Improve Git Monorepo Performance


Today, I was exploring source code of the Gitlab project and experienced poor performance of the git status command. Gitlab is an open source alternative to Github.

Below is the output of git status command

 time git status
On branch master
Your branch is ahead of 'origin/master' by 1 commit.
  (use "git push" to publish your local commits)

nothing to commit, working tree clean
git status  0.20s user 1.13s system 88% cpu 1.502 total

The total here is the number of seconds it took for the command to complete.

The same was the case for the git add command.

time git add .
git add .  0.21s user 1.11s system 115% cpu 1.146 total

So both commands took more than a second to finish.

These commands are slow because they need to search the entire worktree looking for changes. When the worktree is very large, Git needs to do a lot of work.

To give some context as of 3rd July 2022 Gitlab source code has 44147 files. This is equal to 3920240 lines of code and 2.18 GB in size. I used a tool called tokei to calculate the number of files and lines of code. Below is the trimmed down tokei output.

===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 BASH                   10          331          217           53           61
 Clojure                 1            3            3            0            0
 CSS                     2          380          265           10          105
 Dockerfile             20          352          183           74           95
 Go                    218        26761        20792         1149         4820
 GraphQL               786        13442        12711          382          349
 JavaScript           6758       687087       565968        17702       103417
 JSON                  880       265745       265691            0           54
 Makefile                2          206          158           15           33
 Pan                     1           15           11            1            3
 PowerShell              1           13            5            5            3
 Python                  1           47           32            7            8
 Rakefile              116         6945         5174          509         1262
 Ruby                26637      2250147      1696058        86570       467519
 Ruby HTML             168         2321         1865           41          415
 Sass                  272        55825        46004         1459         8362
 Shell                  36         2343         1742          158          443
 SQL                     5        61611        50019           22        11570
 SVG                   206         1362         1327            8           27
 Plain Text             51        22962            0        17866         5096
 XML                    15         7327         6314            4         1009
 YAML                 4003       134445       130221         2402         1822
 // removed for brevity
 ===============================================================================
 Total               44147      3920240      2835859       379671       704710
===============================================================================

I am aware that Git does not scale well for large Git mono repositories.

A few years back I remember reading a post by the Microsoft team where they explained how they have built a virtual file system to improve performance of Git. From the 2017 post

As a refresher, the Windows code base is approximately 3.5M files and, when checked into a Git repo, results in a repo of about 300GB. Further, the Windows team is about 4,000 engineers and the engineering system produces 1,760 daily “lab builds” across 440 branches in addition to thousands of pull request validation builds. All 3 of the dimensions (file count, repo size and activity), independently, provide daunting scaling challenges and taken together they make it unbelievably challenging to create a great experience.

I happen to read post published by Github team where they explained how you can improve the performance of large monorepos by using a newly released feature in Git called Git file system monitor (FSMonitor). This feature is available in Git version 2.37.0. On Mac, you can run brew install git to get the latest version.

You can enable FSMonitor by running the following command.

git config core.fsmonitor true

Github post suggests to also enable an untracked cache feature so we will do that as well.

git config core.untrackedcache true

The first time you will run the git status command after running the above commands it will be equally slow. This is because daemon needs to synchronize with the state of the index.

time git status
On branch master
Your branch is ahead of 'origin/master' by 1 commit.
  (use "git push" to publish your local commits)

nothing to commit, working tree clean
git status  0.23s user 1.16s system 61% cpu 2.260 total

From the second time onwards git status will be much faster as shown below.

time git status
On branch master
Your branch is ahead of 'origin/master' by 1 commit.
  (use "git push" to publish your local commits)

nothing to commit, working tree clean
git status  0.05s user 0.02s system 63% cpu 0.108 total

It took 108ms to run the command. It is close to 14 times faster.

You can learn about how FSMonitor works by reading Github post.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: