Code Repositories and Yak Shaving Part Two

In Part One of this series, I discussed what a monorepo is, defined contextual repos, and covered a review of a couple articles which attempted to illustrate the benefits of a monorepo. In this part, I’m going to go a bit deeper into some the of assertions I made, and some claims often made “in defense of” monorepos.

But Google Does It, So Should You/We

This is a pretty common one, and the glib response is to mutter someting about friends jumping off of cliffs or bridges. Not that there isn’t substance buried in that phrase so many of us heard from mom or dad, but I think it better to recognize it for what it is: a logical fallacy, then demonstrate why it isn’t what those who use the phrase think it is.

The first thing to realize is that Google does many things the vast majority of people and companies not only don’t need but would suffer consequences due to not “being Google”. For analogy, say you are a keyboard jockey who decides you want to become a competitor in Ironman. Do you go find out what the winners are doing and do that? Well, if you did you’d soon discover what a terrible idea it is to simply emulate what the successful do. Instead, you should find out what they did when they were at your level. It is not that different for much of software and operations.

Yes, Google uses a monorepo for its billions of lines of code. It has over 60 TB of data in that monorepo. When was the last time you checked out a repo at 1TB in size, let alone 60? What kind of problems might you have? A lot, not to put too fine a point on it.

But all of those disadvantages I listed in Part one, surely they apply to Google, right? Yes. So how do they do it? By doing things the rest of us do not.

For starters you don’t check out the entire repo. “But git doesn’t” - no it does not. Google wrote their own version control system. They have people dedicated to that project. So you don’t git checkout allthethings. That they built their own VCS isn’t a big shock when you consider the size. But the size isn’t the only aspect which drove it. I mentioned security controls. Google’s VCS has file-level ACLs designed into it. These ACLs control who can do what from read to commit.

Piper supports file-level access control lists. Most of the repository is visible to all Piper users;d however, important configuration files or files including business-critical algorithms can be more tightly controlled. In addition, read and write access to files in Piper is logged. If sensitive data is accidentally committed to Piper, the file in question can be purged. The read logs allow administrators to determine if anyone accessed the problematic file before it was removed. – Communications of the ACM, Vol. 59 No. 7

So clearly, Google engineers and developers recognized the security drawback of all your eggs in a single basket. Further:

The Google codebase is laid out in a tree structure. Each and every directory has a set of owners who control whether a change to files in their directory will be accepted. – ibid.

Does your VCS support those features? Further, consider that Google is almost assuredly working more “in the cloud” than you are. After all, most development isn’t necessarily done locally, but through their custom platform:

Most developers access Piper through a system called Clients in the Cloud, or CitC, which consists of a cloud-based storage backend and a Linux-only FUSE13 file system. Developers see their workspaces as directories in the file system, including their changes overlaid on top of the full Piper repository. CitC supports code browsing and normal Unix tools with no need to clone or sync state locally. Developers can browse and edit files anywhere across the Piper repository, and only modified files are stored in their workspace. This structure means CitC workspaces typically consume only a small amount of storage (an average workspace has fewer than 10 files) while presenting a seamless view of the entire Piper codebase to the developer. – ibid.

This helps to address the problem of a monorepo on your laptop leading to exposure of the kingdome. Does your VCS and platform provide this feature? It only starts there, because that snapshot filesystem you are in can be shared so others can view it - including automated tools. For the rest of is without that we can use branches, but we have to commit to branches, and if you want to see someone else’s work you have to check out their branch as well as wait for them push their changes for you to pull. Google, however, eschews branches for development - everyone works in HEAD, period.

This is doable because of the combination of their CitC platform and custom built VCS. If you are working with Git, Subversion, Mercurial, etc., you don’t get that capability without a lot of custom work.

That Google requires work to be done on HEAD, aka “trunk-based” would not be a surprise to many Go developers, as that expectation is baked into the Go packaging system - it doesn’t do native versioning. This works well for Google because of the aformentioned custom development. They also have a plethora of tools, and people, dedicated to making it work.

They don’t do work in development branches, then merge. They work in HEAD, the workspace is shared with others and with tools which work against it, then when changes are ready they go to review - of a select grup of people who are potentially affected by the proposed change. Again, this requires tooling to track what other projects in the codebase may be affected. Only releases get a branch and they are a combination of HEAD and sometimes some cherry-picking.

They further have this integrated in their workflow, allowing the selective inclusion of non-committed code based on flags. The code review is done via, yes, another custom tool - this one called Critique. The code analysis is custom-built as well, known as Tricorder. Yet another one, Rosie, is used for change control and modification. One of Rosie’s functions is to split changes into the directory structure so it can notify the affected owners. I’ve not seen the like in the OSS world, yet.

But the custom tooling doesn’t stop there. I also referred to the fact that whether you use a monorepo or contextual repos you need to have code indexing. Google knew this too, so they built CodeSearch, which indexes the entire thing - including the workspaces - and even lets you edit right in the tool.

And these are just some of the software and platforms that had to be designed and developed to support the “monorepo” at Google. They also had to develop vaious practices (ie. everything in HEAD, always) to support it. They have teams dedicated to maintaining this code, the platform, and the repository itself. The more complex or widespread proposed changes are, the dicier the situation gets. This is why Google employes teams of dedicated codebase maintainers.

Because all projects are centrally stored, teams of specialists can do this work for the entire company, rather than require many individuals to develop their own tools, techniques, or expertise. – ibid.

This brings us back to the rest of us. Do you have the resources to dedicate people to this? Is that the best investment of your people, your money, or your time? Maybe, but unlikely for the vast majority.

Developers must be able to explore the codebase, find relevant libraries, and see how to use them and who wrote them. Library authors often need to see how their APIs are being used. This requires a significant investment in code search and browsing tools. – ibid.

Among the problems Google is encountering and investing resources in correcting and avoiding is the blurring of lines between projects. To that end, they’ve been working on tooling to identify unnecessary dependencies. In 2011 they began creating essentially “private” flags for library APIs that default to private - meaning that unless the developer of that project allows it, someone else can’t include it in their code.

That may sound bad, but it really isn’t. It makes sharing it a conscious decision, as it should be considering the costs to you when others depend on that code. It also helps, if in a small way, counter the other problem of blurred boundaries: code-is-the-documentation. The problem they, and many others, have found is that “you have the source, go read it” not only means “I don’t write docs”, but also that the reader then often relies on what should be an internal implementation. This is tighter coupling.

But perhaps the largest recognition needed in the “Google uses a monorepo” argument is that they don’t truly run one. They also have Git repositories. The Android codebase is split across some 800 Git repos.

So, before accepting “Google does it”, look at what all Google had to, and continues to have to, do to make it work. Before accepting the appeal to authority, look deeper. Realistically, the tooling Google wrote didn’t have to require a monorepo. had they chosen back in the 90’s to go with contextual repos, they would have likely built the very same tools, but they would have crossed repositories instead.

Most of us won’t be working at the sheer codebase scale of Google. But that doesn’t mean the problems that need addressing are thus avoided. Either your entire codebase fits in the head of each developer, or you are going to need tooling. Back in 1999 there was not a lot of tooling available - especially for OSS work and workers. Today is a different story. But that belongs in yet another installment in this series.