Code Repositories and Yak Shaving
There is a wave of argument growing among how to organize software repositories
; the god-repo (commonly known as a monorepo
) and focused repositories. By
and large there is no one best way. Rather it is a question of which yaks you
want to shave. This is a meager attempt to explore them without a general
recommendation.
Definitions
Any reasonable dissertation or discussion needs the terms to be well defined. So, let us start there. A monorepo is where all of your code resides in a single giant repository, regardless of whether all the code is integrated or even related. Indeed, that a monorepo contains unrelated code is mandatory for the definition. For the other case, where various codebases are in dedicated repositories.
To be clear, there are multiple ways to organize multiple repositories. There is the extreme case where every tool, library, and bit of coherent code is in a unique, non-shared place. This is often used by monorepo proponents as the only alternative to a monorepo. In around two decades I’ve never seen it, so I think it safe to say it is rare at best.
In between these two extremes lies what we see: code repositories have types and groupings. For example, a small company may use GitHub to house their code, and will have a minimum of one organization under which its code repositories lie.
However, it may also have multiple organizations, such as one GH org per team, one org per context, or a mix of both. This may seem initially like a minor difference but it is indeed quite significant as we will see when contrasting and comparing assertions by proponents of either monorepos or focused repos. In this article I’ll refer to the “not-monorepo” as “contextual repos” for reasons that will become clear.
Now, as to how I’ll describe and explain each “side”’s position. When arguing for a thing there is a lot of arguing against another thing. I don’t accept that form of discussion as proper. For example, if you are saying “Car A is good because Car B does…” you are not making a case for Car A, just “not Car B”. I find that unpacking these is more useful and instrumental in analyzing the arguments being made.
With that base level of ground established, let us delve into the assertions for/against each as posited by the respective proponents.
Monorepos
There are several claims made as to the benefits of a monorepo, but they share a common theme. One of the better pieces on a monerepo, and one I see proffered often in favor of them is Dan Luu’s piece on it. So the first thing to realize about this piece is that it is, by Dan’s own assertion, an argument in favor of monrepos. It is in fact an attempt to explain why one might not view it as a terrible idea. Yet it is often referred to as such, and it does offer reasons Dan thinks a monorepo can be a good choice.
However, a lot of the arguments Dan uses to favor/demonstrate benefits of monorepos are not very consistent. By that I mean that many of them are based on the assumption that multiple repos simply must be a certain way and produce certain effects.
No-Context Needed
The first argument in favor offered is that you don’t have to worry about context with a monorepo. Dan does’t explicitly say this, but his argument does.
For example:
With multiple repos, you typically either have one project per repo, or an umbrella of related projects per repo, but that forces you to define what a “project” is for your particular team or company, and it sometimes forces you to split and merge repos for reasons that are pure overhead. For example, having to split a project because it’s too big or has too much history for your VCS is not optimal.
This division of code is context. That context may be a team, a business function, a component, or a service, but it is still code context.
Side note: Things which occur “sometimes” in one style are not reasons for another style to be good. Nor are they often limited to that “other” style.
In this case, Dan is making the assertion that:
a) having multiple repositories forces you to define a project (context) b) this is bad c) monorepos don’t do this
However, I would counter that even if we accept (a), (b) and © do not follow. First, let us start with defining a project. This is not a bad thing, nor is it unnecessary overhead. All code has context. Context is key to understanding the code. As noted above if we consider a small development company using Github, they can easily have multiple repositories in a company-wide org. This means that each contextually complete grouping of source code files goes into a repo expressing that context. They might break it down into more fine-grained contexts such as their business domain dictates.
All code has context. If you don’t, or can’t, see the context, your code quality will suffer regardless of whether it all sits in one monorepo or sits in a dozen repositories. If you can’t structure your code by context, then repository style isn’t going to help you, as you don’t understand the domain or the code. This leads us into ©.
The implicit assertion that monorepos do not make you contextualize your code is only possily true under one condition: your repo consists of one directory with all files in it, and no subdirectories. This is probably as rare as the thousands-of-individual-repos situation. Indeed, the example Dan uses later on demonstrates this is not the situation he is referring to. Even in a single-directory-monorepo you’ll have a minimum of contextualization in the sense that you will have multiple files; because context is king in code structuring and development.
So regardless of repository style, you will need to have a firm grasp of the context for each code. If your monorepo is “allthethings”, you will have sub-directories such as “allthethings/front-end”, and “allthethings/api”, and so on. So we can dispense with the notion that monorepos are good/better because you don’t have to worry about context. You will need context either way, and context is crucial to quality code as well as cognitive space management.
Because of the ease of exposition I’ll use Go projects as the examples. In a monorepo the above contexts could be:
github.com/mycompany/allthethings/front-end
github.com/mycompany/allthethings/api
github.com/mycompany/allthethings/client
github.com/mycompany/allthethings/common-libs
with more context (directories) under each of those.
Whereas a company not using a monorepo might have:
github.com/mycompany/front-end
github.com/mycompany/api
github.com/mycompany/client
github.com/mycompany/common-libs
In this case each subproject in the monorepo is a dedicated repo.
Another option might be:
* github.com/mycompany/front-end/sales
* github.com/mycompany/front-end/support
* github.com/mycompany/api/v1
* github.com/mycompany/client/v1
* github.com/mycompany/common-libs/v1
Wherein the portion after ‘mycompany’ is an org, and the final is the
repo. I’m not arguing this is the best, merely pointing out some
possibilities. Notice the similarity in the URLs. The key here is that
the context still has to exist. Having all your things in allthethings
does not eliminate the need for, nor the existence of, code context.
Now, going further you’ll see that Dan acknowledges you still need context:
With a monorepo, projects can be organized and grouped together in whatever way you find to be most logically consistent, and not just because your version control system forces you to organize things in a particular way.
However, he makes an error when throwing out the assertion that a VCS (such as Git) “forces” you do organize your code a certain way. As we’ve seen above, it does not. You are forced to use directories but that is because we use a directory based file system. So the hidden assertion here, that only a monorepo lets you contextualize your code the way you want is false. Logically, the assertion can not be true. If a version control system forces a certain context on you, it will force it on you at the repository level - because that is all the context it has - and thus would force it equally on any repository - whether “mono” or not.
To understand this point a bit better we have to go back to the understanding that a monorepo must contain unrelated code. As such, every monorepo is still just a repository. You exchange repository level context for sub-directory level context - and the software doesn’t know or care for your code’s context. As such we can dispense with this assertion as well.
Navigating Repositories
With this assertion not being true, the “side-effects” go with it. The argument that you can navigate a monorepo easily on disk, but not multiple repos is, to be blunt, absurd. All repos you use will need to be checked out, they will go onto your filesystem, and you will navigate them the same way wether it be one or a hundred.
The claim that this is “simpler” for dependencies is likewise untrue. You still have files on your filesystem to navigate to. Your filesystem doesn’t care if you go to “$HOME/projects/allthethings/commonlibs/server/api/client” or “$HOME/projects/mycompany/commonlibs/server/api/client”.
A side note about the offered benefit: it is again caveated with “often”. Anytime you caveat a benefit or disadvantage with terms which show it is not “always”, you are not describing a benefit of the system.
Specifically, Dan says: >A side effect of that side effect is that, with monorepos, it’s often > the case that it’s very easy to get a dev environment set up to run > builds and tests.
Initial Setup
Yet it is not difficult to do that with contextual repos, and can often be “easier”. A monorepo may contain multiple languages, and need multiple test suites and software installed that you, as someone working on one part of it, don’t need. So you may sometimes have more work to do for stuff you won’t be using, than if you checkout a contextual repo. A larger monorepo may take much longer to checkout than a contextual repo. So because this alleged benefit is demonstrably not a given, we can’t accept it as a benefit of monorepos.
Dependency Management
Dan goes on to make a common claim that monorepos make dependence management easy, whereas multiple repos make it hard. To wit:
This probably goes without saying, but with multiple repos, you need to have some way of specifying and versioning dependencies between them.
No you don’t, actually. Versioning is an attempt at a non-technical problem. Versioning code is essentially a workaround for using shared code, and a monorepo doesn’t change that fact. If you share no code with anyone else, then you have no need for any form of versioning other than your released software. No, I’m not arguing for that, merely pointing out the underlying reality we so often ignore. This also can be described as a result of unbounded or improperly bounded contexts.
The assertions made here are that: 1) Versioning is mandatory 2) Monorepos make versioning easy
He then goes on to argue that because a monorepo is atomic across
projects, it always builds properly. He states” Dependencies still need
to be specified in the build system, but whether that’s a make Makefiles
or bazel BUILD files, those can be checked into version control like
everything else. And since there’s just one version number, the
Makefiles or BUILD files or whatever you choose don’t need to specify
version numbers.
Yet nothing prevents contextual repos from doing the same thing. There is a hidden assertion here I want to tease out: you should never have to think abut your dependencies or context. That is what Dan is arguing for when he says there is no version number but there is a version number. Indeed, as his example scenario, which we will get to soon, describes, this assertion is untrue.
I’m going to deviate from Dan’s ordering here to keep the context local. We’ll get to tooling shortly, and go right to cross-project work.
Cross-Project Interaction
Now Dan calls this “cross-project changes”, but interaction works better
I think because it exposes the broader scope of the scenario. He begins
with With lots of repos, making cross-repo changes is painful
. Yet
this is, like many things, not a sole effect of contextual repos. He
goes on to essentially hand-wave monorepos as being godsends in that
“you simply do it”. To explain that better, let us jump to his example
scenario:
I needed to update [Project A], but to do that, I needed my colleague to fix one of its dependencies, [Project B]. The colleague, in turn, needed to fix [Project C]. If I had had to wait for C to do a release, and then B, before I could fix and deploy A, I might still be waiting. But since everything’s in one repo, my colleague could make his change and commit, and then I could immediately make my change.
I guess I could do that if everything were linked by git versions, but my colleague would still have had to do two commits. And there’s always the temptation to just pick a version and “stabilize” (meaning, stagnate). That’s fine if you just have one project, but when you have a web of projects with interdependencies, it’s not so good.
Note that the steps involved in the comparions don’t actually change. Whether projects A, B, and C are in one repo or three doesn’t change the fact that you have to identify and track down what all needs changed, and that you have three locations to change things.
In the monorepo: * Project A needs changed * Author determines somehow that Project B also needs changed * Author of project B is told about it, then finds that Project C needs changed * Author of Project C makes changes to C, then to B, then commits these changes * Author A is then able to make their change
In contextual repos: * Project A needs changed * Author determines somehow that Project B also needs changed * Author of project B is told about it, then finds that Project C needs changed * Author of Project C makes changes to C, then to B, then commits these changes * Author A is then able to make their change
Functionally, nothing actually changed. Neither repo style makes a difference here. The hidden assertion in this proposed benefit is that somehow Author on project A knew about B and that Author A knew becuase of the monorepo. Author A still had to wait for Aurhor B to make changes to Project C. Wether you call it a commit a release or not is irrelvant to the functionality of waiting for someone else to make the change.
But a monorepo does not provide that context on its own. That context is provided by something else be it personal knowledge due to interaction between authors, organizational context, or a tool which indexes what code relies on what other code. An example of the latter is godoc.org which can show you what other packages it knows about that import a given package.
Notice that godoc.org is not working against a monorepo, but against multiple repositories from multiple sources and provides this knowledge. A monorepo, by virtue of being in a single directory parent, does not provide this knowledge. This is actually demonstrated in the example Dan gives: Author A didn’t know Project C needed updated - only project B. You could run godoc.org against an internal monorepo, and you can run it against a collection of internal repos. There is nothing inherent to monorepos which make this easier or unnecessary.
On the subject of change tracking, Dan again asserts that becuase you have one source you can more easily track it. But this is bound by conditionals as well. Depending on what you need to track, having a global change number is a bad thing. For example, consider components A and B and that they are entirely unrelated. When A is changed, B’s change number changed - without B having any changes. That complicates tracking changes. So this too, is neither a given nor unconditional.
Dan concludes that example with Forcing dependees to update is
actually another benefit of a monorepo
. Yet again, nothing about a
monorepo forces that. I think it fair to assume you’ll have integration
testing regardless of repo style. If you lack the above mentioned
contextual knowledge you’ll discover it in failing tests - which could
well be how Author A in the example discovered it in the first place.
But this assertion assumes that forcing everyone to always change is
always a good thing, which time has shown is not true.
So even this last assertion on cross-project interaction fails to test of a) only being true of monorepos and b) is always a desirable thing/true.
Tooling
Dan doesn’t really write much about the tooling. So let us start with the opening:
The simplification of navigation and dependencies makes it much easier to write tools. Instead of having tools that must understand relationships between repositories, as well as the nature of files within repositories, tools basically just need to be able to read files (including some file format that specifies dependencies between units within the repo).
So far, we’ve seen that neither navigation nor dependencies are inherently “easier” or “simpler”, which immediately places the assertion on unstable ground. but read the rest of it carefully. The immediate claim is that you don’t need to have tools that understand the relationship between repositories. But it is immediately followed by a statement that you need to have tools that understand the relation between projects. There is no conceptual or functional difference between those two. Specification of dependencies is specification of dependencies. Indeed, the parenthetical demonstrates this.
Back to Go, and specifically Godep. Godep defines a file (in JSON) which specifies what your dependencies are, where they are, and what “version” to use. In this case, the version is usually a git reference for git repositories as Git is more common from what I’ve seen. It is a file which “specifies dependencies between units” of code. It does it across conextual repos, and could in theory be done in a monorepo if one so desired. This goes back to the fact that context is still context even in a monorepo. You still have to specify where the shared code is coming from whether it is in contextual directories in a monorepo or contextual repos.
The bulk of the rest of the tooling section is not really about monorepos, but about a build tool. Yet even there, the benefits listed are not provided solely because of or by monorepos. He asserts:
It’s theoretically possible to create a build system that makes building anything, with any dependencies, simple without having a monorepo, but it’s more effort, enough effort that I’ve never seen a system that does it seamlessly
Now I don’t know when this article was written, as it doesn’t include that, but I’ve seen them. One example is Godep+standard Go build tooling. Pants does it also, and Dan actually links to it as an example of monorepos making it easy to do. So in reality, he has in fact seen a system that does it seamlessly. Make can also do it, though I can see how some might think Make is not simple.
Dan doesn’t address the downsides, which I think is a bad idea. He claims to be no tmaking the case for monorepos, but when you leave out the disadvantages you are doing just that - albeit poorly in my opinion.
So let us jump to another piece, much of which has been covered above but brings some clarity and more nuanced discussion to the subject. For this, let is consider David Maciver’s post recommending it for almost everyone.
First, David points out the definition of a monorepo being a collection of distinct projects - essentially saying the same thing as “unrelated projects”. To wit:
In particular a monorepo should be organised with a number of subdirectories that each look more or less like the root directory of what would have otherwise been its own repo (possibly with additional directories for grouping projects together, though for small to medium sized companies I wouldn’t bother).
So David also points out that you still need context even in a monorepo. As we have see, this is not a subtle point. He then conventiently lists what he thinks are the best advantages:
- It is impossible for your code to get out of sync with itself.
- Any change can be considered and reviewed as a single atomic unit.
- Refactoring to modularity becomes cheap.
The first is conditional, though he presents it as always. The example from Dan shows that your codebase can indeed be out of sync. Even a monorepo requires something and/or someone to track dependent changes. It may be code indexing, it can be integration testing, it can be someone piping up in a code review that some other project needs to be changed because he knows he uses that bit of code that is changing. Regardless of the method for detection, nonetheless it must still exist. There is nothing magical about placing all projects under a single directory - after all, whether you have one repo or a million on your system, they are all under the root filesystem.
The second advantage listed is true but not a given. All changes can be considered and reviewed as a single atomic unit regardless of repo style, assuming your VCS allows things like PRs. Where it goes wrong is that the context for the benefit is off. The benefit more accurately stated is that “all changes made can be done in a single PR”. However, this is still only true if you make proper use of VCS features.
In a monorepo if you need several people to make changes to various “unrelated” projects you don’t get single-PR for free - you have to, for example, make a shared branch they all work on and make commits to. So while you can review all the changes eventually in a single PR, the overall amount of review work doesn’t change. So with that caveat, we will grant that as a benefit. However, that benefit isn’t strictly limited to monorepos either. Git, for example, has submodules and you can get the same effect by utilizing them. Personally, I’m not a huge fan of them but the desired benefit is there.
Since at that level in that specified context, I would view a monorepo to be simpler than Git submodules, I am willing to keep that as a limited benefit. Keep in mind that other VCS may offer the same thing, I’m not going through all of them.
However, I do not agree it is an “always beneficial” thing to have. If you have two “distinct” projects that have such tight coupling that you can not make change to one without simultaneously making changes to the other, you do not have two distinct projects. Nor is this problem solved entirely by monorepos by design either. The moment you import/include a third party library or project, you’ve broken that. Sure, most who try monorepos wind up essentially “vendoring” the third party source. But that is an added layer of complexity which exposes that the underlying probem still exists.
Finally, on refactoring being cheap. The previous paragraph should explain quite well why monorepos do not ensure this. If you have, as David explained, distinct projects in your monorepo then you don’t have a functional difference when it comes to refactoring. A truly distinct project is just as easy to refactor when on it’s own repo as it is in a monorepo.
What matters on ease of refactoring is how you approach code, and how well you maintain your boundaries. If you have good, strict boundaries you can refactor your code regardless of what repop style. As such, this is not a benefit inherent to monorepos.
Disadvantages or Costs
Ok, now on to hthe disavantages. Dan skips out on them, but David tackles them. So let us look first at what David lists. Now strictly speaking he doesn’t list disadvanages, just “exceptions” where a monorepo may not be best for you. Nonetheless, let us look at them.
You Don’t Own All the Code, Or Share Yours
As David says, if you work w/open source in your codebase you will likely need to utilize outside repositories. While David says it in terms of open-sourcing your code, it is true in both directions.
He also says >Ideally I’d like to solve this problem with tooling that >made it better to mirror directories of a monorepo as individual >projects, but I’m not aware of any good systems that do that right now.
As before, I can’t say in good conscience that Git submodules qualify as a “good system”, but they are functionally the same as what he is saying.
Access Control
The moment you need isolation of repo access within the company, monorepos are de facto disadvantaged until someone puts out a VCS which allows you to have per-directory ACLs in a single repository. Now many argue, including David, that you shouldn’t have to worry about internal users having access. However, there are a couple counter-arguments here.
First is that most breaches occur from within, so it isn’t strictly a matter of “only hire trustworthy people”. You can’t actually know that everyone is, and an individual’s system can be compromised - or in the case of a mobile system stolen or lost. If all of your source code is downloaded to a laptop which is then compromised the attacker has it all. Most breaches are not externally sourced, so it is unwise to always treat internal as “safe” or “non-hostile”.
David asserts that this isn’t a big deal provided you don’t put secrets in the code. However, code can be analyzed with the intent to break the end result - something rarely done internally. It can also expose trade secrets. This can be a really big deal because it can destroy a company based on it. There are other issues with code being exposed which are thus financial risks to a company.
There is also the regulatory aspect of a breach and/or lack of isolation. Not everyone has the “benefit” of being wild-westish about their code - there are good regulatory reasons for isolation. The more regulated your industry is, the more likely this is to be a problem that can indeed destroy your company.
Related to this is the contractor issue David mostly dismisses. NDAs are not always permissible, again regulatory agencies often have a hand in that, but also there is the case of you having access to code that you don’t have the legal authority to give to others. This is a close relative of the “you don’t own all the code” - just in a non-OSS context.
In a non-governmental regulatory context, consider ISO defined configuration management where each software component is to have a unique version identifier. You lose this in a monorepo unless you add per-component version numbers, and thus add (back) complexity a monorepo is intended to remove.
Boundary Blurring
Not everything should be easy and frictionless. We’ve seen a bit of this play out as being listed as an advantage for monorepos, but is also a disadvantage as I see it. Software and system components should have clear and strict boundaries. When you lose those bounaries the context increases, and the larger the context the larger the scope for problems and the higher the cognitive load is.
If you have two “projects” and one can not be changed withouth the other, it is arguable you do not actually have two disconnected projects, but two pieces of the same project. A common pattern to see this in is in pubsub or client/server models. Often a message being passed in pubsub will require changes on both server and client in order to remain consistent.
However, this is not a given. It is entirely possible, and the how has been known for many years, to build such systems that are not so inter-dependent. In many ways self-describing formats such as XML, or JSON are an attempt to break that conhesion down. The same can be seen in databases by moving to so-called “schema-less” and/or “NoSQL” databases. The more tightly coupled two components are, the more they belong in the same repo and the less you have two components.
But by using a monorepo withouth strict boundaries, which given the frequency with which “I can change all the things in one fell swoop” is listed as the main advantage of them I think it reasonable to say those boundaries are mostly blurry, there is a natural trend toward tighter coupling. The easier it is to tightly couple, the more it will happen. Monorepos incentivize tight coupling, and for that I list that as a disadvantage.
Cognitive Overload
For this disadvantage, it is quite scale dependent by definition. At the smaller end of the scale, cognitive context is easily manageable regardless of how many repos you have. But as the number of distict projects grow, cognitive load increases and eventually you have to swap stuff out. With every unrelated project under one roof, that cognitive space can grow quite quickly, and overwelm it with things you don’t care, or need, to know about. It happens in nearly invisible ways. That frequent pull request you have to look at but have no input on becuase you don’t have the context for it is but one example. The constant build notices, especially failures, on the monorepo that go to you but have nothing to do with you and serve only to needlessly interrupt your day, is another.
Operational Considerations
Not all of your code is software. With the advent and rise of configuration management and infrastructure management being “code”, you now have entirely unrelated code that adds to it. Of all the posts I’ve read on mono vs contextual, none have addressed this. They are always focused on software. But from an operational perspective, these bits are also in repositories, and rarely the same one as the code. For this one this eliminates the touted benefit of “one place for all changes” because not all changes take place in software code.
That isn’t to say you can’t have a software “monorepo” and a “confiuraton/infrastructure” repo. But then, you aren’t really running a monorepo; instead you’ve reduced it to two (or three) repos. So you still have to track changes across them in a world of proper configuration management. If you need to other separate repositories such as mentioned above under access control or code ownership, you’ve simply gained the problems of a monorepo without losing the problems of contextual repos.
It is like deciding to buy a large SUV because it is only managing one vehicle that lets you be a sole occupant or haul friends, groceries, and furniture; then adding a pickup truck becase not eveything fits in an SUV, then adding a motorcycle because it enables you use the carpool lane when travelling solo. Now you’ve got three vehicles and are paying for all of them.
Summary and Coming Next
This turned out much longer than planned, so I’ll deal with the discussion for/against contextual repos in another installment. However, in running through the asserted benefits of monorepo, we can see that these assertions are not limited to monorepos, while exposing new problems. That isn’t a surprise to most, I would hope. If there was a true “one best solution” we would all be using it and there would be no discussion or controversy.
However, this one is a bit unusual in that the pro/con assertions don’t really hold up to scrutiny on either side. That tells me something about the entire thing, but that will have to wait for the next bit. Now go get a snack or drink of your choice, you’ve earned it after getting to the end of ths one. :)
Now, once you have digested that and taken a break, feel free to go to Part Two.