A place to cache linked articles (think custom and personal wayback machine)
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

index.md 8.1KB

4 years ago
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131
  1. title: On Monolithic Repositories
  2. url: http://gregoryszorc.com/blog/2014/09/09/on-monolithic-repositories/
  3. hash_url: 7baab90810027d42259e14006a420074
  4. <p>When companies or organizations deploy version control, they have to
  5. make many choices. One of them is how many repositories to create.
  6. Your choices are essentially a) a single, monolithic repository that
  7. holds everything b) many separate, smaller repositories that hold
  8. all the individual parts c) something in between.</p>
  9. <p>The prevailing convention today (especially in the open source
  10. realm) is to create many separate and loosely coupled repositories,
  11. each repository mapping to a specific product or service. That does
  12. seem reasonable: if you were organizing files on your filesystem,
  13. you would group them by functionality or role (photos, music,
  14. documents, etc). And, version control tools are functionally
  15. filesystems. So it makes sense to draw repository boundaries at
  16. directory/role levels.</p>
  17. <p>Further reinforcing the separate repository convention is the
  18. scaling behavior of our version control tools. Git, the popular
  19. tool in open source these days, doesn't scale well to very large
  20. repositories due to - among other things - not having narrow clones
  21. (fetching a subset of files). It scales well enough to the
  22. overwhelming majority of projects. But if you are a large
  23. organization generating lots of data (read: gigabytes of data over
  24. hundreds of thousands of files and commits) for version control,
  25. Git is unsuitable in its current form. Other tools (like Mercurial)
  26. don't currently fare that much better (although Mercurial has plans
  27. to tackle these scaling vectors).</p>
  28. <p>Despite popular convention and even limitations in tools, companies
  29. like Google and Facebook opt to run large, monolithic repositories.
  30. Google runs Perforce.
  31. <a href="https://code.facebook.com/posts/218678814984400/scaling-mercurial-at-facebook/">Facebook is on Mercurial</a>,
  32. or at least is in the process of migrating to Mercurial.</p>
  33. <p>Why do these companies run monolithic repositories?
  34. In <a href="http://www.perforce.com/sites/default/files/still-all-one-server-perforce-scale-google-wp.pdf">Google's words</a>:</p>
  35. <p><em>We have a single large depot with almost all of Google's projects
  36. on it. This aids agile development and is much loved by our users,
  37. since it allows almost anyone to easily view almost any code, allows
  38. projects to share code, and allows engineers to move freely from
  39. project to project. Documentation and data is stored on the server
  40. as well as code.</em></p>
  41. <p>So, monolithic repositories are all about moving fast and getting things
  42. done more efficiently. In other words, <strong>monolithic repositories
  43. increase developer productivity.</strong></p>
  44. <p>Furthermore, monolithic repositories are also more compatible with
  45. the ebb and flow of large organizations and large software projects.
  46. Components, features, products, and teams come and go, merge and split.
  47. The only constant is change. And if you are maintaining separate
  48. repositories that attempt to map to this ever-changing organizational
  49. topology, you are going to have a bad time. Either you'll be
  50. constantly copying, moving, merging, splitting, etc data and repositories.
  51. Or your repositories will be organized in a very non-logical and
  52. non-intuitive manner. That translates to overhead and lost productivity.
  53. I think that monolithic repositories handle the realities of large
  54. organizations much better. Big change or reorganization you want
  55. to reflect? You can make a single, atomic, history-preserving commit
  56. to move things around. I think that's much more manageable, especially
  57. when you consider the difficulty and annoyance of history-preserving
  58. changes across repositories.</p>
  59. <p>Naysayers will decry monolithic repositories on principled and practical
  60. grounds.</p>
  61. <p>The principled camp will say that separate repositories
  62. constitute a loosely coupled (dare I say service oriented) architecture
  63. that maps better to how software is consumed, assembled, and deployed
  64. and that erecting barriers in the form of separate repositories
  65. deliberately enforces this architecture. I agree. However, you can
  66. still maintain a loosely coupled architecture with monolithic
  67. repositories. The Subversion model of checking out a single tree
  68. <em>from a larger repository</em> proves this. Furthermore, I would say
  69. architecture decisions should be enforced by people (via code review,
  70. etc), not via version control repository topology. I believe this
  71. principled argument against monolithic repositories to be rather weak.</p>
  72. <p>The principled camp living in the open source realm may also decry
  73. monolithic repositories as an affront to the spirit of open source.
  74. They would say that a monolithic repository creates unfairly strong
  75. ties to the organization that operates it and creates barriers to
  76. forking, etc. This may be true. But monolithic repositories don't
  77. intrinsically infringe on the
  78. <a href="https://www.gnu.org/philosophy/free-sw.html">basic software freedoms</a>,
  79. organizations do. Therefore, I find this principled argument rather
  80. weak.</p>
  81. <p>The practical camp will say that monolithic repositories just don't
  82. scale or aren't suitable for general audiences. These concerns are
  83. real.</p>
  84. <p><em>Fully</em> distributed version control systems (every commit on every
  85. machine) definitely don't scale past certain limits. Depending on your
  86. repository and user base, your scaling limits include disk space
  87. (repository data terabytes in size), bandwidth (repository data terabytes
  88. in size), filesystem (repository hundreds of thousands or millions of
  89. files), CPU and memory (operations on large repositories take too
  90. many system resources), and many heads/branches (tools like Git and
  91. Mercurial don't scale well to tens of thousands of heads/branches).
  92. These limitations with fully distributed version
  93. control are why distributed version control tools like Git and
  94. Mercurial support a partially-distributed mode that behaves more like
  95. your classical server-client model, like those employed by Subversion,
  96. Perforce, etc. Git supports shallow clone and sparse checkout.
  97. Mercurial supports shallow clone (via remotefilelog) and has planned
  98. support for narrow clone and sparse checkout in the next release or
  99. two. Of course, you can avoid the scaling limitations of distributed
  100. version control by employing a non-distributed tool, such as Subversion.
  101. Many companies continue to reach this conclusion today. However,
  102. users adapted to the distributed workflow would likely be
  103. up in arms (they would probably use tools like hg-subversion or git-svn
  104. to maintain their workflows). So, while scaling of version control
  105. can be a real concern, there are solutions and workarounds. However,
  106. they do involve falling back to a partially-distributed model.</p>
  107. <p>Another concern with monolithic repositories is user access control. You
  108. inevitably have code or data that is more sensitive and want to limit
  109. who can change or even access it. Separate repositories seem to
  110. facilitate a simpler model: per-repository access control. With
  111. monolithic repositories, you have to worry about per-directory/subtree
  112. permissions, an increased risk of data leaking, etc. This concern is
  113. more real with distributed version control, as distributed data and
  114. access control aren't naturally compatible. But these issues can be
  115. resolved. And if the tooling supports it, there is only a semantic
  116. difference between managing access control between repositories versus
  117. components of a single repository.</p>
  118. <p>When it comes to repository hosting conversions, I agree with Google
  119. and Facebook: <strong>I prefer monolithic repositories</strong>. When I am interacting
  120. with version control, I just want to get stuff done. I don't want to
  121. waste time dealing with multiple commands to manage multiple
  122. repositories. I don't want to waste time or expend cognitive load
  123. dealing with submodule, subrepository, or big files management. I
  124. don't want to waste time trying to find and reuse code, data, or
  125. documentation. I want everything at my fingertips, where it can be
  126. easily discovered, inspected, and used. Monolithic repositories
  127. facilitate these workflows more than separate repositories and make
  128. me more productive as a result.</p>
  129. <p>Now, if only all the tools and processes we use and love would work
  130. with monolithic repositories...</p>