A place to cache linked articles (think custom and personal wayback machine)
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

index.md 8.8KB

4 years ago
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154
  1. title: Notes from Facebook's Developer Infrastructure at Scale F8 Talk
  2. url: http://gregoryszorc.com/blog/2015/03/28/notes-from-facebook%27s-developer-infrastructure-at-scale-f8-talk/
  3. hash_url: add9aea6059be55754097cf61bd9eee2
  4. <p>Any time Facebook talks about technical matters I tend to listen.
  5. They have a track record of demonstrating engineering leadership
  6. in several spaces. And, unlike many companies that just talk, Facebook
  7. often gives others access to those ideas via source code and healthy
  8. open source projects. It's rare to see a company operating on the
  9. frontier of the computing field provide so much insight into their
  10. inner workings. You can gain so much by riding their cotails and
  11. following their lead instead of clinging to and cargo culting from
  12. the past.</p>
  13. <p>The Facebook F8 developer conference was this past week. All the
  14. talks are <a href="https://developers.facebooklive.com/">now available online</a>.
  15. <strong>I encourage you to glimpse through the list of talks and watch
  16. whatever is relevant to you.</strong> There's really a little bit for
  17. everyone.</p>
  18. <p>Of particular interest to me is the
  19. <a href="https://developers.facebooklive.com/videos/561/big-code-developer-infrastructure-at-facebook-s-scale">Big Code: Developer Infrastructure at Facebook's Scale</a>
  20. talk. This is highly relevant to my job role as Developer Productivity
  21. Engineer at Mozilla.</p>
  22. <p>My notes for this talk follow.</p>
  23. <p><strong>"We don't want humans waiting on computers. We want computers waiting
  24. on humans."</strong> (This is the common theme of the talk.)</p>
  25. <p>In 2005, Facebook was on Subversion. In 2007 moved to Git. Deployed
  26. a bridge so people worked in Git and had distributed workflow but
  27. pushed to Subversion under the hood.</p>
  28. <p>New platforms over time. Server code, iOS, Android. One Git repo
  29. per platform/project -&gt; 3 Git repos. Initially no code sharing, so
  30. no problem. Over time, code sharing between all repos. Lots of code
  31. copying and confusion as to what is where and who owns what.</p>
  32. <p>Facebook is mere weeks away from completing their migration to
  33. consolidate the big three repos to a Mercurial monorepo. (See also
  34. <a href="/blog/2014/09/09/on-monolithic-repositories/">my post about monorepos</a>.)</p>
  35. <p>Reasons:</p>
  36. <ol>
  37. <li>Easier code sharing.</li>
  38. <li>Easier large-scale changes. Rewrite the universe at once.</li>
  39. <li>Unified set of tooling.</li>
  40. </ol>
  41. <p>Facebook employees run &gt;1M source control commands per day. &gt;100k
  42. commits per week. VCS tool needs to be fast to prevent distractions
  43. and context switching, which slow people down.</p>
  44. <p>Facebook implemented sparse checkout and shallow history in Mercurial.
  45. Necessary to scale distributed version control to large repos.</p>
  46. <p><strong>Quote from Google: "We're excited about the work Facebook is doing with
  47. Mercurial and glad to be collaborating with Facebook on Mercurial
  48. development."</strong> (Well, I guess the cat is finally out of the bag:
  49. Google is working on Mercurial. This was kind of an open secret for
  50. months. But I guess now it is official.)</p>
  51. <p>Push-pull-rebase bottleneck: if you rebase and push and someone beats
  52. you to it, you have to pull, rebase, and try again. This gets worse
  53. as commit rate increases and people do needless legwork. <strong>Facebook
  54. has moved to server-side rebasing on push</strong> to mostly eliminate this
  55. pain point. (This is part of a still-experimental feature in Mercurial,
  56. which should hopefully lose its experimental flag soon.)</p>
  57. <p>Starting 13:00 in we have a speaker change and move away from version
  58. control.</p>
  59. <p>IDEs don't scale to Facebook scale. <strong>"Developing in Xcode at Facebook
  60. is an exercise in frustration."</strong> On average 3.5 minutes to open
  61. Facebook for iOS in Xcode. 5 minutes on average to index. Pegs CPU
  62. and makes not very responsive. 50 Xcode crashes per day across all
  63. Facebook iOS developers.</p>
  64. <p><strong>Facebook measures everything about tools. Mercurial operation times.
  65. Xcode times. Build times. Data tells them what tools and workflows
  66. need to be worked on.</strong></p>
  67. <p>Facebook believes IDEs are worth the pain because they make people
  68. more productive.</p>
  69. <p>Facebook wants to support all editors and IDEs since people want to
  70. use whatever is most comfortable.</p>
  71. <p>React Native changed things. Supported developing on multiple
  72. platforms, which no single IDE supports. People launched several
  73. editors and tools to do React Native development. People needed 4
  74. windows to do development. That experience was "not acceptable."
  75. So they built their own IDE. Set of plugins on top of ATOM. Not
  76. a fork. They like hackable and web-y nature of ATOM.</p>
  77. <p>The demo showing iOS development looks very nice! Doing Objective-C,
  78. JavaScript, simulator integration, and version control in one window!</p>
  79. <p>It can connect to remote servers and transparently save and
  80. deploy changes. It can also get real-time compilation errors and hints
  81. from the remote server! (Demo was with Hack. Not sure if others langs
  82. supported. Having beefy central servers for e.g. Gecko development
  83. would be a fun experiment.)</p>
  84. <p>Starting at 32:00 presentation shifts to continuous integration.</p>
  85. <p>Number one goal of CI at Facebook is developer efficiency. <strong>We
  86. don't want developers waiting on computers to build and test diffs.</strong></p>
  87. <p>3 goals for CI:</p>
  88. <ol>
  89. <li>High-signal feedback. Don't want developers chasing failures that
  90. aren't their fault. Wastes time.</li>
  91. <li>Must provide rapid feedback. Developers don't want to wait.</li>
  92. <li>Provide frequent feedback. Developers should know as soon as
  93. possible after they did something. (I think this refers to local
  94. feedback.)</li>
  95. </ol>
  96. <p>Sandcastle is their CI system.</p>
  97. <p>Diff lifecycle discussion.</p>
  98. <p>Basic tests and lint run locally. (My understanding from talking
  99. with Facebookers is "local" often means on a Facebook server, not
  100. local laptop. Machines at developers fingertips are often dumb
  101. terminals.)</p>
  102. <p>They appear to use code coverage to determine what tests to run.
  103. "We're not going to run a test unless your diff might actually have
  104. broken it."</p>
  105. <p>They run flaky tests less often.</p>
  106. <p>They run slow tests less often.</p>
  107. <p><strong>Goal is to get feedback to developers in under 10 minutes.</strong></p>
  108. <p><strong>If they run fewer tests and get back to developers quicker,
  109. things are less likely to break than if they run more tests but
  110. take longer to give feedback.</strong></p>
  111. <p>They also want feedback quickly so reviewers can see results at
  112. review time.</p>
  113. <p>They use Web Driver heavily. Love cross-platform nature of Web Driver.</p>
  114. <p>In addition to test results, performance and size metrics are reported.</p>
  115. <p>They have a "Ship It" button on the diff.</p>
  116. <p>Landcastle handles landing diff.</p>
  117. <p>"It is not OK at Facebook to land a diff without using Landcastle."
  118. (Read: developers don't push directly to the master repo.)</p>
  119. <p>Once Landcastle lands something, it runs tests again. If an issue
  120. is found, a task is filed. Task can be "push blocking."
  121. Code won't ship to users until the "push blocking" issue resolved.
  122. (Tweets confirm they do backouts "fairly aggressively." A valid
  123. resolution to a push blocking task is to backout. But fixing forward
  124. is fine as well.)</p>
  125. <p>After a while, branch cut occurs. Some cherry picks onto release
  126. branches.</p>
  127. <p>In addition to diff-based testing, they do continuous testing runs.
  128. Much more comprehensive. No time restrictions. Continuous runs on
  129. master and release candidate branches. Auto bisect to pin down
  130. regressions.</p>
  131. <p>Sandcastle processes &gt;1000 test results per second. 5 years of machine
  132. work per day. Thousands of machines in 5 data centers.</p>
  133. <p>They started with buildbot. Single master. Hit scaling limits of
  134. single thread single master. Master could not push work to workers
  135. fast enough. Sandcastle has distributed queue. Workers just pull
  136. jobs from distributed queue.</p>
  137. <p>"High-signal feedback is critical." "Flaky failures erode developer
  138. confidence." "We need developers to trust Sandcastle."</p>
  139. <p>Extremely careful separating infra failures from other failures.
  140. Developers don't see infra failures. Infra failures only reported
  141. to Sandcastle team.</p>
  142. <p>Bots look for flaky tests. Stress test individual tests. Run tests
  143. in parallel with themselves. Goal: developers don't see flaky tests.</p>
  144. <p>There is a "not my fault" button that developers can use to report
  145. bad signals.</p>
  146. <p><strong>"Whatever the scale of your engineering organization, developer
  147. efficiency is the key thing that your infrastructure teams should be
  148. striving for. This is why at Facebook we have some of our top
  149. engineers working on developer infrastructure."</strong> (Preach it.)</p>
  150. <p>Excellent talk. <strong>Mozillians doing infra work or who are in charge
  151. of head count for infra work should watch this video.</strong></p>
  152. <p><em>Update 2015-03-28 21:35 UTC - Clarified some bits in response to
  153. new info Tweeted at me. Added link to my monorepos blog post.</em></p>