|
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154 |
- title: Notes from Facebook's Developer Infrastructure at Scale F8 Talk
- url: http://gregoryszorc.com/blog/2015/03/28/notes-from-facebook%27s-developer-infrastructure-at-scale-f8-talk/
- hash_url: add9aea6059be55754097cf61bd9eee2
-
- <p>Any time Facebook talks about technical matters I tend to listen.
- They have a track record of demonstrating engineering leadership
- in several spaces. And, unlike many companies that just talk, Facebook
- often gives others access to those ideas via source code and healthy
- open source projects. It's rare to see a company operating on the
- frontier of the computing field provide so much insight into their
- inner workings. You can gain so much by riding their cotails and
- following their lead instead of clinging to and cargo culting from
- the past.</p>
- <p>The Facebook F8 developer conference was this past week. All the
- talks are <a href="https://developers.facebooklive.com/">now available online</a>.
- <strong>I encourage you to glimpse through the list of talks and watch
- whatever is relevant to you.</strong> There's really a little bit for
- everyone.</p>
- <p>Of particular interest to me is the
- <a href="https://developers.facebooklive.com/videos/561/big-code-developer-infrastructure-at-facebook-s-scale">Big Code: Developer Infrastructure at Facebook's Scale</a>
- talk. This is highly relevant to my job role as Developer Productivity
- Engineer at Mozilla.</p>
- <p>My notes for this talk follow.</p>
- <p><strong>"We don't want humans waiting on computers. We want computers waiting
- on humans."</strong> (This is the common theme of the talk.)</p>
- <p>In 2005, Facebook was on Subversion. In 2007 moved to Git. Deployed
- a bridge so people worked in Git and had distributed workflow but
- pushed to Subversion under the hood.</p>
- <p>New platforms over time. Server code, iOS, Android. One Git repo
- per platform/project -> 3 Git repos. Initially no code sharing, so
- no problem. Over time, code sharing between all repos. Lots of code
- copying and confusion as to what is where and who owns what.</p>
- <p>Facebook is mere weeks away from completing their migration to
- consolidate the big three repos to a Mercurial monorepo. (See also
- <a href="/blog/2014/09/09/on-monolithic-repositories/">my post about monorepos</a>.)</p>
- <p>Reasons:</p>
- <ol>
- <li>Easier code sharing.</li>
- <li>Easier large-scale changes. Rewrite the universe at once.</li>
- <li>Unified set of tooling.</li>
- </ol>
- <p>Facebook employees run >1M source control commands per day. >100k
- commits per week. VCS tool needs to be fast to prevent distractions
- and context switching, which slow people down.</p>
- <p>Facebook implemented sparse checkout and shallow history in Mercurial.
- Necessary to scale distributed version control to large repos.</p>
- <p><strong>Quote from Google: "We're excited about the work Facebook is doing with
- Mercurial and glad to be collaborating with Facebook on Mercurial
- development."</strong> (Well, I guess the cat is finally out of the bag:
- Google is working on Mercurial. This was kind of an open secret for
- months. But I guess now it is official.)</p>
- <p>Push-pull-rebase bottleneck: if you rebase and push and someone beats
- you to it, you have to pull, rebase, and try again. This gets worse
- as commit rate increases and people do needless legwork. <strong>Facebook
- has moved to server-side rebasing on push</strong> to mostly eliminate this
- pain point. (This is part of a still-experimental feature in Mercurial,
- which should hopefully lose its experimental flag soon.)</p>
- <p>Starting 13:00 in we have a speaker change and move away from version
- control.</p>
- <p>IDEs don't scale to Facebook scale. <strong>"Developing in Xcode at Facebook
- is an exercise in frustration."</strong> On average 3.5 minutes to open
- Facebook for iOS in Xcode. 5 minutes on average to index. Pegs CPU
- and makes not very responsive. 50 Xcode crashes per day across all
- Facebook iOS developers.</p>
- <p><strong>Facebook measures everything about tools. Mercurial operation times.
- Xcode times. Build times. Data tells them what tools and workflows
- need to be worked on.</strong></p>
- <p>Facebook believes IDEs are worth the pain because they make people
- more productive.</p>
- <p>Facebook wants to support all editors and IDEs since people want to
- use whatever is most comfortable.</p>
- <p>React Native changed things. Supported developing on multiple
- platforms, which no single IDE supports. People launched several
- editors and tools to do React Native development. People needed 4
- windows to do development. That experience was "not acceptable."
- So they built their own IDE. Set of plugins on top of ATOM. Not
- a fork. They like hackable and web-y nature of ATOM.</p>
- <p>The demo showing iOS development looks very nice! Doing Objective-C,
- JavaScript, simulator integration, and version control in one window!</p>
- <p>It can connect to remote servers and transparently save and
- deploy changes. It can also get real-time compilation errors and hints
- from the remote server! (Demo was with Hack. Not sure if others langs
- supported. Having beefy central servers for e.g. Gecko development
- would be a fun experiment.)</p>
- <p>Starting at 32:00 presentation shifts to continuous integration.</p>
- <p>Number one goal of CI at Facebook is developer efficiency. <strong>We
- don't want developers waiting on computers to build and test diffs.</strong></p>
- <p>3 goals for CI:</p>
- <ol>
- <li>High-signal feedback. Don't want developers chasing failures that
- aren't their fault. Wastes time.</li>
- <li>Must provide rapid feedback. Developers don't want to wait.</li>
- <li>Provide frequent feedback. Developers should know as soon as
- possible after they did something. (I think this refers to local
- feedback.)</li>
- </ol>
- <p>Sandcastle is their CI system.</p>
- <p>Diff lifecycle discussion.</p>
- <p>Basic tests and lint run locally. (My understanding from talking
- with Facebookers is "local" often means on a Facebook server, not
- local laptop. Machines at developers fingertips are often dumb
- terminals.)</p>
- <p>They appear to use code coverage to determine what tests to run.
- "We're not going to run a test unless your diff might actually have
- broken it."</p>
- <p>They run flaky tests less often.</p>
- <p>They run slow tests less often.</p>
- <p><strong>Goal is to get feedback to developers in under 10 minutes.</strong></p>
- <p><strong>If they run fewer tests and get back to developers quicker,
- things are less likely to break than if they run more tests but
- take longer to give feedback.</strong></p>
- <p>They also want feedback quickly so reviewers can see results at
- review time.</p>
- <p>They use Web Driver heavily. Love cross-platform nature of Web Driver.</p>
- <p>In addition to test results, performance and size metrics are reported.</p>
- <p>They have a "Ship It" button on the diff.</p>
- <p>Landcastle handles landing diff.</p>
- <p>"It is not OK at Facebook to land a diff without using Landcastle."
- (Read: developers don't push directly to the master repo.)</p>
- <p>Once Landcastle lands something, it runs tests again. If an issue
- is found, a task is filed. Task can be "push blocking."
- Code won't ship to users until the "push blocking" issue resolved.
- (Tweets confirm they do backouts "fairly aggressively." A valid
- resolution to a push blocking task is to backout. But fixing forward
- is fine as well.)</p>
- <p>After a while, branch cut occurs. Some cherry picks onto release
- branches.</p>
- <p>In addition to diff-based testing, they do continuous testing runs.
- Much more comprehensive. No time restrictions. Continuous runs on
- master and release candidate branches. Auto bisect to pin down
- regressions.</p>
- <p>Sandcastle processes >1000 test results per second. 5 years of machine
- work per day. Thousands of machines in 5 data centers.</p>
- <p>They started with buildbot. Single master. Hit scaling limits of
- single thread single master. Master could not push work to workers
- fast enough. Sandcastle has distributed queue. Workers just pull
- jobs from distributed queue.</p>
- <p>"High-signal feedback is critical." "Flaky failures erode developer
- confidence." "We need developers to trust Sandcastle."</p>
- <p>Extremely careful separating infra failures from other failures.
- Developers don't see infra failures. Infra failures only reported
- to Sandcastle team.</p>
- <p>Bots look for flaky tests. Stress test individual tests. Run tests
- in parallel with themselves. Goal: developers don't see flaky tests.</p>
- <p>There is a "not my fault" button that developers can use to report
- bad signals.</p>
- <p><strong>"Whatever the scale of your engineering organization, developer
- efficiency is the key thing that your infrastructure teams should be
- striving for. This is why at Facebook we have some of our top
- engineers working on developer infrastructure."</strong> (Preach it.)</p>
- <p>Excellent talk. <strong>Mozillians doing infra work or who are in charge
- of head count for infra work should watch this video.</strong></p>
- <p><em>Update 2015-03-28 21:35 UTC - Clarified some bits in response to
- new info Tweeted at me. Added link to my monorepos blog post.</em></p>
|