|
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592 |
- title: Mercurial's Journey to and Reflections on Python 3
- url: https://gregoryszorc.com/blog/2020/01/13/mercurial%27s-journey-to-and-reflections-on-python-3/
- hash_url: 67c8c54b07137bcfc0069fccd8261b53
-
- <p>Mercurial 5.2 was released on November 5, 2019. It is the first version
- of Mercurial that supports Python 3. This milestone comes nearly 11 years
- after Python 3.0 was first released on December 3, 2008.</p>
- <p>Speaking as a maintainer of Mercurial and an avid user of Python, I
- feel like the experience of making Mercurial work with Python 3 is
- worth sharing because there are a number of lessons to be learned.</p>
- <p>This post is logically divided into two sections: a mostly factual recount
- of Mercurial's Python 3 porting effort and a more opinionated commentary
- of the transition to Python 3 and the Python language ecosystem as a whole.
- Those who don't care about the mechanics of porting a large Python project
- to Python 3 may want to skip the next section or two.</p>
- <h2>Porting Mercurial to Python 3</h2>
- <p>Let's start with a brief history lesson of Mercurial's support for
- Python 3 as told by its own commit history.</p>
- <p>The Mercurial version control tool was first released in April 2005
- (the same month that Git was initially released). Version 1.0 came out
- in March 2008. The first reference to Python 3 I found in the code base
- was in <a href="https://www.mercurial-scm.org/repo/hg/rev/8fee8ff13d37">September 2008</a>.
- Then not much happens for a while until
- <a href="https://www.mercurial-scm.org/repo/hg/rev/4494fb02d549">June 2010</a>, when
- someone authors a bunch of changes to make the Python C extensions
- start to recognize Python 3. Then things were again quiet for a while
- until <a href="https://www.mercurial-scm.org/repo/hg/rev/56ef99fbd6f2">January 2013</a>,
- when a handful of changes landed to remove 2 argument <code>raise</code>. There were
- a handful of commits in 2014 but nothing worth calling out.</p>
- <p>Mercurial's meaningful journey to Python 3 started in 2015. In code,
- the work started in
- <a href="https://www.mercurial-scm.org/repo/hg/rev/af6e6a0781d7">April 2015</a>, with
- effort to make Mercurial's test harness run with Python 3. Part of
- this was a <a href="https://www.mercurial-scm.org/repo/hg/rev/fefc72523491">decision</a>
- that Python 3.5 (to be released several months later in September 2015)
- would be the minimum Python 3 version that Mercurial would support.</p>
- <p>Once the Mercurial Project decided it wanted to port to Python 3 (as opposed
- to another language), one of the earliest decisions was how to perform that
- port. <strong>Mercurial's code base was too large to attempt a flag day conversion</strong>
- where there would be a Python 2 version and a Python 3 version and one day
- everyone would switch from Python 2 to 3. <strong>Mercurial needed a way to run the
- same code (or as much of the same code) on both Python 2 and 3.</strong> We would
- maintain a single code base and users would gradually switch from running with
- Python 2 to Python 3.</p>
- <p>In <a href="https://www.mercurial-scm.org/repo/hg/rev/e1fb276d4619">May 2015</a>,
- Mercurial dropped support for Python 2.4 and 2.5. Dropping support for
- these older Python versions was critical, as it was effectively impossible to
- write Python code that ran on this wide gamut of versions because of
- incompatibilities in syntax and language features. For example, you needed
- Python 2.6 to get <code>print()</code> via <code>from __future__ import print_function</code>.
- The project's late start at a Python 3 port can be significantly attributed
- to Python 2.4 and 2.5 compatibility holding us back.</p>
- <p>The main goal with Mercurial's early porting work was just getting the code base
- to a point where <code>import mercurial</code> would work. There were a myriad of places
- where Mercurial used syntax that was invalid on Python 3 and Python 3
- couldn't even parse the source code, let alone compile it to bytecode and
- execute it.</p>
- <p>This effort began in earnest in
- <a href="https://www.mercurial-scm.org/repo/hg/rev/e93036747902">June 2015</a>
- with global source code rewrites like using modern octal syntax,
- modern exception catching syntax (<code>except Exception as e</code> instead of
- <code>except Exception, e</code>), <code>print()</code> instead of <code>print</code>, and a
- <a href="https://www.mercurial-scm.org/repo/hg/rev/1a6a117d0b95">modern import convention</a>
- along with the use of <code>from __future__ import absolute_import</code>.</p>
- <p>In the early days of the port, our first goal was to get all source code
- parsing as valid Python 3. The next step was to get all the modules <code>import</code>ing
- cleanly. This entailed fixing code that ran at <code>import</code> time to work on
- Python 3. Our thinking was that we would need the code base to be <code>import</code>
- clean on Python 3 before seriously thinking about run-time behavior. In reality,
- we quickly ported a lot of modules to <code>import</code> cleanly and then moved on
- to higher-level porting, leaving a long-tail of modules with <code>import</code> failures.</p>
- <p>This initial porting effort played out over months. There weren't many
- people working on it in the early days: a few people would basically hack on
- Python 3 as a form of itch scratching and most of the project's energy was
- focused on improving the existing Python 2 based product. You can get a rough
- idea of the timeline and participation in the early porting effort through the
- <a href="https://www.mercurial-scm.org/repo/hg/log/081a77df7bc6/tests/test-check-py3-compat.t?revcount=960">history of test-check-py3-compat.t</a>.
- We see the test being added in <a href="https://www.mercurial-scm.org/repo/hg/rev/40eb385f798f">December 2015</a>,
- By June 2016, most of the code base was ported to our modern import convention
- and we were ready to move on to more meaningful porting.</p>
- <p>One of the biggest early hurdles in our porting effort was how to overcome
- the string literals type mismatch between Python 2 and 3. In Python 2, a
- <code>''</code> string literal is a sequence of bytes. In Python 3, a <code>''</code> string literal
- is a sequence of Unicode code points. These are fundamentally different types.
- And in Mercurial's code base, <strong>most of our <em>string</em> types are binary by design:
- use of a Unicode based <code>str</code> for representing data is flat out wrong for our use
- case</strong>. We knew that Mercurial would need to eventually switch many string
- literals from <code>''</code> to <code>b''</code> to preserve type compatibility. But doing so would
- be problematic.</p>
- <p>In the early days of Mercurial's Python 3 port in 2015, Mercurial's project
- maintainer (Matt Mackall) set a ground rule that the Python 3 port shouldn't overly
- disrupt others: he wanted the Python 3 port to more or less happen in the background
- and not require every developer to be aware of Python 3's low-level behavior in order
- to get work done on the existing Python 2 code base. This may seem like a questionable
- decision (and I probably disagreed with him to some extent at the time because I was
- doing Python 3 porting work and the decision constrained this work). But it was the
- correct decision. Matt knew that it would be years before the Python 3 port was either
- necessary or resulted in a meaningful return on investment (the value proposition of
- Python 3 has always been weak to Mercurial because Python 3 doesn't demonstrate a
- compelling advantage over Python 2 for our use case). What Matt was trying to do was
- minimize the externalized costs that a Python 3 port would inflict on the project.
- He correctly recognized that maintaining the existing product and supporting
- existing users was more important than a long-term bet in its infancy.</p>
- <p>This ground rule meant that a mass insertion of <code>b''</code> prefixes everywhere
- was not desirable, as that would require developers to think about whether
- a type was a <code>bytes</code> or <code>str</code>, a distinction they didn't have to worry about
- on Python 2 because we practically never used the Unicode-based string type in
- Mercurial.</p>
- <p>In addition, there were some other practical issues with doing a bulk <code>b''</code>
- prefix insertion. One was that the added <code>b</code> characters would cause a lot of lines
- to grow beyond our length limits and we'd have to reformat code. That would
- require manual intervention and would significantly slow down porting. And
- a sub-issue of adding all the <code>b</code> prefixes and reformatting code is that it would
- <em>break</em> annotate/blame more than was tolerable. The latter issue was addressed
- by teaching Mercurial's annotate/blame feature to <em>skip</em> revisions. The project
- now has a convention of annotating commit messages with <code># skip-blame <reason></code>
- so structural only changes can easily be ignored when performing an
- annotate/blame.</p>
- <p>A stop-gap solution to the <code>b''</code> everywhere issue came in
- <a href="https://www.mercurial-scm.org/repo/hg/rev/1c22400db72d">July 2016</a>, when I
- introduced a custom Python module importer that rewrote source code as part
- of <code>import</code> when running on Python 3. (I have
- <a href="/blog/2017/03/13/from-__past__-import-bytes_literals/">previously blogged</a>
- about this hack.) What this did was transparently add <code>b''</code> prefixes to all
- un-prefixed string literals as well as modify how a few common functions were
- called so that we wouldn't need to modify source code so things would run natively
- on Python 3. The source transformer allowed us to have the benefits of progressing
- in our Python 3 port without having to rewrite tens of thousands of lines of
- source code. The solution was hacky. But it enabled us to make significant
- progress on the Python 3 port without externalizing a lot of cost onto others.</p>
- <p>I thought the source transformer would be relatively short-lived and would be
- removed shortly after the project inevitably decided to go all in on Python 3.
- To my surprise, others built additional transforms over the years and the source
- transformer persisted all the way until
- <a href="https://www.mercurial-scm.org/repo/hg/rev/d783f945a701">October 2019</a>, when
- I removed it just before the first non-alpha Python 3 compatible version
- of Mercurial was released.</p>
- <p>A common problem Mercurial faced with making the code base dual Python 2/3 native
- was dealing with standard library differences. Most of the problems stemmed
- from changes between Python 2.7 and 3.5+. But there are changes within the
- versions of Python 3 that we had to wallpaper over as well. In
- <a href="https://www.mercurial-scm.org/repo/hg/rev/6041fb8f2da8">April 2016</a>, the
- <code>mercurial.pycompat</code> module was introduced to export aliases or wrappers around
- standard library functionality to abstract the differences between Python
- versions. This file <a href="https://www.mercurial-scm.org/repo/hg/log/66af68d4c751/mercurial/pycompat.py?revcount=240">grew over time</a>
- and <a href="https://www.mercurial-scm.org/repo/hg/file/66af68d4c751/mercurial/pycompat.py">eventually became</a>
- Mercurial's version of <a href="https://six.readthedocs.io/">six</a>. To be honest, I'm
- not sure if we should have used <code>six</code> from the beginning. <code>six</code> probably would
- have saved some work. But we had to eventually write a lot of shims for
- converting between <code>str</code> and <code>bytes</code> and would have needed to invent a
- <code>pycompat</code> layer in some form anyway. So I'm not sure <code>six</code> would have saved
- enough effort to justify the baggage of integrating a 3rd party package into
- Mercurial. (When Mercurial accepts a 3rd party package, downstream packagers
- like Debian get all hot and bothered and end up making questionable patches
- to our source code. So we prefer to minimize the surface area for
- problems by minimizing dependencies on 3rd party packages.)</p>
- <p>Once we had a source transforming module importer and the <code>pycompat</code>
- compatibility shim, we started to focus in earnest on making core
- functionality actually work on Python 3. We established a convention of
- annotating changesets needed for Python 3 with <code>py3</code>, so a
- <a href="https://www.mercurial-scm.org/repo/hg/log?rev=desc(py3)&revcount=4000">commit message search</a>
- yields a lot of the history. (But it isn't a full history since not every Python 3
- oriented change used this convention). We see from that history that after
- the source importer landed, a lot of porting effort was spent on things
- very early in the <code>hg</code> process lifetime. This included handling environment
- variables, loading config files, and argument parsing. We introduced a
- <a href="https://www.mercurial-scm.org/repo/hg/log/@/tests/test-check-py3-commands.t">test-check-py3-commands.t</a>
- test to track the progress of <code>hg</code> commands working in Python 3. The very early
- history of that file shows the various error messages changing, as underlying
- early process functionality was slowly ported to work on Python 3. By
- <a href="https://www.mercurial-scm.org/repo/hg/rev/2d555d753f0e">December 2016</a>, we
- had <code>hg version</code> working on Python 3!</p>
- <p>With basic <code>hg</code> command dispatch ported to Python 3 at the end of 2016,
- 2017 represented an inflection point in the Python 3 porting effort. With the
- early process functionality working, different people could pick up different
- commands and code paths and start making code work with Python 3. By
- <a href="https://www.mercurial-scm.org/repo/hg/rev/52ee1b5ac277">March 2017</a>, basic
- repository opening and <code>hg files</code> worked. Shortly thereafter,
- <a href="https://www.mercurial-scm.org/repo/hg/rev/ed23f929af38">hg init started working as well</a>.
- And <a href="https://www.mercurial-scm.org/repo/hg/rev/935a1b1117c7">hg status</a> and
- <a href="https://www.mercurial-scm.org/repo/hg/rev/aea8ec3f7dd1">hg commit</a> did as well.</p>
- <p>Within a few months, enough of Mercurial's functionality was working with Python
- 3 that we started to <a href="https://www.mercurial-scm.org/repo/hg/rev/7a877e569ed6">track which tests passed on Python 3</a>.
- The <a href="https://www.mercurial-scm.org/repo/hg/log/@/contrib/python3-whitelist?revcount=480">evolution of this file</a>
- shows a reasonable history of the porting velocity.</p>
- <p>In <a href="https://www.mercurial-scm.org/repo/hg/rev/feb910d2f59b">May 2017</a>, we dropped
- support for Python 2.6. This significantly reduced the complexity of supporting
- Python 3, as there was tons of functionality in Python 2.7 that made it easier
- to target both Python 2 and 3 and now our hands were untied to utilize it.</p>
- <p>In <a href="https://www.mercurial-scm.org/repo/hg/rev/bd8875b6473c">November 2017</a>, I
- landed a test harness feature to report exceptions seen during test runs. I
- later <a href="https://www.mercurial-scm.org/repo/hg/rev/8de90e006c78">refined the output</a>
- so the most frequent failures were reported more prominently. This feature
- greatly enabled our ability to target the most common exceptions, allowing
- us to write patches to fix the most prevalent issues on Python 3 and uncover
- previously unknown failures.</p>
- <p>By the end of 2017, we had most of the structural pieces in place to complete
- the port. Essentially all that was required at that point was time and labor.
- We didn't have a formal mechanism in place to target porting efforts. Instead,
- people would pick up a component or test that they wanted to hack on and then
- make incremental changes towards making that work. All the while, we didn't
- have a strict policy on not regressing Python 3 and regressions in Python 3
- porting progress were semi-frequent. Although we did tend to correct
- regressions quickly. And over time, developers saw a flurry of Python 3
- patches and slowly grew awareness of how to accommodate Python 3, and the
- number of Python 3 regressions became less frequent.</p>
- <p>As useful as the source-transforming module importer was, it incurred some
- additional burden for the porting effort. The source transformer effectively
- converted all un-prefixed string literals (<code>''</code>) to bytes literals (<code>b''</code>)
- to preserve string type behavior with Python 2. But various aspects of Python
- 3 didn't like the existence of <code>bytes</code>. Various standard library functionality
- now wanted unicode <code>str</code> and didn't accept <code>bytes</code>, even though the Python
- 2 implementation used the equivalent of <code>bytes</code>. So our <code>pycompat</code> layer
- grew pretty large to accommodate calling into various standard library
- functionality. Another side-effect which we didn't initially anticipate
- was the <code>**kwargs</code> calling convention. Python allows you to use <code>**</code>
- with a dict with string keys to turn those keys into named arguments
- in a function call. But Python 3 requires these <code>dict</code> keys to be
- <code>str</code> and outright rejects <code>bytes</code> keys, even if the <code>bytes</code> instance
- is ASCII safe and has the same underlying byte representation of the
- string data as the <code>str</code> instance would. So we had to invent support
- functions that would convert <code>dict</code> keys from <code>bytes</code> to <code>str</code> for
- use with <code>**kwargs</code> and another to convert a <code>**kwargs</code> dict from
- <code>str</code> keys to <code>bytes</code> keys so we could use <code>''</code> syntax to access keys
- in our source code! Also on the string type front, we had to sprinkle
- the codebase with raw string literals (<code>r''</code>) to force the use of
- <code>str</code> irregardless of which Python version you were running on (our
- source transformer only changed unprefixed string literals, so existing
- <code>r''</code> strings would be preserved as <code>str</code>).</p>
- <p>Blind transformation of all string literals to <code>bytes</code> was less than ideal
- and it did impose some unwanted side-effects. But, again, most <em>strings</em>
- in Mercurial are bytes by design, so we thought it would be easier to
- <em>byteify</em> all strings then selectively undo that where native strings
- were actually warranted (like keys in most <code>dict</code>s) than to take the
- up-front cost to examine every string and make an intelligent determination
- as to what type it should be. I go back and forth as to whether this was the
- correct call. But when you factor in that the source transforming
- module importer unblocked Python 3 porting at a time in the project's
- history when there was so much focus on improving the core product and it
- did so without externalizing many costs onto the people doing the critical
- core product work, I think it was the right call.</p>
- <p>By mid 2019, the number of test failures in Python 3 had been whittled
- down to a reasonable, less daunting number. It felt like victory was
- in grasp and inevitable. But a few significant issues lingered.</p>
- <p>One remaining question was around addressing differences between Python
- 3 versions. At the time, Python 3.5, 3.6, and 3.7 were released and 3.8
- was scheduled for release by the end of the year. We had a surprising
- number of issues with differences in Python 3 versions. Many of us
- were running Python 3.7, so it had the fewest failures. We had to spend
- extra effort to get Python 3.5 and 3.6 working as well as 3.7. Same for
- 3.8.</p>
- <p>Another task we deferred until the second half of 2019 was standing up
- robust CI for Python 3. We had some coverage, but it was minimal. Wanting
- a distraction from PyOxidizer for a bit and wanting to overhaul Mercurial's
- CI system (which is officially built on Buildbot), I cobbled together a
- <em>serverless</em> CI system built on top of AWS DynamoDB and S3 for storage,
- Lambda functions and CloudWatch events for all business logic, and EC2 spot
- instances for job execution. This CI system executed Python 3.5, 3.6, 3.7,
- and 3.8 variants of our test harness on Linux and Python 3.7 on Windows.
- This gave developers insight into version-specific failures. More
- importantly, it also gave insight into Windows failures, which was
- previously not well tested. It was discovered that Python 3 on Windows was
- lagging significantly behind POSIX.</p>
- <p>By the time of the Mercurial developer meetup in October 2019, nearly
- all tests were passing on POSIX platforms and we were confident that
- we could declare Python 3 support as at least beta quality for the
- Mercurial 5.2 release, planned for early November.</p>
- <p>One of our blockers for ripping off the alpha label on Python 3 support
- was removing our source-transforming module importer. It had performance
- implications and it wasn't something we wanted to ship because it felt
- too hacky. A blocker for this was we wanted to automatically format
- our source tree with <a href="https://black.readthedocs.io/en/stable/">black</a>
- because if we removed the source transformer, we'd have to rewrite
- a lot of source code to apply changes the transformer was performing,
- which would necessitate wrapping a lot of lines, which would involve a lot
- of manual effort. We wanted to <em>blacken</em> our code base first so that
- mass rewriting source code wouldn't involve a lot of tedious reformatting
- since <code>black</code> would handle that for us automatically. And rewriting the
- source tree with <code>black</code> was blocked on a specific feature landing in
- <code>black</code>! (We did not agree with <code>black</code>'s behavior of
- unwrapping comma-delimited lists of items if they could fit on a single
- line. So one of our core contributors wrote a patch to <code>black</code> that
- changed its behavior so a trailing <code>,</code> in a list of items will force
- items to be formatted on multiple lines. I personally find the multiple line
- formatting much easier to read. And the behavior is arguably better for
- code review and <em>annotation</em>, which is line based.) Once this feature
- landed in <code>black</code>, we reformatted our source tree and started ripping
- out the source transformations, starting by inserting <code>b''</code> literals
- everywhere. By late October, the source transformer was no more and
- we were ready to release beta quality support for Python 3 (at least
- on UNIX-like platforms).</p>
- <p>Having described a mostly factual overview of Mercurial's port to Python
- 3, it is now time to shift gears to the speculative and opinionated
- parts of this post. <strong>I want to underscore that the opinions reflected
- here are my own and do not reflect the overall Mercurial Project or even
- a consensus within it.</strong></p>
- <h2>The Future of Python 3 and Mercurial</h2>
- <p>Mercurial's port to Python 3 is still ongoing. While we've shipped
- Python 3 support and the test harness is clean on Python 3, I view shipping
- as only a milestone - arguably <em>the</em> most important one - in a longer
- journey. There's still a lot of work to do.</p>
- <p>It is now 2020 and Python 2 support is now officially dead from the
- perspective of the Python language maintainers. Linux distributions are
- starting to rip out Python 2. Packages are dropping Python 2 support in
- new versions. The world is moving to Python 3 only. But <strong>Mercurial still
- officially supports Python 2</strong>. And it is still yet to be determined how
- long we will retain support for Python 2 in the code base. We've only had
- one release supporting Python 3. Our users still need to port their
- extensions (implemented in Python). Our users still need to start widely
- using Mercurial with Python 3. Even our own developers need to switch to
- Python 3 (old habits are hard to break).</p>
- <p>I anticipate a long tail of random bugs in Mercurial on Python 3. While
- the tests may pass, our code coverage is not 100%. And even if it were,
- Python is a dynamic language and there are tons of invariants that aren't
- caught at compile time and can only be discovered at run time. <strong>These
- invariants cannot all be detected by tests, no matter how good your test
- coverage is.</strong> This is a <em>feature</em>/<em>limitation</em> of dynamic languages. Our
- users will likely be finding a long tail of miscellaneous bugs on Python
- 3 for <em>years</em>.</p>
- <p>At present, our code base is littered with tons of random hacks to bridge
- the gap between Python 2 and 3. Once Python 2 support is dropped, we'll
- need to remove these hacks and make the source tree Python 3 native, with
- minimal shims to wallpaper over differences in Python 3 versions. <strong>Removing
- this Python version bridge code will likely require hundreds of commits and
- will be a non-trivial effort.</strong> It's likely to be deemed a low priority (it
- is glorified busy work after all), and code for the express purpose of
- supporting Python 2 will likely linger for years.</p>
- <p>We are also still shoring up our packaging and distribution story on
- Python 3. This is easier on some platforms than others. I created
- <a href="https://github.com/indygreg/PyOxidizer">PyOxidizer</a> partially because
- of the poor experience I had with Python application packaging and
- distribution through the Mercurial Project. The Mercurial Project has
- already signed off on using PyOxidizer for distributing Mercurial in
- the future. So look for an <em>oxidized</em> Mercurial distribution in the
- near future! (You could argue PyOxidizer is an epic yak shave to better
- support Mercurial. But that's for another post.)</p>
- <p>Then there's Windows support. A Python 3 powered Mercurial on Windows
- still has a handful of known issues. It may require a few more releases
- before we consider Python 3 on Windows to be stable.</p>
- <p>Because we're still on a code base that must support Python 2, our
- adoption of Python 3 features is very limited. The only Python 3
- feature that Mercurial developers seem to almost universally get excited
- about is type annotations. We already have some people playing around
- with <code>pytype</code> using comment-based annotations and <code>pytype</code> has already
- caught a few bugs. We're eager to go all in on type annotations and
- uncover lots of dynamic typing bugs and poorly implemented APIs.
- Beyond type annotations, I can't name any feature that people are screaming
- to adopt and which makes a lot of sense for Mercurial. There's a long
- tail of minor features I'm sure will get utilized. But none of the
- marquee features that define major language releases seem that interesting
- to us. Time will tell.</p>
- <h2>Commentary on Python 3</h2>
- <p>Having described Mercurial's ongoing journey to Python 3, I now want to
- focus more on Python itself. Again, the opinions here are my own and
- don't reflect those of the Mercurial Project.</p>
- <p><strong>Succinctly, my experience porting Mercurial and other projects to
- Python 3 has significantly soured my perceptions of Python. As much as
- I have historically loved Python - from the language to the welcoming
- community - I am still struggling to understand how Python could manage
- to inflict so much hardship on the community by choosing the transition
- plan that they did.</strong> I believe Python's choices represent a terrific
- example of what not to do when managing a large project or ecosystem.
- Maintainers of other largely-deployed systems would benefit from taking
- the time to understand and reflect on Python's missteps.</p>
- <p>Python 3.0 was released on December 3, 2008. And it took the better part of
- a decade for the community to embrace it. <strong>This should be universally
- recognized as a failure.</strong> While hindsight is 20/20, many of the issues
- with Python 3 were obvious at the time and could have been mitigated had
- the language maintainers been more accommodating - and dare I say
- empathetic - to its users.</p>
- <p>Initially, Python 3 had a rather cavalier attitude towards backwards and
- forwards compatibility. In the early years of Python 3, the attitude of
- Python's maintainers was <em>Python 3 is a new, better language: you should
- target it explicitly</em>. There were some tools and methods to ease the
- transition. But nothing super polished, especially in the early years.
- Adoption of Python 3 in the overall community was slow. Python developers
- in the wild justifiably complained that the value proposition of Python 3
- was too weak to justify porting effort. Not helping was that the early
- advice for targeting Python 3 was to rewrite the source code to become
- Python 3 native. This is in contrast with using the same source to run on both
- Python 2 and 3. For library and application maintainers, this potentially
- meant maintaining separate versions of your code or forcing end-users to
- make a giant leap, which would realistically orphan users on an old version,
- fragmenting your user base. Neither of those were great alternatives, so
- you can understand why many projects didn't bite.</p>
- <p>For many projects of non-trivial size, flag day transitions from Python 2 to
- 3 were simply not viable: the pathway to Python 3 was to make code dual
- Python 2/3 compatible and gradually switch over the runtime to Python 3.
- But initial versions of Python 3 made this effectively impossible! Let me
- give a few specific examples.</p>
- <p>In Python 2, a string literal <code>''</code> is effectively an array of bytes. In
- Python 3, it is a series of Unicode code points - a fundamentally different
- type! In Python 2, you could write <code>b''</code> to be explicit that a string literal
- was bytes or you could write <code>u''</code> to indicate a Unicode literal, mimicking
- Python 3's behavior. In Python 3, you could write <code>b''</code> to create a <code>bytes</code>
- instance. But for whatever reason, Python 3 initially removed the <code>u''</code> syntax,
- meaning there wasn't as easy way to explicitly denote the type of each
- string literal so that it was consistent between Python 2 and 3! Python 3.3
- (released September 2012) restored <code>u''</code> support, making it more viable to
- write Python source code that worked on both Python 2 and 3. <strong>For nearly 4
- years, Python 3 took away the consistent syntax for denoting bytes/Unicode
- string literals.</strong></p>
- <p>Another feature was <code>%</code> formatting of strings. Python 2 allowed use of the
- <code>%</code> formatting operator on both its string types. But Python 3 initially
- removed the implementation of <code>%</code> from <code>bytes</code>. Why, I have no clue. It
- is perfectly reasonable to splice byte sequences into a buffer via use of
- a formatting string. But the Python language maintainers insisted otherwise.
- And it wasn't until the community complained about its absence loudly enough
- that this feature was
- <a href="https://docs.python.org/3/whatsnew/3.5.html#whatsnew-pep-461">restored in Python 3.5</a>,
- which was released in September 2015. Fun fact: the lack of this feature was
- once considered a blocker for Mercurial moving to Python 3 because
- Mercurial uses <code>bytes</code> almost universally, which meant that nearly every use
- of <code>%</code> would have to be changed to something else. And to this day, Python
- 3's <code>bytes</code> still doesn't have a <code>format()</code> method, so the alternative was
- effectively string concatenation, which is a massive step backwards from the
- expressiveness of <code>%</code> formatting.</p>
- <p><strong>The initial approach of Python 3 mirrors a folly that many developers
- and projects make: attempting a rewrite instead of performing incremental
- evolution.</strong> For established projects, large scale rewrites often go poorly.
- And Python 3 is no exception. Yes, from a code level, CPython (and likely
- other Python implementations) were incremental changes over Python 2 using
- the same code base. But from a language and standard library level, the
- differences in Python 3 were significant enough that I - and even Python's
- core maintainers - considered it a new language, and therefore a rewrite.
- When your random project attempts a rewrite and fails, the blast radius of that is
- often contained to that project. Maybe you don't publish a new release
- as soon as you otherwise would. <strong>But when you are powering an ecosystem,
- the ripple effects from a failed rewrite percolate throughout that ecosystem
- and last for years and have many second order effects. We see this with
- Python 3, where poor choices made in the late 2000s are inflicting significant
- hardship still in 2020.</strong></p>
- <p>From the initial restrained adoption of Python 3, it is obvious that the
- Python ecosystem overwhelmingly rejected the initial boil the oceans approach
- of Python 3. Python's maintainers eventually got the message and started
- restoring features like <code>u''</code> and <code>bytes</code> <code>%</code> formatting back into the
- language to placate the community. All the while Python 3 had been accumulating
- new features and the cumulative sum of those features was compelling enough
- to win over users.</p>
- <p>For many projects (including Mercurial), Python 3.4/3.5 was the first viable
- porting target for Python 3. Python 3.5 was released in September 2015, almost
- 7 years after Python 3.0 was released in December 2008. <strong>Seven. Years.</strong>
- An ecosystem that falters for that long is generally not healthy. What may have
- saved Python from total collapse here is that Python 2 was still going strong and
- people were generally happy with it. I really do think Python dodged a bullet
- here, because there was a massive window where the language could have
- hemorrhaged a critical amount of its user base and been relegated to an
- afterthought. One could draw an analogy to Perl, which lost out to PHP,
- Python, and Ruby, and whose fall from grace aligned with a lengthy
- transition from Perl 5 to 6.</p>
- <p>If you look back at the early history of Python 3, <strong>I think you are forced
- to conclude that Python effectively kneecapped itself for 5-7 years
- through questionable implementation choices that prevented users from
- incurring incremental transitions between the major language versions. 2008
- to 2013-2015 should be known as the <em>lost years of Python</em> because so much
- opportunity and energy was squandered.</strong> Yes, Python is still healthy today
- and Python 3 is (finally) being adopted at scale. But had earlier versions
- of Python 3 been more <em>empathetic</em> towards Python 2 users porting to it,
- Python and Python 3 in 2020 would be even stronger than it is. The community
- was artificially hindered for years. And we won't know until 2023-2025 what
- things could have looked like in 2020 had the Python core language team
- spent more time paving a smoother road between the major language versions.</p>
- <p>To be clear, I do think Python 3 is generally a better language than Python 2.
- It has fewer warts, more compelling features, and better performance (except
- for startup time, which is still slower than Python 2). I am ecstatic the
- community is finally rallying around Python 3! For my Python coding, it has
- reached the point where I curse under my breath when I need to support
- Python 2 or even older versions of Python 3, like 3.5 or 3.6: I just wish
- the world would move on and adopt the future already!</p>
- <p>But I would be remiss if I failed to mention some of my gripes with Python
- 3 beyond the transition shenanigans.</p>
- <p>Perhaps my least favorite <em>feature</em> of Python 3 is its insistence that the
- world is Unicode. In Python 2, the default string type was backed by
- bytes. In Python 3, the default string type is backed by Unicode code
- points. As part of that transition, large parts of the standard library
- now operate in the Unicode space instead of the domain of bytes. I understand
- why Python does this: they want <em>strings</em> to be Unicode and don't want
- users to have to spend that much energy thinking about when to use
- <code>str</code> versus <code>bytes</code>. This approach is admirable and somewhat defensible
- because it takes a stand on a solution that is arguably <em>good enough</em> for
- most users. However, <strong>the approach of assuming the world is Unicode is
- flat out wrong and has significant implications for systems level
- applications</strong> (like version control tools).</p>
- <p>There are a myriad of places in Python's standard library where Python
- insists on using the Unicode-backed <code>str</code> type and rejects <code>bytes</code>. For
- example, various networking modules refuse to accept <code>bytes</code> for hostnames
- or URLs. HTTP libraries won't accept <code>bytes</code> for HTTP header names or values.
- Functions that are proxies to POSIX-defined functions won't accept <code>bytes</code>
- even though the POSIX function it calls into is using <code>char *</code> and isn't
- Unicode aware. Then there's filename handling, where Python assumes the
- existence of a global encoding for filenames and uses this encoding to convert
- between <code>str</code> and <code>bytes</code>. And it does this despite POSIX filesystem paths
- being a bag of bytes where the only rules are that <code>\0</code> terminates the
- filename and <code>/</code> is special.</p>
- <p>In cases like Python refusing to accept <code>bytes</code> for things like HTTP
- header names (which will just be spit out over the wire as bytes), Python's
- pendulum has swung too far towards Unicode only. In my opinion, Python needs
- to be more accommodating and allow <code>bytes</code> when it makes sense. I hope the
- pendulum knocks some sense into people when it swings back towards a more
- reasonable solution that better acknowledges the realities of the world we
- live in.</p>
- <p>For areas like filename handling, the world is more complicated. Python
- is effectively an abstraction layer over the operating system APIs exposing
- this functionality. And there is often an impedance mismatch between operating
- systems. For example, POSIX (Linux) tends to use <code>char *</code> for everything
- and doesn't care about encoding and Windows tends to use 16 bit character
- types where the encoding is... a can of worms.</p>
- <p><strong>The reality here is that it is impossible to abstract over differences
- between operating system behavior without compromises that can result in data
- loss, outright wrong behavior, or loss of functionality. But Python 3 attempts
- to do it anyway, making Python 3 unsuitable (or at least highly undesirable) for
- certain systems level applications that rely on it</strong> (like a version control
- tool).</p>
- <p>In fairness to Python, it isn't the only programming language that gets
- this wrong. The only language I've seen <em>properly</em> implement higher-order
- abstractions on top of operating system facilities is Rust, whose approach can
- be generalized as <em>use Python 3's solution of normalizing to Unicode/UTF-8 by
- default</em>, but expose <em>escape hatches</em> which allow access to the raw underlying
- types and APIs used by the operating system for the advanced consumers who
- require it. For example, Rust's <code>Path</code> type which represents a filesystem path
- <a href="https://doc.rust-lang.org/std/path/struct.Path.html#method.as_os_str">allows access</a>
- to the raw <a href="https://doc.rust-lang.org/std/ffi/struct.OsStr.html">OsStr</a> value
- used by the operating system, not a normalization of it to bytes or Unicode,
- which may be lossy. This allows consumers to e.g. create and retrieve
- OS-native filesystem paths without data loss. This functionality is critical
- in some domains. Python 3's awareness/insistence that the world is
- Unicode (which it isn't universally) reduces Python's applicability in these
- domains.</p>
- <p>Speaking of Rust, at the Mercurial developer meetup in October 2019, we were
- discussing the use of Rust in Mercurial and one of the core maintainers blurted
- out something along the lines of <em>if Rust were at its current state 5 years ago,
- Mercurial would have likely ported from Python 2 to Rust instead of Python 3</em>.
- As crazy as it initially sounded, I think I agree with that assessment. With the
- benefit of hindsight, having been a key player in the Python 3 porting effort,
- seeing all the complications and headaches Python 3 is introducing, and
- having learned Rust and witnessed its benefits for performance, control,
- and correctness firsthand, porting to Rust would likely have been the correct
- move for the project at that point in time. 2020 is not 2014, however, and I'm
- not sure if I would opt for a rewrite in Rust today. (Most rewrites are follies
- after all.) But I know one thing: I certainly wouldn't implement a new version
- control tool in Python 3 and I would probably choose Rust as an implementation
- language for most new projects in the systems level space or with an expected
- shelf life of 10+ years. (I really should blog about how awesome Rust is.)</p>
- <p>Back to the topic of Python itself, <strong>I'm really soured on Python at this
- point in time. The effort required to port to Python 3 was staggering. For
- Mercurial, Python 3 introduces a ton of problems and doesn't really solve
- many. We effectively sludged through mud for several years only to wind
- up in a state that feels strictly worse than where we started. I'm sure it will
- be strictly better in a few years. But at that point, we're talking about a
- 5+ year transition. To call the Python 3 transition disruptive and
- distracting for the project would be an understatement. As a project maintainer,
- it's natural to ask what we could have accomplished if we weren't forced
- to carry out this sideshow.</strong></p>
- <p>I can't shake the feeling that a lot of the pain afflicted by the Python 3
- transition could have been avoided had Python's language leadership made
- a different set of decisions and more highly prioritized the transition
- experience. (Like not initially removing features like <code>u''</code> and <code>bytes %</code>
- and not introducing gratuitous backwards compatibility breaks, like with
- <code>items()/iteritems()</code>. I would have also liked to see a feature like
- <code>from __future__</code> - maybe <code>from __past__</code> - that would make it easier for
- Python 3 code to target semantics in earlier versions in order to provide
- a more turnkey on-ramp onto new versions.) I simultaneously see Python 3
- losing its position as a justifiable tool in some domains (like systems
- level tooling) due to ongoing design decisions and poor implementation (like
- startup overhead problems). (In contrast, I see Rust excelling where Python
- is faltering and find Rust code surprisingly expressive to write and maintain
- given how low-level it is and therefore feel that Rust is a compelling
- alternative to Python in a surprisingly large number of domains.)</p>
- <p>Look, I know it is easy for me to armchair quarterback and critique with the
- benefit of hindsight/ignorance. I'm sure there is a lot of nuance here. I'm
- sure there was disagreement within the Python community over a lot of these
- issues. Maintaining a large and successful programming language and community
- like Python's is hard and you aren't going to please all the people all the
- time. And speaking as a maintainer, I have mad respect for the people leading
- such a large community. But niceties aside, everyone knows the Python 3
- transition was rough and could have gone better. It should not have taken 11
- years to get to where we are today.</p>
- <p><strong>I'd like to encourage the Python Project to conduct a thorough postmortem on
- the transition to Python 3.</strong> Identify what went well, what could have gone
- better, and what should be done next time such a large language change is wanted.
- Speaking as a Python user, a maintainer of a Python project, and as someone in
- industry who is now skeptical about use of Python at work due to risks of
- potentially company crippling high-effort migrations in the future, a postmortem
- would help restore my confidence that Python's maintainers learned from the
- various missteps on the road to Python 3 and these potentially ecosystem
- crippling mistakes won't be made again.</p>
- <p>Python had a wildly successful past few decades. And it can continue to
- thrive for several more. But the Python 3 migration was painful for all
- involved. And as much as we need to move on and leave Python 2 behind us,
- there are some important lessons to be learned. I hope the Python community
- takes the opportunity to reflect and am confident it will grow stronger by
- taking the time to do so.</p>
|