|
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106 |
- title: Let's talk about usernames
- url: https://www.b-list.org/weblog/2018/feb/11/usernames/
- hash_url: f11fd87b74b7e887269b0e4f300de405
-
- <p>A few weeks ago I released <a href="https://www.b-list.org/projects/django-registration/">django-registration</a> 2.4.1. The 2.4 series is the last in the django-registration 2.x line, and from here on out it’ll only get bugfixes. The <code>master</code> branch is now prepping for 3.0, which will remove a lot of the deprecated cruft that’s accumulated over the past decade of maintaining it, and try to focus on best practices for modern Django applications.</p>
- <p>I’ll write more about that sometime soon, but right now I want to spend a little bit of time talking about a deceptively hard problem django-registration has to deal with: usernames. And while I could write this as one of those “falsehoods programmers believe about X” articles, my personal preference is to actually explain why this is trickier than people think, and offer some advice on how to deal with it, rather than just provide mockery with no useful context.</p>
- <h2>Aside: the right way to do identity</h2>
- <p>Usernames — as implemented by many sites and services, and by many popular frameworks (including Django) — are almost certainly not the right way to solve the problem they’re often used to solve. What we really want in terms of identifying users is some combination of:</p>
- <ol>
- <li>System-level identifier, suitable for use as a target of foreign keys in our database</li>
- <li>Login identifier, suitable for use in performing a credential check</li>
- <li>Public identity, suitable for displaying to other users</li>
- </ol>
- <p>Many systems ask the username to fulfill all three of these roles, which is probably wrong. A better approach is <a href="http://habitatchronicles.com/2008/10/the-tripartite-identity-pattern/">the tripartite identity pattern</a>, in which each identifier is distinct, and multiple login and/or public identifiers may be associated with a single system identifier.</p>
- <p>Many of the problems and pains I’ve seen with people trying to build and scale account systems have come down to ignoring this pattern. An unfortunate number of hacks have been built on top of systems which <em>don’t</em> have this pattern, in order to make them look or sort-of act as if they do.</p>
- <p>So if you’re building an account system from scratch today in 2018, I would suggest reading up on this pattern and using it as the basis of your implementation. The flexibility it will give you in the future is worth a little bit of work, and one of these days someone might even build a good generic reusable implementation of it (I’ve certainly given thought to doing this for Django, and may still do it one day).</p>
- <p>For the rest of this post, though, I’ll be assuming that you’re using a more common implementation where a unique username serves as at least a system and login identifier, and probably also a public identifier. And by “username” I mean essentially any string identifier; you may be using usernames in the sense that, say, Reddit or Hacker News do, or you might be using email addresses, or you might be using some other unique string. But no matter what, you’re probably using <em>some</em> kind of single unique string for this, and that means you need to be aware of some issues.</p>
- <h2>Uniqueness is harder than you think</h2>
- <p>You might be thinking to yourself, how hard can this be? We can just create a unique column and we’re good to go! Here, let’s make a user table in Postgres:</p>
- <div class="codehilite"><pre><span/><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">accounts</span> <span class="p">(</span>
- <span class="n">id</span> <span class="nb">SERIAL</span> <span class="k">PRIMARY</span> <span class="k">KEY</span><span class="p">,</span>
- <span class="n">username</span> <span class="nb">TEXT</span> <span class="k">UNIQUE</span><span class="p">,</span>
- <span class="n">password</span> <span class="nb">TEXT</span><span class="p">,</span>
- <span class="n">email_address</span> <span class="nb">TEXT</span>
- <span class="p">);</span>
- </pre></div>
-
- <p>There’s our user table, there’s our unique username column. Easy!</p>
- <p>Well, it’s easy until we start thinking about case. If you’re registered as <code>john_doe</code>, what happens if I register as <code>JOHN_DOE</code>? It’s a different username, but could I cause people to think I’m you? Could I get people to accept friend requests or share sensitive information with me because they don’t realize case matters to a computer?</p>
- <p>This is a simple thing that a lot of systems get wrong. In researching for this post, I discovered Django’s auth system doesn’t enforce case-insensitive uniqueness of usernames, despite getting quite a lot of other things generally right in its implementation. There is a ticket for making usernames case-insensitive, but it’s <span class="caps">WONTFIX</span> now because making usernames case-insensitive would be a massive backwards-compatibility break and nobody’s sure whether or how we could actually do it. I’ll probably look at enforcing it in django-registration 3.0, but I’m not sure it’ll be possible to do even there — any site with existing case-sensitive accounts that bolts on a case-insensitive solution is asking for trouble.</p>
- <p>So if you’re going to build a system from scratch today, you should be doing case-insensitive uniqueness checks on usernames from day one; <code>john_doe</code>, <code>John_Doe</code>, and <code>JOHN_DOE</code> should all be the same username in your system, and once one of them is registered, none of the others should be available.</p>
- <p>But that’s just the start; we live in a Unicode world, and determining if two usernames are the same in a Unicode world is more complex than just doing <code>username1 == username2</code>. For one thing, there are composed and decomposed forms which are distinct when compared as sequences of Unicode code points, but render on-screen as visually identical to each other. So now you need to talk about normalization, pick a normalization form, and then normalize every username to your chosen form <em>before</em> you do any uniqueness checks.</p>
- <p>You also need to be considering non-<abbr title="American Standard Code for Information Interchange"><span class="caps">ASCII</span></abbr> when thinking about how to do your case-insensitive checks. Is <code>StraßburgJoe</code> the same user as <code>StrassburgJoe</code>? What answer you get will often depend on whether you do your check by normalizing to lowercase or uppercase. And then there are the different ways of decomposing Unicode; you can and will get different results for many strings depending on whether you use canonical equivalence or compatibility.</p>
- <p>If all this is confusing — and it is, even if you’re a Unicode geek! — my recommendation is to follow <a href="http://www.unicode.org/reports/tr36/#Recommendations_General">the advice of Unicode Technical Report 36</a> and normalize usernames using <abbr title="Compatibility Decomposition, Canonical Composition"><span class="caps">NFKC</span></abbr>. If you’re using Django’s <code>UserCreationForm</code> or a subclass of it (django-registration uses subclasses of <code>UserCreationForm</code>), this is already done for you. If you’re using Python but not Django (or not using <code>UserCreationForm</code>), you can do this in one line using a helper from the standard library:</p>
- <div class="codehilite"><pre><span/><span class="kn">import</span> <span class="nn">unicodedata</span>
-
- <span class="n">username_normalized</span> <span class="o">=</span> <span class="n">unicodedata</span><span class="o">.</span><span class="n">normalize</span><span class="p">(</span><span class="s1">'NFKC'</span><span class="p">,</span> <span class="n">username</span><span class="p">)</span>
-
- </pre></div>
-
- <p>For other languages, look up a good Unicode library.</p>
- <h2>No, really, uniqueness is harder than you think</h2>
- <p>Unfortunately, that’s not the end of it. Case-insensitive uniqueness checks on normalized strings are a start, but won’t catch all the cases you probably need to catch. For example, consider the following username: <code>jane_doe</code>. Now consider another username: <code>jаne_doe</code>. Are these the same username?</p>
- <p>In the tyepface I’m using as I write this, and in the typeface my blog uses, they <em>appear</em> to be. But to software, they’re very much <em>not</em> the same, and still aren’t the same after Unicode normalization and case-insensitive comparison (whether you go to upper- or lower-case doesn’t matter).</p>
- <p>To see why, pay attention to the second code point. In one of the usernames above, it’s <code>U+0061 LATIN SMALL LETTER A</code>. But in the other, it’s <code>U+0430 CYRILLIC SMALL LETTER A</code>. And no amount of Unicode normalization or case insensitivity will make those be the same code point, even though they’re often visually indistinguishable.</p>
- <p>This is the basis of the homograph attack, which first gained widespread notoriety in the context of <a href="https://en.wikipedia.org/wiki/IDN_homograph_attack">internationalized domain names</a>. And solving it requires a bit more work.</p>
- <p>For network host names, one solution is to represent names in <a href="https://en.wikipedia.org/wiki/Punycode">Punycode</a>, which is designed to head off precisely this issue, and also provides a way to represent a non-<abbr title="American Standard Code for Information Interchange"><span class="caps">ASCII</span></abbr> name using only <abbr title="American Standard Code for Information Interchange"><span class="caps">ASCII</span></abbr> characters. Returning to our example usernames above, this makes the distinction between the two obvious. If you want to try it yourself, it’s a one-liner in Python. Here it is on the version which includes the Cyrillic ‘а’:</p>
- <div class="codehilite"><pre><span/><span class="gp">>>> </span><span class="s1">'jаne_doe'</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s1">'punycode'</span><span class="p">)</span>
- <span class="go">b'jne_doe-2fg'</span>
- </pre></div>
-
- <p>(if you have difficulty copy/pasting the non-<abbr title="American Standard Code for Information Interchange"><span class="caps">ASCII</span></abbr> character, you can also express it in a string literal as <code>j\u0430ne_doe</code>)</p>
- <p>But this isn’t a real solution for usernames; sure, you could use Punycode representation whenever you display a name, but it will break display of a lot of perfectly legitimate non-<abbr title="American Standard Code for Information Interchange"><span class="caps">ASCII</span></abbr> names, and what you probably really <em>want</em> is to reject the above username during your signup process. How can you do that?</p>
- <p>Well, this time we open our hymnals to <a href="http://www.unicode.org/reports/tr39/">Unicode Technical Report 39</a>, and begin reading sections 4 and 5. Sets of code points which are distinct (even after normalization) but visually identical or at least confusingly similar when rendered for display are called (appropriately) “confusables”, and Unicode does provide mechanisms for detecting the presence of such code points.</p>
- <p>The example username we’ve been looking at here is what Unicode terms a “mixed-script confusable”, and this is what we probably want to detect. In other words: an all-Latin username containing confusables is probably fine, and an all-Cyrillic username containing confusables is probably fine, but a username containing mostly Latin plus one Cyrillic code point which happens to be confusable with a Latin one… is not.</p>
- <p>Unfortunately, Python doesn’t provide the necessary access to the full set of Unicode properties and tables in the standard library to be able to do this. But a helpful developer named Victor Felder has written <a href="http://confusable-homoglyphs.readthedocs.io/en/latest/index.html">a library which provides what we need</a>, and released it under an open-source license. Using the <code>confusable_homoglyphs</code> library, we can detect the problem:</p>
- <div class="codehilite"><pre><span/><span class="gp">>>> </span><span class="kn">from</span> <span class="nn">confusable_homoglyphs</span> <span class="kn">import</span> <span class="n">confusables</span>
- <span class="gp">>>> </span><span class="n">s1</span> <span class="o">=</span> <span class="s1">'jane_doe'</span>
- <span class="gp">>>> </span><span class="n">s2</span> <span class="o">=</span> <span class="s1">'j</span><span class="se">\u0430</span><span class="s1">ne_doe'</span>
- <span class="gp">>>> </span><span class="nb">bool</span><span class="p">(</span><span class="n">confusables</span><span class="o">.</span><span class="n">is_dangerous</span><span class="p">(</span><span class="n">s1</span><span class="p">))</span>
- <span class="go">False</span>
- <span class="gp">>>> </span><span class="nb">bool</span><span class="p">(</span><span class="n">confusables</span><span class="o">.</span><span class="n">is_dangerous</span><span class="p">(</span><span class="n">s2</span><span class="p">))</span>
- <span class="go">True</span>
- </pre></div>
-
- <p>The actual output of <a href="http://confusable-homoglyphs.readthedocs.io/en/latest/apidocumentation.html#confusable_homoglyphs.confusables.is_dangerous">is_dangerous()</a>, for the second username, is a data structure containing detailed information about the potential problems, but what we care about is that it detects a mixed-script string containing code points which are confusable, and that’s what we want.</p>
- <p>Django allows non-<abbr title="American Standard Code for Information Interchange"><span class="caps">ASCII</span></abbr> in usernames, but does not check for homograph problems. Since version 2.3, though, django-registration has had a dependency on <code>confusable_homoglyphs</code>, and has used its <code>is_dangerous()</code> function as part of the validation for usernames and email addresses. If you need to do user signups in Django (or generally in Python), and can’t or don’t want to use django-registration, I encourage you to make use of <code>confusable_homoglyphs</code> in the same way..</p>
- <h2>Have I mentioned that uniqueness is hard?</h2>
- <p>Once we’re dealing with Unicode confusables, it’s worth also asking whether we should deal with <em>single-script</em> confusables. For example, <code>paypal</code> and <code>paypa1</code>, which (depending on your choice of typeface) may be difficult to distinguish from one another. So far, everything I’ve suggested is good general-purpose advice, but this is starting to get into things which are specific to particular languages, scripts or geographic regions, and should only be done with care and with the potential tradeoffs in mind (forbidding confusable Latin characters may end up with a higher false-positive rate than you’d like, for example). But it is something worth thinking about. The same goes for usernames which are distinct but still very similar to each other; you can check this at the database level in a variety of ways — Postgres, for example, ships with support for <a href="https://en.wikipedia.org/wiki/Soundex">Soundex</a> and <a href="https://en.wikipedia.org/wiki/Metaphone">Metaphone</a>, as well as <a href="https://en.wikipedia.org/wiki/Levenshtein_distance">Levenshtein distance</a> and <a href="https://www.postgresql.org/docs/9.6/static/pgtrgm.html">trigram fuzzy matching</a> — but again it’s going to be something you do on a case-by-case basis, rather than just something you should generally always do.</p>
- <p>There is one more uniqueness issue I want to mention, though, and it primarily affects email addresses, which often get used as usernames these days (especially in services which rely on a third-party identity provider and use OAuth or similar protocols). So assume you’ve got a case for enforcing uniqueness of email addresses. How many distinct email addresses are listed below?</p>
- <ul>
- <li><code>johndoe@example.com</code></li>
- <li><code>johndoe+yoursite@example.com</code></li>
- <li><code>john.doe@example.com</code></li>
- </ul>
- <p>The answer is “it depends”. Most <abbr title="Mail Transfer Agents">MTAs</abbr> have long ignored anything after a <code>+</code> in the local-part when determining recipient identity, which in turn has led to many people using text after a <code>+</code> as a sort of <em>ad hoc</em> tagging and filtering system. And Gmail famously ignores dot (<code>.</code>) characters in the local-part, including in their custom-domain offerings, so it’s impossible without doing <abbr title="Domain Name System"><span class="caps">DNS</span></abbr> lookups to figure out whether someone’s mail provider actually thinks <code>johndoe</code> and <code>john.doe</code> are distinct.</p>
- <p>So if you’re enforcing unique email addresses, or using email addresses as a user identifier, you need to be aware of this and you probably need to strip all dot characters from the local-part, along with <code>+</code> and any text after it, before doing your uniqueness check. Currently django-registration doesn’t do this, but I have plans to add it in the 3.x series.</p>
- <p>Also, for dealing with Unicode confusables in email addresses: apply that check to the local-part and the domain <em>separately</em>. People don’t always have control over the script used for the domain, and shouldn’t be punished for choosing something that causes the local-part to be in a single script distinct from the domain; as long as neither the local-part nor the domain, considered in isolation, are mixed-script confusable, the address is probably <span class="caps">OK</span> (and this is what django-registration’s validator does).</p>
- <p>There are a lot of other concerns you can have about usernames which are too similar to each other to be considered “distinct”, but once you deal with case-insensitivity, normalization, and confusables, you start getting into diminishing-returns territory pretty quickly, especially since many rules start being language-, script-, or region-specific. That doesn’t mean you shouldn’t think about them, just that it’s difficult to give general-purpose advice.</p>
- <p>So let’s switch things up a bit and consider a different category of problem.</p>
- <h2>You should have reservations about some names</h2>
- <p>Many sites use the username as more than just a field in the login form. Some will create a profile page for each user, and put the username in the <abbr title="Uniform Resource Locator"><span class="caps">URL</span></abbr>. Some might create email addresses for each user. Some might create subdomains. So here are some questions:</p>
- <ul>
- <li>If your site puts the username in the <abbr title="Uniform Resource Locator"><span class="caps">URL</span></abbr> of the user’s profile page, what would happen if I created a user named <code>login</code>? If I were to populate my profile with the text “Our log-in page has moved, please click here to log in”, with a link to my credential-harvesting site, how many of your users do you think I could fool?</li>
- <li>If your site creates email addresses from usernames, what happens if I sign up as a user named <code>webmaster</code> or <code>postmaster</code>? Will I get email directed to those addresses for your domain? Could I potentially obtain an <abbr title="Secure Sockets Layer"><span class="caps">SSL</span></abbr> certificate for your domain with the right username and auto-created email address?</li>
- <li>If your site creates subdomains from usernames, what happens if I sign up as a user named <code>www</code>? Or <code>smtp</code> or <code>mail</code>?</li>
- </ul>
- <p>If you think these are just silly hypotheticals, well, <a href="http://www.theregister.co.uk/2011/04/11/state_of_ssl_analysis/">some of them have actually happened</a>. And not just once, but <a href="https://www.tivi.fi/Kaikki_uutiset/2015-03-18/A-Finnish-man-created-this-simple-email-account---and-received-Microsofts-security-certificate-3217662.html">multiple times</a>. No really, <a href="https://twitter.com/EdOverflow/status/954093588362809345">these things have happened multiple times</a>.</p>
- <p>You can — and should — be taking some precautions to ensure that, say, an auto-created subdomain for a user account doesn’t conflict with a pre-existing subdomain you’re actually using or that has a special meaning, or that auto-created email addresses can’t clash with important/pre-existing ones.</p>
- <p>But to really be careful, you should probably also just disallow certain usernames from being registered. I first saw this suggestion — and a list of names to reserve, and the first two articles linked above — in <a href="https://ldpreload.com/blog/names-to-reserve">this blog post by Geoffrey Thomas</a>. Since version 2.1, django-registration has shipped a list of reserved names, and the list has grown with each release; it’s now around a hundred items.</p>
- <p><a href="https://github.com/ubernostrum/django-registration/blob/1d7d0f01a24b916977016c1d66823a5e4a33f2a0/registration/validators.py#L25">The list in django-registration</a> breaks names down into a few categories, which lets you compose subsets of them based on your needs (the default validator combines all of them, but lets you override with your own preferred set of reserved names):</p>
- <ul>
- <li>Hostnames used for autodiscovery/autoconfig of some well-known services</li>
- <li>Hostnames associated with common protocols</li>
- <li>Email addresses used by certificate authorities to verify domain ownership</li>
- <li>Email addresses listed in <a href="https://tools.ietf.org/html/rfc2142"><span class="caps">RFC</span> 2142</a> that don’t appear in any other subset of reserved names</li>
- <li>Common no-reply email addresses</li>
- <li>Strings which match sensitive filenames (like cross-domain access policies)</li>
- <li>A laundry list of other potentially-sensitive names like <code>contact</code> and <code>login</code></li>
- </ul>
- <p>The validator in django-registration will also reject any username which begins with <code>.well-known</code>, to protect anything which uses the <a href="https://tools.ietf.org/html/rfc5785"><span class="caps">RFC</span> 5785</a> system for “well-known locations”.</p>
- <p>As with confusables in usernames, I encourage you to copy from and improve on django-registration’s list, which in turn is based on and expanded from Geoffrey Thomas’ list.</p>
- <h2>It’s a start</h2>
- <p>The ideas above are not an exhaustive list of all the things you could or should do to validate usernames in sites and services you build, because if I started trying to write an exhaustive list, I’d be here forever. They are, though, a good baseline of things you can do, and I’d recommend you do most or all of them. And hopefully this has provided a good introduction to the lurking complexity of something as seemingly “simple” as user accounts with usernames.</p>
- <p>As I’ve mentioned, Django and/or django-registration already do most of these, and the ones that they don’t are likely to be added at least to django-registration in 3.0; Django itself may not be able to adopt some of them soon, if ever, due to stronger backwards-compatibility concerns. All the code is open source (<abbr title="Berkeley Software Distribution"><span class="caps">BSD</span></abbr> license) and so you should feel free to copy, adapt or improve it.</p>
- <p>And if there’s something important I’ve missed, please feel free to let me know about it; you can file a bug or pull request to <a href="https://github.com/ubernostrum/django-registration">django-registration on GitHub</a>, or just <a href="https://www.b-list.org/contact/">get in touch with me directly</a>.</p>
|