larlet-fr-david-cache

A place to cache linked articles (think custom and personal wayback machine)

title: The History of the URL: Domain, Protocol, and Port url: https://eager.io/blog/the-history-of-the-url-domain-and-protocol/ hash_url: 90a9fd7dd7

On the 11th of January 1982 twenty two computer scientists met to discuss an issue with ‘computer mail’ (now known as email). Attendees included the guy who would create Sun Microsystems, the guy who made Zork, the NTP guy, and the guy who convinced the government to pay for Unix. The problem was simple: there were 455 hosts on the ARPANET and the situation was getting out of control.

This issue was occuring now because the ARPANET was on the verge of switching from its original NCP protocol, to the TCP/IP protocol which powers what we now call the Internet. With that switch suddenly there would be a multitude of interconnected networks (an ‘Inter... net’) requiring a more ‘hierarchical’ domain system where ARPANET could resolve its own domains while the other networks resolved theirs.

Other networks at the time had great names like “COMSAT”, “CHAOSNET”, “UCLNET” and “INTELPOSTNET” and were maintained by groups of universities and companies all around the US who wanted to be able to communicate, and could afford to lease 56k lines from the phone company and buy the requisite PDP-11s to handle routing.

In the original ARPANET design, a central Network Information Center (NIC) was responsible for maintaining a file listing every host on the network. The file was known as the HOSTS.TXT file, similar to the /etc/hosts file on a Linux or OS X system today. Every network change would require the NIC to FTP (a protocol invented in 1971) to every host on the network, a significant load on their infrastructure.

Having a single file list every host on the Internet would, of course, not scale indefinitely. The priority was email, however, as it was the predominant addressing challenge of the day. Their ultimate conclusion was to create a hierarchical system in which you could query an external system for just the domain or set of domains you needed. In their words: “The conclusion in this area was that the current ‘user@host’ mailbox identifier should be extended to ‘user@host.domain’ where ‘domain’ could be a hierarchy of domains.” And the domain was born.

It’s important to dispel any illusion that these decisions were made with prescience for the future the domain name would have. In fact, their elected solution was primarily decided because it was the “one causing least difficulty for existing systems.” For example, one proposal was for email addresses to be of the form <user>.<host>@<domain>. If email usernames of the day hadn’t already had ‘.’ characters you might be emailing me at ‘zack.eager@io’ today.

UUCP and the Bang Path

It has been said that the principal function of an operating system is to define a number of different names for the same object, so that it can busy itself keeping track of the relationship between all of the different names. Network protocols seem to have somewhat the same characteristic.

-- David D. Clark, 1982

Another failed proposal involved separating domain components with the exclamation mark (!). For example, to connect to the ISIA host on ARPANET, you would connect to !ARPA!ISIA. You could then query for hosts using wildcards, so !ARPA!* would return to you every ARPANET host.

This method of addressing wasn’t a crazy divergence from the standard, it was an attempt to maintain it. The system of exclamation separated hosts dates to a data transfer tool called UUCP created in 1976. If you’re reading this on an OS X or Linux computer, uucp is likely still installed and available at the terminal.

ARPANET was introduced in 1969, and quickly became a powerful communication tool... amoung the handful of universities and government institutions which had access to it. The Internet as we know it wouldn’t become publically available outside of research insitutions until 1991, twenty one years later. But that didn’t mean computer users weren’t communicating.

In the era before the Internet, the general method of communication between computers was with a direct point-to-point dial up connection. For example, if you wanted to send me a file, you would have your modem call my modem, and we would transfer the file. To craft this into a network of sorts, UUCP was born.

In this system, each computer has a file which lists the hosts its aware of, their phone number, and a username and password on that host. You then craft a ‘path’, from your current machine to your destination, through hosts which each know how to connect to the next:

sw-hosts!digital-lobby!zack

This address would form not just a method of sending me files or connecting with my computer directly, but also would be my email address. In this era before ‘mail servers’, if my computer was off you weren’t sending me an email.

While use of ARPANET was restricted to top-tier universities, UUCP created a bootleg Internet for the rest of us. It formed the basis for both Usenet and the BBS system.

DNS

Ultimately, the DNS system we still use today would be proposed in 1983. If you run a DNS query today, for example using the dig tool, you’ll likely see a response which looks like this:

;; ANSWER SECTION:
google.com.   299 IN  A 172.217.4.206

This is informing us that google.com is reachable at 172.217.4.206. As you might know, the A is informing us that this is an ‘address’ record, mapping a domain to an IPv4 address. The 299 is the ‘time to live’, letting us know how many more seconds this value will be valid for, before it should be queried again. But what does the IN mean?

IN stands for ‘Internet’. Like so much of this, the field dates back to an era when there were several competing computer networks which needed to interoperate. Other potential values were CH for the CHAOSNET or HS for Hesiod which was the name service of the Athena system. CHAOSNET is long dead, but a much evolved version of Athena is still used by students at MIT to this day. You can find the list of DNS classes on the IANA website, but it’s no surprise only one potential value is in common use today.

TLDs

It is extremely unlikely that any other TLDs will be created.

— John Postel, 1994

Once it was decided that domain names should be arranged hierarchically, it became necessary to decide what sits at the root of that hierarchy. That root is traditionally signified with a single ‘.’. In fact, ending all of your domain names with a ‘.’ is semantically correct, and will absolutely work in your web browser: google.com.

The first TLD was .arpa. It allowed users to address their old traditional ARPANET hostnames during the transition. For example, if my machine was previously registered as hfnet, my new address would be hfnet.arpa. That was only temporary, during the transition, server administrators had a very important choice to make: which of the five TLDs would they assume? “.com”, “.gov”, “.org”, “.edu” or “.mil”.

When we say DNS is hierarchical, what we mean is there is a set of root DNS servers which are responsible for, for example, turning .com into the .com nameservers, who will in turn answer how to get to google.com. The root DNS zone of the internet is composed of thirteen DNS server clusters. There are only 13 server clusters, because that’s all we can fit in a single UDP packet. Historically, DNS has operated through UDP packets, meaning the response to a request can never be more than 512 bytes.


;       This file holds the information on root name servers needed to
;       initialize cache of Internet domain name servers
;       (e.g. reference this file in the "cache  .  "
;       configuration file of BIND domain name servers).
;
;       This file is made available by InterNIC
;       under anonymous FTP as
;           file                /domain/named.cache
;           on server           FTP.INTERNIC.NET
;       -OR-                    RS.INTERNIC.NET
;
;       last update:    March 23, 2016
;       related version of root zone:   2016032301
;
; formerly NS.INTERNIC.NET
;
.                        3600000      NS    A.ROOT-SERVERS.NET.
A.ROOT-SERVERS.NET.      3600000      A     198.41.0.4
A.ROOT-SERVERS.NET.      3600000      AAAA  2001:503:ba3e::2:30
;
; FORMERLY NS1.ISI.EDU
;
.                        3600000      NS    B.ROOT-SERVERS.NET.
B.ROOT-SERVERS.NET.      3600000      A     192.228.79.201
B.ROOT-SERVERS.NET.      3600000      AAAA  2001:500:84::b
;
; FORMERLY C.PSI.NET
;
.                        3600000      NS    C.ROOT-SERVERS.NET.
C.ROOT-SERVERS.NET.      3600000      A     192.33.4.12
C.ROOT-SERVERS.NET.      3600000      AAAA  2001:500:2::c
;
; FORMERLY TERP.UMD.EDU
;
.                        3600000      NS    D.ROOT-SERVERS.NET.
D.ROOT-SERVERS.NET.      3600000      A     199.7.91.13
D.ROOT-SERVERS.NET.      3600000      AAAA  2001:500:2d::d
;
; FORMERLY NS.NASA.GOV
;
.                        3600000      NS    E.ROOT-SERVERS.NET.
E.ROOT-SERVERS.NET.      3600000      A     192.203.230.10
;
; FORMERLY NS.ISC.ORG
;
.                        3600000      NS    F.ROOT-SERVERS.NET.
F.ROOT-SERVERS.NET.      3600000      A     192.5.5.241
F.ROOT-SERVERS.NET.      3600000      AAAA  2001:500:2f::f
;
; FORMERLY NS.NIC.DDN.MIL
;
.                        3600000      NS    G.ROOT-SERVERS.NET.
G.ROOT-SERVERS.NET.      3600000      A     192.112.36.4
;
; FORMERLY AOS.ARL.ARMY.MIL
;
.                        3600000      NS    H.ROOT-SERVERS.NET.
H.ROOT-SERVERS.NET.      3600000      A     198.97.190.53
H.ROOT-SERVERS.NET.      3600000      AAAA  2001:500:1::53
;
; FORMERLY NIC.NORDU.NET
;
.                        3600000      NS    I.ROOT-SERVERS.NET.
I.ROOT-SERVERS.NET.      3600000      A     192.36.148.17
I.ROOT-SERVERS.NET.      3600000      AAAA  2001:7fe::53
;
; OPERATED BY VERISIGN, INC.
;
.                        3600000      NS    J.ROOT-SERVERS.NET.
J.ROOT-SERVERS.NET.      3600000      A     192.58.128.30
J.ROOT-SERVERS.NET.      3600000      AAAA  2001:503:c27::2:30
;
; OPERATED BY RIPE NCC
;
.                        3600000      NS    K.ROOT-SERVERS.NET.
K.ROOT-SERVERS.NET.      3600000      A     193.0.14.129
K.ROOT-SERVERS.NET.      3600000      AAAA  2001:7fd::1
;
; OPERATED BY ICANN
;
.                        3600000      NS    L.ROOT-SERVERS.NET.
L.ROOT-SERVERS.NET.      3600000      A     199.7.83.42
L.ROOT-SERVERS.NET.      3600000      AAAA  2001:500:9f::42
;
; OPERATED BY WIDE
;
.                        3600000      NS    M.ROOT-SERVERS.NET.
M.ROOT-SERVERS.NET.      3600000      A     202.12.27.33
M.ROOT-SERVERS.NET.      3600000      AAAA  2001:dc3::35
; End of file

Root DNS servers operate in safes, inside locked cages. A clock sits on the safe to ensure the camera feed hasn’t been looped. Particularily given how slow DNSSEC implementation has been, an attack on one of those servers could allow an attacker to redirect all of the Internet traffic for a portion of Internet users. This, of course, makes for the most fantastic heist movie to have never been made.

Unsurprisingly, the nameservers for top-level TLDs don’t actually change all that often. 98% of the requests root DNS servers receive are in error, most often because of broken and toy clients which don’t properly cache their results. This became such a problem that several root DNS operators had to spin up special servers just to return ‘go away’ to all the people asking for reverse DNS lookups on their local IP addresses.

The TLD nameservers are administered by different companies and governments all around the world (Verisign manages .com). When you purchase a .com domain, about $0.18 goes to the ICANN, and $7.85 goes to Verisign.

Punycode

It is rare in this world that the silly name us developers think up for a new project makes it into the final, public, product. We might name the company database Delaware (because that’s where all the companies are registered), but you can be sure by the time it hits production it will be CompanyMetadataDatastore. But rarely, when all the stars align and the boss is on vacation, one slips through the cracks.

Punycode is the system we use to encode unicode into domain names. The problem it is solving is simple, how do you write 比薩.com when the entire internet system was built around using the ASCII alphabet whose most foreign character is the tilde?

It’s not a simple matter of switching domains to use unicode. The original documents which govern domains specify they are to be encoded in ASCII. Every piece of internet hardware from the last fourty years, including the Cisco and Juniper routers used to deliver this page to you make that assumption.

The web itself was never ASCII-only. It was actually originally concieved to speak ISO 8859-1 which includes all of the ASCII characters, but adds an additional set of special characters like ¼ and letters with special marks like ä. It does not, however, contain any non-Latin characters.

This restriction on HTML was ultimately removed in 2007 and that same year Unicode became the most popular character set on the web. But domains were still confined to ASCII.

As you might guess, Punycode was not the first proposal to solve this problem. You most likely have heard of UTF-8, which is a popular way of encoding Unicode into bytes (the 8 is for the eight bits in a byte). In the year 2000 several members of the Internet Engineering Task Force came up with UTF-5. The idea was to encode Unicode into five bit chunks. You could then map each five bits into a character allowed (A-V & 0-9) in domain names. So if I had a website for Japanese language learning, my site 日本語.com would become the cryptic M5E5M72COA9E.com.

This encoding method has several disadvantages. For one, A-V and 0-9 are used in the output encoding, meaning if you wanted to actually include one of those characters in your doman, it had to be encoded like everything else. This made for some very long domains, which is a serious problem when each segment of a domain is restricted to 63 characters. A domain in the Myanmar language would be restricted to no more than 15 characters. The proposal does make the very interesting suggestion of using UTF-5 to allow Unicode to be transmitted by Morse code and telegram though.

There was also the question of how to let clients know that this domain was encoded so they could display them in the appropriate Unicode characters, rather than showing M5E5M72COA9E.com in my address bar. There were several suggestions, one of which was to use an unused bit in the DNS response. It was the “last unused bit in the header”, and the DNS folks were “very hesitant to give it up” however.

Another suggestion was to start every domain using this encoding method with ra--. At the time (mid-April 2000), there were no domains which happened to start with those particular characters. If I know anything about the Internet, someone registered an ra-- domain out of spite immediately after the proposal was published.

The ultimate conclusion, reached in 2003, was to adopt a format called Punycode which included a form of delta compression which could dramatically shorten encoded domain names. Delta compression is a particularily good idea because the odds are all of the characters in your domain are in the same general area within Unicode. For example, two characters in Farsi are going to be much closer together than a Farsi character and another in Hindi. To give an example of how this works, if we take the nonsense phrase:

يذؽ

In an uncompressed format, that would be stored as the three characters [1610, 1584, 1597] (based on their Unicode code points). To compress this we first sort it numerically (keeping track of where the original characters were): [1584, 1597, 1610]. Then we can store the lowest value (1584), and the delta between that value and the next character (13), and again for the following character (23), which is significantly less to transmit and store.

Punycode then (very) efficiently encodes those integers into characters allowed in domain names, and inserts an xn-- at the beginning to let consumers know this is an encoded domain. You’ll notice that all the Unicode characters end up together at the end of the domain. They don’t just encode their value, they also encode where they should be inserted into the ASCII portion of the domain. To provide an example, the website 熱狗sales.com becomes xn--sales-r65lm0e.com. Anytime you type a Unicode-based domain name into your browser’s address bar, it is encoded in this way.

This transformation could be transparent, but that introduces a major security problem. All sorts of Unicode characters print identically to existing ASCII characters. For example, you likely can’t see the difference between Cyrillic small letter a (“а”) and Latin small letter a (“a”). If I register Cyrillic аmazon.com (xn--mazon-3ve.com), and manage to trick you into visiting it, it’s gonna be hard to know you’re on the wrong site. For that reason, when you visit 🍕💩.ws, your browser somewhat lamely shows you xn--vi8hiv.ws in the address bar.

Protocol

The first portion of the URL is the protocol which should be used to access it. The most common protocol is http, which is the simple document transfer protocol Tim Berners-Lee invented specifically to power the web. It was not the only option. Some people believed we should just use Gopher. Rather than being general-purpose, Gopher is specifically designed to send structured data similar to how a file tree is structured.

For example, if you request the /Cars endpoint, it might return:

1Chevy Camaro             /Archives/cars/cc     gopher.cars.com     70
iThe Camero is a classic  fake                  (NULL)              0
iAmerican Muscle car      fake                  (NULL)              0
1Ferrari 451              /Factbook/ferrari/451  gopher.ferrari.net 70

which identifies two cars, along with some metadata about them and where you can connect to for more information. The understanding was your client would parse this information into a usable form which linked the entries with the destination pages.

The first popular protocol was FTP, which was created in 1971, as a way of listing and downloading files on remote computers. Gopher was a logical extension of this, in that it provided a similar listing, but included facilities for also reading the metadata about entries. This meant it could be used for more liberal purposes like a news feed or a simple database. It did not have, however, the freedom and simplicity which characterizes HTTP and HTML.

HTTP is a very simple protocol, particularily when compared to alternatives like FTP or even the HTTP/2 protocol which is rising in popularity today. First off, HTTP is entirely text based, rather than being composed of bespoke binary incantations (which would have made it significantly more efficient). Tim Berners-Lee correctly intuited that using a text-based format would make it easier for generations of programmers to develop and debug HTTP-based applications.

HTTP also makes almost no assumptions about what you’re transmitting. Despite the fact that it was invented expliticly to accompany the HTML language, it allows you to specify that your content is of any type (using the MIME Content-Type, which was a new invention at the time). The protocol itself is rather simple:

A request:

GET /index.html HTTP/1.1
Host: www.example.com

Might respond:

HTTP/1.1 200 OK
Date: Mon, 23 May 2005 22:38:34 GMT
Content-Type: text/html; charset=UTF-8
Content-Encoding: UTF-8
Content-Length: 138
Last-Modified: Wed, 08 Jan 2003 23:11:55 GMT
Server: Apache/1.3.3.7 (Unix) (Red-Hat/Linux)
ETag: "3f80f-1b6-3e1cb03b"
Accept-Ranges: bytes
Connection: close

<html>
  <head>
    <title>An Example Page</title>
  </head>
  <body>
    Hello World, this is a very simple HTML document.
  </body>
</html>

To put this in context, you can think of the networking system the Internet uses as starting with IP, the Internet Protocol. IP is responsible for getting a small packet of data (around 1500 bytes) from one computer to another. On top of that we have TCP, which is responsible for taking larger blocks of data like entire documents and files and sending them via many IP packets reliably. On top of that, we then implement a protocol like HTTP or FTP, which specifies what format should be used to make the data we send via TCP (or UDP, etc.) understandable and meaningful.

In other words, TCP/IP sends a whole bunch of bytes to another computer, the protocol says what those bytes should be and what they mean.

You can make your own protocol if you like, assemblying the bytes in your TCP messages however you like. The only requirement is that whoever you are talking to speaks the same language. For this reason, it’s common to standardize these protocols.

There are, of course, many less important protocols to play with. For example there is a Quote of The Day protocol (port 17), and a Random Characters protocol (port 19). They may seem silly today, but they also showcase just how important that a general-purpose document transmission format like HTTP was.

Port

The timeline of Gopher and HTTP can be evidenced by their default port numbers. Gopher is 70, HTTP 80. The HTTP port was assigned (likely by Jon Postel at the IANA) at the request of Tim Berners-Lee sometime between 1990 and 1992.

This concept, of registering ‘port numbers’ predates even the Internet. In the original NCP protocol which powered the ARPANET remote addresses were identified by 40 bits. The first 32 identified the remote host, similar to how an IP address works today. The last eight were known as the AEN (it stood for “Another Eight-bit Number”), and were used by the remote machine in the way we use a port number, to separate messages destined for different processes. In other words, the address specifies which machine the message should go to, and the AEN (or port number) tells that remote machine which application should get the message.

They quickly requested that users register these ‘socket numbers’ to limit potential collisions. When port numbers were expanded to 16 bits by TCP/IP, that registration process was continued.

While protocols have a default port, it makes sense to allow ports to also be specified manually to allow for local development and the hosting of multiple services on the same machine. That same logic was the basis for prefixing websites with www.. At the time, it was unlikely anyone was getting access to the root of their domain, just for hosting an ‘experimental’ website. But if you give users the hostname of your specific machine (dx3.cern.ch), you’re in trouble when you need to replace that machine. By using a common subdomain (www.cern.ch) you can change what it points to as needed.

The Bit In-between

As you probably know, the URL syntax places a double slash (//) between the protocol and the rest of the URL:

http://eager.io

That double slash was inherited from the Apollo computer system which was one of the first networked workstations. The Apollo team had a similar problem to Tim Berners-Lee: they needed a way to separate a path from the machine that path is on. Their solution was to create a special path format:

//computername/file/path/as/usual

And TBL copied that scheme. Incidentally, he now regrets that decision, wishing the domain (in this case example.com) was the first portion of the path:

http:com/example/foo/bar/baz

The Rest

So far, we have covered the components of a URL which allow you to connect to a specific application on a remote server somewhere on the Internet. The second, and final, post of this series will cover those components of the URL which are processed by that remote application to return to you a specific piece of content, the Path, Fragment, Query and Auth.

I would have liked to include all of the content in a single post, but its length was proving intimidating to readers. The second post is absolutely worth your time however. It includes things like the alternative forms for URLs Tim Berners-Lee considered, the history of forms and how the GET parameter syntax was decided, and the fifteen year argument over how to make URLs which won’t change. If you’d like, you can subscribe below to be notified when that post is released.

index.md 30KB Ruw Blame Geschiedenis