|
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493 |
- title: The History of the URL: Domain, Protocol, and Port
- url: https://eager.io/blog/the-history-of-the-url-domain-and-protocol/
- hash_url: 90a9fd7dd73e7b0975c94821db4d2797
-
- <p class="teaser">
- On the <a href="https://www.rfc-editor.org/rfc/rfc805.txt">11th of January 1982</a> twenty two computer scientists
- met to discuss an issue with ‘computer mail’ (now known as email).
- Attendees included <a href="https://en.wikipedia.org/wiki/Bill_Joy">the
- guy who would create Sun Microsystems</a>,
- <a href="https://en.wikipedia.org/wiki/Dave_Lebling">the guy who made Zork</a>, <a href="https://en.wikipedia.org/wiki/David_L._Mills">the NTP
- guy</a>, and <a href="https://en.wikipedia.org/wiki/Bob_Fabry">the guy who convinced
- the government to pay for Unix</a>. The
- problem was simple: there were 455 hosts on the ARPANET and the situation was
- getting out of control.
- </p>
-
- <p><img src="images/arpanet-1969.gif" alt="ARPANET circa 1969"/></p>
- <p>This issue was occuring now because the ARPANET was on the verge of
- <a href="https://www.rfc-editor.org/rfc/rfc801.txt">switching</a> from its original <a href="https://en.wikipedia.org/wiki/Network_Control_Program">NCP
- protocol</a>, to the TCP/IP
- protocol which powers what we now call the Internet. With that switch
- suddenly there would be a multitude of interconnected networks (an ‘Inter...
- net’) requiring a more ‘hierarchical’ domain system where ARPANET could
- resolve its own domains while the other networks resolved theirs.</p>
- <p>Other
- networks at the time had great names like “COMSAT”, “CHAOSNET”,
- “UCLNET” and “INTELPOSTNET” and were maintained by groups of universities
- and companies all around the US who wanted to be able to communicate, and
- could afford to lease 56k lines from the phone company and buy
- the requisite <a href="https://en.wikipedia.org/wiki/PDP-11">PDP-11s</a> to handle routing.</p>
- <p><a href="http://www.saccade.com/writing/projects/PDP11/PDP-11.html"><img src="images/pdp11.jpg" alt="PDP-11"/></a></p>
- <p>In the original ARPANET design, a central Network Information Center (NIC) was
- responsible for maintaining a file listing every host on the network. The file
- was known as the <a href="https://tools.ietf.org/html/rfc952"><code>HOSTS.TXT</code></a> file, similar to the <code>/etc/hosts</code> file on a Linux
- or OS X system today. Every network change would
- <a href="https://www.rfc-editor.org/rfc/rfc952.txt">require</a> the NIC to FTP (a protocol
- invented in <a href="https://tools.ietf.org/html/rfc114">1971</a>) to every host on the
- network, a significant load on their infrastructure.</p>
- <p>Having a single file list every host on the Internet would, of course, not
- scale indefinitely. The priority was email, however, as it
- was the predominant addressing challenge of the day. Their ultimate conclusion
- was to create a hierarchical system in which you could query an external system
- for just the domain or set of domains you needed. In their words: “The
- conclusion in this area was that the current ‘user@host’ mailbox identifier
- should be extended to ‘user@host.domain’ where ‘domain’ could be a hierarchy of
- domains.” And the domain was born.</p>
- <p><a href="https://personalpages.manchester.ac.uk/staff/m.dodge/cybergeography/atlas/"><img src="images/arpanet.gif" alt="ARPANET map"/></a></p>
- <p>It’s important to dispel any illusion that these decisions were made with
- prescience for the future the domain name would have. In fact, their elected
- solution was primarily decided because it was the “one causing least difficulty
- for existing systems.” For example, <a href="https://www.rfc-editor.org/rfc/rfc799.txt">one
- proposal</a> was for email addresses to
- be of the form <code><user>.<host>@<domain></code>. If email usernames of the day hadn’t
- already had ‘.’ characters you might be emailing me at ‘zack.eager@io’ today.</p>
- <p><a href="https://personalpages.manchester.ac.uk/staff/m.dodge/cybergeography/atlas/"><img src="images/arpanet-1987.gif" alt="ARPANET circa 1987"/></a></p>
- <h3 id="uucp-and-the-bang-path">UUCP and the Bang Path</h3>
- <blockquote>
- <p>
- It has been said that the principal function of an operating system is to
- define a number of different names for the same object, so that it can busy
- itself keeping track of the relationship between all of the different names.
- Network protocols seem to have somewhat the same characteristic.
- </p>
- <p>-- David D. Clark, <a href="https://www.rfc-editor.org/rfc/rfc814.txt"><code>1982</code></a>
- </p>
- </blockquote>
-
- <p>Another <a href="https://www.rfc-editor.org/ien/ien116.txt">failed proposal</a> involved
- separating domain components with the exclamation mark (<code>!</code>). For example, to
- connect to the <code>ISIA</code> host on <code>ARPANET</code>, you would connect to <code>!ARPA!ISIA</code>.
- You could then query for hosts using wildcards, so <code>!ARPA!*</code> would return to
- you every <code>ARPANET</code> host.</p>
- <p>This method of addressing wasn’t a crazy divergence from the standard, it was
- an attempt to maintain it. The system of exclamation separated hosts dates to
- a data transfer tool called <a href="https://en.wikipedia.org/wiki/UUCP">UUCP</a>
- <a href="http://www.cs.dartmouth.edu/~doug/reader.pdf">created</a> in 1976. If you’re
- reading this on an OS X or Linux computer, <code>uucp</code> is likely still installed and
- available at the terminal.</p>
- <p>ARPANET was introduced in 1969, and quickly became a powerful communication tool...
- amoung the handful of universities and government institutions which had access
- to it. The Internet as we know it wouldn’t become publically available outside
- of research insitutions until <a href="http://www.cybertelecom.org/notes/nsfnet.htm">1991</a>,
- twenty one years later. But that didn’t mean computer users weren’t communicating.</p>
- <p><img src="images/coupler.jpg" alt="Acoustic Coupler"/></p>
- <p>In the era before the Internet, the general method of communication between
- computers was with a direct point-to-point dial up connection. For example, if
- you wanted to send me a file, you would have your modem call my modem, and we
- would transfer the file. To craft this into a network of sorts, UUCP was born.</p>
- <p>In this system, each computer has a file which lists the hosts its aware of,
- their phone number, and a username and password on that host. You then craft a
- ‘path’, from your current machine to your destination, through hosts which each
- know how to connect to the next:</p>
- <pre><code>sw-hosts!digital-lobby!zack
- </code></pre><p><img src="images/uucp.jpg" alt="Business card featuring UUCP address"/></p>
- <p>This address would form not just a method of sending me files or connecting
- with my computer directly, but also would be my email address. In this era
- before ‘mail servers’, if my computer was off you weren’t sending me an email.</p>
- <p>While use of ARPANET was restricted to top-tier universities, UUCP created a
- bootleg Internet for the rest of us. It formed the basis for both
- <a href="https://en.wikipedia.org/wiki/Usenet">Usenet</a> and the
- <a href="https://en.wikipedia.org/wiki/Bulletin_board_system">BBS</a> system.</p>
- <h3 id="dns">DNS</h3>
- <p>Ultimately, the DNS system we still use today would be
- <a href="https://www.rfc-editor.org/rfc/rfc882.txt">proposed</a> in 1983. If you run a
- DNS query today, for example using the <code>dig</code> tool, you’ll likely see a response
- which looks like this:</p>
- <pre><code>;; ANSWER SECTION:
- google.com. 299 IN A 172.217.4.206
- </code></pre><p>This is informing us that google.com is reachable at <code>172.217.4.206</code>. As you
- might know, the <code>A</code> is informing us that this is an ‘address’ record, mapping a
- domain to an IPv4 address. The <code>299</code> is the ‘time to live’, letting us know
- how many more seconds this value will be valid for, before it should be queried
- again. But what does the <code>IN</code> mean?</p>
- <p><code>IN</code> stands for ‘Internet’. Like so much of this, the field dates back to an
- era when there were several competing computer networks which needed to
- interoperate. Other potential values were <code>CH</code> for the
- <a href="https://en.wikipedia.org/wiki/Chaosnet">CHAOSNET</a> or <code>HS</code> for Hesiod which was
- the name service of the <a href="https://en.wikipedia.org/wiki/Project_Athena">Athena
- system</a>. CHAOSNET is long dead,
- but a much evolved version of Athena is still used by students at MIT to this
- day. You can find the list of <a href="http://www.iana.org/assignments/dns-parameters/dns-parameters.xhtml">DNS
- classes</a>
- on the IANA website, but it’s no surprise only one potential value is in common
- use today.</p>
- <h3 id="tlds">TLDs</h3>
- <blockquote>
- <p>
- It is extremely unlikely that any other TLDs will be created.
- </p>
- <p>
- — John Postel, <a href="https://tools.ietf.org/html/rfc1591"><code class="year">1994</code></a>
- </p>
- </blockquote>
-
- <p>Once it was decided that domain names should be arranged hierarchically, it
- became necessary to decide what sits at the root of that hierarchy. That root
- is traditionally signified with a single ‘.’. In fact, ending all of your
- domain names with a ‘.’ is semantically correct, and will absolutely work in
- your web browser: <a href="http://google.com."><code>google.com.</code></a></p>
- <p>The first TLD was <code>.arpa</code>. It allowed users to address their old
- traditional ARPANET hostnames during the transition. For example, if
- my machine was previously registered as <code>hfnet</code>, my new address would be
- <code>hfnet.arpa</code>. That was only temporary, during the transition,
- server administrators had a very important choice to make: which of the five
- TLDs would they assume? “.com”, “.gov”, “.org”, “.edu” or “.mil”.</p>
- <p>When we say DNS is hierarchical, what we mean is there is a set of root DNS
- servers which are responsible for, for example, turning <code>.com</code> into the <code>.com</code>
- nameservers, who will in turn answer how to get to <code>google.com</code>. The root DNS
- zone of the internet is composed of thirteen DNS server clusters. There are
- only <a href="https://www.internic.net/zones/named.cache">13 server clusters</a>, because
- that’s all we can fit in a single UDP packet. Historically, DNS has operated
- through UDP packets, meaning the response to a request can never be more than
- 512 bytes.</p>
- <pre><code>
- ; This file holds the information on root name servers needed to
- ; initialize cache of Internet domain name servers
- ; (e.g. reference this file in the "cache . <file>"
- ; configuration file of BIND domain name servers).
- ;
- ; This file is made available by InterNIC
- ; under anonymous FTP as
- ; file /domain/named.cache
- ; on server FTP.INTERNIC.NET
- ; -OR- RS.INTERNIC.NET
- ;
- ; last update: March 23, 2016
- ; related version of root zone: 2016032301
- ;
- ; formerly NS.INTERNIC.NET
- ;
- . 3600000 NS A.ROOT-SERVERS.NET.
- A.ROOT-SERVERS.NET. 3600000 A 198.41.0.4
- A.ROOT-SERVERS.NET. 3600000 AAAA 2001:503:ba3e::2:30
- ;
- ; FORMERLY NS1.ISI.EDU
- ;
- . 3600000 NS B.ROOT-SERVERS.NET.
- B.ROOT-SERVERS.NET. 3600000 A 192.228.79.201
- B.ROOT-SERVERS.NET. 3600000 AAAA 2001:500:84::b
- ;
- ; FORMERLY C.PSI.NET
- ;
- . 3600000 NS C.ROOT-SERVERS.NET.
- C.ROOT-SERVERS.NET. 3600000 A 192.33.4.12
- C.ROOT-SERVERS.NET. 3600000 AAAA 2001:500:2::c
- ;
- ; FORMERLY TERP.UMD.EDU
- ;
- . 3600000 NS D.ROOT-SERVERS.NET.
- D.ROOT-SERVERS.NET. 3600000 A 199.7.91.13
- D.ROOT-SERVERS.NET. 3600000 AAAA 2001:500:2d::d
- ;
- ; FORMERLY NS.NASA.GOV
- ;
- . 3600000 NS E.ROOT-SERVERS.NET.
- E.ROOT-SERVERS.NET. 3600000 A 192.203.230.10
- ;
- ; FORMERLY NS.ISC.ORG
- ;
- . 3600000 NS F.ROOT-SERVERS.NET.
- F.ROOT-SERVERS.NET. 3600000 A 192.5.5.241
- F.ROOT-SERVERS.NET. 3600000 AAAA 2001:500:2f::f
- ;
- ; FORMERLY NS.NIC.DDN.MIL
- ;
- . 3600000 NS G.ROOT-SERVERS.NET.
- G.ROOT-SERVERS.NET. 3600000 A 192.112.36.4
- ;
- ; FORMERLY AOS.ARL.ARMY.MIL
- ;
- . 3600000 NS H.ROOT-SERVERS.NET.
- H.ROOT-SERVERS.NET. 3600000 A 198.97.190.53
- H.ROOT-SERVERS.NET. 3600000 AAAA 2001:500:1::53
- ;
- ; FORMERLY NIC.NORDU.NET
- ;
- . 3600000 NS I.ROOT-SERVERS.NET.
- I.ROOT-SERVERS.NET. 3600000 A 192.36.148.17
- I.ROOT-SERVERS.NET. 3600000 AAAA 2001:7fe::53
- ;
- ; OPERATED BY VERISIGN, INC.
- ;
- . 3600000 NS J.ROOT-SERVERS.NET.
- J.ROOT-SERVERS.NET. 3600000 A 192.58.128.30
- J.ROOT-SERVERS.NET. 3600000 AAAA 2001:503:c27::2:30
- ;
- ; OPERATED BY RIPE NCC
- ;
- . 3600000 NS K.ROOT-SERVERS.NET.
- K.ROOT-SERVERS.NET. 3600000 A 193.0.14.129
- K.ROOT-SERVERS.NET. 3600000 AAAA 2001:7fd::1
- ;
- ; OPERATED BY ICANN
- ;
- . 3600000 NS L.ROOT-SERVERS.NET.
- L.ROOT-SERVERS.NET. 3600000 A 199.7.83.42
- L.ROOT-SERVERS.NET. 3600000 AAAA 2001:500:9f::42
- ;
- ; OPERATED BY WIDE
- ;
- . 3600000 NS M.ROOT-SERVERS.NET.
- M.ROOT-SERVERS.NET. 3600000 A 202.12.27.33
- M.ROOT-SERVERS.NET. 3600000 AAAA 2001:dc3::35
- ; End of file
- </file></code></pre>
-
- <p>Root DNS servers operate in safes, inside locked cages. A clock sits on the
- safe to ensure the camera feed hasn’t been looped. Particularily given how
- slow <a href="https://en.wikipedia.org/wiki/Domain_Name_System_Security_Extensions">DNSSEC</a>
- implementation has been, an attack on one of those servers could
- allow an attacker to redirect all of the Internet traffic for a portion of
- Internet users. This, of course, makes for the most fantastic heist movie to
- have never been made.</p>
- <p>Unsurprisingly, the nameservers for top-level TLDs don’t actually change all
- that often.
- <a href="http://dns.measurement-factory.com/writings/wessels-pam2003-paper.pdf">98%</a> of
- the requests root DNS servers receive are in error, most often because of
- broken and toy clients which don’t properly cache their results. This became
- such a problem that several root DNS operators had to <a href="https://www.as112.net/">spin
- up</a> special servers just to return ‘go away’ to all the
- people asking for reverse DNS lookups on their local IP addresses.</p>
- <p>The TLD nameservers are administered by different companies and governments all
- around the world (<a href="https://www.verisign.com/">Verisign</a> manages <code>.com</code>). When you purchase a <code>.com</code> domain,
- about $0.18 goes to the ICANN, and $7.85 <a href="http://webmasters.stackexchange.com/questions/61467/if-icann-only-charges-18%C2%A2-per-domain-name-why-am-i-paying-10">goes
- to</a>
- Verisign.</p>
- <h3 id="punycode">Punycode</h3>
- <p>It is rare in this world that the silly name us developers think up for a new
- project makes it into the final, public, product. We might name the company
- database Delaware (because that’s where all the companies are registered), but
- you can be sure by the time it hits production it will be
- CompanyMetadataDatastore. But rarely, when all the stars align and the boss is
- on vacation, one slips through the cracks.</p>
- <p>Punycode is the system we use to encode unicode into domain names. The problem
- it is solving is simple, how do you write 比薩.com when the entire internet
- system was built around using the <a href="https://en.wikipedia.org/wiki/ASCII">ASCII</a>
- alphabet whose most foreign character is the tilde?</p>
- <p>It’s not a simple matter of switching domains to use
- <a href="https://en.wikipedia.org/wiki/Unicode">unicode</a>. The <a href="https://tools.ietf.org/html/rfc1035">original
- documents</a> which govern domains specify
- they are to be encoded in ASCII. Every piece of internet hardware from the
- last fourty years, including the
- <a href="http://www.cisco.com/c/en/us/support/routers/crs-1-multishelf-system/model.html">Cisco</a>
- and
- <a href="http://www.juniper.net/techpubs/en_US/release-independent/junos/information-products/pathway-pages/t-series/t1600/">Juniper</a>
- routers used to deliver this page to you make that assumption.</p>
- <p>The web itself was <a href="http://1997.webhistory.org/www.lists/www-talk.1994q3/1085.html">never
- ASCII-only</a>.
- It was actually originally concieved to speak <a href="https://en.wikipedia.org/wiki/ISO/IEC_8859-1">ISO
- 8859-1</a> which includes all of the
- ASCII characters, but adds an additional set of special characters like ¼ and
- letters with special marks like ä. It does not, however, contain any non-Latin
- characters.</p>
- <p>This restriction on HTML was ultimately removed in
- <a href="https://tools.ietf.org/html/rfc2070">2007</a> and that same year Unicode
- <a href="https://googleblog.blogspot.com/2008/05/moving-to-unicode-51.html">became</a> the
- most popular character set on the web. But domains were still confined to ASCII.</p>
- <p><a href="http://www.alanwood.net/unicode/"><img src="images/ie-hebrew.gif" alt="Hebrew in IE 5"/></a></p>
- <p>As you might guess, Punycode was not the first proposal to solve this problem.
- You most likely have heard of UTF-8, which is a popular way of encoding Unicode
- into bytes (the 8 is for the eight bits in a byte). In the year
- <a href="https://tools.ietf.org/html/draft-jseng-utf5-01">2000</a> several members of the
- Internet Engineering Task Force came up with UTF-5. The idea was to encode
- Unicode into five bit chunks. You could then map each five bits into a
- character allowed (A-V & 0-9) in domain names. So if I had a website for
- Japanese language learning, my site 日本語.com would become the cryptic
- M5E5M72COA9E.com.</p>
- <p>This encoding method has several disadvantages. For one, A-V and 0-9 are used
- in the output encoding, meaning if you wanted to actually include one of those
- characters in your doman, it had to be encoded like everything else. This made
- for some very long domains, which is a serious problem when each segment of a
- domain is restricted to 63 characters. A domain in the Myanmar language would
- be restricted to no more than 15 characters. The proposal does make the very
- interesting suggestion of using UTF-5 to allow Unicode to be transmitted by
- Morse code and telegram though.</p>
- <p>There was also the question of how to let clients know that this domain was
- encoded so they could display them in the appropriate Unicode characters,
- rather than showing M5E5M72COA9E.com in my address bar. There were <a href="https://tools.ietf.org/html/draft-ietf-idn-compare-01">several
- suggestions</a>, one of
- which was to use an unused bit in the DNS response. It was the “last unused
- bit in the header”, and the DNS folks were “very hesitant to give it up”
- however.</p>
- <p>Another suggestion was to start every domain using this encoding method with
- <code>ra--</code>. At <a href="https://tools.ietf.org/html/draft-ietf-idn-race-00">the time</a>
- (mid-April 2000), there were no domains which happened to start with those
- particular characters. If I know anything about the Internet, someone
- registered an <code>ra--</code> domain out of spite immediately after the
- proposal was published.</p>
- <p>The <a href="https://tools.ietf.org/html/rfc3492">ultimate conclusion</a>, reached in
- 2003, was to adopt a format called Punycode which included a form of delta
- compression which could dramatically shorten encoded domain names. Delta
- compression is a particularily good idea because the odds are all of the
- characters in your domain are in the same general area within Unicode. For
- example, two characters in Farsi are going to be much closer together than a
- Farsi character and another in Hindi. To give an example of how this works, if
- we take the nonsense phrase:</p>
- <p>يذؽ</p>
- <p>In an uncompressed format, that would be stored as the three characters <code>[1610,
- 1584, 1597]</code> (based on their Unicode code points). To compress this we first
- sort it numerically (keeping track of where the original characters were):
- <code>[1584, 1597, 1610]</code>. Then we can store the lowest value (<code>1584</code>), and the
- delta between that value and the next character (<code>13</code>), and again for the
- following character (<code>23</code>), which is significantly less to transmit and store.</p>
- <p>Punycode then (very) efficiently encodes those integers into characters allowed
- in domain names, and inserts an <code>xn--</code> at the beginning to let consumers know
- this is an encoded domain. You’ll notice that all the Unicode characters end
- up together at the end of the domain. They don’t just encode their value, they
- also encode where they should be inserted into the ASCII portion of the domain.
- To provide an example, the website 熱狗sales.com becomes
- <code>xn--sales-r65lm0e.com</code>. Anytime you type a Unicode-based domain name into
- your browser’s address bar, it is encoded in this way.</p>
- <p>This transformation could be transparent, but that introduces a major security
- problem. All sorts of Unicode characters print identically to existing ASCII
- characters. For example, you likely can’t see the difference between Cyrillic
- small letter a (“а”) and Latin small letter a (“a”). If I register Cyrillic
- аmazon.com (xn--mazon-3ve.com), and manage to trick you into visiting it, it’s
- gonna be hard to know you’re on the wrong site. For that reason, when you
- visit <a href="http://🍕💩.ws">🍕💩.ws</a>, your browser somewhat lamely shows you
- <code>xn--vi8hiv.ws</code> in the address bar.</p>
- <h3 id="protocol">Protocol</h3>
- <p>The first portion of the URL is the protocol which should be used to access it.
- The most common protocol is <code>http</code>, which is the simple document transfer
- protocol Tim Berners-Lee invented specifically to power the web. It was not
- the only option. <a href="http://1997.webhistory.org/www.lists/www-talk.1993q2/0339.html">Some
- people</a>
- believed we should just use Gopher. Rather than being general-purpose, Gopher
- is specifically designed to send structured data similar to how a file tree is
- structured.</p>
- <p>For example, if you request the <code>/Cars</code> endpoint, it might return:</p>
- <pre><code>1Chevy Camaro /Archives/cars/cc gopher.cars.com 70
- iThe Camero is a classic fake (NULL) 0
- iAmerican Muscle car fake (NULL) 0
- 1Ferrari 451 /Factbook/ferrari/451 gopher.ferrari.net 70
- </code></pre><p>which identifies two cars, along with some metadata about them and where you
- can connect to for more information. The understanding was your client would
- parse this information into a usable form which linked the entries with the
- destination pages.</p>
- <p><a href="http://www.yale.edu/pclt/WINWORLD/GOPHER.HTM"><img src="images/gopher.gif" alt="Gopher"/></a></p>
- <p>The first popular protocol was FTP, which was created in 1971, as a way of
- listing and downloading files on remote computers. Gopher was a logical
- extension of this, in that it provided a similar listing, but included
- facilities for also reading the metadata about entries. This meant it could
- be used for more liberal purposes like a news feed or a simple database. It
- did not have, however, the freedom and simplicity which characterizes HTTP and HTML.</p>
- <p>HTTP is a very simple protocol, particularily when compared to alternatives like
- FTP or even the <a href="https://http2.github.io/">HTTP/2</a> protocol which is rising in popularity today. First off,
- HTTP is entirely text based, rather than being composed of bespoke binary
- incantations (which would have made it significantly more efficient). Tim
- Berners-Lee correctly intuited that using a text-based format would make it
- easier for generations of programmers to develop and debug HTTP-based
- applications.</p>
- <p>HTTP also makes almost no assumptions about what you’re transmitting. Despite
- the fact that it was invented expliticly to accompany the HTML language, it
- allows you to specify that your content is of any type (using the MIME <code>Content-Type</code>,
- which was a new invention at the time). The protocol itself is rather simple:</p>
- <p>A request:</p>
- <pre><code class="lang-http">GET /index.html HTTP/1.1
- Host: www.example.com
- </code></pre>
- <p>Might respond:</p>
- <pre><code class="lang-http">HTTP/1.1 200 OK
- Date: Mon, 23 May 2005 22:38:34 GMT
- Content-Type: text/html; charset=UTF-8
- Content-Encoding: UTF-8
- Content-Length: 138
- Last-Modified: Wed, 08 Jan 2003 23:11:55 GMT
- Server: Apache/1.3.3.7 (Unix) (Red-Hat/Linux)
- ETag: "3f80f-1b6-3e1cb03b"
- Accept-Ranges: bytes
- Connection: close
-
- <html>
- <head>
- <title>An Example Page</title>
- </head>
- <body>
- Hello World, this is a very simple HTML document.
- </body>
- </html>
- </code></pre>
- <p>To put this in context, you can think of the networking system the Internet
- uses as starting with IP, the Internet Protocol. IP is responsible for
- getting a small packet of data (around 1500 bytes) from one computer
- to another. On top of that we have TCP, which is responsible for taking
- larger blocks of data like entire documents and files and sending them
- via many IP packets reliably. On top of that, we then implement a protocol
- like HTTP or FTP, which specifies what format should be used to make
- the data we send via TCP (or UDP, etc.) understandable and meaningful.</p>
- <p>In other words, TCP/IP sends a whole bunch of bytes to another computer,
- the protocol says what those bytes should be and what they mean.</p>
- <p>You can make your own protocol if you like, assemblying the bytes in your
- TCP messages however you like. The only requirement is that whoever you
- are talking to speaks the same language. For this reason, it’s common
- to standardize these protocols.</p>
- <p>There are, of course, many less important protocols to play with. For example
- there is a <a href="https://www.rfc-editor.org/rfc/rfc865.txt">Quote of The Day</a>
- protocol (port 17), and a <a href="https://www.rfc-editor.org/rfc/rfc864.txt">Random
- Characters</a> protocol (port 19).
- They may seem silly today, but they also showcase just how important that a
- general-purpose document transmission format like HTTP was.</p>
- <h3 id="port">Port</h3>
- <p>The timeline of Gopher and HTTP can be evidenced by their default port numbers.
- Gopher is 70, HTTP 80. The HTTP port was assigned (likely by <a href="https://en.wikipedia.org/wiki/Jon_Postel">Jon
- Postel</a> at the IANA) at the request
- of Tim Berners-Lee sometime between <a href="https://tools.ietf.org/html/rfc1060">1990</a>
- and <a href="https://tools.ietf.org/html/rfc1340">1992</a>.</p>
- <p>This concept, of registering ‘port numbers’ predates even the Internet.
- In the original NCP protocol which powered the ARPANET remote
- addresses were identified by 40 bits. The first 32 identified the remote
- host, similar to how an IP address works today. The last eight were known as
- the <a href="https://tools.ietf.org/html/rfc433">AEN</a> (it stood for “Another Eight-bit Number”),
- and were used by the remote machine in the way we use a port number, to separate
- messages destined for different processes. In other words, the address
- specifies which machine the message should go to, and the AEN (or port number)
- tells that remote machine which application should get the message.</p>
- <p>They quickly <a href="https://tools.ietf.org/html/rfc322">requested</a> that users register
- these ‘socket numbers’ to limit potential collisions. When port numbers were
- expanded to 16 bits by TCP/IP, that registration process was continued.</p>
- <p>While protocols have a default port, it makes sense to allow ports to also be
- specified manually to allow for local development and the hosting of multiple
- services on the same machine. That same logic was the
- <a href="http://1997.webhistory.org/www.lists/www-talk.1992/0335.html">basis</a> for
- prefixing websites with <code>www.</code>. At the time, it was unlikely anyone was
- getting access to the root of their domain, just for hosting an ‘experimental’
- website. But if you give users the hostname of your specific machine
- (<code>dx3.cern.ch</code>), you’re in trouble when you need to replace that machine. By
- using a common subdomain (<code>www.cern.ch</code>) you can change what it points to as
- needed.</p>
- <h3 id="the-bit-in-between">The Bit In-between</h3>
- <p>As you probably know, the URL syntax places a double slash (<code>//</code>) between
- the protocol and the rest of the URL:</p>
- <pre><code>http://eager.io
- </code></pre><p>That double slash was inherited from the <a href="https://en.wikipedia.org/wiki/Apollo/Domain">Apollo</a>
- computer system which was one of the first networked workstations. The Apollo
- team had a similar problem to Tim Berners-Lee: they needed a way to separate
- a path from the machine that path is on. Their solution was to create a
- special path format:</p>
- <pre><code>//computername/file/path/as/usual
- </code></pre><p>And TBL copied that scheme. Incidentally, he now <a href="https://www.w3.org/People/Berners-Lee/FAQ.html#etc">regrets</a>
- that decision, wishing the domain (in this case <code>example.com</code>) was the first portion of the path:</p>
- <pre><code>http:com/example/foo/bar/baz
- </code></pre><h3 id="the-rest">The Rest</h3>
- <p>So far, we have covered the components of a URL which allow you to connect
- to a specific application on a remote server somewhere on the Internet. The second,
- and final, post of this series will cover those components of the URL which
- are processed by that remote application to return to you a specific piece of content,
- the Path, Fragment, Query and Auth.</p>
- <p>I would have liked to include all of the content in a single post, but its length
- was proving intimidating to readers. The second post is absolutely worth your
- time however. It includes things like the alternative forms for URLs Tim Berners-Lee
- considered, the history of forms and how the GET parameter syntax was decided, and the fifteen
- year argument over how to make URLs which won’t change. If you’d like, you can
- subscribe below to be notified when that post is released.</p>
|