A place to cache linked articles (think custom and personal wayback machine)
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

index.md 30KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493
  1. title: The History of the URL: Domain, Protocol, and Port
  2. url: https://eager.io/blog/the-history-of-the-url-domain-and-protocol/
  3. hash_url: 90a9fd7dd73e7b0975c94821db4d2797
  4. <p class="teaser">
  5. On the <a href="https://www.rfc-editor.org/rfc/rfc805.txt">11th of January 1982</a> twenty two computer scientists
  6. met to discuss an issue with ‘computer mail’ (now known as email).
  7. Attendees included <a href="https://en.wikipedia.org/wiki/Bill_Joy">the
  8. guy who would create Sun Microsystems</a>,
  9. <a href="https://en.wikipedia.org/wiki/Dave_Lebling">the guy who made Zork</a>, <a href="https://en.wikipedia.org/wiki/David_L._Mills">the NTP
  10. guy</a>, and <a href="https://en.wikipedia.org/wiki/Bob_Fabry">the guy who convinced
  11. the government to pay for Unix</a>. The
  12. problem was simple: there were 455 hosts on the ARPANET and the situation was
  13. getting out of control.
  14. </p>
  15. <p><img src="images/arpanet-1969.gif" alt="ARPANET circa 1969"/></p>
  16. <p>This issue was occuring now because the ARPANET was on the verge of
  17. <a href="https://www.rfc-editor.org/rfc/rfc801.txt">switching</a> from its original <a href="https://en.wikipedia.org/wiki/Network_Control_Program">NCP
  18. protocol</a>, to the TCP/IP
  19. protocol which powers what we now call the Internet. With that switch
  20. suddenly there would be a multitude of interconnected networks (an ‘Inter...
  21. net’) requiring a more ‘hierarchical’ domain system where ARPANET could
  22. resolve its own domains while the other networks resolved theirs.</p>
  23. <p>Other
  24. networks at the time had great names like “COMSAT”, “CHAOSNET”,
  25. “UCLNET” and “INTELPOSTNET” and were maintained by groups of universities
  26. and companies all around the US who wanted to be able to communicate, and
  27. could afford to lease 56k lines from the phone company and buy
  28. the requisite <a href="https://en.wikipedia.org/wiki/PDP-11">PDP-11s</a> to handle routing.</p>
  29. <p><a href="http://www.saccade.com/writing/projects/PDP11/PDP-11.html"><img src="images/pdp11.jpg" alt="PDP-11"/></a></p>
  30. <p>In the original ARPANET design, a central Network Information Center (NIC) was
  31. responsible for maintaining a file listing every host on the network. The file
  32. was known as the <a href="https://tools.ietf.org/html/rfc952"><code>HOSTS.TXT</code></a> file, similar to the <code>/etc/hosts</code> file on a Linux
  33. or OS X system today. Every network change would
  34. <a href="https://www.rfc-editor.org/rfc/rfc952.txt">require</a> the NIC to FTP (a protocol
  35. invented in <a href="https://tools.ietf.org/html/rfc114">1971</a>) to every host on the
  36. network, a significant load on their infrastructure.</p>
  37. <p>Having a single file list every host on the Internet would, of course, not
  38. scale indefinitely. The priority was email, however, as it
  39. was the predominant addressing challenge of the day. Their ultimate conclusion
  40. was to create a hierarchical system in which you could query an external system
  41. for just the domain or set of domains you needed. In their words: “The
  42. conclusion in this area was that the current ‘user@host’ mailbox identifier
  43. should be extended to ‘user@host.domain’ where ‘domain’ could be a hierarchy of
  44. domains.” And the domain was born.</p>
  45. <p><a href="https://personalpages.manchester.ac.uk/staff/m.dodge/cybergeography/atlas/"><img src="images/arpanet.gif" alt="ARPANET map"/></a></p>
  46. <p>It’s important to dispel any illusion that these decisions were made with
  47. prescience for the future the domain name would have. In fact, their elected
  48. solution was primarily decided because it was the “one causing least difficulty
  49. for existing systems.” For example, <a href="https://www.rfc-editor.org/rfc/rfc799.txt">one
  50. proposal</a> was for email addresses to
  51. be of the form <code>&lt;user&gt;.&lt;host&gt;@&lt;domain&gt;</code>. If email usernames of the day hadn’t
  52. already had ‘.’ characters you might be emailing me at ‘zack.eager@io’ today.</p>
  53. <p><a href="https://personalpages.manchester.ac.uk/staff/m.dodge/cybergeography/atlas/"><img src="images/arpanet-1987.gif" alt="ARPANET circa 1987"/></a></p>
  54. <h3 id="uucp-and-the-bang-path">UUCP and the Bang Path</h3>
  55. <blockquote>
  56. <p>
  57. It has been said that the principal function of an operating system is to
  58. define a number of different names for the same object, so that it can busy
  59. itself keeping track of the relationship between all of the different names.
  60. Network protocols seem to have somewhat the same characteristic.
  61. </p>
  62. <p>-- David D. Clark, <a href="https://www.rfc-editor.org/rfc/rfc814.txt"><code>1982</code></a>
  63. </p>
  64. </blockquote>
  65. <p>Another <a href="https://www.rfc-editor.org/ien/ien116.txt">failed proposal</a> involved
  66. separating domain components with the exclamation mark (<code>!</code>). For example, to
  67. connect to the <code>ISIA</code> host on <code>ARPANET</code>, you would connect to <code>!ARPA!ISIA</code>.
  68. You could then query for hosts using wildcards, so <code>!ARPA!*</code> would return to
  69. you every <code>ARPANET</code> host.</p>
  70. <p>This method of addressing wasn’t a crazy divergence from the standard, it was
  71. an attempt to maintain it. The system of exclamation separated hosts dates to
  72. a data transfer tool called <a href="https://en.wikipedia.org/wiki/UUCP">UUCP</a>
  73. <a href="http://www.cs.dartmouth.edu/~doug/reader.pdf">created</a> in 1976. If you’re
  74. reading this on an OS X or Linux computer, <code>uucp</code> is likely still installed and
  75. available at the terminal.</p>
  76. <p>ARPANET was introduced in 1969, and quickly became a powerful communication tool...
  77. amoung the handful of universities and government institutions which had access
  78. to it. The Internet as we know it wouldn’t become publically available outside
  79. of research insitutions until <a href="http://www.cybertelecom.org/notes/nsfnet.htm">1991</a>,
  80. twenty one years later. But that didn’t mean computer users weren’t communicating.</p>
  81. <p><img src="images/coupler.jpg" alt="Acoustic Coupler"/></p>
  82. <p>In the era before the Internet, the general method of communication between
  83. computers was with a direct point-to-point dial up connection. For example, if
  84. you wanted to send me a file, you would have your modem call my modem, and we
  85. would transfer the file. To craft this into a network of sorts, UUCP was born.</p>
  86. <p>In this system, each computer has a file which lists the hosts its aware of,
  87. their phone number, and a username and password on that host. You then craft a
  88. ‘path’, from your current machine to your destination, through hosts which each
  89. know how to connect to the next:</p>
  90. <pre><code>sw-hosts!digital-lobby!zack
  91. </code></pre><p><img src="images/uucp.jpg" alt="Business card featuring UUCP address"/></p>
  92. <p>This address would form not just a method of sending me files or connecting
  93. with my computer directly, but also would be my email address. In this era
  94. before ‘mail servers’, if my computer was off you weren’t sending me an email.</p>
  95. <p>While use of ARPANET was restricted to top-tier universities, UUCP created a
  96. bootleg Internet for the rest of us. It formed the basis for both
  97. <a href="https://en.wikipedia.org/wiki/Usenet">Usenet</a> and the
  98. <a href="https://en.wikipedia.org/wiki/Bulletin_board_system">BBS</a> system.</p>
  99. <h3 id="dns">DNS</h3>
  100. <p>Ultimately, the DNS system we still use today would be
  101. <a href="https://www.rfc-editor.org/rfc/rfc882.txt">proposed</a> in 1983. If you run a
  102. DNS query today, for example using the <code>dig</code> tool, you’ll likely see a response
  103. which looks like this:</p>
  104. <pre><code>;; ANSWER SECTION:
  105. google.com. 299 IN A 172.217.4.206
  106. </code></pre><p>This is informing us that google.com is reachable at <code>172.217.4.206</code>. As you
  107. might know, the <code>A</code> is informing us that this is an ‘address’ record, mapping a
  108. domain to an IPv4 address. The <code>299</code> is the ‘time to live’, letting us know
  109. how many more seconds this value will be valid for, before it should be queried
  110. again. But what does the <code>IN</code> mean?</p>
  111. <p><code>IN</code> stands for ‘Internet’. Like so much of this, the field dates back to an
  112. era when there were several competing computer networks which needed to
  113. interoperate. Other potential values were <code>CH</code> for the
  114. <a href="https://en.wikipedia.org/wiki/Chaosnet">CHAOSNET</a> or <code>HS</code> for Hesiod which was
  115. the name service of the <a href="https://en.wikipedia.org/wiki/Project_Athena">Athena
  116. system</a>. CHAOSNET is long dead,
  117. but a much evolved version of Athena is still used by students at MIT to this
  118. day. You can find the list of <a href="http://www.iana.org/assignments/dns-parameters/dns-parameters.xhtml">DNS
  119. classes</a>
  120. on the IANA website, but it’s no surprise only one potential value is in common
  121. use today.</p>
  122. <h3 id="tlds">TLDs</h3>
  123. <blockquote>
  124. <p>
  125. It is extremely unlikely that any other TLDs will be created.
  126. </p>
  127. <p>
  128. — John Postel, <a href="https://tools.ietf.org/html/rfc1591"><code class="year">1994</code></a>
  129. </p>
  130. </blockquote>
  131. <p>Once it was decided that domain names should be arranged hierarchically, it
  132. became necessary to decide what sits at the root of that hierarchy. That root
  133. is traditionally signified with a single ‘.’. In fact, ending all of your
  134. domain names with a ‘.’ is semantically correct, and will absolutely work in
  135. your web browser: <a href="http://google.com."><code>google.com.</code></a></p>
  136. <p>The first TLD was <code>.arpa</code>. It allowed users to address their old
  137. traditional ARPANET hostnames during the transition. For example, if
  138. my machine was previously registered as <code>hfnet</code>, my new address would be
  139. <code>hfnet.arpa</code>. That was only temporary, during the transition,
  140. server administrators had a very important choice to make: which of the five
  141. TLDs would they assume? “.com”, “.gov”, “.org”, “.edu” or “.mil”.</p>
  142. <p>When we say DNS is hierarchical, what we mean is there is a set of root DNS
  143. servers which are responsible for, for example, turning <code>.com</code> into the <code>.com</code>
  144. nameservers, who will in turn answer how to get to <code>google.com</code>. The root DNS
  145. zone of the internet is composed of thirteen DNS server clusters. There are
  146. only <a href="https://www.internic.net/zones/named.cache">13 server clusters</a>, because
  147. that’s all we can fit in a single UDP packet. Historically, DNS has operated
  148. through UDP packets, meaning the response to a request can never be more than
  149. 512 bytes.</p>
  150. <pre><code>
  151. ; This file holds the information on root name servers needed to
  152. ; initialize cache of Internet domain name servers
  153. ; (e.g. reference this file in the "cache . <file>"
  154. ; configuration file of BIND domain name servers).
  155. ;
  156. ; This file is made available by InterNIC
  157. ; under anonymous FTP as
  158. ; file /domain/named.cache
  159. ; on server FTP.INTERNIC.NET
  160. ; -OR- RS.INTERNIC.NET
  161. ;
  162. ; last update: March 23, 2016
  163. ; related version of root zone: 2016032301
  164. ;
  165. ; formerly NS.INTERNIC.NET
  166. ;
  167. . 3600000 NS A.ROOT-SERVERS.NET.
  168. A.ROOT-SERVERS.NET. 3600000 A 198.41.0.4
  169. A.ROOT-SERVERS.NET. 3600000 AAAA 2001:503:ba3e::2:30
  170. ;
  171. ; FORMERLY NS1.ISI.EDU
  172. ;
  173. . 3600000 NS B.ROOT-SERVERS.NET.
  174. B.ROOT-SERVERS.NET. 3600000 A 192.228.79.201
  175. B.ROOT-SERVERS.NET. 3600000 AAAA 2001:500:84::b
  176. ;
  177. ; FORMERLY C.PSI.NET
  178. ;
  179. . 3600000 NS C.ROOT-SERVERS.NET.
  180. C.ROOT-SERVERS.NET. 3600000 A 192.33.4.12
  181. C.ROOT-SERVERS.NET. 3600000 AAAA 2001:500:2::c
  182. ;
  183. ; FORMERLY TERP.UMD.EDU
  184. ;
  185. . 3600000 NS D.ROOT-SERVERS.NET.
  186. D.ROOT-SERVERS.NET. 3600000 A 199.7.91.13
  187. D.ROOT-SERVERS.NET. 3600000 AAAA 2001:500:2d::d
  188. ;
  189. ; FORMERLY NS.NASA.GOV
  190. ;
  191. . 3600000 NS E.ROOT-SERVERS.NET.
  192. E.ROOT-SERVERS.NET. 3600000 A 192.203.230.10
  193. ;
  194. ; FORMERLY NS.ISC.ORG
  195. ;
  196. . 3600000 NS F.ROOT-SERVERS.NET.
  197. F.ROOT-SERVERS.NET. 3600000 A 192.5.5.241
  198. F.ROOT-SERVERS.NET. 3600000 AAAA 2001:500:2f::f
  199. ;
  200. ; FORMERLY NS.NIC.DDN.MIL
  201. ;
  202. . 3600000 NS G.ROOT-SERVERS.NET.
  203. G.ROOT-SERVERS.NET. 3600000 A 192.112.36.4
  204. ;
  205. ; FORMERLY AOS.ARL.ARMY.MIL
  206. ;
  207. . 3600000 NS H.ROOT-SERVERS.NET.
  208. H.ROOT-SERVERS.NET. 3600000 A 198.97.190.53
  209. H.ROOT-SERVERS.NET. 3600000 AAAA 2001:500:1::53
  210. ;
  211. ; FORMERLY NIC.NORDU.NET
  212. ;
  213. . 3600000 NS I.ROOT-SERVERS.NET.
  214. I.ROOT-SERVERS.NET. 3600000 A 192.36.148.17
  215. I.ROOT-SERVERS.NET. 3600000 AAAA 2001:7fe::53
  216. ;
  217. ; OPERATED BY VERISIGN, INC.
  218. ;
  219. . 3600000 NS J.ROOT-SERVERS.NET.
  220. J.ROOT-SERVERS.NET. 3600000 A 192.58.128.30
  221. J.ROOT-SERVERS.NET. 3600000 AAAA 2001:503:c27::2:30
  222. ;
  223. ; OPERATED BY RIPE NCC
  224. ;
  225. . 3600000 NS K.ROOT-SERVERS.NET.
  226. K.ROOT-SERVERS.NET. 3600000 A 193.0.14.129
  227. K.ROOT-SERVERS.NET. 3600000 AAAA 2001:7fd::1
  228. ;
  229. ; OPERATED BY ICANN
  230. ;
  231. . 3600000 NS L.ROOT-SERVERS.NET.
  232. L.ROOT-SERVERS.NET. 3600000 A 199.7.83.42
  233. L.ROOT-SERVERS.NET. 3600000 AAAA 2001:500:9f::42
  234. ;
  235. ; OPERATED BY WIDE
  236. ;
  237. . 3600000 NS M.ROOT-SERVERS.NET.
  238. M.ROOT-SERVERS.NET. 3600000 A 202.12.27.33
  239. M.ROOT-SERVERS.NET. 3600000 AAAA 2001:dc3::35
  240. ; End of file
  241. </file></code></pre>
  242. <p>Root DNS servers operate in safes, inside locked cages. A clock sits on the
  243. safe to ensure the camera feed hasn’t been looped. Particularily given how
  244. slow <a href="https://en.wikipedia.org/wiki/Domain_Name_System_Security_Extensions">DNSSEC</a>
  245. implementation has been, an attack on one of those servers could
  246. allow an attacker to redirect all of the Internet traffic for a portion of
  247. Internet users. This, of course, makes for the most fantastic heist movie to
  248. have never been made.</p>
  249. <p>Unsurprisingly, the nameservers for top-level TLDs don’t actually change all
  250. that often.
  251. <a href="http://dns.measurement-factory.com/writings/wessels-pam2003-paper.pdf">98%</a> of
  252. the requests root DNS servers receive are in error, most often because of
  253. broken and toy clients which don’t properly cache their results. This became
  254. such a problem that several root DNS operators had to <a href="https://www.as112.net/">spin
  255. up</a> special servers just to return ‘go away’ to all the
  256. people asking for reverse DNS lookups on their local IP addresses.</p>
  257. <p>The TLD nameservers are administered by different companies and governments all
  258. around the world (<a href="https://www.verisign.com/">Verisign</a> manages <code>.com</code>). When you purchase a <code>.com</code> domain,
  259. about $0.18 goes to the ICANN, and $7.85 <a href="http://webmasters.stackexchange.com/questions/61467/if-icann-only-charges-18%C2%A2-per-domain-name-why-am-i-paying-10">goes
  260. to</a>
  261. Verisign.</p>
  262. <h3 id="punycode">Punycode</h3>
  263. <p>It is rare in this world that the silly name us developers think up for a new
  264. project makes it into the final, public, product. We might name the company
  265. database Delaware (because that’s where all the companies are registered), but
  266. you can be sure by the time it hits production it will be
  267. CompanyMetadataDatastore. But rarely, when all the stars align and the boss is
  268. on vacation, one slips through the cracks.</p>
  269. <p>Punycode is the system we use to encode unicode into domain names. The problem
  270. it is solving is simple, how do you write 比薩.com when the entire internet
  271. system was built around using the <a href="https://en.wikipedia.org/wiki/ASCII">ASCII</a>
  272. alphabet whose most foreign character is the tilde?</p>
  273. <p>It’s not a simple matter of switching domains to use
  274. <a href="https://en.wikipedia.org/wiki/Unicode">unicode</a>. The <a href="https://tools.ietf.org/html/rfc1035">original
  275. documents</a> which govern domains specify
  276. they are to be encoded in ASCII. Every piece of internet hardware from the
  277. last fourty years, including the
  278. <a href="http://www.cisco.com/c/en/us/support/routers/crs-1-multishelf-system/model.html">Cisco</a>
  279. and
  280. <a href="http://www.juniper.net/techpubs/en_US/release-independent/junos/information-products/pathway-pages/t-series/t1600/">Juniper</a>
  281. routers used to deliver this page to you make that assumption.</p>
  282. <p>The web itself was <a href="http://1997.webhistory.org/www.lists/www-talk.1994q3/1085.html">never
  283. ASCII-only</a>.
  284. It was actually originally concieved to speak <a href="https://en.wikipedia.org/wiki/ISO/IEC_8859-1">ISO
  285. 8859-1</a> which includes all of the
  286. ASCII characters, but adds an additional set of special characters like ¼ and
  287. letters with special marks like ä. It does not, however, contain any non-Latin
  288. characters.</p>
  289. <p>This restriction on HTML was ultimately removed in
  290. <a href="https://tools.ietf.org/html/rfc2070">2007</a> and that same year Unicode
  291. <a href="https://googleblog.blogspot.com/2008/05/moving-to-unicode-51.html">became</a> the
  292. most popular character set on the web. But domains were still confined to ASCII.</p>
  293. <p><a href="http://www.alanwood.net/unicode/"><img src="images/ie-hebrew.gif" alt="Hebrew in IE 5"/></a></p>
  294. <p>As you might guess, Punycode was not the first proposal to solve this problem.
  295. You most likely have heard of UTF-8, which is a popular way of encoding Unicode
  296. into bytes (the 8 is for the eight bits in a byte). In the year
  297. <a href="https://tools.ietf.org/html/draft-jseng-utf5-01">2000</a> several members of the
  298. Internet Engineering Task Force came up with UTF-5. The idea was to encode
  299. Unicode into five bit chunks. You could then map each five bits into a
  300. character allowed (A-V &amp; 0-9) in domain names. So if I had a website for
  301. Japanese language learning, my site 日本語.com would become the cryptic
  302. M5E5M72COA9E.com.</p>
  303. <p>This encoding method has several disadvantages. For one, A-V and 0-9 are used
  304. in the output encoding, meaning if you wanted to actually include one of those
  305. characters in your doman, it had to be encoded like everything else. This made
  306. for some very long domains, which is a serious problem when each segment of a
  307. domain is restricted to 63 characters. A domain in the Myanmar language would
  308. be restricted to no more than 15 characters. The proposal does make the very
  309. interesting suggestion of using UTF-5 to allow Unicode to be transmitted by
  310. Morse code and telegram though.</p>
  311. <p>There was also the question of how to let clients know that this domain was
  312. encoded so they could display them in the appropriate Unicode characters,
  313. rather than showing M5E5M72COA9E.com in my address bar. There were <a href="https://tools.ietf.org/html/draft-ietf-idn-compare-01">several
  314. suggestions</a>, one of
  315. which was to use an unused bit in the DNS response. It was the “last unused
  316. bit in the header”, and the DNS folks were “very hesitant to give it up”
  317. however.</p>
  318. <p>Another suggestion was to start every domain using this encoding method with
  319. <code>ra--</code>. At <a href="https://tools.ietf.org/html/draft-ietf-idn-race-00">the time</a>
  320. (mid-April 2000), there were no domains which happened to start with those
  321. particular characters. If I know anything about the Internet, someone
  322. registered an <code>ra--</code> domain out of spite immediately after the
  323. proposal was published.</p>
  324. <p>The <a href="https://tools.ietf.org/html/rfc3492">ultimate conclusion</a>, reached in
  325. 2003, was to adopt a format called Punycode which included a form of delta
  326. compression which could dramatically shorten encoded domain names. Delta
  327. compression is a particularily good idea because the odds are all of the
  328. characters in your domain are in the same general area within Unicode. For
  329. example, two characters in Farsi are going to be much closer together than a
  330. Farsi character and another in Hindi. To give an example of how this works, if
  331. we take the nonsense phrase:</p>
  332. <p>يذؽ</p>
  333. <p>In an uncompressed format, that would be stored as the three characters <code>[1610,
  334. 1584, 1597]</code> (based on their Unicode code points). To compress this we first
  335. sort it numerically (keeping track of where the original characters were):
  336. <code>[1584, 1597, 1610]</code>. Then we can store the lowest value (<code>1584</code>), and the
  337. delta between that value and the next character (<code>13</code>), and again for the
  338. following character (<code>23</code>), which is significantly less to transmit and store.</p>
  339. <p>Punycode then (very) efficiently encodes those integers into characters allowed
  340. in domain names, and inserts an <code>xn--</code> at the beginning to let consumers know
  341. this is an encoded domain. You’ll notice that all the Unicode characters end
  342. up together at the end of the domain. They don’t just encode their value, they
  343. also encode where they should be inserted into the ASCII portion of the domain.
  344. To provide an example, the website 熱狗sales.com becomes
  345. <code>xn--sales-r65lm0e.com</code>. Anytime you type a Unicode-based domain name into
  346. your browser’s address bar, it is encoded in this way.</p>
  347. <p>This transformation could be transparent, but that introduces a major security
  348. problem. All sorts of Unicode characters print identically to existing ASCII
  349. characters. For example, you likely can’t see the difference between Cyrillic
  350. small letter a (“а”) and Latin small letter a (“a”). If I register Cyrillic
  351. аmazon.com (xn--mazon-3ve.com), and manage to trick you into visiting it, it’s
  352. gonna be hard to know you’re on the wrong site. For that reason, when you
  353. visit <a href="http://🍕💩.ws">🍕💩.ws</a>, your browser somewhat lamely shows you
  354. <code>xn--vi8hiv.ws</code> in the address bar.</p>
  355. <h3 id="protocol">Protocol</h3>
  356. <p>The first portion of the URL is the protocol which should be used to access it.
  357. The most common protocol is <code>http</code>, which is the simple document transfer
  358. protocol Tim Berners-Lee invented specifically to power the web. It was not
  359. the only option. <a href="http://1997.webhistory.org/www.lists/www-talk.1993q2/0339.html">Some
  360. people</a>
  361. believed we should just use Gopher. Rather than being general-purpose, Gopher
  362. is specifically designed to send structured data similar to how a file tree is
  363. structured.</p>
  364. <p>For example, if you request the <code>/Cars</code> endpoint, it might return:</p>
  365. <pre><code>1Chevy Camaro /Archives/cars/cc gopher.cars.com 70
  366. iThe Camero is a classic fake (NULL) 0
  367. iAmerican Muscle car fake (NULL) 0
  368. 1Ferrari 451 /Factbook/ferrari/451 gopher.ferrari.net 70
  369. </code></pre><p>which identifies two cars, along with some metadata about them and where you
  370. can connect to for more information. The understanding was your client would
  371. parse this information into a usable form which linked the entries with the
  372. destination pages.</p>
  373. <p><a href="http://www.yale.edu/pclt/WINWORLD/GOPHER.HTM"><img src="images/gopher.gif" alt="Gopher"/></a></p>
  374. <p>The first popular protocol was FTP, which was created in 1971, as a way of
  375. listing and downloading files on remote computers. Gopher was a logical
  376. extension of this, in that it provided a similar listing, but included
  377. facilities for also reading the metadata about entries. This meant it could
  378. be used for more liberal purposes like a news feed or a simple database. It
  379. did not have, however, the freedom and simplicity which characterizes HTTP and HTML.</p>
  380. <p>HTTP is a very simple protocol, particularily when compared to alternatives like
  381. FTP or even the <a href="https://http2.github.io/">HTTP/2</a> protocol which is rising in popularity today. First off,
  382. HTTP is entirely text based, rather than being composed of bespoke binary
  383. incantations (which would have made it significantly more efficient). Tim
  384. Berners-Lee correctly intuited that using a text-based format would make it
  385. easier for generations of programmers to develop and debug HTTP-based
  386. applications.</p>
  387. <p>HTTP also makes almost no assumptions about what you’re transmitting. Despite
  388. the fact that it was invented expliticly to accompany the HTML language, it
  389. allows you to specify that your content is of any type (using the MIME <code>Content-Type</code>,
  390. which was a new invention at the time). The protocol itself is rather simple:</p>
  391. <p>A request:</p>
  392. <pre><code class="lang-http">GET /index.html HTTP/1.1
  393. Host: www.example.com
  394. </code></pre>
  395. <p>Might respond:</p>
  396. <pre><code class="lang-http">HTTP/1.1 200 OK
  397. Date: Mon, 23 May 2005 22:38:34 GMT
  398. Content-Type: text/html; charset=UTF-8
  399. Content-Encoding: UTF-8
  400. Content-Length: 138
  401. Last-Modified: Wed, 08 Jan 2003 23:11:55 GMT
  402. Server: Apache/1.3.3.7 (Unix) (Red-Hat/Linux)
  403. ETag: "3f80f-1b6-3e1cb03b"
  404. Accept-Ranges: bytes
  405. Connection: close
  406. &lt;html&gt;
  407. &lt;head&gt;
  408. &lt;title&gt;An Example Page&lt;/title&gt;
  409. &lt;/head&gt;
  410. &lt;body&gt;
  411. Hello World, this is a very simple HTML document.
  412. &lt;/body&gt;
  413. &lt;/html&gt;
  414. </code></pre>
  415. <p>To put this in context, you can think of the networking system the Internet
  416. uses as starting with IP, the Internet Protocol. IP is responsible for
  417. getting a small packet of data (around 1500 bytes) from one computer
  418. to another. On top of that we have TCP, which is responsible for taking
  419. larger blocks of data like entire documents and files and sending them
  420. via many IP packets reliably. On top of that, we then implement a protocol
  421. like HTTP or FTP, which specifies what format should be used to make
  422. the data we send via TCP (or UDP, etc.) understandable and meaningful.</p>
  423. <p>In other words, TCP/IP sends a whole bunch of bytes to another computer,
  424. the protocol says what those bytes should be and what they mean.</p>
  425. <p>You can make your own protocol if you like, assemblying the bytes in your
  426. TCP messages however you like. The only requirement is that whoever you
  427. are talking to speaks the same language. For this reason, it’s common
  428. to standardize these protocols.</p>
  429. <p>There are, of course, many less important protocols to play with. For example
  430. there is a <a href="https://www.rfc-editor.org/rfc/rfc865.txt">Quote of The Day</a>
  431. protocol (port 17), and a <a href="https://www.rfc-editor.org/rfc/rfc864.txt">Random
  432. Characters</a> protocol (port 19).
  433. They may seem silly today, but they also showcase just how important that a
  434. general-purpose document transmission format like HTTP was.</p>
  435. <h3 id="port">Port</h3>
  436. <p>The timeline of Gopher and HTTP can be evidenced by their default port numbers.
  437. Gopher is 70, HTTP 80. The HTTP port was assigned (likely by <a href="https://en.wikipedia.org/wiki/Jon_Postel">Jon
  438. Postel</a> at the IANA) at the request
  439. of Tim Berners-Lee sometime between <a href="https://tools.ietf.org/html/rfc1060">1990</a>
  440. and <a href="https://tools.ietf.org/html/rfc1340">1992</a>.</p>
  441. <p>This concept, of registering ‘port numbers’ predates even the Internet.
  442. In the original NCP protocol which powered the ARPANET remote
  443. addresses were identified by 40 bits. The first 32 identified the remote
  444. host, similar to how an IP address works today. The last eight were known as
  445. the <a href="https://tools.ietf.org/html/rfc433">AEN</a> (it stood for “Another Eight-bit Number”),
  446. and were used by the remote machine in the way we use a port number, to separate
  447. messages destined for different processes. In other words, the address
  448. specifies which machine the message should go to, and the AEN (or port number)
  449. tells that remote machine which application should get the message.</p>
  450. <p>They quickly <a href="https://tools.ietf.org/html/rfc322">requested</a> that users register
  451. these ‘socket numbers’ to limit potential collisions. When port numbers were
  452. expanded to 16 bits by TCP/IP, that registration process was continued.</p>
  453. <p>While protocols have a default port, it makes sense to allow ports to also be
  454. specified manually to allow for local development and the hosting of multiple
  455. services on the same machine. That same logic was the
  456. <a href="http://1997.webhistory.org/www.lists/www-talk.1992/0335.html">basis</a> for
  457. prefixing websites with <code>www.</code>. At the time, it was unlikely anyone was
  458. getting access to the root of their domain, just for hosting an ‘experimental’
  459. website. But if you give users the hostname of your specific machine
  460. (<code>dx3.cern.ch</code>), you’re in trouble when you need to replace that machine. By
  461. using a common subdomain (<code>www.cern.ch</code>) you can change what it points to as
  462. needed.</p>
  463. <h3 id="the-bit-in-between">The Bit In-between</h3>
  464. <p>As you probably know, the URL syntax places a double slash (<code>//</code>) between
  465. the protocol and the rest of the URL:</p>
  466. <pre><code>http://eager.io
  467. </code></pre><p>That double slash was inherited from the <a href="https://en.wikipedia.org/wiki/Apollo/Domain">Apollo</a>
  468. computer system which was one of the first networked workstations. The Apollo
  469. team had a similar problem to Tim Berners-Lee: they needed a way to separate
  470. a path from the machine that path is on. Their solution was to create a
  471. special path format:</p>
  472. <pre><code>//computername/file/path/as/usual
  473. </code></pre><p>And TBL copied that scheme. Incidentally, he now <a href="https://www.w3.org/People/Berners-Lee/FAQ.html#etc">regrets</a>
  474. that decision, wishing the domain (in this case <code>example.com</code>) was the first portion of the path:</p>
  475. <pre><code>http:com/example/foo/bar/baz
  476. </code></pre><h3 id="the-rest">The Rest</h3>
  477. <p>So far, we have covered the components of a URL which allow you to connect
  478. to a specific application on a remote server somewhere on the Internet. The second,
  479. and final, post of this series will cover those components of the URL which
  480. are processed by that remote application to return to you a specific piece of content,
  481. the Path, Fragment, Query and Auth.</p>
  482. <p>I would have liked to include all of the content in a single post, but its length
  483. was proving intimidating to readers. The second post is absolutely worth your
  484. time however. It includes things like the alternative forms for URLs Tim Berners-Lee
  485. considered, the history of forms and how the GET parameter syntax was decided, and the fifteen
  486. year argument over how to make URLs which won’t change. If you’d like, you can
  487. subscribe below to be notified when that post is released.</p>