A place to cache linked articles (think custom and personal wayback machine)
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

5 年之前
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227
  1. title: Archiving web sites
  2. url: https://lwn.net/Articles/766374/
  3. hash_url: 9ce5f7eee16ec460d4d2e32bd6c7ec2a
  4. <p>I recently took a deep dive into web site archival for friends who
  5. were worried about losing control over the hosting of their work
  6. online in the face of poor system administration or hostile
  7. removal.
  8. This makes web site archival an essential instrument in the
  9. toolbox of any system administrator.
  10. As it turns out, some sites are much harder to archive than
  11. others. This article goes through the process of archiving traditional
  12. web sites and shows how it falls short when confronted with the latest
  13. fashions in the single-page applications that are bloating the modern web.</p>
  14. <h4>Converting simple sites</h4>
  15. <p>The days of handcrafted HTML web sites are long gone. Now web sites are
  16. dynamic and built on the fly using the latest JavaScript, PHP, or
  17. Python framework. As a result, the sites are more fragile: a database
  18. crash, spurious upgrade, or unpatched vulnerability might lose data.
  19. In my previous life as web developer, I
  20. had to come to terms with the idea that customers expect web sites to
  21. basically work forever. This expectation matches poorly with "move
  22. fast and break things" attitude of web development. Working with the
  23. <a href="https://drupal.org">Drupal</a> content-management system (CMS) was
  24. particularly
  25. challenging in that regard as major upgrades deliberately break
  26. compatibility with third-party modules, which implies a costly upgrade process that
  27. clients could seldom afford. The solution was to archive those sites:
  28. take a living, dynamic web site and turn it into plain HTML files that
  29. any web server can serve forever. This process is useful for your own dynamic
  30. sites but also for third-party sites that are outside of your control and you might want
  31. to safeguard.</p>
  32. <p>For simple or static sites, the venerable <a href="https://www.gnu.org/software/wget/">Wget</a> program works
  33. well. The incantation to mirror a full web site, however, is byzantine:</p>
  34. <pre>
  35. $ nice wget --mirror --execute robots=off --no-verbose --convert-links \
  36. --backup-converted --page-requisites --adjust-extension \
  37. --base=./ --directory-prefix=./ --span-hosts \
  38. --domains=www.example.com,example.com http://www.example.com/
  39. </pre>
  40. <p>The above downloads the content of the web page, but also crawls
  41. everything within the specified domains. Before you run this against
  42. your favorite site, consider the impact such a crawl might have on the
  43. site. The above command line deliberately ignores
  44. <a href="https://en.wikipedia.org/wiki/Robots_exclusion_standard"><tt>robots.txt</tt></a>
  45. rules, as is now <a href="https://blog.archive.org/2017/04/17/robots-txt-meant-for-search-engines-dont-work-well-for-web-archives/">common practice for archivists</a>,
  46. and hammer the website as fast as it can. Most crawlers have options to
  47. pause between hits and limit bandwidth usage to avoid overwhelming the
  48. target site.
  49. </p><p>
  50. The above command will also fetch "page
  51. requisites" like style sheets (CSS), images, and scripts. The
  52. downloaded page contents are modified so that links point to the local
  53. copy as well. Any web server can host the resulting file set, which results
  54. in a static copy of the original web site.</p>
  55. <p>That is, when things go well. Anyone who has ever worked with a computer
  56. knows that things seldom go according to plan; all sorts of
  57. things can make the procedure derail in interesting ways. For example,
  58. it was trendy for a while to have calendar blocks in web sites. A CMS
  59. would generate those on the fly and make crawlers go into an infinite
  60. loop trying to retrieve all of the pages. Crafty archivers can resort to regular expressions
  61. (e.g. Wget has a <code>--reject-regex</code> option) to ignore problematic
  62. resources. Another option, if the administration interface for the
  63. web site is accessible, is to disable calendars, login forms, comment
  64. forms, and other dynamic areas. Once the site becomes static, those
  65. will stop working anyway, so it makes sense to remove such clutter
  66. from the original site as well.</p>
  67. <h4>JavaScript doom</h4>
  68. <p>Unfortunately, some web sites are built with much more than pure
  69. HTML. In single-page sites, for example, the web browser builds the
  70. content itself by executing a small JavaScript program. A simple user
  71. agent like Wget will struggle to reconstruct a meaningful static copy
  72. of those sites as it does not support JavaScript at all. In theory, web
  73. sites should be using <a href="https://en.wikipedia.org/wiki/Progressive_enhancement">progressive
  74. enhancement</a> to have content and
  75. functionality available without JavaScript but those directives are
  76. rarely followed, as anyone using plugins like <a href="https://noscript.net/">NoScript</a> or
  77. <a href="https://github.com/gorhill/uMatrix">uMatrix</a> will confirm.</p>
  78. <p>Traditional archival methods sometimes fail in the dumbest way. When
  79. trying to build an offsite backup of a local newspaper
  80. (<a href="https://pamplemousse.ca/">pamplemousse.ca</a>), I found that
  81. WordPress adds query strings
  82. (e.g. <code>?ver=1.12.4</code>) at the end of JavaScript includes. This confuses
  83. content-type detection in the web servers that serve the archive, which
  84. rely on the file extension
  85. to send the right <code>Content-Type</code> header. When such an archive is
  86. loaded in a
  87. web browser, it fails to load scripts, which breaks dynamic websites.</p>
  88. <p>As the web moves toward using the browser as a virtual machine to run
  89. arbitrary code, archival methods relying on pure HTML parsing need to
  90. adapt. The solution for such problems is to record (and replay) the
  91. HTTP headers delivered by the server during the crawl and indeed
  92. professional archivists use just such an approach.</p>
  93. <h4>Creating and displaying WARC files</h4>
  94. <p>At the <a href="https://archive.org">Internet Archive</a>, Brewster
  95. Kahle and Mike Burner designed
  96. the <a href="http://www.archive.org/web/researcher/ArcFileFormat.php">ARC</a> (for "ARChive") file format in 1996 to provide a way to
  97. aggregate the millions of small files produced by their archival
  98. efforts. The format was eventually standardized as the WARC ("Web
  99. ARChive") <a href="https://iipc.github.io/warc-specifications/">specification</a> that
  100. was released as an ISO standard in 2009 and
  101. revised in 2017. The standardization effort was led by the <a href="https://en.wikipedia.org/wiki/International_Internet_Preservation_Consortium">International Internet
  102. Preservation Consortium</a> (IIPC), which is an "<span>international
  103. organization of libraries and other organizations established to
  104. coordinate efforts to preserve internet content for the future</span>",
  105. according to Wikipedia; it includes members such as the US Library of
  106. Congress and the Internet Archive. The latter uses the WARC format
  107. internally in its Java-based <a href="https://github.com/internetarchive/heritrix3/wiki">Heritrix
  108. crawler</a>.</p>
  109. <p>A WARC file aggregates multiple resources like HTTP headers, file
  110. contents, and other metadata in a single compressed
  111. archive. Conveniently, Wget actually supports the file format with
  112. the <code>--warc</code> parameter. Unfortunately, web browsers cannot render WARC
  113. files directly, so a viewer or some conversion is necessary to access
  114. the archive. The simplest such viewer I have found is <a href="https://github.com/webrecorder/pywb">pywb</a>, a
  115. Python package that runs a simple webserver to offer a
  116. Wayback-Machine-like interface to browse the contents of WARC
  117. files. The following set of commands will render a WARC file on
  118. <tt>http://localhost:8080/</tt>:</p>
  119. <pre>
  120. $ pip install pywb
  121. $ wb-manager init example
  122. $ wb-manager add example crawl.warc.gz
  123. $ wayback
  124. </pre>
  125. <p>This tool was, incidentally, built by the folks behind the
  126. <a href="https://webrecorder.io/">Webrecorder</a> service, which can use
  127. a web browser to save
  128. dynamic page contents.</p>
  129. <p>Unfortunately, pywb has trouble loading WARC files generated by Wget
  130. because it <a href="https://github.com/webrecorder/pywb/issues/294">followed</a> an <a href="https://github.com/iipc/warc-specifications/issues/23">inconsistency in the 1.0
  131. specification</a>, which was <a href="https://github.com/iipc/warc-specifications/pull/24">fixed in the 1.1 specification</a>. Until Wget or
  132. pywb fix those problems, WARC files produced by Wget are not
  133. reliable enough for my uses, so I have looked at other alternatives. A
  134. crawler that got my attention is simply called <a href="https://git.autistici.org/ale/crawl/">crawl</a>. Here is how
  135. it is invoked:</p>
  136. <pre>
  137. $ crawl https://example.com/
  138. </pre>
  139. <p>(It <em>does</em> say "very simple" in the README.) The program does support
  140. some command-line options, but most of its defaults are sane: it will fetch
  141. page requirements from other domains (unless the <code>-exclude-related</code>
  142. flag is used), but does not recurse out of the domain. By default, it
  143. fires up ten parallel connections to the remote site, a setting that
  144. can be changed with the <code>-c</code> flag. But, best of all, the resulting WARC
  145. files load perfectly in pywb.</p>
  146. <h4>Future work and alternatives</h4>
  147. <p>There are plenty more <a href="https://archiveteam.org/index.php?title=The_WARC_Ecosystem">resources</a>
  148. for using WARC files. In
  149. particular, there's a Wget drop-in replacement called <a href="https://github.com/chfoo/wpull">Wpull</a> that is
  150. specifically designed for archiving web sites. It has experimental
  151. support for <a href="http://phantomjs.org/">PhantomJS</a> and <a href="http://rg3.github.io/youtube-dl/">youtube-dl</a> integration that
  152. should allow downloading more complex JavaScript sites and streaming
  153. multimedia, respectively. The software is the basis for an elaborate
  154. archival tool called <a href="https://www.archiveteam.org/index.php?title=ArchiveBot">ArchiveBot</a>,
  155. which is used by the "<span>loose collective of
  156. rogue archivists, programmers, writers and loudmouths</span>" at
  157. <a href="https://archiveteam.org/">ArchiveTeam</a> in its struggle to
  158. "<span>save the history before it's lost
  159. forever</span>". It seems that PhantomJS integration does not work as well as
  160. the team wants, so ArchiveTeam also uses a rag-tag bunch of other
  161. tools to mirror more complex sites. For example, <a href="https://github.com/JustAnotherArchivist/snscrape">snscrape</a> will
  162. crawl a social media profile to generate a list of pages to send into
  163. ArchiveBot. Another tool the team employs is <a href="https://github.com/PromyLOPh/crocoite">crocoite</a>, which uses
  164. the Chrome browser in headless mode to archive JavaScript-heavy sites.</p>
  165. <p>This article would also not be complete without a nod to the
  166. <a href="http://www.httrack.com/">HTTrack</a> project, the "website
  167. copier". Working similarly to Wget,
  168. HTTrack creates local copies of remote web sites but unfortunately does
  169. not support WARC output. Its interactive aspects might be of more
  170. interest to novice users unfamiliar with the command line.
  171. </p><p>
  172. In the
  173. same vein, during my research I found a full rewrite of Wget called
  174. <a href="https://gitlab.com/gnuwget/wget2">Wget2</a> that has support for
  175. multi-threaded operation, which might make
  176. it faster than its predecessor. It is <a href="https://gitlab.com/gnuwget/wget2/wikis/home">missing some
  177. features</a> from
  178. Wget, however, most notably reject patterns, WARC output, and FTP support but
  179. adds RSS, DNS caching, and improved TLS support.</p>
  180. <p>Finally, my personal dream for these kinds of tools would be to have
  181. them integrated with my existing bookmark system. I currently keep
  182. interesting links in <a href="https://wallabag.org/">Wallabag</a>, a
  183. self-hosted "read it later"
  184. service designed as a free-software alternative to <a href="https://getpocket.com/">Pocket</a> (now owned by
  185. Mozilla). But Wallabag, by design, creates only a
  186. "readable" version of the article instead of a full copy. In some
  187. cases, the "readable version" is actually <a href="https://github.com/wallabag/wallabag/issues/2825">unreadable</a> and Wallabag
  188. sometimes <a href="https://github.com/wallabag/wallabag/issues/2914">fails to parse the article</a>. Instead, other tools like
  189. <a href="https://pirate.github.io/bookmark-archiver/">bookmark-archiver</a>
  190. or <a href="https://github.com/kanishka-linux/reminiscence">reminiscence</a> save
  191. a screenshot of the
  192. page along with full HTML but, unfortunately, no WARC file that would
  193. allow an even more faithful replay.</p>
  194. <p>The sad truth of my experiences with mirrors and archival is that data
  195. dies. Fortunately,
  196. amateur archivists have tools at their disposal to keep interesting
  197. content alive online. For those who do not want to go through that
  198. trouble, the Internet Archive seems to be here to stay and Archive
  199. Team is obviously <a href="http://iabak.archiveteam.org">working on a
  200. backup of the Internet Archive itself</a>.</p>