|
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227 |
- title: Archiving web sites
- url: https://lwn.net/Articles/766374/
- hash_url: 9ce5f7eee16ec460d4d2e32bd6c7ec2a
-
-
- <p>I recently took a deep dive into web site archival for friends who
- were worried about losing control over the hosting of their work
- online in the face of poor system administration or hostile
- removal.
- This makes web site archival an essential instrument in the
- toolbox of any system administrator.
- As it turns out, some sites are much harder to archive than
- others. This article goes through the process of archiving traditional
- web sites and shows how it falls short when confronted with the latest
- fashions in the single-page applications that are bloating the modern web.</p>
-
- <h4>Converting simple sites</h4>
-
- <p>The days of handcrafted HTML web sites are long gone. Now web sites are
- dynamic and built on the fly using the latest JavaScript, PHP, or
- Python framework. As a result, the sites are more fragile: a database
- crash, spurious upgrade, or unpatched vulnerability might lose data.
- In my previous life as web developer, I
- had to come to terms with the idea that customers expect web sites to
- basically work forever. This expectation matches poorly with "move
- fast and break things" attitude of web development. Working with the
- <a href="https://drupal.org">Drupal</a> content-management system (CMS) was
- particularly
- challenging in that regard as major upgrades deliberately break
- compatibility with third-party modules, which implies a costly upgrade process that
- clients could seldom afford. The solution was to archive those sites:
- take a living, dynamic web site and turn it into plain HTML files that
- any web server can serve forever. This process is useful for your own dynamic
- sites but also for third-party sites that are outside of your control and you might want
- to safeguard.</p>
-
- <p>For simple or static sites, the venerable <a href="https://www.gnu.org/software/wget/">Wget</a> program works
- well. The incantation to mirror a full web site, however, is byzantine:</p>
-
- <pre>
- $ nice wget --mirror --execute robots=off --no-verbose --convert-links \
- --backup-converted --page-requisites --adjust-extension \
- --base=./ --directory-prefix=./ --span-hosts \
- --domains=www.example.com,example.com http://www.example.com/
- </pre>
-
- <p>The above downloads the content of the web page, but also crawls
- everything within the specified domains. Before you run this against
- your favorite site, consider the impact such a crawl might have on the
- site. The above command line deliberately ignores
- <a href="https://en.wikipedia.org/wiki/Robots_exclusion_standard"><tt>robots.txt</tt></a>
- rules, as is now <a href="https://blog.archive.org/2017/04/17/robots-txt-meant-for-search-engines-dont-work-well-for-web-archives/">common practice for archivists</a>,
- and hammer the website as fast as it can. Most crawlers have options to
- pause between hits and limit bandwidth usage to avoid overwhelming the
- target site.
-
- </p><p>
- The above command will also fetch "page
- requisites" like style sheets (CSS), images, and scripts. The
- downloaded page contents are modified so that links point to the local
- copy as well. Any web server can host the resulting file set, which results
- in a static copy of the original web site.</p>
-
- <p>That is, when things go well. Anyone who has ever worked with a computer
- knows that things seldom go according to plan; all sorts of
- things can make the procedure derail in interesting ways. For example,
- it was trendy for a while to have calendar blocks in web sites. A CMS
- would generate those on the fly and make crawlers go into an infinite
- loop trying to retrieve all of the pages. Crafty archivers can resort to regular expressions
- (e.g. Wget has a <code>--reject-regex</code> option) to ignore problematic
- resources. Another option, if the administration interface for the
- web site is accessible, is to disable calendars, login forms, comment
- forms, and other dynamic areas. Once the site becomes static, those
- will stop working anyway, so it makes sense to remove such clutter
- from the original site as well.</p>
-
- <h4>JavaScript doom</h4>
-
- <p>Unfortunately, some web sites are built with much more than pure
- HTML. In single-page sites, for example, the web browser builds the
- content itself by executing a small JavaScript program. A simple user
- agent like Wget will struggle to reconstruct a meaningful static copy
- of those sites as it does not support JavaScript at all. In theory, web
- sites should be using <a href="https://en.wikipedia.org/wiki/Progressive_enhancement">progressive
- enhancement</a> to have content and
- functionality available without JavaScript but those directives are
- rarely followed, as anyone using plugins like <a href="https://noscript.net/">NoScript</a> or
- <a href="https://github.com/gorhill/uMatrix">uMatrix</a> will confirm.</p>
-
- <p>Traditional archival methods sometimes fail in the dumbest way. When
- trying to build an offsite backup of a local newspaper
- (<a href="https://pamplemousse.ca/">pamplemousse.ca</a>), I found that
- WordPress adds query strings
- (e.g. <code>?ver=1.12.4</code>) at the end of JavaScript includes. This confuses
- content-type detection in the web servers that serve the archive, which
- rely on the file extension
- to send the right <code>Content-Type</code> header. When such an archive is
- loaded in a
- web browser, it fails to load scripts, which breaks dynamic websites.</p>
-
- <p>As the web moves toward using the browser as a virtual machine to run
- arbitrary code, archival methods relying on pure HTML parsing need to
- adapt. The solution for such problems is to record (and replay) the
- HTTP headers delivered by the server during the crawl and indeed
- professional archivists use just such an approach.</p>
-
- <h4>Creating and displaying WARC files</h4>
-
- <p>At the <a href="https://archive.org">Internet Archive</a>, Brewster
- Kahle and Mike Burner designed
- the <a href="http://www.archive.org/web/researcher/ArcFileFormat.php">ARC</a> (for "ARChive") file format in 1996 to provide a way to
- aggregate the millions of small files produced by their archival
- efforts. The format was eventually standardized as the WARC ("Web
- ARChive") <a href="https://iipc.github.io/warc-specifications/">specification</a> that
- was released as an ISO standard in 2009 and
- revised in 2017. The standardization effort was led by the <a href="https://en.wikipedia.org/wiki/International_Internet_Preservation_Consortium">International Internet
- Preservation Consortium</a> (IIPC), which is an "<span>international
- organization of libraries and other organizations established to
- coordinate efforts to preserve internet content for the future</span>",
- according to Wikipedia; it includes members such as the US Library of
- Congress and the Internet Archive. The latter uses the WARC format
- internally in its Java-based <a href="https://github.com/internetarchive/heritrix3/wiki">Heritrix
- crawler</a>.</p>
-
- <p>A WARC file aggregates multiple resources like HTTP headers, file
- contents, and other metadata in a single compressed
- archive. Conveniently, Wget actually supports the file format with
- the <code>--warc</code> parameter. Unfortunately, web browsers cannot render WARC
- files directly, so a viewer or some conversion is necessary to access
- the archive. The simplest such viewer I have found is <a href="https://github.com/webrecorder/pywb">pywb</a>, a
- Python package that runs a simple webserver to offer a
- Wayback-Machine-like interface to browse the contents of WARC
- files. The following set of commands will render a WARC file on
- <tt>http://localhost:8080/</tt>:</p>
-
- <pre>
- $ pip install pywb
- $ wb-manager init example
- $ wb-manager add example crawl.warc.gz
- $ wayback
- </pre>
-
- <p>This tool was, incidentally, built by the folks behind the
- <a href="https://webrecorder.io/">Webrecorder</a> service, which can use
- a web browser to save
- dynamic page contents.</p>
-
- <p>Unfortunately, pywb has trouble loading WARC files generated by Wget
- because it <a href="https://github.com/webrecorder/pywb/issues/294">followed</a> an <a href="https://github.com/iipc/warc-specifications/issues/23">inconsistency in the 1.0
- specification</a>, which was <a href="https://github.com/iipc/warc-specifications/pull/24">fixed in the 1.1 specification</a>. Until Wget or
- pywb fix those problems, WARC files produced by Wget are not
- reliable enough for my uses, so I have looked at other alternatives. A
- crawler that got my attention is simply called <a href="https://git.autistici.org/ale/crawl/">crawl</a>. Here is how
- it is invoked:</p>
-
- <pre>
- $ crawl https://example.com/
- </pre>
-
- <p>(It <em>does</em> say "very simple" in the README.) The program does support
- some command-line options, but most of its defaults are sane: it will fetch
- page requirements from other domains (unless the <code>-exclude-related</code>
- flag is used), but does not recurse out of the domain. By default, it
- fires up ten parallel connections to the remote site, a setting that
- can be changed with the <code>-c</code> flag. But, best of all, the resulting WARC
- files load perfectly in pywb.</p>
-
- <h4>Future work and alternatives</h4>
-
- <p>There are plenty more <a href="https://archiveteam.org/index.php?title=The_WARC_Ecosystem">resources</a>
- for using WARC files. In
- particular, there's a Wget drop-in replacement called <a href="https://github.com/chfoo/wpull">Wpull</a> that is
- specifically designed for archiving web sites. It has experimental
- support for <a href="http://phantomjs.org/">PhantomJS</a> and <a href="http://rg3.github.io/youtube-dl/">youtube-dl</a> integration that
- should allow downloading more complex JavaScript sites and streaming
- multimedia, respectively. The software is the basis for an elaborate
- archival tool called <a href="https://www.archiveteam.org/index.php?title=ArchiveBot">ArchiveBot</a>,
- which is used by the "<span>loose collective of
- rogue archivists, programmers, writers and loudmouths</span>" at
- <a href="https://archiveteam.org/">ArchiveTeam</a> in its struggle to
- "<span>save the history before it's lost
- forever</span>". It seems that PhantomJS integration does not work as well as
- the team wants, so ArchiveTeam also uses a rag-tag bunch of other
- tools to mirror more complex sites. For example, <a href="https://github.com/JustAnotherArchivist/snscrape">snscrape</a> will
- crawl a social media profile to generate a list of pages to send into
- ArchiveBot. Another tool the team employs is <a href="https://github.com/PromyLOPh/crocoite">crocoite</a>, which uses
- the Chrome browser in headless mode to archive JavaScript-heavy sites.</p>
-
- <p>This article would also not be complete without a nod to the
- <a href="http://www.httrack.com/">HTTrack</a> project, the "website
- copier". Working similarly to Wget,
- HTTrack creates local copies of remote web sites but unfortunately does
- not support WARC output. Its interactive aspects might be of more
- interest to novice users unfamiliar with the command line.
-
- </p><p>
- In the
- same vein, during my research I found a full rewrite of Wget called
- <a href="https://gitlab.com/gnuwget/wget2">Wget2</a> that has support for
- multi-threaded operation, which might make
- it faster than its predecessor. It is <a href="https://gitlab.com/gnuwget/wget2/wikis/home">missing some
- features</a> from
- Wget, however, most notably reject patterns, WARC output, and FTP support but
- adds RSS, DNS caching, and improved TLS support.</p>
-
- <p>Finally, my personal dream for these kinds of tools would be to have
- them integrated with my existing bookmark system. I currently keep
- interesting links in <a href="https://wallabag.org/">Wallabag</a>, a
- self-hosted "read it later"
- service designed as a free-software alternative to <a href="https://getpocket.com/">Pocket</a> (now owned by
- Mozilla). But Wallabag, by design, creates only a
- "readable" version of the article instead of a full copy. In some
- cases, the "readable version" is actually <a href="https://github.com/wallabag/wallabag/issues/2825">unreadable</a> and Wallabag
- sometimes <a href="https://github.com/wallabag/wallabag/issues/2914">fails to parse the article</a>. Instead, other tools like
- <a href="https://pirate.github.io/bookmark-archiver/">bookmark-archiver</a>
- or <a href="https://github.com/kanishka-linux/reminiscence">reminiscence</a> save
- a screenshot of the
- page along with full HTML but, unfortunately, no WARC file that would
- allow an even more faithful replay.</p>
-
- <p>The sad truth of my experiences with mirrors and archival is that data
- dies. Fortunately,
- amateur archivists have tools at their disposal to keep interesting
- content alive online. For those who do not want to go through that
- trouble, the Internet Archive seems to be here to stay and Archive
- Team is obviously <a href="http://iabak.archiveteam.org">working on a
- backup of the Internet Archive itself</a>.</p>
|