davidbgk
/
larlet-fr-david-cache

title: Archiving web sites
url: https://lwn.net/Articles/766374/
hash_url: 9ce5f7eee16ec460d4d2e32bd6c7ec2a


<p>I recently took a deep dive into web site archival for friends who
were worried about losing control over the hosting of their work
online in the face of poor system administration or hostile
removal.
This makes web site archival an essential instrument in the
toolbox of any system administrator.
As it turns out, some sites are much harder to archive than
others. This article goes through the process of archiving traditional
web sites and shows how it falls short when confronted with the latest
fashions in the single-page applications that are bloating the modern web.</p>

<h4>Converting simple sites</h4>

<p>The days of handcrafted HTML web sites are long gone. Now web sites are
dynamic and built on the fly using the latest JavaScript, PHP, or
Python framework. As a result, the sites are more fragile: a database
crash, spurious upgrade, or unpatched vulnerability might lose data.
In my previous life as web developer, I
had to come to terms with the idea that customers expect web sites to
basically work forever. This expectation matches poorly with "move
fast and break things" attitude of web development. Working with the
<a href="https://drupal.org">Drupal</a> content-management system (CMS) was
particularly
challenging in that regard as major upgrades deliberately break
 compatibility with third-party modules, which implies a costly upgrade process that
clients could seldom afford. The solution was to archive those sites:
take a living, dynamic web site and turn it into plain HTML files that
any web server can serve forever. This process is useful for your own dynamic
sites but also for third-party sites that are outside of your control and you might want
to safeguard.</p>

<p>For simple or static sites, the venerable <a href="https://www.gnu.org/software/wget/">Wget</a> program works
well. The incantation to mirror a full web site, however, is byzantine:</p>

<pre>
    $ nice wget --mirror --execute robots=off --no-verbose --convert-links \
                --backup-converted --page-requisites --adjust-extension \
                --base=./ --directory-prefix=./ --span-hosts \
                --domains=www.example.com,example.com http://www.example.com/
</pre>

<p>The above downloads the content of the web page, but also crawls
everything within the specified domains. Before you run this against
your favorite site, consider the impact such a crawl might have on the
site. The above command line deliberately ignores
<a href="https://en.wikipedia.org/wiki/Robots_exclusion_standard"><tt>robots.txt</tt></a>
rules, as is now <a href="https://blog.archive.org/2017/04/17/robots-txt-meant-for-search-engines-dont-work-well-for-web-archives/">common practice for archivists</a>,
and hammer the website as fast as it can. Most crawlers have options to
pause between hits and limit bandwidth usage to avoid overwhelming the
target site.

</p><p>
The above command will also fetch "page
requisites" like style sheets (CSS), images, and scripts. The
downloaded page contents are modified so that links point to the local
copy as well. Any web server can host the resulting file set, which results
in a static copy of the original web site.</p>

<p>That is, when things go well. Anyone who has ever worked with a computer
knows that things seldom go according to plan; all sorts of
things can make the procedure derail in interesting ways. For example,
it was trendy for a while to have calendar blocks in web sites. A CMS
would generate those on the fly and make crawlers go into an infinite
loop trying to retrieve all of the pages. Crafty archivers can resort to regular expressions
(e.g. Wget has a <code>--reject-regex</code> option) to ignore problematic
resources. Another option, if the administration interface for the
web site is accessible, is to disable calendars, login forms, comment
forms, and other dynamic areas. Once the site becomes static, those
will stop working anyway, so it makes sense to remove such clutter
from the original site as well.</p>

<h4>JavaScript doom</h4>

<p>Unfortunately, some web sites are built with much more than pure
HTML. In single-page sites, for example, the web browser builds the
content itself by executing a small JavaScript program. A simple user
agent like Wget will struggle to reconstruct a meaningful static copy
of those sites as it does not support JavaScript at all. In theory, web
sites should be using <a href="https://en.wikipedia.org/wiki/Progressive_enhancement">progressive
enhancement</a> to have content and
functionality available without JavaScript but those directives are
rarely followed, as anyone using plugins like <a href="https://noscript.net/">NoScript</a> or
<a href="https://github.com/gorhill/uMatrix">uMatrix</a> will confirm.</p>

<p>Traditional archival methods sometimes fail in the dumbest way. When
trying to build an offsite backup of a local newspaper
(<a href="https://pamplemousse.ca/">pamplemousse.ca</a>), I found that
WordPress adds query strings
(e.g. <code>?ver=1.12.4</code>) at the end of JavaScript includes. This confuses
content-type detection in the web servers that serve the archive, which
rely on the file extension
to send the right <code>Content-Type</code> header. When such an archive is
loaded in a
web browser, it fails to load scripts, which breaks dynamic websites.</p>

<p>As the web moves toward using the browser as a virtual machine to run
arbitrary code, archival methods relying on pure HTML parsing need to
adapt. The solution for such problems is to record (and replay) the
HTTP headers delivered by the server during the crawl and indeed
professional archivists use just such an approach.</p>

<h4>Creating and displaying WARC files</h4>

<p>At the <a href="https://archive.org">Internet Archive</a>, Brewster
Kahle and Mike Burner designed
the <a href="http://www.archive.org/web/researcher/ArcFileFormat.php">ARC</a> (for "ARChive") file format in 1996 to provide a way to
aggregate the millions of small files produced by their archival
efforts. The format was eventually standardized as the WARC ("Web
ARChive") <a href="https://iipc.github.io/warc-specifications/">specification</a> that
was released as an ISO standard in 2009 and
revised in 2017.  The standardization effort was led by the <a href="https://en.wikipedia.org/wiki/International_Internet_Preservation_Consortium">International Internet
Preservation Consortium</a> (IIPC), which is an "<span>international
organization of libraries and other organizations established to
coordinate efforts to preserve internet content for the future</span>",
according to Wikipedia; it includes members such as the US Library of
Congress and the Internet Archive. The latter uses the WARC format
internally in its Java-based <a href="https://github.com/internetarchive/heritrix3/wiki">Heritrix
crawler</a>.</p>

<p>A WARC file aggregates multiple resources like HTTP headers, file
contents, and other metadata in a single compressed
archive. Conveniently, Wget actually supports the file format with
the <code>--warc</code> parameter. Unfortunately, web browsers cannot render WARC
files directly, so a viewer or some conversion is necessary to access
the archive. The simplest such viewer I have found is <a href="https://github.com/webrecorder/pywb">pywb</a>, a
Python package that runs a simple webserver to offer a
Wayback-Machine-like interface to browse the contents of WARC
files. The following set of commands will render a WARC file on
<tt>http://localhost:8080/</tt>:</p>

<pre>
    $ pip install pywb
    $ wb-manager init example
    $ wb-manager add example crawl.warc.gz
    $ wayback
</pre>

<p>This tool was, incidentally, built by the folks behind the
<a href="https://webrecorder.io/">Webrecorder</a> service, which can use
a web browser to save
dynamic page contents.</p>

<p>Unfortunately, pywb has trouble loading WARC files generated by Wget
because it <a href="https://github.com/webrecorder/pywb/issues/294">followed</a> an <a href="https://github.com/iipc/warc-specifications/issues/23">inconsistency in the 1.0
specification</a>, which was <a href="https://github.com/iipc/warc-specifications/pull/24">fixed in the 1.1 specification</a>. Until Wget or
pywb fix those problems, WARC files produced by Wget are not
reliable enough for my uses, so I have looked at other alternatives. A
crawler that got my attention is simply called <a href="https://git.autistici.org/ale/crawl/">crawl</a>. Here is how
it is invoked:</p>

<pre>
    $ crawl https://example.com/
</pre>

<p>(It <em>does</em> say "very simple" in the README.) The program does support
some command-line options, but most of its defaults are sane: it will fetch
page requirements from other domains (unless the <code>-exclude-related</code>
flag is used), but does not recurse out of the domain. By default, it
fires up ten parallel connections to the remote site, a setting that
can be changed with the <code>-c</code> flag. But, best of all, the resulting WARC
files load perfectly in pywb.</p>

<h4>Future work and alternatives</h4>

<p>There are plenty more <a href="https://archiveteam.org/index.php?title=The_WARC_Ecosystem">resources</a>
for using WARC files. In
particular, there's a Wget drop-in replacement called <a href="https://github.com/chfoo/wpull">Wpull</a>  that is
specifically designed for archiving web sites. It has experimental
support for <a href="http://phantomjs.org/">PhantomJS</a> and <a href="http://rg3.github.io/youtube-dl/">youtube-dl</a> integration that
should allow downloading more complex JavaScript sites and streaming
multimedia, respectively. The software is the basis for an elaborate
archival tool called <a href="https://www.archiveteam.org/index.php?title=ArchiveBot">ArchiveBot</a>,
which is used by the "<span>loose collective of
rogue archivists, programmers, writers and loudmouths</span>" at
<a href="https://archiveteam.org/">ArchiveTeam</a> in its struggle to
"<span>save the history before it's lost
forever</span>". It seems that PhantomJS integration does not work as well as
the team wants, so ArchiveTeam also uses a rag-tag bunch of other
tools to mirror more complex sites. For example, <a href="https://github.com/JustAnotherArchivist/snscrape">snscrape</a> will
crawl a social media profile to generate a list of pages to send into
ArchiveBot. Another tool the team employs is <a href="https://github.com/PromyLOPh/crocoite">crocoite</a>, which uses
the Chrome browser in headless mode to archive JavaScript-heavy sites.</p>

<p>This article would also not be complete without a nod to the
<a href="http://www.httrack.com/">HTTrack</a> project, the "website
copier". Working similarly to Wget,
HTTrack creates local copies of remote web sites but unfortunately does
not support WARC output. Its interactive aspects might be of more
interest to novice users unfamiliar with the command line.

</p><p>
In the
same vein, during my research I found a full rewrite of Wget called
<a href="https://gitlab.com/gnuwget/wget2">Wget2</a> that has support for
multi-threaded operation, which might make
it faster than its predecessor. It is <a href="https://gitlab.com/gnuwget/wget2/wikis/home">missing some
features</a> from
Wget, however, most notably reject patterns, WARC output, and FTP support but
adds RSS, DNS caching, and improved TLS support.</p>

<p>Finally, my personal dream for these kinds of tools would be to have
them integrated with my existing bookmark system. I currently keep
interesting links in <a href="https://wallabag.org/">Wallabag</a>, a
self-hosted "read it later"
service designed as a free-software alternative to <a href="https://getpocket.com/">Pocket</a> (now owned by
Mozilla). But Wallabag, by design, creates only a
"readable" version of the article instead of a full copy. In some
cases, the "readable version" is actually <a href="https://github.com/wallabag/wallabag/issues/2825">unreadable</a> and Wallabag
sometimes <a href="https://github.com/wallabag/wallabag/issues/2914">fails to parse the article</a>. Instead, other tools like
<a href="https://pirate.github.io/bookmark-archiver/">bookmark-archiver</a>
or <a href="https://github.com/kanishka-linux/reminiscence">reminiscence</a> save
a screenshot of the
page along with full HTML but, unfortunately, no WARC file that would
allow an even more faithful replay.</p>

<p>The sad truth of my experiences with mirrors and archival is that data
dies.  Fortunately,
amateur archivists have tools at their disposal to keep interesting
content alive online. For those who do not want to go through that
trouble, the Internet Archive seems to be here to stay and Archive
Team is obviously <a href="http://iabak.archiveteam.org">working on a
backup of the Internet Archive itself</a>.</p>