davidbgk
/
larlet-fr-david

title: On caching links
slug: caching-links
date: 2018-10-15
chapo: Because the impermanence of the Web is a reality.
lang: en

<details lang=fr>
  <summary>Résumé en français</summary>
  <p>Différentes stratégies pour conserver le contenu des liens vers l’extérieur.</p>
</details>

> You can take an HTML document written over two decades ago, and open it in a browser today.
>
> Even more astonishing, you can take an HTML document written today and open it in a browser from two decades ago. That’s because the error-handling model of HTML has always been to simply ignore any tags it doesn’t recognise and render the content inside them.
> 
> <cite>*[The Web Is Agreement](https://adactio.com/articles/14321)* ([cache](/david/cache/22968d175d13a6693433c7c4732469da/))</cite>

The thing is: good luck to find contents from two decades ago!

I have a recurring question from readers about the way I keep content from external sources over years as a cache for almost each and every content linked from here.

*TL;DR: it’s tedious but stay here, there are now tools to help you.*

## How I do it

I start by using [python-readability](https://github.com/buriy/python-readability) from my custom code generating these pages (there are [tons of alternatives in Python](https://github.com/bookieio/breadability#alternatives) and other langages). Then there are a couple of manual edits, mostly for websites not generating HTML or serving an indigestible tag soup (I’m looking at you Medium!). Then I fix relative links and images. And finally the markdown code is generated to ease the copy-pasta within the final markdown document. *I am aware that it is usable as a developer only but it answers the initial question.*

If your goal is only to archive links (not republish), you can take a look at [reminiscence](https://github.com/kanishka-linux/reminiscence) or [bookmark-archiver](https://github.com/pirate/bookmark-archiver) for instance. I’m still [having the dream](/david/stream/2018/02/26/) to combine all that with a browser as I initiated with my [contentbrowser](https://bitbucket.org/david/contentbrowser). One day maybe…

## How you can you do it

> As part of the Internet Archive’s aim to build a better Web, we have been working to make the Web more reliable — and are pleased to announce that 9 million formerly broken links on Wikipedia now work because they go to archived versions in the Wayback Machine.
>
> For more than 5 years, the Internet Archive has been archiving nearly every URL referenced in close to 300 wikipedia sites as soon as those links are added or changed at the rate of about 20 million URLs/week.
>
> <cite>*[More than 9 million broken links on Wikipedia are now rescued](https://blog.archive.org/2018/10/01/more-than-9-million-broken-links-on-wikipedia-are-now-rescued/)* ([cache](/david/cache/adcc9f7650ed8a95618998e6d68650b5/))</cite>

Using directly the Internet Archive might be an option. It is still a backup on somebody’s else infrastructure but let’s call it *Cloud* and it sounds OK :-).

> I knew we had to add it at some point. But honestly, x-callback-url support is something I would have never dreamed of being excited about. But, damn it, I am. I am not going to make a long story about how that happened. Just this much: Spend a few minutes on how the Shortcuts app works, and in a breath, you can send clippings and entire articles to iA Writer, with title, copied text, and tags.
>
> <cite>*[Write to Organize](https://ia.net/writer/blog/write-to-organize)* ([cache](/david/cache/69dbdb0356028efdcec2bc2fb2a384e2/))</cite>

I didn’t know you can use Apple Shortcuts like that, I need to investigate if it fits my needs. I’m not sure I would bet on it because I prefer to rely on something I have full control on but if you want a less technical option it sounds fine. There are probably options with different operating systems.

## How we can do it better

> At the Internet Archive, Brewster Kahle and Mike Burner designed the ARC (for "ARChive") file format in 1996 to provide a way to aggregate the millions of small files produced by their archival efforts. The format was eventually standardized as the WARC ("Web ARChive") specification that was released as an ISO standard in 2009 and revised in 2017.
>
> <cite>*[Archiving web sites](https://lwn.net/Articles/766374/)* ([cache](/david/cache/9ce5f7eee16ec460d4d2e32bd6c7ec2a/))</cite>

I still think these archives (including that website but that’s another story) should be distributed and shared across groups of people/services for an even better longevity and resilience.

Using [OpenZIM](http://www.openzim.org/wiki/OpenZIM) to store the content and [Kiwix](https://github.com/kiwix/kiwix-js) to read/distribute it that might work with existing technologies. I already talked about these technologies considering [another context](/david/stream/2018/06/28/) but they are still pertinent in this one.

## Browsers, browsers, browsers

Well, imagine if the Reader feature from your browser has a cache. Now, imagine if that cache were anonymized (encrypted?) and shared. Boom, you have it for free! (and a lot of imagination :p)

*Hey Mozilla, you still disruptive?*