title: The Irony of Writing About Digital Preservation
url: http://www.theatlantic.com/technology/archive/2015/11/the-irony-of-writing-about-digital-preservation/416184/
hash_url: b74e15baff
Recently, Adrienne LaFrance wrote in The Atlantic about the digital death and rebirth of a story that was a Pulitzer Prize finalist in 2008. Because “The Crossing,” a 34-part series originally published by the Rocky Mountain News, was born digital, it was not as easily archived as print stories, and its journey from obscurity to resurrection was moving.
I loved LaFrance’s story. It was masterfully written, and it touched on most of the issues that digital preservationists grapple with every day. Coincidentally, the story was published the same week as a special issue of Newspaper Research Journal called “Capturing and Preserving the ‘First Draft of History’ in the Digital Environment,” which is a collection of scholarly papers (including my own) about preserving digital news.
Which led me to wonder: In 20 years, will anyone be able to read LaFrance’s story?
There is no guarantee that we will be able to read today’s news on tomorrow’s computers. I’ve been studying news preservation for the past two years, and I can confidently say that most media companies use a preservation strategy that resembles Swiss cheese.
My contribution to the NRJ special issue centers on news apps, the interactive databases like ProPublica’s “Surgeon Scorecard” that allow readers to read a story, search for themselves or their community in the data, and then figure out exactly how the story affects their own lives. When a data journalist calls something a “news app,” it doesn’t mean the thing you download from the App Store. ProPublica’s Scott Klein explains: “Inside newsrooms, these interactive databases are sometimes called ‘news applications’—but don’t be confused. They’re interactive databases published on the web, not something you buy on your smartphone. Think Dollars for Docs, not Flipboard or Zite.”
News apps aren’t being preserved because they are software, and software preservation is a specialized, idiosyncratic pursuit that requires more money and more specialized labor than is available at media organizations today. But, you might argue, it ought to be easy to preserve stories that are not software, right? A story like LaFrance’s, which is composed of text and images and a few hyperlinks to outside sources, ought to be simpler to save?
You’d think so. But not necessarily.
To understand why, we need to look at the back-end technology of the newsroom. In developer-speak, the front end is the nice-looking part of the technology that is open to customers and the world; the back end is the factory where the sausage is made.
You probably know the basics of the back end: When you click a link or type a URL into your web browser, a web server delivers a page to your browser. At a media organization, the web server assembles a page for you that consists of different digital assets: text, images, captions, headlines, code, videos, or ads. These assets reside in a content management system (CMS) that organizes the thousands or millions of pieces of content that the media company generates.
It’s rarely just one CMS, however. Newsrooms rely on a blend of new and legacy systems. In a newsroom that produces a print edition, there is always an additional software system—like K4 or CCI or Hermes—that manages page layouts and sends those pages to digital printers. Let’s call this the print CMS. This is different than the web CMS, which could be a system like Wordpress. The BBC uses at least two web CMSs. (Here’s a diagram of the newest one, Vivo.)
Invisible processes seamlessly transmit text, images, headlines, and other content from one system to the other. Most news organizations don’t have in-house librarians any more, so archiving is largely done automatically. Large organizations like LexisNexis or EBSCO (The Atlantic’s archiver) will hoover up a digital feed from the news organization, store the information in a database, and then license packages of such databases to libraries. The digital feed might include the text of each story, the author’s name, the title of the story, any associated images, and some meta-information that describes the placement of the story or its licensing rights. In some cases, the feed also includes PDF images of each page of the newspaper or magazine.
To try to determine if LaFrance’s story was included in the archival feed, I ran a search on October 16, 2015, for all articles from The Atlantic in the EBSCO database (using my university-library subscription) from January 1, 2014, to December 31, 2015. There were 488 results.
I ran the same search on Google on the same date for stories that show a publication date on TheAtlantic.com from January 1, 2014, to December 31, 2015. There were 20,200 results.
Were there really 19,712 more stories published on TheAtlantic.com than in The Atlantic magazine? I’m not sure. Some of the Google hits could be duplicates, bringing the total number of articles published down below 20,200. Or, there could be something I don’t know about how many articles are included in my library’s subscription to EBSCO’s collection of works in The Atlantic. There could also be additional technical and licensing issues that I’m not aware of—archiving is an immensely complex practice. The 20,200 number does not include Atlantic writers’ posts to Facebook, Twitter, Instagram, Pinterest, Reddit, or any other social platforms where the journalists may have interacted with readers or posted comments related to their stories. If we want to count social posts as journalistic content, we need to revise our estimate dramatically upward. (Social posts are also surprisingly difficult to meaningfully preserve in libraries, by the way.)
In all of my library searching, I couldn’t find LaFrance’s article on “The Crossing.” In fact, searching more than 400 databases and publishers via EBSCO, and the 700 million sources contained therein, I only found nine articles by Adrienne LaFrance. Which is strange, because looking at LaFrance’s author page on The Atlantic.com reveals pages upon pages of search results.
To understand what’s happening, we need to return to the back-end and think about the systems in which story text resides. LaFrance’s story appeared on TheAtlantic.com, which runs on a web CMS called Ollie. Ollie, which replaced three older CMSes, was custom-built using a popular open-source software framework called Django. The print edition of The Atlantic is managed through a workflow system called K4, which (unlike Django) works well with the Adobe software programs that are used to create layouts. From a media-tech perspective, this is state-of-the-art engineering. I don’t know how or where the EBSCO feed taps into this configuration. Probably, what happens is something like this:
I’m reminded of the time I used a sink in a friend’s new pool house, which he built himself. “Don’t run too much water when you’re washing things,” my friend told me. “It looks like a real sink, but I didn’t hook it up to the sewer system, so the water just runs out onto the ground.” I was flummoxed. How could that be? Was he even allowed to do that? In that moment, I realized that plumbing, like software, is a complex system built by humans. Humans make mistakes and make idiosyncratic design decisions. So it is surprising, but not improbable, to realize that the complex multidimensional software systems that serve us web content might not be sending content to libraries in the ways that we expect.
When I started my research into news preservation, I thought there would be an easy technological solution. There isn’t. Every media company in the world grapples with the issue of digital archiving. Large legacy organizations, like The Atlantic or The New York Times or the BBC, do a better job than smaller companies, but nobody has a solution. From a software perspective, it is a legitimately difficult problem: unsolved, but probably not unsolvable. “The challenges of maintaining digital archives over long periods of time are as much social and institutional as technological,” reads a 2003 NSF and Library of Congress report. “Even the most ideal technological solutions will require management and support from institutions that in time go through changes in direction, purpose, management, and funding.”
Newsrooms need to manage workflow and content for print, audio, visuals, video, and code. Most software is built for companies that do only one of those things at a time; newsrooms do them all simultaneously. Every time a new technology is introduced, a newsroom needs a new content-management or workflow system to handle it. Ensuring interoperability between these systems and archival systems requires engineering, ingenuity, and regular attention.
The scale is different for newsrooms, too. Facebook only has to manage 11 years’ worth of data, all of which is digital and all of which is structured exactly the way it needs to be structured. A legacy media company might have to deal with more than a hundred years’ worth of data, only some of which is digital, all of which is potentially important to scholars, all of which has different licensing restrictions and preservation needs and is ambiguously structured. Remember when Macromedia Flash was the new hot thing in journalism? Most of those elaborate Flash projects have disappeared now. They’re probably archived on Jaz drives in a storage room somewhere, next to boxes of color slides and piles of floppy disks and other outdated media. Future historians will likely lament this loss.
The web only shows recent history. “Not one publication has a complete archive of its website,” my colleagues Kathleen Hansen and Nora Paul write in their NRJ article, “Newspaper Archives Reveal Major Gaps in Digital Age.” “Most can go back no earlier than 2008 … In every case, informants talked about the chaos of switching CMSes or servers, of shifting organizational homes for the website, of staffing changes and many other elements that have had an impact on the integrity of the website over time.”
The quantity and variety of information we now produce has outpaced our ability to preserve it for the future. Librarians are the only ones who are making sure that our collective memory is preserved. And they, along with small teams of digital historians elsewhere, are still trying to understand the scope of myriad challenges involved in modern preservation. If today’s born-digital news stories are not automatically put into library storehouses, these stories are unlikely to survive in an accessible way.
So: The articles we see today on TheAtlantic.com are stored in a CMS that is ambiguously hooked up to my library’s archival feed. For the purposes of scholarly research (which is performed through library databases, not through Google), it appears that some subset of articles from TheAtlantic.com are not being preserved. Which means that in 20 years, media scholars may not be able to read Adrienne LaFrance’s article about a story that disappeared and was resurrected, because LaFrance’s article may have disappeared.
Some savvy readers may wonder: What about the Internet Archive? Doesn’t the Wayback Machine preserve web pages, and won’t LaFrance’s story be preserved that way? The simple answer is yes. LaFrance’s article was crawled by the Internet Archive’s Wayback Machine, and you can go and look at it there. The folks at the Internet Archive are thoughtful digital preservationists, and I am grateful every day for their work preserving our collective digital memory.
If I know exactly what web page I am looking for, the Internet Archive is very helpful. I know that LaFrance’s story ran on the front page of TheAtlantic.com on October 14, 2015, and so I can go to the Wayback Machine and look at the snapshot taken closest to that date, which is October 15, and I can see LaFrance’s story “Raiders of the Lost Web” and I can click on it.
But if I don’t know exactly the web page that I want and exactly the day that the information appeared, I won’t be able to find the information in the Internet Archive. Library databases are indexed so that they are searchable, meaning that the databases include lots of information about the information that they contain. The Wayback Machine is technologically quite sophisticated—it preserves images and code too, for example—but it is not yet indexed so as to be easily searchable. The Internet Archive will allow you to find a needle in a haystack, but only if you already know approximately where the needle is.
I’m pretty sure that in five years, when I want to re-read LaFrance’s article, I won’t remember the exact date on which it was published. I’m also reasonably sure that in five years my browser bookmark to the story will be broken because of linkrot: The Atlantic will have redesigned its website and the story’s URL will be different. My 2020 web-searching self will probably look on The Atlantic’s website and fail to find the article because the CMS will have changed, and the search parameters will be set up differently, and I will not be able to find so much as a title for the article in the library databases. Which means I will give up in frustration and rant to anyone who will listen about how disorganized the online world is and how we are losing digital history almost as soon as we make it. This is a shame. Because it’s a really good article, and it deserves to endure.
There is a solution, of course. I could just print the article and keep it in my filing cabinet. But that would be a step backward, not forward.