Digital storage is simultaneously the most fragile medium ever invented and the most robust. A change in the magnetization of a few microscopic bits on a hard disk can wipe out content forever. Furthermore, anyone who causes mischief on their web site or social media can expunge the embarrassing evidence with a few keystrokes. But in compensation, the ability to make digital copies at essentially no cost allows content to be replicated and stored in safe places. This second trait of digital media is exploited by the Internet Archive to preserve the history of the web—and more.
The article is part of a monthly series on the LPI blog to celebrate the anniversaries of several key open source projects, by exploring different angles and directions of the broad open source movement.
The Internet Archive launched in 1996, when most people had enjoyed web access for only a few years. (I date the real popularity of the web from the release of the Mosaic browser in January 1993.) Already, computer engineer Brewster Kahle could tell that historic content was being lost, and created the Internet Archive in response. The engines of the archive currently crawl about 750 million pages per day, each site potentially containing hundreds or thousands of individual web pages. At the time of this writing, the archive’s estimated content is 552 billion web pages. And it has even more than web sites. This article explores the achievement of the Internet Archive and what it offers both researchers and ordinary computer users.
Another aspect of open knowledge is represented by web sites serving original content, which I rely on a lot when researching articles like this one. The superhero of these free sites is Wikipedia, which had its 20th anniversary on January 15 of this year. Although Wikipedia content is original, it relies on references wherever possible and warns users not to rely on it as a primary source. Furthermore, the text and images on Wikipedia are released under a Creative Commons license, the GNU Free Documentation License, or both. Therefore, the content often turns up on other web sites.
Easy come, easy go—that’s the main trait of the internet. Apparently, the U.S. Supreme Court has not learned this lesson, because the justices and their staff refer to web links in their rulings all the time. Researchers have determined that nearly half of these links are broken, producing the standard 404 error response. That means that we can’t discover the evidence used by the judges to make the decisions that have such heavy consequences.
The same loss of accountability is risked by news sites, academic research, and anyone else who uses the key advantage of the web: the ease of linking to other sites. The problem doesn’t apply just to sites that went 404 (disappeared). It also applies to sites that change content after you’ve based an argument on the old content. For this reason, when using people’s web content or posts to social media to make a point, savvy commenters post screenshots of the current content.
A more organized solution to preserving the past is provided by Amber, a project of Harvard’s Berkman Klein Center for Internet & Society. Amber makes it easy to save a copy of a web page at the time you’re viewing it. But Amber has a fundamental prerequisite: a web server on which to save the content. Most of us use web services provided by other companies, and we lack the privileges to save a page. A kind of “Amber as a Service” is offered by Harvard through Perma.cc, where anyone can save a page in its current state, creating a URL that others can refer to later. It’s also encouraging that Drupal.org allows you to save pages through Amber. Perma.cc is backed up by the Internet Archive. To check how prevalent the problem of broken links is, I looked through an article of my own, choosing one that was fairly long and that I had published exactly four years before my research for this Internet Archive article. My published article contained 43 links, of which 7 were broken—just four years after I wrote it.
Enter the Internet Archive. They don’t throw anything away, so you can retrieve a web site at many different dates. Let’s take a look at how to retrieve old pages. You can do this through the Wayback Machine, a search interface to the Internet Archive.
Suppose one of the links in this web page has gone 404. You can retrieve the content at that link as follows.
You can also skip the visual interface and search for the page manually, but this is a complicated topic I won’t cover here. If you want to make sure a web page is preserved in its current state, you can use the save-page-now feature. There’s also a way to upload files.
I estimate that more than 250 of my articles and blog postings have disappeared from various web sites. Some articles I could recreate from drafts I saved, whereas others turned up through searches in odd places such as mailing list archives. But I am sure they are all in the Internet Archive. Whenever I decide that one is worth saving, I retrieve it and put it on my personal web site.
You probably don’t like everything that’s on the internet, so you won’t like everything in the Internet Archive either. Remember that everything people post to the internet, no matter how objectionable, can have value to researchers and historians. The Internet Archive does have a copyright policy similar to policies on social media sites, to adhere to content take-down laws.
Brewster Kahle, Founder and Digital Librarian of the Internet Archive, when reviewing this article, commented:
The pandemic and disinformation campaigns have shown how dependent we are on information that is reliably available online and of high quality. These are the roles of a library and we are happy to serve however we can.
How can the Internet Archive preserve, on a regular basis, the current state of a medium that is vaster than anything that came before by many orders of magnitude?
The answer is simple: they use the same brute force techniques employed by search engines. The Internet Archive searches through the web page by page, trying to find everything it can. (Other content in the archive is discussed later in this article.) The archive has leased enormous storage capacity to keep everything it finds.
Programmers love to find clever ways to avoid brute force techniques, which have an optimization level of O(n)—meaning that you can scale up only by investing a corresponding amount of computer power. But sometimes brute force is the way to go.
For instance, graphical processing requires reading in lots of data about the graphic and applying algorithms to every pixel. This is why few applications could do graphical processing until cheap hardware was developed to address the particular needs of these applications: the now-ubiquitous graphics processing unit or GPU.
Another area where brute force triumphs is modern machine learning. The basic idea goes back to 1949, practically the dawn of digital computing. The neural network inspired artificial intelligence researchers for decades, but was declared a bust after much research and sweat. Then processors (including GPUs) grew fast enough to run the algorithms in a feasible amount of time, while virtual computing and the cloud provided essentially unlimited compute power. Now machine learning is being applied to problems in classification and categorization everywhere.
So let’s celebrate the tenacity of the Internet Archive. They attacked their problem head-on in 1996, and the solution has worked for them ever since.
A note on limitations is in order: web crawling leaves out much of what we routinely see on the web. The Internet Archive won’t cross paywalls, behind which much news and academic content lies. The crawler can’t submit a form, so it can’t pick up what visitors can see in dynamically generated web pages such as those put up by retail sites.
The history of lost culture is part of history itself. Some of the disasters we still mourn include these:
Add to these catastrophic events the loss of magnificent architecture from ancient times (often dismantled by local residents searching for cheap building materials), the extinction of entire languages (losing with each one not only a culture but a unique worldview), and the disappearance of poems and plays that shaped modern literature from Sappho, Sophocles, and others.
Well before the internet, many megabytes of data were ensconced in corporate data centers. Their owners must have realized that data could be left behind as companies moved to new computers, new databases, and new formats. Software vendors go out of business, leaving their customers trapped with content in opaque and proprietary formats. People now have precious memories on physical media for which hardly any devices still exist. And so our data slips out of our hands.
When Vint Cerf was designing the Transmission Control Protocol (TCP) in the 1970s, I wonder whether he imagined the vast amounts of content that would later be created to share over the internet. Several years ago, Cerf raised alarms about the loss of digital content in a mission he called Digital Vellum. So far as I know, Digital Vellum has not been implemented. But the Internet Archive serves some of this function. They realize that lots of content exists outside the web, on film and tape and pages of books, so they work with libraries and other institutions to bring much of this onto the web.
After you hear a few of their 15,000 recorded Grateful Dead concerts, try picking up Yggdrasil, one of the earliest distributions of GNU Linux. (For SLS I found only some metainformation, perhaps because SLS was distributed on floppy disks.) Check out 100 great books by Black women, or listen to a discussion of God’s names and gender at the Women’s Mosque of America. There is something for everybody at the Internet archive.
And when you’ve grasped the sweep and value of the Internet Archive, consider giving them a donation—so that our culture does not go the way of the Mayans.