Digital Preservation
howto tech uspolThere is a saying: “the Internet is forever” – but it isn’t automatically true. People work tirelessly behind the scenes to make it so.
A lot of content-intensive technology has “lookback” built in. Commercial sites do versioning, caching, backups, archives, and more. Sometimes they even do this in a way that lets ordinary users take advantage of it; one of the earliest examples in the web era was the ability to look at the “cached” version of a search engine result. Even if the site was down, the cached version might have given you what you were looking for.
However, as commercial tech has historically become ever more user-hostile, it’s been necessary for ordinary citizens to step in to keep the adage true. These folks are, roughly speaking, engaged in a decentralized, communal act of digital preservation.
What is Digital Preservation?
In the late nineties, independent technologists created the Internet Archive, which many people first encountered through its Wayback Machine, a date-indexed, searchable archive of cached web pages. Web sites don’t live forever, and even when they live a long time, commercial interests tend to turn them into walled gardens, taking crowd-sourced and bulk aggregate user informations like graphs and hiding them behind authentication and paywalls, if not obscuring them altogether. The internet archive is now the current go-to way to link to a version of a website captured in time, as one can now rarely rely on search engines or app owners to do this for you.
Commercialization
Much of this third-party archiving starts to run against copyright and other legal intellectual property protections… that said, other sites like Sci-Hub and Z-Library are even more aggressive in their insistence that “information wants to be free”, by indexing and storing hoards of academic research that is often otherwise available only through expensive subscriptions. Indeed, this is not just for casual users - professional academics use sites like these all the time; I first heard of Z-Lib in reference to its use by a tenured professor in the humanities who regularly uses it for one-shot literature search. Apparently whatever access conferred by the university is still insufficient in the face of the commercialization and paywalling of the bulk of academic resources.
Government Censorship
Still more failure modes need addressing: consider the increase polarization and digitization of government that has occurred in the last 2-3 decades. Our common notions of truth are at stake. What we build as states and nations is at stake. It has increasingly become standard practice for new administrations (especially reactionary ones) to use control of government websites to not only promote their agenda, but to erase history, undo previous work and censor work it disapproves of, even if there is no clear non-partisan benefit to doing so. Some things just end up going down the memory hole.
What Can I Do?
In the cases of switch-over/censorship, there is often little time to react or to create a shadow site. Sometimes the only thing to do is just bulk-archive all information in a predictable structured format for later re-assembly. That’s the goal of groups like ArchiveTeam, a prominent group in the wider data hoarder community, which, it’s worth noting, also includes people who individually collect and share pirated entertainment and education media. ArchiveTeam makes it fairly easy for non-technical folks to participate in the fetching and archiving of publically-accessible information via it’s ArchiveTeam Warrior project. This ready-to-run virtual machine can be installed by anyone on their home machine and run quietly in the background, participating in one of ArchiveTeam’s active archival projects. This is in the spirit of other distributed volunteer computing projects like SETI@Home and Folding@Home and others, which let folks with ordinary PCs use their idle computing time to participate in large, parallel computing projects.
ArchiveTeam Warrior on Your PC
This is one of the simplest ways you can become a digital preservationist fighting the good fight. Follow instructions to run an ArchiveTeam Warrior.
ArchiveTeam Warrior on Free-Tier Cloud
If you want to step it up a bit, either with more workers or by participating in more than one project, consider using Free-Tier cloud credits/allowances to run an ArchiveTeam Warrior. Rumor has it that it runs (barely) in OCI Free Tier. It also runs in Google Cloud (negligible compute costs, signifigant outbound network costs) and AWS (similar utility costs).
The reason this is a step up is because it requires you to convert the VirtualBox .ova (specifically, its bundled .vmdk disk image) into something that can run on the Cloud provider. This is not too hard; here’s how you do it on OCI Free Tier.
Prerequisites
- Linux or Windows with WSL (which should both include the
tar
utility) - The ArchiveWarrior OVA file click here to get it
- Oracle Cloud Free Tier account click here to sign up
Steps
Extract the ova file
tar -xvf archiveteam-warrior-v3.2-20210306.ova
Note: there may be other ways to do this, using your File Explorer/Manager.
- Create a storage bucket in OCI
- Navigate to buckets
- Create a bucket
- Navigate to the bucket
- Upload the .vmdk file you extracted into the bucket
- Create a custom image in OCI
- Navigate to images
- Click “Import Image”
- Operating System “Generic Linux”
- Import from an Object Storage bucket
- Select the bucket in the “Bucket in” pulldown
- Select the .vmdk file from the “Object name” pulldown
- Select “VMDK” for image type
- Select “Emulated mode” in the “Launch mode” radio selector
- Click “Import Image”
- Create a free-tier instance in OCI
- Navigate to instances
- Click “Create Instance”
- Click “Edit” Link on the “Image and shape” section
- Click “Change Image”
- Click “My images”
- Scroll down and select your custom image from the list of Custom images
- Click “Select image”
- Ensure the default shape is marked “Always Free-eligible”, something like “VM.Standard.E2.1.Micro” in early 2025
- Click “Create” button
- Grant Network Access to your Warrior
- Navigate to vcns
- Click on your default vcn
- Click on your default subnet
- Click on your default security list
- Click on “Add Ingress Rules”
- Stateless:
No
- Source Type:
CIDR
- Source CIDR: your IPv4 address from whatismyipaddress or similar, with a trailing “/32”
- IP Protocol:
TCP
- Source Port Range:
All
- Destination Port Range:
8001
- Description:
ArchiveTeam Warrior HTTP
- Click “Add Ingress Rules”
- Stateless:
- Navigate to your Warrior
- Navigate to instances
- Wait for your new warrior instance to obtain a Public IP
- Connect to the public IP address at port 8001 e.g. “http://[public IP address]:8001/” Note: you may need to go back to this if the public IP changes in order to find your Warrior again.
At this point you should see your Warrior’s Web Interface:
You can select a project to participate in by clicking “Work on this project” next to your desired project. You can click on “current project” to see logging and ensure that yor Warrior is working.
Oracle lets you have two free-tier instances. It caps bandwidth monthly in the free tier, but with two warriors running at 2 threads apiece, you might not even hit the limit and if you do, the warrior will simply not be able to upload until the quota resets; no intervention should be required.
Custom Scraping
Maybe ArchiveTeam isn’t targeting a site you’re worried about. It’s possible to do this yourself, if you have some basic web development skills.
Scrape Prototyping
The easiest way to jump right into web scraping is to use a browser plugin. Web Scraper - Free provides a developer console interface and visual hints for element selection that make it very easy to prototype a scraper that can be automated later on using headless tools.
Watch the intro video… if you can follow along with the tutorial, you will easily see how to make your own custom scraper for whatever site you want.
Once you have completed a scrape, you can download the scraped content as CSV. From there, you can use python and a DOM parser like Beautiful Soup to perform normalization of captured content, search/replace manipulation, secondary crawls of media links, and (if desired) re-assembly of the page as a new web app or as static content.
I have myself leveraged this kind of scrape prototyping workflow to crawl a smaller site recently targeted by the laughably-named “DOGE” which was not covered by ArchiveTeam’s US Gov project, and in only 3 days was able to build a mirror of the target site suitable for CDN redistribution. This was for a site that is only ~1TiB in size with fairly uniform presentation structure; more complex sites might have required a team of developers to preserve effectively. However, I think its an essential skill to learn for any web developer since it has practical applications outside the domain of anti-censorship action (e.g. monitoring, testing, validation).
Anything Else?
Did I miss anything or get anything wrong? Drop me a line: mastodon | email