Race to save government websites before they disappear forever

In a small, windowless room in San Francisco, rows of computers whirr with an intensity that borders on a scream. This may be an ordinary scene for a data center located less than an hour’s drive from Silicon Valley, but these machines are engaged in an extraordinary task.

With the Nov. 5 election just two weeks away, they are gathering vast amounts of government data before the White House welcomes new or former residents in January. The information will live on in the End-of-Term Web Archive, a giant repository of federal government websites preserved for historical records when an administrative term ends and a new one begins. Librarians, archivists and technologists across the country join forces every four years to donate time, effort and resources to what they call the “deadline crawl,” with the resulting datasets available to the public for free.

“It’s important to capture these websites because they can provide a snapshot of government messaging before and after conditions are passed,” reads a project description from the Internet Archive, a nonprofit organization that provides free access to digitized materials. including websites, software. applications, music and audiovisual and printed materials.

The data collected in the end-of-term crawls lives in the organization’s Wayback Machine, which contains billions of web pages and other cultural artifacts that can be searched and, in this case, downloaded as massive data for machine-assisted analysis. Researchers have used the datasets to examine, among other things, the history of climate change policy, the reuse of suspended US government accounts on the social media platform X, and how PDFs can be more effectively distributed government information.

It’s here at the Internet Archive’s cavernous headquarters in San Francisco, and in data centers elsewhere around the Bay Area, where the Internet Archive’s computers have begun collecting government domains, like .gov and .mil, from the legislative, executive, and executive branches. and judicial. The initiative aims to preserve the annals of history – and, project participants say, democracy itself.

“Citizens have a right to information about what their government is doing on their behalf,” says James Jacobs, a government information librarian at Stanford University, a partner in the End of Term Web Archive project. “That’s why libraries have long collected these materials to ensure they are organized, preserved and easily accessible for the long term.”

Today, with misinformation flooding the Internet at an alarming rate, it’s a job some project team members see as more critical than ever.

Why does web content disappear?

Over the past two decades, government information, like that of almost every sector, has moved overwhelmingly online. But there is no guarantee that it will remain there undisturbed. Web content is disappearing from view at an alarming rate, according to a Pew Research Center study on “digital decay” released in May. It found that 38% of websites that existed in 2013 are no longer accessible a decade later, and that 23% of news websites contain at least one broken link, as do 21% of websites from governmental.

Government documents disappear for reasons other than so-called link rot.

With the dawn of a new administration, “the new management will often rearrange things,” says Mark Graham, director of the Wayback Machine at the Internet Archive. “Often they’re just moving things around and not necessarily moving things in a predictable way or with the right thought about redirection.” A researcher looking for original source material from the Environmental Protection Agency during the Obama administration, for example, may have no idea where to look.

Information can also be deliberately withheld for political reasons, Graham notes. After Donald Trump’s 2016 victory, some government watchers feared the new administration might delete or censor key environmental data, a concern that mobilized an influx of volunteers eager to participate in the end-of-term internet crawl . The Trump campaign did not respond to a request for comment.

The End of Term Web Archive aims to alleviate such citizen concerns by making publicly available documentation that can no longer be found on the live web for open access. There is no clear federal mandate to retain the data, Jacobs notes, and agencies can enforce or ignore the law as they see fit.

“By saving as many federal websites as possible every four years and making them easily accessible, we make sure that everyone can still have all that information online,” says Malea Walker, a reference librarian in the section of serials and government publications of the Library of Congress, a partner of the Web Archive at the end of the term. Others include the University of North Texas Libraries, the US Government Publishing Office, the National Archives and Records Administration, and Common Crawl, a nonprofit organization that crawls the Web and makes its archives and data sets available to the public for free.

Through the Web Hole

Exploring the Past Web Archives feels like stepping through a portal to the past. Pages culled from the official White House website during George W. Bush’s two terms, for example, capture the moment Samuel Alito was sworn in as a U.S. Supreme Court justice and the messages Bush sent to U.S. troops on the ground in Iraq during the war there. Official sites from the Bush era also provide a fascinating look at the evolution of the Internet. They feature fewer and much smaller images than current web pages typically do, and they have narrower columns of smaller, more crowded text than dictated by today’s more airy web design patterns.

The National Archives and Records Administration, the government agency charged with preserving and documenting government and historical records, conducted the first large-scale federal network capture in 2004 at the end of Bush’s first term, retaining slightly less than 6.5 terabytes of data.

In 2008, when NARA decided not to continue archival work, other organizations stepped in. That year’s end-of-term crawl collected more than 15 terabytes — and the data capture has grown steadily since then. The 2020 crawl collected more than 260 terabytes of data, and the 2024 crawl, which began just a few weeks ago, has already archived about 150 terabytes (roughly the equivalent of 25,000 HD movies streaming on Netflix).

“And we’re just getting started,” Graham tells me. Because of the ever-growing network and the project expanding to emphasize video, with its larger files, “I’m sure this one will be bigger than any of them,” he says. “It may even be bigger than all the previous archives put together.”

How you can join the effort

The Internet Archive’s open-source web crawler called Heritrix will continue to index government websites over the next year. Once Heritrix crawls a page, the pages appear in the Wayback Machine within hours, although cyberattacks on the Internet Archive this month have led to permanent hangups and hiccups.

Such massive capture of digital data could not happen without machines, but there is a very human component at play here. Through an online tool, project members nominate links they think should be preserved—URLs buried deep within government websites, social media feeds, PDF files. Until March 31, 2025, the general public can also offer suggestions and recommend how data that has already been nominated should be prioritized.

These nominations “allow us to empower individuals, whether they’re librarians, researchers, activists or journalists, to identify resources that are relevant to their work,” says Mark Phillips, associate dean for digital libraries at the University of North Texas and a senior member. of the Last Termite Web Archive team. The team focused on outreach to convey a central message: Human involvement and discernment matter.

Once an archive is assembled, the Library of Congress holds a copy and the University of North Texas hosts another. This year’s archive may be the largest yet, but the effort to assemble it, at least so far, has not been driven by the urgency that followed Trump’s 2016 victory.

“That was the time when people realized why libraries see this as so important,” Brewster Kahle, founder of the Internet Archive. “When the announcement was made that the new administration was going to take a very different stance on women’s health, climate issues, all of that, there was an uprising of people to help drag out the end of the term.”

One way to do this was at “data rescue events” organized by project partner Environmental Government and Data Initiative, a network of professionals who contextualize and analyze changes in environmental data and governance practices.

“They would gather people and look through the government network and try to find environmental data that they knew needed to be collected and stored for the long term,” says Jacobs, “and so we collected a lot more data than in other years because it was targeted”.

While this year hasn’t seen the kind of data collection events that proliferated in 2016, the mission of the Term End Web Archive remains clear.

“We’re building on civil discourse,” says Kahle. “We are building on an understanding of the land, of the rivers, of our history, of our processes. Let’s make sure it’s accessible.”

Leave a Comment