
FEATURE
Data Preservation in 2025
by Marydee Ojala
The blizzard of executive orders in the first few weeks of the second Trump presidency overwhelmed many, particularly librarians concerned about preserving government datasets. Confronted with sudden data whiteout conditions, we experienced snow blindness. Not that we didn’t know it was coming. We went through this during the first Trump presidency as well. It was the speed that unnerved us. We should have been better prepared. We should have realized that he’s used the 4 years he was out of office to plan, plot, and prepare. But as soon as government websites went dark and even when they came back lacking critical data sources, data preservationists leaped into action. How big is the problem of data destruction? Writing in 404 Media, Jason Koebler quantified the scale of data deletions as a decrease from 307,854 datasets on Data.gov just before the new administration came into power to 305,564 datasets in the week following Trump’s inauguration 404media.co/archivists-work-to-identify-and-save-the-thousands-of-datasets-disappearing-from-data-gov). Koebler found some datasets deleted from Data.gov that still exist on an agency’s website and others where it looks like they still exist on Data.gov, but attempting to download files results in a 404 error message.
Not everyone was caught off guard. Recognizing that change happens whenever a new administration arrives in Washington, D.C., the Internet Archive (web.archive.org) captures and saves U.S. government websites at the end of presidential administrations on an End of Term (EOT) Web Archive (eotarchive.org) project. This started in 2008, well before the current avalanche of website information deletions. For historical and research purposes, the older data should be retained. This year’s crawl collected more than 500 terabytes of material, which includes more than 100 million webpages, most from top-level domains such as .gov and .mil, as well as government websites hosted on .org, .edu, and others. The EOT Web Archive resides on the Filecoin network, part of the Internet Archive’s Democracy’s Library project (archive.org/details/democracys-library).
Individuals don’t have to wait for the EOT Web Archive to preserve websites. Gary Price explained how individuals can add websites to the Internet Archive whenever they like (infotoday.com/OnlineSearcher/Articles/Features/Saving-the-Web-for-Posterity-151646.shtml). This is the first line of defense for data preservation.
DISAPPEARING HEALTH INFORMATION
Trump’s Executive Order 14168, declaring the government’s recognition of only male and female sexes and demanding agencies remove and stop issuing statements, policies, and other messages concerning “gender ideology,” sent agencies scurrying to comply. There was a rush toward wholesale deletions, often missing the nuances of language that are entirely familiar to information professionals. Don’t want anything on the sites about people transitioning from one gender to another? No problem, just do a search for “transition” and delete documents containing that word. Ummm, yes, actually, that is a problem. It eliminates mentions of government transition teams operating to ensure a smooth changeover from one administration to the next, organizations transitioning from an older to a newer technology, optometrists prescribing transition lens, and individuals transitioning from one career path to another, just to name a few.
The situation is not clear-cut. Among many others, the Centers for Disease Control and Prevention (CDC) removed a site containing statistics on HIV (cdc.gov/hiv/data-research/facts-stats/transgender-people.html), then brought it back under a Feb. 14 court order—but with an introductory notice that its information was “extremely inaccurate and disconnected from the immutable biological reality that there are two sexes, male and female.” Like other information sources, this one could well disappear again, but its reinstatement gave preservationists additional opportunities to save data.
Questions remain about what’s next. Will there be takedowns about the efficacy of vaccines? Will misinformation infiltrate the Health and Human Services data offerings?
CLIMATE DATA SWEPT AWAY
The Trump administration is also targeting climate change data. Clearly on its hit list are the EPA (Environmental Protection Agency), NOAA (National Oceanic and Atmospheric Administration), and CEQ (Council on Environmental Quality). But the Department of Agriculture also came under scrutiny since it had information on its website about climate change. The webpage for Climate Hubs still exists, but the Climate Solutions page says, “You’re not authorized to view this page.” Apparently, no one holds that authorization.
The Department of Energy, NOAA, the Department of the Interior, NASA, and the EPA have all had datasets obliterated because they touched on climate change. With extreme staff cuts predicted for agencies charged with gathering climate data, the likelihood of disappearing data increases.
The Department of Education removed data, including some 200 guidance documents, reports, and training materials related to DEI (diversity, equity, inclusion). It ceased most of the research activities of the Institute of Education Sciences and cancelled contracts of researchers.
DATASET PRESERVATION
The Harvard Law School Library Innovation Lab is in the fore- front of dataset preservation (lil.law.harvard.edu/blog/2025/01/30/preserving-public-u-s-federal-data) and has been for years. It created a “data vault” to download data, authenticate it, and make copies available. It scoured Data.gov, GitHub repositories, and PubMed to grab portions of datasets these sites track. The data vault is different from the Internet Archive in that it collects and preserves datasets, not webpages. The two have complementary, rather than overlapping, missions.
On Feb. 6, 2025, Harvard Law announced the release of a 16-terabyte collection that includes more than 311,000 datasets harvested during 2024 and 2025 (lil.law.harvard.edu/blog/2025/02/06/an nouncing-data-gov-archive) on Source Cooperative. This constitutes a complete archive of federal public datasets linked by Data. gov. It will be updated daily as new datasets are added to Data.gov.
It preserves the datasets, as well as detailed metadata, and has now released open source software and documentation to allow researchers to replicate its work and create similar repositories. This builds on its work with the Perma.cc web archiving tool, the Caselaw Access Project (https://case.law), and Century-Scale Storage (lil.law.harvard.edu/century-scale-storage).
An information coalition of librarians and library associations banded together to create the Data Rescue Project (datarescueproject.org/data-rescue-tracker). This is a list, often with commentary, of suggestions from library-related sources to the state of data takedowns, data preservation, tools for data rescue, library guides, and current news. You can follow the Data Rescue Project on Bluesky (bsky.app/profile/datarescueproject.org) and send suggestions (the document is not editable) to iassistdata@gmail.com. The site is run by an anonymous university librarian with help from the wider data rescue community.
Journalists, as well as librarians, have a commitment to data so that they can fulfill their mission of holding public officials accountable and speaking truth to power. The Journalist’s Resource, a project of Harvard Kennedy School’s Shorenstein Center on Media, Politics and Public Policy, has likewise set up a list of nongovernment websites that have health data, noting that some of them use government data in their report creation (journalistsresource.org). In addition to sources of health data, it lists data archiving efforts, including the Data Rescue Project and the Harvard Innovation Lab.
Environmental data preservation is happening at the Public Environmental Data Project (screening-tools.com). EDGI, the Environmental Data and Governance Initiative (envirodatagov.org), is also active in preserving environmental data. On the climate front, the National Security Archive’s Climate Change Transparency Project publishes a selection of materials on climate change and environmental justice that have been deleted from agency webpages and spotlights the environmental and archivist organizations working to identify, scrape, and preserve this critical data (nsarchive.gwu.edu/briefing-book/climate-change-transparency-project-foia/2025-02-06/disappearing-data-trump).
Some preservation efforts are very targeted to a topic of interest rather than an entire agency. The Substack, Abortion, Every Day (jessica.substack.com/p/cdc-birth-control-guidelines-pdf), is keeping track of CDC guidelines that have been removed. theSkimm (theskimm.com) has saved ReproductiveRights.gov on its website.
POLICY PAPERS
It’s not just datasets and webpages. Documents, such as policy papers, research reports, project reports, white papers, state- ments concerning specific topics, regulations, commentaries on news items, and position papers that could be very valuable for public policy decision making, are also disappearing. Cofounder of Coherent Digital, Toby Green, sees a role for his organization to preserve those and put them into Policy Commons (policycommons.net). It fits well with Coherent Digital’s mission to find and preserve endangered materials (coherentdigital.net/about-us/our-vision).
On March 6, 2025, Coherent Digital announced the Policy Commons 2025 Open Collection, an initiative to identify, preserve, and make openly available government materials at risk of disappearing, and announced a new funding model it’s calling Rescue-to- Open to make the program sustainable for the long term (coherentdigital.net/open-2025). Through this initiative, Coherent Digital is striving to get as many pieces of information ingested as possible. Other fallouts from the cost-cutting initiatives of the Trump administration affect nongovernment organizations (NGOs), particularly in the Global South. Research done in those countries is often funded by U.S. government agencies. When that funding evaporates, the NGO websites are likely to go dark.
A BLEAK FUTURE
Looking ahead, what’s worrying is not simply the destruction of datasets, but the future of bibliographic databases, long supported by federal funding, that have been around for decades. Will the National Library of Medicine be allowed to continue MEDLINE? Will there even be an Education Department to continue ERIC? And, if not, what department could assume the mantle? Will this massive removal of data by the U.S. government result in removal of valid information from either database, decimating the scholarly record? Will data destruction spread to other countries? Can governments require that articles be pulled from databases because they conflict with this administration’s opinions?
Databases duplicated on commercial aggregator servers such as Clarivate, EBSCO, ProQuest, and LexisNexis are, with luck, immune from government intervention. Sage Data, for example, with its 550-plus datasets, continues to maintain the integrity of the information. Nothing has disappeared. But that’s only a minuscule portion of government data.
Whether it’s preservation of websites, datasets, or informational documents, the recognition of its importance transcends the library community, as is evident from the many organizations showing up to the preservation party. But it is a partial remedy. Preservation keeps alive data already gathered, analyzed, summarized, and published. But what happens when there is no more data, when the datasets end with 2024 information at the latest because nothing is collected after that date? Longitudinal datasets stop abruptly. No new information appears. Scholars are thwarted. Policymakers are frustrated. Clinicians are stymied. Knowledge is lost. The general public is ill-served. The preservation of government data and information is admirable, particularly given their scope and volume. Preservation is not sufficient. Advocacy for the continuation of data gathering is essential. |