FEATURE
Rotting Research: A Challenge for Academic Scholarship
by Marshal A. Miller
In 2020, I was a few months into my doctoral dissertation process with the idea that I would be researching plagiarism in the digital publication system. That was when I first encountered the true extent of link rot in digital scholarship. Link rot is the phenomenon of resources becoming inaccessible across time when their originally cited location is relocated or permanently unavailable.
While doing my literature review, I would find articles that cited resources relevant to my topic, but the links to those artifacts would not be available. A university research database usually provides access to most journals that I would want to read, but this was different. It was not that I was denied access; it was that access was no longer possible. The artifact had been lost to link rot, like many others before.
Research in the field of link rot has revealed some concerning statistics. At Harvard Law School’s Berkman Klein Center for Internet and Society, Jonathan L. Zittrain, Kendra Albert, and Lawrence Lessig found that around 50% of links in Supreme Court cases were broken (“Perma: Scoping and Addressing the Problem of Link and Reference Rot in Legal Citations,” Harvard Public Law Review, Vol. 127, No. 4, March 2014; harvardlawreview.org/forum/vol-127/perma-scoping-and-addressing-the-problem-of-link-and-reference-rot-in-legal-citations). Another study, by Karol Król and Dariusz Zdonek, published in Global Knowledge, Memory and Communication (“Peculiarity of the Bit Rot and Link Rot Phenomena,” Vol. 69, No.1/2: pp. 20–37, 2020; https://doi.org/10.1108/gkmc-06-2019-0067) found that 70% of articles in the Harvard Law Review were also broken. While many early studies focused on the field of law, studies on the broad collection of academic publications followed.
Library and information scientists often point to persistent identifiers, such as DOIs, as a resolution to the problem. But how effective are DOIs? A study by Martin Klein and Lyudmila Balakireva at the Los Alamos National Laboratory looked at the availability of DOIs inside and outside an institution’s internal network. They found that 33% of DOIs were inaccessible from within the internal network, and an astonishing 51.7% of DOIs were inaccessible outside the institution’s network (“An Extended Analysis of the Persistence of Persistent Identifiers of the Scholarly Web,” International Journal of Digital Libraries, Vol. 23, October 2021; doi.org/10.1007/s00799-021-00315-w). In addition, a recent study from Crossref employee Martin Paul Eve acknowledged that 28% of the articles with DOIs appear to be entirely unpreserved (“Digital Scholarly Journals Are Poorly Preserved: A Study of 7 Million Articles,” Journal of Librarianship and Scholarly Communication, Vol. 12, No. 1, January 2024; doi.org/10.31274/jlsc.16288).
THE EFFECT LINK ROT HAS ON SCHOLARLY PUBLISHING
The deep dive into this topic engulfed me. As a result, I switched my dissertation topic to study the effects of link rot on scholarly publishing. I created a dataset of 2,500 articles, 100 from each year between 2013 and 2022. The articles were also broken up into individual disciplines. Since past research often focused on law, I decided to see if there were any exciting results between subject matters. I broke it up into five categories—Arts & Humanities; Business; Health & Medicine; STEM; and Social Sciences—with 250 publications in each discipline.
My findings closely aligned with previous research. The overall percentage of broken links was 36%, while the overall percentage of broken DOIs was 37% (“The Putrefaction of Digital Scholarship: How Link Rot Impacts the Integrity of Scholarly Publishing,” January 2022; academia.edu/105022489/The_Putrefaction_of_Digital_Scholarship_How_Link_Rot_Impacts_the_Integrity_of_Scholarly_Publishing). This again showed the ineffectiveness of our current DOI system. While there were no statistically significant findings regarding broken links across academic domains, there were between broken DOIs across the disciplines. Though the average percentage of broken DOIs was 37%, it ranged greatly from discipline to discipline. Business had an average broken DOI percentage of 44%; Arts & Humanities, 42%; Social Sciences, 40%; Health and Medicine, 33%; and STEM, 23%. Further studies would need to be conducted to determine the cause of this disparity, but it does exist. There was anecdotal evidence that the chosen style guide for each field may have played a part.
To conduct my study, I created a Python program that would scan PDFs for URIs, try to contact them, and record the HTTP response status code from the attempt. The Python programming language is often used to scrape and process data at a large scale. HTTP response status codes are records from the server of the website you are visiting, telling you the status of the URI request that you send when you attempt to visit a link on the internet. The program was also able to determine which URIs were DOIs or other common persistent identifiers such as arXiv links.
HTTP response status codes are separated into these five classes or categories (“Hypertext Transfer Protocol (HTTP) Status Code Registry”; iana.org/assignments/http-status-codes/http-status-codes.xhtml):
- 1xx informational response – The request was received, continuing process.
- 2xx success – The request was successfully received, understood, and accepted.
- 3xx redirection – Further action needs to be taken in order to complete the request.
- 4xx client error – The request contains bad syntax or cannot be fulfilled.
- 5xx server error – The server failed to fulfill an apparently valid request.
The Python program allowed me to process a large dataset in a reasonable amount of time, although it still took time to process the results and run the proper analyses. Since the program was initially created, it has since expanded to be able to archive all active links using the Internet Archive. This program is open source and available at github.com/rottingresearch/linkrot.
OPEN SOURCE COMMUNITY INTEREST
While my Python program was able to help me complete my doctoral dissertation, there was also a lot of interest in this project from the open source community. Unfortunately, not everyone has experience using Python, which made my program inaccessible to many. That is when the Rotting Research project was launched (rottingresearch.org).
Seeing that there was a need in the community for a tool to check academic publications for the status of links that were cited, I began to create a web-based application that could be used by the general public. I found a great deal of assistance among the open source community on GitHub. Many were able to contribute to different areas. Tim Robbins, a Philadelphia-based product designer with more than 10 years of agency experience, took on the task of designing the branding elements for our projects. Others, including Aditi Rao, a Bangalore-based software developer, were able to contribute to the code infrastructure. In total, we have more than 2,000 contributions from 10 different countries. Together, we were able to create something special.
Rotting Research is the web application I built as a result of these efforts. It is powered by the original Python program, which continues to be actively developed. Rotting Research has the same general functionality as the Python program. You upload a PDF, and it begins to process the file, looking for links and sending HTTP requests to them all. One of the things that makes Rotting Research unique is that it then returns a generated report with a breakdown of the results.
The results are analyzed, and a report is returned that indicates how many DOIs are present, how many links are broken, and the error code that was returned from the server. You can view each URI that was tested as well as the response. Rotting Research also extracts any metadata from the PDF for the report. Finally, this report can be downloaded as a PDF for your records.
Rotting Research isn’t just a tool. It also serves as a repository for information about the projects and current research in the field of link rot. There is a page dedicated to the latest research in this area of study. Another page is dedicated to the best practices for content creators to mitigate the risk of link rot in their work.
FUTURE ROTTING RESEARCH PROJECTS
While Rotting Research serves a valuable role as is, it is far from resolving the issues that surround link rot in academic publishing. As a result, we will continue to develop the tool and add more functionality as efforts allow. We have a road map that has many exciting features.
We hope to add the ability to archive all active links via the Internet Archive soon. This feature is available in the Python program, but there are more considerations when uploading a large number of resources to the Internet Archive from a web server in bulk. We have found that, as of right now, the Internet Archive is the most reliable place to archive resources for long-term storage.
Another exciting feature that is coming soon to Rotting Research is a database of results that have been run through the web application. While this database will store the results of the test, there will be no other information stored. This database will give our team, as well as other researchers, the ability to access a growing dataset of link rot results. We hope this will continue to fuel the impact that Rotting Research has on the academic community.
Down the road, we hope to address two other issues closely tied to link rot and data persistence in the academic record: retractions and content drift. It is no secret that articles are being retracted at a growing rate; there were more than 10,000 retractions from top journals in 2023 (Richard Van Noorden, “More Than 10,000 Research Papers Were Retracted in 2023—A New Record,” Nature, December 2023; doi.org/10.1038/d41586-023-03974-8). We hope to integrate with a retraction database to notify users if an article that is cited has been retracted.
Change detection will allow us to address content drift by determining if a resource has changed its content since it was originally cited. This is a task that researchers at Los Alamos National Laboratory are attempting to address with the Memento protocol (Martin Klein, Harihar Shankar, and Herbert Van de Sompel. “Robust Links in Scholarly Communication,” Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries, Fort Worth, Texas, ACM, May 2018, pp. 357–358; doi.org/10.1145/3197026.3203885). This is a method in which a timestamp is integrated into a web resource to ensure that the version of an artifact that is cited is the one that is retrieved when navigated to.
Personally, I hope to continue my research into link rot and other issues with digital artifacts. Right now, I am investigating the potential impact that the computer science and information technology fields could have on improving the way that we access internet resources. The computer science community, as well as the technology industry, seems to have an issue with preserving systems, archives, and history.
Another area of research that I am focusing on is internet protocols and how they can be improved to establish a more fortified infrastructure for the internet. The HTTP status response code system appears to be showing signs that it needs revision. Backporting and backward compatibility with rapidly developing standards could also help other industries who are unable to adapt as quickly as technology iterates.
Rotting Research has been lucky enough to receive much support from the open source community. We are currently a Docker-Sponsored Open Source Software project. We also receive services as a member of the Red Hat Open Source Infrastructure Program. Rotting Research has also been able to join the Open Source Collective, a 501(c)(6) tax-exempt organization, which allows us to accept donations. Thanks to all of this, in addition to our generous contributors, we are excited to continue our mission of helping the academic community address and mitigate the effect of evanescence of our academic record.
Rotting Research is dedicated to its mission and encourages everyone to contribute code, ideas, or research to the field. To find out more, visit rottingresearch.org/contribute.
|