FEATURE
10% Wrong for 90% Done: A Practical Approach to Collection Deduping
by Rogan Hamby
|
Investment in open source paid off for libraries beyond South Carolina, creating a single, more coherent collection out of many. |
The technology of the 21st century has had no shortage of challenges for libraries, but it has also brought opportunities. Operating systems, databases, and application environments that can use cheap off-the-shelf hardware to scale and handle hundreds of millions of transactions were not developed for libraries. However, libraries have taken advantage of these technologies with multiple integrated library systems (ILSs) now featuring this scalability. One of the best-known among these younger systems has been the Evergreen open source ILS, which was adopted in 2009 by the South Carolina Library Evergreen Network Delivery System (SC LENDS), where I am director. Widespread resource-sharing library consortia, especially those powered by Evergreen, are quickly growing. With Evergreen, SC LENDS was able to begin its pilot program of 10 library systems and then scale to accommodate adding libraries as they elected to join. From the beginning SC LENDS knew it would face a variety of challenges, but it didn’t realize that one of the most hindering challenges would be its own bibliographic records and the ability to present a clean collection for searching and placing holds against.
Rolling It Out in Waves
The pilot program for SC LENDS was broken down into three waves. The first wave went live in May 2009 and consisted of two county library systems and the state library.
The first wave did not present an immediate issue for merging these collections. The state library collects in narrow areas such as information sciences, technology, and governance, while Beaufort is a medium-sized public library, and Union’s collection was much smaller.
Wave I Libraries |
Bibliographic Records |
Items |
Union Carnegie Library |
45,000 |
45,000 |
Beaufort County Library |
178,000 |
273,000 |
State Library |
+403,000
626,000 |
+409,000
727,000 |
|
Wave II Libraries |
Bibliographic Records |
Items |
Chesterfield County Library System |
68,000 |
72,000 |
Dorchester County Library |
123,000 |
143,000 |
Calhoun County Library |
32,000 |
24,000 |
York County Library |
+167,000
390,000 |
+275,000
514,000 |
|
Wave III Libraries |
Bibliographic Records |
Items |
Anderson County Library System |
340,000 |
353,000 |
Florence County Library System |
320,000 |
332,000 |
Fairfield County Library |
+74,000
734,000 |
+82,000
767,000 |
|
By the time Wave II joined the consortium in October 2009, a problem began to emerge. Now, the collections of six public libraries were integrated, and the collected database had more than 1 million bibliographic records in it and 1.2 million item records. It had become clear that there was a problem with deduplication of MARC bibliographic records. Although the consortium had run deduplication algorithms to match bibliographic records and to move items along with linked circulation and holds records to a consolidated record set, there seemed to be very few records that had moved. An analysis of pre- and post-deduplication holdings showed that far less than one-tenth of 1% of the bibliographic records had been affected, even when side-by-side analysis showed very few differences between the records. At the time, the consortium agreed to wait until after Wave III joined in December 2009 to fully investigate the problem.
Too Much of a Good Thing
By January 2010, SC LENDS was looking at a problem with 1.75 million bibliographic records and a little more than 2 million item records. An average of one bibliographic record per item may be understandable in a small library with few duplicate copies, but in a consortium, especially with members that should have overlapping collections, it is a problem. Already, there was a lot of discontent from patrons and staff. The increased collection size had made apparent a lot of existing errors that migrated from old ILSs under the consortium’s shared collection. Searching for a Harry Potter book and getting two duplicate entries might be annoying, but getting nine is far more frustrating if the patron has to begin sequentially clicking through them.
As the problem was explored, there was an initial premise that no bibliographic record could have more than 10 duplicates. This presumption was quickly dispelled. Some titles had as many as 17 bibliographic records, with some libraries having holdings split between multiple instances of the record. Casual copy cataloging standards at member libraries had created issues that were often easy to ignore as individual collections, but once you combine 10 collections into one database, messes are a lot harder to ignore. The union database now represented the past cataloging standards and the needs of 10 different library systems, five previous ILSs (and more, counting previous migrations), and multiple versions of those ILSs, as well as the different skill levels of the staff and the varying sources of copied records. In short, the database was a melting pot with all the chaos of New York City in the early 20th century without any of the charm.
With searches returning too many results, searchers were becoming frustrated; it was also creating a negative economic impact. As a resource-sharing consortium, SC LENDS members pay per pound to ship materials between member library systems. It was far too easy for a patron to place a hold associated with a bibliographic record that had no local holdings, even though a copy may exist on another record that had a holding just a few feet away. This both slowed down patrons gaining materials and created unnecessary expense. To aggravate the problem, patrons would then place holds on every available bibliographic record, check out the first one to become available, and simply decline to check out the other materials, wasting staff time and library money. Additionally, although not a major concern to SC LENDS, due to the robust hardware allocated, the reduced searching and storage needs would be important to any library wanting to boost performance.
The Deduping Imperative
It was clear that improving the database for use by patrons had to be the shared highest priority. The first thing was to evaluate the existing deduplication, which was applied during every wave, and then to determine why it had such a low match rate. The existing algorithm had a very high standard for defining matches and guaranteed no incorrect matches. As a result, even small distinctions between the MARC records caused the program to fail at merging them. The algorithm worked correctly but was set to an idyllic standard for determining matches and, thus, was dependent upon technical services to correct the remaining failed matches, which amounted to nearly all of them. The labor necessary to identify, evaluate, and merge records from a collection of several million was determined to be unviable. Even if hundreds of thousands of dollars were available to hire temporary staff, the training and coordination would require a large commitment from existing staff, and human errors would still be significant.
Staff members were trained to manually merge records, and standards were set. But, the consortium knew that an automated process was needed to help reduce the number of merges to be done manually. The consortium needed a rapid solution that created a high number of accurate matches with a low number of inaccurate merges, but it had to accept the possibility of some bad merges to increase the number of successful ones. In the spirit of open source, SC LENDS decided to develop its own solution and agreed beforehand to release any code developed under the GNU General Public License and to apply a Creative Commons license to any documentation developed. As an employee at the State Library of South Carolina, I was chosen as the project lead, and the core development team members were Shasta Brewer of York County and Lynn Floyd of Anderson County. The project was pitched under the agreement that, of the record set eligible for merging, the consortium would accept as much as a 10% inaccurate merge rate in order to successfully merge the remaining 90%. These ranges were not a result of analysis but a crude goal set to estimate what would be an acceptable error range and to establish realistic expectations with the project before moving forward. There was also a conscious reminder that improved cataloging standards and manual cleanup would be critical after the project. The project came to be known as 10% Wrong to get it 90% Done.
Making the Unviable Doable
In preparation for the deduplication project, the SC LENDS staff consulted with several other consortiums and cataloging staff who were experienced with these challenges. The feedback was that it was not a viable project to pursue. However, since the alternative was to maintain the status quo, SC LENDS decided to continue with the project. The next step was to do large reports on the bibliographic and item holdings. After spending a lot of time eyeballing patterns in spreadsheets and crunching frequencies in the data, the development team made some general conclusions about trends in the bibliographic database. The consortium was not interested in determining if the data was representative of other libraries. However, based on the trends, there is reason to believe that what was observed would be generalizable to other consortiums and their merged collections.
After removing bibliographic records with no attached copies, the development team isolated the MARC records that would be targeted for deduplication. Although they would be viable targets for other deduplication runs, non-ISBN records were excluded from consideration. These records were the far minority for most of the SC LENDS libraries because the public libraries focus on more contemporary monographs. Bibliographic records excluded included those with SuDocs, ISSNs, state docs, precat records, and others that were not fully formed, including monographs too old to have ISBNs. This was roughly 60% of the total database, but the values were weighted in part by several large collections that were government depositories. The remaining records were searched for potential matches using fuzzy text matching of selected MARC subfields, taking into consideration fields with null and missing values. After generating automated high-frequency matches, the project team also looked for human-identifiable matches.
Estimates were made by developing models of the data set to project the results of different approaches to the deduplication. After churning through spreadsheet after spreadsheet, the project team kept returning to one model that was too compelling to discard. The model chosen was a radical departure from the common wisdom that MARC records are too unique to be matched and selected via a software algorithm, and when selected at all, it would only be done by the inclusion of a large number of MARC fields to ensure that exact and identical records were being merged. This was the root of the low-matching problem that had been encountered, but due to human variability during the import and creation of records, the results were inevitable. While trying to move forward the development group’s experience was that, though catalogers were trained to understand and describe individual items in extraordinary detail, they were not well-prepared to understand the generalization of the data in those same cataloging records. From this point on, catalogers became a crucial part of the evaluation and assessment phase of the project, but they were not a part of the development phase. Simply put, there was a need to discard conventional assumptions about the uniqueness of MARC records.
How to Merge MARC Records
Simply put, there was a need to
discard conventional assumptions
about the uniqueness of MARC
records. |
The project looked at two common approaches to evaluating a set of MARC records for merging. One method was to use the TCNs (title control numbers) and merge records with matching TCNs. On a practical level though, with MARC records coming from many sources and even OCLC hosting redundant MARC records for a single title, this created too few matches. The second evaluated method was to create fingerprints based on match points between the records and to use the match points to identify unique records and to pair them up for merging. Unfortunately, this very granular identification creates distinct fingerprints for records with small differences, and it was the approach that was already unsatisfactory for SC LENDS’ goals. A low impact was already deemed unacceptable by the project mandate.
The assessment was that two fields had the highest likelihood of correctly matching records by using those data points with the inclusion of the greatest number of correct records. The conclusion was that the ISBN and the proper title were the most reliable. Adding a single additional match point had the potential to reduce matches by a high degree, so the entire matching of records to be merged would be by only the title and ISBN fields. This was the approach of grouping bibliographic records by having them match a broad profile rather than a specific fingerprint. While this was deemed very aggressive and dangerous, it was the method SC LENDS moved forward with.
To form bibliographic matches, values of both the title and the ISBN subfields (but not all subfields) had to match. Here is an example:
020 . ‡a0462356984
020 . ‡a0462356984 : ‡a0354856541
These records would match because a subfield agrees. Additionally, neither field could be null.
More could have been done to increase matches with less strict parameters, but the project was determined to be a trial run, and a conservative scope was agreed upon. From samples the data model showed that the deduplication would remove 13%–19% of the bibliographic records or roughly 25% of the subset of targeted records. The sampled data was inconclusive as to how many records would be excluded from the merging process and that would have to be manually merged in the future.
It’s All in the Algorithm
To enact the deduplication, a copy of production records was built in tables with the original IDs maintained. This allowed needed changes to be made to the copies of the records rather than originals when those changes helped clean up the matching records. The title and ISBN fields were cleaned up by removing extra spaces, standardizing capitalization, converting symbols, removing non-Latin characters, removing general material designations in titles, normalizing 10-digit ISBNs to 13 digits, calculating missing check sums, and more.
This process picks a pool of items to merge together, and then a lead record has to be selected. This is done by a weighting algorithm that generates a 19-digit number for each record. In a pool of records to be merged, the highest value among the 19 digits became the new dominant record.
The weighting algorithm generates a 19-digit number:
- First number: Is there a 003 field? (present=1, not present =0)
- Second two numbers: The number of 02X lines in the record
- Third two numbers: The number of 24X lines
- Fourth two numbers: Number of characters in the 300 field
- Fifth two numbers: Number of characters in the 100 field
- Sixth two numbers: Number of 010 fields
- Seventh two numbers: Number of 500–589 fields
- Eighth two numbers: Number of 6XX fields
- Ninth two numbers: Number of 440, 490, and 830s
- Tenth two numbers: Number of 7XX fields
- If all else is equal, take the one with the most holdings.
- If they have the same holdings, a random or the earliest record is the tiebreaker.
SC LENDS then looked at 1,000 bibliographic records that were matched into pools of two or more records for merging with a new dominant record that the other’s attached copies would be moved to. During this process, catalogers were recruited to take tables of 100 records apiece and evaluate the pooling and choice of dominant record. About 90% of the time, the catalogers considered the algorithm’s choice to be a superior or best choice, and the remaining 10% of the time, it was considered good.
Implementing the Solution
From this point on, SC LENDS committed to the project and contracted with Equinox Software to do the coding. Galen Charlton was the contact at Equinox. He provided invaluable feedback to improve the process.
The algorithm was first run on a test server that mirrored the production system. This allowed technical services staff to review the results, and a 10,000-record sample was broken up among member libraries to review over a 2-week period. Problems from the obvious in hindsight to the amazingly obscure were discovered and fixed. All of this went into updating the procedures and code. The next step was to go into production.
The estimate had been that there would be roughly 300,000 merges or 25% of the ISBN collection. The actual results were a purge of 326,098 records during the deduplication process, which was roughly 27% of the ISBN-based collection. The day after the deduplication finished, an evaluation of the consequences at the OPACs began. Staff members began doing searches for titles they knew had a large number of duplicates and watched patron reactions. The experience was a significantly cleaner catalog with an immediate marked decrease in complaints. Searches that before had a confusing hash of more than a dozen similar MARC records now had only one to three results requiring far less or no manual cleanup. The exact number of bad merges that may have been created is unknown, but after 2 years, fewer than 300 have been documented. Where SC LENDS was prepared for 10% wrong, the actual value has been closer to two-thirds of a single percent.
In 2011 SC LENDS added a fourth wave of libraries, used the new deduplication, and did not see the cluttered search results that characterized the original three waves. Additionally, other consortiums have borrowed the code and expanded on it, making the investment in open source pay off for libraries beyond South Carolina and creating a single, more coherent collection out of many.
|