Migrating Records from Proprietary Software to RTF, HTML, and XML

Online

KMWorld

CRM Media

Streaming Media

Faulkner

Speech Technology

Unisphere/DBTA

Other ITI Websites

American Library Directory Boardwalk Empire Database Trends and Applications DestinationCRM Enterprise AI World Faulkner Information Services Fulltext Sources Online InfoToday Europe KMWorld Literary Market Place Plexus Publishing Smart Customer Service Speech Technology Streaming Media Streaming Media Europe Unisphere Research

Magazines > Computers in Libraries > June 2003
Back Index Forward

SUBSCRIBE NOW!

Vol. 23 No. 6 — June 2003

Feature
Migrating Records from Proprietary Software to RTF, HTML, and XML
by Elizabeth Reuben

We all know that digital preservation can be difficult; many avoid it, thinking it's too hard. But what happens when electronic records begin to degrade? What do you do when propriety software is being phased out? What if the records are important corporate history? This was the situation our library faced in 2002 when it became apparent that unless we did something soon, we would lose an essential corporate record. So we secured project funds and migrated the data to an open format to preserve our vital resource. I'd like to share the story of this challenging process.

The Australian Commonwealth Dept. of Family and Community Services (FaCS) library is centrally located in Canberra, the nation's capital. The library serves a potential client group of 28,000, and all except 2,000 are spread over several hundred locations across Australia (an area roughly equivalent to the continental U.S.). The library's overriding aim is to always provide the same level of service to all staff regardless of location. To this end, we provide as many services as possible direct to people's desktops.

Our Preservation Charge: Saving Essential Social Security Law Guides

Since March 20, 2000, the FaCS department has been governed by the social security law (SSL) that encompasses the Social Security (Administration) Act 1999, the Social Security Act 1991, and the Social Security (International Agreements) Act 1999, as well as associated subordinate legislation. Altogether, the SSL has more than 1,000 printed pages. It's vital to us because FaCS is a policy-creation department, responsible for the administration of the SSL, and for researching and promoting possible revisions and amendments to those laws. One of the primary tools used to interpret the SSL, to explain eligibility for payments, and to uphold appeals is the Guide to the Social Security Law, more commonly called "the Guide." The Guide is a primary corporate resource for making decisions and for interpreting the Acts. It is both a tool and a record. The historical releases provide evidentiary value, both for the appeals process and for understanding how policy has developed. So you can see that it's essential for us to preserve this information.

The Social Security Acts used to be supported by three Manuals of Instructions, regularly updated, printed loose-leaf services. They were phased out between 1991 and 1994 and replaced by the Guide, an electronic serial available via the desktop or on floppy disk. These electronic releases of the Guide were produced using a proprietary software package called Epublish, which dates from the early 1990s and was written in Visual Basic. It consists of two tools: a desktop publishing program and a browser. It became an extremely sophisticated publishing product that enabled hypertext links within the Guide, and between the Guide and its support documentation—acts, manuals, and definitions. Back in 1994, when the Web was just developing, Epublish was creating a product that for all intents and purposes looked like a text-heavy Web site does today. But more recently, the owners of Epublish decided that it was outside their core area of expertise and that they would no longer support the product as of March 2003. (That date was later extended.)

The FaCS library had superseded releases of the Guide on floppy disks. Each release took up 10 to 14 disks, and needed to be installed on a stand-alone PC in the library to run. Library staff gradually became more worried about the possibility of corrupt files and degradation due to the age and format of the media that the Guide was stored on. This was an important corporate historical record, and the likelihood of its becoming inaccessible or unusable was increasing every day.

An Unexpected Migration Gets Underway

The demise of Epublish support was a catalyst. The library staff bid for funds to increase accessibility and to enable preservation of the Guide; funding was granted in August 2002. The aims of this project were simple: Extract the data from the proprietary software format, then migrate it to formats that would provide the best archiving and greater accessibility. We would do this for historical releases from 1994 to the present. Once the project was funded, the library manager and I became the project team. (Additional people were contracted as necessary.) The funding parameters immediately provided our overall time frame and defined what we could realistically achieve.

It took most of the first 5 months to plan and lay the groundwork. We had incomplete holdings, most notably an 18-month gap, and also a few other releases that had either been lost or had been corrupted. We began by looking for the missing releases, researching archival best practices for digital media, ascertaining the relevant records management and archival practices of our organization, and planning for client input into the final product.

In late November we discovered that the owners of Epublish had Rich Text Format (RTF) copies of a product that incorporated Guide releases, so we then had a skeleton to cover our biggest gap. We also had duplicate copies of other releases (which would save us time in the data extraction process), and even some releases to cover the period 1991 to 1993!

Guidance from Others

We conducted investigations in December that were also fruitful, though sometimes circuitous. We were able to meet with preservation staff at the National Archives of Australia. The NAA employees were very helpful and provided us with several important directions to explore. The three formats preferred by the NAA for digital archiving were XML, HTML, and PDF; all three followed the NAA requirements for a format with open specifications. NAA also suggested we do three things:

1. Use well-recognized storage procedures, such as choosing appropriate metadata, creating redundant copies and systems, and storing archived material on a live system, not offline.

2. Sentence our records, which is the process of determining a document's lifespan, including its final disposal time.

3. Check our department's disposal coverage, which is an organizational-level agreement between a department and the NAA regarding the classification and sentencing of classes of records held by the department. It is, essentially, a blueprint of disposal for records-management staff.

We also needed to look at the process of extracting the data and think carefully about the "essence" of the document—what was vital to retain in the migration process, and what could be altered or lost without affecting the "look and feel" or the evidentiary value of the records.

Armed with this information, we set out to compare it with the department's standards and procedures. We met with records management and information technology staff to establish what, if any, departmental guidelines we should be following. The current disposal coverage was being updated, and the department had decided to comply with several standards, including e-permanence, the Administrative Functions Disposal Authority published by the National Archives (http://naa.gov.au). We were also directed to look at VERS, the Victorian Electronic Records Strategy specification (http://www.prov.vic.gov.au/vers), produced by the Public Record Office in Victoria and approved by the NAA.

We met with Guide Management Group (GMG) staff to compare notes on our projects. The GMG also needed to replace Epublish, which it had used to create each new release of the Guide. We wanted to share our findings about the archival aspects, hoping this would feed into GMG's project to make our lives easier in the future: An alignment of formats and procedures would simplify the future preservation of Guide releases. Senior management in the department also wanted to maximize learning outcomes by aligning these two projects as closely as possible.

We now felt ready to run a focus group to invite comment on what we had planned so far. The meeting was only 2 hours long; however, it was highly productive and provided us with feedback on the group's thoughts on what was most important to retain, features that might add value, aspects of "look and feel" that we needed to concentrate on, and what factors were unimportant to them.

Converting the SSL Data

Over the rest of December 2002 and into January 2003, we planned in detail what we needed to do and how we might accomplish it. We needed to extract the data from the Epublish software and convert it to RTF. This would form the basis of our preservation files. The RTF files would be converted to HTML for display via the department's intranet. XML would be either wrapped around the HTML files, or used to create a separate set of files that, with the metadata, would provide maximum archival content and protection. This would give us the greatest flexibility to migrate files while still allowing for lower-level browsers that don't support the presentation of data in XML and its associated style sheets. (Later, we also added Word as a final format so that staff using assistive technologies would have the easiest possible access to the historical releases.) Each of these formats would be stored in a separate directory on a RAID server (which uses two mirrored hard drives acting as one), and also stored off-site.

After some experimentation, we found only one reliable way to extract data from the Epublish format, which was to copy and paste the files. Accurate replication of the files was essential because changes in the way the information was presented could affect its interpretation. We estimated that we had 41 releases to convert this way. Each release had between 35 and 54 chapters, some broken into multiple parts, with at least one table of contents for each chapter—and each of these was a separate file. All of the associated material for each release would also need to be treated this way. Since the files could only be accessed on a stand-alone PC, we employed two contractors to begin this task.

We wanted a script to convert the RTF files into HTML, so we employed a third contractor who had worked extensively with the Guide to write a prototype. We got the first version of the prototype in the middle of January. Around the same time, the software company delivered the additional releases it had found, and we realized that we had a larger job on our hands than we had thought because we had underestimated the number of supplemental materials. Our salvation came when we realized that we could get a script to automate the Epublish-to-RTF conversion process.

We immediately arranged for the script to be written. Luckily, we were in the position where we could continue having our in-house contractors copy and paste releases while the script was being written, and could treat the possible overlap as a risk-management approach. If the script didn't work, we'd be no further behind, and if it did, our contractors would have an excellent foundation for troubleshooting and the quality-control stage that was still to come.

Testing and Access

By the end of January we had successfully negotiated our way through thechange management process for the project overall, and for specific changes to hardware and software. Then we were ready to test the prototype HTML format. It was loaded onto the server, a checklist was created, and we organized a small test population (based on the focus group participants with additional interested staff). We sent out an e-mail to our testers with the checklist and a link to the prototype. Within 3 days we discovered that nobody could access the link! First we tried to fix the problem, and then with our test period growing short, we just copied the prototype onto CD-ROM and sent it out to participants. Despite this hiccup, the testing went well, and the results showed that there were no major problems with the design. One issue that the testing highlighted was that the prototype reflected the "look and feel" of later versions of the Guide, rather than earlier ones. We wondered: Should we sacrifice the variations in appearance for the sake of simplicity, or are the changes over time essential to see? Opinion seemed to be split, and it may be that lack of time will prevent us from doing anything about this issue anyway.

The access problem to the Guide's link is a continuing issue. It occurred because of a proxy server setting, and it's complicated by the different networks that need to access our intranet to use this product. With luck, the problem will be solved by mid-2003. We are currently putting protocols in place to minimize the effect that changes to the network environment will have on access to several library services.

The conversion script we ordered arrived in late February. It was fast and it retained all the important formatting, including bolding, bulleted points, underlining, tables, and columns. In comparison, the copy-and-paste versions had several problems. Early versions of the Guide included coding that was not easily translated to RTF. Underlining, in particular, copied across as coding that replaced the first and last letter of the underlined segment, and some segments of underlining would not copy at all, and had to be retyped. Tabulated data was often columns separated by tabs, and the conversion translated these tabs into single spaces, increasing the difficulty of reconstruction. It appeared that it would actually be faster to go back to the original Epublish files and run the script over them, and then convert columns to tables.

Metadata, Archival Formats

At this same time, I began to look at metadata schemes. I compiled a list of the types of metadata that we would need to match to an appropriate scheme. I was familiar with the Dublin Core; however, the preservation focus of the project meant that DC didn't seem detailed enough. Our records management staff had directed me to the VERS project, so I began to search the Internet and make some phone calls. VERS is fully compatible with the NAA specifications, but its emphasis is more in line with our project. I downloaded the VERS specifications and began to map our list of required metadata fields.

Confluent with this, I also began to investigate XML in more depth. XML was not widely used in the portfolio, and no one was using it the way we planned. We got two recommendations for firms that were involved in the type of XML work we were interested in. After choosing one, we held two meetings to decide on a particular path to follow and agreed on outcomes. We decided to create a separate set of XML files rather than using the XML shell we had originally considered. This approach gave us the flexibility to both display in all browsers, and to migrate into future formats. By April 2003, we were well on track. The contractors had essentially completed the editing work on the RTF files and were ready to start on the quality control aspect. A number of corrupted files had been sent off in an attempt to extract whatever data could be saved. The HTML prototype was progressing well despite the difficulties created by the two types of RTF files, and the XML and metadata work was ready to begin.

Where We Stand Now

At the time of writing (April), we still have a significant amount of work to do in order to complete this project by the deadline of June 30; however, we are on track and have found a solution to each problem. The next major task will be the HTML-to-XML conversion. We will need to use an off-the-shelf product to convert the HTML files to XHTML, and then another to convert the XHTML to XML. We chose this process because it will retain the internal hypertext links and formatting.

While there have been some false starts and some backtracking, we have learned valuable lessons as this project has progressed. We hope that at least some of these lessons will become standard practice in our department. One of the aspects that I am most proud of is the communications strategy we used. A collaborative approach underpins much of the work by library staff, and was an important element we brought to the project. We have included staff that would be affected by the outcomes of the project, drawn on the expertise of others, and worked in parallel with similar projects within the department. This has given us the ability to both learn from and share with others. Given the current progress of the project, I am sure that by June 30, this resource will be easily accessible by staff and ready for the future.

Elizabeth Reuben is a research librarian at the Commonwealth Dept. of Family and Community Services in Canberra, ACT, Australia. She holds a Library and Information Science degree from the University of Canberra. She has worked on several library/IT hybrid projects aimed at delivering desktop services to clients in multiple locations. Her e-mail address is liz.reuben@facs.gov.au.

Back to top