Feature
Migrating Records from Proprietary
Software to RTF, HTML, and XML
by Elizabeth Reuben
We all know that digital preservation can be difficult; many avoid it, thinking
it's too hard. But what happens when electronic records begin to degrade? What
do you do when propriety software is being phased out? What if the records
are important corporate history? This was the situation our library faced in
2002 when it became apparent that unless we did something soon, we would lose
an essential corporate record. So we secured project funds and migrated the
data to an open format to preserve our vital resource. I'd like to share the
story of this challenging process.
The Australian Commonwealth Dept. of Family and Community Services (FaCS)
library is centrally located in Canberra, the nation's capital. The library
serves a potential client group of 28,000, and all except 2,000 are spread
over several hundred locations across Australia (an area roughly equivalent
to the continental U.S.). The library's overriding aim is to always provide
the same level of service to all staff regardless of location. To this end,
we provide as many services as possible direct to people's desktops.
Our Preservation Charge: Saving Essential Social Security Law Guides
Since March 20, 2000, the FaCS department has been governed by the social
security law (SSL) that encompasses the Social Security (Administration) Act
1999, the Social Security Act 1991, and the Social Security (International
Agreements) Act 1999, as well as associated subordinate legislation. Altogether,
the SSL has more than 1,000 printed pages. It's vital to us because FaCS is
a policy-creation department, responsible for the administration of the SSL,
and for researching and promoting possible revisions and amendments to those
laws. One of the primary tools used to interpret the SSL, to explain eligibility
for payments, and to uphold appeals is the Guide to the Social Security Law,
more commonly called "the Guide." The Guide is a primary corporate resource
for making decisions and for interpreting the Acts. It is both a tool and a
record. The historical releases provide evidentiary value, both for the appeals
process and for understanding how policy has developed. So you can see that
it's essential for us to preserve this information.
The Social Security Acts used to be supported by three Manuals of Instructions,
regularly updated, printed loose-leaf services. They were phased out between
1991 and 1994 and replaced by the Guide, an electronic serial available via
the desktop or on floppy disk. These electronic releases of the Guide were
produced using a proprietary software package called Epublish, which dates
from the early 1990s and was written in Visual Basic. It consists of two tools:
a desktop publishing program and a browser. It became an extremely sophisticated
publishing product that enabled hypertext links within the Guide, and between
the Guide and its support documentationacts, manuals, and definitions.
Back in 1994, when the Web was just developing, Epublish was creating a product
that for all intents and purposes looked like a text-heavy Web site does today.
But more recently, the owners of Epublish decided that it was outside their
core area of expertise and that they would no longer support the product as
of March 2003. (That date was later extended.)
The FaCS library had superseded releases of the Guide on floppy disks. Each
release took up 10 to 14 disks, and needed to be installed on a stand-alone
PC in the library to run. Library staff gradually became more worried about
the possibility of corrupt files and degradation due to the age and format
of the media that the Guide was stored on. This was an important corporate
historical record, and the likelihood of its becoming inaccessible or unusable
was increasing every day.
An Unexpected Migration Gets Underway
The demise of Epublish support was a catalyst. The library staff bid for
funds to increase accessibility and to enable preservation of the Guide; funding
was granted in August 2002. The aims of this project were simple: Extract the
data from the proprietary software format, then migrate it to formats that
would provide the best archiving and greater accessibility. We would do this
for historical releases from 1994 to the present. Once the project was funded,
the library manager and I became the project team. (Additional people were
contracted as necessary.) The funding parameters immediately provided our overall
time frame and defined what we could realistically achieve.
It took most of the first 5 months to plan and lay the groundwork. We had
incomplete holdings, most notably an 18-month gap, and also a few other releases
that had either been lost or had been corrupted. We began by looking for the
missing releases, researching archival best practices for digital media, ascertaining
the relevant records management and archival practices of our organization,
and planning for client input into the final product.
In late November we discovered that the owners of Epublish had Rich Text
Format (RTF) copies of a product that incorporated Guide releases, so we then
had a skeleton to cover our biggest gap. We also had duplicate copies of other
releases (which would save us time in the data extraction process), and even
some releases to cover the period 1991 to 1993!
Guidance from Others
We conducted investigations in December that were also fruitful, though sometimes
circuitous. We were able to meet with preservation staff at the National Archives
of Australia. The NAA employees were very helpful and provided us with several
important directions to explore. The three formats preferred by the NAA for
digital archiving were XML, HTML, and PDF; all three followed the NAA requirements
for a format with open specifications. NAA also suggested we do three things:
1. Use well-recognized storage procedures, such as choosing appropriate
metadata, creating redundant copies and systems, and storing archived material
on a live system, not offline.
2. Sentence our records, which is the process of determining a document's
lifespan, including its final disposal time.
3. Check our department's disposal coverage, which is an organizational-level
agreement between a department and the NAA regarding the classification and
sentencing of classes of records held by the department. It is, essentially,
a blueprint of disposal for records-management staff.
We also needed to look at the process of extracting the data and think carefully
about the "essence" of the documentwhat was vital to retain in the migration
process, and what could be altered or lost without affecting the "look and
feel" or the evidentiary value of the records.
Armed with this information, we set out to compare it with the department's
standards and procedures. We met with records management and information technology
staff to establish what, if any, departmental guidelines we should be following.
The current disposal coverage was being updated, and the department had decided
to comply with several standards, including e-permanence, the Administrative
Functions Disposal Authority published by the National Archives (http://naa.gov.au).
We were also directed to look at VERS, the Victorian Electronic Records Strategy
specification (http://www.prov.vic.gov.au/vers),
produced by the Public Record Office in Victoria and approved by the NAA.
We met with Guide Management Group (GMG) staff to compare notes on our projects.
The GMG also needed to replace Epublish, which it had used to create each new
release of the Guide. We wanted to share our findings about the archival aspects,
hoping this would feed into GMG's project to make our lives easier in the future:
An alignment of formats and procedures would simplify the future preservation
of Guide releases. Senior management in the department also wanted to maximize
learning outcomes by aligning these two projects as closely as possible.
We now felt ready to run a focus group to invite comment on what we had planned
so far. The meeting was only 2 hours long; however, it was highly productive
and provided us with feedback on the group's thoughts on what was most important
to retain, features that might add value, aspects of "look and feel" that we
needed to concentrate on, and what factors were unimportant to them.
Converting the SSL Data
Over the rest of December 2002 and into January 2003, we planned in detail
what we needed to do and how we might accomplish it. We needed to extract the
data from the Epublish software and convert it to RTF. This would form the
basis of our preservation files. The RTF files would be converted to HTML for
display via the department's intranet. XML would be either wrapped around the
HTML files, or used to create a separate set of files that, with the metadata,
would provide maximum archival content and protection. This would give us the
greatest flexibility to migrate files while still allowing for lower-level
browsers that don't support the presentation of data in XML and its associated
style sheets. (Later, we also added Word as a final format so that staff using
assistive technologies would have the easiest possible access to the historical
releases.) Each of these formats would be stored in a separate directory on
a RAID server (which uses two mirrored hard drives acting as one), and also
stored off-site.
After some experimentation, we found only one reliable way to extract data
from the Epublish format, which was to copy and paste the files. Accurate replication
of the files was essential because changes in the way the information was presented
could affect its interpretation. We estimated that we had 41 releases to convert
this way. Each release had between 35 and 54 chapters, some broken into multiple
parts, with at least one table of contents for each chapterand each of
these was a separate file. All of the associated material for each release
would also need to be treated this way. Since the files could only be accessed
on a stand-alone PC, we employed two contractors to begin this task.
We wanted a script to convert the RTF files into HTML, so we employed a third
contractor who had worked extensively with the Guide to write a prototype.
We got the first version of the prototype in the middle of January. Around
the same time, the software company delivered the additional releases it had
found, and we realized that we had a larger job on our hands than we had thought
because we had underestimated the number of supplemental materials. Our salvation
came when we realized that we could get a script to automate the Epublish-to-RTF
conversion process.
We immediately arranged for the script to be written. Luckily, we were in
the position where we could continue having our in-house contractors copy and
paste releases while the script was being written, and could treat the possible
overlap as a risk-management approach. If the script didn't work, we'd be no
further behind, and if it did, our contractors would have an excellent foundation
for troubleshooting and the quality-control stage that was still to come.
Testing and Access
By the end of January we had successfully negotiated our way through thechange
management process for the project overall, and for specific changes to hardware
and software. Then we were ready to test the prototype HTML format. It was
loaded onto the server, a checklist was created, and we organized a small test
population (based on the focus group participants with additional interested
staff). We sent out an e-mail to our testers with the checklist and a link
to the prototype. Within 3 days we discovered that nobody could access the
link! First we tried to fix the problem, and then with our test period growing
short, we just copied the prototype onto CD-ROM and sent it out to participants.
Despite this hiccup, the testing went well, and the results showed that there
were no major problems with the design. One issue that the testing highlighted
was that the prototype reflected the "look and feel" of later versions of the
Guide, rather than earlier ones. We wondered: Should we sacrifice the variations
in appearance for the sake of simplicity, or are the changes over time essential
to see? Opinion seemed to be split, and it may be that lack of time will prevent
us from doing anything about this issue anyway.
The access problem to the Guide's link is a continuing issue. It occurred
because of a proxy server setting, and it's complicated by the different networks
that need to access our intranet to use this product. With luck, the problem
will be solved by mid-2003. We are currently putting protocols in place to
minimize the effect that changes to the network environment will have on access
to several library services.
The conversion script we ordered arrived in late February. It was fast and
it retained all the important formatting, including bolding, bulleted points,
underlining, tables, and columns. In comparison, the copy-and-paste versions
had several problems. Early versions of the Guide included coding that was
not easily translated to RTF. Underlining, in particular, copied across as
coding that replaced the first and last letter of the underlined segment, and
some segments of underlining would not copy at all, and had to be retyped.
Tabulated data was often columns separated by tabs, and the conversion translated
these tabs into single spaces, increasing the difficulty of reconstruction.
It appeared that it would actually be faster to go back to the original Epublish
files and run the script over them, and then convert columns to tables.
Metadata, Archival Formats
At this same time, I began to look at metadata schemes. I compiled a list
of the types of metadata that we would need to match to an appropriate scheme.
I was familiar with the Dublin Core; however, the preservation focus of the
project meant that DC didn't seem detailed enough. Our records management staff
had directed me to the VERS project, so I began to search the Internet and
make some phone calls. VERS is fully compatible with the NAA specifications,
but its emphasis is more in line with our project. I downloaded the VERS specifications
and began to map our list of required metadata fields.
Confluent with this, I also began to investigate XML in more depth. XML was
not widely used in the portfolio, and no one was using it the way we planned.
We got two recommendations for firms that were involved in the type of XML
work we were interested in. After choosing one, we held two meetings to decide
on a particular path to follow and agreed on outcomes. We decided to create
a separate set of XML files rather than using the XML shell we had originally
considered. This approach gave us the flexibility to both display in all browsers,
and to migrate into future formats. By April 2003, we were well on track. The
contractors had essentially completed the editing work on the RTF files and
were ready to start on the quality control aspect. A number of corrupted files
had been sent off in an attempt to extract whatever data could be saved. The
HTML prototype was progressing well despite the difficulties created by the
two types of RTF files, and the XML and metadata work was ready to begin.
Where We Stand Now
At the time of writing (April), we still have a significant amount of work
to do in order to complete this project by the deadline of June 30; however,
we are on track and have found a solution to each problem. The next major task
will be the HTML-to-XML conversion. We will need to use an off-the-shelf product
to convert the HTML files to XHTML, and then another to convert the XHTML to
XML. We chose this process because it will retain the internal hypertext links
and formatting.
While there have been some false starts and some backtracking, we have learned
valuable lessons as this project has progressed. We hope that at least some
of these lessons will become standard practice in our department. One of the
aspects that I am most proud of is the communications strategy we used. A collaborative
approach underpins much of the work by library staff, and was an important
element we brought to the project. We have included staff that would be affected
by the outcomes of the project, drawn on the expertise of others, and worked
in parallel with similar projects within the department. This has given us
the ability to both learn from and share with others. Given the current progress
of the project, I am sure that by June 30, this resource will be easily accessible
by staff and ready for the future.
Elizabeth Reuben is a research librarian at the Commonwealth
Dept. of Family and Community Services in Canberra, ACT, Australia. She holds
a Library and Information Science degree from the University of Canberra. She
has worked on several library/IT hybrid projects aimed at delivering desktop
services to clients in multiple locations. Her e-mail address is liz.reuben@facs.gov.au.
|