The ultimate archives
- By William Matthews
- Aug 28, 2000
Four centuries from now, if a historian wants to read Al Gore's or George
W. Bush's inaugural address from January 2001, he or she should be able
to find it in a snap in the online electronic records archive now being
developed by the National Archives and Records Administration.
"The goal is to preserve digital information for at least 400 years,"
say researchers from the San Diego Supercomputer Center, who have provided
much of the scientific brainpower behind the project.
As the federal government shifts more of its work from paper to electronic
documents, the National Archives must radically rethink long-term preservation
of records. Computers and formats rapidly become obsolete, rendering documents
created just a few years ago unreadable. The problem is how to make documents
readable centuries from now, when computers beyond imagining today are likely
to be in use.
"It has been described as the archival equivalent of the first moon
shot," said John Carlin, archivist of the United States.
Carlin and other archives officials are confident they will have a pilot
version of the electronic records archive in operation by 2004 or 2005,
at an estimated cost of $130 million.
The Migration Problem
Until recently, the Archives' attempt to build such an electronic archive
seemed like a technically impossible dream: In theory, obsolescence can
be overcome by migrating electronic data to more modern systems. But at
the present pace of evolution, software used to manage archival collections
changes every three to five years. Combine that rapid rate of obsolescence
with the explosive growth in the number of electronic records, and mass
migration, in reality, is impractical.
"The time needed to migrate to new technology may exceed the lifetime
of the hardware and software systems that are being used," eight scientists
from the San Diego Supercomputer Center wrote in a technical paper describing
the new electronic archive.
The migration problem is further complicated by archival rules of order.
Official records must remain authentic. That means their contents can't
change, and in most instances, neither should their appearance. Paper records
always look the same, but electronic records can look very different — or
become incapable of being viewed at all — if the software needed to display
them properly no longer exists.
That's already a problem for documents created a decade or so ago in
formats that are no longer used. "Electronic records are only as good as
they are authentic," said Reynolds Cahoon, assistant archivist of the United
States and head of the effort to create an electronic archive. "If they
aren't authentic, everything is for naught."
Records exist in thousands of formats, and the challenge of keeping
up with new ones as they come out and old ones as they are discarded quickly
becomes insurmountable. So the archivists concluded that the best way to
solve it was to avoid dealing with formats altogether.
Finding the Right Language
Carlin dramatized the solution in March, when, while presenting Congress
with his 2001 budget request, he announced that two years of work by computer
scientists had led to "a major technological breakthrough" in storage technology
for electronic records.
Researchers, he said, had developed methods for storing electronic records
that promise to preserve them for hundreds of years and keep them readable
despite the obsolescence of the software and hardware used to create them.
Three years ago, scientists would have said it couldn't be done, Carlin
"But now they have demonstrated it to us and given us confidence that
in three to five years we will be able to deal with the massive volume of
federal records in various formats and from various generations of technology,"
Working with the San Diego Super-computer Center, Georgia Tech Research
Institute and several other government agencies, the Archives has discovered
a method that promises to permit storing records "totally independent of
their software and hardware," Cahoon said.
A process called "persistent object preservation" appears capable of
stripping the display characteristics of any electronic document — whether
text, spreadsheet, photo or map — and storing it in a format that will allow
it to be called up by whatever software is being used in the future.
The format of choice is Extensible Markup Language or XML, a standard
language for transmitting data from one computer to another. "Tags" within
XML documents tell the receiving computer how to read and format the data.
Here is how the electronic archive would work: An incoming electronic
document would be converted into an XML document. This involves identifying
the components of the document using XML document type definitions, replacing
proprietary or nonstandard formats with XML tags and preserving information
about the document's appearance.
XML tags will also make it easier for search engines to locate documents
after they are stored. For example, e-mail messages in XML could be searched
by the names of senders and receivers, while omitting names mentioned in
the message's text. Document type definitions will also make it possible
to link related documents in groups or collections of records, a key requirement
Once converted to XML and tagged, documents would be stored in a "container,"
which in turn is stored in a "repository." For now, the container is a 100G
tape cartridge, but that is likely to change as new storage technology is
developed. The physical repository is a robotic storage warehouse — or multiple
warehouses scattered nationwide and linked electronically.
Presiding over the repository is a computerized "storage resource broker,"
which functions as middle-ware between the repository and applications used
to store and retrieve records. The storage broker retrieves records and
uses document type definitions to reassemble collections of records, wherever
they are in the archive.
Still a Theory
So far, a test version of the electronic archive has passed a number
of hurdles, including one that involved taking in a million e-mail messages,
converting them to XML documents, tagging them, storing them and calling
them back up. The process took less than two days, Achives officials say.
"We can prototype the concept and make it work," Cahoon said. "But we
are nowhere near ready to assemble" an archive as large or complex as the
national electronic archive will have to be.
Even when the electronic archive is up and running, work on it won't
be finished, he noted. "You can't just build this once; it's never done.
Parts will become obsolete, so you have to constantly evolve. It's designed
so any piece of the system can be exchanged for new components" and still
be compatible with the XML-based application of the other components.
But the burden of constant upgrading is also a benefit. As computing
power increases, its price declines. The archivists are counting on that
trend to make it economically possible to keep up with the swift-rising
volume of records that must be stored, Cahoon said.
The Department of Veterans Affairs is one of the agencies that could
benefit early from the electronic archive project. On a daily basis, the
VA needs access to veterans' records to process claims and determine eligibility
for benefits. "We spend a good amount of time trying to track down records,"
said VA spokesman Steve Westerfeld. Determining eligibility often takes
months. "We're in favor of anything that allows easier access and enables
us to get hold of records quicker and serve veterans better."
"The challenge we face as records move more and more to electronic is
how access is going to be provided," Carlin said. The electronic archive
is "on the cutting edge of research and technology. Nothing comparable has
ever been done."