Formatting the future
To keep documents accessible, agencies face critical choices on software and hardware
Five years ago, U.S. Courts started putting in place an electronic docket-filing system. It would contain records to be kept—and accessed—for decades, if not indefinitely, and that forced project managers to make some tough decisions on electronic formats.
The courts decided on the Adobe Portable Document Format, for two reasons, according to John Brinkema, a senior research computer scientist at the Administrative Office of the U.S. Courts.
First, PDFs preserve the look and feel of the original paper document, an important quality because legal documents frequently make references to other pages within that document or to pages in other documents—even if those records are in electronic form only.
Second, and more important, Adobe Systems Inc. has published the specifications for reading a PDF document. Should the company ever go out of business, the records could be accessed using other software written to Adobe’s specifications.
These days, U.S. Courts has more than 2 billion electronic records in PDF format, spread across almost 200 locations around the country. The federal judiciary is ahead of many agencies in establishing an electronic-records management process.
“Rather than waiting around for the rest of the government to do things, we just did it,” Brinkema said.
When it comes to saving electronic information for the ages, the challenge of choosing the appropriate format is formidable.
A format specifies how to encode a set of data so that it can be accessed by people or other machines. Every software vendor uses formats to encapsulate the data that is generated in its programs.
But given the volatility and constant changes in the IT industry, the formats an agency chooses today might not be around in 10 or 100 years. Horror stories abound of suddenly vital documents locked away on some early, now unreadable, version of WordStar.
“It is hard to preserve digital information without a clear guide to how the information is encoded within a format,” said William LeFurgy, a project manager for the Library of Congress’ Digital Initiatives program.
Compounding this problem is the fact that hardware used to read these formats may also disappear. Who still has equipment to read 5¼-inch floppy disks or punch cards?
LeFurgy said the library’s Digital Initiatives program considers a number of factors when considering whether to hold on to a format for long-term use.
One is the proprietary nature of the format. “This is a major problem for many commercial software products—the specification is hidden as a business secret, which results in a format whose information content cannot be decoded without using the original proprietary software,” LeFurgy said. The library will not rule out proprietary formats altogether. But any used must have publicly disclosed specifications, such as Adobe’s.
The idea, according to Melonie Warfel, director of worldwide standards for Adobe, is to have a subset of the PDF specifications that is restricted only to completely open standards.
Aside from functionality, agencies should also consider a format’s popularity, LeFurgy said. There’s a good chance that documents written in Microsoft Word will be accessible for quite some time, simply because it is so widely used today, and readers will be in demand for some time to come.
One piece of good news for agencies is that the National Archives and Records Administration has started specifying which formats other agencies should use to submit their records to NARA.
Yet another factor to consider is the complexity of a format’s encoding process, LeFurgy said. Compression schemes used to reduce the size of a record, or encryption schemes to secure a document, could be particularly problematic for future archivists, who might not have access to those algorithms.
Though it requires greater amounts of storage space, keeping records uncompressed is also a smart move in preserving the fidelity of images and audio files, said Charles Fenimore, Motion Image Quality project leader for the Digital Media Group of the National Institute of Standards and Technology.
Fenimore’s research team is finding that converting imagery or sound from one compressed format to another always results in additional loss of quality, which can be problematic as older data gets moved to newer formats, he said.
In addition to file formats, agencies must also worry about the formats of the physical media itself—the tapes, disk drives and optical disks that contain records. These, too, are vulnerable to rapid obsolescence.
While tape is considered the electronic medium that lasts the longest, it is not immune to failure. Kenneth Thibodeau, director of NARA’s Electronic Records Archive program, has heard of rare cases where an entire library of aging tapes suddenly started failing en masse.
“The chemical processes of the manufacturing processes are such that a batch of tapes could self-destruct in a matter of months,” Thibodeau said.
NARA keeps its permanent records on two copies of tapes, each in a different location. To guard against failures, the agency each year tests a sample of tapes to assure they are still stable.Optical questions
“Tape is a devil, but it is a devil we know. We know the vulnerabilities, and most of them can be managed,” Thibodeau said.
Agencies are increasingly using optical disks for archiving, though the jury is out on how long the media can last, given that optical disks have only been in use for the past 25 years or so. It’s an area that Fred Byers, an IT specialist at NIST, is investigating.
Byers said he thinks that disks could last for over a century, if kept in environmentally friendly conditions. What concerns him, though, is the fluctuating rates of quality control in the manufacturing processes, which lead to variances in how long disks can last. He has started working with the Optical Storage Technology Association to develop an industry archiving standard.
If manufacturers adhere to quality control specifications that the OSTA working group is developing, they will be able to put a seal of approval on their products, indicating that the disks should last for a set number of years.
In the end, though, agencies must assume that whatever media they use will be obsolete sooner or later. So they should develop a long-term strategy of periodically updating their files to whatever media is current, officials said. In other words: think of archiving not as a process of putting records on storage media, but rather as a process to preserve records independent of whatever physical media is used. This is the strategy both NARA and the Library of Congress are taking.
“It is a given that we will be moving digital content to and from many kinds of media as part of our ongoing management and preservation function,” LeFurgy said.
NARA has had a storage migration plan in place since 1971, Thibodeau said. The agency will pick a storage media that it can trust to last a specific stretch of time, and develop a process of moving the records off that media when that time period ends.
Thibodeau likens the archiving process to a funnel, one that takes in many formats and converts them all into a standard output format.
“The first thing we do when a record comes in is that we copy it from whatever media it is onto the standard media, and to a standard physical format,” Thibodeau said.
At the end of that lifecycle, the agency can easily automate the transfer of those records to the new media.
“It becomes a production process,” Thibodeau said.
The Archival Preservation System now handles those duties. But it will be replaced by NARA’s Electronic Records Archive, which will be more suited for handling submissions through the Internet.
An important aspect in migration is that agencies must maintain a record’s authenticity. Electronic records could be modified less conspicuously than paper records.
“What is important is the ability to preserve those records in an authentic manner, so that it is incontestable if they were to go to court,” said Tom Kelley, a customer engagement manager for Lockheed Martin Corp.
Lockheed Martin is one of two companies NARA chose—the other is Harris Corp. of Melbourne, Fla.—to build ERA prototypes. In any system, agencies must be able to establish a chain of custody leading back to the original to prove the record in question remains authentic, despite any number of transformations.
“We would keep traceability back to the original submittal, and any chain of transformations that would happen,” Kelley said.