SDSC offers archiving tips

Agencies can use current technologies to build a system to preserve electronic records and to ensure longterm access to them, according to a new report by the San Diego Supercomputer Center (SDSC). But the report noted that better software is needed to operate such a system easily. The report, com

Agencies can use current technologies to build a system to preserve electronic records and to ensure long-term access to them, according to a new report by the San Diego Supercomputer Center (SDSC). But the report noted that better software is needed to operate such a system easily.

The report, commissioned by the National Archives and Records Administration and issued earlier this month, concluded that agencies can, with relatively little money, create an archive for their documents that can grow and migrate to new technology over time. The necessary ingredients to achieve this objective are a supercomputer or high-end workstations; commercial storage and database software; and a World Wide Web server.

These findings are "a significant breakthrough for us," said L. Reynolds Cahoon, chief information officer for NARA. He said the findings point the way for the agency to be able to maintain future stores of government e-mail, word processing documents, images and digital video for decades or centuries after the software used to create them is obsolete.

The report's author, Reagan Moore, associate director for data-intensive computing at SDSC, pointed out that other agencies could use the same technology to manage digital libraries of information, either for preservation or to maintain public access to their documents over time.

"A lot of what this paper was about was to show you can come up with ways to organize information," Moore said.

The study looked at what it would take to maintain collections of digital information for at least 400 years.According to the report, hardware and software for a system that can store 1 billion digital objects - equal in size to a collection of about 43 million images or 31 billion e-mail messages - can be built for about $2 million, not including the cost to run the system for its three- to five-year life cycle. Moore said agencies' individual document collections are not nearly that large, so a single system could be used to manage multiple sets of records.

What is missing, according to the study, is "generic" software that can be used to capture, index and retrieve information in multitudes of formats. "What you have to do right now is write software that can manage a particular collection," Moore said. "What we're interested in is discovering what kind of software could handle a variety of collections."

The report suggested that the answer lies in an emerging standard called Extensible Markup Language, a Web-friendly version of a widely used standard for tagging documents called Standard Generalized Markup Language. XML is designed to define the content and the structure of many kinds of documents so that files stored in an XML-based database could be reconstituted outside their native formats.

Although the report said more research is needed to develop software that can perform such tasks, Moore said he expects tools to start to emerge within the next year.

SDSC used its own archival storage system, which is built from commercial hardware and runs on a mix of commercial and home-grown software, to create archives of nine types of documents, including e-mail, office automation files, maps, Web pages, images and structured databases.

Eliot Christian, a computer specialist with the U.S. Geological Survey, said the problem of how to preserve digital information can only be solved with an "overarching dictionary of basic concepts for finding stuff" that goes beyond what technology is used to capture, store and retrieve data.

"There are significant public-access issues for long-term preservation," he said. "And the most fundamental of those have to deal with semantics, to be able to find things based on the common understanding of the meanings of pieces of documents. What is a title? What is an author? What do we mean by a subject?''The report did not address this issue in detail but included it in a list of areas for further research.

NEXT STORY: Missouri launches teacher training