NARA officials defend searchability of electronic archive
- By Alice Lipowicz
- Nov 01, 2011
Editor's note: This article was changed after publication to clarify information.
Top officials at the National Archives and Records Administration are defending the agency’s recently-deployed $430 million Electronic Records Archive against criticism from a federal auditor that the archive is not fully searchable by text.
While NARA officials concede that only a small portion of the e-archive currently is text-searchable, that portion will continue to grow and “huge amounts” of material will be text-searchable in the next 10 years, an agency official said on Nov. 1.
The dispute began when Paul Brachfeld, NARA’s inspector general, claimed that NARA’s e-archive solution is fundamentally flawed because it was not designed to be fully text-searchable.
Searching text impossible in NARA's e-records archive, IG says
People accessing the archive generally may only search labels or tags, also known as metadata, related to the documents, rather than searching the text of the documents directly, Brachfeld said in an interview with Federal Computer Week on Oct. 26.
Lack of full text search “is one of the profound problems with the ERA at this point,” Brachfeld said. “Metadata alone does not tell the story of what is in the documents.”
However, NARA officials responding to the complaint have suggested that there might be some misunderstandings of how the ERA was conceived and what it is supposed to do.
“It appears that you and some of your staff have been given what appear to be conflicting answers,” David Ferriero, Archivist of the United States, wrote in a response letter to Brachfeld dated earlier this year. His response letter was provided to Federal Computer Week on Oct. 31.
“I am committed to attaining a fully content-searchable ERA,” Ferriero wrote in that letter. “My commitment has not changed.”
A NARA official explained that only a small part of the ERA currently is open to the public, while the remainder is currently categorized for possible release at later dates under laws pertaining to presidential, congressional, classified and census records, among other laws.
Regarding the public part of the ERA, known as the Online Public Access system, a substantial part of the public archive currently consists of scanned historic documents, which are non-digital documents converted to digital images. Those are not text searchable, although efforts are underway to develop technologies and methods to make them text-searchable, according to David Lake, a communications manager for NARA.
At the same time, a small part of the online public access system currently consists of “born-electronic” documents such as emails and word processing documents that are text-searchable, utilizing NARA’s Vivisimo Search Engine application, Lake said. The percentage of born-electric documents currently could be as small as 1 percent, he said.
Over the next 10 years, as agencies deliver more material to the e-archive, the born-electronic documents in the archive will increase in number, making a larger portion of the e-archive searchable by text, even while scanned historic documents also are coming in, Lake added.
For example, over the next five to 10 years, 300 million emails from the George W. Bush administration will become available in the archive, he said.
“I wish I had a crystal ball to predict everything that will be coming in the door,” Lake said. “It will be a huge amount of born-electronic materials, and a huge amount of scanned images as well. In the next 10 years we will see a substantial increase in materials,” Lake said. The materials that originated electronically, such as emails, will be text-searchable in the archive, he said.
There are many challenges ahead in reviewing documents for classified material and privacy requirements before releasing them to the public. In addition, 350 terabytes of data from the 2010 Census will be coming in to the archive for storage but will not be accessible, by law, for 72 years.
NARA officials previously had acknowledged some of the limitations with the ERA. The agency finished its $430 million system development contract with Lockheed Martin Corp. in September and did not extend it for an optional year. It has hired IBM Corp. to maintain and operate the ERA on an annual contract valued at $243 million over 10 years if all options are exercised.
Alice Lipowicz is a staff writer covering government 2.0, homeland security and other IT policies for Federal Computer Week.