EPA builds a better search
- By David Perera
- Nov 14, 2004
A keyword search in the Environmental Protection Agency's Web pages used to yield a mishmash of results. Typing, say, "water quality" in the search engine might have returned links to high-level overviews of water quality issues or to documents that merely mentioned water quality.
"The relevancy ranking of our search engine couldn't really say, 'Here's a general thing about water quality that could get you started,' " said Richard Huffine, program manager for the EPA's National Library Network. So EPA officials modified the search engine.
Now, the engine returns documents based on a ranking of data stored in metadata fields, giving priority — in descending order — to information that has the search query term embedded in a document's subject, title, description and text.
Draft recommendations, written in part by Huffine and issued by members of the Categorization of Government Information Working Group, call for adoption of similar metadata standards governmentwide. The working group is a subcommittee of the Interagency Committee on Government Information, a creation of the E-Government Act of 2002.
The metadata recommendations are part of group members' larger effort to preserve government information in digital formats and make it permanently available. The problem is that, although the federal government is permanent, individual agencies may not be. Documents stored digitally on one server can be moved to another. Such moves result in the all too common message "404 error — file not found."
Although it is technically possible to continually update databases to reflect changes as documents are moved, it is impractical, according to working group members. Instead of relying on URLs to locate digital information, members recommended that federal officials develop search schemes based on uniform resource names (URNs).
Federal officials would assign unique identifiers to each piece of government information — policy documents, Web sites, photos, maps and other digital materials. A searchable index would link users to a citation containing a minimum set of standardized metadata fields, such as subject, agency creator, title and publication date.
"If, for example, the identifier resolves to a book, then you get a citation for the book," said Eliot Christian, manager of data and information systems at the U.S. Geological Survey and chairman of the working group.
Combining URNs and a standardized metadata scheme would open the door to new possibilities for analysis, said James Erwin, primary author of the group's URN recommendations and director of information science and technology at the Defense Technical Information Center. "People can take that metadata and our identifier and put it into their database, their index, and they can use that for discovery," he said.
Information collected at one time by officials at one agency can be relevant in the future. Government surveys from the 1780s in the Northwest Territories, for example, are being used by Interior Department officials today to assess changes in vegetation patterns in Michigan and Ohio.
Deciding which types of information merit universal identifiers, however, is still a matter of debate. The group's members define government information as "any information product, regardless of form or format, that an agency discloses, publishes, disseminates or makes available to the public, as well as information produced for administrative or operational purposes, that is of public interest or public value."
David Perera is a special contributor to Defense Systems.