Power search

Could search commercial technology replace some information management standards?

The federal government is beginning to see what the private sector has already discovered: Search technology could be the answer to all its information management problems.

A recent request for information, issued jointly by the General Services Administration and the Office of Management and Budget, asks whether search technology is powerful enough to replace some government standards for information management.

"Does current search technology perform to a sufficiently high level to make an added investment in metadata tagging unnecessary in terms of cost and benefit?" the Sept. 15 RFI asks. Responses are due by Oct. 21.

The notice will likely lead to and shape procurements in the next decade, according to supplementary information on the Federal Business Opportunities Web site. Some people say existing technologies that can fulfill the request are ready and waiting for the government to notice them.

Suggested approaches must meet the wide-reaching aim of identifying the most cost-effective means to search for, locate, retrieve and share information. The notice lists seven scenarios to provide context.

For example, the government is looking for information on how to help a physician search multiple databases and Web sites for treatments for a defense contractor's unexplained illness. The doctor might not know which agencies provide information on unexplained or service-related illnesses. He or she would also need a way to search nongovernmental sources, and some of the information might not be easily accessible through traditional Internet search engines.

In addition to tackling information sharing, vendors' suggested approaches must address the problem of access.

The RFI appears at a time when popular commercial search engines -- such as Google, Yahoo and Microsoft's MSN Search -- are about to retire a 10-year-old government search standard intended as an electronic card catalog of public government information.

The National Institute of Standards and Technology wants to withdraw the Government Information Locator Service because the agency considers the search standard obsolete. A July 15 Federal Register notice states that recalling the standard, also known as International Organization for Standardization's ISO 23950, seems justified because most agencies now use commercial search tools to help people locate government information.

Accordingly, the RFI seeks approaches that could avoid the use of government-mandated standards.

Alternatively, the notice asks vendors to explain why they believe government standards are not necessary or cost-effective.

Some government computer programmers say they are impressed by GSA's and OMB's foresight in issuing the notice.

Tamas Doszkocs, a computer scientist at the National Library of Medicine, has been working on the metasearch and clustering engine ToxSeek for almost a decade.

The RFI "is a very good way of taking a look at an extremely complex array of problems and solutions and trying to elicit feedback from major contractors who would be able to address this whole complex issue," he said. "It indicates a keen awareness of the complexity of the problems."

Doszkocs said only piecemeal solutions now exist in industry and government.

"There is nobody that could address and provide solutions to all of the concerns and problem sets," he said. "But there are certainly companies that have formidable technologies that could team up."

As OMB moves forward in soliciting help from industry, the agency is also seeking guidance from federal stakeholders, government officials say.

Last year, for instance, the Interagency Committee on Government Information submitted draft recommendations to OMB on adopting open, interoperable standards. Those standards would help agencies catalog information so that people can search any government system using terms that allow information to be identified electronically. Section 207 of the E-Government Act of 2002 required agencies to develop those recommendations.

The committee's report calls for the federal government to implement a searchable identifier standard that would provide long-term access to digital information. The paper states that the standard should be flexible enough to remain viable as technology changes and specific enough to provide authoritative access to government information.

OMB officials say they are considering the committee's ideas as they develop policies to foster better public access to government information. They will issue the policies to agencies by Dec. 17.

Karen Evans, OMB's administrator of e-government and information technology, said the RFI language asking whether search technology should replace government standards does not conflict with the E-Government Act.

"The question the RFI asks in no way suggests avoiding the use of standards when such are necessary," she said. "Moreover, it most certainly does not suggest noninteroperable searching. Rather, it seeks to identify where metadata tagging or other formal -- and costly -- advanced information preparation mechanisms achieve the goal of making information more easily accessible to interested parties."

In the three years since the E-Government Act was enacted, she added, improvements in commercial search technologies have altered the Bush administration's attitude toward business information retrieval solutions. The RFI seeks to ensure that the public benefits from commercial advances when it seeks government information, Evans said.

The Government Printing Office is also involved in the information retrieval and sharing initiative.

GPO, the agency responsible for distributing government publications, has assigned several employees to OMB during the past year. One of those employees will soon return to GPO for work on a new digital distribution system capable of verifying and tracking all versions of official government documents.

GPO officials say the system's design will ensure authenticity of government information and permanent public access to that information.

"The RFI will help our efforts since we are working closely with the community that generated the RFI and [that] is developing enhanced search tools," GPO spokeswoman Veronica Meter said.

By July 2007, GPO officials expect to have an operational system that will support Web browsing, downloading and printing. It will also have search tools and redundant data warehouses.

Vendors say intelligence agencies have already succeeded with endeavors similar to what OMB and GSA are looking for.

"The tools and products are already available to support this initiative," said Paul Norcini, federal channels manager at search tools supplier Verity.

Verity's solutions can index data formats from disparate repositories into searchable collections. Other tools then categorize the data based on concepts, metadata and highlighted information.

Indexing facilitates information sharing, while highlighting helps with retrieval, Norcini said.

Agency workers can also simultaneously search government and nongovernment systems with existing technology.

Norcini said OMB and GSA need to consider how their programs will detect patterns and connections among pieces of information.

"There is more to information sharing than just search," he said.

One global consortium is working with foreign governments on a massive information retrieval and sharing project that could influence the U.S. government's path.

Earlier this month, groups from industry, government, academia and nonprofit organizations announced plans to provide online versions of books, academic papers, video and audio to the world. The Internet Archive, a nonprofit entity that offers access to historical collections in digital format, will host the Open Content Alliance (OCA). The National Archives of the United Kingdom has already contributed to the effort.

The OCA "may significantly help the [U.S.] government in doing their public access mission," Internet Archive co-founder Brewster Kahle said. "The OCA is an almost unprecedented collaboration between nonprofits, libraries, government institutions and commercial search engines to bring to life the treasures that are currently locked up in independent collections."

Kahle said he has been talking to GPO officials for the past year about joining the alliance. The alliance will unveil a technology Oct. 25 that performs nondestructive scans of book pages at high resolutions for 10 cents a page. That cost savings could appeal to GPO and its Federal Depository Library Program, he said.

Anyone will be able to search and download works from the alliance's repository for free. Yahoo will provide the search engine, but all content will be available for other major search engines to index.

"The combination of large digital archives and the Internet could allow us to take all the U.S. government information and make it available through technologies such as commercial search engines," Kahle said. "We hope that the government considers the OCA as a way of achieving its aims."

Setting out the scenario

Can search technology replace government information standards? A request for information, issued by the General Services Administration and the Office of Management and Budget, seeks to address that question. Any approach that the government takes will need to identify the most cost-effective means for locating, retrieving and sharing information.

The RFI lays out a number of scenarios to provide context for responses.

Scenario 1: Researching unexplained illnesses among defense contractors.

A physician needs to perform a fairly exhaustive search for government information across the range of federal agencies, some state and local governments, and various commercial and academic resources. The information exists in a wide variety of formats, including handwritten forms that have been digitized. Some of those information resources are not easily accessible from typical Internet search engines -- sometimes called Deep Web resources. The physician needs to aggregate, analyze and manipulate the information relevant to the topic and also correlate data geospatially. The physician will publish a scholarly paper on the completed findings, including citations to e-government records. Those cited resources are expected to be obtainable in the future. The physician also wants to receive automatic notification whenever new information concerning unexplained military service-related illnesses is published.

Scenario 2: Searching for experts.

The government wants to identify experts to study an urgent, complex and relatively obscure technical issue. The experts could come from the federal, state, local or tribal governments or the private sector, especially academic and nonprofit organizations. Because the technical issue is relatively obscure, human resources and personnel management systems have not likely captured the related skills. The best way to identify experts is likely through an analysis of subject-matter work products and agencies' Web sites. But some of the relevant works may be within federal government information systems, outside the government or otherwise not readily accessible through Internet search engines.

Scenario 3: Performing academic research.

For a report on Poland's involvement in the Cold War, a student needs to locate and analyze information resources. This requires an ability to identify all relevant government and other resources, focusing more on primary sources -- such as reports, photos, maps and military unit histories -- than secondary sources -- such as textbooks and encyclopedias. The student must also translate resources into English as necessary, rank information resources by relevance and extract relevant facts, summaries and text passages from some of those resources. The assignment also requires the student to find maps of various Cold War hot spots and add information from other sources to those maps. Finally, the student must organize the information through an analysis of the relevant resources and publish the work as a paper and Web site.

Scenario 4: Tracing information audit trails.

An organization must track the flow of electronic information on a specific topic among government agencies to understand how and where the information was processed. It also must identify the accuracy, relevancy, timeliness and completeness of the information. The need for the information could be for application filings, environmental findings, historical research or more authoritative sources.

Scenario 5: Sharing law enforcement information across jurisdictional boundaries.

Police searching an apartment obtain handwritten notes in a foreign language, an apparent ledger of financial transactions, fingerprints and photos of unfamiliar graffiti on the apartment walls. They digitize the information and post it to the appropriate law enforcement information-sharing exchange. After receiving a notice about the posting, an investigator then translates and interprets the documents and photos, analyzes the materials, and correlates the information with other relevant information obtained from various law enforcement organizations at the federal, state, local and tribal levels.

Scenario 6: Tracking down forged identities.

A credit card company discovers that someone fraudulently established a series of accounts. The credit card company must notify the victims. The victims in turn must notify all financial organizations they use, including all governmental agencies from which they currently or potentially receive services.

Scenario 7: Allowing citizens to access to government information on a specific topic.

Someone is searching for all available federal information on a particular topic, including information located on government Web sites. A successful search will help the person avoid using the complex, lengthy and potentially costly Freedom of Information Act process. Agencies cannot determine an individual's interest in advance, but invariably the same, similar or related government information is located at more than one federal agency and comes in various types of online information. Some of those information resources are Deep Web or hidden Web assets and are not easily accessible using typical Internet search engines.

International information retrieval

The Open Content Alliance, a new worldwide collaboration of cultural, technology, nonprofit and governmental organizations, would like to help the U.S. government build a searchable, permanent archive of online government information.

Here is more information about the Open Content Alliance.

  • The alliance, hosted by the nonprofit Internet Archive, will be a digital repository of global content for universal access.
  • The online warehouse will offer digital versions of books, academic papers and video and audio files.
  • Yahoo will power the collection's search engine, but all content will be available for other major search engines to index.
  • Metadata for all content will be freely available to the public through formats such as the Open Archives Initiative Protocol for Metadata Harvesting and RSS.
  • Current participants include Adobe, Hewlett-Packard Labs, the United Kingdom's National Archives, O'Reilly Media, Prelinger Archives, all libraries at the University of California campuses and the University of Toronto.

-- Aliya Sternstein