Agencies grapple with search and discovery

Google's magic doesn't work in enterprise search, agencies find.

As increasing amounts of information exist only in electronic form, agencies are having to find ways to search it, retrieve useful information and retain it to comply with information retention rules.


Legal cases now routinely require agencies to produce not only paper records but also e-mails, text chat logs and other electronic data — a challenge many of them are not well prepared for, said Ed Meagher, deputy chief information officer at the Interior Department.


"This is one of those issues that has really crept up on us," he said. Meagher moderated a panel discussion on the topic today at the FOSE trade show in Washington.


"It's relatively easy to store 1 billion objects, but it is incredibly hard to search for relevant information" within them, said Jason Baron, director of litigation at the National Archives and Records Administration.


Lt. Col. James Whitlock, chief of knowledge management for the Air Force Medical Service, said enterprise users are conditioned by Internet search tools — primarily Google — to expect well-sorted search results. Google's site ranking system is so good that 90 percent of the time, users find what they need on the first page of results, he said.


"We are socialized to expect that level [of accuracy] when we go to enterprise search, and the problem is, the Google magic doesn't work" for enterprise data, Whitlock said. Google's search engine ranks Web pages based on the number of other pages that link to them. There is no such easy measure for business documents in an enterprise system, he said.


A Google spokesman, who was not involved in the discussion, later noted that Google does offer enterprise search tools using technology approriate to the enterprise.


Whitlock advocated a ranking technique called "concept search." This technique breaks down multiple-word search terms into smaller units and ranks document hits accordingly. For example, an enterprise search on "tamiflu stockpile policy" would rank documents containing that complete phrase at the top of the list of possibly relevant hits. Next would be documents containing the phrases "tamiflu stockpile" or "stockpile policy," and trailing those, documents with any one of the words.


There are other sorting and ranking process that can be used in conjunction with concept search-based tools to further refine the results, he said.


The situation will only grow more complicated, Baron said. To date, most of the attention to electronically stored information has centered on e-mail, text chat logs and similar common tools. But it can also include voice mail, electronic calendars, instant messages, video conferences, posts to wikis and blogs, and virtual worlds such as Second LIfe, he said.


And that's not even counting new technologies yet to emerge.