Too much information

Content categorization engines help find what workers need to know

As agencies expand their online infrastructure with Web portals and e-government applications, they are bulking up their data repositories but also frustrating efforts to search that information.

Sixty percent of the respondents to a recent Delphi Group survey of 450 private- and public-sector organizations complained that finding information has become a difficult and often futile process.

The problem is that valuable information is often buried in a mass of unstructured data, such as e-mails and various text files, which now account for 75 percent to 85 percent of repository content, said Susan Feldman, research vice president for content management and retrieval software at IDC.

Certainly anyone who has ever looked for information on the Web using traditional search engines knows how frustrating and time-consuming it can be.

Now a number of agencies are finding an alternative to search engines with content categorization or taxonomy technologies, which organize unstructured data into Yahoo-type directories.

These software products, offered by a growing cadre of vendors, use a constellation of algorithms to index documents in subject categories and build a hierarchical category structure called a taxonomy.

The taxonomy provides "an overview of the information space," enriching the browsing experience, according to Feldman.

However, even though federal agencies are expressing interest in taxonomy-generating technologies, few are rushing out to make major investments because there are a number of sticking points.

For one, "it is hard to do" and involves more than just information technology, said John Gregory, a marketing specialist with the U.S. Postal Service. "It takes assembling a team that understands how to do it."

Gregory led the effort to deploy Semio Corp.'s SemioTagger to make marketing and customer data easily accessible to about 1,000 USPS employees.

"Cost is a major issue," he added. With implementations averaging $200,000, they are "a hard sell" to agencies that have already invested in Web portals.

Varying Methods

Assuming that an agency has made the case for a taxonomy product, finding the right one can be tricky. Four general types of algorithms are most commonly used in taxonomy software: linguistic, statistical, rule-based and data- modeling. As you might imagine, vendors can make persuasive cases about why certain approaches (always theirs) are better than others.

That said, "statistical and rule-based techniques are the basic methods," said Ramana Rao, chief technical officer for Inxight Software Inc.

Many tools use more than one approach. For example, Inxight uses linguistic and statistical analysis, Semio uses linguistic and statistical clustering techniques and Stratify Inc. uses versions of all four algorithms.

Regardless of the approach, these algorithms represent only part of the equation. "There isn't a categorization system I know of that doesn't require some human effort," Feldman said.

For example, Autonomy Corp.'s Clusterizer "can examine existing information and suggest an appropriate categorization schema," said John Cronin, director of the government sector at Autonomy.

But the user must then manually select terms for each category and subcategory or feed an example or training set of documents into the system "so it can learn" the category criteria, Cronin said.

Some vendors simplify the process by using taxonomy templates that can be customized. Semio offers nine such indexes, covering areas such as general business and IT.

Those templates provide "rich starting points," said Art Goldberg, vice president of corporate development at Semio.

Similarly, Verity Inc. offers a canned taxonomy that is modeled after one used in the LexisNexis information service as a way to jump-start the process, said Rajat Mukherjee, Verity's principal architect.

The amount of human involvement in creating the taxonomy often depends on the size of the data collection and how fast new documents are added to the system. The decision involves trade-offs, such as weighing speed against accuracy. And there are related cost considerations: "The more automated it is, the less costly," Cronin said.

Feldman added a couple of other cautionary notes. If you let the computer categorize a cluster of documents, it may assign a label that may not match your perception of the meaning of that cluster, he said. More importantly, if you take a snapshot of any data collection without using some human understanding, you run the risk of creating a taxonomy that is biased or limited by the particular dataset you gave it to analyze.

Once the taxonomy structure is built, it is populated with documents and made accessible to users.

For Lotus Development Corp., this means the taxonomy becomes only one part of its knowledge management offering. "We think of the taxonomy as very useful, but probably a background tool," said Scott Eliot, director of knowledge management strategy at Lotus.

Semio offers a device to view and a browser to search the taxonomy and translates the output into Extensible Markup Language, HTML or Microsoft Corp.'s SQL for use in third-party portals, Goldberg said. Army Knowledge Online and the Office of the Secretary of Defense use SemioTagger with AT&T and Lotus portals, respectively.

Semio is also looking at using enhanced visualization from TheBrain Technologies Corp. in a future product, Goldberg said.

For its part, TheBrain officials see Semio both as a potential partner and a competitor.

TheBrain's hook is that instead of using the traditional hierarchical format for displaying topics, it presents them so that the main topic appears in the center of the screen, with subtopics connected by lines that radiate from the center.

This novel presentation attracted the attention of the U.S. Joint Forces Command Joint Experimentation Directorate, which has been using TheBrain's technology for 18 months, said Annette Ratzenberger, chief of the directorate's experimental engineering department.

The directorate is using TheBrain's technology with an eye toward how it or similar tools could be used to deliver to a commander a broad spectrum of information, representing "a complete picture of a threat," including military and other relevant features, Ratzenberger said.

Similarly, Inxight offers a nonhierarchical data visualization tool called Star Tree Studio. "It is a map on steroids," Rao said, adding that it displays several levels of information at a time, helping users determine the correct search direction. In March, the Army integrated the Star Tree tool into its knowledge management system.

McKenna is a freelance writer based in the San Francisco Bay area.

***

A primer on algorithms

An ever-growing family of algorithms is powering the tools used to categorize data repositories. In its Taxonomy and Content Categorization study, the Delphi Group identifies several basic algorithms. Among them are:

* Linguistic analysis, which identifies the subject, verbs and objects of a sentence and then analyzes them to extract meaning.

* Statistical text analysis and clustering, which measures word frequency, placement and grouping and the distance between words in a document.

* Rule-based taxonomies, which classify documents using specific "rules" created and maintained by experts using "if-then" statements that measure how well a document fits into a category.

NEXT STORY: DOD aims to seed workforce