Xerox announces categorization software
- By Florence Olsen
- Mar 01, 2004
Research scientists at Xerox Research Centre Europe say they have perfected a new method for automatically categorizing electronic messages and documents for future retrieval.
The method uses unnamed software that performs what the scientists call "deep linguistic analysis." The technique could be useful, for example, for categorizing documents that should be preserved as federal records, the scientists said. Written in Java, the software can be integrated into existing document management and workflow systems.
"It's exciting news if true," said J. Timothy Sprehe, president of Sprehe Information Management Associates Inc., a consulting company in Washington, D.C. "There's enormous interest in auto-categorizing e-mail," especially among federal records managers.
Eric Gaussier, a research scientist at the center, said the new software represents an advance over existing categorization software, which is offered in some products and in the public domain. The software recognizes, for example, that words can have several meanings, depending on their context. It also recognizes that different words can mean the same thing, he said.
Since 1993, the research center has been developing linguistic analysis tools for different uses and in 20 languages, Gaussier said. The categorization software is a new use for those tools and for machine learning, for which the center is also known.
Such tools are very much needed, Sprehe said. In most federal departments, the volume of e-mail has grown so large that having people categorize e-mail messages for preservation as federal records is nearly impossible, he said. "It's no longer a practical solution," he said.
However, most experts in the field of records management say that automated filtering of records still leaves much to be desired. "The general conclusion is that auto-categorization is not yet ready for prime time," Sprehe said. "Everyone who is interested in this will say they want to see the proof first.