Going where no search engine has gone before
Connotate Technologies uses information agents to extract data from Deep Web
- By Dibya Sarkar
- May 30, 2005
Google, one of the most popular search engines, at best can index and search about 4 billion to 5 billion Web pages, representing only 1 percent of the World Wide Web.
But officials from Connotate Technologies, a company based in New Brunswick, N.J., said they have developed technology that can mine and extract data from the Deep Web, which contains an estimated 500 billion Web pages, and deliver it in any format and through any delivery mechanism. The Deep Web refers to content in databases that rarely shows up in Web searches.
Through the use of intelligence-based software modules called information agents, corporate and government organizations can quickly and easily target specific unstructured data from intranets and password-protected Web sites on a continual basis.
"What the agents do is they automate time-consuming Web interaction," said Bruce Molloy, the company's chief executive officer. "So an agent can act on your behalf, type in information, search terms, can click on links, can know your password but we would keep it protected can automatically go to sites and bring back information, format and cut and paste results."
Such information agents can monitor pages as often as once per second and deliver real-time results, he said. In addition to the financial and energy sectors, some federal agencies, such as the Defense Department, use the technology. Company officials would not comment on how DOD uses the technology. But Connotate officials said they are talking with intelligence organizations and the Homeland Security Department about the technology.
Connotate was formed in 1999 by three Rutgers University professors, whose Web-mining technology research was funded by the Defense Advanced Research Projects Agency and the university.
Learning the ABCs
Ken Hambright, information technology manager at Quadel Consulting, a firm based in Washington, D.C., said the company began using information agents about three years ago to help monitor several multifamily housing programs as required by its Department of Housing and Urban Development contract. HUD also requires Quadel to enter data into government systems, which company employees initially had been doing manually.
"What the information agents allow us to do is kind of automate that procedure so that when we enter things in our system they automatically get entered into the HUD system," Hambright said. "It saves a lot of keystrokes. It saves errors because the systems are always in sync."
Molloy said information agents can go to complex Web sites and databases, extract information such as dates, names or contract identification numbers and automatically deliver that data in any format.
"What we're able to do is actually connect on a data level and pull information back, or we can take information and actually place it onto Web sites so the agents can provide a kind of data-entry function," he said.
Company officials said setting up an agent is easy and takes only five minutes in some cases. The company sells software licenses, and it also hosts an Information Agent Library in which users can manage their subscriptions to various sites, including news, corporate, government and others.
For example, a user could open a Web browser to a news Web site and highlight a section that provides financial news.
Another way to build an agent is through a keyword filter. The agent would essentially learn what information the user is targeting.
"It's a lot like showing something to a small child for the first time," said Chris Giarretta, Connotate's customer relationship manager. Essentially, he said, the more you show what a user wants, the better the agent will get at finding it.
Users can personalize subscriptions by setting how often they want to receive data and through what medium, such as e-mail, instant messaging or Really Simple Syndication (RSS) feeds to any electronic device. The information can also be placed in spreadsheets or databases or published in a newsletter format. Plus, data can be delivered to an alert monitor a personalized desktop ticker which Molloy likens to an RSS feed. Subscribers can set up a distribution list to automatically send data from sites to several people at once.
The agent can access intranets and other sites that need authentication. Essentially, it serves as the user's proxy to enter those sites, said Dan Haughton, Connotate's vice president of marketing.
"Whatever an individual can do in terms of accessing a site, Deep Web navigation, filling out forms, an agent can do that," he said. "If you have a subscription and password to the site, then the agent can have access. If you don't have a password, that site would be closed."
Web-mining tech emerging
John Blossom, president of Shore Communications, a research and analysis firm, said search engines typically either identify a document or provide an index and list relevance rankings. Web-mining technology not only crawls through sites but also takes content from a Web page and normalizes, analyzes and packages it in useful formats, such as Extensible Markup Language or other means, he said.
"Since the engine can be prepared to look for specific kinds of information, you're not at the mercy of a general crawling algorithm," Blossom said.
He said people are becoming aware of this technology. Several companies, such as Inxight Software, Mark Logic and Zoom Information formerly Eliyon Technologies provide text-mining capabilities using different approaches, he added.
Blossom said government agencies and other organizations want to get as much value from raw data as they do from structured or normalized information. He said in knowledge management, people are generally moving toward this mixture of structured and unstructured content being brought under a common processing umbrella to extract meaningful information and intelligence.
Another benefit of Connotate's technology is that users can effectively apply it at an individual or institutional level. Other similar technologies work at one level or the other, but not both, Blossom said.
Pricing starts at a little more than $100,000, Molloy said.