Online Extra: Algorithms are the engines of data mining
At the core of many data-mining tools are the algorithms, or sets of rules, that the apps use to winnow data and select useful information.
SPSS Inc. of Chicago produces Clementine, a widely used data-mining workbench that has many algorithms. Bill Haffey, technical director for the company’s public sector division, said the technology uses two main classes of algorithms: unsupervised and supervised.
Unsupervised algorithms identify various patterns in data, Haffey said. They include association detection algorithms, sometimes called market basket algorithms.
Association detection algorithms scan data to identify coincidental information. For example, they would be useful to detect fraud in Medicare or Medicaid claims if a criminal is using a similar technique under different aliases, Haffey said.
Some association algorithms detect sequences of events, Haffey said. For example, a sequence detection algorithm could analyze the treatment pattern of a Medicare patient using the often-abused drug OxyContin to see if the drug may potentially be diverted.
Another unsupervised association set of algorithms are time series algorithms, which might be used to detect and track disease outbreaks of interest to homeland security agencies, Haffey said.
They can, for example, compare the routine level of occurrence of a given disease in an area to a potential outbreak as it spreads over time.
Supervised algorithms can train themselves to recognize patterns of interest. For example, a supervised algorithm used by a bank can build a profile of a bad credit risk and successively refine it and compare it to new customers.
“The application of a supervised algorithm means you have a known outcome,” Haffey said.
Textmining algorithms can search for the occurrence of words in unstructured text. An additional type, clustering algorithms, seek associations among data sets.
Homeland Security Department chief technology officer Lee Holcomb noted that artificial intelligence research carried out years ago generated numerous algorithms. “The type of algorithm you use depends on the task,” he said.
Some algorithms use pattern-seeking mathematical approaches such as the Bayes Law of probability, which relies on knowledge of prior events to help predict future events, Holcomb said. In structured data, name-searching engines can be useful, he said.
An additional category of data analysis algorithms is fuzzy logic. These rule sets hunt for data using vaguely defined conditions.
—Wilson P. Dizard III
Connect with the GCN staff on Twitter @GCNtech.