Making the most of a golden opportunity
Data mining is rich with potential, but not a simple tool
- By Tim Fielden
- Feb 04, 2001
"Ad hoc, ad loc and quid pro quo. So little time, so much to know," fretted Jeremy Boob in the Beatles movie "Yellow Submarine." He might as well have been an information technology manager in any federal agency or department, struggling to keep up with the rapidly growing mounds of electronic data needing to be analyzed for information.
Most organizations attempt to deal with the problem by sorting and storing the information into data warehouses or data marts, but actually finding a critical piece of data can be problematic. Imagine, for example, trying to find out how many people who refused to participate in the census also refused to pay income tax.
That's where data-mining solutions come in. Data- mining software employs algorithms to search for patterns in the raw data residing in your databases.
In contrast to other forms of business intelligence, such as query tools or returned records from online analytical processing (OLAP) technology, data mining does not require a formulated approach such as a structured query, but rather performs fuzzy searches with little or no guidance from the user. And, thanks to many recent advances in data mining, users can now select, explore and model large amounts of data residing in flat files, relational databases and even multidimensional OLAP cubes.
Data mining certainly isn't a good fit for everyone, as it often requires a substantial initial investment and potentially even a bit of consulting assistance to make it work. However, it does enable an organization to extract useful information that might otherwise prove unobtainable.
For this briefing, we explore the ins and outs of data-mining software by examining three products: IBM Corp.'s DB2 Intelligent Miner for Data 6.1, Microsoft Corp.'s SQL Server 2000 and eSystems' ASaP 2.0. Bear in mind, however, that there are other strong contenders, including SAS Institute Inc.'s Enterprise Miner and SPSS Inc.'s Clementine.
IBM DB2 Intelligent Miner for Data
The IBM solution proved to be an excellent piece of software. Not only was it easy to install, requiring very little input, but it also proved to be fairly intuitive to use thanks to its blending of a comfortable explorer tree on the left side of the screen with useful graphical views of the results on the right.
By simply choosing to create a new mining operation from the main menu, we were quickly transported into the Mining Functions Wizard. We then were walked through the process of selecting our data, defining which repositories it would come from, the groups (mining collections) to which it would be added and any mapping and special taxonomies to be used.
When it came to the actual mining, we found that the solution offered six of the most popular data-mining algorithms: associations, classification, clustering, prediction, sequential patterns and time sequence analysis.
An association enables users to detect related trends and patterns, while a classification is best suited for predicting outcomes. Clustering, on the other hand, aside from being the easiest place to get started with data analysis, also enables a user to choose datasets that exhibit similar, predictable characteristics. Following that is prediction, which aids in forecasting values over time, and sequential patterns, which let you analyze data by time or by transactions. Lastly, time sequence analysis enables users to look for data associations and activity in relation to time.
Microsoft SQL Server 2000
The Microsoft product also put on a good showing, thanks largely to its new offering Integrated Data Mining, available as part of its Analysis Services. Microsoft has included many wizards, editors and other user interface elements for simplifying the design, creation, training and browsing of data-mining information. It is not as intuitive as the IBM product, but we still found it easy to use. SQL Server 2000 now offers two classes of data-mining algorithms developed by Microsoft Research: Microsoft Decision Trees and Microsoft Clustering. Although seemingly limited at first glance, we quickly found that, if necessary, we could also make use of other algorithms developed by third parties.
The Microsoft Decision Trees algorithm is an amalgamation of four mining methods and is based on the notion of classification. The algorithm builds a tree that will predict the value of columns based upon the other columns in the training set (that is, a fact table). The decision on where to place each node in the tree is made by the algorithm, and the most significant and differentiating attributes are shown closer to the root of the decision tree. The Microsoft Clustering algorithm, on the other hand, uses a "nearest-neighbor" method for grouping records into clusters that exhibit some similar, predictable characteristic.
Using either algorithm, we found that the data-mining tools worked well with all of our data, whether from OLAP, relational or flat-file databases.
Esystems' ASaP also installed without a hitch, although we were a bit disappointed with the user interface at the server and client levels. The interface uses a portal-type approach to mining data, with one copy of all the data and the application itself residing on a server. After doing a bit of configuring on the server side, the client interfaces do work well for analyzing data trends.
The portal approach makes some sense, in theory, if one expects to field the tool across an organization. In such a case, spending $50,000 on one server-side solution vs. $50,000 per instance would be much less expensive. It is a nice offering, although most companies will not have large numbers of data miners because it requires a bit more skill and training than most software.
With our data defined, we fired up the Web client only to find a framed HTML document from which to work. Although we have nothing against frames, we were a bit disappointed to find that after adding a few data banks to the window, we needed to resize the frame to access all the information. Similar to the Microsoft solution, ASaP used a multi-dimensional database from which users can index, search, access, retrieve, browse and accurately data mine distributed information banks residing either inside or outside of their agency or department making it a potentially powerful knowledge management tool.
We also appreciated the tool's ability to apply expert and personal knowledge via a "Knowledge Note" against either a specific piece of information or against an entire data bank. You then can make those notes part of the searchable index of the data bank.
Picking the Best Fit
So how can you choose among the various data-mining solutions?
All of the solutions we examined offer the scalability demanded by large organizations, being capable of handling hundreds of gigabytes' worth of transactions daily. And all of the solutions offer effective algorithms for sifting through the masses of data that can accumulate in your department.
Selecting a specific product for implementation, therefore, will primarily depend on three factors: price, user interface and the type of data and other applications already in use in your organization.
ESystems' ASaP has a definite edge in price, but it also offers a less usable interface than either Microsoft's SQL Server or IBM's Intelligent Miner for Data. Admittedly, although ASaP's portal-type interface allowed multiple users to mine data concurrently on a single piece of software, its inability to control and change the operations at the client level proved a bit disappointing.
Similarly, Microsoft's SQL Server solution was cheaper to initially deploy than the IBM product, but had one major shortcoming that may eliminate it from your list namely, its reliance on SQL Server to work. In an environment where this is not the standard data storage solution, it could prove problematic.
Although our testing of the three products was not for the purpose of comparative scoring, Intelligent Miner struck us as the most complete solution and the easiest to use of those we examined. Finally, as noted above, users in your agency or department may be more comfortable with the algorithms employed by one solution rather than another. Accordingly, you may want to see about arranging hands-on testing before committing your budget to a specific product.
Fielden is a senior analyst with the InfoWorld Test Center. He can be reached at firstname.lastname@example.org.