Hadoop cuts big data down to size

The open-source software has unmatched analytical chops, but it is hardly a free ride.

Linux desktop software certainly has its hard-core fans, but the open-source program has always faced an uphill battle against a proprietary Microsoft Windows product that could pretty much do all the same things (and mostly better), though it cost a few bucks instead of being free. The same scenario goes for many other open-source tools.

But that’s not the story for Apache Hadoop, an open-source program that is gaining converts at government agencies who are trying to draw new insights from all the “big data” they are generating.

“Hadoop is totally unique in many ways,” said Bob Gourley, editor of CTOvision.com and former chief technology officer at the Defense Intelligence Agency.

Hadoop, and a handful of open-source tools that complement it, has no equal when it comes to making gigantic and diverse datasets easily available for quick analysis using clusters of inexpensive computers, Gourley said. The oddly named software is already considered mission-critical at Web giants such as Facebook, Twitter, Yahoo and many others for marketing chores, personalizing Web pages and detecting junk mail.

Nevertheless, agency IT officials still need to weigh their options carefully because Hadoop, like most open-source tools, can require a significantly bigger commitment of skills and involvement from its users than most commercial products demand.

Hadoop’s roots go back to 2004, when Google published a paper on a parallel processing technique it had developed to juice its Web operations, called MapReduce. Those ideas inspired Doug Cutting to create the Java-based Hadoop, which he named after his son’s toy elephant.

The National Security Agency started using Hadoop two years ago for a massive, nationwide system for sharing and analyzing data. The system also houses unstructured text, large files and other forms of intelligence with unprecedented scalability, reported J. Nicholas Hoover for InformationWeek.

"The object is to do things that were essentially impossible before," said Randy Garrett, director of technology at NSA's integrated intelligence program, at a symposium at the time.

Others agencies are getting on board, too. Earlier this month, more than 40 systems were nominated for the Government Big Data Solutions Award at the annual Hadoop World conference, including a State Department system that looks for fraudulent visa and passport applications, said Gourley, one of the award judges.

The top winner was the General Services Administration's USASearch, a hosted search service used by more than 550 government websites. Using Hadoop with the Hive open-source data warehousing software fundamentally changed how the system designers thought about data, said Ammie Farraj-Feijoo, GSA's program manager for USASearch.

"Freed from the constant anxiety of wondering how we were going to handle an ever-increasing amount of data, we shifted from trying to store only what we really needed to storing whatever we thought might be useful," wrote Loren Siebert, an independent software developer and technical lead for the USASearch program, in a blog entry describing the project.

But buyer beware: Hadoop can be a serious do-it-yourself project. If you don't want to smooth Hadoop's sharp corners yourself, you'd need to use a product like Cloudera’s distribution of Apache Hadoop, which provides enterprise support the way Red Hat does with Linux.

Also, a lot of the data analysis capabilities and enabling algorithms that for most organizations are the only reason to wrestle with big data — and are the special sauce that proprietary business analytics vendors make their living on — are not part of Hadoop.

Users must cobble together and program those applications themselves, hire help or buy them from one of a growing number of companies. For example, Digital Reasoning offers software that works with Hadoop to analyze massive amounts of unstructured text in English and now Chinese by looking for patterns that signal fraud and potential threats. Guess who might be interested in those capabilities?

More established business analytics vendors are also joining the party because of Hadoop’s unique capabilities. IBM’s Watson, the Jeopardy-winning supercomputer, owes part of its mental agility to Hadoop, while SAS is developing ways to allow its proprietary tools to work with data stored by Hadoop.