Hadoop cuts big data down to size

Linux desktop software certainly has its hard-core fans, but the open-source program has always faced an uphill battle against a proprietary Microsoft Windows product that could pretty much do all the same things (and mostly better), though it cost a few bucks instead of being free. The same scenario goes for many other open-source tools.

But that’s not the story for Apache Hadoop, an open-source program that is gaining converts at government agencies who are trying to draw new insights from all the “big data” they are generating.

“Hadoop is totally unique in many ways,” said Bob Gourley, editor of CTOvision.com and former chief technology officer at the Defense Intelligence Agency.

Hadoop, and a handful of open-source tools that complement it, has no equal when it comes to making gigantic and diverse datasets easily available for quick analysis using clusters of inexpensive computers, Gourley said. The oddly named software is already considered mission-critical at Web giants such as Facebook, Twitter, Yahoo and many others for marketing chores, personalizing Web pages and detecting junk mail.

Nevertheless, agency IT officials still need to weigh their options carefully because Hadoop, like most open-source tools, can require a significantly bigger commitment of skills and involvement from its users than most commercial products demand.

Hadoop’s roots go back to 2004, when Google published a paper on a parallel processing technique it had developed to juice its Web operations, called MapReduce. Those ideas inspired Doug Cutting to create the Java-based Hadoop, which he named after his son’s toy elephant.

The National Security Agency started using Hadoop two years ago for a massive, nationwide system for sharing and analyzing data. The system also houses unstructured text, large files and other forms of intelligence with unprecedented scalability, reported J. Nicholas Hoover for InformationWeek.

"The object is to do things that were essentially impossible before," said Randy Garrett, director of technology at NSA's integrated intelligence program, at a symposium at the time.

Others agencies are getting on board, too. Earlier this month, more than 40 systems were nominated for the Government Big Data Solutions Award at the annual Hadoop World conference, including a State Department system that looks for fraudulent visa and passport applications, said Gourley, one of the award judges.

The top winner was the General Services Administration's USASearch, a hosted search service used by more than 550 government websites. Using Hadoop with the Hive open-source data warehousing software fundamentally changed how the system designers thought about data, said Ammie Farraj-Feijoo, GSA's program manager for USASearch.

"Freed from the constant anxiety of wondering how we were going to handle an ever-increasing amount of data, we shifted from trying to store only what we really needed to storing whatever we thought might be useful," wrote Loren Siebert, an independent software developer and technical lead for the USASearch program, in a blog entry describing the project.

But buyer beware: Hadoop can be a serious do-it-yourself project. If you don't want to smooth Hadoop's sharp corners yourself, you'd need to use a product like Cloudera’s distribution of Apache Hadoop, which provides enterprise support the way Red Hat does with Linux.

Also, a lot of the data analysis capabilities and enabling algorithms that for most organizations are the only reason to wrestle with big data — and are the special sauce that proprietary business analytics vendors make their living on — are not part of Hadoop.

Users must cobble together and program those applications themselves, hire help or buy them from one of a growing number of companies. For example, Digital Reasoning offers software that works with Hadoop to analyze massive amounts of unstructured text in English and now Chinese by looking for patterns that signal fraud and potential threats. Guess who might be interested in those capabilities?

More established business analytics vendors are also joining the party because of Hadoop’s unique capabilities. IBM’s Watson, the Jeopardy-winning supercomputer, owes part of its mental agility to Hadoop, while SAS is developing ways to allow its proprietary tools to work with data stored by Hadoop.

About the Author

John Zyskowski is a senior editor of Federal Computer Week. Follow him on Twitter: @ZyskowskiWriter.

The Fed 100

Read the profiles of all this year's winners.


  • Then-presidential candidate Donald Trump at a 2016 campaign event. Image: Shutterstock

    'Buy American' order puts procurement in the spotlight

    Some IT contractors are worried that the "buy American" executive order from President Trump could squeeze key innovators out of the market.

  • OMB chief Mick Mulvaney, shown here in as a member of Congress in 2013. (Photo credit Gage Skidmore/Flickr)

    White House taps old policies for new government makeover

    New guidance from OMB advises agencies to use shared services, GWACs and federal schedules for acquisition, and to leverage IT wherever possible in restructuring plans.

  • Shutterstock image (by Everett Historical): aerial of the Pentagon.

    What DOD's next CIO will have to deal with

    It could be months before the Defense Department has a new CIO, and he or she will face a host of organizational and operational challenges from Day One

  • USAF Gen. John Hyten

    General: Cyber Command needs new platform before NSA split

    U.S. Cyber Command should be elevated to a full combatant command as soon as possible, the head of Strategic Command told Congress, but it cannot be separated from the NSA until it has its own cyber platform.

  • Image from Shutterstock.

    DLA goes virtual

    The Defense Logistics Agency is in the midst of an ambitious campaign to eliminate its IT infrastructure and transition to using exclusively shared, hosted and virtual services.

  • Fed 100 logo

    The 2017 Federal 100

    The women and men who make up this year's Fed 100 are proof positive of what one person can make possibile in federal IT. Read on to learn more about each and every winner's accomplishments.

Reader comments

Wed, Nov 30, 2011 Southeast US

I bet Charlie in "Numbers", if he were real, would be using Hadoop to organize his data for his amazing crime-solving analyses. Now, if I can just learn how to do all that analytics coding, I might just have a very interesting new career. [At least I could use it to find the next "big winner" in the stock market!)

Please post your comments here. Comments are moderated, so they may not appear immediately after submitting. We will not post comments that we consider abusive or off-topic.

Please type the letters/numbers you see above

More from 1105 Public Sector Media Group