How agencies can put Hadoop to work

The open-source platform is proving its mettle, but it's just one piece of a bigger puzzle.

DNA strand

Genome research is one of several areas where the big-data tool Hadoop is proving itself. (Stock image)

Can the big-data tool Hadoop help rescue victims during the next big disaster, or steer health officials toward the cancer treatment that is a patient's best bet?

It’s beginning to happen, according to Dante Ricci, director of federal innovation at SAP. Pre-packaged solutions based on Hadoop already exist that federal agencies (and state and local ones) could use to deliver critical insights from to researchers and emergency responders in real-time, he said.

"Hadoop is a powerful technology that can meet some of the needs of big data, but not all the different use cases of big data," Ricci said. "Government employees and citizens want a simple interface, but they don’t have that combination of tools that allow end-users to search and understand the different capabilities in the data that are there for finding correlations."

Explaining Hadoop

For those still trying to wrap their head around big data, Hadoop is an even bigger enigma. Here's the (very) short version:

Hadoop elephant

The Hadoop software framework allows the ability to work with massive amounts of unstructured data by spreading the load across a large number of servers. It is an open-source version of Google's MapReduce, which the search-engine giant developed for its own web-indexing and searching efforts.

Hadoop is an Apache Software Foundation project. A more detailed description is available at http://hadoop.apache.org/

Hadoop is as an open-source framework used in managing and processing vast amounts of structured and unstructured data. (see sidebar) Companies like Facebook, Twitter and Yahoo use it to take enormous volumes of low-value information from web servers, such as link clicks, and turn into useful data, according to Sid Probstein, CTO of Attivio, a Massachusetts-based enterprise software company.

But useful summaries require Hadoop be layered with other applications that might, for example, mine data or provide visualization to an end-user.

"Hadoop is a solid technology, it’s very effective at solving the volume problem, but most interesting output has to be merged with other data," Probstein said. "You have to put that data in order for real people to consume it."

Some companies have turned to pre-packaged solutions that do just that, combining Hadoop’s data processing capabilities with other tools like in-memory technology to provide end-users with real-time insights gleaned from big piles of data.

Ricci cited a use-case of MKI, a biotechnology company based in Japan, as evidence of where Hadoop-based technologies are headed.

MKI uses Hadoop in conjunction with SAP’s in-memory HANA system and the open-source statistical program R to create what Ricci called a "real-time big data platform" that cuts down the time it takes doctors to sequence genomes.

In the system, Hadoop handles data pre-processing and high-speed storage, R does the data mining and SAP’s HANA system performs real-time analysis of patient data for MKI.

The end result is that doctors can compare the genome data of a cancer patient with healthy individuals, delivering analysis "before the patient leaves the hospital," Ricci said.

The National Institutes of Health could use such a system in its cancer research efforts, Ricci said. And because the technology stack can group together real-time independent data feeds like 911 and search and rescue and display information visually, it could prove useful in agencies responding to natural disasters or major emergencies.

"Hadoop with in-memory technology would allow emergency responders to have a holistic view of what’s going on," Ricci said. "How do you gather all that information up, make sense of it all so you can coordinate from a central or multiple locations and make sure it’s done efficiently and nobody’s left out? That’s what we’re trying to do."

Alone, Ricci said, Hadoop is a useful tool for some, but suggested its best bet in the evolution of big data is in concert with other tools to "bring information together in a better user-interface."

"Hadoop does not fit all – it has limitations – and it’s not easy for a business person or someone to go in quickly and garner insight without help from technologists," Ricci said. "The evolution of big data comes down to making the information available to all stakeholders."