What the heck is Hadoop?

The open-source tool simplifies big-data management, but don't think of it as just another means of data analysis, experts say. In the right application, Hadoop frees users to explore information in whole new ways.

representation of dna

In the right application, Hadoop frees users to explore information in whole new ways. For applications that involve comparisons between gigantic databases, such as in analyzing genomic sequences, it shines. (Stock image)

Every day, people send 150 billion new email messages. The number of mobile devices already exceeds the world's population and is growing. With every keystroke and click, we are creating new data at a blistering pace.

This brave new world is a potential treasure trove for data scientists and analysts who can comb through massive amounts of data for new insights, research breakthroughs, undetected fraud or other yet-to-be-discovered purposes. But it also presents a problem for traditional relational databases and analytics tools, which were not built to handle the data being created. Another challenge is the mixed sources and formats, which include XML, log files, objects, text, binary and more.

"We have a lot of data in structured databases, traditional relational databases now, but we have data coming in from so many sources that trying to categorize that, classify it and get it entered into a traditional database is beyond the scope of our capabilities," said Jack Collins, director of the Advanced Biomedical Computing Center at the Frederick National Laboratory for Cancer Research. "Computer technology is growing rapidly, but the number of [full-time equivalent positions] that we have to work with this is not growing. We have to find a different way."

Enter Apache Hadoop, an open-source, distributed programming framework that relies on parallel processing to store and analyze tremendous amounts of structured and unstructured data. Although Hadoop is far from the only big-data tool, it is one that has generated remarkable buzz and excitement in recent years. And it offers a possible solution for IT leaders who are realizing that they will soon be buried in more data than they can efficiently manage and use.

"In the last 10 years, this is one of the most important developments because it's really transforming the way we work, our business processes and the way we think about data," said Ed Granstedt, a vice president at predictive analytics firm GoldBot Consulting. "This change is coming, and if government leaders don't understand how to use this change, they're going to get left behind or pushed aside."

Why it matters

Hadoop is more than just a faster, cheaper database and analytics tool. In some cases, the Hadoop framework lets users query datasets in previously unimaginable ways.

Take the Frederick laboratory, whose databases contain scientific knowledge about cancer genes, including the expression levels of a gene and what chromosome it is on. New projects seek to mine literature, scientific articles, results of clinical trials and adverse-event databases for related or useful connections. Other researchers are exploring whether big-data analysis of patient blogs, Google searches and Twitter feeds can also provide useful correlations.

"In many cases, we're trying to find associations, so we're doing mining and asking questions that weren't previously imagined," Collins said.

If we can revolutionize the way we think about what scientists can do with data analysis, it would change the perspective on what is possible. -- David Skinner.

Last summer, his team conducted a study of two Hadoop implementations with both real and simulated data to see whether the framework would improve performance and allow for new types of analysis. The project reduced hours-long computations to minutes and won a government big-data award from CTOvision. Building on that success, the institution is working on the next phase by aiming to better integrate data and improve visualization of results.

"Data is the new natural resource," said Josh Sullivan, a vice president at Booz Allen Hamilton and founder of the Hadoop-DC Meetup group. "Hadoop is the first enterprise tool we have that lets us create value from data. Every agency should be looking at Hadoop."

However, implementation is not as simple as converting existing databases into a Hadoop framework. That would be a missed opportunity for strategic data analysis, Sullivan said. Moreover, many existing databases should be maintained separately and connected to Hadoop databases and analytics.

As a general rule, any group with more than 2 terabytes of data should consider Hadoop. "Anything more than 100 [terabytes], you absolutely want to be looking at Hadoop," Sullivan said.

David Skinner, leader of the Outreach, Software and Programming Group at the Energy Department's Lawrence Berkeley National Laboratory, said he hopes Hadoop will offer a solution to the growing problem of data blindness, which keeps scientists from deeply understanding their own datasets. Skinner's group evaluates new technologies and makes them accessible to the thousands of scientists who use the lab’s National Energy Research Scientific Computing Center (NERSC).

"We're very interested in technologies that deliver data transparency and allow people to do analysis with large sets of data," said Skinner, whose group has been exploring scalable data solutions for a couple of years. "Science is increasingly inundated with data. If we can revolutionize the way we think about what scientists can do with data analysis, it would change the perspective on what is possible."

The fundamentals

Hadoop evolved out of Google researchers' work on the MapReduce framework, which Yahoo programmers brought into the open-source Apache environment. Core Hadoop consists of the Hadoop Distributed File System for storage and the MapReduce framework for processing. Queries migrate to the data rather than pulling the data into the analysis, yielding fast load times but potentially slower queries. In addition, Hadoop queries require higher-level programming skills compared with the user-friendly SQL, so developers have released additional software solutions with colorful names such as Cassandra, HBase, Hive, Pig and ZooKeeper to make it easier to program Hadoop and perform complex analyses.

Hadoop Caveats

For all the promise that Hadoop offers, it can't do everything. IT leaders should be aware of the following limitations.

The technology is young. Compared with relational databases, Hadoop is in its infancy. There are many competing versions -- both free and commercial -- and some contain bugs or could end up being unsupported if a vendor goes out of business as the market consolidates.

Implementation requires fine-tuning. Hadoop is far from an out-of-the-box solution. You have to understand your data structures well and characterize your problem in order to get the most out of Hadoop.

Commercial uses differ from government and research applications. Many of the Hadoop solutions that developed to help corporations do not easily translate to the government space.

All processes must change. Because Hadoop queries cannot be written by the same analyst who is comfortable with SQL, users need to create an entirely new analytic environment. That requires new roles for analysts, researchers and data scientists, a position that might not have been required previously.

"Like a database, Hadoop is a mechanism for storing, manipulating and querying data," said Steven Hillion, chief product officer at Alpine Data Labs. "Unlike databases, Hadoop can handle data in a very fluid way. It doesn't insist that you've structured your data. Hadoop is sort of a big dumping ground for whatever data you can throw at it. People who have struggled to deal with big data have found Hadoop to be a cheap and flexible and powerful platform for dealing with these very large volumes of unstructured and fluid data." Because Hadoop evolved in the Internet space -- LinkedIn and Facebook were early adopters -- it is well-suited to the kind of data you find in those environments: log files, text files and the like. However, users should be aware of the upsides and downsides to parallel processing, which is Hadoop's salient characteristic.

"While the MapReduce programming model is very powerful because it makes it very easy to express a problem and run in parallel, there are lots of applications that just don't decompose that way," said Shane Canon, leader of the Technology Integration Group at NERSC. "It may require synchronization between the pieces. Maybe it's a complex workflow that has lots of synchronized parts."

Moreover, areas such as high-energy and nuclear physics typically rely on binary data formats, which do not work as well in Hadoop. On the other hand, bioinformatics is well-suited to Hadoop because the data comes from a sequencer and needs to be compared to a reference database for similarities. "That's something that fits well into a map model," Canon said.

In two years of working with Hadoop, NERSC leaders have found that rather than establishing dedicated clusters, users have the most success by bringing up a Hadoop cluster, running their applications and then tearing down the cluster -- even though they miss out on some of the positive features of Hadoop. That could change as the technology evolves and more user-friendly applications are created.

A survey by the Data Warehousing Institute found an average of 45 machines in a Hadoop cluster, with a median of 12, suggesting the existence of a few extremely large clusters.

"You're not going to see the benefit until you're running a larger environment," said David Jonker, SAP's director of product marketing for big data. "The true benefit of Hadoop is when you have multiple machines together."

Hadoop appeals to IT leaders because of the improved performance, scalability, flexibility, efficiency, extensibility and fault tolerance it offers, said Glenn Tamkin, a software engineer at the NASA Center for Climate Simulation. Users can simply dump all their data into the framework without taking time to reformat it, which lifts a huge burden off NASA scientists working with 32 years of climate data.

The center has 36 nodes in its Hadoop cluster and envisions scaling up, Tamkin said, who added that potential Hadoop converts still have to understand the data and know whether Hadoop is an appropriate solution.

"Make sure that your base design or format is able to solve your use cases," he said. "If you make a wrong decision, you're kind of hosed.”

Nevertheless, any skeptic need only look at how the Defense Department is using Hadoop to provide real-time tactical information in support of battlefield missions and intelligence operations. Or the genome sequencing that can now be accomplished in a few minutes instead of hours.

"We've found it a really exciting technology to work with and investigate," said Deb Agarwal, head of the Advanced Computing for Science Department at Lawrence Berkeley National Laboratory. "Sure, there are places where it doesn’t fit our paradigms well, but it helps point us to areas where we could make improvements."

X
This website uses cookies to enhance user experience and to analyze performance and traffic on our website. We also share information about your use of our site with our social media, advertising and analytics partners. Learn More / Do Not Sell My Personal Information
Accept Cookies
X
Cookie Preferences Cookie List

Do Not Sell My Personal Information

When you visit our website, we store cookies on your browser to collect information. The information collected might relate to you, your preferences or your device, and is mostly used to make the site work as you expect it to and to provide a more personalized web experience. However, you can choose not to allow certain types of cookies, which may impact your experience of the site and the services we are able to offer. Click on the different category headings to find out more and change our default settings according to your preference. You cannot opt-out of our First Party Strictly Necessary Cookies as they are deployed in order to ensure the proper functioning of our website (such as prompting the cookie banner and remembering your settings, to log into your account, to redirect you when you log out, etc.). For more information about the First and Third Party Cookies used please follow this link.

Allow All Cookies

Manage Consent Preferences

Strictly Necessary Cookies - Always Active

We do not allow you to opt-out of our certain cookies, as they are necessary to ensure the proper functioning of our website (such as prompting our cookie banner and remembering your privacy choices) and/or to monitor site performance. These cookies are not used in a way that constitutes a “sale” of your data under the CCPA. You can set your browser to block or alert you about these cookies, but some parts of the site will not work as intended if you do so. You can usually find these settings in the Options or Preferences menu of your browser. Visit www.allaboutcookies.org to learn more.

Sale of Personal Data, Targeting & Social Media Cookies

Under the California Consumer Privacy Act, you have the right to opt-out of the sale of your personal information to third parties. These cookies collect information for analytics and to personalize your experience with targeted ads. You may exercise your right to opt out of the sale of personal information by using this toggle switch. If you opt out we will not be able to offer you personalised ads and will not hand over your personal information to any third parties. Additionally, you may contact our legal department for further clarification about your rights as a California consumer by using this Exercise My Rights link

If you have enabled privacy controls on your browser (such as a plugin), we have to take that as a valid request to opt-out. Therefore we would not be able to track your activity through the web. This may affect our ability to personalize ads according to your preferences.

Targeting cookies may be set through our site by our advertising partners. They may be used by those companies to build a profile of your interests and show you relevant adverts on other sites. They do not store directly personal information, but are based on uniquely identifying your browser and internet device. If you do not allow these cookies, you will experience less targeted advertising.

Social media cookies are set by a range of social media services that we have added to the site to enable you to share our content with your friends and networks. They are capable of tracking your browser across other sites and building up a profile of your interests. This may impact the content and messages you see on other websites you visit. If you do not allow these cookies you may not be able to use or see these sharing tools.

If you want to opt out of all of our lead reports and lists, please submit a privacy request at our Do Not Sell page.

Save Settings
Cookie Preferences Cookie List

Cookie List

A cookie is a small piece of data (text file) that a website – when visited by a user – asks your browser to store on your device in order to remember information about you, such as your language preference or login information. Those cookies are set by us and called first-party cookies. We also use third-party cookies – which are cookies from a domain different than the domain of the website you are visiting – for our advertising and marketing efforts. More specifically, we use cookies and other tracking technologies for the following purposes:

Strictly Necessary Cookies

We do not allow you to opt-out of our certain cookies, as they are necessary to ensure the proper functioning of our website (such as prompting our cookie banner and remembering your privacy choices) and/or to monitor site performance. These cookies are not used in a way that constitutes a “sale” of your data under the CCPA. You can set your browser to block or alert you about these cookies, but some parts of the site will not work as intended if you do so. You can usually find these settings in the Options or Preferences menu of your browser. Visit www.allaboutcookies.org to learn more.

Functional Cookies

We do not allow you to opt-out of our certain cookies, as they are necessary to ensure the proper functioning of our website (such as prompting our cookie banner and remembering your privacy choices) and/or to monitor site performance. These cookies are not used in a way that constitutes a “sale” of your data under the CCPA. You can set your browser to block or alert you about these cookies, but some parts of the site will not work as intended if you do so. You can usually find these settings in the Options or Preferences menu of your browser. Visit www.allaboutcookies.org to learn more.

Performance Cookies

We do not allow you to opt-out of our certain cookies, as they are necessary to ensure the proper functioning of our website (such as prompting our cookie banner and remembering your privacy choices) and/or to monitor site performance. These cookies are not used in a way that constitutes a “sale” of your data under the CCPA. You can set your browser to block or alert you about these cookies, but some parts of the site will not work as intended if you do so. You can usually find these settings in the Options or Preferences menu of your browser. Visit www.allaboutcookies.org to learn more.

Sale of Personal Data

We also use cookies to personalize your experience on our websites, including by determining the most relevant content and advertisements to show you, and to monitor site traffic and performance, so that we may improve our websites and your experience. You may opt out of our use of such cookies (and the associated “sale” of your Personal Information) by using this toggle switch. You will still see some advertising, regardless of your selection. Because we do not track you across different devices, browsers and GEMG properties, your selection will take effect only on this browser, this device and this website.

Social Media Cookies

We also use cookies to personalize your experience on our websites, including by determining the most relevant content and advertisements to show you, and to monitor site traffic and performance, so that we may improve our websites and your experience. You may opt out of our use of such cookies (and the associated “sale” of your Personal Information) by using this toggle switch. You will still see some advertising, regardless of your selection. Because we do not track you across different devices, browsers and GEMG properties, your selection will take effect only on this browser, this device and this website.

Targeting Cookies

We also use cookies to personalize your experience on our websites, including by determining the most relevant content and advertisements to show you, and to monitor site traffic and performance, so that we may improve our websites and your experience. You may opt out of our use of such cookies (and the associated “sale” of your Personal Information) by using this toggle switch. You will still see some advertising, regardless of your selection. Because we do not track you across different devices, browsers and GEMG properties, your selection will take effect only on this browser, this device and this website.