Big data's security imperative

In big data projects, the data might reside in a traditional database or in various unstructured data stores. It might be stored on premise or be scattered across multiple sites. The very complexity of big data projects makes data security both a challenge and an absolute imperative.

When big data began catching on, the tool of choice for many organizations was Hadoop, an open source programming framework that supports the processing of large data sets in a distributed computing environment.

Some IT experts were concerned about the security risks that came with Hadoop. Still, from the beginning companies such as Facebook and eBay turned to the technology to do what nothing else could do: Collect, aggregate, analyze, and share structured and unstructured data from a variety of disparate sources.

In fact, Hadoop’s benefit was its biggest Achilles’ heel: When bringing together data the security clearance for one set might be completely different than another’s. With big data being made available to multiple stakeholders and, in some cases, constituents, there was no way to protect its sources.

Today, while Hadoop has more security features – mostly software add-ons from other developers and service providers – security remains a serious concern for any IT executive considering a big data project. They must consider the security both of data within a cluster of servers and the cluster itself.

IT is charged with setting up encryption for the data and authentication or identity management for the clusters. In addition, especially for those organizations allowing mobile or web-based access, they must ensure the security of the applications and of the data that is produced. In particular, IT must make sure that crucial and classified data doesn’t end up being downloaded on to a mobile device, which can be lost or stolen at any time.

It’s everywhere

Another problem, says Forrester analyst Mike Gualtieri, is that IT might not know which data is important for a specific query. “So when the line of business or end users say, ‘Give me all you’ve got,’ IT may do so without making sure the appropriate security standards are in place.”

Gualtieri recounts a client that had attempted to “anonymize” medical data that was being used in predictive analytics. However, although identifying data had been scrubbed from the sets, users were still able to infer and inadvertently exposed some information that never should have been allowed to come into the public view.

“The challenge becomes you want to give access to as much data as possible, but you still need to protect privacy,” he says.

Mobile data is even more problematic because it is not only real-time, but also attached to GPS positioning. “You’ve got to do a lot more governance when you’re dealing with so much personally identifiable information, says Gualtieri.

Cloud installations are also giving IT pause, says Evan Quinn, a senior principal analyst with ESG. When data is on premise, you have more control over the data. Once it is put into the public cloud, IT must worry about whether or not the data is on a shared resource, if the servers are inside or outside of this country, and whether or not the cloud provider’s security and governance practices could withstand an audit.

“You need to go to the cloud provider and ask point blank to see a checklist of its practices,” he said. He also suggests ensuring that employees and, in some cases, constituents are accessing data only via secure connections.

And finally, IT executives need to think carefully about the accessibility of big data query results. Sometimes, employees are so excited about what they’re seeing that they extend access to others in their department or those who are working with them on projects. “You don’t want to get any surprises and find out that you’ve just distributed data to 1,000 people who were never cleared to see it in the first place,” said Quinn.