Managing big data

More agencies are beginning to put their petabytes to work, but the most visible success stories are still localized efforts.

Big Data

John Holdren, director of the White House’s Office of Science and Technology Policy, doesn’t mince words when it comes to his department’s ambitious plans for using big data.

“In the same way that past federal investments in information technology R&D led to dramatic advances in supercomputing and the creation of the Internet, the initiative we are launching today promises to transform our ability to use big data for scientific discovery, environmental and biomedical research, education, and national security,” Holdren said in March.

The initiative in question is a $200 million effort the Obama administration launched to investigate uses for big data at five major agencies: the National Science Foundation, the National Institutes of Health, the Defense Department, the Energy Department and the U.S. Geological Survey.


Related:

 How big data fights crime

Charting agency big data progress


Six months later, the ball has started to roll: On Oct. 3, NSF and NIH announced their first funding awards through the big data initiative. The eight awards, totaling $15 million, were granted to widely diverse projects. Among them are a collaborative effort by Carnegie Mellon University and the University of Minnesota to simulate language processing in the brain and a Brown University project that seeks to design and test algorithmic and statistical techniques for analyzing large-scale, heterogeneous and so-called noisy cancer datasets.

“We've barely scratched the surface,” Suzi Iacono, a senior science adviser at NSF and co-chairwoman of the Big Data Senior Steering Group, told FCW. She added that more awards will be presented in the months to come.

The steering group is the interagency team responsible for executing the administration’s big data plans across 20 some agencies. It was chartered in 2011 and includes representatives from the Defense Advanced Research Projects Agency, the Office of the Secretary of Defense, DOE, the Department of Health and Human Services, and NASA.

“Everyone wants to be part of this,” Iacono said. “I think that's because we’ll be able to accelerate the pace of discovery if we can mine the datasets we have. Truly, we’ll be able to transform commerce and the economy and address the most pressing issues facing society.”

Defining big data

But of all the technology buzzwords that have crossed from Silicon Valley to politics and governance in recent years — such as “open government” and “cloud computing” — “big data” is arguably the most vague.

That is not necessarily the government’s fault. After all, even the tech sector has trouble defining big data. However, most agree that the concept refers to three attributes when dealing with digital datasets: volume, velocity and variety.

Myriad government agencies “have been collecting data for over a hundred years now, and we finally have technology and the wherewithal to use it,” said Dan Olds, founder of IT advisory firm Gabriel Consulting Group and chief editor of the blog “Inside-BigData.”

Those reams of data are now so voluminous, they're nearly unmanageable. As the TechAmerica Foundation said in a report released Oct. 5, “Since 2000, the amount of information the federal government captures has increased exponentially. In 2009, the U.S. government produced 848 petabytes of data, and U.S. health care data alone reached 150 exabytes. Five exabytes of data would contain all words ever spoken by human beings on earth.”

Big Data graphic

Extracting value

The problems extend beyond what to do with that data and how to manage it. Federal agencies hope to harness the data for new insights and use it for the benefit of government and the public. New ways to organize and analyze data, in fact, could be a matter of life or death.

“Imagine some kind of weather emergency, if we could have data from all the models of that emergency, then integrate that with real-time weather data and census data and give this to responders on the ground,” Iacono said. “Being able to make these kinds of split-second decisions is...one of the holy grails of big data.”

Big data has already saved lives in smaller settings. In a pilot program launched in 2008 at several hospitals, big data was used to monitor premature infants for signs of late-onset neonatal sepsis, an infection that is often fatal by the time it is detected via traditional monitoring systems. The program, managed by Carolyn McGregor of the University of Ontario Institute of Technology, used IBM InfoSphere Streams software to analyze as many as 16 concurrent streams of physiological data in real time and alert hospital staff to subtle but potentially life-threatening changes. The system, which uses an approach called predictive analytics, runs on little more than three laptop PCs.

Law enforcement agencies are also using the combination of big data and predictive analytics. Memphis, Tenn., saw a 31 percent drop in violent crime from 2006 to 2011 after the police department launched its Blue CRUSH (Crime Reduction Utilizing Statistical History) pilot program in partnership with the University of Memphis. The program combines data from disparate sources — surveillance cameras, crime records, even benign datasets such as vehicle registration records — and uses it to provide officers with on-demand information about suspects and victims, as well as the likelihood of crime in any designated area of the city. Blue CRUSH uses IBM’s SPSS software.

For Olds, it is no surprise that state and local agencies are outpacing their federal counterparts when it comes to putting big data to use. “The examples that I’ve seen that have shown some results are typically from city and state initiatives,” he said. “The sheer amount of federal bureaucracy makes it hard for federal agencies to get all the necessary parties on board to put together a good initiative.”

Nonetheless, the federal government has been plowing ahead, though often not in the public eye. For instance, many intelligence and defense agencies now rely on sentiment analysis, yet another big data trend. Sentiment analysis mines data, especially social media, to try to predict mass events and uprisings, such as the Arab Spring demonstrations, before they unfold.

In the past year, the FBI, the CIA and even the Federal Reserve have either published requests for proposals or pursued other means of developing sentiment analysis tools. The agencies, however, remain mum on their progress. The CIA declined to comment for this article, telling FCW that “we cannot answer your specific questions.”

Private-sector offerings

The private sector recognizes the government's vested interest in big data, and companies are lining up to offer their products and services to agencies. In August, longtime government IT services contractor Carahsoft announced a partnership with Cloudera, which provides software and services grounded in the popular Apache Hadoop open-source big data platform, to make the companies’ offerings available through the General Services Administration’s Schedule 70 contracts.

The two companies have been working together for a while, Carahsoft President Craig Abod said. “We have expanded our footprint of Hadoop users across the Intelligence Community and civilian sector exponentially,” he said, though he declined to get into specifics, noting only that “the projects we are working on deal with matters of national security, intelligence, cybersecurity, health care and energy.”

Abod said Carahsoft now offers agencies big data services from multiple companies, including Datameer, Digital Reasoning, MarkLogic, DataDirect, EMC and VMware. And he added that the firm’s annual Government Big Data Forum has helped government IT professionals and contractors to explore “issues of common concern.”

Abod said the company’s biggest challenge is “the slow adoption of a number of big data technologies, including Apache Hadoop,” in the federal space.

Cloudera also has been establishing itself as a provider of big data solutions to government. “Cloudera's technology is being used in some capacity or another in almost every government agency,” a Cloudera spokesperson said in a statement. “Many use it via the GSA program called USASearch, which provides advanced citizen-focused services to agency websites. Others use it as part of embedded systems being shipped by partners such as Oracle, Dell and HP.”

Cloudera officials declined to share success stories so far, telling FCW that “while we are not able to publicly disclose specific government use cases, we can say that Cloudera advises numerous government agencies and provides ongoing hands-on implementation support, systems design and architecture.”

Regarding the gap between federal agencies and private companies, Olds said, “What private enterprises are doing is much more directed and focused.... But big data initiatives need to have concrete goals. You shouldn't be just going on fishing expeditions. And it’s hard for government at the federal level to move that way.”