Delivering on the promise of big data

Agencies have moved beyond the buzzword and are making big data part of their everyday missions.

"Big Data Goes Mainstream" front cover.

Researchers who are fighting the battle against Alzheimer’s disease are collaborating on a scale they’ve never done before thanks to the National Institutes of Health’s launch of a new big data portal. The goal is to speed up a painfully arduous and slow process of discovery that had previously been a litany of failure.

The project is one of a growing number of mission-driven collaborations that suggest government is moving beyond the buzzword phase and learning how to make big data part of everyday business.

Federal agencies and businesses have different needs when dealing with big data, of course, but the ability to engage in collaborative analytics with many partners on a large scale is driving much of the activity in the federal government. The NIH portal will enable researchers from disparate areas of expertise to collaborate and quickly share research and analytical models based on the molecular and genome-related data from samples of the brains of 2,000 people who were afflicted with the disease, which currently has no cure.

Suzana Petanceska, health science administrator in the Division of Neuroscience at NIH’s National Institute on Aging and coordinator of the Alzheimer’s portal, said: “The consortium of scientists will generate a number of predictive models of the disease and will prioritize a new set of targets that could be the foundation for the development of new therapies for Alzheimer’s.”

NIH’s project is the biotech equivalent of an intellectual jam session between the federal government, the pharmaceutical industry, researchers in academia and several nonprofit organizations in a field that is desperate for success. The Alzheimer’s Association estimates that 5.3 million Americans currently suffer from the disease, and that number is expected to balloon to 13.8 million among people 65 and older by 2050. Furthermore, the annual cost of care could top $1 trillion by 2050 — 26 times more than the Department of Homeland Security’s fiscal 2015 budget request. A study published last summer found that relatively few drug trials are being developed for treating Alzheimer’s, and 99.6 percent of those conducted from 2002 to 2012 failed.

“This is a project that is taking many millions of features of each individual sample and trying to compile that information into a snapshot of Alzheimer’s disease,” said Lara Mangravite, director of the systems biology research group at Sage Bionetworks, the nonprofit organization that built the portal.

The project is expected to generate less than 20 terabytes of data in total. That’s not a huge amount of information to analyze by big-data standards, but it’s the equivalent of twice the printed collection of the Library of Congress.

“In my opinion, this is a big-data project in that it is taking a large amount of information and trying to distill it down to universal truths,” Mangravite said.

Prior to the portal’s creation, the molecular datasets needed to perform analyses were scattered in repositories around the country, and the data was not annotated as it is now. As a result, the NIH-funded research was not always easily accessible to the entire research community.

The relevant research is now available on Sage Bionetworks’ secure platform, which runs on cloud servers operated by Amazon.

“The technological platform that we’ve built, which uses Amazon as a framework, has built within it a long series of data security protections to make sure that the data is ethically and appropriately shared,” Mangravite said.

The project seeks to remove the intellectual and physical barriers that have developed over the decades between scientists in their niches of expertise so that they can work together to create a larger view of Alzheimer’s disease.

“The goal is to collect many molecular measurements from human samples and apply analytical methods to reconstruct the biological processes that drive the disease process and to understand how they interact with each other within and across cells, tissues and organs, rather than studying them in isolation,” Petanceska said.

The project, in other words, is not some big-data proof of concept. According to Petanceska, it’s enhancing NIH’s approach to its central mission: “The intent is to make the most out of the public’s investment in research and deliver on the promise to advance public health.”

An analytical engine

That quest for the big picture is sweeping through NIH in other ways as well. The agency will soon launch several other big-data projects, the most prominent of which is its Precision Medicine Initiative. And similar pivots are apparent elsewhere in the federal government.

To bring some cohesion to the movement, in February the White House hired Silicon Valley data scientist DJ Patil to be deputy chief technology officer for data policy and chief data scientist at the Office of Science and Technology Policy.

In a public memo released shortly after the White House announced his appointment, Patil said he’ll provide the vision for how to maximize the social return on government-generated information and create federal data policies to enable agencies to efficiently execute big-data projects and recruit talent.

The 45-year-old alumnus of LinkedIn, Skype, PayPal and eBay joins a growing cadre of chief data officers at various agencies across the government.

“Organizations are increasingly realizing that in order to maximize their benefit from data, they require dedicated leadership with the relevant skills,” Patil wrote. “Many corporations, local governments, federal agencies and others have already created such a role, which is usually called the chief data officer or the chief data scientist. The role of an organization’s CDO or CDS is to help their organization acquire, process and leverage data in a timely fashion to create efficiencies, iterate on and develop new products, and navigate the competitive landscape.”

Patil pointed to NIH’s Precision Medicine Initiative as a federal priority. That effort, as NIH describes it, is “an emerging approach for disease treatment and prevention that takes into account individual variability in genes, environment and lifestyle.” And he said NIH and other agencies “will work to deliver not just raw datasets, but also value-added ‘data products’ that integrate and usefully present information from multiple sources.”

Under the initiative, investigators plan on analyzing the genes, electronic health records and lifestyles of at least 1 million Americans.

One of the agencies leading the way in terms of business intelligence is the Centers for Medicare and Medicaid Services.

The agency’s Office of Enterprise Data and Analytics, launched last November, could serve as a case study for other federal agencies because it has such a clearly defined structure and detailed mission statement. Niall Brennan, who helped establish the unit, is its director and chief data officer. And just like a commercial entity, the office contains an Information Products Group, a formal acknowledgment of the role an agency can play in packaging information that helps support decision-making both inside and outside the government.

That’s an especially important role at CMS: The agency is becoming one of the analytical engines behind health care reform. In addition to gathering and analyzing information on payments and medical conditions related to Medicare, Medicaid and the Children’s Health Insurance Program, CMS is also responsible for analyzing other data related to programs that seek to reduce the costs and increase the efficiency of health care delivery.

The agency’s spending reflects that focus. In fiscal 2014, CMS spent $32.6 million on contracts for analytics, dashboards and reporting — second only to NASA in total spending in that area, according to government market analytics company Govini.

And although it is not immune to complaints about the usability of its data, the Office of Enterprise Data and Analytics has managed to keep the public focused on how the health care industry operates and what that means for public pocketbooks. For example, its release of data on hospital charges nationwide generated front-page headlines and academic analyses about the potential social impact of widespread discrepancies in charges for similar procedures. Its release of the payment records for the 880,000 doctors and other health care providers paid by Medicare in 2014 had a similar impact.

CMS’ Fraud Prevention System is another widely lauded example of a successful big-data project. The system, which launched in 2011, analyzes the 4.5 million claims that flow in daily from the 1.5 million Medicare providers. It uses predictive algorithms that rely on a variety of factors — such as payment patterns, contact information, tips and detailed, eight-year-long historical records — to help CMS detect suspicious patterns in billing activity. The system also helps law enforcement officials speed up their investigations and prosecutions.

As with NIH’s Alzheimer’s portal, CMS has emphasized the importance of assembling a multidisciplinary team whose members work together to analyze information and achieve the project’s goal. In this case, a team that included policy experts, clinicians, field investigators and data analysts developed and tested 74 models used to flag potential cases of fraud, waste or abuse.

In a report to Congress in June 2014, CMS officials wrote that “bringing together teams with a variety of skill sets is a best practice in model development — ensuring that the [Fraud Prevention System] models yield solid, actionable leads.”

CMS awarded the contract for developing the system to Northrop Grumman and the modeling contract to IBM. Fifteen full-time staff members oversee the system and run the analytics part of the program. A CMS spokesperson told FCW it is crucial to involve the end users of the analytics system from the beginning of the development process and on an ongoing basis so that the system accommodates their work processes.

In addition to prosecutions and sniffing out suspicious payments before they’re made, CMS also uses its data to block potentially problematic providers from enrolling and to revoke some entities’ status as Medicare providers if they’ve been identified as having billed the government inappropriately on multiple occasions.

As a result, the Department of Health and Human Services reported in March that its fraud-prevention efforts had contributed significantly to the recovery of $3.3 billion for taxpayers in fiscal 2014.

Improving access to data

Of course, for some agencies, big data is nothing new. The Federal Emergency Management Agency, for example, built a prototype of a data warehouse business intelligence tool 15 years ago, said Mark DeRosa, director of business intelligence and analytics at Definitive Logic, a management consulting and systems engineering firm.

FEMA has been using the cost/benefit analysis tool to analyze the vast streams of data pouring into the agency and help it quantify the relative value of engaging in various disaster-mitigation projects. The goal is to prioritize projects that would provide taxpayers with a maximum return on investment.

However, DeRosa said many agencies still live in a world of multiple, unconnected databases, which makes it difficult — if not impossible — for employees to access the data to derive actionable knowledge.

“We have to recognize that the average user of these systems is not a techie,” DeRosa said. “We can’t expect them to write code to access their data. What we need to do is to put together solutions that give users the ability to integrate raw data and to transform it into information. What I mean is applying the business rules and logic to that data and converting it into something that the average human can understand and to visualize that data so they can make a decision.”

Along those lines, Definitive Logic works with clients to overhaul and connect disparate databases and create dashboards to tease insights out of the information. A wide array of other contractors — ranging from longtime players such as Unisys to relative upstarts such as Splunk — are engaged in similar projects across government.

And another firm in Silicon Valley is emerging to help companies deal with big-data problems. But instead of consulting on individual projects, venture capital-backed Alation offers its product in the form of software as a service.

Alation enables companies to centralize their datasets so that they can engage in collaborative analytics projects more efficiently, optimize their data warehouse and better manage their data governance processes. Its clients include eBay, Square and MarketShare, a company that provides marketers with predictive analytics services to help them decide where to most effectively allocate their marketing budgets.

DeRosa wasn’t familiar with Alation, but concurred with its basic premise. And he suggested that the new cadres of chief data officers — in addition to creating formal data management guidelines for their agencies and establishing return-on-investment calculations for big-data projects to demonstrate their value — could start with the basics. For many government workers, simply knowing where to look for relevant datasets is a complex and torturous process, and chief data officers should help their agencies simplify that process.

“They could focus on...reducing the complexity of getting and using data,” he said. “This is a big challenge: being able to simplify the retrieval of the data that you need.”