A draft of the NIST Big Data Interoperability Framework looks to establish a common set of definitions for data science.
Behind the "big data" cliché is an explosion in the volume of information collected by sensors, cameras, social media, e-commerce, science experiments, weather satellites, logistics and a host of other sources. But to extract valuable insights from the terabytes and petabytes of information, analysts have to know how to use datasets in their systems, and compare data from different sources.
A standards-based approach is one way to facilitate this process, and the National Institute of Standards and Technology is leading an effort to bring some consensus in terms of the logistics, structure, and security of data, to the user community. A draft of the NIST Big Data Interoperability Framework, released April 6, looks to establish common a set of definitions for data science, and common ground, or "reference architecture," for what constitutes usability, portability, analytics, governance and other concepts.
"One of NIST's big data goals was to develop a reference architecture that is vendor-neutral and technology- and infrastructure-agnostic, to enable data scientists to perform analytics processing for their given data sources without worrying about the underlying computing environment," said NIST's Digital Data Advisor Wo Chang.
The framework is less a policy document than an agreed-upon set of questions that need to be answered, and challenges that need to be addressed in order to produce a consensus-based set of global standards for the production, storage, analysis and safeguarding of large, diverse datasets. NIST isn't looking to write specs for operational systems, or rules for information exchange or security. NIST's Big Data Public Working Group, which includes scientists in government, academia and the private sector, has released a seven-volume document designed to "clarify the underlying concepts of big data and data science to enhance communication among big data producers and consumers," per the report.
A set of use cases collected from contributors gets at the challenges facing government, researchers and industry in maintaining the viability and usability of current data, while preparing for the future.
For example, the National Archives and Records Administration faces the problem of processing and managing a huge amount of varied data, structured and unstructured, from different government agencies, that may have to be gathered from different clouds, and tagged to respond to queries, while preserving security and privacy where required by law.
The Census Bureau is exploring the possibility of using non-traditional sources from e-commerce transactions, wireless communications and public-facing social media data to augment or mash up with its survey data to improve statistical estimates, and produce data that is closer to real-time. But that data has to be reliable and maintain confidentiality.
On the security side, the NIST report calls attention to the future – the problem of protecting data that might need to outlast the lifespan and usefulness of the systems that house it, and the security measures that protect it.
Some types of data, including medical imaging data, security video and geospatial imaging were until relatively recently considered too large to be conveniently analyzed and shared over computer networks, and therefore weren't created with security and privacy in mind – that could be a problem down the road. The Internet of Things and the new troves of sensor data created by connected devices could create vulnerabilities for devices and data that were not previously considered.
NIST is accepting comments on the framework through May 21.