NSF seeks cyber infrastructure to make sense of scientific data
- By Camille Tuutti
- Oct 04, 2011
The National Science Foundation has tapped a research team at the University of North Carolina-Chapel Hill to develop a national data infrastructure that would help future scientists and researchers manage the data deluge, share information and fuel innovation in the scientific community.
The UNC group will lead the DataNet Federation Consortium, which includes seven universities. The infrastructure that the consortium will try to create would support collaborative multidisciplinary research and will "democratize access to information among researchers and citizen scientists alike," said Rob Pennington, program director in NSF's Office of Cyberinfrastructure.
"It means researchers on the cutting edge have access to new, more extensive, multidisciplinary datasets that will enable breakthroughs and the creation of new fields of science and engineering," he added.
The effort would be a "significant step in the right direction" in solving some of the key problems researchers run into, said Stan Ahalt, director at the Renaissance Computing Institute at UNC-Chapel Hill, which federates the consortium's data repositories to enable cross-disciplinary research. One of the issues researchers today grapple with is how to best manage data in a way that maximizes its utility to the scientific community, he said. Storing massive quantities of data and the lack of well-designed methods that allow researchers to use unstructured and structured data simultaneously are additional obstacles for researchers, Ahalt added.
The national data infrastructure may not solve everything immediately, he said, "but it will give us a platform for start working meticulously on more long-term rugged solutions or robust solutions."
DFC will use iRODS, the integrated Rule Oriented Data System, to implement a data management infrastructure. Multiple federal agencies are already using the technology: the NASA Center for Climate Simulation, for example, imported a Moderate Resolution Imaging Spectroradiometer satellite image dataset onto the environment so academic researchers would have access, said Reagan Moore, principal investigator for the Data Intensive Cyber Environments research group at UNC-Chapel Hill that leads the consortium.
It's very typical for a scientific community to develop a set of practices around a particular methodology of collecting data, Ahalt explained. For example, hydrologists know where their censors are and what those mean from a geographical perspective. Those hydrologists put their data in a certain format that may not be obvious to someone who is, for example, doing atmospheric studies, he said.
"The long-term goal of this effort is to improve the ability to do research," Moore said. "If I'm a researcher in any given area, I'd like to be able to access data from other people working in the same area, collaborate with them, and then build a new collection that represents the new research results that are found. To do that, I need access to the old research results, to the observational data, to simulations or analyze what happens using computers, etc. These environments then greatly minimize the effort required to manage and distribute a collection and make it available to research."
For science research as a whole, Ahalt said the infrastructure could mean a lot more than just managing the data deluge or sharing information within the different research communities.
"Data is the currency of the knowledge economy," he said. "Right now, a lot of what we do collectively and globally from an economic standpoint is highly dependent on our ability to manipulate and analyze data. Data is also the currency of science; it's our ability to have a national infrastructure that will allow us to share those scientific assets."
The bottom line: "We'll be more efficient at producing new science, new innovation and new innovation knowledge," he said.