How cloud can take open data to new heights
- By Jed Sundwall
- Jul 24, 2017
According to IDC, the volume of data being produced each day is growing at an explosive rate, and is now estimated at 2.5 exabytes. That’s equivalent to producing 250,000 Libraries of Congress every single day.
Much of it is "open data," which means the data can be used by anyone for any purpose without needing to pay a licensing fee. The availability of this data is a boon for entrepreneurs, scientists and public servants, who can use it to create new products, accelerate scientific discovery and provide better services to the public.
Unfortunately, the infrastructure usually used to serve this data is not keeping up with growth. Government data is still being provided with the assumption that users will download and store their own copies of data. That’s fine when a few gigabytes of data are being shared, but as data volumes increase, this approach simply doesn’t work.
For example, the National Oceanic Atmospheric Administration’s new weather satellite, GOES-16, is estimated to produce one terabyte of data per day -- over 100 times the amount of its predecessor. Very few people have the hard disks or patience to download a terabyte of data. Using this old model of data distribution reduces the data’s value to end-users, and ultimately to the taxpayer.
This is why the cloud is emerging as the center of gravity for all big data analyses.
Once the data is made available in the cloud, anyone who wants to use it no longer needs to buy the hard drive capacity and spend months downloading the data. Interested users can instead use on-demand computing resources in the cloud to query as much, or as little, of the data as they need. When their analysis is done, they can save the results, turn off the virtual servers and not have to worry about paying for an individual copy of the original data.
Through the Amazon Web Services Public Datasets program, we host some of the world's most valuable open datasets and show off what’s possible when data is made available in the cloud. We try to think about what would be possible if people had fast, programmatic access to data, and the computing resources to analyze it. We've seen some really cutting edge results through these initiatives.
A new look at landsat
Since 1971, Landsat satellites have produced the longest continuous record of Earth’s land surface as seen through space. These images have been available at no cost directly from the United States Geological Survey since 2010. However, many people were limited by their ability to download and store significant quantities. We talked to many end-users and learned that they had big ideas of what could be done with Landsat data, but couldn’t get it fast enough or couldn’t afford to store their own copies.
Amazon started hosting imagery from the Landsat 8 satellite in 2015. The response was astonishing. Within the first year, over 1 billion requests for Landsat imagery and metadata were logged from 147 countries. Businesses like Esri, Mapbox and Mathworks immediately created tools to take advantage of the new easy-to-access Landsat archive.
One of the most interesting developments has been how novices and amateurs have been able to create entirely new interfaces and tools to explore and analyze the data. A group of students from Code Fellows created Snapsat -- a fast and completely novel web-based service to browse and interact with Landsat imagery. An independent developer in Melbourne, Australia, even created an iPhone app, giving people the ability to access tremendous amounts of data on how the Earth has changed over time by simply reaching into their pocket.
NEXRAD Opening New Research Frontiers
NOAA recognized early on that the cloud would be essential to fulfill their mission, and in 2015, they entered into a research agreement with several cloud service providers to explore ways to drive usage of their data. Through that agreement, we have made several hundred terabytes of high-resolution NEXRAD radar data available in the cloud.
Similar to the response we saw with Landsat, the usage of NEXRAD data has been impressive. After making NEXRAD data available in the cloud, NOAA recorded a 130 percent spike in usage of the data, while simultaneously seeing a 50 percent decrease in the usage of their own servers.
This open data initiative has also made the full NEXRAD archive available on demand, creating new analysis and discovery possibilities. For example, Dr. Eli Bridge at the University of Oklahoma has leveraged this public dataset to compile radar data to estimate the size of Purple Martin bird roosts.
These birds form large, dense aggregations that appear as ring-shaped patterns on the radar images. Now that the researchers no longer have to make requests for individual scans and receive chunks of data at a time, the University of Oklahoma team is able to learn how the birds are responding to droughts, environmental change, and seasonal queues. This is an example of “latent research” -- that is, research that has existed in the minds of researchers, but hasn’t been possible because of restricted data access.
Big (Census) data
Most recently, the Census Bureau discovered that increasing access to big data can lead to increased usage. Previously, the agency's American Community survey data was only available on tabular file formats like CSV, which required days to access and then required a separate reference document to be able to make any sense of it.
Now that it has been uploaded in bulk to the cloud, anyone can access and analyze the entire dataset for about 40¢ an hour. Additional partners such as National Science Foundation have pitched in to further boost access to the data. They have even provided instructions on how to analyze the data using an open source graph database engine.
The open data era is just getting started
Once uploaded to the public cloud, the potential use cases for open data are virtually endless, and government agencies are just getting started. Governments around the world are investing billions in new sensors, ranging from Internet of Things devices in parking meters to Earth-observing satellites, which are producing huge volumes of data.
The best way to get a return on these investments is to make it easy for innovators across the country to access the data and put it to work. With a modernized data distribution method and some imagination, open data can be unleashed to become a tremendous force for the public good.
Jed Sundwall is Amazon Web Services' open data lead.