Brand Niemann offers inspiration and tips for analyzing and integrating the reams of health data available online.
Brand Niemann is senior data scientist at Semanticommunity.net and former senior enterprise architect and data scientist at the Environmental Protection Agency.
To build on my recent series of articles on data science, I decided to make HealthData.gov my latest exploration.
Recently, Sonny Bhagowalia, deputy associate administrator of the Office of Citizen Services and Innovative Technologies at the General Services Administration, wrote in a tweet that Data.gov's move to the cloud would yield more and better mashups of government data. Meanwhile, Todd Park, chief technology officer at the Health and Human Services Department, announced the new Health Indicators Warehouse and welcomed us to HealthData.gov. And Health 2.0 announced two new developer challenges: Healthy People 2020 and Go Viral to Improve Health. In addition, George Thomas, an enterprise architect at HHS, is working on Clinical Quality Linked Data on HealthData.gov to help achieve Linked Open Government Data goals.
So there are five major sites now with health data — HealthyPeople.gov, Health2Challenge.org, HealthIndicators.gov, HealthData.gov and Data.Medicare.gov — that can be integrated (i.e., mashed up). I inventoried the resources and datasets at those five sites in several spreadsheets and looked for opportunities to analyze them individually and collectively. I also entered the Healthy People 2020 and Go Viral to Improve Health challenges. Previously, I had built a health data indicators warehouse in the cloud as part of the Health Data Visualization Challenge of 2010, so this data science project was not completely new to me.
I started with Spotfire’s library of U.S. state and county boundaries because I knew I would be doing interactive maps with the spatial data at those five sites. Then I imported the spreadsheet data and created a separate tab in Spotfire for each major site as follows:
HealthyPeople.gov: Inventory to understand contents and apply business intelligence and analytics.
Health2Challenge.org: Inventory to understand contents and build on previous work.
HealthIndicators.gov: New interface to catalog and data to support business intelligence and analytics.
HealthData.gov: New data catalog to expedite discovery and download for business intelligence and analytics.
- Data.Medicare.gov: Inventory of datasets to expedite discovery and download for hospital selection example.
The focus of the challenge was to extract the goals and objectives from the state-specific Healthy People 2010 and 2020 plans, map them, and integrate them with the databases above.
All that work is documented on the wiki page and its attachments so others can check and produce their own integrations. The Healthy People 2020 challenge was submitted March 7, and the Go Viral to Improve Health Challenge is due April 27. The latter includes work with more community-level data sources such as the Pellucid Health Care Transparency tables and the data sources in the book “Visualizing Data Patterns with Micromaps” by Dan Carr of George Mason University and Linda Williams Pickle, formerly with the National Cancer Institute. The latter also links to the recent work to build VIVO, an open-source Semantic Web application, in the cloud for the National Institutes of Health's Workshop on Value Added Services for VIVO.
I hope this article has piqued your interest in taking the challenge to analyze health databases — and makes it easier for you to get started.
NEXT STORY: Feds motivated by more than just pay