Getting government datasets to talk the same language

Brand Niemann explains how and why we should make data universally accessible on the Semantic Web.

Stars 4 and 5 are more challenging to implement because multiple links are possible even with small datasets.

Brand Niemann is senior data scientist at Semanticommunity.net and former senior enterprise architect and data scientist at the Environmental Protection Agency.

Since my previous column "How to move from datasets to data services," I have been reflecting on the proliferation of data catalogs, the two most prominent of which are Data.gov and Data.gov.uk, and their lack of standardization in vocabulary, format and functionality. An excellent inventory of open-government initiatives and their data catalogs at wiki.civiccommons.org shows that diversity.

Data.gov initially opened community forums for discussion of open data and the Semantic Web, which have since expanded to include groups on health, law and restoring the Gulf. The Semantic Web forum has the following discussion areas:

  • Cross-domain linking, for sharing views on how best to interlink Linked Open Government Data.
  • Cross-domain vocabularies, for sharing views about vocabularies that all publishers of Linked Open Government Data are using or should be using.
  • Domain-specific vocabularies, for sharing views about vocabularies that are specific to agency mission areas.
  • Uniform Resource Identifier (URI) schemes, for sharing views about conventions for publishing and consuming Linked Open Government Data.

The latter is of particular interest because it is a critical part of Tim Berners-Lee’s five stars of linked open data.

  • Star 1: Make your stuff available on the Web, in whatever format.
  • Star 2: Make it available as structured data (e.g., an Excel spreadsheet instead of an image scan of a table).
  • Star 3: Use a nonproprietary format (e.g., comma-separated values instead of Excel).
  • Star 4: Use URLs to identify things, so people can link to your stuff.
  • Star 5: Link your data to other people’s data to provide context.

The seminal government work on that topic appears to be the United Kingdom Cabinet Office's “Designing URI Sets for the UK Public Sector.” It defines the design considerations and guidance by which public-sector Uniform Resource Identifier [URI] sets should be developed and maintained. They are designed to encourage those who own reference data to make it available for reuse and give those who have data that could be linked the confidence to reuse a URI/URL set that is not under their direct control.

In addition, I recommend following Principle 14 of the Open Group’s Data Principles: Common Vocabulary and Data Definitions. It states that data should be defined consistently throughout the enterprise, and the definitions should be understandable and available to all users.

Recently, the Open Government Working Group concluded that all vocabularies, terms and concepts should have a URI/URL, and each URI/URL should point to both data and a Web/wiki page that shows the metadata about that vocabulary, term or concept. Doing that one simple thing makes vocabularies universally accessible on the Semantic Web. Furthermore, the group recommends use of the U.K. document as a basis for a U.S. policy on minting URIs/URLs and that all open-government vocabularies use permanent URIs/URLs.

That recommendation is what I have been doing using a wiki, spreadsheets and a visualization/analytics tool following a simple metadata format (e.g., Facebook's Open Graph Protocol) so that each data table contains links to vocabulary definitions and metadata at well-defined URLs on the wiki, the wiki in turn provides well-defined links to the spreadsheets, and the tables in the visualization/analytics tool contain those links as well.

I challenge data and data catalog providers to deliver their data tables and catalogs with well-defined URIs/URLs to facilitate the integration of all that data.