Getting government datasets to talk the same language

Brand Niemann is senior data scientist at Semanticommunity.net and former senior enterprise architect and data scientist at the Environmental Protection Agency.

Since my previous column "How to move from datasets to data services," I have been reflecting on the proliferation of data catalogs, the two most prominent of which are Data.gov and Data.gov.uk, and their lack of standardization in vocabulary, format and functionality. An excellent inventory of open-government initiatives and their data catalogs at wiki.civiccommons.org shows that diversity.

Data.gov initially opened community forums for discussion of open data and the Semantic Web, which have since expanded to include groups on health, law and restoring the Gulf. The Semantic Web forum has the following discussion areas:

  • Cross-domain linking, for sharing views on how best to interlink Linked Open Government Data.
  • Cross-domain vocabularies, for sharing views about vocabularies that all publishers of Linked Open Government Data are using or should be using.
  • Domain-specific vocabularies, for sharing views about vocabularies that are specific to agency mission areas.
  • Uniform Resource Identifier (URI) schemes, for sharing views about conventions for publishing and consuming Linked Open Government Data.

The latter is of particular interest because it is a critical part of Tim Berners-Lee’s five stars of linked open data.

  • Star 1: Make your stuff available on the Web, in whatever format.
  • Star 2: Make it available as structured data (e.g., an Excel spreadsheet instead of an image scan of a table).
  • Star 3: Use a nonproprietary format (e.g., comma-separated values instead of Excel).
  • Star 4: Use URLs to identify things, so people can link to your stuff.
  • Star 5: Link your data to other people’s data to provide context.
Stars 4 and 5 are more challenging to implement because multiple links are possible even with small datasets.

The seminal government work on that topic appears to be the United Kingdom Cabinet Office's “Designing URI Sets for the UK Public Sector.” It defines the design considerations and guidance by which public-sector Uniform Resource Identifier [URI] sets should be developed and maintained. They are designed to encourage those who own reference data to make it available for reuse and give those who have data that could be linked the confidence to reuse a URI/URL set that is not under their direct control.

In addition, I recommend following Principle 14 of the Open Group’s Data Principles: Common Vocabulary and Data Definitions. It states that data should be defined consistently throughout the enterprise, and the definitions should be understandable and available to all users.

Recently, the Open Government Working Group concluded that all vocabularies, terms and concepts should have a URI/URL, and each URI/URL should point to both data and a Web/wiki page that shows the metadata about that vocabulary, term or concept. Doing that one simple thing makes vocabularies universally accessible on the Semantic Web. Furthermore, the group recommends use of the U.K. document as a basis for a U.S. policy on minting URIs/URLs and that all open-government vocabularies use permanent URIs/URLs.

That recommendation is what I have been doing using a wiki, spreadsheets and a visualization/analytics tool following a simple metadata format (e.g., Facebook's Open Graph Protocol) so that each data table contains links to vocabulary definitions and metadata at well-defined URLs on the wiki, the wiki in turn provides well-defined links to the spreadsheets, and the tables in the visualization/analytics tool contain those links as well.

I challenge data and data catalog providers to deliver their data tables and catalogs with well-defined URIs/URLs to facilitate the integration of all that data.

About the Author

Brand Niemann is senior data scientist at Semanticommunity.net and former senior enterprise architect and data scientist at the Environmental Protection Agency.

The Fed 100

Save the date for 28th annual Federal 100 Awards Gala.

Featured

  • computer network

    How Einstein changes the way government does business

    The Department of Commerce is revising its confidentiality agreement for statistical data survey respondents to reflect the fact that the Department of Homeland Security could see some of that data if it is captured by the Einstein system.

  • Defense Secretary Jim Mattis. Army photo by Monica King. Jan. 26, 2017.

    Mattis mulls consolidation in IT, cyber

    In a Feb. 17 memo, Defense Secretary Jim Mattis told senior leadership to establish teams to look for duplication across the armed services in business operations, including in IT and cybersecurity.

  • Image from Shutterstock.com

    DHS vague on rules for election aid, say states

    State election officials had more questions than answers after a Department of Homeland Security presentation on the designation of election systems as critical U.S. infrastructure.

  • Org Chart Stock Art - Shutterstock

    How the hiring freeze targets millennials

    The government desperately needs younger talent to replace an aging workforce, and experts say that a freeze on hiring doesn't help.

  • Shutterstock image: healthcare digital interface.

    VA moves ahead with homegrown scheduling IT

    The Department of Veterans Affairs will test an internally developed scheduling module at primary care sites nationwide to see if it's ready to service the entire agency.

  • Shutterstock images (honglouwawa & 0beron): Bitcoin image overlay replaced with a dollar sign on a hardware circuit.

    MGT Act poised for a comeback

    After missing in the last Congress, drafters of a bill to encourage cloud adoption are looking for a new plan.

Reader comments

Please post your comments here. Comments are moderated, so they may not appear immediately after submitting. We will not post comments that we consider abusive or off-topic.

Please type the letters/numbers you see above

More from 1105 Public Sector Media Group