Getting government datasets to talk the same language

Brand Niemann is senior data scientist at and former senior enterprise architect and data scientist at the Environmental Protection Agency.

Since my previous column "How to move from datasets to data services," I have been reflecting on the proliferation of data catalogs, the two most prominent of which are and, and their lack of standardization in vocabulary, format and functionality. An excellent inventory of open-government initiatives and their data catalogs at shows that diversity. initially opened community forums for discussion of open data and the Semantic Web, which have since expanded to include groups on health, law and restoring the Gulf. The Semantic Web forum has the following discussion areas:

  • Cross-domain linking, for sharing views on how best to interlink Linked Open Government Data.
  • Cross-domain vocabularies, for sharing views about vocabularies that all publishers of Linked Open Government Data are using or should be using.
  • Domain-specific vocabularies, for sharing views about vocabularies that are specific to agency mission areas.
  • Uniform Resource Identifier (URI) schemes, for sharing views about conventions for publishing and consuming Linked Open Government Data.

The latter is of particular interest because it is a critical part of Tim Berners-Lee’s five stars of linked open data.

  • Star 1: Make your stuff available on the Web, in whatever format.
  • Star 2: Make it available as structured data (e.g., an Excel spreadsheet instead of an image scan of a table).
  • Star 3: Use a nonproprietary format (e.g., comma-separated values instead of Excel).
  • Star 4: Use URLs to identify things, so people can link to your stuff.
  • Star 5: Link your data to other people’s data to provide context.
Stars 4 and 5 are more challenging to implement because multiple links are possible even with small datasets.

The seminal government work on that topic appears to be the United Kingdom Cabinet Office's “Designing URI Sets for the UK Public Sector.” It defines the design considerations and guidance by which public-sector Uniform Resource Identifier [URI] sets should be developed and maintained. They are designed to encourage those who own reference data to make it available for reuse and give those who have data that could be linked the confidence to reuse a URI/URL set that is not under their direct control.

In addition, I recommend following Principle 14 of the Open Group’s Data Principles: Common Vocabulary and Data Definitions. It states that data should be defined consistently throughout the enterprise, and the definitions should be understandable and available to all users.

Recently, the Open Government Working Group concluded that all vocabularies, terms and concepts should have a URI/URL, and each URI/URL should point to both data and a Web/wiki page that shows the metadata about that vocabulary, term or concept. Doing that one simple thing makes vocabularies universally accessible on the Semantic Web. Furthermore, the group recommends use of the U.K. document as a basis for a U.S. policy on minting URIs/URLs and that all open-government vocabularies use permanent URIs/URLs.

That recommendation is what I have been doing using a wiki, spreadsheets and a visualization/analytics tool following a simple metadata format (e.g., Facebook's Open Graph Protocol) so that each data table contains links to vocabulary definitions and metadata at well-defined URLs on the wiki, the wiki in turn provides well-defined links to the spreadsheets, and the tables in the visualization/analytics tool contain those links as well.

I challenge data and data catalog providers to deliver their data tables and catalogs with well-defined URIs/URLs to facilitate the integration of all that data.

About the Author

Brand Niemann is senior data scientist at and former senior enterprise architect and data scientist at the Environmental Protection Agency.

The Fed 100

Read the profiles of all this year's winners.


  • Then-presidential candidate Donald Trump at a 2016 campaign event. Image: Shutterstock

    'Buy American' order puts procurement in the spotlight

    Some IT contractors are worried that the "buy American" executive order from President Trump could squeeze key innovators out of the market.

  • OMB chief Mick Mulvaney, shown here in as a member of Congress in 2013. (Photo credit Gage Skidmore/Flickr)

    White House taps old policies for new government makeover

    New guidance from OMB advises agencies to use shared services, GWACs and federal schedules for acquisition, and to leverage IT wherever possible in restructuring plans.

  • Shutterstock image (by Everett Historical): aerial of the Pentagon.

    What DOD's next CIO will have to deal with

    It could be months before the Defense Department has a new CIO, and he or she will face a host of organizational and operational challenges from Day One

  • USAF Gen. John Hyten

    General: Cyber Command needs new platform before NSA split

    U.S. Cyber Command should be elevated to a full combatant command as soon as possible, the head of Strategic Command told Congress, but it cannot be separated from the NSA until it has its own cyber platform.

  • Image from Shutterstock.

    DLA goes virtual

    The Defense Logistics Agency is in the midst of an ambitious campaign to eliminate its IT infrastructure and transition to using exclusively shared, hosted and virtual services.

  • Fed 100 logo

    The 2017 Federal 100

    The women and men who make up this year's Fed 100 are proof positive of what one person can make possibile in federal IT. Read on to learn more about each and every winner's accomplishments.

Reader comments

Please post your comments here. Comments are moderated, so they may not appear immediately after submitting. We will not post comments that we consider abusive or off-topic.

Please type the letters/numbers you see above

More from 1105 Public Sector Media Group