10 flaws with the data on

Recently released high-value datasets reveal 10 types of deficiencies

Transparency should be a three-legged stool of awareness, access and accuracy., the federal government’s data Web portal, is focusing on the second leg of the stool: access. Of the three, accuracy, which is part of data quality, is the most difficult to achieve but also the most important. If government data is untrustworthy, the government defaults on its backstop role in society.

Related stories:

How connects the dots of the federal data reference model

The promise and perils of lie in the metadata

So, can you trust the data provided on A cursory examination of the newly released high-value datasets revealed 10 types of quality deficiencies.

1. Omission errors. These are a violation of the quality characteristic of completeness. The No. 1 idea on, the collaboration site, is to provide definitions for every column. But many datasets do not provide this information. Another type of omission is when dataset fields are sparsely populated, which might omit the key fields necessary for the data to be relevant. For example, a dataset on recreation sites should have the location of the site. Furthermore, many datasets use codes but omit the complete code lists needed to validate the data. Finally, Extensible Markup Language documents omit the XML schema used to validate them even when the schemas clearly exist.

2. Formatting errors. These are violations of the quality characteristic of consistency. Examples are a lack of header lines in comma-separated value files and incorrectly quoted CSV values. Additionally, this includes poorly formatted data values for some numbers and dates. For example, we still see dates such as “5-Feb-10” with a two-digit year.

3. Accuracy errors. These are violations of the quality characteristic of correctness. Examples are errors in range constraints, such as a dataset having numbers such as “47199998999988888…”

4. Incorrectly labeled records. These are also violations of the quality characteristic of correctness. Unfortunately, agencies are confused as to when to use CSV files versus Excel files. Some datasets are being labeled as CSV files when they are not record-oriented, which they must be, and are just CSV dumps from Microsoft Excel. This indicates a need for more education and training on information management skills.

5. Access errors. These are violations of correct metadata description. Some datasets advertise that they provide raw data, but when you click the link, you are sent to a Web site that does not provide the raw data.

6. Poorly structured data. These are violations of correct metadata description and relevance. Some datasets are formatted using CSV or XML with little regard to how the data would be used. Specifically, some datasets are formatted in nonrecord-oriented manners in which field names are embedded as data values.

7. Nonnormalized data. These errors violate the principles of normalization, which attempt to reduce redundant data. Some datasets have repeated fields and superfluously duplicated field values.

8. Raw database dumps. Although more of a metadata than data quality issue, this certainly violates the principle of relevance. These datasets have files with names such as table1, table2, etc., and are clearly raw database dumps exported to CSV or XLS. Unfortunately, raw database dumps are usually poorly formatted, have no associated business rules and have terse field names.

9. Inflation of counts. Although also a metadata quality issue, many datasets are differentiated only by year or geography, which clutters search results. A simple solution is to allow multiple files per dataset and thereby combine these by-dimension differences into a single search hit.

10. Inconsistent data granularity. This is yet another metadata quality issue that goes to the purpose of and its utility for the public. Some datasets are at an extremely high level while others provide extreme detail without any metadata field denoting them as such.

So what can we do? Here are three basics steps: Attract more citizen involvement to police the data; implement the top ideas on; and ensure agency open-government plans address, in detail, their data quality processes.

About the Author

Michael C. Daconta ( is the Vice President of Advanced Technology at InCadence Strategic Solutions and the former Metadata Program Manager for the Homeland Security Department. His new book is entitled, The Great Cloud Migration: Your Roadmap to Cloud Computing, Big Data and Linked Data.


  • Congress
    U.S. Capitol (Photo by M DOGAN / Shutterstock)

    Funding bill clears Congress, heads for president's desk

    The $1.3 trillion spending package passed the House of Representatives on March 22 and the Senate in the early hours of March 23. President Trump is expected to sign the bill, securing government funding for the remainder of fiscal year 2018.

  • 2018 Fed 100

    The 2018 Federal 100

    This year's Fed 100 winners show just how much committed and talented individuals can accomplish in federal IT. Read their profiles to learn more!

  • Census
    How tech can save money for 2020 census

    Trump campaign taps census question as a fund-raising tool

    A fundraising email for the Trump-Pence reelection campaign is trying to get supporters behind a controversial change to the census -- asking respondents whether or not they are U.S. citizens.

Stay Connected

FCW Update

Sign up for our newsletter.

I agree to this site's Privacy Policy.