10 flaws with the data on

Recently released high-value datasets reveal 10 types of deficiencies

Transparency should be a three-legged stool of awareness, access and accuracy., the federal government’s data Web portal, is focusing on the second leg of the stool: access. Of the three, accuracy, which is part of data quality, is the most difficult to achieve but also the most important. If government data is untrustworthy, the government defaults on its backstop role in society.

Related stories:

How connects the dots of the federal data reference model

The promise and perils of lie in the metadata

So, can you trust the data provided on A cursory examination of the newly released high-value datasets revealed 10 types of quality deficiencies.

1. Omission errors. These are a violation of the quality characteristic of completeness. The No. 1 idea on, the collaboration site, is to provide definitions for every column. But many datasets do not provide this information. Another type of omission is when dataset fields are sparsely populated, which might omit the key fields necessary for the data to be relevant. For example, a dataset on recreation sites should have the location of the site. Furthermore, many datasets use codes but omit the complete code lists needed to validate the data. Finally, Extensible Markup Language documents omit the XML schema used to validate them even when the schemas clearly exist.

2. Formatting errors. These are violations of the quality characteristic of consistency. Examples are a lack of header lines in comma-separated value files and incorrectly quoted CSV values. Additionally, this includes poorly formatted data values for some numbers and dates. For example, we still see dates such as “5-Feb-10” with a two-digit year.

3. Accuracy errors. These are violations of the quality characteristic of correctness. Examples are errors in range constraints, such as a dataset having numbers such as “47199998999988888…”

4. Incorrectly labeled records. These are also violations of the quality characteristic of correctness. Unfortunately, agencies are confused as to when to use CSV files versus Excel files. Some datasets are being labeled as CSV files when they are not record-oriented, which they must be, and are just CSV dumps from Microsoft Excel. This indicates a need for more education and training on information management skills.

5. Access errors. These are violations of correct metadata description. Some datasets advertise that they provide raw data, but when you click the link, you are sent to a Web site that does not provide the raw data.

6. Poorly structured data. These are violations of correct metadata description and relevance. Some datasets are formatted using CSV or XML with little regard to how the data would be used. Specifically, some datasets are formatted in nonrecord-oriented manners in which field names are embedded as data values.

7. Nonnormalized data. These errors violate the principles of normalization, which attempt to reduce redundant data. Some datasets have repeated fields and superfluously duplicated field values.

8. Raw database dumps. Although more of a metadata than data quality issue, this certainly violates the principle of relevance. These datasets have files with names such as table1, table2, etc., and are clearly raw database dumps exported to CSV or XLS. Unfortunately, raw database dumps are usually poorly formatted, have no associated business rules and have terse field names.

9. Inflation of counts. Although also a metadata quality issue, many datasets are differentiated only by year or geography, which clutters search results. A simple solution is to allow multiple files per dataset and thereby combine these by-dimension differences into a single search hit.

10. Inconsistent data granularity. This is yet another metadata quality issue that goes to the purpose of and its utility for the public. Some datasets are at an extremely high level while others provide extreme detail without any metadata field denoting them as such.

So what can we do? Here are three basics steps: Attract more citizen involvement to police the data; implement the top ideas on; and ensure agency open-government plans address, in detail, their data quality processes.

About the Author

Michael C. Daconta ( is the Vice President of Advanced Technology at InCadence Strategic Solutions and the former Metadata Program Manager for the Homeland Security Department. His new book is entitled, The Great Cloud Migration: Your Roadmap to Cloud Computing, Big Data and Linked Data.


  • Defense
    Ryan D. McCarthy being sworn in as Army Secretary Oct. 10, 2019. (Photo credit: Sgt. Dana Clarke/U.S. Army)

    Army wants to spend nearly $1B on cloud, data by 2025

    Army Secretary Ryan McCarthy said lack of funding or a potential delay in the JEDI cloud bid "strikes to the heart of our concern."

  • Congress
    Rep. Jim Langevin (D-R.I.) at the Hack the Capitol conference Sept. 20, 2018

    Jim Langevin's view from the Hill

    As chairman of of the Intelligence and Emerging Threats and Capabilities subcommittee of the House Armed Services Committe and a member of the House Homeland Security Committee, Rhode Island Democrat Jim Langevin is one of the most influential voices on cybersecurity in Congress.

Stay Connected


Sign up for our newsletter.

I agree to this site's Privacy Policy.