10 flaws with the data on

Recently released high-value datasets reveal 10 types of deficiencies

Transparency should be a three-legged stool of awareness, access and accuracy., the federal government’s data Web portal, is focusing on the second leg of the stool: access. Of the three, accuracy, which is part of data quality, is the most difficult to achieve but also the most important. If government data is untrustworthy, the government defaults on its backstop role in society.

Related stories:

How connects the dots of the federal data reference model

The promise and perils of lie in the metadata

So, can you trust the data provided on A cursory examination of the newly released high-value datasets revealed 10 types of quality deficiencies.

1. Omission errors. These are a violation of the quality characteristic of completeness. The No. 1 idea on, the collaboration site, is to provide definitions for every column. But many datasets do not provide this information. Another type of omission is when dataset fields are sparsely populated, which might omit the key fields necessary for the data to be relevant. For example, a dataset on recreation sites should have the location of the site. Furthermore, many datasets use codes but omit the complete code lists needed to validate the data. Finally, Extensible Markup Language documents omit the XML schema used to validate them even when the schemas clearly exist.

2. Formatting errors. These are violations of the quality characteristic of consistency. Examples are a lack of header lines in comma-separated value files and incorrectly quoted CSV values. Additionally, this includes poorly formatted data values for some numbers and dates. For example, we still see dates such as “5-Feb-10” with a two-digit year.

3. Accuracy errors. These are violations of the quality characteristic of correctness. Examples are errors in range constraints, such as a dataset having numbers such as “47199998999988888…”

4. Incorrectly labeled records. These are also violations of the quality characteristic of correctness. Unfortunately, agencies are confused as to when to use CSV files versus Excel files. Some datasets are being labeled as CSV files when they are not record-oriented, which they must be, and are just CSV dumps from Microsoft Excel. This indicates a need for more education and training on information management skills.

5. Access errors. These are violations of correct metadata description. Some datasets advertise that they provide raw data, but when you click the link, you are sent to a Web site that does not provide the raw data.

6. Poorly structured data. These are violations of correct metadata description and relevance. Some datasets are formatted using CSV or XML with little regard to how the data would be used. Specifically, some datasets are formatted in nonrecord-oriented manners in which field names are embedded as data values.

7. Nonnormalized data. These errors violate the principles of normalization, which attempt to reduce redundant data. Some datasets have repeated fields and superfluously duplicated field values.

8. Raw database dumps. Although more of a metadata than data quality issue, this certainly violates the principle of relevance. These datasets have files with names such as table1, table2, etc., and are clearly raw database dumps exported to CSV or XLS. Unfortunately, raw database dumps are usually poorly formatted, have no associated business rules and have terse field names.

9. Inflation of counts. Although also a metadata quality issue, many datasets are differentiated only by year or geography, which clutters search results. A simple solution is to allow multiple files per dataset and thereby combine these by-dimension differences into a single search hit.

10. Inconsistent data granularity. This is yet another metadata quality issue that goes to the purpose of and its utility for the public. Some datasets are at an extremely high level while others provide extreme detail without any metadata field denoting them as such.

So what can we do? Here are three basics steps: Attract more citizen involvement to police the data; implement the top ideas on; and ensure agency open-government plans address, in detail, their data quality processes.

About the Author

Michael C. Daconta ([email protected]) is the Vice President of Advanced Technology at InCadence Strategic Solutions and the former Metadata Program Manager for the Homeland Security Department. His new book is entitled, The Great Cloud Migration: Your Roadmap to Cloud Computing, Big Data and Linked Data.


  • Acquisition
    Shutterstock ID 169474442 By Maxx-Studio

    The growing importance of GWACs

    One of the government's most popular methods for buying emerging technologies and critical IT services faces significant challenges in an ever-changing marketplace

  • Workforce
    Shutterstock image 1658927440 By Deliris masks in office coronavirus covid19

    White House orders federal contractors vaccinated by Dec. 8

    New COVID-19 guidance directs federal contractors and subcontractors to make sure their employees are vaccinated — the latest in a series of new vaccine requirements the White House has been rolling out in recent weeks.

Stay Connected