Web mashups put transparency to the test

Builders of these public interest Web sites say government could do more to make its data more accessible

A cornerstone of the Obama administration’s vision for a more open and accountable government rests on making government data easily accessible to the public.

However, that’s only the first step. From there, people and interest groups must be able to analyze or package the data in ways that make it more useful to others.

Web mashups — sites that present data from multiple sources in one place — are expected to become a valuable tool for sharing and analyzing government data. There are already several examples of mashups that provide more convenient access to public information and services or produce innovative insights about the economy. Visitors could not necessarily get that information from one agency’s site alone.

Architects of three early efforts say that even though agency data is increasingly available in standard Web formats such as Extensible Markup Language, building a mashup is still harder than it needs to be, and the government could do more to help.

“For mashups, it’s now dependent on the implementer to reconcile the data in order to use them, and that can often be an arduous task,” said Ruhollah Farchtchi, an associate at Booz Allen Hamilton.

That work is expected to lighten once Semantic Web, or Web 3.0, technologies become more available. Whereas Web 2.0 is about connecting people through social-networking applications such as Facebook and Twitter, Web 3.0 will be about connecting information in more convenient, meaningful and dynamic ways.

To lay the groundwork for Web 3.0, government and other data providers will use Semantic Web standards, such as the Resource Description Framework (RDF) and Web Ontology Language (OWL), to add context and meaning to the data they post online.

Sites such as the Office of Management and Budget’s Data.gov might end up doing the Semantic Web work for agencies by converting their data to the required standards and providing the application programming interfaces (APIs) that mashup developers will need to access the data.

In the meantime, three mashup examples illustrate the possibilities these composite Web sites can offer and the challenges their creators face.

This We Know

This We Know was built as a Semantic Web mashup from the beginning. The site takes data that agencies have uploaded to Data.gov and converts it into an RDF database, which is used to organize and present the data according to geographic communities. For example, site visitors can choose a town and find data about its population, unemployment rate and cancer incidence, among other facts.

The site won a competition launched last year by the nonprofit Sunlight Foundation, which honored the best application developed from data on Data.gov.

The site’s developers organized the site based on geographic places to make it more compelling to visitors, said Michael Knapp, managing director and partner of GreenRiver.org, a Vermont-based Web application development and data visualization company.

Place is also a convenient foundation for developers because many government databases already include location data.

GreenRiver.org partnered on the mashup with Sway Design, a Web design studio, and Semantic Web database company Intellidimension.

The site represents about 350 hours of labor. Most of that was spent on developing the interface, though the companies also spent considerable time cleaning up the data because of the nonstandard way it was described in each of the databases.

Knapp said agencies should improve data standards because mashup developers need to understand what the various fields and values mean in the databases. Agencies could significantly help developers by creating and following a standard data dictionary.

“If you have that, it takes 75 percent of the work out of putting these mashups together,” Knapp said.

FedViewer.com

FedViewer.com presents a host of national economic data from the U.S. Federal Reserve as a visual temperature map — darkening shades of red indicate trouble, lighter colors point to positive trends. The site shows the performance over time of a range of economic indicators, along with contextual facts, such as when a recession or a government bailout program began.

The site is the brainchild of Will Riley, who wanted to examine the Federal Reserve's data as part of an investment analysis project. Traditional tools such as spreadsheets and graphs did not work well.

“Representing 10, 20 or several hundred variants simultaneously forces one to consider [other] solutions,” said Riley, who runs Real World Analytics, an Internet consulting firm. “The FedViewer.com moving average display attempts to manage such combinations in a comfortable and, hopefully, intuitive interface.”

The Federal Reserve offers some of its statistical data in comma-separated values and XML formats, which developers can incorporate automatically into a mashup. But other important databases require techniques such as screen-scraping, in which developers write a custom program to read and copy data presented on a Web page rather than accessing the original database.

But screen-scraping is a brittle technique and depends on page content retaining a consistent form, Riley said. A change in a single character in the values is enough to break the extraction process. Also, screen-scraping can be time-consuming, so it isn’t practical for pulling data from lots of different Web sites.

Time equals cost, so datasets that can’t be handled automatically likely won't be part of the information that FedViewer.com seeks to present.

Riley said he looks forward to when automated queries that use SPARQL, a software query language for RDF databases, can pull timeline data by topic from federal and municipal sites, Wikipedia, and elsewhere.

“The scope of official [government] activity could be rebiased toward the provision of public APIs rather than the work of building and maintaining data browsing sites, such as we have now,” he said.

For that to happen, he said he believes a single federal entity should develop data-mastering and publication standards. In addition, an agency should create guidelines for handling data at the municipal level.

Riley said he believes that President Barack Obama’s Open Government Directive, which calls on agencies to release more accessible and user-friendly datasets, is a historic move and step in the right direction.

Neighborhood Knowledge California

Neighborhood Knowledge California provides tools that allow potential homeowners and community grass-roots organizations throughout the state to design maps, collate information and publish their own research, using data from private and public resources available on NKCA.

The goal is to increase understanding of issues such as fair housing and fair lending laws in areas where there are usually few resources that people can tap to find that information. The University of California at Los Angeles, which runs NKCA, worked with more than 40 organizations in five metropolitan areas in the state to build the site.

NKCA builds on the success of Neighborhood Knowledge Los Angeles, the first site of its kind. Another project based on the NKLA example is Living Independently in Los Angeles, which was created by local residents with disabilities. The site uses their expert knowledge to identify and map independent living resources.

The tools used with the newer sites have come a long way since NKLA debuted in the mid-1990s, when the goal was the simple dissemination of information, said Charanjeet Singh, NKCA’s program manager.

In the mid-1990s, users in search of particular data, such as building permits or code complaints, needed to separately query each applicable database to get the information. Now the sites include an integrated database in which users can view all of the property data at one time.

Mapping tools also have advanced. The earliest sites used spreadsheets that automatically sorted data by geographic fields, which users then extracted to use with geographic information system mapping software. That gave a decent, though limited, 2-D representation, but the goal is for a more immersive system.

NKCA now uses Google Maps with Street View, which allows users to automatically provide maps where people can zoom in to see photographs of locations and properties. The goal is to cut down on the flood of available data by making it more instantly meaningful, which is where the Semantic Web will help.

“It means you don’t have to go through the data hassles to make it more accessible,” Singh said. “You can bring in more data from different sources and give it different contexts.”

He said he believes government agencies are doing a better job of understanding what type of data is important to the public. The missing link is OWL, RDF and other Semantic Web APIs that agencies need to build to make their data more easily accessible to users, many of whom are building their sites on-the-fly.

Featured

Reader comments

Sun, Aug 29, 2010

The article says "This We Know" converted the data to RDF, that sounds like some involved work work. Can somebody share how this was done, what ontology was used etc. Is there any other place where this information is available

Thu, Jan 21, 2010

In addition to Neighborhood Knowledge California, the Business.gov website has a brand new Licenses & Permits web service. It allows business owners to find the licenses and permits they need to stay compliant with federal, state and local law. Visit: http://www.business.gov/about/features/api/

Thu, Jan 7, 2010

"... Sites such as the Office of Management and Budget’s Data.gov might end up doing the Semantic Web work for agencies by converting their data to the required standards and providing the application programming interfaces (APIs) that mashup developers will need to access the data." ... seems to miss the point that driving transparent data formats up to the data production level minimizes costs for every downstream operation, prevents inadvertent lossy conversions for heretofore unanticipated analyses, and allows new internal organizational uses of the same data. As a pure overhead cost, even if the data hasn't been hashed/formatted into uselessness, "make transparent at the end" is pretty indefensible.

Please post your comments here. Comments are moderated, so they may not appear immediately after submitting. We will not post comments that we consider abusive or off-topic.

Please type the letters/numbers you see above