Diamonds in the data

Federal agencies increasingly use data mining to extract valuable info buried in large databases

At this moment, public health officials are poring over terabytes of health care data to detect the first signs of a possible pandemic flu outbreak, bioterrorism attack or other contagion. The Centers for Disease Control and Prevention began a biosurveillance program in 2003, but advances in information exchange standards and concerns about pandemic flu have accelerated its national implementation.

The federal initiative, called BioSense, analyzes existing health care records, such as diagnoses, laboratory test results, physician visits and hospitalizations. The results help public health officials discover where an event is occurring and decide when to intervene with vaccines or quarantines. The CDC works with regional hospital systems to create secure connections between their health care databases and the federal database. The data does not contain patient names, medical numbers or personal identifiers, CDC officials said.

Like the CDC, Medicaid agencies, NASA and many other government agencies have begun to employ software to look for meaningful patterns in large volumes of data. Their searches have various purposes, such as pinpointing criminal activity, improving customer service and detecting fraud, waste and abuse. Some call this activity data surveillance, others call it data mining, and still others prefer the term data analysis. Whatever such searches are called, they usually require federal agencies to strike a balance between observing behaviors and violating privacy.

CDC officials say their project should not be labeled data mining. Lynn Steele, director of the Emergency Preparedness and Response Division at CDC’s National Center for Public Health Informatics, said the group focuses on acquiring specific clinical and health care data. “We wouldn’t call it data mining because it’s not looking for data abstractly,” Steele said. “We are looking at clinical and health care data that have been proven useful for public health purposes.”

Linda Koontz, information management issues director at the Government Accountability Office, said she is not familiar enough with the CDC’s initiative to say whether it is data mining. But Koontz said some agencies she interviewed about programs that mine data refuse to identity their programs as such.

“Different people sometimes mean different things by the term data mining,” she said. “There isn’t one definition that everyone agrees with. A lot of people feel aversion to using the word ‘data mining’ because they think that casts a negative pall over what they are doing.”

Koontz said some of her discussions with agency officials turned into semantic arguments over the term. “Even though it looked exactly like data mining to us, they would call it some kind of analysis,” she said.

GAO defines data mining as the application of database technology and techniques to uncover hidden patterns and subtle relationships in data and infer rules that allow for the prediction of future results. Koontz said she doesn’t understand why data mining has a negative connotation. “Analysis is not evil,” she said.

Koontz added, however, that agencies should comply with the E-Government Act, which requires agencies to conduct privacy impact assessments of proposed data-mining efforts. Agency chief privacy officers should lead those assessments, she said.

Federal officials and others expect greater benefits from data mining as mining algorithms become more sophisticated. However, because an increasing number of federal agencies are mining personal data, some lawmakers and watchdog groups are concerned about whether proper mechanisms are in place to safeguard personal information. They are calling for greater compliance with privacy regulations.

GAO recently received a request from Rep. David Obey (D-Wisc.) and Rep. Martin Sabo (D-Minn.) to review privacy protections for a Homeland Security Department data-mining program aimed at better understanding terrorism. Koontz said GAO just began its review of that program, which is known as Analysis, Dissemination, Visualization, Insight, and Semantic Enhancement, or ADVISE.

In May, Koontz testified before the House Judiciary Committee’s Commercial and Administrative Law Subcommittee that agencies failed to comply with data-mining privacy protocols as recently as August 2005. “Increased use by federal agencies of data mining — the analysis of large amounts of data to uncover hidden patterns and relationships — has been accompanied by uncertainty regarding privacy requirements and oversight of such systems,” Koontz said.

GAO, for example, found that agencies employing data mining took many steps to protect privacy, such as issuing public notices. None, however, followed all privacy protections, such as including in public notices the intended uses of personal information.

Before the BioSense project was expanded nationwide, the program already had proven its value. Data gathered from national laboratories and Department of Veterans Affairs and Defense Department health care facilities helped identify and treat seasonal flu and gastrointestinal disease outbreaks, Steele said.

Meningitis can also be contained with assistance from BioSense. The project allows local public health officials to look for cases of meningitis and rapidly identify people who might have been exposed to someone infected with meningococcal bacteria, Steele said. Then public health officials can respond quickly to stop the infection.

While the CDC monitors data for signs of bird flu, the Interior Department uses a similar technique to examine data from animals, which are often the sentinels for disease in humans. Acting on guidance from President Bush’s National Strategy for Pandemic Influenza, Interior’s National Wildlife Health Center collects and analyzes data from live birds to help detect the presence of the avian influenza virus in U.S. migratory birds.

Monitoring animals is more difficult than monitoring people because much less data is available on wildlife, and that data lacks standardization. “It’s really hard to mine data that doesn’t exist,” said F. Joshua Dein, a principal investigator at the National Wildlife Health Center. No legislative mandate exists to collect national wildlife disease data because it does not have the same high profile as human health data or domestic animal data, he said.

Interior’s program, known as the National Biological Information Infrastructure Wildlife Disease Information Node, is designed to develop tools that will allow states to collect their own data and create an infrastructure for sharing that data in a standardized format, said Dein, who leads the program.

Following the first detection of highly pathogenic bird flu, natural resource agencies will have the opportunity to study how the disease spreads in the wild, Dein said. Deadly animal illnesses can mutate into deadly human illnesses and then spread throughout the human population.

The objective of Interior’s program is to link the animal findings with health care data on domestic animals and people. Most of the animal data obtained so far has come from Alaska, where there are migratory paths to and from Asia. With funding from the federal government this summer and fall, additional states will be collecting data. All information will be stored in a SQL Server database and will be available for review and analysis using HTML/JavaScript Web applications and ESRI geographic information system software.

Wildlife data is public and does not include personal information. Many other federal data-mining efforts do not involve personal information. For example, NASA looks for patterns and relationships in huge volumes of earth science data collected by satellites and sensors. The findings are used to better understand climate.

The government first employed data mining for fraud detection in a manner similar to the private sector’s analysis of credit card fraud.

But some data-mining programs are more controversial. The possibility that the federal government is using supercomputers to sift through tens of millions of phone records came to light this spring after USA Today reported that the National Security Agency had collected records from AT&T, Verizon and BellSouth.

Lawmakers have begun to question the security of personal information used in data mining as more federal agencies turn to the technology for help.

Two years ago, Sen. Daniel Akaka (D-Hawaii), ranking member of the Homeland Security and Governmental Affairs Committee’s Oversight of Government Management, the Federal Workforce and the District of Columbia Subcommittee, asked GAO to identify the purposes of data-mining activities within the federal government. More than 60 percent of the 199 efforts identified used personal information.

Akaka said data-mining tools can be helpful in organizing and connecting information to eliminate waste, stop criminal activity and improve public service. He said, however, that GAO’s August 2005 follow-up report, which found that agencies are failing to meet necessary privacy and security requirements, represents “a troubling trend, given the number of data-mining activities in the federal government that use personal information.”

Akaka also expressed concern that federal privacy laws and senior privacy officials may not be sufficiently regulating data-mining activities. He said he is unsure whether those individuals have had proper training on privacy matters, whether they have adequate expertise of privacy laws and whether they have sufficient authority to ensure compliance with privacy laws. “It’s also unclear what protections federal privacy laws actually provide and how the federal government can assure the accuracy of the information used from the private sector,” he said.

In May, Akaka moved to strengthen the role of the Homeland Security Department’s chief privacy officer by introducing the Privacy Officer With Enhanced Rights Act of 2006. He has called on the Homeland Security and Governmental Affairs Committee to hold hearings on the 32-year-old Privacy Act. With the growth of data-mining activities and agency failures to follow privacy practices, Akaka said he might also look at broadening the bill to cover other agencies.

Several congressional committees are also considering legislation that would mandate privacy protections for private-sector data because the federal government relies on this data for many data-mining activities.

“Public confidence of the government’s use of these activities is undermined because of unregulated and sometimes inaccurate information from the private sector, combined with agencies’ failure to follow key privacy and security laws,” Akaka said. “The fact that the Privacy Act has numerous exemptions for intelligence and law enforcement purposes...raises key questions as to what privacy rules govern in those circumstances.”

Some privacy advocates say any promises made by federal agencies about protecting personal information cannot be trusted because agencies have little experience with safeguarding personal information used in data-mining projects.

Lee Tien, a senior staff attorney at the Electronic Frontier Foundation, cited the Transportation Security Administration’s passenger screening system, Secure Flight, which violated the privacy of potentially millions of people. Last July, a GAO audit found that a TSA contractor, acting on behalf of the agency, collected more than 100 million commercial data records containing personal information, such as names, birthdates and telephone numbers, without informing the public.

“We are in a very difficult area, technologically, as well as policywise,” Tien said. “It is really important to emphasize that we don’t know the answers.”

He added that Congress should enact legislation and appropriate funding to enable program managers and agency privacy officials to hire more staff for enforcement.

Debate over the use of personal information will likely grow, as federal and state governments discover more applications for data mining.

Guy Amisano, president and chief executive officer of data analytics company Salient, said the company’s government clients have had much success managing the Medicaid program with data mining. For almost a year, several counties in New York state have been tracking Medicaid recipient and provider payments to identify potential waste, fraud and abuse.

With Salient’s software, county officials can spot anomalies in Medicaid recipients’ behavior. By pegging outliers, program administrators figure out the causes of discrepancies and make improvements to reduce costs or improve services. Amisano is working on a statewide Medicaid system that is expected to save New York $5 billion to $12 billion a year.

Amisano said it is people, not technology, who are mostly to blame when personal information leaks. He said secure systems are looking at aggregated data or data cleansed of personal identifiers. Individuals are often identified by a random number, instead of a name, so the user has access only to the individual’s behavior instead of the individual’s personal information.

Data security should not deter federal agencies from taking advantage of data mining, Amisano said, adding that “technology is more than sufficient to guarantee individual privacy rights.”

Crime forecasting could be the next big thing

Law enforcement is likely to be the next government organization to use data mining, experts say. Data-mining technology can find patterns that link criminal incidents to factors such as weather, sporting events and paydays. The patterns reveal that, when those same factors appear in the future, the probability of another crime occurring will be high. Police can use such knowledge to make decisions about deploying forces.

Workforce planning is another direction in which the use of data mining is heading. As baby boomers leave government service, federal officials want to determine what actions they can take to retain critical skills or experience levels.

Data mining is an ideal tool for helping agencies retain the expertise of their employees. For example, the technology might be able to calculate the likelihood of retirements or resignations based on employee age, most recent promotion and whether the employee was assigned to his or her preferred agency.

Data mining becomes a USDA management tool

One of the earlier adopters of data mining in the government was the Risk Management Agency (RMA) in the Agriculture Department’s Federal Crop Insurance Program. The agency built a data warehouse to analyze policyholder records for compliance with the insurance program’s rules. Congress mandated the data-mining effort in 2000.

RMA’s first data-mining project produced $48 million in savings compared with what was paid to the same group of policyholders during the previous year. In the past five years, the Federal Crop Insurance Program has achieved $460 million in cost avoidance. The agency has spent about $20 million on data-mining efforts.

RMA scrutinizes self-reported policyholder filings to focus on losses that the USDA and policyholders cannot explain. For example, the computer might match drought filings against soil and weather data and all data for those policyholders to determine if the local office staff or the policyholder incorrectly filled out a form.

Program officials say the effort does not solely look for criminal activity, but it gives USDA officials a starting point from which to figure out what is creating the losses. Garland Westmoreland, director of the RMA’s strategic data acquisition and analysis unit, said personal contact with policyholders reduces erroneous payments. “When we do work with the individuals, the result of having someone from the local office work with them seems to have a huge effect on mitigating losses.”

No single factor causes claim problems, Westmoreland said. Negligence, fraud, miscommunication and simple errors can cause miscalculations. Data mining simply allows USDA officials to save time in repairing the problem, he said. “We have over a million policies. We try determining which of the policies need the most attention.”

He added that only individuals who have undergone background checks can access the data. RMA’s chief information officer is responsible for ensuring privacy provisions are followed throughout all phases of the program. RMA does not have sufficient resources to pay the salary of a chief privacy official, an RMA spokeswoman said.


  • Workforce
    Avril Haines testifies SSCI Jan. 19, 2021

    Haines looks to restore IC workforce morale

    If confirmed, Avril Haines says that one of her top priorities as the Director of National Intelligence will be "institutional" issues, like renewing public trust in the intelligence community and improving workforce morale.

  • Defense
    laptop cloud concept (Andrey Suslov/

    Telework, BYOD and DEOS

    Telework made the idea of bringing your own device a top priority as the Defense Information Systems Agency begins transitioning to a permanent version of the commercial virtual remote environment.

Stay Connected