How to train your algorithm

The federal government is starting to bet big on artificial intelligence in the federal space, but agencies must be careful not to repeat IT mistakes of the past that have resulted in insecure legacy technology.

machine learning

The federal government is starting to bet big on artificial intelligence in the federal space, but agencies must be careful not to repeat IT mistakes of the past that have resulted in insecure legacy technology, two feds told FCW during a July 17 panel on emerging technology.

Jeff Alstott, a program manager for the Intelligence Advanced Research Projects Activity, oversees a number of projects designed to advance long-term development of artificial intelligence for the intelligence community. Much of the legacy tech built in the 1970s and 80s was developed with functionality, not security, in mind. The federal government (and much of the rest of the world) has paid a heavy price since.

That same dynamic, Alstott warned, is playing out in the AI space, with companies and agencies so focused on getting their algorithms to work that they are inadvertently building a foundation of insecurity for systems that will drive our cars, route our planes and perform an increasing number of critical functions in society.

“Just as we’ve baked in certain technical and organizational mistakes back then, thus we are permanently insecure in the digital cyber IT world, I’m trying to avoid us being permanently insecure in the AI world,” said Alstott.

Alstott is overseeing a new project, TrojAI, that IARPA hopes will one day be capable of detecting attempts by bad actors to intentionally manipulate the training data used for advanced automated systems.

The “Troj” in TrojAI stands for “Trojan,” and the program is designed to sniff out more than mere sabotage or data poisoning of an algorithm. Rather, IARPA wants to make sure a sophisticated attacker can’t alter or modify training data to teach that algorithm to engage in destructive behavior, like tampering with the signs used to train self-driving cars so they incorrectly interpret road signs and crash.

Many algorithms start training on open source data in their early stages. While open source products aren’t necessarily inherently less secure, Alstott said IARPA is worried about the potential for targeted compromise or sabotage of commonly used datasets that often form the educational foundation for nascent AI programs.

“You might train a neural network from scratch, but we often don’t do that,” said Alstott. “We often grab an existing neural network off GitHub … and the problem is that data and those data networks are not secure, they’ve not been validated and that supply chain that led into those neural networks is long, vast and very difficult to track.”

Experts have begun to increasingly cite the quality of data used to train algorithms as the most critical step in an AI system’s lifecycle. Such data, whether for an autonomous vehicle or a consumer lending program, must be “representative,” Alstott said, with all the dirt and noise that will present itself in real world conditions, or the consequences could be significant.

DOD’s AI strategy rolled out in February also calls for the Joint Artificial Intelligence Center to leverage unified data stores, reusable tools, frameworks and standards and cloud and edge services, while the Trump administration’s executive order on AI lists a strategic objective to ensure agencies have access to “high quality and fully traceable federal data, models and resources” without sacrificing “safety, security, privacy and confidentiality protections” to facilitate the government’s shift. However, the order does not mention or lay out a strategy for dealing with algorithmic bias, which can often be traced back to the data that feeds into AI systems.

Martin Stanley, a senior technical advisor at the Cybersecurity and Infrastructure Security Agency who focuses on artificial intelligence, told FCW that AI managers get into trouble when they train an algorithm for a specific purpose and then try to expand the scope of its application using that same data.

“All the implementations we’ve seen to date are narrow AI applications, so you’ve really got to understand that it’s focused on a particular task,” Stanley said. “You’ve got to have the right hardware, the right algorithm, the right data and the right know how around that narrow implementation, and you have to be very, very careful about generalizing that further to some other space.”

As an algorithm’s application broadens, designing the system to pair with human analysis at the outset is one way to institute quality control, because people are still better at decision-making than most contemporary AI systems, which generally must ingest hundreds if not thousands of examples to make connections and understand nuance that humans can pick up intuitively.

 “Humans are miraculous and life is miraculous generally in that we can learn from one or no examples and then apply that to a novel situation,” said Stanley.