What do video surveillance, speech recognition and autonomous vehicles have in common? They're all getting better amazingly quickly -- and needing less and less human help to do so.
What do video surveillance, speech recognition and autonomous vehicles have in common? They're all getting better amazingly quickly -- and needing less and less human help to do so. (FCW illustration)
In the past decade, computer scientists have made remarkable progress in creating algorithms capable of discerning information in unstructured data. In controlled settings, computer programs are now able to recognize faces in photographs and transcribe spoken speech. Cars laden with laser sensors can create 3-D representations of the world and use those representations to navigate safely through chaotic, unpredictable traffic.
In the coming decade, improvements in computational power and techniques will allow programs such as voice and face recognition to work in increasingly robust settings. Those technological developments will affect broad swathes of the American economy and have the potential to fundamentally alter the routines of our daily lives.
There is not one single reason for these improvements. Various approaches have proven effective and have improved over the years. However, many of the best-performing algorithms share a common trait: They have not been explicitly programmed by humans. As David Stavens, a Stanford University computer scientist, wrote about Junior, an autonomous car that earned Stanford second place in the Defense Advanced Research Projects Agency's 2007 Urban Challenge: "Our work does not rely on manual engineering or even supervised machine learning. Rather, the car learns on its own, training itself without human teaching or labeling."
In a wide range of examples, techniques that rely on self-supervised learning have leapfrogged traditional computer science approaches that relied on explicitly crafted rules. Supervised learning — in which an algorithm is first trained on a large set of data that has been annotated by a human and then is let loose on other, unstructured data — is effective in cases where an algorithm benefits from some initial structure. In both cases, the rules the computer ultimately used were never explicitly coded and cannot be succinctly described.
Such so-called machine learning algorithms have a long history. But for much of that history, they were more interesting for their theoretical promise than on the basis of real-world performance. That has changed in the past few years for a variety of reasons. Chief among them are the availability of large datasets with which to train learning algorithms and cheap computational power that can do such training quickly. Just as important, though, are developments in methodology that make it possible to use that data — millions of images tagged online by, say, Flickr users, or linguistic data stretching to the billions of words — in advantageous ways.
The new generation of learning techniques holds the promise of not only being able to match human performance in tasks that have heretofore been impossible for computers but also to exceed it.
The market for speech recognition is huge and will only grow as the technology improves. Call centers alone account for tens of billions of dollars in annual corporate expenditure, and the mobile telephony market is also worth billions. Nuance, the company behind Apple's Siri voice-recognition engine, announced in November 2012 that it is working with handset manufacturers on a telephone that could be controlled by voice alone.
According to Forbes, Americans spend about $437 billion annually on cars and buy 15.1 million automobiles each year. According to the General Services Administration's latest tally, federal agencies own nearly 660,000 vehicles. As technologies for autonomy improve, many and eventually most of those cars will have detectors and software that will enable them to drive autonomously, which means the potential market is enormous.
The impact of image-analysis technologies such as facial recognition will also be transformative. Government use of such technologies is already widespread, and commercial use will increase as capabilities do. Video surveillance software is already a $650 million annual market, according to a June report by IMS Research.
Just as the commercial stakes for those and other applications of machine learning are high, so too are the broader questions the new capabilities raise. How does the nature of privacy change when it becomes possible not only to record audio and video on a mass scale but also to reliably extract data — such as people's identities or transcripts of their conversations — from those recordings? The difficult nature of the questions means they have largely escaped public discussion, even as the debate over National Security Agency surveillance programs has increased in recent weeks following Edward Snowden's disclosures.
Li Deng, a principal researcher at Microsoft Research, wrote in a paper in the May issue of IEEE Transactions on Audio, Speech and Language Processing that there are no applications today for which automated speech recognition works as well as a person. But machine learning techniques, he said, "show great promise to advance the state of the art."
There are many machine learning techniques, including Bayesian networks, hidden Markov models, neural networks of various sorts and Boltzmann machines. The differences between them are largely technical. What the techniques have in common is that they consist of a large set of nodes that connect with one another and make interrelated decisions about how to behave.
Those complicated networks can "learn" how to discern patterns by following rules that modify the way in which a given node reacts to stimuli from other nodes. It can be done in a way that simply seeks out patterns without any human-crafted prompting (in unsupervised learning) or by trying to duplicate example patterns (in supervised learning). For instance, a neural network might be shown many pairs of photographs along with information about when a pair consisted of two photographs of the same person and when it consisted of photographs of two different people, or it might be played many audio recordings paired with transcriptions of those recordings.
Deep neural networks have, since 2006, become far more effective. A shallow neural network might have only one hidden layer of nodes that could learn how to behave. That layer might consist of thousands of nodes, but it would still be a single layer. Deep networks have many layers, which allow them to recognize far more complex patterns because there is a much larger number of potential ways in which a given number of nodes can interconnect.
But that complexity has a downside. For decades, deep networks, though theoretically powerful, didn't work well in practice. Training them was computationally intractable. But in 2006, Geoffrey Hinton, a computer science professor at the University of Toronto, published a paper widely described as a breakthrough. He devised a way to train deep networks one layer at a time, which allowed them to perform in the real world.
In late May, Google researchers Vincent Vanhoucke, Matthieu Devin and Georg Heigold presented a paper at the IEEE International Conference on Acoustics, Speech and Signal Processing describing the application of deep networks to speech recognition. The Google researchers ran a three-layer system with 640 nodes in each layer. They trained the system on 3,000 hours of recorded English speech and then tested it on 27,327 utterances. In the best performance of a number of different configurations they tried, the system's word error rate was 12.8 percent. That means it got slightly more than one word in 10 wrong. There is still a long way to go, but training a network as complicated as this one would have been a non-starter just a few years ago.
Agencies that interact with the public on a massive scale will have to decide to what extent they wish to replace human operators with automated voice-recognition systems.
Nevertheless, speech-recognition technologies have already had a dramatic impact on call and contact centers. As the technology improves, agencies that interact with the public on a massive scale — such as the Social Security Administration, the National Park Service and the Veterans Health Administration — will have to decide to what extent they wish to replace human operators with automated voice-recognition systems.
On June 17, Stanford associate professor Andrew Ng and his colleagues presented a paper at the annual International Conference on Machine Learning describing how even larger networks — systems with as many as 11 billion parameters — can be trained in a matter of days on a cluster of 16 commercial servers. They do not yet know, they say, how to effectively train such large networks but want to show that it can be done. They trained their neural network on a dataset of 10 million unlabeled YouTube video thumbnails. They then used the network to distinguish 13,152 faces from 48,000 distractor images. It succeeded 86.5 percent of the time.
Again, that performance is not yet at a level that is of much practical use. But the remarkable thing is that Ng and his team achieved it on a dataset that wasn't labeled in any way. They devised a program that could, on its own, figure out what a face looks like.
Current commercial facial recognition systems include NEC's NeoFace, which won a competition run by the National Institute of Standards and Technology in 2010. NeoFace matches pictures of faces against large databases of images taken from close up and under controlled lighting conditions. NeoFace can work with images taken at very low resolution, with as few as 24 pixels between the subject's eyes, according to NEC. In the NIST evaluation, it identified 95 percent of the sample images given to it.
In a May 2013 paper, Anil Jain and Joshua Klontz of Michigan State University used NeoFace to search through a database of 1.6 million law enforcement booking images, along with pictures of Dzhokhar and Tamerlan Tsarnaev, who are accused of setting off the bombs at the Boston Marathon in April. Using the publicly released images of the Tsarnaev brothers, NeoFace was able to match Dzhokhar's high school graduation photo from the database of millions of images. It was less successful with Tamerlan because he was wearing sunglasses.
Eigenfaces allow computers to deconstruct images of faces into charcteristic components to enable recognition technoologies. (Image: Wikimedia Commons)
Jain and Klontz make the point that, even today, facial recognition algorithms are good enough to be useful in a real-world context. The methods for automatically detecting faces, though, are likely to get much better with machine learning. NeoFace and other commercial tools work in part by deconstructing faces into characteristic constituents, called eigenfaces, in a way roughly analogous to the grid coordinates of a point. A picture of a face can then be described as a distinct combination of eigenfaces, just as any physical movement in the real world can be broken down into the components of up-down, left-right and forward-backward.
But that approach is not very adaptable to changes in lighting and posture. The same face breaks down into very different eigenfaces if it is lit differently or photographed from another angle. However, it is easy for a person to recognize that, say, Angelina Jolie is the same person in profile as she is when photographed from the front.
Honglak Lee, an assistant professor at the University of Michigan in Ann Arbor, wrote recently with colleagues from the University of Massachusetts that deep neural networks are now being applied to the problem of facial recognition in a way that doesn't require any explicit information about lighting or pose. Lee and his colleagues were able to get 86 percent accuracy on a 5,749-image database called Labeled Faces in the Wild, which now contains more than 13,000 images. Their results compared favorably to the 87 percent that the best handcrafted systems achieved.
But the deep learning systems remain computationally demanding. Lee and his colleagues had to scale down their images to 150 by 150 pixels to make the problem computationally tractable. As computing power grows, there is every reason to believe that machine learning techniques applied on a larger scale will become still more effective. At present, it might seem that facial recognition programs are of interest only to law enforcement and intelligence agencies. But as the systems become more robust and effective, other agencies will have to decide whether and how to use them. The technology has broad potential but also threatens to encroach fundamentally on privacy.
In a sense, the machine learning algorithms for facial recognition are doing something analogous to speech recognition. Just as speech-recognition programs can't try to match sounds against all possible words that might have generated those sounds, the new generation of face-recognition techniques doesn't attempt to match patterns. Instead, the learning methodology allows it to discern global structure in a way loosely analogous to human perception.
The pace of such progress can perhaps best be seen in the case of autonomous cars. In 2004, DARPA ran a race in which autonomous cars had to navigate a 150-mile desert route. None of the 21 teams finished. The best-performing team, from Carnegie Mellon University, traveled a little more than 7 miles. In 2005, five teams finished DARPA's 132-mile course. Last year, Google announced that about a dozen of its autonomous cars had driven more than 300,000 miles.
Suddenly, DARPA's efforts to bring driverless vehicles to the battlefield look a lot closer to reality. Many elements must come together for this to work. As Chris Urmson, engineering lead for Google's Boss, which won the 2007 DARPA Urban Challenge, autonomous vehicles combine information from many sources. Boss had 17 sensors, including nine lasers, four radar systems and a Global Positioning System device. It had a reliable software architecture broken down into a perception component, mission-planning component and behavioral executive. For autonomous cars to work well, all those elements must perform reliably.
But the mind of an autonomous car — the part that's fundamentally new, as opposed to the chassis or the engine — consists of algorithms that allow it to learn from its environment, much as speech recognizers learn to recognize words out of vibrations in the air, or facial recognizers find and match faces in a crowd. The capacity to effectively program algorithms that are capable of learning implicit rules of behavior has made it possible for autonomous cars to get so much more capable so quickly.
A 2012 report by consultants KPMG predicts that self-driving cars will be sold to the public by 2019. In the meantime, the Transportation Department's Intelligent Transportation Systems Joint Program Office is figuring out how the widespread deployment of technologies that will enable autonomy will work in the coming years. DOT's effort is focused on determining how to change roads in ways that will enable autonomous vehicles. Besides the technical challenges, it raises a sticky set of liability issues. For instance, if an autonomous car driving on a smart road crashes because of a software glitch, who will be held responsible — the car's owner, the car's passenger, the automaker or the company that wrote the software for the road?
Clearly, autonomy in automobile navigation presents a difficult set of challenges, but it might be one of the areas in which robots first see large-scale deployment. That is because although part of what needs to be done (perceiving the environment) is hard, another part (moving around in it) is relatively easy. It is far simpler to program a car to move on wheels than it is to program a machine to walk. Cars also need to process only minimal linguistic information, compared to, say, a household robot.
Groups such as Peter Stone's at the University of Texas, which won first place in the 2012 Robot Soccer World Cup, and Yiannis Aloimonos' at the University of Maryland are creating robots that can learn. Stone's winning team relied on many explicitly encoded rules. However, his group is also working on lines of research that teach robots how to walk faster using machine learning techniques. Stone's robots also use learning to figure out how to best take penalty kicks.
Aloimonos is working on an even more ambitious European Union-funded project called Poeticon++, which aims to create a robot that can not only manipulate objects such as balls but can also understand language. Much as autonomous vehicle teams have created a grammar for driving — breaking down, say, a U-turn at a busy intersection into its constituent parts — Aloimonos aims to describe a language for how people move. Having come up with a way to describe the constituent parts of motions, called kinetemes — for instance rotating a knee or shoulder joint around a given axis — robots can then learn how to compose them into actions that mimic human behavior.
This is all very ambitious, of course. But if machine learning techniques continue to improve in the next five years as much as they have in the past five, they will allow computers to become very powerful in fundamentally new ways. Autonomous cars will be just the beginning.
NEXT STORY: Agile software: Stand and deliver