Boston probe's big data use hints at the future

A day after the explosions at the Boston Marathon, authorities had amassed 10 terabytes of data in their search for clues.

bombing crime scene

The FBI turns the Boston Marathon bombing crime scene back over to the city in an informal ceremony held April 22. (FBI photo)

Less than 24 hours after two explosions killed three people and injured dozens more at the April 15 Boston Marathon, the Federal Bureau of Investigation had compiled 10 terabytes of data in hopes of finding needles in haystacks of information that might lead to the suspects.

The tensest part of the ongoing investigation – the death of one suspect and the capture of the second – concluded four days later in part because the FBI-led investigation analyzed mountains of cell phone tower call logs, text messages, social media data, photographs and video surveillance footage to quickly pinpoint the suspects.

A big assist in this investigation goes the public, which presented perhaps the best illustration of a crowd-sourced investigation in recent memory. Not only did the public respond to the FBI's request for information – the agency ultimately received several thousand tips and loads of additional photographs and video footage – but a citizen's tip ultimately led to the capture of the surviving suspect.

Still, the investigation showed a glimpse of what big data and data analytics can do -- and highlighted how far we yet have to go.

Knowledge is power

Big data is a relatively new term in technology and its definition varies amongst early practitioners, but the main goal of any big data project is to pull insights from large amounts of data.

Prominent statistician Nate Silver describes it as "pulling signal from the noise" – noise that can be a veritable smorgasbord of different kinds of information. The noise can be big, too – some datasets within the federal government are measured in petabytes, each of which is one million gigabytes or 1,000 terabytes.

So the 10 terabytes gathered by investigators is not a large data collection even in today's relatively early stages of big data technology. But the investigation's processes still presented officials with a data crunch due to the volume, variety and complexity, according to Bradley Schreiber, vice president of Washington operations for the Applied Science Foundation for Homeland Security.

To get a sense for the initial complexities of combining various data sets in the early moments of the investigation, consider this: In the aftermath of the bombing, cellular networks in the area were taxed beyond their capacity. AT&T put out a tweet urging those in the area to "please use text & we ask that you keep non-emergency calls to a minimum."

There was speculation that the bombs could have been triggered remotely by mobile phones, prompting interest in traffic logs from area cell towers to try to get a fix on the culprits. That geo-location information could then be cross-checked against surveillance video and eyewitness photography – just another layer of data available to law enforcement when trying to stitch together a detailed and textured version of events.

To handle all this data at once, the FBI had the services of local law enforcement, as well as manpower and technology from a collection of federal agencies -- including the Department of Homeland Security and the 16 other agencies that make up the intelligence community (IC).

Because of the secrecy involved in the IC, many of its most innovative technologies are not well-understood by the public, and often their uses are not confirmed via official sources.

Yet some are known anyway. Counterterrorism sources told several publications that facial recognition software was being used to compare faces in photographs and video against visa, passport, driver's license and other databases.

An individual not authorized to speak about the investigation told FCW that DHS used situational awareness tools that allow its personnel to act as field sensors through their mobile devices, interacting in real-time with a central network. Through a smart phone, a user's position can be triangulated by the central network, and the user can push alerts to the network via text, image or video, for instance, or view where other mobile device users with the application are located via mapping technology. The central hub then stands up a virtualized map of all its field users and logs what data they're sending back and forth.

New challenges with social media

Terabytes or more of video, images, text messages and cell phone records are complex to compare in their own right, but social media data adds a new wrinkle for investigators.

A tool from Topsy Labs, a company that bills itself as having the only full-scale index of the public social web, was used by local and federal officials during the Boston bombing investigation to sift through torrents of tweets.

Topsy has stored every tweet since July 2010, and in the Boston investigation, its tool allowed investigators to run big-data analytics of Boston-related tweets against hundreds of billions of past and present messages.

Using Topsy, investigators could search every reference ever made on Twitter of the word "bomb," in a specific region – like the city of Boston and its suburbs.

Such a search would have turned up since-deleted bomb references from both the suspects' Twitter accounts, said Rishab Ghoshi, Topsy's chief scientist and co-founder.

This kind of search through public information may have also revealed other important clues for investigators, like which users re-tweeted the bomb mentions or engaged in dialog with the suspects.

Furthermore, Topsy's "geo-inferencing" capabilities allow users to accurately map where specified tweets are originating, despite the fact that only about one percent of Twitter users geo-tag their tweets. Those capabilities make it "20 times more accurate" than standard Twitter location data, Ghosh said.

"Technology plays the role of identifying signal and extracting it from this noise, and allows you (as a user) to access information without someone, like a journalist or official, editing it, yet you're still hearing relevant voices," Ghosh said. "What has happened now is the way people communicate through public conversation, until the past few years, that has been inaccessible. The Internet has changed that, and now public conversations are publicly accessible."

DHS has certainly taken advantage of this fact, and was almost certainly keeping tabs on the investigation through social media, too.

Since 2011, the agency has monitored public-facing social media networks, blogs and content aggregators. The monitoring has stirred up controversy among privacy advocates because of worries that DHS would be collecting personally identifiable information (PII) on social media users. The agency has stated that they're not doing data mining for PII, but that they would use such information in exigent circumstances to rescue, say, an earthquake victims tweeting from under a pile of rubble or the victim of a terrorist attack trapped in a hotel. The National Operations Center at DHS "identifies and monitors only information needed to provide situational awareness and establish a common operating picture," according to an April privacy impact assessment from DHS about the program.

These monitoring efforts require DHS to establish accounts with usernames and passwords on public social media sites like Twitter, Facebook, YouTube, Flickr, as well as a host of Twitter search and trend sites, but these DHS accounts are not supposed to interact with other uses, make friends, or share content across networks.

They are designed to lurk and watch for the appearance of terms that indicate a social media post or news item is about a terrorist attack, cyber-security breach, natural disaster, public health emergency, or other threatening situation is in progress.

In its privacy memo, DHS spells out the terms on its radar screen. Some that would have come up on social media in the aftermath of the Boston Marathon bombing include "explosion," "bomb," "shelter-in-place," and "lockdown."

What's to come

John Crupi, Chief Technology Officer of Washington, D.C.-based JackBe, which designs real-time operational intelligence software and has contracts within the IC, said the Boston bombing investigation highlights where technology is and where it might soon go.

While the suspects were pinpointed, Crupi said the fact that investigators were asking people to send in digital photos during the early hours of the investigation and engaging in lengthy exchanges with tipsters via the phone and e-mail may show too much reliance on the public.

If the right people did not come forward, the investigation might have stalled. However, improving technology is likely going to change that, Crupi said. One day, predictive analytics might actually allow savvy investigators to prevent crimes and tragedies like the Boston bombings before they happen.

Think Minority Report, but replace the psychics with a blazing-fast, massive and scalable cloud-based infrastructure that seamlessly wields disparate, complex data sets produced by people, drones, satellites and other smart machines. In fact, the IC is working on installing this kind of technology right now, so such powerful predictive analytics may not be far off.

"The goal is going to be to take all this real-time data and ultimately connect all the dots to authoritative systems," Crupi said. "If you can connect them, then you can make a more probabilistic decision on how real or possible something is."