A secret spy plane operated by the US Marshals hunted drug cartel kingpins in Mexico. A military contractor that tracks terrorists in Africa is also flying surveillance aircraft over US cities. In these and other cases, we revealed the activities of aircraft that their operators didn’t want to discuss, opening the lid on a black box of covert aerial surveillance by agencies of the US government, the military and its contractors, and local law enforcement agencies. And we packaged these stories using striking maps, to reach BuzzFeed News’s broad audience.
Some of these spy planes employed sophisticated surveillance technologies including devices to locate and track cell phones and satellite phones, or survey Wi-Fi networks. Others have carried persistent surveillance cameras that can monitor large parts of a city continuously for hours at a time. One military contractor, Acorn Growth Companies, was interested in offering this technology to commercial clients, describing it as a promotional flyer as “the absolute best solution for your security or surveillance needs.” It noted that the system can switch to a conventional surveillance camera for a zoomed-in view to “ensure that no activity is missed or valuable intelligence overlooked.”
Before our stories, most Americans would have been unaware of the extent and sophistication of these operations. Without employing machine learning to identify aircraft engaged in aerial surveillance, the activities of many of aircraft deploying these devices would have remained hidden in plain sight, in vast volumes of flight tracking data.
What makes this project innovative?
In recent years, there has been much discussion about the potential of machine learning and artificial intelligence in journalism, largely centered on classifying and organizing content with a CMS, on fact-checking, or on tailoring content to individual consumers of news (see, for example: http://www.storybench.org/how-machine-learning-could-change-journalism/ and https://www.techemergence.com/automated-journalism-applications/).
There have been relatively few stories that have used machine learning as a core tool for reporting, which is why this project is an important landmark. Of course, extensive further reporting, including public records work, was needed to turn the insights gained from our machine learning into the three stories we published in 2017. But the initial identification of previously hidden spy planes from vast quantities of flight tracking data on thousands of aircraft was only made possible by the application of the random forest algorithm.
What was the impact of your project? How did you measure it?
This project has been recognized for drawing back the veil on a poorly understood aspect of government surveillance, and as an indicator of the potential of machine learning to find and break important stories.
Our story describing the methodology and the diversity of our findings was co-published with the Columbia Journalism Review (https://www.cjr.org/watchdog/how-buzzfeed-news-revealed-hidden-spy-planes-in-us-airspace.php). It was described was “incredible” (https://twitter.com/Snowden/status/894978967769419776) and “deviously clever” (https://twitter.com/Snowden/status/894986251924885506) by Edward Snowden. And it was highlighted in Nieman Lab’s “Predictions for Journalism 2018.” Under the heading “Scooped by AI,” John O’Keefe, a developer in the Quartz Bot Studio, wrote: “These will be stories on your beat, written by humans who understand how to use machine learning to aid their reporting.”
One of us (Peter Aldhous) has spoken about the work in March 2018 at the annual NICAR data journalism meeting organized by Investigative Reporters and Editors (https://paldhous.github.io/NICAR/2018/machine-learning.html and https://www.ire.org/events-and-training/event/3189/3551/), and has been invited to speak about it in June 2018 at the annual meeting of Netzwerk Recherche, a German journalism organization.
The stories were widely read and shared, accumulating more than 400,000 views across multiple online platforms.
Source and methodology
Our identification of previously hidden spy planes came by training a computer to recognize known surveillance aircraft, then setting it loose on large quantities of flight-tracking data compiled by the website Flightradar24 (https://www.flightradar24.com/).
Surveillance aircraft often keep a low profile: The FBI, for example, registers its planes to fictitious companies to mask their true identity. So BuzzFeed News trained a computer to find spy planes by letting a machine-learning algorithm sift for planes with flight patterns that resembled those operated by the FBI and the Department of Homeland Security (DHS). In 2016, we reported on aerial surveillance by these planes (https://www.buzzfeed.com/peteraldhous/spies-in-the-skies), mapping thousands of flights over more than four months from mid-August to the end of December 2015.
First we made a series of calculations to describe the flight characteristics of almost 20,000 planes in the same four months of Flightradar24 data: their turning rates, speeds and altitudes flown, the areas of rectangles drawn around each flight path, and the flights’ durations. We also included information on the manufacturer and model of each aircraft, and the four-digit squawk codes emitted by the planes’ transponders.
Then we turned to an algorithm called the “random forest,” training it to distinguish between the characteristics of two groups of planes: almost 100 previously identified FBI and DHS planes, and 500 randomly selected aircraft.
The random forest algorithm makes its own decisions about which aspects of the data are most important. But not surprisingly, given that spy planes tend to fly in tight circles, it put most weight on the planes’ turning rates. We then used its model to assess all of the planes, calculating a probability that each aircraft was a match for those flown by the FBI and DHS. This analysis is described in detail here, with the underlying computer code: https://buzzfeednews.github.io/2017-08-spy-plane-finder/
Having identified a series of interesting planes, and researching their ownership and the equipment fitted on them through registration and airworthiness documentation submitted to the Federal Aviation Administration, we continued to observe their activities through to in flight tracking data provided by Flightradar24 through to July 2017.
One of us (Peter Aldhous) performed all of the data analysis and mapping for the stories.
Analysis, including the feature engineering of the variables used, and the random forest training and classification, was performed in R (https://www.r-project.org/) and RStudio (https://www.rstudio.com/).
Full flight tracking data for the aircraft of interest was maintained in a PostgreSQL (https://www.postgresql.org/) database with PostGIS (https://postgis.net/) enabled, using the Postgres app (https://postgresapp.com/) for MacOS and the Postico (https://eggerapps.at/postico/) PostgreSQL client. Tracks were created from point data using PostGIS queries.
The maps displayed in the story were made using QGIS (http://qgis.com/) and PostGIS, with OpenStreetMap (https://www.openstreetmap.org/) basemaps.