Andrei Scheinkman (Director of Data and Technology) - firstname.lastname@example.org
FiveThirtyEight has been publishing the data and code behind some of our journalism on GitHub since our launch in 2014. Each dataset comes with a short explanation of where the data came from, what it means, and how we used it. These datasets have allowed our readers to better understand how we do our work, test the robustness of our conclusions themselves, and use our data as the basis for new original analyses or data visualizations. The datasets are used by our readers, civic technologists, academics, activists, journalists, students and educators. They have been used in classrooms and data-science competitions and have been built into software packages. They have been cited in academic papers, research reports, and other news articles.
Writers at FiveThirtyEight file datasets along with their stories, and our editors work with them to ensure that the data is clean, accurate, readable, and well-documented. Each dataset is accompanied with a README that contains an explanation of the data and its source.
Making data open, however, means more than just posting it online. We post the datasets as much as possible in simple, non-proprietary, formats like CSV files that are both human and machine readable. This year, we have made special efforts to make the datasets more accessible to an even wider audience and reduce the technical barriers to using them. We launched data.fivethirtyeight.com a page where readers can scroll through all of our datasets and download any one of them with a single click. Many of our articles now also come with a button directly under the author’s byline that encourages readers to check out the data and analysis that underlies our journalism.
We hope that our work can set a standard for transparency and openness that we can come to expect of news, government, academia, and other institutions that host data that could be used in the interest of serving the public.
What makes this project innovative?
As journalists, we are acutely aware of barriers to acquiring and using datasets, both from the government and other public entities. Often data that purports to be open is hidden, difficult to access, difficult to understand, or easily available to download but stored in a digital format that is not amenable to analysis (like pdf files). For data to be truly open it must be both available and easily accessible as well as human and machine readable.
Our project is innovative because it sets a new standard for open-data that we hope we can come to expect of government, news, academia, and other institutions. We post data as much as possible in open text formats like CSV files and avoid formats that require proprietary software (like Excel files) or special skills (like JSON files) to parse. Since much of the data is hosted on GitHub, any changes we might make the data along with who made the change and when is publicly accessible.
Making the datasets human-readable and creating a index where readers can download the data with a single click has made the data accessible to non-technologists. Our analytics as well as reader feedback tell us that it is being used in classrooms and by people who don’t have specialized technical skills.
Using simple formats, sticking to a consistent structure, and including metadata has allowed the open data to be machine-readable as well. Users have created R-packages and Python libraries that allow programmers to quickly load our data into code. It has been used as an example for new software packages like Datasette, which has loaded all of our data into one large SQL database and used it to demonstrate their software which provides the data as an API. They have been integrated into Docker images set up for data-science competitions. Because our data is well formatted and machine readable, third parties have been able to make it accessible in the technology that works best for each person on their native platform.
What was the impact of your project? How did you measure it?
Our datasets have been cited in academic publications, research reports, and news articles. They have garnered praise from readers, academics, educators, and media critics on twitter for being transparent, straightforward, and easy to use (https://twitter.com/search?q=data.fivethirtyeight.com). Our analytics show over a thousand daily visitors to our new data index, and it is likely that many more access the data directly through GitHub or various third party software (like R and Python packages that have been built on top of it). We have received feedback from other journalists that they have used FiveThirtyEight’s open data as a template for their own repositories and to persuade editors in other newsrooms to open-source data and code (https://twitter.com/dhmontgomery/status/962046129926909953). This is the most satisfying evidence that our project may be leading by example and helping to set a new standard for what people can expect of open data efforts.
Source and methodology
We try to add data and code to our open data repository whenever we have conducted original analysis, scraped data, made it more easily accessible or simplified it in a useful way, acquired it from hard to access government sources or via a freedom of information request, or believe that it can provide value to our readers in some other way. The data is either automatically updated from our internal systems or is filed by writers along with each story and reviewed by editors prior to publication.
In addition to simple static datasets that can be a single file or a collection of files, some of our datasets that power our interactive visualizations are “live-updating”. These datasets update after particular events in real time. For example, many of our politics datasets update automatically when new polls are added to our database and many of our sports datasets update automatically at the conclusion of a sports game (and in-fact some update live during the game after each score). These live-updating datasets are handled differently on the back-end, but are served to the user in the same simple interface as the static datasets, only marked with a red dot to indicate that they are updating in real time. For the user, regardless of the complexity of the underlying data pipeline, all of our datasets are available this simple one-click download button that gives them the most up-to-date version of the dataset.
We have also built in a smooth editorial workflow internally to encourage writers file data more frequently. They simply file the data and an explanation of it in an email along with their story, and a data editor will edit the dataset for clarity and accuracy and post it to an internal private fork of our public data GitHub repo. The writer, story editor, copy editor and quantitative editor look over the dataset and when it is ready to publish, it is pushed to the public repository which automatically kicks off an update of our index page (https://data.fivethirtyeight.com/).