Projects submitted to the Data Journalism Awards competition

Right here you will find a list of all the projects submitted to the Data Journalism Awards competition.  

A Farra dos Verbetes
Country: Brazil
Organisation: Volt Data Lab
Investigation of the year
InvestigationVerificationPublic institutionsTech
Applicant
Sergio
Spagnuolo
Team Members
Keila GuimarãesVitor Hugo Brandalise
Project Description
Volt Data Lab\'s data-driven investigation shed a light into how federal government computers in Brazil were used to polish the image of politicians and hide references to corruption on Wikipedia. This was, above all, a story on public transparency that brought attention to the government\'s lack of an ethical code for public officers when manipulating information on the public domain through their networks. By scraping, analyzing and visualizing three years of data about amendments on hundreds of thousands of Wikipedia entries through federal government networks we revealed several skewed online public relations tactics applied by politicians, displaying how they (and their PR teams) were manipulating public domain information for their benefit while hiding and erasing negative information about their biographies. This investigation has also uncovered a wide lack of accountability from public officers, who used office time and government resources to add and distort information on the Wikipedia that was not on the public interest. The story was published by the prestigious magazine Piauí, often compared to the acclaimed American publication New Yorker, for its combination of innovative storytelling and first-class journalism.
What makes this project innovative?
This was the first ever investigation in Brazil into how politicians, their PRs and public officers working in over 50 agencies of the Brazilian government were using public resources to edit entries in Wikipedia in order to embezzle the image of politicians. While standalone editions had made the headlines in the Brazilian press in the past, it was the first time quantitative data had been put into context. Although this information was in the public domain - through a Twitter bot created in mid-2014 (more on this below), there had not been any effort into putting these facts into perspective, into presenting, in a statistical way, how government networks were being used to manipulate the world\'s main online encyclopedia. By using a data journalism approach to make sense of this information, we were able extract knowledge from apparently shallow data, scraping metadata and employing several data cleaning methods and data analysis towards the raw figures. We modeled the appearance of our charts using R (ggplot2) and adapted a data visualization tool to give a final touch in our charts. This story is a also good example of how data journalism can often reveal hidden stories that are in the public interest and shed light on underreported topics. Clear and visually pleasing graphics, as those accompanying our story, are also key into helping readers quickly make sense of why such topic matter to their lives.
What was the impact of your project? How did you measure it?
The story, which secured a homepage splash at Piauí and was featured at the homepage of Brazil\'s largest newspaper Folha de Sao Paulo, was widely shared on social media, generating strong response from readers. As the first investigation of this kind ever published in the Brazilian media, the piece served to bring topics such as lack of accountability, transparency and conflicts of interest running deep in all branches of the Brazilian government. The impact of the project can be measured towards public transparency and citizen vigilance regarding manipulation of public domain information.
Source and methodology
In previous years, there had been various stories in the Brazilian press about controversial standalone editions on Wikipedia entries made from government computers. However, there had not been yet a full compilation over a long period of time of what was the nature of these editions, what kind of changes were made on the public encyclopedia, where these editions were coming from and who were they benefiting. We have been interested in producing a story that would help shed light on the scale of such activity across all sectors of the Brazilian government - including Congress, state-owned companies, Ministries, etc - and reveal if such editions were on the public interest or not. The main issue was the data acquisition - where to find the records for such activity? Our starting point was a Twitter bot that tracked all changes on the Wikipedia and issued a tweet once an edition had been made from a government IP, which had been launched a few years back by a young Brazilian programmer. However, although that was useful information, the format (a list of tweets spread over a three-year period) did not allow us to make sense of the data. In order to be able to look into these records in detail and discover meaningful insights, we used a combination of programming languages, such as R (for graphics modelling) and Python (for scraping and basic statistical insights), plus GoogleSheets for data cleaning and analysis (more on the tools is explained below). We also reached out every politician and institution mentioned in the piece to get their side of the story. Interviews with local NGOs related to government transparency and accountability helped us foster the debate on the ethical principles which should be governing public officers.
Technologies Used
The data was scraped from a Twitter bot called BRWikiedits (www.twitter.com/brwikiedits), using a python-based script. We obtained over 7,000 records -- although Twitter API only allows 3,200 tweets, there are a few workarounds to obtain everything (we adapted a tool called Tweep to our needs). To check about potential transparency conflicts regarding the manipulation of Wikipedia entries, we consulted representatives from two of the most respected NGOs in this area: Transparency International and Transparency Brasil. For cleaning and analysing the data, we developed Google Sheet functions that allowed us to disaggregate information from the tweets into a more or descriptive and informational format, applying categories and separating entries names from the body of the tweet. Through the resulting tables, we used a python-based tool called csvkit to get basic statistical insights. The visualizations were made in R, using the ggplot2 library, while Playfair was also used to enhance the clarity and appealing nature of the visuals.This was a 2-month effort to get everything ready and checked.