Each story in this portfolio was driven by data that has either not been structured, existed or published previously. The work I pursue is driven by discovery and pushing the field of programmatic journalism further towards creative data collection, beyond the processed summary tables that are given to us by governments or companies.
When the #metoo movement began, for instance, I sought to find a data approach that would help contextualize the subject. While anecdotal stories initially focused on very powerful men in media and politics, I wanted to show that the issue is pervasive across industry and particularly pronounced in lower paying industries. Looking for any data on sexual harassment is already hard - the subject can escape clear and hard categorizations and definitions.
For our Oscars coverage we wanted to better understand how much more diverse speaking roles in some of Hollywood’s top films has become. We found scripts for 22 films of different years, going as far back as 1989, turned each spoken line into data associated with a character and built a database of actors classifying their ethnicity and gender to better. The results showed that the Oscars, while more diverse in their nominees were still highlighting films that weren’t all that much more diverse.
Most importantly, I’ve been keen to push further into the realm of understanding the social web to better understand and investigate both real-life practices that can be traced through the digital footprints people leave behind as well as the ways in which the social web has become its own ecosystem. My stories looked at millions of Facebook posts to measure the reach of hyperpartisan news and examined thousands of newsfeed items to facilitate a dialogue between a conservative mother and her liberal daughter about their disparate online experiences.
Recognizing that social data literacy is important, I’ve made it my personal quest to make this kind of journalism more accessible: I’ve put together reproducible scripts on Github (https://github.com/lamthuyvo/social-media-data-scripts), have taught these practices to journalists and other interested researchers in the US and in as far flung as Kenya, and I am writing a coding book (https://docs.google.com/document/d/1gXKdILpTmwzvn5w7mj7NgN55zT668xrM1wNjCYJG3Mw/edit?usp=sharing) with No Starch Press, aimed at beginners that I will publish online for free to make this work more accessible to others, especially to journalists and researchers from non-technical backgrounds.
What makes this project innovative?
To a sleuth, content and information that lives on the web and inside documents is just data waiting to be structured.
I’ve grabbed information from HTML pages, from APIs and have turned PDFs of film scripts into crawlable XML-based script writing files. I’ve also inundated government data scientists from the EEOC with requests and kindly offered to do the analysis myself if I was handed anonymized raw data instead (which they did give us).
Another way in which we’ve tried to be more creative with our data practices is to use data collected for commercial purposes in new ways. Whether it’s using a third party data gatherer like Buzzsumo, an aggregator that tracks the spread of URLs for marketing purposes, or using a database discovered by my colleague Scott Pham that was created for Hollywood agents and studios to find ethnicity data on actors —throughout my work I try to look at data from all kinds of places and servers, not just those owned by the government.
What was the impact of your project? How did you measure it?
After reading our story based on the sexual harassment data from the EEOC, senator Kirsten Gillibrand’s office told us she wrote a letter addressed to the Bureau of Labor statistics requesting data related to sexual harassment to be collected. Academics and other reporters have also used the data for either their own reporting or academic studies after we published it in its raw form on github.
Similarly, a story about the partisan news environment Facebook, for which my collaborators and I collected a list of news organizations’ data, has been used by various academics for their research about misinformation, too.
Last but not least, a lot of the stories I’ve produced yielded GitHub repo being used around the world. I strongly believe that the field of social media data analysis needs more diverse stakeholders — what happens on the web is no longer something that affects a select few ‘super users’ of online media, it’s moved deeply into the ways in which our information flows. I’m writing a Python book on the subject that I’ve been giving out in stages (https://docs.google.com/document/d/1gXKdILpTmwzvn5w7mj7NgN55zT668xrM1wNjCYJG3Mw/edit?usp=sharing) and that will be freely available online, in hopes it reaches non-coders like the Kenyan digital journalists and the California human rights law students I’ve trained (the book is aimed at beginners). I’ve also contributed a chapter on social media data to the Data Journalism Handbook hoping to raise its profile.
Source and methodology
Please see more detailed descriptions under what makes this project innovative
Python (mainly pandas, beautifulsoup)
Adobe suite (Premiere, After Effects, Photoshop, Illustrator).