As the artificial intelligence reporter at MIT Technology Review, I am constantly seeking new ways to demistify and make accessible a difficult technical subject for a non-technical audience. AI is particularly confusing because the term\’s definition has evolved quickly and dramatically since the inception of the field, so I wanted a simple and visual way of showing readers its brief yet tumultuous history. Fortunately, AI research is built on a culture of openness, and the community maintains one of the largest open databases of scientific research papers called the arXiv (pronounced \”archive\”). I saw it as a perfect data source for capturing the field\’s evolution over time. The resulting story, \”We analyzed 16,625 papers to figure out where AI is headed next,\” provides a sweeping bird\’s eye view of the three biggest trends that have occured in AI research over the last 25 years. The data visualization is paired with reporting to explain how and why these trends came about and the significance they hold for the technology\’s future.
What makes this project innovative?
Artificial intelligence is a heavily reported on field, and much ink has been spilled on its history. But by combining data scraping, natural language processing, visualization, and reporting, this story finally brought that history to life. It allows readers to immediately see the battle of ideas that have deeply informed the technology and how far it still has to go to reach full maturity. In this way, it makes AI feel more tangible, more concrete, a product of rich debate within a community rather than the brainchild of a handful of geniuses. It also swiftly debunks a common misconception that AI is somehow perfect by unveiling the humanity behind its creation.
What was the impact of your project? How did you measure it?
The story clearly resonated with a broad audience. It received 162,000 page views, with an average read time of 2 minutes. It was also shared widely on social media, with over 1,250 unique tweets, from both by AI experts (https://twitter.com/zittrain/statuses/1088806727435329538) and lay readers (https://twitter.com/i/web/status/1092063294297382912). My post on my profile also received over 1,000 likes and 500 retweets (https://twitter.com/_KarenHao/status/1088801864227921920). I also published the majority of the code on github (https://github.com/karenhao/techreview_arxiv_scrape) and the repository was starred 35 times and forked 3 times.
Source and methodology
I scraped the abstracts of all of the papers ever published on the artificial intelligence section of the arXiv and performed basic natural language processing to find the key terms used in the field over the years. My methodology is given in more detail at my Github: https://github.com/karenhao/techreview_arxiv_scrape.
I used Python for all scraping and data analysis, and Jupyter Notebook for my interactive development environment. Specifically I used the following packages: pandas and numpy for basic data cleaning and analysis; matplotlib and seaborn for visualization; and nltk, string, and collections for natural language processing. I also used a mix of Python and Jupyter notebook, Adobe Illustrator, and Datawrapper to make my data visualizations.
Daniel Zender created the header illustration.