Google Opens Books to New Cultural Studies
Google Opens Books to New Cultural Studies | Science/AAAS
In March 2007, a young man with dark, curly hair and a Brooklyn accent knocked on the door of Peter Norvig, the head of research at Google in Mountain View, California. It was Erez Lieberman Aiden, a mathematician doing a Ph.D. in genomics at Harvard University, and he wanted some data. Specifically, Lieberman Aiden wanted access to Google Books, the company’s ambitious—and controversial—project to digitally scan every page of every book ever published.
The first explorations of the Google Books data are now on display in a study published online this week by Science (www.sciencemag.org/content/early/2010/12/15/science.1199644.abstract). The researchers have revealed 500,000 English words missed by all dictionaries, tracked the rise and fall of ideologies and famous people, and, perhaps most provocatively, identified possible cases of political suppression unknown to historians. “The ambition is enormous,” says Nicholas Dames, a literary scholar at Columbia University.
Norvig admits he had concerns about the legality of sharing the digital books, which cannot be distributed without compensating the authors. But Lieberman Aiden had an idea. By converting the text of the scanned books into a single, massive “n-gram” database—a map of the context and frequency of words across history—scholars could do quantitative research on the tomes without actually reading them. That was enough to persuade Norvig.
Analysis of the n-gram database can also reveal patterns that have escaped the attention of historians. Aviva Presser Aiden led an analysis of the names of people that appear in German books in the first half of the 20th century. (She is a medical student at Harvard and the wife of Erez Lieberman Aiden.) A large number of artists and academics of this era are known to have been censored during the Nazi period, for being either Jewish or “degenerate,” such as the painter Pablo Picasso. Indeed, the n-gram trace of their names in the German corpus plummets during that period, while it remains steady in the English corpus.
Orwant says that both the available data and analytical tools will expand: “We’re going to make this as open-source as possible.” With the study’s publication, Google is releasing the n-gram database for public use. The current version is available at www.culturomics.org.