Google's Ngram Database


There was a piece in The Times this weekend about Google's Ngram Database, which tabulates the frequency of words used in over five million books. The author of the op-ed, Marc Egnal, raises a lot of interesting points, and suggests that word frequency can help us to understand what common themes run through literary works. The major upside to accessing big data here is that we can cover a huge amount of literature; it would never be possible for one or even a group of scholars to read all of these books. And, as Egnal suggests, it might help scholars overcome their own biases when analyzing text. That being said, I'm left with a few questions.

How is it possible to determine the context in which a word is being used? How do you account for the words that surround the key word? Egnal uses "optimism" as one of his examples when examining novels from the 1930's, a time that is generally considered to be pessimistic. Turns out, use of the word optimistic actually rose during this time. But the word itself could be used in a negative way. e.g. "Optimism is for suckers." or "The Great Depression is sure killing my optimism." Really curious to know if the database can determine how the word is being used, not just that it is being used.

Interestingly, the words are tagged according to their parts of speech, so a word like "phone" can be reviewed as both a noun and a verb. The amount of nerdly grammar patterns that can be generated from the Ngram Database is astounding (and, as I am finding out this afternoon, a major time suck) . And, when considering simple nouns (which can't really be interpreted in multiple ways) such as zombie, vampire, and ghost, it is truly fascinating to see the usage over time. It also seems like this tool would be extremely useful for writers (screenwriters, particularly) to confirm the vernacular of a certain time period.

Though I can see the many uses for this data, all I can think is: the data can't analyze itself. Though there are certain biases that will arise, (like the fact that historically the majority of literary scholars were white men, reading white men's books) someone needs to look at and interpret the numbers. In some ways, I believe literary analysis must be subjective. Counting the number of specific words used over time in a number of books can give us some valuable insight, and certainly might provide a counter to something we thought was intuitive, but at the end of the day, a well-read human being who has an understanding not only of literature, but also culture, history, sociology, etc, will be able to make hypotheses about what something means. Data surrounding art is extremely ambiguous. This is precisely what makes art so difficult to understand. We must think about it as it relates to other pieces of art, and also the rest of the word. It doesn't exist in a vacuum. How can a database make sense of irony, metaphor, simile, hyperbole? If a word is used in jest, this changes the meaning of the word. What about made-up words? There are so many subtle techniques used in literature that require the human mind to make sense of, I wonder what this data can really tell us. Nevertheless, I'm pretty thrilled to see a data visualization that showcases the massive uptick in vampire literature in the mid-90s.