This is my belated response to the culturomics postings run by the NYTimes last month. The bottom line is that I wasn’t that impressed by the projects discussed there, but I do feel that projects like these have plenty of implications for the literary studies we might want to pursue over the next decade or two.  

From my own perspective, the biggest issue with both the N-Gram and the “culturomics” derived from it is that they seem to be research tools in search of an appropriate research problem, an impression that was reinforced by the sample problems discussed in the NYT piece.  Admittedly, the N-Gram does produce very suggestive visualizations of the frequency of selected key terms over time. Yet Googlebook’s notorious metadata and OCR problems render any spike or dip in the graphs suspect, and useless as evidence without further investigation. This means that the most interesting portions of the graph, the visible “changes,” are essentially off-limits to public discussion until problems like misdatings are cleared away.

The larger issue, though, as Geoffrey Nunberg has pointed out, is what exactly we think we are learning when we track the frequencies of particular words.  In one report, for example, counting the number of times the term “God” appears in Victorian writings  over time is supposed to tell us something about the long-term, large-scale process of secularization in 19th century British culture.  But even if we refuse to read or interpret the hundreds of novels contained in that database (as prescribed by Moretti’s now familiar notion of “distant reading”), we still need to read the results closely enough to produce a plausible interpretation of what they mean.   

For example, two digital scholars are convinced that the relative frequency of terms like “hope” and “happiness” between the beginning and end of the 19th century can tell us something interesting about the Victorian novel.  I am perfectly happy to entertain this idea. Yet how can this claim be tested except by reading and arguing in a very concrete way about some portion of the novels contained in that database?    In this respect, I think the veneer of positivism attached to this kind of project comes off pretty quickly, like a bad paint job, the moment we talk about the validation of such claims.  Because competing interpretations of the results would not be settled with “better” or more data, but by competing explanations with their respective warrants, evidence, and argumentative self-consistency.

In our own exchanges on this project, Ben Pauley pointed out me to this useful comparison between Mark Davies’ COHA project and culturomics, and I think Davies raises the key issues that should complicate any discussion of word frequencies and their significance for interpreting their shifts as evidence for cultural change:  the first issue, if I understand it properly, resides with the “collocates,” or nearby words, that indicate the conceptual clusters (and contextual frameworks) that particular words are embedded within (e.g., gay New York vs. gay Paris); the second, related to the first, is about synonymy, which again suggests the need to relate words to the specific groups of synonyms attached to a particular use (e.g., gay=brilliant, jolly, joking); the last is about genre, which remains an indispensable context for understanding the tacit and social dimensions of the word and its circulation. 

It seems to me that any counting of word frequencies, in the absence of this kind of information (e.g., in what contexts, in what surroundings, using which synonyms, with what kinds of other terms, do Victorian novelists mention God?) makes this sort of analysis unpersuasive.  And I do wish that the scholars pursuing this kind of analysis would familiarize themselve with the practices of conceptual history. In my view, Koselleck’s pioneering work in conceptual history seems closely related to the culturomics style of statistical analyses of culture, though with a vastly enlarged set of corpora to search through.  But perhaps the main value of such statistical research is to perform a kind of defamiliarization exercise on our historical understandings of a period, so that we can look beyond existing histories to construct our own?

Having said all this, I do wish that there were ways to attach the power of the distant reading paradigm to current practices in literary and cultural history.  Thoughts, anyone?



  1. Laura Rosenthal

    One tangential thought: I think we have to be careful about whatever conclusions we draw given that the databases aren’t really comprehensive. (I and many people I know have had the experience of finding things that are not availble on ECCO, for example.) The analogy might be when you see students make claims about the significance of the very first time a word was used in a particular way because it is the first time it appears with this meaning in the OED. Nevertheless, in both cases there will be some insight. On the other side of things, I think the databases force us to be more precise about making other kinds of claims. Can you really make claims anymore about ‘the rise of the novel’ or whatever based on a handful of major authors? This was of course already unravelling but the databases still make a difference. It’s not yet clear to me how they will shape arguments of the future

  2. Hey Laura, I think the databases underscore our sense of the arbitrariness of certain judgments about “representative instances.” We may feel more tied than ever to certain canonical figures, however, as the only hope of organizing a much larger and more intimidating mass of material. We certainly feel more confident generalizing on the basis of that information, with specific caveats and provisos we might not have developed before. One thing that I have noticed is that the natural unit of analysis for the database tends to be a concept or keyword, and this tends to atomize–the way old-style philology did–the literary text beyond recognition. But I’ve been surprised at the number of people who end up thinking about concepts as a result of doing this work, and even when proceeding from an analysis of an individual work.