This is my belated response to the culturomics postings run by the NYTimes last month. The bottom line is that I wasn’t that impressed by the projects discussed there, but I do feel that projects like these have plenty of implications for the literary studies we might want to pursue over the next decade or two.
From my own perspective, the biggest issue with both the N-Gram and the “culturomics” derived from it is that they seem to be research tools in search of an appropriate research problem, an impression that was reinforced by the sample problems discussed in the NYT piece. Admittedly, the N-Gram does produce very suggestive visualizations of the frequency of selected key terms over time. Yet Googlebook’s notorious metadata and OCR problems render any spike or dip in the graphs suspect, and useless as evidence without further investigation. This means that the most interesting portions of the graph, the visible “changes,” are essentially off-limits to public discussion until problems like misdatings are cleared away.
The larger issue, though, as Geoffrey Nunberg has pointed out, is what exactly we think we are learning when we track the frequencies of particular words. In one report, for example, counting the number of times the term “God” appears in Victorian writings over time is supposed to tell us something about the long-term, large-scale process of secularization in 19th century British culture. But even if we refuse to read or interpret the hundreds of novels contained in that database (as prescribed by Moretti’s now familiar notion of “distant reading”), we still need to read the results closely enough to produce a plausible interpretation of what they mean.
For example, two digital scholars are convinced that the relative frequency of terms like “hope” and “happiness” between the beginning and end of the 19th century can tell us something interesting about the Victorian novel. I am perfectly happy to entertain this idea. Yet how can this claim be tested except by reading and arguing in a very concrete way about some portion of the novels contained in that database? In this respect, I think the veneer of positivism attached to this kind of project comes off pretty quickly, like a bad paint job, the moment we talk about the validation of such claims. Because competing interpretations of the results would not be settled with “better” or more data, but by competing explanations with their respective warrants, evidence, and argumentative self-consistency.
In our own exchanges on this project, Ben Pauley pointed out me to this useful comparison between Mark Davies’ COHA project and culturomics, and I think Davies raises the key issues that should complicate any discussion of word frequencies and their significance for interpreting their shifts as evidence for cultural change: the first issue, if I understand it properly, resides with the “collocates,” or nearby words, that indicate the conceptual clusters (and contextual frameworks) that particular words are embedded within (e.g., gay New York vs. gay Paris); the second, related to the first, is about synonymy, which again suggests the need to relate words to the specific groups of synonyms attached to a particular use (e.g., gay=brilliant, jolly, joking); the last is about genre, which remains an indispensable context for understanding the tacit and social dimensions of the word and its circulation.
It seems to me that any counting of word frequencies, in the absence of this kind of information (e.g., in what contexts, in what surroundings, using which synonyms, with what kinds of other terms, do Victorian novelists mention God?) makes this sort of analysis unpersuasive. And I do wish that the scholars pursuing this kind of analysis would familiarize themselve with the practices of conceptual history. In my view, Koselleck’s pioneering work in conceptual history seems closely related to the culturomics style of statistical analyses of culture, though with a vastly enlarged set of corpora to search through. But perhaps the main value of such statistical research is to perform a kind of defamiliarization exercise on our historical understandings of a period, so that we can look beyond existing histories to construct our own?
Having said all this, I do wish that there were ways to attach the power of the distant reading paradigm to current practices in literary and cultural history. Thoughts, anyone?