topic modeling and a theory of language

The much-anticipated special issue of Poetics devoted to topic modeling in cultural sociology is now available, and it’s a beaut! Props to John Mohr and Petko Bogdanov for editing the special issue, and to all the authors for an exciting group of articles.

There is, quite appropriately, a lot of buzz about the potential of “big data” and quantitative analysis of text, in particular for cultural analysis since so much of culture seems to make its way into text in one form or another.  The articles in the special issue combine into a grand showcase of the possibilities of quantitative analysis of text.  I’ll comment on most of them below. But I think most of them–like much quantitative analysis of text in general–suffer from some theoretical shortcomings. Specifically:

  • with the partial exception of the Mohr, Wagner-Pacifici, Breiger, and Bogdanov article, the studies lack a well-conceptualized theory of language, which leads to some conceptual slippage.
  • there is little attention to the conditions of production of text: whose words, and which words, are written down, archived, and digitized.

The buzz about topic modeling (and quantitative analysis of text more generally) is driven by the decline in computing costs, the associated increase in technical availability of text analysis tools, and what McFarland et al.  (in this issue) note about the sheer availability of text to analyze:

we have witnessed an explosion of freely available digitized material, a significant amount of which is text. In many instances, we can see the written communication of an entire community over time. In others instances, we have large-scale corpora that are repositories of a population’s knowledge and communication (e.g., ISI Web of Knowledge)

Mohr and Bogdanov provide a helpful introduction to topic models, “an automated procedure for coding the content of a corpus of texts (including very large corpora) into a set of substantively meaningful coding categories called ‘topics.'” A topic, they say, is “the constellation of words that tend to come up in a discussion (and, thus, to co-occur more frequently then they otherwise would) whenever that (unobserved and latent) topic is being discussed.”

Throughout the article, Mohr and Bogdanov refer back to Lasswell’s work on content analysis, particularly The Comparative Study of Symbols (1951). The idea is that Lasswell’s ambition is topic modeling’s realization: the ability to “count… and then… begin to interpret” makes the promise of content analysis a reality. As an aside, though, Lasswell was faulted in his day for an overemphasis on countability vis-a-vis interpretation in that very work:

One might have wished that the authors had devoted more thought and space to possible explanations of their findings… than to the constant defense of their methods…. The above brief summary indicates some of the rich findings which fail to be exploited by the authors in terms of any theoretical or interpretive analysis. (107)

diMaggio, Nag, and Blei continue on this course, examining substantive affinities between topic modeling and cultural theory. Using a corpus of newspaper articles about government funding of the arts, they uncover theoretically important, substantively interesting patterns of word combinations.

They provide a very useful set of conditions that a method ought to fulfill in order to be useful for cultural analysis. I want to focus on one of these conditions:

it must be inductive to permit researchers to discover the structure of the corpus before imposing their priors on the analysis

This is important to the practice of topic modeling. It is also important to thinking through the theory of language implicit in the method. Effectively, they are saying that corpora have structures that are discoverable separately from the linguistic structures that enable them (e.g., grammar, syntax, discourse). This theoretical move is necessary in order to license the bag-of-words approach that topic modeling uses, and it turns out to be a very productive move in terms of discerning patterns of word usage across texts. But it’s also probably wrong in terms of an actual theory of language, not just because of word order (a shortcoming several of the articles acknowledge) but because utterances are constrained and enabled by syntactic, grammatical, and discursive structures. Parole is not Langue.

Latent Dirichlet Allocation (LDA, the form of topic modeling they use), they write, is “a statistical model of language” that “takes a relational approach to meaning, in the sense that co-occurrences are important in the assignment of words to topics.” These are, indeed, major strengths of the approach, and the proof of the pudding is in the eating, as we learn substantively interpretable things about discussions over arts funding by applying this model. But it’s important to recognize that, at best, LDA is a statistical model of speech acts (parole), not of language (langue). Like any good model, it is wrong but useful, and it discards some information in order better to interpret other information. The form of relationality it implements is useful but partial; some relations are privileged over others, and this privileging has theoretical implications, not just pragmatic ones.

The next couple of articles–by Miller and by Bonilla and Grimmer–apply topic modeling to interesting empirical cases, demonstrating latent patterns of word usage in Qing Dynasty records (Miller) and media reporting over terrorism (Bonilla and Grimmer). Two articles more in the digital humanities field (Tangherlini and Leonard, Jockers and Mimmo) are in a different class because their interest is in text per se, not in culture as represented through, or created by, text. So the theoretical leaps are not as substantial in these cases.

Marshall’s fascinating “Defining Population Problems” identifies systematic differences in the discourse surrounding fertility in French and British demography, 1946-2005.

Topic modeling analyses revealed that French demographic research is generally more closely tied to domestic policy than is British demographic research: for instance, analyses identified topics in the French journal related to family policy and retirement policy, but did not identify any similarly policy-centered British topics. The extent of the influence of demographic research on policy and popular understandings, however, is left unresolved. It is not clear whether British distance from mass-media topics represents autonomy or irrelevance (positions that are not mutually exclusive). Nor is it clear whether French policy relevance is produced by demographic research that follows policy debates or leads them.

Again, an important analysis with interesting and relevant findings. I wonder, though, about the conditions of production of these texts. Are the journals in the two societies really so institutionally similar that differences in word patterns allow for inference about an academic field (demography)? If the cultural or institutional idea of what a scientific journal is differs between the two societies, we’d need a different interpretive approach to understanding these patterned differences.

Let me wrap up the review of articles with Mohr, Wagner-Pacifici, and Bogdanov’s analysis of national security strategies. At least in the circles I move in, this collaboration has been the most anxiously awaited, and in fact highlights the performative fecundity of the culture section. This article takes a much more sophisticated theory of language, drawing on Kenneth Burke’s dramaturgical theory and focusing, in particular, on his “distinction between semantic meanings and poetic meanings,” a distinction that gets right at the heart of the promises and limitations of computational text analysis.

“The way states talk matters,” the article asserts.

…we ask how the state talks when it speaks of strategies for national security. And how the state imagines the world of entities and actions involved in national security. We expect that these types of speech activities both reflect and also enable the exercise of power that the United States continuously exerts when it acts upon the world stage. We also anticipate, but cannot yet demonstrate, that the texts conjure precise images of that world stage as populated by various agents (friends, enemies, partners, neighbors, competitors) and their networked relations to each other.

The dramatistic pentad the article draws from Burke — “Act, Scene, Agent, Agency, Purpose” — necessitates an approach to text analysis that takes the structure of sentences and paragraphs seriously, because it has to ascertain what a word is doing within its context in order to evaluate its role in the grammar of motives. The empirical work the authors do to implement this pentad as an empirical investigation is really impressive, and by far the richest in terms of respecting a theory of language. But is it really “the way states talk?” Part of the genius of the article is its treatment of highly similar texts across time (National Security Strategy documents). But “the state” is speaking in specific ways through these, and probably speaks differently in different forums and to different audiences.

My take on this is that topic modeling–and, more generally, computational analysis of text–is a really exciting development in the study of culture and politics, and I expect much more to come of it. But culture is not just language, language is not just text, and text is not just words. Since these methods actually analyze text (not language and not culture) we need to attend to the processes by which culture becomes language and language becomes text. Not all cultural processes produce or record language equally, so we need to consider what the processes are that selected for the availability of some texts and not others. And within texts, systems of discourse, grammar, syntax, and modifiers all create substantive problems with treating text as a bag of words. These principles do not by any means require abandoning these computational approaches, but they highlight theoretical baggage accompanying methodological decisions.

Author: andrewperrin

University of North Carolina, Chapel Hill

5 thoughts on “topic modeling and a theory of language”

  1. Great post, Andrew. Thanks for the heads up! Language isn’t culture itself, sure. Neither are behaviors. Nor are machines actually technology. They are embodied artifacts of culture and ideas. Similarly, nobody has ever seen and touched, nor ever will, a calorie — but they are (like you correctly mention) useful scientific proxies to infer answers to our questions.

    I think you are *absolutely* correct that topic modeling (and other inductive methods like naive Bayes classifiers) are crucial here. We’re not doing any better than old school directed interviews (admittedly useful data in their own right) if we just establish keywords and social categories at the front end then go digging to confirm our priors.

    Will check out the papers. . . .

    Like

  2. Great summary, Andy, thanks. It seems to me that an important overlooked use of topic models is to analyze “small data” – for example, open ended responses to survey prompts (where demographics are known, as well as the context of the prompt). I didn’t see anything in the special issue that looked at this sort of “small” data in particular, but there are dozens upon dozens of datasets with such open ended responses alongside more traditional survey instruments out there, and this could be an excellent tool (especially in cases where two handfuls worth of trained coders are not available).

    Of course, this isn’t nearly as sexy as some of the uses of topic models in sociology more generally, but I just can’t get over how vague our understanding of the speech actors are in much of this work. Our contribution to computational text analysis is context, isn’t it?

    Like

    1. What Chris had in mind, I think, is a paper by Molly Roberts and her colleagues at Harvard. It’s forthcoming at the American Journal of Political Science. The final draft is available on Dustin Tingley’s website: http://scholar.harvard.edu/files/dtingley/files/topicmodelsopenendedexperiments.pdf. What’s nice about this extension of topic modeling is that it takes document attributes into account (or in the case of surveys, the respondent’s attributes) when generating the topic-document cross-classification.

      Like

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s