doing things with bags-of-words

The following is a guest post by Juan Pablo Pardo-Guerra.

Topic models are fast emerging as a workhorse of computational social science. Since their introduction in the late 1990s as part of a larger family of classification and indexing algorithms, they have grown into one of the most common and convenient means for automated text analysis. Not too long ago, using topic methods confronted scholars unfamiliar with programming with steep learning curves: even the simplest implementations required some familiarity with coding in addition to a good deal of patience. Today, by contrast, topic modeling is available as part of point-and-click desktop applications (e.g. Context) and can be installed in widely used statistical analysis packages (e.g. Stata). The relative ease, scalability, and intelligibility of topic models explains, perhaps, their quick adoption across sociology, political science, and the digital humanities. Indeed, to say that topic models are the OLS of text analysis wouldn’t be too much of an exaggeration.

Topic models are, however, not nearly as straightforward as OLS. Even the most seemingly standardized models (think, for instance, of the unsupervised topic models based on Blei et al’s Latent Dirichlet Allocation) are rather daunting, complex statistical mechanisms. A detailed understanding of how these work requires much more training in probability theory than what is often offered to social science students (“Dirichwhat? Dirichwho?”). Indeed, as a computer science student recently confided to the class, even the seemingly vanilla LDA topic models are “complicated”. How, then, can we teach these important techniques?

For the last four years, I have been teaching an introductory graduate class on computational social science. My objective is not to make coding wizards out of students (that skill mostly requires screen-time, practice, and looking up errors in StackOverflow) but rather to exercise their imaginations by familiarizing them with the challenges and possibilities of computational analyses. While the course heavily emphasizes discussions about data, it also surveys key techniques and their applications in recent publications in the field. Given their prominence, topic models are one of our techniques of focus.

In teaching topic models, I decided to try an experiment this year inspired by a similar exercise developed by Keith O’Hara: I used half of a class to get students to ‘embody’ the LDA algorithm over a very selective set of texts. The exercise was meant to get students to think about the type of work performed by topic models. All too frequently, I have had to deconstruct pre-existing notions that topic models are useful for identifying units of meaning in large collections of texts. The exercise I designed tackles this very issue and prompts students to think about how topic models can be used not as statistical arbiters of meaning and cultural structures but, rather, as elements of an inductive, iterative research process grounded on the exploration of textual data (here, I am drawing on Laura Nelson’s exceptional research). In other words, I wanted to get students to think about what might be happening within the black boxes of topic models, remaining critical of the outputs but also attuned to their use in sociological research.

The exercise has three specific learning outcomes. The first is to familiarize students with the type of problem solved by LDA topic models. Here, what matters is getting students to understand that the topic models perform two conflicting tasks: given a series of documents and initial parameters (such as the number of topics), the model has to allocate the words in each document to few topics while assigning words to each overall topic with the high probability. This requires treating texts as ‘bags-of-words’ stripped of semantic context and seeing topics as collections of frequently co-occuring terms. The second expected learning outcome is to underscore the importance of expert knowledge in evaluating and interpreting the output of topic models. This involves getting students to think about how parametrizations are selected in relation to the parsimony of outputs, as made sense through previous knowledge of documents and their domain of origin. Finally, the third learning outcome is to get students to reflect about the conceptual integrity of data. Like any other technique, topic models are defined by garbage-in-garbage-out: an improperly selected corpus has important consequences for analysis.

The exercise proceeds as follows. Prior to class, I printed similar-length fragments from four classical books—Lewis Carroll’s Alice in Wonderland, Charles Dickens’ Tale of Two Cities, Charles Darwin’s The Origins of Species, and Karl Marx’s Capital. I then fastidiously cut each of the 1495 words from the printouts and threw them in a bag. I thus arrived at class the following day with a literal bag-of-words.

On the day of our class, I presented the students with the bag of words. Faces of surprise ensued. I then asked them to use these words to identify 5 topics across the four texts. About 2/3rds of the class had a strong computer science and engineering background, so this also provided an opportunity for students to collectively discuss the LDA algorithm. The group soon started to work on the problem and quickly realized that, to produce the topics, they needed information about the texts where the bag-of-words came from as a proxy for the types of tokens found in each document. I provided the information, which allowed them to produce a first set of word bundles.

Throughout the exercise, I pretended to be busy on my laptop but was in reality listening into the students’ collaborative process. For example, as the allocation started, I overheard discussions about eliminating stop-words that, because of their uniform distribution across documents, were more likely noise than information. I then suggested changes to the group’s strategy, moving away from an initial impetus to classify everything into five topics in one go, to starting with five provisional topics and refining the topics by iteration (much the way the LDA algorithm works). More importantly, I overheard students recurring to prior knowledge of the texts to justify their classifications. “This ‘political’ must be Marx. Let’s put it in with ‘accumulation’”, they said. “What does phylum mean?”, the asked each other to then resolve that, because it was biology, it should go in the topic for Darwin. These moments were punctuated by my annoying reminders that I needed five, rather than four topics, so restricting their work to reconstructing books was a faulty strategy.

Half an hour in, I stopped the activity by showing the group the five topics as identified by LDA on using the same texts. As the topic model was displayed on the screen, there was an audible sense of dismay (“WTF?!”). The topics didn’t look anything like what students had expected.

The surprising computational output anchored our subsequent discussions. Overall, the exercise allowed us to have a richer conversation about the use of topic models in sociological research.

Our analysis started with the question of what was so difficult with the process of identifying five topics. The gut reaction was that this was an arbitrary number, but on reflection, they realized that the problem was something else. Students recognized that knowing about the texts made the classification more, rather than less, difficult. Rather than trying to allocate meaningless tokens (the words in the bag-of-words), they were attempting to produce meaningful collections of concepts that alluded to what they already knew. Some expected, for example, a topic where ‘Alice’, ‘Hearts’, and ‘Queen’ would figure prominently in the results, making the output of the LDA even more confusing. The fact that I asked them to identify five, rather than four topics, only increased the difficulty of this exercise because they could not recur to their background knowledge to compose five seemingly coherent themes across four very different documents. Nevertheless, this feature also proved useful for thinking about the power of unsupervised topic models: because they are agnostic to meaning, they can be used as a first step in getting a sense of how tokens (or words) are distributed across a collection of texts without assuming anything about their significance. Under certain conditions, this agnosticism can become a methodological instrument for analyzing large collections of texts.

Meaning matters to social scientists, though, and the second point of our conversation was that the significance of topics comes not from the model itself but from the analyst’s interpretation. For this, I asked students to make sense of the output from the computational model—to which they responded that it was largely meaningless. This provided an opportunity to talk about how meaning emerges in relation to topic models through iteration and prior knowledge. Instead of seeing these techniques as mechanisms for discerning independent structures of meaning, students got to understand them as useful devices that aid the work of an already-expert analyst when exploring large collections of texts. Specifically, they noted that, to understand the outputs of these models, some domain knowledge of the documents was necessary, making topic models closer to an iterative process of analysis in which outputs are evaluated in relation to how well they resound with the expectations of expert observers who understand both the importance of the specific topics and the nature and structure of the documents they seek to describe. This grounded approach allows using topic models as somewhat opaque black boxes that contribute to an inductive process of empirical exploration and that, simultaneously, ask analysts to be distinctly aware of the assumptions of their research designs.

This led to a third and final discussion: as in other areas of computational social science, the integrity and coherence of the data is fundamental. When asked why the computational output seems so meaningless, the group proceeded to identify a possible origin in how the documents were selected. While contemporary, the four texts cover very different genres and contain distinctly different distributions of terms. Garbage in, garbage out. The group recognized that how a corpus is specified matters for the output, much in the same way that a survey is implemented matters for running regressions. Unlike these four hastily selected texts, an ad hoc collection of documents may well serve as a meaningful corpus if it represents a defined, bounded processes (an example being topic modeling Charles Dickens’ oeuvre as a means for exploring thematic changes in his work). Keeping an alignment between data, methods, and claims was another important takeaway.

Although simple, this exercise was extremely formative. On the one hand, it allowed students to develop some awareness about the issues that are central to the use of topic models in social scientific research. Simultaneously, it also got them a space to think about the conditions under which topic models can aid the work of the social scientist. Importantly, the exercise stressed the importance of methodological reflexivity, care of research design, and attentiveness to data. Hopefully, it also gave them some confidence in using topic models in their research as a useful, convenient, and powerful technique for doing other things with words.

Juan Pablo Pardo-Guerra is Assistant Professor of Sociology at UCSD.

Author: Dan Hirschman

I am a sociologist interested in the use of numbers in organizations, markets, and policy. For more info, see here.

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: