Laura Nelson has an excellent discussion of topic modeling on badhessian, which in part takes me to task for my comments on the Poetics issue on topic modeling. Unfortunately the diqus system that handles comments there doesn’t like me, and so has eaten my comments twice. So I’m posting them here, and perhaps someone smarter than I am can make them into a bona fide comment on the site.
Thank you for a very useful post. I agree with you, of course, that the increased interest in automated text analysis in sociology is a welcome development. I do think there’s been interest for a good while (when I was a grad student at Berkeley I put together some Perl tools for rudimentary tasks; to my knowledge nobody except me has ever used them!), but it’s happily accelerated recently.
I don’t have a strong disagreement with your point that computational linguists and similar have good solutions for a variety of technical tasks related to text analysis, and that these solutions are well tested and well understood.
However, I do think you bury a major theoretical claim by equating “extract information from text” with “discover what a text is about.” There is a lot more information in (many) texts than just what the text is about: there’s the use (or not) of irony, humor, varying argumentative styles and resources, character development, reference (negatively or positively) to cultural icons outside the text, strategic behavior and strategic failure by speakers, vocabulary used and avoided, and likely much more.
So, if the theoretical task is to evaluate what documents are about, I am happy to accept your reading of the “reams of research to back up their methodological claims.” But the list of tasks you provide (translation, POS tagging, author identification, and topic identification) is far short of the tasks a reasonable cultural sociologist might want to accomplish using text(s) as data.
So, “so what if it is a model of speech acts?” So what indeed! But it is not appropriate to claim that the model is a model of language when in fact it is a model of speech acts. Again, my point is not that these models are illegitimate but that they are partial in theoretically-relevant ways.
That leads to the final caution. Technologies constrain and enable actions by their users. The presence and popularity of a new tool has the capacity to encourage that tool’s use even when it’s not appropriate, and more problematically, to encourage scholars to pursue questions because they are answerable using the tool. (That’s certainly what’s happened with regression over the years.) The defense against this is consistent theoretical interrogation. So I hope we do not “move beyond” the model-of-language critique, but that we move toward continuing using these tools and developing others, with constant theoretical considerations throughout in order to better understand society through the texts that refract it.