Words to numbers bleg

Dear Scatterplotters,

I’m cross-posting an inquiry from my advisee and collaborator Alex Hanna regarding text parsing to convert qualitative descriptions of events into numerical estimates.


I’ve done this myself in the past, but as a human coder using the text descriptions to do qualitative categorization of group size based on my best judgment reading the whole story. FYI the codes I used for Madison protests in the 1990s were: Tiny (1-5), Very Small (6-15), Small (16-30), Modest (31-99), Medium (100-499), Larger (500-1500), Large (2000-10,000), Very Large (10,000  +), and Huge (100,000+) which we then collapsed into Small (1-15), Medium (16-499), and Large (500+).

The problem here is to use automated text parsing of words like “several”, “scores,” “small,” “large,” etc. to categorize protests. I can find substantial literature on the problem of estimating crowd sizes while looking at a crowd and about the diversity of crowd size estimates from different sources (e.g. police and organizers) and about how news reporters decide which sources to use. But I can’t find anything about this problem of trying to get some rough event size estimate from text parsing.  Can anyone point us to a source?