Words to numbers bleg

Dear Scatterplotters,

I’m cross-posting an inquiry from my advisee and collaborator Alex Hanna regarding text parsing to convert qualitative descriptions of events into numerical estimates.

http://badhessian.org/2013/06/numerical-approximation-words-to-numbers/

I’ve done this myself in the past, but as a human coder using the text descriptions to do qualitative categorization of group size based on my best judgment reading the whole story. FYI the codes I used for Madison protests in the 1990s were: Tiny (1-5), Very Small (6-15), Small (16-30), Modest (31-99), Medium (100-499), Larger (500-1500), Large (2000-10,000), Very Large (10,000  +), and Huge (100,000+) which we then collapsed into Small (1-15), Medium (16-499), and Large (500+).

The problem here is to use automated text parsing of words like “several”, “scores,” “small,” “large,” etc. to categorize protests. I can find substantial literature on the problem of estimating crowd sizes while looking at a crowd and about the diversity of crowd size estimates from different sources (e.g. police and organizers) and about how news reporters decide which sources to use. But I can’t find anything about this problem of trying to get some rough event size estimate from text parsing.  Can anyone point us to a source?

 

10 Comments

  1. Posted June 22, 2013 at 1:51 pm | Permalink

    I’m not ignoring you, I just don’t have a good answer! It’s a neat question, but it seems like something your advisee should play around with to see how it works. One confounder will be that the words you’re talking about don’t just signify number, they also to some extent signify importance (“a small band of ex-hippies….”, e.g.), so the same word probably means different numbers in different contexts.

    Good luck!

    Like

    • Posted June 22, 2013 at 2:31 pm | Permalink

      Thanks, Andrew. As I said, I’ve done this myself as a human coder. We have some idea about how to do it if we are going to have to start from scratch. I just want to make sure I’m not missing a literature we could build on regarding automating this. My quick lit review did not turn anything up.

      Like

  2. Posted June 22, 2013 at 7:28 pm | Permalink

    Paging Neal Caren! Neal has some neat work using the NLTK package within Python to identify named entities within large chunks of texts. He may therefore have some good ideas about how to approach the question.

    Like Andy, I don’t know of a “best practice” for this type of coding job, but I suspect you might find some GREP programming useful. As you may already know, this language lets you identify sentences which contain certain groups of words such as “protest” and “number” and “people.” GREP is actually rather old, and it is even available in Atlas.TI. If you are working with very large samples of text, however, you may find that data entry and analysis is exceedingly slow in Atlas.

    If your graduate student has significant programming skills, it is far easier to use GREP in Python or R, however, the latter has some strange rules for working with text that can take some getting used to (though there are also some ways of making R simulate Perl or other languages which make it more straightforward.

    An alternative might be to train a “topic model” to recognize language that describes protest events. Depending upon what types of texts you are working with, this might be a good route since it would let you narrow down texts into “protest” and “non-protest” categories- this would enable you to subsequently code the former texts qualitatively (depending upon the size of the sample). If this strategy is not viable, you might try topic modeling on paragraphs instead of documents as a means of further focusing a qualitative microscope.

    Finally, there is Roberto Franzosi’s interesting work on quantitative narrative analysis. One of the first applications of this technique- as I recall- was to identify cases of strikes within a large corpus of Italian media documents. Yet this technique may not be very useful for identifying the number of people involved in protests (as opposed to broader cases of collective protest). Once again, however, Franzosi’s method might be fruitfully combined with a qualitative coding strategy.

    Like

  3. Posted June 24, 2013 at 6:43 am | Permalink

    Unless the issue is in the AP stylebook, I would assume that there are no special journalistic meanings to the words. An informed guess is probably all there is.

    Ideally you would have multiple accounts of each event so you could impute a number for “few” or whatever. You could estimate that of the 50 events that were described as having “few” people arrested, the median number of people arrested according to other accounts was 3.

    Complicated the matter is the issue that Andy raised. “Many” people getting arrested is likely to be a larger number than “many” cities experiencing rioting.

    Like

  4. Posted June 24, 2013 at 10:50 am | Permalink

    Are the texts news articles? Have protest/non-protest texts already been identified?

    To the point about “Many people” and “many cities”, I think one could GREP around this issue (by searching for adjacent words) within three or four word strings. The first step, I would think, would be to parse the text into three or four word strings and then use GREP to identify cases where “many” and “people” co-occur (or perhaps “people” and “protest”), and then extract all numbers from such strings. You could even account for the possibility that certain phrases (or sentences) contain both “many people” and “many cities” with GREP, and code each of these cases by hand.

    There is a function within R’s tm package to identify numbers within text that might be useful assuming you can identify the relevant phrases.

    Like

    • Posted June 24, 2013 at 2:01 pm | Permalink

      Hi Christopher,

      The larger project is a rewrite of TABARI (http://eventdata.psu.edu/software.dir/tabari.html). The project effectively involves doing grammatical parsing of text, matching proper parts against actor and verb dictionaries, and outputting some kind of event code. It’ll be used to generate GDELT-type (http://gdelt.utdallas.edu/) datasets.

      Right now the sentences are going through Python’s NLTK tagger. There’s some chunking and identification of noun and verb phrases. Given the way that it handles this, there’s not a lot of room to do a more machine learning-type of solutions (e.g. identifying cooccurrence or doing topic modeling). I’m partial to the “binning” solution that Jay Ulfelder has suggested: http://badhessian.org/2013/06/numerical-approximation-words-to-numbers/#comment-937832928

      Like

      • Posted June 24, 2013 at 5:26 pm | Permalink

        Hi Alex:

        First, cool project. I’ve been hoping sociologists would tackle GDELT- it sounds like very promising data.

        If you are using NLTK to identify noun and verb phrases, you should really check out Roberto Franzosi’s new work. As you may know, his book “From Words to Numbers” is based upon linguistic theory of actor/noun/verb triads. While the old QNA stuff is probably not going to be very helpful for measuring event size, he is currently collaborating with computer scientists at Georgia Tech to bring it up to speed with recent advances in machine learning. Roberto’s been at this for at least five years and my guess is that he may be searching for beta-testers for his new tool (though I don’t know whether he is that far along).

        To my mind, the binning solution sounds plausible provided that you are able to identify keywords that are associated with protest events consistently across texts. I would still think you will need to do some GREPing to ensure that you can get around the “many” issue Andy raises above.

        I also seem to recall some Python scripts that were circulating around in conjunction with GDELT and perhaps the old REUTERS machine learning corpus? Perhaps that might be worth a look if you haven’t already. It sounds like you probably already know about it, but there has been quite a lot of other work done with the REUTERS dataset in the NLP community- they may be a better resource than Scatterplotters on the issue of measuring numbers (I believe its a big question in the literature on “boundary detection”).

        How many texts are you dealing with? Are you working with the entire GDELT dataset or just a subset? If the latter, I think your best bet would be a hybrid quant/qual strategy. Even if you are looking at the entire dataset, I suspect you’ll have to do a lot of supervized training. The hybrid strategy may be more convincing to sociologists, and it may help advance machine learning to boot.

        Best,
        Chris

        Like

      • Posted June 25, 2013 at 7:33 am | Permalink

        Christopher,

        Thanks for letting me know about Franzosi’s work. I’ll have to check it out and see if we can use a different parsing approach.

        The issue about the different words to be used is a good point. With the larger project, we hope to capture more than just protest events, though. So the net is going to have to be a bit larger than protest events. I can imagine variability of how “many” is used in each instance.

        The hope is that PETRARCH will be used to regenerate the GDELT dataset, so 200 million+ events probably doesn’t lend itself to a qual/quant strategy. But that’s with the caveat that the CAMEO coding scheme currently used with GDELT was the effort of a lot of manual coding rules developed by a lot of Schrodt’s students. So in a sense there’s that qualitative sanity check.

        Like

    • Posted June 25, 2013 at 8:10 am | Permalink

      As much as I like grep (and as a former unix system administrator, that’s a lot!), I don’t think there’s a technical answer to this theoretical question. many people, many cities, many protests, many cats, many days…. “many” can mean lots of different numbers. I think that demands theoretical, not technical work. You can just decide ahead of time, “many people” means x, “many cities” means y, etc., or you can be more inductive. And grep can help you once you’ve made the theoretical decisions.

      Like

      • Posted June 25, 2013 at 1:04 pm | Permalink

        Fair point. The feature extraction procedure may not be separable from the event coding. Wondering if there’s other ways to make this determination, though — the verbs or nouns that describe it, perhaps?

        I, too, love me some grep. grep + awk/sed + UNIX pipes are severely underrated and underutilized.

        Like

Follow

Get every new post delivered to your Inbox.

Join 736 other followers

%d bloggers like this: