help wanted

I’m interested in identifying different adjectives that appeared in front of a particular not-uncommon noun in sociology. Think of, say, adjectives that precede “theory.” This search by no means needs to be exhaustive of anything, but I want it to be broad. So I’m trying to think if there is an efficient way to do this.

Having to page to each time the word occurs in JSTOR takes too long. Sage’s journals online, in contrast, provide a suitably good way of doing this for ASR, insofar the search for a term provides a snippet of text around the term and so you can see if there is an adjective and what it is, but Sage’s ASR selection only appears to go back to 2004, and I was hoping to go back farther than that. Any ideas?

Author: jeremy

I am the Ethel and John Lindgren Professor of Sociology and a Faculty Fellow in the Institute for Policy Research at Northwestern University.

15 thoughts on “help wanted”

      1. Hmm, my problem is that I really need the adjective. If you just get words that occur near the words, it appears to generate a lot of noise…

        I am clearly out of my league here. I need to figure out a way to upskill or outsource.


  1. Have you looked into GREP expressions? These allow you to identify every instance of a character or word that precedes another word. Atlas.TI uses this for it’s auto-coding function, but it is also available in R (but unfortunately not STATA to the best of my knowledge).

    Although I’m a bit rusty, I believe the GREP expression you might be interested in is “*/theory”


    1. Stata does have regular expression functions — — but you wouldn’t want to use Stata for this because of the 255 character limit. Better to use a text processing language/command like grep (the Unix command), perl, or python to create a file that you could then import into a stats package.

      Of course this assumes that you have the entire database locally, which is not realistic for jstor.


  2. Like Chris says, you could use some kind of regular expression, but I guess the question is whether the database you’re searching allows for queries of that sort.


  3. If you are looking for strictly academic context, JSTOR is probably your best bet. However, if your word is significantly sociology-specific, you could make a more general search much more easily.

    $100 or about will give you word frequencies along with frequencies of collocates from this website:

    To keep things more academic, you can look at the breakdown for academic usage and so avoid spoken word, fiction, etc. It’s a fascinating site if you have some time and a bit of funding.


  4. It turns out it’s possible to make some progress using the “stable” URLs that JSTOR has.

    First, going to the main JSTOR page for ASR, you can get a list of all the issues, with a link to each issue. I had to click the plus sign to expand each decade and it loaded the HTML dynamically. This means you can’t just save the HTML page. It’s a pain, but copy and paste the whole HTML document into Word and then save it as an HTML file. That file will have links to all the tables of contents, just grep it for /\/stable\/[0123456789]+/

    With the resulting list of links to issues, (432 of them), fetch each one using the Unix program wget. Make a list of all the URLs that match the regexp /\/stable\/info\/[0123456789]+/ which is the URL pointing to the “Summary” page.

    For 432 issues, there were URLs to 13758 articles. Many have the abstract in plain text. The abstract isn’t as good as the full text of the article, but even though it’s not much harder to get the PDFs, I’m not sure how to get the full text out of them. Also, batch-downloading PDFs is likely to upset JSTOR as well as ASR in a way that abstracts alone might not. Anyway, only about 2500 articles’ abstracts were available.

    I’ve posted the archive as a single text file (tab-separated volume, issue, abstract) here:


    1. Thanks for this corpus. Would it be that hard to generalize this to the sociology journals available in JSTOR more broadly, to expand the number of abstracts?


      1. Wget has a fairly simple interface, you’ll find the manual here (

        Since all you want is the exact page listed as “” (as compared to snowball sampling) it should be pretty easy. The main thing you want to read about is the “–input-file=” option, which lets wget read addresses from a text file (which you can prepare following Scott’s instructions).

        That’ll be enough to download the info pages to your hard drive. At that point you can process them locally.

        Note that wget is not a standard part of Mac OS X (you can install it with Fink) but is a standard part of most other types of Unix and if it’s not already installed in your server account you can ask the IT people to install it.


  5. Nice to know that STATA supports regular expressions. I have also written a (fairly clunky) screenscraping program for Ruby which could probably be easily adapted to mine all the instances of “identity” on a web page. I did try this on LEXIS-NEXIS a few years but was discouraged by their stubborn API. Also, the program would not handle PDFs as is. At present it simply calls google searches and yields hits from google news. In any event, I’m happy to share it if it sounds promising.


  6. I think many of the suggestions for using regular expressions make sense (my personal favorite is perl, but only because I know it best). The bigger problem is, as @6.scottgolder points out, much of the material isn’t available in a text format.


Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.