data cleaning

I’ve been swallowed up lately in trying to finish a long-overdue project.  Not being particularly sane in some ways, I keep bringing new data into the project. This has led to hours of time spent in cleaning up messes in data sets that have been the basis for other people’s publications, as well as messes in public data sets like the Statistical Abstract. Today’s fun was realizing that published Statistical Abstract tables for counts of Republicans and Democrats in state legislatures vary over the years in terms of when the count is made, so that sometimes the count is in March or mid-year (i.e. before the general elections) and other years the count is made in December of the year (i.e. just after the general elections).  This matters a lot if you are trying to match up political control of a state with other variables. The data in exactly the same on-line table vary in this from year to year, and the notes in the downloadable spreadsheet are wrong. You can correct the mistake only by checking the older PDF versions of the tables and reading their footnotes. Another set of errors was in a data file posted on line to support a publication: some entries for the governor’s party were just flat wrong, they must have been entered by an under-motivated student employee from paper sources. These I could correct by checking alternate on-line sources state by state. And data files of prison numbers have what are clear errors if you do enough checking. Some are flagrant, including population numbers with additional or missing digits that make the population wrong by a factor of 10 and counts for “other” race that just happen to equal what the total for all known races is. Others are subtle and can be found only by merging data across years and looking at the time plots, such as the clear case in which one state one year reversed its numbers for Black and Native American prison inmates.  I know I am unusually obsessive about data — and introduce plenty of my own mistakes that I later have to find and correct. But this whole process of two steps forward and one step back is driving me crazy.  It also makes me feel like reminding folks how important it is to check and clean data. There are quite a few cases in which people have published results that turned out to be driven by the fact that “no response” was coded 999 and the analyst just threw the whole thing into a regression equation without ever looking at the frequencies.

Author: olderwoman

I'm a sociology professor but not only a sociology professor. I keep my name out of this blog because I don't want my name associated with it in a Google search. Although I never write anything in a public forum like a blog that I'd be ashamed to have associated with my name (and you shouldn't either), it is illegal for me to use my position as a public employee to advance my religious or political views, and the pseudonym helps to preserve the distinction between my public and private identities. The pseudonym also helps to protect the people I may write about in describing public or semi-public events I've been involved with. You can read about my academic work on my academic blog --Pam Oliver

3 thoughts on “data cleaning”

  1. As you know, I don’t deal with data files like this that often. But I do advise students. And one of the interesting things I’ve found is that they often say to me, “So, I’m going to put this into a regression.” My response is always, “Well, have you looked at descriptive statistics first?” The answer is usually “no.” At which point I say, “Let’s do that first. Send me the info, and we’ll talk about those before we charge forward.” I’ve noticed two things about this: (1) descriptive statistics are often very very interesting in their own right and (2) Oh my God can you save headaches later by just looking through stuff like this. Some things, which are just “too interesting” are often wrong, and require fixing data files. So, yeah, I hear you on that last sentence.


  2. OW – I am totally with you right now. I’m digging through data right now using a well-known and well-documented dataset. Unfortunately, all of the variables are in their most basic format (i.e. counts instead of percentages) which requires making a lot of variables and, accordingly, a lot of mistakes. Not to mention that I can’t figure out how authors who have published using this dataset created one of their variables from the variables available in the dataset. It’s driving me nuts!


  3. I hear you. I am similarly careful, and I don’t know that it served me particularly well in grad school. As you mention, people publish crap like that because too often nobody really cares if it’s right if it sounds like a good story. So, people who can make up a good story and seem to get things done quickly are seen as more productive and rewarded accordingly, while careful people are seen as unproductive. I wish more faculty were like you and shakha…


Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.