I’ve been swallowed up lately in trying to finish a long-overdue project. Not being particularly sane in some ways, I keep bringing new data into the project. This has led to hours of time spent in cleaning up messes in data sets that have been the basis for other people’s publications, as well as messes in public data sets like the Statistical Abstract. Today’s fun was realizing that published Statistical Abstract tables for counts of Republicans and Democrats in state legislatures vary over the years in terms of when the count is made, so that sometimes the count is in March or mid-year (i.e. before the general elections) and other years the count is made in December of the year (i.e. just after the general elections). This matters a lot if you are trying to match up political control of a state with other variables. The data in exactly the same on-line table vary in this from year to year, and the notes in the downloadable spreadsheet are wrong. You can correct the mistake only by checking the older PDF versions of the tables and reading their footnotes. Another set of errors was in a data file posted on line to support a publication: some entries for the governor’s party were just flat wrong, they must have been entered by an under-motivated student employee from paper sources. These I could correct by checking alternate on-line sources state by state. And data files of prison numbers have what are clear errors if you do enough checking. Some are flagrant, including population numbers with additional or missing digits that make the population wrong by a factor of 10 and counts for “other” race that just happen to equal what the total for all known races is. Others are subtle and can be found only by merging data across years and looking at the time plots, such as the clear case in which one state one year reversed its numbers for Black and Native American prison inmates. I know I am unusually obsessive about data — and introduce plenty of my own mistakes that I later have to find and correct. But this whole process of two steps forward and one step back is driving me crazy. It also makes me feel like reminding folks how important it is to check and clean data. There are quite a few cases in which people have published results that turned out to be driven by the fact that “no response” was coded 999 and the analyst just threw the whole thing into a regression equation without ever looking at the frequencies.