The ongoing scuffles over reproducible (or is it replicable? or robust?) research always seems to miss one point particularly important to my own work: protecting geographic identities of respondents.
I do not wish to argue that we should not replicate or share data. Rather, I wish to suggest that the costs of data sharing are not as low as many make them out to be and that a one-size-fits all policy on reproducible research seems unwise.
For those who don’t know, I study how people’s health differs across neighborhoods and how they decide where to move (among other things). Both cases present problems that would make me very uncomfortable providing a replication package (data with analysis code) with no restrictions to journals upon acceptance of a paper. In fact, I believe that doing so would violate very reasonable personal privacy protections.
My analytical dataset includes detailed information about residents’ communities. If I provided a researcher with that dataset, it would allow them to figure out where those respondents live. Even if I strip the dataset of place identifiers (like Census tract FIPS ID numbers), I am not safe. Identifying which Census tracts I studied would take someone with basic programming experience a matter of minutes to hours to detect by matching data to data downloaded by the Census. With the additional information provided by the survey itself (i.e., more exact income measures, household composition, age, gender, educational degree), the respondent could be identified with some effort. Who will ensure that the person using the data follows proper protocols? How do we inform respondents of this very real potential risk to their privacy? I mean, there is a reason that the Census does not release data for 70 years on individuals, even datasets based on samples.
The second problem is even stickier. Studying geographic phenomena means that my analytical code itself often contains geographic identifiers. For example, in a recent paper, I find that distance to a community influences whether a person would consider moving there. To make that analytical variable, I calculated the distance from the respondent’s block-group to the community itself using the Pythagorean Theorem, which requires entering in the coordinate location of the respondent. And, in fact, it would require only high school algebra and geometry to figure out someone’s block group if I released a dataset with distances from the respondent to three different communities.
The solutions I often see bandied about do not provide much comfort. I could create a fuzzy dataset by masking geographic locations. But in order for my analysis to be truly reproducible, I would need to analyze the data with the same fuzzy dataset. That would mean intentionally entering measurement error into my analysis. Second, I could break data into categories to prevent people from easily matching against Census data (though with a sufficient number of variables, they could probably still sniff it out). Again, binning continuous into categorical variables introduces measurement error and reduces the quality of research.
Finally, I could require that people who request data submit their study to IRB review at their home institution before releasing the data. But this would then turn me into a judge of sufficient protocol design and make me liable for any mistakes or unethical behavior conducted by the requester. Not to mention, it would take a good deal of time that eats into productive, new research (the purported goal of reproducibility not to mention my own tenure). My inclination in such cases would be to say no, like 72% of my colleagues.
It’s possible that a middle ground has been found of which I am unaware; this is not a particular debate I have sought to enter. That said, any solution provides real costs that reproduction absolutists often ignore. I also think it important to highlight that the peculiarities of individual studies, or even whole subfields, do not fit with a general prescription to share data on open platforms like journal websites.