The ongoing scuffles over reproducible (or is it replicable? or robust?) research always seems to miss one point particularly important to my own work: protecting geographic identities of respondents.

I do not wish to argue that we should not replicate or share data. Rather, I wish to suggest that the costs of data sharing are not as low as many make them out to be and that a one-size-fits all policy on reproducible research seems unwise.

For those who don’t know, I study how people’s health differs across neighborhoods and how they decide where to move (among other things). Both cases present problems that would make me very uncomfortable providing a replication package (data with analysis code) with no restrictions to journals upon acceptance of a paper. In fact, I believe that doing so would violate very reasonable personal privacy protections.

My analytical dataset includes detailed information about residents’ communities. If I provided a researcher with that dataset, it would allow them to figure out where those respondents live. Even if I strip the dataset of place identifiers (like Census tract FIPS ID numbers), I am not safe. Identifying which Census tracts I studied would take someone with basic programming experience a matter of minutes to hours to detect by matching data to data downloaded by the Census. With the additional information provided by the survey itself (i.e., more exact income measures, household composition, age, gender, educational degree), the respondent could be identified with some effort. Who will ensure that the person using the data follows proper protocols? How do we inform respondents of this very real potential risk to their privacy? I mean, there is a reason that the Census does not release data for 70 years on individuals, even datasets based on samples.

The second problem is even stickier. Studying geographic phenomena means that my analytical code itself often contains geographic identifiers. For example, in a recent paper, I find that distance to a community influences whether a person would consider moving there. To make that analytical variable, I calculated the distance from the respondent’s block-group to the community itself using the Pythagorean Theorem, which requires entering in the coordinate location of the respondent. And, in fact, it would require only high school algebra and geometry to figure out someone’s block group if I released a dataset with distances from the respondent to three different communities.

The solutions I often see bandied about do not provide much comfort. I could create a fuzzy dataset by masking geographic locations. But in order for my analysis to be truly reproducible, I would need to analyze the data with the same fuzzy dataset. That would mean intentionally entering measurement error into my analysis. Second, I could break data into categories to prevent people from easily matching against Census data (though with a sufficient number of variables, they could probably still sniff it out). Again, binning continuous into categorical variables introduces measurement error and reduces the quality of research.

Finally, I could require that people who request data submit their study to IRB review at their home institution before releasing the data. But this would then turn me into a judge of sufficient protocol design and make me liable for any mistakes or unethical behavior conducted by the requester. Not to mention, it would take a good deal of time that eats into productive, new research (the purported goal of reproducibility not to mention my own tenure). My inclination in such cases would be to say no, like 72% of my colleagues.

It’s possible that a middle ground has been found of which I am unaware; this is not a particular debate I have sought to enter. That said, any solution provides real costs that reproduction absolutists often ignore. I also think it important to highlight that the peculiarities of individual studies, or even whole subfields, do not fit with a general prescription to share data on open platforms like journal websites.

13 thoughts on “the place of reproducible research”

  1. These are good questions, to which there are good answers — even now before we have technological solutions like the ones you refer to (but which work). One is to make the data available under a specific licensing agreement. This can be done through dataverse for example, but there are other options. You certainly don’t want to have to be in the position of making evaluations of scholars who want your data one by one, so you need a fixed agreement with fixed rules. You can put whatever rules you want, including financial payments for violating them, but having the fixed rules be public is crucial (for you). That then prevents others from criticizing you for deciding that the only people who will get your data are those who implicitly promise not to criticize you or some such (which was often the case with the text of the written work a few centuries ago!)

    On technological solutions: lots of research is underway. e.g., see

    1. Gary, I appreciate your response!

      But I have some follow up questions. I am not sure how dataverse helps much in this case *unless* they also provide institutional strength to back up the agreements. For example, would it be possible for me to require users to have an approved IRB before obtaining the data, and would dataverse check to make sure that is the case? I know for larger studies, ICPSR provides this kind of service, but I’m not sure how willing they would be do that with analysis code or smaller datasets. And, as far as I know, ICPSR only acts as a resource and actually passes final approval on to the data depositors (I’m thinking specifically of the LA FANS and PHDCN data).

      And, if someone violates the terms, will Harvard, through dataverse, use it’s resources to penalize and collect financial penalties if terms of service are violated or does that fall to individual investigators? If it falls on me, the investigator, then I would mean that a) I know about all uses of the data, b) keep track of what others do with the data, and c) take the time and energy to file for damages and recoup financial damages. This seems to put me in the same predicament of policing the ethical use of data that I deposit. To some degree we have to assume mostly honest actors, but there is still room for plenty of careless mistakes that could happen. I hope I’m not being obtuse, but I still get very nervous releasing point-level geocoded data that people could use to identify study subjects if I use that data in my own analysis.

      I’m not trying to be obtuse and want to learn about how I can safely share data without it becoming a huge burden.


  2. Mike, There’s a general plan and a specific one. The general plan would be for you to make the data available via licensing agreement that completely protects you. Prosecuting violations you probably don’t care about unless there’s an issue, or someone has an issue with you; it is there to protect you, not so you can make money or go after people. Same story when you get data from most places and sign an agreement. They write the agreement to protect themselves, not to watch everything you do. They are passing the responsibility off from them to you; same story here. So you don’t need to follow anyone around, which would be infeasible, unpleasant, and intrusive, to say the least.

    Then there’s the more specific story. You’re actually not allowed to make data available for research purposes without permission, even if you copied the data from wikipedia or the NY Times. Technically, that would be a violation of federal regulations, punishable by your university not being allowed to receive any other federal funds. The formal rule is that you must go to your IRB to get approval to make data available. It is their call, not your’s. You can give them the general plan above, which would help them let you do it, but it is their call and responsibility. You also want it to be their call since then your university backs you up and takes the responsibility rather than you putting yourself on the line (and your house, savings, career, etc…).

    The issues you raise in your post are issues your IRB has to deal with. And it is good that you wrote it since it will help them. They of course might well be in completely different fields.


    1. This all seems to simply address concerns about regulations and legal liability, rather than a researchers ethical concerns. I personally would want to avoid a situation where someone is substantially harmed by my work, even if I was not liable.


  3. That’s right it probably wouldn’t be declared human subjects research if you went to the IRB, but the feds recommend (and almost all universities have adopted the formal rule) that the investigator may not be the one to make this determination. So you’re stuck; if you want approval, you must go to the IRB and get it formally. And you might as well, so you don’t have to take the responsibility.

    Melissa Sands and I just wrote a little paper on this subject, which may be helpful. See “How Human Subjects Research Rules Mislead You and Your University, and What to Do About It”. Copy at


  4. Responding to the exchange about whether texts are human subjects: I had these discussions when I was on the IRB (I was kicked off). We have worked hard to say that publicly-available texts are NOT human subjects, any more than chemicals in a lab are. Seriously, the minute anyone tries to say that texts are human subjects, I will demand that EVERYONE (including the engineers, business profs, physicists, etc as well as the literature profs and history profs) file IRB protocols before they go to the library because when we read texts, they have all been written by human beings, and you cannot do a literature search without talking about what individual people have written. Seriously, this is way out of control and I honestly hope you Gary King, just don’t know what you are talking about. Saying that only the IRB, not the individual researcher, can decide whether research on human subjects is exempt is bad enough, but if the IRB is claiming jurisdiction over deciding whether research on publicly-available texts is human subjects research, then we are done.

    As regards the more important (I HOPE) exchange about sharing potentially-identifiable data, I believe the best solution is deposit with ICPSR or similar. I know due to current personal experience that the IRB is quite capable of regulating access to potentially-identifiable data. INCLUDING, I may add (linking to other debates about replication) that the recipient of potentially-identifiable data DESTROY the file copies after X years!!


    1. OW, I asked Gary for his thoughts because he has thought more about replication in social sciences than anyone I know, and has done a great deal of service to social sciences by building infrastructures to make it possible. Reading the working paper he links to above, I think that he might agree with you about engineers submitting to institutional review. He makes the point that a careful reading of relevant statutes actually requires all research to get approval from IRBs. His point (as I read it) is not to say that the system makes sense — indeed he has a radical but simple solution to make it better and more coherent — but to protect oneself as a political actor among other political actors with different motivations.


      1. Note however my claim that under this interpretation, a literature review is human subjects research, so there is no such thing as a professor who does NOT do human subjects research.


      2. I did catch that. Not that I said “institutional review” not “human subjects review.” Gary points out that human subjects is only one of many potential legal issues that should be considered. Those others could be just as problematic (things such as copyright, environmental protection, election laws, etc.).

        The problem, it seems, is in the law and policies surrounding the relevant laws. I simultaneously agree that there are substantial problems in IRBs that need revision and the fact that the current state of affairs would lead investigators to submit to IRB out of due diligence to protect their own interests. I believe both can be simultaneously true.


