experimental vs. statistical replication

In the context of all of the debates about replication going on across the blogs, it might be useful to introduce a distinction: experimental vs. statistical replication.* Experimental replication is the more obvious kind: can we run a new experiment using the same methods and produce a substantially similar result? Statistical replication, on the other hand, asks, can we take the exact same data, run the same or similar statistical models, and reproduce the reported results? In other words, experimental replication is about generalizability, while statistical replication is about data manipulation and model specification.

On the one hand, sociology, economics, and political science all have ongoing issues with statistical replication. The big Reinhart and Rogoff controversy was the result of an attempt to replicate a statistical finding that revealed some unreported shenanigans in how cases were weighted, and that some cases were simply dropped through error. Gary King’s work on improving replication in political science aims at making this kind of replication easier, and even turning it into a standard part of the graduate curriculum. Similarly, I believe the UMass paper (failing to) replicate Reinhart and Rogoff emerged out of a econometrics class assignment (e.g.) that required students to statistically replicate a published finding.

On the other hand, psychology seems to have a big problem with experimental replication. Here the concerns are less about model specification (as the models are often simple, bivariate relationships) or data coding, but rather about implausibly large effects and “the file drawer problem” where published results are biased towards significance (which in turn makes replications much more likely to produce null findings).

Both of these kinds of replication are clearly important, but they present somewhat different issues. For example, Mitchell’s concern that replication will be incompetently performed and thus produce null findings when real effects exist makes less sense in the context of statistical replication where the choices made by the replicator can be reported transparently, and the data are shared by all researchers. So, as an attempt at an intervention, I propose we try to make clear when we’re talking about experimental replication vs. statistical replication, or if we really mean both. Perhaps we might even call the second kind of replication something else like “statistical reproduction”** in order to highlight that the attempt to reproduce the findings are not based on new data.

What do you all think?

* H/T Sasha Killewald for a conversation about different kinds of replication that sparked this post.
** Think “artistic reproduction” – can I repaint the same painting? Can I re-run the same models and data and produce the same results?

Author: Dan Hirschman

I am a sociologist interested in the use of numbers in organizations, markets, and policy. For more info, see here.

23 thoughts on “experimental vs. statistical replication”

  1. How about this?

    * Exact reproduction: Do we get the same inferences from the same analyses on the same data?

    * Conceptual reproduction: Do we get the same inferences from different analyses on the same data?

    * Exact replication: Do we get the same inferences from the same analyses on different data?

    * Conceptual replication: Do we get the same inferences from different analyses on different data?

    Liked by 2 people

  2. Yes, you are absolutely right that economists and some political scientists refer to reproducing results using the same data is replication. I’ve used that phrasing myself, although I’m not sure I have in print. Along the lines of what LJ is saying, I think open science folks are moving to referring to that as the “reproducibility” of results, and replication for the idea of trying to see if one observes the same pattern of results from different data.

    At least as I see the term used in psychology, I would describe conceptual replication as trying to test a theoretical generalization using a substantially different operationalization from what was used to produce the original finding.

    Like

  3. As a non-academic this may sound like a stupid question but why should there even be an issue with statistical replication in this day and age? I assume the problem is that researchers aren’t publishing their raw data in a universal format that would facilitate verification. Why aren’t the journals more pro-active with peer review to actually double check the data calculations? Shouldn’t at least one of the referees double check the numbers in stata (or whatever software you use)? It seems to me that this problem should not really exist with our current technology.
    As far as econometrics and advanced statistical techniques, the issues are more complicated because some of the very techniques used are questionable (freakonomics anyone?). From what little I know about econometrics, sometimes the data is tortured to death with questionable statistical techniques (the gun control and death penalty studies from the 70s, 80s and 90s come to mind). This seems like a separate problem from the reproducing the very calculations that the study’s conclusions relies upon.

    Like

  4. “Why aren’t the journals more pro-active with peer review to actually double check the data calculations?” You obviously don’t review many articles.Seriously. I HAVE caught obvious mistakes in tables and have noticed problems that pointed to modeling or data issues, but the idea that a reviewer is going to re-run the analysis is pretty breathtaking in the mental image you must have of what is actually going on in an article review. In most cases, the real issue is what happens to the results if you tweak the model specification this way and that; it is that kind of concern that has led to the R&R crisis.

    Like

      1. oops, sorry. My apologies for the snark. No, given how reviews work, you can’t check people’s data. I once was asked to check a simulation paper by being required to download the software (!) and look at the model specifications and the code. I did it, but I and the other reviewer complained mightily. Because, really, even after spending several hours getting it all set up, how could you really test it except by re-doing all the work yourself?

        Like

      2. No problem vis-a-vis snark – I’m a lawyer and we aren’t known for our social skills. I was sort of hinting at some possible structural changes that could be made considering the state of existing technology. It just seems bizarre that so much rides on stipulations from the researcher. At minimum, one would expect full disclosure of the datasets, as well as cooperation with other scholars who attempt replication. Just my two cents…

        Like

      3. “bizarre that so much rests on the stipulations from the researcher.” Yes and yes and that is why scientific integrity is such a big deal. As I lecture to my intro methods students, science really depends on people telling the truth about their data. This is just as true for claims that somebody really did get the data they said they did in a biology or physics lab as a quantitative social scientist saying their statistical analysis of a data set really said what they think it said (or a qualitative researcher saying they observed what they said they observed). The history of science is full of examples of whole lines of work that are just wrong, sometimes due to outright fraud, other times due to genuine mistakes. Hence the premium in some fields (e.g. biology, physics) on another lab being able to replicate and the importance of the whole replication debate.

        Like

      4. I used to teach high school natural science. One year, I thought that it would be a good idea to see if students could explain the difference between science as a way of knowing things and other ways of knowing things (such as testimony and logic), so that students could get a better idea of what science is — and how natural science and social science are similar to each other. But the more I thought about this as I prepared the lesson, the more I realized that much of science is only a subset of testimony.

        I’m not sure that we can ever remove the need for testimony from science, but the current problem is that testimony is required in situations in which testimony is not required.

        Regarding the initial question, one problem is that researchers do not make available their data and the code necessary to reproduce their results (often, analyses are so complicated and include so many assumptions that it is nearly impossible to reproduce results from raw data; for example, if a respondent does not answer a question about their political ideology, some researchers code that non-response as missing data and exclude the person from the analysis; other researchers might code that person as being a political moderate; other researchers might code the respondent’s ideology based on their political party affiliation, so that a Republican with no response on the ideology item is coded as a moderate conservative; and other researchers might conduct sophisticated missing data analyses to predict the respondent’s ideology based on demographics and other data). This problem of making data and code available can largely be resolved by journals requiring researchers to upload data and code to public websites.

        Another problem is that — even if the data and code were made available — in many cases, no one does the work to check whether the analyses were conducted correctly — or to check whether the analyses that were reported are representative of all the analyses that could have been reported. This problem is more difficult to solve. One idea might be to leave it to market forces: antagonistic replicators should be more likely to target research perceived to be important and research perceived to be poorly done or at least not convincing (say, because the sample was 40 students at the researcher’s university); the problem leaving this to market forces is that — in many fields of social science — researchers agree with each other broadly on policy so that the targeted research might not be representative of the research that should be targeted.

        Another idea might be to require peer reviewers to check the analyses, but that type of peer review would in many cases place a substantial burden on peer reviewers; I imagine that many people can draft a quality peer review in a few hours; for the median study, though, it might take much longer to conduct a quality review of the statistical analyses (even if the researcher’s code is well-documented; imagine the time required for quality reviews when the code is sparse or confusing). Science appears to be moving in this direction (see link below), but it’s not clear to me that journals at the median or lower level of perceived impact will be able to get quality peer reviews that check analyses; it’s difficult enough for many of these journals to get peer reviewers to conduct a quality read-through type review.

        http://www.nature.com/news/science-joins-push-to-screen-statistics-in-papers-1.15509

        Like

      5. Again, all of this is why replication is such a big deal. Much statistical analysis uses publicly-available data which can be checked. There are quite a few statistical patterns that show up over and over in many different data sets with different model specifications. Those patterns are solid. It really isn’t true that nothing is established. It is the accumulation of multiple studies that creates science, not any one testimony. Most of the time, attempted replication reveals errors or fraud. What we have been debating on this blog is practices that tend to discourage replication or the reporting of replication (or its failure). For example, the premium on having something “new” to say to get published pushes people away from adding to the depth of certainty about well-established findings into pushing the statistical limits.

        This is a very different mindset from law. In my experience dealing with lawyers around racial disparity issues, lawyers are trained to think of the specifics of cases one at a time and are generally averse to the idea that statistical averaging across cases can reveal important patterns. They want to know everything that could possibly be a factor in any one case.

        It is true that there is an underlying idea of testimony, that any one report is something you have to interrogate for its truth. And there are well established scientific ways of doing this. Any decent social scientist would of course interrogate the character of the sample, the measurement procedures used, the statistical methods employed. That’s why research articles have methods sections. It is also not uncommon for people to be asked to share their data with another group, and the first thing a new group does with data is attempt to replicate the previously-published results. But this usually happens privately.

        You probably don’t realize if you don’t do this kind of work that in a good statistical shop, the kinds of things that you think are invisible are actually retrievable information, because they can be seen in the computer code.

        Like

      6. Let me push back gently against some of this.

        “It is the accumulation of multiple studies that creates science, not any one testimony.”

        The plural of testimony is not science; the plural of testimony is only testimonies.

        If I test claims with observations, that’s science. If I report the results of my study, that’s testimony. If ten other researchers report the results of ten similar studies that found the same effect and effect size and direction, that’s still only testimony. Sure, it might be more convincing testimony, but it’s still only testimony.

        Now, if I post my data and code so that anyone can check the code and test whether my effect size estimates change with reasonable model specifications that I did not report, then someone can replace my testimony with “eyewitness” testing of claims with observations (i.e., science). Sure, there still might be testimony regarding where the observations came from, so this is why it is a good idea for data to be collected by TESS or the ANES or some third party.

        I agree that certain things have been established by social science, at least in terms of the presence of an effect, the effect’s direction, and ballpark estimates for some effect sizes. I would not be concerned about reproduction and replication if I thought that nothing could be established. But my concern is that it is also possible that certain effects and effect sizes have been established even though a thorough and unbiased assessment of reported and unreported observations might not support their establishment.

        (I’d also want to think more about the establishment of findings. It seems like I often see a lot of “Science: X causes Y” claims based on one study. For many people reading such claims, it might be “established” that X causes Y.)

        “You probably don’t realize if you don’t do this kind of work that in a good statistical shop, the kinds of things that you think are invisible are actually retrievable information, because they can be seen in the computer code.”

        Sure, things are retrievable from computer code. But the problem is that researchers often refuse to provide their data or computer code even upon request. There’s no way that I know of to tell from a methods section what happens to key effect sizes and p-values when a new control variable is added to a model or when a different sample restriction is used or when any other model specification change is made.

        Like

  5. I realize I’m late to the party, but here goes.

    When I was in grad school I was taught that finding the same result using new data is replication while reproducing a result using the same data is duplication. The former adds validity to a finding because it means that a relationship can be reproduced independently by folks other than the original authors using new data, thereby suggesting that the original result wasn’t simply due to alpha error or the like. For obvious reasons, this means that replications by the same author(s) as the original study are less useful than replication by other teams. Duplication, in contrast, is only sufficient to show that the procedures used were sufficient and appropriate. In other words, if you can duplicate a result then you know how it was obtained originally and can derive an assessment of how reasonable those procedures were.

    I spent a summer duplicating a bunch of results on networks using the GSS data and learned a good few things. One of those things is that Peter V. Marsden is a rock star. It took some doing in a few cases, but you can duplicate everything in his papers with utter precision. Moreover, any weird coding details not reported in his papers are pretty much always trivial. But in many other cases, duplications were somewhat incomplete, even in some cases with direct help from the original authors. And in at least one case I can think of, but won’t name, there were clear and serious errors in the reported tables and the author and I collectively couldn’t duplicate a lot of results. In general, I attribute most of these issues to an inability to derive some of the original finicky details of variable coding rather than fraud or major problems, especially since I was almost always able to produce results very similar to the originals without doing anything iffy. I think duplication is a great exercise, especially for students or folks who want to use a new method, but there’s just no reward to it (other than personal enlightenment/education) UNLESS you happen to uncover some sort of smoking methodological gun a la Reinhart and Rogoff in a major paper. Since the likelihood of this is comparatively low (we all hope), successful academics tend to spend their time elsewhere. The same is often true of replication: just replicating a result doesn’t get you much favor from journals, so you either have to do something else and replicate in passing, or replicate purely to disprove. And falsifications tend not to be published by a journal of equal prestige to the original paper. Moreover, there’s a sort of bias in favor of papers doing something really “new,” which makes sense, but can grow so strong as to make accumulation of work very difficult because the rewards for continuing a line of work are very small.

    I think requiring the deposition of data/code is a great idea. This is partly why I’ve been a big part of setting up the Mathematical Sociology section’s data depository (http://www.mathematicalsociology.org/). But there are minimal incentives to release your data. I’ve collected a lot of data since arriving at Cornell and some of it has required the development of new software. I COULD release my data, but doing so would just allow others to produce papers without having to invest the way I did. Thus, I’m only willing to release those data when compelled to by a journal/funder or when the data are old enough to be somewhat irrelevant. That’s not ideal in my view, but my desire for scientific transparency doesn’t motivate me to engage in career suicide. Even when the data are available, duplication is often difficult simply because journals can’t/don’t publish the full details of how every single variable was coded or manipulated on the road to the ultimate result. To some extent code deposits help with this, but after a few years programs can become uninterpretable to current generation software.

    Ultimately, I think the issue is mostly one of incentives. Most academics seem generally in favor, but nobody wants to do it unilaterally because it confers only risks. It’s a structural problem and will require a structural solution.

    Liked by 2 people

      1. I like the term “duplication” for the process of trying to exactly duplicate what went into a paper (same data, same methods, same coding choices, no attempt at robustness). Very useful terminology.

        Also, I’ve never taught quant methods, but I think I would have benefitted tremendously from a duplication-style assignment. Proving the consistency of MLE or whatever was not a particularly great use of my time; actually taking some messy data and getting it to look like a published table would have been very instructive.

        Like

      2. Couldn’t agree more. It’s a great exercise, at least once students have enough skill worked up to have a reasonable chance at it. I also tend to steer students towards papers I’ve duplicated myself, so that I at least know whether it can be done and what varmints are lurking in the details.

        Like

      1. It doesn’t necessarily, but it can cost a lot, both in money and researcher time, to collect data. As a result, when I do it I almost always try to throw enough extra stuff into the collection that I can pull several papers out of a given data set. But, of course, I generally can’t write all of them at once and as a pre-tenure academic, I can’t really sit on results until all the papers are ready to go. So if I release the data with the first paper, there’s nothing to stop some other clever person from grabbing the data and doing a follow-up paper with it, possibly taking the place of one I was working on. This isn’t fair to me, since I accepted all the costs of getting the data to begin with, and it’s especially unfair to any students I have working on the data, which I frequently do. In theory I could release a truncated data set that contains only the variables and observations used in that specific paper, but honestly I think that defeats the purpose of releasing data to enhance transparency.

        The situation would be different if everyone had to release their data, because at least then I’d be getting free data as often as I’d be giving it, but that isn’t the case right now. It’d also be different if using someone else’s data meant they had to be added as an author on the resulting paper, but that’s not the norm. Most often, if there’s any requirement to mention the folks who collected the data, it’s in the acknowledgments section, but a bunch of mentions in the acknowledgments isn’t something you can add to your CV.

        So calling it “career suicide” is certainly a bit hyperbolic, but the situation has many risks and few, if any, obvious rewards.

        Like

      2. My understanding of the purpose of releasing reproduction materials is to permit others to check our work and to determine whether the results that we report are representative of the results that could have been reported from that dataset; I’m not sure how releasing a truncated dataset defeats that purpose; it might not be a perfect reflection of that purpose, but it’s closer than releasing nothing.

        Of course, I can imagine a situation in which there is a fear that someone else can add a variable to a truncated dataset to produce an analysis that the dataset collector has planned but has not yet gotten around to. In that case, I can imagine that a dataset embargo might work, in which the data are kept in a repository until an agreed-upon release date: that way, outsiders can differentiate situations in which scholars want to keep their data private for a limited period from situations in which scholars do not appear to have any intention of releasing their data.

        Of course, I can imagine someone being concerned that, after uploading a dataset with an embargo date, someone at the dataset repository intentionally or unintentionally releases the data prematurely. In that case, it is still possible publicly commit to releasing the data in the future. I can’t see any risk there of rivals anticipating planned research.

        My point is that — even though there is legitimate concern that others might take advantage of a scholar’s work in collecting data or having data collected — that does not preclude all efforts to increase transparency.

        That’s just for the data itself. For Matt or anyone else reading this: is there any reason why, in the scenario that Matt described, the reproduction code for each publication could not be made public, even if the data are kept private?

        For what it’s worth, I do think that the current lack of norms and rules disadvantages persons who try to make their research reproducible. The current incentives are really wrong in this area.

        Like

  6. As a side issue: for a rather interesting, and amusing, case dealing with what it means to release data, what counts as transparency, the costs of collecting data, and the difficulty in many cases of truly separating evidence from testimony, it’s worth reading this article in its entirety: http://rationalwiki.org/wiki/Lenski_affair

    For anyone who is curious, Richard Lenski is also the son of Sociologist Gerhard Lenski, so this example still keeps us somewhat in the family.

    Liked by 1 person

    1. That was worth the read, and though extreme, well within the range of some of the things I have experienced. As regards releasing data to others, the data I have is all public, but (as you say) it took hours and hours to compile it, and in addition to the issues you raise about wanting to get your own publications out (which I’ve been slow to do), it takes a lot of time to document & label a data file so someone else can use it.

      Like

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s