replication vs. robustness in social science

Economist Michael Clemens has posted a very useful working paper attempting to bring order to the chaotic discussions of replication tests in the social sciences. Clemens argues for a clean conceptual distinction between replication tests on the one hand, and robustness tests on the other.

A replication test “estimates parameters drawn from the same sampling distribution as those in the original study.” (p. 1) A robustness test, on the other hand, “estimates parameters drawn from a different sampling distribution from those in the original study.” (p. 2) Replication, in other words, means attempting to verify that a model gives the results that an author reports (by literally re-running it to check for coding errors, or by checking for sampling error by resampling the same population and then running the same model), while a robustness test involves tweaking the modeling strategy (“reanalysis”) or trying to use the same model on a new population (“extension”), or both.

Clemens argues that conflating replication and robustness leads to all sorts of confusion, including defensiveness by authors whose work has failed a robustness test but is accused of failing to replicate. Saying that a paper fails to “replicate” should mean that the result is flat out wrong (and perhaps embarrassingly so, as in the case of the famed Reinhart-Rogoff spreadsheet errors); saying that it failed a robustness test simply means that the finding either fails to extend to other contexts (suggesting that circumstances have changed, or some scope condition was not identified) or that it the result is sensitive to particular (legitimate) modeling decisions. Clemens ends the paper by showing that many, but not all, recent debates over failed “replications” are better characterized as debates about robustness.

I like Clemens’ language a lot, and I think it could help to clarify many debates just as he hopes. It would also help systematize the “replication” assignment in third semester PhD stats classes – the requirement could be to start with a simple verification (does the code do what the paper says?) and then move on to a robustness test of one sort or another (extending the same model to data from new populations, or tweaking the model parameters to see how sensitive the results are).

On the other hand, the article loses me when it makes one particularly unfortunate analogy:

The meaning of replication needs to be standardized just as the meaning of “statistically significant” once was. (p. 16)

Because clearly after the definition of “significant” was standardized, there was never any confusion or misunderstanding again. Let’s hope that “replication” and “robustness” have a brighter future than “statistically significant.”

Author: Dan Hirschman

I am a sociologist interested in the use of numbers in organizations, markets, and policy. For more info, see here.

8 thoughts on “replication vs. robustness in social science”

  1. Polysemy is tough. I kept getting tripped up talking to a material scientist and biologist when I kept using “falsification” in the Popperian sense, and they were using it in the “making up your data” sense.

    Like

  2. This is a good distinction, but it would be perhaps more relevant after we take replication and robustness tests more seriously in sociology. Comparing with political science, for example, we still have to catch up in terms of providing additional tests and also making replication data available,

    Like

  3. Honestly, I don’t like this way of drawing the distinction much at all, at least from the quotes provided. As I understand it, the idea looks just like it’s drawing the distinction between internal validity vs. external validity, only with respect to how a subsequent study might be said to be speaking to an original study. All that is well and good, but bringing sampling distributions into just muddies the picture. In any event, though, replication shouldn’t be reduced to verification: as Harry Collins says somewhere, replication is not simply checking.

    Like

    1. The emphasis on distributions is a little confusing. The chart on p. 2 is much clearer.

      And Harry Collins is certainly right that checking shouldn’t be the only thing we do (whatever we call it). But clearly in the modern social sciences, checking (what Clemens calls ‘verification’) is a much needed start.

      Beyond that, Clemens also includes under replication drawing a new sample from the same population and running the same model again to see if the results substantially differ (what he calls “reproduction”, the second type of replication). This is, I think, closer to what Collins has in mind. Can we get the same result in a different lab but one that does not differ in any way that should be relevant to the finding? That’s the question of reproduction.

      What Clemens doesn’t get into, and what would be interesting to discuss, is how the border between ‘reproduction’ and ‘extension’ depends on the theoretical claims made about the population. So, for example, if a psychologist finds a priming effect in a lab sample of undergrads, but claims to be making inferences about perception among adults in general, then it would count as reproduction to draw a completely different sample from the same population (adults in general) and see if the results hold. On the other hand, if the psychologist were only making claims about the perceptions of contemporary American college students, you’d have to re-sample from that group to reproduce the results – and if you sampled senior citizens, you’d be trying to see if the model is robust to a certain extension rather than whether it can be replicated.

      All this to say: I may have undersold the paper, and I think walking through a few examples with the language is useful for clarifying several thorny issues, for me at least.

      Like

  4. I always liked “perfect replication with the same data” which to me meant an expectation of perfect replication of estimates. This should be simple enough to complete with access to the data and a close read of the paper – at least in a perfect world.

    Fitting the same model to different samples should be called a validation check, and to fail one isn’t “merely” failing a robustness check – at least not as “merely” as tweaking the model on the same sample. Both are potentially a huge issue, but the latter relies on (again, to me) the justification for the tweak, while the former is the kiss of death if both samples are deemed representative. I think validation is important, but risky, because failed validation means you have detected a sample effect (at best, results are random at worst). I think that risk is why it isn’t performed as often as it could be.

    Like

  5. I teach my students that “replication” is producing a similar result using new data. In other words: you test the hypotheses in the same, or similar, way using new data and get the same pattern of results, with wiggle permitted in the specific coefficient sizes. On the other hand “duplication” is reproducing the exact same parameter estimates as another paper when using the same data. The key is that “duplication” is sufficient to show that you know how the author(s) of the other study did it, and that another person can produce the same results given the same starting point. But because you ARE using the same data, it provides no independent confirmation that the original result was valid (if the data are weird for some reason, they’re weird for everyone). “Replication” in contrast provides supporting evidence from a different data set and thus serves to reduce the cumulative probability that the result is spurious for some reason. Or, to be really simplistic, “replication” multiplies the alpha error probabilities to get the cumulative likelihood that all results are wrong due to chance, while “duplication” has no impact on cumulative alpha error.

    Like

  6. Oh dear, so we all agree on the distinctions but don’t agree on what to call the different things. Could we get some official naming committee? How about this:
    duplication = re-doing the analysis on the same data, making sure there are no mistakes
    replication = re-doing the same analysis on different data
    robustness check = re-doing the analysis on the same data but with different modeling assumptions
    robust replication = re-doing the analysis on different data with different modeling assumptions

    Liked by 2 people

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s