Economist Michael Clemens has posted a very useful working paper attempting to bring order to the chaotic discussions of replication tests in the social sciences. Clemens argues for a clean conceptual distinction between replication tests on the one hand, and robustness tests on the other.
A replication test “estimates parameters drawn from the same sampling distribution as those in the original study.” (p. 1) A robustness test, on the other hand, “estimates parameters drawn from a different sampling distribution from those in the original study.” (p. 2) Replication, in other words, means attempting to verify that a model gives the results that an author reports (by literally re-running it to check for coding errors, or by checking for sampling error by resampling the same population and then running the same model), while a robustness test involves tweaking the modeling strategy (“reanalysis”) or trying to use the same model on a new population (“extension”), or both.
Clemens argues that conflating replication and robustness leads to all sorts of confusion, including defensiveness by authors whose work has failed a robustness test but is accused of failing to replicate. Saying that a paper fails to “replicate” should mean that the result is flat out wrong (and perhaps embarrassingly so, as in the case of the famed Reinhart-Rogoff spreadsheet errors); saying that it failed a robustness test simply means that the finding either fails to extend to other contexts (suggesting that circumstances have changed, or some scope condition was not identified) or that it the result is sensitive to particular (legitimate) modeling decisions. Clemens ends the paper by showing that many, but not all, recent debates over failed “replications” are better characterized as debates about robustness.
I like Clemens’ language a lot, and I think it could help to clarify many debates just as he hopes. It would also help systematize the “replication” assignment in third semester PhD stats classes – the requirement could be to start with a simple verification (does the code do what the paper says?) and then move on to a robustness test of one sort or another (extending the same model to data from new populations, or tweaking the model parameters to see how sensitive the results are).
On the other hand, the article loses me when it makes one particularly unfortunate analogy:
The meaning of replication needs to be standardized just as the meaning of “statistically significant” once was. (p. 16)
Because clearly after the definition of “significant” was standardized, there was never any confusion or misunderstanding again. Let’s hope that “replication” and “robustness” have a brighter future than “statistically significant.”