experimental vs. statistical replication

In the context of all of the debates about replication going on across the blogs, it might be useful to introduce a distinction: experimental vs. statistical replication.* Experimental replication is the more obvious kind: can we run a new experiment using the same methods and produce a substantially similar result? Statistical replication, on the other hand, asks, can we take the exact same data, run the same or similar statistical models, and reproduce the reported results? In other words, experimental replication is about generalizability, while statistical replication is about data manipulation and model specification.

On the one hand, sociology, economics, and political science all have ongoing issues with statistical replication. The big Reinhart and Rogoff controversy was the result of an attempt to replicate a statistical finding that revealed some unreported shenanigans in how cases were weighted, and that some cases were simply dropped through error. Gary King’s work on improving replication in political science aims at making this kind of replication easier, and even turning it into a standard part of the graduate curriculum. Similarly, I believe the UMass paper (failing to) replicate Reinhart and Rogoff emerged out of a econometrics class assignment (e.g.) that required students to statistically replicate a published finding.

On the other hand, psychology seems to have a big problem with experimental replication. Here the concerns are less about model specification (as the models are often simple, bivariate relationships) or data coding, but rather about implausibly large effects and “the file drawer problem” where published results are biased towards significance (which in turn makes replications much more likely to produce null findings).

Both of these kinds of replication are clearly important, but they present somewhat different issues. For example, Mitchell’s concern that replication will be incompetently performed and thus produce null findings when real effects exist makes less sense in the context of statistical replication where the choices made by the replicator can be reported transparently, and the data are shared by all researchers. So, as an attempt at an intervention, I propose we try to make clear when we’re talking about experimental replication vs. statistical replication, or if we really mean both. Perhaps we might even call the second kind of replication something else like “statistical reproduction”** in order to highlight that the attempt to reproduce the findings are not based on new data.

What do you all think?

* H/T Sasha Killewald for a conversation about different kinds of replication that sparked this post.
** Think “artistic reproduction” – can I repaint the same painting? Can I re-run the same models and data and produce the same results?

a study in scarlet

RedShirtOvulationGraph
(N=100 on the left, N=24 on the right, one data point per person, observational study)

Andrew Gelman and I exchanged e-mails awhile back after I made his lexicon for a second time. That prompted me to check out his article in Slate about a study published in Psychological Science finding women were more likely to wear red/pink when at “high conception risk,” and then I read the original article.

I don’t want to get into Gelman’s critique, although notably it included whether the authors were correct to measure “high conception risk” as 6-14 days after a woman starts menstruating (see Gelman’s response to authors response about this). And I’m not here to offer an additional critique of my own.

I’m just looking at the graph and marveling at the reported effect size, and inviting you to do the same. Of the women wearing red in this study, 3 out 4 were at high conception risk. Of the women not wearing red, only 2 out of 5 were.*

UPDATE: Self-indulgent even by blog standards, but since I could see using this example again somewhere and it took some effort to reconstruct, I’m going to paste in the cross-tab here: Continue reading “a study in scarlet”

the bigfoot-black swan continuum of behavioral science

Jason Mitchell uses the example of “black swans” to argue that there is a fundamental asymmetry between positive and negative findings in psychology experiments, such that positive findings are the only meaningful findings and negative findings should not be published. The idea is that no matter how many white swans you observe, you don’t know if black swans exist; whereas if you observe one black swan, you know they do (full quote at bottom).

The problem: findings from behavioral science experiments aren’t like being able to hold a black swan by the neck and shout to everyone, “See! I told you they existed!”

Instead, you are presented with papers in which you have to trust researcher reports of what they did to produce a finding that an observed swan was darker than would be expected under the white-swan null (p < .05).

In this respect, positive experimental findings are somewhere on a continuum between Bigfoot and black swans. Continue reading “the bigfoot-black swan continuum of behavioral science”

why so much psychology?

(Zeroth in a series) I’ve been interested in the sociology of psychology ever since my dissertation, but the recent dramas in social psychology have made this interest, like Tinder at the Olympic Village, “next level.” (Also, I’ve a genuinely remarkable advisee, David Peterson*, whose dissertation involves a multisite lab ethnography of psychology, and even though we’ve got nine thousand miles between us we’ve been corresponding on this issues quite a bit.)

I’m just explaining here what’s going on if you ever wonder, “Why does Jeremy talk so much about psychology?” Also, I worry that a lot of my concern about psychology appears like it’s strictly methodological, but a lot of the methodological critique adds up to a dire substantive point that I think sociologists should be extreme concerned about-but that’s a teaser for another post.

For now, let me link to one of the latest turns in the drama: a post by a Harvard psychologist arguing strongly against the value of replication at all, by as far as I can tell unwittingly following Harry Collins’s experimenter’s regress all the way to a sort of anti-replicationist fundamentalism. Continue reading “why so much psychology?”