I ended a recent post talking about how **we shouldn’t blame the media for overstating research findings when the overstatements start in the press releases that universities and journals distribute**. This led me to start looking around at more press releases for studies I remembered as getting a lot of attention.

Try it yourself: *you might be surprised by all the surprise*. A common narrative is that something inspires a hypothesis, researchers conduct a study to test that hypothesis, and then, more than merely finding a result that supports their hypothesis, the researchers were *shocked* by how big the effect turned out to be.

How should one think about all this purported surprise?

Let me offer a diagnostic: given the design of the study, **how big of an effect would have been necessary in order for the result to be publishable?** If the only way somebody could have gotten a publishable result is by finding an effect about as big as what they found, doesn’t it seem fishy to claim to be shocked by it? After all, doing a study takes a lot of work. Why would anybody do it if the only reason they ended up with publishable results is that luckily the effect turned out to be much more massive than what they’d anticipated?

**Wouldn’t you hope that if somebody goes to all the trouble of testing a hypothesis, they’d do a study that could be published as a positive finding so long as their results were merely consistent with what they were expecting?**

Let’s consider this for fields, like lots of sociology, in which a quasi-necessary condition of publishing a quantitative finding as positive support for a hypothesis is being able to say that it’s statistically significant–most often at the .05 level. Now say a study is published for which the p-value is only a little less than .05. **Here it is obviously dodgy for researchers to claim surprise.** They went ahead and did their study, but had the estimated effect been much smaller than their “surprise” result, they wouldn’t have been able to publish it.

Now think of the case where observing an effect 50% bigger than expected would be a shock. Different research questions are going to differ as to what counts as surprising, but this feels like a lower-bound case for a lot of research areas I’m familiar with. In that case, we should be suspicious unless they are talking about a result with a p-value less than **.003**. Otherwise, they would not have achieved a significant result if they had merely observed an effect size that was in line with their expectations at the start of the study.

What about being shocked by an effect that was twice as large as anticipated? That would imply the p-value should be less than **.00009**, because again otherwise their expected result would not have been significant. What if shock implies an effect three times bigger than anticipated? The corresponding p-value should be below **.000000004**.

[Technical aside: One could say that I’m understating the matter, because the above results imply a researcher goes ahead and does a study with 50% power. That is, I’m presuming researchers will forge ahead with a study so long as, if there’s an effect size around the size they were thinking, there’s a 50/50 chance it will show up as significant. *For all that time, wouldn’t you think they’d want better odds?* However, given all the not-necessarily-unreasonable ways of presenting p-values above .05 as positive findings — one-tailed tests, “marginally significant”, “approaching significance”, “substantively significant” — having 50% power of observing p < .05 means having more than 50% power of being near .05.]

Of course, a different problem is that **researchers may begin an empirical project without any actual notion of what effect size they are imagining to be implied by their hypothesis**. Or, even if they did have some idea about the size of the effect, they still have no idea what chance they have of actually finding positive evidence for that hypothesis given the study they are embarking on.

**I’ve taken to calling such projects Columbian Inquiry.** Like brave sailors, researchers simply just point their ships at the horizon with a vague hypothesis that there’s eventually land, and perhaps they’ll have the rations and luck to get there, or perhaps not. Of course, after a long time at sea with no land in sight, sailors start to get desperate, but there’s nothing they can do. Researchers, on the other hand, have a lot of more longitude–I mean, *latitude*–to terraform new land–I mean, *publishable results*–out of data that might to a less motivated researcher seem far more ambiguous in terms of how it speaks to the animating hypothesis. But I’ve already said enough for one post.

I’m not sure about Columbian Inquiry term for this, although I otherwise like the metaphor. Suggestions for alternative terms welcome.

LikeLike

I have never asked what size effect I will need to get a p<.05 result. In fact, I don't know how to know that.

However, I do look at the substantive size of the effects I discover and some predicted values. If they're unreasonable I add some more control variables. Just kidding (about the last part).

LikeLike

I don’t know how to know thatWhy, I’ve been kicking around the idea of writing a paper on this very topic! One sticking point was figuring out a good working example to illustrate the problem. Then this hurricane name study happened.

LikeLike

That’s great – I’d love to see it. In my case it’s probably because most of my models are in the well-worn grooves laid by previous work – new variable, updated data, etc – and large datasets. Asking original questions with new (small) data is probably riskier this way.

LikeLike

Good point. Although if you’re in that groove, you probably would have a lot of information about what you’d expect the standard error of a model to be even before you run an analysis? So then if you multiply that by 2 you know how big an effect would need to be in order to be significant.

LikeLike

OK, while we’re embarrassing me for pedagogical purposes: No, I rarely pay attention to standard errors except insofar as they affect p-values (and, in the old days, lots of my papers didn’t show SEs unless a journal asked for it). I think this makes me (people like me) a perfect audience for your future paper, and my future students will have their sh*t together even better than me if you write it.

LikeLike

An important illustration of this was the Oregon Medicaid experiment. The general expectation was that the effects would be in the right direction but nobody seems to have thought about the effects size and the statistical power needed to measure them in advance. It was only

afterthe results came back insignificant (but in the expected direction) that folks like Austin Frakt became obsessed with power calculations and saying it wasn’t insignificant, just underpowered.LikeLike

Not to blame the media again, but there is strong selection of big effects into media stories. Also, for some reason reporters

alwaysask, “What surprised you about your results?”LikeLike

I would have thought there were a strong selection for particular topics and not so much for large effects, although it is plausible that larger effects make it more interesting fodder.

Because of this strong selection for only headline grabbing articles to make the news, David Spiegelhalter basically suggests you shouldn’t trust the results of any newsworthy article. http://understandinguncertainty.org/heuristic-sorting-science-stories-news-0

LikeLike

Isn’t part of the problem the excessive premium sociologists at least have been putting on having something new to say to justify publication? And needing to explain why even theoretically predicted findings are somehow unexpected and therefore interesting?

LikeLike