the hurricane name study tries again

(Note: earlier posts on this study here, here, here, and here)

The hurricane name people have now issued a third statement. Most of it shows that if you run the model several different ways, the key result remains significant. They still do not use logged damage*, even though logging damage is both theoretically better justified and fits the data better (but makes their key result non-significant). So except toward the end, everything they do is simply showing that the results remain significant if you specify damage the wrong way. Do it a better way, all these same analyses are nonsignificant for their key result.

But, toward the end (p. 11), they introduce a new twist:

They take up the point I made about how they report two severity interactions and one of them is actually in the wrong direction for their hypothesis. Because I don’t believe any of these effects are “real” to any extent remotely close to what they are reporting, I have no stake in the opposite effect being significant. Nevertheless it’s telling how they argue the point.

From nowhere, they bust out bootstrapped standard errors. Understand: they published the interaction in their paper using naive standard errors. They have a little table note about how the interactions remain significant even if you use robust standard errors (see last sentence). Nowhere do they say anything about bootstrapped standard errors. But, now that it has been pointed out that an ostensibly key interaction actually has the wrong sign, they are new enthusiasts for bootstrapped SEs.

Why? Because, like statistical chemotherapy, even though it slightly poisons their key result, it still leaves it alive just below the conventional statistical cutoff (p = .035). But the diseased result in the wrong direction is now above the .05 threshold (p = .077) (see page 12). So they argue it isn’t “robust” and shouldn’t be interpreted. Granted, just a few pages later, they are more than happy to talk about a different analysis that leaves their key result “marginally significant” (p = .09), without raising any alarm bells.

(Again though: the key result wouldn’t be close to significant if they were using a logged damage measure.)

What might make all this particularly timely is that, mere days ago, a psychologist raised the concept of p-squashing, in which motivated analytic decisions make a result non-significant. The accusation was leveled at people who try to replicate psychology studies and get null results. The notion is that perhaps they are conducting studies in ways that induce negative findings for hypotheses that are actually true.

The phrase “A Crisis of False Negatives” was even used. That is hilarious on its face since psychology hardly even publishes negative findings to begin with, and psych studies have uncanny track records of getting statistically significant results with demonstrably underpowered research designs.

Anyway, the stuff about p-squashing in the context of replication seemed so like grasping for straws that I thought it should be called “p-sasquatching” until somebody could provide clear evidence of it really occurring.

The example here raises a different scenario for p-squashing: that new ways of doing an analysis might be introduced for the purposes of making an inconveniently significant result go away. So it could be that the most common scenario in which p-squashing does occur is one with a broader context of p-hacking a different result.

* Again, you take the dollar amount, you log it, and then if you want to standardize it, fine. I show the code for this at the bottom of my previous post.

** So, are they claiming that they looked at these bootstrapped standard errors before they published their paper? Not clear. I’ll admit to being flattered at how, while all the other analyses in their statement use old-school syntax for interaction effects (constructed variables with names like mxn), the analysis with bootstrapped SEs suddenly shifts to using the factor-variable syntax I used when showing how you can replicate their key model in half a tweet.

Author: jeremy

I am the Ethel and John Lindgren Professor of Sociology and a Faculty Fellow in the Institute for Policy Research at Northwestern University.

4 thoughts on “the hurricane name study tries again”

    1. Interesting analysis. As for damage v. pressure, I thought the basic intuition of the authors about this in one of their statements seemed reasonable. If you had to choose one measure to predict the number of deaths: a measure of property damage or a meteorological measure of severity, the property damage seems like it should be a better predictor, since property damage is greater when a hurricane strikes where people live.

      BUT, one could step then in and wonder about an endogeneity problem with property damage. After all, there are measures people take when informed about an impending hurricane to try to minimize how much property damage it can do (boarding up windows). So the same logic that leads people to hypothesize that feminine hurricanes kill more people could be used to argue that feminine hurricanes cause more damage. From that standpoint, measures like pressure that are completely exogenous to human action would be strongly preferable to using hurricane damage.


      1. I certainly wouldn’t disagree with that analysis – this isn’t my area (I do ecology and statistics). I just like being able to pick apart the evidence.


Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s