The hurricane name people have now issued a third statement. Most of it shows that if you run the model several different ways, the key result remains significant. They still do not use logged damage*, even though logging damage is both theoretically better justified and fits the data better (but makes their key result non-significant). So except toward the end, everything they do is simply showing that the results remain significant if you specify damage the wrong way. Do it a better way, all these same analyses are nonsignificant for their key result.
But, toward the end (p. 11), they introduce a new twist:
They take up the point I made about how they report two severity interactions and one of them is actually in the wrong direction for their hypothesis. Because I don’t believe any of these effects are “real” to any extent remotely close to what they are reporting, I have no stake in the opposite effect being significant. Nevertheless it’s telling how they argue the point.
From nowhere, they bust out bootstrapped standard errors. Understand: they published the interaction in their paper using naive standard errors. They have a little table note about how the interactions remain significant even if you use robust standard errors (see last sentence). Nowhere do they say anything about bootstrapped standard errors. But, now that it has been pointed out that an ostensibly key interaction actually has the wrong sign, they are new enthusiasts for bootstrapped SEs.
Why? Because, like statistical chemotherapy, even though it slightly poisons their key result, it still leaves it alive just below the conventional statistical cutoff (p = .035). But the diseased result in the wrong direction is now above the .05 threshold (p = .077) (see page 12). So they argue it isn’t “robust” and shouldn’t be interpreted. Granted, just a few pages later, they are more than happy to talk about a different analysis that leaves their key result “marginally significant” (p = .09), without raising any alarm bells.
(Again though: the key result wouldn’t be close to significant if they were using a logged damage measure.)
What might make all this particularly timely is that, mere days ago, a psychologist raised the concept of p-squashing, in which motivated analytic decisions make a result non-significant. The accusation was leveled at people who try to replicate psychology studies and get null results. The notion is that perhaps they are conducting studies in ways that induce negative findings for hypotheses that are actually true.
The phrase “A Crisis of False Negatives” was even used. That is hilarious on its face since psychology hardly even publishes negative findings to begin with, and psych studies have uncanny track records of getting statistically significant results with demonstrably underpowered research designs.
Anyway, the stuff about p-squashing in the context of replication seemed so like grasping for straws that I thought it should be called “p-sasquatching” until somebody could provide clear evidence of it really occurring.
The example here raises a different scenario for p-squashing: that new ways of doing an analysis might be introduced for the purposes of making an inconveniently significant result go away. So it could be that the most common scenario in which p-squashing does occur is one with a broader context of p-hacking a different result.
* Again, you take the dollar amount, you log it, and then if you want to standardize it, fine. I show the code for this at the bottom of my previous post.
** So, are they claiming that they looked at these bootstrapped standard errors before they published their paper? Not clear. I’ll admit to being flattered at how, while all the other analyses in their statement use old-school syntax for interaction effects (constructed variables with names like mxn), the analysis with bootstrapped SEs suddenly shifts to using the factor-variable syntax I used when showing how you can replicate their key model in half a tweet.