ya down with o.t.t.?

My last post raised some comments about one-tailed tests versus two-tailed tests, including a post by LJ Zigerell here.  I’ve returned today to an out-of-control inbox after several days of snorkeling, sea kayaking, and spotting platypuses, so I haven’t given this much new thought.

Whatever philosophical grounding for one-tailed vs. two-tailed test is vitiated by the reality that in practice one-tailed tests are largely invoked so that one can talk about results in a way that one is precluded from doing if held to two-tailed p-values.  Gabriel notes that this is p-hacking, and he’s right.  But it’s p-hacking of the best sort, because it’s right out in the open and doesn’t change the magnitude of the coefficient.*  So it’s vastly preferable to any hidden practice that biases the coefficient upward to get it below the two-tailed p < .05.  

In general, I’ve largely swung to the view that practices that allow people talk about results that are near .05 as providing sort-of evidence for a hypothesis are better than the mischief caused by using .05 as a gatekeeper for whether or not results can get into journals.  What keeps from committing to this position is that I’m not sure if it just changes the situation so that .10 is the gatekeeper.  In any event: if we are sticking to a world of p-values and hypothesis testing, I suspect I would be much happier in which investigators were expected to articulate what would comprise a substantively trivial effect with respect to a hypothesis, and then use a directional test against that. 

* I make this argument as a side point in a conditionally accepted paper, the main point of which will be saved for another day.

Author: jeremy

I am the Ethel and John Lindgren Professor of Sociology and a Faculty Fellow in the Institute for Policy Research at Northwestern University.

8 thoughts on “ya down with o.t.t.?”

  1. I’m puzzling through your arguments. The hurricane sex problem wasn’t substantively trivial effects, it was an implausibly non-trivial effect. Seems to me there are quite different problems. When you have really large samples, even substantively trivial effects are statistically significant. That’s what you seem to be worrying about in this post. But when you have relatively small samples (as you generally do in aggregate research), effects have to be big to be statistically significant, and the problem is that models based on small samples can and do produce very large statistically significant coefficients that are extremely sensitive to model specification. I’ve watched coefficients switch from significantly negative to significantly positive when one more control is added to a model. I suppose this implies that there is a Goldilocks spot where the sample size is just right, large enough to support the analysis but small enough that only non-trivial effects are significant. But mostly I think there is too little recognition of the fact that different kinds of data pose different kinds of analysis problems.

    Like

  2. Yeah, neither this issue nor the issue of directional hypotheses have much to do with the hurricane name study. The hurricane name study is almost a counterexample in that if a real effect existed that was very small, it would probably still have some substantive interest (yet the irony is that in order for any effect to be detected, it had to be wildly big). I don’t think that’s usually the case with social science hypotheses.

    BTW, I do think that there is a Goldilocks spot! Maybe that should get its own post, though.

    Like

  3. There are lots of patterns in data, not everything in a randomly-generated data set has a zero correlation. If by “effect” you mean genuinely a non-zero coefficient, some of those will necessarily happen randomly. If by “effect” you mean a general causal principle that transcends one data set, in this case that people don’t take things with female names as seriously, it will show up consistently in lots of different data sets. What other things are given names that might be gendered? Or does it evince itself in people not protecting themselves as much against female criminals? What’s the principle?

    Like

  4. I agree that switching from a two-tailed test to a one-tailed test is the least worst form of p-hacking, but I also suspect that in many cases it is the *last* form of p-hacking, in which researchers switch to a one-tailed test only if the best that all other forms of p-hacking can produce is a two-tailed p-value that falls between 0.10 and 0.05.

    In other words, I don’t imagine that this scenario occurs often: a researcher makes decisions about how to code missing data, about which controls to include, and about how to measure and code the variables; the researcher conducts a statistical significance test with this model specification; the statistical significance test indicates that the two-tailed p-value is between 0.10 and 0.05; the researcher refuses to fiddle with missing data, control variables, or measurement and coding; but the researcher claims that s/he originally meant to use a one-tailed test.

    Like

    1. One of my publications pretty much fits this scenario. I did the analysis, actually had an a priori theory that predicted the direction of the effect, so one-tailed was appropriate, but it wasn’t significant with a two-tailed test.

      Like

      1. I’m presuming that social science research — except for a small subset of applied research — requires a two-tailed test.

        One-tailed tests require researchers to treat “I didn’t find an effect at a statistically significant level” the same as “I found an effect at a statistically significant level in the direction opposite from what I expected”; therefore, the a priori theory necessary to justify a one-tailed test is not about the direction of the effect but is about why “I didn’t find an effect” should be treated the same as “I found an effect in the direction opposite from what I expected.”

        This appears to reflect a different issue with one-tailed tests: differentiating one-tailed tests performed as a type of p-hacking from one-tailed tests performed as an honest following of common but apparently incorrect textbook advice: http://onlinelibrary.wiley.com/doi/10.1111/j.1442-9993.2009.01946.x/abstract.

        Not that this would have much or any effect on the proper interpretation of evidence presented in your study: as Jeremy alluded, the coefficient and standard error are the same for one-tailed and two-tailed tests. And, anyway, a better way to judge a study is by its research design.

        Like

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s