should social scientists allow non-directional hypotheses? any examples?

I recently read a paper in which the author hadn’t authored directional hypotheses. They were just of the form that X was expected to be associated with Y. My reaction was just that a non-directional hypothesis was not much of a hypothesis, and I made a comment along the lines of, “you should probably clarify your ideas until you have more of an idea of how X and Y might be associated before you try to test the hypothesis with data.”

This leads me to try to be thinking if I have a more general position about the specificity required for something to be a substantively meaningful social science hypothesis. Does anyone have an example of something in social science where the hypothesis is non-directional (or is just a hypothesis that something “matters”), and that this hypothesis is not trivial? If so, please let me know.

Author: jeremy

I am the Ethel and John Lindgren Professor of Sociology and a Faculty Fellow in the Institute for Policy Research at Northwestern University.

15 thoughts on “should social scientists allow non-directional hypotheses? any examples?”

  1. I’ve seen a ton of papers with bidirectional hypotheses: first the author provides theoretical explanations for why the relationship would be positive, then theoretical explanations for why it would be negative. Then, lo and behold, half the hypotheses are confirmed! I’m sure you must have seen this too.

    And, in fact, this is back to the Stinchcombe bee.

    Personally, I’ve gotten quite un-enamored of the whole hypothesis testing schtick. I’m now in favor of really careful descriptive work that tries to nail down what the empirical pattern really is and then discusses how this relates to other research and reflects on theory.

    Like

    1. As usual, I agree completely with OW.

      Rob Sampson’s synthesis in the MTO debate in AJS in 2008 provided a very strong affirmative case for careful descriptive work tied to theory — although he wasn’t challenging hypothesis testing as much as the knee-jerk insistence on causal inference being the only appropriate epistemological approach to quantitative research.

      Like

  2. In fact, don’t some soc journals require two-tailed tests? I remember an ASR paper that made a big deal out of a non-significant effect (in an over-specified model). If they had used one-tailed tests (for a well-established unidirectional hypothesis) it would have been significant and the paper would have been pointless.

    Like

    1. Do you remember the ASR? That could be a p-squashing example.

      The question is tied to thinking about one-tailed versus two-tailed tests, but I’m not ready to talk about how yet.

      Like

      1. Funny how (my) memory works. What I said was true except for the end, “and the paper would have been pointless.” Turns out there were a lot of other points in the paper, but I only remember the one that ran against my dissertation (which back in 2001 seemed like a big deal to me, for whatever reason). So I shouldn’t diss the whole paper, and I’m sorry I did.

        The paper was, “Sources of Racial Wage Inequality in Metropolitan Labor Markets: Racial, Ethnic, and Gender Differences,” by Leslie McCall. http://www.jstor.org/stable/3088921.

        A big part of that paper is about whether certain things “matter,” and so she looks at variances explained rather than direction. But in table 6, the effect of labor market population percent Black on Black men’s relative wage is -.011, with standard error .006. Not significant at .05 with a two-tailed test.

        She wrote:

        “Second, areas with relatively large black populations continue to have larger wage gaps between blacks and whites, but the effect is not significant for men in the full model.” And then summarized, “Industrial structure remains one of the most important sources of black/white wage inequality, while demographic structure emerges as the most important source of Latino/white and Asian/white inequality for men and women,” the latter conclusion partly resting on the lack of significant concentration effect for Black men.

        The unfortunate percent Black variable in this model is laboring under the burden of sharing its variance with regional dummy variables and industrial variables that partly result from race-related employment practices. I believe it in fact did rise even to that unfair challenge, and should be given appropriate credit for its effort.

        In this case, wouldn’t a one-tailed test be appropriate – with decades of literature pointing toward an expected effect in that direction?

        Like

      2. Strictly speaking, use of a one-tailed test means that you cannot differentiate between “no effect at a statistically significant level” and “an effect at a statistically significant level, but in the direction opposite to what we expected.” So use of a one-tailed test in Philip’s example would mean that we have assigned all the acceptable error to the negative tail either because (1) it is impossible for labor market population percent black to have a positive effect on black men’s relative wage or (2) we don’t care whether labor market population percent black has a positive effect on black men’s relative wage.

        Decades of literature do not make something impossible, and I don’t see how the literature provides a reason to treat no effect the same as a positive effect in this case, so I don’t think that a one-tailed test would be appropriate in this case. Instead, a better avenue for providing credit to the labor market population percent black variable is to assess the robustness of that variable to alternate reasonable model specifications.

        For testing someone else’s theory, a one-tailed test would be appropriate only if it is more important to test the theory than it is to differentiate “no effect” from “an effect in the opposite direction than the theory predicted.” If we are interested in the underlying effect more than we are interested in providing evidence that someone’s theory is correct or incorrect, it might be better to start with a two-tailed test and then — if the coefficient does not reach statistical significance — conduct a one-tailed test to assess the theory under the most favorable testing conditions.

        I haven’t seen this in practice, but I have seen it in textbooks: performing a single test with uneven tails, say, 4% acceptable error in the tail in the hypothesized direction and 1% acceptable error in the other tail. That uneven allocation accounts for the strength of the theory but also permits reporting strong evidence that the theory has the direction wrong.

        Like

      3. That’s interesting. In this case it’s not just previous research, but previous models using the same data but a different combination of variables:

        “…employs a hierarchical linear model for individual and metropolitan area data from the 1990 Census. Principal findings are that greater relative black population size is associated with (1) higher white earnings and lower black earnings for men and women, and (2) reduced gender inequality among black workers.”

        In a paper published the years earlier (not cited): http://www.terpconnect.umd.edu/~pnc/sf98.pdf.

        I guess you could apply that two-tailed logic to the model with new variables added.

        Like

      4. This is exactly the kind of issue that makes me feel that the whole hypothesis-testing framework is counter-productive. The real issues are specification error (what else is or is not in the model) and power issues [thanks to Jeremy for naming it this way) or what I’ve always thought of as the degrees of freedom problem, i.e. are there enough independent data points to support the complexity of the analysis. If you really work a data set you will find that tweaking what is or is not in a regression model can change your interpretation of effects and when that happens, I want to call a halt and carefully work until I know what the stable empirical patterns in the data really are. Lots of times what you really know is that there is this cluster of correlated independent variables whose effects you cannot actually unpack with the data you have.

        Like

  3. Yeah, that’s right — the whole competing dynamics trope. I’d forgotten that, this person was not arguing that specifically the idea tha x mattered for y.

    You change against hypothesis testing could be a useful post, especially if we are sprining to life on a methological kick.

    Like

  4. OK, I’ll bite. I don’t like one-tailed tests for two reasons.

    1. I don’t think people really specify one-tailed tests in advance and if you specify them ex post it’s just p-hacking.

    2. I think seeing stars when the coefficient goes the wrong way makes us more open-minded to findings against our priors.

    I describe these in my lecture notes on significance testing.

    BTW, when I reviewed for AJPS I noticed that their style is to use one star tests (and p<.10). This shocked me when I noticed it in the manuscript I was reviewing but it seems to be a house style, as you'd kind of expect for a journal that regularly reviews articles with an n of 50.

    Like

    1. I like this: “Unless you have firmly committed yourself to a very clear one-tailed hypothesis in advance of the analysis, using a one-tailed test is sketchy.” When you are testing someone else’s theory, that’s exactly what you have – a clear, one-tailed hypothesis specified before you start.

      Like

  5. I’m not sure whether this counts as trivial, but I think that non-directional hypotheses are acceptable when assessing an aggregate outcome for which there is theory or evidence suggesting a positive correlation between X and Y in some cases and a negative correlation between X and Y in other cases, but when there is not enough theory or evidence to support a claim about the relative strengths of the correlations.

    Let’s say that we plan to assess sex bias in K12 math teacher hiring. I can think of a few contexts in which there might be bias against female candidates, and I can think of a few contexts in which there might be bias against male candidates, but I don’t have a strong sense about the relative sizes of these biases, at least in the domain of math teacher hiring. In that case, I think that it would be acceptable to test a non-directional hypothesis that the sex of the candidate matters in hiring decisions.

    Notice, though, that it is impossible to have half of the hypotheses confirmed in this case because there are three hypotheses: one hypothesis about bias against female candidates, one hypothesis about bias against male candidates, and one hypothesis about the relative sizes of these biases.

    If the statistical significance test indicates bias against females, that would not necessarily indicate no bias against males because it might be that bias against females is substantially greater than the bias against males; similarly, a non-statistically significant coefficient on the sex of the candidate variable would not necessarily indicate no sex bias in math teacher hiring, because it might be that the bias against females is similar in magnitude to the bias against males.

    Like

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s