a study in scarlet

(N=100 on the left, N=24 on the right, one data point per person, observational study)

Andrew Gelman and I exchanged e-mails awhile back after I made his lexicon for a second time. That prompted me to check out his article in Slate about a study published in Psychological Science finding women were more likely to wear red/pink when at “high conception risk,” and then I read the original article.

I don’t want to get into Gelman’s critique, although notably it included whether the authors were correct to measure “high conception risk” as 6-14 days after a woman starts menstruating (see Gelman’s response to authors response about this). And I’m not here to offer an additional critique of my own.

I’m just looking at the graph and marveling at the reported effect size, and inviting you to do the same. Of the women wearing red in this study, 3 out 4 were at high conception risk. Of the women not wearing red, only 2 out of 5 were.*

UPDATE: Self-indulgent even by blog standards, but since I could see using this example again somewhere and it took some effort to reconstruct, I’m going to paste in the cross-tab here:


* Thanks to R.C.M. for noting a problem with how I had worded the interpretation before.

Author: jeremy

I am the Ethel and John Lindgren Professor of Sociology and a Faculty Fellow in the Institute for Policy Research at Northwestern University.

19 thoughts on “a study in scarlet”

  1. I’m just going to keep taking hits for all the grad students out there because I think this kind of training is magnitudes more illuminating than writing proofs (and I think people pretending to know statistics because of scientistic prestige mongering is a much bigger disaster than having an environment where clarifying questions are par for the course).

    If I double those error bars in my head to get two standard errors away from the mean, the intervals clearly overlap. Yet Gelman rips on them for sampling and research design issues. What am I missing?


    1. Cool. That’s what I assumed at first, because I thought the custom is to throw CI’s on group comparisons like this, but when I read the graph caption to check that, it said, “error bars indicate standard errors of the means.” Which taken literally means, “error bars indicate standard errors of the means.” ;)


  2. Gelman continues to insinuate that “one could have [cooked up a publishable ‘hypothesis’ ex post if XYZWBV things happened, therefore I’m not backing off anything I said about this study, even though they swallowed hard like adults and explained that they made all of their decisions ex ante.” I feel bad for them. It’s like watching me (Gelman) have an ethical debate with Andy Perrin (the authors) about South African socialism (a different red scare).


  3. Grahamalam:

    I don’t know what you mean by “insinuate”; my coauthor and I clearly write that under different data the analyses could be different, no insinuation about it.

    And I don’t really know what you’re talking about regarding XYZWBV, but it’s not true that the authors of the clothing papers made all their decisions ex ante. For one thing, there is no record of such decisions being made before the data were collected; in any case, Loken and I discuss several examples in section 3 of our Garden of Forking Paths paper of analyses contingent on the data. Of course it is the nature of a counterfactual that it can never be observed, but I strongly believe and have no reason to doubt that had, for example, the results in those two experiments gone in opposite directions rather than the same direction, and had that difference been statistically significant, that this difference would have been reported. And, as we discuss in that paper, the report of such a difference (by analogy to the interaction reported in that other paper we discuss, on ovulation and political attitudes) would have been consistent with the researchers’ general theories and thus have not felt to them like fishing in their data.

    And I don’t know what you mean by “swallowed hard like adults.” We’re all adults here, as far as I know. But adults can make mistakes and have misunderstandings.

    In any case, I agree completely with Jeremy’s point that the reported effect sizes are huge and it is extremely implausible to imagine these numbers occurring in the general population. But it is completely reasonable to imagine these numbers arising via random chance in a small sample. This last point would be obvious and unobjectionable to all had there not been a statistically significant comparison. The point of discussing multiple comparisons (the “garden of forking paths”) is to explain why the reporting of statistically significant comparisons in an open-ended fashion does not provide strong evidence against the null hypothesis that these patterns are just noise.

    Some people might say that I should be spending more time on research and less time responding to blog comments. And some people might be right on this. But I feel that it’s part of my role as a statistician and statistics educator to explain these issues and also to put in the effort to understand where there is confusion in this area. So here I am.

    Liked by 4 people

  4. Is the effect size “huge?” I calculate it as r = .26 based on the data provided here. Seems to be right in the realm of typical social psychology effects.


    1. Social psychology effect sizes are typically drawn from experiments, whereas this is an observational study. I think those yield very different worlds in terms of plausible effect sizes. I might do another post on how to interpret this effect size in behavioral terms rather than something like r.


      1. Sorry, I’m a social/personality psychologist and pretty familiar with observational vs. experimental data and I don’t think that has anything to do with it. I have never heard anyone say anything like that before. Anyhow, in “behavioral terms” it means that if 50% of the women are fertile and 50% are not, we can expect that 63% of the women who are fertile to wear red/pink and 37% to not. Alternatively, we can also expect that 37% of the women who are not fertile to be wearing red/pink and 63% to not. See Binomial Effect Size Display (BESD).

        Liked by 1 person

      2. In a lab experiment, often the experimenter has much more control over competing stimuli (ie everything they can hold fixed is fixed). On the other hand, in a field experiment with the realistic intervention or an observational study, there are usually more causes of the outcome that are varying naturally. Compare with the previous discussion of the hurricanes/himmacanes study in which the results suggest that huge fractions of observed deaths are caused by naming decisions.

        Taking the inferences of an observational study seriously also is informative about attribution of outcomes to causes in the real world: Certainly with a binary outcome like this, what’s striking is that the results in the study for which Jeremy posted the cross tabs would have that 55% of all red-wearing in the study was caused by being fertile. Calculation: (17 – 61 * (5 / 63)) / 22.


      3. Ryne: I mean something a little different and more specific by “behavioral terms,” which you’d have no way of knowing from the vague way I tossed it out here. Roughly, I mean how somebody would parameterize the theory if one was going to try to formalize it. I’ve thought about trying to write out this particular finding in those terms; would be interested in your thoughts if I do.

        As for experiments vs. observational studies, observational studies was also probably too vague. Let’s be specific and say observational studies of surveys. From the intuitions about effect sizes one gains from doing non-experimental surveys, one wouldn’t imagine that if you were going to ask people what color shirt they were wearing and with self-reports about the menstrual cycle, that the difference would be so big that surveys of 24 and of 100 respondents were all you’d need to see significant effects.


    2. For what it is worth, the sample size wearing red is way too small to draw any definitive conclusions about the effect of fertility on wearing red or not in my opinion. We are talking about 22 women here. You don’t need to be a statistician to realize that you shouldn’t generalize very much from 22 people. That is, to me it is ridiculous to assume that 77% of women wearing red are fertile until further evidence from much larger samples proves otherwise.

      Liked by 1 person

  5. Hi Professor Gelman,

    I think the service you provide online is invaluable, extended blog comments and all. And I made the point to Jeremy during the himmicane that bad stats and research practices are sociological/institutional, not ethical, and that it doesn’t do a lot of good to admonish people about research ethics. So I liked your exposition of how mutually reinforcing biases among reviewers, authors, editors, and audience get a lot of garbage published (I think this is why having a variance of ideological/political priors in social science is more important than anything).

    Moreover, your counterfactual story is plausible. But there is a record of how the authors made their sampling decisions: their word. And I thought your response creeped into insinuating (you used the impersonal third person pronoun “one” instead of “the authors”) that the authors were either lying to themselves, or just plain lying to us, fully aware that they’d gone fishing. Maybe it’s a minor point, but I think it’s extremely important that people feel comfortable being publicly wrong in science, and we’re less likely to do so if being wrong about the ideas reliably draws accusations about our ethics.

    The reading has been sincerely educational and I’m downloading the Forking paper right now, thanks.


  6. Grahamalam:

    I never said the authors went fishing. I said their analysis was contingent on the data, which is not the same as fishing. Indeed, the difference between “fishing” and “analysis contingent on data” is a central point of our Forking Paths paper. It’s not about ethics, it’s about understanding the extent to which one can generalize from sample to population.


  7. Ovulation is a phase of the female menstrual cycle that involves the release of an egg (ovum) from one of the ovaries. New life begins if the ovum meets with a sperm during its journey down the fallopian tube. Ovulation depends on a complex interplay of glands and their hormones, and generally occurs about two weeks before the onset of the menstrual period. Typical ovulation symptoms and signs include changes in cervical mucus and a small rise in basal temperature. For most women, ovulation occurs about once every month until menopause, apart from episodes of pregnancy and breastfeeding. However, some women experience irregular ovulation or no ovulation at all.


Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.