In the 1960s, Stanley Milgram conducted a study on conformity to authority that is now infamous among social scientists. The study was relatively straightforward. Participants would be asked to administer shocks to another human who had performed poorly on a test. They were told that doing so could help the poor performer learn to do better. If a participant resisted administering the shocks, a member of the research team would insist that the participant continue for the good of the research. The shocks increased in intensity over the course of the study, reaching a level that could be lethal. In reality, there was no one receiving these shocks, but a paid actor would pretend to be hurt, leading the participant to believe that they had caused real harm to another real person. As a surprise to the researchers, over half of participants administered the final “lethal” shock. The findings from this study are commonly used to explain how genocides are perpetrated. Milgram and his team argued that ordinary people are willing to commit incomprehensible acts of violence so long as someone in authority assures them it is the right thing to do.
I first encountered the Milgram study as an undergrad in an introductory psychology class. By the time I graduated, I learned about the study in at least three other classes. Each time, the discussion was essentially the same. Our professor would insist that the findings from the study are important, but that the study is unethical due to the harm it caused participants. That harm was described as the emotional trauma of walking around with the knowledge that you could—and would—murder another person if someone asked you to do so. There are other ethical issues as well, including the deception used by the research team and how difficult it was for participants to withdraw their consent to be in the study, but they were also tied back to that main concern: the weight on the conscience of a participant who administered that “lethal” shock.
As a professor, I was prepared to have the same discussion with my students in Science, Power and Diversity as we discussed research ethics. But when it came time to do so, I had a different perspective on the Milgram study that comes from my own work with perpetrators of sexual violence—and how hard it is to research them.
The seductive power of sensual charm survives only where the forces of denial are strongest. If asceticism once reacted against the sensuous aesthetic, asceticism has today become the sign of advanced art. All “light” and pleasant art has become illusory and false. What makes its appearance esthetically in the pleasure categories can no longer give pleasure. The musical consciousness of the masses today is “displeasure in pleasure” — the unconscious recognition of “false happiness.”
–Adorno, “On the Fetish-Character in Music and the Regression of Listening,” 1938
Jeff Guhin innocently posted to Facebook that “doing a lecture on Habermas is ridiculous.” He may well be right, for many different kinds of reasons. But in the (lengthy!) conversation that followed, two critiques were raised that I think deserve separate treatment. They are:
That much theory, including Habermas and, all the more so, his Frankfurt predecessors, is too difficult to read to make it worthwhile; and
Reading theorists like Habermas is really mostly about the history of social thought and has no payoff for empirical or analytical sociology.
I am going to take Dan’s invitation to consider one aspect of the polls that I don’t see getting a lot of attention right now, but that I think could be important: undecided voters could explain much of the polling error being discussed.
In other words, I don’t think that the polls were that wrong. I know that this view puts me in the minority, even among people who think about these things for a living. What we have, I think, is a failure to really consider how we should interpret polls given two very unpopular candidates and a possible “Shy Tory” effect where Trump supporters reported being undecided to pollsters.
O’Neil looks out at the land of big data and its various uses in algorithms and sees problems everywhere. Quantitative and statistical principles are badly abused in the service of “finding value” in systems, whether this be through firing bad teachers, targeting predatory loans, reducing the risk of employee turnover by using models that incorporate past mental health issues, or designing better ads to sniff out for-profit university matriculates. Wherever we look, she shows, we can find mathematical models used to eke out gains for their creators. Those gains destroy the lives of those affected by algorithms that they sometimes don’t even know exist.
Unlike treatises that declare algorithms universally bad or always good, O’Neil asks three questions to determine whether we should classify a model as a “weapon of math destruction”:
Is the model opaque?
Is it unfair? Does it damage or destroy lives?
Can it scale?
These questions actually eliminate the math entirely. By doing so, O’Neil makes it possible to study WMDs by their characteristics not their content. One need not know anything about the internal workings of the model at all to attempt to answer these three empirical questions. More than any other contribution that O’Neil makes, defining the opacity-damage-scalability schema to identify WMDs as social facts makes the book valuable.
Last weekend, Slate announced the use of social scientific tools similar to those used by campaigns themselves to anticipate results over the course of the day. Slate rejects, in editor-in-chief Julia Turner’s words, the “paternalistic” stance of the traditional media embargo on publishing results during Election Day.
Slate is making a bold move by ignoring the embargo, but in doing so they also appear to be ignoring the flaws of data science and a sacrosanct principle of both social science and journalism: skepticism.
Timothy Carney wrote an article earlier this week decrying what he calls the “rampant abuse of data” by pollsters and the press this election season. He faults North Carolina’s hometown polling company, Public Policy Polling (PPP), among others, for asking “dumb polling questions” such as the popularity of the erstwhile Cincinnati Zoo gorilla Harambe; support for the Emancipation Proclamation; and support for bombing Agrabah, the fictional country in which the Disney film Aladdin is set.
While I agree with Carney that many of the interpretations of these questions are very problematic (and I should note that I have used PPP many times to field polls for my own research), I think he’s wrong that these are dumb questions and that the answers therefore do not constitute “data.” Quite the opposite: asking vague and difficult-to-answer questions is an important technique for assaying culture and, thereby, revealing contours of public opinion that cannot be observed using conventional polling.
A couple of weeks ago I got in a friendly back-and-forth on Twitter with my friend and colleague Daniel Kreiss. Daniel was annoyed by this article, which purports to reveal why Mitt Romney chose Paul Ryan to be his running mate by deploying median-voter theory. Daniel’s frustration was this:
I love these studies – complicated models, and no one thought to ask former staffers what went into the decision. https://t.co/t99mfXhyUl
As I have admitted before, I am a terrible electronic file-keeper. If I was to count up the minutes I have wasted in the last 15 years searching for files that should have been easy to find or typing and retyping Stata code that would have (and should have) been a simple do-file or doing web searches for things that I read that I thought I wanted to include in lectures or powerpoints or articles but couldn’t place, I fear I would discover many months of my life wasted as a result of my organizational ineptitude.
For a long while, these bad habits only affected me (and the occasional collaborator). It was my wasted time and effort. Now, though, expectations are changing and this type of disorganization can make or break a career. I think about my dissertation data and related files, strewn about floppy disks and disparate folders, and I feel both shame and fear. Continue reading “ask a scatterbrain: managing workflow.”
Dylan Riley’s Contemporary Sociologyreview (paywall, sorry) of Biernacki’s Reinventing Evidence is out, and an odd review it is. H/T to Dan for noting it and sending it along. The essence of the review: Biernacki is right even though his evidence and argument are wrong. This controversy, along with a nearly diametrically opposed one on topic modeling (continued here) suggest to me that cultural sociology desperately needs a theory of language if we’re going to keep using texts as windows into culture (which, of course, we are). Topic modeling’s approach to language is intentionally atheoretical; Biernacki’s is disingenuously so.
I apparently attended the same session at the ASA conference as Scott Jaschik yesterday, one on Gender and Work in the Academy. He must have been the guy with press badge who couldn’t wait to fact-check his notes during the Q&A.
The first presenter, Kate Weisshaar from Stanford University, started the session off with a bang with her presentation looking at the glass ceiling in academia, asking whether it was productivity that explained women’s under-representation among the ranks of the tenured (or attrition to lower-ranked programs or out of academia all together). A summary of her findings – and a bit of detail about the session and the session organizer’s response to her presentation – appeared in Inside Higher Ed today. Continue reading “productivity, sexism, or a less sexy explanation.”
(N=100 on the left, N=24 on the right, one data point per person, observational study)
Andrew Gelman and I exchanged e-mails awhile back after I made his lexicon for a second time. That prompted me to check out his article in Slate about a study published in Psychological Science finding women were more likely to wear red/pink when at “high conception risk,” and then I read the original article.
I don’t want to get into Gelman’s critique, although notably it included whether the authors were correct to measure “high conception risk” as 6-14 days after a woman starts menstruating (see Gelman’s response to authors response about this). And I’m not here to offer an additional critique of my own.
I’m just looking at the graph and marveling at the reported effect size, and inviting you to do the same. Of the women wearing red in this study, 3 out 4 were at high conception risk. Of the women not wearing red, only 2 out of 5 were.*
UPDATE: Self-indulgent even by blog standards, but since I could see using this example again somewhere and it took some effort to reconstruct, I’m going to paste in the cross-tab here: Continue reading “a study in scarlet”
Like much of the sociology blogosphere, I’ve been following the debate over the recent Facebook emotion study pretty closely. (For a quick introduction to the controversy, check out Beth Berman’s post over at Orgtheory.) While I agree that the study is an important marker of what’s coming (and what’s already here), and thus worth our time to debate, I think the overall discussion could be improved by refocusing the debate in two major ways.
First thing: is it in The Goldilocks Spot? For this, do not look at the result itself, but look at the confidence interval around the result. Ask yourself two questions:
1. Is the lower bound of the interval large enough that if that were the true effect, we wouldn’t think the result is substantively trivial?
2. Is the upper bound of the interval small enough that if that were the point estimate, we wouldn’t think the result was implausible?
(Of course, from this we move to questions about whether the study was well-designed so that the estimated effects are actually estimates of what we mean to be estimating, etc.. But this seems like a first hurdle for considering whether what is presented as a positive result should be interpreted as possibly such.)
Caveat: Note that this also assumes the hypothesis is a hypothesis that the effect in question is not trivial, and hypotheses may vary in this respect.
My last post raised some comments about one-tailed tests versus two-tailed tests, including a post by LJ Zigerell here. I’ve returned today to an out-of-control inbox after several days of snorkeling, sea kayaking, and spotting platypuses, so I haven’t given this much new thought.
Whatever philosophical grounding for one-tailed vs. two-tailed test is vitiated by the reality that in practice one-tailed tests are largely invoked so that one can talk about results in a way that one is precluded from doing if held to two-tailed p-values. Gabriel notes that this is p-hacking, and he’s right. But it’s p-hacking of the best sort, because it’s right out in the open and doesn’t change the magnitude of the coefficient.* So it’s vastly preferable to any hidden practice that biases the coefficient upward to get it below the two-tailed p < .05.
In general, I’ve largely swung to the view that practices that allow people talk about results that are near .05 as providing sort-of evidence for a hypothesis are better than the mischief caused by using .05 as a gatekeeper for whether or not results can get into journals. What keeps from committing to this position is that I’m not sure if it just changes the situation so that .10 is the gatekeeper. In any event: if we are sticking to a world of p-values and hypothesis testing, I suspect I would be much happier in which investigators were expected to articulate what would comprise a substantively trivial effect with respect to a hypothesis, and then use a directional test against that.
* I make this argument as a side point in a conditionally accepted paper, the main point of which will be saved for another day.