big data hubris

A not-very-important, yet instructive, series of events on Friday offers a cautionary tale about the allure of big data and the fashionable mistrust of local knowledge.

On Friday, October 3, the New York Times’ popular “data science” column Upshot ran a cool data visualization of college football fan bases, broken out zipcode by zipcode, based on Facebook “likes.” Here’s the announcement:

The original map looked like this (sorry for the poor-quality screen capture):

upshot-ncaa-football-map-wrongZooming into North Carolina, one found an amazing discovery: in no single zip code of North Carolina (even Chapel Hill) were the UNC Tar Heels among the top three favorites. Surprising indeed. The original story even included North Carolina as one of the interesting revelations of this big-data success: there are few places where the state’s flagship university’s team doesn’t make the top three in the state. The implication was that poor football performance over the years explained this anomaly–not an unrealistic hypothesis, sad to say.

But several people who live in North Carolina found it suspicious nonetheless: both that UNC wouldn’t make the top three even in its immediate region, and perhaps more that Duke would be in the top three in so much of the state. Journalism professor Andy Bechtel pressed gently:

Bradley Bethel was somewhat more indelicate:

  And the NYT’s Derek Willis pressed back, doubling down on the big-data-reveals-surprising-truth narrative:

But here’s the thing: it’s not true:

Why yes, it does – nearly ever zip code in the state now has the Heels on top; only Raleigh (NC State) and the far southwestern corner of the state (Tennessee) are exceptions. Even the zip codes containing Duke, Wake Forest, and East Carolina show greater support for UNC than for those. So the obligatory correction:

An earlier version of this interactive feature, using information from Facebook, omitted data for some universities.  After the maps were posted, Facebook discovered a coding error in its survey. The corrected information shows a significant change for the North Carolina Tar Heels, who had no territory in the initial data and now dominate their state. The corrected data also adds territory for the University of California at Berkeley, Marshall, U.N.L.V. and Oklahoma State.

Well, OK. Good to know the problem has been corrected, although it probably would have been a good idea to check before doubling down on interpreting the false data. So what was the problem?

Screenshot from 2014-10-06 11:22:02The moral of the story: big data aren’t necessarily good data; understand the conditions of production of your data before publishing on them; and respect those with local knowledge.

Now, college football really doesn’t matter, of course, so at the end of the day who cares. But there is an obvious allure of using big data as they arise, which often means through social media like Facebook and Twitter. And there are all sorts of questions that do matter for which such data might be useful. For mapping poverty, for example; or crime, or political engagement. Just because there are lots of responses and the map looks cool doesn’t mean the conclusion is right.

Author: andrewperrin

University of North Carolina, Chapel Hill

7 thoughts on “big data hubris”

  1. Last year, ECU averaged 43K people at their football games, UNC averaged 51K and NC State averaged 53K.

    Currently, ECU football has 7K twitter followers, UNC has 43K followers, and NC State has 40K followers.

    So my ballpark estimate is that the level of online support for a college team is 40% size of fan base and 60% SES of fan base.

    Liked by 2 people

  2. It seems that we know that the problem has been corrected only because the current map is in line with our preconceived notions of how the map should look. This episode appears to be less a reflection of the problem of big data and more a reflection of a lack of transparency in methods, in which we can identify errors in an analysis only through the outcomes of the analysis and not through an investigation into the method of the analysis.

    Liked by 2 people

  3. It seems to me like this was a classic case of a measurement validity problem. Those with more expert knowledge (of North Carolina, and perhaps other states) felt the results were implausible. The only neat new twist (though, perhaps it isn’t so new) is that the original instrument is a black box – not only did we have to trust it’s measures were valid before, but we still have to trust them now.


  4. All data has errors and oddities and cannot be used without knowledge of how they were produced. Those readers who do criminology will know that raw arrest data are problematic because of nonreporting problems.


  5. I’m with LJ on this. Data (whether big or small) and analyses (sophisticated or simple) can have errors. Respecting local knowledge is good, but I think sharing data and code are potentially even more powerful. Many errors that won’t be caught using local knowledge alone can be caught if data and code are shared.

    Liked by 1 person

  6. I’m happy to agree in principle with LJ, Michael, and others, that the problem here is not inherent to big data. But I think it’s exacerbated by big data because assessing error is harder on that scale. And also because there’s a tendency–illustrated by Willis’s early responses–to assume that anomalies are really discoveries rather than errors. That’s a result of the faddishness of the big data approach (“hey, look what I can visualize!”), not a necessary feature of big data. But I do think it’s a result of how big data is understood.

    Liked by 2 people

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: