The following is a guest post by Jeff Lockhart.
On Wednesday, Devin Gaffney and Nathan Matias shared a warning for computational social scientists: Large-scale missing data in a widely-published Reddit dataset could be undermining the quality of your research. Alarm bells rang, and by the next morning, several friends and colleagues who know I use this data rushed to share the link with me. While their work is still in the preprint stage, the analysis is good and it makes an important contribution. I feel the same about Hessel et al.’s response analysis, which is printed in full at the end of the preprint. I agree wholeheartedly that more people working with this kind of data should investigate what’s really there rather than trusting grandiose claims about its quality.
Gaffney and Matias do a lot to quantify the missingness and its potential impact on various kinds of research. Because “big data” is such a new area to social science, though, we don’t yet have a good sense of what these quantities mean. For example, they find 943,755 comments and 1,539,583 submissions are missing from the data. That certainly feels like “large-scale missing data,” especially if we print the full numbers. But Reddit is enormous. Those counts work out to only 0.043% of comments and 0.65% of submissions missing. What baseline should that be compared to? Most social science would be thrilled with a response rate over 99%.
So why is anyone fussing over such a “small” amount of missing data? The Reddit dataset is “administrative” data: it contains records kept by Reddit for their own internal operations, rather than gathered for research purposes. Such data is often touted as “complete” or “population” data that lets researchers transcend some limits of sampled data. For example, it can offer more detailed granularity, less recall bias, and greater numbers of people from small and hard to reach populations. Such data is primarily used by people outside social science: computer scientists and industry professionals who generally aren’t trained to worry about sampling, missingness, and the biases they can encode. In these fields, as we have seen in the fast-paced literature on algorithmic bias, grandiose claims of panoptic surveillance are relatively common. With that as our baseline, even 0.04% missing records is a big shortcoming.
Gaffney and Matias, to their credit, help us by quantifying the chance a Reddit study will be affected by the missingness in terms of study size and design. They tell us “approximately 2% of the sampled users had a 50% or greater chance of having a missing comment.” Playing very fast and loose with my comparisons, imagine that a longitudinal social science study with frequent time points (or a journal study) said 2% of participants missed at least one time point. Such a response rate would be fantastically high. (It’s not a perfect analogy: in Reddit we don’t know who is missing data, or when, and there are no “time points.”) So when Gaffney and Matias say there is a “very high risk” of studies being affected by this missingness, they are correct in the sense that we should assume at least a few users in any study of appreciable size will be missing at least one comment. But it would be wrong to read that claim to mean that there is a “very high risk” of significantly affecting a study’s findings.
Missingness can have a greater impact in network analysis, but here too it is helpful to contextualize the level of missingness. Some of our best social network data from non-“big data” sources is like the Add Health data, in which high school students list off a handful of their friends on a survey. Sometimes it is longitudinal, like the Teenage Friends and Lifestyle Study. Between survey fatigue, recall bias, and sample limitations (only friends within the school are included), I’m confident the Reddit network is much more complete. That’s not a dig on Add Health or TFLS; some really good work has come out of both. Nor is it an exoneration of administrative data: we still need to look into the effects of missingness. My point is: the sky is not falling. We have tools and theory for handling missingness, and most social science work handles much more missingness than exists in administrative data sets like Reddit.
The big concern is whether the data are “missing at random,” which is to say whether these data are missing according to some pattern or process related to whatever we happen to be studying. Gaffney and Matias demonstrate that the missing data is not uniformly distributed over time or across subreddit communities. Of course, randomness is less uniform than we often expect. We don’t know why the missing data are missing. Hessel et al.’s reply shows that some of it appears not to exist on reddit’s servers, while other parts of it appear to exist but be inaccessible, while yet more was accessible to them but failed to make it into the main public corpus. Some of this could be due to privacy settings. If I had to guess (drawing on my past life as a computer scientist), I’d say a lot of the missingness is probably server error and technical glitches. If Reddit were a bank, and transactions were missing, we would fairly conclude that something fishy was going on. But social media posts just aren’t that important. It is much easier and cheaper to build large, fast systems like those that power Reddit if you’re willing to tolerate small error rates (say, 0.04% like in the Reddit comment data). I’m using a Google API right now, and over the last two weeks it has given errors for 0.04% of my requests, usually in bursts. Facebook’s motto used to be “move fast and break things.” Errors happen in every system like this.
Many different, plausible system errors could lead to the observed missing data. Perhaps IDs were skipped and don’t represent missing comments. Or the comments were created but not saved, or saved but later lost, or saved but missed in the search to download them. Perhaps their IDs or timestamps were altered accidentally later on (this likely happened with the comments that are older than the submissions they reply to). I’ve seen some comments saved twice, and on and on. Such errors could be related to how busy the servers were at any given time (explaining the burstiness of the missing data), whether some intern was fussing with the database while a comment was submitted, or numerous other things, all of which are unrelated to most social science research questions. We end users can’t know for sure if any of these are at play in the data or to what extent. Descartes’ Evil Demon (read: a malevolent Reddit employee) might be manipulating the data for all we know.
There is plenty of data missing from Reddit. It doesn’t share the text content of posts that have been deleted, which Hessel et al. point out is around 25% of submissions (among comments, it is 6.8% by my count). Unlike Facebook, Reddit doesn’t have the comments people type but then decide not to post. More broadly, there are numerous variables we’d like to have but don’t: there is essentially nothing in terms of demographic covariates, location, offline behavior, behavior elsewhere online, etc. These are serious limitations of the dataset. When working with this kind of large scale, administrative data, we need to conduct thorough analyses of quality, as Gaffney and Matias do, but we also need to develop perspective for the scale of it all.