Guest post by Nathan Seltzer
In the days following the publication of a Washington Post article that detailed allegations of sexual abuse against Roy Moore, Emerson College Polling released an election poll of Alabama voters that showed Moore maintaining a 10-point lead over his opponent Doug Jones, 55%/45%. This poll received sustained national press and influenced perceptions of the Alabama senate race since it was one of the first polls to be released after the Roy Moore allegations. The Emerson College Poll was conducted using survey data collected over the internet and by landline phone.
In my working paper, “Less Trust, Moore Verification: Determining the Accuracy of Third-Party Data through an Innovative Use of Attention Checks,” I analyze raw data from this poll and find irregularities in the internet sample that might suggest that the respondents were not properly sampled by the data vendor that administered the survey, Opinion Access Corp., LLC.
As researchers increasingly rely on internet data vendors to acquire respondents for polls and surveys, I argue for the necessity of proactively verifying the accuracy of third-party data. In the paper, I detail how researchers can use survey “attention checks” to determine whether data vendors have provided samples that match their requested sampling frame. In the example below, I repurpose two pre-existing questionnaire items from the November 13 Emerson Poll to verify the accuracy of the sample provided by Opinion Access Corp.
Verifying Samples through A Priori Expectations of Variable Distributions
To verify whether the internet sample was comprised of valid Alabama respondents, I examined the joint frequency distribution of two overlapping geographic variables in the dataset: county of residence and US congressional district.
Alabama counties are nested within congressional districts, although there are several counties that overlap with two or three congressional districts (map here). As a result, we should expect that congressional districts are non-randomly distributed within counties. The a priori expectation would be that most counties should only have respondents in one congressional district. Additionally, we should expect respondents to correctly match their county and congressional district – there should be no ambiguity with exception of the possibility of minimal respondent error.
In the figure below (Figure 2 in the paper), I graph the joint frequency distribution of respondents by their counties and congressional districts for both the internet sample and the IVR phone sample. The rows of the graph correspond to county of residence while the columns correspond to the respondents’ specified congressional districts. The dark blue boxes represent clusters of one or more respondents, while the light grey boxes represent no respondents. Importantly, the red x-marks indicate valid responses that correctly match counties to congressional districts; all other cells in the heat map represent illogical and invalid county-district pairs.
Heat Map Depicting Joint Distribution of Counties of Residence and Congressional Districts for Respondents in the Internet and IVR Samples.
Notes: Correct Match refers to valid/logical matches for counties and congressional districts. All other cells represent invalid/illogical county-district pairs. Blue cells refer to whether one or more respondents indicated that they lived in the corresponding county and congressional district.
While the IVR phone sample matches our a priori expectations for how congressional districts should be distributed within counties, the internet sample does not. In fact, 117 out of the 324 internet respondents (36.1%) were unable to accurately match their county of residence to their US congressional district.
In Autauga county, for instance, which is in central Alabama and District #2, none of the respondents from the internet sample selected District #2. Instead, they indicated that their congressional district was either District #1, District #3, District #4, or District #7, all of which are incorrect.
It is unclear why respondents in the internet sample failed to correctly match their congressional districts to their county of residences. In the online questionnaire, respondents were provided a map that transposed congressional districts over county boundaries, and were then asked to indicate their congressional district. This should have been a simple task for respondents if they had knowledge of where they lived within their state of residency. To be sure, it is possible that the divergence in the joint distributions shown in the internet and IVR phone samples might have a practical explanation that is not easily inferred from the publicly-released survey methodology. But when internet error rate surpasses a third of all respondents, such an explanation seems implausible.
Less Trust, Moore Verification
Third-party internet panel vendors provide a cost-effective and time-efficient option for conducting survey research. However, data vendors often have aims and motives that do not align with academic researchers. By default, researchers should be skeptical of the accuracy of data provided by third parties. Ultimately, it is the researcher’s responsibility to determine the fidelity of the data they use in their analysis.
Although the aim of the paper is not to predict the outcome of an electoral contest, the removal of this poll from aggregate polling averages might indicate a tighter Alabama senate race than previously understood. Emerson College Polling released an additional poll that surveyed support for Roy Moore and Doug Jones in the Alabama senate race on November 28 that similarly relied on respondents acquired through Opinion Access Corp. If the same irregularities observed in the November 13 poll are present in the more recent poll, then political observers should interpret the results with the understanding that a substantial number of respondents interviewed might be invalidly included.
Nathan Seltzer is a PhD student in Sociology at the University of Wisconsin-Madison and a trainee at the Center for Demography and Ecology.