Poll Uncertainty, 2016 and 2020

There’s a saying that generals always fight the last war. The idea is that they analyze the tactics that worked (or would have worked) against the enemy in the previous conflict. When the next conflict comes along, they avoid the mistakes they made in the last one. But technology and enemy strategies have changed, and the tactics from the last war no longer work.

This adage does not describe most modern-day generals, who actually spend vast amounts of time anticipating how potential conflicts of the future will differ from past ones. But it may apply to pollsters trying to estimate characteristics and preferences of the 2020 electorate.

As a survey statistician, I get asked a lot about polls. Personally, I think journalists and pundits focus too much on polls; I’d rather read about how a candidate’s proposals would affect the country than how they might shift the polls. But polls illustrate larger issues of trying to measure a changing population, so here follows my guide to the 2020 polls and where they might go wrong.

The national polling estimates for the 2016 election were actually pretty close to the election results. The New York Times report of the national polling average on November 8, 2016 had Hillary Clinton with 45.9% of the vote and Donald Trump with 42.8% of the vote — about a 3 percentage point margin for Clinton. In the election, Clinton won 48.2% of the popular vote, and Trump won 46.1% of the popular vote — about a 2 percentage point margin (approximately 3 million votes) for Clinton, close to the margin predicted by the polling average. But the popular vote does not determine the election winner; the state-by-state results do.

The 2016 polling may have been accurate nationally, but key state polls, and the resulting forecasts for the electoral college, had errors that were all in the same direction. The New York Times had Clinton forecast to win in Florida, Wisconsin, Pennsylvania, Michigan, and North Carolina — all states narrowly won by Trump in the election. This led them to predict that Clinton had an 85% chance to win the electoral college.

There have been a lot of articles about what went wrong in the 2016 polls, and most of these try to assure readers that the errors from 2016 won’t happen this year. Their main argument is that high-quality 2020 polls weight for education, while most 2016 polls did not. Let’s take a little detour to look at what weighting does and why it is supposed to work, and then see whether this argument means the 2020 polls will be accurate.

Weighting the Data

If you took a simple random sample of 1,000 people from a list of registered voters and everyone answered the poll, you would expect the sample to have about the same percentage of Democrats, the same percentage of women, and the same percentage of African Americans, as the list. The larger the sample, the closer the sample percentages will be to the percentages calculated from all persons on the list.

But most of the people called by pollsters don’t provide answers. For most polls, fewer than 5% of the people called end up talking to the pollsters, and for some, fewer than 1% respond. This means that even if a random sample of voters is called, the percentages of sample members who are Democrats, women, or African Americans can be far from the registered voter percentages.

Suppose the list has 60,000 registered voters, with half men and half women. If 600 of the 1,000 people who end up talking to the pollsters are women, then women are overrepresented in the sample. And, if women (both in and out of the sample) are more likely to support candidate A than men are, then the percentage of the 1,000 sampled people who support candidate A will overestimate candidate A’s support among registered voters on the list, because the sample has too many women relative to men.

To compensate, pollsters weight the data. There are 30,000 women in the list and 600 in the sample, so each woman in the sample is given a weight of 30,000/600 = 50. Each of the 400 men in the sample is given a weight of 30,000/400 = 75. Because men are underrepresented in the sample, they are given higher weights. Then, when estimating the support for candidate A, the pollsters count each woman’s answer 50 times and each man’s answer 75 times, and then calculate the percentage of these 60,000 “responses” that favor candidate A. The weighting corrects for the overrepresentation of women in the sample, because 30,000 of the replicated responses are from women and 30,000 are from men.

If gender is related to candidate preference, the weighting also partially corrects for an overrepresentation of candidate A supporters among the respondents. But this presumes that the relationship between gender and candidate support is similar for the persons in the sample and for those not in the sample. If the sample is selected randomly and everyone answers, then it’s reasonable to think that the same relationships hold for sampled and nonsampled persons, because there’s nothing “special” or unusual about the people who answer the poll questions relative to those that don’t. When there is a low response rate, however, it’s possible to have a situation where the women who don’t answer the poll are more likely to prefer candidate B than the women who answer, and the men who don’t answer the poll are more likely to prefer candidate B than the men who answer. In that case, the estimates after weighting will still underestimate the support for candidate B. There are differences between the persons who answer the poll questions and those that don’t that cannot be explained by gender alone. So pollsters weight the survey by a lot of different characteristics, hoping that at least some of those explain the differences between persons who answer and persons who refuse to answer.

Pollsters who used weighting in 2016 generally weighted by gender, age, race, ethnicity, and (sometimes) political party affiliation. But, in some states, these variables did not capture the difference between respondents and nonrespondents within those groups. In particular, there was a divide among white men: a higher percentage of college-educated men supported Clinton than non-college-educated men. Because people with higher education are more likely to participate in polls, this led some polls to overestimate Clinton’s support. And the same type of overestimation affected multiple states. The errors in the 2016 state polls were not independent; each of the key states’ errors went in the same direction.

More pollsters weight for education in 2020, at least for national polls. So they have solved the problem that affected the poll accuracy in the last election. But there have been huge societal changes between 2016 and 2020, and it is quite likely that in 2020 there is some other factor that might explain differences in candidate preference among poll respondents and nonrespondents, but that factor is not used in the weighting.

But wait, there’s more. Polls are used to predict the election, but they actually estimate a snapshot of voter intention at the time a poll is taken. A voter’s intention can change before election day. Persons may end up not voting, may change candidate preference, or may vote but have their ballot rejected because of a signature mismatch, late mail delivery, or disenfranchisement.*

Thus, for purposes of predicting an election, the model used for turnout is extremely important and may be an especially volatile part of the 2020 predictions. You’ve seen polls characterized as being of “registered voters” or “likely voters.” The “likely voter” polls make a prediction of how likely it is that each respondent will vote. This is often based on responses to the poll question “How likely are you to vote each year?” and also, for some polls, on past voting behavior. Whom you vote for is confidential, but whether you’ve voted in past elections is public information. Thus, someone who has voted in every recent past election and says she is likely to vote this year is predicted to have a very high probability of voting; someone who voted sporadically in recent elections has a lower predicted probability of voting. And a newly registered voter might not even be in the list of people who are called (some pollsters dial telephone numbers at random or use other methods to capture these persons). Early turnout has surpassed that from previous years; indeed, as of October 30, more people have voted early in Texas than voted altogether in Texas in 2016 — with the election day results yet to come. More of the predicted-low-probability registered voters may have voted than assumed by turnout models.

The other challenge to polls is that 2020 is an unprecedented year in many, many ways. The pandemic, the soaring unemployment, and the protests for social justice are all signs of a realignment in American politics that might not be captured by models used to predict previous elections. To see what can happen, let’s look at another time of upheaval and realignment in America: the 1930s.

The Worst Poll of All Time?

Statistics classes often use the Literary Digest Poll of 1936 to exemplify bad sampling practice. It’s been called “one of the worst political predictions in history” because the election prediction was so far from the actual result.

The Literary Digest’s pre-election polls were popular features of the news magazine, and their predictions in 1924, 1928, and 1932 had been extremely accurate. In 1932, for example, the poll predicted that Franklin Roosevelt would receive 56% of the popular vote and 474 votes in the Electoral College; in the actual election, Roosevelt received 57% of the popular vote and 472 votes in the Electoral College.

On October 31, 1936, however, the Digest predicted that Republican Alf Landon would receive 54% of the popular vote and the majority of states, compared with 41% for Democrat Franklin Roosevelt. In the election, Roosevelt won by a landslide with 61% of the popular vote. Landon received 37% of the popular vote and won just two states: Maine and Vermont. Even though the editors collected a huge sample of 2.4 million persons, the predictions were far, far off.

What went wrong in 1936? It may have been partly that the list of 10 million persons from whom poll responses were solicited had disproportionately many Landon supporters, or that Landon supporters were more likely to respond to the poll. But the response rate for the Digest poll was close to 25% — a response rate that swamps that of the typical poll in 2020.

Mike Brick and I have argued that the main reason for the polling misfire was that the Digest editors relied on the weighting models they’d used in previous elections (their model was not to weight the data). They did not recognize that the previous weighting model, which had been accurate up until 1932, no longer worked because of the demographic shifts in the 1936 electorate and respondents. If in fact they had weighted the 1936 poll’s data by how respondents voted in 1932 (information that was available from the poll), the Digest poll would have predicted a win for Roosevelt. The problem wasn’t the response rate, but that the model they had used in the past to compensate for nonresponse no longer worked in 1936.

I think we may be in a similar situation in 2020. Response rates for contemporary polls are in the low single digits (modern pollsters would be ecstatic if they got the 25% response rate achieved by the Literary Digest poll), so the predictions depend largely on how well the weighting model describes the voting intentions of persons who do not participate in the polls. But we, too, are at a time of societal upheaval and voting realignments that may cause models from the past to be inadequate. These inadequacies will likely be detected only in retrospect.

Judging the Quality of a Poll

Some polling aggregators give pollsters a grade that is based, at least in part, on their accuracy in previous elections. In times of societal shifts, I think polls need to be evaluated entirely on the methodology. After all, the Literary Digest Poll was extraordinarily accurate for 1924, 1928, and 1932 — until suddenly, in 1936, it wasn’t. Here is my checklist for evaluating a poll, adapted from my checklist for Judging the Quality of a Statistic.

  1. Does the pollster publish a methodology report** that gives the technical details about how participants were selected and how estimates were calculated? If no, stop here. Do not trust the results.

  2. Does the methodology report include the response rate? If the pollster does not disclose the response rate (likely because it was abysmally low), I wonder what other poll deficiencies are not disclosed. The lack of transparency leads me to distrust the results.

  3. How were participants selected? Some pollsters select randomly from a list of registered voters, some select telephone numbers randomly, and some use a panel of online respondents. If the latter, check to see how the panel was formed. If it was recruited using random selection (some online panels call randomly selected telephone numbers or send letters to randomly selected addresses and invite the residents to join the panel), that’s ok.

    If people volunteer to be in an online poll, for example by clicking on a web site to join the panel, I do not trust the results. An organization could flood the panel with “volunteers” who skew the poll in a desired direction. If participants are randomly selected, the polling results may have other types of errors, but at least that type of poll flooding can’t occur.

  4. How do they obtain answers from poll participants? In general, telephone polls with live interviewers have higher quality than those in which a computer asks a question (many of the latter also do not meet criteria 1, 2, and 3).

  5. How do they weight the data? For 2020 polls, I want to see evidence that they’re not just using the same weighting methods as before, and are adapting to the shifts in the electorate.***

Many pollsters do not meet the criteria on my checklist, and polling averages should exclude polls with poor methodology, regardless of whether the pollster has been accurate in the past. The highest quality polls have corrected the errors that were made in 2016. The national polls, at least, weight by education. What about other sources of uncertainty?

I was one of the few statisticians in 2016 who thought that the polls were underestimating the support for Trump and that he would win the election. What about 2020? I think the uncertainty about poll estimates greatly exceeds the margins of error that are reported. And, since an individual pollster uses the same type of weighting model for each state, it’s likely that if the pollster is overestimating a candidate’s support in Wisconsin, they’re also overestimating it in Michigan, Pennsylvania, North Carolina, and other states. Thus, if the models are wrong for one state, the errors are likely to cascade.

Which direction are the errors likely to be? This is hard to say in advance, since you need an external data source to evaluate bias. But there’s a difference in predictions between the polls with better methodology (according to my checklist) and other polls, where the better-methodology polls tend to estimate a larger lead for Biden. The RealClearPolitics average as of October 30, which averages a large number of polls, has Biden leading by 7.8 percentage points. The Fivethirtyeight.com weighted average, which downweights some (but not all) of the polls with more questionable methodology, has Biden ahead by a larger margin, 8.9 percentage points. In 2020, some polls may also be underestimating support for Biden if their turnout models underestimate the probabilities that persons in some Democratic-leaning groups will vote. If that is the case, then polling errors this year may be in the opposite direction of those in 2016, which would mean that the 2020 poll averages may be underestimating the support for Biden. There is a huge uncertainty in the predictions, however, far exceeding the margins of error (which do not account for uncertainty about the nonrespondents) reported for the polls.

But polls estimate what people say they intend to do in the election, not how their vote is cast and recorded. Even if a poll were conducted perfectly and everyone in a randomly selected sample gave an accurate indication of their intention, the poll’s predictions would still be inaccurate if, for some reason, the voters’ stated intentions are not fulfilled. For example, if people do not vote because they think their candidate is going to win or lose anyway, or if some mail ballots do not reach election officials by the ballot deadline, the predictions will be wrong.

So don’t make a decision on whether to vote based on polls. Just go out and do it. If you’ve voted already, check with your state or local election officials that your ballot has been received and counted.

Postscript added November 13, 2020: It appears that there was indeed some residual bias in most polls that was not captured by the weighting. But that is the nature of bias in low-response-rate surveys. Without reliable external information (which for polls comes only after the votes are counted), you can only speculate about the size and direction of the bias. In a couple of weeks I’ll post some suggestions for experiments and investigations that might help reduce bias in the future.

Copyright (2020) Sharon L. Lohr

Footnotes

*Many people count the 2000 election in Florida as a polling misfire. Exit polls had projected that candidate Gore would win, when, after the Supreme Court intervention, the final tally had Bush winning the state, and hence the presidency, by 537 votes. But perhaps the polls in 2000 were a more accurate reflection of voter intention in that election than the official count, which was affected by confusing ballot design (including the infamous butterfly ballot) and disenfranchised voters who were turned away at the polls or had their ballots erroneously invalidated.

**You can find methodology reports by doing an internet search for the poll name along with the word “methodology.” See, for example, the methodology reports for the New York Times/Siena polls and the ABC News/Washington Post polls. Both of those meet the criteria in my checklist. They have low response rates, but so do all polls. The method used for weighting is more important than the raw response rate, and both of these polls weight by education and update their methodology to meet new challenges.

***For example, one of the big changes in 2020 has been the increase in early voting. This provides additional data that can be used to update weighting and turnout prediction models. Pollsters can match a state’s published list of early voters (again, whom they voted for is private, but whether they voted is public in some states) with persons who responded to the poll, and compare their response to “How likely are you to vote in this election?” with their early voting behavior.

Sharon Lohr