Last Updated November 8, 2016 @ernietedeschi

Last Updated November 8, 2016 @ernietedeschi

Ernie Tedeschi
Github repository of Stata code (start with masterrw.do)

Download LAT/USC microdata (requires free registration)

Note: The project described on this site relies on data from survey(s) administered by the Understanding America Study, which is maintained by the Center for Economic and Social Research (CESR) at the University of Southern California. The content of this site is solely the responsibility of the author and does not necessarily represent the official views of USC or UAS.

Motivation

Updated October 26, 2016


The LA Times/USC Daybreak Poll (LAT/USC) is an online longitudinal political survey of US adults with a large rolling sample relative to other election polls. It is based on methodology developed at the nonpartisan RAND Corporation and employed to accurate effect back in 2012. The poll is singular in several ways: not only does it track the same respondents over time and frequently, it also asks them for their prediction of the winner as well as their self-assessed likelihood of voting, it addition to their personal election preference. Respondents provide these measures in each case as continuous probabilities to 100 rather than binary yes/no choices. The poll is folded into the broader survey work of a respected academic research institution: the USC Dornsife Center for Economic and Social Research. The Center makes the entire individual microlevel data of the poll public, free, and frequently updated, something virtually no other poll does.

However, LAT/USC has been a relative outlier among 2016 US election polls, at times dramatically so. This has caused some observers to overlook the poll's rich underlying data and dismiss the entire survey out of hand, while other observers have disproportionately cited the poll in often motivated ways.
Given the unique features of the poll, any one is a potential factor in its frequent outlier results, but the three most discussed possibilities are as follows:
1. The poll weights respondents partially based on self-described 2012 vote; due to well-known ex post recollection bias towards the winner, this has the effect of overweighting Romney voters who now are less likely to support Clinton;
2. Separate from the weighting, the longitudinal nature of the poll means that the survey is "stuck" with a skewed sample that would have been corrected had the poll repeatedly redrawn its sample over time; or
3. The poll is picking up signal being missed by most other electoral polls this cycle.
Since hypotheses 2 and 3 are nearly impossible to assess pre-Election Day, the goal of this exercise is to test the first hypothesis by excluding 2012 vote as a target for weighting.
Procedure
The basic approach is as follows:
Step 1: Choose the target dimensions for reweighting
Step 2: Generate population proportions along each Step 1 dimension using Census surveys
Step 3: Load the LAT/USC microdata, correct for missing data, and prepare the 7-day rolling samples
Step 4: Merge in the Step 2 proportions
Step 5: Create an initial synthetic weight and iteratively adjust using the Step 2 proportions for each 7-day sample

More discussion follows here:

Step 1: Choose the target dimensions for reweighting
LAT/USC weights on five dimensions: race/ethnicity, sex & education, sex & age, household size & income, and 2012 vote. More detail about their procedure may be found here.

I exclude 2012 vote entirely from the reweight and choose different dimensions based on 1) what is available in both LAT/USC and Census data, 2) how LAT/USC coded the categories underlying their variables, and 3) noncyclicality and nonseasonality. Ultimately, I thought gender, race/ethnicity, age, household income, education, marital status, and state of residence covered a broad array of electorally-relevant dimensions. Criteria #3 ruled out labor market variables such as employment status since these can swing significantly on a month-to-month basis without seasonal adjustment.

LAT/USC is a large sample relative to other election polls but is very small as a sample of the whole population, which makes multi-dimensional reweighting a challenge. Too many explicit interactions between variables risks dropping weights when applied to LAT/USC. My strategy then is an iterative reweight, where I extract the individual population proportions along each single dimension and use those to iteratively constrain the synthetic new weights I create. However, the one interaction I did at this point was between gender (2 categories) and race/ethnicity (4 categories), as this is 1) still safely low-dimensional and 2) electorally relevant. Note that none of the LAT/USC demographic variables I use are continuous. LAT/USC for example has 5 adult age categories, 4 race/ethnicity categories, and 3 household income categories.

Step 2: Generate population proportions along each Step 1 dimension using Census surveys
LAT/USC uses the May 2016 basic monthly Current Population Survey (CPS) as its weighting reference, which is a joint survey between the Census Bureau and the Bureau of Labor Statistics. The CPS is the survey used, among other things, to calculate the unemployment rate, and has a rich set of demographic and social data. The downside of using the May 2016 basic CPS lay primarily in measuring income: the basic monthly CPS was never designed to reliably measure total household income; it only asks extensive questions regarding wage and salary earnings, and even then only to a quarter of its sample every month (the "outgoing rotation group"). The basic monthly CPS variable ostensibly tracking family income (HEFAMINC) has a low (<20%) response rate every month and often deviates from more reliable measures of income (see below).

My preferred alternative is the CPS March 2016 Annual Social and Economic Supplement (ASEC). The ASEC is a special augmented version of the CPS used for producing the yearly income and poverty statistics. It asks far more extensive questions about non-earned income than the basic monthly CPS and is far more reliable on this dimension. The May 2016 basic monthly CPS reports 41% of adults with a family income of <$35,000, while 29% and 30% reporting $35,000-$74,999 and >= $75,000, respectively (based on HEFAMINC). The proportions in the March 2016 CPS ASEC, however, are 25%, 29%, and 46%, respectively, a dramatically different result, while the 2015 ACS reports 26%, 31%, and 44% respectively.

There are potentially-important timing differences at play here: the March 2016 CPS ASEC reports income for the prior calendar year (2015) while ACS and basic monthly CPS respondents report income for the prior 12 months; since the ACS is conducted throughout the calendar year but only released annually this means its income measure is meaningfully affected by income received in some months of 2014. However the timing differences appear to be breaking the wrong way: the basic monthly CPS shows a far lower proportion in the highest income tier (>= $75,000), than the March 2016 CPS ASEC; by virtually every reliable measure however, such as weekly earnings, hourly wages, and employment, household income ought to have continued to rise since calendar year 2015, and certainly not have shifted so tremendously from the highest tier to lower tiers. This leads me to strongly favor the CPS ASEC and to a lesser extent the ACS over the basic monthly CPS on the income dimension.

Moreover, having been conducted in March 2016, other demographic dimensions are unlikely to be meaningfully stale versus the current voting population.

One downside of the CPS however is that its sample is relatively small for this type of survey: 185,000 people in 94,000 households. However, it is still larger than the basic monthly CPS used by LAT/USC, and since my procedure uses relatively low-dimension categories, I am not uncomfortable with this aspect of the CPS.

The more important consideration in my opinion is that the CPS is designed to cover only the civilian population living in households (the civilian noninstitutional population). It does not reliably sample active duty military servicemembers, and it does not include people living in institutional or group quarters at all (prisoners, residents of nursing homes, monasteries, etc.) unless another household member lists them (e.g. students living in dormitories, who are often not directly covered by the CPS but whose parents may list them as household members, effectively bringing them into the CPS). Adding these populations in produces the resident population.


For weighting purposes, what matters is the population proportion of each variable, and here the differences between the civilian noninstitutional and resident populations are often small. For example, men made up 48.3% of the adult civilian noninstitutional population in 2014 versus 48.7% of the resident population, which is not surprising giving the large male skew of the active duty military and institutional residents such as prisoners. And it bears emphasizing that a person excluded from the civilian noninstitutional population for weighting purposes is not excluded from the reweighted LAT/USC; the question is simply whether each person in LAT/USC is weighted correctly.


To account for this, I augment my CPS weights with weights generated from the 2015 American Community Survey (ACS). Like the CPS, the ACS collects a wealth of demographic, social, and economic data. Unlike the CPS, however, it covers the whole resident population: both household and institutional populations as well as the military. Its sample size is 2.5 million, far larger than the CPS ASEC, and it tracks individual components of household income more reliably than the basic monthly CPS.

The downside of the ACS however is that lag becomes more of an issue: the 2015 microdata is the most recent available, but that means the majority of its sample responded more than a year ago. It is possible that even over this short time, the demographics and, especially, income makeup of the population has changed in politically-meaningful ways.


I remove the institutional population from the ACS but leave in active duty military and residents living in noninstitutional group quarters.

The CPS ASEC extract I use comes from the Minnesota Population Center's superb IPUMS database. The 2015 ACS comes from the Census' online data repository.

Step 3: Load the LAT/USC microdata, correct for missing data, drop oversampled populations, and prepare the 7-day rolling samples,
Some LAT/USC respondents have data missing that we need for reweighting. In some cases, this data is missing for some dates but is filled in for others; in these instances I fill in the missing data with the successful responses. Of the 37,000+ unique daily responses, 85 instances still have missing data along one of the relevant dimensions. I drop these instances from the sample.
The LAT/USC microdata includes a national sample as well as oversamples of the Native American and Los Angeles County populations. Geographic information at greater detail than the state level is not available, making it impossible to ensure that the Los Angeles County and Native American reservation populations across all three samples are constrained to their actual proportions. Moreover, the official LAT/USC release is based solely on the national sample. Therefore, I drop the two oversamples.
In theory, LAT/USC polls each respondent once every 7 days, meaning that a respondent should only be present once in each 7-day wave. In practice, many respondents appear in the data at intervals more often than 7 days. I ensure that each respondent appears at most once in each 7-day window and that for each window the appearance is his or her latest response. This shrinks the sample size of each wave by 7% on average, to 2,447 from 2,632.
Step 4: Merge in the Step 2 proportions
Step 5: Create an initial synthetic weight and iteratively adjust using the Step 2 proportions for each 7-day sample
Here, I start out by assigning everyone an initial static weight of 10,000. Then, starting with the gender & race/ethnicity proportions, I go through each 7-day wave, see what each gender/race combination's population proportion is supposed to be based on the ACS and CPS ASEC, what it actually is, and then adjust each individual's synthetic weight by the ratio of the target proportion to the actual proportion based on that individual's gender/race/ethnicity combination. I then follow the same procedure for, in order, state of residence, age, education, income, and marital status, adjusting the synthetic weights along each dimension. Then I go back and repeat the whole process again beginning with gender/race/ethnicity, looping 100 times. Finally, I adjust every wave's individual weight by an equal proportion to bring the total up to the adult population total of 250 million (this last step is strictly unnecessary for getting the weighted average electoral preferences, but makes the weights more interpretable).
My final weight is the arithmetic average of the ACS- and ASEC-based weights.
Results
As the figure below shows, the reweighting procedure produces a result that is far closer to the center of the polling distribution as measured by the RCP 4-way average. From July 11 to October 11, the RCP has shown an average 3.46 percentage point margin for Clinton over Trump, versus 3.42 percentage points for my baseline ASEC-reweighted LAT/USC and -1.84 percentage points for the official LAT/USC. The ACS-reweighted version shows an average margin of 2.13 percentage points for Clinton.

Comparison of Original and Reweighted Proportions
The figures below compare the proportional weights for each characteristic between the October 19 LAT/USC wave and my baseline CPS ASEC reweight. Some parameters, such as higher levels of education, see relatively small adjustments. Others, such as household income, are much larger.