7 Non-Response and Weighting

We are moving further down the error tree to discuss the problem of Nonresponse:

7.1 What is nonresponse?

Nonresponse is fairly easy to classify at a high level: this is when people we select into our sample do not answer the survey.

But we can break this down in a couple of helpful ways.

The first is to talk about unit vs item non-response.

Unit non-response is when an individual doesn’t take the entire survey. That is, they refuse to participate at all.

Item non-response is when an individual chooses to not answer a particular question on the survey. That is, we have answers for some but not all of the questions for an individual.

Both things are potentially problematic, though unit nonresponse is obviously a more severe problem.

Nonresponse (I will usually use this term to refer to unit nonresponse unless I state otherwise) has been steadily increasing since the early 1990s.

In Groves you get breathless concern from some pollsters of a response rates approaching 50%. We have blown well by that, and now see response rates well under 5%.

The response rate is calculated by dividing the total number of people sampled by the number of complete interviews.

What’s a complete interview? This is up to interpretation, but for me, an interview is complete if the individual answers all of the weighting questions.

I have exact numbers on the 2020 RBS portion of the exit poll. In Pennsylvania, we attempted to contact 98,548 people off the voter file and got 1805 complete interviews. This means that the response rate was \(1805/98548=1.8\%\).

When it’s possible to do so, it’s helpful to decompose this response rate into two consituent parts: the contact rate and the cooperation rate.

The contact rate is the proportion of people sampled who were succesfully contacted. We might not contact someone for a lot of reasons: the phone number we have for them is wrong, they work a demanding job are rarely available, or they simply don’t pick up the phone for numbers that they don’t know (same). This is probably mitigated a bit by text follow-up which we now do.

Regardless: in Pennsylvania in 2020 of the 98,548 people we sampled we contacted 21,736, which is a contact rate of \(21736/98548=22\%\). So before we even get to people making active decisions about participating in a poll, we only talk to approximately 1 in 5 people.

The cooperation rate is the proportion of people who completed the survey out of the percentage of people who were contacted. In Pennsylvania in 2020 1805 people completed the survey out of the 21736 people, that means a cooperation rate of \(1805/21736 = 8.3\%\). Even after we succesfully contact people, we then get a little less than 1 in 10 who actually respond!

I should note that this latter quantity was actually quite variable across states in 2020. Pennsylvania was actually the lowest cooperation rate. The highest was in Wisconsin, where 30% of contacted people participated. Go Badgers, I guess.

Now mathematically:

\[ Response.Rate = \frac{Complete}{Sampled}\\ Contact.Rate = \frac{Contact}{Sampled}\\ Coop.Rate = \frac{Complete}{Contact}\\ \] And therefore:

\[ Response.Rate = Coop.Rate*Contact.Rate = \frac{Coomplete}{Contact}*{\frac{Contact}{Sampled}} = \frac{Complete}{Sampled} \] To check that math works:

.22*.083
#> [1] 0.01826

Yes.

I’m not sure if that’s actually helpful or just mathematically true.

Anyhow: breaking it down this way helps us to understand where non-response is coming from and therefore the best ways to fix it. In a few weeks we will read the article we read using these data, that shows the predictors of cooperation and contact are very different from one another, and therefore require different approaches to fix.

7.2 When does nonresponse become nonresponse bias?

Similar to coverage bias, just because we have nonresponders to our survey doesn’t mean that we necessarily have bias. In a similar way, we only care about nonresponse if the nonresponders are systematically different than the responders.

We can take a nearly identical approach to quantifying nonresponse bias (this is not our preferred equation for this, hold tight just a second), where non-response bias is equal to:

\[ NR.Bias = \frac{M}{N}(\bar{Y_r} - \bar{Y_m}) \]

Where \(M\) is the number of non-responders, \(N\) is the population size, \(\bar{Y_r}\) is the opinion of responders, and \(\bar{Y_m}\) is the opinion of nonresponders.

To be clear, when we think about non-response bias, we think about it happening for the full population. There are people in the population that will be respondents \(R\) and people in the population who will not \(M\), the difference between their two opinions are in the second term of the equation. That is multiplied by the proportion of the people who are non-respondents in the population to get the non-response bias.

If we want to put this back into the context of a particular survey we can think about:

\[ \bar{Y_r} = \bar{Y_n} + \frac{M}{N}(\bar{Y_r} - \bar{Y_m}) \]

That is, the actual sample mean that we can expect to get is a function of the opinions of ever member of the population \(\bar{Y_n}\) plus this non-response error.

Similar to coverage error, this nonresponse error goes to 0 under two conditions. First: if the proportion of nonresponders is very, very small. Second: if the differences in the attitudes between responders and non-responders is very small.

Unlike for coverage bias, however, we absolutely know that the first thing will not be true. The vast majority of people are nonresponders. That means that any difference between the attitudes of responders and non-responders will be consequential.

For the second thing, we simply do not, and cannot know, the degree to which responders and nonresponders are different. So this is potentially problematic!

In the equation for nonresponse error above we classify all members of the population as either responders or non-responders. This is similar to what we did in the coverage bias case where members of the population were either covered by the sampling frame or not.

This setup, while mathematically simple, probably does not capture the reality of non-response. Instead of fixed 0 and 1 probabilities of being a non-responder, it is more accurate to think of each individual having a response probability \(p\) that is their individual probability of responding to the survey. So someone with a .6 value for p will sometimes respond, sometimes not, based on the stochastic nature of the particular moment that we reach them.

In this setup we can recast the above equation to:

\[ \bar{Y_r} = \bar{Y_n} + \frac{\sigma_{yp}}{\bar{p}} \]

Out expected sample mean is a function of the true population mean for everyone, plus the nonresponse error, which is the covariance of \(y\) and \(p\), divided by the average response probability, \(\bar{p}\)

Note that the Groves textbook has an error for this equation. They have \(Y_m\) instead of \(Y_n\). The above equation is correct

Covariation is exactly what you think it is. It is a high positive number when two things are positively correlated (both things are high at the same time, low at the same time). It is a low negative number when two things are negatively correlated (one thing is high when the other is low). It is zero when two things are unrelated. Correlation is just a covariation that has been scaled to -1 to 1.

Here is some fake data for a population of 10,000.

head(sim.dat)
#>            p y
#> 1 0.04738566 1
#> 2 0.45385259 0
#> 3 0.81229907 1
#> 4 0.63500466 0
#> 5 0.49976918 1
#> 6 0.82102985 1
cor(sim.dat$p, sim.dat$y)
#> [1] 0.3931441

Each population member has a response probability \(p\), and a value for the outcome variable \(y\). Critically: the two things are correlated with each fairly robustly.

When we sample from this population (again, this is just a theoretical exercise), we don’t get a random sample, but a sample that is informed by each individuals response probability. People with higher \(p\) are more likely to be sampled. People with lower \(p\) are less likely to be sampled. Let’s take 10,000 samples of 100 people based on those probabilities, and calculate the mean in each:

result <- NA
for(i in 1:10000){
  result[i] <- mean(sample(sim.dat$y, 100, prob=sim.dat$p))
}
mean(result)
#> [1] 0.608962
mean(sim.dat$y)
#> [1] 0.4957

mean(result) - mean(sim.dat$y) 
#> [1] 0.113262

Our average sample mean from this procedure \(\bar{Y_r}\) is 60.8%. The true value in the population \(\bar{Y_n}\) is 45.5%. This means that the nonresponse error is just over 11%.

If we deploy the equation above to calcualte non-response error:

cov(sim.dat$p, sim.dat$y)/mean(sim.dat$p)
#> [1] 0.1129597

We get an identical number.

But again: this is just a theoretical exercise. We don’t know people’s \(p\), and we don’t know (famously!) the opinions of people who don’t take our surveys, and so ultimately we can’t actually calculate nonresponse error.

But this exercise did allow us to see the conditions for non-response bias!

First, thinking about the denominator, the lower the average response probability the more non-response bias we will have. That’s important when the average response probability is probably under 5%!

Second, thinking about the numerator, for there to be non-response bias there has to be covariation (or correlation, they are the same thing on different scales) between the response probability and the question of interest.

This latter thing is worth stopping on for a second: non-response bias (and non-coverage bias for that matter) is a question level phenomenon. That is: because the non-response bias is a function of the correlation of response probability and \(y\), it will change based on what \(y\) is!

Earlier in the semester we used the example of running a poll during an Eagles game. For those hours, the response probability will be low for all Football fans. If being a football fan is unrelated to politics in Philadelphia (mildly plausible) then there would not be a correlation between response probability and questions about politics. As such there would not be non-response bias. However, there is almost certainly a correlation between response probability and questions about sports (what is your favorite sport? Who is the greatest athlete of all time?)

7.3 Survey Weighting

Survey weighting is, potentially, a solution to the problem of non-response bias.

The Bailey reading (which, by the way, you really should go do – it’s really good) talks about ignorable and non-ignorable non-response. The shorthand you can think of is fixable and non-fixable non-response.

Consider these DAGs (directional acyclic graphs: a way of drawing causal relationships) presented in Groves:

We have covered what \(P\) and \(Y\) are, and know that the problem of nonresponse bias comes from when they are correlated.

The first graph on the left presents a situation of completely unproblematic non-response. Response probabilities are affected by some variables \(Z\), and the outcome variable is affected by some variables \(X\), but neither influence the other thing. This would mean that there would be no correlation between \(P\) and \(Y\). So no problem!

The middle situation is where we run into problems. Now variables \(Z\) are a common cause of both \(P\) and \(Y\), which means that the two variables now are correlated. Failing to correct for \(Z\) will lead to non-response bias.

This is where Bailey comes in: he distinguishes between two situations. Non-response is ignorable when you have measured Z in the population and in the sample and can adjust for it (via weighting). Non-response is non-ignorable when you have not measured Z in either the population or the sample and therefore you can’t adjust for it.

We can measure whatever we want in our sample (I guess… kind of… as we know it can be hard to measure vague concept like “need for cognition”). So the hurdle here is whether we are able to measure variables Z in the population.

The easiest case where non-response is ignorable is if we believe that \(Z\) – the common cause of both non-response and the variable of interest – is demographic. If the cause of both non-response and the outcome of interest is Age for example, we are able to adjust for age (using the processes below) and we will have fixed the problem we are facing.

To give an expedited intuition into weighting: let’s say we are measuring AI usage. Young people are more likely to use AI, and less likely to respond to surveys. As such there is a negative correlation between \(P\) and \(Y\) (Here, AI usage). To correct for this, we give the young people who do respond to our survey more weight

In some cases we can adjust for variables \(Z\) that are non-demographic but have known population totals. The best example for us is the vote distribution in previous elections. We know exactly (or pretty close to) the number of people who didn’t vote and the number of people who voted for each candidate. Similarly, we can adjust for party registration totals in a registered voter sample. More controversially, we might be able to weight to party ID (which is different than registration!) if we are willing to entertain a particular source to be authoritative.

Details on weighting to be added.

7.3.1 Population Totals

Where might we get population totals to weight to?

As we discussed in the introduction, the Census is an OK place to get some demographic information, but only covers a limited set of variables.

Better to use is the American Community Survey (ACS), which asks for far more variables. We can get into the scientific merit of using a survey to weight a survey, but as Groves discusses, high quality government surveys generally have the lowest amount of nonresponse (that is, the highest response rates) and are likely a good bit more accurate then academic or media polls.

Getting weighting targets is a simple as looking up the distribution for a variable for the geography that we care about. So we could go on Social Explorer and get the following table for the 2003 ACS:

The distribution of highest educational attainment for the adult population in the US in 2023. All I have to do is measure education in my own dataset and weight to these targets and I will have adjusted for education.

However to simplify this I have already done some work here for you:

national.targets <- rio::import("https://github.com/marctrussler/IIS-Data/raw/refs/heads/main/NationalWeightingTargets.Rds", trust=T)

national.targets is a list of tibbles (the tidyverse data format) with population targets for a number of variables from the 2023 American Community Survey. Also included is a weighting target for 2024 Presidential vote. Note that these targets are for all US Adults. Not registered voters. This is in a list (a dataset of datasets) because the weighting function we will come to use expects to see a list of target distributions like this. If you want to add to this (say adding a target for partisanship) all you have to do is make a tibble in the same format and then append it to this list.

7.3.2 Cell Weighting

What do we do with these targets?

The simplest (as it will turn out, too simple and too limiting) thing we can do is cell-weighting.

In short, we will break our sample into different cells, and then apply weights to those cells to make sure they match population totals.

Let’s load in some data. This is a random sample of 1000 people from NBCs August survey. I have included some key weighting variables as well as trump approval, which we will use as our “main” variable of interest

sm <- rio::import("https://github.com/marctrussler/IIS-Data/raw/refs/heads/main/NBCAug25.Rds", trust=T)

To make things easy we will convert trump approval into a binary variable:

sm |>
  mutate(approve.trump = if_else(trump_approval %in% c("Strongly approve","Somewhat approve"),T,F)) -> sm
mean(sm$approve.trump)
#> [1] 0.402

The approval rating for Trump is 40%. Now: this is not a situation where we know the truth about this variable. So we can’t actually calculate what kind of error we are making here. But a real possibility is non-response error, where there is a correlation between not supporting Trump and responding to the survey. The question is whether the things that cause that association can be captured by weighting, or not.

Let’s start by looking at sex:

prop.table(table(sm$sex))
#> 
#>   Male Female 
#>  0.473  0.527

Just under 53% of the sample is Female, and just over 47% is Male. We know that’s too many women, but just to confirm, our population weighting target is:

national.targets$sex
#> # A tibble: 2 × 2
#>   sex     Freq
#>   <fct>  <dbl>
#> 1 Female 0.510
#> 2 Male   0.490

About 51% women and 49% men. (The population has more women then men because women outlive men. It’s almost entirely driven by the 65+ group).

Further: I expect that there is a statistical relationship between sex and trump approval:

prop.table(table(sm$sex, sm$approve.trump),1)
#>         
#>              FALSE      TRUE
#>   Male   0.5243129 0.4756871
#>   Female 0.6641366 0.3358634

Yes a big one! 52% of men in our sample approve of Trump compared to 66% of women.

So having too few men in our sample is biasing our result in an anti-trump direction.

Let’s correct for that.

With cell based weighting the idea is to choose one weight for men and one weight for women such that, when those weights are applied, it adjusts the percentages so they match exactly the population target. The equation for cell weighting is:

\[ w_g = \frac{PopulationPercent_g}{SamplePercent_g} \]

So for men in our sample the weights will be:

\[ w_m = \frac{0.4902613 }{0.473}\\ = 1.04\\ w_f = \frac{0.5097387}{0.527}\\ = .967 \]

These weights make good theoretical sense: we have too few men in our sample so the weight for men is above 1. When we calculate averages each man will count for 1.04 people. We have too few women in our sample so their weight is under 1. When we calcualte averages each woman will count for .97 people.

Let’s apply this in our dataset. (Again, we are goign to use a pre-built function to actually do our weighting, but for right now we will do it ourselves to see how it works.)

sm$weight <- NA
sm$weight[sm$sex=="Male"] <- 0.4902613/0.473
sm$weight[sm$sex=="Female"] <- 0.5097387/0.527

head(sm)
#>        trump_approval    sex  age4            pid5     race
#> 1 Strongly disapprove Female 18-29        Democrat    Other
#> 2    Strongly approve   Male 45-64 Lean Republican Hispanic
#> 3    Strongly approve   Male 30-44      Republican Hispanic
#> 4 Somewhat disapprove Female 45-64        Democrat    White
#> 5 Strongly disapprove Female   65+   Lean Democrat Hispanic
#> 6 Strongly disapprove Female   65+        Democrat    White
#>           educ             educ.race vote.pres.2024
#> 1 Some College Non-White Non-College  Kamala Harris
#> 2     Postgrad     Non-White College   Donald Trump
#> 3 Some College Non-White Non-College   Donald Trump
#> 4 Some College     White Non-College   Did not vote
#> 5     Postgrad     Non-White College  Kamala Harris
#> 6     Postgrad         White College  Kamala Harris
#>   approve.trump    weight
#> 1         FALSE 0.9672461
#> 2          TRUE 1.0364932
#> 3          TRUE 1.0364932
#> 4         FALSE 0.9672461
#> 5         FALSE 0.9672461
#> 6         FALSE 0.9672461

Now that this weight is applied, if we calcualte the probability someone is Male in our survey:

weighted.mean(sm$sex=="Male", w = sm$weight)
#> [1] 0.4902613

It exactly equals the truth in the population.

And when we apply that weight to our outcome of interest, Trump support is now higher because we are (correctly!) giving the relative more pro-Trump men in our data more weight:

mean(sm$approve.trump)
#> [1] 0.402
weighted.mean(sm$approve.trump, w=sm$weight)
#> [1] 0.4044135

But we are almost certainly unblanced in other respects as well. Let’s take education for example:

prop.table(table(sm$educ))
#> 
#>   Hs Or Less Some College      College     Postgrad 
#>        0.132        0.323        0.278        0.267
national.targets$educ
#> # A tibble: 4 × 2
#>   educ          Freq
#>   <fct>        <dbl>
#> 1 Hs Or Less   0.385
#> 2 College      0.204
#> 3 Postgrad     0.126
#> 4 Some College 0.285

We have way too few HS or less people, and way too many of every other category.

To simplify things I’m going to just think about having a college degree or not:

sm |> 
  mutate(college = if_else(educ %in% c("College","Postgrad"),T,F)) -> sm
mean(sm$college)
#> [1] 0.545

national.targets$educ |> 
  filter(educ %in% c("Postgrad", "College")) |> 
  summarize(sum(Freq))
#> # A tibble: 1 × 1
#>   `sum(Freq)`
#>         <dbl>
#> 1       0.330

We should have 33% of our sample have a college degree, but 54% does.

We could correct for this using the same process:

sm$weight <- NA
sm$weight[sm$college] <- .3304235/.545
sm$weight[!sm$college] <- (1-.3304235)/(1-.545)
head(sm)
#>        trump_approval    sex  age4            pid5     race
#> 1 Strongly disapprove Female 18-29        Democrat    Other
#> 2    Strongly approve   Male 45-64 Lean Republican Hispanic
#> 3    Strongly approve   Male 30-44      Republican Hispanic
#> 4 Somewhat disapprove Female 45-64        Democrat    White
#> 5 Strongly disapprove Female   65+   Lean Democrat Hispanic
#> 6 Strongly disapprove Female   65+        Democrat    White
#>           educ             educ.race vote.pres.2024
#> 1 Some College Non-White Non-College  Kamala Harris
#> 2     Postgrad     Non-White College   Donald Trump
#> 3 Some College Non-White Non-College   Donald Trump
#> 4 Some College     White Non-College   Did not vote
#> 5     Postgrad     Non-White College  Kamala Harris
#> 6     Postgrad         White College  Kamala Harris
#>   approve.trump    weight college
#> 1         FALSE 1.4715967   FALSE
#> 2          TRUE 0.6062817    TRUE
#> 3          TRUE 1.4715967   FALSE
#> 4         FALSE 1.4715967   FALSE
#> 5         FALSE 0.6062817    TRUE
#> 6         FALSE 0.6062817    TRUE

And again if we calculate the weighted percent of people who went to college:

weighted.mean(sm$college, w=sm$weight)
#> [1] 0.3304235

It matches up perfectly. And if we apply those weights to calculating the trump approval:

weighted.mean(sm$approve.trump, w=sm$weight)
#> [1] 0.4245761

We shift it (a good amount!) in the pro-Trump direction.

Ok, but when we calculated the weight for college or not we wrote over the weight for gender. It’s one or the other.

How can we think about weighting for both of these things at once?

The way that we can do that with cell weighting is simply defining multiple cells.

We can create a new variable that is the combination of sex and college:

sm |> 
  mutate(sex.college = paste(sex, college, sep="_")) -> sm

prop.table(table(sm$sex.college))
#> 
#> Female_FALSE  Female_TRUE   Male_FALSE    Male_TRUE 
#>        0.263        0.264        0.192        0.281

And then get the population targets for the crossover of those two things.

(I can figure that out, but I don’t want to, you can see now how you would weight these cells if we had the population total for men with a college degree, men without a college degree etc.)

Why I want to stop here is because we can see how this can quickly become unwieldy. We can believe that the variables \(Z\) that affect both the probability of response and the outcome might be many. Let’s say we think that sex, race, age, education, and 2024 vote are the variables that make up Z.

Thinking about making cells for those groups, there are 2 categories in sex, 5 in race, 4 in age, 4 in education, and 4 in 2024 vote. That means that we would need to make:

2*5*4*4*4
#> [1] 640

640 cells that we would have to weight.

We only have 1000 people in our surveys, so a lot of these cells will have very few people in them, and some will be empty. We can see how many we get:

sm |> 
  group_by(sex, race, age4, educ,  vote.pres.2024) |> 
  summarise(n())
#> `summarise()` has grouped output by 'sex', 'race', 'age4',
#> 'educ'. You can override using the `.groups` argument.
#> # A tibble: 252 × 6
#> # Groups:   sex, race, age4, educ [122]
#>    sex   race  age4  educ         vote.pres.2024 `n()`
#>    <fct> <fct> <fct> <fct>        <fct>          <int>
#>  1 Male  White 18-29 Hs Or Less   Kamala Harris      2
#>  2 Male  White 18-29 Hs Or Less   Donald Trump       2
#>  3 Male  White 18-29 Hs Or Less   Other              1
#>  4 Male  White 18-29 Hs Or Less   Did not vote       2
#>  5 Male  White 18-29 Some College Kamala Harris      2
#>  6 Male  White 18-29 College      Kamala Harris      3
#>  7 Male  White 18-29 College      Donald Trump       2
#>  8 Male  White 18-29 College      Other              1
#>  9 Male  White 18-29 College      Did not vote       1
#> 10 Male  White 18-29 Postgrad     Donald Trump       1
#> # ℹ 242 more rows

This tibble is only 252 rows, so over 400 of the potential cells have no people in them.

Even if we did have a person or two in every cell, the idea behind weighting is that we are using the people (say older, white, male, college educated, Trump voters) we have in the survey to stand in for the people we don’t. So the key assumption is that people within each cells are representative of that group. With 1 or 2 people in each cell that seems very unlikely.

So we are going to need a better method that allows us to weight to multiple variables at once.

7.3.3 Raking

That method is Raking, which is also called “Iterative Proportional Fitting”.

You are never going to have to calculate rake weights by hand, but you should know what is happening under the hood. Raking lets us create a single set of weights that simultaneously put multiple variables into the correct proportions.

The key to how it works is in that longer name: iterated proportional fitting. In essence, raking works by running the same algorithm across our weighting variables many, many, times until we have reached a satisfactory level of weighting.

Like in the textbook, we are going to think about weighting on both sex and college education.

In our sample this is how many people are in each of these cells:

sm |> 
  mutate(college = if_else(college, "College","Non-College")) -> sm

sample.n <- table(sm$sex, sm$college)
sample.n
#>         
#>          College Non-College
#>   Male       281         192
#>   Female     264         263

sm |> 
  group_by(sex, college) |> 
  summarise(n=n()) -> sample.n
#> `summarise()` has grouped output by 'sex'. You can override
#> using the `.groups` argument.

And here is what we need these proportions to be:

#Also simplifying these weighting targets a bit to make the next stage clearer
national.targets$educ |> 
  mutate(educ= case_when(educ %in% c("Hs Or Less", "Some College") ~ "Non-College",
                         educ %in% c("College","Postgrad") ~ "College")) |> 
  group_by(educ) |> 
  summarize(Freq = sum(Freq)) |> 
  rename(college = educ,
         pop.prop = Freq) -> college
national.targets$sex |> 
  rename(pop.prop = Freq)-> sex

college
#> # A tibble: 2 × 2
#>   college     pop.prop
#>   <chr>          <dbl>
#> 1 College        0.330
#> 2 Non-College    0.670
sex
#> # A tibble: 2 × 2
#>   sex    pop.prop
#>   <fct>     <dbl>
#> 1 Female    0.510
#> 2 Male      0.490

The first stage of the process we are going to weight to sex, and do exactly what we did above for cell weighting. We are going to divide the population proportion by the sample proportion to get the weight for each group:

#Gender Stage 1
sample.n |> 
  group_by(sex) |>
  #Calculate sample proportion for each group
  summarise(n = sum(n)) |> 
  mutate(sample.prop = n/sum(n)) |> 
  #Merge in the population proportion
  left_join(sex) |> 
  #Calculate cell weight as the population proportion divided by the sample proportion
  mutate(interim.weight = pop.prop/sample.prop) |>  
  select(sex, interim.weight) |> 
  #Merge that back to our original table of cell sizes
  right_join(sample.n) |> 
  #Calculate a weighted n, which is the original n multiplied by these weights
  mutate(weighted.n = n*interim.weight) -> weights
#> Joining with `by = join_by(sex)`
#> Joining with `by = join_by(sex)`

We have done cell weighting for sex. So it should not be a surprise that when we use our new weighted n to look at the proportion of men and women it will now be dead on. However, we have done nothing to adjust for the distribution of college degrees, so those marginals are still way off.

weights |> 
  group_by(sex) |> 
  summarise(n=sum(weighted.n)) |> 
  mutate(prop = n/sum(n))
#> # A tibble: 2 × 3
#>   sex        n  prop
#>   <fct>  <dbl> <dbl>
#> 1 Male    490. 0.490
#> 2 Female  510. 0.510

weights |> 
  group_by(college) |> 
  summarise(n=sum(weighted.n)) |> 
  mutate(prop = n/sum(n))
#> # A tibble: 2 × 3
#>   college         n  prop
#>   <chr>       <dbl> <dbl>
#> 1 College      547. 0.547
#> 2 Non-College  453. 0.453

Here is where we do some “iteration” in the raking procedure. We are going to do cell weighting for college, but now the input for those weights will be the weighted n that we calculate in the first stage. Additionally, the weight that we generate in this stage is multiplied by the weight from the previous stage:

#College Stage 1
#Start with the weights from previous stage
weights |> 
  group_by(college) |> 
  #Calculate the proportion of each group using weighted data from previous stage
  summarise(n = sum(weighted.n)) |> 
  mutate(sample.prop = n/sum(n)) |> 
  #Merge with population proportions
  left_join(college) |> 
  #Calculate cell weight by dividing population proportion by sample proportion
  mutate(stage.weight = pop.prop/sample.prop) |>  
  select(college, stage.weight) |> 
  #Merge back with starting data
  right_join(weights) |> 
  #Multiply the weight from this stage with the weight from the previous stage
  mutate(interim.weight = interim.weight*stage.weight,
         weighted.n = n*interim.weight) |> 
  select(-stage.weight)-> weights
#> Joining with `by = join_by(college)`
#> Joining with `by = join_by(college)`

weights
#> # A tibble: 4 × 5
#>   college     sex    interim.weight     n weighted.n
#>   <chr>       <fct>           <dbl> <int>      <dbl>
#> 1 College     Male            0.627   281       176.
#> 2 College     Female          0.585   264       154.
#> 3 Non-College Male            1.53    192       294.
#> 4 Non-College Female          1.43    263       376.

Now we have a new interim weight, and notice that it’s now different for each row. This is because this weight is simultaneously adjusting for both sex and college, so men with a college degree get a different weight than women with a college degree.

When we now look at our weighted proportions for each category:

weights |> 
  group_by(sex) |> 
  summarise(n=sum(weighted.n)) |> 
  mutate(prop = n/sum(n))
#> # A tibble: 2 × 3
#>   sex        n  prop
#>   <fct>  <dbl> <dbl>
#> 1 Male    470. 0.470
#> 2 Female  530. 0.530

weights |> 
  group_by(college) |> 
  summarise(n=sum(weighted.n)) |> 
  mutate(prop = n/sum(n))
#> # A tibble: 2 × 3
#>   college         n  prop
#>   <chr>       <dbl> <dbl>
#> 1 College      330. 0.330
#> 2 Non-College  670. 0.670

We see that now the college weights are nearly dead on, but at the expense of throwing off our weights for sex. These weights now say that women will be 53%, which is too high!

But remember, iterated proportional fitting. So what we do is to take these weights and then repeat the same process for sex and college a second time. Using the output of one stage as the input for the next:

#Sex Stage 2
weights |> 
  group_by(sex) |> 
  summarise(n = sum(weighted.n)) |> 
  mutate(sample.prop = n/sum(n)) |> 
  left_join(sex) |> 
  mutate(stage.weight = pop.prop/sample.prop) |>  
  select(sex, stage.weight) |> 
  right_join(weights) |> 
  mutate(interim.weight = interim.weight*stage.weight,
         weighted.n = n*interim.weight) |> 
  select(-stage.weight)-> weights
#> Joining with `by = join_by(sex)`
#> Joining with `by = join_by(sex)`


#College Stage 2

weights |> 
  group_by(college) |> 
  summarise(n = sum(weighted.n)) |> 
  mutate(sample.prop = n/sum(n)) |> 
  left_join(college) |> 
  mutate(stage.weight = pop.prop/sample.prop) |>  
  select(college, stage.weight) |> 
  right_join(weights) |> 
  mutate(interim.weight = interim.weight*stage.weight,
         weighted.n = n*interim.weight) |> 
  select(-stage.weight)-> weights
#> Joining with `by = join_by(college)`
#> Joining with `by = join_by(college)`

So we took the output of college weighting cell 1, and used that as the input for sex stage 2. We then used that output as the input for college stage 2.

After two full rounds of weighting this is what we get:

weights |> 
  group_by(sex) |> 
  summarise(n=sum(weighted.n)) |> 
  mutate(prop = n/sum(n))
#> # A tibble: 2 × 3
#>   sex        n  prop
#>   <fct>  <dbl> <dbl>
#> 1 Male    490. 0.490
#> 2 Female  510. 0.510

weights |> 
  group_by(college) |> 
  summarise(n=sum(weighted.n)) |> 
  mutate(prop = n/sum(n))
#> # A tibble: 2 × 3
#>   college         n  prop
#>   <chr>       <dbl> <dbl>
#> 1 College      330. 0.330
#> 2 Non-College  670. 0.670

The weighted proportions for both sex and college are now dead on. The stages adjust a little bit in one direction, then a little bit in the other direction, on and on, until the weights are tuned such that they produce the right things for both categories.

Merging this back into our data:

sm |> 
  left_join(weights |>  rename(final.weight = interim.weight), join_by(sex, college)) -> sm

weighted.mean(sm$sex=="Female", w=sm$final.weight)
#> [1] 0.5098978

weighted.mean(sm$college=="College", w=sm$final.weight)
#> [1] 0.3304235

This single set of weights make it so that the proportion of both women and college educated people are correct when weighted.

This will also change our estimate for the weighted proportion of people who support Trump:

weighted.mean(sm$approve.trump, w=sm$final.weight)
#> [1] 0.4309938

7.3.4 Rake Weights in the Real World

Now, let’s say that I want to adjust for race, age, sex, and 2024 presidential vote all at the same time. It should be clear how that would happen with rake weighting. We would first do cell weighting for race, then use that output to do cell weighting for age, then use that output to do cell weighting for sex etc etc. We could do this, but the process is so clear that people have built tools in order to do it for us.

We are going to use the pewmethods package to do rake weighting. Please see earlier in the textbook for how to install this package.

The function we will use to rake the survey is called…. rake_survey

#?pewmethods::rake_survey

The help file is actually pretty helpful here.

We are going to put our data into this function. The next thing we will put is pop_margins which is a list of tibbles that have the population proportion for each of our target variables (which national.targets already is!).

The other arguments are options that you may or may not end up using. We can input a base weight if we like. Think back to when we talked about coverage error, and potentially the need to weight people from small households less than people from large households. In those cases we would calculate those weights for using inverse probability weighting, and then insert those as the “base” weights.

The function will automatically scale the weights so that when you add them up they will equal the n of the survey. This is something that is just true of raking, I think. For example, when we did our “by-hand” raking above, our weights summed to:

sum(sm$final.weight)
#> [1] 1000

The \(n\) of the survey.

Finally, the algorithm has to know when to stop iterating, so the final two options specify that. The first asks: how close do we need to get the weighted proportions to the population proportions? The default is that it will keep going until the weighted proportions are with .000005 of the true proportions. Second, the algorithm will always stop whenever it reaches 100 iterations, by default.

So let’s reduce our weighting targets to just the ones that we want:

weighting.targets <- national.targets[c("race","educ","age4","sex","vote.pres.2024")]

The very persnickity party of this is that the tibbles have to have the exact same names as the variables in your dataset, and each category within those variables needs to be exactly the same as what is in the weighting.targets. Additionally, the variables in both places need to be factor variables. This can take some time to get right! When you go to weight your final projects this is going to give you some trouble, but please just remember the above and keep working until you get it.

In this case I have already done the work to align the variable and category names for these variables, so we can go ahead and use the algorithm:

sm$rake.weights <- pewmethods::rake_survey(sm, weighting.targets)

#Saving a copy of this to use in subsequent lection
#sm |> 
#  select(-final.weight, -n, -weighted.n, -weight) |> 
#  rio::export("/Users/marctrussler/Documents/GitHub/IIS-Data/NBCAug25Weighted.Rds")

What does it look like when this goes wrong? Let me change one of the education categories:

sm |> 
  mutate(educ = if_else(educ=="Hs Or Less", "HS Or Less",educ)) -> sm2
#pewmethods::rake_survey(sm2, weighting.targets)

The other thing that you should know is that weights can only be given to units that have a value for all of the weighting variables.

For the second person let’s say they didn’t answer the vote choice question:

sm2 <- sm
sm2$vote.pres.2024[2] <- NA
head(sm2)
#>        trump_approval    sex  age4            pid5     race
#> 1 Strongly disapprove Female 18-29        Democrat    Other
#> 2    Strongly approve   Male 45-64 Lean Republican Hispanic
#> 3    Strongly approve   Male 30-44      Republican Hispanic
#> 4 Somewhat disapprove Female 45-64        Democrat    White
#> 5 Strongly disapprove Female   65+   Lean Democrat Hispanic
#> 6 Strongly disapprove Female   65+        Democrat    White
#>           educ             educ.race vote.pres.2024
#> 1 Some College Non-White Non-College  Kamala Harris
#> 2     Postgrad     Non-White College           <NA>
#> 3 Some College Non-White Non-College   Donald Trump
#> 4 Some College     White Non-College   Did not vote
#> 5     Postgrad     Non-White College  Kamala Harris
#> 6     Postgrad         White College  Kamala Harris
#>   approve.trump    weight     college  sex.college
#> 1         FALSE 1.4715967 Non-College Female_FALSE
#> 2          TRUE 0.6062817     College    Male_TRUE
#> 3          TRUE 1.4715967 Non-College   Male_FALSE
#> 4         FALSE 1.4715967 Non-College Female_FALSE
#> 5         FALSE 0.6062817     College  Female_TRUE
#> 6         FALSE 0.6062817     College  Female_TRUE
#>   final.weight   n weighted.n rake.weights
#> 1    1.3772120 263   362.2068    0.4203724
#> 2    0.6502934 281   182.7324    0.6433706
#> 3    1.6008841 192   307.3697    1.1926403
#> 4    1.3772120 263   362.2068    1.3847687
#> 5    0.5594358 264   147.6911    0.5586355
#> 6    0.5594358 264   147.6911    0.3991181

sm2$weight <- pewmethods::rake_survey(sm2, weighting.targets)
#> Warning in mm * ww: longer object length is not a multiple
#> of shorter object length
#> Warning in mm * whalf: longer object length is not a
#> multiple of shorter object length
#> Warning in mm * ww: longer object length is not a multiple
#> of shorter object length
#> Warning in mm * ww: longer object length is not a multiple
#> of shorter object length
#> Warning in mm * ww: longer object length is not a multiple
#> of shorter object length
#> Warning in mm * ww: longer object length is not a multiple
#> of shorter object length
#> Warning in mm * ww: longer object length is not a multiple
#> of shorter object length
#> Warning in mm * ww: longer object length is not a multiple
#> of shorter object length
#> Warning in mm * ww: longer object length is not a multiple
#> of shorter object length
#> Warning in mm * ww: longer object length is not a multiple
#> of shorter object length
#> Warning in mm * ww: longer object length is not a multiple
#> of shorter object length
#> Warning in mm * ww: longer object length is not a multiple
#> of shorter object length
#> Warning in mm * ww: longer object length is not a multiple
#> of shorter object length
#> Warning in mm * ww: longer object length is not a multiple
#> of shorter object length
#> Warning in mm * ww: longer object length is not a multiple
#> of shorter object length
#> Warning in mm * ww: longer object length is not a multiple
#> of shorter object length
#> Warning in mm * ww: longer object length is not a multiple
#> of shorter object length
#> Warning in mm * ww: longer object length is not a multiple
#> of shorter object length
#> Warning in mm * ww: longer object length is not a multiple
#> of shorter object length
#> Warning in mm * ww: longer object length is not a multiple
#> of shorter object length
#> Warning in mm * ww: longer object length is not a multiple
#> of shorter object length
#> Warning in design$prob/g: longer object length is not a
#> multiple of shorter object length
#> Warning in g * whalf: longer object length is not a
#> multiple of shorter object length

Now we got a bunch of warnings, but if we go into sm2 we can see that all of our units have a weight. What happened here is that it actually only calculated 999 weights, and the weight for the 2nd person should be the weight for the 3rd person, the weight for the 3rd should be for the 4th etc. To get 1000 weights it just looped back and used the 1st weight for the 1000th person:

sm2$weight[1]
#> [1] 0.4196132
sm2$weight[1000]
#> [1] 0.4196132

Now, the pewmethods raking package is definitely the easiest to use of any I’ve tried. But I really hate this functionality. It should absolutely just throw an error and not give you these weights.

Luckily there are two things you can do.

First: you should reduce your dataset down to cases with non-missing values on all the weighting variables before you start weighting.

Second: the outcome of the weighting is a known thing! When we calcualte the proportion of people who are men, or are white, or have a college education, those numbers should match the population totals exactly. If they don’t we know that something went wrong.

weighting.targets$race
#> # A tibble: 5 × 2
#>   race       Freq
#>   <fct>     <dbl>
#> 1 Asian    0.0627
#> 2 Black    0.116 
#> 3 Hispanic 0.177 
#> 4 Other    0.0466
#> 5 White    0.598
#Correctly calculated weights
pewmethods::get_totals("race", sm, "rake.weights")
#> Warning: There was 1 warning in `mutate()`.
#> ℹ In argument: `race = (function (f, na_level =
#>   "(Missing)") ...`.
#> Caused by warning:
#> ! `fct_explicit_na()` was deprecated in forcats 1.0.0.
#> ℹ Please use `fct_na_value_to_level()` instead.
#> ℹ The deprecated feature was likely used in the dplyr
#>   package.
#>   Please report the issue at
#>   <https://github.com/tidyverse/dplyr/issues>.
#>       race rake.weights
#> 1    White    59.751854
#> 2    Black    11.616352
#> 3 Hispanic    17.703553
#> 4    Asian     6.266394
#> 5    Other     4.661847
#Incorrectly calculated weights
pewmethods::get_totals("race",sm2,"weight")
#>       race    weight
#> 1    White 74.287994
#> 2    Black  5.929992
#> 3 Hispanic 10.581050
#> 4    Asian  2.312860
#> 5    Other  6.888104

So I would very much suggest doing a check like this after you complete your weighting. I do everytime!

7.3.5 Adjusting and interpreting weights

Once you have your weights are you good to go? Kind of! There are a couple of things you may want to think about, however, after you have calulated your weights.

The first is the size of your weights. For us, here is the summary statistics on our weights:

summary(sm$rake.weights)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#> 0.09425 0.47543 0.61812 1.00000 0.91590 7.55338

The median person in our survey has a weight of .61. The maximum weight is 7.5 and the minimum weight is .09.

This person is being treated as 7.5 people:

sm[sm$rake.weights==max(sm$rake.weights),]
#>       trump_approval    sex  age4       pid5  race
#> 824 Strongly approve Female 30-44 Republican Asian
#>           educ             educ.race vote.pres.2024
#> 824 Hs Or Less Non-White Non-College   Donald Trump
#>     approve.trump   weight     college  sex.college
#> 824          TRUE 1.471597 Non-College Female_FALSE
#>     final.weight   n weighted.n rake.weights
#> 824     1.377212 263   362.2068     7.553376

This person – an Asian millenial women with a high school education or less who voted for Donald Trump – is from an overlapping set of groups who are relatively under-reprented in the survey. As such, she is being used to represent 7.5 people!

This person is being treated as 1/10 of a person. (This feels weird. I am not questioning these people’s inherent self worth. We are all someone’s child.)

sm[sm$rake.weights==min(sm$rake.weights),]
#>          trump_approval  sex  age4        pid5  race
#> 694 Strongly disapprove Male 45-64 Independent Other
#>         educ         educ.race vote.pres.2024 approve.trump
#> 694 Postgrad Non-White College          Other         FALSE
#>        weight college sex.college final.weight   n
#> 694 0.6062817 College   Male_TRUE    0.6502934 281
#>     weighted.n rake.weights
#> 694   182.7324   0.09424746

This person – an “other” race gen-X man with a postgraduate education who voted for a third candidate in 2024 – is from an overlapping set of demographics who are relatively over represented in the survey relative to the population.

Are we ok with that? Well… we are getting into the art of decision making.

There are situations where this can get out of hand. For example in the 2016 USC/LA Times poll there was one 19 year old black respondent who had a weight of nearly 30. Now this guy just happened to be a Trump supporter and has inclusion and high weight was enough to make this poll much stronger than Trump compared to others, and made their crosstabs show that Trump was garnering significant among Black voters.

In some cases you may therefore decide to trim the weights, where weights outside a specified range are censored to a specific value.

I haven’t seen amazingly good guidance with this, but I think a relatively good benchmark is to weight to the 1st and 99th quantiles, which will get rid of an uniquely extreme weights:

sm$trimmed.weights <- pewmethods::trim_weights(sm$rake.weights, lower_quantile = .01, upper_quantile=.99)
summary(sm$trimmed.weights)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>  0.1363  0.4818  0.6245  1.0000  0.9222  5.5462

However, if you do this step you are changing the weights that guarantee our sample proportions match up with the population proportions. So if we now look at the weighted distribution of race in our sample:

pewmethods::get_totals("race", sm, "trimmed.weights")
#>       race trimmed.weights
#> 1    White       60.223545
#> 2    Black       11.666441
#> 3 Hispanic       17.509036
#> 4    Asian        5.884045
#> 5    Other        4.716932

It no longer matches exactly, and the proportion that is white is now too high.

Because of this, it is recommended practice after trimming your weights to “recalibrate” them to the weighting targets. This is where we can use the base_weight functionality in the rake_survey function to put in our trimmed weights and then have the raking process calibrate them back to the population weighting targets.

In my head this should just re-inflate the weights that we just trimmed. It does a bit, but not fully. Genuinely this is a bit beyond my ability to fully comprehend and explain to you in a simplistic way. But doing the steps in this way does in fact reduce the variance of the weights and calibrates things back to the population totals:

sm$adj.trimmed.weights <- pewmethods::rake_survey(sm, weighting.targets, base_weight = "trimmed.weights")

summary(sm$rake.weights)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#> 0.09425 0.47543 0.61812 1.00000 0.91590 7.55338
summary(sm$trimmed.weights)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>  0.1363  0.4818  0.6245  1.0000  0.9222  5.5462
summary(sm$adj.trimmed.weights)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>  0.1269  0.4752  0.6112  1.0000  0.9174  6.0506

pewmethods::get_totals("race", sm, "trimmed.weights")
#>       race trimmed.weights
#> 1    White       60.223545
#> 2    Black       11.666441
#> 3 Hispanic       17.509036
#> 4    Asian        5.884045
#> 5    Other        4.716932
pewmethods::get_totals("race", sm, "adj.trimmed.weights")
#>       race adj.trimmed.weights
#> 1    White           59.751846
#> 2    Black           11.616350
#> 3 Hispanic           17.703550
#> 4    Asian            6.266407
#> 5    Other            4.661847

A complimentary or alternative way to think about this problem is to think about adjusting the standard error to take into account how much uncertainty is being caused by the weights and simply incorporate that information into our standard error.

We can do so via calculating the design effect of our weights. The formula for the design effect is:

\[ deff = 1+ (\frac{sd(w)}{\bar{w}})^2 \]

Looking at this formula, what makes our design effect large is if there is a high variance to our weights. (With standard rake weighitng the average of your weights will be 1, which makes the denominator irrevelant.) High variance in our weights mean that we had to make some people’s weights very small and other people’s weights very large. A larger design effect means that our weights are doing more work to make our sample look like the population as a whole.

We can use this formula to calculate the design effect, but the pewmethods package will also do it for you:

1 + (sd(sm$rake.weights)/mean(sm$rake.weights))^2
#> [1] 2.133557
pewmethods::calculate_deff(sm$rake.weights)
#> # A tibble: 1 × 6
#>       n mean_wt sd_wt  deff   ess   moe
#>   <int>   <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  1000    1.00  1.06  2.13  469.  4.53

What do we do with this design effect, which tells us the amount our weights are doing work? why do we care?

We are going to dive deeper into this, but the intuition here is that a higher design effect (that comes from extremely variable weights) necessarily means that our survey has larger variance than what we would expect just from random sampling alone.

Why is that? the most critical thing to remember about weighting is that the key assumption is that the people we have in our weighting cells (30-45 year olds, men, postgraduate degree holders… whatever) are representative of all people in that group in the population. When we have too few people with a college degree and give those people more weight, we are necessarily arguing that “weighting up” the non-college people in our sample is adeuqate to represent the non-college people not in our sample. When we have high weights it means that we have few people in a particular cell (think of that one young black man in the USC survey getting a lot of weight), because that cell size is smaller it also means that it is more variable and there is a good deal of risk we have an unrepresentative sample of those people.

If a bigger design effect means more variance, how can we capture that? The outcome of the design effect command above gives us some clues to how we might do this.

The first is to calcualte the effective sample size. Given that our higher weights means that we have more variance than we expected, we can think of this as having a smaller \(n\) than what we initially thought. The command above calcualtes effective n, but the formula is simple:

\[ effective.n = \frac{n}{deff} \]

Our effective sample size of 468 is telling us: because of the additional variance caused by needing to weight this sample to the population, our precision is similar to if we had a sample size of 468 people.

The other way that we can use the design effect is to inflate our standard error. Specifically, we can calculate an adjusted standard error that is: \[ se_{adj} = se * \sqrt{deff}\\ = \sqrt{\frac{.25}{1000}} * \sqrt{2.133}\\ = 0.0158 *\sqrt{2.133}\\ = 0.023 \]

The original standard error was 0.016, which would mean a margin of error of \(1.96*0.016 = 0.031\) or plus or minus 3.1%. The adjusted standard error is 0.023, which would mean a margin of error of \(1.96*0.023 = 4.52\). So about a point and a half more variance than what would be expected if we just had “regular” sampling error.

Note that the effective sample size and this adjusted margin of error match up! If we calculated the margin of error for our effective sample size it would be

1.96*sqrt(.25/468)
#> [1] 0.04530052

The same margin of error. So they really are just two ways to get at the same thing.