11 Social Desirability Strategies

When we were discussing questionnaire design and the psychology of question answering we spend some time talking about social desirability and motivated misreporting.

There are topics for which there are obviously going to be social desirability problems things like:

  • Drug use

  • Vote buying

  • Discrimination and prejudice

  • Attitudes in authoritarian regimes

But I think it’s good to not get hung on these big and obvious examples. Anything with strong norms around it will suffer from social desirability. We will see in the reading for this week that a good chunk of why we have such bad estimates voter turnout is due to social desiraibilty. Further, many of our attempts to get at people’s true attitudes and behavior around voting are going to be influenced by the strong norms we have around independence and political participation in this country (that they also have in other countries, to be clear!).

In this chapter we will briefly review the “standard” options for reducing social desirability before quickly moving on to two advanced methods for defeating social desirability: list experiments, randomized response experiments, and implicit association tests. We will finish up by doing more to unpack the social desirability in answering voter turnout questions.

11.1 Standard advice for social desirability

As a review, Groves gives several tips for reducing social desirability/motivated reporting in surveys.

  1. Use open rather than closed questions for eliciting the frequency of sensitive behaviors.

  2. Use long rather than short questions.

  3. Use familiar words in describing sensitive behaviors.

  4. Deliberately load the question to reduce misreporting.

Loading the question means wording it in a way that invites a particular response.

“Even the calmest parents get mad at their children sometimes”, “How many times during the past week did your children do something that made you angry” “Many psychologists believe it is healthy for parents to express anger” etc.

Another example of this is the question we use for asking about voter turnout:

In talking to people about elections, we often find that a lot of people were not able to vote beacause they were not registered, they were sick, or they just didn’t have time. How about you – did you vote in the elections this November?

The other piece of general advice is that social desirability will be lessened when the respondent is not talking to another human being. So we would expect less social desirability in self-administered online surveys, and more in live-caller interviews.

11.2 List Experiments

But that is a pretty unconvincing list of things. If we are trying to measure, for example, the percent of people who hold racist views then making long questions and forgiving wording is not going to do the job. We need methods that allow people to tell the truth in a way that remains relatively anonymous.

Luckily, there are a couple of methods that – when used ideally – allow people to admit to socially undesirable things in a way where there answers are anonymous.

The first of these is a list experiment. (Already you should see that I have used the word “experiment” which should always clue you that we are going to be randomizing something!)

Here is how it’s going to work:

We are going to ask the following question (this is from Kulkinski et al. 1997, so be cool about “1 million dollars” being a lot of money)

Now I’m going to read you three things that sometimes make people angry or upset. After I read all three, just tell me HOW MANY of them upset you. I don’t want to know which ones, just HOW MANY.

Condition 1 Condition 2
The federal government increasing the tax on gasoline The federal government increasing the tax on gasoline
Professional athletes getting million-dollar salaries Professional athletes getting million-dollar salaries
Large corporations polluting the environment Large corporations polluting the environment
. A black family moving in next door

The respondents are randomized into the two categories. In one category the socially undersirable opinion (being upset with a black family moving in next door) is added as an option.

Remember that in an experiment everything between the two groups is balanced except exposure to the treatment. That means that there should be the same amount of people in each condition who are angry about gas taxes, athlete salaries, and pollution. Therefore, if the average count is higher in the second condition it means that at least some people in that condition are upset about the socially undesirable thing.

Let’s see what that would look like with a simulation in R

set.seed(2105)
#Assign 5000 opinions on the 4 topics
gas <- rbinom(5000,1, .7)
athletes <- rbinom(5000,1, .2)
pollution <- rbinom(5000,1, .4)
racism <- rbinom(5000,1, .3)
dat <- cbind.data.frame(gas, athletes,pollution, racism)

#The true proportion with racist attitudes is:
mean(dat$racism)
#> [1] 0.304
#28.5%

#Randomly assign to treatment
dat$treat <- rbinom(5000,1,.5)

#People in control group add up the first three items, 
#people in the treatment group add up all 4 items

dat |> 
  mutate(count = case_when(treat==0 ~ gas+athletes+pollution,
                           treat==1 ~ gas+athletes+ pollution + racism)) |> 
  group_by(treat) |> 
  summarise(mean(count))
#> # A tibble: 2 × 2
#>   treat `mean(count)`
#>   <int>         <dbl>
#> 1     0          1.30
#> 2     1          1.64

The true proportion of people holding racist attitudes in the sample is 30%. With a direct question there is no way we would recover anything near that number.

With the list method we see that people in the control condition were, on average, upset about 1.29 of the things. In the treatment condition people were upset with 1.63 of the things. This reveals that at least some people hold the racist attitude.

Indeed, if we just subtract the two numbers:

1.63-1.29
#> [1] 0.34

We get an (approximate!) estimate of the proportion of people who hold the socially undesirable view.

What does this look like in real life?

Here are the real results from the Kulkinski et al list experiment, where they found that racist attitudes persisted among Southern (but not Northern) whites.

Another common method with a list experiment is to explicitly test explicit attitudes versus the “revealed” attitudes from a list experiment.

This is what Coppock (2017) did to test the “Shy Trump” hypothesis – that the reason behind Trump’s weak poll numbers was measurement error from his supporters not wanting to admit to voting for him (we have already seen that this probably was not the case via the AAPOR report!)

Coppock explicitly asked respondents in September 2016 if they were going to vote for Trump, and 32.5% said that they would (Coppock included people not voting, so this is a reasonable number!). He then did a list experiment that included, as the 4th item, that the person would vote for Donald Trump.

Ultimately the results showed this:

The result of the list experiment showed nearly the same result as the direct questioning, which further proves that Trump’s overperformance is not the result of individual level measurement error!

List experiment are not perfect, and there are a few pitfalls to be aware of.

First, if the list is overly long or complicated the logic can easily be corrupted by satisficing. If people do not read all the options they will not necessarily count correctly which throws the whole thing off. This risk can be mitigated by making the list short and easy to read.

Second, the whole “anonymity” principle is ruined if someone agrees to all of the statement! In our first example if someone agrees with all 4 we know they hold the racist view. This can be mitigated by being careful about the selection of items. If you include an item that no one is likely to be upset about then you would avoid someone answering “all”. You can also choose items that are negatively correlated. In the above, it is unlikely someone is upset about gasoline taxes and upset about pollution, lowering the risk that someone will be upset about all the items.

Third, a list experiment is inherently “low power” because we are trying to form an estimate out of noise. These experiments should only be used when necessary, and only with a fairly significant \(n\) (like, more than 1000).

11.3 Randomized Response Experiments

Another, similar, method of eliciting sensitive views is a Randomized Response experiment.

In this method (which is more of an in-person method, but it could be done virtually as well), we similarly want to elicit responses on a sensitive question.

The respondent privately and secretly flips a coin. That is – we do not know as survey invigilators the outcome of the flip. If the coin comes up heads they answer “Yes” to the sensitive question regardless of their true opinions. If the coin comes up tails they answer the question with their true attitude.

Because of the secret coin flip, we do not know if people answered yes because of their true opinion or due to random luck.

What does this look like in practice? First, let’s simulate a situation where nobody holds the sensitive view:

set.seed(2102)
true.attitude <- rep(0,1000)
mean(true.attitude)
#> [1] 0
dat <- as.data.frame(true.attitude)

#Respondents privately flip a coin
dat |> 
  mutate(treatment = sample(c(0,1), 1000, replace=T)) -> dat

#If heads they tell they answer (1), if tails they answer the truth
dat |> 
  mutate(response = case_when(treatment==1 ~ 1, 
                              treatment==0 ~ true.attitude)) -> dat

mean(dat$response)
#> [1] 0.508

When nobody holds the sensitive view the proportion of people answering yes is extremely close to 50%. This is what we “expect” to happen because half of the people will flip a head and answer yes.

If more than 50% of people report the sensitive view, then it must be the case that some of the people who flipped a tail ansered “Yes” to the sensitive question.

Here is what this would look like if some of the people held this sensitive view:

set.seed(2102)
true.attitude <- rbinom(1000, 1, .4)
mean(true.attitude)
#> [1] 0.383
dat <- as.data.frame(true.attitude)

#Respondents privately flip a coin
dat |> 
  mutate(treatment = sample(c(0,1), 1000, replace=T)) -> dat

#If heads they tell they answer (1), if tails they answer the truth
dat |> 
  mutate(response = case_when(treatment==1 ~ 1, 
                              treatment==0 ~ true.attitude)) -> dat

mean(dat$response)
#> [1] 0.691

The truth is that 38.3% of the sample holds the sensitive view. After the coin-flipping and answering procedure we see that 69% of people answered yes to the question. Because this is greater than 50% we know that at least some people hold the sensitive view. We can recover an estimate of the proportion of people who hold the sensitive view by calcualting the proportion of people greater than 50% holding the view

\[ \pi = \frac{P(Yes)-.5}{.5}= \frac{.69-.5}{.5}=.382 \]

Similar to a list experiment this method – because we are deliberately adding in noise – is inherantly low powered. We need a lot of people or a relatively large percent who hold the view in order for this to work.

The other issue with both of these methods is that they tell us the proportion of people who hold these views, we don’t get to find out who holds these views. So any further digging (is the view more prevalent in some subgroups etc) is not really possible.

But it’s cool math!

11.4 Implicit Association Tests

Another possibility is completely bypassing explicitly held attitudes and instead trying to delve deeper into people’s subconscious attitudes and associations about objects.

Thinking back to our reading of Zaller and Feldman and Perez, we discussed how our memories are organized in a “lattice-like” structure of considerations that are associatively linked together based on how we cognitively relate things.

If can get inside people’s brains and understand the way in which they link together different objects with positive and negative thoughts, that gives us insight into people’s more intrinsic and sub-concious motivations. In other words, or “implicit” (as opposed to “explicit” out loud) cognition.

Nosek, Greenwald, and Banjali in reviewing these methods say:

Implicit cognitition could reveal associative infromation that people were either unwilling or unable to report. In other words, implicit cognition could reveal traces of past experience that people might explicitly reject because it conflicts with value or beliefs, or might avoid revealing because the expression could have negative social consquences. Even more likely, implicit cognition can reveal infromation that is not available to introspective access even if people were motivated to retrieve and express it.

Accessing these deeper associative networks is not just a question of getting around social desirability bias (people know they have an attitude but don’t want to report it), but can also get at associations that people don’t even know that they have.

To get at these implicit attitudes, the Implicit Association Test (IAT) was developed around 2000. The test indirectly measures associations between concepts by having the respondent rapidly sort objects.

The key intuition is that it is easier to sort items that are strongly associated with one another.

If you would like to take an IAT before I explain how they work: you can take any number of them here.

You have just seen the process of taking the IAT, but here I will explain the logic behing what is happening.

For this example I will use the “Gender-Science” IAT, which determines the implicit associations between men and women and science and liberal arts.

For all IATs there are 4 categories, contrasting sets of “attributes” and “target concepts”

For the Gender-Science IAT the attributes are:

  • Science represented by the words: Astronomy, Math, Chemistry, Physics, Biology, Geology, Engineering

  • Liberal Arts represented by the words: History, Arts, Humanities, English, Philosophy, Music, Literature

And the target concepts are:

  • Male Man, Son, Father, Boy, Uncle, Grandpa, Husband, Male

  • Female Mother, Wife, Aunt, Woman, Girl, Female, Grandma, Daughter

To contrast this, on the Race IAT the attributes are positive and negative words, and the target concepts are white and black faces.

The test starts with a practice where you sort single categories, moving science to one side and liberal arts to the other; moving male to one side and female to the other.

Then the “real” test starts. The attributes and target concepts are paired together, and you have to categorize 2 things in one direction, and 2 things in the other direction.

In the first set of tasks you might be asked to categorize “Male” and “Science” together, and “Female” and “Liberal Arts” together.

In this prompt you would press “I”

And in this prompt you would press “E”

Following this (and this is the key step!) the combinations of items are reversed. Now you have to classify “Male” with “Liberal Arts” and “Female” with “Science”

In this prompt you would select “I”

In this prompt you would select “E”.

Your implicit association between the concepts is measured in the difference in your response time. How long does it take to categorize “women-science” relative to “women-liberal arts”? And how does that difference compare to the difference between “men-science” and “men-liberal arts”. If the second difference is larger than the first it would suggest the respondent has an unconscious bias towards “men-science”.

Interestingly, with one caveat (below) it actually doesn’t matter if you know how the IAT works. You will still get your result even if you are trying to overcome your implicit associations.

The caveat is this (and this is how you can defeat the IAT if you are ever given one and want to do that): the results are predicated on you trying to go as fast as you can. If you purposefully slow down all your answers you won’t get a valid reading for the IAT.

There are some clear benefits to the IAT.

First, we are completely short-circuiting concerns about social desirability because we aren’t asking about social desirability at all.

Second, unless you know the secret I just revealed to you, it is hard for people to “cheat” on the IAT. So particularly for unfamiliar respondents you are going to get fairly valid responses.

Third, in repeated measures it has been found that the IAT has really good “test-retest reliability”. In other words: people get the same score when we do the IAT many times on them.

But the IAT also has significant criticism.

Mainly, what is it actually measuring? This is not real, acted on, prejudice, and it would be absolutely wrong to look at the results of an IAT and claim that “racism” “ageism” or “sexism” is rampant in society. The test is just measuring cognitive associations (that are almost certainly in large part a function of how society considers these categories!) and is not a reflection on individual attitudes, behavior, or real prejudice.

For example, consider these results, where the authors correlated together explicitly racist attitudes to responses to the IAT:

A relatively high percentage of people who openly express racist views are scored as “free from prejudice” using an IAT.

Ultimately: I think that the IAT is an interesting tool to understand societies broad categorization of things, but is not particularly helpful or predictive at the individual level.

11.5 Jackman and Spahn on Voter Turnout

The question that we ask that is perhaps most egregiously affected by social desirability bias is our question on voter turnout. It’s probably not the question that people lie the most on (questions about morally devious/socially shunned behavior probably have more lying), but it’s such an important question with such a high amount of bias that it’s really consequential. It’s also really easily measurable (i.e. we know the actual turnout rate), so the error is much more public.

Over-estimation of turnout has been a problem in the American National Election Study (ANES) since its inception. This figure from Burden (2000) shows the progression of the over-estimation over time. The public is pretty consistent in the degree to which they turn out to vote, and responses did not track the decline in turnout from the 1950s into the 1990s.

Not shown on this graph: the turnout gap has lessened as actual turnout began to rise again in the 2000s to the present.

In 2012 the official voting-eligible population turnout (VEP) was 58.2%, and the estimates from the in-person and web-survey components of turnout were 71% and 73.2%, respectively. As a side note, it’s odd that the error is worse in the web survey, where we would expect less social desirability bias.

There are three possible sources for this overestimate (two of which we covered, one of which we have not):

  1. Measurement Error/Social desirability bias: people lie about whether they voted.

  2. Non-response error: even after weighting, the people who respond to the survey are a group of people more likely to turn out then those who do not.

  3. Mobilization effect: people who take a long (it’s like an hour!) survey are more likely to vote than if they did not take that survey.

11.5.1 Social Desirability Bias

The way to test social desirability bias is to match the people who were actually surveyed back to the voter file and to see what they actually did.

We have seen this procedure before and I’m not going to belabor it too much (there is also some complicated probability math happening here that I don’t feel like fully untangling, to be honest.)

This table shows the percentage of those who self reported each category who actually turned out to vote.

Critical here is the over and under reporting rates.

Of the people who said they voted, 85.7% of those people actually voted. In other words, 14.3% of those people did not vote and are over-reporters.

Of the people who said that they did not vote, 6.1% of those people actually voted. Those people are under reporters.

When you do the math to back this out to the population level, there is a 6.2% over reporting rate, and a 0.7% under reporting rate. This suggests that the bias from over-reporting is 6.2-0.7 = 5.5 percentage points.

11.5.2 Non-Response

To estimate the effect of non-response, the authors need to compare the turnout rates of the people selected to the survey who took it to the turnout rates of the people selected to the survey who did not take it. However, because the survey samples households but interviews individuals (more on this below), the comparison that is being made is the turnout in the houses where people opted to take the survey compared to the turnout in the houses where nobody opted to take the survey.

Consistent with non-coverage/non-response the real turnout rates (again not reported, we are taking that out of the equation here) of the people who respond to the poll are around 3-5 percentage points higher than the selected households where someone did not participate. In other words, the people that are in the poll are a weirdo group of people, to use the Bailey terminology.

There is less difference when comparing the registered voters in the survey to non-cooperating registered voters, which leads to the conclusion that the problem here is successfully getting non-registered voters to take the election study. But this makes perfect sense: the people straining their hands to be in an election study are more likely to be registered voters. People who are not registered to vote are also less likely to want to be in the study!

One thing not discussed here that I am curious about is the degree to which this difference gets smaller when the survey weights are applied. Some of the differences between these two groups may be “ignorable”, in that it can be controlled for by weighting the sample to match the original sampling frame. Maybe the weights are being applied for this analysis? I’m really not sure.

11.5.3 Mobilization

This is my favorite part of this paper.

The last part of why the ANES may over-report turnout is that the act of taking the ANES may actually cause you to be more likely to turn out. Afterall, the ANES is no small thing: it’s an hour long (sometimes) in-person interview about politics. It is not insane to think that the act of having taken this survey might cause you to be more likely to go and vote.

To get at this, the authors leverage the fact that it is households, not individuals who are selected for the survey. Within households, the person who is actually surveyed is randomly selected. It is not the case that, for example, the most politically interested person in the household gets to choose to do the survey.

By random chance we have a group of people who took the survey, and a group of people who are just like them (they live in the same set of houses!) but by random chance did not take the survey.

The authors subset down to the people who come from houses with multiple members on the voter file. They then compare each survey respondent to a randomly selected other member of their household. So in a given iteration, each survey respondent is paired with another of their household members who did not take the survey. The overall turnout rates are both groups are computed. The authors then repeat this process many times to make sure that survey respondents are matched to all possible other household members.

Here are the results:

The “Intent to Treat” mobilization effect is around 3 points, depending on the sample they are drawing on. This is the effect of being randomly selected to participate on turnout. This is “intent to treat” because it is measuring the effect of being selected, regardless if the person actually completed the interview. This language is standard in causal inference – for example a study might provide free newspapers to a particular zip code which is an “intent” to treat them with news. Only some people will “comply” with the treatment (take the survey in this case, read the newspaper in the hypothetical case).

The CACE is the complier average causal effect, which is found my dividing the ITT by the compliance rate. We know that 77% of people took the ANES after being selected (cooperation rates are way higher in person!), which means that we can do \(2.6/.77=3.3\). So there is an approximate 3.3-4.5 percentage point mobilization bump

11.5.4 Putting it together

This final table puts it all together:

No one thing causes bias in the turnout question. It is these three thrings together.

What I like about this study is that it is very careful in who to compare to who in order to get these effects.

To detect over-reporting you have to compare individual self-reports to validated voter file turnout.

To detect non-response bias you have to compare the turnout of the people who complied with the survey to the people selected who did not comply.

To detect mobilization you have to compare the (randomly) selected people within-households to the (randomly) non-selected people within those same households.