4 Questionnaire Design
We have already seen that surveys are much more complicated than just “write a question”. In the overview of survey research we discussed that the questions that we ask are one of many possible questions we could ask in order to measure a particular construct. Those various questions can do a better or worse job, so we have to understand what goes in to question quality.
Writing questions is important because it directly affects survey quality, but there are other reasons why I am putting this section up front.
First, thinking about how to write survey questions (and that includes the questions, the answers, the order of questions, the instructions…) forces us to understand the psychology of survey response. In order to understand what questions are good and what questions are bad, we must understand how people go about the mental process of answering questions. We will learn some thing about memory, retrieval, averaging, and other mental processes that help shape the way different questions can lead to different answers.
Second, writing good survey questions is a key differentiator for survey professionals. All professions have things that people from the outside look at and say “oh that’s not that hard I can do that”. Then you learn a little about it and realize that it’s actually very hard. When you are working with stakeholders they are going to believe that they have the necessary skills to write survey questions. They do not, but you need to understand what it actually takes, and be able to show your expertise in writing more effective questions. Statisticians know everything you will know in terms of weighting and analysing data. PR professionals can take results and present them in a compelling way as good or better than you. Only you, the survey professional, knows how to write good questions. This is where the money is.
In this chapter we will cover a lot. We will start by going over how to judge survey questions by the main criteria of validity and reliability. We will then discuss the cognitive process of answering survey questions, taking both a broad view through the Groves chapter and a more political-specific view through the Zaller reading. We will cover common problems and solutions for writing and testing survey questions, culminating in a list of best practices. Finally we will look at the case study of language and survey response through the Perez reading.
4.1 What is question quality?
Let’s refer back to our chart on survey error:

When we are thinking about survey design we are interested in this part of this figure, and how we get from a construct of interest to the edited responses that we compile into survey statistics. Critically, on this side of the process we are interested very specifically at the individual level. We aren’t interested in the properties of responses as we scale them into means, but rather how, at the individual level, or questions do a better or worse job of revealing the underlying construct.
To review, the first thing that we determine is not the question itself, but rather the construct that we wish to measure. This is the broad thing we are trying to understand, rather than the answer to a question.
Sometimes the “concept” is something relatively straightforward: things like income or who (or whether) you voted in the last election.
But sometimes the construct is something that is much more nebulous: ideology, personality traits, or issue attitudes. While this can be a bit nebulous, you really do want to start in this place so that you have a baseline on which to judge questions off of.
For example, we could think about two different ways to measure ideology. One way might be to ask the standard 7-point ideology question
- Which point on this scale best describes your political views generally?
- Extremely liberal
- Liberal
- Slightly liberal
- Moderate
- Slightly conservative
- Conservative
- Extremely conservative
A second potential way to measure ideology would be to ask a series of issue questions about things like taxes, spending, foreign policy, and social policy. We could then code people on whether they answers on the “liberal” or “conservative” end of the scale of each question and combine their responses for a summary measure of their ideology.
Both of these are valid and defensible things, but without a construct to compare them against we cannot compare them to each other. Which one we like depends on how we define the construct of ideology and then how each compares to that construct. For example, if we define the construct of ideology as a social process of grouping we may like the first question better. If we define ideology as a method for organizing political opinions, we may like the second one better.
So we start with some underlying construct \(\mu_i\). This is the greek letter “mu”, and is subset by \(i\) because each individual in the population has a different value. This is the persons true “ideology”, to continue the example from above. This is something unobservable about the individual, but we imagine it to be there.
We then design our measure, say the 1-7 scale of ideology we just discussed. Each individual also has a true value for this measure \(Y_i\). This is also a real, but unobservable, thing for each individual. For the ideology example, each of us has a “true” score for that question that does not change. (Well, over time it could change. But we think of it as “not changing” in the sense that if we came to you multiple times in succession it would not be different.)
Finally, people have a response to the particular question \(y_{it}\). Now, critically, the response is subset by both \(i\) and \(t\), for time. This is stating that each individuals resposne may change every time that you ask them a question. We will get into the specifics of why this might be, and in particular, the differences between random and non-random changes in the individual’s response.
We want to think about how the construct turns into the measure which turns into the response, and any differences that occur during these processes.
Mathematically, we can think about the response being a function of the underlying construct and some amount of “validity error”:
\[ Y_i = \mu_i + \epsilon_i \]
Where \(\epsilon_i\) is the validity error for an individual: how much (and in what direction) their true measured value deviates from their true construct value.
They then respond to the question, which is subject to measurement error:
\[ y_{it} = Y_i + u_{it} \]
Where \(u_{it}\) is how much the response for individual \(i\) at time \(t\) varies from their true measurement value.
We can combine these two equations, subbing in the left hand side in equation 1 for the \(Y_i\) in equation 2 to get:
\[ y_{it} = \mu_i + \epsilon_i + u_{it} \]
Whereby the individual’s response is a function of their true underlying value, validity error, and measurement error.
4.2 Validity
“Validity” simply means a measure that has a minimal difference between the response and the value of the construct.
Specifically, we are interested in:
\[ y_{it}-\mu_i \to 0 \]
The difference between the measure and the construct to approach zero. Now this is for the individual, when we think about a measure having validity for everyone, we think about the correlation of \(y_{it}\) and \(\mu_i\). A measure is valid when there is a high correlation between people’s measured values and their construct values:

The graph on the left would be a measure with high validity: the true measured values for an individuals correlates quite highly with the underlying construct. The graph on the right would be a measure with low validity: the true measured values for an individual correlate quite poorly with the underlying measure.
Why might a measure have low validity? Because we are looking at the actual response, low validity can appear as a function of either “validity error” \(\epsilon_i\) or “measurement error” \(u_{it}\).
There will be validity error when the measure itself doesn’t do a good job of matching the construct. For example maybe a measure taps into unrelated constructs:
Which point on this scale best describes your social views generally?
This might be a poor mesaure for political ideology because we are using the word “social”. People might think about their organizing principles for things like friendships and religion in addition to their political views. This will reduce the correlation between the \(Y_i\) for this question and their true political ideology \(\mu_i\).
Or maybe the measure only gets to a portion of the construct, but leaves out other important components:
Which point on this scale best describes your political philosophy when choosing candidates?
“Choosing candidates” is really only one aspect of our political selves, and limiting the judgement of ideology to that part of politics leaves out a lot.
Measurement error would be a situation where people’s true answer to the question is the same as the construct but their actual recorded response differs. We get into this more below, but a common problem here are sensitive questions:
How many times have you used illegal drugs in the last month?
Peoples true answer to this question is unlikely to be different to the truth, but the reported answer might be quite different due to fear of answering honestly. This question might have low validity due to measurement error.
But wait, isn’t this all a bit stupid?
We can’t measure the true value for constructs, \(\mu_i\), so how can we ever actually assess validity?
In the Groves book it does give some examples of how you could assess actual validity if you have some sort of administrative record for someone and ask them the same information via a survey question. But i’m not even sure why you would ask a survey question if you have an administrative record for that person?
For most things we have to think about assessing validity in ways that do not require us to know the individual’s true value. There are several ways that we can do so that do not rely on knowing the true value for the construct.
The first, and simplest is face validity. This is simply: would people (your respondents, other experts) think that the question is measuring the construct? This is a very informal “does it feel right?” type measure of validity. So, for example, if we are running a survey now and want to measure how people will vote in the 2026 midterms, we could ask:
- In elections for Congress, do you usually vote for Democrats, Republicans, or some other party?
The first question is a fine question for something, but not for this! There are two problems here. Asking about “Congress” mixes together voting for the House and the Senate, and despite low rates of split-ticket voting, people do have different attitudes about those two things. Second, asking what people “usually” do may not capture what they are going to do in the next election if their attitudes have changed. This has low face validity because it just doesn’t logically relate the the construct of interest. A better question would be:
- If the elections for the U.S. House of Representatives were held today, would you vote for the Democratic candidate, the Republican candidate, or some other party?
This question is specific about elections for the House, and appropriately gets people to focus on a future election.
A related concept is Content Validity. This similarly involves qualititatively assesing questions, but rather than an informal “does this seem right?”, it focuses more on whether subject experts believe that the question covers all relevant dimensions. For example, if the construct we want to measure is health, a bad question might be:
- Do you have any chronic disesases?
Again, this is a fine question for something but in this case would not capture the full breadth of the construct in question. (I would argue it might be impossible to fully capture this construct in a single question!)
A more statistical way to assess validity is through Convergent Validity. Here we are interested if a measure of correlates highly with other measures of the same construct. We may ask multiple questions that tap ideology, for example, and we can determine the inter-correlation of those measures to confirm that they are all tapping the same underlying thing. Another example might be racial prejudice, where researchers often measure things implicitly and explicitly. Seeing the degree to which those two types of measures correlate can help us understand the validity of each.
An opposite approach is to look at Divergent Validity, which is where we look at the responses across groups of people for whom we expect differences. If we are measuing ideology we expect to see Democrats have a more liberal ideology than independents who should have a more liberal ideology than conservatives. If we are measuring AI usage we expect it to decline with age. I would say that this is the primary way that I use to assess validity of quesitons. I have a good sense of how things should break down by demographic groups, and if I fail to see that it means that something has gone wrong in trying to capture the construct I wish to capture.
4.3 Reliability
The second aspect of question quality that we are interested in is Reliability. While validity looked at the degree to which a single response varied from the construct of interest (\(y_{it} - \mu_i=0\)?), reliability focuses on how much individual responses vary from each other over repeated measurements.
Again, we think that people have a fixed true value for the construct \(\mu_i\) and a fixed true value for the measure \(Y_i\). Their response differs from this true value for the measure by some error \(u_{it}\) which we called measurement error.
We want to know how much that measurement error differs over repeated trials:
\[ Var(u_{i1},u_{i2},u_{i3}, \dots u_{it}) = Var(u_{it}) \]
A measure is unreliable if people’s answers change on it to a high degree in repeated measures even though their underlying value of \(Y_i\) hasn’t changed.
The way that we test for reliability is simple, we ask the same person the same question repeatedly and see how much their answers vary! We refer to this as “test, re-test, reliability”. The obvious limitation to this is that people’s true values \(\mu_i,Y_i\) might actually change, so what looks like unreliability is just the measure doing it’s job!
We will see further complications about reliability and opinion change through the Zaller and Feldman reading.
4.4 The Psychology of Question Answering
To help understand the cognitive process of answering a research question, Groves provides us with this chart.

If you are going to draw arrows to and from everything just make it a ball, you cowards.
Regardless, this is still a helpful method of thinking about where the decision points are for the brain, and what the nodes are where different questions can lead to different outcomes. While I am a little mad about all the arrows above, it’s also helpful in pointing out that this is not a linear process: people may move forward and backward (rapidly, in the matter of milliseconds) as they try to answer the question.
Before thinking about these 4 steps, it’s worth understanding a bit about how memory works.
(This is a highly simplified, political science, version of this. Please take a real psychology class about this at some point.)
In our brains we have long term and working memory. Our long term memory is a vast store-house of information, memories, experiences, and attitudes. To actually make use of this information, however, we have to load those things into our working memory. Critically: we can only hold approximately 4-7 pieces of information in our working memory at a given time. Much of understanding the cognitive process of answering a survey question is in understanding what particular pieces of information (what Zaller and Feldman call “considerations”) get loaded into the working memory as a result of being exposed to the survey question.
We will go through the steps in the cognitive process in an “ideal” sense first, and then talk about potential problems below.
4.4.1 Comprehension
The first step (again, not necessarily a linear process so this may be the 1st, 4th, 7th, and 12th steps), is for the respondent to comprehend the question.
In short, the respondent will be sub-conciously trying to understand what the point of the question is. This includes parsing the working ot the question, assigning meaning to the various elements in the question, determining the range of permissable answers, and attempting to infer the purpose behind the question.
For this latter step – inferring meaning – they will use the question text but also the broader context of the survey. So for example, consider a question like:
- How important is it to have people in your social group who you disagree with?
If this is embedded in a survey about politics people will interpret the disagreement in the question as disagreement about politics. If this question is embedded in a survey about food, people may interpret the disagreement as (maybe) being about dietary choices.
4.4.2 Retrieval
The retrieval step is the key stage where information is moved from long term memory to working memory.
In some cases this process is extremely straight forward. If I ask a specific question like:
- When is the last movie you saw in a theater?
For a good number of people they just go in and grab the relevant memory (I wrote this and then realized I pretty much stopped going to movies in the pandemic and I’m actually having trouble answering it. I saw Oppenheimer with my dad in a theater? Was the that the last one? Who knows.)
For other questions people will try to retrieve helpful pieces of information that could get them closer to answering the question:
- Approximately how many days per week do you eat meat products like beef or chicken?
This question is asking me to estimate, so I might retrieve memoreies of my last week of eating. I may think about buying groceries and how much time I spent in the meat aisle. I might think about whether I’ve been travelling, and if so, what changes to my diet I made. I will keep loading in relevant thoughts until I can continue with the next two stages of the proces and adequately answer the question.
Some retrieval processes are easier than others. In the Zaller and Feldman paper we will see how this works for attitudes, and in the Perez paper we will see how this relates to language of interview. More generally, less distinctive events and lower salience events are much harder to remember and recall. Someone who eats one huge steak a week will have an easier time answering the above question then someone who casually eats meat for some of their meals everyday. (And a vegetarian/vegan has to do no memory process, they just retrieve the very relevant fact “Am Vegan” and that’s all they need!)
The length of time that has passed will also affect memories, with more recent things being more easily recalled.
Finally, survey researchers can place cues in the question to help respondents target relevant memories. Groves brings up a survey on shopping habits, that has people think aboud “visiting drug, clothing, grocery, hardware, and convenience stores” which allows people to more easily access relevant memories.
4.4.3 Estimation and Judgement
Once the relevant memories and thoughts have been loaded into their working memory the next step is processing them in a way that fits the question.
This can be split into processes for factual and attitudinal questions.
For factual questions (which, for political polling is very rare, but takes up a lot of the Groves book….) if the respondent doesn’t have an immediate answer they have to process the thoughts and memories they are able to access in order to produce an answer. So for example, if the question it:
- How many movies have you seen this year?
The respondent may be able to recall each and every movie they’ve seen and answer that. Or they may go through a judgement process where they say “I see about 3 movies a month, and it’s April, so I will answer 12”.
For attitudes the process is different. We will cover this extensively through the Zaller and Feldman reading, but in general we do not believe that people walk around with attitudes “ready to go” in their heads. If we ask people “Do you approve or disapprove of the job Donald Trump is doing as President”, we don’t believe that people walk around with a “Strongly Disapprove” memory in their head that they conjure up and answer. Instead, we think that they will have more or less readily accessible and relevant memories that will come up when that question is asked that they will then sort through in a process of judgement in order to arrive at an answer.
4.4.4 Reporting
Finally, after comprehending the question, accessing memories, and processing those memories, the respondent is ready to report an answer. Critically here they have to fit their desired answer to the response options that are presented. In live telephone surveys or in-person surveys oftentimes people are allowed to answer how they want and the interviewer records their response on a scale for them. In online or other self-administered surveys the response options are set for the respondent so there is a much more explicit process of fitting their desired answer to the scale. This is also where the “backwards” arrows matter in the above diagram. People may see the response options and that may cause them to re-interpret the question or to access a different set of thoughts and memories that more accurately fit those options.
The other aspect of reporting is that people may alter their true opinion in order to fit other psychological goals. We will see this below, but the set of response options can alter how respondents report their attitudes, and the respondent may alter their response to try to increase a perception of consistency from other answers in the survey.
4.4.5 Alternative Routes to an Answer
While the above represents the “ideal” or “good faith” approach to answering questions, not all survey respondents wish to take this (relatively cognitively complex) route to an answer.
A critical alternative route to know is satisficing.
If we think about the ideal survey respondent – what we might call an optimizer – their cognitive goal is to provide as accurate answer as possible the the question. There are things that might prevent them from doing that, but their over-arching goal is accuracy.
A satisficer, in comparison, has as their goal to provide a “reasonable” answer in as quick of time as possible. The word is a portmanteau of satisfactory and sacrifice: sacrifice accuracy to reach a satisfactory answer. These respondents don’t try to retrieve everything relevant they can and process it in a way that promotes accuracy, instead they simply retrieve just enough information to where they are satisfied that they can provide an answer that will satisfy the requirements. Having a large number of satisficers is bad for validity.
4.5 Problems in Answering Survey Questions
Now that we know how to assess the quality of survey questions and briefly understand the psychology of answering survey questions, we can now discuss common problems that lead to survey error.
We are going to talk specifically about the potential for error in attitude questions through the Zaller and Feldman reading. Groves is more focused on errors in factual questions, which are helpful to think through as they more clearly match the cognitive processing steps laid out above.
4.5.1 Problems in encoding
The respondent simply may not remember the thing that you are asking them about. Groves brings up a study where people both kept a diary of the food they ate and then were asked via survey to retrospectively tell the researchers what they ate. The two sources of data had little in common. While there may be other reasons for this error (social desirability, for one) some of the problem here is that a lot of people don’t care much about what they eat (couldn’t be me), and therefore do not store that information to long-term memory. People can’t provide information they don’t have.
4.5.2 Problems interpreting the question.
Respondent’s may have more difficulty than you imagine understanding what you are asking in the question, and may try to read extra context into words in a way that you would not expect. A study cited in Groves had respondents report what key terms in a survey meant to them. Respondents had issues with the words “you” (does this include my spouse and family?) and “weekend” (is Friday after the work part of the weekend)? Another problem word was “Children” – is that under 12? under 18? Should I think about my adult children?
Part of the problem with misunderstanding is that survey respondents will rarely ask for clarification, or even skip questions that they don’t understand. They will assume that they are supposed to understand the question and will barrel ahead in answering. You can ask people about purely fictitious things and a huge percentage of people will give an answer to the question. Subject matter experts often grossly over-estimate how familiar people are with certain terms:

Groves gives a list of possible comprehension issues. Most boil down to the need to use common, everyday, language in your survey questions. You should write survey questions in a way that avoids little known acronyms or jargon. Use conversational wording even if that strays from “technically” correct grammar. Generally speaking we aim for around an 8th grade reading level for surveys. A good use of Chat-GPT is to have it asses the reading level of your survey and to flag questions with too high a level of complexity.
Another problem of interpretation that is discussed by Groves “faulty presupposition”. They give the example of a question that has respondents agree or disagree to the phrase “Family life often suffers because men concentrate too much on their work”. This prompt has the pre-supposition that men concentrate too much on their work. It is impossible to answer this question if you don’t agree with that premise!
The final problem I will highlight in misinterpreting the question is really a problemm in over interpreting the question. Groves brings up the example “Are there any situations you can imagine in which you would approve of a policeman striking an adult male citizen?” Now, even if you are concerned about violence by the police, we can all imagine a situation where a policeman would be justified in using an amount of force. Still, 30% of people answer “no” to this question. In this situation respondents are over interpreting the question by saying “you are asking if I am for or against police violence, and I am against it”.
4.5.3 Problems with memory
In problem number (1) we discusssed issues where people did not encode the thing you are asking for in the first place. This problem is a more general one where they have the memory but the question that we ask does not lead them to actually generating that information. Groves splits this into 4 sub-parts.
Mismatches in encoding occur when people encode the memory of the event in a way that does not match the terms in the question. A political example might be asking people whether they have “Boycotted” a store or product. People may be avoiding shopping at Target or drinking Bud Light for political reasons but may not consider what they are doing a “Boycott”.
Distortion in events over time. Our memories are weird and shakey things. Indeed, our memories are a mix of things that actually occurred, plus things we add in later through recollection. Memory researchers call this “rehearsal”: it is difficult for us to distinguish what happened to us firsthand and what things we simply hear about our inferred after the fact. A political question that might suffer from this: “Did you participate in any BLM protests in Summer 2020?”
Retrieval failure. Quite simply the ability to remember things fades over time. Groves gives some complicated examples about this and where it differs based on memory research, but I am fine with “long time ago things are harder”.
Reconstruction failure. Finally the last memory failure is attempting to reconstruct the past based on the information that does come to mind in the moment. For example, in a diet study people may think about what is “usual” for them, and use that to answer questions about what they ate over the past several weeks. More problematic, particularly for political studies, is people filling in the past based on current beliefs: we reconstruct the past by examining the present and projecting it backwards. Above we discussed possibly over-reporting presence at a protest because we have genuine false memories of attending. Another route this could happen is to feel more strongly about the need to protest now and then use that belief to project backwards what behavior those beliefs would predict.
4.5.4 Problems in estimation.
For both attidunal and behavioral questions some respondents may have a precise answer readily accesible, but others will have to “construct” their answers on the fly which can lead to error.
For behavioral questions respondents might take a couple of routes to answer the question. They can use “recall and count” where they remember as many instances as possible and count them up, which obviously leads to potential problems of forgetting. They may focus on a rate at which events occur and then extrapolate over the time period. Or they may simply start with a vague impression of the event in question and then translate that into a number.
This latter method, impression-based estimation, is the one with the most potential for errors. Critically, with this method in particular people are very susceptible to the response scale that you use in the question. Look at the results to this experiment where people were asked to recall the number of hours spent watching television:

Because this is an experiment the people who answered the TV question with the left scale are just like the people who answered the TV question with the right scale, so we would expect the answers to follow the same distribution, but they do not. On the left just 16% of people reported watching TV more than 2.5 hours. On the right (where 2.5 hours is the lowest response option) 37.5% of people reported waching TV more thatn 2.5 hours. People’s impression of what is “reasonable” will depend on the options that you give them.
4.5.5 Errors in judgement for attitudinal questions.
We are going to cover this extensively in the Zaller and Feldman section. Go read that!
4.5.6 Errors in formatting an answer.
Survey questions can have different sort of response scales, but primarily here we will focus on the difference between open and closed response scales.
In some sense this decision is made for us if we are using an online survey (which is becoming dominant). Because there is not an interviewer to translate open response into a closed response scale, we must present the user with a closed response scale. (I think this is another exciting future possibility for AI in survey research, but we don’t know enough yet about using it to interpret open responses.)
There are clear trade offs when using open versus closed end responses. Open ended responses will usually allow us to gather more info, but scaling and summarizing that information is difficult. It also places a higher cognitive load on the respondents. Close ended scales necessarily censor information, but are much easier to work with.
That being said, even in open ended responses respondents tend to self-censor and round answers in a way that loses precision. Groves reports that the reporting of sexual partners (for example) gets rounded to the nearest 5. In political science our feeling thermometer scales, which ask how warmly or coldly people feel towards different politicans and groups, cluster at the 10s (0,10,20,30…90,100).
Scale ratings can also cause problems. We saw above how a different scale can lead to different outcomes, and we can expand on that problem here. People are biased towards positive numbers, for example. If you ask people how happy there life is on a 0 to 10 or a -5 to 5 response scale, people will give higher responses in the second case due to a desire to not rate their lives with a negative number.
It is also ideal to avoid overly complex scales presented to the respondent. This can lead to information overflow and higher probability of satisficing. For example we often classify American partisanship on a 7 point scale:
- Strong Democrat
- Weak Democrat
- Lean Democrat
- Independent
- Lean Republican
- Weak Republican
- Strong Republican
Just giving people this scale would be overwhelming and the cognitive load of having to make all those differentiation would probably lead to people satisficing into the middle of the scale. So instead of asking this question all at once we ask it in two stages. We first ask people if they are a Democrat, Republican, independent, or something else. Those who say they are a Democrat or Republican then get asked if they are a “strong” or “weak” partisan. Those who initially say that they are Independent or something else then get asked if they lean towards one party or another. Only those that say they do not lean ultimately get classified as independents. (And this number is way smaller than the amount that would call themselves independent if we just asked the 7 point scale as one question).
The final reporting problem listed by Groves comes from situations where respondents are more likely to choose the first (primacy bias) or last (recency bias). Both of these are adequately dealt with by randomizing the responses. (Either total randomization for un-ordered scales or randomly reversing the responses for ordered scales.)
4.5.7 Motivated misreporting.
For sensitive questions an additional problem is people mis-reporting their answers on purpose. Lots of obvious and non-obvious things can be sensitive. Asking about illegal or morally questionable behavior might lead to people changing their true response, but so might things where people are embarrassed. We did a survey for NBC where the editorial people wanted to know if the Taylor Swift and Elon Musk endorsements affected people’s votes – but no one wants to admit (maybe even to themselves) that their voting decision hinged on a celebrity endorsement.
One possibility for sensitive topics is to use forgiving wording. For example for whether people voted or not we ask:
In talking to people about elections, we often find that a lot of people were not able to vote beacause they were not registered, they were sick, or they just didn’t have time. How about you – did you vote in the elections this November?
This doesn’t really work that well! People still massive over-report having voted.
Generally speaking, the other thing that can reduce motivated misreporting is having the survey be self-administered. People are much less likely to admit sensitive information if they are being interviewed by a real person than if they can anonymously type it into a response box.
This is, incidentally, one of the best pieces of evidence that polls in 2016-2024 were not biased because people wouldn’t admit to liking Donald Trump (or, conversely, being motivated by social desirability to say they would vote for Clinton/Biden/Harris): the survey error was no better in online surveys than in telephone or face-to-face surveys, which simply would not be the case if motivated misreporting was driving the error.
4.6 Guidelines for Writing Good Questions
The previous gave negative examples of where things go wrong, and now we can turn that around to give positive examples of how to do things right. The textbook breaks this down into three question types, which we will cover below.
4.6.1 Non-sensitive questions about behavior
When dealing with questions about behavior the major hurdle to overcome is in memory. The list provided by Groves of tips is mostly designed to elicit as much accurate information from respondent’s memory as possible.
With closed questions, include all reasonable possibilities as explicit response options.
Make the questions as specific as possible.
Use words that virtually all respondents will understand.
Lengthen the questions by adding memory cues to improve recall.
When forgetting is likely, use aided recall.
Genuinely can’t tell why 4 and 5 are different and Groves is unclear about it. I think here both of these are covered by adding specific memory cues to the question: “How often do you exercise political speech, which might include things like protesting, boycotting products, or posting about politics on social media.”
6/. When the evens of interest are frequent but not very involving, have respondents keep a diary.
When long recall periods must be used, us a life event calendar to improve reporting.
To reduce telescoping errors (misdating events), ask respondents to use household records or use bounded recall (give them their previous answer).
If cost is a factor, consider whether proxies might be able to provide accurate information. (ex. use parents to gain information about children, rather than interviewing children.)
4.6.2 Sensitive questions about behavior
Remember that “sensitive” is a very broad topic, and there are all sorts of political things that people are weird about answering.
We will have a special week on social desirability strategies in the back half of the course, so this will be a brief look at this topic.
Again, here are the list of recommendations from the textbook:
Use open rather than closed questions for eliciting the frequency of sensitive behaviors.
Use long rather than short questions.
Use familiar words in describing sensitive behaviors.
Deliberately load the question to reduce misreporting.
Loading the question means wording it in a way that invites a particular response.
“Even the calmest parents get mad at their children sometimes”, “How many times during the past week did your children do somethign that made you angry” “Many psychologists believe it is healthy for parents to express anger” etc.
Ask about long periods or periods from the distant past first in asking about sensitive behaviors.
Embed the sensitive question among other sensitive items to make it stand out less.
Ask about armed robbery before asking about shoplifting (which is what you really care about).
Use self administered mode.
Consider a diary.
Include additional items to assess how sensitive the behavioral questions were.
How uncomfortable were you in answering questions about…
- Collect validation data.
4.6.3 Attitude Questions
In political surveys we care much more about attitude questions, so we will spend a bit more time here.
- Specify the attitude object clearly.
“Does the government spend too much on welfare?” - government and welfare can be a lot of different things…
- Avoid double-barreled questions.
A really important one:
To reduce the federal deficit, do you support or oppose raising taxes and cutting spending?
Asking about two different things in this question!
- Measure the strength of the attitude.
This can be as simple as having “Strongly Agree” and “Somewhat Agree” as options, which leaves you the option of combining them if you want to. You could also measure attitude strength in a seperate question, as in “how important is this issue to you”.
- Use bipolar items except when they might miss key information.
When things have trade-offs you want to present them as trade-offs, rather than allowing people to have it both ways. People are more likely to agree then disagree with things, which is called acquiesence bias. Inested of an item where people can just thoughtlessly agree, it’s better to force people into making a real decision.
A recent example in an NBC poll is we asked people:
Nothing will change in Washington until we elect a new generation of leaders.
Everyone agrees with this question! In the next round we switched up and asked if people preferred politicians to be “an insider with the experience to get things done” or “an outsider who will shake things up”. This gave way more interesting results because people are faced with the real dilemma.
- Carefully consider which alternatives to include.
Should you include a “middle” option like “neither agree nor disagree”? The textbook says yes, but I am far more ambivalent. Particularly with online surveys where people can skip a question if they don’t want to answer it, I think that these options get us further away from understanding public opinion. Above we talked about “satisficing”, and simply selecting this middle option is the satisficers way out. Particularly if you follow number (3) above, I lean to forcing people to take a side.
- In measuring change over time, ask the same question each time.
Obviously.
- When asking general and specific questions about a topic, ask the general question first.
If you do this the other way around people’s views will be too narrow when you ask them the general question. If you ask people their attitudes about the direction of the country after you ask them about their approval of the President, their answers about the direction of the country will be infected by their feelings towards the President.
- When asking about multiple items, start with the least popular.
If you are polling people about how much they like carrots, but ask them how much they like choclate cake first, they will give lower ratings to carrots. (The opposite is also probably true, which Groves doesn’t cover).
- Use closed questions for measuring attitudes.
Open ended responses for attitudes are just too difficult to code. However, we will see how helpful they can be in the Zaller reading.
- Use 5-7 point response scales and label every point.
We want to balance the granularity of opinion against cognitive overload. Place yourself on this 1:20 scale of Liberal to Conservative is too hard of a question, and is likely to increase the amount of random error. Place yourself on this 1:3 scale of Liberal to Conservative is too restrictive and misses complexity. The right amount is 5-7 (maybe 4-8).
- Start with the end of the scale that is the least popular.
Only a consideration in surveys where you cannot randomize the response option order.
- Use analogue devices like thermometers to collect more detailed information.
The 0 (very cool) to 100 (very warm) scale is helpful for people to make more granular assesments, if needed. People use around 13 points on the thermometer scale. It’s a way to get more granularity in a way people can cope with.
- Use ranking only if the respondent can see all the alternatives.
People don’t have the cognitive capacity to rank unless they can see all of the options at once. In other words, you couldn’t do this in a live caller survey! Even online a ranking task is probably too cognitively complex.
- Get ratings for every item of interest, do not use check all that apply.
People will give up and stop checking stuff because of satisficing. If you simply ask about ratings for each item you are more likely to get a yes or no for each.
4.7 Zaller and Feldman: A Simple Theory of the Survey Response
While the Groves book is a great repository of information, much of the focus of the question wording chapter is on behavioral, not attitudinal questions.
As political scientists our main interest in surveys is in attitudes. These sorts of questions have particular characteristics that make them more difficult to theorize about and study, and understanding what attitudes even are has had a long history.
The fundamental question being asked here is: what is an attitude? As in: when we ask people on their opinion about, say, increasing foreign aid spending, what is it that we are asking? Is this something like the behavioral questions in the Groves book where people have a true answer to this question and we are just finding the best way to access it? Or is an attitude something that is fundamentally different than asking people who frequently they go see a movie?
The starting point of this article is that there is a high degree of response instability in attitude questions (i.e. low reliability).
In particular, it has long been found that the reliability of attitude questions is very low. Zaller and Feldman are starting from that place in the literature. A fundamental piece in this academic history is Converse’s The Nature of Belief Systems in Mass Publics. This is, probably, the most important article ever written about the psychology of political attitudes. which was the first to find that people have an extremely high level of response instability.
Zaller and Feldman present this chart in their piece:

In the section on reliability we discussed “test re-test” reliability, which is the ability for people to respond to the same question in the same way in multiple instances. Here we are seeing (for two different questions) a crosstab of how people answered in January of 1980 and June of 1980, about 6 months apart. 25% of people, for example, said that the US should cooperate with Russia more in both rounds of the survey. The upshot here is that only about half of people are able to answer the same thing on the survey in ther re-test.
This finding mirrors what Converse found in his original article, that the correlation of people’s answers at two points of time were quite low for most things:

Here we see the correlation of different items over two years, where lower numbers indicates more response instability. Notable here is that party identification has a much higher stability than other issues – this is backwards from a view of politics where people choose their partisanship as a function of their issue positions. Notable also is that there is a good amount of difference between the different issues, with a high salience, racially charged, issue at the very top.
There are a couple of possibilities for this pattern.
The first, is that some people simply do not have real attitudes for these things and they are simply random responding. In the parlance of measurement above, there actually isn’t a \(\mu_i\) underneath everything, and when we ask them questions they just choose a random option.
The second, is that people do have a true preference \(\mu_i\), and that the fluctuations we are seeing are due to measurement error. We did allow for answers, \(y_{it}\) to change across time, and this shows why. But this seems like an extreme claim if 50% of the variance in answers is due to measurement error!
However, this response instability is not the only sort of measurement error that is present in political attitudinal questions. Additional evidence suggests there is something more going on here.
The other type of “error” that occurs is response effects, which are systematic variance that comes from people answering questions differently under different circumstances.
Zaller brings up an extremely well known cold war experiment. Half of a survey sample if they were willing to allow communist reporters into the US, which 37% of people agreed to. In the other half of the sample respondents were first asked whether U.S. reporters should be allowed into Russia (which most favored). With that context, 73% of people said that a communist reporter should be allowed into the US.
There are lots of findings like this, where the context and wording of a question dramatically alters people’s opinions. Critically, these effects don’t just work on “unsophisticated” people: respondents of all stripes are affected.
So when it comes to the attitude of allowing communist reporters into the US, would it be right to say that people have a true attitude \(\mu_i\) that is revealed when the question is asked in isolation, and when people are more likely to agree when given context we call it “measurement error”? In other words: is the first opinion the “true” one and the second opinion an “error”?
Taking these two things together, people have a high degree of response instability, and our revealed attitudes are highly susceptible to context. These things together do not neatly fit with either the idea that people are randomly responding or that they have fixed true attitudes that are subject to measurement error.
Instead, Zaller and Feldman promote a new theory (which is really just a psychology theory that they steal with attribution): attitudes are not fixed things that surveys reveal. Instead, attitudes are constructed in real time when people are asked questions as a function of available information.
4.7.1 An alternative model
The first thing here is to abandon the notion of a “true attitude” when it comes to the sort of issue questions that are asked in political surveys.
The authors bring up in-depth survey work done by people like Jennifer Hochschild. In that work she finds that people will answer traditional fixed-choice questions, but if given the opportunity to say more:
people do not make simple statements; the shade, modulate, deny, retract, or just grind to a halt in frustration. These manifestations of uncertainty are just as meaningful and interesting as the definitive statements [found in a closed survey question].
When given the opportunity to express themselves, people have a great deal of ambivalence in their political attitudes.
Vincent Sartori (a Hochshild interview subject) cannot decide whether or not the government should guarantee incomes, because he cannot decide how much weight to give to the value of productivity. He believes that the rich are mostly undeserving, yet he is angry at “welfare cheats” who refuse to work. Caught between his desire for equality and his fear that a guaranteed income will benefit shirtkers, he remains ambivalent.
Again thinking back to the dominant paradigms (random answering or measurement error), neither seem to fit these thought patterns. This guy is not random responding, but he is also not searching for his “true attitude”. Instead he has a range of conflicting considerations that could lead him to give different responses at different times or when he is in different head spaces. The basic point here is that people have stored in their long term working memory a high number of considerations about important issues that may or may not conflict with one another. When we say that an opinion is “constructed” in real time, what we are saying is that people pull some of these considerations out in response to the question, and the nature and balance of those considerations may change at different times and under different circumstances.
This idea leads to the three axioms of this article:
Ambivalence: Most people possess opposing considerations on most issues.
Response: Individuals answer survey questions by averaging across the considerations that happen to be salient at the moment of response.
Accessibility: A stochastic (random) process where considerations that have been recently thought about are somewhat more likely to be sampled.
A lot can be explained by these three axioms.
Obviously, response instability can be clearly predicted from axiom 1: people have competing considerations in their heads, not just one clear attitude, so obviously they will give different answers to questions if you ask them multiple times.
But this theory is much more subtle than “people are ambivalent”.
One of the critical things about response instability we saw above is that it is not the same for all people, and it is not the same for all topics.
More politically “sophisticated” people are more likely to have a consistent set of considerations in their heads. They are better able to organize the incoming information into a coherent schema of the world, such that when they go to answer a survey question it is more likely that the considerations they bring out of long term memory do not present a contrasting set of considerations. Further: they are more likely to have thought about these things more recently so the consistent, relevant, considerations are more likely to be brought to the front of mind.
Things that are more politically salient and easier to understand (like school desegregation in the Converse chart) are more likely to result in people having a consistent set of considerations that are readily available to be sampled when the question is asked about. Things that are less political salient and harder to understand (like foreign policy) are more likely to bring forth an ambivalent set of attitudes which will induce response instability.
This model also explains response effects like the communism question wording example. Respondents who were first asked about US journalists entering Russia have considerations about reciprocity primed for them before answering the question of interest, that brings those considerations to the top of mind when answering the question. Other response effects, like different question wording experiments, can easily be incoporated into this model.
The evidence presented for this article relies on respondents doing a memory dump of considerations either before or after answering questions. From these dumps of consideration we can see, in real time, ambivalence or consistency across respondents.
I think Table 5 is excellent, and really helps us understand what is going on when people answer survey questions. Notably, even the people who give stable answers to survey questions do not have the exact same considerations pop up when they answer them, they simply have a stable and consistent store of conservative or liberal considerations that come to mind that allow them to stay consistent. On the other hand, people who have unstable attitudes simply have a random set of considerations pop up each time they are asked a question.
What does this teach us about writing survey questions?
First, I think it shows that the “error” model presented in Groves (and above) is likely a simplificaiton of what is really happening. That’s ok. “All models are wrong, some are useful” is a good maxim to live by. It is helpful for us to think about their being “real” attitudes that we try to operationalize into measures that are then subject to measurement error. Building that model allows us to focus in on better or worse operationalizations, and let’s us see where measurement error enters the process.
Yet, the focus on there being a \(\mu_i\), a singular true attitude, doesn’t really mesh with how people think about politics and how they construct – rather than report – their attitudes. Relaxing this assumption allows us to be more understanding and empathetic towards our survey respondents.
What we are fighting against as survey researchers is the belief that our questions can be “hyperdermic needles”. This is the belief that we can simply ask a question and that question will penetrate into our respondent’s brains and pull out the exact piece of information we asked about. Throughout this chapter we have seen many instances why this simply does not work like this. Whether it through be social desirability, satisficing, or Zaller-style “opinion construction”, the reality of people answering questions is very different.
So when it comes to writing survey questions about attitudes we have to think about how we write our questions, generate alternatives, and place those questions in surveys will lead people to bring different sets of considerations to bear when answering those questions. Just like in Groves where they thought about different cues to bring the right set of memories forward, we can think about cues that bring the right set of considerations to bear. (This, obviously, can also go off the rails if you think about cueing people to bring a set of considerations to bear that make them answer the question the way that you want them to.)
The final part of this I want to highlight are the “normative implications” section of the conclusion.
Reading a lot of the classic work from Converse forward on citizen’s “non-attitudes” can be a bit bleak. At the heart of our modern conception of democratic theory are engaged, independent, dispassionate citizens who make judgement about political issues and then make voting decisions based on those judgement. If people lack true beliefs or ideologies then this system is potentially revealed as one in which we are governed by random processes with no underlying logic.
The countering view, represented by the work of Chris Achen (as of the writing of the Zaller article – though not that Achen has become far more pessimistic in the intervening 30 years), is that people do have true attitudes we simply are bad at measuring them and the resulting measurement error is not something that troubles political theory.
With the Zaller and Feldman piece we land somewhere in between those two things. People do not have pre-formed strong opinions about political topics. (And, in other work it is quite clear that the opinions that they do have are often a consequence of, not a precondition to, their vote choices). At the same time, they do have sincerely held, if not contradictory, considerations about politics that inform the way in which they both answer our questions and make sense of the political world.
4.8 Perez:
This article extends our thinking about the construction of opinions by survey respondents to consider the effect that the language of interview can have for altering the considerations that come to the top of the mind, and therefore the opinions that are expressed by the respondents.
To get there, we have to learn a little bit more about what considerations are accessed when a survey quesiton is asked. In particular, the considerations that we have are not just isolated things floating around in our long term memories. Instead, we think that considerations are linked together in a “lattice like” network. When you access one consideration you are more likely to also pull out the ones that are closely associated with it.
Considerations become linked when they are thought of together often, or when they are initially encoded together.
The first thing is often important: if people often think about issues of school funding and racial politics together, when you ask about school funding their brain will more easily grab the related considerations about race than it will other possible relevant considerations. This effect is not what’s important for language though.
Instead, language effects come in through the effect of encoding different pieces of information has on their accessibility. Considerations are more easily recalled when there is a match between how it was encoded and how it is retrieved. nformation that is learned in one language will be more easily retrieved when that language is used. Perez shares an example from another paper, where a group of Mandarin-English bilinguals were randomly assigned to be interviewed in one of those languages, and are asked to “name a statue of someone standing with a raised arm while looking into the distance.” Those randomly assigned to be interviewed in Mandarin were more likely to answer The Statue of Mao Zedong and those in english The Statue of Liberty.
It’s worth noting here how critical the experimental method is to this study and others that look at the effect of language. As Perez notes, it is very difficult to determine the causal effect of interview language when people are able to self-select the language they take the survey in. Even if we are dealing with the group of English-Mandarin bilingual people in the above study, we don’t think that the people who choose to take a survey in English are exactly like the people who choose to take a study in mandarin. The latter group may have deeper ties to a Chinese-American identity, for example. You can attempt to statistically control for this: does language of interview have an effect even if we look within categories of people based on length of time their family has live in the US? But we can only control for things that we can measure. The magic of the experimental method is that we are guaranteed two groups of people that are exactly alike except for the fact that one took the survey in English and the other in Mandarin.
To test this theory for political attitudes for Latinos, Perez randomly assigns English-Spanish bilinguals to take this survey in one of the two languages. These people are then asked questions where the information needed to know them is more likely to have been encoded in spanish or english, or the identies expressed are more likely in english or spanish. The theory is that the availability of considerations will match the language of interview.

The main results are presented above. We are interested here in the effect of English interview which is an indicator variable indicating if someone was in the group who were assigned to take the survey in english (1) or if they were in the group who were assigned to take the survey in spanish (0). As such, positive and significant effects indicates higher scores on the dependent variables for individuals who were assigned to take the survey in spanish.
The results indicate that individuals who took the survey in English have higher political knowledge, and that is specifically driven by having higher knowledge of traditional “American” political facts. Further, these individuals report having a higher Latino identity than those who interviewed in spanish. This might seem odd at first, but this is what is predicted because the categorization of “Latino” only makes sense in the American context. Those interviewed in spanish are less likely to think of themselves as Latino and more likely to associate themselves with their country of origin.
When discussing this research and other pieces which show that the attitudes people express depend on the language of interview, I have often heard Efren Perez get frustrated when people ask him: Which one is the real attitude? The one expressed in Spanish or English?
Pairing this together with the Zaller and Feldman piece, it is reasonable to conclude that both are real attitudes. To try to call one real and one fake misses how these cognitive processes are working. People hold all of these contradictory pieces of information in their heads, and whether and how they come out is a function of a lot of thiings, including the language people are surveyed in.
If I was doing a survey of Latino Americans, how would I approach language of interview? Even with the main NBC survey we give respondents the choice to take the survey in Spanish. However, few people take us up on that. (I asked the president of a firm who specializes in latino attitudes about this, and his perspective was that people will assume that translation will be bad, so they don’t even bother trying to take the survey in spanish.) Ultimately, I think that similar to deciding question wording on the survey, you have to choose the context you wish people to be in when answering questions, and try to make that context match the one that is most politically salient.
4.9 Questions
Can a question be reliable but not valid? Valid but unreliable? Give an example of each.
Write a question that would be valid for an “optimizer” but not for a “satisficer”. Describe why the question is valid for one group but not another.
People have argued that SAT scores are biased against non-white high school students. Is this a problem of validity error, measurement error, or both?
Groves gives the following 6 “problems” respondents can have answering survey questions. Give an example (they can’t be the same examples as the textbook or my class notes) and explanation for each problem.
Problems in encoding
Problems in interpreting the question.
Problems with memory
Problems in estimation
Errors in formatting an answer
Motivated misreporting
Explain why Zaller and Feldman call Americans “Ambivalent”.
Use the framework of Zaller and Feldman to explain the findings of Perez.