3 Survey Research Basics

In the introduction I talked through a brief history of surveys, why they are so important, and how we have arrived at the modern paradigm of survey research.

In this chapter I will introduce more formally the nature of surveys and the framework we will use to look at the quality of surveys.

3.1 What is a survey?

Ok, yes, you know one when you see it, but here is the formal definition of a survey from Groves:

A survey is a systematic method for gathering information from (a sample of) entities for the purposes of constructing quantitative descriptors of the attributes of a larger population of which the entities are members.

There are three key things here that are worth highlighting.

Systematic. This is not just going out and interviewing people and compiling the result. To properly be a survey we must have a systematic process. A survey instrument that is identical. A sampling plan. A coding plan. A weighting plan. An analysis plan. All of this needs to be recorded, reported, and be able to be replicated.
Quantitative. There are lots of questions that are best suited to qualitative analysis and deep interviews. But these things are not going to be treated as “surveys” in this framework. Here we are dealing with things that we can measure and perform math on. Things like proportions, means, correlations, and regressions.
A sample of a population. To distinguish what we are doing from a census, in a survey we have a population of interest (say, all U.S. adults) and we use a sample of that population (say, 1000 randomly selected individuals) to infer information about that population.

3.2 What makes a good survey methodologist?

We are going to talk a lot about this, but the previous section on what makes a survey highlights some obvious, and some not-so-obvious, features of what makes a good survey methodologist.

At a technical level you must have a decent ability to understand survey statistics and the computer science skills necessary to work on data. If you’ve taken my other courses, you know that a good chunk of this is just the ability to stay organized. This is even more important in survey work because you are publishing the results and you are responsible for them! You must have a replicable way to re-produce the data to be able to ensure your numbers are correct. You must also have the statistics skills to understand where survey error comes from, how to best ameliorate it, and how to describe it. As we will see today, that’s not just “traditional” statistical error, but the error that comes from the process of writing questions and having people answer them.

Beyond those things is a set of softer skills that are hard to describe. In order to write good survey questions you have to know a lot about the subject you are asking about. We broadly classify this as “Domain Knowledge”. I think I am a pretty good survey researcher, but if you asked me to run a survey trying to capture attitudes about fine art I would be totally lost. I don’t know anything about that and I don’t even know where to start in terms of capturing attitudes.

There is also a lot of empathy needed. I call this step “Imagine a person”. When you are writing survey questions and answers you genuinely have to try to understand the way people feel about certain topics to see what kind of questions and answers will actually represent their opinions on something. If you’ve ever said something like “How could anyone ever believe XXX!!!!”, then you need to quash that part of yourself. I can believe it.

3.3 Surveys from a Design Perspective

The Groves chapter has a bit of a confusing journey through a bunch of flow charts that all kind of say the same thing to explain how to think about surveys. I think there is really good stuff in here so I want to take time to explain the two perspectives that you can take on surveys. The first is thinking about “how do you literally design a survey?” the second is “where does error come from in a survey?”.

The first way to think about surveys is: what do we actually do? This figure describes this process.

The first step in the survey process is define what it is that we want to research. This, obviously, can be anything, as long as that thing includes attitudes that people can have. We wouldn’t want to do a survey, for example, on the fuel efficiency of cars. But we could do a survey on people’s attitudes about fuel efficiency (maybe in comparison to comfort or horsepower?) when making car buying decisions.

3.3.1 Survey Mode and Sampling Frame

After choosing the research question, we have to choose what survey “mode” we will use, and what the “sampling frame” for the survey will be. These two things are linked together because the choice of one often affects the other.

The “sampling frame” is the real list of people that you could possibly interview for the survey. Maybe you are the head of an organization and have a list of email addresses for the people in that organization, that list is the “sampling frame”.

Two common sampling frames in the politics world are “Random Digit Dialing” (RDD) and “Registration Based Sampling” (RBS).

RDD is what it sounds like. Usually starting with an area code (like 215 for Philly), random numbers are generated for the last 7 digits of a telephone number. For ex:

#Number to call
#215-
sample(1:9, 7, replace=T)
#> [1] 3 2 2 1 3 3 4

Those numbers are then called. The “sampling frame” for this example, therefore, is all people with a Philadelphia phone number.

The choice of RDD as a sampling frame directly impacts survey mode. Survey mode is how you are going to interview people. Obviously, you must do a telephone survey if you are generating random phone numbers! We generally refer to this as a “Live caller” interview. Another option for telephone surveys is text message. With text you can either do a survey back and forth over text, or do a “text to web” survey, where you text out a link to an online survey. That being said: it would be really inefficient to text random numbers if you don’t know that they are cell phone! I don’t think anyone really does this.

The other common sampling frame, particularly for political surveys, is RBS, a “registration based sample”. In every state the government keeps track of every person that is registered to vote, and their history of voting (except North Dakota, which doesn’t have voter registration!). To be clear, the state keeps track of who is registered, and what elections they participated in. They have no idea who you voted for!

This information from the state is publicly available by law, though they don’t need to make access easy or in a helpful format! (The Florida voter file, at least until recently, was sent out by mail on CD-ROM). However, there are commercial companies that take this information and collate it together with extra pieces of information (commercial information, extra contact information, modeled demographics) and sell that as a product. We call these products “Voter Files”. At Penn we have a subscription to L2, one such data vendor. Other prominent firms are TargetSmart and Catalist. The two political parties have their own, proprietary, version of these voter files.

If you want to study registered voters, the voter file is a very helpful list of….. registered voters! Using RBS you take a random sample off of the voter file, and as such your sampling frame is registered voters.

Oftentimes the survey mode for an RBS sample will be a live caller telephone survey, but because you know that these are real people and often have cell phone numbers attached, it’s much more feasible to do a text or tex-to-web survey. Commercial voter files usually have home and email addresses for some individuals, so it’s further possible to do a mail or email survey with RBS.

3.3.2 Questionaire Design and Sample Selection

After your survey mode and sampling design are complete, you have to make your actual sample selection and design your questionnaire.

In many cases sample selection is just a random draw. If we were doing an RBS sample we might randomly select 1000 of the phone numbers of those registered to vote nationwide and call those people.

We will explore more complicated sampling techniques in a few weeks, but there are also alternatives to this simple random sampling. We may want to ensure that we sample enough of certain groups in the electorate, so we may set quotas to meet for those groups. For example, we must keep sampling until 12% of the sample is Black. It’s possible to also generate a sample where different groups have different probabilities of being included in the list of people to contact that is different than what you would get from random sampling. For example you might want a sample that looks like the 2022 midterm electorate, so you determine probabilities of being, say highly educated, among that group, and give highly educated people a slightly higher probability of being selected.

The second part of this stage is questionnaire design. We are going to spend a lot of time on designing questions and responses, so it’s enough to say here that it is a very critical process where you think a lot about how to write questions to capture your research goals. Writing survey questions is not the same as writing interview questions. You can’t ask followups or clarify things for the respondent if they are confused, so you have to do your best to get one question that does the job. You also have to think about the difficulty of answering questions and the overall length of the survey.

In the book it says a critical step is to pre-test the survey, which is deploying it on a small set of your sample before you begin your main data collection. I agree that this is a fine thing to do! The reality is that modern polling budgets rarely allow for such pre-testing. In my experience the pre-testing step is a lot of careful review and showing it to as many people as possible. This last thing is really critical: other people are really helpful in finding where things are confusing or don’t capture the full possibilities of response.

3.3.3 Recruit and Measure Sample

Now you have a sample drawn and a questionnaire written, so now it’s time to collect data!

How this data is collected depends greatly on the survey mode. For a phone survey, phone calls will be made. For a text survey, texts will be sent. For an online survey, the survey will open to collect responses.

The decision making here comes from when you have a set sample list and you want to try to get the people in your sample to actually respond to the survey. You have to ensure that when you are calling a house, for example, that you get the actual person you are looking for on the phone. Indeed, this is even a concern for RBS surveys when as sometimes the number on the file is wrong.

Further, how much effort do you wish to spend getting people to respond to the survey who you cannot find or initially refuse? Sometimes that effort is zero, but there might be reason to believe that the people who are refusing to take your survey are systematically different than the people that are in the survey. This is something that will come back up called Non-Ignorable Non-Response. In these cases a good deal of effort might be taken to re-contact people to get to them to participate. In some cases other survey modes might be tried to get people to respond (we can’t get them on the phone, will they respond to a text? What if we send them some money?)

3.3.4 Code and Edit Data

Once the responses are generated then you have to take the raw data file and turn it into something appropriate for analysis. Oftentimes this is just properly labeling questions and answers so that they are legible.

In many cases you have to take the raw responses from a survey and turn them into the variables that will be used in analysis. For example, in our surveys the race variable we report out has the categories: Non-Hispanic White, Non-Hispanic Black, Hispanic, Asian, Other. But we don’t ask a question with those response options. We actually ask two questions: one that has a large number of racial categories where people can select all that apply to them (including hispanic); and a second question where we ask people if they are of Hispanic ethnicity. We take the responses to these options and turn them into the final race variable based on a set of rules that we have developed.

As we will discuss below there are real decisions that happen in this stage that mean that if you give two analysts the same data they will end up with slightly different answers to the question because of the decisions that they make. The best rule here is to be transparent. Keep things recorded and replicable, and any decisions you make are fine. (Or, at least, can be corrected if they are not fine!)

For example, here is a chunk of cleaning code for an NBC survey. You don’t have to know all of this, i’m just posting it here to show that a lot of decisions are made!


out <- dat |>
  rename(topic_interest = q0001,
         issue_most_important = q0002,
         us_direction = q0003,
         trump_approval = q0004,
         favor_elon = q0005,
         news_source_favorite = q0006,
         personal_economy_now = q0007,
         econ_priority_personal = q0008,
         attention_check = q0009,
         pride_american = q0010,
         general_happiness = q0011,
         general_anxiety = q0012,
         feel_lonely = q0013,
         trump_immigration = q0014,
         trump_inflation = q0015,
         trump_dei = q0016,
         trump_trade = q0017,
         tariff_result = q0018,
         college_important = q0019,
         dei_effect = q0020,
         due_process_noncitizens = q0021, 
         doge_eval = q0022,
         visa_deport = q0023,
         binary_gender_opinion = q0024,
         traditional_roles_opinion = q0025,
         new_gen = q0026,
         dem_party = q0027,
         rep_party = q0028,
         party_fights = q0029,
         trump_emotions = q0030,
         vote_2024 = q0031,
         redo_vote = q0032,
         trans_athletes = q0033,
         registered_voter = q0034,
         party_id = q0035,
         party_lean = q0036,
         maga = q0037,
         rep_policies = q0038,
         prog = q0039,
         dem_policies = q0040,
         state_residence = q0041,
         zip_code = q0042,
         white = q0044_0001,
         hispanic = q0044_0002,
         black = q0044_0003,
         asian = q0044_0004,
         mena = q0044_0005,
         hawaiin.pi = q0044_0006,
         amerindian = q0044_0007,
         other.race = q0044_0008,
         other.race.open = q0044_other,
         hispanic_origin = q0045,
         enrollment_status = q0046,
         education = q0047,
         income = q0048,
         relationship = q0049,
         sex = q0050,
         gender.notlisted = q0051_0001,
         gender.male = q0051_0002,
         gender.female = q0051_0003,
         gender.transgender = q0051_0004,
         gender.nonbinary = q0051_0005,
         sexuality.somethingelse = q0052_0001,
         sexuality.straight = q0052_0002,
         sexuality.lesbian = q0052_0003,
         sexuality.gay = q0052_0004,
         sexuality.bisexual = q0052_0005,
         sexuality.queer = q0052_0006,
         sexuality.asexual = q0052_0007,
         sexuality.pansexual = q0052_0008
      ) |> 
mutate(registered_binary = case_when(registered_voter =="Yes, registered to vote at current address" ~ "Yes",
                                     registered_voter =="No, not registered at current address — registered elsewhere" ~ "Yes",
                                     registered_voter =="No, not registered to vote" ~ "No",
                                     registered_voter == "Don't know" ~ "No"),
        pid3 = case_when(party_id == "Democrat" ~ "Democrat",
                        party_id == "Republican" ~ "Republican",
                        party_id %in% c("Independent","Something else") ~ "Independent"),
       pid5 = case_when(pid3 %in% c("Democrat","Republican") ~ pid3,
                        party_lean=="Republican" ~ "Lean Republican",
                        party_lean=="Democrat" ~ "Lean Democrat",
                        party_lean =="Neither" ~ "Independent"),
       pid3.lean = case_when(pid5 %in% c("Democrat","Lean Democrat") ~ "Democrat",
                             pid5 %in% c("Republican", "Lean Republican") ~ "Republican",
                             pid5=="Independent" ~ "Independent"),
       educ = case_when(education %in% c("Did not complete high school", "High school or G.E.D.") ~ "Hs Or Less",
                        education %in% c("Some college, but no degree", "Associate’s degree, occupational or vocational program",
                                                 "Associate’s degree, academic program") ~ "Some College",
                        education %in% c("Bachelor’s degree") ~ "College",
                        education %in% c("Post graduate degree") ~ "Postgrad"),
       educ_extended = if_else(enrollment_status=="Yes, in a four-year college program", "Attending Undergrad", educ),
       educ_extended = factor(educ_extended, levels = c("Hs Or Less", "Attending Undergrad", "Some College" ,"College", "Postgrad")),
       across(white:other.race, isnt.na),
       across(gender.notlisted:gender.nonbinary, isnt.na),
       across(sexuality.somethingelse:sexuality.pansexual, isnt.na),
       language = case_when(language=="es_US" ~ "Spanish",
                            language=="en" ~ "English"),
       gen.z = case_when(age4 %in% "18-29" ~ T,
                         .default = F),
       vote.pres.2024 = case_when(vote_2024=="Was not eligible to vote" ~ "Did not vote",
                                  vote_2024=="Someone else" ~ "Other",
                                  .default = vote_2024)
       ) |>
rowwise() |>
mutate(race.count = sum(white,hispanic, black, asian, mena, hawaiin.pi, amerindian, other.race)) |>
  ungroup() |>
  mutate(race = case_when(hispanic==1 | hispanic_origin=="Yes" ~ "Hispanic",
                          white==1 & race.count<2 ~ "White",
                          black==1 & race.count<2 ~ "Black",
                          asian==1 & race.count<2 ~ "Asian",
                          other.race==1 | mena==1 | hawaiin.pi==1 | amerindian==1 | race.count>=2 ~ "Other"),
         race.simple = case_when(race=="White" ~ "White", 
                                 race=="Hispanic" ~ "Hispanic",
                                 race == "Black" ~ "Black",
                                 !race %in% c("White","Hispanic","Black") ~ "Other"),
         pid3 = factor(pid3, levels = c("Republican", "Independent","Democrat")) ,
         pid3.lean = factor(pid3.lean, levels = c("Republican", "Independent","Democrat")),      
         pid5 = factor(pid5, levels = c("Republican", "Lean Republican" ,"Independent","Lean Democrat","Democrat")),
         race = factor(race, levels = c("White", "Black", "Hispanic", "Asian", "Other")),
         educ = factor(educ, levels = c("Hs Or Less", "Some College", "College", "Postgrad")),
         vote.pres.2024 = factor(vote.pres.2024, levels = c("Kamala Harris", "Donald Trump", "Other", "Did not vote")),
         educ.race = case_when(educ %in% c("Hs Or Less","Some College") & race!="White" ~ "Non-White Non-College",
                               educ %in% c("Hs Or Less","Some College") & race=="White" ~ "White Non-College",
                               educ %in% c("College","Postgrad") & race!="White" ~ "Non-White College",
                               educ %in% c("College","Postgrad") & race=="White" ~ "White College"),
         educ.race = factor(educ.race, levels = c("Non-White Non-College","Non-White College", "White Non-College","White College")),
         educ.sex = case_when(educ %in% c("Hs Or Less","Some College") & sex =="Male" ~ "Male Non-College",
                               educ %in% c("Hs Or Less","Some College") & sex =="Female" ~ "Female Non-College",
                               educ %in% c("College","Postgrad") & sex =="Male" ~ "Male College",
                               educ %in% c("College","Postgrad") & sex =="Female" ~ "Female College"),
         educ.sex = factor(educ.sex, levels = c("Male Non-College","Female Non-College", "Male College","Female College")),
         educ.sex.race = case_when(
           educ %in% c("Hs Or Less", "Some College") & sex == "Male" & race == "White" ~ "Male White Non-College",
           educ %in% c("Hs Or Less", "Some College") & sex == "Female" & race == "White" ~ "Female White Non-College",
           educ %in% c("College", "Postgrad") & sex == "Male" & race == "White" ~ "Male White College",
           educ %in% c("College", "Postgrad") & sex == "Female" & race == "White" ~ "Female White College",
           educ %in% c("Hs Or Less", "Some College") & sex == "Male" & race != "White" ~ "Male Non-White Non-College",
           educ %in% c("Hs Or Less", "Some College") & sex == "Female" & race != "White" ~ "Female Non-White Non-College",
           educ %in% c("College", "Postgrad") & sex == "Male" & race != "White" ~ "Male Non-White College",
           educ %in% c("College", "Postgrad") & sex == "Female" & race != "White" ~ "Female Non-White College"
         ),
         educ.sex.race = factor(educ.sex.race, levels = c(
           "Male White Non-College", "Female White Non-College",
           "Male White College", "Female White College",
           "Male Non-White Non-College", "Female Non-White Non-College",
           "Male Non-White College", "Female Non-White College"
         )),
         age.sex = paste(age4,sex),
         age.sex = factor(age.sex, levels = c("18-29 Female","18-29 Male","30-44 Female","30-44 Male","45-64 Female","45-64 Male",
                                              "65+ Female","65+ Male")),
         age.weight = factor(age.weight),
         college = case_when(educ %in% c("Hs Or Less", "Some College") & !enrollment_status %in% c("Yes, in a four-year college program","Yes, in a two-year college program", "Yes, in graduate school") ~  "No College, not enrolled",
                             educ %in% c("Hs Or Less", "Some College") & enrollment_status %in% c("Yes, in a four-year college program","Yes, in a two-year college program", "Yes, in graduate school") ~ "No College, enrolled",
                             educ %in% c("College", "Postgrad")  ~ "College+"),
         college = factor(college, levels = c("No College, not enrolled", "No College, enrolled", "College+")),
         age2 = case_when(age4 %in% c("18-29","30-44") ~ "18-44",
                          age4 %in% c("45-64","65+") ~ "45+"),
         age2 = factor(age2, levels = c("18-44","45+")),
         age4 = factor(age4, levels = c("18-29","30-44","45-64","65+")),
         age.genz.split = case_when(age_recode %in% c("18-20","21-23") ~ "18-23",
                                    age_recode %in% c("24-26","27-29") ~ "24-29"),
         age.genz.split = factor( age.genz.split, levels = c("18-23","24-29")),
          PartyFaction = case_when(
             !is.na(maga) ~ maga,
             !is.na(prog) ~ prog,
             TRUE ~ "Unaffiliated"
           ),
           PartyFaction = factor(
             PartyFaction,
             levels = c(
               "More of a supporter of the Make America Great Again or MAGA movement",
               "More of a supporter of the Republican Party",
               "Independent",
               "More of a supporter of the Democratic Party",
               "More of a supporter of progressive causes and the progressive movement"
             ),
             ordered = TRUE
           ),
           HighPoliticalInterest = if_else(topic_interest == "Politics","Most Interested in Politics","Most Interested in Other Topics"),
           HighPoliticalInterest = factor(HighPoliticalInterest,
                                          levels = c("Most Interested in Politics",
                                                     "Most Interested in Other Topics")),
         pid3.lean.interest = case_when(HighPoliticalInterest == "Most Interested in Politics" & pid3.lean == "Democrat" ~"Democrat (Interested)",
                                        HighPoliticalInterest == "Most Interested in Politics" & pid3.lean == "Republican" ~ "Republican (Interested)",
                                        HighPoliticalInterest == "Most Interested in Politics" & pid3.lean == "Independent" ~"Independent (Interested)",
                                        HighPoliticalInterest == "Most Interested in Other Topics" & pid3.lean == "Democrat" ~ "Democrat (Uninterested)",
                                        HighPoliticalInterest == "Most Interested in Other Topics" & pid3.lean == "Republican" ~ "Republican (Uninterested)",
                                        HighPoliticalInterest == "Most Interested in Other Topics" & pid3.lean == "Independent" ~ "Independent (Uninterested)"),
         pid3.lean.interest = factor(pid3.lean.interest,  
                                     levels = c("Democrat (Interested)", "Democrat (Uninterested)","Independent (Interested)","Independent (Uninterested)","Republican (Uninterested)","Republican (Interested)")),
         HighPoliticalInterest = factor(HighPoliticalInterest,
                                        levels = c("Most Interested in Politics",
                                                   "Most Interested in Other Topics")),
         race.simple = factor(race.simple,
                              levels = c("White",
                                         "Black",
                                         "Hispanic",
                                         "Other"))
  )

3.3.5 Make post-survey adjustments

There are three broad things that go into this step: weighting, editing, and imputing data.

The most important of this step is weighting data. Again, we will go over this in detail, but the idea here is to assign weights to individual people to ensure your sample looks like it is supposed to look like.

First, it’s important to know how a weighted mean is calculated.

We know that a standard mean is:

\[ \bar{x} = \frac{\sum x_i }{n} \]

To get the average, add up all the responses (\(x_i\)) and divide by the number of people (\(n\)).

If data has a weight attached it might look like this:

x	w
1	1.0
4	2.0
2	0.5
5	10.0

To get the stanadard mean we do:

sum(x)/4
#> [1] 3
mean(x)
#> [1] 3

To get a weighted mean instead we do:

\[ \tilde{x} = \frac{\sum x_i*w_i}{\sum w_i} \]

This means that entries with a higher weight will have an outsized role in determining the mean:

sum(x*w)/sum(w)
#> [1] 4.444444
weighted.mean(x, w)
#> [1] 4.444444

If we get our sample and 60% of the sample are women, we know that is too many in almost all circumstances. Men and women should generally be 50/50. (Not always. Maybe we are sampling a profession that has a gender imbalance. Or maybe we are sampling older people where there are more women.)

If women right now are 60% of the sample, that means that they are:

.6/.5
#> [1] 1.2

1.2 times more prevalent than they are supposed to be. That data might look like this:

x	female
0	1
0	1
1	1
1	1
0	1
0	1
1	0
0	0
1	0
0	0

If we assign a weight of 80% to the women and a weight of 120% to the men:

x	female	weight
0	1	0.8
0	1	0.8
1	1	0.8
1	1	0.8
0	1	0.8
0	1	0.8
1	0	1.2
0	0	1.2
1	0	1.2
0	0	1.2

Now if we calculate the weighted average of men and women:

weighted.mean(female, weight)
#> [1] 0.5

And we would calculate our other averages using that same weight, the one that creates the right balance of men and women;

mean(x)
#> [1] 0.4
weighted.mean(x, weight)
#> [1] 0.4166667
#Not the same!

When we get to weighting we will use “iterative rake weighting” that will actually allow us to do this for many variables at the same time!

The second way of editing data is editing “bad” data. This is a grey area and a bit of an art, but you want to keep the responses that are “true” without accidentally throwing out any good data.

Common here is to remove people who took too short of a time on the survey, or people who answered the same thing for every question (“straight lining” we call it). You might also decide to remove people who give impossible or non-sensical answers. An example might be someone who lists there state as one thing, but then lists a zip code that doesn’t exist in that state. Finally, we will often place “trap” questions that reveal when people are being inattentive, such as requesting that people select “somewhat disagree” regardless of their opinion.

The last possible editing step is the imputation of missing data. This is something that, in my estimation, is getting less common. You might have an individual who answered all the questions except race. You can look at similar people and “impute” this person’s race so it is no longer missing. We may talk more about this, but as I said, I think this is now relatively rare.

3.3.6 Perform Analysis

Now we have a correct data file with weights we can perform analysis.

Most of the time for political surveys the “analysis” we will do are crosstabs.

Let’s load some data in to look at what that might look like:

pa <- rio::import("https://raw.githubusercontent.com/marctrussler/IIS-Data/refs/heads/main/PAFinalWeeks.csv", trust=T)

This is survey data from Pennsylvania in 2022. Let’s look at the most.important.problem, which is a standard question we ask about…. what issue the voter thinks is most important.

At the basic level we just want to know how many people answered each response:

table(pa$most.important.problem)
#> 
#>         Abortion   Climate change Crime and safety 
#>              224              166              249 
#>             Guns        Inflation 
#>               95              580

We rarely want to know raw numbers, howevever. What we really want to know is the percentages. If we wrap that table in prop.table() we get the proportions of each category.

prop.table(table(pa$most.important.problem))
#> 
#>         Abortion   Climate change Crime and safety 
#>       0.17047184       0.12633181       0.18949772 
#>             Guns        Inflation 
#>       0.07229833       0.44140030

That’s pretty interesting, but we often want to know the joint distribution of two variables. Do Democrats, Republicans, and Independents have different priorities?

We can put two variables in the table function to get the number of people in each distinct cell:

table(pa$most.important.problem, pa$pid)
#>                   
#>                    Democrat Independent Republican
#>   Abortion              162          37         24
#>   Climate change        102          55          9
#>   Crime and safety       72          76         98
#>   Guns                   47          32         15
#>   Inflation              79         156        341

If we wrap this in prop.table we will get the percentage each cell is of the whole, that is, taking each of these number and dividing by the total number of people:

prop.table(table(pa$most.important.problem, pa$pid))
#>                   
#>                       Democrat Independent  Republican
#>   Abortion         0.124137931 0.028352490 0.018390805
#>   Climate change   0.078160920 0.042145594 0.006896552
#>   Crime and safety 0.055172414 0.058237548 0.075095785
#>   Guns             0.036015326 0.024521073 0.011494253
#>   Inflation        0.060536398 0.119540230 0.261302682

That’s not usually what we want. We want to know: among Democrats, what percent has abortion as the most important issue. How does that compare to the proprtion among Republicans?

With prop.table we can put the option 1 for row percentages, and 2 for column percentages. Here we want column percentages so that we get the percent of Democrats who answered each response:

prop.table(table(pa$most.important.problem, pa$pid),2)
#>                   
#>                      Democrat Independent Republican
#>   Abortion         0.35064935  0.10393258 0.04928131
#>   Climate change   0.22077922  0.15449438 0.01848049
#>   Crime and safety 0.15584416  0.21348315 0.20123203
#>   Guns             0.10173160  0.08988764 0.03080082
#>   Inflation        0.17099567  0.43820225 0.70020534

So 35% of Democrats said Abortion was the most important issue (this was just a few months post Dobbs), compared to 5% of Republicans.

Now, the row columns might be interesting too, but we just have to know that’s what we are doing:

prop.table(table(pa$most.important.problem, pa$pid),1)
#>                   
#>                      Democrat Independent Republican
#>   Abortion         0.72645740  0.16591928 0.10762332
#>   Climate change   0.61445783  0.33132530 0.05421687
#>   Crime and safety 0.29268293  0.30894309 0.39837398
#>   Guns             0.50000000  0.34042553 0.15957447
#>   Inflation        0.13715278  0.27083333 0.59201389

Here we could say that, of the people who say that Abortion is the most important issue 73% are Democrats and 11% of Republicans.

Note that our choice to do row or column percentages had everything to do with the order that we put the variables into the table. If we had reversed that, we also would have had to reverse how we requested the percentages:

prop.table(table(pa$pid, pa$most.important.problem),1)
#>              
#>                 Abortion Climate change Crime and safety
#>   Democrat    0.35064935     0.22077922       0.15584416
#>   Independent 0.10393258     0.15449438       0.21348315
#>   Republican  0.04928131     0.01848049       0.20123203
#>              
#>                     Guns  Inflation
#>   Democrat    0.10173160 0.17099567
#>   Independent 0.08988764 0.43820225
#>   Republican  0.03080082 0.70020534

What about weights? We found a weighted mean above, but all of the things in this table are just means too, just for certain subsets of the sample.

There is no way to weight the standard table in R, but the pewmethods package has great functionality to do this via the get_totals command:

pewmethods::get_totals("most.important.problem", pa, wt="weight")
#> Warning: There was 1 warning in `mutate()`.
#> ℹ In argument: `most.important.problem = (function (f,
#>   na_level = "(Missing)") ...`.
#> Caused by warning:
#> ! `fct_explicit_na()` was deprecated in forcats 1.0.0.
#> ℹ Please use `fct_na_value_to_level()` instead.
#> ℹ The deprecated feature was likely used in the dplyr
#>   package.
#>   Please report the issue at
#>   <https://github.com/tidyverse/dplyr/issues>.
#>   most.important.problem    weight
#> 1               Abortion  3.914996
#> 2         Climate change  3.181876
#> 3       Crime and safety  5.442236
#> 4                   Guns  1.987968
#> 5              Inflation 10.964861
#> 6              (Missing) 74.508064

By default for this command we get a (Missing) column which tells us the percent with missing data. We didn’t ask this of most survey respondents, so this number is very high. We can ignore it:

pewmethods::get_totals("most.important.problem", pa, wt="weight", na.rm=T)
#>   most.important.problem    weight
#> 1               Abortion 15.357780
#> 2         Climate change 12.481892
#> 3       Crime and safety 21.348852
#> 4                   Guns  7.798421
#> 5              Inflation 43.013055

To get these number by party there is an explicit by option.

pewmethods::get_totals("most.important.problem", pa, wt="weight", by="pid", na.rm=T)
#>   most.important.problem weight_name Democrat Independent
#> 1               Abortion      weight 31.09896   11.531769
#> 2         Climate change      weight 21.19636   17.219315
#> 3       Crime and safety      weight 19.49631   20.957130
#> 4                   Guns      weight 11.46667    8.547895
#> 5              Inflation      weight 16.74170   41.743890
#>   Republican
#> 1   4.783301
#> 2   2.113492
#> 3  22.909369
#> 4   4.110673
#> 5  66.083164

This is the standard way that we will analyze data, but these are just numbers so anything is possible here: correlation, regression, etc.

3.4 Surveys from an Error Perspective

The above went through the steps of running a survey from the perspective of design, but the other way that we can think about surveys is from the perspective of error.

Now this is not “error” in terms of making mistakes (although mistakes can happen), instead this is naturally occurring error that is inherent to the survey process that we want to minimize. In the above discussion of design we talked about taking lots of different steps. In each of those steps there were choices, and the choices made may result in more or less error being present in our final survey.

At the end of our process we get a series of responses to a particular question. Using our PA survey, we can ask: what proportion of people say that abortion is the most important issue.

head(pa$most.important.problem=="Abortion")
#> [1] FALSE FALSE  TRUE FALSE FALSE  TRUE
mean(pa$most.important.problem=="Abortion",na.rm=T)
#> [1] 0.1704718

Around 17% unweighted.

Or using the included weights:

weighted.mean(pa$most.important.problem=="Abortion", w = pa$weight,na.rm=T)
#> [1] 0.1535778

Around 15%.

We can think about error at the individual and collective level.

At the individual level, for each person in our sample there is a true value for whether they think abortion is the most important issue. We denote that as \(\mu_i\). This is the greek letter mu, and with the subscript \(i\) to say that it changes for each individual. We don’t observe \(\mu_i\), but instead \(y_i\), which is the response in our dataset. Any difference between \(\mu_i\) and \(y_i\) is error.

At the group level we took a weighted mean of this variable and got 15%. For voters in Pennsylvania there is a true percentage who believe that abortion is the most important issue \(\bar{Y}\). This is not subscript because it does not change individual to individual (or on the basis of anything else).

This is really critical, for anything that we calculate (here we are talking about a mean, but this goes for anything that we calculate in our sample) there is a true value in the population that we are studying.

The differences between our survey statistic \(\bar{y}\) and the true value \(\bar{Y}\) is the error in that statistic.

3.4.1 Validity

First moving down the left side of the graph above, we see how the way that we ask and edit questions can affect error at the individual level (and aggregated up, for the survey statistic).

The first place where we encounter error is in moving from a construct \(\mu_i\) to a measurement \(Y_i\).

A construct is the thing we are actually interested in. The measure is the specific question we ask to try to capture that construct.

The best example of this is ideology. When we measure ideology in a survey we often measure it on a 7-point scale that ranges from 1=Very Liberal to 7=Very Conservative. We actually have this for the Pennsylvania survey we’ve been using here:

#This isn't in the right order:
table(pa$ideology)
#> 
#>      Conservative           Liberal          Moderate 
#>              1264               879              2030 
#> Very conservative      Very liberal 
#>               450               396

We ask people to put themselves on this 7-point scale to capture their ideology, but this isn’t what ideology actually is. The concept of ideology is something beyond a simple measurement. The generally accepted definition is that an ideology is a frame used to organize political topics into a greater world view. If someone is liberal, it suggests something about how they think about different political topics fit together, how they will incorporate new information, how they think society should be organized now and in the future…. this is all way beyond a simple 7 point scale.

Moving from a concept to a measure is called “operationalization”, and the degree to which a measure succesfully represents a concept is called “validity”. To the degree to which operationalization is not successful error is introduced. The operationalization of ideology as a 7-point scale is generally thought to be a reasonably valid measure, but there is good evidence that it mostly pertains to dominant White culture.

The ideology of Black Americans, in particular, is not validly captured by the standard 7-point scale. There is often a mismatch in terms of measured ideology and other survey responses for Black Americans. For example many call themselves conservative but have “liberal” views on policy and voting. Many Black respondents in the Obama era placed the president on the “conservative” side of the scale. In this great piece, Hakeem Jefferson shows that Black Americans simply have less familiarity with the “traditional” ideological scale because they are in a different information environment and have a different organizational schema for politics. Put in simpler terms: the way that politics is organized (i.e. ideology) for many Black Americans simply does not conform to the 7 point ideology scale that we present. It’s not that they lack ideology, they simply do not map on to the particular operationalization of ideology that we have selected.

Another example of operationalization is the SAT. When you applied for college you took the SAT (did you?) and were given a score that was meant to represent your inteligence. The questions chosen for the SAT are operationalizations of different forms of intelligence (math, english, etc). You all seemed to do pretty well so you probably think that these were great operationalizations! Many other people would disagree.

3.4.2 Measurement Error

If \(\mu_i\) represents an individual’s value for the concept, then \(Y_i\) represents the true value for that individual for the measure of that concept. Sticking with our ideology example, each of you has a “true” value on the 7-point ideology scale. This is not the same as your actual ideology (\(\mu_i\)) but is instead what your actual answer is to that question.

But that’s not what we get directly. Instead we get \(y_{it}\) which is your response to the survey question at a particular point in time. The fact that this is subscript by \(t\) is important, because it says that your answer may be different at two different points in time.

Any difference between your true level \(Y_i\) and your response \(y_{it}\) is measurement error.

Why might there be a difference?

The first big reason is a systematic bias. Consider quesitons about sensitive topics like drug use. If we ask people “How often have you used an elicit drug in the last month” there is a real answer to that question \(Y_i\). But there are all sorts of reasons why a person may not be totally honest (or even self-censor subconsiously) when answering that question. There are lots of other reasons why a measure might be systematically biased, for example by having a mismatch in race between the interviewer and the interviewee.

Beyond systematic reasons for measurement error there can be instability in response at the individual level. This will often vary based on how hard it is to answer a question and how ingrained the response is for the individual. A complicated question that requires more mental energy is more likely to have people give random responses. In these cases we say that question has low “reliability” because it may often miss the mark.

3.4.3 Processing Error

After the survey is recorded there may still be error in the processing stage. This has two main forms.

Above we said that we may delete respondents if they seem to not be paying attention. Some people might give legitimate answers but take too long or too fast to take the survey. In this case our processing generates errors because there responses are deleted. In some circumstances we may delete individual answers on questions because they are outliers, though there is some possibility that these represent true responses to the questions.

Another form of processing error may come from questions where individual responses are coded by the researchers. People may provide open responses to certain questions and then survey researchers read the response and condense it into some form of coding scheme. Different researchers can have different judgements of where to place the response, which can introduce error into the process. For this reason it is common for coders to code a big number of the same responses to make sure they are in alignment.

3.4.4 Representation Error

We have completed the left side of the error tree, but there are a whole bunch of other possible sources of error when we start think about aggregating individual responses to the group level.

Just like at the individual level, we conceptualize there being a true answer for our survey statistic at the population level: \(\bar{Y}\). Note that this does not have a subscript, because this is just one, unobservable, number. If we think about: what is the proportion of Pennsylvania voters who think Abortion is the most important issue, there is a true percentage that exists (but we cannot observe).

Looking all the way at the bottom of the figure we eventually calculate our survey statistic. In our running example, this is the proportion of our sample that says that Abortion is the most important issue.

weighted.mean(pa$most.important.problem=="Abortion", w = pa$weight, na.rm=T)
#> [1] 0.1535778

Around 15%.

The difference between this survey statistic (\(\bar{y}\)) and the actual answer for the population (\(\bar{Y}\)) is a measure of our total error.

3.4.5 Coverage Error

The first source of representation error is “Coverage Error” which comes about from the sampling frame not being representative of the target population. This is error that occurs before anyone is even approached to take a survey.

If our target for our survey is Pennsylvania voters, anything that systematically includes PA voters from our sampling frame would be coverage error. Another way to think about it: does every PA voter have a non-zero probability of being included in our sample?

With RDD samples, coverage error might occur from people not having a phone, or with a phone number that has the wrong area code, for example. For an RBS survey the voter file might have the wrong phone number (or no phone number) for a certain percentage of the population. RBS surveys might also have the problem of people being newly registered or have registration that has lapsed.

For online surveys, the big source of coverage error is simply people who do not spend a lot of time online. These people might be systmatically different than those who are online in a way that produces error.

3.4.6 Sampling Error

Sampling error is the “traditional” source of error that comes from random sampling.

When you flip a coin 100 times you do not need to get 50 heads and 50 tails. You are going to get something a little different every time. The same is true of a survey. If you sample 1000 people and ask what their most important issue is, you are not going to get the same answer every time. It’s going to be a little bit different.

This “little bit different” is a measurable thing, which we will talk about in detail. This amount is the sampling error.

You might see a survey report something like “This survey accurate to within 2.1 percent, 19 times out of 20”. We will talk through what exactly this margin of error is and what it is measuring. However, what I will note now is that this number only represents the sampling error. Looking at all the possible sources of error that are in a survey, it is extremely weird that this is the only one that gets reported. That margin of error does not include measurement error, coverage error, non-response bias, or adjustment error!

3.4.7 Non-Response Error

Even if you have the right group of people not everyone will choose to answer your survey. You can read textbooks from the 1990s that will say “Survey research is in crisis as response rates have dipped into the 40% range”. Well, response rates are now regularly under 1%.

To measure response rate we divide the total number of people interviewed by the total number of people that were attempted to be contacted. We can further decompose this response rate into the contact rate and the cooperation rate. The contact rate is the percent of people succesfully contacted out of all the people who were attempted. The cooperation rate is the percentage of people who successfully completed a survey conditional on actually being contacted. It’s helpful to make this decomposition to see where a low response rate is coming from.

A low response rate does not necessarily mean that there will be error in the survey. There is only error if the probability of responding to the survey is correlated to the response we are interested in.

Let’s say that we are interested in the most important issue to Pennsylvania voters so we run a telphone survey. However, we are stupid and run it on a Sunday afternoon in October when the Eagles are playing (go birds). Football fans will be less likely to take the survey, but if being a football fan is unrelated to what issue you think is most important (which is plausible, given the ubiquity of football fanship in Philadelphia), then the fact that Eagles fans are not in the survey wouldn’t be a problem.

However, if instead our survey asked what people’s favorite sport is, then we would have a problem. In that case the probability of response (lower if you are a football fan) would be correlated to people’s answers (what sport they like the most).

We call this latter situation “Non-ignorable non-response” because it can be a serious source of bias in our surveys. The current landscape of political polling is plagued by problematic non-response where supporters of President Trump are systematically less likely to participate in ways that are very difficult to control for.

3.4.8 Adjustment Error

Some forms of non-response can be controlled for by making post-collection adjustments like weighting. I described this above, but in short in these cases we generate survey weights to ensure that the distribution of certain variables in our sample match population totals.

Successful application of survey weights can reduce the error in a survey by pulling estimates back towards their true values.

However, weighting can also increase error. We will cover this more extensively in the weighting chapter, but consider this example. We run an online survey and get fewer 65+ people then would be expected based on their percentage in the population. As such, we give more weight to this group in our survey. But what if the group of 65+ people in our survey aren’t the same as the 65+ people in the population. Maybe this group is more online and more liberal. In this case, giving this group more weight is likely to bias our survey away from what it would be if we had a perfect random sample.

3.5 Total Survey Error (TSE)

Taken overall, the two sides of the figure represent the “Total Survey Error” (TSE) paradigm. As mentioned above, when we report the margin of error in a survey we are only talking about sampling error. Just one of the many possible sources of error. We report this one because it is a known quantity, and we cannot measure any of the others.

But just because we can’t measure things doesn’t mean that they aren’t real!

When we are creating surveys it is absolutely critical that we take steps to mininmize error however possible, even if we don’t have a way to measure how much we are reducing that error by.

This approach sets up the next 6 weeks or so of the course.

Over the next two weeks we will cover questionnaire design, which will have us interrogate the left side of the error chart. How can we write questions (and surveys overall) that maximize the validity of our questions?

After that we will cover sampling and weighting. That will cover the right side of the chart. How do we design samples – and weight them after receiving those samples – in way that minimizes coverage error?

3.6 Questions

If a survey says “The margin of error is plus or minus 2.1 percent” what type of error does that represent? Does such a statement take into account the “Total Survey Error” framework?
How does sampling frame affect survey mode? Give two examples.
The three key hallmarks of surveys were given as: systematic, quantitative, and the use of sampling. What do we mean by each of these three things, and how do they set surveys a part from other sorts of research?
A stakeholder comes to you and says that they want to do a live-caller survey of potential voters in the upcoming Philadelphia DA election (where incumbent Democrat Larry Krasner is running for reelection). Would you adivse them to use an RDD or RBS survey?

5.Let’s say we conduct a 1000 person survey and 700 people are White. Around 60% of Americans are Non-Hispanic White. Is it a good plan to drop the first 100 White people in the survey so that the proportion in the sample is closer to 60% What about dropping a random 100 White people?

Give meaningful interpretations to these three tables, showing that you understand what each is telling us.

prop.table(table("Biden Approval" = pa$biden.approval, "Pres Vote 2020" = pa$presvote2020))
#>                      Pres Vote 2020
#> Biden Approval                dem         rep   dnv/other
#>   Strongly approve    0.210599721 0.002191672 0.006375772
#>   Somewhat approve    0.221159594 0.008168958 0.027097031
#>   Somewhat disapprove 0.036859932 0.024108388 0.024307631
#>   Strongly disapprove 0.019326559 0.373381152 0.046423590
prop.table(table("Biden Approval" = pa$biden.approval, "Pres Vote 2020" = pa$presvote2020),1)
#>                      Pres Vote 2020
#> Biden Approval               dem        rep  dnv/other
#>   Strongly approve    0.96090909 0.01000000 0.02909091
#>   Somewhat approve    0.86247086 0.03185703 0.10567211
#>   Somewhat disapprove 0.43224299 0.28271028 0.28504673
#>   Strongly disapprove 0.04401089 0.85027223 0.10571688
prop.table(table("Biden Approval" = pa$biden.approval, "Pres Vote 2020" = pa$presvote2020),2)
#>                      Pres Vote 2020
#> Biden Approval                dem         rep   dnv/other
#>   Strongly approve    0.431604737 0.005373718 0.061185468
#>   Somewhat approve    0.453246223 0.020029311 0.260038241
#>   Somewhat disapprove 0.075541037 0.059110894 0.233269598
#>   Strongly disapprove 0.039608003 0.915486077 0.445506692

2 Basic R

4 Questionnaire Design