10 Survey Experiments

We have primarily looked at pretty straight forward survey questions in this course. One of the more exciting possibilities with survey research, however, is to embed experiments within our surveys. In this chapter we will first do a short crash course on understanding causality. Using that, we will broaden our concept of validity to discuss external vs. internal validity. We will apply that understanding to some basic experiments in surveys. Finally, we will discuss the problem of statistical power and how it can lead us to false negatives.

10.1 When does correlation equal causation? A crash course in Causality

I’m not a huge fan of “correlation doesn’t equal causation.” By this, people usually mean that just because two things happen in concert doesn’t mean that one causes the other. It’s a technically correct statement, but is often wielded to dismiss any study that someone doesn’t like.

The truth of the matter is that no research design can actually prove causality. Instead of thinking of causality like something that either exists (x definitely causes y) or doesn’t (x and y are correlated but we don’t know if the relationship is causal) we can think about it like a continuum. We can have research designs that approximate causality to more or less degree.

The goal of this textbook section is to break down what “to cause” really means, and how we can apply that knowledge in survey experiments

10.1.1 What does it mean for one thing to cause another thing?

The best way to think about what “cause” really means is to think about a light switch. Imagine getting up from your desk right now and walking over to the light-switch on the wall and flipping it. If you do so, almost certainly the light is going to go off. Now: does the light switch cause the light to turn off? Clearly, you are going to say that it does. But do we know that for certain? What if the power for the whole building went off right when you flipped the switch? What if the lights in the building are on a timer that happened to trigger right when you flipped the switch? While both of these things are implausible, they are possible.

So what information would we need in order to know with 100% certainty that the light switch turns the light on and off?

To know for certain what we would need to do is to both hit the light switch and not hit the light switch at exactly the same time. If we were able to do that, we would be able to see that hitting the light switch turns off the lights, and in the absence of us hitting the light switch the light, in fact, stays on.

We can extend this same logic to something like a drug trial. Let’s say that we have an individual who is experiencing some sort of ailment that we want to treat with our new drug. We give them the drug, monitor their progress, and after a few days they get better! So does the drug work?

Well… we don’t actually know! Similar to the light-switch: we don’t know what would have happened if we had not given them the drug! Maybe the drug worked and that’s why they were going to get better, but maybe they were going to get better anyways and the drug did nothing…

Again: the only way to really know if the drug works is to both give an individual the drug and to not give the same individual the drug at exactly the same time.

As an aside, this is actually what the “placebo” effect is. People believe that the placebo effect is that when you give someone a sugar pill their brain “makes” that drug work. This is true to some extent but not what the majority of the placebo effect is. What is actually happening is that people who are likely to enter into a drug trial are particularly sick. Because they are particularly sick when they enter the trial they will, without any intervention, get better on average. The placebo effect is just tracking these individuals’ regression to their usual level of sickness.

To formalize this we can think about two measures for a particular individual:

  • \(Y_{treat}\) is the outcome for an individual (something like their health in a drug trial) if they are treated.

  • \(Y_{control}\) is the outcome for an individual (something like their health in a drug trial) if they are not treated.

As such, the “treatment effect” is the difference between these two values:

\[ \delta = Y_{treat} - Y_{control} \]

The problem/abusrdity with this setup should be abundantly clear to you by now: absent a time machine we can never measure both \(Y_{treat}\) and \(Y_{control}\) for the same individual.

This is what is known as the Fundamental Problem of Causal Inference. The literal impossibility of this problem is why no research design ever captures the true causal effect.

10.1.2 What if we have lots of people?

For an individual person it is impossible to tell whether a drug works, or not. What if we have a lot of people? Is there something with aggregation that we can do to suss out causality in a situation like a drug trial?

Let’s pretend it is 2020 and we are in charge of determining whether the COVID-19 vaccines are effective or not.

One way of answering this question is to go out and survey people. We do everything right and get a random sample of a few thousand Americans and ask whether they have received the vaccine (are treated) or not (control). We also ask how many times they have gotten COVID, and calculate the average number of COVID cases among those who took the vaccine (\(\hat{Y}_T\)) and among those who did not take the vaccine (\(\hat{Y}_C\)).

We can then estimate the treatment effect as:

\[ \delta_n = \hat{Y}_T - \hat{Y}_C \]

This looks a lot like our equation above for the ideal treatment effect for an individual person. Above we couldn’t calculate that number because we couldn’t observe a person in both treated and untreated states, but now we have people in both states (those who took the vaccine and those who didn’t). Plus this sample was generated by randomly selecting people. So is this a good estimate of whether the vaccine is effective?

No it’s not!

At the individual level the Fundamental Problem of Causal Inference is that we cannot observe units in both their treated and untreated states. That’s a problem because we want to compare what happens when an individual is treated to what would have happened if they were untreated.

This aggregate version only works if the individuals in the control group tell us what would have happened to the people in the treatment group if they had not taken the vaccine. Is that true? Are people who decided to not take the vaccine just like the people who did take the vaccine. Definitely not! These are two very different groups of people.

In particular, those who did not take the vaccine were also those who were more likely to get COVID in the absence of the vaccines (because of their behavior). So in the above calculation when we find that the treatment group has less COVID cases then the control group, a good chunk of that effect would exist with or without vaccines.

We can generalize this problem as one where certain types of people “select” a treatment making the resulting comparison to a “control” group biased.

  • Does political party cause individuals to abide by COVID precautions?

  • Does being from a wealthier district cause legislators to vote for lower taxes?

  • Do policies like Larry Krasner’s lead to a rise in crime in the US cities?

  • Does Tom Brady’s TB122 workout program help teams win the Superbowl?

In trying to determine the effects of all of these things we can think of reasons why a non-random group would “select” treatment, such that we can’t just compare those those who are treated to those that are not and get a good estimate of a treatment effect.

10.1.3 So how do we do it?

This seems like a really hopeless situation! We don’t have a time machine, and in many situations the things we are interested in are things that a non-random group of people select into…

But this thinking has at least allowed us to see what we need to have in order to show causality. In order to have an estimate of the treatment effect that is accurate, we must have a control group whose outcomes are equal to what would have happened in the treatment group had they not been treated.

So how do we do that?

We might think we should do some sort of “matching” procedure. We list out the characteristics of people who were treated, and then go and find exact matches of people just like them but who were not treated. This actually does exist, but as you can imagine it is really expensive! Further, this has the downside where we can only match on things that we can actually measure.

The real solution is way easier: Randomization is the solution to the Fundamental Problem Of Causal Inference.

If people are randomly assigned to treatment and control groups then, by definition, the two groups will be equivalent on all possible characteristics, whether we measure them or not.

This logic breaks down for very small groups – with 10 people you could imagine randomly assigning a weird group of 5 to treatment – but once you get to a sufficiently large group (like… 30) random assignment necessarily leaves with you two groups that are statistically equivalent. If you then assign treatment to one of those two groups then the only reason they are different is due to the treatment you give them.

10.1.4 Causal Inference as a missing data problem

As an example, we can think about a job training program that is set up for adults to be trained in a new skill-set.

The people funding the program want to know whether or not the program is helping participants get better paying jobs a year after they complete the program.

Consider these data, where for 10 people we know whether they were treated with this job program:

df <- data.frame(treat = c(T, T, F, T, F, F, F, T, T, F), y.obs = c(84000, 70000, 44000, 56000, 59000, 53000, 61000, 61000, 64000, 54000))
kableExtra::kable(df)
treat y.obs
TRUE 84000
TRUE 70000
FALSE 44000
TRUE 56000
FALSE 59000
FALSE 53000
FALSE 61000
TRUE 61000
TRUE 64000
FALSE 54000

If we think about measuring the outcome among the treated group and among the non-treated group we can see that what we have is a missing data problem:

df$y.treat[df$treat]  <- df$y.obs[df$treat]
df$y.control[!df$treat]  <- df$y.obs[!df$treat]
df$treat.effect <- NA
kableExtra::kable(df)
treat y.obs y.treat y.control treat.effect
TRUE 84000 84000 NA NA
TRUE 70000 70000 NA NA
FALSE 44000 NA 44000 NA
TRUE 56000 56000 NA NA
FALSE 59000 NA 59000 NA
FALSE 53000 NA 53000 NA
FALSE 61000 NA 61000 NA
TRUE 61000 61000 NA NA
TRUE 64000 64000 NA NA
FALSE 54000 NA 54000 NA

For those individuals who are treated we do not know what there value would be if they are in the control group, so that information is missing. The opposite is true for those who are not treated: we know what their value is for the control group but we do not know what their value would have been if they are in the treatment group.

Random assignment allows us to make the assumption that the average value among the control group is a good stand-in for the missing value for those in the treatment group and vice versa. The average value among the treated and control are:

mean(df$y.treat, na.rm=T)
#> [1] 67000
mean(df$y.control, na.rm=T)
#> [1] 54200

So if we think about the first individual, our best guess for what would have happened to them if they were in the control group is that they would have made $54,200. For the third individual our best guess of what would have happened if they were in the treatment group is that they would have made $67,000. We can fill in all these values

df$y.treat[is.na(df$y.treat)] <- mean(df$y.treat,na.rm=T)
df$y.control[is.na(df$y.control)] <- mean(df$y.control,na.rm=T)
kableExtra::kable(df)
treat y.obs y.treat y.control treat.effect
TRUE 84000 84000 54200 NA
TRUE 70000 70000 54200 NA
FALSE 44000 67000 44000 NA
TRUE 56000 56000 54200 NA
FALSE 59000 67000 59000 NA
FALSE 53000 67000 53000 NA
FALSE 61000 67000 61000 NA
TRUE 61000 61000 54200 NA
TRUE 64000 64000 54200 NA
FALSE 54000 67000 54000 NA

This allows us to determine, at the individual level, what the treatment effects are:

df$treat.effect <- df$y.treat - df$y.control
kableExtra::kable(df)
treat y.obs y.treat y.control treat.effect
TRUE 84000 84000 54200 29800
TRUE 70000 70000 54200 15800
FALSE 44000 67000 44000 23000
TRUE 56000 56000 54200 1800
FALSE 59000 67000 59000 8000
FALSE 53000 67000 53000 14000
FALSE 61000 67000 61000 6000
TRUE 61000 61000 54200 6800
TRUE 64000 64000 54200 9800
FALSE 54000 67000 54000 13000

And the average of this column is our treatment effect:

mean(df$treat.effect)
#> [1] 12800

The purpose of this exercise was for you to conceptualize about how we use the treatment and control groups to stand in for the missing values for each other. In the “real” world we don’t have to go through all this work to get the treatment effect. If we have random assignment we just have to do:

t.test(df$y.obs ~ df$treat)
#> 
#>  Welch Two Sample t-test
#> 
#> data:  df$y.obs by df$treat
#> t = -2.2649, df = 6.6392, p-value = 0.05993
#> alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
#> 95 percent confidence interval:
#>  -26312.4162    712.4162
#> sample estimates:
#> mean in group FALSE  mean in group TRUE 
#>               54200               67000

We can alternative do:

lm(df$y.obs ~ df$treat)
#> 
#> Call:
#> lm(formula = df$y.obs ~ df$treat)
#> 
#> Coefficients:
#>  (Intercept)  df$treatTRUE  
#>        54200         12800

10.2 Two Types of Randomization

Sometimes students have some trouble understanding the two critical types of randomization that are present in survey research.

They are:

  • Random Assignment to Treatment is the solution to the Fundamental Problem of Causal Inference. This helps us get unbiased estimates of treatment effects by creating two groups that are identical except for one being exposed to treatment.
  • Random sampling is the fundamental step that underlies what we learned about the regularity of sampling error. Having a random sample ensures that our understanding of the sampling distribution and equations for the standard error are accurate.

In academic research we often talk about internal and external validity. Each of these two things speaks to a different side of this equation.

A study has internal validity when we are confident that (for the particular people studied) we are measuring the correct things. This relates back to our discussion of constructs and measures, but also applies to making sure we are comparing individuals to proper counterfactuals in a causality framework.

A study has external validity when we are able to use our sample to make broader conclusions about a population. This speaks to everything we have discussed with random sampling, weighting, non-response etc.

10.3 Anoll, Engelhardt, and Israel-Trumell Example

The assigned this week is a fairly straightforward example of a survey experiment.

The authors want to know the degree that the BLM protests affect norms of childhood socialization. This is important because lots of political science research has shown that they way that children are raised have important implications for the political and social views down the road. Because the BLM protests were such a salient time (and spurred a lot of discussion about intrinsic values) it is reasonable to hypothesize that exposure to these protests will change how parents decide to raise their children.

Before we get to the actual survey experiment, I just wanted to highlight another analysis that they do in this paper which speaks to issues of causality. To be clear, this is not survey content and I only want you to know about this methodology in-as-much as it teaches us about causal inference.

This preliminary analysis wants to know the effect of the BLM protests themselves on the way that parents talk about race. To get this data they look at the number of posts made on parenting pages that mention race, blm, all lives matter, or police.

One way that you could think about assessing this is simply to see if the number of posts made about these topics increased over the spring and summer of 2020. In other words, does week of the year influence the count of race-focused parenting content. The problem with this is that we can’t seperate the influence of BLM from everythign else that’s changing over the year. If we saw a positive effect, maybe it was the results of BLM, but maybe we are just capturing proximity to the election and a ramping up of all political talk.

The way the authors identify this effect is with a regression discontinuity design, which is presented in these figures.

With a regression discontinuity design we identify a clear cut-point in a continuous variable. In this care, the start of the protests in late may serve as this cut-point (discontinuity). The logic of these designs is that people just before and just after this cutpoint are completely similar but for havign witnessed the event in question (the BLM protests). Similar to an experimental design, we think the posts right before the cut-point are a reasonable control group for the posts right after the cut-point. By looking at the difference between the two we isolate the effect of the protests from the overall trend in talking about Race. There are big effects!

On to the actual survey experiment, the authors wanted to know if exposure to BLM leads to parents wanting to change the curriculum in childhood education. Specifically, they are measuring whether parents want to assign the book “The Hate U Give” to classrooms. Of note, they conducted this experiment 2 years after the protests, when most people were no longer thinking so much about BLM.

One way we could think about studying this asking people their familiarity with BLM and then asking whether they want to assign the book. Thinking about what we learned about causality, would this be a good approach? For that to work we would have to think that people who are very familiar with BLM are just like the people who are not very familiar with BLM. That is almost certainly not true. The people who remained thinking about BLM 2 years after the protests are obviously going to be very different than those who did not.

So instead, the authors experimentally prime their survey respondents.

There were really three conditions, but i’ve reduced things down to 2 for simplicity. (The third treatment was a generic “politics” treatment to see if the mere mention of politics changed the way people recommended books to high schoolers).

In the control condition the respondents viewed a poster that advertised a summer parade. In the treatment condition the respondents viewed a poster that advertised Black Lives Matter. Because this treatment was randomly assigned, the people who saw the BLM poster are just like the people who say the control poster, other than exposure to the treatment.

Here is what the data look like:

dat <- rio::import("https://github.com/marctrussler/IIS-Data/raw/refs/heads/main/APSRExperiment.Rds", trust=T)

head(dat)
#>    divbook treatment        pid3
#> 3        0   control Republicans
#> 4        0       blm Republicans
#> 7        0       blm   Democrats
#> 9        0   control   Democrats
#> 14       0   control Republicans
#> 20      NA       blm Republicans

For each respondent we have information on whether they are in the treatment or control group, and whether they selected “The Hate U Give” (1) or another book (0).

The simple test of the experiment is to see if the probability of selecting the book changes in the treatment and control conditions. We can figure that out via a difference in means test.

t.test(dat$divbook ~ dat$treatment)
#> 
#>  Welch Two Sample t-test
#> 
#> data:  dat$divbook by dat$treatment
#> t = -0.88235, df = 819.29, p-value = 0.3778
#> alternative hypothesis: true difference in means between group control and group blm is not equal to 0
#> 95 percent confidence interval:
#>  -0.08796623  0.03340637
#> sample estimates:
#> mean in group control     mean in group blm 
#>             0.2570093             0.2842893

The answer is no. The experiment does not have a direct effect on the probability of selecting the book. There is only a small, statistically insignificant, effect.

But this was expected by the authors. If we think about the possible effects of being exposed to BLM, it should be different for Democrats and for Republicans. Think back to the Zaller and Feldman theory of considerations. When we prime BLM for Democrats it’s going to bring up all sort of considerations about race that are going to make them more likely to recommend a racially concious book. When we prime BLM for Republicans it’s going to bring up all sorts of considerations that are going to make them less likely to recommend a racially concious book. So the main effect might be zero because Democrats and Republicans are operating in different directions.

We can analyze this by looking at the effect seperately for Democrats, Republicans, and Independents:

t.test(dat$divbook[dat$pid3=="Democrats"] ~ dat$treatment[dat$pid3=="Democrats"])
#> 
#>  Welch Two Sample t-test
#> 
#> data:  dat$divbook[dat$pid3 == "Democrats"] by dat$treatment[dat$pid3 == "Democrats"]
#> t = -2.5876, df = 355.27, p-value = 0.01006
#> alternative hypothesis: true difference in means between group control and group blm is not equal to 0
#> 95 percent confidence interval:
#>  -0.23598030 -0.03217613
#> sample estimates:
#> mean in group control     mean in group blm 
#>             0.3519553             0.4860335
t.test(dat$divbook[dat$pid3=="Republicans"] ~ dat$treatment[dat$pid3=="Republicans"])
#> 
#>  Welch Two Sample t-test
#> 
#> data:  dat$divbook[dat$pid3 == "Republicans"] by dat$treatment[dat$pid3 == "Republicans"]
#> t = 1.8311, df = 378.23, p-value = 0.06787
#> alternative hypothesis: true difference in means between group control and group blm is not equal to 0
#> 95 percent confidence interval:
#>  -0.004786448  0.134465919
#> sample estimates:
#> mean in group control     mean in group blm 
#>             0.1741294             0.1092896
t.test(dat$divbook[dat$pid3=="Independents"] ~ dat$treatment[dat$pid3=="Independents"])
#> 
#>  Welch Two Sample t-test
#> 
#> data:  dat$divbook[dat$pid3 == "Independents"] by dat$treatment[dat$pid3 == "Independents"]
#> t = 0.84736, df = 83.664, p-value = 0.3992
#> alternative hypothesis: true difference in means between group control and group blm is not equal to 0
#> 95 percent confidence interval:
#>  -0.1021435  0.2538075
#> sample estimates:
#> mean in group control     mean in group blm 
#>             0.2553191             0.1794872

Among Democrats being exposed to the treatment increases the probability they select the racially conscious book by 13 percentage points. Among Republicans being exposes to the treatment deceases the probability of selecting the racially conscious book by around 7.5 points. And similarly among Independents being exposed to treatment reduces the probability of selecting the book by 7 percentage points.

These same differences can be recovered via regression (indeed, that’s what they do in the paper and that’s what I would do).

The main effect in a simple regression:

m1 <- lm(divbook ~ treatment, data=dat)
summary(m1)
#> 
#> Call:
#> lm(formula = divbook ~ treatment, data = dat)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -0.2843 -0.2843 -0.2570  0.7157  0.7430 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   0.25701    0.02148  11.965   <2e-16 ***
#> treatmentblm  0.02728    0.03089   0.883    0.377    
#> ---
#> Signif. codes:  
#> 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.4444 on 827 degrees of freedom
#>   (18 observations deleted due to missingness)
#> Multiple R-squared:  0.0009425,  Adjusted R-squared:  -0.0002656 
#> F-statistic: 0.7802 on 1 and 827 DF,  p-value: 0.3773

And the partisan effects are found by interacting the treatment variable with the party variable:

m2 <- lm(divbook ~ treatment*pid3, data=dat)
summary(m2)
#> 
#> Call:
#> lm(formula = divbook ~ treatment * pid3, data = dat)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -0.4860 -0.3520 -0.1741  0.5140  0.8907 
#> 
#> Coefficients:
#>                              Estimate Std. Error t value
#> (Intercept)                   0.25532    0.06167   4.140
#> treatmentblm                 -0.07583    0.09158  -0.828
#> pid3Democrats                 0.09664    0.06930   1.395
#> pid3Republicans              -0.08119    0.06850  -1.185
#> treatmentblm:pid3Democrats    0.20991    0.10190   2.060
#> treatmentblm:pid3Republicans  0.01099    0.10126   0.109
#>                              Pr(>|t|)    
#> (Intercept)                  3.83e-05 ***
#> treatmentblm                   0.4079    
#> pid3Democrats                  0.1635    
#> pid3Republicans                0.2363    
#> treatmentblm:pid3Democrats     0.0397 *  
#> treatmentblm:pid3Republicans   0.9136    
#> ---
#> Signif. codes:  
#> 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.4228 on 822 degrees of freedom
#>   (19 observations deleted due to missingness)
#> Multiple R-squared:  0.1007, Adjusted R-squared:  0.09526 
#> F-statistic: 18.42 on 5 and 822 DF,  p-value: < 2.2e-16

The differences in the effect sizes described in the regression table are identical to the differences in the effect sizes we get from the seperate difference in means tests.

10.4 Other experiment possibilities

We can randomize whatever we want in the survey, and we know that once we randomize the groups that we get are exactly alike in all ways except for exposure to the treatment.

We have seen already some experiments in this class, but a short list of things to manipulate would be: - The wording of a question - The information given to respondents - Video, pictures, or text shown to respondents - The answer choices available to respondents.

10.5 Null Effects and Statistical Power

When we present a long text to respondents we call this a “vignette” experiment.

One such experiment was done by Myrick (2020). I’ve indicated the experimental treatment in bold and italics:

A dictator in the Middle East is widely known for torturing and repressing his people and threatening stability in the region. Rebels within the country are attempting to overthrown the current government but have been unsuccessful so far. After debating different policies, the U.S. government decided to send in a small military force to assist the rebels. The government informed the American public about the operation/The government kept the operation completely secret from the American public. How much do you approve or disapprove of the actions of the US government in this situation?

What would be our conclusion be if we did this study and found there was no statistically significant difference in the degree of approval between the two conditions?

There are three possibilities.

The first is that respondents genuinely don’t care if the public is informed or not in such a situation. That is, there level of support for foreign military intervention doesn’t depend on the disclosures by the government, they either support it or they don’t. We might call this the “true null” scenario.

Unfortunately, there are two other scenarios we have to go through in order to determine whether that is the case.

The second possibility is that the treatment above is simply too weak to elicit the response that we were expectting. Tucked at the end of that paragraph, respondents (especially satisficers!) may not have noticed or internalized the information about secrecy/transparency that we were expecting them to see.

To help rule out this second possibility a common tactic in experimental studies is to include a “manipiulation check” that verifies that respondents have noticed and internalized the treatment. These are usually simple, easy-to-answer questions that verifies the respondents were paying attention to the vignette. In this case we may ask “We are checking to make sure you read carefully. In the last scenario, what tdid the US government do?” giving the options of “informed the public” or “kept the operation secret”. If the treatment is working people should answer appropriately given their experimental condition. If they did we can be more confident that we are seeing a true null.

The third possibility is that the null findings that we have are a “false negative”, that is there is a real difference in opinion between these two groups (in the population) but we have gotten unlucky and have randomly drawn a sample for which there is no difference.

This is a real possibility. You should take PSCI1801 to learn more about this, but our hypothesis tests are biased to rule out “false positives”: saying there is an effect when there is not one in reality. That makes good sense. In science we really really don’t want to claim something is there when it’s not. We would much rather conclude something is not there when it actually is.

Think about this like a drug trial: it is bad, but not catastrophic, to conclude that drug is ineffective when it actually is effective. It is catastrophic, on the other hand, to conclude that a drug works when it does not.

The way that we express this possibility is through statistical power. Power is the ability for an experiment to find an effect of a certain size if it actually exists.

Let’s say we did the above experiment in the survey you did for class. We had about 900 people in each of the 5 modules. If we did an experimental treatment that would mean there would be around 450 people in each arm of the study. For these calculations we tell R the sample size (in each arm of the experiment). the significance level of our hypothesis test (.05 corresponds to a 5% chance of getting a false positive by accident) and a “Cohens D”, which is a standardized measure of the effect size we expect in our experiment. In this case I’ve chosen a Cohen’s D of 0.2, which is standard for a “small effect size”. The above experiment is pretty subtle so i’m going to say that we expect a small effect size.

Putting that in the following function in R:

#install.packages("pwr")
pwr::pwr.t.test(
  d=.2,
  n=450,
  sig.level=.05
)
#> 
#>      Two-sample t test power calculation 
#> 
#>               n = 450
#>               d = 0.2
#>       sig.level = 0.05
#>           power = 0.8500919
#>     alternative = two.sided
#> 
#> NOTE: n is number in *each* group

We have get a statistical power of 85%. This means that there is an 85% chance of concluding that there is a real effect if there actually is a real effect in the population. This is the “True positive” rate. The inverse of this number is therefore the probability of not conclusing there is a real effect is there actually is a real effect in the population. This 15% is the “False negative” rate.

So in this seemingly reasonable setup there is a 15% chance of a false negative. That is: there is a 15% chance that we will find that there is no difference between tehse experimental conditions even if there is a real effect in the population!

Generally speaking, most survey experiments are “underpowered” to conclude that there are true null effects. For us to be 95% certain that we do not have a false negative our cell sizes would have to be:

pwr::pwr.t.test(
  d=.2,
  power=.95,
  sig.level=.05
)
#> 
#>      Two-sample t test power calculation 
#> 
#>               n = 650.6974
#>               d = 0.2
#>       sig.level = 0.05
#>           power = 0.95
#>     alternative = two.sided
#> 
#> NOTE: n is number in *each* group

650 people in each group. So a survey that had, in total, an additional 400 people in it.