Midterm Exam Answers

Opens: 12:00pm on Wednesday March 4th

Due: 1:00pm on Wednesday March 4th.

19.7 Instructions:

This is an in-class open-note test. You have 1 hour to complete the exam.
I do not accept late work. Exams handed in after 60 minutes will not be graded.
The test is open notes. You can use any course material you wish, including this textbook, your problem sets, the problem set answer keys, and your own notes. You cannot google things, and all use of AI is prohibited.
As with the problem sets, you will be graded on your comprehension of the material in this specific course.
You may not use Chat-GPT or any other AI tool to answer the questions. Violating this policy will result in a score of 0 for the midterm and an immediate referral to the Center for Community Standards & Accountability to decide further appropriate disciplinary action.
Your exam submission will be identical to how you’ve submitted problem sets. You will submit a .rmd file and a knitted html.
The exam will be graded anonymously. Please put your student number on the exam only.

Good luck!

We are going to work with panel data from the Cooperative Election Study (CES). The CES is a large national survey that asks Americans about their political opinions and behavior. In this dataset each respondent was surveyed twice: once in 2010 and once in 2014.

Along with some stable basic demographic information, we have a measure of respondents’ fiscal policy preferences deficit.fix in each interview year. This is a 100 point scale asking how the federal budget deficit should be reduced, where 0 means “all from tax increases” (the liberal option) and 100 means “all from spending cuts” (the conservative option).

There is no missing data in this dataset.

You can load the data in here:

dat <- rio::import("https://github.com/marctrussler/IDS-Data/raw/refs/heads/main/PSCI1800Midterm2026DataLong.Rds", trust=T)

1. How many respondents are in these data? How many rows are in these data? Based on that information, and by looking at the dataset, what is the unit of analysis of these data?

head(dat)
#>   respondent.id year deficit.fix gender state  county
#> 1             1 2010          92   Male    TX   Gregg
#> 2             1 2014          83   Male    TX   Gregg
#> 3            24 2010          48 Female    IL  McLean
#> 4            24 2014          26 Female    IL  McLean
#> 5            48 2010          99   Male    TX Hockley
#> 6            48 2014          99   Male    TX Hockley
#>   county_fips birth.yr        income
#> 1       48183     1934 $100k or more
#> 2       48183     1934 $100k or more
#> 3       17113     1971 $100k or more
#> 4       17113     1971 $100k or more
#> 5       48219     1965 $100k or more
#> 6       48219     1965 $100k or more
nrow(dat)
#> [1] 15458
length(unique(dat$respondent.id))
#> [1] 7729

There are 15430 rows in the data and 7715 unique respondents. This is because each respondent is in the data twice, once for each survey year. As such, the unit of analysis of these data is “respondent-year”.

2. Un-comment and edit the code below to reshape the data so that the unit of analysis is the individual respondent. Two new variables will be created in the process. (You do not need to edit the names_prefix option, which I’ve added so that we get the same sensible variable names to work with going forward.)

library(tidyr)
#dat <- pivot_wider(dat,
#                   ??????,
#                   ??????,
#                   names_prefix = "deficit.fix.")

library(tidyr)
dat <- pivot_wider(dat,
                   names_from = "year",
                   values_from = "deficit.fix",
                   names_prefix = "deficit.fix.")
head(dat)
#> # A tibble: 6 × 9
#>   respondent.id gender state county  county_fips birth.yr
#>           <dbl> <chr>  <chr> <chr>   <chr>          <dbl>
#> 1             1 Male   TX    Gregg   48183           1934
#> 2            24 Female IL    McLean  17113           1971
#> 3            48 Male   TX    Hockley 48219           1965
#> 4            56 Male   MA    Essex   25009           1947
#> 5            66 Female WI    Dane    55025           1961
#> 6            71 Male   PA    Berks   42011           1948
#> # ℹ 3 more variables: income <chr>, deficit.fix.2010 <dbl>,
#> #   deficit.fix.2014 <dbl>
nrow(dat) == length(unique(dat$respondent.id))
#> [1] TRUE

If you are not able to successfully complete this step, use this code to load in a re-shaped version of the data so you can continue with the exam

dat <-  rio::import("https://github.com/marctrussler/IDS-Data/raw/refs/heads/main/PSCI1800Midterm2026DataWide.Rds", trust=T)

3. What is the correlation between people’s birth year and deficit.fix.2010? What is the correlation between people’s birth year and deficit.fix.2014? Create a new variable deficit.change which finds the difference between a person’s opinion on this question in 2010 and in 2014. This variable should be calculated such that positive values indicate that someone has moved in a conservative direction (i.e. towards preferring spending reductions). What is the correlation between people’s birth year and this new variable? What do you conclude from these three correlations?

#Correlation of birth year and 2010 deficit opinion
cor(dat$birth.yr, dat$deficit.fix.2010)
#> [1] -0.0452572
#Correlation of birth year and 2014 deficit opinion
cor(dat$birth.yr, dat$deficit.fix.2014)
#> [1] -0.004956423
#Correlation of birth year and change in deficit opinion
dat$deficit.change <- dat$deficit.fix.2014 - dat$deficit.fix.2010
cor(dat$birth.yr, dat$deficit.change)
#> [1] 0.05502341

The correlation between birth year and deficit opinion is weakly negative in both 2010 and 2014, indicating that older Americans are slightly less likely to prefer tax increases to spending cuts to reduce the deficit. The correlation between birth year and the change in this measure is positive, however. This indicates that older Americans were more likely to shift towards wanting tax increases as their method of reducing the deficit between 2010 and 2014, and younger Americans were more likely to shift towards wanting spending decreases between 2010 and 2014.

4. What proportion of men born before or during 1975 moved in a conservative direction (towards wanting spending cuts) from 2010 to 2014? What proportion of men born after 1975 moved in a conservative direction? What does this tell you? Hint: What does the mean of a boolean variable represent?

mean(dat$deficit.change[dat$birth.yr<=1975 & dat$gender=="Male"]>0)
#> [1] 0.3318109
mean(dat$deficit.change[dat$birth.yr>1975& dat$gender=="Male"]>0)
#> [1] 0.4512821

Men born after 1975 were far more likely to move in a conservative direction compared to men born before 1975. This means that younger men (those with a later birth year) on average became more conservative in this period compared to older men) those with an earlier birth year.

5. Edit the code below to answer this question: For each unique county in the dataset, find the maximum value for deficit.change and the minimum value for deficit.change. Which county has the biggest difference between the individual with the maximum and minimum change? Hint: an easy mistake to make here is to not identify all the unique counties in the data.

#county <- ?????
#max.deficit.change <- rep(NA,???????)
#min.deficit.change <- rep(NA,???????)

### 
#??????
###

#deficit.delta <- max.deficit.change - min.deficit.change

county <- unique(dat$county_fips)
max.deficit.change <- rep(NA,length(county))
min.deficit.change <- rep(NA,length(county))

for(i in 1:length(county)){
 max.deficit.change[i] <-max(dat$deficit.change[dat$county_fips==county[i]])
 min.deficit.change[i] <- min(dat$deficit.change[dat$county_fips==county[i]])
}

deficit.delta <- max.deficit.change - min.deficit.change

which(deficit.delta==max(deficit.delta))
#> [1] 35

county[35]
#> [1] "04013"
#
unique(dat$county[dat$county_fips==county[35]])
#> [1] "Maricopa"

6. Below I calculate the mean and standard deviation of deficit.fix.2010. What is the standard error of this sample mean? In words, what does that mean?

mean(dat$deficit.fix.2010)
#> [1] 62.64808
sd(dat$deficit.fix.2010)
#> [1] 28.66385

The standard error of the sample mean is \(\frac{sd(x)}{\sqrt{n}}\). Which in this case is:

sd(dat$deficit.fix.2010)/sqrt(nrow(dat))
#> [1] 0.3260415

Every time we take a new sample of 7729 people we are going to get a slightly different sample. That means that the mean of deficit.fix.2010 will be slightly different in each sample. The standard error tells us how much these sample means will vary from the truth in the population, on average. So each time we take a sample we expect around .32 points of variation. On a scale of 0-100 that’s not that much!

Problem Set 4 Answers

Regression Questions