Problem Set 4 Answers

For this problem set you will hand in a .Rmd file and a knitted html output. There is a .Rmd template file on the assignment page on Canvas you can use to write your answers.

Please make life easy on us by using comments to clearly delineate questions and sub-questions.

Comments are not mandatory, but extra points will be given for code that clearly explains what is happening and gives descriptions of the results of the various tests and functions.

Reminder: Put only your student number on the assignments, as we will grade anonymously.

Collaboration on problem sets is permitted. Ultimately, however, the write-up and code that you turn in must be your own creation. You can discuss coding strategies or code debugging with your classmates, but you should not share code in any way. Please write the names of any students you worked with at the top of each problem set.

Here are two questions from previous problem sets. For PS4, re-do your work only making use of tidyverse code.

Question 1

Run the code below to load data on the 46 zip-codes in Philadelphia:

library(tidyverse)
philly <- rio::import("https://github.com/marctrussler/IDS-Data/raw/main/PhillyZCTA.Rds")

(a) We are going to consider the variable percent.rentals, which is the percent of occupied (non-vacant) housing units that are occupied by renters. Which zip code has the highest percent of housing occupied by renters and which zip code has the lowest percent of housing occupied by renters? For this question, and any questions below that ask you to identify information, you must write code to output your specific answers. (i.e. you can’t just use View() to sort the data and report the right zip).

One tidyverse hint: if you have a dataset arranged by a particular variable, you can print out the first few rows by using slice(1:3).

#Method One
philly |> 
  mutate(highest.rental = percent.rentals==max(percent.rentals,na.rm=T),
         lowest.rental = percent.rentals==min(percent.rentals,na.rm=T)) |> 
  select(zip, percent.rentals, highest.rental, lowest.rental) |> 
  filter(highest.rental |lowest.rental)
#>     zip percent.rentals highest.rental lowest.rental
#> 1 19107           79.48           TRUE         FALSE
#> 2 19137           20.61          FALSE          TRUE

#Method Two
philly |> 
 arrange(percent.rentals) |> 
  select(zip, percent.rentals) |> 
  slice(1)
#>     zip percent.rentals
#> 1 19137           20.61


philly |> 
 arrange(desc(percent.rentals)) |> 
  select(zip, percent.rentals) |> 
  slice(1)
#>     zip percent.rentals
#> 1 19107           79.48

The zip code with the most amount of renters in Philly is 19107 which is the Market East area. The zip code with the fewest amount of renters is 19137, which is Bridesburg in the far Northeast of the city.

(b) What is the median of percent.rentals? Try to find which zip code takes on this value (you will run into a problem, however). Diagnose what’s going on here, and try to determine which zip code(s) are in the middle of the distribution of percent.rentals.

philly |> 
  summarise(median = median(percent.rentals))
#>   median
#> 1  46.32

The median percent of households that are rentals is 46.32%. But when we try to find out what zip code that represents we get nothing:

philly |> 
  mutate(median.case = percent.rentals==median(percent.rentals)) |> 
  filter(median.case) 
#>  [1] zip                         population.density         
#>  [3] perc.under.18               perc.over.65               
#>  [5] perc.non.hispanic.white     perc.non.hispanic.black    
#>  [7] perc.non.hispanic.asian     perc.hispanic              
#>  [9] perc.high.school.or.greater perc.college.or.greater    
#> [11] percent.schoolage.enrolled  average.unemployment       
#> [13] median.income               gini                       
#> [15] total.housing.units         percent.rentals            
#> [17] percent.vacant              median.rent                
#> [19] percent.poverty             percent.childhood.poverty  
#> [21] percent.transit.commute     percent.car.commute        
#> [23] percent.walk.commute        percent.bike.commute       
#> [25] percent.health.insurance    percent.single.parent      
#> [27] median.case                
#> <0 rows> (or 0-length row.names)

That is because we have an even number of zip codes. There mathematically cannot be a “middle” to an even number of entries. I’m going to find the absolute value of the differences from the median value for each case, and then consider the two zip codes that have the smallest distance from the median. This is not the only way to do this!


philly |> 
  mutate(diff.median = abs(percent.rentals - median(percent.rentals))) |> 
  arrange(diff.median) |> 
  select(zip, percent.rentals, diff.median) |> 
  slice(1:2)
#>     zip percent.rentals diff.median
#> 1 19124           45.94        0.38
#> 2 19146           46.70        0.38

The zip codes 19124 (Juniata) and 19146 (Grad Hospital + Point Breeze) are the two “middle” zip codes for percent rental. Their two values are both .38 away from the median. As such, we can derive that the median for an even number of cases is the simple mean of the two middle cases.

(c) Generate a new variable high.rental which indicates whether at least half of housing units are rentals in a zip code, or not. How many of the zip codes are high rental zip codes?


philly |> 
  mutate(high.rental = percent.rentals>=50) |> 
  group_by(high.rental) |> 
  summarize(counts = n())
#> # A tibble: 2 × 2
#>   high.rental counts
#>   <lgl>        <int>
#> 1 FALSE           28
#> 2 TRUE            18

18 of the zipcodes are high rental households.

(Bonus) Using ggplot, create box plot that shows the distribution of percent.poverty (the percent of people over 18 in the zip code with an income below the poverty line) in the high and low rental zip codes. What do you see? If, instead, you make a scatter plot (using ggplot) with percent.poverty on the y-axis and percent.rentals on the x-axis, do you reach the same conclusion?

4
#> [1] 4

The median “high-rental” zip code has a higher percentage of people living in povery than the median “low-rental” zip code. As such, it seems like there is a positive relationship between these two things: neighborhoods with more rental properties are more likely to have people living in poverty. Having said that: there is much more variation in “high-rental” zip codes: there are zip codes with a lot of renters that have a lot of people living in poverty, but also a lot of zip codes with a lot of renters that have few people living in poverty.

4
#> [1] 4

The scatterplot shows the same relationship, but better accentuates the high degree of heterogeneity in neighborhoods with a lot of rental housing. Zip codes with a low amount of rentals (on the left part of the graph) pretty much all have low amounts of poverty. But the zip codes with a high amount of rentals (on the right part of the graph) are a mixed bag. As such, I wouldn’t be confident in saying that there is a strong relationship between these two things.

Question 2

In this question we will use tidyverse to investigate trends in historical temperature data for every county in the United States. We will be using the data set “US County Temp History 1895-2019.Rdata”, which will load into your environment as county.dta.

county.dta <- rio::import("https://github.com/marctrussler/IDS-Data/raw/main/USTemp.Rds", trust=T)

To simplify things, subset the data to only include years greater or equal to 1980.

county.dta <- county.dta |> 
  filter(year>=1980)

For each county in the data set, calculate the mean temperature for all years (that is, what is the mean of all the rows in the dataset for the first county, the second county, etc.?). (No need to visualize and analyze this, just create a new dataset that has county and mean temperature as columns)

county.averages <- county.dta |> 
                    group_by(fips) |> 
                    summarize(mean.temperature = mean(temp))

For each state in the data set, record the lowest and highest average annual temperature (that is, what is the max and min value for all the rows in the dataset for the first state, for the second state, etc.?). Which states had the most consistent average temperature over this period (i.e.have a small range between their highest and lowest values)? Which states were the least consistent?

state.dta <- county.dta |> 
  group_by(state.name) |> 
  summarize(max.temp = max(temp),
            min.temp = min(temp)) |> 
  mutate(temp.range = max.temp-min.temp)

state.dta |> 
  arrange(temp.range) |> 
  slice(1:5)
#> # A tibble: 5 × 4
#>   state.name           max.temp min.temp temp.range
#>   <chr>                   <dbl>    <dbl>      <dbl>
#> 1 District of Columbia     58.9     54.6       4.32
#> 2 Rhode Island             53.8     47.8       5.93
#> 3 Delaware                 59.2     52.9       6.27
#> 4 Connecticut              53.8     45.5       8.27
#> 5 Kentucky                 61.3     52.0       9.35


state.dta |> 
  arrange(desc(temp.range)) |> 
  slice(1:5)
#> # A tibble: 5 × 4
#>   state.name max.temp min.temp temp.range
#>   <chr>         <dbl>    <dbl>      <dbl>
#> 1 California     76.5     41.3       35.2
#> 2 Arizona        75.0     49.0       25.9
#> 3 Nevada         66.6     41.8       24.8
#> 4 Colorado       56.5     32.6       23.9
#> 5 Texas          77.1     53.2       23.8

The state with the smallest range in temperatures is the District of Columbia at 4 degrees (makes sense!), and the state with the largest range in temperatures in California at 35.2 degrees.

The US has 4 regions (Northeast, Midwest, South, and West), and within each region, there are several sub-regions. First, create a variable that indicates each region and sub-region by combing the region variable with the sub-region variable. (ie. you should have a value that is “Northeast.MA”, etc) You should have 9 unique values for this new variable.

Investigate the degree to which the relationship between average temperature and population in counties has changed over time for each subregion. Using the group_by() function, determine the correlation between county temperature and county population within each subregion in each year. The result should be a dataset with a column for each subregion and a row for each year. Note: Not every county has a population value for every year, so make sure to exclude NA’s from your correlation calculation

county.dta |> 
  mutate(subregion = paste(region, subregion, sep="-")) |> 
  group_by(year, subregion) |> 
  summarise(correlation = cor(temp, population, use="pairwise.complete")) |> 
  pivot_wider(names_from = subregion, 
              values_from = correlation)
#> `summarise()` has grouped output by 'year'. You can
#> override using the `.groups` argument.
#> # A tibble: 40 × 10
#> # Groups:   year [40]
#>     year `Midwest-ENC` `Midwest-WNC` `Northeast-MA`
#>    <int>         <dbl>         <dbl>          <dbl>
#>  1  1980       0.00687        0.0475          0.513
#>  2  1981       0.00622        0.0397          0.536
#>  3  1982       0.0203         0.0648          0.513
#>  4  1983       0.0235         0.0692          0.534
#>  5  1984       0.0140         0.0655          0.542
#>  6  1985       0.0239         0.0641          0.507
#>  7  1986       0.0137         0.0505          0.504
#>  8  1987       0.0138         0.0837          0.507
#>  9  1988       0.0247         0.0588          0.539
#> 10  1989       0.0197         0.0526          0.525
#> # ℹ 30 more rows
#> # ℹ 6 more variables: `Northeast-NE` <dbl>,
#> #   `South-ESC` <dbl>, `South-SA` <dbl>, `South-WSC` <dbl>,
#> #   `West-M` <dbl>, `West-P` <dbl>

Problem Set 3 Answers

Midterm Answer Key