Problem Set 2

For this problem set you will hand in a .Rmd file and a knitted html output. There is a .Rmd template file on the assignment page on Canvas you can use to write your answers.

Comments are not mandatory, but extra points will be given for code that clearly explains what is happening and gives descriptions of the results of the various tests and functions.

Reminder: Put only your student number on the assignments, as we will grade anonymously.

Collaboration on problem sets is permitted. Ultimately, however, the write-up and code that you turn in must be your own creation. You can discuss coding strategies or code debugging with your classmates, but you should not share code in any way. Please write the names of any students you worked with at the top of each problem set.


Question 1

Run the code below to load data on the 46 zip-codes in Philadelphia:

philly <- rio::import("https://github.com/marctrussler/IDS-Data/raw/main/PhillyZCTA.Rds")
#> Warning: Missing `trust` will be set to FALSE by default
#> for RDS in 2.0.0.

A small technical note. “Zip Codes” are not geographic areas, but rather lists of addresses maintained by the USPS. As such, it’s not really right to refer to, say, the 19125 zip code as an area. To address this, the Census department defines “Zip Code Tabulation Areas” (ZCTAs) which are geographic areas that roughly correspond to the list of addresses which define a zip code. I will refer to “Zip Codes” as geographic areas below, and you can too. Just know that it’s technically wrong!

For the curious, if you just type a zip-code (say 19125) into Google Maps it will outline where that zip code is located (in that case, Fishtown). You may want to use that to give some context to your answers below.

(a) We are going to consider the variable percent.rentals, which is the percent of occupied (non-vacant) housing units that are occupied by renters. Which zip code has the highest percent of housing occupied by renters and which zip code has the lowest percent of housing occupied by renters? For this question, and any questions below that ask you to identify information, you must write code to output your specific answers. (i.e. you can’t just use View() to sort the data and report the right zip).

(b) What is the median of percent.rentals? Try to find which zip code takes on this value (you will run into a problem, however). Diagnose what’s going on here, and try to determine which zip code(s) are in the middle of the distribution of percent.rentals.

(c) Generate a new variable high.rental which indicates whether half of housing units are rentals in a zip code, or not. How many of the zip codes are high rental zip codes?

(d) Create a box plot that shows the distribution of percent.poverty (the percent of people over 18 in the zip code with an income below the poverty line) in the high and low rental zip codes. What do you see? If, instead, you make a scatter plot with percent.poverty on the y-axis and percent.rentals on the x-axis, do you reach the same conclusion?

Question 2

We are going to work with 2020 election results from the 67 counties in Philadelphia. The dataset you are loading in is not ready for analysis. We will work with this data in order to be able to use it to get some insights on the election. You only need written answers for (a) and (f). Only code is required for the other sub-parts.

pres <- rio::import("https://github.com/marctrussler/IDS-Data/raw/refs/heads/main/PS2Question2.Rds")
#> Warning: Missing `trust` will be set to FALSE by default
#> for RDS in 2.0.0.

(a) How many rows are there in the datasets? Does this match the unique number of counties? What is the unit of analysis of these data?

(b) Reshape the data such that the unit of analysis is county.

(c) The variable county.name has the state abbreviation appended onto the start of it. Use the seperate command to split these apart, and create a new variable called state.

(d) The variable fips.code is the 5-digit code the census assigns to each county. If you look, you will see that this column is currently of the “character” class. Figure out why that is before converting it to numeric using as.numeric(). Be careful! If you convert it without discovering the error I have put in you will delete data. For full points you must use code to find and correct the error (i.e. don’t just use View() to see if you can find the problem). Hint: use nchar() to identify the problematic entry.

(e) The names of the demographic variables are uninformative right now. The variable descriptions are below. Rename the variables in the dataset using informative, consistent, and properly formatted variable names.

current.names real.names
V1 Population Density
V2 % white
V3 % black
V4 % less than high school
V5 % with college degree
V6 Unemployment Rate

(f) Calculate the percent of the vote won by Biden, Trump, and Other candidates in each county. Use a scatterplot to display the relationship between the percent of the county who attended college and the percent Biden won in each county. Instead of points in your graph display each county name. To make things more readable, edit the county.name column to remove the redundant ” County” from each entry. (Do try to make your graph readable, though know that the labels are going to overlap, which is fine). Explain what you see in the graph.

(g) I want to focus on the high-education counties (those with greater than 40% with a college degree) for a report. Specifically: I am going to display the following table which shows the percent won by each candidate (and other) in the 5 high-education counties. Using sub-setting and our re-shaping tools, recreate this table:

candidate Centre Chester Allegheny Montgomery Bucks
Biden 51.69 57.99 59.61 62.63 51.66
Trump 46.94 40.88 39.23 36.35 47.29
Other 1.38 1.13 1.16 1.02 1.05