Problem Set 3
Problem Set Due Wednesday March 26 at 7pm on Canvas.
For this problem set you will hand in a .Rmd file and a knitted html output. There is a .Rmd template file on the assignment page on Canvas you can use to write your answers.
Comments are not mandatory, but extra points will be given for code that clearly explains what is happening and gives descriptions of the results of the various tests and functions.
Reminder: Put only your student number on the assignments, as we will grade anonymously.
Collaboration on problem sets is permitted. Ultimately, however, the write-up and code that you turn in must be your own creation. You can discuss coding strategies or code debugging with your classmates, but you should not share code in any way. Please write the names of any students you worked with at the top of each problem set.
18.5 Question 1
In this question we will use loops to investigate trends in historical (1859-2019) temperature data for every county in the United States. Load this data in using the code below:
county.dta <- rio::import("https://github.com/marctrussler/IDS-Data/raw/main/USTemp.Rds", trust=T)To simplify things, subset the data to only include years greater or equal to 1980. What is the unit of analysis of this dataset?
For each county in the data set, calculate the mean temperature for all years (that is, what is the mean of all the rows in the dataset for the first county, the second county, etc.?) and visualize the result. Describe what you see in your figure.
Note: This loop might take a few seconds or a minute to run, dependent on your computer. Look for the red stop sign in the upper right-hand corner of your console for confirmation that the loop is in progress.For each state in the data set, record the lowest and highest average annual temperature (that is, what is the max and min value for all the rows in the dataset for the first state, for the second state, etc.?). Which state had the most consistent average temperature (i.e. the smallest range between the highest and lowest values)? Which state had the least consistent? Create a visualization of this information with the states sorted in order of most consistent to least consistent.
-
The US has 4 regions (Northeast, Midwest, South, and West), and within each region, there are several sub-regions. First, use the paste() command to create a variable that indicates each region and sub-region by combing the region variable with the sub-region variable. (ie. you should have a value that is “South-ESC”) You should have 9 unique values for this new variable.
Investigate the degree to which the relationship between average temperature and population in counties has changed over time for each subregion. Using a double loop, determine the correlation between county temperature and county population within each subregion in each year. The result should be a matrix with 9 columns (one for each subregion) and 40 rows (one for each year). Note: Not every county has a population value for every year, so make sure to exclude NA’s from your correlation calculation
Bonus: visualize the results stored in the matrix from part d.
18.6 Question 2
For this question we are going to work with data from Philadelphia. In spring of 2021 the progressive District Attorney of Philadelphia, Larry Krasner, was re-nominated by the Democratic party. This election was controversial as some in Philly blamed the rising crime rate on Krasner’s policies.
We are going to use data to investigate this election.
The following code loads in two data files.
electcontains the election results at the precinct level (where people go to vote). Philadelphia precincts are identified by their ward and division. There are 66 Wards in the city, and each of those Wards is split into multiple divisions. Each “ward-division” is an election precinct: “1-1” is the 1st ward, 1st division; “1-2” is the 1st ward 2nd division; “60-4” is the 60th ward 4th division etc.crimecontains data on every crime recorded in Philadelphia in 2019 and 2020.eventin this dataset gives the category of the crime committed. Included in these data is the ward-division in which the crime took place.
crime <- rio::import("https://github.com/marctrussler/IDS-Data/raw/main/PhillyCrimeData.Rds", trust=T)
elect <- rio::import("https://github.com/marctrussler/IDS-Data/raw/main/PhillyDAElection.Rds", trust=T)Using the
crimedataset, determine the total number of crimes in each ward-division, as well as the number of homicides in each ward-division. The result of this question should be a dataset where the unit of analysis is ward-division with a variable for the number of overall crimes and a variable for the number of homicides.Merge this newly created dataset with the election results at the ward-division level. Did Krasner do better or worse in divisions with more crime? What about divisions that specifically have more homicides? There are lots of ways to answer this question, it’s up to you to apply what we have learned so far to choose what you feel is an appropriate analysis and visualization.