Syllabus

Current as of 2025-01-28


Lecture

MW 12:00-1:00 (Location TBD)

Recitations

R: 10:15-11:15 | R: 12:00-1:00 | F: 10:15-11:15 | F: 12:00-1:00


Dr. Marc Trussler

TA: Dylan Radley

  • dradley@sas.upenn.edu

  • Fox-Fels Hall 35 (3814 Walnut Street)

  • Office Hours:

    • M 10-11AM
    • W 10-11AM
    • W 2:30-3:30

Course Description

Understanding and interpreting large, quantitative data sets is increasingly central in social science and the business world. Whether one seeks to understand political communication, international trade, inter-group conflict, or a host of other issues, the availability of large quantities of digital data has revolutionized how questions are asked and answered. The ability to quickly and accurately find, collect, manage, and analyze data is now a fundamental skill for quantitative researchers. The answers to a range of important questions lie in publicly available data sets, whether they are election returns, survey results, journalists’ dispatches, or a range of other data types.

Becoming an effective data scientist requires two related, but distinct, skill sets: technical proficiency and theoretical knowledge of statistics. Most courses try to teach both at once. This course, instead, will focus primarily in the first: building your skills in data acquisition, management, and visualization. Leaving this course, students will be able to acquire, format, analyze, and visualize various types of data using the statistical programming language R.

A secondary learning goal of this class is to be able to write and talk about statistics in a concise and clear fashion. Being able to run the most complicated statistics in the world is unhelpful if you can not explain (particularly to non-specialists) what you have found and why they should care. Too many high school and college classes emphasize long essays, when the primary skill you will need is to write short reports (or, let’s be honest, emails) to quickly communicate an idea or finding. In this class we will emphasize this type of writing.

While this course is not a statistics class, we will discuss (in non-technical terms) the fundamental nature of statistics, particularly the important concepts of uncertainty and causality. The expectation is that you take further courses to build on this knowledge. PSCI 3800 “Applied Data Science” & PSCI 1801 “Statistical Methods” are designed to be a direct follow-ups to this course.

While no background in statistics, political science, or computer science is required, students are expected to be generally familiar with contemporary computing environments (e.g. know how to use a computer, download new software, find the path to saved files etc.) and have a willingness to learn a wide variety of data science tools. Instructions will follow on software to be installed prior to the first class.

Expectations and policies

Course Slack Channel

We will use Slack to communicate with the class. You will receive an invitation to join the our channel shortly after the start of class. One of the better things to come about through the pandemic is the use of Slack for classroom communications. It is a really good tool to allow us to send quick and informal messages to individual students or groups (or for you to message us). Similarly, it allows you to collaborate with other students in the class, and is a great place to get simple questions answered.

Because we will be making announcements via Slack, it is extremely important you get this set up.

Format/Attendance

The course will have two components: weekly lectures and a recitation.

The lectures will be in person. They will be more instructional/lecture based in format, though there is an expectation of some amount of participation and feedback. The lectures will not be recorded, though this textbook contains my notes, and the accompanying R code will be provided.

The recitations will also be in person. Attendance will not be taken, though you are highly encouraged to participate. The purpose of the recitations is to provide a smaller class format for you to ask questions, practice techniques, and to debug code with the TA. The answers to problem sets will also be covered in these sessions.

Academic integrity

We expect all students to abide by the rules of the University and to follow the Code of Academic Integrity.1

For Problem Sets: Collaboration on problem sets is permitted. Ultimately, however, the write-up and code that you turn in must be your own creation.

For Exams: Exams will be taken individually in-person without collaboration. The use of “Chat-GPT” or other AI software on exams is prohibited.

Re-grading of assignments

All student work will be assessed using fair criteria that are uniform across the class. If, however, you are unsatisfied with the grade you received on a particular assignment (beyond simple clerical errors), you can request a re-grade using the following protocol. First, you may not send any grade complaints or requests for re-grades until at least 24 hours after the graded assignment was returned to you. After that, you must document your specific grievances in writing by submitting a PDF or Word Document to the teaching staff. In this document you should explain exactly which parts of the assignment you believe were mis-graded, and provide documentation for why your answers were correct.We will then re-score the entire assignment (including portions for which you did not have grievances), and the new score will be the one you receive on the assignment (even if it is lower than your original score).

Late policy

Notwithstanding everything below: exceptions to all of these policies will be made for health reasons, extraordinary family circumstances, and religious holidays. The teaching staff are extremely reasonable and lenient, as long as you discuss with us potential issues before the deadline.

For problem sets: You are granted 5 “grace days” throughout the semester. Over the course of the semester you can use these when you need to turn problem sets in late. You can only use 3 grace days on any given assignment. You do not have to ask to use these days. This is counted in whole days, so if a problem set is turned in at 5:01pm the day it is due (i.e. 1 minute late) you will have used 1 grace day. If you turn the problem set in at 5:01pm the day after it is due (i.e. 24 hours and 1 minute late) you will have used 2 grace days etc. Choosing to not complete a problem set (see policy below) does not affect your grace days. Once you are out of grace days subsequent late problem sets will be graded as incomplete.

Grace days can not be applied to the final paper or the in-person exams.

Assessment and grading

  • Participation (6%)

    This portion of your grade mixes three components, each worth 2% of your final grade:

    1. Traditional participation including: asking and answering questions in lecture and in recitations, asking and answering questions on the course Slack, attending office hours, or working with teaching staff on your final paper.

    2. The completion of weekly “check-in” quizzes on Canvas. These will be available each week, will only take a few minutes, and will be graded by completion (not correctness).

    3. You will share your “main finding” of your final paper in recitation on Thursday April 24th and 25th and receive peer feedback.

  • Problem sets (24%)

    • Four problem sets.

    • Completed using Rmarkdown. Submissions will include a knitted html file and the associated .RMD file.

    • Scored out of 100. Having answers that strictly produce the “Correct” output from R will result in a grade of 90/100. 90+ grades are reserved for submissions that have all the correct answers, have code that is cleanly and effectively written, and have written explanations that clearly and concisely articulate the findings.

    • There are many ways to do things in R. This course is designed with a particular sequence of knowledge designed to maximize your potential as an R user. As such, in problem sets you will be assessed on the degree to which you have learned the content of this course specifically.

    • You are free to do as many of the problem sets as you like. If you do not complete a problem set, the percentage points for that assignment will be transferred to the midterm (for PS1 and PS2), or the final exam (for PS3, PS4). For example if you don’t complete PS2, the midterm would then be worth 26% of your final grade (20% + 6%). If you don’t complete PS3 & PS4, the final exam would be worth 42% of your final grade (30% + 6% + 6%).

  • Midterm (20%)

    • An in-class exam that will take place during our usual class period on March 5th.

    • The test is open book. You can use any material you wish, including this textbook, your problem sets, the problem set answer keys, and your own notes. You can Google things – though you are almost certainly better off just using the class notes.

    • You may not use Chat-GPT or any other AI tool to answer the questions. Anyone caught using these tools will get a 0 on the exam and will be immediately referred to the office of student affairs.

  • Paper (20%)

    • Due: April 30th. The final paper of this course is to produce a short (less than 600 words) data-journalism style blog post that makes use of data. For this project you will find your own data and use it to produce between 1 and 3 figures or tables to support an argument suitable for a non-technical audience. This project brings together the two learning goals of this course: the technical ability to find, clean, and present data; as well as the ability to write about your findings in a clear and persuasive way. Accordingly, you will be graded on both the quality and rigorousness of your statistical findings, as well as the coding, presentation, and writing of the piece. To emphasize: a major component of this project and of your grade is determined by how you code and how you write your results up. 600 words is short for a final paper. As such, I would highly encourage you to start work on this early. Part of the goal of the problem sets is to have you think a lot about how to present statistics in an approachable and non-technical way. Many undergrads spend 95% of their time writing and 5% of their time editing. (In your working life post-undergrad these two percentages will be almost exactly flipped!) Given the amount of time and the light word count, my expectation is that you meet with the teaching team to talk about your research question relatively early, and spend the majority of the time editing your work, not writing.
  • Final Exam (30%)

    • An in-person exam that will take place during the final exam period.

    • The test is open book. You can use any material you wish, including this textbook, your problem sets, the problem set answer keys, and your own notes. You can Google things – though you are almost certainly better off just using the class notes.

    • You may not use Chat-GPT or any other AI tool to answer the questions. Anyone caught using these tools will get a 0 on the exam and will be immediately referred to the office of student affairs.

Grade scale

Letter grades at the conclusion of the class will be assigned using the following scale. I do not round grades. If your grade is in one of the bands below you will receive that grade.

\[\begin{aligned} 97 \leq Grade: &A+\\ 93 \leq Grade <97: &A\\ 90 \leq Grade <93: &A-\\ 87 \leq Grade <90: &B+\\ 83 \leq Grade <87: &B\\ 80 \leq Grade <83: &B-\\ 77 \leq Grade <80: &C+\\ 73 \leq Grade <77: &C\\ 70 \leq Grade <73: &C-\\ 67 \leq Grade <70: &D+\\ 63 \leq Grade <67: &D\\ 60 \leq Grade <63: &D-\\ Grade <60: &F \end{aligned}\]

Computing

The course will require students to have access to a personal computer in order to run the statistics software. If this is not possible, please consult with one of the instructors as soon as possible. Support to cover course costs is available through Student Financial Services.

We will use R in this class, which you can download for free at https://www.r-project.org/. R is completely open source and has an almost endless set of resources online. Virtually any data science job you could apply nowadays to will require some background in R programming.

While R is the language we will use, RStudio is a free program that makes it considerably easier to work with R. After installing R, you will install RStudio https://www.rstudio.com. Please have both R and RStudio installed by the end of the first week of classes.

If you’re having trouble installing either program, there are more detailed installation instructions on the course Canvas page.

Textbooks

The reading load for this course will be relatively light, with the expectation that your primary task outside of class hours will be working on problem sets and reviewing material. That being said, textbook chapters that supplement the lectures are included, and reading through them before lecture will be helpful.

We will be using three supplementary textbooks for this course. The first two are available for free online through the library website., the second is a free online textbook.

Three additional books that I have found helpful in my development as a data scientist:

  • Data Analysis for Social Science: A Friendly and Practical Introduction. Elena Llaudet & Kosuke Imai (This is also the textbook for PSCI 1801)

  • The Functional Art: An introduction to information graphics and visualization by Alberto Cairo

  • On Writing Well by William Zinsser

Class Schedule

Week 0: January 15

What is Data Science?

Week 1: (No Monday Class) January 22

What is Data?

Leonard Mlodinow. The Drunkard’s Walk: How Randomness Rules Our Lives. (Excerpts on Canvas)

Week 2: January 27 - January 20

Basic R

Davies Chapter 2

January 28: course selection period ends

Week 3: February 3 - February 5

Conditional Logic and Subsetting

Freeman and Ross Chapter 7.3-7.5

Week 4: February 10 - February 12

Dataframes

Davies 5.2

Problem Set 1 Due Wednesday 7pm.

Week 5: February 17 - February 19

Cleaning and Reshaping

Freeman and Ross Chapter 12 (tidyr reshaping)

Week 6: February 24 - February 26

For/If

Davies Chapter 10

Problem Set 2 Due Wednesday 7pm.

February 24: Drop period ends

Week 7: March 3 - March 5

Review/Midterm

First Midterm Exam in-class on Wednesday.

Week 8: March 10 - March 12

Spring Break

Week 9: March 17 - March 19

Collecting and Merging Data

Freeman and Ross Chapter 11.5

March 21: Grade type change deadline.

Week 10: March 24 - March 26

Writing and Visualizing Our Findings

Zinsser. On Writing Well. (Excerpts on Canvas).

Badger et al. 2018. . NYT Upshot.

Problem Set 3 Due Wednesday 7pm.

Week 11: March 31 - April 2

Tidyverse I

March 31: Withdrawal deadline

Week 12: April 7 - April 9

Tidyverse II

Freeman and Ross Chapter 16

Week 13: April 14- April 16

Regression I

Freeman and Ross Chapter 16

Problem Set 4 Due Wednesday 7pm.

Week 14: April 21 - April 23

Regression II

Davies 20.1 - 20.3 & 20.5

Final Paper Peer Presentations During your Recitation Times.

Slides to be submitted by Wednesday 7pm.

Week 15: April 28 - April 30

Causal Inference/Review

Final Paper due Wednesday, April 30th at 11:59pm.

Final Exam

Date TBD