In this assignment, you’ll use simple linear regression to explore the percent of votes cast in-person in the 2020 U.S. election based on the county’s political leanings.
In this assignment, you will…
The following packages will be used in this assignment:
library(tidyverse)
library(knitr)
library(ggfortify)
library(broom)
library(viridis)
There are multiple data sets for this assignment. Use the code below to load the data.
<- read_csv("data/nc-election-2020.csv") %>%
election_nc mutate(fips = as.integer(FIPS))
<- read_csv("data/nc-county-map-data.csv")
county_map_data <- read_csv("data/us-election-2020-sample.csv") election_sample
The county-level election data in election_nc
and election_sample
are from The Economist GitHub repo. The data were originally analyzed in the July 2021 article In-person voting really did accelerate covid-19’s spread in America. For this analysis, we will focus on the following variables:
inperson_pct
: The proportion of a county’s votes cast in-person in the 2020 electionpctTrump_2016
: The proportion of a county’s votes cast for Donald Trump in the 2016 electionThe data in county_map_data
were obtained from the maps package in R. We will not analyze any of the variables in this data set but will use it to help create maps in the assignment. Click here to see the documentation for the maps package. Click here for code examples.
Due to COVID-19 pandemic, many states made alternatives in-person voting, such as voting by mail, more widely available for the 2020 U.S. election. The general consensus was that voters who were more Democratic leaning would be more likely to vote by mail, while more Republican leaning voters would largely vote in-person. This was supported by multiple surveys, including this survey conducted by Pew Research.
The goal of this analysis is to use regression analysis to explore the relationship between a county’s political leanings and the proportion of votes cast in-person in 2020. Did counties with more Republican leanings have a larger proportion of votes cast in-person in the 2020 election?
We will use the proportion of votes cast for Donald Trump in 2016 (pctTrump_2016
) as a measure of a county’s political leaning. Counties with a higher proportion of votes for Trump in 2016 are considered to have more Republican leanings.
All narrative should be written in complete sentences, and all visualizations should have informative titles and axis labels.
For this part of the analysis, we will focus on counties in North Carolina. We will use the data sets election_nc
and county_map_data
.
Visualize the distribution of the response variable inperson_pct
and calculate appropriate summary statistics. Use the visualization and summary statistics to describe the distribution. Include an informative title and axis labels on the plot.
Let’s view the data in another way. Use the code below to make a map of North Carolina with the color of each county filled in based on the percentage of votes cast in-person in the 2020 election. Fill in title and axis labels.
Then use the plot answer the following:
<- left_join(election_nc, county_map_data) election_map_data
ggplot() +
geom_polygon(county_map_data, mapping = aes(x = long, y = lat, group = group),
fill = "lightgray", color = "white") +
geom_polygon(election_map_data, mapping = aes(x = long, y = lat, group = group,
fill = inperson_pct)) +
labs(x = "_____",
y = "_____",
fill = "_____",
title = "_____") +
scale_fill_viridis()
Create a visualization of the relationship between inperson_pct
and pctTrump_2016
. Use the visualization to describe the relationship between the two variables.
We can use a linear regression model to better quantify the relationship between the variables Fit the linear model to understand variability in the percent of in-person votes based on the percent of votes for Trump in the 2016 election. Neatly display the model output with 3 digits.
Now let’s use the model coefficients to describe the relationship.
Now let’s evaluate the model conditions. Check the linearity, constant variance, and normality conditions. For each condition, indicate whether it is satisfied along with a brief explanation for your conclusion. Include the any plots and/or summary statistics used to support your response.
The last condition we need to check is independence. To do so, we will examine a map of the counties in North Carolina with the color filled based on the value of the residuals.
Fill in the name of your model in the code below to calculate the residuals and add them to election_map_data
. Then, a map with the color of each county filled in based on the value of the residual. Hint: Start with the code from Exercise 2.
<- election_nc %>%
election_resid mutate(residual = resid(_____)) %>%
select(fips, residual)
<- left_join(election_map_data, election_resid) election_map_data
To get a better understanding of the trend across the entire United States, we analyze data from a random sample of 200 counties. This data is in the election_sample
data frame. Because these counties were randomly selected out of the 3,006 counties in the United States, we can reasonably treat the counties as independent observations.
Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to commit and push changes.
Remember – you must turn in a PDF file to the Gradescope page before the submission deadline for full credit.
To submit your assignment:
Go to http://www.gradescope.com and click Log in in the top right corner.
Click School Credentials ➡️ Duke NetID and log in using your NetID credentials.
Click on your STA 210 course.
Click on the assignment, and you’ll be prompted to submit it.
Mark the pages associated with each exercise. All of the pages of your assignment should be associated with at least one question (i.e., should be “checked”).
Select the first page of your .PDF submission to be associated with the “Workflow & formatting” section.
Total | 50 |
---|---|
Exercises 1 - 10: | 45 |
Workflow & formatting | 5 |