Lab 02: Simple linear regression

due Mon, September 06 at 11:59p ET

Introduction

In today’s lab, you’ll use simple linear regression to analyze the relationship between the admissions rate and total cost for colleges and universities in the United States.

Learning goals

By the end of the lab you will…

Getting started

Packages

The follow packages are used in the lab.

library(tidyverse)
library(broom)
library(knitr)

The Data

The data for this lab is from the scorecard data set in the rcfss R package. It includes information originally obtained from the U.S. Department of Education’s College Scorecard for 1753 colleges and universities during the 2018 - 2019 academic year.

The lab focuses on the following variables:

Click here to see a full list of variables and definitions.

Use the code below to load the data set.

scorecard <- read_csv("data/scorecard.csv")

Exercises

Note: Include axis labels and an informative title for all plots. Use the kable function to neatly print tables and regression output.

  1. Create a histogram to examine the distribution of admrate and calculate summary statistics for the center (mean and median) and the spread (standard deviation and IQR).

  2. Use the results from the previous exercise to describe the distribution of admrate. Include the shape, center, spread, and if there are potential outliers.

  3. Plot the distribution of cost and calculate the appropriate summary statistics. Describe the distribution of cost (shape, center, and spread, and outliers) using the plot and appropriate summary statistics.

  4. The goal of this analysis is to fit a regression model that can be used to understand the variability in the cost of college based on the admission rate. Before fitting the model, let’s look at the relationship between the two variables. Create a scatterplot to display the relationship between cost and admissions rate. Describe the relationship between the two variables based on the plot.

  5. Does the relationship between cost and admissions rate differ by type of college? Modify the plot from the previous exercise visualize the relationship by type of college.

  6. Describe two new observations from the scatterplot in Exercise 5 that you didn’t see in the scatterplot from Exercise 4.

  7. Fit the linear regression model. Display the confidence interval for the coefficients in the output. Use the kable function to neatly display the results.

  8. Consider the model from the previous exercise.

    • Interpret the slope in the context of the problem.
    • Does the intercept have a meaningful interpretation? If so, write the interpretation in the context of the problem. Otherwise, explain why the interpretation is not meaningful.
  9. Does the data provide evidence of a statistically significant linear relationship between cost and admissions rate? Conduct a hypothesis test to answer this question. In your response

    • State the null and alternative hypotheses used to answer this question in words and in mathematical notation.
    • What is the test statistic? State with the test statistic means in context.
    • What distribution was used to calculate the p-value?
    • State your conclusion for the test in context.
  10. Interpret the 95% confidence interval for the slope in context. Then indicate whether or not it is consistent with the results of the hypothesis test from the previous exercise. Briefly explain your response.

Submission

Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to commit and push changes.

Remember – you must turn in a PDF file to the Gradescope page before the submission deadline for full credit.

To submit your assignment:

Grading (50 pts)


Component Points
Ex 1 - 10 45
Workflow & formatting 5

Grading notes: