HW 03: Multiple linear regression

due Wednesday, October 13 at 11:59pm

Getting started

Packages

You will use the following packages in this assignment.

library(tidyverse)
library(broom)
library(knitr) 
# you can add additional packages as needed

Exercises

Part 1: Kentucky Derby

The Kentucky Derby is a 1.25 mile horse race held annually at the Churchill Downs race track in Louisville, Kentucky. The data used to fit the regression model in this analysis includes information about 122 derbies held 1896 to 2017. The analysis focuses on the following variables:

Below is a regression model using the main effects year, condition, startersCent, and the interaction between year and condition to understand variation in speed. The 95% confidence intervals for the coefficients are included as well. Use the output for exercises 1 - 4.

term estimate std.error statistic p.value conf.low conf.high
(Intercept) 52.344 0.181 288.950 0.000 51.985 52.702
year 0.020 0.003 7.576 0.000 0.014 0.025
startersCent -0.003 0.016 -0.189 0.850 -0.035 0.029
conditiongood -1.070 0.423 -2.527 0.013 -1.908 -0.231
conditionslow -2.183 0.270 -8.097 0.000 -2.717 -1.649
year:conditiongood 0.012 0.008 1.598 0.113 -0.003 0.027
year:conditionslow 0.012 0.004 2.866 0.005 0.004 0.020
  1. Write the equation of the statistical model that corresponds to the regression equation above.

  2. Interpret the following in the context of the data:

    • Coefficient of conditiongood.
    • Coefficient of year.
  3. Use the confidence intervals in the regression output to answer each analysis question. For each question, state the confidence interval you used and state your conclusion in the context of the data.

    • On average, is the winning speed the same for fast and slow track conditions?
    • Is the coefficient of year the same for fast and good track conditions?
    • What is the estimated winning speed a race in 1896 that included 14 horses on fast track conditions?
  4. When interpreting the coefficient for startersCent, why do include “holding year and condition constant”? Is it wrong to leave such a qualifier off the interpretation? Briefly explain.

Part 2: Birth weights

Use the following scenario and data for exercises 5 - 8.

The Child Health and Development Studies investigate a range of topics. One study considered all pregnancies between 1960 and 1967 among women in the Kaiser Foundation Health Plan in the San Francisco East Bay area. The data set is in the file babies.csv and includes the following variables:

The goal of this analysis is to understand the impact of the mother smoking on the baby’s birth weight. To do so, we will fit a model predicting the average birth weight of babies based on all of the relevant variables included in the data set.

  1. Briefly explain why we want to include all other relevant variables as predictors in the model given the objective is to understand the effect of smoking.

  2. Fit the model and display the output using 3 digits. The model only needs to include main effects. Then, interpret the coefficient of gestation in the context of the data.

  3. Use the model to describe the association between the mother smoking and the baby’s birth weight.

  4. Would you recommend using this model to predict birth weights for babies born today? Briefly explain why or why not.

Submission

Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to commit and push changes.

Remember – you must turn in a PDF file to the Gradescope page before the submission deadline for full credit.

To submit your assignment:

Grading (50 pts)

Exercises 1- 8 45
Workflow & formatting 5

Acknowledgement

The data and questions from this assignment were adapted from exercises in Beyond Multiple Linear Regression.