Getting started

Go to the sta210-fa21 organization on GitHub. Click on the repo with the prefix hw-03. It contains the starter documents you need to complete the lab.
Clone the repo and start a new project in RStudio. See the Lab 01 instructions for details on cloning a repo, starting a new R project and configuring git.

Packages

You will use the following packages in this assignment.

library(tidyverse)
library(broom)
library(knitr) 
# you can add additional packages as needed

Exercises

Part 1: Kentucky Derby

The Kentucky Derby is a 1.25 mile horse race held annually at the Churchill Downs race track in Louisville, Kentucky. The data used to fit the regression model in this analysis includes information about 122 derbies held 1896 to 2017. The analysis focuses on the following variables:

year: year of race recorded as number of years since 1896 (e.g., 2017 is recorded as 2017 - 1896 = 121)
condition: condition of the track - fast, good, and slow.
- “good” includes the official designations “good” and “dusty”
- “slow” includes the official designations “slow”, “heavy”, “muddy”, and “sloppy”
starters: number of horses who raced
startersCent: mean-centered value of starters calculated as (starters - 14)
speed: average speed of the winner (in feet per second)

Below is a regression model using the main effects year, condition, startersCent, and the interaction between year and condition to understand variation in speed. The 95% confidence intervals for the coefficients are included as well. Use the output for exercises 1 - 4.

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	52.344	0.181	288.950	0.000	51.985	52.702
year	0.020	0.003	7.576	0.000	0.014	0.025
startersCent	-0.003	0.016	-0.189	0.850	-0.035	0.029
conditiongood	-1.070	0.423	-2.527	0.013	-1.908	-0.231
conditionslow	-2.183	0.270	-8.097	0.000	-2.717	-1.649
year:conditiongood	0.012	0.008	1.598	0.113	-0.003	0.027
year:conditionslow	0.012	0.004	2.866	0.005	0.004	0.020

Write the equation of the statistical model that corresponds to the regression equation above.
Interpret the following in the context of the data:
- Coefficient of conditiongood.
- Coefficient of year.
Use the confidence intervals in the regression output to answer each analysis question. For each question, state the confidence interval you used and state your conclusion in the context of the data.
- On average, is the winning speed the same for fast and slow track conditions?
- Is the coefficient of year the same for fast and good track conditions?
- What is the estimated winning speed a race in 1896 that included 14 horses on fast track conditions?
When interpreting the coefficient for startersCent, why do include “holding year and condition constant”? Is it wrong to leave such a qualifier off the interpretation? Briefly explain.

Part 2: Birth weights

Use the following scenario and data for exercises 5 - 8.

The Child Health and Development Studies investigate a range of topics. One study considered all pregnancies between 1960 and 1967 among women in the Kaiser Foundation Health Plan in the San Francisco East Bay area. The data set is in the file babies.csv and includes the following variables:

case - id number
bwt - birth weight, in ounces
gestation - length of gestation (pregnancy), in days
parity - binary indicator for a first pregnancy (0 = first pregnancy)
age - mother’s age in years
height - mother’s height in inches
weight - mother’s weight in pounds
smoke - binary indicator for whether the mother smokes

The goal of this analysis is to understand the impact of the mother smoking on the baby’s birth weight. To do so, we will fit a model predicting the average birth weight of babies based on all of the relevant variables included in the data set.

Briefly explain why we want to include all other relevant variables as predictors in the model given the objective is to understand the effect of smoking.
Fit the model and display the output using 3 digits. The model only needs to include main effects. Then, interpret the coefficient of gestation in the context of the data.
Use the model to describe the association between the mother smoking and the baby’s birth weight.
Would you recommend using this model to predict birth weights for babies born today? Briefly explain why or why not.

Submission

Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to commit and push changes.

Remember – you must turn in a PDF file to the Gradescope page before the submission deadline for full credit.

To submit your assignment:

Go to http://www.gradescope.com and click Log in in the top right corner.
Click School Credentials ➡️ Duke NetID and log in using your NetID credentials.
Click on your STA 210 course.
Click on the assignment, and you’ll be prompted to submit it.
Mark the pages associated with each exercise. All of the pages of your assignment should be associated with at least one question (i.e., should be “checked”).
Select the first page of your .PDF submission to be associated with the “Workflow & formatting” section.

Grading (50 pts)

Exercises 1- 8	45
Workflow & formatting	5

The “Workflow & formatting” grade is to assess the reproducible workflow. This includes having at least 3 informative commit messages, updating the name and date in the YAML, and submitting a PDF document that is neatly formatted with easily readable code and narrative.

Acknowledgement

The data and questions from this assignment were adapted from exercises in Beyond Multiple Linear Regression.

HW 03: Multiple linear regression

due Wednesday, October 13 at 11:59pm