Go to the sta210-fa21 organization on GitHub. Click on the repo with the prefix hw-03. It contains the starter documents you need to complete the lab.
Clone the repo and start a new project in RStudio. See the Lab 01 instructions for details on cloning a repo, starting a new R project and configuring git.
You will use the following packages in this assignment.
library(tidyverse)
library(broom)
library(knitr)
# you can add additional packages as needed
The Kentucky Derby is a 1.25 mile horse race held annually at the Churchill Downs race track in Louisville, Kentucky. The data used to fit the regression model in this analysis includes information about 122 derbies held 1896 to 2017. The analysis focuses on the following variables:
year
: year of race recorded as number of years since 1896 (e.g., 2017 is recorded as 2017 - 1896 = 121
)condition
: condition of the track - fast, good, and slow.
starters
: number of horses who racedstartersCent
: mean-centered value of starters
calculated as (starters - 14
)speed
: average speed of the winner (in feet per second)Below is a regression model using the main effects year
, condition
, startersCent
, and the interaction between year and condition to understand variation in speed
. The 95% confidence intervals for the coefficients are included as well. Use the output for exercises 1 - 4.
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | 52.344 | 0.181 | 288.950 | 0.000 | 51.985 | 52.702 |
year | 0.020 | 0.003 | 7.576 | 0.000 | 0.014 | 0.025 |
startersCent | -0.003 | 0.016 | -0.189 | 0.850 | -0.035 | 0.029 |
conditiongood | -1.070 | 0.423 | -2.527 | 0.013 | -1.908 | -0.231 |
conditionslow | -2.183 | 0.270 | -8.097 | 0.000 | -2.717 | -1.649 |
year:conditiongood | 0.012 | 0.008 | 1.598 | 0.113 | -0.003 | 0.027 |
year:conditionslow | 0.012 | 0.004 | 2.866 | 0.005 | 0.004 | 0.020 |
Write the equation of the statistical model that corresponds to the regression equation above.
Interpret the following in the context of the data:
conditiongood
.year
.Use the confidence intervals in the regression output to answer each analysis question. For each question, state the confidence interval you used and state your conclusion in the context of the data.
When interpreting the coefficient for startersCent
, why do include “holding year and condition constant”? Is it wrong to leave such a qualifier off the interpretation? Briefly explain.
Use the following scenario and data for exercises 5 - 8.
The Child Health and Development Studies investigate a range of topics. One study considered all pregnancies between 1960 and 1967 among women in the Kaiser Foundation Health Plan in the San Francisco East Bay area. The data set is in the file babies.csv
and includes the following variables:
case
- id numberbwt
- birth weight, in ouncesgestation
- length of gestation (pregnancy), in daysparity
- binary indicator for a first pregnancy (0 = first pregnancy)age
- mother’s age in yearsheight
- mother’s height in inchesweight
- mother’s weight in poundssmoke
- binary indicator for whether the mother smokesThe goal of this analysis is to understand the impact of the mother smoking on the baby’s birth weight. To do so, we will fit a model predicting the average birth weight of babies based on all of the relevant variables included in the data set.
Briefly explain why we want to include all other relevant variables as predictors in the model given the objective is to understand the effect of smoking.
Fit the model and display the output using 3 digits. The model only needs to include main effects. Then, interpret the coefficient of gestation
in the context of the data.
Use the model to describe the association between the mother smoking and the baby’s birth weight.
Would you recommend using this model to predict birth weights for babies born today? Briefly explain why or why not.
Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to commit and push changes.
Remember – you must turn in a PDF file to the Gradescope page before the submission deadline for full credit.
To submit your assignment:
Go to http://www.gradescope.com and click Log in in the top right corner.
Click School Credentials ➡️ Duke NetID and log in using your NetID credentials.
Click on your STA 210 course.
Click on the assignment, and you’ll be prompted to submit it.
Mark the pages associated with each exercise. All of the pages of your assignment should be associated with at least one question (i.e., should be “checked”).
Select the first page of your .PDF submission to be associated with the “Workflow & formatting” section.
Exercises 1- 8 | 45 |
Workflow & formatting | 5 |
The data and questions from this assignment were adapted from exercises in Beyond Multiple Linear Regression.