This lab will focus on Analysis of Variance. You will do a short individual ANOVA activity in Sakai, then as a group, you will use ANOVA to analyze data about penguins in the Palmer Archipelago near Palmer Station.
By the end of the lab you will be able to…
You will work individually for this portion of the lab. The assessment on Sakai will be graded for completion and will be included as part of the Lab 04 grade. Your TA will explain the instructions. You also should have received an email with instructions about this activity earlier in the day.
You should complete the assessment before working on the group portion of lab.
A repository has already been created for you and your teammates. Everyone in your team has access to the same repo.
Go to the sta210-fa21 course organization on GitHub.
You should see a repo with the lab-04 prefix.
Each person on the team should clone the repository and open a new project in RStudio. Do not make any changes to the .Rmd file until the instructions tell you do to so.
There are no Team Member markers in this lab; however, you should use a similar workflow as in lab 03. Only one person should type in the group’s .Rmd file at a time. Once that person has finished typing the group’s responses, they should knit, commit, and push the changes to GitHub. All other teammates can pull to see the updates in RStudio.
One team member: Change the author to your team name and include each team member’s name in the author
field of the YAML in the following format. Team Name: Member 1, Member 2, Member 3, Member 4
. Knit, commit, and push the changes to GitHub.**
All other team members: Click the Pull button in the Git pane to get the updated document. You should see the updated name in the .Rmd file.**
We’ll use the following packages in this lab.
library(tidyverse)
library(knitr)
library(broom)
library(pairwiseCI)
# add more packages as needed
The data is the penguins
data set from the palmerpenguins R package maintained by Dr. Allison Horst. This data set contains measurements and other characteristics for over 300 penguins observed near Palmer Station in Antarctica. The data were originally collected by Dr. Kristen Gorman.
The data set is in penguins.csv
located in the data
folder.
The following variables are in the penguins
data set.
variable | class | description |
---|---|---|
species | integer | Penguin species (Adelie, Gentoo, Chinstrap) |
island | integer | Island where recorded (Biscoe, Dream, Torgersen) |
bill_length_mm | double | Bill length in millimeters (also known as culmen length) |
bill_depth_mm | double | Bill depth in millimeters (also known as culmen depth) |
flipper_length_mm | integer | Flipper length in mm |
body_mass_g | integer | Body mass in grams |
sex | integer | sex of the animal |
year | integer | year recorded |
Can we differentiate penguin species based on their flipper length? To analyze this question we will use Analysis of Variance to compare the mean flipper length for Adelie, Gentoo, Chinstrap penguins found near the Palmer Station in Antarctica.
As you complete the lab
Let’s take a look at the flipper_length_mm
based on the species
. Create a visualization to explore the relationship between the two variables. Interpret the visualization in the context of the data.
We’d like to use ANOVA to assess if there’s a relationship between flipper length and species. Write the null and alternative hypotheses for the ANOVA test (a) using mathematical notation (b) using words.
Before conducting the test, we will check the conditions for ANOVA (normality, constant variance, independence). State whether each condition is satisfied. Briefly explain your response including any plots, summary statistics, etc. used to make your determination.
Use R to calculate and display the ANOVA table.
Use the ANOVA table to answer the following:
What is the conclusion from the ANOVA test? State the conclusion in the context of the data.
We know at least one species has a different mean flipper length, and we can do further analysis to see how the means compare. To do so, we will look at confidence intervals for the difference in the mean flipper length between species.
The general form of the confidence interval for the difference in means between groups \(i\) and \(j\) is
\[(\bar{y}_i-\bar{y}_j) \pm t^* \sqrt{MSE\Big(\frac{1}{n_i}+\frac{1}{n_j}\Big)}\]
Where \(MSE\) is the variance of distribution of flipper length within each group, and \(t^*\) is calculated from a t distribution with the degrees of freedom associated with the within group variability.
We want to make pairwise comparisons of flipper length for the species in this data. How many such comparisons do we need to make?
When we calculate multiple confidence intervals or conduct multiple hypothesis tests, there are different levels of Type I error that can be made in the conclusions. Making a Type I error in this context means to incorrectly conclude two groups have unequal means (i.e. incorrectly reject \(H_0\)). The two types of Type I error that can be made are
Individual Type I error: Incorrectly reject \(H_0\) for one specific comparison of group means. P(Type I error) = \(\alpha = 1 - C\), where \(C\) is the confidence level
Family-wise Type I error: Incorrectly reject \(H_0\) for at least one comparison of group means
Even if the probability of making an individual Type I error is low, the probability of making a family-wise Type I error becomes much larger when we make multiple comparisons. Therefore, when we make multiple comparisons, we will select the critical value \(t^*\) to control for the probability of making a family-wise Type I error. Thus, if we want the family-wise error Type I error to be \(\alpha\), we will choose \(t^*\) such that the probability of making a Type I error for any individual confidence interval is \(\frac{\alpha}{m}\) (confidence level = \(1 - \alpha/m\)), where \(m\) is the number of comparisons. This is called the Bonferroni correction.
Suppose we want a family-wise Type I error rate for this analysis to be 0.1. What confidence level should we use for each individual confidence interval of the difference in means between two groups?
Click here to read more about the pairwiseCI
function.
pariwiseCI
function in the pairwiseCI R package. Fill in the code below to calculate the intervals using the confidence level chosen in the previous exercise.pairwiseCI(_____ ~ _____, data = _____,
conf.level = ______, var.equal = TRUE) %>%
kable(digits = 3)
There should only be one submission per team on Gradescope.
Component | Points |
---|---|
Ex 1 - 10 | 40 |
Complete ANOVA Lab Activity | 5 |
Workflow & formatting | 5 |
Grading notes:
There should only be one submission per team on Gradescope.