Lab 04: ANOVA with Palmer Penguins

due Thu, Oct 7 at 11:59pm

This lab will focus on Analysis of Variance. You will do a short individual ANOVA activity in Sakai, then as a group, you will use ANOVA to analyze data about penguins in the Palmer Archipelago near Palmer Station.

Learning goals

By the end of the lab you will be able to…

ANOVA Activity (~ 10 - 15 minutes)

You will work individually for this portion of the lab. The assessment on Sakai will be graded for completion and will be included as part of the Lab 04 grade. Your TA will explain the instructions. You also should have received an email with instructions about this activity earlier in the day.

You should complete the assessment before working on the group portion of lab.

Getting started

Workflow: Using git and GitHub as a team

There are no Team Member markers in this lab; however, you should use a similar workflow as in lab 03. Only one person should type in the group’s .Rmd file at a time. Once that person has finished typing the group’s responses, they should knit, commit, and push the changes to GitHub. All other teammates can pull to see the updates in RStudio.

Update YAML

One team member: Change the author to your team name and include each team member’s name in the author field of the YAML in the following format. Team Name: Member 1, Member 2, Member 3, Member 4. Knit, commit, and push the changes to GitHub.**

All other team members: Click the Pull button in the Git pane to get the updated document. You should see the updated name in the .Rmd file.**

Packages

We’ll use the following packages in this lab.

library(tidyverse)
library(knitr)
library(broom)
library(pairwiseCI)
# add more packages as needed

Data: Palmer penguins

The data is the penguins data set from the palmerpenguins R package maintained by Dr. Allison Horst. This data set contains measurements and other characteristics for over 300 penguins observed near Palmer Station in Antarctica. The data were originally collected by Dr. Kristen Gorman.

The data set is in penguins.csv located in the data folder.

The following variables are in the penguins data set.

variable class description
species integer Penguin species (Adelie, Gentoo, Chinstrap)
island integer Island where recorded (Biscoe, Dream, Torgersen)
bill_length_mm double Bill length in millimeters (also known as culmen length)
bill_depth_mm double Bill depth in millimeters (also known as culmen depth)
flipper_length_mm integer Flipper length in mm
body_mass_g integer Body mass in grams
sex integer sex of the animal
year integer year recorded

Exercises

Can we differentiate penguin species based on their flipper length? To analyze this question we will use Analysis of Variance to compare the mean flipper length for Adelie, Gentoo, Chinstrap penguins found near the Palmer Station in Antarctica.

As you complete the lab

  1. Let’s take a look at the flipper_length_mm based on the species. Create a visualization to explore the relationship between the two variables. Interpret the visualization in the context of the data.

  2. We’d like to use ANOVA to assess if there’s a relationship between flipper length and species. Write the null and alternative hypotheses for the ANOVA test (a) using mathematical notation (b) using words.

  3. Before conducting the test, we will check the conditions for ANOVA (normality, constant variance, independence). State whether each condition is satisfied. Briefly explain your response including any plots, summary statistics, etc. used to make your determination.

  4. Use R to calculate and display the ANOVA table.

  5. Use the ANOVA table to answer the following:

    • What is the test statistic? Briefly explain what this value means in the context of the data.
    • What distribution was used to calculate the p-value?
  6. What is the conclusion from the ANOVA test? State the conclusion in the context of the data.

  7. We know at least one species has a different mean flipper length, and we can do further analysis to see how the means compare. To do so, we will look at confidence intervals for the difference in the mean flipper length between species.

The general form of the confidence interval for the difference in means between groups \(i\) and \(j\) is

\[(\bar{y}_i-\bar{y}_j) \pm t^* \sqrt{MSE\Big(\frac{1}{n_i}+\frac{1}{n_j}\Big)}\]

Where \(MSE\) is the variance of distribution of flipper length within each group, and \(t^*\) is calculated from a t distribution with the degrees of freedom associated with the within group variability.

We want to make pairwise comparisons of flipper length for the species in this data. How many such comparisons do we need to make?

  1. When we calculate multiple confidence intervals or conduct multiple hypothesis tests, there are different levels of Type I error that can be made in the conclusions. Making a Type I error in this context means to incorrectly conclude two groups have unequal means (i.e. incorrectly reject \(H_0\)). The two types of Type I error that can be made are

    • Individual Type I error: Incorrectly reject \(H_0\) for one specific comparison of group means. P(Type I error) = \(\alpha = 1 - C\), where \(C\) is the confidence level

    • Family-wise Type I error: Incorrectly reject \(H_0\) for at least one comparison of group means

    Even if the probability of making an individual Type I error is low, the probability of making a family-wise Type I error becomes much larger when we make multiple comparisons. Therefore, when we make multiple comparisons, we will select the critical value \(t^*\) to control for the probability of making a family-wise Type I error. Thus, if we want the family-wise error Type I error to be \(\alpha\), we will choose \(t^*\) such that the probability of making a Type I error for any individual confidence interval is \(\frac{\alpha}{m}\) (confidence level = \(1 - \alpha/m\)), where \(m\) is the number of comparisons. This is called the Bonferroni correction.

    Suppose we want a family-wise Type I error rate for this analysis to be 0.1. What confidence level should we use for each individual confidence interval of the difference in means between two groups?

Click here to read more about the pairwiseCI function.

  1. Calculate the pairwise confidence intervals using the confidence level from the previous question. We can quickly calculate the pairwise confidence interval using the pariwiseCI function in the pairwiseCI R package. Fill in the code below to calculate the intervals using the confidence level chosen in the previous exercise.
pairwiseCI(_____ ~ _____, data = _____, 
           conf.level = ______, var.equal = TRUE) %>%
  kable(digits = 3)
  1. How does the flipper length compare across the species of penguins? Briefly explain your response in 2 - 3 sentences, using results from the previous exercise to support your conclusions.

Submission

There should only be one submission per team on Gradescope.

Grading (50 pts)


Component Points
Ex 1 - 10 40
Complete ANOVA Lab Activity 5
Workflow & formatting 5

Grading notes:

There should only be one submission per team on Gradescope.