ANOVA

Analysis of Variance (ANOVA)

Researchers examined hemoglobin levels of Austrailian athletes[1]. Hemoglobin is a protein found in red blood cells that transports oxygen throughout the body and athletes who have low hemoglobin levels are sometimes said to have "sports anemia".

Are there differences in the mean hemoglobin levels of athletes that participate in rowing, basketball, netball (similar to basketball but without dribbling), running (distances longer than 400m), and swimming?

A rowing team of five people in a long, narrow boat on a river. Four rowers are paddling in unison while one person at the front appears to be steering or directing the team.

To compare means of more than two independent populations, we use a hypothesis testing procedure called the Analysis of Variance, ANOVA for short.

The Analysis of Variance (ANOVA) is used to compare means of more than two independent populations.

Why is a procedure for comparing means called the Analysis of Variance? The Analysis of Variance depends on two types of variation.

between-group variability: how the group means vary around the overall mean.
within-group variability: how the measurements in a single group vary around their mean.

If the between group variability is large compared to the within group variability, this is evidence that the groups means are significantly different.

Use the applet below to investigate sources of variation. The applet generates random samples that have the means and standard deviations indicated with the sliders. Adjust these to see their affect on the variability.

Under what conditions is most of the variation from between groups?
Under what conditions is most of the variation from within groups?

Analysis of Variance is a hypothesis testing procedure. We'll discuss each of the steps of the hypothesis testing process as we proceed in our discusssion of ANOVA.

State hypotheses.
Collect data.
Construct a test statistic.
Compute a p-value.
Draw conclusions (in statistical terms and in context)

Step 1: State Hypotheses

The null hypothesis for an Analysis of Varance is that the means of the k populations are equal. If we denote the means of the k populations $\mu_1$, $\mu_2$, $\ldots$, $\mu_k$ then the null hypothesis stated symbolically is $$H_0: \mu_1=\mu_2=\ldots = \mu_k$$ The alternative hypothesis is that at least two of the means are not equal to each other, that is $$H_A: \mu_i \neq \mu_j \text{ for some }i,j.$$

Example: Athlete Hemoglobin Levels

Let $\mu_1$, $\mu_2$, $\mu_3$, $\mu_4$, and $\mu_5$ denote the mean hemoglobin levels of athletes competing in rowing, basketball, netball, running, and swimming respectively.

The null hypothesis of an Analysis of Variance to compare the mean hemoglobin levels of athletes in the 5 groups is $$H_0: \mu_1=\mu_2=\mu_3=\mu_4=\mu_5$$ and the alternative hypothesis is $$H_A: \mu_i \neq \mu_j \texttt{ for some }i,j.$$

Step 2: Collect Data

ANOVA is used to compare the means of three of more groups thus the data consist of numeric measurements made across at least three categories.

Example: Athlete Hemoglobin Levels

The full dataset on the Austrailian athletes is available through the software package R. The data needed for this analysis, containing the variables "hg" (hemoglobin level) and sport can be found here.

Step 3: Construct a Test Statistic

The test statistic for an ANOVA is the ratio of a quantity (called MSTr) measuring the between group variability to a quantity (called the MSE) measuring the within group variability. The resulting statistic has an F distribution thus $$F = \frac{SSTr/df_1}{SSE/df_2} = \frac{MSTr}{MSE} \sim F_{df_1, df_2}$$ (We'll address the degrees of freedom later on.)

When there is more variability between the groups than within groups, this statistic is large and the p-value is small giving evidence of a difference between the group means.

The between group variability is is measured by the sum of squares for the treatments (SSTr): $$SSTr=\sum_{\texttt{(all groups)}} \texttt{group size}(\texttt{group mean}-\texttt{overall mean})^2$$ This measure is a summary of the variability of the group means around the overall mean.

T sum of squares for error (SSE) measures the variability of the individual subjects within a group around the group mean. $$SSE= \sum_\texttt{(all groups)}\sum_\texttt{(all obs)}(\texttt{obs j from group i} - \texttt{group i mean})^2 $$
Total Sum of Squares, SST = SSTr + SSE.

The SST measures how the $k$ group means vary around the overall mean but the SSE takes into account the variability of all the $n_T$ observations around their respecitive group means. $n_T$ is typically much bigger than $k$ so the SSE is usually bigger than the SST just because its a total of more things. In order to obtain comparable measures of variability, we divide the sums of squares by corresponding degrees of freedom.

Mean Squares for Error: $MSE = \frac{SSE}{n_T - k}$, $n_T$ is the total number of observations.

Mean Squares for Rreatment: $ MSTr = \frac{SSTr}{k - 1}$, $k$ is the number of groups.

ANOVA test statistic

$F = MSTr/MSE$.

Under the null hypothesis of no difference among the population means, the $F$ statistic has approximately an $F$ distribution with $k-1$ numerator degrees of freedom and $n_T-k$ denominator degrees of freedom.

The ANOVA Table

Results from an analysis of variance carried out in software are displayed in an ANOVA table like the one below. The row containing the totals is not essential to the output.

Table displaying components of an ANOVA summary. The columns are labeled Source, DF, Sum of Squares, Mean Squares, F-Statistic, and p-value. The Treatment row lists DF as k–1, Sum of Squares as SSTr, Mean Squares as MSTr, F-Statistic as F, and p-value as P(F > f). The Error row lists DF as nₜ–k, Sum of Squares as SSE, and Mean Squares as MSE. The Total row lists DF as nₜ–1 and Sum of Squares as SST.

Example: Athlete Hemoglobin Levels

The ANOVA table for the analysis of the Austrailian athletes data is shown below.

	DF	Sum of Squares	Mean Squares	F	p-value
Treatment	4	63.1	15.775	13.59	2.7e-09
Error	131	152	1.161

Step 4: Compute a p-value

The p-value is computed from the right tail area of the appropriate F .

$p-value = P(F \geq f)$ where $F\sim F_{k-1, n_T-k}$

Example: Athlete Hemoglobin Levels

From the ANOVA table we see that the p-value for the Australian athletes analysis is 2.7e-09.

Step 5: Draw Conclusions

When the p-value is smaller than the chosen $\alpha$ (usually 0.05), we reject the null hypothesis and conclude that there is evidence of a difference in population means.
Otherwise, we fail to reject the null hypothesis.
We cannot determine from this which means are not equal, only whether they are not all equal.

Example: Athlete Hemoglobin Levels

Since the p-value for the Australian athletes analysis, p-value = 2.7e-09, is very small, we reject the null hypothesis and conclude that there are difference between the mean hemoglobin levels of at least two groups.

Comments Regarding ANOVA

ANOVA works if

the samples from the k populations are random,
the observations within each group are approximately normally distributed (or the sample sizes are large enough to ensure approximate normality of the sample means), and
the underlying k populations all have the same variance.

If the F test is significant (i.e., we reject the null hypothesis that the means are equal), this means that there is at least one pair of the k means that are significantly different. There are additional procedures that help us to determine which pairs of means differ.

Pairwise Comparisons

If we reject the null hypothesis, we can use pairwise comparisons to determine which means are different, and by how much. The comparisons are made by constructing confidence intervals or conducting hypothesis tests for the difference between the means of each pair of populations. If a confidence interval does not contain 0, or if the p-value for a hypothesis test is less than the significance level then there is evidence that the means of the compared groups are not equal.

Pairwise comparisons can tell us which means are different, and by how much.

The results of pairwise comparisons as a follow-up to significant results from an ANOVA are shown below.

The first column shows which groups are being compared (in the image the groups are simply labeled 1, 2, 3).
The second column, 'diff', gives a point estimate for the difference in the means of two groups.
The third column, 'lwr', is a lower endpoint for a 95% confidence interval for the differences in the means.
The next column, 'upr' is the upper endpoint of the confidence interval.
The final column, 'padj', gives a p-value for a hypothesis test for ascertaining whether the difference in the two means is 0.

If the confidence interval does not contain 0, or the p-value is smaller than 0.05, this is evidence that the two means are different.

Annotated table showing pairwise comparison results with labels explaining each section. The first column (2–1, 3–1, 3–2) is outlined in blue and labeled “The groups being compared.” The “diff” column is outlined in orange and labeled “Point estimates of the differences in means.” The “lwr” and “upr” columns are outlined in green and labeled “Confidence Intervals for differences between pairs of means.” The “p adj” column is outlined in purple and labeled “p-values for pairwise hypothesis tests.”

Notice that the point estimates for all three pairs are negative. That means that the mean of the second group in the pair is larger than the mean of the first group in the pair.

Example: Athlete Hemoglobin Levels

Since we rejected the null hypothesis for the the p-value for the Australian athletes analysis, we use pairwise comparisons to see which group means are different. br>

Table showing pairwise comparison results between sports activities with four columns labeled diff, lwr, upr, and p adj. The comparisons listed are Netball–B_Ball, Row–B_Ball, Swim–B_Ball, T_400m–B_Ball, Row–Netball, Swim–Netball, T_400m–Netball, Swim–Row, T_400m–Row, and T_400m–Swim. Rows comparing Netball to other activities are highlighted with orange boxes, showing significant adjusted p-values (all less than 0.001) for Netball compared to B_Ball, Row, Swim, and T_400m.

Comparisons that show evidence of a difference are highlighted in the table. This shows that there is evidence that the athletes in the following groups have different mean hemoglobin levels:

Netball and Basketball
Rowing and Netball
Swimming and Netball
Running and Netball

In other words, the mean hemoglobin level of the athletes in the netball group is different than the means of all four other groups.

The point estimate comparing the netball and basketball means came from subtracting the basketball mean from the netball mean. Since the result is negative, the basketball mean is larger than the netball mean.

For all the other comparisons involving netball, the netball mean was subracted from the other group mean. These point estimates are all positive indicating that the netball mean was smaller.

There is evidence that the mean hemoglobin levels of athletes who participate in netball is lower than that of athletes who participate in the four other sports.

Footnotes:

[1] Telford, R.D. and Cunningham, R.B. 1991. Sex, sport and body-size dependency of hematology in highly trained athletes. Medicine and Science in Sports and Exercise 23: 788-794.