Chi-square Tests

Researchers studying lizards investigated whether the color people wore affected how the lizards responded to them[1]. The researchers studied water anole lizards for which the orange is the sexually selected color. They dressed in orange, blue, or green while trying to capture the lizards and recorded the color worn and whether the capture attempt was successful.

Green anole lizard clinging to a light surface with its head turned and tail stretched behind.

In the study, the researchers wanted to determine if there was a relationship between two categorical variables. Hypothesis tests for categorical variables are referred to as chi-square (usually pronounced "Kai square") tests because the test statistics involved have Chi-square distributions. Two types of chi-square tests are described on this page.

The 'goodness-of-fit' test is used with one categorical variable to determine whether the distribution of counts in the different levels agrees with some researcher-defined model. For example, this test could be used to determine whether a die is fair. The categorical variable of interest is the outcome of a roll, and observed counts for the six categories would each be expected to be about 1/6 of the total if the die is fair.

The chi-square Test of Independence is used to determine whether there is a relationship between two categorical variables.

Hypothesis tests for categorical data

A goodness-of-fit test that enables us to compare the proportions in multiple categories to hypothesized values and
A test for independence that enables us to determine whether there is an association between two categorical variables.

The Chi-Square Goodness-of-Fit Test

A Goodness-of-Fit Test measures the discrepancy between observed cell counts and cell counts expected under the null hypothesis (according to the hypothesized distribution) to assess whether the hypothesized distribution is plausible.

Example: Movie Ratings

According to the-numbers.com between 1995 and 2020, about 27% of movie revenue came from movies rated R, about 47% from movies rated PG-13, and 26% from movies with the other ratings (PG, G, NR, etc.). Do the movies available from Netflix reflect these proportions? That is, are 47% of Netflix offerings rated PG-13, 27% rated R, with the other ratings making up the remaining 26%?

Since this is a hypothesis test, we will consider each of the steps in the process as it relates to the goodness-of-fit test:

State hypotheses.
Collect data.
Construct a test statistic
Compute a p-value.
Draw conclusions (in statistical terms and in context)

Step 1: State Hypotheses

For a goodness of fit test, the null hypothesis is that the observed proportions in each of k categories are equal to the hypothesized values. The parameters of interest are the probabilities that an observation falls in each cell. $p_i$ denotes the true probability that an observation is in cell $i$ and $p_i^*$ denotes the null value, or the probability expected under the null hypothesis.

The null hypothesis is that true proportions are equal to the hypothesized values. $$H_0: p_1=p_1^*, p_2=p_2^*, \ldots ,p_k=p_k^* $$ The alternative hypothesis is that at least one of the true cell proportions is not equal to the hypothesized value. $$H_A: p_i\neq p_i^* \texttt{ for some }i$$

Example: Movie Ratings

Let $p_1$, $p_2$, and $p_3$ correspond to the groups with ratings R, PG-13, and everything else respectively. Using information provided previously:

$H_0:p_1=0.27$, $p_2=0.47$, $p_3=0.26$

$H_A: p_i\neq p_i^*$ for some $i$

Step 2: Collect Data

The data for a goodness of fit test are displayed in a one way contingency table. This is a table that displays the counts in different levels of a single categorical variable.

Each of n observations can be classified into exactly one of k categories or cells in the table. The cell frequencies are denoted $x_1, x_2, \dots x_k$. $x_1+x_2+ \cdots + x_k = n$.

A one-way contingency table showing counts for categories 1 through k with a total count n.

Example: Movie Ratings

The data for the Netflix movie ratings example are shown in the table.

One-way table showing movie ratings: 19 R, 11 PG-13, 2 Other, total count 32.

One-way table showing movie ratings: 19 R, 11 PG-13, 2 Other, total count 32.

Step 3: Construct a Test Statistic

If the null hypothesis is true, that is $p_1=p_1^*$, $p_2=p_2^*$, ... , $p_k=p_k^*$ and the total number of observations is $n$, then we would expect to see $p_i^*\cdot n$ observations in cell $i$. We denote the expected count for cell i as $e_i$.

$e_i = p_i^*\cdot n$

Note: $e_1+ e_2+ \dots+e_k=n$.

Warning: For the test to be valid, each of the expected counts should be at least 5.

Example: Movie Ratings

$\begin{array}{rcl} e_1 & = & 0.27\times 32 & = & 8.64\\ e_2 & = & 0.47\times 32 & = & 15.04\\ e_3 & = & 0.26\times 32 & = & 8.32 \end{array}$

One-way table showing expected movie counts: 8.64 R, 15.04 PG-13, 8.32 Other, total 32.

One-way table showing expected movie counts: 8.64 R, 15.04 PG-13, 8.32 Other, total 32.

The test statistics for the goodness-of-fit test facilitates comparison of the observed and expected counts.

The Chi-Square Statistic for the Goodness-of-Fit Test
$X^2=\sum_{i=1}^n \frac{(x_i-e_i)^2}{e_i} \sim \chi^2_{k-1}$.

That is $X^2=\sum_{\texttt{(all cells)}} \frac{(\texttt{ith observed value} - \texttt{ith expected value})^2}{\texttt{ith expected value}}$

The statistic has a chi-square distribution with $k-1$ degrees of freedom (the number of categories minus 1).

Example: Movie Ratings

The test statistic for the movie rating goodness-of-fit test is
$$X^2 = \frac{(19-8.64)^2}{8.64}+\frac{(11-15.04)^2}{15.04}+\frac{(2-8.32)^2}{8.32} = 18.308$$

Compute a p-value

The p-value is for this test comes from the right tail of a $\chi^2_{k-1}$ distribution.

$ p-value= P(\chi^2 > X^2)$

Example: Movie Ratings

$P(\chi^2_2 > X^2) = P(\chi^2_2 > 18.308) = 0.0001$

Draw Conclusions

As with previous tests, we must draw conclusions statistically (that is, reject or fail to reject the null hypothesis) and in context. If we reject the null hypothesis, we conclude that the probabilities stated in the null hypothesis do not fit the data.

Example: Movie Ratings

The p-value for the movie rating goodness-of-fit test is 0.0001 which is much smaller than the usual signficance level of $0.05$, thus we reject the null hypothesis. The results are highly statistically significant.

There is strong evidence that the proportions of Netflix movies in the various ratings categories do not reflect the proportions of movie revenue generate by movies in these ratings categories.

The Chi-square Test of Independence

A chi-square test of independence is used to determine whether two variables are independent.

Example: Lizards

Researchers studying lizards investigated whether the color people wore affected how the lizards responded to them[1]. The researchers studied water anole lizards for which the orange is the sexually selected color. They dressed in orange, blue, or green while trying to capture the lizards and recorded the color worn and whether the capture attempt was successful.

Step 1: State Hypotheses

For a test of independence, the null hypothesis is that the variables are independent of each other, the alternative hypothesis is that they are not.

Example: Lizards

$H_0$: Lizard capture is independent of color worn by researcher.
$H_A$: Lizard capture is not independent of color worn by researcher.

Step 2: Collect Data

The data for a goodness of fit test are displayed in a two way contingency table. This is a table in which the cell counts indicate how many observations were classified in the corresponding row and column categories.

If the first category has $r$ levels and the second has $c$ levels $1 \leq i \leq r$ and $1 \leq j \leq c$, then $x_{i,j}$ is the in the $i$th row and $j$th column of the table.

NOTATION: A dot in place of an index is used to indicate a total over the index category, e.g. $x_{1\cdot}$ represents the total for the first row, $x_{\cdot 1}$ is the total for the first column, and $x_{\cdot\cdot}$ is the total over all rows and columns (the grand total).

A two-way contingency table showing counts for r rows and c columns with row and column totals.

Example: Lizards

A two-way contingency table for the lizard/color data:

A two-way contingency table showing lizard colors, number captured or not captured, and totals.

A two-way contingency table showing lizard colors, number captured or not captured, and totals.

Step 3: Construct a Test Statistic

The null hypothesis for the test of independence is that the row and column variables are independent of one another. Under this hypothesis, the expected count for the $i$th row and $j$th column of the table is $e_{ij} = \frac{x_{r\cdot}x_{\cdot c}}{x_{\cdot\cdot}} = \frac{\texttt{(row i total)}\texttt{(column j total)}}{\texttt{(grand total)}}.$

$e_{ij} = \frac{x_{r\cdot}x_{\cdot c}}{x_{\cdot\cdot}} = \frac{\texttt{(row i total)}\texttt{(column j total)}}{\texttt{(grand total)}} $

Warning: For the test to be valid, each of the expected counts should be at least 5.

Example: Lizards

$\begin{array}{ccccccccccc} e_{11} & = & \frac{61\cdot 150}{218} & = & 41.972 & & e_{12} & = & \frac{61\cdot 68}{218} & = & 13.028\\ e_{21} & = & \frac{83\cdot 150}{218} & = & 57.110 & & e_{22} & = &\frac{83\cdot 68}{218} & = & 25.890\\ e_{31} & = & \frac{74\cdot 150}{218} & = & 39.928 & & e_{32} & = & \frac{74\cdot 68}{218} & = & 34.072 \end{array}$

A two-way contingency table showing lizard colors and expected counts of captured or not captured, and totals.

A two-way contingency table showing lizard colors and expected counts of captured or not captured, and totals.

The test statistic of the test of independence is very similar to that of the goodness-of-fit test. It is found by summing over all cells the squared difference between the observed and expected counts divided by the expected count.

The Pearson Chi-Square Statistic for the Goodness-of-Fit Test
$X^2=\sum_{ij}n \frac{(x_{ij}-e_{ij})^2}{e_{ij}} \sim \chi^2_{(r-1)(c-1)}$.

That is $X^2=\sum_{\texttt{(all cells)}} \frac{(\texttt{ijth observed value} - \texttt{ijth expected value})^2}{\texttt{ijth expected value}}$

The statistic has a chi-square distribution with $(r-1)(c-1)$ degrees of freedom (the number of row minus 1 times the number of columns minus 1).

Example: Lizards

The test statistic lizard/color test of independence is
$$\small{\begin{array}{ccccc} X^2 & = & \frac{(49-41.97)^2}{41.97} & + & \frac{(12-19.028)^2}{19.028} & + & \frac{(52-57.11)^2}{57.11}\\ & + & \frac{(31-25.890)^2}{25.890} & + & \frac{(49-39.928)^2}{39.928} & + & \frac{(25-34.072)^2}{34.072} \\ & = & 9.692 & & & &\end{array}}$$

Compute a p-value

The p-value for the test of independence comes from the right tail of a chi-square distribution.

$ p-value= P(\chi^2 > X^2)$

Example: Lizards

$P(\chi^2_2 > X^2) = P(\chi^2_2 > 9.692) = 0.008$

Draw Conclusions

Example: Lizards

The p-value for the lizard/color test of independence is 0.008 which is much smaller than the usual signficance level of $0.05$, thus we reject the null hypothesis. The results are highly statistically significant.

There is evidence that the success of a capture and the color the researcher was wearing are not independent.

Example: Family Friendly Movie Length

Is being family friendly (rated G or PG) independent of whether a movie is short (less than 90 minutes)?

To answer this question, we'll conduct a chi-square test of independence.

The hypotheses for the test are:

$H_0$: family friendly and short are independent.
$H_A$: family friendly and short are not independent.

To collect data, we chose a random sample of 79 movies available from major streaming services and classified each by whether it was family friendly and whether it was short. The resulting data are shown in the table: A two-way contingency table showing observed counts for short, family-friendly movies and totals.

A two-way contingency table showing observed counts for short, family-friendly movies and totals.

The expected counts are: $$\small{\begin{array}{ccccc} e_{11} & = & \frac{23\times 28}{79}& = & 8.15\\ e_{12} & = & \frac{23\times 51}{79} & = & 14.85\\ e_{21} & = & \frac{56\times 28}{79} & = & 18.85\\ e_{22} & = & \frac{56\times 51}{79} & = & 36.15\end{array}}$$ Which give a test statistic of $$\small{\begin{array}{rcl} X^2 & = & \frac{(18-8.15)^2}{8.15}+\frac{(5-14.85)^2}{14.85}+\frac{(10-18.85)^2}{18.85}+\frac{(46-36.15)^2}{36.15}\\ & = & 25.277\end{array}}$$
Notice that all of the expected counts are larger than 5 so use of the chi-square distribution is reasonable.

The pvalue for the test is $P(\chi^2_1 > 25.277) = 4.97\times 10^{-7}$.

Since the p-value is less than 0.05, it is unlikely that these data would occur by chance if family friendly and short were independent. We reject the null hypothesis and conclude that the variables are not independent.

[1] Fondren, A., Swierk, L., & Putman, B. J. (2019). Clothing color mediates lizard responses to humans in a tropical forest. Biotropica, 52(1), 172–181. doi: 10.1111/btp.12744