Everything we care about lies somewhere in the middle, where pattern and randomness interlace.

Top

Site Menu

Sampling Distributions

According to a recent poll by Gallup.com, 59% of Americans believe that the amount they pay in income taxes is fair. The survey was based on a sample 1017 American adults.

If the pollsters talked to 1017 people, 59% is the value of a statistic. How does this value relate to the value of the parameter?

Close-up of a U.S. federal tax document, Internal Revenue Service Form 1040.

The true percentage of American adults who think their the amount of their taxes is fair is a characteristic (a parameter) of a large population and is therefore difficult to discover exactly. 59% is the value of a statistic or estimate of the parameter. This estimate is based on a random sample from the population.

If a different sample of 1,017 American adults was chosen, it is likely that the percentage who think the amount of taxes they pay is fair would be different from 59% but fairly close to the value of the parameter. It is also possible, but less likely, that a different sample would yield a value that is very different from the value of the parameter. This variation is due to the random sampling.


Sample Variation

In the figure, the balls in the large "jar" constitute the population of interest. Suppose the blue balls represent American adults who think the amount they pay in taxes is fair and the orange balls represent those who do not.

There are too many balls to count them all with limited time and resources. Instead, it is helpful to choose a sample from the population, find the percentage of blue balls in the sample, and to use that to estimate the percentage of blue balls in the population.

A central box filled with many blue and orange circles is surrounded by eight smaller boxes containing fewer circles in the same colors. Each smaller box has a label below it with “Sample p” followed by a decimal value (e.g., 0.56, 0.67, 0.55). The arrangement visually represents multiple random samples from a larger population, with sample proportions varying slightly around a central value.

Each of the smaller jars represents a random sample from the population. Since the samples are chosen randomly, the percentage of blue balls in a given sample may not be the same as the percentage in the population, but it will likely be close.

It is also unlikely that each sample will contain exactly the same percentage of blue balls, but the percentages in the samples are likely to be fairly similar. In fact, it is possible to describe the distribution of these percentages mathematically.


Sampling Distribution

A statistic is a random variable since it represents numerically the results of an experiment (drawing a random sample). Like all random variables, a statistic has a distribution. The distribution of a statistic is called the sampling distribution. The ability to describe the distribution of a statistic makes it possible to conduct statistical inference.

A sampling distribution is the distribution of a statistic.


Use the applet below to investigate the sampling distribution of the sample proportion.



The large box in the applet is the population, a 'jar' of orange and blue balls.


The statistic is an estimate of the proportion of blue balls in the population.
The 'dot plot' plot type is useful to see clearly where each new statistic falls on the graph, however, if you choose many samples, eventually, they will not all fit in the graph window. Choose plot type 'histogram' to see the distribution of the sample values.



The applet displays a simulated distribution based on the chosen samples. In order to see the complete sampling distribution, it would be necessary to find the value of the statistic for every possible sample of a given size. With a population of 10,000 and a sample of size 100, there are around 6.5×10241 of these!

In the following sections, we will look at sampling distributions related to the sample mean and sample proportion. We will investigate these further with simulation and describe them mathematically.


The Sampling Distribution of the Sample Mean, $\bar{X}$

Many research question involve a population mean, $\mu$. The sample mean $\bar{X}$ is an appropriate estimator for $\mu$. This section discusses the distribution of the sample mean $\bar{X}$ under the following conditions:

  1. The sample is drawn from a normally distributed population and the value of the population variance, $\sigma^2$, is known.
  2. The sample is drawn from a population that is not normally distributed but the sample size is large.
  3. The sample is drawn from a normally distributed population and the population variance is unknown.
Much of statistical inference relies on a standardized statistic, e.g. $\small{\frac{\bar{X}-\mu}{s.e.(\bar{X})}}$, thus the following sections also discuss the distribution of this standardized form.


1. The Distribution of $\bar{X}$, Normal Population, $\sigma^2$ known.

Linear combinations of normally distributed random variables are also normally distributed. When a sample is drawn from a normal population, the distribution of the sample mean is also normal.




If $X_1, X_2, \ldots X_n$ are independent and $X_i\sim N(\mu, \sigma^2)$
$\bar{X}\sim N\left(\mu,\frac{\sigma^2}{n}\right)$

Standardizing \(\bar{X}\) yields $Z=\frac{\bar{X}-\mu}{\sigma/\sqrt{n}}.$
$Z=\frac{\bar{X}-\mu}{\sigma/\sqrt{n}}\sim N(0,1)$


2. The Distribution of $\bar{X}$, Non-Normal Population, Large Sample Size

When a sample is drawn from a population that is not normally distributed but the sample size is large, the Central Limit Theorem indicates that distribution of the sample mean is approximately normal. The larger the sample size, the more closely the sampling distribution follows a normal curve.

NOTATION: $\stackrel{\cdot}{\sim}$ indicates an approximation distribution, thus $X\stackrel{\cdot}{\sim}N(\mu, \sigma^2)$ reads 'X is approximately $N(\mu, \sigma^2)$ distributed'.



If $X_1, X_2, \ldots X_n$ are independent and identically distributed such that $E(X_i) = \mu$ and $Var(X_i) = \sigma^2$
$\bar{X}\stackrel{\cdot}{\sim} N(\mu,\frac{\sigma^2}{n})$

Standardizing \(\bar{X}\) yields $Z=\frac{\bar{X}-\mu}{\sigma/\sqrt{n}}.$ $Z=\frac{\bar{X}-\mu}{\sigma/\sqrt{n}}\sim N(0,1).$

If $\sigma$ is unknown, as is usually the case, it is estimated with the sample standard deviation, S. When the sample size is large, $S \approx \sigma$ thus $Z=\frac{\bar{X}-\mu}{S/\sqrt{n}}\stackrel{\cdot}{\sim} N(0,1)$



3. The Distribution of $\bar{X}$, Normal Population, $\sigma$ unknown

Again, if $\sigma$ is unknown, it is usual to estimate it with $S$. When the sample size is large, $S \approx \sigma$ and the Central Limit Theorem indicates that $Z=\frac{\bar{X}-\mu}{S/\sqrt{n}}\stackrel{\cdot}{\sim} N(0,1).$ However, if a sample size is small, the Central Limit Theorem is not helpful and $S$ may not be a reliable estimator for $\sigma$. In this case, the standardized statistic has a t-distribution.



If $X_1, X_2, \ldots X_n$ are independent and $X_i\sim N(\mu, \sigma^2)$
$\frac{\bar{X}-\mu}{S/\sqrt{n}}\sim t_{n-1}.$

As the sample size increases, the t-distribution converges to the standard normal distribution, but for small samples, the t-distribution has heavier tails.

The population standard deviation $\sigma$ is rarely known and when the sample size is large, the t-distribution and the normal distribution are very similar, thus it is common practice to use only the t-distribution when conducting inference for a mean.



The Sampling Distribution of the Sample Proportion, $\hat{p}$

When the parameter of interest is a population proportion, $p$, the underlying population distribution is composed solely of 0's and 1's thus it cannot be normally distributed. If the sample size is large, $\hat{p} \sim N\left(p,\frac{p(1-p)}{n}\right)$. Since $p$ is generally unknown, it is estimated with $\hat{p}$, $Z=\frac{\hat{p}-p}{\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}}\stackrel{\cdot}{\sim}N(0,1)$.



If $X_1, X_2, \ldots X_n$ are independent random variables such that $X_i\in \{0,1\}$ and $\hat{p}=\frac{1}{n}\sum_{i=1}^nX_i$ then $\hat{p} \stackrel{\cdot}{\sim} N\left(p,\frac{p(1-p)}{n}\right)$

Look again at the sampling distribution applet above. The applet simulates the sampling distribution for a proportion. Notice the bell shape of the histogram or dotplot. Selecting the 'Show p' option, you can also see that the distribution is centered around the population proportion.