Spread

Measures of Spread

The median individual income in the US in 2019 was about $\small{\$}$40,000. While that tells us something about how much money people in the US are earning, by itself it doesn't tell us a great deal. To illustrate, if every individual in the US had an income of $\small{\$}$40,000, then $\small{\$40,000}$ would be the median. But the median would also be $\small{\$}$40,000 if half of individuals annual incomes of $\small{\$}$0 and the other half brought in $\small{\$}$80,000 annually. The distribution of annual income in the US is dramatically right skewed. To get a better understanding of what the distribution looks like, graphical summaries are helpful and numerical summaries describing how spread out the data are can add to our understanding.

Bar graph showing a single tall orange bar centered at 40,000 on the horizontal axis, labeled “Mean = 40,000.” The x-axis ranges from 0 to 80,000, representing that all data values are concentrated at the mean. Bar graph with two tall pink bars, one at 0 and one at 80,000 on the horizontal axis. The x-axis ranges from 0 to 80,000 and the caption below reads “Mean = 40,000.” This represents a data distribution where values are split evenly at the extremes, giving the same mean as before despite high variability. Histogram with a bright green bar sharply concentrated near zero on the left side, then tapering off toward the right up to about 6 million. The x-axis ranges from 0 to 6 million, and the caption below reads “Mean ≈ 40,000.” This represents a highly right-skewed distribution with most values near zero but a few very large values that raise the mean.

Variance

The variance is a measure of how spread out the data values are around the mean. Since it describes spread in relation to the mean, the variance should always be used in conjunction with the mean as a measure of center. Like the mean, the variance is strongly affected by outliers thus it is most appropriate to use the variance with symmetric data.

The variance is a mean of the squared deviations of the data points from the mean. The deviation of a point from the mean is the distance between them: ith deviation$\small{ = X_i - \bar{X}}$.

Deviation from the mean: The distance between a data point and the mean of the data.

Deviations from the Mean Applet

Click on the plot to create a dot plot. Click on an existing dot to clear all dots above that point. The applet displays the values of the mean and standard deviation. Check the 'show mean' box to see the mean and the deviations from the mean.

Sample Variance, $S^2$
$S^2 = \frac{1}{n-1}\sum_{i=1}^n\left(X_i-\bar{X}\right)^2$

Percentiles and Quartiles

Percentiles are an effective way to describe the spread of data that may not be symmetric. A percentile indicates a point such that a specified percentage of the data is smaller. For instance, the 70th percentile is a value such that 70 percent of the data is less than it. Like the median (which is itself the 50th percentile), percentiles are not affected by outliers.

Percentile: The pth percentile, $T_p$ is the value such that p% of the data are smaller than it.

There various methods for computing percentiles. Let $T_p$ denote the pth percentile. $T_p$ could be defined to be the value such that p% of the data is less than it (denote this value $L_p$), or as the value such that p% of the data is less than or equal to it (denote this $E_p$). Given the numbers from 1 to 100, $L_p = 31$ and $E_p=30$. We will define the pth percentile, $T_p$ as a weighted average of these two values.

Sample Percentile, $T_p$
$T_p = \frac{p}{100} E_p + \frac{100-p}{100}L_p$

Example: A dataset consists of the numbers from 1 to 100, what is the 30th percentile?

$E_{30} = 30$ and $L_{30} = 31$
$T_{30} = \frac{30}{100}E_{30}+\frac{70}{100}L_p = 0.3(30) + 0.7(31) = 30.7.$

Notice that using the above definition of a percentile, the 50th percentile, $T_{50}$ corresponds to the median.

Example: A dataset consists of the numbers from 1 to 100, what is the 50th percentile?

$E_{30} = 500$ and $L_{30} = 51$
$T_{30} = \frac{50}{100}E_{50}+\frac{50}{100}L_p = 0.5(50)+0.5(51) = 50.5$.

Since there is an even number of values in the list, the median is the mean of the middle two numbers, 50 and 51.
Thus, $M = \frac{50+51}{2} = 50.5$.

Quantiles are equivalent to percentiles but written in decimal rather than percent form. Thus the pth percentile is the p/100th quantile. Deciles and quartiles are alternative designations for specific percentiles. Deciles refer the the percentiles where p is a multiple of 10.

Quartiles divide the data into quarters. The first or lower quartile is the 25th percentile, the second quartile or median is the 50th percentile, the 3rd or upper quartile is the 75th percentile and the 4th quartile corresponds to the maximum. The distance between the upper and lower quartiles is called the Interquartile Range (IQR). One method for identifying outliers is to consider any value that is more than 1.5$\times$IQR above the upper quartile or below the lower quartile an outlier.

It is common to call any value more than 1.5$\times$IQR below the 1st quartile or above the 3rd quartile an outlier.

Table comparing percentiles, quantiles, deciles, and quartiles. The top row shows Percentiles from 1st to 100th. The second row lists corresponding Quantiles from 0.01 through 1. The third row labels Deciles: the 1st decile aligns with the 10th percentile, the 2nd with the 20th, the 3rd with the 30th, and so on through the 10th decile at the 100th percentile. The bottom row shows Quartiles: the 1st quartile at the 25th percentile, 2nd quartile at the 50th percentile, 3rd quartile at the 75th percentile, and 4th quartile at the 100th percentile.

Example: Net Worth of US Households

The table shows selected percentiles and quartiles of the 2020 net worth of US households.

Table showing percentiles, quartiles, and corresponding net worth values. The top row lists percentiles: 1, 5, 10, 25, 50, 75, 90, 95, 99, and 100. The second row identifies quartiles: the 1st quartile at the 25th percentile, 2nd at the 50th, 3rd at the 75th, and 4th at the 100th. The bottom row shows estimated U.S. household net worth values for each percentile: −$94,517 at the 1st, −$18,387 at the 5th, −$467 at the 10th, $12,430 at the 25th, $121,411 at the 50th, $403,284 at the 75th, $1,219,126 at the 90th, $2,584,130 at the 95th, $11,099,166 at the 99th, and $172,000,000,000 at the 100th percentile.