In God we trust, all others bring data.

Top

Site Menu

Summary for Numeric Data

In addition to information about the country as a whole, the Statistical Abstract of the United States contains data about the individual states and territories that make up the country. The variables include crime rates, education levels, income, record high and low temperatures. All of these are numerical variables.

Colorful map of the United States showing each state in a different color. State boundaries are outlined, and Alaska is shown separately in the lower left corner.

The Histogram

A histogram is a visual summary of numeric data in which blocks or bars represent the percentages of data that correspond to each interval. The intervals spanned by the bars usually all have the same width but this is not necessary. Because the area of each block is proportional to the amount of data in the corresponding interval, the total area under the histogram is 100%.

A histogram is a visual summary of numeric data in which blocks or bars represent percentages.


A histogram looks similar to a barchart but there are some important differences. The barchart is used to summarize categorical data while the histogram summarizes numeric data. Because of this, the bars in a barchart can be rearranged to emphasize different features of the data. However, the horizontal axis of a histogram is a numberline thus it cannot be rearranged or reorganized.


Bar graph titled “Per Capita Income by State, 2010.” The histogram shows the distribution of mean income across U.S. states, with most states clustered between $30,000 and $40,000, and fewer states at higher income levels.

Interpretation

A histogram is interpreted in terms of its area. Since the total area under the histogram is 100%, we can refer to areas over other regions as percentages as well. We don't get exact values from the histogram, rather we can use it to estimate or approximate the area in a given region. For instance, looking at the 'per capita in come by state' histogram above, we can see about 50% of states had a per capita income below $35,000 in 2010. The actual value may be a little more than or a little less than 50% but it's fairly close.

Example: Use the histograms to answer the following questions.
  1. About what percentage of states have a record high temperature above 120 degrees?
  2. About what percentage of states have a record low temperature below -40 degrees?
  3. About what percentage of states have life expectancey between 76 and 77 years?
  4. In about what percentage of states does over 12% of the population has advanced degrees?

Four histograms showing different state-level data in the U.S. The top left chart, “Percentage with Advanced Degrees,” shows most states between 8% and 12%. The top right chart, “High Temperatures of US States,” shows most states with highs between 110°F and 120°F. The bottom left chart, “Low Temperatures of US States,” centers around –50°F. The bottom right chart, “Life Expectancies of US States,” clusters around 75 to 77 years. Each chart uses a different color.
  1. 12 percent of states have a record high temperature above 120 degrees.
  2. 48 percent of states have a record low temperature below -40 degrees.
  3. 34 percent of states have life expectancey between 76 and 77 years.
  4. In 16 percent of states over 12% of the population has advanced degrees.


Construction

To construct a histogram by hand keep in mind that the area of each bar corresponds to the percentage of data in the corresponding interval and, since the bars are rectangles, the area = length × height.

  1. Divide the range of the data into intervals.
  2. Calculate the length of each interval
  3. Determine the percentage of the data that is in each interval (include the left endpoint but not the right one).
  4. Find the height of the bar over each interval by dividing the percentage in the interval (area of the bar) by the length of the interval.
Example: Histogram of Marvel Movie Runtime

The runtimes of the 23 Marvel Movies (as of July 2020), sorted by length, are A horizontal list of numbers in black text on a white background: 112, 112, 114, 115, 117, 118, 119, 122, 124, 124, 126, 129, 130, 131, 133, 134, 136, 137, 141, 143, 147, 149, 181.

The movie runtimes range from 112 to 181 minutes. A reasonable range for the histogram would be from 110 to 185 minutes. If there are 5 bars of equal length, each bar will be (185-110)/5 = 15 minutes long.

Here are the data divided into intervals:
A table showing data values grouped into colored intervals. The first row lists intervals: 110–125, 125–140, 140–155, and 170–185. The second row lists the corresponding data: 110–125 (peach): 112, 112, 114, 115, 117, 118, 119, 122, 124, 124; 125–140 (blue): 126, 129, 130, 131, 133, 134, 136, 137; 140–155 (yellow): 141, 143, 147, 149; 170–185 (green): 181.
Notice that there are no observations in the interval from 155 to 170.

To find the area of each bar, divide the count corresponding to a given interval by the total number of observations (23) and multiply by 100%.
For the interval 110-125: \(\small{\frac{10}{23}\times 100\% = 43.5\%}\).

The height of the interval is the percentage or area divided by the interval length. For interval 110-125: \(\small{\frac{43.5}{15} = 2.9}\).

A table summarizing interval data with five columns labeled Interval, Count, Percentage (Area), Length, and Height. The data show the following values: 110–125 has a count of 10, percentage 43.5, length 15, and height 2.9; 125–140 has a count of 8, percentage 34.8, length 15, and height 2.32; 140–155 has a count of 4, percentage 17.4, length 15, and height 1.16; 155–170 has a count of 0, percentage 0, length 15, and height 0; 170–185 has a count of 1, percentage 4.3, length 15, and height 0.29. Histogram titled “Histogram of Marvel Movie Lengths.” The x-axis is labeled “Runtime.” Most Marvel movies have runtimes between 120 and 140 minutes, fewer between 140 and 160 minutes, and a small number around 180 minutes. The bars are shaded green.

The look of a histogram depends a great deal on the number of intervals into which the data are divided. If there are too many intervals, the histogram may not provide a sufficient summary of the data. If there are too few intervals, the histgram may not convey enough information. Use the applet below to investigate how changing the number of intervals affects the data display.



The histogram shows the numbers of murders in all 50 US states and the District of Columbia.

Use the 'intervals' slider to change the number of intervals of the histogram.

The 'scale' slider adjusts the heights of the bars as needed to fit the window.


Describing Distributions

The shape of a histogram is described in terms of its peaks (modes) and tails (extreme values). A histogram with a long left tail is called left skewed. A histogram that looks the same in both tails is symmetric and a histogram that has a long right tail is right skewed. All three of the histograms shown below are unimodal, that means they have only one peak.

A histogram showing left skewed data.

Left skewed

A histogram showing symmetric data.

Symmetric

A histogram showing right skewed data.

Right skewed


A distribution with two peaks is called bimodal and it is called multimodal if it has three or more peaks.

A histogram showing a unimodal distribution.

Unimodal

A histogram showing a bimodal distribution.

Bimodal

A histogram showing a multimodal distribution.

Mulitimodal

A smoothed histogram is a probability curve that captures the main features of a histogram. As with a histogram, area under the curve corresponds to percentages or probabilities and the total area is 100% or 1. A smoothed histogram is easier to sketch than a regular histogram, thus they are convenient to use to describe histograms.

Histogram with green bars and a black smooth curve overlay. The x-axis shows data from about 120 to 180. The curve peaks around 120 and tapers off toward 160, showing a right-skewed distribution.
A smoothed histogram is a probability curve that captures the main features of a histogram.

An extreme value that is much removed from the marjority of the data is called an outlier. There are various methods for determining when something is an outlier, for now, we'll look for something in the graph that is very different from the rest of the data. Look at the histogram of Marriage Rates of US States (marriages per 1,000 people in 2009). Notice that most of the states, 90% or so, had marriage rates between 5 and 10 per 1,000. There is an extreme outlier, however. One state had more than 40 marriages per 1,000 people in 2009 (incidentally, that is down from 99 marriages per 1,000 in 1990!). You'll probably not be surprised to learn that the outlier state is Nevada.

Histogram titled “Marriage Rates of US States.” The chart shows that most states have marriage rates around 10 per 1,000 people, with very few states having rates much lower or higher. The bars are light green.
An outlier is data value that is much separated from the majority of the data.

Example: Use the histograms to answer the following questions.
  1. Which distribution looks the most symmetric?
  2. Which distribution is left skewed?
  3. Which distribution is bimodal?
  4. Which distribution has outliers?

Four histograms displaying different statistics for U.S. states. Top left: “Forest Area of US States” shows most states with forest areas below 20%, using light blue bars. Top right: “Percentage with Bachelor Degrees” shows most states between 25% and 30%, using red bars. Bottom left: “Populations of US States” shows a right-skewed distribution with most states below 10,000 (in thousands), using light blue bars. Bottom right: “Percentage of HS Graduates of US States” shows most states between 86% and 92%, using pink bars.
  1. Percentage with bachelor's degrees looks the most symmetric.
  2. Percentage of high school graduates is left skewed.
  3. Forest area is bimodal. Percentage of high school graduates is arguably also bimodal.
  4. Forest area has one observation that is separate from the rest, but perhaps not far enough to be an outlier. Population appears to have outliers.