Historical Hypothesis Testing

In the 1100’s, the English government faced a dilemma: how to ensure that coins minted for the king contained the appropriate amount of gold or silver? The production of coinage was a process full with opportunities to for an unscrupulous minter to defraud the king. A minter had only to replace some of the requisite precious metal with cheaper filler and he’d have an easy profit.

To protect the king’s money from these kinds of mischiefs and to gauge the integrity of newly minted coins, the English instituted a yearly ceremony called the ‘Trial of the Pyx’. In this ceremony, coins are randomly chosen throughout the year from those produced by the Royal Mint. This sample of coins is stored in a metal box called the pyx. During the ceremony, the weight of the sampled coins is compared to the weight that would be expected if the coins contained the appropriate amount of gold or silver. If the weight of the coins differs from what is expected, the Master of the Mint in subject to a penalty. When the coin weight came up short in 1318, the Master of the Mint was imprisoned. On the other hand, in 1423 the coins were found to be too fine and the Master of the Mint was chastised as it was feared that this would lead people to melt the coins down to get the precious metals.

A pyx like these is used to store coins collected throughout the year for the ‘Trial of the Pyx’.

The Trial of the Pyx is an early example of the now common statistical procedure of hypothesis testing. Hypothesis testing, as we know it, was formalized in the twentieth century by R.A. Fisher, and Jerzy Neyman with Egon Pearson. In modern usage, a hypothesis proceeds like this (connections to the Trial of the Pyx are in parentheses): 1. The researcher establishes a hypothesis or hypotheses (the weight of the sampled coins is the same as expected); 2. Data are collected as a sample from the population of interest (coins are randomly sampled throughout the year); 3. A sample summary is compared to what would be expected if the initial hypothesis were true (the weight of the coins is compared to the expected weight); 4. A conclusion is drawn about the initial hypothesis (either the weight of the coins is consistent with the appropriate amount of metal being used or it is not).

Particularly in the work of Fisher, a testing procedure can be thought of as a type of proof by contradiction. An investigator states a hypothesis and then gathers data in an attempt to find evidence against it. This type of reasoning has been around for centuries. According to an old story, Aristotle (384 BC -322 BC) asserted that the velocity of an object was proportional to the weight of a falling body. Galileo (1564-1642), believing that the objects of different masses would fall at the same rate, supposedly set out to prove his hypothesis by contradicting Artistotle’s. According to the story, Galileo gathered his data by dropping two balls of different masses from the top of the leaning tower of Pisa. The idea is that if the objects fall at the same rate, he has contradicted the Aristotle’s hypothesis and proved his own. Galileo’s experiment may be apocryphal, however, the experiment was carried out in the Netherlands in the sixteenth century by Simon Stevin and Jan Cornets de Groot who dropped their objects from the top of the Nieuwe Kirk. In 1971, astronaut David Scott carried out the experiment on the moon dropping a feather and hammer simultaneously. You can view this experiment here. Note that if no contradiction arises when the data are collected, the investigator has not proven that his hypothesis is true, he has only failed to show that it is false. This is a point that receives a lot of attention in textbooks discussion the implementation of hypothesis tests.

In the early eighteenth centry, John Arbuthnot (1667-1735), a noted physician and writer, proposed a hypothesis test to prove the existence of God.

Arbuthnot’s published work was called “An Argument for Divine Providence, Taken from the Constant Regularity Observ’d in the Births of Both Sexes” and was published in 1710 in the Philosophical Transactions of the royal Society. In it he argued that the balance between the numbers of males and females at maturity was evidence of the existence of a divine power. He wrote:

Among innumerable Footsteps of Divine Providence to be found in the Works of Nature, there is a very remarkable one to be observed in the exact Balance that is maintained, between the Numbers of Men and Women; for by this means it is provided, that the Species may never fail, nor perish, since every Male may have its Female, and of a proportionable Age. This Equality of Males and Females is not the Effect of Chance but Divine Providence, working for a good End, which I thus demonstrate.

Arbuthnot framed his argument in terms of a die with two sides labeled M and F. He argues that in a large number of tosses of the die, there is a very small chance that the numbers of M’s and F’s will be equal. Making an analogy to male and female births, he argued that, though more males were born than females, the numbers of males and females of marriageable age was near to equal. Furthermore, the chance that the numbers of males and females would be close to equal is very small; so small, in fact, that the sexes must not be determined by chance. In his work, we again see the quintessential logic of hypothesis testing: an initial hypothesis, in this case that the sexes of males and females are randomly determined; the collection of data; and the use of the data to draw conclusions regarding the hypothesis. Arbuthnot took the equal numbers of members of the sexes to be evidence of Divine planning:

We must observe that the external Accidents to which are Males subject (who must seek their Food with danger) do make a great havock of them, and that this loss exceeds far that of the other Sex…To repair that Loss, providence Nature, by the Disposal of its wife Creator, brings forth more Males than Female; and that in almost a constant proportion. (Arbuthnot 1710, p.188)

Another early foray into hypothesis testing produced interesting results. In 1767, J. Michell (1724-1793) used similar reasoning to conclude that stars are not dispersed randomly across the sky. He argued that if the distribution of stars were indeed random, it would be very unlikely to see as many clusters of stars as we do. He is credited with using this discovery to postulate the existence of black holes.

Michell examined the distribution of the stars in space.

The formalization of the process of hypothesis testing was largely a twentieth century work. In 1900, Karl Pearson (1857-1936) published his work on the chi-square goodness of fit test. In 1908, William Gosset (1876-1937) published important work on tests for a mean when samples are small. R.A. Fisher credits Gosset with making a big leap forward in the development of hypothesis testing procedures. Fisher said:

Though his was the first exact test of significance, characteristic of the modern period, “Student” did not go so far as to claim that he was introducing a new mode of reasoning, and perhaps would have been unwilling to believe it had he been told so; for he was only applying his own good sense to a logical situation with which he was quite familiar. (Fisher 1973, p. 84).

With the work of Fisher (1890-1962) himself, hypothesis testing began to come into its own. As mentioned, Fisher formed his testing as a proof by contradiction. He advocated stating a hypothesis then gathering data to examine whether the data would contradict the hypothesis. He formulated the idea of a p-value as a probability that measures the strength of the evidence against the initial hypothesis. He is also responsible, though not intentionally, of the modern adherence to 0.05 as an indication of a small p-value. He said that an experiment should be designed in such a way that it would “rarely fail to give a significant result” and that significant result would rarely, specifically, not more than one time in 20 (hence 0.05) be produced if there were no real effect of the kind under investigation. He chose this value so as to be large enough to identify useful innovations but small enough to screen out spurious results.

Jerzy Neyman (1894-1981) and Egon Pearson (1895-1980) framed their version of a hypothesis test as a tool for making a choice between competing hypotheses. They relied on error rates to determine rules for making choices between these hypotheses. Interestingly, modern hypothesis testing often looks like a mixture of the methods put forward by Fisher, and Neyman and Pearson using the structure of the Neyman-Pearson approach in structure but the philosophy of Fisher’s work.

Examining the historical development of hypothesis testing gives insight into the procedures we use today. Early applications, such as the Trial of the Pyx, demonstrate the intuitive logic behind hypothesis testing. That is, an investigator starts with a hypothesis then collects data. If the data are very different from what she would expect to see if the hypothesis were true, she concludes that it must not be true. Students asked to respond to a question such a ‘is this coin fair’, generally intuit a very similar process.

These early applications and the later developments demonstrate that hypothesis testing is not a single unified method. The work of many people contributed to the development of hypothesis testing procedures. Indeed, the varied contributions have led to considerable diversity of opinion on how and when hypothesis tests should be conducted (see Nickerson, 2000; Wainer and Robinson, 2003 for examples). These are no ‘black box’ methods and no arbiters of truth but rather tools that should be thoughtfully applied that can be adapted and improved upon as circumstances dictate.

References:

Fisher, R.A. (1973). Statistical Methods and Scientific Inference, third edition. London: Collier Macmillan.

Arbuthnott, J. (1710). An Argument for Divine Providence, Taken from the Constant Regularity Observ'd in the Births of Both Sexes. By Dr. John Arbuthnott, Physitian in Ordinary to Her Majesty, and Fellow of the College of Physitians and the Royal Society. Philosophical Transactions, 27(325-336), 186-190.

Heyde, C. C. and Crepel, P. (Eds) (2001) Statisticians of the Centuries. New York: Springer Verlag.

Nickerson, R. S. (2000). Null Hypothesis Significance Testing: A Review of an Old and Continuing Controversy. Psychological Methods, 5(2,241-301).

The Royal Mint. (2016). The History of the Trial of the Pyx. Retrieved from www.royalmint.com/discover/uk-coins/history-of-the-trial-of-the-pyx

Wainer, H., & Robinson, D. H. (2003). Shaping up the practice of null hypothesis significance testing. Educational Researcher, 32(7), 22-30. Another early foray into hypothesis testing produced interesting results. In 1767,