Hypothesis testing

Sir Ronald Fisher

Sir Ronald Fisher

The dominant approach to frequentist hypothesis testing was established by Sir Ronald Fisher (1890–1962), the Chomsky of statistics. Most of the what I’m about to talk about can be attributed to Fisher, though a few pieces come from his archenemies in the field of statistics, Jerzy Neyman and Egon Pearson.1

The approach has been honed to an almost automatic procedure, which can now be applied by people who don’t even understand it. And there are many of those. (Though you should not become one of them.)

Many students find this aspect of statistics the hardest to wrap their heads around. As we’ll see in a bit, part of the reason is that it really doesn’t make any sense, or at least it doesn’t make the kind of sense you’re expecting it to.2

Is your friend a dirty, lying cheater?

To ease into the basic idea of hypothesis testing: Suppose your friend challenges you to a coin toss game, using his coin, and he picks heads. He then proceeds to get two heads in a row. You want to decide between two possibilities:

If your friend is honest, the probability of getting two heads in a row is .25, which isn’t too unbelievable. You wouldn’t want to risk your friendship by accusing him of low moral character based on such flimsy evidence. But if you keep flipping the coin over and over, the probability that all of the flips would come up heads gets progressively lower. By the time you get up to ten heads in a row, the probability is a smidge less than one in a thousand:

tosses probability
HHH 0.125
HHHH 0.0625
HHHHH 0.03125
HHHHHH 0.015625
HHHHHHH 0.0078125
HHHHHHHH 0.00390625
HHHHHHHHH 0.001953125
HHHHHHHHHH 0.0009765625

Naturally, you’d like to keep believing your friend is honest until the evidence becomes simply too implausible to reconcile with his honesty. The basic idea of classical hypothesis testing is this. You set up two hypotheses for this situation:

Now you decide what level of improbability it would take for you to stop believing in your friend’s honesty — let’s call this probability alpha (\(\alpha\)). In theory, statistics offers you no advice at all on what you should choose as your \(\alpha\). In practice, the sciences influenced by Fisher’s brand of hypothesis testing have almost without exception settled, as a matter of religious belief, on the idea that your \(\alpha\) must be .05.

Next, you calculate the probability of the relevant events in the world. Suppose your friend flipped the coin five times and got five heads in a row. The probability of that is .015625.

Finally, you apply your decision rule, which is: Reject the null hypothesis if it thinks the probability of what happened is less than \(\alpha\).

In this case, \(p\)(HHHHH) = 0.015625. So \(p\)(HHHHHH) \(< \alpha\). You have to reject the null hypothesis. You no longer believe that your friend is honest and the coin was fair. You now accept the alternative hypothesis, that your friend is dishonest and the coin was loaded.

(Consumer warning: For the purposes of deciding which of your friends to stop speaking to, or even who to defriend on Facebook, this method obviously leaves something to be desired.)

Is nature a dirty, lying cheater?

If you’ve run an experiment, the logic of hypothesis testing is the same, except that instead of your friend, you’re trying to draw conclusions about the honesty of nature. You set up your two hypotheses:

Of course, in the real world, we aren’t rooting for the honesty of nature the same way that we were rooting for our friend. We wouldn’t have bothered doing the experiment in the first place if we didn’t already supect there was something interesting going on that we could write a journal article about. But the logic of hypothesis testing is the same. We pick our \(\alpha\) — or we use the \(\alpha\) of .05 that preceding generations have chosen for us. We calculate how likely it is that the results of our experiment would have looked this lopsided (or more) if the only thing going on had been a random fluke. We apply our decision rule: If the probability we calculated is less than \(\alpha\), then we won the experiment. We get to “reject the null hypothesis”, and “accept the alternative hypothesis”, and send the resulting paper off to our favourite journal. If that probability is greater than \(\alpha\) then we “fail to reject the null hypothesis”, we can’t claim to have discovered an effect in nature, and our prospects for publication are considerably bleaker.

Statistical significance

We say that a difference is statistically significant if there’s less than a 5% (or whatever \(\alpha\)) probability that you would have gotten data this extreme by chance alone, that is, if the difference didn’t really exist). For example, we might say:

It’s important to be clear about what “significant” doesn’t mean:

“significant” \(\neq\) big

“significant” \(\neq\) important

“significant” \(\neq\) meaningful

“significant” \(\neq\) interesting

“Significant” doesn’t even mean “unlikely to have happened by accident”. It only means: unlikely to have looked like this if it had been just an accident. (But, maybe, it could also be unlikely to have looked like this even if it weren’t an accident, because it’s just plain unlikely.)

Advice: Never use the word “significant” in any other sense if there’s any danger of being misunderstood.

(Conservative rule-of-thumb: If your paper contains a number somewhere in it, there’s a danger that some readers will interpret the word “significant” in its technical sense.)

The bigger picture

“Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.” — John Tukey, frequentist (at least in public) statistician

Let’s recall Bayes’ rule, where we can think of h as being a theoretical hypothesis we have and d as being the data gathered by an experiment we run:

\[p(h|d) = \frac{p(d|h)p(h)}{p(d)}\]

As scientists and human beings, what we’re really interested in is \(p(h|d)\), that is, how likely our hypothesis is to be true given the results of the experiment.

Unfortunately, that’s not what classical hypothesis testing tells us. Instead of giving us \(p(h|d)\), the method we’ve described cares only about \(p(d|\sim h)\), that is, the probability that we would have got this data if our pet hypothesis were not true. This is a small piece of what we’d need to know in order to calculate \(p(h|d)\) using Bayes’ rule — it’s part of the calculation hiding inside the \(p(d)\) of the denominator — but by itself it’s practically useless.

In the hands of less theoretically sophisticated researchers, the typical practice of hypothesis testing boils down to:

The frequentist statisticians who developed this approach didn’t decide to do things exactly backwards because they were crazy. They were trying to do the best they could with what they had, while sweeping every hint of subjectivity under the rug. In the 1930s, “computers” were living people hired by the researchers to scribble away at arithmetic in a room for months on end. Back then, calculating \(p(d|h)\) was pretty easy, at least for some simple hypotheses, while calculating \(p(h|d)\) was extremely difficult, usually impossible, for most real-word situations. Nowadays, with our tiny computers that don’t need lunch breaks, there are many situations where \(p(h|d)\) can be calculated exactly, and in most of the rest \(p(h|d)\) can at least be estimated closely using computer simulations. There’s really no good excuse any more for researchers to keep asking the wrong question, but given the inertia of human institutions, it will keep happening for a while yet. So you should at least be familiar with the traditional approach of hypothesis testing, if only to be a discerning reader of the work of others.

Another way you can think of traditional hypothesis testing is as a weak form of the classical philosophical and mathematical technique of indirect proof, or reductio ad absurdum. In indirect proof, instead of directly proving H, you temporarily assume its opposite (not H) and then show that the assumption leads to a logical contradiction. By analogy, in hypothesis testing, instead of directly showing H1, you temporarily assume its opposite (H0) and then show that the assumption leads to something that’s moderately unlikely. (Obviously, it often goes wrong, which is the topic of the next section.)

Error types

In introductions to statistics, it’s nearly mandatory to compare hypothesis testing to the (ideal) workings of the justice system in countries with an English legal tradition. In criminal trials, the null hypothesis is that the defendant is not guilty, and that’s the verdict that a jury is supposed to give unless the prosecution has presented compelling enough evidence to convince them “beyond reasonable doubt” that the null hypothesis is false. In that case, they reject the null hypothesis and give a verdict of “guilty”.

When a jury finds a defendant “not guilty”, that doesn’t mean they believe the defendant is innocent (let alone that the defendant actually is innocent), just that the prosecution didn’t present good enough evidence for them to reject the null hypothesis. The defence lawyers don’t have to prove that the accused is innocent. The court will never declare that the accused is innocent. All they’ll ever say is “not guilty” (“you haven’t convinced us”), and leave it at that.

Similarly, in hypothesis testing, you never “prove” the null hypothesis. You can’t prove the null hypothesis. If you ever claim to have proven the null hypothesis, you’ll get slapped on the wrist and/or shunned by people who know better. All you can do is “fail to reject” the null hypothesis, just as a jury fails to reject the null hypothesis that the defendant is not guilty.

In a criminal case, as in a hypothesis test, there are two things that could go wrong. Either the jury rejects the null hypothesis when the defendant really is innocent (a Type I error) or they fail to reject the null hypothesis when the defendant really is guilty (a Type II error).

the defendant is really…
innocent guilty
the jury finds the defendant… not guilty yay! Type II error
guilty Type I error yay!

We’d obviously like to avoid both kinds of error as much as possible. The terms “power” and “conservativeness” refer to the ability of a statistical test to avoid each error type.

It’s trivially easy to be perfectly powerful. If you always reject the null hypothesis, you’ll never commit a Type II error. You can declare every single accused person to be guilty, and you’re guaranteed to never let a guilty man go free.

It’s also trivially easy to be perfectly conservative. If you never reject the null hypothesis, you’ll never commit a Type I error. You can declare every defendant to be not guilty, and you’re guaranteed to never send an innocent man to jail. (Just to avoid confusion, you might note that the “conservative” preference, in the statistical sense, is the opposite of the preference of most political conservatives.)

Unfortunately, in our universe, it isn’t possible to be both perfectly powerful and perfectly conservative at the same time. You have to strike a balance between them and decide what the best trade-off is between the risk of missing real effects and the risk of claiming to find false ones. Where exactly you should strike this balance will depend on the real-world implications. If committing a Type I error will result in people dying, you’d obviously prefer your statistics to be conservative. If committing a Type II error will result in people dying, you’d obviously prefer your statistics to be powerful.

Practitioners of hypothesis testing typically side-step the whole question. Fisher himself could get furious at any suggestion that Type II error was something you could, let alone should, worry about. Researchers in the sciences influenced by Fisher have seldom bothered to think about issues of power and Type II error, instead conducting their research as if the only thing that mattered was controlling for Type I error. The result is a literature full of studies that were under-powered and unlikely to find real effects.3

It’s possible to calculate the probability of making these errors, with some assumptions.

  1. The vicious personal attacks, the towering rages, the savage ripping apart of your opponents’ ideas despite the fact that, 90% of the time, they’re just different ways of phrasing exactly the same ideas you believe — we’d never have any of that linguistics, right?

  2. Also, what I’m presenting is an awkward synthesis of the ideas of Fisher, Neyman, and Pearson that the practical disciplines have muddled their way into over the decades. Fisher, Neyman, and Pearson themselves would all have blasted the ideas of the other camp as being utterly incompatible with their own. For example, Fisher would have been apoplectic over talk about Type II error or confidence intervals. Stripping away the egos of these giants, there’s always the chance that they were right and that the awkward synthesis actually is incoherent.

  3. For example, in the first survey of power in psychological research, Cohen (1962) found that the average experiment reported in the Journal of Abnormal and Social Psychology had only an 18% chance of detecting a “small effect” and less than a 50% chance of finding a “medium effect”. (We’ll touch on what Cohen considers “small” and “medium” effects later.) Rossi (1990) found that the situation had barely improved since 1962, nor has it dramatically improved since 1990. This should make us suspicious of both significant and non-significant findings. It’s not surprising if an under-powered study fails to reject the null hypothesis when it was never likely to succeed. But we should be just as suspicious when an under-powered study does seem to find a significant effect — since the success wasn’t due to good experiment design, there’s a better than usual chance that it was just a fluke, i.e., a Type I error.