|
|
|
|
Hypothesis TestingWelcome to the wrong side of the Rubicon. As the world of mere description and summary recedes, that of analysis and inference opens up. And unlike Julius Caesar, who found only other Romans when he crossed the Rubicon, we will discover Greeks as well. Lots of m's and s's as well as x-bar's and s's. Historically, much of what we know about the Greeks came to us via the Romans. Deciphering what the Romans have to tell us about the Greeks is also the process of inference in general and hypothesis testing in particular. We observe the Romans (the mean and standard deviation of our sample, for example) and infer things about the Greeks (the mean and standard deviation of the population the sample from). Confusing the two nationalities is but one of the many traps for the young hypothesis tester. The basic idea of hypothesis testing is very simple. It amounts to an assumption of innocence until proven guilty. The court hearing the evidence of guilt is not a criminal one, so the "proof" is "on the balance of probabilities" rather than "beyond reasonable doubt". This process of falsification rather than verification can be confusing to the beginner. If you're trying to establish that a new drug outperforms an existing one, why would you assume they are equally efficacious until "proven" otherwise? The reason is that "the balance of probabilities" can never verify what is. It can only suggest what can't be. And if the drugs can't be the same, they must be different. Whereas, if the drugs can't be different by 10 (hours of pain relief, say) they might be different by 11, or 20 as well as by 0. There is only one way the drugs can be the same, and infinitely many ways they can be different. The other very confusing thing about hypothesis testing is that the weight of evidence against the assumption of innocence is measured by a quantity (called the p-value, for probability) which gets smaller as the weight of evidence increases! The smaller the p-value, the more untenable the assumption of innocence becomes. Think of the p-value as being a measure of your belief in the innocence of accused. The smaller it gets, the less you believe the accused is innocent. Competing HypothesesFor better or worse, hypothesis testing is usually introduced in a very formal framework. There are two hypotheses competing for our allegiance; the null hypothesis (usually denoted by H0), representing the assumption of innocence, and the alternative hypothesis (H1), representing the charge against the accused. The null hypothesis is held to be true unless it can be shown to be untenable, in which case the alternative hypothesis is accepted. ExampleThe weight of flour sold in nominal 1 kg packets is never exactly 1 kg. A packaging firm might want to test whether the amount of flour they put in a packet is, on average, 1 kg. In the long term they would then be neither giving flour away nor leaving themselves open to consumer litigation. The null hypothesis, H0, is that the mean weight of packets produced is 1 kg. (That's a Greek mean m.) The alternative hypothesis, H1, is that the mean weight of packets produced is not 1 kg. The "trial" might consist of taking a random sample of packets coming off the production line and calculating the mean weight of the packets in the sample. (That's a Roman mean, x-bar.) If the sample mean is a long way from 1 kg, then H0 is untenable, and H1 suddenly becomes an attractive proposition. Putting a (not too) fine point on precisely what constitutes "a long way" is the job of the p-value. p-Values and SignificanceIn our example, the p-value is the probability of obtaining a sample mean, in conceptual repetitions of our trial, at least as far from 1 kg as the sample mean that we actually obtained in our trial. The probability is calculated on the assumption that the null hypothesis is true and involves evaluating the area in the tails of a probability distribution. (The precise details of the probability calculation need not concern us here.) The smaller the p-value, the less likely becomes the event that we witnessed under the assumption of innocence and the more disposed we are towards the alternative. How small should the p-value be before the balance of probabilities shifts our inclination from innocence to guilt? Since this question is unanswerable, an industry standard is applied. A p-value less than 0.05 (5%) is decreed to be small enough 1. Should a sample mean have a p-value less than 0.05, it is said to be statistically significant (at the 5% level). Note the irony here. An insignificant p-value implies statistical significance! Be clear about this, statistical significance is associated with acceptance of the alternative, and rejection of the null. Quick quizThe phrase "at least as far from" in the preceding paragraph needs some explanation. Is the distance measured in absolute terms (i.e. either side of) or relative terms (i.e. in a particular direction)? This brings us to another needlessly confusing component of hypothesis testing. Two Sided or One?The alternative hypothesis in our example is called a two sided alternative because it merely asserted that the average packet weight was different from the 1 kg - either more or less. An excessively high or low sample mean weight would support this alternative. A one sided alternative would have specified that the difference was in a particular direction - either less than 1 kg (if the packager was being accused of habitually marketing underweight packets) or greater than 1 kg (if the packager was being accused of either excessive philanthropy, caution [see origin of the phrase "baker's dozen"], stupidity or of perhaps adhering to a consumer standard which related not to the mean but minimum weight of a packet). The null hypothesis is effectively unchanged by how many sides the alternative has. If the packaging firm is accused of marketing underweight packets, then only an excessively low sample mean weight will cast doubt on their integrity. An excessively high sample mean would provide no evidence at all to support the accusation. The choice of how many sides the alternative hypothesis possesses thus changes how the p-value is calculated. Only one tail of the probability distribution is evaluated in the case of a one-sided alternative, instead of both. In practice, this means that the p-value for a one-sided alternative is half that of a two-sided alternative. (Objection, your Honour.) Just as when you choose the two-sided printing option you get half the weight of paper, when you choose a two-sided alternative you get half the weight of evidence against the null hypothesis. (Remember, the smaller the p-value the greater the evidence against the null hypothesis.) That would appear to make the choice of alternative critical. The prosecution and defence counsels would disagree as to the optimal choice. The prosecution would welcome the one-sided "hanging judge" test, while the the defence would prefer the two-sided "bleeding heart" test. Which should be chosen and why? Fortunately, there are two simple rules
These rules apply not only to the confused beginner but also to the seasoned professional, and will stand you in good stead in all situations throughout your lifetime, with the possible exception of examinations in introductory statistics units, where you might lose a few marks here and there. For those obsessed by this topic, there are a few more complicated rules. Others can get on with the rest of their lives. Where does the two-sided and nothing but the two-sided approach leave us when confronted by a dishonest flour packager? Just because two-sided tests are only capable of establishing a difference rather than a deficiency (or excess), it would be wrong to infer that cheats must go unpunished on the grounds that they might equally well be philanthropists. You are allowed to look at your data to determine in which direction any significant difference lies. Indeed you should routinely give some kind of interval estimate of where the true mean packet weight, m (that's Greek), lies. (See discussion on statistical significance vs practical significance.) |