|
Multiple Comparison Procedures
Gerard E. Dallal, PhD
Scientist I, JM USDA HNRC
[Much of this discussion involves tests of significance. Since most tests are
performed at the 0.05 level, I will use 0.05 throughout rather than an abstract
symbol such as a that might make some readers
uncomfortable. Whenever you see "0.05 level", feel free to substitute
your own favorite value, such as 0.01, or even a symbol such as a,
if you'd like.]
At some point in a career that requires the use of statistical analysis, an
investigator will be asked by a statistician or a referee to use a multiple
comparison procedure to adjust for having performed many tests or for having
constructed many confidence intervals. What, exactly, is the issue being raised,
why is it important, and how is it best addressed? We'll start with significance
tests and later draw some comparisons with confidence intervals.
Let's start with some "dumb" questions. The answers will be
obvious. Yet, they're all one needs to know to understand the issue surrounding
multiple comparisons.
When playing the lottery, would you rather have one ticket or many
tickets? Many. Lottery numbers are a random phenomenon and having more
tickets increases your chances of winning.
There's a severe electrical storm and you have to travel across a large,
open field. Your main concern is about being hit by lightning, a somewhat random
phenomenon. Would you rather make the trip once or many times? Once. The
more trips you make, the more likely it is that you get hit by lightning.
Similar considerations apply to observing statistically significant test
results. When there is no underlying effect or difference, we want to keep the
chance of obtaining statistically significant results small. Otherwise, we could
not rule out the possibility that our observed differences were merely due to
the vagaries of sampling and measurement.
For better or worse, much of statistical analysis is driven by significance
tests and the scientific community as a whole has decided that the vast majority
of those tests will be carried out at the 0.05 level of significance. This level
of significance is a value that summarizes a procedure investigators use to
claim an effect has been observed. The rules of the game say that if results are
typical of what happens when there is no effect, investigators can't claim
evidence of an effect. However, if the observed results occur rarely when there
is no effect, investigators may say there is evidence of an effect. The
level of significance is the probability of those rare events that permit
investigators to claim an effect. When we test at the 0.05 level of
significance, the probability of observing one of these rare results when there
is no effect is 5%.
In summary, a significance test is a is a way of deciding whether something
rare has occurred if there is no effect. It may well be that there is no effect
and something rare has occurred, but we cannot know that. By the rules of the
game, we conclude that there is an effect and not that we've observed a
"rare event".
When there's no underlying effect or difference, getting statistically
significant results is supposed to be like winning the lottery or getting hit by
lightning. The probability is supposed to be small (well, 5%, anyway). But just
as with the lottery or with lightning--where the probability of winning or
getting hit can increase dramatically if you take lots of chances--many tests,
increase the chance that at something will be statistically significant at the
nominal 5%. In the case of 4 independent tests each at the 0.05 level, the
probability that one or more will achieve significance is about 19%. This
violates the spirit of the significance test. The chance of a statistically
significant result is suppose to be small when there's no underlying effect, but
performing lots of tests makes it large.
If the chance of seeing a statistically significant result is large, why
should we pay it any attention and why should a journal publish it? Well, we
shouldn't and they shouldn't. In order to insure that the statistically
significant results we observe really are rare when there is no underlying
effect, some adjustment is needed to keep the probability of getting any
statistically significant results small when many tests are performed. This is
the issue of multiple comparisons. The way we adjust for multiple tests will
depend on the number and type of comparisons that are made. There are common
situations that occur so often they merit special attention.
Comparing many groups
Consider an experiment to determine differences among three or more treatment
groups (e.g., cholesterol levels resulting from diets rich in different types of
of oil: olive, canola, rice bran, peanut). This is a generalization of Student's
t test, which compares 2 groups.
How might we proceed? One way is to perform all possible t tests. But this
raises the problem we discussed earlier. There are 6 comparisons when there are
4 treatments and the chance that some comparison will be significant
(that some pair of treatments will look different from each other) is much
greater than 5% if they all have the same effect. (I'd guess it's around 15%.)
If we notice a t statistic greater than 1.96 in magnitude, we'd like to say,
"Hey, those two diets are different because, if they weren't, there's only
a 5% chance of an observed difference this large." However, with that many
tests (lottery tickets, trips in the storm) the chance of a significant result
(a win, getting hit) is much larger, the t statistic is no longer what it
appears to be, and the argument is no longer sound.
Statisticians have developed many "multiple comparison procedures"
to let us proceed when there are many tests to be performed or comparisons to be
made. Two of the most commonly used procedures are Fisher's Least Significant
Differences (LSD) and Tukey's Honestly Significant Differences (HSD).
Fisher's LSD: We begin with a one-way analysis of variance. If the
overall F-ratio (which tests that hypothesis that all group means are equal) is
statistically significant, we can safely conclude that not all of the treatment
means are identical and then, and only then...we carry out all possible t tests!
Yes, the same "all possible t tests" that were just soundly
criticized. The difference is that the t tests can't be performed unless the
overall F-ratio is statistically significant. There is only a 5% chance of that
the overall F ratio will reach statistical significance when there are no
differences. Therefore, the chance of reporting a significant difference when
there are none is held to 5%. Some authors refer to this procedure as Fisher's Protected
LSD to emphasize the protection that the preliminary F-test provides. It is not
uncommon to see the term Fisher's LSD used to describe all possible t
tests without a preliminary F test, so stay alert and be a careful consumer of
statistics.
Tukey's HSD: Tukey attacked the problem a different way by following
in Student's (WS Gosset) footsteps. Student discovered the distribution of the t
statistic when there was one two-group comparison to be made and there was no
underlying mean difference between them. When there are g groups, there
are g(g-1)/2 pairwise comparisons that can be made. Tukey found the
distribution of the largest of these t statistic when there were no
underlying differences. For example, when there are 4 treatements and 6 subjects
per treatment, there are 20 degrees of freedom for the various test statistics.
For Student's t test, the critcal value is 2.09. To be statistically significant
according to Tukey's HSD, a t statistic must exceed 2.80. Because the number of
groups is accounted for, there is only a 5% chance that Tukey's HSD will declare
something to be statistically significant when all groups have the same
population mean. While HSD and LSD are the most commonly used procedures, there
are many more in the statistical literature (a dozen are listed in the PROC GLM
section of the SAS/STAT manual) and some see frequent use.
Multiple comparison procedures can be compared to buying insurance. Here, the
insurance is against making a claim of a statistically significant result when
it is just the result of chance variation. Tukey's HSD is the right amount of
insurance when all possible pairwise comparisons are being made in a set of g
groups. However, sometimes not all comparisons will be made and Tukey's HSD buys
too much insurance. In the preliminary stages of development, drug companies are
interested in identifing compounds that have some activity relative to placebo,
but they are not yet trying to rank the active compounds. When there are g
treatments including placebo, only g-1 of the g(g-1)/2 possible pairwise
comparisons will be performed. Charles Dunnett determined the behavior of the
largest t statistic when comparing all treatments to a control. In the case of 4
groups with 6 subjects per group, the critical value for the three comparions of
Dunnett's test is 2.54.
Similar considerations apply to Scheffe's test, which was once one of the
most popular procedures but has now fallen into disuse. Scheffe's test is the
most flexible of the multiple comparison procedures. It allows analysts to
perform any comparison they might think of--not just all pairs, but the mean of
the 1st and 2nd with the mean of the 4th and 6th, and so on. However, this
flexibility comes with a price. The critical value for the four group, six
subjects per group situation we've been considering is 3.05. This makes it
harder to detect any differences that might be present. If pairwise comparisons
were the only things the investigator wanted to do, then it is unnecessary
(foolish?) to pay the price of protection that the Scheffe test demand.
The moral of the story is to never take out more insurance than necessary. If
you use Scheffe's test so that you're allowed to perform any comparison you can
think of when all you really want to do is compare all treatments to a control,
you'll be using a critical value of 3.05 instead of 2.54 and may miss some
effective treatments.
The Bonferroni Adjustment
The most flexible multiple comparisons procedure is the Bonferroni
adjustment. In order to insure that the probability is no greater than 5%
that something will appear to be statistically significant when there are no
underlying differences, each of 'm' individual comparisons is performed at the
(0.05/m) level of significance. For example, with 4 treatments, there are
m=4(4-1)/2=6 comparisons. In order to insure that the probability of no greater
than 5% that something will appear to be statistically significant when there
are no underlying differences, each of 'm' individual comparisons is performed
at the 0.0083 (=0.05/6) level of significance. An equivalent procedure is to
multiply the unadjusted P values by the number of test and compare the results
to the nominal significance level--that is, comparing P to 0.05/m is equivalent
to comparing mP to 0.05.
The Bonferroni adjustment has the advantage that it can be used in any
multiple testing situation. For example, when an investigator and I analyzed
cataract data at five time points, we were able to assure the paper's reviewers
that our results were not merely an artifact of having examined the data at five
different points in time because we had used the Bonferroni adjustment and
performed each test at the 0.01 (=0.05/5) level of significance.
The major disadvantage to the Bonferroni adjustment is that it is not exact
procedure. The Bonferroni adjusted P value is larger than the true P value.
Therefore, in order for the Bonferroni adjusted P value to be 0.05, the true
P-value must be smaller. No one likes using a smaller P value than necessary
because it makes effects harder to detect. An exact procedure will be preferred
when one is available. Tukey's HSD will afford the same protection as the
Bonferroni adjustment when comparing many treatment groups and the HSD makes it
easier to reject the hypothesis of no difference when there are real
differences. In our example of four groups with six subjects per group, the
critical value for Tukey's HSD is 2.80, while for the Bonferroni adjustment it
is 2.93 (the percentile of Student's t distribution with 20 df corrsponding to a
two-tail probability of 0.05/6=0.008333).
This might make it seem as though there is no place for the Bonferroni
adjustment. However, as already noted, the Bonferroni adjustment can be used in
any multiple testing situation. If only 3 comparions are to be carried out, the
Bonferroni adjustment would have them performed at the 00.5/3=0.01667 level with
a critical value of 2.63, which is less than the critical value for Tukey's HSD.
Summary Table
The critical values a t statistic must achieve to reach statistical
significance at the 0.05 level(4 groups, 6 subjects per group, and 20 degrees
of freedom for the error variance).
|
Test
|
critical
value
|
|
t
test (LSD)
|
2.09
|
|
Duncan*
|
2.22
|
|
Dunnett
|
2.54
|
|
Bonferroni
(3)
|
2.63
|
|
Tukey's
HSD
|
2.80
|
|
Bonferroni
(6)
|
2.93
|
|
Scheffe
|
3.05
|
* Duncan's New Multiple Range Test is a stepwise procedure. This
is the critical value for assessing the homogeneity of all 4 groups.
If you look these values up in a table, Duncan, Dunnett, and Tukey's HSD
will be larger by a factor of Ö2. I have
divided them by Ö2 to make them comparable.
The reason for the difference is the tables assume equal sample sizes of n,
say. In that case, the denominator of the t statistic would contain the
factor Ö[(1/n)+(1/n)] = Ö(2/n).
Instead of referring to the usual t statistic (xbari-xbarj)/[spÖ(2/n)],
the tables refer to the statistic (xbari鰔barj)/[spÖ(1/n)].
Since this statistic is the ordinary t statistic multiplied by Ö2,
the critical values must be adjusted accordingly. If you should have
occasion to use such a table, check the critical value for 2 groups and
infinite degrees of freedom. If the critical value is 1.96, the test
statistic is the usual t statistic. If the critical value is 2.77, the table
expects the Ö2 to be removed from the
denominator of the t statistic.
[Student]-Newman-Keuls Procedure
The [Student]-Newman-Keuls Procedure is an attempted compromise
between LSD and HSD. It acknowledges the multiple comparison problem but invokes
the following argument: Once we determine that the two extreme treatments are
different according to the Tukey HSD criterion, we no longer have a homogeneous
set of 'g' groups. At most, 'g-1' of them are the same. Newman and Keuls
proposed that these means be compared by using the Tukey criteria to assess
homogeneity in 'g-1' groups. The procedure continued in like fashion considering
homogeneous groups of 'g-2' groups, 'g-3' groups, and so on, as long as
heterogeneity continued to be uncovered. That is, the critical value of the t
statistic got smaller (approaching the critical value for Student's t test) as
the number of groups that might have the same mean decreased. At one time, the
SNK procedure was widely used not only because it provided genuine protection
against falsely declaring differences to be real but also because it let
researchers have more significant differences than Tukey's HSD would allow. It
is now used less often, for two reasons. The first is that, unlike the HSD or
even the LSD approach, it cannot be used to construct confidence intervals for
differences between means. The second reason is the growing realization that
differences that depend strongly on the choice of particular multiple comparison
procedure are probably not readily replicated.
Duncan's New Multiple Range Test
[You have two choices. You can promise never to use this test or you can
read this section!]
Duncan's New Multiple Range Test is a wolf in sheep's clothing. It
looks like the SNK procedure and, to the delight of its advocates gives many
more satistically significant differences. It does this, despite it's official
sounding name, by failing to give real protection to the significance level.
Whenever I am asked to review a paper that uses this procedure, I always ask
the investigators to reanalyze their data.
This New Multiple Range Test, despite its suggestive name, does not really
adjust for multiple comparisions. It is a stepwise procedure that uses the
Studentized range statistic, the same statistic used by Tukey's HSD, but it
undoes the adjustment for multiple comparisons!
The logic goes something like this: When there are g groups, there are
g(g-1)/2 comparisons that can be made. There is some redundancy here because
there are only g-1 independent pieces of information. Use the Studentized
range statistic for g groups and the appropriate number of error degrees of
freedom. To remove the penalty on the g-1 independent pieces of information,
perform the Studentized range test at the 1-(1-a)g-1
level of significance. In the case of 4 groups (3 independent pieces of
information), this corresponds to performing the Studentized range test at the
0.143 level of significance.
When 'm' independent tests of true null hypotheses are carried out at some
level a, the probability that none are
statistically significant is (1-a)m and
the Type I error is 1-(1-a)m. Therefore,
to insure that the Studentized range statistic does not penalize me, I use at
the level that corresponds to having used a for my
individual tests. In the case of 4 groups, there are three independent pieces
of information. Testing the three peices at the 0.05 level is like using the
Studentized range statistic at the 1-(1-0.05)3 (=0.143) level. That
is, if I use the Studentized range statistic with a=0.143,
it is just as though I performed my 3 independent tests at the 0.05 level.
Additional Topics
Many Response Variables
The problem of multiple tests occurs when two groups are compared with
respect to many variables. For example, suppose we have two groups and wish to
compare them with respect to three measures of folate status. Once again, the
fact that three tests are performed make it much more likely than 5% that
something will be statistically significant at a nominal 0.05 level when there
is no real underlying difference between the two groups. Hotelling's T2
statistic could be used to test the hypothesis that the means of all variables
are equal. A Bonferroni adjustment could be used, as well.
An Apparent Paradox
An investigator compares three treatments A, B, and C. The only significant
difference is between B and C with a nominal P value of 0.04. However, when any
multiple comparison procedure is used, the result no longer achieves statistical
significance. Across town, three different investigators are conducting three
different experiments. One is comparing A with B, the second is comparing A with
C, and the third is comparing B with C. Lo and behold, they get the same P
values as the investigator running the combined experiment. The investigator
comparing B with C gets a P value of 0.04 and has no adjustment to make; thus,
the 0.04 stands and the investigator will have an easier time of impressing
others with the result.
Why should the investigator who analyzed all three treatments at once be
penalized when the investigator who ran a single experiment is not? This is part
of Kenneth Rothman's argument that there should be no adjustment for multiple
comparisons; that all significant results should be reported and each result
will stand or fall depending on whether it is replicated by other scientists.
I find this view shortsighted. The two P-values are quite different, even
though they are both 0.04. In the first case (big experiment) the investigator
felt it necessary to work with three groups. This suggests a different sort of
intuition than that of the scientist who investigated the single comparison. The
investigator working with many treatments should recognize that there is a
larger chance of achieving nominal significance and ought to be prepared to pay
the price to insure that many false leads do not enter the scientific
literature. The scientist working with the single comparison, on the other hand,
has narrowed down the possibilities from the very start and can correctly have
more confidence in the result. For the first scientist, it's, "I made 3
comparisons and just one was barely significant." For the second scientist,
it's, "A difference, right where I expected it!"
Planned Comparisons
The discussion of the previous section may be unrealistically tidy. Suppose,
for example, the investigator working with three treatments really felt that the
only important comparison was between treatments B and C and that treatment A
was added only at the request of the funding agency or a fellow investigator. In
that case, I would argue that the investigator be allowed to compare B and C
without any adjustment for multiple comparisons because the comparison was
planned in advance and had special status.
It is difficult to give a firm rule for when multiple comparison procedures
are required. The most widely respected statistician in the field was Rupert G.
Miller, Jr. who made no pretense of being able to resolve the question but
offered some guidelines in his book Simultaneous Statistical Inference, 2nd
edition (Chapter 1, section 5, emphasis is his):
Time has now run out. There is nowhere left for the author to go but to
discuss just what constitutes a family [of comparisons to which multiple
comparison procedures are applied]. This is the hardest part of the book
because this is where statistics takes leave of mathematics and must be guided
by subjective judgment. . . .
Provided the nonsimultaneous statistician [one who never adjusts for
multiple comparisons] and his client are well aware of their error rates for
groups of statements, and feel the group rates are either satisfactory or
unimportant, the author has no quarrel with them. Every man should get to pick
his own error rates. SImultaneous techniques certainly do not apply, or should
not be applied, to every problem.
[I]t is important to distinguish between two types of experiments. The
first is the preliminary, search- type experiment concerned with uncovering
leads that can be pursued further to determine their relevance to the problem.
The second is the final, more definitive experiment from which conclusions
will be drawn and reported. Most experiments will involve a little of both,
but it is conceptually convenient to being basically distinct. The
statistician does not have to be as conservative for the first type as for the
second, but simultaneous techniques are still quite useful for keeping the
number of leads that must be traced within reasonable bounds. In the latter
type multiple comparison techniques are very helpful in avoiding public
pronouncements of red herrings simply because the investigation was very
large.
The natural family for the author in the majority of instances
is the individual experiment of a single researcher. . . . The
loophole is of course the clause in the majority of instances. Whether
or not this rule of thumb applies will depend upon the size of the experiment.
Large single experiments cannot be treated as a whole without an unjustifiable
loss in sensitivity. . . . There are no hard-and-fast rules for where the
family lines should be drawn, and the statistician must rely on his own
judgment for the problem at hand.
Unequal Sample Sizes
If sample sizes are unequal, exact multiple comparison procedures may not be
available. In 1984, Hayter showed that the unequal sample size modification of
Tukey's HSD is conservative. that is, the true significance level is no greater
than the observed significance level. Some computer programs perform multiple
comparison procedures for unequal sample sizes by pretending that the sample
sizes are equal to their harmonic mean. This is called an unweighted means
analysis. It was developed before the time of computers when the more
precise calculations could not be done by hand. When the first computer programs
were written, the procedure was implemented because analysts were used to it and
it was easy to program. Thus, we found ourselves using computers to perform an
analysis that was developed to be done by hand because there were no computers!
The unweighted means analysis is not necessarily a bad thing to do if the sample
sizes are all greater than 10, say, and differ by only 1 or 2, but this
approximate test is becoming unnecessary as software packages are updated.
What do I do?
My philosophy for handling multiple comparisons is identical to that of Cook
RJ and Farewell VT (1996), "Multiplicity
Considerations in the Design and Analysis of Clinical Trials," Journal
of the Royal Statistical Society, Series A, 159, 93-110. (The link will get you
to the paper if you subscribe to JSTOR.) An extreme view that denies the need
for multiple comparison procedures is Rothman K (1990), "No Adjustments Are
Needed for Multiple Comparisons," Epidemiology, 1, 43-46.
I use Tukey's HSD for the most part, but I'm always willing to use unadjusted
t tests for planned comparisons. One general approach is to use both Fisher's
LSD and Tukey's HSD. Differences that are significant according to HSD are
judged significant; differences that are not significant according to LSD are
judged nonsignificant; differences that are judged significant by LSD by not by
HSD are judged open to further investigation.
For sample size calculations, I apply the standard formula for the two sample
t test to the most important comparisons, with a Bonferroni adjustment of the
level of the test. This guarantees me the necessary power for critical pairwise
comparisons.
|