A test of significance is used to determine whether an effect is likely to have occurred by chance. There has been much criticism of using p<.05 as a “cliff,” over which an effect is magically deemed to be real, and before which it is not. Using this criterion, there is a 1 in 20 chance of finding an effect in your sample where none exists in the larger population. So, if you run 100 tests, you would expect 5 to come out as “significant” by chance.
The Bonferroni correction is a simple, but very conservative, way of accounting for this problem and avoiding drawing conclusions based on chance findings. If you run 100 correlations, you divide the p value by 100. So, you would only consider a result to be significant if p<.05/100 or p<.0005. The problem is that, by setting the p value so low, there is a 99.95% chance that you have rejected a finding that really holds in the larger population. Even with p<.05, there is a 95% chance that you rejected a true finding, but we rarely pay much attention to those type II errors. They become important in medical research, where failure to recognize a carcinogen, for example, can lead to fatal consequences.
A less conservative alternative to the Bonferroni correction is the Benjamini-Hochberg procedure. To use this, you need to decide how great your tolerance is for accepting a finding given that the observed effect does not exist in the larger population. In this procedure, you rank the findings from lowest to highest p value. The formula for the new cutoff point is (i/m)Q where i is the rank, m is the total number of tests, and Q is your tolerance for accepting a finding that is not borne out in the larger population. So, for 100 correlation coefficients, with a 5% tolerance for accepting a relationship that doesn’t exist in the larger population, the finding with the lowest p value has to be less than .0005 (the same as the Bonferroni correction) to be considered significant: (1/100).05). However, with this procedure, the required significance level drops as the rank increases. So the second lowest significance level of the 100 only has to be significant at p<.001, and the tenth at p<.005. And the required significance level continues to drop as you move further down the ranking until, at i=100, the critical value is (100/100).05=.05.
It seems clear that the Bonferroni correction can put an undue burden on the researcher when a large number of tests are being performed. More about that problem was discussed in a previous post on Power Analysis. The remaining question for this discussion is “At what point should one of these correction procedures be used?” Clearly, the p<.05 criterion is only technically valid when only one significance test is being performed on a data set. A 2-way ANOVA tests three effects and so the cutoff for each of the three effects using the Bonferroni correction would be p<.017 and, using the Benjamini-Hochberg procedure, the cutoff would also be p<.017 for the effect with the lowest p value, using Q=.05. Most published articles report at least a dozen tests of significance, making the Bonferroni cutoff and Benjamini-Hochberg cutoff for the effect with the lowest p value p<.004. But, in 30 years as editor of a peer-reviewed journal, I never saw anyone apply either of these corrections unless they are running at least 25 significance tests on a single data set.
Ultimately, the cutoff point for a significance level is always a subjective decision. The purpose of this posting is to offer some alternatives for coming to a reasonable conclusion as to whether or not to accept a finding as reflective of reality. A reasonable solution is probably to use one of these correction procedures in exploratory studies to screen possible findings and then to do follow-up studies on new data sets using the findings from the initial study to reduce the number of tests performed in future studies.