How did p<.05 become the magic cliff, before which everything is significant and beyond which nothing matters. It’s not unusual for a finding to be rejected because p>.05, even if the point value is p=.052. Because of this, and because people confuse “significant” with “meaningful,” there have been recent calls, particularly in the natural sciences, to ban the term “significance.”
So, where did p<.05 come from?
In the dim past of statistical analysis, statistics were often calculated by hand, or with the use of primitive calculators. It took time to calculate a correlation, a t-test, or an ANOVA. And it would take more time to calculate the actual p value, so we relied on tables of critical values of the test statistic required to reach p values of .05, .01, and .001. And .05 (representing 95% certainty that the relationship we found in our sample would hold in the larger population) became the minimal standard. Now that we can run a 20 by 20 correlation matrix in seconds and get not only the correlations, but the point values for the significance levels, these arbitrary cutoffs should become irrelevant. In considering significance, it should be noted that a 20 by 20 correlation matrix produces 180 unique correlation coefficients, of which 9 would be significant at p<.05 by chance. In a previous post, we considered the Bonferroni correction, which provides an adjustment to the significance cutoff to account for running a large number of tests.
Should the term “statistical significance” be eliminated?
The argument that the .05 cutoff should become irrelevant is not the same as saying that we should eliminate (or ban) the use of statistical significance as one determinant of whether to pay attention to a finding. If point values of p were consistently reported rather than the three arbitrary and obsolete cutoffs, and if effect sizes were always reported along with the significance levels, we would have a valid and easy way to decide whether a finding is to be believed. For the sake of comparability across statistical tests, the most common effect size measure is r. The results of most statistical tests can be translated into a value that approximates r as a measure of effect size, and different disciplines have different values that their adherents consider weak, moderate, or strong effect sizes. A typical guide is r=.3 for weak or small effects, r=.5 for moderate effects, and r=.7 for strong effects.
If, on the other hand, we eliminate the reporting of the significance level altogether, then each reader would need to look at the size of the sample and decide whether the author had picked a large enough sample to be believed. In looking at gender differences, for example, clearly sampling 10 men and 10 women isn’t going to make a compelling case, but what about 50 of each, or 100 of each? That’s where the point value of the significance level gives us a guide to deciding whether to believe the result. Then, effect size helps us decide whether it matters if it does reach a reasonable level of significance.
Chi Square: Contingency tables or Phi coefficients?
Chi Square results can easily be transformed into Phi coefficients, which are equivalent to r. But if what we’re really looking for is an easy way to see how big the effect is, just reporting the cells in a Chi Square as percentages makes sense. If 65 percent of men vote Republican and 35% vote Democratic, while 35% of women vote Republican and 65% vote Democratic, it probably doesn’t make things clearer to say that this corresponds to r=.3 (which it does). In a future blog, we’ll talk about how a correlation can be translated into this sort of table of percentages as a more intuitive measure of effect size known as the Binomial Effect Size Display (BESD).