The terminology generally used in reporting statistics is inherently confusing to most of us when we start out in conducting research. For example, it seems logical that a “significant” result should be meaningful, but it often isn’t. This blog is an attempt to clarify the meanings of the two most commonly-used terms used to describe the result of a statistical measure: significance and effect size.
Statisticians use the term “significance” to indicate how likely the result could have been obtained by chance rather than because of a real relationship. The symbol used for this is “p” as in p<.05, the most common cutoff point for declaring significance. More about this in a future blog on “the significance cliff.” This means that there is a 5% (or 1 in 20) chance that the relationship you have observed really exists in the larger population. If you choose to use p<.01 as the cutoff, it means that there is a 1 in 100 chance that the observed relationship exists in the larger population. Because of this definition, the greater the sample size, the more likely you are to get a significant result, even if the effect you are measuring is very small. Inference testing on census data is meaningless, because everything is significant when you are measuring the entire population. This will be discussed in more depth in a future blog on power analysis.
There are two additional complications with significance. Say you perform 100 tests of significance on your data. With currently available statistical software that’s not out of the question. By chance, you would expect 5 of those tests to come out significant at p<.05 and one of them to come out significant at p<.01. If you are going to perform a large number of tests, you need to make a Bonferroni correction to your cutoff point. This will be dealt with here in a separate blog. The second complication deals with directionality of the relationship. Are you using a one-tailed or two-tailed test? Some statistics (such as ANOVA) always use 2-tailed tests of significance, while others, such as t-tests and correlations can use either. I always recommend that my clients use 2-tailed tests, even though it puts a greater burden on finding significance. (a one-tailed test that produced a significant result at p<.05 is only going to be significant at p<.10 as a 2-tailed test). But if the goal of your research is to find out something new, do you really want to throw out a result that is very strong and in the opposite direction to your prediction? Most important discoveries are a result of unexpected outcomes.
Effect size is what most people are thinking about when they see the term “significance;” it is a measure of the meaningfulness of your finding. It tells you how big the effect is. For example, if you are measuring the relationship between height and weight in men, you would perform a Pearson correlation. The result of a correlation is expressed as r, which stands for “regression,” because correlation and regression are really the result of a single statistical procedure that looks at the relationship between two variables. The possible values of “r range from -1 to 1. If there is a perfect positive relationship, r=1, the taller the man, the heavier the man. So, all men would be perfectly height/weight proportional and the Body Mass Index would be the same for every man. If there is a perfect negative relationship (r=-1) then the world would be composed completely of short fat and tall skinny men. If there was no relationship at all (r=0) then there would be an even distribution of heights and weights and the world would be even more diverse than it is now. These extremes are fairly easy to visualize, but it becomes more difficult to visualize the normal range of “r” that we typically encounter. For the height/weight example, r is probably closer of .5 than to either extreme (-1 or 1). A more commonly used effect size is r2, which is often referred to as “percent variance explained.” So, a correlation of r=.5 yields an r2=.25, indicating that, for example, height accounts for 25% of the variance in weight for men. The results of other statistical tests can generally be converted to r or r2. For example, the result of a Chi Square test can be converted to a Phi coefficient, which is equivalent to r. More about this, and about a more intuitive display of effect size, known as a Binomial Effect Size Display in a future blog.