Suppose that one statistic test has a 5% false positive rate; then it will have a 95% probability of not getting a false positive. However, if we perform the test times, we will have likelihood of getting at least one false positive. For example, if we perform the test 20 times, our false positive rate is more than 60%!

This issue is known as the multiple comparison problem. It has long troubled the scientific community due to the unavoidable need for making multiple comparisons, a focus on achieving statistical significance, and a lack of understanding of probability and statistics.

The practice of conducting multiple statistical comparisons without a pre-specified hypothesis, with the aim of finding statistically significant results, is known as p-hacking. This approach is now widely considered unethical.

Perhaps the most egregious example of the multiple comparisons problem is that a scan of dead salmon can be shown to have brain activity. Despite using a small p-value (), the sheer number of comparisons (130,000) makes false positives unavoidable. 1

There are techniques to mitigate the multiple comparison problem. For example, the Bonferroni correction says that if you perform multiple comparisons, the criteria for significance should be . However, this method has a drawback: it reduces statistical power, as it requires a much stronger correlations to reach statistical significance. 2

Footnotes

  1. C. Bennett, A. Baird, M. Miller, G. Wolford. Neural Correlates of Interspecies Perspective Taking in the Post-Mortem Atlantic Salmon: An Argument For Proper Multiple Comparisons Correction. Journal of Serendipitous and Unexpected Results, 1:1–5, 2010.

  2. The p value and the base rate fallacy — Statistics Done Wrong