Ronald Fisher, who first introduced the -value, recognized its limitations and intended it only as a heuristic tool and convenient guide. Nevertheless, the p-value cutoff has become a widely accepted standard in many scientific disciplines. 1 However, several drawbacks of p-values can lead to p-hacking and contribute to replication crisis.

Quote

After Fisher had retired to Australia, he was asked whether there was anything in his long career he regretted. He is said to have snapped, “Ever mentioning 0.05.” 1

The criticism of the value is nothing new, but it has started to gain more traction, and there is at least one journal outright ban the usage p-value altogether. 2 A more common and milder view is that the -value should not serve as the sole arbiter of publication, scientific conclusions should not be based only on whether a -value passes a specific threshold, and researchers should always report other important information. 3

Problems of the P-Value

Base Rate Fallacy

see: base rate fallacy

P-values are commonly used in hypothesis testing to decide whether to reject or retain the null hypothesis. However, this process is based on the misinterpretation that “the p-value is the chance that the null hypothesis is true”. The statement is false since p-value is based on the assumption that the null hypothesis is true and then asks how unusual the data is. And the statement flips the direction. A low p-value tells you, “If the null hypothesis is true, these results are unlikely”. It does not tell you: “If these results are true, the null hypothesis is unlikely.” 3

In other words, this misinterpretation conflates the conditional probabilities with . 1

Multiple Comparisons Problem

See: multiple comparisons problem and p-hacking

When multiple hypotheses are tested together, the probability of obtaining “significant” result by chance increases exponentially. While it’s rare for researchers to intentionally manipulate data to produce statistically significant results, they may still unconsciously select hypotheses based on whether they achieve statistical significance. 4

Lack of Information on Effect Size

See: effect size

With a sufficiently large sample size, even minuscule effects can yield statistically significant results, provided the test has adequate statistical power. This phenomenon has led some researchers to advocate for a shift in focus from p-value to effect size. 5

A commonly cited example is a study on aspirin. The study has a sample size of more than 22000 subjects. It achieves a high statistical significance of . However, the effect size was tiny: a risk difference of 0.77% with . 5 3

Overemphasis on Dichotomous Decision-Making

related: binary thinking

Reliance on p-values often encourages a binary mindset of either “rejecting” or “retaining” the null hypothesis, based on arbitrary thresholds like . This approach is prone to both hyped claims (false positives) and dismissal of real effects (false negatives).

Instead, some researchers think that the p-value should be interpreted as a continuous variable rather than in a dichotomous way. They also propose reconceptualizing confidence intervals as “compatibility intervals.” 6

Further, some people believe that statistical significance is widespread because it sates to our human desire for certainty. However, they think that we should instead embracing uncertainty and avoid oversimplifying the would’s complexity. 1

Publication Bias

The requirement of statistical significance as a criterion can cause publication bias. In particular, statistical non significant results are less likely to be published. 7

Teaching of P-value

Despite serious concerns, hypothesis testing and p-value are still often taught in intro statistical courses, often without mentioning any above mentioned issues. One professor mentioned that it may be due to the circularity that “we teach it because it’s what we do in industry, and we do it because it’s what we were taught.” 8

Footnotes

  1. The Significant Problem of P Values 2 3 4

  2. Psychology journal bans P values | Nature

  3. P Value Problems - PMC 2 3

  4. The Statistical Crisis in Science | American Scientist

  5. Using Effect Size—or Why the P Value Is Not Enough - PMC 2

  6. The P Value and Statistical Significance: Misunderstandings, Explanations, Challenges, and Alternatives - PMC

  7. Publication Bias: A Problem in Interpreting Medical Data

  8. The ASA Statement on p-Values: Context, Process, and Purpose