4.5 Interpreting p-values

In the examples we have discussed, the p-value for a regression coefficient \(\beta\) tests the null hypothesis \(H_0:\beta = 0\) vs. the alternative hypothesis \(H_1:\beta \ne 0\). If the associated regression term is a continuous predictor, the null hypothesis is that the outcome has no association with the predictor. If the associated regression term is an indicator function for the level of a categorical predictor, then the null hypothesis is that the mean outcome is the same at that level and the reference level.

Typically, the level of significance is set at .05 (for arbitrary historical reasons) and we “reject the null hypothesis” if \(p < .05\) and “fail to reject the null hypothesis” if \(p \ge .05\). Importantly, rejecting the null hypothesis does not mean we have proved the null hypothesis to be false or the alternative to be true. Similarly, failing to reject the null does not mean we have proved the null to be true or the alternative to be false.

The p-value is computed under two assumptions: (1) Equation (4.1) or (4.2) and the associated assumptions constitute the model that is generating the data and (2) the regression coefficient (\(\beta\)) is zero (the null hypothesis). Under these assumptions, the p-value is the probability of observing a sample that results in a regression coefficient with a magnitude at least as large as what was observed. For a continuous predictor, for example, the p-value answers the following question: “If there really is no association between the outcome and predictor, and the regression model really is correct, how likely is it that we would observe an association at least this strong?” In Example 4.1, p <.001 which means that if the true model really is a horizontal line, and the regression assumptions are true, there is very little chance we would observe as steep a slope or steeper (as large an association or larger).

In epidemiology, the p-value is often said to be used to attempt to “rule out chance” as an explanation for an observed association. The p-value, however, is not the probability that random chance produced the observed association (a common misconception). More specifically, it is not the probability that the null hypothesis of no association is true. Rather, the p-value is the probability of observing a slope so large assuming the null is true (assuming chance, or assuming no association). You cannot compute the probability of something you are assuming to be true, but you can assume it is true and then compute the probability of observing an extreme result.

A key distinction is between the estimated regression coefficient, which is a measure of the strength of association between the predictor and the outcome, and the p-value, which is a measure of the compatibility of the data with the null hypothesis (Greenland et al. 2016). In a small sample, one could observe a meaningfully large estimated regression coefficient yet have a large p-value and fail to reject the null. Why? Because, assuming the null is true, it is not unusual in a small sample to end up with a large regression coefficient. Conversely, in a large sample, one could observe a small and not meaningfully far from zero estimated regression coefficient, yet have a small p-value and reject the null. Why? Because, assuming the null is true, only small deviations from no effect are likely in large sample sizes.

Conclusion: While p-values are commonly reported in research studies, they are far less important than estimates of the magnitude of the estimated regression coefficients and their confidence intervals. In particular, because p-values are so dependent on sample size, pay more attention to the size of regression coefficients than to p-values when deciding if an association or effect is meaningfully large.

NOTES:

When working with very large samples, for example when conducting a secondary analysis of data from a large, national survey, it is common to find that many of the p-values for regression coefficients are very small. Rather than just declare these effects to be “significant,” base your conclusions primarily on the sizes of the regression coefficients.
When working with a designed study with a sample size determined by a power analysis, this scenario is less likely to occur. A different problem arises, however. The sample size required to achieve the desired power is often underestimated, resulting in an under-powered study. “The reason is well known among those who have served as grant and research reviewers: The final effective sample size and thus the final standard error is most determined not by the total size of the trial, but rather by the number of highly informative cases or events (e.g., deaths, remissions, clinically large reductions in biomarkers, etc.) which, in practice, are almost inevitably less frequent in the study than was promised in the proposal, even when the total sample was very large. This is unsurprising because the frequency estimates in proposals are usually based on population data, but the study will impose restrictions that reduce baseline event frequencies below those in population estimates (e.g., by excluding patients in poorest health to minimize liability concerns and drop-out)” (Sander Greenland, personal communication, 2024). For example, Olivier et al. (2024) carried out a systematic review of 344 cardiovascular trials and found that, when compared to the observed event rates, 61% overestimated the event rate during trial design (mean relative deviation = 12.3%; 95% CI = 5.6%, 16.4%; p < .001).
P-values can also be computed for alternative hypotheses in which the tested coefficient value(s) are not zero. This is often advisable when the goal is to test non-inferiority, superiority, or equivalence (Greenland et al. 2016; Rafi and Greenland 2020a, 2020b).
P-values have generated a lot of debate. See Wasserstein and Lazar (2016) for the American Statistical Association’s official stance on p-values, Greenland et al. (2016) for a consensus supplementary statement about misinterpretations of p-values, confidence intervals, and power, the supplementary comments following Wasserstein and Lazar (2016) for additional commentary and differing opinions, and Wasserstein, Schirm, and Lazar (2019) for a follow-up article.

References

Greenland, Sander, Stephen J. Senn, Kenneth J. Rothman, John B. Carlin, Charles Poole, Steven N. Goodman, and Douglas G. Altman. 2016. “Statistical Tests, p-Values, Confidence Intervals, and Power: A Guide to Misinterpretations.” The American Statistician 70 (2): Online Supplement, 1–12. https://doi.org/10.1080/00031305.2016.1154108.

Olivier, Christoph B., Lasse Struß, Nathalie Sünnen, Klaus Kaier, Lukas A. Heger, Dirk Westermann, Joerg J. Meerpohl, and Kenneth W. Mahaffey. 2024. “Accuracy of Event Rate and Effect Size Estimation in Major Cardiovascular Trials: A Systematic Review.” JAMA Network Open 7 (4): e248818. https://doi.org/10.1001/jamanetworkopen.2024.8818.

Rafi, Zad, and Sander Greenland. 2020a. “Semantic and Cognitive Tools to Aid Statistical Science: Replace Confidence and Significance by Compatibility and Surprise.” BMC Medical Research Methodology 20 (1): 244. https://doi.org/10.1186/s12874-020-01105-9.

———. 2020b. “Technical Issues in the Interpretation of S-Values and Their Relation to Other Information Measures.” arXiv stat (arXiv:2008.12991v3). https://doi.org/10.48550/arXiv.2008.12991.

Wasserstein, Ronald L., and Nicole A. Lazar. 2016. “The ASA Statement on p-Values: Context, Process, and Purpose.” The American Statistician 70 (2): 129–33. https://doi.org/10.1080/00031305.2016.1154108.

Wasserstein, Ronald L., Allen L. Schirm, and Nicole A. Lazar. 2019. “Moving to a World Beyond p < 0.05”.” The American Statistician 73 (sup1): 1–19. https://doi.org/10.1080/00031305.2019.1583913.