2.3 Estimation, hypothesis testing and prediction

All that is required to perform estimation, hypothesis testing (model selection), and prediction in the Bayesian approach is to apply Bayes’ rule. This ensures coherence under a probabilistic view. However, there is no free lunch: coherence reduces flexibility. On the other hand, the Frequentist approach may not be coherent from a probabilistic point of view, but it is highly flexible. This approach can be seen as a toolkit that offers inferential solutions under the umbrella of understanding probability as relative frequency. For instance, a point estimator in a Frequentist approach is found such that it satisfies good sampling properties like unbiasedness, efficiency, or a large sample property such as consistency.

A notable difference is that optimal Bayesian decisions are calculated by minimizing the expected value of the loss function with respect to the posterior distribution, i.e., conditional on observed data. In contrast, Frequentist “optimal” actions are based on the expected values over the distribution of the estimator (a function of data), conditional on the unknown parameters. This involves considering sampling variability.

The Bayesian approach allows for the derivation of the posterior distribution of any unknown object, such as parameters, latent variables, future or unobserved variables, or models. A major advantage is that predictions can account for estimation error, and predictive distributions (probabilistic forecasts) can be easily derived.

Hypothesis testing (model selection) in the Bayesian framework is based on inductive logic reasoning (inverse probability). Based on observed data, we evaluate which hypothesis is most tenable, performing this evaluation using posterior odds. These odds are in turn based on Bayes factors, which assess the evidence in favor of a null hypothesis while explicitly considering the alternative (R. E. Kass and Raftery 1995), following the rules of probability (D. V. Lindley 2000). This approach compares how well hypotheses predict data (Goodman 1999), minimizes the weighted sum of type I and type II error probabilities (DeGroot 1975; Pericchi and Pereira 2015), and takes into account the implicit balance of losses (H. Jeffreys 1961; J. Bernardo and Smith 1994). Posterior odds allow for the use of the same framework to analyze nested and non-nested models and perform model averaging.

However, Bayes factors cannot be based on improper or vague priors (G. M. Koop 2003), the practical interplay between model selection and posterior distributions is not as straightforward as it may be in the Frequentist approach, and the computational burden can be more demanding due to the need to solve potentially difficult integrals.

On the other hand, the Frequentist approach establishes most of its estimators as the solution to a system of equations. Observe that optimization problems often reduce to solving systems. We can potentially obtain the distribution of these estimators, but most of the time, asymptotic arguments or resampling techniques are required. Hypothesis testing relies on pivotal quantities and/or resampling, and prediction is typically based on a plug-in approach, which means that estimation error is not taken into account.15

Comparing models depends on their structure. For instance, there are different Frequentist statistical approaches to compare nested and non-nested models. A nice feature in some situations is that there is a practical interplay between hypothesis testing and confidence intervals. For example, in the normal population mean hypothesis framework, you cannot reject a null hypothesis \(H_0: \mu = \mu^0\) at the \(\alpha\) significance level (Type I error) if \(\mu^0\) is in the \(1-\alpha\) confidence interval. Specifically,

\[ P\left( \mu \in \left[\hat{\mu} - |t_{N-1}^{\alpha/2}| \times \hat{\sigma}_{\hat{\mu}}, \hat{\mu} + |t_{N-1}^{\alpha/2}| \times \hat{\sigma}_{\hat{\mu}}\right] \right) = 1 - \alpha, \]

where \(\hat{\mu}\) and \(\hat{\sigma}_{\hat{\mu}}\) are the maximum likelihood estimators of the mean and standard error, \(t_{N-1}^{\alpha/2}\) is the quantile value of the Student’s \(t\)-distribution at the \(\alpha/2\) probability level with \(N-1\) degrees of freedom, and \(N\) is the sample size.

A remarkable difference between the Bayesian and Frequentist inferential frameworks is the interpretation of credible/confidence intervals. Observe that once we have estimates, such that, for example, the previous interval is \([0.2, 0.4]\) given a 95% confidence level, we cannot say that \(P(\mu \in [0.2, 0.4]) = 0.95\) in the Frequentist framework. In fact, this probability is either 0 or 1 in this approach, as \(\mu\) is either in the interval or it is not. The problem is that we will never know for certain in applied settings. This is because

\[ P(\mu \in [\hat{\mu} - |t_{N-1}^{0.025}| \times \hat{\sigma}_{\hat{\mu}}, \hat{\mu} + |t_{N-1}^{0.025}| \times \hat{\sigma}_{\hat{\mu}}]) = 0.95 \]

is interpreted in the context of repeated sampling. On the other hand, once we have the posterior distribution in the Bayesian framework, we can say that \(P(\mu \in [0.2, 0.4]) = 0.95\).

Following common practice, most researchers and practitioners conduct hypothesis testing based on the p-value in the Frequentist framework. But what is a p-value? Most users do not know the answer, as statistical inference is often not performed by statisticians (J. Berger 2006).16 A p-value is the probability of obtaining a statistical summary of the data equal to or more extreme than what was actually observed, assuming that the null hypothesis is true.

Therefore, p-value calculations involve not just the observed data, but also more extreme hypothetical observations. Thus,

“What the use of p implies, therefore, is that a hypothesis that may be true may be rejected because it has not predicted observable results that have not occurred.” (H. Jeffreys 1961)

Some researchers and practitioners using Frequentist inference often intertwines two distinct logical frameworks: Fisher’s p-value approach (Fisher 1958) and the Neyman–Pearson significance testing framework (Neyman and Pearson 1933). The p-value serves as an informal, data-dependent measure of evidence against the null hypothesis. It is rooted in reduction to absurdity reasoning, where the extremeness of the observed data is assessed under the assumption that the null hypothesis is true. However, the p-value is frequently misinterpreted as the probability that the null hypothesis is false—a misconception known as the p-value fallacy (Goodman 1999). In contrast, the Neyman–Pearson framework adopts a deductive, long-run perspective: it defines decision rules that control the frequency of Type I errors over repeated sampling, irrespective of the evidence in any particular case. Conflating these two frameworks leads to interpretational inconsistencies, especially when the p-value is used both as a measure of evidence and as a decision-making threshold. A clearer separation of these paradigms is essential for coherent statistical reasoning.

The American Statistical Association has several concerns regarding the use of the p-value as a cornerstone for hypothesis testing in science. This concern motivates the ASA’s statement on p-values (Wasserstein and Lazar 2016), which can be summarized in the following principles:

  • P-values can indicate how incompatible the data are with a specified statistical model.
  • P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
  • Scientific conclusions and business or policy decisions should not be based solely on whether a p-value passes a specific threshold.
  • Proper inference requires full reporting and transparency.
  • A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
  • By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.

To sum up, Fisher proposed the p-value as a witness rather than a judge. So, a p-value lower than the significance level means more inspection of the null hypothesis, but it is not a final conclusion about it.

Another key distinction between frequentist and Bayesian approaches lies in how scientific hypotheses are evaluated. Users of the Frequentist approach rely on the p-value, which quantifies the probability of observing data as extreme as—or more extreme than—the sample under the assumption that the null hypothesis is true. Bayesians, in contrast, use the Bayes factor, which compares the performance of two competing hypotheses by evaluating the ratio of their marginal likelihoods. While the p-value reflects \(P(\text{data} \mid \text{hypothesis})\), the Bayes factor is more aligned with \(P(\text{hypothesis} \mid \text{data})\), though not equivalent. Notably, there exists an approximate relationship between the \(t\)-statistic and the Bayes factor in the context of regression coefficients (A. Raftery 1995), which offers some practical interpretability across paradigms. In particular,

\[ |t|>(\log(N)+6)^{1/2} \]

corresponds to strong evidence in favor of rejecting the null hypothesis of no relevance of a control in a regression. Observe that, in this setting, the threshold of the \(t\) statistic, and as a consequence the significance level, depends on the sample size. This setting agrees with the idea in experimental designs of selecting the sample size such that we control Type I and Type II errors. In observational studies, we cannot control the sample size, but we can select the significance level.

See also Sellke, Bayarri, and Berger (2001) and Benjamin et al. (2018) for exercises that reveal potential flaws of the p-value (\(p\)) due to \(p \sim U[0,1]\) under the null hypothesis,17 and calibrations of the p-value to interpret it as the odds ratio and the error probability. In particular,

\[ B(p)=-e \times p \times \log(p) \quad \text{when} \quad p < e^{-1} \]

and interpret this as the Bayes factor of \(H_0\) to \(H_1\), where \(H_1\) denotes the unspecified alternative to \(H_0\), and

\[ \alpha(p) = \left(1 + \left[-e \times p \times \log(p)\right]^{-1}\right)^{-1} \]

as the error probability \(\alpha\) in rejecting \(H_0\). Take into account that \(B(p)\) and \(\alpha(p)\) are lower bounds.

The logic of argumentation in the Frequentist approach is based on deductive logic, which means that it starts from a statement about the true state of nature (null hypothesis) and predicts what should be observed if this statement were true. On the other hand, the Bayesian approach is based on inductive logic, which means that it defines which hypothesis is more consistent with what is observed. The former inferential approach establishes that the truth of the premises implies the truth of the conclusion, which is why we reject or fail to reject hypotheses. The latter establishes that the premises supply some evidence, but not full assurance, of the truth of the conclusion, which is why we get probabilistic statements.

Here, there is a distinction between the effects of causes (forward causal inference) and the causes of effects (reverse causal inference) (Andrew Gelman and Imbens 2013; Dawid, Musio, and Fienberg 2016). To illustrate this point, imagine that a firm increases the price of a specific good. Economic theory would suggest that, as a result, demand for the good decreases. In this case, the premise (null hypothesis) is the price increase, and the consequence is the decrease in the firm’s demand.

Alternatively, one could observe a reduction in a firm’s demand and attempt to identify the cause behind it. For example, a reduction in quantity could be due to a negative supply shock. The Frequentist approach typically follows the first view (effects of causes), while Bayesian reasoning focuses on determining the probability of potential causes (causes of effects).

References

Benjamin, Daniel J, James O Berger, Magnus Johannesson, Brian A Nosek, E-J Wagenmakers, Richard Berk, Kenneth A Bollen, et al. 2018. “Redefine Statistical Significance.” Nature Human Behaviour 2 (1): 6–10.
———. 2006. “The Case for Objective Bayesian Analysis.” Bayesian Analysis 1 (3): 385–402.
Bernardo, J., and A. Smith. 1994. Bayesian Theory. Chichester: Wiley.
Dawid, A. P., M. Musio, and S. E. Fienberg. 2016. “From Statistical Evidence to Evidence of Causality.” Bayesian Analysis 11 (3): 725–52.
DeGroot, M. H. 1975. Probability and Statistics. London: Addison-Wesley Publishing Co.
Fisher, R. 1958. Statistical Methods for Research Workers. 13th ed. New York: Hafner.
Gelman, Andrew, and Guido Imbens. 2013. “Why Ask Why? Forward Causal Inference and Reverse Causal Questions.” National Bureau of Economic Research.
Goodman, S. N. 1999. “Toward Evidence-Based Medical Statistics. 1: The P Value Fallacy.” Annals of Internal Medicine 130 (12): 995–1004.
———. 1961. Theory of Probability. London: Oxford University Press.
Kass, R E, and A E Raftery. 1995. Bayes factors.” Journal of the American Statistical Association 90 (430): 773–95.
Koop, Gary M. 2003. Bayesian Econometrics. John Wiley & Sons Inc.
Lindley, D. V. 2000. “The Philosophy of Statistics.” The Statistician 49 (3): 293–337.
Neyman, J., and E. Pearson. 1933. “On the Problem of the Most Efficient Tests of Statistical Hypotheses.” Philosophical Transactions of the Royal Society, Series A 231: 289–337.
Pericchi, Luis, and Carlos Pereira. 2015. Adaptative significance levels using optimal decision rules: Balancing by weighting the error probabilities.” Brazilian Journal of Probability and Statistics.
Raftery, A. 1995. “Bayesian Model Selection in Social Research.” Sociological Methodology 25: 111–63.
Sellke, Thomas, MJ Bayarri, and James O Berger. 2001. “Calibration of p Values for Testing Precise Null Hypotheses.” The American Statistician 55 (1): 62–71.
Wasserstein, Ronald L., and Nicole A. Lazar. 2016. “The ASA’s Statement on p–Values: Context, Process and Purpose.” The American Statistician.

  1. A pivot quantity is a function of unobserved parameters and observations whose probability distribution does not depend on the unknown parameters.↩︎

  2. See also: https://fivethirtyeight.com/features/not-even-scientists-can-easily-explain-p-values/↩︎

  3. See: https://joyeuserrance.wordpress.com/2011/04/22/proof-that-p-values-under-the-null-are-uniformly-distributed/ for a simple proof.↩︎