12.2 Regularization

In this century, the amount of available data continues to grow. This means we have access to more covariates for prediction, and we can also generate additional inputs to enhance the predictive power of our models. As a result, we often encounter wide datasets, where the number of inputs may exceed the number of observations. Even in modest settings, we might have hundreds of inputs, and we can use them to identify which ones contribute to accurate predictions. However, we generally avoid using all inputs at once due to the risk of overfitting. Thus, we require a class of input selection or regularization.

In the standard linear regression setting,

\[ \mathbf{y} = \mathbf{i}_N \beta_0 + \mathbf{W}\boldsymbol{\beta} + \boldsymbol{\mu}, \]

where \(\mathbf{i}_N\) is an \(N\)-dimensional vector of ones, \(\mathbf{W}\) is the \(N \times K\) design matrix of inputs, and \(\boldsymbol{\mu} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}_N)\), there has been extensive development of techniques aimed at regularization within the Frequentist inferential framework. These include methods such as Ridge regression (Hoerl and Kennard 1970); discrete subset selection techniques like best subset selection (Furnival and Wilson 1974a), forward selection, and backward stepwise selection (Hastie, Tibshirani, and Friedman 2009); as well as continuous subset selection approaches such as the LASSO (Tibshirani 1996), the elastic net (Zou and Hastie 2005), and OSCAR (Bondell and Reich 2008).

It is important to note, however, that Ridge regression does not perform variable selection; rather, it shrinks coefficient estimates toward zero without setting them exactly to zero.

Ridge regression and the LASSO can be viewed as special cases of a more general class of estimators known as Bridge regression (Fu 1998), which also admits a Bayesian interpretation. Consider the following penalized least squares criterion in the linear regression setting:

\[ \hat{\boldsymbol{\beta}}^{\text{Bridge}} = \arg\min_{\boldsymbol{\beta}} \left\{ \sum_{i=1}^N \left( y_i - \beta_0 - \sum_{k=1}^K \tilde{w}_{ik} \beta_k \right)^2 + \lambda \sum_{k=1}^K |\beta_k|^q \right\}, \quad q \geq 0, \]

where \(\tilde{w}_{ik}\) denotes the standardized inputs. Standardizing inputs is important in variable selection problems to avoid issues caused by differences in scale; otherwise, variables with larger magnitudes will be penalized less and disproportionately influence the regularization path.

Interpreting \(|\beta_k|^q\) as proportional to the negative log-prior density of \(\beta_k\), the penalty shapes the contours of the prior distribution on the parameters. Specifically:

\(q = 0\) corresponds to best subset selection, where the penalty counts the number of nonzero coefficients.
\(q = 1\) yields the LASSO, which corresponds to a Laplace (double-exponential) prior.
\(q = 2\) yields ridge regression, which corresponds to a Gaussian prior.

In this light, best subset selection, the LASSO, and ridge regression can be viewed as maximum a posteriori (MAP) estimators under different priors centered at zero (Park and Casella 2008). However, they are not Bayes estimators in the strict sense, since Bayes estimators are typically defined as the posterior mean. While ridge regression coincides with the posterior mean under a Gaussian prior (Ishwaran and Rao 2005), the LASSO and best subset selection yield posterior modes.

This distinction is important because the Bayesian framework naturally incorporates regularization through the use of proper priors, which helps mitigate overfitting. Specifically, when proper shrinkage priors are used, the posterior balances data likelihood and prior information, thus controlling model complexity.

Furthermore, empirical Bayes methods, where the marginal likelihood is optimized, or cross-validation can be used to estimate the scale parameter of the prior covariance matrix for the regression coefficients. This scale parameter, in turn, determines the strength of regularization in ridge regression.

Note that regularization introduces bias into the parameter estimates because it constrains the model, shrinking the location parameters toward zero. However, it substantially reduces variance, as the estimates are prevented from varying excessively across samples. As a result, the mean squared error (MSE) of the estimates, which equals the sum of the squared bias and the variance, is often lower for regularization methods compared to ordinary least squares (OLS), which remains unbiased under the classical assumptions. This trade-off is particularly important when the goal is to identify causal effects, where unbiasedness may be preferred over predictive accuracy (see Chapter 13).

12.2.1 Bayesian LASSO

Given the popularity of the LASSO as a variable selection technique, we present its Bayesian formulation in this subsection (Park and Casella 2008). The Gibbs sampler for the Bayesian LASSO exploits the representation of the Laplace distribution as a scale mixture of normals. This leads to the following hierarchical representation of the model:

\[ \begin{aligned} \mathbf{y} \mid \beta_0, \boldsymbol{\beta}, \sigma^2, \mathbf{W} &\sim \mathcal{N}(\mathbf{i}_N \beta_0 + \mathbf{W} \boldsymbol{\beta}, \sigma^2 \mathbf{I}_N), \\ \boldsymbol{\beta} \mid \sigma^2, \tau_1^2, \dots, \tau_K^2 &\sim \mathcal{N}(\mathbf{0}_K, \sigma^2 \mathbf{D}_{\tau}), \\ \tau_1^2, \dots, \tau_K^2 &\sim \prod_{k=1}^K \frac{\lambda^2}{2} \exp\left\{ -\frac{\lambda^2}{2} \tau_k^2 \right\}, \\ \sigma^2 &\sim \frac{1}{\sigma^2}, \\ \beta_0 &\sim c, \end{aligned} \]

where \(\mathbf{D}_{\tau} = \operatorname{diag}(\tau_1^2, \dots, \tau_K^2)\) and \(c\) is a constant.

After integrating out \(\tau_1^2, \dots, \tau_K^2\), the conditional prior of \(\boldsymbol{\beta} \mid \sigma^2\) is:

\[ \pi(\boldsymbol{\beta} \mid \sigma^2) = \prod_{k=1}^K \frac{\lambda}{2 \sqrt{\sigma^2}} \exp\left\{ -\frac{\lambda}{\sqrt{\sigma^2}} |\beta_k| \right\}, \]

which implies that the log-prior is proportional to \(\lambda \sum_{k=1}^K |\beta_k|\), matching the penalty term in the LASSO optimization problem.

The conditional posterior distributions for the Gibbs sampler are (Park and Casella 2008):

\[ \begin{aligned} \boldsymbol{\beta} \mid \sigma^2, \tau_1^2, \dots, \tau_K^2, \tilde{\mathbf{W}}, \tilde{\mathbf{y}} &\sim \mathcal{N}(\boldsymbol{\beta}_n, \sigma^2 \mathbf{B}_n), \\ \sigma^2 \mid \boldsymbol{\beta}, \tau_1^2, \dots, \tau_K^2, \tilde{\mathbf{W}}, \tilde{\mathbf{y}} &\sim \text{Inverse-Gamma}(\alpha_n/2, \delta_n/2), \\ 1/\tau_k^2 \mid \boldsymbol{\beta}, \sigma^2 &\sim \text{Inverse-Gaussian}(\mu_{kn}, \lambda_n), \\ \beta_0 \mid \sigma^2, \tilde{\mathbf{W}}, \tilde{\mathbf{y}} &\sim \mathcal{N}(\bar{y}, \sigma^2 / N), \end{aligned} \]

where:

\[ \begin{aligned} \boldsymbol{\beta}_n &= \mathbf{B}_n \tilde{\mathbf{W}}^{\top} \tilde{\mathbf{y}}, \\ \mathbf{B}_n &= \left( \tilde{\mathbf{W}}^{\top} \tilde{\mathbf{W}} + \mathbf{D}_{\tau}^{-1} \right)^{-1}, \\ \alpha_n &= (N - 1) + K, \\ \delta_n &= (\tilde{\mathbf{y}} - \tilde{\mathbf{W}} \boldsymbol{\beta})^{\top} (\tilde{\mathbf{y}} - \tilde{\mathbf{W}} \boldsymbol{\beta}) + \boldsymbol{\beta}^{\top} \mathbf{D}_{\tau}^{-1} \boldsymbol{\beta}, \\ \mu_{kn} &= \sqrt{ \frac{ \lambda^2 \sigma^2 }{ \beta_k^2 } }, \\ \lambda_n &= \lambda^2, \end{aligned} \]

and \(\tilde{\mathbf{W}}\) is the matrix of standardized inputs, and \(\tilde{\mathbf{y}}\) is the centered response vector.

Note that the posterior distribution of \(\boldsymbol{\tau}\) depends on the data through \(\boldsymbol{\beta}\) and \(\sigma^2\), which is a typical feature of hierarchical models. In this formulation, we can interpret \(\tau_k\) as local shrinkage parameters, while \(\lambda\) acts as a global shrinkage parameter. Higher values of \(\tau_k\) and \(\lambda\) imply stronger shrinkage toward zero. Park and Casella (2008) propose two approaches for specifying the global shrinkage parameter: empirical Bayes estimation or a fully Bayesian hierarchical specification, where \(\lambda^2\) is assigned a Gamma prior.

We should acknowledge that the Bayesian LASSO is more computationally expensive than the Frequentist LASSO. However, it provides credible intervals for the parameters automatically. In contrast, obtaining standard errors in the Frequentist LASSO is more challenging, particularly for parameters estimated to be exactly zero (Kyung et al. 2010).

Example: Simulation exercise to study the Bayesian LASSO performance

We simulate the process
\[\begin{equation*} y_i = \beta_0 + \sum_{k=1}^{10} \beta_k w_{ik} + \mu_i, \end{equation*}\] where \(\beta_k \sim \mathcal{U}(-3, 3)\), \(\mu_i \sim \mathcal{N}(0, 1)\), and \(w_{ik} \sim \mathcal{N}(0, 1)\), for \(i = 1, 2, \dots, 500\).

Additionally, we generate 90 extra potential inputs from a standard normal distribution, which are included in the model specification. Our goal is to assess whether the Bayesian LASSO can successfully identify the truly relevant inputs.

We use the bayesreg package in R to perform the Bayesian LASSO, using 5,000 MCMC draws and 1,000 burn-in iterations. The following code illustrates the simulation exercise and compares the posterior means with the true population values.

The summary of the fit and the plot comparing the population parameters with the posterior means show that the Bayesian LASSO is able to identify the variables that are relevant in the data generating process.

In Exercise 1, we propose programming the Gibbs sampler from scratch, assuming a hierarchical structure for the global shrinkage parameter, and comparing the results with those obtained using the monomvn package.

####### Bayesian LASSSO #######
rm(list = ls()); set.seed(10101)
library(bayesreg)

## Loading required package: pgdraw

## Loading required package: doParallel

## Loading required package: foreach

## 
## Attaching package: 'foreach'

## The following objects are masked from 'package:purrr':
## 
##     accumulate, when

## Loading required package: iterators

## Loading required package: parallel

# Parameters
n <- 500  # sample size
p <- 100  # number of predictors
s <- 10   # number of non-zero coefficients
# Generate design matrix
X <- matrix(rnorm(n * p), nrow = n, ncol = p)
# True beta: first s coefficients are non-zero, rest are zero
beta_true <- c(runif(s, -3, 3), rep(0, p - s))
# Generate response with some noise
sigma <- 1
y <- X %*% beta_true + rnorm(n, sd = sigma)
df <- data.frame(X,y)
### Using bayesreg ###
# Fit the model
fit <- bayesreg::bayesreg(y ~ X, data = df, model = "gaussian", prior = "lasso", 
n.samples = 5000, burnin = 1000)
# Check summary
summary(fit)

## ==========================================================================================
## |                   Bayesian Penalised Regression Estimation ver. 1.3                    |
## |                     (c) Enes Makalic, Daniel F Schmidt. 2016-2024                      |
## ==========================================================================================
## Bayesian Gaussian lasso regression                              Number of obs   =      500
##                                                                 Number of vars  =      100
## MCMC Samples   =   5000                                         std(Error)      =   1.0284
## MCMC Burnin    =   1000                                         R-squared       =   0.9759
## MCMC Thinning  =      5                                         WAIC            =   766.71
## 
## -------------+-----------------------------------------------------------------------------
##    Parameter |  mean(Coef)  std(Coef)    [95% Cred. Interval]      tStat    Rank        ESS
## -------------+-----------------------------------------------------------------------------
##           X1 |    -0.06844    0.04766     -0.15959    0.02175     -1.436      19 *     5000
##           X2 |    -2.20902    0.05113     -2.30902   -2.10620    -43.200       3 **    5000
##           X3 |    -2.63626    0.05173     -2.73480   -2.53314    -50.963       1 **    5000
##           X4 |    -0.73957    0.04892     -0.83262   -0.64014    -15.119       9 **    5000
##           X5 |     1.81039    0.05028      1.70918    1.90559     36.005       6 **    5000
##           X6 |    -2.11683    0.05003     -2.21260   -2.01692    -42.312       3 **    5000
##           X7 |    -2.09666    0.05171     -2.19847   -1.99440    -40.546       5 **    4521
##           X8 |     0.88095    0.04828      0.78502    0.97211     18.248       8 **    4883
##           X9 |     1.26813    0.05236      1.16647    1.36995     24.218       7 **    4807
##          X10 |    -2.55503    0.05113     -2.65156   -2.45091    -49.967       2 **    4850
##          X11 |    -0.03009    0.04512     -0.12290    0.05585     -0.667      56       4874
##          X12 |    -0.02403    0.04917     -0.11931    0.07308     -0.489      63       5000
##          X13 |    -0.06494    0.04922     -0.16574    0.02483     -1.319      22 *     4922
##          X14 |    -0.02815    0.04517     -0.11869    0.05836     -0.623      56       4833
##          X15 |    -0.04614    0.04905     -0.14484    0.04775     -0.941      36       5000
##          X16 |    -0.08147    0.04875     -0.17906    0.01060     -1.671      14 *     4653
##          X17 |    -0.02087    0.04724     -0.11483    0.07023     -0.442      63       4936
##          X18 |    -0.01167    0.04729     -0.10832    0.07944     -0.247      77       5000
##          X19 |     0.06309    0.04877     -0.02831    0.16102      1.294      22 *     4929
##          X20 |     0.01944    0.04755     -0.07217    0.11667      0.409      63       5000
##          X21 |     0.02998    0.04683     -0.05981    0.12457      0.640      52       4706
##          X22 |    -0.02101    0.04745     -0.11587    0.07157     -0.443      56       5000
##          X23 |     0.03316    0.04776     -0.05635    0.13212      0.694      47       4952
##          X24 |    -0.00297    0.04255     -0.08990    0.08009     -0.070      99       4860
##          X25 |    -0.00364    0.04291     -0.08907    0.08127     -0.085      77       5000
##          X26 |    -0.01292    0.04620     -0.10572    0.07784     -0.280      77       4889
##          X27 |    -0.05888    0.04706     -0.15631    0.02838     -1.251      24 *     5000
##          X28 |     0.02074    0.04662     -0.06903    0.11448      0.445      77       4910
##          X29 |     0.08343    0.05125     -0.01261    0.18523      1.628      15 *     5000
##          X30 |     0.04347    0.04832     -0.04850    0.14218      0.900      35       5000
##          X31 |     0.02797    0.04553     -0.05871    0.11941      0.614      52       5000
##          X32 |    -0.08017    0.04820     -0.17343    0.01236     -1.663      15 *     5000
##          X33 |    -0.02015    0.04628     -0.11413    0.07051     -0.435      63       4615
##          X34 |    -0.01937    0.04739     -0.11360    0.07285     -0.409      63       5000
##          X35 |    -0.01983    0.04543     -0.11087    0.06930     -0.437      63       5000
##          X36 |     0.00136    0.04613     -0.09282    0.09398      0.030      94       5000
##          X37 |    -0.04906    0.04582     -0.14045    0.03930     -1.071      31       5000
##          X38 |    -0.00757    0.04812     -0.10258    0.08964     -0.157      77       5000
##          X39 |     0.01473    0.04822     -0.08227    0.11086      0.306      77       5000
##          X40 |    -0.02048    0.04782     -0.11556    0.07364     -0.428      77       5000
##          X41 |    -0.00985    0.04505     -0.10042    0.07671     -0.219      77       4937
##          X42 |     0.03601    0.04691     -0.05171    0.13104      0.768      45       5000
##          X43 |    -0.05504    0.04634     -0.14572    0.03275     -1.188      25 *     4620
##          X44 |    -0.01813    0.04580     -0.11123    0.06938     -0.396      77       5000
##          X45 |     0.07286    0.04695     -0.01380    0.16626      1.552      18 *     4896
##          X46 |    -0.03810    0.04697     -0.13078    0.05123     -0.811      43       5000
##          X47 |    -0.05030    0.04995     -0.14953    0.04323     -1.007      31       5000
##          X48 |     0.02317    0.04799     -0.07073    0.11967      0.483      63       4927
##          X49 |    -0.01156    0.04749     -0.10916    0.08249     -0.243      77       4957
##          X50 |    -0.02061    0.04354     -0.10988    0.06283     -0.473      63       4939
##          X51 |    -0.05312    0.04594     -0.14623    0.03516     -1.156      27       5000
##          X52 |    -0.00229    0.04805     -0.09706    0.09194     -0.048      77       5000
##          X53 |    -0.04892    0.04856     -0.14770    0.04460     -1.007      33       4915
##          X54 |    -0.04527    0.05034     -0.14746    0.04927     -0.899      36       5000
##          X55 |     0.04083    0.04774     -0.05046    0.13552      0.855      38       5000
##          X56 |     0.03043    0.04469     -0.05447    0.12099      0.681      47       4960
##          X57 |     0.01771    0.04709     -0.07502    0.11188      0.376      63       5000
##          X58 |     0.00793    0.04466     -0.07915    0.09819      0.178      94       4917
##          X59 |    -0.05206    0.04698     -0.14629    0.03533     -1.108      27 *     4799
##          X60 |    -0.00134    0.04613     -0.09244    0.09164     -0.029      77       5000
##          X61 |     0.15103    0.04810      0.05886    0.24343      3.140      10 **    5000
##          X62 |     0.03065    0.04735     -0.06019    0.12737      0.647      56       5000
##          X63 |    -0.08314    0.05036     -0.18355    0.01269     -1.651      15 *     5000
##          X64 |    -0.00507    0.04780     -0.10075    0.09101     -0.106      99       5000
##          X65 |    -0.03424    0.04633     -0.12965    0.05588     -0.739      43       4572
##          X66 |    -0.02355    0.04543     -0.11411    0.06672     -0.518      63       5000
##          X67 |     0.01834    0.04591     -0.06851    0.10984      0.400      77       4761
##          X68 |    -0.03052    0.04872     -0.12664    0.06346     -0.626      56       4804
##          X69 |    -0.05088    0.04585     -0.14373    0.03498     -1.110      30       5000
##          X70 |    -0.06581    0.04738     -0.16132    0.02268     -1.389      20 *     5000
##          X71 |    -0.04111    0.04639     -0.13444    0.04730     -0.886      38       5000
##          X72 |    -0.00586    0.04707     -0.09843    0.08764     -0.125      94       4834
##          X73 |    -0.03931    0.04880     -0.13759    0.05230     -0.806      38       4927
##          X74 |     0.02699    0.04675     -0.06570    0.12046      0.577      56       4854
##          X75 |     0.06582    0.04733     -0.02329    0.15987      1.391      20 *     5000
##          X76 |    -0.01311    0.04749     -0.10753    0.07993     -0.276      77       5000
##          X77 |     0.04198    0.04755     -0.04894    0.13696      0.883      38       5000
##          X78 |    -0.02919    0.04772     -0.12514    0.06082     -0.612      52       5000
##          X79 |    -0.01887    0.04595     -0.11145    0.07088     -0.411      63       4977
##          X80 |    -0.05852    0.04804     -0.15513    0.03137     -1.218      25 *     4573
##          X81 |     0.03337    0.04780     -0.05966    0.13092      0.698      47       5000
##          X82 |    -0.02568    0.04373     -0.11271    0.05888     -0.587      52       5000
##          X83 |    -0.04773    0.04728     -0.14203    0.04280     -1.010      33       4836
##          X84 |    -0.09403    0.05210     -0.19682    0.00623     -1.805      12 *     5000
##          X85 |    -0.04002    0.04658     -0.13237    0.04913     -0.859      38       5000
##          X86 |    -0.05402    0.04811     -0.14978    0.03695     -1.123      27       5000
##          X87 |    -0.00654    0.04590     -0.09800    0.08379     -0.142      77       5000
##          X88 |     0.02026    0.04810     -0.07107    0.12050      0.421      63       4786
##          X89 |    -0.08403    0.04809     -0.17901    0.00743     -1.747      13 *     5000
##          X90 |     0.01805    0.04668     -0.07186    0.11478      0.387      63       4908
##          X91 |    -0.00281    0.04587     -0.09545    0.08864     -0.061      94       4912
##          X92 |     0.02670    0.04645     -0.06240    0.11842      0.575      47       5000
##          X93 |    -0.09959    0.05383     -0.20692    0.00457     -1.850      11 *     5000
##          X94 |    -0.02220    0.04321     -0.10892    0.06214     -0.514      77       5000
##          X95 |     0.00427    0.04780     -0.08911    0.09984      0.089      77       5000
##          X96 |     0.01944    0.04613     -0.06947    0.11301      0.421      63       5000
##          X97 |    -0.03594    0.04724     -0.12937    0.05387     -0.761      47       4802
##          X98 |     0.02650    0.04544     -0.06018    0.11789      0.583      56       5000
##          X99 |    -0.03474    0.04809     -0.13184    0.05845     -0.722      45       5000
##         X100 |    -0.01175    0.04455     -0.10263    0.07515     -0.264      94       4885
##  (Intercept) |     0.06127    0.05096     -0.03814    0.16019          .       .          .
## -------------+-----------------------------------------------------------------------------

# Extract posterior means of beta
beta_post_mean <- rowMeans(fit$beta)
# Compare true vs estimated
plot(beta_true, beta_post_mean, pch = 19, col = "steelblue",
xlab = "True beta", ylab = "Posterior mean of beta",
main = "Bayesian LASSO Shrinkage")
abline(0, 1, col = "red", lty = 2)

12.2.2 Stochastic search variable selection

Another well-known Bayesian strategy for variable selection in the presence of a large set of regressors (inputs) is stochastic search variable selection (SSVS) (E. I. George and McCulloch 1993; E. George and McCulloch 1997). SSVS is a particular case of the broader class of spike-and-slab priors, in which the prior distribution for the location parameters is specified as a hierarchical mixture that captures the uncertainty inherent in variable selection problems (Ishwaran and Rao 2005).

The hierarchical structure of the model is given by:

\[ \begin{aligned} \mathbf{y} \mid \beta_0, \boldsymbol{\beta}, \sigma^2, \mathbf{W} &\sim \mathcal{N}(\mathbf{i}_N \beta_0 + \mathbf{W} \boldsymbol{\beta}, \sigma^2 \mathbf{I}_N), \\ \boldsymbol{\beta} \mid \sigma^2, \boldsymbol{\gamma} &\sim \mathcal{N}(\mathbf{0}_K, \sigma^2 \mathbf{D}_\gamma \mathbf{R} \mathbf{D}_\gamma), \\ \sigma^2 &\sim \text{Inverse-Gamma}\left(\frac{v}{2}, \frac{v \lambda_\gamma}{2}\right), \\ \gamma_k &\sim \text{Bernoulli}(p_k), \end{aligned} \]

where \(p_k\) is the prior inclusion probability of regressor \(w_k\), that is, \(P(\gamma_k = 1) = 1 - P(\gamma_k = 0) = p_k\), \(\mathbf{R}\) is a correlation matrix, and \(\mathbf{D}_\gamma\) is a diagonal matrix whose \((k,k)\)-th element is defined as:

\[ (\mathbf{D}_\gamma)_{kk} = \begin{Bmatrix} v_{0k}, & \text{if } \gamma_k = 0, \\ v_{1k}, & \text{if } \gamma_k = 1 \end{Bmatrix}. \]

This formulation implies that:

\[ \beta_k \sim (1 - \gamma_k) \, \mathcal{N}(0, v_{0k}) + \gamma_k \, \mathcal{N}(0, v_{1k}), \]

where \(v_{0k}\) and \(v_{1k}\) are chosen such that \(v_{0k}\) is small and \(v_{1k}\) is large. Therefore, when the data favors \(\gamma_k = 0\), the corresponding \(\beta_k\) is shrunk toward zero, effectively excluding input \(k\) from the model. In this sense, \(\mathcal{N}(0, v_{0k})\) is a spike prior concentrated at zero, while \(\mathcal{N}(0, v_{1k})\) is a diffuse slab prior.

A critical aspect of SSVS is the choice of the hyperparameters \(v_{0k}\) and \(v_{1k}\), as they determine the amount of shrinkage applied to the regression coefficients (see E. I. George and McCulloch (1993) and E. George and McCulloch (1997) for details).

The assumption \(\gamma_k \sim \text{Bernoulli}(p_k)\) implies that the prior on the inclusion indicators is given by:

\[ \pi(\boldsymbol{\gamma}) = \prod_{k=1}^K p_k^{\gamma_k} (1 - p_k)^{1 - \gamma_k}. \]

This means that the inclusion of input \(k\) is independent of the inclusion of any other input \(j \neq k\). A common choice is the uniform prior \(\pi(\boldsymbol{\gamma}) = 2^{-K}\), which corresponds to setting \(p_k = 1/2\), giving each regressor an equal chance of being included (Ishwaran and Rao 2005).

A practical choice for the correlation matrix is to set \(\mathbf{R} \propto (\tilde{\mathbf{W}}^{\top} \tilde{\mathbf{W}})^{-1}\) (E. I. George and McCulloch 1993). Regarding the hyperparameters \(v\) and \(\lambda_\gamma\), it is helpful to interpret \(\lambda_\gamma\) as a prior estimate of \(\sigma^2\), and \(v\) as the prior sample size associated with this estimate. In the absence of prior information, E. I. George and McCulloch (1997) recommend setting \(\lambda_\gamma\) equal to the least squares estimate of the variance from the saturated model, that is, the model including all regressors, and \(v\) to a small number, for instance, 0.01.

The conditional posterior distributions for the Gibbs sampler are (E. I. George and McCulloch 1993):

\[ \begin{aligned} \boldsymbol{\beta} \mid \sigma^2, \gamma_1, \dots, \gamma_K, \tilde{\mathbf{W}}, \tilde{\mathbf{y}} &\sim N(\boldsymbol{\beta}_n, \mathbf{B}_n), \\ \sigma^2 \mid \boldsymbol{\beta}, \gamma_1, \dots, \gamma_K, \tilde{\mathbf{W}}, \tilde{\mathbf{y}} &\sim \text{Inverse-Gamma}(\alpha_n/2, \delta_n/2), \\ \gamma_k \mid \boldsymbol{\beta}, \sigma^2 &\sim \text{Bernoulli}(p_{kn}), \end{aligned} \]

where:

\[ \begin{aligned} \boldsymbol{\beta}_n &= \sigma^{-2} \mathbf{B}_n \tilde{\mathbf{W}}^{\top} \tilde{\mathbf{y}}, \\ \mathbf{B}_n &= \left(\sigma^{-2} \tilde{\mathbf{W}}^{\top} \tilde{\mathbf{W}} + \mathbf{D}_{\gamma}^{-1}\mathbf{R}^{-1}\mathbf{D}_{\gamma}^{-1} \right)^{-1}, \\ \alpha_n &= N + v, \\ \delta_n &= (\tilde{\mathbf{y}} - \tilde{\mathbf{W}} \boldsymbol{\beta})^{\top} (\tilde{\mathbf{y}} - \tilde{\mathbf{W}} \boldsymbol{\beta}) + v\lambda_{\gamma}, \\ p_{kn} &= \frac{\pi(\boldsymbol{\beta}\mid \boldsymbol{\gamma}_{-k},\gamma_k=1)\times p_k}{\pi(\boldsymbol{\beta}\mid \boldsymbol{\gamma}_{-k},\gamma_k=1)\times p_k+\pi(\boldsymbol{\beta}\mid \boldsymbol{\gamma}_{-k},\gamma_k=0)\times (1-p_k)}, \end{aligned} \]

where \(\tilde{\mathbf{W}}\) is the matrix of standardized inputs, \(\tilde{\mathbf{y}}\) is the centered response vector, \(\boldsymbol{\gamma}_{-k}\) denotes the vector composed of \(\gamma_1, \dots, \gamma_K\) excluding \(\gamma_k\), and \(\pi(\boldsymbol{\beta} \mid \boldsymbol{\gamma}_{-k}, \gamma_k = \delta)\) is the posterior density of \(\boldsymbol{\beta}\) evaluated at \(\boldsymbol{\gamma}_{-k}\) and \(\gamma_k = \delta\), where \(\delta \in \{0,1\}\).

In general, it is wise to consider the inclusion of regressors jointly due to potential correlations among them; that is, the marginal frequency of \(\gamma_k = 1\) should be interpreted with caution. SSVS is more effective at identifying a good set of potential models rather than selecting a single best model.

Example: Simulation exercise to study SSVS performance

Let’s use the simulation setting from the previous example to evaluate the performance of SSVS in uncovering the data-generating process. In particular, we use the BoomSpikeSlab package to implement this example.

The analysis is performed using 5,000 posterior draws and the default prior. However, the package allows the user to modify the default prior via the SpikeSlabPrior function.

The results show that the posterior inclusion probabilities for regressors 2 through 10 are 100%, and the model with the highest posterior probability (94%) includes all of these nine variables. However, the true data-generating process, which also includes regressor 1, receives a posterior model probability of 0%. This is because the population coefficient of this regressor is essentially zero. The plot comparing the posterior means with the true population parameters indicates good performance of SSVS. In general, Bayesian methods for variable selection perform well, and the choice of the most suitable method largely depends on the prior specification (O’Hara and Sillanpää 2009).

####### Stochastic search variable selection #######
rm(list = ls()); set.seed(10101)
library(BoomSpikeSlab)

## Loading required package: Boom

## 
## Attaching package: 'Boom'

## The following objects are masked from 'package:brms':
## 
##     ddirichlet, rdirichlet

## The following objects are masked from 'package:LaplacesDemon':
## 
##     ddirichlet, dinvgamma, dmvn, rdirichlet, rinvgamma, rmvn

## The following objects are masked from 'package:MCMCpack':
## 
##     ddirichlet, dinvgamma, rdirichlet, rinvgamma

## The following object is masked from 'package:coda':
## 
##     thin

## The following object is masked from 'package:sirt':
## 
##     rmvn

## The following object is masked from 'package:stats':
## 
##     rWishart

## 
## Attaching package: 'BoomSpikeSlab'

## The following object is masked from 'package:stats':
## 
##     knots

library(dplyr)
library(tibble)
# Parameters
n <- 500  # sample size
k <- 100  # number of predictors
s <- 10   # number of non-zero coefficients
# Generate design matrix
X <- matrix(rnorm(n * k), nrow = n, ncol = k)
# True beta: first s coefficients are non-zero, rest are zero
beta_true <- c(runif(s, -3, 3), rep(0, k - s))
# Generate response with some noise
sigma <- 1
y <- X %*% beta_true + rnorm(n, sd = sigma)
df <- data.frame(X,y)
### Using BoomSpikeSlab ###
#Scale regressors
W <- scale(X); yh <- y - mean(y)
prior <- SpikeSlabPrior(W, yh, 
expected.model.size = ncol(W)/2, # expect 50 nonzero predictors
prior.df = .01, # weaker prior than the default
prior.information.weight = .01,
diagonal.shrinkage = 0) # shrink to zero
niter <- 5000
######Estimate model########
SSBoomNew <- lm.spike(yh ~ W - 1, niter = niter, prior = prior)

## =-=-=-=-= Iteration 0 Sun Nov 30 18:47:01 2025
##  =-=-=-=-=
## =-=-=-=-= Iteration 500 Sun Nov 30 18:47:02 2025
##  =-=-=-=-=
## =-=-=-=-= Iteration 1000 Sun Nov 30 18:47:02 2025
##  =-=-=-=-=
## =-=-=-=-= Iteration 1500 Sun Nov 30 18:47:02 2025
##  =-=-=-=-=
## =-=-=-=-= Iteration 2000 Sun Nov 30 18:47:03 2025
##  =-=-=-=-=
## =-=-=-=-= Iteration 2500 Sun Nov 30 18:47:03 2025
##  =-=-=-=-=
## =-=-=-=-= Iteration 3000 Sun Nov 30 18:47:04 2025
##  =-=-=-=-=
## =-=-=-=-= Iteration 3500 Sun Nov 30 18:47:04 2025
##  =-=-=-=-=
## =-=-=-=-= Iteration 4000 Sun Nov 30 18:47:04 2025
##  =-=-=-=-=
## =-=-=-=-= Iteration 4500 Sun Nov 30 18:47:05 2025
##  =-=-=-=-=

Models <- SSBoomNew$beta != 0
PIP <- colMeans(SSBoomNew$beta != 0)
# Convert the logical matrix to a data frame and then to a tibble
df <- as.data.frame(Models); df_tbl <- as_tibble(df)
# Count identical rows
row_counts <- df_tbl %>% count(across(everything()), name = "frequency") %>% arrange(desc(frequency))
sum(row_counts[1:100,101])

## [1] 3948

# Ensure your vector and matrix are logical
trueModel <- c(rep(1, 10), rep(0, 90)) == 1  # convert to logical if needed
# Assume your matrix is named 'mat'
matching_rows <- apply(row_counts[,-101], 1, function(row) all(row == trueModel))
# Get indices (row numbers) where the match is TRUE
row_counts[which(matching_rows), 101]

## # A tibble: 1 × 1
##   frequency
##       <int>
## 1         3

# Coefficients
SummarySS <- summary(coda::mcmc(SSBoomNew$beta))
# Extract posterior means of beta
beta_post_mean <- SummarySS$statistics[, 1]
# Compare true vs estimated
plot(beta_true, beta_post_mean, pch = 19, col = "steelblue",
xlab = "True beta", ylab = "Posterior mean of beta",
main = "SSVS Shrinkage")
abline(0, 1, col = "red", lty = 2)

The examples and exercises presented thus far have considered scenarios in which the number of inputs is smaller than the number of observations (\(K < N\)). In Exercise 4, we challenge the Bayesian LASSO and SSVS in a setting where the number of inputs exceeds the sample size (\(K > N\)). As you will observe in that experiment, both the Bayesian LASSO and SSVS perform well. However, the Bayesian LASSO requires more time to produce results compared to SSVS in this exercise. Ročková and George (2018) propose a connection between the LASSO and spike-and-slab priors for variable selection in linear models, offering oracle properties and optimal posterior concentration even in high-dimensional settings where \(K > N\).

In addition, there are other Bayesian methods for regularization, such as the spike-and-slab approach proposed by Ishwaran and Rao (2005) and non-local priors introduced by Johnson and Rossell (2012), which can be implemented using the R packages spikeslab and mombf, respectively.

References

Bondell, Howard D., and Brian J. Reich. 2008. “Simultaneous Regression Shrinkage, Variable Selection, and Supervised Clustering of Predictors with OSCAR.” Biometrics 64 (1): 115–23. https://doi.org/10.1111/j.1541-0420.2007.00843.x.

Fu, Wenjiang J. 1998. “Penalized Regression: The Bridge Versus the Lasso.” Journal of Computational and Graphical Statistics 7 (3): 397–416. https://doi.org/10.1080/10618600.1998.10474784.

Furnival, George M., and R. W. Wilson. 1974a. “Regressions by Leaps and Bounds.” Technometrics 16 (4): 499–511. https://doi.org/10.1080/00401706.1974.10489152.

George, Edward I, and Robert E McCulloch. 1993. “Variable Selection via Gibbs Sampling.” Journal of the American Statistical Association 88 (423): 881–89.

———. 1997. “Approaches for Bayesian Variable Selection.” Statistica Sinica 7: 339–73.

Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. New York: Springer. https://link.springer.com/book/10.1007/978-0-387-84858-7.

Hoerl, Arthur E., and Robert W. Kennard. 1970. “Ridge Regression: Biased Estimation for Nonorthogonal Problems.” Technometrics 12 (1): 55–67. https://doi.org/10.1080/00401706.1970.10488634.

Ishwaran, H., and J. S. Rao. 2005. “Spike and Slab Variable Selection: Frequentist and Bayesian Strategies.” The Annals of Statistics 33 (2): 730–73.

Johnson, Valen E, and David Rossell. 2012. “Bayesian Model Selection in High-Dimensional Settings.” Journal of the American Statistical Association 107 (498): 649–60.

Kyung, Minjung, Jeff Gill, Malay Ghosh, and George Casella. 2010. “Penalized Regression, Standard Errors, and Bayesian Lassos.” Bayesian Analysis 5 (2): 369–411. https://doi.org/10.1214/10-BA607.

O’Hara, Robert B., and Mikko J. Sillanpää. 2009. “A Review of Bayesian Variable Selection Methods: What, How and Which.” Bayesian Analysis 4 (1): 85–118. https://doi.org/10.1214/09-BA403.

Park, T., and G. Casella. 2008. “The Bayesian Lasso.” Journal of the American Statistical Association 103 (482): 681–86.

Ročková, Veronika, and Edward I. George. 2018. “The Spike-and-Slab LASSO.” Journal of the American Statistical Association 113 (521): 431–44. https://doi.org/10.1080/01621459.2016.1260469.

Tibshirani, Robert. 1996. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society: Series B (Methodological) 58 (1): 267–88. https://www.jstor.org/stable/2346178.

Zou, Hui, and Trevor Hastie. 2005. “Regularization and Variable Selection via the Elastic Net.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67 (2): 301–20. https://doi.org/10.1111/j.1467-9868.2005.00503.x.