Lecture 6 Bayesian Modelling
6.1 Overview
In this part of the lecture, we introduce modeling techniques used in Bayesian inference. We do so by revisiting the familiar example of the linear model in Section 6.2. Subsequently, in Section 6.3 we provide an overview of methods used to obtain samples from the posterior distribution. Particular emphasis is put on the Gibbs Sampling Algorithm which is generally explained in Section 6.3.1 and applied to the linear model in Section 6.3.2. To help clarify the different components and hyperparameters involved in the algorithm we provide an interactive shiny app in Section 6.3.2.4 that covers the entire procedure, from initialization to final results.
6.2 Example: Linear Model
6.2.1 Setup
For the Bayesian modeling approaches, we take the linear model that was introduced in Lecture 3.Reminder: Dimensions
\[\begin{eqnarray} \boldsymbol{Y} = \underbrace{\left( \begin{array}{c} y_1 \\ \vdots \\ y_n \end{array} \right)}_{n\times 1} \end{eqnarray}\]
\[\begin{eqnarray} \boldsymbol{\mu_Y} = \boldsymbol{X\beta} = \underbrace{\left(\begin{array}{c} 1 & x_{11} & \cdots & x_{1p} \\ \vdots & \ddots & \vdots \\ 1 & x_{n1} & \cdots & x_{np} \end{array} \right)}_{n \times (p+1)} \underbrace{\left(\begin{array}{c} \beta_0 \\ \vdots \\ \beta_p \end{array} \right)}_{(p+1) \times 1} \end{eqnarray}\]
\[\begin{eqnarray} \boldsymbol{\Sigma_Y} = \sigma^2\boldsymbol{I} = \underbrace{\left(\begin{array}{c} \sigma^2 & \cdots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \cdots & \sigma^2 \end{array} \right)}_{n \times n} \end{eqnarray}\]6.2.2 Bayesian Considerations
In Bayesian inference, we assume that parameters follow probability distributions. For the linear model, the random parameters for which we need to specify distributions are \(\bb\) and \(\sigma^2\). To find suitable distributions, we have to consider the domain for each of the parameters:
- The parameter vector \(\bb\) is defined of the entire real line: \(-\infty < \bb < \infty\). Since the vector may contain more than one random variable, the Multivariate Normal Distribution is a good choice.
- For the variance \(\sigma^2\) to be defined correctly, the distribution must have finite non-negative values (\(\sigma^2 > 0\)) and is typically defined over real numbers. This includes the Chi-Squared, Gamma, and Inverse Gamma distributions. We choose the latter since it is conjugate to the Multivariate Normal Distribution.
6.2.3 Bayesian Ingredients
Every Bayesian Model needs a well-defined Likelihood and Prior distribution(s) for the parameter(s):
Likelihood:
- \(\boldsymbol{Y}\sim\MVN(\boldsymbol{X}\boldsymbol{\beta}, \sigma^2\boldsymbol{I})\)
Priors:
- \(\bb|\sigma^2\sim\MVN(\boldsymbol{m},\sigma^2\boldsymbol{M})\)
- \(\sigma^2\sim \IG(a_0,b_0)\)
Question: How would an uninformative prior for the coefficient vector look like?
When we have no prior knowledge about the possible values of the coefficient vector \(\bb\) could look like, an typical approach is to set:
\[ \boldsymbol{m} = \left(\begin{array}{c} 0 \\ \vdots \\ 0 \end{array} \right) \] and
\[ \boldsymbol{M} = \left(\begin{array}{c} \psi & \cdots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \cdots & \psi \end{array} \right) \] where \(\psi\) is an arbitrary large number, such as 100,000. This choice effectively results in a flat prior distribution for \(\bb\).
6.3 Bayesian Inference
Our goal in Bayesian inference is to produce samples that come from the posterior distribution. There are different possibilities to obtain these:
- analytical calculation of posterior distribution (e.g., for conjugate priors)
- Markov Chain Monte Carlo (MCMC) simulation techniques
- Gibbs Sampler
- Metropolis Hastings Sampler
- …
6.3.1 Gibbs Sampling Algorithm
The Gibbs Sampler is an MCMC technique in which each result depends on a specific number of preceding results.
Assume having a parameter vector \(\boldsymbol{\theta}\) of length \(K\) with entries \(\theta_1,\ldots,\theta_K\):
Initialize \(\boldsymbol{\theta} =\boldsymbol{\theta}^{(0)}\)
For iteration \(t\) in \(1\):\(T\) repeat:
for \(k \in\) \(1\):\(K\) draw \(\theta^{(b)}_k\) from the full conditional distribution \[p\left(\theta_k|\theta^{(t)}_{-k},\boldsymbol{y}\right)\] where \[ \theta_{-k}^{(t)} = \begin{cases} \left( \theta_2^{(t-1)}, \ldots, \theta_K^{(t-1)} \right), & \text{if } k = 1 \\ \left( \theta_1^{(t)}, \ldots, \theta_{k-1}^{(t)}, \theta_{k+1}^{(t-1)}, \ldots, \theta_K^{(t-1)} \right), & \text{if } 1 < k < K \\ \left( \theta_1^{(t)}, \ldots, \theta_{K-1}^{(t)} \right), & \text{if } k = K \end{cases} \]Result after burn-in (deletion of the first iterations before convergence): a vector of random numbers from each marginal posterior distribution \(p(\theta_k|\boldsymbol{y})\)
6.3.2 Gibbs Sampler Application: Linear Model
Remember that in the linear model the parameters of interest are \(\bb\) and \(\sigma^2\) so our goal is to find the marginal posterior distributions \(p(\bb \mid \boldsymbol{y})\) and \(p(\sigma^2 \mid \boldsymbol{y})\).
To apply the Gibbs sampling we first need to find the full conditionals:
- \(p(\boldsymbol{\beta}|\cdot) = p(\boldsymbol{\beta}|\sigma^2,\boldsymbol{y})\)
- \(p(\sigma^2|\cdot) =p(\sigma^2|\boldsymbol{\beta}, \boldsymbol{y})\)
These full conditionals are then used as a tool to generate samples from the posterior distribution.
6.3.2.1 Full Conditional for \(\bb\)
Bayes’ theorem gives us the following form for the full conditional of \(\beta\): \[p(\boldsymbol{\beta}|\sigma^2, \boldsymbol{y}) = \frac{p(\boldsymbol{\beta}|\sigma^2,\boldsymbol{y})}{p(\sigma^2, \boldsymbol{y})}\]
Since \(\bb\) does not appear in the denominator, in the process of finding the the full conditional we can focus on the numerator only (using the proportionality sign). Plugging in the distributions defined in Section 6.2.3 we can now derive the full conditional by removing all parts that do not depend on \(\bb\):
\[\begin{aligned} p(\bb|\cdot) \propto & \;p(\boldsymbol{y}|\boldsymbol{\beta},\sigma^2)p(\boldsymbol{\beta}|\sigma^2) \\ = &\;(2 \pi)^{-\frac{n}{2}}\text{det}(\sigma^2 \boldsymbol{I})^{-\frac{1}{2}} \exp{\left(-\frac{1}{2} (\yb - \boldsymbol{X \beta})^{\top}(\sigma^2 \boldsymbol{I})^{-1}(\yb - \boldsymbol{X \beta}) \right)} \\ &\;(2 \pi)^{-\frac{p+1}{2}} \text{det}(\sigma^2 \boldsymbol{M})^{-1/2} \exp{\left(-\frac{1}{2} (\bb - \boldsymbol{m})^{\top}(\sigma^2 \boldsymbol{M})^{-1}(\bb - \boldsymbol{m}) \right)} \\ \propto &\; \exp{\left(-\frac{1}{2} (\yb - \boldsymbol{X \beta})^{\top}(\sigma^2 \boldsymbol{I})^{-1}(\yb - \boldsymbol{X \beta}) \right)} \\ &\; \exp{\left(-\frac{1}{2}(\bb-\mb)^{\top}(\sigma^2 \Mb)^{-1}(\bb - \mb) \right)}\\ = &\; \exp{\left(-\frac{1}{2\sigma^2} \left(\bb^\top\Xb^\top\Xb\bb - 2\bb^\top\Xb^\top\yb + \yb^\top\yb \right) \right)} \\ &\; \exp{\left(-\frac{1}{2\sigma^2} \left(\bb^\top \Mb^{-1} \bb - 2\bb^\top \Mb^{-1} \mb + \mb^\top \Mb^{-1} \mb \right)\right)} \\ \propto &\; \exp{\left(-\frac{1}{2\sigma^2} \left(\bb^\top\Xb^\top\Xb\bb - 2\bb^\top\Xb^\top\yb + \bb^\top \Mb^{-1} \bb - 2\bb^\top \Mb^{-1} \mb \right)\right)} \\ = &\; \exp{\biggl( -\frac{1}{2} \bigl(\bb^\top\underbrace{\frac{1}{\sigma^2}\left(\Xb^\top\Xb + \Mb^{-1}\right)}_{:=\boldsymbol{\Sigma}^{-1}_{\bb}}\bb \bigr) + \frac{1}{\sigma^2}\left(\bb^\top\left(\Xb^\top\yb + \Mb^{-1}\mb\right)\right)\biggr)} \\ = &\; \exp{\biggl(-\frac{1}{2}\Bigl(\bb^\top\boldsymbol{\Sigma}_\bb^{-1}\bb\Bigr) + \bb^\top\boldsymbol{\Sigma}_\bb^{-1}\underbrace{\frac{1}{\sigma^2}\boldsymbol{\Sigma}_\bb \bigl(\Xb^\top\yb + \Mb^{-1}\mb\bigr)}_{:=\boldsymbol{\mu}_\bb} \biggr)} \\ = &\; \exp{\biggl(-\frac{1}{2}\Bigl(\bb^\top\boldsymbol{\Sigma}_\bb^{-1}\bb\Bigr) + \bb^\top\boldsymbol{\Sigma}_\bb^{-1}\boldsymbol{\mu}_\bb\biggr)} \end{aligned}\]Expected value with weak prior information
If you include a prior \(\bb|\cdot \sim \N(\boldsymbol{m}, \sigma^2\boldsymbol{M})\), the posterior mean \(\mu_\bb\) becomes a weighted average between the OLS estimate (from the likelihood) and the prior mean \(\boldsymbol{m}\), weighted by the prior precision \(\boldsymbol{M}^{-1}\). With no or weak prior information (flat prior, i.e., \(\boldsymbol{M} \rightarrow \infty\boldsymbol{I}\)) we draw \(\bb\) from a distribution with an expected value that is equal to the OLS estimator: \[\begin{aligned} \boldsymbol{\mu}_\bb &= \left(\frac{1}{\sigma^2}\Xb^\top\Xb + \frac{1}{\sigma^2}\Mb^{-1}\right)^{-1} \left(\frac{1}{\sigma^2}\Xb^\top\yb + \frac{1}{\sigma^2}\Mb^{-1}\mb\right) \\ &= \left(\Xb^\top\Xb + \Mb^{-1}\right)^{-1} \left(\Xb^\top\yb + \Mb^{-1}\mb\right) \\ &\approx \left(\Xb^\top\Xb\right)^{-1} \left(\Xb^\top\yb\right) \end{aligned}\]6.3.2.2 Full Conditional for \(\sigma^2\)
Using Bayes’ theorem we can state the full conditional as: \[\begin{aligned} p(\sigma^2|\cdot)\propto &\; p(\boldsymbol{y}|\boldsymbol{\beta},\sigma^2)p(\boldsymbol{\beta}|\sigma^2)p(\sigma^2) \\ = &\;(2 \pi)^{-\frac{n}{2}}\text{det}(\sigma^2 \boldsymbol{I})^{-\frac{1}{2}} \exp{\left(-\frac{1}{2} (\yb - \boldsymbol{X \beta})^{\top}(\sigma^2 \boldsymbol{I})^{-1}(\yb - \boldsymbol{X \beta}) \right)} \\ &\;(2 \pi)^{-\frac{p+1}{2}} \text{det}(\sigma^2 \boldsymbol{M})^{-1/2} \exp{\left(-\frac{1}{2} (\bb - \boldsymbol{m})^{\top}(\sigma^2 \boldsymbol{M})^{-1}(\bb - \boldsymbol{m}) \right)} \\ &\; \frac{b_0}{\Gamma(a)}\left(\sigma^2\right)^{-(a_0+1)} \exp{\left(-\frac{b_0}{\sigma^2}\right)} \\ \propto &\; \left(\sigma^2\right)^{-\frac{n}{2}}\exp{\left(-\frac{1}{2\sigma^2} \left(\yb-\Xb\bb\right)^\top\left(\yb - \Xb\bb\right)\right)} \\ &\; \left(\sigma^2\right)^{-\frac{p+1}{2}} \exp{\left(-\frac{1}{2\sigma^2}\left(\bb-\boldsymbol{m}\right)^\top\boldsymbol{M}^{-1}\left(\bb-\boldsymbol{m}\right)\right)} \\ &\; \left(\sigma^2\right)^{-(a_0+1)} \exp{\left(-\frac{b_0}{\sigma^2}\right)} \\ =&\; \left(\sigma^2\right)^{-(\frac{n}{2}+\frac{p+1}{2}+a_0+1)} \\ &\; \exp{\left(-\frac{1}{\sigma^2}\left(\frac{1}{2}\left(\left(\yb-\Xb\bb\right)^\top\left(\yb - \Xb\bb\right) + \left(\bb-\boldsymbol{m}\right)^\top\boldsymbol{M}^{-1}\left(\bb-\boldsymbol{m}\right)\right) + b_0\right)\right)} \end{aligned}\]Expected value with weak prior information
In the case of flat priors, i.e. \(\boldsymbol{M}^{-1} = \boldsymbol{0}\) and \(a_0, b_0 \rightarrow 0\), the expected value for the variance is as follows: \[\begin{aligned} \mathbb{E}(\sigma^2 | \cdot) &= \frac{b_{\sigma^2}}{a_{\sigma^2} -1} \\ &= \frac{b_0 + \frac{1}{2}\left(\left(\yb-\Xb\bb\right)^\top\left(\yb - \Xb\bb\right) + \left(\bb-\boldsymbol{m}\right)^\top\boldsymbol{M}^{-1}\left(\bb-\boldsymbol{m}\right)\right)}{a_0 + \frac{n}{2}+\frac{p+1}{2} - 1} \\ &\approx \frac{ \frac{1}{2}\left(\left(\yb-\Xb\bb\right)^\top\left(\yb - \Xb\bb\right)\right)}{\frac{n}{2}+\frac{p+1}{2} - 1} \end{aligned}\] \[\Rightarrow \hat{\sigma}^2 = \frac{1}{n-p-1}\sum^{n}_{i=1}(y_i - \boldsymbol{x}_i\bb)^2\]6.3.2.3 Application: Algorithm
The Gibbs-Sampler algorithm for linear model example looks as follows:- initialize \(\boldsymbol{\hat\beta}\) and \(\hat\sigma^2\)
- initialize parameter of prior distribution of \(\boldsymbol {\hat \beta}\): \(\boldsymbol m\) and \(\boldsymbol M\) (mean and variance)
- initialize parameter of prior distribution of \(\sigma^2\): \(a_{0}\) and \(b_{0}\) (scale and shape)
- update \(\boldsymbol{\Sigma_{\beta}}\) as in eq. (\(\ref{Sigbeta}\)) using latest available guess of \(\sigma^2\)
- update \(\boldsymbol{\mu_{\beta}}\) as in eq. (\(\ref{mubeta}\)) using latest available guess of \(\sigma^2\)
- draw new \(\boldsymbol{\hat\beta}\) from multivariate normal distribution with updated \(\boldsymbol{\mu}_\bb\) and \(\boldsymbol{\Sigma}_\bb\)
- update \(a_{\sigma^2}\) as in eq. (\(\ref{asigma}\))
- update \(b_{\sigma^2}\) as in eq. (\(\ref{bsigma}\)) using newly drawn \(\boldsymbol \beta\)
- draw new \(\hat\sigma^2\) from inverse gamma distribution with updated values for \(a_{\sigma^2}\) and \(b_{\sigma^2}\)
- repeat updating steps 4.-5. until predefined number of iterations \(B\) is reached
- drop values of burn-in phase with predefined length
6.3.2.4 Interactive Shiny App
In this section we provide an interactive shiny app that lets you you define your own (hyper-)parameter inputs for the linear model and see what the resulting posterior values after Gibbs Sampling are.
Click here for the full version of the ‘Gibbs Sampler’ shiny app.. In the full shiny you can compare results for different complexities \(k\) of the linear model as well as varying initializations and hyperparemeter setups of the algorithm.