13.7 Sample selection
We depict a situation of collider bias that induces selection bias in a previous section. Specifically, conditioning on a particular subset of the population (\(D_i=1\)), where \(D_i\) is influenced by both the treatment and confounders, opens a back-door path and creates a spurious association between their causes. Placing this situation within the well-known sample selection framework (Heckman 1979), the observed outcome can be represented as
\[ Y_i = \begin{cases} Y_i(1) & \text{if } D_i=1, \\ \text{NA} & \text{if } D_i=0, \end{cases} \]
that is, we only observe \(Y_i = Y_i(1)\) for \(i=1,2,\dots,N\), while \(Y_i(0)\) remains unobserved (“missing”).
In this setting, inference can be performed based on the likelihood of the observed outcomes together with the selection (missingness) mechanism \((Y_i(1),D_i)\), integrating out the unobserved \(Y_i(0)\) from the joint distribution of \(\{Y_i(1),Y_i(0),D_i\}\). However, one must consider whether the missingness mechanism is ignorable or not. According to Little and Rubin (2019), the missingness mechanism is ignorable in Bayesian inference if the following two conditions hold:
- the likelihood can be factorized as
\[ p(y_i, d_i\mid \boldsymbol{\theta},\boldsymbol{\gamma}) = p(y_i\mid \boldsymbol{\theta})\, p(y_i,d_i \mid \boldsymbol{\gamma}), \]
and
- the parameters \(\boldsymbol{\theta}\) and \(\boldsymbol{\gamma}\) are a priori independent,
\[ \pi(\boldsymbol{\theta},\boldsymbol{\gamma}) = \pi(\boldsymbol{\theta})\,\pi(\boldsymbol{\gamma}). \]
Under these conditions, posterior inference can be based on
\[ \pi(\boldsymbol{\theta}\mid \mathbf{y}) \propto \pi(\boldsymbol{\theta})\, p(\mathbf{y} \mid \boldsymbol{\theta}). \]
Thus, if the missingness mechanism is Missing At Random (MAR) and the parameters are a priori independent, the missingness mechanism is ignorable for Bayesian inference (see Chapter 6 of Little and Rubin (2019)).
In contrast, when the missingness mechanism is non-ignorable (i.e., Not Missing At Random, NMAR), the probability of observing \(Y_i\) depends on the unobserved values themselves. In this case, Bayesian inference must be performed using the full joint likelihood,
\[ p(y_i, d_i \mid \boldsymbol{\theta},\boldsymbol{\gamma}), \]
which requires specifying and estimating both the outcome model and the selection model simultaneously (see Chapter 15 of Little and Rubin (2019)). In particular, the classical sample selection model of Heckman (1979) establishes,
\[ Y_i = \begin{cases} \mathbf{X}_i^{\top}\boldsymbol{\beta}+\mu_i & \text{if } D_i=1, \\ \text{NA} & \text{if } D_i=0, \end{cases} \]
\[ D_i = \begin{cases} 1 & \text{if } D_i^* = \mathbf{Z}_i^{\top}\boldsymbol{\gamma}+\nu_i > 0, \\ 0 & \text{if } D_i^* = \mathbf{Z}_i^{\top}\boldsymbol{\gamma}+\nu_i \leq 0, \end{cases} \]
where
\[ \begin{bmatrix} \mu_i \\[6pt] \nu_i \end{bmatrix} \sim N\!\left( \begin{bmatrix} 0 \\ 0 \end{bmatrix}, \begin{bmatrix} \sigma^2_{\mu} & \sigma_{\mu\nu} \\ \sigma_{\mu\nu} & 1 \end{bmatrix}\right). \]
The restriction \(\operatorname{Var}(\nu_i)=1\) is imposed for identification. Without this normalization, the latent index \(D_i^*=\mathbf{Z}_i^{\top}\boldsymbol{\gamma}+\nu_i\) is only identified up to scale, since
\[ P(D_i=1) = P(\mathbf{Z}_i^{\top}\boldsymbol{\gamma}+\nu_i > 0) = P\!\left(\frac{\nu_i}{c} > -\frac{\mathbf{Z}_i^{\top}\boldsymbol{\gamma}}{c}\right), \quad \forall\, c>0. \]
Thus, setting \(\sigma^2_{\nu}=1\) yields point identification of the parameters.
Note that since \(D_i=1 \iff \nu_i>-\mathbf Z_i^\top\boldsymbol\gamma\),
\[ \mathbb E[\mu_i\mid D_i=1,\mathbf Z_i] =\mathbb E\!\big[\mathbb E(\mu_i\mid \nu_i)\,\big|\,\nu_i>-\mathbf Z_i^\top\boldsymbol\gamma\big] =\sigma_{\mu\nu}\,\mathbb E\!\big[\nu_i\mid \nu_i>-\mathbf Z_i^\top\boldsymbol\gamma\big]. \]
This is because \(\mu_i\mid \nu_i\sim N(\sigma_{\mu\nu}\nu_i, \sigma^2_{\mu}-\sigma_{\mu\nu}^2)\).
If \(\nu\sim\mathcal N(0,1)\), then for \(a=\mathbf Z_i^\top\boldsymbol\gamma\),
\[ \mathbb E[\nu\mid \nu>-a]=\frac{\phi(a)}{\Phi(a)}\equiv \lambda(a), \]
where \(\phi\) and \(\Phi\) are the standard normal pdf and cdf. Hence
\[ \mathbb E[\mu_i\mid D_i=1,\mathbf Z_i]=\sigma_{\mu\nu}\,\lambda(\mathbf Z_i^\top\boldsymbol\gamma) =\rho\,\sigma_\mu\,\lambda(\mathbf Z_i^\top\boldsymbol\gamma), \]
where \(\rho=\sigma_{\mu\nu}/\sigma_{\mu}\).
Then,
\[ \begin{aligned} \mathbb E[Y_i\mid D_i=1,\mathbf X_i,\mathbf Z_i] &=\mathbf X_i^\top \boldsymbol\beta+\mathbb E[\mu_i\mid D_i=1,\mathbf Z_i]\\ &=\mathbf X_i^\top \boldsymbol\beta + \sigma_{\mu\nu}\,\lambda(\mathbf Z_i^\top\boldsymbol\gamma)\\ &=\mathbf X_i^\top \boldsymbol\beta + \rho\,\sigma_\mu\,\lambda(\mathbf Z_i^\top\boldsymbol\gamma). \end{aligned} \]
This is the Heckman selection-bias correction: the inverse Mills ratio \(\lambda(\mathbf Z_i^\top\boldsymbol\gamma)\) enters as an additional regressor in the selected sample. Point identification can be achieved through functional form (since the inverse Mills ratio is non-linear) together with the normality assumption. This implies that, in principle, one may have \(\mathbf{X}_i=\mathbf{Z}_i\). However, such identification is weak, and it is preferable to include at least one variable in \(\mathbf{Z}_i\) that is excluded from \(\mathbf{X}_i\), i.e., \(\mathbf{X}_i\neq\mathbf{Z}_i\), as this strengthens identification and also improves the precision of inference.
To perform Bayesian inference in this model, we use the augmented likelihood. Thus,
\[ \begin{aligned} \pi(\boldsymbol{\delta},\sigma^2_{\mu},\sigma_{\mu\nu},D_i^*\mid \mathbf{y},\mathbf{d}) & \propto \prod_{i\in I_{0}} \pi(D_i^*\mid \boldsymbol{\delta},\sigma^2_{\mu},\sigma_{\mu\nu}) \, \mathbf{1}(d_i=0)\, \mathbf{1}(D_i^*\leq 0) \\ & \quad \times \prod_{i\in I_{1}} p(y_i,D_i^*\mid \boldsymbol{\delta},\sigma^2_{\mu},\sigma_{\mu\nu}) \, \mathbf{1}(d_i=1)\, \mathbf{1}(D_i^*> 0) \\ & \quad \times \pi(\boldsymbol{\delta},\sigma^2_{\mu},\sigma_{\mu\nu}), \end{aligned} \]
where \(\boldsymbol{\delta}=[\boldsymbol{\beta}^{\top} \ \boldsymbol{\gamma}^{\top}]^{\top}\), \(I_0=\{i : d_i=0\}\) and \(I_1=\{i : d_i=1\}\).
Following Chapter 11 in E. Greenberg (2012), we set
\[ \boldsymbol{\eta}_i= \begin{cases} [0, \, D_i^*]^{\top}, & i\in I_0\\ [y_i, \, D_i^*]^{\top}, & i\in I_1 \end{cases}, \qquad \mathbf{W}_i=\begin{bmatrix} \mathbf{X}_i^{\top} & \mathbf{0}\\ \mathbf{0} & \mathbf{Z}_i^{\top} \end{bmatrix}, \qquad \mathbf{J}=\begin{bmatrix} 0 & 0\\ 0 & 1 \end{bmatrix}. \]
Assuming \(\pi(\boldsymbol{\delta}) \sim N(\boldsymbol{\delta}_0,\mathbf{D}_0)\), the conditional posterior distribution of \(\boldsymbol{\delta}\) is normal with mean
\[ \boldsymbol{\delta}_n=\mathbf{D}_n\left[\sum_{i\in I_0}\mathbf{W}_i^{\top}\mathbf{J}\boldsymbol{\eta}_i +\sum_{i\in I_1}\mathbf{W}_i^{\top}\boldsymbol{\Sigma}^{-1}\boldsymbol{\eta}_i +\mathbf{D}_0^{-1}\boldsymbol{\delta}_0\right], \]
and variance matrix
\[ \mathbf{D}_n=\left[\sum_{i\in I_0}\mathbf{W}_i^{\top}\mathbf{J}\mathbf{W}_i +\sum_{i\in I_1}\mathbf{W}_i^{\top}\boldsymbol{\Sigma}^{-1}\mathbf{W}_i +\mathbf{D}_0^{-1}\right]^{-1}, \]
where
\[ \boldsymbol{\Sigma}=\begin{bmatrix} \sigma^2_{\mu} & \sigma_{\mu\nu} \\[6pt] \sigma_{\mu\nu} & 1 \end{bmatrix}. \]
Let \(\omega=\sigma^2_{\mu}-\sigma^2_{\mu\nu}\) denote the conditional variance of \(\mu_i \mid \nu_i\). Assuming \(\omega^{-1}\sim G(\alpha_0/2,\delta_0/2)\) and noting that
\[ p(y_i,D_i^*\mid \boldsymbol{\delta},\omega,\sigma_{\mu\nu}) = p(y_i\mid D_i^*, \boldsymbol{\delta},\omega,\sigma_{\mu\nu}) \times p(D_i^*\mid \boldsymbol{\delta},\omega,\sigma_{\mu\nu}), \]
the conditional posterior distribution of \(\omega^{-1}\) is Gamma,
\[ \omega^{-1}\mid \boldsymbol{\delta},\sigma_{\mu\nu}, \mathbf{y},\mathbf{d} \sim G(\alpha_n/2,\delta_n/2), \]
with
\[ \alpha_n=\alpha_0+N_1, \qquad \delta_n=\delta_0+\sum_{i\in I_1}\Big[y_i-\mathbf{X}_i^{\top}\boldsymbol{\beta} -\sigma_{\mu\nu}(D_i^*-\mathbf{Z}_i^{\top}\boldsymbol{\gamma})\Big]^2, \]
where \(N_1\) is the number of observations with \(D_i=1\).
In addition, assuming \(\sigma_{\mu\nu}\sim N(s_0,S_0)\), the conditional posterior distribution is normal with mean
\[ s_n=S_n\left(\omega^{-1}\sum_{i=1}^N(D_i^*-\mathbf{Z}_i^{\top}\boldsymbol{\gamma})(y_i-\mathbf{X}_i^{\top}\boldsymbol{\beta}) +S_0^{-1}s_0\right), \]
and variance
\[ S_n=\left[\omega^{-1}\sum_{i=1}^N(D_i^*-\mathbf{Z}_i^{\top}\boldsymbol{\gamma})^2+S_0^{-1}\right]^{-1}. \]
Finally, since
\[ p(y_i,D_i^*\mid \boldsymbol{\delta},\omega,\sigma_{\mu\nu}) = p(D_i^*\mid y_i, \boldsymbol{\delta},\omega,\sigma_{\mu\nu}) \times p(y_i\mid \boldsymbol{\delta},\omega,\sigma_{\mu\nu}), \]
the conditional posterior distribution of \(D_i^*\) is
\[ D_i^* \sim \begin{cases} TN_{(-\infty,0]}(\mathbf{Z}_i^{\top}\boldsymbol{\gamma},\,1), & i\in I_0,\\[6pt] TN_{(0,\infty)}\!\left(\mathbf{Z}_i^{\top}\boldsymbol{\gamma} +\dfrac{\sigma_{\mu\nu}}{\sigma_{\mu\nu}^2+\omega}\,(y_i-\mathbf{X}_i^{\top}\boldsymbol{\beta}),\, \dfrac{\omega}{\sigma_{\mu\nu}^2+\omega}\right), & i\in I_1, \end{cases} \]
where \(TN_{A}(\mu,\sigma^2)\) denotes a normal distribution with mean \(\mu\) and variance \(\sigma^2\), truncated to the set \(A\).
Example: Simulation of the sample selection model
We simulate the model
\[
Y_i = 12 + X_{i1} + X_{i2} + \mu_i,
\]
where \(X_{i1}\sim N(0,4)\), \(X_{i2}\sim \text{Bin}(1,0.5)\), and \(\mu_i\sim N(0,1.2)\) for \(i=1,2,\dots,1,000\).
In addition,
\[
D_i^* = 1 + Z_{i1} - X_{i2} + \nu_i,
\]
where \(Z_{i1}\sim N(0,4)\).
The covariance between \(\mu_i\) and \(\nu_i\) is set to \(0.6\).
The hyperparameters are \(\boldsymbol{\delta}_0=[0 \ 0 \ 0 \ 0 \ 0 \ 0]^{\top}\), \(\mathbf{D}_0=1,000\mathbf{I}_6\), \(\alpha_0=\delta_0=0.001\), \(s_0=0\), and \(S_0=1,000\). We perform 1,500 iterations with a burn-in of 500. The following code illustrates the Gibbs sampler. Finally, we compare the results with an implementation that does not account for the sample selection issue.
The figure shows the posterior distributions of the second parameter in the outcome equation, whose population value is 1 (black dashed line). The posterior distribution of the model that accounts for selection encompasses the population value, whereas the model that ignores selection does not.
##
## Attaching package: 'snow'
## The following objects are masked from 'package:parallel':
##
## closeNode, clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, clusterSplit, makeCluster, parApply,
## parCapply, parLapply, parRapply, parSapply, recvData, recvOneData,
## sendData, splitIndices, stopCluster
N <- 1000
w1 <- rbinom(N, 1, 0.5) # rnorm(N, 0, sigExo);
delta <- c(1, 1, -1); beta <- c(2,1,1)
zx <- MASS::mvrnorm(N, mu = rep(0, 2), matrix(c(1,0.7,0.7,1),2,2))
z1 <- zx[,1]
Z <- cbind(1, z1, w1)
sig12 <- 0.8; sig11 <- 1.2
SIGMA <- matrix(c(sig11, sig12, sig12, 1), 2, 2)
E <- MASS::mvrnorm(N, mu = rep(0, 2), SIGMA)
cl <- Z%*%delta + E[,2]; c <- cl > 0
x1 <- zx[,2]; X <- cbind(1, x1, w1)
y <- X%*%beta + E[,1]
y[c==0] <- NA
# Hyperparameters
b0 <- rep(0, 6); B0 <- 1000*diag(6); B0i <- solve(B0)
a0 <- 0.001; d0 <- 0.001
s0 <- 0; S0 <- 1000; S0i <- 1/S0
# Location
idc1 <- which(c==1)
nc1 <- length(idc1)
PostThetaNew <- function(Sigma, clat){
J <- matrix(c(0,0,0,1),2,2)
WW <- matrix(0, 6, 6)
Wy <- matrix(0, 6, 1)
for(i in 1:N){
if(i %in% idc1){
yclat <- c(y[i], clat[i])
Auxi <- solve(Sigma)
}else{
yclat <- c(0, clat[i])
Auxi <- J
}
Wi <- as.matrix(Matrix::bdiag(X[i,], Z[i,]))
WWi <- Wi%*%Auxi%*%t(Wi)
Wyi <- Wi%*%Auxi%*%yclat
WW <- WW + WWi
Wy <- Wy + Wyi
}
Bn <- solve(B0i + WW)
bn <- Bn%*%(B0i%*%b0 + Wy)
Beta <- MASS::mvrnorm(1, bn, Bn)
return(Beta)
}
PostOmega11 <- function(theta, sig12, clat){
an <- a0 + nc1
mui <- y[idc1] -X[idc1, ]%*%theta[1:3] - sig12*(clat[idc1] - Z[idc1,]%*%theta[4:6])
dn <- d0 + t(mui)%*%mui
omega11 <- LaplacesDemon::rinvgamma(1, an/2, dn/2)
return(omega11)
}
PostSig12 <- function(omega11, theta, clat){
Sn <- (omega11^(-1)*sum((clat[idc1] - Z[idc1,]%*%theta[4:6])^2) + S0i)^(-1)
sn <- Sn*(omega11^(-1)*sum((clat[idc1] - Z[idc1,]%*%theta[4:6])*(y[idc1] - X[idc1,]%*%theta[1:3])) + s0*S0i)
sig12 <- rnorm(1, sn, sd = Sn^0.5)
return(sig12)
}
PostClat <- function(theta, sig12, omega11, i){
if(i %in% idc1){
mu <- Z[i,]%*%theta[4:6] + (sig12/(omega11+sig12^2))*(y[i] - X[i,]%*%theta[1:3])
sig2 <- omega11/(omega11+sig12^2)
clat <- EnvStats::rnormTrunc(1, mean = mu, sd = sig2^0.5, min = 0, max = Inf)
}else{
mu <- Z[i,]%*%theta[4:6]
clat <- EnvStats::rnormTrunc(1, mean = mu, sd = 1, min = -Inf, max = 0)
}
return(clat)
}
# Sampler
S <- 1500; burnin <- 500; thin <- 2
keep <- seq(burnin+thin, S, thin)
PostThetasDraws <- matrix(NA, S, 6)
PostSigmaDraws <- matrix(NA, S, 2)
# Initial parameters
Thetap <- rep(0, 6); Sigmap <- diag(2)
Sig12p <- 0; Omega11p <- 1
for(i in 1:N){
if(c[i] == 0){
LatPost <- EnvStats::rnormTrunc(1, mean = 0, sd = 1, min = -Inf, max = 0)
}else{
LatPost <- EnvStats::rnormTrunc(1, mean = 0, sd = 1, min = 0, max = Inf)
}
}
#### Parallel code ####
cn <- detectCores()
ClusterHope <- makeCluster(cn, type = "SOCK")
registerDoParallel(ClusterHope)
clusterExport(ClusterHope, list("Z", "X", "c", "y", "N", "idc1", "nc1", "PostClat","Thetap", "Sig12p", "Omega11p"))
pb <- txtProgressBar(min = 0, max = S, style = 3)## | | | 0%
for(rep in 1:S){
LatPost <- t(parSapply(ClusterHope, 1:N, function(i){PostClat(theta = Thetap, sig12 = Sig12p, omega11 = Omega11p, i)}))
Thetap <- PostThetaNew(Sigma = Sigmap, clat = LatPost)
Omega11p <- PostOmega11(theta = Thetap, sig12 = Sig12p, clat = LatPost)
Sig12p <- PostSig12(omega11 = Omega11p, theta = Thetap, clat = LatPost)
Sigmap <- matrix(c(Omega11p+Sig12p^2, Sig12p, Sig12p, 1),2,2)
PostThetasDraws[rep,] <- Thetap
PostSigmaDraws[rep, ] <- c(Omega11p+Sig12p^2, Sig12p)
clusterExport(ClusterHope, list("Thetap", "Sig12p", "Omega11p"))
setTxtProgressBar(pb, rep)
}## | | | 1% | |= | 1% | |= | 2% | |== | 2% | |== | 3% | |== | 4% | |=== | 4% | |=== | 5% | |==== | 5% | |==== | 6% | |===== | 6% | |===== | 7% | |===== | 8% | |====== | 8% | |====== | 9% | |======= | 9% | |======= | 10% | |======= | 11% | |======== | 11% | |======== | 12% | |========= | 12% | |========= | 13% | |========= | 14% | |========== | 14% | |========== | 15% | |=========== | 15% | |=========== | 16% | |============ | 16% | |============ | 17% | |============ | 18% | |============= | 18% | |============= | 19% | |============== | 19% | |============== | 20% | |============== | 21% | |=============== | 21% | |=============== | 22% | |================ | 22% | |================ | 23% | |================ | 24% | |================= | 24% | |================= | 25% | |================== | 25% | |================== | 26% | |=================== | 26% | |=================== | 27% | |=================== | 28% | |==================== | 28% | |==================== | 29% | |===================== | 29% | |===================== | 30% | |===================== | 31% | |====================== | 31% | |====================== | 32% | |======================= | 32% | |======================= | 33% | |======================= | 34% | |======================== | 34% | |======================== | 35% | |========================= | 35% | |========================= | 36% | |========================== | 36% | |========================== | 37% | |========================== | 38% | |=========================== | 38% | |=========================== | 39% | |============================ | 39% | |============================ | 40% | |============================ | 41% | |============================= | 41% | |============================= | 42% | |============================== | 42% | |============================== | 43% | |============================== | 44% | |=============================== | 44% | |=============================== | 45% | |================================ | 45% | |================================ | 46% | |================================= | 46% | |================================= | 47% | |================================= | 48% | |================================== | 48% | |================================== | 49% | |=================================== | 49% | |=================================== | 50% | |=================================== | 51% | |==================================== | 51% | |==================================== | 52% | |===================================== | 52% | |===================================== | 53% | |===================================== | 54% | |====================================== | 54% | |====================================== | 55% | |======================================= | 55% | |======================================= | 56% | |======================================== | 56% | |======================================== | 57% | |======================================== | 58% | |========================================= | 58% | |========================================= | 59% | |========================================== | 59% | |========================================== | 60% | |========================================== | 61% | |=========================================== | 61% | |=========================================== | 62% | |============================================ | 62% | |============================================ | 63% | |============================================ | 64% | |============================================= | 64% | |============================================= | 65% | |============================================== | 65% | |============================================== | 66% | |=============================================== | 66% | |=============================================== | 67% | |=============================================== | 68% | |================================================ | 68% | |================================================ | 69% | |================================================= | 69% | |================================================= | 70% | |================================================= | 71% | |================================================== | 71% | |================================================== | 72% | |=================================================== | 72% | |=================================================== | 73% | |=================================================== | 74% | |==================================================== | 74% | |==================================================== | 75% | |===================================================== | 75% | |===================================================== | 76% | |====================================================== | 76% | |====================================================== | 77% | |====================================================== | 78% | |======================================================= | 78% | |======================================================= | 79% | |======================================================== | 79% | |======================================================== | 80% | |======================================================== | 81% | |========================================================= | 81% | |========================================================= | 82% | |========================================================== | 82% | |========================================================== | 83% | |========================================================== | 84% | |=========================================================== | 84% | |=========================================================== | 85% | |============================================================ | 85% | |============================================================ | 86% | |============================================================= | 86% | |============================================================= | 87% | |============================================================= | 88% | |============================================================== | 88% | |============================================================== | 89% | |=============================================================== | 89% | |=============================================================== | 90% | |=============================================================== | 91% | |================================================================ | 91% | |================================================================ | 92% | |================================================================= | 92% | |================================================================= | 93% | |================================================================= | 94% | |================================================================== | 94% | |================================================================== | 95% | |=================================================================== | 95% | |=================================================================== | 96% | |==================================================================== | 96% | |==================================================================== | 97% | |==================================================================== | 98% | |===================================================================== | 98% | |===================================================================== | 99% | |======================================================================| 99% | |======================================================================| 100%
thetaHat <- coda::mcmc(PostThetasDraws[keep,])
PostDrawsSigma <- coda::mcmc(PostSigmaDraws[keep,])
summary(thetaHat)##
## Iterations = 1:500
## Thinning interval = 1
## Number of chains = 1
## Sample size per chain = 500
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## [1,] 2.0082 0.06372 0.002850 0.004891
## [2,] 1.0192 0.04499 0.002012 0.002965
## [3,] 1.0042 0.09113 0.004076 0.006620
## [4,] 0.9830 0.07317 0.003272 0.007188
## [5,] 1.0443 0.06149 0.002750 0.006651
## [6,] -0.9723 0.09496 0.004247 0.007826
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## var1 1.8849 1.9646 2.0052 2.0492 2.1399
## var2 0.9298 0.9898 1.0196 1.0455 1.1138
## var3 0.8379 0.9454 1.0049 1.0688 1.1835
## var4 0.8391 0.9320 0.9841 1.0347 1.1210
## var5 0.9257 1.0027 1.0430 1.0866 1.1662
## var6 -1.1748 -1.0326 -0.9722 -0.9112 -0.7837
##
## Iterations = 1:500
## Thinning interval = 1
## Number of chains = 1
## Sample size per chain = 500
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## [1,] 1.3098 0.09117 0.004077 0.008362
## [2,] 0.8545 0.08517 0.003809 0.012798
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## var1 1.1306 1.2391 1.3120 1.3730 1.485
## var2 0.6581 0.8055 0.8589 0.9149 1.003
##
## Iterations = 1001:11000
## Thinning interval = 1
## Number of chains = 1
## Sample size per chain = 10000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## X 2.2991 0.05380 0.0005380 0.0005390
## Xx1 0.8432 0.04265 0.0004265 0.0004265
## Xw1 1.2449 0.08532 0.0008532 0.0008698
## sigma2 1.0891 0.06178 0.0006178 0.0006178
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## X 2.1947 2.2626 2.2987 2.3356 2.4050
## Xx1 0.7598 0.8145 0.8433 0.8714 0.9269
## Xw1 1.0740 1.1892 1.2461 1.3025 1.4096
## sigma2 0.9742 1.0464 1.0867 1.1289 1.2178
library(coda)
library(ggplot2)
# 1) Extract the 2nd coefficient draws from each object
beta2_nosel <- as.numeric(RegNOsel[, 2])
beta2_sel <- as.numeric(thetaHat[, 2])
# 2) Put into a long data frame for ggplot
df <- rbind(
data.frame(value = beta2_nosel, model = "No selection"),
data.frame(value = beta2_sel, model = "With selection")
)
# 3) Compute summaries
summ <- function(x) c(mean = mean(x), quantile(x, c(0.025, 0.975)))
cat("\nPosterior summaries (2nd coefficient)\n")##
## Posterior summaries (2nd coefficient)
## No selection : 0.8432 0.7598 0.9269
## With selection: 1.0192 0.9298 1.1138
# 4) Plot both posteriors + true value line at 1
ggplot(df, aes(x = value, fill = model, color = model)) +
geom_density(alpha = 0.25, linewidth = 0.8) +
geom_vline(xintercept = 1, linetype = 2) +
labs(x = "Coefficient (2nd parameter)", y = "Posterior density",
title = "Posterior of 2nd Coeff: No-Selection vs Selection Models",
subtitle = "Dashed line = population value (1)") +
theme_minimal(base_size = 12) +
theme(legend.position = "top")The original Heckman selection model was not intended to calculate the ATE, since \(Y_i(0)\) is not observed and, therefore, a treatment effect is not defined. Its main purpose is to correct the expected value of \(Y_i \mid D_i=1\) by accounting for sample selection. Subsequently, Heckman (1990) and Heckman and Vytlacil (2005) extended the sample selection framework to incorporate \(Y_i(0)\), framing the Roy model (Roy 1951) within the treatment effect literature and establishing the identification conditions for different treatment parameters. In particular, Heckman and Vytlacil (2005) introduced the marginal treatment effect (MTE),
\[ \tau_{MTE}(x,u_D) = \mathbb{E}[Y_i(1)-Y_i(0)\mid \mathbf{X}_i=\mathbf{x}, \, U_{Di}=u_D], \]
which is the expected effect of treatment for individuals with observed characteristics \(\mathbf{X}_i=\mathbf{x}\) and unobservable factors from the treatment assignment \(U_{Di}=u_D\) (Heckman and Vytlacil 2005).
The MTE is a unifying concept that connects the treatment effect, selection, and matching literatures. Moreover, these authors show that many treatment effect parameters, such as the ATE, ATT, and LATE, can be expressed as weighted averages of the MTE (see Table IA in Heckman and Vytlacil (2005)).
In the Bayesian literature, several authors have estimated different versions of selection models. In particular, Van Hasselt (2011) and Ding (2014) analyze the sample selection model using flexible specifications for the error terms, while Gary Koop and Poirier (1997) and Siddhartha Chib (2007) estimate the Roy model. Furthermore, Siddhartha Chib, Greenberg, and Jeliazkov (2009) extend the basic framework to a semiparametric setting that accounts for endogeneity, and Heckman, Lopes, and Piatek (2014) introduce latent factors to address the fundamental problem of causal inference in the Roy model. This approach allows researchers to learn about the otherwise unidentified covariance between \(Y_i(1)\) and \(Y_i(0)\), which in turn enables the estimation of distributional versions of treatment effects that require the joint distribution of \(Y_i(1)\) and \(Y_i(0)\).