11.4 Exercises

  1. Simulate a semi-parametric regression where
    \[\begin{align*} y_i &= 0.5x_{i1} - 1.2x_{i2} + \mu_i, \\ p(\mu_i) &= 0.7 \phi(\mu_i \mid -0.5,0.5^2) + 0.3 \phi(\mu_i \mid 1,0.8^2). \end{align*}\]

Assume that \(x_{i1}\) and \(x_{i2}\) follow a standard normal distribution and that the sample size is 1,000. Perform inference in this model assuming that the number of components is unknown. Start with \(H=5\) and use non-informative priors, setting \(\alpha_{h0}=\delta_{h0}=0.01\), \(\boldsymbol{\beta}_0=\boldsymbol{0}_2\), \(\boldsymbol{B}_0=\boldsymbol{I}_2\), \(\mu_{h0}=0\), \(\sigma^2_{h0}=10\), and \(\boldsymbol{\alpha}_0=[1/H \ \dots \ 1/H]^{\top}\). Use 6,000 MCMC iterations, a burn-in period of 4,000, and a thinning parameter of 2. Compare the population parameters with the posterior estimates and plot the population density along with the posterior density estimate of \(\boldsymbol{\mu}\) (the mean, and the 95% credible interval).

  1. Example: Consumption of marijuana in Colombia continues I

Use the dataset MarijuanaColombia.csv from our GitHub repository to perform inference on the demand for marijuana in Colombia. This dataset contains information on the (log) monthly demand in 2019 from the National Survey of the Consumption of Psychoactive Substances. It includes variables such as the presence of a drug dealer in the neighborhood (Dealer), gender (Female), indicators of good physical and mental health (PhysicalHealthGood and MentalHealthGood), age (Age and Age2), years of schooling (YearsEducation), and (log) prices of marijuana, cocaine, and crack by individual (LogPriceMarijuana, LogPriceCocaine, and LogPriceCrack). The sample size is 1,156.

Estimate a finite Gaussian mixture regression using non-informative priors, that is, \(\alpha_{0}=\delta_{0}=0.01\), \(\boldsymbol{\beta}_{0}=\boldsymbol{0}_K\), \(\boldsymbol{B}_{0}=\boldsymbol{I}_K\), and and \(\boldsymbol{\alpha}_0=[1/H \ \dots \ 1/H]^{\top}\), \(K\) is the number of regressors, 11 including the intercept. The number of MCMC iterations is 5,000, the burn-in is 1,000, and the thinning parameter is 2. Start with five potential clusters. Obtain the posterior distribution of the own-price elasticity of marijuana and the cross-price elasticities of marijuana demand with respect to the prices of cocaine and crack.

  1. Get the posterior sampler in the semi-parametric setting using a Dirichlet process mixture: \[\begin{align*} y_i&=\boldsymbol{x}_i^{\top}\boldsymbol{\beta}+e_i\\ e_i\mid \mu_i,\sigma_i^2 &\stackrel{iid}{\sim} N(\mu_i,\sigma_i^2), \end{align*}\]

Do not include the intercept in \(\boldsymbol{\beta}\) to get flexibility in the distribution of the stochastic errors.

Let’s assume \(\boldsymbol{\beta}\sim N(\boldsymbol{\beta}_0,\boldsymbol{B}_0)\), \(\sigma_i^2\sim IG(\alpha_0/2,\delta_0/2)\), \(\mu_i\sim N(\mu_0,\sigma_i^2/\beta_0)\), \(\alpha\sim G(a,b)\) such that introducing the latent variable \(\xi|\alpha,N\sim Be(\alpha+1,N)\), allows to easily sample the posterior draws of \(\alpha|\xi,H,\pi_{\xi}\sim\pi_{\xi}{G}(a+H,b-log(\xi))+(1-\pi_{\xi}){G}(a+H-1,b-log(\xi))\), where \(\frac{\pi_{\xi}}{1-\pi_{\xi}}=\frac{a+H-1}{N(b-log(\xi))}\), \(H\) is the number of atoms (mixture components).

  1. Example: Exercise 1 and 3 continue

Perform inference in the simulation of the semi-parametric model of Exercise 1 using the sampler of Exercise 3. Use non-informative priors, setting \(\alpha_{0}=\delta_{0}=0.01\), \(\boldsymbol{\beta}_{0}=\boldsymbol{0}_2\), \(\boldsymbol{B}_{0}=\boldsymbol{I}_2\), and \(a=b=0.1\). The number of MCMC iterations is 5,000, the burn-in is 1,000, and the thinning parameter is 2.

  1. Example: Simulation exercise continues I

Fix the label-switching problem of the simulation exercise of the DPM using random permutation of latent classes.

  1. Example: Simulation exercise continues II

Obtain the density estimate of \(y\) in the simulation exercise of the DPM, evaluated at \(x=0\).

  1. Example: Consumption of marijuana in Colombia continues II

Perform the application of marijuana consumption with the following specification: \[\begin{align*} y_i & = \boldsymbol{z}_i^{\top} \boldsymbol{\gamma} + f(Age_{i}) + \mu_i, \end{align*}\]

where \(y_i\) is the (log) marijuana monthly consumption, \(\boldsymbol{z}_i\) represents the presence of a drug dealer in the neighborhood (Dealer), gender (Female), indicators of good physical and mental health (PhysicalHealthGood and MentalHealthGood), years of education (YearsEducation), and the (log) prices of marijuana, cocaine, and crack by individual.

Initially, set the knots as the percentiles \(\left\{0,0.05,\dots,0.95,1\right\}\) of age and use cubic B-splines. Then, apply the BIC approximation to perform variable selection in this model with non-informative conjugate priors, 5,000 MCMC iterations, and 5,000 burn-in iterations.

Do you think that using a linear regression with a second-degree polynomial in age provides a good approximation to the relationship found using splines in this application?