9.10 Exercises

  1. Explain the distinction between unit and item non-response. Which type of non-response can be handled with multiple imputation (as discussed in this chapter)?

  2. What type of missing data method is used when carrying out a complete case analysis?

  3. What type of missing data method is the default for most regression software?

  4. When imputing missing values, what is wrong with imputing a missing value with the mean of the observed values? How does conditional mean imputation help, and why is it still inadequate?

  5. Suppose you have a dataset with \(N\) cases, with \(n < N\) cases that are complete (no missing values for any variable). After multiple imputation, you have \(M\) complete datasets, each of size \(N\). The “effective sample size” is the correct sample size to use when computing the precision of estimates after MI. Between what two values is the effective sample size?

  6. Suppose you have a dataset with two variables, \(Y\) and \(X\), where \(Y\) is complete and \(X\) has some missing values. For each of the following statements, is the missing data mechanism MCAR, MAR, or MNAR?

  7. For which of MCAR, MAR, and MNAR can a multiple imputation analysis result in unbiased estimates?

  8. For which of MCAR, MAR, and MNAR does a complete case analysis result in unbiased estimates?

  9. When using multiple imputation, how many imputations do you need?

  10. When using multiple imputation, how should you handle a variable that will be transformed in your analysis (e.g., a categorical variable that will be collapsed; a continuous variable that will be transformed using some function)?

For Exercises 11 to 16, use the Natality teaching dataset (natality2018_rmph.Rdata, see Appendix A.3).

  1. Prior to imputation, compute descriptive statistics for father’s race/Hispanic origin (FRACEHISP), education (FEDUC), and age (FAGECOMB), including the number of missing values. Use all available data for each variable (rather than a complete case analysis).

  2. Next, we would like to compute descriptive statistics for father’s race/Hispanic origin (FRACEHISP), education (FEDUC), and age (FAGECOMB) after using multiple imputation. There are a number of auxiliary variables that can be included in the imputation model. The auxiliary variables are non-analysis variables in the dataset that may be correlated with the father’s characteristics and/or correlated with the chance that the father’s characteristics are missing. The following is the full list of analysis and auxiliary variables.

  • Father’s race/Hispanic origin (FRACEHISP)
  • Father’s education (FEDUC)
  • Father’s age (FAGECOMB)
  • Mother’s race/Hispanic origin (MRACEHISP)
  • Mother’s education (MEDUC)
  • Mother’s age (MAGER)
  • Marital Status (DMAR)
  • Prior births now living (PRIORLIVE)
  • Birthweight (g) (DBWT)
  • Month prenatal care began (PRECARE)
  • WIC (WIC)
  • Risk factors reported (risks)
  • Preterm birth (preterm)

In this exercise, visualize the pattern of missing data.

  1. How many imputations are needed? Compute the recommended number, as well as an alternative for a large dataset with a lot of missing values.**

  2. It turns out that using the proportion of cases with any missing values when computing the number of imputations results in mice() taking a very long time to run. For this exercise, fit the imputation model using a smaller number of imputations, based on the average proportion of cases with missing values, and then examine the output. What method was used to impute each variable?

  3. Visualize the imputations for father’s age.

  4. Compute the descriptive statistics after MI. How do these compare to the descriptives before using MI (computed in Question 11) and what does that say about the nature of the missing data?

For Exercises 17 to 23, use the 2020 UN Human Development Data (unhdd2020.rmph.Rdata, see Appendix A.2).

  1. After handling missing data using multiple imputation, fit a regression model to test the association between the outcome “child under 5y mortality (2018, per 1,000 live births)” (mort_lt5) and the predictors “female population with at least some secondary education (2015-2019, % ages 25 and older)” (educ_f), “child malnutrition - stunting (moderate or severe) (2010-2019, % under age 5)” (stunt), and “infants exclusively breastfed (2010-2019, % ages 0-5 months)” (breast). Assume no changes need to be made to any of these variables – you will explore other aspects of this analysis in subsequent questions.

  2. Check the normality, linearity, and constant variance assumptions for the model you fit in the previous Exercise. What assumptions are violated?

  3. Examine a histogram of the outcome. Use a Box-Cox outcome outcome transformation (based on the original data, before imputation), re-fit the imputation model, re-fit the regression model, and re-check the normality, linearity, and constant variance assumptions. Do any problems remain?

  4. Starting with the model you fit in the previous exercise, relax the linearity assumption for stunt and breast using polynomial transformations (e.g., quadratic, cubic, or higher order). Remember to center each variable prior to transformation. Then re-check the normality, linearity, and constant variance assumptions. Do any problems remain?

  5. Redo the previous exercise, this time also including a quadratic for educ_f. Then re-check the normality, linearity, and constant variance assumptions. Do any problems remain?

  6. Using the final model from the previous exercise, predict child mortality for a nation with 40% child malnutrition, 25% infants exclusively breastfed, and 35% female population with at least some secondary education. Compare this to the prediction when these values are 10%, 70%, and 90%. Hint: In your prediction data.frame, enter a value for every term in the model, and the terms in this model were centered and some where squared or cubed.

  7. Using the final model from Exercise 21, expand the imputation and regression models to assess if the quadratic association between female education and mortality depends on countries’ HDI group (hdi_group). Use the transform-then-impute method for including an interaction.

For Exercises 24 to 26, use the NHANES 2017-2018 fasting subsample teaching dataset (nhanes1718_adult_fast_sub_rmph.Rdata, see Appendix A.1). Create a dichotomous version of PHQ-9 representing “at least mild depression” (PHQ-9 \(\ge\) 5) using the following code.

load("Data/nhanes1718_adult_fast_sub_rmph.Rdata")

# Create dichotomized PHQ-9
# "PHQ-9 scores of 5, 10, 15, and 20 represented
#  mild, moderate, moderately severe, and
#  severe depression, respectively"
#  (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1495268/)

nhanes <- nhanes_adult_fast_sub %>% 
  mutate(depression = factor(phq9 >= 5,
                             levels = c(F, T),
                             labels = c("No", "Yes")))

# Check derivation
table(nhanes$phq9, nhanes$depression, exclude = NULL)
tapply(nhanes$phq9, nhanes$depression, range, na.rm=T)
  1. After handling missing data using multiple imputation, fit a regression model to test if the outcome “at least mild depression” is significantly associated with ever told doctor had trouble sleeping (SLQ050) after adjusting for age (RIDAGEYR), gender (RIAGENDR), income (income), and days someone engages in vigorous recreational activities (PAQ655)? Answer the question and report the AOR, 95% confidence interval, and p-value. Also, which other predictors are significantly associated with “at least mild depression”?

  2. Expand the model you fit in the previous exercise to assess whether the association between “at least mild depression” and trouble sleeping depends on gender. Use the stratification method for imputing an interaction. Regardless of the statistical significance of the interaction, estimate the sleep effect at each level of gender.

  3. Assess the goodness of fit of the model from the previous exercise using the Hosmer-Lemeshow test, as well as calibration plots.

  4. For this exercise, use the teaching dataset based on the Framingham Heart Study (fram_time_invar_rmph.rData, see Appendix A.6). After handling missing data using multiple imputation, fit a regression model to test if time to angina differs between participants with different levels of education (EDUC), adjusted for age (AGE) and sex (SEX). The time variable is TIMEAP and the event indicator is ANGINA.