9.7 Logistic regression after MI

The syntax for logistic regression after MI is similar to linear regression after MI, except that now glm() is used instead of lm().

Example 9.2: Using the NSDUH 2019 teaching dataset (see Appendix A.5), what is the association between lifetime marijuana use (mj_lifetime) and age at first use of alcohol (alc_agefirst), adjusted for age (demog_age_cat6), sex (demog_sex), and income (demog_income)? Use MI to handle missing data.

NOTE: This example uses the MAR dataset created for the illustration in Section 9.2 (nsduh_mar_rmph.RData). Those who never used alcohol were removed from the dataset, since we do not want to impute missing ages for them. Also, a random subset of the values for each variable were set to missing, with the probability of missingness for each variable depending on all the other variables but not the variable itself.

First, load the data and view the extent of the missing data.

load("Data/nsduh_mar_rmph.RData")
summary(nsduh_mar)
##  mj_lifetime  alc_agefirst  demog_age_cat6  demog_sex              demog_income
##  No  :342    Min.   : 3.0   18-25: 99      Male  :376   Less than $20,000:122  
##  Yes :487    1st Qu.:15.0   26-34:118      Female:397   $20,000 - $49,999:226  
##  NA's: 14    Median :17.0   35-49:223      NA's  : 70   $50,000 - $74,999:116  
##              Mean   :17.6   50-64:191                   $75,000 or more  :342  
##              3rd Qu.:20.0   65+  :180                   NA's             : 37  
##              Max.   :45.0   NA's : 32                                          
##              NA's   :59

Next, fit the imputation model.

imp.nsduh <- mice(nsduh_mar,
                  seed  = 3,
                  m     = nimpute(nsduh_mar),
                  print = F)
imp.nsduh
## Class: mids
## Number of multiple imputations:  22 
## Imputation methods:
##    mj_lifetime   alc_agefirst demog_age_cat6      demog_sex   demog_income 
##       "logreg"          "pmm"      "polyreg"       "logreg"      "polyreg" 
## PredictorMatrix:
##                mj_lifetime alc_agefirst demog_age_cat6 demog_sex demog_income
## mj_lifetime              0            1              1         1            1
## alc_agefirst             1            0              1         1            1
## demog_age_cat6           1            1              0         1            1
## demog_sex                1            1              1         0            1
## demog_income             1            1              1         1            0

Finally, fit the logistic regression on the imputed datasets and pool the results.

fit.imp.glm <- with(imp.nsduh,
                glm(mj_lifetime ~ alc_agefirst + demog_age_cat6 +
                    demog_sex + demog_income, family = binomial))
# summary(pool(fit.imp.glm), conf.int = T)
round.summary(fit.imp.glm, digits=3)
##                               estimate std.error statistic    df p.value  2.5 % 97.5 %
## (Intercept)                      6.434     0.615    10.461 592.6   0.000  5.226  7.642
## alc_agefirst                    -0.275     0.029    -9.386 513.9   0.000 -0.333 -0.218
## demog_age_cat626-34             -0.251     0.339    -0.738 711.1   0.461 -0.917  0.416
## demog_age_cat635-49             -0.809     0.305    -2.655 694.5   0.008 -1.407 -0.211
## demog_age_cat650-64             -0.790     0.311    -2.541 639.8   0.011 -1.401 -0.179
## demog_age_cat665+               -1.307     0.314    -4.157 673.5   0.000 -1.925 -0.690
## demog_sexFemale                 -0.045     0.176    -0.255 393.3   0.799 -0.391  0.301
## demog_income$20,000 - $49,999   -0.770     0.275    -2.795 687.3   0.005 -1.310 -0.229
## demog_income$50,000 - $74,999   -0.204     0.314    -0.650 719.9   0.516 -0.821  0.412
## demog_income$75,000 or more     -0.574     0.261    -2.202 686.0   0.028 -1.086 -0.062

Add exponentiate = T to summary() or round.summary() to compute odds ratios, but be careful interpreting the results as only the estimate and confidence interval are exponentiated; the standard error in the table is still the standard error of the un-exponentiated regression coefficient. As such, to avoid confusion, extract just the estimate, confidence interval, and p-value when exponentiating (and drop the first row since the exponentiated intercept is not of interest).

# summary(pool(fit.imp.glm), conf.int = T,
#   exponentiate = T)[-1, c("term", "estimate", "2.5 %", "97.5 %", "p.value")]
round.summary(fit.imp.glm, digits = 3,
        exponentiate = T)[-1, c("estimate", "2.5 %", "97.5 %", "p.value")]
##                               estimate 2.5 % 97.5 % p.value
## alc_agefirst                     0.759 0.717  0.804   0.000
## demog_age_cat626-34              0.778 0.400  1.515   0.461
## demog_age_cat635-49              0.445 0.245  0.810   0.008
## demog_age_cat650-64              0.454 0.246  0.836   0.011
## demog_age_cat665+                0.271 0.146  0.502   0.000
## demog_sexFemale                  0.956 0.677  1.351   0.799
## demog_income$20,000 - $49,999    0.463 0.270  0.795   0.005
## demog_income$50,000 - $74,999    0.815 0.440  1.511   0.516
## demog_income$75,000 or more      0.563 0.337  0.940   0.028

9.7.1 Multiple degree of freedom tests

mi.anova() does not work for logistic regression models, but D1() does. Fit reduced models, each omitting one categorical variable that has more than 2 levels and then use D1() to compare the full and reduced models using a Wald test or D3() for a likelihood ratio test.

For example, to get a Type 3 test for demog_age_cat6 in Example 9.2:

# Fit reduced model
fit.imp.glm.age   <- with(imp.nsduh,
  glm(mj_lifetime ~ alc_agefirst + 
        demog_sex + demog_income, family = binomial))

# Compare full and reduced models
# Wald test
summary(D1(fit.imp.glm, fit.imp.glm.age))
## 
## Models:
##  model                                                                formula
##      1 mj_lifetime ~ alc_agefirst + demog_age_cat6 + demog_sex + demog_income
##      2                  mj_lifetime ~ alc_agefirst + demog_sex + demog_income
## 
## Comparisons:
##    test statistic df1   df2 dfcom   p.value     riv
##  1 ~~ 2     5.886   4 794.7   833 0.0001138 0.06877
## 
## Number of imputations:  22   Method D1
# LR test
summary(D3(fit.imp.glm, fit.imp.glm.age))
## 
## Models:
##  model                                                                formula
##      1 mj_lifetime ~ alc_agefirst + demog_age_cat6 + demog_sex + demog_income
##      2                  mj_lifetime ~ alc_agefirst + demog_sex + demog_income
## 
## Comparisons:
##    test statistic df1   df2 dfcom    p.value     riv
##  1 ~~ 2     6.191   4 19760   833 0.00005653 0.06634
## 
## Number of imputations:  22   Method D3

9.7.2 Predictions

Prediction proceeds in almost the same way as described in Section 9.6.2 – predict on each imputed dataset and then pool the results using Rubin’s rules. The one change is that type = "response" is added to get predictions on the probability scale.

Example 9.2 (continued): What is the predicted probability (and standard error) of lifetime marijuana use for someone who first used alcohol at age 13 years, is currently age 30 years, male, and has an annual income of less than $20,000?

# Prediction data.frame
# Code below works if NEWDAT has only 1 row
NEWDAT <- data.frame(alc_agefirst   = 13,
                     demog_age_cat6 = "26-34",
                     demog_sex      = "Male",
                     demog_income   = "Less than $20,000")

# Get prediction for each analysis
PREDLIST <- lapply(fit.imp.glm$analyses, predict, newdata=NEWDAT,
                   se.fit=T, type = "response")

# Extract mean and variance from each analysis
MEAN <- VAR <- rep(NA, length(fit.imp.glm$analyses))
for(i in 1:length(MEAN)) {
  MEAN[i] <-  PREDLIST[[i]]$fit
  VAR[ i] <- (PREDLIST[[i]]$se.fit)^2
}

# Apply Rubin's rules
PRED.POOLED <- pool.scalar(MEAN,
                           VAR,
                           nrow(imp.nsduh$data))

# Extract the pooled mean
PRED.POOLED$qbar
## [1] 0.9309
# Extract the pooled standard error
sqrt(PRED.POOLED$t)
## [1] 0.02254