9.7 Logistic regression after MI

The syntax for logistic regression after MI is similar to linear regression after MI, except that now glm() is used instead of lm().

Example 9.2: Using the NSDUH 2019 teaching dataset (see Appendix A.5), what is the association between lifetime marijuana use (mj_lifetime) and age at first use of alcohol (alc_agefirst), adjusted for age (demog_age_cat6), sex (demog_sex), and income (demog_income)? Use MI to handle missing data.

NOTE: This example uses the MAR dataset created for the illustration in Section 9.2 (nsduh_mar_rmph.RData). Those who never used alcohol were removed from the dataset, since we do not want to impute missing ages for them. Also, a random subset of the values for each variable were set to missing, with the probability of missingness for each variable depending on all the other variables but not the variable itself.

First, load the data and view the extent of the missing data.

load("Data/nsduh_mar_rmph.RData")
summary(nsduh_mar)
##  mj_lifetime  alc_agefirst  demog_age_cat6  demog_sex              demog_income
##  No  :343    Min.   : 3.0   18-25: 98      Male  :380   Less than $20,000:117  
##  Yes :491    1st Qu.:15.0   26-34:120      Female:392   $20,000 - $49,999:230  
##  NA's:  9    Median :17.0   35-49:216      NA's  : 71   $50,000 - $74,999:118  
##              Mean   :17.6   50-64:195                   $75,000 or more  :345  
##              3rd Qu.:20.0   65+  :173                   NA's             : 33  
##              Max.   :45.0   NA's : 41                                          
##              NA's   :49

Next, fit the imputation model.

imp.nsduh <- mice(nsduh_mar,
                  seed  = 3,
                  m     = 20,
                  print = F)
imp.nsduh
## Class: mids
## Number of multiple imputations:  20 
## Imputation methods:
##    mj_lifetime   alc_agefirst demog_age_cat6      demog_sex   demog_income 
##       "logreg"          "pmm"      "polyreg"       "logreg"      "polyreg" 
## PredictorMatrix:
##                mj_lifetime alc_agefirst demog_age_cat6 demog_sex demog_income
## mj_lifetime              0            1              1         1            1
## alc_agefirst             1            0              1         1            1
## demog_age_cat6           1            1              0         1            1
## demog_sex                1            1              1         0            1
## demog_income             1            1              1         1            0

Finally, fit the logistic regression on the imputed datasets and pool the results.

NOTE: Only include in the regression model variables that were included in the dataset prior to multiple imputation (see Section 9.4.2).

fit.imp.glm <- with(imp.nsduh,
                glm(mj_lifetime ~ alc_agefirst + demog_age_cat6 +
                    demog_sex + demog_income, family = binomial))
# summary(pool(fit.imp.glm), conf.int = T)
round.summary(fit.imp.glm, digits=3)
##                               estimate std.error statistic    df p.value  2.5 % 97.5 %
## (Intercept)                      6.022     0.588    10.239 683.1   0.000  4.867  7.177
## alc_agefirst                    -0.265     0.028    -9.361 594.4   0.000 -0.320 -0.209
## demog_age_cat626-34             -0.252     0.334    -0.754 716.0   0.451 -0.908  0.404
## demog_age_cat635-49             -0.824     0.298    -2.769 800.8   0.006 -1.408 -0.240
## demog_age_cat650-64             -0.743     0.301    -2.473 773.5   0.014 -1.334 -0.153
## demog_age_cat665+               -1.391     0.307    -4.532 788.9   0.000 -1.993 -0.789
## demog_sexFemale                  0.012     0.172     0.067 478.7   0.947 -0.327  0.350
## demog_income$20,000 - $49,999   -0.457     0.271    -1.689 746.0   0.092 -0.989  0.074
## demog_income$50,000 - $74,999   -0.032     0.310    -0.103 760.3   0.918 -0.642  0.577
## demog_income$75,000 or more     -0.325     0.259    -1.254 727.3   0.210 -0.834  0.184

Add exponentiate = T to summary() or round.summary() to compute odds ratios, but be careful interpreting the results as only the estimate and confidence interval are exponentiated; the standard error in the table is still the standard error of the un-exponentiated regression coefficient. As such, to avoid confusion, extract just the estimate, confidence interval, and p-value when exponentiating (and drop the first row since the exponentiated intercept is not of interest).

# summary(pool(fit.imp.glm), conf.int = T,
#   exponentiate = T)[-1, c("term", "estimate", "2.5 %", "97.5 %", "p.value")]
round.summary(fit.imp.glm, digits = 3,
        exponentiate = T)[-1, c("estimate", "2.5 %", "97.5 %", "p.value")]
##                               estimate 2.5 % 97.5 % p.value
## alc_agefirst                     0.767 0.726  0.811   0.000
## demog_age_cat626-34              0.777 0.403  1.498   0.451
## demog_age_cat635-49              0.439 0.245  0.787   0.006
## demog_age_cat650-64              0.475 0.263  0.858   0.014
## demog_age_cat665+                0.249 0.136  0.455   0.000
## demog_sexFemale                  1.012 0.721  1.419   0.947
## demog_income$20,000 - $49,999    0.633 0.372  1.077   0.092
## demog_income$50,000 - $74,999    0.968 0.526  1.782   0.918
## demog_income$75,000 or more      0.722 0.434  1.202   0.210

9.7.1 Multiple degree of freedom tests

mi.anova() does not work for logistic regression models, but D1() does. Fit reduced models, each omitting one categorical variable that has more than 2 levels and then use D1() to compare the full and reduced models using a Wald test or D3() for a likelihood ratio test.

For example, to get a Type 3 test for demog_age_cat6 in Example 9.2:

# Fit reduced model
fit.imp.glm.age   <- with(imp.nsduh,
  glm(mj_lifetime ~ alc_agefirst + 
        demog_sex + demog_income, family = binomial))

# Compare full and reduced models
# Wald test
summary(D1(fit.imp.glm, fit.imp.glm.age))
## 
## Models:
##  model                                                                formula
##      1 mj_lifetime ~ alc_agefirst + demog_age_cat6 + demog_sex + demog_income
##      2                  mj_lifetime ~ alc_agefirst + demog_sex + demog_income
## 
## Comparisons:
##    test statistic df1   df2 dfcom    p.value   riv
##  1 ~~ 2      6.78   4 804.9   833 0.00002278 0.054
## 
## Number of imputations:  20   Method D1
# LR test
summary(D3(fit.imp.glm, fit.imp.glm.age))
## 
## Models:
##  model                                                                formula
##      1 mj_lifetime ~ alc_agefirst + demog_age_cat6 + demog_sex + demog_income
##      2                  mj_lifetime ~ alc_agefirst + demog_sex + demog_income
## 
## Comparisons:
##    test statistic df1   df2 dfcom    p.value     riv
##  1 ~~ 2     7.042   4 19610   833 0.00001164 0.06281
## 
## Number of imputations:  20   Method D3

9.7.2 Predictions

Prediction proceeds in almost the same way as described in Section 9.6.2 – predict on each imputed dataset and then pool the results using Rubin’s rules. The one change is that type = "response" is added to get predictions on the probability scale.

Example 9.2 (continued): What is the predicted probability (and standard error) of lifetime marijuana use for someone who first used alcohol at age 13 years, is currently age 30 years, male, and has an annual income of less than $20,000?

# Prediction data.frame
# Code below works if NEWDAT has only 1 row
NEWDAT <- data.frame(alc_agefirst   = 13,
                     demog_age_cat6 = "26-34",
                     demog_sex      = "Male",
                     demog_income   = "Less than $20,000")

# Get prediction for each analysis
PREDLIST <- lapply(fit.imp.glm$analyses, predict, newdata=NEWDAT,
                   se.fit=T, type = "response")

# Extract mean and variance from each analysis
MEAN <- VAR <- rep(NA, length(fit.imp.glm$analyses))
for(i in 1:length(MEAN)) {
  MEAN[i] <-  PREDLIST[[i]]$fit
  VAR[ i] <- (PREDLIST[[i]]$se.fit)^2
}

# Apply Rubin's rules
PRED.POOLED <- pool.scalar(MEAN,
                           VAR,
                           nrow(imp.nsduh$data))

# Extract the pooled mean
PRED.POOLED$qbar
## [1] 0.911
# Extract the pooled standard error
sqrt(PRED.POOLED$t)
## [1] 0.02772