9.7 Logistic regression after MI
The syntax for logistic regression after MI is similar to linear regression after MI, except that now glm()
is used instead of lm()
.
Example 9.2: Using the NSDUH 2019 teaching dataset (see Appendix A.5), what is the association between lifetime marijuana use (mj_lifetime
) and age at first use of alcohol (alc_agefirst
), adjusted for age (demog_age_cat6
), sex (demog_sex
), and income (demog_income
)? Use MI to handle missing data.
NOTE: This example uses the MAR dataset created for the illustration in Section 9.2 (nsduh_mar_rmph.RData
). Those who never used alcohol were removed from the dataset, since we do not want to impute missing ages for them. Also, a random subset of the values for each variable were set to missing, with the probability of missingness for each variable depending on all the other variables but not the variable itself.
First, load the data and view the extent of the missing data.
load("Data/nsduh_mar_rmph.RData")
summary(nsduh_mar)
## mj_lifetime alc_agefirst demog_age_cat6 demog_sex demog_income
## No :342 Min. : 3.0 18-25: 99 Male :376 Less than $20,000:122
## Yes :487 1st Qu.:15.0 26-34:118 Female:397 $20,000 - $49,999:226
## NA's: 14 Median :17.0 35-49:223 NA's : 70 $50,000 - $74,999:116
## Mean :17.6 50-64:191 $75,000 or more :342
## 3rd Qu.:20.0 65+ :180 NA's : 37
## Max. :45.0 NA's : 32
## NA's :59
Next, fit the imputation model.
<- mice(nsduh_mar,
imp.nsduh seed = 3,
m = nimpute(nsduh_mar),
print = F)
imp.nsduh
## Class: mids
## Number of multiple imputations: 22
## Imputation methods:
## mj_lifetime alc_agefirst demog_age_cat6 demog_sex demog_income
## "logreg" "pmm" "polyreg" "logreg" "polyreg"
## PredictorMatrix:
## mj_lifetime alc_agefirst demog_age_cat6 demog_sex demog_income
## mj_lifetime 0 1 1 1 1
## alc_agefirst 1 0 1 1 1
## demog_age_cat6 1 1 0 1 1
## demog_sex 1 1 1 0 1
## demog_income 1 1 1 1 0
Finally, fit the logistic regression on the imputed datasets and pool the results.
<- with(imp.nsduh,
fit.imp.glm glm(mj_lifetime ~ alc_agefirst + demog_age_cat6 +
+ demog_income, family = binomial))
demog_sex # summary(pool(fit.imp.glm), conf.int = T)
round.summary(fit.imp.glm, digits=3)
## estimate std.error statistic df p.value 2.5 % 97.5 %
## (Intercept) 6.434 0.615 10.461 592.6 0.000 5.226 7.642
## alc_agefirst -0.275 0.029 -9.386 513.9 0.000 -0.333 -0.218
## demog_age_cat626-34 -0.251 0.339 -0.738 711.1 0.461 -0.917 0.416
## demog_age_cat635-49 -0.809 0.305 -2.655 694.5 0.008 -1.407 -0.211
## demog_age_cat650-64 -0.790 0.311 -2.541 639.8 0.011 -1.401 -0.179
## demog_age_cat665+ -1.307 0.314 -4.157 673.5 0.000 -1.925 -0.690
## demog_sexFemale -0.045 0.176 -0.255 393.3 0.799 -0.391 0.301
## demog_income$20,000 - $49,999 -0.770 0.275 -2.795 687.3 0.005 -1.310 -0.229
## demog_income$50,000 - $74,999 -0.204 0.314 -0.650 719.9 0.516 -0.821 0.412
## demog_income$75,000 or more -0.574 0.261 -2.202 686.0 0.028 -1.086 -0.062
Add exponentiate = T
to summary()
or round.summary()
to compute odds ratios, but be careful interpreting the results as only the estimate and confidence interval are exponentiated; the standard error in the table is still the standard error of the un-exponentiated regression coefficient. As such, to avoid confusion, extract just the estimate, confidence interval, and p-value when exponentiating (and drop the first row since the exponentiated intercept is not of interest).
# summary(pool(fit.imp.glm), conf.int = T,
# exponentiate = T)[-1, c("term", "estimate", "2.5 %", "97.5 %", "p.value")]
round.summary(fit.imp.glm, digits = 3,
exponentiate = T)[-1, c("estimate", "2.5 %", "97.5 %", "p.value")]
## estimate 2.5 % 97.5 % p.value
## alc_agefirst 0.759 0.717 0.804 0.000
## demog_age_cat626-34 0.778 0.400 1.515 0.461
## demog_age_cat635-49 0.445 0.245 0.810 0.008
## demog_age_cat650-64 0.454 0.246 0.836 0.011
## demog_age_cat665+ 0.271 0.146 0.502 0.000
## demog_sexFemale 0.956 0.677 1.351 0.799
## demog_income$20,000 - $49,999 0.463 0.270 0.795 0.005
## demog_income$50,000 - $74,999 0.815 0.440 1.511 0.516
## demog_income$75,000 or more 0.563 0.337 0.940 0.028
9.7.1 Multiple degree of freedom tests
mi.anova()
does not work for logistic regression models, but D1()
does. Fit reduced models, each omitting one categorical variable that has more than 2 levels and then use D1()
to compare the full and reduced models using a Wald test or D3()
for a likelihood ratio test.
For example, to get a Type 3 test for demog_age_cat6
in Example 9.2:
# Fit reduced model
<- with(imp.nsduh,
fit.imp.glm.age glm(mj_lifetime ~ alc_agefirst +
+ demog_income, family = binomial))
demog_sex
# Compare full and reduced models
# Wald test
summary(D1(fit.imp.glm, fit.imp.glm.age))
##
## Models:
## model formula
## 1 mj_lifetime ~ alc_agefirst + demog_age_cat6 + demog_sex + demog_income
## 2 mj_lifetime ~ alc_agefirst + demog_sex + demog_income
##
## Comparisons:
## test statistic df1 df2 dfcom p.value riv
## 1 ~~ 2 5.886 4 794.7 833 0.0001138 0.06877
##
## Number of imputations: 22 Method D1
# LR test
summary(D3(fit.imp.glm, fit.imp.glm.age))
##
## Models:
## model formula
## 1 mj_lifetime ~ alc_agefirst + demog_age_cat6 + demog_sex + demog_income
## 2 mj_lifetime ~ alc_agefirst + demog_sex + demog_income
##
## Comparisons:
## test statistic df1 df2 dfcom p.value riv
## 1 ~~ 2 6.191 4 19760 833 0.00005653 0.06634
##
## Number of imputations: 22 Method D3
9.7.2 Predictions
Prediction proceeds in almost the same way as described in Section 9.6.2 – predict on each imputed dataset and then pool the results using Rubin’s rules. The one change is that type = "response"
is added to get predictions on the probability scale.
Example 9.2 (continued): What is the predicted probability (and standard error) of lifetime marijuana use for someone who first used alcohol at age 13 years, is currently age 30 years, male, and has an annual income of less than $20,000?
# Prediction data.frame
# Code below works if NEWDAT has only 1 row
<- data.frame(alc_agefirst = 13,
NEWDAT demog_age_cat6 = "26-34",
demog_sex = "Male",
demog_income = "Less than $20,000")
# Get prediction for each analysis
<- lapply(fit.imp.glm$analyses, predict, newdata=NEWDAT,
PREDLIST se.fit=T, type = "response")
# Extract mean and variance from each analysis
<- VAR <- rep(NA, length(fit.imp.glm$analyses))
MEAN for(i in 1:length(MEAN)) {
<- PREDLIST[[i]]$fit
MEAN[i] <- (PREDLIST[[i]]$se.fit)^2
VAR[ i]
}
# Apply Rubin's rules
<- pool.scalar(MEAN,
PRED.POOLED
VAR,nrow(imp.nsduh$data))
# Extract the pooled mean
$qbar PRED.POOLED
## [1] 0.9309
# Extract the pooled standard error
sqrt(PRED.POOLED$t)
## [1] 0.02254