9.6 Linear regression after MI
Use with()
to fit a regression model on a multiply imputed dataset (mids
) object to produce a multiple imputation repeated analysis (mira
) object. Afterwards, use pool()
to create a multiple imputation pooled results (mipo
) object. Returning to Example 9.1, the following code fits a linear regression model after MI.
NOTE: Only include in the regression model variables that were included in the dataset prior to multiple imputation (see Section 9.4.2). Some can be left out, but no new ones should be added – the regression model must not be more complex than the imputation model.
fit.imp.lm <- with(imp.nhanes.new,
lm(LBDGLUSI_trans ~ BMXWAIST + smoker + RIDAGEYR +
RIAGENDR + race_eth + income))
with()
fits the regression model separately for each complete dataset. The following extracts a list
of regression results for each imputation.
For example, the results for the first analysis are:
# Full summary
# summary(fit.imp.lm$analyses[[1]])
# Just the coefficients
round(summary(fit.imp.lm$analyses[[1]])$coef, 5)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.11247 0.00352 -31.94801 0.00000
## BMXWAIST 0.00031 0.00003 10.63284 0.00000
## smokerPast 0.00133 0.00116 1.13986 0.25462
## smokerCurrent 0.00071 0.00141 0.50642 0.61267
## RIDAGEYR 0.00031 0.00003 10.08568 0.00000
## RIAGENDRFemale -0.00492 0.00098 -5.02719 0.00000
## race_ethNH White -0.00395 0.00132 -2.99556 0.00281
## race_ethNH Black -0.00298 0.00183 -1.62541 0.10439
## race_ethNH Other -0.00196 0.00194 -1.00794 0.31373
## income$25,000 to < $55,000 0.00090 0.00152 0.59550 0.55165
## income$55,000+ 0.00010 0.00137 0.06974 0.94441
Use summary(pool())
to apply Rubin’s rules to obtain the final regression coefficients and 95% confidence intervals.
The numbers in the resulting table may have a lot of digits, and unfortunately the round()
function does not work directly on a mipo.summary
object. To round the results, use the function round.summary()
(found in Functions_rmph.R
which you loaded at the beginning of this chapter) which takes a mira
object, pools the results, and rounds the output to a specified number of digits
.
## estimate std.error statistic df p.value 2.5 % 97.5 %
## (Intercept) -0.1117 0.0036 -30.9663 889.2 0.0000 -0.1188 -0.1046
## BMXWAIST 0.0003 0.0000 9.9470 846.6 0.0000 0.0002 0.0004
## smokerPast 0.0014 0.0012 1.1532 978.1 0.2491 -0.0009 0.0037
## smokerCurrent 0.0007 0.0014 0.5238 980.7 0.6005 -0.0020 0.0035
## RIDAGEYR 0.0003 0.0000 10.0374 979.4 0.0000 0.0002 0.0004
## RIAGENDRFemale -0.0048 0.0010 -4.8885 981.2 0.0000 -0.0068 -0.0029
## race_ethNH White -0.0038 0.0013 -2.8704 975.3 0.0042 -0.0064 -0.0012
## race_ethNH Black -0.0028 0.0019 -1.5040 974.6 0.1329 -0.0064 0.0008
## race_ethNH Other -0.0019 0.0020 -0.9904 970.2 0.3222 -0.0058 0.0019
## income$25,000 to < $55,000 0.0011 0.0016 0.6740 803.5 0.5005 -0.0020 0.0041
## income$55,000+ 0.0004 0.0014 0.2469 622.9 0.8051 -0.0025 0.0032
9.6.1 Multiple degree of freedom tests
mi.anova()
(Grund, Lüdtke, and Robitzsch 2016) from the miceadds
package (Robitzsch, Grund, and Henke 2023) carries out multiple degree of freedom Type III tests pooled over imputations, and also computes the pooled \(R^2\) value.
mi.anova(imp.nhanes.new,
"LBDGLUSI_trans ~ BMXWAIST + smoker + RIDAGEYR + RIAGENDR +
race_eth + income",
type = 3)
## Univariate ANOVA for Multiply Imputed Data (Type 3)
##
## lm Formula: LBDGLUSI_trans ~ BMXWAIST + smoker + RIDAGEYR + RIAGENDR +
## race_eth + income
## R^2=0.1953
## ..........................................................................
## ANOVA Table
## SSQ df1 df2 F value Pr(>F) eta2 partial.eta2
## BMXWAIST 0.02434 1 5894 97.9628 0.00000 0.08455 0.09507
## smoker 0.00033 2 363903 0.6861 0.50355 0.00114 0.00141
## RIDAGEYR 0.02374 1 836530 100.8680 0.00000 0.08247 0.09296
## RIAGENDR 0.00563 1 813144 23.8936 0.00000 0.01954 0.02371
## race_eth 0.00201 3 250497 2.8327 0.03677 0.00699 0.00861
## income 0.00017 2 5274 0.2740 0.76038 0.00058 0.00072
## Residual 0.23166 NA NA NA NA NA NA
mi.anova()
only works, however, for linear regression. A more generally applicable method, which works for linear and other forms of regression, is D1()
which carries out a Wald test comparing two nested models. To get a multiple degree of freedom test for a categorical variable with more than two levels, fit a reduced model that omits that variable, and use D1()
to compare the full and reduced models. The first time you run D1()
, you will be prompted to install the mitml
package (Grund, Robitzsch, and Luedtke 2023) if you have not already installed it.
For example, to get a Type III test for smoker
in Example 9.1:
# Reduced model omitting smoker
fit.imp.lm.smoker <- with(imp.nhanes.new,
lm(LBDGLUSI_trans ~ BMXWAIST + RIDAGEYR + RIAGENDR + race_eth + income))
# Compare full and reduced model
summary(D1(fit.imp.lm, fit.imp.lm.smoker))
##
## Models:
## model formula
## 1 LBDGLUSI_trans ~ BMXWAIST + smoker + RIDAGEYR + RIAGENDR + race_eth + income
## 2 LBDGLUSI_trans ~ BMXWAIST + RIDAGEYR + RIAGENDR + race_eth + income
##
## Comparisons:
## test statistic df1 df2 dfcom p.value riv
## 1 ~~ 2 0.6891 2 985.9 989 0.5023 0.005758
##
## Number of imputations: 20 Method D1
To carry out a likelihood ratio test instead, use D3()
.
##
## Models:
## model formula
## 1 LBDGLUSI_trans ~ BMXWAIST + smoker + RIDAGEYR + RIAGENDR + race_eth + income
## 2 LBDGLUSI_trans ~ BMXWAIST + RIDAGEYR + RIAGENDR + race_eth + income
##
## Comparisons:
## test statistic df1 df2 dfcom p.value riv
## 1 ~~ 2 0.6972 2 1101109 989 0.498 0.005294
##
## Number of imputations: 20 Method D3
9.6.2 Predictions
To obtain predictions following MI, estimate within each analysis and pool the results over imputations using Rubin’s rules. First, create a data.frame
containing the values at which you want to predict. Second, compute the estimate (and its standard error) for each analysis. Since the mira
object, fit.imp.lm
, contains the results of each analysis in a list (fit.imp.lm$analyses
), use lapply()
to apply the predict()
function to each analysis (each element of the list). Finally, extract the predictions and their variances and use Rubin’s rules to pool the results via pool.scalar()
.
Example 9.1 (continued): Predict (transformed) fasting glucose for an individual with a waist circumference of 130 cm who is a current smoker, age 50 years, male, non-Hispanic Black, and earns at least $55,000 annually.
# Prediction data.frame
NEWDAT <- data.frame(
BMXWAIST = 130,
smoker = "Current",
RIDAGEYR = 50,
RIAGENDR = "Male",
race_eth = "NH Black",
income = "$55,000+")
# The code below assumes NEWDAT has only 1 row.
# To make predictions at different values,
# run again with a different NEWDAT.
# Get prediction for each analysis
PREDLIST <- lapply(fit.imp.lm$analyses, predict, newdata=NEWDAT, se.fit=T)
# Compute the mean and variance in each analysis
MEAN <- VAR <- rep(NA, length(fit.imp.lm$analyses))
for(i in 1:length(MEAN)) {
MEAN[i] <- PREDLIST[[i]]$fit
VAR[ i] <- (PREDLIST[[i]]$se.fit)^2
}
# Apply Rubin's rules
PRED.POOLED <- pool.scalar(MEAN,
VAR,
nrow(imp.nhanes.new$data))
# Extract the pooled mean
PRED.POOLED$qbar
## [1] -0.05931
## [1] 0.002225
If the outcome was transformed and you would like the prediction on the original scale, then back-transform (this works for the mean, but not the standard error). In Example 9.1, a Box-Cox transformation with \(\lambda\) = -1.5 was used.
## [1] 6.575
9.6.3 Polynomial predictor transformations
If you are going to include non-linear functions of variables in the regression model (e.g., Box-Cox, log, quadratic), your imputation model should include those non-linearities. This ensures that your imputation model includes the same level of complexity as your regression model. Otherwise, you will be imputing values under the assumption of linearity.
For functions that convert \(X\) into a single term \(f(x)\) (e.g., log\((X)\)), simply create a transformed version of the variable prior to imputation. This was done for the outcome fasting glucose in Example 9.1. What about a transformation that converts a predictor \(X\) into multiple terms, such as a quadratic transformation which includes both \(X\) and \(X^2\)? Should the square term be created prior to imputation (transform-then-impute) or after (impute-then-transform)? As described in Section 9.4.2, create the term prior to imputation (transform-then-impute).
NOTES:
- When using a polynomial transformation, avoid collinearity issues in the imputation model by first centering (e.g. \(cX = X -\) mean of \(X\)) and then creating higher order terms as powers of the centered variable (e.g., \(cX2 = (cX)^2, cX3 = (cX)^3\), etc.) prior to imputation.
- In both the imputation and regression model, include higher order terms using the derived version (e.g., include the square term as \(cX2\), not \(I(cX^2)\)).
- In both the imputation and regression model, include all the polynomial terms (e.g., centered linear, quadratic, cubic, etc.) in your polynomial, not just the highest order term (e.g., for a quadratic transformation, include both \(cX\) and \(cX2\), not just \(cX2\)).
Example 9.1 (continued): Regress (transformed) fasting glucose on waist circumference, adjusted for smoking status, age, gender, race/ethnicity, and income, and include a quadratic transformation for waist circumference.
First, update the pre-processed dataset to include a centered version of waist circumference and a squared version of that centered variable.
## [1] 100.5
nhanesb <- nhanes %>%
mutate(
# Centered, quadratic version of waist circumference
cBMXWAIST = BMXWAIST - 100,
cBMXWAIST2 = cBMXWAIST^2
) %>%
# Select the variables to be included in the imputation model.
# If including a quadratic, also include the linear term.
select(LBDGLUSI_trans, cBMXWAIST, cBMXWAIST2,
smoker, RIDAGEYR, RIAGENDR, race_eth, income)
Next, fit the imputation model.
Finally, fit the regression model with both the centered linear and quadratic waist circumference terms, with the quadratic term included as cBMXWAIST2
not I(cBMXWAIST^2)
.
fit.imp.quad <- with(imp.quad,
lm(LBDGLUSI_trans ~ cBMXWAIST + cBMXWAIST2 + smoker + RIDAGEYR +
RIAGENDR + race_eth + income))
round.summary(fit.imp.quad, digits = 8)[, c("estimate", "2.5 %",
"97.5 %", "p.value")]
## estimate 2.5 % 97.5 % p.value
## (Intercept) -0.08249436 -0.0870761 -0.07791260 0.00000000
## cBMXWAIST 0.00028633 0.0002205 0.00035212 0.00000000
## cBMXWAIST2 0.00000081 -0.0000016 0.00000322 0.50834469
## smokerPast 0.00136314 -0.0009372 0.00366350 0.24516547
## smokerCurrent 0.00078328 -0.0020152 0.00358175 0.58294854
## RIDAGEYR 0.00031310 0.0002507 0.00037544 0.00000000
## RIAGENDRFemale -0.00485763 -0.0068005 -0.00291479 0.00000109
## race_ethNH White -0.00395902 -0.0065948 -0.00132328 0.00327873
## race_ethNH Black -0.00291613 -0.0065572 0.00072493 0.11634728
## race_ethNH Other -0.00214190 -0.0059891 0.00170533 0.27486657
## income$25,000 to < $55,000 0.00134174 -0.0017960 0.00447953 0.40140845
## income$55,000+ 0.00058526 -0.0022808 0.00345135 0.68849725
9.6.4 Interactions
If you want to include an interaction in your regression model, make sure to include the interaction in your imputation model, as well. This ensures that your imputation model includes at least the same level of complexity as your regression model. Otherwise, you will be imputing values under the assumption of no interaction.
For an interaction involving a categorical variable, this can be done by stratifying the dataset by the categorical variable, fitting the imputation model separately in each stratum, and then combining the results. This will only work if the categorical variable has no missing values (or, if there are not many missing values, if you first exclude cases with missing values on the categorical variable). An alternative, which is more generally applicable, is to use transform-then-impute (Section 9.4.2); create the interaction term as a distinct variable in your pre-processing step and include it in the imputation model (von Hippel 2009).
The stratification and transform-then-impute methods will not, in general, lead to the same imputed values. The stratification method leads to a more complex imputation model because it effectively includes an interaction between the categorical variable and every other variable, whereas the transform-then-impute method only includes whatever interactions you explicitly include. However, transform-then-impute is more generally applicable; if the categorical variable has missing data, or if both terms in the interaction are continuous, the stratification method is not possible.
Example 9.1 (continued): Regress (transformed) fasting glucose on waist circumference, adjusted for smoking status, age, gender, race/ethnicity, and income, and include an interaction to see if the waist circumference effect depends on gender. Use MI to handle missing data.
9.6.4.1 Interaction via stratification
First, check to see if gender has any missing values. If yes, and the number is very small, then remove the cases with missing gender from the dataset. If the number is large, or you do not want to lose even a small number of cases, use the interaction via transform-then-impute method described in the next section.
##
## Male Female
## 457 543
Gender has no missing values. To use stratification to include an interaction in the imputation model, do the following: split the dataset into separate datasets for each level of gender, use mice()
on each dataset, and combine the resulting mids
objects using rbind()
. Make sure to use the same number of imputations for all strata, the same number you would use for the full dataset if you were not using stratification.
# Split data by gender
nhanes_F <- nhanes %>%
filter(RIAGENDR == "Female")
nhanes_M <- nhanes %>%
filter(RIAGENDR == "Male")
# Impute separately by gender
imp_F <- mice(nhanes_F,
seed = 3,
m = 20,
print = F)
imp_M <- mice(nhanes_M,
seed = 3,
m = 20,
print = F)
# Combine the imputations across strata
imp.int <- rbind(imp_F, imp_M)
Just to check, table gender on the first imputation. It should have all its values and no missing values. If it does not, you may have stratified on a variable with missing values (which you should not do).
# Checking the first imputation
# (should have all levels of gender and no missing values)
table(complete(imp.int)$RIAGENDR, useNA = "ifany")
##
## Male Female
## 457 543
Next, fit the model with the interaction and pool the results.
fit.imp.int <- with(imp.int,
lm(LBDGLUSI_trans ~ BMXWAIST + smoker + RIDAGEYR +
RIAGENDR + race_eth + income +
BMXWAIST:RIAGENDR))
round.summary(fit.imp.int, digits = 5)[, c("estimate", "2.5 %",
"97.5 %", "p.value")]
## estimate 2.5 % 97.5 % p.value
## (Intercept) -0.11386 -0.12345 -0.10426 0.00000
## BMXWAIST 0.00032 0.00023 0.00041 0.00000
## smokerPast 0.00142 -0.00089 0.00374 0.22889
## smokerCurrent 0.00088 -0.00193 0.00368 0.53975
## RIDAGEYR 0.00031 0.00025 0.00037 0.00000
## RIAGENDRFemale -0.00027 -0.01233 0.01178 0.96466
## race_ethNH White -0.00388 -0.00650 -0.00125 0.00383
## race_ethNH Black -0.00272 -0.00636 0.00092 0.14342
## race_ethNH Other -0.00203 -0.00588 0.00182 0.30049
## income$25,000 to < $55,000 0.00104 -0.00205 0.00413 0.51008
## income$55,000+ 0.00040 -0.00241 0.00320 0.78212
## BMXWAIST:RIAGENDRFemale -0.00005 -0.00016 0.00007 0.45021
In this example, the interaction is only 1 term, but if there were multiple terms, use D1()
to get a multiple degree of freedom Wald test of significance by comparing the model with an interaction to the model without (or D3()
to get a likelihood ratio test).
# Fit reduced model with no interaction
fit.imp.noint <- with(imp.int,
lm(LBDGLUSI_trans ~ BMXWAIST + smoker + RIDAGEYR +
RIAGENDR + race_eth + income))
# Type 3 Wald test of interaction
summary(D1(fit.imp.int, fit.imp.noint))
##
## Models:
## model
## 1
## 2
## formula
## LBDGLUSI_trans ~ BMXWAIST + smoker + RIDAGEYR + RIAGENDR + race_eth + income + BMXWAIST:RIAGENDR
## LBDGLUSI_trans ~ BMXWAIST + smoker + RIDAGEYR + RIAGENDR + race_eth + income
##
## Comparisons:
## test statistic df1 df2 dfcom p.value riv
## 1 ~~ 2 0.5706 1 911.4 988 0.4502 0.0326
##
## Number of imputations: 20 Method D1
9.6.4.2 Interaction via transform-then-impute
A more general approach to including an interaction, which works even when a categorical variable in the interaction has missing values or when both variables are continuous, is to explicitly compute the interaction terms during pre-processing followed by imputation including the computed terms. If both terms in the interaction are continuous, simply create (during pre-processing) a variable that is their product (transform-then-impute). If either are categorical, the process involves a few additional steps.
First, fit the regression model, including the interaction, to the original pre-imputation dataset to confirm what terms comprise the interaction.
fit.pre <- lm(LBDGLUSI_trans ~ BMXWAIST + smoker + RIDAGEYR +
RIAGENDR + race_eth + income +
BMXWAIST:RIAGENDR, data = nhanes)
names(coef(fit.pre))
## [1] "(Intercept)" "BMXWAIST" "smokerPast"
## [4] "smokerCurrent" "RIDAGEYR" "RIAGENDRFemale"
## [7] "race_ethNH White" "race_ethNH Black" "race_ethNH Other"
## [10] "income$25,000 to < $55,000" "income$55,000+" "BMXWAIST:RIAGENDRFemale"
There is one interaction term – BMXWAIST:RIAGENDRFemale
. To explicitly compute this term, derive a variable that is BMXWAIST
multiplied by an indicator variable corresponding to RIAGENDER == Female
(1 when this condition is true, 0 when false). In general, first create an indicator variable for each non-reference level of the categorical variable and compute the product of each indicator variable and the other term in the interaction.
NOTE: Make sure to drop the old categorical variable from the dataset.
nhanes_int <- nhanes %>%
mutate(RIAGENDRFemale = as.numeric(RIAGENDR == "Female"),
BMXWAIST_RIAGENDRFemale = BMXWAIST*RIAGENDRFemale) %>%
select(-RIAGENDR)
# Checking
table(nhanes$RIAGENDR, nhanes_int$RIAGENDRFemale, useNA = "ifany")
SUB <- nhanes$RIAGENDR == "Female"
# Should all be 0
summary(nhanes_int$BMXWAIST[SUB] -
nhanes_int$BMXWAIST_RIAGENDRFemale[SUB])
summary(nhanes_int$BMXWAIST_RIAGENDRFemale[!SUB])
# (results not shown)
Next, fit the imputation model.
Finally, fit the regression model using with()
, including the new gender indicator variable and the derived interaction term (which is entered just like any other term in the model, not with a :
).
fit.imp.int <- with(imp.int,
lm(LBDGLUSI_trans ~ BMXWAIST + smoker + RIDAGEYR +
RIAGENDRFemale + race_eth + income +
BMXWAIST_RIAGENDRFemale))
round.summary(fit.imp.int, digits = 5)
## estimate std.error statistic df p.value 2.5 % 97.5 %
## (Intercept) -0.11339 0.00487 -23.2773 943.7 0.00000 -0.12295 -0.10383
## BMXWAIST 0.00032 0.00005 6.9489 915.3 0.00000 0.00023 0.00041
## smokerPast 0.00138 0.00117 1.1777 982.6 0.23921 -0.00092 0.00369
## smokerCurrent 0.00078 0.00143 0.5491 977.9 0.58308 -0.00201 0.00358
## RIDAGEYR 0.00031 0.00003 9.8267 977.1 0.00000 0.00024 0.00037
## RIAGENDRFemale -0.00122 0.00613 -0.1992 894.9 0.84215 -0.01326 0.01082
## race_ethNH White -0.00390 0.00133 -2.9242 971.4 0.00353 -0.00652 -0.00128
## race_ethNH Black -0.00292 0.00185 -1.5780 979.0 0.11490 -0.00655 0.00071
## race_ethNH Other -0.00207 0.00196 -1.0584 976.6 0.29012 -0.00592 0.00177
## income$25,000 to < $55,000 0.00082 0.00164 0.5002 462.1 0.61717 -0.00240 0.00404
## income$55,000+ 0.00019 0.00145 0.1333 610.8 0.89402 -0.00266 0.00305
## BMXWAIST_RIAGENDRFemale -0.00004 0.00006 -0.5837 882.4 0.55956 -0.00015 0.00008
9.6.4.3 Estimate the effect of one variable at levels of the other
To estimate the effect of one variable at levels of the other, use gmodels::estimable()
on each imputation separately and pool the results using Rubin’s rules via pool.scalar()
(with some additional code to compute p-values).
# Compute mean and variance from each analysis
# Initialize vector for level 1 (Male)
MEAN1 <- VAR1 <- rep(NA, length(fit.imp.int$analyses))
# Initialize vector for level 2 (Female)
MEAN2 <- VAR2 <- rep(NA, length(fit.imp.int$analyses))
for(i in 1:length(MEAN1)) {
EST1 <- gmodels::estimable(fit.imp.int$analyses[[i]],
c("BMXWAIST" = 1))
EST2 <- gmodels::estimable(fit.imp.int$analyses[[i]],
c("BMXWAIST" = 1,
"BMXWAIST_RIAGENDRFemale" = 1))
MEAN1[i] <- as.numeric(EST1["Estimate"])
VAR1[ i] <- (as.numeric(EST1["Std. Error"])^2)
MEAN2[i] <- as.numeric(EST2["Estimate"])
VAR2[ i] <- (as.numeric(EST2["Std. Error"])^2)
}
# Apply Rubin's rules
POOLED1 <- pool.scalar(MEAN1,
VAR1,
nrow(imp.int$data))
POOLED2 <- pool.scalar(MEAN2,
VAR2,
nrow(imp.int$data))
# Extract the pooled mean and standard error
DF <- data.frame(gender = c("Male", "Female"),
est = c(POOLED1$qbar, POOLED2$qbar),
se = sqrt(c(POOLED1$t, POOLED2$t)),
df = c(POOLED1$df, POOLED2$df))
# Compute test statistic and p-value
DF$t <- DF$est/DF$se
DF$p <- 2*(pt(abs(DF$t), DF$df, lower.tail = F))
DF
## gender est se df t p
## 1 Male 0.0003173 0.00004567 925.1 6.949 0.000000000006944
## 2 Female 0.0002822 0.00003910 841.6 7.219 0.000000000001175