## 9.5 Descriptive statistics after MI

After MI, compute descriptive statistics within each complete dataset and use Rubin’s rules to pool statistics and compute their standard errors. Rubin’s rules comprise two parts – one for computing the statistic and one for its variation. For the statistic, simply average the values over the imputations. For the variation of the statistic, combine the average variance over imputations with the variation of the statistic between imputations. Additionally, if the sampling distribution of the statistic is not close to normal, then it is recommended to transform the statistic to normality before applying Rubin’s rules.

Statistics for which we can assume normality (in a large sample) include the mean and standard deviation of a continuous variable and the frequencies and proportions of levels of a categorical variable. A list of appropriate transformations for other statistics (e.g., correlation, odds ratio) can be found at Scalar inference of non-normal quantities . The strategy in that case is to transform, apply Rubin’s rules, and then back-transform to the original scale.

Later, this section will introduce some convenient functions for computing descriptive statistics after MI. But first, we demonstrate how to use Rubin’s rules to compute the mean and standard error of the mean waist circumference from Example 9.1. To do this, extract the multiply imputed datasets, compute the sample mean and variance of the sample mean for waist circumference in each, and then combine them using Rubin’s rules to get the mean and standard error pooled over the imputations. Recall that the variance of the sample mean is the sample variance divided by the sample size.

# Multiply imputed datasets stacked together
# but NOT including the original data
imp.nhanes.dat <- complete(imp.nhanes.new, action = "long")

# Sample mean in each complete imputed dataset
MEAN     <- tapply(imp.nhanes.dat$BMXWAIST, imp.nhanes.dat$.imp, mean)

# Variance of the sample mean in each imputed dataset
VAR.MEAN <- tapply(imp.nhanes.dat$BMXWAIST, imp.nhanes.dat$.imp,
var) / nrow(imp.nhanes.new$data) The following shows the means and variances by imputation, in a table and visually. cbind("MEAN" = MEAN, "VAR.MEAN" = VAR.MEAN) ## MEAN VAR.MEAN ## 1 100.5 0.2874 ## 2 100.4 0.2884 ## 3 100.4 0.2825 ## 4 100.3 0.2866 ## 5 100.6 0.2885 ## 6 100.5 0.2872 ## 7 100.5 0.2894 ## 8 100.5 0.2836 ## 9 100.4 0.2888 ## 10 100.3 0.2897 ## 11 100.5 0.2858 ## 12 100.4 0.2840 ## 13 100.5 0.2851 ## 14 100.6 0.2833 par(mfrow=c(1,2)) plot(1:nimpute(nhanes), MEAN, ylab = "Mean", xlab = "Imputation", main = "Mean, by imputation") abline(h=mean(MEAN), col="red", lty=2, lwd=2) plot(1:nimpute(nhanes), VAR.MEAN, ylab = "Variance of Mean", xlab = "Imputation", main = "Variance of Mean, by imputation") abline(h=mean(VAR.MEAN), col="red", lty=2, lwd=2) Figure 9.6: Mean and variance of waist circumference across multiple complete (imputed) datasets The average of the 14 within-imputation sample means is 100.44 cm (the horizontal dashed line in the left panel of Figure 9.6) and is the correct estimate of the mean waist circumference. The average of the 14 within-imputation variances of the sample mean is 0.2864. This turns out to be an underestimate of the variance of the sample mean waist circumference. It needs to be increased based on how variable the means are between imputations. The Rubin’s rule formula for the pooled variance adds the average variance to a factor times the variance of the means. Below, the pool.scalar() function does the computations for us and the pooled mean and variance of the mean are manually computed, as well. The latter is just to illustrate what pool.scalar() is doing; in future, just use pool.scalar(). # Rubin's rules for the mean and variance of the mean # Using pool.scalar() POOLED <- pool.scalar(MEAN, VAR.MEAN, nrow(imp.nhanes.new$data))
data.frame(pooled.mean = POOLED$qbar, mean.var = POOLED$ubar,
btw.var     = POOLED$b, pooled.var = POOLED$t)
##   pooled.mean mean.var  btw.var pooled.var
## 1       100.4   0.2864 0.007941     0.2949
# Manually (just for illustration)
data.frame(pooled.mean = mean(MEAN),
mean.var    = mean(VAR.MEAN),
btw.var     = var(MEAN),
pooled.var  = mean(VAR.MEAN) + (1 + 1/length(MEAN))*var(MEAN))
##   pooled.mean mean.var  btw.var pooled.var
## 1       100.4   0.2864 0.007941     0.2949

In this example, there was not much missing data, so there was not much variation in the means (the between-imputation variance, btw.var is small), so the pooled variance is not that much larger than the mean of the variances.

Finally, to get the pooled standard error of the mean, take the square root of the pooled variance of the mean.

sqrt(POOLED$t) ##  0.5431 NOTE: This is the pooled standard error of the mean. If you want the pooled standard deviation of waist circumference, average the within-imputation standard deviations over the imputations. The steps of computing descriptive statistics (mean, standard error of the mean, and standard deviation for continuous variables; frequency and proportion for categorical variables) in each dataset and pooling using Rubin’s rules have been coded into the functions mi.mean.se.sd(), mi.n.p(), mi.mean.se.sd.by(), and mi.n.p.by() (found in Functions_rmph.R which you loaded at the beginning of this chapter). These operate directly on a mids object so there is no need to use complete() to convert to a data.frame first. NOTES: • For variables with no missing data, these mi. functions will return the same values as standard descriptive statistic functions. • For frequencies of categorical variables that have missing values, the results after MI may not be whole numbers, as they are averages over imputations. • If a “by” variable has missing values, then even frequencies for variables that did not have missing values may not be whole numbers, as the split between groups will differ between imputations. Each function call returns a data.frame and rbind() stacks the results together. # Means and standard deviations rbind(mi.mean.se.sd(imp.nhanes.new, "LBDGLUSI"), mi.mean.se.sd(imp.nhanes.new, "BMXWAIST"), mi.mean.se.sd(imp.nhanes.new, "RIDAGEYR")) ## mean se sd ## LBDGLUSI 6.093 0.05096 1.611 ## BMXWAIST 100.441 0.54309 16.924 ## RIDAGEYR 47.876 0.52657 16.652 # Frequencies and proportions rbind(mi.n.p(imp.nhanes.new, "smoker"), mi.n.p(imp.nhanes.new, "RIAGENDR"), mi.n.p(imp.nhanes.new, "race_eth"), mi.n.p(imp.nhanes.new, "income")) ## n p ## smoker: Never 579.0 0.5790 ## smoker: Past 264.0 0.2640 ## smoker: Current 157.0 0.1570 ## RIAGENDR: Male 457.0 0.4570 ## RIAGENDR: Female 543.0 0.5430 ## race_eth: Hispanic 191.0 0.1910 ## race_eth: Non-Hispanic White 602.0 0.6020 ## race_eth: Non-Hispanic Black 115.0 0.1150 ## race_eth: Non-Hispanic Other 92.0 0.0920 ## income: <$25,000            190.0 0.1900
## income: $25,000 to <$55,000 254.7 0.2547
## income: $55,000+ 555.3 0.5553 # Means and standard deviations by another variable rbind(mi.mean.se.sd.by(imp.nhanes.new, "LBDGLUSI", BY = "wc_median_split"), mi.mean.se.sd.by(imp.nhanes.new, "RIDAGEYR", BY = "wc_median_split")) ## mean.1 se.1 sd.1 mean.2 se.2 sd.2 ## LBDGLUSI 5.699 0.05314 1.166 6.482 0.0843 1.875 ## RIDAGEYR 44.522 0.76144 16.703 51.182 0.7208 15.938 # Frequencies and proportions by another variable rbind(mi.n.p.by(imp.nhanes.new, "smoker", BY = "wc_median_split"), mi.n.p.by(imp.nhanes.new, "RIAGENDR", BY = "wc_median_split"), mi.n.p.by(imp.nhanes.new, "race_eth", BY = "wc_median_split"), mi.n.p.by(imp.nhanes.new, "income", BY = "wc_median_split")) ## n.1 p.1 n.2 p.2 ## smoker: Never 314.79 0.6341 264.21 0.52468 ## smoker: Past 110.36 0.2223 153.64 0.30511 ## smoker: Current 71.29 0.1436 85.71 0.17021 ## RIAGENDR: Male 212.29 0.4276 244.71 0.48596 ## RIAGENDR: Female 284.14 0.5724 258.86 0.51404 ## race_eth: Hispanic 91.50 0.1843 99.50 0.19759 ## race_eth: Non-Hispanic White 292.57 0.5894 309.43 0.61447 ## race_eth: Non-Hispanic Black 57.14 0.1151 57.86 0.11489 ## race_eth: Non-Hispanic Other 55.21 0.1112 36.79 0.07305 ## income: <$25,000             83.79 0.1688 106.21 0.21092
## income: $25,000 to <$55,000 130.29 0.2624 124.43 0.24709
## income: \$55,000+             282.36 0.5688 272.93 0.54199

### References

van Buuren, Stef. 2018. Flexible Imputation of Missing Data. 2nd ed. Boca Raton: Chapman; Hall/CRC.