## 5.5 Fitting the MLR model

This section demonstrates how to fit a MLR model in R. For comparison, we first look at the unadjusted models we fit in Chapter 4 (each only had one predictor variable). Second, we will fit a model adjusted for confounding due to other predictors. Since, in this chapter, we removed cases with missing values on any of a set of variables, the unadjusted results here differ slightly from those in Chapter 4.

The unadjusted results are shown in Tables 5.1 and 5.2.

fit.ex4.1.cc  <- lm(LBDGLUSI ~ BMXWAIST, data = nhanesf.complete)
Table 5.1: Regression of fasting glucose (mmol/L) on waist circumference
Characteristic Beta 95% CI1 p-value
(Intercept) 3.204 2.574, 3.833 <0.001
Waist Circumference (cm) 0.0288 0.0227, 0.0350 <0.001
1 CI = Confidence Interval
fit.ex4.2.cc  <- lm(LBDGLUSI ~ smoker,   data = nhanesf.complete)
Table 5.2: Regression of fasting glucose (mmol/L) on smoking status
Characteristic Beta 95% CI1 p-value
(Intercept) 5.96 5.81, 6.10 <0.001
Smoking Status

0.002
Never
Past 0.451 0.192, 0.711 <0.001
Current 0.240 -0.072, 0.552 0.132
1 CI = Confidence Interval

If just using the unadjusted models, we would conclude the following:

• There is a significant positive unadjusted association between fasting glucose and waist circumference (B = 0.029; 95% CI = 0.023, 0.035; p < .001).
• There is a significant association between fasting glucose and smoking status (p = .002). Past Smokers had significantly greater fasting glucose than Never Smokers (B = 0.451; 95% CI = 0.192, 0.711; p < .001). See Section 4.4 for how to re-level smoking in order to compare Past vs. Current.

However, as this is observational data, we should be wary of drawing conclusions from unadjusted models.

To fit an adjusted model, include multiple predictors in lm() separated by + signs. Then, just as with a SLR model, use summary(), confint(), and car::Anova(, type = 3) to summarize the results.

fit.ex5.1 <- lm(LBDGLUSI ~ BMXWAIST + smoker + RIDAGEYR +
RIAGENDR + race_eth + income, data = nhanesf.complete)
summary(fit.ex5.1)
##
## Call:
## lm(formula = LBDGLUSI ~ BMXWAIST + smoker + RIDAGEYR + RIAGENDR +
##     race_eth + income, data = nhanesf.complete)
##
## Residuals:
##    Min     1Q Median     3Q    Max
## -3.694 -0.749 -0.234  0.290 13.033
##
## Coefficients:
##                            Estimate Std. Error t value           Pr(>|t|)
## (Intercept)                 2.97301    0.37665    7.89 0.0000000000000091 ***
## BMXWAIST                    0.02451    0.00306    8.00 0.0000000000000040 ***
## smokerPast                  0.21121    0.12515    1.69            0.09185 .
## smokerCurrent               0.09776    0.15063    0.65            0.51650
## RIDAGEYR                    0.02505    0.00331    7.58 0.0000000000000926 ***
## RIAGENDRFemale             -0.32795    0.10514   -3.12            0.00188 **
## race_ethNon-Hispanic White -0.50864    0.14524   -3.50            0.00049 ***
## race_ethNon-Hispanic Black -0.24764    0.19919   -1.24            0.21411
## race_ethNon-Hispanic Other  0.01644    0.21618    0.08            0.93938
## income$25,000 to <$55,000 -0.10038    0.16273   -0.62            0.53753
## income$55,000+ -0.09381 0.14724 -0.64 0.52421 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.51 on 846 degrees of freedom ## Multiple R-squared: 0.171, Adjusted R-squared: 0.161 ## F-statistic: 17.4 on 10 and 846 DF, p-value: <0.0000000000000002 confint(fit.ex5.1) ## 2.5 % 97.5 % ## (Intercept) 2.23373 3.71228 ## BMXWAIST 0.01850 0.03052 ## smokerPast -0.03443 0.45685 ## smokerCurrent -0.19789 0.39342 ## RIDAGEYR 0.01856 0.03154 ## RIAGENDRFemale -0.53431 -0.12158 ## race_ethNon-Hispanic White -0.79371 -0.22356 ## race_ethNon-Hispanic Black -0.63860 0.14331 ## race_ethNon-Hispanic Other -0.40786 0.44075 ## income$25,000 to < $55,000 -0.41978 0.21903 ## income$55,000+             -0.38281  0.19519
car::Anova(fit.ex5.1, type = 3)
## Anova Table (Type III tests)
##
## Response: LBDGLUSI
##             Sum Sq  Df F value             Pr(>F)
## (Intercept)    142   1   62.30 0.0000000000000091 ***
## BMXWAIST       146   1   64.02 0.0000000000000040 ***
## smoker           7   2    1.44            0.23638
## RIDAGEYR       131   1   57.42 0.0000000000000926 ***
## RIAGENDR        22   1    9.73            0.00188 **
## race_eth        39   3    5.68            0.00075 ***
## income           1   2    0.23            0.79082
## Residuals     1929 846
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Table 5.3 combines all of the output into a single table.

library(gtsummary)
TABLE <- fit.ex5.1 %>%
tbl_regression(intercept = T,
estimate_fun = function(x) style_sigfig(x, digits = 3),
pvalue_fun   = function(x) style_pvalue(x, digits = 3),
label  = list(BMXWAIST ~ "Waist Circumference (cm)",
smoker   ~ "Smoking Status",
RIDAGEYR ~ "Age (years)",
RIAGENDR ~ "Gender",
race_eth ~ "Race/ethnicity",
income   ~ "Annual Income")) %>%
modify_caption("Linear regression results for fasting glucose (mmol/L) vs.
waist circumference (cm) and smoking, adjusted for demographics")
TABLE
Table 5.3: Linear regression results for fasting glucose (mmol/L) vs. waist circumference (cm) and smoking, adjusted for demographics
Characteristic Beta 95% CI1 p-value
(Intercept) 2.97 2.23, 3.71 <0.001
Waist Circumference (cm) 0.025 0.018, 0.031 <0.001
Smoking Status

0.236
Never
Past 0.211 -0.034, 0.457 0.092
Current 0.098 -0.198, 0.393 0.516
Age (years) 0.025 0.019, 0.032 <0.001
Gender

0.002
Male
Female -0.328 -0.534, -0.122 0.002
Race/ethnicity

<0.001
Hispanic
Non-Hispanic White -0.509 -0.794, -0.224 <0.001
Non-Hispanic Black -0.248 -0.639, 0.143 0.214
Non-Hispanic Other 0.016 -0.408, 0.441 0.939
Annual Income

0.791
< $25,000$25,000 to < $55,000 -0.100 -0.420, 0.219 0.538$55,000+ -0.094 -0.383, 0.195 0.524
1 CI = Confidence Interval

How do we interpret these regression coefficients? The regression equation is as follows.

$\begin{array}{rcl} \textrm{FG} & = & \beta_0 \\ & + & \beta_1 \textrm{WC} \\ & + & \beta_2 I(\textrm{Smoker = Past}) + \beta_3 I(\textrm{Smoker = Current}) \\ & + & \beta_4 \textrm{Age} \\ & + & \beta_5 I(\textrm{Gender = Female}) \\ & + & \beta_6 I(\textrm{Race = Non-Hispanic White}) + \beta_7 I(\textrm{Race = Non-Hispanic Black}) \\ & + & \beta_8 I(\textrm{Race = Non-Hispanic Other}) \\ & + & \beta_9 I(\textrm{Income = \25,000 to < \55,000}) + \beta_{10} I(\textrm{Income = \55,000+}) + \epsilon \end{array}$

The intercept is the mean outcome when all continuous predictors are 0 and all categorical predictors are at their reference level.

• In this example, the intercept (2.97 mmol/L) corresponds to the estimated fasting glucose for individuals with a waist circumference of 0 cm who have never smoked, are age = 0 years, male, Hispanic, and have < \$25,000 annual household income. Waist circumference = 0 cm is, of course, not possible; also, the model was fit to data on adults so age = 0 years is far beyond the range of ages analyzed. Had we centered waist circumference and age then the intercept would be more interpretable (see Section 4.3.4). The lack of centering here does not invalidate the model, it only makes it so the intercept does not correspond to a prediction for a plausible combination of predictor values.

For a continuous predictor, the regression coefficient is the estimated difference in the outcome associated with a 1-unit difference in the predictor, when holding all other predictors fixed.

• When holding all other predictors fixed, individuals with 1 cm greater waist circumference are estimated to have, on average, 0.025 mmol/L greater fasting glucose. “When holding all other predictors fixed” means that this statement is the same for any combination of the other predictors; it does not matter at what values you hold the other predictors fixed. Another way to say this is that individuals with the same smoking status, age, gender, race/ethnicity, and household income who differ in waist circumference by 1 cm differ in average fasting glucose by 0.025 mmol/L.
• The estimated difference in the outcome associated with an other than 1-unit difference is obtained by multiplying the coefficient by the units. For example, when holding all other predictors fixed, individuals with 5 cm greater waist circumference are estimated to have, on average, 5 $$\times$$ 0.025 = 0.125 mmol/L greater fasting glucose.

For a categorical predictor with $$L$$ levels, there will be $$L-1$$ regression coefficients, each representing the difference in mean outcome between individuals at that level and individuals at the reference level, when holding all other predictors fixed.

• Comparing individuals who have the same values for all predictors except smoking status, past smokers have an average fasting glucose that is 0.211 mmol/L greater than never smokers.

The summaries below compare the results for each of waist circumference and smoking status before and after adjusting for the other predictors.

• Both before (B = 0.029; 95% CI = 0.023, 0.035; p < .001) and after adjusting for confounding (B = 0.025; 95% CI = 0.018, 0.031; p < .001), we found a significant association between fasting glucose and waist circumference. Additionally, the magnitude of the association did not change much, indicating that there was not much confounding.

• Before adjusting for confounding, we found a significant association between fasting glucose and smoking status (p = .002). However, after adjusting for waist circumference, age, gender, race/ethnicity, and income, there is no significant association between fasting glucose and smoking status (p = .236). Additionally, after adjusting for confounding, the magnitude of the regression coefficients were strongly attenuated (much closer to 0). Thus, for smoking status, failure to adjust for confounding was strongly biasing the results.