5.2 Notation and interpretation

The data are from \(n\) independent sets of observed values of the outcome \(Y\) and predictors \(X_1, X_2, ..., X_K\). The data for the \(i^{th}\) case (or individual, or observation) are denoted \((y_i, x_{i1}, x_{i2}, ..., x_{iK})\) – each case has an associated outcome value and set of predictor values. Equation (5.1) describes the multiple linear regression model.

\[\begin{equation} Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_K X_K + \epsilon \tag{5.1} \end{equation}\]

\(\beta_0\) is the intercept, the other \(\beta\) terms are the effects for each of the predictors, and \(\epsilon\) is the error term, or residual which is assumed to have a normal distribution with the same variance at all predictor values. We denote the assumption about the error term using the notation \(\epsilon \sim N(0, \sigma^2)\), which is read “epsilon has a normal distribution with a mean of zero and a variance of sigma squared”. The fact that the \(\sigma^2\) does not have a subscript \(i\) means that all individuals have the same error variance – individuals’ true observed values vary about the values predicted by the regression, and how much they vary is assumed to not depend on the characteristics of any individual. The model described by Equation (5.1) assumes that the relationship between \(Y\) and each \(X_k, (k=1,…,K)\), is linear when all the other predictors are held constant, and that the error term captures random variation between individuals and outcome measurement error.

For a continuous predictor \(X_k\), the corresponding regression coefficient \(\beta_k\) is a slope, interpreted as the difference in the mean \(Y\) associated with a one-unit difference in \(X_k\) when holding all other predictors fixed (or “controlling for” or “adjusted for” the other predictors).

As discussed in Chapter 4, if \(X_k\) is a categorical predictor with \(L\) levels, then instead of just one corresponding \(X\) term in the model, there are \(L-1\) indicator variables. Each of the \(L-1\) corresponding regression coefficients is interpreted as the difference between the mean \(Y\) at that level of \(X_k\) and the mean \(Y\) at the reference level, when holding all other predictors fixed (or “controlling for” or “adjusted for” the other predictors).

NOTE: The interpretations of the MLR regression coefficients are the same as in SLR except for the addition of “when holding all other predictors fixed.”

Example 5.1: In the previous chapter, we estimated the relationship between fasting glucose (FG; mmol/L) (LBDGLUSI) and each of waist circumference (WC; BMXWAIST) and smoking status (smoker; Never, Past, Current) among a subset of adult participants in NHANES 2017-2018. We found that each was significantly associated with fasting glucose. However, we did not adjust for potential confounders of these relationships. In this chapter, we will use MLR to adjust each for confounding due to the other, as well as due to age (RIDAGEYR), gender (RIAGENDR; Male, Female), race/ethnicity (RIDRETH3; Mexican American, Other Hispanic, Non-Hispanic White, Non-Hispanic Black, Non-Hispanic Asian, Other/Multi), and income (< $25,000, $25,000 to < $55,000, $55,000+).

Based on Equation (5.1), the MLR model for Example 5.1 is written as follows.

\[\begin{array}{rcl} \textrm{FG} & = & \beta_0 \\ & + & \beta_1 \textrm{WC} \\ & + & \beta_2 I(\textrm{Smoker = Past}) + \beta_3 I(\textrm{Smoker = Current}) \\ & + & \beta_4 \textrm{Age} \\ & + & \beta_5 I(\textrm{Gender = Female}) \\ & + & \beta_6 I(\textrm{Race = Other Hispanic}) + \beta_7 I(\textrm{Race = Non-Hispanic White}) \\ & + & \beta_8 I(\textrm{Race = Non-Hispanic Black}) + \beta_9 I(\textrm{Race = Non-Hispanic Asian}) \\ & + & \beta_{10} I(\textrm{Race = Other/Multi}) \\ & + & \beta_{11} I(\textrm{Income = \$25,000 to < \$55,000}) + \beta_{12} I(\textrm{Income = \$55,000+}) + \epsilon \end{array}\]

For each categorical predictor, the first level is left out as it is the reference level.