4.3 SLR model with a continuous predictor
Now you are ready to use R to fit your first regression model! To fit a linear regression, use the lm()
function (“lm” stands for “linear model”). The syntax is lm(y ~ x, data = dat)
where dat
is a data.frame
containing variables with the names y
and x
. If your dataset, outcome, and/or predictor have different names, then use those names instead of dat
, y
, and/or x
, respectively. After fitting the model, use summary()
to view the output.
Example 4.1 (continued): Regress fasting glucose on the continuous predictor waist circumference. Typically, the output of lm()
is assigned to an object, in this case fit.ex4.1
. The name you give the object is arbitrary. Assigning the output to an object allows you to use summary()
to later view formatted output and to use $
notation to extract elements of the output.
##
## Call:
## lm(formula = LBDGLUSI ~ BMXWAIST, data = nhanesf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.012 -0.685 -0.280 0.168 13.184
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.3041 0.2950 11.20 <0.0000000000000002 ***
## BMXWAIST 0.0278 0.0029 9.59 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.52 on 963 degrees of freedom
## (35 observations deleted due to missingness)
## Multiple R-squared: 0.0871, Adjusted R-squared: 0.0862
## F-statistic: 91.9 on 1 and 963 DF, p-value: <0.0000000000000002
4.3.1 Estimated regression coefficients
The Coefficients:
section of the output contains estimates of and hypothesis tests for the intercept \((\beta_0)\), labeled (Intercept)
, and the coefficient for waist circumference (the slope, \(\beta_1\)), labeled BMXWAIST
. The Estimate
column displays the estimated regression coefficients. Thus, the best fit regression line has an intercept of 3.304 and a slope of 0.0278. If you want to extract just the Coefficients
table, use summary(fit.ex4.1)$coef
. Additionally, you can extract any element of the table using bracket []
notation or row and column headers, as shown below.
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.30409 0.295020 11.200 0.000000000000000000000000001873
## BMXWAIST 0.02776 0.002895 9.588 0.000000000000000000007392885480
## [1] 0.02776
## [1] 0.02776
## [1] 0.000000000000000000007393
The Coefficients
table also displays the standard error of each Estimate
, its t value (= Estimate
/Std. Error
), and its p-value (labeled Pr(>|t|)
). The t value tells you how many standard deviations away from 0 the estimate falls. The p-value for the intercept is usually not of interest, but the p-value for the predictor tests the null hypothesis that the outcome has no association with the predictor or, equivalently, that the slope is zero. The null hypothesis is that the best fit line is a horizontal line, indicating that the expected mean outcome is the same at all values of the predictor (that is, there is no association between the outcome and the predictor). Using the expected value notation, if fasting glucose and waist circumference are unrelated, then \(E(\textrm{FG}|\textrm{WC}) = E(\textrm{FG})\), that is, the average fasting glucose does not vary with waist circumference.
4.3.2 Other outputs of summary()
- Residual standard (SE) error: The
Residual standard error
is 1.52. This is an estimate of \(\sigma\), the standard deviation of the model error term \(\epsilon\). To extract this estimate from the output, usesummary(fit.ex4.1)$sigma
. - Residual SE degrees of freedom (df): The residual SE df is 963 and corresponds to the sample size minus the number of regression parameters. In this example, there are 2 regression parameters (the intercept and the slope). In general, the number of regression parameters is the number of rows in the
Coefficients
section. Therefore, the sample size with no missing data is 963 + 2 = 965. To extract the residual SE df from the output, usefit.ex4.1$df.residual
. - Number of observations excluded: A note that 35 observations were deleted due to “missingness” – they had a missing value for at least one of the variables in the regression model. To extract the sample size used in the analysis (the number of observations in the dataset with no missing values on any analysis variable), use
length(fit.ex4.1$residuals)
. - Multiple R2: The
Multiple R-squared
value (.0871), which is a measure of goodness of fit ranging from 0 (no association) to 1 (perfect linear association), and is the square of the Pearson correlation between the outcome and predictor. It is interpreted as the proportion of variation in the outcome explained by the model. To extract R2 from the output, usesummary(fit.ex4.1)$r.squared
. - Adjusted R2: The
Adjusted R-squared
(.0862) is an alternative that penalizes R2 based on how many predictors are in the model. To extract adjusted R2 from the output, usesummary(fit.ex4.1)$adj.r.squared
. - Global F test: The
F-statistic:
section provides a global F test of the null hypothesis that all the coefficients other than the intercept are 0, and this section can typically be ignored. For SLR with a continuous predictor, this provides no new information – the F statistic is the square of the slope’s t value, and the F statistic p-value is identical to the slope’s p-value. When there is more than one predictor, as there will be in Chapter 5, the global F test is usually not of interest. Typically, we are interested in tests of individual predictors, not a comparison of a model with all the predictors to a model with none. To extract the F statistic and corresponding df values, usesummary(fit.ex4.1)$fstatistic
.
Use confint()
to get confidence intervals (CI) for the intercept and slope. It is always a good idea to report a CI along with any estimate.
## 2.5 % 97.5 %
## (Intercept) 2.72514 3.88305
## BMXWAIST 0.02208 0.03344
4.3.3 Writing it up
Below is an example of how to write up these results for a report or publication.
Example 4.1 (continued): Linear regression was used to test the association between fasting glucose (mmol/L) and waist circumference (cm) using data from 965 adults from NHANES 2017-2018. 8.71% of the variation in fasting glucose was explained by waist circumference (R2 = .0871). There was a significant positive association between fasting glucose and waist circumference (B = 0.0278; 95% CI = 0.0221, 0.0334; p < .001). On average, for every 1-cm difference in waist circumference, adults differ in mean fasting glucose by 0.0278 mmol/L. Table 4.1 summarizes the regression results.
In Chapter 3, tbl_summary()
from the gtsummary
library (Sjoberg et al. 2021, 2023) was used to produce a table of descriptive statistics. Similarly, the tbl_regression()
function easily creates nice regression tables (see Tutorial:tbl_regression and Table Gallery for more information).
library(gtsummary)
TABLE2 <- fit.ex4.1 %>%
tbl_regression(intercept = T,
estimate_fun = function(x) style_sigfig(x, digits = 4),
pvalue_fun = function(x) style_pvalue(x, digits = 3),
label = list(BMXWAIST ~ "Waist circumference (cm)")) %>%
modify_caption("Regression of fasting glucose (mmol/L) on waist circumference")
Characteristic | Beta | 95% CI1 | p-value |
---|---|---|---|
(Intercept) | 3.304 | 2.725, 3.883 | <0.001 |
Waist circumference (cm) | 0.0278 | 0.0221, 0.0334 | <0.001 |
1 CI = Confidence Interval |
To save the table to a Microsoft Word document, use the following code.
NOTES:
- This is cross-sectional data, so the write-up was carefully worded to not make any kind of causal claim. We did not write that a “change” or “increase” in waist circumference “causes” or “leads to” greater fasting glucose. We specifically use the words “differ” or “difference” because in cross-sectional data we can only draw conclusions about differences in the mean outcome between individuals with different values of the predictor, not what happens as an individual changes over time.
- As described in Appendix A.1, the NHANES data we are using are a subsample of the full data and intended only for use in teaching regression methods. No conclusions should be drawn from these analyses about the outcomes we are analyzing. Also, the NHANES complex survey design was not accounted for in these analyses; we will discuss how to account for complex survey designs in Chapter 8.
- Even if we had used the full NHANES data and appropriately accounted for the complex survey design, we would still generally not want to draw any substantive conclusions from a simple linear regression as this is observational data. This is an unadjusted analysis and may be biased due to confounding. We will use multiple linear regression later to carry out adjusted analyses in Chapter 5. This is one of the main features of regression that makes it useful in epidemiological research – the ability to control for confounding by adjusting for potential confounders in a multiple regression model.
4.3.4 Centering a continuous predictor
“Centering” refers to replacing a continuous predictor with its value after subtracting off some value, typically the mean or median (although you can center at any number). The primary purpose of centering is to make the intercept interpretable as the mean outcome when the predictor is at its centering value.
To center a continuous predictor \(X\) at its mean, compute \(cX = X - \bar{X}\) where \(\bar{X}\) is the mean of \(X\) and enter \(cX\) into the model rather than \(X\). To center at a different value, replace \(\bar{X}\) with that value (e.g., the median).
Example 4.1 (continued): Refit the regression of fasting glucose on waist circumference after centering waist circumference at its mean.
MEAN <- mean(nhanesf$BMXWAIST, na.rm = T)
nhanesf$cBMXWAIST <- nhanesf$BMXWAIST - MEAN
fit.ex4.1.centered <- lm(LBDGLUSI ~ cBMXWAIST, data = nhanesf)
MEAN
## [1] 100.5
The uncentered output was:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.30409 0.295020 11.200 0
## BMXWAIST 0.02776 0.002895 9.588 0
and the new, centered output is:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.09342 0.049041 124.251 0
## cBMXWAIST 0.02776 0.002895 9.588 0
The only difference between these two outputs is the intercept. Without centering, the intercept was the mean fasting glucose at waist circumference = 0 cm, which was not meaningful since no such humans exist (or could exist!). After centering, the intercept is the mean fasting glucose at waist circumference = 100.5 cm. Centering waist circumference made the intercept interpretable but had no effect on the regression coefficient for waist circumference.