4.4 SLR model with a categorical predictor

Earlier we showed that when you have a categorical predictor linear regression is essentially comparing the mean outcome between levels of that predictor. This is exactly what two-sample t-tests and one-way analysis of variance (ANOVA) do – compare means between independent samples. These methods are equivalent to a linear regression with a single categorical predictor. The advantage of using linear regression, however, is that it is easy to adjust for additional variables by adding more predictors to the model (as we will do in Chapter 5).

Example 4.2 (continued): Regress fasting glucose on the categorical predictor smoking status, which has three levels: Never, Past, and Current (as seen by using the levels() function below). The syntax is the same as for a continuous predictor, but the interpretation of the estimated regression coefficients differs.

levels(nhanesf$smoker)

## [1] "Never"   "Past"    "Current"

fit.ex4.2 <- lm(LBDGLUSI ~ smoker, data = nhanesf)
summary(fit.ex4.2)

## 
## Call:
## lm(formula = LBDGLUSI ~ smoker, data = nhanesf)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.752 -0.722 -0.367  0.168 13.058 
## 
## Coefficients:
##               Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)     5.9423     0.0666   89.23 < 0.0000000000000002 ***
## smokerPast      0.4199     0.1190    3.53              0.00044 ***
## smokerCurrent   0.2551     0.1442    1.77              0.07722 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.6 on 997 degrees of freedom
## Multiple R-squared:  0.0131, Adjusted R-squared:  0.0111 
## F-statistic: 6.62 on 2 and 997 DF,  p-value: 0.00139

confint(fit.ex4.2)

##                 2.5 % 97.5 %
## (Intercept)    5.8116 6.0730
## smokerPast     0.1864 0.6535
## smokerCurrent -0.0279 0.5380

The primary way in which this differs from the output for a continuous predictor is that the Coefficients table now has, in addition to the (Intercept) row, \(L-1\) rows (where \(L\) = the number of levels of the categorical predictor), each one corresponding to the difference in the estimated mean outcome between a level and the reference level. By default, lm() sets the reference level to be the first level of the categorical predictor.

This output states that:

The estimated mean fasting glucose among never smokers is 5.94 mmol/L (95% CI = 5.81, 6.07) (for a categorical predictor the intercept is the mean at the reference level).
Past smokers have significantly greater mean fasting glucose than never smokers (B = 0.420; 95% CI = 0.186, 0.653; p < .001).
Current smokers have greater mean fasting glucose than never smokers (B = 0.255; 95% CI = -0.028, 0.538; p = .077), but this difference is not statistically significant.

4.4.1 Recoding as a factor

In the previous example, smoker was coded as a factor variable. If it were coded instead as a numeric variable with levels, say, 1, 2, and 3, then lm() would treat it, incorrectly, as a continuous variable and assume that the difference in mean outcome between levels 2 and 1 is the same as between levels 3 and 2 (this is what it means to assume a linear relationship). But we do not want to make this assumption for a categorical predictor. To query whether or not a variable is coded as a factor, use is.factor().

is.factor(nhanesf$smoker)

## [1] TRUE

If is.factor() returned FALSE then we would need to convert the variable to a factor before running lm(). Our dataset contains a made-up variable cat_not_factor that has 3 levels (1, 2, and 3) but is not coded as a factor. The following code converts this variable to a factor with labels A, B, and C.

table(nhanesf$cat_not_factor, useNA = "ifany")

## 
##   1   2   3 
## 100 700 200

is.factor(nhanesf$cat_not_factor)

## [1] FALSE

nhanesf <- nhanesf %>% 
  mutate(cat_factor = 
           factor(cat_not_factor,
                  # Enter the existing values
                  levels = c(1,2,3),
                  # Enter new values or omit the labels option
                  # if you want the old values to be the factor level labels
                  labels = c("A", "B", "C")))
# Check
table(nhanesf$cat_not_factor,
      nhanesf$cat_factor, useNA = "ifany")

##    
##       A   B   C
##   1 100   0   0
##   2   0 700   0
##   3   0   0 200

is.factor(nhanesf$cat_factor)

## [1] TRUE

4.4.2 What happens if you fit the model without coding a categorical variable as a factor?

To illustrate the difference between treating the predictor in the previous section as continuous vs. categorical, fit the regression of fasting glucose on each and see how the results differ.

fit.not_factor <- lm(LBDGLUSI ~ cat_not_factor, data = nhanesf)
fit.factor     <- lm(LBDGLUSI ~ cat_factor,     data = nhanesf)

round(summary(fit.not_factor)$coef, 4)

##                Estimate Std. Error t value Pr(>|t|)
## (Intercept)      5.9014     0.2052 28.7656   0.0000
## cat_not_factor   0.0913     0.0946  0.9651   0.3347

round(summary(fit.factor)$coef, 4)

##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   5.9611     0.1612 36.9736   0.0000
## cat_factorB   0.1321     0.1724  0.7662   0.4438
## cat_factorC   0.1985     0.1975  1.0053   0.3150

If (incorrectly) treating the predictor as continuous, we get just one regression coefficient for the predictor (0.0913) which implies that the difference in estimated mean outcome between individuals at levels 2 and 1, or between those at levels 3 and 2, each a 1-unit difference, is 0.0913. It also implies that the difference in estimated mean outcome between those at levels 3 and 1 (a 2-unit difference) is 2 \(\times\) 0.0913 \(=\) 0.1826. Treating a predictor as a continuous variable assumes that the relationship between the outcome and predictor is linear.

Treating the predictor (correctly) as categorical, however, places no restrictions on the differences in mean outcome between levels of the predictor. When treating the predictor as a categorical variable (by coding it as a factor), the model implies that the difference in mean outcome between those at levels B and A is 0.1321, between levels C and B is 0.1985 \(-\) 0.1321 \(=\) 0.0664, and between levels C and A is 0.1985. Comparing this to the previous paragraph, we see that failing to code a categorical variable as a factor leads to overly restrictive assumptions about the differences in mean outcome between levels.

4.4.3 Re-leveling

To compare the mean outcome between two non-reference levels of a categorical predictor, you could simply subtract their two regression coefficients (as was done in the previous paragraph when comparing levels C and B). However this only provides an estimate, not a CI or p-value. A method which yields all three is to re-level the predictor to change the reference level to one of those levels, followed by refitting the model.

Example 4.2 (continued): To compare current and past smokers, set one of them to be the reference level. For example, setting “Past” as the reference level moves it to be first in the ordering of the levels, leading lm() to treat it as the reference level when fitting the model.

nhanesf.releveled <- nhanesf %>% 
  mutate(smoker = relevel(smoker, ref = "Past"))

# Check before (rows) vs. after (columns)
table(nhanesf$smoker, nhanesf.releveled$smoker, useNA = "ifany")

##          
##           Past Never Current
##   Never      0   579       0
##   Past     264     0       0
##   Current    0     0     157

# Refit the model using the modified dataset
fit.ex4.2.releveled <- lm(LBDGLUSI ~ smoker, data = nhanesf.releveled)

# Use cbind() to combine the coef and confint output columns
cbind(round(summary(fit.ex4.2.releveled)$coef, 4),
      round(confint(fit.ex4.2.releveled),      4))

##               Estimate Std. Error t value Pr(>|t|)   2.5 %  97.5 %
## (Intercept)     6.3623     0.0986  64.510   0.0000  6.1687  6.5558
## smokerNever    -0.4199     0.1190  -3.529   0.0004 -0.6535 -0.1864
## smokerCurrent  -0.1649     0.1615  -1.021   0.3075 -0.4818  0.1520

Re-leveling does not change the overall fit of the model, as indicated by the residual standard error, R², and F statistic (compare summary(fit.ex4.2) and summary(fit.ex4.2.releveled) to verify this). However, it does change the interpretation of the coefficients. The (Intercept) is now the mean fasting glucose among past smokers (the new reference level) and the other coefficients correspond to differences in mean outcome between each level and past smokers. Thus, we conclude that current smokers were observed to have lower fasting glucose than past smokers, but this difference was not statistically significant (B = -0.165; 95% CI = -0.482, 0.152; p = .308).

4.4.4 Multiple DF test for a categorical predictor

P-values for categorical predictor non-reference levels give us tests of significance for specific pairwise comparisons between levels. But what if you want a single overall test of significance for the categorical predictor? The null hypothesis of no association corresponds to the mean outcome being equal at all levels of the predictor or, equivalently, that all differences in mean outcome between each level and the reference level are 0. For example, for “smoking status”, which has 3 levels, the null hypothesis is that both \(\beta_1 = 0\) and \(\beta_2 = 0\) simultaneously. \(\beta_1 = 0\) corresponds to Past and Never smokers having equal means, \(\beta_2 = 0\) corresponds to Current and Never smokers having equal means, and if both are true then Current, Past, and Never smokers have equal means implying no association between smoking status and the outcome. Since these are two tests, this would be a “2 df” (two degree of freedom) test. In general, for a categorical predictor with \(L\) levels, the corresponding test will be a \(L-1\) df test.

Obtain the p-value corresponding to this multiple degree of freedom (df) test by calling the Anova() (upper case A) function in the car library (Fox, Weisberg, and Price 2023; Fox and Weisberg 2019). The type = 3 option specifies that we want a Type III test, which is a test of a predictor after adjusting for all other terms in the model (there are no other terms in SLR, but there will be when we get to multiple linear regression). A Type III test for a predictor \(X\) compares the regression model to a reduced model in which that predictor has been removed.

Example 4.2 (continued): Is the predictor smoking status significantly associated with the outcome fasting glucose? We already fit the regression model and assigned it to the object fit.ex4.2 so enter that object into the car::Anova() function.

car::Anova(fit.ex4.2, type = 3)

## Anova Table (Type III tests)
## 
## Response: LBDGLUSI
##             Sum Sq  Df F value              Pr(>F)    
## (Intercept)  20445   1 7961.81 <0.0000000000000002 ***
## smoker          34   2    6.62              0.0014 ** 
## Residuals     2560 997                                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The Type III test states that the 2 degree of freedom test of smoker is statistically significant (p = .0014). In other words, we reject the null hypothesis that mean fasting glucose is the same for never, past, and current smokers. If reporting this test formally, we would write “F[2, 997] = 6.62, p = .001” where the 2 and 997 are the smoker and Residuals degrees of freedom, respectively.

In the case of a single categorical predictor, the Type III test is equivalent to the global F test shown in the lm() output above since both tests compare a model with one predictor to a model without any predictors. This will not be the case in multiple linear regression. car::Anova() is used for tests of individual predictors adjusted for all others, whereas the global F-test in the lm() output is used to test all predictors simultaneously by comparing the regression model to a model with just an intercept.

4.4.5 Special case: binary predictor

If a categorical predictor has only two levels (binary, dichotomous), it will only have one indicator function in the regression and there is no need for a Type III test – simply read the p-value from the Coefficients table. The comparison of mean outcome between the two levels is the same as the overall test of a binary predictor. For example, suppose we regress fasting glucose on gender (for which NHANES only reports responses of male and female) (RIAGENDR).

fit.binary <- lm(LBDGLUSI ~ RIAGENDR, data = nhanesf)
summary(fit.binary)$coef

##                Estimate Std. Error t value  Pr(>|t|)
## (Intercept)      6.2973    0.07491  84.067 0.0000000
## RIAGENDRFemale  -0.3758    0.10165  -3.697 0.0002304

car::Anova(fit.binary, type = 3)

## Anova Table (Type III tests)
## 
## Response: LBDGLUSI
##             Sum Sq  Df F value               Pr(>F)    
## (Intercept)  18123   1  7067.3 < 0.0000000000000002 ***
## RIAGENDR        35   1    13.7              0.00023 ***
## Residuals     2559 998                                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The RIAGENDRFemale p-value in the regression coefficient table is identical to the RIAGENDR Type III test p-value in the Anova Table.

4.4.6 Writing it up

Below is an example of how to write up these results for a report or publication. Notice how this differs from when the predictor was continuous – for a categorical predictor we report the overall association plus the pairwise differences.

Example 4.2 (continued): We used linear regression to test the association between fasting glucose (mmol/L) and smoking status (Never, Past, Current) using data from 1000 adults from NHANES 2017-2018. 1.31% of the variation in fasting glucose was explained by smoking status (R² = .0131). The association between fasting glucose and smoking status was statistically significant (F[2, 997] = 6.62, p = .001). Mean fasting glucose was significantly greater among past smokers than among never smokers (B = 0.420; 95% CI = 0.186, 0.653; p < .001), but not significantly different between current and never smokers (B = 0.255; 95% CI = -0.028, 0.538; p = .077) or between current and past smokers (B = -0.165; 95% CI = -0.482, 0.152; p = .308).

Table 4.2 summarizes the regression results.

library(gtsummary)
TABLE2 <- fit.ex4.2 %>%
  tbl_regression(intercept = T,
                 estimate_fun = function(x) style_sigfig(x, digits = 3),
                 pvalue_fun   = function(x) style_pvalue(x, digits = 3),
                 label        = list(smoker ~ "Smoking Status")) %>%
  # add_global_p() adds the multiple df Type III p-values from car::Anova
  # keep = T keeps the individual p-values comparing levels, as well
  add_global_p(keep = T) %>% 
  modify_caption("Regression of fasting glucose (mmol/L) on smoking status")

TABLE2

Table 4.2: Regression of fasting glucose (mmol/L) on smoking status
Characteristic	Beta	95% CI¹	p-value
(Intercept)	5.94	5.81, 6.07	<0.001
Smoking Status			0.001
Never	—	—
Past	0.420	0.186, 0.653	<0.001
Current	0.255	-0.028, 0.538	0.077
¹ CI = Confidence Interval

The p-value in the “Smoking Status” row corresponds to the 2 df Type III test of smoking status obtained using car::Anova().

References

Fox, John, and Sanford Weisberg. 2019. An R Companion to Applied Regression. 3rd ed. Los Angeles: Sage Publications, Inc.

Fox, John, Sanford Weisberg, and Brad Price. 2023. Car: Companion to Applied Regression. https://r-forge.r-project.org/projects/car/.