5.19 Collinearity

An important step in the process of fitting a MLR is checking for collinearity between predictors. This step was not needed in SLR because there was only one predictor.

Here’s an extreme example which helps illustrate the concept. Suppose you want to know the relationship between fasting glucose and weight, and you fit a regression model that includes both weight in kg and weight in pounds. Clearly, these two predictors are redundant since they only differ in their units. With either one in the regression model, the other adds no new information. That is called perfect collinearity and it is mathematically impossible to include both in the model – you cannot estimate distinct regression coefficients for both predictors.

You also can have approximate collinearity. Suppose instead of weight in kg and weight in pounds, you have weight in kg and body mass index (BMI = weight/height2). While weight and BMI are not exactly redundant, they are highly related. In this case, it is mathematically possible to include both in the model, but their regression coefficients are difficult to estimate accurately (they will have large variances) and will be difficult to interpret. The regression coefficient for weight is the difference in mean outcome associated with a 1-unit difference in weight while holding BMI constant. But in order to vary weight while holding BMI constant you have to vary height, as well. The “effect of greater weight adjusted for BMI” is therefore actually the “effect of greater weight and greater height”, a combination of the effects of weight and height. In general, when two or more predictors are collinear, the interpretation of each while holding the other(s) fixed becomes more complicated.

Two predictors are perfectly collinear if their correlation is -1 or 1 (e.g., weight in kg and weight in pounds). The term “collinear” comes from the fact that one predictor (\(X_1\)) can be written as a linear combination of the other (\(X_2\)) as \(X_1 = a + b X_2\). If you were to plot the pairs of points \((x_{i1}, x_{i2})\) for all the cases, they would fall exactly on a line.

Examples of perfect collinearity include:

  • Two predictors that are exactly the same \((X_1 = X_2)\).
  • One predictor is twice the other \((X_1 = 2 X_2)\).
  • One predictor is 3 less than twice the other \((X_1 = -3 + 2 X_2)\).

If two predictors are perfectly collinear, then they are completely redundant for the purposes of linear regression. You could put either one in the model and the fit would be identical. Unless they are exactly the same (\(X_1 = X_2\)), you will get a different slope estimate using one or the other, but their p-values will be identical. If you put them both in the regression model at the same time, however, R will estimate a regression coefficient for just one of them, as shown in the following code.

# To illustrate...
# Generate random data from a standard normal distribution
y <- rnorm(100)
x <- rnorm(100)
# Create z perfectly collinear with x
z <- -3 + 2*x
lm(y ~ x + z)
## Call:
## lm(formula = y ~ x + z)
## Coefficients:
## (Intercept)            x            z  
##     -0.0776       0.1418           NA

The problem is actually worse, however, for approximate collinearity because if you put both in the model R will estimate a regression coefficient for each, but will have trouble estimating their unique contributions and you could get strange results, such as one large negative and one large positive estimated \(\beta\) and large standard errors for the estimates (implying that even with only a slightly different sample of individuals you may get very different estimates of the \(\beta\)s).

Two predictors that are approximately collinear are essentially measuring the same thing and so should have about the same association with the outcome. But when each is adjusted for the other, the results for each can be quite unstable and quite different from each other. In the example below, the second predictor is identical to the first except for having some random noise added. Compare their unadjusted and adjusted associations with the outcome.

# Generate random data but make the collinearity approximate
y <- rnorm(100)
x <- rnorm(100)
# Create z approximately collinear with x
z <- x + rnorm(100, sd=0.05)
# Each predictor alone
round(summary(lm(y ~ x))$coef, 4)
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  -0.0776     0.1020 -0.7601   0.4490
## x             0.1418     0.1087  1.3045   0.1951
round(summary(lm(y ~ z))$coef, 4)
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  -0.0783     0.1021  -0.767   0.4449
## z             0.1394     0.1079   1.292   0.1995
# Adjusted for each other
round(summary(lm(y ~ x + z))$coef, 4)
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  -0.0742     0.1034 -0.7176   0.4747
## x             0.7256     2.3995  0.3024   0.7630
## z            -0.5799     2.3811 -0.2436   0.8081

Each predictor individually has a regression slope of about 0.14 with a standard error of about 0.11. However, when attempting to adjust for each other, their slopes are now in opposite directions and their standard errors are around 2.4! Again, a large standard error means that if you happened to have drawn a slightly different sample, you might end up with very different regression coefficient estimates in the model with both x and z – the results are not stable. If you want to verify this, re-run the above code multiple times starting after set.seed() (so it is not reset to the same value each time), and compare the results.

Going back to our example of weight and BMI, we can see that in our NHANES fasting subsample dataset these two predictors are highly collinear (Figure 5.42).

Body mass index and weight are highly collinear

Figure 5.42: Body mass index and weight are highly collinear

The points do not fall exactly on a line but they are generally close. Thus, weight and BMI seem to be approximately collinear. This makes sense because BMI is a measure of body mass computed as weight normalized by stature.

All of the above examples were of pairwise collinearity. You can have collinearity between groups of predictors, as well, if some linear combination of one group of predictors is approximately equal to a linear combination of some other predictors. Such situations are harder to detect visually. However, in the following section, we will use a numeric diagnostic tool to help us detect collinearity even among groups of predictors.

5.19.1 Diagnosis of collinearity

The statistic we will use to diagnose collinearity is the variance inflation factor (VIF). For each predictor, the VIF measures how much the variance of the regression coefficient estimate is inflated due to correlation between that predictor and the others. Compute VIFs using the car::vif() function (Fox, Weisberg, and Price 2022; Fox and Weisberg 2019). VIFs when all predictors are continuous

Example 5.6: Use VIFs to evaluate the extent of collinearity in the regression of the outcome systolic blood pressure (sbp) on the predictors age (RIDAGEYR), weight (BMXWT), BMI (BMXBMI), and height (BMXHT) using NHANES 2017-2018 examination data from adults (nhanes1718_adult_exam_sub_rmph.Rdata). First, fit the regression model. Second, call car::vif() with the regression fit as the argument.

# Shorter name
nhanes <- nhanes_adult_exam_sub

fit.ex5.6 <- lm(sbp ~ RIDAGEYR + BMXWT + BMXBMI + BMXHT, data = nhanes)
##    1.014   88.835   69.934   18.764

One of the VIFs is near 1, while the other three are much larger. How do we interpret the magnitude of VIFs? How large is “too large”? Unfortunately, there is no clear-cut answer to that question. We do know that a value of 1 is ideal – that indicates that the variance of that predictor’s regression coefficient estimate is not inflated at all by the presence of other predictors. Values above 2.5 may be of concern (Allison 2012), and values above 5 or 10 are indicative of a more serious problem. In this example height, weight, and BMI clearly have serious collinearity issues. Their variances are being inflated 19- to 89-fold by their collective redundancy. Generalized VIFs when at least one predictor is categorical

Recall that a categorical predictor with \(L\) levels will be entered into a model as \(L-1\) dummy variables (\(L-1\) vectors of 1’s and 0’s). Why \(L-1\)? Because if you included all \(L\) of them the vectors would sum up to a vector of all 1’s (since every observation falls in exactly 1 category) and that would be perfect collinearity. So one is always left out (the reference level).

Example 5.6 examined collinearity among predictors that were all continuous. What happens if there are any categorical predictors? It turns out that the usual VIFs computed on a model with a categorical predictor will differ depending on which level is designated as the reference level. In addition to getting inconsistent results, we are not interested in the VIF for each individual level of a categorical predictor, but rather of the predictor as a single entity. Fortunately, car::vif() automatically gives us a consistent VIF, called the generalized VIF (GVIF) (Fox and Monette 1992), that is the same for each categorical predictor no matter the reference level, as long as the predictor is coded as a factor.

Example 5.7: Evaluate the extent of the collinearity in the regression of systolic blood pressure on the predictors age, BMI, height, and smoking status using our NHANES 2017-2018 examination subsample of data from adults (same dataset as in Example 5.6).

car::vif(lm(sbp ~ RIDAGEYR + BMXBMI + BMXHT + smoker, data = nhanes))
##           GVIF Df GVIF^(1/(2*Df))
## RIDAGEYR 1.052  1           1.026
## BMXBMI   1.006  1           1.003
## BMXHT    1.051  1           1.025
## smoker   1.081  2           1.020

The generalized VIF is found in the GVIF column. The GVIF^(1/(2*Df)) column is the adjusted generalized standard error inflation factor (aGSIF) and is equal to the square-root of GVIF for continuous predictors and categorical predictors with just 2 levels (since, for those, Df = 1). Fox and Monette (1992) recommend using the aGSIF, however, since for categorical predictors with more than 2 levels it adjusts for the number of levels allowing comparability with the other predictors. A consequence is that when using aGSIF, we must take the square-root of our rules of thumb for what is a large value – aGSIF values above \(\sqrt{2.5}\) (1.6) may be of concern, and values above \(\sqrt{5}\) or \(\sqrt{10}\) (2.2 or 3.2) are indicative of a more serious problem. Alternatively, you could square the aGSIF values and compare them to our original rule of thumb cutoffs.

Thus, the aGSIF for the 3-level variable “smoker” is 1.02. In this example, the predictors do not exhibit strong collinearity (all the aGSIFs are near 1).

Run the code below if you would like to verify that the usual VIFs depend on which level is left out as the reference level but the GVIFs and aGSIFs remain unchanged.

# VIFs depend on the reference level
tmp <- nhanes %>%
  mutate(smoker1 = as.numeric(smoker == "Never"),
         smoker2 = as.numeric(smoker == "Past"),
         smoker3 = as.numeric(smoker == "Current"))
fit_ref1 <- lm(sbp ~ RIDAGEYR + BMXBMI + BMXHT + smoker2 + smoker3, data = tmp)
fit_ref2 <- lm(sbp ~ RIDAGEYR + BMXBMI + BMXHT + smoker1 + smoker3, data = tmp)
fit_ref3 <- lm(sbp ~ RIDAGEYR + BMXBMI + BMXHT + smoker1 + smoker2, data = tmp)

# GVIF and aGSIF do not depend on the reference level
car::vif(lm(sbp ~ RIDAGEYR + BMXBMI + BMXHT + smoker, data = nhanes))
tmp <- nhanes %>%
  mutate(smoker = relevel(smoker, ref = "Past"))
car::vif(lm(sbp ~ RIDAGEYR + BMXBMI + BMXHT + smoker, data = tmp))
tmp <- nhanes %>%
  mutate(smoker = relevel(smoker, ref = "Current"))
car::vif(lm(sbp ~ RIDAGEYR + BMXBMI + BMXHT + smoker, data = tmp)) VIFs when there is an interaction or polynomial terms

Collinearity between terms involved in an interaction can be ignored (Allison 2012). This applies also to collinearity between terms that together define a polynomial curve (e.g., \(X\), \(X^2\), etc.). In general, centering continuous predictors will reduce the collinearity and, as we saw in previous sections, centering is often a good idea for the purpose of interpretability. However, the sort of collinearity that is removed by centering actually has no effect on the fit of the model, only on the interpretation of the intercept and the main effects of predictors involved in the interaction. If you have interactions in the model, or multiple terms defining a polynomial, compute the VIFs or aGSIFs using a model without the interactions or higher order polynomial terms. VIF summary

  • If all predictors in a model are continuous and/or binary (2 levels) then car::vif() returns VIFs and our rule of thumb for what is large is that values above 2.5 may be of concern and values above 5 or 10 are indicative of a more serious problem.
  • If any predictors in a model are categorical with more than 2 levels then car::vif() returns both GVIF and aGSIF values. Use the aGSIF values to evaluate collinearity since that allows predictors with different Df (different number of non-reference levels) to be comparable. For aGSIFs, our rule of thumb for what is large is that values above 1.6 may be of concern and values above 2.2 or 3.2 are indicative of a more serious problem.
  • Our rule of thumb regarding what is a large amount of inflation is arbitrary. It is useful as a guide but there is no requirement to apply it strictly.
  • If you have interactions in the model, or multiple terms defining a polynomial, compute the VIFs or aGSIFs using a model without the interactions or higher order polynomial terms.

5.19.2 Impact of collinearity

When there is perfect collinearity, R will drop out predictors until that is no longer the case and so there is no impact on the final regression results. When there is approximate collinearity, however, there may be inflated standard errors and difficulty interpreting coefficients.

Example 5.7 (continued): It is reasonable to guess that the main driver of the collinearity is the strong correlation between weight and BMI. Compare the regression coefficients, their standard errors, and their p-values with and without weight in the model. First, create a complete-case dataset so differences between the models are not due to having different observations.

# Create a complete-case dataset so both models use the same sample
sbpdat <- nhanes %>% 
  select(sbp, RIDAGEYR, BMXWT, BMXBMI, BMXHT) %>% 

# With weight in the model
fit_wt   <- lm(sbp ~ RIDAGEYR + BMXWT + BMXBMI + BMXHT, data = sbpdat)

# After dropping weight due to high collinearity
fit_nowt <- lm(sbp ~ RIDAGEYR         + BMXBMI + BMXHT, data = sbpdat)
Table 5.4: Demonstrating the impact of collinearity by comparing the models with and without weight
With Weight
Without Weight
Intercept 137.34 36.41 < .001 73.52 8.96 < .001
Age 0.4327 0.0292 < .001 1.01 0.4348 0.0292 < .001 1.01
Weight 0.3707 0.2050 .071 88.8
BMI -0.5093 0.5782 .379 69.9 0.5287 0.0692 < .001 1.00
Height -0.3029 0.2169 .163 18.8 0.0787 0.0505 .119 1.01

Table 5.4 shows how the results differ after removing weight from the model. As expected, since its VIF was near 1, the “Age” row does not change much. Note in particular that, up to four decimal places, its standard error (SE) is exactly the same. This is what it means for a VIF to be 1 – that the variance (the square of the SE) is inflated by a factor of 1, in other words not inflated at all, by the presence of the other predictors.

Compare this, however, to the rows for “BMI” and “Height.” After removing weight from the model, their estimates change drastically; both change direction, going from negative to positive. Also, their standard errors decrease dramatically, indicating that they are more precisely estimated. The squares of the ratios of their SEs with and without weight in the model are approximately the same as their VIFs in the model that includes weight.

(summary(fit_wt)$coef["BMXBMI", "Std. Error"] /
    summary(fit_nowt)$coef["BMXBMI", "Std. Error"])^2
## [1] 69.74
(summary(fit_wt)$coef["BMXHT",  "Std. Error"] /
    summary(fit_nowt)$coef["BMXHT",  "Std. Error"])^2
## [1] 18.49

The reason the variances decrease so much is that the definitions of “BMI holding other predictors fixed” and “height holding other predictors fixed” change after removing weight, with the old quantities (effects of BMI and height when holding age and weight constant) being much harder to estimate accurately (and harder to interpret) than the new (effects of BMI and height when holding age constant). Also, whereas “BMI when holding age, weight, and height constant” was not statistically significant (p = .379), “BMI when holding age and height constant” is (p < .001). Finally, after removing weight from the model, the VIFs for the remaining predictors are all near 1.

5.19.3 Potential solutions for collinearity

There are multiple ways to deal with collinearity:

  • Remove one or more predictors from the model, or
  • Combine predictors (e.g., average, sum, difference).

One could also try a biased regression technique such as LASSO or ridge regression. Those methods are beyond the scope of this text but see, for example, the R package glmnet (Friedman, Hastie, and Tibshirani 2010)). Remove predictors to reduce collinearity

Ultimately, you want a set of predictors that are not redundant. One solution is to remove predictors suspected of being problematic, one at a time, and see if the VIFs become smaller.

Before removing any predictors, check the extent of missing data in the dataset. On one hand, when comparing VIFs between models with different predictors, each model should be fit to the same set of observations, which means starting with a complete case dataset. On the other hand, one or more predictors might have a much larger number of missing values than the others. In that case, it might be best to remove the predictors with a lot of missing values first, create a complete case dataset based on the remaining predictors, and then compare VIFs between models with different predictors.

Example 5.7 (continued): Remove one predictor at a time and see how the VIFs change. Use the complete case dataset created above so any differences are due to removing predictors rather than differing sets of observations with non-missing values. We already created a complete case dataset (sbpdat) for this example, so proceed directly to comparing models with different predictors.

# VIFs with all the predictors included
##    1.014   88.835   69.934   18.764
# Drop weight
fit_nowt  <- lm(sbp ~ RIDAGEYR         + BMXBMI + BMXHT, data = sbpdat)
##    1.012    1.000    1.013
# Drop BMI
fit_nobmi <- lm(sbp ~ RIDAGEYR + BMXWT          + BMXHT, data = sbpdat)
##    1.012    1.271    1.284
# Drop height
fit_noht  <- lm(sbp ~ RIDAGEYR + BMXWT + BMXBMI        , data = sbpdat)
##    1.010    4.794    4.786

With all four predictors in the model, BMI, weight, and height each exhibit high collinearity. After removing them one at a time we see that, in fact, weight and BMI are the problem variables (the VIFs are only large when these two both remain in the model). In this example, removing weight from the model best resolved the collinearity issue (the VIFs are smallest when weight was removed). In other examples, you may have to remove more than one predictor to obtain VIFs that are all small. Combine predictors to reduce collinearity

Simply removing predictors is often the best solution. At other times, you really would like to keep all the predictors. In that case, combining them in some way may work. Common methods of combining predictors include taking the average, the sum, or the difference. Also, you could combine predictors into just one predictor, or into multiple predictors.

Example 5.8: Using the Natality dataset (see Appendix A.3), examine the extent of collinearity in the regression of the outcome birthweight (DBWT, g) on the ages of the mother (MAGER) and father (FAGECOMB) and resolve by combining predictors.

fit.ex5.8 <- lm(DBWT ~ MAGER + FAGECOMB, data = natality)
round(summary(fit.ex5.8)$coef, 4)
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3168.469     75.620 41.8999   0.0000
## MAGER          5.769      3.767  1.5317   0.1258
## FAGECOMB      -2.519      3.067 -0.8215   0.4115
##     2.32     2.32

Ignoring statistical significance, this model implies that (1) for a given father’s age, older mothers have children with greater birthweight and (2) for a given mother’s age, older father’s have children with lower birthweight. In this particular dataset, the two predictors are not highly collinear, but are close to our rule of thumb cutoff of 2.5. We will attempt to reduce their collinearity by combining them to form two predictors that are less collinear.

Replacing a pair of correlated predictors \(X_1\) and \(X_2\) with their average \(Z_1 = (X_1 + X_2)/2\) and difference \(Z_2 = X_1 - X_2\) may result in a pair of less correlated predictors. In this example, the transformed predictors would be interpreted as “average parental age” and “parental age difference.”

natality <- natality %>% 
  mutate(avg_parent_age        = (FAGECOMB + MAGER)/2,
         parent_age_difference =  FAGECOMB - MAGER)

fit.ex5.8_2 <- lm(DBWT ~ avg_parent_age + parent_age_difference, data = natality)
round(summary(fit.ex5.8_2)$coef, 4)
##                       Estimate Std. Error t value Pr(>|t|)
## (Intercept)           3168.469     75.620  41.900   0.0000
## avg_parent_age           3.250      2.483   1.309   0.1908
## parent_age_difference   -4.144      3.202  -1.294   0.1958
##        avg_parent_age parent_age_difference 
##                 1.099                 1.099

Ignoring statistical significance, these results imply that (1) for a given age difference, older parents have children with greater birthweight and (2) for a given average parental age, parents with a larger age difference have children with lower birthweight. By combining the predictors, we have reduced collinearity while retaining the ability to draw conclusions based on the combination of ages of both parents.

In other cases, you might combine the collinear predictors into a single predictor by replacing them with their sum or average. For example, if you have a set of items in a questionnaire that all are asking about the same underlying construct, each on a scale from, say, 1 to 5 (e.g., Likert-scale items), you may be able to average them together to produce a single summary variable. If taking this approach, be sure all the items are on the same scale and if they are not then standardize them before summing or averaging. Also, make sure the ordering of responses has the same meaning – sometimes one item is asking about agreement with the underlying construct and another is asking about disagreement, resulting in a low score on one item having the same meaning as a high score on another. If you have items with different response orderings, change some of them until all are in the same direction. For example, you could reverse the ordering of items asking about disagreement (e.g., replace a 1 to 5 scale with scores of 5 to 1) so they match items asking about agreement. While summing or averaging variables may work to reduce collinearity, it is important to also assess whether such a sum is valid using methods such as Cronbach’s alpha and factor analysis (beyond the scope of this text but see, for example, ?psy::cronbach (Falissard 2022) and ?factanal).

In general, take time to think about how your candidate predictors are related to each other and to the outcome. Are some completely or partially redundant (measuring the same underlying concept)? That will show up in the check for collinearity. Remove or combine predictors in a way that aligns with your analysis goals.


———. 2012. “When Can You Safely Ignore Multicollinearity?” https://statisticalhorizons.com/multicollinearity; Statistical Horizons.
Falissard, Bruno. 2022. Psy: Various Procedures Used in Psychometrics. https://CRAN.R-project.org/package=psy.
Fox, John, and Georges Monette. 1992. “Generalized Collinearity Diagnostics.” Journal of the American Statistical Association 87 (417): 178–83. https://doi.org/10.1080/01621459.1992.10475190.
Fox, John, and Sanford Weisberg. 2019. An R Companion to Applied Regression. 3rd ed. Los Angeles: Sage Publications, Inc.
Fox, John, Sanford Weisberg, and Brad Price. 2022. Car: Companion to Applied Regression. https://CRAN.R-project.org/package=car.
Friedman, Jerome H., Trevor Hastie, and Rob Tibshirani. 2010. “Regularization Paths for Generalized Linear Models via Coordinate Descent.” Journal of Statistical Software 33: 1–22. https://doi.org/10.18637/jss.v033.i01.