6.10 Separation

6.10.1 Quasi-complete separation

Consider the following two-way table describing how the presence of a condition is related to a risk factor with two levels.

	Condition negative	Condition positive
Risk factor level 1	100	0
Risk factor level 2	50	25

Using the cross-product method (Section 6.5), the OR comparing the odds of “positive” between risk factor levels 2 and 1 is \((25 \times 100) / (0 \times 50) = 2500/0 = \infty\). There are no positive individuals in risk factor level 1, so the observed odds of being positive at that level are 0. When compared to 0, any number of positives in the other risk factor level leads to an infinite OR. In general, if there is a 0 in the table there will be a 0 in the numerator (leading to an OR of zero) or in the denominator (leading to an infinite OR).

If, within one or more levels of a predictor, you can perfectly predict the outcome in the sample, then you cannot compute a confidence interval for the usual estimate of the OR, nor can you test its significance. Prediction may be perfect in the sample, but the usual methods cannot estimate the accuracy of an estimated OR of 0 or \(\infty\). The situation where you can perfectly predict the outcome within some but not all levels of the risk factor is called quasi-complete separation and, depending on the software, when using logistic regression will lead to an error, warning, and/or output that is nonsense. Even without an error or warning, you can diagnose the problem by examining the output.

mydat <- data.frame(condition  = c(rep(1, 100), rep(1, 50), rep(2, 25)),
                    riskfactor = c(rep(1, 100), rep(2, 50), rep(2, 25))) %>% 
  mutate(condition = factor(condition,
                            levels = 1:2,
                            labels = c("Negative", "Positive")),
         riskfactor = factor(riskfactor,
                             levels = 1:2,
                             labels = c("1", "2")))
table(mydat$riskfactor, mydat$condition)

##    
##     Negative Positive
##   1      100        0
##   2       50       25

fit <- glm(condition ~ riskfactor, family = binomial, data = mydat)

# Very large estimates and SEs
summary(fit)$coef

##             Estimate Std. Error  z value Pr(>|z|)
## (Intercept)   -20.57       1773 -0.01160   0.9907
## riskfactor2    19.87       1773  0.01121   0.9911

# OR approaching infinite
exp(coef(fit)["riskfactor2"])

## riskfactor2 
##   427267715

The sample OR is actually \(\infty\), but R attempts to estimate it anyway and the results are nonsense – when you see very large standard errors, large positive or negative estimates, or ORs approaching \(\infty\) or 0 then you likely have a problem with separation.

6.10.2 Complete separation

Consider the following two-way table:

	Condition negative	Condition positive
Risk factor level 1	100	0
Risk factor level 2	0	25

In the previous example (quasi-complete separation), we could perfectly predict the outcome at some but not all levels of the risk factor. In this example, we can perfectly predict the outcome at every level of the risk factor. This may seem like a good thing, but numerically it causes the same problem as quasi-complete separation: logistic regression fails when the sample OR is 0 or \(\infty\). This is called complete separation of the data (also known as perfect prediction). When there is complete separation, R gives a lack-of-convergence warning (with quasi-complete separation sometimes there is no warning).

mydat <- data.frame(condition  = c(rep(1, 100), rep(2, 25)),
                    riskfactor = c(rep(1, 100), rep(2, 25))) %>% 
  mutate(condition = factor(condition,
                            levels = 1:2,
                            labels = c("Negative", "Positive")),
         riskfactor = factor(riskfactor,
                             levels = 1:2,
                             labels = c("1", "2")))
table(mydat$riskfactor, mydat$condition)

##    
##     Negative Positive
##   1      100        0
##   2        0       25

# Lack of convergence warning
fit <- glm(condition ~ riskfactor, family = binomial, data = mydat)

## Warning: glm.fit: algorithm did not converge

# Very large estimates and SEs
summary(fit)$coef

##             Estimate Std. Error    z value Pr(>|z|)
## (Intercept)   -26.57      35612 -0.0007460   0.9994
## riskfactor2    53.13      79632  0.0006672   0.9995

# Undefined OR
exp(coef(fit)["predictor2"])

## <NA> 
##   NA

6.10.3 Diagnosing separation

Separation is more likely to occur with an outcome that is rare or which almost always occurs (sample proportion near 0 or 1). However, it can also occur with a less rare outcome if you have a categorical predictor with at least one level with a small sample size.

Before running a logistic regression, always check for separation. For each categorical predictor, create a two-way table of the predictor vs. the outcome. Make sure to use a complete-case dataset so that the sample used in each table is the same sample that would be used in a regression that includes all of these variables. If there is an interaction between categorical predictors in the model, create a three-way table for the predictors involved in the interaction (predictor1 × predictor2 × outcome). Look for zeros in the cells of the tables.

If you forget to check for separation and instead run the regression first, hopefully you will notice any separation problem that exists by noticing standard errors that are unusually large or ORs that are very large or near zero.

Example 6.4: Heroin use is more rare than marijuana use. Using our NSDUH dataset, check for separation in the regression of lifetime heroin use (her_lifetime) on age at first use of alcohol (alc_agefirst), age (demog_age_cat6), and sex (demog_sex).

What happens if we fit the logistic regression model without examining the data first?

# Complete-case dataset
nsduh_complete <- nsduh %>% 
  select(her_lifetime, alc_agefirst, demog_age_cat6, demog_sex) %>% 
  drop_na()

fit.ex6.4 <- glm(her_lifetime ~ alc_agefirst + demog_age_cat6 +
  demog_sex, family = binomial, data = nsduh_complete)
# Regression coefficients
round(summary(fit.ex6.4)$coef, 4)

##                     Estimate Std. Error z value Pr(>|z|)
## (Intercept)         -16.2988  1024.2091 -0.0159   0.9873
## alc_agefirst         -0.2438     0.0638 -3.8215   0.0001
## demog_age_cat626-34  15.5392  1024.2087  0.0152   0.9879
## demog_age_cat635-49  15.4474  1024.2086  0.0151   0.9880
## demog_age_cat650-64  15.6021  1024.2086  0.0152   0.9878
## demog_age_cat665+    15.3629  1024.2087  0.0150   0.9880
## demog_sexMale         1.2376     0.6526  1.8965   0.0579

# AORs and CIs
OR.CI <- cbind("AOR" = exp(coef(fit.ex6.4)),
                       exp(confint(fit.ex6.4)))[-1,]
round(OR.CI, 3)

##                             AOR 2.5 % 97.5 %
## alc_agefirst              0.784 0.689  0.886
## demog_age_cat626-34 5605018.279 0.000     NA
## demog_age_cat635-49 5113370.493 0.000     NA
## demog_age_cat650-64 5969190.982 0.000     NA
## demog_age_cat665+   4699268.444 0.000     NA
## demog_sexMale             3.447 1.079 15.290

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

The warning indicates there is an issue with separation, as do the extremely large estimates and standard errors and the ORs approaching \(\infty\). It looks like there is a separation problem with demog_age_cat. Let’s examine each categorical predictor vs. the outcome to look for separation.

table(nsduh_complete$demog_age_cat6, nsduh_complete$her_lifetime)

##        
##          No Yes
##   18-25 102   0
##   26-34 121   3
##   35-49 225   5
##   50-64 198   5
##   65+   180   4

table(nsduh_complete$demog_sex, nsduh_complete$her_lifetime)

##         
##           No Yes
##   Female 431   3
##   Male   395  14

There is a zero in one of the cells for age; in this dataset, no one in the 18-25 year old group reported having ever used heroin. This is causing quasi-complete separation – we can perfectly predict the outcome among 18-25 year olds.

6.10.4 Resolving separation

For each categorical predictor that has a zero in any cell of it’s two-way table comparing it to the outcome, do one of the following.

Filter: Remove predictor levels which have zero observations at either level of the outcome (this only works if the predictor has more than two levels).
Collapse: Collapse levels together until all cells have some observations (this only works if the predictor has more than two levels).
Remove: Remove the predictor from the model.

Collapsing is often the best solution when it is possible, but sometimes removing the predictor is the only option available among these three. Although zeros in cells are definitely a problem, cells with only a few observations can be problematic, as well (this would be approximate separation). Such cases may not cause as serious a problem, but they may result in unstable estimates. If in doubt, compare the regression coefficients and standard errors from the original fit with a fit that deals with such cases in one of the ways mentioned above to see if approximate separation is impacting your conclusions.

NOTE: Separation is one example of the more general sparse data bias problem which can lead to inflated odds ratios, and the problem becomes more severe after adjusting for confounding (Sander Greenland, Mansournia, and Altman 2016; Mansournia et al. 2018). Penalized logistic regression (Heinze and Schemper 2002; Sander Greenland, Mansournia, and Altman 2016), not covered in this text, addresses this more general problem and can be implemented with the logistf package (Heinze et al. 2023b), for the Firth adjustment, or with the more general plogit() function (Sander Greenland, Mansournia, and Altman 2016, web appendix).

After resolving separation and re-fitting the model, look at the regression coefficients, standard errors, and ORs. If any still seem unusually large, you may have missed a problem with separation. In that case, go back and re-examine the data to see if there is a predictor that needs to be dealt with. If your outcome is very rare, then you may simply not have enough data to obtain reliable estimates for any predictors.

Example 6.4 (continued): Resolve the separation issue we just found in the regression of lifetime heroin use (her_lifetime) on age at first use of alcohol, age, and sex.

We could try any of the following:

Filter the data to remove individuals age 18-25 years.
Collapse individuals age 18-25 and 26-34 years into one group.
Remove demog_age_cat6 from the model.

Which choice you make depends on your goals. What age range do you want to be able to draw conclusions about? How important is age as a potential confounder? All three options are described below, along with their relative merits.

6.10.4.1 Filter

Filtering the data to remove predictor levels for which cells have zero observations may not be a good option if this results in a large drop in sample size or removes a sub-population you are interested in. If you filter out the level that has a cell with zero observations, you remove individuals in another cell. For example if there are no “Yes” outcomes at \(X = x\) and you filter out everyone with \(X = x\) then you end up removing individuals with \(X = x\) and a “No” outcome.

The following points address additional issues with filtering.

If filtering would result in just one level for a predictor then, instead, remove it from the model (see Section 6.10.4.3).
Since filtering reduces the sample size, it may lead to a problem with separation for a different predictor that previously did not have a problem. If you filter, make sure to re-check the other predictors for separation.
Filtering changes the scope of the analysis. All individuals at the filtered out levels are excluded, both those with a “Yes” outcome and those with a “No” outcome, even if only one level of the outcome has zero observations. Therefore, results do not generalize to the population at the filtered out levels.

Example 6.4 (continued): Resolve separation by filtering the data to remove individuals age 18-25 years.

# Filter
nsduh_complete_filtered <- nsduh_complete %>% 
  filter(demog_age_cat6 != "18-25")

# Recheck separation
table(nsduh_complete_filtered$demog_age_cat6,
      nsduh_complete_filtered$her_lifetime)

##        
##          No Yes
##   18-25   0   0
##   26-34 121   3
##   35-49 225   5
##   50-64 198   5
##   65+   180   4

Although there are zeros in this table, they all occur in the 18-25 year old group which was filtered out. When a row in the table has all zeros (i.e., there are no individuals in that level in the dataset) that level will not cause a problem in the model fit.

Filtering reduced the sample size. This could have led to a separation problem for demog_sex, but fortunately it did not. In hindsight, this is obvious for this example since we filtered out a group of individuals with no lifetime heroin use so the cells in the sex vs. heroin use table with small numbers of cases were unaffected.

table(nsduh_complete_filtered$demog_sex,
      nsduh_complete_filtered$her_lifetime)

##         
##           No Yes
##   Female 372   3
##   Male   352  14

Finally, refit the model using the filtered dataset.

fit.ex6.4.filter <- glm(her_lifetime ~ alc_agefirst + demog_age_cat6 + demog_sex,
  family = binomial, data = nsduh_complete_filtered)

round(summary(fit.ex6.4.filter)$coef, 4)

##                     Estimate Std. Error z value Pr(>|z|)
## (Intercept)          -0.7597     1.2292 -0.6180   0.5366
## alc_agefirst         -0.2438     0.0638 -3.8218   0.0001
## demog_age_cat635-49  -0.0918     0.7571 -0.1213   0.9035
## demog_age_cat650-64   0.0629     0.7597  0.0829   0.9340
## demog_age_cat665+    -0.1763     0.8232 -0.2141   0.8305
## demog_sexMale         1.2376     0.6524  1.8969   0.0578

OR.CI <- cbind("AOR" = exp(coef(fit.ex6.4.filter)),
                       exp(confint(fit.ex6.4.filter)))[-1,]
round(OR.CI, 3)

##                       AOR 2.5 % 97.5 %
## alc_agefirst        0.784 0.689  0.886
## demog_age_cat635-49 0.912 0.213  4.651
## demog_age_cat650-64 1.065 0.248  5.461
## demog_age_cat665+   0.838 0.161  4.639
## demog_sexMale       3.447 1.079 15.290

Conclusion: All the estimates, standard errors, AORs, and 95% CIs look reasonable. Filtering took care of the separation problem, but at the cost of reducing the sample size and limiting the inference to individuals age 26 years and older.

6.10.4.2 Collapse

Another option is to collapse levels together until all cells have some observations. This only works if the predictor has more than two levels. The downside to collapsing levels of a predictor together is that we must assume the odds of the outcome are the same among those in each of the collapsed levels.

Example 6.4 (continued): Resolve separation by collapsing individuals age 18-25 and 26-34 years into one group.

# Collapse
nsduh_collapsed <- nsduh_complete %>%
  mutate(demog_age_cat_new = fct_collapse(demog_age_cat6,
                                          "18-34" = c("18-25", "26-34")))

# Recheck separation (use the new predictor name)
table(nsduh_collapsed$demog_age_cat_new,
      nsduh_collapsed$her_lifetime)

##        
##          No Yes
##   18-34 223   3
##   35-49 225   5
##   50-64 198   5
##   65+   180   4

There are now no zeros in the age table.

Finally, refit the model using the collapsed dataset.

fit.ex6.4.collapse <- glm(her_lifetime ~ alc_agefirst +
  demog_age_cat_new + demog_sex,
  family = binomial, data = nsduh_collapsed)

round(summary(fit.ex6.4.collapse)$coef, 4)

##                        Estimate Std. Error z value Pr(>|z|)
## (Intercept)             -1.1906     1.2444 -0.9567   0.3387
## alc_agefirst            -0.2520     0.0650 -3.8779   0.0001
## demog_age_cat_new35-49   0.4231     0.7484  0.5653   0.5718
## demog_age_cat_new50-64   0.5822     0.7498  0.7764   0.4375
## demog_age_cat_new65+     0.3161     0.8259  0.3828   0.7019
## demog_sexMale            1.2761     0.6519  1.9576   0.0503

OR.CI <- cbind("AOR" = exp(coef(fit.ex6.4.collapse)),
                       exp(confint(fit.ex6.4.collapse)))[-1,]
round(OR.CI, 3)

##                          AOR 2.5 % 97.5 %
## alc_agefirst           0.777 0.682  0.881
## demog_age_cat_new35-49 1.527 0.362  7.659
## demog_age_cat_new50-64 1.790 0.423  9.010
## demog_age_cat_new65+   1.372 0.260  7.591
## demog_sexMale          3.583 1.124 15.876

Conclusion: Collapsing took care of the separation problem, with the advantages of no loss in sample size and retaining the ability to make inferences about all adults, but at the cost of assuming the odds of heroin use are the same among those age 18-25 and 26-34 years.

6.10.4.3 Remove

Yet another option for resolving separation is to remove the predictor from the model. The downsides to removing a predictor are that the AORs for the remaining predictors are no longer adjusted for the removed predictor and there is no estimate of effect for the removed predictor. However, of the three discussed here, removing the predictor is the only viable option if filtering and collapsing result in just one level for the predictor.

Example 6.4 (continued): Resolve separation by removing demog_age_cat6 from the model.

# Refit the model without age
fit.ex6.4.remove <- glm(her_lifetime ~ alc_agefirst + demog_sex,
  family = binomial, data = nsduh_complete)

round(summary(fit.ex6.4.remove)$coef, 4)

##               Estimate Std. Error z value Pr(>|z|)
## (Intercept)    -0.9017     1.0656 -0.8462   0.3975
## alc_agefirst   -0.2491     0.0614 -4.0559   0.0000
## demog_sexMale   1.3054     0.6504  2.0070   0.0447

OR.CI <- cbind("AOR" = exp(coef(fit.ex6.4.remove)),
                       exp(confint(fit.ex6.4.remove)))[-1,]
round(OR.CI, 3)

##                 AOR 2.5 % 97.5 %
## alc_agefirst  0.780 0.690   0.88
## demog_sexMale 3.689 1.161  16.31

Conclusion: Removing demog_age_cat6 from the model solved the separation problem, but at the cost of the other effects no longer being adjusted for age and no estimate of the age effect.

6.10.4.4 Summary

When the outcome can be perfectly predicted at one or more levels of a categorical predictor, logistic regression runs into numerical difficulties and returns an error, a warning, and/or nonsense values. Always examine two-way tables (or three-way if there are interactions) of categorical predictors vs. the outcome and look for zeros. If there are zeros and the predictor has more than two levels, try collapsing groups so there are no zeros. Alternatively, filter the data to remove zeros or remove the predictor from the model, although these options come with stronger side effects that you may not want. If the predictor has only two levels then, of these three, the only option that works is to remove it from the model. Another alternative, beyond the scope of this text, is to use penalized logistic regression via the logistf package.

References

Greenland, Sander, Mohammad Ali Mansournia, and Douglas G. Altman. 2016. “Sparse Data Bias: A Problem Hiding in Plain Sight.” BMJ (Clinical Research Ed.) 352: i1981. https://doi.org/10.1136/bmj.i1981.

———. 2023b. Logistf: Firth’s Bias-Reduced Logistic Regression. https://cemsiis.meduniwien.ac.at/en/kb/science-research/software/statistical-software/firth-correction/.

Heinze, Georg, and Michael Schemper. 2002. “A Solution to the Problem of Separation in Logistic Regression.” Statistics in Medicine 21 (16): 2409–19. https://doi.org/10.1002/sim.1047.

Mansournia, Mohammad Ali, Angelika Geroldinger, Sander Greenland, and Georg Heinze. 2018. “Separation in Logistic Regression: Causes, Consequences, and Control.” American Journal of Epidemiology 187 (4): 864–70. https://doi.org/10.1093/aje/kwx299.