## 6.3 Why not use linear regression for a binary outcome?

Example 6.1: Figure 6.1 displays the linear regression fit for simulated data from 60 individuals aged 20 to 80 years where $$Y$$ = coronary heart disease (CHD) status and $$X$$ = age. The possible values for $$Y$$ are “yes” and “no” coded as $$Y = 1$$ and $$0$$, respectively. The gray dots at $$Y = 1$$ (along the top of the figure) are plotted at the ages of individuals with CHD, while the dots at $$Y = 0$$ (along the bottom of the figure) are for those without CHD. The solid line is the linear regression line, which attempts to estimate the proportion of CHD = Yes (the mean outcome) as a function of age using a straight line. The dashed line is a smoother which tracks the observed proportion without any constraint on its shape.

A straight line is not a good fit to this data in the sense that most of the points are very far from the line. However, it turns out to not be a terrible fit when comparing it to the relationship between the proportion of CHD = Yes and age (the dashed line). However, in general, logistic regression does even better at estimating the proportion.

Logistic regression models the mean (the probability of $$Y = 1$$) using the complicated looking logit function $$\ln{(p/(1-p))}$$ on the left-hand side of the equation. Why not have $$Y$$ or $$p$$ on the left-hand side? One reason is that, ideally, the range of possible values on the left-hand side should match that of the right-hand side. The right-hand side of the model equation is $$\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_K X_K$$ and can take on any value from $$-\infty$$ to $$\infty$$, whereas $$Y$$ can only be 0 or 1 and $$p$$ must be in the range 0 to 1. However, the logit of $$p$$, $$\ln{(p/(1-p))}$$, can take on any value, matching the range of the right-hand side.

When transformed back to the probability scale, logistic regression fits an S-shaped curve that estimates the proportion of 1’s at a given value of the predictor. Figure 6.2 demonstrates how the logistic regression fit (solid line) closely tracks the smoother (dashed line) and the predicted probabilities are all in [0, 1]. At the highest observed ages, however, the height of the estimated linear regression line (in Figure 6.1) is greater than 1, resulting in predictions outside of [0, 1].