5.26 Generalization / extrapolation / interpolation / overfitting

Generalization refers to using a model to predict outcome values for cases that were not in your dataset. Only generalize to cases that are similar to those that were used to fit the model.

For example, if you fit a model to predict total cholesterol for individuals in the U.K., the model may not provide accurate predictions for individuals from another country.
As another example, suppose you fit a model predicting lead in tap water using data only from cities. The model may not generalize to rural areas.

In these two examples, the observations included in the model fitting were restricted by some variable (country, urban/rural status) not included in the model. These variables were used to define inclusion/exclusion criteria and generalization is limited to cases that meet those criteria.

Additionally, generalization is limited to cases that, for predictors included in the model, have values similar to those of the cases in the dataset used to fit the model.

Extrapolation refers to using a model to predict outcome values for cases with predictor values outside the range used to fit the model. For example, if you fit a model to predict total cholesterol from age for individuals age 40-65 years, do not make predictions for individuals younger than age 40 years or older than age 65 years.
Interpolation refers to making a prediction for a case with a predictor value that is within the range of the observed values of that predictor in the dataset used to fit the model. Interpolation is valid IF the assumptions of linear regression are met and the model was appropriate. For example, if you fit a line when you should be fitting a curve, or left out important predictors, then even interpolations could be biased.

Another cause of lack of generalizability is overfitting – including so many predictors in a model that it fits the current data very well but does not predict future data well. A model with too many predictors is fitting not only the true relationships between the predictors and the outcome (“signal”), but also relationships that are specific to this sample and do not hold in general (“noise”). A rule of thumb is to limit the number of predictors in a linear regression model no more than \(n/15\), where \(n\) is the sample size. Seen the other way around, if you are designing a study and plan to include \(K\) predictors, you need at least \(15 \times K\) observations. Of course, you may need more observations to have sufficient power to test a specific hypothesis – this rule of thumb is only for generalizability. For binary logistic regression (Chapter 6) and Cox proportional hazards regression (Chapter 7), the \(n\) is replaced by the number of observations in the less prevalent outcome category and the number of events, respectively (Babyak 2004; Harrell 2015, pp72–73). As with any rule of thumb, this is meant as guidance – there is no requirement that it be strictly applied.

References

Babyak, Michael A. 2004. “What You See May Not Be What You Get: A Brief, Nontechnical Introduction to Overfitting in Regression-Type Models.” Psychosomatic Medicine 66 (3): 411–21.

Harrell, Frank E, Jr. 2015. Regression Modeling Strategies. 2nd ed. Switzerland: Springer International Publishing.