## 6.15 Generalization / overfitting

As discussed in Section 5.25, it is important to limit the number of predictors in a model to avoid overfitting and ensure generalizability. In a logistic regression with $$n$$ observations, the rule of thumb is to have no more than $$n_{min}/15$$ predictors, where $$n_{min}$$ is the number of observations in the less common outcome level . This can also be expressed in terms of the sample proportion – have no more than $$n \times p_{min}/15$$ predictors, where $$p_{min}$$ is the proportion of observations with the less common outcome level. Seen the other way around, if you are designing a study and plan to include $$K$$ predictors, you need at least $$15 \times K / p_{min}$$ observations, where $$p_{min}$$ is your best guess for the smaller of the two population prevalences. As with any rule of thumb, this is meant as guidance – there is no requirement that it be strictly applied.

For example, suppose your outcome is “occurrence of disease within one-year post-exposure”. The outcome levels are “disease” and “no disease”. If you have a sample of size $$n = 300$$ and $$23\%$$ of the sample developed the disease (so $$77\%$$ did not and the less prevalent outcome level is “disease”), you should include no more than $$n \times p_{min}/15 = 300 \times 0.23/15 = 4.6$$ predictors (you can round up to $$5$$) in a logistic regression model.

Suppose instead you were designing a study in which you expect $$65\%$$ of the individuals to experience the outcome and you would like to include 5 predictors. In this case, the prevalences are $$65\%$$ and $$35\%$$, so the lower prevalence is that of not experiencing the outcome. To ensure generalizability, you need at least $$15 \times K / p_{min} = 15 \times 5 / 0.35 = 214.29$$ observations in your sample (round up to $$215$$).

NOTE: A sample size sufficient to ensure generalizability may or may not be sufficient for the purpose of having enough power to test a hypothesis.

### References

Babyak, Michael A. 2004. “What You See May Not Be What You Get: A Brief, Nontechnical Introduction to Overfitting in Regression-Type Models.” Psychosomatic Medicine 66 (3): 411–21.
Harrell, Frank E, Jr. 2015. Regression Modeling Strategies. 2nd ed. Switzerland: Springer International Publishing.