7.21 Generalization / overfitting

As with linear (Section 5.26) and logistic (Section 6.15) regression, it is important to limit the number of predictors in a Cox regression model to avoid overfitting and ensure generalizability. The rule of thumb is to have no more than \(n_e/15\) predictors, where \(n_e\) is the number of events (non-censored event times) (Babyak 2004; Harrell 2015, pp72–73). As with any rule of thumb, this is meant as guidance – there is no requirement that it be strictly applied.

Seen the other way around, if you are designing a study and plan to include \(K\) predictors, you need a sample size large enough to result in \(n_e \geq 15 \times K\) events. If your background research leads you to believe that the proportion of events will be \(p_e\), then the expected number of events is \(n_e = p_e \times n\), where \(n\) is the total sample size, and the sample size required for generalizability is \(n \geq 15 \times K / p_e\).

As mentioned before, this does not preclude a sample size calculation for the purpose of having enough power. If the sample size is too small, then the Cox regression parameter estimates are unstable, and so are not generalizable to new cases. This is not the same as lacking in power to detect a real effect; a study can lack power but still be generalizable if the parameter estimates are stable.

For example, suppose you are studying “time to disease after exposure” in a sample in which \(n_e = 80\) individuals developed the disease. In a Cox regression, you should limit the number of predictors in your model to no more than \(K = n_e/15 = 5.33\) (this is just a rule of thumb, so it is reasonable to round up to \(6\)).

Suppose instead you were designing a study in which you expect \(15\%\) of individuals to develop the disease and you would like to include 5 predictors in your model. For generalizability, you need a sample size of at least \(n \geq 15 \times K / p_e = 15 \times 5 / 0.15 = 500\).

References

Babyak, Michael A. 2004. “What You See May Not Be What You Get: A Brief, Nontechnical Introduction to Overfitting in Regression-Type Models.” Psychosomatic Medicine 66 (3): 411–21.

Harrell, Frank E, Jr. 2015. Regression Modeling Strategies. 2nd ed. Switzerland: Springer International Publishing.