## 7.10 Prediction

In linear regression, a predicted value is the estimated mean outcome at $$X=x$$, $$E(Y|X=x)$$. In logistic regression, we predicted the estimated probability of the outcome, $$P(Y=1|X=x)$$. In Cox regression, we can predict the estimated survival probability at a specific time, $$S\left(t|X=x\right)$$, as well as the predicted hazard ratio for an individual relative to a reference individual, $$h\left(t|X=x\right)/h\left(t|X=x_\textrm{ref}\right)$$.

Use predict() to compute predictions. Along with specifying values for the predictors in the model at which you want to make the prediction, also specify $$t$$, the time at which you want to make the prediction. For predicted survival, the choice of $$t$$ matters. However, because of the proportional hazards assumption, a predicted HR will not depend on time, although in the syntax below you still need to specify a time for the code to work (but the HR will be the same regardless of what time is chosen).

Unfortunately, predict() only returns the standard error of the prediction for predictions from a coxph object, not a confidence interval. For the predicted survival, an alternative is to use survfit() on the fitted Cox model which does return a confidence interval.

Example 7.6 (continued): Using the Natality teaching dataset, estimate $$S(30)$$, the probability that a preterm birth has not occurred as of 30 weeks, for a married, non-Hispanic Black, 28-year old woman with a previous preterm birth. Also, estimate the hazard of preterm birth for this individual relative to a reference individual.

# Check the spelling of each level
# (results not shown)
levels(natality$DMAR) levels(natality$MRACEHISP)
levels(natality$RF_PPTERM) # NOTE: For the event indicator variable (preterm), you must specify # a value or predict() will not work, but it does not matter which # value you specify, you will get the same answer. NEWDAT <- data.frame(preterm01 = 1, gestage37 = 30, DMAR = "Married", MRACEHISP = "NH Black", MAGER = 28, RF_PPTERM = "Yes") predict(cox.ex7.6, NEWDAT, type = "survival", se.fit = T) ##$fit
##  0.9628
##
## $se.fit ##  0.01235 We can get the same answer using a summary() of survfit() with times options, and this time including a confidence interval using the conf.int option. With this method, there is no need to specify values for the event indicator and time inside of NEWDAT, although leaving them in NEWDAT will not change the answer. unlist(summary(survfit(cox.ex7.6, NEWDAT, se.fit =T, conf.int = 0.95), times=30)[c("surv", "std.err", "lower", "upper")]) ## surv std.err lower upper ## 0.96282 0.01235 0.93892 0.98732 We conclude that the probability of no preterm birth through 30 weeks is 0.963 (95% CI = 0.939, 0.987). Use predict() with type = "risk" to estimate an AHR comparing the hazard for this individual to that of a reference individual. Different reference specifications lead to different results. • reference = "zero": The reference individual has continuous predictor values of 0 and categorical predictors each at their reference level. This is not a good choice if $$X = 0$$ is not a plausible value for a continuous predictor $$X$$. • reference = "sample": The reference individual has continuous predictors each at their respective sample mean and categorical predictors are each at their reference level. This is equivalent to centering each continuous variable at its mean. • reference = "strata" (the default): This leads to the same answer as reference = "sample" unless the model includes a strata() term (see Section 7.16.4), in which case continuous variables will be set to their within-stratum means. Our model contained a single continuous predictor, mother’s age, for which a reference value of 0 does not make sense. As we see below, the AHR comparing our individual to a reference individual with age 0 is very large, since the model is extrapolating and estimating a very small hazard for a newborn woman to have a preterm birth (a nonsense statement) and that small number is the denominator of the HR. predict(cox.ex7.6, NEWDAT, type = "risk", reference = "zero", se.fit = T) ##$fit
##     1
## 12.15
##
## $se.fit ## 1 ## 1.487 If a value of 0 does not make sense for any of the continuous predictors in your model, then use reference = "strata" instead (equivalent to sample if there are no strata variables, and more sensible if there are since it makes sense to have the hazard for the individual in NEWDAT be compared to an individual in their own strata). predict(cox.ex7.6, NEWDAT, type = "risk", reference = "strata", se.fit = T) ##$fit
##     1
## 4.906
##
## $se.fit ## 1 ## 0.6329 Because of the proportional hazards assumption, the hazard ratio is constant over time. You will get the same answer for any gestage37 value you enter into NEWDAT, or even if you omit gestage37 altogether (the event indicator, preterm01, can also be omitted). NEWDAT <- data.frame(DMAR = "Married", MRACEHISP = "NH Black", MAGER = 28, RF_PPTERM = "Yes") predict(cox.ex7.6, NEWDAT, type = "risk", reference = "strata") ## 1 ## 4.906 Conclusion: This individual has 4.91 times the hazard of preterm birth of an individual at the reference level of each categorical predictor and the mean of each continuous predictor. To verify that the reference individual has reference level values and mean values for the predictors, the code below demonstrates that the predicted HR for an individual with these characteristics is 1. NEWDAT <- data.frame(DMAR = levels(natality.complete$DMAR),
MRACEHISP = levels(natality.complete$MRACEHISP), MAGER = mean(natality.complete$MAGER),
RF_PPTERM = levels(natality.complete\$RF_PPTERM))
predict(cox.ex7.6, NEWDAT, type = "risk", reference = "strata")
## 1
## 1

NOTE: reference = "strata" is the default value and, for the reasons mentioned above, is a reasonable choice. Thus, you can typically omit the option and type, for example, predict(cox.ex7.6, NEWDAT, type = "risk").