## 7.10 Prediction

In linear regression, we estimated the mean outcome at \(X=x\), \(E(Y|X=x)\). In logistic regression, we estimated the probability of the outcome, \(P(Y=1|X=x)\). In Cox regression, we can estimate the survival probability at a specific time, \(S\left(t|X=x\right)\), as well as the hazard ratio for an individual relative to a reference individual, \(h\left(t|X=x\right)/h\left(t|X=x_\textrm{ref}\right)\).

Use `predict()`

to compute these estimates. Along with specifying values for the predictors in the model at which you want to compute the estimate, also specify \(t\), the time at which you want to compute the estimate. For predicted survival, the choice of \(t\) matters. However, because of the proportional hazards assumption, a predicted HR will not depend on time, although in the syntax below you still need to specify a time for the code to work (but the HR will be the same regardless of what time is chosen).

Unfortunately, `predict()`

only returns the standard error of the estimate when applied to a `coxph`

object, not a confidence interval. For the estimated survival probability, an alternative is to use `survfit()`

on the fitted Cox model, which does return a confidence interval.

**Example 7.6 (continued):** Using the Natality teaching dataset, estimate \(S(30)\), the probability that a preterm birth has not occurred as of 30 weeks, for a married, non-Hispanic Black, 28-year old woman with a previous preterm birth. Also, estimate the hazard of preterm birth for this individual relative to a reference individual.

```
# Check the spelling of each level
# (results not shown)
levels(natality$DMAR)
levels(natality$MRACEHISP)
levels(natality$RF_PPTERM)
```

```
# NOTE: For the event indicator variable (preterm), you must specify
# a value or predict() will not work, but it does not matter which
# value you specify, you will get the same answer.
NEWDAT <- data.frame(preterm01 = 1,
gestage37 = 30,
DMAR = "Married",
MRACEHISP = "NH Black",
MAGER = 28,
RF_PPTERM = "Yes")
predict(cox.ex7.6, NEWDAT, type = "survival", se.fit = T)
```

```
## $fit
## [1] 0.9628
##
## $se.fit
## [1] 0.01235
```

We can get the same answer using a `summary()`

of `survfit()`

with the `times`

option. However, `summary()`

will also include a confidence interval (via the `conf.int`

option). With this method, there is no need to specify values for the event indicator and time inside of `NEWDAT`

, although leaving them in `NEWDAT`

will not change the answer.

```
unlist(summary(survfit(cox.ex7.6,
NEWDAT,
se.fit =T, conf.int = 0.95),
times=30)[c("surv", "std.err", "lower", "upper")])
```

```
## surv std.err lower upper
## 0.96282 0.01235 0.93892 0.98732
```

We conclude that the probability of no preterm birth through 30 weeks is 0.963 (95% CI = 0.939, 0.987).

Use `predict()`

with `type = "risk"`

to estimate an AHR comparing the hazard for this individual to that of a reference individual. Different `reference`

specifications lead to different results.

`reference = "zero"`

: The reference individual has continuous predictor values of 0 and categorical predictors each at their reference level. This is not a good choice if \(X = 0\) is not a plausible value for a continuous predictor \(X\).`reference = "sample"`

: The reference individual has continuous predictors each at their respective sample mean and categorical predictors are each at their reference level. This is equivalent to centering each continuous variable at its mean.`reference = "strata"`

(the default): This leads to the same answer as`reference = "sample"`

unless the model includes a`strata()`

term (see Section 7.16.4), in which case continuous variables will be set to their within-stratum means.

Our model contained a single continuous predictor, mother’s age, for which a reference value of 0 does not make sense. As we see below, the AHR comparing our individual to a reference individual with age 0 is very large, since the model is extrapolating and estimating a very small hazard for a newborn woman to have a preterm birth (a nonsense statement) and that small number is the denominator of the HR.

```
## $fit
## 1
## 12.15
##
## $se.fit
## 1
## 1.487
```

If a value of 0 does not make sense for any of the continuous predictors in your model, then use `reference = "strata"`

instead (equivalent to `"sample"`

if there are no strata variables, and more sensible if there are since it makes sense to have the hazard for the individual in `NEWDAT`

be compared to an individual in their own strata).

```
## $fit
## 1
## 4.906
##
## $se.fit
## 1
## 0.6329
```

Because of the proportional hazards assumption, the hazard ratio is constant over time. You will get the same answer for any `gestage37`

value you enter into `NEWDAT`

, or even if you omit `gestage37`

altogether (the event indicator, `preterm01`

, can also be omitted).

```
NEWDAT <- data.frame(DMAR = "Married",
MRACEHISP = "NH Black",
MAGER = 28,
RF_PPTERM = "Yes")
predict(cox.ex7.6, NEWDAT, type = "risk", reference = "strata")
```

```
## 1
## 4.906
```

**Conclusion:** This individual has 4.91 times the hazard of preterm birth of an individual at the reference level of each categorical predictor and the mean of each continuous predictor.

To verify that the reference individual has reference level values and mean values for the predictors, the code below demonstrates that the predicted HR for an individual with these characteristics is 1.

```
NEWDAT <- data.frame(DMAR = levels(natality.complete$DMAR)[1],
MRACEHISP = levels(natality.complete$MRACEHISP)[1],
MAGER = mean(natality.complete$MAGER),
RF_PPTERM = levels(natality.complete$RF_PPTERM)[1])
predict(cox.ex7.6, NEWDAT, type = "risk", reference = "strata")
```

```
## 1
## 1
```

**NOTE:** `reference = "strata"`

is the default value and, for the reasons mentioned above, is a reasonable choice. Thus, you can typically omit the option and type, for example, `predict(cox.ex7.6, NEWDAT, type = "risk")`

.