7.24 Summary of survival analysis

The steps of carrying out survival analysis using the Kaplan-Meier method and Cox regression are similar to MLR (Section 5.27) with the following differences:

  • Considerations when thinking about the outcome.
    • What is the “event” of interest? (e.g., death, preterm birth, heart attack)
    • What is the time origin, the time when an individual could first experience the event? (e.g., enrollment in the study, age of initiation of pain pill use)
    • Which events are censored? Are some individuals’ times censored due to loss to follow up? Are there competing risks that censor times (e.g., death leads to a censored time for other events)? Is there an end-of-study time at which times are censored? (see Section 7.2)
  • If the dataset is not already in the correct format (unlike the examples we have been working with), then set up your dataset to fit the structure required for a survival analysis (see Section 7.5).
    • Create an numeric event indicator variable that has the value 1 for events and 0 for censored times.
    • If there are no time-varying predictors, the dataset should have one row per individual.
    • If there are time-varying predictors, the dataset should have one row for every time period in which predictors do not vary (see Section 7.14).
    • Create a numeric time to event variable (or START and STOP variables if you have time-varying predictors). If you need to work with date variables, see, for example, the chapter “Dates and Times” in the text R for Data Science (H. Wickham, Çetinkaya-Rundel, and Grolemund 2017).
  • The outcome (time to event) is not typically transformed.
  • Start by using the KM method to estimate and plot the survival and hazard functions, ignoring any other variables in your dataset. This will help you get a feel for the data and the extent of censoring (see Section 7.6).
  • Use the KM method (and log-rank test) to compare survival functions between groups and plot these comparisons, as well (see Section 7.6.5).
  • Check for separation by creating a two-way table for each categorical predictor vs. the event indicator (three-way table if there is an interaction) and look for levels at which all the event times are censored (see Section 7.13). Resolve issues using filtering, collapsing variables, or removing variables (see Section 6.10.4).
    • If you need to redo the previous steps because the sample size has changed or any categorical variables have changed, make sure to also re-evaluate separation.
  • Check the proportional hazards assumption. If necessary, include an interaction with time or stratify (see Section 7.16).
  • Evaluate linearity (Section 7.18), outliers (Section 7.19), and influential observations (Section 7.20). In Cox regression, there are no normality or constant variance assumptions


Wickham, H., M. Çetinkaya-Rundel, and G. Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 2nd ed. Sebastopol, CA: O’Reilly Media.