The R functions we will use for survival analysis require a dataset with a specific structure. There must be a numeric event time variable and a binary event indicator variable, coded as numeric, with values of 1 for events, 0 for censored event times, and no other non-missing values. There can also be additional variables representing predictors. In the basic set-up, the predictors do not vary over time and so there is one row per individual. Later, we will discuss time-varying predictors (Section 7.14) which require a dataset with multiple rows per individual.
Example 7.1 (continued): The first five rows of the Natality teaching dataset look like the following, including the event time (
gestage37), the indicator of preterm birth (
preterm01), and a few time-invariant demographic variables and other risk factors – mother’s age (
MAGER), mother’s race/Hispanic origin (
MRACEHISP), previous preterm birth (
RF_PPTERM), and previous Cesarean (
RF_CESAR). Four of the five births have a gestational age censored at 37 weeks (
preterm01 = 0), and one was preterm at gestational age 31 weeks (
preterm01 = 1).
load("Data/natality2018_rmph.Rdata") %>% natality select(gestage37, preterm01, MAGER, MRACEHISP, RF_PPTERM, RF_CESAR) %>% head(5)
## # A tibble: 5 × 6 ## gestage37 preterm01 MAGER MRACEHISP RF_PPTERM RF_CESAR ## <labelled> <labelled> <labelled> <fct> <fct> <fct> ## 1 37 0 35 Hispanic No Yes ## 2 31 1 28 NH White Yes No ## 3 37 0 22 NH Black No No ## 4 37 0 35 NH White No No ## 5 37 0 30 NH White No No
Verify the event time variable (
gestage37) is numeric using
is.numeric() and summarize the event times using
##  TRUE
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 17.0 37.0 37.0 36.4 37.0 37.0
Similarly, verify the event indicator (
preterm01) is numeric and use
table() to verify its only non-missing values are 0 and 1.
##  TRUE
table(natality$preterm01, exclude = NULL)
## ## 0 1 ## 1748 252
In Example 7.1, the event indicator variable was already in the correct format. What if it is not?
Example 7.2: The Digitalis Investigation Group (DIG) teaching dataset (
dig_rmph.rData, see Appendix A.6) contains data from a clinical trial investigating the safety and efficacy of Digoxin for treating congestive heart failure. One of the endpoints measured was toxicity (
DIG). Examine if this event indicator variable is numeric with values 0 and 1 and if it is not then convert it to that form.
##  FALSE
table(dig$DIG, exclude = NULL)
## ## No Event First Event ## 6702 98
The variable is not numeric, but it does have just two values. Create a numeric event indicator variable that is 1 when the original variable is “First Event”, and use
table() to check the derivation. The syntax
dig$DIG == "First Event" creates a logical vector of
FALSE values, and
as.numeric() converts that logical vector to numeric, converting
TRUE to 1 and
FALSE to 0.
$DIG_event <- as.numeric(dig$DIG == "First Event") digtable(dig$DIG, dig$DIG_event, exclude = NULL)
## ## 0 1 ## No Event 6702 0 ## First Event 0 98
The datasets used in this text all include an event time variable. However, in your future work you may encounter datasets for which you have to compute the event time. For example, you may be given the dates the individuals started being observed and dates that events occurred (or were censored). Computing the event times, the times between those dates, is facilitated in R by using date-formatted variables and functions specifically designed to count time units between date-formatted variables. See, for example, the chapter “Dates and Times” in R for Data Science (H. Wickham, Çetinkaya-Rundel, and Grolemund 2017).