## 7.5 Survival analysis dataset structure

The R functions we will use for survival analysis require a dataset with a specific structure. There must be a numeric **event time** variable and a binary **event indicator** variable, coded as numeric, with values of 1 for events, 0 for censored event times, and no other non-missing values. There can also be additional variables representing predictors. In the basic set-up, the predictors do not vary over time and so there is one row per individual. Later, we will discuss time-varying predictors (Section 7.14) which require a dataset with multiple rows per individual.

**Example 7.1 (continued):** The first five rows of the Natality teaching dataset look like the following, including the event time (`gestage37`

), the indicator of preterm birth (`preterm01`

), and a few time-invariant demographic variables and other risk factors – mother’s age (`MAGER`

), mother’s race/Hispanic origin (`MRACEHISP`

), previous preterm birth (`RF_PPTERM`

), and previous Cesarean (`RF_CESAR`

). Four of the five births have a gestational age censored at 37 weeks (`preterm01`

= 0), and one was preterm at gestational age 31 weeks (`preterm01`

= 1).

```
load("Data/natality2018_rmph.Rdata")
natality %>%
select(gestage37, preterm01, MAGER, MRACEHISP, RF_PPTERM, RF_CESAR) %>%
head(5)
```

```
## # A tibble: 5 × 6
## gestage37 preterm01 MAGER MRACEHISP RF_PPTERM RF_CESAR
## <dbl> <dbl> <dbl> <fct> <fct> <fct>
## 1 37 0 35 Hispanic No Yes
## 2 31 1 28 NH White Yes No
## 3 37 0 22 NH Black No No
## 4 37 0 35 NH White No No
## 5 37 0 30 NH White No No
```

Verify the event time variable (`gestage37`

) is numeric using `is.numeric()`

and summarize the event times using `summary()`

.

`## [1] TRUE`

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 17.0 37.0 37.0 36.4 37.0 37.0
```

Similarly, verify the event indicator (`preterm01`

) is numeric and use `table()`

to verify its only non-missing values are 0 and 1.

`## [1] TRUE`

```
##
## 0 1
## 1748 252
```

In Example 7.1, the event indicator variable was already in the correct format. What if it is not?

**Example 7.2:** The Digitalis Investigation Group (DIG) teaching dataset (`dig_rmph.rData`

, see Appendix A.6) contains data from a clinical trial investigating the safety and efficacy of Digoxin for treating congestive heart failure. One of the endpoints measured was toxicity (`DIG`

). Examine if this event indicator variable is numeric with values 0 and 1 and if it is not then convert it to that form.

`## [1] FALSE`

```
##
## No Event First Event
## 6702 98
```

The variable is not numeric, but it does have just two values. Create a numeric event indicator variable that is 1 when the original variable is “First Event”, and use `table()`

to check the derivation. The syntax `dig$DIG == "First Event"`

creates a logical vector of `TRUE`

and `FALSE`

values, and `as.numeric()`

converts that logical vector to numeric, converting `TRUE`

to 1 and `FALSE`

to 0.

```
##
## 0 1
## No Event 6702 0
## First Event 0 98
```

The datasets used in this text all include an event time variable. However, in your future work you may encounter datasets for which you have to compute the event time. For example, you may be given the dates the individuals started being observed and dates that events occurred (or were censored). Computing the event times, the times between those dates, is facilitated in R by using date-formatted variables and functions specifically designed to count time units between date-formatted variables. See, for example, the chapter “Dates and Times” in **R for Data Science** (H. Wickham, Çetinkaya-Rundel, and Grolemund 2017).

### References

*R for Data Science: Import, Tidy, Transform, Visualize, and Model Data*. 2nd ed. Sebastopol, CA: O’Reilly Media.