## 3.3 Creating a “Table 1”

In most published articles, there is a “Table 1” containing descriptive statistics for the sample. This may include, for example, the mean and standard deviation for continuous variables, the frequency and proportion for categorical variables, and perhaps also the number of missing values.

The brute force method of creating such a table would be to compute each statistic for each variable of interest and then copy and paste the results into a table. Having done this even once you will wish for an easier method! There are many possible solutions, but one that is quick and easy to use is demonstrated here – the `gtsummary`

package (Sjoberg et al. 2021, 2023). Many examples can be found on the `gtsummary`

package GitHub site, in Tutorial:tbl_summary, and in Table Gallery.

### 3.3.1 Overall

**Example 3.1 (continued):** Create a table of summary statistics for all the variables we have been summarizing in this chapter for the entire sample (“overall”).

The code below loads the `gtsummary`

library and uses `tbl_summary()`

with default settings to generate Table 3.1.

The default settings produce a table with the frequency and proportion for categorical variables, the median and interquartile range (IQR) for continuous variables (here, the 25th and 75th percentiles, not their difference), and the number of missing values (if any) (show in rows headed by “Unknown”).

```
library(gtsummary)
%>%
nhanes # Select the variables to be included in the table
select(sbp, RIDAGEYR, RIAGENDR, income) %>%
tbl_summary()
```

Characteristic |
N = 1,000^{1} |
---|---|

Systolic BP (mean of 2nd and 3rd) | 121 (111, 134) |

Unknown | 42 |

Age in years at screening | 47 (32, 61) |

Gender | |

Male | 482 (48%) |

Female | 518 (52%) |

Annual household income | |

< $25,000 | 156 (18%) |

$25,000 to < $55,000 | 254 (29%) |

$55,000+ | 480 (54%) |

Unknown | 110 |

^{1} Median (IQR); n (%) |

The default is to include missing values, and this can be overridden by setting the `missing`

option to “no”, as in the code below. The resulting table is not shown, but is identical to the table above other than the “Unknown” rows are omitted. In particular, it is still based on the full sample size (1000 observations).

```
%>%
nhanes # Select the variables to be included in the table
select(sbp, RIDAGEYR, RIAGENDR, income) %>%
tbl_summary(
missing = "no"
)
```

**To create a table for a complete case analysis sample**, start with a complete case analysis dataset, as in the code below (which uses `complete.dat`

, the dataset created in Section 3.2.1). Additional options are demonstrated below. See `?gtsummary::tbl_summary`

for more options.

`statistic`

: The default for continuous variables is the median and IQR. The default for categorical variables is the frequency and proportion. Below, this option is used to instead compute the mean and standard deviation for continuous variables (and the default for categorical variables is coded explicitly).`digits`

:`tbl_summary()`

guesses the number of digits to which to round. Use this option to set the number yourself. Since two statistics are specified in the`statistic`

option, two numbers are specified here, one for each statistic.`type`

:`tbl_summary()`

guesses whether a variable is continuous or categorical based on its distribution. Use this option to specify the types yourself, in particular if you need to override the default.`label`

: The default row labels are the variable names or labels (if the dataset has been labeled, for example, using the`Hmisc`

library`label()`

function). Use this option to change the row headers.`modify_header`

: The default column header is “Characteristic”. Use this option to change the column header. Surrounding text by`**`

results in a bold font.`modify_caption`

: Use this option to add a table caption.`bold_labels`

: Use this option to display the row labels in a bold font.

The results are shown in Table 3.2.

```
%>%
complete.dat select(sbp, RIDAGEYR, RIAGENDR, income) %>%
tbl_summary(
statistic = list(all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"),
digits = list(all_continuous() ~ c(2, 2),
all_categorical() ~ c(0, 1)),
type = list(sbp ~ "continuous",
~ "continuous",
RIDAGEYR ~ "categorical",
RIAGENDR ~ "categorical"),
income label = list(sbp ~ "SBP (mmHg)",
~ "Age (years)",
RIDAGEYR ~ "Gender",
RIAGENDR ~ "Annual Income")
income %>%
) modify_header(label = "**Variable**") %>%
modify_caption("Participant characteristics (complete case analysis)") %>%
bold_labels()
```

Variable |
N = 855^{1} |
---|---|

SBP (mmHg) | 123.57 (17.57) |

Age (years) | 48.11 (17.43) |

Gender | |

Male | 425 (49.7%) |

Female | 430 (50.3%) |

Annual Income | |

< $25,000 | 148 (17.3%) |

$25,000 to < $55,000 | 248 (29.0%) |

$55,000+ | 459 (53.7%) |

^{1} Mean (SD); n (%) |

**NOTE:** If a variable is a factor with exactly two levels labeled “Yes” and “No”, then `tbl_summary()`

by default will only include the row corresponding to “Yes”. The same applies with variables that have values TRUE/FALSE or 1/0. Use the `value`

option to change the row displayed (see `?gtsummary::tbl_summary`

for details). Alternatively, set `type`

to “categorical” to display both rows.

### 3.3.2 By outcome or exposure

In many published research articles, descriptive statistics are presented not only “overall” (over the entire sample) but also by the outcome or exposure. If the “by” variable is continuous then, for the purpose of the descriptive table only, create a categorical version with as many levels as you would like “by” columns, where each level corresponds to a range of values. A common method, demonstrated here, is to use a **median split** in which a binary variable is created based on whether the value of the continuous variable is below or at least as large as the median value. This results in a “by” variable with two levels and approximately equal sample sizes in each level.

To create a table of descriptive statistics by outcome or exposure, use the `by`

argument. To also include a column with the “overall” summaries, use `add_overall()`

. To stratify by more than one variable, use `tbl_strata()`

(see `?gtsummary::tbl_strata`

for more information).

**Categorical outcome or exposure**

**Example 3.1 (continued):** Create a table of summary statistics overall and by gender.

The following code produces Table 3.3 displaying descriptive statistics by gender.

**NOTES:**

- The
`all_stat_cols()`

option in`modify_header()`

adds the frequency and proportion of the “by” variable in the column header. - The “by” variable (
`RIAGENDR`

) was omitted from the`type`

and`label`

options since leaving it in results in an error. - The code below illustrates how to assign the table to an object (
`TABLE1`

), and then view it by typing the name of the object. This is not actually necessary for this example, but will facilitate exporting the table to an external file (see Section 3.3.3).

```
<- complete.dat %>%
TABLE1 select(sbp, RIDAGEYR, RIAGENDR, income) %>%
tbl_summary(
# The "by" variable
by = RIAGENDR,
statistic = list(all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"),
digits = list(all_continuous() ~ c(2, 2),
all_categorical() ~ c(0, 1)),
type = list(sbp ~ "continuous",
~ "continuous",
RIDAGEYR ~ "categorical"),
income label = list(sbp ~ "SBP (mmHg)",
~ "Age (years)",
RIDAGEYR ~ "Annual Income")
income %>%
) modify_header(
label = "**Variable**",
# The following adds the % to the column total label
# <br> is the location of a line break
all_stat_cols() ~ "**{level}**<br>N = {n} ({style_percent(p, digits=1)}%)"
%>%
) modify_caption("Participant characteristics, by gender") %>%
bold_labels() %>%
# Include an "overall" column
add_overall(
last = FALSE,
# The ** make it bold
col_label = "**All participants**<br>N = {N}"
)
```

` TABLE1`

Variable |
All participantsN = 855 ^{1} |
MaleN = 425 (49.7%) ^{1} |
FemaleN = 430 (50.3%) ^{1} |
---|---|---|---|

SBP (mmHg) | 123.57 (17.57) | 124.55 (16.01) | 122.60 (18.96) |

Age (years) | 48.11 (17.43) | 47.26 (17.38) | 48.94 (17.45) |

Annual Income | |||

< $25,000 | 148 (17.3%) | 65 (15.3%) | 83 (19.3%) |

$25,000 to < $55,000 | 248 (29.0%) | 127 (29.9%) | 121 (28.1%) |

$55,000+ | 459 (53.7%) | 233 (54.8%) | 226 (52.6%) |

^{1} Mean (SD); n (%) |

**Median split for a continuous outcome or exposure**

**Example 3.1 (continued):** Create a table of summary statistics overall and by SBP, using a median split to create two SBP groups.

The code below creates a dichotomous version of `sbp`

based on a median split and then uses this new variable as the `by`

variable to produce Table 3.4.

```
<- median(complete.dat$sbp)
MEDIAN <- paste("SBP <", MEDIAN)
LABEL0 <- paste("SBP >=", MEDIAN)
LABEL1
<- complete.dat %>%
complete.dat mutate(sbp_median_split = as.numeric(sbp >= MEDIAN),
sbp_median_split = factor(sbp_median_split,
levels = 0:1,
labels = c(LABEL0, LABEL1)))
# Checking derivation
tapply(complete.dat$sbp, complete.dat$sbp_median_split, range)
```

```
## $`SBP < 121`
## [1] 89 120
##
## $`SBP >= 121`
## [1] 121 234
```

```
# Create table
<- complete.dat %>%
TABLE1 # Select the median split variable, not the original variable
select(sbp_median_split, RIDAGEYR, RIAGENDR, income) %>%
tbl_summary(
# Use the median split variable as the "by" variable
by = sbp_median_split,
statistic = list(all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"),
digits = list(all_continuous() ~ c(2, 2),
all_categorical() ~ c(0, 1)),
type = list(RIDAGEYR ~ "continuous",
~ "categorical",
RIAGENDR ~ "categorical"),
income label = list(RIDAGEYR ~ "Age (years)",
~ "Gender",
RIAGENDR ~ "Annual Income")
income %>%
) modify_header(
label = "**Variable**",
all_stat_cols() ~ "**{level}**<br>N = {n} ({style_percent(p, digits=1)}%)"
%>%
) modify_caption("Participant characteristics, by SBP") %>%
bold_labels() %>%
add_overall(last = FALSE,
col_label = "**All participants**<br>N = {N}")
```

` TABLE1`

Variable |
All participantsN = 855 ^{1} |
SBP < 121N = 412 (48.2%) ^{1} |
SBP >= 121N = 443 (51.8%) ^{1} |
---|---|---|---|

Age (years) | 48.11 (17.43) | 41.57 (15.84) | 54.19 (16.63) |

Gender | |||

Male | 425 (49.7%) | 180 (43.7%) | 245 (55.3%) |

Female | 430 (50.3%) | 232 (56.3%) | 198 (44.7%) |

Annual Income | |||

< $25,000 | 148 (17.3%) | 67 (16.3%) | 81 (18.3%) |

$25,000 to < $55,000 | 248 (29.0%) | 120 (29.1%) | 128 (28.9%) |

$55,000+ | 459 (53.7%) | 225 (54.6%) | 234 (52.8%) |

^{1} Mean (SD); n (%) |

### 3.3.3 Exporting to an external file

To export a `gtsummary`

table to a Microsoft Word or HTML file, use the following syntax which starts with the `tbl_summary`

object (called `TABLE1`

above) and then uses the `flextable`

(Gohel and Skintzos 2023) or `gt`

(Iannone et al. 2022) package to do the exporting.

```
# Make sure these are installed:
# install.packages(c("Rcpp", "gtsummary", "flextable", "gt"))
%>%
TABLE1 as_flex_table() %>%
::save_as_docx(path = "MyTable1.docx")
flextable
%>%
TABLE1 as_gt() %>%
::gtsave(filename = "MyTable1.html") gt
```

### 3.3.4 Adding p-values to Table 1

Often, in a published research article, a Table 1 that displays descriptive statistics by the outcome or an exposure includes p-values that test, for each variable, the null hypothesis that the variable has the same mean (or median or proportion) across all groups in the population. P-values can be easily added to a `tbl_summary`

table using `add_p`

(see `?gtsummary::add_p.tbl_summary`

for all the options, including the various statistical tests available).

**Example 3.1 (continued):** Create a table of descriptive statistics, by SBP, including t-tests for continuous variables and chi-square tests for categorical variables.

The code below loads the `gtsummary`

library and uses `tbl_summary()`

with default settings to generate Table 3.5.

```
%>%
complete.dat select(sbp_median_split, RIDAGEYR, RIAGENDR, income) %>%
tbl_summary(
by = sbp_median_split,
statistic = list(all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"),
digits = list(all_continuous() ~ c(2, 2),
all_categorical() ~ c(0, 1))
%>%
) add_p(
test = list(all_continuous() ~ "t.test",
all_categorical() ~ "chisq.test"),
pvalue_fun = function(x) style_pvalue(x, digits = 3)
)
```

Characteristic |
SBP < 121, N = 412^{1} |
SBP >= 121, N = 443^{1} |
p-value^{2} |
---|---|---|---|

Age in years at screening | 41.57 (15.84) | 54.19 (16.63) | <0.001 |

Gender | <0.001 | ||

Male | 180 (43.7%) | 245 (55.3%) | |

Female | 232 (56.3%) | 198 (44.7%) | |

Annual household income | 0.728 | ||

< $25,000 | 67 (16.3%) | 81 (18.3%) | |

$25,000 to < $55,000 | 120 (29.1%) | 128 (28.9%) | |

$55,000+ | 225 (54.6%) | 234 (52.8%) | |

^{1} Mean (SD); n (%) |
|||

^{2} Welch Two Sample t-test; Pearson's Chi-squared test |

### 3.3.5 Should p-values be added to a Table 1?

**It is very easy to add p-values to a Table 1, but are they recommended? In general, no, regardless of whether the data arose from an observational study (Vandenbroucke et al. 2007) or a randomized trial (Moher et al. 2010).**

If displaying sample characteristics by the outcome for the purpose of providing crude (unadjusted) tests of association between a set of predictors and the outcome, then including p-values in Table 1 does make some sense. However, typically the subsequent regression analysis will provide adjusted tests of association, and conclusions will be drawn from that adjusted analysis, so unadjusted tests may not be relevant.

A potential reason for displaying p-values in a Table 1 of participant characteristics by the primary exposure of interest is to attempt to demonstrate the extent to which the characteristics differ between exposure groups and may therefore confound the outcome-exposure relationship. But what matters for confounding is not if the groups differ in the population (which is what the p-values are testing) but how much they differ in *this sample*. P-values, in this context, are not relevant; what matters are the magnitudes of differences in characteristics between the exposure groups, as well as the magnitude of association between characteristics and the outcome (Vandenbroucke et al. 2007).

Yes, p-values are related to the magnitude of differences, but they are very dependent on the sample size, as well. In a small sample, even a meaningfully large difference might not lead to a small p-value, resulting in an incorrect conclusion of “no confounding.” Conversely, in a large sample, even a small, non-meaningful, difference might result in small p-value, resulting in an incorrect conclusion of “confounding.” Yet another error can occur in the case of a small, seemingly non-meaningful, difference that is not statistically significant but which is for a predictor that is very strongly associated with the outcome. The p-value would lead to a conclusion of “no confounding” yet even a small difference between exposure groups in a predictor strongly associated with the outcome can result in meaningful confounding (Dales and Ury 1978; Vandenbroucke et al. 2007). Even worse, if the exposure groups were determined using randomization (e.g., a randomized clinical trial) then we already know the null hypothesis is true so p-values are irrelevant (Moher et al. 2010; Altman 1985; Senn 1994). Under randomization, any differences observed between groups in the sample are entirely due to randomness, not to any underlying difference between the groups.

Thus, using p-values to provide evidence for or against confounding can be misleading or even nonsensical. In a confirmatory analysis (see Section 5.22), potential confounders are identified using subject-matter knowledge based on prior research and included in a regression model regardless of their observed associations.

In summary, including p-values in a Table 1 is easy to do, but may not be relevant and can, at times, be misleading or meaningless.

### References

*Journal of the Royal Statistical Society. Series D (The Statistician)*34 (1): 125–36. https://doi.org/10.2307/2987510.

*International Journal of Epidemiology*7 (4): 373–75. https://doi.org/10.1093/ije/7.4.373.

*Flextable: Functions for Tabular Reporting*. https://CRAN.R-project.org/package=flextable.

*Gt: Easily Create Presentation-Ready Display Tables*. https://CRAN.R-project.org/package=gt.

*BMJ*340 (March): c869. https://doi.org/10.1136/bmj.c869.

*Statistics in Medicine*13 (17): 1715–26. https://doi.org/10.1002/sim.4780131703.

*Gtsummary: Presentation-Ready Data Summary and Analytic Result Tables*. https://CRAN.R-project.org/package=gtsummary.

*The R Journal*13: 570–80. https://doi.org/10.32614/RJ-2021-053.

*Epidemiology*18 (6): 805–35. https://doi.org/10.1097/EDE.0b013e3181577511.