3.3 Creating a “Table 1”
In most published articles, there is a “Table 1” containing descriptive statistics for the sample. This may include, for example, the mean and standard deviation for continuous variables, the frequency and proportion for categorical variables, and perhaps also the number of missing values.
The brute force method of creating such a table would be to compute each statistic for each variable of interest and then copy and paste the results into a table. Having done this even once you will wish for an easier method! There are many possible solutions, but one that is quick and easy to use is demonstrated here – the gtsummary
package (Sjoberg et al. 2021, 2023). Many examples can be found on the gtsummary
package GitHub site, in Tutorial:tbl_summary, and in Table Gallery.
Other R packages not covered here that facilitate table creation include flextable
(Gohel and Skintzos 2023), tableone
(Yoshida and Bartel 2022), and table1
(Rich 2023).
3.3.1 Overall
Example 3.1 (continued): Create a table of summary statistics for all the variables we have been summarizing in this chapter for the entire sample (“overall”).
The code below loads the gtsummary
library and uses tbl_summary()
with default settings to generate Table 3.1.
The default settings produce a table with the frequency and proportion for categorical variables, the median and interquartile range (IQR) for continuous variables (here, the 25th and 75th percentiles, not their difference), and the number of missing values (if any) (show in rows headed by “Unknown”).
nhanes %>%
# Select the variables to be included in the table
select(sbp, RIDAGEYR, RIAGENDR, income) %>%
tbl_summary()
Characteristic | N = 1,0001 |
---|---|
sbp | 121 (111, 134) |
Unknown | 42 |
RIDAGEYR | 47 (32, 61) |
RIAGENDR | |
Male | 482 (48%) |
Female | 518 (52%) |
income | |
< $25,000 | 156 (18%) |
$25,000 to < $55,000 | 254 (29%) |
$55,000+ | 480 (54%) |
Unknown | 110 |
1 Median (IQR); n (%) |
The default is to include missing values, and this can be overridden by setting the missing
option to “no”, as in the code below. The resulting table is not shown, but is identical to the table above other than the “Unknown” rows are omitted. In particular, it is still based on the full sample size (1000 observations).
nhanes %>%
# Select the variables to be included in the table
select(sbp, RIDAGEYR, RIAGENDR, income) %>%
tbl_summary(
missing = "no"
)
To create a table for a complete case analysis sample, start with a complete case analysis dataset, as in the code below (which uses complete.dat
, the dataset created in Section 3.2.1). Additional options are demonstrated below. See ?gtsummary::tbl_summary
for more options.
statistic
: The default for continuous variables is the median and IQR. The default for categorical variables is the frequency and proportion. Below, this option is used to instead compute the mean and standard deviation for continuous variables (and the default for categorical variables is coded explicitly).digits
:tbl_summary()
guesses the number of digits to which to round. Use this option to set the number yourself. Since two statistics are specified in thestatistic
option, two numbers are specified here, one for each statistic.type
:tbl_summary()
guesses whether a variable is continuous or categorical based on its distribution. Use this option to specify the types yourself, in particular if you need to override the default.label
: The default row labels are the variable names or labels (if the dataset has been labeled, for example, using theHmisc
librarylabel()
function). Use this option to change the row headers.modify_header
: The default column header is “Characteristic”. Use this option to change the column header. Surrounding text by**
results in a bold font.modify_caption
: Use this option to add a table caption.bold_labels
: Use this option to display the row labels in a bold font.
The results are shown in Table 3.2.
complete.dat %>%
select(sbp, RIDAGEYR, RIAGENDR, income) %>%
tbl_summary(
statistic = list(all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"),
digits = list(all_continuous() ~ c(2, 2),
all_categorical() ~ c(0, 1)),
type = list(sbp ~ "continuous",
RIDAGEYR ~ "continuous",
RIAGENDR ~ "categorical",
income ~ "categorical"),
label = list(sbp ~ "SBP (mmHg)",
RIDAGEYR ~ "Age (years)",
RIAGENDR ~ "Gender",
income ~ "Annual Income")
) %>%
modify_header(label = "**Variable**") %>%
modify_caption("Participant characteristics (complete case analysis)") %>%
bold_labels()
Variable | N = 8551 |
---|---|
SBP (mmHg) | 123.57 (17.57) |
Age (years) | 48.11 (17.43) |
Gender | |
Male | 425 (49.7%) |
Female | 430 (50.3%) |
Annual Income | |
< $25,000 | 148 (17.3%) |
$25,000 to < $55,000 | 248 (29.0%) |
$55,000+ | 459 (53.7%) |
1 Mean (SD); n (%) |
NOTE: If a variable is a factor with exactly two levels labeled “Yes” and “No”, then tbl_summary()
by default will only include the row corresponding to “Yes”. The same applies with variables that have values TRUE/FALSE or 1/0. Use the value
option to change the row displayed (see ?gtsummary::tbl_summary
for details). Alternatively, set type
to “categorical” to display both rows.
3.3.2 By outcome or exposure
In many published research articles, descriptive statistics are presented not only “overall” (over the entire sample) but also by the outcome or exposure. If the “by” variable is continuous then, for the purpose of the descriptive table only, create a categorical version with as many levels as you would like “by” columns, where each level corresponds to a range of values. A common method, demonstrated here, is to use a median split in which a binary variable is created based on whether the value of the continuous variable is below or at least as large as the median value. This results in a “by” variable with two levels and approximately equal sample sizes in each level.
To create a table of descriptive statistics by outcome or exposure, use the by
argument. To also include a column with the “overall” summaries, use add_overall()
. To stratify by more than one variable, use tbl_strata()
(see ?gtsummary::tbl_strata
for more information).
Categorical outcome or exposure
Example 3.1 (continued): Create a table of summary statistics overall and by gender.
The following code produces Table 3.3 displaying descriptive statistics by gender.
NOTES:
- The
all_stat_cols()
option inmodify_header()
adds the frequency and proportion of the “by” variable in the column header. - The “by” variable (
RIAGENDR
) was omitted from thetype
andlabel
options since leaving it in results in an error. - The code below illustrates how to assign the table to an object (
TABLE1
), and then view it by typing the name of the object. This is not actually necessary for this example, but will facilitate exporting the table to an external file (see Section 3.3.3).
TABLE1 <- complete.dat %>%
select(sbp, RIDAGEYR, RIAGENDR, income) %>%
tbl_summary(
# The "by" variable
by = RIAGENDR,
statistic = list(all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"),
digits = list(all_continuous() ~ c(2, 2),
all_categorical() ~ c(0, 1)),
type = list(sbp ~ "continuous",
RIDAGEYR ~ "continuous",
income ~ "categorical"),
label = list(sbp ~ "SBP (mmHg)",
RIDAGEYR ~ "Age (years)",
income ~ "Annual Income")
) %>%
modify_header(
label = "**Variable**",
# The following adds the % to the column total label
# <br> is the location of a line break
all_stat_cols() ~ "**{level}**<br>N = {n} ({style_percent(p, digits=1)}%)"
) %>%
modify_caption("Participant characteristics, by gender") %>%
bold_labels() %>%
# Include an "overall" column
add_overall(
last = FALSE,
# The ** make it bold
col_label = "**All participants**<br>N = {N}"
)
Variable | All participants N = 8551 |
Male N = 425 (49.7%)1 |
Female N = 430 (50.3%)1 |
---|---|---|---|
SBP (mmHg) | 123.57 (17.57) | 124.55 (16.01) | 122.60 (18.96) |
Age (years) | 48.11 (17.43) | 47.26 (17.38) | 48.94 (17.45) |
Annual Income | |||
< $25,000 | 148 (17.3%) | 65 (15.3%) | 83 (19.3%) |
$25,000 to < $55,000 | 248 (29.0%) | 127 (29.9%) | 121 (28.1%) |
$55,000+ | 459 (53.7%) | 233 (54.8%) | 226 (52.6%) |
1 Mean (SD); n (%) |
Median split for a continuous outcome or exposure
Example 3.1 (continued): Create a table of summary statistics overall and by SBP, using a median split to create two SBP groups.
The code below creates a dichotomous version of sbp
based on a median split and then uses this new variable as the by
variable to produce Table 3.4.
MEDIAN <- median(complete.dat$sbp)
LABEL0 <- paste("SBP <", MEDIAN)
LABEL1 <- paste("SBP >=", MEDIAN)
complete.dat <- complete.dat %>%
mutate(sbp_median_split = as.numeric(sbp >= MEDIAN),
sbp_median_split = factor(sbp_median_split,
levels = 0:1,
labels = c(LABEL0, LABEL1)))
# Checking derivation
tapply(complete.dat$sbp, complete.dat$sbp_median_split, range)
## $`SBP < 121`
## [1] 89 120
##
## $`SBP >= 121`
## [1] 121 234
# Create table
TABLE1 <- complete.dat %>%
# Select the median split variable, not the original variable
select(sbp_median_split, RIDAGEYR, RIAGENDR, income) %>%
tbl_summary(
# Use the median split variable as the "by" variable
by = sbp_median_split,
statistic = list(all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"),
digits = list(all_continuous() ~ c(2, 2),
all_categorical() ~ c(0, 1)),
type = list(RIDAGEYR ~ "continuous",
RIAGENDR ~ "categorical",
income ~ "categorical"),
label = list(RIDAGEYR ~ "Age (years)",
RIAGENDR ~ "Gender",
income ~ "Annual Income")
) %>%
modify_header(
label = "**Variable**",
all_stat_cols() ~ "**{level}**<br>N = {n} ({style_percent(p, digits=1)}%)"
) %>%
modify_caption("Participant characteristics, by SBP") %>%
bold_labels() %>%
add_overall(last = FALSE,
col_label = "**All participants**<br>N = {N}")
Variable | All participants N = 8551 |
SBP < 121 N = 412 (48.2%)1 |
SBP >= 121 N = 443 (51.8%)1 |
---|---|---|---|
Age (years) | 48.11 (17.43) | 41.57 (15.84) | 54.19 (16.63) |
Gender | |||
Male | 425 (49.7%) | 180 (43.7%) | 245 (55.3%) |
Female | 430 (50.3%) | 232 (56.3%) | 198 (44.7%) |
Annual Income | |||
< $25,000 | 148 (17.3%) | 67 (16.3%) | 81 (18.3%) |
$25,000 to < $55,000 | 248 (29.0%) | 120 (29.1%) | 128 (28.9%) |
$55,000+ | 459 (53.7%) | 225 (54.6%) | 234 (52.8%) |
1 Mean (SD); n (%) |
3.3.3 Exporting to an external file
To export a gtsummary
table to a Microsoft Word or HTML file, use the following syntax which starts with the tbl_summary
object (called TABLE1
above) and then uses the flextable
(Gohel and Skintzos 2023) or gt
(Iannone et al. 2023) package to do the exporting.
3.3.4 Adding p-values to Table 1
Often, in a published research article, a Table 1 that displays descriptive statistics by the outcome or an exposure includes p-values that test, for each variable, the null hypothesis that the variable has the same mean (or median or proportion) across all groups in the population. P-values can be easily added to a tbl_summary
table using add_p
(see ?gtsummary::add_p.tbl_summary
for all the options, including the various statistical tests available).
Example 3.1 (continued): Create a table of descriptive statistics, by SBP, including t-tests for continuous variables and chi-square tests for categorical variables.
The code below loads the gtsummary
library and uses tbl_summary()
with default settings to generate Table 3.5.
complete.dat %>%
select(sbp_median_split, RIDAGEYR, RIAGENDR, income) %>%
tbl_summary(
by = sbp_median_split,
statistic = list(all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"),
digits = list(all_continuous() ~ c(2, 2),
all_categorical() ~ c(0, 1)),
label = list(RIDAGEYR ~ "Age (years)",
RIAGENDR ~ "Gender",
income ~ "Annual Income")
) %>%
add_p(
test = list(all_continuous() ~ "t.test",
all_categorical() ~ "chisq.test"),
pvalue_fun = function(x) style_pvalue(x, digits = 3)
)
Characteristic | SBP < 121, N = 4121 | SBP >= 121, N = 4431 | p-value2 |
---|---|---|---|
Age (years) | 41.57 (15.84) | 54.19 (16.63) | <0.001 |
Gender | <0.001 | ||
Male | 180 (43.7%) | 245 (55.3%) | |
Female | 232 (56.3%) | 198 (44.7%) | |
Annual Income | 0.728 | ||
< $25,000 | 67 (16.3%) | 81 (18.3%) | |
$25,000 to < $55,000 | 120 (29.1%) | 128 (28.9%) | |
$55,000+ | 225 (54.6%) | 234 (52.8%) | |
1 Mean (SD); n (%) | |||
2 Welch Two Sample t-test; Pearson’s Chi-squared test |
3.3.5 Should p-values be added to a Table 1?
It is very easy to add p-values to a Table 1, but are they recommended? In general, no, regardless of whether the data arose from an observational study (Vandenbroucke et al. 2007) or a randomized trial (Moher et al. 2010).
If displaying sample characteristics by the outcome for the purpose of providing crude (unadjusted) tests of association between a set of predictors and the outcome, then including p-values in Table 1 does make some sense. However, typically the subsequent regression analysis will provide adjusted tests of association, and conclusions will be drawn from that adjusted analysis, so unadjusted tests may not be relevant.
A potential reason for displaying p-values in a Table 1 of participant characteristics by the primary exposure of interest is to attempt to demonstrate the extent to which the characteristics differ between exposure groups and may therefore confound the outcome-exposure relationship. But what matters for confounding is not if the groups differ in the population (which is what the p-values are testing) but how much they differ in this sample. P-values, in this context, are not relevant; what matters are the magnitudes of differences in characteristics between the exposure groups, as well as the magnitude of association between characteristics and the outcome (Vandenbroucke et al. 2007).
Yes, p-values are related to the magnitude of differences, but they are very dependent on the sample size, as well. In a small sample, even a meaningfully large difference might not lead to a small p-value, resulting in an incorrect conclusion of “no confounding.” Conversely, in a large sample, even a small, non-meaningful, difference might result in small p-value, resulting in an incorrect conclusion of “confounding.” Yet another error can occur in the case of a small, seemingly non-meaningful, difference that is not statistically significant but which is for a predictor that is very strongly associated with the outcome. The p-value would lead to a conclusion of “no confounding” yet even a small difference between exposure groups in a predictor strongly associated with the outcome can result in meaningful confounding (Dales and Ury 1978; Vandenbroucke et al. 2007). Even worse, if the exposure groups were determined using randomization (e.g., a randomized clinical trial) then we already know the null hypothesis is true so p-values are irrelevant (Moher et al. 2010; Altman 1985; Senn 1994). Under randomization, any differences observed between groups in the sample are entirely due to randomness, not to any underlying difference between the groups.
Thus, using p-values to provide evidence for or against confounding can be misleading or even nonsensical. In a confirmatory analysis (see Section 5.23), potential confounders are identified using subject-matter knowledge based on prior research and included in a regression model regardless of their observed associations.
In summary, including p-values in a Table 1 is easy to do, but may not be relevant and can, at times, be misleading or meaningless.