## 8.3 Weighted descriptive statistics

There are a number of `survey`

functions for computing weighted descriptive statistics, as well as a `gtsummary`

(Sjoberg et al. 2021, 2023) function to conveniently create a “Table 1”. We will compute these statistics overall and by exposure or outcome.

### 8.3.1 Overall

**Example 8.1 (continued):** Compute weighted descriptive statistics that estimate values for the population represented by NHANES 2017-2018 participants with non-missing values on all our analysis variables. Since the outcome variable is fasting glucose, use the design corresponding to the fasting subsample with complete data and positive weights (`design.FST.nomiss`

).

Examples are shown first for two variables, one continuous and one categorical. Later, we will use `gtsummary`

to create a Table 1 of weighted descriptive statistics for all the variables.

For **continuous** variables, use `svymean()`

, `svyvar()`

, and `svyquantile()`

to compute the weighted mean, standard deviation, median, and interquartile range, and `svyhist()`

and `svyboxplot()`

to plot weighted histograms and boxplots (Figures 8.1 and 8.2, respectively). When using `confint()`

to get a 95% confidence interval, add `df = degf(design)`

to use the design DF.

**NOTE:** If not using a complete case analysis and a variable has missing values, add `na.rm = T`

to each `svymean`

or `svyvar`

call to avoid returning a missing value. When using `confint(svymean())`

, the `na.rm=T`

must go in the `svymean()`

call, not the `confint()`

call (e.g., `confint(svymean(~X, design, na.rm=T), df = degf(design))`

).

```
# Weighted mean, standard deviation, and 95% CI
cbind(
"wMEAN" = svymean( ~LBDGLUSI, design.FST.nomiss),
"wSD" = sqrt(svyvar(~LBDGLUSI, design.FST.nomiss)[1]),
confint(svymean(~LBDGLUSI, design.FST.nomiss), df = degf(design.FST.nomiss))
)
```

```
## wMEAN wSD 2.5 % 97.5 %
## LBDGLUSI 6.106 1.763 5.976 6.236
```

```
# Weighted median, 95% CI, and standard error
# df = degf(design) is already the default for svyquantile
svyquantile(~LBDGLUSI, design.FST.nomiss, 0.50)[[1]]
```

```
## quantile ci.2.5 ci.97.5 se
## 0.5 5.72 5.66 5.83 0.03988
```

```
# Weighted interquartile range
c("wIQR" =
svyquantile(~LBDGLUSI, design.FST.nomiss, 0.75)[[1]][1, "quantile"] -
svyquantile(~LBDGLUSI, design.FST.nomiss, 0.25)[[1]][1, "quantile"])
```

```
## wIQR
## 0.83
```

```
par(mfrow=c(1,2))
# Unweighted probability histogram
hist(nhanes.mod$LBDGLUSI[nhanes.mod$nomiss], probability = T,
xlab = "Fasting Glucose (mmol/L)", main = "Unweighted",
ylim = c(0,0.35))
# Weighted probability histogram
svyhist( ~LBDGLUSI, design.FST.nomiss,
xlab = "Fasting Glucose (mmol/L)", main = "Weighted",
ylim = c(0,0.35))
```

```
par(mfrow=c(1,2))
# Unweighted boxplot
boxplot(nhanes.mod$LBDGLUSI[nhanes.mod$nomiss],
ylab = "Fasting Glucose (mmol/L)",
main = "Unweighted")
# Weighted boxplot
svyboxplot(~LBDGLUSI ~ 1, design.FST.nomiss, all.outliers = T,
ylab = "Fasting Glucose (mmol/L)",
main = "Weighted")
```

For **categorical** variables, use `svytotal()`

or `svytable()`

to estimate population totals. In a sample we compute sample frequencies but the corresponding population quantity is a population total. Use `svymean()`

, `svytable()`

or `svyciprop()`

to estimate population proportions. Use `barplot()`

on the output of `svytable()`

to plot a weighted bar chart (Figure 8.3).

**NOTE:** If not using a complete case analysis and a variable has missing values, add `na.rm = T`

to each `svytotal()`

or `svymean()`

call to avoid returning a missing value.

```
## total SE
## race_ethHispanic 32712402 4552659
## race_ethNon-Hispanic White 137704704 10603571
## race_ethNon-Hispanic Black 23063852 3388626
## race_ethNon-Hispanic Other 21072543 2998573
```

```
## race_eth
## Hispanic Non-Hispanic White Non-Hispanic Black Non-Hispanic Other
## 32712402 137704704 23063852 21072543
```

```
## mean SE
## race_ethHispanic 0.1525 0.02
## race_ethNon-Hispanic White 0.6418 0.03
## race_ethNon-Hispanic Black 0.1075 0.02
## race_ethNon-Hispanic Other 0.0982 0.01
```

```
# Weighted proportion using svytable by normalizing the total to 1
svytable(~race_eth, design.FST.nomiss, Ntotal=1)
```

```
## race_eth
## Hispanic Non-Hispanic White Non-Hispanic Black Non-Hispanic Other
## 0.15247 0.64182 0.10750 0.09822
```

```
# Weighted proportion with 95% CI
# df = degf(design) is already the default for svyciprop
svyciprop(~I(race_eth=="Hispanic"), design.FST.nomiss)
```

```
## 2.5% 97.5%
## I(race_eth == "Hispanic") 0.152 0.113 0.2
```

```
# (results for other 3 levels not shown)
svyciprop(~I(race_eth=="Non-Hispanic White"), design.FST.nomiss)
svyciprop(~I(race_eth=="Non-Hispanic Black"), design.FST.nomiss)
svyciprop(~I(race_eth=="Non-Hispanic Other"), design.FST.nomiss)
```

```
par(mfrow=c(2,1))
# Unweighted barplot
barplot(prop.table(table(nhanes.mod$race_eth[nhanes.mod$nomiss])),
ylab = "Proportion", xlab = "Race/Ethnicity",
main = "Unweighted", cex.names = 0.65)
# Weighted barplot
mybar <- svytable(~race_eth, design.FST.nomiss, Ntotal=1)
barplot(mybar,
ylab = "Proportion", xlab = "Race/Ethnicity",
main = "Weighted", cex.names = 0.65)
```

NHANES over-sampled minority groups in order to have sufficient sample sizes for subgroup analyses. Thus, there are large differences between the unweighted and weighted distributions of race/ethnicity. Unweighted proportions reflect the composition of the sample, whereas weighted proportions estimate the population distribution.

Finally, use `tbl_svysummary()`

from the `gtsummary`

library to produce a Table 1 of weighted descriptive statistics. Instead of starting with a dataset as we did in unweighted analyses, start with the design object (`design.FST.nomiss`

). For categorical variables, \(N\) and \(n\) values in the table are estimated population totals – when we did unweighted analyses, these were sample sizes.

The results are shown in Table 8.1.

```
library(gtsummary)
design.FST.nomiss %>%
tbl_svysummary(
# Use include to select variables
include = c(LBDGLUSI, BMXWAIST, RIDAGEYR,
smoker, RIAGENDR, race_eth, income),
statistic = list(all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"),
digits = list(LBDGLUSI ~ c(2, 2),
BMXWAIST ~ c(1, 1),
RIDAGEYR ~ c(1, 1),
all_categorical() ~ c(0, 1))
) %>%
modify_header(label = "**Variable**") %>%
modify_caption("Weighted descriptive statistics") %>%
bold_labels()
```

Variable |
N = 214,553,501^{1} |
---|---|

LBDGLUSI | 6.11 (1.76) |

BMXWAIST | 100.2 (17.1) |

RIDAGEYR | 47.0 (17.5) |

smoker | |

Never | 123,589,210 (57.6%) |

Past | 55,919,877 (26.1%) |

Current | 35,044,414 (16.3%) |

RIAGENDR | |

Male | 104,286,253 (48.6%) |

Female | 110,267,248 (51.4%) |

race_eth | |

Hispanic | 32,712,402 (15.2%) |

Non-Hispanic White | 137,704,704 (64.2%) |

Non-Hispanic Black | 23,063,852 (10.7%) |

Non-Hispanic Other | 21,072,543 (9.8%) |

income | |

< $25,000 | 38,141,114 (17.8%) |

$25,000 to < $55,000 | 57,342,659 (26.7%) |

$55,000+ | 119,069,729 (55.5%) |

^{1} Mean (SD); n (%) |

### 8.3.2 By exposure or outcome

**Example 8.1 (continued):** Compute weighted descriptive statistics by smoking status.

For `svymean()`

, `svyvar()`

, `svyquantile()`

, `svytotal()`

, and `svyciprop()`

, use the wrapper function `svyby()`

to compute statistics at each level of another variable. For `svytable()`

, however, add the by variable to the formula.

```
# Continuous variables, by smoking status
# Weighted mean, standard deviation, and 95% CI
WTD <- cbind(
svyby(~LBDGLUSI, ~smoker, design.FST.nomiss, svymean)[, 1:2],
svyby(~LBDGLUSI, ~smoker, design.FST.nomiss, svyvar)[ , 2],
confint(svyby(~LBDGLUSI, ~smoker, design.FST.nomiss, svymean),
df = degf(design.FST.nomiss))
)
WTD[, 3] <- sqrt(WTD[, 3])
names(WTD)[3] <- paste(names(WTD)[2], "wSD")
names(WTD)[2] <- paste(names(WTD)[2], "wMEAN")
WTD
```

```
## smoker LBDGLUSI wMEAN LBDGLUSI wSD 2.5 % 97.5 %
## Never Never 5.965 1.543 5.796 6.135
## Past Past 6.457 2.066 6.297 6.617
## Current Current 6.041 1.894 5.798 6.285
```

```
# Weighted median, standard error, and 95% CI
# df = degf(design) is already the default for svyquantile
# but not for confint
cbind(
svyby(~LBDGLUSI, ~smoker, design.FST.nomiss, svyquantile, 0.50),
confint(svyby(~LBDGLUSI, ~smoker, design.FST.nomiss, svyquantile, 0.50),
df = degf(design.FST.nomiss))
)
```

```
## smoker LBDGLUSI se.LBDGLUSI 2.5 % 97.5 %
## Never Never 5.66 0.03753 5.580 5.740
## Past Past 5.83 0.03988 5.745 5.915
## Current Current 5.66 0.06568 5.520 5.800
```

```
# Weighted interquartile range
Q75 <- as.data.frame(
svyby(~LBDGLUSI, ~smoker, design.FST.nomiss, svyquantile, 0.75)
)[,1:2]
Q25 <- as.data.frame(
svyby(~LBDGLUSI, ~smoker, design.FST.nomiss, svyquantile, 0.25)
)[,1:2]
names(Q75)[2] <- paste(names(Q75)[2], "wQ75")
names(Q25)[2] <- paste(names(Q25)[2], "wQ25")
WIQR <- merge(Q75, Q25)
WIQR$wIQR <- WIQR[,2] - WIQR[,3]
WIQR
```

```
## smoker LBDGLUSI wQ75 LBDGLUSI wQ25 wIQR
## 1 Current 6.11 5.27 0.84
## 2 Never 6.11 5.27 0.84
## 3 Past 6.49 5.50 0.99
```

```
# Categorical variables, by smoking status
# Weighted total
svytable(~race_eth + smoker, design.FST.nomiss)
```

```
## smoker
## race_eth Never Past Current
## Hispanic 20524606 7861212 4326583
## Non-Hispanic White 75120526 40106079 22478099
## Non-Hispanic Black 15137153 3487312 4439386
## Non-Hispanic Other 12806924 4465273 3800346
```

```
# Weighted total with SE of total
svyby(~race_eth, ~smoker, design.FST.nomiss, svytotal)
# (results not shown)
```

Be careful when normalizing `svytable()`

to get proportions. Using `Ntotal=1`

as we did previously normalizes the frequencies to the overall total. Rather, we want to normalize to the column totals to get proportions within each level of smoking status. This is accomplished using `prop.table(, margin = 2)`

. Optionally, use `addmargins(, 1)`

to confirm that each column sums to 1.

```
# Weighted proportion
addmargins(
prop.table(svytable(~race_eth + smoker, design.FST.nomiss), margin = 2)
, 1)
```

```
## smoker
## race_eth Never Past Current
## Hispanic 0.16607 0.14058 0.12346
## Non-Hispanic White 0.60782 0.71721 0.64142
## Non-Hispanic Black 0.12248 0.06236 0.12668
## Non-Hispanic Other 0.10362 0.07985 0.10844
## Sum 1.00000 1.00000 1.00000
```

```
# Weighted proportion with SE
svyby(~I(race_eth=="Hispanic"), ~smoker, design.FST.nomiss, svyciprop)
```

```
## smoker I(race_eth == "Hispanic") se.as.numeric(I(race_eth == "Hispanic"))
## Never Never 0.1661 0.02650
## Past Past 0.1406 0.01976
## Current Current 0.1235 0.02512
```

```
# (results for other levels not shown)
svyby(~I(race_eth=="Non-Hispanic White"), ~smoker, design.FST.nomiss, svyciprop)
svyby(~I(race_eth=="Non-Hispanic Black"), ~smoker, design.FST.nomiss, svyciprop)
svyby(~I(race_eth=="Non-Hispanic Other"), ~smoker, design.FST.nomiss, svyciprop)
```

Finally, use `tbl_svysummary()`

from the `gtsummary`

library to produce a Table 1 of weighted descriptive statistics by smoking status. Be careful, however, to use a character version of the smoking status variable; using the factor `smoker`

will lead to an error. When we created this dataset in Section 8.2, we created `smoker_char`

, a character formatted version of `smoker`

.

The results are shown in Table 8.2.

```
library(gtsummary)
design.FST.nomiss %>%
tbl_svysummary(
# Use a character variable here. A factor leads to an error
by = smoker_char,
# Use include to select variables
include = c(LBDGLUSI, BMXWAIST, RIDAGEYR,
RIAGENDR, race_eth, income),
statistic = list(all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"),
digits = list(LBDGLUSI ~ c(2, 2),
BMXWAIST ~ c(1, 1),
RIDAGEYR ~ c(1, 1),
all_categorical() ~ c(0, 1))
) %>%
modify_header(label = "**Variable**",
all_stat_cols() ~ "**{level}**<br>N = {n} ({style_percent(p, digits=1)}%)") %>%
modify_caption("Weighted descriptive statistics, by smoking status") %>%
bold_labels()
```

Variable |
1 Never
N = 123589210 (57.6%)^{1} |
2 Past
N = 55919877 (26.1%)^{1} |
3 Current
N = 35044414 (16.3%)^{1} |
---|---|---|---|

LBDGLUSI | 5.97 (1.54) | 6.46 (2.07) | 6.04 (1.89) |

BMXWAIST | 98.6 (17.0) | 104.0 (16.6) | 99.9 (17.4) |

RIDAGEYR | 45.4 (17.7) | 51.5 (18.0) | 45.4 (14.4) |

RIAGENDR | |||

Male | 51,169,518 (41.4%) | 34,805,126 (62.2%) | 18,311,610 (52.3%) |

Female | 72,419,692 (58.6%) | 21,114,752 (37.8%) | 16,732,805 (47.7%) |

race_eth | |||

Hispanic | 20,524,606 (16.6%) | 7,861,212 (14.1%) | 4,326,583 (12.3%) |

Non-Hispanic White | 75,120,526 (60.8%) | 40,106,079 (71.7%) | 22,478,099 (64.1%) |

Non-Hispanic Black | 15,137,153 (12.2%) | 3,487,312 (6.2%) | 4,439,386 (12.7%) |

Non-Hispanic Other | 12,806,924 (10.4%) | 4,465,273 (8.0%) | 3,800,346 (10.8%) |

income | |||

< $25,000 | 18,081,806 (14.6%) | 9,269,756 (16.6%) | 10,789,552 (30.8%) |

$25,000 to < $55,000 | 31,806,525 (25.7%) | 15,168,200 (27.1%) | 10,367,933 (29.6%) |

$55,000+ | 73,700,879 (59.6%) | 31,481,921 (56.3%) | 13,886,929 (39.6%) |

^{1} Mean (SD); n (%) |

### References

*Gtsummary: Presentation-Ready Data Summary and Analytic Result Tables*. https://github.com/ddsjoberg/gtsummary.

*The R Journal*13: 570–80. https://doi.org/10.32614/RJ-2021-053.