8.3 Weighted descriptive statistics
There are a number of survey
functions for computing weighted descriptive statistics, as well as a gtsummary
(Sjoberg et al. 2021, 2023) function to conveniently create a “Table 1”. We will compute these statistics overall and by exposure or outcome.
8.3.1 Overall
Example 8.1 (continued): Compute weighted descriptive statistics that estimate values for the population represented by NHANES 2017-2018 participants with non-missing values on all our analysis variables. Since the outcome variable is fasting glucose, use the design corresponding to the fasting subsample with complete data and positive weights (design.FST.nomiss
).
Examples are shown first for two variables, one continuous and one categorical. Later, we will use gtsummary
to create a Table 1 of weighted descriptive statistics for all the variables.
For continuous variables, use svymean()
, svyvar()
, and svyquantile()
to compute the weighted mean, standard deviation, median, and interquartile range, and svyhist()
and svyboxplot()
to plot weighted histograms and boxplots (Figures 8.1 and 8.2, respectively). When using confint()
to get a 95% confidence interval, add df = degf(design)
to use the design DF.
NOTE: If not using a complete case analysis and a variable has missing values, add na.rm = T
to each svymean
or svyvar
call to avoid returning a missing value. When using confint(svymean())
, the na.rm=T
must go in the svymean()
call, not the confint()
call (e.g., confint(svymean(~X, design, na.rm=T), df = degf(design))
).
# Weighted mean, standard deviation, and 95% CI
cbind(
"wMEAN" = svymean( ~LBDGLUSI, design.FST.nomiss),
"wSD" = sqrt(svyvar(~LBDGLUSI, design.FST.nomiss)[1]),
confint(svymean(~LBDGLUSI, design.FST.nomiss), df = degf(design.FST.nomiss))
)
## wMEAN wSD 2.5 % 97.5 %
## LBDGLUSI 6.106 1.763 5.976 6.236
# Weighted median, 95% CI, and standard error
# df = degf(design) is already the default for svyquantile
svyquantile(~LBDGLUSI, design.FST.nomiss, 0.50)[[1]]
## quantile ci.2.5 ci.97.5 se
## 0.5 5.72 5.66 5.83 0.03988
# Weighted interquartile range
c("wIQR" =
svyquantile(~LBDGLUSI, design.FST.nomiss, 0.75)[[1]][1, "quantile"] -
svyquantile(~LBDGLUSI, design.FST.nomiss, 0.25)[[1]][1, "quantile"])
## wIQR
## 0.83
par(mfrow=c(1,2))
# Unweighted probability histogram
hist(nhanes.mod$LBDGLUSI[nhanes.mod$nomiss], probability = T,
xlab = "Fasting Glucose (mmol/L)", main = "Unweighted",
ylim = c(0,0.35))
# Weighted probability histogram
svyhist( ~LBDGLUSI, design.FST.nomiss,
xlab = "Fasting Glucose (mmol/L)", main = "Weighted",
ylim = c(0,0.35))

Figure 8.1: Unweighted vs. weighted histograms
par(mfrow=c(1,2))
# Unweighted boxplot
boxplot(nhanes.mod$LBDGLUSI[nhanes.mod$nomiss],
ylab = "Fasting Glucose (mmol/L)",
main = "Unweighted")
# Weighted boxplot
svyboxplot(~LBDGLUSI~1, design.FST.nomiss, all.outliers = T,
ylab = "Fasting Glucose (mmol/L)",
main = "Weighted")

Figure 8.2: Unweighted vs. weighted boxplots
For categorical variables, use svytotal()
or svytable()
to estimate population totals. In a sample we compute sample frequencies but the corresponding population quantity is a population total. Use svymean()
, svytable()
or svyciprop()
to estimate population proportions. Use barplot()
on the output of svytable()
to plot a weighted bar chart (Figure 8.3).
NOTE: If not using a complete case analysis and a variable has missing values, add na.rm = T
to each svytotal()
or svymean()
call to avoid returning a missing value.
# Weighted total with SE of total
svytotal(~race_eth, design.FST.nomiss)
## total SE
## race_ethHispanic 32712402 4552659
## race_ethNon-Hispanic White 137704704 10603571
## race_ethNon-Hispanic Black 23063852 3388626
## race_ethNon-Hispanic Other 21072543 2998573
# Weighted total (no SE)
svytable(~race_eth, design.FST.nomiss)
## race_eth
## Hispanic Non-Hispanic White Non-Hispanic Black Non-Hispanic Other
## 32712402 137704704 23063852 21072543
# Weighted proportion using svymean()
svymean(~race_eth, design.FST.nomiss)
## mean SE
## race_ethHispanic 0.1525 0.02
## race_ethNon-Hispanic White 0.6418 0.03
## race_ethNon-Hispanic Black 0.1075 0.02
## race_ethNon-Hispanic Other 0.0982 0.01
# Weighted proportion using svytable by normalizing the total to 1
svytable(~race_eth, design.FST.nomiss, Ntotal=1)
## race_eth
## Hispanic Non-Hispanic White Non-Hispanic Black Non-Hispanic Other
## 0.15247 0.64182 0.10750 0.09822
# Weighted proportion with 95% CI
# df = degf(design) is already the default for svyciprop
svyciprop(~I(race_eth=="Hispanic"), design.FST.nomiss)
## 2.5% 97.5%
## I(race_eth == "Hispanic") 0.152 0.113 0.2
# (results for other 3 levels not shown)
svyciprop(~I(race_eth=="Non-Hispanic White"), design.FST.nomiss)
svyciprop(~I(race_eth=="Non-Hispanic Black"), design.FST.nomiss)
svyciprop(~I(race_eth=="Non-Hispanic Other"), design.FST.nomiss)
par(mfrow=c(2,1))
# Unweighted barplot
barplot(prop.table(table(nhanes.mod$race_eth[nhanes.mod$nomiss])),
ylab = "Proportion", xlab = "Race/Ethnicity",
main = "Unweighted", cex.names = 0.65)
# Weighted barplot
<- svytable(~race_eth, design.FST.nomiss, Ntotal=1)
mybar barplot(mybar,
ylab = "Proportion", xlab = "Race/Ethnicity",
main = "Weighted", cex.names = 0.65)

Figure 8.3: Unweighted vs. weighted bar charts
NHANES over-sampled minority groups in order to have sufficient sample sizes for subgroup analyses. Thus, there are large differences between the unweighted and weighted distributions of race/ethnicity. Unweighted proportions reflect the composition of the sample, whereas weighted proportions estimate the population distribution.
Finally, use tbl_svysummary()
from the gtsummary
library to produce a Table 1 of weighted descriptive statistics. Instead of starting with a dataset as we did in unweighted analyses, start with the design object (design.FST.nomiss
). For categorical variables, \(N\) and \(n\) values in the table are estimated population totals – when we did unweighted analyses, these were sample sizes.
The results are shown in Table 8.1.
library(gtsummary)
%>%
design.FST.nomiss tbl_svysummary(
# Use include to select variables
include = c(LBDGLUSI, BMXWAIST, RIDAGEYR,
smoker, RIAGENDR, race_eth, income),statistic = list(all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"),
digits = list(LBDGLUSI ~ c(2, 2),
~ c(1, 1),
BMXWAIST ~ c(1, 1),
RIDAGEYR all_categorical() ~ c(0, 1))
%>%
) modify_header(label = "**Variable**") %>%
modify_caption("Weighted descriptive statistics") %>%
bold_labels()
Variable | N = 214,553,5011 |
---|---|
LBDGLUSI | 6.11 (1.76) |
BMXWAIST | 100.2 (17.1) |
RIDAGEYR | 47.0 (17.5) |
smoker | |
Never | 123,589,210 (57.6%) |
Past | 55,919,877 (26.1%) |
Current | 35,044,414 (16.3%) |
RIAGENDR | |
Male | 104,286,253 (48.6%) |
Female | 110,267,248 (51.4%) |
race_eth | |
Hispanic | 32,712,402 (15.2%) |
Non-Hispanic White | 137,704,704 (64.2%) |
Non-Hispanic Black | 23,063,852 (10.7%) |
Non-Hispanic Other | 21,072,543 (9.8%) |
income | |
< $25,000 | 38,141,114 (17.8%) |
$25,000 to < $55,000 | 57,342,659 (26.7%) |
$55,000+ | 119,069,729 (55.5%) |
1 Mean (SD); n (%) |
8.3.2 By exposure or outcome
Example 8.1 (continued): Compute weighted descriptive statistics by smoking status.
For svymean()
, svyvar()
, svyquantile()
, svytotal()
, and svyciprop()
, use the wrapper function svyby()
to compute statistics at each level of another variable. For svytable()
, however, add the by variable to the formula.
# Continuous variables, by smoking status
# Weighted mean, standard deviation, and 95% CI
<- cbind(
WTD svyby(~LBDGLUSI, ~smoker, design.FST.nomiss, svymean)[, 1:2],
svyby(~LBDGLUSI, ~smoker, design.FST.nomiss, svyvar)[ , 2],
confint(svyby(~LBDGLUSI, ~smoker, design.FST.nomiss, svymean),
df = degf(design.FST.nomiss))
)3] <- sqrt(WTD[, 3])
WTD[, names(WTD)[3] <- paste(names(WTD)[2], "wSD")
names(WTD)[2] <- paste(names(WTD)[2], "wMEAN")
WTD
## smoker LBDGLUSI wMEAN LBDGLUSI wSD 2.5 % 97.5 %
## Never Never 5.965 1.543 5.796 6.135
## Past Past 6.457 2.066 6.297 6.617
## Current Current 6.041 1.894 5.798 6.285
# Weighted median, standard error, and 95% CI
# df = degf(design) is already the default for svyquantile
# but not for confint
cbind(
svyby(~LBDGLUSI, ~smoker, design.FST.nomiss, svyquantile, 0.50),
confint(svyby(~LBDGLUSI, ~smoker, design.FST.nomiss, svyquantile, 0.50),
df = degf(design.FST.nomiss))
)
## smoker LBDGLUSI se.LBDGLUSI 2.5 % 97.5 %
## Never Never 5.66 0.03753 5.580 5.740
## Past Past 5.83 0.03988 5.745 5.915
## Current Current 5.66 0.06568 5.520 5.800
# Weighted interquartile range
<- as.data.frame(
Q75 svyby(~LBDGLUSI, ~smoker, design.FST.nomiss, svyquantile, 0.75)
1:2]
)[,<- as.data.frame(
Q25 svyby(~LBDGLUSI, ~smoker, design.FST.nomiss, svyquantile, 0.25)
1:2]
)[,names(Q75)[2] <- paste(names(Q75)[2], "wQ75")
names(Q25)[2] <- paste(names(Q25)[2], "wQ25")
<- merge(Q75, Q25)
WIQR $wIQR <- WIQR[,2] - WIQR[,3]
WIQR WIQR
## smoker LBDGLUSI wQ75 LBDGLUSI wQ25 wIQR
## 1 Current 6.11 5.27 0.84
## 2 Never 6.11 5.27 0.84
## 3 Past 6.49 5.50 0.99
# Categorical variables, by smoking status
# Weighted total
svytable(~race_eth + smoker, design.FST.nomiss)
## smoker
## race_eth Never Past Current
## Hispanic 20524606 7861212 4326583
## Non-Hispanic White 75120526 40106079 22478099
## Non-Hispanic Black 15137153 3487312 4439386
## Non-Hispanic Other 12806924 4465273 3800346
# Weighted total with SE of total
svyby(~race_eth, ~smoker, design.FST.nomiss, svytotal)
# (results not shown)
Be careful when normalizing svytable()
to get proportions. Using Ntotal=1
as we did previously normalizes the frequencies to the overall total. Rather, we want to normalize to the column totals to get proportions within each level of smoking status. This is accomplished using prop.table(, margin = 2)
. Optionally, use addmargins(, 1)
to confirm that each column sums to 1.
addmargins(
prop.table(svytable(~race_eth + smoker, design.FST.nomiss), margin = 2)
1) ,
## smoker
## race_eth Never Past Current
## Hispanic 0.16607 0.14058 0.12346
## Non-Hispanic White 0.60782 0.71721 0.64142
## Non-Hispanic Black 0.12248 0.06236 0.12668
## Non-Hispanic Other 0.10362 0.07985 0.10844
## Sum 1.00000 1.00000 1.00000
# Weighted proportion with SE
svyby(~I(race_eth=="Hispanic"), ~smoker, design.FST.nomiss, svyciprop)
## smoker I(race_eth == "Hispanic") se.as.numeric(I(race_eth == "Hispanic"))
## Never Never 0.1661 0.02650
## Past Past 0.1406 0.01976
## Current Current 0.1235 0.02512
# (results for other levels not shown)
svyby(~I(race_eth=="Non-Hispanic White"), ~smoker, design.FST.nomiss, svyciprop)
svyby(~I(race_eth=="Non-Hispanic Black"), ~smoker, design.FST.nomiss, svyciprop)
svyby(~I(race_eth=="Non-Hispanic Other"), ~smoker, design.FST.nomiss, svyciprop)
Finally, use tbl_svysummary()
from the gtsummary
library to produce a Table 1 of weighted descriptive statistics by smoking status. Be careful, however, to use a character version of the smoking status variable; using the factor smoker
will lead to an error. When we created this dataset in Section 8.2, we created smoker_char
, a character formatted version of smoker
.
The results are shown in Table 8.2.
library(gtsummary)
%>%
design.FST.nomiss tbl_svysummary(
# Use a character variable here. A factor leads to an error
by = smoker_char,
# Use include to select variables
include = c(LBDGLUSI, BMXWAIST, RIDAGEYR,
RIAGENDR, race_eth, income),statistic = list(all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"),
digits = list(LBDGLUSI ~ c(2, 2),
~ c(1, 1),
BMXWAIST ~ c(1, 1),
RIDAGEYR all_categorical() ~ c(0, 1))
%>%
) modify_header(label = "**Variable**",
all_stat_cols() ~ "**{level}**<br>N = {n} ({style_percent(p, digits=1)}%)") %>%
modify_caption("Weighted descriptive statistics, by smoking status") %>%
bold_labels()
Variable | 1 Never N = 123589210 (57.6%)1 | 2 Past N = 55919877 (26.1%)1 | 3 Current N = 35044414 (16.3%)1 |
---|---|---|---|
LBDGLUSI | 5.97 (1.54) | 6.46 (2.07) | 6.04 (1.89) |
BMXWAIST | 98.6 (17.0) | 104.0 (16.6) | 99.9 (17.4) |
RIDAGEYR | 45.4 (17.7) | 51.5 (18.0) | 45.4 (14.4) |
RIAGENDR | |||
Male | 51,169,518 (41.4%) | 34,805,126 (62.2%) | 18,311,610 (52.3%) |
Female | 72,419,692 (58.6%) | 21,114,752 (37.8%) | 16,732,805 (47.7%) |
race_eth | |||
Hispanic | 20,524,606 (16.6%) | 7,861,212 (14.1%) | 4,326,583 (12.3%) |
Non-Hispanic White | 75,120,526 (60.8%) | 40,106,079 (71.7%) | 22,478,099 (64.1%) |
Non-Hispanic Black | 15,137,153 (12.2%) | 3,487,312 (6.2%) | 4,439,386 (12.7%) |
Non-Hispanic Other | 12,806,924 (10.4%) | 4,465,273 (8.0%) | 3,800,346 (10.8%) |
income | |||
< $25,000 | 18,081,806 (14.6%) | 9,269,756 (16.6%) | 10,789,552 (30.8%) |
$25,000 to < $55,000 | 31,806,525 (25.7%) | 15,168,200 (27.1%) | 10,367,933 (29.6%) |
$55,000+ | 73,700,879 (59.6%) | 31,481,921 (56.3%) | 13,886,929 (39.6%) |
1 Mean (SD); n (%) |