8.5 Domain (subgroup) analysis
Sometimes we are interested in estimating quantities among subgroups of the population, such as “females age 45 years and older.” This is called a domain analysis or subpopulation analysis. In an unweighted analysis, one would subset the dataset to filter out those who do not meet the criteria and then run the analysis on the resulting subset of observations. In a weighted analysis, however, removing observations from the dataset results in incorrect standard errors, confidence intervals, and p-values. The information about the sample design found in the excluded cases is needed to incorporate the sampling design correctly. Instead, use the subset()
function to subset the design, that is, to specify the domain of interest in the design object. Rather than actually removing observations, this retains all observations and lets survey
know what subset you are interested in. This is exactly what we did earlier, using a complete case indicator variable (nomiss
) to subset on those with non-missing values. However, here, we show how to apply this method more generally to any subpopulation.
Example 8.1 (continued): Repeat the creation of a Table 1 and the linear regression from Example 8.1 among females age 45 years and older. Due to the inclusion criteria, gender must be removed from the analysis since it no longer varies between individuals.
Previously, we used subset()
to specify a domain with no missing values and positive weights. Here, we add additional conditions to limit to females age 45 years and older.
<- subset(design.FST,
design.FST.domain & RIAGENDR == "Female" & RIDAGEYR >= 45) nomiss
NOTE: If the process of specifying your domain of interest involves creating new variables, you must re-create your design object to make those variables available to the design before using subset()
on the survey.design
object.
Use tbl_svysummary()
to create a table of descriptive statistics. The results are shown in Table 8.4.
library(gtsummary)
%>%
design.FST.domain tbl_svysummary(
# Use a character variable here. A factor leads to an error
by = smoker_char,
# Use include to select variables
include = c(LBDGLUSI, BMXWAIST, RIDAGEYR, race_eth, income),
statistic = list(all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"),
digits = list(LBDGLUSI ~ c(2, 2),
~ c(1, 1),
BMXWAIST ~ c(1, 1),
RIDAGEYR all_categorical() ~ c(0, 1))
%>%
) modify_header(label = "**Variable**",
all_stat_cols() ~ "**{level}**<br>N = {n} ({style_percent(p, digits=1)}%)") %>%
modify_caption("Weighted descriptive statistics, by smoking status
(Females age 45y and older)") %>%
bold_labels()
Variable | 1 Never N = 40923599 (65.8%)1 | 2 Past N = 13054149 (21.0%)1 | 3 Current N = 8207032 (13.2%)1 |
---|---|---|---|
LBDGLUSI | 6.12 (1.55) | 6.46 (2.38) | 6.20 (1.42) |
BMXWAIST | 98.1 (15.1) | 104.4 (18.0) | 98.6 (17.1) |
RIDAGEYR | 60.9 (10.5) | 62.0 (10.3) | 57.0 (7.9) |
race_eth | |||
Hispanic | 4,701,252 (11.5%) | 1,219,766 (9.3%) | 841,551 (10.3%) |
Non-Hispanic White | 28,446,162 (69.5%) | 9,779,559 (74.9%) | 5,853,934 (71.3%) |
Non-Hispanic Black | 4,125,440 (10.1%) | 1,217,709 (9.3%) | 829,841 (10.1%) |
Non-Hispanic Other | 3,650,745 (8.9%) | 837,114 (6.4%) | 681,706 (8.3%) |
income | |||
< $25,000 | 5,158,320 (12.6%) | 3,323,765 (25.5%) | 3,168,536 (38.6%) |
$25,000 to < $55,000 | 10,585,563 (25.9%) | 3,199,300 (24.5%) | 1,753,277 (21.4%) |
$55,000+ | 25,179,717 (61.5%) | 6,531,083 (50.0%) | 3,285,219 (40.0%) |
1 Mean (SD); n (%) |
Fit the model using the subsetted domain design.
1.domain <- svyglm(LBDGLUSI ~ BMXWAIST + smoker + RIDAGEYR +
fit.ex8.+ income,
race_eth family=gaussian(), design=design.FST.domain)
Finally, use tbl_regression()
to create a table of regression coefficients. The results are shown in Table 8.5.
1.domain %>%
fit.ex8.tbl_regression(intercept = T,
estimate_fun = function(x) style_sigfig(x, digits = 3),
pvalue_fun = function(x) style_pvalue(x, digits = 3),
label = list(BMXWAIST ~ "Waist Circumference (cm)",
~ "Smoking Status",
smoker ~ "Age (years)",
RIDAGEYR ~ "Race/ethnicity",
race_eth ~ "Annual Income")) %>%
income add_global_p(keep = T, test.statistic = "F"
# Uncomment to use design df for Type III test p-values
# (but then they will not match the regression term p-values for
# binary predictors)
# , error.df = degf(fit.ex8.1$survey.design)
%>%
) modify_caption("Weighted linear regression results for fasting glucose (mmol/L)
(Females age 45y and older)")
Characteristic | Beta | 95% CI1 | p-value |
---|---|---|---|
(Intercept) | 3.15 | 1.85, 4.46 | 0.001 |
Waist Circumference (cm) | 0.031 | 0.015, 0.047 | 0.003 |
Smoking Status | 0.731 | ||
Never | — | — | |
Past | 0.118 | -0.367, 0.604 | 0.573 |
Current | 0.037 | -0.559, 0.632 | 0.885 |
Age (years) | 0.008 | -0.012, 0.028 | 0.362 |
Race/ethnicity | 0.017 | ||
Hispanic | — | — | |
Non-Hispanic White | -0.480 | -0.924, -0.036 | 0.038 |
Non-Hispanic Black | -0.391 | -1.02, 0.234 | 0.177 |
Non-Hispanic Other | 0.071 | -0.476, 0.618 | 0.762 |
Annual Income | 0.436 | ||
< $25,000 | — | — | |
$25,000 to < $55,000 | -0.124 | -0.501, 0.252 | 0.450 |
$55,000+ | -0.255 | -0.706, 0.196 | 0.216 |
1 CI = Confidence Interval |