## 8.5 Domain (subgroup) analysis

Sometimes we are interested in estimating quantities among subgroups of the population, such as “females age 45 years and older.” This is called a **domain analysis** or **subgroup analysis**. In an unweighted analysis, one would subset the *dataset* to filter out those who do not meet the criteria and then run the analysis on the resulting subset of observations. In a weighted analysis, however, removing observations from the dataset results in incorrect standard errors, confidence intervals, and p-values. The information about the sample design found in the excluded cases is needed to incorporate the sampling design correctly. Instead, use the `subset()`

function to subset the design, that is, to specify the domain of interest *in the design object.* Rather than actually removing observations, this retains all observations and lets `survey`

know what subset you are interested in. This is exactly what we did earlier, using a complete case indicator variable (`nomiss`

) to subset on those with non-missing values. However, here, we show how to apply this method more generally to any subpopulation.

**Example 8.1 (continued):** Repeat the creation of a Table 1 of descriptive statistics and the linear regression from Example 8.1 among females age 45 years and older. Due to the inclusion criteria, gender must be removed from the analysis since it no longer varies between individuals.

Previously, we used `subset()`

to specify a domain with no missing values and positive weights. Here, we add additional conditions to limit to females age 45 years and older.

**NOTE:** If the process of specifying your domain of interest involves creating new variables, you must re-create your design object to make those variables available to the design before using `subset()`

on the `survey.design`

object.

Use `tbl_svysummary()`

to create a table of descriptive statistics. The results are shown in Table 8.4.

```
library(gtsummary)
TABLE <- design.FST.domain %>%
tbl_svysummary(
by = smoker,
# Use include to select variables
include = c(LBDGLUSI, BMXWAIST, RIDAGEYR, race_eth, income),
statistic = list(all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"),
digits = list(LBDGLUSI ~ c(2, 2),
BMXWAIST ~ c(1, 1),
RIDAGEYR ~ c(1, 1),
all_categorical() ~ c(0, 1)),
label = list(LBDGLUSI ~ "Fasting glucose (mmol/L)",
BMXWAIST ~ "Waist Circumference (cm)",
RIDAGEYR ~ "Age (years)",
race_eth ~ "Race/ethnicity",
income ~ "Annual Income")
) %>%
modify_header(label = "**Variable**",
all_stat_cols() ~ "**{level}**<br>N = {n} ({style_percent(p, digits=1)}%)") %>%
modify_caption("Weighted descriptive statistics, by smoking status
(Females age 45y and older)") %>%
bold_labels()
```

Variable |
NeverN = 40923599 (65.8%) ^{1} |
PastN = 13054149 (21.0%) ^{1} |
CurrentN = 8207032 (13.2%) ^{1} |
---|---|---|---|

Fasting glucose (mmol/L) | 6.12 (1.55) | 6.46 (2.38) | 6.20 (1.42) |

Waist Circumference (cm) | 98.1 (15.1) | 104.4 (18.0) | 98.6 (17.1) |

Age (years) | 60.9 (10.5) | 62.0 (10.3) | 57.0 (7.9) |

Race/ethnicity | |||

Hispanic | 4,701,252 (11.5%) | 1,219,766 (9.3%) | 841,551 (10.3%) |

Non-Hispanic White | 28,446,162 (69.5%) | 9,779,559 (74.9%) | 5,853,934 (71.3%) |

Non-Hispanic Black | 4,125,440 (10.1%) | 1,217,709 (9.3%) | 829,841 (10.1%) |

Non-Hispanic Other | 3,650,745 (8.9%) | 837,114 (6.4%) | 681,706 (8.3%) |

Annual Income | |||

< $25,000 | 5,158,320 (12.6%) | 3,323,765 (25.5%) | 3,168,536 (38.6%) |

$25,000 to < $55,000 | 10,585,563 (25.9%) | 3,199,300 (24.5%) | 1,753,277 (21.4%) |

$55,000+ | 25,179,717 (61.5%) | 6,531,083 (50.0%) | 3,285,219 (40.0%) |

^{1} Mean (SD); n (%) |

Fit the model using the subsetted domain design.

```
fit.ex8.1.domain <- svyglm(LBDGLUSI ~ BMXWAIST + smoker + RIDAGEYR +
race_eth + income,
family=gaussian(), design=design.FST.domain)
```

Finally, use `tbl_regression()`

to create a table of regression coefficients. The results are shown in Table 8.5.

```
TABLE <- fit.ex8.1.domain %>%
tbl_regression(intercept = T,
estimate_fun = function(x) style_sigfig(x, digits = 3),
pvalue_fun = function(x) style_pvalue(x, digits = 3),
label = list(BMXWAIST ~ "Waist Circumference (cm)",
smoker ~ "Smoking Status",
RIDAGEYR ~ "Age (years)",
race_eth ~ "Race/ethnicity",
income ~ "Annual Income")) %>%
add_global_p(keep = T, test.statistic = "F"
# Uncomment to use design df for Type III test p-values
# (but then they will not match the regression term p-values for
# binary predictors)
# , error.df = degf(fit.ex8.1$survey.design)
) %>%
modify_caption("Weighted linear regression results for fasting glucose (mmol/L)
(Females age 45y and older)")
```

Characteristic |
Beta |
95% CI^{1} |
p-value |
---|---|---|---|

(Intercept) | 3.15 | 1.85, 4.46 | 0.001 |

Waist Circumference (cm) | 0.031 | 0.015, 0.047 | 0.003 |

Smoking Status | 0.731 | ||

Never | — | — | |

Past | 0.118 | -0.367, 0.604 | 0.573 |

Current | 0.037 | -0.559, 0.632 | 0.885 |

Age (years) | 0.008 | -0.012, 0.028 | 0.362 |

Race/ethnicity | 0.017 | ||

Hispanic | — | — | |

Non-Hispanic White | -0.480 | -0.924, -0.036 | 0.038 |

Non-Hispanic Black | -0.391 | -1.02, 0.234 | 0.177 |

Non-Hispanic Other | 0.071 | -0.476, 0.618 | 0.762 |

Annual Income | 0.436 | ||

< $25,000 | — | — | |

$25,000 to < $55,000 | -0.124 | -0.501, 0.252 | 0.450 |

$55,000+ | -0.255 | -0.706, 0.196 | 0.216 |

^{1} CI = Confidence Interval |

**Conclusion:** Among females age 45 years and older, after adjusting for age, gender, race/ethnicity, income, and each other, waist circumference is significantly associated with fasting glucose
(B = 0.031; 95% CI = 0.015, 0.047; p = .003) but smoking status is not (p = .731).