8.2 Specifying the survey design
The first step when using the survey
package is to specify the variables in the dataset that define the components of the complex survey design (e.g., strata, PSUs, sampling weights). This information is needed by all the other survey analysis functions, and is stored in a survey.design
object which is a required argument in all the survey
functions.
Example 8.1: Specify the survey design for the full NHANES 2017-2018 dataset (nhanes1718_rmph.Rdata
). This dataset has all the participants, not just a random subsample as was used in earlier chapters, although it still has just a subset of all the NHANES variables. Ultimately, our goal with this example is, as we did in Chapter 5, to carry out a linear regression analysis for the outcome fasting glucose (LBDGLUSI
) with the predictors waist circumference (BMXWAIST
), smoking status (smoker
), age (RIDAGEYR
), gender (RIAGENDR
), race/ethnicity (RIDRETH3
), and income (income
). Unlike in Chapter 5, this analysis will use the entire NHANES 2017-2018 dataset and incorporate the complex survey design in order to produce unbiased population estimates with appropriate estimates of variation. Since this analysis uses fasting glucose, use the fasting subsample weights, WTSAF2YR
.
## [1] 9254
When incorporating a complex survey design, always use the full dataset. Do not exclude any observations, not even to apply inclusion/exclusion criteria or to remove cases with missing data. The survey
procedures need to have access to the survey design variable values for all observations in order to obtain valid estimates, even estimates in subgroups. However, we can tell R that we want the final analysis to apply only to a subgroup by creating a subpopulation indicator variable and subsetting the design (not the dataset) based on that variable at a later step. In this example, we will subset on those with non-missing data and positive weights. More generally, you can subset to any subpopulation, as discussed in Section 8.5.
For this example, create a complete case indicator variable called nomiss
that also conditions on sampling weight > 0. This makes sure descriptive statistics generalize to the same subpopulation as the regression analysis, as well as avoids an error when running some survey
functions which do not accept zero weights. If doing a regression without a “Table 1” of descriptive statistics (see Section 3.3), you only need to condition on sampling weight > 0. If using a design in which all sampling weights are positive, however, there is no need to condition on sampling weight > 0.
Also, carry out the following data management steps:
- Collapse the race/ethnicity variable as we have done with other analyses.
- Set missing fasting subsample weights to 0 (
svydesign()
returns an error if there are any missing weights).
nhanes.mod <- nhanes %>%
mutate(# Collapse race/ethnicity variable
race_eth = fct_collapse(RIDRETH3,
"Hispanic" = c("Mexican American", "Other Hispanic"),
"Non-Hispanic Other" = c("Non-Hispanic Asian", "Other/Multi")),
# Set missing fasting subsample weights to 0 since
# svydesign() returns an error if there are any missing weights.
WTSAF2YR = replace_na(WTSAF2YR, 0),
# Complete case and positive weight indicator
# (a vector of TRUE and FALSE values)
nomiss = complete.cases(LBDGLUSI, BMXWAIST, smoker, RIDAGEYR,
RIAGENDR, race_eth, income) &
WTSAF2YR > 0)
# Verify the derivations
# Subset on missing weights
SUB <- is.na(nhanes$WTSAF2YR)
# Old NAs
summary(nhanes$WTSAF2YR[SUB])
# are now 0 in the new dataset
summary(nhanes.mod$WTSAF2YR[SUB])
# For those with non-missing weights, new - old = 0 (no change)
summary(nhanes$WTSAF2YR[!SUB] - nhanes.mod$WTSAF2YR[!SUB])
# (results not shown)
Next, use svydesign()
to create the survey.design
object. For each of the strata
, id
, and weights
arguments, input a formula (e.g., ~X
) specifying the variable that defines the survey strata, PSUs, and sampling weights, respectively.
For NHANES 2017-2018, specify the following three designs: (1) Interview, (2) Examination, and (3) Fasting subsample. Each design has the same strata and PSU variable names, but a different sampling weight variable name. See Section 8.1.1 for definitions of the variables used. As mentioned in that Section, only use the interview weights if all the variables in the analysis were included in the interview; if any were from the examination, use the examination weights unless any were only measured in the fasting subsample in which case use the fasting subsample weights.
library(survey)
options(survey.lonely.psu = "adjust")
options(survey.adjust.domain.lonely = TRUE)
design.INT <- svydesign(strata=~SDMVSTRA, id=~SDMVPSU, weights=~WTINT2YR,
nest=TRUE, data=nhanes.mod)
design.MEC <- svydesign(strata=~SDMVSTRA, id=~SDMVPSU, weights=~WTMEC2YR,
nest=TRUE, data=nhanes.mod)
design.FST <- svydesign(strata=~SDMVSTRA, id=~SDMVPSU, weights=~WTSAF2YR,
nest=TRUE, data=nhanes.mod)
# View basic information about the design
design.FST
## Stratified 1 - level Cluster Sampling design (with replacement)
## With (30) clusters.
## svydesign(strata = ~SDMVSTRA, id = ~SDMVPSU, weights = ~WTSAF2YR,
## nest = TRUE, data = nhanes.mod)
Finally, for Example 8.1, use the subset()
function to create a design object to be used with analyses of cases with nomiss = TRUE
, those with complete data and positive weights.
Filtering out cases with missing values is not actually necessary for the regression analysis itself, but will ensure that the descriptive statistics and the regression analysis generalize to the same subpopulation. Had we used design.FST
instead of design.FST.nomiss
, the subpopulation represented by the descriptive statistics for a variable would be those who would have had non-missing values for that variable had they been sampled, regardless of whether or not other variables were missing. With design.FST.nomiss
, descriptive statistics for all variables and the regression analysis all generalize to those in the population who would have had non-missing values for all the variables had they been sampled.
NOTES:
- The global options,
survey.lonely.psu = "adjust"
andsurvey.adjust.domain.lonely = TRUE
(that we set at the beginning of this chapter) specify how variance estimation will be handled in cases where a stratum contains only one PSU. These options “center the stratum at the population mean rather than the stratum mean” and make the same adjustment within subgroup analyses (see?surveyoptions
). These options only need to be set once per R session. - The
survey
package assumes theid
variable has a unique value for every unique PSU. However, in many datasets, including NHANES and NSDUH, theid
variable values are nested withinstrata
, duplicated acrossstrata
, requiring the use of the thenest=TRUE
option insvydesign()
. If you leave this option out but theid
values are actually nested, R will return the warningClusters not nested in strata at top level; you may want nest=TRUE
. svydesign()
does not accept missing weights. That is why we set missing weights to 0 before specifying the design for the fasting subsample.- Some functions (e.g.,
svycoxph()
) return an error if there are any weights of 0. That is why we conditioned on sampling weight > 0 in the code above. See also Section 8.7.3.2. lm()
,glm()
, andcoxph()
all have aweights
argument that can be used to specify a weighted regression model. However, passing sampling weights to that argument will not produce the correct results for a complex survey design sinceweights
in the context of those functions refers to a different kind of weight. Instead, use the methods described in this Chapter, using functions from thesurvey
library, to correctly incorporate the survey design into your analysis.
8.2.1 Design degrees of freedom
Estimated confidence intervals and p-values depend on the degrees of freedom. Both NHANES and NSDUH documentation recommended using the design degrees of freedom (DF), the number of PSUs less the number of strata, to obtain correct measures of variation (NHANES Tutorial: Variance Estimation, accessed February 7, 2023; 2019 NSDUH Public Use File Codebook, page H-2, accessed February 7, 2023).
Where does this quantity come from? “Degrees of freedom” are related to the number of independent pieces of information in a sample. For a simple random sample, the DF is the sample size minus the number of population quantities being estimated. For example, the sample mean has DF equal to the sample size minus 1. However, in many complex survey designs, there are fewer independent pieces of information. Where does the “design DF” come from? Here is one way to think about it. After splitting the population up into strata, NHANES and NSDUH sampled PSUs within strata, that is, they conducted a simple random sample of PSUs within each stratum. Thus, the sample mean of a PSU-level variable within a single stratum has DF = (number of PSUs within the stratum) - 1. Summing that up over strata results in DF = number of PSUs less the number of strata.
For the NHANES 2017-2018 data, the design DF are 15 because there are 30 PSUs and 15 strata. For the 2019 NSDUH data, the design DF are 50 because there are 100 PSUs and 50 strata.
Different statistical software may use different default DF. In R, most survey
functions do not use the design DF by default but do have an option to change the DF. The code degf(design)
computes the design DF for a given design
. In a subgroup analysis (see Section 8.5) the correct design DF is based on the numbers of PSUs and strata in the subsample representing the subgroup of interest, either of which might be smaller than in the full sample. Fortunately, degf()
modifies the DF appropriately for an analysis on a subgroup with fewer PSUs and/or strata. See the “Details” section in ?summary.svyglm
for more information.