The first step when using the
survey package is to specify the variables in the dataset that define the components of the complex survey design (e.g., strata, PSUs, sampling weights). This information is needed by all the other survey analysis functions, and is stored in a
survey.design object which is a required argument in all the
Example 8.1: Specify the survey design for the full NHANES 2017-2018 dataset (
nhanes1718_rmph.Rdata). This dataset has all the participants, not just a random subsample as was used in earlier chapters, although it still has just a subset of all the NHANES variables. Ultimately, our goal with this example is, as we did in Chapter 5, to carry out a linear regression analysis for the outcome fasting glucose (
LBDGLUSI) with the predictors waist circumference (
BMXWAIST), smoking status (
smoker) , age (
RIDAGEYR), gender (
RIAGENDR), race/ethnicity (
RIDRETH3), and income (
income). Unlike in Chapter 5, this analysis will use the entire NHANES 2017-2018 dataset and incorporate the complex survey design in order to produce unbiased population estimates with appropriate estimates of variation.
##  9254
When incorporating a complex survey design, always use the full dataset. Do not exclude any observations, not even to apply inclusion/exclusion criteria or to remove cases with missing data. The
survey procedures need to have access to the survey design variable values for all observations in order to obtain valid estimates, even estimates in subgroups. However, we can tell R that we want the final analysis to apply only to a subgroup by creating a subpopulation indicator variable and subsetting the design (not the dataset) based on that variable at a later step. In this example, we will subset on those with non-missing data and positive weights. More generally, you can subset to any subpopulation, as discussed in Section 8.5.
For this example, create a complete case indicator variable called
nomiss that also conditions on sampling weight > 0. This makes sure descriptive statistics generalize to the same subpopulation as the regression analysis, as well as avoids an error when running some
survey functions which do not accept zero weights. If doing a regression without a Table 1 of descriptive statistics, you only need to condition on sampling weight > 0. If using a design in which all sampling weights are positive, however, there is no need to condition on sampling weight > 0.
Also, carry out the following data management steps:
- Collapse the race/ethnicity variable as we have done with other analyses.
- Center waist circumference at 100 cm.
- Create a character version of
smokerfor use as a by variable in our Table 1 (the function we will use to create this table will expect a character variable, not a factor – see Section 8.3.2).
- Set missing fasting subsample weights to 0 (
svydesignreturns an error if there are any missing weights).
<- nhanes %>% nhanes.mod mutate(# Collapse race/ethnicity variable race_eth = fct_collapse(RIDRETH3, "Hispanic" = c("Mexican American", "Other Hispanic"), "Non-Hispanic Other" = c("Non-Hispanic Asian", "Other/Multi")), # Create a centered waist circumference variable we will use later cBMXWAIST = BMXWAIST - 100, # Create a character version for use as a by variable in tbl_svysummary, # and change the values so they are in the correct order when alphabetized smoker_char = case_when(smoker == "Never" ~ "1 Never", == "Past" ~ "2 Past", smoker == "Current" ~ "3 Current"), smoker # This step is only needed when the variable is labeled. # Without this, replace_na() in the next step results in an # error for a labeled variable. WTSAF2YR = as.numeric(WTSAF2YR), # Set missing fasting subsample weights to 0 since # svydesign returns an error if there are any missing weights. WTSAF2YR = replace_na(WTSAF2YR, 0), # Complete case and positive weight indicator # (a vector of TRUE and FALSE values) nomiss = !is.na(LBDGLUSI) & !is.na(BMXWAIST) & !is.na(smoker) & !is.na(RIDAGEYR) & !is.na(RIAGENDR) & !is.na(race_eth) & !is.na(income) & WTSAF2YR > 0)
# Verify the derivations table(nhanes$smoker, nhanes.mod$smoker_char, exclude = NULL) # Subset on missing weights <- is.na(nhanes$WTSAF2YR) SUB # Old NAs summary(nhanes$WTSAF2YR[SUB]) # are now 0 in the new dataset summary(nhanes.mod$WTSAF2YR[SUB]) # For those with non-missing weights, new - old = 0 (no change) summary(nhanes$WTSAF2YR[!SUB] - nhanes.mod$WTSAF2YR[!SUB]) # (results not shown)
svydesign to create the
survey.design object. For each of the
weights arguments, input a formula (e.g.,
~X) specifying the variable that defines the survey strata, PSUs, and sampling weights, respectively.
For NHANES 2017-2018, specify the following three designs: (1) Interview, (2) Examination, and (3) Fasting subsample. Each design has the same strata and PSU variable names, but a different sampling weight variable name. See Section 8.1.1 for definitions of the variables used. As mentioned in that Section, only use the interview weights if all the variables in the analysis were included in the interview; if any were from the examination, use the examination weights unless any were only measured in the fasting subsample in which case use the fasting subsample weights.
library(survey) options(survey.lonely.psu = "adjust") options(survey.adjust.domain.lonely = TRUE) <- svydesign(strata=~SDMVSTRA, id=~SDMVPSU, weights=~WTINT2YR, design.INT nest=TRUE, data=nhanes.mod) <- svydesign(strata=~SDMVSTRA, id=~SDMVPSU, weights=~WTMEC2YR, design.MEC nest=TRUE, data=nhanes.mod) <- svydesign(strata=~SDMVSTRA, id=~SDMVPSU, weights=~WTSAF2YR, design.FST nest=TRUE, data=nhanes.mod) # View basic information about the design design.FST
## Stratified 1 - level Cluster Sampling design (with replacement) ## With (30) clusters. ## svydesign(strata = ~SDMVSTRA, id = ~SDMVPSU, weights = ~WTSAF2YR, ## nest = TRUE, data = nhanes.mod)
# View more information about the design summary(design.FST) # (results not shown)
Finally, for Example 8.1, use the
subset() function to create a design object to be used with analyses of cases with
nomiss = TRUE, those with complete data and positive weights.
<- subset(design.FST, nomiss)design.FST.nomiss
Filtering out cases with missing values is not actually necessary for the regression analysis itself, but will ensure that the descriptive statistics and the regression analysis generalize to the same subpopulation. Had we used
design.FST instead of
design.FST.nomiss, the subpopulation represented by the descriptive statistics for a variable would be those who would have had non-missing values for that variable had they been sampled, regardless of whether or not other variables were missing. With
design.FST.nomiss, descriptive statistics for all variables and the regression analysis all generalize to those in the population who would have had non-missing values for all the variables had they been sampled.
- The global options,
survey.lonely.psu = "adjust"and
survey.adjust.domain.lonely = TRUEspecify how variance estimation will be handled in cases where a stratum contains only one PSU. These options “center the stratum at the population mean rather than the stratum mean” and make the same adjustment within subgroup analyses (see
?surveyoptions). These options only need to be set once per R session.
surveypackage assumes the
idvariable has a unique value for every unique PSU. However, in many datasets, including NHANES and NSDUH, the
idvariable values are nested within
strata, duplicated across
strata, requiring the use of the the
svydesign(). If you leave this option out but the
idvalues are actually nested, R will return the warning
Clusters not nested in strata at top level; you may want nest=TRUE.
svydesign()does not accept missing weights. That is why we set missing weights to 0 before specifying the design for the fasting subsample.
- Some functions (e.g.,
svycoxph()) return an error if there are any weights of 0. That is why we conditioned on sampling weight > 0 in the code above. See also Section 220.127.116.11.
coxph()all have a
weightsargument that can be used to specify a weighted regression model. However, passing sampling weights to that argument will not produce the correct results for a complex survey design since
weightsin the context of those functions refers to a different kind of weight. Instead, use the methods described in this Chapter, using functions from the
surveylibrary, to correctly incorporate the survey design into your analysis.
Estimated confidence intervals and p-values depend on the degrees of freedom. Both NHANES and NSDUH documentation recommended using the design degrees of freedom (DF), the number of PSUs less the number of strata, to obtain correct measures of variation (NHANES Tutorial: Variance Estimation, accessed February 7, 2023; 2019 NSDUH Public Use File Codebook, page H-2, accessed February 7, 2023).
Where does this quantity come from? “Degrees of freedom” are related to the number of independent pieces of information in a sample. For a simple random sample, the DF is the sample size minus the number of population quantities being estimated. For example, the sample mean has DF equal to the sample size minus 1. However, in many complex survey designs, there are fewer independent pieces of information. Where does the “design DF” come from? Here is one way to think about it. After splitting the population up into strata, NHANES and NSDUH sampled PSUs within strata, that is, they conducted a simple random sample of PSUs within each stratum. Thus, the sample mean of a PSU-level variable within a single stratum has DF = (number of PSUs within the stratum) - 1. Summing that up over strata results in DF = number of PSUs less the number of strata.
For the NHANES 2017-2018 data, the design DF are 15 because there are 30 PSUs and 15 strata. For the 2019 NSDUH data, the design DF are 50 because there are 100 PSUs and 50 strata.
Different statistical software may use different default DF. In R, most
survey functions do not use the design DF by default but do have an option to change the DF. The code
degf(design) computes the design DF for a given
design. In a subgroup analysis (see Section 8.5) the correct design DF is based on the numbers of PSUs and strata in the subsample representing the subgroup of interest, either of which might be smaller than in the full sample. Fortunately,
degf() modifies the DF appropriately for an analysis on a subgroup with fewer PSUs and/or strata. See the “Details” section in
?summary.svyglm for more information.