In research we are often interested in quantifying some characteristic of a population, such as the prevalence of a condition, the average of a measurement, or the association between an exposure and a disease. Typically, we measure the characteristic in a sample of units from the population and use that information to make inference about the population characteristic. Units could be individual people, neighborhoods, hospitals, counties, or even nations.
A census attempts to select every unit in a population. A sample is a subset of units from a population. The process by which the sample is selected is called the sampling design. A non-probability sample consists of units selected based on some known or unknown non-random method (e.g., a convenience sample). A probability sample consists of units selected randomly with known probabilities of selection (not necessarily equal). Probability sampling requires a sampling frame – a listing of all the units in the population and their associated selection probabilities.
Whatever the sampling design, the goal is to use information from the sample to infer something about the population. For example, we might use the sample mean blood pressure among individuals as an estimate of the population mean blood pressure. If the sample was obtained using probability sampling, the data analyst can take into account the survey design to obtain results that are representative of the population. The default assumption of most standard statistical methods is that every group of size \(n\) in the population has the same probability of being selected. As a consequence, every single unit in the population has the same probability of being selected. Such a design is known as simple random sampling. With this design, many standard statistics (e.g., sample mean, sample regression slope) are unbiased estimates of their population counterparts.
Many surveys, however, use a complex sampling design, not simple random sampling. There are various reasons for this. For example, if constructing a sampling frame listing every unit in the population is difficult or likely to result in errors, one could use multistage sampling to sample larger, easier to list, groups of units followed by surveying some or all units within each group, where an accurate sampling frame can be constructed on site. In multistage sampling, you first sample primary sampling units (PSUs) (e.g., households). Then, you sample units within each PSU (e.g., individuals within a household). There could, of course, be more than two stages of sampling. The units from earlier stages form clusters.
Another reason to use a complex sampling design is that a simple random sample may result in small sample sizes among some subgroups of interest. For example, if race/ethnicity-specific mean blood pressure is of interest, a researcher may want a sampling design that increases the sample size within smaller subgroups. A simple random sample will likely result in a much larger sample size for the majority race/ethnicity and smaller sample sizes for minority groups. Rather than increase the overall sample size to ensure sufficient sizes in the smaller groups, it is more cost-effective to under-sample large groups and over-sample small groups using unequal probability sampling.
One form of unequal probability sampling that is seen in some multistage sampling designs is sampling with probability proportional to size (PPS), in which larger PSUs have a greater probability of being selected. Another is stratified random sampling in which the population is first non-randomly split into strata (e.g., geographic region) within each of which a simple random sample is drawn. Stratification into unequal size strata followed by simple random sampling within strata results in unequal probability sampling because individuals in smaller strata have a greater probability of being selected than individuals in larger strata.
A unit’s design weight is the inverse of its probability of selection. In a simple random sample, each sampling unit has the same probability of selection, so has the same weight when measurements are combined to form a statistic. For example, the sample mean of a variable \(X\) in a sample of size \(n\) is \((x_1 + x_2 + ... + x_n)/n\). Each unit in a simple random sample is equally weighted and we refer to the sample mean as an unweighted statistic. But what if, as in many complex survey designs, units in the population have different probabilities of being selected? In that case, the unweighted sample mean is a biased estimate of the population mean. The weighted mean, however, is an unbiased estimate. Additional complexities can arise that must be accounted for, as well, such as non-response, in which individuals are selected but refuse to participate. Methods exist to combine the design weights with the other complexities to produce sampling weights which, ideally, correspond for each unit to the number of population units represented by that unit.
There are three main consequences to incorrectly treating a complex survey design as a simple random sample. First, standard sample statistics computed from data sampled using a complex survey design may be biased estimates of population statistics. Second, estimates of the variation of sample statistics may be incorrect, resulting also in incorrect confidence intervals and p-values. Finally, ignoring the complex survey design leads to a violation of the independence assumption required for standard statistical methods (Hahs-Vaughn et al. 2011). See, for example, Skinner and Wakefield (2017) and Lohr (2021) for further reading about sampling design and analysis in general and Lumley (2010) for further reading about sampling design and analysis in R.
In this chapter, we use the
survey package (Lumley 2004, 2023) to account for complex survey designs when computing descriptive statistics and carrying out regression analyses. Full documentation can be found at
help(package="survey") and Analysis of Complex Survey Samples (accessed February 7, 2023).
An example of a survey with a complex design is the National Health and Nutrition Examination Survey (NHANES).
“The NHANES samples are not simple random samples. Rather, a complex, multistage, probability sampling design is used to select participants representative of the civilian, non-institutionalized US population. Oversampling of certain population subgroups is also done to increase the reliability and precision of health status indicator estimates for these particular subgroups. Researchers need to take this into account in their analyses by appropriately specifying the sampling design parameters.”
— NHANES Tutorial: Sample Design (accessed February 3, 2023).
Briefly, NHANES has a stratified four-stage sampling design. First, strata are (non-randomly) constructed based on census regions and other geographic information. Within each strata, U.S. counties (the PSUs) are randomly selected, with larger counties having a greater probability of selection. Within counties, city blocks are selected, also proportional to size. Within blocks, households are randomly selected, with certain age, ethnic, and income groups oversampled (higher probability of selection). Finally, within households, individuals are randomly selected. For a full description of the NHANES complex survey design, see NHANES Tutorial: Sample Design (accessed February 3, 2023).
The NHANES website provides sample R code (accessed February 3, 2023) for analyzing NHANES data using the
survey package, as well as some special considerations when analyzing NHANES data.
In this text, we use the 2017-2018 NHANES cycle, so the information given below is from that cycle. The following variables are included in the dataset to account for the sampling design.
- Stratum (
SDMVSTRA): There were 15 strata.
- Primary sampling unit (
SDMVPSU): This variable takes on only two values (1 or 2). This does not mean that only two counties were selected, rather that 2 counties were selected within each stratum. Thus, in total, there were 30 PSUs.
- Interview sampling weight (
WTINT2YR): Every participant was interviewed so the interview sampling weight is > 0 for every individual (n = 9254). Interviewers used questionnaires to collect self-reported information.
- Examination sampling weight (
WTMEC2YR): Most participants (n = 8704) were also examined at a mobile examination center (MEC). Examinations collected objective measures using, for example, anthropometrics (e.g., height, weight), blood draws (e.g., lipids), and other instrumentation (e.g., dual-energy x-ray absorptiometry (DXA) scans to assess body composition). The 550 participants who were not examined have an examination sampling weight of 0.
- Fasting subsample sampling weight (
WTSAF2YR): A subset of 2711 participants aged 12 years and older also had blood measurements taken using blood drawn after fasting for 8-24 hours. The 6218 participants not in this subsample have a missing (
NA) fasting subsample weight. The remaining 325 participants were selected for this subsample but were not able to provide an appropriate blood draw. These individuals have fasting subsample sampling weights of zero. See the NHANES documentation (accessed February 8, 2023) for more information.
There were other subsamples, as well, and their corresponding weight variables are noted in the Analytic Notes for certain NHANES variables (e.g., Perfluoroalkyl and Polyfluoroalkyl Substances, accessed February 3, 2023). When using NHANES data, always consult the appropriate data documentation and codebooks (accessed February 3, 2023) to ensure you use the appropriate sampling weights.
Which NHANES sampling weight to use
In general, “use the weight of the smallest subpopulation that includes all the variables you want to include in your analysis” (NHANES Tutorial: Weighting (accessed February 3, 2023)).
For example, if you only have variables collected in the interview, examination, or on the fasting subsample:
- If any of the variables in your analysis were collected only in the fasting subsample then use the fasting subsample sampling weights.
- Otherwise, if any were collected only in the examination subsample, then use the examination sampling weights.
- Otherwise, if all variables were collected in the interview, use the interview sampling weights.
Combining data over multiple NHANES cycles
Sample statistics based on a single NHANES cycle, while unbiased estimates of U.S. population characteristics, can have large variability due to the fact that not very many PSUs are sampled in any given cycle (NHANES Tutorial: Sample Design, accessed February 3, 2023). For example, NHANES 2017-2018 sampled only 30 counties. However, information can easily be combined over multiple cycles. When doing so, you must create a new sampling weight variable, as well as consider the possibility of trends over time. Instructions for how to combine weights over cycles can be found in NHANES Tutorial: Weighting (accessed February 3, 2023).
Another example of a survey with a complex design is the National Survey of Drug Use and Health (NSDUH), which incorporated geographic stratification followed by multistage sampling. See the 2019 NSDUH Public Use File Codebook (accessed February 3, 2023) for detailed information about the sampling design.
The variables needed to account for the complex survey design of the 2019 NSDUH are the following.
- Stratum (
vestr): There were 50 strata.
- Primary sampling unit (
verep): As with the NHANES PSU variable, this variable takes on only two values (1 or 2), nested within strata. Thus, in total, there were 100 PSUs.
- Final analysis weight (
ANALWT_C): These sampling weights are positive for all participants.