A linear regression model assumes that each observation is independent of the others. If the cases were drawn from a population using a simple random sample then this assumption is met. The independence assumption would be violated if the observations are clustered, such as if there are repeated measures from the same individual, as in longitudinal data, or if households were first sampled followed by sampling individuals within households.
A violation of the assumption of independence results in incorrect confidence intervals and p-values, although in some cases regression coefficient estimates will still be unbiased (K. Y. Liang and Zeger 1993; Diggle et al. 2002; Fitzmaurice, Laird, and Ware 2011).
Think about how the data were collected. Are there clusters? For example, are there data from individual patients clustered within hospitals? Or individuals clustered in families or neighborhoods? Are there repeated measures from the same individual? Answering “yes” to any of those questions implies the presence of dependent data.
There are a number of methods of handling correlated data. If the correlation is the result of a complex sampling method, such as is utilized in NHANES (see NHANES Tutorials, accessed July 30, 2021), it is possible to adjust for this feature of the data using methods discussed in Chapter 8. If the clusters are a simple random sample, but there are multiple observations within clusters, then generalized least squares (J. C. Pinheiro and Bates 2000), linear mixed models (Laird and Ware 1982; Fitzmaurice, Laird, and Ware 2011), or generalized estimating equations (Kung-Yee Liang and Zeger 1986; Zeger and Liang 1986; Diggle et al. 2002) can be used to account for the within-cluster correlations. A specific example of clustered data is longitudinal data, in which the clusters are individuals who are measured repeatedly over time. These methods are beyond the scope of this text.