5.14 Checking the independence assumption

A linear regression model assumes that each observation is independent of the others. If the cases were drawn from a population using a simple random sample then this assumption is met. The independence assumption would be violated if the observations are clustered, such as if there are repeated measures from the same individual, as in longitudinal data, or if households were first sampled followed by sampling individuals within households.

5.14.1 Impact of dependence

A violation of the assumption of independence results in incorrect confidence intervals and p-values, although in some cases regression coefficient estimates will still be unbiased (K. Y. Liang and Zeger 1993; Diggle et al. 2002; Fitzmaurice, Laird, and Ware 2011).

5.14.2 Diagnosis of dependence

Think about how the data were collected. Are there clusters? For example, are there data from individual patients clustered within hospitals? Or individuals clustered in families or neighborhoods? Are there repeated measures from the same individual? Answering “yes” to any of those questions implies the presence of dependent data.

5.14.3 Potential solutions for dependence

There are a number of methods of handling correlated data. If the correlation is the result of a complex sampling method, such as is utilized in NHANES (see NHANES Tutorials, accessed July 30, 2021), it is possible to adjust for this feature of the data using methods discussed in Chapter 8. If the clusters are a simple random sample, but there are multiple observations within clusters, then generalized least squares (J. C. Pinheiro and Bates 2000), linear mixed models (Laird and Ware 1982; Fitzmaurice, Laird, and Ware 2011), or generalized estimating equations (Kung-Yee Liang and Zeger 1986; Zeger and Liang 1986; Diggle et al. 2002) can be used to account for the within-cluster correlations. A specific example of clustered data is longitudinal data, in which the clusters are individuals who are measured repeatedly over time. These methods are beyond the scope of this text.


Diggle, Peter J., Patrick Heagerty, Kung-Yee Liang, and Scott L. Zeger. 2002. Analysis of Longitudinal Data. Oxford: Oxford University Press.
Fitzmaurice, Garrett, Nan M. Laird, and James H. Ware. 2011. Applied Longitudinal Analysis. 2nd ed. Hoboken: John Wiley; Sons, Inc.
Laird, Nan M., and James H. Ware. 1982. “Random-Effects Models for Longitudinal Data.” Biometrics 38 (4): 963–74. https://doi.org/10.2307/2529876.
Liang, K Y, and S L Zeger. 1993. “Regression Analysis for Correlated Data.” Annual Review of Public Health 14 (1): 43–68. https://doi.org/10.1146/annurev.pu.14.050193.000355.
Liang, Kung-Yee, and Scott L. Zeger. 1986. “Longitudinal Data Analysis Using Generalized Linear Models.” Biometrika 73 (1): 13–22. https://doi.org/10.2307/2336267.
Pinheiro, José C, and Douglas M Bates. 2000. Mixed-Effects Models in s and s-PLUS. New York: Springer-Verlag.
Zeger, Scott L., and Kung-Yee Liang. 1986. “Longitudinal Data Analysis for Discrete and Continuous Outcomes.” Biometrics 42 (1): 121–30. http://www.jstor.org/stable/2531248.