## 5.15 Checking the normality assumption

Earlier in this chapter we stated that $$\epsilon \sim N(0, \sigma^2)$$ which is, in part, shorthand for “the residuals are normally distributed around the mean outcome and the average of the residuals is zero.” This implies that at any given predictor value the distribution of $$Y$$ given $$X$$ is assumed to be normal. For example, in Figure 5.19, the SLR on the left meets the normality assumption while the one on the right (from Example 4.1) does not.

### 5.15.1 Impact of non-normality

A violation of the assumption of normality of residuals may result in incorrect inferences in small samples. Both confidence intervals and p-values rely on the normality assumption, so if it is not valid then these may be inaccurate. However, in large samples (at least 10 observations per predictor, although this is just a rule of thumb ) violation of the normality assumption does not have much of an impact on inferences .

### 5.15.2 Diagnosis of non-normality

While Figure 5.19 provided a general illustration, two diagnostic tools that allow us to directly compare the residuals to a normal distribution are a histogram of the residuals and a normal quantile-quantile plot (QQ plot), both of which are explained in the following example.

To illustrate the diagnosis of non-normality in a small sample size, we will use a random subset of 50 observations from our NHANES fasting subsample dataset (nhanesf.complete.50_rmph.Rdata) and re-fit the model from Example 5.1.

load("Data/nhanesf.complete.50_rmph.Rdata")

# Refit the model using the small n dataset
fit.ex5.1.smalln <- lm(LBDGLUSI ~ BMXWAIST + smoker + RIDAGEYR +
RIAGENDR + race_eth + income,
data = nhanesf.complete.50)

# View results
round(cbind(summary(fit.ex5.1.smalln)$coef, confint(fit.ex5.1.smalln)),4) ## Estimate Std. Error t value Pr(>|t|) 2.5 % 97.5 % ## (Intercept) 3.1771 1.5706 2.0228 0.0500 0.0002 6.3540 ## BMXWAIST 0.0258 0.0102 2.5390 0.0152 0.0052 0.0464 ## smokerPast -0.4796 0.5278 -0.9088 0.3690 -1.5471 0.5879 ## smokerCurrent 0.5608 0.6197 0.9048 0.3711 -0.6928 1.8143 ## RIDAGEYR 0.0168 0.0139 1.2109 0.2332 -0.0113 0.0448 ## RIAGENDRFemale -1.0163 0.4812 -2.1120 0.0411 -1.9896 -0.0430 ## race_ethNon-Hispanic White 0.0297 0.6071 0.0489 0.9612 -1.1983 1.2577 ## race_ethNon-Hispanic Black 0.2479 0.8903 0.2785 0.7821 -1.5528 2.0487 ## race_ethNon-Hispanic Other 2.2347 0.8800 2.5393 0.0152 0.4547 4.0147 ## income$25,000 to < $55,000 -0.1413 0.7488 -0.1886 0.8514 -1.6559 1.3734 ## income$55,000+              -0.2065     0.6999 -0.2951   0.7695 -1.6223  1.2092

Example 5.1 (continued): Diagnose non-normality in the small sample size version of Example 5.1. Plot the residuals using a histogram and QQ plot.

# Compute residuals
RESID <- fit.ex5.1.smalln$residuals # Sample size vs. # of predictors length(RESID) ## [1] 50 length(coef(fit.ex5.1.smalln))-1 ## [1] 10 There are fewer than 10 observations per predictor. par(mfrow=c(1,2)) hist(RESID, xlab = "Residuals", probability = T, # Adjust ylim to be able to see more of the curves if needed ylim = c(0, 0.6)) # Superimpose empirical density (no normality assumption, for comparison) lines(density(RESID, na.rm=T), lwd = 2, col = "red") # Superimpose best fitting normal curve curve(dnorm(x, mean = mean(RESID, na.rm=T), sd = sd(RESID, na.rm=T)), lty = 2, lwd = 2, add = TRUE, col = "blue") # Normal quantile-quantile (QQ) plot qqnorm(as.numeric(RESID), col="red", pch=20) qqline(as.numeric(RESID), col="blue", lty=2, lwd=2) In the left panel (histogram of residuals) of Figure 5.20, the solid line provides a smoothed estimate of the distribution of the observed residuals without making any assumption about the shape of the distribution. The dashed line, however, is the best fit normal curve. The closer these two curves are to each other, the better the normality assumption is met. In the right panel (QQ plot), the ordered values of the $$n$$ residuals are plotted against the corresponding $$n$$ quantiles of the standard normal distribution. The closer the points are to the dashed line, the better the normality assumption is met. In this example, there is a lack of normality in the tails of the distribution. Since the sample size is small (n = 50, 10 predictors, fewer than 10 per predictor), non-normality may impact the results. For convenience, instead of all the code above you can use the check_normality() function found in Functions_rmph.R (which you loaded at the beginning of this chapter). # If you have not already loaded this file... # ...do so now to make check_normality() available source("Functions_rmph.R") check_normality(fit.ex5.1.smalln) ### 5.15.3 Potential solutions for non-normality If the sample size is large (at least 10 observations per predictor, although this is just a rule of thumb ), non-normality does not need to be addressed . If the sample size is not large (or if you are not sure and want a method that does not rely on normality), car::Boot() can be used to compute bootstrap 95% confidence intervals (and corresponding hypothesis tests) that do not depend on the normality assumption. Briefly, the bootstrap resamples the observations with replacement many $$(R)$$ times and refits the model on each resampled dataset. The resulting $$R$$ sets of regression coefficient estimates are therefore approximate draws from the unknown sampling distributions, and CIs based on these draws can be computed with no assumption of normality. For each regression coefficient, a test of the null hypothesis that the coefficient is 0 can be carried out by examining the bootstrap CI to see if it contains 0. A full explanation of the bootstrap, why it works, and how best to use it are beyond the scope of this text, but see Efron and Tibshirani (1993) and Davison (1997) to learn more. Some alternative methods of handling non-normality include the following. • Non-normality may co-occur with non-linearity and/or non-constant variance. It is a good idea to check all the assumptions, and then resolve the most glaring assumption failure first, followed by re-checking all the assumptions to see what problems remain. Transforming the outcome is often a good place to start and can help with all three assumptions. • Normalizing transformation: A transformation of the outcome $$Y$$ used to correct non-normality of residuals is called a “normalizing transformation.” Common transformations are natural logarithm, square root, and inverse. These are all special cases of the Box-Cox family of outcome transformations which we will discuss in Section 5.18. • Unfortunately, these transformations only work for positive outcomes. An exception is the square-root transformation which can handle zeros, but still not negative values. For the other transformations, if the outcome is non-negative but includes some zero values, it is common to add 1 before transforming (e.g., $$\log(X+1)$$). If the values of the raw variable are all very small or large, then instead of adding 1, add a number on a different scale (e.g., 0.01, 0.10, 10, 100). However, this approach is arbitrary and the final results can depend on the number chosen. A sensitivity analysis can be useful to check if the results change when adding a different number (see Section 5.24). • If the outcome has many zero values, however, then no transformation will result in normality – there will always be a spike in the distribution at that (transformed) value. An alternative is to transform the variable into a binary or ordinal variable and use logistic regression (see Chapter 6), or use a regression method designed for non-negative data such as Poisson regression, zero-inflated Poisson regression, or negative binomial regression . NOTE: A regression model does not assume that either the outcome or predictors are normally distributed. Rather, it assumes that the residuals are normally distributed. It is often the case, however, that the normality or non-normality of the outcome ($$Y$$) does correspond to that of the residuals. Example 5.1 (continued): Continuing with our small sample size version of this example, use the bootstrap to obtain confidence intervals and hypothesis tests for each term in the model that do not depend on normality. # Set seed if you want the same answer every time set.seed(27340) # Run bootstrap boot.ex5.1 <- car::Boot(fit.ex5.1.smalln, R = 2000) # View original and bootstrap corrected regression coefficients, # and bootstrap 95% CIs round(cbind(summary(boot.ex5.1), confint(boot.ex5.1)), 4) ## R original bootBias bootSE bootMed 2.5 % 97.5 % ## (Intercept) 1972 3.1771 0.0650 1.3220 3.4058 -0.8644 4.9979 ## BMXWAIST 1972 0.0258 -0.0004 0.0114 0.0245 0.0089 0.0570 ## smokerPast 1972 -0.4796 -0.0068 0.4382 -0.4567 -1.6404 0.2035 ## smokerCurrent 1972 0.5608 -0.0680 0.8921 0.3070 -0.4641 4.3408 ## RIDAGEYR 1972 0.0168 -0.0015 0.0103 0.0145 0.0022 0.0474 ## RIAGENDRFemale 1972 -1.0163 0.0670 0.5081 -0.8955 -2.8027 -0.3430 ## race_ethNon-Hispanic White 1972 0.0297 -0.0083 0.3432 0.0113 -0.6024 0.8000 ## race_ethNon-Hispanic Black 1972 0.2479 -0.0406 0.5720 0.2369 -1.0430 1.1696 ## race_ethNon-Hispanic Other 1972 2.2347 0.0037 1.7117 2.1507 -0.1613 7.3470 ## income$25,000 to < $55,000 1972 -0.1413 0.0289 0.5449 -0.1343 -1.1562 1.0114 ## income$55,000+             1972  -0.2065   0.0656 0.4719 -0.1356 -1.2586  0.5648

Some of the original regression coefficients are biased, as shown by the bootBias values and the fact that the medians of the simulated sampling distributions (bootMed) are sometimes meaningfully different from the original values. With a small sample size, use the bootstrap CIs (2.5 %, 97.5 %) rather than the original confidence intervals. Also, to test the significance of each term in the model, check to see if the corresponding bootstrap CI contains 0. Based on the bootstrap, waist circumference (BMXWAIST), age (RIDAGEYR), and gender (RIAGENDRFemale) are significantly associated with the outcome (their bootstrap CIs do not contain 0). This is different from the results we obtained above in the analysis that relied on the normality assumption, in which age was not significant and non-Hispanic Other was significantly different from Hispanic (the reference level).

NOTE: The value of $$R$$ in the output may be smaller than the value of $$R$$ specified in the call to car::Boot(). This is because some of the bootstrap samples resulted in a model that did converge or, more likely with a linear regression with a small sample size, in the resampled dataset one of the categorical predictors had a level with 0 observations resulting in a model with fewer parameters (see ?car::Boot). In this example, this happened 2000 - 1972 = 28 times and these 28 resampled datasets were excluded from the summary.

Example 5.1 (continued): Examine if the natural logarithm, square-root, or inverse normalizing transformations correct the lack of normality.

Figures 5.21 to 5.24 display the histograms and QQ plots of residuals after these three transformations. Since the inverse transformation reverses the ordering of the variable, to preserve the original ordering also multiply by -1.

# Check for zero or negative values
summary(nhanesf.complete.50\$LBDGLUSI)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##    4.55    5.27    5.61    5.97    6.11   16.20
# All positive so we can use log, sqrt, or inverse
fit.ex5.1.logY <- lm(log(LBDGLUSI) ~ BMXWAIST + smoker + RIDAGEYR +
RIAGENDR + race_eth + income, data = nhanesf.complete.50)
fit.ex5.1.sqrtY <- lm(sqrt(LBDGLUSI) ~ BMXWAIST + smoker + RIDAGEYR +
RIAGENDR + race_eth + income, data = nhanesf.complete.50)
fit.ex5.1.invY <- lm(-1/LBDGLUSI ~ BMXWAIST + smoker + RIDAGEYR +
RIAGENDR + race_eth + income, data = nhanesf.complete.50)
# Histograms and QQ plots of residuals
# sample.size=F suppresses printing of the sample size
check_normality(fit.ex5.1.smalln, sample.size=F, main = "Y")
check_normality(fit.ex5.1.logY,   sample.size=F, main = "log(Y)")
check_normality(fit.ex5.1.sqrtY,  sample.size=F, main = "Sqrt(Y)")
check_normality(fit.ex5.1.invY,   sample.size=F, main = "-1/Y")

None seem to resolve the non-normality completely, but of these the inverse transformation is the best. See Section 5.18 for how to implement the Box-Cox transformation which potentially can improve upon these. In this example, Box-Cox turns out not to help much – after reading Section 5.18, come back and try it as an exercise to confirm that conclusion.

### References

Davison, A. C. 1997. Bootstrap Methods and Their Application. Cambridge University Press, New York, NY.
Efron, Bradley, and R. J. Tibshirani. 1993. An Introduction to the Bootstrap. New York: Chapman & Hall/CRC.
Fox, John, and Sanford Weisberg. 2019. An R Companion to Applied Regression. 3rd ed. Los Angeles: Sage Publications, Inc.
Fox, John, Sanford Weisberg, and Brad Price. 2022. Car: Companion to Applied Regression. https://CRAN.R-project.org/package=car.
Schmidt, Amand F., and Chris Finan. 2018. “Linear Regression and the Normality Assumption.” Journal of Clinical Epidemiology 98 (June): 146–51. https://doi.org/10.1016/j.jclinepi.2017.12.006.
Weisberg, Sanford. 2013. Applied Linear Regression. 4th ed. Hoboken, NJ: Wiley.
Zeileis, Achim, Christian Kleiber, and Simon Jackman. 2008. “Regression Models for Count Data in R.” Journal of Statistical Software 27 (July): 1–25. https://doi.org/10.18637/jss.v027.i08.