5.3 Complete case analysis dataset
We will use a complete case analysis in this chapter, excluding all individuals (cases) that have missing values for any of the analysis variables (see Section 3.2.1). Complete case analysis is already the default for the lm()
function, but if we want all our results, in particular the descriptive statistics in our “Table 1”, to be based on the same sample as used to fit the regression model, then we must explicitly remove individuals with missing values ahead of time. See Chapter 9 for how to handle missing data using multiple imputation.
Example 5.1 (continued): Use summary()
to assess the extent of missing data in the variables in our analysis and create a complete case analysis dataset.
load("Data/nhanes1718_adult_fast_sub_rmph.Rdata")
nhanesf <- nhanes_adult_fast_sub
rm(nhanes_adult_fast_sub)
nhanesf %>%
select(LBDGLUSI, BMXWAIST, smoker, RIDAGEYR,
RIAGENDR, RIDRETH3, income) %>%
summary()
## LBDGLUSI BMXWAIST smoker RIDAGEYR RIAGENDR RIDRETH3
## Min. : 2.61 Min. : 63.2 Never :579 Min. :20.0 Male :457 Mexican American :120
## 1st Qu.: 5.33 1st Qu.: 88.3 Past :264 1st Qu.:34.0 Female:543 Other Hispanic : 71
## Median : 5.72 Median : 97.8 Current:157 Median :47.0 Non-Hispanic White:602
## Mean : 6.09 Mean :100.5 Mean :47.9 Non-Hispanic Black:115
## 3rd Qu.: 6.22 3rd Qu.:111.0 3rd Qu.:61.0 Non-Hispanic Asian: 48
## Max. :19.00 Max. :169.5 Max. :80.0 Other/Multi : 44
## NA's :35
## income
## < $25,000 :164
## $25,000 to < $55,000:224
## $55,000+ :489
## NA's :123
##
##
##
Waist circumference and income have missing values. The code below creates a complete case dataset using the method that removes all other variables from our dataset. See Section 3.2.1 for an alternative method that retains all the other variables.
nhanesf.complete <- nhanesf %>%
select(LBDGLUSI, BMXWAIST, smoker, RIDAGEYR,
RIAGENDR, RIDRETH3, income) %>%
drop_na()
nrow(nhanesf)
## [1] 1000
## [1] 857
While the full dataset has 1000 observations, the complete case dataset has only 857.