3.1 Examining the data

Before fitting a regression model, it is a good idea to examine each of the variables individually. One reason to do this is to look for anomalous values that may require you to look more closely at your data source. Additionally, a common element in the presentation of regression analysis results is a table containing descriptive statistics for each analysis variable.

How a variable is examined depends on whether it is continuous or categorical (as defined in Chapter 2).

For continuous variables, create a numerical summary and plot a histogram using summary() and hist().
For categorical variables, create a frequency table and plot a bar chart using table() and barplot().

NOTE: Throughout this text, we assume that categorical variables are coded in R as factor variables (for more information, see “Factors” in R for Data Science (H. Wickham, Çetinkaya-Rundel, and Grolemund 2023) or Chapter 3 in An Introduction to R for Research (Nahhas 2024)).

Example 3.1: Using data from a random subset of 1,000 adults from the NHANES 2017-2018 examination teaching dataset (see Appendix A.1), summarize the continuous variables systolic blood pressure (sbp) and age (RIDAGEYR) and the categorical variables gender (RIAGENDR) and annual household income category (income).

First, load the NHANES examination teaching dataset using load().

load("Data/nhanes1718_adult_exam_sub_rmph.Rdata")
# For convenience, give the dataset a shorter name
nhanes <- nhanes_adult_exam_sub

Next, use summary() to look at some basic descriptive statistics for the continuous variables. None of these values seem out of the ordinary, although note that there are 42 missing values for SBP (NA's).

# Continuous variables
summary(nhanes$sbp)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##      83     111     121     124     134     234      42

summary(nhanes$RIDAGEYR)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    20.0    32.0    47.0    47.7    61.0    80.0

Use hist() to create a visualization of the entire distribution for each variable. As shown in Figure 3.1, SBP is a bit skewed to the right, which is typical for many health-related measures, and there are more younger than older individuals. There is a spike at the upper end of the age distribution because NHANES, for privacy reasons, reports ages \(\ge\) 80 years as exactly 80 years (see the NHANES documentation for RIDAGEYR, accessed May 20, 2022).

par(mfrow=c(1,2))
hist(nhanes$sbp,
     main = "",
     xlab = "Systolic Blood Pressure (mmHg)")
hist(nhanes$RIDAGEYR,
     main = "",
     xlab = "Age (years)")

Histograms of systolic blood pressure and age

Figure 3.1: Histograms of continuous variables

Computations of common descriptive statistics are demonstrated below. The option na.rm = T is needed if there are any missing values, otherwise the functions will return NA (indicating “missing” or “unknown”).

# Mean
mean(nhanes$sbp, na.rm = T)

## [1] 123.5

# Standard deviation
sd(nhanes$sbp, na.rm = T)

## [1] 17.65

# Median
median(nhanes$sbp, na.rm = T)

## [1] 121

# Interquartile range
IQR(nhanes$sbp, na.rm = T)

## [1] 23

# 25th and 75th percentile
# (sometimes also referred to as the IQR)
quantile(nhanes$sbp,
         probs = c(0.25, 0.75),
         na.rm = T)

## 25% 75% 
## 111 134

# Minimum
min(nhanes$sbp, na.rm = T)

## [1] 83

# Maximum
max(nhanes$sbp, na.rm = T)

## [1] 234

# Number of non-missing values for a single variable
sum(complete.cases(nhanes$sbp))

## [1] 958

# Number of missing values for a single variable
sum(is.na(nhanes$sbp))

## [1] 42

For categorical variables, use table() and prop.table() to examine the frequency and proportion of observations at each level (each possible value of a categorical variable). The useNA = "ifany" option tells R to include the number of missing values in the frequency table. In prop.table(), useNA = "ifany" is omitted here, resulting in proportions of non-missing cases.

# Categorical variables
table(nhanes$income, useNA = "ifany")

## 
##            < $25,000 $25,000 to < $55,000             $55,000+                 <NA> 
##                  156                  254                  480                  110

prop.table(table(nhanes$income))

## 
##            < $25,000 $25,000 to < $55,000             $55,000+ 
##               0.1753               0.2854               0.5393

When a variable has no missing values, the output of table(, useNA = "ifany") simply excludes the <NA> column. To force it to appear, use table(, useNA = "always"), as demonstrated below.

table(nhanes$RIAGENDR, useNA = "always")

## 
##   Male Female   <NA> 
##    482    518      0

prop.table(table(nhanes$RIAGENDR))

## 
##   Male Female 
##  0.482  0.518

In general, we would like to know if there are missing values. Therefore, omitting the useNA option altogether is not recommended as that would result in exclusion of the <NA> column whether there are missing values or not. An exception is when using prop.table() to obtain the proportion of non-missing values.

The upper income group is most common and there are 110 individuals with missing income values. Missing income values are common in survey data as some individuals are reluctant to disclose their income, even when the response options are ranges of values. Also, NHANES has gender response options “Male” and “Female” and, in this subset of the data, there are more females than males.

Options for visualizing the distribution of categorical variables include vertical and horizontal barcharts created with barplot(), as shown in Figure 3.2.

par(mfrow=c(1,2))
# barplot() expects frequencies, not the raw data, so use table inside barplot()
barplot(table(nhanes$RIAGENDR),
        ylab = "Frequency", xlab = "Gender")
barplot(table(nhanes$income), horiz=T, cex.names = 0.65, 
        xlab = "Frequency", ylab = "Annual Household Income")

Figure 3.2: Barcharts of categorical variable frequencies

Substituting prop.table(table()) for table() results in plotting proportions instead of frequencies, as shown in Figure 3.3.

barplot(prop.table(table(nhanes$income)),
        ylab = "Proportion", xlab = "Annual Household Income")

Vertical proportion bar chart showing an increasing proportion in higher levels

Figure 3.3: Barchart of categorical variable proportions

3.1.1 Detailed description of all variables in a dataset

summary() can be used on multiple variables all at once, as can complete.cases().

# Summary of the data
summary(nhanes[, c("sbp", "RIDAGEYR", "RIAGENDR", "income")])

##       sbp         RIDAGEYR      RIAGENDR                    income   
##  Min.   : 83   Min.   :20.0   Male  :482   < $25,000           :156  
##  1st Qu.:111   1st Qu.:32.0   Female:518   $25,000 to < $55,000:254  
##  Median :121   Median :47.0                $55,000+            :480  
##  Mean   :124   Mean   :47.7                NA's                :110  
##  3rd Qu.:134   3rd Qu.:61.0                                          
##  Max.   :234   Max.   :80.0                                          
##  NA's   :42

# Number of cases
nrow(nhanes[, c("sbp", "RIDAGEYR", "RIAGENDR", "income")])

## [1] 1000

# Number of cases with no missing data
sum(complete.cases(nhanes[, c("sbp", "RIDAGEYR", "RIAGENDR", "income")]))

## [1] 855

# Number of cases with any missing data
sum(!complete.cases(nhanes[, c("sbp", "RIDAGEYR", "RIAGENDR", "income")]))

## [1] 145

The describe() function in the Hmisc library (Harrell 2023) also summarizes multiple variables at once and, additionally, provides more detail than summary().

# Access the describe function without loading
# the entire Hmisc library using the :: syntax
nhanes %>% 
  select(sbp, RIDAGEYR, RIAGENDR, income) %>% 
  Hmisc::describe()
# (results not shown)

If desired, use write() to export the results to an external file.

write(Hmisc::html(Hmisc::describe(nhanes)), "nhanes_description.html")

References

———. 2023. Hmisc: Harrell Miscellaneous. https://hbiostat.org/R/Hmisc/.

Nahhas, Ramzi W. 2024. “An Introduction to R for Research.” bookdown.org/rwnahhas/IntroToR.

Wickham, H., M. Çetinkaya-Rundel, and G. Grolemund. 2023. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 2nd ed. Sebastopol, CA: O’Reilly Media.