Before fitting a regression model, it is a good idea to examine each of the variables individually. One reason to do this is to look for anomalous values that may require you to look more closely at your data source. Additionally, a common element in the presentation of regression analysis results is a table containing descriptive statistics for each analysis variable.
How a variable is examined depends on whether it is continuous or categorical (as defined in Chapter 2).
- For continuous variables, create a numerical summary and plot a histogram using
- For categorical variables, create a frequency table and plot a bar chart using
NOTE: Throughout this text, we assume that categorical variables are coded in R as
factor variables (for more information, see “Factors” in R for Data Science (H. Wickham, Çetinkaya-Rundel, and Grolemund 2017).
Example 3.1: Using data from a random subset of 1,000 adults from the NHANES 2017-2018 examination teaching dataset (see Appendix A.1), summarize the continuous variables systolic blood pressure (
sbp) and age (
RIDAGEYR) and the categorical variables gender (
RIAGENDR) and annual household income category (
First, load the NHANES examination teaching dataset using
load("Data/nhanes1718_adult_exam_sub_rmph.Rdata") # For convenience, give the dataset a shorter name <- nhanes_adult_exam_subnhanes
summary() to look at some basic descriptive statistics for the continuous variables. None of these values seem out of the ordinary, although note that there are 42 missing values for SBP.
# Continuous variables summary(nhanes$sbp)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## 83 111 121 124 134 234 42
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 20.0 32.0 47.0 47.7 61.0 80.0
hist() to create a visualization of the entire distribution for each variable. As shown in Figure 3.1, SBP is a bit skewed to the right, which is typical for many health-related measures, and there are more younger than older individuals. There is a spike at the upper end of the age distribution because NHANES, for privacy reasons, reports ages \(\ge\) 80 years as exactly 80 years (see the NHANES documentation for
RIDAGEYR, accessed May 20, 2022).
par(mfrow=c(1,2)) hist(nhanes$sbp, xlab = "", main = "Systolic Blood Pressure (mmHg)") hist(nhanes$RIDAGEYR, xlab = "", main = "Age (years)")
Computations of common descriptive statistics are demonstrated below. The option
na.rm = T is needed if there are any missing values, otherwise the functions will return
NA (indicating “missing” or “unknown”).
# Mean mean(nhanes$sbp, na.rm = T)
##  123.5
# Standard deviation sd(nhanes$sbp, na.rm = T)
##  17.65
# Median median(nhanes$sbp, na.rm = T)
##  121
# Interquartile range IQR(nhanes$sbp, na.rm = T)
##  23
# 25th and 75th percentile # (sometimes also referred to as the IQR) quantile(nhanes$sbp, probs = c(0.25, 0.75), na.rm = T)
## 25% 75% ## 111 134
# Minimum min(nhanes$sbp, na.rm = T)
##  83
# Maximum max(nhanes$sbp, na.rm = T)
##  234
# Number of missing values sum(is.na(nhanes$sbp))
##  42
# Number of non-missing values sum(!is.na(nhanes$sbp))
##  958
For categorical variables, use
prop.table() to examine the frequency and proportion of observations at each level (each possible value of the variable). The
exclude = NULL option tells R to include the number of missing values in the frequency table. In
exclude = NULL is omitted here, resulting in proportions of non-missing cases.
# Categorical variables table(nhanes$income, exclude = NULL)
## ## < $25,000 $25,000 to < $55,000 $55,000+ <NA> ## 156 254 480 110
## ## < $25,000 $25,000 to < $55,000 $55,000+ ## 0.1753 0.2854 0.5393
table(nhanes$RIAGENDR, exclude = NULL)
## ## Male Female ## 482 518
## ## Male Female ## 0.482 0.518
The upper income group is most common and there are 110 individuals with missing income values. Missing income values are common in survey data as some individuals are reluctant to disclose their income, even when the response options are ranges of values. Also, NHANES has gender response options “Male” and “Female” and, in this subset of the data, there are more females than males.
Options for visualizing the distribution of categorical variables include vertical and horizontal barcharts created with
barplot(), as shown in Figure 3.2.
par(mfrow=c(1,2)) # barplot() expects frequencies, not the raw data, so use table inside barplot() barplot(table(nhanes$RIAGENDR), ylab = "Frequency", xlab = "Gender") barplot(table(nhanes$income), horiz=T, cex.names = 0.65, ylab = "Frequency", xlab = "Annual Household Income")
table() results in plotting proportions instead of frequencies, as shown in Figure 3.3.
barplot(prop.table(table(nhanes$income)), ylab = "Proportion", xlab = "Annual Household Income")
summary() can be used on multiple variables all at once.
summary(nhanes[, c("sbp", "RIDAGEYR", "RIAGENDR", "income")])
## sbp RIDAGEYR RIAGENDR income ## Min. : 83 Min. :20.0 Male :482 < $25,000 :156 ## 1st Qu.:111 1st Qu.:32.0 Female:518 $25,000 to < $55,000:254 ## Median :121 Median :47.0 $55,000+ :480 ## Mean :124 Mean :47.7 NA's :110 ## 3rd Qu.:134 3rd Qu.:61.0 ## Max. :234 Max. :80.0 ## NA's :42
describe() function in the
Hmisc library (Harrell 2023) also summarizes multiple variables at once and, additionally, provides more detail than
# Access the describe function without loading # the entire Hmisc library using the :: syntax %>% nhanes select(sbp, RIDAGEYR, RIAGENDR, income) %>% ::describe() Hmisc# (results not shown)
If desired, use
write() to export the results to an external file.