## 3.1 Examining the data

Before fitting a regression model, it is a good idea to examine each of the variables individually. One reason to do this is to look for anomalous values that may require you to look more closely at your data source. Additionally, a common element in the presentation of regression analysis results is a table containing descriptive statistics for each analysis variable.

How a variable is examined depends on whether it is continuous or categorical (as defined in Chapter 2).

- For continuous variables, create a numerical summary and plot a histogram using
`summary()`

and`hist()`

. - For categorical variables, create a frequency table and plot a bar chart using
`table()`

and`barplot()`

.

**NOTE:** Throughout this text, we assume that categorical variables are coded in R as `factor`

variables (for more information, see “Factors” in R for Data Science (H. Wickham, Çetinkaya-Rundel, and Grolemund 2017).

**Example 3.1:** Using data from a random subset of 1,000 adults from the NHANES 2017-2018 examination teaching dataset (see Appendix A.1), summarize the continuous variables systolic blood pressure (`sbp`

) and age (`RIDAGEYR`

) and the categorical variables gender (`RIAGENDR`

) and annual household income category (`income`

).

First, load the NHANES examination teaching dataset using `load()`

.

```
load("Data/nhanes1718_adult_exam_sub_rmph.Rdata")
# For convenience, give the dataset a shorter name
nhanes <- nhanes_adult_exam_sub
```

Next, use `summary()`

to look at some basic descriptive statistics for the continuous variables. None of these values seem out of the ordinary, although note that there are 42 missing values for SBP.

```
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 83 111 121 124 134 234 42
```

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 20.0 32.0 47.0 47.7 61.0 80.0
```

Use `hist()`

to create a visualization of the entire distribution for each variable. As shown in Figure 3.1, SBP is a bit skewed to the right, which is typical for many health-related measures, and there are more younger than older individuals. There is a spike at the upper end of the age distribution because NHANES, for privacy reasons, reports ages \(\ge\) 80 years as exactly 80 years (see the NHANES documentation for `RIDAGEYR`

, accessed May 20, 2022).

```
par(mfrow=c(1,2))
hist(nhanes$sbp,
xlab = "", main = "Systolic Blood Pressure (mmHg)")
hist(nhanes$RIDAGEYR,
xlab = "", main = "Age (years)")
```

Computations of common descriptive statistics are demonstrated below. The option `na.rm = T`

is needed if there are any missing values, otherwise the functions will return `NA`

(indicating “missing” or “unknown”).

`## [1] 123.5`

`## [1] 17.65`

`## [1] 121`

`## [1] 23`

```
# 25th and 75th percentile
# (sometimes also referred to as the IQR)
quantile(nhanes$sbp,
probs = c(0.25, 0.75),
na.rm = T)
```

```
## 25% 75%
## 111 134
```

`## [1] 83`

`## [1] 234`

`## [1] 42`

`## [1] 958`

For categorical variables, use `table()`

and `prop.table()`

to examine the frequency and proportion of observations at each level (each possible value of the variable). The `exclude = NULL`

option tells R to include the number of missing values in the frequency table. In `prop.table()`

, `exclude = NULL`

is omitted here, resulting in proportions of non-missing cases.

```
##
## < $25,000 $25,000 to < $55,000 $55,000+ <NA>
## 156 254 480 110
```

```
##
## < $25,000 $25,000 to < $55,000 $55,000+
## 0.1753 0.2854 0.5393
```

```
##
## Male Female
## 482 518
```

```
##
## Male Female
## 0.482 0.518
```

The upper income group is most common and there are 110 individuals with missing income values. Missing income values are common in survey data as some individuals are reluctant to disclose their income, even when the response options are ranges of values. Also, NHANES has gender response options “Male” and “Female” and, in this subset of the data, there are more females than males.

Options for visualizing the distribution of categorical variables include vertical and horizontal barcharts created with `barplot()`

, as shown in Figure 3.2.

```
par(mfrow=c(1,2))
# barplot() expects frequencies, not the raw data, so use table inside barplot()
barplot(table(nhanes$RIAGENDR),
ylab = "Frequency", xlab = "Gender")
barplot(table(nhanes$income), horiz=T, cex.names = 0.65,
xlab = "Frequency", ylab = "Annual Household Income")
```

Substituting `prop.table(table())`

for `table()`

results in plotting proportions instead of frequencies, as shown in Figure 3.3.

### 3.1.1 Detailed description of all variables in a dataset

`summary()`

can be used on multiple variables all at once.

```
## sbp RIDAGEYR RIAGENDR income
## Min. : 83 Min. :20.0 Male :482 < $25,000 :156
## 1st Qu.:111 1st Qu.:32.0 Female:518 $25,000 to < $55,000:254
## Median :121 Median :47.0 $55,000+ :480
## Mean :124 Mean :47.7 NA's :110
## 3rd Qu.:134 3rd Qu.:61.0
## Max. :234 Max. :80.0
## NA's :42
```

The `describe()`

function in the `Hmisc`

library (Harrell 2023) also summarizes multiple variables at once and, additionally, provides more detail than `summary()`

.

```
# Access the describe function without loading
# the entire Hmisc library using the :: syntax
nhanes %>%
select(sbp, RIDAGEYR, RIAGENDR, income) %>%
Hmisc::describe()
# (results not shown)
```

If desired, use `write()`

to export the results to an external file.

### References

*Hmisc: Harrell Miscellaneous*. https://hbiostat.org/R/Hmisc/.

*R for Data Science: Import, Tidy, Transform, Visualize, and Model Data*. 2nd ed. Sebastopol, CA: O’Reilly Media.