## 5.28 Exercises

**For Exercises 1 to 3, use the 2018 Natality subsample teaching dataset ( natality2018_rmph.Rdata, see Appendix A.3).**

Create a complete case analysis dataset that includes all cases with no missing data for the following variables:

`DBWT`

(birthweight (g)),`CIG_REC`

(smoked during pregnancy),`risks`

(risk factors reported),`MAGER`

(mother’s age),`MRACEHISP`

(mother’s race/ethnicity),`DMAR`

(marital status),`MEDUC`

(mother’s education),`PRIORLIVE`

(prior births now living),`PRIORDEAD`

(prior births now dead), and`BMI`

(mother’s body mass index, kg/m^{2}). How many rows were in the original dataset? How many rows are in the complete case analysis dataset?Examine the complete case analysis dataset you created in Exercise 1. Create a

`summary()`

and histogram for each continuous variable. For each categorical variable, check to see if it is coded as a factor and create a frequency table. If you are not sure whether a numeric (non-factor) variable should be treated as continuous or categorical, use`table()`

to see how many unique values there are – if there are only a few, then consider it categorical.Using the complete case analysis dataset you created in Exercise 1, treat

`PRIORLIVE`

and`PRIORDEAD`

as factor variables and collapse them so they have fewer levels, since many of the levels are sparse. Create a new variable called`PRIORLIVE_cat`

that is a factor with levels 0, 1, 2, 3, and 4+. Create a new variable called`PRIORDEAD_cat`

that is a factor with levels 0 and 1+. Hint: Use`factor()`

to turn them into factor variables first, then use`fct_collapse()`

to collapse the levels. Finally, use`table()`

to compare the variables before and after the derivation to make sure it was done correctly.

**For Exercises 4 to 11, use the modified complete case analysis 2018 Natality subsample teaching dataset you created in Exercises 1 and 3. If you did not complete Exercises 1 and 3, you can still do these exercises using PRIORLIVE instead of PRIORLIVE_cat and PRIORDEAD instead of PRIORDEAD_cat (treat them as continuous variables). In these exercises, you will assess the association between birthweight (the outcome) and smoking during pregnancy (the primary predictor), adjusted for risk factors reported, mother’s age, mother’s race/ethnicity, marital status, mother’s education, prior births now living (use the collapsed factor version if you completed Exercise 3), prior births now dead (use the collapsed factor version if you completed Exercise 3), and mother’s body mass index.**

Visualize the unadjusted relationship between the outcome and each predictor.

Fit both the unadjusted and adjusted regression models. Assess the unadjusted and adjusted association between birthweight and cigarette use during pregnancy. How do the effect estimate, its 95% confidence interval, and its statistical significance change after adjusting for other variables?

Create a regression table displaying the results of the adjusted model, including the regression coefficients, their 95% CIs and p-values, and multiple degree of freedom p-values for categorical variables with more than two levels. Include an appropriate brief label for each variable.

Interpret the regression coefficient(s) for each predictor in the adjusted model. For each, state the regression coefficient, 95% CI, whether or not the predictor is statistically significant (along with its p-value), and provide an interpretation in terms of how differences in the predictor are associated with differences in the mean outcome.

Visualize the adjusted relationships. How are the solid lines in these plots related to the estimated regression coefficients?

Does the association between cigarette smoking during pregnancy and birthweight depend on mother’s education? Modify the adjusted model appropriately to be able to answer this question. Regardless of statistical significance, create a plot illustrating the smoking effect at each level of mother’s education and estimate the cigarette smoking effect at each level of mother’s education. Hint: When creating the dataset for plotting, include a single value for each predictor not included in the interaction. What values you pick will impact the size of the resulting estimates, but not the size of the interaction effects.

Does the association between mother’s BMI and birthweight differ between mothers of different race/ethnicity? Modify the adjusted model appropriately to be able to answer this question. Regardless of statistical significance, estimate the BMI effect at each level of mother’s race/ethnicity and make a plot illustrating the BMI effect at each level of mother’s race/ethnicity. NOTE: When creating the dataset for plotting, include a single value for each predictor not included in the interaction. What values you pick will impact the size of the resulting estimates, but not the size of the interaction effects.

Does the association between mother’s BMI and birthweight depend on mother’s age? Modify the adjusted model appropriately to be able to answer this question. Regardless of statistical significance, estimate the BMI effect for mothers age 20, 30, and 40 years and make a plot illustrating the BMI effect at each of these ages. Center BMI at 25 kg/m

^{2}and center age at 30 years. NOTE: When creating the dataset for plotting, include a single value for each predictor not included in the interaction. What values you pick will impact the size of the resulting estimates, but not the size of the interaction effects.Use the small sample NHANES 2017-2018 teaching dataset

`nhanesf.complete.30_rmph.Rdata`

(Appendix A.1) to regress the outcome Triglyceride (mmol/L) (`LBDTRSI`

) on the predictors DXA Trunk Percent Fat (`DXDTRPF`

), age (`RIDAGEYR`

), and gender (`RIAGENDR`

). Check the normality assumption and then use the bootstrap to assess if the model conclusions are robust to lack of normality.

**For Exercises 13 and 14, use the CAMP teaching dataset camp_0_48_rmph.rData (see Appendix A.6).**

In Exercise 20 in Chapter 4, we tested to see if post-bronchodilator Forced Expiratory Volume at 1 second (FEV1) (

`POSFEV0`

) differed significantly between treatment groups (Budesonide, Nedocromil, Placebo) (`TG`

). Does the difference in`POSFEV0`

between treatment groups significantly depend on whether or not the child has a dehumidifier in the home (`dehumid`

)?In Exercise 21 in Chapter 4, we estimated the two treatment effects (Budesonide vs. Placebo, Nedocromil vs. Placebo) along with their 95% CIs and p-values. Using the model you fit in the previous Exercise, estimate these two treatment effects among those with a dehumidifier and among those without. Interpret these results.

**For Exercises 15 and 16, the following code creates the NHANES 2017-2018 (Appendix A.1) teaching dataset needed:**

```
load("Data/nhanes1718_adult_fast_sub_rmph.Rdata")
<- nhanes_adult_fast_sub
nhanesf rm(nhanes_adult_fast_sub)
<- !is.na(nhanesf$LBDGLUSI) & !is.na(nhanesf$BMXWAIST) &
SUB !is.na(nhanesf$smoker) & !is.na(nhanesf$RIDAGEYR) &
!is.na(nhanesf$RIAGENDR) & !is.na(nhanesf$RIDRETH3) &
!is.na(nhanesf$income)
<- subset(nhanesf, subset = SUB)
nhanesf.complete
<- nhanesf.complete %>%
nhanesf.complete mutate(cRIDAGEYR = RIDAGEYR - 40,
cBMXWAIST = BMXWAIST - 100)
```

- In Section 5.9.9.2, Example 5.3, we fit the following model

```
3.int <- lm(LBDGLUSI ~ RIAGENDR + income + RIAGENDR:income,
fit.ex5.data = nhanesf.complete)
```

and plotted fasting glucose vs. income by gender. In this exercise, plot fasting glucose vs. gender by income to visualize how the income effects differ between genders.

- In Section 5.9.9.3, Example 5.4, we fit the following model

```
4.int <- lm(LBDGLUSI ~ cRIDAGEYR + cBMXWAIST + cRIDAGEYR:cBMXWAIST,
fit.ex5.data = nhanesf.complete)
```

and plotted fasting glucose vs. waist circumference by age. In this exercise, plot fasting glucose vs. age at waist circumference values of 80, 100, and 120 cm to visualize how the age effect differs between those with different levels of waist circumference.

**In each of Exercises 17 to 19, draw a figure similar to those in Section 5.8 that illustrates the described hypothetical relationships between variables, and state whether this illustrates a confounder, mediator, or moderator.**

Substance use affects risk of suicide in part through substance use disorder.

The effect of substance use on suicide differs between those with different gender identity.

The effect of substance use on suicide must be adjusted for gender identity because gender identity is associated with both variables but is not in the causal pathway.

**For Exercises 20 to 26, use the UN Human Development Data ( unhdd2020.rmph.rData, see Appendix A.1).**

What is the association between the outcome life expectancy at birth (years) (

`life`

) and expected years of schooling (years) (`educ_expected`

), adjusted for urban population (%) (`urban`

), annual growth in per capita GDP (%) (`gdp_percap_growth`

), and depth of food deficit (average dietary energy supply adequacy, 2017/2019, %) (`food_deficit`

)? Estimate the adjusted regression coefficient, 95% CI, and p-value. Also, how does the adjusted association compare to the unadjusted and, based on this comparison, what would you conclude about the extent to which the three variables you adjusted for confound the association? A 10% difference in effects is often used as a cutoff for deciding if confounding is meaningful (of course, this cutoff is arbitrary). Make sure you fit the unadjusted model to the same sample as your adjusted model.Using the adjusted model from the previous Exercise, what is the estimated life expectancy (and its 95% CI) for a country with 10 years of expected schooling? What about for a country with 15 years of expected schooling? Assume 50% urban population (

`urban`

= 50), no change in annual per capita GDP (`gdp_percap_growth`

= 0), and a depth of food deficit of 120% (`food_deficit`

= 120).For each of the two sets of predictor values in the previous Exercise, within what bounds do we expect 95% of countries’ life expectancy to fall?

Does the Gender Inequality Index (

`gii`

) differ between HDI groups (`hdi_group`

), adjusted for urban population (%) (`urban`

), median age (years) (`age_median`

), and current health expenditure (2017, % of GDP) (`cur_health_expen`

)? Estimate the adjusted differences between groups and the reference level (“Low”), their 95% CIs, and p-values. Also, how do the adjusted differences compare to the unadjusted and, based on these comparisons, what would you conclude about the extent to which the three variables you adjusted for confound the association? A 10% difference in effects is often used as a cutoff for deciding if confounding is meaningful (of course, this cutoff is arbitrary). Make sure you fit the unadjusted model to the same sample as your adjusted model.Using the adjusted model from the previous Exercise, what is the estimated GII (and its 95% CI) for a country in the “Low” HDI group? What about for a country in the “Very high” HDI group? For the other variables in the model (urban population, median age, and current health expenditure), choose any plausible value. You can use

`summary()`

to summarize their distributions to help you decide what values are plausible.For each of the two sets of predictor values in the previous Exercise, within what bounds do we expect 95% of countries’ GII to fall?

Is there an association between the predictor tuberculosis incidence (2018, per 100,000 people) (

`tb`

) and the outcome Human Development Index (`hdi`

)? Plot the unadjusted association. Then fit the model and check the assumptions. Then transform the predictor (try each of log, inverse, and square root) and re-fit the model to answer the research question.

**For Exercises 27-34, use our COVID-19 county-level data ( covid_20210908_rmph.rData, see Appendix A.4).**

Is the (predictor) county-level number of ICU beds per 100,000 population (

`icu_beds_per_100k`

) associated with the (outcome) number of deaths (`deaths.usafacts.20210908`

)? Give a possible interpretation for the direction of association.Is the predictor county-level number of ICU beds per 100,000 population (

`icu_beds_per_100k`

) associated with the outcome number of deaths (`deaths.usafacts.20210908`

) after adjusting for population size (`PopulationEstimate2018`

)?Assess the regression model assumptions (normality, linearity, and constant variance) for the model you fit in the previous Exercise.

In the model from the previous Exercise, log-transform the outcome and both predictors, re-fit the model, and re-check the assumptions. Hint: Check for zeros for each variable, and if there are any then add a small number to the variable prior to the log-transformation.

Instead of a log-transformation of the outcome in the previous Exercise, try a Box-Cox transformation, re-fit the model, and re-check the assumptions. If there were zeros and you added a small number prior to transformation in the previous Exercise, do the same here prior to the Box-Cox transformation. Hint: Fit the model with transformed predictors but with the UN-transformed outcome (after adding a small number, if needed), then apply the

`MASS::boxcox()`

function to that model.Fit a regression model for the outcome “estimated age-adjusted percentage of diagnosed diabetes in 2016 among adults age 20 and over” (

`DiabetesPercentage`

) that includes the following predictors:

`X..Adults.with.Obesity`

: percentage of the adult population (age 20 and older) that reports a body mass index (BMI) greater than or equal to 30 kg/m^{2}(2016)`Rural.UrbanContinuumCode2013`

: rural-urban continuum code`MedianAge2010`

: median age of county in 2010`X..Fair.or.Poor.Health`

: percentage of adults reporting fair or poor health (age-adjusted) (2017)`SVIPercentile`

: the county’s overall percentile ranking indicating the CDC’s Social Vulnerability Index (SVI); higher ranking indicates greater social vulnerability`X..Uninsured`

: percentage of population under age 65 without health insurance (2017)`X..Unemployed`

: percentage of population ages 16 and older unemployed but seeking work (2018)`X..Children.in.Poverty`

: percentage of people under age 18 in poverty (2018)

Check for collinearity. Resolve any collinearity issues you find to arrive at a model where all VIFs are below 2.5. Compare the regression coefficients and their standard errors in the full model and the final model. Did any change by a meaningful amount?

Using the final model from the previous exercise, determine if there are any outliers using an outlier test, and then visualize the outliers. Based on the results of the outlier test, examine the significant outliers by looking at their values of

`X..Adults.with.Obesity`

,`DiabetesPercentage`

, and population (`pop.usafacts`

). Compare their obesity % to the national distribution. Compare their diabetes % to the national distribution. What is unusual about these counties with respect to their obesity % and % of adults with diabetes (what makes them outliers in our model)?Using the final model from the previous Exercise, look for influential observations.

**For Exercises 35 to 37, use our 2018 Natality subsampled dataset ( natality2018_rmph.Rdata, see Appendix A.3).**

- Using our natality dataset, fit multiple models, each one testing the association between the outcome birthweight (g) (
`DBWT`

) and a single risk factor, and each adjusted for Mother’s Age (`MAGER`

), Mother’s Education (`MEDUC`

), Marital Status (`DMAR`

), and Total Birth Order (`TBO_REC`

). Fit an adjusted model for each of the following risk factors, and then adjust for multiple testing using the Bonferroni and Hommel methods to obtain adjusted p-values. Which were statistically significant before adjusting for multiple testing? After adjusting for multiple testing? Was the Bonferroni adjustment too conservative for any of the risk factors? NOTE: The need for multiple testing here assumes that the researcher is screening for “any risk factor” and will conclude they have found an association if any of them are significant.

- RF_GDIAB: Gestational Diabetes
- RF_PHYPE: Pre-pregnancy Hypertension
- RF_GHYPE: Gestational Hypertension
- RF_PPTERM: Previous Preterm Birth
- RF_INFTR: Infertility Treatment Used
- RF_CESAR: Previous Cesarean

Regress the outcome birthweight (g) (

`DBWT`

) on the following predictors:`MRACEHISP`

(mother’s race/ethnicity),`risks`

(risk factors reported),`MAGER`

(mother’s age),`PRIORLIVE`

(prior births now living),`PRIORDEAD`

(prior births now dead),`DMAR`

(marital status),`MEDUC`

(mother’s education), and`BMI`

(mother’s body mass index, kg/m^{2}). Compute the p-values for each of the 6 pairwise comparisons of estimated birthweight between the 4 levels of`MRACEHISP`

and use the Hommel method to adjust for multiple comparisons. Which comparisons were significant before adjusting for multiple comparisons? After? Hint: You will have to re-level the predictor in two different ways to get all the comparisons.Referring back to the model in the previous Exercise, if mother’s race/ethnicity is the primary predictor of interest, does it make sense to adjust for all the other variables in the model? Think about each other predictor and evaluate whether it is a confounder or a mediator of racial disparities in birthweight. Remove the mediators and re-fit the model. Compare the magnitude of the estimated racial disparities before and after removing the mediators. Explain the difference.

**Exercises 38 and 39 refer back to Exercises 33 and 34. The following code creates the model that was fit in Exercise 33.**

```
<- lm(DiabetesPercentage ~ X..Adults.with.Obesity +
fit1 +
Rural.UrbanContinuumCode2013 + X..Fair.or.Poor.Health + X..Uninsured +
MedianAge2010 data = covid) X..Unemployed,
```

Do a sensitivity analysis for the model fit in Exercise 33 to assess the robustness of model conclusions to the presence of outliers (which you looked for in Exercise 33).

Do a sensitivity analysis for the model fit in Exercise 33 to assess the robustness of model conclusions to the presence of influential observations (which you looked for in Exercise 34).

Suppose you have a dataset with \(n = 200\) observations. In order to avoid overfitting, what is the maximum number of predictors you should include in a regression model?

You are designing a study for which you wish to fit a regression model with 8 predictors. What is the minimum sample size you need to avoid overfitting?

Using the NHANES fasting subset teaching dataset (

`nhanes1718_adult_fast_sub_rmph.Rdata`

, see Appendix A.1), regress log systolic blood pressure (`log(sbp)`

) on log body mass index (`log(BMXBMI`

) and age (`RIDAGEYR`

).

- Suppose you want to predict the log SBP for an individual with a BMI of 70 kg/m
^{2}. Would that be a valid prediction? Why or why not? - Suppose you want to predict the log SBP for an individual who is age 25 years. Would that be a valid prediction? Why or why not?
- Suppose you want to predict the log SBP for an individual with a BMI of 55 kg/m
^{2}and an age of 70 years. Would that be a valid prediction? Why or why not?