## 6.23 Exercises

1. Write a sentence interpreting an odds ratio of 1.90.

2. Write a sentence interpreting an odds ratio of 2.10.

3. Write a sentence interpreting an odds ratio of 0.42.

For Exercises 4 to 15, use the NHANES 2017-2018 examination subsample teaching dataset (nhanes1718_adult_exam_sub_rmph.Rdata, see Appendix A.1). Create a binary version of PHQ-9 representing mild depression (PHQ-9 $$\ge$$ 5) using the following code prior to answering the questions.

# Create dichotomized PHQ-9

# "PHQ-9 scores of 5, 10, 15, and 20 represented mild, moderate,
# moderately severe, and severe depression, respectively"
# (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1495268/)

mutate(depression_mild = factor(phq9 >= 5,
levels = c(F, T),
labels = c("No", "Yes")))
# Check
table(nhanes$phq9, nhanes$depression_mild, exclude = NULL)
tapply(nhanes$phq9, nhanes$depression_mild, range, na.rm=T)
1. Create a 2 $$\times$$ 2 table of mild depression (depression_mild) vs. walk or bicycle (PAQ635, “In a typical week do you walk or use a bicycle for at least 10 minutes continuously to get to and from places?”) and, using the table, compute the odds ratio comparing the odds of mild depression between those who do not and those who do answer “Yes” to the walk or bicycle question.

2. Create a 2 $$\times$$ 2 table of mild depression (depression_mild) vs. trouble sleeping (SLQ050, “Have you ever told a doctor or other health professional that you have trouble sleeping?”) and, using the table, compute the odds ratio comparing the odds of mild depression between those who do and those who do not answer “Yes” to the sleeping question.

3. Suppose you are going to fit a logistic regression where the outcome is mild depression. What probability is glm() modeling? If it is not already, modify the outcome variable so glm() will model P(mild depression = Yes).

4. Compute the odds ratio comparing the odds of mild depression (depression_mild) between those who do and those who do not answer “Yes” to the sleeping question (SLQ050) using logistic regression, as well as its 95% confidence interval. Assume depression_mild is the outcome.

5. Do the odds of mild depression (depression_mild) differ between individuals of different income levels (income)? Test the global significance of income using a Type III Wald test and compute the OR, 95% CI, and p-value comparing each possible pair of levels.

6. Is mild depression (depression_mild) associated with the number of days someone engages in vigorous recreational activities (PAQ655, “In a typical week, on how many days do you do vigorous-intensity sports, fitness or recreational activities?”)? What is the OR comparing mild depression between individuals who differ by 1 day in days of vigorous recreation? What about between those who differ by 5 days?

7. Is mild depression (depression_mild) significantly associated with trouble sleeping (SLQ050) after adjusting for age (RIDAGEYR), gender (RIAGENDR), income (income), and days someone engages in vigorous recreational activities (PAQ655)? Answer the question and report the AORs, 95% confidence intervals, and p-values. Also, interpret the AOR for trouble sleeping.

8. Create a forest plot to illustrate the AORs and their 95% CIs for the model from the previous Exercise. For each continuous predictor, plot the AOR corresponding to a difference in the predictor equal to its inter-quartile range (IQR).

9. Using the model from Exercise 10, what is the predicted prevalence (and 95% CI) of mild depression among those with and without trouble sleeping who are age 40 years, male, earn $25,000 to <$55,000 per year, and do not engage in any days of vigorous recreational activities?

10. After adjusting for age (RIDAGEYR), gender (RIAGENDR), income (income), and days someone engages in vigorous recreational activities (PAQ655), does the association between mild depression and trouble sleeping (SLQ050) depend on gender?

11. Using the model from the previous Exercise, test the overall significance of trouble sleeping.

12. Using the model from Exercise 13, estimate the AOR for mild depression comparing those without and with trouble sleeping separately for males and females (along with their 95% CIs and p-values). Based on the answer to Exercise 13, are these two AORs significantly different from each other?

For Exercises 16 to 17, use the NHANES 2017-2018 fasting subsample teaching dataset (nhanes1718_adult_fast_sub_rmph.Rdata, see Appendix A.1).

1. Fit a logistic regression model to test the association between the outcome “Ever told had congestive heart failure?” (MCQ160B) and “How often do you snort or stop breathing?” (SLQ040), adjusted for age (RIDAGEYR), gender (RIAGENDR), and income (income). Look at the table of regression coefficients. Do you see any indicators of a problem with quasi- or complete separation?

2. Check for quasi- or complete separation in the model from previous Exercise, resolve any issues you find, and re-fit the model.

For Exercises 18 to 20, use the 2019 National Survey of Drug Use and Health (NSDUH) teaching dataset (nsduh2019_adult_sub_rmph.RData, see Appendix A.5).

1. Fit a logistic regression for the outcome substance use treatment (tx_substance_lifetime) vs. the predictor age of first cigarette use (cig_agefirst) and check the linearity assumption. Is the relationship linear?

2. For the model you fit in the previous Exercise, relax the linearity assumption in each of the following three ways (not all at once): (1) log transformation of the predictor, (2) square-root transformation of the predictor, and (3) add a quadratic term for the predictor. Re-check the linearity assumption for each. Which transformation would you choose and why?

3. Compare the models with log, square-root, and quadratic transformed predictors from the previous exercise based on influential observations and goodness-of-fit (Hosmer-Lemeshow test, calibration plot). Is one preferable to the other?

4. Suppose you have a dataset with $$n = 1000$$ observations. In order to avoid overfitting, what is the maximum number of predictors you should include in a logistic regression model if the proportion of observations with the outcome is 0.15? What if the proportion is 0.70? What can you say about the relationship between the sample proportion and the number of predictors for a given sample size?

5. You are designing a study for which you wish to fit a logistic regression model with 7 predictors. Assuming the population prevalence is 0.80, what is the minimum sample size you need to avoid overfitting? What if the population prevalence is 0.35? What can you say about the relationship between the population prevalence and the minimum sample size for a given number of predictors?

6. The matched case-control teaching dataset nhanes_CC_rmph.Rdata was created from a subset of adults from the 2017-2018 NHANES teaching dataset (Appendix A.1) containing 91 individuals who were ever told they had cancer or malignancy and 799 individuals who were not, matched on gender (RIAGENDR) and income (income). Assess the association between having been told one has cancer or malignancy and smoking status (smoker = Never, Past, or Current), accounting for the matching in the analysis. Compare all 3 levels of smoking status and give a plausible explanation of the results. NOTE: As in the example in the text, you must first convert the outcome to a 0/1 variable.

7. Repeat Exercise 10, but this time instead of computing the AOR, compute the adjusted prevalence ratio (APR).

For Exercises 25 to 26, use the 2018 Natality subsample teaching dataset (natality2018_rmph.Rdata, see Appendix A.3).

1. Is cigarette smoking during pregnancy (CIG_REC) associated with the outcome birthweight category (BWTR4 = Normal, Low birthweight (LBW), or Very low birthweight (VLBW)), adjusted for maternal age (MAGER), maternal education (MEDUC), and the presence of risk factors (risks)?

2. Check the proportional odds (PO) assumption for the model you fit in the previous Exercise.