## 8.10 Exercises

1. What is the difference between a sample and a census?

2. What kind of sampling first non-randomly splits the population into groups before simple random sampling withing groups?

3. In multistage sampling, what are the units called that are sampled first?

4. What do we call the set of numbers that are related to how many population units are represented by each sampled unit?

For Exercises 5 to 13, use the 2019 NSDUH data (nsduh2019_rmph.RData, see Appendix A.5).

1. Estimate the population mean, standard deviation, and 95% confidence interval for the mean for the age of first cigarette use (cig_agefirst), and plot a weighted histogram.

2. Estimate the population totals and proportions for the levels of total family income (demog_income) and plot a barplot displaying the weighted proportions.

3. Estimate the population totals and proportions for the levels of total family income (demog_income) at each level of employment status (demog_employ).

4. Create a Table 1 of weighted descriptive statistics (mean and SD for continuous variables, total and proportion for categorical variables) for the following variables, using a subset() of the design so the statistics are based on a complete case analysis. The variables to include are: sex (demog_sex), health status (demog_health), education (demog_educ_cat4), age of first alcohol use (alc_agefirst), and days of alcohol use in the past year (alc_past1y). The table should contain an “Overall” column and columns by sex.

5. Visualize the estimated population relationship between the predictor age at first alcohol use (alc_agefirst) and the outcome age at first heroin use (her_agefirst), along with the estimated population regression line and smoother.

6. Is the predictor age at first alcohol use (alc_agefirst) associated with the outcome age at first heroin use (her_agefirst), after adjusting for sex (demog_sex) and education (demog_educ_cat4)? Estimate the population association, 95% confidence interval, and p-value, and interpret the results.

7. Using the model from Exercise 10, what is the predicted age of first heroin use (and 95% confidence interval) for someone who first used alcohol at age 15, is male, and has a college degree?

8. After adjusting for sex (demog_sex) and education (demog_educ_cat4), does the association between the predictor age at first cigarette use (cig_agefirst) and the outcome age at first heroin use (her_agefirst) differ between those who live in Nonmetro, Small Metro, and Large Metro areas (demog_urban)? Answer the question, along with the appropriate p-value, and estimate the population association and its 95% confidence interval at each level of urbanicity. At which levels of urbanicity is the association significant?

9. Repeat Exercise 10 within the subgroup females age 18-34 years (demog_age_cat). Regardless of statistical significance, interpret the regression coefficient. How do these results compare to the results from Exercise 10? Hint: You will have to remove demog_sex from the model since it no longer has any variation.

For Exercises 14 to 19, use the NHANES 2017-2018 data (nhanes1718_rmph.Rdata, see Appendix A.1).

1. Estimate the population proportion of individuals not covered by health insurance (HIQ011 = "No"). Use the interview weights (WTINT2YR) when specifying the design.

2. Create a complete case indicator for the variables covered by health insurance (HIQ011), education among those age 20 years and older (DMDEDUC2) and age (RIDAGEYR). For the subpopulation with non-missing values for these variables, estimate the proportion of individuals not covered by health insurance (HIQ011 = "No"). Use the interview weights (WTINT2YR) when specifying the design.

3. Using the design you specified in Exercise 15, is there an association between age and not being covered by health insurance after adjusting for education? Answer the question, including the adjusted odds ratio, 95% confidence interval, and p-value. Also, interpret the association in terms of the change in odds associated with a 5-year age difference.

4. Using the model from Exercise 16, what is the predicted population proportion with no health insurance among those with a HS/GED education who are 21, 30, and 45 years of age?

5. Plot the weighted Kaplan-Meier estimate of the survival function for age when first tried marijuana at each level of education (DMDEDUC2). Are the survival curves significantly different? The time variable is age when first tried marijuana (DUQ210) and the event indicator is DUQ200 – “Ever used marijuana or hashish”). Use the interview weights (WTINT2YR) when specifying the design. Just like for the asthma example in the Chapter, first format the event indicator and fill in current age for the time variable for those who did not experience the event.

6. Using the design you specified in Exercise 18, estimate the hazard ratios for marijuana use comparing each level of education to those with less than a 9th grade education.