## 3.4 Exercises

1. True or false? A boxplot is an appropriate visualization for a continuous variable.

2. True or false? An appropriate numerical summary for a categorical variable is a frequency table.

3. For descriptive statistical functions such as mean() and sd(), if R returns a missing value NA, what option can you use to return a non-missing value (assuming the variable you are describing has some non-missing values)?

4. What is the default method of handling missing data when using a regression function in R?

5. Give a reason for removing cases with missing values before doing a regression analysis.

6. Using the COVID dataset (covid_20210908_rmph.rData, see Appendix A.4), numerically and visually examine the continuous variable hospitals_per_100k (number of hospitals per 100,000 persons) and the categorical variable CensusRegionName.

7. Using the United Nations Human Development Data (unhdd2020.rmph.rData, see Appendix A.2), create a “Table 1” of descriptive statistics (mean and standard deviation for continuous variables, frequency and proportion for categorical variables), overall and by Human Development Index group (hdi_group). Use a complete case analysis that includes the following variables:

• hdi: Human Development Index (HDI)
• life: Life expectancy at birth (years)
• educ_expected: Expected years of schooling (years)
• gii: Gender Inequality Index
• urban: Urban population (%)
1. Using the COVID dataset (covid_20210908_rmph.rData, see Appendix A.4), create a “Table 1” of descriptive statistics, overall and by the number of hospitals per 100,000 persons (hospitals_per_100k) (use a median split to create a binary version of this “by” variable). Describe the following variables in this table:
• pop.usafacts: County population
• cases.usafacts.20210908: COVID-19 cumulative cases as of 2021-09-08
• deaths.usafacts.20210908: COVID-19 cumulative deaths as of 2021-09-08
• MedianAge2010: median age of county in 2010
• CensusRegionName: name of census region

Hint: Remove the statistic option in tbl_summary() to display the default statistics (median and interquartile range) which are more appropriate for data that are skewed, such as county population. This is equivalent to replacing "{mean} ({sd})" with "{median} ({p25}, {p75})".