3.4 Exercises

  1. True or false? A boxplot is an appropriate visualization for a continuous variable.

  2. True or false? An appropriate numerical summary for a categorical variable is a frequency table.

  3. For descriptive statistical functions such as mean() and sd(), if R returns a missing value NA, what option can you use to return a non-missing value (assuming the variable you are describing has some non-missing values)?

  4. What is the default method of handling missing data when using a regression function in R?

  5. Give a reason for removing cases with missing values before doing a regression analysis.

  6. Using the COVID dataset (covid_20210908_rmph.rData, see Appendix A.4), numerically and visually examine the continuous variable hospitals_per_100k (number of hospitals per 100,000 persons) and the categorical variable CensusRegionName.

  7. Using the United Nations Human Development Data (unhdd2020.rmph.rData, see Appendix A.2), create a “Table 1” of descriptive statistics (mean and standard deviation for continuous variables, frequency and proportion for categorical variables), overall and by Human Development Index group (hdi_group). Use a complete case analysis that includes the following variables:

  • hdi: Human Development Index (HDI)
  • life: Life expectancy at birth (years)
  • educ_expected: Expected years of schooling (years)
  • gii: Gender Inequality Index
  • urban: Urban population (%)
  1. Using the COVID dataset (covid_20210908_rmph.rData, see Appendix A.4), create a “Table 1” of descriptive statistics, overall and by the number of hospitals per 100,000 persons (hospitals_per_100k) (use a median split to create a binary version of this “by” variable). Describe the following variables in this table:
  • pop.usafacts: County population
  • cases.usafacts.20210908: COVID-19 cumulative cases as of 2021-09-08
  • deaths.usafacts.20210908: COVID-19 cumulative deaths as of 2021-09-08
  • MedianAge2010: median age of county in 2010
  • CensusRegionName: name of census region

Hint: Remove the statistic option in tbl_summary() to display the default statistics (median and interquartile range) which are more appropriate for data that are skewed, such as county population. This is equivalent to replacing "{mean} ({sd})" with "{median} ({p25}, {p75})".