3.5 Exercises

The following exercises practice the essential dplyr functions and aim to show that simple pipes of them can solve quite intriguing puzzles about data.
3.5.1 Exercise 1
Reshaping vs. reducing data
The essential dplyr functions perform data transformation tasks. Discuss the specific purpose of each function in terms of reshaping or reducing data (as introduced in Section 3.1.1).
3.5.2 Exercise 2
Star and R wars
We start tackling the tidyverse by uncovering even more facts about the dplyr::starwars universe.
Answer the following questions by using pipes of basic dplyr commands (i.e., by arranging, filtering, selecting, grouping, counting, summarizing).
- Save the tibble
dplyr::starwarsasswand report its dimensions.
Known unknowns
How many missing (
NA) values doesswcontain?Which variable (column) has the most missing values?
Which individuals come from an unknown (missing)
homeworldbut have a knownbirth_yearor knownmass?
Gender issues
How many humans are contained in
swoverall and by gender?How many and which individuals in
sware neither male nor female?Of which species in
swexist at least two different gender values?Bonus task: R typically provides many ways to obtain a solution. Let’s gain an overview of the gender distribution in our
swdataset in three different ways:- Use a dplyr pipe to compute a summary table
tbthat counts the frequency of each gender insw.
- Use ggplot2 on the raw data of
swto create a bar chart (A) that shows the same gender distribution.
- Use ggplot2 on the summary table
tbto create a bar chart (B) that shows the same gender distribution.
- Use a dplyr pipe to compute a summary table
Popular homes and heights
From which
homeworlddo the most indidividuals (rows) come from?What is the mean
heightof all individuals with orange eyes from the most popular homeworld?
Size and mass issues
Compute the median, mean, and standard deviation of
heightfor all droids.Compute the average height and mass by species and save the result as
h_m.Sort
h_mto list the three species with the smallest individuals (in terms of meanheight).Sort
h_mto list the three species with the heaviest individuals (in terms of medianmass).
3.5.3 Exercise 3
Sleeping mammals
The dataset ggplot2::msleep contains a mammals sleep dataset (see ?msleep for details and the definition of variables).
- Save the data as
spand check the dimensions, variable types, and number of missing values in the dataset.
Arranging and filtering data
Use the dplyr-verbs arrange(), group_by(), and filter() to answer the following questions by creating ordered subsets of the data:
Arrange the rows (alphabetically) by
vore,order, andname, and report thegenusof the top three mammals.What is the most common type of
vorein the data? How many omnivores are there?What is the most common
orderin the dataset? Are there more exemplars of theorder“Carnivora” or “Primates”?Which two mammals of the order “Primates” have the longest and shortest
sleep_totaltimes?
Computing new variables
Solve the following tasks by combining the dplyr commands mutate(), group_by(), and summarise():
Compute a variable
sleep_awake_sumthat adds thesleep_totaltime and theawaketime of each mammal. What result do you expect and get?Which animals have the smallest and largest brain to body ratio (in terms of weight)? How many mammals have a larger ratio than humans?
What is the minimum, average (mean), and maximum sleep cycle length for each
vore? (Hint: First group the data bygroup_by, then usesummariseon thesleep_cyclevariable, but also count the number ofNAvalues for eachvore. When computing grouped summaries,NAvalues can be removed byna.rm = TRUE.)Replace your
summarise()verb in the previous task bymutate(). What do you get as a result? (Hint: The last two tasks illustrate the difference betweenmutate()and groupedmutate()commands.)
3.5.4 Exercise 4
Outliers
This exercise examines different possibilities for defining outliers and uses the outliers dataset of the ds4psy package (also available as out.csv at http://rpository.com/ds4psy/data/out.csv) to illustate and compare them.
With respect to your insights into dplyr, this exercise helps disentangling mutate from grouped mutate commands.
Data on outliers
Use the outliers data (from the ds4psy package) or use the following read_csv() command to load the data into an R object entitled outliers:
# From the ds4psy package:
outliers <- ds4psy::outliers
# Alternatively, load csv data from online source (as comma-separated file):
# outliers_2 <- readr::read_csv("http://rpository.com/ds4psy/data/out.csv") # from online source
# Verify equality:
# all.equal(ds4psy::outliers, outliers_2)
# Alternatively, from a local data file:
# outliers <- read_csv("out.csv") # from current directoryNot all outliers are alike
An outlier can be defined as an individual whose value in some variable deviates by more than a given criterion (e.g., two standard deviations) from the mean of the variable. However, this definition is incomplete unless it also specifies the reference group over which the means and deviations are computed. In the following, we explore the implications of different reference groups.
Basic tasks
Save the data into a tibble
outliersand report its number of observations and variables, and their types.How many missing data values are there in
outliers?What is the gender (or
sex) distribution in this sample?Create a plot that shows the distribution of
heightvalues for each gender.
Defining different outliers
Compute 2 new variables that signal and distinguish between 2 types of outliers in terms of height:
outliers relative to the
heightof the overall sample (i.e., individuals withheightvalues deviating more than 2 SD from the overall meanheight);outliers relative to the
heightof some subgroup’s mean and SD. Here, a suitable subgroup to consider is every person’s gender (i.e., individuals withheightvalues deviating more than 2 SD from the meanheightof their own gender).
Hints:
As both variable signal whether or not someone is an outlier they should be defined as logicals (being either TRUE or FALSE) and added as new columns to data (via appropriate mutate commands). While the 1st variable can be computed based on the mean and SD of the overall sample, the 2nd variable can be computed after grouping outliers by gender and then computing and using the corresponding mean and SD values. The absolute difference between 2 numeric values x and y is provided by abs(x - y).
Relative outliers
Now use the 2 new outlier variables to define (or filter) 2 subsets of the data that contain 2 subgroups of people:
out_1: Individuals (females and males) withheightvalues that are outliers relative to both the entire sample and the sample of their own gender. How many such individuals are inoutliers?out_2: Individuals (females and males) withheightvalues that are not outliers relative to the entire population, but are outliers relative to their own gender. How many such individuals are inoutliers?
3.5.5 Exercise 5
Revisiting positive psychology
In previous exercises, we used the p_info data — available as posPsy_p_info in the ds4psy package or as http://rpository.com/ds4psy/data/posPsy_participants.csv — from a study on the effectiveness of web-based positive psychology interventions (Woodworth et al., 2018).
More specifically, we used this data in
Exercise 6 of Chapter 1 and
Exercise 5 of Chapter 2
to explore the participant information and create some corresponding plots.
(See Section B.1 of Appendix B for background information on this data.)
Answer the same questions as in those exercises by verifying your earlier base R results and ggplot2 graphs by pipes of dplyr commands. Do your graphs and quantitative results support the same conclusions?
Data
# From ds4psy package:
p_info <- ds4psy::posPsy_p_info
# Alternatively, load data from online source:
# p_info_2 <- readr::read_csv(file = "http://rpository.com/ds4psy/data/posPsy_participants.csv")
# Verify equality:
# all.equal(p_info, p_info_2)
# p_info
dim(p_info) # 295 rows, 6 columns#> [1] 295 6
From Exercise 6 of Chapter 1
Questions from Exercise 6 of Chapter 1:
Examine the participant information in p_info by describing each of its variables:
- How many individuals are contained in the dataset?
- What percentage of them is female (i.e., has a
sexvalue of 1)? - How many participants were in one of the 3 treatment groups (i.e., have an
interventionvalue of 1, 2, or 3)? - What is the participants’ mean education level? What percentage has a university degree (i.e., an
educvalue of at least 4)? - What is the age range (
mintomax) of participants? What is the average (mean and median) age? - Describe the range of
incomelevels present in this sample of participants. What percentage of participants self-identifies as having a below-average income (i.e., anincomevalue of 1)?
From Exercise 5 of Chapter 2:
Questions from Exercise 5 of Chapter 2:
Use the p_info data to create some plots that describe the sample of participants:
- A histogram that shows the distribution of participant
agein 3 ways:- overall,
- separately for each
sex, and - separately for each
intervention.
- A bar plot that
- shows how many participants took part in each
intervention; or - shows how many participants of each
sextook part in eachintervention.
- shows how many participants took part in each
Try to answer the same questions by dplyr pipes.
3.5.6 Exercise 6
Surviving the Titanic
The Titanic data in datasets contains basic information on the Age, Class, Sex, and Survival status for the people on board of the fatal maiden voyage of the Titanic. This data is saved as a 4-dimensional array resulting from cross-tabulating 2201 observations on four variables, but can easily be transformed into a tibble titanic by evaluating titanic <- tibble::as_tibble(datasets::Titanic).
| Class | Sex | Age | Survived | n |
|---|---|---|---|---|
| 1st | Male | Child | No | 0 |
| 2nd | Male | Child | No | 0 |
| 3rd | Male | Child | No | 35 |
| Crew | Male | Child | No | 0 |
| 1st | Female | Child | No | 0 |
| 2nd | Female | Child | No | 0 |
Use dplyr pipes to answer each of the following questions by a summary table that counts the sum of particular groups of survivors.
Determine the number of survivors by
Sex: Were female passengers more likely to survive than male passengers?Determine the number of survivors by
Age: Were children more likely to survive than adults?Consider the number of survivors as a function of both
SexandAge. Does the pattern observed in 1. hold equally for children and adults?The documentation of the
Titanicdata suggests that the policy women and children first policy was “not entirely successful in saving the women and children in the third class”. Verify this by creating corresponding contingency tables (i.e., counts of survivors).
3.5.7 Exercise 7
Bonus task: Replacing dplyr functions by base R functionality
Discuss how each of the essential dplyr functions introduced in this chapter could be replaced by using base R functionality.
Hint: As there is no 1:1-correspondence between functions, identify the task performed by each function before thinking about alternative ways of tackling these tasks.
This concludes our exercises on dplyr — but the topic of data transformation will stay relevant throughout this book.