1 Prerequisites

1.1 What is R?

R is a computer language. R is primarily used for statistical analysis and graphics development, including but not limited to:

  • data handling and manipulation
  • data analysis
  • data visualization

There are many other statistical programs, such as STATA, SAS, and SPSS. What differentiates R from other statistical programs is that R is a language, not a program. Wherein STATA you program your computer to complete a set of pre-determined commands, you can develop your own set of commands in R from scratch.

This is because much of R is essentially an interpreter of C, C++, and Fortran, making its uses much more flexible and varied than other statistical programs. In programs like STATA, you are limited to the set of commands already made available to you. Meanwhile, if a particular command does not exist in R, you can create it yourself.

This is because R is a functional language. In other words, nearly all of the commands in R can be performed using some function f(x):

sum(c(0, 1, 2, 3))
## [1] 6

where in this instance, f() is sum(), and x is the vector c(0, 1, 2, 3). However, if the sum() function did not exist, you could create your own:

# initialize that you want to create a function called `my_sum()`
my_sum <- function(x) {
  # Initialize the sum to 0
  total <- 0
  
  # Iterate through the vector and add each element to the total
  for (val in x) {
    total <- total + val
  }

  # Return the final sum
  return(total)
}

my_sum(c(0, 1, 2, 3))
## [1] 6

This is part of what separates R from other statistical programs. Anyone can create functions, and as a result, R is a composite of packages (i.e., set of related functions) that many independent developers created and made freely available to anyone with a computer and internet. This makes R open source. Meanwhile, programs like STATA and SAS are examples of closed source programs, meaning that the code is owned and modified by STATA and SAS developers only. This is why R is free to use, while STATA and SAS require paid licenses.

Note! There are advantages and disadvantages to open-sourced programs. R is much more diverse as a result of its thousands of packages; However, it is up each developer to ensure their code and packages work correctly. Programs like STATA is less diverse due to being closed-source; However, STATA ensures that its code works as intended.

R is also a vectorized language. This means that operations are applied to entire vectors at once, rather than element by element. What this means in practice is that R is much more efficient when it is analyzing vectors. For example, say you want to know element-by-element the sums of two vectors: c(1, 2, 3) and c(3, 2, 1). Whereas many other languages require looping through the k-3 elements of each vector:

vector_sum <- c()
for(i in 1:3){
  vector_sum[i] <- c(1, 2, 3)[i] + c(3, 2, 1)[i]
}
print(vector_sum)
## [1] 4 4 4

you can simply add two vectors in R:

c(1, 2, 3) + c(3, 2, 1)
## [1] 4 4 4

Understanding that R is a vectorized language is not important if you work with small-to-medium data. However, when analyzing big data, this point is critical. This is because every form of data in R (e.g., a vector, list, data frame, tibble) is stored on your own computer as an array of vectors and is read from top-to-bottom then left-to-right. Thus, each column in your dataset is a vector, making it much more efficient to operate on a column then on a row.

1.2 What is RStudio?

While R is a computer language, RStudio is an Integrated Development Environment (IDE) that provides you with a more user-friendly interface when coding in R. RStudio allows you to write a .r script that can execute that R code, debug it, find typos, provide you with a window for plots you generate, and display all of the data types stored in your current environment.

Note! It is not necessary to use RStudio to run R code; However, RStudio will make it much easier to develop your R code.

1.3 The Essentials

The code presented below is just a primer of the main syntax, operations, and data types of R. It is not exhaustive. There are many tutorials and books online that provide in-depth details on this topics.

1.3.1 Basic syntax

You can output text using single or double quotes:

'Hello SPH!'
## [1] "Hello SPH!"

You can output numbers by just typing the number:

8
## [1] 8

1.3.2 Basic math

You can perform basic mathematical operations:

2+2
2-2
2*2
2/2
2^2

1.3.3 Variables

You can assign and store a variable or data type using the assignment operator <-, ->:

# a vector named x
x <- 1

You can print this variable by just directly calling it or using the print() function:

x
## [1] 1

1.3.4 Booleans

A form of logical values (i.e., TRUE or FALSE) using:

  • >: greater than
  • >=: greater than or equal to
  • <: less than
  • <: less than or equal to
  • ==: equal to
  • !=: not equal to
4 >= 3
## [1] TRUE

1.3.5 Data types and basic statistics

There are many types of ways to store data in R, including:

  • vectors
  • lists
  • matrices
  • data frames

You can create a vector of any size using c():

y <- c(1, 2, 3, 4, 5)

You can perform basic mathematical operations of vectors using sum(), mean(), sd(), var(), median(), IQR(), range(), quantile(), length():

mean(y)
## [1] 3

You can visualize simple plots of vectors using hist(), barplot(), boxplot():

boxplot(y)

A list is a set of vectors

a <- list(c(1, 2, 3), c(3, 2, 1, 0, -1, -2, -3))
print(a)
## [[1]]
## [1] 1 2 3
## 
## [[2]]
## [1]  3  2  1  0 -1 -2 -3

Which is equivalent to making a dataframe of not necessarily the same lengths

df <- data.frame(x = c(0, -1, 2), y = c(3, 2, 1))
df
##    x y
## 1  0 3
## 2 -1 2
## 3  2 1

1.3.6 Accessing data

There are three main forms of accessing data:

  • [ ] to access a particular index of a vector
  • [[ ]] to access a particular index of a list
  • $ to access an entire vector
# get the third element of vector y
y <- c(1, 2, 3, 4, 5)
y[3]
## [1] 3
# get the third vector of list b
b <- list(c("Zebra"), c(1, 2, 3), c(-2, 0, 2, 4, 6))
b[[3]]
## [1] -2  0  2  4  6
# get the first column of dataframe df2
df2 <- data.frame(a = c("Apple", "Banana", "Chocolate"),
                  b = c(5, 10, 15))
df2$a
## [1] "Apple"     "Banana"    "Chocolate"

This template is based on Bookdown and the Memoir LaTeX class to allow writing a book, a report, a PhD thesis, etc. in R Markdown.

The main file is index.Rmd which contains the description of the book in its header. All other .Rmd files in the folder contain a chapter. The references.bib file contains the bibliography.

This file will have to be deleted, as well as 81-getting_started.Rmd and 82-syntax.Rmd: they have to be replaced by the content of the book.

To get started, create a new R project from this folder. Then open index.Rmd and click on the Build Book button in the Build window of Rstudio.