Processing math: 100%
+ - 0:00:00
Notes for current slide
Notes for next slide

STAT 302 Statistical Computing

Lecture 4: Data Manipulation and Visualization

Yikun Zhang (Winter 2024)

1 / 114

Outline

  1. Using Packages in R

  2. Data Manipulation via tidyverse

  3. Basic Graphics in R

  4. Data Visualization via ggplot2

* Acknowledgement: Parts of the slides are modified from the course materials by Prof. Ryan Tibshirani, Prof. Yen-Chi Chen, Prof. Deborah Nolan, Bryan Martin, and Andrea Boskovic.
2 / 114

Part 1: Using Packages in R

3 / 114

What is an R package?

R packages contain code, data, and documentation in a standardized collection format that can be installed and utilized by users of R.

4 / 114

What is an R package?

R packages contain code, data, and documentation in a standardized collection format that can be installed and utilized by users of R.

4 / 114

What is an R package?

R packages contain code, data, and documentation in a standardized collection format that can be installed and utilized by users of R.

  • There are 19,961+ official R packages available on Comprehensive R Archive Network (CRAN). Apart from that, some unofficial R packages are also posted on GitHub.

  • These packages implement miscellaneous statistical methods using functions in R, which makes our programming and data analysis easier.

4 / 114

How Can We Install R Packages?

If a package is officially available on CRAN, like most packages we will use for this course, we can install it using

install.packages("PACKAGE_NAME_IN_QUOTES")

Or, we can use the "Packages" tab in the lower right panel and click the "Install" button to install an official package in RStudio.

  • After a package is installed, it is saved on our computer until we update R, and we don't need to re-install it.

  • There is no need to include a call to install.packages() in any .R or .Rmd file!

5 / 114

How Can We Install R Packages?

If a package is officially available on CRAN, like most packages we will use for this course, we can install it using

install.packages("PACKAGE_NAME_IN_QUOTES")

Or, we can use the "Packages" tab in the lower right panel and click the "Install" button to install an official package in RStudio.

  • After a package is installed, it is saved on our computer until we update R, and we don't need to re-install it.

  • There is no need to include a call to install.packages() in any .R or .Rmd file!

Occasionally, we may want to install an R package from a .tar.gz file downloaded from CRAN or elsewhere:

install.packages("pkgname.tar.gz", repos = NULL, type ="source")
5 / 114

How Can We Use R Packages?

After a package is installed, we can load it into our current R session using library() or require() if it is inside our customized function:

library(PACKAGE_NAME)
# or
library("PACKAGE_NAME")
  • Unlike install.packages(), it is not necessary to include the package name in quotes.
6 / 114

How Can We Use R Packages?

After a package is installed, we can load it into our current R session using library() or require() if it is inside our customized function:

library(PACKAGE_NAME)
# or
library("PACKAGE_NAME")
  • Unlike install.packages(), it is not necessary to include the package name in quotes.

  • Loading a package must be done with each new R session, so we should put calls to library() in our .R and .Rmd files whenever we use some R packages in our code.

  • In .Rmd files, we can load all the required packages in the opening chunk and set the parameter include = FALSE in that chunk to hide the messages and code.

    {r, include = FALSE}

6 / 114

Install R Packages From Github

There is an install_github() function to install R packages hosted on GitHub in the devtools package, though it requests developer's name.

library(devtools)
install_github("DeveloperName/PackageName")

Here is an example where we don't have to load the devtools package:

devtools::install_github("zhangyk8/Debias-Infer", subdir = "R_Package")
7 / 114

Install R Packages From Github

There is an install_github() function to install R packages hosted on GitHub in the devtools package, though it requests developer's name.

library(devtools)
install_github("DeveloperName/PackageName")

Here is an example where we don't have to load the devtools package:

devtools::install_github("zhangyk8/Debias-Infer", subdir = "R_Package")

The githubinstall package provides a function githubinstall(), which does not need developer's name.

library(githubinstall)
githubinstall("PackageName")
7 / 114

Part 2: Data Manipulation via tidyverse

8 / 114

What is tidyverse?

The tidyverse is a coherent collection of packages in R for data science (and tidyverse itself is also a package that loads all its constituent packages). Packages include:

  • Data reading and saving: readr.

  • Data manipulation: dplyr, tidyr.

  • Iteration: purrr.

  • Visualization: ggplot2.

We can install all of them using

install.packages("tidyverse")

Note: We only need to do this once!

9 / 114

Why Do We Need tidyverse?

10 / 114

Why Do We Need tidyverse?

  • These packages have a very consistent API as well as an active developer and user community.

  • Function names and commands follow a focused grammar.

  • The functions are powerful and fast when working with data frames and lists (matrices, not so much, yet!).

  • Pipes (%>% operator) allows us to fluidly glue functionality together.

    • At its best, tidyverse code can be read like a story using the pipe operator!
10 / 114

Load tidyverse into R

We can load all the tidyverse packages into our current R session using the library() function.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::group_rows() masks kableExtra::group_rows()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
11 / 114

Conflicts in Using R Packages

Recall that R packages encapsulate functions written by different R developers.

  • Occasionally, some of these functions in different packages may share the same name, which introduces a conflict.
12 / 114

Conflicts in Using R Packages

Recall that R packages encapsulate functions written by different R developers.

  • Occasionally, some of these functions in different packages may share the same name, which introduces a conflict.

  • Whichever package that we load more recently using library() will mask the old function, meaning that R will default to that version.

12 / 114

Conflicts in Using R Packages

Recall that R packages encapsulate functions written by different R developers.

  • Occasionally, some of these functions in different packages may share the same name, which introduces a conflict.

  • Whichever package that we load more recently using library() will mask the old function, meaning that R will default to that version.

  • In general, this is fine, especially with tidyverse. The conflict message is to make sure that we are aware of conflicts.

12 / 114

Data Manipulation in a Tidy Way

  • The packages dplyr and tidyr are going to be our main workhorses for data manipulation.

  • The main data type used by these packages is the data frame (or tibble, but we won't go there).

13 / 114

Data Manipulation in a Tidy Way

  • The packages dplyr and tidyr are going to be our main workhorses for data manipulation.

  • The main data type used by these packages is the data frame (or tibble, but we won't go there).

Why do we need to learn data manipulation through tidyverse?

  • Learning pipes %>% will facilitate our learning of the dplyr and tidyr verbs (or functions).

  • The functions in dplyr are analogous to SQL counterparts, so learning dplyr will get some SQL syntax for free!

13 / 114

Learning Pipes %>%

Piping at its most basic level:

  • It uses the %>% operator to take the output from a previous function call and "pipe" it through to the next function, in order to form a flow of results.
14 / 114

Learning Pipes %>%

Piping at its most basic level:

  • It uses the %>% operator to take the output from a previous function call and "pipe" it through to the next function, in order to form a flow of results.

This can really help with the readability of code when we use multiple nested functions!

  • Shortcut for typing %>%: use ctrl + shift + m in RStudio.

Note: In Linux and other related systems, we also have pipes, as in:

ls -l | grep tidy | wc -l
14 / 114

The Logics of Pipes with Single Arguments

Passing a single argument through pipes, we interpret the following code as h(g(f(x))).

x %>% f %>% g %>% h

Note: In our mind, when we see the %>% operator, we should read this as "and then".

15 / 114

The Logics of Pipes with Single Arguments

Passing a single argument through pipes, we interpret the following code as h(g(f(x))).

x %>% f %>% g %>% h

Note: In our mind, when we see the %>% operator, we should read this as "and then".

We can write exp(1) with pipes as 1 %>% exp, and log(exp(1)) as 1 %>% exp %>% log.

1 %>% exp
## [1] 2.718282
1 %>% exp %>% log
## [1] 1
15 / 114

The Logics of Pipes with Multiple Arguments

For multi-arguments functions, we interpret the following code as f(x,y).

x %>% f(y)
16 / 114

The Logics of Pipes with Multiple Arguments

For multi-arguments functions, we interpret the following code as f(x,y).

x %>% f(y)

We can subset top 1 row of the mcars data frame using the following pipes syntax.

# Syntax in basic R
head(mtcars, 1)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21 6 160 110 3.9 2.62 16.46 0 1 4 4
# Pipes syntax
mtcars %>% head(1)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21 6 160 110 3.9 2.62 16.46 0 1 4 4
16 / 114

The Logics of Pipes with Multiple Arguments

The command x %>% f(y) can be equivalently written in dot notation as:

x %>% f(., y)
17 / 114

The Logics of Pipes with Multiple Arguments

The command x %>% f(y) can be equivalently written in dot notation as:

x %>% f(., y)

What is the advantage of using dots?

  • Sometimes we may want to pass in a variable as the second or third (say, not first) argument to a function, with a pipe. As in:
x %>% f(y, .)

which is equivalent to f(y,x).

17 / 114

Some Examples with Pipes

Let's interpret the following code without executing it first.

state_df = data.frame(state.x77)
state.region %>%
tolower %>%
tapply(state_df$Income, ., summary)
18 / 114

Some Examples with Pipes

Let's interpret the following code without executing it first.

state_df = data.frame(state.x77)
state.region %>%
tolower %>%
tapply(state_df$Income, ., summary)
## $`north central`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4167 4466 4594 4611 4694 5107
##
## $northeast
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3694 4281 4558 4570 4903 5348
##
## $south
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3098 3622 3848 4012 4316 5299
##
## $west
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3601 4347 4660 4703 4963 6315
18 / 114

Some Examples with Pipes

Let's interpret the following code without executing it first.

x = "Data Manipulation with Pipes"
x %>%
strsplit(split = " ") %>%
.[[1]] %>% # indexing
nchar %>%
max
19 / 114

Some Examples with Pipes

Let's interpret the following code without executing it first.

x = "Data Manipulation with Pipes"
x %>%
strsplit(split = " ") %>%
.[[1]] %>% # indexing
nchar %>%
max
## [1] 12
19 / 114

dplyr Functions

Some of the most important dplyr verbs (functions):

  • filter(): subset rows based on a condition.

  • group_by(): define groups of rows according to a column or specific condition.

  • summarize(): apply computations across groups of rows.

  • arrange(): order rows by value of a column.

  • select(): pick out given columns.

  • mutate(): create new columns.

  • mutate_at(): apply a function to given columns.

20 / 114

filter() Function

The filter() function is to subset rows based on a condition.

# Built-in data frame of cars data, 32 cars x 11 variables
mtcars %>% head(2)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4
21 / 114

filter() Function

The filter() function is to subset rows based on a condition.

# Built-in data frame of cars data, 32 cars x 11 variables
mtcars %>% head(2)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4
mtcars %>% filter((mpg >= 20 & disp >= 200) | (drat <= 3))
## mpg cyl disp hp drat wt qsec vs am gear carb
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
## Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4
## Dodge Challenger 15.5 8 318 150 2.76 3.520 16.87 0 0 3 2
21 / 114

filter() Function

An alternative approach using subset() function in base R:

subset(mtcars, (mpg >= 20 & disp >= 200) | (drat <= 3))
## mpg cyl disp hp drat wt qsec vs am gear carb
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
## Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4
## Dodge Challenger 15.5 8 318 150 2.76 3.520 16.87 0 0 3 2
22 / 114

filter() Function

An alternative approaches using the basic R syntax:

mtcars[(mtcars$mpg >= 20 & mtcars$disp >= 200) | (mtcars$drat <= 3), ]
## mpg cyl disp hp drat wt qsec vs am gear carb
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
## Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4
## Dodge Challenger 15.5 8 318 150 2.76 3.520 16.87 0 0 3 2
23 / 114

group_by() Function

  • The group_by() function is to define groups of rows according to a column or specific condition.
# Grouped by number of cylinders
mtcars %>% group_by(cyl) %>% head(2)
## # A tibble: 2 × 11
## # Groups: cyl [1]
## mpg cyl disp hp drat wt qsec vs am gear carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
## 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4

Note: The group_by() function doesn't actually change anything about the way that the data frame looks. Only difference is that when it prints, we know the groups.

24 / 114

summarize() Function

The summarize() function is to apply computations across groups of rows.

# Ungrouped
summarize(mtcars, mpg_avg = mean(mpg), hp_avg = mean(hp))
## mpg_avg hp_avg
## 1 20.09062 146.6875
25 / 114

summarize() Function

The summarize() function is to apply computations across groups of rows.

# Ungrouped
summarize(mtcars, mpg_avg = mean(mpg), hp_avg = mean(hp))
## mpg_avg hp_avg
## 1 20.09062 146.6875
# Grouped by number of cylinders
summarize(group_by(mtcars, cyl), mpg_avg = mean(mpg), hp_avg = mean(hp))
## # A tibble: 3 × 3
## cyl mpg_avg hp_avg
## <dbl> <dbl> <dbl>
## 1 4 26.7 82.6
## 2 6 19.7 122.
## 3 8 15.1 209.

Can we rewrite the above code using pipes?

25 / 114

summarize() Function

The summarize() function is to apply computations across groups of rows.

mtcars %>%
group_by(cyl) %>%
summarize(mpg_avg = mean(mpg), hp_avg = mean(hp))
## # A tibble: 3 × 3
## cyl mpg_avg hp_avg
## <dbl> <dbl> <dbl>
## 1 4 26.7 82.6
## 2 6 19.7 122.
## 3 8 15.1 209.

Note: Using the group_by() function makes the difference here.

26 / 114

arrange() Function

The arrange() function is to order rows by value of a column.

mtcars %>%
arrange(mpg) %>%
head(3)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4
## Camaro Z28 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4
27 / 114

arrange() Function

The arrange() function is to order rows by value of a column.

mtcars %>%
arrange(mpg) %>%
head(3)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4
## Camaro Z28 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4
# Base R syntax
mpg_inds = order(mtcars$mpg)
head(mtcars[mpg_inds, ], 3)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4
## Camaro Z28 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4
27 / 114

arrange() Function

We can also do it in a descending order.

mtcars %>%
arrange(desc(mpg)) %>%
head(3)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
28 / 114

arrange() Function

We can also do it in a descending order.

mtcars %>%
arrange(desc(mpg)) %>%
head(3)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
# Base R syntax
mpg_inds_decr = order(mtcars$mpg, decreasing = TRUE)
head(mtcars[mpg_inds_decr, ], 3)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
28 / 114

arrange() Function

We can order by multiple columns as well.

mtcars %>%
arrange(desc(gear), desc(hp)) %>%
head(7)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.3 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.9 1 0 4 4
29 / 114

select() Function

The select() function is to pick out given columns.

mtcars %>%
select(cyl, disp, hp) %>%
head(3)
## cyl disp hp
## Mazda RX4 6 160 110
## Mazda RX4 Wag 6 160 110
## Datsun 710 4 108 93
30 / 114

select() Function

The select() function is to pick out given columns.

mtcars %>%
select(cyl, disp, hp) %>%
head(3)
## cyl disp hp
## Mazda RX4 6 160 110
## Mazda RX4 Wag 6 160 110
## Datsun 710 4 108 93
# Base R syntax
head(mtcars[, c("cyl", "disp", "hp")], 3)
## cyl disp hp
## Mazda RX4 6 160 110
## Mazda RX4 Wag 6 160 110
## Datsun 710 4 108 93
30 / 114

Some Handy select() Helpers

mtcars %>%
select(starts_with("d")) %>%
head(3)
## disp drat
## Mazda RX4 160 3.90
## Mazda RX4 Wag 160 3.90
## Datsun 710 108 3.85
# Base R syntax
d_colnames = grep(x = colnames(mtcars), pattern = "^d")
head(mtcars[, d_colnames], 3)
## disp drat
## Mazda RX4 160 3.90
## Mazda RX4 Wag 160 3.90
## Datsun 710 108 3.85

Note: We need to use the regular expression under the base R syntax.

31 / 114

Some Handy select() Helpers

mtcars %>%
select(ends_with('t')) %>%
head(3)
## drat wt
## Mazda RX4 3.90 2.620
## Mazda RX4 Wag 3.90 2.875
## Datsun 710 3.85 2.320
mtcars %>%
select(contains('ar')) %>%
head(3)
## gear carb
## Mazda RX4 4 4
## Mazda RX4 Wag 4 4
## Datsun 710 4 1

More details about these select() helper functions can be found in this web page.

32 / 114

mutate() Function

The mutate() function is to create new columns.

mtcars = mtcars %>%
mutate(hp_wt = hp/wt,
mpg_wt = mpg/wt)
# Base R
mtcars$hp_wt = mtcars$hp/mtcars$wt
mtcars$mpg_wt = mtcars$mpg/mtcars$wt
33 / 114

mutate() Function

The mutate() function is to create new columns.

mtcars = mtcars %>%
mutate(hp_wt = hp/wt,
mpg_wt = mpg/wt)
# Base R
mtcars$hp_wt = mtcars$hp/mtcars$wt
mtcars$mpg_wt = mtcars$mpg/mtcars$wt

The newly created variables can be used immediately.

mtcars = mtcars %>%
mutate(hp_wt_again = hp/wt,
hp_wt_cyl = hp_wt_again/cyl)
# Base R
mtcars$hp_wt_again = mtcars$hp/mtcars$wt
mtcars$hp_wt_cyl = mtcars$hp_wt_again/mtcars$cyl
33 / 114

mutate_at() Function

The mutate_at() function is to apply a function to one or several columns.

mtcars = mtcars %>%
mutate_at(c("hp_wt", "mpg_wt"), log)
# Base R
mtcars$hp_wt = log(mtcars$hp_wt)
mtcars$mpg_wt = log(mtcars$mpg_wt)

Note:

  • Calling dplyr functions always outputs a new data frame, and it does not alter the existing data frame.

  • To keep the changes, we have to reassign the data frame to be the output of the pipe! (See the example above).

34 / 114

Linking dplyr to SQL

Learning dplyr also facilitates our understanding of SQL syntax.

  • For example, select() is SELECT, filter() is WHERE, arrange() is ORDER BY, group_by() is GROUP BY, etc.

  • This will make it easier for tasks that require using both R and SQL to manage data and build statistical models.

35 / 114

Linking dplyr to SQL

Learning dplyr also facilitates our understanding of SQL syntax.

  • For example, select() is SELECT, filter() is WHERE, arrange() is ORDER BY, group_by() is GROUP BY, etc.

  • This will make it easier for tasks that require using both R and SQL to manage data and build statistical models.

  • Another major link to SQL is through merging or joining data frames, via left_join() and inner_join() functions.

35 / 114

tidyr Functions

Recall the tidy data principle for data (or a data frame/table) that we discussed in Lecture 2:

  1. Each variable must have its own column.

  2. Each observation must have its own row.

  3. Each value must have its own cell.

36 / 114

tidyr Functions

Recall the tidy data principle for data (or a data frame/table) that we discussed in Lecture 2:

  1. Each variable must have its own column.

  2. Each observation must have its own row.

  3. Each value must have its own cell.

There are two of the most important tidyr verbs (functions) that help us achieve the tidy data principle:

  • pivot_longer(): make "wide" data longer.

  • pivot_wider(): make "long" data wider.

There are many other verbs, such as spread(), gather(), nest(), unnest(), etc. More details can be found in this web page.

36 / 114

pivot_longer() Function

# devtools::install_github("rstudio/EDAWR")
library(EDAWR) # Load some nice data sets
EDAWR::cases
## country 2011 2012 2013
## 1 FR 7000 6900 7000
## 2 DE 5800 6000 6200
## 3 US 15000 14000 13000
37 / 114

pivot_longer() Function

# devtools::install_github("rstudio/EDAWR")
library(EDAWR) # Load some nice data sets
EDAWR::cases
## country 2011 2012 2013
## 1 FR 7000 6900 7000
## 2 DE 5800 6000 6200
## 3 US 15000 14000 13000
EDAWR::cases %>%
pivot_longer(names_to = "year", values_to = "n", cols = 2:4)
## # A tibble: 9 × 3
## country year n
## <chr> <chr> <dbl>
## 1 FR 2011 7000
## 2 FR 2012 6900
## 3 FR 2013 7000
## 4 DE 2011 5800
## 5 DE 2012 6000
## 6 DE 2013 6200
## 7 US 2011 15000
## 8 US 2012 14000
## 9 US 2013 13000
37 / 114

pivot_longer() Function

Here, we transposed columns 2:4 into a "year" column and put the corresponding count values into a column called "n".

  • The pivot_longer() function did all the heavy lifting of the transposing work, and we just had to specify the output.
# A different approach that does the same thing
EDAWR::cases %>%
pivot_longer(names_to = "year", values_to = "n", -country)
## # A tibble: 9 × 3
## country year n
## <chr> <chr> <dbl>
## 1 FR 2011 7000
## 2 FR 2012 6900
## 3 FR 2013 7000
## 4 DE 2011 5800
## 5 DE 2012 6000
## 6 DE 2013 6200
## 7 US 2011 15000
## 8 US 2012 14000
## 9 US 2013 13000
38 / 114

pivot_wider() Function

Here, we transposed to a wide format by "size" and tabulated the corresponding "amount" for each "size".

  • Note that pivot_wider() and pivot_longer() are inverses.
EDAWR::pollution
## city size amount
## 1 New York large 23
## 2 New York small 14
## 3 London large 22
## 4 London small 16
## 5 Beijing large 121
## 6 Beijing small 56
EDAWR::pollution %>%
pivot_wider(names_from = "size", values_from = "amount")
## # A tibble: 3 × 3
## city large small
## <chr> <dbl> <dbl>
## 1 New York 23 14
## 2 London 22 16
## 3 Beijing 121 56
39 / 114

Part 3: Basic Graphics in R

40 / 114

Overview of Base R Plotting Functions

Base R has a set of powerful plotting tools:

  • plot(): generic plotting function.

  • points(): add points to an existing plot.

  • lines(), abline(): add lines to an existing plot.

  • text(), legend(): add text to an existing plot.

  • rect(), polygon(): add shapes to an existing plot.

  • hist(), image(): histogram and heatmap.

  • heat.colors(), topo.colors(), etc: create a color vector.

  • density(): estimate density, which can be plotted.

  • contour(): draw contours, or add to existing plot.

  • curve(): draw a curve, or add to existing plot.

41 / 114

Scatter Plots

To make a scatter plot of one variable versus another, we use plot().

set.seed(123)
x = sort(runif(50, min=-2, max=2))
y = x^3 + rnorm(50)
plot(x, y)

42 / 114

Plot Types

The type argument controls the plot type. Default is "p" for points; set it to "l" for lines. If we want both points and lines, set it to "b".

plot(x, y, type="b")

More details can be found by ?plot.

43 / 114

Plot Labels

The main argument controls the title; xlab and ylab are the x and y labels.

plot(x, y, main="A noisy cubic", xlab="My x variable", ylab="My y variable")

44 / 114

Point Types

We use the pch argument to control point type.

plot(x, y, pch = 19) # Filled circles

45 / 114

Line Types

We use the lty argument to control the line type, and lwd to control the line width.

plot(x, y, type="l", lty=2, lwd=3) # Dashed line, 3 times as thick

46 / 114

Colors

We use the col argument to control the color. It can be:

  • An integer between 1 and 8 for basic colors.

  • A string for any of the 657 available named colors.

The function colors() returns a string vector of the available colors

plot(x, y, pch=19, col="red")

47 / 114

Multiple Plots

To set up a plotting grid of arbitrary dimension, we use the par() function with the argument mfrow.

par(mfrow=c(2,2)) # Grid elements are filled by row
plot(x, y, main="Red cubic", pch=20, col="red")
plot(x, y, main="Blue cubic", pch=20, col="blue")
plot(rev(x), y, main="Flipped green", pch=20, col="green")
plot(rev(x), y, main="Flipped purple", pch=20, col="purple")

48 / 114

Margins of the Plots

Default margins in R are large (and ugly); to change them, we use the par() function with the argument mar.

par(mfrow = c(2,2), mar = c(4,4,2,0.5))
plot(x, y, main="Red cubic", pch=20, col="red")
plot(x, y, main="Blue cubic", pch=20, col="blue")
plot(rev(x), y, main="Flipped green", pch=20, col="green")
plot(rev(x), y, main="Flipped purple", pch=20, col="purple")

49 / 114

Saving Plots

We use the pdf() function to save a pdf file of our plot in the current R working directory.

getwd() # This is where the pdf will be saved
## [1] "/media/yikun/Disk_D1/Graduate School/STAT 302/Lectures"
pdf(file="noisy_cubics.pdf", height=7, width=7) # Height, width are in inches
par(mfrow=c(2,2), mar=c(4,4,2,0.5))
plot(x, y, main="Red cubic", pch=20, col="red")
plot(x, y, main="Blue cubic", pch=20, col="blue")
plot(rev(x), y, main="Flipped green", pch=20, col="green")
plot(rev(x), y, main="Flipped purple", pch=20, col="purple")
graphics.off()

Also, we use the jpg() and png() functions to save jpg and png files.

50 / 114

Adding to Plots

The main tools for this are:

  • points(): add points to an existing plot.

  • lines(), abline(): add lines to an existing plot.

  • text(), legend(): add text to an existing plot.

  • rect(), polygon(): add shapes to an existing plot.

Note: We should pay attention to layers---they work just like we are painting a picture by hand.

51 / 114

Plotting a Histogram

Recall that we can plot a histogram of a numeric vector using hist().

king_lines = readLines("https://github.com/zhangyk8/zhangyk8.github.io/raw/master/_teaching/file_stat302/Data/king.txt")
king_words = strsplit(paste(king_lines, collapse=" "),
split="[[:space:]]|[[:punct:]]")[[1]]
king_words = tolower(king_words[king_words != ""])
king_wlens = nchar(king_words)
hist(king_wlens)

52 / 114

Adding a Histogram to the Existing Plot

To add a histogram to an existing plot (say, another histogram), we use hist() with add=TRUE.

hist(king_wlens, col="pink", freq=FALSE, breaks=0:20,
xlab="Word length", main="King word lengths")
hist(king_wlens + 5, col=rgb(0,0.5,0.5,0.5),
freq=FALSE, breaks=0:20, add=TRUE)

53 / 114

Adding a Density Curve to a Histogram

To estimate a density from a numeric vector, we use the density() function; see this note and this tutorial for more details.

density_est = density(king_wlens, adjust=1.5) # 1.5 times the default bandwidth
class(density_est)
## [1] "density"
names(density_est)
## [1] "x" "y" "bw" "n" "call" "data.name"
## [7] "has.na"
54 / 114

Adding a Density Curve to a Histogram

The density() function returns a list that has components x and y, so we can call lines() directly on the returned object.

hist(king_wlens, col="pink", freq=FALSE, breaks=0:20,
xlab="Word length", main="King word lengths")
lines(density_est, lwd=3)

55 / 114

Plotting a Heatmap

To plot a heatmap of a numeric matrix, we use the image() function.

# Here, %o% gives for outer product
(mat = 1:5 %o% 6:10)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 6 7 8 9 10
## [2,] 12 14 16 18 20
## [3,] 18 21 24 27 30
## [4,] 24 28 32 36 40
## [5,] 30 35 40 45 50
image(mat) # Red means high, white means low

56 / 114

Orientation of image()

The orientation of image() is to plot the heatmap according to the following order, in terms of the matrix elements:

(1,ncol)(2,ncol)(nrow,ncol)(1,2)(2,2)(nrow,2)(1,1)(2,1)(nrow,1)

This is a 90 degrees counterclockwise rotation of the "usual" printed order for a matrix:

(1,1)(1,2)(1,ncol)(2,1)(2,2)(2,ncol)(nrow,1)(nrow,2)(nrow,ncol)

57 / 114

Orientation of image()

Therefore, if we want the displayed heatmap to follow the usual order, we must rotate the matrix 90 clockwise before passing it in to image() (Equivalently, reverse the row order and take the transpose).

clockwise90 = function(a) {
t(a[nrow(a):1,])
} # Handy rotate function
image(clockwise90(mat))

58 / 114

Color Scale

The default is to use a red-to-white color scale in image(), but the col argument can take any vector of colors. Built-in functions gray.colors(), rainbow(), heat.colors(), topo.colors(), terrain.colors(), cm.colors() all return contiguous color vectors of given lengths.

phi = dnorm(seq(-2,2,length=50))
normal.mat = phi %o% phi
image(normal.mat, col=terrain.colors(20)) # Terrain colors

59 / 114

Drawing Contour Lines

To draw contour lines from a numeric matrix, we use the contour() function; to add contours to an existing plot (says, a heatmap), we use contour() with add=TRUE.

image(normal.mat, col=terrain.colors(20))
contour(normal.mat, add=TRUE)

60 / 114

Part 4: Data Visualization via ggplot2

61 / 114

What is ggplot2?

ggplot2 is a R package for "declaratively" creating graphics.

  • We provide the data and tell ggplot2 how to map variables to aesthetics and what graphical primitives to use. Then, it takes care of the details.

  • Plots in ggplot2 are built sequentially using layers.

  • When using ggplot2, it is essential that our data are tidy!

Let's work through how to build a plot layer by layer.

62 / 114

Step-by-step Practice with ggplot2

First, let's initialize a plot. We use the data parameter to tell ggplot what data frame to use.

  • It should be tidy data, in either a data.frame or tibble!
library(gapminder)
ggplot(data = gapminder)

63 / 114

Step-by-step Practice with ggplot2

Add an aesthetic using aes() within the initial ggplot() call.

  • It controls our axes variables as well as graphical parameters such as color, size, shape.
ggplot(data = gapminder,
mapping = aes(x = year, y = lifeExp))

64 / 114

Step-by-step Practice with ggplot2

Now ggplot knows what to plot, but it doesn't know how to plot it yet. Let's add some points with geom_point().

  • This is a new layer! We always add layers using the + operator.
ggplot(data = gapminder,
mapping = aes(x = year, y = lifeExp)) +
geom_point()

65 / 114

Step-by-step Practice with ggplot2

Let's make our points smaller and red.

ggplot(data = gapminder,
mapping = aes(x = year, y = lifeExp)) +
geom_point(color = "red", size = 0.75)

66 / 114

Step-by-step Practice with ggplot2

Let's try switching them to lines.

ggplot(data = gapminder,
mapping = aes(x = year, y = lifeExp)) +
geom_line(color = "red", linewidth = 0.75)

67 / 114

Step-by-step Practice with ggplot2

We want lines connected by country, not just in the order that they appear in the data.

ggplot(data = gapminder,
mapping = aes(x = year, y = lifeExp,
group = country)) +
geom_line(color = "red", linewidth = 0.5)

68 / 114

Step-by-step Practice with ggplot2

We can color by continent to explore differences across continents.

  • We use aes() because we want to color by something in our data.

  • Putting a color within aes() will automatically add a label.

  • We have to remove the color within geom_line(), or it will override the aes().

ggplot(data = gapminder,
aes(x = year, y = lifeExp, group = country, color = continent)) +
geom_line(linewidth = 0.5)

69 / 114

Step-by-step Practice with ggplot2

Let's add another layer for the trend lines by continent!

  • We use a new aes() to group them differently than our lines (by continent).

  • We will make them stick out by having them thicker and darker.

  • We don't want error bars, so we will remove se.

ggplot(data = gapminder,
aes(x = year, y = lifeExp, group = country, color = continent)) +
geom_line(linewidth = 0.5) +
geom_smooth(aes(group = continent), se = FALSE, linewidth = 1.5, color = "black", method = "loess")

70 / 114

Step-by-step Practice with ggplot2

The plot is cluttered and hard to read. Let's try separating by continents using facets!

  • We use facet_wrap, which takes in a formula object and uses a tilde ~ with the variable name.
ggplot(data = gapminder,
aes(x = year, y = lifeExp, group = country, color = continent)) +
geom_line(linewidth = 0.5) +
geom_smooth(aes(group = continent), se = FALSE, linewidth = 1.5, color = "black", method = "loess") +
facet_wrap(~ continent)

71 / 114

Step-by-step Practice with ggplot2

Now, we formalize the labels on our plot using labs().

  • We can also edit labels one at a time using xlab(), ylab(), ggmain(), etc.

  • Unfortunately, we should do this in every graph that we present! It is unlikely that the text styling of our data frame matches our output. Changing the labels improves human readability!

ggplot(data = gapminder,
aes(x = year, y = lifeExp, group = country, color = continent)) +
geom_line(linewidth = 0.5) +
geom_smooth(aes(group = continent), se = FALSE, linewidth = 1.5,
color = "black", method = "loess") +
facet_wrap(~ continent) +
labs(title = "Life expectancy over time by continent",
x = "Year", y = "Life Expectancy", legend = "Continent")
72 / 114

Step-by-step Practice with ggplot2

ggplot(data = gapminder,
aes(x = year, y = lifeExp, group = country, color = continent)) +
geom_line(linewidth = 0.5) +
geom_smooth(aes(group = continent), se = FALSE, linewidth = 1.5,
color = "black", method = "loess") +
facet_wrap(~ continent) +
labs(title = "Life expectancy over time by continent",
x = "Year", y = "Life Expectancy", legend = "Continent")

73 / 114

Step-by-step Practice with ggplot2

Let's center our title by adjusting theme().

  • element_text() tells ggplot() how to display the text.

  • hjust is our horizontal alignment, we set it to one half

ggplot(data = gapminder,
aes(x = year, y = lifeExp, group = country, color = continent)) +
geom_line(linewidth = 0.5) +
geom_smooth(aes(group = continent), se = FALSE, linewidth = 1.5, color = "black", method = "loess") +
facet_wrap(~ continent) +
labs(title = "Life expectancy over time by continent",
x = "Year", y = "Life Expectancy", legend = "Continent") +
theme(plot.title = element_text(hjust = 0.5, face = "bold",
size = 14))
74 / 114

Step-by-step Practice with ggplot2

Indeed, the legend is redundant. Let's remove it.

ggplot(data = gapminder,
aes(x = year, y = lifeExp, group = country, color = continent)) +
geom_line(linewidth = 0.5) +
geom_smooth(aes(group = continent), se = FALSE, linewidth = 1.5, color = "black", method = "loess") +
facet_wrap(~ continent) +
labs(title = "Life expectancy over time by continent",
x = "Year", y = "Life Expectancy", legend = "Continent") +
theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 14), legend.position = "none")

75 / 114

Step-by-step Practice with ggplot2

If we don't like the default gray background, then we always remove it by theme_bw().

  • There are several other theme options! (Use ?theme_bw to look them up.)
ggplot(data = gapminder,
aes(x = year, y = lifeExp, group = country, color = continent)) +
geom_line(linewidth = 0.5) +
geom_smooth(aes(group = continent), se = FALSE, linewidth = 1.5, color = "black", method = "loess") +
facet_wrap(~ continent) +
labs(title = "Life expectancy over time by continent",
x = "Year", y = "Life Expectancy") +
theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 14), legend.position = "none") +
theme_bw()
76 / 114

Step-by-step Practice with ggplot2

ggplot(data = gapminder,
aes(x = year, y = lifeExp, group = country, color = continent)) +
geom_line(linewidth = 0.5) +
geom_smooth(aes(group = continent), se = FALSE, linewidth = 1.5, color = "black", method = "loess") +
facet_wrap(~ continent) +
labs(title = "Life expectancy over time by continent",
x = "Year", y = "Life Expectancy") +
theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 14), legend.position = "none") +
theme_bw()

77 / 114

Step-by-step Practice with ggplot2

We can increase all of our text proportionally using base_size within theme_bw() to increase readability.

  • We could also do this by adjusting text within theme().

  • We don't need to manually adjust our title size. This will scale everything automatically.

ggplot(data = gapminder,
aes(x = year, y = lifeExp, group = country, color = continent)) +
geom_line(linewidth = 0.5) +
geom_smooth(aes(group = continent), se = FALSE, linewidth = 1.5, color = "black", method = "loess") +
facet_wrap(~ continent) +
labs(title = "Life expectancy over time by continent",
x = "Year", y = "Life Expectancy") +
theme_bw(base_size = 16) +
theme(plot.title = element_text(hjust = 0.5, face = "bold"),
legend.position = "none")
78 / 114

Step-by-step Practice with ggplot2

ggplot(data = gapminder,
aes(x = year, y = lifeExp, group = country, color = continent)) +
geom_line(linewidth = 0.5) +
geom_smooth(aes(group = continent), se = FALSE, linewidth = 1.5, color = "black", method = "loess") +
facet_wrap(~ continent) +
labs(title = "Life expectancy over time by continent",
x = "Year", y = "Life Expectancy") +
theme_bw(base_size = 16) +
theme(plot.title = element_text(hjust = 0.5, face = "bold"),
legend.position = "none")

79 / 114

Step-by-step Practice with ggplot2

Now, our text is in a good size, but it overlaps. We consider rotating our text.

ggplot(data = gapminder,
aes(x = year, y = lifeExp, group = country, color = continent)) +
geom_line(linewidth = 0.5) +
geom_smooth(aes(group = continent), se = FALSE, linewidth = 1.5, color = "black", method = "loess") +
facet_wrap(~ continent) +
labs(title = "Life expectancy over time by continent",
x = "Year", y = "Life Expectancy") +
theme_bw(base_size = 16) +
theme(plot.title = element_text(hjust = 0.5, face = "bold"),
legend.position = "none",
axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))
80 / 114

Step-by-step Practice with ggplot2

ggplot(data = gapminder,
aes(x = year, y = lifeExp, group = country, color = continent)) +
geom_line(linewidth = 0.5) +
geom_smooth(aes(group = continent), se = FALSE, linewidth = 1.5, color = "black", method = "loess") +
facet_wrap(~ continent) +
labs(title = "Life expectancy over time by continent",
x = "Year", y = "Life Expectancy") +
theme_bw(base_size = 16) +
theme(plot.title = element_text(hjust = 0.5, face = "bold"),
legend.position = "none",
axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))

81 / 114

Step-by-step Practice with ggplot2

Lastly, let's space out our panels by adjusting panel.spacing.x.

ggplot(data = gapminder,
aes(x = year, y = lifeExp, group = country, color = continent)) +
geom_line(linewidth = 0.5) +
geom_smooth(aes(group = continent), se = FALSE, linewidth = 1.5, color = "black", method = "loess") +
facet_wrap(~ continent) +
labs(title = "Life expectancy over time by continent",
x = "Year", y = "Life Expectancy") +
theme_bw(base_size = 16) +
theme(plot.title = element_text(hjust = 0.5, face = "bold"),
legend.position = "none",
axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1),
panel.spacing.x = unit(0.75, "cm"))
82 / 114

Step-by-step Practice with ggplot2

ggplot(data = gapminder,
aes(x = year, y = lifeExp, group = country, color = continent)) +
geom_line(linewidth = 0.5) +
geom_smooth(aes(group = continent), se = FALSE, linewidth = 1.5, color = "black", method = "loess") +
facet_wrap(~ continent) +
labs(title = "Life expectancy over time by continent",
x = "Year", y = "Life Expectancy") +
theme_bw(base_size = 16) +
theme(plot.title = element_text(hjust = 0.5, face = "bold"),
legend.position = "none",
axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1),
panel.spacing.x = unit(0.75, "cm"))

83 / 114

Step-by-step Practice with ggplot2

When the entire plot is ready, we can also store it as an object.

lifeExp_plot <- ggplot(data = gapminder,
aes(x = year, y = lifeExp, group = country, color = continent)) +
geom_line(linewidth = 0.5) +
geom_smooth(aes(group = continent), se = FALSE, linewidth = 1.5, color = "black", method = "loess") +
facet_wrap(~ continent) +
labs(title = "Life expectancy over time by continent",
x = "Year", y = "Life Expectancy") +
theme_bw(base_size = 16) +
theme(plot.title = element_text(hjust = 0.5, face = "bold"),
legend.position = "none",
axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1),
panel.spacing.x = unit(0.75, "cm"))
84 / 114

Step-by-step Practice with ggplot2

Then, we can plot it by just calling our object.

lifeExp_plot

85 / 114

Step-by-step Practice with ggplot2

We can also save it in our figures subfolder using ggsave().

  • Set the height and width parameters to automatically resize the image.
ggsave(filename = "figures/lifeExp_plot.pdf", plot = lifeExp_plot,
height = 5, width = 7)

Note: Never save figures from our analysis using screenshots or point-and-click! It will lead to lower quality and non-reproducible figures!

86 / 114

Some Comments on ggplot:

  • What we just made was a very complicated and fine-tuned plot!

  • It is very common that we have to Google how to adjust certain things all the time.

87 / 114

Some Comments on ggplot:

  • What we just made was a very complicated and fine-tuned plot!

  • It is very common that we have to Google how to adjust certain things all the time.

  • So does the creator of ggplot2:

87 / 114

A Simpler Example: Histogram

ggplot(data = gapminder,
aes(x = lifeExp))

88 / 114

A Simpler Example: Histogram

ggplot(data = gapminder,
aes(x = lifeExp)) +
geom_histogram()

89 / 114

A Simpler Example: Histogram

ggplot(data = gapminder,
aes(x = lifeExp)) +
geom_histogram(binwidth = 1)

90 / 114

A Simpler Example: Histogram

ggplot(data = gapminder, aes(x = lifeExp)) +
geom_histogram(binwidth = 1,
color = "black",
fill = "lightblue")

91 / 114

A Simpler Example: Histogram

ggplot(data = gapminder, aes(x = lifeExp)) +
geom_histogram(binwidth = 1,
color = "black",
fill = "lightblue") +
theme_bw(base_size = 20)

92 / 114

A Simpler Example: Histogram

ggplot(data = gapminder, aes(x = lifeExp)) +
geom_histogram(binwidth = 1,
color = "black",
fill = "lightblue") +
theme_bw(base_size = 20) +
labs(x = "Life Expectancy",
y = "Count")

93 / 114

A Simpler Example: Boxplots

ggplot(data = gapminder,
aes(x = continent, y = lifeExp))

94 / 114

A Simpler Example: Boxplots

ggplot(data = gapminder,
aes(x = continent, y = lifeExp)) +
geom_boxplot()

95 / 114

A Simpler Example: Boxplots

ggplot(data = gapminder,
aes(x = continent, y = lifeExp)) +
geom_boxplot(fill = "lightblue")

96 / 114

A Simpler Example: Boxplots

ggplot(data = gapminder,
aes(x = continent, y = lifeExp)) +
geom_boxplot(fill = "lightblue") +
theme_bw(base_size = 20)

97 / 114

A Simpler Example: Boxplots

ggplot(data = gapminder,
aes(x = continent, y = lifeExp)) +
geom_boxplot(fill = "lightblue") +
theme_bw(base_size = 20) +
labs(title = "Life expectancy by Continent",
x = "",
y = "")

98 / 114

A Simpler Example: Boxplots

ggplot(data = gapminder, aes(x = continent, y = lifeExp)) +
geom_boxplot(fill = "lightblue") +
theme_bw(base_size = 20) +
labs(title = "Life expectancy by Continent",
x = "",
y = "") +
theme(plot.title =
element_text(hjust = 0.5))

99 / 114

A Simpler Example: Boxplots

ggplot(data = gapminder, aes(x = continent, y = lifeExp)) +
geom_boxplot(fill = "lightblue") +
theme_bw(base_size = 20) +
labs(title = "Life expectancy by Continent",
x = "",
y = "") +
theme(plot.title =
element_text(hjust = 0.5)) +
ylim(c(0, 85))

100 / 114

ggplot2 Summary

  • Axes: xlim(), ylim().

  • Legends: within initial aes(), edit within theme() or guides().

  • geom_point(), geom_line(), geom_histogram(), geom_bar(), geom_boxplot(), geom_text(), etc.

  • facet_grid(), facet_wrap() for faceting.

  • labs() for labels.

  • theme_bw() to make things look nicer.

  • Graphical parameters: color for color, alpha for opacity, lwd/size for thickness, shape for shape, fill for interior color, etc.

Here is a ggplot2 cheat sheet!

101 / 114

Some Guidelines For Data Visualization

Don'ts

  • Deceptive axes.

  • Excessive/bad coloring.

  • Bad variable/axis names.

  • Unreadable labels.

  • Overloaded with information.

  • Pie charts (usually).

Do's

  • Simple, clean graphics

  • Neat and human readable text.

  • Appropriate data range (bar charts should always start from 0!).

  • Consistent intervals.

  • Roughly ~6 colors or less.

  • Size figures appropriately.

102 / 114

Which Plot Should We Use?

Consider the following questions when we choose our plot:

  • What if we have one variable? Two variables?

  • What if we have numeric data?

  • How can we deal with those categorical or nominal variables?

Let's see some examples!

103 / 114

One Numeric Variable: Histogram

geom_histogram()

104 / 114

One Numeric Variable: Boxplot

geom_boxplot()

Note: We can also use the more sophisticated letter-valued plot implemented in the package lvplot.

105 / 114

One Categorical Variable: Bar Chart

geom_bar()

106 / 114

One Numeric and One Categorical Variable

geom_boxplot()

Here, we have multiple observations for each category.

107 / 114

One Numeric and One Categorical Variable

geom_bar() (with argument stat = "identity")

Here, we have only one observation per category.

108 / 114

Two Numeric Variables: Scatterplot

geom_point()

109 / 114

Two Numeric Variables, One Time-based

geom_line()

Note: When making a line plot, we should use both geom_point() and geom_line()!

110 / 114

Two Categorical Variables

geom_bar() setting x and fill within aes()

This is an example of bad visualization!!

111 / 114

Two Categorical Variables

geom_bar() setting x and fill within aes()

This one looks better by specifying position = position_dodge() in geom_bar().

Note: Never stack the bars unless it is necessary.

112 / 114

Three Variables

  • What if we have two numeric variables and one categorical?
113 / 114

Three Variables

  • What if we have two numeric variables and one categorical?

    • Scatterplot or line plot colored by category.

    • Scatterplot or line plot faceted by category.

Note: Recall our example in the step-by-step practice with ggplot2.

  • More details about the choices of plotting and other data visualization concepts can be found in this notes. Please spend some time reading this notes!
113 / 114

Summary

  • R packages provide us with numerous handy functions that have been written by other R developers.

  • The tidyverse is a collection of packages for common data science tasks.

  • Pipes %>% allow us to string together commands to get a flow of results.

  • The dplyr is a package for data wrangling with several key verbs (or functions).

  • The tidyr is a package for manipulating data frames in R.

  • Base R has a set of powerful plotting tools that help us quickly visualize our data.

  • The ggplot2 is a package for creating more sophisticated plots.

Submit Lab 4 on Gradescope by the end of Tuesday (February 13)!! Start earlier!!

114 / 114

Outline

  1. Using Packages in R

  2. Data Manipulation via tidyverse

  3. Basic Graphics in R

  4. Data Visualization via ggplot2

* Acknowledgement: Parts of the slides are modified from the course materials by Prof. Ryan Tibshirani, Prof. Yen-Chi Chen, Prof. Deborah Nolan, Bryan Martin, and Andrea Boskovic.
2 / 114
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow