Using Packages in R
Data Manipulation via tidyverse
Basic Graphics in R
Data Visualization via ggplot2
R packages contain code, data, and documentation in a standardized collection format that can be installed and utilized by users of R.
R packages contain code, data, and documentation in a standardized collection format that can be installed and utilized by users of R.
R packages contain code, data, and documentation in a standardized collection format that can be installed and utilized by users of R.
There are 19,961+ official R packages available on Comprehensive R Archive Network (CRAN). Apart from that, some unofficial R packages are also posted on GitHub.
These packages implement miscellaneous statistical methods using functions in R, which makes our programming and data analysis easier.
If a package is officially available on CRAN, like most packages we will use for this course, we can install it using
install.packages("PACKAGE_NAME_IN_QUOTES")
Or, we can use the "Packages" tab in the lower right panel and click the "Install" button to install an official package in RStudio.
After a package is installed, it is saved on our computer until we update R, and we don't need to re-install it.
There is no need to include a call to install.packages()
in any .R
or .Rmd
file!
If a package is officially available on CRAN, like most packages we will use for this course, we can install it using
install.packages("PACKAGE_NAME_IN_QUOTES")
Or, we can use the "Packages" tab in the lower right panel and click the "Install" button to install an official package in RStudio.
After a package is installed, it is saved on our computer until we update R, and we don't need to re-install it.
There is no need to include a call to install.packages()
in any .R
or .Rmd
file!
Occasionally, we may want to install an R package from a .tar.gz
file downloaded from CRAN or elsewhere:
install.packages("pkgname.tar.gz", repos = NULL, type ="source")
After a package is installed, we can load it into our current R session using library()
or require()
if it is inside our customized function:
library(PACKAGE_NAME)# or library("PACKAGE_NAME")
install.packages()
, it is not necessary to include the package name in quotes.After a package is installed, we can load it into our current R session using library()
or require()
if it is inside our customized function:
library(PACKAGE_NAME)# or library("PACKAGE_NAME")
Unlike install.packages()
, it is not necessary to include the package name in quotes.
Loading a package must be done with each new R session, so we should put calls to library()
in our .R
and .Rmd
files whenever we use some R packages in our code.
In .Rmd
files, we can load all the required packages in the opening chunk and set the parameter include = FALSE
in that chunk to hide the messages and code.
{r, include = FALSE}
There is an install_github()
function to install R packages hosted on GitHub in the devtools
package, though it requests developer's name.
library(devtools)install_github("DeveloperName/PackageName")
Here is an example where we don't have to load the devtools
package:
devtools::install_github("zhangyk8/Debias-Infer", subdir = "R_Package")
There is an install_github()
function to install R packages hosted on GitHub in the devtools
package, though it requests developer's name.
library(devtools)install_github("DeveloperName/PackageName")
Here is an example where we don't have to load the devtools
package:
devtools::install_github("zhangyk8/Debias-Infer", subdir = "R_Package")
The githubinstall
package provides a function githubinstall()
, which does not need developer's name.
library(githubinstall)githubinstall("PackageName")
tidyverse
tidyverse
?The tidyverse
is a coherent collection of packages in R for data science (and tidyverse
itself is also a package that loads all its constituent packages). Packages include:
Data reading and saving: readr
.
Data manipulation: dplyr
, tidyr
.
Iteration: purrr
.
Visualization: ggplot2
.
We can install all of them using
install.packages("tidyverse")
Note: We only need to do this once!
tidyverse
?These packages have a very consistent API as well as an active developer and user community.
tidyverse
?These packages have a very consistent API as well as an active developer and user community.
Function names and commands follow a focused grammar.
The functions are powerful and fast when working with data frames and lists (matrices, not so much, yet!).
Pipes (%>%
operator) allows us to fluidly glue functionality together.
tidyverse
code can be read like a story using the pipe operator!tidyverse
into RWe can load all the tidyverse
packages into our current R session using the library()
function.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──## ✔ dplyr 1.1.3 ✔ readr 2.1.4## ✔ forcats 1.0.0 ✔ stringr 1.5.1## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0## ✔ purrr 1.0.2 ## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──## ✖ dplyr::filter() masks stats::filter()## ✖ dplyr::group_rows() masks kableExtra::group_rows()## ✖ dplyr::lag() masks stats::lag()## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Recall that R packages encapsulate functions written by different R developers.
Recall that R packages encapsulate functions written by different R developers.
Occasionally, some of these functions in different packages may share the same name, which introduces a conflict.
Whichever package that we load more recently using library()
will mask the old function, meaning that R will default to that version.
Recall that R packages encapsulate functions written by different R developers.
Occasionally, some of these functions in different packages may share the same name, which introduces a conflict.
Whichever package that we load more recently using library()
will mask the old function, meaning that R will default to that version.
In general, this is fine, especially with tidyverse
. The conflict message is to make sure that we are aware of conflicts.
The packages dplyr
and tidyr
are going to be our main workhorses for data manipulation.
The main data type used by these packages is the data frame (or tibble, but we won't go there).
The packages dplyr
and tidyr
are going to be our main workhorses for data manipulation.
The main data type used by these packages is the data frame (or tibble, but we won't go there).
Why do we need to learn data manipulation through tidyverse
?
Learning pipes %>%
will facilitate our learning of the dplyr
and tidyr
verbs (or functions).
The functions in dplyr
are analogous to SQL counterparts, so learning dplyr
will get some SQL syntax for free!
%>%
Piping at its most basic level:
%>%
operator to take the output from a previous function call and "pipe" it through to the next function, in order to form a flow of results.%>%
Piping at its most basic level:
%>%
operator to take the output from a previous function call and "pipe" it through to the next function, in order to form a flow of results.This can really help with the readability of code when we use multiple nested functions!
%>%
: use ctrl + shift + m
in RStudio.Note: In Linux and other related systems, we also have pipes, as in:
ls -l | grep tidy | wc -l
Passing a single argument through pipes, we interpret the following code as h(g(f(x))).
x %>% f %>% g %>% h
Note: In our mind, when we see the %>%
operator, we should read this as "and then".
Passing a single argument through pipes, we interpret the following code as h(g(f(x))).
x %>% f %>% g %>% h
Note: In our mind, when we see the %>%
operator, we should read this as "and then".
We can write exp(1)
with pipes as 1 %>% exp
, and log(exp(1))
as 1 %>% exp %>% log
.
1 %>% exp
## [1] 2.718282
1 %>% exp %>% log
## [1] 1
For multi-arguments functions, we interpret the following code as f(x,y).
x %>% f(y)
For multi-arguments functions, we interpret the following code as f(x,y).
x %>% f(y)
We can subset top 1 row of the mcars
data frame using the following pipes syntax.
# Syntax in basic Rhead(mtcars, 1)
## mpg cyl disp hp drat wt qsec vs am gear carb## Mazda RX4 21 6 160 110 3.9 2.62 16.46 0 1 4 4
# Pipes syntaxmtcars %>% head(1)
## mpg cyl disp hp drat wt qsec vs am gear carb## Mazda RX4 21 6 160 110 3.9 2.62 16.46 0 1 4 4
The command x %>% f(y)
can be equivalently written in dot notation as:
x %>% f(., y)
The command x %>% f(y)
can be equivalently written in dot notation as:
x %>% f(., y)
What is the advantage of using dots?
x %>% f(y, .)
which is equivalent to f(y,x).
Let's interpret the following code without executing it first.
state_df = data.frame(state.x77)state.region %>% tolower %>% tapply(state_df$Income, ., summary)
Let's interpret the following code without executing it first.
state_df = data.frame(state.x77)state.region %>% tolower %>% tapply(state_df$Income, ., summary)
## $`north central`## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 4167 4466 4594 4611 4694 5107 ## ## $northeast## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 3694 4281 4558 4570 4903 5348 ## ## $south## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 3098 3622 3848 4012 4316 5299 ## ## $west## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 3601 4347 4660 4703 4963 6315
Let's interpret the following code without executing it first.
x = "Data Manipulation with Pipes"x %>% strsplit(split = " ") %>% .[[1]] %>% # indexing nchar %>% max
Let's interpret the following code without executing it first.
x = "Data Manipulation with Pipes"x %>% strsplit(split = " ") %>% .[[1]] %>% # indexing nchar %>% max
## [1] 12
dplyr
FunctionsSome of the most important dplyr
verbs (functions):
filter()
: subset rows based on a condition.
group_by()
: define groups of rows according to a column or specific condition.
summarize()
: apply computations across groups of rows.
arrange()
: order rows by value of a column.
select()
: pick out given columns.
mutate()
: create new columns.
mutate_at()
: apply a function to given columns.
filter()
FunctionThe filter()
function is to subset rows based on a condition.
# Built-in data frame of cars data, 32 cars x 11 variablesmtcars %>% head(2)
## mpg cyl disp hp drat wt qsec vs am gear carb## Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4## Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4
filter()
FunctionThe filter()
function is to subset rows based on a condition.
# Built-in data frame of cars data, 32 cars x 11 variablesmtcars %>% head(2)
## mpg cyl disp hp drat wt qsec vs am gear carb## Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4## Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4
mtcars %>% filter((mpg >= 20 & disp >= 200) | (drat <= 3))
## mpg cyl disp hp drat wt qsec vs am gear carb## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1## Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4## Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4## Dodge Challenger 15.5 8 318 150 2.76 3.520 16.87 0 0 3 2
filter()
FunctionAn alternative approach using subset()
function in base R:
subset(mtcars, (mpg >= 20 & disp >= 200) | (drat <= 3))
## mpg cyl disp hp drat wt qsec vs am gear carb## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1## Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4## Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4## Dodge Challenger 15.5 8 318 150 2.76 3.520 16.87 0 0 3 2
filter()
FunctionAn alternative approaches using the basic R syntax:
mtcars[(mtcars$mpg >= 20 & mtcars$disp >= 200) | (mtcars$drat <= 3), ]
## mpg cyl disp hp drat wt qsec vs am gear carb## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1## Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4## Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4## Dodge Challenger 15.5 8 318 150 2.76 3.520 16.87 0 0 3 2
group_by()
Functiongroup_by()
function is to define groups of rows according to a column or specific condition.# Grouped by number of cylindersmtcars %>% group_by(cyl) %>% head(2)
## # A tibble: 2 × 11## # Groups: cyl [1]## mpg cyl disp hp drat wt qsec vs am gear carb## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4## 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
Note: The group_by()
function doesn't actually change anything about the way that the data frame looks. Only difference is that when it prints, we know the groups.
summarize()
FunctionThe summarize()
function is to apply computations across groups of rows.
# Ungroupedsummarize(mtcars, mpg_avg = mean(mpg), hp_avg = mean(hp))
## mpg_avg hp_avg## 1 20.09062 146.6875
summarize()
FunctionThe summarize()
function is to apply computations across groups of rows.
# Ungroupedsummarize(mtcars, mpg_avg = mean(mpg), hp_avg = mean(hp))
## mpg_avg hp_avg## 1 20.09062 146.6875
# Grouped by number of cylinderssummarize(group_by(mtcars, cyl), mpg_avg = mean(mpg), hp_avg = mean(hp))
## # A tibble: 3 × 3## cyl mpg_avg hp_avg## <dbl> <dbl> <dbl>## 1 4 26.7 82.6## 2 6 19.7 122. ## 3 8 15.1 209.
Can we rewrite the above code using pipes?
summarize()
FunctionThe summarize()
function is to apply computations across groups of rows.
mtcars %>% group_by(cyl) %>% summarize(mpg_avg = mean(mpg), hp_avg = mean(hp))
## # A tibble: 3 × 3## cyl mpg_avg hp_avg## <dbl> <dbl> <dbl>## 1 4 26.7 82.6## 2 6 19.7 122. ## 3 8 15.1 209.
Note: Using the group_by()
function makes the difference here.
arrange()
FunctionThe arrange()
function is to order rows by value of a column.
mtcars %>% arrange(mpg) %>% head(3)
## mpg cyl disp hp drat wt qsec vs am gear carb## Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4## Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4## Camaro Z28 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4
arrange()
FunctionThe arrange()
function is to order rows by value of a column.
mtcars %>% arrange(mpg) %>% head(3)
## mpg cyl disp hp drat wt qsec vs am gear carb## Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4## Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4## Camaro Z28 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4
# Base R syntaxmpg_inds = order(mtcars$mpg)head(mtcars[mpg_inds, ], 3)
## mpg cyl disp hp drat wt qsec vs am gear carb## Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4## Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4## Camaro Z28 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4
arrange()
FunctionWe can also do it in a descending order.
mtcars %>% arrange(desc(mpg)) %>% head(3)
## mpg cyl disp hp drat wt qsec vs am gear carb## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
arrange()
FunctionWe can also do it in a descending order.
mtcars %>% arrange(desc(mpg)) %>% head(3)
## mpg cyl disp hp drat wt qsec vs am gear carb## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
# Base R syntaxmpg_inds_decr = order(mtcars$mpg, decreasing = TRUE)head(mtcars[mpg_inds_decr, ], 3)
## mpg cyl disp hp drat wt qsec vs am gear carb## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
arrange()
FunctionWe can order by multiple columns as well.
mtcars %>% arrange(desc(gear), desc(hp)) %>% head(7)
## mpg cyl disp hp drat wt qsec vs am gear carb## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2## Merc 280 19.2 6 167.6 123 3.92 3.440 18.3 1 0 4 4## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.9 1 0 4 4
select()
FunctionThe select()
function is to pick out given columns.
mtcars %>% select(cyl, disp, hp) %>% head(3)
## cyl disp hp## Mazda RX4 6 160 110## Mazda RX4 Wag 6 160 110## Datsun 710 4 108 93
select()
FunctionThe select()
function is to pick out given columns.
mtcars %>% select(cyl, disp, hp) %>% head(3)
## cyl disp hp## Mazda RX4 6 160 110## Mazda RX4 Wag 6 160 110## Datsun 710 4 108 93
# Base R syntaxhead(mtcars[, c("cyl", "disp", "hp")], 3)
## cyl disp hp## Mazda RX4 6 160 110## Mazda RX4 Wag 6 160 110## Datsun 710 4 108 93
select()
Helpersmtcars %>% select(starts_with("d")) %>% head(3)
## disp drat## Mazda RX4 160 3.90## Mazda RX4 Wag 160 3.90## Datsun 710 108 3.85
# Base R syntaxd_colnames = grep(x = colnames(mtcars), pattern = "^d")head(mtcars[, d_colnames], 3)
## disp drat## Mazda RX4 160 3.90## Mazda RX4 Wag 160 3.90## Datsun 710 108 3.85
Note: We need to use the regular expression under the base R syntax.
select()
Helpersmtcars %>% select(ends_with('t')) %>% head(3)
## drat wt## Mazda RX4 3.90 2.620## Mazda RX4 Wag 3.90 2.875## Datsun 710 3.85 2.320
mtcars %>% select(contains('ar')) %>% head(3)
## gear carb## Mazda RX4 4 4## Mazda RX4 Wag 4 4## Datsun 710 4 1
More details about these select()
helper functions can be found in this web page.
mutate()
FunctionThe mutate()
function is to create new columns.
mtcars = mtcars %>% mutate(hp_wt = hp/wt, mpg_wt = mpg/wt)# Base Rmtcars$hp_wt = mtcars$hp/mtcars$wtmtcars$mpg_wt = mtcars$mpg/mtcars$wt
mutate()
FunctionThe mutate()
function is to create new columns.
mtcars = mtcars %>% mutate(hp_wt = hp/wt, mpg_wt = mpg/wt)# Base Rmtcars$hp_wt = mtcars$hp/mtcars$wtmtcars$mpg_wt = mtcars$mpg/mtcars$wt
The newly created variables can be used immediately.
mtcars = mtcars %>% mutate(hp_wt_again = hp/wt, hp_wt_cyl = hp_wt_again/cyl) # Base Rmtcars$hp_wt_again = mtcars$hp/mtcars$wtmtcars$hp_wt_cyl = mtcars$hp_wt_again/mtcars$cyl
mutate_at()
FunctionThe mutate_at()
function is to apply a function to one or several columns.
mtcars = mtcars %>% mutate_at(c("hp_wt", "mpg_wt"), log) # Base Rmtcars$hp_wt = log(mtcars$hp_wt)mtcars$mpg_wt = log(mtcars$mpg_wt)
Note:
Calling dplyr
functions always outputs a new data frame, and it does not alter the existing data frame.
To keep the changes, we have to reassign the data frame to be the output of the pipe! (See the example above).
dplyr
to SQLLearning dplyr
also facilitates our understanding of SQL syntax.
For example, select()
is SELECT, filter()
is WHERE, arrange()
is ORDER BY, group_by()
is GROUP BY, etc.
This will make it easier for tasks that require using both R and SQL to manage data and build statistical models.
dplyr
to SQLLearning dplyr
also facilitates our understanding of SQL syntax.
For example, select()
is SELECT, filter()
is WHERE, arrange()
is ORDER BY, group_by()
is GROUP BY, etc.
This will make it easier for tasks that require using both R and SQL to manage data and build statistical models.
Another major link to SQL is through merging or joining data frames, via left_join()
and inner_join()
functions.
tidyr
FunctionsRecall the tidy data principle for data (or a data frame/table) that we discussed in Lecture 2:
Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.
tidyr
FunctionsRecall the tidy data principle for data (or a data frame/table) that we discussed in Lecture 2:
Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.
There are two of the most important tidyr
verbs (functions) that help us achieve the tidy data principle:
pivot_longer()
: make "wide" data longer.
pivot_wider()
: make "long" data wider.
There are many other verbs, such as spread()
, gather()
, nest()
, unnest()
, etc. More details can be found in this web page.
pivot_longer()
Function# devtools::install_github("rstudio/EDAWR")library(EDAWR) # Load some nice data setsEDAWR::cases
## country 2011 2012 2013## 1 FR 7000 6900 7000## 2 DE 5800 6000 6200## 3 US 15000 14000 13000
pivot_longer()
Function# devtools::install_github("rstudio/EDAWR")library(EDAWR) # Load some nice data setsEDAWR::cases
## country 2011 2012 2013## 1 FR 7000 6900 7000## 2 DE 5800 6000 6200## 3 US 15000 14000 13000
EDAWR::cases %>% pivot_longer(names_to = "year", values_to = "n", cols = 2:4)
## # A tibble: 9 × 3## country year n## <chr> <chr> <dbl>## 1 FR 2011 7000## 2 FR 2012 6900## 3 FR 2013 7000## 4 DE 2011 5800## 5 DE 2012 6000## 6 DE 2013 6200## 7 US 2011 15000## 8 US 2012 14000## 9 US 2013 13000
pivot_longer()
FunctionHere, we transposed columns 2:4 into a "year" column and put the corresponding count values into a column called "n".
pivot_longer()
function did all the heavy lifting of the transposing work, and we just had to specify the output.# A different approach that does the same thingEDAWR::cases %>% pivot_longer(names_to = "year", values_to = "n", -country)
## # A tibble: 9 × 3## country year n## <chr> <chr> <dbl>## 1 FR 2011 7000## 2 FR 2012 6900## 3 FR 2013 7000## 4 DE 2011 5800## 5 DE 2012 6000## 6 DE 2013 6200## 7 US 2011 15000## 8 US 2012 14000## 9 US 2013 13000
pivot_wider()
FunctionHere, we transposed to a wide format by "size" and tabulated the corresponding "amount" for each "size".
pivot_wider()
and pivot_longer()
are inverses.EDAWR::pollution
## city size amount## 1 New York large 23## 2 New York small 14## 3 London large 22## 4 London small 16## 5 Beijing large 121## 6 Beijing small 56
EDAWR::pollution %>% pivot_wider(names_from = "size", values_from = "amount")
## # A tibble: 3 × 3## city large small## <chr> <dbl> <dbl>## 1 New York 23 14## 2 London 22 16## 3 Beijing 121 56
Base R has a set of powerful plotting tools:
plot()
: generic plotting function.
points()
: add points to an existing plot.
lines()
, abline()
: add lines to an existing plot.
text()
, legend()
: add text to an existing plot.
rect()
, polygon()
: add shapes to an existing plot.
hist()
, image()
: histogram and heatmap.
heat.colors()
, topo.colors()
, etc: create a color vector.
density()
: estimate density, which can be plotted.
contour()
: draw contours, or add to existing plot.
curve()
: draw a curve, or add to existing plot.
To make a scatter plot of one variable versus another, we use plot()
.
set.seed(123)x = sort(runif(50, min=-2, max=2))y = x^3 + rnorm(50)plot(x, y)
The type
argument controls the plot type. Default is "p" for points; set it to "l" for lines. If we want both points and lines, set it to "b".
plot(x, y, type="b")
More details can be found by ?plot
.
The main
argument controls the title; xlab
and ylab
are the x and y labels.
plot(x, y, main="A noisy cubic", xlab="My x variable", ylab="My y variable")
We use the pch
argument to control point type.
plot(x, y, pch = 19) # Filled circles
We use the lty
argument to control the line type, and lwd
to control the line width.
plot(x, y, type="l", lty=2, lwd=3) # Dashed line, 3 times as thick
We use the col
argument to control the color. It can be:
An integer between 1 and 8 for basic colors.
A string for any of the 657 available named colors.
The function colors()
returns a string vector of the available colors
plot(x, y, pch=19, col="red")
To set up a plotting grid of arbitrary dimension, we use the par()
function with the argument mfrow
.
par(mfrow=c(2,2)) # Grid elements are filled by rowplot(x, y, main="Red cubic", pch=20, col="red")plot(x, y, main="Blue cubic", pch=20, col="blue")plot(rev(x), y, main="Flipped green", pch=20, col="green")plot(rev(x), y, main="Flipped purple", pch=20, col="purple")
Default margins in R are large (and ugly); to change them, we use the par()
function with the argument mar
.
par(mfrow = c(2,2), mar = c(4,4,2,0.5))plot(x, y, main="Red cubic", pch=20, col="red")plot(x, y, main="Blue cubic", pch=20, col="blue")plot(rev(x), y, main="Flipped green", pch=20, col="green")plot(rev(x), y, main="Flipped purple", pch=20, col="purple")
We use the pdf()
function to save a pdf file of our plot in the current R working directory.
getwd() # This is where the pdf will be saved
## [1] "/media/yikun/Disk_D1/Graduate School/STAT 302/Lectures"
pdf(file="noisy_cubics.pdf", height=7, width=7) # Height, width are in inchespar(mfrow=c(2,2), mar=c(4,4,2,0.5))plot(x, y, main="Red cubic", pch=20, col="red")plot(x, y, main="Blue cubic", pch=20, col="blue")plot(rev(x), y, main="Flipped green", pch=20, col="green")plot(rev(x), y, main="Flipped purple", pch=20, col="purple")graphics.off()
Also, we use the jpg()
and png()
functions to save jpg and png files.
The main tools for this are:
points()
: add points to an existing plot.
lines()
, abline()
: add lines to an existing plot.
text()
, legend()
: add text to an existing plot.
rect()
, polygon()
: add shapes to an existing plot.
Note: We should pay attention to layers---they work just like we are painting a picture by hand.
Recall that we can plot a histogram of a numeric vector using hist()
.
king_lines = readLines("https://github.com/zhangyk8/zhangyk8.github.io/raw/master/_teaching/file_stat302/Data/king.txt")king_words = strsplit(paste(king_lines, collapse=" "), split="[[:space:]]|[[:punct:]]")[[1]]king_words = tolower(king_words[king_words != ""])king_wlens = nchar(king_words)hist(king_wlens)
To add a histogram to an existing plot (say, another histogram), we use hist()
with add=TRUE
.
hist(king_wlens, col="pink", freq=FALSE, breaks=0:20, xlab="Word length", main="King word lengths")hist(king_wlens + 5, col=rgb(0,0.5,0.5,0.5), freq=FALSE, breaks=0:20, add=TRUE)
To estimate a density from a numeric vector, we use the density()
function; see this note and this tutorial for more details.
density_est = density(king_wlens, adjust=1.5) # 1.5 times the default bandwidthclass(density_est)
## [1] "density"
names(density_est)
## [1] "x" "y" "bw" "n" "call" "data.name"## [7] "has.na"
The density()
function returns a list that has components x
and y
, so we can call lines()
directly on the returned object.
hist(king_wlens, col="pink", freq=FALSE, breaks=0:20, xlab="Word length", main="King word lengths")lines(density_est, lwd=3)
To plot a heatmap of a numeric matrix, we use the image()
function.
# Here, %o% gives for outer product(mat = 1:5 %o% 6:10)
## [,1] [,2] [,3] [,4] [,5]## [1,] 6 7 8 9 10## [2,] 12 14 16 18 20## [3,] 18 21 24 27 30## [4,] 24 28 32 36 40## [5,] 30 35 40 45 50
image(mat) # Red means high, white means low
image()
The orientation of image()
is to plot the heatmap according to the following order, in terms of the matrix elements:
(1,ncol)(2,ncol)…(nrow,ncol)⋮(1,2)(2,2)…(nrow,2)(1,1)(2,1)…(nrow,1)
This is a 90 degrees counterclockwise rotation of the "usual" printed order for a matrix:
(1,1)(1,2)…(1,ncol)(2,1)(2,2)…(2,ncol)⋮(nrow,1)(nrow,2)…(nrow,ncol)
image()
Therefore, if we want the displayed heatmap to follow the usual order, we must rotate the matrix 90∘ clockwise before passing it in to image()
(Equivalently, reverse the row order and take the transpose).
clockwise90 = function(a) { t(a[nrow(a):1,]) } # Handy rotate functionimage(clockwise90(mat))
The default is to use a red-to-white color scale in image()
, but the col
argument can take any vector of colors. Built-in functions gray.colors()
, rainbow()
, heat.colors()
, topo.colors()
, terrain.colors()
, cm.colors()
all return contiguous color vectors of given lengths.
phi = dnorm(seq(-2,2,length=50))normal.mat = phi %o% phiimage(normal.mat, col=terrain.colors(20)) # Terrain colors
To draw contour lines from a numeric matrix, we use the contour()
function; to add contours to an existing plot (says, a heatmap), we use contour()
with add=TRUE
.
image(normal.mat, col=terrain.colors(20))contour(normal.mat, add=TRUE)
ggplot2
ggplot2
?ggplot2
is a R package for "declaratively" creating graphics.
We provide the data and tell ggplot2
how to map variables to aesthetics and what graphical primitives to use. Then, it takes care of the details.
Plots in ggplot2
are built sequentially using layers.
When using ggplot2
, it is essential that our data are tidy!
Let's work through how to build a plot layer by layer.
ggplot2
First, let's initialize a plot. We use the data
parameter to tell ggplot
what data frame to use.
data.frame
or tibble
!library(gapminder)ggplot(data = gapminder)
ggplot2
Add an aesthetic using aes()
within the initial ggplot()
call.
ggplot(data = gapminder, mapping = aes(x = year, y = lifeExp))
ggplot2
Now ggplot
knows what to plot, but it doesn't know how to plot it yet. Let's add some points with geom_point()
.
+
operator.ggplot(data = gapminder, mapping = aes(x = year, y = lifeExp)) + geom_point()
ggplot2
Let's make our points smaller and red.
ggplot(data = gapminder, mapping = aes(x = year, y = lifeExp)) + geom_point(color = "red", size = 0.75)
ggplot2
Let's try switching them to lines.
ggplot(data = gapminder, mapping = aes(x = year, y = lifeExp)) + geom_line(color = "red", linewidth = 0.75)
ggplot2
We want lines connected by country, not just in the order that they appear in the data.
ggplot(data = gapminder, mapping = aes(x = year, y = lifeExp, group = country)) + geom_line(color = "red", linewidth = 0.5)
ggplot2
We can color by continent to explore differences across continents.
We use aes()
because we want to color by something in our data.
Putting a color within aes()
will automatically add a label.
We have to remove the color within geom_line()
, or it will override the aes()
.
ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country, color = continent)) + geom_line(linewidth = 0.5)
ggplot2
Let's add another layer for the trend lines by continent!
We use a new aes()
to group them differently than our lines (by continent).
We will make them stick out by having them thicker and darker.
We don't want error bars, so we will remove se
.
ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country, color = continent)) + geom_line(linewidth = 0.5) + geom_smooth(aes(group = continent), se = FALSE, linewidth = 1.5, color = "black", method = "loess")
ggplot2
The plot is cluttered and hard to read. Let's try separating by continents using facets!
facet_wrap
, which takes in a formula object and uses a tilde ~
with the variable name.ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country, color = continent)) + geom_line(linewidth = 0.5) + geom_smooth(aes(group = continent), se = FALSE, linewidth = 1.5, color = "black", method = "loess") + facet_wrap(~ continent)
ggplot2
Now, we formalize the labels on our plot using labs()
.
We can also edit labels one at a time using xlab()
, ylab()
, ggmain()
, etc.
Unfortunately, we should do this in every graph that we present! It is unlikely that the text styling of our data frame matches our output. Changing the labels improves human readability!
ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country, color = continent)) + geom_line(linewidth = 0.5) + geom_smooth(aes(group = continent), se = FALSE, linewidth = 1.5, color = "black", method = "loess") + facet_wrap(~ continent) + labs(title = "Life expectancy over time by continent", x = "Year", y = "Life Expectancy", legend = "Continent")
ggplot2
ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country, color = continent)) + geom_line(linewidth = 0.5) + geom_smooth(aes(group = continent), se = FALSE, linewidth = 1.5, color = "black", method = "loess") + facet_wrap(~ continent) + labs(title = "Life expectancy over time by continent", x = "Year", y = "Life Expectancy", legend = "Continent")
ggplot2
Let's center our title by adjusting theme()
.
element_text()
tells ggplot()
how to display the text.
hjust
is our horizontal alignment, we set it to one half
ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country, color = continent)) + geom_line(linewidth = 0.5) + geom_smooth(aes(group = continent), se = FALSE, linewidth = 1.5, color = "black", method = "loess") + facet_wrap(~ continent) + labs(title = "Life expectancy over time by continent", x = "Year", y = "Life Expectancy", legend = "Continent") + theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 14))
ggplot2
Indeed, the legend is redundant. Let's remove it.
ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country, color = continent)) + geom_line(linewidth = 0.5) + geom_smooth(aes(group = continent), se = FALSE, linewidth = 1.5, color = "black", method = "loess") + facet_wrap(~ continent) + labs(title = "Life expectancy over time by continent", x = "Year", y = "Life Expectancy", legend = "Continent") + theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 14), legend.position = "none")
ggplot2
If we don't like the default gray background, then we always remove it by theme_bw()
.
?theme_bw
to look them up.)ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country, color = continent)) + geom_line(linewidth = 0.5) + geom_smooth(aes(group = continent), se = FALSE, linewidth = 1.5, color = "black", method = "loess") + facet_wrap(~ continent) + labs(title = "Life expectancy over time by continent", x = "Year", y = "Life Expectancy") + theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 14), legend.position = "none") + theme_bw()
ggplot2
ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country, color = continent)) + geom_line(linewidth = 0.5) + geom_smooth(aes(group = continent), se = FALSE, linewidth = 1.5, color = "black", method = "loess") + facet_wrap(~ continent) + labs(title = "Life expectancy over time by continent", x = "Year", y = "Life Expectancy") + theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 14), legend.position = "none") + theme_bw()
ggplot2
We can increase all of our text proportionally using base_size
within theme_bw()
to increase readability.
We could also do this by adjusting text
within theme()
.
We don't need to manually adjust our title size. This will scale everything automatically.
ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country, color = continent)) + geom_line(linewidth = 0.5) + geom_smooth(aes(group = continent), se = FALSE, linewidth = 1.5, color = "black", method = "loess") + facet_wrap(~ continent) + labs(title = "Life expectancy over time by continent", x = "Year", y = "Life Expectancy") + theme_bw(base_size = 16) + theme(plot.title = element_text(hjust = 0.5, face = "bold"), legend.position = "none")
ggplot2
ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country, color = continent)) + geom_line(linewidth = 0.5) + geom_smooth(aes(group = continent), se = FALSE, linewidth = 1.5, color = "black", method = "loess") + facet_wrap(~ continent) + labs(title = "Life expectancy over time by continent", x = "Year", y = "Life Expectancy") + theme_bw(base_size = 16) + theme(plot.title = element_text(hjust = 0.5, face = "bold"), legend.position = "none")
ggplot2
Now, our text is in a good size, but it overlaps. We consider rotating our text.
ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country, color = continent)) + geom_line(linewidth = 0.5) + geom_smooth(aes(group = continent), se = FALSE, linewidth = 1.5, color = "black", method = "loess") + facet_wrap(~ continent) + labs(title = "Life expectancy over time by continent", x = "Year", y = "Life Expectancy") + theme_bw(base_size = 16) + theme(plot.title = element_text(hjust = 0.5, face = "bold"), legend.position = "none", axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))
ggplot2
ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country, color = continent)) + geom_line(linewidth = 0.5) + geom_smooth(aes(group = continent), se = FALSE, linewidth = 1.5, color = "black", method = "loess") + facet_wrap(~ continent) + labs(title = "Life expectancy over time by continent", x = "Year", y = "Life Expectancy") + theme_bw(base_size = 16) + theme(plot.title = element_text(hjust = 0.5, face = "bold"), legend.position = "none", axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))
ggplot2
Lastly, let's space out our panels by adjusting panel.spacing.x
.
ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country, color = continent)) + geom_line(linewidth = 0.5) + geom_smooth(aes(group = continent), se = FALSE, linewidth = 1.5, color = "black", method = "loess") + facet_wrap(~ continent) + labs(title = "Life expectancy over time by continent", x = "Year", y = "Life Expectancy") + theme_bw(base_size = 16) + theme(plot.title = element_text(hjust = 0.5, face = "bold"), legend.position = "none", axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1), panel.spacing.x = unit(0.75, "cm"))
ggplot2
ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country, color = continent)) + geom_line(linewidth = 0.5) + geom_smooth(aes(group = continent), se = FALSE, linewidth = 1.5, color = "black", method = "loess") + facet_wrap(~ continent) + labs(title = "Life expectancy over time by continent", x = "Year", y = "Life Expectancy") + theme_bw(base_size = 16) + theme(plot.title = element_text(hjust = 0.5, face = "bold"), legend.position = "none", axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1), panel.spacing.x = unit(0.75, "cm"))
ggplot2
When the entire plot is ready, we can also store it as an object.
lifeExp_plot <- ggplot(data = gapminder, aes(x = year, y = lifeExp, group = country, color = continent)) + geom_line(linewidth = 0.5) + geom_smooth(aes(group = continent), se = FALSE, linewidth = 1.5, color = "black", method = "loess") + facet_wrap(~ continent) + labs(title = "Life expectancy over time by continent", x = "Year", y = "Life Expectancy") + theme_bw(base_size = 16) + theme(plot.title = element_text(hjust = 0.5, face = "bold"), legend.position = "none", axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1), panel.spacing.x = unit(0.75, "cm"))
ggplot2
Then, we can plot it by just calling our object.
lifeExp_plot
ggplot2
We can also save it in our figures
subfolder using ggsave()
.
height
and width
parameters to automatically resize the image.ggsave(filename = "figures/lifeExp_plot.pdf", plot = lifeExp_plot, height = 5, width = 7)
Note: Never save figures from our analysis using screenshots or point-and-click! It will lead to lower quality and non-reproducible figures!
ggplot
:What we just made was a very complicated and fine-tuned plot!
It is very common that we have to Google how to adjust certain things all the time.
ggplot
:What we just made was a very complicated and fine-tuned plot!
It is very common that we have to Google how to adjust certain things all the time.
So does the creator of ggplot2
:
ggplot(data = gapminder, aes(x = lifeExp))
ggplot(data = gapminder, aes(x = lifeExp)) + geom_histogram()
ggplot(data = gapminder, aes(x = lifeExp)) + geom_histogram(binwidth = 1)
ggplot(data = gapminder, aes(x = lifeExp)) + geom_histogram(binwidth = 1, color = "black", fill = "lightblue")
ggplot(data = gapminder, aes(x = lifeExp)) + geom_histogram(binwidth = 1, color = "black", fill = "lightblue") + theme_bw(base_size = 20)
ggplot(data = gapminder, aes(x = lifeExp)) + geom_histogram(binwidth = 1, color = "black", fill = "lightblue") + theme_bw(base_size = 20) + labs(x = "Life Expectancy", y = "Count")
ggplot(data = gapminder, aes(x = continent, y = lifeExp))
ggplot(data = gapminder, aes(x = continent, y = lifeExp)) + geom_boxplot()
ggplot(data = gapminder, aes(x = continent, y = lifeExp)) + geom_boxplot(fill = "lightblue")
ggplot(data = gapminder, aes(x = continent, y = lifeExp)) + geom_boxplot(fill = "lightblue") + theme_bw(base_size = 20)
ggplot(data = gapminder, aes(x = continent, y = lifeExp)) + geom_boxplot(fill = "lightblue") + theme_bw(base_size = 20) + labs(title = "Life expectancy by Continent", x = "", y = "")
ggplot(data = gapminder, aes(x = continent, y = lifeExp)) + geom_boxplot(fill = "lightblue") + theme_bw(base_size = 20) + labs(title = "Life expectancy by Continent", x = "", y = "") + theme(plot.title = element_text(hjust = 0.5))
ggplot(data = gapminder, aes(x = continent, y = lifeExp)) + geom_boxplot(fill = "lightblue") + theme_bw(base_size = 20) + labs(title = "Life expectancy by Continent", x = "", y = "") + theme(plot.title = element_text(hjust = 0.5)) + ylim(c(0, 85))
ggplot2
SummaryAxes: xlim()
, ylim()
.
Legends: within initial aes()
, edit within theme()
or guides()
.
geom_point()
, geom_line()
, geom_histogram()
, geom_bar()
, geom_boxplot()
, geom_text()
, etc.
facet_grid()
, facet_wrap()
for faceting.
labs()
for labels.
theme_bw()
to make things look nicer.
Graphical parameters: color
for color, alpha
for opacity, lwd
/size
for thickness, shape
for shape, fill
for interior color, etc.
Here is a ggplot2
cheat sheet!
Deceptive axes.
Excessive/bad coloring.
Bad variable/axis names.
Unreadable labels.
Overloaded with information.
Pie charts (usually).
Simple, clean graphics
Neat and human readable text.
Appropriate data range (bar charts should always start from 0!).
Consistent intervals.
Roughly ~6 colors or less.
Size figures appropriately.
Consider the following questions when we choose our plot:
What if we have one variable? Two variables?
What if we have numeric data?
How can we deal with those categorical or nominal variables?
Let's see some examples!
geom_histogram()
geom_boxplot()
Note: We can also use the more sophisticated letter-valued plot implemented in the package lvplot
.
geom_bar()
geom_boxplot()
Here, we have multiple observations for each category.
geom_bar()
(with argument stat = "identity"
)Here, we have only one observation per category.
geom_point()
geom_line()
Note: When making a line plot, we should use both geom_point()
and geom_line()
!
geom_bar()
setting x
and fill
within aes()
This is an example of bad visualization!!
geom_bar()
setting x
and fill
within aes()
This one looks better by specifying position = position_dodge()
in geom_bar()
.
Note: Never stack the bars unless it is necessary.
What if we have two numeric variables and one categorical?
Scatterplot or line plot colored by category.
Scatterplot or line plot faceted by category.
Note: Recall our example in the step-by-step practice with ggplot2
.
R packages provide us with numerous handy functions that have been written by other R developers.
The tidyverse
is a collection of packages for common data science tasks.
Pipes %>%
allow us to string together commands to get a flow of results.
The dplyr
is a package for data wrangling with several key verbs (or functions).
The tidyr
is a package for manipulating data frames in R.
Base R has a set of powerful plotting tools that help us quickly visualize our data.
The ggplot2
is a package for creating more sophisticated plots.
Submit Lab 4 on Gradescope by the end of Tuesday (February 13)!! Start earlier!!
Using Packages in R
Data Manipulation via tidyverse
Basic Graphics in R
Data Visualization via ggplot2
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |