STAT 302 Statistical Computing

class: center, top, title-slide

.title[
# STAT 302 Statistical Computing
]
.subtitle[
## Lecture 2: Data Structures in R
]
.author[
### Yikun Zhang (<em>Winter 2024</em>)
]

---

# Outline

1. Vectors in R.

2. Arrays and Matrices in R.

3. Lists in R.

4. Data Frames.

5. R Coding Style Guide

<font size="4">* Acknowledgement: Parts of the slides are modified from the course materials by Prof. Ryan Tibshirani, Prof. Yen-Chi Chen, Prof. Deborah Nolan, Bryan Martin, and Andrea Boskovic. </font>

---
class: inverse

# Part 1: Vectors in R

---

# First Data Structure: Vectors

* A **data structure** is a grouping of related data values into an object.

* A **vector** is a sequence of values with the _same_ data type.

```r
# Create a numeric vector
x = c(7, 8, 10, 45, 2)

is.vector(x)
```

```
## [1] TRUE
```

```r
class(x)
```

```
## [1] "numeric"
```

The function `c()` combines all its arguments into a vector (or a list).

---

# Vectors in R

* A **vector** is a sequence of values with the _same_ data type.

```r
y = c(7.5, as.integer(8), 10+4i, "c")
y
```

```
## [1] "7.5"   "8"     "10+4i" "c"
```

```r
class(y)
```

```
## [1] "character"
```

* If there is some elements in a vector that is of character type, R will coerce all the elements into characters.

```r
1:6
```

```
## [1] 1 2 3 4 5 6
```

`1:6` is shorthand for `c(1,2,3,4,5,6)`.

---

# Vectors in R

* We can also generate vectors using functions such as `rep()` and `seq()`

```r
# Sequence from 1 to 20, incrementing by a step 5
seq(1, 20, by = 5)
```

```
## [1]  1  6 11 16
```

```r
# Repeat each element of a vector 3 times each
rep(c(1, 2), each = 3)
```

```
## [1] 1 1 1 2 2 2
```

```r
# Repeat an entire vector 3 times
rep(c(1, 2), times = 3)
```

```
## [1] 1 2 1 2 1 2
```

---

# Subsetting Vectors in R

* We subset a vector using `[index]` after the vector name.

```r
x = c(7, 8, 10, 45, 2)
# Subset the second element
x[2]
```

```
## [1] 8
```

```r
# Subset the first, second, and fourth elements
x[c(1,2,4)]
```

```
## [1]  7  8 45
```

* If we use a negative index, we return the vector with that element removed.

```r
x[-3]
```

```
## [1]  7  8 45  2
```

---

# Subsetting Vectors in R

* We can also subset a vector by a logical statement (or equivalently, a logical vector of the same length).

```r
x = c(7, 8, 10, 45, 2)
x[x > 9]
```

```
## [1] 10 45
```

```r
# Return the indices of those elements > 9
which(x > 9)
```

```
## [1] 3 4
```

```r
# Same output, but the code is redundant
x[which(x > 9)]
```

```
## [1] 10 45
```

---

# Naming the Elements of Vectors in R

* We can give names to elements/components of vectors, and index vectors accordingly.
  - <font size="4"> Note: Names are the labels of elements but not the additional components of the vector. </font>

```r
z = c(3, 2, 31, 10)
names(z) = c("v1","v2","v3","fred")
z
```

```
##   v1   v2   v3 fred 
##    3    2   31   10
```

```r
z["fred"]
```

```
## fred 
##   10
```

```r
z[c("v1", "fred")]
```

```
##   v1 fred 
##    3   10
```

---

# Naming the Elements of Vectors in R

What if we only name one element of `z` in the first place?

```r
z = c(3, 2, 31, 10)
names(z[2]) = "b"
z
```

```
## [1]  3  2 31 10
```

We can't change the name of a single element in vector `z` neither.

```r
names(z[2]) = "b"
z
```

```
## [1]  3  2 31 10
```

---

# Vector Arithmetics

Arithmetic operators apply to vectors in a "componentwise" fashion.

```r
x = c(7, 8, 10, 45, 2)
y = -1:-5
x + y
```

```
## [1]  6  6  7 41 -3
```

```r
z = c("a", "6", "7", "2", "5")
x - as.numeric(z)
```

```
## Warning: NAs introduced by coercion
```

```
## [1] NA  2  3 43 -3
```

Note: Arithmetic operations only work for numeric vectors.

---

# Vector Recycling

What if we apply arithmetic operators to two numeric vectors of different lengths?

```r
x = c(7, 8, 10, 45, 2)
p = c(2, 3)
x^p
```

```
## [1]    49   512   100 91125     4
```

**Recycling** in R repeat elements in the shorter vector to match with the longer one. 
  - This is useful when done on purpose, but could also lead to hard-to-catch bugs in our code!

```r
2*x
```

```
## [1] 14 16 20 90  4
```

---

# Comparative and Logical Operations on Vectors

We can also do componentwise comparisons and logical operations with vectors.

```r
x = c(7, 8, 10, 45, 2)
x > 9
```

```
## [1] FALSE FALSE  TRUE  TRUE FALSE
```

```r
(x > 9) | (x < 6)
```

```
## [1] FALSE FALSE  TRUE  TRUE  TRUE
```

```r
x == c(10, 2)
```

```
## [1] FALSE FALSE  TRUE FALSE FALSE
```

```r
sum(x > 9)
```

```
## [1] 2
```

---

# Built-in Functions for Vectors

Many built-in functions can take vectors as arguments:

* `mean(), median(), sd(), var(), max(), min(), length()`, and `sum()` return single numbers.

* `cumsum(), cumprod(), cummax(), cummin()` return the cumulative sums, products, minima or maxima of the elements of a vector.

* `sort()` returns the sorted vector.

* `order()` returns the indices of the sorted vector.

* `hist()` takes a vector of numbers and produces a histogram, a highly structured object, with the side effect of making a plot.

* `ecdf()` similarly produces a cumulative-density-function object.
    
* `summary()` gives the summary statistics of numerical vectors.
    
* `any()` and `all()` are useful on Boolean vectors.

---
class: inverse

# Part 2: Arrays and Matrices in R

---

# Second Data Structure: Arrays

* An **array** is a multi-dimensional generalization of vectors.

```r
x = c(7, 8, 10, 45, 20, 1)
# Create a 3-by-2 array using the elements in `x`
x_arr = array(x, dim = c(3, 2))
x_arr
```

```
##      [,1] [,2]
## [1,]    7   45
## [2,]    8   20
## [3,]   10    1
```

```r
dim(x_arr)
```

```
## [1] 3 2
```

The function `dim()` tells us the numbers of rows and columns. The output of `dim()` could be a vector of arbitrary length.

---

# Arrays in R

We can also create a 3-dim array (known as tensor in Python).

```r
y = c(7, 8, 10, 45, 20, 1, 4, 2, 188, 32, 12, 34)
# Create a 3-by-2-by-2 array using the elements in `x`
y_arr = array(y, dim = c(3, 2, 2))
y_arr
```

```
## , , 1
## 
##      [,1] [,2]
## [1,]    7   45
## [2,]    8   20
## [3,]   10    1
## 
## , , 2
## 
##      [,1] [,2]
## [1,]    4   32
## [2,]    2   12
## [3,]  188   34
```

---

# Subsetting/Indexing An Array in R

We can access a 2-dim array either using `[index,index]` or by the underlying vector (column-major order).

```r
is.array(x_arr)
```

```
## [1] TRUE
```

```r
x_arr[1,2]
```

```
## [1] 45
```

```r
y_arr[3,1,2]
```

```
## [1] 188
```

```r
x_arr[c(1,2),2]
```

```
## [1] 45 20
```

---

# Subsetting/Indexing An Array in R

We can access a 2-dim array either using `[index,index]` or by the underlying vector (column-major order).

```r
x_arr
```

```
##      [,1] [,2]
## [1,]    7   45
## [2,]    8   20
## [3,]   10    1
```

```r
# View an array as a vector in a column-major order
x_arr[4]
```

```
## [1] 45
```

```r
as.vector(x_arr)
```

```
## [1]  7  8 10 45 20  1
```

---

# Matrices in R

A matrix is a specialization of a 2-dim array.

```r
z_mat = matrix(c(40, 1, 60, 3, 4, 2), nrow = 3)
z_mat
```

```
##      [,1] [,2]
## [1,]   40    3
## [2,]    1    4
## [3,]   60    2
```

```r
is.matrix(z_mat)
```

```
## [1] TRUE
```

```r
is.array(z_mat)
```

```
## [1] TRUE
```

* We could also specify `ncol` for the number of columns.

---

# Matrices in R

We can also generate matrices by column binding (`cbind()`) and row binding (`rbind()`) vectors.

```r
y = c(2, 3, 4)
arr1 = cbind(x_arr, y)
arr1
```

```
##            y
## [1,]  7 45 2
## [2,]  8 20 3
## [3,] 10  1 4
```

```r
rbind(x_arr, x_arr[c(1,2),])
```

```
##      [,1] [,2]
## [1,]    7   45
## [2,]    8   20
## [3,]   10    1
## [4,]    7   45
## [5,]    8   20
```

---

# Matrices in R

* We can subset a matrix as how we did for an array.

* Matrices, like vectors, can only have its entries of the same data type.

```r
rbind(c(1, 2, 3), c("a", "b", "c"))
```

```
##      [,1] [,2] [,3]
## [1,] "1"  "2"  "3" 
## [2,] "a"  "b"  "c"
```

* We can also apply (built-in) functions to matrices as vectors.

```r
mean(arr1)
```

```
## [1] 11.11111
```

---

# Matrix Multiplication

The usual multiplication `*` can only do component-wise/element-wise multiplication between two matrices.

```r
x_arr * y_arr[,,1]
```

```
##      [,1] [,2]
## [1,]   49 2025
## [2,]   64  400
## [3,]  100    1
```

The matrix multiplication in R is achieved by `%*%`.

```r
z_mat = matrix(data = c(1,2,3,12), ncol = 2)
x_arr %*% z_mat
```

```
##      [,1] [,2]
## [1,]   97  561
## [2,]   48  264
## [3,]   12   42
```

---

# Other Matrix Operations

* Row/Column sum and mean:

```r
rowSums(x_arr)
```

```
## [1] 52 28 11
```

```r
colMeans(x_arr)
```

```
## [1]  8.333333 22.000000
```

* Matrix transpose:

```r
t(x_arr)
```

```
##      [,1] [,2] [,3]
## [1,]    7    8   10
## [2,]   45   20    1
```

---

# Other Matrix Operations

* The determinant of a square matrix:

```r
print(z_mat)
```

```
##      [,1] [,2]
## [1,]    1    3
## [2,]    2   12
```

```r
# The determinant of a square matrix
det(z_mat)
```

```
## [1] 6
```

* The inverse of a matrix:

```r
solve(z_mat)
```

```
##            [,1]       [,2]
## [1,]  2.0000000 -0.5000000
## [2,] -0.3333333  0.1666667
```

---

# Other Matrix Operations

* The `diag()` function can extract the diagonal entries of a matrix:

```r
diag(z_mat)
```

```
## [1]  1 12
```

* The `diag()` function can also be used to create a diagonal matrix:

```r
diag(c(1,4,3))
```

```
##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    4    0
## [3,]    0    0    3
```

---

# Names in Matrices

* We can name either rows or columns or both, with `rownames()` and `colnames()`. The rules are the same as naming the vectors.

```r
colnames(z_mat) = c("a", "b")
z_mat
```

```
##      a  b
## [1,] 1  3
## [2,] 2 12
```

Similarly to `names()` for vectors, we then access them by calling the function again.

```r
colnames(z_mat)
```

```
## [1] "a" "b"
```

Note: Names help us understand what we are working with.

---
class: inverse

# Part 3: Lists in R

---

# Third Data Structure: Lists

A **list** is a collection of objects that are not necessarily all of the same data type and can even have different lengths.

```r
my_list = list("exponential", 7, FALSE, c(1,6,2))
my_list
```

```
## [[1]]
## [1] "exponential"
## 
## [[2]]
## [1] 7
## 
## [[3]]
## [1] FALSE
## 
## [[4]]
## [1] 1 6 2
```

---

# Subsetting a List

* We can use `[index]` as with vectors, and it will return a list.

```r
my_list[4]
```

```
## [[1]]
## [1] 1 6 2
```

```r
class(my_list[4])
```

```
## [1] "list"
```

* If we want to extract one element of a list, we have to use `[[index]]`.

```r
my_list[[4]]
```

```
## [1] 1 6 2
```

---

# Subsetting and Expanding a List

```r
# Subset the second sub-element in the fourth element of the list
my_list[[4]][2]
```

```
## [1] 6
```

We can also use `[[index]]` to expand the list.

```r
my_list[[5]] = c("a", "3", "UW STAT")
my_list
```

```
## [[1]]
## [1] "exponential"
## 
## [[2]]
## [1] 7
## 
## [[3]]
## [1] FALSE
## 
## [[4]]
## [1] 1 6 2
## 
## [[5]]
## [1] "a"       "3"       "UW STAT"
```

---

# Contracting a List

We can also shorten the list with by setting the length to something smaller (also works for vectors).

```r
length(my_list)
```

```
## [1] 5
```

```r
length(my_list) = 3
my_list
```

```
## [[1]]
## [1] "exponential"
## 
## [[2]]
## [1] 7
## 
## [[3]]
## [1] FALSE
```

---

# Naming a List

We can also name the elements of a list:

```r
names(my_list) = c("first", "num", "logical")
my_list
```

```
## $first
## [1] "exponential"
## 
## $num
## [1] 7
## 
## $logical
## [1] FALSE
```

---

# Naming a List

The names for the element of a list can be given when we initialize the list.

```r
my_list = list(func = "exponential", num = 7, 
               logi = FALSE, vec = c(1,6,2))
my_list
```

```
## $func
## [1] "exponential"
## 
## $num
## [1] 7
## 
## $logi
## [1] FALSE
## 
## $vec
## [1] 1 6 2
```

---

# Subsetting a List By Name

There are two different ways to subset an element from the list by name.

```r
my_list[["num"]]
```

```
## [1] 7
```

```r
my_list$num
```

```
## [1] 7
```

We will also use `$` to access a column of the data frame later...
---

# Advantage of Lists

* Lists give us a natural way to store and look up data by name, rather than by position.

* Lists achieve a useful programming concept called **key-value pairs**, i.e., dictionaries in Python.

- If we need to know the value of a component, we can look that up by name without caring where it is (in what position it lies) in the list.
  
--
  
* Lists are generally used when a function returns multiple results...

---
class: inverse

# Part 4: Data Frames in R

---

# Fourth Data Structure: Data Frames

* A **data frame** is a classic data table with `$n$` rows for cases and `$p$` columns for variables.

* A data frame can be viewed as a generalization of a named array.

* In principle, a data frame is a special list, with the restriction that all its components are vectors of the same length.

---

# Data Frames in R

We start from creating a matrix (2-dim array).

```r
a_mat = matrix(c(35, 8, 10, 4, 12, 20, 10, 11, 1, 2), 
               ncol=2)
colnames(a_mat) = c("v1","v2")
a_mat
```

```
##      v1 v2
## [1,] 35 20
## [2,]  8 10
## [3,] 10 11
## [4,]  4  1
## [5,] 12  2
```

```r
class(a_mat)
```

```
## [1] "matrix" "array"
```

---

# Data Frames in R

We can expand the column of a data frame or coerce a matrix/array to the data frame type using `data.frame()`.

```r
a_df = data.frame(a_mat, Date=as.Date("1965/5/15") + 1:5)
a_df
```

```
##   v1 v2       Date
## 1 35 20 1965-05-16
## 2  8 10 1965-05-17
## 3 10 11 1965-05-18
## 4  4  1 1965-05-19
## 5 12  2 1965-05-20
```

Note: Check what the function `as.Date()` is for. Why can we add a numeric vector to it?

---

# Data Frames in R

The function `cbind()` and `rbind()` also works for data frames.

```r
a_df = cbind(a_df, binary=rbinom(5, size = 1, prob = 0.3))
a_df
```

```
##   v1 v2       Date binary
## 1 35 20 1965-05-16      0
## 2  8 10 1965-05-17      0
## 3 10 11 1965-05-18      0
## 4  4  1 1965-05-19      1
## 5 12  2 1965-05-20      1
```

Note: `rbinom()` generates some random samples from the binomial distribution. Run `?rbinom()` to check the documentation.

---

# Data Frames in R

* However, when using `rbind()`, the data type of each column in the new data frame should match the original data frame.

```r
rbind(a_df, data.frame(v1=1, v2=32, 
                       Date=as.Date("2023/09/27"), binary=-1.1))
```

```
##   v1 v2       Date binary
## 1 35 20 1965-05-16    0.0
## 2  8 10 1965-05-17    0.0
## 3 10 11 1965-05-18    0.0
## 4  4  1 1965-05-19    1.0
## 5 12  2 1965-05-20    1.0
## 6  1 32 2023-09-27   -1.1
```

---

# Subset Rows/Columns of A Data Frame

```r
a_df$v2
```

```
## [1] 20 10 11  1  2
```

```r
a_df$Date[1:3]
```

```
## [1] "1965-05-16" "1965-05-17" "1965-05-18"
```

```r
a_df[,2]
```

```
## [1] 20 10 11  1  2
```

```r
a_df[-(3:4),2]
```

```
## [1] 20 10  2
```

---

# Read Tables into R

So far, we only create our data frames manually in R. 
--

In practice, it is more common to read those existing tabular data into R and carry out our analysis. There are many different ways to read tables into R. Here are two possible ways:

```r
family_df = 
  read.table(url("https://github.com/zhangyk8/zhangyk8.github.io/raw/master/_teaching/file_stat302/Data/family.txt"), 
             sep = "\t", header = TRUE)

family_df2 = 
  read.csv(url("https://github.com/zhangyk8/zhangyk8.github.io/raw/master/_teaching/file_stat302/Data/family.txt"), 
           sep = "\t", header = TRUE)

all(family_df == family_df2)
```

```
## [1] TRUE
```

The data `family.txt` can be downloaded through the link [https://github.com/zhangyk8/zhangyk8.github.io/raw/master/_teaching/file_stat302/Data/family.txt](https://github.com/zhangyk8/zhangyk8.github.io/raw/master/_teaching/file_stat302/Data/family.txt).

---

# Read Tables into R

If the data file is in our current working directory, then we do not have to use the function `url()` to access it.

```r
family_df3 = read.table("family.txt", sep = "\t", header = TRUE)
class(family_df3)
```

```
## [1] "data.frame"
```

```r
head(family_df3)
```

```
##   firstName sex age height weight      bmi overWt
## 1       Tom   m  77     70    175 25.16239   TRUE
## 2      Maya   f  33     64    124 21.50106  FALSE
## 3       Joe   m  79     73    185 24.45884  FALSE
## 4    Robert   m  47     67    156 24.48414  FALSE
## 5       Sue   f  27     61     98 18.51492  FALSE
## 6       Liz   f  33     68    190 28.94981   TRUE
```

---

# Post-Analysis After Reading the Tables

```r
# Find all the unique first name
unique(family_df$firstName)
```

```
##  [1] "Tom"    "Maya"   "Joe"    "Robert" "Sue"    "Liz"    "Jon"    "Sally" 
##  [9] "Tim"    "Ann"    "Dan"    "Art"    "Zoe"
```

```r
# Histogram of BMIs for all individuals
hist(family_df$bmi, xlab = "BMIs",
     main = "Histogram of BMIs for all individuals")
```

---

# Working Directory in R

A working directory is the file path that R uses to save and look for data.

* We can check for our current working directory using `getwd()`.

```r
getwd()
```

```
## [1] "/media/yikun/Disk_D1/Graduate School/STAT 302/Lectures"
```

* We can change our working directory using `setwd()`.

```r
setwd("/media/yikun/Disk_D1/Graduate School/STAT 302/Lectures")
```

Note: Do not change the working directory inside R Markdown files! By default, R Markdown sets the file path of where it is in as the working directory.

---

# Saving Tables in R

We can save a single R object as `.rds` files using `saveRDS()`, and multiple R objects as `.RData` or `.rda` files using `save()`.

```r
object1 = 1:5
object2 = c("a", "b", "c")
# save only object1
saveRDS(object1, file = "object1_only.rds")
# save object1 and object2
save(object1, object2, file = "both_objects.RData")
```

If we want to save a data frame, it is recommended to write it as `.csv` or `.txt` file.

```r
write.table(family_df, file = "family_newsave.txt", sep = "\t")
write.csv(family_df, file = "family_newsave.csv")
```

---
class: inverse

# Part 5: R Coding Style Guide

---

# Object Names

Use either underscores (`_`) or big camel case (`BigCamelCase`) to separate words within an object/Variable name.

Try to avoid using dots `.` to separate words in R functions!

```r
# Good
day_one
day_1
DayOne

# Bad
dayone
```

---

# Object Names

Names should be concise, meaningful, and (generally) nouns.

```r
# Good
day_one

# Bad
first_day_of_the_month
djm1
```

---

# Object Names

It is **very important** that object names do not overlap with common functions!

```r
# Very extra super bad
c = 7
t = 23
T = FALSE
mean = "something"
```

Note: `T` and `F` are R shorthand for `TRUE` and `FALSE`, respectively. In general, we should spell them out as clear as possible.

```r
mean(c(1, 2))
```

```
## [1] 1.5
```

---

# Spacing

Put a space after every comma, just like the English writing.

```r
# Good
x[, 1]

# Bad
x[,1]
x[ ,1]
x[ , 1]
```

Do not put spaces inside or outside parentheses for regular function calls.

```r
# Good
mean(x, na.rm = TRUE)

# Bad
mean (x, na.rm = TRUE)
mean( x, na.rm = TRUE )
```

---

# Spacing with Operators

Most of the time when we are doing math, conditionals, logicals, or assignments, our operators should be surrounded by spaces (e.g. for `==`, `+`, `-`, `<-`, etc.).

```r
# Good
height = (feet * 12) + inches
mean(x, na.rm = 10)

# Bad
height=feet*12+inches
mean(x, na.rm=10)
```

There are some exceptions we will learn more about later, such as the power symbol `^`. 
See the [Tidyverse Style Guide](https://style.tidyverse.org/) for more details!

---

# Extra Spacing

Adding extra spaces is fine if it improves alignment of `=` or `<-`.

```r
# Good
list(
  total = a + b + c,
  mean  = (a + b + c) / n
)

# Also fine
list(
  total = a + b + c,
  mean = (a + b + c) / n
)
```

---

# Long Lines of Code

Strive to limit our code to 80 characters per line. This fits comfortably on a printed page with a reasonably sized font.

If a function call is too long to fit on a single line, use one line each for the function name, each argument, and the closing `)`. This makes the code easier to read and to modify later.

```r
# Good
do_something_very_complicated(
  something = "that",
  requires = many,
  arguments = "some of which may be long"
)

# Bad
do_something_very_complicated("that", requires, many, arguments,
                              "some of which may be long"
                              )
```

<font size="4">
*Tip! Try RStudio > Preferences > Code > Display > Show Margin with Margin column 
80 to give us a visual cue!*
</font>

---

# Semicolons

In R, semi-colons (`;`) are used to execute pieces of R code on a single line.

* In general, this is bad practice and should be avoided. Also, we never need to end lines of code with semi-colons!

```r
# Bad
a = 2; b = 3

# Also bad
a = 2;
b = 3;

# Good
a = 2
b = 3
```

---

# Quotes and Strings

Use `"`, not `'`, for quoting text. The only exception is when the text already contains double quotes and no single quotes.

```r
# Bad
'Text'
'Text with "double" and \'single\' quotes'

# Good
"Text"
'Text with "quotes"'
'<a href="http://style.tidyverse.org">A link</a>'
```

### Useful References for R Coding Style Guide

* [Tidyverse Style Guide](https://style.tidyverse.org/) by Hadley Wickham.

* [Google Style Guide](https://google.github.io/styleguide/Rguide.html).

This style guides are useful for other people to understand our code!

---

# Tidy Data Principles

There are three rules required for data (or a data frame/table) to be considered tidy:

1. Each variable must have its own column.

2. Each observation must have its own row.

3. Each value must have its own cell.

---

# Tidy Data Principles (Example 1)

The rules seem simple, but using them can be tricky! Let's consider the following example.

What is untidy about the following data frame?

<table class="table" style="margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:center;"> Hospital </th>
   <th style="text-align:center;"> Diseased </th>
   <th style="text-align:center;"> Healthy </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:center;"> A </td>
   <td style="text-align:center;"> 10 </td>
   <td style="text-align:center;"> 14 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> B </td>
   <td style="text-align:center;"> 15 </td>
   <td style="text-align:center;"> 18 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> C </td>
   <td style="text-align:center;"> 12 </td>
   <td style="text-align:center;"> 13 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> D </td>
   <td style="text-align:center;"> 5 </td>
   <td style="text-align:center;"> 16 </td>
  </tr>
</tbody>
</table>

* **Variables:** hospital, disease status, and counts.

* **Observations:** the number of individuals at a given hospital and of a given disease status.

* **Values:** Hospital A, Hospital B, Hospital C, Hospital D, individual count values, *Disease Status "Healthy"*, and *Disease Status "Diseased"*.

---

# Tidy Data Principles (Example 1)

The main problem is that the column headers are values, not variables! How can we tidy it up?

<table class="table" style="margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:center;"> Hospital </th>
   <th style="text-align:center;"> Status </th>
   <th style="text-align:center;"> Count </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:center;"> A </td>
   <td style="text-align:center;"> Diseased </td>
   <td style="text-align:center;"> 10 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> A </td>
   <td style="text-align:center;"> Healthy </td>
   <td style="text-align:center;"> 14 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> B </td>
   <td style="text-align:center;"> Diseased </td>
   <td style="text-align:center;"> 15 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> B </td>
   <td style="text-align:center;"> Healthy </td>
   <td style="text-align:center;"> 18 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> C </td>
   <td style="text-align:center;"> Diseased </td>
   <td style="text-align:center;"> 12 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> C </td>
   <td style="text-align:center;"> Healthy </td>
   <td style="text-align:center;"> 13 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> D </td>
   <td style="text-align:center;"> Diseased </td>
   <td style="text-align:center;"> 5 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> D </td>
   <td style="text-align:center;"> Healthy </td>
   <td style="text-align:center;"> 16 </td>
  </tr>
</tbody>
</table>

---

# Tidy Data Principles (Example 2)

Let's consider another example:

<table class="table" style="margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> Country </th>
   <th style="text-align:right;"> Year </th>
   <th style="text-align:right;"> m_16_24 </th>
   <th style="text-align:right;"> m_25_34 </th>
   <th style="text-align:right;"> f_16_24 </th>
   <th style="text-align:right;"> f_25_34 </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> A </td>
   <td style="text-align:right;"> 2018 </td>
   <td style="text-align:right;"> 49 </td>
   <td style="text-align:right;"> 55 </td>
   <td style="text-align:right;"> 47 </td>
   <td style="text-align:right;"> 41 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> B </td>
   <td style="text-align:right;"> 2018 </td>
   <td style="text-align:right;"> 34 </td>
   <td style="text-align:right;"> 33 </td>
   <td style="text-align:right;"> 50 </td>
   <td style="text-align:right;"> 43 </td>
  </tr>
</tbody>
</table>

* **Variables:** Country, year, gender, age group, and counts.

* **Observations:** the number of individuals in a given country, in a given year, of a given gender, and in a given age group.

* **Values:** Country A, Country B, Year 2018, Gender "m", Gender "f", Age Group "16_24", Age Group "25_34", and individual counts.

---

# Tidy Data Principles (Example 2)

The tidy version is as follows:

<table class="table" style="margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:center;"> Country </th>
   <th style="text-align:center;"> Year </th>
   <th style="text-align:center;"> Gender </th>
   <th style="text-align:center;"> Age_Group </th>
   <th style="text-align:center;"> Counts </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:center;"> A </td>
   <td style="text-align:center;"> 2018 </td>
   <td style="text-align:center;"> m </td>
   <td style="text-align:center;"> 16_24 </td>
   <td style="text-align:center;"> 49 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> A </td>
   <td style="text-align:center;"> 2018 </td>
   <td style="text-align:center;"> m </td>
   <td style="text-align:center;"> 25_34 </td>
   <td style="text-align:center;"> 55 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> A </td>
   <td style="text-align:center;"> 2018 </td>
   <td style="text-align:center;"> f </td>
   <td style="text-align:center;"> 16_24 </td>
   <td style="text-align:center;"> 47 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> A </td>
   <td style="text-align:center;"> 2018 </td>
   <td style="text-align:center;"> f </td>
   <td style="text-align:center;"> 25_34 </td>
   <td style="text-align:center;"> 41 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> B </td>
   <td style="text-align:center;"> 2018 </td>
   <td style="text-align:center;"> m </td>
   <td style="text-align:center;"> 16_24 </td>
   <td style="text-align:center;"> 34 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> B </td>
   <td style="text-align:center;"> 2018 </td>
   <td style="text-align:center;"> m </td>
   <td style="text-align:center;"> 25_34 </td>
   <td style="text-align:center;"> 33 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> B </td>
   <td style="text-align:center;"> 2018 </td>
   <td style="text-align:center;"> f </td>
   <td style="text-align:center;"> 16_24 </td>
   <td style="text-align:center;"> 50 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> B </td>
   <td style="text-align:center;"> 2018 </td>
   <td style="text-align:center;"> f </td>
   <td style="text-align:center;"> 25_34 </td>
   <td style="text-align:center;"> 43 </td>
  </tr>
</tbody>
</table>

Note: In R, this can be done via the `pivot_longer()` function in the `tidyr` package. We will discuss this in detail later...

---

# Guidelines of Making Data Tidy

1. Identify the observations, variables, and values.

2. Ensure that each observation has its own row.

* Be careful about individual observations spreading over multiple tables, Excel files, etc, or multiple types of observations within a single table (this would result in many empty cells).
  
3. Ensure that each variable has its own column.

* Be careful about variables spreading over two columns and multiple variables within a single column.
  
4. Ensure that each value has its own cell.

* Be careful about values as column headers.
  
---

# Why Do We Need Tidy Data?

* Easier to read and understand the data.

* More intuitive to analyze and plot the data using R (required for `ggplot2`).

* Fewer issues with missing values.

### Useful References for Tidy Data Principles

Here is a [fantastic reference](https://vita.had.co.nz/papers/tidy-data.pdf) written by Hadley Wickham going through all these principles in detail and with more examples.

---
# Summary

- Data structures allow us to group related values together.

- Vectors group together values with the same data type.

- Arrays add multi-dimensional structure to vectors, while matrices are two-dimensional arrays.

- Lists allow us to combine data of different types and lengths.

- Data frames are hybrids of matrices and lists, allowing each column to have a different data type but the same length.

- Tidy data principle helps us better analyze and visualize data tables in R.

Submit Lab 2 on Gradescope by the end of Tuesday (January 23)!!