class: center, top, title-slide .title[ # STAT 302 Statistical Computing ] .subtitle[ ## Lecture 2: Data Structures in R ] .author[ ### Yikun Zhang (
Winter 2024
) ] --- # Outline 1. Vectors in R. 2. Arrays and Matrices in R. 3. Lists in R. 4. Data Frames. 5. R Coding Style Guide <font size="4">* Acknowledgement: Parts of the slides are modified from the course materials by Prof. Ryan Tibshirani, Prof. Yen-Chi Chen, Prof. Deborah Nolan, Bryan Martin, and Andrea Boskovic. </font> --- class: inverse # Part 1: Vectors in R --- # First Data Structure: Vectors * A **data structure** is a grouping of related data values into an object. * A **vector** is a sequence of values with the _same_ data type. -- ```r # Create a numeric vector x = c(7, 8, 10, 45, 2) is.vector(x) ``` ``` ## [1] TRUE ``` ```r class(x) ``` ``` ## [1] "numeric" ``` The function `c()` combines all its arguments into a vector (or a list). --- # Vectors in R * A **vector** is a sequence of values with the _same_ data type. ```r y = c(7.5, as.integer(8), 10+4i, "c") y ``` ``` ## [1] "7.5" "8" "10+4i" "c" ``` ```r class(y) ``` ``` ## [1] "character" ``` * If there is some elements in a vector that is of character type, R will coerce all the elements into characters. -- ```r 1:6 ``` ``` ## [1] 1 2 3 4 5 6 ``` `1:6` is shorthand for `c(1,2,3,4,5,6)`. --- # Vectors in R * We can also generate vectors using functions such as `rep()` and `seq()` ```r # Sequence from 1 to 20, incrementing by a step 5 seq(1, 20, by = 5) ``` ``` ## [1] 1 6 11 16 ``` ```r # Repeat each element of a vector 3 times each rep(c(1, 2), each = 3) ``` ``` ## [1] 1 1 1 2 2 2 ``` ```r # Repeat an entire vector 3 times rep(c(1, 2), times = 3) ``` ``` ## [1] 1 2 1 2 1 2 ``` --- # Subsetting Vectors in R * We subset a vector using `[index]` after the vector name. ```r x = c(7, 8, 10, 45, 2) # Subset the second element x[2] ``` ``` ## [1] 8 ``` ```r # Subset the first, second, and fourth elements x[c(1,2,4)] ``` ``` ## [1] 7 8 45 ``` -- * If we use a negative index, we return the vector with that element removed. ```r x[-3] ``` ``` ## [1] 7 8 45 2 ``` --- # Subsetting Vectors in R * We can also subset a vector by a logical statement (or equivalently, a logical vector of the same length). ```r x = c(7, 8, 10, 45, 2) x[x > 9] ``` ``` ## [1] 10 45 ``` ```r # Return the indices of those elements > 9 which(x > 9) ``` ``` ## [1] 3 4 ``` ```r # Same output, but the code is redundant x[which(x > 9)] ``` ``` ## [1] 10 45 ``` --- # Naming the Elements of Vectors in R * We can give names to elements/components of vectors, and index vectors accordingly. - <font size="4"> Note: Names are the labels of elements but not the additional components of the vector. </font> ```r z = c(3, 2, 31, 10) names(z) = c("v1","v2","v3","fred") z ``` ``` ## v1 v2 v3 fred ## 3 2 31 10 ``` ```r z["fred"] ``` ``` ## fred ## 10 ``` ```r z[c("v1", "fred")] ``` ``` ## v1 fred ## 3 10 ``` --- # Naming the Elements of Vectors in R What if we only name one element of `z` in the first place? ```r z = c(3, 2, 31, 10) names(z[2]) = "b" z ``` ``` ## [1] 3 2 31 10 ``` -- We can't change the name of a single element in vector `z` neither. ```r names(z[2]) = "b" z ``` ``` ## [1] 3 2 31 10 ``` --- # Vector Arithmetics Arithmetic operators apply to vectors in a "componentwise" fashion. ```r x = c(7, 8, 10, 45, 2) y = -1:-5 x + y ``` ``` ## [1] 6 6 7 41 -3 ``` -- ```r z = c("a", "6", "7", "2", "5") x - as.numeric(z) ``` ``` ## Warning: NAs introduced by coercion ``` ``` ## [1] NA 2 3 43 -3 ``` Note: Arithmetic operations only work for numeric vectors. --- # Vector Recycling What if we apply arithmetic operators to two numeric vectors of different lengths? ```r x = c(7, 8, 10, 45, 2) p = c(2, 3) x^p ``` ``` ## [1] 49 512 100 91125 4 ``` -- **Recycling** in R repeat elements in the shorter vector to match with the longer one. - This is useful when done on purpose, but could also lead to hard-to-catch bugs in our code! ```r 2*x ``` ``` ## [1] 14 16 20 90 4 ``` --- # Comparative and Logical Operations on Vectors We can also do componentwise comparisons and logical operations with vectors. ```r x = c(7, 8, 10, 45, 2) x > 9 ``` ``` ## [1] FALSE FALSE TRUE TRUE FALSE ``` ```r (x > 9) | (x < 6) ``` ``` ## [1] FALSE FALSE TRUE TRUE TRUE ``` ```r x == c(10, 2) ``` ``` ## [1] FALSE FALSE TRUE FALSE FALSE ``` ```r sum(x > 9) ``` ``` ## [1] 2 ``` --- # Built-in Functions for Vectors Many built-in functions can take vectors as arguments: * `mean(), median(), sd(), var(), max(), min(), length()`, and `sum()` return single numbers. * `cumsum(), cumprod(), cummax(), cummin()` return the cumulative sums, products, minima or maxima of the elements of a vector. * `sort()` returns the sorted vector. * `order()` returns the indices of the sorted vector. * `hist()` takes a vector of numbers and produces a histogram, a highly structured object, with the side effect of making a plot. * `ecdf()` similarly produces a cumulative-density-function object. * `summary()` gives the summary statistics of numerical vectors. * `any()` and `all()` are useful on Boolean vectors. --- class: inverse # Part 2: Arrays and Matrices in R --- # Second Data Structure: Arrays * An **array** is a multi-dimensional generalization of vectors. ```r x = c(7, 8, 10, 45, 20, 1) # Create a 3-by-2 array using the elements in `x` x_arr = array(x, dim = c(3, 2)) x_arr ``` ``` ## [,1] [,2] ## [1,] 7 45 ## [2,] 8 20 ## [3,] 10 1 ``` -- ```r dim(x_arr) ``` ``` ## [1] 3 2 ``` The function `dim()` tells us the numbers of rows and columns. The output of `dim()` could be a vector of arbitrary length. --- # Arrays in R We can also create a 3-dim array (known as tensor in Python). ```r y = c(7, 8, 10, 45, 20, 1, 4, 2, 188, 32, 12, 34) # Create a 3-by-2-by-2 array using the elements in `x` y_arr = array(y, dim = c(3, 2, 2)) y_arr ``` ``` ## , , 1 ## ## [,1] [,2] ## [1,] 7 45 ## [2,] 8 20 ## [3,] 10 1 ## ## , , 2 ## ## [,1] [,2] ## [1,] 4 32 ## [2,] 2 12 ## [3,] 188 34 ``` --- # Subsetting/Indexing An Array in R We can access a 2-dim array either using `[index,index]` or by the underlying vector (column-major order). ```r is.array(x_arr) ``` ``` ## [1] TRUE ``` ```r x_arr[1,2] ``` ``` ## [1] 45 ``` ```r y_arr[3,1,2] ``` ``` ## [1] 188 ``` ```r x_arr[c(1,2),2] ``` ``` ## [1] 45 20 ``` --- # Subsetting/Indexing An Array in R We can access a 2-dim array either using `[index,index]` or by the underlying vector (column-major order). ```r x_arr ``` ``` ## [,1] [,2] ## [1,] 7 45 ## [2,] 8 20 ## [3,] 10 1 ``` ```r # View an array as a vector in a column-major order x_arr[4] ``` ``` ## [1] 45 ``` ```r as.vector(x_arr) ``` ``` ## [1] 7 8 10 45 20 1 ``` --- # Matrices in R A matrix is a specialization of a 2-dim array. ```r z_mat = matrix(c(40, 1, 60, 3, 4, 2), nrow = 3) z_mat ``` ``` ## [,1] [,2] ## [1,] 40 3 ## [2,] 1 4 ## [3,] 60 2 ``` ```r is.matrix(z_mat) ``` ``` ## [1] TRUE ``` ```r is.array(z_mat) ``` ``` ## [1] TRUE ``` * We could also specify `ncol` for the number of columns. --- # Matrices in R We can also generate matrices by column binding (`cbind()`) and row binding (`rbind()`) vectors. ```r y = c(2, 3, 4) arr1 = cbind(x_arr, y) arr1 ``` ``` ## y ## [1,] 7 45 2 ## [2,] 8 20 3 ## [3,] 10 1 4 ``` ```r rbind(x_arr, x_arr[c(1,2),]) ``` ``` ## [,1] [,2] ## [1,] 7 45 ## [2,] 8 20 ## [3,] 10 1 ## [4,] 7 45 ## [5,] 8 20 ``` --- # Matrices in R * We can subset a matrix as how we did for an array. * Matrices, like vectors, can only have its entries of the same data type. ```r rbind(c(1, 2, 3), c("a", "b", "c")) ``` ``` ## [,1] [,2] [,3] ## [1,] "1" "2" "3" ## [2,] "a" "b" "c" ``` -- * We can also apply (built-in) functions to matrices as vectors. ```r mean(arr1) ``` ``` ## [1] 11.11111 ``` --- # Matrix Multiplication The usual multiplication `*` can only do component-wise/element-wise multiplication between two matrices. ```r x_arr * y_arr[,,1] ``` ``` ## [,1] [,2] ## [1,] 49 2025 ## [2,] 64 400 ## [3,] 100 1 ``` -- The matrix multiplication in R is achieved by `%*%`. ```r z_mat = matrix(data = c(1,2,3,12), ncol = 2) x_arr %*% z_mat ``` ``` ## [,1] [,2] ## [1,] 97 561 ## [2,] 48 264 ## [3,] 12 42 ``` --- # Other Matrix Operations * Row/Column sum and mean: ```r rowSums(x_arr) ``` ``` ## [1] 52 28 11 ``` ```r colMeans(x_arr) ``` ``` ## [1] 8.333333 22.000000 ``` * Matrix transpose: ```r t(x_arr) ``` ``` ## [,1] [,2] [,3] ## [1,] 7 8 10 ## [2,] 45 20 1 ``` --- # Other Matrix Operations * The determinant of a square matrix: ```r print(z_mat) ``` ``` ## [,1] [,2] ## [1,] 1 3 ## [2,] 2 12 ``` ```r # The determinant of a square matrix det(z_mat) ``` ``` ## [1] 6 ``` * The inverse of a matrix: ```r solve(z_mat) ``` ``` ## [,1] [,2] ## [1,] 2.0000000 -0.5000000 ## [2,] -0.3333333 0.1666667 ``` --- # Other Matrix Operations * The `diag()` function can extract the diagonal entries of a matrix: ```r diag(z_mat) ``` ``` ## [1] 1 12 ``` -- * The `diag()` function can also be used to create a diagonal matrix: ```r diag(c(1,4,3)) ``` ``` ## [,1] [,2] [,3] ## [1,] 1 0 0 ## [2,] 0 4 0 ## [3,] 0 0 3 ``` --- # Names in Matrices * We can name either rows or columns or both, with `rownames()` and `colnames()`. The rules are the same as naming the vectors. ```r colnames(z_mat) = c("a", "b") z_mat ``` ``` ## a b ## [1,] 1 3 ## [2,] 2 12 ``` -- Similarly to `names()` for vectors, we then access them by calling the function again. ```r colnames(z_mat) ``` ``` ## [1] "a" "b" ``` Note: Names help us understand what we are working with. --- class: inverse # Part 3: Lists in R --- # Third Data Structure: Lists A **list** is a collection of objects that are not necessarily all of the same data type and can even have different lengths. ```r my_list = list("exponential", 7, FALSE, c(1,6,2)) my_list ``` ``` ## [[1]] ## [1] "exponential" ## ## [[2]] ## [1] 7 ## ## [[3]] ## [1] FALSE ## ## [[4]] ## [1] 1 6 2 ``` --- # Subsetting a List * We can use `[index]` as with vectors, and it will return a list. ```r my_list[4] ``` ``` ## [[1]] ## [1] 1 6 2 ``` ```r class(my_list[4]) ``` ``` ## [1] "list" ``` -- * If we want to extract one element of a list, we have to use `[[index]]`. ```r my_list[[4]] ``` ``` ## [1] 1 6 2 ``` --- # Subsetting and Expanding a List ```r # Subset the second sub-element in the fourth element of the list my_list[[4]][2] ``` ``` ## [1] 6 ``` We can also use `[[index]]` to expand the list. ```r my_list[[5]] = c("a", "3", "UW STAT") my_list ``` ``` ## [[1]] ## [1] "exponential" ## ## [[2]] ## [1] 7 ## ## [[3]] ## [1] FALSE ## ## [[4]] ## [1] 1 6 2 ## ## [[5]] ## [1] "a" "3" "UW STAT" ``` --- # Contracting a List We can also shorten the list with by setting the length to something smaller (also works for vectors). ```r length(my_list) ``` ``` ## [1] 5 ``` ```r length(my_list) = 3 my_list ``` ``` ## [[1]] ## [1] "exponential" ## ## [[2]] ## [1] 7 ## ## [[3]] ## [1] FALSE ``` --- # Naming a List We can also name the elements of a list: ```r names(my_list) = c("first", "num", "logical") my_list ``` ``` ## $first ## [1] "exponential" ## ## $num ## [1] 7 ## ## $logical ## [1] FALSE ``` --- # Naming a List The names for the element of a list can be given when we initialize the list. ```r my_list = list(func = "exponential", num = 7, logi = FALSE, vec = c(1,6,2)) my_list ``` ``` ## $func ## [1] "exponential" ## ## $num ## [1] 7 ## ## $logi ## [1] FALSE ## ## $vec ## [1] 1 6 2 ``` --- # Subsetting a List By Name There are two different ways to subset an element from the list by name. ```r my_list[["num"]] ``` ``` ## [1] 7 ``` ```r my_list$num ``` ``` ## [1] 7 ``` We will also use `$` to access a column of the data frame later... --- # Advantage of Lists * Lists give us a natural way to store and look up data by name, rather than by position. -- * Lists achieve a useful programming concept called **key-value pairs**, i.e., dictionaries in Python. - If we need to know the value of a component, we can look that up by name without caring where it is (in what position it lies) in the list. -- * Lists are generally used when a function returns multiple results... --- class: inverse # Part 4: Data Frames in R --- # Fourth Data Structure: Data Frames * A **data frame** is a classic data table with `\(n\)` rows for cases and `\(p\)` columns for variables. -- * A data frame can be viewed as a generalization of a named array. -- * In principle, a data frame is a special list, with the restriction that all its components are vectors of the same length. <p align="center"> <img src="./figures/df.png" width="300"/> </p> --- # Data Frames in R We start from creating a matrix (2-dim array). ```r a_mat = matrix(c(35, 8, 10, 4, 12, 20, 10, 11, 1, 2), ncol=2) colnames(a_mat) = c("v1","v2") a_mat ``` ``` ## v1 v2 ## [1,] 35 20 ## [2,] 8 10 ## [3,] 10 11 ## [4,] 4 1 ## [5,] 12 2 ``` ```r class(a_mat) ``` ``` ## [1] "matrix" "array" ``` --- # Data Frames in R We can expand the column of a data frame or coerce a matrix/array to the data frame type using `data.frame()`. ```r a_df = data.frame(a_mat, Date=as.Date("1965/5/15") + 1:5) a_df ``` ``` ## v1 v2 Date ## 1 35 20 1965-05-16 ## 2 8 10 1965-05-17 ## 3 10 11 1965-05-18 ## 4 4 1 1965-05-19 ## 5 12 2 1965-05-20 ``` Note: Check what the function `as.Date()` is for. Why can we add a numeric vector to it? --- # Data Frames in R The function `cbind()` and `rbind()` also works for data frames. ```r a_df = cbind(a_df, binary=rbinom(5, size = 1, prob = 0.3)) a_df ``` ``` ## v1 v2 Date binary ## 1 35 20 1965-05-16 0 ## 2 8 10 1965-05-17 0 ## 3 10 11 1965-05-18 0 ## 4 4 1 1965-05-19 1 ## 5 12 2 1965-05-20 1 ``` Note: `rbinom()` generates some random samples from the binomial distribution. Run `?rbinom()` to check the documentation. --- # Data Frames in R * However, when using `rbind()`, the data type of each column in the new data frame should match the original data frame. ```r rbind(a_df, data.frame(v1=1, v2=32, Date=as.Date("2023/09/27"), binary=-1.1)) ``` ``` ## v1 v2 Date binary ## 1 35 20 1965-05-16 0.0 ## 2 8 10 1965-05-17 0.0 ## 3 10 11 1965-05-18 0.0 ## 4 4 1 1965-05-19 1.0 ## 5 12 2 1965-05-20 1.0 ## 6 1 32 2023-09-27 -1.1 ``` --- # Subset Rows/Columns of A Data Frame ```r a_df$v2 ``` ``` ## [1] 20 10 11 1 2 ``` ```r a_df$Date[1:3] ``` ``` ## [1] "1965-05-16" "1965-05-17" "1965-05-18" ``` ```r a_df[,2] ``` ``` ## [1] 20 10 11 1 2 ``` ```r a_df[-(3:4),2] ``` ``` ## [1] 20 10 2 ``` --- # Read Tables into R So far, we only create our data frames manually in R. -- In practice, it is more common to read those existing tabular data into R and carry out our analysis. There are many different ways to read tables into R. Here are two possible ways: ```r family_df = read.table(url("https://github.com/zhangyk8/zhangyk8.github.io/raw/master/_teaching/file_stat302/Data/family.txt"), sep = "\t", header = TRUE) family_df2 = read.csv(url("https://github.com/zhangyk8/zhangyk8.github.io/raw/master/_teaching/file_stat302/Data/family.txt"), sep = "\t", header = TRUE) all(family_df == family_df2) ``` ``` ## [1] TRUE ``` The data `family.txt` can be downloaded through the link [https://github.com/zhangyk8/zhangyk8.github.io/raw/master/_teaching/file_stat302/Data/family.txt](https://github.com/zhangyk8/zhangyk8.github.io/raw/master/_teaching/file_stat302/Data/family.txt). --- # Read Tables into R If the data file is in our current working directory, then we do not have to use the function `url()` to access it. ```r family_df3 = read.table("family.txt", sep = "\t", header = TRUE) class(family_df3) ``` ``` ## [1] "data.frame" ``` ```r head(family_df3) ``` ``` ## firstName sex age height weight bmi overWt ## 1 Tom m 77 70 175 25.16239 TRUE ## 2 Maya f 33 64 124 21.50106 FALSE ## 3 Joe m 79 73 185 24.45884 FALSE ## 4 Robert m 47 67 156 24.48414 FALSE ## 5 Sue f 27 61 98 18.51492 FALSE ## 6 Liz f 33 68 190 28.94981 TRUE ``` --- # Post-Analysis After Reading the Tables ```r # Find all the unique first name unique(family_df$firstName) ``` ``` ## [1] "Tom" "Maya" "Joe" "Robert" "Sue" "Liz" "Jon" "Sally" ## [9] "Tim" "Ann" "Dan" "Art" "Zoe" ``` ```r # Histogram of BMIs for all individuals hist(family_df$bmi, xlab = "BMIs", main = "Histogram of BMIs for all individuals") ``` <img src="Lecture2_Data_Structures_files/figure-html/unnamed-chunk-51-1.png" width="40%" style="display: block; margin: auto;" /> --- # Working Directory in R A working directory is the file path that R uses to save and look for data. * We can check for our current working directory using `getwd()`. ```r getwd() ``` ``` ## [1] "/media/yikun/Disk_D1/Graduate School/STAT 302/Lectures" ``` * We can change our working directory using `setwd()`. -- ```r setwd("/media/yikun/Disk_D1/Graduate School/STAT 302/Lectures") ``` Note: Do not change the working directory inside R Markdown files! By default, R Markdown sets the file path of where it is in as the working directory. --- # Saving Tables in R We can save a single R object as `.rds` files using `saveRDS()`, and multiple R objects as `.RData` or `.rda` files using `save()`. ```r object1 = 1:5 object2 = c("a", "b", "c") # save only object1 saveRDS(object1, file = "object1_only.rds") # save object1 and object2 save(object1, object2, file = "both_objects.RData") ``` -- If we want to save a data frame, it is recommended to write it as `.csv` or `.txt` file. ```r write.table(family_df, file = "family_newsave.txt", sep = "\t") write.csv(family_df, file = "family_newsave.csv") ``` --- class: inverse # Part 5: R Coding Style Guide --- # Object Names Use either underscores (`_`) or big camel case (`BigCamelCase`) to separate words within an object/Variable name. Try to avoid using dots `.` to separate words in R functions! ```r # Good day_one day_1 DayOne # Bad dayone ``` --- # Object Names Names should be concise, meaningful, and (generally) nouns. ```r # Good day_one # Bad first_day_of_the_month djm1 ``` --- # Object Names It is **very important** that object names do not overlap with common functions! ```r # Very extra super bad c = 7 t = 23 T = FALSE mean = "something" ``` Note: `T` and `F` are R shorthand for `TRUE` and `FALSE`, respectively. In general, we should spell them out as clear as possible. ```r mean(c(1, 2)) ``` ``` ## [1] 1.5 ``` --- # Spacing Put a space after every comma, just like the English writing. ```r # Good x[, 1] # Bad x[,1] x[ ,1] x[ , 1] ``` Do not put spaces inside or outside parentheses for regular function calls. ```r # Good mean(x, na.rm = TRUE) # Bad mean (x, na.rm = TRUE) mean( x, na.rm = TRUE ) ``` --- # Spacing with Operators Most of the time when we are doing math, conditionals, logicals, or assignments, our operators should be surrounded by spaces (e.g. for `==`, `+`, `-`, `<-`, etc.). ```r # Good height = (feet * 12) + inches mean(x, na.rm = 10) # Bad height=feet*12+inches mean(x, na.rm=10) ``` There are some exceptions we will learn more about later, such as the power symbol `^`. See the [Tidyverse Style Guide](https://style.tidyverse.org/) for more details! --- # Extra Spacing Adding extra spaces is fine if it improves alignment of `=` or `<-`. ```r # Good list( total = a + b + c, mean = (a + b + c) / n ) # Also fine list( total = a + b + c, mean = (a + b + c) / n ) ``` --- # Long Lines of Code Strive to limit our code to 80 characters per line. This fits comfortably on a printed page with a reasonably sized font. If a function call is too long to fit on a single line, use one line each for the function name, each argument, and the closing `)`. This makes the code easier to read and to modify later. ```r # Good do_something_very_complicated( something = "that", requires = many, arguments = "some of which may be long" ) # Bad do_something_very_complicated("that", requires, many, arguments, "some of which may be long" ) ``` <font size="4"> *Tip! Try RStudio > Preferences > Code > Display > Show Margin with Margin column 80 to give us a visual cue!* </font> --- # Semicolons In R, semi-colons (`;`) are used to execute pieces of R code on a single line. * In general, this is bad practice and should be avoided. Also, we never need to end lines of code with semi-colons! ```r # Bad a = 2; b = 3 # Also bad a = 2; b = 3; # Good a = 2 b = 3 ``` --- # Quotes and Strings Use `"`, not `'`, for quoting text. The only exception is when the text already contains double quotes and no single quotes. ```r # Bad 'Text' 'Text with "double" and \'single\' quotes' # Good "Text" 'Text with "quotes"' '<a href="http://style.tidyverse.org">A link</a>' ``` -- ### Useful References for R Coding Style Guide * [Tidyverse Style Guide](https://style.tidyverse.org/) by Hadley Wickham. * [Google Style Guide](https://google.github.io/styleguide/Rguide.html). This style guides are useful for other people to understand our code! --- # Tidy Data Principles There are three rules required for data (or a data frame/table) to be considered tidy: 1. Each variable must have its own column. 2. Each observation must have its own row. 3. Each value must have its own cell. --- # Tidy Data Principles (Example 1) The rules seem simple, but using them can be tricky! Let's consider the following example. What is untidy about the following data frame? <table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:center;"> Hospital </th> <th style="text-align:center;"> Diseased </th> <th style="text-align:center;"> Healthy </th> </tr> </thead> <tbody> <tr> <td style="text-align:center;"> A </td> <td style="text-align:center;"> 10 </td> <td style="text-align:center;"> 14 </td> </tr> <tr> <td style="text-align:center;"> B </td> <td style="text-align:center;"> 15 </td> <td style="text-align:center;"> 18 </td> </tr> <tr> <td style="text-align:center;"> C </td> <td style="text-align:center;"> 12 </td> <td style="text-align:center;"> 13 </td> </tr> <tr> <td style="text-align:center;"> D </td> <td style="text-align:center;"> 5 </td> <td style="text-align:center;"> 16 </td> </tr> </tbody> </table> -- * **Variables:** hospital, disease status, and counts. -- * **Observations:** the number of individuals at a given hospital and of a given disease status. -- * **Values:** Hospital A, Hospital B, Hospital C, Hospital D, individual count values, *Disease Status "Healthy"*, and *Disease Status "Diseased"*. --- # Tidy Data Principles (Example 1) <table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:center;"> Hospital </th> <th style="text-align:center;"> Diseased </th> <th style="text-align:center;"> Healthy </th> </tr> </thead> <tbody> <tr> <td style="text-align:center;"> A </td> <td style="text-align:center;"> 10 </td> <td style="text-align:center;"> 14 </td> </tr> <tr> <td style="text-align:center;"> B </td> <td style="text-align:center;"> 15 </td> <td style="text-align:center;"> 18 </td> </tr> <tr> <td style="text-align:center;"> C </td> <td style="text-align:center;"> 12 </td> <td style="text-align:center;"> 13 </td> </tr> <tr> <td style="text-align:center;"> D </td> <td style="text-align:center;"> 5 </td> <td style="text-align:center;"> 16 </td> </tr> </tbody> </table> The main problem is that the column headers are values, not variables! How can we tidy it up? -- <table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:center;"> Hospital </th> <th style="text-align:center;"> Status </th> <th style="text-align:center;"> Count </th> </tr> </thead> <tbody> <tr> <td style="text-align:center;"> A </td> <td style="text-align:center;"> Diseased </td> <td style="text-align:center;"> 10 </td> </tr> <tr> <td style="text-align:center;"> A </td> <td style="text-align:center;"> Healthy </td> <td style="text-align:center;"> 14 </td> </tr> <tr> <td style="text-align:center;"> B </td> <td style="text-align:center;"> Diseased </td> <td style="text-align:center;"> 15 </td> </tr> <tr> <td style="text-align:center;"> B </td> <td style="text-align:center;"> Healthy </td> <td style="text-align:center;"> 18 </td> </tr> <tr> <td style="text-align:center;"> C </td> <td style="text-align:center;"> Diseased </td> <td style="text-align:center;"> 12 </td> </tr> <tr> <td style="text-align:center;"> C </td> <td style="text-align:center;"> Healthy </td> <td style="text-align:center;"> 13 </td> </tr> <tr> <td style="text-align:center;"> D </td> <td style="text-align:center;"> Diseased </td> <td style="text-align:center;"> 5 </td> </tr> <tr> <td style="text-align:center;"> D </td> <td style="text-align:center;"> Healthy </td> <td style="text-align:center;"> 16 </td> </tr> </tbody> </table> --- # Tidy Data Principles (Example 2) Let's consider another example: <table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Country </th> <th style="text-align:right;"> Year </th> <th style="text-align:right;"> m_16_24 </th> <th style="text-align:right;"> m_25_34 </th> <th style="text-align:right;"> f_16_24 </th> <th style="text-align:right;"> f_25_34 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 2018 </td> <td style="text-align:right;"> 49 </td> <td style="text-align:right;"> 55 </td> <td style="text-align:right;"> 47 </td> <td style="text-align:right;"> 41 </td> </tr> <tr> <td style="text-align:left;"> B </td> <td style="text-align:right;"> 2018 </td> <td style="text-align:right;"> 34 </td> <td style="text-align:right;"> 33 </td> <td style="text-align:right;"> 50 </td> <td style="text-align:right;"> 43 </td> </tr> </tbody> </table> -- * **Variables:** Country, year, gender, age group, and counts. * **Observations:** the number of individuals in a given country, in a given year, of a given gender, and in a given age group. * **Values:** Country A, Country B, Year 2018, Gender "m", Gender "f", Age Group "16_24", Age Group "25_34", and individual counts. --- # Tidy Data Principles (Example 2) The tidy version is as follows: <table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:center;"> Country </th> <th style="text-align:center;"> Year </th> <th style="text-align:center;"> Gender </th> <th style="text-align:center;"> Age_Group </th> <th style="text-align:center;"> Counts </th> </tr> </thead> <tbody> <tr> <td style="text-align:center;"> A </td> <td style="text-align:center;"> 2018 </td> <td style="text-align:center;"> m </td> <td style="text-align:center;"> 16_24 </td> <td style="text-align:center;"> 49 </td> </tr> <tr> <td style="text-align:center;"> A </td> <td style="text-align:center;"> 2018 </td> <td style="text-align:center;"> m </td> <td style="text-align:center;"> 25_34 </td> <td style="text-align:center;"> 55 </td> </tr> <tr> <td style="text-align:center;"> A </td> <td style="text-align:center;"> 2018 </td> <td style="text-align:center;"> f </td> <td style="text-align:center;"> 16_24 </td> <td style="text-align:center;"> 47 </td> </tr> <tr> <td style="text-align:center;"> A </td> <td style="text-align:center;"> 2018 </td> <td style="text-align:center;"> f </td> <td style="text-align:center;"> 25_34 </td> <td style="text-align:center;"> 41 </td> </tr> <tr> <td style="text-align:center;"> B </td> <td style="text-align:center;"> 2018 </td> <td style="text-align:center;"> m </td> <td style="text-align:center;"> 16_24 </td> <td style="text-align:center;"> 34 </td> </tr> <tr> <td style="text-align:center;"> B </td> <td style="text-align:center;"> 2018 </td> <td style="text-align:center;"> m </td> <td style="text-align:center;"> 25_34 </td> <td style="text-align:center;"> 33 </td> </tr> <tr> <td style="text-align:center;"> B </td> <td style="text-align:center;"> 2018 </td> <td style="text-align:center;"> f </td> <td style="text-align:center;"> 16_24 </td> <td style="text-align:center;"> 50 </td> </tr> <tr> <td style="text-align:center;"> B </td> <td style="text-align:center;"> 2018 </td> <td style="text-align:center;"> f </td> <td style="text-align:center;"> 25_34 </td> <td style="text-align:center;"> 43 </td> </tr> </tbody> </table> Note: In R, this can be done via the `pivot_longer()` function in the `tidyr` package. We will discuss this in detail later... --- # Guidelines of Making Data Tidy 1. Identify the observations, variables, and values. 2. Ensure that each observation has its own row. * Be careful about individual observations spreading over multiple tables, Excel files, etc, or multiple types of observations within a single table (this would result in many empty cells). 3. Ensure that each variable has its own column. * Be careful about variables spreading over two columns and multiple variables within a single column. 4. Ensure that each value has its own cell. * Be careful about values as column headers. --- # Why Do We Need Tidy Data? * Easier to read and understand the data. * More intuitive to analyze and plot the data using R (required for `ggplot2`). * Fewer issues with missing values. ### Useful References for Tidy Data Principles Here is a [fantastic reference](https://vita.had.co.nz/papers/tidy-data.pdf) written by Hadley Wickham going through all these principles in detail and with more examples. --- # Summary - Data structures allow us to group related values together. - Vectors group together values with the same data type. - Arrays add multi-dimensional structure to vectors, while matrices are two-dimensional arrays. - Lists allow us to combine data of different types and lengths. - Data frames are hybrids of matrices and lists, allowing each column to have a different data type but the same length. - Tidy data principle helps us better analyze and visualize data tables in R. Submit Lab 2 on Gradescope by the end of Tuesday (January 23)!!