class: center, middle, inverse, title-slide .title[ # Introduction to
] .author[ ### Evangeline Warren ] .institute[ ### R Working Group ] .date[ ### Jan 20th, 2023 ] --- # Goals for this session * Learn about... -- + basic R syntax + different R objects (things that hold data) & **indexing** them + useful functions for working with data -- * Become familiar with [R Studio](https://posit.co/download/rstudio-desktop/) & develop good coding habits -- * R Studio is an *additional* program that provides many useful features for working with R * (you need to download and install both [R](https://cran.r-project.org/) and [R Studio](https://posit.co/download/rstudio-desktop/)) --- class: inverse, center, middle # R Studio --- # R Studio * Let's dive in by starting R Studio and opening a new R script + menu bar: `File` → `New File` → `R Script` + (in R: `File` → `New Script`) -- * You should now have 4 panes open (like on the next slide) + **Source** -- Our script where we will type and save our comments & commands + **Console** -- Where we can give R commands and where the output will appear + **Output** -- File explorer, plots, help files, and more! + **Environments** -- Useful information about the R session --- .center[<img src="img/rstudio-panes-labeled.jpeg" style="width: 75%" />] .center[.bottom[downloaded from [user guide on postit.co](https://docs.posit.co/ide/user/ide/guide/ui/ui-panes.html)]] --- # R Studio: Good Habits * Add a comment to our new script: ```r # Comment: My R script from Working Group Session (1/20/2023) # (R ignores all lines that begin with a pound/hash/number sign/#) ``` -- * Save our script + menu bar: `File` → `Save As...` -- * Set our **working directory** + this is where R will start looking for & saving files (e.g., data files or plots) + menu bar: `Session` → `Set Working Directory` → <br>         `Choose Directory...` --- class: inverse, center, middle # Basic R Syntax --- # Basic R Syntax * R syntax takes the form ```r # object_name <- object_value mean_age <- 33 ``` -- * The symbol "`<-`" is called the assignment operator + we are creating a new variable called `mean_age` and assigning it a value of 33 + `mean_age = 33` will also work (but `<-` is the convention) --- class: slide-font-25 # Basic R Syntax (cont.) If we enter the name of a variable in the `Console`, then R will list the value(s) ```r > Mean_age2 <- 22 ## note: object names are case-sensitive > Mean_age2 ``` ``` ## [1] 22 ``` -- BUT we are in the business of good habits... * type this syntax into our script and (with the cursor on the same line) press the following keys together: + On a Mac: `<command> <enter>` + In Windows: `<control> <enter>`   (in R Studio) <br>           `<control> r`         (in the R app) * these keyboard shortcuts will run the syntax on the line in the `Console` <br> (or you can highlight a region) --- # Basic R Syntax: functions We have seen a simple object for holding data, but R has many useful **functions** ```r ls() # list all the objects in memory rm(Mean_age2) # remove the object called Mean_age2 getwd() # print the working directory dir() # list the files in the current directory dir("../") # list the files in the parent directory save.image("my_data.RData") # save all the objects in memory load("my_data.RData") # load all the objects in the data file ``` -- *Quick note*: * suppose you create an object called `abc` that holds the value 2 * then you load a file `data.RData` that also has an object named `abc` but holds the value 99 * the first version of the object (`abc` holding 2) will get replaced --- # Basic R Syntax: help files * Google searches are a very effective way to find help + and so is asking the R Working Group 😎 -- * R documentation can be accessed in the `Help` tab in the `Output` pane -- * Some additional syntax and functions ```r ?read.csv # show the help file for the function read.csv help.search("weighted mean") # search help files for the phrase'weighted mean' ``` --- class: inverse, center, middle # Data Structures in R --- ## **Data Structures**: motivation We are not going to solve the world's problems with a single number... ```r > all_ages <- c(22, 33, 44, 55) # c() concatenates numbers together > all_ages ``` ``` ## [1] 22 33 44 55 ``` ```r > mean(all_ages) # calculate the mean ``` ``` ## [1] 38.5 ``` ```r > all_ed <- c("HS", "Col", "Grad Sch", "HS") > all_ed ``` ``` ## [1] "HS" "Col" "Grad Sch" "HS" ``` --- ## **Data Structures**: motivation (cont.) R handles different *types* of data as well ```r > important_data <- c("OSU", "R", "Group", 4) > important_data ``` ``` ## [1] "OSU" "R" "Group" "4" ``` Wait, what is going on here? -- * we are mixing different types of data & R assumes that we just forgot to wrap the 4 in quotation marks * sometimes R's assumptions are useful, sometimes they are not! 🤔 --- ## **Data Structures**: motivation (cont.) Here is another example with missing data ```r > test_scores <- c(88, 99, 110, 66, NA) # NA is for missing values > mean_scores <- mean(test_scores) > mean_scores / 100 ``` ``` ## [1] NA ``` 😾 Ugh! Why didn't R tell me there was a problem when I tried to calculate the mean?!? -- * another R assumption * can you figure out how to calculate the mean for non-missing values? (help file is helpful 😄) --- ## **Data Structures**: vectors * We have been creating **vectors** when we use `c()` to concatenate data -- * Here are some more useful functions for working with vectors ```r > # test that we have a vector > is.vector(test_scores) # returns another data type: TRUE or FALSE (called logical) ``` ``` ## [1] TRUE ``` ```r > summary(test_scores) # numerical summary (less helpful for strings) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## 66.00 82.50 93.50 90.75 101.75 110.00 1 ``` --- ## **Data Structures**: vectors (cont.) ```r > length(test_scores) # how many elements in the vector ``` ``` ## [1] 5 ``` ```r > is.na(test_scores) # test if each element is NA ``` ``` ## [1] FALSE FALSE FALSE FALSE TRUE ``` ```r > TRUE + TRUE + FALSE # useful trick with logical objects (TRUE/FALSE) ``` ``` ## [1] 2 ``` ```r > n_missing <- sum(is.na(test_scores)) > n_missing ``` ``` ## [1] 1 ``` --- ## **Data Structures**: indexing vectors We can access the `\(i^{th}\)` element in a vector with the syntax `vector_name[ i ]` ```r > test_scores[1] # first element ``` ``` ## [1] 88 ``` ```r > test_scores[2] # second element ``` ``` ## [1] 99 ``` -- ```r > 1:3 # a vector of c(1, 2, 3) ``` ``` ## [1] 1 2 3 ``` ```r > # so what will test_scores[3:1] give us? ``` --- ## **Data Structures**: indexing vectors (cont.) The syntax   `3:1`   gives the vector   `c(3, 2, 1)`, so... -- ```r > test_scores[3:1] # returns 3rd element, then the 2nd, then the first ``` ``` ## [1] 110 99 88 ``` ```r > test_scores # sanity check ``` ``` ## [1] 88 99 110 66 NA ``` -- * So what will the following command do? 🤔 ```r test_scores[c(3, 5, 11)] ``` --- ## **Data Structures**: changing vectors We can use indexing to change vectors as well, e.g., reassign the first element ```r > test_scores[1] <- NA # change the first element to NA > test_scores[1] ``` ``` ## [1] NA ``` -- Again, we can use vectors to index as well: ```r index_missing_scores <- is.na(test_scores) # create an index vector of TRUE & FALSE test_scores[index_missing_scores] <- -99 # change NA to -99 ``` -- Let's walk through this... <br> (🦉 but note a good habit would be to create a new vector, `new_test_scores`, so we can retain the original data!) --- ## **Data Structures**: changing vectors (cont.) ```r > # create an index vector of TRUE & FALSE > index_missing_scores <- is.na(test_scores) > index_missing_scores ``` ``` ## [1] TRUE FALSE FALSE FALSE TRUE ``` -- ```r > # attach these 2 vectors together as columns > cbind(index_missing_scores, test_scores) ``` ``` ## index_missing_scores test_scores ## [1,] 1 NA ## [2,] 0 99 ## [3,] 0 110 ## [4,] 0 66 ## [5,] 1 NA ``` -- * with `cbind` we are actually creating a new **data structure** called a **matrix** -- * as we will see, matrices can only hold the same *data type*, so R changes `TRUE`/`FALSE` to `1`/`0` (respectively) --- ## **Data Structures**: changing vectors (cont.) ```r > test_scores[index_missing_scores] # access all of the indices with TRUE ``` ``` ## [1] NA NA ``` -- ```r > # recode NA to -99 > test_scores[index_missing_scores] <- -99 > test_scores ``` ``` ## [1] -99 99 110 66 -99 ``` --- ## **Data Structures**: changing vectors (recap) When you want to change a vector, do the *delta 2-step*: 1. create an index vector that identifies the elements you want to change * what data type should this vector hold? * `logical`, i.e. `TRUE`s and `FALSE`s 2. assign new values to the vector using your vector of indices --- ## **Data Structures**: vector recap * We are not going to become 💰 famous 💰 by working with a single vector -- * However, we have learned a powerful way to work with vectors, **indexing**, that extends to other types of **data structures** -- * A **matrix** made a brief appearance earlier, but before going further let's review a useful framework for thinking about **data structures** --- ## **Data Structures**: overview R has different structures for holding data, which can be organized by... -- 1. How many dimensions does the structure have? -- 2. Do the types of data need to be the same? -- * Example: **vectors** + only 1 dimension (it is just a single row or a column) + we saw earlier that R changes the elements so they all have the same data type (e.g., `4` → `"4"`) -- * We'll now (re)introduce different data structures, and learn about different data types along the way. --- ## **Data Structures**: overview (cont.) * **Vectors** 1. 1 dimension 1. same data type + special case: **factor** (predefined categories) * **Matrices** 1. rows and columns 1. same data type * **Arrays** 1. any number of dimensions 1. same data type --- ## **Data Structures**: overview (cont.) * **Data Frames** 1. rows and columns 1. different data types - particularly useful for holding a data set with quantitative & qualitative variables * **Lists** 1. 1 dimension 1. different data types (or structures!) - actually, this is just a special type of vector (can you verify this?) --- ## **Data Structures**: working with data frames * For the rest of this session we will focus on **Data frames**, the R structure typically used for data sets (i.e., variables as columns and an observation for each row). -- * Let's get some practice working with data frames using one of R's example data sets ```r > data(mtcars) ## load one of R's example data sets mtcars > ls() ``` ``` ## [1] "all_ages" "all_ed" "important_data" ## [4] "index_missing_scores" "Mean_age2" "mean_scores" ## [7] "mtcars" "n_missing" "test_scores" ``` ```r > is.data.frame(mtcars) ## check that mtcars is a data frame ``` ``` ## [1] TRUE ``` --- ## **Data Structures**: reading in data sets Before we proceed with `mtcars`, a quick example of how to read in a data set. ```r > # write data to a CSV file called 'copy_mtcars.csv' in the working directory > write.csv(mtcars, "copy_mtcars.csv") > mtcars2 <- read.csv("copy_mtcars.csv") # load data set from CSV file > ls() ``` ``` ## [1] "all_ages" "all_ed" "important_data" ## [4] "index_missing_scores" "Mean_age2" "mean_scores" ## [7] "mtcars" "mtcars2" "n_missing" ## [10] "test_scores" ``` ```r > is.data.frame(mtcars2) ``` ``` ## [1] TRUE ``` --- ## **Data Structures**: exploring data frames * Since **data frames** have 2 dimensions, the index requires 2 pieces of info: `[row index, column index]` ```r > mtcars[1, 1] # 1st observation in 1st variable ``` ``` ## [1] 21 ``` -- * Many times, however, we just work with one variable/column at a time, so all our skills working with vectors still apply ```r > # if we leave out the row part of the address, we get all rows and a vector > is.vector(mtcars[, 1]) ``` ``` ## [1] TRUE ``` --- ## **Data Structures**: exploring data frames * A useful way to access a single column in a data frame is to use `$` ```r > names(mtcars) ## print the variable names ``` ``` ## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" ## [11] "carb" ``` ```r > mtcars$mpg ## return the mpg variable ``` ``` ## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4 ## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7 ## [31] 15.0 21.4 ``` -- * Now we will (re)introduce several functions for exploring data frames * We will also see a more advanced example of indexing --- ## **Data Frames**: exploring columns (cont.) ```r > dim(mtcars) ## print the number of rows and columns ``` ``` ## [1] 32 11 ``` ```r > str(mtcars) ## print structure of data frame ``` ``` ## 'data.frame': 32 obs. of 11 variables: ## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... ## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ... ## $ disp: num 160 160 108 258 360 ... ## $ hp : num 110 110 93 110 175 105 245 62 95 123 ... ## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ... ## $ wt : num 2.62 2.88 2.32 3.21 3.44 ... ## $ qsec: num 16.5 17 18.6 19.4 17 ... ## $ vs : num 0 0 1 1 0 1 0 1 1 1 ... ## $ am : num 1 1 1 0 0 0 0 0 0 0 ... ## $ gear: num 4 4 4 3 3 3 3 4 4 4 ... ## $ carb: num 4 4 1 1 2 1 4 2 2 4 ... ``` --- ## **Data Frames**: summarizing columns ```r > summary(mtcars) ``` ``` ## mpg cyl disp hp ## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0 ## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5 ## Median :19.20 Median :6.000 Median :196.3 Median :123.0 ## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7 ## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0 ## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0 ## drat wt qsec vs ## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000 ## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000 ## Median :3.695 Median :3.325 Median :17.71 Median :0.0000 ## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375 ## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000 ## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000 ## am gear carb ## Min. :0.0000 Min. :3.000 Min. :1.000 ## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000 ## Median :0.0000 Median :4.000 Median :2.000 ## Mean :0.4062 Mean :3.688 Mean :2.812 ## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000 ## Max. :1.0000 Max. :5.000 Max. :8.000 ``` --- ## **Data Frames**: exploring columns (cont.) An alternative ways to access a data frame's variable(s): ```r > mtcars[["mpg"]] ``` ``` ## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4 ## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7 ## [31] 15.0 21.4 ``` ```r > mtcars[, c("mpg", "cyl")] ``` ``` ## mpg cyl ## Mazda RX4 21.0 6 ## Mazda RX4 Wag 21.0 6 ## Datsun 710 22.8 4 ## Hornet 4 Drive 21.4 6 ## Hornet Sportabout 18.7 8 ## Valiant 18.1 6 ## Duster 360 14.3 8 ## Merc 240D 24.4 4 ## Merc 230 22.8 4 ## Merc 280 19.2 6 ## Merc 280C 17.8 6 ## Merc 450SE 16.4 8 ## Merc 450SL 17.3 8 ## Merc 450SLC 15.2 8 ## Cadillac Fleetwood 10.4 8 ## Lincoln Continental 10.4 8 ## Chrysler Imperial 14.7 8 ## Fiat 128 32.4 4 ## Honda Civic 30.4 4 ## Toyota Corolla 33.9 4 ## Toyota Corona 21.5 4 ## Dodge Challenger 15.5 8 ## AMC Javelin 15.2 8 ## Camaro Z28 13.3 8 ## Pontiac Firebird 19.2 8 ## Fiat X1-9 27.3 4 ## Porsche 914-2 26.0 4 ## Lotus Europa 30.4 4 ## Ford Pantera L 15.8 8 ## Ferrari Dino 19.7 6 ## Maserati Bora 15.0 8 ## Volvo 142E 21.4 4 ``` --- ## **Data Frames**: creating new variables ```r > mtcars$mpg_squared <- mtcars$mpg * mtcars$mpg > mtcars[, c("mpg", "mpg_squared")] ``` ``` ## mpg mpg_squared ## Mazda RX4 21.0 441.00 ## Mazda RX4 Wag 21.0 441.00 ## Datsun 710 22.8 519.84 ## Hornet 4 Drive 21.4 457.96 ## Hornet Sportabout 18.7 349.69 ## Valiant 18.1 327.61 ## Duster 360 14.3 204.49 ## Merc 240D 24.4 595.36 ## Merc 230 22.8 519.84 ## Merc 280 19.2 368.64 ## Merc 280C 17.8 316.84 ## Merc 450SE 16.4 268.96 ## Merc 450SL 17.3 299.29 ## Merc 450SLC 15.2 231.04 ## Cadillac Fleetwood 10.4 108.16 ## Lincoln Continental 10.4 108.16 ## Chrysler Imperial 14.7 216.09 ## Fiat 128 32.4 1049.76 ## Honda Civic 30.4 924.16 ## Toyota Corolla 33.9 1149.21 ## Toyota Corona 21.5 462.25 ## Dodge Challenger 15.5 240.25 ## AMC Javelin 15.2 231.04 ## Camaro Z28 13.3 176.89 ## Pontiac Firebird 19.2 368.64 ## Fiat X1-9 27.3 745.29 ## Porsche 914-2 26.0 676.00 ## Lotus Europa 30.4 924.16 ## Ford Pantera L 15.8 249.64 ## Ferrari Dino 19.7 388.09 ## Maserati Bora 15.0 225.00 ## Volvo 142E 21.4 457.96 ``` --- ## **Data Frames**: more on indexing When creating an index, we can also use multiple conditions * to satisfy EITHER condition use `|` (or) * to satisfy BOTH conditions use `&` (and) ```r > mtcars$mpg > 20 | mtcars$mpg < 25 ``` ``` ## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE ## [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE ## [31] TRUE TRUE ``` --- ## **Data Frames**: more on indexing (cont.) (remember: variables are just vectors, so we can use what we learned earlier) ```r > cbind(mtcars$mpg, mtcars$mpg < 20 | mtcars$mpg > 30) ``` ``` ## [,1] [,2] ## [1,] 21.0 0 ## [2,] 21.0 0 ## [3,] 22.8 0 ## [4,] 21.4 0 ## [5,] 18.7 1 ## [6,] 18.1 1 ## [7,] 14.3 1 ## [8,] 24.4 0 ## [9,] 22.8 0 ## [10,] 19.2 1 ## [11,] 17.8 1 ## [12,] 16.4 1 ## [13,] 17.3 1 ## [14,] 15.2 1 ## [15,] 10.4 1 ## [16,] 10.4 1 ## [17,] 14.7 1 ## [18,] 32.4 1 ## [19,] 30.4 1 ## [20,] 33.9 1 ## [21,] 21.5 0 ## [22,] 15.5 1 ## [23,] 15.2 1 ## [24,] 13.3 1 ## [25,] 19.2 1 ## [26,] 27.3 0 ## [27,] 26.0 0 ## [28,] 30.4 1 ## [29,] 15.8 1 ## [30,] 19.7 1 ## [31,] 15.0 1 ## [32,] 21.4 0 ``` --- ## **Data Frames**: more on indexing (cont.) And we can use multiple variables ```r > table(mtcars$mpg > 30 & mtcars$cyl == 6) ``` ``` ## ## FALSE ## 32 ``` ```r > table(mtcars$mpg > 30 & mtcars$cyl == 4) ``` ``` ## ## FALSE TRUE ## 28 4 ``` --- ## **Data Frames**: final indexing example ```r > hi_mpg <- mtcars$mpg > mean(mtcars$mpg) > hi_cyl <- mtcars$cyl == 4 > table(hi_mpg, hi_cyl) ``` ``` ## hi_cyl ## hi_mpg FALSE TRUE ## FALSE 18 0 ## TRUE 3 11 ``` --- ## **Data Frames**: final indexing example (cont.) ```r > mtcars$good_car <- FALSE > mtcars$good_car[hi_mpg & hi_cyl] <- TRUE > table(mtcars$good_car) ``` ``` ## ## FALSE TRUE ## 21 11 ``` --- ## **Data Frames**: final indexing example (cont.) Sanity check ```r > # cbind(mtcars$good_car, hi_mpg, hi_cyl, mtcars$mpg, mtcars$cyl) > cbind(mtcars$good_car, hi_mpg, hi_cyl) ``` ``` ## hi_mpg hi_cyl ## [1,] FALSE TRUE FALSE ## [2,] FALSE TRUE FALSE ## [3,] TRUE TRUE TRUE ## [4,] FALSE TRUE FALSE ## [5,] FALSE FALSE FALSE ## [6,] FALSE FALSE FALSE ## [7,] FALSE FALSE FALSE ## [8,] TRUE TRUE TRUE ## [9,] TRUE TRUE TRUE ## [10,] FALSE FALSE FALSE ## [11,] FALSE FALSE FALSE ## [12,] FALSE FALSE FALSE ## [13,] FALSE FALSE FALSE ## [14,] FALSE FALSE FALSE ## [15,] FALSE FALSE FALSE ## [16,] FALSE FALSE FALSE ## [17,] FALSE FALSE FALSE ## [18,] TRUE TRUE TRUE ## [19,] TRUE TRUE TRUE ## [20,] TRUE TRUE TRUE ## [21,] TRUE TRUE TRUE ## [22,] FALSE FALSE FALSE ## [23,] FALSE FALSE FALSE ## [24,] FALSE FALSE FALSE ## [25,] FALSE FALSE FALSE ## [26,] TRUE TRUE TRUE ## [27,] TRUE TRUE TRUE ## [28,] TRUE TRUE TRUE ## [29,] FALSE FALSE FALSE ## [30,] FALSE FALSE FALSE ## [31,] FALSE FALSE FALSE ## [32,] TRUE TRUE TRUE ``` --- ## **Recap & Moving Forward** * You should now be familiar with a few of R's data structures + (and for knowing when they should be used: # of dimensions & data types) * We have also been introduced to some useful functions for manipulating, summarizing, and exploring data + There are many more(!) and users contribute **R packages** that implement a wide range of tools, models, and methods: [list of some packages on CRAN](https://cran.r-project.org/) --- ## **Recap & Moving Forward** (cont.) * R comes installed with many packages that you can explore & access with the `library()` function ```r # library() # list all the packages installed on your computer library(stats) # load the stats package # help(package="stats") # look at the package documentation ``` * In future session, we will explore some of these packages that are particularly useful for + data manipulation: [dplyr](https://dplyr.tidyverse.org/) + plotting: [ggplot2](https://ggplot2.tidyverse.org/) * Please join us 😄