Learn about...
basic R syntax
different R objects (things that hold data) & indexing them
useful functions for working with data
Learn about...
basic R syntax
different R objects (things that hold data) & indexing them
useful functions for working with data
Learn about...
basic R syntax
different R objects (things that hold data) & indexing them
useful functions for working with data
Become familiar with R Studio & develop good coding habits
Let's dive in by starting R Studio and opening a new R script
File
→ New File
→ R Script
File
→ New Script
)Let's dive in by starting R Studio and opening a new R script
File
→ New File
→ R Script
File
→ New Script
)You should now have 4 panes open (like on the next slide)
Add a comment to our new script:
# Comment: My R script from Working Group Session (1/20/2023)# (R ignores all lines that begin with a pound/hash/number sign/#)
Add a comment to our new script:
# Comment: My R script from Working Group Session (1/20/2023)# (R ignores all lines that begin with a pound/hash/number sign/#)
Save our script
File
→ Save As...
Add a comment to our new script:
# Comment: My R script from Working Group Session (1/20/2023)# (R ignores all lines that begin with a pound/hash/number sign/#)
Save our script
File
→ Save As...
Set our working directory
Session
→ Set Working Directory
→ Choose Directory...
# object_name <- object_value mean_age <- 33
# object_name <- object_value mean_age <- 33
The symbol "<-
" is called the assignment operator
we are creating a new variable called mean_age
and assigning it a value of 33
mean_age = 33
will also work (but <-
is the convention)
If we enter the name of a variable in the Console
, then R will list the value(s)
> Mean_age2 <- 22 ## note: object names are case-sensitive> Mean_age2
## [1] 22
If we enter the name of a variable in the Console
, then R will list the value(s)
> Mean_age2 <- 22 ## note: object names are case-sensitive> Mean_age2
## [1] 22
BUT we are in the business of good habits...
type this syntax into our script and (with the cursor on the same line) press the following keys together:
On a Mac: <command> <enter>
In Windows: <control> <enter>
(in R Studio)
<control> r
(in the R app)
these keyboard shortcuts will run the syntax on the line in the Console
(or you can highlight a region)
We have seen a simple object for holding data, but R has many useful functions
ls() # list all the objects in memoryrm(Mean_age2) # remove the object called Mean_age2getwd() # print the working directorydir() # list the files in the current directorydir("../") # list the files in the parent directorysave.image("my_data.RData") # save all the objects in memoryload("my_data.RData") # load all the objects in the data file
We have seen a simple object for holding data, but R has many useful functions
ls() # list all the objects in memoryrm(Mean_age2) # remove the object called Mean_age2getwd() # print the working directorydir() # list the files in the current directorydir("../") # list the files in the parent directorysave.image("my_data.RData") # save all the objects in memoryload("my_data.RData") # load all the objects in the data file
Quick note:
abc
that holds the value 2data.RData
that also has an object named abc
but holds the value 99abc
holding 2) will get replacedGoogle searches are a very effective way to find help
Google searches are a very effective way to find help
R documentation can be accessed in the Help
tab in the Output
pane
Google searches are a very effective way to find help
R documentation can be accessed in the Help
tab in the Output
pane
Some additional syntax and functions
?read.csv # show the help file for the function read.csvhelp.search("weighted mean") # search help files for the phrase'weighted mean'
We are not going to solve the world's problems with a single number...
> all_ages <- c(22, 33, 44, 55) # c() concatenates numbers together> all_ages
## [1] 22 33 44 55
> mean(all_ages) # calculate the mean
## [1] 38.5
> all_ed <- c("HS", "Col", "Grad Sch", "HS")> all_ed
## [1] "HS" "Col" "Grad Sch" "HS"
R handles different types of data as well
> important_data <- c("OSU", "R", "Group", 4)> important_data
## [1] "OSU" "R" "Group" "4"
Wait, what is going on here?
R handles different types of data as well
> important_data <- c("OSU", "R", "Group", 4)> important_data
## [1] "OSU" "R" "Group" "4"
Wait, what is going on here?
we are mixing different types of data & R assumes that we just forgot to wrap the 4 in quotation marks
sometimes R's assumptions are useful, sometimes they are not! 🤔
Here is another example with missing data
> test_scores <- c(88, 99, 110, 66, NA) # NA is for missing values> mean_scores <- mean(test_scores)> mean_scores / 100
## [1] NA
😾 Ugh! Why didn't R tell me there was a problem when I tried to calculate the mean?!?
Here is another example with missing data
> test_scores <- c(88, 99, 110, 66, NA) # NA is for missing values> mean_scores <- mean(test_scores)> mean_scores / 100
## [1] NA
😾 Ugh! Why didn't R tell me there was a problem when I tried to calculate the mean?!?
another R assumption
can you figure out how to calculate the mean for non-missing values? (help file is helpful 😄)
c()
to concatenate dataWe have been creating vectors when we use c()
to concatenate data
Here are some more useful functions for working with vectors
> # test that we have a vector> is.vector(test_scores) # returns another data type: TRUE or FALSE (called logical)
## [1] TRUE
> summary(test_scores) # numerical summary (less helpful for strings)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## 66.00 82.50 93.50 90.75 101.75 110.00 1
> length(test_scores) # how many elements in the vector
## [1] 5
> is.na(test_scores) # test if each element is NA
## [1] FALSE FALSE FALSE FALSE TRUE
> TRUE + TRUE + FALSE # useful trick with logical objects (TRUE/FALSE)
## [1] 2
> n_missing <- sum(is.na(test_scores))> n_missing
## [1] 1
We can access the ith element in a vector with the syntax vector_name[ i ]
> test_scores[1] # first element
## [1] 88
> test_scores[2] # second element
## [1] 99
We can access the ith element in a vector with the syntax vector_name[ i ]
> test_scores[1] # first element
## [1] 88
> test_scores[2] # second element
## [1] 99
> 1:3 # a vector of c(1, 2, 3)
## [1] 1 2 3
> # so what will test_scores[3:1] give us?
The syntax 3:1
gives the vector c(3, 2, 1)
, so...
The syntax 3:1
gives the vector c(3, 2, 1)
, so...
> test_scores[3:1] # returns 3rd element, then the 2nd, then the first
## [1] 110 99 88
> test_scores # sanity check
## [1] 88 99 110 66 NA
The syntax 3:1
gives the vector c(3, 2, 1)
, so...
> test_scores[3:1] # returns 3rd element, then the 2nd, then the first
## [1] 110 99 88
> test_scores # sanity check
## [1] 88 99 110 66 NA
test_scores[c(3, 5, 11)]
We can use indexing to change vectors as well, e.g., reassign the first element
> test_scores[1] <- NA # change the first element to NA> test_scores[1]
## [1] NA
We can use indexing to change vectors as well, e.g., reassign the first element
> test_scores[1] <- NA # change the first element to NA> test_scores[1]
## [1] NA
Again, we can use vectors to index as well:
index_missing_scores <- is.na(test_scores) # create an index vector of TRUE & FALSEtest_scores[index_missing_scores] <- -99 # change NA to -99
We can use indexing to change vectors as well, e.g., reassign the first element
> test_scores[1] <- NA # change the first element to NA> test_scores[1]
## [1] NA
Again, we can use vectors to index as well:
index_missing_scores <- is.na(test_scores) # create an index vector of TRUE & FALSEtest_scores[index_missing_scores] <- -99 # change NA to -99
Let's walk through this...
(🦉 but note a good habit would be to create a new vector,
new_test_scores
, so we can retain the original data!)
> # create an index vector of TRUE & FALSE> index_missing_scores <- is.na(test_scores)> index_missing_scores
## [1] TRUE FALSE FALSE FALSE TRUE
> # create an index vector of TRUE & FALSE> index_missing_scores <- is.na(test_scores)> index_missing_scores
## [1] TRUE FALSE FALSE FALSE TRUE
> # attach these 2 vectors together as columns> cbind(index_missing_scores, test_scores)
## index_missing_scores test_scores## [1,] 1 NA## [2,] 0 99## [3,] 0 110## [4,] 0 66## [5,] 1 NA
> # create an index vector of TRUE & FALSE> index_missing_scores <- is.na(test_scores)> index_missing_scores
## [1] TRUE FALSE FALSE FALSE TRUE
> # attach these 2 vectors together as columns> cbind(index_missing_scores, test_scores)
## index_missing_scores test_scores## [1,] 1 NA## [2,] 0 99## [3,] 0 110## [4,] 0 66## [5,] 1 NA
cbind
we are actually creating a new data structure called a matrix> # create an index vector of TRUE & FALSE> index_missing_scores <- is.na(test_scores)> index_missing_scores
## [1] TRUE FALSE FALSE FALSE TRUE
> # attach these 2 vectors together as columns> cbind(index_missing_scores, test_scores)
## index_missing_scores test_scores## [1,] 1 NA## [2,] 0 99## [3,] 0 110## [4,] 0 66## [5,] 1 NA
cbind
we are actually creating a new data structure called a matrixTRUE
/FALSE
to 1
/0
(respectively)> test_scores[index_missing_scores] # access all of the indices with TRUE
## [1] NA NA
> test_scores[index_missing_scores] # access all of the indices with TRUE
## [1] NA NA
> # recode NA to -99> test_scores[index_missing_scores] <- -99> test_scores
## [1] -99 99 110 66 -99
When you want to change a vector, do the delta 2-step:
create an index vector that identifies the elements you want to change
logical
, i.e. TRUE
s and FALSE
sassign new values to the vector using your vector of indices
We are not going to become 💰 famous 💰 by working with a single vector
However, we have learned a powerful way to work with vectors, indexing, that extends to other types of data structures
We are not going to become 💰 famous 💰 by working with a single vector
However, we have learned a powerful way to work with vectors, indexing, that extends to other types of data structures
A matrix made a brief appearance earlier, but before going further let's review a useful framework for thinking about data structures
R has different structures for holding data, which can be organized by...
R has different structures for holding data, which can be organized by...
R has different structures for holding data, which can be organized by...
How many dimensions does the structure have?
Do the types of data need to be the same?
R has different structures for holding data, which can be organized by...
How many dimensions does the structure have?
Do the types of data need to be the same?
Example: vectors
4
→ "4"
)R has different structures for holding data, which can be organized by...
How many dimensions does the structure have?
Do the types of data need to be the same?
Example: vectors
4
→ "4"
)Vectors
Matrices
Arrays
Data Frames
Lists
For the rest of this session we will focus on Data frames, the R structure typically used for data sets (i.e., variables as columns and an observation for each row).
Let's get some practice working with data frames using one of R's example data sets
> data(mtcars) ## load one of R's example data sets mtcars> ls()
## [1] "all_ages" "all_ed" "important_data" ## [4] "index_missing_scores" "Mean_age2" "mean_scores" ## [7] "mtcars" "n_missing" "test_scores"
> is.data.frame(mtcars) ## check that mtcars is a data frame
## [1] TRUE
Before we proceed with mtcars
, a quick example of how to read in a data set.
> # write data to a CSV file called 'copy_mtcars.csv' in the working directory> write.csv(mtcars, "copy_mtcars.csv") > mtcars2 <- read.csv("copy_mtcars.csv") # load data set from CSV file> ls()
## [1] "all_ages" "all_ed" "important_data" ## [4] "index_missing_scores" "Mean_age2" "mean_scores" ## [7] "mtcars" "mtcars2" "n_missing" ## [10] "test_scores"
> is.data.frame(mtcars2)
## [1] TRUE
[row index, column index]
> mtcars[1, 1] # 1st observation in 1st variable
## [1] 21
[row index, column index]
> mtcars[1, 1] # 1st observation in 1st variable
## [1] 21
> # if we leave out the row part of the address, we get all rows and a vector> is.vector(mtcars[, 1])
## [1] TRUE
$
> names(mtcars) ## print the variable names
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"## [11] "carb"
> mtcars$mpg ## return the mpg variable
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7## [31] 15.0 21.4
$
> names(mtcars) ## print the variable names
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"## [11] "carb"
> mtcars$mpg ## return the mpg variable
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7## [31] 15.0 21.4
> dim(mtcars) ## print the number of rows and columns
## [1] 32 11
> str(mtcars) ## print structure of data frame
## 'data.frame': 32 obs. of 11 variables:## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...## $ disp: num 160 160 108 258 360 ...## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...## $ qsec: num 16.5 17 18.6 19.4 17 ...## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...## $ am : num 1 1 1 0 0 0 0 0 0 0 ...## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
> summary(mtcars)
## mpg cyl disp hp ## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0 ## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5 ## Median :19.20 Median :6.000 Median :196.3 Median :123.0 ## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7 ## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0 ## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0 ## drat wt qsec vs ## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000 ## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000 ## Median :3.695 Median :3.325 Median :17.71 Median :0.0000 ## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375 ## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000 ## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000 ## am gear carb ## Min. :0.0000 Min. :3.000 Min. :1.000 ## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000 ## Median :0.0000 Median :4.000 Median :2.000 ## Mean :0.4062 Mean :3.688 Mean :2.812 ## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000 ## Max. :1.0000 Max. :5.000 Max. :8.000
An alternative ways to access a data frame's variable(s):
> mtcars[["mpg"]]
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7## [31] 15.0 21.4
> mtcars[, c("mpg", "cyl")]
## mpg cyl## Mazda RX4 21.0 6## Mazda RX4 Wag 21.0 6## Datsun 710 22.8 4## Hornet 4 Drive 21.4 6## Hornet Sportabout 18.7 8## Valiant 18.1 6## Duster 360 14.3 8## Merc 240D 24.4 4## Merc 230 22.8 4## Merc 280 19.2 6## Merc 280C 17.8 6## Merc 450SE 16.4 8## Merc 450SL 17.3 8## Merc 450SLC 15.2 8## Cadillac Fleetwood 10.4 8## Lincoln Continental 10.4 8## Chrysler Imperial 14.7 8## Fiat 128 32.4 4## Honda Civic 30.4 4## Toyota Corolla 33.9 4## Toyota Corona 21.5 4## Dodge Challenger 15.5 8## AMC Javelin 15.2 8## Camaro Z28 13.3 8## Pontiac Firebird 19.2 8## Fiat X1-9 27.3 4## Porsche 914-2 26.0 4## Lotus Europa 30.4 4## Ford Pantera L 15.8 8## Ferrari Dino 19.7 6## Maserati Bora 15.0 8## Volvo 142E 21.4 4
> mtcars$mpg_squared <- mtcars$mpg * mtcars$mpg> mtcars[, c("mpg", "mpg_squared")]
## mpg mpg_squared## Mazda RX4 21.0 441.00## Mazda RX4 Wag 21.0 441.00## Datsun 710 22.8 519.84## Hornet 4 Drive 21.4 457.96## Hornet Sportabout 18.7 349.69## Valiant 18.1 327.61## Duster 360 14.3 204.49## Merc 240D 24.4 595.36## Merc 230 22.8 519.84## Merc 280 19.2 368.64## Merc 280C 17.8 316.84## Merc 450SE 16.4 268.96## Merc 450SL 17.3 299.29## Merc 450SLC 15.2 231.04## Cadillac Fleetwood 10.4 108.16## Lincoln Continental 10.4 108.16## Chrysler Imperial 14.7 216.09## Fiat 128 32.4 1049.76## Honda Civic 30.4 924.16## Toyota Corolla 33.9 1149.21## Toyota Corona 21.5 462.25## Dodge Challenger 15.5 240.25## AMC Javelin 15.2 231.04## Camaro Z28 13.3 176.89## Pontiac Firebird 19.2 368.64## Fiat X1-9 27.3 745.29## Porsche 914-2 26.0 676.00## Lotus Europa 30.4 924.16## Ford Pantera L 15.8 249.64## Ferrari Dino 19.7 388.09## Maserati Bora 15.0 225.00## Volvo 142E 21.4 457.96
When creating an index, we can also use multiple conditions
|
(or)&
(and)> mtcars$mpg > 20 | mtcars$mpg < 25
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE## [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE## [31] TRUE TRUE
(remember: variables are just vectors, so we can use what we learned earlier)
> cbind(mtcars$mpg, mtcars$mpg < 20 | mtcars$mpg > 30)
## [,1] [,2]## [1,] 21.0 0## [2,] 21.0 0## [3,] 22.8 0## [4,] 21.4 0## [5,] 18.7 1## [6,] 18.1 1## [7,] 14.3 1## [8,] 24.4 0## [9,] 22.8 0## [10,] 19.2 1## [11,] 17.8 1## [12,] 16.4 1## [13,] 17.3 1## [14,] 15.2 1## [15,] 10.4 1## [16,] 10.4 1## [17,] 14.7 1## [18,] 32.4 1## [19,] 30.4 1## [20,] 33.9 1## [21,] 21.5 0## [22,] 15.5 1## [23,] 15.2 1## [24,] 13.3 1## [25,] 19.2 1## [26,] 27.3 0## [27,] 26.0 0## [28,] 30.4 1## [29,] 15.8 1## [30,] 19.7 1## [31,] 15.0 1## [32,] 21.4 0
And we can use multiple variables
> table(mtcars$mpg > 30 & mtcars$cyl == 6)
## ## FALSE ## 32
> table(mtcars$mpg > 30 & mtcars$cyl == 4)
## ## FALSE TRUE ## 28 4
> hi_mpg <- mtcars$mpg > mean(mtcars$mpg)> hi_cyl <- mtcars$cyl == 4> table(hi_mpg, hi_cyl)
## hi_cyl## hi_mpg FALSE TRUE## FALSE 18 0## TRUE 3 11
> mtcars$good_car <- FALSE> mtcars$good_car[hi_mpg & hi_cyl] <- TRUE> table(mtcars$good_car)
## ## FALSE TRUE ## 21 11
Sanity check
> # cbind(mtcars$good_car, hi_mpg, hi_cyl, mtcars$mpg, mtcars$cyl)> cbind(mtcars$good_car, hi_mpg, hi_cyl)
## hi_mpg hi_cyl## [1,] FALSE TRUE FALSE## [2,] FALSE TRUE FALSE## [3,] TRUE TRUE TRUE## [4,] FALSE TRUE FALSE## [5,] FALSE FALSE FALSE## [6,] FALSE FALSE FALSE## [7,] FALSE FALSE FALSE## [8,] TRUE TRUE TRUE## [9,] TRUE TRUE TRUE## [10,] FALSE FALSE FALSE## [11,] FALSE FALSE FALSE## [12,] FALSE FALSE FALSE## [13,] FALSE FALSE FALSE## [14,] FALSE FALSE FALSE## [15,] FALSE FALSE FALSE## [16,] FALSE FALSE FALSE## [17,] FALSE FALSE FALSE## [18,] TRUE TRUE TRUE## [19,] TRUE TRUE TRUE## [20,] TRUE TRUE TRUE## [21,] TRUE TRUE TRUE## [22,] FALSE FALSE FALSE## [23,] FALSE FALSE FALSE## [24,] FALSE FALSE FALSE## [25,] FALSE FALSE FALSE## [26,] TRUE TRUE TRUE## [27,] TRUE TRUE TRUE## [28,] TRUE TRUE TRUE## [29,] FALSE FALSE FALSE## [30,] FALSE FALSE FALSE## [31,] FALSE FALSE FALSE## [32,] TRUE TRUE TRUE
You should now be familiar with a few of R's data structures
We have also been introduced to some useful functions for manipulating, summarizing, and exploring data
library()
function# library() # list all the packages installed on your computerlibrary(stats) # load the stats package# help(package="stats") # look at the package documentation
In future session, we will explore some of these packages that are particularly useful for
Please join us 😄
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |