class: center, middle, inverse, title-slide .title[ # Introduction to
dplyr ] .author[ ### Jason Thomas ] .institute[ ### R Working Group ] .date[ ### Oct 9th, 2025 ] --- ## Motivation *
-- past 8 years (or so) have included great (user-friendly) strides in data management + [`dplyr` package](https://dplyr.tidyverse.org/) -- "grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges" + a lot of functions coded in C++ (for speed) + many other packages make use of `dplyr` * Goal for today is to establish basic fluency with this grammar --- ## Background * `dplyr` is part of [`tidyverse`](https://www.tidyverse.org/) + `ggplot2`, `forcats`, `tibble`, `readr`, `stringr`, `tidyr`, `purrr` + may also want to check out [`tidycensus`](https://walker-data.com/tidycensus/articles/basic-usage.html) * `dplyr` logic: "By constraining your options, it helps you think about your data manipulation challenges." + 5 commands will take you a long way + readability and simplifying code (with pipes) *
**for Data Science** + [https://r4ds.had.co.nz/](https://r4ds.had.co.nz/) --- ## Setup * Installing new packages + only need to do this once for current version of R + (if/when you upgrade R, you may need to reinstall your packages) ``` r > install.packages("dplyr") > # install.packages(c("tidyverse", "rmarkdown")) > # update.packages() ``` * attach the package (gives us access to all of the functions/tools in the package) + (need to do this for every R session, i.e., include it in your scripts) ``` r > library(dplyr) > # otherwise, we can access functions with dplyr::mutate() ``` --- class: inverse, center, middle # `dplyr` Grammar --- ## Grammar * **Rows** -- selecting and organizing observations/cases + **Groups of Rows** * **Columns** -- cleaning and creating new variables * **Merging & Reshaping Data** (not going to go into this today) * Additional features of the language + tibble data structure (similar to data frames, but crankier) + "pipe" %>% for stringing multiple commands together + useful for keeping number of objects to a minimum and for plotting (e.g., adding separate symbols for subgroups) --- ## Grammar for Rows * `filter()` * `slice()` * `arrange()` * *Groups of rows*: `summarize()` + useful companion: group_by() + collapse across rows with `summarize()` --- ## Grammar for Columns * `select()` * `rename()` * `mutate()` --- class: inverse, center, middle # Grammar in Action --- class: slide-font-25 ## Arrange rows ``` r > # only look at a few columns > names(mtcars) > mtcars %>% > select(mpg, cyl) %>% # choose which columns to work with > arrange(mpg, desc(cyl)) # sort the rows (default = ascending) ``` ``` ## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" ## [11] "carb" ``` ``` ## mpg cyl ## Cadillac Fleetwood 10.4 8 ## Lincoln Continental 10.4 8 ## Camaro Z28 13.3 8 ## Duster 360 14.3 8 ## Chrysler Imperial 14.7 8 ## Maserati Bora 15.0 8 ## Merc 450SLC 15.2 8 ## AMC Javelin 15.2 8 ## Dodge Challenger 15.5 8 ## Ford Pantera L 15.8 8 ``` (truncated output) --- ## Filter row ``` r > # only look at a few rows > mtcars %>% + select(mpg, cyl) %>% + filter(cyl == 6) # only show rows that match a condition ``` ``` ## mpg cyl ## Mazda RX4 21.0 6 ## Mazda RX4 Wag 21.0 6 ## Hornet 4 Drive 21.4 6 ## Valiant 18.1 6 ## Merc 280 19.2 6 ## Merc 280C 17.8 6 ## Ferrari Dino 19.7 6 ``` keyboard shortcuts in RStudio for the pipe (`%>%`) + MacOS: <kbd>command</kbd> + <kbd>shift</kbd> + <kbd>M</kbd> + Windows: <kbd>Ctrl</kbd> + <kbd>Shift</kbd> + <kbd>M</kbd> --- ## Filter more rows ``` r > # only look at a few rows > mtcars %>% + select(mpg, cyl) %>% + filter(cyl > 4 & mpg > 22) # why does this output look so strange? ``` ``` ## [1] mpg cyl ## <0 rows> (or 0-length row.names) ``` --- ## `dplyr` filter more rows ``` r > # only look at a few rows > mtcars %>% + select(mpg, cyl) %>% + filter(cyl > 4 & mpg > 18) ``` ``` ## mpg cyl ## Mazda RX4 21.0 6 ## Mazda RX4 Wag 21.0 6 ## Hornet 4 Drive 21.4 6 ## Hornet Sportabout 18.7 8 ## Valiant 18.1 6 ## Merc 280 19.2 6 ## Pontiac Firebird 19.2 8 ## Ferrari Dino 19.7 6 ``` --- ## Take a slice ``` r > # only look at a few rows (similar to filter) > mtcars %>% + select(mpg, cyl) %>% + slice(c(1, 9, 20)) # print rows 1, 9, and 20 ``` ``` ## mpg cyl ## Mazda RX4 21.0 6 ## Merc 230 22.8 4 ## Toyota Corolla 33.9 4 ``` --- class: slide-font-25 ## Take another slice A more complicated version ``` r > # only look at a few rows > mtcars %>% + select(mpg, cyl) %>% + slice( + grep( "Mazda", row.names(mtcars) ) # return row #s that contain Mazda + ) ``` ``` ## mpg cyl ## Mazda RX4 21 6 ## Mazda RX4 Wag 21 6 ``` + `grep()` is a powerful tool that can match text (with regular expressions!!) + also look at [`stringr`](https://stringr.tidyverse.org/) package -- useful when working with text/string variables (e.g. country name, college major, etc.) --- class: inverse, center, middle # Grammar for Columns --- class: slide-font-25 ## Make new column ``` r > # create new column named mpg2 > mtcars %>% > select(mpg, cyl) %>% > mutate(mpg2 = mpg/1000) # create a new variable called mpg2 ``` ``` ## mpg cyl mpg2 ## Mazda RX4 21.0 6 0.0210 ## Mazda RX4 Wag 21.0 6 0.0210 ## Datsun 710 22.8 4 0.0228 ## Hornet 4 Drive 21.4 6 0.0214 ## Hornet Sportabout 18.7 8 0.0187 ## Valiant 18.1 6 0.0181 ## Duster 360 14.3 8 0.0143 ## Merc 240D 24.4 4 0.0244 ## Merc 230 22.8 4 0.0228 ## Merc 280 19.2 6 0.0192 ## Merc 280C 17.8 6 0.0178 ## Merc 450SE 16.4 8 0.0164 ``` (truncated output) NOTE: WE ARE NOT SAVING THE NEW VARIABLE!! HOW WOULD WE DO THIS? --- class: slide-font-25 ## Recode a variable ``` r > new_mtcars <- mtcars %>% > select(mpg, cyl) %>% > mutate(mpg3 = # create new variable mpg3 > case_when( # fill in mpg3 with recoding of mpg > mpg < 15.5 ~ "very low", > mpg >= 15.5 & mpg < 20 ~ "low", > mpg >= 20 & mpg < 23 ~ "high", > mpg >= 23 ~ "very high" > ) > ) ``` --- class: slide-font-25 ## Recoded variable
--- class: slide-font-25 ## More recoding ``` r > new_mtcars |> # another pipe you may come across > mutate(mpg4 = > case_match( # recode categorical to numeric > mpg3, > "very low" ~ 0, > c("low", "high") ~ 1, > .default = 2 # if no conditions are met, then use 2 > ) > ) |> > select(mpg4, mpg3) ``` ``` ## mpg4 mpg3 ## Mazda RX4 1 high ## Mazda RX4 Wag 1 high ## Datsun 710 1 high ## Hornet 4 Drive 1 high ## Hornet Sportabout 1 low ## Valiant 1 low ## Duster 360 0 very low ## Merc 240D 2 very high ``` (truncated output) --- ## Replacing values We can use `replace` (in cahoots with `mutate`) to change the values in a column for rows where a certain condition is satisfied. * `replace( column, condition, new value if condition is met)` ``` r > mtcars %>% + mutate(mpg=replace(mpg, cyl==4, NA)) ``` ``` ## mpg cyl disp hp drat wt qsec vs am gear carb ## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 ## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 ## Datsun 710 NA 4 108.0 93 3.85 2.320 18.61 1 1 4 1 ## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 ## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 ## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 ## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 ## Merc 240D NA 4 146.7 62 3.69 3.190 20.00 1 0 4 2 ## Merc 230 NA 4 140.8 95 3.92 3.150 22.90 1 0 4 2 ## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 ## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 ## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 ## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 ## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 ## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 ## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 ## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 ## Fiat 128 NA 4 78.7 66 4.08 2.200 19.47 1 1 4 1 ## Honda Civic NA 4 75.7 52 4.93 1.615 18.52 1 1 4 2 ## Toyota Corolla NA 4 71.1 65 4.22 1.835 19.90 1 1 4 1 ## Toyota Corona NA 4 120.1 97 3.70 2.465 20.01 1 0 3 1 ## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 ## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 ## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 ## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 ## Fiat X1-9 NA 4 79.0 66 4.08 1.935 18.90 1 1 4 1 ## Porsche 914-2 NA 4 120.3 91 4.43 2.140 16.70 0 1 5 2 ## Lotus Europa NA 4 95.1 113 3.77 1.513 16.90 1 1 5 2 ## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 ## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 ## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 ## Volvo 142E NA 4 121.0 109 4.11 2.780 18.60 1 1 4 2 ``` --- class: inverse, center, middle # Grouping --- ## Summarizing data with `dplyr` `summarize` will create a *new* data frame by applying a function (e.g., mean sd, n, n_distinct) to a column in your data ``` r > mtcars %>% + summarize(mean(cyl), sd(cyl), mean(mpg), col4 = sd(mpg)) ``` ``` ## mean(cyl) sd(cyl) mean(mpg) col4 ## 1 6.1875 1.785922 20.09062 6.026948 ``` * usually the result is a single row, but `quantile` is a notable exception but this is getting phased out (so this will give you a warning) ``` r > mtcars %>% > summarize(qt_mpg = quantile(mpg, c(.25, .75)), > qt_wt = quantile(wt, c(.25, .75))) ``` * (there are better ways of doing this) --- class: slide-font-25 ## Summarizing data with `dplyr` (cont.) The true benefit of `summarize` comes with grouping your data: ``` r > mtcars |> + group_by(cyl, vs) |> # calculate stats within groups defined by cyl & vs + summarize(mu_hp = mean(hp), n = n()) ``` ``` ## `summarise()` has grouped output by 'cyl'. You can override using the `.groups` ## argument. ``` ``` ## # A tibble: 5 × 4 ## # Groups: cyl [3] ## cyl vs mu_hp n ## <dbl> <dbl> <dbl> <int> ## 1 4 0 91 1 ## 2 4 1 81.8 10 ## 3 6 0 132. 3 ## 4 6 1 115. 4 ## 5 8 0 209. 14 ``` * note: every call to `summarize()` removes a layer of grouping --- class: slide-font-20, codefs-70 ## Summarizing data with `dplyr` (extras) * Careful with reusing names to label your summarized variables ``` r > mtcars |> + summarize(hp = mean(hp), sd_hp = sd(hp)) ## why don't we get the std dev?!? ``` ``` ## hp sd_hp ## 1 146.6875 NA ``` -- * `ungroup()` will apply functions to the entire data set ``` r > mtcars |> select(mpg, hp, cyl) |> + group_by(cyl) |> mutate(mu_mpg = mean(mpg)) |> + ungroup() |> mutate(grand_mu_mpg = mean(mpg)) ``` ``` ## # A tibble: 32 × 5 ## mpg hp cyl mu_mpg grand_mu_mpg ## <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 21 110 6 19.7 20.1 ## 2 21 110 6 19.7 20.1 ## 3 22.8 93 4 26.7 20.1 ## 4 21.4 110 6 19.7 20.1 ## 5 18.7 175 8 15.1 20.1 ## 6 18.1 105 6 19.7 20.1 ## 7 14.3 245 8 15.1 20.1 ## 8 24.4 62 4 26.7 20.1 ## 9 22.8 95 4 26.7 20.1 ## 10 19.2 123 6 19.7 20.1 ## # ℹ 22 more rows ```