Introduction to dplyr

class: center, middle, inverse, title-slide

.title[
# Introduction to <br/><br/> dplyr
]
.author[
### Jason Thomas
]
.institute[
### R Working Group
]
.date[
### Oct 9th, 2025
]

---

## Motivation

* <svg aria-hidden="true" role="img" viewBox="0 0 581 512" style="height:1em;width:1.13em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:steelblue;overflow:visible;position:relative;"><path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"/></svg> -- past 8 years (or so) have included great (user-friendly) strides in data management

+ [`dplyr` package](https://dplyr.tidyverse.org/)  -- "grammar of data manipulation,
    providing a consistent set of verbs that help you solve the most common data manipulation
    challenges"

+ a lot of functions coded in C++ (for speed)

+ many other packages make use of `dplyr`

* Goal for today is to establish basic fluency with this grammar

---
## Background

* `dplyr` is part of [`tidyverse`](https://www.tidyverse.org/)

+ `ggplot2`, `forcats`, `tibble`, `readr`, `stringr`,  `tidyr`, `purrr`
  + may also want to check out [`tidycensus`](https://walker-data.com/tidycensus/articles/basic-usage.html)

* `dplyr` logic: "By constraining your options, it helps you think about your data manipulation challenges."

+ 5 commands will take you a long way
  + readability and simplifying code (with pipes)

+ [https://r4ds.had.co.nz/](https://r4ds.had.co.nz/)

---
## Setup

* Installing new packages
  + only need to do this once for current version of R
  + (if/when you upgrade R, you may need to reinstall your packages)

``` r
> install.packages("dplyr")
> # install.packages(c("tidyverse", "rmarkdown"))
> # update.packages()
```

* attach the package (gives us access to all of the functions/tools in the package)
  + (need to do this for every R session, i.e., include it in your scripts)

``` r
> library(dplyr)
> # otherwise, we can access functions with dplyr::mutate()
```

---
class: inverse, center, middle

# `dplyr` Grammar

---
## Grammar

* **Rows** -- selecting and organizing observations/cases

+ **Groups of Rows**

* **Columns** -- cleaning and creating new variables

* **Merging & Reshaping Data** (not going to go into this today)

* Additional features of the language
    + tibble data structure (similar to data frames, but crankier)
    + "pipe" %>%  for stringing multiple commands together
    + useful for keeping number of objects to a minimum and for
    plotting (e.g., adding separate symbols for subgroups)

---
## Grammar for Rows

* `filter()`

* `slice()`

* `arrange()`

* *Groups of rows*: `summarize()` 
  + useful companion: group_by()
  + collapse across rows with `summarize()`

---
## Grammar for Columns

* `select()`

* `rename()`

* `mutate()`

---
class: inverse, center, middle

# Grammar in Action

---
class: slide-font-25
## Arrange rows

``` r
> # only look at a few columns
> names(mtcars)
> mtcars %>% 
>   select(mpg, cyl) %>%      # choose which columns to work with
>   arrange(mpg, desc(cyl))   # sort the rows (default = ascending)
```

```
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"
```

```
##                      mpg cyl
## Cadillac Fleetwood  10.4   8
## Lincoln Continental 10.4   8
## Camaro Z28          13.3   8
## Duster 360          14.3   8
## Chrysler Imperial   14.7   8
## Maserati Bora       15.0   8
## Merc 450SLC         15.2   8
## AMC Javelin         15.2   8
## Dodge Challenger    15.5   8
## Ford Pantera L      15.8   8
```
(truncated output)

---
## Filter row

``` r
> # only look at a few rows
> mtcars %>% 
+   select(mpg, cyl) %>%
+   filter(cyl == 6)       # only show rows that match a condition
```

```
##                 mpg cyl
## Mazda RX4      21.0   6
## Mazda RX4 Wag  21.0   6
## Hornet 4 Drive 21.4   6
## Valiant        18.1   6
## Merc 280       19.2   6
## Merc 280C      17.8   6
## Ferrari Dino   19.7   6
```

keyboard shortcuts in RStudio for the pipe (`%>%`)
  
  + MacOS: &nbsp;  <kbd>command</kbd> + <kbd>shift</kbd> + <kbd>M</kbd>
  + Windows: &nbsp; <kbd>Ctrl</kbd> + <kbd>Shift</kbd> + <kbd>M</kbd>

---
## Filter more rows

``` r
> # only look at a few rows
> mtcars %>% 
+   select(mpg, cyl) %>%
+   filter(cyl > 4 & mpg > 22)    # why does this output look so strange?
```

```
## [1] mpg cyl
## <0 rows> (or 0-length row.names)
```

---
## `dplyr` filter more rows

``` r
> # only look at a few rows
> mtcars %>% 
+   select(mpg, cyl) %>%
+   filter(cyl > 4 & mpg > 18)
```

```
##                    mpg cyl
## Mazda RX4         21.0   6
## Mazda RX4 Wag     21.0   6
## Hornet 4 Drive    21.4   6
## Hornet Sportabout 18.7   8
## Valiant           18.1   6
## Merc 280          19.2   6
## Pontiac Firebird  19.2   8
## Ferrari Dino      19.7   6
```

---
## Take a slice

``` r
> # only look at a few rows (similar to filter)
> mtcars %>% 
+   select(mpg, cyl) %>%
+   slice(c(1, 9, 20))     # print rows 1, 9, and 20
```

```
##                 mpg cyl
## Mazda RX4      21.0   6
## Merc 230       22.8   4
## Toyota Corolla 33.9   4
```

---
class: slide-font-25
## Take another slice

A more complicated version

``` r
> # only look at a few rows
> mtcars %>% 
+   select(mpg, cyl) %>%
+   slice(                                
+     grep( "Mazda", row.names(mtcars) )  # return row #s that contain Mazda
+     )
```

```
##               mpg cyl
## Mazda RX4      21   6
## Mazda RX4 Wag  21   6
```

+ `grep()` is a powerful tool that can match text (with regular expressions!!)

+ also look at [`stringr`](https://stringr.tidyverse.org/) package -- useful
when working with text/string variables (e.g. country name, college major, etc.)

---
class: inverse, center, middle

# Grammar for Columns

---
class: slide-font-25
## Make new column

``` r
> # create new column named mpg2
> mtcars %>% 
>   select(mpg, cyl) %>%
>   mutate(mpg2 = mpg/1000)    # create a new variable called mpg2
```

```
##                    mpg cyl   mpg2
## Mazda RX4         21.0   6 0.0210
## Mazda RX4 Wag     21.0   6 0.0210
## Datsun 710        22.8   4 0.0228
## Hornet 4 Drive    21.4   6 0.0214
## Hornet Sportabout 18.7   8 0.0187
## Valiant           18.1   6 0.0181
## Duster 360        14.3   8 0.0143
## Merc 240D         24.4   4 0.0244
## Merc 230          22.8   4 0.0228
## Merc 280          19.2   6 0.0192
## Merc 280C         17.8   6 0.0178
## Merc 450SE        16.4   8 0.0164
```
(truncated output)

NOTE: WE ARE NOT SAVING THE NEW VARIABLE!! HOW WOULD WE DO THIS?

---
class: slide-font-25
## Recode a variable

``` r
> new_mtcars <- mtcars %>% 
>   select(mpg, cyl) %>%
>   mutate(mpg3 =          # create new variable mpg3
>            case_when(    # fill in mpg3 with recoding of mpg
>              mpg < 15.5 ~ "very low",
>              mpg >= 15.5 & mpg < 20 ~ "low",
>              mpg >= 20 & mpg < 23 ~ "high",
>              mpg >= 23 ~ "very high"
>            )
>   )
```

---
class: slide-font-25
## Recoded variable

<div class="datatables html-widget html-fill-item" id="htmlwidget-9e3adeac66cad2fd776c" style="width:100%;height:auto;"></div>
<script type="application/json" data-for="htmlwidget-9e3adeac66cad2fd776c">{"x":{"filter":"none","vertical":false,"fillContainer":false,"data":[["Mazda RX4","Mazda RX4 Wag","Datsun 710","Hornet 4 Drive","Hornet Sportabout","Valiant","Duster 360","Merc 240D","Merc 230","Merc 280","Merc 280C","Merc 450SE","Merc 450SL","Merc 450SLC","Cadillac Fleetwood","Lincoln Continental","Chrysler Imperial","Fiat 128","Honda Civic","Toyota Corolla","Toyota Corona","Dodge Challenger","AMC Javelin","Camaro Z28","Pontiac Firebird","Fiat X1-9","Porsche 914-2","Lotus Europa","Ford Pantera L","Ferrari Dino","Maserati Bora","Volvo 142E"],[21,21,22.8,21.4,18.7,18.1,14.3,24.4,22.8,19.2,17.8,16.4,17.3,15.2,10.4,10.4,14.7,32.4,30.4,33.9,21.5,15.5,15.2,13.3,19.2,27.3,26,30.4,15.8,19.7,15,21.4],["high","high","high","high","low","low","very low","very high","high","low","low","low","low","very low","very low","very low","very low","very high","very high","very high","high","low","very low","very low","low","very high","very high","very high","low","low","very low","high"]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th> <\/th>\n      <th>mpg<\/th>\n      <th>mpg3<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"pageLength":5,"columnDefs":[{"className":"dt-right","targets":1},{"orderable":false,"targets":0},{"name":" ","targets":0},{"name":"mpg","targets":1},{"name":"mpg3","targets":2}],"order":[],"autoWidth":false,"orderClasses":false,"lengthMenu":[5,10,25,50,100]}},"evals":[],"jsHooks":[]}</script>

---
class: slide-font-25
## More recoding

``` r
> new_mtcars |>              # another pipe you may come across
>   mutate(mpg4 = 
>            case_match(     # recode categorical to numeric
>              mpg3,
>              "very low" ~ 0,
>              c("low", "high") ~ 1,
>              .default = 2  # if no conditions are met, then use 2
>            )
>   ) |>
>   select(mpg4, mpg3)
```

```
##                   mpg4      mpg3
## Mazda RX4            1      high
## Mazda RX4 Wag        1      high
## Datsun 710           1      high
## Hornet 4 Drive       1      high
## Hornet Sportabout    1       low
## Valiant              1       low
## Duster 360           0  very low
## Merc 240D            2 very high
```

(truncated output)

---
## Replacing values

We can use `replace` (in cahoots with `mutate`) to change the values in a 
column for rows where a certain condition is satisfied.

* `replace( column, condition, new value if condition is met)`

``` r
> mtcars %>%
+      mutate(mpg=replace(mpg, cyl==4, NA))
```

```
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710            NA   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D             NA   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230              NA   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128              NA   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic           NA   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla        NA   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona         NA   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9             NA   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2         NA   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa          NA   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E            NA   4 121.0 109 4.11 2.780 18.60  1  1    4    2
```

---
class: inverse, center, middle

# Grouping

---
## Summarizing data with `dplyr`

`summarize` will create a *new* data frame by applying a function (e.g., mean
sd, n, n_distinct) to a column in your data

``` r
> mtcars %>%
+   summarize(mean(cyl), sd(cyl), mean(mpg), col4 = sd(mpg))
```

```
##   mean(cyl)  sd(cyl) mean(mpg)     col4
## 1    6.1875 1.785922  20.09062 6.026948
```

* usually the result is a single row, but `quantile` is a notable exception
  but this is getting phased out (so this will give you a warning)

``` r
> mtcars %>%
>   summarize(qt_mpg = quantile(mpg, c(.25, .75)),
>             qt_wt = quantile(wt, c(.25, .75)))
```
  * (there are better ways of doing this)
  
  
---
class: slide-font-25
## Summarizing data with `dplyr` (cont.)

The true benefit of `summarize` comes with grouping your data:

``` r
> mtcars |> 
+   group_by(cyl, vs) |>  # calculate stats within groups defined by cyl & vs
+   summarize(mu_hp = mean(hp), n = n())
```

```
## `summarise()` has grouped output by 'cyl'. You can override using the `.groups`
## argument.
```

```
## # A tibble: 5 × 4
## # Groups:   cyl [3]
##     cyl    vs mu_hp     n
##   <dbl> <dbl> <dbl> <int>
## 1     4     0  91       1
## 2     4     1  81.8    10
## 3     6     0 132.      3
## 4     6     1 115.      4
## 5     8     0 209.     14
```

* note: every call to `summarize()` removes a layer of grouping

---
class: slide-font-20, codefs-70
## Summarizing data with `dplyr` (extras)

*  Careful with reusing names to label your summarized variables

``` r
> mtcars |> 
+   summarize(hp = mean(hp), sd_hp = sd(hp))  ## why don't we get the std dev?!?
```

```
##         hp sd_hp
## 1 146.6875    NA
```

* `ungroup()` will apply functions to the entire data set

```
## # A tibble: 32 × 5
##      mpg    hp   cyl mu_mpg grand_mu_mpg
##    <dbl> <dbl> <dbl>  <dbl>        <dbl>
##  1  21     110     6   19.7         20.1
##  2  21     110     6   19.7         20.1
##  3  22.8    93     4   26.7         20.1
##  4  21.4   110     6   19.7         20.1
##  5  18.7   175     8   15.1         20.1
##  6  18.1   105     6   19.7         20.1
##  7  14.3   245     8   15.1         20.1
##  8  24.4    62     4   26.7         20.1
##  9  22.8    95     4   26.7         20.1
## 10  19.2   123     6   19.7         20.1
## # ℹ 22 more rows
```