|
| 1 | +--- |
| 2 | +output: html_document |
| 3 | +editor_options: |
| 4 | + chunk_output_type: console |
| 5 | +--- |
| 6 | + |
| 7 | +```{r setup, include=FALSE} |
| 8 | +knitr::opts_chunk$set(echo = TRUE) |
| 9 | +``` |
| 10 | + |
| 11 | +There are two ways to group in dplyr: |
| 12 | + |
| 13 | +- Persistent grouping with [group_by()] |
| 14 | + |
| 15 | +- Temporary grouping with `.by` |
| 16 | + |
| 17 | +This help page is dedicated to explaining where and why you might want to use the latter. |
| 18 | +The most important reason to use `.by` is that you *never* have to remember to use [ungroup()] after it. |
| 19 | +The grouping is always temporary, meaning that if an ungrouped data frame goes in, then an ungrouped data frame comes out, regardless of the number of grouping columns that are specified. |
| 20 | + |
| 21 | +The following dplyr verbs support `.by`: |
| 22 | + |
| 23 | +- [mutate()] |
| 24 | + |
| 25 | +- [summarise()] |
| 26 | + |
| 27 | +- [filter()] |
| 28 | + |
| 29 | +- [slice()] and its variants, such as [slice_head()] |
| 30 | + |
| 31 | +Let's take a look at the two grouping approaches using this `expenses` data set, which tracks costs accumulated across various `id`s and `region`s: |
| 32 | + |
| 33 | +```{r} |
| 34 | +expenses <- tibble( |
| 35 | + id = c(1, 2, 1, 3, 1, 2, 3), |
| 36 | + region = c("A", "A", "A", "B", "B", "A", "A"), |
| 37 | + cost = c(25, 20, 19, 12, 9, 6, 6) |
| 38 | +) |
| 39 | +expenses |
| 40 | +``` |
| 41 | + |
| 42 | +Imagine that you wanted to compute the average cost per region. |
| 43 | +You'd probably write something like this: |
| 44 | + |
| 45 | +```{r} |
| 46 | +expenses %>% |
| 47 | + group_by(region) %>% |
| 48 | + summarise(cost = mean(cost)) |
| 49 | +``` |
| 50 | + |
| 51 | +As of dplyr 1.1.0, an additional option is available that lets you specify the grouping *inline* within the verb: |
| 52 | + |
| 53 | +```{r} |
| 54 | +expenses %>% |
| 55 | + summarise(cost = mean(cost), .by = region) |
| 56 | +``` |
| 57 | + |
| 58 | +This great idea comes from [data.table](https://CRAN.R-project.org/package=data.table), where the equivalent syntax looks something like `expenses[, .(cost = mean(cost)), by = region]`. |
| 59 | + |
| 60 | +Grouping with `.by` is temporary, meaning that since `expenses` was an ungrouped data frame, the result after applying `.by` will also always be an ungrouped data frame, full stop. |
| 61 | +Compare that with `group_by() %>% summarise()`, where `summarise()` generally peels off 1 layer of grouping by default, with a message that it is doing so if there were originally more than one grouping columns: |
| 62 | + |
| 63 | +```{r} |
| 64 | +expenses %>% |
| 65 | + group_by(id, region) %>% |
| 66 | + summarise(cost = mean(cost)) |
| 67 | +``` |
| 68 | + |
| 69 | +This behavior is sometimes useful to sequentially "roll up" a data frame that has been grouped by multiple columns, but often you just want to group temporarily. |
| 70 | +Traditionally, you'd either `ungroup()` after the `summarise()` or explicitly set `.groups = "drop"` to achieve this: |
| 71 | + |
| 72 | +```{r} |
| 73 | +expenses %>% |
| 74 | + group_by(id, region) %>% |
| 75 | + summarise(cost = mean(cost), .groups = "drop") |
| 76 | +``` |
| 77 | + |
| 78 | +Because `.by` grouping is temporary, you don't need to worry about ungrouping and it never needs to emit a message to remind you what is happening: |
| 79 | + |
| 80 | +```{r} |
| 81 | +expenses %>% |
| 82 | + summarise(cost = mean(cost), .by = c(id, region)) |
| 83 | +``` |
| 84 | + |
| 85 | +Note that we specified multiple columns to group by using the [tidy-select][dplyr_tidy_select] syntax `c(id, region)`. |
| 86 | +If you have a character vector of column names you'd like to group by, you can do so with `.by = all_of(my_cols)`. |
| 87 | + |
| 88 | +To prevent surprising results, you can't use `.by` on an existing grouped data frame: |
| 89 | + |
| 90 | +```{r, error=TRUE} |
| 91 | +expenses %>% |
| 92 | + group_by(id) %>% |
| 93 | + summarise(cost = mean(cost), .by = c(id, region)) |
| 94 | +``` |
| 95 | + |
| 96 | +So far we've focused on the usage of `.by` with `summarise()`, but `.by` works with a number of other dplyr verbs. |
| 97 | +For example, you could append the mean cost per region onto the original data frame as a new column rather than computing a summary: |
| 98 | + |
| 99 | +```{r} |
| 100 | +expenses %>% |
| 101 | + mutate(cost_by_region = mean(cost), .by = region) |
| 102 | +``` |
| 103 | + |
| 104 | +Or you could slice out the maximum cost per combination of id and region: |
| 105 | + |
| 106 | +```{r} |
| 107 | +expenses %>% |
| 108 | + slice_max(cost, n = 1, by = c(id, region)) |
| 109 | +``` |
| 110 | + |
| 111 | +Again, note that the result of this `slice_max()` is an ungrouped data frame, which is probably what you'd expect here. |
| 112 | +Compare that to the `group_by()` approach, which uses persistent grouping and returns another grouped data frame: |
| 113 | + |
| 114 | +```{r} |
| 115 | +expenses %>% |
| 116 | + group_by(id, region) %>% |
| 117 | + slice_max(cost, n = 1) |
| 118 | +``` |
0 commit comments