Skip to content

Commit 1578eb6

Browse files
committed
Add a full documentation page specific to .by
1 parent 42fc138 commit 1578eb6

File tree

8 files changed

+330
-24
lines changed

8 files changed

+330
-24
lines changed

R/by.R

+17-4
Original file line numberDiff line numberDiff line change
@@ -4,15 +4,28 @@
44
#'
55
#' @param .by `r lifecycle::badge("experimental")`
66
#'
7-
#' <[`tidy-select`][dplyr_tidy_select]> Optionally, select columns to group
8-
#' by. This grouping will only be active for the duration of this verb.
9-
#'
10-
#' Can't be used when the data is already a grouped or rowwise data frame.
7+
#' <[`tidy-select`][dplyr_tidy_select]> Optionally, a selection of columns to
8+
#' temporarily group by using an inline alternative to [group_by()]. For
9+
#' details and examples, see [?dplyr_by][dplyr_by].
1110
#'
1211
#' @name args_by
1312
#' @keywords internal
1413
NULL
1514

15+
#' Grouping with `.by`
16+
#'
17+
#' ```{r, echo = FALSE, results = "asis"}
18+
#' result <- rlang::with_options(
19+
#' knitr::knit_child("man/rmd/by.Rmd"),
20+
#' dplyr.summarise.inform = TRUE
21+
#' )
22+
#' cat(result, sep = "\n")
23+
#' ```
24+
#'
25+
#' @name dplyr_by
26+
#' @keywords internal
27+
NULL
28+
1629
compute_by <- function(by,
1730
data,
1831
...,

man/args_by.Rd

+3-4
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

man/dplyr_by.Rd

+180
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

man/filter.Rd

+3-4
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

man/mutate.Rd

+3-4
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

man/rmd/by.Rmd

+118
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,118 @@
1+
---
2+
output: html_document
3+
editor_options:
4+
chunk_output_type: console
5+
---
6+
7+
```{r setup, include=FALSE}
8+
knitr::opts_chunk$set(echo = TRUE)
9+
```
10+
11+
There are two ways to group in dplyr:
12+
13+
- Persistent grouping with [group_by()]
14+
15+
- Temporary grouping with `.by`
16+
17+
This help page is dedicated to explaining where and why you might want to use the latter.
18+
The most important reason to use `.by` is that you *never* have to remember to use [ungroup()] after it.
19+
The grouping is always temporary, meaning that if an ungrouped data frame goes in, then an ungrouped data frame comes out, regardless of the number of grouping columns that are specified.
20+
21+
The following dplyr verbs support `.by`:
22+
23+
- [mutate()]
24+
25+
- [summarise()]
26+
27+
- [filter()]
28+
29+
- [slice()] and its variants, such as [slice_head()]
30+
31+
Let's take a look at the two grouping approaches using this `expenses` data set, which tracks costs accumulated across various `id`s and `region`s:
32+
33+
```{r}
34+
expenses <- tibble(
35+
id = c(1, 2, 1, 3, 1, 2, 3),
36+
region = c("A", "A", "A", "B", "B", "A", "A"),
37+
cost = c(25, 20, 19, 12, 9, 6, 6)
38+
)
39+
expenses
40+
```
41+
42+
Imagine that you wanted to compute the average cost per region.
43+
You'd probably write something like this:
44+
45+
```{r}
46+
expenses %>%
47+
group_by(region) %>%
48+
summarise(cost = mean(cost))
49+
```
50+
51+
As of dplyr 1.1.0, an additional option is available that lets you specify the grouping *inline* within the verb:
52+
53+
```{r}
54+
expenses %>%
55+
summarise(cost = mean(cost), .by = region)
56+
```
57+
58+
This great idea comes from [data.table](https://CRAN.R-project.org/package=data.table), where the equivalent syntax looks something like `expenses[, .(cost = mean(cost)), by = region]`.
59+
60+
Grouping with `.by` is temporary, meaning that since `expenses` was an ungrouped data frame, the result after applying `.by` will also always be an ungrouped data frame, full stop.
61+
Compare that with `group_by() %>% summarise()`, where `summarise()` generally peels off 1 layer of grouping by default, with a message that it is doing so if there were originally more than one grouping columns:
62+
63+
```{r}
64+
expenses %>%
65+
group_by(id, region) %>%
66+
summarise(cost = mean(cost))
67+
```
68+
69+
This behavior is sometimes useful to sequentially "roll up" a data frame that has been grouped by multiple columns, but often you just want to group temporarily.
70+
Traditionally, you'd either `ungroup()` after the `summarise()` or explicitly set `.groups = "drop"` to achieve this:
71+
72+
```{r}
73+
expenses %>%
74+
group_by(id, region) %>%
75+
summarise(cost = mean(cost), .groups = "drop")
76+
```
77+
78+
Because `.by` grouping is temporary, you don't need to worry about ungrouping and it never needs to emit a message to remind you what is happening:
79+
80+
```{r}
81+
expenses %>%
82+
summarise(cost = mean(cost), .by = c(id, region))
83+
```
84+
85+
Note that we specified multiple columns to group by using the [tidy-select][dplyr_tidy_select] syntax `c(id, region)`.
86+
If you have a character vector of column names you'd like to group by, you can do so with `.by = all_of(my_cols)`.
87+
88+
To prevent surprising results, you can't use `.by` on an existing grouped data frame:
89+
90+
```{r, error=TRUE}
91+
expenses %>%
92+
group_by(id) %>%
93+
summarise(cost = mean(cost), .by = c(id, region))
94+
```
95+
96+
So far we've focused on the usage of `.by` with `summarise()`, but `.by` works with a number of other dplyr verbs.
97+
For example, you could append the mean cost per region onto the original data frame as a new column rather than computing a summary:
98+
99+
```{r}
100+
expenses %>%
101+
mutate(cost_by_region = mean(cost), .by = region)
102+
```
103+
104+
Or you could slice out the maximum cost per combination of id and region:
105+
106+
```{r}
107+
expenses %>%
108+
slice_max(cost, n = 1, by = c(id, region))
109+
```
110+
111+
Again, note that the result of this `slice_max()` is an ungrouped data frame, which is probably what you'd expect here.
112+
Compare that to the `group_by()` approach, which uses persistent grouping and returns another grouped data frame:
113+
114+
```{r}
115+
expenses %>%
116+
group_by(id, region) %>%
117+
slice_max(cost, n = 1)
118+
```

man/slice.Rd

+3-4
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)