Implement `.by` #6528

DavisVaughan · 2022-11-04T21:52:38Z

This feels...pretty good? It definitely needs more testing, but I think the implementation is definitely going in the right direction. I think I've found the right abstraction with compute_by() which takes the by tidy-selection and the data and returns a list of:

The grouping type (ungrouped, rowwise, grouped)
The group column names
The group data, i.e. what typically comes from group_data()

Notes:

Only implemented for summarise() and mutate() right now.
Can't be used with existing rowwise or grouped data frames
Can't specify "by row"
summarise() prevents you from combining .by and .groups, which shouldn't ever make sense.
Not exposing this for transmute() because it is superseded.

It is implemented in a way that all tests still pass, even though there is only support for it in mutate() and summarise().

I think it feels so much nicer than our existing syntax! Especially in the case where summarise() returns >1 row per group and you are forced to specify .groups = "drop" or call ungroup(), like with this common ivs example:

library(dplyr)
library(ivs)

users <- tribble(
  ~user, ~from, ~to,
  1L, "2019-01-01", "2019-01-05", 
  1L, "2019-01-12", "2019-01-13", 
  1L, "2019-01-03", "2019-01-10", 
  2L, "2019-01-02", "2019-01-03", 
  2L, "2019-01-03", "2019-01-04", 
  2L, "2019-01-05", "2019-01-07"
)
users <- users %>%
  mutate(from = as.Date(from), to = as.Date(to)) %>%
  mutate(range = iv(from, to), .keep = "unused")

users
#> # A tibble: 6 × 2
#>    user                    range
#>   <int>               <iv<date>>
#> 1     1 [2019-01-01, 2019-01-05)
#> 2     1 [2019-01-12, 2019-01-13)
#> 3     1 [2019-01-03, 2019-01-10)
#> 4     2 [2019-01-02, 2019-01-03)
#> 5     2 [2019-01-03, 2019-01-04)
#> 6     2 [2019-01-05, 2019-01-07)

# Long, don't like that I have to use `.groups = "drop"` because I rarely want that
users %>%
  group_by(user) %>%
  summarise(range = iv_groups(range), .groups = "drop")
#> # A tibble: 4 × 2
#>    user                    range
#>   <int>               <iv<date>>
#> 1     1 [2019-01-01, 2019-01-10)
#> 2     1 [2019-01-12, 2019-01-13)
#> 3     2 [2019-01-02, 2019-01-04)
#> 4     2 [2019-01-05, 2019-01-07)

# SO MUCH NICER
users %>%
  summarise(range = iv_groups(range), .by = user)
#> # A tibble: 4 × 2
#>    user                    range
#>   <int>               <iv<date>>
#> 1     1 [2019-01-01, 2019-01-10)
#> 2     1 [2019-01-12, 2019-01-13)
#> 3     2 [2019-01-02, 2019-01-04)
#> 4     2 [2019-01-05, 2019-01-07)

mine-cetinkaya-rundel · 2022-11-07T13:56:30Z

Some reactions/questions:

It says "I rarely want .groups = "drop" but this implementation always does .groups = "drop", no? So, that doesn't seem like the right justification. Unless it's about that syntax .groups = "drop", which I'm also not the biggest fan of.
Is the idea that if this existed before, we'd never have group_by()? I've always thought group_by |> summarize is so powerful, that this feels like a step back for the simplest scenario (single grouping variable, calculating a summary statistic once). But I see the upsides for longer pipelines. I imagine y'all have done the "cost-benefit analysis" of that?
.by sounds like the right argument name but what about the by in *_join() functions?
What will this look like for multiple grouping variables? .by = c(var1, var2)? Is the result ungrouped regardless of the number of grouping variables?

DavisVaughan · 2022-11-07T14:09:57Z

oh regarding this comment of mine (your first bullet):

don't like that I have to use `.groups = "drop"` because I rarely want that

I think I just made some kind of typo. I actually meant that I almost always want .groups = "drop", but I hate that I have to type it out all the time to get it. This does .groups = "drop" unconditionally.

DavisVaughan · 2022-11-07T14:17:18Z

Is the idea that if this existed before, we'd never have group_by()?

That is how I feel about it, yea. More powerfully is that we probably never would have needed the grouped_df subclass, we could have always used bare data frames or bare tibbles. Which would make extending dplyr a lot easier for people like the tsibble authors.

I've always thought group_by |> summarize is so powerful, that this feels like a step back for the simplest scenario (single grouping variable, calculating a summary statistic once)

I actually felt like it was a step forward, especially for teaching. You could introduce summarise() with:

mtcars %>% 
  summarise(mean = mean(mpg))

And then grouped summaries use the same syntax, no new functions, just a new argument, and you never have to worry about ungrouping afterwards

mtcars %>% 
  summarise(mean = mean(mpg), .by = vs)

I've always felt like when you introduce group_by() then you immediately have to talk about ungroup() and the intricacies of exactly when it is needed, which is super confusing. The fact that this is completely transient grouping seems a lot easier to explain.

what about the by in *_join() functions?

Reasonable question, we'd have to see how confusing that is, I guess

With multiple variables, yes .by = c(var1, var2) would be valid syntax. It accepts anything supported by tidyselect, so you can use .by = all_of(character_vector) or .by = starts_with("grp_") too.

Is the result ungrouped regardless of the number of grouping variables?

Yes and I feel fairly strongly that this is the right design decision, because otherwise we end up back with the .groups = argument of summarise() along with having to emit messages to users about what is happening.

hadley · 2022-11-07T14:20:55Z

@mine-cetinkaya-rundel another advantage of this approach is that you never have to explain what a grouped_df is and how it affects the results of all subsequent operations.

mine-cetinkaya-rundel · 2022-11-07T14:39:35Z

Yes, I agree that not having to explain what a grouped_df is and not having to keep track of the level of grouping ("summarize peels off a level of grouping" is not easy to wrap one's head around), are both benefits.

For simple cases, it's useful to think about how one would do things "manually" when teaching, and I usually say things like "you ask the penguins to get into groups based on the island they come from, and then you take the average of each group's weights". But maybe I say this because the workflow was designed that way, so it's a chicken/egg thing.

Overall I can be convinced this is an overall win, even though it might not be for the simplest things.

The join issue is worth considering carefully, I'm also not sure how confusing it might be.

And I like that you can do tidyselect stuff with .by but I don't have a good intuition for the order of grouping if you do .by = all_of(character_vector) or .by = starts_with("grp_"). In alphabetical order of variable names or in the order variables appear in the data frame?

DavisVaughan · 2022-11-16T21:52:04Z

R/by.R

+
+compute_by_groups <- function(data, names, error_call = caller_env()) {
+  data <- dplyr_col_select(data, names, error_call = error_call)
+  info <- vec_group_loc(data)


We would potentially switch this out for vec_locate_sorted_groups(appearance = TRUE)
r-lib/vctrs#1747

But really once you get into the 100k+ range of number of groups, the group index computation isn't the slow part, it's the expression evaluation.

So if we wanted to keep vec_group_loc() I think that would probably also be okay

R/slice.R

R/summarise.R

DavisVaughan · 2022-11-16T22:01:32Z

R/summarise.R

+summarise.grouped_df <- function(.data, ..., .by = NULL, .groups = NULL) {
+  # TODO: Is it right to even expose `.by` here? It lets us catch errors.
+  # Will always error if `.by != NULL` b/c you can't use it with grouped/rowwise dfs.
+  by <- compute_by({{ .by }}, .data, by_arg = ".by", data_arg = ".data")


But this is kind of weird. .by is here in the grouped-df method because it is in the generic (although I dont think it has to be here) but it will always error informatively if non-null.

I can't decide if removing it from here would be better or not. If we removed it then .by would slip by on grouped-dfs and potentially would become a column of the result through being captured by ...

It seems unsurprising to include it here to me. How else would you ensure that df %>% group_by() %>% summarise(.by = c(a, b)) errors?

Yea, that's fair

It is easier for mutate() because there is only mutate.data.frame(), there is not mutate.grouped_df(), so we don't have to make the same kind of call there

DavisVaughan · 2022-11-16T22:02:35Z

R/summarise.R

 summarise_verbose <- function(.groups, .env) {
-  is.null(.groups) &&
-    is_reference(topenv(.env), global_env()) &&
-    !identical(getOption("dplyr.summarise.inform"), FALSE)
+  if (!is.null(.groups)) {
+    # User supplied `.groups`
+    return(FALSE)
+  }
+
+  inform <- getOption("dplyr.summarise.inform")
+
+  if (is_true(inform) || is_false(inform)) {
+    # User supplied global option
+    return(inform)
+  }
+
+  is_reference(topenv(.env), global_env())
 }


I tweaked this a little so I could force verbosity in the by.Rmd doc

tests/testthat/test-mutate.R

hadley

Looks great!!

hadley · 2022-11-16T22:15:18Z

NEWS.md

@@ -1,5 +1,50 @@
 # dplyr (development version)

+* `.by` is a new experimental inline alternative to `group_by()` that supports
+  _temporary_ grouping in the following key dplyr verbs: `mutate()`,


Do we want to use temporary or transient?

I think I went with temporary in documentation because it seemed like an easier verb for users to understand, but I am not tied to it if you feel like transient is clearer

NEWS.md

R/by.R

man/rmd/by.Rmd

R/by.R

tests/testthat/test-mutate.R

DavisVaughan · 2022-11-17T22:06:58Z

@hadley thanks for the in depth by.Rmd review! I think it is way more concise now

This should override the `global_env()` reference check, so we can force verbosity in relevant documentation pages

Co-authored-by: Hadley Wickham <[email protected]>

It should always return a bare tibble, even though `group_data()` returns a data frame for data frame input.

DavisVaughan changed the title ~~Implement .by for mutate() and summarise()~~ Implement .by Nov 4, 2022

DavisVaughan force-pushed the feature/by branch from 7e8d52c to 1578eb6 Compare November 7, 2022 23:24

markfairbanks mentioned this pull request Nov 14, 2022

Implement .by tidyverse/dtplyr#399

Closed

DavisVaughan commented Nov 16, 2022

View reviewed changes

DavisVaughan force-pushed the feature/by branch from bea24a4 to 0f5958c Compare November 16, 2022 22:11

DavisVaughan requested a review from hadley November 16, 2022 22:12

hadley approved these changes Nov 17, 2022

View reviewed changes

DavisVaughan mentioned this pull request Nov 17, 2022

dplyr grouped distinct does not split-apply-combine calculations, but dbplyr does tidyverse/dbplyr#1081

Open

DavisVaughan and others added 16 commits November 17, 2022 18:10

Implement .by for mutate() and summarise()

0abeb53

Add tests related to tidyverse#6100

b2b159e

Add .by support in filter()

22de139

Add .by support in slice() family

2e7f5a7

Move the .by collision checks into the generics

26e703d

Tweak summarise_verbose() to respect if the global option is TRUE

660dfc0

This should override the `global_env()` reference check, so we can force verbosity in relevant documentation pages

Add a full documentation page specific to .by

8743441

Add section about verbs without .by support

a56c439

NEWS bullet

bd47260

Order groups by first appearance when using .by

c5449d8

NEWS bullet updates

d581f68

We have decided that the NULL column case is too obscure to care about

244caa3

NEWS tweaks based on feedback

be5f977

Second pass on .by help page based on feedback

0985ddf

Include .by help page in pkgdown reference

ea80373

Apply suggestions from code review

e7567c9

Co-authored-by: Hadley Wickham <[email protected]>

DavisVaughan added 4 commits November 17, 2022 18:10

Regenerate snapshots

0a72ad6

Regenerate documentation

9201a7f

Ensure that compute_by() is type stable on $data

fb6ee6d

It should always return a bare tibble, even though `group_data()` returns a data frame for data frame input.

Remove TODOs

ce6926f

DavisVaughan force-pushed the feature/by branch from fee8344 to ce6926f Compare November 17, 2022 23:14

DavisVaughan merged commit 0a55cf5 into tidyverse:main Nov 17, 2022

DavisVaughan deleted the feature/by branch November 17, 2022 23:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement `.by` #6528

Implement `.by` #6528

DavisVaughan commented Nov 4, 2022 •

edited

Loading

mine-cetinkaya-rundel commented Nov 7, 2022

DavisVaughan commented Nov 7, 2022 •

edited

Loading

DavisVaughan commented Nov 7, 2022 •

edited

Loading

hadley commented Nov 7, 2022

mine-cetinkaya-rundel commented Nov 7, 2022

DavisVaughan Nov 16, 2022

DavisVaughan Nov 16, 2022

hadley Nov 16, 2022

DavisVaughan Nov 17, 2022

DavisVaughan Nov 16, 2022

hadley left a comment

hadley Nov 16, 2022

DavisVaughan Nov 17, 2022

DavisVaughan commented Nov 17, 2022

Implement .by #6528

Implement .by #6528

Conversation

DavisVaughan commented Nov 4, 2022 • edited Loading

mine-cetinkaya-rundel commented Nov 7, 2022

DavisVaughan commented Nov 7, 2022 • edited Loading

DavisVaughan commented Nov 7, 2022 • edited Loading

hadley commented Nov 7, 2022

mine-cetinkaya-rundel commented Nov 7, 2022

DavisVaughan Nov 16, 2022

Choose a reason for hiding this comment

DavisVaughan Nov 16, 2022

Choose a reason for hiding this comment

hadley Nov 16, 2022

Choose a reason for hiding this comment

DavisVaughan Nov 17, 2022

Choose a reason for hiding this comment

DavisVaughan Nov 16, 2022

Choose a reason for hiding this comment

hadley left a comment

Choose a reason for hiding this comment

hadley Nov 16, 2022

Choose a reason for hiding this comment

DavisVaughan Nov 17, 2022

Choose a reason for hiding this comment

DavisVaughan commented Nov 17, 2022

Implement `.by` #6528

Implement `.by` #6528

DavisVaughan commented Nov 4, 2022 •

edited

Loading

DavisVaughan commented Nov 7, 2022 •

edited

Loading

DavisVaughan commented Nov 7, 2022 •

edited

Loading