Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement .by #6528

Merged
merged 20 commits into from
Nov 17, 2022
Merged

Implement .by #6528

merged 20 commits into from
Nov 17, 2022

Conversation

DavisVaughan
Copy link
Member

@DavisVaughan DavisVaughan commented Nov 4, 2022

Closes #6214

This feels...pretty good? It definitely needs more testing, but I think the implementation is definitely going in the right direction. I think I've found the right abstraction with compute_by() which takes the by tidy-selection and the data and returns a list of:

  • The grouping type (ungrouped, rowwise, grouped)
  • The group column names
  • The group data, i.e. what typically comes from group_data()

Notes:

  • Only implemented for summarise() and mutate() right now.
  • Can't be used with existing rowwise or grouped data frames
  • Can't specify "by row"
  • summarise() prevents you from combining .by and .groups, which shouldn't ever make sense.
  • Not exposing this for transmute() because it is superseded.

It is implemented in a way that all tests still pass, even though there is only support for it in mutate() and summarise().

I think it feels so much nicer than our existing syntax! Especially in the case where summarise() returns >1 row per group and you are forced to specify .groups = "drop" or call ungroup(), like with this common ivs example:

library(dplyr)
library(ivs)

users <- tribble(
  ~user, ~from, ~to,
  1L, "2019-01-01", "2019-01-05", 
  1L, "2019-01-12", "2019-01-13", 
  1L, "2019-01-03", "2019-01-10", 
  2L, "2019-01-02", "2019-01-03", 
  2L, "2019-01-03", "2019-01-04", 
  2L, "2019-01-05", "2019-01-07"
)
users <- users %>%
  mutate(from = as.Date(from), to = as.Date(to)) %>%
  mutate(range = iv(from, to), .keep = "unused")

users
#> # A tibble: 6 × 2
#>    user                    range
#>   <int>               <iv<date>>
#> 1     1 [2019-01-01, 2019-01-05)
#> 2     1 [2019-01-12, 2019-01-13)
#> 3     1 [2019-01-03, 2019-01-10)
#> 4     2 [2019-01-02, 2019-01-03)
#> 5     2 [2019-01-03, 2019-01-04)
#> 6     2 [2019-01-05, 2019-01-07)

# Long, don't like that I have to use `.groups = "drop"` because I rarely want that
users %>%
  group_by(user) %>%
  summarise(range = iv_groups(range), .groups = "drop")
#> # A tibble: 4 × 2
#>    user                    range
#>   <int>               <iv<date>>
#> 1     1 [2019-01-01, 2019-01-10)
#> 2     1 [2019-01-12, 2019-01-13)
#> 3     2 [2019-01-02, 2019-01-04)
#> 4     2 [2019-01-05, 2019-01-07)

# SO MUCH NICER
users %>%
  summarise(range = iv_groups(range), .by = user)
#> # A tibble: 4 × 2
#>    user                    range
#>   <int>               <iv<date>>
#> 1     1 [2019-01-01, 2019-01-10)
#> 2     1 [2019-01-12, 2019-01-13)
#> 3     2 [2019-01-02, 2019-01-04)
#> 4     2 [2019-01-05, 2019-01-07)

@DavisVaughan DavisVaughan changed the title Implement .by for mutate() and summarise() Implement .by Nov 4, 2022
@mine-cetinkaya-rundel
Copy link
Member

Some reactions/questions:

  • It says "I rarely want .groups = "drop" but this implementation always does .groups = "drop", no? So, that doesn't seem like the right justification. Unless it's about that syntax .groups = "drop", which I'm also not the biggest fan of.

  • Is the idea that if this existed before, we'd never have group_by()? I've always thought group_by |> summarize is so powerful, that this feels like a step back for the simplest scenario (single grouping variable, calculating a summary statistic once). But I see the upsides for longer pipelines. I imagine y'all have done the "cost-benefit analysis" of that?

  • .by sounds like the right argument name but what about the by in *_join() functions?

  • What will this look like for multiple grouping variables? .by = c(var1, var2)? Is the result ungrouped regardless of the number of grouping variables?

@DavisVaughan
Copy link
Member Author

DavisVaughan commented Nov 7, 2022

oh regarding this comment of mine (your first bullet):

don't like that I have to use `.groups = "drop"` because I rarely want that

I think I just made some kind of typo. I actually meant that I almost always want .groups = "drop", but I hate that I have to type it out all the time to get it. This does .groups = "drop" unconditionally.

@DavisVaughan
Copy link
Member Author

DavisVaughan commented Nov 7, 2022

Is the idea that if this existed before, we'd never have group_by()?

That is how I feel about it, yea. More powerfully is that we probably never would have needed the grouped_df subclass, we could have always used bare data frames or bare tibbles. Which would make extending dplyr a lot easier for people like the tsibble authors.


I've always thought group_by |> summarize is so powerful, that this feels like a step back for the simplest scenario (single grouping variable, calculating a summary statistic once)

I actually felt like it was a step forward, especially for teaching. You could introduce summarise() with:

mtcars %>% 
  summarise(mean = mean(mpg))

And then grouped summaries use the same syntax, no new functions, just a new argument, and you never have to worry about ungrouping afterwards

mtcars %>% 
  summarise(mean = mean(mpg), .by = vs)

I've always felt like when you introduce group_by() then you immediately have to talk about ungroup() and the intricacies of exactly when it is needed, which is super confusing. The fact that this is completely transient grouping seems a lot easier to explain.


what about the by in *_join() functions?

Reasonable question, we'd have to see how confusing that is, I guess


With multiple variables, yes .by = c(var1, var2) would be valid syntax. It accepts anything supported by tidyselect, so you can use .by = all_of(character_vector) or .by = starts_with("grp_") too.

Is the result ungrouped regardless of the number of grouping variables?

Yes and I feel fairly strongly that this is the right design decision, because otherwise we end up back with the .groups = argument of summarise() along with having to emit messages to users about what is happening.

@hadley
Copy link
Member

hadley commented Nov 7, 2022

@mine-cetinkaya-rundel another advantage of this approach is that you never have to explain what a grouped_df is and how it affects the results of all subsequent operations.

@mine-cetinkaya-rundel
Copy link
Member

Yes, I agree that not having to explain what a grouped_df is and not having to keep track of the level of grouping ("summarize peels off a level of grouping" is not easy to wrap one's head around), are both benefits.

For simple cases, it's useful to think about how one would do things "manually" when teaching, and I usually say things like "you ask the penguins to get into groups based on the island they come from, and then you take the average of each group's weights". But maybe I say this because the workflow was designed that way, so it's a chicken/egg thing.

Overall I can be convinced this is an overall win, even though it might not be for the simplest things.

The join issue is worth considering carefully, I'm also not sure how confusing it might be.

And I like that you can do tidyselect stuff with .by but I don't have a good intuition for the order of grouping if you do .by = all_of(character_vector) or .by = starts_with("grp_"). In alphabetical order of variable names or in the order variables appear in the data frame?


compute_by_groups <- function(data, names, error_call = caller_env()) {
data <- dplyr_col_select(data, names, error_call = error_call)
info <- vec_group_loc(data)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would potentially switch this out for vec_locate_sorted_groups(appearance = TRUE)
r-lib/vctrs#1747

But really once you get into the 100k+ range of number of groups, the group index computation isn't the slow part, it's the expression evaluation.

So if we wanted to keep vec_group_loc() I think that would probably also be okay

R/summarise.R Outdated
Comment on lines 154 to 156
summarise.grouped_df <- function(.data, ..., .by = NULL, .groups = NULL) {
# TODO: Is it right to even expose `.by` here? It lets us catch errors.
# Will always error if `.by != NULL` b/c you can't use it with grouped/rowwise dfs.
by <- compute_by({{ .by }}, .data, by_arg = ".by", data_arg = ".data")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this is kind of weird. .by is here in the grouped-df method because it is in the generic (although I dont think it has to be here) but it will always error informatively if non-null.

I can't decide if removing it from here would be better or not. If we removed it then .by would slip by on grouped-dfs and potentially would become a column of the result through being captured by ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems unsurprising to include it here to me. How else would you ensure that df %>% group_by() %>% summarise(.by = c(a, b)) errors?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, that's fair

It is easier for mutate() because there is only mutate.data.frame(), there is not mutate.grouped_df(), so we don't have to make the same kind of call there

Comment on lines 428 to 435
summarise_verbose <- function(.groups, .env) {
is.null(.groups) &&
is_reference(topenv(.env), global_env()) &&
!identical(getOption("dplyr.summarise.inform"), FALSE)
if (!is.null(.groups)) {
# User supplied `.groups`
return(FALSE)
}

inform <- getOption("dplyr.summarise.inform")

if (is_true(inform) || is_false(inform)) {
# User supplied global option
return(inform)
}

is_reference(topenv(.env), global_env())
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tweaked this a little so I could force verbosity in the by.Rmd doc

Copy link
Member

@hadley hadley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!!

@@ -1,5 +1,50 @@
# dplyr (development version)

* `.by` is a new experimental inline alternative to `group_by()` that supports
_temporary_ grouping in the following key dplyr verbs: `mutate()`,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to use temporary or transient?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I went with temporary in documentation because it seemed like an easier verb for users to understand, but I am not tied to it if you feel like transient is clearer

@DavisVaughan
Copy link
Member Author

@hadley thanks for the in depth by.Rmd review! I think it is way more concise now

It should always return a bare tibble, even though `group_data()` returns a data frame for data frame input.
@DavisVaughan DavisVaughan merged commit 0a55cf5 into tidyverse:main Nov 17, 2022
@DavisVaughan DavisVaughan deleted the feature/by branch November 17, 2022 23:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

.by argument as alternative to group_by
3 participants