-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement .by
#6528
Implement .by
#6528
Conversation
Some reactions/questions:
|
oh regarding this comment of mine (your first bullet):
I think I just made some kind of typo. I actually meant that I almost always want |
That is how I feel about it, yea. More powerfully is that we probably never would have needed the
I actually felt like it was a step forward, especially for teaching. You could introduce mtcars %>%
summarise(mean = mean(mpg)) And then grouped summaries use the same syntax, no new functions, just a new argument, and you never have to worry about ungrouping afterwards mtcars %>%
summarise(mean = mean(mpg), .by = vs) I've always felt like when you introduce
Reasonable question, we'd have to see how confusing that is, I guess With multiple variables, yes
Yes and I feel fairly strongly that this is the right design decision, because otherwise we end up back with the |
@mine-cetinkaya-rundel another advantage of this approach is that you never have to explain what a |
Yes, I agree that not having to explain what a For simple cases, it's useful to think about how one would do things "manually" when teaching, and I usually say things like "you ask the penguins to get into groups based on the island they come from, and then you take the average of each group's weights". But maybe I say this because the workflow was designed that way, so it's a chicken/egg thing. Overall I can be convinced this is an overall win, even though it might not be for the simplest things. The And I like that you can do tidyselect stuff with |
7e8d52c
to
1578eb6
Compare
|
||
compute_by_groups <- function(data, names, error_call = caller_env()) { | ||
data <- dplyr_col_select(data, names, error_call = error_call) | ||
info <- vec_group_loc(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would potentially switch this out for vec_locate_sorted_groups(appearance = TRUE)
r-lib/vctrs#1747
But really once you get into the 100k+ range of number of groups, the group index computation isn't the slow part, it's the expression evaluation.
So if we wanted to keep vec_group_loc()
I think that would probably also be okay
R/summarise.R
Outdated
summarise.grouped_df <- function(.data, ..., .by = NULL, .groups = NULL) { | ||
# TODO: Is it right to even expose `.by` here? It lets us catch errors. | ||
# Will always error if `.by != NULL` b/c you can't use it with grouped/rowwise dfs. | ||
by <- compute_by({{ .by }}, .data, by_arg = ".by", data_arg = ".data") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this is kind of weird. .by
is here in the grouped-df method because it is in the generic (although I dont think it has to be here) but it will always error informatively if non-null.
I can't decide if removing it from here would be better or not. If we removed it then .by
would slip by on grouped-dfs and potentially would become a column of the result through being captured by ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems unsurprising to include it here to me. How else would you ensure that df %>% group_by() %>% summarise(.by = c(a, b))
errors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, that's fair
It is easier for mutate()
because there is only mutate.data.frame()
, there is not mutate.grouped_df()
, so we don't have to make the same kind of call there
summarise_verbose <- function(.groups, .env) { | ||
is.null(.groups) && | ||
is_reference(topenv(.env), global_env()) && | ||
!identical(getOption("dplyr.summarise.inform"), FALSE) | ||
if (!is.null(.groups)) { | ||
# User supplied `.groups` | ||
return(FALSE) | ||
} | ||
|
||
inform <- getOption("dplyr.summarise.inform") | ||
|
||
if (is_true(inform) || is_false(inform)) { | ||
# User supplied global option | ||
return(inform) | ||
} | ||
|
||
is_reference(topenv(.env), global_env()) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tweaked this a little so I could force verbosity in the by.Rmd
doc
bea24a4
to
0f5958c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!!
@@ -1,5 +1,50 @@ | |||
# dplyr (development version) | |||
|
|||
* `.by` is a new experimental inline alternative to `group_by()` that supports | |||
_temporary_ grouping in the following key dplyr verbs: `mutate()`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to use temporary or transient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I went with temporary in documentation because it seemed like an easier verb for users to understand, but I am not tied to it if you feel like transient is clearer
@hadley thanks for the in depth |
This should override the `global_env()` reference check, so we can force verbosity in relevant documentation pages
Co-authored-by: Hadley Wickham <[email protected]>
It should always return a bare tibble, even though `group_data()` returns a data frame for data frame input.
fee8344
to
ce6926f
Compare
Closes #6214
This feels...pretty good? It definitely needs more testing, but I think the implementation is definitely going in the right direction. I think I've found the right abstraction with
compute_by()
which takes theby
tidy-selection and thedata
and returns a list of:group_data()
Notes:
summarise()
andmutate()
right now.summarise()
prevents you from combining.by
and.groups
, which shouldn't ever make sense.transmute()
because it is superseded.It is implemented in a way that all tests still pass, even though there is only support for it in
mutate()
andsummarise()
.I think it feels so much nicer than our existing syntax! Especially in the case where
summarise()
returns >1 row per group and you are forced to specify.groups = "drop"
or callungroup()
, like with this common ivs example: