-
-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Grouped operations are slow #119
Comments
This goes towards making overall grouped calculations quicker. See #119
As a result of 3cc0a99:
Improvements can still be made. |
The following produces the results in ~9 seconds:
So still a lot slower than
|
Related to this, in the current version set.seed(2023)
d <- data.frame(
g1 = sample(c(LETTERS[1:3], NA), 10, replace = TRUE),
g2 = sample(c(LETTERS[1:3], NA), 10, replace = TRUE),
v1 = sample(1:10)
)
d
#> g1 g2 v1
#> 1 A A 4
#> 2 <NA> A 5
#> 3 C A 8
#> 4 A A 1
#> 5 <NA> A 7
#> 6 C A 2
#> 7 B <NA> 10
#> 8 <NA> C 9
#> 9 B B 3
#> 10 A C 6
d |>
poorman::group_by(g1) |>
poorman::group_data()
#> g1 .rows
#> 1 A 1, 4, 10
#> 2 B 7, 9
#> 3 C 3, 6
#> 4 <NA> 2, 5, 8
d |>
poorman::group_by(g1) |>
poorman::summarize(mean = mean(v1))
#> g1 mean
#> 1 A 3.666667
#> 2 B 6.500000
#> 3 C 5.000000
d |>
dplyr::group_by(g1) |>
dplyr::summarize(mean = mean(v1))
#> # A tibble: 4 × 2
#> g1 mean
#> <chr> <dbl>
#> 1 A 3.67
#> 2 B 6.5
#> 3 C 5
#> 4 <NA> 7 Created on 2023-01-05 with reprex v2.0.2 |
Good catch. I think at some point I will need to pin |
Right, it would make a lot of sense given how much |
That's the main incentive, yeah. It would require a lot of work to keep up and probably a lot of refactoring. My suggestion would therefore be |
Yes it's in the upcoming 1.1.0 |
Coming back to the slowness issue, I can gain ~30% speed by replacing: for (i in seq_len(n_comb)) {
rows[[i]] <- which(data_groups %in% interaction(unique_groups[i, groups]))
} by pasted_groups <- do.call(paste, c(unique_groups[, groups, drop = FALSE], sep = "."))
pasted_groups[is.na(unique_groups)] <- NA
for (i in seq_len(n_comb)) {
rows[[i]] <- which(data_groups %in% pasted_groups[i])
} Basically, at each iteration, This change passes all the tests (which are not really complete given the NA issue). Benchmark setup: d <- data.frame(
g1 = sample(LETTERS, 4000, TRUE),
g2 = sample(LETTERS, 4000, TRUE),
g3 = sample(LETTERS, 4000, TRUE),
x1 = runif(4000),
x2 = runif(4000),
x3 = runif(4000)
)
# return a list of results so that both functions return the same output (without
# all the class problem, tibble vs data.frame)
poor = function() {
foo <- d %>%
group_by(g1, g2, g3) |>
summarise(x1 = mean(x1), x2 = max(x2), x3 = min(x3))
list(foo$x1, foo$x2, foo$x3)
}
dpl = function() {
foo <- d %>%
dplyr::group_by(g1, g2, g3) |>
dplyr::summarise(x1 = mean(x1), x2 = max(x2), x3 = min(x3))
list(foo$x1, foo$x2, foo$x3)
}
bench::mark(
poor(),
dpl()
) |
Oh that's really nice. Did you want to submit a PR? I probably won't have time to implement it myself until Sunday. |
I can make a PR but I think it shouldn't be implemented before the behavior of There should probably be some tests for more complex grouping with |
The text was updated successfully, but these errors were encountered: