Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve unnest() performance #1141

Merged
merged 5 commits into from
Aug 26, 2021
Merged

Improve unnest() performance #1141

merged 5 commits into from
Aug 26, 2021

Conversation

DavisVaughan
Copy link
Member

@DavisVaughan DavisVaughan commented Aug 6, 2021

Part of #1127
Closes #1112

This removes the as_df() and as_empty_df() helpers from unnest(). Their main purpose seemed to be ensuring that each element of the list-col was a data frame, so that unchop() would return a packed df-col that unpack() would unpack. This doesn't seem to be necessary. Instead, we just let unchop() unchop the list-cols (it "skips" over any non-list-cols) and then we tell unpack() the remaining df-cols to unpack.

As mentioned in #1126 (comment), there are a few edge cases that this would break. But they seemed to be off-label usage, only worked by accident, and unnest_legacy() didn't work that way either, so I think we are okay. I'll run revdeps to be sure. I've also added a test of the new error behavior.

Here is an example of unnesting a list column of integers (which is really just unchop):

library(tidyr)
set.seed(1)

n <- 100e3

df <- tibble(
  a = purrr::map(1:n, ~ rlang::seq2(1, sample(0:10, 1))),
  y = 1:n
)

df
#> # A tibble: 100,000 × 2
#>    a              y
#>    <list>     <int>
#>  1 <int [8]>      1
#>  2 <int [3]>      2
#>  3 <int [6]>      3
#>  4 <int [0]>      4
#>  5 <int [1]>      5
#>  6 <int [6]>      6
#>  7 <int [10]>     7
#>  8 <int [1]>      8
#>  9 <int [10]>     9
#> 10 <int [2]>     10
#> # … with 99,990 more rows

# list column of integers
bench::mark(
  unnest = unnest(df, a),
  unnest_legacy = unnest_legacy(df, a) %>% dplyr::select(a, y),
  iterations = 20
)

# CRAN
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 6
#>   expression         min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 unnest           1.39s    1.67s     0.609    13.9MB     8.62
#> 2 unnest_legacy 345.67ms 361.21ms     2.75     17.7MB     6.18

# This PR
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 6
#>   expression         min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 unnest          70.8ms   74.2ms     12.5       13MB     6.87
#> 2 unnest_legacy  345.7ms  355.9ms      2.76    17.7MB     9.37

Here is an example of unnesting a list column of 2 column tibbles (i.e. standard unnest usage):

library(tidyr)
set.seed(1)

n <- 100e3

df <- tibble(
  a = purrr::map(1:n, ~ {
    x <- rlang::seq2(1, sample(0:10, 1))
    y <- sample(x)
    tibble::new_tibble(list(x = x, y = y), nrow = length(x))
  }),
  z = 1:n
)

df
#> # A tibble: 100,000 × 2
#>    a                    z
#>    <list>           <int>
#>  1 <tibble [8 × 2]>     1
#>  2 <tibble [1 × 2]>     2
#>  3 <tibble [2 × 2]>     3
#>  4 <tibble [4 × 2]>     4
#>  5 <tibble [6 × 2]>     5
#>  6 <tibble [4 × 2]>     6
#>  7 <tibble [8 × 2]>     7
#>  8 <tibble [5 × 2]>     8
#>  9 <tibble [6 × 2]>     9
#> 10 <tibble [5 × 2]>    10
#> # … with 99,990 more rows

# list column of tibbles
bench::mark(
  unnest = unnest(df, a),
  unnest_legacy = unnest_legacy(df, a) %>% dplyr::select(x, y, z),
  iterations = 10
)

# CRAN
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 6
#>   expression         min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 unnest           1.38s    1.42s     0.686    15.8MB     3.09
#> 2 unnest_legacy    1.82s    1.95s     0.506    25.8MB     4.45

# This PR
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 6
#>   expression         min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 unnest         610.5ms 615.72ms     1.62     15.2MB     1.30
#> 2 unnest_legacy     1.8s    1.86s     0.523    25.8MB     4.39

Comment on lines +334 to 335
cols <- cols[map_lgl(unclass(data)[cols], is.data.frame)]
unpack(data, any_of(cols), names_sep = names_sep, names_repair = names_repair)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If unpack() was smart enough to "skip" any non df-cols (like how unchop "skips" non list-cols) then we wouldn't even need this filtering line! For example, this unpack() call could return its input unchanged.

library(tidyr)

df <- tibble(x = 1L, y = 1)

# "skips" unchopping y
unchop(df, y)
#> # A tibble: 1 x 2
#>       x     y
#>   <int> <dbl>
#> 1     1     1

# requires df cols
unpack(df, y)
#> Error: `y` must be a data frame column

That would make unnest() truly an unchop() + unpack() without any intermediate adjustments, which theoretically is kind of nice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would also be great to easily unpack all data frame columns unpack(df, everything()) and in general less need to worry when using selection helpers

tests/testthat/test-nest.R Show resolved Hide resolved
@DavisVaughan DavisVaughan requested a review from mgirlich August 6, 2021 14:35
@mgirlich
Copy link
Contributor

mgirlich commented Aug 9, 2021

LGTM! Looking forward to the speed improvements in tidyr 😄

@DavisVaughan
Copy link
Member Author

No changes in revdeps!

@DavisVaughan DavisVaughan merged commit 36e5399 into tidyverse:master Aug 26, 2021
@DavisVaughan DavisVaughan deleted the feature/speed-up-unnest branch August 26, 2021 18:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

unnest column of data.frames
2 participants