Improve `unnest()` performance #1141

DavisVaughan · 2021-08-06T14:21:32Z

Part of #1127
Closes #1112

This removes the as_df() and as_empty_df() helpers from unnest(). Their main purpose seemed to be ensuring that each element of the list-col was a data frame, so that unchop() would return a packed df-col that unpack() would unpack. This doesn't seem to be necessary. Instead, we just let unchop() unchop the list-cols (it "skips" over any non-list-cols) and then we tell unpack() the remaining df-cols to unpack.

As mentioned in #1126 (comment), there are a few edge cases that this would break. But they seemed to be off-label usage, only worked by accident, and unnest_legacy() didn't work that way either, so I think we are okay. I'll run revdeps to be sure. I've also added a test of the new error behavior.

Here is an example of unnesting a list column of integers (which is really just unchop):

library(tidyr)
set.seed(1)

n <- 100e3

df <- tibble(
  a = purrr::map(1:n, ~ rlang::seq2(1, sample(0:10, 1))),
  y = 1:n
)

df
#> # A tibble: 100,000 × 2
#>    a              y
#>    <list>     <int>
#>  1 <int [8]>      1
#>  2 <int [3]>      2
#>  3 <int [6]>      3
#>  4 <int [0]>      4
#>  5 <int [1]>      5
#>  6 <int [6]>      6
#>  7 <int [10]>     7
#>  8 <int [1]>      8
#>  9 <int [10]>     9
#> 10 <int [2]>     10
#> # … with 99,990 more rows

# list column of integers
bench::mark(
  unnest = unnest(df, a),
  unnest_legacy = unnest_legacy(df, a) %>% dplyr::select(a, y),
  iterations = 20
)

# CRAN
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 6
#>   expression         min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 unnest           1.39s    1.67s     0.609    13.9MB     8.62
#> 2 unnest_legacy 345.67ms 361.21ms     2.75     17.7MB     6.18

# This PR
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 6
#>   expression         min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 unnest          70.8ms   74.2ms     12.5       13MB     6.87
#> 2 unnest_legacy  345.7ms  355.9ms      2.76    17.7MB     9.37

Here is an example of unnesting a list column of 2 column tibbles (i.e. standard unnest usage):

library(tidyr)
set.seed(1)

n <- 100e3

df <- tibble(
  a = purrr::map(1:n, ~ {
    x <- rlang::seq2(1, sample(0:10, 1))
    y <- sample(x)
    tibble::new_tibble(list(x = x, y = y), nrow = length(x))
  }),
  z = 1:n
)

df
#> # A tibble: 100,000 × 2
#>    a                    z
#>    <list>           <int>
#>  1 <tibble [8 × 2]>     1
#>  2 <tibble [1 × 2]>     2
#>  3 <tibble [2 × 2]>     3
#>  4 <tibble [4 × 2]>     4
#>  5 <tibble [6 × 2]>     5
#>  6 <tibble [4 × 2]>     6
#>  7 <tibble [8 × 2]>     7
#>  8 <tibble [5 × 2]>     8
#>  9 <tibble [6 × 2]>     9
#> 10 <tibble [5 × 2]>    10
#> # … with 99,990 more rows

# list column of tibbles
bench::mark(
  unnest = unnest(df, a),
  unnest_legacy = unnest_legacy(df, a) %>% dplyr::select(x, y, z),
  iterations = 10
)

# CRAN
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 6
#>   expression         min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 unnest           1.38s    1.42s     0.686    15.8MB     3.09
#> 2 unnest_legacy    1.82s    1.95s     0.506    25.8MB     4.45

# This PR
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 6
#>   expression         min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 unnest         610.5ms 615.72ms     1.62     15.2MB     1.30
#> 2 unnest_legacy     1.8s    1.86s     0.523    25.8MB     4.39

DavisVaughan · 2021-08-06T14:24:58Z

R/nest.R

+  cols <- cols[map_lgl(unclass(data)[cols], is.data.frame)]
  unpack(data, any_of(cols), names_sep = names_sep, names_repair = names_repair)


If unpack() was smart enough to "skip" any non df-cols (like how unchop "skips" non list-cols) then we wouldn't even need this filtering line! For example, this unpack() call could return its input unchanged.

library(tidyr) df <- tibble(x = 1L, y = 1) # "skips" unchopping y unchop(df, y) #> # A tibble: 1 x 2 #> x y #> <int> <dbl> #> 1 1 1 # requires df cols unpack(df, y) #> Error: `y` must be a data frame column

That would make unnest() truly an unchop() + unpack() without any intermediate adjustments, which theoretically is kind of nice.

This would also be great to easily unpack all data frame columns unpack(df, everything()) and in general less need to worry when using selection helpers

tests/testthat/test-nest.R

mgirlich · 2021-08-09T07:06:51Z

LGTM! Looking forward to the speed improvements in tidyr 😄

It is now a simple combination of `unchop()` + `unpack()` on the remaining df-cols.

DavisVaughan · 2021-08-26T18:09:24Z

No changes in revdeps!

DavisVaughan commented Aug 6, 2021

View reviewed changes

DavisVaughan requested a review from mgirlich August 6, 2021 14:35

This was referenced Aug 23, 2021

unnest column of data.frames #1112

Closed

Change ptype default from NULL to list() in unnest() / unchop() #1152

Closed

Investigate whether unpack() should auto skip non-df-cols #1153

Closed

DavisVaughan added 5 commits August 26, 2021 12:27

Simplify unnest() to improve performance

fb75de4

It is now a simple combination of `unchop()` + `unpack()` on the remaining df-cols.

Add a test for unnesting df-cols

6a09dc6

Add skipped tests for lists of NULL

24de269

Add a test and NEWS bullet for the error with mixed type list-cols

6838037

Remove skip()s now that #1140 is merged

3af2624

DavisVaughan merged commit 36e5399 into tidyverse:master Aug 26, 2021

DavisVaughan deleted the feature/speed-up-unnest branch August 26, 2021 18:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve `unnest()` performance #1141

Improve `unnest()` performance #1141

Uh oh!

DavisVaughan commented Aug 6, 2021 •

edited

Loading

Uh oh!

DavisVaughan Aug 6, 2021

Uh oh!

mgirlich Aug 9, 2021

Uh oh!

Uh oh!

mgirlich commented Aug 9, 2021

Uh oh!

DavisVaughan commented Aug 26, 2021

Uh oh!

Uh oh!

		cols <- cols[map_lgl(unclass(data)[cols], is.data.frame)]
		unpack(data, any_of(cols), names_sep = names_sep, names_repair = names_repair)

Improve unnest() performance #1141

Improve unnest() performance #1141

Uh oh!

Conversation

DavisVaughan commented Aug 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DavisVaughan Aug 6, 2021

Choose a reason for hiding this comment

Uh oh!

mgirlich Aug 9, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mgirlich commented Aug 9, 2021

Uh oh!

DavisVaughan commented Aug 26, 2021

Uh oh!

Uh oh!

Improve `unnest()` performance #1141

Improve `unnest()` performance #1141

DavisVaughan commented Aug 6, 2021 •

edited

Loading