Improve `unchop()` performance #1140

DavisVaughan · 2021-08-05T21:07:25Z

Closes #1126
Closes #1127
Closes #1128
Closes #1075
Closes #1052
A step towards #1112

@mgirlich I think you had some great ideas for improving unchop() in #1126! I've tweaked them a little bit and added them here, along with many additional edge case tests, and gave you credit in the NEWS file.

I haven't touched unnest() yet, I'll save that for another PR, but I think you are right that we could remove that as_df() code and just have unpack() only unpack the remaining df-cols. Your approach also fixes two bugs with unnest() that I'll add tests for.

The original plan for df_unchop() was to move it to C in vctrs, so I wrote it somewhat inefficiently with R level loops (which would have been pretty efficient at the C level). I'm no longer convinced that we should move it to vctrs because of how it peeks at the list-of ptype attribute (vctrs C code tries to be agnostic to these details, but that would be impossible here).

@mgirlich I've fixed a few edge cases that I discovered in yours, such as:

# expect 0 row tibble with `x=integer()` column
df <- tibble(x = list(NULL, NULL))
unchop(df, x, ptype = tibble(x = integer()))
#> Error in x[needs_recycling] <- map2(x[needs_recycling], lengths_out[needs_recycling],  : 
#>   NAs are not allowed in subscripted assignments

# (probably similar to above)
df <- tibble(x = 1:2, y = list(NULL, NULL), z = list(NULL, NULL))
expect <- tibble(x = integer(), y = unspecified(), z = unspecified())
expect_identical(unchop(df, c(y, z)), expect)
#> Error in x[needs_recycling] <- map2(x[needs_recycling], lengths_out[needs_recycling],  : 
#>   NAs are not allowed in subscripted assignments

# this should return a column of double NAs, not logical NAs
df <- tibble(x = 1:2, y = list(integer(), double()))
unchop(df, y, keep_empty = TRUE)
#> # A tibble: 2 x 2
#>       x y    
#>   <int> <lgl>
#> 1     1 NA   
#> 2     2 NA

Performance improvements are around the same as #1126 (see that PR for the performance of current master)

library(tidyr)
set.seed(1)

n <- 100e3
df <- tibble(
  a = purrr::map(1:n, ~ rlang::seq2(1, sample(0:10, 1))),
  y = 1:n
)

# list column
bench::mark(
  drop_empty = unchop(df, a),
  keep_empty = unchop(df, a, keep_empty = TRUE),
  check = FALSE
)
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 drop_empty   65.1ms   67.4ms      14.9    13.4MB     8.95
#> 2 keep_empty   83.4ms   87.4ms      11.5    13.3MB    11.5

# vector column
bench::mark(
  drop_empty = unchop(df, y),
  keep_empty = unchop(df, y, keep_empty = TRUE),
  check = FALSE
)
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 drop_empty   6.58ms   7.15ms      139.    3.82MB     22.7
#> 2 keep_empty   6.08ms   6.42ms      151.    3.82MB     14.2

DavisVaughan · 2021-08-05T21:09:44Z

R/chop.R

+#' @param ptype Optionally, a named list of column name-prototype pairs to
+#'   coerce `cols` to, overriding the default that will be guessed from
+#'   combining the individual values.


One subtle change is that ptype no longer has to be a data frame. It is now just a named list where the names can match any names specified by cols. This matches pivot_longer().

Previously we required a data frame here, but we had some strange behavior where we would cast the unchopped data frame to ptype, which could add columns that only ever existed in the ptype. That seemed very strange, and this feels like much better behavior for this argument.

This closes #1075

I would say it also closes #1052 since the confusion there was that you had to supply the ptype as a tibble. Now that it is just a named list, it should be more intuitive.

DavisVaughan · 2021-08-05T21:10:26Z

R/chop.R

+    # Convert to form that `df_unchop()` requires
+    ptype <- list()


As an internal detail, I've made df_unchop() always take a list ptype, so we convert that here

DavisVaughan · 2021-08-05T21:11:40Z

R/chop.R

+  # Remove unchopped columns to avoid slicing them needlessly later
+  out[sel] <- NULL


@mgirlich had the nice idea that we should remove the columns that get unchopped up front. That way when we do vec_slice(out, loc) later on, we don't needlessly slice those columns (since we are going to immediately replace them with update_cols())

DavisVaughan · 2021-08-05T21:15:53Z

R/chop.R

-      # Recycle each row element to the common size of that row
-      pieces[[j]] <- tidyr_recycle(elt, size)
+    if (!col_is_list) {
+      out_cols[[i]] <- vec_slice(col, out_loc)


Nice optimization also pointed out by @mgirlich that while unchopping non-list columns is kind of a silly thing for the user to do, it can be done a lot faster by just slicing the column. Previously I think we essentially converted it to a list, recycled each element, and then converted it back to a vector.

This will actually help performance with #1112. The user is trying to unnest() a data frame column (really they should use unpack(), but unnest() should work too). The unchop() step of that will have them unchopping a df-col, which will be faster because of this optimization.

DavisVaughan · 2021-08-05T21:19:44Z

R/chop.R

+    if (is_null(col)) {
+      # This can happen when both of these are true:
+      # - `col` was an empty list(), or a list of all `NULL`s.
+      # - No ptype was specified for `col`, either by the user or by a list-of.
+      if (out_size != 0L) {
+        abort("Internal error: `NULL` column generated, but output size is not `0`.")
+      }

-  loc <- rep(seq_len_size, sizes)
+      col <- unspecified(0L)
+    }


This edge case is mainly a side effect of the fact that vec_unchop(list()) and vec_unchop(list(NULL)) return NULL. There isn't much we can do about this, and I feel like it is the right vctrs behavior, but we can't insert a NULL as a data frame column so the correct alternative is to insert an unspecified object. It can come up when the user tries to unchop tibble(x = list()) or tibble(x = list(NULL)), where there is clearly no obvious type for x.

DavisVaughan · 2021-08-05T21:23:03Z

R/chop.R

-  } else {
-    vec_recycle(x, size)
+unchop_sizes2 <- function(x, y) {
+  # Standard tidyverse recycling rules, just vectorized.


Something that is different between this implementation and the previous one (and @mgirlich's) is that we now use standard recycling rules when computing the majority of the common sizes for the rows. No more awkward special handling of NA_integer_ sizes!

To be able to do this, we do a pre-adjustment to NULL elements in unchop_col_info(), and then a post-adjustment to entirely NULL rows in unchop_finalize()

Changing an element of a list doesn't really matter memory wise, right? So, the list is not twice in the memory but basically only the changed elements?

DavisVaughan · 2021-08-05T21:24:34Z

tests/testthat/test-chop.R

-test_that("can specify a ptype with extra columns", {
+test_that("nonexistent `ptype` columns are ignored", {
  df <- tibble(x = 1, y = list(1, 2))
-  ptype <- tibble(y = numeric(), z = numeric())
-
-  expect <- tibble(x = c(1, 1), y = c(1, 2), z = c(NA_real_, NA_real_))
+  ptype <- list(y = numeric(), z = numeric())

-  expect_identical(unchop(df, y, ptype = ptype), expect)
+  expect_identical(unchop(df, y, ptype = ptype), unchop(df, y))
 })


This is the weird ptype behavior that we used to have. Notice that z was a column that only existed in the ptype, but that forced it to exist in the unchop() result as an all missing column. That seemed pretty silly, so I've removed that behavior.

mgirlich

Looks great to me. Very readable code. Thanks for improving my PR!

R/chop.R

mgirlich · 2021-08-06T06:34:11Z

R/chop.R

-  } else {
-    vec_recycle(x, size)
+unchop_sizes2 <- function(x, y) {
+  # Standard tidyverse recycling rules, just vectorized.


Changing an element of a list doesn't really matter memory wise, right? So, the list is not twice in the memory but basically only the changed elements?

DavisVaughan · 2021-08-06T16:39:16Z

No changes in revdeps (besides those caused by #1142)!

DavisVaughan · 2021-08-06T17:32:37Z

@hadley, like #1126 this has significant changes, so I don't expect a full review until you get back to working on tidyr. No changes in revdeps though.

@mgirlich

Based on the great ideas from #1126 from @mgirlich. This also allows `unchop()` to handle unchopping df-cols / rcrd-cols (which is a no-op).

…ected Previously, we required a data frame ptype and we would cast the unchopped data frame to this ptype. This was confusing, and could add additional columns to the unchopped result that were only present in the ptype. A named list of column specific ptypes is more in line with `pivot_longer()`.

Co-authored-by: Maximilian Girlich <[email protected]>

- ffscrapr, r2dii.analysis, strand due to #1142 - simTool, tidyjson due to `unchop()` dropping names. Will fix momentarily. - Unsure about visR

This broke a few revdeps, and isn't necessary/correct. `vec_unchop()` combines with `"minimal"` repair by default, so there shouldn't be any issues.

Turns out that visR was also affected by the name zapping behavior

Outer names can be problematic in `vec_unchop()` if the inner elements have size >1, as then some kind of name spec is needed to combine them. This fixes an issue with graphTweets that popped up after removing the `zap()` from `vec_unchop()`, which zapped both outer and inner names.

* Simplify `unnest()` to improve performance It is now a simple combination of `unchop()` + `unpack()` on the remaining df-cols. * Add a test for unnesting df-cols * Add skipped tests for lists of `NULL` * Add a test and NEWS bullet for the error with mixed type list-cols * Remove `skip()`s now that #1140 is merged

DavisVaughan commented Aug 5, 2021

View reviewed changes

DavisVaughan requested a review from mgirlich August 5, 2021 21:26

mgirlich reviewed Aug 6, 2021

View reviewed changes

DavisVaughan mentioned this pull request Aug 6, 2021

Improve unnest() performance #1141

Merged

DavisVaughan requested a review from hadley August 6, 2021 17:31

This was referenced Aug 17, 2021

Add function df_unchop()? r-lib/vctrs#1390

Closed

Add df_size_common(df, keep_empty = TRUE) r-lib/vctrs#1391

Closed

unnest column of data.frames #1112

Closed

DavisVaughan and others added 7 commits August 25, 2021 14:23

Vectorize df_unchop() and have it handle keep_empty

af3ee5b

Based on the great ideas from #1126 from @mgirlich. This also allows `unchop()` to handle unchopping df-cols / rcrd-cols (which is a no-op).

Apply suggestions from code review

37d705b

Co-authored-by: Maximilian Girlich <[email protected]>

Run revdeps

eafc52a

- ffscrapr, r2dii.analysis, strand due to #1142 - simTool, tidyjson due to `unchop()` dropping names. Will fix momentarily. - Unsure about visR

Don't zap names when unchopping

25be48e

This broke a few revdeps, and isn't necessary/correct. `vec_unchop()` combines with `"minimal"` repair by default, so there shouldn't be any issues.

Re-run revdeps for tidyjson, simTool, and visR

baa5001

Turns out that visR was also affected by the name zapping behavior

DavisVaughan mentioned this pull request Aug 25, 2021

Change ptype default from NULL to list() in unnest() / unchop() #1152

Closed

DavisVaughan merged commit ba73124 into tidyverse:master Aug 25, 2021

DavisVaughan deleted the feature/speed-up-unchop branch August 25, 2021 20:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve `unchop()` performance #1140

Improve `unchop()` performance #1140

DavisVaughan commented Aug 5, 2021 •

edited

Loading

DavisVaughan Aug 5, 2021 •

edited

Loading

DavisVaughan Aug 5, 2021

DavisVaughan Aug 5, 2021

DavisVaughan Aug 5, 2021

DavisVaughan Aug 5, 2021

DavisVaughan Aug 5, 2021

mgirlich Aug 6, 2021

DavisVaughan Aug 5, 2021

mgirlich left a comment

mgirlich Aug 6, 2021

DavisVaughan commented Aug 6, 2021

DavisVaughan commented Aug 6, 2021

		# Convert to form that `df_unchop()` requires
		ptype <- list()

		# Remove unchopped columns to avoid slicing them needlessly later
		out[sel] <- NULL

Improve unchop() performance #1140

Improve unchop() performance #1140

Conversation

DavisVaughan commented Aug 5, 2021 • edited Loading

DavisVaughan Aug 5, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mgirlich left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DavisVaughan commented Aug 6, 2021

DavisVaughan commented Aug 6, 2021

Improve `unchop()` performance #1140

Improve `unchop()` performance #1140

DavisVaughan commented Aug 5, 2021 •

edited

Loading

DavisVaughan Aug 5, 2021 •

edited

Loading