Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve unchop() performance #1140

Merged
merged 7 commits into from
Aug 25, 2021
Merged

Improve unchop() performance #1140

merged 7 commits into from
Aug 25, 2021

Conversation

DavisVaughan
Copy link
Member

@DavisVaughan DavisVaughan commented Aug 5, 2021

Closes #1126
Closes #1127
Closes #1128
Closes #1075
Closes #1052
A step towards #1112

@mgirlich I think you had some great ideas for improving unchop() in #1126! I've tweaked them a little bit and added them here, along with many additional edge case tests, and gave you credit in the NEWS file.

I haven't touched unnest() yet, I'll save that for another PR, but I think you are right that we could remove that as_df() code and just have unpack() only unpack the remaining df-cols. Your approach also fixes two bugs with unnest() that I'll add tests for.

The original plan for df_unchop() was to move it to C in vctrs, so I wrote it somewhat inefficiently with R level loops (which would have been pretty efficient at the C level). I'm no longer convinced that we should move it to vctrs because of how it peeks at the list-of ptype attribute (vctrs C code tries to be agnostic to these details, but that would be impossible here).


@mgirlich I've fixed a few edge cases that I discovered in yours, such as:

# expect 0 row tibble with `x=integer()` column
df <- tibble(x = list(NULL, NULL))
unchop(df, x, ptype = tibble(x = integer()))
#> Error in x[needs_recycling] <- map2(x[needs_recycling], lengths_out[needs_recycling],  : 
#>   NAs are not allowed in subscripted assignments

# (probably similar to above)
df <- tibble(x = 1:2, y = list(NULL, NULL), z = list(NULL, NULL))
expect <- tibble(x = integer(), y = unspecified(), z = unspecified())
expect_identical(unchop(df, c(y, z)), expect)
#> Error in x[needs_recycling] <- map2(x[needs_recycling], lengths_out[needs_recycling],  : 
#>   NAs are not allowed in subscripted assignments

# this should return a column of double NAs, not logical NAs
df <- tibble(x = 1:2, y = list(integer(), double()))
unchop(df, y, keep_empty = TRUE)
#> # A tibble: 2 x 2
#>       x y    
#>   <int> <lgl>
#> 1     1 NA   
#> 2     2 NA   

Performance improvements are around the same as #1126 (see that PR for the performance of current master)

library(tidyr)
set.seed(1)

n <- 100e3
df <- tibble(
  a = purrr::map(1:n, ~ rlang::seq2(1, sample(0:10, 1))),
  y = 1:n
)

# list column
bench::mark(
  drop_empty = unchop(df, a),
  keep_empty = unchop(df, a, keep_empty = TRUE),
  check = FALSE
)
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 drop_empty   65.1ms   67.4ms      14.9    13.4MB     8.95
#> 2 keep_empty   83.4ms   87.4ms      11.5    13.3MB    11.5

# vector column
bench::mark(
  drop_empty = unchop(df, y),
  keep_empty = unchop(df, y, keep_empty = TRUE),
  check = FALSE
)
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 drop_empty   6.58ms   7.15ms      139.    3.82MB     22.7
#> 2 keep_empty   6.08ms   6.42ms      151.    3.82MB     14.2

Comment on lines +40 to +42
#' @param ptype Optionally, a named list of column name-prototype pairs to
#' coerce `cols` to, overriding the default that will be guessed from
#' combining the individual values.
Copy link
Member Author

@DavisVaughan DavisVaughan Aug 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One subtle change is that ptype no longer has to be a data frame. It is now just a named list where the names can match any names specified by cols. This matches pivot_longer().

Previously we required a data frame here, but we had some strange behavior where we would cast the unchopped data frame to ptype, which could add columns that only ever existed in the ptype. That seemed very strange, and this feels like much better behavior for this argument.

This closes #1075

I would say it also closes #1052 since the confusion there was that you had to supply the ptype as a tibble. Now that it is just a named list, it should be more intuitive.

Comment on lines +96 to +97
# Convert to form that `df_unchop()` requires
ptype <- list()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an internal detail, I've made df_unchop() always take a list ptype, so we convert that here

Comment on lines +108 to +109
# Remove unchopped columns to avoid slicing them needlessly later
out[sel] <- NULL
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mgirlich had the nice idea that we should remove the columns that get unchopped up front. That way when we do vec_slice(out, loc) later on, we don't needlessly slice those columns (since we are going to immediately replace them with update_cols())

# Recycle each row element to the common size of that row
pieces[[j]] <- tidyr_recycle(elt, size)
if (!col_is_list) {
out_cols[[i]] <- vec_slice(col, out_loc)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice optimization also pointed out by @mgirlich that while unchopping non-list columns is kind of a silly thing for the user to do, it can be done a lot faster by just slicing the column. Previously I think we essentially converted it to a list, recycled each element, and then converted it back to a vector.

This will actually help performance with #1112. The user is trying to unnest() a data frame column (really they should use unpack(), but unnest() should work too). The unchop() step of that will have them unchopping a df-col, which will be faster because of this optimization.

R/chop.R Outdated
Comment on lines 227 to 239
if (is_null(col)) {
# This can happen when both of these are true:
# - `col` was an empty list(), or a list of all `NULL`s.
# - No ptype was specified for `col`, either by the user or by a list-of.
if (out_size != 0L) {
abort("Internal error: `NULL` column generated, but output size is not `0`.")
}

loc <- rep(seq_len_size, sizes)
col <- unspecified(0L)
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This edge case is mainly a side effect of the fact that vec_unchop(list()) and vec_unchop(list(NULL)) return NULL. There isn't much we can do about this, and I feel like it is the right vctrs behavior, but we can't insert a NULL as a data frame column so the correct alternative is to insert an unspecified object. It can come up when the user tries to unchop tibble(x = list()) or tibble(x = list(NULL)), where there is clearly no obvious type for x.

} else {
vec_recycle(x, size)
unchop_sizes2 <- function(x, y) {
# Standard tidyverse recycling rules, just vectorized.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something that is different between this implementation and the previous one (and @mgirlich's) is that we now use standard recycling rules when computing the majority of the common sizes for the rows. No more awkward special handling of NA_integer_ sizes!

To be able to do this, we do a pre-adjustment to NULL elements in unchop_col_info(), and then a post-adjustment to entirely NULL rows in unchop_finalize()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing an element of a list doesn't really matter memory wise, right? So, the list is not twice in the memory but basically only the changed elements?

Comment on lines 103 to 138
test_that("can specify a ptype with extra columns", {
test_that("nonexistent `ptype` columns are ignored", {
df <- tibble(x = 1, y = list(1, 2))
ptype <- tibble(y = numeric(), z = numeric())

expect <- tibble(x = c(1, 1), y = c(1, 2), z = c(NA_real_, NA_real_))
ptype <- list(y = numeric(), z = numeric())

expect_identical(unchop(df, y, ptype = ptype), expect)
expect_identical(unchop(df, y, ptype = ptype), unchop(df, y))
})
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the weird ptype behavior that we used to have. Notice that z was a column that only existed in the ptype, but that forced it to exist in the unchop() result as an all missing column. That seemed pretty silly, so I've removed that behavior.

@DavisVaughan DavisVaughan requested a review from mgirlich August 5, 2021 21:26
Copy link
Contributor

@mgirlich mgirlich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me. Very readable code. Thanks for improving my PR!

R/chop.R Outdated Show resolved Hide resolved
R/chop.R Outdated Show resolved Hide resolved
} else {
vec_recycle(x, size)
unchop_sizes2 <- function(x, y) {
# Standard tidyverse recycling rules, just vectorized.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing an element of a list doesn't really matter memory wise, right? So, the list is not twice in the memory but basically only the changed elements?

@DavisVaughan
Copy link
Member Author

No changes in revdeps (besides those caused by #1142)!

@DavisVaughan DavisVaughan requested a review from hadley August 6, 2021 17:31
@DavisVaughan
Copy link
Member Author

@hadley, like #1126 this has significant changes, so I don't expect a full review until you get back to working on tidyr. No changes in revdeps though.

DavisVaughan and others added 7 commits August 25, 2021 14:23
Based on the great ideas from #1126 from @mgirlich. This also allows `unchop()` to handle unchopping df-cols / rcrd-cols (which is a no-op).
…ected

Previously, we required a data frame ptype and we would cast the unchopped data frame to this ptype. This was confusing, and could add additional columns to the unchopped result that were only present in the ptype. A named list of column specific ptypes is more in line with `pivot_longer()`.
Co-authored-by: Maximilian Girlich <[email protected]>
- ffscrapr, r2dii.analysis, strand due to #1142
- simTool, tidyjson due to `unchop()` dropping names. Will fix momentarily.
- Unsure about visR
This broke a few revdeps, and isn't necessary/correct. `vec_unchop()` combines with `"minimal"` repair by default, so there shouldn't be any issues.
Turns out that visR was also affected by the name zapping behavior
Outer names can be problematic in `vec_unchop()` if the inner elements have size >1, as then some kind of name spec is needed to combine them. This fixes an issue with graphTweets that popped up after removing the `zap()` from `vec_unchop()`, which zapped both outer and inner names.
@DavisVaughan DavisVaughan merged commit ba73124 into tidyverse:master Aug 25, 2021
@DavisVaughan DavisVaughan deleted the feature/speed-up-unchop branch August 25, 2021 20:08
DavisVaughan added a commit that referenced this pull request Aug 26, 2021
* Simplify `unnest()` to improve performance

It is now a simple combination of `unchop()` + `unpack()` on the remaining df-cols.

* Add a test for unnesting df-cols

* Add skipped tests for lists of `NULL`

* Add a test and NEWS bullet for the error with mixed type list-cols

* Remove `skip()`s now that #1140 is merged
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants