-
Notifications
You must be signed in to change notification settings - Fork 418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve unchop()
performance
#1140
Improve unchop()
performance
#1140
Conversation
#' @param ptype Optionally, a named list of column name-prototype pairs to | ||
#' coerce `cols` to, overriding the default that will be guessed from | ||
#' combining the individual values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One subtle change is that ptype
no longer has to be a data frame. It is now just a named list where the names can match any names specified by cols
. This matches pivot_longer()
.
Previously we required a data frame here, but we had some strange behavior where we would cast the unchopped data frame to ptype
, which could add columns that only ever existed in the ptype. That seemed very strange, and this feels like much better behavior for this argument.
This closes #1075
I would say it also closes #1052 since the confusion there was that you had to supply the ptype as a tibble. Now that it is just a named list, it should be more intuitive.
# Convert to form that `df_unchop()` requires | ||
ptype <- list() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an internal detail, I've made df_unchop()
always take a list ptype
, so we convert that here
# Remove unchopped columns to avoid slicing them needlessly later | ||
out[sel] <- NULL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mgirlich had the nice idea that we should remove the columns that get unchopped up front. That way when we do vec_slice(out, loc)
later on, we don't needlessly slice those columns (since we are going to immediately replace them with update_cols()
)
# Recycle each row element to the common size of that row | ||
pieces[[j]] <- tidyr_recycle(elt, size) | ||
if (!col_is_list) { | ||
out_cols[[i]] <- vec_slice(col, out_loc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice optimization also pointed out by @mgirlich that while unchopping non-list columns is kind of a silly thing for the user to do, it can be done a lot faster by just slicing the column. Previously I think we essentially converted it to a list, recycled each element, and then converted it back to a vector.
This will actually help performance with #1112. The user is trying to unnest()
a data frame column (really they should use unpack()
, but unnest()
should work too). The unchop()
step of that will have them unchopping a df-col, which will be faster because of this optimization.
R/chop.R
Outdated
if (is_null(col)) { | ||
# This can happen when both of these are true: | ||
# - `col` was an empty list(), or a list of all `NULL`s. | ||
# - No ptype was specified for `col`, either by the user or by a list-of. | ||
if (out_size != 0L) { | ||
abort("Internal error: `NULL` column generated, but output size is not `0`.") | ||
} | ||
|
||
loc <- rep(seq_len_size, sizes) | ||
col <- unspecified(0L) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This edge case is mainly a side effect of the fact that vec_unchop(list())
and vec_unchop(list(NULL))
return NULL
. There isn't much we can do about this, and I feel like it is the right vctrs behavior, but we can't insert a NULL
as a data frame column so the correct alternative is to insert an unspecified object. It can come up when the user tries to unchop tibble(x = list())
or tibble(x = list(NULL))
, where there is clearly no obvious type for x
.
} else { | ||
vec_recycle(x, size) | ||
unchop_sizes2 <- function(x, y) { | ||
# Standard tidyverse recycling rules, just vectorized. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something that is different between this implementation and the previous one (and @mgirlich's) is that we now use standard recycling rules when computing the majority of the common sizes for the rows. No more awkward special handling of NA_integer_
sizes!
To be able to do this, we do a pre-adjustment to NULL
elements in unchop_col_info()
, and then a post-adjustment to entirely NULL
rows in unchop_finalize()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing an element of a list doesn't really matter memory wise, right? So, the list is not twice in the memory but basically only the changed elements?
tests/testthat/test-chop.R
Outdated
test_that("can specify a ptype with extra columns", { | ||
test_that("nonexistent `ptype` columns are ignored", { | ||
df <- tibble(x = 1, y = list(1, 2)) | ||
ptype <- tibble(y = numeric(), z = numeric()) | ||
|
||
expect <- tibble(x = c(1, 1), y = c(1, 2), z = c(NA_real_, NA_real_)) | ||
ptype <- list(y = numeric(), z = numeric()) | ||
|
||
expect_identical(unchop(df, y, ptype = ptype), expect) | ||
expect_identical(unchop(df, y, ptype = ptype), unchop(df, y)) | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the weird ptype
behavior that we used to have. Notice that z
was a column that only existed in the ptype
, but that forced it to exist in the unchop()
result as an all missing column. That seemed pretty silly, so I've removed that behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me. Very readable code. Thanks for improving my PR!
} else { | ||
vec_recycle(x, size) | ||
unchop_sizes2 <- function(x, y) { | ||
# Standard tidyverse recycling rules, just vectorized. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing an element of a list doesn't really matter memory wise, right? So, the list is not twice in the memory but basically only the changed elements?
No changes in revdeps (besides those caused by #1142)! |
…ected Previously, we required a data frame ptype and we would cast the unchopped data frame to this ptype. This was confusing, and could add additional columns to the unchopped result that were only present in the ptype. A named list of column specific ptypes is more in line with `pivot_longer()`.
Co-authored-by: Maximilian Girlich <[email protected]>
- ffscrapr, r2dii.analysis, strand due to #1142 - simTool, tidyjson due to `unchop()` dropping names. Will fix momentarily. - Unsure about visR
This broke a few revdeps, and isn't necessary/correct. `vec_unchop()` combines with `"minimal"` repair by default, so there shouldn't be any issues.
Turns out that visR was also affected by the name zapping behavior
Outer names can be problematic in `vec_unchop()` if the inner elements have size >1, as then some kind of name spec is needed to combine them. This fixes an issue with graphTweets that popped up after removing the `zap()` from `vec_unchop()`, which zapped both outer and inner names.
* Simplify `unnest()` to improve performance It is now a simple combination of `unchop()` + `unpack()` on the remaining df-cols. * Add a test for unnesting df-cols * Add skipped tests for lists of `NULL` * Add a test and NEWS bullet for the error with mixed type list-cols * Remove `skip()`s now that #1140 is merged
Closes #1126
Closes #1127
Closes #1128
Closes #1075
Closes #1052
A step towards #1112
@mgirlich I think you had some great ideas for improving
unchop()
in #1126! I've tweaked them a little bit and added them here, along with many additional edge case tests, and gave you credit in the NEWS file.I haven't touched
unnest()
yet, I'll save that for another PR, but I think you are right that we could remove thatas_df()
code and just haveunpack()
only unpack the remaining df-cols. Your approach also fixes two bugs withunnest()
that I'll add tests for.The original plan for
df_unchop()
was to move it to C in vctrs, so I wrote it somewhat inefficiently with R level loops (which would have been pretty efficient at the C level). I'm no longer convinced that we should move it to vctrs because of how it peeks at the list-of ptype attribute (vctrs C code tries to be agnostic to these details, but that would be impossible here).@mgirlich I've fixed a few edge cases that I discovered in yours, such as:
Performance improvements are around the same as #1126 (see that PR for the performance of current master)