-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-49547][SQL][PYTHON] Add iterator of RecordBatch
API to applyInArrow
#52440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good!
only a few minor comments.
also cc @HyukjinKwon
python/pyspark/worker.py
Outdated
verify_arrow_result(batch, assign_cols_by_name, expected_cols_and_types) | ||
|
||
|
||
def wrap_grouped_map_arrow_udf(f, return_type, argspec, is_iterator, runner_conf): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we add a dedicated function def wrap_grouped_map_arrow_iter_udf
for the new eval type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep it is cleaner that way, done
please also add a simple example in the PR description, section |
`pyarrow.Table` or takes an iterator of `pyarrow.RecordBatch` and yields | ||
`pyarrow.RecordBatch`. Additionally, each form can take a tuple of grouping keys | ||
as the first argument, with the `pyarrow.Table` or iterator of `pyarrow.RecordBatch` | ||
as the second argument. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's add
.. versionchanged:: 4.1.0
Supports iterator API ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets also add two simple examples (w/o key
) in the Examples
section
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added one with and without key, since the key is more relevant for this API I think
thank you @Kimahriman so much for spending time on this. |
What changes were proposed in this pull request?
Add the option to
applyInArrow
to take a function that takes an iterator ofRecordBatch
and returns an iterator ofRecordBatch
. A new eval type is addedSQL_GROUPED_MAP_ARROW_ITER_UDF
, and is detected via type hints on the function.Why are the changes needed?
Having a single Table as input and a single Table as output requires collecting all inputs and outputs in memory for a single batch. This can require excessive memory for certain edge cases with large groups. Inputs and outputs already get serialized as record batches, so simply expose this lazy iterator directly instead of forcing materialization into a table.
Does this PR introduce any user-facing change?
Yes, a new function signature supported by
applyInArrow
.Example:
How was this patch tested?
Updated existing UTs to test both Table signatures and RecordBatch signatures
Was this patch authored or co-authored using generative AI tooling?
No