-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-49547][SQL][PYTHON] Add iterator of RecordBatch
API to applyInArrow
#49005
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Gentle ping for potential inclusion in 4.0 |
5c2c07c
to
7766b1d
Compare
Gentle ping again now that 4.0 is out |
Hi, @HyukjinKwon @zhengruifeng could you please review this again? We indeed encountered this problem in our production jobs when users call applyInPandas, which returns a large DataFrame. |
a901806
to
4670c1d
Compare
Gentle ping again, it's getting tricky to keep up with/figure out new merge conflicts each time there are conflicting changes that make it in |
@Kimahriman we plan to optimize the batch size in multiple UDF types, e.g.
this is the first one for SQL_GROUPED_MAP_PANDAS_UDF /SQL_GROUPED_MAP_ARROW_UDF. Can we reuse it for the iterator API? Regarding this PR, I think we should: |
I can try to update with new eval types and using type hints as the mechanism. I do think your update will fit in fine, as it's just a different way to implement the JVM side batching that I implemented here. Basically you just used a trait instead of a subclass. I do think it would still be beneficial for both batched and non-batched to go through the same code path, otherwise including more eval types will make it even more complicated |
I have a new version of this with a new eval type and using type hints ready to go after #52303 gets merged |
Closing in favor of #52440 |
What changes were proposed in this pull request?
Add the option to
applyInArrow
to take a function that takes an iterator ofRecordBatch
and returns an iterator ofRecordBatch
, and respectspark.sql.execution.arrow.maxRecordsPerBatch
on the input iterator.Why are the changes needed?
Being limited to returning a single Table requires collecting all results in memory for a single batch. This can require excessive memory for certain edge cases with large groups. Currently the Python worker immediately converts a table into it's underlying batches, so there's barely any changes required to accommodate this. There are larger changes required to support max records per batch on the input side.
Does this PR introduce any user-facing change?
Yes, a new function signature supported by
applyInArrow
How was this patch tested?
Updated existing UTs to test both Table signatures and RecordBatch signatures
Was this patch authored or co-authored using generative AI tooling?
No