[SPARK-49547][SQL][PYTHON] Add iterator of `RecordBatch` API to `applyInArrow` #49005

Kimahriman · 2024-11-28T13:54:16Z

What changes were proposed in this pull request?

Add the option to applyInArrow to take a function that takes an iterator of RecordBatch and returns an iterator of RecordBatch, and respect spark.sql.execution.arrow.maxRecordsPerBatch on the input iterator.

Why are the changes needed?

Being limited to returning a single Table requires collecting all results in memory for a single batch. This can require excessive memory for certain edge cases with large groups. Currently the Python worker immediately converts a table into it's underlying batches, so there's barely any changes required to accommodate this. There are larger changes required to support max records per batch on the input side.

Does this PR introduce any user-facing change?

Yes, a new function signature supported by applyInArrow

How was this patch tested?

Updated existing UTs to test both Table signatures and RecordBatch signatures

Was this patch authored or co-authored using generative AI tooling?

No

Kimahriman · 2025-01-15T16:15:17Z

Gentle ping for potential inclusion in 4.0

Kimahriman · 2025-06-24T14:28:39Z

Gentle ping again now that 4.0 is out

ConeyLiu · 2025-07-04T07:47:24Z

Hi, @HyukjinKwon @zhengruifeng could you please review this again? We indeed encountered this problem in our production jobs when users call applyInPandas, which returns a large DataFrame.

Kimahriman · 2025-09-11T20:20:58Z

Gentle ping again, it's getting tricky to keep up with/figure out new merge conflicts each time there are conflicting changes that make it in

zhengruifeng · 2025-09-12T03:14:56Z

Gentle ping again, it's getting tricky to keep up with/figure out new merge conflicts each time there are conflicting changes that make it in

@Kimahriman we plan to optimize the batch size in multiple UDF types, e.g.

SQL_GROUPED_MAP_PANDAS_UDF
SQL_GROUPED_MAP_ARROW_UDF
SQL_GROUPED_AGG_ARROW_UDF
SQL_GROUPED_AGG_PANDAS_UDF
SQL_WINDOW_AGG_PANDAS_UDF
SQL_WINDOW_AGG_ARROW_UDF

this is the first one for SQL_GROUPED_MAP_PANDAS_UDF /SQL_GROUPED_MAP_ARROW_UDF. Can we reuse it for the iterator API?

Regarding this PR, I think we should:
1, exclude any changes in cogroup;
2, add a new eval type for the new iterator API, because it is a user-facing change. For example, we have SQL_SCALAR_PANDAS_ITER_UDF for the iterator API in Pandas UDF.

Kimahriman · 2025-09-13T16:34:40Z

I can try to update with new eval types and using type hints as the mechanism. I do think your update will fit in fine, as it's just a different way to implement the JVM side batching that I implemented here. Basically you just used a trait instead of a subclass. I do think it would still be beneficial for both batched and non-batched to go through the same code path, otherwise including more eval types will make it even more complicated

Kimahriman · 2025-09-24T11:14:29Z

I have a new version of this with a new eval type and using type hints ready to go after #52303 gets merged

Kimahriman · 2025-09-24T18:14:55Z

Closing in favor of #52440

Kimahriman added 30 commits August 19, 2024 22:33

Support returning iterators in apply in arrow funcs

6e250a0

Cleanup validation and just check return type

e060d7e

Add function types

a3575ca

Formatting

20fd9ad

Merge branch 'master' into apply-in-arrow-iterator

12400b7

Start updating all tests to test both options

940a0b2

Test variations for all grouped map tests

4998878

Add cogroup test

8cd1681

Add more cogrouped tests

894d2cb

Format

1af758e

Finish up cogrouped test

3fd7718

Fix typing

4b8238d

Update docs for methods

6154433

Fix doc

0b58aa3

Update grouped to be iterator -> iterator

1edaa9d

Update cogroup API

b82b3a1

Simplify Table -> Table handling

e2844af

Support returning iterators in apply in arrow funcs

da9ffe6

Cleanup validation and just check return type

67ae126

Add function types

ece27d8

Formatting

f7e979b

Start building toward passing batches as input

6eb86c6

Start updating all tests to test both options

c71a414

Test variations for all grouped map tests

eb77102

Add cogroup test

a232ecc

Add more cogrouped tests

b5acd7f

Format

b801633

Finish up cogrouped test

fcd11dd

Fix typing

49a9339

Update docs for methods

bab10c6

Kimahriman added 7 commits December 10, 2024 19:31

Merge branch 'master' into apply-in-arrow-input

46c3e70

Undo cogroup changes

178bdaa

Remove a few more things

61a3620

Undid wrong one

9424d72

Fix missing test mixin

fc5912c

Fix cogroup test

e8cd929

Merge branch 'master' into apply-in-arrow-input

2d99a20

Merge branch 'master' into apply-in-arrow-input

7766b1d

Kimahriman force-pushed the apply-in-arrow-input branch from 5c2c07c to 7766b1d Compare February 8, 2025 14:13

Kimahriman added 5 commits May 5, 2025 15:23

Merge branch 'master' into apply-in-arrow-input

2c51011

Add import for moved streaming arrow writer

0a8af6f

Fix new arrow_to_pandas signature

d36d597

Merge branch 'master' into apply-in-arrow-input

31702c4

Merge branch 'master' into apply-in-arrow-input

25eddfc

Kimahriman added 2 commits August 15, 2025 07:19

Merge branch 'master' into apply-in-arrow-input

ab78097

Merge branch 'master' into apply-in-arrow-input

4670c1d

Kimahriman force-pushed the apply-in-arrow-input branch from a901806 to 4670c1d Compare August 29, 2025 11:18

Kimahriman added 3 commits September 11, 2025 14:50

Fix serializers params

7714f9c

Merge branch 'master' into apply-in-arrow-input

7e9253d

lint

86547dc

Kimahriman mentioned this pull request Sep 12, 2025

[SPARK-53562][PYTHON] Limit Arrow batch sizes in applyInArrow and applyInPandas #52303

Closed

Kimahriman closed this Sep 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-49547][SQL][PYTHON] Add iterator of `RecordBatch` API to `applyInArrow` #49005

[SPARK-49547][SQL][PYTHON] Add iterator of `RecordBatch` API to `applyInArrow` #49005

Uh oh!

Kimahriman commented Nov 28, 2024

Uh oh!

Kimahriman commented Jan 15, 2025

Uh oh!

Kimahriman commented Jun 24, 2025

Uh oh!

ConeyLiu commented Jul 4, 2025

Uh oh!

Kimahriman commented Sep 11, 2025

Uh oh!

zhengruifeng commented Sep 12, 2025 •

edited

Loading

Uh oh!

Kimahriman commented Sep 13, 2025 •

edited

Loading

Uh oh!

Kimahriman commented Sep 24, 2025

Uh oh!

Kimahriman commented Sep 24, 2025

Uh oh!

Uh oh!

[SPARK-49547][SQL][PYTHON] Add iterator of RecordBatch API to applyInArrow #49005

[SPARK-49547][SQL][PYTHON] Add iterator of RecordBatch API to applyInArrow #49005

Uh oh!

Conversation

Kimahriman commented Nov 28, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Kimahriman commented Jan 15, 2025

Uh oh!

Kimahriman commented Jun 24, 2025

Uh oh!

ConeyLiu commented Jul 4, 2025

Uh oh!

Kimahriman commented Sep 11, 2025

Uh oh!

zhengruifeng commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Kimahriman commented Sep 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Kimahriman commented Sep 24, 2025

Uh oh!

Kimahriman commented Sep 24, 2025

Uh oh!

Uh oh!

[SPARK-49547][SQL][PYTHON] Add iterator of `RecordBatch` API to `applyInArrow` #49005

[SPARK-49547][SQL][PYTHON] Add iterator of `RecordBatch` API to `applyInArrow` #49005

zhengruifeng commented Sep 12, 2025 •

edited

Loading

Kimahriman commented Sep 13, 2025 •

edited

Loading