[SPARK-49547][SQL][PYTHON] Add iterator of `RecordBatch` API to `applyInArrow` #52440

Kimahriman · 2025-09-24T18:14:32Z

What changes were proposed in this pull request?

Add the option to applyInArrow to take a function that takes an iterator of RecordBatch and returns an iterator of RecordBatch. A new eval type is added SQL_GROUPED_MAP_ARROW_ITER_UDF, and is detected via type hints on the function.

Why are the changes needed?

Having a single Table as input and a single Table as output requires collecting all inputs and outputs in memory for a single batch. This can require excessive memory for certain edge cases with large groups. Inputs and outputs already get serialized as record batches, so simply expose this lazy iterator directly instead of forcing materialization into a table.

Does this PR introduce any user-facing change?

Yes, a new function signature supported by applyInArrow.

Example:

import pyarrow as pa
import pyarrow.compute as pc
def sum_func(key: Tuple[pa.Scalar, ...], batches: Iterator[pa.RecordBatch]) -> Iterator[pa.RecordBatch]:
    total = 0
    for batch in batches:
        total += pc.sum(batch.column("v")).as_py()
    yield pyarrow.RecordBatch.from_pydict({"id": [key[0].as_py()], "v": [total]})

df.groupby("id").applyInArrow(sum_func, schema="id long, v double").show()

+---+----+
| id|   v|
+---+----+
|  1| 3.0|
|  2|18.0|
+---+----+

How was this patch tested?

Updated existing UTs to test both Table signatures and RecordBatch signatures

Was this patch authored or co-authored using generative AI tooling?

No

Kimahriman · 2025-09-24T18:15:31Z

@zhengruifeng @HyukjinKwon

zhengruifeng

Looks pretty good!
only a few minor comments.

also cc @HyukjinKwon

zhengruifeng · 2025-09-25T04:44:00Z

python/pyspark/worker.py

+    verify_arrow_result(batch, assign_cols_by_name, expected_cols_and_types)
+
+
+def wrap_grouped_map_arrow_udf(f, return_type, argspec, is_iterator, runner_conf):


shall we add a dedicated function def wrap_grouped_map_arrow_iter_udf for the new eval type?

yep it is cleaner that way, done

python/pyspark/sql/pandas/serializers.py

python/pyspark/sql/tests/arrow/test_arrow_grouped_map.py

zhengruifeng · 2025-09-25T04:51:09Z

please also add a simple example in the PR description, section Does this PR introduce any user-facing change?

zhengruifeng · 2025-09-25T06:07:49Z

python/pyspark/sql/pandas/group_ops.py

+            `pyarrow.Table` or takes an iterator of `pyarrow.RecordBatch` and yields
+            `pyarrow.RecordBatch`. Additionally, each form can take a tuple of grouping keys
+            as the first argument, with the `pyarrow.Table` or iterator of `pyarrow.RecordBatch`
+            as the second argument.


let's add

.. versionchanged:: 4.1.0 Supports iterator API ...

lets also add two simple examples (w/o key) in the Examples section

I added one with and without key, since the key is more relevant for this API I think

python/pyspark/sql/pandas/_typing/__init__.pyi

zhengruifeng · 2025-09-27T00:35:54Z

thank you @Kimahriman so much for spending time on this.
merged to master for 4.1

Kimahriman added 2 commits September 24, 2025 15:59

Implement Iterator API for applyInArrow

b941c9d

Support connect

5fb8918

github-actions bot added SQL CORE PYTHON CONNECT labels Sep 24, 2025

Kimahriman mentioned this pull request Sep 24, 2025

[SPARK-49547][SQL][PYTHON] Add iterator of RecordBatch API to applyInArrow #49005

Closed

zhengruifeng approved these changes Sep 25, 2025

View reviewed changes

zhengruifeng reviewed Sep 25, 2025

View reviewed changes

zhengruifeng requested review from HyukjinKwon and HeartSaVioR September 25, 2025 07:09

zhengruifeng mentioned this pull request Sep 25, 2025

[SPARK-53562][PYTHON] Limit Arrow batch sizes in applyInArrow and applyInPandas #52303

Closed

zhengruifeng reviewed Sep 25, 2025

View reviewed changes

python/pyspark/sql/pandas/_typing/__init__.pyi Show resolved Hide resolved

HyukjinKwon approved these changes Sep 26, 2025

View reviewed changes

Kimahriman added 5 commits September 26, 2025 15:25

Add examples and dedicated wrapper func

4f3c4c4

Merge branch 'master' into apply-in-arrow-iter-eval

e0fe7b8

Format

4a1fd6e

Lint

6e0a2d0

Doctest

735a3a1

zhengruifeng closed this in 9e12201 Sep 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-49547][SQL][PYTHON] Add iterator of `RecordBatch` API to `applyInArrow` #52440

[SPARK-49547][SQL][PYTHON] Add iterator of `RecordBatch` API to `applyInArrow` #52440

Kimahriman commented Sep 24, 2025 •

edited

Loading

Uh oh!

Kimahriman commented Sep 24, 2025

Uh oh!

zhengruifeng left a comment

Uh oh!

zhengruifeng Sep 25, 2025

Uh oh!

Kimahriman Sep 26, 2025

Uh oh!

Uh oh!

Uh oh!

zhengruifeng commented Sep 25, 2025

Uh oh!

zhengruifeng Sep 25, 2025

Uh oh!

zhengruifeng Sep 25, 2025

Uh oh!

Kimahriman Sep 26, 2025

Uh oh!

Uh oh!

zhengruifeng commented Sep 27, 2025

Uh oh!

Uh oh!

		verify_arrow_result(batch, assign_cols_by_name, expected_cols_and_types)


		def wrap_grouped_map_arrow_udf(f, return_type, argspec, is_iterator, runner_conf):

[SPARK-49547][SQL][PYTHON] Add iterator of RecordBatch API to applyInArrow #52440

[SPARK-49547][SQL][PYTHON] Add iterator of RecordBatch API to applyInArrow #52440

Conversation

Kimahriman commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Kimahriman commented Sep 24, 2025

Uh oh!

zhengruifeng left a comment

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

Kimahriman Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

zhengruifeng commented Sep 25, 2025

Uh oh!

zhengruifeng Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

Kimahriman Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zhengruifeng commented Sep 27, 2025

Uh oh!

Uh oh!

[SPARK-49547][SQL][PYTHON] Add iterator of `RecordBatch` API to `applyInArrow` #52440

[SPARK-49547][SQL][PYTHON] Add iterator of `RecordBatch` API to `applyInArrow` #52440

Kimahriman commented Sep 24, 2025 •

edited

Loading