Skip to content

[Bug] arrow batch converter error #7245

@echo567

Description

@echo567

Code of Conduct

Search before asking

  • I have searched in the issues and found no similar issues.

Describe the bug

When kyuubi.operation.result.format=arrow, spark.connect.grpc.arrow.maxBatchSize does not take effect.

Reproduction: You can debug KyuubiArrowConverters or add the following log to line 300 of KyuubiArrowConverters:

logInfo(s"Total limit: ${limit}, rowCount: ${rowCount}, " +

s"rowCountInLastBatch:${rowCountInLastBatch}," +

s"estimatedBatchSize: ${estimatedBatchSize}," +

s"maxEstimatedBatchSize: ${maxEstimatedBatchSize}," +

s"maxRecordsPerBatch:${maxRecordsPerBatch}")

Test data: 1.6 million rows, 30 columns per row. Command executed:

bin/beeline \
  -u 'jdbc:hive2://10.168.X.X:XX/default;thrift.client.max.message.size=2000000000' \
  --hiveconf kyuubi.operation.result.format=arrow \
  -n test -p 'testpass' \
  --outputformat=csv2 -e "select * from db.table" > /tmp/test.csv

Log output

25/11/13 13:52:57 INFO KyuubiArrowConverters: Total limit: -1, rowCount: 200000, lastBatchRowCount:200000, estimatedBatchSize: 145600000 maxEstimatedBatchSize: 4,maxRecordsPerBatch:10000 25/11/13 13:52:57 INFO KyuubiArrowConverters: Total limit: -1, rowCount: 200000, lastBatchRowCount:200000, estimatedBatchSize: 145600000

Original Code

while (rowIter.hasNext && (
rowCountInLastBatch == 0 && maxEstimatedBatchSize > 0 ||
estimatedBatchSize <= 0 ||
estimatedBatchSize < maxEstimatedBatchSize ||
maxRecordsPerBatch <= 0 ||
rowCountInLastBatch < maxRecordsPerBatch ||
rowCount < limit ||
limit < 0))

deepseek's explanation is as follows:

while (rowIter.hasNext && (condition A || condition B || condition C || condition D || condition E || condition F))

Detailed Explanation of Each Condition

Condition A: rowCountInLastBatch == 0 && maxEstimatedBatchSize > 0 Special handling for writing the first line:

rowCountInLastBatch == 0: The current batch is the first line

maxEstimatedBatchSize > 0: Maximum batch size setting is valid

Meaning: When a valid batch size limit is set, the first line of the current batch is always written.

Condition B: estimatedBatchSize <= 0 Unlimited byte size: If the estimated batch size ≤ 0, there is no limit.

Condition C: estimatedBatchSize < maxEstimatedBatchSize Byte size not exceeded: The current estimated size is less than the maximum allowed size.

Condition D: maxRecordsPerBatch <= 0 Unlimited record count: If the maximum number of records per batch ≤ 0, there is no limit.

Condition E: rowCountInLastBatch < maxRecordsPerBatch Record count not exceeded: The number of records in the current batch is less than the limit.

Condition F: rowCount < limit || limit < 0 Total number of rows control:

rowCount < limit: The total number of rows processed has not reached the limit.

limit < 0: Total row count limit is negative (indicating no limit)

"Continue as long as any condition is met" strategy

When the limit is not set, i.e., -1, all data will be retrieved at once. If the row count is too large, the following three problems will occur:

(1) Driver/executor oom

(2) Array oom cause of array length is not enough

(3) Transfer data slowly

After updating the code, the log output is as follows:

25/11/14 10:57:16 INFO KyuubiArrowConverters: Total limit: -1, rowCount: 5762, rowCountInLastBatch:5762, estimatedBatchSize: 4194736, maxEstimatedBatchSize: 4194304, maxRecordsPerBatch:10000

25/11/14 10:57:16 INFO KyuubiArrowConverters: Total limit: -1, rowCount: 11524, rowCountInLastBatch: 5762, estimatedBatchSize: 4194736, maxEstimatedBatchSize: 4194304, maxRecordsPerBatch: 10000 25/11/14 10:57:16 INFO KyuubiArrowConverters: Total limit: -1, rowCount: 17286, rowCountInLastBatch: 5762, estimatedBatchSize: 4194736, maxEstimatedBatchSize: 4194304, maxRecordsPerBatch: 10000

The estimatedBatchSize is slightly larger than the maxEstimatedBatchSize. Data can be written in batches as expected.

Affects Version(s)

master

Kyuubi Server Log Output

Kyuubi Engine Log Output

25/11/13 13:52:57 INFO KyuubiArrowConverters: Total limit: -1, rowCount: 200000, lastBatchRowCount:200000, estimatedBatchSize: 145600000 maxEstimatedBatchSize: 4,maxRecordsPerBatch:10000 25/11/13 13:52:57 INFO KyuubiArrowConverters: Total limit: -1, rowCount: 200000, lastBatchRowCount:200000, estimatedBatchSize: 145600000

Kyuubi Server Configurations

Kyuubi Engine Configurations

Additional context

Test data: 1.6 million rows, 30 columns per row

Are you willing to submit PR?

  • Yes. I would be willing to submit a PR with guidance from the Kyuubi community to fix.
  • No. I cannot submit a PR at this time.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions