-
Notifications
You must be signed in to change notification settings - Fork 972
Description
Code of Conduct
- I agree to follow this project's Code of Conduct
Search before asking
- I have searched in the issues and found no similar issues.
Describe the bug
When kyuubi.operation.result.format=arrow, spark.connect.grpc.arrow.maxBatchSize does not take effect.
Reproduction: You can debug KyuubiArrowConverters or add the following log to line 300 of KyuubiArrowConverters:
logInfo(s"Total limit: ${limit}, rowCount: ${rowCount}, " +
s"rowCountInLastBatch:${rowCountInLastBatch}," +
s"estimatedBatchSize: ${estimatedBatchSize}," +
s"maxEstimatedBatchSize: ${maxEstimatedBatchSize}," +
s"maxRecordsPerBatch:${maxRecordsPerBatch}")
Test data: 1.6 million rows, 30 columns per row. Command executed:
bin/beeline \
-u 'jdbc:hive2://10.168.X.X:XX/default;thrift.client.max.message.size=2000000000' \
--hiveconf kyuubi.operation.result.format=arrow \
-n test -p 'testpass' \
--outputformat=csv2 -e "select * from db.table" > /tmp/test.csv
Log output
25/11/13 13:52:57 INFO KyuubiArrowConverters: Total limit: -1, rowCount: 200000, lastBatchRowCount:200000, estimatedBatchSize: 145600000 maxEstimatedBatchSize: 4,maxRecordsPerBatch:10000 25/11/13 13:52:57 INFO KyuubiArrowConverters: Total limit: -1, rowCount: 200000, lastBatchRowCount:200000, estimatedBatchSize: 145600000
Original Code
while (rowIter.hasNext && (
rowCountInLastBatch == 0 && maxEstimatedBatchSize > 0 ||
estimatedBatchSize <= 0 ||
estimatedBatchSize < maxEstimatedBatchSize ||
maxRecordsPerBatch <= 0 ||
rowCountInLastBatch < maxRecordsPerBatch ||
rowCount < limit ||
limit < 0))
deepseek's explanation is as follows:
while (rowIter.hasNext && (condition A || condition B || condition C || condition D || condition E || condition F))
Detailed Explanation of Each Condition
Condition A: rowCountInLastBatch == 0 && maxEstimatedBatchSize > 0 Special handling for writing the first line:
rowCountInLastBatch == 0: The current batch is the first line
maxEstimatedBatchSize > 0: Maximum batch size setting is valid
Meaning: When a valid batch size limit is set, the first line of the current batch is always written.
Condition B: estimatedBatchSize <= 0 Unlimited byte size: If the estimated batch size ≤ 0, there is no limit.
Condition C: estimatedBatchSize < maxEstimatedBatchSize Byte size not exceeded: The current estimated size is less than the maximum allowed size.
Condition D: maxRecordsPerBatch <= 0 Unlimited record count: If the maximum number of records per batch ≤ 0, there is no limit.
Condition E: rowCountInLastBatch < maxRecordsPerBatch Record count not exceeded: The number of records in the current batch is less than the limit.
Condition F: rowCount < limit || limit < 0 Total number of rows control:
rowCount < limit: The total number of rows processed has not reached the limit.
limit < 0: Total row count limit is negative (indicating no limit)
"Continue as long as any condition is met" strategy
When the limit is not set, i.e., -1, all data will be retrieved at once. If the row count is too large, the following three problems will occur:
(1) Driver/executor oom
(2) Array oom cause of array length is not enough
(3) Transfer data slowly
After updating the code, the log output is as follows:
25/11/14 10:57:16 INFO KyuubiArrowConverters: Total limit: -1, rowCount: 5762, rowCountInLastBatch:5762, estimatedBatchSize: 4194736, maxEstimatedBatchSize: 4194304, maxRecordsPerBatch:10000
25/11/14 10:57:16 INFO KyuubiArrowConverters: Total limit: -1, rowCount: 11524, rowCountInLastBatch: 5762, estimatedBatchSize: 4194736, maxEstimatedBatchSize: 4194304, maxRecordsPerBatch: 10000 25/11/14 10:57:16 INFO KyuubiArrowConverters: Total limit: -1, rowCount: 17286, rowCountInLastBatch: 5762, estimatedBatchSize: 4194736, maxEstimatedBatchSize: 4194304, maxRecordsPerBatch: 10000
The estimatedBatchSize is slightly larger than the maxEstimatedBatchSize. Data can be written in batches as expected.
Affects Version(s)
master
Kyuubi Server Log Output
Kyuubi Engine Log Output
25/11/13 13:52:57 INFO KyuubiArrowConverters: Total limit: -1, rowCount: 200000, lastBatchRowCount:200000, estimatedBatchSize: 145600000 maxEstimatedBatchSize: 4,maxRecordsPerBatch:10000 25/11/13 13:52:57 INFO KyuubiArrowConverters: Total limit: -1, rowCount: 200000, lastBatchRowCount:200000, estimatedBatchSize: 145600000Kyuubi Server Configurations
Kyuubi Engine Configurations
Additional context
Test data: 1.6 million rows, 30 columns per row
Are you willing to submit PR?
- Yes. I would be willing to submit a PR with guidance from the Kyuubi community to fix.
- No. I cannot submit a PR at this time.