GH-49420: [C++][Gandiva] Fix castVARCHAR memory allocation and len<=0 handling by dmitry-chirkov-dremio · Pull Request #49421 · apache/arrow

dmitry-chirkov-dremio · 2026-03-02T04:55:09Z

Rationale for this change

The castVARCHAR functions in Gandiva have memory allocation inefficiencies and missing edge case handling. See GH-49420 for details.

What changes are included in this PR?

Functional fixes:

bool: Remove unused 5-byte arena allocation; return string literal directly
int32/int64: Add handling for len=0 (return empty string) and len<0 (set error)

Memory allocation optimizations:

int32/int64: Allocate fixed small buffer (11/20 bytes) directly in arena, use optimized digit-pair conversion writing right-to-left, then memmove to align. Returns min(len, actual_size) bytes.
date64: Allocate only min(len, 10) bytes upfront (output is always "YYYY-MM-DD")
float32/float64: Allocate only min(len, 24) bytes upfront (max output length)

Code cleanup:

Extract common code into helper macros to reduce duplication

Are these changes tested?

Yes. Added tests for len=0 and len<0 edge cases for int64, date64, float32, float64, and bool types. All existing Gandiva tests pass. Adhoc performance benchmarking was performed both via direct expression evaluation as well as via query execution via Dremio.

Are there any user-facing changes?

No. Users will see reduced memory usage and proper error messages for invalid len parameter values.
Note: Error messages for negative len remain different between precompiled ("Output buffer length can't be negative") and interpreted ("Buffer length cannot be negative") code paths, preserving existing behavior.

GitHub Issue: [C++][Gandiva] Fix castVARCHAR memory inefficiencies, unused bool allocation, and missing len<=0 handling #49420

dmitry-chirkov-dremio · 2026-03-02T13:29:26Z

Let me go through first-timer's hurdles (like pre-commit clang format failures).
GH-49347 for 'aws/core/utils/pagination/Paginator.h' file not found

dmitry-chirkov-dremio · 2026-03-02T18:08:41Z

Pushed clang-format fixes

kou · 2026-03-04T00:12:18Z

@lriggs @akravchukdremio @xxlaykxx You may want to review this.

dmitry-chirkov-dremio · 2026-03-04T03:07:55Z

@lriggs @akravchukdremio @xxlaykxx You may want to review this.

https://github.com/telemenar is reviewing offline. Second commit is based on their feedback.

dmitry-chirkov-dremio · 2026-03-04T14:59:07Z

Performance Benchmark Results

Benchmarks run on Apple M3, 5 repetitions each. Tests exercise castVARCHAR for Int32/Int64 with various len parameters. We see small yet noticeable improvements for int conversion. Date conversion's performance remained as is.

Test data:

Int32 / Int64: Full-range random integers (1M rows)
Int32Small / Int64Small: 2-digit integers 10-99 (1M rows)

Int32 (Full-Range Random)

len	Original (μs)	Optimized (μs)	Δ
1	16,416	16,243	-1.1% ✅
11	18,436	18,235	-1.1% ✅
100	19,043	18,455	-3.1% ✅
65536	18,844	18,013	-4.4% ✅

Int32 (Small 2-Digit)

len	Original (μs)	Optimized (μs)	Δ
1	11,710	10,853	-7.3% ✅
11	12,217	11,751	-3.8% ✅
100	12,208	12,182	-0.2% ✅
65536	12,207	11,614	-4.9% ✅

Int64 (Full-Range Random)

len	Original (μs)	Optimized (μs)	Δ
1	17,198	17,016	-1.1% ✅
20	19,369	18,922	-2.3% ✅
100	19,223	18,749	-2.5% ✅
65536	19,484	18,744	-3.8% ✅

Int64 (Small 2-Digit)

len	Original (μs)	Optimized (μs)	Δ
1	11,345	10,708	-5.6% ✅
20	12,306	11,736	-4.6% ✅
100	12,237	11,545	-5.7% ✅
65536	12,147	11,528	-5.1% ✅

Memory savings: Allocates 11-20 bytes per integer instead of up to 65,536 bytes - significant for engines that tend to use VARCHAR(65536)

p.s. updated to match 891c1ca

telemenar

There is a allocation pattern change for int that might be unexpected when len < max_len previously we would limit the allocation to len but now we always allocate at max_len.

I think this is fine though given the stated performance impact as there are a number of mitigating factors:

max_len = 11|20 bytes for int32|int64 so the allocation size is capped and not huge.
the arena used for this allocation is batch scoped. Which means that any overallocation is limited both the the processing time of a batch and to the number of rows/columns in a batch.
in my experience it is rare that len < 11|20 bytes in most schemas I've seen in practice.
out_len which is what is used by the caller is correctly limited by len so callers shouldn't ever touch the part of the buffer exceeding len

…len<=0 handling

Rename SUFFIX → FORMATTER_SUFFIX and FROM_TYPE → FROM_DATE64 to better reflect their actual usage after the integer optimization was added.

… data types

…conversion

dmitry-chirkov-dremio · 2026-03-13T04:52:46Z

@telemenar

now we always allocate at max_len.
I reverified - going with out_len allocation introduces a measurable performance penalty (5-10%!) and I don't want to chase that optimization for ~10 byte/value savings. Some other day! 😄

kou

+1

conbench-apache-arrow · 2026-03-13T12:05:40Z

After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit b9274bc.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 30 possible false positives for unstable benchmarks that are known to sometimes produce them.

…len<=0 handling (apache#49421) ### Rationale for this change The `castVARCHAR` functions in Gandiva have memory allocation inefficiencies and missing edge case handling. See apacheGH-49420 for details. ### What changes are included in this PR? **Functional fixes:** - `bool`: Remove unused 5-byte arena allocation; return string literal directly - `int32`/`int64`: Add handling for `len=0` (return empty string) and `len<0` (set error) **Memory allocation optimizations:** - `int32`/`int64`: Allocate fixed small buffer (11/20 bytes) directly in arena, use optimized digit-pair conversion writing right-to-left, then `memmove` to align. Returns `min(len, actual_size)` bytes. - `date64`: Allocate only `min(len, 10)` bytes upfront (output is always "YYYY-MM-DD") - `float32`/`float64`: Allocate only `min(len, 24)` bytes upfront (max output length) **Code cleanup:** - Extract common code into helper macros to reduce duplication ### Are these changes tested? Yes. Added tests for `len=0` and `len<0` edge cases for int64, date64, float32, float64, and bool types. All existing Gandiva tests pass. Adhoc performance benchmarking was performed both via direct expression evaluation as well as via query execution via Dremio. ### Are there any user-facing changes? No. Users will see reduced memory usage and proper error messages for invalid len parameter values. Note: Error messages for negative `len` remain different between precompiled ("Output buffer length can't be negative") and interpreted ("Buffer length cannot be negative") code paths, preserving existing behavior. * GitHub Issue: apache#49420 Authored-by: Dmitry Chirkov <dmitry.chirkov@dremio.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>

…eGH-49421 Backport several upstream fixes. (#136) * apacheGH-49420: [C++][Gandiva] Fix castVARCHAR memory allocation and len<=0 handling (apache#49421) ### Rationale for this change The `castVARCHAR` functions in Gandiva have memory allocation inefficiencies and missing edge case handling. See apacheGH-49420 for details. ### What changes are included in this PR? **Functional fixes:** - `bool`: Remove unused 5-byte arena allocation; return string literal directly - `int32`/`int64`: Add handling for `len=0` (return empty string) and `len<0` (set error) **Memory allocation optimizations:** - `int32`/`int64`: Allocate fixed small buffer (11/20 bytes) directly in arena, use optimized digit-pair conversion writing right-to-left, then `memmove` to align. Returns `min(len, actual_size)` bytes. - `date64`: Allocate only `min(len, 10)` bytes upfront (output is always "YYYY-MM-DD") - `float32`/`float64`: Allocate only `min(len, 24)` bytes upfront (max output length) **Code cleanup:** - Extract common code into helper macros to reduce duplication ### Are these changes tested? Yes. Added tests for `len=0` and `len<0` edge cases for int64, date64, float32, float64, and bool types. All existing Gandiva tests pass. Adhoc performance benchmarking was performed both via direct expression evaluation as well as via query execution via Dremio. ### Are there any user-facing changes? No. Users will see reduced memory usage and proper error messages for invalid len parameter values. Note: Error messages for negative `len` remain different between precompiled ("Output buffer length can't be negative") and interpreted ("Buffer length cannot be negative") code paths, preserving existing behavior. * GitHub Issue: apache#49420 Authored-by: Dmitry Chirkov <dmitry.chirkov@dremio.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * apacheGH-49438: [C++][Gandiva] Optimize LPAD/RPAD functions (apache#49439) ### Rationale for this change The `lpad_utf8_int32_utf8` and `rpad_utf8_int32_utf8` functions have performance inefficiency and a potential memory safety issue: 1. **Performance**: Single-byte fills iterate character-by-character when `memset` would suffice. Multi-byte fills use O(n) iterations instead of O(log n) with a doubling strategy. 2. **Memory safety**: When the fill string is longer than the padding space needed, the code could write more bytes than allocated. Fixed preventatively. ### What changes are included in this PR? 1. **Memory safety fix**: Use `std::min(fill_text_len, total_fill_bytes)` for the initial copy to prevent overflow 2. **Fast path**: Add single-byte fill optimization using `memset` 3. **General path**: Replace character-by-character loop with doubling strategy for multi-byte fills 4. **Tests**: Add comprehensive tests for the new code paths ### Are these changes tested? Yes. Added tests covering: - Large UTF-8 fill characters (4-byte emoji, 3-byte Chinese) - Single-byte fill boundaries (1 char and 65536 char padding) - Content verification for fill patterns - Doubling strategy boundaries - Partial fill scenarios (fill text longer than padding needed) ### Are there any user-facing changes? No. * GitHub Issue: apache#49438 Authored-by: Dmitry Chirkov <dmitry.chirkov@dremio.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * apacheGH-49441: [C++][Gandiva] Add rand_integer function (apache#49442) ### Rationale for this change Add `rand_integer` function to Gandiva to generate random integers, complementing the existing `rand`/`random` functions that generate random doubles. This provides native integer random number generation and offers a more efficient alternative to `CAST(rand() * range AS INT)`. ### What changes are included in this PR? - Add `RandomIntegerGeneratorHolder` class following the existing `RandomGeneratorHolder` pattern - Implement three function signatures: - `rand_integer()` → int32 in range [INT32_MIN, INT32_MAX] - `rand_integer(int32 range)` → int32 in range [0, range-1] - `rand_integer(int32 min, int32 max)` → int32 in range [min, max] inclusive - Add parameter validation (range > 0, min <= max) at expression compilation time - Add 8 unit tests covering all signatures and edge cases - Use `std::uniform_int_distribution<int32_t>` with Mersenne Twister engine ### Are these changes tested? Yes, added 8 unit tests in `random_generator_holder_test.cc`: - `NoParams` - verifies full int32 range - `WithRange` - verifies [0, range-1] bounds - `WithMinMax` - verifies [min, max] inclusive bounds - `WithNegativeMinMax` - verifies negative range handling - `InvalidRangeZero` - verifies range=0 is rejected - `InvalidRangeNegative` - verifies negative range is rejected - `InvalidMinGreaterThanMax` - verifies min > max is rejected - `NullRangeDefaultsToOne` - verifies null parameter handling ### Are there any user-facing changes? Yes, this adds a new `rand_integer` function to Gandiva with three signatures as described above. * GitHub Issue: apache#49441 * apacheGH-49454: [C++][Gandiva] Fix castVARCHAR_timestamp for pre-epoch timestamps (apache#49455) ### Rationale for this change apacheGH-49454 castVARCHAR_timestamp_int64 produces negative milliseconds for pre-epoch timestamps ### What changes are included in this PR? Fixed `castVARCHAR_timestamp_int64` to correctly handle pre-epoch timestamps (before 1970-01-01). The issue was that using `in % MILLIS_IN_SEC` on negative timestamps produces negative milliseconds, resulting in output like `"0107-10-17 12:20:03.-10"`. ### Are these changes tested? Yes, added 4 new test cases covering pre-epoch timestamps with milliseconds ### Are there any user-facing changes? **This PR contains a "Critical Fix".** This fixes a bug that caused **incorrect data to be produced** when casting pre-epoch timestamps to VARCHAR in Gandiva. Previously, timestamps before 1970-01-01 with non-zero milliseconds would produce invalid output with negative millisecond values (e.g., `"0107-10-17 12:20:03.-10"` instead of `"0107-10-17 12:20:03.900"`). * GitHub Issue: apache#49454 Authored-by: Dmitry Chirkov <dmitry.chirkov@dremio.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * apacheGH-49470: [C++][Gandiva] Fix crashes in substring_index and truncate with extreme integer values (apache#49471) ### Rationale for this change Two Gandiva functions crash when called with extreme integer parameter values: 1. `substring_index(VARCHAR, VARCHAR, INT)` crashes with SIGBUS when count is `INT_MIN` 2. `truncate(BIGINT, INT)` crashes with SIGSEGV when scale is `INT_MAX` or `INT_MIN` ### What changes are included in this PR? **substring_index fix** (`gdv_string_function_stubs.cc`): - Replace `abs(cnt)` with safe `int64_t` computation to avoid undefined behavior when `cnt == INT_MIN` **truncate fix** (`precompiled/extended_math_ops.cc`): - Return input unchanged for positive scales (no-op for integers) - Return 0 for scales < -38 to prevent out-of-bounds access in `GetScaleMultiplier` ### Are these changes tested? Yes. Added coverage for `INT_MAX`/`INT_MIN` values in `gdv_function_stubs_test.cc` and `extended_math_ops_test.cc`. ### Are there any user-facing changes? No. **This PR contains a "Critical Fix".** These changes fix crashes caused by: - `abs(INT_MIN)` triggering undefined behavior (integer overflow) in `substring_index` - Out-of-bounds array access in `GetScaleMultiplier` when `truncate` receives extreme scale values * GitHub Issue: apache#49470 Authored-by: Dmitry Chirkov <dmitry.chirkov@dremio.com> Signed-off-by: Rossi Sun <zanmato1984@gmail.com> --------- Signed-off-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Rossi Sun <zanmato1984@gmail.com> Co-authored-by: Dmitry Chirkov <dmitry.chirkov@dremio.com>

…len<=0 handling (apache#49421) ### Rationale for this change The `castVARCHAR` functions in Gandiva have memory allocation inefficiencies and missing edge case handling. See apacheGH-49420 for details. ### What changes are included in this PR? **Functional fixes:** - `bool`: Remove unused 5-byte arena allocation; return string literal directly - `int32`/`int64`: Add handling for `len=0` (return empty string) and `len<0` (set error) **Memory allocation optimizations:** - `int32`/`int64`: Allocate fixed small buffer (11/20 bytes) directly in arena, use optimized digit-pair conversion writing right-to-left, then `memmove` to align. Returns `min(len, actual_size)` bytes. - `date64`: Allocate only `min(len, 10)` bytes upfront (output is always "YYYY-MM-DD") - `float32`/`float64`: Allocate only `min(len, 24)` bytes upfront (max output length) **Code cleanup:** - Extract common code into helper macros to reduce duplication ### Are these changes tested? Yes. Added tests for `len=0` and `len<0` edge cases for int64, date64, float32, float64, and bool types. All existing Gandiva tests pass. Adhoc performance benchmarking was performed both via direct expression evaluation as well as via query execution via Dremio. ### Are there any user-facing changes? No. Users will see reduced memory usage and proper error messages for invalid len parameter values. Note: Error messages for negative `len` remain different between precompiled ("Output buffer length can't be negative") and interpreted ("Buffer length cannot be negative") code paths, preserving existing behavior. * GitHub Issue: apache#49420 Authored-by: Dmitry Chirkov <dmitry.chirkov@dremio.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>

github-actions Bot added Component: C++ Component: Gandiva awaiting review Awaiting review labels Mar 2, 2026

dmitry-chirkov-dremio mentioned this pull request Mar 2, 2026

[C++][Gandiva] Fix castVARCHAR memory inefficiencies, unused bool allocation, and missing len<=0 handling #49420

Closed

dmitry-chirkov-dremio force-pushed the gandiva-castvarchar-optimization branch from 69b39cf to 61e5ec2 Compare March 2, 2026 14:53

dmitry-chirkov-dremio force-pushed the gandiva-castvarchar-optimization branch from 61e5ec2 to 37d8e6d Compare March 4, 2026 03:00

dmitry-chirkov-dremio marked this pull request as draft March 4, 2026 03:23

xxlaykxx approved these changes Mar 4, 2026

View reviewed changes

dmitry-chirkov-dremio force-pushed the gandiva-castvarchar-optimization branch from 90c36c4 to c6d2c25 Compare March 4, 2026 14:52

dmitry-chirkov-dremio marked this pull request as ready for review March 4, 2026 14:53

telemenar approved these changes Mar 4, 2026

View reviewed changes

github-actions Bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Mar 5, 2026

akravchukdremio approved these changes Mar 12, 2026

View reviewed changes

kou reviewed Mar 12, 2026

View reviewed changes

Comment thread cpp/src/gandiva/gdv_string_function_stubs.cc Outdated

github-actions Bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Mar 12, 2026

dmitry-chirkov-dremio added 5 commits March 12, 2026 20:38

apacheGH-49420: [C++][Gandiva] Fix castVARCHAR memory allocation and …

ac0c896

…len<=0 handling

apacheGH-49420: [Gandiva] Rename castVARCHAR macros for clarity

8557b6c

Rename SUFFIX → FORMATTER_SUFFIX and FROM_TYPE → FROM_DATE64 to better reflect their actual usage after the integer optimization was added.

Adding missing len=0 and len<0 tests for uniform test coverage across…

b164f82

… data types

Optimize int conversion a bit more

7535237

Refactor to use FormatAllDigits instead of reimplementing digit-pair …

891c1ca

…conversion

dmitry-chirkov-dremio force-pushed the gandiva-castvarchar-optimization branch from c6d2c25 to 891c1ca Compare March 13, 2026 04:28

github-actions Bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Mar 13, 2026

kou approved these changes Mar 13, 2026

View reviewed changes

kou merged commit b9274bc into apache:main Mar 13, 2026
51 of 52 checks passed

kou removed the awaiting change review Awaiting change review label Mar 13, 2026

github-actions Bot added the awaiting merge Awaiting merge label Mar 13, 2026

dmitry-chirkov-dremio mentioned this pull request Apr 17, 2026

[C++][Gandiva] Some string functions may be overflowed #49753

Open

lriggs mentioned this pull request Apr 23, 2026

GH-49471, GH-49455, GH-49442, GH-49439, GH-49421 Backport several upstream fixes. dremio/arrow#136

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-49420: [C++][Gandiva] Fix castVARCHAR memory allocation and len<=0 handling#49421

GH-49420: [C++][Gandiva] Fix castVARCHAR memory allocation and len<=0 handling#49421
kou merged 5 commits into
apache:mainfrom
dmitry-chirkov-dremio:gandiva-castvarchar-optimization

dmitry-chirkov-dremio commented Mar 2, 2026 •

edited

Loading

Uh oh!

dmitry-chirkov-dremio commented Mar 2, 2026 •

edited

Loading

Uh oh!

dmitry-chirkov-dremio commented Mar 2, 2026

Uh oh!

kou commented Mar 4, 2026

Uh oh!

dmitry-chirkov-dremio commented Mar 4, 2026 •

edited

Loading

Uh oh!

dmitry-chirkov-dremio commented Mar 4, 2026 •

edited

Loading

Uh oh!

telemenar left a comment

Uh oh!

Uh oh!

dmitry-chirkov-dremio commented Mar 13, 2026

Uh oh!

kou left a comment

Uh oh!

Uh oh!

conbench-apache-arrow Bot commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

dmitry-chirkov-dremio commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

dmitry-chirkov-dremio commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dmitry-chirkov-dremio commented Mar 2, 2026

Uh oh!

kou commented Mar 4, 2026

Uh oh!

dmitry-chirkov-dremio commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dmitry-chirkov-dremio commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance Benchmark Results

Int32 (Full-Range Random)

Int32 (Small 2-Digit)

Int64 (Full-Range Random)

Int64 (Small 2-Digit)

Uh oh!

telemenar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dmitry-chirkov-dremio commented Mar 13, 2026

Uh oh!

kou left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

conbench-apache-arrow Bot commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

dmitry-chirkov-dremio commented Mar 2, 2026 •

edited

Loading

dmitry-chirkov-dremio commented Mar 2, 2026 •

edited

Loading

dmitry-chirkov-dremio commented Mar 4, 2026 •

edited

Loading

dmitry-chirkov-dremio commented Mar 4, 2026 •

edited

Loading