[Draft]Implemented casting for RunEnd Encoding #7713

rich-t-kid-datadog · 2025-06-19T03:55:27Z

Which issue does this PR close?

This PR contributes towards the larger epic Implement RunArray

Rationale for this change

This PR implements casting support for RunEndEncoded arrays in Apache Arrow. RunEndEncoded is a compression format that stores consecutive runs of equal values efficiently, but previously lacked casting functionality. This change enables:

Casting FROM RunEndEncoded arrays - Converting the values within a RunEndEncoded array to different data types while preserving the run structure
Casting TO RunEndEncoded arrays - Converting regular arrays into RunEndEncoded format by performing run-length encoding
Full integration with Arrow's casting system - Making RunEndEncoded arrays work with the existing cast() and can_cast_types() functions

Run-End Encoded Array Casting: Tradeoffs and Implementation
The implementation of REE array casting introduced a critical tradeoff between user flexibility and data integrity.

Unlike most Arrow types, REE arrays have a fundamental monotonicity constraint: their run-end indices must be strictly increasing to preserve logical correctness. Silent truncation or wrapping during downcasts (e.g., Int64 → Int16) could produce invalid sequences like:

[1000, -15536, 14464] // due to overflow
Such sequences break the REE invariant and could cause panics or silent data corruption downstream.

While Arrow's CastOptions typically allow safe: false to skip overflow checks and return nulls, this behavior would be actively harmful for run-end indices. Therefore:

We chose to hard-code safe: false behavior for run-end casting.

This ensures that:

Any attempt to downcast invalid run-end values fails immediately, even if the user sets safe = false.

Upcasts (e.g., Int16 → Int64) are allowed, as they are lossless.

This policy protects the logical soundness of REE arrays and maintains integrity across the Arrow ecosystem.

These changes are being made to allow REE to be cast just like any other type in the Arrow eco-system.

What changes are included in this PR?

run_end_encoded_cast() - Casts values within existing RunEndEncoded arrays to different types
cast_to_run_end_encoded() - Converts regular arrays to RunEndEncoded format with run-length encoding
Comprehensive test suite covering various data types and edge cases
Updated can_cast_types() to support RunEndEncoded compatibility rules
Run_End down casting is not possible.
Users can now cast RunEndEncoded arrays using the standard arrow_cast::cast() function
All existing APIs remain unchanged.
There are no breaking changes from this PR, this is a purely additive change.

alamb · 2025-06-19T11:37:03Z

FYI @brancz and @alexanderbianchi

alexanderbianchi · 2025-06-19T20:06:49Z

FYI @brancz and @alexanderbianchi

Yup! this is my teams intern 😄
sorry for if the draft PR is noisy, after we iterate a bit we'll fill out the description/comment on what tradeoffs we should make (for example, we can encode the values then cast or we can cast the full array then encode the values. Casting while encoding also possible but maybe a bit less "plug and play")

alamb · 2025-06-20T13:56:04Z

FYI @brancz and @alexanderbianchi

Yup! this is my teams intern 😄 sorry for if the draft PR is noisy, after we iterate a bit we'll fill out the description/comment on what tradeoffs we should make (for example, we can encode the values then cast or we can cast the full array then encode the values. Casting while encoding also possible but maybe a bit less "plug and play")

No worries -- let me know when it is ready for a review and I'll take a closer look

THanks for helping drive this forward

arrow-cast/src/cast/run_array.rs

Implement casting between REE arrays and other Arrow types. REE-to-REE casting validates run-end upcasts only (Int16→Int32, Int16→Int64, Int32→Int64) to prevent invalid sequences.

arrow-cast/src/cast/mod.rs

gabotechs · 2025-06-26T14:25:55Z

arrow-cast/src/cast/mod.rs

+        (_, RunEndEncoded(_, _value_type)) => true,
+
+


Suggested change

(_, RunEndEncoded(_, _value_type)) => true,

(_, RunEndEncoded(_, _value_type)) => true,

no need for extra spaces here

gabotechs · 2025-06-26T14:29:31Z

arrow-cast/src/cast/run_array.rs

+/// and value type. This function performs run-length encoding on the input array.
+///
+/// # Arguments
+/// * `array` - The input array to be run-length encoded


I see that there are several references in the docs and comments to "run-length" instead of "run-end", should those be changed?

gabotechs · 2025-06-26T14:34:28Z

arrow-cast/src/cast/run_array.rs

+        cast_with_options(array, value_type, cast_options)?
+    };
+
+    // Step 2: Run-length encode the cast array


Suggested change

// Step 2: Run-length encode the cast array

// Step 2: Run-end encode the cast array

?

Also, are we missing step 3?

gabotechs · 2025-06-26T14:37:14Z

arrow-cast/src/cast/mod.rs

@@ -137,6 +139,10 @@ pub fn can_cast_types(from_type: &DataType, to_type: &DataType) -> bool {
            can_cast_types(from_value_type, to_value_type)
        }
        (Dictionary(_, value_type), _) => can_cast_types(value_type, to_type),
+        (RunEndEncoded(_, value_type), _) => can_cast_types(value_type.data_type(), to_type),
+        (_, RunEndEncoded(_, _value_type)) => true,


Can anything be converted from non-REE to REE? I imagine that in order for you to encode something as REE, you need to be able to compare values. Are all the possible DataTypes comparable?

gabotechs · 2025-06-26T14:38:43Z

arrow-cast/src/cast/mod.rs

This file does not look like its properly formatted. Maybe run cargo fmt?

gabotechs · 2025-06-26T14:41:41Z

arrow-cast/src/cast/run_array.rs

+
+    // Step 4: Build the run_ends array
+    for &run_end in &run_ends_vec {
+        run_ends_builder.append_value(K::Native::from_usize(run_end).unwrap());


Rather than panicking here, it might be better to return the error

gabotechs · 2025-06-26T14:42:22Z

arrow-cast/src/cast/run_array.rs

+    // Step 5: Build the values array by taking elements at the run start positions
+    let indices = PrimitiveArray::<UInt32Type>::from_iter_values(
+        values_indices.iter().map(|&idx| idx as u32),
+    );
+    let values_array = take(&cast_array, &indices, None)?;
+
+    // Step 7: Create and return the RunArray


Step 5 and then step 7?

gabotechs · 2025-06-26T14:46:39Z

arrow-cast/src/cast/run_array.rs

+        // For simplicity, we'll use a basic comparison approach
+        // In practice, you'd want more sophisticated comparison based on data type
+        let values_equal = match (cast_array.is_null(i), cast_array.is_null(i - 1)) {
+            (true, true) => true, // Both null
+            (false, false) => {
+                // Both non-null - use slice comparison as a basic approach
+                // This is a simplified implementation
+                cast_array.slice(i, 1).to_data() == cast_array.slice(i - 1, 1).to_data()
+            }
+            _ => false, // One null, one not null
+        };


It's simple, but is it correct? couldn't this be problematic for certain DataTypes? I wonder if there are any comparison kernels already built that we can use here

gabotechs · 2025-06-30T13:40:22Z

arrow-cast/src/cast/run_array.rs

@@ -0,0 +1,169 @@
+use crate::cast::*;
+
+pub(crate) fn run_end_encoded_cast<K: RunEndIndexType>(


The doc comment we have for the cast_to_run_end_encoded method is great, how about adding another one here?

gabotechs · 2025-06-30T13:47:12Z

arrow-cast/src/cast/run_array.rs

+                            let re = cast_run_ends.as_primitive::<Int64Type>();
+                            Arc::new(RunArray::<Int64Type>::try_new(re, cast_values.as_ref())?)
+                        }
+                        _ => unreachable!("Run-end type must be i16, i32, or i64"),


Instead of panicking, it might be better to just return an ArrowError.

Suggested change

_ => unreachable!("Run-end type must be i16, i32, or i64"),

_ => {

return Err(ArrowError::CastError(

"Run-end type must be i16, i32, or i64".to_string(),

))

}

gabotechs · 2025-06-30T13:53:51Z

arrow-cast/src/cast/mod.rs

+    #[cfg(test)]
+    mod run_end_encoded_tests {


If these tests are meant to be scoped for REE arrays, how about having them in the run_array.rs file instead? this one is already 10K lines long...

I was following the pattern that dictionary.rs /list.rs /string.rs layed out. They added their test to the end of the mod.rs file instead of at the end of their respective files.

gabotechs · 2025-06-30T13:57:56Z

arrow-cast/src/cast/run_array.rs

+    let mut current_run_end = 1usize;
+


Isn't this variable redundant with i in the for loop below? Maybe there's a simplification that can be done here:

simplification

for i in 1..cast_array.len() { // For simplicity, we'll use a basic comparison approach // In practice, you'd want more sophisticated comparison based on data type let values_equal = match (cast_array.is_null(i), cast_array.is_null(i - 1)) { (true, true) => true, // Both null (false, false) => { // Both non-null - use slice comparison as a basic approach // This is a simplified implementation cast_array.slice(i, 1).to_data() == cast_array.slice(i - 1, 1).to_data() } _ => false, // One null, one not null }; if !values_equal { // End current run, start new run ~ run_ends_vec.push(i); values_indices.push(i); } } // Add the final run end ~ run_ends_vec.push(cast_array.len());

gabotechs · 2025-06-30T13:59:26Z

arrow-cast/src/cast/run_array.rs

+    run_ends_vec.push(current_run_end);
+
+    // Step 4: Build the run_ends array
+    for &run_end in &run_ends_vec {


You should be able to just iterate the Vec normally:

Suggested change

for &run_end in &run_ends_vec {

for run_end in run_ends_vec {

Implemented casting for RunEnd Encoding

4c5b644

github-actions bot added the arrow Changes to the arrow crate label Jun 19, 2025

alamb mentioned this pull request Jun 19, 2025

[Epic] Implement RunArray (Run Length Encoding (RLE) / Run End Encoding (REE) support) #3520

Open

14 tasks

alexanderbianchi reviewed Jun 20, 2025

View reviewed changes

arrow-cast/src/cast/run_array.rs Show resolved Hide resolved

alexanderbianchi reviewed Jun 20, 2025

View reviewed changes

arrow-cast/src/cast/run_array.rs Outdated Show resolved Hide resolved

alexanderbianchi reviewed Jun 20, 2025

View reviewed changes

arrow-cast/src/cast/run_array.rs Outdated Show resolved Hide resolved

alexanderbianchi reviewed Jun 20, 2025

View reviewed changes

arrow-cast/src/cast/run_array.rs Outdated Show resolved Hide resolved

alexanderbianchi reviewed Jun 20, 2025

View reviewed changes

arrow-cast/src/cast/run_array.rs Show resolved Hide resolved

rich-t-kid-datadog force-pushed the baah/RunEndEncoding branch from ad373e9 to 1c28b08 Compare June 20, 2025 18:01

Implemented casting for RunEnd Encoding

5307851

rich-t-kid-datadog force-pushed the baah/RunEndEncoding branch from 1c28b08 to 5307851 Compare June 23, 2025 14:09

alexanderbianchi reviewed Jun 23, 2025

View reviewed changes

arrow-cast/src/cast/run_array.rs Outdated Show resolved Hide resolved

feat: Add Run-End Encoded array casting with overflow protection

7d6df4f

Implement casting between REE arrays and other Arrow types. REE-to-REE casting validates run-end upcasts only (Int16→Int32, Int16→Int64, Int32→Int64) to prevent invalid sequences.

alexanderbianchi reviewed Jun 24, 2025

View reviewed changes

arrow-cast/src/cast/mod.rs Show resolved Hide resolved

gabotechs reviewed Jun 30, 2025

View reviewed changes

	(_, RunEndEncoded(_, _value_type)) => true,
	(_, RunEndEncoded(_, _value_type)) => true,

	// Step 2: Run-length encode the cast array
	// Step 2: Run-end encode the cast array

		@@ -0,0 +1,169 @@
		use crate::cast::*;

		pub(crate) fn run_end_encoded_cast<K: RunEndIndexType>(

-                        _ => unreachable!("Run-end type must be i16, i32, or i64"),
+                        _ => {
+                            return Err(ArrowError::CastError(
+                                "Run-end type must be i16, i32, or i64".to_string(),
+                            ))
+                        }

	for &run_end in &run_ends_vec {
	for run_end in run_ends_vec {

[Draft]Implemented casting for RunEnd Encoding #7713

Are you sure you want to change the base?

[Draft]Implemented casting for RunEnd Encoding #7713

Uh oh!

Conversation

rich-t-kid-datadog commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Uh oh!

alamb commented Jun 19, 2025

Uh oh!

alexanderbianchi commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented Jun 20, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gabotechs Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rich-t-kid-datadog commented Jun 19, 2025 •

edited

Loading

alexanderbianchi commented Jun 19, 2025 •

edited

Loading

gabotechs Jun 30, 2025 •

edited

Loading