Skip to content

[Draft]Implemented casting for RunEnd Encoding #7713

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

rich-t-kid-datadog
Copy link

@rich-t-kid-datadog rich-t-kid-datadog commented Jun 19, 2025

Which issue does this PR close?

This PR contributes towards the larger epic Implement RunArray

Rationale for this change

This PR implements casting support for RunEndEncoded arrays in Apache Arrow. RunEndEncoded is a compression format that stores consecutive runs of equal values efficiently, but previously lacked casting functionality. This change enables:

  1. Casting FROM RunEndEncoded arrays - Converting the values within a RunEndEncoded array to different data types while preserving the run structure
  2. Casting TO RunEndEncoded arrays - Converting regular arrays into RunEndEncoded format by performing run-length encoding
  3. Full integration with Arrow's casting system - Making RunEndEncoded arrays work with the existing cast() and can_cast_types() functions

Run-End Encoded Array Casting: Tradeoffs and Implementation
The implementation of REE array casting introduced a critical tradeoff between user flexibility and data integrity.

Unlike most Arrow types, REE arrays have a fundamental monotonicity constraint: their run-end indices must be strictly increasing to preserve logical correctness. Silent truncation or wrapping during downcasts (e.g., Int64 → Int16) could produce invalid sequences like:

[1000, -15536, 14464] // due to overflow
Such sequences break the REE invariant and could cause panics or silent data corruption downstream.

While Arrow's CastOptions typically allow safe: false to skip overflow checks and return nulls, this behavior would be actively harmful for run-end indices. Therefore:

We chose to hard-code safe: false behavior for run-end casting.

This ensures that:

Any attempt to downcast invalid run-end values fails immediately, even if the user sets safe = false.

Upcasts (e.g., Int16 → Int64) are allowed, as they are lossless.

This policy protects the logical soundness of REE arrays and maintains integrity across the Arrow ecosystem.

  1. These changes are being made to allow REE to be cast just like any other type in the Arrow eco-system.

What changes are included in this PR?

  1. run_end_encoded_cast() - Casts values within existing RunEndEncoded arrays to different types

  2. cast_to_run_end_encoded() - Converts regular arrays to RunEndEncoded format with run-length encoding

  3. Comprehensive test suite covering various data types and edge cases

  4. Updated can_cast_types() to support RunEndEncoded compatibility rules

  5. Run_End down casting is not possible.

  6. Users can now cast RunEndEncoded arrays using the standard arrow_cast::cast() function

  7. All existing APIs remain unchanged.
    There are no breaking changes from this PR, this is a purely additive change.

@alamb
Copy link
Contributor

alamb commented Jun 19, 2025

FYI @brancz and @alexanderbianchi

@alexanderbianchi
Copy link

alexanderbianchi commented Jun 19, 2025

FYI @brancz and @alexanderbianchi

Yup! this is my teams intern 😄
sorry for if the draft PR is noisy, after we iterate a bit we'll fill out the description/comment on what tradeoffs we should make (for example, we can encode the values then cast or we can cast the full array then encode the values. Casting while encoding also possible but maybe a bit less "plug and play")

@alamb
Copy link
Contributor

alamb commented Jun 20, 2025

FYI @brancz and @alexanderbianchi

Yup! this is my teams intern 😄 sorry for if the draft PR is noisy, after we iterate a bit we'll fill out the description/comment on what tradeoffs we should make (for example, we can encode the values then cast or we can cast the full array then encode the values. Casting while encoding also possible but maybe a bit less "plug and play")

No worries -- let me know when it is ready for a review and I'll take a closer look

THanks for helping drive this forward

Implement casting between REE arrays and other Arrow types. REE-to-REE casting
validates run-end upcasts only (Int16→Int32, Int16→Int64, Int32→Int64) to prevent
invalid sequences.
Comment on lines +143 to +145
(_, RunEndEncoded(_, _value_type)) => true,


Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(_, RunEndEncoded(_, _value_type)) => true,
(_, RunEndEncoded(_, _value_type)) => true,

no need for extra spaces here

/// and value type. This function performs run-length encoding on the input array.
///
/// # Arguments
/// * `array` - The input array to be run-length encoded

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that there are several references in the docs and comments to "run-length" instead of "run-end", should those be changed?

cast_with_options(array, value_type, cast_options)?
};

// Step 2: Run-length encode the cast array

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Step 2: Run-length encode the cast array
// Step 2: Run-end encode the cast array

?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, are we missing step 3?

@@ -137,6 +139,10 @@ pub fn can_cast_types(from_type: &DataType, to_type: &DataType) -> bool {
can_cast_types(from_value_type, to_value_type)
}
(Dictionary(_, value_type), _) => can_cast_types(value_type, to_type),
(RunEndEncoded(_, value_type), _) => can_cast_types(value_type.data_type(), to_type),
(_, RunEndEncoded(_, _value_type)) => true,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can anything be converted from non-REE to REE? I imagine that in order for you to encode something as REE, you need to be able to compare values. Are all the possible DataTypes comparable?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file does not look like its properly formatted. Maybe run cargo fmt?


// Step 4: Build the run_ends array
for &run_end in &run_ends_vec {
run_ends_builder.append_value(K::Native::from_usize(run_end).unwrap());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than panicking here, it might be better to return the error

Comment on lines +160 to +166
// Step 5: Build the values array by taking elements at the run start positions
let indices = PrimitiveArray::<UInt32Type>::from_iter_values(
values_indices.iter().map(|&idx| idx as u32),
);
let values_array = take(&cast_array, &indices, None)?;

// Step 7: Create and return the RunArray

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Step 5 and then step 7?

Comment on lines +130 to +140
// For simplicity, we'll use a basic comparison approach
// In practice, you'd want more sophisticated comparison based on data type
let values_equal = match (cast_array.is_null(i), cast_array.is_null(i - 1)) {
(true, true) => true, // Both null
(false, false) => {
// Both non-null - use slice comparison as a basic approach
// This is a simplified implementation
cast_array.slice(i, 1).to_data() == cast_array.slice(i - 1, 1).to_data()
}
_ => false, // One null, one not null
};

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's simple, but is it correct? couldn't this be problematic for certain DataTypes? I wonder if there are any comparison kernels already built that we can use here

@@ -0,0 +1,169 @@
use crate::cast::*;

pub(crate) fn run_end_encoded_cast<K: RunEndIndexType>(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc comment we have for the cast_to_run_end_encoded method is great, how about adding another one here?

let re = cast_run_ends.as_primitive::<Int64Type>();
Arc::new(RunArray::<Int64Type>::try_new(re, cast_values.as_ref())?)
}
_ => unreachable!("Run-end type must be i16, i32, or i64"),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of panicking, it might be better to just return an ArrowError.

Suggested change
_ => unreachable!("Run-end type must be i16, i32, or i64"),
_ => {
return Err(ArrowError::CastError(
"Run-end type must be i16, i32, or i64".to_string(),
))
}

Comment on lines +10718 to +10719
#[cfg(test)]
mod run_end_encoded_tests {
Copy link

@gabotechs gabotechs Jun 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these tests are meant to be scoped for REE arrays, how about having them in the run_array.rs file instead? this one is already 10K lines long...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was following the pattern that dictionary.rs /list.rs /string.rs layed out. They added their test to the end of the mod.rs file instead of at the end of their respective files.

Comment on lines +124 to +125
let mut current_run_end = 1usize;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this variable redundant with i in the for loop below? Maybe there's a simplification that can be done here:

simplification
    for i in 1..cast_array.len() {
        // For simplicity, we'll use a basic comparison approach
        // In practice, you'd want more sophisticated comparison based on data type
        let values_equal = match (cast_array.is_null(i), cast_array.is_null(i - 1)) {
            (true, true) => true, // Both null
            (false, false) => {
                // Both non-null - use slice comparison as a basic approach
                // This is a simplified implementation
                cast_array.slice(i, 1).to_data() == cast_array.slice(i - 1, 1).to_data()
            }
            _ => false, // One null, one not null
        };

        if !values_equal {
            // End current run, start new run
~           run_ends_vec.push(i);
            values_indices.push(i);
        }
    }

    // Add the final run end
~   run_ends_vec.push(cast_array.len());

run_ends_vec.push(current_run_end);

// Step 4: Build the run_ends array
for &run_end in &run_ends_vec {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to just iterate the Vec normally:

Suggested change
for &run_end in &run_ends_vec {
for run_end in run_ends_vec {

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants