-
Notifications
You must be signed in to change notification settings - Fork 961
[Draft]Implemented casting for RunEnd Encoding #7713
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[Draft]Implemented casting for RunEnd Encoding #7713
Conversation
FYI @brancz and @alexanderbianchi |
Yup! this is my teams intern 😄 |
No worries -- let me know when it is ready for a review and I'll take a closer look THanks for helping drive this forward |
ad373e9
to
1c28b08
Compare
1c28b08
to
5307851
Compare
Implement casting between REE arrays and other Arrow types. REE-to-REE casting validates run-end upcasts only (Int16→Int32, Int16→Int64, Int32→Int64) to prevent invalid sequences.
(_, RunEndEncoded(_, _value_type)) => true, | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(_, RunEndEncoded(_, _value_type)) => true, | |
(_, RunEndEncoded(_, _value_type)) => true, |
no need for extra spaces here
/// and value type. This function performs run-length encoding on the input array. | ||
/// | ||
/// # Arguments | ||
/// * `array` - The input array to be run-length encoded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that there are several references in the docs and comments to "run-length" instead of "run-end", should those be changed?
cast_with_options(array, value_type, cast_options)? | ||
}; | ||
|
||
// Step 2: Run-length encode the cast array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Step 2: Run-length encode the cast array | |
// Step 2: Run-end encode the cast array |
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, are we missing step 3?
@@ -137,6 +139,10 @@ pub fn can_cast_types(from_type: &DataType, to_type: &DataType) -> bool { | |||
can_cast_types(from_value_type, to_value_type) | |||
} | |||
(Dictionary(_, value_type), _) => can_cast_types(value_type, to_type), | |||
(RunEndEncoded(_, value_type), _) => can_cast_types(value_type.data_type(), to_type), | |||
(_, RunEndEncoded(_, _value_type)) => true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can anything be converted from non-REE to REE? I imagine that in order for you to encode something as REE, you need to be able to compare values. Are all the possible DataTypes comparable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file does not look like its properly formatted. Maybe run cargo fmt
?
|
||
// Step 4: Build the run_ends array | ||
for &run_end in &run_ends_vec { | ||
run_ends_builder.append_value(K::Native::from_usize(run_end).unwrap()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than panicking here, it might be better to return the error
// Step 5: Build the values array by taking elements at the run start positions | ||
let indices = PrimitiveArray::<UInt32Type>::from_iter_values( | ||
values_indices.iter().map(|&idx| idx as u32), | ||
); | ||
let values_array = take(&cast_array, &indices, None)?; | ||
|
||
// Step 7: Create and return the RunArray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Step 5 and then step 7?
// For simplicity, we'll use a basic comparison approach | ||
// In practice, you'd want more sophisticated comparison based on data type | ||
let values_equal = match (cast_array.is_null(i), cast_array.is_null(i - 1)) { | ||
(true, true) => true, // Both null | ||
(false, false) => { | ||
// Both non-null - use slice comparison as a basic approach | ||
// This is a simplified implementation | ||
cast_array.slice(i, 1).to_data() == cast_array.slice(i - 1, 1).to_data() | ||
} | ||
_ => false, // One null, one not null | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's simple, but is it correct? couldn't this be problematic for certain DataTypes? I wonder if there are any comparison kernels already built that we can use here
@@ -0,0 +1,169 @@ | |||
use crate::cast::*; | |||
|
|||
pub(crate) fn run_end_encoded_cast<K: RunEndIndexType>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The doc comment we have for the cast_to_run_end_encoded
method is great, how about adding another one here?
let re = cast_run_ends.as_primitive::<Int64Type>(); | ||
Arc::new(RunArray::<Int64Type>::try_new(re, cast_values.as_ref())?) | ||
} | ||
_ => unreachable!("Run-end type must be i16, i32, or i64"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of panicking, it might be better to just return an ArrowError.
_ => unreachable!("Run-end type must be i16, i32, or i64"), | |
_ => { | |
return Err(ArrowError::CastError( | |
"Run-end type must be i16, i32, or i64".to_string(), | |
)) | |
} |
#[cfg(test)] | ||
mod run_end_encoded_tests { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If these tests are meant to be scoped for REE arrays, how about having them in the run_array.rs
file instead? this one is already 10K lines long...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was following the pattern that dictionary.rs /list.rs /string.rs layed out. They added their test to the end of the mod.rs file instead of at the end of their respective files.
let mut current_run_end = 1usize; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this variable redundant with i
in the for loop below? Maybe there's a simplification that can be done here:
simplification
for i in 1..cast_array.len() {
// For simplicity, we'll use a basic comparison approach
// In practice, you'd want more sophisticated comparison based on data type
let values_equal = match (cast_array.is_null(i), cast_array.is_null(i - 1)) {
(true, true) => true, // Both null
(false, false) => {
// Both non-null - use slice comparison as a basic approach
// This is a simplified implementation
cast_array.slice(i, 1).to_data() == cast_array.slice(i - 1, 1).to_data()
}
_ => false, // One null, one not null
};
if !values_equal {
// End current run, start new run
~ run_ends_vec.push(i);
values_indices.push(i);
}
}
// Add the final run end
~ run_ends_vec.push(cast_array.len());
run_ends_vec.push(current_run_end); | ||
|
||
// Step 4: Build the run_ends array | ||
for &run_end in &run_ends_vec { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should be able to just iterate the Vec normally:
for &run_end in &run_ends_vec { | |
for run_end in run_ends_vec { |
Which issue does this PR close?
This PR contributes towards the larger epic Implement RunArray
Rationale for this change
This PR implements casting support for RunEndEncoded arrays in Apache Arrow. RunEndEncoded is a compression format that stores consecutive runs of equal values efficiently, but previously lacked casting functionality. This change enables:
Run-End Encoded Array Casting: Tradeoffs and Implementation
The implementation of REE array casting introduced a critical tradeoff between user flexibility and data integrity.
Unlike most Arrow types, REE arrays have a fundamental monotonicity constraint: their run-end indices must be strictly increasing to preserve logical correctness. Silent truncation or wrapping during downcasts (e.g., Int64 → Int16) could produce invalid sequences like:
[1000, -15536, 14464]
// due to overflowSuch sequences break the REE invariant and could cause panics or silent data corruption downstream.
While Arrow's CastOptions typically allow safe: false to skip overflow checks and return nulls, this behavior would be actively harmful for run-end indices. Therefore:
We chose to hard-code safe: false behavior for run-end casting.
This ensures that:
Any attempt to downcast invalid run-end values fails immediately, even if the user sets safe = false.
Upcasts (e.g., Int16 → Int64) are allowed, as they are lossless.
This policy protects the logical soundness of REE arrays and maintains integrity across the Arrow ecosystem.
What changes are included in this PR?
run_end_encoded_cast() - Casts values within existing RunEndEncoded arrays to different types
cast_to_run_end_encoded() - Converts regular arrays to RunEndEncoded format with run-length encoding
Comprehensive test suite covering various data types and edge cases
Updated can_cast_types() to support RunEndEncoded compatibility rules
Run_End down casting is not possible.
Users can now cast RunEndEncoded arrays using the standard arrow_cast::cast() function
All existing APIs remain unchanged.
There are no breaking changes from this PR, this is a purely additive change.