Describe the bug
Writing nested list data with parquet::arrow::ArrowWriter and content-defined chunking enabled can panic inside the parquet column writer with an
out-of-bounds slice access.
This appears to be a regression from:
To Reproduce
This currently fails on main:
nice cargo bench -p parquet --bench arrow_writer -- list_primitive/cdc
Fails like this:
Benchmarking list_primitive/cdc: Warming up for 3.0000 s
thread 'main' (11848300) panicked at parquet/src/column/writer/mod.rs:720:39:
range end index 59344 out of range for slice of length 58905
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Here is a standalone reproducer (in parquet/tests/arrow_writer.rs) that codex made for me:
#[test]
fn test_arrow_writer_cdc_list_roundtrip_regression() {
let schema = Arc::new(Schema::new(vec![
Field::new(
"_1",
DataType::List(Arc::new(Field::new_list_field(DataType::Int32, true))),
true,
),
Field::new(
"_2",
DataType::List(Arc::new(Field::new_list_field(DataType::Boolean, true))),
true,
),
Field::new(
"_3",
DataType::LargeList(Arc::new(Field::new_list_field(DataType::Utf8, true))),
true,
),
]));
let props = WriterProperties::builder()
.set_content_defined_chunking(Some(CdcOptions::default()))
.build();
let batch = create_random_batch(schema, 2, 0.25, 0.75).unwrap();
let mut buffer = Vec::new();
let mut writer = ArrowWriter::try_new(&mut buffer, batch.schema(), Some(props)).unwrap();
writer.write(&batch).unwrap();
writer.close().unwrap();
let mut reader = ParquetRecordBatchReader::try_new(Bytes::from(buffer), 1024).unwrap();
let read = reader.next().unwrap().unwrap();
assert_eq!(batch, read);
}
Run like
cargo test -p parquet test_arrow_writer_cdc_list_roundtrip_regression
Results:
running 1 test
test test_arrow_writer_cdc_list_roundtrip_regression ... FAILED
failures:
---- test_arrow_writer_cdc_list_roundtrip_regression stdout ----
thread 'test_arrow_writer_cdc_list_roundtrip_regression' (11845398) panicked at parquet/src/column/writer/mod.rs:720:39:
range end index 1 out of range for slice of length 0
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Expected behavior
No panics, tests should pass
Additional context
Describe the bug
Writing nested list data with parquet::arrow::ArrowWriter and content-defined chunking enabled can panic inside the parquet column writer with an
out-of-bounds slice access.
This appears to be a regression from:
To Reproduce
This currently fails on main:
Fails like this:
Here is a standalone reproducer (in parquet/tests/arrow_writer.rs) that codex made for me:
Run like
cargo test -p parquet test_arrow_writer_cdc_list_roundtrip_regressionResults:
Expected behavior
No panics, tests should pass
Additional context