Skip to content

[Parquet] ArrowWriter with CDC panics on nested ListArrays #9637

@alamb

Description

@alamb

Describe the bug

Writing nested list data with parquet::arrow::ArrowWriter and content-defined chunking enabled can panic inside the parquet column writer with an
out-of-bounds slice access.

This appears to be a regression from:

To Reproduce
This currently fails on main:

nice cargo bench -p parquet --bench arrow_writer -- list_primitive/cdc

Fails like this:

Benchmarking list_primitive/cdc: Warming up for 3.0000 s
thread 'main' (11848300) panicked at parquet/src/column/writer/mod.rs:720:39:
range end index 59344 out of range for slice of length 58905
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Here is a standalone reproducer (in parquet/tests/arrow_writer.rs) that codex made for me:

#[test]
fn test_arrow_writer_cdc_list_roundtrip_regression() {
    let schema = Arc::new(Schema::new(vec![
        Field::new(
            "_1",
            DataType::List(Arc::new(Field::new_list_field(DataType::Int32, true))),
            true,
        ),
        Field::new(
            "_2",
            DataType::List(Arc::new(Field::new_list_field(DataType::Boolean, true))),
            true,
        ),
        Field::new(
            "_3",
            DataType::LargeList(Arc::new(Field::new_list_field(DataType::Utf8, true))),
            true,
        ),
    ]));
    let props = WriterProperties::builder()
        .set_content_defined_chunking(Some(CdcOptions::default()))
        .build();
    let batch = create_random_batch(schema, 2, 0.25, 0.75).unwrap();

    let mut buffer = Vec::new();
    let mut writer = ArrowWriter::try_new(&mut buffer, batch.schema(), Some(props)).unwrap();
    writer.write(&batch).unwrap();
    writer.close().unwrap();

    let mut reader = ParquetRecordBatchReader::try_new(Bytes::from(buffer), 1024).unwrap();
    let read = reader.next().unwrap().unwrap();
    assert_eq!(batch, read);
}

Run like

cargo test -p parquet test_arrow_writer_cdc_list_roundtrip_regression

Results:

running 1 test
test test_arrow_writer_cdc_list_roundtrip_regression ... FAILED

failures:

---- test_arrow_writer_cdc_list_roundtrip_regression stdout ----

thread 'test_arrow_writer_cdc_list_roundtrip_regression' (11845398) panicked at parquet/src/column/writer/mod.rs:720:39:
range end index 1 out of range for slice of length 0
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Expected behavior
No panics, tests should pass

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    arrowChanges to the arrow cratearrow-avroarrow-avro cratearrow-flightChanges to the arrow-flight cratebugparquetChanges to the parquet crateparquet-variantparquet-variant* cratesperformance

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions