Skip to content

Perf: Investigate and improve parquet writing performance #7822

Open
@jhorstmann

Description

@jhorstmann

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Parquet writing performance is not very good. The arrow_writer microbenchmark shows throughput for a batch of primitives to be around 200MiB/s. For a column-oriented format that seems rather low, but a profiling run shows no obvious single bottleneck.

Describe the solution you'd like

Investigate and improve the performance.

  • Optimize counting of values and nulls
  • Avoid asserts called in a loop in BitWriter::put_value
  • Avoids bounds checks in flush_bit_packed_run
  • Optimize iteration in LevelInfoBuilder::write_leaf
  • Avoid cloning null buffer or recalculating logical nulls
  • Do not collect non_null_indices and gather these into a new Vec for non-nullable arrays
  • Optimize writing bit-packed runs (bit width is 1 for levels most of the time, always writes 8 values except for last run)
  • Change get_min_max to check logical/converted types outside of loop

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementAny new improvement worthy of a entry in the changelog

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions