Skip to content

[Story] Towards a faster Parquet reader with pipelining and multistream optimization #18892

@JigaoLuo

Description

@JigaoLuo
To the cuDF team:

Thank you for your help and the high-quality of cuDF code, especially the Parquet reader. This story issue documents my series of studies on optimizing the cudf Parquet reader, building on insights from existing issues. As an external contributor, I aim to first explain my high-level ideas here, then detailed implementation in child issues & PR drafts. I'll keep refining this story issue and all my issues, to shape this story issue better as a huge umbrella to contain my previous and future issues. Your feedback is welcome!


This story issue outlines my series of studies on optimizing the cuDF Parquet reader using pipelining and multistreaming. The Pipelining is defined as overlapping I/O operations with GPU kernel execution to hide latency, thus maximizing throughput to hardware limits.

Problem Statement: Why & What this story

Pipelining is critical to: (1) Reading Parquet at max speeds and (2) Overlapping future computation to consume decompressed chunks. While many CPU-based Parquet readers have attempted this approach, few (or none) achieve sustained high performance in reading.

Building on the motivation above for pipelining, I'll continue to explain why it is necessary in this story to improve Parquet reader addressing both pipelining and multistreaming. My prior issues #17873 #18268 and cuIO team experience #18278 (comment) also observed in Velox-CuDF TableScan revealed unstable performance with pipelining. The root cause of this instability (termed "pipeline bubbles" by the cuDF team) remains unclear, unidentified, and nobody has a fix. Resolving this instability is critical for achieving predictable speedups in the Parquet reader, as well as all applications on top of libcudf: pylibcudf, RAPIDS Spark and Velox-CuDF.

Background: workflow $${\color{lightblue}Steps}$$ in `read_parquet`

Understanding the Parquet reading is foundational to effective profiling and optimization. In my understanding,
the read_parquet call inside one CUDA stream executes three sequential, blocking steps:

  • $${\color{lightblue}Step1}$$: metadata reading from Parquet footer
  • $${\color{lightblue}Step2}$$: I/O operations via KVIKIO or general data transfer
  • $${\color{lightblue}Step3}$$: GPU kernels (decoding and decompression)
    To maximize throughput in the Parquet reader, each of these three steps must be profiled and optimized.

A figure showing these three steps:

Image

Problems addressed in this story

Building on the motivation for optimizing reader and its internal workflow, this story issue tracks the following problems as the limitations in the current reader:

$${\color{red}Problem1}$$: Inefficient read in non-standard workloads

  • Performance relies on the workload under empirical rules (e.g., "reading 1GB as large chunks to be fast"). It means getting suboptimal performance in other read workloads: small-sized reads, repeated reads to the same file.
  • Location of this problem: $${\color{lightblue}Step1}$$ metadata-reading and $${\color{lightblue}Step2}$$ I/O
     

$${\color{red}Problem2}$$: Unstable pipelining performance

(More Problems: tbd. One idea to have fewer kernels to optimize $${\color{lightblue}Step3}$$ GPU-kernels ...)


Proposed tasks

To address the problems outlined above, I propose the following orthogonal tasks, each tracked as a dedicated issue and PR draft under this umbrella story. These fixes are designed to be combinable for cumulative performance gains:

$${\color{green}Task1}$$ addressing $${\color{red}Problem1}$$: Metadata caching

$${\color{green}Task2}$$ addressing $${\color{red}Problem2}$$: Elimination unnecessary synchronization

Here are several engineering-side subtasks & takeaways:

Complementary Tasks:

My assumptions:

I mainly focus on direct-attached NVMe SSDs, one A100 GPU, and single TPC-H Parquet tables. I enable PTDS in cudf and RMM. I choose to optimize the Parquet reader, but the potential generalization would apply to other open table formats.
Let’s enhance Parquet reading via multistreaming and pipelining!

Metadata

Metadata

Assignees

No one assigned

    Labels

    PerformancePerformance related issuePythonAffects Python cuDF API.SparkFunctionality that helps Spark RAPIDScuIOcuIO issuelibcudfAffects libcudf (C++/CUDA) code.

    Type

    No type

    Projects

    Status

    Todo

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions