[Story] Towards a faster Parquet reader with pipelining and multistream optimization

<details>

<summary>To the cuDF team:</summary>

Thank you for your help and the high-quality of cuDF code, especially the Parquet reader. This story issue documents my series of studies on optimizing the cudf Parquet reader, building on insights from existing issues. As an external contributor, I aim to first explain my high-level ideas here, then detailed implementation in child issues & PR drafts. I'll keep refining this story issue and all my issues, to shape this story issue better as a huge umbrella to contain my previous and future issues. Your feedback is welcome!

</details>

---

This story issue outlines my series of studies on optimizing the cuDF Parquet reader using pipelining and multistreaming. The *Pipelining* is defined as overlapping I/O operations with GPU kernel execution to hide latency, thus maximizing throughput to hardware limits. 

# Problem Statement: Why & What this story 

Pipelining is critical to: (1) Reading Parquet at max speeds and (2) Overlapping future computation to consume decompressed chunks. While many CPU-based Parquet readers have attempted this approach, [few (or none) achieve sustained high performance in reading](https://arxiv.org/abs/2504.15247).

Building on the motivation above for pipelining, I'll continue to explain why it is necessary in this story to improve Parquet reader addressing both pipelining and multistreaming. My prior issues https://github.com/rapidsai/cudf/issues/17873 https://github.com/rapidsai/cudf/issues/18268 and cuIO team experience https://github.com/rapidsai/cudf/issues/18278#issuecomment-2725088922 also observed in Velox-CuDF TableScan revealed unstable performance with pipelining. The root cause of this instability (termed "pipeline bubbles" by the cuDF team) remains unclear, unidentified, and nobody has a fix. Resolving this instability is critical for achieving predictable speedups in the Parquet reader, as well as all applications on top of libcudf: pylibcudf, RAPIDS Spark and Velox-CuDF. 

<details>

<summary>Background: workflow $${\color{lightblue}Steps}$$ in `read_parquet` </summary>

Understanding the Parquet reading is foundational to effective profiling and optimization. In my understanding, 
the `read_parquet` call inside one CUDA stream executes three sequential, blocking steps:
- $${\color{lightblue}Step1}$$: metadata reading from Parquet footer 
- $${\color{lightblue}Step2}$$: I/O operations via KVIKIO or general data transfer 
- $${\color{lightblue}Step3}$$: GPU kernels (decoding and decompression)
To maximize throughput in the Parquet reader, each of these three steps must be profiled and optimized.

A figure showing these three steps:

![Image](https://github.com/user-attachments/assets/bee5ad97-e417-4ae5-89ed-e7c24bade91b)

</details>

## Problems addressed in this story

Building on the motivation for optimizing reader and its internal workflow, this story issue tracks the following problems as the limitations in the current reader:

### **$${\color{red}Problem1}$$: Inefficient read in non-standard workloads**
  - Performance relies on the workload under empirical rules (e.g., "reading 1GB as large chunks to be fast"). It means getting suboptimal performance in other read workloads: small-sized reads, repeated reads to the same file.
  - Location of this problem: $${\color{lightblue}Step1}$$ metadata-reading and $${\color{lightblue}Step2}$$ I/O
&nbsp;
### **$${\color{red}Problem2}$$: Unstable pipelining performance**
  - As mentioned above, unnecessary synchronization overhead causes unstable reading performance. The root cause remains unclear. 
  - Issues: 
    - https://github.com/rapidsai/cudf/issues/17873 
    - https://github.com/rapidsai/cudf/issues/18268 
    - https://github.com/rapidsai/cudf/issues/18278#issuecomment-2725088922 
  - Location of this problem: $${\color{lightblue}Step2}$$ I/O and $${\color{lightblue}Step3}$$ GPU-kernels

(More Problems: tbd. One idea to have fewer kernels to optimize $${\color{lightblue}Step3}$$ GPU-kernels ...)

---

# Proposed tasks

To address the problems outlined above, I propose the following orthogonal tasks, each tracked as a dedicated issue and PR draft under this umbrella story. These fixes are designed to be combinable for cumulative performance gains:

## **$${\color{green}Task1}$$  addressing $${\color{red}Problem1}$$: Metadata caching** 
- Child-Issue: https://github.com/rapidsai/cudf/issues/18890
- Draft PR: https://github.com/rapidsai/cudf/pull/18891
&nbsp;
## **$${\color{green}Task2}$$ addressing $${\color{red}Problem2}$$: Elimination unnecessary synchronization**

Here are several engineering-side subtasks & takeaways:
- remove unnecessary synchronization in $${\color{lightblue}Step3}$$ GPU-kernels: 
  - Child-Issue: https://github.com/rapidsai/cudf/issues/18967
  - Draft PR: https://github.com/rapidsai/cudf/pull/18968
  - PR:
    - Part0 from cudf team: https://github.com/rapidsai/cudf/pull/19020 
    - Part1: #19055
    - Part2:  #19092
    - Part3:
       - This section uses host-pinned memory as a bounce buffer to assist with data copying. We spent considerable time discussing this aspect, as reflected in the RMM PR draft.
       - in cuDF: #19119 & #19608
       - (closed in RMM): https://github.com/rapidsai/rmm/pull/1985/
       - Discussion & draft in RMM: https://github.com/rapidsai/rmm/pull/1996
    - Other discoveries: #18980
&nbsp;
- remove unnecessary synchronization in $${\color{lightblue}Step2}$$ I/O Path: 
  - Child-Issue: https://github.com/rapidsai/cudf/issues/18278
  - PR: https://github.com/rapidsai/cudf/pull/18279
&nbsp;
- remove unnecessary synchronization in pylibcudf-level: 
  - Child-Issue: https://github.com/rapidsai/cudf/issues/18356
  - PR from cudf team: https://github.com/rapidsai/cudf/pull/18367

## Complementary Tasks:
- reduce pool memory allocation latency in RMM: 
  - Child-Issue: https://github.com/rapidsai/rmm/issues/1887
  - Draft PR: https://github.com/rapidsai/rmm/pull/1912

<details>

<summary>My assumptions:</summary>

I mainly focus on direct-attached NVMe SSDs, one A100 GPU, and single TPC-H Parquet tables. I enable PTDS in cudf and RMM. I choose to optimize the Parquet reader, but the potential generalization would apply to other open table formats. 
Let’s enhance Parquet reading via multistreaming and pipelining!

</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Story] Towards a faster Parquet reader with pipelining and multistream optimization #18892

Problem Statement: Why & What this story

Problems addressed in this story

$${\color{red}Problem1}$$: Inefficient read in non-standard workloads

$${\color{red}Problem2}$$: Unstable pipelining performance

Proposed tasks

$${\color{green}Task1}$$ addressing $${\color{red}Problem1}$$: Metadata caching

$${\color{green}Task2}$$ addressing $${\color{red}Problem2}$$: Elimination unnecessary synchronization

Complementary Tasks:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Story] Towards a faster Parquet reader with pipelining and multistream optimization #18892

Description

Problem Statement: Why & What this story

Problems addressed in this story

$${\color{red}Problem1}$$: Inefficient read in non-standard workloads

$${\color{red}Problem2}$$: Unstable pipelining performance

Proposed tasks

$${\color{green}Task1}$$ addressing $${\color{red}Problem1}$$: Metadata caching

$${\color{green}Task2}$$ addressing $${\color{red}Problem2}$$: Elimination unnecessary synchronization

Complementary Tasks:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions