-
Notifications
You must be signed in to change notification settings - Fork 965
Description
To the cuDF team:
Thank you for your help and the high-quality of cuDF code, especially the Parquet reader. This story issue documents my series of studies on optimizing the cudf Parquet reader, building on insights from existing issues. As an external contributor, I aim to first explain my high-level ideas here, then detailed implementation in child issues & PR drafts. I'll keep refining this story issue and all my issues, to shape this story issue better as a huge umbrella to contain my previous and future issues. Your feedback is welcome!
This story issue outlines my series of studies on optimizing the cuDF Parquet reader using pipelining and multistreaming. The Pipelining is defined as overlapping I/O operations with GPU kernel execution to hide latency, thus maximizing throughput to hardware limits.
Problem Statement: Why & What this story
Pipelining is critical to: (1) Reading Parquet at max speeds and (2) Overlapping future computation to consume decompressed chunks. While many CPU-based Parquet readers have attempted this approach, few (or none) achieve sustained high performance in reading.
Building on the motivation above for pipelining, I'll continue to explain why it is necessary in this story to improve Parquet reader addressing both pipelining and multistreaming. My prior issues #17873 #18268 and cuIO team experience #18278 (comment) also observed in Velox-CuDF TableScan revealed unstable performance with pipelining. The root cause of this instability (termed "pipeline bubbles" by the cuDF team) remains unclear, unidentified, and nobody has a fix. Resolving this instability is critical for achieving predictable speedups in the Parquet reader, as well as all applications on top of libcudf: pylibcudf, RAPIDS Spark and Velox-CuDF.
Background: workflow $${\color{lightblue}Steps}$$ in `read_parquet`
Understanding the Parquet reading is foundational to effective profiling and optimization. In my understanding,
the read_parquet
call inside one CUDA stream executes three sequential, blocking steps:
-
$${\color{lightblue}Step1}$$ : metadata reading from Parquet footer -
$${\color{lightblue}Step2}$$ : I/O operations via KVIKIO or general data transfer -
$${\color{lightblue}Step3}$$ : GPU kernels (decoding and decompression)
To maximize throughput in the Parquet reader, each of these three steps must be profiled and optimized.
A figure showing these three steps:
Problems addressed in this story
Building on the motivation for optimizing reader and its internal workflow, this story issue tracks the following problems as the limitations in the current reader:
$${\color{red}Problem1}$$ : Inefficient read in non-standard workloads
- Performance relies on the workload under empirical rules (e.g., "reading 1GB as large chunks to be fast"). It means getting suboptimal performance in other read workloads: small-sized reads, repeated reads to the same file.
- Location of this problem:
$${\color{lightblue}Step1}$$ metadata-reading and$${\color{lightblue}Step2}$$ I/O
$${\color{red}Problem2}$$ : Unstable pipelining performance
- As mentioned above, unnecessary synchronization overhead causes unstable reading performance. The root cause remains unclear.
- Issues:
- Location of this problem:
$${\color{lightblue}Step2}$$ I/O and$${\color{lightblue}Step3}$$ GPU-kernels
(More Problems: tbd. One idea to have fewer kernels to optimize
Proposed tasks
To address the problems outlined above, I propose the following orthogonal tasks, each tracked as a dedicated issue and PR draft under this umbrella story. These fixes are designed to be combinable for cumulative performance gains:
$${\color{green}Task1}$$ addressing $${\color{red}Problem1}$$ : Metadata caching
- Child-Issue: [FEA] Parquet metadata caching due to overhead in reader #18890
- Draft PR: [DO NOT MERGE] [POC] Metadata caching prototype in Parquet reader #18891
$${\color{green}Task2}$$ addressing $${\color{red}Problem2}$$ : Elimination unnecessary synchronization
Here are several engineering-side subtasks & takeaways:
- remove unnecessary synchronization in
$${\color{lightblue}Step3}$$ GPU-kernels:- Child-Issue: [FEA] Unstable pipelining performance in Parquet reading due to "miss-sync" #18967
- Draft PR: [DO NOT MERGE] Remove unnecessary synchronization (miss-sync) during Parquet reading #18968
- PR:
- Part0 from cudf team:
batched_memset
to use ahost_span
arg instead ofstd::vector
#19020 - Part1: Remove unnecessary synchronization (miss-sync) during Parquet reading (Part 1: device_scalar) #19055
- Part2: Remove
hostdevice_vector::element
due to unnecessary synchronization #19092 - Part3:
- This section uses host-pinned memory as a bounce buffer to assist with data copying. We spent considerable time discussing this aspect, as reflected in the RMM PR draft.
- in cuDF: Replace
rmm::device_scalar
withcudf::detail::device_scalar
due to unnecessary synchronization (Part 3 of miss-sync) #19119 & Simplify cudf::scalar usage in reduce utility #19608 - (closed in RMM): [ 🚧 Draft] : Adding host-mr for pinned bounce buffer to
rmm::device_scalar
rmm#1985 - Discussion & draft in RMM: [ 🚧 Draft] : Adding host-mr for pinned bounce buffer to
rmm::device_buffer
rmm#1996
- Other discoveries: [BUG] Compiler segmentation fault when calling
make_host_vector
in certain cases. #18980
- Part0 from cudf team:
- remove unnecessary synchronization in
$${\color{lightblue}Step2}$$ I/O Path: - remove unnecessary synchronization in pylibcudf-level:
Complementary Tasks:
- reduce pool memory allocation latency in RMM:
My assumptions:
I mainly focus on direct-attached NVMe SSDs, one A100 GPU, and single TPC-H Parquet tables. I enable PTDS in cudf and RMM. I choose to optimize the Parquet reader, but the potential generalization would apply to other open table formats.
Let’s enhance Parquet reading via multistreaming and pipelining!
Metadata
Metadata
Assignees
Labels
Type
Projects
Status