Skip to content

Conversation

@mhaseeb123
Copy link
Member

Description

This PR refactors the multithreaded page index setup code in main parquet reader so that the child class used in hybrid scan reader can also benefit from it. No new code is introduced.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 25, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Nov 25, 2025
};

// Use parallel processing only if we have enough columns to justify the overhead
constexpr std::size_t parallel_threshold = 256;
Copy link
Member Author

@mhaseeb123 mhaseeb123 Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vuule did you pick 512 for this empirically or did you run any experiments. I am seeing more benefit with 256 for the hybrid scan example but can revert.

@mhaseeb123 mhaseeb123 marked this pull request as ready for review November 25, 2025 01:39
@mhaseeb123 mhaseeb123 requested a review from a team as a code owner November 25, 2025 01:39
@mhaseeb123 mhaseeb123 added 3 - Ready for Review Ready for review by team cuIO cuIO issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Nov 25, 2025
return;
}

auto& row_groups = per_file_metadata.front().row_groups;
Copy link
Member Author

@mhaseeb123 mhaseeb123 Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for all this anymore. Just call metadata.setup_page_index(..) method.

all_column_chunks.emplace_back(std::ref(col));
}
}
if (max_offset > min_offset) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why we were checking max_offset > 0, max_offset > min_offset should be a better check unless I am missing something here?

int64_t const length = max_offset - min_offset;
auto const idx_buf = source->host_read(min_offset, length);
// Flatten all columns into a single vector for easier task distribution
std::vector<std::reference_wrapper<ColumnChunk>> all_column_chunks;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved everything beyond this point as-is into the setup_page_index function.

{
CUDF_FUNC_RANGE();

// Flatten all columns into a single vector for easier task distribution
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved as-is from the constructor

metadata& operator=(metadata const& other) = delete;
metadata& operator=(metadata&& other) = default;

~metadata() = default;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use base reader's destructor that uses multiple threads

@mhaseeb123 mhaseeb123 requested a review from vuule November 25, 2025 20:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3 - Ready for Review Ready for review by team cuIO cuIO issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant