Skip to content

[WIP][df] Introduce RDatasetSpec for RNTuple #19341

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

martamaja10
Copy link
Contributor

This PR introduces functionality of RDatasetSpec for RNTuple. Main changes are inside RLoopManager and RNTupleDS - mainly the introduction of WithGlobalRanges.

Marked as WIP to make sure all tests pass but the main development should be ready for the review.

@martamaja10 martamaja10 self-assigned this Jul 11, 2025
@martamaja10 martamaja10 force-pushed the datasetspec-rntuple branch from 3645459 to 48101c1 Compare July 11, 2025 09:51
Copy link

Test Results

    18 files      18 suites   3d 21h 48m 57s ⏱️
 3 198 tests  3 194 ✅ 0 💤  4 ❌
56 086 runs  56 035 ✅ 0 💤 51 ❌

For more details on these failures, see this check.

Results for commit 48101c1.

Copy link
Member

@hageboeck hageboeck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello, thanks for the development!

I left already a few comments.

@@ -104,12 +104,18 @@ class RNTupleDS final : public ROOT::RDF::RDataSource {
std::vector<std::vector<ROOT::Internal::RDF::RNTupleColumnReader *>> fActiveColumnReaders;

ULong64_t fSeenEntries = 0; ///< The number of entries so far returned by GetEntryRanges()
ULong64_t fSeenEntriesRange = 0;
ULong64_t fSeenEntriesNonRange = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the name hard to understand. A small comment could help or if it's not too much work, one could rename it to something like fSeenEntriesOutOfRange (if that's what it does).

@@ -104,12 +104,18 @@ class RNTupleDS final : public ROOT::RDF::RDataSource {
std::vector<std::vector<ROOT::Internal::RDF::RNTupleColumnReader *>> fActiveColumnReaders;

ULong64_t fSeenEntries = 0; ///< The number of entries so far returned by GetEntryRanges()
ULong64_t fSeenEntriesRange = 0;
ULong64_t fSeenEntriesNonRange = 0;
ULong64_t counterFileEmpty = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also start it with fCounter...?

Comment on lines +582 to +584
throw std::runtime_error(
"More than one RNTuple name was found, please make sure to use RNTuples with the same RnTuple name.");
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a temporary limitation?

if (nRemainingFiles == 0)
return;

unsigned int nEntries = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be safe, I would make this a size_t

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, with the below comment, you might be able to remove it.

@@ -554,12 +561,11 @@ void ROOT::RDF::RNTupleDS::PrepareNextRanges()
range.fFileName = fFileNames[fNextFileIndex];
range.fSource->Attach();
fNextFileIndex++;

auto nEntries = range.fSource->GetNEntries();
nEntries = range.fSource->GetNEntries();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto nEntries =? It looks like if could be declared only in the loop.

auto start = fCurrentRanges[i].fFirstEntry + fSeenEntries;
auto end = fCurrentRanges[i].fLastEntry + fSeenEntries;

start = fCurrentRanges[i].fFirstEntry + fSeenEntriesNonRange;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it need to be out of the loop?

Copy link
Member

@hageboeck hageboeck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a few shadow warnings in the CI, but you probably saw them, too.

catch (const std::runtime_error &err) {
EXPECT_EQ(
std::string(err.what()),
"More than one RNTuple name was found, please make sure to use RNTuples with the same RnTuple name.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not specifically for this PR, but general for RDF devs:
I always wonder if it makes sense to really test the entire error message. In this way, we always have to touch two places if we wanted to correct small things, such as in this case:
RnTuple --> RNTuple

I would either just test that there is any exception (possibly of a specific type), or just test a substring such as exc.startsWith("More than one RNTuple name");. In this way, small adjustments or making the error message more helpful doesn't immediately fail the test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants