-
Notifications
You must be signed in to change notification settings - Fork 1.4k
[WIP][df] Introduce RDatasetSpec for RNTuple #19341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
3645459
to
48101c1
Compare
Test Results 18 files 18 suites 3d 21h 48m 57s ⏱️ For more details on these failures, see this check. Results for commit 48101c1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello, thanks for the development!
I left already a few comments.
@@ -104,12 +104,18 @@ class RNTupleDS final : public ROOT::RDF::RDataSource { | |||
std::vector<std::vector<ROOT::Internal::RDF::RNTupleColumnReader *>> fActiveColumnReaders; | |||
|
|||
ULong64_t fSeenEntries = 0; ///< The number of entries so far returned by GetEntryRanges() | |||
ULong64_t fSeenEntriesRange = 0; | |||
ULong64_t fSeenEntriesNonRange = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find the name hard to understand. A small comment could help or if it's not too much work, one could rename it to something like fSeenEntriesOutOfRange
(if that's what it does).
@@ -104,12 +104,18 @@ class RNTupleDS final : public ROOT::RDF::RDataSource { | |||
std::vector<std::vector<ROOT::Internal::RDF::RNTupleColumnReader *>> fActiveColumnReaders; | |||
|
|||
ULong64_t fSeenEntries = 0; ///< The number of entries so far returned by GetEntryRanges() | |||
ULong64_t fSeenEntriesRange = 0; | |||
ULong64_t fSeenEntriesNonRange = 0; | |||
ULong64_t counterFileEmpty = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also start it with fCounter...
?
throw std::runtime_error( | ||
"More than one RNTuple name was found, please make sure to use RNTuples with the same RnTuple name."); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a temporary limitation?
if (nRemainingFiles == 0) | ||
return; | ||
|
||
unsigned int nEntries = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be safe, I would make this a size_t
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, with the below comment, you might be able to remove it.
@@ -554,12 +561,11 @@ void ROOT::RDF::RNTupleDS::PrepareNextRanges() | |||
range.fFileName = fFileNames[fNextFileIndex]; | |||
range.fSource->Attach(); | |||
fNextFileIndex++; | |||
|
|||
auto nEntries = range.fSource->GetNEntries(); | |||
nEntries = range.fSource->GetNEntries(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto nEntries =
? It looks like if could be declared only in the loop.
auto start = fCurrentRanges[i].fFirstEntry + fSeenEntries; | ||
auto end = fCurrentRanges[i].fLastEntry + fSeenEntries; | ||
|
||
start = fCurrentRanges[i].fFirstEntry + fSeenEntriesNonRange; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it need to be out of the loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see a few shadow warnings in the CI, but you probably saw them, too.
catch (const std::runtime_error &err) { | ||
EXPECT_EQ( | ||
std::string(err.what()), | ||
"More than one RNTuple name was found, please make sure to use RNTuples with the same RnTuple name."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not specifically for this PR, but general for RDF devs:
I always wonder if it makes sense to really test the entire error message. In this way, we always have to touch two places if we wanted to correct small things, such as in this case:
RnTuple --> RNTuple
I would either just test that there is any exception (possibly of a specific type), or just test a substring such as exc.startsWith("More than one RNTuple name");
. In this way, small adjustments or making the error message more helpful doesn't immediately fail the test.
This PR introduces functionality of RDatasetSpec for RNTuple. Main changes are inside RLoopManager and RNTupleDS - mainly the introduction of WithGlobalRanges.
Marked as WIP to make sure all tests pass but the main development should be ready for the review.