-
Notifications
You must be signed in to change notification settings - Fork 5.5k
refactor: Change native pos API to return BaseSerializedPage #26692
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Reviewer's GuideRefactors the native shuffle and exchange path to use a new ShuffleSerializedPage (subclass of velox::exec::BaseSerializedPage) instead of the custom ReadBatch/SerializedPage types, and updates readers, exchange sources, the ShuffleRead operator, and tests to consume this unified page abstraction. Sequence diagram for ShuffleReader to ShuffleRead data flow using ShuffleSerializedPagesequenceDiagram
participant ShuffleRead
participant ShuffleExchangeSource
participant LocalShuffleReader
participant ExchangeQueue
ShuffleRead->>ShuffleExchangeSource: request(maxBytes, maxWait)
activate ShuffleExchangeSource
ShuffleExchangeSource->>LocalShuffleReader: next(maxBytes)
activate LocalShuffleReader
LocalShuffleReader-->>ShuffleExchangeSource: SemiFuture<vector<ShuffleSerializedPage>>
deactivate LocalShuffleReader
ShuffleExchangeSource->>ShuffleExchangeSource: deferValue(on batches)
ShuffleExchangeSource->>ExchangeQueue: enqueueLocked(ShuffleSerializedPage pages)
ExchangeQueue-->>ShuffleExchangeSource: ContinuePromise list
ShuffleExchangeSource-->>ShuffleRead: Response{totalBytes, atEnd}
deactivate ShuffleExchangeSource
ShuffleRead->>ShuffleRead: getOutput()
ShuffleRead->>ShuffleRead: iterate currentPages_ as BaseSerializedPage
ShuffleRead->>ShuffleRead: cast each page to ShuffleSerializedPage
ShuffleRead->>ShuffleRead: use rows() and numRows() to build RowVector
Class diagram for unified ShuffleSerializedPage hierarchy and consumersclassDiagram
class BaseSerializedPage {
<<abstract>>
+size() uint64_t
+numRows() std::optional<int64_t>
+prepareStreamForDeserialize() ByteInputStream*
+getIOBuf() follyIOBuf*
}
class PrestoSerializedPage {
}
BaseSerializedPage <|-- PrestoSerializedPage
class ShuffleSerializedPage {
<<abstract>>
+prepareStreamForDeserialize() ByteInputStream*
+getIOBuf() follyIOBuf*
+rows() std::vector_string_view&
}
BaseSerializedPage <|-- ShuffleSerializedPage
class LocalShuffleSerializedPage {
-rows_ std::vector_string_view
-buffer_ BufferPtr
+LocalShuffleSerializedPage(rows std::vector_string_view, buffer BufferPtr)
+rows() std::vector_string_view&
+size() uint64_t
+numRows() std::optional_int64_t
}
ShuffleSerializedPage <|-- LocalShuffleSerializedPage
class ShuffleReader {
<<interface>>
+next(maxBytes uint64_t) SemiFuture_vector_unique_ptr_ShuffleSerializedPage
+noMoreData(success bool) void
}
class LocalShuffleReader {
+initialize() void
+next(maxBytes uint64_t) SemiFuture_vector_unique_ptr_ShuffleSerializedPage
-nextSorted(maxBytes uint64_t) vector_unique_ptr_ShuffleSerializedPage
-nextUnsorted(maxBytes uint64_t) vector_unique_ptr_ShuffleSerializedPage
}
ShuffleReader <|.. LocalShuffleReader
LocalShuffleReader ..> LocalShuffleSerializedPage
LocalShuffleReader ..> ShuffleSerializedPage
class ShuffleExchangeSource {
+request(maxBytes int64_t, maxWait microseconds) SemiFuture_Response
}
ShuffleExchangeSource ..> ShuffleReader
ShuffleExchangeSource ..> ShuffleSerializedPage
ShuffleExchangeSource ..> BaseSerializedPage
class ShuffleRead {
-currentPages_ vector_unique_ptr_BaseSerializedPage
+getOutput() RowVectorPtr
}
ShuffleRead ..> BaseSerializedPage
ShuffleRead ..> ShuffleSerializedPage
class BroadcastExchangeSource {
+request(maxBytes int64_t, maxWait microseconds) Response
}
BroadcastExchangeSource ..> BaseSerializedPage
BroadcastExchangeSource ..> PrestoSerializedPage
class PrestoExchangeSource {
-processDataResponse(response DataResponsePtr) void
}
PrestoExchangeSource ..> BaseSerializedPage
PrestoExchangeSource ..> PrestoSerializedPage
class ReadBatch {
<<removed>>
-rows std::vector_string_view
-data BufferPtr
}
class ShuffleRowBatch {
<<removed>>
-rowBatch_ unique_ptr_ReadBatch
+rows() std::vector_string_view&
}
ReadBatch ..> ShuffleRowBatch
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes - here's some feedback:
- In ShuffleExchangeSource::request, totalBytes is no longer updated inside the deferValue loop (it stays 0 for the Response), so you likely want to accumulate page->size() for each enqueued batch as before to preserve correct accounting.
- LocalShuffleSerializedPage takes const std::vectorstd::string_view& rows and stores it in a const member initialized with std::move(rows), which just copies because the parameter is const; consider making the parameter by value or rvalue-reference and the member non-const so you can actually move and avoid an extra copy.
- ShuffleSerializedPage::rows() returns a non-const reference but cannot be called on a const ShuffleSerializedPage, which is awkward for read-only use; consider making rows() const (and updating overrides) so callers with const BaseSerializedPage pointers can still access row views without casting away constness.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In ShuffleExchangeSource::request, totalBytes is no longer updated inside the deferValue loop (it stays 0 for the Response), so you likely want to accumulate page->size() for each enqueued batch as before to preserve correct accounting.
- LocalShuffleSerializedPage takes const std::vector<std::string_view>& rows and stores it in a const member initialized with std::move(rows), which just copies because the parameter is const; consider making the parameter by value or rvalue-reference and the member non-const so you can actually move and avoid an extra copy.
- ShuffleSerializedPage::rows() returns a non-const reference but cannot be called on a const ShuffleSerializedPage, which is awkward for read-only use; consider making rows() const (and updating overrides) so callers with const BaseSerializedPage pointers can still access row views without casting away constness.
## Individual Comments
### Comment 1
<location> `presto-native-execution/presto_cpp/main/operators/ShuffleExchangeSource.cpp:38` </location>
<code_context>
- .deferValue([this](std::vector<std::unique_ptr<ReadBatch>>&& batches) {
</code_context>
<issue_to_address>
**issue (bug_risk):** totalBytes is no longer updated and will always be 0 in the Response
In the new lambda, `totalBytes` is set to 0 but never updated in the loop over `batches`, whereas previously it was set from `batch->data->size()`. This means `request` will always return a `Response` with `totalBytes == 0`, which can break byte-based accounting and backpressure. Please restore accumulation of `totalBytes` (e.g., sum `batch->size()` for each enqueued page and re-apply the `int32_t` bounds check if still required).
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| std::chrono::microseconds /*maxWait*/) { | ||
| auto nextBatch = [this, maxBytes]() { | ||
| return std::move(shuffleReader_->next(maxBytes)) | ||
| .deferValue([this](std::vector<std::unique_ptr<ReadBatch>>&& batches) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (bug_risk): totalBytes is no longer updated and will always be 0 in the Response
In the new lambda, totalBytes is set to 0 but never updated in the loop over batches, whereas previously it was set from batch->data->size(). This means request will always return a Response with totalBytes == 0, which can break byte-based accounting and backpressure. Please restore accumulation of totalBytes (e.g., sum batch->size() for each enqueued page and re-apply the int32_t bounds check if still required).
xiaoxmeng
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tanjialiang thanks!
…b#26692) Summary: Use BaseSerializedPage directly from shuffle. This allows seamless handle of shuffle data all the way to ShuffleRead(subclass of Exchange operator) Reviewed By: xiaoxmeng Differential Revision: D87852058
1159993 to
f4cd744
Compare
…b#26692) Summary: Use BaseSerializedPage directly from shuffle. This allows seamless handle of shuffle data all the way to ShuffleRead(subclass of Exchange operator) Reviewed By: xiaoxmeng Differential Revision: D87852058
f4cd744 to
76b6a96
Compare
…b#26692) Summary: Use BaseSerializedPage directly from shuffle. This allows seamless handle of shuffle data all the way to ShuffleRead(subclass of Exchange operator) Reviewed By: xiaoxmeng Differential Revision: D87852058
76b6a96 to
dc70c38
Compare
Summary: Use BaseSerializedPage directly from shuffle. This allows seamless handle of shuffle data all the way to ShuffleRead(subclass of Exchange operator)
Differential Revision: D87852058