[detailed] Device-to-Host LazyAwaitable #3477
Closed
+63
−2
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary:
TL;DR
DeviceToHostTensorAwaitableclass is available to wrap the device-to-host data transfer, and defers thecudaEventSynccall until the data is really used on the host.why awaitable
wait()call until running the function that uses the result from the comm oplazy_awaitableconcept, which is already implemented in torchrecwhy device-to-host transfer
.to(non_blocking=True)and cuda event: PR 3436) usually spans across multiple code domain making it difficult to optimize.LazyAwaitableconcept for the device-side comms and wrap the (1) non-blocking device-to-host data transfer, and (2)cuda_event.wait()inside aDeviceToHostTensorAwaitableclass for better user experience.results
DeviceToHostTensorAwaitable.wait()defer the cudaEventSync until the very end where the result is really needed by host.Differential Revision: D85211205