Skip to content

Conversation

@TroyGarden
Copy link
Contributor

@TroyGarden TroyGarden commented Oct 22, 2025

Summary:

TL;DR

  • A new DeviceToHostTensorAwaitable class is available to wrap the device-to-host data transfer, and defers the cudaEventSync call until the data is really used on the host.
  • It aims at helping sync-point removal in training optimization which often suffers from cpu-blocking sync points.

why awaitable

  • as shown in the following diagram, a comms op is often better to overlap with another (irrelevant) compute op to better utilize the device capability
  • the idea is to defer the wait() call until running the function that uses the result from the comm op
  • a convenient way to achieve this "deferring" behavior is to use the lazy_awaitable concept, which is already implemented in torchrec
  • diagram of (lazy_)awaitable in torchrec
image

why device-to-host transfer

  • there are scenarios that the on-device data is needed from the host side, such as metrics logging and data-dependent shape operation.
  • those pattern creates a device-to-host sync (data transfer) that often blocks the cpu execution, and the correct implementation (with .to(non_blocking=True) and cuda event: PR 3436) usually spans across multiple code domain making it difficult to optimize.
image
  • here we borrow the LazyAwaitable concept for the device-side comms and wrap the (1) non-blocking device-to-host data transfer, and (2) cuda_event.wait() inside a DeviceToHostTensorAwaitable class for better user experience.
  • diagram of lazy_awaitable for device-to-host data transfer
image

results

  • pre-comm compute -> all-to-all comm -> irrelevant compute -> comm results check -> post-comm compute -> comm check assertion (cpu-side)
  • the "comms check" result is on device and is needed for validation (host-side assertion)
  • the DeviceToHostTensorAwaitable.wait() defer the cudaEventSync until the very end where the result is really needed by host.
  • You can see the post-comms computes are scheduled before the assertion on the host side.
image

NOTE: in this version of implementation we don't use a separate stream (as shown in the diagram above) for the non-blocking device-to-host data transfer because usually the data volume is relatively small. The trace below is with a separate stream for device-to-host transfer.

image

Differential Revision: D85211205

Summary:
# TL;DR
* A new `DeviceToHostTensorAwaitable` class is available to wrap the device-to-host data transfer, and defers the `cudaEventSync` call until the data is really used on the host.
* It aims at helping sync-point removal in training optimization which often suffers from cpu-blocking sync points.

# why awaitable
* as shown in the following diagram, a comms op is often better to overlap with another (irrelevant) compute op to better utilize the device capability
* the idea is to **defer** the `wait()` call until running the function that uses the result from the comm op
* a convenient way to achieve this "deferring" behavior is to use the `lazy_awaitable` concept, which is already [implemented in torchrec](https://github.com/meta-pytorch/torchrec/blob/main/torchrec/distributed/types.py#L368)
* diagram of (lazy_)awaitable in torchrec
 {F1982900178}

# why device-to-host transfer
* there are scenarios that the on-device data is needed from the host side, such as metrics logging and data-dependent shape operation.
* those pattern creates a device-to-host sync (data transfer) that often blocks the cpu execution, and the correct implementation (with `.to(non_blocking=True)` and cuda event: [PR 3436](meta-pytorch#3436)) usually spans across multiple code domain making it difficult to optimize. 
* here we borrow the `LazyAwaitable` concept for the device-side comms and wrap the (1) non-blocking device-to-host data transfer, and (2) `cuda_event.wait()` inside a `DeviceToHostTensorAwaitable` class for better user experience.
* diagram of lazy_awaitable for device-to-host data transfer
 {F1982900233} 

# results
* the "comms check" result is on device and is needed for validation (host-side assertion)
* the `DeviceToHostTensorAwaitable.wait()` **defer** the cudaEventSync until the very end where the result is really needed by host. 
* You can see the post-comms computes are scheduled before the assertion on the host side.
{F1982900468} 

NOTE: in this version of implementation we don't use a separate stream (as shown in the diagram above) for the non-blocking device-to-host data transfer because usually the data volume is relatively small.

Differential Revision: D85211205
@meta-codesync
Copy link
Contributor

meta-codesync bot commented Oct 22, 2025

@TroyGarden has exported this pull request. If you are a Meta employee, you can view the originating Diff in D85211205.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 22, 2025
@TroyGarden TroyGarden changed the title Device-to-Host LazyAwaitable [detailed] Device-to-Host LazyAwaitable Oct 22, 2025
@meta-codesync meta-codesync bot closed this in a4ca26f Oct 22, 2025
@TroyGarden TroyGarden deleted the export-D85211205 branch October 22, 2025 18:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant