[detailed] Device-to-Host LazyAwaitable #3477

TroyGarden · 2025-10-22T04:31:51Z

Summary:

TL;DR

A new DeviceToHostTensorAwaitable class is available to wrap the device-to-host data transfer, and defers the cudaEventSync call until the data is really used on the host.
It aims at helping sync-point removal in training optimization which often suffers from cpu-blocking sync points.

why awaitable

as shown in the following diagram, a comms op is often better to overlap with another (irrelevant) compute op to better utilize the device capability
the idea is to defer the wait() call until running the function that uses the result from the comm op
a convenient way to achieve this "deferring" behavior is to use the lazy_awaitable concept, which is already implemented in torchrec
diagram of (lazy_)awaitable in torchrec

why device-to-host transfer

there are scenarios that the on-device data is needed from the host side, such as metrics logging and data-dependent shape operation.
those pattern creates a device-to-host sync (data transfer) that often blocks the cpu execution, and the correct implementation (with .to(non_blocking=True) and cuda event: PR 3436) usually spans across multiple code domain making it difficult to optimize.

here we borrow the LazyAwaitable concept for the device-side comms and wrap the (1) non-blocking device-to-host data transfer, and (2) cuda_event.wait() inside a DeviceToHostTensorAwaitable class for better user experience.
diagram of lazy_awaitable for device-to-host data transfer

results

pre-comm compute -> all-to-all comm -> irrelevant compute -> comm results check -> post-comm compute -> comm check assertion (cpu-side)
the "comms check" result is on device and is needed for validation (host-side assertion)
the DeviceToHostTensorAwaitable.wait() defer the cudaEventSync until the very end where the result is really needed by host.
You can see the post-comms computes are scheduled before the assertion on the host side.

NOTE: in this version of implementation we don't use a separate stream (as shown in the diagram above) for the non-blocking device-to-host data transfer because usually the data volume is relatively small. The trace below is with a separate stream for device-to-host transfer.

Differential Revision: D85211205

Summary: # TL;DR * A new `DeviceToHostTensorAwaitable` class is available to wrap the device-to-host data transfer, and defers the `cudaEventSync` call until the data is really used on the host. * It aims at helping sync-point removal in training optimization which often suffers from cpu-blocking sync points. # why awaitable * as shown in the following diagram, a comms op is often better to overlap with another (irrelevant) compute op to better utilize the device capability * the idea is to **defer** the `wait()` call until running the function that uses the result from the comm op * a convenient way to achieve this "deferring" behavior is to use the `lazy_awaitable` concept, which is already [implemented in torchrec](https://github.com/meta-pytorch/torchrec/blob/main/torchrec/distributed/types.py#L368) * diagram of (lazy_)awaitable in torchrec {F1982900178} # why device-to-host transfer * there are scenarios that the on-device data is needed from the host side, such as metrics logging and data-dependent shape operation. * those pattern creates a device-to-host sync (data transfer) that often blocks the cpu execution, and the correct implementation (with `.to(non_blocking=True)` and cuda event: [PR 3436](meta-pytorch#3436)) usually spans across multiple code domain making it difficult to optimize. * here we borrow the `LazyAwaitable` concept for the device-side comms and wrap the (1) non-blocking device-to-host data transfer, and (2) `cuda_event.wait()` inside a `DeviceToHostTensorAwaitable` class for better user experience. * diagram of lazy_awaitable for device-to-host data transfer {F1982900233} # results * the "comms check" result is on device and is needed for validation (host-side assertion) * the `DeviceToHostTensorAwaitable.wait()` **defer** the cudaEventSync until the very end where the result is really needed by host. * You can see the post-comms computes are scheduled before the assertion on the host side. {F1982900468} NOTE: in this version of implementation we don't use a separate stream (as shown in the diagram above) for the non-blocking device-to-host data transfer because usually the data volume is relatively small. Differential Revision: D85211205

meta-codesync · 2025-10-22T04:31:59Z

@TroyGarden has exported this pull request. If you are a Meta employee, you can view the originating Diff in D85211205.

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 22, 2025

meta-codesync bot added fb-exported meta-exported labels Oct 22, 2025

TroyGarden changed the title ~~Device-to-Host LazyAwaitable~~ [detailed] Device-to-Host LazyAwaitable Oct 22, 2025

meta-codesync bot closed this in a4ca26f Oct 22, 2025

TroyGarden deleted the export-D85211205 branch October 22, 2025 18:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[detailed] Device-to-Host LazyAwaitable #3477

[detailed] Device-to-Host LazyAwaitable #3477

TroyGarden commented Oct 22, 2025 •

edited

Loading

Uh oh!

meta-codesync bot commented Oct 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[detailed] Device-to-Host LazyAwaitable #3477

[detailed] Device-to-Host LazyAwaitable #3477

Conversation

TroyGarden commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR

why awaitable

why device-to-host transfer

results

Uh oh!

meta-codesync bot commented Oct 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

TroyGarden commented Oct 22, 2025 •

edited

Loading