Skip to content

Conversation

@TroyGarden
Copy link
Contributor

@TroyGarden TroyGarden commented Oct 5, 2025

Summary:

context

  • add benchmark to torch.distributed.all_to_all_single.
  • in this very first edition, this all_to_all_single is running with sync on the same stream.
  • basic operations:
    pre-comms compute (GPU compute heavy) ==> all_to_all_single comms ==> irrelevant compute (GPU compute heavy, does not depend on comms data) ==> post-comms compute (GPU compute heavy, uses the comms data)

other changs

  • extend the cmd_conf decorator so that it can support multiple-program selection in command line
python -m torchrec.distributed.benchmark.benchmark_comms \
  a2a_single --name=a2a_sync_base-$(git rev-parse --short HEAD || echo $USER)
  • add a config (dataclass) class BenchmarkFunc for benchmark_func, which includes most common arguments used in benchmark_func

  • trace shows the all_to_all_single (comms) is in the same stream as compute.

image

Differential Revision: D83900855

@meta-codesync
Copy link
Contributor

meta-codesync bot commented Oct 5, 2025

@TroyGarden has exported this pull request. If you are a Meta employee, you can view the originating Diff in D83900855.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 5, 2025
TroyGarden added a commit to TroyGarden/torchrec that referenced this pull request Oct 5, 2025
Summary:

# context
* add benchmark to `torch.distributed.all_to_all_single`. 
* in this very first edition, this all_to_all_single is running with sync on the same stream.
* basic operations:
pre-comms compute (GPU compute heavy) ==> all_to_all_single comms ==> irrelevant compute (GPU compute heavy, does not depend on comms data) ==> post-comms compute (GPU compute heavy, uses the comms data)

# other changs
* extend the cmd_conf decorator so that it can support multiple-program selection in command line
```
python -m torchrec.distributed.benchmark.benchmark_comms \
  a2a_single --name=a2a_sync_base-$(git rev-parse --short HEAD || echo $USER)
```
* add a config (dataclass) class **`BenchmarkFunc`** for `benchmark_func`, which includes most common arguments used in `benchmark_func`

Differential Revision: D83900855
TroyGarden added a commit to TroyGarden/torchrec that referenced this pull request Oct 6, 2025
Summary:

# context
* add benchmark to `torch.distributed.all_to_all_single`. 
* in this very first edition, this all_to_all_single is running with sync on the same stream.
* basic operations:
pre-comms compute (GPU compute heavy) ==> all_to_all_single comms ==> irrelevant compute (GPU compute heavy, does not depend on comms data) ==> post-comms compute (GPU compute heavy, uses the comms data)

# other changs
* extend the cmd_conf decorator so that it can support multiple-program selection in command line
```
python -m torchrec.distributed.benchmark.benchmark_comms \
  a2a_single --name=a2a_sync_base-$(git rev-parse --short HEAD || echo $USER)
```
* add a config (dataclass) class **`BenchmarkFunc`** for `benchmark_func`, which includes most common arguments used in `benchmark_func`

Reviewed By: spmex

Differential Revision: D83900855
TroyGarden added a commit to TroyGarden/torchrec that referenced this pull request Oct 6, 2025
Summary:

# context
* add benchmark to `torch.distributed.all_to_all_single`. 
* in this very first edition, this all_to_all_single is running with sync on the same stream.
* basic operations:
pre-comms compute (GPU compute heavy) ==> all_to_all_single comms ==> irrelevant compute (GPU compute heavy, does not depend on comms data) ==> post-comms compute (GPU compute heavy, uses the comms data)

# other changs
* extend the cmd_conf decorator so that it can support multiple-program selection in command line
```
python -m torchrec.distributed.benchmark.benchmark_comms \
  a2a_single --name=a2a_sync_base-$(git rev-parse --short HEAD || echo $USER)
```
* add a config (dataclass) class **`BenchmarkFunc`** for `benchmark_func`, which includes most common arguments used in `benchmark_func`

Reviewed By: spmex

Differential Revision: D83900855
@TroyGarden TroyGarden changed the title benchmark for comms used in TorchRec [benchmark] add a benchmark file for comms used in TorchRec Oct 6, 2025
Summary:

# context
* add benchmark to `torch.distributed.all_to_all_single`. 
* in this very first edition, this all_to_all_single is running with sync on the same stream.
* basic operations:
pre-comms compute (GPU compute heavy) ==> all_to_all_single comms ==> irrelevant compute (GPU compute heavy, does not depend on comms data) ==> post-comms compute (GPU compute heavy, uses the comms data)

# other changs
* extend the cmd_conf decorator so that it can support multiple-program selection in command line
```
python -m torchrec.distributed.benchmark.benchmark_comms \
  a2a_single --name=a2a_sync_base-$(git rev-parse --short HEAD || echo $USER)
```
* add a config (dataclass) class **`BenchmarkFunc`** for `benchmark_func`, which includes most common arguments used in `benchmark_func`

Reviewed By: spmex

Differential Revision: D83900855
TroyGarden added a commit to TroyGarden/torchrec that referenced this pull request Oct 6, 2025
Summary:

# context
* add benchmark to `torch.distributed.all_to_all_single`. 
* in this very first edition, this all_to_all_single is running with sync on the same stream.
* basic operations:
pre-comms compute (GPU compute heavy) ==> all_to_all_single comms ==> irrelevant compute (GPU compute heavy, does not depend on comms data) ==> post-comms compute (GPU compute heavy, uses the comms data)

# other changs
* extend the cmd_conf decorator so that it can support multiple-program selection in command line
```
python -m torchrec.distributed.benchmark.benchmark_comms \
  a2a_single --name=a2a_sync_base-$(git rev-parse --short HEAD || echo $USER)
```
* add a config (dataclass) class **`BenchmarkFunc`** for `benchmark_func`, which includes most common arguments used in `benchmark_func`

Reviewed By: spmex

Differential Revision: D83900855
@meta-codesync meta-codesync bot closed this in 8cd65b1 Oct 6, 2025
@TroyGarden TroyGarden deleted the export-D83900855 branch October 7, 2025 03:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant