Skip to content

[UT]XCCL remains the default backend for XPU #1721

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed

[UT]XCCL remains the default backend for XPU #1721

wants to merge 4 commits into from

Conversation

Chao1Han
Copy link
Contributor

@Chao1Han Chao1Han commented Jun 5, 2025

This test is designed to verify that XCCL remains the default backend for XPU, even when other groups are registered as optional backends for XPU.

@Copilot Copilot AI review requested due to automatic review settings June 5, 2025 05:16
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds a new unit test to confirm that XCCL remains the default distributed backend on XPU even after another backend is registered.

  • Introduces test_xccl_priority to register a dummy backend and run an all-reduce call without specifying a backend.
  • Leverages existing requires_xccl decorator to skip if XCCL isn’t available.
Comments suppressed due to low confidence (2)

test/xpu/distributed/test_c10d_xccl.py:568

  • The test currently only invokes all_reduce but doesn't assert that the default backend is actually XCCL. Consider retrieving the process group (e.g., via dist.distributed_c10d._get_default_group()) and asserting its type or backend name to ensure the priority behavior is verified.
dist.all_reduce(a)

test/xpu/distributed/test_c10d_xccl.py:555

  • [nitpick] The test name test_xccl_priority is a bit generic. Consider renaming to test_default_backend_is_xccl_when_fake_registered for clarity on what scenario is covered.
def test_xccl_priority(self):

Comment on lines +556 to +568
dist.Backend.register_backend(
"fake",
lambda store, rank, size, timeout: dist.ProcessGroup(rank, size),
devices=["xpu"],
)
store = dist.FileStore(self.file_name, self.world_size)
dist.init_process_group(
world_size=self.world_size,
rank=self.rank,
store=store,
)
a = torch.randn(2, device="xpu")
dist.all_reduce(a)
Copy link
Preview

Copilot AI Jun 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After registering the fake backend, consider unregistering it in a finally block or teardown step to avoid side effects on other tests.

Suggested change
dist.Backend.register_backend(
"fake",
lambda store, rank, size, timeout: dist.ProcessGroup(rank, size),
devices=["xpu"],
)
store = dist.FileStore(self.file_name, self.world_size)
dist.init_process_group(
world_size=self.world_size,
rank=self.rank,
store=store,
)
a = torch.randn(2, device="xpu")
dist.all_reduce(a)
try:
dist.Backend.register_backend(
"fake",
lambda store, rank, size, timeout: dist.ProcessGroup(rank, size),
devices=["xpu"],
)
store = dist.FileStore(self.file_name, self.world_size)
dist.init_process_group(
world_size=self.world_size,
rank=self.rank,
store=store,
)
a = torch.randn(2, device="xpu")
dist.all_reduce(a)
finally:
dist.Backend.unregister_backend("fake")

Copilot uses AI. Check for mistakes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other cases explicit init with backend xccl, it is safely no unregister.

@Chao1Han
Copy link
Contributor Author

Close due to pytorch/pytorch#155320 merged

@Chao1Han Chao1Han closed this Jun 16, 2025
@Chao1Han Chao1Han deleted the xccl/uts branch June 16, 2025 05:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant