-
Notifications
You must be signed in to change notification settings - Fork 42
[UT]XCCL remains the default backend for XPU #1721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Adds a new unit test to confirm that XCCL remains the default distributed backend on XPU even after another backend is registered.
- Introduces
test_xccl_priority
to register a dummy backend and run an all-reduce call without specifying a backend. - Leverages existing
requires_xccl
decorator to skip if XCCL isn’t available.
Comments suppressed due to low confidence (2)
test/xpu/distributed/test_c10d_xccl.py:568
- The test currently only invokes
all_reduce
but doesn't assert that the default backend is actually XCCL. Consider retrieving the process group (e.g., viadist.distributed_c10d._get_default_group()
) and asserting its type or backend name to ensure the priority behavior is verified.
dist.all_reduce(a)
test/xpu/distributed/test_c10d_xccl.py:555
- [nitpick] The test name
test_xccl_priority
is a bit generic. Consider renaming totest_default_backend_is_xccl_when_fake_registered
for clarity on what scenario is covered.
def test_xccl_priority(self):
dist.Backend.register_backend( | ||
"fake", | ||
lambda store, rank, size, timeout: dist.ProcessGroup(rank, size), | ||
devices=["xpu"], | ||
) | ||
store = dist.FileStore(self.file_name, self.world_size) | ||
dist.init_process_group( | ||
world_size=self.world_size, | ||
rank=self.rank, | ||
store=store, | ||
) | ||
a = torch.randn(2, device="xpu") | ||
dist.all_reduce(a) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After registering the fake backend, consider unregistering it in a finally
block or teardown step to avoid side effects on other tests.
dist.Backend.register_backend( | |
"fake", | |
lambda store, rank, size, timeout: dist.ProcessGroup(rank, size), | |
devices=["xpu"], | |
) | |
store = dist.FileStore(self.file_name, self.world_size) | |
dist.init_process_group( | |
world_size=self.world_size, | |
rank=self.rank, | |
store=store, | |
) | |
a = torch.randn(2, device="xpu") | |
dist.all_reduce(a) | |
try: | |
dist.Backend.register_backend( | |
"fake", | |
lambda store, rank, size, timeout: dist.ProcessGroup(rank, size), | |
devices=["xpu"], | |
) | |
store = dist.FileStore(self.file_name, self.world_size) | |
dist.init_process_group( | |
world_size=self.world_size, | |
rank=self.rank, | |
store=store, | |
) | |
a = torch.randn(2, device="xpu") | |
dist.all_reduce(a) | |
finally: | |
dist.Backend.unregister_backend("fake") |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other cases explicit init with backend xccl
, it is safely no unregister.
Close due to pytorch/pytorch#155320 merged |
This test is designed to verify that XCCL remains the default backend for XPU, even when other groups are registered as optional backends for XPU.