Intro async flag and use current stream avoid stream sync #1546

Chao1Han · 2025-04-07T06:44:38Z

Refer pytorch/pytorch#147820 pytorch/pytorch#150398
To launch kernels on the current stream and reduce the CPU overhead introduced by recordStream, an async option is introduced.

For example, in an allreduce operation between two ranks:

rank0 corresponds to device0, using the current device's stream0 to create the communicator and preserving stream0.

When async = true:

Both rank0 and rank1 perform the collective using stream0, which is associated with the communicator.
To prevent potential reads by stream0 from unready tensors (e.g., from rank1), synchronization with the current stream is required.
After the collective completes, to prevent premature release of the input tensors, recordStream must be used for stream tracking, or the tensors need to be temporarily stored (e.g., in reduce_scatter or all2all).

When async = false:

Both rank0 and rank1 use their respective current streams for collectives (i.e., rank0 uses stream0, rank1 uses stream1).
In this case, the collective op handles synchronization implicitly.

Previously, we defaulted to async = true. Now, the async option is explicitly introduced and set to false by default, leveraging the current stream to avoid the overhead of stream synchronization.

Chao1Han · 2025-04-09T03:27:12Z

zhangxiaoli73 · 2025-04-10T06:05:58Z

src/xccl/ProcessGroupXCCL.cpp

+}
+
+bool ProcessGroupXCCL::WorkXCCL::wait(std::chrono::milliseconds timeout) {
+  synchronize();


Do we still need sync if compute stream is used for communication?

not needed, use current stream means async=false. line 632 will return null rather then work and frontend will check if need call wait()
https://github.com/pytorch/pytorch/blob/4273e5d15cfcb282b2795684874ea439d8620999/torch/distributed/distributed_c10d.py#L2882-L2887

zhangxiaoli73 · 2025-04-10T06:11:43Z

src/xccl/ProcessGroupXCCL.cpp

+
+  // asyncOp=false will always use current stream; getStrem will return current
+  // stream
+  c10::Stream stream = asyncOp


In some special case like FSDP2, same communicator will have different current stream on each device. Then how to make sure to return correct compute stream for communication?

zhangxiaoli73 · 2025-04-10T08:44:53Z

src/xccl/ProcessGroupXCCL.cpp

+    cclstream =
+        std::make_unique<ccl::stream>(xcclStreamsMap_.at(StreamKey).second);
+  } catch (...) {
+    LOG(WARNING) << "Current stream id changed, create new ccl stream";


Why is it to be a warning? I think info should be more suitable. Warning usually means something not so safe and may cause unpredictable result.

zhangxiaoli73 · 2025-04-10T08:47:13Z

src/xccl/ProcessGroupXCCL.cpp

-  auto stream = xcclStreamsMap_.at(key).first;
-  auto cclstream = xcclStreamsMap_.at(key).second;
+  auto StreamKey = asyncOp ? key
+                           : key + "_" +


Should you put code in 63823e4#diff-29271b6f1608f7ad940c9cd242ce24dcc68bba932348e39dc0b524604cc78c6aR568, then you don't need construct StreamKey twice.

zhangxiaoli73 · 2025-04-11T06:12:38Z

src/xccl/ProcessGroupXCCL.cpp

+  auto StreamKey = asyncOp ? key
+                           : key + "_" +
+          std::to_string(at::xpu::getCurrentXPUStream(device.index()).id());
+  auto stream = asyncOp ? xcclStreamsMap_.at(StreamKey).first


Suggested change

auto stream = asyncOp ? xcclStreamsMap_.at(StreamKey).first

if (asyncOp) {

stream = xcclStreamsMap_.at(key).first;

cclstream = xcclStreamsMap_.at(key).second;

syncStream(device, xcclEventsMap_[key], stream);

} else {

current_stream = at::xpu::getCurrentXPUStream(device.index())

streamkey = stream + current_stream .id()

if (xcclStreamsMap_.find(streamkey) != xcclStreamsMap_.end()) {

stream = xcclStreamsMap_.at(streamkey ).first;

cclstream = xcclStreamsMap_.at(streamkey).second;

} else {

// update xcclStreamsMap_ with current stream key

cclstream = std::make_unique<ccl::stream>(ccl::create_stream(current_stream.queue()));

std::lock_guard<std::mutex> lock(mutex_);

xcclStreamsMap_.emplace(

StreamKey, std::make_pair(at::xpu::XPUStream(current_stream), *cclstream));

}

}

change done

Chao1Han force-pushed the xccl/record_stream branch from 19216e1 to 5fe91cb Compare April 7, 2025 07:25

Chao1Han added 5 commits April 7, 2025 20:48

align no record stream

6a1dc8a

update

b67eae4

update

5fe91cb

update

f1a90b8

not create new ccl stream

d3022e5

Chao1Han changed the title ~~[wip] Xccl/record stream~~ Intro async flag and use current stream avoid stream sync Apr 9, 2025

Chao1Han added 3 commits April 9, 2025 11:32

Merge branch 'main' into xccl/record_stream

a1b08dd

lint

f4574b1

move stream to getxcclcomm

c246ebd

zhangxiaoli73 reviewed Apr 10, 2025

View reviewed changes

Chao1Han added 2 commits April 11, 2025 00:32

add stream id check avid current activate stream change

63823e4

Merge branch 'main' into xccl/record_stream

115efaa

zhangxiaoli73 reviewed Apr 11, 2025

View reviewed changes

Chao1Han added 6 commits April 11, 2025 16:18

info

e0e5f64

create new stream in collective

b763850

update

0e61fdf

Merge branch 'main' into xccl/record_stream

f82b4af

Merge branch 'main' into xccl/record_stream

6714683

Merge branch 'main' into xccl/record_stream

6abc827

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intro async flag and use current stream avoid stream sync #1546

Intro async flag and use current stream avoid stream sync #1546

Chao1Han commented Apr 7, 2025 •

edited

Loading

Chao1Han commented Apr 9, 2025 •

edited

Loading

zhangxiaoli73 Apr 10, 2025

Chao1Han Apr 10, 2025

zhangxiaoli73 Apr 10, 2025

zhangxiaoli73 Apr 10, 2025

zhangxiaoli73 Apr 10, 2025

zhangxiaoli73 Apr 11, 2025

Chao1Han Apr 11, 2025

-  auto stream = asyncOp ? xcclStreamsMap_.at(StreamKey).first
+  if (asyncOp) {
+      stream = xcclStreamsMap_.at(key).first;
+      cclstream = xcclStreamsMap_.at(key).second;
+      syncStream(device, xcclEventsMap_[key], stream);
+  } else {
+      current_stream = at::xpu::getCurrentXPUStream(device.index())
+      streamkey = stream + current_stream .id()
+      if (xcclStreamsMap_.find(streamkey) != xcclStreamsMap_.end()) {
+          stream = xcclStreamsMap_.at(streamkey ).first;
+          cclstream = xcclStreamsMap_.at(streamkey).second;
+      } else {
+      // update xcclStreamsMap_ with current stream key
+      cclstream = std::make_unique<ccl::stream>(ccl::create_stream(current_stream.queue()));
+    std::lock_guard<std::mutex> lock(mutex_);
+    xcclStreamsMap_.emplace(
+        StreamKey, std::make_pair(at::xpu::XPUStream(current_stream), *cclstream));
+      }
+  }

Intro async flag and use current stream avoid stream sync #1546

Are you sure you want to change the base?

Intro async flag and use current stream avoid stream sync #1546

Conversation

Chao1Han commented Apr 7, 2025 • edited Loading

Chao1Han commented Apr 9, 2025 • edited Loading

zhangxiaoli73 Apr 10, 2025

Choose a reason for hiding this comment

Chao1Han Apr 10, 2025

Choose a reason for hiding this comment

zhangxiaoli73 Apr 10, 2025

Choose a reason for hiding this comment

zhangxiaoli73 Apr 10, 2025

Choose a reason for hiding this comment

zhangxiaoli73 Apr 10, 2025

Choose a reason for hiding this comment

zhangxiaoli73 Apr 11, 2025

Choose a reason for hiding this comment

Chao1Han Apr 11, 2025

Choose a reason for hiding this comment

Chao1Han commented Apr 7, 2025 •

edited

Loading

Chao1Han commented Apr 9, 2025 •

edited

Loading