[aDAG] Overlap computation and communication #47586

ruisearch42 · 2024-09-10T16:40:31Z

Why are these changes needed?

This PR supports overlapping computation and communication for GPU tasks, as described in https://docs.google.com/document/d/1AkAqrMPadk1rMyjKE4VN4bq058z36fgBcx0i4dHIW20/edit#heading=h.8jw8z0hmgva0

The scope is send/recv but does not include collectives.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Rui Qiao <[email protected]>

ruisearch42 · 2024-09-19T00:01:36Z

python/ray/experimental/channel/nccl_group.py

@@ -172,7 +189,7 @@ def recv(
        dtype: "torch.dtype",
        peer_rank: int,
        allocator=Optional[TorchTensorAllocator],
-    ) -> "torch.Tensor":
+    ) -> Union["torch.Tensor", Tuple["torch.Tensor", "cp.cuda.Event"]]:


This API needs to be refined:

Consider introducing a new recv_gpu_async() API, but this complicates client side code which needs to decide which method to call;

We should not have cuda.Event in the API, may need our wrapper API for event

Can we instead always return Tuple["torch.Tensor", Optional["cp.cuda.Event"]]? I think that will be a cleaner interface. You could also wrap this in our own dataclass like MaybeAsyncTensor.

what about generalizing all send/recv to Future? And if it is blocking, the returned future is already ready. (I think code complexity wise it is same as returning None bcause you need to check if the second output is None anyway)

fut = recv() val = fut.wait()

Yes, exactly!

stephanie-wang

Main comment is to try to reuse existing codepaths for executing tasks and reading/writing local args. I think the code will be much more robust this way, plus we need to do it anyway to support enabling overlapping per-task or object.

Seems possible to do if we wrap all inputs/outputs with a wrapper class like this, maybe we need to update the channel reading/writing:

@dataclass
class Value:
  value: Any
  # If set, then reader needs to synchronize before reading value.
  cuda_event: Optional[cp.cuda.Event]

Also should think about how we can unit-test this. Ideally we should try to write a mocked test, maybe something like this one.

stephanie-wang · 2024-09-19T04:01:57Z

python/ray/dag/compiled_dag_node.py

@@ -173,6 +174,106 @@ def do_profile_tasks(
        raise


+@DeveloperAPI
+def do_stream_tasks(


Can we combine this logic with do_exec_tasks?

I think the more codepaths we have here, the more brittle the code becomes.

Yeah the first version of the code does that.

The downside of that is if/else branches in each of the _read()/_compute()/_write()/_exec_operation() methods, which is not quite clean. I get your point that adding a separate code path is brittle, and agree from general principle. Yet I feel reusing do_exec_tasks is arguably brittle as well, as people need to think two different kinds of "execution loops" while maintaining a single method. I think the do_stream_tasks code path would evolve more after this PR (e.g., supporting overlapping more, supporting shared memory involved operations) and there are likely interface changes which will make it more incompatible with do_exec_tasks. My original thinking was that these two code paths will eventually converge, but it is probably better to keep them separate in the beginning.

The other benefit of different execution loops is not to introduce performance regressions for cases where overlapping is disabled.

Let me know what you think.

I get your point but I think in this case we do not want these paths to evolve separately too much, i.e. in one of the following PRs we should support more controls like disabling overlapping by task or specifying the stream to use. For these changes, it will be best to be able to use the same execution loop.

OK, I think I can change to that.

stephanie-wang · 2024-09-19T04:03:14Z

python/ray/dag/compiled_dag_node.py

+    tasks: List["ExecutableTask"],
+    schedule: List[_DAGNodeOperation],
+) -> None:
+    """Similar to `do_stream_tasks`, but with torch profiling enabled.


Suggest starting/stopping the profiling context in a Ray task before/after sending the tasks to start the execution loop. That way we can reuse the current do_exec_tasks codepath.

stephanie-wang · 2024-09-19T04:04:19Z

python/ray/dag/compiled_dag_node.py

+        # if the operation is COMPUTE, the value is the result from the READ with
+        # the same exec_task_idx; if the operation is WRITE, the value is the result
+        # from the COMPUTE with the same exec_task_idx.
+        self._stream_buffer: Dict[_DAGNodeOperation, Any] = {}


Can we reuse self._intermediate_buffer instead?

I originally reused it, but for Dict it has additional overhead of hashing/lookup etc. (although arguably the overhead is small), so kept the original path entirely intact.

This also ties to the decision whether we reuse do_exec_task or have the new do_stream_task.

Hmm my preference is that we reuse do_exec_task and try to do the Future kind of API that @rkooo567 suggested. I think with these changes it should be possible to introduce the stream execution in a clean way.

stephanie-wang · 2024-09-19T04:06:13Z

python/ray/dag/compiled_dag_node.py

+        import cupy as cp
+
+        exec_event = cp.cuda.Event()
+        with exec_stream:


It would be best if we can avoid implicitly setting the execution stream (user may have code that manages streams themselves).

I think the current behavior is if user uses another stream in their UDF, that will take precedence. Is it a behavior we don't want? Can you elaborate a bit on the problem and the stream-management-contract-with-user you have on mind?

Hmm in general I'm wary of implicitly overwriting user defaults but it could be okay in this case since as you said, the user stream will take precedence. Once we have an API to set the execution stream, this also becomes less of an issue and more of a debate of what the better default behavior is (just use the default stream or create one).

One tricky thing is that the CUDA default stream also has different synchronization semantics depending on compilation flags...: https://docs.nvidia.com/cuda/cuda-runtime-api/stream-sync-behavior.html

I want to avoid using a new stream for execution. It is confusing behavior for users, and if they do something in main thread with a default stream, it can cause issues (tho it is not common).

Is Stephanie's original suggestion also to use the passed-in stream as execution stream rather than creating a new one? like this Sang's comment here #47586 (comment). @stephanie-wang

If so, I can change to that.

I think we should not use any stream for execution (so it uses default from users). And only when it is explicitly passed, we should allow to overwrite (in reality, users may just also do it themselves if required). let's minimze magic behaviors if possible

@rkooo567 I will have a quick chat with you tomorrow to make sure I understand your recommendation.

python/ray/dag/context.py

stephanie-wang · 2024-09-19T04:10:45Z

python/ray/experimental/channel/nccl_group.py

@@ -172,7 +189,7 @@ def recv(
        dtype: "torch.dtype",
        peer_rank: int,
        allocator=Optional[TorchTensorAllocator],
-    ) -> "torch.Tensor":
+    ) -> Union["torch.Tensor", Tuple["torch.Tensor", "cp.cuda.Event"]]:


Can we instead always return Tuple["torch.Tensor", Optional["cp.cuda.Event"]]? I think that will be a cleaner interface. You could also wrap this in our own dataclass like MaybeAsyncTensor.

stephanie-wang · 2024-09-19T04:12:17Z

python/ray/dag/dag_node_operation.py

+    # not supported (yet). For example, currently if a channel requires
+    # both NCCL and shared memory transport, overlap optimization cannot
+    # be applied.
+    if out_of_order_limit == 0:


Can you add some comments to this code to explain the logic?

Also, improve the name? _optimize_execution_schedule is not very clear what it's optimizing.

+1 or maybe just add high level approach in the docstring. Btw, the logic is we put all read ahead as many as out of order limit and put compute/send.

stephanie-wang · 2024-09-19T04:18:04Z

python/ray/dag/compiled_dag_node.py

+            if isinstance(entry, tuple):
+                channel_result, event = entry
+                if event:
+                    event.synchronize()


Note that this will also block the CPU thread, not just the CUDA stream that needs to read the data. You probably want to just have the CUDA stream wait on the event.

I think that's what we need? We don't know whether user UDF runs on GPU or CPU, so if there is a data dependency, we need to sync all?

If there is a GPU -> CPU dependency, then there's going to be some CUDA memcpy to move the data from GPU -> CPU. So we don't need to block the entire CPU task on the GPU task, we just need to make sure the memcpy executes after the GPU task (which it should as long as it's executed on the correct stream).

Just to clarify: this event is a recv event on a recv-stream, and here the compute operation runs on CPU or execute-stream of the GPU. Suppose it runs on CPU, so you are saying as long as the execution operation happens after the "async" recv operation is launched (which it should because they have the same task_index), even without the manual sync with CPU, the memcpy would happen and happens after the recv finishes?

Ah not quite. What I was trying to say is that even if the task is executing on CPU, if it depends on the GPU data, then it needs to run some memcpy to actually read the GPU data. The memcpy will execute on exec_stream, so we just need to make sure to sync exec_stream with the recv stream (using stream.wait_event). We don't need to sync everything including the CPU with the recv stream (using event.synchronize).

(and if the CPU task does not actually read the GPU data, then it doesn't matter if we sync at all)

Make sense, thanks for the clarification!

Actually, cupy.cuda.Event.synchronize() does not block CPU thread when it is created with block=False (default param): https://docs.cupy.dev/en/stable/reference/generated/cupy.cuda.Event.html#cupy.cuda.Event.synchronize

But it mentioned "Synchronizes all device work to the event", which probably means sync with all streams -- this would sync with the send_stream. So I will change to use https://docs.cupy.dev/en/stable/reference/generated/cupy.cuda.Stream.html#cupy.cuda.Stream.wait_event

rkooo567

I really think we need to unify the execution loop. The reason is the test space becomes much larger (we need to also make sure existing case works correctly when overlap is used).

What about we always assume send/recv is unblocking and return Future? And if send/recv is a blocking, the future is returned after wait is finished. Otherwise, it just returns future. It is same as how gloo apis also work

python/ray/dag/context.py

rkooo567 · 2024-09-19T15:04:21Z

python/ray/dag/constants.py

+
+# Feature flag to turn on torch profiling.
+RAY_ADAG_ENABLE_TORCH_PROFILING = (
+    os.environ.get("RAY_ADAG_ENABLE_TORCH_PROFILING", "0") == "1"


I think we will allow to dynamic profiling (like profiling N iterations, and you can enable/disable at runtime). I think this one is okay for now. Can you create a corresponding issue?

Created #47745

rkooo567 · 2024-09-19T15:05:31Z

python/ray/dag/dag_node_operation.py

        """
        self.exec_task_idx = exec_task_idx
        self.type = operation_type
+        self.method_name = method_name
+
+    def next_operation(self):


can you add docstring to clarify the definition?

I am asking it because "next operation" can have 2 meanings. 1. the literal next op in the scheduling. the next operation for the same bind index

rkooo567 · 2024-09-19T15:05:48Z

python/ray/dag/dag_node_operation.py


    def __repr__(self):
-        return f"(Task idx: {self.exec_task_idx}, Type: {self.type})"
+        return f"([{self.exec_task_idx}] {self.method_name} {self.type})"


just return __str__()?

rkooo567 · 2024-09-19T15:06:25Z

python/ray/dag/dag_node_operation.py

    return actor_to_execution_schedule
+
+
+def _optimize_execution_schedule(


can you add unit tests like @kevin85421 did before? (not e2e, but unit level testing)

yes, will do.

rkooo567 · 2024-09-19T15:17:37Z

python/ray/experimental/channel/nccl_group.py

+                peer_rank,
+                self._recv_stream.ptr,
+            )
+            event.record(self._recv_stream)


actually I can imagine we can also do recv_stream.synchronize()

what's the pros of cons of using event vs stream.synchornize?

hmm, at the consumer side (compute operation), we'd like to sync on a particular event on the recv_stream, rather than the whole stream (there might be other operations launched to the same recv_stream).

rkooo567 · 2024-09-19T15:18:50Z

python/ray/dag/compiled_dag_node.py

+                    )
+                    if done:
+                        break
+        profiler.export_chrome_trace(f"adag-proc-{pid}.json")


log before doing this so that users can see where the file is geneated

rkooo567 · 2024-09-19T15:23:08Z

python/ray/dag/compiled_dag_node.py

+            # exception in a RayTaskError here because it has already been wrapped
+            # by the previous task.
+            self.set_stream_buffer(exc)
+            return False


how does it propagate back to the caller? Looks like the when an execution loop sees False return value, it just finishes the loop.

rkooo567 · 2024-09-19T15:23:30Z

python/ray/dag/tests/experimental/test_torch_tensor_dag.py

@@ -269,6 +271,52 @@ def test_torch_tensor_nccl(ray_start_regular):
    # ray.get(receiver.ping.remote())


+@pytest.mark.parametrize("ray_start_regular", [{"num_cpus": 4}], indirect=True)
+def test_torch_tensor_nccl_overlap(ray_start_regular):


can you add test cases where exception is raised from compute/recv/send? (and it is raised properly)

rkooo567 · 2024-09-19T15:24:12Z

python/ray/dag/compiled_dag_node.py

+        """
+        output_val, exec_event = self.reset_stream_buffer(op)
+        exit = False
+        exec_event.synchronize()


maybe we should not block cpu here.

this way, we can overlap shm write and compute (if it runs in kernel)

I think we need to sync to make sure execution finishes, otherwise the value may be incorrect?

I think you can synchronize on compute stream, and then torch and cuda should handle the gpu synchronization

It's better to sync on the event since there may be additional operations on the compute stream? Wouldn't waiting on the compute stream require unnecessarily longer waiting time?

Could you elaborate more on "then torch and cuda should handle the gpu synchronization"? Not sure what it means.

hmm actually my bad. I think if event is created with blocking=False (which is the default https://docs.cupy.dev/en/stable/reference/generated/cupy.cuda.Event.html), this only blocks the relevant device, not cpu. So I think the current code is fine.

Signed-off-by: Rui Qiao <[email protected]>

ruisearch42 force-pushed the overlap_comm branch 2 times, most recently from 9e13938 to ccb561c Compare September 17, 2024 22:36

ruisearch42 changed the title ~~WIP~~ [aDAG] Overlap computation and communication Sep 17, 2024

ruisearch42 force-pushed the overlap_comm branch from 3991e58 to ea3bef7 Compare September 17, 2024 23:45

WIP

4e752b2

Signed-off-by: Rui Qiao <[email protected]>

ruisearch42 force-pushed the overlap_comm branch from 6b66049 to 4e752b2 Compare September 18, 2024 22:55

up

f7b1b97

Signed-off-by: Rui Qiao <[email protected]>

ruisearch42 commented Sep 19, 2024

View reviewed changes

stephanie-wang self-assigned this Sep 19, 2024

stephanie-wang reviewed Sep 19, 2024

View reviewed changes

rkooo567 reviewed Sep 19, 2024

View reviewed changes

ruisearch42 mentioned this pull request Sep 19, 2024

[aDAG] Support torch profiling with configurable parameters #47745

Open

up

424a3e2

Signed-off-by: Rui Qiao <[email protected]>

		return actor_to_execution_schedule


		def _optimize_execution_schedule(

[aDAG] Overlap computation and communication #47586

Are you sure you want to change the base?

[aDAG] Overlap computation and communication #47586

Conversation

ruisearch42 commented Sep 10, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephanie-wang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ruisearch42 Sep 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ruisearch42 Sep 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ruisearch42 Sep 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ruisearch42 Sep 20, 2024 • edited Loading

Choose a reason for hiding this comment

rkooo567 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ruisearch42 commented Sep 10, 2024 •

edited

Loading

ruisearch42 Sep 20, 2024 •

edited

Loading

ruisearch42 Sep 19, 2024 •

edited

Loading

ruisearch42 Sep 20, 2024 •

edited

Loading

ruisearch42 Sep 20, 2024 •

edited

Loading