use SYCL queue.memcpy instead of copy kernel #2095

jianyizh · 2025-09-22T14:25:35Z

depend on jira GSD-3944. It's expected that d2d copy through copy engine and kernel should be the same performance in level zero 2, but I still see gaps.

Note memcpyAsync is only called when memcpy_eligible, means no cast, and all contiguous, so it's just copy without other operations. Previous implementation in copy_kernel calls two implementation by loops kernel: gpu_kernel(iter, CopyScalarFunc<scalar_t>()); and float8_copy_kernel_xpu, since we won't do cast, they both equivalent to queue.memcpy

Copilot

Pull Request Overview

This PR updates the XPU copy implementation to use SYCL queue.memcpy instead of copy kernels for better performance. The change addresses performance gaps between device-to-device copy through copy engine and kernel implementation in Level Zero 2.

Simplifies the memcpyAsync function by removing the device comparison branch and always using q.copy()
Removes outdated performance-related comments that are no longer relevant
Maintains the same logical flow while leveraging improved SYCL queue.memcpy performance

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-09-22T14:27:21Z

src/ATen/native/xpu/Copy.cpp

+  if (dst_device != src_device) {
    TORCH_INTERNAL_ASSERT(p2p_enabled == true);
-    auto dst = (char*)iter.data_ptr(0);
-    auto src = (char*)iter.data_ptr(1);
-    size_t size = iter.numel() * iter.element_size(0);
-    auto q = copy_stream.queue();
-    q.copy(src, dst, size);
  }
+  auto dst = (char*)iter.data_ptr(0);
+  auto src = (char*)iter.data_ptr(1);
+  size_t size = iter.numel() * iter.element_size(0);
+  auto q = copy_stream.queue();
+  q.copy(src, dst, size);


The function now always calls q.copy() regardless of whether devices are the same or different. For same-device copies, this may not be the intended behavior since the original code used copy_kernel(iter) for same-device operations. Consider verifying that q.copy() handles same-device copies correctly or add a comment explaining why this unified approach is safe.

Suggested change

if (dst_device != src_device) {

TORCH_INTERNAL_ASSERT(p2p_enabled == true);

auto dst = (char*)iter.data_ptr(0);

auto src = (char*)iter.data_ptr(1);

size_t size = iter.numel() * iter.element_size(0);

auto q = copy_stream.queue();

q.copy(src, dst, size);

}

auto dst = (char*)iter.data_ptr(0);

auto src = (char*)iter.data_ptr(1);

size_t size = iter.numel() * iter.element_size(0);

auto q = copy_stream.queue();

q.copy(src, dst, size);

auto dst = (char*)iter.data_ptr(0);

auto src = (char*)iter.data_ptr(1);

size_t size = iter.numel() * iter.element_size(0);

auto q = copy_stream.queue();

if (dst_device == src_device) {

// Use the optimized kernel for same-device copies

copy_kernel(iter);

} else {

TORCH_INTERNAL_ASSERT(p2p_enabled == true);

q.copy(src, dst, size);

}

use SYCL queue.memcpy instead of copy kernel

180c113

jianyizh marked this pull request as ready for review September 22, 2025 14:26

Copilot AI review requested due to automatic review settings September 22, 2025 14:26

Copilot AI reviewed Sep 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

use SYCL queue.memcpy instead of copy kernel #2095

use SYCL queue.memcpy instead of copy kernel #2095

Uh oh!

jianyizh commented Sep 22, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Sep 22, 2025

Uh oh!

Uh oh!

use SYCL queue.memcpy instead of copy kernel #2095

Are you sure you want to change the base?

use SYCL queue.memcpy instead of copy kernel #2095

Uh oh!

Conversation

jianyizh commented Sep 22, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!