Skip to content

Conversation

jianyizh
Copy link
Contributor

depend on jira GSD-3944. It's expected that d2d copy through copy engine and kernel should be the same performance in level zero 2, but I still see gaps.

Note memcpyAsync is only called when memcpy_eligible, means no cast, and all contiguous, so it's just copy without other operations. Previous implementation in copy_kernel calls two implementation by loops kernel: gpu_kernel(iter, CopyScalarFunc<scalar_t>()); and float8_copy_kernel_xpu, since we won't do cast, they both equivalent to queue.memcpy

@jianyizh jianyizh marked this pull request as ready for review September 22, 2025 14:26
@Copilot Copilot AI review requested due to automatic review settings September 22, 2025 14:26
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR updates the XPU copy implementation to use SYCL queue.memcpy instead of copy kernels for better performance. The change addresses performance gaps between device-to-device copy through copy engine and kernel implementation in Level Zero 2.

  • Simplifies the memcpyAsync function by removing the device comparison branch and always using q.copy()
  • Removes outdated performance-related comments that are no longer relevant
  • Maintains the same logical flow while leveraging improved SYCL queue.memcpy performance

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +73 to +80
if (dst_device != src_device) {
TORCH_INTERNAL_ASSERT(p2p_enabled == true);
auto dst = (char*)iter.data_ptr(0);
auto src = (char*)iter.data_ptr(1);
size_t size = iter.numel() * iter.element_size(0);
auto q = copy_stream.queue();
q.copy(src, dst, size);
}
auto dst = (char*)iter.data_ptr(0);
auto src = (char*)iter.data_ptr(1);
size_t size = iter.numel() * iter.element_size(0);
auto q = copy_stream.queue();
q.copy(src, dst, size);
Copy link
Preview

Copilot AI Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function now always calls q.copy() regardless of whether devices are the same or different. For same-device copies, this may not be the intended behavior since the original code used copy_kernel(iter) for same-device operations. Consider verifying that q.copy() handles same-device copies correctly or add a comment explaining why this unified approach is safe.

Suggested change
if (dst_device != src_device) {
TORCH_INTERNAL_ASSERT(p2p_enabled == true);
auto dst = (char*)iter.data_ptr(0);
auto src = (char*)iter.data_ptr(1);
size_t size = iter.numel() * iter.element_size(0);
auto q = copy_stream.queue();
q.copy(src, dst, size);
}
auto dst = (char*)iter.data_ptr(0);
auto src = (char*)iter.data_ptr(1);
size_t size = iter.numel() * iter.element_size(0);
auto q = copy_stream.queue();
q.copy(src, dst, size);
auto dst = (char*)iter.data_ptr(0);
auto src = (char*)iter.data_ptr(1);
size_t size = iter.numel() * iter.element_size(0);
auto q = copy_stream.queue();
if (dst_device == src_device) {
// Use the optimized kernel for same-device copies
copy_kernel(iter);
} else {
TORCH_INTERNAL_ASSERT(p2p_enabled == true);
q.copy(src, dst, size);
}

Copilot uses AI. Check for mistakes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant