-
Notifications
You must be signed in to change notification settings - Fork 58
use SYCL queue.memcpy instead of copy kernel #2095
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR updates the XPU copy implementation to use SYCL queue.memcpy instead of copy kernels for better performance. The change addresses performance gaps between device-to-device copy through copy engine and kernel implementation in Level Zero 2.
- Simplifies the
memcpyAsync
function by removing the device comparison branch and always usingq.copy()
- Removes outdated performance-related comments that are no longer relevant
- Maintains the same logical flow while leveraging improved SYCL queue.memcpy performance
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
if (dst_device != src_device) { | ||
TORCH_INTERNAL_ASSERT(p2p_enabled == true); | ||
auto dst = (char*)iter.data_ptr(0); | ||
auto src = (char*)iter.data_ptr(1); | ||
size_t size = iter.numel() * iter.element_size(0); | ||
auto q = copy_stream.queue(); | ||
q.copy(src, dst, size); | ||
} | ||
auto dst = (char*)iter.data_ptr(0); | ||
auto src = (char*)iter.data_ptr(1); | ||
size_t size = iter.numel() * iter.element_size(0); | ||
auto q = copy_stream.queue(); | ||
q.copy(src, dst, size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function now always calls q.copy()
regardless of whether devices are the same or different. For same-device copies, this may not be the intended behavior since the original code used copy_kernel(iter)
for same-device operations. Consider verifying that q.copy()
handles same-device copies correctly or add a comment explaining why this unified approach is safe.
if (dst_device != src_device) { | |
TORCH_INTERNAL_ASSERT(p2p_enabled == true); | |
auto dst = (char*)iter.data_ptr(0); | |
auto src = (char*)iter.data_ptr(1); | |
size_t size = iter.numel() * iter.element_size(0); | |
auto q = copy_stream.queue(); | |
q.copy(src, dst, size); | |
} | |
auto dst = (char*)iter.data_ptr(0); | |
auto src = (char*)iter.data_ptr(1); | |
size_t size = iter.numel() * iter.element_size(0); | |
auto q = copy_stream.queue(); | |
q.copy(src, dst, size); | |
auto dst = (char*)iter.data_ptr(0); | |
auto src = (char*)iter.data_ptr(1); | |
size_t size = iter.numel() * iter.element_size(0); | |
auto q = copy_stream.queue(); | |
if (dst_device == src_device) { | |
// Use the optimized kernel for same-device copies | |
copy_kernel(iter); | |
} else { | |
TORCH_INTERNAL_ASSERT(p2p_enabled == true); | |
q.copy(src, dst, size); | |
} |
Copilot uses AI. Check for mistakes.
depend on jira GSD-3944. It's expected that d2d copy through copy engine and kernel should be the same performance in level zero 2, but I still see gaps.
Note
memcpyAsync
is only called whenmemcpy_eligible
, means no cast, and all contiguous, so it's just copy without other operations. Previous implementation incopy_kernel
calls two implementation by loops kernel:gpu_kernel(iter, CopyScalarFunc<scalar_t>());
andfloat8_copy_kernel_xpu
, since we won't do cast, they both equivalent toqueue.memcpy