-
Notifications
You must be signed in to change notification settings - Fork 2.8k
[GPU] Allow host buffer access for Xe2+ iGPUs #32912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GPU] Allow host buffer access for Xe2+ iGPUs #32912
Conversation
|
build_jenkins |
Lyamin-Roman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have there been any performance tests? It's necessary to verify that this really doesn't cause any performance drops
|
@Lyamin-Roman, yes, I've checked performance on a set of models, as well as some synthetic tests, such as a model consisting of a single GEMM with different dimensions. Every tests demonstrated the same performance (±1%) |
|
Did you check with driver team for this change? Actually this is different from what we heard from driver team previously.. Memory footprint reduction is something unexpected and I guess there is some issue for memory footprint, which is hided by this change. |
| if (alloc_type == allocation_type::usm_host || alloc_type == allocation_type::usm_shared) { | ||
| // usm_device memory does not provide performance benefits on the LNL platform | ||
| if (get_engine().get_device_info().arch == gpu_arch::xe2 && | ||
| // usm_device memory does not provide performance benefits on the integrated Xe2+ platforms |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On PTL integrated parts we have e2e compression available for device USM allocations.
It means that if data is nicely compressible, you may see compression benefits from using device USM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but at the same time the trained weights aren't typically compressible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes then this would help to reduce the memory culprit and reduce amount of copies
a930596
Description of the issue
Integrated GPUs starting from Xe2 can benefit from reusing the host-sided buffer for the weights. This allows to avoid the allocation of the device-sided buffer in the same physical memory with significant memory footprint reduction and no runtime penalty. Previously it was enabled only for LNL (#31600), but for AI weights that don't benefit from compression there's no need to limit this functionality only to that platform.
Reproduction step and snapshot
Check for the "Compile model ram used" metric. For an fp16 stable diffusion model with size of 600MB, there is a ~600MB on multiple platforms, more details in the ticket.
Checklist
Tickets: