-
Notifications
You must be signed in to change notification settings - Fork 35
umfMemspaceHighestBandwidthGet does not use HBM on SapphireRapids+HBM #1289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
physical_id field stores os index, so we should use correct function to get hwloc numanode object. fixes: oneapi-src#1289
Hi, and thanks for taking care. Yes it does fix the issue. At least in the simple case that I have tested:
Since hwloc's logical index was used here, where the OS index should have been used instead, it subsequently seems likely to me that this is not the only such occurrence. Another question regarding a more complex case: Does first-touch still work as expected with multiple threads using Background: I have tested using
These performance results are stable. I have used 96 OpenMP threads, each pinned to a core of the system described above. |
physical_id field stores os index, so we should use correct function to get hwloc numanode object. fixes: oneapi-src#1289
In UMF we do exactly what you did with numactl—we generate a HighestBandwidth namespace, which is really just a set of NUMA nodes (for each CPU we add the node with the highest bandwidth to the set). We membind every allocation to this set and let the kernel choose where to place each page. Another alternative is to bind memory based on the memory‑allocating thread, but we think the first approach is better. If we get a request for the second approach, we can consider to add that option. Regarding performance, could you share the benchmark details so we can test it on our side? |
Here is the benchmark code (Makfile included) @lplewa . It compares three memory allocation versions:
This is the output:
Regarding binding memory to the allocating thread: |
Discussion is still in progress, so I'll re-open this issue. |
umfMemspaceHighestBandwidthGet does not allocate from HBM on SapphireRapids+HBM
Environment Information
HMAT is also enabled:
Please provide a reproduction of the bug:
How often bug is revealed:
always
Actual behavior:
umfMemspaceHighestBandwidthGet
allocates memory from NUMA node 1 (DDR).Expected behavior:
umfMemspaceHighestBandwidthGet
allocates memory from (local) HBM NUMA node.Details
Since this CPU has HBM connected, it is a semantical error if
umfMemspaceHighestBandwidthGet
allocates from DDR instead of HBM. Allocating from NUMA node 1 when running on CPU 0 withtaskset -c 0
is even worse than allocating from NUMA node 0, which is also DDR, but closer to CPU 0.The desired behavior would be to allocate from the closest proximity HBM NUMA node. This would be NUMA node 8 on this CPU. See the output of
numactl -H
:lstopo
is truncated, note that hwloc uses a different NUMA node numbering compared tonumactl
):Additional information about Priority and Help Requested:
Are you willing to submit a pull request with a proposed change? (Perhaps)
Requested priority: (Showstopper if you want to use this library function to allocate memory from HBM on this CPU)
Memkind
library:This should report
8,9,10,11,12,13,14,15
instead.Memkind
they do not work correct, too.The text was updated successfully, but these errors were encountered: