-
Notifications
You must be signed in to change notification settings - Fork 289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory allocation slows down at scale #7333
Comments
Stupid question from somebody who doesn't know anything, Isn't it possible to either:
PS : Increase the THR will just "hide" the problem IMO, less allocation, less overhead for sure. But one apps will do THR+1 allocation size in a tie loop for reason, and will hit the pathological hip one more time. But maybe it's impossible to do it lazily, as I said I know nothing. So sorry for the noise! |
We're not explicitly hooking the memory allocations when XPMEM is enabled. But one difference regardless of communication is that we register the entire virtual address space with XPMEM during Another option might be to register smaller regions on-demand like we do with GPU IPC. |
I'd like to understand why memory allocation slows down at scale. Is it because any allocation inside XPMEM registered space triggers kind of all to all communication in the comm world by XPMEM? |
No communication related to memory allocation. Need to check the XPMEM source code to understand what, if anything, is happening during allocation after registration. |
When I profiled my app (not this reproducer), I saw |
I am not sure this is a "severe" issue. The slowdown is still within the order. Is Anyway, in the case of QMCPACK, disabling xpmem seems to be a nobrainer. For background, my current understanding is: we open a segment covering the entire virtual address space at init time for convenience and efficiency. That is convenient, but apparently it added cost to every malloc (likely at kernel space). We need test the alternative of opening segment for each new piece of large memory -- it may work better but we will be trading off with the added complexity and cost of managing send-side caching. [TAG:TODO] |
FYI. In actual QMCPACK runs, startup time grows from a few minutes to a few hours that is not reflected in the reproducer. In addition, it is not just user code allocating memory. When JIT kernels, IGC needs to allocate memory for its work that is total out of user control. Regardless, the slowdown of memory allocation needs be understood. There are limited reasons that can potentially make the slowdown scaling related. Either doing excessive communication or excessive I/O. To me, both are strong reasons to ban use of xpmem on HPC at scale because abusing shared resource on HPC hurts al the users. For this reason, this can be a way more severe issue than a slow bcast. Other potential slowdown reason can be a bad memory look up algorithm using O(N) instead of O(log(N)). Some googling shows CrayMPI only use xpmem for intra-node communication. |
Yes, XPMEM is a cross memory access method used in intra-node IPC communication. The benefit of IPC on the surface is a single copy data transfer vs. two-copy data transfer. However, the actual benefit is complicated as you have seen. It has been demonstrable performance benefit using XPMEM including on Aurora. There is no additional communication at algorithm level, in fact less, compared to the 2-copy conventional method. However, the 2-copy method is a pipelining algorithm so it hides much of the 2-copy overhead at large memory size. There isn't excessive I/O either. 1-copy algorithm does less I/O than 2-copy algorithm by definition. I believe the malloc overhead you seen is incurred in the kernel. I am not familiar but I can imagine xpmem need process each page allocation in order to map the page to different process address space. Your data kind of show there isn't a scalling issue beyond the node level, right? |
Also an obvious beneficial place is MPI RMA where processes can open a window and have other remote processes access data without tying up both processes as traditional send/recv does. It is significantly less useful without cross-process memory access such as XPMEM. In the RMA case, the mapping overhead is hidden by API design. |
Allocating memory takes longer at large scale runs on Aurora. version
4.3.0rc3
32 to 4096 node runs. Each node has 12 MPI ranks each rank has 8 threads.
after setting
MPIR_CVAR_CH4_XPMEM_ENABLE=0
constant timing restored.I don't mean XPMEM is the culprit. With this setting, the feature that caused the slowdown seems turned off.
Please investigate this issue. Many thanks.
here is the reproducer
test_mpi.zip
The text was updated successfully, but these errors were encountered: