Memory allocation slows down at scale #7333

ye-luo · 2025-03-13T03:34:13Z

Allocating memory takes longer at large scale runs on Aurora. version 4.3.0rc3
32 to 4096 node runs. Each node has 12 MPI ranks each rank has 8 threads.

test_mpi_malloc.n  32     Function malloc takes 7.70977e+06 us
test_mpi_malloc.n  64     Function malloc takes 7.48404e+06 us
test_mpi_malloc.n 128     Function malloc takes 8.01141e+06 us
test_mpi_malloc.n 256     Function malloc takes 9.05335e+06 us
test_mpi_malloc.n 512     Function malloc takes 9.94731e+06 us
test_mpi_malloc.n1024     Function malloc takes 1.0429e+07 us
test_mpi_malloc.n2048     Function malloc takes 1.04857e+07 us
test_mpi_malloc.n4096     Function malloc takes 1.07383e+07 us

after setting MPIR_CVAR_CH4_XPMEM_ENABLE=0 constant timing restored.

test_mpi_malloc_noxpmem.n  32     Function malloc takes 5.37159e+06 us
test_mpi_malloc_noxpmem.n  64     Function malloc takes 5.35342e+06 us
test_mpi_malloc_noxpmem.n 128     Function malloc takes 5.38609e+06 us
test_mpi_malloc_noxpmem.n 256     Function malloc takes 5.40957e+06 us
test_mpi_malloc_noxpmem.n 512     Function malloc takes 5.52926e+06 us
test_mpi_malloc_noxpmem.n1024     Function malloc takes 5.48222e+06 us
test_mpi_malloc_noxpmem.n2048     Function malloc takes 5.48631e+06 us
test_mpi_malloc_noxpmem.n4096     Function malloc takes 5.50638e+06 us

I don't mean XPMEM is the culprit. With this setting, the feature that caused the slowdown seems turned off.
Please investigate this issue. Many thanks.

here is the reproducer

test_mpi_malloc.cpp

mpicxx -fiopenmp test_mpi_malloc.cpp

a512_t8_async.sub # job script

test_mpi.zip

The text was updated successfully, but these errors were encountered:

TApplencourt · 2025-03-14T13:18:56Z

Stupid question from somebody who doesn't know anything,
but capturing ALL the memory allocation (so adding an overhead to them) just in case some are used in MPI seem a little hum optimistic.

Isn't it possible to either:

Provide user a custom allocator / custom registration function that they can use if they want?
Do this "automatic" registration lazyly at MPI send/recv? -- at least one will pay the overhead only for memory used by MPI

PS : Increase the THR will just "hide" the problem IMO, less allocation, less overhead for sure. But one apps will do THR+1 allocation size in a tie loop for reason, and will hit the pathological hip one more time. But maybe it's impossible to do it lazily, as I said I know nothing. So sorry for the noise!

raffenet · 2025-03-14T14:12:16Z

Stupid question from somebody who doesn't know anything, but capturing ALL the memory allocation (so adding an overhead to them) just in case some are used in MPI seem a little hum optimistic.

Isn't it possible to either:

Provide user a custom allocator / custom registration function that they can use if they want?

Do this "automatic" registration lazyly at MPI send/recv? -- at least one will pay the overhead only for memory used by MPI

PS : Increase the THR will just "hide" the problem IMO, less allocation, less overhead for sure. But one apps will do THR+1 allocation size in a tie loop for reason, and will hit the pathological hip one more time. But maybe it's impossible to do it lazily, as I said I know nothing. So sorry for the noise!

We're not explicitly hooking the memory allocations when XPMEM is enabled. But one difference regardless of communication is that we register the entire virtual address space with XPMEM during MPI_INIT so receivers can map portions of it during communication. We can certainly do that registration lazily, especially if we confirm it to be the cause of the slow allocations.

Another option might be to register smaller regions on-demand like we do with GPU IPC.

ye-luo · 2025-03-14T14:29:50Z

I'd like to understand why memory allocation slows down at scale. Is it because any allocation inside XPMEM registered space triggers kind of all to all communication in the comm world by XPMEM?

raffenet · 2025-03-14T14:31:59Z

I'd like to understand why memory allocation slows down at scale. Is it because any allocation inside XPMEM registered space triggers kind of all to all communication in the comm world by XPMEM?

No communication related to memory allocation. Need to check the XPMEM source code to understand what, if anything, is happening during allocation after registration.

ye-luo · 2025-03-14T14:54:42Z

When I profiled my app (not this reproducer), I saw _mid_memalign time goes up significantly. That is the lowest level of call stack I can find. Probably needs profiling with linux kernel included? It can potentially pinpoint the issue.

hzhou · 2025-03-14T17:44:00Z

I am not sure this is a "severe" issue. The slowdown is still within the order. Is malloc in the hot path of the application? If it is, it is always recommended to consider a custom allocator. The generic libc malloc is known not to be optimal for all use cases.

Anyway, in the case of QMCPACK, disabling xpmem seems to be a nobrainer.

For background, my current understanding is: we open a segment covering the entire virtual address space at init time for convenience and efficiency. That is convenient, but apparently it added cost to every malloc (likely at kernel space). We need test the alternative of opening segment for each new piece of large memory -- it may work better but we will be trading off with the added complexity and cost of managing send-side caching. [TAG:TODO]

ye-luo · 2025-03-14T18:07:58Z

FYI. In actual QMCPACK runs, startup time grows from a few minutes to a few hours that is not reflected in the reproducer. In addition, it is not just user code allocating memory. When JIT kernels, IGC needs to allocate memory for its work that is total out of user control.

Regardless, the slowdown of memory allocation needs be understood. There are limited reasons that can potentially make the slowdown scaling related. Either doing excessive communication or excessive I/O. To me, both are strong reasons to ban use of xpmem on HPC at scale because abusing shared resource on HPC hurts al the users. For this reason, this can be a way more severe issue than a slow bcast. Other potential slowdown reason can be a bad memory look up algorithm using O(N) instead of O(log(N)).

Some googling shows CrayMPI only use xpmem for intra-node communication.

hzhou · 2025-03-14T20:44:46Z

Some googling shows CrayMPI only use xpmem for intra-node communication.

Yes, XPMEM is a cross memory access method used in intra-node IPC communication. The benefit of IPC on the surface is a single copy data transfer vs. two-copy data transfer. However, the actual benefit is complicated as you have seen. It has been demonstrable performance benefit using XPMEM including on Aurora. There is no additional communication at algorithm level, in fact less, compared to the 2-copy conventional method. However, the 2-copy method is a pipelining algorithm so it hides much of the 2-copy overhead at large memory size. There isn't excessive I/O either. 1-copy algorithm does less I/O than 2-copy algorithm by definition.

I believe the malloc overhead you seen is incurred in the kernel. I am not familiar but I can imagine xpmem need process each page allocation in order to map the page to different process address space.

Your data kind of show there isn't a scalling issue beyond the node level, right?

hzhou · 2025-03-14T21:07:43Z

Also an obvious beneficial place is MPI RMA where processes can open a window and have other remote processes access data without tying up both processes as traditional send/recv does. It is significantly less useful without cross-process memory access such as XPMEM. In the RMA case, the mapping overhead is hidden by API design.

colleeneb added the aurora label Mar 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory allocation slows down at scale #7333

Memory allocation slows down at scale #7333

ye-luo commented Mar 13, 2025 •

edited

Loading

TApplencourt commented Mar 14, 2025 •

edited

Loading

raffenet commented Mar 14, 2025 •

edited

Loading

ye-luo commented Mar 14, 2025

raffenet commented Mar 14, 2025

ye-luo commented Mar 14, 2025

hzhou commented Mar 14, 2025 •

edited

Loading

ye-luo commented Mar 14, 2025

hzhou commented Mar 14, 2025

hzhou commented Mar 14, 2025 •

edited

Loading

Memory allocation slows down at scale #7333

Memory allocation slows down at scale #7333

Comments

ye-luo commented Mar 13, 2025 • edited Loading

TApplencourt commented Mar 14, 2025 • edited Loading

raffenet commented Mar 14, 2025 • edited Loading

ye-luo commented Mar 14, 2025

raffenet commented Mar 14, 2025

ye-luo commented Mar 14, 2025

hzhou commented Mar 14, 2025 • edited Loading

ye-luo commented Mar 14, 2025

hzhou commented Mar 14, 2025

hzhou commented Mar 14, 2025 • edited Loading

ye-luo commented Mar 13, 2025 •

edited

Loading

TApplencourt commented Mar 14, 2025 •

edited

Loading

raffenet commented Mar 14, 2025 •

edited

Loading

hzhou commented Mar 14, 2025 •

edited

Loading

hzhou commented Mar 14, 2025 •

edited

Loading