Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory allocation slows down at scale #7333

Open
ye-luo opened this issue Mar 13, 2025 · 9 comments
Open

Memory allocation slows down at scale #7333

ye-luo opened this issue Mar 13, 2025 · 9 comments
Labels

Comments

@ye-luo
Copy link

ye-luo commented Mar 13, 2025

Allocating memory takes longer at large scale runs on Aurora. version 4.3.0rc3
32 to 4096 node runs. Each node has 12 MPI ranks each rank has 8 threads.

test_mpi_malloc.n  32     Function malloc takes 7.70977e+06 us
test_mpi_malloc.n  64     Function malloc takes 7.48404e+06 us
test_mpi_malloc.n 128     Function malloc takes 8.01141e+06 us
test_mpi_malloc.n 256     Function malloc takes 9.05335e+06 us
test_mpi_malloc.n 512     Function malloc takes 9.94731e+06 us
test_mpi_malloc.n1024     Function malloc takes 1.0429e+07 us
test_mpi_malloc.n2048     Function malloc takes 1.04857e+07 us
test_mpi_malloc.n4096     Function malloc takes 1.07383e+07 us

after setting MPIR_CVAR_CH4_XPMEM_ENABLE=0 constant timing restored.

test_mpi_malloc_noxpmem.n  32     Function malloc takes 5.37159e+06 us
test_mpi_malloc_noxpmem.n  64     Function malloc takes 5.35342e+06 us
test_mpi_malloc_noxpmem.n 128     Function malloc takes 5.38609e+06 us
test_mpi_malloc_noxpmem.n 256     Function malloc takes 5.40957e+06 us
test_mpi_malloc_noxpmem.n 512     Function malloc takes 5.52926e+06 us
test_mpi_malloc_noxpmem.n1024     Function malloc takes 5.48222e+06 us
test_mpi_malloc_noxpmem.n2048     Function malloc takes 5.48631e+06 us
test_mpi_malloc_noxpmem.n4096     Function malloc takes 5.50638e+06 us

I don't mean XPMEM is the culprit. With this setting, the feature that caused the slowdown seems turned off.
Please investigate this issue. Many thanks.

here is the reproducer

test_mpi_malloc.cpp

mpicxx -fiopenmp test_mpi_malloc.cpp

a512_t8_async.sub # job script

test_mpi.zip

@TApplencourt
Copy link

TApplencourt commented Mar 14, 2025

Stupid question from somebody who doesn't know anything,
but capturing ALL the memory allocation (so adding an overhead to them) just in case some are used in MPI seem a little hum optimistic.

Isn't it possible to either:

  • Provide user a custom allocator / custom registration function that they can use if they want?
  • Do this "automatic" registration lazyly at MPI send/recv? -- at least one will pay the overhead only for memory used by MPI

PS : Increase the THR will just "hide" the problem IMO, less allocation, less overhead for sure. But one apps will do THR+1 allocation size in a tie loop for reason, and will hit the pathological hip one more time. But maybe it's impossible to do it lazily, as I said I know nothing. So sorry for the noise!

@raffenet
Copy link
Contributor

raffenet commented Mar 14, 2025

Stupid question from somebody who doesn't know anything, but capturing ALL the memory allocation (so adding an overhead to them) just in case some are used in MPI seem a little hum optimistic.

Isn't it possible to either:

  • Provide user a custom allocator / custom registration function that they can use if they want?
  • Do this "automatic" registration lazyly at MPI send/recv? -- at least one will pay the overhead only for memory used by MPI

PS : Increase the THR will just "hide" the problem IMO, less allocation, less overhead for sure. But one apps will do THR+1 allocation size in a tie loop for reason, and will hit the pathological hip one more time. But maybe it's impossible to do it lazily, as I said I know nothing. So sorry for the noise!

We're not explicitly hooking the memory allocations when XPMEM is enabled. But one difference regardless of communication is that we register the entire virtual address space with XPMEM during MPI_INIT so receivers can map portions of it during communication. We can certainly do that registration lazily, especially if we confirm it to be the cause of the slow allocations.

Another option might be to register smaller regions on-demand like we do with GPU IPC.

@ye-luo
Copy link
Author

ye-luo commented Mar 14, 2025

I'd like to understand why memory allocation slows down at scale. Is it because any allocation inside XPMEM registered space triggers kind of all to all communication in the comm world by XPMEM?

@raffenet
Copy link
Contributor

I'd like to understand why memory allocation slows down at scale. Is it because any allocation inside XPMEM registered space triggers kind of all to all communication in the comm world by XPMEM?

No communication related to memory allocation. Need to check the XPMEM source code to understand what, if anything, is happening during allocation after registration.

@ye-luo
Copy link
Author

ye-luo commented Mar 14, 2025

When I profiled my app (not this reproducer), I saw _mid_memalign time goes up significantly. That is the lowest level of call stack I can find. Probably needs profiling with linux kernel included? It can potentially pinpoint the issue.

@hzhou
Copy link
Contributor

hzhou commented Mar 14, 2025

I am not sure this is a "severe" issue. The slowdown is still within the order. Is malloc in the hot path of the application? If it is, it is always recommended to consider a custom allocator. The generic libc malloc is known not to be optimal for all use cases.

Anyway, in the case of QMCPACK, disabling xpmem seems to be a nobrainer.

For background, my current understanding is: we open a segment covering the entire virtual address space at init time for convenience and efficiency. That is convenient, but apparently it added cost to every malloc (likely at kernel space). We need test the alternative of opening segment for each new piece of large memory -- it may work better but we will be trading off with the added complexity and cost of managing send-side caching. [TAG:TODO]

@ye-luo
Copy link
Author

ye-luo commented Mar 14, 2025

FYI. In actual QMCPACK runs, startup time grows from a few minutes to a few hours that is not reflected in the reproducer. In addition, it is not just user code allocating memory. When JIT kernels, IGC needs to allocate memory for its work that is total out of user control.

Regardless, the slowdown of memory allocation needs be understood. There are limited reasons that can potentially make the slowdown scaling related. Either doing excessive communication or excessive I/O. To me, both are strong reasons to ban use of xpmem on HPC at scale because abusing shared resource on HPC hurts al the users. For this reason, this can be a way more severe issue than a slow bcast. Other potential slowdown reason can be a bad memory look up algorithm using O(N) instead of O(log(N)).

Some googling shows CrayMPI only use xpmem for intra-node communication.

@hzhou
Copy link
Contributor

hzhou commented Mar 14, 2025

Some googling shows CrayMPI only use xpmem for intra-node communication.

Yes, XPMEM is a cross memory access method used in intra-node IPC communication. The benefit of IPC on the surface is a single copy data transfer vs. two-copy data transfer. However, the actual benefit is complicated as you have seen. It has been demonstrable performance benefit using XPMEM including on Aurora. There is no additional communication at algorithm level, in fact less, compared to the 2-copy conventional method. However, the 2-copy method is a pipelining algorithm so it hides much of the 2-copy overhead at large memory size. There isn't excessive I/O either. 1-copy algorithm does less I/O than 2-copy algorithm by definition.

I believe the malloc overhead you seen is incurred in the kernel. I am not familiar but I can imagine xpmem need process each page allocation in order to map the page to different process address space.

Your data kind of show there isn't a scalling issue beyond the node level, right?

@hzhou
Copy link
Contributor

hzhou commented Mar 14, 2025

Also an obvious beneficial place is MPI RMA where processes can open a window and have other remote processes access data without tying up both processes as traditional send/recv does. It is significantly less useful without cross-process memory access such as XPMEM. In the RMA case, the mapping overhead is hidden by API design.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants