CDI-Info/328 at main · vaj/CDI-Info · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

Hello, I believe this is the last talk, so I'll try to make it quick.

Okay, so today I'd like to discuss some challenges related to heterogeneous memory systems and how a unified memory framework can help to address them. I will try to describe what those challenges are, then I will introduce UMF, show you how those challenges can be addressed, present UMF architecture, and finish by talking about the current status and future plans.

So, as the amount of data and the power needed to process that data are increasing, current servers are becoming more heterogeneous. As a result, you may very often find multiple different types of memory and compute on a single server, which can be leveraged by a single application. Those different types of memory can include local DRAM, HBM (high bandwidth memory), CXL-attached memory (local or remote), and also GPU memory, if you have a GPU on that system. And as a developer, utilizing those different memories requires a way to discover what those resources are on the system, deciding where to place the data, and how to migrate that data between different memory types. Additionally, it requires interacting with different APIs for other applications, for allocation and data migration. So, for example, if you want to allocate local memory, you will use interfaces like malloc. But if you want to allocate memory for a GPU, you often need to use different APIs, for example, driver APIs.

And this is where UMF comes into the picture. So, its goal is to unify the path for different heterogeneous memory allocations and research discovery among higher-level runtimes. So, those runtimes include a single OpenMP unified runtime, MPI—those are things that we are using inside Intel, but also for external libraries and applications. And UMF is basically a single project that accumulates technologies related to memory management. It provides a flexible, mix-and-match API that allows tuning for particular use cases, and its main philosophy is to complement and not compete with operating system capabilities, so things like memory tuning. We are not competing with that, but we are trying to complement this on user space.

Now, before I move to describing UMF in more detail, let's take a look at the common memory allocation structure from the perspective of a user-space application. So, here you can see an application; let's assume this is a C++ application that wants to allocate some memory. It can use different APIs. It can use, for example, high-level C++ memory allocators or memory sources, or lower-level calls like malloc and free. And regardless of which path is taken, usually, underneath there will be some call to a malloc-like API. And internally in this call, usually, there is some implementation of memory pooling or caching. And the way it works is that there is usually some heap or pool manager—I will use those interchangeably—which manages memory and basically requests memory in big chunks from some memory provider, which can be the operating system memory. So, in the case of malloc, for example, this can be a call to mmap or some other syscall. And requesting these huge chunks of memory is usually quite expensive. Once this memory is requested, it will be cached in that memory pool, and some part of this allocation may be carved out and returned to the user. The rest will be retained for future allocations. So, if the memory is already in a pool, the allocation is usually pretty fast. Those heap managers or pool managers can be implemented in various ways. They can be optimized for different things, like minimizing latency, fragmentation, or maximizing concurrency. But usually, from the application perspective, there is no easy way to change or select which heap or pool manager is used. So, for example, if you're using malloc, you will use the implementation that is available on that particular system. So, it will depend. It will be different on Linux or on Windows, for example. And all of this, basically, the entire thing below the malloc and free box on this slide is a black box from the application perspective. There is no easy way to influence, for example, what type of memory will be allocated by that API.

So, here is how UMF can help with that. You can take a look at the data in the chart on the right. This shows a very similar application flow or allocation flow. And, you can see that UMF exposes a very similar API. So, we call it umfPoolMalloc, which is very similar to how regular malloc looks like. The only difference is that it accepts one extra argument that tells you from which memory pool the memory should be allocated. And in UMF, you can create multiple different memory pools that can be mapped to different hardware. So, for example, on this slide, you can see the first pool that is mapped to a GPU memory, and that memory is managed by the GPU driver. The second pool is mapped to a local HBM memory, high bandwidth memory. And the last pool is a pool that manages some remote memory over CXL memory, for example. Over CXL memory.

And UMF itself is a framework to build allocators and organize those different memory pools. And let me actually introduce or define properly what a memory pool and memory provider is for UMF. So, a memory pool is basically a combination of a pool manager or heap manager. A memory provider does the actual coarse-grain memory allocations. So, this can be, for example, a provider that manages operating system memory by calling malloc or other calls. Or it can be another provider that manages GPU memory. The heap manager itself is basically a collection of algorithms and data structures that tell how to actually split these huge allocations, these coarse-grain allocations, into smaller pieces, and then how much of that memory to retain. So, basically, its job is to manage that memory pool and service those fine-grain allocations to the user.

Now, UMF defines interfaces for both memory providers and memory pools. But it also implements a few specific memory pools and memory providers. So, we have an implementation based on a disjoint pool that is aimed at GPU memory mostly. A scalable pool that is based on a TBB implementation. And a jmalloc pool that is based on jmalloc. So, the user can select which of these pool managers to use depending on the specific use case because they have different properties and different use cases usually.

As for memory providers, we have a provider that manages operating system memory. So, this can be used for allocating local DRAM, HBM, or CXL memory. And an L0 provider, which is used for managing GPU memory for Intel GPUs. Also, users can provide their own implementations of both memory providers and pool managers. And they can plug it into this library.

Now, one other feature that UMF exposes is memspaces. A memspace is an abstraction over some memory resource. It's a collection of memory targets. Now, a memory target is something that represents a single memory source from which we can actually allocate. So, this can be a NUMA node, a memory-mapped file, and it can have different properties that can be queried by the user. So, it includes latency, bandwidth, capacity, things like that. Now, memspace can be used as a means of discovery. So, it can be used by an application to know what kind of memory is available on a system and what are the properties of those memories. And also, it can be used to create memory pools from which the allocation can happen. Now, UMF exposes a few predefined memspaces. Those are, you can see two examples on this slide: memspace HOST_ALL, which basically contains all the available NUMA nodes from the system that you can iterate over, and memspace HBM, that in this example contains a single NUMA node, in which case allocating from that memspace would be equivalent to allocating memory on the highest bandwidth memory.

Here, you can see a basic example, a basic code example, how you can utilize those memspaces, and how you can allocate memory using UMF. So, on the top, you can see how you first create a pool. Here, we are creating two different pools, one for HBM memory, one for high-capacity memory. The first step for both those pools is actually getting appropriate memspace by calling a function. And the second step is calling the UMF poolCreateFromMemspace function that will then select the best available heap manager for that memory, and select a memory provider that will actually be responsible for allocating memory for those different types of memories. And depending on the platform or the server that we are running this on, those two different types of memories might map to the same physical memory. For example, if we don't have HBM, if we only have DRAM, this will be the same physical memory, but if we actually have HBM, then the HBM pool will use that. Now, the second step is actually allocating the data. So here, as you can see, we are using an interface similar to regular malloc. One difference is that we are passing from which pool we want to allocate. So first, pointer first is allocated from the HBM pool by just passing that as the first argument, and the second one is allocated from the high-capacity pool. A free function is even simpler because you don't need to specify the pool. It will be selected automatically. You only need to pass a pointer.

And one last feature—or two last features—I wanted to mention is observability and interrupt capabilities. So, modern applications can be quite complex, and it's often the case that multiple libraries or runtimes might be used by a single application, and one application or one library might want to manage memory allocated by a different library or different runtime. And to do that, it often needs some information about this memory: where it comes from, and what are its properties. So, UMF can help here by aggregating data about all allocations, and it can answer questions based on a pointer, and it can, for example, tell whether that memory comes from GPU or from operating system. If it's coming from the operating system, it can tell which NUMA node it's bound to, for example, if it was allocated in that way. One other feature is also abstracting the memory sharing operations. So, if you want to share memory between different processes, this will depend on what kind of memory you want to share and what operating system you are using. So, UMF abstracts that away by creating a single API that will work for both regular memory and GPU memory. All of this is implemented underneath, and you only get a single unified interface that uses IPC handles that you can use to basically share memory from one process to another. This is, for example, used by Intel MPI as well, or this is worked on.

So, current status for UMF is that we are releasing this as an internal component of oneAPI 2025. It is used by several projects at Intel. One of them is the unified runtime, where it is used for managing GPU memory, so USM memory pooling. Unified runtime is a component that is underneath SQL, and other projects that are also working on integrating UMF include Intel MPI, oneCCL, libiomp, and CAL. The UMF project is entirely open-sourced, and you can find the GitHub repo on the next slide. So, I will go there.

And to sum up, UMF is a framework that unifies interfaces to work with different memory hierarchies. It improves efficiency by code and technology reuse. It provides a set of building blocks to adapt to particular needs; it can handle interrupts between different runtimes by aggregating data about all the allocations. And my call to action would be: if you are working on heterogeneous memory systems, whether it's a system with CXL or a system with GPU memory, try out UMF. It might also be helpful when building a custom memory allocator due to its pluggable nature for heap managers and memory pools. And on the links below, you can find some more information. The first link is to the documentation, and the second link is to the GitHub repository where all the development of UMF happens. And that will be all.

Awesome, Igor. Thank you.