CDI-Info/323 at main · vaj/CDI-Info · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58

So, yeah. Hi, my name is Honggyu Kim, and I'm from SK Hynix. My topic is expanding the CXL software ecosystem through HMSDK on Linux. And so, maybe the slide is—there's some problem, actually.

Yeah, never mind. Okay, so in the conventional memory system, it consists of only the same types of memory in their DIMM slots. So, it just works, and we don't have to have any kind of special software support. However, expanding the memory beyond the DIMM slots allows us to incorporate CXL memory. By having CXL memory, we can expand both the bandwidth and capacity. However, it requires some software support for efficient use.

So, the software supports the CXL memory. Initially, CXL driver-level support is needed. The device driver is located here in the Linux kernel. The memory management to support this is very much needed for efficient use. CXL memory is detected as a CPU list NUMA node. We are working on this based on the NUMA abstraction under the HMSDK project.

So, the HMSDK project. The HMSDK stands for the Heterogeneous Memory Software Development Kit. It provides three different modes. First, there is bandwidth expansion. Second, capacity expansion. And lastly, a custom allocator. Let's go through them one by one.

So, the bandwidth expansion can be used when the target workload is bandwidth-hungry. And in that case, providing more bandwidth, we can achieve the speed-up for the bandwidth-intensive workloads.

But for the capacity expansion mode, and then? So, if the workload is not bandwidth-hungry but requires more capacity, we can expand the capacity with the CXL memory. But accessing the CXL memory requires additional latency. So, we need to minimize those kinds of latency overhead.

And then, the last one is a custom allocator. This can be used when the users have some knowledge about their programs. And so, we can modify the target program with HMSDK's hmalloc APIs, such as hmalloc and hfree. So, that means that the custom allocator requires some software modification.

But the first two modes don't require a software modification, because those are the OS-level techniques.

So maybe I can just skip this slide, because I'll explain it in the following slides. So, the first one is bandwidth expansion.

So, having the convention of the round-robin interleave can work, but it doesn't consider the difference in bandwidth characteristics. So, it just distributes the pages evenly. And so, it doesn't consider the bandwidth difference. Having the weighted interleave, we can consider the bandwidth difference. And so, having this one on the left side, the host memory bandwidth is underutilized here. But, having the weighted interleave, we can allocate two pages to the host memory, then allocate one single page—one single page to a CXL memory, considering the bandwidth difference. And so, we can fully utilize those kinds of bandwidth capacity and capability.

So, this weighted interleave provides some kind of sysfs interface in the kernel. And, this is available from Linux version 6.9. This is fully upstreamed, and then, with the collaboration with the MemVerge engineer, Gregory Price. And so, in this location, we can set the weight numbers. And then, so the user interface, and so numactl is a widely used tool to set the memory policies. And then, so there is a new option called the -w and --weighted-interleave. And then, so we can specify some nodes.

So, I will show you the usage. And back to the similar, the picture. And then, we can just write the weight number to node one, node zero. And we can set the 1 to node one, which is a CXL node. In that case, the kernel is ready. But it doesn't work without the numactl, because numactl provides --weighted-interleave 0, 1; that means we can run the target program with the weighted interleave for the NUMA nodes 0 and 1, and we can run the target program in a weighted interleave mode.

So, the performance result is shown in the below picture; having the 8 channels of DDR5, plus the 4 channels over CMM DDR5, we are able to achieve a 21% throughput improvement in a two-to-one read/write ratio. And then, in terms of the one-to-one read/write ratio, we are able to achieve 31% of the performance bandwidth expansion.

So, the second one is the capacity expansion. Before explaining the capacity expansion, I need to explain about this new feature, which is called DAMON in the Linux kernel. DAMON is a data access monitoring framework in the Linux kernel, supported from version 5.15, which was released about three years ago. It allows memory access checks in an upper bounded override for scalability. That means that it sacrifices some of the accuracy. In the picture below, it shows a heatmap. From left to right is time, and from bottom to top are the address ranges. It shows how frequently each address range is accessed, and this is shown as colors.

And so, DAMON is just a profiling framework in the Linux kernel, and then profiling doesn't do anything for the memory management. Based on the profiling results by DAMON, we can apply some of the memory management actions. DAMON is for this purpose, and then it is a DAMON-based operation scheme. As you can see, the blue dotted line, after passing some specific time and based on the hotness, allows us to apply some specific actions. If some of the area is detected as hot, then we can promote it to the DRAM; if some of the area is detected as cold, then we can relegate it to CXL.

As the time goes by, so the pattern can be changed, and then we can apply some different actions.

So, the in DAMONS, and so there are many other memory management actions. For example, the proactive reclaim, those kind of things, and then, which is called the DAMONS page out, but there was no migration actions before, and in HMSDK, and that we implemented some of the migration actions, and then added it to the Linux kernel, and then that is called the DAMONS migrate hot and cold, and then, which means promotion and demotion, and that is fully upstreamed, and then it is available from the Linux version 6.11, which was released last month.

So, to explain the evaluation result, this kind of tiering algorithm is meaningful when the workload can fully fit into the DRAM. I need to increase some kind of memory pressure. Initially, on the y-axis, these are the normalized execution times, and 1.0 represents the DRAM-only execution time. On the x-axis, these are the DRAM free spaces before loading Redis, and our evaluation is conducted using Redis and YCSB. If Redis is fully allocated inside the DRAM, this is the fastest case; in that scenario, no tiering algorithm is needed. On the other end, if the target workload is fully allocated in the CXL, it incurs a performance weathering through the CXL protocol into a latency overhead. In terms of slowdown, it has an 18.8% performance decrease compared to DRAM-only in this evaluation. Thus, there is a DRAM-only performance upper bound and a performance lower bound; this is the performance range unless the target workload is not bandwidth hungry. We will see the performance result in this area.

So, let's increase some memory pressure by allocating some of the cold, cold data in various ways. We can allocate the Redis in various different locations, and initially, if the partial data is located inside the CXL, then the performance isn't—it's not that bad. Then, it's very close to the DRAM only. In the second case, half of the Redis is located in the CXL, the performance gets a little bit slower. If most of the Redis is located inside the CXL, then this is very close to the CXL-only performance. If we see the other cases, then you can see the performance slow down linearly.

So, let's compare the results with the DAMON-enabled system. The initial status is the same, but in this case, we can detect some of the cold areas and then demote them to the CXL. We can also detect some hot pages and then promote them. As a result, more of this data is located inside the DRAM, and therefore, there is less memory access to the CXL. Then, the performance result shows that the performance overhead is minimized, even if the memory pressure is severe in this case.

So, this result is fine, but the result was measured by turning the DAMON on just before the workload exceeds expectations. This can be for evaluation, but in the real world, the DAMON can be running always on the system. So, that means other cold data can be demoted earlier, even before running the target workload. And that means, okay, I can show you the result.

So, in the case there, the target workload is not loaded yet, and then, before that, we can detect some of the cold pages and demote them. We can also detect some cold data and demote it as a result. And then, spy when we are looking at the target workload, and then it increases the chance that the radices are located more in the DRAM. So, as a performance result shows, it's much better in this case; it is very close to the DRAM-only performance. So, in terms of the speedup, it shows 12.9% of speedup in the worst case. Yeah.

And we've upstreamed all the features into the Linux kernel, and so we've done the collaboration with the DAMON community. This work was done, and I also presented this work at one of the major Linux conferences, called the Open Source Summit Europe, last month. The title is this, and I have a presentation with the DAMON maintainer, SeongJae Park. Yeah, and all the slides and video are already available, so you can go to the link and search it. Yeah.

The last topic is a heterogeneous memory allocator, which is also called a custom allocator. If the user has some knowledge about their programs, they can modify their programmer so they know which area is cold. Rather than depending on the OS-level memory access profiling, they can just allocate some specific area as "cold" using the HTML APIs. HMSDK provides one library called libhmalloc.so, and it provides some of the hmalloc API such as hmalloc, hcalloc, and hfree. The interface is very much the same as malloc, calloc, and free. However, we can explicitly allocate some specific area to specific NUMA nodes using this one. We also provide one of the tools which is similar to numactl, but which is called nmctl. While numactl applies the mempolicy globally at the process level, hmctl applies mempolicy only to the hmalloc area. It currently supports --preferred and --membind options.

So, there is an example. We can include hmalloc.h, and then we can just call the hmalloc and hfree. In that case, even if we're having this one and it doesn't bother anything, the original execution shows that the entire... the memory is allocated into the NUMA node 0, because it didn't change the behavior. And, having the hmctl, and then so we can provide -m and 2, that means allocate hmalloc area to the NUMA node 2. And, as you can see, the 512 megabytes is allocated into node 2, and you can change it to node 3. It shows it. And so, we can also use it with the numactl as well. By allocating the rest of the memory to node 1.

Yeah, so, in conclusion—So, to conclude this one, we have made another release called HMSDK 3.0, and then this was released last month, in September 2024, and based on the Linux 6.11. That means... This is meaningful, especially... This is fully aligned with the various open-source projects, and the official Linux kernel can directly be used for the HMSDK without any other local patches and custom rebuild. So, to sum up, and then... So, weighted interleaving is available from the 6.9, and DAMON-based tiered memory management is supported from the 6.11. And hmalloc allocator is available from the HMSDK 3.0.

So, I've already covered the Linux kernel part and then numactl, which has a new option, -w and --interleave option. There is no official release after having this feature, so this is only available in the master numactl repository. And then, the third one is DAMON. I haven't explained about this one, but DAMON is a Linux kernel feature. This is fully configured with the sysfs interface, but there is a loss of knobs, and then it's quite difficult to handle all the knobs manually. So, the DAMON maintainer also created another tool for user space management. So... in this... in this project, we've also added the hot code migration features as well. So... and we've contributed many more features as well.

And other projects, including hwloc—which is called the hardware locality—contain lots of useful tools, such as lstopo. It now supports the weighted interleave mode as well. So, I added some of the patch, then upstreamed it, and then there was an official release as well. Lastly, in the UMF project, a unified memory framework, this will be presented in the next session. I've also proposed the weighted interleave feature in this project as well.

So, the final conclusion: We are trying to expand the CXL software ecosystem on Linux and make CXL memory usable by end users, such as software developers and system administrators. And... So... A new memory system requires some software changes, as I explained, especially if the major open-source projects support these kinds of new features. So, CXL memory can be adopted much better because we need to... we need to lower the hurdle of it using these kinds of soft tech techniques. Still, there are some hurdles. So, we need to lower the barrier. Yeah. Yeah. That's it. Thank you very much.And then, so, this is an open-source project, which is available on GitHub. So, you can find it. There is a document as well. So... Yeah. Thank you very much.