CDI-Info/316 at main · vaj/CDI-Info · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108

All right. Hello, everyone. My name is Ahmad Danesh. I'm from Astera Labs, and I run a product management team there.

And I'm Samir. I'm part of Microsoft Azure Research, looking at CXL memory.

All right. So, I mean, CXL has been around for a while, right? And what we're really excited about is that it's finally taking off, right? We're really going to start seeing CXL deployment starting in '25. But how you take it to cloud-scale deployment is really what we wanted to focus on. And how do you actually use that to get the best application performance?

Okay. So, let's start about CXL memory, right? And everybody who is in the memory industry, we always keep talking about these four basic requirements: latency, bandwidth, capacity, and cost, right? Now, when we are looking at deploying a memory with CXL in mind, and specifically in the cloud, we are going to have varieties of workloads. So, it's going to be very difficult to put down numbers exactly for is latency always number one or bandwidth always number one, right? But one thing is very, very important to know: cost is always number one. So, that's why on the bottom row I have cost as the most important thing because, you know, this is general purpose compute, and as you can imagine, you're putting a memory behind the CXL controller, there is a board, and a lot of other components. So, you are always competing against a locally attached memory. So, cost is one of the most important things there. But now, looking at the workloads, right? Some workloads may need more capacity. So, I have an example here, for example, an in-memory database. So, in an in-memory database, capacity is extremely important, and then latency and bandwidth, they fall off to that. So, you know, the requirements are going to be very generic, and it's not very, you know, I cannot just say, just design for the biggest capacity or the biggest bandwidth. You have to really look at your workload and specifically tune it.

Okay. So, what's the main problem statement here, right? We have been talking maybe in the same room for the last three, four years about CXL. So, we get more and more clarity as time goes on. And now the clarity that we have is the CPU ecosystem is always interested in putting more and more cores, and we all know that the memory interfaces are parallel, right? And there is a limit to how many of these parallel IOs you can put on the CPU socket. So, the core-to-memory ratio is getting distorted. You cannot have the memory at the same rate as cores, right? You have heard it many times. And memory is not scaling at the same rate as CPU cores, right? Now, when I'm looking at a specific problem where I have to add more capacity, one way to look at that is, why not use those higher capacity DIMMs, right? These are 3D stack DIMMs. And that is one way to solve that problem, but they're very expensive, right? So, if I'm laser-focused on my cost, I cannot go with these expensive DIMMs for each and every server in the cloud. But at the same time, the example I gave about the in-memory database, I need to have more capacity than what is possible through locally attached, right? So, I have maxed out on the memory on my local DDR, but I still need to have more memory. And this is where CXL helps me. And traditionally, you know, we had all these multi-socket architectures. So, anytime you wanted more memory, add a socket. So, you had four-socket, eight-socket machines, and the long-term trend is that you're going away from these multi-socket machines. So, that path is also going to go away because that was one path to add more memory, right? So, CXL is extremely beneficial, and CXL provides us this cost-effective and effective way of adding a performance solution. It also makes it possible for us to use the refurbished DIMMs because memory has more lives than the CPUs and other components, right? So, we can use these refurbished DIMMs and put them behind CXL. And CXL has all these advanced features which will come maybe next year when we come here and talk about it; we are going to talk about pooling and sharing, right? And these features are not possible with the regular locally attached memory. So, for that, you specifically need CXL. And the kind of benefits for certain applications you get with shared memory, for example, is going to be some multiple of X, so that is one great thing that CXL brings us.

Okay, so on this slide, on the left side, you know, these are the basic CXL solution requirements: RAS, security, in-band management, out-of-band management. The CXL components that are getting designed, they have to interoperate with all the CPUs there in the market. They have to interoperate with all the DIMMs because we don't want to be fixed with a certain vendor for a CPU, certain vendor for a DIMM, right? So, interop is very important. And on the right side, we talk about the platform solution stack: application management tools, BMC, device drivers, right? And the middle picture gives you a view of the silicon and the firmware that is trying to solve all these CXL solution requirements. So, with that, I'm going to talk about the next few things.

Thank you, Samir. So, at Astera Labs, we've been really focused on delivering a holistic cloud-scale CXL memory solution. We've been delivering our Leo CXL smart memory controllers, first 1.1 and 2.0 memory expansion, but one of the key things that is not discussed enough is application performance. And maybe just a small plug: if you want to learn a lot more about the lower level details of each of these categories of RAS, security, Chris Peterson from Astera Labs, and Prakash from Meta are going to be doing the presentation later this week. But what we really want to focus in on is how to get the best value out of CXL. How do you actually deliver that application performance? Because a lot of the conversation so far has been around how do we get to the enablement, how do we do all the different software ecosystem enablement pieces, the reliability, but there hasn't been enough focus, we feel, on delivering the best performance. So, we wanted to focus in and dive in a little deeper there.

So, in order to do it, the first conversation when you talk about performance is, well, what's my bandwidth and what's my latency? So, what we're showcasing here is, if you want to do an apples to apples comparison of four DIMMs that are locally attached, four DIMMs that are remote attached across a UPI hop, or four DIMMs that are attached over CXL with the orange, green, and purple lines on that diagram. So, what are some key observations we've made here? As expected, the unloaded latency of CXL is delivering approximately the same latency as remote memory, or about 2x that of the local memory. The loaded latency is providing a smooth latency versus bandwidth response. Why is that important? So, as applications are running, as you're going to be able, you're going to hit different points of this curve. It's very important to have very predictable and consistent tail latencies so that you can have persistent application performance. As Samir was touching on earlier, latency sensitivity and bandwidth sensitivity can be all over the place. It depends on the application. And so, you need to have consistent performance. The other is that the bandwidth of CXL is delivering significantly higher than remote memory and about the same as local. So, at a baseline, we say, "OK, the device is working well, but what do you do with this data? What does this mean from an application perspective?"

And so, we want to take a look at how do you actually map this memory? How do you actually utilize CXL-attached memory? And there's really two schools of thought. You can do memory tiering, or you can do memory interleaving. Tiering is the approach where you fundamentally have two separate NUMA nodes, as far as the host is concerned. You have local-attached memory of however much capacity, and CXL-attached memory of however much capacity. And so, you have this local or hot memory, and you have your CXL as your warmer tier. With interleaving, it's really a single NUMA node. The data is just striped across the local and the CXL-attached memory. And in tiering and with interleaving, you can do them either through software-based methods, or you can do them through hardware-based methods.

OK. And so, let's take a look at the different latency and bandwidth of tiering versus interleaving. And so, what we have in this case here, really similar to the last one, in the tiering setup, we have eight local-attached DIMMs. And on CXL, we have two of the Astera Labs Leo smart memory controllers, each of which provides two DDR5 channels. And then, on the right-hand side there, you can see with the interleaving setup, what we're doing is we're actually interleaving all of the DIMMs together, whereas with the tiering setup, they're actually two separate NUMA nodes. So let's take a look at what we see from the performance here. With interleaving, because we are striping the data across, what you'd expect is you get higher bandwidth, which is exactly what we see. So, interleaving essentially provides the aggregate bandwidth that you would have gotten of the eight local memory and the four additional DDR5 DIMMs off of CXL. So, you essentially get one and a half times your bandwidth. OK. And because you're getting higher aggregate bandwidth, your queuing latencies are lower. And so, you actually deliver lower average latency than you would have if you had just been writing to CXL memory. You can see that it kind of ends up being just a little bit about an average of the local and the CXL if you had run it separately. But tiering then can give you a lot of flexibility to optimize local and CXL-attached memory. OK. So, ultimately, then, what does this really mean from an application performance perspective? And the answer is, obviously, it depends. It depends on the application, but there's a lot of room to optimize. If we take one example, if you knew your application was always going to be running extremely high bandwidth, and that's the application you're designing for, then interleaving makes a lot of sense. If you're designing a box that has to be a general-purpose server where you're going to sell VMs and you don't know what application it's going to run, you might need to tune it. You might need to use tiering methods of being able to optimize local-attached memory and CXL-attached memory and page across. And there's hardware-based methods and software-based methods. And from there, I'll let Sameer continue on to describe a little bit more.

OK. So, Ahmed talked about CXL memory, and as we all know, CXL memory is already there. Now, let's talk about deployment, right? So, there are three fundamental modes here: how we are going to deploy CXL memory. The first one is application-managed, the second one is software-managed, and the third one is hardware-managed. And I'm going to talk in detail about all of these. And all these methods exist, and they're not exclusive. One can use multiple of these, but there are advantages with each one of these, right?

So, let's talk about application-managed memory tiering. So, this is where the application is in control. You are a company that owns your own application. That means you can change your code, right? So, for example, at Microsoft, we have a lot of our software as a service. So, we own those applications, and we have the liberty to change that code. So, the application, in this case, sees two tiers of memory. And because you own your application, you can modify it. And now the application is going to... the application will be modified. To utilize these two tiers. What I mean by that is, you know exactly what are your hot objects, what are your cold objects. And you're going to place your objects appropriately. For example, hot objects will be put in local memory. Not so hot, you'll put it in CXL memory. And you're going to manage it, whether it's to promote or demote from hot to cold, or cold to hot. You're going to do that entirely through the application. So, it is entirely managed by the application. So that is one method.

The second method is software-managed memory tiering. In software-managed memory tiering, the application only sees a single memory tier, and it is done under the hood by the software, for example, OS or hypervisor in this case. So, what is happening is that in the CXL memory controller, there will be a unit called a hotness tracker, and it is continuously monitoring the traffic and identifying pages that are hot. Then, there is a software interface with the OS, through the hardware, of course. The OS looks at those and moves pages, which happen to be very hot, inside the CXL memory to your local memory. So, it is managed by software. You don't have to change your application. So, for example, if you're running on infrastructure, right? And you don't want to change your code, then you can go with the second approach.

And the third one is hardware-managed memory tiering. This is where it is done at the CPU level. For example, Intel has this flat memory mode. I have a link to their presentation. So, what happens is, in the CPU, while they're talking to the CXL memory and the local memory, they're continuously swapping these 64-byte cache lines between the CXL and the local memory. You always try to go to the local memory. If you have a miss, you go to CXL memory and you swap, right? So, this mechanism is entirely done at the hardware level. You don't have to change your application. You don't have to change your software. So, that's another third method of hardware-managed memory tiering. Now, the point that I'm trying to make here is, all these three mechanisms exist, and you have to decide what works well for you. What works well for your needs. And it does not mean that you have to pick one out of three. Sometimes you can use hardware-managed, and you can use application-managed. It all depends on how much code you're willing to change in case of application-managed. You don't want to touch everywhere. Maybe you want to only touch a few areas. You go with that part for application, and the next is for hardware-managed.

So, next slide is all call to action. Yeah. So, maybe to expand a little bit more on what Samir is touching on as well is when you take a look at actually deploying the hardware, the hardware actually can be exactly the same. It can be the exact same server with the same number of CXL devices, and it could just be different configuration options of whether you're doing hardware-managed, software-managed, or application-managed as well. So, think about it in terms of how you're deploying. But the key thing that we're seeing is CXL definitely provides a cost-effective and performance solution to expand system memory capacity. And it really is ready for cloud-scale deployment now, right? We're seeing a number of applications and a number of hardware platforms that are going to start deploying with multiple deployment options in terms of how they're managing that memory and how they're utilizing that memory. And each tiering and interleaving mode really has different unique performance advantages that you can take advantage of, right, if you really want to get full, full bandwidth or you want that flexibility to get the best latency whenever you need it, you can optimize with tiering approaches. And so, there's a lot of application-specific performance that you can then tune through all these different tiering modes, and I'm looking forward to seeing a lot more of these big papers coming out over the next year or so here as more and more performance tuning is being done and a lot more in the ecosystem as well as some of the great work we're seeing from a lot of the OEMs and hyperscalers. Okay, we are open for questions.

Yeah, thank you. So, do we have any CPU production already impaired in the hardware tiering right now from AMD or Intel? Or maybe in the future of the Zodamap?

Yeah, I think so. There was a public announcement about Intel's flat memory mode at Hard Chips last year, so I don't know when it sees the market.

Okay, okay.

Hi, thank you for the great talk. Could you go to the performance measurement slide?

The interleaving and tiering, or the latency?

This one would be okay. So, you didn't mention the ratio between DRAM and the CXL. So, how much is the DRAM capacity ratio for this measurement?

These are actually the exact same DIMMs used across local and CXL attached. I believe in this particular demo, it was 64-gigabyte DIMMs.

64-gigabyte DIMM and 64-gigabyte CXL measurement.

Each DIMM was 64-gigabyte. So, there are 8 64-gig local and 4 64-gig behind CXL.

Can you repeat the question?

No, I think he answered my question. My question was the DRAM to CXL memory ratio.

Yes.

So I could get this one.

Eight to four.

Okay. So, this measurement—is it measured with MLC, or what other tools did you use to measure the performance?

Yes, this measurement was done with MLC.

I see. Do you guys have any performance measurement result for any application?

Yes, I do actually have a bunch of performance application-level performance. At the Asterilabs booth, actually, we're showcasing some of that and happy to share more if what you're seeing at the booth isn't enough. If we can connect right after the show, I'll connect you with more data.

Okay, thank you.

You're welcome.

Yes, you indicated that CXL memory is cheaper than regular RDIMMs. I don't know if we're seeing that realistically in the prices that vendors are quoting for CXL memory versus RDIMM, but what's behind that, and what do you think is the percentage difference per capacity or whatever? What metrics would you say CXL is cheaper by?

What I'm trying to say is, if you want to go for a higher capacity, the current option is 3DS DIMMs on the local DRAM. With a CXL controller, you can use low-capacity DIMMs behind the CXL controller, and the next solution, including the controller plus multiple of these low-capacity DIMMs, will come out cheaper than putting expensive DIMMs on all DDR local sockets.

Okay, you're talking about if you have very high capacity, and you want to get the most expensive, high-density DIMMs, you could stack many of these behind.

Yes.

Got it, thank you.

Hi, so can you share some details? Which application is using CXL in Microsoft? What scale is the deployment, and what benefits can you see so far?

Yeah, so overall, right? In benchmarking varieties of applications, we see the good value with CXL, but finally, it comes to TCO, right? And certain applications, the benefit is a lot more. So, I would say any application that needs large capacity, large memory capacity, those are going to benefit more from CXL memory.

But that's theoretical, right? But I just want to know, Steve, is that a real deployment? Do we see large-scale deployment at Microsoft and get real benefit, real value?

Yeah, we definitely see a value. We are marching towards that. And you'll hear more public announcements very soon.

Okay, thank you. Yeah, one more. So, has Microsoft done any evaluation between the flat memory mode and the auto NUMA, actually, for various applications? What is your take on that?

What do you mean by auto NUMA?

So the software tiering, right?

Oh, software tiering, yeah. Actually, see, the flat memory mode operates at the cache line, 64 bytes. And the software mode we are expecting will be in very large page sizes, maybe even megabyte pages. Okay. So it is not really an Apple-to-Apple comparison, right? If you're doing the same granularity, it makes sense. So, again, it depends on the application. For certain applications, if you have to move a bulk of data at the same time, the software approach makes sense. But if you are going all over the place and the data is extremely random, then why move one megabyte per page, right? Because if you're going to access only one cache line out of it, right? So, it all depends on the randomness of the data. So, yeah, it all depends on the application.

Is there any application that Microsoft found that has more performance for flat memory mode? That's my question, yeah.

See, one of the best things about flat memory mode is, you don't have to change anything, right? You don't have to modify your application. You don't have to touch your OS layer, right? Not in a very significant way, right? So, if you just want to take hardware and deploy it, that's the best thing about flat memory mode. But if you, you know, all these other methods, as the previous speaker was saying, you always start with something working and then build on top of that, right? So, you start with hardware and then, if you own the application, that's the best thing you can manage through the application.

Thank you.

Hi. Could you please touch on the multi-tiering thing? Once you identify what's hot and what's cold, where do you think, for example, in a software-managed option, it is the OS who would be taking the responsibility of moving pages around versus hardware movement. If you have any opinion or data on which option works better, could you share it?

You know, again, I would say we did not have to compare. These are all three different methods, and it entirely depends on the application. We see value in all three approaches. Right now, option number three is available. I mean, it's Intel; they have publicly announced it. Option number one is something that we own internally as application level. And software is something we are working towards it right now. So, yeah, hard to compare. But again, my underlying message is, if you don't want to change anything at the software level, go with hardware managed. If you see a value in moving a large amount of data, megabytes, right, go with the software approach. Yeah.

Yeah, and I think maybe one thing to leave you all with is: when you're actually looking at application performance, take a look at all these memory deployment options. Right. Take a look at how you're... It's not going to be a clean answer of saying, 'This one is always best,' or 'This one's always best.' You have to do some testing. It's going to take some work to be able to do that. And it really comes down to, do you want to get 90% benefit, or do you want to eke out that last 10%? How much of that extra value are you looking for from CXL? And you'll have to maybe take a look at all the different options and tune to be able to take a look at that memory page placements to be able to get the best performance for your application. Thank you, everyone.