CDI-Info/79 at main · vaj/CDI-Info · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42

Welcome, everybody. I like that God voice. So hi, I'm David McIntyre, and I'm from Samsung. How many of you spent the morning and crossed all the keynotes like I did? It was pretty exciting. I'll try and reflect some of the comments that we heard from the executives that gave those keynotes. CXL is right in the thick of much of that messaging.

And as an example, I'll jump right into my presentation. So AI and ML certainly are the dominant applications that we see today. And as we also learned this morning, large language models are driving the need for more and more and more memory. And there are other applications, too. Don't forget in-memory databases, which are becoming very important, and I'll reflect on that application in a little while. But I think we all understand this. We're grounded that billions and billions of billions, if not trillions of parameters, you need a place to put those parameters, don't we?

So my premise is that the current acceleration solutions that are out there today are for the most part compute-driven. However, of course, they have HBM memory. But it's limited. And the construct of having CXL where you can expand beyond the local proximity of your resources and go beyond, that's really interesting and compelling. But if we look at-- I put this slide together just as a high-level description of the concept. So if you have the application palette-- and if you've heard me speak before, I'm always about application end user driving. Because at the end of the day, regardless of what we invent here, the end users need to buy it and implement it, right? So from an application palette, the compute, memory, and storage resources need to be equally balanced to provide the best performance, the most power-efficient solutions. And actually, as we know-- we heard it this morning too-- performance per watt is a key indicator. I didn't know that riding a bike is complementary to what we're doing here. Both are measures of performance per watt. I think I learned that from the meta gentleman. So on the compute side, we have a local orientation host to the host solution on the compute, close to where the application-- if not where the application resides. But now you also heard a little bit about UCIe and how chiplets can be provided to have a more efficient compute construct moving forward. For memory, Optane left us a couple of years ago. They're still out there. But customers are actually crying out that now that the market is actually ready for a persistent memory solution, how is the community embracing that? And I'll explain how Samsung certainly is taking a leadership opportunity to enable that type of solution. CXL coherency, we'll talk a little bit about that, where it provides both shared and semantic attributes. And then storage still needs to be available, not only as a last level cache for the very interesting memory tiering, but also scalable and able to still support block, file, and object.

So we also learned about the memory wall. Basically, this is the bottleneck between processor resources and memory capacity. And so breaking through the memory wall, I would suggest that that's a very key opportunity for us to address these emerging applications that are constrained in performance, if not in efficiency, because the data has trouble moving, or there simply isn't enough memory to move that data from the memory into where the compute function takes place. So that leads to the bandwidth wall, the latency wall, the capacity wall, and the power wall.

So this one is quite illustrative. We've all seen this triangle, whether it's at Flash Memory Summit or OCP. This hierarchy of memories down to storage is what we've seen for decades. So from HBM, actually for L1, L2 on the CPU to HBM memory to DRAM, main memory, to Flash Media, and then even down to hard disk drives, even tape still, because it's one of the cheapest forms for remote storage. And you can see that the less expensive it is, you're going to pay a performance penalty. You'll have increased latency. But from a capacity standpoint, that's what we're very familiar with. And it works very well in this hierarchy for traditional workloads.

But now in today's age, where we're focusing on AI, that becomes the challenge to-- where the data-- before, we were mixed with hot and cold data. Now we have this big influx of everything hot data. I need to know, from a video analytics standpoint, is that a threat that's just walked into a shopping mall? I need to understand from a collision avoidance system right now, real-time data processing. So that has shifted the paradigm, where now we need to maybe take a look at rebalancing.

And so we can do that with CXL as a means of addressing both latency and cost bandwidth issues by having a new mix of tiering that supports the last level cache, which can either be a lower performance memory-- not the host-based high-performance memory, but perhaps a complementary CXL memory pool. We can have a memory expander that enables that. And then we can have a tiered memory solution that supports both. We can also have that petascale SSD, where we can put actually a petabyte of storage within an appliance and use that as your traditional storage vehicle, or also as supporting a last level cache to the application.

So to address the memory wall problem, we're now also talking about data-centric computing. That is, instead of moving the data back and forth, back and forth, which is very expensive and consumes a tremendous amount of power, we're now able to look at putting compute where the data resides.

And the benefit in doing so is that we're able to achieve lower power computing. We're not moving the data around as much. So in fact, we're freeing up that pipe with a higher effective bandwidth. And we're able to scale our compute resources from cloud to edge for more of a distributed compute network, if not a heterogeneous composable network as well.

So I put together this table just as a reminder that-- so certainly, CXL goes hand in hand with PCI Express. In fact, as we know, the three protocols reside over PCIe Gen 5 today, PCIe Gen 6 tomorrow. But if we look at the accelerated introduction of CXL into the market, the spec from 1.0 to 3.0 on CXL took three years versus PCI Express, which took at least seven years. But CXL, because it's riding on PCI Express, is able to benefit on all the amazing great works that the PCI SIG has done over the years. And from an OEM adoption standpoint, PCIe, I would suggest, was more of an evolution from PCIe. Whereas with CXL, it's really opportunistic and driven by the CPU manufacturers releasing CXL-supported processors, as well as the controller technology, the switches, all the devices that have to come in to provide a CXL solution. So in fact, the CXL spec itself is closely coupled to the applications that CXL is primarily targeting. And that's the priority. From a market timing perspective, PCIe has pretty much always been performance-based evolution, as we know. And where CXL is looking, because it's focused on the applications, to provide really TCO optimization across the three different CXL protocols. And from an ecosystem standpoint, I mentioned silicon, servers, controllers, switches. Back for PCI Express, Intel and the OEMs worked hand-in-hand, not exclusively, but pretty much as the dominant forces to release PCI Express into our industry. Whereas today, it's actually the hyperscalers, along with Intel and AMD and the OEMs, that are working hand-in-hand to align these three protocols together. And so the competing options-- back in the day with PCI Express, we had InfiniBand, HyperTransport, GenZ, CAPI. Now we're all aggregating those standards into one home, with over 250 member companies waving the same flag.

So Samsung is-- I mentioned before this, I mentioned about data-centric computing. So Samsung is looking to put data-centric computing in memory devices themselves for compute operations, such as multiply/accumulate functions. We are also having an attached accelerator, memory expander, as an available product, as well as a tiering solution, which I'll go into in more detail.

So if we look at the CXL memory module types, what Samsung has done from a product planning perspective, we've looked at the type 3 and type 2 devices. The type 1 typically are allocated or assigned to SmartNICs and that type, the networking side. But for the memory expander, the CMM-D, the CXL Memory Module D, that allows us to do what I talked about earlier, about memory pooling, memory expansion, these sorts of things. The CMM-H actually takes it a step further, and now we're actually providing both .mem and .io pathways to go beyond just a storage device. We're adding storage into the equation now, but also having a cache memory, which is accessible. And then the CMM-HC is an accelerated attached solution that now we're bringing in compute into the equation as well. And I'll go into a little bit more detail about this.

So in fact, the first device I'll talk about for this presentation will be the CXL Memory Module -H. So this is really targeting, expanding the capacity and utilization of memory for AI applications. It provides better TCO and smaller granularity access. So instead of having to write to an NVMe device in block, you can now actually read and write by byte sizes. And it also provides persistence. Remember I talked about the opportunity since Optane has left, and NVDIMMs are still providing embedded type solutions, but we wanted to go well beyond that. And so this is what the CXL Memory Module does for us.

So there are two options with the CMM-H. The tiered memory option, which is this .mem and DRAM cache availability, all within the same package as the SSD. So you have the NAND media storage as well as the DRAM cache inside. And then the second option would be the persistent memory option. So it operates quite similar to how traditional NVDIMMs would operate. But this is just one feature of many features of what the CMM-H offers. The GPF is the Global Persistence Flush. That is a CXL-specific command that enables the persistence. And today we're operating with a small external battery, but from a product planning standpoint, we're also investigating how we can integrate a power source or energy source within. So we're looking at different variants there.

And then if we look at some of the outcomes of the CMM-H, for this AI recommendation system, if we look at the I/O-based system, we can see how much performance inference per second we're getting here. So almost 5,000 inferences per second. If we move to the host-based software approach with caching local to the host, now we've quadrupled. We're up to about 16,000, almost 17,000 inferences per second. But if we apply the CMM-H topology and we look at how we can combine both compute density and memory density together and complement what's being performed at the host with this caching device on the CMM-H, we're actually now able to achieve almost 7x performance versus the I/O-based solution. So persistence is just one feature. This is another very compelling feature for CMM-H.

And the tiered memory architecture, just a quick look under the hood, because it supports both .io and .mem, you're able to read and write directly to the NAND flash media as you would with a standard NVMe SSD. Or you can read back from the DRAM cache as a caching device without having to go through the NAND flash itself. So it's a dual-mode operation. And in the future, we'll be looking at read and write performance as well on the cache.

And then let's go back to the data-centric computing concept. So we talked about the traditional model where you have the host CPU, which is basically working with the peripheral memory, the storage. But now if we can put that CPU directly where the data resides, we're not having to move the data around as much.

And as a result, we can now think about the .mem, the .io, and the .cache, all those three different protocols, all within the same device. So that allows us to provide computational storage. We now call it data-centric compute in one Samsung device.

And so here it is. This is the accelerator-attached memory solution. You're probably familiar with Smart SSD. So what we're doing is evolving Smart SSD to embrace CXL. And with that, that allows us to put the accelerator within the device itself and also have DRAM cache and the NAND flash media all in one. And you can envision having one of these CMM-HC devices in an appliance. Or you could scale. You could fill an entire appliance with them. That's up to the application and what you would like to address. But that opportunity is now here today.

And this is some examples that we ran on Smart SSD on the second gen. What happened with the second gen was we realized that CXL is moving so quickly and getting adopted so quickly from the technologists that we wanted to pivot and move the Smart SSD concept into a CXL-based SSD. So with that, this is an example of-- it's not AI. It's database acceleration. And we're showing here total energy use is significantly dropped by 7x because we're computing on these Postgres scan queries within where the data resides. And therefore, our end-to-end throughput has significantly increased because we've alleviated the main pipeways to do other things. And now we can compute on those Postgres scan queries where the data resides. And therefore, the CPU utilization at the host has significantly dropped. That also frees the CPU up to do other things. And that's a good thing if you remember my compute memory and storage balancing picture at the beginning. Now because we've intentionally provided distributed data-centric compute, we've actually reduced the CPU utilization at the host by 11x.

And so that's what I have. I would encourage you to go and visit us at the Samsung booth. We have a number of CXL demos around the corner from our booth. And we have our latest technologies that we're also showing. And if you're interested in collaborating on POCs, feel free to reach out to myself or any of my team out there. CXL is here to stay. And Samsung is privileged to be leading with a brand new CXL portfolio for us all to consider. Thank you.