CDI-Info/157 at main · vaj/CDI-Info · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32

All right, good afternoon.My name is Ron Swartzentruber.I'm from Lightelligence, and I'm going to be talking to you about Optical CXL as an interconnect for large-scale memory pooling.

First we're going to talk a little bit about the large language model growth, the memory-centric shift in the data center.The need for Optical CXL, so why do we need this technology?And then a case study.So we wanted to prove it to ourselves, the benefits of this technology, so we did a case study, and I'm going to talk about that.

So first off, why disaggregation?　So what we're finding in the data center is that the CPU is no longer the dominant resource in the data center.It's memory and access to memory, which is the challenge.Furthermore, there's these applications now that are defining the hardware that they run on.So what, as a result, the data center architect has a little bit more freedom to design their data center the way they need to, and so disaggregation was born.

What we're also seeing, and this chart is getting a little bit old from this show, but with the large language model growth, there appears to be no end to the growth of these models, and as a result, disaggregation for memory is required so that we can meet the needs of the AI model processing.Furthermore, it's not just the access to memory, it's the latency to the memory.So what you can see on this chart here is that CXL memory is just a single NUMA hop away from your main memory within several hundreds of nanoseconds.So CXL becomes the required interconnect to access to this memory.When you're looking at something like SSD or network-attached memory, now your latencies are up into the single to multiple digits of microseconds.So this becomes prohibitive for these large language model processing algorithms.So what's really needed then is a CXL memory interconnect with optics enabled to extend the reach of your memory bus.

So today what's being used is RDMA is largely the remote memory interconnect used in the data center.The challenge with RDMA is again the latency.It's going through the NIC, there's a FEC involved, and it's just simply prohibitive for these applications.So what's preferred is a memory interconnect using CXL over optics, basically cut out one stage of the latency equation.

So a little bit about CXL for those of you that aren't aware, there's 250 member companies.It's been widely adopted by all of these big names that you see here.If you don't know about CXL, I'm sure you know about PCI Express.CXL basically adds memory and cache coherency functions onto the PCI interconnect, enabling it to be a fast, effective memory interconnect.

So why optical CXL?Well first off, the signal loss over copper is extremely high.Copper can basically, even in a Gen 5 situation, maybe extend 2 to 5 meters.And compared to optics, which can go that 30 meters to 50, even 100 meters, which is likely not going to be needed, but certainly in the 10 to 30 meters, if needed, to connect to your remote memory.Furthermore, the cross-sectional area of your copper cable is pretty massive.If you've ever seen a Gen 5 cable, the TE showed theirs just a few hours ago, it's quite large, compared to the cross-sectional area of fiber, which is, by comparison, much smaller.

So what's needed then is a memory interconnect that can break through the rack.Basically extend the reach of your CXL memory interconnect across multiple racks and even through the data center.So you can fully disaggregate your data center architecture.It's no longer confined to a single rack with a few meters of copper cable.

Okay, so that's the premise and let's talk about the study.So we went out to prove why this is advantageous and to show people the advantages of optical CXL.So what we did is we built a case study using large language model inference.So what you can see here is a super micro server on the left-hand side equipped with an AMD Genoa processor.That processor can support CXL 1.1.We made use of the MemVerge memory tiering software.I'll talk about that a little bit during the results section.We have an A10 GPU that's running the LLM inference and that is then connected with a PCI CXL card, okay, which is connected by two 24 MPO multimode fibers to a second card on a memory expansion box that we purpose built for the demo.Now that memory expansion box is a fairly simple box.It holds a CXL over optics card along with an FPGA connected to two CXL memory expanders.In this case, we're using the Samsung 128 gigabyte expanders and each of those have a Gen 5 by 8 link.And so that's the topology of the demo and what we set out to do was show that, okay, if we put our large language model right here in the SSD versus put that large language model 30 meters away connected by CXL in one of these memory expanders.And that's what we set out to test.

So for the purposes of this demonstration, we chose OPT66B.The reason we chose that model is it fits in a single CXL memory expander.It's 128 gig.And the workload that we gave it was news text summarization.

Okay, so here are some of the results.What you care about when you're summarizing a big block of news text is how fast can you get that result.So what we found was with the disk decode throughput of close to two tokens per second compared to CXL memory of the 4.8, now the reason for that difference is the latency required to access the SSD memory is much higher than the latency to access the CXL memory.Comparing that to system memory, it's roughly 70%.So it's in the same order of magnitude.What the data center architects don't want to do is put these large language models in system memory because then they're completely used for the workload.With the MemVerge software, they add a 60/40 policy.So 60% of the models stored in CXL versus 40% of it stored in system memory.And you can see the performance is almost equivalent of that of the system memory.But the most important number here is the 2.4x using the CXL memory alone.

So this is the workload.As I said, news text summarization, you take a block of text and you summarize it.So that's what the OPT66B model is doing for us.This is a dramatization.It takes a little bit longer than this, but essentially that's the workload.Take the news text and summarize it.So you can imagine your 6 o'clock news anchor.He doesn't want to have to read the whole thing.He wants the summary.Well, here it is.

Okay, some additional results.What you can see from this chart is the decode throughput is 2.5x better using the CXL memory versus the SSD.Our GPU is now fully utilized at 95%.The reason why it's not utilized with the SSD is that there's just a lot of memory movement going on.Our CPU is fully utilized when it's run on SSD memory versus about 50% utilized for CXL memory.So our CPU now can go off and do other things.And of course, our CXL memory is fully utilized in the memory expander case.So this is the progress of the model, this chart at the bottom, as it runs.What you can see here is that CXL and disk basically follow each other.That's because the model is being cached in the GPU memory.So the startup time is very similar.But unfortunately, as the cache runs out, in the case of the disk, it has to go to the disk and fetch more data.And so that's where the CXL memory shines, because it's just lower latency, faster access.

So in summary, we showed the CXL memory offloadings efficient and beneficial.Similar performance, as I said, about 70% compared to system memory.Most importantly, a 2.4x performance advantage in throughput in tokens per second.And improved TCO.

So on to the practical aspects for the products that you can purchase today.So there's a low-profile PCI card.So to basically get folks started, this is your ubiquitous form factor.Logs into any server, any rack.We've also developed an OCP 3.0 NIC form factor.So that is available.And as well, active optical cables.With the AOCs, this is currently a custom ASM-type connector.But we're developing CDFP, QSFPDD, and others as customers provide those requests.One thing you'll notice is that there's a difference between the cards and the AOCs.The cards provide an extra SerDes to do the signal integrity, cleanup, jitter reduction, and basically recover the clock and reproduce it.So they have approximately 20 nanoseconds of latency through the card.The active optical cable, though, is more of a linear, an LPO type of design.So there's no added SerDes.And so that latency is under a nanosecond.

And I think that's about it.If you'd like more information, we have a recording of the demo at our booth at C18.And you can come talk to us more about the product.