CDI-Info/282 at main · vaj/CDI-Info · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
YouTube:https://www.youtube.com/watch?v=-47OqaCtlxM
Text:
I'm Bill Gervasi from Wolley, and I'm going to hit you with a new topic today, and that is CXL native memory.I'm one of the architects in JEDEC that has been developing DDR for the last 20 years or more.And so it's really kind of odd to hear somebody like me ask the question, do we really need DDR?

So let's start with the idea about the fabric wars.Remember CCIX and Gen Z and all those fabrics that were holding the industry up all these years?And then finally CXL came out a couple of years back, and we're in the process of finally integrating all of this fabric stuff into one place.

And so this is great, because it came just at the time that we hit a really weird milestone, which is that if you track what happened in the industry, when we did DDR1, back in the 1990s, you got four memory modules in every channel.And then we went on to DDR2, which doubled the speed of DDR1, but we lost a socket per channel.We went down to three modules per channel.And then DDR3 went down to two modules per channel.DDR4, we were able to figure out some cool tricks to stay at two modules per channel.But as we go into DDR5 at 6,400 megabits per channel, we hit this interesting crisis where DDR5 is going to one DIMM per channel.And we're all scrambling, wondering what happens when you cut your memory capacity in half on the servers.

So CXL comes along just at the right time to save the day.
DRAM expansion on CXL is going to be an absolutely essential way to deal with DDR5, losing all this memory capacity.

Now, the data centers are stuck with a bunch of DDR4 and DDR5 modules that they already paid for.The slower speed DDR5 and a bunch of DDR4s.What are you going to do with all those memory modules?Well, one of the things that CXL does is it isolates the memory technology behind a barrier so that you can do essentially a mix and match.

And so the initial introduction, an otherwise awesome technology, is in the form of a camera which takes and puts some memory modules behind a CXL controller.And it's kind of a stopgap thing.But I respect the industry for developing this and coming up with a cute way to recycle old memory modules rather than throwing them into a dump somewhere and polluting the planet.

But you can see why they're having trouble.You can see why they're having to budget 75 watts or more per card because now you have a CXLx16 interface, which is already power hungry because that needs to drive a meter's worth of cable.You have voltage regulators on the module itself.In addition to on each memory module, you have registers, you have more voltage regulators, and lots and lots and lots of DRAMs.So we're not exactly in a very power friendly situation.

And that's what I'm going to be tackling.Now, the good news is there is a halfway step, which is they're going to flush that inventory of DDR4 and DDR5 DIMMs, and they're going to come up with a far more sane solution for memory expansion where you just solder the DRAMs down.And so your CXL memory module starts looking a lot more to the system like just another DIMM or dual inline memory module.So this is good.It's moving in the right direction.

But I want to dig one level deeper.So we've seen that we have these long wires coming into a PCI physical interface, and then you're going to put the CXL transport over the top of that PCI interface.Now, inside that CXL memory controller, you're going to have a couple of DDR physical interfaces.Then you're going to have external to that a bunch of DDR DIMMs, and then you're going to have a bunch of DDR DIMMs, voltage regulators.Okay.So you can see where there might be some power waste, but let's drill down one more level.

And that is let's look at the architecture of all of these chips.One of the things about DDR is that it is architected to be a general purpose solution that can be put in anything from a data center through a SSD, through a network switch.And as a result, it has things like a very narrow data interface.It has a data mux that is cycling through a lot of data.And then it has to have a lot of logic for things like training that interface.Decision feedback equalization on all of those signals in case you're pushing it over 12 inches of wire.It has to have bit skew because of all that.FIFOs, error correction, all this stuff gets duplicated.And it gets duplicated in each memory chip.Now, if you're putting 80 DRAMs on a module, that means 80 DRAMs are all implementing this complex interface.And so this is a little bit wasteful.And so maybe there are ways that we can resolve some of these inefficiencies.

So you can see that you're adding not only more power every time you do one of these interface changes, you're also adding more power every time you do one of these interface changes.You're also adding latency every time you do this. So the CXL interface adds some latency. Then you're going to pile on top of that a translation from CXL language to DDR language, decode that DDR language over on the DRAM side, do the memory core access, and then shovel it back up the line. Take that memory core, put it out as DDR, convert DDR to CXL, and then back up the line. So these are all the redundancies that we're going to take a look at and see if we can make better.

Now, it's all about hit rate. The good news is that when you're really close to the CPU, the hit rates are pretty high and you're not wasting a lot of access.

However, hit rates start to drop off pretty dramatically once you get out of the CPU complex. So for example, you just put a DDR module on a CPU and your hit rates are now dropping down to, say, 82% for reads, 62% for writes. But that's while direct attached.

Now, add on top of that, that we're talking about having to go to CXL as a memory expansion. And now with this thing out across the fabric, your hit rates are going to be dropping down. Maybe you're lucky if you're going to hit 65% hit rates. You start asking about why you're doing open page versus closed page access.And this is, even before you throw in memory pooling, where you might have a thousand CPUs or GPUs sharing a memory resource and the accesses are going to be so completely random that you will actually take a performance hit if you turn on open page mode.

So I asked a CPU architect, why are we doing some of this stuff still? Some of these things we invented back in the 1990s when we had, pretty high hit rates on everything. But now we're, you know, 64 core processors and things like that. How much performance gain are we getting for every watt that we spend?Why are we doing so many speculative operations when the hit rate on the speculative operation is dropping precipitously over the years?

And so here's an example of waste. 64 byte cash line comes out, but what does it take to get a 64 byte cash line from a memory module? And this is even a direct attached memory module. CXL is pretty much the same problem. You have 10 DRAMs with a one kilobyte block size. And so when you access them, you're getting 10 kilobytes read out of the arrays into the sense amps, but these are capacitors. So they get discharged when you do that.So you're going to have to precharge them. So now we're up to 20 kilobytes of data movement in order to satisfy a single 64 byte cache line. That's an efficiency rating of 0.025% efficiency.

Now let's talk about SSD access. We get these four kilobyte blocks, but on average, we're only using a hundred bytes out of that. So now we're at a waste factor of, ah, 20 per second. So that's when we know there's a problem and that we have to change the 97.5% because we're 2.5% to 3% efficiency on those operations.

So you build a data center out of these kind of concepts, and you can see that you start adding up these waste factors.And in the Department of Energy report that I helped write over the last couple of years that's going to the U.S. Congress in September 2024, what you find is that the data centers are generously rated at 0.00004% efficiency, where efficiency is measured by data worked on versus data moved around arbitrarily.

So who cares about this efficiency?Well, the Department of Energy sure does, because if you look at that orange line, that's the capacity of the planet Earth to produce electricity.And those blue and purple lines are what we're using for data centers and other types of communication.If we don't put an efficiency program in place where those dotted green lines are, we're going to be in a world of hurt around the year 2055 when those lines cross.

And as you can see here, the little flat section in the middle, that's standard processing.But those four little lines that are on a logarithmic scale and consuming a million, million times more power than other types of technologies, that's artificial intelligence and cryptocurrency.When we started writing the Department of Energy report, cryptocurrency was 0.6% of world energy use.The latest estimates have it closer to 0.8% of all of the world's power is being used just for cryptocurrency.This is not a sustainable model.

But the good news that came out of this is that we're not alone.It's not like Wally invented these ideas.The data center owners like Google and Microsoft are waking up to the fact that this is happening and that energy is a major part of the cost of a data center.So they're putting people in place who do total cost of ownership analysis and actually have the ability to make decisions about it.Instead.This is the old model, which is that the procurement by the procurement guy bought the cheapest shit that was out there and put it in their data centers.

So let's see what we can do to reimagine this and try to raise the efficiency.Right now, I showed you that model where you're doing a decode of a CXL packet, converting it to DDR and so forth, right?

Well, the CXL packet has everything we need to be a memory controller.It has everything.It has an address, it has data, it has metadata, it has a read or a write command.What else do we need?Can't we just eliminate that DDR in the middle and drive a memory core directly off of a CXL packet?

This is how it's done.Here's the architecture of a CXL on PCIe interface.It decodes the packet that comes in, extracts that information.And then drives the memory directly.You can translate an address to banks, rows, and columns.You can get the data and move it across, and you get a little freebie.The CXL.io protocol allows for an interesting out-of-band backdoor for doing more advanced things like selecting your refresh regions or for memory fill operations that are currently impossible over the DDR interfaces.

Break it down a little bit deeper, it kind of looks like HBM if you think about it, what if you were to take HBM that has a base logic die and a bunch of naked memory die on top of that base die, but throw away the HBM protocol that takes 4,000 signals and instead make a 32-wire interface of PCIe.And you get the idea of where we're going.with this. So this is take a 32-pin interface, take that serial protocol, translate that to memory accesses, and then take all of the logic out of the DRAM. The DRAM should only have sense amplifiers, pre-charge capability, row and column decoders, and one thing that we do have to throw in is the row activation count for row hammer handling.

So now we still have all of those things that you need for a memory interface. You still have to have bit skew and all that, bit de-skew and all that other stuff. You still have to detect your row handling, but instead of putting it in 80 memory chips, you put it in one ASIC, and now this one ASIC can eliminate all of that redundancy in those ADDRAMs.You may the interface very wide, like 512 bits plus ECC, and you can have the ECC cover metadata, which is not something that is easy with today's DDR solutions. So now what you're talking about is a low frequency memory core that's transferring a flit in every clock.

Data efficiency of this is 2,000 times that of a CXL DDR interface, because now you're instead of accessing 20,000 bytes to get 64 bytes, you're processing 64 bytes on a read and then a writeback.

So I have this in another one of my talks, but a quick summary of what we want to do with this is make a very small form factor module, roughly the size of an M.2 memory module, like what we put in our systems as SSDs.And we address some of the limitations of the M.2 to allow it to go up to gen 6 PCI-E speeds. We reduce the power envelope now, we're not trying to put 75 watts of DRAM into each module, but we don't have to because this thing is like really small.We're talking about being able to put say 16 gigabytes in one inch by one inch.And with 32 pins, this is much smaller than even a DDR.interface, which is 300 pins. So you can imagine packing nearly 10 of these interfaces in the same pin count on a CPU that you would require for one DDR interface.

Now, my buddy Tom Schnell at Dell has been co-sponsoring this work with me, and he did some great justification as to why this is interesting. He's facing the situation where DDR5 has this slot problem, where when DDR5 maxes out at 5600, he's currently selling two memory modules per channel with, say, 64-gigabyte configuration for that notebook. When DDR5 goes to 6400 and it's restricted to one DIMM per channel, he's not going to have to tell his customers, you either have to run your memory at 5600 to get the full configuration, or cut your memory capacity in half to get to the new, higher speeds that DDR5 is capable of running. This is not a very attractive situation for the customers. So Tom has been working with me on saying, well, gee, if we had a flex memory module, this CXL native memory approach, where we could put a memory module in on the system over 32-pin interfaces, we can let the DDR5 go to one DIMM per channel, and supplement that memory capacity with some number of flex modules, which are much smaller, and we can pack more of them on the system. We could actually have higher capacity memory for the customer, and they get the benefit of DDR5 running at full speed.And so, sad users become happy users, and we can now have a situation where flexes, CXL native memory, these things are not going to be able to run at full speed.These things are complementary to DDR5. They don't necessarily replace it.

So with DDR5 hitting this capacity wall, CXL allows for memory expansion to compensate for that loss, and more importantly, it allows DDR5 now at one DIMM per channel to run at full speed, which I'm not supposed to tell you is now being pegged at 9200 megabit per second as end of life. CXL memory modules are not going to run at full speed, this CXL native memory approach, where we could put a memory module in on the system over 32-pin interfaces, we can now have a flex memory module.CXL memory modules are not very power efficient but we can do a lot to improve that.We can take away the DDR interfaces to improve that energy efficiency and derive the memory cores directly off of the FLIT without DDR consuming time and adding latency.CXL modules can bring these CXL solutions down to motherboards and even to notebooks with a low pin count way to expand memory and great as artificial intelligence and other applications are demanding that we add more memory to the system, not less.

So thank you for your time.I've included my contact information here.I'm happy to add you on LinkedIn or take emails from you guys and address any questions you may have about what I've presented.Thank you.