CDI-Info/78 at main · vaj/CDI-Info · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

All right, thank you. And good presentation, Charles, and good one from Meta as well, from Chris and Prakash. So wanted to talk to you today. Sorry, first of all, my name is Ahmad Danesh. I'm from Estera Labs. I run our product strategy and product management teams. What I really wanted to focus on today is continue making the case for CXL as what I'm going to call the superhighway for AI workloads. What we saw from Meta was some of the gaps that CXL has and how to be able to address those as we move forward in the next generations of the CXL spec. Of course, we need more bandwidth. We need more memory. But what we have today already solves a lot of important problems. And it's important to continue investing in the technology as well as the standard to continue making it the right choice as we move forward.

I wanted to focus on it really from a sustainability perspective and take a look at the infrastructure as we move forward. Of course, there's a lot of investments that are being made within AI, and this investment is going to continue to increase, but we need to make sure we're doing it in a sustainable approach. So I'll start with a basic premise that if we focus on decreasing costs, that will inherently result in a more sustainable infrastructure. So if we start with understanding where our costs are when we deploy solutions, of course, there's the entire upfront design that comes in defining the standards, defining the actual product design, the architecture, of course, building it and testing it. There's the high volume manufacturing process portion, and then there's the operations in the back end. So what are the five pillars then for a sustainable infrastructure if the premise is let's try to decrease these costs? The first is open standards. When Chris and Prakash were describing why they're proposing CXL, this was because it's an open standard. It's really important that we continue investing in that space to make sure that everyone's moving in the same direction. We then need a healthy ecosystem, which then a part of that comes into a lot of the interoperability efforts that we all need to be working closely together in, and we see a big effort, of course, and what OCP is fundamentally trying to drive is creating open standards and ecosystem interoperability. So we're well along the way there. Reusing hardware as well is going to be extremely important. Let's take a look at how we're actually decommissioning hardware as we take it and be able to redeploy it into new hardware and with new applications as we move forward. And then finally, we want to do more with what we are deploying, right? Be able to unlock some of those bottlenecks that are there so we're actually getting the most out of the hardware that we're deploying, and then making sure that we're designing the architecture and the hardware in a sustainable approach, that we're actually taking a look at making sure that it's the most sustainable way to actually service it, especially when we're focusing on the memory itself.

So if those are the five things we need to solve, let's take a look at how CXL is doing so far and where we need to focus on as we move forward here. Where we started with CXL, or I should say coherent specifications a number of years ago, we had a number of, you know, at the time we would have never set it as competing standards, but fundamentally they were all competing in the same direction. I got some friends looking at me in the audience, nodding their head. Actually, a show of hands, how many of you were working on at least two or more of these specs at the same time and contributing to the same weekly meetings over and over again? I see one hand up, and I'm just going to assume everybody else is shy, because I remember those weekly meetings. It was fun. But what's great as we move forward, and by the way, those are only the open standards I referenced. There's of course a lot of proprietary solutions that are out there as well. As we take a look now at CXL, CCIX, Gen Z, OpenCAPI, those have all been transferred over, the assets have been transferred over to CXL. The ecosystem community is behind CXL as that solution. So why is this great for sustainability? Everyone can just invest their own time into one standard, right? When we think about the design portion of the cost of developing these solutions, our time counts as part of that sustainability. All the time we're putting into improving those specifications is going to be important. What that also means is a singular focus that we all drive towards the same goals and driving towards lower design costs and taking a look at how do we improve the spec all together and driving in the same direction. And that improved spec will then deliver lower cost designs, lower test costs as well. We're going to then have singular solutions where we have a better ecosystem interoperability which will then improve reliability of these solutions as well that then reduces operational costs. So we're really down this path of the ecosystem, the entire community being behind CXL and we need to continue investing in that space.

So what does the ecosystem look like right now? It's large, it continues to grow. I can only include so many logos on here. Sorry if I didn't include your company and you're contributing to this, but we will always be looking for more people to contribute as well. So please do join. It is free to join as a member and be able to look at the specifications if you want to do so. The next is driving towards a lot of cloud scale interoperability. When we take a look at how these are being deployed in the cloud, it's going to be important that we drive towards having all of the ecosystem players coming together, driving for interoperability, driving for these large scale labs that we can actually be able to test and deploy these solutions together.

Finally, I want to touch on how to reuse server memory. This circular economy and Google, AWS, Meta, Microsoft, they all have different versions of this and how they'd be able to deploy data centers today. CXL really helps with this technology as well, where we can take decommissioned hardware, especially as we take a look at the DIMMs that are being utilized, take those out of the old servers, retest, refurbish them, then reuse them with CXL attached memory solutions.

This allows for a very sustainable approach for being able to adopt it. And then from there, we want to then do more with what we're actually deploying. Here's an example of just one workload where we're running and just adding a little bit of memory where we can see in this particular case, the memory was that bottleneck. The actual processing was not the bottleneck in this case. And all we have to do is add a couple of DIMMs to be able to unlock the performance of the solution. This particular one is just essentially running a high-performance database for in-memory database applications. But the same thing applies then when we start taking a look at vector databases that are used for AI. But what's important is as we move forward as an ecosystem, we have to understand what are the workloads that we're going to take a look at and how are we going to measure the performance of one system versus another, of one approach or configuration versus another. As an ecosystem and industry, we need to invest more into those industry standard workloads.

We are working on that as a community right now. So just a bit of a summary and some of the key areas that we're focusing on as we move forward. Let's say CXL is really well underway of being that sustainable infrastructure for AI that we need. We need a healthy ecosystem, a lot of interoperability efforts across a lot of vendors. And we're going to start seeing big deployments starting in 2024 and then big advancements that we're going to need from what Meadow is touching on in terms of the improvements of the hardware. It enables reuse of hardware investments in the infrastructure. It's going to be important for sustainability. It unlocks the performance and unlocks the memory wall that we need to be able to address those use cases. But we need help to be able to make it better. So what are the key areas? First is defining the open standards and form factors. And we have to keep serviceability and hardware reuse in mind, right? Thinking around, well, if memory fails, do I replace an entire E3.S drive of memory with the sheet metal and the DRAM and the memory controller and everything? Or do I build a solution in such a way where I can replace the DIMM and I can now deploy another solution again and I can take that hardware and be able to reuse it? So think about these approaches as you're defining those standards. We need open source AI benchmarks as well. The models have been increasing so fast that I think the industry as a whole, we haven't had time to keep up with developing good AI benchmarks to really be able to optimize those configurations and compare how one system performs against the other. And from there, AI hardware platforms that we can take a look at standardizing on the optimizing on data flows and leveraging memory tiering as much as possible here, similar to how storage tiering really helped to be a sustainable approach. A lot of these areas we're already focusing in on within the OCP Composable Memory Systems groups as a CMS group. I've put the link there as well. There is within the ecosystem, the experience zone, there's CMS demos as well. I do encourage you to check those out and look at what the community has been developing so far and talk to them about how to contribute as we move forward. With that, I'll stop to see if there's any questions.

Hi there. Jason Ruth from OLEX. As CXL has been discussed, it's been a lot about memory. But just doing a quick Google search because I'm just trying to get up to speed here. There's mention of CPU to device as well as CPU to memory. I was wondering if you could touch on the CPU to device portion of CXL.

Yeah. So CPU to device. So within CXL, there are three device types that are defined. There's a type 1, type 2, and type 3. I'll start with the type 3 because it's the easier one to think around. Type 3 is where a CPU can access the device's attached memory. So think of memory expansion. I'm going to attach some memory behind it. A type 1 device is the exact opposite where you actually have the device that's actually penetrating and accessing the host memory. And then you have type 2, which is actually a combination of both where you have a cache protocol for the device to access host memory as well as a memory protocol for the host to access the device memory. We call it -- it's really more of a host. It can be any type of host. CPU, GPU, XPU, TPUs, anything can support that protocol. And so it's memory access in both directions, depending on which type of device you are. You're welcome.

Thanks so much, Ahmad.

All right. Thank you, everyone.