CDI-Info/400 at main · vaj/CDI-Info · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48

Hello, and welcome to the AI Infrastructure Forum hosted by MemVerge. My name is Ronen Hyatt. I'm the CEO and Chief Architect at UnifabriX, and I'm here to talk about Memory over Fabrics, which describes our open journey from CXL to UALink in AI infrastructures. So, let's begin.

So, everybody is talking about AI, AI, AI, GPUs, and more GPUs. GPUs are compute, and people tend to forget about the effects of memory on compute. So, Meta Facebook is here to help us with this, showing some nice system-level radar charts, showing the challenges that Meta has with running big AI workloads, such as LLMs (Large Language Models), Training and Inference, and Ranking and Rating Models over the infrastructure. So, what we can see here is that compute, meaning GPUs, is very important for running these workloads, obviously, but since these workloads tend to be very big—ranging from gigabytes, dozens of gigabytes, hundreds of gigabytes, to terabytes—and unfortunately, they tend to grow exponentially, the compute infrastructure, the GPUs, needs a lot of memory bandwidth to consume these models and process them, and a lot of memory capacity to hold these models in memory. So, compute is very important, but you need the memory to fit the compute and make it run efficiently and effectively to get the performance required. Nobody wants an AI workload that runs... runs slower; like human beings, they do not like slow responses.

Another very familiar term in the industry is the "memory wall," or the "AI memory wall," and it shows the performance gap that is getting wider every year between compute and memory. Recall that we need compute for running the AI workloads; we also need that memory to support the compute. And if we have a gap between them, and that gap is widening every year, it means that we have a problem.

So, if we deep-dive into some data points, we can see here a few compute elements, like GPUs from NVIDIA—the H100, previous generation like the A100—Google TPUs, and some other CPUs made by Intel, and we see the slope here that illustrates the performance increase of compute over the years. Actually, it’s 3x on average every two years. And if we look at memory technologies, such as DDR, GDDR, and even HBM, we see that memory bandwidth is growing too, but at a much slower pace; it’s about 1.6x every two years. And IO fabrics, or interconnects, are much slower. And the interconnects that connect these compute elements together grow at an even slower pace in terms of bandwidth; it’s only 1.4x every two years. So effectively, if we look at a window of 20 years, the compute performance—measured in hardware FLOPS or teraFLOPS—grew by more than 60,000x in 20 years, whereas the DRAM bandwidth grew by only 100x, and the interconnect bandwidth grew by only 30x. That’s the memory wall. That’s the memory wall in terms of bandwidth.

But what does it mean practically? It’s not a real wall, right? So here, UnifabriX steps in to show. And this is something we demonstrated at Super Compute in 2022. We took the latest and greatest server CPU from one of the leading x86 vendors. And we took an HPC benchmark called HPCG. It’s quite a notorious benchmark that consumes a lot of memory bandwidth. And we wanted to show what happens when HPC—or a CPU running an HPC workload—runs out of memory bandwidth. So, what we did is started running HPCG on the CPU and started engaging more and more cores. And what we saw initially is that we have linear scaling, meaning you put more cores to do the work, to run the workload, and you get a linear scaling of performance at the system level. But at some point, the graph gets to a plateau. And from some point onwards, you can add more CPU cores—as many as you like—but you’re not getting more performance. And the reason for that is that at this point, the memory bandwidth of the CPU is completely choked, meaning the cores that already run the HPCG workload consumed all the bandwidth that existed on that CPU. And if we engage more cores, these cores try to do some more work, but they cannot because they cannot access memory at the rate that they need. In that specific case, we have stressors. And we have stranded compute, meaning that the point here is roughly around 50% of the CPU cores—meaning only 50% of the CPU cores do some effective work of HPCG, of the workload itself—whereas the rest of the 50% of the cores are just stranded. Like, they are wasting power, they are burning power, but they are not actually contributing to the overall performance.So, this is a nice example that shows what happens to the utilization, or the effective utilization, of compute elements such as CPUs when running demanding workloads that need a lot of memory bandwidth. And this could be HPC workloads or AI workloads.

And the reason for this—and now we deep-dive into the structure of the silicon, in this case, CPU silicon—we saw that the CPU has memory controllers and memory channels going out of the CPU, but it has many, many cores, right? And these memory channels have to feed the cores to make them run, and it turns out that having eight memory channels (DDR5) here is not enough to feed these many cores in this specific CPU.

And if we try to zoom out and look at what happened to general-purpose CPUs over the last decade, initially, we can think that the total memory bandwidth of CPUs grew quite nicely, right? This is 2013. This is 2023—like, almost 10 years ago—we had something like 50 gigabytes per second per CPU per socket. And now we have around 500; it’s like a 10x factor. We have 12 DDR5 channels on some of the CPU models. So, it looks like there is no problem, right? Yeah, we have enough bandwidth. The CPU gets a lot of bandwidth. But here, we kind of forget that at the same time—not just the total bandwidth of CPUs grew—but also the number of cores on the CPU grew faster too.

So, what it means is that the average bandwidth per CPU core, if we look at the last decade, remains stagnant, meaning it didn’t grow much. It even went a bit lower. So, from 2013 to 2023, when we take the max core count SKUs of x86 CPUs as an example—general-purpose CPUs—we see that the average bandwidth per core remains around something like five gigabytes per second, only five gigabytes per second. But if we think even deeper, the effective bandwidth per core—the effective memory bandwidth per core—is actually decreasing. And there is a reason for that because you cannot really compare a CPU core of 2023 to a CPU core of 2013. The CPU cores that we have today are much stronger than what CPUs had a decade ago, and they consume—they do a lot more work—and they tend to consume a lot more bandwidth. So, effectively, the memory bandwidth per CPU core is decreasing over the years because the CPU cores themselves are getting much stronger. So, in this chart, you see the core count is growing quite rapidly, and the effective memory channel bandwidth per core is actually decreasing. So, what we see here—and this is in the context of CPUs, not GPUs—what we see here is that memory and memory bandwidth have an effect on what’s happening within the CPU. And in the HPCG benchmark case, we saw that it’s actually hurting performance—the performance of the workload.

So, how do we solve that? There is one well-known solution for solving the bandwidth challenge locally, and this is HBM. And everybody knows what HBM is today because every GPU has HBM—high bandwidth memory. And some GPUs have a lot of HBM dies in the package. And what we see here is an example of a standard, off-the-shelf x86 CPU—Sapphire Rapids—equipped with four HBM dies. These are HBM2e, four dies, each 16 gigabytes in capacity. And these four dies provide to the CPU around one terabyte per second of extra bandwidth for running workloads—more than it has from its local DDR channels, which was around something like 300 gigabytes per second. The HBM added another one terabyte per second, meaning 3x the bandwidth of the DDR5 memory channels, which is a lot. So, this could help for memory bandwidth-hungry workloads.

And the same thing happens with GPUs. And this example is an NVIDIA Grace Hopper. We see the Grace CPU on the left side and the Hopper GPU on the right side. And the Hopper GPU has six HBM instances—six HBM dies. Some models include HBM3, 16 gigabytes each, totaling 96 gigabytes. And some of the newer models have HBM3e, which are a bit larger—24 gigabytes—with higher bandwidth and higher capacity, 144 gigabytes. The total is around four terabytes per second to five terabytes per second.

And at the block diagram, looking at the Grace Hopper superchip, we have the Hopper GPU with the locally attached HBMs. The capacity is not very big—96 gigabytes or 144 gigabytes—but the bandwidth is quite enormous, between four terabytes per second to five terabytes per second. The Grace CPU has a different memory technology, more common—LPDDR5X—with a bandwidth around 500 gigabytes per second. And we see here there is a very fast interconnect—the NVLink C2C—that connects between the Hopper GPU and the Grace CPU, so that the Hopper GPU, the compute elements here, have access to additional memory capacity with some reasonable bandwidth. It’s still one-tenth of the HBM bandwidth, but it’s still reasonable. And even more bandwidth is available to the GPU when going over the NVLink network or NVLink fabric to other GPUs—up to 256 GPUs. And these links provide an additional 900 gigabytes per second. So, this is an example of feeding the GPU with a lot of memory bandwidth locally through HBMs, from adjacent components such as a CPU through NVLink C2C and adding even more capacity.

So, that was a single GPU instance. But what happens when we need to build larger models like LLMs, and we need to fit the whole model into a system? So, of course, a single GPU is not enough to fit large models, and we need a system for that. We need an interconnect. We need memory fabrics. And these are memory fabrics for AI, connecting multiple GPUs together. In this case, this is an NVIDIA system with GPU clusters connected by NVLinks and NVSwitches. But this solution is not unique to AI. It has been tried before in HPC. And you can see that HPC and AI share some of the challenges and also share some of the solutions. We had memory fabrics even in HPC solutions. And this example is a CPU cluster—not a GPU cluster—built by HPE in a product called SuperDOME Flex. It’s a supercomputer. And this memory fabric is based on a UPI fabric. UPI is a proprietary Intel standard for connecting CPU sockets. And you can see here a lot of CPU sockets interconnected together in a mesh. So, each CPU has access to memory on other CPUs, and it gets extra capacity and extra bandwidth. And the same here with the GPUs connected to other GPUs over NVLink and NVSwitches, so that we can run larger models on such a system. So, summarizing it all, the GPU fabric provides a larger memory capacity for large models with extra bandwidth between nodes. But there are still challenges—some challenges that remain. We see here that in both fabrics—the GPU cluster and the CPU cluster—the ratio between memory and compute is a fixed ratio, meaning the memory is tightly coupled to the CPUs or GPUs. None of them can grow independently. Let’s say we need larger models. We want to add memory, but we don’t need more GPUs. Here, we don’t have a way to add memory without adding more GPUs and CPUs. So, this is one challenge. And this challenge also has an effect on the affordability or cost-effectiveness of the system. Such AI clusters may be affordable for training, but they could be very expensive for inference, and inference is more cost-conscious than training.

One solution for that could be using other types of memory that are adjacent to the GPU. We can make it work better. We can outsource GPU memory to CPU memory—do a spillover—and this is particularly useful with LLM inference, which is partitioned into layers. And in most cases, the GPU needs to work on a specific layer and needs that layer in the GPU memory, whereas other layers—previous layers—could be evicted out of the GPU memory to the CPU memory, swapped out, and the next layers could be prefetched and swapped in to the GPU memory just before the GPU needs them. So, this is a nice model. It works. And let’s see what happens when we do that.

So, let’s look at two systems. One is a classic GPU cluster, such as one using the NVIDIA H100 SXM, where multiple GPUs—let’s say eight—are interconnected to each other using NVLink. These are the 900 gigabytes per second that we saw earlier in the Grace Hopper superchip. And we also have PCIe connections between the GPUs and the CPUs. And in that particular case, the GPU is connected over PCIe with a 128 gigabytes per second link through a PCIe switch to a CPU. But we see here a bottleneck, like the bandwidth provided here by PCIe is much lower than the bandwidth numbers that we saw on Grace Hopper between the Hopper GPU and the Grace CPU—much lower than that 500 gigabytes per second. And making things even worse, there is a PCIe switch here, and any GPU here competes with the other GPUs in the same cluster that connect to the same PCIe when it goes up to the CPU. So, such a classic GPU cluster has some bandwidth between the GPU and the CPU memory, but that bandwidth is practically low. Whereas, when we’re having the Grace Hopper superchip, the Hopper GPU is tightly coupled to the Grace CPU and has much higher bandwidth. So, this is all architectural, theoretical bandwidth numbers.

What does it really mean in terms of the cost-effectiveness of inference? What’s the memory effect on the cost per token? And this slide shows a nice experiment that Lambda did with an LLM—the LLaMA 3.1 70 billion model—as an example, running it on the same generation of GPU, the H100. First, on an SXM SKU of that GPU that connects PCIe to CPU memory, and another run on a Grace Hopper superchip—one where the H100 is connected to the Grace CPU, having much higher bandwidth. So, we can see this compute engine—these two GPUs belong to the same generation. They have roughly the same capability, the same compute capability. They have slightly different sizes of VRAM or HBM memory—one has 80 gigabytes, the H100 SXM; the other one has 96 gigabytes—but the main difference is the bandwidth that’s available to the GPU for offloading to the CPU memory. So, we can see that in both cases, Lambda offloaded almost the same amount of GPU memory to the CPU, but you can see the performance is significantly different. The Grace Hopper superchip was able to provide almost 8x better throughput in terms of tokens per second, which immediately translates to a cost reduction—like, the cost per token is 8x lower with the Grace Hopper—even though the compute engine, the GPU, belongs to the same architecture, the same generation. This is the same H100, slightly different HBM size, but the significant difference between the two systems—and we are running a single GPU here—a significant difference is the memory bandwidth that goes between the GPU and the CPU and the ability of the GPU to do memory offloading from GPU memory to CPU memory. So, the claim here was that single GPU instances are very practical and economical. However, since models are too large to fit into a single GPU memory, that’s a challenge. There are two ways to solve it. One is to use multi-GPU instances, but that has a cost impact—you have a large model, you want to fit it into multiple GPUs—and the other approach, which was tested here, is to offload—to use CPU offloading, offload GPU memory to CPU memory. And that usually means a performance impact, but we saw that if we improve the bandwidth between the GPU and some other memory, such as a CPU memory, things run a lot better.

So, how can we make it even better? And this is where UnifabriX steps in. And what UnifabriX builds is a Memory over Fabrics solution. And in this example, we show a GPU interconnect—this one is based over UALink, which is the comparable interconnect to the NVIDIA NVLink—and we bring here a memory pool that connects to the UALink switch with a very high bandwidth that exceeds three terabytes per second. And that means that one of the challenges that I mentioned before—that the GPU, the compute, and memory were tightly coupled, and none of them could grow independently of the other—now we can actually select the ratio that we need for a specific workload, the ratio between compute and memory, because the memory pool can provide as much memory as the GPU needs. We can use fewer GPUs, for instance, for running inference workloads, where we use the memory pool as a high-bandwidth memory solution for swapping out data or layers from GPU memory when they are not needed anymore. So, it’s not necessarily needed to be swapped out to CPU; it could be swapped out to the fabric, to a memory pool that sits in the fabric. There are other benefits here, like eliminating data duplications. And the high bandwidth itself keeps up with the pace of the GPU.

So, what UnifabriX is creating is Memory over Fabrics, and we create the silicon and systems that do that for AI and HPC, and we use a standards-based, open ecosystem. And this is why it’s an open journey. We use CXL—we started with CXL earlier when the CXL consortium formed in 2019—and recently, we added UALink capability to our products to support the newer generation interconnect of UALink, which I mentioned. This is the comparable interconnect—GPU interconnect—for the NVIDIA NVLink, which runs at a much higher bandwidth than CXL.

This is our product. This is the UnifabriX MAX Memory over Fabrics. You can see it’s a 2U form factor—standard form factor—hosting between 4 to 32 terabytes of fast memory, DDR5-based. It has ports at the back; these ports can connect over CXL to CPUs and over UALink to GPUs. There are scale-up ports here to connect multiple of these memory pools to create even a bigger memory footprint—in some scenarios, it is needed—and scale-out ports for network ports for running over Ethernet. And this appliance provides a lot of features which are very useful for AI HPC, such as checkpointing and performance telemetry and heat maps, where you can understand which portions of the memory are "hot," meaning they are touched frequently, and which portions of memory are less hot—maybe cold—maybe they can be swapped out to storage or to a lower-cost type of memory, such as SCM.

And this product supports both CPUs and GPUs. And at the system level, you can see this is the MAX memory—in this case, this 60-node configuration with up to 32 terabytes of memory—with direct memory feed, both capacity and bandwidth, to serve these 60 nodes. If someone needs a larger fan-out, they can use a switch. The bandwidth scales from one terabyte per second to more than three terabytes per second of memory bandwidth, depending on the SKU of the product. And this works with any CPU or GPU with a standard CXL today—CXL1.1, 2.0, tomorrow CXL 3.2—and we are working with the latest spec, UALink 200, which is going to be released soon, and prototyping the first memory pool for UALink. This memory pool is using standard DDR5 DIMMs, which helps a lot with reducing TCO for the solution.

In terms of system-level architecture, this is the UnifabriX MAX—this is the memory pool. It has memory inside; it has the UnifabriX Memory OS, which is a hardened version of Linux, and an API interface through which an orchestrator or a fleet controller can provision and deprovision memory to and from compute elements—hosts, CPUs over CXL, GPUs over UALink. We support CXL 3.2 DCD dynamic capacity, meaning via this API we can provision and deprovision memory on the fly.

And we also provide a nice graphical dashboard that represents the status that the fabric manager sees. There is a fabric manager running underneath the hood that does the magic of provisioning memory to compute elements. In this case, we can see in the dashboard that there is a six-terabyte—almost a six-terabyte—memory pool and five hosts with memory from the memory pool provisioned to the host. And this dashboard provides an intuitive interface to play with the system—like adding memory to a host, reducing the memory, playing with multiple tiers, even adding a tier from a CXL fabric when we have multiple memory pools interconnected. And underneath, we have the fabric manager running the APIs, meaning it can connect to an orchestration layer and get all these commands—not through the sliders and text boxes, but directly from an orchestrator.

Where do we see these memory pools deployed today? So, it turns out a lot of use cases, application workloads, and market segments could benefit from memory pooling. We have the MAX memory deployed today in places where data analytics workloads are running, such as in-memory databases or AI-based analytics. Finance institutions that use large graph models need a lot of memory, and they need the memory pool that helps them store those large graph models. Drug discovery, even animation studios—they have large scales of digital assets that need to be worked on and processed and rendered. And usually, these sit on storage systems, but to get more performance, they need to sit in memory. National labs that can use the memory pool for getting more bandwidth to compute elements, and hyperscalers in the public cloud.

And getting back to our CPU, which was choked out of memory bandwidth, once we add the UnifabriX MAX memory and provision more memory to that CPU—and that means more memory capacity but also more memory bandwidth—we can see that we can scale the performance of the workload by adding more CPU cores and getting more performance at the system level because now we have more memory bandwidth available to the CPU. So, we solve some of the memory wall challenge.

So, recapping—and since we are in the memory business—we have some things to memorize from this session. So, we saw the memory wall—the gap which is getting wider between compute elements and memory elements. We saw that memory is a key pain point for the performance of AI, HPC, and big data. And we saw that AI and HPC share some of the challenges and some of the solutions. We saw that AI applications grow fast—sometimes exponentially—in their memory bandwidth and footprint requirements, and HPC is not far behind. And we also saw that memory is a key enabler for making LLM inference cost-effective. We saw the cost per token in inference, and we saw some ways to make it more cost-effective—reduce the cost per token—by providing more memory capacity to the GPU, not by HBM, but by a more cost-effective memory which is adjacent to the GPU—either a CPU with a high-memory interconnect to the GPU or over the fabric with a memory pool, a high-bandwidth memory pool that is available for the GPU over a UALink fabric. So, solving bandwidth locally—typically done by HBM, high-bandwidth memory—so, yeah, go buy HBM. And everybody goes by HBM today with GPUs. And if you want to solve at scale, you need Memory over Fabrics. So, go UnifabriX—we provide the solutions for that. So, thank you very much, and now I’ll be glad to answer your questions. Thank you.