CDI-Info/395 at main · vaj/CDI-Info · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

Hi, everyone. My name is Tejas Chopra, and I’m very honored to be talking today as part of the AI Infra Forum group on the memory wall in AI. I’m a senior engineer at Netflix, and I work on building infrastructure that powers ML models that, in turn, are helping grow the personalization and the recommendations catalog at Netflix.

So, let me start by talking about the exponential growth in AI and the hidden bottleneck. We have now seen unprecedented growth in large language models. We’ve leapt from the 100 million parameters of early models to 175 billion in GPT-3, with trillions of parameters rumored in GPT-4. This corresponds to an exponential growth in model size and the data that they need. In parallel, the hardware raced to provide more raw compute power. In fact, peak compute in server-class chips increased by an astonishing factor of 60,000x in 20 years. This is why we can train and run such massive models at all. However, there is a catch. Memory has not kept up. The ability to store and move data improved only modestly, on the order of 30 to 100x in the same time frame of 20 years. This disparity is the hidden problem. As models and computations grew exponentially, the pipelines to supply data to these computations grew linearly at best. This issue often lurks in the background. We talk about teraflops and petaflops of compute while assuming memory will just catch up, but it hasn’t. The result is that AI systems increasingly stall, waiting for data. In the following slides, we will see why this memory wall is becoming the key limiter of AI performance.

The memory wall is a concept that was first observed decades ago in general computing. As early as the 1990s, pioneers warned that if memory bandwidth doesn’t dramatically improve, many applications would become memory-bound rather than compute-bound. And researchers Wolf and Mechie famously termed this as “the memory wall.” Fast forward to today’s AI era, we are indeed slamming into this wall.  In practical terms, the memory wall is when your CPUs and GPUs are trapped, twiddling their thumbs, waiting for data from memory. This is not a theoretical edge case; it is becoming the norm in cutting-edge AI. A recent analysis from UC Berkeley showed that, for large transformer models, memory bandwidth can dominate performance, which means that the model can run only as fast as the memory can stream data, not as fast as the arithmetic units can compute. To visualize this issue, look at the chart that is attached here. You’ll see that the transformer size has grown by around 240x in two years, whereas the AI hardware memory size has only grown 2x every two years, and you can see this memory wall show up in this diagram. This means that often the extra flops cannot be utilized. This is not just a technical inconvenience; it is an emerging crisis. We are pouring resources into bigger clusters and chips but getting diminishing returns because the memory subsystem is becoming the choke point. There is a direct financial implication: we end up buying more hardware and burning more energy to compensate. The key point here is that the memory wall is real and growing and threatens to slow down the AI revolution unless we address it head-on.

Let’s ground this in a few real-world examples. First, let us consider the NVIDIA H100 GPUs, one of the most advanced AI chips that was released by NVIDIA. NVIDIA turbocharged its compute cores. There was a 6x boost in theoretical flop performance—NVIDIA’s performance was higher than the previous generation’s ops over the previous generation—but the memory bandwidth did not get that kind of boost. It was only around 1.6x. Engineers who run real AI workloads observed that these GPUs cannot use all of their horsepower. In fact, even when OpenAI was training some of the models, and especially on the inference side, some of the GPUs were only running at 60% utilization. Another vivid example is dozens of GPUs working in tandem to support inference for large language models because no single GPU has enough memory to hold that model. So, many companies distribute the model or shard the model across multiple GPUs. Let’s say the model may need 300 gigabytes of memory, but the GPU only has 80 GB of memory available. That means you have to now shard it across four GPUs. This incurs complex questions. GPUs are constantly talking to each other over interconnects, waiting on each other to pass chunks of the model around. That communication is very slow and very energy-hungry. And this is a direct consequence of the memory capacity limits on each device. Essentially, we are using more GPUs not necessarily for more speed, but for their memory. On the training side, memory limits are equally daunting. When training massive models, researchers use hundreds or thousands of GPUs. One reason is compute power, yes, but another major reason is aggregate memory. A researcher from Micron noted that training a large, cutting-edge model can demand on the order of dozens of terabytes of memory in total. Even a high-end GPU may only have hundreds of gigabytes of memory. So now you need hundreds of such GPUs to provide the memory requirement, never mind the compute. And if any of those memory chips hiccups, it can disrupt the entire training run. And let us now consider the user-facing impact. When ChatGPT famously shows “we are at capacity” messages, part of that was due to hitting the limits of the available GPU fleet. Essentially, the system cannot process more requests concurrently because the GPUs’ memory and bandwidth were maxed out, given the model size and the required throughput. It is a sign that the service was memory bandwidth limited, and adding more GPUs was the only fix to serve more users. These examples underscore the fact that the memory wall is not theoretical. It is affecting real systems today.

Let’s unpack the technical heart of the memory bottleneck here. So why is memory the bottleneck? The first issue is the sheer speed gap between computation and memory. In engineering terms, we can say that memory has not kept up with Moore’s Law. You can build a GPU core that toggles billions of times per second, but if the data it needs lives in DRAM, which might deliver only a very small fraction of that per core, then the core is starved most of the time. Over many generations, this compounds to a huge gap. Concretely, a modern GPU can issue multiple floating-point operations per clock cycle, and it has thousands of cores, so it might demand on the order of terabytes per second of data to stay fully busy. But a single HBM (High-Bandwidth Memory) stack today tops out at around a terabyte per second. So, if the workload needs more than that, the GPU will stall. The bottom line is that the processing elements are faster than the memory channels that feed them by a wide margin. Let us now talk about latency, the delay for accessing memory. Even if bandwidth were high, latency matters because AI algorithms involve a lot of sequential dependencies. You often need the result of one layer before proceeding to the next layer, and today’s memory latencies for DRAM are on the order of hundreds of nanoseconds. If the data is on another server or storage, it could be microseconds or more. The chart here shows a very nice way to think about GPU memory hierarchy and look at both the latency and the throughput numbers for on-chip and off-chip memory, especially when talking about GPUs. GPUs and TPUs have on-chip SRAM—which is registers and caches—which is extremely fast and high bandwidth, but there is very little of it, on the order of megabytes. But these AI models, if you think about it, are enormous and far larger than the on-chip memory, so inevitably GPUs have to reach out to HBM or DDR memory to get weights and activations. That external memory is much slower. Even though technologies like HBM have higher bandwidth than older DDR, it is still limited by physics and interface constraints. So, large models cannot simply be stored entirely on-chip; often, these off-chip data fetches become a choke point. Let us also talk about arithmetic intensity—essentially, how much computation do you do per byte of data. Some operations, like large matrix multiplications, are very compute-heavy relative to data; these are high-intensity. Those can utilize compute units well, but many AI operations, like the attention mechanism in transformers, are more data-access-heavy, which means they are low-intensity. For those operations, performance is dictated by memory bandwidth. If the algorithm needs to pull a lot of memory for each calculation, then increasing the flops won’t help unless memory can supply more data. Finally, it’s worth noting the energy cost of moving data. It might take, for example, an order of magnitude more energy to fetch a piece of data from DRAM than to do the floating-point multiply on it. This means when we are bottlenecked on memory, we are often also operating at poor energy efficiency, burning power just waiting or shuffling data around. Also, in distributed setups, moving data across a network or PCIe bus is even slower and costs power. So, the memory bottleneck is a double hit: it lowers performance and wastes energy. Now that we’ve understood the technical reasons, it is important to discuss what solutions exist, which we will discuss next.

So, how are we dealing with the memory limitations today? There is a toolbox of strategies—both at the algorithmic level as well as at the system level—that practitioners use. The first big category is model compression. If memory is the bottleneck, one straightforward fix is to use less memory per model. Quantization is a prime example: instead of 32-bit floats for each weight, use 16-bit or even 8-bit integers to represent them. That immediately cuts the memory in half or better. Companies like NVIDIA have heavily pushed mixed-precision training—which is a mix of floating-point 16 and floating-point 8—for this reason. And for inference, there is a lot of excitement around 8-bit and 4-bit weight representations. If done right, with some calibration or fine-tuning, quantized models can retain very close to the original accuracy by using a fraction of the memory. Similarly, pruning removes weights that don’t significantly contribute to output. If you can zero out, say, 20% of the weights and then compress the model, that’s 20% less data to move and store. These techniques directly mitigate the memory wall by reducing the demand side of the equation. Another approach is memory-efficient algorithmic techniques in training. A well-known technique, or trick, is gradient checkpointing, also called recomputation. Normally, during neural network training, we store all the intermediate activations so that when we do backpropagation, we have them ready to compute gradients. This can consume huge memory for deep networks. With checkpointing, we only store a few key layer outputs, discard the rest, and then recompute those intermediate activations from scratch during backpropagation. This means that we do some extra compute work, but in return, we save a lot of memory. It’s a classic space-versus-time trade-off. Many training frameworks use this to fit larger models on GPUs. Another example is optimizing the sequence of operations or using algorithms with better memory access patterns—for example, fusing operations to keep data in registers longer or using algebraic techniques to minimize memory reads. Parallelism and model sharding are basically distributing the problem. If one GPU cannot handle the whole model, use two or four, storing part of it. This is commonplace now. Model parallelism splits the neural network layers or parameters across GPUs; pipeline parallelism streams different mini-batches through different devices. These allow us to scale memory capacity by adding more devices. In inference, as I mentioned, we shard models across GPUs as well. The downside is that those devices must now talk to each other frequently, so you introduce communication latency and complexity. It’s not a cure-all, but it’s a necessary tactic today to handle supersized models. We also use tiered memory approaches. Think of it like Computer hierarchy 101, but applied to AI. If you have a small, fast memory, a larger, slow memory (which is the CPU), and maybe an even larger but slower storage (which is an SSD), if a model’s working set is too big for GPU memory, some parts of it that aren’t immediately needed can be swapped out to host memory or CPU memory. Frameworks and libraries exist that can automatically move the infrequently used model weights to CPU and bring them back when required. NVIDIA’s software stack, for instance, allows offloading certain layers to CPU if the GPU memory is full. It’s slower, yes, but it can enable functionality that you otherwise couldn’t. Similarly, some large-scale recommender systems use the GPU for the dense compute but keep giant embedding tables in CPU memory, fetching embeddings on the fly over PCIe. The key is to manage what needs to be close to the compute units and what can tolerate latency. Good caching and prefetching can mitigate performance hits. Lastly, the software stack plays a big role. Deep learning frameworks like TensorFlow, PyTorch, and compilers are getting smarter about memory. They can reuse memory buffers for different layers if lifetimes don’t overlap; they can schedule operations to reduce peak usage in memory and overlap communication with computation. For example, while one part of the GPU is crunching on one layer, the next layer’s weights might be streaming in simultaneously, so that by the time we need those weights, they’re already in HBM. Also, adjusting the batch sizes or sequence lengths can help control the memory usage to fit within bounds. These software optimizations are less flashy but often yield, say, 10 to 30% improvements in memory usage or bandwidth efficiency. All these strategies are about making the most of the current hardware and mitigating the memory wall’s effects. They’re what allow us to train and serve large language models today despite the bottleneck however they also come with trade-offs—complexity, performance overhead, or development effort—and so we need to now think about some emerging Innovations in the AI memory infrastructure.

There are some exciting innovations on the horizon that aim to break through the memory wall for AI. We’ll just go through a subset of them first. Let us improve the memory itself. High bandwidth memory is currently the state-of-the-art for GPU memory. It is continuously evolving. Each generation of HBM has increased bandwidth by widening the interfaces and stacking more memory together. Today’s HBM3 can actually exceed 1 terabyte per second of bandwidth per stack. This is an incredible number and roughly 10x more bandwidth than regular DDR DIMMs. GPU manufacturers are also packing more HBM stacks per GPU and offering models with huge memory pools. That kind of capacity means that certain large models that previously needed splitting might fit in one device, and that bandwidth helps feed the compute cores more data per second. NVIDIA’s upcoming architectures are rumored to further boost the memory bandwidth as well. Of course, HBM is very expensive and adds to power usage. It’s basically a brute-force approach to throw more memory and bandwidth at the problem. But in the near term, it’s a critical part of the solution. The next one is memory disaggregation via CXL, which is another promising development, especially for cloud and enterprise environments. CXL is essentially a new high-speed link that allows memory to be decoupled from a specific GPU or CPU and instead shared as a resource. Imagine pooling a few terabytes of DDR5 or even specialized memory that any GPU in a server rack can access on demand. This means that if a particular model or workload needs more memory than a single GPU has locally, it could tap into the CXL memory pool. We could scale memory independently of compute, adding more memory capacity to a cluster without one-to-one tying it to GPUs. The benefit is flexibility and potentially larger memory availability. The trade-off is latency and bandwidth. CXL memory is slower than the HBM on the GPU, but it’s substantially faster than traditional network storage. In practice, we might see systems where GPUs use HBM for the active working set and CXL memory as an overflow for large models. Major vendors and consortia are building standards, so that in a couple of years, your data center might have disaggregated memory appliances. For AI, this could mean a future where loading a 10-trillion-parameter model is feasible because you can spread it over a massive memory pool rather than being constrained to what’s inside one server. Perhaps the most radical approach is processing-in-memory (PIM). This flips the script. Instead of dragging data to the processor, you’re putting the processor in the memory. Practically, this means embedding simple compute cores or logic directly inside memory chips, as you can see in the diagram attached here. Samsung, for example, has demonstrated HBM-PIM, where each memory bank has a tiny processor that can perform matrix multiplications or accumulation on the data in the bank. If you want to sum two large vectors, you can command the memory to do it internally and only send the result out, rather than sending both vectors over the memory bus to the GPU to sum. PIM can drastically cut down on the volume of data transferred between memory and CPU or GPU, which is great for bandwidth-bound scenarios. It’s like doing a preprocessing step right where the data lives. Early prototypes of PIM have shown notable speedups for certain AI workloads—like recommendations or basic neural networks—and big energy savings because data movement is reduced. It’s still in the research and productization phase. Challenges include programming models and making the on-memory compute general enough for many use cases. But it’s a very promising direction. Then there are novel architectures and integration techniques. One of them is 3D stacking, which is already used in HBM. The next step is stacking memory on top of logic. Companies are exploring chip designs where high-density memory, like DRAM, is layered directly on the processor, connected with thousands of tiny vertical interconnects. This could allow bandwidths an order of magnitude higher than the current 2.5D HBM setups because memory is on top of the chip, not sitting next to it. It can also reduce latency. We might see special-purpose AI chips that integrate, say, hundreds of megabytes or a few gigabytes of memory right on the same package or die as the cores. Another example is wafer-scale engines. Cerebras Systems took an extreme approach by making a single silicon wafer into a giant chip. They have 850,000 cores and an enormous on-wafer memory. The idea is to keep as much of the model on-chip as possible to avoid off-chip delays. It’s a very different approach than the GPU, trading off some clock speed in exchange for sheer scale and memory locality. Cerebras has—when I last read it—around 40 GB of on-chip memory in their latest wafer-scale engine, which is a lot. But even that wasn’t enough for the largest models without partitioning. Still, it shows the trend, which is to build systems that are more memory-rich and memory-oriented. We are also looking at new memory materials—like MRAM, which can be as fast as SRAM but dense like DRAM and non-volatile—or resistive memory. And there is a general trend of co-designing AI algorithms with hardware. For example, some research suggests new network architectures that compress activations or use more compute to reduce memory access. So, future AI models themselves might evolve to be more memory-friendly when we design them with hardware limits in mind. The overarching theme here is that the industry recognizes the memory wall to be a very big hurdle and is investing heavily. As one analysis has put it, it’s like hundreds of billions in AI capex on the table. Whether through more advanced memory technology, smarter architecture, or paradigm shifts like PIM, we are entering an era of memory-centric computing for AI. The goal is very clear: to make memory scale in line with our appetite for larger and more complex AI models, thereby sustaining the AI progress curve without running into a hard wall.

We have talked about technology. Now, let’s translate that to the language of business and strategy, which is crucial for executives. The cost implications of AI models are very important. When your $10,000 GPU is only giving you $6,000 of useful work, it’s because of the memory bottlenecks. Across thousands of units, that’s millions of dollars of essentially wasted potential. We compensate by buying more GPUs to make up throughput, which directly increases capital expenditure. For cloud providers and AI-heavy companies, this is showing up as jaw-dropping budgets. We are literally paying more for memory than for processing cores in some cases. Scalability is another angle. Let’s say your company develops a brilliant AI model that could transform your product, but if you find out that to serve customers with this model, you need an order of magnitude more GPUs than you planned because each GPU can’t handle as many requests as expected, this can slow down rolling out new AI-driven features. Either you delay until infrastructure catches up, or you deploy a scaled-down version of the model. There is also an opportunity cost of engineering time. A lot of very bright engineers are currently spending time on tricks to work around memory limits—manually partitioning models, optimizing memory allocation, devising custom caching strategies. While that’s important, imagine if those teams could spend that time on improving the model quality or developing new AI features rather than wrestling with infrastructure. By investing in better memory solutions, leaders can free up talent to focus on higher-level innovation. Organizations that alleviate the memory bottleneck internally can iterate on AI capabilities faster than those constantly bogged down by low-level performance tuning. Energy usage is sometimes overlooked but a critical part of business impact. Data center power and cooling is a finite resource. If you need twice the number of servers to do a task because each is underutilized, you’re also using roughly twice the electricity. And AI is a big driver of the increase in total power consumption. For cost and sustainability reasons, improving the efficiency of AI is key. The memory wall undermines efficiency, and reducing data movement and better utilizing hardware can directly cut the energy per task. Finally, from a strategic risk point, if a company ignores these issues, it might find its AI initiatives stalling unexpectedly. In summary, the memory wall isn’t just a systems engineering headache. It’s a scaling and economics problem that leadership needs to factor into roadmaps. Those who plan for it will use these resources more effectively and push AI further.

Looking ahead, how do we adapt our infrastructure and strategy to this reality? One key shift is moving towards memory-first architectures. Historically, we design a processor or an accelerator and then figure out how to feed it data. Going forward, we will likely invert that thinking: design the data pathways and build a system around memory and memory subsystems to be extremely high bandwidth and large, and then build compute around it. For example, instead of a GPU fixed with an 80 GB memory limit, maybe we’ll see systems that fluidly attach more memory as needed or architectures that sacrifice some peak compute in exchange for much higher sustained memory throughput. Essentially, the ideal future system is one where compute and memory scale in tandem. This also means new metrics for system performance—not just how many teraflops a chip can do, but how many teraflops can it actually sustain on real models given the memory constraints? Hardware-software co-design is going to be critical. We cannot just throw hardware at the problem; software has to play along. On one front, AI researchers are exploring model architectures that inherently require less memory or better use of memory. For instance, techniques like mixture-of-experts models dynamically activate only parts of the model, reducing memory usage for inference. If we know our hardware has a certain memory pattern that’s efficient, we might design algorithms to exploit that pattern. On the hardware side, we’ll see more specialization. Think of how tensor cores were introduced for matrix math; similarly, we might get specialized memory engines for AI, like dedicated circuits for embedding lookups or sparse data handling, which are currently very memory-heavy operations. Co-design also implies that companies that build AI hardware are working more closely with those that build AI models to ensure that future models and future chips are aligned to overcome some of these bottlenecks together. In the data center, the concept of composable or disaggregated infrastructure will likely become reality. We’re already seeing some early versions of it—for example, Azure and AWS exploring instances with flexible GPU-to-memory ratios or startup solutions where multiple GPUs share a large NVMe-backed memory space. The idea is to break the fixed pairing of GPU, CPU, and memory. So, if you need four GPUs’ worth of compute but eight GPUs’ worth of memory, you don’t actually have to have eight GPUs; you can allocate this extra memory from those pool and build a common pool. This will require high-speed interconnects and clever resource management, but it’s on the horizon with technologies like CXL. For tech leaders, this means that in a few years, you may be able to dial up or down memory independently in your cloud deployments. Perhaps more importantly, continuous innovation is needed. The memory wall is not a simple obstacle we blast through once; it’s more like constant friction that we have to keep smoothing out. As AI demands grow—and they will, with more data, bigger models, and more usage—memory demands will also grow too. We need a pipeline of innovations—maybe HBM3 today, HBM4 next, then optical memory links—each buying us some time. Similarly, algorithms will continue to evolve. Think about how much more efficient these AI algorithms and models are compared to the ones five years ago in terms of compute per quality; we need similar jumps in memory per quality. For leadership, this translates into strategic investments and vigilance. It means allocating budgets and attention to infrastructure upgrades focused on memory and I/O, not just compute. In essence, the path forward is an AI ecosystem where memory is no longer the limiting factor. Treat memory as a first-class citizen in every architecture decision.

And finally, I would like to end this with a clear message: the memory wall in AI is not hype. It’s a defining challenge of our time in computing. We’ve peeled back the layers from the raw technical details to the high-level impacts on business, and the evidence is clear that memory has become the crucial limiter. However, I’m very optimistic. We have identified many ways to tackle this, and brilliant minds across the industry and academia are already working on it. Overcoming it will likely spur a new golden age of AI capability, where our imagination is less constrained by hardware bottlenecks. The key is that we treat this challenge as a strategic priority. And I urge you, in your capacity as leaders and decision-makers, to bake memory-conscious thinking into your roadmap. When budgeting for AI projects, consider allocating resources to optimize memory usage. When evaluating partnerships or startups, pay attention to those tackling memory efficiency—they might hold the key to your next big competitive advantage. And collaboration is going to be vital. This isn’t something any single company can solve alone because it spans the entire stack. Engage with consortiums like the CXL Consortium or the Open Compute Project. Contribute to open-source initiatives for better memory management. Maybe even partner with universities researching novel memory technology. Finally, let’s lead by example. This discussion means that you’re already ahead of many of those who haven’t yet recognized this issue. Take this awareness, spread it to your teams and networks, and set concrete goals—for example, aim to improve the utilization of your AI hardware by some percentage via memory optimization next year, or pilot a project with a new memory technology. In essence, the memory wall is a call to arms for the AI industry. Those who rise to the challenge will define the next era of technology. And let us turn this crisis into an opportunity—an opportunity to innovate, to collaborate, and to keep the momentum of AI going strong. Thank you so much for joining me today. I am very honored once again to be presenting this to you. Please feel free to connect with me on LinkedIn. Thank you.