CDI-Info/138 at main · vaj/CDI-Info · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88

Okay, so next up we're going to have an engaging overview that's going to be lightning round, 10 minutes each of AI use cases. So the AI use cases are going to be from the perspective of hyperscalers. So first up, Vikram is going to talk on behalf of Google and present a view of use cases, then Manoj from Meta, and then Alex from Microsoft. So with that, Vikram will take it away. And let's save questions till the end of the three talks, and then they can answer them together. Take it away. Not working. Oh, perfect.

Hello. Hey, everyone. Thank you all for being here. My name is Vikram. I'm a director for product management at Google, where I focus on machine learning efficiency and ways to scale. So a few things I want to chat about today, generally about how AI is helping transform our use cases from foundation models to open frameworks and open ecosystems. So the first thing, of course, I think everyone here is already aware, and I suspect this is why everyone here is also here, is because generative AI is really transforming the way we think about technology.

And even at Google, for instance, we just have a small number of models, for instance, Palm, Kodi, Imogen, and what have you, which are driving a number of use cases, immense diversity of use cases, everything from consumer grade, BARD, and things like that, all the way up to enterprise grade and things like Vertex, which is pretty astonishing.

And this brings me to the point, which I think Amit also kind of hinted at, which is that foundation models are really the engine that are driving all these use cases.　But that begs the question, what exactly do I mean by a foundation model?

And there are many definitions. Here's one of them, which is a foundation model is a single, very large, pre-trained model that serves as the foundation for a number of use cases. So you can serve a multitude of use cases with a single model just by fine-tuning it for your specific need. And so this one model, multiple uses paradigm, is incredibly powerful, as you can imagine, because you don't need to retrain things from scratch. You start with a platform that's really good, and then you just build on top of it. And so this is how customers around the world can benefit by fine-tuning things to their specific needs. So what does this look like? So let me give you a few use cases which I think are very cool.

The first is coding. So for instance, I can just tell the AI, please write for me a quicksort algorithm. And it does. This used to be an interview question and what have you. No longer. Now you can just-- developers can focus on what's really difficult. And the simple modules such as quicksort, they can just leave it to the AI to figure out. Very, very cool.

Another one is speech-to-text, text-to-speech. I actually find this-- I don't know if folks can actually read the text here, which I find very amusing. So for instance, you can look at the foundation model output for a speech-to-text. And it's very clear, looking at the context that this is-- it has to do with air traffic controllers. It has to do with the plane. It's asking for instructions. And it's very, very clear, because the foundation model genuinely understands the context of the conversation. Whereas if you look at something more old school-- I mean, you can see that the words phonetically sound like what the person may have been saying, but it makes no sense. It's like, what does this even mean? So this is the power of foundation models, where you can begin to understand the context of these conversations and do much more with it.

Now I mentioned a little earlier that you can do a lot of use cases using foundation models. So how can customers and users actually go about fine-tuning this for their needs? And there's a broad spectrum of ways to do it, all the way from the simplest and cheapest, where you literally take this model off the shelf, do something called prompt engineering, which is very low-key. You don't mess around with the main model. The main model remains intact. The benefit of this is it's very quick, very cheap. The downside is it may not be the highest quality. So the other end of the spectrum, you have full fine-tuning. You basically do reinforcement learning with human feedback. In this situation, you provide a lot of your data, and you completely change the way that the model is created, so the way it's changed. So the other end of the spectrum, lots of quality, very expensive. And in the middle, of course, you have the rest of both worlds, where you have something called adapter tuning, where you provide a little bit of an extra layer on top of a frozen model to improve it for your particular use case. But the TL;DR of all of this is that there are many ways in which customers and users around the world can improve this model for their specific use case.

OK, so that's one part of it, the foundation models. The other thing that has really transformed the way we think about AI use cases is open frameworks. So I think folks are very familiar with the set, TensorFlow, Keras, PyTorch, JAX, and so on.

So let me walk you through a few use cases that have really been transformed, with a very classic one, which is image classification, which is, OK, here's an image. Is it a cat or a dog?

This used to be a very, very complicated, painful, 100-plus line piece of code. Let me show you what it looks like now. So you start with something that is pre-built. Like, hey, I would like to start with a pre-existing data set, a preset model, which is ResNet-15 in this case. I have two classes, dog, cat. I have a data set, which I bring with me. Fit. Done. This is the whole piece of code. And the ability to do something this easily means that developers, as I said earlier, can focus on the next order of cool things to work on. When something that used to take hundreds of lines takes only 12, well, OK, I can spend this time on something even cooler now.

Another great example is text classification. So let's say that I have a number of movie reviews. I'm trying to infer sentiment. Is this a good movie or a bad movie, just looking at the reviews?

So the way I do this is I start with something called a BertClassifier.  I basically specify a number of classes, so good movie, bad movie.  I in this case have a data set, which happens to be the IMDB movie data set.

I just -- and then based on the simple lines of code, I'm able to predict based on very simple statements whether the movie is good or not.  So what an amazing movie.  OK, clearly, as you can see from here, 1.0 on the right, that means really good.  Total waste of my time, yep, really bad.  So very simple to do.

And we can extrapolate this even further.  So let's say that I wanted to now have AI help me write a story.  Actually, this is the most canonical use case that folks are familiar with, which is truly generative AI.  I want to complete the story.  So how does that work?  In this case, I start with the GPT-2 preset.  I have in this case a newspaper data set, which has a lot of English language words that understands how sentences are created.

And just with this, I'm able to basically start generating model code where I say, hey, I love peanut butter and jelly sandwiches, but I -- and you let the model complete it, and it does.

It auto-completes based on what's seen, and I let French fries even more. So again, this is a way in which if you were an author, you're able to use things like AI to enhance your productivity and get more out of the system.

And then now moving on to open ecosystems.  So as I said, there's an immense amount of creativity and awesomeness happening here.  But one of the challenges we face when it comes to scaling is there's a lot of infrastructure, different kinds of infrastructure, lots of frameworks, lots of stacks.  So we need a way to make this a more tractable problem.  And this is where the open ecosystems really comes in.

The first of which is OpenXLA.  This is an open source compiler ecosystem created by a number of AI luminaries across the world.  You know, Google, AWS, many others have been part of this.  And the intent of this is to provide a way for developers to compile and optimize their code across a number of frameworks and across a number of hardware.  So basically we want developers to not think about the permutations of things such as hardware and frameworks, but rather the outcome that they're trying to drive.  So you need something like an OpenXLA here.

Another example is OFP8 interchange format.  Again, as I noted, there's so many different pieces of hardware, so many different stacks that exist.  If you needed every permutation to be able to speak together to each other, it would be very difficult, an intractable problem.  So one of the ways in which you can enable interoperability is through this interchange format, where now different floating point representations can actually speak to each other through this interchange format, which makes this a very-- well, not very-- much of a tractable problem.

And then finally, the call to action, which I've kind of been hinting at all this time, is that scaling is really the need of the hour.  We can do so much more, but it becomes increasingly expensive, increasingly power hungry, increasingly inefficient.  So what is it that we can do to enable this creativity to continue unbounded?  What are the ways in which, for instance, we can reduce the silent data corruption?  This is something that you don't actually see at small scale, or you don't care much about at small scale.  But as the scales get larger and larger, it becomes an increasingly hard problem.  With that, I'm going to hand it off to Manoj.  Thank you.

OK.  Looks like this one is working.  What I'm going to do now for the next section is basically a slightly different angle.  I'm going to start with the AI as a what overall use cases that we have, and then try to dig into what exactly it translates from a hardware perspective, the challenges.  And then perhaps we'll see what we can do, because as was said in the previous sections, I think the AI is going to present a lot of problems to us.  It's to OCP to figure out how to create creative solutions for addressing those.  So I'll start from there.

So first of all, I think I'm going to skip through some of these.  Basically, AI is a case at a high level.  You want to talk about problems.  You want to train your models.  You have to search the database to find out inference, basically, to figure out what the solutions are.  And as we talked about in the previous one, Gen AI kind of solutions, which is synthesis.  You're creating stuff that is not trained, but you're creating something based out of the training.

At a high level, if you see from AI data center perspective, the use cases, the one that actually drives our hardware, would be something in the ranking and recommendations, where you're trying to make a recommendation for more than 3 billion active users Facebook has about what content they should be looking at, what kind of movies or what kind of reels they're going to look at.  We have computer vision, and then we have various language that we talked about from Gen AI perspective or Llama perspective or ChatGPT, all the stuff that we are using.

So at a high level, if you think of it for meta, we have these main two use cases.  One is, as we talked about, ranking and recommendation.  This is where bulk of work happens from the deep learning recommendation model perspective.  For this, you need hardware that is being trained, and then inference that actually will make the actual recommendation as the users are working on it or looking at it.  And then we have the Llama 2 kind of large language models.  This is where you have the training and inference.  And inference you can consider into the two stages, again, where you can do the first token, which is the first recommendation part that is coming out of the inference of Gen AI, which is called prefill.  And then you keep on getting the, as we say, I like peanut butter and jelly, but then you talk about, OK, but I like fries.  The fries  , the whole sentence completion comes into the decode part of it.

Now I'm going to step back and say, OK, what problems do we have?  As we see, this is something that you may have seen it many times.  This is a very well-known paper that was the end memory ball.  From there, if you see, what drives the systems is basically how fast your models are running and how large the models are.  How many parameters it is being trained on.  So if you look at basically how the models are growing, the number of parameters, the way they grow versus how much memory the accelerators have to work with.  And you can see that actually, definitely we are moving into the place where you're going to start getting constrained on capacity of the memory that the GPUs have or the accelerators have.  Similarly, processing perspective.  As we continue to process more, are you able to maintain the amount of bandwidth that you need to get?  I'm going to stop memory because I'll focus on one aspect in this presentation.  But the same thing applies for the networking and other aspects of those things.  So as you can see that on the right-hand side, the second chart, this is the teraflops on the x-axis.  And then what we have is basically the y-axis shows how the various interconnects are growing.  This focuses on, again, memory on the green arrow.  But there are a lot of other interconnects behind it, which are also networking interconnects.  So what this shows is that the bandwidth is also not keeping up as we see the scaling of the models and the capabilities for the GPUs.  So overall, there is going to be a challenge from capacity and bandwidth as we continue to scale for the larger and larger models to enable the kind of cool use cases that we talked about.

So if you think from-- OK.  So when we talk about designing hardware, we are trying to optimize-- we are solving this n-dimensional problem for various components that sit into the system, whether it is a compute, whether if I'm adding memory.  When I talk about compute, by the way, this compute is from the perspective of accelerator or CPU, both, whatever computation capability you have.  How much memory do we add?  What kind of bandwidth do we have?  What are we trying to train the model for?  What is the model size?  What is the cluster scale?  How many GPUs or accelerators do you need?

And when we think of this and then start mapping them back into the various use cases we talked about, if you have ranking and the recommendation models, we have training and inference.  For large language models, we have training and we have pre-fill and decode, which is a part of inference.  What this chart is trying to say, without really going to the specific numbers of it, this is a significant problem that we see that different types of use cases push the limits of different components on the systems, whether we may be bound on a memory one use cases, but it may be networking latency on the other side as you start going to the scale out.  And the challenge for most of the systems that we're going to design into data center for addressing these AI needs for future is going to make sure that we have solutions for that, because it may be difficult to have a very focused solution only for one of them, unless it hits a scale that we expect it to justify.

So with that, let me focus back into the memory in AI systems.  So we'll start different use cases and what the challenges will come from the system perspective.  So on the first side, as you see, we're going to go into the limit of how much memory the accelerators have.  You are providing HBM memory capacity and a bandwidth.  We get a very high bandwidth, but the capacity is going to be limited.  So at some point of time, we need to start using technology solutions, which allows your tier two kind of a memory for the accelerators to use, which would be required to be expanding the capacity attached to the GPU or an accelerator, which may be shared with the CPU, because the workload that gets executed between CPU and GPU will continue to evolve.  So we want to make sure that we have the memory capacity and a bandwidth within a node.  So we call this as a node memory expansion, where we need high capacity and high bandwidth.  Then we look at, as we start hitting the limit of one accelerator, you are going to get more accelerators working in a scale up kind of a system, where you're working almost like a single system, where you're running a single model across multiple GPUs.  You really need high bandwidth and low latency connectivity, because you are making sure that the work that you're doing across all of them stays connected and stays in sync.  And for that, you need the high bandwidth and low latency.  This can lead in future, as we start looking into memory architecture, as we look into the system interconnect architectures, and we look at ways the accelerators are evolving, and the new rack architectures that we are going to see in future, that you may start seeing disaggregation or some more composability of the racks where memory can start getting separated.  We have seen the network providing the disaggregation of storage and compute in the past.  Now it can become the opportunity for memory to make sure that we can have independent selling of compute and memory as we look into the AI use cases.

Let me look at from the network side in that case.  So traditional network, this is the architecture that we have.  You have a top of the rack, and you have a whole hierarchy of the switches on the top, which is what we call as a front-end network, where you have access for the developers.  This is where, if you're training the model, you are getting the data ingested from the storage where you are storing the data.  You're going to do pre-processing and provide it to the GPUs and accelerators to train the networks after that.  This is a very large network.  This is the traditional data center network that we talked about.  What's changing is basically, as we start looking at the scale-up system, and we said we need more high bandwidth and a low-latency network, you can see that these GPUs will get connected with much more higher bandwidth and low latency, and what we can call this as a scale-up network, which basically stays single kind of-- almost like a single system kind of a view, but across multiple GPU infrastructure.  So this becomes a second type of fabric in the data center, if you will.  The third is basically-- but that will not be enough for you to provide a kind of scale that you're trying to reach.  We talked on model parallelism.  We are talking about maybe 128, 256 accelerators or GPUs or to that scale.  But then as we looked into the larger numbers, we are looking at maybe 32, 64, or 128, 1,000 kind of GPUs collaborating to train a single model.  At that kind of time, we will need a scale-out network.  Again, high bandwidth, low latency, but of course, it is relatively lower bandwidth can deal with it because you are going to distribute the work across much larger accelerators or GPUs.  So you can think that-- you can see that actually what starts with a single fabric right now in the data center is already evolving into at least three types of networks.  You have a scale-up networks and scale-out networks that are being brought into this.  So the systems that we talked about, we need to have something that expands the memory inside the node, a system that basically allows you to scale up with having high bandwidth and low latency, and then something that goes-- even scales out to the much larger set with the scale-out networks.

Some of the challenges and opportunities-- I'm kind of shifting the gear here to specifics now.  How do we really get what we want?  From node memory expansion perspective, today we are limited to what we can add with the DRAM or HBM to an accelerator.  DRAM to the CPU typically, and then HBM to the accelerators.  If you really want to add more, at some point of time, you start getting beachfront limited with the number of pins that you carry.  So this is where CXL kind of plays a good role, where you can drive it with the high speed serial links.  However, where CXL is today, it definitely is lacking in the bandwidth as compared to what you will see in multiple other solutions from this service perspective, which basically means that if you run at a Gen 5 or Gen 6 speed of PCIe, you need a much, much higher number of services as compared if you're running at a higher bandwidth.  And this becomes a challenge from integration.  If you think node memory expansion perspective, same thing is true for accelerator to accelerator or scale up interconnect.  CXL kind of solution can provide you very good memory semantic.  However, the challenge is going to be the speed.  Just to give an example, Ethernet in next few years will be running up to 224 gigabits per second lane.  But if I just follow the current roadmap of CXL, which goes with PCIe, we'll be at a Gen 6 at the best, which is going to run at 64 gigabits per second per lane in 2026.  There are other challenges that we need to fix.  If we want to make these multiple systems to work as a single system in a scale up kind of environment, how are they going to share the memory?  How are they going to make sure that they can achieve the bandwidth that is targeted and not be limited by underlying infrastructure because the use cases are very different?  AI is driving these use cases much differently than the traditional use case.  On a scale out interconnect, I think we heard a little bit in the morning, and then we'll continue to have more discussion throughout the presentation, is that Ethernet has done really well and it has provided continuously growing higher bandwidth.  But as we look into the AI use cases, to get the most out of the system, we want to make sure that we can reduce the P99 latencies because any kind of impact on those latencies can have a large impact on the AI use cases, which are training the jobs for multiple days and multiple weeks, and restarting those jobs is not certainly an interesting thing to do.  So congestion avoidance or congestion management, both of those are going to be important things.  Ethernet has always evolved by making sure that it integrates the important requirements, and I think in general, AI is going to perhaps push for the requirements in this space a lot too.

As we look into the future, essentially, we are going to see more and more composability of the system because we looked at basically five different types of use cases.  We talked about recommendation for training and inference, Gen-AI, or the use cases for the training, pre-fill, and decode.  To make sure that we have a solution that can adjust the requirements for compute and memory and a network, essentially, it will lead into more and more composability of the solutions for the racks as well as for the solution that gets deployed at a cluster level.  Scale-up and scale-out fabrics become part of such kind of flexibility.  Over time, it can perhaps unify, but at this point, I think the requirements seem to be separating out the different types of network topology.  Memory part of the problem, I want to make a plug for the work that we are doing.  Composable Memory Systems is a group that is working within OCP that was started last year.  That group has done good progress to make sure that we have an industry standard discussion about how do we add more memory, how do we make sure that it can be composed, it can be brought up appropriately, how we test, how do we create the benchmarks for people to solve the solutions.

A lot of blueprints are being done here, and so I'm going to leave you guys with that call to action as a plug for the CMS.  There is a whole eight to five session tomorrow, almost 25 speakers speaking.  A lot of focus on making sure that we solve the problem for the memory as these AI systems start driving our infrastructure requirements.  Thank you.  Let me hand it off.

Hi, my name is Alex Wetmore, and I'm a technical advisor in the office of the CTO, and I come from a software background, so my talk might be a little bit different.

Over the slides we've seen today, we've seen a lot of examples of this slide, and we've talked a lot about how generative AI and large language models are changing everything.  I wanted to focus on how that affects the hardware, especially during inferencing.  I also really like this slide, or this image, which shows how the operations are growing hugely and as the model sizes grow.

So what have LLMs really changed?  The models have gotten large enough that they're using all of the accelerator memory and more for weights, and LLMs have also changed inferencing because they have two distinct phases with different performance characteristics.  There's a prompt phase that's processing the initial text that's being computed, and then there's a token phase which is producing each word, and the token phase is very memory bandwidth bound while the prompt phase is very compute bound, and I'll go into a little bit more into the details on that.  Because the models no longer fit on a single accelerator, we have to do distributed inferencing across multiple accelerators, and this is a little bit new for inferencing.  Traditionally, I've thought about inferencing as generally fitting on one accelerator and not requiring a high-bandwidth backend network.  And then KV caching is a unique feature in LLMs, which has also changed a little bit of how you think about programming the accelerators and adds yet another new requirement for memory space and memory bandwidth.

So what is the KV cache?  The KV cache is used during the token phase and replaces a quadratic computation which would be what's going on in these boxes here, if you can see where my mouse cursor is, with a linear computation that's based on what the previous token was, and so it's replacing that computation with a memory look-up from the prior phase.

And this is very important because when we're doing token processing, we have to run every token through every weight of the model.  And if we're doing that with self-attention, also doing this quadratic look-up, now you're doing something that would be very compute and memory intensive.  This trades it for something that is compute intensive but even more memory-- that's very memory intensive.  The other thing that's very interesting about having this set of prompt and token is that there's different ratios for different types of applications.  So as an example, chat is very token heavy.  You may be giving a very small prompt and generating a very large number of tokens.  If summarization is much more prompt heavy, you're giving a large prompt and generally trying to produce a smaller number of tokens, or translation would be another example of a workload like that.  And so batching is also necessary to get good reuse of the weights when we're doing the token processing because otherwise everything would just be completely memory bandwidth bound where we're just loading these weights over and over again.

In distributed inferencing, we're using model parallelism to split the workload across multiple accelerators.  In this example, I'm showing two, but there will often be more.  And here we're partitioning for weights so that we can put a fraction of the weights on each of the accelerators.  But this introduces a new communication step, which takes place after the computation.  And this is a blocking communication step.  It's very difficult to overlap this with some compute.  And so the interconnect bandwidth and latency matters a lot to get good performance out of the systems.

An area that I'm very excited about-- sorry, this is an old slide.  OK.  An area that I'm very excited about is microscaling formats, which are something that we've just introduced and announced as a collaboration across multiple companies today.  The microscaling formats allow us to do inferencing using int8, floating point 8, floating point 6, and floating point 4-bit formats and get very similar accuracy at the output.  What's really advantageous about this is that by reducing the bits from 32 bits or 16 bits down to, say, 6 bits, we're getting a multiplicative effect on the memory bandwidth, on the interconnect bandwidth, and reducing the memory requirements for holding weights and reloading weights.  So unfortunately, this doesn't have the slide I wanted to show.  But the slide I wanted to show has a very nice graph showing the lack of accuracy degradation across using the microscaling formats.  And I highly recommend going and looking at the OCP paper that is coming out today with details on the formats, how block scaling works, and how these can be implemented into accelerators or into libraries and used on existing accelerators.

So next steps, what do we need to do across the system?  We need to be looking for ways to really improve both inferences per watt and inferences per dollar through things like microscaling, sparsity, technology advancements across compute, memory bandwidth, memory capacity, and interconnect.  And then there are software efficiency improvements that we need to be looking to as well across batching, sparsity, and in some cases, moving smaller models to edge accelerators to free up cloud capacity.

Great presentation.  Good point of views.  I'm just curious on your thoughts about AI at the edge and what stuff you need to do for the hardware, software system solutions.

For me?  I think that's a good question.  I haven't explored that a lot personally because my focus has been on the cloud.  I think for me, there were some earlier talks about model specialization and being able to do smaller, fine-tuned models that are task-specific.  And I think that that's a good thing that can be used and approached on the edge.  And obviously, there's a lot of work going into edge accelerators for laptops and cell phones.

I was just going to say the way I think about this is that it's a bit of a cascading hierarchy, where there are some queries which you should just solve at the edge and be done with it.  And then only the ones that are super serious do you call back home.  So just think of it as a classic memory hierarchy, except that it's in this case, what can you do on the phone?  What can you do at the edge?  What can you do in the data center?  And then the architecture follows that.