CDI-Info/224 at main · vaj/CDI-Info · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

All right, good afternoon everyone. Hope you all are having a great time  at the summit here in Lisbon. I certainly have enjoyed all of the talks  and also connecting with friends  and former colleagues after a while. So it's really good to be here. Here at Intel, we are truly embracing  a systems first approach towards building  large scale AI infrastructure. And today I'm gonna talk about that approach  and our thinking around building scalable systems  for AI at the data center scale. So besides talking about general principles  and approaches towards building these systems,  I will also dig deep a little bit specifically  into what we are doing with our recently launched  Gaudi three generation of data center accelerator products  for AI. That was announced just a couple of weeks ago. So just fresh off the press. So with that, let's jump right in.

 So why does a systems first approach really matter? Here are a few reasons. I'm sure there are plenty more,  but just as a bird's eye view,  first and foremost, the data center is fast becoming  the unit of compute when it comes to AI. The optimization point that both software  and hardware professionals are optimizing for  is no longer an individual ingredient or an accelerator,  but it's the entire data center. So the conversation is rapidly shifting away  from optimizing at the teraflops per watt level  at an accelerator to optimizing for the exaflops  or petaflops per megawatt scale at the data center. Similarly, all of the TCO optimizations  are no longer just limited to focusing  on the individual accelerator,  but are now spanning to the entire data center. Secondly, this should not be news to any one of you,  but there has been a near exponential growth  in the adoption of large scale AI infrastructure. And the reasons for that are twofold. One is obviously to support the ever growing growth in,  or ever growing size of models that are being trained  and deployed for AI. But secondly, there is also a near exponential growth  in the developer count that has been focusing  on AI development over the past five to 10 years. I mean, ever since 2012,  when AlexNet revolutionized the field of computer vision,  and then in 2017 or 18, when BERT came along  and just changed the landscape entirely on language modeling,  the developer count has actually increased  quite substantially, which is why the need  for large scale AI infrastructure. And finally, we cannot accomplish  a true systems based approach until we look  at the stack end to end. So a vertically integrated full stack AI platform  is really the only way to take on a systems based approach  to meet the needs, ever growing needs of developers. Both, when I say developers, it's both the software,  the AI machine learning experts,  and also the infrastructure software developer persona.

Which is a good segue into this slide,  which talks about what exactly does it mean  to take a systems based approach. Starting with ease of use and developer productivity,  I talked about the two developer personas. We as infrastructure providers need to ensure  that whatever is put out there in the data center  is easy to use for all kinds of developers. Second is the aspect of multi-tenancy  and all of the security and isolation guarantees  that need to come along with offering  multi-tenant solutions. This is especially true in the case  of the large public cloud providers  where they support at any given time,  hundreds of thousands of customers on just a single cluster,  not just for mainstream infrastructure,  but also for AI infrastructure going forward. A corollary of that is the notion of managing  the infrastructure efficiently  through intelligent orchestration frameworks  and also job schedulers. Today the world is all revolving around  Kubernetes based infrastructure for this,  but there may be other frameworks  and frameworks at a slightly higher level of abstraction  that may be coming along. So infrastructure providers need to be cognizant of that  and a systems level approach needs to comprehend  that aspect as well. Next is life cycle management for the end to end,  the end to end life cycle management for the cluster itself,  meaning starting with the initial deployment of the cluster  all the way to actually allocating capacity  to individual customers and then end of lifing it. All of that needs to happen in a seamless automated way,  which also needs to be comprehended from day one  when we start thinking about these designs, these clusters. And comes along with that is a control plane  because the control plane besides the data plane,  which is the compute,  is something that orchestrates all of this. So we talk less about the control plane,  but that's very important as well. Obviously workloads need to scale  because there's no point in building large scale systems  without ensuring that scalability frameworks  that allow workloads to run on that large scale  infrastructure are taken into account as well. And finally, but no less important is the hardware itself. Us, part of the OCP community,  especially should align with this notion  of a standards based hardware approach  towards system design. Because there is only certain amount of room  for diversity in the infrastructure. But if you want to make AI infrastructure truly ubiquitous,  I think the standards based approach  is the only way to go about it.

 So with that general framing of the approach,  let me jump into a few specifics  about our Gaudi 3 accelerator product  that was launched a couple of weeks ago.

The Gaudi family of products  is a truly AI optimized accelerator family  and does not suffer from the incumbencies  of traditional GPUs which need to support  certain graphics functionalities as well. So purpose built for AI with 64 tensor processor cores  and eight matrix multiply engines  which are the heart of the compute element  in the accelerator. Besides compute, memory, as you all can appreciate,  is extremely important. So two tiers of memory, HBM, large capacity,  128 gigabytes, tons of bandwidth,  3.7 terabytes per second. But equally important is On-Die SRAM  which offers a caching layer for software  for effective reuse of activations  in the AI modeling context,  which offers significantly higher bandwidth  even compared to HBM. And finally, networking which forms the third pillar  of the accelerator complex. 24 integrated NICs, each of which are capable  of serving traffic at 200 gigabit per second,  leveraging 100 gigabit SerDes. Which packs in a lot of punch  and creates a very efficient and well balanced system. And a PCIe Gen 5 for the host interface.

This allows us to take a very modular approach  towards building these clusters,  starting all the way from a single node,  which is the basic scale unit,  to something like a 64 node cluster  which becomes a scalable unit  for a data center scale design. And then expanding that to something  like a 4,000 accelerator complex  or even an 8,000 accelerator complex. I will not go into the speeds and feeds in detail. You can look at it offline. But the idea here is modularity.

So how we go about accomplishing this modularity  is to ensure that it is baked into the design  from the very fundamental or the foundational layer  of the infrastructure. Which is in this case a node  comprising of eight of these accelerators  on a OCP compliant UBB baseboard. They have a all to all mesh,  a fully connected interconnects  between all of the eight Gaudis  that offer an incredible amount of bandwidth for scale up,  given that the tensor model parallel domain  for model parallelism typically extends  to eight GPUs in today's world. And then there are three links out of those 24  that I talked about earlier that are pushed out  to OSFP connectors in the server that allow for scale out.

Which brings me to this,  which is a node level architecture showing the OSFPs,  which can then be leveraged to build out  what we are calling here a subcluster  that comprises of 128 accelerators. In this case, this becomes the scalable unit. As you can see, it is quite flexible. There is a leaf switch tier. We typically resort to, as you may know,  a full fat tree class type network,  non-blocking network for AI use cases  that has two layers of switching. So this is the leaf layer.

And then building on top of this,  we can go about building even larger clusters  with a two tier switch hierarchy. In this case, showcased as a 4,000 Gaudi three cluster,  fully non-blocking fat tree, as I said earlier,  and allows us to offer customers the ability  to train the largest of the models. We are talking about trillion parameter range here  on this kind of infrastructure.

Now, I would be remiss if I did not talk about  our sustainability efforts on Gaudi,  especially in the context of this regional summit at OCP. Besides the traditional air cooled version,  we also support two flavors of liquid cooling,  single phase liquid cooled designs  and two phase liquid cooled designs based on cold plates. The uniqueness of this design is that a common cold plate  serves both cooling solutions. And we are on our way to launch the air cooled product first  and then within quickly following that,  we'll be launching the liquid cooled products as well. So very committed to the sustainability effort here.

But speaking of sustainability,  we obviously are not just confined  to the Gaudi family of products here. At Intel, as you all know,  we have a huge CPU business as well, focusing on Xeon. So this slide tries to cover our overall  cooling infrastructure solutions for both the Gaudi family  and also for our Xeon family. And more notably, this includes both liquid cooled solutions  and also emergent cooling, which is fast emerging  as a far more efficient cooling subsystem. It'll take some time to gain traction  in terms of mainstream deployments,  but we need to pay attention to it from the get go  because the numbers are staggering. We are talking about about 30% of reduction  in power consumption just by virtue  of leveraging immersion cooling. You saw the other talks in sessions earlier today  where with immersion, people are talking about  PUEs of close to one. That probably is the only path for us to get there. And like I said, our Xeon family of products  also supports a single phase, two phase,  and actually goes on to support immersion cooling right now  with the third, fourth, and fifth generation of Xeon  and the recently announced Xeon 6 family,  which will all support immersion cooling. With that, this is my last slide. So in conclusion, I would just like to say  that we are super excited to continue our journey  towards a systems-based approach for AI infrastructure. The call to action is here for you all  to come participate in that journey with us. We are looking forward to collaborations  across the industry to create an open ecosystem  for AI systems as opposed to a closed source,  proprietary set of technologies. With that, I thank you for listening in  and I'm happy to take any questions. Thank you.