CDI-Info/297 at main · vaj/CDI-Info · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292

I am DK Panda from the Ohio State University, and my colleague, Professor Hari Subramoni, is also here. So, the two of us will be taking you through this tutorial. The title is 'High Performance and Smart Networking Technologies for HPC and AI.' The slides are available here; please take a look. You can download them and post the link on the web as well.

So, the outline will be like this: we'll have an introduction, just a very high-level introduction, followed by just asking questions—why are we talking about high-performance networking for HPC and AI? We'll go through a little bit of past history, what has happened, and why networking like InfiniBand, high-speed Ethernet, RoCE, Slingshot—all these—are coming into important focus these days. Then, we'll go into more details gradually; we'll start with the communication model and semantics of high-performance networks. After that, we'll take a look at, we'll start with InfiniBand and high-speed Ethernet, that is the kind of common networking technologies used on many systems, we'll start with those. And then, gradually, we'll compare and contrast with Omni-Path, NVLink, AMD Infinity Fabric, Amazon EFA, then Slingshot. We'll be going into a little bit about the smart NIC, which are more like the DPUs, IPUs, and what can be done there. And then we'll come back and take a look at some dedicated AI chips, like Cerebras, Habana Gaudi. And once we get exposed to all these features, we'll try to also take a look at those software stacks, like how exactly the hardware features, the networking features, are being linked to the upper-level software. And then we'll take a look at some sample case studies and performance numbers, and then we'll conclude. And please feel free to ask questions on the chat; Hari, you can monitor, and then we can answer.

So, if you take a very broad look over the last, like, 20-30 years, there has been a lot of growth in high-performance computing. And that growth is happening in terms of processor performance, especially chip densities doubling almost every 18 months; we see more and more number of cores per socket. So each socket, or each server, is also getting, these days, very heavy—like the latest AMD, if you take a look, has like 128 cores; similarly, Intel has 96 cores. So within a node, of course, there are intra-node interconnects, but here we'll be focusing more on the across-the-node, or inter-node. So when these nodes are becoming very dense, there is a need; you also need to have good inter-node connectivity so that they can get balanced out. And this is where I think the commodity clusters are becoming very important. And a lot of people have been using those commodity clusters all over the place.

So, when I say 'commodity cluster', the idea is, can you actually go and then purchase a bunch of CPUs, GPUs, interconnects, memory storage, and then have some open-source stacks or any other libraries available? And then you have the complete system. So, that's what we'll be keeping a view here. If you take a look at this top 500 list, which is like a commodity computing clusters, where they fit in the top 500, many of you must be knowing about this top 500. This actually evaluates the ranked systems—supercomputers in the world—twice a year: one in November in the US, and one in June—July, May, June, July—in Europe. So, if you take a look at the history, in the very early days, these systems used to be very proprietary systems. Okay, think of these as the connection machines, IBM traditional systems, IBM Blue Gene, and all. That means you purchase the system's entirety. So, you cannot just go and buy pieces of the hardware and then put it together. But, over the years, as the commodity networking has come, now you can see almost like a very close to 100%. Most of the systems are commodity in nature. That's what you will see.

And then, what is happening? It is not just the compute cluster—it's a focus on computation, but you will also see storage clusters. They also have a similar trend: you have a bunch of servers, metadata nodes, compute nodes, server nodes. You are connecting them with some network. You also see this is the most dominant one, like for all the e-commerce, financial transactions—like all the things of Amazon and all. So, these are the data centers where you have different tiers, tier one, tier two, tier three, kind of thing. And they are also connected with a switch. So, if you take a look at the kind of the role of networking, it is very diverse here within the system, and also across the systems, LAN and more.

And at the same time, we will also see that cloud computing is also getting a very broad view. All high-performance networking technologies are gradually moving into cloud computing. In fact, if you take a look at Azure, their high-end cluster instances use InfiniBand. Similarly, Oracle uses RoCE; AWS, which we'll talk a little bit about later, actually uses their own proprietary adapter called EFA, but that also gets interconnected with OFI. So, the networking technologies are gaining importance there.

And of course, these days we are in the field of AI. So, all the deep learning, machine learning—as you would have been hearing—all these large-scale trainings require thousands of GPUs, or several months, to get the training done.

And especially the large language models, as you can see here, are becoming larger and larger, and continuously, people are trying to make them larger. And that means the training requires very large systems, and when you have large systems, the role of networking is also becoming very important.

And this also gives a very quick overview of how we have moved in the field of AI, starting from 94 million to here, like 540 billion parameters. People are talking, and already there is a trillion-parameter consortium. There is an organization looking to train these trillion-parameter models.

So, if you take a look at this very broad perspective, you might ask, 'What are the networking and I/O requirements?' Okay. So, if you're an architect, you're trying to ask the question. And here, what you will see is that there are multiple parameters, of course. In the traditional way, people always want the latency and bandwidth. Latency is like a round trip or a one-way latency. You can think of server-to-server, and we also want to minimize this, and the bandwidth is like a from node to node at the rate at which we can ask for now. In addition to these two parameters, another parameter has become very important over the years. It's called CPU utilization. And this CPU utilization is not for application, but for protocol processing. Okay. So if you have a network, we have some protocols. So the question is, how much time does it take for the protocol processing? So, if I have two networking technologies that deliver similar latency and bandwidth, but one has lower CPU utilization, that means that one is better. Okay. Um, if, instead of like spending 30% of the cycles for network protocol processing, I can spend 10%, then my... Nice. Okay. 90% of the cycles are left for application. Okay. And that, we will see gradually as we go along, will play a very important role in all these networks. So, that is kind of the basic performance. Then, of course, we need good storage area network, good wide area network connectivity, quality of service, reliability, availability, serviceability. So, all these are important for production way systems. But then the question is, like, everybody wants it with low cost. So, this is where a lot of competition happens across different vendors to reduce the cost of the system.

So, let's take a look at this again from a very high-level perspective. Let's take a look at some basic computer architecture and see what has happened over the years. Okay. So, if you take a look at—I'm sure in any of your digital logic classes—you would have seen a setup where we have processors with some cores, memories, then there is an I/O bus, then there is a network adapter. Then it gets connected to the network switch. So, this is a typical server architecture.

Now, what happens over the years when you are putting it together—a system with some kind of processor, some kind of memory, some kind of a network—is that you will see there could be different kinds of bottlenecks. One could be processing bottlenecks. These processing bottlenecks, what we are talking about here, are protocol processing bottlenecks. Okay, not the actual application. There could be I/O interface bottlenecks, but the rate at which data goes in and out can vary. And the third one is network bottlenecks. So, even though you have very good systems, there might be an issue with the overall network.

So, let's take a look at these three bottlenecks in detail. So, protocol processing bottlenecks. If you start like a— I'm sure all of you are familiar with TCP/IP, UDP/IP. These have been there since the evolution of computer networking, and the standard TCP/IP stack actually goes through a lot of these multiple layers. And here, you will see like a post-processor handles all aspects of communication, like data buffering, integrity, routing aspects, signal. So, once you go through all those steps, it takes a lot of time. And that's why if you do some TCP/IP level communication, it will be a little bit— very little bit— time-consuming. Like in the best systems these days, you may see like a three-microsecond, five-microsecond delay, depending on the system and the processor speed. Now, the question is, can we really reduce that? And this is where we'll try to introduce some kind of a user-level protocol in the next section, and that will try to reduce the latency in a significant manner.

So that is the protocol processing bottleneck. Now let's take a look at the I/O interface bottleneck. So, if you take a look at the server, this is the I/O interface, which actually determines the rate at which data can come inside the server and outside. Sometimes it's also known as the last mile bottleneck. So, if you take a look at the history, like PCI started in 1990, and that used to be the I/O interface. It used to be like a bus, and it had 33 megahertz, 32 bits. It is 32 bits, and each bit was like 33 megahertz. So, if you do the math, it was like trying to deliver one gigabit per second, bidirectional. But then, of course, as the technology improved, that was not sufficient. So, PCI-X was introduced. They expanded like 32 bits to 64 bits, and also 33 megahertz. They went to 133, and sometimes some versions also had 266, 533. So, the broad idea was that we can increase the clock speed and increase the bus width, and then performance will keep on increasing. But if you take a look, if you remember, like your basic digital logic class or any analog logic design class, when you have a lot of bus, like a bus, and then each bit is operating at a very high rate, you will see a lot of cross-talks. And then with those cross-talks, your error keeps on increasing. And that's what people found out when, in the late 1990s, people thought that this will not be the technology to move forward, and we'll see a solution a little bit later to address that.

And the final bottleneck is the actual network bottleneck. So that means we have designed the servers very well, even with the network adapter. But once it goes to the network, whatever happens with the network— the congestion, the bandwidth—those things impact the performance. So, this is where Ethernet, if you remember, started in 1979. It was good enough for almost 14 years. But then, as computers became faster and faster, people introduced Fast Ethernet, Gigabit Ethernet, ATM. Myrinet was the first technology, which was actually cluster technology. It was a wormhole-routed system, which came around one gigabit per second, and then the Fiber Channel came.

So, all these were like, as you can see here from this list, like we're in the mid-90s. And then people saw that we need a better solution. And this is where the InfiniBand and all high-performance networking came into the picture. And InfiniBand and high-speed Ethernets were introduced in the market around 2000. Since InfiniBand was new, it actually aimed to address all the bottlenecks, like protocol processing, IO bus, network speed. And then Ethernet was already there. And it aimed to directly handle the network speed bottleneck. And relied on compliance. And implemented new technologies to alleviate protocol processing and IO bottlenecks. So, we'll take a look at these in a little bit more detail.

So, let's continue. So then, what we'll try to take a look at is a high-level introduction of InfiniBand. It started, actually, as a consortium. Some of these companies don't exist anymore, like Compaq, Dell, HP, IBM, Intel, Microsoft, and so on. But if you take a look at their goal, it was to design a scalable and high-performance communication and I/O architecture by taking an integrated view of computing, networking, and storage technology. So, it was not just networking. It was a networking component and the architecture, but they were trying to take a global look at how the systems need to be designed. Then, the volume one version was... The volume one actually introduced almost like 24 years back, October 2000. And then, as the technology got adopted by many organizations, system sizes kept on growing, and people discovered limitations in the InfiniBand architecture. People found some limitations. And this is where several annexes were released after that, like RDMA-CM, iSER, XRC, RoCE, virtualization. So, like this, the specification also has kept on getting updated. So the very latest one, you will see, like a 1.7, was released last July. So, you can actually go to the InfiniBand Trade Association website and actually download it. But if you download the document, you will see that it's not a simple flyer; it is a 1,700-page document. Okay, because it is a very open specification. Anybody can design these kinds of systems. Of course, I mean, we don't have to understand all the things unless you are trying to actually design components like the adapter, switches, cables. All those details are there. Here, we'll try to take a look a little bit at a high level, more from a user perspective or an architect perspective. And depending on the need, you can always jump in and then go into more details of the architecture.

So, similarly, just like InfiniBand started in the 2000s, prior to this, Ethernet was there. There was like a one-gig alliance, so then the 10-gig alliance was formed. And here, the idea was very similar: to achieve a scalable and high-performance communication architecture while maintaining backward compatibility with Ethernet. But then, as Ethernet evolved, multiple issues were discovered. Okay, so the first thing was like, if you remember, Ethernet started from like a 10 megabits, 100 megabits, 1 gigabit, 10 gigabits—everything was going in as a factor of 10. So after 10 gigabits, the obvious choice is to go for 100 gigabits, but then the 100 gigabits, if you try to put it in a system—you remember we discussed these three bottlenecks—the I/O interface bottleneck became very apparent, in the sense that, unless technology is there to handle that I/O interface, you cannot actually have even though you have the network as 100 gigabits per second, you cannot actually communicate with the server. So, this is what happened over the years as InfiniBand technology was evolving. It actually came to like a QDR technology will introduce very soon, what data rate, and by that time, actually, the interface technology, I/O interface technology, has evolved so that at least some, you know, intermediate like a 40 gigabit Ethernet can be introduced. Okay, so that's what Ethernet Consortium, uh, Ethernet Alliance had a lot of discussion, so instead of jumping from like a 10 gigi to 100 gigi, there was an intermediate technology introduced, 40 gigi, we'll see. And that was very similar to InfiniBand QDR. So then, as the technology again evolved, we'll see that the Ethernet and then we'll introduce very soon PCI Express, uh, so that PCI, what we introduced earlier, got moved to PCI Express, and that PCI Express also will see will have some concepts of lanes. So people wanted to reuse technologies from InfiniBand to even to Ethernet. Okay, at the hardware layer, so this is where over the years you will see now we have very different kind of a range of Ethernet products with different kinds of cost performance, 25 gigi, 50 gigi, 100—of course, it is going to 200, 400 kind of things. And, and this is what I introduced like the 40 gigabit sub was introduced with an IEEE standard, uh, then 25 gigi was introduced, and of course, new concepts like energy efficiency, power consciousness, they have also been introduced in the Ethernet.

So now, let's take a look at how, at least, these two technologies—InfiniBand and high-speed Ethernet—are trying to handle those three bottlenecks. Okay, so once we know that, then later on, we'll see other technologies like Omni-Path, Omni-Path Express, and Slingshot. Everybody is exactly handling the same thing, so we don't have to repeat all those things. We'll just take a look at the features.

So, the basic network of speed bottleneck—um, if you don't know, like, this InfiniBand—the word actually stands for infinite bandwidth. Okay, I have a—okay, I have a—you can ask a question. I mean, what does it mean by "infinite bandwidth"? There is nothing like that, but that's how it was coined, in the sense that as the technology will, as the computing technology, storage technologies will keep on improving, InfiniBand will continue to grow. That was the idea of that infinite bandwidth. Okay, so what the basic idea was that, if you remember, like, we had the PCI—uh, what is a bus-based systems—and it has two degrees of freedom: like, the bus speed and then the number of bits. They changed that to similar kind of, like, two degrees of freedom, but in a little bit different manner. Instead of a bus, it was changed to a bit serial differential signaling. Okay, so it was an independent pair of wires to transmit independent data, and in the InfiniBand literature, it's called a lane. So you can think of like, an infant pair; they are actually twisted and covered, and then you can have any number of lanes to put a bigger cable. Okay, so it still has two degrees of freedom. Okay, so you can change the clock speed of the lane, and you can have more number of lanes, and if you play properly, we'll see in the next slide, the bandwidth will keep on improving, and that's what exactly has happened over the years. So theoretically, there is no perceived limit on the bandwidth here.

So, let's see how it has actually played out over the years. Here, you will see like these are the technology we talked. So, InfiniBand started here; I'm saying 2001. It was introduced in 2000, but the actual technology adapters and all, commercially, it got available in 2001. So, what they started in the beginning, it's called single lane 1x, and the rate, the lane, was characterized as SDR, called Single Data Rate. So, that means if you take a look at the clock, we're able to send one bit in that clock. So, that's what is known as one SDR.

Also, in the networking technology, there is a some of you might be familiar, there is a payload rate and actually the physical rate. Okay, and, most of the time, at the network layer, we do some kind of error correction, encoding, decoding, so there is some overhead. What InfiniBand introduced, so what we are trying to show here is actually payload rate. Okay, the actual SDR, if you look at the literature, it is 2.5 gigabits per second, but it was using 8 to 10 encoding, um, to take care of all those error corrections and all, so the payload rate is 2 gigabits per second.

And, similarly, Ethernet was also introduced at the same time. The payload was 10 gigabits per second, but it was using 8 to 10, so actually, the physical layer was 12.5. Okay, so then what happened in two years, the InfiniBand consortium wanted to move into higher bandwidth, so they kept the same SDR, but they changed the lanes; you can see from 1x to 4x. So, that means instead of one lane, it is a four lane, so their four lanes are put together, you, you, you get it in a single cable. So, then you can see the bandwidth improved from two to eight.

So, then, in another two years, it actually moved to DDR, called Double Data Rate. So, that means in the same clock cycle, we are sending one bit, now instead of that, we can send actually one on the rising edge, one on the falling edge. So, it will be the double data rate, um, and then the number of lanes, even the same 4x, so it became 16 gigabits per second. And, at the same time, actually, you see that these are kind of a time frame, 2005, IBM actually was a part of the InfiniBand consortium, they actually went ahead, and they had a HDR technologies, but they had 12 such lanes. So, if you put it together, 24 gigabits per second, this was quite high bandwidth in the 2005 timeframe.

Okay, so there were some systems were deployed using that technologies, but what happened, since there were like a 12 such lanes, the cables were very heavy. And that actually led to not very stability in the integration. If you think of like servers, you plug a network card like this or network card like this, these cables were heavy, so that they were bending the network. So, the system was not being reliable. So, over the years, actually, that technology didn't succeed. But then what it moved ahead was that the same four lanes in 2007, this was QDR. That means Quad Data Rate. I referred to this earlier, like four bits, you can actually send in a clock. So, that's how it became like 32 gigabits per second. And since that IO interface was already providing that 32 gigabits per second, then the Ethernet Consortium, I was mentioning earlier, instead of from 10, going directly to 100, they had the intermediate one, which is the 40 gigabits per second.

So then in another four years, InfiniBand moved to, this is a little bit of interesting dynamics. What happened, typically, you know, in the digital world, we go in a multiple of like 2, 2, 4, 8, 16, kind of things. So here, what happened, instead of QDR, like going to 8 bits or 16 bits, they had some intermediate one called 40 Data Rate. So, that's how the FDR is introduced. And interestingly, what happened, by that time also, the encoding technology also improved, so here they were using 8 to 10 encoding, but then the encoding community came up with better solutions, which can do 64 to 66 bits, okay kind of an encoding, they came so it's only with a very little overhead, you should be able to do this encoding. So, if you do a little bit of math, this is where, in 2011, with 14 data rate, it was able to InfiniBand was able to deliver like a 54.6, and then if you have two such lanes, many of these network adapters have like a two ports out, if you do some channel bonding here or later on, we'll introduce a terminology called multi-port, if you do that, then in 2012, InfiniBand actually introduced, like we can actually deliver 100 gigabits per second. Okay, that was the first time, 2012, it went to a little bit ahead of 100 gigabits per second, but then what happened, since this, this components people wanted to reuse, okay in the cables, in terms of adapter, the interfaces, uh, the even the like encoding chips and uh different chips, so they wanted to leverage, like the single lane concept and that went back to either you use one lane or two lane, this is where 25 or 50 gigabit was introduced. Okay, once again, 10 gigabit was there, Ethernet 100, then 40 was introduced, then if you use like a one such lanes or two such lanes, you can have 25 or 50. These were the intermediate technologies were introduced, and then in of course 100 gigabits was also available, then Omni-Path we'll talk about that a little bit later, that was also like an InfiniBand um technology uh but instead of an offloading, it tried to use on-loading, we'll come to that concept a little bit later and that also got introduced as 100 years per second, then we had InfiniBand going into what they call it like an EDR, enhanced data rate, still using four lanes, they came to 100 degrees per second, 2017 was HDR, high data rate, to 200 and then the Slingshot was introduced from Cray, that is a 200 gigabits per second, Omni-Path Express was introduced, it is basically built on Omni-Path, but it was a little bit of variation, we'll talk about that, that was 100 gig, Google also came up with its own network um but it has been, I don't think it has been deployed anymore uh that was like 100 gigabytes and then InfiniBand 2022, it was like the next data rate, they introduced the same four lanes but with 400 gigabits per second.

So, so this shows actually playing with those two dimensions, like the number of planes. You can actually keep on moving, and then, just if you take a very quick look, starting from 2001 to 2024, the network speed has doubled 200 times. Okay, so that's what is actually the biggest improvement that has happened in the field, and that leads to a lot of excitement. A lot of the systems are being developed, and later on, we'll see, like the GPUs and all these are all very network hungry. Now, we are trying to design these kinds of systems; still, they're not very balanced. People still need more and more networking demand, and we'll see, this is where the InfiniBand and high-speed Ethernet, in fact, have started already talking about 800 gigabits per second, 1600 gigabits per second, and etc.

So then, let's take a look at the protocol processing bottleneck. So, you remember I introduced TCP, and TCP said, if it goes through the standard layers, it has a lot of overhead. Uh, so this concept started—the research actually happened in the mid '90s. So, the idea was: how can we reduce that latency? In the concepts of intelligent network interface cards, okay, we started with this adjective "intelligent." In fact, in modern days, we'll see things are becoming much more intelligent, and this is where the smart NICs are coming into the picture. We have a section we'll go over that. So, the idea is, can we offload protocol processing to the adapter instead of the adapter just being dumb and moving packets? Can it actually do some protocol processing? And the idea is very similar to what you can see what we do in the real world. Uh, if some of you are executives and managers, I'm sure you have assistants and secretaries to help you. So, if you take a look at that executive-assistant model, bring it to the real systems here, we can think of our executives could be like CPUs or GPUs, and the network card can be your assistant or secretary. Okay, so then the question is: okay, how do I move, just like in the real world, you know? You tell your assistant, "Set up this call," or, "May send this email to 100 people," or "Handle this event." All those kinds of things you can offload so that you are free as an executive to actually take care of high-level decisions. Similar designs are coming within the software stack. We'll introduce MPI, and all, you can actually have this kind of newer paradigm so that the processor actually doesn't spend a lot of time on protocol processing. Okay, it actually offloads to the network adapter, and the network adapter can actually take care of this. But then, to achieve this, there has to be newer kinds of APIs or newer kinds of interfaces, and that's what, in the literature, in the networking literature, if you go and search, it stands for user-level communication capability. Okay, we'll go into a little bit more detail, saying how that API looks, how actually it gets offloaded into the network. But the idea is, if you do that, then you don't have to go through this software signaling between communication layers as TCP/IP had introduced earlier. These layers can be implemented by dedicated hardware units or some shared host CPU. There are multiple architectures that can come.

So then, the final thing is the I/O interface bottleneck. So, let's take a look at the interface with the I/O technology. So, if you remember, the InfiniBand started in the beginning. So when you say data movement, typically it is from—you can say—from the server I showed in the picture, but typically it comes from the memory of the sender, goes through the network, over the network, the other adapter, and the memory of the receiver. So, in the original, you and InfiniBand introduced this, this, this lane technologies, and all they wanted to actually have their adapters directly talk to the memory, okay? And this was like early 2000. Now, if you see, Intel at that time, or mostly the Intel, were developing all the server platforms. So, even though Intel was a part of the InfiniBand consortium, they had their own, like the front side bus, all those internal architectures, and they didn't want to make it open, okay? So that led actually to some controversy in the community. If you go back and look at some of the websites, you will see that, in fact, in around 2005, 2006, there was some time when Intel declared, actually, they are withdrawing from the InfiniBand consortium. Also, Microsoft also announced that they are withdrawing. But then that didn't, but that was not good for the end users, okay? And at the same time, another thing was happening, the evolution of the GPU technology. It was not only networking. The GPU technology was also evolving. If you look at the history of NVIDIA and all, you will see it started as a graphics engine, just for our, like, rendering, visualization. But over a period of time, they came up with the CUDA. So that was introduced as a computing engine. But then, if you take a look at the GPUs that are also attached to the server through an I/O interface, PCI or in the modern days, PCI Express, so unless that I/O interface is much more matured, you cannot actually move data between host memory and GPU memory. And that is also crucial. The GPU, even though it has a lot of computing power, unless you load it with the data, you cannot operate. So then, what is happening is, even though Intel was not very happy to allow InfiniBand to enter into the I/O interface area, it was also getting a lot of pressure from the GPU vendors, at least NVIDIA at that time, saying there is a bottleneck. So then, what happened, you know, and this is like a typical silicon area, a silicon barrier concept of companies, you compete with each other; if you cannot compete, you just join. Okay, so that's what exactly happened. So that PCI technology got moved to PCI Express. Okay, I'll show that in the next slide.

And then, once that PCI express came in—fact, here, if you see—they exactly adopted the same concept of the lanes. Whatever InfiniBand was proposing from adapter to adapter, now they extended that to the interface side, both on the sender and the receiver. So you can actually see again, whether it is a one lane, two lanes, four lanes, eight lanes, or 16 lanes, then that will lead to, it's like a giga transfer per second. So, this is where the PCI express got born. And as of now, if you go... Most of the systems, and we'll see some changes happening again, both InfiniBand and high-speed Ethernet today come as networked adapters that plug into the existing IO technology, so that IO interface bottleneck issue is a little bit resolved up to a certain extent. But then what happened, even though PCI express became PCI express, there is a consortium, then, of course, the InfiniBand consortium is there, and these people are deploying the systems. So sometimes the PCI Express Consortium was not moving very fast, okay? And that's what happened. If you remember a few years back, the number one system at Oak Ridge National Lab was the Summit system. And so, there they wanted to use the GPUs, multiple GPUs per node, and with a very good high interface, like a high-speed network, but also having a good interface with the memory system. And PCI Express was not able to deliver that.

And this is where, actually, the whole idea of this—we'll see later on—NVLink, or NVIDIA Link, was introduced. Even though NVIDIA was a GPU company, they saw that there was a limit. So, they introduced their own networking technology called NVLink. And the Summit system was actually a collaboration between NVIDIA, Mellanox, and IBM. IBM also allowed the interface to be updated to a higher speed so that the system could be deployed, okay? And based on that, similar computing technology was also introduced. Like, there was a cache-coherent interconnect for accelerators—CCIX, CAPI, OpenCAPI—then HP introduced Gen-Z. So, a lot of these I/O interface architectures again got involved.

They compete with each other, but then over the years, what has happened? Again, all of them have not survived, but then they have again got rearranged into this new thing called Compute Express Link, or CXL. I think some of you might be familiar with this. So, this is what is getting momentum now. Open industry standard. It provides a cache-coherent interconnect between CPUs, accelerators, IO devices, and various flavors of memories. It also allows, like currently, if you see, most of the servers typically have 96 or 128 cores. And within that, you can do shared memory, but outside of that, you cannot. You have to go through MPI or some PGAS. And this is where all the InfiniBand, Slingshot, high-speed Ethernet comes in. So, CXL is trying to expand that up to a certain extent, okay? Like, you cannot connect CXL with, oh, sorry, you cannot connect, let's say, 10,000 nodes with 10,000 servers with CXL, but you can go up to, let's say, eight or 16, depending on the technologies and the switches. So, within that, like a super node or a super server, you can actually have shared memory, okay? And that improves significantly performance, programmability, et cetera.

And this is where I think the CXL technology has been evolving. If you take a look at the CXL 1.0, 1.1, up to 3.1, that is the very latest status. That's the standard. And many, many companies in the Bay Area are actually working on some of these products. And there are also a lot of simulators available. So, hopefully, we'll see a lot of traction on this technology in the coming years.

Hope that gives a very high-level perspective of all these bottlenecks and all. So, this is at a, let's say, lower level. So now, let's see, how do we build the system, okay? So when you build a system, when you are talking, really like you're running an application, let's say, think of like a training or inferencing or even a scientific workload, you are running on a thousand nodes with, you know, a hundred GPUs and all. There has to be some kind of a communication stack. So, through which you activate all these underlying technologies. So, this section, what we'll try to do is we'll take a look at that communication model, okay? And the semantics, which actually bridges the features of high-performance network with that of the upper layers.

So, what we'll introduce over the years, we'll give a summary here. Broadly, there are two kinds of communication models. One is called two-sided, and the other one is called one-sided. We'll take a look. So, let's start with the two-sided communication model. The two-sided communication model, basically, you can think of like a, you know, our emails, okay? So, let's say I send you an email. I am sending you an email, but the total control of the receive lies with you, okay? Like from my mailbox, I can just send a send, but unless you log on to your system or your mailbox, you don't receive my email, okay? And after you log on and after you detect my email, the control also lies with you, what you do with that email, whether you store, forward, or delete, entire control lies with you. That's what is called two-sided. So, this is also what we'll see here. Let's say, in the MPI, those of you are familiar, like MPI send, MPI receive, there are different kinds of primitives, but you will see that this is the basic two-sided operations. And, to, if you remember, like these days, of course, the technology has changed. If some of you are familiar, like 10 years back or 15 years back, I cannot even, mailboxes were so small in size, you cannot send a big message, okay? It actually says that, yeah, the receiver mailbox is full, I cannot receive. So, most of the time, to get performance, of course, you need a bigger mailbox. The same idea also gets used in the high-performance network. So, that means, if I want performance, let's say these are the three nodes, like a P1, P2, P3, they have their HCA's. We'll call these are like a network adapter. We are using InfiniBand terminology here, which is called host channel adapter. So, most of the time, if I want to get very good performance and somebody is designing the high-level, like a network and protocols, you typically do something called a post-receive buffer, okay? So, that means the receiver is ready. So then what happens, most of the time in an HPC community, the communication happens with polling, okay? So, there is a difference between polling and interrupt. Some of you must be knowing. Polling means I'm continuously polling. In this case, P2 is waiting for a message to arrive. It'll keep on polling. Let's say it expects messages from P1 and P3. It posts to receive buffers and keeps on polling, okay? So, in an MPI world, if you're familiar, you can say MPI receive, receive, receive, or an I receive, I receive, actually, in this case. And then what happens? The sender side, when it computes, and then when it comes to ready to send the message, it sends a post-send, okay? So then, you will see, look at the animation here. The message will actually go over the network, go to the adapter. We'll discuss later on the topology and all those stacks. But as soon as the message reaches there, since P2 is polling, that adapter is polling, it will get the message, and then we'll give it to the P2. See what happens? Similar things will happen if P1 sends a message, you will get it, P2 will get. So, this is what is called two-sided. So, that means from the sender side, I'm just sending the message with a destination address, destination server address, or a process address, or a core address, we'll see.

But there is another improved model, which is called the one-sided communication model, okay? So, one-sided is more like a trusted communication model. So, what does that mean? That means, if you remember, like, a real one, let's say you have a very trusted colleague, so you can always say to your colleague, "Look, when my door is open, you can always come, take a book from my office or open the file cabinet, take things or put back things, okay?" And this is what is one-sided, so that is much better because if my friend is coming to take things from my office, he or she doesn't have to interrupt me, okay? And that is the whole idea of this one-sided communication model. So, typically, in many of the programs you do, there's kind of an init, and during that stage, you have a global region creation, buffer infos are exchanged, and then what happens, you will see now, let's say I am sending and P2 wants to send a message to P3, you look at that animation. So, as soon as it actually gets written, the data goes over the network, and this adapter knows where to place the data, okay? We'll see that a little bit later, within better animation. So, in this case, actually, the sender has to have the actual destination memory address in the packet, okay? So that it goes and gets written, and then, similar kinds of things, here you can see P1 also can do the return.

So, the question is this: at a very large scale, for a strong scaling case of either HPC or AI training, do you see the actual need for high bandwidth per port? I believe that as you scale your application, your bandwidth requirements per port would go down. This would be counterintuitive with respect to small-scale single box, where the scale of communication is substantially higher and proportional to HBM bandwidth. So, the basic question is this: do you need high bandwidth as you scale out to many number of nodes?

Yes, yes, the answer is actually yes, you need it. So, what is happening, if you see the modern way people are deploying, the single node doesn't have only one GPU. These are like you get currently, like you can go up to four GPUs, eight GPUs, kind of things. So, in that context, what is happening is, in fact, sometimes the network is not balanced actually. So, with eight GPUs working together, when you are trying to do the training, you are actually sending very large messages from each GPU. You want to do the STD, the reduction, or it's an all-to-all or all-gather. There's a big demand for the bandwidth. And unless you have that kind of high bandwidth, you will be effectively actually taking more time on that training. And then, in that context, your strong scaling applications will get affected, okay? So, in fact, there is another direction we will not talk about here today, exactly for that reason, in fact, not only in our group, other groups are working on some on-the-fly compression technologies. We can discuss it offline. So, the idea is that since there is a big demand and these big messages are going, and if the network bandwidth is not enough, it is like you can think of multiple islands being connected together. So, you can use actually compression technologies to reduce that message so that you get a good balance of your computation and networking.

So let's then move very quickly. We'll try to finish this and move to the next thing. So, how does this exactly happen then? I mean, all these one-sided communication, two-sided communication, posting buffers. So, in all these high-speed networks, there's something called memory registration. That means before we do any communication, all memory must be, for communication, must be registered, okay? This typically a program is written with a virtual address. But you know, like the actual physical, at a DMA and all happens to the physical address. So, these are the kind of steps. So, in the beginning, when you send a virtual address saying, "Okay, I want to send this message, okay?" You are not actually sending the message at that time, but you have the intent. So, you send your virtual address and length. The kernel actually does the virtual to physical mapping. Okay, this is very important because we want security and all those kinds of features. And once it does that virtual to physical address translation, it actually, the adapters have intelligence, even though basic adapters, this HCA adapter caches the virtual to physical mapping. Okay, so it has actually just like a processor cache, all these adapters have cache. So, it puts that virtual to physical address mapping there and gives two keys. One is called a local key. That is for your own local one. And the R key, if it wants somebody else to write to you.

And once you have this process done, and most of the time this is done in the initialization so that you are not affecting the actual communication, but sometimes you can also do it on demand. The performance will get affected a little bit, but that also gives flexibility so that you can register more and more memory. So once you do that, when you actually send your message, in addition to the actual data, you send something—this key called an L key or the local key. And this is what you will see. Interestingly, so this time when I'm trying to do the communication, it doesn't touch the kernel at all. And that is the very big difference between these all high-speed networks and let's say traditional TCP/IP stack. Because traditional TCP/IP, with the multiple layers, you have to invoke the kernels multiple times. So here, since this is what is known as kernel bypass protocols, or OS bypass protocols, or user-level protocols, there are a lot of terminologies here. And then the adapter actually validates that L key, saying it has come from the right process so that it provides some kind of security here. And then if somebody else is trying to do a remote write to me or remote read from me, I have to give that remote key to somebody else. Just like think of your office, if you want a trusted person to open your door, let's say you have some—like instead of an actual key, if you have some... you know, the coded one. So then you say, "Oh, my code is three, four, five. OK, the person comes, then puts that in and then can enter your door." Same thing happens like you send that key to remote. And then when the remote person is trying to do this RDMA operation, in addition to the data, it has to provide that R key. And in the beginning, in the basic InfiniBand standard, actually, this R key is not encrypted. OK. So the assumption is that, OK, if the system is within a data center or within a building, those are enough protected so that people cannot come and then touch it. But then InfiniBand has also grown into like a wide area network technologies across continents and all. So in those cases, you need to encrypt it. And then there are routers. Those are very little expensive, but those routers actually do on the fly, like real-time encoding and decoding or encryption. They do it.

So, now let's try to put all these ideas together. These are like a nice set of three animation slides. It will actually help you to understand the protocol. So, let's start with the two-sided model. So, what we are showing you, this is like a sender. This is the receiver. These are the processors. This is your InfiniBand device. This is the memory. So internally, just like you think of that executive assistant or assistant and secretary, most of the time, you know, there is something called a like an outgoing box. Or incoming box. So here, we'll have a send QP or a receive QP, and then there will be another box called a completion key, like an event which gets completed. So, here we'll have both the things on sender side, receiver side.

So, in the beginning, as you said, there will be a receive will get posted. So, it's called a receive WQE in the InfiniBand terminologies. Different networks have a little bit of variation of the names. So, write queue element, or called as WQE. And that has information saying, "OK, if a message comes from this person, where to put it?" OK, you know, typically an executive might be telling the secretary, saying, "Yeah, yeah, I'm waiting for this call, or for this email. "If it comes, forward it to somebody or give it to me, interrupt me." So, that's what we are trying to do here. So, the receiver is getting ready.

Then, the sender does similar things when it is ready to send the data. It also sends a WQE. And in this case, it is two-sided. It has information—it has information, excuse me—on where to take the data. And once these things are posted, you will see the processors, or the executives, don't have to be involved in the actual communication.

So that's what we'll see. So this WQE gets picked up by the adapter and then it actually goes and grabs the data from the memory, puts them into appropriate packets and all those things, sends them, and it goes over the network and it reaches the adapter of the receiver. But interestingly, since it is two-sided, something has to be done here because this message cannot move from the adapter to the memory. So this is where what is going to happen. There is a matching that will happen because that receive buffer might be; this adapter is getting messages from processor one, two, three. It needs to know where to put them. So there will be a matching that will happen here. And once it knows where it needs to be placed, it goes and places the things there. And then what happens? Like a... Some kind of a completion entry is added into the completion queue. Then it sends a hardware ack. The sender also puts a completion queue on the sender. So the whole operation is over. So... The both executives can actually; there are again mechanisms you can have; either the executive can come into a polling or it can have interrupt. So now if you see this protocol, you'll see that the processors are involved only on either side, like either a posting... Receiving key or post sending. OK. And pulling out completion. OK. There is no involvement; the processors are actually not sending the... Data. And this is where if you remember, I introduced CPU utilization for protocol processing; that is where this concept comes in. So if you can upload this to the adapters, then your CPU usage goes down. So in the modern and figure adapter, it will be seeing very low, like you in a five percent, three percent, depending on your message size. That's enough. Okay, so that means most of your cycles are left for your computation.So, that is the two-sided.

Now, let's see how the one-sided will work. As you said and as we discussed, one-sided is a little bit more advanced. Here’s what happens: the receiver doesn't have to post anything. The sender posts the wiki, but this wiki contains not only information about the data location in the sender but also the receiver buffer, where exactly it should face. This is like a trusted friend returning your book, opening the door, and putting the book on your bookcase.

That's what we'll see here in the animation: it gets the data and directly goes to place the data in the memory. If a hardware ACK is sent, there’s a completion entry. So, then, of course, the question is: how does the processor know the message has arrived? There are multiple techniques. Some adapters provide strict ordering; InfiniBand, for example, does that. If you send a message and the receiver waits for the very last byte, then it knows the message has arrived. Or, after sending the RDMA message (like in this case a push/put), you send a follow-up message confirming completion. That can be two-sided. Sometimes, you can encode bits in the message itself to indicate delivery.

So, there are various trade-offs. Here, the initiator processor is involved only in posting the sender key and pulling out the completion cookies. There is no involvement from the target processor. If your upper-level software and middleware are designed well, the receiver can achieve a good overlap between computation and communication. This is what happens in modern MPI libraries and machine learning libraries. Middleware developers continually enhance these systems to reduce network overhead and maximize overlap with computation.

The same thing also happened with InfiniBand; it introduced something called atomics. Um, those of you familiar with atomic machines are like, you know, like a fetch and add, compare and swap. These are very much needed for distributed systems, and here you will see—uh, let's take a look at the fetch and add—so that means I have some data, and this receiver has some data. If I have five and the receiver has ten, this is to be added, and then 15 goes back to both of them. Okay, both memory, but we need to do it in an atomic manner, and that's what happens. The InfiniBand provides this, so here you will see, like a 64-bit segment, it goes there, and then it actually fetches from the memory; it performs the operation. So this was the very early days when InfiniBand actually introduced computation to the adapter, not just communication, but computation. Later on, we'll see, over the years, in-network computing has come; the adapters have become more smart, like the DPUs, IPUs. So these days, actually, you can even do computation in the adapter, computation in the switches. And based on the system, based on your target application, you can actually try to balance this. When I say computation, these are computations related to communication, most of the time. Not the user-level computation, people have not uploaded. We have done a little bit of research on that to put it into the DPUs. You will see some examples later on. But this is where the field is moving. So, once again, this is where the field is moving. So, once this operation happens, here and here, you finish. So, here again, you will see that this kind of operations, initiative processor is only involved in the push, send WQE, pulling out the completion cookie; there is no involvement from the target processor, so that we can also have a good overlap for the atomic operations with the computation.

So, those are the point-to-point; but there is also a very big need for collectives. And this is needed in the HPC. As well as in the AI, and especially AI, all this training, a lot of collective operations get done, like a broadcast, scatter, gather, reduction. We will see some examples later on. But think of like, these collective communications are designed through some algorithms, and the algorithms are a collection of point-to-point steps. So, we saw some examples of the point-to-point earlier, two-sided, one-sided, so those can be actually used here to even design better and better collectives.

So, now we are going down a little bit and looking into the architectural overview of various high-performance networks. So, again, each one of these networks has standards and specifications which are over a thousand pages long. We are trying to compress that into maybe four or five slides. So, that'll be significant, let's say, a filtration and drop of information that is happening. So, I think that's the nature of the tutorial. We'll start with InfiniBand, high-speed Ethernet, and see how they are converging to start with.

So, let's try to compare your traditional ISO OSI software stack, which has an application layer, transport layer, network layer, so on and so forth, with a different IP or InfiniBand stack. So, if you look at the physical layer, one of the major differences is that InfiniBand does not have wireless. So, it does have copper or optical similar to traditional Ethernet. Between the physical and late layers, you have the network management tool called OpenSM. So, OpenSM, or Open Subnet Manager, is responsible for bringing together the entire network, making sure that there is no deadlock, the routing tables and the switching tables are configured correctly. It also helps with configuring InfiniBand features like multicast and flow control, and error detection also happens at this level. Now, if you move up one step, you have routing. Although, typically, most IB networks are not routed, but products for routing do exist in the community, and there are various vendors who are offering such products. The major differences between traditional Ethernet and IB happen at the transport layer. With Ethernet, the interface has always, historically, been the sockets interface. So, you have open, read, write, close, so on and so forth. And one of the major differences here is that, with the sockets interface, you have to go from the user space to the kernel space, and there was a cost associated with that, which, you know, was not suitable for high-performance networks. That is what the OpenFabrics Verbs interface tried to avoid. So, you don't have the context switch from the user space to the kernel space in the critical path. You are able to do that once at init and then avoid going through the kernel for most of the communications. Just like Ethernet has TCP and UDP, reliable and unreliable, the OpenFabrics Verbs, or IB site, also has reliability, and UDP is unreliable. And on top of this, you will have applications written using the OpenFabrics verbs interface going over the IB stack.

So, having said this, let us see how your transport's stack looks like. So, you have applications at middleware written for sockets or verbs. And again, this is going back maybe 10, 20 years in history. So, if you were using sockets, you would be using TCP/IP or UDP, and you could go through the Ethernet driver, Ethernet adapter, and Ethernet switch. When InfiniBand came into the ecosystem, obviously, they wanted to play well with the sockets interface. So that is where the first kind of convergence happened. So if you had an application who were willing to go straight over IB verbs, you could do that. But if not, you also had a bridge protocol called IP over IB, where you would encapsulate IP packets into IB frames and send them over an InfiniBand adapter and an IB switch. So when this came out, your Ethernet was still at the 1 gig or 10 gig level, and IP over IB could give you a bandwidth boost over your traditional network. So, that is one way where the first convergence happened.

Now let's look at how the evolution of IB has been happening over the last couple of decades. So, we started with SDR, or single data rate, a long time ago. Then, we went through various evolutions and, right now, we are at NDR, 400 to 800 gigabits per second. And your typical deployment scenario is with four pairs of twisted wires. So, that is what is represented by 4x. As was mentioned earlier, you potentially can have any type of bandwidth as your wires get thicker, but your typical deployment scenario is 4x. And currently, we are at the NDR regime. There are a lot of deployments with EDR, HDR already out there, and NDR is what is currently the state of the art.

So, recently, IB had released the 1.7 volume, and the latest changes there were support for large radix switches. Basically, this will reduce the overall number of networking hardware you will need to deploy, like a larger scale cluster. Your previous radices were about 48, 32, or 64 ports, which would require a lot more network hardware. But with a large radix switch, you will need significantly less. And, there are other enhancements done to the standard with the 1.7 release as well.

Let us look at some of the architecture and basic hardware components.

So, there are multiple components. Your network adapters are called channel adapters. You could have a host channel adapter or a target channel adapter, and HCA for short. Each can have multiple end ports. And, to have support for quality of service or differentiated quality of service, you have what is called virtual lanes, or VLs, for each port. And, you have the software endpoints that are called Q pairs, just like you have sockets on the Ethernet side. You have Q pairs as software endpoints that allow the application to interact with the network adapter.

You also have switches and routers and like switches are for obviously intra domain and routers are for inter domain. You also have links and repeaters to extend the range.

So you have a lot of different options for that. A long time ago, Intel had come up with copper to optical conversion at very, very, very low latencies, as low as 550 picoseconds. And this had allowed the InfiniBand networks to span longer distances. This was done almost like 20 years ago.

So, let us go forward and look at how the hardware protocol offload happens. Again, we are trying to compress multiple chapters of a large specification down, so we are trying to pick and choose what we think is interesting and useful. So, this is how your IB stack looks like: the transport layer, network layer, link, and the physical layers. So, we'll start with basic IB communication semantics.

So, as I mentioned, the software endpoint is called a queue pair. It has a pair of queues: a send queue and a receive queue. You will post metadata regarding what you need to send, from where you need to send, and to where you need to receive, into work queue elements, or WQEs, into the send queue and the receive queue. Once the network adapter has processed these WQEs, they would be placed into a completion queue, or CQ, as completion queue elements, or CQEs. So, this is your basic operation. Suppose I want to send 64 bytes from a particular memory location; then, first, I would do a registration, as we saw earlier. Then, we will create a work queue element. This is the address from which I want to send; this is the number of bytes; this is the key. The IB device will get that metadata, fetch the data from memory using DMA engines, send it out through the network, and then post the send WQE into the completion queue, indicating to the process that the operation has been completed and the buffer is safe for reuse.

So let's see how hardware, so you have complete hardware implementations of all of this in various vendors. So we'll see how the network and link layer operates.

So there are lots of capabilities here, but we'll just focus on a few, like buffering and flow control, virtual lanes, and quality of service, and switching and multicast.

So, if you look at virtual lanes, it is very important. And the concept of virtual lanes, although we are describing it in a bit more detail for IB, the same concepts do exist for other high performance networks like Omni-Path, OPX, Slingshot, EFA, everything. So, you have two concepts: virtual lanes at the hardware level, as we saw earlier. Virtual lanes give reserved buffering at the hardware level to send out packets, and this will avoid head-of-line blocking. At the software level, you have various service levels that the network administrator or the system administrator will set up, and these would be exposed to end users. There will be a service level to virtual lane mapping table, which maps various service levels to underlying virtual lanes so that they go through different virtual lanes, avoiding head-of-line blocking. Now, there is, how do you give differentiating priority to the virtual lanes? There is something called the virtual lane arbitration or VLARB table, which determines how many flips you need to send out from one particular virtual lane before moving to the other. So, with a combination of different virtual lanes, the service level to virtual lane mapping table, and the virtual lane arbitration table, we can implement differentiated quality of service for an IB network. The only thing is that every single IB device should be configured exactly the same. So, that is why the OpenSM or the Open Subnet Manager is tasked with the responsibility of doing this so that the entire subnet looks the same as far as differentiated quality of service goes. And please note that the same concepts do apply for almost all other high performance networks.

So, let's now move into layer 2 routing or switching. Just like on the Ethernet side, where you have MAC addresses, you have what is called local identifiers, or LIDs, on the IB side for intra-domain switching. For multicast packets, we have what are known as multicast group identifiers, or MGIDs. And these IDs tell the switch through which end ports or egress ports the packets need to be replicated out. Additionally, you have separate interfaces for multicast group management, like create, join, leave, prune. So, you have multicast groups as well.

Let's take a quick look at how the switch looks. So, the basic unit of switching is what we call a crossbar. This is fully non-blocking, very highly provisioned. This is also called a pizza box switch, basically because the switch kind of looks like a pizza box. Unfortunately, these are extremely expensive to implement when you go to very large scales. Your typical larger scale switches that you see in markets, your director class switches with 300, 500, 600 ports, are collections of crossbars within a single cabinet. And these are what we call non-blocking switches. For a crossbar, you have a system where every house is connected to every other house with a separate road. There'll be no congestion whatsoever, but it'll be damn expensive to implement, and it'll take a lot of real estate. Your non-blocking switch is like a typical road system where if you have multiple cars going at the same time, there's probably not a lot of congestion, but you could potentially create a scenario where there's congestion on the road.

Now having mentioned congestion, let's take a simple example of how IB switching or routing happens. Suppose you have a two-level factory like this: You have the leaf blocks and the spine blocks, you have process one on node one and process two on node two. And let's assume that node two has two destination addresses, or LIDs, 2 and 4. Now, let's see what happens when process 1 on node 1 wants to send a packet out. Let's first look at the forwarding table on this particular leaf switch. So, the forwarding table says that if your destination LID is 2, you go through port 1. If it is 4, you go through port 4, okay. Let's assume P1 sends a message with DLID of 2. That will go to this particular spine switch, come down, and reach P2. Now, suppose that this particular spine switch is down, or these links are congested. At that point in time, if we have multiple LIDs, we have the flexibility to send a packet to P2 using a different LID, which will take a different route through the network. So, by having multiple LIDs, you can have redundant paths as well as avoid hotspots in the network. So, this was done a long time ago when the IB networks were still statically routed. Nowadays, you have dynamic routing capability in the network. So, this is kind of becoming less and less useful as you go forward. But still, there are not many vendors who have been successfully able to bring this in. But this is just there so that you know why you could have multiple LIDs in the network. So, that you have different routing algorithms.

Now, let's take a quick example of how IB multicast works. So, typically, the subnet manager entity can run on an end node or a switch, depending on the choice of the system administrator. The first time a process wants to join a multicast group, it sends a multicast join to the subnet manager, who then sets up the multicast group by configuring the switches appropriately. The next time somebody else wants to join, a similar step happens. And, if you look at the switch in between, you will see that multiple egress ports have been tagged with this multicast group ID. So, at a later point in time, if any node wants to send them or any node which is part of this multicast group wants to send a message out to this multicast group, the switches will replicate the packet, and the end node will only have to send one copy instead of multiple copies. So, this was an initial hardware offload solution, network hardware offload solution that InfiniBand has.

Recently, there are other more advanced network offload solutions. So, with IB multicast, you could only replicate packets; you could not do higher value operations like mathematical operations. So, this is where the newer support called SHARP, or Scalable Hierarchical Aggregation Protocol from NVIDIA Mellanox, comes into the picture. Here, what we do is you have the ability to collect data from the end compute nodes, send them to the subnet manager, and then you can do the mathematical operations in the switches in a hierarchical fashion all the way up to the root of this tree that the job created. Send the result out through the switches, back down to the processes, so that the number of hops required to finish the compute operation and the communication operation will be significantly lower. And this, as we will see later on, leads to excellent scalability as you go to a very large number of nodes. A lot of communication libraries, like the MVAPICH MPI library as well as NCCL, the NVIDIA Collective Communication library, have inbuilt support for these kinds of hardware-enabled collectives, both the multicast and SHARP.

So, let's finally look at the transport layer.

So, on the Ethernet side, you only have, like, a reliable connection and unreliable datagram, but IB has a rich variety of protocols. Each has a different equation in terms of scalability and performance. So, RC is full-feature rich, but it has a significantly high connection overhead. So, if you have M-processes and N-nodes, we have a very high overhead to establish all-to-all connectivity. UD, on the other hand, has a fixed overhead, but it does not have any of the cool features like RDMA or Atomics that we talked about earlier. So, vendors have tried to come up with intermediate protocols with the features of RC and the scalability of UD. So, these are called extended reliable connection and dynamic connected, as you can see here. So, they have all the features of RC while maintaining the scalability of UD.

So, this is what we have on the host side, but about 10 years ago, GPUs came into the picture, and various vendors wanted to come up with solutions to send and receive data directly from GPU buffers. So, this was where GPU direct RDMA came in; back in 2011, our group—NVIDIA and Mellanox—had worked together to come up with a seminal paper, presented at the International Supercomputing Conference, on how you can pass GPU buffers directly into a communication runtime, make the communication runtime aware of GPU buffers, and send the data directly from GPU buffer on one node to GPU buffer on a different node.

Leading to high performance and high productivity, this was the first such GPU-aware communication runtime in the community. Right now, if you look at it, there are lots of similar solutions from each GPU vendor. So, NVIDIA has NCCL, AMD has RCCL, Intel has oneCCL, so on and so forth, and Habana, which is an AI vendor, has HiCCL. So, everybody has their own GPU-aware communication runtime, but we, in our group, had come up with this feature back in 2011, first for the NVIDIA plus Mellanox ecosystem.

As a result, we were able to significantly reduce the latencies and improve the bandwidth for communication.

Now, let's look at a few bridge protocols. So, just like we mentioned IP over IB, there were other protocols that people tried to come up with because you had the requirement for staying on with the sockets API but improving the bandwidth. So, IP over IB was the initial model where you would go fully through the kernel. So, you had this, like, slightly improved performance, but the bandwidth still wasn't that high because you weren't able to use all the hardware offload capabilities. Again, remember, this is rewinding the clock almost 20 years. You should not compare this with what currently Ethernet is able to do. So, back then, people came up with what is called as a Sockets Direct Protocol, which will do kernel bypass RDMA semantics, and this primarily targeted the large message spectrum. So, the small message latency was still not great because you had to go over the user to kernel interface, but the large message bandwidth was significantly improved compared to IP over IB.

Now, in order to address the small message performance while retaining the socket semantics, some folks have introduced what is called RSockets. This intercepts the sockets calls and makes them go over RDMA CM, and this gives you good bandwidth and good latency. But, unfortunately, because of the limitations of the socket's interface, you still cannot do RDMA or atomic directly from the application level.

So, with the advent of these two, this is how our protocol stack looks like. So initially, if you remember, we started with TCP/IP and RDMA at both ends of the spectrum, and vendors introduced IP over IB. Later on, SDP and Rsockets came. So, this is, let's say, in a naive fashion, this is how they would relatively be placed in the overall spectrum of sockets versus verbs.

So, going forward, let's quickly look at the high-speed Ethernet family with the iWARP protocol. So, this is when the Ethernet consortium and the Ethernet vendors began to realize that, oh, okay, this is a good InfiniBand or verbs is a good protocol. Let's see if we can move this protocol over our regular Ethernet networks.

This is a comparison of how IB compares with iWARP and high-speed Ethernet. So, some of the benefits of high-speed Ethernet are that you could have out-of-order data placement. You could also have fine-grained and dynamic rate control, a lot of improved quality of service support, all the very nice network management features, and ACLs of Ethernet immediately became applicable here.

So, with the advent of iWARP, two more verticals came here. You could go through the sockets interface TCP/IP, and you could have hardware offload capabilities where you could offload segmentation, reassembly, checksum, all of those things in hardware—and a bunch of others as well—and go over your traditional Ethernet network. Or, you could go over IB Verbs, again TCP/IP at the user space, go through an iWARP adapter, and go through an Ethernet switch. So, the benefit here is that you get improved performance, but you can go over the Ethernet network fabric.

The next level of convergence that happened was RDMA over Converged Enhanced Ethernet, or RoCE. So, with RDMA over Converged Enhanced Ethernet (RoCE), the benefit is that you are able to make your IB packets go over Ethernet networks, either at the L2 level or at the L2 and L3 levels, over regular Ethernet networks by changing the last one or two layers of your packet. With RoCE, the InfiniBand link layer would be changed to an Ethernet link layer, so that now it is switchable. With RoCE V2, the link and the network layers would be changed so that it is switchable and routable. Thus, RoCE V2 brings all the benefits of your traditional Ethernet network management infrastructure, including ACLs, snooping, network monitoring tools, etc.

And consequently, your management stack looks like this: So, this is your full-featured management stack. So, you have RSockets here, SDPs are here, and RoCE will come here.

The next high-performance interconnect that we have is Omni-Path. So, Omni-Path has a fairly long history that started off at Pathscale. Then, Qlogic bought it, which was later acquired by Intel, and recently it has been spun off into a different company called Cornelis Networks, and they have rebranded it as Omni-Path Express.

So the Omni-Path fabric looks very similar to IB but it is an enhanced version of it, has more features, it has support for higher number of fabric addresses, it has also support for larger number of virtual lanes while InfiniBand specification only supports 16 virtual lanes, the Omni-Path specification supports up to 32 virtual lanes but the basic process of providing differentiated quality of service is similar.

And with the advent of the Omni-Path network, we have a new, higher-level interface called OFI, or Open Fabrics Interface. And we have a different underlying communication protocol as well.

Now, this slide is a feature comparison of all the different networks that we talked about: IB, iWARP, RoCE, RoCE v2, and Omni-Path. So, these two slides are your cheat sheets to see how your InfiniBand, Ethernet, Omni-Path has evolved in terms of hardware and software over the last couple of decades, and trying to decipher all the different terminologies that vendors may post to you. Slides number 80 and 81 give you a quick, high-level overview of all of these.

Now, these are more recent advances: with the advent of higher performance GPUs, various GPU vendors felt that the PCI bus was not catching up quickly enough. PCI was stuck at PCI Gen 3 for a very, very long time, but the bandwidth requirements of the GPUs were growing much higher than how the PCI was evolving.

So, the vendors came up with their own network interconnects, and the first one of these is from NVIDIA, called NVLink, and NVLink 2. You have NVLink different generations of it, with different capabilities, and each generation is made available with different classes of their GPU. So, with the latest Grace Hopper GPUs, you have the fifth generation of NVLinks available, with significantly higher network bandwidth capabilities.

So, are there plans to provide network computing on RoCE-based networks, like SHARP on Ethernet network? Mellanox Nvidia is heading in that direction for providing better support for Ethernet, but we have not heard anything concrete along that direction. However, they are looking into Ethernet-based networks; that much we know.

Dr. Panda: Uh, just might be we can add, at least, uh, I believe, some other vendors like Broadcom and all. They are working on in-network computing, um, for their adapters, um, but I don't think they are publicly available yet.

Yes, that is also right. Uh, so what Dr. Panda mentioned right now about Broadcom—this is along the Ultra Ethernet side. So, there are different vendors trying to come up with solutions for Ethernet. We hope that, and we are working with several of these vendors as well, from the Ohio State group. We hope some of these solutions will be productized and made available to the public soon. So the NVLinks can be connected in different kinds of topologies, either a switch topology or a simple point-to-point communication topology. This is a slide showing your switch topology where you can connect multiple GPUs to an NVSwitch, and this gives you significantly high intra-baseboard and inter-baseboard communication.

And once these kinds of solutions are made available, you also need software capability to go over these. So, as I mentioned, we were one of the first in the community to come up with GPU-aware communications. Since then, various vendors have come up with extremely optimized solutions for GPU-based communication. NCCL, or NVIDIA's Collective Communication Library, is one such solution. Although the name is Collective Communication Library, nowadays they have support for all sorts of collectives as well as point-to-point operations. And these give you excellent scalability and performance for single-node as well as multi-node NVIDIA Plus IB enabled systems.

And this kind of shows how your large language models scaled with NCCL on modern NVIDIA DGX GPU systems.

So, just like NVIDIA came up with NVLink, AMD also came up with their own version of intranode communication fabric to interconnect their CPUs and GPUs, called the AMD Infinity Fabric. So, you have a die-to-die interconnect on the CPUs as well as CPU to GPUs. This gives you significantly high bandwidth to the memory as well as between CPUs and GPUs.

And they have designed, just like Nvidia designed their collective communication library, AMD also has designed their own ROCm collective communication library, and it is pronounced as RCCL. So, it is very similar to NCCL in terms of APIs and functionalities, and it gives you extremely good communication performance for intranode and internode point-to-point and collective communication mechanisms on small as well as large, AMD-enabled systems. So, Frontier is a good example where RCCL gives you very good communication performance. Frontier is a top supercomputing system at the Oak Ridge National Labs.

So far we have looked at solutions for GPU-based fabrics.And let us say commodity fabrics.

Now Amazon also has been working on the interconnect area for a long time now.So they started with the C1 adapter which was at 1Gbps.

So most recently, they have gone to the third generation of their elastic fabric adapters with a bandwidth of 200Gbps and a latency of very close to 6.5 microseconds, which have a lot of optimizations. Now, you might think that, hey, 6.5 microseconds, that is very bad. People are talking about 1 microsecond and things like that. So, this is not something that was introduced by mistake. This is by design. From Amazon's point of view, they wanted the HPC systems to be very well integrated with their existing e-commerce and large data centers. So, equal cost multi-pathing was very important for them to make sure that by introducing HPC into the large-scale fabric, you don't screw up the existing communication traffic. So, in order to ensure this, they spray the packets over multiple paths over the network, and they also combine the packets, unbeknown to the user at the receive site, using a newer transport protocol called Scalable Reliable Datagram, or SRD, and it is because of these that the latency is slightly higher than your typical latencies that you would expect with high-performance interconnects.

So, with the advent of the new Scalable Reliable Datagram, or SRD, protocol, this is how your transport protocol slide, which we looked at maybe 10 or 15 slides before, looks like. So, with SRD, your communication overhead is very similar to the dynamic connected on the IB side, and you have the same capabilities that RC provides. So, this is what we have with regard to Amazon's EFA adapter.

Let's quickly look at, um, Cray's Slingshot Network as well. So, if you look at Slingshot from Cray HPE, uh, their biggest, uh, selling point is that they have excellent, um, QoS and congestion control, and very low jitter. So, this was their main focus, um, as described by Steve Scott in the 2019 keynote at the ExaComm workshop.

So, Slingshot, apart from all the other features they have, they have a highly tunable set of QoS classes, and they support multiple overlaid virtual networks so that you are able to minimize your jitter and provide excellent performance isolation for different kinds of traffic. This ensures that small messages are not stuck behind large messages. They also have very good condition management, and some of their experiments have shown that their network is able to quickly recover from condition-kind environments, which is typically not the case with other HPC networks. So, having said this, note that every network is evolving, and a lot of networks are also aiming to provide similar condition and quality of service mechanisms, and you might see a lot of these coming up in the future from various vendors.

So, this is what we have about your basic, uh, network adapters. Let's now quickly discuss, uh, the smart network adapters. So, let's move, uh, to the, um, uh, next, uh, section. So, we saw the, like, these are the regular networks, uh, we introduced the RDMA up to some kind of an intelligence, but now, actually, the field is moving into, um, kind of a smart NIC, and that's what we'll be, uh, we'll be, we'll be talking here. And, uh, so we'll typically go through like an overview of the SmartNIC architecture, Nvidia BlueField DPUs, then AMD Pensando, and also the Intel Columbia will IPUs.

So, there are broadly two types of smart NICs you will see. Again, there is a little bit of multiple terminologies floating around: Smart NIC, Super NIC—a lot of terminologies. But let's keep that aside; we'll just get the basic concepts here. So, broadly, there are two types of smart NICs. One is like a CPU-based—kind of a, we have the NIC, programmable CPU code—and then ASICs. So this is where actually the NVIDIA, this DPU (Data Processing Units), fall in; Marvel has this Octeon, AMD has the Pensando, and the other approach is to use a field-programmable or FPGA-based. Okay, so you can use NIC plus FPGAs. So this is where AMD Alveo, and then Intel's FPGA with Smart NICs, comes up.

So, now let's take a look at the architecture. So, this is a BlueField-3 DPU architecture. So, you remember I introduced this executive assistant paradigm, okay? So, that time we were just talking about, let's say, one CPU core and one NIC, but in a real system, if you see, let's say this is our server, there might be 128 cores. Then, this is an InfiniBand BlueField adapter. Of course, it has the RDMA engine, which was there earlier, but now, in addition to that, it has a set of programmable ARM cores, okay? And depending on the adapter, they are trying to adjust the number of cores. Uh, so, in the BlueField-3 DPU, you will see that the BlueField-3 DPU has 16 ARM cores, and obviously, the speeds are different. These are the ARM cores. So, here the way I can indicate and you can evaluate it is like multiple executives and multiple assistants, okay? Or, you can think of the assistants as available in a group. So, if you have a set of managers, some assistants and secretaries are helping all of them, okay, instead of having a dedicated one. So, there's a shared pool of assistants. So now, the question is, how do we take advantage of that, okay? So, just like in the day-to-day world, that means if we can offload certain things, uh, to achieve some good overlap of computation with communication, uh, then it will be more effective.

So, here's what we have been exploring. Um, we have a library called MVAPICH2-DPU Library. This is being done jointly with a startup called X-Scale Solutions. So, what we have done—some of you might be familiar with MPI non-blocking collectives. Okay, so the MPI standard has blocking collectives and non-blocking collectives. In the blocking collectives, when you initiate an operation—let's say you are trying to do an all-to-all—since it is blocking, unless the operation finishes, you cannot proceed. Whereas, the non-blocking collectives were introduced in the sense that you can actually initiate your collectives while the collectives are happening, the processors can actually proceed and perform the next step of the competition, which is not dependent on that data. Okay, and if we can do that at an application level, it has actually good overlap of computation and communication. And that's what also happens many times in our day-to-day world with an executive assistant paradigm. The executives may say, 'Yeah, send out these emails, or make a phone call, or set up a Zoom,' but the executives are not sitting idle; they are continuing to do their work. So, that is the kind of paradigm we first started putting it here. So, what we have done: we have actually offloaded MPI all-to-all, MPI bcast; gradually, we are uploading scatter, other kind of operations. So that is actually an ideal situation because you can think of the ARM cores. I initiate an all-to-all, and while the ARM cores are actually trying to take care of the all-to-all, the application can proceed. Okay, of course, you need to make sure that that application supports non-blocking operations.

And what we have done here, the end results I am trying to show, are very detailed results we presented earlier at GTC. And all you can take a look. We took two things we did here. So first, this is the MVAPICH2, the basic MPI library which does not have any upload. The other one is the MVAPICH2-DPU, that is the optimized version where MPI all-to-all has been uploaded, or IB cast—all these collectives, non-blocking collectives, have been uploaded. At the same time, we modified a P3DFFT application. This is a P3DFFT, just like a representative of FFT application. It has all-to-all, but this has a version of I all-to-all. Okay, so that means the processors actually perform some computation and initiate non-blocking all-to-all, and while that operation is going on, you can actually perform other computations. So, we put this on top of our MVAPICH2-DPU, and this is how the numbers come. These have been taken like, think of 60 nodes, 32 PPN, like a 512 core configuration with different grid sizes we are operating on, different like application size. And overall, y-axis is latency; that means lower is better. And as you can see, on 512 cores, we are able to improve at a P3DFFT like a semi-application layer or a kernel layer, you can think of like 17 to around 16, 17 percent. Okay, so that really shows like you can actually use, just like in the real world, you can overlap computational communication. The same thing you can do it here.

This is another application. So, what we have done is the same MVAPICH2-DPU, but we accelerate the HPL. As you know, HPL is the High-Performance Linpack that actually uses non-blocking bcast, and this is a CPU-based HPL, just to clarify. And then, that modified version is available also as a separate package, XScaleHpl-DPU. And now, if you put them all together, you will see, for different process grids and all, add an HPL CPU-based HPL, you can actually boost the performance here. It's higher is better, around like an 18, 15 kind of things, depending on your problem size. Okay?

So then, this is another application. Uh, this is a XCompact3D, um, that is from the UK. Um, so here again, uh, we modified that application because here, just like the executive assistant paradigm, let's say you are a manager. You didn't have any secretaries to work with, and your upper-level management is very happy with your progress. They say, "Oh, now onward, you can have a secretary. You'll be very happy, I mean, right?" But you need to modify your workload or your workflow. Okay, unless you decide, saying, "Oh, I have a secretary, but everything I'll do." Then that doesn't help. So, you need to create your workload saying, "Okay, what can I upload to my secretary so that I can do other things?" Okay, so, to take advantage of some of this DPU functionality, at least in this paradigm where we are trying to upload the MPI non-blocking operations, uh, other operations we'll see in the next slides, you may need to modify the workflow. And that's what we did here. The same thing, XCompact3D, we are able to get like a seven to ten percent, uh, benefit here.

So, those are the collectives. Then, we offloaded the point-to-point, okay, and also reduction, so MPI-like: MPI_Isend, MPI_Irecv. And the way we have done the study is with PETSc. Um, some of you might be familiar with PETSc. So, PETSc is a—it solves a 3D Laplacian and 27-point finite difference stencil. It is a very big application suite, which gets used in a lot of applications, even using OpenFOAM and all. So, we try to go inside and then trying to see the main kernel; can we actually modify the solver algorithms to efficiently offload reduction? And here, we also offloaded some part of computation, okay. In addition to communication, I need to be a little bit careful here because these ARM cores are not very—very heavyweight. Uh, so you need to balance it out, saying, "Okay, how much can I offload in terms of the actual application computation?" If you offload too much, it will be very slow, okay. So, with an ideal, like an experimentation here, we can see like OPT one eight eight node experiments here. Uh, problem size 256 x 256 x 256, it is a strong scaling, so we are able to give you like a 18 percent to 24 percent, okay. So, like this, a lot of solutions are available from these X-Scale Solutions. So, if you have any interest in this kind of solutions being deployed on your system, please contact us.

The other things we have done is a similar AI workload. Okay, and these are like a CPU-based training. So, we have just started first with a CPU DPU; we are moving into GPU DPU. In the end, you will see modern systems will have CPU, GPU, and DPU or IPU. So, one has to redesign this middleware with all these components in mind. But currently, we have CPU DPU solutions here. So, this is the accelerating CPU-based DNN support. So, we have, like as I was telling earlier, we have done this high performance DL where we have, you can run actually distributed Python or TensorFlow just directly on MPI. You don't have to go through NCCL or RCCL, any of those, and with a Horovod and with a modification, and here we support like a PyTorch framework for deep learning.

And this looks like this is a training of ResNet-20v1 model on CIFAR10 dataset on the blue field 3, up to 60 nodes. And here is the time for each epoch; the lower, the better. And here, we are able to show, depending on the number of nodes, almost like an 18% overhead. Okay, so we also have worked on more; we didn't include it here because we have limited time. Think of other operations like I/O, checkpoint, restart—all those kinds of operations. Actually, the DPU technology is ideal, so then you should be able to offload those.

So then, the other architecture, just to provide an overview—so these, like the AMD Smart NICs, um, so this is the AMD, uh, Pensando, uh, and, and this is the Alveo kind of things. These are, again, although very similar kind of things, we have not experimented with any of these, but the concepts can actually follow here.

And Intel is also working on very similar things—what they call Intel Infrastructure Processing Units, or IPUs—and they also have the Intel FPGA-based Smart NICs.

So, it gives some kind of a high-level overview. Um, if you go here, this is like the Intel FPGA SmartNICs. They have the 200 gigabits per second Ethernet, they have an onboard Ethernet controller, and the Intel Agilex FPGAs. And this is the Intel IPU, uh, which has already like packet processing engines, RDMA, storage capabilities. Also, they have a set of ARM cores. So, all these architectures—whether it is AMD, Nvidia, or Intel, even Octane—they're almost going in a very similar direction. Each one has different features, but the main challenge is like, how do you design the upper-level software stacks? Okay, at least here, in the context of what we are discussing, there are also communities that are also trying to offload other things like virtualization, for example. The cloud might be some security procedure, some encryption. So, I think over the next several years, we'll see gradually this side, more and more intelligent adapters will be coming up, and it will open up many challenges to see how to redesign the upper-level middleware. Also, there might be some co-design with applications, so you can boost the application performance and scalability.

So, we started with the basics—uh, the fundamentals of high-performance networking: basic high-performance networks, smart networks—and now we are heading into how, uh, AI-based network deployments are happening. So, we'll take a couple of specific use cases: the Cerebras and the Habana Gaudi.

So, more recently, uh, you would have seen that Cerebras has released a Wafer Scale Engine, or WSE3 architecture, with over 4 trillion 5-nanometer transistors—a heck of a lot of AI cores, huge amount of performance, a lot of memory, and a very rich network. So, for you to get a scale of how the GPU compares to the wafer scale engine, this is how it looks like. So, as you can see, it's a really, really, really big chip. Okay, so this is a comparison of WSE3 with Nvidia H100. All of this is courtesy of Cerebras; nothing that we have come up with. So, they have significantly higher performance, as you can see from the compute side, the core side, memory side, or the communication side.

So, with all of these, we see that Cerebris is able to give very good linear scaling. Until about a year ago, most of the Cerebris compute systems with CS1 were limited to mostly one box, but with CS2, and more recently, CS3, we see that Cerebris is trying to put together larger clusters—multi-node clusters—so that you are efficiently able to scale out deep learning training, particularly large language model training. And the goal is that you have significant ease of use. So, just by modifying some small numbers in your input YAML file, you're able to scale out in a very, very easy fashion. That is the goal with Cerebris. So, getting back to Cerebris, this is how they're able to scale out. We can see that they are able to achieve very near linear scaling.

The other AI architecture that we are going to look at will be the Havana Gaudi. So, this was a separate company which Intel acquired for a couple of billion dollars several years ago. One major difference between Havana and other architectures is that they have the network on the chip itself. So, you don't have to go out of the chip to reach the network. So, they have a bunch of tensor processing cores, a lot of on-chip memory, and RDMA over converged Ethernet V2 (RoCE V2) chips engines on the chip itself.

So, using this, you are able to interconnect a lot of these Havana Gaudi chips with a rich intra-node or inter-node interconnect. If you look at the Voyager system at the San Diego Supercomputing Center, that is one such system with a lot of these Havana chips, particularly meant for AI training.

So, just like NCCL—had, I mean, Nvidia had NCCL, AMD had RCCL—Havana has their own collective communication library called Hickle, or HiCCL. We at the Network-Based Computing Lab had added support for HiCCL inside the MVAPICH communication runtime. So, MVAPICH has unified support for NCCL, RCCL, HiCCL, whatever you want to, so that you get the best performance that any communication runtime is able to offer out there. And with such solutions, Havana Gaudi's team has been able to show that they are able to get very good throughput for various large language models, as seen in this graph. So again, no offense to any of the DL vendors or chip vendors out there, but just because of the lack of time, we're only able to show two of these solutions here. This by no means is the gamut of GPU vendors out there, so you have Samba Nova, you have GraphCore, you have several other vendors out there. Groq is another AI vendor out there. So this is just a couple of things that we have been able to highlight so far.

So, let's take a look at what we saw: all these different networks and all. So, what we'll try to do now, in the next section, is take a look at a little bit of software stacks, how things have been evolving, and then we'll take a look at some sample case studies, perform numbers to get a feel, and then we'll conclude. We have around 30 more minutes to go.

So when this field started, in the very beginning, of course, we had the hardware. Since InfiniBand was an open standard, there was a lot of initiative. I'm just going back here, like almost early 2000s, a lot of DOE labs, and there were multiple vendors of InfiniBand. There was an organization called OpenIB. The idea was that these adapters would be there, but then the software stack would be developed, just like following the Linux model. Vendors could contribute a driver, both user level and kernel level. Then, at the upper level, the entire stack, including MPI and all the other software stacks, allowed people to download these stacks and deploy them on their systems. It was a very nice model. It started, and when iWARP came from Chelsio, there was a lot of discussion. There was another company involved, so the IB community and the iWARP community merged together. OpenIB was renamed as OpenFabrics. The objective, as I mentioned, was to incorporate IB, RoCE, and iWARP in a unified manner, supporting both Linux and Windows. Many organizations, including us, participated in that effort, and it was very successful. It continued for many years and led to something some of you might have heard of called OFED, or OpenFabrics Enterprise Distribution. These were all tested together, so as an end user, if you were deploying a cluster, you could put together your hardware, deploy the OFED stack, and be ready to go. You had MPI libraries, multiple versions of them, and you could run your applications. The system evolved over the years. But eventually, some friction developed within the organization, just like in any other organization. The latest version went into OFED 5.3. Version 5.2 was released, and 5.3 was being developed. What happened between these OFED versions was that a branch occurred, leading to something some of you might have heard of, called MOFED. This was the Mellanox version of OFED. At that time, Mellanox kept adding new features that were not being incorporated into the OFED stack. So, they developed their own version, Mellanox OFED, with their own features, and the stack became fragmented. There were big conflicts here. I'm just stating the facts—some of you might be from these companies. There was friction between Nvidia/Mellanox (back when it was still Mellanox) and Intel.

So now, if we move forward, there actually—it has been bifurcated. The community has been bifurcated. One is a UCX consortium. Then another one is a Libfabrics or Open Fabrics consortium. So, these two are there. So UCX, it is like a—this is just trying to show like a timeline kind of thing. Again, overall, over the years, they have been putting together all these different components: the protocols, the mid-layer, then the upper-level stacks, and all. So they have the UCX, UCC, OpenSNAPI, even some Spark, UCD. So they have all these components for a complete stack. And this is heavily optimized for Mellanox InfiniBand. Okay. You can also support RoCE here. But other networks like Slingshot and all, you cannot do. There is a port out there. People have contributed, but not.

The other community, which has been led by Intel, mostly it is called Libfabrics, or it is a part of like a Open Fabrics interface. So, in fact, annually you will see there are two events. One UCX event happens in December. There is an OFI event that happens in, I think, March—a kind of timeframe. So these two communities are evolving. So, this Open Fabrics, or Libfabrics, Open Fabrics has the lower level called Libfabrics. And this is what they have defined as like Open Fabrics interface. So the idea is very similar, but a little bit different. So here, the idea is that every time you design hardware, the OFI provider can actually provide some of these APIs and, using the OFI interface, you can actually query or do callback function and implement the upper-level thing. So the idea is that, below can be opaque. So you may not be knowing anything, what is happening below, but as long as you are following the OFI interface, then you should be able to design. So, in fact, this is like, for example, I'll introduce later on the MVAPICH project. Some of you might be familiar. That is the project in my group. We have been doing it for 23 years. Earlier, we were doing like a direct verbs interface. So through that, we were able to support IB, Omni-Path, RoCE, iWarp, many of the interconnects. But then when Slingshot came, Slingshot only exposed it through OFI. So in the beginning, we were not able to support, but since last year, we have actually our MVAPICH moved to 3.0 and 4.0 series. So there, actually we support both OFI and UCX. So through those supports, you will be able to actually use any interconnect. So now we are able to support all the interconnects and, in fact, in the very latest OFI 4.0, we even brought back our own InfiniBand verbs level support. So, we have a lot of flexibility now for different interconnects.

And then, while we were talking, I think there was a question that came earlier, and it was mentioned there is another consortium which is getting momentum. This is called Ultra Ethernet Consortium. So, this is where I think all the Slingshot, HPE, Broadcom, and many other companies, Intel and all, are joining the effort to see if it will be pure Ethernet. So, there has been a lot of, I think you saw some earlier slide, how you presented earlier, there was a little bit of gap in performance between InfiniBand and Ethernet. So that's why all these intermediate protocols like STP were introduced. But over the years, especially after the Frontier system was deployed, that was like HPEs with Slingshot interconnect, and the Slingshot actually gives the APIs more like an Ethernet. So, this debate is still continuing. It is not a complete result, but these Ethernet companies are putting it together or working together to bring it to the Ultra Ethernet Consortium. And the idea is very similar. Again, Ethernet physical layer, link layer, IP layer, and the transport layer. And this is the software, all the different collectives like NCCL, RCCL, MPI, OpenSHMEM are being put together, and then the applications. So, as long as you are following any of these things, UCX, or OFI for the time being, or if you are on the Ultra Ethernet Consortium, you will be seeing software stacks will be coming, and you will be able to directly utilize those.

So with that, let's take a global picture of where this field is. As I said, many of you must be familiar with the top 500, and these are some of the numbers. If you go to the top 500, you yourself will be able to get this kind of charts. It shows, over the years, what is happening to like this. This is an InfiniBand. This is Ethernet. They have a little bit of a misnomer. They call it like a gigabit Ethernet. Ethernet is like some of the other networks, like the custom interconnection.

So, some of you might have seen these kinds of pictures from the top 500. You can also download these. Typically, there is a fraction distribution in terms of count, and then the other one is in terms of performance. The count is very simple: out of 500, which one belongs where. So here, it's, let's say, InfiniBand is 48%, Gigabit Ethernet is 13%, and others are here. In terms of performance, this is like, if you combine the performance of all 500 systems, then you say, 'Okay, how much are InfiniBand-based systems contributing?' or 'How much are Ethernet-based systems contributing?' So here, if you see, in fact, Ethernet, since this number one system, which is the exaFLOPS system Frontier, actually contributes a big weightage to, on the Ethernet side, in terms of the performance. This single system dwarfs a lot of other systems. So here, you will see these kinds of distributions.

If you go down a little bit, these are all from the latest June 2024 list, which was announced at ISE. We'll see a new list in another four months, in November. So, currently, there are like 43% of the top 500 systems that are InfiniBand. Starting, this is the largest one, which is number three, almost. It's like a 1.1 million core. Microsoft, CINECA, MareNostrum, then Oak Ridge, which used to be number one at some time, has gone down.

On the Ethernet side, this actually combines all the different kinds of systems. So, in 10 gigi, 25 gigi, 50 gigi, 100 gigi, 200 gigi. Of course, as I said, Slingshot 11 is at number one. That is the number one. The second is actually the Aurora system. That is also an exact scale system. So, these are, as you can see, like 8.6 million cores, 4.7 million cores. They're much bigger than many of these other systems. And, in the latest race, there have been a lot of new lists that have come out. So, this gives a perspective of where things stand. And of course, we'll see this continuously evolve over the years.

So then, let's move to a kind of sample case study. So here, what we'll do is just to give you a feel. We talked about a lot of concepts. But if you really take the latest systems, where do you think these are some like basic experimental numbers? These are all taken within my lab. So, these are all reproducible. These are also, we'll see, based on the like the OMB or low-level latency. So, we'll see low-level performance and then the MPI.

So, this is what it's like—you can think of it as, we talked a lot about InfiniBand and RoCE. So here, these numbers are taken from a little bit older hardware, like ConnectX-4 EDR, and they're connected back-to-back. But the idea gets clear. So, the whole idea is like this: this is a network layer. We are not bringing in any of the upper middle layer. So here, you will see, like, let's say the InfiniBand provides 730 nanoseconds. Okay. From a node to node. When I say node to node, that is like a memory to adapter to link or switch. This is not a switch. So, it's back-to-back adapter and then memory. The RoCE has a little bit of overhead. Okay. So, it is around like 130 nanoseconds overhead on top of the InfiniBand. If we take a look at the bandwidth, which is like a unidirectional bandwidth, they come very close. There is like a IB goes a little bit higher compared to the RoCE. And it again sometimes also may change based on the kind of switches you are connecting. The RoCE, as you know, is RDMA over Converged Ethernet. So, you can actually use any of the Ethernet switches. You can use from Broadcom, Arista, even Mellanox also, slash Nvidia. They also sell RoCE. So, in fact, from Nvidia slash Mellanox, the adapter is exactly the same. Okay. With some minor differences, so that the similar logic, which is trying to accelerate the InfiniBand as well as for RoCE, only some of the formats and all are modified so that you can use the RoCE interface.

Now, if you go into a little bit of differentiation or evaluation between sockets—we talked about these RSockets and IB Verbs—you will see very clear differences. So, this is like the sockets. This was again at the ConnectX EDR, that platform. Here you will see, like, a six microsecond at the TCP versus very close to a one microsecond at the InfiniBand RSockets or IB Verbs. And here you can actually see the Rsocket. That was, that's how it was designed: you don't have to go through all the complexity of InfiniBand but still be able to give a very good interface to the upper layer, especially with TCP/IP traffic. In terms of latency, you will be able to get good performance.

In terms of bandwidth, Rsockets—so, there's a little bit of an issue in the driver during that time at the middle messages. But, if you go with very large messages, InfiniBand and Rsockets were able to deliver very similar performance. But the TCP, TCP/IP doesn't deliver that much because of the higher overhead associated with this. And then, this is how you see a gap.

Now, we talked about this AWS EFA adapter. The Hari mentioned this: their design philosophy is a little bit different with respect to their data center, how they are operating the AWS cloud. For them, their workload latency is not very important. So, that's why you will see, if you do some basic performance testing here, it is a UD-based that we talked about, the scalable, reliable datagram (SRD)-based. Here, you will see like a, this is a high overhead compared to InfiniBand. When we are talking about one microsecond latency, and all, InfiniBand here is around 16-microsecond latency, but it is, it is by design. They are doing this because they have a different philosophy of spraying the network, providing different kinds of congestion management, and all. And here, it is trying to show like a, some kind of performance difference between UD and SRD. SRD typically adds some additional overhead compared to UD.

This is Broadcom. They have been working on RoCE adapters. So, this is some evaluation on their NetXtreme HCA 100 Gige. Now, they have a 200 Gige adapter. So, these numbers, here you can see, it comes to other latencies around 3.37 microseconds. The bandwidth effectively, you get like saturate the, like a 12. If you do a multiplication here, you will see like a 12.5, you should be getting, for a 100 gig, it comes very, very close. So, again, I just want to indicate, these are some sample numbers. If, of course, the technology is changing very fast. So if you are looking for the numbers, you should actually do it yourself so that you will see exactly, based on what CPU, what GPU, or the network adapter, and all the performance will change a little bit. It will almost remain the same, okay? That's what I just want to take.

So then, next, if we go into one upper layer, like here, we'll just take MPI. Um, this is the project I was talking about. Many of you might be familiar. Uh, this is called the MVAPICH project. Uh, we started this project almost the day one of InfiniBand. Uh, prior to InfiniBand, we were working on Quadrics, Myrinet—some of you old-timers might be knowing those technologies. Uh, when InfiniBand came, even though hardware was there, there was no ecosystem for software, so we were the very first one in my group in the world to jump in and have the MPI running over InfiniBand. And in fact, we gave a demonstration, uh, almost 22 years back at Supercomputing '02, um, over InfiniBand. And since then, the MVAPICH project has been continuing, um, steadily. I am very proud of my team. We are able to, while being in academia, run, develop production-quality code, deliver it to the community, and sustain it for 23 years. As many of you know, there is a lot of discussion going on about sustainability in the community, purely from a software development perspective. We are able to maintain that with all different kinds of funding. So currently, it is being used by more than 3,400 organizations—these are registered organizations, excuse me—in 92 countries. So, if you go to this MVAPICH site, we have actually a user tab. You should be able to exactly see how their people are using it. We have also crossed, like, just from our OSU site, we have crossed like a 1.8 million downloads just from our website. Of course, it is a part of all the different stacks like Red Hat's, you see Open HPC, Spack; it's very hard to keep track of those, but just from our website, we can keep a track, and we are going very nicely there. But in addition, we are actually, over the last 20 years, we have been working with this community—networking community, the systems community, Top 500 community—to make sure that even our MPI stack is also empowering many other Top 500 clusters. So here, there are some examples here.

So, if you take a look at it here, these are some of the sample numbers just to show over the years what has happened, just to understand the trend. Here it says, starting with a True Scale QDR, FDR, Dual FDR, EDR—this is HDR, and then the Omni-Path, and these are all the details, like on what systems they were taken, mostly the similar systems. So, the message I want to give: the left-hand side is like a latency and message size, and the right-hand side is latency for large message size. Okay, so for small messages, mostly you will see they are hovering around one microsecond, one microsecond point nine; different vendor, different adapter, servers, you will see around in that reason. And once you go to the large messages, of course, the bandwidth makes a big impact. Okay, so whichever network, as you can see, this latency is reducing over the years, as you have the network higher bandwidth; large message will improve. Now here we are trying to show the bandwidth; this is the unidirectional bandwidth and bi-directional bandwidth across message size. So, this is the love, this is the million bytes per second. Um, there is a little bit of conversion; these are taken all taken with the OSU Micro Benchmark, that is also a part of the MVAPICH project, that is open source. Many of you must be familiar; that is also kind of an industry standard for any networks papers you will see, even a lot of procurement, you will see some bullets saying, "Okay, please refer using OMB numbers." So, that is from my group; that's what we are using. So, you should be able to also utilize the, uh, to get these numbers, and, and, and very nicely, you will see like this stepwise curve, like over the years, the bandwidth is, uh, is improving, and similarly, also bi-directional bandwidth is improved.

If we take a look at the RoCE adapter later—earlier we saw, at the lower level—uh, this is like one level up. So, you can see that, uh, for example, this was like a—uh, IB was 730 nanoseconds, and this is like a nine-nine-one. So, we are able to—like, what, um, 140 nanosecond—is kind of an MPI providing the overhead, okay, on top of the network, just to give you a ballpark. So, these are very low overhead stacks we have designed, and this is where you use them very extensively: all those RDMA operations—uh, uh—and exploit all the different bars. Of course, this is a one-way latency using blocking send and receive with non-blocking collectives and all. There are a lot of numbers out there. We'll not have time to go over all those things.

This is the same thing, but in the bandwidth, it follows a very similar trend, just like what we saw at the load level. So, the InfiniBand and RoCE are able to give more or less similar bandwidth. It is a little bit higher, the InfiniBand.

So, this is the Slingshot 11, like the Frontier. As I said, starting from our 3.0 version, we are able to support. And then, this gives around two microseconds. So, we have a lot of latency point-to-point, and we are able to almost go into, like, a similar 200 gigabits per second bandwidth, unidirectional on the Slingshot network.

If you take a look at these, the OPX we talked about—Omni-Path—and then the Cornelius network has the open fabrics. Sorry, the OPX. And that again has multiple lower layers. Like, either you go to the OFI, or you can do PSM. There are different options. So, at an MPI level, actually, OPX is competing with almost InfiniBand. It's around like a one microsecond. In some of the platforms, actually, we have also shown like a lower than one microsecond, around 0.91 or 0.92. It all depends on, again, the system configuration, as I was telling. Depending on the processor speed, you may see some difference because some part of the calls are still being done by the processor, right?

So, and then this is on the Broadcom RoCE. Here, also, we see very similar kinds of things I showed earlier. This has a little bit more of the bandwidth numbers. Here, we are able to go into like 100 gigabits per second, and bidirectional is very close to 200 gigabits per second. We have not done the Thor 2. We did a little bit of orientation. We have done some earlier things on their system. I believe they are still tuning some of their drivers. And once we do that, we'll be very happy to even share their Thor 2 adapter numbers.

So, with this, I would like to conclude here. Just again, this just gives a very simple example. That's what I tell. I mean, there are so many different performance numbers. Even at the Hot Interconnect Conference in the earlier years, we have presented many, many, many papers. Please feel free to take a look at those so that you can get an overall idea.

So, so, so, at a high level, if you see, we almost talked for a little bit more than two and a half hours. We started with a very high level perspective, high level objectives, and goals of these high-performance networking. What is the requirement? We looked at the history, what has happened, and gradually you're seeing that more and more intelligence is coming to the adapters, to the switches. And we also talked about some of the smart networking architecture, the trends highlighted the main features, given all of the high-performance network hardware software ecosystems, discussed some of the sample performance numbers. But as we see the trend, you will see more and more because systems are becoming denser. I indicated that more and more GPUs are coming together. Server density is increasing, and to have a good balance of computation and communication, and to deliver very good performance in scalability, both strong scaling and weak scaling for HPC and AI applications, people are exploring many of these smart network architectures. Whether you make the basic hardware, basic adapter, much faster, or put intelligence in the adapter or put intelligence in the switch. But as these are being added, the software ecosystem is also evolving. So, that leads to a lot of innovations. Every time a new adapter comes, it just opens up avenues, and unless you actually bring that software ecosystem ready, you don't see people trying to adopt that. And that's what actually from our group, we have been working in this area for almost the last 30 years, and we try to bring all these latest results to all of you.

I'd like to thank all our sponsors: funding, equipment donation. And that's how we are able to share this information with all of you.

But more importantly, these are all my heroes. What we presented in what is like two and a half hours is actually the work of almost 100 students. Over the last 20 years, many of these names might look familiar. They have graduated from here and are holding positions in many other companies, NASA, labs, academia, that kind of thing. So, I would like to thank all of them. Every time I present these things, I'd like to salute them.

So with this, I'd like to close our email addresses are here.If you have any questions, please feel free to send us a note to myself.