CDI-Info/300 at main · vaj/CDI-Info · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
YouTube:https://youtu.be/cGQHZ-8205o?si=PpKVIFDNEoj-wWuH
Text:
Hi, I'm Mark Nossokoff, Research Director with Hyperion Research, and what I'd like to do today is go through an HPC/AI market update and give a snapshot of some recent research on composability within the industry.

For those of you not familiar with Hyperion Research, we are an industry analyst market research consultancy focusing on the HPC/AI advanced technical computing markets. We cover all sectors—government, industry, and academia—global aspects, as well as user perspectives and vendors' perspectives.

We also monitor, facilitate, and moderate the HPC user forum, and I'll touch on that a little more at the end of the session today.

Jumping right in, more specifically, as I mentioned, I hope to present a sprint through the HPC/AI market update, an overview of the market, and then spend some time on insights on composability and memory considerations from some recent research that we have done. Then, I'll end and close with some observations and thoughts on standards and consortia in this area.

Starting with the HPC/AI market, this is just hot off the presses: Earlier this week, we have—we have changed a bit—our market taxonomy to better incorporate and integrate the impacts of AI into the HPC/AI market size. And, with this new AI-centric server taxonomy, we are projecting and showing an increase in market size in 2023 of 36.7% over how we considered the market in the past. And that, you know, that growth and impact of HPC/AI is going to continue to grow and be projected out in the five-year forecast.

For the rest of the market data, though, I'm going to go back to our prior assessment of the market, as the incorporation of the new AI taxonomy has not filtered through the rest of our forecast yet. So, just a brief snapshot here of what the market did in 2022 and 2023. So, as you can see, 2023 showed little change from those years. The servers are roughly 40%. Cloud spending has increased in that time, and the rest of our broad market areas have stayed relatively stable.

As we look at how we segment the market in terms of our competitive segment, while the high end of the market—the leadership computers and the supercomputers—get an awful lot of the attention, there's still an awfully large market in the sub-$10 million system and cluster level. And so, that's reflected in the overall market.

The leaders in the HPC market space, specifically in the server and HPC/AI server, remain fairly consistent: HPC on top, followed by Dell and Lenovo from a global vendor perspective. What is of note, as well, is the fairly large market share within the "other" category. And some of that is what led to our new, AI-centric taxonomy.

Drilling down a little bit more into the broader market area, and areas where composability and aspects are going to come into play, the size of the server market is the largest piece of the overall market. Storage, being the second largest in the storage area, is, however, the fastest-growing segment.

Looking at the cloud aspects, we have been continuing to bump up our cloud investment in the cloud; what people are spending in the cloud for HPC AI resources. And this will, once we incorporate some more AI aspects, we can expect to adjust this upward as well. Just the last piece and view of the HPC AI market segment is showing—we show the distillation of spend in the cloud versus what's being spent on-prem overall; with the cloud spend projected to approach a third of the overall spending out in 2028.

Moving now into what we've seen and are from recent research.

This is actually fairly recent, new as well, and we're considering it preliminary from our most recent global site survey of over 107 sites across industry, government, and academia. We cover a whole host of topics of interest to the HPC and composability, being one of them. And this is showing an indication of intent of these sites to support composability of 27%—either are currently doing it or intend to do it—which I'm not sure, I can't define a specific number of what I was expecting, but this was actually a higher percentage of what we were expecting to see as far as and is encouraging towards adopting composability. There's still also, though, approaching 20%, 18% have zero intent to implement it.

What are they going to compose? There's a survey; we asked them to select all the elements within a system that they're intending to compose. In roughly half the sites, or half the sites indicated that GPUs will be what they're intending to compose. CPUs at 42% of the sites, and just under 40% intend to compose memory. And it doesn't add up to 100%, as they could select multiple items that they're intending to compose within their future architectures.

Just highlighting what they feel are, what these sites feel are the benefits and barriers for them to adopt composability. Increased utilization was their highest perceived benefit, along with cost savings and simplified scalability of their systems. Where there's still some hesitation to implement, it's due to a perceived lack of maturity and stability in the solution and the ecosystem at this point, as well as some interoperability concerns and any changes they might have to make to their codes.

Now we'll turn and take a quick look at a look at some of the memory considerations for composability.

From an overall broad market perspective, memory-bound performance limitations are pretty pervasive, with roughly three-quarters of the sites indicating they have an application whose performance is memory-bound in one way, shape, or form. As we drill down further and look at those applications that do have memory-bound performance limitations, memory capacity is roughly a limiting factor in three-quarters of the applications. And this is pretty consistent. We've asked this question a number of times over the last year and even further. And this is pretty consistent as we've done a little bit for which, you know, what is the limitation of the memory? Is it the bandwidth, or is it the capacity, or both? We see the breakout of the percentages of applications that the sites have indicated for which the memory limitation is. So, note that suggests that there's not enough memory to support the applications. At the same time, there's an indication that a large number of the systems have stranded memory; the lack of utilization of the memory approaches 50% in some cases. The large majority of the systems are between 10 and 50% have some stranded memory capacity. The types of applications that appear to have memory-bound limitations do see a leaning towards traditional HPC modeling and simulation more commonly having stranded memory capacity than AI and HPDA analytic workloads. From a bandwidth memory performance limitation, over half the sites indicate needing between 64 gigabytes per second and 256 gigabytes per second additional bandwidth that would alleviate their memory bandwidth limitation.

Looking specifically a little further at CXL and the sentiment towards CXL within the user community, roughly three-fourths of the sites with CXL have at least a basic understanding of it. As far as when the perception is that it's going to be ready for production at a broad market adoption level, this is pretty consistent to ask several times over the last couple of years. This fairly significant bucket of responses says between 12 and 18 months, and a similar sized cohort indicate that they feel it's even a little farther out, 18 months to three years or more. And when I say that's consistent, it's kind of a sliding thing. You'd think if there was a perception of progress towards stability, we would expect this time range to shrink as we move through time, but there's no such indication. This is a pretty consistent perception.

Closing out just this snapshot and market update.

A couple of observations: I wanted to look at the various standards for some of the interconnects and protocols that are out there, and looking at the overlap within the leadership and the founders of these particular industry consortium standards. And we do see some commonality and overlap in the standards leadership. And while some of these, we get questions on, well, what's the impact of, say, the UALink on CXL? And will there be any impact one way, either a benefit or detriment? We do see from this perspective, the fact that there is overlap within the leadership and the participation, that there will be cooperation and coordination for the stewardship and success of these individual initiatives.

Then, also, thinking about the architecture of some of these interconnect protocols and the actual interconnects themselves, there's a different hierarchy and multiple hierarchies of these within the systems. And while they're not necessarily competition, these particular initiatives and standards do need to be, I think, considered in the architectures, and that they will interact and play well and integrate well with one another, whether you're at the rack-to-rack level, node-to-node, or down inside the system within the CPU interconnects and GPU interconnects, and the protocols used to communicate with each other. So, it is incumbent that the architectures and development consider all the levels and hierarchy of the interconnects and protocols.

To recap, there are memory-bound limitations for the applications in this market space. It is pretty pervasive. And the innovation is occurring, but more, I think, is going to be required to really address the concerns and move the needle forward with this. The composability is an emerging architecture to address this, to both dynamically configure and pull these resources, but not only memory, but the GPUs and CPUs as well. And kind of the last observation is, CXL does appear to be approaching an inflection point. We mentioned that consistent bucket. We do anticipate that the perception of readiness will begin to shrink, but the current state does reveal various levels of competence and velocity from the users with when that will occur.

So with that, we welcome any questions. Feel free to reach out with any questions along any of these topics. And, I would like to highlight, too, an upcoming HPC user forum that we have. It's three weeks away. I invite you to attend, look at the website, and consider attending the user forum at Argonne National Lab the first week in September. Thank you very much.