CDI-Info/40 at main · vaj/CDI-Info · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
YouTube:https://www.youtube.com/watch?v=3HM-rSZT9ao
Text:
So yeah, different from other talks, I will not try to say products or sell products I was actually requesting. I'm telling you what I have ready for CXL and hopefully someone says, 'Hey, you can use this.' I can buy, I really looking forward to get my hands on those. So I'm bringing today to two use cases.

One is ecoHMEM and another one is HomE. So first I will say a little bit of motivation on my research, kind of gluing these two use cases on how I got to them and then presenting the two use cases and then concluding remarks. So anytime, because this has led to some confusions in the past in some talks or papers, anytime I mentioned persistent memory, it's not IO or persistency, it's like RAM expansion.

So my background is programming models, runtime system, system software, mainly started in GPU computing. This brought me to prepare for the KNL, which was an accelerator, but it had MCDRAM along with DRAM, so it was heterogeneous memory. So I started preparing for that even before that was released. Back when I was a postdoc at Agra National Lab. So I started working on automatic distribution for heterogeneous memory systems. And that evolved and at some point we realized, hey, what about this use case of homomorphic encryption that I will actually explain more detail later. And it was actually great for a use case for the Optane persistent memories in memory mode. That's why I also got engaged in that kind of use case, despite I'm not an applications guy, I'm more on runtime system, system software. So we started with Xeon Phi, rest in peace. Then we passed to Optane persistent memory, rest in peace. (laughing) Now that's our next step. Unfortunately, to my taste, and I understand why anyway, but this one kind of died too early because CXL is not quite ready yet, at least for production. So we're kind of on that gap and we're very much looking for CXL to become ready for production.

So let me start with this first use case that we have been working, And I realized when making these slides like for 10 years already, so time flies.

Because I really started before the KNL was ready. Actually, of course, I was at Agra National Lab. We had some information under NDA. So we're getting ready for that. So our view was to move from this hierarchical view, which was actually fairly popular for quite some time, which I mentioned as deep memory, to this one in which we expose all the memory subsystems as first class citizens to the software. But of course, we do not want applications to deal with that, of course. That should be completely abstracted. And yeah, forgot that there's this new player here. So who's in charge of applications data distribution? I think that's one of the latest questions we had. And there are actually quite a few, quite, actually quite many possibilities that are worth exploring. And there's no one fits all or one single solution. So we've been exploring quite a lot. And what's the operating system based on what? On heuristics or on monitoring? With hardware systems or not? Should we get historical data or hints from users? We clearly need an ecosystem of software like MemVerge is actually doing intensively.

So this is what we aim for. We aim to move from this view in which all of our data is stored in just the single memory subsystem that we always had. Like when we mentioned memory, okay, DRAM. But no, we may have more. We have a variety and you name it. It's just the names that are arbitrary. You can have this MCDRAM, HBM, Scratchpad, but CXL, whatever is there. So what we want to do is to minimize, so to assess the optimal data distribution to minimize energy consumption to maximize performance. And actually when we were looking at persistent memories, our aim, and we were quite successful, was actually, so how this is data partition, to reduce actually the DRAM. So we could have big memories which were energy efficient and reduce the energy hungry DRAM. And we could get with smart data placement, quite a performance, negligible performance detrimental.

So yeah, mentioning that again, I still have seen many papers like talking about this kind of memories exposed to software, talking as deep memory, but that's to me, that's not the memory because the depth in place hierarchy. And to me, I'm just mentioning that as heterogeneous memory and getting rid of the hierarchy. So a couple of methodologies that broadly speaking, you can do a page movement at operating system level and you can leverage the operating system view. That's very transparent for applications. They do not realize there is a page movement underneath. It kind of follows similar principles to shopping. So the operating system monitors the different pages, monitors hot and cold pages, accesses, access patterns, a lot of things that it can monitor and it migrates pages transparently. These may have, so this may have the limitation that the operating system is limited on what it can see. So at some point, it does, it cannot predict the future. So it may decide how these pages becoming hot, let me migrate it to DRAM. In this case, it could be from a far away CXL memory to the closing of DRAM. And then, but maybe at that time, which is, that's not a cheap operation. At that time, maybe it doesn't pay off and one cannot know. But it's certainly a very interesting way of doing and it has its pros and cons. And the other one, which is what we do in ecoHMEM, it has its pros and cons as well. It's not perfect at all. We figure out what to do beforehand. So we do some profiling of the application so we know what will happen. And then, during runtime, we intercept the allocation calls and then we know where to place data. But also that has its caveats because we are static. We optimize for entire execution. As of now, we cannot adapt to movements, to even phases of the application. That's why right now we are trying to mix them both to be a proactive plus a reactive solution. We are trying to get good initial static placement plus letting the operating system adapt during runtime, the best of the two worlds.

So in a glimpse, this is the framework that we have ready. We based on profiling and user-level interposition. So anytime we get an allocation call, the beauty of this is that we can have, there is no source code modifications at all. No recompilation, just the original binary, the production ready binary, we leverage that. So first we need first profiling run. In this case, we use Extrae, but we can be compatible with others. As of now, we intercept the counters, but we are working on ARM support as well. And this is just one shot.

One time we do the profiling and then you run in production many times. So afterwards we do the heterogeneous memory advisor. There's a Python script that figures out depending on the underlying system, how to distribute the different allocation calls.

And during runtime, we interpose this, we do an LD_PRELOAD of this library, which has some smartness to adapt during runtime. And it tries to honor the object distribution intercepting the allocation calls. And we should be seamlessly compatible with CXL local and remote memory with Sapphire Rapids high bandwidth memory, because we just understand the NUMA nodes and we can work with NUMA allocation, with NUMA allocators underneath. So it should be fairly transparent.

So some results just quickly, we have compared our solution with the inter-kernel-based migration approach with the memory mode as well. The memory mode is that you obtain when you have the DRAM as an inclusive cache for the persistent memory, everything managed by hardware. And for example, we can get up to 2x in MiniFE, one of the benchmarks that we have, but this is all HPC benchmarks. In some other cases, we were able to reduce drastically the DRAM up to four times and notice no significant reduction, no significant overhead. Then if we go to a little bit more complex applications, like LAMMPS, for example, we saw similar performance. In OpenFOAM, we saw 6% speed up, which may not seem much, but it comes for free, no need for any modification. So that's why we realized, okay, for these, so more complex and long runs and different stages, we need actually to combine this with some more ability to do runtime page movements. And that's what we're doing right now. But what we saw interestingly is that there is, we didn't find any use case with performance slowdown. So either you get around the same, like with LAMMPS, or maybe up to 2x with MiniFE, depending on the access patterns. So this is kind of something that we were happy to add.

So this is open source. It was released, was presented at Cluster 22 last September in Germany, and we got to be the best paper finalist.

Okay, and then now going to my second use case of homomorphic and encrypted deep learning inference. I call it the HomE. This is a project, a New York City grant.

So there are many use cases in which deep learning users would benefit from loading their inference tasks to untrusted servers, such as, for example, those in the cloud. But however, there may be privacy concerns on their models and datasets, since they may be exposed to third parties. And even if they wouldn't care much, maybe the users, there may be policies and regulations that could stop them from doing that, from using cloud inference services.

So there is some technology which is called homomorphic encryption, which enables computations on encrypted data without the need for decryption. And this is how it works. So we could encrypt our inference model and/or our user data, send it to a cloud for doing the inference tasks, already encrypted, and then get the results back. And only then decrypt those with our private key. Then privacy would be completely guaranteed. However, the problem is that depending on the encryption parameters, the data size may grow from 100x to 10,000x. Depends on how we encrypt and how strong we want our encryption to be and a few others. Then when applying homomorphic encryption, our models and datasets easily grow beyond RAM spaces. So our idea is to leverage large memory pools, along with some smaller, close RAM memories, to enable production-ready use cases for homomorphic encryption.

So this is kind of just in a glimpse the main idea around a large project, actually. So this was all designed to work with Crow Pass, with Optane Crow Pass. So I touched slightly my slides, but instead of persistent memory, I call it large memories for now. So with large memories, we have seen we will have a little bit more latency, perhaps less bandwidth, although I heard in some presentations that we get the aggregated bandwidth and in the end we get more bandwidth, but anyway. We will have to care about access patterns as well. And if we use the DRAM or the nearby DRAM as a cache, we also have to take into account the cache policy. But we have actually supporting evidence, at least that this homomorphic encrypted deep learning inference works really good with Optane persistent memory in memory mode. And that's because when you homomorphically encrypt, your data item, like a floating point, turns into long chains of integers. So there is quite high locality, temporal and spatial locality. That's why it worked great with this kind of, it's not so much advanced, I mean, you know, just how the memory mode works with this direct cache mapping and everything. It worked really good with our analysis. That's what actually this paper sparked, the entire project.

This project is an ERC consolidator grant. This is the most prestigious funding scheme in Europe. It has this kind of longish title. This started already September, and we have almost 3 million Euro funding. This is only for me, for BSC and the PI, and about half a million Euro in equipment. So it started September. I have to admit that we were having quite some difficulty in hiring and building the team. We already started, but I believe this is not a surprise for you, we're all trying to hire, right?

So nothing different here, actually. So at high level, our objective is to enable currently impossible scenarios of homomorphic encryption and deep learning. One is to enable large models, production-ready models, because as of now, the literature, you can find kind of toy or mobile models so that they kind of fit in RAM spaces. We also want to enable multiple smaller models running concurrently, as in the case of a cloud service, attending multiple clients. And also large inputs. Many use cases benefit from large data sets like cancer detection, for example, that benefits from higher resolution images. And this, of course, in combination with the two others. And all this central piece will be having large memories to enable this large energy consumption inherent from the homomorphic encryption. But also, we want to address those challenges. There will be challenges along with using large memories, data movement, where it should be.

Because what we're going to do is to focus in the software side in three pillars. One is general optimizations in the software stack. We have already identified some sources of inefficiency in current software. Also specific optimizations for heterogeneous memory systems, like we want to have the data where it has to be at different times. And one of the things is that we will be using different multiple accelerators. We'll be starting to look into GPUs, FPGAs. And again, a central piece of this will also be that everything will be in combination with large memory pools. And right now, looking at CXL, despite this being designed for Optane memories.

So on the second word package, we are going to develop some simulation infrastructure to explore features such as very wide processing units and others, like runtime adaptive cache association, something like that. We're also looking into processing in memory. We've already started this task. And the target here is to develop, or the design, to design a domain-specific accelerator in which we consolidate all of our successful ideas. All this, again, in combination with large memory pools.

And yeah, this of course will be down in co-design hardware-software. So one of the things that worries me now is that the testbed to be provisioned. So I'm planning to buy four Sapphire Rapids compute nodes. And this was the plan, to have the maximum possible Optane Crow Pass. So what happens now is that Intel was promising to me, Crow Pass, yes, you will have it, you will have it, you will have it, and then you will not have it. When the procurement was already in the hands of the state, the lawyers of the Spanish government, because we are a public company. So I have the procurement in pause, stopped, and trying to look at any CXL solution that could cover the absence of Crow Pass.

We'll also have FPGAs and GPUs in this testbed. We're still thinking whether to have real processing in memory or not. We haven't found any current or upcoming product that could feed the kind of computing that we will be doing, but we might go to simulators, but it's something that we didn't decide yet. Budget is over half a million euros plus tax. So time frame as soon as possible, we would need it.

So to conclude this part, so this is a timely project. It's going to be an enabler. The technology is also looking at cutting edge technology. We have a recent publication supporting this. So there are challenges on our coming efficiency and the limitations of current software and hardware technologies. This is a high risk, high gain project, as actually mandated by the funding scheme. And we see an impact on the state of the art, on doing groundbreaking contributions and trying to enable homophic encryption and deep learning in production use cases. And importantly, we need to cover a gap, whether CXL is an enabler for this, for homophic encrypted deep learning inference in production environments.

So my concluding remarks. I have presented two use cases for CXL memory expansion, either local or pooling, software for automatic data distribution, ecoHMEM, and homomorphic encrypted deep learning inference, HomE. So I'm looking forward to get my hands on any solution, whether remote access, prototype testing, purchase. So if you have such a thing, you're interested in these use cases, please drop me an email or talk to me.

Acknowledgements to my team. This is part of my team that is working in these projects. Of course, we're still hiring mainly in the HomE one. And acknowledgements to my funding, to national projects, and thank you very much for your attention.