-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy path254
More file actions
60 lines (30 loc) · 24.9 KB
/
254
File metadata and controls
60 lines (30 loc) · 24.9 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
All right, so yeah, my name's Adam Manzanares and I work at Samsung and I'm part of what is called the Global Open Ecosystem Team. And our main charter is to work closely with standards and also product development teams, as well as software communities to enable ecosystems for emerging hardware. And as we all know, so CXL is this emerging hardware. And what I want to highlight in this talk is to get people to understand QEMU. And if you haven't heard of it, it's a very useful project in my opinion, and allows you to start emulating hardware before the hardware is widely available, which I'll go through why I think this is very useful. And I'll also give a rundown of what we currently have for CXL, as well as kind of like a high level plans of where we're headed.
So, all right. So if you have never heard of QEMU, I think it's worth taking a look at, right? And so you could think about what I'm talking about now is definitely from the developer perspective, right? So I look at CXL primarily from a type three memory expander point, right? And you know, Samsung has announced hardware plans around this. And so this is driving a lot of the initial effort initially, right? And so I know we're talking about storage use cases and moving further. And I have a little bit of preview in here too, but there's many shared, what I'd say management. And if you looked at the slides from before, there's a CXL.io. And that can be common from like a storage case versus the memory case, right? It's sort of like a side, no, not a side of sideband, but a management interface more is that you can talk over like mailbox thing, but we'll get into that a little bit later. So let's start with QEMU. So what is QEMU? It's an open source and generic emulator and virtualizer. So this is kind of interesting, right? So one thing that you can do is you can run, say on an x86 system, run an Arm emulator and then run Arm architecture code using QEMU. And this becomes very handy from an ecosystem perspective, right? Because like say one key player that we see in the open source community is from Huawei. And he's mainly interested in Arm, right? And so almost all of his development is done on Arm. And I primarily do x86 based systems, but occasionally if he's ahead on something, I'll try to, I'll actually emulate Arm on x86 system to deal with patches that he's worried about. Much of the infrastructure is the same, but of course some developers will prefer their own architecture, but you can go back and forth between using this emulation. And primarily we use x86, so it also has extensions for hardware virtualization using KVM. So you can also have speedy virtualization as well for your target architecture if they're matching. And in my experience, it's just a very valuable tool for blueprinting software, right? And really that's what our work boils down to, right? Is that we wanna make sure there's a blueprint for people to develop software and the software moves as fast as the hardware, even faster than the hardware if possible. So I just want people to start thinking that this is a possibility and you can emulate these devices at the moment, right? If you can't get your hands on them.
So I kinda wanna bring it back to the storage developer conference too, right? And so there is this push for CXL for storage and we're kinda discovering this, but now I wanna look at QEMU from an NVMe perspective, right? So this talk is about CXL, but we learned a lot from working with NVMe and QEMU at the same time. And again, right, I'm sort of telling people, this is a way for you to blueprint and discover software issues initially, right? And then kinda feed it back into standard. So the earliest that you can do it is the most valuable in my opinion. And so, you know, some of the features that we rapidly prototyped for end-to-end software, including connecting into applications, was ZNS, FTP, Simple Copy, SR-IOV, and we were connecting this into applications, you know, kinda getting the best representative use case and then demonstrating it. All of the software would be open source, so anybody could reproduce, especially it makes it a lot easier when you can package QEMU and then you have the virtualized hardware along to test. And QEMU is not generally seen as the high performance, especially when you're emulating different architectures. So you're not really looking at how will this performance about blueprinting, architecting your software and making sure that all works together well. And yeah, so it's a blueprint, right? And that's our main mission, right, is blueprinting these features in an open manner to get other people involved. So the more people that participate in the open source software, the higher quality the solution is and, you know, we're ecosystem builders, right? And standard is part of it and we believe the system software should go right there with it, right? Should be with it. So in my opinion, you know, what I've seen here is I think Samsung has been quite good at this, right? Like we've had many successes by putting software as a priority. And so for QEMU, for example, Klaus on the same team that I work on is actually the maintainer of the NVMe support in QEMU. And, you know, I think there was Keith Bush when he was at Intel, did a lot of work for QEMU as well. And so you can see that other companies participate in developing QEMU as well because they find it valuable as well. And I mentioned Huawei for CXL. Then we have someone else who's a reviewer on NVMe. And one of the other things that I noticed too, right, is that, you know, we have test teams internally, right? And of course they're always looking at the next generation hardware and what's coming out there. And it really helps to have the software teams involved, right, and, you know, that they're, people build upon open source software. There's testing software available. There's many things, but the speed at which we can deliver new features is increasing, right, as we have the understanding of all the open source software, as we understand the, we work closely like with the standards teams, our product planning teams, right? The, you know, we see that there's an edge if we can move fast and have the end-to-end solution. And again, we like to be in the public because we're trying to build more people into these software ecosystems, right? It's really the more the merrier, right? And so we're quite open about how we do this.
Okay, so let's see where I'm at. Okay, so now I think we go back to CXL, right? So I'll focus on CXL. When I started looking at CXL, it was basically maybe like two years ago, and Intel was largely driving, and I saw PMEM all over the place, right? So PMEM was here, it was there. I work for Samsung, right? You know, we have our advertised devices, right? And I was slightly concerned, right, is that I see this ecosystem and I say, "Hey, you know, this type three memory device "is not tied to PMEM," right? And if we develop all this software and sort of like show the world that CXL is PMEM, I wasn't so comfortable with that. And just to kind of increase the way people look at CXL, right? So we just basically leveraged what we've done in NVMe and kind of apply it for CXL. So we said, "Hey, you know," and the interesting thing is Intel actually did the initial QEMU support, and I'll mention this when I put the acknowledgments, but once you have that emulation in place, you can start building out features, right? And yes, it was heavily PMEM at the time, but then we started poking around and saying like, "Let's push the volatile side as well too "and make sure it's a first-class citizen here "with the PMEM as well," right? And CXL, as we were working on this, it's, I say it's an infancy versus something like NVMe. You know, NVMe is widely deployed, you can easily get devices, it's all over the place, and CXL is just emerging. And so what we thought is it's the only way to really get ahead is to start putting this open-source development, working on QEMU, and kind of bringing people along to make sure we can prove out use cases. And we try to focus on simpler and kind of like plumbing, really basic things just to ensure that the basic parts of getting CXL up and running are there, right? And it's an emerging technology.
So I'll get into some of the more CXL-specific details. So when you're a programmer looking at this, at CXL memory, I actually like a very simple world, right? And what would I say this way is that I like to tell people it's just memory. Like that's the simple case, right? If we could drive latencies down as low as possible, right, is that in an ideal world, I would say that maybe this is a new NUMA node. You're already used to, you know, like 100, extra 100 nanoseconds latency. System software can deal with this. That's the ideal world, right? If we could drive latency down as far as possible. And when you get to that world, many things of CXL become simpler, right? Is that we already know how to deal with this. And so that's the simple view, right? Like that's what I would like most people to see. But the story's a little bit more complicated. And what do I mean by this, right? Is that, so CXL has to be routed down like the CXL hierarchy, which is very similar to PCI hierarchy, but it has to be routed. And, you know, I'm not gonna get into more complex cases of like PBR routing, things like that, because I think we have enough sort of problems on the software side or like we're still building out a lot of these use cases from the software side. And so let's simplify and think of it sort of just from like a more traditional PCI hierarchy. And so at the top from the platform level, there's something called the CXL fixed memory window. And what this is is that it maps to host bridges that have CXL support. And so what it's doing is it's saying this reservation of memory, this particular host physical address space would be routed to this particular host bridge. Like that's the first component that has to be programmed. And that's actually like platform firmware. And it's more of like a reservation, I would say, right? Like this has to be reserved for any CXL memory to be routed through. Then we go down to the host bridge at the host bridge level. And so we talked about HDM decoders it's host managed device memory. That's it, host managed device memory. And so you have to take host address space that is advertised in this fixed memory window and then route it down root ports from the host bridge. So each level of the hierarchy, you're routing memory requests. And all of this is basically transparent to an application. And I think it's very interesting to think this way. So when I talk to many people about CXL that have been in the storage world, they ask me like, "How do I submit the IO?" Or, "How does the IO go through here?" And I tell them, you need to take a step back and think about like accessing DRAM. Are you submitting IO for accessing DRAM? That step's just not there anymore, right? It's you need to program these HDM decoders and tell your hardware how to route the memory. But once that's in place, then your memory accesses get routed down to these CXL devices through hardware. It's all pure through hardware. But you have to be aware of the hardware pieces that do need to be programmed. So again, we're kind of like walking down. And then eventually you may have a CXL switch. And I have a diagram next, and I'll kind of walk through it on the next slide. But I wanted to introduce these terms. And for QEMU, we have the switch currently only has a single upstream port. And then it can have multiple downstream ports, right? So one up and multiple down. And the support is for type-3 memory devices, volatile and persistent. And it was initially all persistent regions. And we were pushing to get the volatile support in too. So that's kind of a high level picture of the components, the hardware components that QEMU emulates inside of the system.
Also, I think this is probably too small. But let's see. So that's fine. I can walk it. So this is this. Yeah. I won't zoom because I think what this just does is put like a physical topology to those terms that we had before. I think that's the most important thing. So and what you have here is these fixed memory windows. And that's the top level here of just some reserved address space. But as you can see, these fixed memory windows could be for a single host bridge, interleaved across multiple host bridges, and maybe to another host bridge. And this would be handled like in your BIOS layer. The OS is not touching these. It's consuming this through ACPI. So this is not managed by the OS. Now, once something matches, like an address that's accessed by the host, then it would go down to a host bridge. And then this is where software programs a decoder. At this level, this is where you would actually have system software that would program these decoders. But another thing about CXL that can be, well, how would I say this? Challenging for people to understand is the responsibility for programming these HDM decoders. Because system firmware can program these decoders. And in, I think from 1.1, 1.1, they're all always programmed by the firmware. 2.0 supports hot add. So if you hot add some CXL memory, you would have to program your decoders that match for the device. But this always causes confusion too. Because the responsibility is not defined in the standard. The standard doesn't say that. It provides the mechanisms for doing this and tells what has to be there. But the responsibility is left to the person who implements it. And Intel had a pretty good device drivers writer's guide that many people follow. And that kind of became like a de facto standard. And I know we've talked about updating that. And so that would probably be very valuable moving forward too. So as it comes down, then eventually you go down to a root port and then eventually to a device. And even the device has these HDM decoders. And these ones map the host physical address space to the device physical address space. Because the device advertises some physical address space. But it needs to be mapped to the host physical address space. So yeah, so this one is just a direct attach. You can say this is a direct attach device walking through this topology. And again, this is all within QEMU. You have the ability to add these host bridges, root ports on the host bridges and devices and map it all. And in addition, there's also support for a CXL switch. I think this is very interesting at the moment. Because as the thing Kevin was discussing, is that fabric management is sort of the next things to do. And it's not fully complete. And they're adding more and more specification changes related to fabric management. But from the QEMU perspective, and Jonathan Cameron from Huawei has been really driving this part, has been very forward thinking and adding this functionality. And it's not completely there yet. There's some pieces missing. But I actually like his model of kind of blueprinting a skeleton. And then kind of putting it on the community to add more of this skeleton as the features are really being used. And so it's very flexible. There's more things to do. And one example I would say is, it would be very interesting if you can get one of these switches to multiple QEMU instances. And you start looking at multi-host interaction with the CXL memory. And I haven't seen anyone show this publicly. Although I've heard many people talk about it. Several groups have said, hey, let's hook multiple QEMU instances to one of these switches to see what it would look like if we assign memory dynamically to different hosts.
Okay, so let me kind of go through here. And then it works, right? So I think this is the main takeaway here. And so if you look at these pictures later, so one of the members on our team, Fan is his name, he actually created this tool that helps you out. And we have created these blog posts that show this off too as well. It's in these slides. But you give a set of options and it'll actually run QEMU for you, build QEMU, build a compatible kernel. And then you can go SSH into this QEMU instance. So here he's having it install all the modules that are needed 'cause you need the kernel support. So one of the main reasons that we look at is we also add kernel support for it. And it's much faster in the development cycle to rely on QEMU to add new features. You add features in QEMU and then you add the corresponding kernel support. And so it's all here. And this actually, he's showing a bleeding edge feature called the dynamic capacity devices. And this is kind of being seen as the de facto way of hot adding and hot removing CXL memory. Like per the specification, you can pull devices. But I'm getting some sensing that, I would say it's like CPU vendors are not as interested in changing HDM decoders once the system is up. And right, so I feel a push towards these dynamic capacity devices as the solution for hot add and hot remove of memory. And basically the device will reserve a large amount of HDM to cover the available capacity of this dynamic capacity device. But you can add and remove the extents. So basically you can actually back some of this reserved memory by device memory on the fly. And there's mechanisms we're doing. And we've been prototyping this support in QEMU. And then Intel has been doing the kernel support and we kind of go back and forth, right, and work together with each other.
So here's some high level features that you can emulate. So events, like so CXL from the .io path has many events like for error logging, things like this. So it's like a management sort of interface, right, for IO and say firmware update to get timestamps, any logs that identify, sanitize, DCD support, the switch, basic support for a switch. All of this you can emulate in QEMU. And one thing that has come handy recently is MCTP support. And this is not fully up there in mainline QEMU. And we have to work on getting it merged upstream. But they're publicly available patches that have MCTP support so that you can start looking on the out of band management of the CXL devices. And I'll highlight Jonathan's get tree here, right. And so I think if we distribute the slides afterwards will be very helpful for people to go and kind of check some of these links. And he has all the latest bleeding edge support, right. And there's work to be done in some of these cases. But we work very closely together in the public sense, right. Is that we kind of know which features are coming up and we arrange patches in this way. And it's all done in the open, which is quite nice too. And so we'll go on to the next one.
One thing I want to highlight too from the Samsung side. So Samsung has been very early looking at CXL/NVMe devices in some way or shape, right. And you may have seen some demos that they have as well. But from an emulation standpoint, this has been there for a long time. And from my perspective, it's definitely prototype, right. Because it's truly an NVMe device that exposes CXL/HDM ranges. So it's handled like an NVMe device. It just so happens to have HDM decoders and work with the HDM decoders. And this emulated device, what it does is if you access the memory with a given range, it will map directly to LBA. So you have a dual interface to the SSD. And so it lets you kind of think about these ideas of potentially what would you do if you could have these interfaces. And so this is available. And reach out to one of my colleagues here if you have any questions.
And so I want to also highlight a couple of blog posts, right. So for getting started, all the screenshots that I showed were done with some of the work from Fan. And so we have this blog post up that we put specifically for this audience to kind of get started, to kind of play around with the CXL devices for people to look at. And specifically, he's been working on dynamic capacity devices. And this is very bleeding edge, right. So I don't recommend everyone to look at this if you're going to just initially look at it. But it does give you a sense of where the software is headed and what features that the community as a whole seems to be coalescing around. One other thing that we do use is Discord. And we use it to coordinate, even publicly. So we have public channels where you can ask us questions about software. And so we're a pretty open group. And I welcome people to join.
And then the last slide that I have would be for acknowledgments. And so Ben Widawsky, he was at Intel when he first started the QEMU CXL emulation. And then right after he was doing that, very early on, he moved to Google. But he was actually originally at Intel. Then Jonathan Cameron from Huawei has been really pushing QEMU forward in the kernel. Ira has been kernel developer as well as QEMU. Gregory Price from Membridge has been involved in the QEMU. Fan, Davidlohr, Tong, many people at Samsung are leveraging QEMU for building software, for building it out now early and try to figure out what works, what doesn't. But yeah, and there's many others that are involved in here. I can't say give kudos to everybody. And I apologize if I missed anyone. But that kind of wraps up my talk there. So if there's any questions, let me know.
Very nice. Couple of questions. One, has there ever been a multi-QEMU connected thing ever built for anything else, a shared something between QEMUs?
Not that I'm aware of. Do you know?
Actually, IB Schmitt. IB Schmitt, oh.
It appeared about 10 years ago at the University of Alberta. But there's also-- well, and then also, you can do things like-- so not with the full emulation, but you can have a VM that's got-- oh, sorry.
Yeah, it's straightforward to do VMs with emulated PMEM devices, which can be converted to DAX devices, which is-- and then they can be online to system RAM, if that's what you want to do. And it's straightforward to make those PMEM devices be backed by a file or a DAX device or something. So I can have multiple VMs that each have a DAX device, starting with PMEM, converted to DAX. That maps to the same file. That's what we're doing FAMFS development on mostly. Although we do have a shared memory, too.
But no switch. The switch is the key missing piece there that we want.
Yes, it is. We want that, too.
It sounds like University of Alberta might be that.
Interesting.
Yeah. Well, so that's old work. Ken McDonald, the guy who did it, he may be-- his faculty's somewhere now. He was a grad student when he did it. But it is in mainline QEMU.
Hi. I was wondering-- I've been hearing about the memory semantic SSD for, I don't know, 10, five years. It's been a long time. Have there ever been any samples that we can play with? I mean, it sounds great. I'd like to try it. I mean, I have PMEM myself. So I'm not exactly hurting for fast storage. But it would be nice to try it because it's been talked about so much. I was wondering if there might be some available.
I could say there's been public demos more recently. But in general, I would say you could connect with me offline. I could get you to talk to people. But pretty much, I'm much more focused on the open software at this point. But if you want to get in contact with people, I can get you in contact with people.
OK. Thank you.
Hi. To what extent do you think the academic teaching and research community is aware of this and kind of teaching their advanced undergrad students and early graduate students to look at this sort of stuff?
I see systems groups doing it. And I can go back to looking at NVMe. I think there was a project called FEMU, like a fast QEMU emulator for SSDs. And I believe that one was the University of Chicago. And we recently got an email. I can't remember the university. But they had noticed the open work that we were doing. They were like, hey, do you guys have any tips or things that we could look at? So we are aware that some academic circles are watching all the open development. And so it's getting out there in small doses. But I do see it happening.
Hi, Adam. I was looking at the GitLab link that you have in there. And one of the questions I have is, is there a way to figure out which CXL features are already available out there in the community? I'm interested in simulating GPF.
Yeah, I get it. I would say that's a pain point now. So what do I notice? Most people, most developers focus much more on the development and less so on kind of keeping track of what's happening. What I can offer to you, if you connect with me, we do keep track of these things on the community, like internally. And we share externally as well. And it's been on our to-do list to put this tracking externally. But we just haven't got to it. But we actually do share that information. It's just all public information. So yeah, the community as a whole could be better at it. But we do do it for ourselves.
OK, thanks. I'll reach out or even ask in the community.
Thank you. Great. Thank you. Let's thank Adam. Thank you.