CDI-Info/254 at main · vaj/CDI-Info · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60

All right, so yeah, my name's Adam Manzanares  and I work at Samsung and I'm part of what is called  the Global Open Ecosystem Team. And our main charter is to work closely with standards  and also product development teams,  as well as software communities to enable ecosystems  for emerging hardware. And as we all know, so CXL is this emerging hardware. And what I want to highlight in this talk  is to get people to understand QEMU. And if you haven't heard of it,  it's a very useful project in my opinion,  and allows you to start emulating hardware  before the hardware is widely available,  which I'll go through why I think this is very useful. And I'll also give a rundown of what we currently have  for CXL, as well as kind of like a high level plans  of where we're headed.

So, all right. So if you have never heard of QEMU,  I think it's worth taking a look at, right? And so you could think about what I'm talking about now  is definitely from the developer perspective, right? So I look at CXL primarily from a type three  memory expander point, right? And you know, Samsung has announced hardware plans  around this. And so this is driving a lot of the initial effort  initially, right? And so I know we're talking about storage use cases  and moving further. And I have a little bit of preview in here too,  but there's many shared, what I'd say management. And if you looked at the slides from before,  there's a CXL.io. And that can be common from like a storage case  versus the memory case, right? It's sort of like a side, no, not a side of sideband,  but a management interface more is that you can talk over  like mailbox thing, but we'll get into that  a little bit later. So let's start with QEMU. So what is QEMU? It's an open source and generic emulator and virtualizer. So this is kind of interesting, right? So one thing that you can do is you can run,  say on an x86 system, run an Arm emulator  and then run Arm architecture code using QEMU. And this becomes very handy  from an ecosystem perspective, right? Because like say one key player that we see  in the open source community is from Huawei. And he's mainly interested in Arm, right? And so almost all of his development is done on Arm. And I primarily do x86 based systems,  but occasionally if he's ahead on something,  I'll try to, I'll actually emulate Arm on x86 system  to deal with patches that he's worried about. Much of the infrastructure is the same,  but of course some developers  will prefer their own architecture,  but you can go back and forth between using this emulation. And primarily we use x86,  so it also has extensions for hardware virtualization  using KVM. So you can also have speedy virtualization  as well for your target architecture if they're matching. And in my experience, it's just a very valuable tool  for blueprinting software, right? And really that's what our work boils down to, right? Is that we wanna make sure there's a blueprint  for people to develop software  and the software moves as fast as the hardware,  even faster than the hardware if possible. So I just want people to start thinking  that this is a possibility  and you can emulate these devices at the moment, right? If you can't get your hands on them.

So I kinda wanna bring it back  to the storage developer conference too, right? And so there is this push for CXL for storage  and we're kinda discovering this,  but now I wanna look at QEMU from an NVMe perspective, right? So this talk is about CXL,  but we learned a lot from working with NVMe  and QEMU at the same time. And again, right, I'm sort of telling people,  this is a way for you to blueprint  and discover software issues initially, right? And then kinda feed it back into standard. So the earliest that you can do it  is the most valuable in my opinion. And so, you know, some of the features  that we rapidly prototyped for end-to-end software,  including connecting into applications,  was ZNS, FTP, Simple Copy, SR-IOV,  and we were connecting this into applications,  you know, kinda getting the best representative use case  and then demonstrating it. All of the software would be open source,  so anybody could reproduce,  especially it makes it a lot easier  when you can package QEMU  and then you have the virtualized hardware along to test. And QEMU is not generally seen as the high performance,  especially when you're emulating different architectures. So you're not really looking at how will this performance  about blueprinting, architecting your software  and making sure that all works together well. And yeah, so it's a blueprint, right? And that's our main mission, right,  is blueprinting these features in an open manner  to get other people involved. So the more people that participate  in the open source software,  the higher quality the solution is  and, you know, we're ecosystem builders, right? And standard is part of it  and we believe the system software  should go right there with it, right? Should be with it. So in my opinion, you know, what I've seen here  is I think Samsung has been quite good at this, right? Like we've had many successes  by putting software as a priority. And so for QEMU, for example,  Klaus on the same team that I work on  is actually the maintainer of the NVMe support in QEMU. And, you know, I think there was Keith Bush  when he was at Intel, did a lot of work for QEMU as well. And so you can see that other companies  participate in developing QEMU as well  because they find it valuable as well. And I mentioned Huawei for CXL. Then we have someone else who's a reviewer on NVMe. And one of the other things that I noticed too, right,  is that, you know, we have test teams internally, right? And of course they're always looking  at the next generation hardware  and what's coming out there. And it really helps to have the software teams involved,  right, and, you know, that they're,  people build upon open source software. There's testing software available. There's many things, but the speed  at which we can deliver new features is increasing, right,  as we have the understanding of all the open source software,  as we understand the, we work closely  like with the standards teams,  our product planning teams, right? The, you know, we see that there's an edge  if we can move fast and have the end-to-end solution. And again, we like to be in the public  because we're trying to build more people  into these software ecosystems, right? It's really the more the merrier, right? And so we're quite open about how we do this.

Okay, so let's see where I'm at. Okay, so now I think we go back to CXL, right? So I'll focus on CXL. When I started looking at CXL,  it was basically maybe like two years ago,  and Intel was largely driving,  and I saw PMEM all over the place, right? So PMEM was here, it was there. I work for Samsung, right? You know, we have our advertised devices, right? And I was slightly concerned, right,  is that I see this ecosystem and I say,  "Hey, you know, this type three memory device  "is not tied to PMEM," right? And if we develop all this software  and sort of like show the world that CXL is PMEM,  I wasn't so comfortable with that. And just to kind of increase  the way people look at CXL, right? So we just basically leveraged what we've done in NVMe  and kind of apply it for CXL. So we said, "Hey, you know,"  and the interesting thing is Intel actually did  the initial QEMU support,  and I'll mention this when I put the acknowledgments,  but once you have that emulation in place,  you can start building out features, right? And yes, it was heavily PMEM at the time,  but then we started poking around and saying like,  "Let's push the volatile side as well too  "and make sure it's a first-class citizen here  "with the PMEM as well," right? And CXL, as we were working on this,  it's, I say it's an infancy versus something like NVMe. You know, NVMe is widely deployed,  you can easily get devices, it's all over the place,  and CXL is just emerging. And so what we thought is it's the only way  to really get ahead is to start putting  this open-source development, working on QEMU,  and kind of bringing people along  to make sure we can prove out use cases. And we try to focus on simpler and kind of like plumbing,  really basic things just to ensure that the basic parts  of getting CXL up and running are there, right? And it's an emerging technology.

So I'll get into some of the more CXL-specific details. So when you're a programmer looking at this,  at CXL memory, I actually like a very simple world, right? And what would I say this way is that I like  to tell people it's just memory. Like that's the simple case, right? If we could drive latencies down as low as possible, right,  is that in an ideal world, I would say  that maybe this is a new NUMA node. You're already used to, you know, like 100,  extra 100 nanoseconds latency. System software can deal with this. That's the ideal world, right? If we could drive latency down as far as possible. And when you get to that world,  many things of CXL become simpler, right? Is that we already know how to deal with this. And so that's the simple view, right? Like that's what I would like most people to see. But the story's a little bit more complicated. And what do I mean by this, right? Is that, so CXL has to be routed down like the CXL hierarchy,  which is very similar to PCI hierarchy,  but it has to be routed. And, you know, I'm not gonna get into more complex cases  of like PBR routing, things like that,  because I think we have enough sort of problems  on the software side or like we're still building out  a lot of these use cases from the software side. And so let's simplify and think of it sort of just  from like a more traditional PCI hierarchy. And so at the top from the platform level,  there's something called the CXL fixed memory window. And what this is is that it maps to host bridges  that have CXL support. And so what it's doing is it's saying  this reservation of memory,  this particular host physical address space  would be routed to this particular host bridge. Like that's the first component that has to be programmed. And that's actually like platform firmware. And it's more of like a reservation, I would say, right? Like this has to be reserved for any CXL memory  to be routed through. Then we go down to the host bridge  at the host bridge level. And so we talked about HDM decoders it's host managed device memory. That's it, host managed device memory. And so you have to take host address space  that is advertised in this fixed memory window  and then route it down root ports from the host bridge. So each level of the hierarchy,  you're routing memory requests. And all of this is basically transparent to an application. And I think it's very interesting to think this way. So when I talk to many people about CXL  that have been in the storage world,  they ask me like, "How do I submit the IO?"  Or, "How does the IO go through here?"  And I tell them, you need to take a step back  and think about like accessing DRAM. Are you submitting IO for accessing DRAM? That step's just not there anymore, right? It's you need to program these HDM decoders  and tell your hardware how to route the memory. But once that's in place,  then your memory accesses get routed down  to these CXL devices through hardware. It's all pure through hardware. But you have to be aware of the hardware pieces  that do need to be programmed. So again, we're kind of like walking down. And then eventually you may have a CXL switch. And I have a diagram next,  and I'll kind of walk through it on the next slide. But I wanted to introduce these terms. And for QEMU, we have the switch currently  only has a single upstream port. And then it can have multiple downstream ports, right? So one up and multiple down. And the support is for type-3 memory devices,  volatile and persistent. And it was initially all persistent regions. And we were pushing to get the volatile support in too. So that's kind of a high level picture of the components,  the hardware components that QEMU emulates  inside of the system.

Also, I think this is probably too small. But let's see. So that's fine. I can walk it. So this is this. Yeah. I won't zoom because I think what this just does  is put like a physical topology  to those terms that we had before. I think that's the most important thing. So and what you have here is these fixed memory windows. And that's the top level here  of just some reserved address space. But as you can see, these fixed memory windows  could be for a single host bridge,  interleaved across multiple host bridges,  and maybe to another host bridge. And this would be handled like in your BIOS layer. The OS is not touching these. It's consuming this through ACPI. So this is not managed by the OS. Now, once something matches,  like an address that's accessed by the host,  then it would go down to a host bridge. And then this is where software programs a decoder. At this level, this is where you would actually have  system software that would program these decoders. But another thing about CXL that can be,  well, how would I say this? Challenging for people to understand  is the responsibility for programming these HDM decoders. Because system firmware can program these decoders. And in, I think from 1.1,  1.1, they're all always programmed by the firmware. 2.0 supports hot add. So if you hot add some CXL memory,  you would have to program your decoders  that match for the device. But this always causes confusion too. Because the responsibility is not defined in the standard. The standard doesn't say that. It provides the mechanisms for doing this  and tells what has to be there. But the responsibility is left  to the person who implements it. And Intel had a pretty good device drivers writer's guide  that many people follow. And that kind of became like a de facto standard. And I know we've talked about updating that. And so that would probably be very valuable  moving forward too. So as it comes down, then eventually you go down  to a root port and then eventually to a device. And even the device has these HDM decoders. And these ones map the host physical address space  to the device physical address space. Because the device advertises some physical address space. But it needs to be mapped  to the host physical address space. So yeah, so this one is just a direct attach. You can say this is a direct attach device  walking through this topology. And again, this is all within QEMU. You have the ability to add these host bridges,  root ports on the host bridges and devices and map it all. And in addition, there's also support for a CXL switch. I think this is very interesting at the moment. Because as the thing Kevin was discussing,  is that fabric management is sort of the next things to do. And it's not fully complete. And they're adding more and more specification changes  related to fabric management. But from the QEMU perspective,  and Jonathan Cameron from Huawei  has been really driving this part,  has been very forward thinking  and adding this functionality. And it's not completely there yet. There's some pieces missing. But I actually like his model  of kind of blueprinting a skeleton. And then kind of putting it on the community  to add more of this skeleton  as the features are really being used. And so it's very flexible. There's more things to do. And one example I would say is,  it would be very interesting  if you can get one of these switches  to multiple QEMU instances. And you start looking at multi-host interaction  with the CXL memory. And I haven't seen anyone show this publicly. Although I've heard many people talk about it. Several groups have said,  hey, let's hook multiple QEMU instances  to one of these switches  to see what it would look like  if we assign memory dynamically to different hosts.

Okay, so let me kind of go through here. And then it works, right? So I think this is the main takeaway here. And so if you look at these pictures later,  so one of the members on our team,  Fan is his name,  he actually created this tool that helps you out. And we have created these blog posts  that show this off too as well. It's in these slides. But you give a set of options  and it'll actually run QEMU for you,  build QEMU, build a compatible kernel. And then you can go SSH into this QEMU instance. So here he's having it install all the modules  that are needed 'cause you need the kernel support. So one of the main reasons that we look at  is we also add kernel support for it. And it's much faster in the development cycle  to rely on QEMU to add new features. You add features in QEMU  and then you add the corresponding kernel support. And so it's all here. And this actually, he's showing a bleeding edge feature  called the dynamic capacity devices. And this is kind of being seen as the de facto way  of hot adding and hot removing CXL memory. Like per the specification, you can pull devices. But I'm getting some sensing that,  I would say it's like CPU vendors are not as interested  in changing HDM decoders once the system is up. And right, so I feel a push  towards these dynamic capacity devices  as the solution for hot add and hot remove of memory. And basically the device will reserve  a large amount of HDM to cover the available capacity  of this dynamic capacity device. But you can add and remove the extents. So basically you can actually back  some of this reserved memory by device memory on the fly. And there's mechanisms we're doing. And we've been prototyping this support in QEMU. And then Intel has been doing the kernel support  and we kind of go back and forth, right,  and work together with each other.

So here's some high level features that you can emulate. So events, like so CXL from the .io path has many events  like for error logging, things like this. So it's like a management sort of interface, right,  for IO and say firmware update to get timestamps,  any logs that identify, sanitize,  DCD support, the switch, basic support for a switch. All of this you can emulate in QEMU. And one thing that has come handy recently  is MCTP support. And this is not fully up there in mainline QEMU. And we have to work on getting it merged upstream. But they're publicly available patches  that have MCTP support so that you can start looking  on the out of band management of the CXL devices. And I'll highlight Jonathan's get tree here, right. And so I think if we distribute the slides afterwards  will be very helpful for people to go  and kind of check some of these links. And he has all the latest bleeding edge support, right. And there's work to be done in some of these cases. But we work very closely together in the public sense, right. Is that we kind of know which features are coming up  and we arrange patches in this way. And it's all done in the open, which is quite nice too. And so we'll go on to the next one.

One thing I want to highlight too from the Samsung side. So Samsung has been very early looking at CXL/NVMe devices  in some way or shape, right. And you may have seen some demos that they have as well. But from an emulation standpoint,  this has been there for a long time. And from my perspective, it's definitely prototype, right. Because it's truly an NVMe device that exposes CXL/HDM ranges. So it's handled like an NVMe device. It just so happens to have HDM decoders  and work with the HDM decoders. And this emulated device, what it does is  if you access the memory with a given range,  it will map directly to LBA. So you have a dual interface to the SSD. And so it lets you kind of think about these ideas  of potentially what would you do  if you could have these interfaces. And so this is available. And reach out to one of my colleagues  here if you have any questions.

And so I want to also highlight a couple of blog posts, right. So for getting started, all the screenshots that I showed  were done with some of the work from Fan. And so we have this blog post up that we put specifically  for this audience to kind of get started,  to kind of play around with the CXL devices for people  to look at. And specifically, he's been working  on dynamic capacity devices. And this is very bleeding edge, right. So I don't recommend everyone to look at this  if you're going to just initially look at it. But it does give you a sense of where the software is headed  and what features that the community as a whole  seems to be coalescing around. One other thing that we do use is Discord. And we use it to coordinate, even publicly. So we have public channels where you can ask us  questions about software. And so we're a pretty open group. And I welcome people to join.

And then the last slide that I have  would be for acknowledgments. And so Ben Widawsky, he was at Intel  when he first started the QEMU CXL emulation. And then right after he was doing that, very early on,  he moved to Google. But he was actually originally at Intel. Then Jonathan Cameron from Huawei  has been really pushing QEMU forward in the kernel. Ira has been kernel developer as well as QEMU. Gregory Price from Membridge has been involved in the QEMU. Fan, Davidlohr, Tong, many people at Samsung  are leveraging QEMU for building software,  for building it out now early and try to figure out  what works, what doesn't. But yeah, and there's many others that are involved in here. I can't say give kudos to everybody. And I apologize if I missed anyone. But that kind of wraps up my talk there. So if there's any questions, let me know.

Very nice. Couple of questions. One, has there ever been a multi-QEMU connected thing ever  built for anything else, a shared something between QEMUs?

Not that I'm aware of. Do you know?

Actually, IB Schmitt. IB Schmitt, oh.

 It appeared about 10 years ago at the University of Alberta. But there's also-- well, and then also,  you can do things like--  so not with the full emulation, but you can have a VM that's  got--  oh, sorry.

Yeah, it's straightforward to do VMs with emulated PMEM  devices, which can be converted to DAX devices, which is--  and then they can be online to system RAM,  if that's what you want to do. And it's straightforward to make those PMEM devices be backed  by a file or a DAX device or something. So I can have multiple VMs that each have a DAX device,  starting with PMEM, converted to DAX. That maps to the same file. That's what we're doing FAMFS development on mostly. Although we do have a shared memory, too.

But no switch. The switch is the key missing piece there that we want.

Yes, it is. We want that, too.

 It sounds like University of Alberta might be that.

 Interesting.

Yeah. Well, so that's old work. Ken McDonald, the guy who did it, he may be--  his faculty's somewhere now. He was a grad student when he did it. But it is in mainline QEMU.

Hi. I was wondering--  I've been hearing about the memory semantic SSD for,  I don't know, 10, five years. It's been a long time. Have there ever been any samples that we can play with? I mean, it sounds great. I'd like to try it. I mean, I have PMEM myself. So I'm not exactly hurting for fast storage. But it would be nice to try it because it's  been talked about so much. I was wondering if there might be some available.

 I could say there's been public demos more recently. But in general, I would say you could connect with me offline. I could get you to talk to people. But pretty much, I'm much more focused on the open software  at this point. But if you want to get in contact with people,  I can get you in contact with people.

 OK. Thank you.

Hi. To what extent do you think the academic teaching and research  community is aware of this and kind of teaching  their advanced undergrad students  and early graduate students to look at this sort of stuff?

I see systems groups doing it. And I can go back to looking at NVMe. I think there was a project called FEMU,  like a fast QEMU emulator for SSDs. And I believe that one was the University of Chicago. And we recently got an email. I can't remember the university. But they had noticed the open work that we were doing. They were like, hey, do you guys have any tips or things  that we could look at? So we are aware that some academic circles  are watching all the open development. And so it's getting out there in small doses. But I do see it happening.

Hi, Adam. I was looking at the GitLab link that you have in there. And one of the questions I have is,  is there a way to figure out which CXL features are already  available out there in the community? I'm interested in simulating GPF.

Yeah, I get it. I would say that's a pain point now. So what do I notice? Most people, most developers focus much more  on the development and less so on kind of keeping track  of what's happening. What I can offer to you, if you connect with me,  we do keep track of these things on the community,  like internally. And we share externally as well. And it's been on our to-do list to put this tracking  externally. But we just haven't got to it. But we actually do share that information. It's just all public information. So yeah, the community as a whole could be better at it. But we do do it for ourselves.

 OK, thanks. I'll reach out or even ask in the community.

Thank you. Great. Thank you. Let's thank Adam. Thank you.