Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: add option to force guests to use swiotlb in dedicated memory region for all I/O #5108

Draft
wants to merge 32 commits into
base: feature/secret-hiding
Choose a base branch
from

Conversation

roypat
Copy link
Contributor

@roypat roypat commented Mar 24, 2025

Changes

Add a new API field to /machine-config, initial_swiotlb_size, which causes Firecracker to carve out a region of guest memory of the specified size, mark is as dma-restricted in FDT, and force all virtio devices to use this memory for their vrings and buffers.

Reason

With secret hiding, I/O needs to be handled by swiotlb, as we cannot do I/O into guest memory that is direct map removed

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

  • I have read and understand CONTRIBUTING.md.
  • I have run tools/devtool checkstyle to verify that the PR passes the
    automated style checks.
  • I have described what is done in these changes, why they are needed, and
    how they are solving the problem in a clear and encompassing way.
  • I have updated any relevant documentation (both in code and in the docs)
    in the PR.
  • I have mentioned all user-facing changes in CHANGELOG.md.
  • If a specific issue led to this PR, this PR closes the issue.
  • When making API changes, I have followed the
    Runbook for Firecracker API changes.
  • I have tested all new and changed functionalities in unit tests and/or
    integration tests.
  • I have linked an issue to every new TODO.

  • This functionality cannot be added in rust-vmm.

@roypat roypat force-pushed the secret-freedom-v4 branch 14 times, most recently from 8fe8462 to 75f4ee1 Compare March 26, 2025 15:41
@roypat roypat force-pushed the feature/secret-hiding branch from 5573725 to ad06919 Compare March 27, 2025 07:40
@roypat roypat force-pushed the secret-freedom-v4 branch 4 times, most recently from 9462514 to 39cce06 Compare March 27, 2025 16:03
The explicit calls to GuestMemoryMmap::create ended up being touched in
like 5 of the following commits, so let's reduce the churn by putting it
into a function.

Signed-off-by: Patrick Roy <[email protected]>
@roypat roypat force-pushed the secret-freedom-v4 branch from 39cce06 to 2168468 Compare March 27, 2025 16:20
Have all of them operate on some sort of list of (GuestAddress, usize)
tuples, for consistency.

Signed-off-by: Patrick Roy <[email protected]>
@roypat roypat force-pushed the secret-freedom-v4 branch 7 times, most recently from e11ee8e to 7635ee8 Compare March 28, 2025 14:01
Copy link

codecov bot commented Mar 28, 2025

Codecov Report

Attention: Patch coverage is 75.78947% with 161 lines in your changes missing coverage. Please review.

Project coverage is 82.97%. Comparing base (ad06919) to head (a4e86a0).
Report is 5 commits behind head on feature/secret-hiding.

Files with missing lines Patch % Lines
src/vmm/src/vstate/vm.rs 76.87% 40 Missing ⚠️
src/vmm/src/builder.rs 63.63% 28 Missing ⚠️
src/vmm/src/vstate/memory.rs 88.53% 18 Missing ⚠️
src/vmm/src/arch/aarch64/fdt.rs 50.00% 14 Missing ⚠️
src/vmm/src/resources.rs 77.35% 12 Missing ⚠️
src/vmm/src/arch/x86_64/mod.rs 66.66% 9 Missing ⚠️
src/vmm/src/persist.rs 83.78% 6 Missing ⚠️
src/vmm/src/arch/aarch64/mod.rs 76.19% 5 Missing ⚠️
src/vmm/src/devices/virtio/block/device.rs 0.00% 5 Missing ⚠️
src/vmm/src/devices/virtio/queue.rs 82.60% 4 Missing ⚠️
... and 8 more
Additional details and impacted files
@@                    Coverage Diff                    @@
##           feature/secret-hiding    #5108      +/-   ##
=========================================================
- Coverage                  83.06%   82.97%   -0.09%     
=========================================================
  Files                        250      250              
  Lines                      26888    27121     +233     
=========================================================
+ Hits                       22334    22504     +170     
- Misses                      4554     4617      +63     
Flag Coverage Δ
5.10-c5n.metal 83.55% <76.50%> (+0.02%) ⬆️
5.10-m5n.metal 83.54% <76.50%> (+0.01%) ⬆️
5.10-m6a.metal 82.74% <76.50%> (+0.02%) ⬆️
5.10-m6g.metal 79.37% <75.28%> (-0.06%) ⬇️
5.10-m6i.metal 83.54% <76.50%> (+0.01%) ⬆️
5.10-m7a.metal-48xl 82.74% <76.50%> (+0.02%) ⬆️
5.10-m7g.metal 79.37% <75.28%> (-0.06%) ⬇️
6.1-c5n.metal 83.59% <76.50%> (+0.02%) ⬆️
6.1-m5n.metal 83.59% <76.50%> (+0.01%) ⬆️
6.1-m6a.metal 82.79% <76.50%> (+0.02%) ⬆️
6.1-m6g.metal 79.37% <75.28%> (-0.06%) ⬇️
6.1-m6i.metal 83.58% <76.50%> (+0.01%) ⬆️
6.1-m7a.metal-48xl 82.79% <76.50%> (+0.02%) ⬆️
6.1-m7g.metal 79.37% <75.28%> (-0.06%) ⬇️
6.14-c5n.metal 83.58% <76.50%> (+0.02%) ⬆️
6.14-m5n.metal 83.59% <76.50%> (+0.02%) ⬆️
6.14-m6a.metal 82.79% <76.50%> (+0.02%) ⬆️
6.14-m6g.metal 79.36% <75.28%> (-0.06%) ⬇️
6.14-m6i.metal 83.58% <76.50%> (+0.02%) ⬆️
6.14-m7a.metal-48xl 82.79% <76.50%> (+0.02%) ⬆️
6.14-m7g.metal 79.36% <75.28%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

roypat added 2 commits March 28, 2025 15:30
Avoid some .map_err() calls using #[from] and the auto-From::from
calling of the question mark operator.

Signed-off-by: Patrick Roy <[email protected]>
When taking a diff snapshot, Firecracker resets the dirty bitmap of each
guest memory region if the region itself was successfully dumped to the
snapshot memory file. This is wrong, because a later region might fail,
and then some regions will have their dirty bitmap reset, yet no
snapshot was successfully taken, meaning a failure during diff snapshot
taking will forever corrupt the dirty bitmap.

Fix this by only reset the dirty bitmap of all regions after (and if!)
all regions were successfully dumped to the file.

Signed-off-by: Patrick Roy <[email protected]>
@roypat roypat force-pushed the secret-freedom-v4 branch 3 times, most recently from 9aa4b62 to 9febf6c Compare March 28, 2025 16:47
roypat added 7 commits March 28, 2025 16:59
Otherwise it's impossible to figure out what precisely went wrong during
queue restoration.

Signed-off-by: Patrick Roy <[email protected]>
This allows having regions that start somewhere other than guest
physical address 0. Useful in case not all regions are allocated at
once, and each batch needs to be placed after all already allocated
regions.

Since the resulting code has some intricacies (and I actually got it
wrong the first time around), write a kani harness to validate that
we're computing the regions correctly.

Signed-off-by: Patrick Roy <[email protected]>
This way the non-snapshot path doesnt have to deal with passing in
`None`.

Signed-off-by: Patrick Roy <[email protected]>
Only enabling it when a balloon device is present doesn't really gain us
anything, because the only time this feature is checked in the kernel is
as part of madvise handling - but without balloon devices we wont ever
call madvise, so it makes no difference whether we enabled the feature
or not (it is unconditionally available since linux 4.10).

Signed-off-by: Patrick Roy <[email protected]>
Being able to use the new GuestRegionCollection will make things a lot
easier.

Signed-off-by: Patrick Roy <[email protected]>
Use a special KvmRegion type, which allows us to manage
kvm_userspace_memory_region2 inside of GuestRegionCollection.

This means we can stop using .zip(0u32..) to determine kvm slot indices,
which doesnt work anymore when multiple GuestMemoryMmaps are around,
which have no order imposed on them wrt which one is initialized first.

Signed-off-by: Patrick Roy <[email protected]>
Add a mem_config parameter to the PUT/PATCH /machine-config endpoints,
which allows configuration of the size of a swiotlb region to use.

Signed-off-by: Patrick Roy <[email protected]>
@roypat roypat force-pushed the secret-freedom-v4 branch 3 times, most recently from 936dcad to 0a21c54 Compare March 28, 2025 17:13
roypat added 5 commits March 28, 2025 17:15
Add a special `GuestMemoryMmap` to `struct Vm` that has the semantics
that if its not empty, it represents the swiotlb area in the guest.

Have all virtio devices only get access to swiotlb regions if they
exist, only falling back onto having access to all of guest memory
otherwise. This means that in case of swiotlb regions being set up,
virtio devices cannot access generic memory, so if the guest decides to
(incorrectly) place I/O buffers outside of swiotlb we'll get a somewhat
meaningful error of "invalid guest address" (or similar), instead of
random things failing further down the line in case generic memory isn't
accessible by firecracker/the host kernel for some reason (e.g. if its
guest_memfd backed in the future).

When a VM with swiotlb regions is being snapshotted, the regions are
simply concatenated to the end of the snapshot memory file. This changes
the snapshot memory file format in the sense that regions in the file
are not necessarily sorted by guest address anymore.

Signed-off-by: Patrick Roy <[email protected]>
If swiotlb is requested through the API, this function will allocate a
single contiguous region intended for swiotlb use. It needs to be
continguous because we must be able to describe it using a single memory
range in FDT, as devices cannot be assigned to reserved memory
consisting of multiple regions (e.g. FDT assignment is to memory
regions, not to #reserved-memory nodes).

While we're at it, always use anon memory for the "normal" part of guest
memory if swiotlb is enabled, as if swiotlb is enabled we know that
vhost devices will never need to access that part of memory.

Signed-off-by: Patrick Roy <[email protected]>
Allocate the swiotlb region if requested via the /machine-config API
endpoint, and register it to the VM/KVM.

Signed-off-by: Patrick Roy <[email protected]>
describe the swiotlb region as a restricted-mem node in the flattened
device tree, and force all virtio devices to use it for their vrings and
buffers by setting their `memory-region` property.

Signed-off-by: Patrick Roy <[email protected]>
Have all virtio devices off VIRTIO_F_PLATFORM_ACCESS, which communicates
to the guest that DMA APIs should be used for allocating vrings and
virtio buffers. Since we have created a dedicated swiotlb region in the
geust and assigned all virtio devices to this, it means that all such
allocations will go through the swiotlb region.

Signed-off-by: Patrick Roy <[email protected]>
@roypat roypat force-pushed the secret-freedom-v4 branch from 0a21c54 to 3e042c6 Compare March 28, 2025 17:16
roypat added 7 commits March 28, 2025 17:22
Support restoring VMs with swiotlb regions. For this, untangle the uffd
handshake from the actual restoration of memory regions, as we first
need to restore and register to the Vm _all_ memory regions, before we
can then send a single handshake containing both normal and swiotlb
regions to the Uffd handler.

While we're here, significantly simplify the jungle of error types.

Signed-off-by: Patrick Roy <[email protected]>
This will make it easier to validate in an integration test that virt
queues are actually placed in the swiotlb region, as a sort of sanity
check that on Firecracker's side the swiotlb regions are actually passed
to the virtio devices, and that the guest is honoring the request to
place virtio stuff into the swiotlb regions.

Signed-off-by: Patrick Roy <[email protected]>
This is needed for the guest to correctly parse the restricted-memory
nodes that firecracker now puts into FDT.

Signed-off-by: Patrick Roy <[email protected]>
these contain 6.1 guest kernels that have CONFIG_DMA_RESTRICTED_MEM
enabled and can thus parse swiotlb region descriptions from FDT.

Signed-off-by: Patrick Roy <[email protected]>
If the CI artifacts dont contain old firecracker releases, still succeed
at setting downloading them.

Signed-off-by: Patrick Roy <[email protected]>
On aarch64, additionally parametrize our throughput performance tests
(network, block and vsock) by swiotlb size, to get a feeling for the
performance impact bounce buffering will have. Try out different sizes
of swiotlb region to see whether we can identify a cut off.

Signed-off-by: Patrick Roy <[email protected]>
Add an integration test on ARM that configures an in-guest swiotlb
region to which all virtio devices should be bound. Asserts the guest
sees the swiotlb region, and also that block and net I/O work
(indirectly through loading of rootfs during boot and by running
commands via SSH). Also asserts that all virtio queue components have
been placed into swiotlb, by parsing firecracker log messages.

Also add a test for snapshotting VMs with swiotlb regions, which just
checks the network device after restoration to sanity check that the
swiotlb region was correctly restored.

Signed-off-by: Patrick Roy <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant