Skip to content

Commit 09d1c6a

Browse files
committed
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini: "Generic: - Use memdup_array_user() to harden against overflow. - Unconditionally advertise KVM_CAP_DEVICE_CTRL for all architectures. - Clean up Kconfigs that all KVM architectures were selecting - New functionality around "guest_memfd", a new userspace API that creates an anonymous file and returns a file descriptor that refers to it. guest_memfd files are bound to their owning virtual machine, cannot be mapped, read, or written by userspace, and cannot be resized. guest_memfd files do however support PUNCH_HOLE, which can be used to switch a memory area between guest_memfd and regular anonymous memory. - New ioctl KVM_SET_MEMORY_ATTRIBUTES allowing userspace to specify per-page attributes for a given page of guest memory; right now the only attribute is whether the guest expects to access memory via guest_memfd or not, which in Confidential SVMs backed by SEV-SNP, TDX or ARM64 pKVM is checked by firmware or hypervisor that guarantees confidentiality (AMD PSP, Intel TDX module, or EL2 in the case of pKVM). x86: - Support for "software-protected VMs" that can use the new guest_memfd and page attributes infrastructure. This is mostly useful for testing, since there is no pKVM-like infrastructure to provide a meaningfully reduced TCB. - Fix a relatively benign off-by-one error when splitting huge pages during CLEAR_DIRTY_LOG. - Fix a bug where KVM could incorrectly test-and-clear dirty bits in non-leaf TDP MMU SPTEs if a racing thread replaces a huge SPTE with a non-huge SPTE. - Use more generic lockdep assertions in paths that don't actually care about whether the caller is a reader or a writer. - let Xen guests opt out of having PV clock reported as "based on a stable TSC", because some of them don't expect the "TSC stable" bit (added to the pvclock ABI by KVM, but never set by Xen) to be set. - Revert a bogus, made-up nested SVM consistency check for TLB_CONTROL. - Advertise flush-by-ASID support for nSVM unconditionally, as KVM always flushes on nested transitions, i.e. always satisfies flush requests. This allows running bleeding edge versions of VMware Workstation on top of KVM. - Sanity check that the CPU supports flush-by-ASID when enabling SEV support. - On AMD machines with vNMI, always rely on hardware instead of intercepting IRET in some cases to detect unmasking of NMIs - Support for virtualizing Linear Address Masking (LAM) - Fix a variety of vPMU bugs where KVM fail to stop/reset counters and other state prior to refreshing the vPMU model. - Fix a double-overflow PMU bug by tracking emulated counter events using a dedicated field instead of snapshotting the "previous" counter. If the hardware PMC count triggers overflow that is recognized in the same VM-Exit that KVM manually bumps an event count, KVM would pend PMIs for both the hardware-triggered overflow and for KVM-triggered overflow. - Turn off KVM_WERROR by default for all configs so that it's not inadvertantly enabled by non-KVM developers, which can be problematic for subsystems that require no regressions for W=1 builds. - Advertise all of the host-supported CPUID bits that enumerate IA32_SPEC_CTRL "features". - Don't force a masterclock update when a vCPU synchronizes to the current TSC generation, as updating the masterclock can cause kvmclock's time to "jump" unexpectedly, e.g. when userspace hotplugs a pre-created vCPU. - Use RIP-relative address to read kvm_rebooting in the VM-Enter fault paths, partly as a super minor optimization, but mostly to make KVM play nice with position independent executable builds. - Guard KVM-on-HyperV's range-based TLB flush hooks with an #ifdef on CONFIG_HYPERV as a minor optimization, and to self-document the code. - Add CONFIG_KVM_HYPERV to allow disabling KVM support for HyperV "emulation" at build time. ARM64: - LPA2 support, adding 52bit IPA/PA capability for 4kB and 16kB base granule sizes. Branch shared with the arm64 tree. - Large Fine-Grained Trap rework, bringing some sanity to the feature, although there is more to come. This comes with a prefix branch shared with the arm64 tree. - Some additional Nested Virtualization groundwork, mostly introducing the NV2 VNCR support and retargetting the NV support to that version of the architecture. - A small set of vgic fixes and associated cleanups. Loongarch: - Optimization for memslot hugepage checking - Cleanup and fix some HW/SW timer issues - Add LSX/LASX (128bit/256bit SIMD) support RISC-V: - KVM_GET_REG_LIST improvement for vector registers - Generate ISA extension reg_list using macros in get-reg-list selftest - Support for reporting steal time along with selftest s390: - Bugfixes Selftests: - Fix an annoying goof where the NX hugepage test prints out garbage instead of the magic token needed to run the test. - Fix build errors when a header is delete/moved due to a missing flag in the Makefile. - Detect if KVM bugged/killed a selftest's VM and print out a helpful message instead of complaining that a random ioctl() failed. - Annotate the guest printf/assert helpers with __printf(), and fix the various bugs that were lurking due to lack of said annotation" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (185 commits) x86/kvm: Do not try to disable kvmclock if it was not enabled KVM: x86: add missing "depends on KVM" KVM: fix direction of dependency on MMU notifiers KVM: introduce CONFIG_KVM_COMMON KVM: arm64: Add missing memory barriers when switching to pKVM's hyp pgd KVM: arm64: vgic-its: Avoid potential UAF in LPI translation cache RISC-V: KVM: selftests: Add get-reg-list test for STA registers RISC-V: KVM: selftests: Add steal_time test support RISC-V: KVM: selftests: Add guest_sbi_probe_extension RISC-V: KVM: selftests: Move sbi_ecall to processor.c RISC-V: KVM: Implement SBI STA extension RISC-V: KVM: Add support for SBI STA registers RISC-V: KVM: Add support for SBI extension registers RISC-V: KVM: Add SBI STA info to vcpu_arch RISC-V: KVM: Add steal-update vcpu request RISC-V: KVM: Add SBI STA extension skeleton RISC-V: paravirt: Implement steal-time support RISC-V: Add SBI STA extension definitions RISC-V: paravirt: Add skeleton for pv-time support RISC-V: KVM: Fix indentation in kvm_riscv_vcpu_set_reg_csr() ...
2 parents 1b1934d + 1c6d984 commit 09d1c6a

File tree

187 files changed

+7469
-2710
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

187 files changed

+7469
-2710
lines changed

Documentation/admin-guide/kernel-parameters.txt

+3-3
Original file line numberDiff line numberDiff line change
@@ -3996,9 +3996,9 @@
39963996
vulnerability. System may allow data leaks with this
39973997
option.
39983998

3999-
no-steal-acc [X86,PV_OPS,ARM64,PPC/PSERIES] Disable paravirtualized
4000-
steal time accounting. steal time is computed, but
4001-
won't influence scheduler behaviour
3999+
no-steal-acc [X86,PV_OPS,ARM64,PPC/PSERIES,RISCV] Disable
4000+
paravirtualized steal time accounting. steal time is
4001+
computed, but won't influence scheduler behaviour
40024002

40034003
nosync [HW,M68K] Disables sync negotiation for all devices.
40044004

Documentation/virt/kvm/api.rst

+207-12
Original file line numberDiff line numberDiff line change
@@ -147,10 +147,29 @@ described as 'basic' will be available.
147147
The new VM has no virtual cpus and no memory.
148148
You probably want to use 0 as machine type.
149149

150+
X86:
151+
^^^^
152+
153+
Supported X86 VM types can be queried via KVM_CAP_VM_TYPES.
154+
155+
S390:
156+
^^^^^
157+
150158
In order to create user controlled virtual machines on S390, check
151159
KVM_CAP_S390_UCONTROL and use the flag KVM_VM_S390_UCONTROL as
152160
privileged user (CAP_SYS_ADMIN).
153161

162+
MIPS:
163+
^^^^^
164+
165+
To use hardware assisted virtualization on MIPS (VZ ASE) rather than
166+
the default trap & emulate implementation (which changes the virtual
167+
memory layout to fit in user mode), check KVM_CAP_MIPS_VZ and use the
168+
flag KVM_VM_MIPS_VZ.
169+
170+
ARM64:
171+
^^^^^^
172+
154173
On arm64, the physical address size for a VM (IPA Size limit) is limited
155174
to 40bits by default. The limit can be configured if the host supports the
156175
extension KVM_CAP_ARM_VM_IPA_SIZE. When supported, use
@@ -608,18 +627,6 @@ interrupt number dequeues the interrupt.
608627
This is an asynchronous vcpu ioctl and can be invoked from any thread.
609628

610629

611-
4.17 KVM_DEBUG_GUEST
612-
--------------------
613-
614-
:Capability: basic
615-
:Architectures: none
616-
:Type: vcpu ioctl
617-
:Parameters: none)
618-
:Returns: -1 on error
619-
620-
Support for this has been removed. Use KVM_SET_GUEST_DEBUG instead.
621-
622-
623630
4.18 KVM_GET_MSRS
624631
-----------------
625632

@@ -6192,6 +6199,130 @@ to know what fields can be changed for the system register described by
61926199
``op0, op1, crn, crm, op2``. KVM rejects ID register values that describe a
61936200
superset of the features supported by the system.
61946201

6202+
4.140 KVM_SET_USER_MEMORY_REGION2
6203+
---------------------------------
6204+
6205+
:Capability: KVM_CAP_USER_MEMORY2
6206+
:Architectures: all
6207+
:Type: vm ioctl
6208+
:Parameters: struct kvm_userspace_memory_region2 (in)
6209+
:Returns: 0 on success, -1 on error
6210+
6211+
KVM_SET_USER_MEMORY_REGION2 is an extension to KVM_SET_USER_MEMORY_REGION that
6212+
allows mapping guest_memfd memory into a guest. All fields shared with
6213+
KVM_SET_USER_MEMORY_REGION identically. Userspace can set KVM_MEM_GUEST_MEMFD
6214+
in flags to have KVM bind the memory region to a given guest_memfd range of
6215+
[guest_memfd_offset, guest_memfd_offset + memory_size]. The target guest_memfd
6216+
must point at a file created via KVM_CREATE_GUEST_MEMFD on the current VM, and
6217+
the target range must not be bound to any other memory region. All standard
6218+
bounds checks apply (use common sense).
6219+
6220+
::
6221+
6222+
struct kvm_userspace_memory_region2 {
6223+
__u32 slot;
6224+
__u32 flags;
6225+
__u64 guest_phys_addr;
6226+
__u64 memory_size; /* bytes */
6227+
__u64 userspace_addr; /* start of the userspace allocated memory */
6228+
__u64 guest_memfd_offset;
6229+
__u32 guest_memfd;
6230+
__u32 pad1;
6231+
__u64 pad2[14];
6232+
};
6233+
6234+
A KVM_MEM_GUEST_MEMFD region _must_ have a valid guest_memfd (private memory) and
6235+
userspace_addr (shared memory). However, "valid" for userspace_addr simply
6236+
means that the address itself must be a legal userspace address. The backing
6237+
mapping for userspace_addr is not required to be valid/populated at the time of
6238+
KVM_SET_USER_MEMORY_REGION2, e.g. shared memory can be lazily mapped/allocated
6239+
on-demand.
6240+
6241+
When mapping a gfn into the guest, KVM selects shared vs. private, i.e consumes
6242+
userspace_addr vs. guest_memfd, based on the gfn's KVM_MEMORY_ATTRIBUTE_PRIVATE
6243+
state. At VM creation time, all memory is shared, i.e. the PRIVATE attribute
6244+
is '0' for all gfns. Userspace can control whether memory is shared/private by
6245+
toggling KVM_MEMORY_ATTRIBUTE_PRIVATE via KVM_SET_MEMORY_ATTRIBUTES as needed.
6246+
6247+
4.141 KVM_SET_MEMORY_ATTRIBUTES
6248+
-------------------------------
6249+
6250+
:Capability: KVM_CAP_MEMORY_ATTRIBUTES
6251+
:Architectures: x86
6252+
:Type: vm ioctl
6253+
:Parameters: struct kvm_memory_attributes (in)
6254+
:Returns: 0 on success, <0 on error
6255+
6256+
KVM_SET_MEMORY_ATTRIBUTES allows userspace to set memory attributes for a range
6257+
of guest physical memory.
6258+
6259+
::
6260+
6261+
struct kvm_memory_attributes {
6262+
__u64 address;
6263+
__u64 size;
6264+
__u64 attributes;
6265+
__u64 flags;
6266+
};
6267+
6268+
#define KVM_MEMORY_ATTRIBUTE_PRIVATE (1ULL << 3)
6269+
6270+
The address and size must be page aligned. The supported attributes can be
6271+
retrieved via ioctl(KVM_CHECK_EXTENSION) on KVM_CAP_MEMORY_ATTRIBUTES. If
6272+
executed on a VM, KVM_CAP_MEMORY_ATTRIBUTES precisely returns the attributes
6273+
supported by that VM. If executed at system scope, KVM_CAP_MEMORY_ATTRIBUTES
6274+
returns all attributes supported by KVM. The only attribute defined at this
6275+
time is KVM_MEMORY_ATTRIBUTE_PRIVATE, which marks the associated gfn as being
6276+
guest private memory.
6277+
6278+
Note, there is no "get" API. Userspace is responsible for explicitly tracking
6279+
the state of a gfn/page as needed.
6280+
6281+
The "flags" field is reserved for future extensions and must be '0'.
6282+
6283+
4.142 KVM_CREATE_GUEST_MEMFD
6284+
----------------------------
6285+
6286+
:Capability: KVM_CAP_GUEST_MEMFD
6287+
:Architectures: none
6288+
:Type: vm ioctl
6289+
:Parameters: struct kvm_create_guest_memfd(in)
6290+
:Returns: 0 on success, <0 on error
6291+
6292+
KVM_CREATE_GUEST_MEMFD creates an anonymous file and returns a file descriptor
6293+
that refers to it. guest_memfd files are roughly analogous to files created
6294+
via memfd_create(), e.g. guest_memfd files live in RAM, have volatile storage,
6295+
and are automatically released when the last reference is dropped. Unlike
6296+
"regular" memfd_create() files, guest_memfd files are bound to their owning
6297+
virtual machine (see below), cannot be mapped, read, or written by userspace,
6298+
and cannot be resized (guest_memfd files do however support PUNCH_HOLE).
6299+
6300+
::
6301+
6302+
struct kvm_create_guest_memfd {
6303+
__u64 size;
6304+
__u64 flags;
6305+
__u64 reserved[6];
6306+
};
6307+
6308+
Conceptually, the inode backing a guest_memfd file represents physical memory,
6309+
i.e. is coupled to the virtual machine as a thing, not to a "struct kvm". The
6310+
file itself, which is bound to a "struct kvm", is that instance's view of the
6311+
underlying memory, e.g. effectively provides the translation of guest addresses
6312+
to host memory. This allows for use cases where multiple KVM structures are
6313+
used to manage a single virtual machine, e.g. when performing intrahost
6314+
migration of a virtual machine.
6315+
6316+
KVM currently only supports mapping guest_memfd via KVM_SET_USER_MEMORY_REGION2,
6317+
and more specifically via the guest_memfd and guest_memfd_offset fields in
6318+
"struct kvm_userspace_memory_region2", where guest_memfd_offset is the offset
6319+
into the guest_memfd instance. For a given guest_memfd file, there can be at
6320+
most one mapping per page, i.e. binding multiple memory regions to a single
6321+
guest_memfd range is not allowed (any number of memory regions can be bound to
6322+
a single guest_memfd file, but the bound ranges must not overlap).
6323+
6324+
See KVM_SET_USER_MEMORY_REGION2 for additional details.
6325+
61956326
5. The kvm_run structure
61966327
========================
61976328

@@ -6824,6 +6955,30 @@ array field represents return values. The userspace should update the return
68246955
values of SBI call before resuming the VCPU. For more details on RISC-V SBI
68256956
spec refer, https://github.com/riscv/riscv-sbi-doc.
68266957

6958+
::
6959+
6960+
/* KVM_EXIT_MEMORY_FAULT */
6961+
struct {
6962+
#define KVM_MEMORY_EXIT_FLAG_PRIVATE (1ULL << 3)
6963+
__u64 flags;
6964+
__u64 gpa;
6965+
__u64 size;
6966+
} memory_fault;
6967+
6968+
KVM_EXIT_MEMORY_FAULT indicates the vCPU has encountered a memory fault that
6969+
could not be resolved by KVM. The 'gpa' and 'size' (in bytes) describe the
6970+
guest physical address range [gpa, gpa + size) of the fault. The 'flags' field
6971+
describes properties of the faulting access that are likely pertinent:
6972+
6973+
- KVM_MEMORY_EXIT_FLAG_PRIVATE - When set, indicates the memory fault occurred
6974+
on a private memory access. When clear, indicates the fault occurred on a
6975+
shared access.
6976+
6977+
Note! KVM_EXIT_MEMORY_FAULT is unique among all KVM exit reasons in that it
6978+
accompanies a return code of '-1', not '0'! errno will always be set to EFAULT
6979+
or EHWPOISON when KVM exits with KVM_EXIT_MEMORY_FAULT, userspace should assume
6980+
kvm_run.exit_reason is stale/undefined for all other error numbers.
6981+
68276982
::
68286983

68296984
/* KVM_EXIT_NOTIFY */
@@ -7858,6 +8013,27 @@ This capability is aimed to mitigate the threat that malicious VMs can
78588013
cause CPU stuck (due to event windows don't open up) and make the CPU
78598014
unavailable to host or other VMs.
78608015

8016+
7.34 KVM_CAP_MEMORY_FAULT_INFO
8017+
------------------------------
8018+
8019+
:Architectures: x86
8020+
:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP.
8021+
8022+
The presence of this capability indicates that KVM_RUN will fill
8023+
kvm_run.memory_fault if KVM cannot resolve a guest page fault VM-Exit, e.g. if
8024+
there is a valid memslot but no backing VMA for the corresponding host virtual
8025+
address.
8026+
8027+
The information in kvm_run.memory_fault is valid if and only if KVM_RUN returns
8028+
an error with errno=EFAULT or errno=EHWPOISON *and* kvm_run.exit_reason is set
8029+
to KVM_EXIT_MEMORY_FAULT.
8030+
8031+
Note: Userspaces which attempt to resolve memory faults so that they can retry
8032+
KVM_RUN are encouraged to guard against repeatedly receiving the same
8033+
error/annotated fault.
8034+
8035+
See KVM_EXIT_MEMORY_FAULT for more information.
8036+
78618037
8. Other capabilities.
78628038
======================
78638039

@@ -8374,6 +8550,7 @@ PVHVM guests. Valid flags are::
83748550
#define KVM_XEN_HVM_CONFIG_EVTCHN_2LEVEL (1 << 4)
83758551
#define KVM_XEN_HVM_CONFIG_EVTCHN_SEND (1 << 5)
83768552
#define KVM_XEN_HVM_CONFIG_RUNSTATE_UPDATE_FLAG (1 << 6)
8553+
#define KVM_XEN_HVM_CONFIG_PVCLOCK_TSC_UNSTABLE (1 << 7)
83778554

83788555
The KVM_XEN_HVM_CONFIG_HYPERCALL_MSR flag indicates that the KVM_XEN_HVM_CONFIG
83798556
ioctl is available, for the guest to set its hypercall page.
@@ -8417,6 +8594,11 @@ behave more correctly, not using the XEN_RUNSTATE_UPDATE flag until/unless
84178594
specifically enabled (by the guest making the hypercall, causing the VMM
84188595
to enable the KVM_XEN_ATTR_TYPE_RUNSTATE_UPDATE_FLAG attribute).
84198596

8597+
The KVM_XEN_HVM_CONFIG_PVCLOCK_TSC_UNSTABLE flag indicates that KVM supports
8598+
clearing the PVCLOCK_TSC_STABLE_BIT flag in Xen pvclock sources. This will be
8599+
done when the KVM_CAP_XEN_HVM ioctl sets the
8600+
KVM_XEN_HVM_CONFIG_PVCLOCK_TSC_UNSTABLE flag.
8601+
84208602
8.31 KVM_CAP_PPC_MULTITCE
84218603
-------------------------
84228604

@@ -8596,6 +8778,19 @@ block sizes is exposed in KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES as a
85968778
64-bit bitmap (each bit describing a block size). The default value is
85978779
0, to disable the eager page splitting.
85988780

8781+
8.41 KVM_CAP_VM_TYPES
8782+
---------------------
8783+
8784+
:Capability: KVM_CAP_MEMORY_ATTRIBUTES
8785+
:Architectures: x86
8786+
:Type: system ioctl
8787+
8788+
This capability returns a bitmap of support VM types. The 1-setting of bit @n
8789+
means the VM type with value @n is supported. Possible values of @n are::
8790+
8791+
#define KVM_X86_DEFAULT_VM 0
8792+
#define KVM_X86_SW_PROTECTED_VM 1
8793+
85998794
9. Known KVM API problems
86008795
=========================
86018796

Documentation/virt/kvm/locking.rst

+3-4
Original file line numberDiff line numberDiff line change
@@ -43,10 +43,9 @@ On x86:
4343

4444
- vcpu->mutex is taken outside kvm->arch.hyperv.hv_lock and kvm->arch.xen.xen_lock
4545

46-
- kvm->arch.mmu_lock is an rwlock. kvm->arch.tdp_mmu_pages_lock and
47-
kvm->arch.mmu_unsync_pages_lock are taken inside kvm->arch.mmu_lock, and
48-
cannot be taken without already holding kvm->arch.mmu_lock (typically with
49-
``read_lock`` for the TDP MMU, thus the need for additional spinlocks).
46+
- kvm->arch.mmu_lock is an rwlock; critical sections for
47+
kvm->arch.tdp_mmu_pages_lock and kvm->arch.mmu_unsync_pages_lock must
48+
also take kvm->arch.mmu_lock
5049

5150
Everything else is a leaf: no other lock is taken inside the critical
5251
sections.

arch/arm64/include/asm/esr.h

+15
Original file line numberDiff line numberDiff line change
@@ -392,6 +392,21 @@ static inline bool esr_is_data_abort(unsigned long esr)
392392
return ec == ESR_ELx_EC_DABT_LOW || ec == ESR_ELx_EC_DABT_CUR;
393393
}
394394

395+
static inline bool esr_fsc_is_translation_fault(unsigned long esr)
396+
{
397+
return (esr & ESR_ELx_FSC_TYPE) == ESR_ELx_FSC_FAULT;
398+
}
399+
400+
static inline bool esr_fsc_is_permission_fault(unsigned long esr)
401+
{
402+
return (esr & ESR_ELx_FSC_TYPE) == ESR_ELx_FSC_PERM;
403+
}
404+
405+
static inline bool esr_fsc_is_access_flag_fault(unsigned long esr)
406+
{
407+
return (esr & ESR_ELx_FSC_TYPE) == ESR_ELx_FSC_ACCESS;
408+
}
409+
395410
const char *esr_get_class_string(unsigned long esr);
396411
#endif /* __ASSEMBLY */
397412

0 commit comments

Comments
 (0)