mirror-linux

Commit Graph

Author	SHA1	Message	Date
Peter Zijlstra	0701c9e17b	x86/kvm/vmx: Move IRQ/NMI dispatch from KVM into x86 core Move the VMX interrupt dispatch magic into the x86 core code. This isolates KVM from the FRED/IDT decisions and reduces the amount of EXPORT_SYMBOL_FOR_KVM(). Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: "Verma, Vishal L" <vishal.l.verma@intel.com> Tested-by: Zhao Liu <zhao1.liu@intel.com> Tested-by: Zhao Liu <zhao1.liu@intel.com> Tested-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Binbin Wu <binbin.wu@linxu.intel.com> Acked-by: Sean Christopherson <seanjc@google.com> Link: https://patch.msgid.link/20260508091829.GO3126523@noisy.programming.kicks-ass.net	2026-05-19 20:25:51 +02:00
Paolo Bonzini	2d5d3fc593	KVM: VMX: introduce module parameter to disable CET There have been reports of host hangs caused by CET virtualization. Until these are analyzed further, introduce a module parameter that makes it possible to easily disable it. Link: https://lore.kernel.org/all/85548beb-1486-40f9-beb4-632c78e3360b@proxmox.com/ Cc: David Riley <d.riley@proxmox.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2026-05-13 15:38:22 +02:00
Sean Christopherson	0aec99f9bf	KVM: x86: Fix misleading variable names and add more comments for PIR=>IRR flow Rename kvm_apic_update_irr()'s "irr_updated" and vmx_sync_pir_to_irr()'s "got_posted_interrupt" to a more accurate "max_irr_is_from_pir", as neither "irr_updated" nor "got_posted_interrupt" is accurate. __kvm_apic_update_irr() and thus kvm_apic_update_irr() specifically return true if and only if the highest priority IRQ, i.e. max_irr, is a "new" pending IRQ from the PIR. I.e. it's possible for the IRR to be updated, i.e. for a posted IRQ to be "got", without the APIs returning true. Expand vmx_sync_pir_to_irr()'s comment to explain why it's necessary to set KVM_REQ_EVENT only if a "new" IRQ was found, and to explain why it's safe to do so only if a new IRQ is also the highest priority pending IRQ. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://patch.msgid.link/20260503201703.108231-3-pbonzini@redhat.com/ Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2026-05-03 22:32:41 +02:00
Paolo Bonzini	4a530993da	KVM x86 VMXON and EFER.SVME extraction for 7.1 Move _only_ VMXON+VMXOFF and EFER.SVME toggling out of KVM (versus all of VMX and SVM enabling) out of KVM and into the core kernel so that non-KVM TDX enabling, e.g. for trusted I/O, can make SEAMCALLs without needing to ensure KVM is fully loaded. TDX isn't a hypervisor, and isn't trying to be a hypervisor. Specifically, TDX should _never_ have it's own VMCSes (that are visible to the host; the TDX-Module has it's own VMCSes to do SEAMCALL/SEAMRET), and so there is simply no reason to move that functionality out of KVM. With that out of the way, dealing with VMXON/VMXOFF and EFER.SVME is a fairly simple refcounting game. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmnZJkYACgkQOlYIJqCj N/21chAAjg9tb/E8+vqBZDT5vO9Bu6c333irV2vqBBJZWUx6xKhtk77kL6kISWyf aI57hJ5IwbUkfDcomSY+MyRXxw/X4OioSs5qqvcC2XHatGA8XwifJE47cN5ZT0+D hzZjru8Z9VGHf5wUXS41yTHtm+INiEYMgJiseUQR6sbWx3H+zDcLIooNQx/ZLYrV vR+VPtaMYpJ0TTDDqb8PrCnjgXoXFenAnzAj9bAikWP60kaDXrxN9KPc5woDo29+ TrkTyr2mmQvKpNhLCDwAMNa9bXxgzkHEGx8J2WZTbUi9ZBv4MwVsnGLLsaUKQlaa 4V1JDiICzYptjMzU+ka4iTF+m0KEz4EykP7mVVK+5MAHc0NOUVfDW6JP2PM/66dh NyyjGhbrfH0PwqzDn4N2h0MmWT4YNCIxESClecEMtEzsCyWfYOMitxbDbzHnu9Vw a/C0pwWKJ34Trr0O79SevAWJBlu596mya0YvMeCAWxCvSUGknbo5IXdrmtp6htGp Gz5+0ZyvVRbYpwxS+OOpWMkZuPvvEcWTbMAG/scbSHh80P/uCVyuLsRZR2HSB8EV tYnnLDDDQ1KmLV7xmw5XnkN9hFffAM8eXA7KX9TPjCXjd25lCJGgquQEH0oAHe5q 1qXf+lWttP7MIbD5/Ga5CO+FqXAE6xmFRWjEBgLx32kSAWXqxPs= =SuxR -----END PGP SIGNATURE----- Merge tag 'kvm-x86-vmxon-7.1' of https://github.com/kvm-x86/linux into HEAD KVM x86 VMXON and EFER.SVME extraction for 7.1 Move _only_ VMXON+VMXOFF and EFER.SVME toggling out of KVM (versus all of VMX and SVM enabling) out of KVM and into the core kernel so that non-KVM TDX enabling, e.g. for trusted I/O, can make SEAMCALLs without needing to ensure KVM is fully loaded. TIO isn't a hypervisor, and isn't trying to be a hypervisor. Specifically, TIO should _never_ have it's own VMCSes (that are visible to the host; the TDX-Module has it's own VMCSes to do SEAMCALL/SEAMRET), and so there is simply no reason to move that functionality out of KVM. With that out of the way, dealing with VMXON/VMXOFF and EFER.SVME is a fairly simple refcounting game.	2026-04-13 13:04:48 +02:00
Paolo Bonzini	ea8bc95fbb	KVM nested SVM changes for 7.1 (with one common x86 fix) - To minimize the probability of corrupting guest state, defer KVM's non-architectural delivery of exception payloads (e.g. CR2 and DR6) until consumption of the payload is imminent, and force delivery of the payload in all paths where userspace saves relevant state. - Use vcpu->arch.cr2 when updating vmcb12's CR2 on nested #VMEXIT to fix a bug where L2's CR2 can get corrupted after a save/restore, e.g. if the VM is migrated while L2 is faulting in memory. - Fix a class of nSVM bugs where some fields written by the CPU are not synchronized from vmcb02 to cached vmcb12 after VMRUN, and so are not up-to-date when saved by KVM_GET_NESTED_STATE. - Fix a class of bugs where the ordering between KVM_SET_NESTED_STATE and KVM_SET_{S}REGS could cause vmcb02 to be incorrectly initialized after save+restore. - Add a variety of missing nSVM consistency checks. - Fix several bugs where KVM failed to correctly update VMCB fields on nested #VMEXIT. - Fix several bugs where KVM failed to correctly synthesize #UD or #GP for SVM-related instructions. - Add support for save+restore of virtualized LBRs (on SVM). - Refactor various helpers and macros to improve clarity and (hopefully) make the code easier to maintain. - Aggressively sanitize fields when copying from vmcb12 to guard against unintentionally allowing L1 to utilize yet-to-be-defined features. - Fix several bugs where KVM botched rAX legality checks when emulating SVM instructions. Note, KVM is still flawed in that KVM doesn't address size prefix overrides for 64-bit guests; this should probably be documented as a KVM erratum. - Fail emulation of VMRUN/VMLOAD/VMSAVE if mapping vmcb12 fails instead of somewhat arbitrarily synthesizing #GP (i.e. don't bastardize AMD's already- sketchy behavior of generating #GP if for "unsupported" addresses). - Cache all used vmcb12 fields to further harden against TOCTOU bugs. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmnZfbwACgkQOlYIJqCj N/0pVRAAkys8LLtIekQtEVkaX3EPaXk0lGGmnzXbihgHFsS5lMAS4tcsr7oyk4TI rvJUGmkaTKTboQdTaCq0G7lwCu5hMuXsZ10WvmKfivMFxy3kSppqfffux5zVXng2 U/8oyJSorkX1WPC7d5QAZYMqqcSwQaR+a0FxowghGWBXMRHylerSuH00CiGr6Ron QQbZaKBNtkYwYFNos2tLuT4tueyFogk8FPAmdejEQ9CMxUjeAivlKm8JVXaDvGik lyPYbJJLukjuxSYGYmeRyGLLwK7VBGkFHQp/KBYSBgzGdweabhsQa1Z0CGm24+w1 q626W0sxsq97dZ0cd7oE6Cw+AdlMBK+mjpxB9gX4uLGyYlnFkdJV7OSlHVTR9d96 cqKduT0JvlBnVb7Yd5jyaGVl1YD62p0nwcrTuWidR5IJ16b4mYwwPzvkkQKHLt64 VAhH8lBVtATtblI9gfsbwGezV74xXnuLb0L1G7xeh1VIWu7pubFdqyRwIA+qiXQa OkyxzoDlFl+QF2Uf3cBCFMojBOrSZRiGiLzIkUnjBsN4N2uOPYTsQEfr9BXVVcv7 obT9xl/wUwry2fAJhUL+IBCDE42+8C62UaWT5KJHQLttBL7Mm06e75hFN5ObbE/x nExL+NmAcsSUUbbdojjnD0KWxYKkosNiONBVrjqqXdmBjmzzOvI= =ys7N -----END PGP SIGNATURE----- Merge tag 'kvm-x86-nested-7.1' of https://github.com/kvm-x86/linux into HEAD KVM nested SVM changes for 7.1 (with one common x86 fix) - To minimize the probability of corrupting guest state, defer KVM's non-architectural delivery of exception payloads (e.g. CR2 and DR6) until consumption of the payload is imminent, and force delivery of the payload in all paths where userspace saves relevant state. - Use vcpu->arch.cr2 when updating vmcb12's CR2 on nested #VMEXIT to fix a bug where L2's CR2 can get corrupted after a save/restore, e.g. if the VM is migrated while L2 is faulting in memory. - Fix a class of nSVM bugs where some fields written by the CPU are not synchronized from vmcb02 to cached vmcb12 after VMRUN, and so are not up-to-date when saved by KVM_GET_NESTED_STATE. - Fix a class of bugs where the ordering between KVM_SET_NESTED_STATE and KVM_SET_{S}REGS could cause vmcb02 to be incorrectly initialized after save+restore. - Add a variety of missing nSVM consistency checks. - Fix several bugs where KVM failed to correctly update VMCB fields on nested #VMEXIT. - Fix several bugs where KVM failed to correctly synthesize #UD or #GP for SVM-related instructions. - Add support for save+restore of virtualized LBRs (on SVM). - Refactor various helpers and macros to improve clarity and (hopefully) make the code easier to maintain. - Aggressively sanitize fields when copying from vmcb12 to guard against unintentionally allowing L1 to utilize yet-to-be-defined features. - Fix several bugs where KVM botched rAX legality checks when emulating SVM instructions. Note, KVM is still flawed in that KVM doesn't address size prefix overrides for 64-bit guests; this should probably be documented as a KVM erratum. - Fail emulation of VMRUN/VMLOAD/VMSAVE if mapping vmcb12 fails instead of somewhat arbitrarily synthesizing #GP (i.e. don't bastardize AMD's already- sketchy behavior of generating #GP if for "unsupported" addresses). - Cache all used vmcb12 fields to further harden against TOCTOU bugs.	2026-04-13 13:01:50 +02:00
Paolo Bonzini	276f81a491	KVM x86 misc changes for 7.1 - Advertise support for AVX512 Bit Matrix Multiply (BMM) when it's present in hardware (no additional emulation/virtualization required). - Immediately fail the build if a required #define is missing in one of KVM's headers that is included multiple times. - Reject SET_GUEST_DEBUG with -EBUSY if there's an already injected exception, mostly to prevent syzkaller from abusing the uAPI to trigger WARNs, but also because it can help prevent userspace from unintentionally crashing the VM. - Exempt SMM from CPUID faulting on Intel, as per the spec. - Misc hardening and cleanup changes. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmnZIt8ACgkQOlYIJqCj N/2HqA/8CwoMlaK4nPDp39JI1+avlKaBkrwfF5/mku6uZcrq9WeyflH+t4wc7JE0 lRXQO5PPNideYrjEqLsdn9OWIar+ZsYGrsEO5/MFc4Z67kPkai67m7nUT46APU4Q fE/3KpT3afaHcM6+zpIIF/lMmQJVco+7EQrlexSM9LZTap6uxNRvMC3B/czF7/li UsEJH37vluXxuCPUXAE61IPHtF++eDf4x6w0nIJ+7UJSUZk8JJYWMvJ5lPIxRTGG Pvql2v7hDQ9h2ISIDr+e85wpIpIkbc7hKZMtlib36PB1Dm7gOeKgosFHIwNLnJoJ pxuzsqYShXBHsmsYgzmfYlVUcWFF1f02yC4XfoQ735LNnBbX6bm5nuSmPQBmvg4O +URUKjo4DLjzzs44RrRsBsBVuZTMbe0Ht2qLmGrWrB9+vr1PxQVNFpLA0MCDCFx7 skJTo6raJQkLJmmoKUslehiJFTvzOrOJy8JhWhiznkJNSS5jWFbaFf7nEoMCYIl0 ttzeISQDgzHAvT6V29CO4+zttexF4QVVRwFwG3aI8zGJ3WJhjrNyazVLrvrzWfhA ygNwV0BCEbBclMpBRF4jRLGMibnsTeEsBTiMARgJ0ZL7RPUYeQidVzP/JwPKbod0 DHqqtOXXngl7OsHdfdd74ThKaQb6EzlDFyI5aoYInPCXH/LhE98= =ZvDQ -----END PGP SIGNATURE----- Merge tag 'kvm-x86-misc-7.1' of https://github.com/kvm-x86/linux into HEAD KVM x86 misc changes for 7.1 - Advertise support for AVX512 Bit Matrix Multiply (BMM) when it's present in hardware (no additional emulation/virtualization required). - Immediately fail the build if a required #define is missing in one of KVM's headers that is included multiple times. - Reject SET_GUEST_DEBUG with -EBUSY if there's an already injected exception, mostly to prevent syzkaller from abusing the uAPI to trigger WARNs, but also because it can help prevent userspace from unintentionally crashing the VM. - Exempt SMM from CPUID faulting on Intel, as per the spec. - Misc hardening and cleanup changes.	2026-04-13 11:51:34 +02:00
Sean Christopherson	7212094bae	KVM: x86: Suppress WARNs on nested_run_pending after userspace exit To end an ongoing game of whack-a-mole between KVM and syzkaller, WARN on illegally cancelling a pending nested VM-Enter if and only if userspace has NOT gained control of the vCPU since the nested run was initiated. As proven time and time again by syzkaller, userspace can clobber vCPU state so as to force a VM-Exit that violates KVM's architectural modelling of VMRUN/VMLAUNCH/VMRESUME. To detect that userspace has gained control, while minimizing the risk of operating on stale data, convert nested_run_pending from a pure boolean to a tri-state of sorts, where '0' is still "not pending", '1' is "pending", and '2' is "pending but untrusted". Then on KVM_RUN, if the flag is in the "trusted pending" state, move it to "untrusted pending". Note, moving the state to "untrusted" even if KVM_RUN is ultimately rejected is a-ok, because for the "untrusted" state to matter, KVM must get past kvm_x86_vcpu_pre_run() at some point for the vCPU. Reviewed-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260312234823.3120658-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-04-03 09:34:01 -07:00
Yosry Ahmed	3d4470d71f	KVM: x86: Move nested_run_pending to kvm_vcpu_arch Move nested_run_pending field present in both svm_nested_state and nested_vmx to the common kvm_vcpu_arch. This allows for common code to use without plumbing it through per-vendor helpers. nested_run_pending remains zero-initialized, as the entire kvm_vcpu struct is, and all further accesses are done through vcpu->arch instead of svm->nested or vmx->nested. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Yosry Ahmed <yosry@kernel.org> [sean: expand the commend in the field declaration] Link: https://patch.msgid.link/20260312234823.3120658-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-04-03 09:33:30 -07:00
Yosry Ahmed	3b27c82ba2	KVM: x86: Move some EFER bits enablement to common code Move EFER bits enablement that only depend on CPU support to common code, as there is no reason to do it in vendor code. Leave EFER.SVME and EFER.LMSLE enablement in SVM code as they depend on vendor module parameters. Having the enablement in common code ensures that if a vendor starts supporting an existing feature, KVM doesn't end up advertising to userspace but not allowing the EFER bit to be set. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260307011619.2324234-2-yosry@kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-03-12 09:05:41 -07:00
Paolo Bonzini	6b1ca262a9	KVM: x86: clarify leave_smm() return value The return value of vmx_leave_smm() is unrelated from that of nested_vmx_enter_non_root_mode(). Check explicitly for success (which happens to be 0) and return 1 just like everywhere else in vmx_leave_smm(). Likewise, in svm_leave_smm() return 0/1 instead of the 0/1/-errno returned by tenter_svm_guest_mode(). Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2026-03-11 18:41:12 +01:00
Paolo Bonzini	5a30e8aea0	KVM: VMX: check validity of VMCS controls when returning from SMM The VMCS12 is not available while in SMM. However, it can be overwritten if userspace manages to trigger copy_enlightened_to_vmcs12() - for example via KVM_GET_NESTED_STATE. Because of this, the VMCS12 has to be checked for validity before it is used to generate the VMCS02. Move the check code out of vmx_set_nested_state() (the other "not a VMLAUNCH/VMRESUME" path that emulates a nested vmentry) and reuse it in vmx_leave_smm(). Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2026-03-11 18:41:11 +01:00
Namhyung Kim	f78e627a01	KVM: VMX: Fix a wrong MSR update in add_atomic_switch_msr() The previous change had a bug to update a guest MSR with a host value. Fixes: `c3d6a7210a` ("KVM: VMX: Dedup code for adding MSR to VMCS's auto list") Signed-off-by: Namhyung Kim <namhyung@kernel.org> Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://patch.msgid.link/20260220220216.389475-1-namhyung@kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2026-03-11 18:41:11 +01:00
Sean Christopherson	8528a7f9c9	x86/virt: Add refcounting of VMX/SVM usage to support multiple in-kernel users Implement a per-CPU refcounting scheme so that "users" of hardware virtualization, e.g. KVM and the future TDX code, can co-exist without pulling the rug out from under each other. E.g. if KVM were to disable VMX on module unload or when the last KVM VM was destroyed, SEAMCALLs from the TDX subsystem would #UD and panic the kernel. Disable preemption in the get/put APIs to ensure virtualization is fully enabled/disabled before returning to the caller. E.g. if the task were preempted after a 0=>1 transition, the new task would see a 1=>2 and thus return without enabling virtualization. Explicitly disable preemption instead of requiring the caller to do so, because the need to disable preemption is an artifact of the implementation. E.g. from KVM's perspective there is no _need_ to disable preemption as KVM guarantees the pCPU on which it is running is stable (but preemption is enabled). Opportunistically abstract away SVM vs. VMX in the public APIs by using X86_FEATURE_{SVM,VMX} to communicate what technology the caller wants to enable and use. Cc: Xu Yilun <yilun.xu@linux.intel.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Tested-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Sagi Shahar <sagis@google.com> Link: https://patch.msgid.link/20260214012702.2368778-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-03-04 08:52:52 -08:00
Sean Christopherson	428afac5a8	KVM: x86: Move bulk of emergency virtualizaton logic to virt subsystem Move the majority of the code related to disabling hardware virtualization in emergency from KVM into the virt subsystem so that virt can take full ownership of the state of SVM/VMX. This will allow refcounting usage of SVM/VMX so that KVM and the TDX subsystem can enable VMX without stomping on each other. To route the emergency callback to the "right" vendor code, add to avoid mixing vendor and generic code, implement a x86_virt_ops structure to track the emergency callback, along with the SVM vs. VMX (vs. "none") feature that is active. To avoid having to choose between SVM and VMX, simply refuse to enable either if both are somehow supported. No known CPU supports both SVM and VMX, and it's comically unlikely such a CPU will ever exist. Leave KVM's clearing of loaded VMCSes and MSR_VM_HSAVE_PA in KVM, via a callback explicitly scoped to KVM. Loading VMCSes and saving/restoring host state are firmly tied to running VMs, and thus are (a) KVM's responsibility and (b) operations that are still exclusively reserved for KVM (as far as in-tree code is concerned). I.e. the contract being established is that non-KVM subsystems can utilize virtualization, but for all intents and purposes cannot act as full-blown hypervisors. Reviewed-by: Chao Gao <chao.gao@intel.com> Tested-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Sagi Shahar <sagis@google.com> Link: https://patch.msgid.link/20260214012702.2368778-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-03-04 08:52:49 -08:00
Sean Christopherson	920da4f755	KVM: VMX: Move core VMXON enablement to kernel Move the innermost VMXON+VMXOFF logic out of KVM and into to core x86 so that TDX can (eventually) force VMXON without having to rely on KVM being loaded, e.g. to do SEAMCALLs during initialization. Opportunistically update the comment regarding emergency disabling via NMI to clarify that virt_rebooting will be set by _another_ emergency callback, i.e. that virt_rebooting doesn't need to be set before VMCLEAR, only before _this_ invocation does VMXOFF. Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Sagi Shahar <sagis@google.com> Link: https://patch.msgid.link/20260214012702.2368778-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-03-04 08:52:42 -08:00
Sean Christopherson	95e4adb24f	x86/virt: Force-clear X86_FEATURE_VMX if configuring root VMCS fails If allocating and configuring a root VMCS fails, clear X86_FEATURE_VMX in all CPUs so that KVM doesn't need to manually check root_vmcs. As added bonuses, clearing VMX will reflect that VMX is unusable in /proc/cpuinfo, and will avoid a futile auto-probe of kvm-intel.ko. WARN if allocating a root VMCS page fails, e.g. to help users figure out why VMX is broken in the unlikely scenario something goes sideways during boot (and because the allocation should succeed unless there's a kernel bug). Tweak KVM's error message to suggest checking kernel logs if VMX is unsupported (in addition to checking BIOS). Tested-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Sagi Shahar <sagis@google.com> Link: https://patch.msgid.link/20260214012702.2368778-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-03-04 08:52:39 -08:00
Sean Christopherson	405b7c2793	KVM: VMX: Unconditionally allocate root VMCSes during boot CPU bringup Allocate the root VMCS (misleading called "vmxarea" and "kvm_area" in KVM) for each possible CPU during early boot CPU bringup, before early TDX initialization, so that TDX can eventually do VMXON on-demand (to make SEAMCALLs) without needing to load kvm-intel.ko. Allocate the pages early on, e.g. instead of trying to do so on-demand, to avoid having to juggle allocation failures at runtime. Opportunistically rename the per-CPU pointers to better reflect the role of the VMCS. Use Intel's "root VMCS" terminology, e.g. from various VMCS patents[1][2] and older SDMs, not the more opaque "VMXON region" used in recent versions of the SDM. While it's possible the VMCS passed to VMXON no longer serves as _the_ root VMCS on modern CPUs, it is still in effect a "root mode VMCS", as described in the patents. Link: https://patentimages.storage.googleapis.com/c7/e4/32/d7a7def5580667/WO2013101191A1.pdf [1] Link: https://patentimages.storage.googleapis.com/13/f6/8d/1361fab8c33373/US20080163205A1.pdf [2] Tested-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Sagi Shahar <sagis@google.com> Link: https://patch.msgid.link/20260214012702.2368778-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-03-04 08:52:34 -08:00
Sean Christopherson	a1450a8156	KVM: x86: Move "kvm_rebooting" to kernel as "virt_rebooting" Move "kvm_rebooting" to the kernel, exported for KVM, as one of many steps towards extracting the innermost VMXON and EFER.SVME management logic out of KVM and into to core x86. For lack of a better name, call the new file "hw.c", to yield "virt hardware" when combined with its parent directory. No functional change intended. Tested-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Sagi Shahar <sagis@google.com> Link: https://patch.msgid.link/20260214012702.2368778-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-03-04 08:52:31 -08:00
Paolo Bonzini	bf2c3138ae	Merge tag 'kvm-x86-pmu-6.20' of https://github.com/kvm-x86/linux into HEAD KVM mediated PMU support for 6.20 Add support for mediated PMUs, where KVM gives the guest full ownership of PMU hardware (contexted switched around the fastpath run loop) and allows direct access to data MSRs and PMCs (restricted by the vPMU model), but intercepts access to control registers, e.g. to enforce event filtering and to prevent the guest from profiling sensitive host state. To keep overall complexity reasonable, mediated PMU usage is all or nothing for a given instance of KVM (controlled via module param). The Mediated PMU is disabled default, partly to maintain backwards compatilibity for existing setup, partly because there are tradeoffs when running with a mediated PMU that may be non-starters for some use cases, e.g. the host loses the ability to profile guests with mediated PMUs, the fastpath run loop is also a blind spot, entry/exit transitions are more expensive, etc. Versus the emulated PMU, where KVM is "just another perf user", the mediated PMU delivers more accurate profiling and monitoring (no risk of contention and thus dropped events), with significantly less overhead (fewer exits and faster emulation/programming of event selectors) E.g. when running Specint-2017 on a single-socket Sapphire Rapids with 56 cores and no-SMT, and using perf from within the guest: Perf command: a. basic-sampling: perf record -F 1000 -e 6-instructions -a --overwrite b. multiplex-sampling: perf record -F 1000 -e 10-instructions -a --overwrite Guest performance overhead: --------------------------------------------------------------------------- \| Test case \| emulated vPMU \| all passthrough \| passthrough with \| \| \| \| \| event filters \| --------------------------------------------------------------------------- \| basic-sampling \| 33.62% \| 4.24% \| 6.21% \| --------------------------------------------------------------------------- \| multiplex-sampling \| 79.32% \| 7.34% \| 10.45% \| ---------------------------------------------------------------------------	2026-02-11 12:45:40 -05:00
Paolo Bonzini	1b13885edf	Merge tag 'kvm-x86-apic-6.20' of https://github.com/kvm-x86/linux into HEAD KVM x86 APIC-ish changes for 6.20 - Fix a benign bug where KVM could use the wrong memslots (ignored SMM) when creating a vCPU-specific mapping of guest memory. - Clean up KVM's handling of marking mapped vCPU pages dirty. - Drop a pile of ancient sanity checks hidden behind in KVM's unused ASSERT() macro, most of which could be trivially triggered by the guest and/or user, and all of which were useless. - Fold "struct dest_map" into its sole user, "struct rtc_status", to make it more obvious what the weird parameter is used for, and to allow burying the RTC shenanigans behind CONFIG_KVM_IOAPIC=y. - Bury all of ioapic.h and KVM_IRQCHIP_KERNEL behind CONFIG_KVM_IOAPIC=y. - Add a regression test for recent APICv update fixes. - Rework KVM's handling of VMCS updates while L2 is active to temporarily switch to vmcs01 instead of deferring the update until the next nested VM-Exit. The deferred updates approach directly contributed to several bugs, was proving to be a maintenance burden due to the difficulty in auditing the correctness of deferred updates, and was polluting "struct nested_vmx" with a growing pile of booleans. - Handle "hardware APIC ISR", a.k.a. SVI, updates in kvm_apic_update_apicv() to consolidate the updates, and to co-locate SVI updates with the updates for KVM's own cache of ISR information. - Drop a dead function declaration.	2026-02-11 12:45:32 -05:00
Paolo Bonzini	9e03b7caf4	KVM x86 misc changes for 6.20 - Disallow changing the virtual CPU model if L2 is active, for all the same reasons KVM disallows change the model after the first KVM_RUN. - Fix a bug where KVM would incorrectly reject host accesses to PV MSRs that were advertised as supported to userspace when running with KVM_CAP_ENFORCE_PV_FEATURE_CPUID enabled. - Fix a bug where KVM would attempt to read protect guest state (CR3) when configuring an async #PF entry. - Fail the build if EXPORT_SYMBOL_GPL or EXPORT_SYMBOL is used in KVM (for x86 only) to enforce usage of EXPORT_SYMBOL_FOR_KVM_INTERNAL. Explicitly allow the few exports that are intended for external usage. - Ignore -EBUSY when checking nested events after a vCPU exits blocking as the WARN is user-triggerable, and because exiting to userspace on -EBUSY does more harm than good in pretty much every situation. - Throw in the towel and drop the WARN on INIT/SIPI being blocked when vCPU is in Wait-For-SIPI, as playing whack-a-mole with syzkaller turned out to be an unwinnable game. - Add support for new Intel instructions that don't require anything beyond enumerating feature flags to userspace. - Grab SRCU when reading PDPTRs in KVM_GET_SREGS2. - Add WARNs to guard against modifying KVM's CPU caps outside of the intended setup flow, as nested VMX in particular is sensitive to unexpected changes in KVM's golden configuration. - Add a quirk to allow userspace to opt-in to actually suppress EOI broadcasts when the suppression feature is enabled by the guest (currently limited to split IRQCHIP, i.e. userspace I/O APIC). Sadly, simply fixing KVM to honor Suppress EOI Broadcasts isn't an option as some userspaces have come to rely on KVM's buggy behavior (KVM advertises Supress EOI Broadcast irrespective of whether or not userspace I/O APIC supports Directed EOIs). - Minor cleanups. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmmGqtYACgkQOlYIJqCj N/2mURAAq6xms7qH8IpXy7RJjGP7UWVfV7sJPP9N8FWERVfljYn2FGGPAlBi0+5b Gbpf3dhEk+JEHPda7Skz3RqnfKqNXszhPRfUxXIW4nlKWs3VCBNtI2XuOc3xGSs+ itq6jwirPJAibi3GhP3GOnzH3VSdlgq5JhkYW3MGO2JeB0+XMzB+OYE/xZbnRjXg i4qwoe9+pGVHpV+rf0MMhCd/46HaGAegPOKArQUbMXQIK3L+6Kgz3y4zy74cCJkI nOmevvXztuM8rWrJUl8NvhqNWAak3au6gLg/1CkNcaXp6ekQovZb8BWihQ8JrkOS AcmUNqK8RcXXGtjohuXgTgigLg/t+z7tpXiwHC/BxAglf3YB/P2hcxN1/q8zG56T s5Ua8RFiosYorlN/LVeyMpPK4MEZQi8QyL/biKIlyoPg3vIL+g7Llf3XdBYsfb4d gWGecZTNmEvhwhVbwCqo+2zsO2ATYXKdR+lE8czqqdJ98l+6p652DxA315a6dx7Y 2fkirbs/JJJotjvukWjWDNk5oGFdX6cDxt2tA1SqDaZ9WTLoqXIIT+9EMtnqXPZm KsQLEa5mrM0mbRuOid+Ce+Y1bK4x4DLFaM1oH9BF0UIewo+dMIC/gRgrJEcBS+Vv E+XdrCSq2904NX9Gy3OubdorwTloMk+2Sc0HfvsXMytw1LBsUYY= =ii2B -----END PGP SIGNATURE----- Merge tag 'kvm-x86-misc-6.20' of https://github.com/kvm-x86/linux into HEAD KVM x86 misc changes for 6.20 - Disallow changing the virtual CPU model if L2 is active, for all the same reasons KVM disallows change the model after the first KVM_RUN. - Fix a bug where KVM would incorrectly reject host accesses to PV MSRs that were advertised as supported to userspace when running with KVM_CAP_ENFORCE_PV_FEATURE_CPUID enabled. - Fix a bug where KVM would attempt to read protect guest state (CR3) when configuring an async #PF entry. - Fail the build if EXPORT_SYMBOL_GPL or EXPORT_SYMBOL is used in KVM (for x86 only) to enforce usage of EXPORT_SYMBOL_FOR_KVM_INTERNAL. Explicitly allow the few exports that are intended for external usage. - Ignore -EBUSY when checking nested events after a vCPU exits blocking as the WARN is user-triggerable, and because exiting to userspace on -EBUSY does more harm than good in pretty much every situation. - Throw in the towel and drop the WARN on INIT/SIPI being blocked when vCPU is in Wait-For-SIPI, as playing whack-a-mole with syzkaller turned out to be an unwinnable game. - Add support for new Intel instructions that don't require anything beyond enumerating feature flags to userspace. - Grab SRCU when reading PDPTRs in KVM_GET_SREGS2. - Add WARNs to guard against modifying KVM's CPU caps outside of the intended setup flow, as nested VMX in particular is sensitive to unexpected changes in KVM's golden configuration. - Add a quirk to allow userspace to opt-in to actually suppress EOI broadcasts when the suppression feature is enabled by the guest (currently limited to split IRQCHIP, i.e. userspace I/O APIC). Sadly, simply fixing KVM to honor Suppress EOI Broadcasts isn't an option as some userspaces have come to rely on KVM's buggy behavior (KVM advertises Supress EOI Broadcast irrespective of whether or not userspace I/O APIC supports Directed EOIs). - Minor cleanups.	2026-02-09 18:53:47 +01:00
Paolo Bonzini	687603fb2b	KVM VMX changes for 6.20 - Fix an SGX bug where KVM would incorrectly try to handle EPCM #PFs by always relecting EPCM #PFs back into the guest. KVM doesn't shadow EPCM entries, and so EPCM violations cannot be due to KVM interference, and can't be resolved by KVM. - Fix a bug where KVM would register its posted interrupt wakeup handler even if loading kvm-intel.ko ultimately failed. - Disallow access to vmcb12 fields that aren't fully supported, mostly to avoid weirdness and complexity for FRED and other features, where KVM wants enable VMCS shadowing for fields that conditionally exist. - Print out the "bad" offsets and values if kvm-intel.ko refuses to load (or refuses to online a CPU) due to a VMCS config mismatch. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmmGstUACgkQOlYIJqCj N/3z3w/+NSA+/0/JfeCmw+CiMtmHY4eCOtScPwmrP0RONcee4HzX2LlzhZww9YeL GSBouvaU5eyNoYjA14mgOjTHfLUEkhH6/3kULN2LjE8md+oLxD0ZBVQNwkXDKSgX BpP8EJ/9vBuAjSzUNWeikBsNlXt8I4+QxZYSHPe+BiKE0kMVtFuua2LQeDtj5qc9 SYKguN0EmYdCox09a1YOX9tExk4VULrOtwcOnNK0I7m87os5Xl2DLHy1vYLZ0WPT R9iSnh/AfTsYuvCfotlGccDW8x9x+5PILZ7zxyipXBOGvRBgTaOmsgho/Rf81vpj laj6PDk06ep5PLfX0IPM7I4+8usQCxWB0dTXnB6Fu32BnmwuFRpwYCW3XJsqBMrb Q4fa14a0Aj5rviCn/CWDJOmMZtTRbQ/U+AaYT+A1VlaMRo8hkIMvW3coYSqvCuZY tceW2/3oobwzad5pi37OPsNws6STQc/UOgQDsmAIX6c5/B+cc8PF/a/DAInHPyX2 356rpdIBOnF7uheLfHGBefFeD1TdkVZvW9Gy6rHPaVjWAwyc59+C6OZoA8bTJtyP x4akIaS0GrJ7Gi9RcHRJpvKQucMWbhOrpZxov9QDMRgkdH00eznVwixVZfYAFLPN iyQpYJU+moyhXQBGmVUJlWTuMud3qwwCxhY4DEi/pGT8JtK1v5M= =XHNe -----END PGP SIGNATURE----- Merge tag 'kvm-x86-vmx-6.20' of https://github.com/kvm-x86/linux into HEAD KVM VMX changes for 6.20 - Fix an SGX bug where KVM would incorrectly try to handle EPCM #PFs by always relecting EPCM #PFs back into the guest. KVM doesn't shadow EPCM entries, and so EPCM violations cannot be due to KVM interference, and can't be resolved by KVM. - Fix a bug where KVM would register its posted interrupt wakeup handler even if loading kvm-intel.ko ultimately failed. - Disallow access to vmcb12 fields that aren't fully supported, mostly to avoid weirdness and complexity for FRED and other features, where KVM wants enable VMCS shadowing for fields that conditionally exist. - Print out the "bad" offsets and values if kvm-intel.ko refuses to load (or refuses to online a CPU) due to a VMCS config mismatch.	2026-02-09 18:50:04 +01:00
Sean Christopherson	3f2757dbf3	KVM: x86: Harden against unexpected adjustments to kvm_cpu_caps Add a flag to track when KVM is actively configuring its CPU caps, and WARN if a cap is set or cleared if KVM isn't in its configuration stage. Modifying CPU caps after {svm,vmx}_set_cpu_caps() can be fatal to KVM, as vendor setup code expects the CPU caps to be frozen at that point, e.g. will do additional configuration based on the caps. Rename kvm_set_cpu_caps() to kvm_initialize_cpu_caps() to pair with the new "finalize", and to make it more obvious that KVM's CPU caps aren't fully configured within the function. Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://patch.msgid.link/20260128014310.3255561-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-01-30 13:28:29 -08:00
Sean Christopherson	c0d6b8bbbc	KVM: VMX: Print out "bad" offsets+value on VMCS config mismatch When kvm-intel.ko refuses to load due to a mismatched VMCS config, print all mismatching offsets+values to make it easier to debug goofs during development, and to make it at least feasible to triage failures that occur during production. E.g. if a physical core is flaky or is running with the "wrong" microcode patch loaded, then a CPU can get a legitimate mismatch even without KVM bugs. Print the mismatches as 32-bit values as a compromise between hand coding every field (to provide precise information) and printing individual bytes (requires more effort to deduce the mismatch bit(s)). All fields in the VMCS config are either 32-bit or 64-bit values, i.e. in many cases, printing 32-bit values will be 100% precise, and in the others it's close enough, especially when considering that MSR values are split into EDX:EAX anyways. E.g. on mismatch CET entry/exit controls, KVM will print: kvm_intel: VMCS config on CPU 0 doesn't match reference config: Offset 76 REF = 0x107fffff, CPU0 = 0x007fffff, mismatch = 0x10000000 Offset 84 REF = 0x0010f3ff, CPU0 = 0x0000f3ff, mismatch = 0x00100000 Opportunistically tweak the wording on the initial error message to say "mismatch" instead of "inconsistent", as the VMCS config itself isn't inconsistent, and the wording conflates the cross-CPU compatibility check with the error_on_inconsistent_vmcs_config knob that treats inconsistent VMCS configurations as errors (e.g. if a CPU supports CET entry controls but no CET exit controls). Cc: Jim Mattson <jmattson@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://patch.msgid.link/20260128014310.3255561-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-01-30 13:27:46 -08:00
Sean Christopherson	f8ade833b7	KVM: x86: Explicitly configure supported XSS from {svm,vmx}_set_cpu_caps() Explicitly configure KVM's supported XSS as part of each vendor's setup flow to fix a bug where clearing SHSTK and IBT in kvm_cpu_caps, e.g. due to lack of CET XFEATURE support, makes kvm-intel.ko unloadable when nested VMX is enabled, i.e. when nested=1. The late clearing results in nested_vmx_setup_{entry,exit}_ctls() clearing VM_{ENTRY,EXIT}_LOAD_CET_STATE when nested_vmx_setup_ctls_msrs() runs during the CPU compatibility checks, ultimately leading to a mismatched VMCS config due to the reference config having the CET bits set, but every CPU's "local" config having the bits cleared. Note, kvm_caps.supported_{xcr0,xss} are unconditionally initialized by kvm_x86_vendor_init(), before calling into vendor code, and not referenced between ops->hardware_setup() and their current/old location. Fixes: `69cc3e8865` ("KVM: x86: Add XSS support for CET_KERNEL and CET_USER") Cc: stable@vger.kernel.org Cc: Mathias Krause <minipli@grsecurity.net> Cc: John Allen <john.allen@amd.com> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Cc: Chao Gao <chao.gao@intel.com> Cc: Binbin Wu <binbin.wu@linux.intel.com> Cc: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://patch.msgid.link/20260128014310.3255561-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-01-30 13:27:33 -08:00
Sean Christopherson	26304e0e69	KVM: nVMX: Setup VMX MSRs on loading CPU during nested_vmx_hardware_setup() Move the call to nested_vmx_setup_ctls_msrs() from vmx_hardware_setup() to nested_vmx_hardware_setup() so that the nested code can deal with ordering dependencies without having to straddle vmx_hardware_setup() and nested_vmx_hardware_setup(). Specifically, an upcoming change will sanitize the vmcs12 fields based on hardware support, and that code needs to run _before_ the MSRs are configured, because the lovely vmcs_enum MSR depends on the max support vmcs12 field. No functional change intended. Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://patch.msgid.link/20260115173427.716021-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-01-16 07:47:59 -08:00
Hou Wenlong	6c8512a5b7	KVM: VMX: Don't register posted interrupt wakeup handler if alloc_kvm_area() fails Unregistering the posted interrupt wakeup handler only happens during hardware unsetup. Therefore, if alloc_kvm_area() fails and continue to register the posted interrupt wakeup handler, this will leave the global posted interrupt wakeup handler pointer in an incorrect state. Although it should not be an issue, it's still better to change it. Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com> Fixes: `ec5a4919fa` ("KVM: VMX: Unregister posted interrupt wakeup handler on hardware unsetup") Link: https://patch.msgid.link/0ac6908b608cf80eab7437004334fedd0f5f5317.1768304590.git.houwenlong.hwl@antgroup.com [sean: use a goto] Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-01-14 13:21:19 -08:00
Sean Christopherson	249cc1ab4b	KVM: nVMX: Switch to vmcs01 to set virtual APICv mode on-demand if L2 is active If L1's virtual APIC mode changes while L2 is active, e.g. because L1 doesn't intercept writes to the APIC_BASE MSR and L2 changes the mode, temporarily load vmcs01 and do all of the necessary actions instead of deferring the update until the next nested VM-Exit. This will help in fixing yet more issues related to updates while L2 is active, e.g. KVM neglects to update vmcs02 MSR intercepts if vmcs01's MSR intercepts are modified while L2 is active. Not updating x2APIC MSRs is benign because vmcs01's settings are not factored into vmcs02's bitmap, but deferring the x2APIC MSR updates would create a weird, inconsistent state. Reviewed-by: Chao Gao <chao.gao@intel.com> Link: https://patch.msgid.link/20260109034532.1012993-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-01-13 17:35:32 -08:00
Sean Christopherson	51c821d6d0	KVM: nVMX: Switch to vmcs01 to update APIC page on-demand if L2 is active If the KVM-owned APIC-access page is migrated while L2 is running, temporarily load vmcs01 and immediately update APIC_ACCESS_ADDR instead of deferring the update until the next nested VM-Exit. Once changing the virtual APIC mode is converted to always do on-demand updates, all of the "defer until vmcs01 is active" logic will be gone. Reviewed-by: Chao Gao <chao.gao@intel.com> Link: https://patch.msgid.link/20260109034532.1012993-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-01-13 17:35:32 -08:00
Sean Christopherson	2bf889a68f	KVM: nVMX: Switch to vmcs01 to refresh APICv controls on-demand if L2 is active If APICv is (un)inhibited while L2 is running, temporarily load vmcs01 and immediately refresh the APICv controls in vmcs01 instead of deferring the update until the next nested VM-Exit. This all but eliminates potential ordering issues due to vmcs01 not being synchronized with kvm_lapic.apicv_active, e.g. where KVM _thinks_ it refreshed APICv, but vmcs01 still contains stale state. Reviewed-by: Chao Gao <chao.gao@intel.com> Link: https://patch.msgid.link/20260109034532.1012993-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-01-13 17:35:31 -08:00
Sean Christopherson	f0044429b2	KVM: nVMX: Switch to vmcs01 to update SVI on-demand if L2 is active If APICv is activated while L2 is running and triggers an SVI update, temporarily load vmcs01 and immediately update SVI instead of deferring the update until the next nested VM-Exit. This will eventually allow killing off kvm_apic_update_hwapic_isr(), and all of nVMX's deferred APICv updates. Reviewed-by: Chao Gao <chao.gao@intel.com> Link: https://patch.msgid.link/20260109034532.1012993-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-01-13 17:35:31 -08:00
Sean Christopherson	51ca274607	KVM: nVMX: Switch to vmcs01 to update TPR threshold on-demand if L2 is active If KVM updates L1's TPR Threshold while L2 is active, temporarily load vmcs01 and immediately update TPR_THRESHOLD instead of deferring the update until the next nested VM-Exit. Deferring the TPR Threshold update is relatively straightforward, but for several APICv related updates, deferring updates creates ordering and state consistency problems, e.g. KVM at-large thinks APICv is enabled, but vmcs01 is still running with stale (and effectively unknown) state. Reviewed-by: Chao Gao <chao.gao@intel.com> Link: https://patch.msgid.link/20260109034532.1012993-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-01-13 17:35:31 -08:00
Sean Christopherson	3e013d0a70	KVM: nVMX: Switch to vmcs01 to update PML controls on-demand if L2 is active If KVM toggles "CPU dirty logging", a.k.a. Page-Modification Logging (PML), while L2 is active, temporarily load vmcs01 and immediately update the relevant controls instead of deferring the update until the next nested VM-Exit. For PML, deferring the update is relatively straightforward, but for several APICv related updates, deferring updates creates ordering and state consistency problems, e.g. KVM at-large thinks APICv is enabled, but vmcs01 is still running with stale (and effectively unknown) state. Convert PML first precisely because it's the simplest case to handle: if something is broken with the vmcs01 <=> vmcs02 dance, then hopefully bugs will bisect here. Reviewed-by: Chao Gao <chao.gao@intel.com> Link: https://patch.msgid.link/20260109034532.1012993-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-01-13 17:35:31 -08:00
Fred Griffoul	c9d7134679	KVM: nVMX: Mark APIC access page dirty when syncing vmcs12 pages For consistency with commit `7afe79f573` ("KVM: nVMX: Mark vmcs12's APIC access page dirty when unmapping"), which marks the page dirty during unmap operations, also mark it dirty during vmcs12 page synchronization. Signed-off-by: Fred Griffoul <fgriffo@amazon.co.uk> [sean: use kvm_vcpu_map_mark_dirty()] Link: https://patch.msgid.link/20251121223444.355422-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-01-08 11:58:24 -08:00
Sean Christopherson	57dfa61f62	KVM: VMX: Move nested_mark_vmcs12_pages_dirty() to vmx.c, and rename Move nested_mark_vmcs12_pages_dirty() to vmx.c now that it's only used in the VM-Exit path, and add "all" to its name to document that its purpose is to mark all (mapped-out-of-band) vmcs12 pages as dirty. No functional change intended. Link: https://patch.msgid.link/20251121223444.355422-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-01-08 11:58:23 -08:00
Sean Christopherson	d374b89edb	KVM: VMX: Add mediated PMU support for CPUs without "save perf global ctrl" Extend mediated PMU support for Intel CPUs without support for saving PERF_GLOBAL_CONTROL into the guest VMCS field on VM-Exit, e.g. for Skylake and its derivatives, as well as Icelake. While supporting CPUs without VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL isn't completely trivial, it's not that complex either. And not supporting such CPUs would mean not supporting 7+ years of Intel CPUs released in the past 10 years. On VM-Exit, immediately propagate the saved PERF_GLOBAL_CTRL to the VMCS as well as KVM's software cache so that KVM doesn't need to add full EXREG tracking of PERF_GLOBAL_CTRL. In practice, the vast majority of VM-Exits won't trigger software writes to guest PERF_GLOBAL_CTRL, so deferring the VMWRITE to the next VM-Enter would only delay the inevitable without batching/avoiding VMWRITEs. Note! Take care to refresh VM_EXIT_MSR_STORE_COUNT on nested VM-Exit, as it's unfortunately possible that KVM could recalculate MSR intercepts while L2 is active, e.g. if userspace loads nested state and _then_ sets PERF_CAPABILITIES. Eating the VMWRITE on every nested VM-Exit is unfortunate, but that's a pre-existing problem and can/should be solved separately, e.g. modifying the number of auto-load entries while L2 is active is also uncommon on modern CPUs. Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Tested-by: Manali Shukla <manali.shukla@amd.com> Link: https://patch.msgid.link/20251206001720.468579-45-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-01-08 11:52:23 -08:00
Sean Christopherson	9757a5aebc	KVM: VMX: Initialize vmcs01.VM_EXIT_MSR_STORE_ADDR with list address Initialize vmcs01.VM_EXIT_MSR_STORE_ADDR to point at the vCPU's msr_autostore list in anticipation of utilizing the auto-store functionality, and to harden KVM against stray reads to pfn 0 (or, in theory, a random pfn if the underlying CPU uses a complex scheme for encoding VMCS data). The MSR auto lists are supposed to be ignored if the associated COUNT VMCS field is '0', but leaving the ADDR field zero-initialized in memory is an unnecessary risk (albeit a minuscule risk) given that the cost is a single VMWRITE during vCPU creation. Tested-by: Manali Shukla <manali.shukla@amd.com> Link: https://patch.msgid.link/20251206001720.468579-44-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-01-08 11:52:22 -08:00
Sean Christopherson	c3d6a7210a	KVM: VMX: Dedup code for adding MSR to VMCS's auto list Add a helper to add an MSR to a VMCS's "auto" list to deduplicate the code in add_atomic_switch_msr(), and so that the functionality can be used in the future for managing the MSR auto-store list. No functional change intended. Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Tested-by: Manali Shukla <manali.shukla@amd.com> Link: https://patch.msgid.link/20251206001720.468579-43-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-01-08 11:52:22 -08:00
Sean Christopherson	2239d137a7	KVM: VMX: Compartmentalize adding MSRs to host vs. guest auto-load list Undo the bundling of the "host" and "guest" MSR auto-load list logic so that the code can be deduplicated by factoring out the logic to a separate helper. Now that "list full" situations are treated as fatal to the VM, there is no need to pre-check both lists. For all intents and purposes, this reverts the add_atomic_switch_msr() changes made by commit `3190709335` ("x86/KVM/VMX: Separate the VMX AUTOLOAD guest/host number accounting"). Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Tested-by: Manali Shukla <manali.shukla@amd.com> Link: https://patch.msgid.link/20251206001720.468579-42-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-01-08 11:52:21 -08:00
Sean Christopherson	0c4ff0866f	KVM: VMX: Set MSR index auto-load entry if and only if entry is "new" When adding an MSR to the auto-load lists, update the MSR index in the list entry if and only if a new entry is being inserted, as 'i' can only be non-negative if vmx_find_loadstore_msr_slot() found an entry with the MSR's index. Unnecessarily setting the index is benign, but it makes it harder to see that updating the value is necessary even when an existing entry for the MSR was found. No functional change intended. Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Tested-by: Manali Shukla <manali.shukla@amd.com> Link: https://patch.msgid.link/20251206001720.468579-41-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-01-08 11:52:20 -08:00
Sean Christopherson	2ed57bb899	KVM: VMX: Bug the VM if either MSR auto-load list is full WARN and bug the VM if either MSR auto-load list is full when adding an MSR to the lists, as the set of MSRs that KVM loads via the lists is finite and entirely KVM controlled, i.e. overflowing the lists shouldn't be possible in a fully released version of KVM. Terminate the VM as the core KVM infrastructure has no insight as to _why_ an MSR is being added to the list, and failure to load an MSR on VM-Enter and/or VM-Exit could be fatal to the host. E.g. running the host with a guest-controlled PEBS MSR could generate unexpected writes to the DS buffer and crash the host. Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Tested-by: Manali Shukla <manali.shukla@amd.com> Link: https://patch.msgid.link/20251206001720.468579-40-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-01-08 11:52:20 -08:00
Sean Christopherson	84ac00042a	KVM: VMX: Drop unused @entry_only param from add_atomic_switch_msr() Drop the "on VM-Enter only" parameter from add_atomic_switch_msr() as it is no longer used, and for all intents and purposes was never used. The functionality was added, under embargo, by commit `989e3992d2` ("x86/KVM/VMX: Extend add_atomic_switch_msr() to allow VMENTER only MSRs"), and then ripped out by commit `2f055947ae` ("x86/kvm: Drop L1TF MSR list approach") just a few commits later. `2f055947ae` x86/kvm: Drop L1TF MSR list approach `72c6d2db64` x86/litf: Introduce vmx status variable `215af5499d` cpu/hotplug: Online siblings when SMT control is turned on `390d975e0c` x86/KVM/VMX: Use MSR save list for IA32_FLUSH_CMD if required `989e3992d2` x86/KVM/VMX: Extend add_atomic_switch_msr() to allow VMENTER only MSRs Furthermore, it's extremely unlikely KVM will ever _need_ to load an MSR value via the auto-load lists only on VM-Enter. MSRs writes via the lists aren't optimized in any way, and so the only reason to use the lists instead of a WRMSR are for cases where the MSR _must_ be load atomically with respect to VM-Enter (and/or VM-Exit). While one could argue that command MSRs, e.g. IA32_FLUSH_CMD, "need" to be done exact at VM-Enter, in practice doing such flushes within a few instructons of VM-Enter is more than sufficient. Note, the shortlog and changelog for commit `390d975e0c` ("x86/KVM/VMX: Use MSR save list for IA32_FLUSH_CMD if required") are misleading and wrong. That commit added MSR_IA32_FLUSH_CMD to the VM-Enter _load_ list, not the VM-Enter save list (which doesn't exist, only VM-Exit has a store/save list). Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Tested-by: Manali Shukla <manali.shukla@amd.com> Link: https://patch.msgid.link/20251206001720.468579-39-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-01-08 11:52:19 -08:00
Sean Christopherson	0bd2937911	KVM: VMX: Dedup code for removing MSR from VMCS's auto-load list Add a helper to remove an MSR from an auto-{load,store} list to dedup the msr_autoload code, and in anticipation of adding similar functionality for msr_autostore. No functional change intended. Tested-by: Manali Shukla <manali.shukla@amd.com> Link: https://patch.msgid.link/20251206001720.468579-38-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-01-08 11:52:18 -08:00
Sean Christopherson	58f21a0141	KVM: nVMX: Don't update msr_autostore count when saving TSC for vmcs12 Rework nVMX's use of the MSR auto-store list to snapshot TSC to sneak MSR_IA32_TSC into the list _without_ updating KVM's software tracking, and drop the generic functionality so that future usage of the store list for nested specific logic needs to consider the implications of modifying the list. Updating the list only for vmcs02 and only on nested VM-Enter is a disaster waiting to happen, as it means vmcs01 is stale relative to the software tracking, and KVM could unintentionally leave an MSR in the store list in perpetuity while running L1, e.g. if KVM addressed the first issue and updated vmcs01 on nested VM-Exit without removing TSC from the list. Furthermore, mixing KVM's desire to save an MSR with L1's desire to save an MSR result KVM clobbering/ignoring the needs of vmcs01 or vmcs02. E.g. if KVM added MSR_IA32_TSC to the store list for its own purposes, and then _removed_ MSR_IA32_TSC from the list after emulating nested VM-Enter, then KVM would remove MSR_IA32_TSC from the list even though saving TSC on VM-Exit from L2 is still desirable (to provide L1 with an accurate TSC). Similarly, removing an MSR from the list based on vmcs12's settings could drop an MSR that KVM wants to save for its own purposes. In practice, the issues are currently benign, because KVM doesn't use the store list for vmcs01. But that will change with upcoming mediated PMU support. Alternatively, a "full" solution would be to track MSR list entries for vmcs12 separately from KVM's standard lists, but MSR_IA32_TSC is likely the only MSR that KVM would ever want to save on _every_ VM-Exit purely based on vmcs12. I.e. the added complexity isn't remotely justified at this time. Opportunistically escalate from a pr_warn_ratelimited() to a full WARN as KVM reserves eight entries in each MSR list, and as above KVM uses at most one entry. Opportunistically make vmx_find_loadstore_msr_slot() local to vmx.c as using it directly from nested code is unsafe due to the potential for mixing vmcs01 and vmcs02 state (see above). Cc: Jim Mattson <jmattson@google.com> Tested-by: Manali Shukla <manali.shukla@amd.com> Link: https://patch.msgid.link/20251206001720.468579-37-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-01-08 11:52:17 -08:00
Sean Christopherson	462f092dc5	KVM: VMX: Drop intermediate "guest" field from msr_autostore Drop the intermediate "guest" field from vcpu_vmx.msr_autostore as the value saved on VM-Exit isn't guaranteed to be the guest's value, it's purely whatever is in hardware at the time of VM-Exit. E.g. KVM's only use of the store list at the momemnt is to snapshot TSC at VM-Exit, and the value saved is always the raw TSC even if TSC-offseting and/or TSC-scaling is enabled for the guest. And unlike msr_autoload, there is no need differentiate between "on-entry" and "on-exit". No functional change intended. Cc: Jim Mattson <jmattson@google.com> Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Tested-by: Manali Shukla <manali.shukla@amd.com> Link: https://patch.msgid.link/20251206001720.468579-36-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-01-08 11:52:17 -08:00
Dapeng Mi	860bcb1021	KVM: x86/pmu: Expose enable_mediated_pmu parameter to user space Expose enable_mediated_pmu parameter to user space, i.e. allow userspace to enable/disable mediated vPMU support. Document the mediated versus perf-based behavior as part of the kernel-parameters.txt entry, and opportunistically add an entry for the core enable_pmu param as well. Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> Tested-by: Xudong Hao <xudong.hao@intel.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Tested-by: Manali Shukla <manali.shukla@amd.com> Link: https://patch.msgid.link/20251206001720.468579-34-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-01-08 11:52:15 -08:00
Dapeng Mi	2904df6692	KVM: x86/pmu: Disable interception of select PMU MSRs for mediated vPMUs For vCPUs with a mediated vPMU, disable interception of counter MSRs for PMCs that are exposed to the guest, and for GLOBAL_CTRL and related MSRs if they are fully supported according to the vCPU model, i.e. if the MSRs and all bits supported by hardware exist from the guest's point of view. Do NOT passthrough event selector or fixed counter control MSRs, so that KVM can enforce userspace-defined event filters, e.g. to prevent use of AnyThread events (which is unfortunately a setting in the fixed counter control MSR). Defer support for nested passthrough of mediated PMU MSRs to the future, as the logic for nested MSR interception is unfortunately vendor specific. Suggested-by: Sean Christopherson <seanjc@google.com> Co-developed-by: Mingwei Zhang <mizhang@google.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> Co-developed-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> [sean: squash patches, massage changelog, refresh VMX MSRs on filter change] Tested-by: Xudong Hao <xudong.hao@intel.com> Tested-by: Manali Shukla <manali.shukla@amd.com> Link: https://patch.msgid.link/20251206001720.468579-23-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-01-08 11:52:08 -08:00
Dapeng Mi	d3ba32d1ff	KVM: x86/pmu: Load/save GLOBAL_CTRL via entry/exit fields for mediated PMU When running a guest with a mediated PMU, context switch PERF_GLOBAL_CTRL via the dedicated VMCS fields for both host and guest. For the host, always zero GLOBAL_CTRL on exit as the guest's state will still be loaded in hardware (KVM will context switch the bulk of PMU state outside of the inner run loop). For the guest, use the dedicated fields to atomically load and save PERF_GLOBAL_CTRL on all entry/exits. For now, require VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL support (introduced by Sapphire Rapids). KVM can support such CPUs by saving PERF_GLOBAL_CTRL via the MSR save list, a.k.a. the MSR auto-store list, but defer that support as it adds a small amount of complexity and is somewhat unique. To minimize VM-Entry latency, propagate IA32_PERF_GLOBAL_CTRL to the VMCS on-demand. But to minimize complexity, read IA32_PERF_GLOBAL_CTRL out of the VMCS on all non-failing VM-Exits. I.e. partially cache the MSR. KVM could track GLOBAL_CTRL as an EXREG and defer all reads, but writes are rare, i.e. the dirty tracking for an EXREG is unnecessary, and it's not obvious that shaving ~15-20 cycles per exit is meaningful given the total overhead associated with mediated PMU context switches. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Co-developed-by: Mingwei Zhang <mizhang@google.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> Tested-by: Xudong Hao <xudong.hao@intel.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Tested-by: Manali Shukla <manali.shukla@amd.com> Link: https://patch.msgid.link/20251206001720.468579-22-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-01-08 11:52:07 -08:00
Dapeng Mi	8062427212	KVM: x86/pmu: Disable RDPMC interception for compatible mediated vPMU Disable RDPMC interception for vCPUs with a mediated vPMU that is compatible with the host PMU, i.e. that doesn't require KVM emulation of RDPMC to honor the guest's vCPU model. With a mediated vPMU, all guest state accessible via RDPMC is loaded into hardware while the guest is running. Adust RDPMC interception only for non-TDX guests, as the TDX module is responsible for managing RDPMC intercepts based on the TD configuration. Co-developed-by: Mingwei Zhang <mizhang@google.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> Co-developed-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Tested-by: Xudong Hao <xudong.hao@intel.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Tested-by: Manali Shukla <manali.shukla@amd.com> Link: https://patch.msgid.link/20251206001720.468579-21-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-01-08 11:52:06 -08:00
Dapeng Mi	bfee4f07d8	KVM: x86/pmu: Implement Intel mediated PMU requirements and constraints Implement Intel PMU requirements and constraints for mediated PMU support. Require host PMU version 4+ so that PERF_GLOBAL_STATUS_SET can be used to precisely load the guest's status value into hardware, and require full- width writes so that KVM can precisely load guest counter values. Disable PEBS and LBRs if mediated PMU support is enabled, as they won't be supported in the initial implementation. Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Co-developed-by: Mingwei Zhang <mizhang@google.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> [sean: split to separate patch, add full-width writes dependency] Tested-by: Xudong Hao <xudong.hao@intel.com> Tested-by: Manali Shukla <manali.shukla@amd.com> Link: https://patch.msgid.link/20251206001720.468579-18-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2026-01-08 11:52:04 -08:00

1 2 3 4 5 ...

1214 Commits (2c142b63c8ee982cdfdba49a616027c266294838)