Move the VMX interrupt dispatch magic into the x86 core code. This
isolates KVM from the FRED/IDT decisions and reduces the amount of
EXPORT_SYMBOL_FOR_KVM().
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: "Verma, Vishal L" <vishal.l.verma@intel.com>
Tested-by: Zhao Liu <zhao1.liu@intel.com>
Tested-by: Zhao Liu <zhao1.liu@intel.com>
Tested-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Binbin Wu <binbin.wu@linxu.intel.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://patch.msgid.link/20260508091829.GO3126523@noisy.programming.kicks-ass.net
Move "kvm_rebooting" to the kernel, exported for KVM, as one of many steps
towards extracting the innermost VMXON and EFER.SVME management logic out
of KVM and into to core x86.
For lack of a better name, call the new file "hw.c", to yield "virt
hardware" when combined with its parent directory.
No functional change intended.
Tested-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Sagi Shahar <sagis@google.com>
Link: https://patch.msgid.link/20260214012702.2368778-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
- Use the root role from kvm_mmu_page to construct EPTPs instead of the
current vCPU state, partly as worthwhile cleanup, but mostly to pave the
way for tracking per-root TLB flushes so that KVM can elide EPT flushes on
pCPU migration if KVM has flushed the root at least once.
- Add a few missing nested consistency checks.
- Rip out support for doing "early" consistency checks via hardware as the
functionality hasn't been used in years and is no longer useful in general,
and replace it with an off-by-default module param to detected missed
consistency checks (i.e. WARN if hardware finds a check that KVM does not).
- Fix a currently-benign bug where KVM would drop the guest's SPEC_CTRL[63:32]
on VM-Enter.
- Misc cleanups.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmkmWaoACgkQOlYIJqCj
N/39qw//T14lCRVtO/raX1cuFCyYRTtzEEwRF+T1AdnhLoigduz/ajALLc/giF3Y
qU5Ubl/N9k/uC62mFd+tC8e/F7BKXHJIvC2WxbxzT00nos/gmHm7ZLZRlFWI51cs
9oshc87lD0XIW6Ze6Dq9xbJVA2ly1AadvrdFgi8p+/kRTO0/Vyxm9AzmvEzRzO6l
F6RXRzWzbJtNmmdqBKVMuiMAObP6nPs8Gh3gCHnBGdrSTw/Wt7W7nCLujFK4VlOD
IXZXDhnYRcSKh8NW/O6Y42VFYN0pzApMDKiFtrSM0kCkANWFusCBXiM/vt1StdZr
/MNlwJkZbIfP1jMI2km/0TSRM3x9RIlAEJ/07w38WQouc5an7eoVquNtLa3PgJ2L
AqnKzNro0TtfOSg0Nhy9LznZPOub/4VegCYvxc4+6Q74jKv8pAFMEW9uvJrSq6rk
cqQCxWG3qid6xCoE6n5uPaMWRBROVG9EzfkK+K3JyvLut9iClKHq37dFOkNQ+AZe
03uhP0CP529cnbqG+1K1VtPKl9VX7PC+Mg3ZPHI3n1GUL+/AFj3K7bpaZ+LaJpyK
jfuxBffA69dD57yPGURdfiiTkfUbbAyEP7eiBcV3Yi7vTHkbi+AsF21TA0wsGIpo
WoRw/jrjw4ScHgNbBP24QmBYynlBUzLXGo7lXE2HlUIst4qqOY0=
=hP3F
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-vmx-6.19' of https://github.com/kvm-x86/linux into HEAD
KVM VMX changes for 6.19:
- Use the root role from kvm_mmu_page to construct EPTPs instead of the
current vCPU state, partly as worthwhile cleanup, but mostly to pave the
way for tracking per-root TLB flushes so that KVM can elide EPT flushes on
pCPU migration if KVM has flushed the root at least once.
- Add a few missing nested consistency checks.
- Rip out support for doing "early" consistency checks via hardware as the
functionality hasn't been used in years and is no longer useful in general,
and replace it with an off-by-default module param to detected missed
consistency checks (i.e. WARN if hardware finds a check that KVM does not).
- Fix a currently-benign bug where KVM would drop the guest's SPEC_CTRL[63:32]
on VM-Enter.
- Misc cleanups.
Rework the handling of the MMIO Stale Data mitigation to clear CPU buffers
immediately prior to VM-Enter, i.e. in the same location that KVM emits a
VERW for unconditional (at runtime) clearing. Co-locating the code and
using a single ALTERNATIVES_2 makes it more obvious how VMX mitigates the
various vulnerabilities.
Deliberately order the alternatives as:
0. Do nothing
1. Clear if vCPU can access MMIO
2. Clear always
since the last alternative wins in ALTERNATIVES_2(), i.e. so that KVM will
honor the strictest mitigation (always clear CPU buffers) if multiple
mitigations are selected. E.g. even if the kernel chooses to mitigate
MMIO Stale Data via X86_FEATURE_CLEAR_CPU_BUF_VM_MMIO, another mitigation
may enable X86_FEATURE_CLEAR_CPU_BUF_VM, and that other thing needs to win.
Note, decoupling the MMIO mitigation from the L1TF mitigation also fixes
a mostly-benign flaw where KVM wouldn't do any clearing/flushing if the
L1TF mitigation is configured to conditionally flush the L1D, and the MMIO
mitigation but not any other "clear CPU buffers" mitigation is enabled.
For that specific scenario, KVM would skip clearing CPU buffers for the
MMIO mitigation even though the kernel requested a clear on every VM-Enter.
Note #2, the flaw goes back to the introduction of the MDS mitigation. The
MDS mitigation was inadvertently fixed by commit 43fb862de8 ("KVM/VMX:
Move VERW closer to VMentry for MDS mitigation"), but previous kernels
that flush CPU buffers in vmx_vcpu_enter_exit() are affected (though it's
unlikely the flaw is meaningfully exploitable even older kernels).
Fixes: 650b68a062 ("x86/kvm/vmx: Add MDS protection when L1D Flush is not active")
Suggested-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Reviewed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Reviewed-by: Brendan Jackman <jackmanb@google.com>
Link: https://patch.msgid.link/20251113233746.1703361-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
TSA mitigation:
d8010d4ba4 ("x86/bugs: Add a Transient Scheduler Attacks mitigation")
introduced VM_CLEAR_CPU_BUFFERS for guests on AMD CPUs. Currently on Intel
CLEAR_CPU_BUFFERS is being used for guests which has a much broader scope
(kernel->user also).
Make mitigations on Intel consistent with TSA. This would help handling the
guest-only mitigations better in future.
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
[sean: make CLEAR_CPU_BUF_VM mutually exclusive with the MMIO mitigation]
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Brendan Jackman <jackmanb@google.com>
Link: https://patch.msgid.link/20251113233746.1703361-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
When testing for VMLAUNCH vs. VMRESUME, use the copy of @flags from the
stack instead of first moving it to EBX, and then propagating
VMX_RUN_VMRESUME to RFLAGS.CF (because RBX is clobbered with the guest
value prior to the conditional branch to VMLAUNCH). Stashing information
in RFLAGS is gross, especially with the writer and reader being bifurcated
by yet more gnarly assembly code.
Opportunistically drop the SHIFT macros as they existed purely to allow
the VM-Enter flow to use Bit Test.
Suggested-by: Borislav Petkov <bp@alien8.de>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Brendan Jackman <jackmanb@google.com>
Link: https://patch.msgid.link/20251113233746.1703361-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
SPEC_CTRL is an MSR, i.e. a 64-bit value, but the assembly code that loads
the guest's value assumes bits 63:32 are always zero. The bug is
_currently_ benign because neither KVM nor the kernel support setting any
of bits 63:32, but it's still a bug that needs to be fixed.
Note, the host's value is restored in C code and is unaffected.
Fixes: 07853adc29 ("KVM: VMX: Prevent RSB underflow before vmenter")
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://patch.msgid.link/20250820100007.356761-1-ubizjak@gmail.com
[sean: call out that only the guest's value is affected]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Validate that all indirect calls adhere to kCFI rules. Notably doing
nocfi indirect call to a cfi function is broken.
Apparently some Rust 'core' code violates this and explodes when ran
with FineIBT.
All the ANNOTATE_NOCFI_SYM sites are prime targets for attackers.
- runtime EFI is especially henous because it also needs to disable
IBT. Basically calling unknown code without CFI protection at
runtime is a massice security issue.
- Kexec image handover; if you can exploit this, you get to keep it :-)
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lkml.kernel.org/r/20250714103441.496787279@infradead.org
Micro-optimize vmx_do_interrupt_irqoff() by substituting
MOV %RBP,%RSP; POP %RBP instruction sequence with equivalent
LEAVE instruction. GCC compiler does this by default for
a generic tuning and for all modern processors:
DEF_TUNE (X86_TUNE_USE_LEAVE, "use_leave",
m_386 | m_CORE_ALL | m_K6_GEODE | m_AMD_MULTIPLE | m_ZHAOXIN
| m_TREMONT | m_CORE_HYBRID | m_CORE_ATOM | m_GENERIC)
The new code also saves a couple of bytes, from:
27: 48 89 ec mov %rbp,%rsp
2a: 5d pop %rbp
to:
27: c9 leave
No functional change intended.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lore.kernel.org/r/20250414081131.97374-2-ubizjak@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
BHI mitigation mode spectre_bhi=auto does not deploy the software
mitigation by default. In a cloud environment, it is a likely scenario
where userspace is trusted but the guests are not trusted. Deploying
system wide mitigation in such cases is not desirable.
Update the auto mode to unconditionally mitigate against malicious
guests. Deploy the software sequence at VMexit in auto mode also, when
hardware mitigation is not available. Unlike the force =on mode,
software sequence is not deployed at syscalls in auto mode.
Suggested-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
Branch History Injection (BHI) attacks may allow a malicious application to
influence indirect branch prediction in kernel by poisoning the branch
history. eIBRS isolates indirect branch targets in ring0. The BHB can
still influence the choice of indirect branch predictor entry, and although
branch predictor entries are isolated between modes when eIBRS is enabled,
the BHB itself is not isolated between modes.
Alder Lake and new processors supports a hardware control BHI_DIS_S to
mitigate BHI. For older processors Intel has released a software sequence
to clear the branch history on parts that don't support BHI_DIS_S. Add
support to execute the software sequence at syscall entry and VMexit to
overwrite the branch history.
For now, branch history is not cleared at interrupt entry, as malicious
applications are not believed to have sufficient control over the
registers, since previous register state is cleared at interrupt
entry. Researchers continue to poke at this area and it may become
necessary to clear at interrupt entry as well in the future.
This mitigation is only defined here. It is enabled later.
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Co-developed-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
During VMentry VERW is executed to mitigate MDS. After VERW, any memory
access like register push onto stack may put host data in MDS affected
CPU buffers. A guest can then use MDS to sample host data.
Although likelihood of secrets surviving in registers at current VERW
callsite is less, but it can't be ruled out. Harden the MDS mitigation
by moving the VERW mitigation late in VMentry path.
Note that VERW for MMIO Stale Data mitigation is unchanged because of
the complexity of per-guest conditional VERW which is not easy to handle
that late in asm with no GPRs available. If the CPU is also affected by
MDS, VERW is unconditionally executed late in asm regardless of guest
having MMIO access.
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/all/20240213-delay-verw-v8-6-a6216d83edb7%40linux.intel.com
Use EFLAGS.CF instead of EFLAGS.ZF to track whether to use VMRESUME versus
VMLAUNCH. Freeing up EFLAGS.ZF will allow doing VERW, which clobbers ZF,
for MDS mitigations as late as possible without needing to duplicate VERW
for both paths.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Link: https://lore.kernel.org/all/20240213-delay-verw-v8-5-a6216d83edb7%40linux.intel.com
Instruction with %rip-relative address operand is one byte shorter than
its absolute address counterpart and is also compatible with position
independent executable (-fpie) build.
No functional changes intended.
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Link: https://lore.kernel.org/r/20231031075312.47525-1-ubizjak@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Mark vmread_error_trampoline() as noinstr, and add a second trampoline
for the CONFIG_CC_HAS_ASM_GOTO_OUTPUT=n case to enable instrumentation
when handling VM-Fail on VMREAD. VMREAD is used in various noinstr
flows, e.g. immediately after VM-Exit, and objtool rightly complains that
the call to the error trampoline leaves a no-instrumentation section
without annotating that it's safe to do so.
vmlinux.o: warning: objtool: vmx_vcpu_enter_exit+0xc9:
call to vmread_error_trampoline() leaves .noinstr.text section
Note, strictly speaking, enabling instrumentation in the VM-Fail path
isn't exactly safe, but if VMREAD fails the kernel/system is likely hosed
anyways, and logging that there is a fatal error is more important than
*maybe* encountering slightly unsafe instrumentation.
Reported-by: Su Hui <suhui@nfschina.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230721235637.2345403-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Code indentation should use tabs where possible and miss a '*'.
Signed-off-by: Rong Tao <rongtao@cestc.cn>
Message-Id: <tencent_A492CB3F9592578451154442830EA1B02C07@qq.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move VMX's handling of NMI VM-Exits into vmx_vcpu_enter_exit() so that
the NMI is handled prior to leaving the safety of noinstr. Handling the
NMI after leaving noinstr exposes the kernel to potential ordering
problems as an instrumentation-induced fault, e.g. #DB, #BP, #PF, etc.
will unblock NMIs when IRETing back to the faulting instruction.
Reported-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20221213060912.654668-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Split the asm subroutines for handling NMIs versus IRQs that occur in the
guest so that the NMI handler can be called from a noinstr section. As a
bonus, the NMI path doesn't need an indirect branch.
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20221213060912.654668-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Access @flags using 32-bit operands when saving and testing @flags for
VMX_RUN_VMRESUME, as using 8-bit operands is unnecessarily fragile due
to relying on VMX_RUN_VMRESUME being in bits 0-7. The behavior of
treating @flags a single byte is a holdover from when the param was
"bool launched", i.e. is not deliberate.
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20221119003747.2615229-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Declare vmread_error_trampoline() as an opaque symbol so that it cannot
be called from C code, at least not without some serious fudging. The
trampoline always passes parameters on the stack so that the inline
VMREAD sequence doesn't need to clobber registers. regparm(0) was
originally added to document the stack behavior, but it ended up being
confusing because regparm(0) is a nop for 64-bit targets.
Opportunustically wrap the trampoline and its declaration in #ifdeffery
to make it even harder to invoke incorrectly, to document why it exists,
and so that it's not left behind if/when CONFIG_CC_HAS_ASM_GOTO_OUTPUT
is true for all supported toolchains.
No functional change intended.
Cc: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20220928232015.745948-1-seanjc@google.com
This already removes an ugly #include "" from asm-offsets.c, but
especially it avoids a future error when trying to define asm-offsets
for KVM's svm/svm.h header.
This would not work for kernel/asm-offsets.c, because svm/svm.h
includes kvm_cache_regs.h which is not in the include path when
compiling asm-offsets.c. The problem is not there if the .c file is
in arch/x86/kvm.
Suggested-by: Sean Christopherson <seanjc@google.com>
Cc: stable@vger.kernel.org
Fixes: a149180fbc ("x86: Add magic AMD return-thunk")
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There is no need to declare vmread_error() asmlinkage, its arguments
can be passed via registers for both 32-bit and 64-bit targets.
Function argument registers are considered call-clobbered registers,
they are saved in the trampoline just before the function call and
restored afterwards.
Dropping "asmlinkage" patch unifies trampoline function argument handling
between 32-bit and 64-bit targets and improves generated code for 32-bit
targets.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Link: https://lore.kernel.org/r/20220817144045.3206-1-ubizjak@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Avoid instructions with explicit uses of the stack pointer between
instructions that implicitly refer to it. The sequence of
POP %reg; ADD $x, %RSP; POP %reg forces emission of synchronization
uop to synchronize the value of the stack pointer in the stack engine
and the out-of-order core.
Using POP with the dummy register instead of ADD $x, %RSP results in a
smaller code size and faster code.
The patch also fixes the reference to the wrong register in the
nearby comment.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Link: https://lore.kernel.org/r/20220816211010.25693-1-ubizjak@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
tl;dr: The Enhanced IBRS mitigation for Spectre v2 does not work as
documented for RET instructions after VM exits. Mitigate it with a new
one-entry RSB stuffing mechanism and a new LFENCE.
== Background ==
Indirect Branch Restricted Speculation (IBRS) was designed to help
mitigate Branch Target Injection and Speculative Store Bypass, i.e.
Spectre, attacks. IBRS prevents software run in less privileged modes
from affecting branch prediction in more privileged modes. IBRS requires
the MSR to be written on every privilege level change.
To overcome some of the performance issues of IBRS, Enhanced IBRS was
introduced. eIBRS is an "always on" IBRS, in other words, just turn
it on once instead of writing the MSR on every privilege level change.
When eIBRS is enabled, more privileged modes should be protected from
less privileged modes, including protecting VMMs from guests.
== Problem ==
Here's a simplification of how guests are run on Linux' KVM:
void run_kvm_guest(void)
{
// Prepare to run guest
VMRESUME();
// Clean up after guest runs
}
The execution flow for that would look something like this to the
processor:
1. Host-side: call run_kvm_guest()
2. Host-side: VMRESUME
3. Guest runs, does "CALL guest_function"
4. VM exit, host runs again
5. Host might make some "cleanup" function calls
6. Host-side: RET from run_kvm_guest()
Now, when back on the host, there are a couple of possible scenarios of
post-guest activity the host needs to do before executing host code:
* on pre-eIBRS hardware (legacy IBRS, or nothing at all), the RSB is not
touched and Linux has to do a 32-entry stuffing.
* on eIBRS hardware, VM exit with IBRS enabled, or restoring the host
IBRS=1 shortly after VM exit, has a documented side effect of flushing
the RSB except in this PBRSB situation where the software needs to stuff
the last RSB entry "by hand".
IOW, with eIBRS supported, host RET instructions should no longer be
influenced by guest behavior after the host retires a single CALL
instruction.
However, if the RET instructions are "unbalanced" with CALLs after a VM
exit as is the RET in #6, it might speculatively use the address for the
instruction after the CALL in #3 as an RSB prediction. This is a problem
since the (untrusted) guest controls this address.
Balanced CALL/RET instruction pairs such as in step #5 are not affected.
== Solution ==
The PBRSB issue affects a wide variety of Intel processors which
support eIBRS. But not all of them need mitigation. Today,
X86_FEATURE_RSB_VMEXIT triggers an RSB filling sequence that mitigates
PBRSB. Systems setting RSB_VMEXIT need no further mitigation - i.e.,
eIBRS systems which enable legacy IBRS explicitly.
However, such systems (X86_FEATURE_IBRS_ENHANCED) do not set RSB_VMEXIT
and most of them need a new mitigation.
Therefore, introduce a new feature flag X86_FEATURE_RSB_VMEXIT_LITE
which triggers a lighter-weight PBRSB mitigation versus RSB_VMEXIT.
The lighter-weight mitigation performs a CALL instruction which is
immediately followed by a speculative execution barrier (INT3). This
steers speculative execution to the barrier -- just like a retpoline
-- which ensures that speculation can never reach an unbalanced RET.
Then, ensure this CALL is retired before continuing execution with an
LFENCE.
In other words, the window of exposure is opened at VM exit where RET
behavior is troublesome. While the window is open, force RSB predictions
sampling for RET targets to a dead end at the INT3. Close the window
with the LFENCE.
There is a subset of eIBRS systems which are not vulnerable to PBRSB.
Add these systems to the cpu_vuln_whitelist[] as NO_EIBRS_PBRSB.
Future systems that aren't vulnerable will set ARCH_CAP_PBRSB_NO.
[ bp: Massage, incorporate review comments from Andy Cooper. ]
Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Co-developed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
On VMX, there are some balanced returns between the time the guest's
SPEC_CTRL value is written, and the vmenter.
Balanced returns (matched by a preceding call) are usually ok, but it's
at least theoretically possible an NMI with a deep call stack could
empty the RSB before one of the returns.
For maximum paranoia, don't allow *any* returns (balanced or otherwise)
between the SPEC_CTRL write and the vmenter.
[ bp: Fix 32-bit build. ]
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Prevent RSB underflow/poisoning attacks with RSB. While at it, add a
bunch of comments to attempt to document the current state of tribal
knowledge about RSB attacks and what exactly is being mitigated.
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
On eIBRS systems, the returns in the vmexit return path from
__vmx_vcpu_run() to vmx_vcpu_run() are exposed to RSB poisoning attacks.
Fix that by moving the post-vmexit spec_ctrl handling to immediately
after the vmexit.
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Convert __vmx_vcpu_run()'s 'launched' argument to 'flags', in
preparation for doing SPEC_CTRL handling immediately after vmexit, which
will need another flag.
This is much easier than adding a fourth argument, because this code
supports both 32-bit and 64-bit, and the fourth argument on 32-bit would
have to be pushed on the stack.
Note that __vmx_vcpu_run_flags() is called outside of the noinstr
critical section because it will soon start calling potentially
traceable functions.
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Move the vmx_vm{enter,exit}() functionality into __vmx_vcpu_run(). This
will make it easier to do the spec_ctrl handling before the first RET.
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Replace all ret/retq instructions with RET in preparation of making
RET a macro. Since AS is case insensitive it's a big no-op without
RET defined.
find arch/x86/ -name \*.S | while read file
do
sed -i 's/\<ret[q]*\>/RET/' $file
done
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211204134907.905503893@infradead.org
Replace inline assembly in nested_vmx_check_vmentry_hw
with a call to __vmx_vcpu_run. The function is not
performance critical, so (double) GPR save/restore
in __vmx_vcpu_run can be tolerated, as far as performance
effects are concerned.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Reviewed-and-tested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
[sean: dropped versioning info from changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20201231002702.2223707-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Saves one byte in __vmx_vcpu_run for the same functionality.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Message-Id: <20201029140457.126965-1-ubizjak@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move the asm blob that invokes the appropriate IRQ handler after VM-Exit
into a proper subroutine. Unconditionally create a stack frame in the
subroutine so that, as objtool sees things, the function has standard
stack behavior. The dynamic stack adjustment makes using unwind hints
problematic.
Suggested-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200915191505.10355-2-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move the functions which are inside the RCU off region into the
non-instrumentable text section.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20200708195322.037311579@linutronix.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
POP [mem] defaults to the word size, and the only legal non-default
size is 16 bits, e.g. a 32-bit POP will #UD in 64-bit mode and vice
versa, no need to use __ASM_SIZE macro to force operating mode.
Changes since v1:
- Fix commit message.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Message-Id: <20200427205035.1594232-1-ubizjak@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Clear CF and ZF in the VM-Exit path after doing __FILL_RETURN_BUFFER so
that KVM doesn't interpret clobbered RFLAGS as a VM-Fail. Filling the
RSB has always clobbered RFLAGS, its current incarnation just happens
clear CF and ZF in the processs. Relying on the macro to clear CF and
ZF is extremely fragile, e.g. commit 089dd8e531 ("x86/speculation:
Change FILL_RETURN_BUFFER to work with objtool") tweaks the loop such
that the ZF flag is always set.
Reported-by: Qian Cai <cai@lca.pw>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: stable@vger.kernel.org
Fixes: f2fde6a5bc ("KVM: VMX: Move RSB stuffing to before the first RET after VM-Exit")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200506035355.2242-1-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The exception trampoline in .fixup section is not needed, the exception
handling code can jump directly to the label in the .text section.
Changes since v1:
- Fix commit message.
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Message-Id: <20200406202108.74300-1-ubizjak@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add a hand coded assembly trampoline to preserve volatile registers
across vmread_error(), and to handle the calling convention differences
between 64-bit and 32-bit due to asmlinkage on vmread_error(). Pass
@field and @fault on the stack when invoking the trampoline to avoid
clobbering volatile registers in the context of the inline assembly.
Calling vmread_error() directly from inline assembly is partially broken
on 64-bit, and completely broken on 32-bit. On 64-bit, it will clobber
%rdi and %rsi (used to pass @field and @fault) and any volatile regs
written by vmread_error(). On 32-bit, asmlinkage means vmread_error()
expects the parameters to be passed on the stack, not via regs.
Opportunistically zero out the result in the trampoline to save a few
bytes of code for every VMREAD. A happy side effect of the trampoline
is that the inline code footprint is reduced by three bytes on 64-bit
due to PUSH/POP being more efficent (in terms of opcode bytes) than MOV.
Fixes: 6e2020977e ("KVM: VMX: Add error handling to VMREAD helper")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200326160712.28803-1-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Registers in "regs" array are indexed as rax/rcx/rdx/.../rsi/rdi/r8/...
Reorder access to "regs" array in vmenter.S to follow its natural order.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
These are all functions which are invoked from elsewhere, so annotate
them as global using the new SYM_FUNC_START and their ENDPROC's by
SYM_FUNC_END.
Make sure ENTRY/ENDPROC is not defined on X86_64, given these were the
last users.
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> [hibernate]
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> [xen bits]
Acked-by: Herbert Xu <herbert@gondor.apana.org.au> [crypto]
Cc: Allison Randal <allison@lohutok.net>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Andy Shevchenko <andy@infradead.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Armijn Hemel <armijn@tjaldur.nl>
Cc: Cao jin <caoj.fnst@cn.fujitsu.com>
Cc: Darren Hart <dvhart@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Enrico Weigelt <info@metux.net>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: kvm ML <kvm@vger.kernel.org>
Cc: Len Brown <len.brown@intel.com>
Cc: linux-arch@vger.kernel.org
Cc: linux-crypto@vger.kernel.org
Cc: linux-efi <linux-efi@vger.kernel.org>
Cc: linux-efi@vger.kernel.org
Cc: linux-pm@vger.kernel.org
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: platform-driver-x86@vger.kernel.org
Cc: "Radim Krčmář" <rkrcmar@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Wei Huang <wei@redhat.com>
Cc: x86-ml <x86@kernel.org>
Cc: xen-devel@lists.xenproject.org
Cc: Xiaoyao Li <xiaoyao.li@linux.intel.com>
Link: https://lkml.kernel.org/r/20191011115108.12392-25-jslaby@suse.cz
Fix an incorrect/stale comment regarding the vmx_vcpu pointer, as guest
registers are now loaded using a direct pointer to the start of the
register array.
Opportunistically add a comment to document why the vmx_vcpu pointer is
needed, its consumption via 'call vmx_update_host_rsp' is rather subtle.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Objtool reports the following:
arch/x86/kvm/vmx/vmenter.o: warning: objtool: vmx_vmenter()+0x14: call without frame pointer save/setup
But frame pointers are necessarily broken anyway, because
__vmx_vcpu_run() clobbers RBP with the guest's value before calling
vmx_vmenter(). So calling without a frame pointer doesn't make things
any worse.
Make objtool happy by changing the call to a UD2.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lkml.kernel.org/r/9fc2216c9dc972f95bb65ce2966a682c6bda1cb0.1563413318.git.jpoimboe@redhat.com
The not-so-recent change to move VMX's VM-Exit handing to a dedicated
"function" unintentionally exposed KVM to a speculative attack from the
guest by executing a RET prior to stuffing the RSB. Make RSB stuffing
happen immediately after VM-Exit, before any unpaired returns.
Alternatively, the VM-Exit path could postpone full RSB stuffing until
its current location by stuffing the RSB only as needed, or by avoiding
returns in the VM-Exit path entirely, but both alternatives are beyond
ugly since vmx_vmexit() has multiple indirect callers (by way of
vmx_vmenter()). And putting the RSB stuffing immediately after VM-Exit
makes it much less likely to be re-broken in the future.
Note, the cost of PUSH/POP could be avoided in the normal flow by
pairing the PUSH RAX with the POP RAX in __vmx_vcpu_run() and adding an
a POP to nested_vmx_check_vmentry_hw(), but such a weird/subtle
dependency is likely to cause problems in the long run, and PUSH/POP
will take all of a few cycles, which is peanuts compared to the number
of cycles required to fill the RSB.
Fixes: 453eafbe65 ("KVM: VMX: Move VM-Enter + VM-Exit handling to non-inline sub-routines")
Reported-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move the clearing of the common registers (not 64-bit-only) to the start
of the flow that clears registers holding guest state. This is
purely a cosmetic change so that the label doesn't point at a blank line
and a #define.
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
...to make it callable from C code.
Note that because KVM chooses to be ultra paranoid about guest register
values, all callee-save registers are still cleared after VM-Exit even
though the host's values are now reloaded from the stack.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
...to prepare for making the assembly sub-routine callable from C code.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
...to prepare for making the sub-routine callable from C code.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
...to prepare for making the sub-routine callable from C code. That
means returning the result in RAX. Since RAX will be used to return the
result, use it as the scratch register as well to make the code readable
and to document that the scratch register is more or less arbitrary.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
...now that the name is no longer usurped by a defunct helper function.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>