KVM IRQ changes for 6.17
- Rework irqbypass to track/match producers and consumers via an xarray
instead of a linked list. Using a linked list leads to O(n^2) insertion
times, which is hugely problematic for use cases that create large numbers
of VMs. Such use cases typically don't actually use irqbypass, but
eliminating the pointless registration is a future problem to solve as it
likely requires new uAPI.
- Track irqbypass's "token" as "struct eventfd_ctx *" instead of a "void *",
to avoid making a simple concept unnecessarily difficult to understand.
- Add CONFIG_KVM_IOAPIC for x86 to allow disabling support for I/O APIC, PIC,
and PIT emulation at compile time.
- Drop x86's irq_comm.c, and move a pile of IRQ related code into irq.c.
- Fix a variety of flaws and bugs in the AVIC device posted IRQ code.
- Inhibited AVIC if a vCPU's ID is too big (relative to what hardware
supports) instead of rejecting vCPU creation.
- Extend enable_ipiv module param support to SVM, by simply leaving IsRunning
clear in the vCPU's physical ID table entry.
- Disable IPI virtualization, via enable_ipiv, if the CPU is affected by
erratum #1235, to allow (safely) enabling AVIC on such CPUs.
- Dedup x86's device posted IRQ code, as the vast majority of functionality
can be shared verbatime between SVM and VMX.
- Harden the device posted IRQ code against bugs and runtime errors.
- Use vcpu_idx, not vcpu_id, for GA log tag/metadata, to make lookups O(1)
instead of O(n).
- Generate GA Log interrupts if and only if the target vCPU is blocking, i.e.
only if KVM needs a notification in order to wake the vCPU.
- Decouple device posted IRQs from VFIO device assignment, as binding a VM to
a VFIO group is not a requirement for enabling device posted IRQs.
- Clean up and document/comment the irqfd assignment code.
- Disallow binding multiple irqfds to an eventfd with a priority waiter, i.e.
ensure an eventfd is bound to at most one irqfd through the entire host,
and add a selftest to verify eventfd:irqfd bindings are globally unique.
- Introduce and start using TRAILING_OVERLAP() helper for fixing
embedded flex array instances (Gustavo A. R. Silva)
- mux: Convert mux_control_ops to a flex array member in mux_chip
(Thorsten Blum)
- string: Group str_has_prefix() and strstarts() (Andy Shevchenko)
- Remove KCOV instrumentation from __init and __head (Ritesh Harjani,
Kees Cook)
- Refactor and rename stackleak feature to support Clang
- Add KUnit test for seq_buf API
- Fix KUnit fortify test under LTO
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRSPkdeREjth1dHnSE2KwveOeQkuwUCaIfUkgAKCRA2KwveOeQk
uypLAP92r6f47sWcOw/5B9aVffX6Bypsb7dqBJQpCNxI5U1xcAEAiCrZ98UJyOeQ
JQgnXd4N67K4EsS2JDc+FutRn3Yi+A8=
=+5Bq
-----END PGP SIGNATURE-----
Merge tag 'hardening-v6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull hardening updates from Kees Cook:
- Introduce and start using TRAILING_OVERLAP() helper for fixing
embedded flex array instances (Gustavo A. R. Silva)
- mux: Convert mux_control_ops to a flex array member in mux_chip
(Thorsten Blum)
- string: Group str_has_prefix() and strstarts() (Andy Shevchenko)
- Remove KCOV instrumentation from __init and __head (Ritesh Harjani,
Kees Cook)
- Refactor and rename stackleak feature to support Clang
- Add KUnit test for seq_buf API
- Fix KUnit fortify test under LTO
* tag 'hardening-v6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (22 commits)
sched/task_stack: Add missing const qualifier to end_of_stack()
kstack_erase: Support Clang stack depth tracking
kstack_erase: Add -mgeneral-regs-only to silence Clang warnings
init.h: Disable sanitizer coverage for __init and __head
kstack_erase: Disable kstack_erase for all of arm compressed boot code
x86: Handle KCOV __init vs inline mismatches
arm64: Handle KCOV __init vs inline mismatches
s390: Handle KCOV __init vs inline mismatches
arm: Handle KCOV __init vs inline mismatches
mips: Handle KCOV __init vs inline mismatch
powerpc/mm/book3s64: Move kfence and debug_pagealloc related calls to __init section
configs/hardening: Enable CONFIG_INIT_ON_FREE_DEFAULT_ON
configs/hardening: Enable CONFIG_KSTACK_ERASE
stackleak: Split KSTACK_ERASE_CFLAGS from GCC_PLUGINS_CFLAGS
stackleak: Rename stackleak_track_stack to __sanitizer_cov_stack_depth
stackleak: Rename STACKLEAK to KSTACK_ERASE
seq_buf: Introduce KUnit tests
string: Group str_has_prefix() and strstarts()
kunit/fortify: Add back "volatile" for sizeof() constants
acpi: nfit: intel: avoid multiple -Wflex-array-member-not-at-end warnings
...
While __noinstr already contained __no_sanitize_coverage, it needs to
be added to __init and __head section markings to support the Clang
implementation of CONFIG_KSTACK_ERASE. This is to make sure the stack
depth tracking callback is not executed in unsupported contexts.
The other sanitizer coverage options (trace-pc and trace-cmp) aren't
needed in __head nor __init either ("We are interested in code coverage
as a function of a syscall inputs"[1]), so this is fine to disable for
them as well.
Link: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/kcov.c?h=v6.14#n179 [1]
Acked-by: Marco Elver <elver@google.com>
Link: https://lore.kernel.org/r/20250724055029.3623499-3-kees@kernel.org
Signed-off-by: Kees Cook <kees@kernel.org>
Since commit 4b63491838 ("arm64/mm: Close theoretical race where stale
TLB entry remains valid"), all arches that use tlbbatch for reclaim
(arm64, riscv, x86) implement arch_flush_tlb_batched_pending() with a
flush_tlb_mm().
So let's simplify by removing the unnecessary abstraction and doing the
flush_tlb_mm() directly in flush_tlb_batched_pending(). This effectively
reverts commit db6c1f6f23 ("mm/tlbbatch: introduce
arch_flush_tlb_batched_pending()").
Link: https://lkml.kernel.org/r/20250609103132.447370-1-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Suggested-by: Will Deacon <will@kernel.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Will Deacon <will@kernel.org>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
GCC appears to have kind of fragile inlining heuristics, in the
sense that it can change whether or not it inlines something based on
optimizations. It looks like the kcov instrumentation being added (or in
this case, removed) from a function changes the optimization results,
and some functions marked "inline" are _not_ inlined. In that case,
we end up with __init code calling a function not marked __init, and we
get the build warnings I'm trying to eliminate in the coming patch that
adds __no_sanitize_coverage to __init functions:
WARNING: modpost: vmlinux: section mismatch in reference: xbc_exit+0x8 (section: .text.unlikely) -> _xbc_exit (section: .init.text)
WARNING: modpost: vmlinux: section mismatch in reference: real_mode_size_needed+0x15 (section: .text.unlikely) -> real_mode_blob_end (section: .init.data)
WARNING: modpost: vmlinux: section mismatch in reference: __set_percpu_decrypted+0x16 (section: .text.unlikely) -> early_set_memory_decrypted (section: .init.text)
WARNING: modpost: vmlinux: section mismatch in reference: memblock_alloc_from+0x26 (section: .text.unlikely) -> memblock_alloc_try_nid (section: .init.text)
WARNING: modpost: vmlinux: section mismatch in reference: acpi_arch_set_root_pointer+0xc (section: .text.unlikely) -> x86_init (section: .init.data)
WARNING: modpost: vmlinux: section mismatch in reference: acpi_arch_get_root_pointer+0x8 (section: .text.unlikely) -> x86_init (section: .init.data)
WARNING: modpost: vmlinux: section mismatch in reference: efi_config_table_is_usable+0x16 (section: .text.unlikely) -> xen_efi_config_table_is_usable (section: .init.text)
This problem is somewhat fragile (though using either __always_inline
or __init will deterministically solve it), but we've tripped over
this before with GCC and the solution has usually been to just use
__always_inline and move on.
For x86 this means forcing several functions to be inline with
__always_inline.
Link: https://lore.kernel.org/r/20250724055029.3623499-2-kees@kernel.org
Signed-off-by: Kees Cook <kees@kernel.org>
Add new ARCH_WARN_ASM macro for BUG/WARN assembly code sharing with
Rust to avoid the duplication.
No functional changes.
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: FUJITA Tomonori <fujita.tomonori@gmail.com>
Link: https://lore.kernel.org/r/20250502094537.231725-2-fujita.tomonori@gmail.com
[ Fixed typo in macro parameter name. - Miguel ]
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCgA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmh9azkeHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG7GIH/0lpQtHRl6N+q2Qs
v75iG2ZouWyw2JlhUOHAToKU58MZqqTXLZzc8ZdY6fAd7DpXWKRGSDsyWVyLbUkt
UKGzXEIJsHXYvw2QIPbhkY9gQBWpdZTh4tHztFyKb0QLn81qkibVP6ChOwSzOGa/
xUyQ5v6yH+JvQlnQaCgy6hi7cMrLNSNZmuIjy0yc5Y153YPEtX5OUPO2PstpUx5r
AuiOhU4ewW9QCe07X/Pk7tdn0T2Jg8Kwk1FViaM0RBUf/0GXGfsovIxpUP/eCyMc
MA+9SXXLlDa/4Z8w3EsQYx6m2MnNmm0HPeevCmWqq3+Ocooik4si1BpzHfUaE6n/
/0D8zBg=
=NzEi
-----END PGP SIGNATURE-----
Merge tag 'v6.16-rc7' into tty-next
We need the tty/serial fixes in here as well.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iQFHBAABCgAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAmh6uxMTHHdlaS5saXVA
a2VybmVsLm9yZwAKCRB2FHBfkEGgXsDqB/9EZrNSmMhoctjhBwrEjilRPaBHaD0f
7AGbdvs78sZDC2FMQeeqVeUDgv7nhZoT9B/cjSNkAXF8LdtEXOpiYY0sqG+fO8c9
2DHBLBBGbFg+/SI+2SLzgrg8ixplnEF2HPR3pIgpmvqOKJcBd6+LnlyXs3xuBhc/
XCerEwjHDpC27YQf3ZFvgomH1rw4w0yENZaz/ztpca41GScmbBq6K9c0SAzherBJ
IDcbSUHNkT9LoFdXG9gawwX++XrBsfsKwlfAX6lYuzj+rdY5uxJaqMBg5sIA5Huh
aRLB9gXHQVABTqp73kG76ZtBdmMsd2S1JF0YdRxdsL91MQ1C0n7AW9iK
=0Qh6
-----END PGP SIGNATURE-----
Merge tag 'hyperv-fixes-signed-20250718' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux
Pull hyperv fixes from Wei Liu:
- Select use CONFIG_SYSFB only if EFI is enabled (Michael Kelley)
- An assorted set of fixes to remove warnings for missing export.h
header inclusion (Naman Jain)
- An assorted set of fixes for when Linux run as the root partition
for Microsoft Hypervisor (Mukesh Rathor, Nuno Das Neves, Stanislav
Kinsburskii)
- Fix the check for HYPERVISOR_CALLBACK_VECTOR (Naman Jain)
- Fix fcopy tool to handle irregularities with size of ring buffer
(Naman Jain)
- Fix incorrect file path conversion in fcopy tool (Yasumasa Suenaga)
* tag 'hyperv-fixes-signed-20250718' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
tools/hv: fcopy: Fix irregularities with size of ring buffer
PCI: hv: Use the correct hypercall for unmasking interrupts on nested
x86/hyperv: Expose hv_map_msi_interrupt()
Drivers: hv: Use nested hypercall for post message and signal event
x86/hyperv: Clean up hv_map/unmap_interrupt() return values
x86/hyperv: Fix usage of cpu_online_mask to get valid cpu
PCI: hv: Don't load the driver for baremetal root partition
net: mana: Fix warnings for missing export.h header inclusion
PCI: hv: Fix warnings for missing export.h header inclusion
clocksource: hyper-v: Fix warnings for missing export.h header inclusion
x86/hyperv: Fix warnings for missing export.h header inclusion
Drivers: hv: Fix warnings for missing export.h header inclusion
Drivers: hv: Fix the check for HYPERVISOR_CALLBACK_VECTOR
tools/hv: fcopy: Fix incorrect file path conversion
Drivers: hv: Select CONFIG_SYSFB only if EFI is enabled
Move some of the logic of hv_irq_compose_irq_message() into
hv_map_msi_interrupt(). Make hv_map_msi_interrupt() a globally-available
helper function, which will be used to map PCI interrupts when running
in the root partition.
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
Reviewed-by: Roman Kisel <romank@linux.microsoft.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Link: https://lore.kernel.org/r/1752261532-7225-3-git-send-email-nunodasneves@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Message-ID: <1752261532-7225-3-git-send-email-nunodasneves@linux.microsoft.com>
When running nested, these hypercalls must be sent to the L0 hypervisor
or VMBus will fail.
Remove hv_do_nested_hypercall() and hv_do_fast_nested_hypercall8()
altogether and open-code these cases, since there are only 2 and all
they do is add the nested bit.
Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
Reviewed-by: Roman Kisel <romank@linux.microsoft.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Link: https://lore.kernel.org/r/1752261532-7225-2-git-send-email-nunodasneves@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Message-ID: <1752261532-7225-2-git-send-email-nunodasneves@linux.microsoft.com>
__PAGE_KERNEL(_EXEC) is defined twice, just remove the superfluous set.
No functional changes.
Signed-off-by: Nikolay Borisov <nik.borisov@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250714170258.390175-1-nik.borisov@suse.com
- Allow hugetlb PMD sharing only on 64-bit as it doesn't make a whole lotta
sense on 32-bit
- Add fixes for a misconfigured AMD Zen2 client which wasn't even supposed to
run Linux
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmhziVAACgkQEsHwGGHe
VUoPLRAAqnv0D8pKO/UPUp05bOvKvEvYarK3Va4MV1QrqOgPIvabGbJOzYStU9+Q
4FZ5ZCZbi0eV1sZuNP1Zk7Ryp/bYipR6gLX+jg06VTXXTrjKnUN3ofBLPQf0+fE9
AZDShoSjS+6ifzt6BaUWW3uDMLOzwv50X/xwtLG+Nrprshs7HzfvJq2oFFQX6drQ
kg7Cj9N8WNHl1kp6CVy2DXVRzv4VR9+yxeNfOCPOJiCVEmzRlMulPzQYWWagYidB
+U+IYDJiG2p8YNL9aiCmnrRNpSfA4Podn8ZJVPKDwXSpmuUmfcLPur0c1Tjt97h0
85ovsJs+RqBBzD3ixkbNSpdNLRBFVX7q5mx4n4+1DuR5ygZrBbDyjZce9gwY2YPh
h1c2dnxxxkp9LAnBFJcaWiv8jzScRRbkqwHprBidkCS4plJiGhrD6MC78fMf8kE5
i+dydBefrsYnBwe3ciyCZh/fCvPHk6OmegSdT1+0jlz2YJOGlD1uSPSMeE6YFFvW
64R7MV3BLllBkpRx57zafx8tsdRiH9mZM6naltlcOcQV3JkHZhDJ5aNkfvBJpXw3
RZtHHAG0noCGSMAhl/crQUkZOany2TdkDQn6SQpcM+iY/E0OQSH+/QM4Rw/s+05/
FEtmkC2FqJ00RODhzSqKIwkKSgiwSCokR4pty5OJqBirUqDFNiw=
=d3UC
-----END PGP SIGNATURE-----
Merge tag 'x86_urgent_for_v6.16_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- Update Kirill's email address
- Allow hugetlb PMD sharing only on 64-bit as it doesn't make a whole
lotta sense on 32-bit
- Add fixes for a misconfigured AMD Zen2 client which wasn't even
supposed to run Linux
* tag 'x86_urgent_for_v6.16_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
MAINTAINERS: Update Kirill Shutemov's email address for TDX
x86/mm: Disable hugetlb page table sharing on 32-bit
x86/CPU/AMD: Disable INVLPGB on Zen2
x86/rdrand: Disable RDSEED on AMD Cyan Skillfish
Rename the 'reg_off' parameter of apic_{set|get}_reg() to 'reg' to
match other usages in apic.h.
No functional change intended.
Reviewed-by: Tianyu Lan <tiala@microsoft.com>
Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Link: https://lore.kernel.org/r/20250709033242.267892-15-Neeraj.Upadhyay@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Move apic_test_vector() to apic.h in order to reuse it in the Secure AVIC
guest APIC driver in later patches to test vector state in the APIC
backing page.
No functional change intended.
Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250709033242.267892-14-Neeraj.Upadhyay@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Move apic_clear_vector() and apic_set_vector() helper functions to
apic.h in order to reuse them in the Secure AVIC guest APIC driver
in later patches to atomically set/clear vectors in the APIC backing
page.
No functional change intended.
Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250709033242.267892-13-Neeraj.Upadhyay@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Move the apic_get_reg(), apic_set_reg(), apic_get_reg64() and
apic_set_reg64() helper functions to apic.h in order to reuse them in the
Secure AVIC guest APIC driver in later patches to read/write registers
from/to the APIC backing page.
No functional change intended.
Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250709033242.267892-12-Neeraj.Upadhyay@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
In preparation for using apic_find_highest_vector() in Secure AVIC
guest APIC driver, move it and associated macros to apic.h.
No functional change intended.
Acked-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Link: https://lore.kernel.org/r/20250709033242.267892-11-Neeraj.Upadhyay@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Consolidate KVM's {REG,VEC}_POS() macros and lapic_vector_set_in_irr()'s
open coded equivalent logic in anticipation of the kernel gaining more
usage of vector => reg+bit lookups.
Use lapic_vector_set_in_irr()'s math as using divides for both the bit
number and register offset makes it easier to connect the dots, and for at
least one user, fixup_irqs(), "/ 32 * 0x10" generates ever so slightly
better code with gcc-14 (shaves a whole 3 bytes from the code stream):
((v) >> 5) << 4:
c1 ef 05 shr $0x5,%edi
c1 e7 04 shl $0x4,%edi
81 c7 00 02 00 00 add $0x200,%edi
(v) / 32 * 0x10:
c1 ef 05 shr $0x5,%edi
83 c7 20 add $0x20,%edi
c1 e7 04 shl $0x4,%edi
Keep KVM's tersely named macros as "wrappers" to avoid unnecessary churn
in KVM, and because the shorter names yield more readable code overall in
KVM.
The new macros type cast the vector parameter to "unsigned int". This is
required from better code generation for cases where an "int" is passed
to these macros in KVM code.
int v;
((v) >> 5) << 4:
c1 f8 05 sar $0x5,%eax
c1 e0 04 shl $0x4,%eax
((v) / 32 * 0x10):
85 ff test %edi,%edi
8d 47 1f lea 0x1f(%rdi),%eax
0f 49 c7 cmovns %edi,%eax
c1 f8 05 sar $0x5,%eax
c1 e0 04 shl $0x4,%eax
((unsigned int)(v) / 32 * 0x10):
c1 f8 05 sar $0x5,%eax
c1 e0 04 shl $0x4,%eax
(v) & (32 - 1):
89 f8 mov %edi,%eax
83 e0 1f and $0x1f,%eax
(v) % 32
89 fa mov %edi,%edx
c1 fa 1f sar $0x1f,%edx
c1 ea 1b shr $0x1b,%edx
8d 04 17 lea (%rdi,%rdx,1),%eax
83 e0 1f and $0x1f,%eax
29 d0 sub %edx,%eax
(unsigned int)(v) % 32:
89 f8 mov %edi,%eax
83 e0 1f and $0x1f,%eax
Overall kvm.ko text size is impacted if "unsigned int" is not used.
Bin Orig New (w/o unsigned int) New (w/ unsigned int)
lapic.o 28580 28772 28580
kvm.o 670810 671002 670810
kvm.ko 708079 708271 708079
No functional change intended.
[Neeraj: Type cast vec macro param to "unsigned int", provide data
in commit log on "unsigned int" requirement]
Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Link: https://lore.kernel.org/r/20250709033242.267892-4-Neeraj.Upadhyay@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
- Remove the last leftovers of the ill-fated FPSIMD host state
mapping at EL2 stage-1
- Fix unexpected advertisement to the guest of unimplemented S2 base
granule sizes
- Gracefully fail initialising pKVM if the interrupt controller isn't
GICv3
- Also gracefully fail initialising pKVM if the carveout allocation
fails
- Fix the computing of the minimum MMIO range required for the host on
stage-2 fault
- Fix the generation of the GICv3 Maintenance Interrupt in nested mode
x86:
- Reject SEV{-ES} intra-host migration if one or more vCPUs are actively
being created, so as not to create a non-SEV{-ES} vCPU in an SEV{-ES} VM.
- Use a pre-allocated, per-vCPU buffer for handling de-sparsification of
vCPU masks in Hyper-V hypercalls; fixes a "stack frame too large" issue.
- Allow out-of-range/invalid Xen event channel ports when configuring IRQ
routing, to avoid dictating a specific ioctl() ordering to userspace.
- Conditionally reschedule when setting memory attributes to avoid soft
lockups when userspace converts huge swaths of memory to/from private.
- Add back MWAIT as a required feature for the MONITOR/MWAIT selftest.
- Add a missing field in struct sev_data_snp_launch_start that resulted in
the guest-visible workarounds field being filled at the wrong offset.
- Skip non-canonical address when processing Hyper-V PV TLB flushes to avoid
VM-Fail on INVVPID.
- Advertise supported TDX TDVMCALLs to userspace.
- Pass SetupEventNotifyInterrupt arguments to userspace.
- Fix TSC frequency underflow.
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmhurKgUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroNxHggApTP4vw+oOzfN7UoNmgR9XZMI1p2a
R8AzQ1zDyVbEVWq3xTKvXtld+dKeO0yKB/XeI/1JLck1OiHxY57I3X6k5AnsurEr
CBzeAhAjXivF8woMgmlP+30aqpomcPACdQm0gRnWkRDDJfXqSUas/iE/s9Ct1dT4
4w3PtFLsSsU8vX/RttR+CqF1AQ6SeV/NRvA8hzPGMGZoQ2um74j4ZsM/3xh77Kdw
Z2vOnZOIA4dk0074JjO/Yb9l00Ib4hn+MWG5jVJ+6i2HRRYd2knnB29apVS/ARdL
X20j+LvtYj/jrPPdYwqjvxbIXyLbJrLCZyjKhfueN+rnisPNvzR+7YE4ZQ==
=NduO
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Paolo Bonzini:
"Many patches, pretty much all of them small, that accumulated while I
was on vacation.
ARM:
- Remove the last leftovers of the ill-fated FPSIMD host state
mapping at EL2 stage-1
- Fix unexpected advertisement to the guest of unimplemented S2 base
granule sizes
- Gracefully fail initialising pKVM if the interrupt controller isn't
GICv3
- Also gracefully fail initialising pKVM if the carveout allocation
fails
- Fix the computing of the minimum MMIO range required for the host
on stage-2 fault
- Fix the generation of the GICv3 Maintenance Interrupt in nested
mode
x86:
- Reject SEV{-ES} intra-host migration if one or more vCPUs are
actively being created, so as not to create a non-SEV{-ES} vCPU in
an SEV{-ES} VM
- Use a pre-allocated, per-vCPU buffer for handling de-sparsification
of vCPU masks in Hyper-V hypercalls; fixes a "stack frame too
large" issue
- Allow out-of-range/invalid Xen event channel ports when configuring
IRQ routing, to avoid dictating a specific ioctl() ordering to
userspace
- Conditionally reschedule when setting memory attributes to avoid
soft lockups when userspace converts huge swaths of memory to/from
private
- Add back MWAIT as a required feature for the MONITOR/MWAIT selftest
- Add a missing field in struct sev_data_snp_launch_start that
resulted in the guest-visible workarounds field being filled at the
wrong offset
- Skip non-canonical address when processing Hyper-V PV TLB flushes
to avoid VM-Fail on INVVPID
- Advertise supported TDX TDVMCALLs to userspace
- Pass SetupEventNotifyInterrupt arguments to userspace
- Fix TSC frequency underflow"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: x86: avoid underflow when scaling TSC frequency
KVM: arm64: Remove kvm_arch_vcpu_run_map_fp()
KVM: arm64: Fix handling of FEAT_GTG for unimplemented granule sizes
KVM: arm64: Don't free hyp pages with pKVM on GICv2
KVM: arm64: Fix error path in init_hyp_mode()
KVM: arm64: Adjust range correctly during host stage-2 faults
KVM: arm64: nv: Fix MI line level calculation in vgic_v3_nested_update_mi()
KVM: x86/hyper-v: Skip non-canonical addresses during PV TLB flush
KVM: SVM: Add missing member in SNP_LAUNCH_START command structure
Documentation: KVM: Fix unexpected unindent warnings
KVM: selftests: Add back the missing check of MONITOR/MWAIT availability
KVM: Allow CPU to reschedule while setting per-page memory attributes
KVM: x86/xen: Allow 'out of range' event channel ports in IRQ routing table.
KVM: x86/hyper-v: Use preallocated per-vCPU buffer for de-sparsified vCPU masks
KVM: SVM: Initialize vmsa_pa in VMCB to INVALID_PAGE if VMSA page is NULL
KVM: SVM: Reject SEV{-ES} intra host migration if vCPU creation is in-flight
KVM: TDX: Report supported optional TDVMCALLs in TDX capabilities
KVM: TDX: Exit to userspace for SetupEventNotifyInterrupt
Extract KVM's open-coded calls to do writeback caches on multiple CPUs to
common library helpers for both WBINVD and WBNOINVD (KVM will use both).
Put the onus on the caller to check for a non-empty mask to simplify the
SMP=n implementation, e.g. so that it doesn't need to check that the one
and only CPU in the system is present in the mask.
[sean: move to lib, add SMP=n helpers, clarify usage]
Signed-off-by: Zheyun Shen <szy0127@sjtu.edu.cn>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250128015345.7929-2-szy0127@sjtu.edu.cn
Link: https://lore.kernel.org/20250522233733.3176144-5-seanjc@google.com
In line with WBINVD usage, add WBNOINVD helper functions. Explicitly fall
back to WBINVD (via alternative()) if WBNOINVD isn't supported even though
the instruction itself is backwards compatible (WBNOINVD is WBINVD with an
ignored REP prefix), so that disabling X86_FEATURE_WBNOINVD behaves as one
would expect, e.g. in case there's a hardware issue that affects WBNOINVD.
Opportunistically, add comments explaining the architectural behavior of
WBINVD and WBNOINVD, and provide hints and pointers to uarch-specific
behavior.
Note, alternative() ensures compatibility with early boot code as needed.
[ bp: Massage, fix typos, make export _GPL. ]
Signed-off-by: Kevin Loughlin <kevinloughlin@google.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/20250522233733.3176144-4-seanjc@google.com
Drop wbinvd_on_all_cpus()'s return value; both the "real" version and the
stub always return '0', and none of the callers check the return.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250522233733.3176144-3-seanjc@google.com
Now that DAX and all other reference counts to ZONE_DEVICE pages are
managed normally there is no need for the special devmap PTE/PMD/PUD page
table bits. So drop all references to these, freeing up a software
defined page table bit on architectures supporting it.
Link: https://lkml.kernel.org/r/6389398c32cc9daa3dfcaa9f79c7972525d310ce.1750323463.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Acked-by: Will Deacon <will@kernel.org> # arm64
Acked-by: David Hildenbrand <david@redhat.com>
Suggested-by: Chunyan Zhang <zhang.lyra@gmail.com>
Reviewed-by: Björn Töpel <bjorn@rivosinc.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: Björn Töpel <bjorn@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Deepak Gupta <debug@rivosinc.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Inki Dae <m.szyprowski@samsung.com>
Cc: John Groves <john@groves.net>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
AMD Cyan Skillfish (Family 17h, Model 47h, Stepping 0h) has an error that
causes RDSEED to always return 0xffffffff, while RDRAND works correctly.
Mask the RDSEED cap for this CPU so that both /proc/cpuinfo and direct CPUID
read report RDSEED as unavailable.
[ bp: Move to amd.c, massage. ]
Signed-off-by: Mikhail Paulyshka <me@mixaill.net>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/20250524145319.209075-1-me@mixaill.net
TSA are new aspeculative side channel attacks related to the execution
timing of instructions under specific microarchitectural conditions. In
some cases, an attacker may be able to use this timing information to
infer data from other contexts, resulting in information leakage.
Add the usual controls of the mitigation and integrate it into the
existing speculation bugs infrastructure in the kernel.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmhSsvQACgkQEsHwGGHe
VUrWNw//V+ZabYq3Nnvh4jEe6Altobnpn8bOIWmcBx6I3xuuArb9bLqcbKerDIcC
POVVW6zrdNigDe/U4aqaJXE7qCRX55uTYbhp8OLH0zzqX3Pjl/hUnEXWtMtlXj/G
CIM5mqjqEFp5JRGXetdjjuvjG1IPf+CbjKqj2WXbi//T6F3LiAFxkzdUhd+clBF/
ztWchjwUmqU0WJd6+Smb8ZnvWrLoZuOFldjhFad820B7fqkdJhzjHMmwBHJKUEZu
oABv8B0/4IALrx6LenCspWS4OuTOGG7DKyIgzitByXygXXb4L3ZUKpuqkxBU7hFx
bscwtOP7e5HIYAekx6ZSLZoZpYQXr1iH0aRGrjwapi3ASIpUwI0UA9ck2PdGo0IY
0GvmN0vbybskewBQyG819BM+DCau5pOLWuL7cYmaD2eTNoOHOknMDNlO8VzXqJxa
NnignSuEWFm2vNV1FXEav2YbVjlanV6JleiPDGBe5Xd9dnxZTvg9HuP2NkYio4dZ
mb/kEU/kTcN8nWh0Q96tX45kmj0vCbBgrSQkmUpyAugp38n69D1tp3ii9D/hyQFH
hKGcFC9m+rYVx1NLyAxhTGxaEqF801d5Qawwud8HsnQudTpCdSXD9fcBg9aCbWEa
FymtDpIeUQrFAjDpVEp6Syh3odKvLXsGEzL+DVvqKDuA8r6DxFo=
=2cLl
-----END PGP SIGNATURE-----
Merge tag 'tsa_x86_bugs_for_6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull CPU speculation fixes from Borislav Petkov:
"Add the mitigation logic for Transient Scheduler Attacks (TSA)
TSA are new aspeculative side channel attacks related to the execution
timing of instructions under specific microarchitectural conditions.
In some cases, an attacker may be able to use this timing information
to infer data from other contexts, resulting in information leakage.
Add the usual controls of the mitigation and integrate it into the
existing speculation bugs infrastructure in the kernel"
* tag 'tsa_x86_bugs_for_6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/process: Move the buffer clearing before MONITOR
x86/microcode/AMD: Add TSA microcode SHAs
KVM: SVM: Advertise TSA CPUID bits to guests
x86/bugs: Add a Transient Scheduler Attacks mitigation
x86/bugs: Rename MDS machinery to something more generic
Introduce new MSR registers for AMD hardware feedback support. They provide
workload classification and configuration capabilities.
Signed-off-by: Perry Yuan <perry.yuan@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Acked-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Link: https://lore.kernel.org/20250609200518.3616080-4-superm1@kernel.org
prevents their TSCs from going skewed from the hypervisor's
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmhqMCYACgkQEsHwGGHe
VUrz6A/9EMN2fEdpeStGrk5t0BvYPhnIApr2x2QuvVN6qXgyGWQXhnF5G94SFNn8
ykNxcGi97R/wxqk2uK3RfBg4P0ScoPDOLKKSeaqO0LVuHDTVX72fwB1F3qdaPNbp
EIEL+OOEwUAwviT2GSH4mwTb1C7TuJnOZH2lC6yDkWwN5BLnIA4P4C0Wr4pIQ+MT
TMzxGMT01yTnCAGHGOD2NRUIv/29qeJl+18uOqDO4A64RPT5Pp0yLJ72U4grGzOr
2C49+/XD0qjXMs8vRz/CbeBK47ZaE1v/ui3g5ofbZ2YtcrfJXDY2/SwoJ0Oz+liM
TSEXj8IpFZ3aq3+Pvgp9Qibu5QnFxJi7xTzrGCG1OSouHXTH2eFSVIXCiojVU12Y
s+pKCBTXs9wVJN4z/FaSSwmTvQolld7oozShgPieZsYNBfJeeWIm+6LZRT4Zr/7Y
UVsYEc/7m36ggKK+XFHsea2ZnmUFV18kEHPuWAXwmH3DW3dDfI5nm/s811jsbS+6
2RaLZPiKBsYmNZ7iCujrY3GEmE5Eyemr8Ricj2zSGTCH2EYNeODDQBOn+hYgQOTK
WJFWqpC5JqI5oJapmhugCkjfT75e+XTgO8Dox7HdlJR4UAb61xxf1zGFbThH8T45
LZgbIKtLwwLShg0FwzDl7swnADJ/SiaKl049Q8Z5YthhllHy6zI=
=6yl8
-----END PGP SIGNATURE-----
Merge tag 'x86_urgent_for_v6.16_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fix from Borislav Petkov:
- Make sure AMD SEV guests using secure TSC, include a TSC_FACTOR which
prevents their TSCs from going skewed from the hypervisor's
* tag 'x86_urgent_for_v6.16_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/sev: Use TSC_FACTOR for Secure TSC frequency calculation
Fixes and New HW Support
- amd/isp4: Improve swnode graph (new driver exception)
- asus-nb-wmi: Use duo keyboard quirk for Zenbook Duo UX8406CA
- dell-lis3lv02d: Add Latitude 5500 accelerometer address
- dell-wmi-sysman: Fix WMI data block retrieval and class dev unreg
- hp-bioscfg: Fix class device unregistration
- i2c: piix4: Re-enable on non-x86 + move FCH header under platform_data/
- intel/hid: Wildcat Lake support
- mellanox:
- mlxbf-pmc: Fix duplicate event ID
- mlxbf-tmfifo: Fix vring_desc.len assignment
- mlxreg-lc: Fix bit-not-set logic check
- nvsw-sn2201: Fix bus number in error message & spelling errors
- portwell-ec: Move watchdog device under correct platform hierarchy
- think-lmi: Error handling fixes (sysfs, kset, kobject, class dev unreg)
- thinkpad_acpi: Handle HKEY 0x1402 event (2025 Thinkpads)
- wmi: Fix WMI event enablement
The following is an automated shortlog grouped by driver:
asus-nb-wmi:
- add DMI quirk for ASUS Zenbook Duo UX8406CA
dell-lis3lv02d:
- Add Latitude 5500
dell-wmi-sysman:
- Fix class device unregistration
- Fix WMI data block retrieval in sysfs callbacks
hp-bioscfg:
- Fix class device unregistration
i2c:
- Re-enable piix4 driver on non-x86
intel/hid:
- Add Wildcat Lake support
mellanox:
- Fix spelling and comment clarity in Mellanox drivers
mlxbf-pmc:
- Fix duplicate event ID for CACHE_DATA1
mlxbf-tmfifo:
- fix vring_desc.len assignment
mlxreg-lc:
- Fix logic error in power state check
Move FCH header to a location accessible by all archs:
- Move FCH header to a location accessible by all archs
nvsw-sn2201:
- Fix bus number in adapter error message
portwell-ec:
- Move watchdog device under correct platform hierarchy
think-lmi:
- Create ksets consecutively
- Fix class device unregistration
- Fix kobject cleanup
- Fix sysfs group cleanup
thinkpad_acpi:
- handle HKEY 0x1402 event
Update swnode graph for amd isp4:
- Update swnode graph for amd isp4
wmi:
- Fix WMI event enablement
- Update documentation of WCxx/WExx ACPI methods
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSCSUwRdwTNL2MhaBlZrE9hU+XOMQUCaGfkwwAKCRBZrE9hU+XO
MVK1AQCK3C21auqcEbiZrx67hr5ir6VwTAZ9S6IR8R2FKqw8YwEAinUOcHSbmP6a
eXV0v5xVRPxZV7JBO5aN7FESqVHpBQ4=
=uxUH
-----END PGP SIGNATURE-----
Merge tag 'platform-drivers-x86-v6.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86
Pull x86 platform drivers fixes from Ilpo Järvinen:
"Mostly a few lines fixed here and there except amd/isp4 which improves
swnodes relationships but that is a new driver not in any stable
kernels yet. The think-lmi driver changes also look relatively large
but there are just many fixes to it.
The i2c/piix4 change is a effectively a revert of the commit
7e173eb82a ("i2c: piix4: Make CONFIG_I2C_PIIX4 dependent on
CONFIG_X86") but that required moving the header out from arch/x86
under include/linux/platform_data/
Summary:
- amd/isp4: Improve swnode graph (new driver exception)
- asus-nb-wmi: Use duo keyboard quirk for Zenbook Duo UX8406CA
- dell-lis3lv02d: Add Latitude 5500 accelerometer address
- dell-wmi-sysman: Fix WMI data block retrieval and class dev unreg
- hp-bioscfg: Fix class device unregistration
- i2c: piix4: Re-enable on non-x86 + move FCH header under platform_data/
- intel/hid: Wildcat Lake support
- mellanox:
- mlxbf-pmc: Fix duplicate event ID
- mlxbf-tmfifo: Fix vring_desc.len assignment
- mlxreg-lc: Fix bit-not-set logic check
- nvsw-sn2201: Fix bus number in error message & spelling errors
- portwell-ec: Move watchdog device under correct platform hierarchy
- think-lmi: Error handling fixes (sysfs, kset, kobject, class dev unreg)
- thinkpad_acpi: Handle HKEY 0x1402 event (2025 Thinkpads)
- wmi: Fix WMI event enablement"
* tag 'platform-drivers-x86-v6.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86: (22 commits)
platform/x86: think-lmi: Fix sysfs group cleanup
platform/x86: think-lmi: Fix kobject cleanup
platform/x86: think-lmi: Create ksets consecutively
platform/mellanox: mlxreg-lc: Fix logic error in power state check
i2c: Re-enable piix4 driver on non-x86
Move FCH header to a location accessible by all archs
platform/x86/intel/hid: Add Wildcat Lake support
platform/x86: dell-wmi-sysman: Fix class device unregistration
platform/x86: think-lmi: Fix class device unregistration
platform/x86: hp-bioscfg: Fix class device unregistration
platform/x86: Update swnode graph for amd isp4
platform/x86: dell-wmi-sysman: Fix WMI data block retrieval in sysfs callbacks
platform/x86: wmi: Update documentation of WCxx/WExx ACPI methods
platform/x86: wmi: Fix WMI event enablement
platform/mellanox: nvsw-sn2201: Fix bus number in adapter error message
platform/mellanox: Fix spelling and comment clarity in Mellanox drivers
platform/mellanox: mlxbf-pmc: Fix duplicate event ID for CACHE_DATA1
platform/x86: thinkpad_acpi: handle HKEY 0x1402 event
platform/x86: asus-nb-wmi: add DMI quirk for ASUS Zenbook Duo UX8406CA
platform/x86: dell-lis3lv02d: Add Latitude 5500
...
Merge fixes back into for-next to be able to take dell_rbu change that
is build on top of fixes material, and to bring lenovo related changes
in sync after the move under lenovo/ subdir in the for-next branch and
diverging changes in the fixes branch.
When using Secure TSC, the GUEST_TSC_FREQ MSR reports a frequency based on
the nominal P0 frequency, which deviates slightly (typically ~0.2%) from
the actual mean TSC frequency due to clocking parameters.
Over extended VM uptime, this discrepancy accumulates, causing clock skew
between the hypervisor and a SEV-SNP VM, leading to early timer interrupts as
perceived by the guest.
The guest kernel relies on the reported nominal frequency for TSC-based
timekeeping, while the actual frequency set during SNP_LAUNCH_START may
differ. This mismatch results in inaccurate time calculations, causing the
guest to perceive hrtimers as firing earlier than expected.
Utilize the TSC_FACTOR from the SEV firmware's secrets page (see "Secrets
Page Format" in the SNP Firmware ABI Specification) to calculate the mean
TSC frequency, ensuring accurate timekeeping and mitigating clock skew in
SEV-SNP VMs.
Use early_ioremap_encrypted() to map the secrets page as
ioremap_encrypted() uses kmalloc() which is not available during early TSC
initialization and causes a panic.
[ bp: Drop the silly dummy var:
https://lore.kernel.org/r/20250630192726.GBaGLlHl84xIopx4Pt@fat_crate.local ]
Fixes: 73bbf3b0fb ("x86/tsc: Init the TSC for Secure TSC guests")
Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/20250630081858.485187-1-nikunj@amd.com
A new header fch.h was created to store registers used by different AMD
drivers. This header was included by i2c-piix4 in
commit 624b0d5696 ("i2c: piix4, x86/platform: Move the SB800 PIIX4 FCH
definitions to <asm/amd/fch.h>"). To prevent compile failures on non-x86
archs i2c-piix4 was set to only compile on x86 by commit 7e173eb82a
("i2c: piix4: Make CONFIG_I2C_PIIX4 dependent on CONFIG_X86").
This was not a good decision because loongarch and mips both actually
support i2c-piix4 and set it enabled in the defconfig.
Move the header to a location accessible by all architectures.
Fixes: 624b0d5696 ("i2c: piix4, x86/platform: Move the SB800 PIIX4 FCH definitions to <asm/amd/fch.h>")
Suggested-by: Hans de Goede <hansg@kernel.org>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: Hans de Goede <hansg@kernel.org>
Link: https://lore.kernel.org/r/20250610205817.3912944-1-superm1@kernel.org
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
accidentally cleared, leading to misconfigurations
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmhg/UQACgkQEsHwGGHe
VUrmXBAAtOrpEtR4geeBeZtCEaUxE4DE8Zvj36dr+sAScHTXNTzYK94mAy/AHU22
V3rF12/kuyyZSrwROXLBD6PgkHEn8u0WLztSeqP/SisnoLjMTV9H9TuGYBUoz1NS
clkoElQ6DJP5BVzmYpZlJrcofqNjkS/mAxfMRoIAq+LzKkb3iL/Lge+Ox/IDUg8z
L9wRlKh/IaJ5EETWlqh0gkFeS/M9DXYmfkasDQeVkUxnKFeXBdyUGc2jFzBGX5RA
rsdnz+C3x3ow2U9N+ZMVr+n06yTZvh+fAiU8emeBQm0q5fZBBHWDbnZZtWf+KG6s
43tlWyVqic5yzyQbUpRC2sttOkIAtOCMx36XexbGm1eKRNNc6fTz9IlgO/97HkuE
lYBNq0zd/p5Kb53lXb3uwBVy4sjIEZUyD/K5DO4YfTgamcwXl8BP5xnKtNPqImI5
aaF3xKKLOUDOTL1CcK5YG0joaU1k+I0F0KO7HYqkDi8Uf5naWZSUNil8nPQn8RX7
3f3LJx0e3j2o0f60AHI4mjUAUJHsxExmpaTl079k03wt8YVE3ucNaUN6se6nidVz
H5q0JU4q3C3DCu0I3Ub4wa5QXGA+TOHKuqhJCapKAVAAQDlbV2z8GxA/3B2YbRM+
eZ6/RVyk++VrRXIyfmfwLPu0CLVoSNhaUhu/hFrYbjA3NX+85qw=
=fjgT
-----END PGP SIGNATURE-----
Merge tag 'x86_urgent_for_v6.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- Make sure DR6 and DR7 are initialized to their architectural values
and not accidentally cleared, leading to misconfigurations
* tag 'x86_urgent_for_v6.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/traps: Initialize DR7 by writing its architectural reset value
x86/traps: Initialize DR6 by writing its architectural reset value
There is inconvenient for maintainers and maintainership to have
some quirks under architectural code. Move it to the specific quirk
file like other 8250-compatible drivers do.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://lore.kernel.org/r/20250627182743.1273326-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
OVMF EFI firmware needs access to the CA page to do SVSM protocol calls. For
example, when the SVSM implements an EFI variable store, such calls will be
necessary.
So add that to sev_es_efi_map_ghcbs() and also rename the function to reflect
the additional job it is doing now.
[ bp: Massage. ]
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250626114014.373748-4-kraxel@redhat.com
Drop kvm_arch_{start,end}_assignment() and all associated code now that
KVM x86 no longer consumes assigned_device_count. Tracking whether or not
a VFIO-assigned device is formally associated with a VM is fundamentally
flawed, as such an association is optional for general usage, i.e. is prone
to false negatives. E.g. prior to commit 2edd9cb79f ("kvm: detect
assigned device via irqbypass manager"), device passthrough via VFIO would
fail to enable IRQ bypass if userspace omitted the formal VFIO<=>KVM
binding.
And device drivers that *need* the VFIO<=>KVM connection, e.g. KVM-GT,
shouldn't be relying on generic x86 tracking infrastructure.
Cc: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20250523011756.3243624-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Enforce the MMIO State Data mitigation if KVM has ever mapped host MMIO
into the VM, not if the VM has an assigned device. VFIO is but one of
many ways to map host MMIO into a KVM guest, and even within VFIO,
formally attaching a device to a VM via KVM_DEV_VFIO_FILE_ADD is entirely
optional.
Track whether or not the guest can access host MMIO on a per-MMU basis,
i.e. based on whether or not the vCPU has a mapping to host MMIO. For
simplicity, track MMIO mappings in "special" rools (those without a
kvm_mmu_page) at the VM level, as only Intel CPUs are vulnerable, and so
only legacy 32-bit shadow paging is affected, i.e. lack of precise
tracking is a complete non-issue.
Make the per-MMU and per-VM flags sticky. Detecting when *all* MMIO
mappings have been removed would be absurdly complex. And in practice,
removing MMIO from a guest will be done by deleting the associated memslot,
which by default will force KVM to re-allocate all roots. Special roots
will forever be mitigated, but as above, the affected scenarios are not
expected to be performance sensitive.
Use a VMX_RUN flag to communicate the need for a buffers flush to
vmx_vcpu_enter_exit() so that kvm_vcpu_can_access_host_mmio() and all its
dependencies don't need to be marked __always_inline, e.g. so that KASAN
doesn't trigger a noinstr violation.
Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Fixes: 8cb861e9e3 ("x86/speculation/mmio: Add mitigation for Processor MMIO Stale Data")
Tested-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Link: https://lore.kernel.org/r/20250523011756.3243624-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
== Background ==
CET defines two register states: CET user, which includes user-mode control
registers, and CET supervisor, which consists of shadow-stack pointers for
privilege levels 0-2.
Current kernels disable shadow stacks in kernel mode, making the CET
supervisor state unused and eliminating the need for context switching.
== Problem ==
To virtualize CET for guests, KVM must accurately emulate hardware
behavior. A key challenge arises because there is no CPUID flag to indicate
that shadow stack is supported only in user mode. Therefore, KVM cannot
assume guests will not enable shadow stacks in kernel mode and must
preserve the CET supervisor state of vCPUs.
== Solution ==
An initial proposal to manually save and restore CET supervisor states
using raw RDMSR/WRMSR in KVM was rejected due to performance concerns and
its impact on KVM's ABI. Instead, leveraging the kernel's FPU
infrastructure for context switching was favored [1].
The main question then became whether to enable the CET supervisor state
globally for all processes or restrict it to vCPU processes. This decision
involves a trade-off between a 24-byte XSTATE buffer waste for all non-vCPU
processes and approximately 100 lines of code complexity in the kernel [2].
The agreed approach is to first try this optimal solution [3], i.e.,
restricting the CET supervisor state to guest FPUs only and eliminating
unnecessary space waste.
The guest-only xfeature infrastructure has already been added. Now,
introduce CET supervisor xstate support as the first guest-only feature
to prepare for the upcoming CET virtualization in KVM.
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: John Allen <john.allen@amd.com>
Link: https://lore.kernel.org/kvm/ZM1jV3UPL0AMpVDI@google.com/ [1]
Link: https://lore.kernel.org/kvm/1c2fd06e-2e97-4724-80ab-8695aa4334e7@intel.com/ [2]
Link: https://lore.kernel.org/kvm/2597a87b-1248-b8ce-ce60-94074bc67ea4@intel.com/ [3]
Link: https://lore.kernel.org/all/20250522151031.426788-7-chao.gao%40intel.com
In preparation for upcoming CET virtualization support, the CET supervisor
state will be added as a "guest-only" feature, since it is required only by
KVM (i.e., guest FPUs). Establish the infrastructure for "guest-only"
features.
Define a new XFEATURE_MASK_GUEST_SUPERVISOR mask to specify features that
are enabled by default in guest FPUs but not in host FPUs. Specifically,
for any bit in this set, permission is granted and XSAVE space is allocated
during vCPU creation. Non-guest FPUs cannot enable guest-only features,
even dynamically, and no XSAVE space will be allocated for them.
The mask is currently empty, but this will be changed by a subsequent
patch.
Co-developed-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: John Allen <john.allen@amd.com>
Link: https://lore.kernel.org/all/20250522151031.426788-6-chao.gao%40intel.com
Currently, guest and host FPUs share the same default features. However,
the CET supervisor xstate is the first feature that needs to be enabled
exclusively for guest FPUs. Enabling it for host FPUs leads to a waste of
24 bytes in the XSAVE buffer.
To support "guest-only" features, add a new structure to hold the
default features and sizes for guest FPUs to clearly differentiate them
from those for host FPUs.
Add two helpers to provide the default feature masks for guest and host
FPUs. Default features are derived by applying the masks to the maximum
supported features.
Note that,
1) for now, guest_default_mask() and host_default_mask() are identical.
This will change in a follow-up patch once guest permissions, default
xfeatures, and fpstate size are all converted to use the guest defaults.
2) only supervisor features will diverge between guest FPUs and host
FPUs, while user features will remain the same [1][2]. So, the new
vcpu_fpu_config struct does not include default user features and size
for the UABI buffer.
An alternative approach is adding a guest_only_xfeatures member to
fpu_kernel_cfg and adding two helper functions to calculate the guest
default xfeatures and size. However, calculating these defaults at runtime
would introduce unnecessary overhead.
Suggested-by: Chang S. Bae <chang.seok.bae@intel.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: John Allen <john.allen@amd.com>
Link: https://lore.kernel.org/kvm/aAwdQ759Y6V7SGhv@google.com/ [1]
Link: https://lore.kernel.org/kvm/9ca17e1169805f35168eb722734fbf3579187886.camel@intel.com/ [2]
Link: https://lore.kernel.org/all/20250522151031.426788-2-chao.gao%40intel.com
Initialize DR7 by writing its architectural reset value to always set
bit 10, which is reserved to '1', when "clearing" DR7 so as not to
trigger unanticipated behavior if said bit is ever unreserved, e.g. as
a feature enabling flag with inverted polarity.
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Reviewed-by: Sohil Mehta <sohil.mehta@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Sean Christopherson <seanjc@google.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Cc:stable@vger.kernel.org
Link: https://lore.kernel.org/all/20250620231504.2676902-3-xin%40zytor.com
Allocate VM structs via kvzalloc(), i.e. try to use a contiguous physical
allocation before falling back to __vmalloc(), to avoid the overhead of
establishing the virtual mappings. For non-debug builds, The SVM and VMX
(and TDX) structures are now just below 7000 bytes in the worst case
scenario (see below), i.e. are order-1 allocations, and will likely remain
that way for quite some time.
Add compile-time assertions in vendor code to ensure the size of the
structures, sans the memslot hash tables, are order-0 allocations, i.e.
are less than 4KiB. There's nothing fundamentally wrong with a larger
kvm_{svm,vmx,tdx} size, but given that the size of the structure (without
the memslots hash tables) is below 2KiB after 18+ years of existence,
more than doubling the size would be quite notable.
Add sanity checks on the memslot hash table sizes, partly to ensure they
aren't resized without accounting for the impact on VM structure size, and
partly to document that the majority of the size of VM structures comes
from the memslots.
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250523001138.3182794-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Dynamically allocate the (massive) array of hashed lists used to track
shadow pages, as the array itself is 32KiB, i.e. is an order-3 allocation
all on its own, and is *exactly* an order-3 allocation. Dynamically
allocating the array will allow allocating "struct kvm" using kvmalloc(),
and will also allow deferring allocation of the array until it's actually
needed, i.e. until the first shadow root is allocated.
Opportunistically use kvmalloc() for the hashed lists, as an order-3
allocation is (stating the obvious) less likely to fail than an order-4
allocation, and the overhead of vmalloc() is undesirable given that the
size of the allocation is fixed.
Cc: Vipin Sharma <vipinsh@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250523001138.3182794-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Use a preallocated per-vCPU bitmap for tracking the unpacked set of vCPUs
being targeted for Hyper-V's paravirt TLB flushing. If KVM_MAX_NR_VCPUS
is set to 4096 (which is allowed even for MAXSMP=n builds), putting the
vCPU mask on-stack pushes kvm_hv_flush_tlb() past the default FRAME_WARN
limit.
arch/x86/kvm/hyperv.c:2001:12: error: stack frame size (1288) exceeds limit (1024)
in 'kvm_hv_flush_tlb' [-Werror,-Wframe-larger-than]
2001 | static u64 kvm_hv_flush_tlb(struct kvm_vcpu *vcpu, struct kvm_hv_hcall *hc)
| ^
1 error generated.
Note, sparse_banks was given the same treatment by commit 7d5e88d301
("KVM: x86: hyper-v: Use preallocated buffer in 'struct kvm_vcpu_hv'
instead of on-stack 'sparse_banks'"), for the exact same reason.
Reported-by: Abinash Lalotra <abinashsinghlalotra@gmail.com>
Closes: https://lore.kernel.org/all/20250613111023.786265-1-abinashsinghlalotra@gmail.com
Link: https://lore.kernel.org/all/aEylI-O8kFnFHrOH@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Rename kvm_set_msi_irq() to kvm_msi_to_lapic_irq() to better capture what
it actually does, e.g. it's _really_ easy to conflate kvm_set_msi_irq()
with kvm_set_msi().
Opportunistically delete the public declaration and export, as they are
no longer used/needed.
No functional change intended.
Link: https://lore.kernel.org/r/20250611224604.313496-64-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Configure IRTEs to GA log interrupts for device posted IRQs that hit
non-running vCPUs if and only if the target vCPU is blocking, i.e.
actually needs a wake event. If the vCPU has exited to userspace or was
preempted, generating GA log entries and interrupts is wasteful and
unnecessary, as the vCPU will be re-loaded and/or scheduled back in
irrespective of the GA log notification (avic_ga_log_notifier() is just a
fancy wrapper for kvm_vcpu_wake_up()).
Use a should-be-zero bit in the vCPU's Physical APIC ID Table Entry to
track whether or not the vCPU's associated IRTEs are configured to
generate GA logs, but only set the synthetic bit in KVM's "cache", i.e.
never set the should-be-zero bit in tables that are used by hardware.
Use a synthetic bit instead of a dedicated boolean to minimize the odds
of messing up the locking, i.e. so that all the existing rules that apply
to avic_physical_id_entry for IS_RUNNING are reused verbatim for
GA_LOG_INTR.
Note, because KVM (by design) "puts" AVIC state in a "pre-blocking"
phase, using kvm_vcpu_is_blocking() to track the need for notifications
isn't a viable option.
Link: https://lore.kernel.org/r/20250611224604.313496-63-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add plumbing to the AMD IOMMU driver to allow KVM to control whether or
not an IRTE is configured to generate GA log interrupts. KVM only needs a
notification if the target vCPU is blocking, so the vCPU can be awakened.
If a vCPU is preempted or exits to userspace, KVM clears is_run, but will
set the vCPU back to running when userspace does KVM_RUN and/or the vCPU
task is scheduled back in, i.e. KVM doesn't need a notification.
Unconditionally pass "true" in all KVM paths to isolate the IOMMU changes
from the KVM changes insofar as possible.
Opportunistically swap the ordering of parameters for amd_iommu_update_ga()
so that the match amd_iommu_activate_guest_mode().
Note, as of this writing, the AMD IOMMU manual doesn't list GALogIntr as
a non-cached field, but per AMD hardware architects, it's not cached and
can be safely updated without an invalidation.
Link: https://lore.kernel.org/all/b29b8c22-2fd4-4b5e-b755-9198874157c7@amd.com
Cc: Vasant Hegde <vasant.hegde@amd.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Link: https://lore.kernel.org/r/20250611224604.313496-62-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Use a dedicated counter to track the number of IRQs that can utilize IRQ
bypass instead of piggybacking the assigned device count. As evidenced by
commit 2edd9cb79f ("kvm: detect assigned device via irqbypass manager"),
it's possible for a device to be able to post IRQs to a vCPU without said
device being assigned to a VM.
Leave the calls to kvm_arch_{start,end}_assignment() alone for the moment
to avoid regressing the MMIO stale data mitigation. KVM is abusing the
assigned device count when applying mmio_stale_data_clear, and it's not at
all clear if vDPA devices rely on this behavior. This will hopefully be
cleaned up in the future, as the number of assigned devices is a terrible
heuristic for detecting if a VM has access to host MMIO.
Link: https://lore.kernel.org/r/20250611224604.313496-55-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Now that setting vCPU affinity is guarded with ir_list_lock, i.e. now that
avic_physical_id_entry can be safely accessed, set the pCPU info
straight-away when setting vCPU affinity. Putting the IRTE into posted
mode, and then immediately updating the IRTE a second time if the target
vCPU is running is wasteful and confusing.
This also fixes a flaw where a posted IRQ that arrives between putting
the IRTE into guest_mode and setting the correct destination could cause
the IOMMU to ring the doorbell on the wrong pCPU.
Link: https://lore.kernel.org/r/20250611224604.313496-44-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Split the vcpu_data structure that serves as a handoff from KVM to IOMMU
drivers into vendor specific structures. Overloading a single structure
makes the code hard to read and maintain, is *very* misleading as it
suggests that mixing vendors is actually supported, and bastardizing
Intel's posted interrupt descriptor address when AMD's IOMMU already has
its own structure is quite unnecessary.
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-33-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Hoist the logic for identifying the target vCPU for a posted interrupt
into common x86. The code is functionally identical between Intel and
AMD.
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-30-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
- Fix another set of FP/SIMD/SVE bugs affecting NV, and plugging some
missing synchronisation
- A small fix for the irqbypass hook fixes, tightening the check and
ensuring that we only deal with MSI for both the old and the new
route entry
- Rework the way the shadow LRs are addressed in a nesting
configuration, plugging an embarrassing bug as well as simplifying
the whole process
- Add yet another fix for the dreaded arch_timer_edge_cases selftest
RISC-V:
- Fix the size parameter check in SBI SFENCE calls
- Don't treat SBI HFENCE calls as NOPs
x86 TDX:
- Complete API for handling complex TDVMCALLs in userspace. This was
delayed because the spec lacked a way for userspace to deny supporting
these calls; the new exit code is now approved.
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmhXq3YUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroM6JAf/YG2NrLK22eZuNILBu6lwf4+yrUfU
vKcQLHjyX477xRIDZOiAqFOw6JVfqUj7gHbPM+wiVIAg3JT931TkeaZQFuOnMg+k
HalkddeSa9oupd9nhmogcRfGovOXmBQFOleJ+h3GcSeMT/w+6JtgXmgC6/l29f3X
xswi9Lb6yRbcnd2E8XOnJ+ZBqA3ABguuM+0vcmN2xH6EIBaRzjuDPPHV3NGf2aBK
BTAylYzU5P4tZYEHiH/9WlCM19gaeIb2A1LSPDNASpUufvw4VIpmU9MwLDa5LtNB
kVIQxotgFHEoO9qnTX8sGsBrNuStHOuzZA17Z8O6OzTYISK5a7CdworR1w==
=upFj
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"ARM:
- Fix another set of FP/SIMD/SVE bugs affecting NV, and plugging some
missing synchronisation
- A small fix for the irqbypass hook fixes, tightening the check and
ensuring that we only deal with MSI for both the old and the new
route entry
- Rework the way the shadow LRs are addressed in a nesting
configuration, plugging an embarrassing bug as well as simplifying
the whole process
- Add yet another fix for the dreaded arch_timer_edge_cases selftest
RISC-V:
- Fix the size parameter check in SBI SFENCE calls
- Don't treat SBI HFENCE calls as NOPs
x86 TDX:
- Complete API for handling complex TDVMCALLs in userspace.
This was delayed because the spec lacked a way for userspace to
deny supporting these calls; the new exit code is now approved"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: TDX: Exit to userspace for GetTdVmCallInfo
KVM: TDX: Handle TDG.VP.VMCALL<GetQuote>
KVM: TDX: Add new TDVMCALL status code for unsupported subfuncs
KVM: arm64: VHE: Centralize ISBs when returning to host
KVM: arm64: Remove cpacr_clear_set()
KVM: arm64: Remove ad-hoc CPTR manipulation from kvm_hyp_handle_fpsimd()
KVM: arm64: Remove ad-hoc CPTR manipulation from fpsimd_sve_sync()
KVM: arm64: Reorganise CPTR trap manipulation
KVM: arm64: VHE: Synchronize CPTR trap deactivation
KVM: arm64: VHE: Synchronize restore of host debug registers
KVM: arm64: selftests: Close the GIC FD in arch_timer_edge_cases
KVM: arm64: Explicitly treat routing entry type changes as changes
KVM: arm64: nv: Fix tracking of shadow list registers
RISC-V: KVM: Don't treat SBI HFENCE calls as NOPs
RISC-V: KVM: Fix the size parameter check in SBI SFENCE calls
Move enable_ipiv to common x86 so that it can be reused by SVM to control
IPI virtualization when AVIC is enabled. SVM doesn't actually provide a
way to truly disable IPI virtualization, but KVM can get close enough by
skipping the necessary table programming.
Link: https://lore.kernel.org/r/20250611224604.313496-18-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Inhibit AVIC with a new "ID too big" flag if userspace creates a vCPU with
an ID that is too big, but otherwise allow vCPU creation to succeed.
Rejecting KVM_CREATE_VCPU with EINVAL violates KVM's ABI as KVM advertises
that the max vCPU ID is 4095, but disallows creating vCPUs with IDs bigger
than 254 (AVIC) or 511 (x2AVIC).
Alternatively, KVM could advertise an accurate value depending on which
AVIC mode is in use, but that wouldn't really solve the underlying problem,
e.g. would be a breaking change if KVM were to ever try and enable AVIC or
x2AVIC by default.
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-14-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Drop AVIC_HPA_MASK and all its users, the mask is just the 4KiB-aligned
maximum theoretical physical address for x86-64 CPUs, as x86-64 is
currently defined (going beyond PA52 would require an entirely new paging
mode, which would arguably create a new, different architecture).
All usage in KVM masks the result of page_to_phys(), which on x86-64 is
guaranteed to be 4KiB aligned and a legal physical address; if either of
those requirements doesn't hold true, KVM has far bigger problems.
Drop masking the avic_backing_page with
AVIC_PHYSICAL_ID_ENTRY_BACKING_PAGE_MASK for all the same reasons, but
keep the macro even though it's unused in functional code. It's a
distinct architectural define, and having the definition in software
helps visualize the layout of an entry. And to be hyper-paranoid about
MAXPA going beyond 52, add a compile-time assert to ensure the kernel's
maximum supported physical address stays in bounds.
The unnecessary masking in avic_init_vmcb() also incorrectly assumes that
SME's C-bit resides between bits 51:11; that holds true for current CPUs,
but isn't required by AMD's architecture:
In some implementations, the bit used may be a physical address bit
Key word being "may".
Opportunistically use the GENMASK_ULL() version for
AVIC_PHYSICAL_ID_ENTRY_BACKING_PAGE_MASK, which is far more readable
than a set of repeating Fs.
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/20250611224604.313496-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Drop VMCB_AVIC_APIC_BAR_MASK, it's just a regurgitation of the maximum
theoretical 4KiB-aligned physical address, i.e. is not novel in any way,
and its only usage is to mask the default APIC base, which is 4KiB aligned
and (obviously) a legal physical address.
No functional change intended.
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/20250611224604.313496-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
When updating IRTEs in response to a GSI routing or IRQ bypass change,
pass the new/current routing information along with the associated irqfd.
This will allow KVM x86 to harden, simplify, and deduplicate its code.
Since adding/removing a bypass producer is now conveniently protected with
irqfds.lock, i.e. can't run concurrently with kvm_irq_routing_update(),
use the routing information cached in the irqfd instead of looking up
the information in the current GSI routing tables.
Opportunistically convert an existing printk() to pr_info() and put its
string onto a single line (old code that strictly adhered to 80 chars).
Link: https://lore.kernel.org/r/20250611224604.313496-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Move the IRQ mask logic to ioapic.c as KVM's only user is its in-kernel
I/O APIC emulation. In addition to encapsulating more I/O APIC specific
code, trimming down irq_comm.c helps pave the way for removing it entirely.
Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250611213557.294358-18-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add a Kconfig to allow building KVM without support for emulating a I/O
APIC, PIC, and PIT, which is desirable for deployments that effectively
don't support a fully in-kernel IRQ chip, i.e. never expect any VMM to
create an in-kernel I/O APIC. E.g. compiling out support eliminates a few
thousand lines of guest-facing code and gives security folks warm fuzzies.
As a bonus, wrapping relevant paths with CONFIG_KVM_IOAPIC #ifdefs makes
it much easier for readers to understand which bits and pieces exist
specifically for fully in-kernel IRQ chips.
Opportunistically convert all two in-kernel uses of __KVM_HAVE_IOAPIC to
CONFIG_KVM_IOAPIC, e.g. rather than add a second #ifdef to generate a stub
for kvm_arch_post_irq_routing_update().
Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250611213557.294358-15-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Don't bother clearing the PIT's IRQ line status when destroying the PIT,
as userspace can't possibly rely on KVM to lower the IRQ line in any sane
use case, and it's not at all obvious that clearing the PIT's IRQ line is
correct/desirable in kvm_create_pit()'s error path.
When called from kvm_arch_pre_destroy_vm(), the entire VM is being torn
down and thus {kvm_pic,kvm_ioapic}.irq_states are unreachable.
As for the error path in kvm_create_pit(), the only way the PIT's bit in
irq_states can be set is if userspace raises the associated IRQ before
KVM_CREATE_PIT{2} completes. Forcefully clearing the bit would clobber
userspace's input, nonsensical though that input may be. Not to mention
that no known VMM will continue on if PIT creation fails.
Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250611213557.294358-12-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Hardcode the PIT's source IRQ ID to '2' instead of "finding" that bit 2
is always the first available bit in irq_sources_bitmap. Bits 0 and 1 are
set/reserved by kvm_arch_init_vm(), i.e. long before kvm_create_pit() can
be invoked, and KVM allows at most one in-kernel PIT instance, i.e. it's
impossible for the PIT to find a different free bit (there are no other
users of kvm_request_irq_source_id().
Delete the now-defunct irq_sources_bitmap and all its associated code.
Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250611213557.294358-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Drop the superfluous and confusing kvm_set_pic_irq() => kvm_pic_set_irq()
wrapper, and instead wire up ->set() directly to its final destination.
Opportunistically move the declaration kvm_pic_set_irq() to irq.h to
start gathering more of the in-kernel APIC/IO-APIC logic in irq.{c,h}.
No functional change intended.
Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250611213557.294358-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Rename msr_filter_changed() to recalc_msr_intercepts() and drop the
trampoline wrapper now that both SVM and VMX use a filter-agnostic recalc
helper to react to the new userspace filter.
No functional change intended.
Reviewed-by: Xin Li (Intel) <xin@zytor.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20250610225737.156318-21-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Set/clear DEBUGCTLMSR_FREEZE_IN_SMM in GUEST_IA32_DEBUGCTL based on the
host's pre-VM-Enter value, i.e. preserve the host's FREEZE_IN_SMM setting
while running the guest. When running with the "default treatment of SMIs"
in effect (the only mode KVM supports), SMIs do not generate a VM-Exit that
is visible to host (non-SMM) software, and instead transitions directly
from VMX non-root to SMM. And critically, DEBUGCTL isn't context switched
by hardware on SMI or RSM, i.e. SMM will run with whatever value was
resident in hardware at the time of the SMI.
Failure to preserve FREEZE_IN_SMM results in the PMU unexpectedly counting
events while the CPU is executing in SMM, which can pollute profiling and
potentially leak information into the guest.
Check for changes in FREEZE_IN_SMM prior to every entry into KVM's inner
run loop, as the bit can be toggled in IRQ context via IPI callback (SMP
function call), by way of /sys/devices/cpu/freeze_on_smi.
Add a field in kvm_x86_ops to communicate which DEBUGCTL bits need to be
preserved, as FREEZE_IN_SMM is only supported and defined for Intel CPUs,
i.e. explicitly checking FREEZE_IN_SMM in common x86 is at best weird, and
at worst could lead to undesirable behavior in the future if AMD CPUs ever
happened to pick up a collision with the bit.
Exempt TDX vCPUs, i.e. protected guests, from the check, as the TDX Module
owns and controls GUEST_IA32_DEBUGCTL.
WARN in SVM if KVM_RUN_LOAD_DEBUGCTL is set, mostly to document that the
lack of handling isn't a KVM bug (TDX already WARNs on any run_flag).
Lastly, explicitly reload GUEST_IA32_DEBUGCTL on a VM-Fail that is missed
by KVM but detected by hardware, i.e. in nested_vmx_restore_host_state().
Doing so avoids the need to track host_debugctl on a per-VMCS basis, as
GUEST_IA32_DEBUGCTL is unconditionally written by prepare_vmcs02() and
load_vmcs12_host_state(). For the VM-Fail case, even though KVM won't
have actually entered the guest, vcpu_enter_guest() will have run with
vmcs02 active and thus could result in vmcs01 being run with a stale value.
Cc: stable@vger.kernel.org
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250610232010.162191-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Let the guest set DEBUGCTL.RTM_DEBUG if RTM is supported according to the
guest CPUID model, as debug support is supposed to be available if RTM is
supported, and there are no known downsides to letting the guest debug RTM
aborts.
Note, there are no known bug reports related to RTM_DEBUG, the primary
motivation is to reduce the probability of breaking existing guests when a
future change adds a missing consistency check on vmcs12.GUEST_DEBUGCTL
(KVM currently lets L2 run with whatever hardware supports; whoops).
Note #2, KVM already emulates DR6.RTM, and doesn't restrict access to
DR7.RTM.
Fixes: 83c529151a ("KVM: x86: expose Intel cpu new features (HLE, RTM) to guest")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20250610232010.162191-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Instruct vendor code to load the guest's DR6 into hardware via a new
KVM_RUN flag, and remove kvm_x86_ops.set_dr6(), whose sole purpose was to
load vcpu->arch.dr6 into hardware when DR6 can be read/written directly
by the guest.
Note, TDX already WARNs on any run_flag being set, i.e. will yell if KVM
thinks DR6 needs to be reloaded. TDX vCPUs force KVM_DEBUGREG_AUTO_SWITCH
and never clear the flag, i.e. should never observe KVM_RUN_LOAD_GUEST_DR6.
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20250610232010.162191-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Convert kvm_x86_ops.vcpu_run()'s "force_immediate_exit" boolean parameter
into an a generic bitmap so that similar "take action" information can be
passed to vendor code without creating a pile of boolean parameters.
This will allow dropping kvm_x86_ops.set_dr6() in favor of a new flag, and
will also allow for adding similar functionality for re-loading debugctl
in the active VMCS.
Opportunistically massage the TDX WARN and comment to prepare for adding
more run_flags, all of which are expected to be mutually exclusive with
TDX, i.e. should be WARNed on.
No functional change intended.
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20250610232010.162191-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add the new TDVMCALL status code TDVMCALL_STATUS_SUBFUNC_UNSUPPORTED and
return it for unimplemented TDVMCALL subfunctions.
Returning TDVMCALL_STATUS_INVALID_OPERAND when a subfunction is not
implemented is vague because TDX guests can't tell the error is due to
the subfunction is not supported or an invalid input of the subfunction.
New GHCI spec adds TDVMCALL_STATUS_SUBFUNC_UNSUPPORTED to avoid the
ambiguity. Use it instead of TDVMCALL_STATUS_INVALID_OPERAND.
Before the change, for common guest implementations, when a TDX guest
receives TDVMCALL_STATUS_INVALID_OPERAND, it has two cases:
1. Some operand is invalid. It could change the operand to another value
retry.
2. The subfunction is not supported.
For case 1, an invalid operand usually means the guest implementation bug.
Since the TDX guest can't tell which case is, the best practice for
handling TDVMCALL_STATUS_INVALID_OPERAND is stopping calling such leaf,
treating the failure as fatal if the TDVMCALL is essential or ignoring
it if the TDVMCALL is optional.
With this change, TDVMCALL_STATUS_SUBFUNC_UNSUPPORTED could be sent to
old TDX guest that do not know about it, but it is expected that the
guest will make the same action as TDVMCALL_STATUS_INVALID_OPERAND.
Currently, no known TDX guest checks TDVMCALL_STATUS_INVALID_OPERAND
specifically; for example Linux just checks for success.
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
[Return it for untrapped KVM_HC_MAP_GPA_RANGE. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
SNP Guest Request uses only exitinfo2 which is a return value from GHCB, has
meaning beyond ioctl and therefore belongs to struct snp_guest_req.
Move exitinfo2 there and remove snp_guest_request_ioctl from the SEV platform
code.
No functional change intended.
Signed-off-by: Alexey Kardashevskiy <aik@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Dionna Glaze <dionnaglaze@google.com>
Link: https://lore.kernel.org/20250611040842.2667262-2-aik@amd.com
Move the VERW clearing before the MONITOR so that VERW doesn't disarm it
and the machine never enters C1.
Original idea by Kim Phillips <kim.phillips@amd.com>.
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Add the required features detection glue to bugs.c et all in order to
support the TSA mitigation.
Co-developed-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
* Avoid using large pages for kernel mappings when PSE is not enumerated
* Avoid ever making indirect calls to TDX assembly helpers
* Fix a FRED single step issue when not using an external debugger
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmhQWhsACgkQaDWVMHDJ
krBOLg/8CQ4VVzSqaE5llP/nYkehQblmWM03UFHaOr8gzzPF4Pi+Aov9SpPu/X5E
k0va6s4SuV2XVuWAyKkzJlPxjKBzp8gmOI0XgerEqKMTEihElmSkb89d5O5EhqqM
OYNHe5dIhfDbESrbUep2HFvSWR9Q5Df1Gt8gMDhBzjxT7PlJF/U/sle2/G33Ydhj
9IJvAyh349sJTF9+N8nVUB9YYNE2L6ozf/o14lDF+SoBytLJWugYUVgAukwrQeII
kjcfLYuUGPLeWZcczFA/mqKYWQ+MJ+2Qx8Rh2P00HQhIAdGGKyoIla2b4Lw9MfAk
xYPuWWjvOyI0IiumdKfMfnKRgHpqC8IuCZSPlooNiB8Mk2Nd+0L3C0z96Ov08cwg
nKgKQVS1E0FGq35S2VzS01pitasgxFsBBQft9fTIfL+EBtYNfm+TK3bAr6Ep9b4M
RqcnE997AkFx2D4AWnBLY3lKMi2XcP6b0GPGSXSJAHQlzPZNquYXPNEzI8jOQ4IH
E0F8f5P26XDvFCX5P7EfV9TjACSPYRxZtHtwN9OQWjFngbK8cMmP94XmORSFyFec
AFBZ5ZcgLVfQmNzjKljTUvfZvpQCNIiC8oADAVlcfUsm2B45cnYwpvsuz8vswmdQ
uDrDa87Mh20d8JRMLIMZ2rh7HrXaB2skdxQ9niS/uv0L8jKNq9o=
=BEsb
-----END PGP SIGNATURE-----
Merge tag 'x86_urgent_for_6.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Dave Hansen:
"This is a pretty scattered set of fixes. The majority of them are
further fixups around the recent ITS mitigations.
The rest don't really have a coherent story:
- Some flavors of Xen PV guests don't support large pages, but the
set_memory.c code assumes all CPUs support them.
Avoid problems with a quick CPU feature check.
- The TDX code has some wrappers to help retry calls to the TDX
module. They use function pointers to assembly functions and the
compiler usually generates direct CALLs. But some new compilers,
plus -Os turned them in to indirect CALLs and the assembly code was
not annotated for indirect calls.
Force inlining of the helper to fix it up.
- Last, a FRED issue showed up when single-stepping. It's fine when
using an external debugger, but was getting stuck returning from a
SIGTRAP handler otherwise.
Clear the FRED 'swevent' bit to ensure that forward progress is
made"
* tag 'x86_urgent_for_6.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Revert "mm/execmem: Unify early execmem_cache behaviour"
x86/its: explicitly manage permissions for ITS pages
x86/its: move its_pages array to struct mod_arch_specific
x86/Kconfig: only enable ROX cache in execmem when STRICT_MODULE_RWX is set
x86/mm/pat: don't collapse pages without PSE set
x86/virt/tdx: Avoid indirect calls to TDX assembly functions
selftests/x86: Add a test to detect infinite SIGTRAP handler loop
x86/fred/signal: Prevent immediate repeat of single step trap on return from SIGTRAP handler
It will be used by other x86 mitigations.
No functional changes.
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Make pte_swp_exclusive return bool instead of int. This will better
reflect how pte_swp_exclusive is actually used in the code.
This fixes swap/swapoff problems on Alpha due pte_swp_exclusive not
returning correct values when _PAGE_SWP_EXCLUSIVE bit resides in upper
32-bits of PTE (like on alpha).
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Magnus Lindholm <linmag7@gmail.com>
Cc: Sam James <sam@gentoo.org>
Link: https://lore.kernel.org/lkml/20250218175735.19882-2-linmag7@gmail.com/
Link: https://lore.kernel.org/lkml/20250602041118.GA2675383@ZenIV/
[ Applied as the 'sed' script Al suggested - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The of pages with ITS thunks allocated for modules are tracked by an
array in 'struct module'.
Since this is very architecture specific data structure, move it to
'struct mod_arch_specific'.
No functional changes.
Fixes: 872df34d7c ("x86/its: Use dynamic thunks for indirect branches")
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20250603111446.2609381-4-rppt@kernel.org
telemetry_raw_read_events() was added by the commit 378f956e3f
("platform/x86: Add Intel Telemetry Core Driver") in 2016 but has
remained unused.
Remove it.
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Link: https://lore.kernel.org/r/20250608012512.377134-4-linux@treblig.org
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
The functions:
- telemetry_get_sampling_period()
- telemetry_set_sampling_period()
were both added by the commit 378f956e3f ("platform/x86: Add Intel
Telemetry Core Driver") in 2016 but have remained unused.
They're each a tiny wrapper that is the only caller through a similarly
named function pointer, and for each function pointer there's a 'def'
empty implementation and a plt implementation.
Remove all of those components for each function.
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Link: https://lore.kernel.org/r/20250608012512.377134-3-linux@treblig.org
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
The functions:
- telemetry_add_events()
- telemetry_update_events()
- telemetry_reset_events()
- telemetry_get_eventconfig()
were all added by the commit 378f956e3f ("platform/x86: Add Intel
Telemetry Core Driver") in 2016 but have remained unused.
They're each a tiny wrapper that is the only caller through a similarly
named function pointer, and for each function pointer there's a 'def'
empty implementation and a plt implementation.
Remove all of those components for each function.
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Link: https://lore.kernel.org/r/20250608012512.377134-2-linux@treblig.org
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Two 'static inline' TDX helper functions (sc_retry() and
sc_retry_prerr()) take function pointer arguments which refer to
assembly functions. Normally, the compiler inlines the TDX helper,
realizes that the function pointer targets are completely static --
thus can be resolved at compile time -- and generates direct call
instructions.
But, other times (like when CONFIG_CC_OPTIMIZE_FOR_SIZE=y), the
compiler declines to inline the helpers and will instead generate
indirect call instructions.
Indirect calls to assembly functions require special annotation (for
various Control Flow Integrity mechanisms). But TDX assembly
functions lack the special annotations and can only be called
directly.
Annotate both the helpers as '__always_inline' to prod the compiler
into maintaining the direct calls. There is no guarantee here, but
Peter has volunteered to report the compiler bug if this assumption
ever breaks[1].
Fixes: 1e66a7e275 ("x86/virt/tdx: Handle SEAMCALL no entropy error in common code")
Fixes: df01f5ae07 ("x86/virt/tdx: Add SEAMCALL error printing for module initialization")
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/lkml/20250605145914.GW39944@noisy.programming.kicks-ass.net/ [1]
Link: https://lore.kernel.org/all/20250606130737.30713-1-kai.huang%40intel.com
Clear the software event flag in the augmented SS to prevent immediate
repeat of single step trap on return from SIGTRAP handler if the trap
flag (TF) is set without an external debugger attached.
Following is a typical single-stepping flow for a user process:
1) The user process is prepared for single-stepping by setting
RFLAGS.TF = 1.
2) When any instruction in user space completes, a #DB is triggered.
3) The kernel handles the #DB and returns to user space, invoking the
SIGTRAP handler with RFLAGS.TF = 0.
4) After the SIGTRAP handler finishes, the user process performs a
sigreturn syscall, restoring the original state, including
RFLAGS.TF = 1.
5) Goto step 2.
According to the FRED specification:
A) Bit 17 in the augmented SS is designated as the software event
flag, which is set to 1 for FRED event delivery of SYSCALL,
SYSENTER, or INT n.
B) If bit 17 of the augmented SS is 1 and ERETU would result in
RFLAGS.TF = 1, a single-step trap will be pending upon completion
of ERETU.
In step 4) above, the software event flag is set upon the sigreturn
syscall, and its corresponding ERETU would restore RFLAGS.TF = 1.
This combination causes a pending single-step trap upon completion of
ERETU. Therefore, another #DB is triggered before any user space
instruction is executed, which leads to an infinite loop in which the
SIGTRAP handler keeps being invoked on the same user space IP.
Fixes: 14619d912b ("x86/fred: FRED entry/exit and dispatch code")
Suggested-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Cc:stable@vger.kernel.org
Link: https://lore.kernel.org/all/20250609084054.2083189-2-xin%40zytor.com
- Cure IO bitmap inconsistencies
A failed fork cleans up all resources of the newly created thread via
exit_thread(). exit_thread() invokes io_bitmap_exit() which does the IO
bitmap cleanups, which unfortunately assume that the cleanup is related
to the current task, which is obviously bogus. Make it work correctly
- A lockdep fix in the resctrl code removed the clearing of the command
buffer in two places, which keeps stale error messages around. Bring
them back.
- Remove unused trace events
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmhFO0MTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoZYqEADAB+3WsgT9WZlGfksOgcL05uD0sGKP
mrGkrVuyGLQaMK+Vv7F3Vll0khe3nMJ1DzOOdmejerzEi2yb7CaZHIT1esujVHAk
kieR/D7t5zUvPislBSUxgsmNjqwoCDAwe5vKTSM0p20T7CWkenAzCcA4cUFQfGWo
2TqJWLDgdlvOk6qnaUTm0tzGjeYD9PegoUEyz5wgrf+YjsiKIWJLRbDMb3tXsOpy
La2vh2APV6Tt/PMalQiedQdi3VAf5yD+3HpvlXJ6b1BhJy3iPO6iMQPp1kwsKFjm
dzFT0IqM7lOERjMbhhnntLWe9z0vyezwIEj6dPlCf6MG8lTJ3io0bWJ2LwQEzt3q
RkFJO7wc7QqJoIGsZARGREA99Zkh6/qTh8QovbjpHWoQBrMIZ9mxw9gJ6QR/vjNa
dVhIEi4mOBNw+bq4BmYExC8j60qpGBrEi+lJrGWZoEBWHqhUY3I38DU1oH97mn/s
qL1iZo1TOGyxGuHIN79fKWduEpY+tyvhLxB2zW5m9+B26XRXZBec1lZ+qfIpPo/f
ZuWIbNYsKKPgpHZLSb21RJ/ost+sTr4iEb3XuguP+YAxfjOVJR7CAhsfWRTDNhPS
wGiyPAGmdbQrx/2Wn4vZnznOm/OH4aQX5cxbNR8TPknHJzYRmIY3VTDjDr8Xqxx+
R4/IRQOmcJ0Q/Q==
=x7j0
-----END PGP SIGNATURE-----
Merge tag 'x86-urgent-2025-06-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
"A small set of x86 fixes:
- Cure IO bitmap inconsistencies
A failed fork cleans up all resources of the newly created thread
via exit_thread(). exit_thread() invokes io_bitmap_exit() which
does the IO bitmap cleanups, which unfortunately assume that the
cleanup is related to the current task, which is obviously bogus.
Make it work correctly
- A lockdep fix in the resctrl code removed the clearing of the
command buffer in two places, which keeps stale error messages
around. Bring them back.
- Remove unused trace events"
* tag 'x86-urgent-2025-06-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
fs/resctrl: Restore the rdt_last_cmd_clear() calls after acquiring rdtgroup_mutex
x86/iopl: Cure TIF_IO_BITMAP inconsistencies
x86/fpu: Remove unused trace events
-----BEGIN PGP SIGNATURE-----
iQFHBAABCgAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAmg+jmETHHdlaS5saXVA
a2VybmVsLm9yZwAKCRB2FHBfkEGgXiWaCACjYSQcCXW2nnZuWUnGMJq8HD5XGBAH
tNYzOyp2Y4bXEJzfmbHv8UpJynGr3IFKybCnhm0uAQZCmiR5k4CfMvjPQXcJu9LK
7yUI/dTGrRGG7f3NClWK2vXg7ATqzRGiPuPDk2lDcP04aQQWaUMDYe5SXIgcqKyZ
cm2OVHapHGbQ7wA+xXGQcUBb6VJ5+BrQUVOqaEQyl4LURvjaQcn7rVDS0SmEi8gq
42+KDVd8uWYos5dT57HIq9UI5og3PeTvAvHsx26eX8JWNqwXLgvxRH83kstK+GWY
uG3sOm5yRbJvErLpJHnyBOlXDFNw2EBeLC1VyhdJXBR8RabgI+H/mrY3
=4bTC
-----END PGP SIGNATURE-----
Merge tag 'hyperv-next-signed-20250602' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux
Pull hyperv updates from Wei Liu:
- Support for Virtual Trust Level (VTL) on arm64 (Roman Kisel)
- Fixes for Hyper-V UIO driver (Long Li)
- Fixes for Hyper-V PCI driver (Michael Kelley)
- Select CONFIG_SYSFB for Hyper-V guests (Michael Kelley)
- Documentation updates for Hyper-V VMBus (Michael Kelley)
- Enhance logging for hv_kvp_daemon (Shradha Gupta)
* tag 'hyperv-next-signed-20250602' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: (23 commits)
Drivers: hv: Always select CONFIG_SYSFB for Hyper-V guests
Drivers: hv: vmbus: Add comments about races with "channels" sysfs dir
Documentation: hyperv: Update VMBus doc with new features and info
PCI: hv: Remove unnecessary flex array in struct pci_packet
Drivers: hv: Remove hv_alloc/free_* helpers
Drivers: hv: Use kzalloc for panic page allocation
uio_hv_generic: Align ring size to system page
uio_hv_generic: Use correct size for interrupt and monitor pages
Drivers: hv: Allocate interrupt and monitor pages aligned to system page boundary
arch/x86: Provide the CPU number in the wakeup AP callback
x86/hyperv: Fix APIC ID and VP index confusion in hv_snp_boot_ap()
PCI: hv: Get vPCI MSI IRQ domain from DeviceTree
ACPI: irq: Introduce acpi_get_gsi_dispatcher()
Drivers: hv: vmbus: Introduce hv_get_vmbus_root_device()
Drivers: hv: vmbus: Get the IRQ number from DeviceTree
dt-bindings: microsoft,vmbus: Add interrupt and DMA coherence properties
arm64, x86: hyperv: Report the VTL the system boots in
arm64: hyperv: Initialize the Virtual Trust Level field
Drivers: hv: Provide arch-neutral implementation of get_vtl()
Drivers: hv: Enable VTL mode for arm64
...
* Clean up locking of all vCPUs for a VM by using the *_nest_lock()
family of functions, and move duplicated code to virt/kvm/.
kernel/ patches acked by Peter Zijlstra.
* Add MGLRU support to the access tracking perf test.
ARM fixes:
* Make the irqbypass hooks resilient to changes in the GSI<->MSI
routing, avoiding behind stale vLPI mappings being left behind. The
fix is to resolve the VGIC IRQ using the host IRQ (which is stable)
and nuking the vLPI mapping upon a routing change.
* Close another VGIC race where vCPU creation races with VGIC
creation, leading to in-flight vCPUs entering the kernel w/o private
IRQs allocated.
* Fix a build issue triggered by the recently added workaround for
Ampere's AC04_CPU_23 erratum.
* Correctly sign-extend the VA when emulating a TLBI instruction
potentially targeting a VNCR mapping.
* Avoid dereferencing a NULL pointer in the VGIC debug code, which can
happen if the device doesn't have any mapping yet.
s390:
* Fix interaction between some filesystems and Secure Execution
* Some cleanups and refactorings, preparing for an upcoming big series
x86:
* Wait for target vCPU to acknowledge KVM_REQ_UPDATE_PROTECTED_GUEST_STATE to
fix a race between AP destroy and VMRUN.
* Decrypt and dump the VMSA in dump_vmcb() if debugging enabled for the VM.
* Refine and harden handling of spurious faults.
* Add support for ALLOWED_SEV_FEATURES.
* Add #VMGEXIT to the set of handlers special cased for CONFIG_RETPOLINE=y.
* Treat DEBUGCTL[5:2] as reserved to pave the way for virtualizing features
that utilize those bits.
* Don't account temporary allocations in sev_send_update_data().
* Add support for KVM_CAP_X86_BUS_LOCK_EXIT on SVM, via Bus Lock Threshold.
* Unify virtualization of IBRS on nested VM-Exit, and cross-vCPU IBPB, between
SVM and VMX.
* Advertise support to userspace for WRMSRNS and PREFETCHI.
* Rescan I/O APIC routes after handling EOI that needed to be intercepted due
to the old/previous routing, but not the new/current routing.
* Add a module param to control and enumerate support for device posted
interrupts.
* Fix a potential overflow with nested virt on Intel systems running 32-bit kernels.
* Flush shadow VMCSes on emergency reboot.
* Add support for SNP to the various SEV selftests.
* Add a selftest to verify fastops instructions via forced emulation.
* Refine and optimize KVM's software processing of the posted interrupt bitmap, and share
the harvesting code between KVM and the kernel's Posted MSI handler
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmg9TjwUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroOUxQf7B7nnWqIKd7jSkGzSD6YsSX9TXktr
2tJIOfWM3zNYg5GRCidg+m4Y5+DqQWd3Hi5hH2P9wUw7RNuOjOFsDe+y0VBr8ysE
ve39t/yp+mYalNmHVFl8s3dBDgrIeGKiz+Wgw3zCQIBZ18rJE1dREhv37RlYZ3a2
wSvuObe8sVpCTyKIowDs1xUi7qJUBoopMSuqfleSHawRrcgCpV99U8/KNFF5plLH
7fXOBAHHniVCVc+mqQN2wxtVJDhST+U3TaU4GwlKy9Yevr+iibdOXffveeIgNEU4
D6q1F2zKp6UdV3+p8hxyaTTbiCVDqsp9WOgY/0I/f+CddYn0WVZgOlR+ow==
=mYFL
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull more kvm updates from Paolo Bonzini:
Generic:
- Clean up locking of all vCPUs for a VM by using the *_nest_lock()
family of functions, and move duplicated code to virt/kvm/. kernel/
patches acked by Peter Zijlstra
- Add MGLRU support to the access tracking perf test
ARM fixes:
- Make the irqbypass hooks resilient to changes in the GSI<->MSI
routing, avoiding behind stale vLPI mappings being left behind. The
fix is to resolve the VGIC IRQ using the host IRQ (which is stable)
and nuking the vLPI mapping upon a routing change
- Close another VGIC race where vCPU creation races with VGIC
creation, leading to in-flight vCPUs entering the kernel w/o
private IRQs allocated
- Fix a build issue triggered by the recently added workaround for
Ampere's AC04_CPU_23 erratum
- Correctly sign-extend the VA when emulating a TLBI instruction
potentially targeting a VNCR mapping
- Avoid dereferencing a NULL pointer in the VGIC debug code, which
can happen if the device doesn't have any mapping yet
s390:
- Fix interaction between some filesystems and Secure Execution
- Some cleanups and refactorings, preparing for an upcoming big
series
x86:
- Wait for target vCPU to ack KVM_REQ_UPDATE_PROTECTED_GUEST_STATE
to fix a race between AP destroy and VMRUN
- Decrypt and dump the VMSA in dump_vmcb() if debugging enabled for
the VM
- Refine and harden handling of spurious faults
- Add support for ALLOWED_SEV_FEATURES
- Add #VMGEXIT to the set of handlers special cased for
CONFIG_RETPOLINE=y
- Treat DEBUGCTL[5:2] as reserved to pave the way for virtualizing
features that utilize those bits
- Don't account temporary allocations in sev_send_update_data()
- Add support for KVM_CAP_X86_BUS_LOCK_EXIT on SVM, via Bus Lock
Threshold
- Unify virtualization of IBRS on nested VM-Exit, and cross-vCPU
IBPB, between SVM and VMX
- Advertise support to userspace for WRMSRNS and PREFETCHI
- Rescan I/O APIC routes after handling EOI that needed to be
intercepted due to the old/previous routing, but not the
new/current routing
- Add a module param to control and enumerate support for device
posted interrupts
- Fix a potential overflow with nested virt on Intel systems running
32-bit kernels
- Flush shadow VMCSes on emergency reboot
- Add support for SNP to the various SEV selftests
- Add a selftest to verify fastops instructions via forced emulation
- Refine and optimize KVM's software processing of the posted
interrupt bitmap, and share the harvesting code between KVM and the
kernel's Posted MSI handler"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (93 commits)
rtmutex_api: provide correct extern functions
KVM: arm64: vgic-debug: Avoid dereferencing NULL ITE pointer
KVM: arm64: vgic-init: Plug vCPU vs. VGIC creation race
KVM: arm64: Unmap vLPIs affected by changes to GSI routing information
KVM: arm64: Resolve vLPI by host IRQ in vgic_v4_unset_forwarding()
KVM: arm64: Protect vLPI translation with vgic_irq::irq_lock
KVM: arm64: Use lock guard in vgic_v4_set_forwarding()
KVM: arm64: Mask out non-VA bits from TLBI VA* on VNCR invalidation
arm64: sysreg: Drag linux/kconfig.h to work around vdso build issue
KVM: s390: Simplify and move pv code
KVM: s390: Refactor and split some gmap helpers
KVM: s390: Remove unneeded srcu lock
s390: Remove unneeded includes
s390/uv: Improve splitting of large folios that cannot be split while dirty
s390/uv: Always return 0 from s390_wiggle_split_folio() if successful
s390/uv: Don't return 0 from make_hva_secure() if the operation was not successful
rust: add helper for mutex_trylock
RISC-V: KVM: use kvm_trylock_all_vcpus when locking all vCPUs
KVM: arm64: use kvm_trylock_all_vcpus when locking all vCPUs
x86: KVM: SVM: use kvm_lock_all_vcpus instead of a custom implementation
...
semaphore" from Lance Yang enhances the hung task detector. The
detector presently dumps the blocking tasks's stack when it is blocked
on a mutex. Lance's series extends this to semaphores.
- The 2 patch series "nilfs2: improve sanity checks in dirty state
propagation" from Wentao Liang addresses a couple of minor flaws in
nilfs2.
- The 2 patch series "scripts/gdb: Fixes related to lx_per_cpu()" from
Illia Ostapyshyn fixes a couple of issues in the gdb scripts.
- The 9 patch series "Support kdump with LUKS encryption by reusing LUKS
volume keys" from Coiby Xu addresses a usability problem with kdump.
When the dump device is LUKS-encrypted, the kdump kernel may not have
the keys to the encrypted filesystem. A full writeup of this is in the
series [0/N] cover letter.
- The 2 patch series "sysfs: add counters for lockups and stalls" from
Max Kellermann adds /sys/kernel/hardlockup_count and
/sys/kernel/hardlockup_count and /sys/kernel/rcu_stall_count.
- The 3 patch series "fork: Page operation cleanups in the fork code"
from Pasha Tatashin implements a number of code cleanups in fork.c.
- The 3 patch series "scripts/gdb/symbols: determine KASLR offset on
s390 during early boot" from Ilya Leoshkevich fixes some s390 issues in
the gdb scripts.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaDuCvQAKCRDdBJ7gKXxA
jrkxAQCnFAp/uK9ckkbN4nfpJ0+OMY36C+A+dawSDtuRsIkXBAEAq3e6MNAUdg5W
Ca0cXdgSIq1Op7ZKEA+66Km6Rfvfow8=
=g45L
-----END PGP SIGNATURE-----
Merge tag 'mm-nonmm-stable-2025-05-31-15-28' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:
- "hung_task: extend blocking task stacktrace dump to semaphore" from
Lance Yang enhances the hung task detector.
The detector presently dumps the blocking tasks's stack when it is
blocked on a mutex. Lance's series extends this to semaphores
- "nilfs2: improve sanity checks in dirty state propagation" from
Wentao Liang addresses a couple of minor flaws in nilfs2
- "scripts/gdb: Fixes related to lx_per_cpu()" from Illia Ostapyshyn
fixes a couple of issues in the gdb scripts
- "Support kdump with LUKS encryption by reusing LUKS volume keys" from
Coiby Xu addresses a usability problem with kdump.
When the dump device is LUKS-encrypted, the kdump kernel may not have
the keys to the encrypted filesystem. A full writeup of this is in
the series [0/N] cover letter
- "sysfs: add counters for lockups and stalls" from Max Kellermann adds
/sys/kernel/hardlockup_count and /sys/kernel/hardlockup_count and
/sys/kernel/rcu_stall_count
- "fork: Page operation cleanups in the fork code" from Pasha Tatashin
implements a number of code cleanups in fork.c
- "scripts/gdb/symbols: determine KASLR offset on s390 during early
boot" from Ilya Leoshkevich fixes some s390 issues in the gdb
scripts
* tag 'mm-nonmm-stable-2025-05-31-15-28' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (67 commits)
llist: make llist_add_batch() a static inline
delayacct: remove redundant code and adjust indentation
squashfs: add optional full compressed block caching
crash_dump, nvme: select CONFIGFS_FS as built-in
scripts/gdb/symbols: determine KASLR offset on s390 during early boot
scripts/gdb/symbols: factor out pagination_off()
scripts/gdb/symbols: factor out get_vmlinux()
kernel/panic.c: format kernel-doc comments
mailmap: update and consolidate Casey Connolly's name and email
nilfs2: remove wbc->for_reclaim handling
fork: define a local GFP_VMAP_STACK
fork: check charging success before zeroing stack
fork: clean-up naming of vm_stack/vm_struct variables in vmap stacks code
fork: clean-up ifdef logic around stack allocation
kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count
kernel/watchdog: add /sys/kernel/{hard,soft}lockup_count
x86/crash: make the page that stores the dm crypt keys inaccessible
x86/crash: pass dm crypt keys to kdump kernel
Revert "x86/mm: Remove unused __set_memory_prot()"
crash_dump: retrieve dm crypt keys in kdump kernel
...
simplifies the act of creating a pte which addresses the first page in a
folio and reduces the amount of plumbing which architecture must
implement to provide this.
- The 8 patch series "Misc folio patches for 6.16" from Matthew Wilcox
is a shower of largely unrelated folio infrastructure changes which
clean things up and better prepare us for future work.
- The 3 patch series "memory,x86,acpi: hotplug memory alignment
advisement" from Gregory Price adds early-init code to prevent x86 from
leaving physical memory unused when physical address regions are not
aligned to memory block size.
- The 2 patch series "mm/compaction: allow more aggressive proactive
compaction" from Michal Clapinski provides some tuning of the (sadly,
hard-coded (more sadly, not auto-tuned)) thresholds for our invokation
of proactive compaction. In a simple test case, the reduction of a guest
VM's memory consumption was dramatic.
- The 8 patch series "Minor cleanups and improvements to swap freeing
code" from Kemeng Shi provides some code cleaups and a small efficiency
improvement to this part of our swap handling code.
- The 6 patch series "ptrace: introduce PTRACE_SET_SYSCALL_INFO API"
from Dmitry Levin adds the ability for a ptracer to modify syscalls
arguments. At this time we can alter only "system call information that
are used by strace system call tampering, namely, syscall number,
syscall arguments, and syscall return value.
This series should have been incorporated into mm.git's "non-MM"
branch, but I goofed.
- The 3 patch series "fs/proc: extend the PAGEMAP_SCAN ioctl to report
guard regions" from Andrei Vagin extends the info returned by the
PAGEMAP_SCAN ioctl against /proc/pid/pagemap. This permits CRIU to more
efficiently get at the info about guard regions.
- The 2 patch series "Fix parameter passed to page_mapcount_is_type()"
from Gavin Shan implements that fix. No runtime effect is expected
because validate_page_before_insert() happens to fix up this error.
- The 3 patch series "kernel/events/uprobes: uprobe_write_opcode()
rewrite" from David Hildenbrand basically brings uprobe text poking into
the current decade. Remove a bunch of hand-rolled implementation in
favor of using more current facilities.
- The 3 patch series "mm/ptdump: Drop assumption that pxd_val() is u64"
from Anshuman Khandual provides enhancements and generalizations to the
pte dumping code. This might be needed when 128-bit Page Table
Descriptors are enabled for ARM.
- The 12 patch series "Always call constructor for kernel page tables"
from Kevin Brodsky "ensures that the ctor/dtor is always called for
kernel pgtables, as it already is for user pgtables". This permits the
addition of more functionality such as "insert hooks to protect page
tables". This change does result in various architectures performing
unnecesary work, but this is fixed up where it is anticipated to occur.
- The 9 patch series "Rust support for mm_struct, vm_area_struct, and
mmap" from Alice Ryhl adds plumbing to permit Rust access to core MM
structures.
- The 3 patch series "fix incorrectly disallowed anonymous VMA merges"
from Lorenzo Stoakes takes advantage of some VMA merging opportunities
which we've been missing for 15 years.
- The 4 patch series "mm/madvise: batch tlb flushes for MADV_DONTNEED
and MADV_FREE" from SeongJae Park optimizes process_madvise()'s TLB
flushing. Instead of flushing each address range in the provided iovec,
we batch the flushing across all the iovec entries. The syscall's cost
was approximately halved with a microbenchmark which was designed to
load this particular operation.
- The 6 patch series "Track node vacancy to reduce worst case allocation
counts" from Sidhartha Kumar makes the maple tree smarter about its node
preallocation. stress-ng mmap performance increased by single-digit
percentages and the amount of unnecessarily preallocated memory was
dramaticelly reduced.
- The 3 patch series "mm/gup: Minor fix, cleanup and improvements" from
Baoquan He removes a few unnecessary things which Baoquan noted when
reading the code.
- The 3 patch series ""Enhance sysfs handling for memory hotplug in
weighted interleave" from Rakie Kim "enhances the weighted interleave
policy in the memory management subsystem by improving sysfs handling,
fixing memory leaks, and introducing dynamic sysfs updates for memory
hotplug support". Fixes things on error paths which we are unlikely to
hit.
- The 7 patch series "mm/damon: auto-tune DAMOS for NUMA setups
including tiered memory" from SeongJae Park introduces new DAMOS quota
goal metrics which eliminate the manual tuning which is required when
utilizing DAMON for memory tiering.
- The 5 patch series "mm/vmalloc.c: code cleanup and improvements" from
Baoquan He provides cleanups and small efficiency improvements which
Baoquan found via code inspection.
- The 2 patch series "vmscan: enforce mems_effective during demotion"
from Gregory Price "changes reclaim to respect cpuset.mems_effective
during demotion when possible". because "presently, reclaim explicitly
ignores cpuset.mems_effective when demoting, which may cause the cpuset
settings to violated." "This is useful for isolating workloads on a
multi-tenant system from certain classes of memory more consistently."
- The 2 patch series ""Clean up split_huge_pmd_locked() and remove
unnecessary folio pointers" from Gavin Guo provides minor cleanups and
efficiency gains in in the huge page splitting and migrating code.
- The 3 patch series "Use kmem_cache for memcg alloc" from Huan Yang
creates a slab cache for `struct mem_cgroup', yielding improved memory
utilization.
- The 4 patch series "add max arg to swappiness in memory.reclaim and
lru_gen" from Zhongkun He adds a new "max" argument to the "swappiness="
argument for memory.reclaim MGLRU's lru_gen. This directs proactive
reclaim to reclaim from only anon folios rather than file-backed folios.
- The 17 patch series "kexec: introduce Kexec HandOver (KHO)" from Mike
Rapoport is the first step on the path to permitting the kernel to
maintain existing VMs while replacing the host kernel via file-based
kexec. At this time only memblock's reserve_mem is preserved.
- The 7 patch series "mm: Introduce for_each_valid_pfn()" from David
Woodhouse provides and uses a smarter way of looping over a pfn range.
By skipping ranges of invalid pfns.
- The 2 patch series "sched/numa: Skip VMA scanning on memory pinned to
one NUMA node via cpuset.mems" from Libo Chen removes a lot of pointless
VMA scanning when a task is pinned a single NUMA mode. Dramatic
performance benefits were seen in some real world cases.
- The 2 patch series "JFS: Implement migrate_folio for
jfs_metapage_aops" from Shivank Garg addresses a warning which occurs
during memory compaction when using JFS.
- The 4 patch series "move all VMA allocation, freeing and duplication
logic to mm" from Lorenzo Stoakes moves some VMA code from kernel/fork.c
into the more appropriate mm/vma.c.
- The 6 patch series "mm, swap: clean up swap cache mapping helper" from
Kairui Song provides code consolidation and cleanups related to the
folio_index() function.
- The 2 patch series "mm/gup: Cleanup memfd_pin_folios()" from Vishal
Moola does that.
- The 8 patch series "memcg: Fix test_memcg_min/low test failures" from
Waiman Long addresses some bogus failures which are being reported by
the test_memcontrol selftest.
- The 3 patch series "eliminate mmap() retry merge, add .mmap_prepare
hook" from Lorenzo Stoakes commences the deprecation of
file_operations.mmap() in favor of the new
file_operations.mmap_prepare(). The latter is more restrictive and
prevents drivers from messing with things in ways which, amongst other
problems, may defeat VMA merging.
- The 4 patch series "memcg: decouple memcg and objcg stocks"" from
Shakeel Butt decouples the per-cpu memcg charge cache from the objcg's
one. This is a step along the way to making memcg and objcg charging
NMI-safe, which is a BPF requirement.
- The 6 patch series "mm/damon: minor fixups and improvements for code,
tests, and documents" from SeongJae Park is "yet another batch of
miscellaneous DAMON changes. Fix and improve minor problems in code,
tests and documents."
- The 7 patch series "memcg: make memcg stats irq safe" from Shakeel
Butt converts memcg stats to be irq safe. Another step along the way to
making memcg charging and stats updates NMI-safe, a BPF requirement.
- The 4 patch series "Let unmap_hugepage_range() and several related
functions take folio instead of page" from Fan Ni provides folio
conversions in the hugetlb code.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaDt5qgAKCRDdBJ7gKXxA
ju6XAP9nTiSfRz8Cz1n5LJZpFKEGzLpSihCYyR6P3o1L9oe3mwEAlZ5+XAwk2I5x
Qqb/UGMEpilyre1PayQqOnct3aSL9Ao=
=tYYm
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2025-05-31-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- "Add folio_mk_pte()" from Matthew Wilcox simplifies the act of
creating a pte which addresses the first page in a folio and reduces
the amount of plumbing which architecture must implement to provide
this.
- "Misc folio patches for 6.16" from Matthew Wilcox is a shower of
largely unrelated folio infrastructure changes which clean things up
and better prepare us for future work.
- "memory,x86,acpi: hotplug memory alignment advisement" from Gregory
Price adds early-init code to prevent x86 from leaving physical
memory unused when physical address regions are not aligned to memory
block size.
- "mm/compaction: allow more aggressive proactive compaction" from
Michal Clapinski provides some tuning of the (sadly, hard-coded (more
sadly, not auto-tuned)) thresholds for our invokation of proactive
compaction. In a simple test case, the reduction of a guest VM's
memory consumption was dramatic.
- "Minor cleanups and improvements to swap freeing code" from Kemeng
Shi provides some code cleaups and a small efficiency improvement to
this part of our swap handling code.
- "ptrace: introduce PTRACE_SET_SYSCALL_INFO API" from Dmitry Levin
adds the ability for a ptracer to modify syscalls arguments. At this
time we can alter only "system call information that are used by
strace system call tampering, namely, syscall number, syscall
arguments, and syscall return value.
This series should have been incorporated into mm.git's "non-MM"
branch, but I goofed.
- "fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions" from
Andrei Vagin extends the info returned by the PAGEMAP_SCAN ioctl
against /proc/pid/pagemap. This permits CRIU to more efficiently get
at the info about guard regions.
- "Fix parameter passed to page_mapcount_is_type()" from Gavin Shan
implements that fix. No runtime effect is expected because
validate_page_before_insert() happens to fix up this error.
- "kernel/events/uprobes: uprobe_write_opcode() rewrite" from David
Hildenbrand basically brings uprobe text poking into the current
decade. Remove a bunch of hand-rolled implementation in favor of
using more current facilities.
- "mm/ptdump: Drop assumption that pxd_val() is u64" from Anshuman
Khandual provides enhancements and generalizations to the pte dumping
code. This might be needed when 128-bit Page Table Descriptors are
enabled for ARM.
- "Always call constructor for kernel page tables" from Kevin Brodsky
ensures that the ctor/dtor is always called for kernel pgtables, as
it already is for user pgtables.
This permits the addition of more functionality such as "insert hooks
to protect page tables". This change does result in various
architectures performing unnecesary work, but this is fixed up where
it is anticipated to occur.
- "Rust support for mm_struct, vm_area_struct, and mmap" from Alice
Ryhl adds plumbing to permit Rust access to core MM structures.
- "fix incorrectly disallowed anonymous VMA merges" from Lorenzo
Stoakes takes advantage of some VMA merging opportunities which we've
been missing for 15 years.
- "mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE" from
SeongJae Park optimizes process_madvise()'s TLB flushing.
Instead of flushing each address range in the provided iovec, we
batch the flushing across all the iovec entries. The syscall's cost
was approximately halved with a microbenchmark which was designed to
load this particular operation.
- "Track node vacancy to reduce worst case allocation counts" from
Sidhartha Kumar makes the maple tree smarter about its node
preallocation.
stress-ng mmap performance increased by single-digit percentages and
the amount of unnecessarily preallocated memory was dramaticelly
reduced.
- "mm/gup: Minor fix, cleanup and improvements" from Baoquan He removes
a few unnecessary things which Baoquan noted when reading the code.
- ""Enhance sysfs handling for memory hotplug in weighted interleave"
from Rakie Kim "enhances the weighted interleave policy in the memory
management subsystem by improving sysfs handling, fixing memory
leaks, and introducing dynamic sysfs updates for memory hotplug
support". Fixes things on error paths which we are unlikely to hit.
- "mm/damon: auto-tune DAMOS for NUMA setups including tiered memory"
from SeongJae Park introduces new DAMOS quota goal metrics which
eliminate the manual tuning which is required when utilizing DAMON
for memory tiering.
- "mm/vmalloc.c: code cleanup and improvements" from Baoquan He
provides cleanups and small efficiency improvements which Baoquan
found via code inspection.
- "vmscan: enforce mems_effective during demotion" from Gregory Price
changes reclaim to respect cpuset.mems_effective during demotion when
possible. because presently, reclaim explicitly ignores
cpuset.mems_effective when demoting, which may cause the cpuset
settings to violated.
This is useful for isolating workloads on a multi-tenant system from
certain classes of memory more consistently.
- "Clean up split_huge_pmd_locked() and remove unnecessary folio
pointers" from Gavin Guo provides minor cleanups and efficiency gains
in in the huge page splitting and migrating code.
- "Use kmem_cache for memcg alloc" from Huan Yang creates a slab cache
for `struct mem_cgroup', yielding improved memory utilization.
- "add max arg to swappiness in memory.reclaim and lru_gen" from
Zhongkun He adds a new "max" argument to the "swappiness=" argument
for memory.reclaim MGLRU's lru_gen.
This directs proactive reclaim to reclaim from only anon folios
rather than file-backed folios.
- "kexec: introduce Kexec HandOver (KHO)" from Mike Rapoport is the
first step on the path to permitting the kernel to maintain existing
VMs while replacing the host kernel via file-based kexec. At this
time only memblock's reserve_mem is preserved.
- "mm: Introduce for_each_valid_pfn()" from David Woodhouse provides
and uses a smarter way of looping over a pfn range. By skipping
ranges of invalid pfns.
- "sched/numa: Skip VMA scanning on memory pinned to one NUMA node via
cpuset.mems" from Libo Chen removes a lot of pointless VMA scanning
when a task is pinned a single NUMA mode.
Dramatic performance benefits were seen in some real world cases.
- "JFS: Implement migrate_folio for jfs_metapage_aops" from Shivank
Garg addresses a warning which occurs during memory compaction when
using JFS.
- "move all VMA allocation, freeing and duplication logic to mm" from
Lorenzo Stoakes moves some VMA code from kernel/fork.c into the more
appropriate mm/vma.c.
- "mm, swap: clean up swap cache mapping helper" from Kairui Song
provides code consolidation and cleanups related to the folio_index()
function.
- "mm/gup: Cleanup memfd_pin_folios()" from Vishal Moola does that.
- "memcg: Fix test_memcg_min/low test failures" from Waiman Long
addresses some bogus failures which are being reported by the
test_memcontrol selftest.
- "eliminate mmap() retry merge, add .mmap_prepare hook" from Lorenzo
Stoakes commences the deprecation of file_operations.mmap() in favor
of the new file_operations.mmap_prepare().
The latter is more restrictive and prevents drivers from messing with
things in ways which, amongst other problems, may defeat VMA merging.
- "memcg: decouple memcg and objcg stocks"" from Shakeel Butt decouples
the per-cpu memcg charge cache from the objcg's one.
This is a step along the way to making memcg and objcg charging
NMI-safe, which is a BPF requirement.
- "mm/damon: minor fixups and improvements for code, tests, and
documents" from SeongJae Park is yet another batch of miscellaneous
DAMON changes. Fix and improve minor problems in code, tests and
documents.
- "memcg: make memcg stats irq safe" from Shakeel Butt converts memcg
stats to be irq safe. Another step along the way to making memcg
charging and stats updates NMI-safe, a BPF requirement.
- "Let unmap_hugepage_range() and several related functions take folio
instead of page" from Fan Ni provides folio conversions in the
hugetlb code.
* tag 'mm-stable-2025-05-31-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (285 commits)
mm: pcp: increase pcp->free_count threshold to trigger free_high
mm/hugetlb: convert use of struct page to folio in __unmap_hugepage_range()
mm/hugetlb: refactor __unmap_hugepage_range() to take folio instead of page
mm/hugetlb: refactor unmap_hugepage_range() to take folio instead of page
mm/hugetlb: pass folio instead of page to unmap_ref_private()
memcg: objcg stock trylock without irq disabling
memcg: no stock lock for cpu hot-unplug
memcg: make __mod_memcg_lruvec_state re-entrant safe against irqs
memcg: make count_memcg_events re-entrant safe against irqs
memcg: make mod_memcg_state re-entrant safe against irqs
memcg: move preempt disable to callers of memcg_rstat_updated
memcg: memcg_rstat_updated re-entrant safe against irqs
mm: khugepaged: decouple SHMEM and file folios' collapse
selftests/eventfd: correct test name and improve messages
alloc_tag: check mem_profiling_support in alloc_tag_init
Docs/damon: update titles and brief introductions to explain DAMOS
selftests/damon/_damon_sysfs: read tried regions directories in order
mm/damon/tests/core-kunit: add a test for damos_set_filters_default_reject()
mm/damon/paddr: remove unused variable, folio_list, in damon_pa_stat()
mm/damon/sysfs-schemes: fix wrong comment on damons_sysfs_quota_goal_metric_strs
...
The following trace events are not used and defining them just wastes
memory:
x86_fpu_before_restore
x86_fpu_after_restore
x86_fpu_init_state
Simply remove them.
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Linux Trace Kernel <linux-trace-kernel@vger.kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/all/20250529130138.544ffec4@gandalf.local.home # background
Link: https://lore.kernel.org/r/20250529131024.7c2ef96f@gandalf.local.home # x86 submission
- Add a general sysfs scheme for publishing "Measurement" values
provided by the architecture's TEE Security Manager. Use it to publish
TDX "Runtime Measurement Registers" ("RTMRs") that either maintain a
hash of stored values (similar to a TPM PCR) or provide statically
provisioned data. These measurements are validated by a relying party.
- Reorganize the drivers/virt/coco/ directory for "host" and "guest"
shared infrastructure.
- Fix a configfs-tsm-report unregister bug
- With CONFIG_TSM_MEASUREMENTS joining CONFIG_TSM_REPORTS and in
anticipation of more shared "TSM" infrastructure arriving, rename the
maintainer entry to "TRUSTED SECURITY MODULE (TSM) INFRASTRUCTURE".
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQSbo+XnGs+rwLz9XGXfioYZHlFsZwUCaDj38gAKCRDfioYZHlFs
Z3EKAQC2K7RgoufBlLv4C79W8IGiUirKKQvtY9aiC7s/W8R4UwEApwV5gXQx2ImN
cEIIkAkVI2h9wJ9LHxyr3R5XfZPBGgA=
=2fTp
-----END PGP SIGNATURE-----
Merge tag 'tsm-for-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/devsec/tsm
Pull trusted security manager (TSM) updates from Dan Williams:
- Add a general sysfs scheme for publishing "Measurement" values
provided by the architecture's TEE Security Manager. Use it to
publish TDX "Runtime Measurement Registers" ("RTMRs") that either
maintain a hash of stored values (similar to a TPM PCR) or provide
statically provisioned data. These measurements are validated by a
relying party.
- Reorganize the drivers/virt/coco/ directory for "host" and "guest"
shared infrastructure.
- Fix a configfs-tsm-report unregister bug
- With CONFIG_TSM_MEASUREMENTS joining CONFIG_TSM_REPORTS and in
anticipation of more shared "TSM" infrastructure arriving, rename the
maintainer entry to "TRUSTED SECURITY MODULE (TSM) INFRASTRUCTURE".
* tag 'tsm-for-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/devsec/tsm:
tsm-mr: Fix init breakage after bin_attrs constification by scoping non-const pointers to init phase
sample/tsm-mr: Fix missing static for sample_report
virt: tdx-guest: Transition to scoped_cond_guard for mutex operations
virt: tdx-guest: Refactor and streamline TDREPORT generation
virt: tdx-guest: Expose TDX MRs as sysfs attributes
x86/tdx: tdx_mcall_get_report0: Return -EBUSY on TDCALL_OPERAND_BUSY error
x86/tdx: Add tdx_mcall_extend_rtmr() interface
tsm-mr: Add tsm-mr sample code
tsm-mr: Add TVM Measurement Register support
configfs-tsm-report: Fix NULL dereference of tsm_ops
coco/guest: Move shared guest CC infrastructure to drivers/virt/coco/guest/
configfs-tsm: Namespace TSM report symbols
* Add large stage-2 mapping (THP) support for non-protected guests when
pKVM is enabled, clawing back some performance.
* Enable nested virtualisation support on systems that support it,
though it is disabled by default.
* Add UBSAN support to the standalone EL2 object used in nVHE/hVHE and
protected modes.
* Large rework of the way KVM tracks architecture features and links
them with the effects of control bits. While this has no functional
impact, it ensures correctness of emulation (the data is automatically
extracted from the published JSON files), and helps dealing with the
evolution of the architecture.
* Significant changes to the way pKVM tracks ownership of pages,
avoiding page table walks by storing the state in the hypervisor's
vmemmap. This in turn enables the THP support described above.
* New selftest checking the pKVM ownership transition rules
* Fixes for FEAT_MTE_ASYNC being accidentally advertised to guests
even if the host didn't have it.
* Fixes for the address translation emulation, which happened to be
rather buggy in some specific contexts.
* Fixes for the PMU emulation in NV contexts, decoupling PMCR_EL0.N
from the number of counters exposed to a guest and addressing a
number of issues in the process.
* Add a new selftest for the SVE host state being corrupted by a
guest.
* Keep HCR_EL2.xMO set at all times for systems running with the
kernel at EL2, ensuring that the window for interrupts is slightly
bigger, and avoiding a pretty bad erratum on the AmpereOne HW.
* Add workaround for AmpereOne's erratum AC04_CPU_23, which suffers
from a pretty bad case of TLB corruption unless accesses to HCR_EL2
are heavily synchronised.
* Add a per-VM, per-ITS debugfs entry to dump the state of the ITS
tables in a human-friendly fashion.
* and the usual random cleanups.
LoongArch:
* Don't flush tlb if the host supports hardware page table walks.
* Add KVM selftests support.
RISC-V:
* Add vector registers to get-reg-list selftest
* VCPU reset related improvements
* Remove scounteren initialization from VCPU reset
* Support VCPU reset from userspace using set_mpstate() ioctl
x86:
* Initial support for TDX in KVM. This finally makes it possible to use the
TDX module to run confidential guests on Intel processors. This is quite a
large series, including support for private page tables (managed by the
TDX module and mirrored in KVM for efficiency), forwarding some TDVMCALLs
to userspace, and handling several special VM exits from the TDX module.
This has been in the works for literally years and it's not really possible
to describe everything here, so I'll defer to the various merge commits
up to and including commit 7bcf7246c4 ("Merge branch 'kvm-tdx-finish-initial'
into HEAD").
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmg02hwUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroNnkwf/db4xeWKSMseCIvBVR+ObDn3LXhwT
hAgmTkDkP1zq9RfbfJSbUA1DXRwfP+f1sWySLMWECkFEQW9fGIJF9fOQRDSXKmhX
158U3+FEt+3jxLRCGFd4zyXAqyY3C8JSkPUyJZxCpUbXtB5tdDNac4rZAXKDULwe
sUi0OW/kFDM2yt369pBGQAGdN+75/oOrYISGOSvMXHxjccNqvveX8MUhpBjYIuuj
73iBWmsfv3vCtam56Racz3C3v44ie498PmWFtnB0R+CVfWfrnUAaRiGWx+egLiBW
dBPDiZywMn++prmphEUFgaStDTQy23JBLJ8+RvHkp+o5GaTISKJB3nedZQ==
=adZU
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"As far as x86 goes this pull request "only" includes TDX host support.
Quotes are appropriate because (at 6k lines and 100+ commits) it is
much bigger than the rest, which will come later this week and
consists mostly of bugfixes and selftests. s390 changes will also come
in the second batch.
ARM:
- Add large stage-2 mapping (THP) support for non-protected guests
when pKVM is enabled, clawing back some performance.
- Enable nested virtualisation support on systems that support it,
though it is disabled by default.
- Add UBSAN support to the standalone EL2 object used in nVHE/hVHE
and protected modes.
- Large rework of the way KVM tracks architecture features and links
them with the effects of control bits. While this has no functional
impact, it ensures correctness of emulation (the data is
automatically extracted from the published JSON files), and helps
dealing with the evolution of the architecture.
- Significant changes to the way pKVM tracks ownership of pages,
avoiding page table walks by storing the state in the hypervisor's
vmemmap. This in turn enables the THP support described above.
- New selftest checking the pKVM ownership transition rules
- Fixes for FEAT_MTE_ASYNC being accidentally advertised to guests
even if the host didn't have it.
- Fixes for the address translation emulation, which happened to be
rather buggy in some specific contexts.
- Fixes for the PMU emulation in NV contexts, decoupling PMCR_EL0.N
from the number of counters exposed to a guest and addressing a
number of issues in the process.
- Add a new selftest for the SVE host state being corrupted by a
guest.
- Keep HCR_EL2.xMO set at all times for systems running with the
kernel at EL2, ensuring that the window for interrupts is slightly
bigger, and avoiding a pretty bad erratum on the AmpereOne HW.
- Add workaround for AmpereOne's erratum AC04_CPU_23, which suffers
from a pretty bad case of TLB corruption unless accesses to HCR_EL2
are heavily synchronised.
- Add a per-VM, per-ITS debugfs entry to dump the state of the ITS
tables in a human-friendly fashion.
- and the usual random cleanups.
LoongArch:
- Don't flush tlb if the host supports hardware page table walks.
- Add KVM selftests support.
RISC-V:
- Add vector registers to get-reg-list selftest
- VCPU reset related improvements
- Remove scounteren initialization from VCPU reset
- Support VCPU reset from userspace using set_mpstate() ioctl
x86:
- Initial support for TDX in KVM.
This finally makes it possible to use the TDX module to run
confidential guests on Intel processors. This is quite a large
series, including support for private page tables (managed by the
TDX module and mirrored in KVM for efficiency), forwarding some
TDVMCALLs to userspace, and handling several special VM exits from
the TDX module.
This has been in the works for literally years and it's not really
possible to describe everything here, so I'll defer to the various
merge commits up to and including commit 7bcf7246c4 ('Merge
branch 'kvm-tdx-finish-initial' into HEAD')"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (248 commits)
x86/tdx: mark tdh_vp_enter() as __flatten
Documentation: virt/kvm: remove unreferenced footnote
RISC-V: KVM: lock the correct mp_state during reset
KVM: arm64: Fix documentation for vgic_its_iter_next()
KVM: arm64: np-guest CMOs with PMD_SIZE fixmap
KVM: arm64: Stage-2 huge mappings for np-guests
KVM: arm64: Add a range to pkvm_mappings
KVM: arm64: Convert pkvm_mappings to interval tree
KVM: arm64: Add a range to __pkvm_host_test_clear_young_guest()
KVM: arm64: Add a range to __pkvm_host_wrprotect_guest()
KVM: arm64: Add a range to __pkvm_host_unshare_guest()
KVM: arm64: Add a range to __pkvm_host_share_guest()
KVM: arm64: Introduce for_each_hyp_page
KVM: arm64: Handle huge mappings for np-guest CMOs
KVM: arm64: nv: Release faulted-in VNCR page from mmu_lock critical section
KVM: arm64: nv: Handle TLBI S1E2 for VNCR invalidation with mmu_lock held
KVM: arm64: nv: Hold mmu_lock when invalidating VNCR SW-TLB before translating
RISC-V: KVM: add KVM_CAP_RISCV_MP_STATE_RESET
RISC-V: KVM: Remove scounteren initialization
KVM: RISC-V: remove unnecessary SBI reset state
...
device emulated by a Secure VM Service Module (SVSM) - a helper module of sorts
which runs at a different privilege level in the SEV-SNP VM stack.
The intent being that a TPM device is emulated by a trusted entity and not by
the untrusted host which is the default assumption in the confidential
computing scenarios.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmg0vOoACgkQEsHwGGHe
VUo44hAAqro0GkugpytbwYnNSnT2q0C7SoYghvKXmrGJiv/hn/5Q+cYh0AfIRsR0
hVRymTuyGSODjqUSOycMcTMbpQMVryjc2X0rK4MAWs9PKyIaLJVZh7vW497i00q5
Nl0kE+HzjZ44kF8udIBsYKS1qFfNyn11eE3xbFVVcHBCee0n775aOXWBup2d+G7u
+BnskBbDuV3umlce6oDrmhmtKF6PfFMSx5E4YTCrXBDEPYwJemxbdw+N6AXY6cme
mSJe8zwqrz+FX8653O6j9DsZgniIY/XhZd3tJ8QECY+DBRW1ldUt+k1zKeAeeVhc
oqLJGtyVd3cmfx+ZhvLh1VIMHpihkAfzl6gZoNTvUP9m3aGKz44xbL3aiA5UDdDo
DpQek8fInKA0iyWg6SHU+phRuvCXVXorcmIegSdYj3hzOc29AyEkcgfeIwzcbAE+
fuWO9SlGFqa/872d8z1AtSISTB6gWh0KqLaphdfmYmkZljNpKC0bs40788ymWOg8
iZPUj1ffP+Jy8/CGiOSmM5sq2Msy3D1JoIwciaIDo80OqsYDEvp7cwrp+w82T0yH
clQu+WZNCFYr6OLgrUq7pr0WDa8h8a5ifMaPjpaYl6V4TvDiAo97t5/yWJXTInzS
1HbckXH4BF4spgNDjX4wbDHhfoRC7ac4IqhGOL5ngI2vIWLijXo=
=tbNN
-----END PGP SIGNATURE-----
Merge tag 'x86_sev_for_v6.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull AMD SEV update from Borislav Petkov:
"Add a virtual TPM driver glue which allows a guest kernel to talk to a
TPM device emulated by a Secure VM Service Module (SVSM) - a helper
module of sorts which runs at a different privilege level in the
SEV-SNP VM stack.
The intent being that a TPM device is emulated by a trusted entity and
not by the untrusted host which is the default assumption in the
confidential computing scenarios"
* tag 'x86_sev_for_v6.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/sev: Register tpm-svsm platform device
tpm: Add SNP SVSM vTPM driver
svsm: Add header with SVSM_VTPM_CMD helpers
x86/sev: Add SVSM vTPM probe/send_command functions
multiple architectures can share the fs API for manipulating their
respective hw resource control implementation. This is the second step
in the work towards sharing the resctrl filesystem interface, the next
one being plugging ARM's MPAM into the aforementioned fs API.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmg0UDwACgkQEsHwGGHe
VUqsZw//SNSNcVHF7Gz2YvHrMXGYQFBETScg6fRWn/pTe3x1NrKEJedzMANXpAIy
1sBAsfDSOyi8MxIZnvMYapLcRdfLGAD+6FQTkyu/IQ3oSsjAxPgrTXornhxUswMY
LUs40hCv/UaEMkg35NVrRqDlT973kWLwA4iDNNnm6IGtrC8qv4EmdJvgVWHyPTjk
D80KA5ta+iPzK4l8noBrqyhUIZN3ZAJVJLrjS3Tx/gabuolLURE6p4IdlF/O6WzC
4NcqUjpwDeFpHpl2M9QJLVEKXHxKz9zZF2gLpT8Eon/ftqqQigBjzsUx/FKp07hZ
fe2AiQsd4gN9GZa3BGX+Lv+bjvyFadARsOoFbY45szuiUb0oceaRYtFF1ihmO0bV
bD4nAROE1kAfZpr/9ZRZT63LfE/DAm9TR1YBsViq1rrJvp4odvL15YbdOlIDHZD3
SmxhTxAokj058MRnhGdHoiMtPa54iw186QYDp0KxLQHLrToBPd7RBtRE8jsYrqrv
2EvwUxYKyO4vtwr9tzr0ZfptZ/DEsGovoTYD5EtlEGjotQUqsmi5Rxx4+SEQuwFw
CKSJ3j73gpxqDXTujjOe9bCeeXJqyEbrIkaWpkiBRwm5of7eFPG3Sw74jaCGvm4L
NM4UufMSDtyVAKfu3HmPkGhujHv0/7h1zYND51aW+GXEroKxy9s=
=eNCr
-----END PGP SIGNATURE-----
Merge tag 'x86_cache_for_v6.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 resource control updates from Borislav Petkov:
"Carve out the resctrl filesystem-related code into fs/resctrl/ so that
multiple architectures can share the fs API for manipulating their
respective hw resource control implementation.
This is the second step in the work towards sharing the resctrl
filesystem interface, the next one being plugging ARM's MPAM into the
aforementioned fs API"
* tag 'x86_cache_for_v6.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
MAINTAINERS: Add reviewers for fs/resctrl
x86,fs/resctrl: Move the resctrl filesystem code to live in /fs/resctrl
x86/resctrl: Always initialise rid field in rdt_resources_all[]
x86/resctrl: Relax some asm #includes
x86/resctrl: Prefer alloc(sizeof(*foo)) idiom in rdt_init_fs_context()
x86/resctrl: Squelch whitespace anomalies in resctrl core code
x86/resctrl: Move pseudo lock prototypes to include/linux/resctrl.h
x86/resctrl: Fix types in resctrl_arch_mon_ctx_{alloc,free}() stubs
x86/resctrl: Move enum resctrl_event_id to resctrl.h
x86/resctrl: Move the filesystem bits to headers visible to fs/resctrl
fs/resctrl: Add boiler plate for external resctrl code
x86/resctrl: Add 'resctrl' to the title of the resctrl documentation
x86/resctrl: Split trace.h
x86/resctrl: Expand the width of domid by replacing mon_data_bits
x86/resctrl: Add end-marker to the resctrl_event_id enum
x86/resctrl: Move is_mba_sc() out of core.c
x86/resctrl: Drop __init/__exit on assorted symbols
x86/resctrl: Resctrl_exit() teardown resctrl but leave the mount point
x86/resctrl: Check all domains are offline in resctrl_exit()
x86/resctrl: Rename resctrl_sched_in() to begin with "resctrl_arch_"
...
- Wait for target vCPU to acknowledge KVM_REQ_UPDATE_PROTECTED_GUEST_STATE to
fix a race between AP destroy and VMRUN.
- Decrypt and dump the VMSA in dump_vmcb() if debugging enabled for the VM.
- Add support for ALLOWED_SEV_FEATURES.
- Add #VMGEXIT to the set of handlers special cased for CONFIG_RETPOLINE=y.
- Treat DEBUGCTL[5:2] as reserved to pave the way for virtualizing features
that utilize those bits.
- Don't account temporary allocations in sev_send_update_data().
- Add support for KVM_CAP_X86_BUS_LOCK_EXIT on SVM, via Bus Lock Threshold.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmgwmwAACgkQOlYIJqCj
N/1pHw//edW/x838POMeeCN8j39NBKErW9yZoQLhMbzogttRvfoba+xYY9zXyRFx
8AXB8+2iLtb7pXUohc0eYN0mNqgD0SnoMLqGfn7nrkJafJSUAJHAoZn1Mdom1M1y
jHvBPbHCMMsgdLV8wpDRqCNWTH+d5W0kcN5WjKwOswVLj1rybVfK7bSLMhvkk1e5
RrOR4Ewf95/Ag2b36L4SvS1yG9fTClmKeGArMXhEXjy2INVSpBYyZMjVtjHiNzU9
TjtB2RSM45O+Zl0T2fZdVW8LFhA6kVeL1v+Oo433CjOQE0LQff3Vl14GCANIlPJU
PiWN/RIKdWkuxStIP3vw02eHzONCcg2GnNHzEyKQ1xW8lmrwzVRdXZzVsc2Dmowb
7qGykBQ+wzoE0sMeZPA0k/QOSqg2vGxUQHjR7720loLV9m9Tu/mJnS9e179GJKgI
e1ArSLwKmHpjwKZqU44IQVTZaxSC4Sg2kI670i21ChPgx8+oVkA6I0LFQXymx7uS
2lbH+ovTlJSlP9fbaJhMwAU2wpSHAyXif/HPjdw2LTH3NdgXzfEnZfTlAWiP65LQ
hnz5HvmUalW3x9kmzRmeDIAkDnAXhyt3ZQMvbNzqlO5AfS+Tqh4Ed5EFP3IrQAzK
HQ+Gi0ip+B84t9Tbi6rfQwzTZEbSSOfYksC7TXqRGhNo/DvHumE=
=k6rK
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-svm-6.16' of https://github.com/kvm-x86/linux into HEAD
KVM SVM changes for 6.16:
- Wait for target vCPU to acknowledge KVM_REQ_UPDATE_PROTECTED_GUEST_STATE to
fix a race between AP destroy and VMRUN.
- Decrypt and dump the VMSA in dump_vmcb() if debugging enabled for the VM.
- Add support for ALLOWED_SEV_FEATURES.
- Add #VMGEXIT to the set of handlers special cased for CONFIG_RETPOLINE=y.
- Treat DEBUGCTL[5:2] as reserved to pave the way for virtualizing features
that utilize those bits.
- Don't account temporary allocations in sev_send_update_data().
- Add support for KVM_CAP_X86_BUS_LOCK_EXIT on SVM, via Bus Lock Threshold.
- Explicitly check MSR load/store list counts to fix a potential overflow on
32-bit kernels.
- Flush shadow VMCSes on emergency reboot.
- Revert mem_enc_ioctl() back to an optional hook, as it's nullified when
SEV or TDX is disabled via Kconfig.
- Macrofy the handling of vt_x86_ops to eliminate a pile of boilerplate code
needed for TDX, and to optimize CONFIG_KVM_INTEL_TDX=n builds.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmgwnAwACgkQOlYIJqCj
N/1J6Q//ZxmA443SLutx1z1F6jLxRLIh4tvOUyWPfyTIedEHXgx9btIXC9ws+Iue
lgsa7Pto+JB0qP4XtoVS+gBrOeF2ZWwp4kyuFM/07tX67q6NKj7mTML/vXcelJNw
bvWA3lqeNW/KZd0KCByOGeOUdfAvruXFrX0yN9/VKGZladcZaj4EGwxcS7prPCzM
YNlRlaRsD4jB4IsmaIvWZ4sj/TNVefTybUvEgAsK6rz2bfYbYzu73ITRng7VHrRI
u+NS3TyIoMbnqFpE8mvGJ/2dvPwvn8u6IzHLsxQRm/TKeEDlvKZxTWMncHqA6i8K
MT79onhT1QHfaMCx6/0DjLJGGoW/9/Q8k4XNkb80Vh7e+KlqkglPYXhUW1AYs0le
7dXkCF04xDlj5E3I971HclX5Dbiec2fLKyOiYnEUSJgPIchPJ802GYg1TOqSyHZY
g7+QS9RKE7iXqQ2+Df7cfXbSTrpmjVmzPK4nTFUZwbcnYhNN8qo1SYo5nv89+Myr
EOdvsk0GzwgCbIbSPejIsrHFUE65iWGP7NcoBtXa7i34inGCgvElPHhXbDPu9xkU
9AZJ+iz3oegrDuARYJI/QSl3hciD31fX8yHKnpz7ccIYGBbQ5w3hNhpREM9X7fL7
pg9dv/oPmTla+NuuV+9fPXdYq/7p8CuwYyFis6DrTWeyKCOQSyE=
=7W6U
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-vmx-6.16' of https://github.com/kvm-x86/linux into HEAD
KVM VMX changes for 6.16:
- Explicitly check MSR load/store list counts to fix a potential overflow on
32-bit kernels.
- Flush shadow VMCSes on emergency reboot.
- Revert mem_enc_ioctl() back to an optional hook, as it's nullified when
SEV or TDX is disabled via Kconfig.
- Macrofy the handling of vt_x86_ops to eliminate a pile of boilerplate code
needed for TDX, and to optimize CONFIG_KVM_INTEL_TDX=n builds.
Refine and optimize KVM's software processing of the PIR, and ultimately share
PIR harvesting code between KVM and the kernel's Posted MSI handler
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmgwmWcACgkQOlYIJqCj
N/3mUw/9HN4OLRqFytu+GjEocl8I7JelJdwCsNMsUwZRnNVnYGDqsjvw8rzqeFmx
RoQ8uNqMd1PqZOgAdN6suLES949ItErbnG2+UlBvZeNgR63K8fyNJaPUzSXh0Kyd
vNNzGschI0txZXNEtMHcIsCuQknU/arlE6v+HOAokb1jxaIZH2h06vrBAj6pLAHO
hbcZPkaQEaFoQhqCbYm015ecJQRPv3IZoW7H1cK5nC4q6QdNo3LPfGqUJwgHV3Wq
hbfS+2J78nTqLhSn7HHE/y5z3R5+ZyPwFQwbqfvjjap5/DW5w8Tltg2Oif597lf2
klBukBkJyfzSdhjaPKb3V23kCNabNyyX7KUDZnW5HCiEu62Lnl0MexXCvFvSvtmy
YDSsXMg3KdtlESwUOaxGjd2J81tx36L3ZvWRaopDLzA2A6KVyVQCSANGOGkKrRzq
Qq3R/frzp1uUVpVDtdyDIO1AujoXkRecdOj1uAIr2XQBg8jx0kveAUyrkXFbQVjK
oNbfRlOiu6/vnXkWqwZ2w/Q0kRRrK7M+vensOZlculqDqxPH+BLWB+dfPqjGikb/
cL01KPu6n/GQJpwAxIbGU4eUIQPAVOcHm3iRaIlRqEoDCs7C8fTRIyDx+cD1vW8O
O9j/r05EV/Ck5XF2ks6bHIK+C3wemNrCvoeFbnO1uicqtdO+Tqw=
=dU1G
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-pir-6.16' of https://github.com/kvm-x86/linux into HEAD
KVM x86 posted interrupt changes for 6.16:
Refine and optimize KVM's software processing of the PIR, and ultimately share
PIR harvesting code between KVM and the kernel's Posted MSI handler
- Unify virtualization of IBRS on nested VM-Exit, and cross-vCPU IBPB, between
SVM and VMX.
- Advertise support to userspace for WRMSRNS and PREFETCHI.
- Rescan I/O APIC routes after handling EOI that needed to be intercepted due
to the old/previous routing, but not the new/current routing.
- Add a module param to control and enumerate support for device posted
interrupts.
- Misc cleanups.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmgwmHsACgkQOlYIJqCj
N/2fYhAAiwKkqQpOWLcGjjezDTnpMqDDbHCffroq0Ttmqfg/cuul1oyZax+9fxBO
203HUi5VKcG7uAGSpLcMFkPUs9hKnaln2lsDaQD+AnGucdj+JKF5p3INCsSYCo9N
LVRjRZWtZocxJwHSHX9gU8om0pJ5fBCBG2+7+7XhWRaIqCpJe5k944JotiiOkgZ4
5sXeITkN2kouFVMI8eD4wQGNXxRxs837SYUlwCnoD3VuuBesOZuEhz/CEL9l8vNY
keXBLPg7bSW53clKfquNKwXDQRephnZaYoexDebUd+OlZphGhTIPh+C75xPQLWSi
aYg6W9XDu3TChf4LPxHnJLwLg/rjeKNQARcxrnb3XLpPAtx3i2cKU8pDPhnd4qn0
+YV5H0dato8bbe+oClGv+oIolM01qfI9SJVoaEhTPu3Rdw9cCQSVFn5t32vG3Vab
FVxX+seV3+XTmVveD4cjiiMbqtNADwZ/PmHNAi9QCl46DgHR++MLfRtjYuGo1koL
QOmCg2fWOFYtQT6XPJqZxp1SHYxuawrB4qcO9FNyxTuMChoslYoLAr3mBUj0DvwL
fdXNof74Ccj8OK9o3uCPOXS92pZz92rz/edy/XmYiCmO+VEwBJFR6IdginZMPX11
UAc4mAC+KkDTvOcKPPIWEXArOsfQMFKefi+bPOeWUx9/nqpcVws=
=Vs9P
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-misc-6.16' of https://github.com/kvm-x86/linux into HEAD
KVM x86 misc changes for 6.16:
- Unify virtualization of IBRS on nested VM-Exit, and cross-vCPU IBPB, between
SVM and VMX.
- Advertise support to userspace for WRMSRNS and PREFETCHI.
- Rescan I/O APIC routes after handling EOI that needed to be intercepted due
to the old/previous routing, but not the new/current routing.
- Add a module param to control and enumerate support for device posted
interrupts.
- Misc cleanups.
other architectures would like to make use of them as well.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmgy+RARHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1jTHA//eIBOFKJdxmhpJ95kzA0tRXue+FUSTAX+
j9rMZOJpR9hnVkr0pBxH8bU42lji4+6b2vujMHaT59n5i2kH5tPFHW1xfEnpbVNw
thSRsFxrUKsNnKPBju0vK9WQs9e1cn2ZvVBbh2SHrATKQrcTCmJroEERZDX0cdnn
VrPeGoc7UUAjxE23c3vnZOzAJDapIc9zPAdfVGRa7xHqlq5grryG+SfHFzT/fd08
5Qwu8TN37jo1HU5v2I4RYIh4Alc1lXtWTfJAc0bks0Cpryu+Et9+N2XANu/VatVw
cve/Ubwdou9m0QxQtUTULttEbMSBB8Ylc7DJ1PdGkhULxNM8cCb+Yx9C8Gk0+8Rf
SP8/ZSVK8EE+3ETP+J8r8VXoXrNgTPSjMeI1s4rZD/b9QpRKE4g/Khu+R9UA8JBV
yuYdy2xkeRbfFVzoGDSVnZItk18MuAoq4hSNqgAxl9/S33HWG84KHQAnjzixCqb4
9Ai7n3/FBEe1edLJXKoqWK96mTa5P/vpGjMnL8wQ0rAnSYI+V2OSwPpZ9HHviw3g
qYYMqsmiU6ChbfcUnuub/YwdJFdRieVSOa7wh3H6mfKAuakpS0At8fIyD5mBtFtA
/qeSD9INII/guT1gdTgqGsirXeObbmNpC+HJjz8hRvsoP6hdoT2L/UZsUH89LcDl
qd8MKeV1Kew=
=xi0h
-----END PGP SIGNATURE-----
Merge tag 'x86-debug-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 debug updates from Ingo Molnar:
"Move the x86 page fault tracepoints to generic code, because other
architectures would like to make use of them as well"
* tag 'x86-debug-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/tracing, x86/mm: Move page fault tracepoints to generic
x86/tracing, x86/mm: Remove redundant trace_pagefault_key
Boot code changes:
- A large series of changes to reorganize the x86 boot code into a better isolated
and easier to maintain base of PIC early startup code in arch/x86/boot/startup/,
by Ard Biesheuvel.
Motivation & background:
| Since commit
|
| c88d71508e ("x86/boot/64: Rewrite startup_64() in C")
|
| dated Jun 6 2017, we have been using C code on the boot path in a way
| that is not supported by the toolchain, i.e., to execute non-PIC C
| code from a mapping of memory that is different from the one provided
| to the linker. It should have been obvious at the time that this was a
| bad idea, given the need to sprinkle fixup_pointer() calls left and
| right to manipulate global variables (including non-pointer variables)
| without crashing.
|
| This C startup code has been expanding, and in particular, the SEV-SNP
| startup code has been expanding over the past couple of years, and
| grown many of these warts, where the C code needs to use special
| annotations or helpers to access global objects.
This tree includes the first phase of this work-in-progress x86 boot code
reorganization.
Scalability enhancements and micro-optimizations:
- Improve code-patching scalability (Eric Dumazet)
- Remove MFENCEs for X86_BUG_CLFLUSH_MONITOR (Andrew Cooper)
CPU features enumeration updates:
- Thorough reorganization and cleanup of CPUID parsing APIs (Ahmed S. Darwish)
- Fix, refactor and clean up the cacheinfo code (Ahmed S. Darwish, Thomas Gleixner)
- Update CPUID bitfields to x86-cpuid-db v2.3 (Ahmed S. Darwish)
Memory management changes:
- Allow temporary MMs when IRQs are on (Andy Lutomirski)
- Opt-in to IRQs-off activate_mm() (Andy Lutomirski)
- Simplify choose_new_asid() and generate better code (Borislav Petkov)
- Simplify 32-bit PAE page table handling (Dave Hansen)
- Always use dynamic memory layout (Kirill A. Shutemov)
- Make SPARSEMEM_VMEMMAP the only memory model (Kirill A. Shutemov)
- Make 5-level paging support unconditional (Kirill A. Shutemov)
- Stop prefetching current->mm->mmap_lock on page faults (Mateusz Guzik)
- Predict valid_user_address() returning true (Mateusz Guzik)
- Consolidate initmem_init() (Mike Rapoport)
FPU support and vector computing:
- Enable Intel APX support (Chang S. Bae)
- Reorgnize and clean up the xstate code (Chang S. Bae)
- Make task_struct::thread constant size (Ingo Molnar)
- Restore fpu_thread_struct_whitelist() to fix CONFIG_HARDENED_USERCOPY=y
(Kees Cook)
- Simplify the switch_fpu_prepare() + switch_fpu_finish() logic (Oleg Nesterov)
- Always preserve non-user xfeatures/flags in __state_perm (Sean Christopherson)
Microcode loader changes:
- Help users notice when running old Intel microcode (Dave Hansen)
- AMD: Do not return error when microcode update is not necessary (Annie Li)
- AMD: Clean the cache if update did not load microcode (Boris Ostrovsky)
Code patching (alternatives) changes:
- Simplify, reorganize and clean up the x86 text-patching code (Ingo Molnar)
- Make smp_text_poke_batch_process() subsume smp_text_poke_batch_finish()
(Nikolay Borisov)
- Refactor the {,un}use_temporary_mm() code (Peter Zijlstra)
Debugging support:
- Add early IDT and GDT loading to debug relocate_kernel() bugs (David Woodhouse)
- Print the reason for the last reset on modern AMD CPUs (Yazen Ghannam)
- Add AMD Zen debugging document (Mario Limonciello)
- Fix opcode map (!REX2) superscript tags (Masami Hiramatsu)
- Stop decoding i64 instructions in x86-64 mode at opcode (Masami Hiramatsu)
CPU bugs and bug mitigations:
- Remove X86_BUG_MMIO_UNKNOWN (Borislav Petkov)
- Fix SRSO reporting on Zen1/2 with SMT disabled (Borislav Petkov)
- Restructure and harmonize the various CPU bug mitigation methods
(David Kaplan)
- Fix spectre_v2 mitigation default on Intel (Pawan Gupta)
MSR API:
- Large MSR code and API cleanup (Xin Li)
- In-kernel MSR API type cleanups and renames (Ingo Molnar)
PKEYS:
- Simplify PKRU update in signal frame (Chang S. Bae)
NMI handling code:
- Clean up, refactor and simplify the NMI handling code (Sohil Mehta)
- Improve NMI duration console printouts (Sohil Mehta)
Paravirt guests interface:
- Restrict PARAVIRT_XXL to 64-bit only (Kirill A. Shutemov)
SEV support:
- Share the sev_secrets_pa value again (Tom Lendacky)
x86 platform changes:
- Introduce the <asm/amd/> header namespace (Ingo Molnar)
- i2c: piix4, x86/platform: Move the SB800 PIIX4 FCH definitions to <asm/amd/fch.h>
(Mario Limonciello)
Fixes and cleanups:
- x86 assembly code cleanups and fixes (Uros Bizjak)
- Misc fixes and cleanups (Andi Kleen, Andy Lutomirski, Andy Shevchenko,
Ard Biesheuvel, Bagas Sanjaya, Baoquan He, Borislav Petkov, Chang S. Bae,
Chao Gao, Dan Williams, Dave Hansen, David Kaplan, David Woodhouse,
Eric Biggers, Ingo Molnar, Josh Poimboeuf, Juergen Gross, Malaya Kumar Rout,
Mario Limonciello, Nathan Chancellor, Oleg Nesterov, Pawan Gupta,
Peter Zijlstra, Shivank Garg, Sohil Mehta, Thomas Gleixner, Uros Bizjak,
Xin Li)
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmgy9WARHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1jJSw/+OW2zvAx602doujBIE17vFLU7R10Xwj5H
lVgomkWCoTNscUZPhdT/iI+/kQF1fG8PtN9oZKUsTAUswldKJsqu7KevobviesiW
qI+FqH/fhHaIk7GVh9VP65Dgrdki8zsgd7BFxD8pLRBlbZTxTxXNNkuNJrs6LxJh
SxWp/FVtKo6Wd57qlUcsdo0tilAfcuhlEweFUarX55X2ouhdeHjcGNpxj9dHKOh8
M7R5yMYFrpfdpSms+WaCnKKahWHaIQtQTsPAyKwoVdtfl1kK+7NgaCF55Gbo3ogp
r59JwC/CGruDa5QnnDizCwFIwpZw9M52Q1NhP/eLEZbDGB4Yya3b5NW+Ya+6rPvO
ZZC3e1uUmlxW3lrYflUHurnwrVb2GjkQZOdf0gfnly/7LljIicIS2dk4qIQF9NBd
sQPpW5hjmIz9CsfeL8QaJW38pQyMsQWznFuz4YVuHcLHvleb3hR+n4fNfV5Lx9bw
oirVETSIT5hy/msAgShPqTqFUEiVCgp16ow20YstxxzFu/FQ+VG987tkeUyFkPMe
q1v5yF1hty+TkM4naKendIZ/MJnsrv0AxaegFz9YQrKGL1UPiOajQbSyKbzbto7+
ozmtN0W80E8n4oQq008j8htpgIhDV91UjF5m33qB82uSqKihHPPTsVcbeg5nZwh2
ti5g/a1jk94=
=JgQo
-----END PGP SIGNATURE-----
Merge tag 'x86-core-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core x86 updates from Ingo Molnar:
"Boot code changes:
- A large series of changes to reorganize the x86 boot code into a
better isolated and easier to maintain base of PIC early startup
code in arch/x86/boot/startup/, by Ard Biesheuvel.
Motivation & background:
| Since commit
|
| c88d71508e ("x86/boot/64: Rewrite startup_64() in C")
|
| dated Jun 6 2017, we have been using C code on the boot path in a way
| that is not supported by the toolchain, i.e., to execute non-PIC C
| code from a mapping of memory that is different from the one provided
| to the linker. It should have been obvious at the time that this was a
| bad idea, given the need to sprinkle fixup_pointer() calls left and
| right to manipulate global variables (including non-pointer variables)
| without crashing.
|
| This C startup code has been expanding, and in particular, the SEV-SNP
| startup code has been expanding over the past couple of years, and
| grown many of these warts, where the C code needs to use special
| annotations or helpers to access global objects.
This tree includes the first phase of this work-in-progress x86
boot code reorganization.
Scalability enhancements and micro-optimizations:
- Improve code-patching scalability (Eric Dumazet)
- Remove MFENCEs for X86_BUG_CLFLUSH_MONITOR (Andrew Cooper)
CPU features enumeration updates:
- Thorough reorganization and cleanup of CPUID parsing APIs (Ahmed S.
Darwish)
- Fix, refactor and clean up the cacheinfo code (Ahmed S. Darwish,
Thomas Gleixner)
- Update CPUID bitfields to x86-cpuid-db v2.3 (Ahmed S. Darwish)
Memory management changes:
- Allow temporary MMs when IRQs are on (Andy Lutomirski)
- Opt-in to IRQs-off activate_mm() (Andy Lutomirski)
- Simplify choose_new_asid() and generate better code (Borislav
Petkov)
- Simplify 32-bit PAE page table handling (Dave Hansen)
- Always use dynamic memory layout (Kirill A. Shutemov)
- Make SPARSEMEM_VMEMMAP the only memory model (Kirill A. Shutemov)
- Make 5-level paging support unconditional (Kirill A. Shutemov)
- Stop prefetching current->mm->mmap_lock on page faults (Mateusz
Guzik)
- Predict valid_user_address() returning true (Mateusz Guzik)
- Consolidate initmem_init() (Mike Rapoport)
FPU support and vector computing:
- Enable Intel APX support (Chang S. Bae)
- Reorgnize and clean up the xstate code (Chang S. Bae)
- Make task_struct::thread constant size (Ingo Molnar)
- Restore fpu_thread_struct_whitelist() to fix
CONFIG_HARDENED_USERCOPY=y (Kees Cook)
- Simplify the switch_fpu_prepare() + switch_fpu_finish() logic (Oleg
Nesterov)
- Always preserve non-user xfeatures/flags in __state_perm (Sean
Christopherson)
Microcode loader changes:
- Help users notice when running old Intel microcode (Dave Hansen)
- AMD: Do not return error when microcode update is not necessary
(Annie Li)
- AMD: Clean the cache if update did not load microcode (Boris
Ostrovsky)
Code patching (alternatives) changes:
- Simplify, reorganize and clean up the x86 text-patching code (Ingo
Molnar)
- Make smp_text_poke_batch_process() subsume
smp_text_poke_batch_finish() (Nikolay Borisov)
- Refactor the {,un}use_temporary_mm() code (Peter Zijlstra)
Debugging support:
- Add early IDT and GDT loading to debug relocate_kernel() bugs
(David Woodhouse)
- Print the reason for the last reset on modern AMD CPUs (Yazen
Ghannam)
- Add AMD Zen debugging document (Mario Limonciello)
- Fix opcode map (!REX2) superscript tags (Masami Hiramatsu)
- Stop decoding i64 instructions in x86-64 mode at opcode (Masami
Hiramatsu)
CPU bugs and bug mitigations:
- Remove X86_BUG_MMIO_UNKNOWN (Borislav Petkov)
- Fix SRSO reporting on Zen1/2 with SMT disabled (Borislav Petkov)
- Restructure and harmonize the various CPU bug mitigation methods
(David Kaplan)
- Fix spectre_v2 mitigation default on Intel (Pawan Gupta)
MSR API:
- Large MSR code and API cleanup (Xin Li)
- In-kernel MSR API type cleanups and renames (Ingo Molnar)
PKEYS:
- Simplify PKRU update in signal frame (Chang S. Bae)
NMI handling code:
- Clean up, refactor and simplify the NMI handling code (Sohil Mehta)
- Improve NMI duration console printouts (Sohil Mehta)
Paravirt guests interface:
- Restrict PARAVIRT_XXL to 64-bit only (Kirill A. Shutemov)
SEV support:
- Share the sev_secrets_pa value again (Tom Lendacky)
x86 platform changes:
- Introduce the <asm/amd/> header namespace (Ingo Molnar)
- i2c: piix4, x86/platform: Move the SB800 PIIX4 FCH definitions to
<asm/amd/fch.h> (Mario Limonciello)
Fixes and cleanups:
- x86 assembly code cleanups and fixes (Uros Bizjak)
- Misc fixes and cleanups (Andi Kleen, Andy Lutomirski, Andy
Shevchenko, Ard Biesheuvel, Bagas Sanjaya, Baoquan He, Borislav
Petkov, Chang S. Bae, Chao Gao, Dan Williams, Dave Hansen, David
Kaplan, David Woodhouse, Eric Biggers, Ingo Molnar, Josh Poimboeuf,
Juergen Gross, Malaya Kumar Rout, Mario Limonciello, Nathan
Chancellor, Oleg Nesterov, Pawan Gupta, Peter Zijlstra, Shivank
Garg, Sohil Mehta, Thomas Gleixner, Uros Bizjak, Xin Li)"
* tag 'x86-core-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (331 commits)
x86/bugs: Fix spectre_v2 mitigation default on Intel
x86/bugs: Restructure ITS mitigation
x86/xen/msr: Fix uninitialized variable 'err'
x86/msr: Remove a superfluous inclusion of <asm/asm.h>
x86/paravirt: Restrict PARAVIRT_XXL to 64-bit only
x86/mm/64: Make 5-level paging support unconditional
x86/mm/64: Make SPARSEMEM_VMEMMAP the only memory model
x86/mm/64: Always use dynamic memory layout
x86/bugs: Fix indentation due to ITS merge
x86/cpuid: Rename hypervisor_cpuid_base()/for_each_possible_hypervisor_cpuid_base() to cpuid_base_hypervisor()/for_each_possible_cpuid_base_hypervisor()
x86/cpu/intel: Rename CPUID(0x2) descriptors iterator parameter
x86/cacheinfo: Rename CPUID(0x2) descriptors iterator parameter
x86/cpuid: Rename cpuid_get_leaf_0x2_regs() to cpuid_leaf_0x2()
x86/cpuid: Rename have_cpuid_p() to cpuid_feature()
x86/cpuid: Set <asm/cpuid/api.h> as the main CPUID header
x86/cpuid: Move CPUID(0x2) APIs into <cpuid/api.h>
x86/msr: Add rdmsrl_on_cpu() compatibility wrapper
x86/mm: Fix kernel-doc descriptions of various pgtable methods
x86/asm-offsets: Export certain 'struct cpuinfo_x86' fields for 64-bit asm use too
x86/boot: Defer initialization of VM space related global variables
...
Core & generic-arch updates:
- Add support for dynamic constraints and propagate it to
the Intel driver (Kan Liang)
- Fix & enhance driver-specific throttling support (Kan Liang)
- Record sample last_period before updating on the
x86 and PowerPC platforms (Mark Barnett)
- Make perf_pmu_unregister() usable (Peter Zijlstra)
- Unify perf_event_free_task() / perf_event_exit_task_context()
(Peter Zijlstra)
- Simplify perf_event_release_kernel() and perf_event_free_task()
(Peter Zijlstra)
- Allocate non-contiguous AUX pages by default (Yabin Cui)
Uprobes updates:
- Add support to emulate NOP instructions (Jiri Olsa)
- selftests/bpf: Add 5-byte NOP uprobe trigger benchmark (Jiri Olsa)
x86 Intel PMU enhancements:
- Support Intel Auto Counter Reload [ACR] (Kan Liang)
- Add PMU support for Clearwater Forest (Dapeng Mi)
- Arch-PEBS preparatory changes: (Dapeng Mi)
- Parse CPUID archPerfmonExt leaves for non-hybrid CPUs
- Decouple BTS initialization from PEBS initialization
- Introduce pairs of PEBS static calls
x86 AMD PMU enhancements:
- Use hrtimer for handling overflows in the AMD uncore driver
(Sandipan Das)
- Prevent UMC counters from saturating (Sandipan Das)
Fixes and cleanups:
- Fix put_ctx() ordering (Frederic Weisbecker)
- Fix irq work dereferencing garbage (Frederic Weisbecker)
- Misc fixes and cleanups (Changbin Du, Frederic Weisbecker,
Ian Rogers, Ingo Molnar, Kan Liang, Peter Zijlstra, Qing Wang,
Sandipan Das, Thorsten Blum)
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmgy4zoRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1j6QRAAvQ4GBPrdJLb8oXkLjCmWSp9PfM1h2IW0
reUrcV0BPRAwz4T60QEU2KyiEjvKxNghR6bNw4i3slAZ8EFwP9eWE/0ZYOo5+W/N
wv8vsopv/oZd2L2G5TgxDJf+tLPkqnTvp651LmGAbquPFONN1lsya9UHVPnt2qtv
fvFhjW6D828VoevRcUCsdoEUNlFDkUYQ2c3M1y5H2AI6ILDVxLsp5uYtuVUP+2lQ
7UI/elqRIIblTGT7G9LvTGiXZMm8T58fe1OOLekT6NdweJ3XEt1kMdFo/SCRYfzU
eDVVVLSextZfzBXNPtAEAlM3aSgd8+4m5sACiD1EeOUNjo5J9Sj1OOCa+bZGF/Rl
XNv5Kcp6Kh1T4N5lio8DE/NabmHDqDMbUGfud+VTS8uLLku4kuOWNMxJTD1nQ2Zz
BMfJhP89G9Vk07F9fOGuG1N6mKhIKNOgXh0S92tB7XDHcdJegueu2xh4ZszBL1QK
JVXa4DbnDj+y0LvnV+A5Z6VILr5RiCAipDb9ascByPja6BbN10Nf9Aj4nWwRTwbO
ut5OK/fDKmSjEHn1+a42d4iRxdIXIWhXCyxEhH+hJXEFx9htbQ3oAbXAEedeJTlT
g9QYGAjL96QEd0CqviorV8KyU59nVkEPoLVCumXBZ0WWhNwU6GdAmsW1hLfxQdLN
sp+XHhfxf8M=
=tPRs
-----END PGP SIGNATURE-----
Merge tag 'perf-core-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf events updates from Ingo Molnar:
"Core & generic-arch updates:
- Add support for dynamic constraints and propagate it to the Intel
driver (Kan Liang)
- Fix & enhance driver-specific throttling support (Kan Liang)
- Record sample last_period before updating on the x86 and PowerPC
platforms (Mark Barnett)
- Make perf_pmu_unregister() usable (Peter Zijlstra)
- Unify perf_event_free_task() / perf_event_exit_task_context()
(Peter Zijlstra)
- Simplify perf_event_release_kernel() and perf_event_free_task()
(Peter Zijlstra)
- Allocate non-contiguous AUX pages by default (Yabin Cui)
Uprobes updates:
- Add support to emulate NOP instructions (Jiri Olsa)
- selftests/bpf: Add 5-byte NOP uprobe trigger benchmark (Jiri Olsa)
x86 Intel PMU enhancements:
- Support Intel Auto Counter Reload [ACR] (Kan Liang)
- Add PMU support for Clearwater Forest (Dapeng Mi)
- Arch-PEBS preparatory changes: (Dapeng Mi)
- Parse CPUID archPerfmonExt leaves for non-hybrid CPUs
- Decouple BTS initialization from PEBS initialization
- Introduce pairs of PEBS static calls
x86 AMD PMU enhancements:
- Use hrtimer for handling overflows in the AMD uncore driver
(Sandipan Das)
- Prevent UMC counters from saturating (Sandipan Das)
Fixes and cleanups:
- Fix put_ctx() ordering (Frederic Weisbecker)
- Fix irq work dereferencing garbage (Frederic Weisbecker)
- Misc fixes and cleanups (Changbin Du, Frederic Weisbecker, Ian
Rogers, Ingo Molnar, Kan Liang, Peter Zijlstra, Qing Wang, Sandipan
Das, Thorsten Blum)"
* tag 'perf-core-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (60 commits)
perf/headers: Clean up <linux/perf_event.h> a bit
perf/uapi: Clean up <uapi/linux/perf_event.h> a bit
perf/uapi: Fix PERF_RECORD_SAMPLE comments in <uapi/linux/perf_event.h>
mips/perf: Remove driver-specific throttle support
xtensa/perf: Remove driver-specific throttle support
sparc/perf: Remove driver-specific throttle support
loongarch/perf: Remove driver-specific throttle support
csky/perf: Remove driver-specific throttle support
arc/perf: Remove driver-specific throttle support
alpha/perf: Remove driver-specific throttle support
perf/apple_m1: Remove driver-specific throttle support
perf/arm: Remove driver-specific throttle support
s390/perf: Remove driver-specific throttle support
powerpc/perf: Remove driver-specific throttle support
perf/x86/zhaoxin: Remove driver-specific throttle support
perf/x86/amd: Remove driver-specific throttle support
perf/x86/intel: Remove driver-specific throttle support
perf: Only dump the throttle log for the leader
perf: Fix the throttle logic for a group
perf/core: Add the is_event_in_freq_mode() helper to simplify the code
...
API:
- Fix memcpy_sglist to handle partially overlapping SG lists.
- Use memcpy_sglist to replace null skcipher.
- Rename CRYPTO_TESTS to CRYPTO_BENCHMARK.
- Flip CRYPTO_MANAGER_DISABLE_TEST into CRYPTO_SELFTESTS.
- Hide CRYPTO_MANAGER.
- Add delayed freeing of driver crypto_alg structures.
Compression:
- Allocate large buffers on first use instead of initialisation in scomp.
- Drop destination linearisation buffer in scomp.
- Move scomp stream allocation into acomp.
- Add acomp scatter-gather walker.
- Remove request chaining.
- Add optional async request allocation.
Hashing:
- Remove request chaining.
- Add optional async request allocation.
- Move partial block handling into API.
- Add ahash support to hmac.
- Fix shash documentation to disallow usage in hard IRQs.
Algorithms:
- Remove unnecessary SIMD fallback code on x86 and arm/arm64.
- Drop avx10_256 xts(aes)/ctr(aes) on x86.
- Improve avx-512 optimisations for xts(aes).
- Move chacha arch implementations into lib/crypto.
- Move poly1305 into lib/crypto and drop unused Crypto API algorithm.
- Disable powerpc/poly1305 as it has no SIMD fallback.
- Move sha256 arch implementations into lib/crypto.
- Convert deflate to acomp.
- Set block size correctly in cbcmac.
Drivers:
- Do not use sg_dma_len before mapping in sun8i-ss.
- Fix warm-reboot failure by making shutdown do more work in qat.
- Add locking in zynqmp-sha.
- Remove cavium/zip.
- Add support for PCI device 0x17D8 to ccp.
- Add qat_6xxx support in qat.
- Add support for RK3576 in rockchip-rng.
- Add support for i.MX8QM in caam.
Others:
- Fix irq_fpu_usable/kernel_fpu_begin inconsistency during CPU bring-up.
- Add new SEV/SNP platform shutdown API in ccp.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEn51F/lCuNhUwmDeSxycdCkmxi6cFAmgz47AACgkQxycdCkmx
i6fvKRAAr4Xa903L0r1Q1P1alQqoFFCqimUWeH72m68LiWynHWi0lUo0z/+tKweg
mnPStz7/Ha9HRHJjdNCMPnlJqXQDkuH3bIOuBJCwduDuhHo9VGOd46XGzmGMv3gb
HKuZhI0lk7pznK3CSyD/2nHmbDCHD+7feTZSBMoN9mm875+aSoM6fdxgak8uPFcq
KbB1L+hObTn2kAPSqRrNOR8/xG2N7hdH8eax7Li+LAtqYNVT5HvWVECsB/CKRPfB
sgAv3UTzcIFapSSHUHaONppSeoqPAIAeV7SdQhJvlT+EUUR/h/B6+D9OUQQqbphQ
LBalgTnqMKl0ymDEQFQ6QyYCat9ZfNmDft2WcXEsxc8PxImkgJI1W3B8O51sOjbG
78D8JqVQ96dleo4FsBhM2wfG0b41JM6zU4raC4vS7a3qsUS+Q1MpehvcS1iORicy
SpGdE8e7DLlxKhzWyW1xJnbrtMZDC7Sa2hUnxrvP0/xOvRhChKscRVtWcf0a5q7X
8JmuvwVSOJuSbQ3MeFbQvpo5lR9+0WsNjM6e9miiH6Y7vZUKmWcq2yDp377qVzeh
7NK6+OwGIQZZExrmtPw2BXwssT9Eg+ks6Y7g2Ne7yzvrjVNfEPY7Cws/5w7p8mRS
qhrcpbJNFlWgD7YYkmGZFTQ8DCN25ipP8lklO/hbcfchqLE/o1o=
=O8L5
-----END PGP SIGNATURE-----
Merge tag 'v6.16-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto updates from Herbert Xu:
"API:
- Fix memcpy_sglist to handle partially overlapping SG lists
- Use memcpy_sglist to replace null skcipher
- Rename CRYPTO_TESTS to CRYPTO_BENCHMARK
- Flip CRYPTO_MANAGER_DISABLE_TEST into CRYPTO_SELFTESTS
- Hide CRYPTO_MANAGER
- Add delayed freeing of driver crypto_alg structures
Compression:
- Allocate large buffers on first use instead of initialisation in scomp
- Drop destination linearisation buffer in scomp
- Move scomp stream allocation into acomp
- Add acomp scatter-gather walker
- Remove request chaining
- Add optional async request allocation
Hashing:
- Remove request chaining
- Add optional async request allocation
- Move partial block handling into API
- Add ahash support to hmac
- Fix shash documentation to disallow usage in hard IRQs
Algorithms:
- Remove unnecessary SIMD fallback code on x86 and arm/arm64
- Drop avx10_256 xts(aes)/ctr(aes) on x86
- Improve avx-512 optimisations for xts(aes)
- Move chacha arch implementations into lib/crypto
- Move poly1305 into lib/crypto and drop unused Crypto API algorithm
- Disable powerpc/poly1305 as it has no SIMD fallback
- Move sha256 arch implementations into lib/crypto
- Convert deflate to acomp
- Set block size correctly in cbcmac
Drivers:
- Do not use sg_dma_len before mapping in sun8i-ss
- Fix warm-reboot failure by making shutdown do more work in qat
- Add locking in zynqmp-sha
- Remove cavium/zip
- Add support for PCI device 0x17D8 to ccp
- Add qat_6xxx support in qat
- Add support for RK3576 in rockchip-rng
- Add support for i.MX8QM in caam
Others:
- Fix irq_fpu_usable/kernel_fpu_begin inconsistency during CPU bring-up
- Add new SEV/SNP platform shutdown API in ccp"
* tag 'v6.16-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (382 commits)
x86/fpu: Fix irq_fpu_usable() to return false during CPU onlining
crypto: qat - add missing header inclusion
crypto: api - Redo lookup on EEXIST
Revert "crypto: testmgr - Add hash export format testing"
crypto: marvell/cesa - Do not chain submitted requests
crypto: powerpc/poly1305 - add depends on BROKEN for now
Revert "crypto: powerpc/poly1305 - Add SIMD fallback"
crypto: ccp - Add missing tee info reg for teev2
crypto: ccp - Add missing bootloader info reg for pspv5
crypto: sun8i-ce - move fallback ahash_request to the end of the struct
crypto: octeontx2 - Use dynamic allocated memory region for lmtst
crypto: octeontx2 - Initialize cptlfs device info once
crypto: xts - Only add ecb if it is not already there
crypto: lrw - Only add ecb if it is not already there
crypto: testmgr - Add hash export format testing
crypto: testmgr - Use ahash for generic tfm
crypto: hmac - Add ahash support
crypto: testmgr - Ignore EEXIST on shash allocation
crypto: algapi - Add driver template support to crypto_inst_setname
crypto: shash - Set reqsize in shash_alg
...
irq_fpu_usable() incorrectly returned true before the FPU is
initialized. The x86 CPU onlining code can call sha256() to checksum
AMD microcode images, before the FPU is initialized. Since sha256()
recently gained a kernel-mode FPU optimized code path, a crash occurred
in kernel_fpu_begin_mask() during hotplug CPU onlining.
(The crash did not occur during boot-time CPU onlining, since the
optimized sha256() code is not enabled until subsys_initcalls run.)
Fix this by making irq_fpu_usable() return false before fpu__init_cpu()
has run. To do this without adding any additional overhead to
irq_fpu_usable(), replace the existing per-CPU bool in_kernel_fpu with
kernel_fpu_allowed which tracks both initialization and usage rather
than just usage. The initial state is false; FPU initialization sets it
to true; kernel-mode FPU sections toggle it to false and then back to
true; and CPU offlining restores it to the initial state of false.
Fixes: 11d7956d52 ("crypto: x86/sha256 - implement library instead of shash")
Reported-by: Ayush Jain <Ayush.Jain3@amd.com>
Closes: https://lore.kernel.org/r/20250516112217.GBaCcf6Yoc6LkIIryP@fat_crate.local
Signed-off-by: Eric Biggers <ebiggers@google.com>
Tested-by: Ayush Jain <Ayush.Jain3@amd.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
When starting APs, confidential guests and paravisor guests
need to know the CPU number, and the pattern of using the linear
search has emerged in several places. With N processors that leads
to the O(N^2) time complexity.
Provide the CPU number in the AP wake up callback so that one can
get the CPU number in constant time.
Suggested-by: Michael Kelley <mhklinux@outlook.com>
Signed-off-by: Roman Kisel <romank@linux.microsoft.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lore.kernel.org/r/20250507182227.7421-3-romank@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Message-ID: <20250507182227.7421-3-romank@linux.microsoft.com>
To start an application processor in SNP-isolated guest, a hypercall
is used that takes a virtual processor index. The hv_snp_boot_ap()
function uses that START_VP hypercall but passes as VP index to it
what it receives as a wakeup_secondary_cpu_64 callback: the APIC ID.
As those two aren't generally interchangeable, that may lead to hung
APs if the VP index and the APIC ID don't match up.
Update the parameter names to avoid confusion as to what the parameter
is. Use the APIC ID to the VP index conversion to provide the correct
input to the hypercall.
Cc: stable@vger.kernel.org
Fixes: 44676bb9d5 ("x86/hyperv: Add smp support for SEV-SNP guest")
Signed-off-by: Roman Kisel <romank@linux.microsoft.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Link: https://lore.kernel.org/r/20250507182227.7421-2-romank@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Message-ID: <20250507182227.7421-2-romank@linux.microsoft.com>
This reverts commit 693bbf2a50 as kdump LUKS
support (CONFIG_CRASH_DM_CRYPT) depends on __set_memory_prot.
[akpm@linux-foundation.org: x86 set_memory.h needs pgtable_types.h]
Link: https://lkml.kernel.org/r/20250502011246.99238-7-coxu@redhat.com
Signed-off-by: Coiby Xu <coxu@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: "Daniel P. Berrange" <berrange@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Jan Pazdziora <jpazdziora@redhat.com>
Cc: Liu Pingfan <kernelfans@gmail.com>
Cc: Milan Broz <gmazyland@gmail.com>
Cc: Ondrej Kozina <okozina@redhat.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCgA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmgqSbkeHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGr6sH/1ICAvlin1GuxffE
ISVNz3xhXQpXG2k8yl9r0umpdCfPQbGrxm30vZyuIDNutY/FuMvkIqfu+Z1NnLg0
GidZW015LtXrp7/puKtTnUD5CPSjdETMXig+Q7c1PrxkkmHwz8sBbbm173AIDbDB
t7wwqSEUQh2AIDouGwN+DXB+6bR2FoOXb/k/njmtappIwR3rBc2f1HQJnP095rKO
5AKw1c9DMv5Wq2cEdBOCP48e4CFZEIN1ycW0nvtjpnOmcPOJjLoEothRbntQolqF
udtj5UeTGdAJqmjigv7KHmlrmFNe+GqBq4+beHl5MRxhBaT2uGGaM9jCJiSxT3Jx
sHyYYr8=
=Ddma
-----END PGP SIGNATURE-----
Merge tag 'v6.15-rc7' into x86/core, to pick up fixes
Pick up build fixes from upstream to make this tree more testable.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
- Fix SEV-SNP kdump bugs
- Update the email address of Alexey Makhalov in MAINTAINERS
- Add the CPU feature flag for the Zen6 microarchitecture
- Fix typo in system message
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmgoj3MRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1hppg//S/eodSXrgxzTOvZLu0gFeYN4xyxUnfWl
0Dvc+FRasGCpBpQcD9sl3w9xKnTkaGY250NPP4/OKW2JgiizP6E3UcFYvaDnZ96I
TU3/y3acAI5zpvASOuOuDlwDt0w9xIk5L/K0gcVec9dYnGdAOmTE4jjZV6wDm0Q4
rto8k5E0RmSs5HQ4GcpU2sgzJSlaQlkkxZMo6HaUE6oJUiuodmPnxHkjoLgAQiU9
I0ALcrPVtyI1jap52DVxAIDcMsrOddazYley4IyDRqWezwrtrxkNaEzvNkMWO4ZV
iAnTYe/21HrppsQ40KuYa5VY5k0Dkv+QVzb23rGT2sZlPaXAiPIVUtt25z4VGtve
1z/kn1TszfcqC9sPodVcHQkzNrTktlaEKXd3u9GuFlfMkuj7iSnmYnGoPMo6x7T9
vcbBF6PUQ+uNi7QZXDvww8S0OMBVVlMDOjhuGjFBFzkmfVzkFtdyC1oGXppiXNzg
KG0LjiTDlOeI4B8bxG1Wwldwl7vLfwHJag2xWaw0uQR8mjstkCTLXibjdvz3QNwi
bM14hlG3TxmxJSsYl8QNFnF45DwzApWGKz9K81OPz/yZ2Z6KB1uQqrN2l8+blFt9
OUMEukY9sAcmUR1hkt3Rdynb1ri+jGMcJUGOn48w2ne+qiLoVicp8LEgWO6KoI3Z
vgLnVmqIa9o=
=cD7r
-----END PGP SIGNATURE-----
Merge tag 'x86-urgent-2025-05-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 fixes from Ingo Molnar:
- Fix SEV-SNP kdump bugs
- Update the email address of Alexey Makhalov in MAINTAINERS
- Add the CPU feature flag for the Zen6 microarchitecture
- Fix typo in system message
* tag 'x86-urgent-2025-05-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm: Remove duplicated word in warning message
x86/CPU/AMD: Add X86_FEATURE_ZEN6
x86/sev: Make sure pages are not skipped during kdump
x86/sev: Do not touch VMSA pages during SNP guest memory kdump
MAINTAINERS: Update Alexey Makhalov's email address
x86/sev: Fix operator precedence in GHCB_MSR_VMPL_REQ_LEVEL macro
PARAVIRT_XXL is exclusively utilized by XEN_PV, which is only compatible
with 64-bit machines.
Clearly designate PARAVIRT_XXL as 64-bit only and remove ifdefs to
support CONFIG_PGTABLE_LEVELS < 5.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Juergen Gross <jgross@suse.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250516123306.3812286-5-kirill.shutemov@linux.intel.com
Both Intel and AMD CPUs support 5-level paging, which is expected to
become more widely adopted in the future. All major x86 Linux
distributions have the feature enabled.
Remove CONFIG_X86_5LEVEL and related #ifdeffery for it to make it more readable.
Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250516123306.3812286-4-kirill.shutemov@linux.intel.com
Dynamic memory layout is used by KASLR and 5-level paging.
CONFIG_X86_5LEVEL is going to be removed, making 5-level paging support
unconditional which requires unconditional support of dynamic memory
layout.
Remove CONFIG_DYNAMIC_MEMORY_LAYOUT.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Jan Kiszka <jan.kiszka@siemens.com>
Cc: Kieran Bingham <kbingham@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250516123306.3812286-2-kirill.shutemov@linux.intel.com
Virtual machines can exploit bus locks to degrade the performance of
the system. Bus locks can be caused by Non-WB(Write back) and
misaligned locked RMW (Read-modify-Write) instructions and require
systemwide synchronization among all processors which can result into
significant performance penalties.
To address this issue, the Bus Lock Threshold feature is introduced to
provide ability to hypervisor to restrict guests' capability of
initiating mulitple buslocks, thereby preventing system slowdowns.
Support for the buslock threshold is indicated via CPUID function
0x8000000A_EDX[29].
On the processors that support the Bus Lock Threshold feature, the
VMCB provides a Bus Lock Threshold enable bit and an unsigned 16-bit
Bus Lock threshold count.
VMCB intercept bit
VMCB Offset Bits Function
14h 5 Intercept bus lock operations
Bus lock threshold count
VMCB Offset Bits Function
120h 15:0 Bus lock counter
When a VMRUN instruction is executed, the bus lock threshold count is
loaded into an internal count register. Before the processor executes
a bus lock in the guest, it checks the value of this register:
- If the value is greater than '0', the processor successfully
executes the bus lock and decrements the count.
- If the value is '0', the bus lock is not executed, and a #VMEXIT to
the VMM is taken.
The bus lock threshold #VMEXIT is reported to the VMM with the VMEXIT
code A5h, SVM_EXIT_BUS_LOCK.
Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
Co-developed-by: Manali Shukla <manali.shukla@amd.com>
Signed-off-by: Manali Shukla <manali.shukla@amd.com>
Link: https://lore.kernel.org/r/20250502050346.14274-4-manali.shukla@amd.com
[sean: rewrite shortlog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Misbehaving guests can cause bus locks to degrade the performance of
the system. The Bus Lock Threshold feature can be used to address this
issue by providing capability to the hypervisor to limit guest's
ability to generate bus lock, thereby preventing system slowdown due
to performance penalities.
When the Bus Lock Threshold feature is enabled, the processor checks
the bus lock threshold count before executing the buslock and decides
whether to trigger bus lock exit or not.
The value of the bus lock threshold count '0' generates bus lock
exits, and if the value is greater than '0', the bus lock is executed
successfully and the bus lock threshold count is decremented.
Presence of the Bus Lock threshold feature is indicated via CPUID
function 0x8000000A_EDX[29].
Signed-off-by: Manali Shukla <manali.shukla@amd.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250502050346.14274-3-manali.shukla@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Move and rename kvm_pio_request.linear_rip to
kvm_vcpu_arch.cui_linear_rip so that the field can be used by other
userspace exit completion flows that need to take action if and only
if userspace has not modified RIP.
No functional changes intended.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Manali Shukla <manali.shukla@amd.com>
Link: https://lore.kernel.org/r/20250502050346.14274-2-manali.shukla@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
The resctrl pseudo-lock feature allows an architecture to allocate data
into particular cache portions, which are then treated as reserved to
avoid that data ever being evicted. Setting this up is deeply architecture
specific as it involves disabling prefetchers etc. It is not possible
to support this kind of feature on arm64. Risc-V is assumed to be the
same.
The prototypes for the architecture code were added to x86's asm/resctrl.h,
with other architectures able to provide stubs for their architecture. This
forces other architectures to provide identical stubs.
Move the prototypes and stubs to linux/resctrl.h, and switch between them
using the existing Kconfig symbol.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/20250515165855.31452-20-james.morse@arm.com
resctrl_arch_mon_ctx_alloc() and resctrl_arch_mon_ctx_free() take an enum
resctrl_event_id that is already defined in resctrl_types.h to be
accessible to asm/resctrl.h.
The x86 stubs take an int. Fix that.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/20250515165855.31452-19-james.morse@arm.com
In order to let all the APIs under <cpuid/api.h> have a shared "cpuid_"
namespace, rename hypervisor_cpuid_base() to cpuid_base_hypervisor().
To align with the new style, also rename:
for_each_possible_hypervisor_cpuid_base(function)
to:
for_each_possible_cpuid_base_hypervisor(function)
Adjust call-sites accordingly.
Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: x86-cpuid@lists.linux.dev
Link: https://lore.kernel.org/r/aCZOi0Oohc7DpgTo@lx-t490
Rename the CPUID(0x2) register accessor function:
cpuid_get_leaf_0x2_regs(regs)
to:
cpuid_leaf_0x2(regs)
for consistency with other <cpuid/api.h> accessors that return full CPUID
registers outputs like:
cpuid_leaf(regs)
cpuid_subleaf(regs)
In the same vein, rename the CPUID(0x2) iteration macro:
for_each_leaf_0x2_entry()
to:
for_each_cpuid_0x2_desc()
to include "cpuid" in the macro name, and since what is iterated upon is
CPUID(0x2) cache and TLB "descriptos", not "entries". Prefix an
underscore to that iterator macro parameters, so that the newly renamed
'desc' parameter do not get mixed with "union leaf_0x2_regs :: desc[]" in
the macro's implementation.
Adjust all the affected call-sites accordingly.
While at it, use "CPUID(0x2)" instead of "CPUID leaf 0x2" as this is the
recommended style.
No change in functionality intended.
Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: x86-cpuid@lists.linux.dev
Link: https://lore.kernel.org/r/20250508150240.172915-6-darwi@linutronix.de
Page fault tracepoints are interesting for other architectures as well.
Move them to be generic.
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Gabriele Monaco <gmonaco@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: linux-trace-kernel@vger.kernel.org
Link: https://lore.kernel.org/r/89c2f284adf9b4c933f0e65811c50cef900a5a95.1747046848.git.namcao@linutronix.de
trace_pagefault_key is used to optimize the pagefault tracepoints when it
is disabled. However, tracepoints already have built-in static_key for this
exact purpose.
Remove this redundant key.
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Gabriele Monaco <gmonaco@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: linux-trace-kernel@vger.kernel.org
Link: https://lore.kernel.org/r/827c7666d2989f08742a4fb869b1ed5bfaaf1dbf.1747046848.git.namcao@linutronix.de
resctrl_sched_in() loads the architecture specific CPU MSRs with the
CLOSID and RMID values. This function was named before resctrl was
split to have architecture specific code, and generic filesystem code.
This function is obviously architecture specific, but does not begin
with 'resctrl_arch_', making it the odd one out in the functions an
architecture needs to support to enable resctrl.
Rename it for consistency. This is purely cosmetic.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64
Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64
Tested-by: Babu Moger <babu.moger@amd.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/20250515165855.31452-7-james.morse@arm.com
In order to let all the APIs under <cpuid/api.h> have a shared "cpuid_"
namespace, rename have_cpuid_p() to cpuid_feature().
Adjust all call-sites accordingly.
Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: x86-cpuid@lists.linux.dev
Link: https://lore.kernel.org/r/20250508150240.172915-4-darwi@linutronix.de
The main CPUID header <asm/cpuid.h> was originally a storefront for the
headers:
<asm/cpuid/api.h>
<asm/cpuid/leaf_0x2_api.h>
Now that the latter CPUID(0x2) header has been merged into the former,
there is no practical difference between <asm/cpuid.h> and
<asm/cpuid/api.h>.
Migrate all users to the <asm/cpuid/api.h> header, in preparation of
the removal of <asm/cpuid.h>.
Don't remove <asm/cpuid.h> just yet, in case some new code in -next
started using it.
Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: x86-cpuid@lists.linux.dev
Link: https://lore.kernel.org/r/20250508150240.172915-3-darwi@linutronix.de
Move all of the CPUID(0x2) APIs at <cpuid/leaf_0x2_api.h> into
<cpuid/api.h>, in order centralize all CPUID APIs into the latter.
While at it, separate the different CPUID leaf parsing APIs using
header comments like "CPUID(0xN) parsing: ".
Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: x86-cpuid@lists.linux.dev
Link: https://lore.kernel.org/r/20250508150240.172915-2-darwi@linutronix.de
Add a simple rdmsrl_on_cpu() compatibility wrapper for
rdmsrq_on_cpu(), to make life in -next easier, where
the PM tree recently grew more uses of the old API.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Mario Limonciello <mario.limonciello@amd.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Xin Li <xin@zytor.com>
Link: https://lore.kernel.org/r/20250512145517.6e0666e3@canb.auug.org.au
Add a synthetic feature flag for Zen6.
[ bp: Move the feature flag to a free slot and avoid future merge
conflicts from incoming stuff. ]
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250513204857.3376577-1-yazen.ghannam@amd.com
Prepare to resolve conflicts with an upstream series of fixes that conflict
with pending x86 changes:
6f5bf947ba Merge tag 'its-for-linus-20250509' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Prepare to resolve conflicts with an upstream series of fixes that conflict
with pending x86 changes:
6f5bf947ba Merge tag 'its-for-linus-20250509' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Prepare to resolve conflicts with an upstream series of fixes that conflict
with pending x86 changes:
6f5bf947ba Merge tag 'its-for-linus-20250509' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Prepare to resolve conflicts with an upstream series of fixes that conflict
with pending x86 changes:
6f5bf947ba Merge tag 'its-for-linus-20250509' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Prepare to resolve conflicts with an upstream series of fixes that conflict
with pending x86 changes:
6f5bf947ba Merge tag 'its-for-linus-20250509' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Prepare to resolve conflicts with an upstream series of fixes that conflict
with pending x86 changes:
6f5bf947ba Merge tag 'its-for-linus-20250509' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Prepare to resolve conflicts with an upstream series of fixes that conflict
with pending x86 changes:
6f5bf947ba Merge tag 'its-for-linus-20250509' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Prepare to resolve conflicts with an upstream series of fixes that conflict
with pending x86 changes:
6f5bf947ba Merge tag 'its-for-linus-20250509' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Signed-off-by: Ingo Molnar <mingo@kernel.org>
kexec handover (KHO) creates a metadata that the kernels pass between each
other during kexec. This metadata is stored in memory and kexec image
contains a (physical) pointer to that memory.
In addition, KHO keeps "scratch regions" available for kexec: physically
contiguous memory regions that are guaranteed to not have any memory that
KHO would preserve. The new kernel bootstraps itself using the scratch
regions and sets all handed over memory as in use. When subsystems that
support KHO initialize, they introspect the KHO metadata, restore
preserved memory regions, and retrieve their state stored in the preserved
memory.
Enlighten x86 kexec-file and boot path about the KHO metadata and make
sure it gets passed along to the next kernel.
Link: https://lkml.kernel.org/r/20250509074635.3187114-12-changyuanl@google.com
Signed-off-by: Alexander Graf <graf@amazon.com>
Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Co-developed-by: Changyuan Lyu <changyuanl@google.com>
Signed-off-by: Changyuan Lyu <changyuanl@google.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anthony Yznaga <anthony.yznaga@oracle.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Ashish Kalra <ashish.kalra@amd.com>
Cc: Ben Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Gowans <jgowans@amazon.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Krzysztof Kozlowski <krzk@kernel.org>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Pratyush Yadav <ptyadav@amazon.de>
Cc: Rob Herring <robh@kernel.org>
Cc: Saravana Kannan <saravanak@google.com>
Cc: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Thomas Lendacky <thomas.lendacky@amd.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
There are now no callers of mk_huge_pmd() and mk_pmd(). Remove them.
Link: https://lkml.kernel.org/r/20250402181709.2386022-12-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Richard Weinberger <richard@nod.at>
Cc: <x86@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Move the shadow stack check to pfn_pte() which lets us use the common
definition of mk_pte().
Link: https://lkml.kernel.org/r/20250402181709.2386022-5-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Richard Weinberger <richard@nod.at>
Cc: <x86@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmgebIwACgkQaDWVMHDJ
krCGSA/+I+W/uqiz58Z2Zu4RrXMYFfKJxacF7My9wnOyRxaJduS3qrz1E5wHqBId
f6M8wDx9nS24UxDkBbi84NdtlG1zj8nV8djtszGKVeqHG2DcQMMOXBKZSjOmTo2b
GIZ3a3xEqXaFfnGQxXSZrvtHIwCmv10H2oyGHu0vBp/SJuWXNg72oivOGhbm0uWs
0/bdIK8+1sW7OAmhhKdvMVpmzL8TQJnkUHSkQilPB2Tsf9wWDfeY7kDkK5YwQpk2
ZK+hrmwCFXQZELY65F2+y/cFim/F38HiqVdvIkV1wFSVqVVE9hEKJ4BDZl1fXZKB
p4qpDFgxO27E/eMo9IZfxRH4TdSoK6YLWo9FGWHKBPnciJfAeO9EP/AwAIhEQRdx
YZlN9sGS6ja7O1Eh423BBw6cFj6ta0ck2T1PoYk32FXc6sgqCphsfvBD3+tJxz8/
xoZ3BzoErdPqSXbH5cSI972kQW0JLESiMTZa827qnJtT672t6uBcsnnmR0ZbJH1f
TJCC9qgwpBiEkiGW3gwv00SC7CkXo3o0FJw0pa3MkKHGd7csxBtGBHI1b6Jj+oB0
yWf1HxSqwrq2Yek8R7lWd4jIxyWfKriEMTu7xCMUUFlprKmR2RufsADvqclNyedQ
sGBCc4eu1cpZp2no/IFm+IvkuzUHnkS/WNL1LbZ9YI8h8unjZHE=
=UVgZ
-----END PGP SIGNATURE-----
Merge tag 'its-for-linus-20250509' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 ITS mitigation from Dave Hansen:
"Mitigate Indirect Target Selection (ITS) issue.
I'd describe this one as a good old CPU bug where the behavior is
_obviously_ wrong, but since it just results in bad predictions it
wasn't wrong enough to notice. Well, the researchers noticed and also
realized that thus bug undermined a bunch of existing indirect branch
mitigations.
Thus the unusually wide impact on this one. Details:
ITS is a bug in some Intel CPUs that affects indirect branches
including RETs in the first half of a cacheline. Due to ITS such
branches may get wrongly predicted to a target of (direct or indirect)
branch that is located in the second half of a cacheline. Researchers
at VUSec found this behavior and reported to Intel.
Affected processors:
- Cascade Lake, Cooper Lake, Whiskey Lake V, Coffee Lake R, Comet
Lake, Ice Lake, Tiger Lake and Rocket Lake.
Scope of impact:
- Guest/host isolation:
When eIBRS is used for guest/host isolation, the indirect branches
in the VMM may still be predicted with targets corresponding to
direct branches in the guest.
- Intra-mode using cBPF:
cBPF can be used to poison the branch history to exploit ITS.
Realigning the indirect branches and RETs mitigates this attack
vector.
- User/kernel:
With eIBRS enabled user/kernel isolation is *not* impacted by ITS.
- Indirect Branch Prediction Barrier (IBPB):
Due to this bug indirect branches may be predicted with targets
corresponding to direct branches which were executed prior to IBPB.
This will be fixed in the microcode.
Mitigation:
As indirect branches in the first half of cacheline are affected, the
mitigation is to replace those indirect branches with a call to thunk that
is aligned to the second half of the cacheline.
RETs that take prediction from RSB are not affected, but they may be
affected by RSB-underflow condition. So, RETs in the first half of
cacheline are also patched to a return thunk that executes the RET aligned
to second half of cacheline"
* tag 'its-for-linus-20250509' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
selftest/x86/bugs: Add selftests for ITS
x86/its: FineIBT-paranoid vs ITS
x86/its: Use dynamic thunks for indirect branches
x86/ibt: Keep IBT disabled during alternative patching
mm/execmem: Unify early execmem_cache behaviour
x86/its: Align RETs in BHB clear sequence to avoid thunking
x86/its: Add support for RSB stuffing mitigation
x86/its: Add "vmexit" option to skip mitigation on some CPUs
x86/its: Enable Indirect Target Selection mitigation
x86/its: Add support for ITS-safe return thunk
x86/its: Add support for ITS-safe indirect thunk
x86/its: Enumerate Indirect Target Selection (ITS) bug
Documentation: x86/bugs/its: Add ITS documentation
The GHCB_MSR_VMPL_REQ_LEVEL macro lacked parentheses around the bitmask
expression, causing the shift operation to bind too early. As a result,
when requesting VMPL1 (e.g., GHCB_MSR_VMPL_REQ_LEVEL(1)), incorrect
values such as 0x000000016 were generated instead of the intended
0x100000016 (the requested VMPL level is specified in GHCBData[39:32]).
Fix the precedence issue by grouping the masked value before applying
the shift.
[ bp: Massage commit message. ]
Fixes: 34ff659017 ("x86/sev: Use kernel provided SVSM Calling Areas")
Signed-off-by: Seongman Lee <augustus92@kaist.ac.kr>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250511092329.12680-1-cloudlee1719@gmail.com