It is now config_lock that must be held, not kvm lock. Replace the
comment with a lockdep annotation.
Fixes: f003277311 ("KVM: arm64: Use config_lock to protect vgic state")
Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230518100914.2837292-4-jean-philippe@linaro.org
vgic_its_create() changes the vgic state without holding the
config_lock, which triggers a lockdep warning in vgic_v4_init():
[ 358.667941] WARNING: CPU: 3 PID: 178 at arch/arm64/kvm/vgic/vgic-v4.c:245 vgic_v4_init+0x15c/0x7a8
...
[ 358.707410] vgic_v4_init+0x15c/0x7a8
[ 358.708550] vgic_its_create+0x37c/0x4a4
[ 358.709640] kvm_vm_ioctl+0x1518/0x2d80
[ 358.710688] __arm64_sys_ioctl+0x7ac/0x1ba8
[ 358.711960] invoke_syscall.constprop.0+0x70/0x1e0
[ 358.713245] do_el0_svc+0xe4/0x2d4
[ 358.714289] el0_svc+0x44/0x8c
[ 358.715329] el0t_64_sync_handler+0xf4/0x120
[ 358.716615] el0t_64_sync+0x190/0x194
Wrap the whole of vgic_its_create() with config_lock since, in addition
to calling vgic_v4_init(), it also modifies the global kvm->arch.vgic
state.
Fixes: f003277311 ("KVM: arm64: Use config_lock to protect vgic state")
Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230518100914.2837292-3-jean-philippe@linaro.org
Lockdep reports a circular lock dependency between the srcu and the
config_lock:
[ 262.179917] -> #1 (&kvm->srcu){.+.+}-{0:0}:
[ 262.182010] __synchronize_srcu+0xb0/0x224
[ 262.183422] synchronize_srcu_expedited+0x24/0x34
[ 262.184554] kvm_io_bus_register_dev+0x324/0x50c
[ 262.185650] vgic_register_redist_iodev+0x254/0x398
[ 262.186740] vgic_v3_set_redist_base+0x3b0/0x724
[ 262.188087] kvm_vgic_addr+0x364/0x600
[ 262.189189] vgic_set_common_attr+0x90/0x544
[ 262.190278] vgic_v3_set_attr+0x74/0x9c
[ 262.191432] kvm_device_ioctl+0x2a0/0x4e4
[ 262.192515] __arm64_sys_ioctl+0x7ac/0x1ba8
[ 262.193612] invoke_syscall.constprop.0+0x70/0x1e0
[ 262.195006] do_el0_svc+0xe4/0x2d4
[ 262.195929] el0_svc+0x44/0x8c
[ 262.196917] el0t_64_sync_handler+0xf4/0x120
[ 262.198238] el0t_64_sync+0x190/0x194
[ 262.199224]
[ 262.199224] -> #0 (&kvm->arch.config_lock){+.+.}-{3:3}:
[ 262.201094] __lock_acquire+0x2b70/0x626c
[ 262.202245] lock_acquire+0x454/0x778
[ 262.203132] __mutex_lock+0x190/0x8b4
[ 262.204023] mutex_lock_nested+0x24/0x30
[ 262.205100] vgic_mmio_write_v3_misc+0x5c/0x2a0
[ 262.206178] dispatch_mmio_write+0xd8/0x258
[ 262.207498] __kvm_io_bus_write+0x1e0/0x350
[ 262.208582] kvm_io_bus_write+0xe0/0x1cc
[ 262.209653] io_mem_abort+0x2ac/0x6d8
[ 262.210569] kvm_handle_guest_abort+0x9b8/0x1f88
[ 262.211937] handle_exit+0xc4/0x39c
[ 262.212971] kvm_arch_vcpu_ioctl_run+0x90c/0x1c04
[ 262.214154] kvm_vcpu_ioctl+0x450/0x12f8
[ 262.215233] __arm64_sys_ioctl+0x7ac/0x1ba8
[ 262.216402] invoke_syscall.constprop.0+0x70/0x1e0
[ 262.217774] do_el0_svc+0xe4/0x2d4
[ 262.218758] el0_svc+0x44/0x8c
[ 262.219941] el0t_64_sync_handler+0xf4/0x120
[ 262.221110] el0t_64_sync+0x190/0x194
Note that the current report, which can be triggered by the vgic_irq
kselftest, is a triple chain that includes slots_lock, but after
inverting the slots_lock/config_lock dependency, the actual problem
reported above remains.
In several places, the vgic code calls kvm_io_bus_register_dev(), which
synchronizes the srcu, while holding config_lock (#1). And the MMIO
handler takes the config_lock while holding the srcu read lock (#0).
Break dependency #1, by registering the distributor and redistributors
without holding config_lock. The ITS also uses kvm_io_bus_register_dev()
but already relies on slots_lock to serialize calls.
The distributor iodev is created on the first KVM_RUN call. Multiple
threads will race for vgic initialization, and only the first one will
see !vgic_ready() under the lock. To serialize those threads, rely on
slots_lock rather than config_lock.
Redistributors are created earlier, through KVM_DEV_ARM_VGIC_GRP_ADDR
ioctls and vCPU creation. Similarly, serialize the iodev creation with
slots_lock, and the rest with config_lock.
Fixes: f003277311 ("KVM: arm64: Use config_lock to protect vgic state")
Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230518100914.2837292-2-jean-philippe@linaro.org
* kvm-arm64/pgtable-fixes-6.4:
: .
: Fixes for concurrent S2 mapping race from Oliver:
:
: "So it appears that there is a race between two parallel stage-2 map
: walkers that could lead to mapping the incorrect PA for a given IPA, as
: the IPA -> PA relationship picks up an unintended offset. This series
: eliminates the problem by using the current IPA of the walk as the
: source-of-truth regarding where we are in a map operation."
: .
KVM: arm64: Constify start/end/phys fields of the pgtable walker data
KVM: arm64: Infer PA offset from VA in hyp map walker
KVM: arm64: Infer the PA offset from IPA in stage-2 map walker
Signed-off-by: Marc Zyngier <maz@kernel.org>
* kvm-arm64/misc-6.4:
: .
: Minor changes for 6.4:
:
: - Make better use of the bitmap API (bitmap_zero, bitmap_zalloc...)
:
: - FP/SVE/SME documentation update, in the hope that this field
: becomes clearer...
:
: - Add workaround for the usual Apple SEIS brokenness
:
: - Random comment fixes
: .
KVM: arm64: vgic: Add Apple M2 PRO/MAX cpus to the list of broken SEIS implementations
KVM: arm64: Clarify host SME state management
KVM: arm64: Restructure check for SVE support in FP trap handler
KVM: arm64: Document check for TIF_FOREIGN_FPSTATE
KVM: arm64: Fix repeated words in comments
KVM: arm64: Use the bitmap API to allocate bitmaps
KVM: arm64: Slightly optimize flush_context()
Signed-off-by: Marc Zyngier <maz@kernel.org>
Unsurprisingly, the M2 PRO is also affected by the SEIS bug, so add it
to the naughty list. And since M2 MAX is likely to be of the same ilk,
flag it as well.
Tested on a M2 PRO mini machine.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Zenghui Yu <yuzenghui@huawei.com>
Link: https://lore.kernel.org/r/20230501182141.39770-1-maz@kernel.org
* More phys_to_virt conversions
* Improvement of AP management for VSIE (nested virtualization)
ARM64:
* Numerous fixes for the pathological lock inversion issue that
plagued KVM/arm64 since... forever.
* New framework allowing SMCCC-compliant hypercalls to be forwarded
to userspace, hopefully paving the way for some more features
being moved to VMMs rather than be implemented in the kernel.
* Large rework of the timer code to allow a VM-wide offset to be
applied to both virtual and physical counters as well as a
per-timer, per-vcpu offset that complements the global one.
This last part allows the NV timer code to be implemented on
top.
* A small set of fixes to make sure that we don't change anything
affecting the EL1&0 translation regime just after having having
taken an exception to EL2 until we have executed a DSB. This
ensures that speculative walks started in EL1&0 have completed.
* The usual selftest fixes and improvements.
KVM x86 changes for 6.4:
* Optimize CR0.WP toggling by avoiding an MMU reload when TDP is enabled,
and by giving the guest control of CR0.WP when EPT is enabled on VMX
(VMX-only because SVM doesn't support per-bit controls)
* Add CR0/CR4 helpers to query single bits, and clean up related code
where KVM was interpreting kvm_read_cr4_bits()'s "unsigned long" return
as a bool
* Move AMD_PSFD to cpufeatures.h and purge KVM's definition
* Avoid unnecessary writes+flushes when the guest is only adding new PTEs
* Overhaul .sync_page() and .invlpg() to utilize .sync_page()'s optimizations
when emulating invalidations
* Clean up the range-based flushing APIs
* Revamp the TDP MMU's reaping of Accessed/Dirty bits to clear a single
A/D bit using a LOCK AND instead of XCHG, and skip all of the "handle
changed SPTE" overhead associated with writing the entire entry
* Track the number of "tail" entries in a pte_list_desc to avoid having
to walk (potentially) all descriptors during insertion and deletion,
which gets quite expensive if the guest is spamming fork()
* Disallow virtualizing legacy LBRs if architectural LBRs are available,
the two are mutually exclusive in hardware
* Disallow writes to immutable feature MSRs (notably PERF_CAPABILITIES)
after KVM_RUN, similar to CPUID features
* Overhaul the vmx_pmu_caps selftest to better validate PERF_CAPABILITIES
* Apply PMU filters to emulated events and add test coverage to the
pmu_event_filter selftest
x86 AMD:
* Add support for virtual NMIs
* Fixes for edge cases related to virtual interrupts
x86 Intel:
* Don't advertise XTILE_CFG in KVM_GET_SUPPORTED_CPUID if XTILE_DATA is
not being reported due to userspace not opting in via prctl()
* Fix a bug in emulation of ENCLS in compatibility mode
* Allow emulation of NOP and PAUSE for L2
* AMX selftests improvements
* Misc cleanups
MIPS:
* Constify MIPS's internal callbacks (a leftover from the hardware enabling
rework that landed in 6.3)
Generic:
* Drop unnecessary casts from "void *" throughout kvm_main.c
* Tweak the layout of "struct kvm_mmu_memory_cache" to shrink the struct
size by 8 bytes on 64-bit kernels by utilizing a padding hole
Documentation:
* Fix goof introduced by the conversion to rST
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmRNExkUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroNyjwf+MkzDael9y9AsOZoqhEZ5OsfQYJ32
Im5ZVYsPRU2K5TuoWql6meIihgclCj1iIU32qYHa2F1WYt2rZ72rJp+HoY8b+TaI
WvF0pvNtqQyg3iEKUBKPA4xQ6mj7RpQBw86qqiCHmlfNt0zxluEGEPxH8xrWcfhC
huDQ+NUOdU7fmJ3rqGitCvkUbCuZNkw3aNPR8dhU8RAWrwRzP2hBOmdxIeo81WWY
XMEpJSijbGpXL9CvM0Jz9nOuMJwZwCCBGxg1vSQq0xTfLySNMxzvWZC2GFaBjucb
j0UOQ7yE0drIZDVhd3sdNslubXXU6FcSEzacGQb9aigMUon3Tem9SHi7Kw==
=S2Hq
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"s390:
- More phys_to_virt conversions
- Improvement of AP management for VSIE (nested virtualization)
ARM64:
- Numerous fixes for the pathological lock inversion issue that
plagued KVM/arm64 since... forever.
- New framework allowing SMCCC-compliant hypercalls to be forwarded
to userspace, hopefully paving the way for some more features being
moved to VMMs rather than be implemented in the kernel.
- Large rework of the timer code to allow a VM-wide offset to be
applied to both virtual and physical counters as well as a
per-timer, per-vcpu offset that complements the global one. This
last part allows the NV timer code to be implemented on top.
- A small set of fixes to make sure that we don't change anything
affecting the EL1&0 translation regime just after having having
taken an exception to EL2 until we have executed a DSB. This
ensures that speculative walks started in EL1&0 have completed.
- The usual selftest fixes and improvements.
x86:
- Optimize CR0.WP toggling by avoiding an MMU reload when TDP is
enabled, and by giving the guest control of CR0.WP when EPT is
enabled on VMX (VMX-only because SVM doesn't support per-bit
controls)
- Add CR0/CR4 helpers to query single bits, and clean up related code
where KVM was interpreting kvm_read_cr4_bits()'s "unsigned long"
return as a bool
- Move AMD_PSFD to cpufeatures.h and purge KVM's definition
- Avoid unnecessary writes+flushes when the guest is only adding new
PTEs
- Overhaul .sync_page() and .invlpg() to utilize .sync_page()'s
optimizations when emulating invalidations
- Clean up the range-based flushing APIs
- Revamp the TDP MMU's reaping of Accessed/Dirty bits to clear a
single A/D bit using a LOCK AND instead of XCHG, and skip all of
the "handle changed SPTE" overhead associated with writing the
entire entry
- Track the number of "tail" entries in a pte_list_desc to avoid
having to walk (potentially) all descriptors during insertion and
deletion, which gets quite expensive if the guest is spamming
fork()
- Disallow virtualizing legacy LBRs if architectural LBRs are
available, the two are mutually exclusive in hardware
- Disallow writes to immutable feature MSRs (notably
PERF_CAPABILITIES) after KVM_RUN, similar to CPUID features
- Overhaul the vmx_pmu_caps selftest to better validate
PERF_CAPABILITIES
- Apply PMU filters to emulated events and add test coverage to the
pmu_event_filter selftest
- AMD SVM:
- Add support for virtual NMIs
- Fixes for edge cases related to virtual interrupts
- Intel AMX:
- Don't advertise XTILE_CFG in KVM_GET_SUPPORTED_CPUID if
XTILE_DATA is not being reported due to userspace not opting in
via prctl()
- Fix a bug in emulation of ENCLS in compatibility mode
- Allow emulation of NOP and PAUSE for L2
- AMX selftests improvements
- Misc cleanups
MIPS:
- Constify MIPS's internal callbacks (a leftover from the hardware
enabling rework that landed in 6.3)
Generic:
- Drop unnecessary casts from "void *" throughout kvm_main.c
- Tweak the layout of "struct kvm_mmu_memory_cache" to shrink the
struct size by 8 bytes on 64-bit kernels by utilizing a padding
hole
Documentation:
- Fix goof introduced by the conversion to rST"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (211 commits)
KVM: s390: pci: fix virtual-physical confusion on module unload/load
KVM: s390: vsie: clarifications on setting the APCB
KVM: s390: interrupt: fix virtual-physical confusion for next alert GISA
KVM: arm64: Have kvm_psci_vcpu_on() use WRITE_ONCE() to update mp_state
KVM: arm64: Acquire mp_state_lock in kvm_arch_vcpu_ioctl_vcpu_init()
KVM: selftests: Test the PMU event "Instructions retired"
KVM: selftests: Copy full counter values from guest in PMU event filter test
KVM: selftests: Use error codes to signal errors in PMU event filter test
KVM: selftests: Print detailed info in PMU event filter asserts
KVM: selftests: Add helpers for PMC asserts in PMU event filter test
KVM: selftests: Add a common helper for the PMU event filter guest code
KVM: selftests: Fix spelling mistake "perrmited" -> "permitted"
KVM: arm64: vhe: Drop extra isb() on guest exit
KVM: arm64: vhe: Synchronise with page table walker on MMU update
KVM: arm64: pkvm: Document the side effects of kvm_flush_dcache_to_poc()
KVM: arm64: nvhe: Synchronise with page table walker on TLBI
KVM: arm64: Handle 32bit CNTPCTSS traps
KVM: arm64: nvhe: Synchronise with page table walker on vcpu run
KVM: arm64: vgic: Don't acquire its_lock before config_lock
KVM: selftests: Add test to verify KVM's supported XCR0
...
switching from a user process to a kernel thread.
- More folio conversions from Kefeng Wang, Zhang Peng and Pankaj Raghav.
- zsmalloc performance improvements from Sergey Senozhatsky.
- Yue Zhao has found and fixed some data race issues around the
alteration of memcg userspace tunables.
- VFS rationalizations from Christoph Hellwig:
- removal of most of the callers of write_one_page().
- make __filemap_get_folio()'s return value more useful
- Luis Chamberlain has changed tmpfs so it no longer requires swap
backing. Use `mount -o noswap'.
- Qi Zheng has made the slab shrinkers operate locklessly, providing
some scalability benefits.
- Keith Busch has improved dmapool's performance, making part of its
operations O(1) rather than O(n).
- Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd,
permitting userspace to wr-protect anon memory unpopulated ptes.
- Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive rather
than exclusive, and has fixed a bunch of errors which were caused by its
unintuitive meaning.
- Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature,
which causes minor faults to install a write-protected pte.
- Vlastimil Babka has done some maintenance work on vma_merge():
cleanups to the kernel code and improvements to our userspace test
harness.
- Cleanups to do_fault_around() by Lorenzo Stoakes.
- Mike Rapoport has moved a lot of initialization code out of various
mm/ files and into mm/mm_init.c.
- Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for
DRM, but DRM doesn't use it any more.
- Lorenzo has also coverted read_kcore() and vread() to use iterators
and has thereby removed the use of bounce buffers in some cases.
- Lorenzo has also contributed further cleanups of vma_merge().
- Chaitanya Prakash provides some fixes to the mmap selftesting code.
- Matthew Wilcox changes xfs and afs so they no longer take sleeping
locks in ->map_page(), a step towards RCUification of pagefaults.
- Suren Baghdasaryan has improved mmap_lock scalability by switching to
per-VMA locking.
- Frederic Weisbecker has reworked the percpu cache draining so that it
no longer causes latency glitches on cpu isolated workloads.
- Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig
logic.
- Liu Shixin has changed zswap's initialization so we no longer waste a
chunk of memory if zswap is not being used.
- Yosry Ahmed has improved the performance of memcg statistics flushing.
- David Stevens has fixed several issues involving khugepaged,
userfaultfd and shmem.
- Christoph Hellwig has provided some cleanup work to zram's IO-related
code paths.
- David Hildenbrand has fixed up some issues in the selftest code's
testing of our pte state changing.
- Pankaj Raghav has made page_endio() unneeded and has removed it.
- Peter Xu contributed some rationalizations of the userfaultfd
selftests.
- Yosry Ahmed has fixed an issue around memcg's page recalim accounting.
- Chaitanya Prakash has fixed some arm-related issues in the
selftests/mm code.
- Longlong Xia has improved the way in which KSM handles hwpoisoned
pages.
- Peter Xu fixes a few issues with uffd-wp at fork() time.
- Stefan Roesch has changed KSM so that it may now be used on a
per-process and per-cgroup basis.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZEr3zQAKCRDdBJ7gKXxA
jlLoAP0fpQBipwFxED0Us4SKQfupV6z4caXNJGPeay7Aj11/kQD/aMRC2uPfgr96
eMG3kwn2pqkB9ST2QpkaRbxA//eMbQY=
=J+Dj
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of
switching from a user process to a kernel thread.
- More folio conversions from Kefeng Wang, Zhang Peng and Pankaj
Raghav.
- zsmalloc performance improvements from Sergey Senozhatsky.
- Yue Zhao has found and fixed some data race issues around the
alteration of memcg userspace tunables.
- VFS rationalizations from Christoph Hellwig:
- removal of most of the callers of write_one_page()
- make __filemap_get_folio()'s return value more useful
- Luis Chamberlain has changed tmpfs so it no longer requires swap
backing. Use `mount -o noswap'.
- Qi Zheng has made the slab shrinkers operate locklessly, providing
some scalability benefits.
- Keith Busch has improved dmapool's performance, making part of its
operations O(1) rather than O(n).
- Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd,
permitting userspace to wr-protect anon memory unpopulated ptes.
- Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive
rather than exclusive, and has fixed a bunch of errors which were
caused by its unintuitive meaning.
- Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature,
which causes minor faults to install a write-protected pte.
- Vlastimil Babka has done some maintenance work on vma_merge():
cleanups to the kernel code and improvements to our userspace test
harness.
- Cleanups to do_fault_around() by Lorenzo Stoakes.
- Mike Rapoport has moved a lot of initialization code out of various
mm/ files and into mm/mm_init.c.
- Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for
DRM, but DRM doesn't use it any more.
- Lorenzo has also coverted read_kcore() and vread() to use iterators
and has thereby removed the use of bounce buffers in some cases.
- Lorenzo has also contributed further cleanups of vma_merge().
- Chaitanya Prakash provides some fixes to the mmap selftesting code.
- Matthew Wilcox changes xfs and afs so they no longer take sleeping
locks in ->map_page(), a step towards RCUification of pagefaults.
- Suren Baghdasaryan has improved mmap_lock scalability by switching to
per-VMA locking.
- Frederic Weisbecker has reworked the percpu cache draining so that it
no longer causes latency glitches on cpu isolated workloads.
- Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig
logic.
- Liu Shixin has changed zswap's initialization so we no longer waste a
chunk of memory if zswap is not being used.
- Yosry Ahmed has improved the performance of memcg statistics
flushing.
- David Stevens has fixed several issues involving khugepaged,
userfaultfd and shmem.
- Christoph Hellwig has provided some cleanup work to zram's IO-related
code paths.
- David Hildenbrand has fixed up some issues in the selftest code's
testing of our pte state changing.
- Pankaj Raghav has made page_endio() unneeded and has removed it.
- Peter Xu contributed some rationalizations of the userfaultfd
selftests.
- Yosry Ahmed has fixed an issue around memcg's page recalim
accounting.
- Chaitanya Prakash has fixed some arm-related issues in the
selftests/mm code.
- Longlong Xia has improved the way in which KSM handles hwpoisoned
pages.
- Peter Xu fixes a few issues with uffd-wp at fork() time.
- Stefan Roesch has changed KSM so that it may now be used on a
per-process and per-cgroup basis.
* tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits)
mm,unmap: avoid flushing TLB in batch if PTE is inaccessible
shmem: restrict noswap option to initial user namespace
mm/khugepaged: fix conflicting mods to collapse_file()
sparse: remove unnecessary 0 values from rc
mm: move 'mmap_min_addr' logic from callers into vm_unmapped_area()
hugetlb: pte_alloc_huge() to replace huge pte_alloc_map()
maple_tree: fix allocation in mas_sparse_area()
mm: do not increment pgfault stats when page fault handler retries
zsmalloc: allow only one active pool compaction context
selftests/mm: add new selftests for KSM
mm: add new KSM process and sysfs knobs
mm: add new api to enable ksm per process
mm: shrinkers: fix debugfs file permissions
mm: don't check VMA write permissions if the PTE/PMD indicates write permissions
migrate_pages_batch: fix statistics for longterm pin retry
userfaultfd: use helper function range_in_vma()
lib/show_mem.c: use for_each_populated_zone() simplify code
mm: correct arg in reclaim_pages()/reclaim_clean_pages_from_list()
fs/buffer: convert create_page_buffers to folio_create_buffers
fs/buffer: add folio_create_empty_buffers helper
...
- Numerous fixes for the pathological lock inversion issue that
plagued KVM/arm64 since... forever.
- New framework allowing SMCCC-compliant hypercalls to be forwarded
to userspace, hopefully paving the way for some more features
being moved to VMMs rather than be implemented in the kernel.
- Large rework of the timer code to allow a VM-wide offset to be
applied to both virtual and physical counters as well as a
per-timer, per-vcpu offset that complements the global one.
This last part allows the NV timer code to be implemented on
top.
- A small set of fixes to make sure that we don't change anything
affecting the EL1&0 translation regime just after having having
taken an exception to EL2 until we have executed a DSB. This
ensures that speculative walks started in EL1&0 have completed.
- The usual selftest fixes and improvements.
-----BEGIN PGP SIGNATURE-----
iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmRCZIwPHG1hekBrZXJu
ZWwub3JnAAoJECPQ0LrRPXpDoZ8P/ioXAdDbAE4hTuyD2YdKJ3IGWN3pg52Z7xc2
rBXXFrbK9+n9FEc3AVdHoGsRPDP0Ynl+apj+aB0Klr/Fl0KKqac+W0ARX9rn1mI1
HjeygFPaGnXjMUp0BjeSLS+g3b0gebELJ6R1QEe1/MIPb8Se7M1y3ZpMWdhe0PPL
vyzw3LZq2OAlLgWKZhAfhh03qdr2kqJxypYs6nMrcexfn8dXT78dsYKW1nXmqKcE
61Gg23MDPUoexYpUhm+ym5t8hltoI1di8faPmxEpaFzpSDyAg8V5vo6LiW9jn3cf
RX0Sikk1laiRAhVbbIFCKC148vFyKxum3scpKyb91Qc+sK1kmIcxvEqlc6SfG9je
+5ndZwAfXtW6SMSOyX8y5fXbee7M0sx3n3le9BNgwXfmLWg/GHXJ544dJgVIlf/e
0Z+8QnP1IUDfARR/b2FlW7A7XLzNHQzO379ekcAdUptbGwlS9CrW6SJ83QR7K6fB
bh0aSSELKsD7pX8wnNyNACvmz2zL12ITlDKdZWUr8MSxyTjgVy7s0BDsQT3sbrA1
1sH++RvUWfC2k7tVT3vjZFzUDlPw3bnZmo5YMWRTMbXEdr1V5rDw5F5IXit13KeT
8bk0hnJgnLmyoX2A17v5dkFMIKD7p13tqDRdfFcn0ru63HIKxgkS3ITkDmsAQELK
DHT7RBE0
=Bhta
-----END PGP SIGNATURE-----
Merge tag 'kvmarm-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 updates for 6.4
- Numerous fixes for the pathological lock inversion issue that
plagued KVM/arm64 since... forever.
- New framework allowing SMCCC-compliant hypercalls to be forwarded
to userspace, hopefully paving the way for some more features
being moved to VMMs rather than be implemented in the kernel.
- Large rework of the timer code to allow a VM-wide offset to be
applied to both virtual and physical counters as well as a
per-timer, per-vcpu offset that complements the global one.
This last part allows the NV timer code to be implemented on
top.
- A small set of fixes to make sure that we don't change anything
affecting the EL1&0 translation regime just after having having
taken an exception to EL2 until we have executed a DSB. This
ensures that speculative walks started in EL1&0 have completed.
- The usual selftest fixes and improvements.
ACPI:
* Improve error reporting when failing to manage SDEI on AGDI device
removal
Assembly routines:
* Improve register constraints so that the compiler can make use of
the zero register instead of moving an immediate #0 into a GPR
* Allow the compiler to allocate the registers used for CAS
instructions
CPU features and system registers:
* Cleanups to the way in which CPU features are identified from the
ID register fields
* Extend system register definition generation to handle Enum types
when defining shared register fields
* Generate definitions for new _EL2 registers and add new fields
for ID_AA64PFR1_EL1
* Allow SVE to be disabled separately from SME on the kernel
command-line
Tracing:
* Support for "direct calls" in ftrace, which enables BPF tracing
for arm64
Kdump:
* Don't bother unmapping the crashkernel from the linear mapping,
which then allows us to use huge (block) mappings and reduce
TLB pressure when a crashkernel is loaded.
Memory management:
* Try again to remove data cache invalidation from the coherent DMA
allocation path
* Simplify the fixmap code by mapping at page granularity
* Allow the kfence pool to be allocated early, preventing the rest
of the linear mapping from being forced to page granularity
Perf and PMU:
* Move CPU PMU code out to drivers/perf/ where it can be reused
by the 32-bit ARM architecture when running on ARMv8 CPUs
* Fix race between CPU PMU probing and pKVM host de-privilege
* Add support for Apple M2 CPU PMU
* Adjust the generic PERF_COUNT_HW_BRANCH_INSTRUCTIONS event
dynamically, depending on what the CPU actually supports
* Minor fixes and cleanups to system PMU drivers
Stack tracing:
* Use the XPACLRI instruction to strip PAC from pointers, rather
than rolling our own function in C
* Remove redundant PAC removal for toolchains that handle this in
their builtins
* Make backtracing more resilient in the face of instrumentation
Miscellaneous:
* Fix single-step with KGDB
* Remove harmless warning when 'nokaslr' is passed on the kernel
command-line
* Minor fixes and cleanups across the board
-----BEGIN PGP SIGNATURE-----
iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAmRChcwQHHdpbGxAa2Vy
bmVsLm9yZwAKCRC3rHDchMFjNCgBCADFvkYY9ESztSnd3EpiMbbAzgRCQBiA5H7U
F2Wc+hIWgeAeUEttSH22+F16r6Jb0gbaDvsuhtN2W/rwQhKNbCU0MaUME05MPmg2
AOp+RZb2vdT5i5S5dC6ZM6G3T6u9O78LBWv2JWBdd6RIybamEn+RL00ep2WAduH7
n1FgTbsKgnbScD2qd4K1ejZ1W/BQMwYulkNpyTsmCIijXM12lkzFlxWnMtky3uhR
POpawcIZzXvWI02QAX+SIdynGChQV3VP+dh9GuFbt7ASigDEhgunvfUYhZNSaqf4
+/q0O8toCtmQJBUhF0DEDSB5T8SOz5v9CKxKuwfaX6Trq0ixFQpZ
=78L9
-----END PGP SIGNATURE-----
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Will Deacon:
"ACPI:
- Improve error reporting when failing to manage SDEI on AGDI device
removal
Assembly routines:
- Improve register constraints so that the compiler can make use of
the zero register instead of moving an immediate #0 into a GPR
- Allow the compiler to allocate the registers used for CAS
instructions
CPU features and system registers:
- Cleanups to the way in which CPU features are identified from the
ID register fields
- Extend system register definition generation to handle Enum types
when defining shared register fields
- Generate definitions for new _EL2 registers and add new fields for
ID_AA64PFR1_EL1
- Allow SVE to be disabled separately from SME on the kernel
command-line
Tracing:
- Support for "direct calls" in ftrace, which enables BPF tracing for
arm64
Kdump:
- Don't bother unmapping the crashkernel from the linear mapping,
which then allows us to use huge (block) mappings and reduce TLB
pressure when a crashkernel is loaded.
Memory management:
- Try again to remove data cache invalidation from the coherent DMA
allocation path
- Simplify the fixmap code by mapping at page granularity
- Allow the kfence pool to be allocated early, preventing the rest of
the linear mapping from being forced to page granularity
Perf and PMU:
- Move CPU PMU code out to drivers/perf/ where it can be reused by
the 32-bit ARM architecture when running on ARMv8 CPUs
- Fix race between CPU PMU probing and pKVM host de-privilege
- Add support for Apple M2 CPU PMU
- Adjust the generic PERF_COUNT_HW_BRANCH_INSTRUCTIONS event
dynamically, depending on what the CPU actually supports
- Minor fixes and cleanups to system PMU drivers
Stack tracing:
- Use the XPACLRI instruction to strip PAC from pointers, rather than
rolling our own function in C
- Remove redundant PAC removal for toolchains that handle this in
their builtins
- Make backtracing more resilient in the face of instrumentation
Miscellaneous:
- Fix single-step with KGDB
- Remove harmless warning when 'nokaslr' is passed on the kernel
command-line
- Minor fixes and cleanups across the board"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (72 commits)
KVM: arm64: Ensure CPU PMU probes before pKVM host de-privilege
arm64: kexec: include reboot.h
arm64: delete dead code in this_cpu_set_vectors()
arm64/cpufeature: Use helper macro to specify ID register for capabilites
drivers/perf: hisi: add NULL check for name
drivers/perf: hisi: Remove redundant initialized of pmu->name
arm64/cpufeature: Consistently use symbolic constants for min_field_value
arm64/cpufeature: Pull out helper for CPUID register definitions
arm64/sysreg: Convert HFGITR_EL2 to automatic generation
ACPI: AGDI: Improve error reporting for problems during .remove()
arm64: kernel: Fix kernel warning when nokaslr is passed to commandline
perf/arm-cmn: Fix port detection for CMN-700
arm64: kgdb: Set PSTATE.SS to 1 to re-enable single-step
arm64: move PAC masks to <asm/pointer_auth.h>
arm64: use XPACLRI to strip PAC
arm64: avoid redundant PAC stripping in __builtin_return_address()
arm64/sme: Fix some comments of ARM SME
arm64/signal: Alloc tpidr2 sigframe after checking system_supports_tpidr2()
arm64/signal: Use system_supports_tpidr2() to check TPIDR2
arm64/idreg: Don't disable SME when disabling SVE
...
o MAINTAINERS files additions and changes.
o Fix hotplug warning in nohz code.
o Tick dependency changes by Zqiang.
o Lazy-RCU shrinker fixes by Zqiang.
o rcu-tasks stall reporting improvements by Neeraj.
o Initial changes for renaming of k[v]free_rcu() to its new k[v]free_rcu_mightsleep()
name for robustness.
o Documentation Updates:
o Significant changes to srcu_struct size.
o Deadlock detection for srcu_read_lock() vs synchronize_srcu() from Boqun.
o rcutorture and rcu-related tool, which are targeted for v6.4 from Boqun's tree.
o Other misc changes.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEcoCIrlGe4gjE06JJqA4nf2o45hAFAmQuBnIACgkQqA4nf2o4
5hACVRAAoXu7/gfh5Pjw9O4E4pCdPJKsZZVYrcrVGrq6NAxRn6M1SgurAdC5grj2
96x0waoGaiO82V0H5iJMcKdAVu67x9R8WaQ1JoxN75Efn8h9W4TguB87TV1gk0xS
eZ18b/CyEaM5mNb80DFFF4FLohy5737p/kNTMqXQdUyR1BsDl16iRMgjiBiFhNUx
yPo8Y2kC2U2OTbldZgaE7s9bQO3xxEcifx93sGWsAex/gx54FYNisiwSlCOSgOE+
XkYo/OKk8Xvr82tLVX8XQVEPCMJ+rxea8T5zSs8/alvsPq7gA8wW3y6fsoa3vUU/
+Gd+W+Q/OsONIDtp8rQAY1qsD0ScDpaR8052RSH0zTa7pj8HsQgE5PjZ+cJW0SEi
cKN+Oe8+ETqKald+xZ6PDf58O212VLrru3RpQWrOQcJ7fmKmfT4REK0RcbLgg4qT
CBgOo6eg+ub4pxq2y11LZJBNTv1/S7xAEzFE0kArew64KB2gyVud0VJRZVAJnEfe
93QQVDFrwK2bhgWQZ6J6IbTvGeQW0L93IibuaU6jhZPR283VtUIIvM7vrOylN7Fq
4jsae0T7YGYfKUhgTpm7rCnm8A/D3Ni8MY0sKYYgDSyKmZUsnpI5wpx1xke4lwwV
ErrY46RCFa+k8wscc6iWfB4cGXyyFHyu+wtyg0KpFn5JAzcfz4A=
=Rgbj
-----END PGP SIGNATURE-----
Merge tag 'rcu.6.4.april5.2023.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux
Pull RCU updates from Joel Fernandes:
- Updates and additions to MAINTAINERS files, with Boqun being added to
the RCU entry and Zqiang being added as an RCU reviewer.
I have also transitioned from reviewer to maintainer; however, Paul
will be taking over sending RCU pull-requests for the next merge
window.
- Resolution of hotplug warning in nohz code, achieved by fixing
cpu_is_hotpluggable() through interaction with the nohz subsystem.
Tick dependency modifications by Zqiang, focusing on fixing usage of
the TICK_DEP_BIT_RCU_EXP bitmask.
- Avoid needless calls to the rcu-lazy shrinker for CONFIG_RCU_LAZY=n
kernels, fixed by Zqiang.
- Improvements to rcu-tasks stall reporting by Neeraj.
- Initial renaming of k[v]free_rcu() to k[v]free_rcu_mightsleep() for
increased robustness, affecting several components like mac802154,
drbd, vmw_vmci, tracing, and more.
A report by Eric Dumazet showed that the API could be unknowingly
used in an atomic context, so we'd rather make sure they know what
they're asking for by being explicit:
https://lore.kernel.org/all/20221202052847.2623997-1-edumazet@google.com/
- Documentation updates, including corrections to spelling,
clarifications in comments, and improvements to the srcu_size_state
comments.
- Better srcu_struct cache locality for readers, by adjusting the size
of srcu_struct in support of SRCU usage by Christoph Hellwig.
- Teach lockdep to detect deadlocks between srcu_read_lock() vs
synchronize_srcu() contributed by Boqun.
Previously lockdep could not detect such deadlocks, now it can.
- Integration of rcutorture and rcu-related tools, targeted for v6.4
from Boqun's tree, featuring new SRCU deadlock scenarios, test_nmis
module parameter, and more
- Miscellaneous changes, various code cleanups and comment improvements
* tag 'rcu.6.4.april5.2023.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux: (71 commits)
checkpatch: Error out if deprecated RCU API used
mac802154: Rename kfree_rcu() to kvfree_rcu_mightsleep()
rcuscale: Rename kfree_rcu() to kfree_rcu_mightsleep()
ext4/super: Rename kfree_rcu() to kfree_rcu_mightsleep()
net/mlx5: Rename kfree_rcu() to kfree_rcu_mightsleep()
net/sysctl: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
lib/test_vmalloc.c: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
tracing: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
misc: vmw_vmci: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
drbd: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
rcu: Protect rcu_print_task_exp_stall() ->exp_tasks access
rcu: Avoid stack overflow due to __rcu_irq_enter_check_tick() being kprobe-ed
rcu-tasks: Report stalls during synchronize_srcu() in rcu_tasks_postscan()
rcu: Permit start_poll_synchronize_rcu_expedited() to be invoked early
rcu: Remove never-set needwake assignment from rcu_report_qs_rdp()
rcu: Register rcu-lazy shrinker only for CONFIG_RCU_LAZY=y kernels
rcu: Fix missing TICK_DEP_MASK_RCU_EXP dependency check
rcu: Fix set/clear TICK_DEP_BIT_RCU_EXP bitmask race
rcu/trace: use strscpy() to instead of strncpy()
tick/nohz: Fix cpu_is_hotpluggable() by checking with nohz subsystem
...
- Plug a buffer overflow due to the use of the user-provided register
width for firmware regs. Outright reject accesses where the
user register width does not match the kernel representation.
- Protect non-atomic RMW operations on vCPU flags against preemption,
as an update to the flags by an intervening preemption could be lost.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSNXHjWXuzMZutrKNKivnWIJHzdFgUCZEALwAAKCRCivnWIJHzd
Fq2nAQDPwWpuMOrQYZCM5RF5/8yai5OPKB0NcvK00UDM4f6E1AEAxDud/KT9i15c
dmBfGn1zDzW4XAENrOkxjQWuBqi/tQg=
=Lxvf
-----END PGP SIGNATURE-----
Merge tag 'kvmarm-fixes-6.3-4' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 fixes for 6.3, part #4
- Plug a buffer overflow due to the use of the user-provided register
width for firmware regs. Outright reject accesses where the
user register width does not match the kernel representation.
- Protect non-atomic RMW operations on vCPU flags against preemption,
as an update to the flags by an intervening preemption could be lost.
Normally when running a guest we do not touch the floating point
register state until first use of floating point by the guest, saving
the current state and loading the guest state at that point. This has
been found to offer a performance benefit in common cases. However
currently if SME is active when switching to a guest then we exit
streaming mode, disable ZA and invalidate the floating point register
state prior to starting the guest.
The exit from streaming mode is required for correct guest operation, if
we leave streaming mode enabled then many non-SME operations can
generate SME traps (eg, SVE operations will become streaming SVE
operations). If EL1 leaves CPACR_EL1.SMEN disabled then the host is
unable to intercept these traps. This will mean that a SME unaware guest
will see SME exceptions which will confuse it. Disabling streaming mode
also avoids creating spurious indications of usage of the SME hardware
which could impact system performance, especially with shared SME
implementations. Document the requirement to exit streaming mode
clearly.
There is no issue with guest operation caused by PSTATE.ZA so we can
defer handling for that until first floating point usage, do so if the
register state is not that of the current task and hence has already
been saved. We could also do this for the case where the register state
is that for the current task however this is very unlikely to happen and
would require disproportionate effort so continue to save the state in
that case.
Saving this state on first use would require that we map and unmap
storage for the host version of these registers for use by the
hypervisor, taking care to deal with protected KVM and the fact that the
host can free or reallocate the backing storage. Given that the strong
recommendation is that applications should only keep PSTATE.ZA enabled
when the state it enables is in active use it is difficult to see a case
where a VMM would wish to do this, it would need to not only be using
SME but also running the guest in the middle of SME usage. This can be
revisited in the future if a use case does arises, in the interim such
tasks will work but experience a performance overhead.
This brings our handling of SME more into line with our handling of
other floating point state and documents more clearly the constraints we
have, especially around streaming mode.
Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221214-kvm-arm64-sme-context-switch-v2-3-57ba0082e9ff@kernel.org
We share the same handler for general floating point and SVE traps with a
check to make sure we don't handle any SVE traps if the system doesn't
have SVE support. Since we will be adding SME support and wishing to handle
that along with other FP related traps rewrite the check to be more scalable
and a bit clearer too, ensuring we don't misidentify SME traps as SVE ones.
Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221214-kvm-arm64-sme-context-switch-v2-2-57ba0082e9ff@kernel.org
In kvm_arch_vcpu_load_fp() we unconditionally set the current FP state
to FP_STATE_HOST_OWNED, this will be overridden to FP_STATE_NONE if
TIF_FOREIGN_FPSTATE is set but the check is deferred until
kvm_arch_vcpu_ctxflush_fp() where we are no longer preemptable. Add a
comment to this effect to help avoid people being concerned about the
lack of a check and discover where the check is done.
Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221214-kvm-arm64-sme-context-switch-v2-1-57ba0082e9ff@kernel.org
As we are revamping the way the pgtable walker evaluates some of the
data, make it clear that we rely on somew of the fields to be constant
across the lifetime of a walk.
For this, flag the start, end and phys fields of the walk data as
'const', which will generate an error if we were to accidentally
update these fields again.
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Similar to the recently fixed stage-2 walker, the hyp map walker
increments the PA and VA of a walk separately. Unlike stage-2, there is
no bug here as the map walker has exclusive access to the stage-1 page
tables.
Nonetheless, in the interest of continuity throughout the page table
code, tweak the hyp map walker to avoid incrementing the PA and instead
use the VA as the authoritative source of how far along a table walk has
gotten. Calculate the PA to use for a leaf PTE by adding the offset of
the VA from the start of the walk to the starting PA.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230421071606.1603916-3-oliver.upton@linux.dev
Until now, the page table walker counted increments to the PA and IPA
of a walk in two separate places. While the PA is incremented as soon as
a leaf PTE is installed in stage2_map_walker_try_leaf(), the IPA is
actually bumped in the generic table walker context. Critically,
__kvm_pgtable_visit() rereads the PTE after the LEAF callback returns
to work out if a table or leaf was installed, and only bumps the IPA for
a leaf PTE.
This arrangement worked fine when we handled faults behind the write lock,
as the walker had exclusive access to the stage-2 page tables. However,
commit 1577cb5823 ("KVM: arm64: Handle stage-2 faults in parallel")
started handling all stage-2 faults behind the read lock, opening up a
race where a walker could increment the PA but not the IPA of a walk.
Nothing good ensues, as the walker starts mapping with the incorrect
IPA -> PA relationship.
For example, assume that two vCPUs took a data abort on the same IPA.
One observes that dirty logging is disabled, and the other observed that
it is enabled:
vCPU attempting PMD mapping vCPU attempting PTE mapping
====================================== =====================================
/* install PMD */
stage2_make_pte(ctx, leaf);
data->phys += granule;
/* replace PMD with a table */
stage2_try_break_pte(ctx, data->mmu);
stage2_make_pte(ctx, table);
/* table is observed */
ctx.old = READ_ONCE(*ptep);
table = kvm_pte_table(ctx.old, level);
/*
* map walk continues w/o incrementing
* IPA.
*/
__kvm_pgtable_walk(..., level + 1);
Bring an end to the whole mess by using the IPA as the single source of
truth for how far along a walk has gotten. Work out the correct PA to
map by calculating the IPA offset from the beginning of the walk and add
that to the starting physical address.
Cc: stable@vger.kernel.org
Fixes: 1577cb5823 ("KVM: arm64: Handle stage-2 faults in parallel")
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230421071606.1603916-2-oliver.upton@linux.dev
* kvm-arm64/spec-ptw:
: .
: On taking an exception from EL1&0 to EL2(&0), the page table walker is
: allowed to carry on with speculative walks started from EL1&0 while
: running at EL2 (see R_LFHQG). Given that the PTW may be actively using
: the EL1&0 system registers, the only safe way to deal with it is to
: issue a DSB before changing any of it.
:
: We already did the right thing for SPE and TRBE, but ignored the PTW
: for unknown reasons (probably because the architecture wasn't crystal
: clear at the time).
:
: This requires a bit of surgery in the nvhe code, though most of these
: patches are comments so that my future self can understand the purpose
: of these barriers. The VHE code is largely unaffected, thanks to the
: DSB in the context switch.
: .
KVM: arm64: vhe: Drop extra isb() on guest exit
KVM: arm64: vhe: Synchronise with page table walker on MMU update
KVM: arm64: pkvm: Document the side effects of kvm_flush_dcache_to_poc()
KVM: arm64: nvhe: Synchronise with page table walker on TLBI
KVM: arm64: nvhe: Synchronise with page table walker on vcpu run
Signed-off-by: Marc Zyngier <maz@kernel.org>
* kvm-arm64/smccc-filtering:
: .
: SMCCC call filtering and forwarding to userspace, courtesy of
: Oliver Upton. From the cover letter:
:
: "The Arm SMCCC is rather prescriptive in regards to the allocation of
: SMCCC function ID ranges. Many of the hypercall ranges have an
: associated specification from Arm (FF-A, PSCI, SDEI, etc.) with some
: room for vendor-specific implementations.
:
: The ever-expanding SMCCC surface leaves a lot of work within KVM for
: providing new features. Furthermore, KVM implements its own
: vendor-specific ABI, with little room for other implementations (like
: Hyper-V, for example). Rather than cramming it all into the kernel we
: should provide a way for userspace to handle hypercalls."
: .
KVM: selftests: Fix spelling mistake "KVM_HYPERCAL_EXIT_SMC" -> "KVM_HYPERCALL_EXIT_SMC"
KVM: arm64: Test that SMC64 arch calls are reserved
KVM: arm64: Prevent userspace from handling SMC64 arch range
KVM: arm64: Expose SMC/HVC width to userspace
KVM: selftests: Add test for SMCCC filter
KVM: selftests: Add a helper for SMCCC calls with SMC instruction
KVM: arm64: Let errors from SMCCC emulation to reach userspace
KVM: arm64: Return NOT_SUPPORTED to guest for unknown PSCI version
KVM: arm64: Introduce support for userspace SMCCC filtering
KVM: arm64: Add support for KVM_EXIT_HYPERCALL
KVM: arm64: Use a maple tree to represent the SMCCC filter
KVM: arm64: Refactor hvc filtering to support different actions
KVM: arm64: Start handling SMCs from EL1
KVM: arm64: Rename SMC/HVC call handler to reflect reality
KVM: arm64: Add vm fd device attribute accessors
KVM: arm64: Add a helper to check if a VM has ran once
KVM: x86: Redefine 'longmode' as a flag for KVM_EXIT_HYPERCALL
Signed-off-by: Marc Zyngier <maz@kernel.org>
* kvm-arm64/timer-vm-offsets: (21 commits)
: .
: This series aims at satisfying multiple goals:
:
: - allow a VMM to atomically restore a timer offset for a whole VM
: instead of updating the offset each time a vcpu get its counter
: written
:
: - allow a VMM to save/restore the physical timer context, something
: that we cannot do at the moment due to the lack of offsetting
:
: - provide a framework that is suitable for NV support, where we get
: both global and per timer, per vcpu offsetting, and manage
: interrupts in a less braindead way.
:
: Conflict resolution involves using the new per-vcpu config lock instead
: of the home-grown timer lock.
: .
KVM: arm64: Handle 32bit CNTPCTSS traps
KVM: arm64: selftests: Augment existing timer test to handle variable offset
KVM: arm64: selftests: Deal with spurious timer interrupts
KVM: arm64: selftests: Add physical timer registers to the sysreg list
KVM: arm64: nv: timers: Support hyp timer emulation
KVM: arm64: nv: timers: Add a per-timer, per-vcpu offset
KVM: arm64: Document KVM_ARM_SET_CNT_OFFSETS and co
KVM: arm64: timers: Abstract the number of valid timers per vcpu
KVM: arm64: timers: Fast-track CNTPCT_EL0 trap handling
KVM: arm64: Elide kern_hyp_va() in VHE-specific parts of the hypervisor
KVM: arm64: timers: Move the timer IRQs into arch_timer_vm_data
KVM: arm64: timers: Abstract per-timer IRQ access
KVM: arm64: timers: Rationalise per-vcpu timer init
KVM: arm64: timers: Allow save/restoring of the physical timer
KVM: arm64: timers: Allow userspace to set the global counter offset
KVM: arm64: Expose {un,}lock_all_vcpus() to the rest of KVM
KVM: arm64: timers: Allow physical offset without CNTPOFF_EL2
KVM: arm64: timers: Use CNTPOFF_EL2 to offset the physical timer
arm64: Add HAS_ECV_CNTPOFF capability
arm64: Add CNTPOFF_EL2 register definition
...
Signed-off-by: Marc Zyngier <maz@kernel.org>
Although pKVM supports CPU PMU emulation for non-protected guests since
722625c6f4 ("KVM: arm64: Reenable pmu in Protected Mode"), this relies
on the PMU driver probing before the host has de-privileged so that the
'kvm_arm_pmu_available' static key can still be enabled by patching the
hypervisor text.
As it happens, both of these events hang off device_initcall() but the
PMU consistently won the race until 7755cec63a ("arm64: perf: Move
PMUv3 driver to drivers/perf"). Since then, the host will fail to boot
when pKVM is enabled:
| hw perfevents: enabled with armv8_pmuv3_0 PMU driver, 7 counters available
| kvm [1]: nVHE hyp BUG at: [<ffff8000090366e0>] __kvm_nvhe_handle_host_mem_abort+0x270/0x284!
| kvm [1]: Cannot dump pKVM nVHE stacktrace: !CONFIG_PROTECTED_NVHE_STACKTRACE
| kvm [1]: Hyp Offset: 0xfffea41fbdf70000
| Kernel panic - not syncing: HYP panic:
| PS:a00003c9 PC:0000dbe04b0c66e0 ESR:00000000f2000800
| FAR:fffffbfffddfcf00 HPFAR:00000000010b0bf0 PAR:0000000000000000
| VCPU:0000000000000000
| CPU: 2 PID: 1 Comm: swapper/0 Not tainted 6.3.0-rc7-00083-g0bce6746d154 #1
| Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
| Call trace:
| dump_backtrace+0xec/0x108
| show_stack+0x18/0x2c
| dump_stack_lvl+0x50/0x68
| dump_stack+0x18/0x24
| panic+0x13c/0x33c
| nvhe_hyp_panic_handler+0x10c/0x190
| aarch64_insn_patch_text_nosync+0x64/0xc8
| arch_jump_label_transform+0x4c/0x5c
| __jump_label_update+0x84/0xfc
| jump_label_update+0x100/0x134
| static_key_enable_cpuslocked+0x68/0xac
| static_key_enable+0x20/0x34
| kvm_host_pmu_init+0x88/0xa4
| armpmu_register+0xf0/0xf4
| arm_pmu_acpi_probe+0x2ec/0x368
| armv8_pmu_driver_init+0x38/0x44
| do_one_initcall+0xcc/0x240
Fix the race properly by deferring the de-privilege step to
device_initcall_sync(). This will also be needed in future when probing
IOMMU devices and allows us to separate the pKVM de-privilege logic from
the core hypervisor initialisation path.
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: Fuad Tabba <tabba@google.com>
Cc: Marc Zyngier <maz@kernel.org>
Fixes: 7755cec63a ("arm64: perf: Move PMUv3 driver to drivers/perf")
Tested-by: Fuad Tabba <tabba@google.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230420123356.2708-1-will@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
All accessors of kvm_vcpu_arch::mp_state should be {READ,WRITE}_ONCE(),
since readers of the mp_state don't acquire the mp_state_lock.
Nonetheless, kvm_psci_vcpu_on() updates the mp_state without using
WRITE_ONCE(). So, fix the code to update the mp_state using WRITE_ONCE.
Signed-off-by: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230419021852.2981107-3-reijiw@google.com
kvm_arch_vcpu_ioctl_vcpu_init() doesn't acquire mp_state_lock
when setting the mp_state to KVM_MP_STATE_RUNNABLE. Fix the
code to acquire the lock.
Signed-off-by: Reiji Watanabe <reijiw@google.com>
[maz: minor refactor]
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230419021852.2981107-2-reijiw@google.com
The KVM_REG_SIZE() comes from the ioctl and it can be a power of two
between 0-32768 but if it is more than sizeof(long) this will corrupt
memory.
Fixes: 99adb56763 ("KVM: arm/arm64: Add save/restore support for firmware workaround state")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Steven Price <steven.price@arm.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/4efbab8c-640f-43b2-8ac6-6d68e08280fe@kili.mountain
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
__kvm_vcpu_run_vhe() end on VHE with an isb(). However, this
function is only reachable via kvm_call_hyp_ret(), which already
contains an isb() in order to mimick the behaviour of nVHE and
provide a context synchronisation event.
We thus have two isb()s back to back, which is one too many.
Drop the first one and solely rely on the one in the helper.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Contrary to nVHE, VHE is a lot easier when it comes to dealing
with speculative page table walks started at EL1. As we only change
EL1&0 translation regime when context-switching, we already benefit
from the effect of the DSB that sits in the context switch code.
We only need to take care of it in the NV case, where we can
flip between between two EL1 contexts (one of them being the virtual
EL2) without a context switch.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
We rely on the presence of a DSB at the end of kvm_flush_dcache_to_poc()
that, on top of ensuring completion of the cache clean, also covers
the speculative page table walk started from EL1.
Document this dependency.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
A TLBI from EL2 impacting EL1 involves messing with the EL1&0
translation regime, and the page table walker may still be
performing speculative walks.
Piggyback on the existing DSBs to always have a DSB ISH that
will synchronise all load/store operations that the PTW may
still have.
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
When CNTPOFF isn't implemented and that we have a non-zero counter
offset, CNTPCT and CNTPCTSS are trapped. We properly handle the
former, but not the latter, as it is not present in the sysreg
table (despite being actually handled in the code). Bummer.
Just populate the cp15_64 table with the missing register.
Reported-by: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
When taking an exception between the EL1&0 translation regime and
the EL2 translation regime, the page table walker is allowed to
complete the walks started from EL0 or EL1 while running at EL2.
It means that altering the system registers that define the EL1&0
translation regime is fraught with danger *unless* we wait for
the completion of such walk with a DSB (R_LFHQG and subsequent
statements in the ARM ARM). We already did the right thing for
other external agents (SPE, TRBE), but not the PTW.
Rework the existing SPE/TRBE synchronisation to include the PTW,
and add the missing DSB on guest exit.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
commit f003277311 ("KVM: arm64: Use config_lock to protect vgic
state") was meant to rectify a longstanding lock ordering issue in KVM
where the kvm->lock is taken while holding vcpu->mutex. As it so
happens, the aforementioned commit introduced yet another locking issue
by acquiring the its_lock before acquiring the config lock.
This is obviously wrong, especially considering that the lock ordering
is well documented in vgic.c. Reshuffle the locks once more to take the
config_lock before the its_lock. While at it, sprinkle in the lockdep
hinting that has become popular as of late to keep lockdep apprised of
our ordering.
Cc: stable@vger.kernel.org
Fixes: f003277311 ("KVM: arm64: Use config_lock to protect vgic state")
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230412062733.988229-1-oliver.upton@linux.dev
Though presently unused, there is an SMC64 view of the Arm architecture
calls defined by the SMCCC. The documentation of the SMCCC filter states
that the SMC64 range is reserved, but nothing actually prevents
userspace from applying a filter to the range.
Insert a range with the HANDLE action for the SMC64 arch range, thereby
preventing userspace from imposing filtering/forwarding on it.
Fixes: fb88707dd3 ("KVM: arm64: Use a maple tree to represent the SMCCC filter")
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230408121732.3411329-2-oliver.upton@linux.dev
- Ensure the guest PMU context is restored before the first KVM_RUN,
fixing an issue where EL0 event counting is broken after vCPU
save/restore
- Actually initialize ID_AA64PFR0_EL1.{CSV2,CSV3} based on the
sanitized, system-wide values for protected VMs
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSNXHjWXuzMZutrKNKivnWIJHzdFgUCZC2ZbwAKCRCivnWIJHzd
FvovAP9aooqUBBs8w2myh8SXCv7dJOg88r6fKS5vCqMdkY7OhwD9GXA7hWU2dXdy
X1L8qq6C7R+GtIY/kDm9E2HkNKg7rQc=
=nfXd
-----END PGP SIGNATURE-----
Merge tag 'kvmarm-fixes-6.3-3' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 fixes for 6.3, part #3
- Ensure the guest PMU context is restored before the first KVM_RUN,
fixing an issue where EL0 event counting is broken after vCPU
save/restore
- Actually initialize ID_AA64PFR0_EL1.{CSV2,CSV3} based on the
sanitized, system-wide values for protected VMs
When returning to userspace to handle a SMCCC call, we consistently
set PC to point to the instruction immediately after the HVC/SMC.
However, should userspace need to know the exact address of the
trapping instruction, it needs to know about the *size* of that
instruction. For AArch64, this is pretty easy. For AArch32, this
is a bit more funky, as Thumb has 16bit encodings for both HVC
and SMC.
Expose this to userspace with a new flag that directly derives
from ESR_EL2.IL. Also update the documentation to reflect the PC
state at the point of exit.
Finally, this fixes a small buglet where the hypercall.{args,ret}
fields would not be cleared on exit, and could contain some
random junk.
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/86pm8iv8tj.wl-maz@kernel.org
Now that the SRCU Kconfig option is unconditionally selected, there is
no longer any point in selecting it. Therefore, remove the "select SRCU"
Kconfig statements from the various KVM Kconfig files.
Acked-by: Sean Christopherson <seanjc@google.com> (x86)
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Aleksandar Markovic <aleksandar.qemu.devel@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <kvm@vger.kernel.org>
Acked-by: Marc Zyngier <maz@kernel.org> (arm64)
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Acked-by: Anup Patel <anup@brainfault.org> (riscv)
Acked-by: Heiko Carstens <hca@linux.ibm.com> (s390)
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Typically a negative return from an exit handler is used to request a
return to userspace with the specified error. KVM's handling of SMCCC
emulation (i.e. both HVCs and SMCs) deviates from the trend and resumes
the guest instead.
Stop handling negative returns this way and instead let the error
percolate to userspace.
Suggested-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230404154050.2270077-12-oliver.upton@linux.dev
A subsequent change to KVM will allow negative returns from SMCCC
handlers to exit to userspace. Make way for this change by explicitly
returning SMCCC_RET_NOT_SUPPORTED to the guest if the VM is configured
to use an unknown PSCI version. Add a WARN since this is undoubtedly a
KVM bug.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230404154050.2270077-11-oliver.upton@linux.dev
As the SMCCC (and related specifications) march towards an 'everything
and the kitchen sink' interface for interacting with a system it becomes
less likely that KVM will support every related feature. We could do
better by letting userspace have a crack at it instead.
Allow userspace to define an 'SMCCC filter' that applies to both HVCs
and SMCs initiated by the guest. Supporting both conduits with this
interface is important for a couple of reasons. Guest SMC usage is table
stakes for a nested guest, as HVCs are always taken to the virtual EL2.
Additionally, guests may want to interact with a service on the secure
side which can now be proxied by userspace.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230404154050.2270077-10-oliver.upton@linux.dev
In anticipation of user hypercall filters, add the necessary plumbing to
get SMCCC calls out to userspace. Even though the exit structure has
space for KVM to pass register arguments, let's just avoid it altogether
and let userspace poke at the registers via KVM_GET_ONE_REG.
This deliberately stretches the definition of a 'hypercall' to cover
SMCs from EL1 in addition to the HVCs we know and love. KVM doesn't
support EL1 calls into secure services, but now we can paint that as a
userspace problem and be done with it.
Finally, we need a flag to let userspace know what conduit instruction
was used (i.e. SMC vs. HVC).
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230404154050.2270077-9-oliver.upton@linux.dev
Maple tree is an efficient B-tree implementation that is intended for
storing non-overlapping intervals. Such a data structure is a good fit
for the SMCCC filter as it is desirable to sparsely allocate the 32 bit
function ID space.
To that end, add a maple tree to kvm_arch and correctly init/teardown
along with the VM. Wire in a test against the hypercall filter for HVCs
which does nothing until the controls are exposed to userspace.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230404154050.2270077-8-oliver.upton@linux.dev
KVM presently allows userspace to filter guest hypercalls with bitmaps
expressed via pseudo-firmware registers. These bitmaps have a narrow
scope and, of course, can only allow/deny a particular call. A
subsequent change to KVM will introduce a generalized UAPI for filtering
hypercalls, allowing functions to be forwarded to userspace.
Refactor the existing hypercall filtering logic to make room for more
than two actions. While at it, generalize the function names around
SMCCC as it is the basis for the upcoming UAPI.
No functional change intended.
Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230404154050.2270077-7-oliver.upton@linux.dev
Whelp, the architecture gods have spoken and confirmed that the function
ID space is common between SMCs and HVCs. Not only that, the expectation
is that hypervisors handle calls to both SMC and HVC conduits. KVM
recently picked up support for SMCCCs in commit bd36b1a9eb ("KVM:
arm64: nv: Handle SMCs taken from virtual EL2") but scoped it only to a
nested hypervisor.
Let's just open the floodgates and let EL1 access our SMCCC
implementation with the SMC instruction as well.
Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230404154050.2270077-6-oliver.upton@linux.dev
KVM handles SMCCC calls from virtual EL2 that use the SMC instruction
since commit bd36b1a9eb ("KVM: arm64: nv: Handle SMCs taken from
virtual EL2"). Thus, the function name of the handler no longer reflects
reality.
Normalize the name on SMCCC, since that's the only hypercall interface
KVM supports in the first place. No fuctional change intended.
Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230404154050.2270077-5-oliver.upton@linux.dev
A subsequent change will allow userspace to convey a filter for
hypercalls through a vm device attribute. Add the requisite boilerplate
for vm attribute accessors.
Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230404154050.2270077-4-oliver.upton@linux.dev
The test_bit(...) pattern is quite a lot of keystrokes. Replace
existing callsites with a helper.
No functional change intended.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230404154050.2270077-3-oliver.upton@linux.dev