- The 3 patch series "mm, swap: improve cluster scan strategy" from
Kairui Song improves performance and reduces the failure rate of swap
cluster allocation.
- The 4 patch series "support large align and nid in Rust allocators"
from Vitaly Wool permits Rust allocators to set NUMA node and large
alignment when perforning slub and vmalloc reallocs.
- The 2 patch series "mm/damon/vaddr: support stat-purpose DAMOS" from
Yueyang Pan extend DAMOS_STAT's handling of the DAMON operations sets
for virtual address spaces for ops-level DAMOS filters.
- The 3 patch series "execute PROCMAP_QUERY ioctl under per-vma lock"
from Suren Baghdasaryan reduces mmap_lock contention during reads of
/proc/pid/maps.
- The 2 patch series "mm/mincore: minor clean up for swap cache
checking" from Kairui Song performs some cleanup in the swap code.
- The 11 patch series "mm: vm_normal_page*() improvements" from David
Hildenbrand provides code cleanup in the pagemap code.
- The 5 patch series "add persistent huge zero folio support" from
Pankaj Raghav provides a block layer speedup by optionalls making the
huge_zero_pagepersistent, instead of releasing it when its refcount
falls to zero.
- The 3 patch series "kho: fixes and cleanups" from Mike Rapoport adds a
few touchups to the recently added Kexec Handover feature.
- The 10 patch series "mm: make mm->flags a bitmap and 64-bit on all
arches" from Lorenzo Stoakes turns mm_struct.flags into a bitmap. To
end the constant struggle with space shortage on 32-bit conflicting with
64-bit's needs.
- The 2 patch series "mm/swapfile.c and swap.h cleanup" from Chris Li
cleans up some swap code.
- The 7 patch series "selftests/mm: Fix false positives and skip
unsupported tests" from Donet Tom fixes a few things in our selftests
code.
- The 7 patch series "prctl: extend PR_SET_THP_DISABLE to only provide
THPs when advised" from David Hildenbrand "allows individual processes
to opt-out of THP=always into THP=madvise, without affecting other
workloads on the system".
It's a long story - the [1/N] changelog spells out the considerations.
- The 11 patch series "Add and use memdesc_flags_t" from Matthew Wilcox
gets us started on the memdesc project. Please see
https://kernelnewbies.org/MatthewWilcox/Memdescs and
https://blogs.oracle.com/linux/post/introducing-memdesc.
- The 3 patch series "Tiny optimization for large read operations" from
Chi Zhiling improves the efficiency of the pagecache read path.
- The 5 patch series "Better split_huge_page_test result check" from Zi
Yan improves our folio splitting selftest code.
- The 2 patch series "test that rmap behaves as expected" from Wei Yang
adds some rmap selftests.
- The 3 patch series "remove write_cache_pages()" from Christoph Hellwig
removes that function and converts its two remaining callers.
- The 2 patch series "selftests/mm: uffd-stress fixes" from Dev Jain
fixes some UFFD selftests issues.
- The 3 patch series "introduce kernel file mapped folios" from Boris
Burkov introduces the concept of "kernel file pages". Using these
permits btrfs to account its metadata pages to the root cgroup, rather
than to the cgroups of random inappropriate tasks.
- The 2 patch series "mm/pageblock: improve readability of some
pageblock handling" from Wei Yang provides some readability improvements
to the page allocator code.
- The 11 patch series "mm/damon: support ARM32 with LPAE" from SeongJae
Park teaches DAMON to understand arm32 highmem.
- The 4 patch series "tools: testing: Use existing atomic.h for
vma/maple tests" from Brendan Jackman performs some code cleanups and
deduplication under tools/testing/.
- The 2 patch series "maple_tree: Fix testing for 32bit compiles" from
Liam Howlett fixes a couple of 32-bit issues in
tools/testing/radix-tree.c.
- The 2 patch series "kasan: unify kasan_enabled() and remove
arch-specific implementations" from Sabyrzhan Tasbolatov moves KASAN
arch-specific initialization code into a common arch-neutral
implementation.
- The 3 patch series "mm: remove zpool" from Johannes Weiner removes
zspool - an indirection layer which now only redirects to a single thing
(zsmalloc).
- The 2 patch series "mm: task_stack: Stack handling cleanups" from
Pasha Tatashin makes a couple of cleanups in the fork code.
- The 37 patch series "mm: remove nth_page()" from David Hildenbrand
makes rather a lot of adjustments at various nth_page() callsites,
eventually permitting the removal of that undesirable helper function.
- The 2 patch series "introduce kasan.write_only option in hw-tags" from
Yeoreum Yun creates a KASAN read-only mode for ARM, using that
architecture's memory tagging feature. It is felt that a read-only mode
KASAN is suitable for use in production systems rather than debug-only.
- The 3 patch series "mm: hugetlb: cleanup hugetlb folio allocation"
from Kefeng Wang does some tidying in the hugetlb folio allocation code.
- The 12 patch series "mm: establish const-correctness for pointer
parameters" from Max Kellermann makes quite a number of the MM API
functions more accurate about the constness of their arguments. This
was getting in the way of subsystems (in this case CEPH) when they
attempt to improving their own const/non-const accuracy.
- The 7 patch series "Cleanup free_pages() misuse" from Vishal Moola
fixes a number of code sites which were confused over when to use
free_pages() vs __free_pages().
- The 3 patch series "Add Rust abstraction for Maple Trees" from Alice
Ryhl makes the mapletree code accessible to Rust. Required by nouveau
and by its forthcoming successor: the new Rust Nova driver.
- The 2 patch series "selftests/mm: split_huge_page_test:
split_pte_mapped_thp improvements" from David Hildenbrand adds a fix and
some cleanups to the thp selftesting code.
- The 14 patch series "mm, swap: introduce swap table as swap cache
(phase I)" from Chris Li and Kairui Song is the first step along the
path to implementing "swap tables" - a new approach to swap allocation
and state tracking which is expected to yield speed and space
improvements. This patchset itself yields a 5-20% performance benefit
in some situations.
- The 3 patch series "Some ptdesc cleanups" from Matthew Wilcox utilizes
the new memdesc layer to clean up the ptdesc code a little.
- The 3 patch series "Fix va_high_addr_switch.sh test failure" from
Chunyu Hu fixes some issues in our 5-level pagetable selftesting code.
- The 2 patch series "Minor fixes for memory allocation profiling" from
Suren Baghdasaryan addresses a couple of minor issues in relatively new
memory allocation profiling feature.
- The 3 patch series "Small cleanups" from Matthew Wilcox has a few
cleanups in preparation for more memdesc work.
- The 2 patch series "mm/damon: add addr_unit for DAMON_LRU_SORT and
DAMON_RECLAIM" from Quanmin Yan makes some changes to DAMON in
furtherance of supporting arm highmem.
- The 2 patch series "selftests/mm: Add -Wunreachable-code and fix
warnings" from Muhammad Anjum adds that compiler check to selftests code
and fixes the fallout, by removing dead code.
- The 10 patch series "Improvements to Victim Process Thawing and OOM
Reaper Traversal Order" from zhongjinji makes a number of improvements
in the OOM killer: mainly thawing a more appropriate group of victim
threads so they can release resources.
- The 5 patch series "mm/damon: misc fixups and improvements for 6.18"
from SeongJae Park is a bunch of small and unrelated fixups for DAMON.
- The 7 patch series "mm/damon: define and use DAMON initialization
check function" from SeongJae Park implement reliability and
maintainability improvements to a recently-added bug fix.
- The 2 patch series "mm/damon/stat: expose auto-tuned intervals and
non-idle ages" from SeongJae Park provides additional transparency to
userspace clients of the DAMON_STAT information.
- The 2 patch series "Expand scope of khugepaged anonymous collapse"
from Dev Jain removes some constraints on khubepaged's collapsing of
anon VMAs. It also increases the success rate of MADV_COLLAPSE against
an anon vma.
- The 2 patch series "mm: do not assume file == vma->vm_file in
compat_vma_mmap_prepare()" from Lorenzo Stoakes moves us further towards
removal of file_operations.mmap(). This patchset concentrates upon
clearing up the treatment of stacked filesystems.
- The 6 patch series "mm: Improve mlock tracking for large folios" from
Kiryl Shutsemau provides some fixes and improvements to mlock's tracking
of large folios. /proc/meminfo's "Mlocked" field became more accurate.
- The 2 patch series "mm/ksm: Fix incorrect accounting of KSM counters
during fork" from Donet Tom fixes several user-visible KSM stats
inaccuracies across forks and adds selftest code to verify these
counters.
- The 2 patch series "mm_slot: fix the usage of mm_slot_entry" from Wei
Yang addresses some potential but presently benign issues in KSM's
mm_slot handling.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaN3cywAKCRDdBJ7gKXxA
jtaPAQDmIuIu7+XnVUK5V11hsQ/5QtsUeLHV3OsAn4yW5/3dEQD/UddRU08ePN+1
2VRB0EwkLAdfMWW7TfiNZ+yhuoiL/AA=
=4mhY
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2025-10-01-19-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- "mm, swap: improve cluster scan strategy" from Kairui Song improves
performance and reduces the failure rate of swap cluster allocation
- "support large align and nid in Rust allocators" from Vitaly Wool
permits Rust allocators to set NUMA node and large alignment when
perforning slub and vmalloc reallocs
- "mm/damon/vaddr: support stat-purpose DAMOS" from Yueyang Pan extend
DAMOS_STAT's handling of the DAMON operations sets for virtual
address spaces for ops-level DAMOS filters
- "execute PROCMAP_QUERY ioctl under per-vma lock" from Suren
Baghdasaryan reduces mmap_lock contention during reads of
/proc/pid/maps
- "mm/mincore: minor clean up for swap cache checking" from Kairui Song
performs some cleanup in the swap code
- "mm: vm_normal_page*() improvements" from David Hildenbrand provides
code cleanup in the pagemap code
- "add persistent huge zero folio support" from Pankaj Raghav provides
a block layer speedup by optionalls making the
huge_zero_pagepersistent, instead of releasing it when its refcount
falls to zero
- "kho: fixes and cleanups" from Mike Rapoport adds a few touchups to
the recently added Kexec Handover feature
- "mm: make mm->flags a bitmap and 64-bit on all arches" from Lorenzo
Stoakes turns mm_struct.flags into a bitmap. To end the constant
struggle with space shortage on 32-bit conflicting with 64-bit's
needs
- "mm/swapfile.c and swap.h cleanup" from Chris Li cleans up some swap
code
- "selftests/mm: Fix false positives and skip unsupported tests" from
Donet Tom fixes a few things in our selftests code
- "prctl: extend PR_SET_THP_DISABLE to only provide THPs when advised"
from David Hildenbrand "allows individual processes to opt-out of
THP=always into THP=madvise, without affecting other workloads on the
system".
It's a long story - the [1/N] changelog spells out the considerations
- "Add and use memdesc_flags_t" from Matthew Wilcox gets us started on
the memdesc project. Please see
https://kernelnewbies.org/MatthewWilcox/Memdescs and
https://blogs.oracle.com/linux/post/introducing-memdesc
- "Tiny optimization for large read operations" from Chi Zhiling
improves the efficiency of the pagecache read path
- "Better split_huge_page_test result check" from Zi Yan improves our
folio splitting selftest code
- "test that rmap behaves as expected" from Wei Yang adds some rmap
selftests
- "remove write_cache_pages()" from Christoph Hellwig removes that
function and converts its two remaining callers
- "selftests/mm: uffd-stress fixes" from Dev Jain fixes some UFFD
selftests issues
- "introduce kernel file mapped folios" from Boris Burkov introduces
the concept of "kernel file pages". Using these permits btrfs to
account its metadata pages to the root cgroup, rather than to the
cgroups of random inappropriate tasks
- "mm/pageblock: improve readability of some pageblock handling" from
Wei Yang provides some readability improvements to the page allocator
code
- "mm/damon: support ARM32 with LPAE" from SeongJae Park teaches DAMON
to understand arm32 highmem
- "tools: testing: Use existing atomic.h for vma/maple tests" from
Brendan Jackman performs some code cleanups and deduplication under
tools/testing/
- "maple_tree: Fix testing for 32bit compiles" from Liam Howlett fixes
a couple of 32-bit issues in tools/testing/radix-tree.c
- "kasan: unify kasan_enabled() and remove arch-specific
implementations" from Sabyrzhan Tasbolatov moves KASAN arch-specific
initialization code into a common arch-neutral implementation
- "mm: remove zpool" from Johannes Weiner removes zspool - an
indirection layer which now only redirects to a single thing
(zsmalloc)
- "mm: task_stack: Stack handling cleanups" from Pasha Tatashin makes a
couple of cleanups in the fork code
- "mm: remove nth_page()" from David Hildenbrand makes rather a lot of
adjustments at various nth_page() callsites, eventually permitting
the removal of that undesirable helper function
- "introduce kasan.write_only option in hw-tags" from Yeoreum Yun
creates a KASAN read-only mode for ARM, using that architecture's
memory tagging feature. It is felt that a read-only mode KASAN is
suitable for use in production systems rather than debug-only
- "mm: hugetlb: cleanup hugetlb folio allocation" from Kefeng Wang does
some tidying in the hugetlb folio allocation code
- "mm: establish const-correctness for pointer parameters" from Max
Kellermann makes quite a number of the MM API functions more accurate
about the constness of their arguments. This was getting in the way
of subsystems (in this case CEPH) when they attempt to improving
their own const/non-const accuracy
- "Cleanup free_pages() misuse" from Vishal Moola fixes a number of
code sites which were confused over when to use free_pages() vs
__free_pages()
- "Add Rust abstraction for Maple Trees" from Alice Ryhl makes the
mapletree code accessible to Rust. Required by nouveau and by its
forthcoming successor: the new Rust Nova driver
- "selftests/mm: split_huge_page_test: split_pte_mapped_thp
improvements" from David Hildenbrand adds a fix and some cleanups to
the thp selftesting code
- "mm, swap: introduce swap table as swap cache (phase I)" from Chris
Li and Kairui Song is the first step along the path to implementing
"swap tables" - a new approach to swap allocation and state tracking
which is expected to yield speed and space improvements. This
patchset itself yields a 5-20% performance benefit in some situations
- "Some ptdesc cleanups" from Matthew Wilcox utilizes the new memdesc
layer to clean up the ptdesc code a little
- "Fix va_high_addr_switch.sh test failure" from Chunyu Hu fixes some
issues in our 5-level pagetable selftesting code
- "Minor fixes for memory allocation profiling" from Suren Baghdasaryan
addresses a couple of minor issues in relatively new memory
allocation profiling feature
- "Small cleanups" from Matthew Wilcox has a few cleanups in
preparation for more memdesc work
- "mm/damon: add addr_unit for DAMON_LRU_SORT and DAMON_RECLAIM" from
Quanmin Yan makes some changes to DAMON in furtherance of supporting
arm highmem
- "selftests/mm: Add -Wunreachable-code and fix warnings" from Muhammad
Anjum adds that compiler check to selftests code and fixes the
fallout, by removing dead code
- "Improvements to Victim Process Thawing and OOM Reaper Traversal
Order" from zhongjinji makes a number of improvements in the OOM
killer: mainly thawing a more appropriate group of victim threads so
they can release resources
- "mm/damon: misc fixups and improvements for 6.18" from SeongJae Park
is a bunch of small and unrelated fixups for DAMON
- "mm/damon: define and use DAMON initialization check function" from
SeongJae Park implement reliability and maintainability improvements
to a recently-added bug fix
- "mm/damon/stat: expose auto-tuned intervals and non-idle ages" from
SeongJae Park provides additional transparency to userspace clients
of the DAMON_STAT information
- "Expand scope of khugepaged anonymous collapse" from Dev Jain removes
some constraints on khubepaged's collapsing of anon VMAs. It also
increases the success rate of MADV_COLLAPSE against an anon vma
- "mm: do not assume file == vma->vm_file in compat_vma_mmap_prepare()"
from Lorenzo Stoakes moves us further towards removal of
file_operations.mmap(). This patchset concentrates upon clearing up
the treatment of stacked filesystems
- "mm: Improve mlock tracking for large folios" from Kiryl Shutsemau
provides some fixes and improvements to mlock's tracking of large
folios. /proc/meminfo's "Mlocked" field became more accurate
- "mm/ksm: Fix incorrect accounting of KSM counters during fork" from
Donet Tom fixes several user-visible KSM stats inaccuracies across
forks and adds selftest code to verify these counters
- "mm_slot: fix the usage of mm_slot_entry" from Wei Yang addresses
some potential but presently benign issues in KSM's mm_slot handling
* tag 'mm-stable-2025-10-01-19-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (372 commits)
mm: swap: check for stable address space before operating on the VMA
mm: convert folio_page() back to a macro
mm/khugepaged: use start_addr/addr for improved readability
hugetlbfs: skip VMAs without shareable locks in hugetlb_vmdelete_list
alloc_tag: fix boot failure due to NULL pointer dereference
mm: silence data-race in update_hiwater_rss
mm/memory-failure: don't select MEMORY_ISOLATION
mm/khugepaged: remove definition of struct khugepaged_mm_slot
mm/ksm: get mm_slot by mm_slot_entry() when slot is !NULL
hugetlb: increase number of reserving hugepages via cmdline
selftests/mm: add fork inheritance test for ksm_merging_pages counter
mm/ksm: fix incorrect KSM counter handling in mm_struct during fork
drivers/base/node: fix double free in register_one_node()
mm: remove PMD alignment constraint in execmem_vmalloc()
mm/memory_hotplug: fix typo 'esecially' -> 'especially'
mm/rmap: improve mlock tracking for large folios
mm/filemap: map entire large folio faultaround
mm/fault: try to map the entire file folio in finish_fault()
mm/rmap: mlock large folios in try_to_unmap_one()
mm/rmap: fix a mlock race condition in folio_referenced_one()
...
- Auxiliary:
- Drop call to dev_pm_domain_detach() in auxiliary_bus_probe()
- Optimize logic of auxiliary_match_id()
- Rust:
- Auxiliary:
- Use primitive C types from prelude
- DebugFs:
- Add debugfs support for simple read/write files and custom callbacks
through a File-type-based and directory-scope-based API
- Sample driver code for the File-type-based API
- Sample module code for the directory-scope-based API
- I/O:
- Add io::poll module and implement Rust specific read_poll_timeout()
helper
- IRQ:
- Implement support for threaded and non-threaded device IRQs based on
(&Device<Bound>, IRQ number) tuples (IrqRequest)
- Provide &Device<Bound> cookie in IRQ handlers
- PCI:
- Support IRQ requests from IRQ vectors for a specific pci::Device<Bound>
- Implement accessors for subsystem IDs, revision, devid and resource start
- Provide dedicated pci::Vendor and pci::Class types for vendor and class
ID numbers
- Implement Display to print actual vendor and class names; Debug to print
the raw ID numbers
- Add pci::DeviceId::from_class_and_vendor() helper
- Use primitive C types from prelude
- Various minor inline and (safety) comment improvements
- Platform:
- Support IRQ requests from IRQ vectors for a specific
platform::Device<Bound>
- Nova:
- Use pci::DeviceId::from_class_and_vendor() to avoid probing
non-display/compute PCI functions
- Misc:
- Add helper for cpu_relax()
- Update ARef import from sync::aref
- sysfs:
- Remove bin_attrs_new field from struct attribute_group
- Remove read_new() and write_new() from struct bin_attribute
- Misc:
- Document potential race condition in get_dev_from_fwnode()
- Constify node_group argument in software node registration functions
- Fix order of kernel-doc parameters in various functions
- Set power.no_pm flag for faux devices
- Set power.no_callbacks flag along with the power.no_pm flag
- Constify the pmu_bus bus type
- Minor spelling fixes
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQS2q/xV6QjXAdC7k+1FlHeO1qrKLgUCaNmQGwAKCRBFlHeO1qrK
LmPzAP9msIvK8eFT4CEDK4buX1gd+VBOdy8mAjAeJ2F80FIo8wEAtOdddNaaqWVF
m4ac2/a2bSRKMGPX+wIM7d2HGyC7sgY=
=XbU+
-----END PGP SIGNATURE-----
Merge tag 'driver-core-6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core
Pull driver core updates from Danilo Krummrich:
"Auxiliary:
- Drop call to dev_pm_domain_detach() in auxiliary_bus_probe()
- Optimize logic of auxiliary_match_id()
Rust:
- Auxiliary:
- Use primitive C types from prelude
- DebugFs:
- Add debugfs support for simple read/write files and custom
callbacks through a File-type-based and directory-scope-based
API
- Sample driver code for the File-type-based API
- Sample module code for the directory-scope-based API
- I/O:
- Add io::poll module and implement Rust specific
read_poll_timeout() helper
- IRQ:
- Implement support for threaded and non-threaded device IRQs
based on (&Device<Bound>, IRQ number) tuples (IrqRequest)
- Provide &Device<Bound> cookie in IRQ handlers
- PCI:
- Support IRQ requests from IRQ vectors for a specific
pci::Device<Bound>
- Implement accessors for subsystem IDs, revision, devid and
resource start
- Provide dedicated pci::Vendor and pci::Class types for vendor
and class ID numbers
- Implement Display to print actual vendor and class names; Debug
to print the raw ID numbers
- Add pci::DeviceId::from_class_and_vendor() helper
- Use primitive C types from prelude
- Various minor inline and (safety) comment improvements
- Platform:
- Support IRQ requests from IRQ vectors for a specific
platform::Device<Bound>
- Nova:
- Use pci::DeviceId::from_class_and_vendor() to avoid probing
non-display/compute PCI functions
- Misc:
- Add helper for cpu_relax()
- Update ARef import from sync::aref
sysfs:
- Remove bin_attrs_new field from struct attribute_group
- Remove read_new() and write_new() from struct bin_attribute
Misc:
- Document potential race condition in get_dev_from_fwnode()
- Constify node_group argument in software node registration
functions
- Fix order of kernel-doc parameters in various functions
- Set power.no_pm flag for faux devices
- Set power.no_callbacks flag along with the power.no_pm flag
- Constify the pmu_bus bus type
- Minor spelling fixes"
* tag 'driver-core-6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core: (43 commits)
rust: pci: display symbolic PCI vendor names
rust: pci: display symbolic PCI class names
rust: pci: fix incorrect platform reference in PCI driver probe doc comment
rust: pci: fix incorrect platform reference in PCI driver unbind doc comment
perf: make pmu_bus const
samples: rust: Add scoped debugfs sample driver
rust: debugfs: Add support for scoped directories
samples: rust: Add debugfs sample driver
rust: debugfs: Add support for callback-based files
rust: debugfs: Add support for writable files
rust: debugfs: Add support for read-only files
rust: debugfs: Add initial support for directories
driver core: auxiliary bus: Optimize logic of auxiliary_match_id()
driver core: auxiliary bus: Drop dev_pm_domain_detach() call
driver core: Fix order of the kernel-doc parameters
driver core: get_dev_from_fwnode(): document potential race
drivers: base: fix "publically"->"publicly"
driver core/PM: Set power.no_callbacks along with power.no_pm
driver core: faux: Set power.no_pm for faux devices
rust: pci: inline several tiny functions
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmjZH40ACgkQ6rmadz2v
bTrG7w//X/5CyDoKIYJCqynYRdMtfqYuCe8Jhud4p5++iBVqkDyS6Y8EFLqZVyg/
UHTqaSE4Nz8/pma0WSjhUYn6Chs1AeH+Rw/g109SovE/YGkek2KNwY3o2hDrtPMX
+oD0my8qF2HLKgEyteXXyZ5Ju+AaF92JFiGko4/wNTX8O99F9nyz2pTkrctS9Vl9
VwuTxrEXpmhqrhP3WCxkfNfcbs9HP+AALpgOXZKdMI6T4KI0N1gnJ0ZWJbiXZ8oT
tug0MTPkNRidYMl0wHY2LZ6ZG8Q3a7Sgc+M0xFzaHGvGlJbBg1HjsDMtT6j34CrG
TIVJ/O8F6EJzAnQ5Hio0FJk8IIgMRgvng5Kd5GXidU+mE6zokTyHIHOXitYkBQNH
Hk+lGA7+E2cYqUqKvB5PFoyo+jlucuIH7YwrQlyGfqz+98n65xCgZKcmdVXr0hdB
9v3WmwJFtVIoPErUvBC3KRANQYhFk4eVk1eiGV/20+eIVyUuNbX6wqSWSA9uEXLy
n5fm/vlk4RjZmrPZHxcJ0dsl9LTF1VvQQHkgoC1Sz/Cc+jA6k4I+ECVHAqEbk36p
1TUF52yPOD2ViaJKkj+962JaaaXlUn6+Dq7f1GMP6VuyHjz4gsI3mOo4XarqNdWd
c7TnYmlGO/cGwqd4DdbmWiF1DDsrBcBzdbC8+FgffxQHLPXGzUg=
=LeQi
-----END PGP SIGNATURE-----
Merge tag 'bpf-next-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Pull bpf updates from Alexei Starovoitov:
- Support pulling non-linear xdp data with bpf_xdp_pull_data() kfunc
(Amery Hung)
Applied as a stable branch in bpf-next and net-next trees.
- Support reading skb metadata via bpf_dynptr (Jakub Sitnicki)
Also a stable branch in bpf-next and net-next trees.
- Enforce expected_attach_type for tailcall compatibility (Daniel
Borkmann)
- Replace path-sensitive with path-insensitive live stack analysis in
the verifier (Eduard Zingerman)
This is a significant change in the verification logic. More details,
motivation, long term plans are in the cover letter/merge commit.
- Support signed BPF programs (KP Singh)
This is another major feature that took years to materialize.
Algorithm details are in the cover letter/marge commit
- Add support for may_goto instruction to s390 JIT (Ilya Leoshkevich)
- Add support for may_goto instruction to arm64 JIT (Puranjay Mohan)
- Fix USDT SIB argument handling in libbpf (Jiawei Zhao)
- Allow uprobe-bpf program to change context registers (Jiri Olsa)
- Support signed loads from BPF arena (Kumar Kartikeya Dwivedi and
Puranjay Mohan)
- Allow access to union arguments in tracing programs (Leon Hwang)
- Optimize rcu_read_lock() + migrate_disable() combination where it's
used in BPF subsystem (Menglong Dong)
- Introduce bpf_task_work_schedule*() kfuncs to schedule deferred
execution of BPF callback in the context of a specific task using the
kernel’s task_work infrastructure (Mykyta Yatsenko)
- Enforce RCU protection for KF_RCU_PROTECTED kfuncs (Kumar Kartikeya
Dwivedi)
- Add stress test for rqspinlock in NMI (Kumar Kartikeya Dwivedi)
- Improve the precision of tnum multiplier verifier operation
(Nandakumar Edamana)
- Use tnums to improve is_branch_taken() logic (Paul Chaignon)
- Add support for atomic operations in arena in riscv JIT (Pu Lehui)
- Report arena faults to BPF error stream (Puranjay Mohan)
- Search for tracefs at /sys/kernel/tracing first in bpftool (Quentin
Monnet)
- Add bpf_strcasecmp() kfunc (Rong Tao)
- Support lookup_and_delete_elem command in BPF_MAP_STACK_TRACE (Tao
Chen)
* tag 'bpf-next-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (197 commits)
libbpf: Replace AF_ALG with open coded SHA-256
selftests/bpf: Add stress test for rqspinlock in NMI
selftests/bpf: Add test case for different expected_attach_type
bpf: Enforce expected_attach_type for tailcall compatibility
bpftool: Remove duplicate string.h header
bpf: Remove duplicate crypto/sha2.h header
libbpf: Fix error when st-prefix_ops and ops from differ btf
selftests/bpf: Test changing packet data from kfunc
selftests/bpf: Add stacktrace map lookup_and_delete_elem test case
selftests/bpf: Refactor stacktrace_map case with skeleton
bpf: Add lookup_and_delete_elem for BPF_MAP_STACK_TRACE
selftests/bpf: Fix flaky bpf_cookie selftest
selftests/bpf: Test changing packet data from global functions with a kfunc
bpf: Emit struct bpf_xdp_sock type in vmlinux BTF
selftests/bpf: Task_work selftest cleanup fixes
MAINTAINERS: Delete inactive maintainers from AF_XDP
bpf: Mark kfuncs as __noclone
selftests/bpf: Add kprobe multi write ctx attach test
selftests/bpf: Add kprobe write ctx attach test
selftests/bpf: Add uprobe context ip register change test
...
Core perf code updates:
- Convert mmap() related reference counts to refcount_t. This
is in reaction to the recently fixed refcount bugs, which
could have been detected earlier and could have mitigated
the bug somewhat. (Thomas Gleixner, Peter Zijlstra)
- Clean up and simplify the callchain code, in preparation
for sframes. (Steven Rostedt, Josh Poimboeuf)
Uprobes updates:
- Add support to optimize usdt probes on x86-64, which
gives a substantial speedup. (Jiri Olsa)
- Cleanups and fixes on x86 (Peter Zijlstra)
PMU driver updates:
- Various optimizations and fixes to the Intel PMU driver
(Dapeng Mi)
Misc cleanups and fixes:
- Remove redundant __GFP_NOWARN (Qianfeng Rong)
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmjWpGIRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1iHvxAAvO8qWbbhUdF3EZaFU0Wx6oh5KBhImU49
VZ107xe9llA0Szy3hIl1YpdOQA2NAHtma6We/ebonrPVTTkcSCGq8absc+GahA3I
CHIomx2hjD0OQ01aHvTqgHJUdFUQQ0yzE3+FY6Tsn05JsNZvDmqpAMIoMQT0LuuG
7VvVRLBuDXtuMtNmGaGCvfDGKTZkGGxD6iZS1iWHuixvVAz4IECK0vYqSyh31UGA
w9Jwa0thwjKm2EZTmcSKaHSM2zw3N8QXJ3SNPPThuMrtO6QDz2+3Da9kO+vhGcRP
Jls9KnWC2wxNxqIs3dr80Mzn4qMplc67Ekx2tUqX4tYEGGtJQxW6tm3JOKKIgFMI
g/KF9/WJPXp0rVI9mtoQkgndzyswR/ZJBAwfEQu+nAqlp3gmmQR9+MeYPCyNnyhB
2g22PTMbXkihJmRPAVeH+WhwFy1YY3nsRhh61ha3/N0ULXTHUh0E+hWwUVMifYSV
SwXqQx4srlo6RJJNTji1d6R3muNjXCQNEsJ0lCOX6ajVoxWZsPH2x7/W1A8LKmY+
FLYQUi6X9ogQbOO3WxCjUhzp5nMTNA2vvo87MUzDlZOCLPqYZmqcjntHuXwdjPyO
lPcfTzc2nK1Ud26bG3+p2Bk3fjqkX9XcTMFniOvjKfffEfwpAq4xRPBQ3uRlzn0V
pf9067JYF+c=
=sVXH
-----END PGP SIGNATURE-----
Merge tag 'perf-core-2025-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull performance events updates from Ingo Molnar:
"Core perf code updates:
- Convert mmap() related reference counts to refcount_t. This is in
reaction to the recently fixed refcount bugs, which could have been
detected earlier and could have mitigated the bug somewhat (Thomas
Gleixner, Peter Zijlstra)
- Clean up and simplify the callchain code, in preparation for
sframes (Steven Rostedt, Josh Poimboeuf)
Uprobes updates:
- Add support to optimize usdt probes on x86-64, which gives a
substantial speedup (Jiri Olsa)
- Cleanups and fixes on x86 (Peter Zijlstra)
PMU driver updates:
- Various optimizations and fixes to the Intel PMU driver (Dapeng Mi)
Misc cleanups and fixes:
- Remove redundant __GFP_NOWARN (Qianfeng Rong)"
* tag 'perf-core-2025-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (57 commits)
selftests/bpf: Fix uprobe_sigill test for uprobe syscall error value
uprobes/x86: Return error from uprobe syscall when not called from trampoline
perf: Skip user unwind if the task is a kernel thread
perf: Simplify get_perf_callchain() user logic
perf: Use current->flags & PF_KTHREAD|PF_USER_WORKER instead of current->mm == NULL
perf: Have get_perf_callchain() return NULL if crosstask and user are set
perf: Remove get_perf_callchain() init_nr argument
perf/x86: Print PMU counters bitmap in x86_pmu_show_pmu_cap()
perf/x86/intel: Add ICL_FIXED_0_ADAPTIVE bit into INTEL_FIXED_BITS_MASK
perf/x86/intel: Change macro GLOBAL_CTRL_EN_PERF_METRICS to BIT_ULL(48)
perf/x86: Add PERF_CAP_PEBS_TIMING_INFO flag
perf/x86/intel: Fix IA32_PMC_x_CFG_B MSRs access error
perf/x86/intel: Use early_initcall() to hook bts_init()
uprobes: Remove redundant __GFP_NOWARN
selftests/seccomp: validate uprobe syscall passes through seccomp
seccomp: passthrough uprobe systemcall without filtering
selftests/bpf: Fix uprobe syscall shadow stack test
selftests/bpf: Change test_uretprobe_regs_change for uprobe and uretprobe
selftests/bpf: Add uprobe_regs_equal test
selftests/bpf: Add optimized usdt variant for basic usdt test
...
Confidential computing:
- Add support for accepting secrets from firmware (e.g. ACPI CCEL)
and mapping them with appropriate attributes.
CPU features:
- Advertise atomic floating-point instructions to userspace.
- Extend Spectre workarounds to cover additional Arm CPU variants.
- Extend list of CPUs that support break-before-make level 2 and
guarantee not to generate TLB conflict aborts for changes of mapping
granularity (BBML2_NOABORT).
- Add GCS support to our uprobes implementation.
Documentation:
- Remove bogus SME documentation concerning register state when
entering/exiting streaming mode.
Entry code:
- Switch over to the generic IRQ entry code (GENERIC_IRQ_ENTRY).
- Micro-optimise syscall entry path with a compiler branch hint.
Memory management:
- Enable huge mappings in vmalloc space even when kernel page-table
dumping is enabled.
- Tidy up the types used in our early MMU setup code.
- Rework rodata= for closer parity with the behaviour on x86.
- For CPUs implementing BBML2_NOABORT, utilise block mappings in the
linear map even when rodata= applies to virtual aliases.
- Don't re-allocate the virtual region between '_text' and '_stext',
as doing so confused tools parsing /proc/vmcore.
Miscellaneous:
- Clean-up Kconfig menuconfig text for architecture features.
- Avoid redundant bitmap_empty() during determination of supported
SME vector lengths.
- Re-enable warnings when building the 32-bit vDSO object.
- Avoid breaking our eggs at the wrong end.
Perf and PMUs:
- Support for v3 of the Hisilicon L3C PMU.
- Support for Hisilicon's MN and NoC PMUs.
- Support for Fujitsu's Uncore PMU.
- Support for SPE's extended event filtering feature.
- Preparatory work to enable data source filtering in SPE.
- Support for multiple lanes in the DWC PCIe PMU.
- Support for i.MX94 in the IMX DDR PMU driver.
- MAINTAINERS update (Thank you, Yicong).
- Minor driver fixes (PERF_IDX2OFF() overflow, CMN register offsets).
Selftests:
- Add basic LSFE check to the existing hwcaps test.
- Support nolibc in GCS tests.
- Extend SVE ptrace test to pass unsupported regsets and invalid vector
lengths.
- Minor cleanups (typos, cosmetic changes).
System registers:
- Fix ID_PFR1_EL1 definition.
- Fix incorrect signedness of some fields in ID_AA64MMFR4_EL1.
- Sync TCR_EL1 definition with the latest Arm ARM (L.b).
- Be stricter about the input fed into our AWK sysreg generator script.
- Typo fixes and removal of redundant definitions.
ACPI, EFI and PSCI:
- Decouple Arm's "Software Delegated Exception Interface" (SDEI)
support from the ACPI GHES code so that it can be used by platforms
booted with device-tree.
- Remove unnecessary per-CPU tracking of the FPSIMD state across EFI
runtime calls.
- Fix a node refcount imbalance in the PSCI device-tree code.
CPU Features:
- Ensure register sanitisation is applied to fields in ID_AA64MMFR4.
- Expose AIDR_EL1 to userspace via sysfs, primarily so that KVM guests
can reliably query the underlying CPU types from the VMM.
- Re-enabling of SME support (CONFIG_ARM64_SME) as a result of fixes
to our context-switching, signal handling and ptrace code.
-----BEGIN PGP SIGNATURE-----
iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAmjWlMIQHHdpbGxAa2Vy
bmVsLm9yZwAKCRC3rHDchMFjNIVYB/sHaFYmToEaK4ldMfTn4ZQ8yr9ZuTMLr/ji
zOm7wULP08wKIcp6t+1N6e7A8Wb3YSErxTywc4b9MqctIel1WvBMewjJd38xb2qO
hFCuRuWemfyt6Rw/EGMCP54yueLzKbnN4q+/Aks8jedWtS+AXY6McGCJOzMjuECL
zKb68Hqqb09YQ2b2BPSAiU9g42rotiIYppLaLbEdxUnw/eqfEN3wSrl92/LuBlqH
r2wZDnJwA5q8Iy9HWlnhg4Jax0Cs86jSJPIQLB6v3WWhc3HR49pm5cqYzYkUht9L
nA6eaWJiBl/+k3S+ftPKNzU8n04r15lqyNZE2FYfRk+0BPSacJOf
=kO15
-----END PGP SIGNATURE-----
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Will Deacon:
"There's good stuff across the board, including some nice mm
improvements for CPUs with the 'noabort' BBML2 feature and a clever
patch to allow ptdump to play nicely with block mappings in the
vmalloc area.
Confidential computing:
- Add support for accepting secrets from firmware (e.g. ACPI CCEL)
and mapping them with appropriate attributes.
CPU features:
- Advertise atomic floating-point instructions to userspace
- Extend Spectre workarounds to cover additional Arm CPU variants
- Extend list of CPUs that support break-before-make level 2 and
guarantee not to generate TLB conflict aborts for changes of
mapping granularity (BBML2_NOABORT)
- Add GCS support to our uprobes implementation.
Documentation:
- Remove bogus SME documentation concerning register state when
entering/exiting streaming mode.
Entry code:
- Switch over to the generic IRQ entry code (GENERIC_IRQ_ENTRY)
- Micro-optimise syscall entry path with a compiler branch hint.
Memory management:
- Enable huge mappings in vmalloc space even when kernel page-table
dumping is enabled
- Tidy up the types used in our early MMU setup code
- Rework rodata= for closer parity with the behaviour on x86
- For CPUs implementing BBML2_NOABORT, utilise block mappings in the
linear map even when rodata= applies to virtual aliases
- Don't re-allocate the virtual region between '_text' and '_stext',
as doing so confused tools parsing /proc/vmcore.
Miscellaneous:
- Clean-up Kconfig menuconfig text for architecture features
- Avoid redundant bitmap_empty() during determination of supported
SME vector lengths
- Re-enable warnings when building the 32-bit vDSO object
- Avoid breaking our eggs at the wrong end.
Perf and PMUs:
- Support for v3 of the Hisilicon L3C PMU
- Support for Hisilicon's MN and NoC PMUs
- Support for Fujitsu's Uncore PMU
- Support for SPE's extended event filtering feature
- Preparatory work to enable data source filtering in SPE
- Support for multiple lanes in the DWC PCIe PMU
- Support for i.MX94 in the IMX DDR PMU driver
- MAINTAINERS update (Thank you, Yicong)
- Minor driver fixes (PERF_IDX2OFF() overflow, CMN register offsets).
Selftests:
- Add basic LSFE check to the existing hwcaps test
- Support nolibc in GCS tests
- Extend SVE ptrace test to pass unsupported regsets and invalid
vector lengths
- Minor cleanups (typos, cosmetic changes).
System registers:
- Fix ID_PFR1_EL1 definition
- Fix incorrect signedness of some fields in ID_AA64MMFR4_EL1
- Sync TCR_EL1 definition with the latest Arm ARM (L.b)
- Be stricter about the input fed into our AWK sysreg generator
script
- Typo fixes and removal of redundant definitions.
ACPI, EFI and PSCI:
- Decouple Arm's "Software Delegated Exception Interface" (SDEI)
support from the ACPI GHES code so that it can be used by platforms
booted with device-tree
- Remove unnecessary per-CPU tracking of the FPSIMD state across EFI
runtime calls
- Fix a node refcount imbalance in the PSCI device-tree code.
CPU Features:
- Ensure register sanitisation is applied to fields in ID_AA64MMFR4
- Expose AIDR_EL1 to userspace via sysfs, primarily so that KVM
guests can reliably query the underlying CPU types from the VMM
- Re-enabling of SME support (CONFIG_ARM64_SME) as a result of fixes
to our context-switching, signal handling and ptrace code"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (93 commits)
arm64: cpufeature: Remove duplicate asm/mmu.h header
arm64: Kconfig: Make CPU_BIG_ENDIAN depend on BROKEN
perf/dwc_pcie: Fix use of uninitialized variable
arm/syscalls: mark syscall invocation as likely in invoke_syscall
Documentation: hisi-pmu: Add introduction to HiSilicon V3 PMU
Documentation: hisi-pmu: Fix of minor format error
drivers/perf: hisi: Add support for L3C PMU v3
drivers/perf: hisi: Refactor the event configuration of L3C PMU
drivers/perf: hisi: Extend the field of tt_core
drivers/perf: hisi: Extract the event filter check of L3C PMU
drivers/perf: hisi: Simplify the probe process of each L3C PMU version
drivers/perf: hisi: Export hisi_uncore_pmu_isr()
drivers/perf: hisi: Relax the event ID check in the framework
perf: Fujitsu: Add the Uncore PMU driver
arm64: map [_text, _stext) virtual address range non-executable+read-only
arm64/sysreg: Update TCR_EL1 register
arm64: Enable vmalloc-huge with ptdump
arm64: cpufeature: add Neoverse-V3AE to BBML2 allow list
arm64: errata: Apply workarounds for Neoverse-V3AE
arm64: cputype: Add Neoverse-V3AE definitions
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaNZgMQAKCRCRxhvAZXjc
ornXAP954dZjz+OJw6lJLCf0j9TXJOczGHvK3oW5ZD9KnqtTdwEA7p1A6WMOKJyl
8VtTgCS0yNt8QlznUnsSDfVm0jXVGAY=
=tUXG
-----END PGP SIGNATURE-----
Merge tag 'kernel-6.18-rc1.clone3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull copy_process updates from Christian Brauner:
"This contains the changes to enable support for clone3() on nios2
which apparently is still a thing.
The more exciting part of this is that it cleans up the inconsistency
in how the 64-bit flag argument is passed from copy_process() into the
various other copy_*() helpers"
[ Fixed up rv ltl_monitor 32-bit support as per Sasha Levin in the merge ]
* tag 'kernel-6.18-rc1.clone3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
nios2: implement architecture-specific portion of sys_clone3
arch: copy_thread: pass clone_flags as u64
copy_process: pass clone_flags as u64 across calltree
copy_sighand: Handle architectures where sizeof(unsigned long) < sizeof(u64)
If uprobe handler changes instruction pointer we still execute single
step) or emulate the original instruction and increment the (new) ip
with its length.
This makes the new instruction pointer bogus and application will
likely crash on illegal instruction execution.
If user decided to take execution elsewhere, it makes little sense
to execute the original instruction, so let's skip it.
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20250916215301.664963-3-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Currently uprobe (BPF_PROG_TYPE_KPROBE) program can't write to the
context registers data. While this makes sense for kprobe attachments,
for uprobe attachment it might make sense to be able to change user
space registers to alter application execution.
Since uprobe and kprobe programs share the same type (BPF_PROG_TYPE_KPROBE),
we can't deny write access to context during the program load. We need
to check on it during program attachment to see if it's going to be
kprobe or uprobe.
Storing the program's write attempt to context and checking on it
during the attachment.
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20250916215301.664963-2-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
uprobe_warn() is passed a task structure, yet its using current. For
the most part this shouldn't matter, but since a task structure is
provided, lets use it.
Fixes: 248d3a7b2f ("uprobes: Change uprobe_copy_process() to dup return_instances")
Signed-off-by: Jeremy Linton <jeremy.linton@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Will Deacon <will@kernel.org>
Now that the driver core can properly handle constant struct bus_type,
move the pmu_bus variable to be a constant structure as well,
placing it into read-only memory which can not be modified at runtime.
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: "Ricardo B. Marliere" <ricardo@marliere.net>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20240204-bus_cleanup-events-v1-1-c779d1639c3a@marliere.net
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
As part of the effort to move to mm->flags becoming a bitmap field,
convert existing users to making use of the mm_flags_*() accessors which
will, when the conversion is complete, be the only means of accessing
mm_struct flags.
No functional change intended.
Link: https://lkml.kernel.org/r/1d4fe5963904cc0c707da1f53fbfe6471d3eff10.1755012943.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Namhyung kim <namhyung@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: xu xin <xu.xin16@zte.com.cn>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The event_limit can be set by the PERF_EVENT_IOC_REFRESH to limit the
number of events. When the event_limit reaches 0, the POLL_HUP signal
should be sent. But it's missed.
The corresponding counter should be stopped when the event_limit reaches
0. It was implemented in the ARCH-specific code. However, since the
commit 9734e25fbf ("perf: Fix the throttle logic for a group"), all
the ARCH-specific code has been moved to the generic code. The code to
handle the event_limit was lost.
Add the event->pmu->stop(event, 0); back.
Fixes: 9734e25fbf ("perf: Fix the throttle logic for a group")
Closes: https://lore.kernel.org/lkml/aICYAqM5EQUlTqtX@li-2b55cdcc-350b-11b2-a85c-a78bff51fc11.ibm.com/
Reported-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Link: https://lkml.kernel.org/r/20250811182644.1305952-1-kan.liang@linux.intel.com
With the introduction of clone3 in commit 7f192e3cd3 ("fork: add
clone3") the effective bit width of clone_flags on all architectures was
increased from 32-bit to 64-bit, with a new type of u64 for the flags.
However, for most consumers of clone_flags the interface was not
changed from the previous type of unsigned long.
While this works fine as long as none of the new 64-bit flag bits
(CLONE_CLEAR_SIGHAND and CLONE_INTO_CGROUP) are evaluated, this is still
undesirable in terms of the principle of least surprise.
Thus, this commit fixes all relevant interfaces of callees to
sys_clone3/copy_process (excluding the architecture-specific
copy_thread) to consistently pass clone_flags as u64, so that
no truncation to 32-bit integers occurs on 32-bit architectures.
Signed-off-by: Simon Schuster <schuster.simon@siemens-energy.com>
Link: https://lore.kernel.org/20250901-nios2-implement-clone3-v2-2-53fcf5577d57@siemens-energy.com
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
If the task is not a user thread, there's no user stack to unwind.
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250820180428.930791978@kernel.org
Simplify the get_perf_callchain() user logic a bit. task_pt_regs()
should never be NULL.
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20250820180428.760066227@kernel.org
To determine if a task is a kernel thread or not, it is more reliable to
use (current->flags & (PF_KTHREAD|PF_USER_WORKERi)) than to rely on
current->mm being NULL. That is because some kernel tasks (io_uring
helpers) may have a mm field.
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250820180428.592367294@kernel.org
get_perf_callchain() doesn't support cross-task unwinding for user space
stacks, have it return NULL if both the crosstask and user arguments are
set.
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250820180428.426423415@kernel.org
The 'init_nr' argument has double duty: it's used to initialize both the
number of contexts and the number of stack entries. That's confusing
and the callers always pass zero anyway. Hard code the zero.
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Namhyung Kim <Namhyung@kernel.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/r/20250820180428.259565081@kernel.org
Commit 16f5dfbc85 ("gfp: include __GFP_NOWARN in GFP_NOWAIT")
made GFP_NOWAIT implicitly include __GFP_NOWARN.
Therefore, explicit __GFP_NOWARN combined with GFP_NOWAIT
(e.g., `GFP_NOWAIT | __GFP_NOWARN`) is now redundant. Let's clean
up these redundant flags across subsystems.
No functional changes.
Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20250805025000.346647-1-rongqianfeng@vivo.com
Putting together all the previously added pieces to support optimized
uprobes on top of 5-byte nop instruction.
The current uprobe execution goes through following:
- installs breakpoint instruction over original instruction
- exception handler hit and calls related uprobe consumers
- and either simulates original instruction or does out of line single step
execution of it
- returns to user space
The optimized uprobe path does following:
- checks the original instruction is 5-byte nop (plus other checks)
- adds (or uses existing) user space trampoline with uprobe syscall
- overwrites original instruction (5-byte nop) with call to user space
trampoline
- the user space trampoline executes uprobe syscall that calls related uprobe
consumers
- trampoline returns back to next instruction
This approach won't speed up all uprobes as it's limited to using nop5 as
original instruction, but we plan to use nop5 as USDT probe instruction
(which currently uses single byte nop) and speed up the USDT probes.
The arch_uprobe_optimize triggers the uprobe optimization and is called after
first uprobe hit. I originally had it called on uprobe installation but then
it clashed with elf loader, because the user space trampoline was added in a
place where loader might need to put elf segments, so I decided to do it after
first uprobe hit when loading is done.
The uprobe is un-optimized in arch specific set_orig_insn call.
The instruction overwrite is x86 arch specific and needs to go through 3 updates:
(on top of nop5 instruction)
- write int3 into 1st byte
- write last 4 bytes of the call instruction
- update the call instruction opcode
And cleanup goes though similar reverse stages:
- overwrite call opcode with breakpoint (int3)
- write last 4 bytes of the nop5 instruction
- write the nop5 first instruction byte
We do not unmap and release uprobe trampoline when it's no longer needed,
because there's no easy way to make sure none of the threads is still
inside the trampoline. But we do not waste memory, because there's just
single page for all the uprobe trampoline mappings.
We do waste frame on page mapping for every 4GB by keeping the uprobe
trampoline page mapped, but that seems ok.
We take the benefit from the fact that set_swbp and set_orig_insn are
called under mmap_write_lock(mm), so we can use the current instruction
as the state the uprobe is in - nop5/breakpoint/call trampoline -
and decide the needed action (optimize/un-optimize) based on that.
Attaching the speed up from benchs/run_bench_uprobes.sh script:
current:
usermode-count : 152.604 ± 0.044M/s
syscall-count : 13.359 ± 0.042M/s
--> uprobe-nop : 3.229 ± 0.002M/s
uprobe-push : 3.086 ± 0.004M/s
uprobe-ret : 1.114 ± 0.004M/s
uprobe-nop5 : 1.121 ± 0.005M/s
uretprobe-nop : 2.145 ± 0.002M/s
uretprobe-push : 2.070 ± 0.001M/s
uretprobe-ret : 0.931 ± 0.001M/s
uretprobe-nop5 : 0.957 ± 0.001M/s
after the change:
usermode-count : 152.448 ± 0.244M/s
syscall-count : 14.321 ± 0.059M/s
uprobe-nop : 3.148 ± 0.007M/s
uprobe-push : 2.976 ± 0.004M/s
uprobe-ret : 1.068 ± 0.003M/s
--> uprobe-nop5 : 7.038 ± 0.007M/s
uretprobe-nop : 2.109 ± 0.004M/s
uretprobe-push : 2.035 ± 0.001M/s
uretprobe-ret : 0.908 ± 0.001M/s
uretprobe-nop5 : 3.377 ± 0.009M/s
I see bit more speed up on Intel (above) compared to AMD. The big nop5
speed up is partly due to emulating nop5 and partly due to optimization.
The key speed up we do this for is the USDT switch from nop to nop5:
uprobe-nop : 3.148 ± 0.007M/s
uprobe-nop5 : 7.038 ± 0.007M/s
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Link: https://lore.kernel.org/r/20250720112133.244369-11-jolsa@kernel.org
Adding new uprobe syscall that calls uprobe handlers for given
'breakpoint' address.
The idea is that the 'breakpoint' address calls the user space
trampoline which executes the uprobe syscall.
The syscall handler reads the return address of the initial call
to retrieve the original 'breakpoint' address. With this address
we find the related uprobe object and call its consumers.
Adding the arch_uprobe_trampoline_mapping function that provides
uprobe trampoline mapping. This mapping is backed with one global
page initialized at __init time and shared by the all the mapping
instances.
We do not allow to execute uprobe syscall if the caller is not
from uprobe trampoline mapping.
The uprobe syscall ensures the consumer (bpf program) sees registers
values in the state before the trampoline was called.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Link: https://lore.kernel.org/r/20250720112133.244369-10-jolsa@kernel.org
Adding support to add special mapping for user space trampoline with
following functions:
uprobe_trampoline_get - find or add uprobe_trampoline
uprobe_trampoline_put - remove or destroy uprobe_trampoline
The user space trampoline is exported as arch specific user space special
mapping through tramp_mapping, which is initialized in following changes
with new uprobe syscall.
The uprobe trampoline needs to be callable/reachable from the probed address,
so while searching for available address we use is_reachable_by_call function
to decide if the uprobe trampoline is callable from the probe address.
All uprobe_trampoline objects are stored in uprobes_state object and are
cleaned up when the process mm_struct goes down. Adding new arch hooks
for that, because this change is x86_64 specific.
Locking is provided by callers in following changes.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Link: https://lore.kernel.org/r/20250720112133.244369-9-jolsa@kernel.org
Making update_ref_ctr call in uprobe_write conditional based
on do_ref_ctr argument. This way we can use uprobe_write for
instruction update without doing ref_ctr_offset update.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20250720112133.244369-8-jolsa@kernel.org
The uprobe_write has special path to restore the original page when we
write original instruction back. This happens when uprobe_write detects
that we want to write anything else but breakpoint instruction.
Moving the detection away and passing it to uprobe_write as argument,
so it's possible to write different instructions (other than just
breakpoint and rest).
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20250720112133.244369-7-jolsa@kernel.org
Adding nbytes argument to uprobe_write and related functions as
preparation for writing whole instructions in following changes.
Also renaming opcode arguments to insn, which seems to fit better.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20250720112133.244369-6-jolsa@kernel.org
Adding uprobe_write function that does what uprobe_write_opcode did
so far, but allows to pass verify callback function that checks the
memory location before writing the opcode.
It will be used in following changes to implement specific checking
logic for instruction update.
The uprobe_write_opcode now calls uprobe_write with verify_opcode as
the verify callback.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20250720112133.244369-5-jolsa@kernel.org
Making copy_from_page global and adding uprobe prefix.
Adding the uprobe prefix to copy_to_page as well for symmetry.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20250720112133.244369-4-jolsa@kernel.org
We are about to add uprobe trampoline, so cleaning up the namespace.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20250720112133.244369-3-jolsa@kernel.org
Currently unapply_uprobe takes mmap_read_lock, but it might call
remove_breakpoint which eventually changes user pages.
Current code writes either breakpoint or original instruction, so it can
go away with read lock as explained in here [1]. But with the upcoming
change that writes multiple instructions on the probed address we need
to ensure that any update to mm's pages is exclusive.
[1] https://lore.kernel.org/all/20240710140045.GA1084@redhat.com/
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20250720112133.244369-2-jolsa@kernel.org
The recently fixed reference count leaks could have been detected by using
refcount_t and refcount_t would have mitigated the potential overflow at
least.
Now that the code is properly structured, convert the mmap() related
mmap_count variants over to refcount_t.
No functional change intended.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Link: https://lore.kernel.org/r/20250812104020.071507932@infradead.org
Needed because refcount_inc() doesn't allow the 0->1 transition.
Specifically, this is the case where we've created the RB, this means
there was no RB, and as such there could not have been an mmap.
Additionally we hold mmap_mutex to serialize everything.
This must be the first.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250812104019.956479989@infradead.org
Mostly just re-indent noise.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Link: https://lore.kernel.org/r/20250812104019.838047976@infradead.org
Move the RB buffer allocation branch into its own function.
Originally-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Link: https://lore.kernel.org/r/20250812104019.722214699@infradead.org
Move the AUX buffer allocation branch into its own function.
Originally-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Link: https://lore.kernel.org/r/20250812104019.494205648@infradead.org
After duplicating the common code into the rb/aux branches is it
possible to use a simple guard() for the aux_mutex. Making the aux
branch self-contained.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Link: https://lore.kernel.org/r/20250812104019.246250452@infradead.org
unlock and aux_unlock are now identical, remove the aux_unlock one.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Link: https://lore.kernel.org/r/20250812104019.131293512@infradead.org
if (cond) {
A;
} else {
B;
}
C;
into
if (cond) {
A;
C;
} else {
B;
C;
}
Notably C has a success branch and both A and B have two places for
success. For A (rb case), duplicate the success case because later
patches will result in them no longer being identical. For B (aux
case), share using goto (cleaned up later).
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Link: https://lore.kernel.org/r/20250812104019.016252852@infradead.org
if (cond) {
A;
} else {
B;
}
if (cond) {
C;
} else {
D;
}
into:
if (cond) {
A;
C;
} else {
B;
D;
}
Notably the conditions are not identical in form, but are equivalent.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Link: https://lore.kernel.org/r/20250812104018.900078502@infradead.org
Similarly to the mlock limit calculation the VM accounting is required for
both the ringbuffer and the AUX buffer allocations.
To prepare for splitting them out into separate functions, move the
accounting into a helper function.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Link: https://lore.kernel.org/r/20250812104018.660347811@infradead.org
To prepare for splitting the buffer allocation out into separate functions
for the ring buffer and the AUX buffer, split out mlock limit handling into
a helper function, which can be called from both.
No functional change intended.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Link: https://lore.kernel.org/r/20250812104018.541975109@infradead.org
It is already checked whether the VMA size is the same as
nr_pages * PAGE_SIZE, so later checking both:
aux_size == vma_size && aux_size == nr_pages * PAGE_SIZE
is redundant. Remove the vma_size check as nr_pages is what is actually
used in the allocation function. That prepares for splitting out the buffer
allocation into separate functions, so that only nr_pages needs to be
handed in.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Link: https://lore.kernel.org/r/20250812104018.424519320@infradead.org
Calling pmu->start()/stop() on perf events in PERF_EVENT_STATE_OFF can
leave event->hw.idx at -1. When PMU drivers later attempt to use this
negative index as a shift exponent in bitwise operations, it leads to UBSAN
shift-out-of-bounds reports.
The issue is a logical flaw in how event groups handle throttling when some
members are intentionally disabled. Based on the analysis and the
reproducer provided by Mark Rutland (this issue on both arm64 and x86-64).
The scenario unfolds as follows:
1. A group leader event is configured with a very aggressive sampling
period (e.g., sample_period = 1). This causes frequent interrupts and
triggers the throttling mechanism.
2. A child event in the same group is created in a disabled state
(.disabled = 1). This event remains in PERF_EVENT_STATE_OFF.
Since it hasn't been scheduled onto the PMU, its event->hw.idx remains
initialized at -1.
3. When throttling occurs, perf_event_throttle_group() and later
perf_event_unthrottle_group() iterate through all siblings, including
the disabled child event.
4. perf_event_throttle()/unthrottle() are called on this inactive child
event, which then call event->pmu->start()/stop().
5. The PMU driver receives the event with hw.idx == -1 and attempts to
use it as a shift exponent. e.g., in macros like PMCNTENSET(idx),
leading to the UBSAN report.
The throttling mechanism attempts to start/stop events that are not
actively scheduled on the hardware.
Move the state check into perf_event_throttle()/perf_event_unthrottle() so
that inactive events are skipped entirely. This ensures only active events
with a valid hw.idx are processed, preventing undefined behavior and
silencing UBSAN warnings. The corrected check ensures true before
proceeding with PMU operations.
The problem can be reproduced with the syzkaller reproducer:
Fixes: 9734e25fbf ("perf: Fix the throttle logic for a group")
Signed-off-by: Yunseong Kim <ysk@kzalloc.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Link: https://lore.kernel.org/r/20250812181046.292382-2-ysk@kzalloc.com
The perf mmap code is careful about mmap()'ing the user page with the
ringbuffer and additionally the auxiliary buffer, when the event supports
it. Once the first mapping is established, subsequent mapping have to use
the same offset and the same size in both cases. The reference counting for
the ringbuffer and the auxiliary buffer depends on this being correct.
Though perf does not prevent that a related mapping is split via mmap(2),
munmap(2) or mremap(2). A split of a VMA results in perf_mmap_open() calls,
which take reference counts, but then the subsequent perf_mmap_close()
calls are not longer fulfilling the offset and size checks. This leads to
reference count leaks.
As perf already has the requirement for subsequent mappings to match the
initial mapping, the obvious consequence is that VMA splits, caused by
resizing of a mapping or partial unmapping, have to be prevented.
Implement the vm_operations_struct::may_split() callback and return
unconditionally -EINVAL.
That ensures that the mapping offsets and sizes cannot be changed after the
fact. Remapping to a different fixed address with the same size is still
possible as it takes the references for the new mapping and drops those of
the old mapping.
Fixes: 45bfb2e504 ("perf: Add AUX area to ring buffer for raw data streams")
Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-27504
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: stable@vger.kernel.org
After successful allocation of a buffer or a successful attachment to an
existing buffer perf_mmap() tries to map the buffer read only into the page
table. If that fails, the already set up page table entries are zapped, but
the other perf specific side effects of that failure are not handled. The
calling code just cleans up the VMA and does not invoke perf_mmap_close().
This leaks reference counts, corrupts user->vm accounting and also results
in an unbalanced invocation of event::event_mapped().
Cure this by moving the event::event_mapped() invocation before the
map_range() call so that on map_range() failure perf_mmap_close() can be
invoked without causing an unbalanced event::event_unmapped() call.
perf_mmap_close() undoes the reference counts and eventually frees buffers.
Fixes: b709eb872e ("perf: map pages in advance")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: stable@vger.kernel.org
When perf_mmap() fails to allocate a buffer, it still invokes the
event_mapped() callback of the related event. On X86 this might increase
the perf_rdpmc_allowed reference counter. But nothing undoes this as
perf_mmap_close() is never called in this case, which causes another
reference count leak.
Return early on failure to prevent that.
Fixes: 1e0fb9ec67 ("perf: Add pmu callbacks to track event mapping and unmapping")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: stable@vger.kernel.org
Failure of the AUX buffer allocation leaks the reference count.
Set the reference count to 1 only when the allocation succeeds.
Fixes: 45bfb2e504 ("perf: Add AUX area to ring buffer for raw data streams")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: stable@vger.kernel.org