- The 3 patch series "mm, swap: improve cluster scan strategy" from
Kairui Song improves performance and reduces the failure rate of swap
cluster allocation.
- The 4 patch series "support large align and nid in Rust allocators"
from Vitaly Wool permits Rust allocators to set NUMA node and large
alignment when perforning slub and vmalloc reallocs.
- The 2 patch series "mm/damon/vaddr: support stat-purpose DAMOS" from
Yueyang Pan extend DAMOS_STAT's handling of the DAMON operations sets
for virtual address spaces for ops-level DAMOS filters.
- The 3 patch series "execute PROCMAP_QUERY ioctl under per-vma lock"
from Suren Baghdasaryan reduces mmap_lock contention during reads of
/proc/pid/maps.
- The 2 patch series "mm/mincore: minor clean up for swap cache
checking" from Kairui Song performs some cleanup in the swap code.
- The 11 patch series "mm: vm_normal_page*() improvements" from David
Hildenbrand provides code cleanup in the pagemap code.
- The 5 patch series "add persistent huge zero folio support" from
Pankaj Raghav provides a block layer speedup by optionalls making the
huge_zero_pagepersistent, instead of releasing it when its refcount
falls to zero.
- The 3 patch series "kho: fixes and cleanups" from Mike Rapoport adds a
few touchups to the recently added Kexec Handover feature.
- The 10 patch series "mm: make mm->flags a bitmap and 64-bit on all
arches" from Lorenzo Stoakes turns mm_struct.flags into a bitmap. To
end the constant struggle with space shortage on 32-bit conflicting with
64-bit's needs.
- The 2 patch series "mm/swapfile.c and swap.h cleanup" from Chris Li
cleans up some swap code.
- The 7 patch series "selftests/mm: Fix false positives and skip
unsupported tests" from Donet Tom fixes a few things in our selftests
code.
- The 7 patch series "prctl: extend PR_SET_THP_DISABLE to only provide
THPs when advised" from David Hildenbrand "allows individual processes
to opt-out of THP=always into THP=madvise, without affecting other
workloads on the system".
It's a long story - the [1/N] changelog spells out the considerations.
- The 11 patch series "Add and use memdesc_flags_t" from Matthew Wilcox
gets us started on the memdesc project. Please see
https://kernelnewbies.org/MatthewWilcox/Memdescs and
https://blogs.oracle.com/linux/post/introducing-memdesc.
- The 3 patch series "Tiny optimization for large read operations" from
Chi Zhiling improves the efficiency of the pagecache read path.
- The 5 patch series "Better split_huge_page_test result check" from Zi
Yan improves our folio splitting selftest code.
- The 2 patch series "test that rmap behaves as expected" from Wei Yang
adds some rmap selftests.
- The 3 patch series "remove write_cache_pages()" from Christoph Hellwig
removes that function and converts its two remaining callers.
- The 2 patch series "selftests/mm: uffd-stress fixes" from Dev Jain
fixes some UFFD selftests issues.
- The 3 patch series "introduce kernel file mapped folios" from Boris
Burkov introduces the concept of "kernel file pages". Using these
permits btrfs to account its metadata pages to the root cgroup, rather
than to the cgroups of random inappropriate tasks.
- The 2 patch series "mm/pageblock: improve readability of some
pageblock handling" from Wei Yang provides some readability improvements
to the page allocator code.
- The 11 patch series "mm/damon: support ARM32 with LPAE" from SeongJae
Park teaches DAMON to understand arm32 highmem.
- The 4 patch series "tools: testing: Use existing atomic.h for
vma/maple tests" from Brendan Jackman performs some code cleanups and
deduplication under tools/testing/.
- The 2 patch series "maple_tree: Fix testing for 32bit compiles" from
Liam Howlett fixes a couple of 32-bit issues in
tools/testing/radix-tree.c.
- The 2 patch series "kasan: unify kasan_enabled() and remove
arch-specific implementations" from Sabyrzhan Tasbolatov moves KASAN
arch-specific initialization code into a common arch-neutral
implementation.
- The 3 patch series "mm: remove zpool" from Johannes Weiner removes
zspool - an indirection layer which now only redirects to a single thing
(zsmalloc).
- The 2 patch series "mm: task_stack: Stack handling cleanups" from
Pasha Tatashin makes a couple of cleanups in the fork code.
- The 37 patch series "mm: remove nth_page()" from David Hildenbrand
makes rather a lot of adjustments at various nth_page() callsites,
eventually permitting the removal of that undesirable helper function.
- The 2 patch series "introduce kasan.write_only option in hw-tags" from
Yeoreum Yun creates a KASAN read-only mode for ARM, using that
architecture's memory tagging feature. It is felt that a read-only mode
KASAN is suitable for use in production systems rather than debug-only.
- The 3 patch series "mm: hugetlb: cleanup hugetlb folio allocation"
from Kefeng Wang does some tidying in the hugetlb folio allocation code.
- The 12 patch series "mm: establish const-correctness for pointer
parameters" from Max Kellermann makes quite a number of the MM API
functions more accurate about the constness of their arguments. This
was getting in the way of subsystems (in this case CEPH) when they
attempt to improving their own const/non-const accuracy.
- The 7 patch series "Cleanup free_pages() misuse" from Vishal Moola
fixes a number of code sites which were confused over when to use
free_pages() vs __free_pages().
- The 3 patch series "Add Rust abstraction for Maple Trees" from Alice
Ryhl makes the mapletree code accessible to Rust. Required by nouveau
and by its forthcoming successor: the new Rust Nova driver.
- The 2 patch series "selftests/mm: split_huge_page_test:
split_pte_mapped_thp improvements" from David Hildenbrand adds a fix and
some cleanups to the thp selftesting code.
- The 14 patch series "mm, swap: introduce swap table as swap cache
(phase I)" from Chris Li and Kairui Song is the first step along the
path to implementing "swap tables" - a new approach to swap allocation
and state tracking which is expected to yield speed and space
improvements. This patchset itself yields a 5-20% performance benefit
in some situations.
- The 3 patch series "Some ptdesc cleanups" from Matthew Wilcox utilizes
the new memdesc layer to clean up the ptdesc code a little.
- The 3 patch series "Fix va_high_addr_switch.sh test failure" from
Chunyu Hu fixes some issues in our 5-level pagetable selftesting code.
- The 2 patch series "Minor fixes for memory allocation profiling" from
Suren Baghdasaryan addresses a couple of minor issues in relatively new
memory allocation profiling feature.
- The 3 patch series "Small cleanups" from Matthew Wilcox has a few
cleanups in preparation for more memdesc work.
- The 2 patch series "mm/damon: add addr_unit for DAMON_LRU_SORT and
DAMON_RECLAIM" from Quanmin Yan makes some changes to DAMON in
furtherance of supporting arm highmem.
- The 2 patch series "selftests/mm: Add -Wunreachable-code and fix
warnings" from Muhammad Anjum adds that compiler check to selftests code
and fixes the fallout, by removing dead code.
- The 10 patch series "Improvements to Victim Process Thawing and OOM
Reaper Traversal Order" from zhongjinji makes a number of improvements
in the OOM killer: mainly thawing a more appropriate group of victim
threads so they can release resources.
- The 5 patch series "mm/damon: misc fixups and improvements for 6.18"
from SeongJae Park is a bunch of small and unrelated fixups for DAMON.
- The 7 patch series "mm/damon: define and use DAMON initialization
check function" from SeongJae Park implement reliability and
maintainability improvements to a recently-added bug fix.
- The 2 patch series "mm/damon/stat: expose auto-tuned intervals and
non-idle ages" from SeongJae Park provides additional transparency to
userspace clients of the DAMON_STAT information.
- The 2 patch series "Expand scope of khugepaged anonymous collapse"
from Dev Jain removes some constraints on khubepaged's collapsing of
anon VMAs. It also increases the success rate of MADV_COLLAPSE against
an anon vma.
- The 2 patch series "mm: do not assume file == vma->vm_file in
compat_vma_mmap_prepare()" from Lorenzo Stoakes moves us further towards
removal of file_operations.mmap(). This patchset concentrates upon
clearing up the treatment of stacked filesystems.
- The 6 patch series "mm: Improve mlock tracking for large folios" from
Kiryl Shutsemau provides some fixes and improvements to mlock's tracking
of large folios. /proc/meminfo's "Mlocked" field became more accurate.
- The 2 patch series "mm/ksm: Fix incorrect accounting of KSM counters
during fork" from Donet Tom fixes several user-visible KSM stats
inaccuracies across forks and adds selftest code to verify these
counters.
- The 2 patch series "mm_slot: fix the usage of mm_slot_entry" from Wei
Yang addresses some potential but presently benign issues in KSM's
mm_slot handling.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaN3cywAKCRDdBJ7gKXxA
jtaPAQDmIuIu7+XnVUK5V11hsQ/5QtsUeLHV3OsAn4yW5/3dEQD/UddRU08ePN+1
2VRB0EwkLAdfMWW7TfiNZ+yhuoiL/AA=
=4mhY
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2025-10-01-19-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- "mm, swap: improve cluster scan strategy" from Kairui Song improves
performance and reduces the failure rate of swap cluster allocation
- "support large align and nid in Rust allocators" from Vitaly Wool
permits Rust allocators to set NUMA node and large alignment when
perforning slub and vmalloc reallocs
- "mm/damon/vaddr: support stat-purpose DAMOS" from Yueyang Pan extend
DAMOS_STAT's handling of the DAMON operations sets for virtual
address spaces for ops-level DAMOS filters
- "execute PROCMAP_QUERY ioctl under per-vma lock" from Suren
Baghdasaryan reduces mmap_lock contention during reads of
/proc/pid/maps
- "mm/mincore: minor clean up for swap cache checking" from Kairui Song
performs some cleanup in the swap code
- "mm: vm_normal_page*() improvements" from David Hildenbrand provides
code cleanup in the pagemap code
- "add persistent huge zero folio support" from Pankaj Raghav provides
a block layer speedup by optionalls making the
huge_zero_pagepersistent, instead of releasing it when its refcount
falls to zero
- "kho: fixes and cleanups" from Mike Rapoport adds a few touchups to
the recently added Kexec Handover feature
- "mm: make mm->flags a bitmap and 64-bit on all arches" from Lorenzo
Stoakes turns mm_struct.flags into a bitmap. To end the constant
struggle with space shortage on 32-bit conflicting with 64-bit's
needs
- "mm/swapfile.c and swap.h cleanup" from Chris Li cleans up some swap
code
- "selftests/mm: Fix false positives and skip unsupported tests" from
Donet Tom fixes a few things in our selftests code
- "prctl: extend PR_SET_THP_DISABLE to only provide THPs when advised"
from David Hildenbrand "allows individual processes to opt-out of
THP=always into THP=madvise, without affecting other workloads on the
system".
It's a long story - the [1/N] changelog spells out the considerations
- "Add and use memdesc_flags_t" from Matthew Wilcox gets us started on
the memdesc project. Please see
https://kernelnewbies.org/MatthewWilcox/Memdescs and
https://blogs.oracle.com/linux/post/introducing-memdesc
- "Tiny optimization for large read operations" from Chi Zhiling
improves the efficiency of the pagecache read path
- "Better split_huge_page_test result check" from Zi Yan improves our
folio splitting selftest code
- "test that rmap behaves as expected" from Wei Yang adds some rmap
selftests
- "remove write_cache_pages()" from Christoph Hellwig removes that
function and converts its two remaining callers
- "selftests/mm: uffd-stress fixes" from Dev Jain fixes some UFFD
selftests issues
- "introduce kernel file mapped folios" from Boris Burkov introduces
the concept of "kernel file pages". Using these permits btrfs to
account its metadata pages to the root cgroup, rather than to the
cgroups of random inappropriate tasks
- "mm/pageblock: improve readability of some pageblock handling" from
Wei Yang provides some readability improvements to the page allocator
code
- "mm/damon: support ARM32 with LPAE" from SeongJae Park teaches DAMON
to understand arm32 highmem
- "tools: testing: Use existing atomic.h for vma/maple tests" from
Brendan Jackman performs some code cleanups and deduplication under
tools/testing/
- "maple_tree: Fix testing for 32bit compiles" from Liam Howlett fixes
a couple of 32-bit issues in tools/testing/radix-tree.c
- "kasan: unify kasan_enabled() and remove arch-specific
implementations" from Sabyrzhan Tasbolatov moves KASAN arch-specific
initialization code into a common arch-neutral implementation
- "mm: remove zpool" from Johannes Weiner removes zspool - an
indirection layer which now only redirects to a single thing
(zsmalloc)
- "mm: task_stack: Stack handling cleanups" from Pasha Tatashin makes a
couple of cleanups in the fork code
- "mm: remove nth_page()" from David Hildenbrand makes rather a lot of
adjustments at various nth_page() callsites, eventually permitting
the removal of that undesirable helper function
- "introduce kasan.write_only option in hw-tags" from Yeoreum Yun
creates a KASAN read-only mode for ARM, using that architecture's
memory tagging feature. It is felt that a read-only mode KASAN is
suitable for use in production systems rather than debug-only
- "mm: hugetlb: cleanup hugetlb folio allocation" from Kefeng Wang does
some tidying in the hugetlb folio allocation code
- "mm: establish const-correctness for pointer parameters" from Max
Kellermann makes quite a number of the MM API functions more accurate
about the constness of their arguments. This was getting in the way
of subsystems (in this case CEPH) when they attempt to improving
their own const/non-const accuracy
- "Cleanup free_pages() misuse" from Vishal Moola fixes a number of
code sites which were confused over when to use free_pages() vs
__free_pages()
- "Add Rust abstraction for Maple Trees" from Alice Ryhl makes the
mapletree code accessible to Rust. Required by nouveau and by its
forthcoming successor: the new Rust Nova driver
- "selftests/mm: split_huge_page_test: split_pte_mapped_thp
improvements" from David Hildenbrand adds a fix and some cleanups to
the thp selftesting code
- "mm, swap: introduce swap table as swap cache (phase I)" from Chris
Li and Kairui Song is the first step along the path to implementing
"swap tables" - a new approach to swap allocation and state tracking
which is expected to yield speed and space improvements. This
patchset itself yields a 5-20% performance benefit in some situations
- "Some ptdesc cleanups" from Matthew Wilcox utilizes the new memdesc
layer to clean up the ptdesc code a little
- "Fix va_high_addr_switch.sh test failure" from Chunyu Hu fixes some
issues in our 5-level pagetable selftesting code
- "Minor fixes for memory allocation profiling" from Suren Baghdasaryan
addresses a couple of minor issues in relatively new memory
allocation profiling feature
- "Small cleanups" from Matthew Wilcox has a few cleanups in
preparation for more memdesc work
- "mm/damon: add addr_unit for DAMON_LRU_SORT and DAMON_RECLAIM" from
Quanmin Yan makes some changes to DAMON in furtherance of supporting
arm highmem
- "selftests/mm: Add -Wunreachable-code and fix warnings" from Muhammad
Anjum adds that compiler check to selftests code and fixes the
fallout, by removing dead code
- "Improvements to Victim Process Thawing and OOM Reaper Traversal
Order" from zhongjinji makes a number of improvements in the OOM
killer: mainly thawing a more appropriate group of victim threads so
they can release resources
- "mm/damon: misc fixups and improvements for 6.18" from SeongJae Park
is a bunch of small and unrelated fixups for DAMON
- "mm/damon: define and use DAMON initialization check function" from
SeongJae Park implement reliability and maintainability improvements
to a recently-added bug fix
- "mm/damon/stat: expose auto-tuned intervals and non-idle ages" from
SeongJae Park provides additional transparency to userspace clients
of the DAMON_STAT information
- "Expand scope of khugepaged anonymous collapse" from Dev Jain removes
some constraints on khubepaged's collapsing of anon VMAs. It also
increases the success rate of MADV_COLLAPSE against an anon vma
- "mm: do not assume file == vma->vm_file in compat_vma_mmap_prepare()"
from Lorenzo Stoakes moves us further towards removal of
file_operations.mmap(). This patchset concentrates upon clearing up
the treatment of stacked filesystems
- "mm: Improve mlock tracking for large folios" from Kiryl Shutsemau
provides some fixes and improvements to mlock's tracking of large
folios. /proc/meminfo's "Mlocked" field became more accurate
- "mm/ksm: Fix incorrect accounting of KSM counters during fork" from
Donet Tom fixes several user-visible KSM stats inaccuracies across
forks and adds selftest code to verify these counters
- "mm_slot: fix the usage of mm_slot_entry" from Wei Yang addresses
some potential but presently benign issues in KSM's mm_slot handling
* tag 'mm-stable-2025-10-01-19-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (372 commits)
mm: swap: check for stable address space before operating on the VMA
mm: convert folio_page() back to a macro
mm/khugepaged: use start_addr/addr for improved readability
hugetlbfs: skip VMAs without shareable locks in hugetlb_vmdelete_list
alloc_tag: fix boot failure due to NULL pointer dereference
mm: silence data-race in update_hiwater_rss
mm/memory-failure: don't select MEMORY_ISOLATION
mm/khugepaged: remove definition of struct khugepaged_mm_slot
mm/ksm: get mm_slot by mm_slot_entry() when slot is !NULL
hugetlb: increase number of reserving hugepages via cmdline
selftests/mm: add fork inheritance test for ksm_merging_pages counter
mm/ksm: fix incorrect KSM counter handling in mm_struct during fork
drivers/base/node: fix double free in register_one_node()
mm: remove PMD alignment constraint in execmem_vmalloc()
mm/memory_hotplug: fix typo 'esecially' -> 'especially'
mm/rmap: improve mlock tracking for large folios
mm/filemap: map entire large folio faultaround
mm/fault: try to map the entire file folio in finish_fault()
mm/rmap: mlock large folios in try_to_unmap_one()
mm/rmap: fix a mlock race condition in folio_referenced_one()
...
-----BEGIN PGP SIGNATURE-----
iQFPBAABCAA5FiEEe7vIQRWZI0iWSE3xu+CwddJFiJoFAmja74IbFIAAAAAABAAO
bWFudTIsMi41KzEuMTEsMiwyAAoJELvgsHXSRYiacR4H/04aBsr7LZnTJVeZLQwK
HKoOwXBqiQyqPdjKXGKnp7Mh9gRp2W3V11VsYTuDJNUS+Vz5YXW0z8cRnUfZ3SYs
l+GZC3vZeAy2EVJE1U6Mb673hU8vziI80IO2q/tGzaj9a+wC3L0lemc+YFQTwG+u
pMtt8zU2vHRjgkx8TNNqJBBOLLDV+RzIl8pqXVnh4eju6x6ZdreGnjXaePYMdjG0
fXLf9XwIeWREqbfeOCEOB50Ts71kkdiOeskwnJyfCTDT8WTu3zC/dICqfh66e3Gg
8hQKvMsuKpm/FwbtgdB0WvaDjENH6PmY+ubLYVxwvNpcsTSqfe0IYGm+HpUP+TPf
m+Y=
=w+JL
-----END PGP SIGNATURE-----
Merge tag 'slab-for-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab
Pull slab updates from Vlastimil Babka:
- A new layer for caching objects for allocation and free via percpu
arrays called sheaves.
The aim is to combine the good parts of SLAB (lower-overhead and
simpler percpu caching, compared to SLUB) without the past issues
with arrays for freeing remote NUMA node objects and their flushing.
It also allows more efficient kfree_rcu(), and cheaper object
preallocations for cases where the exact number of objects is
unknown, but an upper bound is.
Currently VMAs and maple nodes are using this new caching, with a
plan to enable it for all caches and remove the complex SLUB fastpath
based on cpu (partial) slabs and this_cpu_cmpxchg_double().
(Vlastimil Babka, with Liam Howlett and Pedro Falcato for the maple
tree changes)
- Re-entrant kmalloc_nolock(), which allows opportunistic allocations
from NMI and tracing/kprobe contexts.
Building on prior page allocator and memcg changes, it will result in
removing BPF-specific caches on top of slab (Alexei Starovoitov)
- Various fixes and cleanups. (Kuan-Wei Chiu, Matthew Wilcox, Suren
Baghdasaryan, Ye Liu)
* tag 'slab-for-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: (40 commits)
slab: Introduce kmalloc_nolock() and kfree_nolock().
slab: Reuse first bit for OBJEXTS_ALLOC_FAIL
slab: Make slub local_(try)lock more precise for LOCKDEP
mm: Introduce alloc_frozen_pages_nolock()
mm: Allow GFP_ACCOUNT to be used in alloc_pages_nolock().
locking/local_lock: Introduce local_lock_is_locked().
maple_tree: Convert forking to use the sheaf interface
maple_tree: Add single node allocation support to maple state
maple_tree: Prefilled sheaf conversion and testing
tools/testing: Add support for prefilled slab sheafs
maple_tree: Replace mt_free_one() with kfree()
maple_tree: Use kfree_rcu in ma_free_rcu
testing/radix-tree/maple: Hack around kfree_rcu not existing
tools/testing: include maple-shim.c in maple.c
maple_tree: use percpu sheaves for maple_node_cache
mm, vma: use percpu sheaves for vm_area_struct cache
tools/testing: Add support for changes to slab for sheaves
slab: allow NUMA restricted allocations to use percpu sheaves
tools/testing/vma: Implement vm_refcnt reset
slab: skip percpu sheaves for remote object freeing
...
Core & protocols
----------------
- Improve drop account scalability on NUMA hosts for RAW and UDP sockets
and the backlog, almost doubling the Pps capacity under DoS.
- Optimize the UDP RX performance under stress, reducing contention,
revisiting the binary layout of the involved data structs and
implementing NUMA-aware locking. This improves UDP RX performance by
an additional 50%, even more under extreme conditions.
- Add support for PSP encryption of TCP connections; this mechanism has
some similarities with IPsec and TLS, but offers superior HW offloads
capabilities.
- Ongoing work to support Accurate ECN for TCP. AccECN allows more than
one congestion notification signal per RTT and is a building block for
Low Latency, Low Loss, and Scalable Throughput (L4S).
- Reorganize the TCP socket binary layout for data locality, reducing
the number of touched cachelines in the fastpath.
- Refactor skb deferral free to better scale on large multi-NUMA hosts,
this improves TCP and UDP RX performances significantly on such HW.
- Increase the default socket memory buffer limits from 256K to 4M to
better fit modern link speeds.
- Improve handling of setups with a large number of nexthop, making dump
operating scaling linearly and avoiding unneeded synchronize_rcu() on
delete.
- Improve bridge handling of VLAN FDB, storing a single entry per bridge
instead of one entry per port; this makes the dump order of magnitude
faster on large switches.
- Restore IP ID correctly for encapsulated packets at GSO segmentation
time, allowing GRO to merge packets in more scenarios.
- Improve netfilter matching performance on large sets.
- Improve MPTCP receive path performance by leveraging recently
introduced core infrastructure (skb deferral free) and adopting recent
TCP autotuning changes.
- Allow bridges to redirect to a backup port when the bridge port is
administratively down.
- Introduce MPTCP 'laminar' endpoint that con be used only once per
connection and simplify common MPTCP setups.
- Add RCU safety to dst->dev, closing a lot of possible races.
- A significant crypto library API for SCTP, MPTCP and IPv6 SR, reducing
code duplication.
- Supports pulling data from an skb frag into the linear area of an XDP
buffer.
Things we sprinkled into general kernel code
--------------------------------------------
- Generate netlink documentation from YAML using an integrated
YAML parser.
Driver API
----------
- Support using IPv6 Flow Label in Rx hash computation and RSS queue
selection.
- Introduce API for fetching the DMA device for a given queue, allowing
TCP zerocopy RX on more H/W setups.
- Make XDP helpers compatible with unreadable memory, allowing more
easily building DevMem-enabled drivers with a unified XDP/skbs
datapath.
- Add a new dedicated ethtool callback enabling drivers to provide the
number of RX rings directly, improving efficiency and clarity in RX
ring queries and RSS configuration.
- Introduce a burst period for the health reporter, allowing better
handling of multiple errors due to the same root cause.
- Support for DPLL phase offset exponential moving average, controlling
the average smoothing factor.
Device drivers
--------------
- Add a new Huawei driver for 3rd gen NIC (hinic3).
- Add a new SpacemiT driver for K1 ethernet MAC.
- Add a generic abstraction for shared memory communication devices
(dibps)
- Ethernet high-speed NICs:
- nVidia/Mellanox:
- Use multiple per-queue doorbell, to avoid MMIO contention issues
- support adjacent functions, allowing them to delegate their
SR-IOV VFs to sibling PFs
- support RSS for IPSec offload
- support exposing raw cycle counters in PTP and mlx5
- support for disabling host PFs.
- Intel (100G, ice, idpf):
- ice: support for SRIOV VFs over an Active-Active link aggregate
- ice: support for firmware logging via debugfs
- ice: support for Earliest TxTime First (ETF) hardware offload
- idpf: support basic XDP functionalities and XSk
- Broadcom (bnxt):
- support Hyper-V VF ID
- dynamic SRIOV resource allocations for RoCE
- Meta (fbnic):
- support queue API, zero-copy Rx and Tx
- support basic XDP functionalities
- devlink health support for FW crashes and OTP mem corruptions
- expand hardware stats coverage to FEC, PHY, and Pause
- Wangxun:
- support ethtool coalesce options
- support for multiple RSS contexts
- Ethernet virtual:
- Macsec:
- replace custom netlink attribute checks with policy-level checks
- Bonding:
- support aggregator selection based on port priority
- Microsoft vNIC:
- use page pool fragments for RX buffers instead of full pages to
improve memory efficiency
- Ethernet NICs consumer, and embedded:
- Qualcomm: support Ethernet function for IPQ9574 SoC
- Airoha: implement wlan offloading via NPU
- Freescale
- enetc: add NETC timer PTP driver and add PTP support
- fec: enable the Jumbo frame support for i.MX8QM
- Renesas (R-Car S4): support HW offloading for layer 2 switching
- support for RZ/{T2H, N2H} SoCs
- Cadence (macb): support TAPRIO traffic scheduling
- TI:
- support for Gigabit ICSS ethernet SoC (icssm-prueth)
- Synopsys (stmmac): a lot of cleanups
- Ethernet PHYs:
- Support 10g-qxgmi phy-mode for AQR412C, Felix DSA and Lynx PCS
driver
- Support bcm63268 GPHY power control
- Support for Micrel lan8842 PHY and PTP
- Support for Aquantia AQR412 and AQR115
- CAN:
- a large CAN-XL preparation work
- reorganize raw_sock and uniqframe struct to minimize memory usage
- rcar_canfd: update the CAN-FD handling
- WiFi:
- extended Neighbor Awareness Networking (NAN) support
- S1G channel representation cleanup
- improve S1G support
- WiFi drivers:
- Intel (iwlwifi):
- major refactor and cleanup
- Broadcom (brcm80211):
- support for AP isolation
- RealTek (rtw88/89) rtw88/89:
- preparation work for RTL8922DE support
- MediaTek (mt76):
- HW restart improvements
- MLO support
- Qualcomm/Atheros (ath10k_
- GTK rekey fixes
- Bluetooth drivers:
- btusb: support for several new IDs for MT7925
- btintel: support for BlazarIW core
- btintel_pcie: support for _suspend() / _resume()
- btintel_pcie: support for Scorpious, Panther Lake-H484 IDs
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmjdBdsSHHBhYmVuaUBy
ZWRoYXQuY29tAAoJECkkeY3MjxOkWnAP/2Jz0ExnYMDxLxkkKqMte+3JNeF/KNjB
zVIslYN/5ekkd1TMXyDLFlWHqUOUMSF9+cK5ZRMHG6/lzOueSzFuuuDFsWs6Kf2f
q7Q9MzlXhR9++YCsdES1uS5x3PPjMInzo2ZivMFa6fUDLLFSzeAOKL9k+RS0EggU
VlXv2Wkm73R0O6KAssgDsHke9cnNz+F0DzhQ0S3qkyZF9tS5NrDeUl7fZ47XZgwb
ZuXdEzKmTTepo2XvZGxJoF2D7nekWFlHhLjEPpDJtET19nwhukCry41/FplJwlKR
dWsYkqLkrOEQKFQ+q++/5c3BgFOG+IrQLbP5bGQF1tZRMl4Dqp9PDxK5fKJfccbS
0VY3Y2qWOSYejT2Ji2kEHR5E5rPyFm3Y5A4Rnpz0yEHj14vL2v4zf7CZRkCyyDfD
doIZXFGkM0+N7QeXLEN833cje9zjaXuP9GAE+2bb+wAWAZAqof4KX8JgHh+y5Rwm
pvUtvFxmEtntlMwNBap8aT3FquGtfZncU8pzAE5kvWjuMvyF0NVWiE5rB2kSQM/X
NLmUdvDyjwwJqthXAJG0fl+a0mNJ/kOAqSOKJDJrfKDGWa+ovwY0iY06SpK0wIbO
Wz7tpMk5MSlYXW8xWMlbyhvvU6T9xuoQ2KV4QTdMxc6Ir3sNX6YkQr+gjQjxB0gx
ST2QF6lZeWFh
=w2Kz
-----END PGP SIGNATURE-----
Merge tag 'net-next-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Paolo Abeni:
"Core & protocols:
- Improve drop account scalability on NUMA hosts for RAW and UDP
sockets and the backlog, almost doubling the Pps capacity under DoS
- Optimize the UDP RX performance under stress, reducing contention,
revisiting the binary layout of the involved data structs and
implementing NUMA-aware locking. This improves UDP RX performance
by an additional 50%, even more under extreme conditions
- Add support for PSP encryption of TCP connections; this mechanism
has some similarities with IPsec and TLS, but offers superior HW
offloads capabilities
- Ongoing work to support Accurate ECN for TCP. AccECN allows more
than one congestion notification signal per RTT and is a building
block for Low Latency, Low Loss, and Scalable Throughput (L4S)
- Reorganize the TCP socket binary layout for data locality, reducing
the number of touched cachelines in the fastpath
- Refactor skb deferral free to better scale on large multi-NUMA
hosts, this improves TCP and UDP RX performances significantly on
such HW
- Increase the default socket memory buffer limits from 256K to 4M to
better fit modern link speeds
- Improve handling of setups with a large number of nexthop, making
dump operating scaling linearly and avoiding unneeded
synchronize_rcu() on delete
- Improve bridge handling of VLAN FDB, storing a single entry per
bridge instead of one entry per port; this makes the dump order of
magnitude faster on large switches
- Restore IP ID correctly for encapsulated packets at GSO
segmentation time, allowing GRO to merge packets in more scenarios
- Improve netfilter matching performance on large sets
- Improve MPTCP receive path performance by leveraging recently
introduced core infrastructure (skb deferral free) and adopting
recent TCP autotuning changes
- Allow bridges to redirect to a backup port when the bridge port is
administratively down
- Introduce MPTCP 'laminar' endpoint that con be used only once per
connection and simplify common MPTCP setups
- Add RCU safety to dst->dev, closing a lot of possible races
- A significant crypto library API for SCTP, MPTCP and IPv6 SR,
reducing code duplication
- Supports pulling data from an skb frag into the linear area of an
XDP buffer
Things we sprinkled into general kernel code:
- Generate netlink documentation from YAML using an integrated YAML
parser
Driver API:
- Support using IPv6 Flow Label in Rx hash computation and RSS queue
selection
- Introduce API for fetching the DMA device for a given queue,
allowing TCP zerocopy RX on more H/W setups
- Make XDP helpers compatible with unreadable memory, allowing more
easily building DevMem-enabled drivers with a unified XDP/skbs
datapath
- Add a new dedicated ethtool callback enabling drivers to provide
the number of RX rings directly, improving efficiency and clarity
in RX ring queries and RSS configuration
- Introduce a burst period for the health reporter, allowing better
handling of multiple errors due to the same root cause
- Support for DPLL phase offset exponential moving average,
controlling the average smoothing factor
Device drivers:
- Add a new Huawei driver for 3rd gen NIC (hinic3)
- Add a new SpacemiT driver for K1 ethernet MAC
- Add a generic abstraction for shared memory communication
devices (dibps)
- Ethernet high-speed NICs:
- nVidia/Mellanox:
- Use multiple per-queue doorbell, to avoid MMIO contention
issues
- support adjacent functions, allowing them to delegate their
SR-IOV VFs to sibling PFs
- support RSS for IPSec offload
- support exposing raw cycle counters in PTP and mlx5
- support for disabling host PFs.
- Intel (100G, ice, idpf):
- ice: support for SRIOV VFs over an Active-Active link
aggregate
- ice: support for firmware logging via debugfs
- ice: support for Earliest TxTime First (ETF) hardware offload
- idpf: support basic XDP functionalities and XSk
- Broadcom (bnxt):
- support Hyper-V VF ID
- dynamic SRIOV resource allocations for RoCE
- Meta (fbnic):
- support queue API, zero-copy Rx and Tx
- support basic XDP functionalities
- devlink health support for FW crashes and OTP mem corruptions
- expand hardware stats coverage to FEC, PHY, and Pause
- Wangxun:
- support ethtool coalesce options
- support for multiple RSS contexts
- Ethernet virtual:
- Macsec:
- replace custom netlink attribute checks with policy-level
checks
- Bonding:
- support aggregator selection based on port priority
- Microsoft vNIC:
- use page pool fragments for RX buffers instead of full pages
to improve memory efficiency
- Ethernet NICs consumer, and embedded:
- Qualcomm: support Ethernet function for IPQ9574 SoC
- Airoha: implement wlan offloading via NPU
- Freescale
- enetc: add NETC timer PTP driver and add PTP support
- fec: enable the Jumbo frame support for i.MX8QM
- Renesas (R-Car S4):
- support HW offloading for layer 2 switching
- support for RZ/{T2H, N2H} SoCs
- Cadence (macb): support TAPRIO traffic scheduling
- TI:
- support for Gigabit ICSS ethernet SoC (icssm-prueth)
- Synopsys (stmmac): a lot of cleanups
- Ethernet PHYs:
- Support 10g-qxgmi phy-mode for AQR412C, Felix DSA and Lynx PCS
driver
- Support bcm63268 GPHY power control
- Support for Micrel lan8842 PHY and PTP
- Support for Aquantia AQR412 and AQR115
- CAN:
- a large CAN-XL preparation work
- reorganize raw_sock and uniqframe struct to minimize memory
usage
- rcar_canfd: update the CAN-FD handling
- WiFi:
- extended Neighbor Awareness Networking (NAN) support
- S1G channel representation cleanup
- improve S1G support
- WiFi drivers:
- Intel (iwlwifi):
- major refactor and cleanup
- Broadcom (brcm80211):
- support for AP isolation
- RealTek (rtw88/89) rtw88/89:
- preparation work for RTL8922DE support
- MediaTek (mt76):
- HW restart improvements
- MLO support
- Qualcomm/Atheros (ath10k):
- GTK rekey fixes
- Bluetooth drivers:
- btusb: support for several new IDs for MT7925
- btintel: support for BlazarIW core
- btintel_pcie: support for _suspend() / _resume()
- btintel_pcie: support for Scorpious, Panther Lake-H484 IDs"
* tag 'net-next-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1536 commits)
net: stmmac: Add support for Allwinner A523 GMAC200
dt-bindings: net: sun8i-emac: Add A523 GMAC200 compatible
Revert "Documentation: net: add flow control guide and document ethtool API"
octeontx2-pf: fix bitmap leak
octeontx2-vf: fix bitmap leak
net/mlx5e: Use extack in set rxfh callback
net/mlx5e: Introduce mlx5e_rss_params for RSS configuration
net/mlx5e: Introduce mlx5e_rss_init_params
net/mlx5e: Remove unused mdev param from RSS indir init
net/mlx5: Improve QoS error messages with actual depth values
net/mlx5e: Prevent entering switchdev mode with inconsistent netns
net/mlx5: HWS, Generalize complex matchers
net/mlx5: Improve write-combining test reliability for ARM64 Grace CPUs
selftests/net: add tcp_port_share to .gitignore
Revert "net/mlx5e: Update and set Xon/Xoff upon MTU set"
net: add NUMA awareness to skb_attempt_defer_free()
net: use llist for sd->defer_list
net: make softnet_data.defer_count an atomic
selftests: drv-net: psp: add tests for destroying devices
selftests: drv-net: psp: add test for auto-adjusting TCP MSS
...
Confidential computing:
- Add support for accepting secrets from firmware (e.g. ACPI CCEL)
and mapping them with appropriate attributes.
CPU features:
- Advertise atomic floating-point instructions to userspace.
- Extend Spectre workarounds to cover additional Arm CPU variants.
- Extend list of CPUs that support break-before-make level 2 and
guarantee not to generate TLB conflict aborts for changes of mapping
granularity (BBML2_NOABORT).
- Add GCS support to our uprobes implementation.
Documentation:
- Remove bogus SME documentation concerning register state when
entering/exiting streaming mode.
Entry code:
- Switch over to the generic IRQ entry code (GENERIC_IRQ_ENTRY).
- Micro-optimise syscall entry path with a compiler branch hint.
Memory management:
- Enable huge mappings in vmalloc space even when kernel page-table
dumping is enabled.
- Tidy up the types used in our early MMU setup code.
- Rework rodata= for closer parity with the behaviour on x86.
- For CPUs implementing BBML2_NOABORT, utilise block mappings in the
linear map even when rodata= applies to virtual aliases.
- Don't re-allocate the virtual region between '_text' and '_stext',
as doing so confused tools parsing /proc/vmcore.
Miscellaneous:
- Clean-up Kconfig menuconfig text for architecture features.
- Avoid redundant bitmap_empty() during determination of supported
SME vector lengths.
- Re-enable warnings when building the 32-bit vDSO object.
- Avoid breaking our eggs at the wrong end.
Perf and PMUs:
- Support for v3 of the Hisilicon L3C PMU.
- Support for Hisilicon's MN and NoC PMUs.
- Support for Fujitsu's Uncore PMU.
- Support for SPE's extended event filtering feature.
- Preparatory work to enable data source filtering in SPE.
- Support for multiple lanes in the DWC PCIe PMU.
- Support for i.MX94 in the IMX DDR PMU driver.
- MAINTAINERS update (Thank you, Yicong).
- Minor driver fixes (PERF_IDX2OFF() overflow, CMN register offsets).
Selftests:
- Add basic LSFE check to the existing hwcaps test.
- Support nolibc in GCS tests.
- Extend SVE ptrace test to pass unsupported regsets and invalid vector
lengths.
- Minor cleanups (typos, cosmetic changes).
System registers:
- Fix ID_PFR1_EL1 definition.
- Fix incorrect signedness of some fields in ID_AA64MMFR4_EL1.
- Sync TCR_EL1 definition with the latest Arm ARM (L.b).
- Be stricter about the input fed into our AWK sysreg generator script.
- Typo fixes and removal of redundant definitions.
ACPI, EFI and PSCI:
- Decouple Arm's "Software Delegated Exception Interface" (SDEI)
support from the ACPI GHES code so that it can be used by platforms
booted with device-tree.
- Remove unnecessary per-CPU tracking of the FPSIMD state across EFI
runtime calls.
- Fix a node refcount imbalance in the PSCI device-tree code.
CPU Features:
- Ensure register sanitisation is applied to fields in ID_AA64MMFR4.
- Expose AIDR_EL1 to userspace via sysfs, primarily so that KVM guests
can reliably query the underlying CPU types from the VMM.
- Re-enabling of SME support (CONFIG_ARM64_SME) as a result of fixes
to our context-switching, signal handling and ptrace code.
-----BEGIN PGP SIGNATURE-----
iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAmjWlMIQHHdpbGxAa2Vy
bmVsLm9yZwAKCRC3rHDchMFjNIVYB/sHaFYmToEaK4ldMfTn4ZQ8yr9ZuTMLr/ji
zOm7wULP08wKIcp6t+1N6e7A8Wb3YSErxTywc4b9MqctIel1WvBMewjJd38xb2qO
hFCuRuWemfyt6Rw/EGMCP54yueLzKbnN4q+/Aks8jedWtS+AXY6McGCJOzMjuECL
zKb68Hqqb09YQ2b2BPSAiU9g42rotiIYppLaLbEdxUnw/eqfEN3wSrl92/LuBlqH
r2wZDnJwA5q8Iy9HWlnhg4Jax0Cs86jSJPIQLB6v3WWhc3HR49pm5cqYzYkUht9L
nA6eaWJiBl/+k3S+ftPKNzU8n04r15lqyNZE2FYfRk+0BPSacJOf
=kO15
-----END PGP SIGNATURE-----
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Will Deacon:
"There's good stuff across the board, including some nice mm
improvements for CPUs with the 'noabort' BBML2 feature and a clever
patch to allow ptdump to play nicely with block mappings in the
vmalloc area.
Confidential computing:
- Add support for accepting secrets from firmware (e.g. ACPI CCEL)
and mapping them with appropriate attributes.
CPU features:
- Advertise atomic floating-point instructions to userspace
- Extend Spectre workarounds to cover additional Arm CPU variants
- Extend list of CPUs that support break-before-make level 2 and
guarantee not to generate TLB conflict aborts for changes of
mapping granularity (BBML2_NOABORT)
- Add GCS support to our uprobes implementation.
Documentation:
- Remove bogus SME documentation concerning register state when
entering/exiting streaming mode.
Entry code:
- Switch over to the generic IRQ entry code (GENERIC_IRQ_ENTRY)
- Micro-optimise syscall entry path with a compiler branch hint.
Memory management:
- Enable huge mappings in vmalloc space even when kernel page-table
dumping is enabled
- Tidy up the types used in our early MMU setup code
- Rework rodata= for closer parity with the behaviour on x86
- For CPUs implementing BBML2_NOABORT, utilise block mappings in the
linear map even when rodata= applies to virtual aliases
- Don't re-allocate the virtual region between '_text' and '_stext',
as doing so confused tools parsing /proc/vmcore.
Miscellaneous:
- Clean-up Kconfig menuconfig text for architecture features
- Avoid redundant bitmap_empty() during determination of supported
SME vector lengths
- Re-enable warnings when building the 32-bit vDSO object
- Avoid breaking our eggs at the wrong end.
Perf and PMUs:
- Support for v3 of the Hisilicon L3C PMU
- Support for Hisilicon's MN and NoC PMUs
- Support for Fujitsu's Uncore PMU
- Support for SPE's extended event filtering feature
- Preparatory work to enable data source filtering in SPE
- Support for multiple lanes in the DWC PCIe PMU
- Support for i.MX94 in the IMX DDR PMU driver
- MAINTAINERS update (Thank you, Yicong)
- Minor driver fixes (PERF_IDX2OFF() overflow, CMN register offsets).
Selftests:
- Add basic LSFE check to the existing hwcaps test
- Support nolibc in GCS tests
- Extend SVE ptrace test to pass unsupported regsets and invalid
vector lengths
- Minor cleanups (typos, cosmetic changes).
System registers:
- Fix ID_PFR1_EL1 definition
- Fix incorrect signedness of some fields in ID_AA64MMFR4_EL1
- Sync TCR_EL1 definition with the latest Arm ARM (L.b)
- Be stricter about the input fed into our AWK sysreg generator
script
- Typo fixes and removal of redundant definitions.
ACPI, EFI and PSCI:
- Decouple Arm's "Software Delegated Exception Interface" (SDEI)
support from the ACPI GHES code so that it can be used by platforms
booted with device-tree
- Remove unnecessary per-CPU tracking of the FPSIMD state across EFI
runtime calls
- Fix a node refcount imbalance in the PSCI device-tree code.
CPU Features:
- Ensure register sanitisation is applied to fields in ID_AA64MMFR4
- Expose AIDR_EL1 to userspace via sysfs, primarily so that KVM
guests can reliably query the underlying CPU types from the VMM
- Re-enabling of SME support (CONFIG_ARM64_SME) as a result of fixes
to our context-switching, signal handling and ptrace code"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (93 commits)
arm64: cpufeature: Remove duplicate asm/mmu.h header
arm64: Kconfig: Make CPU_BIG_ENDIAN depend on BROKEN
perf/dwc_pcie: Fix use of uninitialized variable
arm/syscalls: mark syscall invocation as likely in invoke_syscall
Documentation: hisi-pmu: Add introduction to HiSilicon V3 PMU
Documentation: hisi-pmu: Fix of minor format error
drivers/perf: hisi: Add support for L3C PMU v3
drivers/perf: hisi: Refactor the event configuration of L3C PMU
drivers/perf: hisi: Extend the field of tt_core
drivers/perf: hisi: Extract the event filter check of L3C PMU
drivers/perf: hisi: Simplify the probe process of each L3C PMU version
drivers/perf: hisi: Export hisi_uncore_pmu_isr()
drivers/perf: hisi: Relax the event ID check in the framework
perf: Fujitsu: Add the Uncore PMU driver
arm64: map [_text, _stext) virtual address range non-executable+read-only
arm64/sysreg: Update TCR_EL1 register
arm64: Enable vmalloc-huge with ptdump
arm64: cpufeature: add Neoverse-V3AE to BBML2 allow list
arm64: errata: Apply workarounds for Neoverse-V3AE
arm64: cputype: Add Neoverse-V3AE definitions
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaNZQigAKCRCRxhvAZXjc
ojqsAQCI9BOJV953LWPr9fmmFVz78wA4kTxkFFDAxwH/0iYtmQD8DORngPfJERI1
FwTY/gk2N744vUpbbcSM/KOnM3ss8wE=
=B3rD
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.18-rc1.writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs writeback updates from Christian Brauner:
"This contains work adressing lockups reported by users when a systemd
unit reading lots of files from a filesystem mounted with the lazytime
mount option exits.
With the lazytime mount option enabled we can be switching many dirty
inodes on cgroup exit to the parent cgroup. The numbers observed in
practice when systemd slice of a large cron job exits can easily reach
hundreds of thousands or millions.
The logic in inode_do_switch_wbs() which sorts the inode into
appropriate place in b_dirty list of the target wb however has linear
complexity in the number of dirty inodes thus overall time complexity
of switching all the inodes is quadratic leading to workers being
pegged for hours consuming 100% of the CPU and switching inodes to the
parent wb.
Simple reproducer of the issue:
FILES=10000
# Filesystem mounted with lazytime mount option
MNT=/mnt/
echo "Creating files and switching timestamps"
for (( j = 0; j < 50; j ++ )); do
mkdir $MNT/dir$j
for (( i = 0; i < $FILES; i++ )); do
echo "foo" >$MNT/dir$j/file$i
done
touch -a -t 202501010000 $MNT/dir$j/file*
done
wait
echo "Syncing and flushing"
sync
echo 3 >/proc/sys/vm/drop_caches
echo "Reading all files from a cgroup"
mkdir /sys/fs/cgroup/unified/mycg1 || exit
echo $$ >/sys/fs/cgroup/unified/mycg1/cgroup.procs || exit
for (( j = 0; j < 50; j ++ )); do
cat /mnt/dir$j/file* >/dev/null &
done
wait
echo "Switching wbs"
# Now rmdir the cgroup after the script exits
This can be solved by:
- Avoiding contention on the wb->list_lock when switching inodes by
running a single work item per wb and managing a queue of items
switching to the wb
- Allowing rescheduling when switching inodes over to a different
cgroup to avoid softlockups
- Maintaining the b_dirty list ordering instead of sorting it"
* tag 'vfs-6.18-rc1.writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
writeback: Add tracepoint to track pending inode switches
writeback: Avoid excessively long inode switching times
writeback: Avoid softlockup when switching many inodes
writeback: Avoid contention on wb->list_lock when switching inodes
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaNZQMQAKCRCRxhvAZXjc
omNLAQCgrwzd9sa1JTlixweu3OAxQlSEbLuMpEv7Ztm+B7Wz0AD9HtwPC44Kev03
GbMcB2DCFLC4evqYECj6IG7NBmoKsAs=
=1ICf
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.18-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"This contains the usual selections of misc updates for this cycle.
Features:
- Add "initramfs_options" parameter to set initramfs mount options.
This allows to add specific mount options to the rootfs to e.g.,
limit the memory size
- Add RWF_NOSIGNAL flag for pwritev2()
Add RWF_NOSIGNAL flag for pwritev2. This flag prevents the SIGPIPE
signal from being raised when writing on disconnected pipes or
sockets. The flag is handled directly by the pipe filesystem and
converted to the existing MSG_NOSIGNAL flag for sockets
- Allow to pass pid namespace as procfs mount option
Ever since the introduction of pid namespaces, procfs has had very
implicit behaviour surrounding them (the pidns used by a procfs
mount is auto-selected based on the mounting process's active
pidns, and the pidns itself is basically hidden once the mount has
been constructed)
This implicit behaviour has historically meant that userspace was
required to do some special dances in order to configure the pidns
of a procfs mount as desired. Examples include:
* In order to bypass the mnt_too_revealing() check, Kubernetes
creates a procfs mount from an empty pidns so that user
namespaced containers can be nested (without this, the nested
containers would fail to mount procfs)
But this requires forking off a helper process because you cannot
just one-shot this using mount(2)
* Container runtimes in general need to fork into a container
before configuring its mounts, which can lead to security issues
in the case of shared-pidns containers (a privileged process in
the pidns can interact with your container runtime process)
While SUID_DUMP_DISABLE and user namespaces make this less of an
issue, the strict need for this due to a minor uAPI wart is kind
of unfortunate
Things would be much easier if there was a way for userspace to
just specify the pidns they want. So this pull request contains
changes to implement a new "pidns" argument which can be set
using fsconfig(2):
fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd);
fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0);
or classic mount(2) / mount(8):
// mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc
mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid");
Cleanups:
- Remove the last references to EXPORT_OP_ASYNC_LOCK
- Make file_remove_privs_flags() static
- Remove redundant __GFP_NOWARN when GFP_NOWAIT is used
- Use try_cmpxchg() in start_dir_add()
- Use try_cmpxchg() in sb_init_done_wq()
- Replace offsetof() with struct_size() in ioctl_file_dedupe_range()
- Remove vfs_ioctl() export
- Replace rwlock() with spinlock in epoll code as rwlock causes
priority inversion on preempt rt kernels
- Make ns_entries in fs/proc/namespaces const
- Use a switch() statement() in init_special_inode() just like we do
in may_open()
- Use struct_size() in dir_add() in the initramfs code
- Use str_plural() in rd_load_image()
- Replace strcpy() with strscpy() in find_link()
- Rename generic_delete_inode() to inode_just_drop() and
generic_drop_inode() to inode_generic_drop()
- Remove unused arguments from fcntl_{g,s}et_rw_hint()
Fixes:
- Document @name parameter for name_contains_dotdot() helper
- Fix spelling mistake
- Always return zero from replace_fd() instead of the file descriptor
number
- Limit the size for copy_file_range() in compat mode to prevent a
signed overflow
- Fix debugfs mount options not being applied
- Verify the inode mode when loading it from disk in minixfs
- Verify the inode mode when loading it from disk in cramfs
- Don't trigger automounts with RESOLVE_NO_XDEV
If openat2() was called with RESOLVE_NO_XDEV it didn't traverse
through automounts, but could still trigger them
- Add FL_RECLAIM flag to show_fl_flags() macro so it appears in
tracepoints
- Fix unused variable warning in rd_load_image() on s390
- Make INITRAMFS_PRESERVE_MTIME depend on BLK_DEV_INITRD
- Use ns_capable_noaudit() when determining net sysctl permissions
- Don't call path_put() under namespace semaphore in listmount() and
statmount()"
* tag 'vfs-6.18-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (38 commits)
fcntl: trim arguments
listmount: don't call path_put() under namespace semaphore
statmount: don't call path_put() under namespace semaphore
pid: use ns_capable_noaudit() when determining net sysctl permissions
fs: rename generic_delete_inode() and generic_drop_inode()
init: INITRAMFS_PRESERVE_MTIME should depend on BLK_DEV_INITRD
initramfs: Replace strcpy() with strscpy() in find_link()
initrd: Use str_plural() in rd_load_image()
initramfs: Use struct_size() helper to improve dir_add()
initrd: Fix unused variable warning in rd_load_image() on s390
fs: use the switch statement in init_special_inode()
fs/proc/namespaces: make ns_entries const
filelock: add FL_RECLAIM to show_fl_flags() macro
eventpoll: Replace rwlock with spinlock
selftests/proc: add tests for new pidns APIs
procfs: add "pidns" mount option
pidns: move is-ancestor logic to helper
openat2: don't trigger automounts with RESOLVE_NO_XDEV
namei: move cross-device check to __traverse_mounts
namei: remove LOOKUP_NO_XDEV check from handle_mounts
...
From the cover letter [1]:
This patch set introduces kmalloc_nolock() which is the next logical
step towards any context allocation necessary to remove bpf_mem_alloc
and get rid of preallocation requirement in BPF infrastructure.
In production BPF maps grew to gigabytes in size. Preallocation wastes
memory. Alloc from any context addresses this issue for BPF and other
subsystems that are forced to preallocate too.
This long task started with introduction of alloc_pages_nolock(), then
memcg and objcg were converted to operate from any context including
NMI, this set completes the task with kmalloc_nolock() that builds on
top of alloc_pages_nolock() and memcg changes.
After that BPF subsystem will gradually adopt it everywhere.
Link: https://lore.kernel.org/all/20250909010007.1660-1-alexei.starovoitov@gmail.com/ [1]
This series adds an opt-in percpu array-based caching layer to SLUB.
It has evolved to a state where kmem caches with sheaves are compatible
with all SLUB features (slub_debug, SLUB_TINY, NUMA locality
considerations). The plan is therefore that it will be later enabled for
all kmem caches and replace the complicated cpu (partial) slabs code.
Note the name "sheaf" was invented by Matthew Wilcox so we don't call
the arrays magazines like the original Bonwick paper. The per-NUMA-node
cache of sheaves is thus called "barn".
This caching may seem similar to the arrays we had in SLAB, but there
are some important differences:
- deals differently with NUMA locality of freed objects, thus there are
no per-node "shared" arrays (with possible lock contention) and no
"alien" arrays that would need periodical flushing
- instead, freeing remote objects (which is rare) bypasses the sheaves
- percpu sheaves thus contain only local objects (modulo rare races
and local node exhaustion)
- NUMA restricted allocations and strict_numa mode is still honoured
- improves kfree_rcu() handling by reusing whole sheaves
- there is an API for obtaining a preallocated sheaf that can be used
for guaranteed and efficient allocations in a restricted context, when
the upper bound for needed objects is known but rarely reached
- opt-in, not used for every cache (for now)
The motivation comes mainly from the ongoing work related to VMA locking
scalability and the related maple tree operations. This is why VMA and
maple nodes caches are sheaf-enabled in the patchset.
A sheaf-enabled cache has the following expected advantages:
- Cheaper fast paths. For allocations, instead of local double cmpxchg,
thanks to local_trylock() it becomes a preempt_disable() and no atomic
operations. Same for freeing, which is otherwise a local double cmpxchg
only for short term allocations (so the same slab is still active on the
same cpu when freeing the object) and a more costly locked double
cmpxchg otherwise.
- kfree_rcu() batching and recycling. kfree_rcu() will put objects to a
separate percpu sheaf and only submit the whole sheaf to call_rcu()
when full. After the grace period, the sheaf can be used for
allocations, which is more efficient than freeing and reallocating
individual slab objects (even with the batching done by kfree_rcu()
implementation itself). In case only some cpus are allowed to handle rcu
callbacks, the sheaf can still be made available to other cpus on the
same node via the shared barn. The maple_node cache uses kfree_rcu() and
thus can benefit from this.
Note: this path is currently limited to !PREEMPT_RT
- Preallocation support. A prefilled sheaf can be privately borrowed to
perform a short term operation that is not allowed to block in the
middle and may need to allocate some objects. If an upper bound (worst
case) for the number of allocations is known, but only much fewer
allocations actually needed on average, borrowing and returning a sheaf
is much more efficient then a bulk allocation for the worst case
followed by a bulk free of the many unused objects. Maple tree write
operations should benefit from this.
- Compatibility with slub_debug. When slub_debug is enabled for a cache,
we simply don't create the percpu sheaves so that the debugging hooks
(at the node partial list slowpaths) are reached as before. The same
thing is done for CONFIG_SLUB_TINY. Sheaf preallocation still works by
reusing the (ineffective) paths for requests exceeding the cache's
sheaf_capacity. This is in line with the existing approach where
debugging bypasses the fast paths and SLUB_TINY preferes memory
savings over performance.
The above is adapted from the cover letter [1], which contains also
in-kernel microbenchmark results showing the lower overhead of sheaves.
Results from Suren Baghdasaryan [2] using a mmap/munmap microbenchmark
also show improvements.
Results from Sudarsan Mahendran [3] using will-it-scale show both
benefits and regressions, probably due to overall noisiness of those
tests.
Link: https://lore.kernel.org/all/20250910-slub-percpu-caches-v8-0-ca3099d8352c@suse.cz/ [1]
Link: https://lore.kernel.org/all/CAJuCfpEQ%3DRUgcAvRzE5jRrhhFpkm8E2PpBK9e9GhK26ZaJQt%3DQ@mail.gmail.com/ [2]
Link: https://lore.kernel.org/all/20250913000935.1021068-1-sudarsanm@google.com/ [3]
kmalloc_nolock() relies on ability of local_trylock_t to detect
the situation when per-cpu kmem_cache is locked.
In !PREEMPT_RT local_(try)lock_irqsave(&s->cpu_slab->lock, flags)
disables IRQs and marks s->cpu_slab->lock as acquired.
local_lock_is_locked(&s->cpu_slab->lock) returns true when
slab is in the middle of manipulating per-cpu cache
of that specific kmem_cache.
kmalloc_nolock() can be called from any context and can re-enter
into ___slab_alloc():
kmalloc() -> ___slab_alloc(cache_A) -> irqsave -> NMI -> bpf ->
kmalloc_nolock() -> ___slab_alloc(cache_B)
or
kmalloc() -> ___slab_alloc(cache_A) -> irqsave -> tracepoint/kprobe -> bpf ->
kmalloc_nolock() -> ___slab_alloc(cache_B)
Hence the caller of ___slab_alloc() checks if &s->cpu_slab->lock
can be acquired without a deadlock before invoking the function.
If that specific per-cpu kmem_cache is busy the kmalloc_nolock()
retries in a different kmalloc bucket. The second attempt will
likely succeed, since this cpu locked different kmem_cache.
Similarly, in PREEMPT_RT local_lock_is_locked() returns true when
per-cpu rt_spin_lock is locked by current _task_. In this case
re-entrance into the same kmalloc bucket is unsafe, and
kmalloc_nolock() tries a different bucket that is most likely is
not locked by the current task. Though it may be locked by a
different task it's safe to rt_spin_lock() and sleep on it.
Similar to alloc_pages_nolock() the kmalloc_nolock() returns NULL
immediately if called from hard irq or NMI in PREEMPT_RT.
kfree_nolock() defers freeing to irq_work when local_lock_is_locked()
and (in_nmi() or in PREEMPT_RT).
SLUB_TINY config doesn't use local_lock_is_locked() and relies on
spin_trylock_irqsave(&n->list_lock) to allocate,
while kfree_nolock() always defers to irq_work.
Note, kfree_nolock() must be called _only_ for objects allocated
with kmalloc_nolock(). Debug checks (like kmemleak and kfence)
were skipped on allocation, hence obj = kmalloc(); kfree_nolock(obj);
will miss kmemleak/kfence book keeping and will cause false positives.
large_kmalloc is not supported by either kmalloc_nolock()
or kfree_nolock().
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Since the combination of valid upper bits in slab->obj_exts with
OBJEXTS_ALLOC_FAIL bit can never happen,
use OBJEXTS_ALLOC_FAIL == (1ull << 0) as a magic sentinel
instead of (1ull << 2) to free up bit 2.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
kmalloc_nolock() can be called from any context
the ___slab_alloc() can acquire local_trylock_t (which is rt_spin_lock
in PREEMPT_RT) and attempt to acquire a different local_trylock_t
while in the same task context.
The calling sequence might look like:
kmalloc() -> tracepoint -> bpf -> kmalloc_nolock()
or more precisely:
__lock_acquire+0x12ad/0x2590
lock_acquire+0x133/0x2d0
rt_spin_lock+0x6f/0x250
___slab_alloc+0xb7/0xec0
kmalloc_nolock_noprof+0x15a/0x430
my_debug_callback+0x20e/0x390 [testmod]
___slab_alloc+0x256/0xec0
__kmalloc_cache_noprof+0xd6/0x3b0
Make LOCKDEP understand that local_trylock_t-s protect
different kmem_caches. In order to do that add lock_class_key
for each kmem_cache and use that key in local_trylock_t.
This stack trace is possible on both PREEMPT_RT and !PREEMPT_RT,
but teach lockdep about it only for PREEMPT_RT, since
in !PREEMPT_RT the ___slab_alloc() code is using
local_trylock_irqsave() when lockdep is on.
Note, this patch applies this logic to local_lock_t
while the next one converts it to local_trylock_t.
Both are mapped to rt_spin_lock in PREEMPT_RT.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Split alloc_pages_nolock() and introduce alloc_frozen_pages_nolock()
to be used by alloc_slab_page().
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Change alloc_pages_nolock() to default to __GFP_COMP when allocating
pages, since upcoming reentrant alloc_slab_page() needs __GFP_COMP.
Also allow __GFP_ACCOUNT flag to be specified,
since most of BPF infra needs __GFP_ACCOUNT except BPF streams.
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Create the vm_area_struct cache with percpu sheaves of size 32 to
improve its performance.
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Currently allocations asking for a specific node explicitly or via
mempolicy in strict_numa node bypass percpu sheaves. Since sheaves
contain mostly local objects, we can try allocating from them if the
local node happens to be the requested node or allowed by the mempolicy.
If we find the object from percpu sheaves is not from the expected node,
we skip the sheaves - this should be rare.
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Since we don't control the NUMA locality of objects in percpu sheaves,
allocations with node restrictions bypass them. Allocations without
restrictions may however still expect to get local objects with high
probability, and the introduction of sheaves can decrease it due to
freed object from a remote node ending up in percpu sheaves.
The fraction of such remote frees seems low (5% on an 8-node machine)
but it can be expected that some cache or workload specific corner cases
exist. We can either conclude that this is not a problem due to the low
fraction, or we can make remote frees bypass percpu sheaves and go
directly to their slabs. This will make the remote frees more expensive,
but if it's only a small fraction, most frees will still benefit from
the lower overhead of percpu sheaves.
This patch thus makes remote object freeing bypass percpu sheaves,
including bulk freeing, and kfree_rcu() via the rcu_free sheaf. However
it's not intended to be 100% guarantee that percpu sheaves will only
contain local objects. The refill from slabs does not provide that
guarantee in the first place, and there might be cpu migrations
happening when we need to unlock the local_lock. Avoiding all that could
be possible but complicated so we can leave it for later investigation
whether it would be worth it. It can be expected that the more selective
freeing will itself prevent accumulation of remote objects in percpu
sheaves so any such violations would have only short-term effects.
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
The possibility of many barn operations is determined by the current
number of full or empty sheaves. Taking the barn->lock just to find out
that e.g. there are no empty sheaves results in unnecessary overhead and
lock contention. Thus perform these checks outside of the lock with a
data_race() annotated variable read and fail quickly without taking the
lock.
Checks for sheaf availability that racily succeed have to be obviously
repeated under the lock for correctness, but we can skip repeating
checks if there are too many sheaves on the given list as the limits
don't need to be strict.
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Add functions for efficient guaranteed allocations e.g. in a critical
section that cannot sleep, when the exact number of allocations is not
known beforehand, but an upper limit can be calculated.
kmem_cache_prefill_sheaf() returns a sheaf containing at least given
number of objects.
kmem_cache_alloc_from_sheaf() will allocate an object from the sheaf
and is guaranteed not to fail until depleted.
kmem_cache_return_sheaf() is for giving the sheaf back to the slab
allocator after the critical section. This will also attempt to refill
it to cache's sheaf capacity for better efficiency of sheaves handling,
but it's not stricly necessary to succeed.
kmem_cache_refill_sheaf() can be used to refill a previously obtained
sheaf to requested size. If the current size is sufficient, it does
nothing. If the requested size exceeds cache's sheaf_capacity and the
sheaf's current capacity, the sheaf will be replaced with a new one,
hence the indirect pointer parameter.
kmem_cache_sheaf_size() can be used to query the current size.
The implementation supports requesting sizes that exceed cache's
sheaf_capacity, but it is not efficient - such "oversize" sheaves are
allocated fresh in kmem_cache_prefill_sheaf() and flushed and freed
immediately by kmem_cache_return_sheaf(). kmem_cache_refill_sheaf()
might be especially ineffective when replacing a sheaf with a new one of
a larger capacity. It is therefore better to size cache's
sheaf_capacity accordingly to make oversize sheaves exceptional.
CONFIG_SLUB_STATS counters are added for sheaf prefill and return
operations. A prefill or return is considered _fast when it is able to
grab or return a percpu spare sheaf (even if the sheaf needs a refill to
satisfy the request, as those should amortize over time), and _slow
otherwise (when the barn or even sheaf allocation/freeing has to be
involved). sheaf_prefill_oversize is provided to determine how many
prefills were oversize (counter for oversize returns is not necessary as
all oversize refills result in oversize returns).
When slub_debug is enabled for a cache with sheaves, no percpu sheaves
exist for it, but the prefill functionality is still provided simply by
all prefilled sheaves becoming oversize. If percpu sheaves are not
created for a cache due to not passing the sheaf_capacity argument on
cache creation, the prefills also work through oversize sheaves, but
there's a WARN_ON_ONCE() to indicate the omission.
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Extend the sheaf infrastructure for more efficient kfree_rcu() handling.
For caches with sheaves, on each cpu maintain a rcu_free sheaf in
addition to main and spare sheaves.
kfree_rcu() operations will try to put objects on this sheaf. Once full,
the sheaf is detached and submitted to call_rcu() with a handler that
will try to put it in the barn, or flush to slab pages using bulk free,
when the barn is full. Then a new empty sheaf must be obtained to put
more objects there.
It's possible that no free sheaves are available to use for a new
rcu_free sheaf, and the allocation in kfree_rcu() context can only use
GFP_NOWAIT and thus may fail. In that case, fall back to the existing
kfree_rcu() implementation.
Expected advantages:
- batching the kfree_rcu() operations, that could eventually replace the
existing batching
- sheaves can be reused for allocations via barn instead of being
flushed to slabs, which is more efficient
- this includes cases where only some cpus are allowed to process rcu
callbacks (CONFIG_RCU_NOCB_CPU)
Possible disadvantage:
- objects might be waiting for more than their grace period (it is
determined by the last object freed into the sheaf), increasing memory
usage - but the existing batching does that too.
Only implement this for CONFIG_KVFREE_RCU_BATCHED as the tiny
implementation favors smaller memory footprint over performance.
Also for now skip the usage of rcu sheaf for CONFIG_PREEMPT_RT as the
contexts where kfree_rcu() is called might not be compatible with taking
a barn spinlock or a GFP_NOWAIT allocation of a new sheaf taking a
spinlock - the current kfree_rcu() implementation avoids doing that.
Teach kvfree_rcu_barrier() to flush all rcu_free sheaves from all caches
that have them. This is not a cheap operation, but the barrier usage is
rare - currently kmem_cache_destroy() or on module unload.
Add CONFIG_SLUB_STATS counters free_rcu_sheaf and free_rcu_sheaf_fail to
count how many kfree_rcu() used the rcu_free sheaf successfully and how
many had to fall back to the existing implementation.
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
It is possible to hit a zero entry while traversing the vmas in unuse_mm()
called from swapoff path and accessing it causes the OOPS:
Unable to handle kernel NULL pointer dereference at virtual address
0000000000000446--> Loading the memory from offset 0x40 on the
XA_ZERO_ENTRY as address.
Mem abort info:
ESR = 0x0000000096000005
EC = 0x25: DABT (current EL), IL = 32 bits
SET = 0, FnV = 0
EA = 0, S1PTW = 0
FSC = 0x05: level 1 translation fault
The issue is manifested from the below race between the fork() on a
process and swapoff:
fork(dup_mmap()) swapoff(unuse_mm)
--------------- -----------------
1) Identical mtree is built using
__mt_dup().
2) copy_pte_range()-->
copy_nonpresent_pte():
The dst mm is added into the
mmlist to be visible to the
swapoff operation.
3) Fatal signal is sent to the parent
process(which is the current during the
fork) thus skip the duplication of the
vmas and mark the vma range with
XA_ZERO_ENTRY as a marker for this process
that helps during exit_mmap().
4) swapoff is tried on the
'mm' added to the 'mmlist' as
part of the 2.
5) unuse_mm(), that iterates
through the vma's of this 'mm'
will hit the non-NULL zero entry
and operating on this zero entry
as a vma is resulting into the
oops.
The proper fix would be around not exposing this partially-valid tree to
others when droping the mmap lock, which is being solved with [1]. A
simpler solution would be checking for MMF_UNSTABLE, as it is set if
mm_struct is not fully initialized in dup_mmap().
Thanks to Liam/Lorenzo/David for all the suggestions in fixing this
issue.
Link: https://lkml.kernel.org/r/20250924181138.1762750-1-charan.kalla@oss.qualcomm.com
Link: https://lore.kernel.org/all/20250815191031.3769540-1-Liam.Howlett@oracle.com/ [1]
Fixes: d240629148 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()")
Signed-off-by: Charan Teja Kalla <charan.kalla@oss.qualcomm.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chris Li <chrisl@kernel.org>
Cc: Kairui Song <kasong@tencent.com>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When collapsing a pmd, there are two address in use:
* address points to the start of pmd
* address points to each individual page
Current naming makes it difficult to distinguish these two and is hence
error prone.
Considering the plan to collapse mTHP, name the first one `start_addr' and
the second `addr' for better readability and consistency.
Link: https://lkml.kernel.org/r/20250922140938.27343-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Nico Pache <npache@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Dev Jain <dev.jain@arm.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
There is a boot failure when both CONFIG_DEBUG_KMEMLEAK and
CONFIG_MEM_ALLOC_PROFILING are enabled.
BUG: kernel NULL pointer dereference, address: 0000000000000000
RIP: 0010:__alloc_tagging_slab_alloc_hook+0x181/0x2f0
Call Trace:
kmem_cache_alloc_noprof+0x1c8/0x5c0
__alloc_object+0x2f/0x290
__create_object+0x22/0x80
kmemleak_init+0x122/0x190
mm_core_init+0xb6/0x160
start_kernel+0x39f/0x920
x86_64_start_reservations+0x18/0x30
x86_64_start_kernel+0x104/0x120
common_startup_64+0x12c/0x138
In kmemleak, mem_pool_alloc() directly calls kmem_cache_alloc_noprof(), as
a result, current->alloc_tag is NULL, leading to a null pointer
dereference.
Move the checks for SLAB_NO_OBJ_EXT, SLAB_NOLEAKTRACE, and
__GFP_NO_OBJ_EXT to the parent function __alloc_tagging_slab_alloc_hook()
to fix this.
Also this distinguishes the SLAB_NOLEAKTRACE case between the actual
memory allocation failures case, make CODETAG_FLAG_INACCURATE more
accurate.
Link: https://lkml.kernel.org/r/20250926080659.741991-1-ranxiaokai627@163.com
Fixes: b9e2f58ffb ("alloc_tag: mark inaccurate allocation counters in /proc/allocinfo output")
Signed-off-by: Ran Xiaokai <ran.xiaokai@zte.com.cn>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christoph Lameter (Ampere) <cl@gentwo.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Usama Arif <usamaarif642@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
We added that "select MEMORY_ISOLATION" in commit ee6f509c32 ("mm:
factor out memory isolate functions"). However, in commit add05cecef
("mm: soft-offline: don't free target page in successful page migration")
we remove the need for it, where we removed the calls to
set_migratetype_isolate() etc.
What CONFIG_MEMORY_FAILURE soft-offline support wants is migrate_pages()
support. But that comes with CONFIG_MIGRATION. And
isolate_folio_to_list() has nothing to do with CONFIG_MEMORY_ISOLATION.
Therefore, we can remove "select MEMORY_ISOLATION" of MEMORY_FAILURE.
Link: https://lkml.kernel.org/r/20250922143618.48640-1-xieyuanbin1@huawei.com
Signed-off-by: Xie Yuanbin <xieyuanbin1@huawei.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Current code is not correct to get struct khugepaged_mm_slot by
mm_slot_entry() without checking mm_slot is !NULL. There is no problem
reported since slot is the first element of struct khugepaged_mm_slot.
While struct khugepaged_mm_slot is just a wrapper of struct mm_slot, there
is no need to define it.
Remove the definition of struct khugepaged_mm_slot, so there is not chance
to miss use mm_slot_entry().
[richard.weiyang@gmail.com: fix use-after-free crash]
Link: https://lkml.kernel.org/r/20250922002834.vz6ntj36e75ehkyp@master
Link: https://lkml.kernel.org/r/20250919071244.17020-3-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Kiryl Shutsemau <kirill@shutemov.name>
Cc: xu xin <xu.xin16@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm_slot: fix the usage of mm_slot_entry", v2.
When using mm_slot in ksm, there is code like:
slot = mm_slot_lookup(mm_slots_hash, mm);
mm_slot = mm_slot_entry(slot, struct ksm_mm_slot, slot);
if (mm_slot && ..) {
}
The mm_slot_entry() won't return a valid value if slot is NULL generally.
But currently it works since slot is the first element of struct
ksm_mm_slot.
To reduce the ambiguity and make it robust, access mm_slot_entry() when
slot is !NULL.
Link: https://lkml.kernel.org/r/20250919071244.17020-1-richard.weiyang@gmail.com
Link: https://lkml.kernel.org/r/20250919071244.17020-2-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Lance Yang <lance.yang@linux.dev>
Cc: Kiryl Shutsemau <kirill@shutemov.name>
Cc: xu xin <xu.xin16@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Commit 79359d6d24 ("hugetlb: perform vmemmap optimization on a list of
pages") batches the submission of HugeTLB vmemmap optimization (HVO)
during hugepage reservation. With HVO enabled, hugepages obtained from
the buddy allocator are not submitted for optimization and their
struct-page memory is therefore not released—until the entire
reservation request has been satisfied. As a result, any struct-page
memory freed in the course of the allocation cannot be reused for the
ongoing reservation, artificially limiting the number of huge pages that
can ultimately be provided.
As commit b1222550fb ("mm/hugetlb: do pre-HVO for bootmem allocated
pages") already applies early HVO to bootmem-allocated huge pages, this
patch extends the same benefit to non-bootmem pages by incrementally
submitting them for HVO as they are allocated, thereby returning
struct-page memory to the buddy allocator in real time. The change raises
the maximum 2 MiB hugepage reservation from just under 376 GB to more than
381 GB on a 384 GB x86 VM.
Link: https://lkml.kernel.org/r/20250919092353.41671-1-lizhe.67@bytedance.com
Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When using vmalloc with VM_ALLOW_HUGE_VMAP flag, it will set the alignment
to PMD_SIZE internally, if it deems huge mappings to be eligible.
Therefore, setting the alignment in execmem_vmalloc is redundant. Apart
from this, it also reduces the probability of allocation in case vmalloc
fails to allocate hugepages - in the fallback case, vmalloc tries to use
the original alignment and allocate basepages, which unfortunately will
again be PMD_SIZE passed over from execmem_vmalloc, thus constraining the
search for a free space in vmalloc region.
Therefore, remove this constraint.
Link: https://lkml.kernel.org/r/20250918093453.75676-1-dev.jain@arm.com
Signed-off-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The kernel currently does not mlock large folios when adding them to rmap,
stating that it is difficult to confirm that the folio is fully mapped and
safe to mlock it.
This leads to a significant undercount of Mlocked in /proc/meminfo,
causing problems in production where the stat was used to estimate system
utilization and determine if load shedding is required.
However, nowadays the caller passes a number of pages of the folio that
are getting mapped, making it easy to check if the entire folio is mapped
to the VMA.
mlock the folio on rmap if it is fully mapped to the VMA.
Mlocked in /proc/meminfo can still undercount, but the value is closer the
truth and is useful for userspace.
Link: https://lkml.kernel.org/r/20250923110711.690639-7-kirill@shutemov.name
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Currently, kernel only maps part of large folio that fits into
start_pgoff/end_pgoff range.
Map entire folio where possible. It will match finish_fault() behaviour
that user hits on cold page cache.
Mapping large folios at once will allow the rmap code to mlock it on add,
as it will recognize that it is fully mapped and mlocking is safe.
Link: https://lkml.kernel.org/r/20250923110711.690639-6-kirill@shutemov.name
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
finish_fault() uses per-page fault for file folios. This only occurs for
file folios smaller than PMD_SIZE.
The comment suggests that this approach prevents RSS inflation. However,
it only prevents RSS accounting. The folio is still mapped to the
process, and the fact that it is mapped by a single PTE does not affect
memory pressure. Additionally, the kernel's ability to map large folios
as PMD if they are large enough does not support this argument.
When possible, map large folios in one shot. This reduces the number of
minor page faults and allows for TLB coalescing.
Mapping large folios at once will allow the rmap code to mlock it on add,
as it will recognize that it is fully mapped and mlocking is safe.
Link: https://lkml.kernel.org/r/20250923110711.690639-5-kirill@shutemov.name
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Currently, try_to_unmap_once() only tries to mlock small folios.
Use logic similar to folio_referenced_one() to mlock large folios: only do
this for fully mapped folios and under page table lock that protects all
page table entries.
[akpm@linux-foundation.org: s/CROSSSED/CROSSED/]
Link: https://lkml.kernel.org/r/20250923110711.690639-4-kirill@shutemov.name
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The mlock_vma_folio() function requires the page table lock to be held in
order to safely mlock the folio. However, folio_referenced_one() mlocks a
large folios outside of the page_vma_mapped_walk() loop where the page
table lock has already been dropped.
Rework the mlock logic to use the same code path inside the loop for both
large and small folios.
Use PVMW_PGTABLE_CROSSED to detect when the folio is mapped across a page
table boundary.
[akpm@linux-foundation.org: s/CROSSSED/CROSSED/]
Link: https://lkml.kernel.org/r/20250923110711.690639-3-kirill@shutemov.name
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm: Improve mlock tracking for large folios", v3.
The patchset includes several fixes and improvements related to mlock
tracking of large folios.
The main objective is to reduce the undercount of Mlocked memory in
/proc/meminfo and improve the accuracy of the statistics.
Patches 1-2:
These patches address a minor race condition in folio_referenced_one()
related to mlock_vma_folio().
Currently, mlock_vma_folio() is called on large folio without the page
table lock, which can result in a race condition with unmap (i.e.
MADV_DONTNEED). This can lead to partially mapped folios on the
unevictable LRU list.
While not a significant issue, I do not believe backporting is necessary.
Patch 3:
This patch adds mlocking logic similar to folio_referenced_one() to
try_to_unmap_one(), allowing for mlocking of large folios where possible.
Patch 4-5:
These patches modifies finish_fault() and faultaround to map in the entire
folio when possible, enabling efficient mlocking upon addition to the
rmap.
Patch 6:
This patch makes rmap mlock large folios if they are fully mapped,
addressing the primary source of mlock undercount for large folios.
This patch (of 6):
Add a PVMW_PGTABLE_CROSSSED flag that page_vma_mapped_walk() will set if
the page is mapped across page table boundary. Unlike other PVMW_* flags,
this one is result of page_vma_mapped_walk() and not set by the caller.
folio_referenced_one() will use it to detect if it safe to mlock the
folio.
[akpm@linux-foundation.org: s/CROSSSED/CROSSED/]
Link: https://lkml.kernel.org/r/20250923110711.690639-1-kirill@shutemov.name
Link: https://lkml.kernel.org/r/20250923110711.690639-2-kirill@shutemov.name
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Commit 56ae0bb349 ("mm: compaction: convert to use a folio in
isolate_migratepages_block()") converts api from page to folio. But the
low_pfn advance for hugetlb page seems wrong when low_pfn doesn't point to
head page.
Originally, if page is a hugetlb tail page, compound_nr() return 1, which
means low_pfn only advance one in next iteration. After the change,
low_pfn would advance more than the hugetlb range, since folio_nr_pages()
always return total number of the large page. This results in skipping
some range to isolate and then to migrate.
The worst case for alloc_contig is it does all the isolation and
migration, but finally find some range is still not isolated. And then
undo all the work and try a new range.
Advance low_pfn to the end of hugetlb.
Link: https://lkml.kernel.org/r/20250910092240.3981-1-richard.weiyang@gmail.com
Fixes: 56ae0bb349 ("mm: compaction: convert to use a folio in isolate_migratepages_block()")
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Cc: "Vishal Moola (Oracle)" <vishal.moola@gmail.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
aren't considered necessary for -stable kernels. 6 of these fixes are for
MM.
All singletons, please see the changelogs for details.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaNjJPAAKCRDdBJ7gKXxA
jhQiAP0TkBGRBu/IbHLDh39SHANINiM6y+FvOAPTR+Jyp9maSwD/RZlLZ0TnEeoL
htD9hxxVCRYxW5LrrQLxagrTnnOC6gA=
=B6P4
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2025-09-27-22-35' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"7 hotfixes. 4 are cc:stable and the remainder address post-6.16 issues
or aren't considered necessary for -stable kernels. 6 of these fixes
are for MM.
All singletons, please see the changelogs for details"
* tag 'mm-hotfixes-stable-2025-09-27-22-35' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
include/linux/pgtable.h: convert arch_enter_lazy_mmu_mode() and friends to static inlines
mm/damon/sysfs: do not ignore callback's return value in damon_sysfs_damon_call()
mailmap: add entry for Bence Csókás
fs/proc/task_mmu: check p->vec_buf for NULL
kmsan: fix out-of-bounds access to shadow memory
mm/hugetlb: fix copy_hugetlb_page_range() to use ->pt_share_count
mm/hugetlb: fix folio is still mapped when deleted
Specifying a non-zero value for a new struct kmem_cache_args field
sheaf_capacity will setup a caching layer of percpu arrays called
sheaves of given capacity for the created cache.
Allocations from the cache will allocate via the percpu sheaves (main or
spare) as long as they have no NUMA node preference. Frees will also
put the object back into one of the sheaves.
When both percpu sheaves are found empty during an allocation, an empty
sheaf may be replaced with a full one from the per-node barn. If none
are available and the allocation is allowed to block, an empty sheaf is
refilled from slab(s) by an internal bulk alloc operation. When both
percpu sheaves are full during freeing, the barn can replace a full one
with an empty one, unless over a full sheaves limit. In that case a
sheaf is flushed to slab(s) by an internal bulk free operation. Flushing
sheaves and barns is also wired to the existing cpu flushing and cache
shrinking operations.
The sheaves do not distinguish NUMA locality of the cached objects. If
an allocation is requested with kmem_cache_alloc_node() (or a mempolicy
with strict_numa mode enabled) with a specific node (not NUMA_NO_NODE),
the sheaves are bypassed.
The bulk operations exposed to slab users also try to utilize the
sheaves as long as the necessary (full or empty) sheaves are available
on the cpu or in the barn. Once depleted, they will fallback to bulk
alloc/free to slabs directly to avoid double copying.
The sheaf_capacity value is exported in sysfs for observability.
Sysfs CONFIG_SLUB_STATS counters alloc_cpu_sheaf and free_cpu_sheaf
count objects allocated or freed using the sheaves (and thus not
counting towards the other alloc/free path counters). Counters
sheaf_refill and sheaf_flush count objects filled or flushed from or to
slab pages, and can be used to assess how effective the caching is. The
refill and flush operations will also count towards the usual
alloc_fastpath/slowpath, free_fastpath/slowpath and other counters for
the backing slabs. For barn operations, barn_get and barn_put count how
many full sheaves were get from or put to the barn, the _fail variants
count how many such requests could not be satisfied mainly because the
barn was either empty or full. While the barn also holds empty sheaves
to make some operations easier, these are not as critical to mandate own
counters. Finally, there are sheaf_alloc/sheaf_free counters.
Access to the percpu sheaves is protected by local_trylock() when
potential callers include irq context, and local_lock() otherwise (such
as when we already know the gfp flags allow blocking). The trylock
failures should be rare and we can easily fallback. Each per-NUMA-node
barn has a spin_lock.
When slub_debug is enabled for a cache with sheaf_capacity also
specified, the latter is ignored so that allocations and frees reach the
slow path where debugging hooks are processed. Similarly, we ignore it
with CONFIG_SLUB_TINY which prefers low memory usage to performance.
[boot failure: https://lore.kernel.org/all/583eacf5-c971-451a-9f76-fed0e341b815@linux.ibm.com/ ]
Reported-and-tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
We don't need to call free_kmem_cache_nodes() immediately when failing
to allocate a kmem_cache_node, because when we return 0,
do_kmem_cache_create() calls __kmem_cache_release() which also performs
free_kmem_cache_nodes().
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
The callback return value is ignored in damon_sysfs_damon_call(), which
means that it is not possible to detect invalid user input when writing
commands such as 'commit' to
/sys/kernel/mm/damon/admin/kdamonds/<K>/state. Fix it.
Link: https://lkml.kernel.org/r/20250920132546.5822-1-akinobu.mita@gmail.com
Fixes: f64539dcdb ("mm/damon/sysfs: use damon_call() for update_schemes_stats")
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org> [6.14+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Running sha224_kunit on a KMSAN-enabled kernel results in a crash in
kmsan_internal_set_shadow_origin():
BUG: unable to handle page fault for address: ffffbc3840291000
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 1810067 P4D 1810067 PUD 192d067 PMD 3c17067 PTE 0
Oops: 0000 [#1] SMP NOPTI
CPU: 0 UID: 0 PID: 81 Comm: kunit_try_catch Tainted: G N 6.17.0-rc3 #10 PREEMPT(voluntary)
Tainted: [N]=TEST
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014
RIP: 0010:kmsan_internal_set_shadow_origin+0x91/0x100
[...]
Call Trace:
<TASK>
__msan_memset+0xee/0x1a0
sha224_final+0x9e/0x350
test_hash_buffer_overruns+0x46f/0x5f0
? kmsan_get_shadow_origin_ptr+0x46/0xa0
? __pfx_test_hash_buffer_overruns+0x10/0x10
kunit_try_run_case+0x198/0xa00
This occurs when memset() is called on a buffer that is not 4-byte aligned
and extends to the end of a guard page, i.e. the next page is unmapped.
The bug is that the loop at the end of kmsan_internal_set_shadow_origin()
accesses the wrong shadow memory bytes when the address is not 4-byte
aligned. Since each 4 bytes are associated with an origin, it rounds the
address and size so that it can access all the origins that contain the
buffer. However, when it checks the corresponding shadow bytes for a
particular origin, it incorrectly uses the original unrounded shadow
address. This results in reads from shadow memory beyond the end of the
buffer's shadow memory, which crashes when that memory is not mapped.
To fix this, correctly align the shadow address before accessing the 4
shadow bytes corresponding to each origin.
Link: https://lkml.kernel.org/r/20250911195858.394235-1-ebiggers@kernel.org
Fixes: 2ef3cec44c ("kmsan: do not wipe out origin when doing partial unpoisoning")
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Tested-by: Alexander Potapenko <glider@google.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Cc: Marco Elver <elver@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
damon_ctx->addr_unit is respected only for physical address space
monitoring use case. Meanwhile, damon_ctx->min_sz_region is used by the
core layer for aligning regions, regardless of whether it is set for
physical address space monitoring or virtual address spaces monitoring.
And it is set as 'DAMON_MIN_REGION / damon_ctx->addr_unit'. Hence, if
user sets ->addr_unit on virtual address spaces monitoring mode, regions
can be unexpectedly aligned in <PAGE_SIZE granularity. It shouldn't cause
crash-like issues but make monitoring and DAMOS behavior difficult to
understand.
Fix the unexpected behavior by setting ->min_sz_region only when it is
configured for physical address space monitoring.
The issue was found from a result of Chris' experiments that thankfully
shared with me off-list.
Link: https://lkml.kernel.org/r/20250917160041.53187-1-sj@kernel.org
Fixes: d8f867fa08 ("mm/damon: add damon_ctx->min_sz_region")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Chris Mason <clm@fb.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: ze zuo <zuoze1@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Currently vm_area_alloc_pages() contains two cond_resched() points.
However, the page allocator already has its own in slow path so an extra
resched is not optimal because it delays the loops.
The place where CPU time can be consumed is in the VA-space search in
alloc_vmap_area(), especially if the space is really fragmented using
synthetic stress tests, after a fast path falls back to a slow one.
Move a single cond_resched() there, after dropping free_vmap_area_lock in
a slow path. This keeps fairness where it matters while removing
redundant yields from the page-allocation path.
[akpm@linux-foundation.org: tweak comment grammar]
Link: https://lkml.kernel.org/r/20250917185906.1595454-1-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This removes the last call to page_stable_node(), so delete the wrapper.
It also removes a call to trylock_page() and saves a call to
compound_head(), as well as removing a reference to folio->page.
Link: https://lkml.kernel.org/r/20250916181219.2400258-1-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Longlong Xia <xialonglong@kylinos.cn>
Cc: xu xin <xu.xin16@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
On NUMA systems without bindings, allocations check all nodes for free
space, then wake up the kswapds on all nodes and retry. This ensures
all available space is evenly used before reclaim begins. However,
when one process or certain allocations have node restrictions, they
can cause kswapds on only a subset of nodes to be woken up.
Since kswapd hysteresis targets watermarks that are *higher* than
needed for allocation, even *unrestricted* allocations can now get
suckered onto such nodes that are already pressured. This ends up
concentrating all allocations on them, even when there are idle nodes
available for the unrestricted requests.
This was observed with two numa nodes, where node0 is normal and node1
is ZONE_MOVABLE to facilitate hotplugging: a kernel allocation wakes
kswapd on node0 only (since node1 is not eligible); once kswapd0 is
active, the watermarks hover between low and high, and then even the
movable allocations end up on node0, only to be kicked out again;
meanwhile node1 is empty and idle.
Similar behavior is possible when a process with NUMA bindings is
causing selective kswapd wakeups.
To fix this, on NUMA systems augment the (misleading) watermark test
with a check for whether kswapd is already active during the first
iteration through the zonelist. If this fails to place the request,
kswapd must be running everywhere already, and the watermark test is
good enough to decide placement.
With this patch, unrestricted requests successfully make use of node1,
even while kswapd is reclaiming node0 for restricted allocations.
[gourry@gourry.net: don't retry if no kswapds were active]
Link: https://lkml.kernel.org/r/20250919162134.1098208-1-hannes@cmpxchg.org
Signed-off-by: Gregory Price <gourry@gourry.net>
Tested-by: Joshua Hahn <joshua.hahnjy@gmail.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Zi Yan <ziy@nvidia.com>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
MADV_COLLAPSE on a file mapping behaves inconsistently depending on if PMD
page table is installed or not.
Consider following example:
p = mmap(NULL, 2UL << 20, PROT_READ | PROT_WRITE,
MAP_SHARED, fd, 0);
err = madvise(p, 2UL << 20, MADV_COLLAPSE);
fd is a populated tmpfs file.
The result depends on the address that the kernel returns on mmap(). If
it is located in an existing PMD table, the madvise() will succeed.
However, if the table does not exist, it will fail with -EINVAL.
This occurs because find_pmd_or_thp_or_none() returns SCAN_PMD_NULL when a
page table is missing, which causes collapse_pte_mapped_thp() to fail.
SCAN_PMD_NULL and SCAN_PMD_NONE should be treated the same in
collapse_pte_mapped_thp(): install the PMD leaf entry and allocate page
tables as needed.
Link: https://lkml.kernel.org/r/v5ivpub6z2n2uyemlnxgbilzs52ep4lrary7lm7o6axxoneb75@yfacfl5rkzeh
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Zach O'Keefe <zokeefe@google.com>
Cc: Barry Song <baohua@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In commit bb666b7c27 ("mm: add mmap_prepare() compatibility layer for
nested file systems") we introduced the ability for stacked drivers and
file systems to correctly invoke the f_op->mmap_prepare() handler from an
f_op->mmap() handler via a compatibility layer implemented in
compat_vma_mmap_prepare().
This populates vm_area_desc fields according to those found in the (not
yet fully initialised) VMA passed to f_op->mmap().
However this function implicitly assumes that the struct file which we are
operating upon is equal to vma->vm_file. This is not a safe assumption in
all cases.
The only really sane situation in which this matters would be something
like e.g. i915_gem_dmabuf_mmap() which invokes vfs_mmap() against
obj->base.filp:
ret = vfs_mmap(obj->base.filp, vma);
if (ret)
return ret;
And then sets the VMA's file to this, should the mmap operation succeed:
vma_set_file(vma, obj->base.filp);
That is - it is the file that is intended to back the VMA mapping.
This is not an issue currently, as so far we have only implemented
f_op->mmap_prepare() handlers for some file systems and internal mm uses,
and the only stacked f_op->mmap() operations that can be performed upon
these are those in backing_file_mmap() and coda_file_mmap(), both of which
use vma->vm_file.
However, moving forward, as we convert drivers to using
f_op->mmap_prepare(), this will become a problem.
Resolve this issue by explicitly setting desc->file to the provided file
parameter and update callers accordingly.
Callers are expected to read desc->file and update desc->vm_file - the
former will be the file provided by the caller (if stacked, this may
differ from vma->vm_file).
If the caller needs to differentiate between the two they therefore now
can.
While we are here, also provide a variant of compat_vma_mmap_prepare()
that operates against a pointer to any file_operations struct and does not
assume that the file_operations struct we are interested in is file->f_op.
This function is __compat_vma_mmap_prepare() and we invoke it from
compat_vma_mmap_prepare() so that we share code between the two functions.
This is important, because some drivers provide hooks in a separate
struct, for instance struct drm_device provides an fops field for this
purpose.
Also update the VMA selftests accordingly.
Link: https://lkml.kernel.org/r/dd0c72df8a33e8ffaa243eeb9b01010b670610e9.1756920635.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm: do not assume file == vma->vm_file in
compat_vma_mmap_prepare()", v2.
As part of the efforts to eliminate the problematic f_op->mmap callback, a
new callback - f_op->mmap_prepare was provided.
While we are converting these callbacks, we must deal with 'stacked'
filesystems and drivers - those which in their own f_op->mmap callback
invoke an inner f_op->mmap callback.
To accomodate for this, a compatibility layer is provided that, via
vfs_mmap(), detects if f_op->mmap_prepare is provided and if so, generates
a vm_area_desc containing the VMA's metadata and invokes the call.
So far, we have provided desc->file equal to vma->vm_file. However this
is not necessarily valid, especially in the case of stacked drivers which
wish to assign a new file after the inner hook is invoked.
To account for this, we adjust vm_area_desc to have both file and vm_file
fields. The .vm_file field is strictly set to vma->vm_file (or in the
case of a new mapping, what will become vma->vm_file).
However, .file is set to whichever file vfs_mmap() is invoked with when
using the compatibilty layer.
Therefore, if the VMA's file needs to be updated in .mmap_prepare,
desc->vm_file should be assigned, whilst desc->file should be read.
No current f_op->mmap_prepare users assign desc->file so this is safe to
do.
This makes the .mmap_prepare callback in the context of a stacked
filesystem or driver completely consistent with the existing .mmap
implementations.
While we're here, we do a few small cleanups, and ensure that we const-ify
things correctly in the vm_area_desc struct to avoid hooks accidentally
trying to assign fields they should not.
This patch (of 2):
Stacked filesystems and drivers may invoke mmap hooks with a struct file
pointer that differs from the overlying file. We will make this
functionality possible in a subsequent patch.
In order to prepare for this, let's update vm_area_struct to separately
provide desc->file and desc->vm_file parameters.
The desc->file parameter is the file that the hook is expected to operate
upon, and is not assignable (though the hok may wish to e.g. update the
file's accessed time for instance).
The desc->vm_file defaults to what will become vma->vm_file and is what
the hook must reassign should it wish to change the VMA"s vma->vm_file.
For now we keep desc->file, vm_file the same to remain consistent.
No f_op->mmap_prepare() callback sets a new vma->vm_file currently, so
this is safe to change.
While we're here, make the mm_struct desc->mm pointers at immutable as
well as the desc->mm field itself.
As part of this change, also update the single hook which this would
otherwise break - mlock_future_ok(), invoked by secretmem_mmap_prepare()).
We additionally update set_vma_from_desc() to compare fields in a more
logical fashion, checking the (possibly) user-modified fields as the first
operand against the existing value as the second one.
Additionally, update VMA tests to accommodate changes.
Link: https://lkml.kernel.org/r/cover.1756920635.git.lorenzo.stoakes@oracle.com
Link: https://lkml.kernel.org/r/3fa15a861bb7419f033d22970598aa61850ea267.1756920635.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Pedro Falcato <pfalcato@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Now that all actionable outcomes from checking pte_write() are gone, drop
the related references.
Link: https://lkml.kernel.org/r/20250908075028.38431-3-dev.jain@arm.com
Signed-off-by: Dev Jain <dev.jain@arm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Kiryl Shutsemau <kas@kernel.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Zach O'Keefe <zokeefe@google.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>