Commit Graph

25545 Commits (master)

Author SHA1 Message Date
Linus Torvalds 40fbbd64bb a couple of shmem rename fixes - recent regression from tree-in-dcache
series and older breakage from stable directory offsets stuff.
 
 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCaUD1egAKCRBZ7Krx/gZQ
 693YAQDWMzqUs8bJx95frxidF4fJ658K/bZWHuG9eLDvhF2CxAEAoIWt5EbJ1dE0
 NEIg/+kdVpKpk1DH7SZTzc+Mgbe4aAE=
 =/LtU
 -----END PGP SIGNATURE-----

Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull shmem rename fixes from Al Viro:
 "A couple of shmem rename fixes - recent regression from tree-in-dcache
  series and older breakage from stable directory offsets stuff"

* tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  shmem: fix recovery on rename failures
  shmem_whiteout(): fix regression from tree-in-dcache series
2025-12-16 19:44:36 +12:00
Al Viro e1b4c6a583 shmem: fix recovery on rename failures
maple_tree insertions can fail if we are seriously short on memory;
simple_offset_rename() does not recover well if it runs into that.
The same goes for simple_offset_rename_exchange().

Moreover, shmem_whiteout() expects that if it succeeds, the caller will
progress to d_move(), i.e. that shmem_rename2() won't fail past the
successful call of shmem_whiteout().

Not hard to fix, fortunately - mtree_store() can't fail if the index we
are trying to store into is already present in the tree as a singleton.

For simple_offset_rename_exchange() that's enough - we just need to be
careful about the order of operations.

For simple_offset_rename() solution is to preinsert the target into the
tree for new_dir; the rest can be done without any potentially failing
operations.

That preinsertion has to be done in shmem_rename2() rather than in
simple_offset_rename() itself - otherwise we'd need to deal with the
possibility of failure after successful shmem_whiteout().

Fixes: a2e459555c ("shmem: stable directory offsets")
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-12-16 00:57:29 -05:00
Al Viro 3010f06c52 shmem_whiteout(): fix regression from tree-in-dcache series
Now that shmem_mknod() hashes the new dentry, d_rehash() in
shmem_whiteout() should be removed.

X-paperbag: brown
Reported-by: Hugh Dickins <hughd@google.com>
Acked-by: Hugh Dickins <hughd@google.com>
Tested-by: Hugh Dickins <hughd@google.com>
Fixes: 2313598222 ("convert ramfs and tmpfs")
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-12-13 22:28:53 -05:00
Linus Torvalds 9d9c1cfec0 There are no significant series in this small merge. Please see the
individual changelogs for details.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaTsgBwAKCRDdBJ7gKXxA
 jkSIAP4jD66nEC2QyKTiv9XvXi8rpKz6RGAHNZSam0ucI5WKswEAoflmlqsoD/Kk
 sN3YLwLztzCJIYU7tT3IRaPK0irwDAo=
 =nkux
 -----END PGP SIGNATURE-----

Merge tag 'mm-nonmm-stable-2025-12-11-11-47' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc updates from Andrew Morton:
 "There are no significant series in this small merge. Please see the
  individual changelogs for details"

[ Editor's note: it's mainly ocfs2 and a couple of random fixes ]

* tag 'mm-nonmm-stable-2025-12-11-11-47' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  mm: memfd_luo: add CONFIG_SHMEM dependency
  mm: shmem: avoid build warning for CONFIG_SHMEM=n
  ocfs2: fix memory leak in ocfs2_merge_rec_left()
  ocfs2: invalidate inode if i_mode is zero after block read
  ocfs2: avoid -Wflex-array-member-not-at-end warning
  ocfs2: convert remaining read-only checks to ocfs2_emergency_state
  ocfs2: add ocfs2_emergency_state helper and apply to setattr
  checkpatch: add uninitialized pointer with __free attribute check
  args: fix documentation to reflect the correct numbers
  ocfs2: fix kernel BUG in ocfs2_find_victim_chain
  liveupdate: luo_core: fix redundant bound check in luo_ioctl()
  ocfs2: validate inline xattr size and entry count in ocfs2_xattr_ibody_list
  fs/fat: remove unnecessary wrapper fat_max_cache()
  ocfs2: replace deprecated strcpy with strscpy
  ocfs2: check tl_used after reading it from trancate log inode
  liveupdate: luo_file: don't use invalid list iterator
2025-12-13 20:55:12 +12:00
Linus Torvalds 2516a87153 Patch series in this pull request:
- The 2 patch series "powerpc/pseries/cmm: two smaller fixes" from David
   Hildenbrand fixes a couple of minor things in ppc land.
 
 - The 4 patch series "Improve folio split related functions" from Zi Yan
   provides some cleanups and minorish fixes in the folio splitting code.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaTseCwAKCRDdBJ7gKXxA
 jkCLAP9ttvtG7zhDf+tqvjCJFnybPEo6Z2B4Qx9g8i7s27gTZgD8DzYbIl+YdxrN
 /tS0tdgUfrJIXR2PzzipkydNrs+lxgg=
 =T65o
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2025-12-11-11-39' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull more MM updates from Andrew Morton:

 - "powerpc/pseries/cmm: two smaller fixes" (David Hildenbrand)
   fixes a couple of minor things in ppc land

 - "Improve folio split related functions" (Zi Yan)
   some cleanups and minorish fixes in the folio splitting code

* tag 'mm-stable-2025-12-11-11-39' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  mm/damon/tests/core-kunit: avoid damos_test_commit stack warning
  mm: vmscan: correct nr_requested tracing in scan_folios
  MAINTAINERS: add idr core-api doc file to XARRAY
  mm/hugetlb: fix incorrect error return from hugetlb_reserve_pages()
  mm: fix CONFIG_STACK_GROWSUP typo in mm.h
  mm/huge_memory: fix folio split stats counting
  mm/huge_memory: make min_order_for_split() always return an order
  mm/huge_memory: replace can_split_folio() with direct refcount calculation
  mm/huge_memory: change folio_split_supported() to folio_check_splittable()
  mm/sparse: fix sparse_vmemmap_init_nid_early definition without CONFIG_SPARSEMEM
  powerpc/pseries/cmm: adjust BALLOON_MIGRATE when migrating pages
  powerpc/pseries/cmm: call balloon_devinfo_init() also without CONFIG_BALLOON_COMPACTION
2025-12-13 20:35:41 +12:00
Arnd Bergmann 402736a591 mm: shmem: avoid build warning for CONFIG_SHMEM=n
The newly added 'flags' variable is unused and causes a warning if
CONFIG_SHMEM is disabled, since the shmem_acct_size() macro it is passed
into does nothing:

mm/shmem.c: In function '__shmem_file_setup':
mm/shmem.c:5816:23: error: unused variable 'flags' [-Werror=unused-variable]
 5816 |         unsigned long flags = (vm_flags & VM_NORESERVE) ? SHMEM_F_NORESERVE : 0;
      |                       ^~~~~

Replace the two macros with equivalent inline functions to get the
argument checking.

Link: https://lkml.kernel.org/r/20251204102905.1048000-1-arnd@kernel.org
Fixes: 6ff1610ced ("mm: shmem: use SHMEM_F_* flags instead of VM_* flags")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: guoweikang <guoweikang.kernel@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kairui Song <kasong@tencent.com>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-12-10 16:07:43 -08:00
Linus Torvalds 1de741159b slab fix for 6.19-rc1
-----BEGIN PGP SIGNATURE-----
 
 iQFPBAABCAA5FiEEe7vIQRWZI0iWSE3xu+CwddJFiJoFAmk5RPgbFIAAAAAABAAO
 bWFudTIsMi41KzEuMTEsMiwyAAoJELvgsHXSRYiaqKoH/3VgscOpeq15HnX5yxt9
 Es2OGMmAQOxuYTKJPL7QQEt2Mtz64RkouwzFUmh3kdKtvMBbs6j/OxPs8nfk6yxf
 g0RwAk+55qsW07k4hGT4XLTSjuz5r3Wk8mkuXzzNZjbVIDN0ssrKdLx3d9OLhpmC
 iG8CGz3zuWbJ5cVQEmwE11rErwlMiXQjxE7KMZXGMCgTZw6vBZ40oVnGrCl/Ez8/
 aEW7jQpnaJr4y7kYrhNdvGTZtQrN+QR5ey/HTfCbpj76H8YHabjzCZarsznMgH/t
 rUNs+o7utJF5k2EnCyHpNWUk+ky3ixdgbusdh9i8hchuDR9dbrOz8yCnriUhy9mT
 5g4=
 =AJKw
 -----END PGP SIGNATURE-----

Merge tag 'slab-for-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab

Pull slab fix from Vlastimil Babka:

 - A stable fix for performance regression in tests that perform
   kmem_cache_destroy() a lot, due to unnecessarily wide scope of
   kvfree_rcu_barrier() (Harry Yoo)

* tag 'slab-for-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
  mm/slab: introduce kvfree_rcu_barrier_on_cache() for cache destruction
2025-12-11 08:54:08 +09:00
Arnd Bergmann dafdba0964 mm/damon/tests/core-kunit: avoid damos_test_commit stack warning
The newly added damos_test_commit() constructs multiple large structures
on the stack, which exceeds the warning limit in some cases:

In file included from mm/damon/core.c:2941:
mm/damon/tests/core-kunit.h: In function 'damos_test_commit':
mm/damon/tests/core-kunit.h:965:1: error: the frame size of 1520 bytes is larger than 1280 bytes [-Werror=frame-larger-than=]

Split this function up into two separate ones that are called
sequentially, so they can occupy the same stack slots.

Link: https://lkml.kernel.org/r/20251204100403.1034980-1-arnd@kernel.org
Fixes: 299a88f6ec ("mm/damon/tests/core-kunit: add damos_commit() test")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: Quanmin Yan <yanquanmin1@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-12-09 11:25:34 -08:00
Chen Ridong 49d921b471 mm: vmscan: correct nr_requested tracing in scan_folios
When enabling vmscan tracing, it is observed that nr_requested is always
4096, which is confusing.

        mm_vmscan_lru_isolate: classzone=3 order=0 nr_requested=4096 ...
        mm_vmscan_lru_isolate: classzone=3 order=0 nr_requested=4096 ...
        mm_vmscan_lru_isolate: classzone=3 order=0 nr_requested=4096 ...
        mm_vmscan_lru_isolate: classzone=3 order=0 nr_requested=4096 ...
        mm_vmscan_lru_isolate: classzone=3 order=0 nr_requested=4096 ...
        mm_vmscan_lru_isolate: classzone=3 order=0 nr_requested=4096 ...
        mm_vmscan_lru_isolate: classzone=3 order=0 nr_requested=4096 ...

This is because it prints MAX_LRU_BATCH, which is meaningless as it's a
constant.  To fix this, modify it to print capped valued.

Link: https://lkml.kernel.org/r/20251204122355.1822919-1-chenridong@huaweicloud.com
Fixes: 8c2214fc9a ("mm: multi-gen LRU: reuse some legacy trace events")
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Reviewed-by: Lance Yang <lance.yang@linux.dev>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Jaewon Kim <jaewon31.kim@samsung.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Lu Jialin <lujialin4@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Wei Xu <weixugc@google.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-12-09 11:25:34 -08:00
Shameer Kolothum 9ee5d1766c mm/hugetlb: fix incorrect error return from hugetlb_reserve_pages()
The function hugetlb_reserve_pages() returns the number of pages added
to the reservation map on success and a negative error code on failure
(e.g. -EINVAL, -ENOMEM). However, in some error paths, it may return -1
directly.

For example, a failure at:

    if (hugetlb_acct_memory(h, gbl_reserve) < 0)
        goto out_put_pages;

results in returning -1 (since add = -1), which may be misinterpreted
in userspace as -EPERM.

Fix this by explicitly capturing and propagating the return values from
helper functions, and using -EINVAL for all other failure cases.

Link: https://lkml.kernel.org/r/20251125171350.86441-1-skolothumtho@nvidia.com
Fixes: 986f5f2b4b ("mm/hugetlb: make hugetlb_reserve_pages() return nr of entries updated")
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
Reviewed-by: Joshua Hahn <joshua.hahnjy@gmail.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: Oscar Salvador <osalvador@suse.de>
Cc: Matthew R. Ochs <mochs@nvidia.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nicolin Chen <nicolinc@nvidia.com>
Cc: Vivek Kasireddy <vivek.kasireddy@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-12-09 11:25:33 -08:00
Zi Yan 9dcdc0c207 mm/huge_memory: fix folio split stats counting
The "return <error code>" statements for error checks at the beginning of
__folio_split() skip necessary count_vm_event() and count_mthp_stat() at
the end of the function.  Fix these by replacing them with "ret = <error
code>; goto out;".

Link: https://lkml.kernel.org/r/20251126210618.1971206-5-ziy@nvidia.com
Signed-off-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-12-09 11:25:33 -08:00
Zi Yan 2f78910659 mm/huge_memory: make min_order_for_split() always return an order
min_order_for_split() returns -EBUSY when the folio is truncated and
cannot be split.  In commit 77008e1b2e ("mm/huge_memory: do not change
split_huge_page*() target order silently"), memory_failure() does not
handle it and pass -EBUSY to try_to_split_thp_page() directly. 
try_to_split_thp_page() returns -EINVAL since -EBUSY becomes 0xfffffff0 as
new_order is unsigned int in __folio_split() and this large new_order is
rejected as an invalid input.  The code does not cause a bug. 
soft_offline_in_use_page() also uses min_order_for_split() but it always
passes 0 as new_order for split.

Fix it by making min_order_for_split() always return an order.  When the
given folio is truncated, namely folio->mapping == NULL, return 0 and let
a subsequent split function handle the situation and return -EBUSY.

Add kernel-doc to min_order_for_split() to clarify its use.

Link: https://lkml.kernel.org/r/20251126210618.1971206-4-ziy@nvidia.com
Signed-off-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-12-09 11:25:33 -08:00
Zi Yan 5842bcbfc3 mm/huge_memory: replace can_split_folio() with direct refcount calculation
can_split_folio() is just a refcount comparison, making sure only the
split caller holds an extra pin.  Open code it with
folio_expected_ref_count() != folio_ref_count() - 1.  For the extra_pins
used by folio_ref_freeze(), add folio_cache_ref_count() to calculate it. 
Also replace folio_expected_ref_count() with folio_cache_ref_count() used
by folio_ref_unfreeze(), since they are returning the same values when a
folio is frozen and folio_cache_ref_count() does not have unnecessary
folio_mapcount() in its implementation.

Link: https://lkml.kernel.org/r/20251126210618.1971206-3-ziy@nvidia.com
Signed-off-by: Zi Yan <ziy@nvidia.com>
Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-12-09 11:25:32 -08:00
Zi Yan bdd0d69a32 mm/huge_memory: change folio_split_supported() to folio_check_splittable()
Patch series "Improve folio split related functions", v4.

This patchset improves several folio split related functions to avoid
future misuse.  The changes are:

1. Consolidated folio splittable checks by moving truncated folio check,
   huge zero folio check, and writeback folio check into
   folio_split_supported(). Changed the function return type. Renamed it
   to folio_check_splittable() for clarification.

2. Replaced can_split_folio() with open coded folio_expected_ref_count()
   and folio_ref_count() and introduced folio_cache_ref_count().

3. Changed min_order_for_split() to always return an order.

4. Fixed folio split stats counting.

Motivation
==========
This is based on Wei's observation[1] and solves several potential
issues:
1. Dereferencing NULL folio->mapping in try_folio_split_to_order() if it
   is called on truncated folios.
2. Not handling of negative return value of min_order_for_split() in
   mm/memory-failure.c

There is no bug in the current code.


This patch (of 4):

folio_split_supported() used in try_folio_split_to_order() requires
folio->mapping to be non NULL, but current try_folio_split_to_order() does
not check it.  There is no issue in the current code, since
try_folio_split_to_order() is only used in truncate_inode_partial_folio(),
where folio->mapping is not NULL.

To prevent future misuse, move folio->mapping NULL check (i.e., folio is
truncated) into folio_split_supported().  Since folio->mapping NULL check
returns -EBUSY and folio_split_supported() == false means -EINVAL, change
folio_split_supported() return type from bool to int and return error
numbers accordingly.  Rename folio_split_supported() to
folio_check_splittable() to match the return type change.

While at it, move is_huge_zero_folio() check and folio_test_writeback()
check into folio_check_splittable() and add kernel-doc.

Remove all warnings inside folio_check_splittable() and give warnings
in __folio_split() instead, so that bool warns parameter can be removed.

Link: https://lkml.kernel.org/r/20251126210618.1971206-1-ziy@nvidia.com
Link: https://lkml.kernel.org/r/20251126210618.1971206-2-ziy@nvidia.com
Signed-off-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Acked-by: Balbir Singh <balbirs@nvidia.com>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-12-09 11:25:32 -08:00
Harry Yoo 0f35040de5 mm/slab: introduce kvfree_rcu_barrier_on_cache() for cache destruction
Currently, kvfree_rcu_barrier() flushes RCU sheaves across all slab
caches when a cache is destroyed. This is unnecessary; only the RCU
sheaves belonging to the cache being destroyed need to be flushed.

As suggested by Vlastimil Babka, introduce a weaker form of
kvfree_rcu_barrier() that operates on a specific slab cache.

Factor out flush_rcu_sheaves_on_cache() from flush_all_rcu_sheaves() and
call it from flush_all_rcu_sheaves() and kvfree_rcu_barrier_on_cache().

Call kvfree_rcu_barrier_on_cache() instead of kvfree_rcu_barrier() on
cache destruction.

The performance benefit is evaluated on a 12 core 24 threads AMD Ryzen
5900X machine (1 socket), by loading slub_kunit module.

Before:
  Total calls: 19
  Average latency (us): 18127
  Total time (us): 344414

After:
  Total calls: 19
  Average latency (us): 10066
  Total time (us): 191264

Two performance regression have been reported:
  - stress module loader test's runtime increases by 50-60% (Daniel)
  - internal graphics test's runtime on Tegra234 increases by 35% (Jon)

They are fixed by this change.

Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Fixes: ec66e0d599 ("slab: add sheaf support for batching kfree_rcu() operations")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/linux-mm/1bda09da-93be-4737-aef0-d47f8c5c9301@suse.cz
Reported-and-tested-by: Daniel Gomez <da.gomez@samsung.com>
Closes: https://lore.kernel.org/linux-mm/0406562e-2066-4cf8-9902-b2b0616dd742@kernel.org
Reported-and-tested-by: Jon Hunter <jonathanh@nvidia.com>
Closes: https://lore.kernel.org/linux-mm/e988eff6-1287-425e-a06c-805af5bbf262@nvidia.com
Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
Link: https://patch.msgid.link/20251207154148.117723-1-harry.yoo@oracle.com
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-12-07 18:09:54 +01:00
Linus Torvalds 67a454e6b1 memblock: introduce check_pages boot parameter
... to decouple simple checks for page state on allocation and free from
 CONFIG_DEBUG_VM.
 
 check_pages parameter allows enabling page checking without building kernel
 with CONFIG_DEBUG_VM or forcing init_on_{alloc, free} or other heavier
 mechanisms.
 -----BEGIN PGP SIGNATURE-----
 
 iQFEBAABCgAuFiEEeOVYVaWZL5900a/pOQOGJssO/ZEFAmk1p20QHHJwcHRAa2Vy
 bmVsLm9yZwAKCRA5A4Ymyw79kZT4B/4wGhbJjAe6RbBsco9jR6p5veLPOj1DFGgf
 nk8MWqgij1clDvOKzXfJPDiXWzsfbt1/z1nak0DMOx5jjqpsZe9RmIfVZA48nMQS
 WJdPC44NQU/xCwnxZ2xRvf6tzc9ehuvELDXAfT+Qg+4YqGtre2HxLy7hSDduMuJG
 p5/9Crt5sf4d2S9J2ctlZYIBf6GB8I6OGQ419eGN7tn5bvQfUNxgpcIYO6CDXw2A
 mNVqYrlHvzeOkQ78cmnj/gBDLyo8KUvYd2ZavSK9SiIJK3XIV2+Goq5KfTpM0yvS
 XC0x1IP/upOO3fYKfpc1tA3vGDc9PPtCK9RqApj1U2oG7X4H/VJu
 =7raZ
 -----END PGP SIGNATURE-----

Merge tag 'memblock-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock

Pull memblock update from Mike Rapoport:
 "Introduce a 'check_pages' boot parameter to decouple simple checks for
  page state on allocation and free from CONFIG_DEBUG_VM.

  This allows enabling page checking without building kernel with
  CONFIG_DEBUG_VM or forcing init_on_{alloc, free} or other heavier
  mechanisms"

* tag 'memblock-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock:
  mm/mm_init: Introduce a boot parameter for check_pages
2025-12-07 08:56:10 -08:00
Linus Torvalds 509d3f4584 Significant patch series in this pull request:
- The 6 patch series "panic: sys_info: Refactor and fix a potential
   issue" from Andy Shevchenko fixes a build issue and does some cleanup in
   ib/sys_info.c.
 
 - The 9 patch series "Implement mul_u64_u64_div_u64_roundup()" from
   David Laight enhances the 64-bit math code on behalf of a PWM driver and
   beefs up the test module for these library functions.
 
 - The 2 patch series "scripts/gdb/symbols: make BPF debug info available
   to GDB" from Ilya Leoshkevich makes BPF symbol names, sizes, and line
   numbers available to the GDB debugger.
 
 - The 4 patch series "Enable hung_task and lockup cases to dump system
   info on demand" from Feng Tang adds a sysctl which can be used to cause
   additional info dumping when the hung-task and lockup detectors fire.
 
 - The 6 patch series "lib/base64: add generic encoder/decoder, migrate
   users" from Kuan-Wei Chiu adds a general base64 encoder/decoder to lib/
   and migrates several users away from their private implementations.
 
 - The 2 patch series "rbree: inline rb_first() and rb_last()" from Eric
   Dumazet makes TCP a little faster.
 
 - The 9 patch series "liveupdate: Rework KHO for in-kernel users" from
   Pasha Tatashin reworks the KEXEC Handover interfaces in preparation for
   Live Update Orchestrator (LUO), and possibly for other future clients.
 
 - The 13 patch series "kho: simplify state machine and enable dynamic
   updates" from Pasha Tatashin increases the flexibility of KEXEC
   Handover.  Also preparation for LUO.
 
 - The 18 patch series "Live Update Orchestrator" from Pasha Tatashin is
   a major new feature targeted at cloud environments.  Quoting the [0/N]:
 
     This series introduces the Live Update Orchestrator, a kernel subsystem
     designed to facilitate live kernel updates using a kexec-based reboot.
     This capability is critical for cloud environments, allowing hypervisors
     to be updated with minimal downtime for running virtual machines.  LUO
     achieves this by preserving the state of selected resources, such as
     memory, devices and their dependencies, across the kernel transition.
 
     As a key feature, this series includes support for preserving memfd file
     descriptors, which allows critical in-memory data, such as guest RAM or
     any other large memory region, to be maintained in RAM across the kexec
     reboot.
 
   Mike Rappaport merits a mention here, for his extensive review and
   testing work.
 
 - The 3 patch series "kexec: reorganize kexec and kdump sysfs" from
   Sourabh Jain moves the kexec and kdump sysfs entries from /sys/kernel/
   to /sys/kernel/kexec/ and adds back-compatibility symlinks which can
   hopefully be removed one day.
 
 - The 2 patch series "kho: fixes for vmalloc restoration" from Mike
   Rapoport fixes a BUG which was being hit during KHO restoration of
   vmalloc() regions.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaTSAkQAKCRDdBJ7gKXxA
 jrkiAP9QKfsRv46XZaM5raScjY1ayjP+gqb2rgt6BQ/gZvb2+wD/cPAYOR6BiX52
 n0pVpQmG5P/KyOmpLztn96ejL4heKwQ=
 =JY96
 -----END PGP SIGNATURE-----

Merge tag 'mm-nonmm-stable-2025-12-06-11-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull non-MM updates from Andrew Morton:

 - "panic: sys_info: Refactor and fix a potential issue" (Andy Shevchenko)
   fixes a build issue and does some cleanup in ib/sys_info.c

 - "Implement mul_u64_u64_div_u64_roundup()" (David Laight)
   enhances the 64-bit math code on behalf of a PWM driver and beefs up
   the test module for these library functions

 - "scripts/gdb/symbols: make BPF debug info available to GDB" (Ilya Leoshkevich)
   makes BPF symbol names, sizes, and line numbers available to the GDB
   debugger

 - "Enable hung_task and lockup cases to dump system info on demand" (Feng Tang)
   adds a sysctl which can be used to cause additional info dumping when
   the hung-task and lockup detectors fire

 - "lib/base64: add generic encoder/decoder, migrate users" (Kuan-Wei Chiu)
   adds a general base64 encoder/decoder to lib/ and migrates several
   users away from their private implementations

 - "rbree: inline rb_first() and rb_last()" (Eric Dumazet)
   makes TCP a little faster

 - "liveupdate: Rework KHO for in-kernel users" (Pasha Tatashin)
   reworks the KEXEC Handover interfaces in preparation for Live Update
   Orchestrator (LUO), and possibly for other future clients

 - "kho: simplify state machine and enable dynamic updates" (Pasha Tatashin)
   increases the flexibility of KEXEC Handover. Also preparation for LUO

 - "Live Update Orchestrator" (Pasha Tatashin)
   is a major new feature targeted at cloud environments. Quoting the
   cover letter:

      This series introduces the Live Update Orchestrator, a kernel
      subsystem designed to facilitate live kernel updates using a
      kexec-based reboot. This capability is critical for cloud
      environments, allowing hypervisors to be updated with minimal
      downtime for running virtual machines. LUO achieves this by
      preserving the state of selected resources, such as memory,
      devices and their dependencies, across the kernel transition.

      As a key feature, this series includes support for preserving
      memfd file descriptors, which allows critical in-memory data, such
      as guest RAM or any other large memory region, to be maintained in
      RAM across the kexec reboot.

   Mike Rappaport merits a mention here, for his extensive review and
   testing work.

 - "kexec: reorganize kexec and kdump sysfs" (Sourabh Jain)
   moves the kexec and kdump sysfs entries from /sys/kernel/ to
   /sys/kernel/kexec/ and adds back-compatibility symlinks which can
   hopefully be removed one day

 - "kho: fixes for vmalloc restoration" (Mike Rapoport)
   fixes a BUG which was being hit during KHO restoration of vmalloc()
   regions

* tag 'mm-nonmm-stable-2025-12-06-11-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (139 commits)
  calibrate: update header inclusion
  Reinstate "resource: avoid unnecessary lookups in find_next_iomem_res()"
  vmcoreinfo: track and log recoverable hardware errors
  kho: fix restoring of contiguous ranges of order-0 pages
  kho: kho_restore_vmalloc: fix initialization of pages array
  MAINTAINERS: TPM DEVICE DRIVER: update the W-tag
  init: replace simple_strtoul with kstrtoul to improve lpj_setup
  KHO: fix boot failure due to kmemleak access to non-PRESENT pages
  Documentation/ABI: new kexec and kdump sysfs interface
  Documentation/ABI: mark old kexec sysfs deprecated
  kexec: move sysfs entries to /sys/kernel/kexec
  test_kho: always print restore status
  kho: free chunks using free_page() instead of kfree()
  selftests/liveupdate: add kexec test for multiple and empty sessions
  selftests/liveupdate: add simple kexec-based selftest for LUO
  selftests/liveupdate: add userspace API selftests
  docs: add documentation for memfd preservation via LUO
  mm: memfd_luo: allow preserving memfd
  liveupdate: luo_file: add private argument to store runtime state
  mm: shmem: export some functions to internal.h
  ...
2025-12-06 14:01:20 -08:00
Linus Torvalds 51d90a15fe ARM:
- Support for userspace handling of synchronous external aborts (SEAs),
   allowing the VMM to potentially handle the abort in a non-fatal
   manner.
 
 - Large rework of the VGIC's list register handling with the goal of
   supporting more active/pending IRQs than available list registers in
   hardware. In addition, the VGIC now supports EOImode==1 style
   deactivations for IRQs which may occur on a separate vCPU than the
   one that acked the IRQ.
 
 - Support for FEAT_XNX (user / privileged execute permissions) and
   FEAT_HAF (hardware update to the Access Flag) in the software page
   table walkers and shadow MMU.
 
 - Allow page table destruction to reschedule, fixing long need_resched
   latencies observed when destroying a large VM.
 
 - Minor fixes to KVM and selftests
 
 Loongarch:
 
 - Get VM PMU capability from HW GCFG register.
 
 - Add AVEC basic support.
 
 - Use 64-bit register definition for EIOINTC.
 
 - Add KVM timer test cases for tools/selftests.
 
 RISC/V:
 
 - SBI message passing (MPXY) support for KVM guest
 
 - Give a new, more specific error subcode for the case when in-kernel
   AIA virtualization fails to allocate IMSIC VS-file
 
 - Support KVM_DIRTY_LOG_INITIALLY_SET, enabling dirty log gradually
   in small chunks
 
 - Fix guest page fault within HLV* instructions
 
 - Flush VS-stage TLB after VCPU migration for Andes cores
 
 s390:
 
 - Always allocate ESCA (Extended System Control Area), instead of
   starting with the basic SCA and converting to ESCA with the
   addition of the 65th vCPU.  The price is increased number of
   exits (and worse performance) on z10 and earlier processor;
   ESCA was introduced by z114/z196 in 2010.
 
 - VIRT_XFER_TO_GUEST_WORK support
 
 - Operation exception forwarding support
 
 - Cleanups
 
 x86:
 
 - Skip the costly "zap all SPTEs" on an MMIO generation wrap if MMIO SPTE
   caching is disabled, as there can't be any relevant SPTEs to zap.
 
 - Relocate a misplaced export.
 
 - Fix an async #PF bug where KVM would clear the completion queue when the
   guest transitioned in and out of paging mode, e.g. when handling an SMI and
   then returning to paged mode via RSM.
 
 - Leave KVM's user-return notifier registered even when disabling
   virtualization, as long as kvm.ko is loaded.  On reboot/shutdown, keeping
   the notifier registered is ok; the kernel does not use the MSRs and the
   callback will run cleanly and restore host MSRs if the CPU manages to
   return to userspace before the system goes down.
 
 - Use the checked version of {get,put}_user().
 
 - Fix a long-lurking bug where KVM's lack of catch-up logic for periodic APIC
   timers can result in a hard lockup in the host.
 
 - Revert the periodic kvmclock sync logic now that KVM doesn't use a
   clocksource that's subject to NTP corrections.
 
 - Clean up KVM's handling of MMIO Stale Data and L1TF, and bury the latter
   behind CONFIG_CPU_MITIGATIONS.
 
 - Context switch XCR0, XSS, and PKRU outside of the entry/exit fast path;
   the only reason they were handled in the fast path was to paper of a bug
   in the core #MC code, and that has long since been fixed.
 
 - Add emulator support for AVX MOV instructions, to play nice with emulated
   devices whose guest drivers like to access PCI BARs with large multi-byte
   instructions.
 
 x86 (AMD):
 
 - Fix a few missing "VMCB dirty" bugs.
 
 - Fix the worst of KVM's lack of EFER.LMSLE emulation.
 
 - Add AVIC support for addressing 4k vCPUs in x2AVIC mode.
 
 - Fix incorrect handling of selective CR0 writes when checking intercepts
   during emulation of L2 instructions.
 
 - Fix a currently-benign bug where KVM would clobber SPEC_CTRL[63:32] on
   VMRUN and #VMEXIT.
 
 - Fix a bug where KVM corrupt the guest code stream when re-injecting a soft
   interrupt if the guest patched the underlying code after the VM-Exit, e.g.
   when Linux patches code with a temporary INT3.
 
 - Add KVM_X86_SNP_POLICY_BITS to advertise supported SNP policy bits to
   userspace, and extend KVM "support" to all policy bits that don't require
   any actual support from KVM.
 
 x86 (Intel):
 
 - Use the root role from kvm_mmu_page to construct EPTPs instead of the
   current vCPU state, partly as worthwhile cleanup, but mostly to pave the
   way for tracking per-root TLB flushes, and elide EPT flushes on pCPU
   migration if the root is clean from a previous flush.
 
 - Add a few missing nested consistency checks.
 
 - Rip out support for doing "early" consistency checks via hardware as the
   functionality hasn't been used in years and is no longer useful in general;
   replace it with an off-by-default module param to WARN if hardware fails
   a check that KVM does not perform.
 
 - Fix a currently-benign bug where KVM would drop the guest's SPEC_CTRL[63:32]
   on VM-Enter.
 
 - Misc cleanups.
 
 - Overhaul the TDX code to address systemic races where KVM (acting on behalf
   of userspace) could inadvertantly trigger lock contention in the TDX-Module;
   KVM was either working around these in weird, ugly ways, or was simply
   oblivious to them (though even Yan's devilish selftests could only break
   individual VMs, not the host kernel)
 
 - Fix a bug where KVM could corrupt a vCPU's cpu_list when freeing a TDX vCPU,
   if creating said vCPU failed partway through.
 
 - Fix a few sparse warnings (bad annotation, 0 != NULL).
 
 - Use struct_size() to simplify copying TDX capabilities to userspace.
 
 - Fix a bug where TDX would effectively corrupt user-return MSR values if the
   TDX Module rejects VP.ENTER and thus doesn't clobber host MSRs as expected.
 
 Selftests:
 
 - Fix a math goof in mmu_stress_test when running on a single-CPU system/VM.
 
 - Forcefully override ARCH from x86_64 to x86 to play nice with specifying
   ARCH=x86_64 on the command line.
 
 - Extend a bunch of nested VMX to validate nested SVM as well.
 
 - Add support for LA57 in the core VM_MODE_xxx macro, and add a test to
   verify KVM can save/restore nested VMX state when L1 is using 5-level
   paging, but L2 is not.
 
 - Clean up the guest paging code in anticipation of sharing the core logic for
   nested EPT and nested NPT.
 
 guest_memfd:
 
 - Add NUMA mempolicy support for guest_memfd, and clean up a variety of
   rough edges in guest_memfd along the way.
 
 - Define a CLASS to automatically handle get+put when grabbing a guest_memfd
   from a memslot to make it harder to leak references.
 
 - Enhance KVM selftests to make it easer to develop and debug selftests like
   those added for guest_memfd NUMA support, e.g. where test and/or KVM bugs
   often result in hard-to-debug SIGBUS errors.
 
 - Misc cleanups.
 
 Generic:
 
 - Use the recently-added WQ_PERCPU when creating the per-CPU workqueue for
   irqfd cleanup.
 
 - Fix a goof in the dirty ring documentation.
 
 - Fix choice of target for directed yield across different calls to
   kvm_vcpu_on_spin(); the function was always starting from the first
   vCPU instead of continuing the round-robin search.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCgAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmkvMa8UHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroMlFwf+Ow7zOYUuELSQ+Jn+hOYXiCNrdBDx
 ZamvMU8kLPr7XX0Zog6HgcMm//qyA6k5nSfqCjfsQZrIhRA/gWJ61jz1OX/Jxq18
 pJ9Vz6epnEPYiOtBwz+v8OS8MqDqVNzj2i6W1/cLPQE50c1Hhw64HWS5CSxDQiHW
 A7PVfl5YU12lW1vG3uE0sNESDt4Eh/spNM17iddXdF4ZUOGublserjDGjbc17E7H
 8BX3DkC2plqkJKwtjg0ae62hREkITZZc7RqsnftUkEhn0N0H9+rb6NKUyzIVh9NZ
 bCtCjtrKN9zfZ0Mujnms3ugBOVqNIputu/DtPnnFKXtXWSrHrgGSNv5ewA==
 =PEcw
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
 "ARM:

   - Support for userspace handling of synchronous external aborts
     (SEAs), allowing the VMM to potentially handle the abort in a
     non-fatal manner

   - Large rework of the VGIC's list register handling with the goal of
     supporting more active/pending IRQs than available list registers
     in hardware. In addition, the VGIC now supports EOImode==1 style
     deactivations for IRQs which may occur on a separate vCPU than the
     one that acked the IRQ

   - Support for FEAT_XNX (user / privileged execute permissions) and
     FEAT_HAF (hardware update to the Access Flag) in the software page
     table walkers and shadow MMU

   - Allow page table destruction to reschedule, fixing long
     need_resched latencies observed when destroying a large VM

   - Minor fixes to KVM and selftests

  Loongarch:

   - Get VM PMU capability from HW GCFG register

   - Add AVEC basic support

   - Use 64-bit register definition for EIOINTC

   - Add KVM timer test cases for tools/selftests

  RISC/V:

   - SBI message passing (MPXY) support for KVM guest

   - Give a new, more specific error subcode for the case when in-kernel
     AIA virtualization fails to allocate IMSIC VS-file

   - Support KVM_DIRTY_LOG_INITIALLY_SET, enabling dirty log gradually
     in small chunks

   - Fix guest page fault within HLV* instructions

   - Flush VS-stage TLB after VCPU migration for Andes cores

  s390:

   - Always allocate ESCA (Extended System Control Area), instead of
     starting with the basic SCA and converting to ESCA with the
     addition of the 65th vCPU. The price is increased number of exits
     (and worse performance) on z10 and earlier processor; ESCA was
     introduced by z114/z196 in 2010

   - VIRT_XFER_TO_GUEST_WORK support

   - Operation exception forwarding support

   - Cleanups

  x86:

   - Skip the costly "zap all SPTEs" on an MMIO generation wrap if MMIO
     SPTE caching is disabled, as there can't be any relevant SPTEs to
     zap

   - Relocate a misplaced export

   - Fix an async #PF bug where KVM would clear the completion queue
     when the guest transitioned in and out of paging mode, e.g. when
     handling an SMI and then returning to paged mode via RSM

   - Leave KVM's user-return notifier registered even when disabling
     virtualization, as long as kvm.ko is loaded. On reboot/shutdown,
     keeping the notifier registered is ok; the kernel does not use the
     MSRs and the callback will run cleanly and restore host MSRs if the
     CPU manages to return to userspace before the system goes down

   - Use the checked version of {get,put}_user()

   - Fix a long-lurking bug where KVM's lack of catch-up logic for
     periodic APIC timers can result in a hard lockup in the host

   - Revert the periodic kvmclock sync logic now that KVM doesn't use a
     clocksource that's subject to NTP corrections

   - Clean up KVM's handling of MMIO Stale Data and L1TF, and bury the
     latter behind CONFIG_CPU_MITIGATIONS

   - Context switch XCR0, XSS, and PKRU outside of the entry/exit fast
     path; the only reason they were handled in the fast path was to
     paper of a bug in the core #MC code, and that has long since been
     fixed

   - Add emulator support for AVX MOV instructions, to play nice with
     emulated devices whose guest drivers like to access PCI BARs with
     large multi-byte instructions

  x86 (AMD):

   - Fix a few missing "VMCB dirty" bugs

   - Fix the worst of KVM's lack of EFER.LMSLE emulation

   - Add AVIC support for addressing 4k vCPUs in x2AVIC mode

   - Fix incorrect handling of selective CR0 writes when checking
     intercepts during emulation of L2 instructions

   - Fix a currently-benign bug where KVM would clobber SPEC_CTRL[63:32]
     on VMRUN and #VMEXIT

   - Fix a bug where KVM corrupt the guest code stream when re-injecting
     a soft interrupt if the guest patched the underlying code after the
     VM-Exit, e.g. when Linux patches code with a temporary INT3

   - Add KVM_X86_SNP_POLICY_BITS to advertise supported SNP policy bits
     to userspace, and extend KVM "support" to all policy bits that
     don't require any actual support from KVM

  x86 (Intel):

   - Use the root role from kvm_mmu_page to construct EPTPs instead of
     the current vCPU state, partly as worthwhile cleanup, but mostly to
     pave the way for tracking per-root TLB flushes, and elide EPT
     flushes on pCPU migration if the root is clean from a previous
     flush

   - Add a few missing nested consistency checks

   - Rip out support for doing "early" consistency checks via hardware
     as the functionality hasn't been used in years and is no longer
     useful in general; replace it with an off-by-default module param
     to WARN if hardware fails a check that KVM does not perform

   - Fix a currently-benign bug where KVM would drop the guest's
     SPEC_CTRL[63:32] on VM-Enter

   - Misc cleanups

   - Overhaul the TDX code to address systemic races where KVM (acting
     on behalf of userspace) could inadvertantly trigger lock contention
     in the TDX-Module; KVM was either working around these in weird,
     ugly ways, or was simply oblivious to them (though even Yan's
     devilish selftests could only break individual VMs, not the host
     kernel)

   - Fix a bug where KVM could corrupt a vCPU's cpu_list when freeing a
     TDX vCPU, if creating said vCPU failed partway through

   - Fix a few sparse warnings (bad annotation, 0 != NULL)

   - Use struct_size() to simplify copying TDX capabilities to userspace

   - Fix a bug where TDX would effectively corrupt user-return MSR
     values if the TDX Module rejects VP.ENTER and thus doesn't clobber
     host MSRs as expected

  Selftests:

   - Fix a math goof in mmu_stress_test when running on a single-CPU
     system/VM

   - Forcefully override ARCH from x86_64 to x86 to play nice with
     specifying ARCH=x86_64 on the command line

   - Extend a bunch of nested VMX to validate nested SVM as well

   - Add support for LA57 in the core VM_MODE_xxx macro, and add a test
     to verify KVM can save/restore nested VMX state when L1 is using
     5-level paging, but L2 is not

   - Clean up the guest paging code in anticipation of sharing the core
     logic for nested EPT and nested NPT

  guest_memfd:

   - Add NUMA mempolicy support for guest_memfd, and clean up a variety
     of rough edges in guest_memfd along the way

   - Define a CLASS to automatically handle get+put when grabbing a
     guest_memfd from a memslot to make it harder to leak references

   - Enhance KVM selftests to make it easer to develop and debug
     selftests like those added for guest_memfd NUMA support, e.g. where
     test and/or KVM bugs often result in hard-to-debug SIGBUS errors

   - Misc cleanups

  Generic:

   - Use the recently-added WQ_PERCPU when creating the per-CPU
     workqueue for irqfd cleanup

   - Fix a goof in the dirty ring documentation

   - Fix choice of target for directed yield across different calls to
     kvm_vcpu_on_spin(); the function was always starting from the first
     vCPU instead of continuing the round-robin search"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (260 commits)
  KVM: arm64: at: Update AF on software walk only if VM has FEAT_HAFDBS
  KVM: arm64: at: Use correct HA bit in TCR_EL2 when regime is EL2
  KVM: arm64: Document KVM_PGTABLE_PROT_{UX,PX}
  KVM: arm64: Fix spelling mistake "Unexpeced" -> "Unexpected"
  KVM: arm64: Add break to default case in kvm_pgtable_stage2_pte_prot()
  KVM: arm64: Add endian casting to kvm_swap_s[12]_desc()
  KVM: arm64: Fix compilation when CONFIG_ARM64_USE_LSE_ATOMICS=n
  KVM: arm64: selftests: Add test for AT emulation
  KVM: arm64: nv: Expose hardware access flag management to NV guests
  KVM: arm64: nv: Implement HW access flag management in stage-2 SW PTW
  KVM: arm64: Implement HW access flag management in stage-1 SW PTW
  KVM: arm64: Propagate PTW errors up to AT emulation
  KVM: arm64: Add helper for swapping guest descriptor
  KVM: arm64: nv: Use pgtable definitions in stage-2 walk
  KVM: arm64: Handle endianness in read helper for emulated PTW
  KVM: arm64: nv: Stop passing vCPU through void ptr in S2 PTW
  KVM: arm64: Call helper for reading descriptors directly
  KVM: arm64: nv: Advertise support for FEAT_XNX
  KVM: arm64: Teach ptdump about FEAT_XNX permissions
  KVM: s390: Use generic VIRT_XFER_TO_GUEST_WORK functions
  ...
2025-12-05 17:01:20 -08:00
Linus Torvalds 7cd122b552 Some filesystems use a kinda-sorta controlled dentry refcount leak to pin
dentries of created objects in dcache (and undo it when removing those).
 Reference is grabbed and not released, but it's not actually _stored_
 anywhere.  That works, but it's hard to follow and verify; among other
 things, we have no way to tell _which_ of the increments is intended
 to be an unpaired one.  Worse, on removal we need to decide whether
 the reference had already been dropped, which can be non-trivial if
 that removal is on umount and we need to figure out if this dentry is
 pinned due to e.g. unlink() not done.  Usually that is handled by using
 kill_litter_super() as ->kill_sb(), but there are open-coded special
 cases of the same (consider e.g. /proc/self).
 
 Things get simpler if we introduce a new dentry flag (DCACHE_PERSISTENT)
 marking those "leaked" dentries.  Having it set claims responsibility
 for +1 in refcount.
 
 The end result this series is aiming for:
 
 * get these unbalanced dget() and dput() replaced with new primitives that
   would, in addition to adjusting refcount, set and clear persistency flag.
 * instead of having kill_litter_super() mess with removing the remaining
   "leaked" references (e.g. for all tmpfs files that hadn't been removed
   prior to umount), have the regular shrink_dcache_for_umount() strip
   DCACHE_PERSISTENT of all dentries, dropping the corresponding
   reference if it had been set.  After that kill_litter_super() becomes
   an equivalent of kill_anon_super().
 
 Doing that in a single step is not feasible - it would affect too many places
 in too many filesystems.  It has to be split into a series.
 
 This work has really started early in 2024; quite a few preliminary pieces
 have already gone into mainline.  This chunk is finally getting to the
 meat of that stuff - infrastructure and most of the conversions to it.
 
 Some pieces are still sitting in the local branches, but the bulk of
 that stuff is here.
 
 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCaTEq1wAKCRBZ7Krx/gZQ
 643uAQC1rRslhw5l7OjxEpIYbGG4M+QaadN4Nf5Sr2SuTRaPJQD/W4oj/u4C2eCw
 Dd3q071tqyvm/PXNgN2EEnIaxlFUlwc=
 =rKq+
 -----END PGP SIGNATURE-----

Merge tag 'pull-persistency' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull persistent dentry infrastructure and conversion from Al Viro:
 "Some filesystems use a kinda-sorta controlled dentry refcount leak to
  pin dentries of created objects in dcache (and undo it when removing
  those). A reference is grabbed and not released, but it's not actually
  _stored_ anywhere.

  That works, but it's hard to follow and verify; among other things, we
  have no way to tell _which_ of the increments is intended to be an
  unpaired one. Worse, on removal we need to decide whether the
  reference had already been dropped, which can be non-trivial if that
  removal is on umount and we need to figure out if this dentry is
  pinned due to e.g. unlink() not done. Usually that is handled by using
  kill_litter_super() as ->kill_sb(), but there are open-coded special
  cases of the same (consider e.g. /proc/self).

  Things get simpler if we introduce a new dentry flag
  (DCACHE_PERSISTENT) marking those "leaked" dentries. Having it set
  claims responsibility for +1 in refcount.

  The end result this series is aiming for:

   - get these unbalanced dget() and dput() replaced with new primitives
     that would, in addition to adjusting refcount, set and clear
     persistency flag.

   - instead of having kill_litter_super() mess with removing the
     remaining "leaked" references (e.g. for all tmpfs files that hadn't
     been removed prior to umount), have the regular
     shrink_dcache_for_umount() strip DCACHE_PERSISTENT of all dentries,
     dropping the corresponding reference if it had been set. After that
     kill_litter_super() becomes an equivalent of kill_anon_super().

  Doing that in a single step is not feasible - it would affect too many
  places in too many filesystems. It has to be split into a series.

  This work has really started early in 2024; quite a few preliminary
  pieces have already gone into mainline. This chunk is finally getting
  to the meat of that stuff - infrastructure and most of the conversions
  to it.

  Some pieces are still sitting in the local branches, but the bulk of
  that stuff is here"

* tag 'pull-persistency' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (54 commits)
  d_make_discardable(): warn if given a non-persistent dentry
  kill securityfs_recursive_remove()
  convert securityfs
  get rid of kill_litter_super()
  convert rust_binderfs
  convert nfsctl
  convert rpc_pipefs
  convert hypfs
  hypfs: swich hypfs_create_u64() to returning int
  hypfs: switch hypfs_create_str() to returning int
  hypfs: don't pin dentries twice
  convert gadgetfs
  gadgetfs: switch to simple_remove_by_name()
  convert functionfs
  functionfs: switch to simple_remove_by_name()
  functionfs: fix the open/removal races
  functionfs: need to cancel ->reset_work in ->kill_sb()
  functionfs: don't bother with ffs->ref in ffs_data_{opened,closed}()
  functionfs: don't abuse ffs_data_closed() on fs shutdown
  convert selinuxfs
  ...
2025-12-05 14:36:21 -08:00
Linus Torvalds 7203ca412f Significant patch series in this merge are as follows:
- The 10 patch series "__vmalloc()/kvmalloc() and no-block support" from
   Uladzislau Rezki reworks the vmalloc() code to support non-blocking
   allocations (GFP_ATOIC, GFP_NOWAIT).
 
 - The 2 patch series "ksm: fix exec/fork inheritance" from xu xin fixes
   a rare case where the KSM MMF_VM_MERGE_ANY prctl state is not inherited
   across fork/exec.
 
 - The 4 patch series "mm/zswap: misc cleanup of code and documentations"
   from SeongJae Park does some light maintenance work on the zswap code.
 
 - The 5 patch series "mm/page_owner: add debugfs files 'show_handles'
   and 'show_stacks_handles'" from Mauricio Faria de Oliveira enhances the
   /sys/kernel/debug/page_owner debug feature.  It adds unique identifiers
   to differentiate the various stack traces so that userspace monitoring
   tools can better match stack traces over time.
 
 - The 2 patch series "mm/page_alloc: pcp->batch cleanups" from Joshua
   Hahn makes some minor alterations to the page allocator's per-cpu-pages
   feature.
 
 - The 2 patch series "Improve UFFDIO_MOVE scalability by removing
   anon_vma lock" from Lokesh Gidra addresses a scalability issue in
   userfaultfd's UFFDIO_MOVE operation.
 
 - The 2 patch series "kasan: cleanups for kasan_enabled() checks" from
   Sabyrzhan Tasbolatov performs some cleanup in the KASAN code.
 
 - The 2 patch series "drivers/base/node: fold node register and
   unregister functions" from Donet Tom cleans up the NUMA node handling
   code a little.
 
 - The 4 patch series "mm: some optimizations for prot numa" from Kefeng
   Wang provides some cleanups and small optimizations to the NUMA
   allocation hinting code.
 
 - The 5 patch series "mm/page_alloc: Batch callers of
   free_pcppages_bulk" from Joshua Hahn addresses long lock hold times at
   boot on large machines.  These were causing (harmless) softlockup
   warnings.
 
 - The 2 patch series "optimize the logic for handling dirty file folios
   during reclaim" from Baolin Wang removes some now-unnecessary work from
   page reclaim.
 
 - The 10 patch series "mm/damon: allow DAMOS auto-tuned for per-memcg
   per-node memory usage" from SeongJae Park enhances the DAMOS auto-tuning
   feature.
 
 - The 2 patch series "mm/damon: fixes for address alignment issues in
   DAMON_LRU_SORT and DAMON_RECLAIM" from Quanmin Yan fixes DAMON_LRU_SORT
   and DAMON_RECLAIM with certain userspace configuration.
 
 - The 15 patch series "expand mmap_prepare functionality, port more
   users" from Lorenzo Stoakes enhances the new(ish)
   file_operations.mmap_prepare() method and ports additional callsites
   from the old ->mmap() over to ->mmap_prepare().
 
 - The 8 patch series "Fix stale IOTLB entries for kernel address space"
   from Lu Baolu fixes a bug (and possible security issue on non-x86) in
   the IOMMU code.  In some situations the IOMMU could be left hanging onto
   a stale kernel pagetable entry.
 
 - The 4 patch series "mm/huge_memory: cleanup __split_unmapped_folio()"
   from Wei Yang cleans up and optimizes the folio splitting code.
 
 - The 5 patch series "mm, swap: misc cleanup and bugfix" from Kairui
   Song implements some cleanups and a minor fix in the swap discard code.
 
 - The 8 patch series "mm/damon: misc documentation fixups" from SeongJae
   Park does as advertised.
 
 - The 9 patch series "mm/damon: support pin-point targets removal" from
   SeongJae Park permits userspace to remove a specific monitoring target
   in the middle of the current targets list.
 
 - The 2 patch series "mm: MISC follow-up patches for linux/pgalloc.h"
   from Harry Yoo implements a couple of cleanups related to mm header file
   inclusion.
 
 - The 2 patch series "mm/swapfile.c: select swap devices of default
   priority round robin" from Baoquan He improves the selection of swap
   devices for NUMA machines.
 
 - The 3 patch series "mm: Convert memory block states (MEM_*) macros to
   enums" from Israel Batista changes the memory block labels from macros
   to enums so they will appear in kernel debug info.
 
 - The 3 patch series "ksm: perform a range-walk to jump over holes in
   break_ksm" from Pedro Demarchi Gomes addresses an inefficiency when KSM
   unmerges an address range.
 
 - The 22 patch series "mm/damon/tests: fix memory bugs in kunit tests"
   from SeongJae Park fixes leaks and unhandled malloc() failures in DAMON
   userspace unit tests.
 
 - The 2 patch series "some cleanups for pageout()" from Baolin Wang
   cleans up a couple of minor things in the page scanner's
   writeback-for-eviction code.
 
 - The 2 patch series "mm/hugetlb: refactor sysfs/sysctl interfaces" from
   Hui Zhu moves hugetlb's sysfs/sysctl handling code into a new file.
 
 - The 9 patch series "introduce VM_MAYBE_GUARD and make it sticky" from
   Lorenzo Stoakes makes the VMA guard regions available in /proc/pid/smaps
   and improves the mergeability of guarded VMAs.
 
 - The 2 patch series "mm: perform guard region install/remove under VMA
   lock" from Lorenzo Stoakes reduces mmap lock contention for callers
   performing VMA guard region operations.
 
 - The 2 patch series "vma_start_write_killable" from Matthew Wilcox
   starts work in permitting applications to be killed when they are
   waiting on a read_lock on the VMA lock.
 
 - The 11 patch series "mm/damon/tests: add more tests for online
   parameters commit" from SeongJae Park adds additional userspace testing
   of DAMON's "commit" feature.
 
 - The 9 patch series "mm/damon: misc cleanups" from SeongJae Park does
   that.
 
 - The 2 patch series "make VM_SOFTDIRTY a sticky VMA flag" from Lorenzo
   Stoakes addresses the possible loss of a VMA's VM_SOFTDIRTY flag when
   that VMA is merged with another.
 
 - The 16 patch series "mm: support device-private THP" from Balbir Singh
   introduces support for Transparent Huge Page (THP) migration in zone
   device-private memory.
 
 - The 3 patch series "Optimize folio split in memory failure" from Zi
   Yan optimizes folio split operations in the memory failure code.
 
 - The 2 patch series "mm/huge_memory: Define split_type and consolidate
   split support checks" from Wei Yang provides some more cleanups in the
   folio splitting code.
 
 - The 16 patch series "mm: remove is_swap_[pte, pmd]() + non-swap
   entries, introduce leaf entries" from Lorenzo Stoakes cleans up our
   handling of pagetable leaf entries by introducing the concept of
   'software leaf entries', of type softleaf_t.
 
 - The 4 patch series "reparent the THP split queue" from Muchun Song
   reparents the THP split queue to its parent memcg.  This is in
   preparation for addressing the long-standing "dying memcg" problem,
   wherein dead memcg's linger for too long, consuming memory resources.
 
 - The 3 patch series "unify PMD scan results and remove redundant
   cleanup" from Wei Yang does a little cleanup in the hugepage collapse
   code.
 
 - The 6 patch series "zram: introduce writeback bio batching" from
   Sergey Senozhatsky improves zram writeback efficiency by introducing
   batched bio writeback support.
 
 - The 4 patch series "memcg: cleanup the memcg stats interfaces" from
   Shakeel Butt cleans up our handling of the interrupt safety of some
   memcg stats.
 
 - The 4 patch series "make vmalloc gfp flags usage more apparent" from
   Vishal Moola cleans up vmalloc's handling of incoming GFP flags.
 
 - The 6 patch series "mm: Add soft-dirty and uffd-wp support for RISC-V"
   from Chunyan Zhang teches soft dirty and userfaultfd write protect
   tracking to use RISC-V's Svrsw60t59b extension.
 
 - The 5 patch series "mm: swap: small fixes and comment cleanups" from
   Youngjun Park fixes a small bug and cleans up some of the swap code.
 
 - The 4 patch series "initial work on making VMA flags a bitmap" from
   Lorenzo Stoakes starts work on converting the vma struct's flags to a
   bitmap, so we stop running out of them, especially on 32-bit.
 
 - The 2 patch series "mm/swapfile: fix and cleanup swap list iterations"
   from Youngjun Park addresses a possible bug in the swap discard code and
   cleans things up a little.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaTEb0wAKCRDdBJ7gKXxA
 jjfIAP94W4EkCCwNOupnChoG+YWw/JW21anXt5NN+i5svn1yugEAwzvv6A+cAFng
 o+ug/fyrfPZG7PLp2R8WFyGIP0YoBA4=
 =IUzS
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2025-12-03-21-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

  "__vmalloc()/kvmalloc() and no-block support" (Uladzislau Rezki)
     Rework the vmalloc() code to support non-blocking allocations
     (GFP_ATOIC, GFP_NOWAIT)

  "ksm: fix exec/fork inheritance" (xu xin)
     Fix a rare case where the KSM MMF_VM_MERGE_ANY prctl state is not
     inherited across fork/exec

  "mm/zswap: misc cleanup of code and documentations" (SeongJae Park)
     Some light maintenance work on the zswap code

  "mm/page_owner: add debugfs files 'show_handles' and 'show_stacks_handles'" (Mauricio Faria de Oliveira)
     Enhance the /sys/kernel/debug/page_owner debug feature by adding
     unique identifiers to differentiate the various stack traces so
     that userspace monitoring tools can better match stack traces over
     time

  "mm/page_alloc: pcp->batch cleanups" (Joshua Hahn)
     Minor alterations to the page allocator's per-cpu-pages feature

  "Improve UFFDIO_MOVE scalability by removing anon_vma lock" (Lokesh Gidra)
     Address a scalability issue in userfaultfd's UFFDIO_MOVE operation

  "kasan: cleanups for kasan_enabled() checks" (Sabyrzhan Tasbolatov)

  "drivers/base/node: fold node register and unregister functions" (Donet Tom)
     Clean up the NUMA node handling code a little

  "mm: some optimizations for prot numa" (Kefeng Wang)
     Cleanups and small optimizations to the NUMA allocation hinting
     code

  "mm/page_alloc: Batch callers of free_pcppages_bulk" (Joshua Hahn)
     Address long lock hold times at boot on large machines. These were
     causing (harmless) softlockup warnings

  "optimize the logic for handling dirty file folios during reclaim" (Baolin Wang)
     Remove some now-unnecessary work from page reclaim

  "mm/damon: allow DAMOS auto-tuned for per-memcg per-node memory usage" (SeongJae Park)
     Enhance the DAMOS auto-tuning feature

  "mm/damon: fixes for address alignment issues in DAMON_LRU_SORT and DAMON_RECLAIM" (Quanmin Yan)
     Fix DAMON_LRU_SORT and DAMON_RECLAIM with certain userspace
     configuration

  "expand mmap_prepare functionality, port more users" (Lorenzo Stoakes)
     Enhance the new(ish) file_operations.mmap_prepare() method and port
     additional callsites from the old ->mmap() over to ->mmap_prepare()

  "Fix stale IOTLB entries for kernel address space" (Lu Baolu)
     Fix a bug (and possible security issue on non-x86) in the IOMMU
     code. In some situations the IOMMU could be left hanging onto a
     stale kernel pagetable entry

  "mm/huge_memory: cleanup __split_unmapped_folio()" (Wei Yang)
     Clean up and optimize the folio splitting code

  "mm, swap: misc cleanup and bugfix" (Kairui Song)
     Some cleanups and a minor fix in the swap discard code

  "mm/damon: misc documentation fixups" (SeongJae Park)

  "mm/damon: support pin-point targets removal" (SeongJae Park)
     Permit userspace to remove a specific monitoring target in the
     middle of the current targets list

  "mm: MISC follow-up patches for linux/pgalloc.h" (Harry Yoo)
     A couple of cleanups related to mm header file inclusion

  "mm/swapfile.c: select swap devices of default priority round robin" (Baoquan He)
     improve the selection of swap devices for NUMA machines

  "mm: Convert memory block states (MEM_*) macros to enums" (Israel Batista)
     Change the memory block labels from macros to enums so they will
     appear in kernel debug info

  "ksm: perform a range-walk to jump over holes in break_ksm" (Pedro Demarchi Gomes)
     Address an inefficiency when KSM unmerges an address range

  "mm/damon/tests: fix memory bugs in kunit tests" (SeongJae Park)
     Fix leaks and unhandled malloc() failures in DAMON userspace unit
     tests

  "some cleanups for pageout()" (Baolin Wang)
     Clean up a couple of minor things in the page scanner's
     writeback-for-eviction code

  "mm/hugetlb: refactor sysfs/sysctl interfaces" (Hui Zhu)
     Move hugetlb's sysfs/sysctl handling code into a new file

  "introduce VM_MAYBE_GUARD and make it sticky" (Lorenzo Stoakes)
     Make the VMA guard regions available in /proc/pid/smaps and
     improves the mergeability of guarded VMAs

  "mm: perform guard region install/remove under VMA lock" (Lorenzo Stoakes)
     Reduce mmap lock contention for callers performing VMA guard region
     operations

  "vma_start_write_killable" (Matthew Wilcox)
     Start work on permitting applications to be killed when they are
     waiting on a read_lock on the VMA lock

  "mm/damon/tests: add more tests for online parameters commit" (SeongJae Park)
     Add additional userspace testing of DAMON's "commit" feature

  "mm/damon: misc cleanups" (SeongJae Park)

  "make VM_SOFTDIRTY a sticky VMA flag" (Lorenzo Stoakes)
     Address the possible loss of a VMA's VM_SOFTDIRTY flag when that
     VMA is merged with another

  "mm: support device-private THP" (Balbir Singh)
     Introduce support for Transparent Huge Page (THP) migration in zone
     device-private memory

  "Optimize folio split in memory failure" (Zi Yan)

  "mm/huge_memory: Define split_type and consolidate split support checks" (Wei Yang)
     Some more cleanups in the folio splitting code

  "mm: remove is_swap_[pte, pmd]() + non-swap entries, introduce leaf entries" (Lorenzo Stoakes)
     Clean up our handling of pagetable leaf entries by introducing the
     concept of 'software leaf entries', of type softleaf_t

  "reparent the THP split queue" (Muchun Song)
     Reparent the THP split queue to its parent memcg. This is in
     preparation for addressing the long-standing "dying memcg" problem,
     wherein dead memcg's linger for too long, consuming memory
     resources

  "unify PMD scan results and remove redundant cleanup" (Wei Yang)
     A little cleanup in the hugepage collapse code

  "zram: introduce writeback bio batching" (Sergey Senozhatsky)
     Improve zram writeback efficiency by introducing batched bio
     writeback support

  "memcg: cleanup the memcg stats interfaces" (Shakeel Butt)
     Clean up our handling of the interrupt safety of some memcg stats

  "make vmalloc gfp flags usage more apparent" (Vishal Moola)
     Clean up vmalloc's handling of incoming GFP flags

  "mm: Add soft-dirty and uffd-wp support for RISC-V" (Chunyan Zhang)
     Teach soft dirty and userfaultfd write protect tracking to use
     RISC-V's Svrsw60t59b extension

  "mm: swap: small fixes and comment cleanups" (Youngjun Park)
     Fix a small bug and clean up some of the swap code

  "initial work on making VMA flags a bitmap" (Lorenzo Stoakes)
     Start work on converting the vma struct's flags to a bitmap, so we
     stop running out of them, especially on 32-bit

  "mm/swapfile: fix and cleanup swap list iterations" (Youngjun Park)
     Address a possible bug in the swap discard code and clean things
     up a little

[ This merge also reverts commit ebb9aeb980 ("vfio/nvgrace-gpu:
  register device memory for poison handling") because it looks
  broken to me, I've asked for clarification   - Linus ]

* tag 'mm-stable-2025-12-03-21-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits)
  mm: fix vma_start_write_killable() signal handling
  mm/swapfile: use plist_for_each_entry in __folio_throttle_swaprate
  mm/swapfile: fix list iteration when next node is removed during discard
  fs/proc/task_mmu.c: fix make_uffd_wp_huge_pte() huge pte handling
  mm/kfence: add reboot notifier to disable KFENCE on shutdown
  memcg: remove inc/dec_lruvec_kmem_state helpers
  selftests/mm/uffd: initialize char variable to Null
  mm: fix DEBUG_RODATA_TEST indentation in Kconfig
  mm: introduce VMA flags bitmap type
  tools/testing/vma: eliminate dependency on vma->__vm_flags
  mm: simplify and rename mm flags function for clarity
  mm: declare VMA flags by bit
  zram: fix a spelling mistake
  mm/page_alloc: optimize lowmem_reserve max lookup using its semantic monotonicity
  mm/vmscan: skip increasing kswapd_failures when reclaim was boosted
  pagemap: update BUDDY flag documentation
  mm: swap: remove scan_swap_map_slots() references from comments
  mm: swap: change swap_alloc_slow() to void
  mm, swap: remove redundant comment for read_swap_cache_async
  mm, swap: use SWP_SOLIDSTATE to determine if swap is rotational
  ...
2025-12-05 13:52:43 -08:00
Linus Torvalds a3ebb59eee VFIO updates for v6.19-rc1
- Move libvfio selftest artifacts in preparation of more tightly
    coupled integration with KVM selftests. (David Matlack)
 
  - Fix comment typo in mtty driver. (Chu Guangqing)
 
  - Support for new hardware revision in the hisi_acc vfio-pci variant
    driver where the migration registers can now be accessed via the PF.
    When enabled for this support, the full BAR can be exposed to the
    user. (Longfang Liu)
 
  - Fix vfio cdev support for VF token passing, using the correct size
    for the kernel structure, thereby actually allowing userspace to
    provide a non-zero UUID token.  Also set the match token callback for
    the hisi_acc, fixing VF token support for this this vfio-pci variant
    driver. (Raghavendra Rao Ananta)
 
  - Introduce internal callbacks on vfio devices to simplify and
    consolidate duplicate code for generating VFIO_DEVICE_GET_REGION_INFO
    data, removing various ioctl intercepts with a more structured
    solution. (Jason Gunthorpe)
 
  - Introduce dma-buf support for vfio-pci devices, allowing MMIO regions
    to be exposed through dma-buf objects with lifecycle managed through
    move operations.  This enables low-level interactions such as a
    vfio-pci based SPDK drivers interacting directly with dma-buf capable
    RDMA devices to enable peer-to-peer operations.  IOMMUFD is also now
    able to build upon this support to fill a long standing feature gap
    versus the legacy vfio type1 IOMMU backend with an implementation of
    P2P support for VM use cases that better manages the lifecycle of the
    P2P mapping. (Leon Romanovsky, Jason Gunthorpe, Vivek Kasireddy)
 
  - Convert eventfd triggering for error and request signals to use RCU
    mechanisms in order to avoid a 3-way lockdep reported deadlock issue.
    (Alex Williamson)
 
  - Fix a 32-bit overflow introduced via dma-buf support manifesting with
    large DMA buffers. (Alex Mastro)
 
  - Convert nvgrace-gpu vfio-pci variant driver to insert mappings on
    fault rather than at mmap time.  This conversion serves both to make
    use of huge PFNMAPs but also to both avoid corrected RAS events
    during reset by now being subject to vfio-pci-core's use of
    unmap_mapping_range(), and to enable a device readiness test after
    reset. (Ankit Agrawal)
 
  - Refactoring of vfio selftests to support multi-device tests and split
    code to provide better separation between IOMMU and device objects.
    This work also enables a new test suite addition to measure parallel
    device initialization latency. (David Matlack)
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEQvbATlQL0amee4qQI5ubbjuwiyIFAmkvV3IRHGFsZXhAc2hh
 emJvdC5vcmcACgkQI5ubbjuwiyIpIQ/9GwpjLH5Vdv0v2d9mkHmZIWFpG/tr3zJa
 +spQqOjO0etASc67PtIJArT9pWib+s6O8OaG7iFrdNR65HCSsXSZbIGbMThPODfy
 DdDj1ipAqMVwcaCZT8un2N8Sktu9YpFQMvc5IoXWWYhw88vili7bBx+OTrEFV2T0
 6qQijSBdhw1TXVFHG6BGSmqmisyMepIebA6GmPWdfYu6BfoWBYMdcMjDwd1J61Q5
 DDwFRzn/Dz2Tvb1jbXiiRMRuFIuegFQii+wtd30S/cRPFZhZLWzc+drimC6oOFiQ
 qL19vQQsBPnLtGvch40HsET/AbY5w0pLCkYX5qacxP3sq27+N+KuotzCvbnVMN+H
 e2BqOCujyoce8z1Br6BzV71Lr2yzPDcc5pXTuEuuBT+J/ptOY8hfEikOj85s5Wzj
 aKsTrdDRGMrn/o11NkGSzYwFcMs9MxCX9mo98U6OkWDr0+cmPLf4LGZgpJudWg4E
 POUlzPpnzJrTlX5d+OqCdKJG0a1hTlTa2udzRa5hCDANHaZWLoAssfgSEKfV9xt1
 PzOMf0UIJmPJmFcw/OpMO72/5xp8O4WslJS0ulSm6vrAJDtutLApHZ7bJ44KniNd
 4vte+gOjyZY8ibTDKRULhXVlCDxkEnZjRBbApgI9HJD61IElOzjqohRuRx77J09B
 7c8OSLI8d1U=
 =tpee
 -----END PGP SIGNATURE-----

Merge tag 'vfio-v6.19-rc1' of https://github.com/awilliam/linux-vfio

Pull VFIO updates from Alex Williamson:

 - Move libvfio selftest artifacts in preparation of more tightly
   coupled integration with KVM selftests (David Matlack)

 - Fix comment typo in mtty driver (Chu Guangqing)

 - Support for new hardware revision in the hisi_acc vfio-pci variant
   driver where the migration registers can now be accessed via the PF.
   When enabled for this support, the full BAR can be exposed to the
   user (Longfang Liu)

 - Fix vfio cdev support for VF token passing, using the correct size
   for the kernel structure, thereby actually allowing userspace to
   provide a non-zero UUID token. Also set the match token callback for
   the hisi_acc, fixing VF token support for this this vfio-pci variant
   driver (Raghavendra Rao Ananta)

 - Introduce internal callbacks on vfio devices to simplify and
   consolidate duplicate code for generating VFIO_DEVICE_GET_REGION_INFO
   data, removing various ioctl intercepts with a more structured
   solution (Jason Gunthorpe)

 - Introduce dma-buf support for vfio-pci devices, allowing MMIO regions
   to be exposed through dma-buf objects with lifecycle managed through
   move operations. This enables low-level interactions such as a
   vfio-pci based SPDK drivers interacting directly with dma-buf capable
   RDMA devices to enable peer-to-peer operations. IOMMUFD is also now
   able to build upon this support to fill a long standing feature gap
   versus the legacy vfio type1 IOMMU backend with an implementation of
   P2P support for VM use cases that better manages the lifecycle of the
   P2P mapping (Leon Romanovsky, Jason Gunthorpe, Vivek Kasireddy)

 - Convert eventfd triggering for error and request signals to use RCU
   mechanisms in order to avoid a 3-way lockdep reported deadlock issue
   (Alex Williamson)

 - Fix a 32-bit overflow introduced via dma-buf support manifesting with
   large DMA buffers (Alex Mastro)

 - Convert nvgrace-gpu vfio-pci variant driver to insert mappings on
   fault rather than at mmap time. This conversion serves both to make
   use of huge PFNMAPs but also to both avoid corrected RAS events
   during reset by now being subject to vfio-pci-core's use of
   unmap_mapping_range(), and to enable a device readiness test after
   reset (Ankit Agrawal)

 - Refactoring of vfio selftests to support multi-device tests and split
   code to provide better separation between IOMMU and device objects.
   This work also enables a new test suite addition to measure parallel
   device initialization latency (David Matlack)

* tag 'vfio-v6.19-rc1' of https://github.com/awilliam/linux-vfio: (65 commits)
  vfio: selftests: Add vfio_pci_device_init_perf_test
  vfio: selftests: Eliminate INVALID_IOVA
  vfio: selftests: Split libvfio.h into separate header files
  vfio: selftests: Move vfio_selftests_*() helpers into libvfio.c
  vfio: selftests: Rename vfio_util.h to libvfio.h
  vfio: selftests: Stop passing device for IOMMU operations
  vfio: selftests: Move IOVA allocator into iova_allocator.c
  vfio: selftests: Move IOMMU library code into iommu.c
  vfio: selftests: Rename struct vfio_dma_region to dma_region
  vfio: selftests: Upgrade driver logging to dev_err()
  vfio: selftests: Prefix logs with device BDF where relevant
  vfio: selftests: Eliminate overly chatty logging
  vfio: selftests: Support multiple devices in the same container/iommufd
  vfio: selftests: Introduce struct iommu
  vfio: selftests: Rename struct vfio_iommu_mode to iommu_mode
  vfio: selftests: Allow passing multiple BDFs on the command line
  vfio: selftests: Split run.sh into separate scripts
  vfio: selftests: Move run.sh into scripts directory
  vfio/nvgrace-gpu: wait for the GPU mem to be ready
  vfio/nvgrace-gpu: Inform devmem unmapped after reset
  ...
2025-12-04 18:42:48 -08:00
Joshua Hahn 83c8f7b5e1 mm/mm_init: Introduce a boot parameter for check_pages
Use-after-free and double-free bugs can be very difficult to track down.
The kernel is good at tracking these and preventing bad pages from being
used/created through simple checks gated behind "check_pages_enabled".

Currently, the only ways to enable this flag is by building with
CONFIG_DEBUG_VM, or as a side effect of other checks such as
init_on_{alloc, free}, page_poisoning, or debug_pagealloc among others.
These solutions are powerful, but may often be too coarse in balancing
the performance vs. safety that a user may want, particularly in
latency-sensitive production environments.

Introduce a new boot parameter "check_pages", which enables page checking
with no other side effects. It takes kstrbool-able inputs as an argument
(i.e. 0/1, true/false, on/off, ...). This patch is backwards-compatible;
setting CONFIG_DEBUG_VM still enables page checking.

Acked-by: SeongJae Park <sj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
Link: https://patch.msgid.link/20251201180739.2330474-1-joshua.hahnjy@gmail.com
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
2025-12-04 19:40:25 +02:00
Linus Torvalds b687034b1a slab updates for 6.19
-----BEGIN PGP SIGNATURE-----
 
 iQFPBAABCAA5FiEEe7vIQRWZI0iWSE3xu+CwddJFiJoFAmksibgbFIAAAAAABAAO
 bWFudTIsMi41KzEuMTEsMiwyAAoJELvgsHXSRYiavR8H/jTNKlb8jZtre1Q2xIGJ
 PgU8+fc4PGX8C6XuKRgb4KYL+zn3VSnTyxLUc3ObKIRTrGOJOBw3YT8R0LvrMOJx
 Ibx/6o0o+vjnDxmq6QGcuYdytDdL/rL6Gh8PR1dyWAqPz6jGtraP0nCJu7Y9jRZ0
 JHbyRTfpC8I6fTZv/WHocTsUDUu/+M4jQx3kMAMgSSTc7IAF+El5GqhpwEaWv7u/
 6D0px1lXI3rGimzmHeLy+CEjW041MTkxPH3GNzkiZwi2WUwI+ZEteMcs29KHcCOe
 /sdqmlzn2CPxzqG3TkJ4LbJE3XThYkqxe56LmBVJnhHFe+vCF8urEX9UUTtMn1dh
 3zs=
 =iQ4N
 -----END PGP SIGNATURE-----

Merge tag 'slab-for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab

Pull slab updates from Vlastimil Babka:

 - mempool_alloc_bulk() support for upcoming users in the block layer
   that need to allocate multiple objects at once with the mempool's
   guaranteed progress semantics, which is not achievable with an
   allocation single objects in a loop. Along with refactoring and
   various improvements (Christoph Hellwig)

 - Preparations for the upcoming separation of struct slab from struct
   page, mostly by removing the struct folio layer, as the purpose of
   struct folio has shifted since it became used in slab code (Matthew
   Wilcox)

 - Modernisation of slab's boot param API usage, which removes some
   unexpected parsing corner cases (Petr Tesarik)

 - Refactoring of freelist_aba_t (now struct freelist_counters) and
   associated functions for double cmpxchg, enabled by -fms-extensions
   (Vlastimil Babka)

 - Cleanups and improvements related to sheaves caching layer, that were
   part of the full conversion to sheaves, which is planned for the next
   release (Vlastimil Babka)

* tag 'slab-for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: (42 commits)
  slab: Remove unnecessary call to compound_head() in alloc_from_pcs()
  mempool: clarify behavior of mempool_alloc_preallocated()
  mempool: drop the file name in the top of file comment
  mempool: de-typedef
  mempool: remove mempool_{init,create}_kvmalloc_pool
  mempool: legitimize the io_schedule_timeout in mempool_alloc_from_pool
  mempool: add mempool_{alloc,free}_bulk
  mempool: factor out a mempool_alloc_from_pool helper
  slab: Remove references to folios from virt_to_slab()
  kasan: Remove references to folio in __kasan_mempool_poison_object()
  memcg: Convert mem_cgroup_from_obj_folio() to mem_cgroup_from_obj_slab()
  mempool: factor out a mempool_adjust_gfp helper
  mempool: add error injection support
  mempool: improve kerneldoc comments
  mm: improve kerneldoc comments for __alloc_pages_bulk
  fault-inject: make enum fault_flags available unconditionally
  usercopy: Remove folio references from check_heap_object()
  slab: Remove folio references from kfree_nolock()
  slab: Remove folio references from kfree_rcu_sheaf()
  slab: Remove folio references from build_detached_freelist()
  ...
2025-12-03 11:53:47 -08:00
Linus Torvalds 51e3b98d73 selinux/stable-6.19 PR 20251201
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCgAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmkuAKEUHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXPKeA/8DSW+sTkQ9BMGGnyuH1uU/r84qtVh
 Ft6pnIPzrogE/GKcQeFgFA9D7gQbB8J39PSxZLS3lp0UiuPCuq+D09L+uzDKzDCD
 Avfe84dwsI5OiplPKyHiG3bF9W2+A1zkwH2j+5uC6yF8v9J9vglo4u5vAYeE2wxA
 X4b2r9jMm7WJ/KFNiSiiLGEhOSjVVUrJULcmWMRPPruplPDC4dLnqYTWTbkrfF8h
 /oXv/+ssqbj6FqfL4WaRnjN8GgZcwaWy1qu9LVlZ40iphpbVAyPBJPLJS6Q4hhOl
 mMHUbYkxALPyW7riQxoXAegQjJyGgKn8Bli9U6bkiKFA2yeIhJFX+OyV1SlOAs/J
 g6s5XfeCzqY0Tw3eqvT1YRhp10GcA7EtBYvhAe5ARq7PkMoqxmiI587piVX9hbos
 a0AH9CDNoOw+8QXx27sOoD1YIaiYD9fikXKymrzRRaW/GX6i43XIKiELBMuKoIVZ
 iwualvQiGBLLczzm5rdqPcLgp09Agn4AHfvFWXKFgS4+IJGKjeeXNOjsp9oFEivq
 RnXmDpa+nBud5zeTSeSpOY2L0pvuIG5N25N6U9bTsDe+4Y6p0qIAUy8e4sQ0PA8P
 xyp9/fcNr9jwHeLTjDbxZqZ+MU3GLIIVPdl0zq4z2J8nhkW3wD3pQX6B4qPIuXLx
 YP3nwhAT9T+hU7w=
 =IvVa
 -----END PGP SIGNATURE-----

Merge tag 'selinux-pr-20251201' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux

Pull selinux updates from Paul Moore:

 - Improve the granularity of SELinux labeling for memfd files

   Currently when creating a memfd file, SELinux treats it the same as
   any other tmpfs, or hugetlbfs, file. While simple, the drawback is
   that it is not possible to differentiate between memfd and tmpfs
   files.

   This adds a call to the security_inode_init_security_anon() LSM hook
   and wires up SELinux to provide a set of memfd specific access
   controls, including the ability to control the execution of memfds.

   As usual, the commit message has more information.

 - Improve the SELinux AVC lookup performance

   Adopt MurmurHash3 for the SELinux AVC hash function instead of the
   custom hash function currently used. MurmurHash3 is already used for
   the SELinux access vector table so the impact to the code is minimal,
   and performance tests have shown improvements in both hash
   distribution and latency.

   See the commit message for the performance measurments.

 - Introduce a Kconfig option for the SELinux AVC bucket/slot size

   While we have the ability to grow the number of AVC hash buckets
   today, the size of the buckets (slot size) is fixed at 512. This pull
   request makes that slot size configurable at build time through a new
   Kconfig knob, CONFIG_SECURITY_SELINUX_AVC_HASH_BITS.

* tag 'selinux-pr-20251201' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
  selinux: improve bucket distribution uniformity of avc_hash()
  selinux: Move avtab_hash() to a shared location for future reuse
  selinux: Introduce a new config to make avc cache slot size adjustable
  memfd,selinux: call security_inode_init_security_anon()
2025-12-03 10:45:47 -08:00
Linus Torvalds 44fc84337b arm64 updates for 6.19:
Core features:
 
  - Basic Arm MPAM (Memory system resource Partitioning And Monitoring)
    driver under drivers/resctrl/ which makes use of the fs/rectrl/ API
 
 Perf and PMU:
 
  - Avoid cycle counter on multi-threaded CPUs
 
  - Extend CSPMU device probing and add additional filtering support for
    NVIDIA implementations
 
  - Add support for the PMUs on the NoC S3 interconnect
 
  - Add additional compatible strings for new Cortex and C1 CPUs
 
  - Add support for data source filtering to the SPE driver
 
  - Add support for i.MX8QM and "DB" PMU in the imx PMU driver
 
 Memory managemennt:
 
  - Avoid broadcast TLBI if page reused in write fault
 
  - Elide TLB invalidation if the old PTE was not valid
 
  - Drop redundant cpu_set_*_tcr_t0sz() macros
 
  - Propagate pgtable_alloc() errors outside of __create_pgd_mapping()
 
  - Propagate return value from __change_memory_common()
 
 ACPI and EFI:
 
  - Call EFI runtime services without disabling preemption
 
  - Remove unused ACPI function
 
 Miscellaneous:
 
  - ptrace support to disable streaming on SME-only systems
 
  - Improve sysreg generation to include a 'Prefix' descriptor
 
  - Replace __ASSEMBLY__ with __ASSEMBLER__
 
  - Align register dumps in the kselftest zt-test
 
  - Remove some no longer used macros/functions
 
  - Various spelling corrections
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE5RElWfyWxS+3PLO2a9axLQDIXvEFAmkvMjkACgkQa9axLQDI
 XvGaGg//dtT/ZAqrWa6Yniv1LOlh837C07YdxAYTTuJ+I87DnrxIqjwbW+ye+bF+
 61RTkioeCUm3PH+ncO9gPVNi4ASZ1db3/Rc8Fb6rr1TYOI1sMIeBsbbVdRJgsbX6
 zu9197jOBHscTAeDceB6jZBDyW8iSLINPZ7LN6lGxXsZM/Vn5zfE0heKEEio6Fsx
 +AzO2vos0XcwBR9vFGXtiCDx57T+/cXUtrWfA0Cjz4nvHSgD8+ghS+Jwv+kHMt1L
 zrarqbeQfj+Iixm9PVHiazv+8THo9QdNl1yGLxDmJ4LEVPewjW5jBs8+5e8e3/Gj
 p5JEvmSyWvKTTbFoM5vhxC72A7yuT1QwAk2iCyFIxMbQ25PndHboKVp/569DzOkT
 +6CjI88sVSP6D7bVlN6pFlzc/Fa07YagnDMnMCSfk4LBjUfE3jYb+usaFydyv/rl
 jwZbJrnSF/H+uQlyoJFgOEXSoQdDsll3dv6yEsUCwbd8RqXbAe3svbguOUHSdvIj
 sCViezGZQ7Rkn6D21AfF9j6e7ceaSDaf5DWMxPI3dAxFKG8TJbCBsToR59NnoSj+
 bNEozbZ1mCxmwH8i43wZ6P0RkClvJnoXcvRA+TJj02fSZACO39d3XDNswfXWL41r
 KiWGUJZyn2lPKtiAWVX6pSBtDJ+5rFhuoFgADLX6trkxDe9/EMQ=
 =4Sb6
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Catalin Marinas:
 "These are the arm64 updates for 6.19.

  The biggest part is the Arm MPAM driver under drivers/resctrl/.
  There's a patch touching mm/ to handle spurious faults for huge pmd
  (similar to the pte version). The corresponding arm64 part allows us
  to avoid the TLB maintenance if a (huge) page is reused after a write
  fault. There's EFI refactoring to allow runtime services with
  preemption enabled and the rest is the usual perf/PMU updates and
  several cleanups/typos.

  Summary:

  Core features:

   - Basic Arm MPAM (Memory system resource Partitioning And Monitoring)
     driver under drivers/resctrl/ which makes use of the fs/rectrl/ API

  Perf and PMU:

   - Avoid cycle counter on multi-threaded CPUs

   - Extend CSPMU device probing and add additional filtering support
     for NVIDIA implementations

   - Add support for the PMUs on the NoC S3 interconnect

   - Add additional compatible strings for new Cortex and C1 CPUs

   - Add support for data source filtering to the SPE driver

   - Add support for i.MX8QM and "DB" PMU in the imx PMU driver

  Memory managemennt:

   - Avoid broadcast TLBI if page reused in write fault

   - Elide TLB invalidation if the old PTE was not valid

   - Drop redundant cpu_set_*_tcr_t0sz() macros

   - Propagate pgtable_alloc() errors outside of __create_pgd_mapping()

   - Propagate return value from __change_memory_common()

  ACPI and EFI:

   - Call EFI runtime services without disabling preemption

   - Remove unused ACPI function

  Miscellaneous:

   - ptrace support to disable streaming on SME-only systems

   - Improve sysreg generation to include a 'Prefix' descriptor

   - Replace __ASSEMBLY__ with __ASSEMBLER__

   - Align register dumps in the kselftest zt-test

   - Remove some no longer used macros/functions

   - Various spelling corrections"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (94 commits)
  arm64/mm: Document why linear map split failure upon vm_reset_perms is not problematic
  arm64/pageattr: Propagate return value from __change_memory_common
  arm64/sysreg: Remove unused define ARM64_FEATURE_FIELD_BITS
  KVM: arm64: selftests: Consider all 7 possible levels of cache
  KVM: arm64: selftests: Remove ARM64_FEATURE_FIELD_BITS and its last user
  arm64: atomics: lse: Remove unused parameters from ATOMIC_FETCH_OP_AND macros
  Documentation/arm64: Fix the typo of register names
  ACPI: GTDT: Get rid of acpi_arch_timer_mem_init()
  perf: arm_spe: Add support for filtering on data source
  perf: Add perf_event_attr::config4
  perf/imx_ddr: Add support for PMU in DB (system interconnects)
  perf/imx_ddr: Get and enable optional clks
  perf/imx_ddr: Move ida_alloc() from ddr_perf_init() to ddr_perf_probe()
  dt-bindings: perf: fsl-imx-ddr: Add compatible string for i.MX8QM, i.MX8QXP and i.MX8DXL
  arm64: remove duplicate ARCH_HAS_MEM_ENCRYPT
  arm64: mm: use untagged address to calculate page index
  MAINTAINERS: new entry for MPAM Driver
  arm_mpam: Add kunit tests for props_mismatch()
  arm_mpam: Add kunit test for bitmap reset
  arm_mpam: Add helper to reset saved mbwu state
  ...
2025-12-02 17:03:55 -08:00
Linus Torvalds 2547f79b0b s390 updates for 6.19 merge window
- Provide a new interface for dynamic configuration and deconfiguration of
   hotplug memory, allowing with and without memmap_on_memory support. This
   makes the way memory hotplug is handled on s390 much more similar to
   other architectures
 
 - Remove compat support. There shouldn't be any compat user space around
   anymore, therefore get rid of a lot of code which also doesn't need to be
   tested anymore
 
 - Add stackprotector support. GCC 16 will get new compiler options, which
   allow to generate code required for kernel stackprotector support
 
 - Merge pai_crypto and pai_ext PMU drivers into a new driver. This removes
   a lot of duplicated code. The new driver is also extendable and allows
   to support new PMUs
 
 - Add driver override support for AP queues
 
 - Rework and extend zcrypt and AP trace events to allow for tracing of
   crypto requests
 
 - Support block sizes larger than 65535 bytes for CCW tape devices
 
 - Since the rework of the virtual kernel address space the module area and
   the kernel image are within the same 4GB area. This eliminates the need
   of weak per cpu variables. Get rid of ARCH_MODULE_NEEDS_WEAK_PER_CPU
 
 - Various other small improvements and fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEECMNfWEw3SLnmiLkZIg7DeRspbsIFAmktZioACgkQIg7DeRsp
 bsK4Rw//VzkvHyzOtGKZ8Hb4S+Sh/PFlaZQXNhj+Xt5gWoOhP1uPmmhBe6LxjYaB
 J9Ns3hpONQ1dTHV7VVkds8FvM/SBcGe8m5RpefmChC/bjm5UEOV/MppKtA0aLnEH
 hJmdubIrrRAXKggxlHEfRLzBsFvV/rJ9Xf16FhRxGDc4pgmgkI1NPQ41/dyCHklQ
 dB3YrFVPIETywVYYVB/G3h11JgF5Z6CKtjYCdSx72Fkbj65+6JPfcPgLKMpcJuPd
 UxUXtCo1FCXlP70jsz8JQI8cdieG0KDQTtnZP4P/pqjQ3wirOqvMewNa9t9xmQ2e
 p6Rc1Vx5DESkq9bRWtQEaprTVVzK7DDLH3RuZwB+uLrcLGD8JvVS6/m9n9CgzBMT
 BnJXG2sLZH+gdQy+DSD/fVDD7OvIk8TGrH+OFwVIKhrT/J3B2E7ZSYyZZCNIS7VG
 yiuypoDGYg3ZpYjH9+qOXWB3nc0vQWrlFzb1bsQu1omJGmunLv4jtTjAKGN82C33
 auBsIYAlQW20X7DV0vZa59PwqwtBqtdQQcTidwtSztzKogRXAdK8KKHtN60JM4S2
 7sWFOFCQaTChAeDNw6MF5EtULb551nwH2RtJ9x3CrJj+OGK6clbQNcxIA7Oy0veR
 Sl9v1lMfeKOgDrPdDy3ArQBJ8WLlF9qX9wLKbiaNyIKmkz2ymkg=
 =CNrb
 -----END PGP SIGNATURE-----

Merge tag 's390-6.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux

Pull s390 updates from Heiko Carstens:

 - Provide a new interface for dynamic configuration and deconfiguration
   of hotplug memory, allowing with and without memmap_on_memory
   support. This makes the way memory hotplug is handled on s390 much
   more similar to other architectures

 - Remove compat support. There shouldn't be any compat user space
   around anymore, therefore get rid of a lot of code which also doesn't
   need to be tested anymore

 - Add stackprotector support. GCC 16 will get new compiler options,
   which allow to generate code required for kernel stackprotector
   support

 - Merge pai_crypto and pai_ext PMU drivers into a new driver. This
   removes a lot of duplicated code. The new driver is also extendable
   and allows to support new PMUs

 - Add driver override support for AP queues

 - Rework and extend zcrypt and AP trace events to allow for tracing of
   crypto requests

 - Support block sizes larger than 65535 bytes for CCW tape devices

 - Since the rework of the virtual kernel address space the module area
   and the kernel image are within the same 4GB area. This eliminates
   the need of weak per cpu variables. Get rid of
   ARCH_MODULE_NEEDS_WEAK_PER_CPU

 - Various other small improvements and fixes

* tag 's390-6.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (92 commits)
  watchdog: diag288_wdt: Remove KMSG_COMPONENT macro
  s390/entry: Use lay instead of aghik
  s390/vdso: Get rid of -m64 flag handling
  s390/vdso: Rename vdso64 to vdso
  s390: Rename head64.S to head.S
  s390/vdso: Use common STABS_DEBUG and DWARF_DEBUG macros
  s390: Add stackprotector support
  s390/modules: Simplify module_finalize() slightly
  s390: Remove KMSG_COMPONENT macro
  s390/percpu: Get rid of ARCH_MODULE_NEEDS_WEAK_PER_CPU
  s390/ap: Restrict driver_override versus apmask and aqmask use
  s390/ap: Rename mutex ap_perms_mutex to ap_attr_mutex
  s390/ap: Support driver_override for AP queue devices
  s390/ap: Use all-bits-one apmask/aqmask for vfio in_use() checks
  s390/debug: Update description of resize operation
  s390/syscalls: Switch to generic system call table generation
  s390/syscalls: Remove system call table pointer from thread_struct
  s390/uapi: Remove 31 bit support from uapi header files
  s390: Remove compat support
  tools: Remove s390 compat support
  ...
2025-12-02 16:37:00 -08:00
Linus Torvalds 1b5dd29869 vfs-6.19-rc1.fd_prepare.fs
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaSmOZwAKCRCRxhvAZXjc
 op0AAP4oNVJkFyvgKoPos5K2EGFB8M7merGhpYtsOoeg8UK6OwD/UySQErHsXQDR
 sUDDa5uFOhfrkcfM8REtAN4wF8p5qAc=
 =QgFD
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.19-rc1.fd_prepare.fs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull fd prepare updates from Christian Brauner:
 "This adds the FD_ADD() and FD_PREPARE() primitive. They simplify the
  common pattern of get_unused_fd_flags() + create file + fd_install()
  that is used extensively throughout the kernel and currently requires
  cumbersome cleanup paths.

  FD_ADD() - For simple cases where a file is installed immediately:

      fd = FD_ADD(O_CLOEXEC, vfio_device_open_file(device));
      if (fd < 0)
          vfio_device_put_registration(device);
      return fd;

  FD_PREPARE() - For cases requiring access to the fd or file, or
  additional work before publishing:

      FD_PREPARE(fdf, O_CLOEXEC, sync_file->file);
      if (fdf.err) {
          fput(sync_file->file);
          return fdf.err;
      }

      data.fence = fd_prepare_fd(fdf);
      if (copy_to_user((void __user *)arg, &data, sizeof(data)))
          return -EFAULT;

      return fd_publish(fdf);

  The primitives are centered around struct fd_prepare. FD_PREPARE()
  encapsulates all allocation and cleanup logic and must be followed by
  a call to fd_publish() which associates the fd with the file and
  installs it into the caller's fdtable. If fd_publish() isn't called,
  both are deallocated automatically. FD_ADD() is a shorthand that does
  fd_publish() immediately and never exposes the struct to the caller.

  I've implemented this in a way that it's compatible with the cleanup
  infrastructure while also being usable separately. IOW, it's centered
  around struct fd_prepare which is aliased to class_fd_prepare_t and so
  we can make use of all the basica guard infrastructure"

* tag 'vfs-6.19-rc1.fd_prepare.fs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (42 commits)
  io_uring: convert io_create_mock_file() to FD_PREPARE()
  file: convert replace_fd() to FD_PREPARE()
  vfio: convert vfio_group_ioctl_get_device_fd() to FD_ADD()
  tty: convert ptm_open_peer() to FD_ADD()
  ntsync: convert ntsync_obj_get_fd() to FD_PREPARE()
  media: convert media_request_alloc() to FD_PREPARE()
  hv: convert mshv_ioctl_create_partition() to FD_ADD()
  gpio: convert linehandle_create() to FD_PREPARE()
  pseries: port papr_rtas_setup_file_interface() to FD_ADD()
  pseries: convert papr_platform_dump_create_handle() to FD_ADD()
  spufs: convert spufs_gang_open() to FD_PREPARE()
  papr-hvpipe: convert papr_hvpipe_dev_create_handle() to FD_PREPARE()
  spufs: convert spufs_context_open() to FD_PREPARE()
  net/socket: convert __sys_accept4_file() to FD_ADD()
  net/socket: convert sock_map_fd() to FD_ADD()
  net/kcm: convert kcm_ioctl() to FD_PREPARE()
  net/handshake: convert handshake_nl_accept_doit() to FD_PREPARE()
  secretmem: convert memfd_secret() to FD_ADD()
  memfd: convert memfd_create() to FD_ADD()
  bpf: convert bpf_token_create() to FD_PREPARE()
  ...
2025-12-01 17:32:07 -08:00
Linus Torvalds f2e74ecfba vfs-6.19-rc1.folio
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaSmOZQAKCRCRxhvAZXjc
 onGBAQDtqeO0jZzS7q9UxlJ84Wj/H9w+9INpO4jMxtWK4svhUAEAghG4qVxRvkE2
 Qh+wrpTPIC7OCQ78k8psDRmkj9cn8QA=
 =FCVN
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.19-rc1.folio' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull folio updates from Christian Brauner:
 "Add a new folio_next_pos() helper function that returns the file
  position of the first byte after the current folio. This is a common
  operation in filesystems when needing to know the end of the current
  folio.

  The helper is lifted from btrfs which already had its own version, and
  is now used across multiple filesystems and subsystems:
   - btrfs
   - buffer
   - ext4
   - f2fs
   - gfs2
   - iomap
   - netfs
   - xfs
   - mm

  This fixes a long-standing bug in ocfs2 on 32-bit systems with files
  larger than 2GiB. Presumably this is not a common configuration, but
  the fix is backported anyway. The other filesystems did not have bugs,
  they were just mildly inefficient.

  This also introduce uoff_t as the unsigned version of loff_t. A recent
  commit inadvertently changed a comparison from being unsigned (on
  64-bit systems) to being signed (which it had always been on 32-bit
  systems), leading to sporadic fstests failures.

  Generally file sizes are restricted to being a signed integer, but in
  places where -1 is passed to indicate "up to the end of the file", it
  is convenient to have an unsigned type to ensure comparisons are
  always unsigned regardless of architecture"

* tag 'vfs-6.19-rc1.folio' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  fs: Add uoff_t
  mm: Use folio_next_pos()
  xfs: Use folio_next_pos()
  netfs: Use folio_next_pos()
  iomap: Use folio_next_pos()
  gfs2: Use folio_next_pos()
  f2fs: Use folio_next_pos()
  ext4: Use folio_next_pos()
  buffer: Use folio_next_pos()
  btrfs: Use folio_next_pos()
  filemap: Add folio_next_pos()
2025-12-01 10:26:38 -08:00
Linus Torvalds ebaeabfa5a vfs-6.19-rc1.writeback
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaSmOZQAKCRCRxhvAZXjc
 or4UAP9FbpFsZd0DpsYnKuv7kFepl291PuR0x2dKmseJ/wcf8AEAzI8FR5wd/fey
 25ZNdExoUojAOj5wVn+jUep3u54jBws=
 =/toi
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.19-rc1.writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull writeback updates from Christian Brauner:
 "Features:

   - Allow file systems to increase the minimum writeback chunk size.

     The relatively low minimal writeback size of 4MiB means that
     written back inodes on rotational media are switched a lot. Besides
     introducing additional seeks, this also can lead to extreme file
     fragmentation on zoned devices when a lot of files are cached
     relative to the available writeback bandwidth.

     This adds a superblock field that allows the file system to
     override the default size, and sets it to the zone size for zoned
     XFS.

   - Add logging for slow writeback when it exceeds
     sysctl_hung_task_timeout_secs. This helps identify tasks waiting
     for a long time and pinpoint potential issues. Recording the
     starting jiffies is also useful when debugging a crashed vmcore.

   - Wake up waiting tasks when finishing the writeback of a chunk

  Cleanups:

   - filemap_* writeback interface cleanups.

     Adding filemap_fdatawrite_wbc ended up being a mistake, as all but
     the original btrfs caller should be using better high level
     interfaces instead.

     This series removes all these low-level interfaces, switches btrfs
     to a more specific interface, and cleans up other too low-level
     interfaces. With this the writeback_control that is passed to the
     writeback code is only initialized in three places.

   - Remove __filemap_fdatawrite, __filemap_fdatawrite_range, and
     filemap_fdatawrite_wbc

   - Add filemap_flush_nr helper for btrfs

   - Push struct writeback_control into start_delalloc_inodes in btrfs

   - Rename filemap_fdatawrite_range_kick to filemap_flush_range

   - Stop opencoding filemap_fdatawrite_range in 9p, ocfs2, and mm

   - Make wbc_to_tag() inline and use it in fs"

* tag 'vfs-6.19-rc1.writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  fs: Make wbc_to_tag() inline and use it in fs.
  xfs: set s_min_writeback_pages for zoned file systems
  writeback: allow the file system to override MIN_WRITEBACK_PAGES
  writeback: cleanup writeback_chunk_size
  mm: rename filemap_fdatawrite_range_kick to filemap_flush_range
  mm: remove __filemap_fdatawrite_range
  mm: remove filemap_fdatawrite_wbc
  mm: remove __filemap_fdatawrite
  mm,btrfs: add a filemap_flush_nr helper
  btrfs: push struct writeback_control into start_delalloc_inodes
  btrfs: use the local tmp_inode variable in start_delalloc_inodes
  ocfs2: don't opencode filemap_fdatawrite_range in ocfs2_journal_submit_inode_data_buffers
  9p: don't opencode filemap_fdatawrite_range in v9fs_mmap_vm_close
  mm: don't opencode filemap_fdatawrite_range in filemap_invalidate_inode
  writeback: Add logging for slow writeback (exceeds sysctl_hung_task_timeout_secs)
  writeback: Wake up waiting tasks when finishing the writeback of a chunk.
2025-12-01 09:20:51 -08:00
Linus Torvalds 9368f0f941 vfs-6.19-rc1.inode
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaSmOZAAKCRCRxhvAZXjc
 omMSAP9GLhavxyWQ24Q+49CNWWRQWDY1wTOiUK2BwtIvZ0YEcAD8D1dAiMckL5pC
 RwEAVA5p+y+qi+bZP0KXCBxQddoTIQM=
 =zo/J
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.19-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs inode updates from Christian Brauner:
 "Features:

   - Hide inode->i_state behind accessors. Open-coded accesses prevent
     asserting they are done correctly. One obvious aspect is locking,
     but significantly more can be checked. For example it can be
     detected when the code is clearing flags which are already missing,
     or is setting flags when it is illegal (e.g., I_FREEING when
     ->i_count > 0)

   - Provide accessors for ->i_state, converts all filesystems using
     coccinelle and manual conversions (btrfs, ceph, smb, f2fs, gfs2,
     overlayfs, nilfs2, xfs), and makes plain ->i_state access fail to
     compile

   - Rework I_NEW handling to operate without fences, simplifying the
     code after the accessor infrastructure is in place

  Cleanups:

   - Move wait_on_inode() from writeback.h to fs.h

   - Spell out fenced ->i_state accesses with explicit smp_wmb/smp_rmb
     for clarity

   - Cosmetic fixes to LRU handling

   - Push list presence check into inode_io_list_del()

   - Touch up predicts in __d_lookup_rcu()

   - ocfs2: retire ocfs2_drop_inode() and I_WILL_FREE usage

   - Assert on ->i_count in iput_final()

   - Assert ->i_lock held in __iget()

  Fixes:

   - Add missing fences to I_NEW handling"

* tag 'vfs-6.19-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (22 commits)
  dcache: touch up predicts in __d_lookup_rcu()
  fs: push list presence check into inode_io_list_del()
  fs: cosmetic fixes to lru handling
  fs: rework I_NEW handling to operate without fences
  fs: make plain ->i_state access fail to compile
  xfs: use the new ->i_state accessors
  nilfs2: use the new ->i_state accessors
  overlayfs: use the new ->i_state accessors
  gfs2: use the new ->i_state accessors
  f2fs: use the new ->i_state accessors
  smb: use the new ->i_state accessors
  ceph: use the new ->i_state accessors
  btrfs: use the new ->i_state accessors
  Manual conversion to use ->i_state accessors of all places not covered by coccinelle
  Coccinelle-based conversion to use ->i_state accessors
  fs: provide accessors for ->i_state
  fs: spell out fenced ->i_state accesses with explicit smp_wmb/smp_rmb
  fs: move wait_on_inode() from writeback.h to fs.h
  fs: add missing fences to I_NEW handling
  ocfs2: retire ocfs2_drop_inode() and I_WILL_FREE usage
  ...
2025-12-01 09:02:34 -08:00
Linus Torvalds 1885cdbfbb vfs-6.19-rc1.iomap
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaSmOZAAKCRCRxhvAZXjc
 ooCXAQCwzX2GS/55QHV6JXBBoNxguuSQ5dCj91ZmTfHzij0xNAEAhKEBw7iMGX72
 c2/x+xYf+Pc6mAfxdus5RLMggqBFPAk=
 =jInB
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.19-rc1.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull iomap updates from Christian Brauner:
 "FUSE iomap Support for Buffered Reads:

    This adds iomap support for FUSE buffered reads and readahead. This
    enables granular uptodate tracking with large folios so only
    non-uptodate portions need to be read. Also fixes a race condition
    with large folios + writeback cache that could cause data corruption
    on partial writes followed by reads.

     - Refactored iomap read/readahead bio logic into helpers
     - Added caller-provided callbacks for read operations
     - Moved buffered IO bio logic into new file
     - FUSE now uses iomap for read_folio and readahead

  Zero Range Folio Batch Support:

    Add folio batch support for iomap_zero_range() to handle dirty
    folios over unwritten mappings. Fix raciness issues where dirty data
    could be lost during zero range operations.

     - filemap_get_folios_tag_range() helper for dirty folio lookup
     - Optional zero range dirty folio processing
     - XFS fills dirty folios on zero range of unwritten mappings
     - Removed old partial EOF zeroing optimization

  DIO Write Completions from Interrupt Context:

    Restore pre-iomap behavior where pure overwrite completions run
    inline rather than being deferred to workqueue. Reduces context
    switches for high-performance workloads like ScyllaDB.

     - Removed unused IOCB_DIO_CALLER_COMP code
     - Error completions always run in user context (fixes zonefs)
     - Reworked REQ_FUA selection logic
     - Inverted IOMAP_DIO_INLINE_COMP to IOMAP_DIO_OFFLOAD_COMP

  Buffered IO Cleanups:

    Some performance and code clarity improvements:

     - Replace manual bitmap scanning with find_next_bit()
     - Simplify read skip logic for writes
     - Optimize pending async writeback accounting
     - Better variable naming
     - Documentation for iomap_finish_folio_write() requirements

  Misaligned Vectors for Zoned XFS:

    Enables sub-block aligned vectors in XFS always-COW mode for zoned
    devices via new IOMAP_DIO_FSBLOCK_ALIGNED flag.

  Bug Fixes:

     - Allocate s_dio_done_wq for async reads (fixes syzbot report after
       error completion changes)
     - Fix iomap_read_end() for already uptodate folios (regression fix)"

* tag 'vfs-6.19-rc1.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (40 commits)
  iomap: allocate s_dio_done_wq for async reads as well
  iomap: fix iomap_read_end() for already uptodate folios
  iomap: invert the polarity of IOMAP_DIO_INLINE_COMP
  iomap: support write completions from interrupt context
  iomap: rework REQ_FUA selection
  iomap: always run error completions in user context
  fs, iomap: remove IOCB_DIO_CALLER_COMP
  iomap: use find_next_bit() for uptodate bitmap scanning
  iomap: use find_next_bit() for dirty bitmap scanning
  iomap: simplify when reads can be skipped for writes
  iomap: simplify ->read_folio_range() error handling for reads
  iomap: optimize pending async writeback accounting
  docs: document iomap writeback's iomap_finish_folio_write() requirement
  iomap: account for unaligned end offsets when truncating read range
  iomap: rename bytes_pending/bytes_accounted to bytes_submitted/bytes_not_submitted
  xfs: support sub-block aligned vectors in always COW mode
  iomap: add IOMAP_DIO_FSBLOCK_ALIGNED flag
  xfs: error tag to force zeroing on debug kernels
  iomap: remove old partial eof zeroing optimization
  xfs: fill dirty folios on zero range of unwritten mappings
  ...
2025-12-01 08:14:00 -08:00
Matthew Wilcox (Oracle) faf3c92352 mm: fix vma_start_write_killable() signal handling
If we get a signal, we need to restore the vm_refcnt.  We don't think that
the refcount can actually be decremented to zero here as it requires the
VMA to be detached, and the vma_mark_detached() uses TASK_UNINTERRUPTIBLE.
However, that's a bit subtle, so handle it as if the refcount was zero at
the start of this function.

Link: https://lkml.kernel.org/r/20251128040100.3022561-1-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Reported-by: syzbot+5b19bad23ac7f44bf8b8@syzkaller.appspotmail.com
Fixes: 2197bb60f8 ("mm: add vma_start_write_killable()")
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-29 10:41:11 -08:00
Youngjun Park b60a3ef784 mm/swapfile: use plist_for_each_entry in __folio_throttle_swaprate
The loop breaks immediately after finding the first swap device and
never modifies the list. Replace plist_for_each_entry_safe() with
plist_for_each_entry() and remove the unused next variable.

Link: https://lkml.kernel.org/r/20251127100303.783198-3-youngjun.park@lge.com
Signed-off-by: Youngjun Park <youngjun.park@lge.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Acked-by: Kairui Song <kasong@tencent.com>
Acked-by: Chris Li <chrisl@kernel.org>
Cc: Barry Song <baohua@kernel.org>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-29 10:41:11 -08:00
Youngjun Park f9e82f99b3 mm/swapfile: fix list iteration when next node is removed during discard
Patch series "mm/swapfile: fix and cleanup swap list iterations", v2.

This series fixes a potential list iteration issue in swap_sync_discard()
when devices are removed, and includes a cleanup for
__folio_throttle_swaprate().


This patch (of 2):

When the next node is removed from the plist (e.g.  by swapoff),
plist_del() makes the node point to itself, causing the iteration to loop
on the same entry indefinitely.

Add a plist_node_empty() check to detect this case and restart iteration,
allowing swap_sync_discard() to continue processing remaining swap devices
that still have pending discard entries.

Additionally, switch from swap_avail_lock/swap_avail_head to
swap_lock/swap_active_head so that iteration is only affected by swapoff
operations rather than frequent availability changes, reducing exceptional
condition checks and lock contention.

Link: https://lkml.kernel.org/r/20251127100303.783198-1-youngjun.park@lge.com
Link: https://lkml.kernel.org/r/20251127100303.783198-2-youngjun.park@lge.com
Fixes: 686ea517f471 ("mm, swap: do not perform synchronous discard during allocation")
Signed-off-by: Youngjun Park <youngjun.park@lge.com>
Suggested-by: Kairui Song <kasong@tencent.com>
Acked-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chris Li <chrisl@kernel.org>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-29 10:41:10 -08:00
Breno Leitao ce2bba8956 mm/kfence: add reboot notifier to disable KFENCE on shutdown
During system shutdown, KFENCE can cause IPI synchronization issues if it
remains active through the reboot process.  To prevent this, register a
reboot notifier that disables KFENCE and cancels any pending timer work
early in the shutdown sequence.

This is only necessary when CONFIG_KFENCE_STATIC_KEYS is enabled, as this
configuration sends IPIs that can interfere with shutdown.  Without static
keys, no IPIs are generated and KFENCE can safely remain active.

The notifier uses maximum priority (INT_MAX) to ensure KFENCE shuts down
before other subsystems that might still depend on stable memory
allocation behavior.

This fixes a late kexec CSD lockup[1] when kfence is trying to IPI a CPU
that is busy in a IRQ-disabled context printing characters to the console.

Link: https://lkml.kernel.org/r/20251127-kfence-v2-1-daeccb5ef9aa@debian.org
Link: https://lkml.kernel.org/r/20251126-kfence-v1-1-5a6e1d7c681c@debian.org
Link: https://lore.kernel.org/all/sqwajvt7utnt463tzxgwu2yctyn5m6bjwrslsnupfexeml6hkd@v6sqmpbu3vvu/ [1]
Fixes: 0ce20dd840 ("mm: add Kernel Electric-Fence infrastructure")
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Marco Elver <elver@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-29 10:41:10 -08:00
Chen Ridong f3b566d726 memcg: remove inc/dec_lruvec_kmem_state helpers
The dec_lruvec_kmem_state helper is unused by any caller and can be safely
removed.  Meanwhile, the inc_lruvec_kmem_state helper is only referenced
by shadow_lru_isolate, retaining these two helpers is unnecessary.  This
patch removes both helper functions to eliminate redundant code.

Link: https://lkml.kernel.org/r/20251126020435.1511637-1-chenridong@huaweicloud.com
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Qi Zheng <zhengqi.arch@bytedance.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Lu Jialin <lujialin4@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Wei Xu <weixugc@google.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-29 10:41:10 -08:00
Geert Uytterhoeven f65372cd7a mm: fix DEBUG_RODATA_TEST indentation in Kconfig
Most of the DEBUG_RODATA_TEST section is indented by four spaces instead
of the customary single TAB.

Link: https://lkml.kernel.org/r/74f39b1bffc6ed802088cb3e7d17b4c82330e8b3.1764058676.git.geert@linux-m68k.org
Fixes: 2959a5f726 ("mm: add arch-independent testcases for RODATA")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Jinbum Park <jinb.park7@gmail.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-29 10:41:09 -08:00
Lorenzo Stoakes 2b6a3f061f mm: declare VMA flags by bit
Patch series "initial work on making VMA flags a bitmap", v3.

We are in the rather silly situation that we are running out of VMA flags
as they are currently limited to a system word in size.

This leads to absurd situations where we limit features to 64-bit
architectures only because we simply do not have the ability to add a flag
for 32-bit ones.

This is very constraining and leads to hacks or, in the worst case, simply
an inability to implement features we want for entirely arbitrary reasons.

This also of course gives us something of a Y2K type situation in mm where
we might eventually exhaust all of the VMA flags even on 64-bit systems.

This series lays the groundwork for getting away from this limitation by
establishing VMA flags as a bitmap whose size we can increase in future
beyond 64 bits if required.

This is necessarily a highly iterative process given the extensive use of
VMA flags throughout the kernel, so we start by performing basic steps.

Firstly, we declare VMA flags by bit number rather than by value,
retaining the VM_xxx fields but in terms of these newly introduced
VMA_xxx_BIT fields.

While we are here, we use sparse annotations to ensure that, when dealing
with VMA bit number parameters, we cannot be passed values which are not
declared as such - providing some useful type safety.

We then introduce an opaque VMA flag type, much like the opaque mm_struct
flag type introduced in commit bb6525f2f8 ("mm: add bitmap mm->flags
field"), which we establish in union with vma->vm_flags (but still set at
system word size meaning there is no functional or data type size change).

We update the vm_flags_xxx() helpers to use this new bitmap, introducing
sensible helpers to do so.

This series lays the foundation for further work to expand the use of
bitmap VMA flags and eventually eliminate these arbitrary restrictions.


This patch (of 4):

In order to lay the groundwork for VMA flags being a bitmap rather than a
system word in size, we need to be able to consistently refer to VMA flags
by bit number rather than value.

Take this opportunity to do so in an enum which we which is additionally
useful for tooling to extract metadata from.

This additionally makes it very clear which bits are being used for what
at a glance.

We use the VMA_ prefix for the bit values as it is logical to do so since
these reference VMAs.  We consistently suffix with _BIT to make it clear
what the values refer to.

We declare bit values even when the flags that use them would not be
enabled by config options as this is simply clearer and clearly defines
what bit numbers are used for what, at no additional cost.

We declare a sparse-bitwise type vma_flag_t which ensures that users can't
pass around invalid VMA flags by accident and prepares for future work
towards VMA flags being a bitmap where we want to ensure bit values are
type safe.

To make life easier, we declare some macro helpers - DECLARE_VMA_BIT()
allows us to avoid duplication in the enum bit number declarations (and
maintaining the sparse __bitwise attribute), and INIT_VM_FLAG() is used to
assist with declaration of flags.

Unfortunately we can't declare both in the enum, as we run into issue with
logic in the kernel requiring that flags are preprocessor definitions, and
additionally we cannot have a macro which declares another macro so we
must define each flag macro directly.

Additionally, update the VMA userland testing vma_internal.h header to
include these changes.

We also have to fix the parameters to the vma_flag_*_atomic() functions
since VMA_MAYBE_GUARD_BIT is now of type vma_flag_t and sparse will
complain otherwise.

We have to update some rather silly if-deffery found in mm/task_mmu.c
which would otherwise break.

Finally, we update the rust binding helper as now it cannot auto-detect
the flags at all.

Link: https://lkml.kernel.org/r/cover.1764064556.git.lorenzo.stoakes@oracle.com
Link: https://lkml.kernel.org/r/3a35e5a0bcfa00e84af24cbafc0653e74deda64a.1764064556.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Acked-by: Alice Ryhl <aliceryhl@google.com>	[rust]
Cc: Alex Gaynor <alex.gaynor@gmail.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Andreas Hindborg <a.hindborg@kernel.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Björn Roy Baron <bjorn3_gh@protonmail.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Byungchul Park <byungchul@sk.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Chris Li <chrisl@kernel.org>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Gary Guo <gary@garyguo.net>
Cc: Gregory Price <gourry@gourry.net>
Cc: "Huang, Ying" <ying.huang@linux.alibaba.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Kairui Song <kasong@tencent.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Mathew Brost <matthew.brost@intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Trevor Gross <tmgross@umich.edu>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Wei Xu <weixugc@google.com>
Cc: xu xin <xu.xin16@zte.com.cn>
Cc: Yuanchu Xie <yuanchu@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-29 10:41:08 -08:00
fujunjie a493c7a650 mm/page_alloc: optimize lowmem_reserve max lookup using its semantic monotonicity
calculate_totalreserve_pages() currently finds the maximum
lowmem_reserve[j] for a zone by scanning the full forward range [j =
zone_idx ..  MAX_NR_ZONES).  However, for a given zone i, the
lowmem_reserve[j] array (for j > i) is naturally expected to form a
monotonically non-decreasing sequence in j, not as an implementation
detail, but as a consequence that naturally arises from the semantics of
lowmem_reserve[].

For zone "i", lowmem_reserve[j] expresses how many pages in zone i must
effectively be kept in reserve when deciding whether an allocation class
that may allocate from zones up to j is allowed to fall back into i.  It
protects less flexible allocation classes (which cannot use higher zones)
from being starved by more flexible ones.

Viewed from this semantics, it is natural to expect a partial ordering in
j: as j increases, the allocation class gains access to a strictly larger
set of fallback zones.  Therefore lowmem_reserve[j] is expected to be
monotonically non-decreasing in j: more flexible allocation classes must
not be allowed to deplete low zones more aggressively than less flexible
ones.

In other words, if lowmem_reserve[j] were ever observed to *decrease* as j
grows, that would be unexpected from the reserve semantics' point of view
and would likely indicate a semantic change or a misconfiguration.

The current implementation in setup_per_zone_lowmem_reserve() reflects
this policy by accumulating managed pages from higher zones and applying
the configured ratio, which results in a non-decreasing sequence.  This
patch makes calculate_totalreserve_pages() rely on that monotonicity
explicitly and finds the maximum reserve value by scanning backward and
stopping at the first non-zero entry.  This avoids unnecessary iteration
and reflects the conceptual model more directly.  No functional behavior
changes.

To maintain this assumption explicitly, a comment is added next to
setup_per_zone_lowmem_reserve() documenting the monotonicity expectation
and noting that calculate_totalreserve_pages() relies on it.

Link: https://lkml.kernel.org/r/tencent_EB0FED91B01B1F8B6DAEE96719C5F5797F07@qq.com
Signed-off-by: fujunjie <fujunjie1@qq.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-29 10:41:08 -08:00
Jiayuan Chen 3cf41edc20 mm/vmscan: skip increasing kswapd_failures when reclaim was boosted
We have a colocation cluster used for deploying both offline and online
services simultaneously.  In this environment, we encountered a
scenario where direct memory reclamation was triggered due to kswapd
not running.

1. When applications start up, rapidly consume memory, or experience
   network traffic bursts, the kernel reaches steal_suitable_fallback(),
   which sets watermark_boost and subsequently wakes kswapd.

2. In the core logic of kswapd thread (balance_pgdat()), when reclaim is
   triggered by watermark_boost, the maximum priority is 10. Higher
   priority values mean less aggressive LRU scanning, which can result in
   no pages being reclaimed during a single scan cycle:

   if (nr_boost_reclaim && sc.priority == DEF_PRIORITY - 2)
       raise_priority = false;

3. Additionally, many of our pods are configured with memory.low, which
   prevents memory reclamation in certain cgroups, further increasing the
   chance of failing to reclaim memory.

4. This eventually causes pgdat->kswapd_failures to continuously
   accumulate, exceeding MAX_RECLAIM_RETRIES, and consequently kswapd
   stops working.  At this point, the system's available memory is still
   significantly above the high watermark -- it's inappropriate for kswapd
   to stop under these conditions.

The final observable issue is that a brief period of rapid memory
allocation causes kswapd to stop running, ultimately triggering direct
reclaim and making the applications unresponsive.

This problem leading to direct memory reclamation has been a
long-standing issue in our production environment.  We initially held
the simple assumption that it was caused by applications allocating
memory too rapidly for kswapd to keep up with reclamation.  However,
after we began monitoring kswapd's runtime behavior, we discovered a
different pattern:

kswapd initially exhibits very aggressive activity even when there is
still considerable free memory, but it subsequently stops running
entirely, even as memory levels approach the low watermark.

In summary, both boosted watermarks and memory.low increase the
probability of kswapd operation failures.

This patch specifically addresses the scenario involving boosted
watermarks by not incrementing kswapd_failures when reclamation fails. 
A more general solution, potentially addressing memory.low or other
cases, requires further discussion.

Link: https://lkml.kernel.org/r/53de0b3ee0b822418e909db29bfa6513faff9d36@linux.dev
Link: https://lkml.kernel.org/r/20251024022711.382238-1-jiayuan.chen@linux.dev
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-29 10:41:07 -08:00
Christian Brauner 910c361f9a
secretmem: convert memfd_secret() to FD_ADD()
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-26-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-28 12:42:34 +01:00
Christian Brauner 1afcbbe5d6
memfd: convert memfd_create() to FD_ADD()
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-25-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-28 12:42:34 +01:00
Pratyush Yadav b3749f174d mm: memfd_luo: allow preserving memfd
The ability to preserve a memfd allows userspace to use KHO and LUO to
transfer its memory contents to the next kernel.  This is useful in many
ways.  For one, it can be used with IOMMUFD as the backing store for IOMMU
page tables.  Preserving IOMMUFD is essential for performing a hypervisor
live update with passthrough devices.  memfd support provides the first
building block for making that possible.

For another, applications with a large amount of memory that takes time to
reconstruct, reboots to consume kernel upgrades can be very expensive. 
memfd with LUO gives those applications reboot-persistent memory that they
can use to quickly save and reconstruct that state.

While memfd is backed by either hugetlbfs or shmem, currently only support
on shmem is added.  To be more precise, support for anonymous shmem files
is added.

The handover to the next kernel is not transparent.  All the properties of
the file are not preserved; only its memory contents, position, and size. 
The recreated file gets the UID and GID of the task doing the restore, and
the task's cgroup gets charged with the memory.

Once preserved, the file cannot grow or shrink, and all its pages are
pinned to avoid migrations and swapping.  The file can still be read from
or written to.

Use vmalloc to get the buffer to hold the folios, and preserve it using
kho_preserve_vmalloc().  This doesn't have the size limit.

Link: https://lkml.kernel.org/r/20251125165850.3389713-15-pasha.tatashin@soleen.com
Signed-off-by: Pratyush Yadav <ptyadav@amazon.de>
Co-developed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Tested-by: David Matlack <dmatlack@google.com>
Cc: Aleksander Lobakin <aleksander.lobakin@intel.com>
Cc: Alexander Graf <graf@amazon.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Andriy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: anish kumar <yesanishhere@gmail.com>
Cc: Anna Schumaker <anna.schumaker@oracle.com>
Cc: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Chanwoo Choi <cw00.choi@samsung.com>
Cc: Chen Ridong <chenridong@huawei.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Daniel Wagner <wagi@kernel.org>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Jeffery <djeffery@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guixin Liu <kanie@linux.alibaba.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com>
Cc: Joel Granados <joel.granados@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Lennart Poettering <lennart@poettering.net>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Leon Romanovsky <leonro@nvidia.com>
Cc: Lukas Wunner <lukas@wunner.de>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Matthew Maurer <mmaurer@google.com>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Myugnjoo Ham <myungjoo.ham@samsung.com>
Cc: Parav Pandit <parav@nvidia.com>
Cc: Pratyush Yadav <pratyush@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Saeed Mahameed <saeedm@nvidia.com>
Cc: Samiullah Khawaja <skhawaja@google.com>
Cc: Song Liu <song@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Stuart Hayes <stuart.w.hayes@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Thomas Weißschuh <linux@weissschuh.net>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: William Tu <witu@nvidia.com>
Cc: Yoann Congal <yoann.congal@smile.fr>
Cc: Zhu Yanjun <yanjun.zhu@linux.dev>
Cc: Zijun Hu <quic_zijuhu@quicinc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-27 14:24:41 -08:00
Pratyush Yadav ed6f45f81b mm: shmem: export some functions to internal.h
shmem_inode_acct_blocks(), shmem_recalc_inode(), and
shmem_add_to_page_cache() are used by shmem_alloc_and_add_folio().  This
functionality will be used by memfd LUO integration.

Link: https://lkml.kernel.org/r/20251125165850.3389713-13-pasha.tatashin@soleen.com
Signed-off-by: Pratyush Yadav <ptyadav@amazon.de>
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Tested-by: David Matlack <dmatlack@google.com>
Cc: Aleksander Lobakin <aleksander.lobakin@intel.com>
Cc: Alexander Graf <graf@amazon.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Andriy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: anish kumar <yesanishhere@gmail.com>
Cc: Anna Schumaker <anna.schumaker@oracle.com>
Cc: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Chanwoo Choi <cw00.choi@samsung.com>
Cc: Chen Ridong <chenridong@huawei.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Daniel Wagner <wagi@kernel.org>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Jeffery <djeffery@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guixin Liu <kanie@linux.alibaba.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com>
Cc: Joel Granados <joel.granados@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Lennart Poettering <lennart@poettering.net>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Leon Romanovsky <leonro@nvidia.com>
Cc: Lukas Wunner <lukas@wunner.de>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Matthew Maurer <mmaurer@google.com>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Myugnjoo Ham <myungjoo.ham@samsung.com>
Cc: Parav Pandit <parav@nvidia.com>
Cc: Pratyush Yadav <pratyush@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Saeed Mahameed <saeedm@nvidia.com>
Cc: Samiullah Khawaja <skhawaja@google.com>
Cc: Song Liu <song@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Stuart Hayes <stuart.w.hayes@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Thomas Weißschuh <linux@weissschuh.net>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: William Tu <witu@nvidia.com>
Cc: Yoann Congal <yoann.congal@smile.fr>
Cc: Zhu Yanjun <yanjun.zhu@linux.dev>
Cc: Zijun Hu <quic_zijuhu@quicinc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-27 14:24:40 -08:00
Pratyush Yadav e165e2a257 mm: shmem: allow freezing inode mapping
To prepare a shmem inode for live update, its index -> folio mappings must
be serialized.  Once the mappings are serialized, they cannot change since
it would cause the serialized data to become inconsistent.  This can be
done by pinning the folios to avoid migration, and by making sure no
folios can be added to or removed from the inode.

While mechanisms to pin folios already exist, the only way to stop folios
being added or removed are the grow and shrink file seals.  But file seals
come with their own semantics, one of which is that they can't be removed.
This doesn't work with liveupdate since it can be cancelled or error out,
which would need the seals to be removed and the file's normal
functionality to be restored.

Introduce SHMEM_F_MAPPING_FROZEN to indicate this instead.  It is internal
to shmem and is not directly exposed to userspace.  It functions similar
to F_SEAL_GROW | F_SEAL_SHRINK, but additionally disallows hole punching,
and can be removed.

Link: https://lkml.kernel.org/r/20251125165850.3389713-12-pasha.tatashin@soleen.com
Signed-off-by: Pratyush Yadav <ptyadav@amazon.de>
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Tested-by: David Matlack <dmatlack@google.com>
Cc: Aleksander Lobakin <aleksander.lobakin@intel.com>
Cc: Alexander Graf <graf@amazon.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Andriy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: anish kumar <yesanishhere@gmail.com>
Cc: Anna Schumaker <anna.schumaker@oracle.com>
Cc: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Chanwoo Choi <cw00.choi@samsung.com>
Cc: Chen Ridong <chenridong@huawei.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Daniel Wagner <wagi@kernel.org>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Jeffery <djeffery@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guixin Liu <kanie@linux.alibaba.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com>
Cc: Joel Granados <joel.granados@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Lennart Poettering <lennart@poettering.net>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Leon Romanovsky <leonro@nvidia.com>
Cc: Lukas Wunner <lukas@wunner.de>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Matthew Maurer <mmaurer@google.com>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Myugnjoo Ham <myungjoo.ham@samsung.com>
Cc: Parav Pandit <parav@nvidia.com>
Cc: Pratyush Yadav <pratyush@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Saeed Mahameed <saeedm@nvidia.com>
Cc: Samiullah Khawaja <skhawaja@google.com>
Cc: Song Liu <song@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Stuart Hayes <stuart.w.hayes@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Thomas Weißschuh <linux@weissschuh.net>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: William Tu <witu@nvidia.com>
Cc: Yoann Congal <yoann.congal@smile.fr>
Cc: Zhu Yanjun <yanjun.zhu@linux.dev>
Cc: Zijun Hu <quic_zijuhu@quicinc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-27 14:24:40 -08:00
Pratyush Yadav 6ff1610ced mm: shmem: use SHMEM_F_* flags instead of VM_* flags
shmem_inode_info::flags can have the VM flags VM_NORESERVE and VM_LOCKED. 
These are used to suppress pre-accounting or to lock the pages in the
inode respectively.  Using the VM flags directly makes it difficult to add
shmem-specific flags that are unrelated to VM behavior since one would
need to find a VM flag not used by shmem and re-purpose it.

Introduce SHMEM_F_NORESERVE and SHMEM_F_LOCKED which represent the same
information, but their bits are independent of the VM flags.  Callers can
still pass VM_NORESERVE to shmem_get_inode(), but it gets transformed to
the shmem-specific flag internally.

No functional changes intended.

Link: https://lkml.kernel.org/r/20251125165850.3389713-11-pasha.tatashin@soleen.com
Signed-off-by: Pratyush Yadav <ptyadav@amazon.de>
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Tested-by: David Matlack <dmatlack@google.com>
Cc: Aleksander Lobakin <aleksander.lobakin@intel.com>
Cc: Alexander Graf <graf@amazon.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Andriy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: anish kumar <yesanishhere@gmail.com>
Cc: Anna Schumaker <anna.schumaker@oracle.com>
Cc: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Chanwoo Choi <cw00.choi@samsung.com>
Cc: Chen Ridong <chenridong@huawei.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Daniel Wagner <wagi@kernel.org>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Jeffery <djeffery@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guixin Liu <kanie@linux.alibaba.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com>
Cc: Joel Granados <joel.granados@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Lennart Poettering <lennart@poettering.net>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Leon Romanovsky <leonro@nvidia.com>
Cc: Lukas Wunner <lukas@wunner.de>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Matthew Maurer <mmaurer@google.com>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Myugnjoo Ham <myungjoo.ham@samsung.com>
Cc: Parav Pandit <parav@nvidia.com>
Cc: Pratyush Yadav <pratyush@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Saeed Mahameed <saeedm@nvidia.com>
Cc: Samiullah Khawaja <skhawaja@google.com>
Cc: Song Liu <song@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Stuart Hayes <stuart.w.hayes@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Thomas Weißschuh <linux@weissschuh.net>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: William Tu <witu@nvidia.com>
Cc: Yoann Congal <yoann.congal@smile.fr>
Cc: Zhu Yanjun <yanjun.zhu@linux.dev>
Cc: Zijun Hu <quic_zijuhu@quicinc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-27 14:24:39 -08:00
Pasha Tatashin f5bfd4793a memblock: unpreserve memory in case of error
If there is an error half way through KHO memory preservation, we should
rollback and unpreserve everything that is partially preserved.

[akpm@linux-foundation.org: s/err_no_fdt_page/err_report/ in prepare_kho_fdt(), per Mike]
Link: https://lkml.kernel.org/r/20251101142325.1326536-5-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Suggested-by: Pratyush Yadav <pratyush@kernel.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Changyuan Lyu <changyuanl@google.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Simon Horman <horms@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-27 14:24:32 -08:00
Mike Rapoport (Microsoft) 70f9133096 kho: drop notifiers
The KHO framework uses a notifier chain as the mechanism for clients to
participate in the finalization process.  While this works for a single,
central state machine, it is too restrictive for kernel-internal
components like pstore/reserve_mem or IMA.  These components need a
simpler, direct way to register their state for preservation (e.g., during
their initcall) without being part of a complex, shutdown-time notifier
sequence.  The notifier model forces all participants into a single
finalization flow and makes direct preservation from an arbitrary context
difficult.  This patch refactors the client participation model by
removing the notifier chain and introducing a direct API for managing FDT
subtrees.

The core kho_finalize() and kho_abort() state machine remains, but clients
now register their data with KHO beforehand.

Link: https://lkml.kernel.org/r/20251101142325.1326536-3-pasha.tatashin@soleen.com
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Co-developed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Alexander Graf <graf@amazon.com>
Cc: Changyuan Lyu <changyuanl@google.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Pratyush Yadav <pratyush@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Simon Horman <horms@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-27 14:24:32 -08:00
Linus Torvalds 9eb220eddd 8 hotfixes. 4 are cc:stable, 7 are against mm/.
All are singletons - please see the respective changelogs for details.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaSdaWwAKCRDdBJ7gKXxA
 jlRmAP92O8ez4IeVq1VVhVfGVf9Xbjd8zFTykQXuGM/3yDRoTgEA+kLDFTTaJ5Wb
 MGlgQAeFXqlfMZfBljhyzbV8Ubz1BAc=
 =4UwS
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2025-11-26-11-51' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "8 hotfixes.  4 are cc:stable, 7 are against mm/.

  All are singletons - please see the respective changelogs for details"

* tag 'mm-hotfixes-stable-2025-11-26-11-51' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  mm/filemap: fix logic around SIGBUS in filemap_map_pages()
  mm/huge_memory: fix NULL pointer deference when splitting folio
  MAINTAINERS: add test_kho to KHO's entry
  mailmap: add entry for Sam Protsenko
  selftests/mm: fix division-by-zero in uffd-unit-tests
  mm/mmap_lock: reset maple state on lock_vma_under_rcu() retry
  mm/memfd: fix information leak in hugetlb folios
  mm: swap: remove duplicate nr_swap_pages decrement in get_swap_page_of_type()
2025-11-26 12:38:05 -08:00
Paolo Bonzini 236831743c KVM guest_memfd changes for 6.19:
- Add NUMA mempolicy support for guest_memfd, and clean up a variety of
    rough edges in guest_memfd along the way.
 
  - Define a CLASS to automatically handle get+put when grabbing a guest_memfd
    from a memslot to make it harder to leak references.
 
  - Enhance KVM selftests to make it easer to develop and debug selftests like
    those added for guest_memfd NUMA support, e.g. where test and/or KVM bugs
    often result in hard-to-debug SIGBUS errors.
 
  - Misc cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmkmSgcACgkQOlYIJqCj
 N/0MBQ/+KzI/6q6AR2m9jYCbz8APcJpmAEJ4Ma0+Q4XrJfkymAmt9P4M46bRZl5G
 Aznaqq9vak1weIzlaFsOzYqEWvSN54P/EfZKkqh0kCDKXGzl1HkSCC7FeThyqZz0
 +uuWUiRtWI/dyNFEpIXB/G06DwqhMIKAlk421Zt84iBI/wz2oZeAgWoFCjPWca4a
 /L/ClmpzM6LnP/Hg2DyoZtfwAXIy65pb9h0IhKbvGcgSrS4sesZPiSV20KKvKVSp
 4+WVLHuNbjk9vWKkmV8IZH+BXAO2J2+y2JYckbx4DvUKQcauXUJjFjp8+wZ4gMrC
 SK3SuWTTc3oe7fhaZZ98KY3BO9dq68iCbxpmdkhYrEbNqcDbQA5GMhIUZ2TgqDX7
 KQ58s9zyHZvsX4cnL3XY9igsdl6tvqnFqVhGyBaxUrWNJDqAs3NPJr8Td3cnrMKg
 MRB2AaJN8xX6DK7JAxKQ4LC0Nb7w3hxF0t0XL2XzBw+6nsu67NjrM7EflIhG+mHJ
 YWhrnNh83Yut95cym5mpUU0kFqfHNB5SxRGGokDcQ1NAXBwwE2Tq+YopR9OKRfiZ
 52/hkC195D7Xaeo9raf6iKN+YfiZ0uj3TvxuxIHs/EIqmmjhsc8VSwkfFW5OIeYf
 Tulmh/ohkq1675By/dapyHvW97PSgd0IqbDzjDrlxmbZbuuS1F8=
 =j3On
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-gmem-6.19' of https://github.com/kvm-x86/linux into HEAD

KVM guest_memfd changes for 6.19:

 - Add NUMA mempolicy support for guest_memfd, and clean up a variety of
   rough edges in guest_memfd along the way.

 - Define a CLASS to automatically handle get+put when grabbing a guest_memfd
   from a memslot to make it harder to leak references.

 - Enhance KVM selftests to make it easer to develop and debug selftests like
   those added for guest_memfd NUMA support, e.g. where test and/or KVM bugs
   often result in hard-to-debug SIGBUS errors.

 - Misc cleanups.
2025-11-26 09:32:44 +01:00