Commit Graph

26407 Commits (ba3e43a9e601636f5edb54e259a74f96ca3b8fd8)

Author SHA1 Message Date
Linus Torvalds 4a694a77c3 13 hotfixes. All are for MM. 10 are cc:stable and the remaining 3
address post-7.1 issues or aren't considered suitable for backporting.
 
 There's a 3 patch series "userfaultfd: verify VMA state across UFFDIO_COPY
 retry" from Mike Rapoport which fixes a few uffd things.  The rest are
 singletons - please see the individual changelogs for details.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCah5U6wAKCRDdBJ7gKXxA
 jmzcAP9N+WQo4qNZYYjSURqJof48Q4nght5C0ZHtsVk5itNJEQEAiecouCreqDSE
 VUY9mQHyEawIfORkPTUijnkjV8b+lwc=
 =m8J3
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2026-06-01-20-58' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM fixes from Andrew Morton:
 "13 hotfixes. All are for MM. 10 are cc:stable and the remaining 3
  address post-7.1 issues or aren't considered suitable for backporting.

  There's a three-patch series "userfaultfd: verify VMA state across
  UFFDIO_COPY retry" from Mike Rapoport which fixes a few uffd things.
  The rest are singletons - please see the individual changelogs for
  details"

* tag 'mm-hotfixes-stable-2026-06-01-20-58' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  userfaultfd: remove redundant check in vm_uffd_ops()
  userfaultfd: refuse to __mfill_atomic_pte() for unsupported VMAs
  userfaultfd: verify VMA state across UFFDIO_COPY retry
  mm/huge_memory: update file PMD counter before folio_put()
  mm/huge_memory: update file PUD counter before folio_put()
  mm/hugetlb_vmemmap: fix incorrect vmemmap restore in rollback
  mm/damon/ops-common: call folio_test_lru() after folio_get()
  mm/cma: fix reserved page leak on activation failure
  mm/memory-failure: fix hugetlb_lock AA deadlock in get_huge_page_for_hwpoison
  mm/hugetlb: restore reservation on error in hugetlb folio copy paths
  mm/cma_debug: fix invalid accesses for inactive CMA areas
  memcg: use round-robin victim selection in refill_stock
  mm/hugetlb: avoid false positive lockdep assertion
2026-06-02 08:59:35 -07:00
Mike Rapoport (Microsoft) 9d7bea186b userfaultfd: remove redundant check in vm_uffd_ops()
Lorenzo says:

  static const struct vm_uffd_ops *vma_uffd_ops(struct vm_area_struct *vma)
  {
          if (vma_is_anonymous(vma))
                  return &anon_uffd_ops;
          return vma->vm_ops ? vma->vm_ops->uffd_ops : NULL;
  }

  This is doing a redundant check _and_ making life confusing, as if
  !vma->vm_ops is a condition that can be reached there, it can't, as
  vma_is_anonymous() is literally a !vma->vm_ops check :)

Remove the redundant check.

Link: https://lore.kernel.org/20260527184751.4147364-4-rppt@kernel.org
Fixes: 0f48947c42 ("userfaultfd: introduce vm_uffd_ops")
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Suggested-by: Lorenzo Stoakes <ljs@kernel.org>
Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: David Carlier <devnexen@gmail.com>
Cc: Michael Bommarito <michael.bommarito@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-31 21:50:26 -07:00
Mike Rapoport (Microsoft) df3ee3b3bb userfaultfd: refuse to __mfill_atomic_pte() for unsupported VMAs
__mfill_atomic_pte() unconditionally dereferences ops because there is an
assumption that VMAs that can undergo mfill_* operations are vetted on
registration and must have valid vm_uffd_ops.

Add a guard against potential bugs and make sure __mfill_atomic_pte()
bails out if ops is NULL.

Link: https://lore.kernel.org/20260527184751.4147364-3-rppt@kernel.org
Fixes: ad9ac30813 ("userfaultfd: introduce vm_uffd_ops->alloc_folio()")
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Suggested-by: Lorenzo Stoakes <ljs@kernel.org>
Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>
Reviewed-by: David CARLIER <devnexen@gmail.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Michael Bommarito <michael.bommarito@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-31 21:50:25 -07:00
Mike Rapoport (Microsoft) 85668fda93 userfaultfd: verify VMA state across UFFDIO_COPY retry
Patch series "userfaultfd: verify VMA state across UFFDIO_COPY retry", v2.

... and two more small fixes.


This patch (of 3):

mfill_copy_folio_retry() drops the VMA lock for copy_from_user() and
reacquires it afterwards.  The destination VMA can be replaced during that
window.

The existing check compares vma_uffd_ops() before and after the retry, but
if a shmem VMA with MAP_SHARED is replaced with a shmem VMA with
MAP_PRIVATE (or vice versa) the replacement goes undetected.

The change from MAP_PRIVATE to MAP_SHARED will treat the folio allocated
with shmem_alloc_folio() as anonymous and this will cause BUG() when
mfill_atomic_install_pte() will try to folio_add_new_anon_rmap().

The change from MAP_SHARED to MAP_PRIVATE allows injection of folios into
the page cache of the original VMA.

There is no need to change for hugetlb because it never uses
mfill_copy_folio_retry().

Introduce helpers for more comprehensive comparison of VMA state:
- mfill_retry_state_save() to save the relevant VMA state into a struct
  mfill_retry_state (original uffd_ops, relevant VMA flags, vm_file and
  pgoff) before dropping the lock
- mfill_retry_state_changed() to compare the saved state with the state
  of the VMA acquired after retaking the locks
- mfill_retry_state_put() to release vm_file pinning.

Use DEFINE_FREE() cleanup to wrap mfill_retry_state_put() to avoid
complicating error handling paths in mfill_copy_folio_retry().

Link: https://lore.kernel.org/20260527184751.4147364-1-rppt@kernel.org
Link: https://lore.kernel.org/20260527184751.4147364-2-rppt@kernel.org
Fixes: 292411fda2 ("mm/userfaultfd: detect VMA type change after copy retry in mfill_copy_folio_retry()")
Fixes: 6ab703034f ("userfaultfd: mfill_atomic(): remove retry logic")
Co-developed-by: Michael Bommarito <michael.bommarito@gmail.com>
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Suggested-by: Peter Xu <peterx@redhat.com>
Co-developed-by: David Carlier <devnexen@gmail.com>
Signed-off-by: David Carlier <devnexen@gmail.com>
Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Liam R. Howlett <liam@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-31 21:50:25 -07:00
Yin Tirui 8d87805992 mm/huge_memory: update file PMD counter before folio_put()
__split_huge_pmd_locked() updates the file/shmem RSS counter after
dropping the PMD mapping's folio reference.  If folio_put() drops the last
reference, mm_counter_file() can later read freed folio state via
folio_test_swapbacked().

Move the counter update before folio_put().

Link: https://lore.kernel.org/20260526101337.1984081-1-yintirui@huawei.com
Fixes: fadae29530 ("thp: use mm_file_counter to determine update which rss counter")
Signed-off-by: Yin Tirui <yintirui@huawei.com>
Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>
Acked-by: David Hildenbrand (arm) <david@kernel.org>
Reviewed-by: Lance Yang <lance.yang@linux.dev>
Reviewed-by: Dev Jain <dev.jain@arm.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chen Jun <chenjun102@huawei.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-31 21:50:25 -07:00
Yin Tirui 40990c87a2 mm/huge_memory: update file PUD counter before folio_put()
__split_huge_pud_locked() updates the file/shmem RSS counter after
dropping the PUD mapping's folio reference.  If folio_put() drops the last
reference, mm_counter_file() can later read freed folio state via
folio_test_swapbacked().

Move the counter update before folio_put().

Link: https://lore.kernel.org/20260526101355.1984244-1-yintirui@huawei.com
Fixes: dbe5415329 ("mm/huge_memory: add vmf_insert_folio_pud()")
Signed-off-by: Yin Tirui <yintirui@huawei.com>
Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>
Acked-by: David Hildenbrand (arm) <david@kernel.org>
Reviewed-by: Lance Yang <lance.yang@linux.dev>
Reviewed-by: Dev Jain <dev.jain@arm.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chen Jun <chenjun102@huawei.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-31 21:50:25 -07:00
Muchun Song c7bde43f6d mm/hugetlb_vmemmap: fix incorrect vmemmap restore in rollback
vmemmap_restore_pte() rebuilds restored vmemmap pages from a tail-page
template derived from compound_head().  This is wrong when the current PTE
already maps a page whose contents are not tail-page metadata.

In the rollback path of vmemmap_remap_free(), the first restored PTE is
backed by vmemmap_head and contains head-page metadata.  Reconstructing
that page from a tail-page template overwrites the head-page state and
corrupts the restored vmemmap page.

Fix this by copying the full page from the page currently mapped by the
PTE.  Also pass vmemmap_tail to the rollback walk so only PTEs backed by
the shared tail page are restored, while the head PTE remains mapped to
vmemmap_head.  Add VM_WARN_ON_ONCE() checks for unexpected cases.

Link: https://lore.kernel.org/20260525025213.2229628-1-songmuchun@bytedance.com
Fixes: c0b495b91a ("mm/hugetlb: refactor code around vmemmap_walk")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Kiryl Shutsemau <kas@kernel.org>
Acked-by: Oscar Salvador (SUSE) <osalvador@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-31 21:50:24 -07:00
SeongJae Park d6b8b02a27 mm/damon/ops-common: call folio_test_lru() after folio_get()
damon_get_folio() speculatively calls folio_test_lru() before
folio_try_get().  The folio can get freed and reallocated to a tail page. 
In the case, VM_BUG_ON_PGFLAGS() in const_folio_flags() can be triggered. 
Remove the speculative call.

Also mark folio_test_lru() check right after folio_try_get() success as no
more unlikely.

The race should be rare.  Also the problem can happen only if the kernel
has enabled CONFIG_DEBUG_VM_PGFLAGS.  No real world report of this issue
has been made so far.  This fix is based on only theoretical analysis. 
That said, a bug is a bug.  A similar issue was also fixed via commit
3203b3ab0f ("mm/filemap: don't call folio_test_locked() without a
reference in next_uptodate_folio()").  I don't expect this change will
make a meaningful impact to DAMON performance in the real world, though I
will be happy to be corrected from the real world reports.

The issue was discovered [1] by Sashiko.


Link: https://lore.kernel.org/20260525162256.8317-1-sj@kernel.org
Link: https://lore.kernel.org/20260517234112.89245-1-sj@kernel.org [1]
Fixes: 3f49584b26 ("mm/damon: implement primitives for the virtual memory address spaces")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Fernand Sieber <sieberf@amazon.com>
Cc: Leonard Foerster <foersleo@amazon.de>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: <stable@vger.kernel.org> # 5.15.x
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-31 21:50:24 -07:00
Linus Torvalds a29c0b0caf memblock: fix regression from memblock_free_late() refactoring
After refactoring of memblock_free_late() and free_init_pages() it became
 possible to call memblock_free() after memblock init data was discarded.
 
 Make sure memblock_free() does not touch memblock.reserved unless it is
 called early enough or when ARCH_KEEP_MEMBLOCK is enabled.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCgAdFiEEeOVYVaWZL5900a/pOQOGJssO/ZEFAmobERsACgkQOQOGJssO
 /ZGiqQgArOfAxLDWWinyherNcejWY+GKsNdWNYoGOv8UEw2oTTBmjyrOqHrGcevQ
 YlLjBxc2v9LzW9wCRnW0ngoq6/28ABLwyLpB+sMyHU8KaJDyYnAhfe7xt59aqE2N
 JuQaSRY8irZG8g2Yks2ZWIPbDoIXJVvGI342L96OYLO63eehV9u5e7kbBebOZpH1
 JlnbsaMGjhh2RgLrWWEy4EW1NZ5bYHer6fmCVIlUWtz9X67OjKD5na8bdi9ADEay
 Wu2CjYwZFdScM4FQ8r2l9UHxjnU8EQKRxVOm2+hO8wRiE3efzezDCp5Laacvf9o6
 KLrkVONfYUpScis3qLpNZly3IjAeVw==
 =H1VN
 -----END PGP SIGNATURE-----

Merge tag 'fixes-2026-05-30' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock

Pull memblock fix from Mike Rapoport:
 "Fix regression from memblock_free_late() refactoring

  After refactoring of memblock_free_late() and free_init_pages() it
  became possible to call memblock_free() after memblock init data was
  discarded.

  Make sure memblock_free() does not touch memblock.reserved unless it
  is called early enough or when ARCH_KEEP_MEMBLOCK is enabled"

* tag 'fixes-2026-05-30' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock:
  memblock: don't touch memblock arrays when memblock_free() is called late
2026-05-30 15:37:05 -07:00
Muchun Song 00739e4dd4 mm/cma: fix reserved page leak on activation failure
If cma_activate_area() fails after allocating only part of the range
bitmaps, the cleanup path still has to release the reserved pages when
CMA_RESERVE_PAGES_ON_ERROR is clear.

That is still worth doing even in this __init path.  A bitmap_zalloc()
failure does not necessarily mean the system cannot make further progress:
freeing the reserved CMA pages can return a substantial amount of memory
to the buddy allocator and may relieve the temporary memory shortage that
caused the allocation failure in the first place.

However, the cleanup path currently uses the bitmap-freeing bound for page
release as well.  That is only correct for ranges whose bitmap allocation
already succeeded.  The failed range and all later ranges still keep their
reserved pages, so a partial bitmap allocation failure can permanently
leak them.

Fix this by releasing reserved pages for all ranges.  Use the saved
early_pfn[] value for ranges whose bitmap allocation already succeeded and
for the failed range, and use cmr->early_pfn for later ranges whose bitmap
allocation was never attempted.

Link: https://lore.kernel.org/20260523060123.2207992-1-songmuchun@bytedance.com
Fixes: c009da4258 ("mm, cma: support multiple contiguous ranges, if requested")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Oscar Salvador (SUSE) <osalvador@kernel.org>
Acked-by: Usama Arif <usama.arif@linux.dev>
Cc: David Hildenbrand <david@kernel.org>
Cc: Frank van der Linden <fvdl@google.com>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28 20:50:34 -07:00
Wupeng Ma 3c2d42b8ee mm/memory-failure: fix hugetlb_lock AA deadlock in get_huge_page_for_hwpoison
Two concurrent madvise(MADV_HWPOISON) calls on the same hugetlb page can
trigger a recursive spinlock self-deadlock (AA deadlock) on hugetlb_lock
when racing with a concurrent unmap:

  thread#0                              thread#1
  --------                              --------
  madvise(folio, MADV_HWPOISON)
    -> poisons the folio successfully
  madvise(folio, MADV_HWPOISON)         unmap(folio)
    try_memory_failure_hugetlb
      get_huge_page_for_hwpoison
        spin_lock_irq(&hugetlb_lock)    <- held
        __get_huge_page_for_hwpoison
          hugetlb_update_hwpoison()
            -> MF_HUGETLB_FOLIO_PRE_POISONED
          goto out:
            folio_put()
              refcount: 1 -> 0
              free_huge_folio()
                spin_lock_irqsave(&hugetlb_lock)
                  -> AA DEADLOCK!

The out: path in __get_huge_page_for_hwpoison() calls folio_put() to drop
the GUP reference while the hugetlb_lock is still held by the hugetlb.c
wrapper get_huge_page_for_hwpoison().  If concurrent unmap has released
the page table mapping reference, folio_put() drops the folio refcount to
zero, triggering free_huge_folio() which attempts to re-acquire the
non-recursive hugetlb_lock.

Fix this by moving hugetlb_lock acquisition from the hugetlb.c wrapper
into get_huge_page_for_hwpoison().  Place spin_unlock_irq() before the
folio_put() at the out: label so the folio is always released outside the
lock.

[akpm@linux-foundation.org: fix race, rename label per Miaohe]
  Link: https://sashiko.dev/#/patchset/20260522010305.4099834-1-mawupeng1@huawei.com
  Link: https://lore.kernel.org/f39f405e-4b4b-8f79-70fe-a2b5b62114eb@huawei.com
Link: https://lore.kernel.org/20260522010305.4099834-1-mawupeng1@huawei.com
Fixes: 405ce05123 ("mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb()")
Signed-off-by: Wupeng Ma <mawupeng1@huawei.com>
Acked-by: Oscar Salvador (SUSE) <osalvador@kernel.org>
Acked-by: Muchun Song <muchun.song@linux.dev>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Acked-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28 20:50:34 -07:00
David Carlier 40c81856e6 mm/hugetlb: restore reservation on error in hugetlb folio copy paths
Two sites in mm/hugetlb.c allocate a hugetlb folio via
alloc_hugetlb_folio() (consuming a VMA reservation) and then call
copy_user_large_folio(), which became int-returning in commit 1cb9dc4b47
("mm: hwpoison: support recovery from HugePage copy-on-write faults") and
can now fail (e.g.  -EHWPOISON on a hwpoisoned source page).  On the
failure path, folio_put() restores the global hugetlb pool count through
free_huge_folio(), but the per-VMA reservation map entry is left marked
consumed:

  - hugetlb_mfill_atomic_pte() resubmission path (UFFDIO_COPY)
  - copy_hugetlb_page_range() fork-time CoW path when
    hugetlb_try_dup_anon_rmap() fails (rare: pinned hugetlb anon
    folio under fork)

User-visible effect: on UFFDIO_COPY into a private hugetlb VMA where the
resubmission copy fails, the reservation for that address is leaked from
the VMA's reserve map.  A subsequent fault at the same address takes the
no-reservation path, and under hugetlb pool pressure the task is SIGBUSed
at an address it had previously reserved.  The fork-time CoW path leaks
the same way in the child VMA's reserve map, though it requires the much
rarer combination of pinned hugetlb anon page + hwpoisoned source.

Add the missing restore_reserve_on_error() call before folio_put() on both
error paths.

Link: https://lore.kernel.org/20260520044912.6751-1-devnexen@gmail.com
Fixes: 1cb9dc4b47 ("mm: hwpoison: support recovery from HugePage copy-on-write faults")
Signed-off-by: David Carlier <devnexen@gmail.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
Cc: David Hildenbrand <david@kernel.org>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: yuehaibing <yuehaibing@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28 20:50:33 -07:00
Muchun Song c0ca59beb5 mm/cma_debug: fix invalid accesses for inactive CMA areas
cma_activate_area() can fail after allocating range bitmaps.  Its cleanup
path frees those bitmaps, but only clears cma->count and
cma->available_count.  It leaves cma->nranges and each range's count in
place, so cma_debugfs_init() can still register debugfs files for an area
that never activated successfully.

That exposes two problems.  Reading the bitmap file can make debugfs walk
a freed range bitmap and trigger an invalid memory access.  Reading
maxchunk can also take cma->lock even though that lock is initialized only
on the successful activation path.

Fix this by creating debugfs entries only for CMA areas that reached
CMA_ACTIVATED.

c009da4258 introduced the invalid access to bitmap file.  2e32b94760
introduced the invalid access to cma->lock.  This change applies to both
issues.  So I added two Fixes tags.

Link: https://lore.kernel.org/20260520061025.3971821-1-songmuchun@bytedance.com
Fixes: c009da4258 ("mm, cma: support multiple contiguous ranges, if requested")
Fixes: 2e32b94760 ("mm: cma: add functions to get region pages counters")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: Oscar Salvador (SUSE) <osalvador@kernel.org>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: Frank van der Linden <fvdl@google.com>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Stefan Strogin <stefan.strogin@gmail.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28 20:50:33 -07:00
Shakeel Butt c0cafe24d3 memcg: use round-robin victim selection in refill_stock
Harry Yoo reported that get_random_u32_below() is not safe to call in the
nmi context and memcg charge draining can happen in nmi context.

More specifically get_random_u32_below() is neither reentrant- nor
NMI-safe: it acquires a per-cpu local_lock via local_lock_irqsave() on the
batched_entropy_u32 state.  An NMI that lands on a CPU mid-update of the
ChaCha batch state and recurses into the random subsystem would corrupt
that state.  The memcg_stock local_trylock prevents re-entry on the percpu
stock itself, but cannot protect an unrelated subsystem's per-cpu lock.

Replace the random pick with a per-cpu round-robin counter stored in
memcg_stock_pcp and serialized by the same local_trylock that already
guards cached[] and nr_pages[].  No atomics, no random calls, no extra
locks needed.

Link: https://lore.kernel.org/20260521223751.3794625-1-shakeel.butt@linux.dev
Fixes: f735eebe55 ("memcg: multi-memcg percpu charge cache")
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Reported-by: Harry Yoo <harry@kernel.org>
Closes: https://lore.kernel.org/4e20f643-6983-4b6e-b12d-c6c4eb20ae0c@kernel.org/
Acked-by: Harry Yoo (Oracle) <harry@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28 20:50:33 -07:00
Lorenzo Stoakes b4aea43cd3 mm/hugetlb: avoid false positive lockdep assertion
Commit 081056dc00 ("mm/hugetlb: unshare page tables during VMA split,
not before") changed the locking model around hugetlbfs PMD unsharing on
VMA split, but did not update the function which asserts the locks,
hugetlb_vma_assert_locked().

This function asserts that either the hugetlb VMA lock is held (if a
shared mapping) or that the reservation map lock is held (if private).

If you get an unfortunate race between something which results in one of
these locks being released and a hugetlb VMA split and you have
CONFIG_LOCKDEP enabled, you can therefore see a false positive assertion
arise when there is in fact no issue.

Since this change introduced a new take_locks parameter to
hugetlb_unshare_pmds(), which, when set to false, indicates that locking
is sufficient, simply pass this to the unsharing logic and predicate the
lock assertions on this.

This is safe, as we already asserted the file rmap lock and the VMA write
lock prior to this (implying exclusive mmap write lock), so we cannot be
raced by either rmap or page fault page table walkers which the asserted
locks are intended to protect against (we don't mind GUP-fast).

Separate out huge_pmd_unshare() into __huge_pmd_unshare() to add a
check_locks parameter, and update hugetlb_unshare_pmds() to pass this
parameter to it.

This leaves all other callers of huge_pmd_unshare() still correctly
asserting the locks.

The below reproducer will trigger the assert in a kernel with
CONFIG_LOCKDEP enabled by racing process teardown (which will release the
hugetlb lock) against a hugetlb split.

void execute_one(void)
{
	void *ptr;
	pid_t pid;

	/*
	 * Create a hugetlb mapping spanning a PUD entry.
	 *
	 * We force the hugetlb page allocation with populate and
	 * noreserve.
	 *
	 * |---------------------|
	 * |                     |
	 * |---------------------|
	 * 0                 PUD boundary
	 */
	ptr = mmap(0, PUD_SIZE, PROT_READ | PROT_WRITE,
		   MAP_FIXED | MAP_SHARED | MAP_ANON |
		   MAP_NORESERVE | MAP_HUGETLB | MAP_POPULATE,
		   -1, 0);
	if (ptr == MAP_FAILED) {
		perror("mmap");
		exit(EXIT_FAILURE);
	}

	/*
	 * Fork but with a bogus stack pointer so we try to execute code in
	 * a non-VM_EXEC VMA, causing segfault + teardown via exit_mmap().
	 *
	 * The clone will cause PMD page table sharing between the
	 * processes first via:
	 * copy_process() -> ... -> huge_pte_alloc() -> huge_pmd_share()
	 *
	 * Then tear down and release the hugetlb 'VMA' lock via:
	 * exit_mmap() -> ... -> vma_close() -> hugetlb_vma_lock_free()
	 */
	pid = syscall(__NR_clone, 0, 2 * PMD_SIZE, 0, 0, 0);
	if (pid < 0) {
		perror("clone");
		exit(EXIT_FAILURE);
	} if (pid == 0) {
		/* Pop stack... */
		return;
	}

	/*
	 * We are the parent process.
	 *
	 * Race the child process's teardown with a PMD unshare.
	 *
	 * We do this by triggering:
	 *
	 * __split_vma() -> hugetlb_split() -> hugetlb_unshare_pmds()
	 *
	 * Which, importantly, doesn't hold the hugetlb VMA lock (nor can
	 * it), meaning we assert in hugetlb_vma_assert_locked().
	 *
	 *            .
	 * |----------.----------|
	 * |          .          |
	 * |----------.----------|
	 * 0          .     PUD boundary
	 */
	mmap(0, PUD_SIZE / 2, PROT_READ | PROT_WRITE,
	     MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
}

int main(void)
{
	int i;

	/* Kick off fork children. */
	for (i = 0; i < NUM_FORKS; i++) {
		pid_t pid = fork();

		if (pid < 0) {
			perror("fork");
			exit(EXIT_FAILURE);
		}

		/* Fork children do their work and exit. */
		if (!pid) {
			int j;

			for (j = 0; j < NUM_ITERS; j++)
				execute_one();
			return EXIT_SUCCESS;
		}
	}

	/* If we succeeded, wait on children. */
	for (i = 0; i < NUM_FORKS; i++)
		wait(NULL);

	return EXIT_SUCCESS;
}

[ljs@kernel.org: account for the !CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING case]
  Link: https://lore.kernel.org/agWZsPGYid08uU6O@lucifer
Link: https://lore.kernel.org/20260513085658.45264-1-ljs@kernel.org
Fixes: 081056dc00 ("mm/hugetlb: unshare page tables during VMA split, not before")
Signed-off-by: Lorenzo Stoakes <ljs@kernel.org>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Acked-by: Oscar Salvador <osalvador@suse.de>
Cc: Jann Horn <jannh@google.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28 20:50:33 -07:00
Linus Torvalds d60ec36cab 13 hotfixes. 9 are for MM. 9 are cc:stable and the remaining 4 address
post-7.1 issues or aren't considered suitable for backporting.
 
 All patches are singletons - please see the individual changelogs for
 details.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCahTZ2QAKCRDdBJ7gKXxA
 ju+UAQDUga+l95O1iOnrraKFWvT1ghQKTgbNxGMwefHjVLLFBQD+Ln2wPfz73Ks7
 H8WK0k5D0g+6lKs6tFGAALdQnTU0BAU=
 =MYsv
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2026-05-25-16-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "13 hotfixes. 9 are for MM. 9 are cc:stable and the remaining 4 address
  post-7.1 issues or aren't considered suitable for backporting.

  All patches are singletons - please see the individual changelogs for
  details"

* tag 'mm-hotfixes-stable-2026-05-25-16-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  Revert "mm: introduce a new page type for page pool in page type"
  mm/vmalloc: do not trigger BUG() on BH disabled context
  MAINTAINERS, mailmap: change email for Eugen Hristev
  mm/migrate_device: fix pgtable leak in migrate_vma_insert_huge_pmd_page
  kernel/fork: validate exit_signal in kernel_clone()
  mm: memcontrol: propagate NMI slab stats to memcg vmstats
  mm/damon/sysfs-schemes: delete tried region in regions_rmdirs()
  mm/rmap: initialize nr_pages to 1 at loop start in try_to_unmap_one
  zram: fix use-after-free in zram_writeback_endio
  memfd: deny writeable mappings when implying SEAL_WRITE
  ipc: limit next_id allocation to the valid ID range
  Revert "mm/hugetlbfs: update hugetlbfs to use mmap_prepare"
  MAINTAINERS: .mailmap: update after GEHC spin-off
2026-05-26 08:23:19 -07:00
Mike Rapoport (Microsoft) 7c2eee9c13 memblock: don't touch memblock arrays when memblock_free() is called late
When memblock_free() is called after memblock_discard() on architectures
that don't select ARCH_KEEP_MEMBLOCK, it tries to update memblock.reserved
that was already discarded and it causes use-after-free, for example

[    8.514775] BUG: KASAN: use-after-free in memblock_isolate_range+0x4ac/0x650
[    8.514775] Read of size 8 at addr ffff88a07fe6a000 by task swapper/0/1
[    8.514775] Call Trace:
[    8.514775]  <TASK>
[    8.514775]  kasan_report+0xb2/0x1b0
[    8.514775]  memblock_isolate_range+0x4ac/0x650
[    8.514775]  memblock_phys_free+0xc4/0x190
[    8.514775]  housekeeping_late_init+0x257/0x280
[    8.514775]  do_one_initcall+0xaa/0x470
[    8.514775]  do_initcalls+0x1b4/0x1f0
[    8.514775]  kernel_init_freeable+0x4b5/0x550
[    8.514775]  kernel_init+0x1c/0x150
[    8.514775]  ret_from_fork+0x5dc/0x8e0
[    8.514775]  ret_from_fork_asm+0x1a/0x30
[    8.514775]  </TASK>

Make sure memblock_free() updates memblock.reserved only when called early
enough or when ARCH_KEEP_MEMBLOCK is enabled.

Reported-by: Waiman Long <longman@redhat.com>
Reported-by: Breno Leitao <leitao@debian.org>
Closes: https://lore.kernel.org/all/20260505051821.1107133-1-longman@redhat.com
Tested-by: Waiman Long <longman@redhat.com>
Tested-by: Breno Leitao <leitao@debian.org>
Fixes: 87ce9e83ab ("memblock, treewide: make memblock_free() handle late freeing")
Link: https://patch.msgid.link/20260513105122.502506-1-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
2026-05-25 20:48:14 +03:00
Linus Torvalds de37e502a3 cgroup: Fixes for v7.1-rc4
Two rstat fixes:
 
 - Out-of-bounds access in the css_rstat_updated() BPF kfunc when called
   with an unchecked user-supplied cpu.
 
 - Over-strict NMI guard after the recent switch to try_cmpxchg left
   sparc and ppc64 unable to queue rstat updates from NMI.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCahCHDA4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGTk2AP9Me+BV0h17oEuaqAii7uzMom6zCYUO6KY6ADAe
 zr+zcgEA0B72FxH+GyPwe7lhropwg9WR6jagsCFN/tlMPHwQrwc=
 =Sos7
 -----END PGP SIGNATURE-----

Merge tag 'cgroup-for-7.1-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup fixes from Tejun Heo:
 "Two rstat fixes:

   - Out-of-bounds access in the css_rstat_updated() BPF kfunc when
     called with an unchecked user-supplied cpu

   - Over-strict NMI guard after the recent switch to try_cmpxchg left
     sparc and ppc64 unable to queue rstat updates from NMI"

* tag 'cgroup-for-7.1-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: rstat: relax NMI guard after switch to try_cmpxchg
  cgroup/rstat: validate cpu before css_rstat_cpu() access
2026-05-22 16:28:47 -07:00
Linus Torvalds 46de4087d3 slab fixes for 7.1-rc4
-----BEGIN PGP SIGNATURE-----
 
 iQFPBAABCAA5FiEEe7vIQRWZI0iWSE3xu+CwddJFiJoFAmoQJvUbFIAAAAAABAAO
 bWFudTIsMi41KzEuMTIsMiwyAAoJELvgsHXSRYiavTcH/iKK1G+ygWOCqKrlZxir
 /Ga7vPPQSXgiZEtEQ9ZIC/1UQeBp/eQc4uCzgy/aAMqksFjdwsFF/qs6mUr1ffHB
 ne9RBk40zoP6KQfKkP8GW22ERmDOLnJ1t+QiCibrre8roM7vguDx0Sr5VXGoDmU5
 p9mtomLzxJkuDV1gY7LGsT26Es+3PIpma8DTJwlhsUw2igAsp9XnfXJEjBu4eEXP
 vfGkxK+go5+u4iH5/qKKw6MpfmV7z+O5WJsHaGaKyp8chc6lMSkqtFe1g15h6j6D
 bL4D6jvmUuwTGsDVaufUnnKdM64+BtcCYJAkgzfKHhge8KG3eBlmsiLkK4uvNRo4
 8NI=
 =E0Wr
 -----END PGP SIGNATURE-----

Merge tag 'slab-for-7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab

Pull slab fix from Vlastimil Babka:

 - Stable fix for a missing cpus_read_lock in one of the cpu sheaves
   flushing paths (Qing Wang)

* tag 'slab-for-7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
  mm/slub: hold cpus_read_lock around flush_rcu_sheaves_on_cache()
2026-05-22 06:23:56 -07:00
Byungchul Park 54cf41c969 Revert "mm: introduce a new page type for page pool in page type"
This reverts commit db359fccf2 ("mm: introduce a new page type for page
pool in page type") and a part of 735a309b4b ("net: add net_iov_init()
and use it to initialize ->page_type").

Netpp page_type'ed pages might be used in mapping so as to use @_mapcount.
However, since @page_type and @_mapcount are union'ed in struct page,
these two can't be used at the same time.  Revert the commit introducing
page_type for Netpp for now.

The patch will be retried once @page_type and @_mapcount get allowed to be
used at the same time.

The revert also includes removal of @page_type initialization part
introduced by commit 735a309b4b ("net: add net_iov_init() and use it
to initialize ->page_type"), which will be restored on the retry.

Link: https://lore.kernel.org/20260515034701.17027-1-byungchul@sk.com
Fixes: db359fccf2 ("mm: introduce a new page type for page pool in page type")
Signed-off-by: Byungchul Park <byungchul@sk.com>
Reported-by: Dragos Tatulea <dtatulea@nvidia.com>
Closes: https://lore.kernel.org/all/982b9bc1-0a0a-4fc5-8e3a-3672db2b29a1@nvidia.com
Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Acked-by: Harry Yoo (Oracle) <harry@kernel.org>
Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Cc: Jesper Dangaard Brouer <hawk@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Mark Bloch <mbloch@nvidia.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Cc: Saeed Mahameed <saeedm@nvidia.com>
Cc: Simon Horman <horms@kernel.org>
Cc: Stanislav Fomichev <sdf@fomichev.me>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tariq Toukan <tariqt@nvidia.com>
Cc: Toke Hoiland-Jorgensen <toke@redhat.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-21 19:06:13 -07:00
Uladzislau Rezki (Sony) 04aa71da5f mm/vmalloc: do not trigger BUG() on BH disabled context
__get_vm_area_node() currently triggers a BUG() if in_interrupt() returns
true.  However, in_interrupt() also reports true when BH are disabled.

The bridge code can call rhashtable_lookup_insert_fast() with bottom
halves disabled:

__vlan_add()
 -> br_fdb_add_local()
  spin_lock_bh(&br->hash_lock); <-- Disable BH
   -> fdb_add_local()
    -> fdb_create()
     -> rhashtable_lookup_insert_fast()
      -> kvmalloc()
       -> vmalloc()
        -> __get_vm_area_node()
         -> BUG_ON(in_interrupt())
  spin_unlock_bh(&br->hash_lock)

this triggers the BUG() despite the caller not being in NMI or
hard IRQ context.

Replace the in_interrupt() check with in_nmi() || in_hardirq().

Link: https://lore.kernel.org/20260515153009.2296191-1-urezki@gmail.com
Fixes: c6307674ed ("mm: kvmalloc: add non-blocking support for vmalloc")
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Ido Schimmel <idosch@nvidia.com>
Reported-by: syzbot+8b12fc6e0fb139765b58@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/69ff8c7c.050a0220.1036b8.000b.GAE@google.com/
Reviewed-by: Baoquan He <baoquan.he@linux.dev>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-21 19:06:13 -07:00
Sunny Patel 2c6f81d587 mm/migrate_device: fix pgtable leak in migrate_vma_insert_huge_pmd_page
When migrate_vma_insert_huge_pmd_page() jumps to unlock_abort due
to a PMD check failure, the pgtable allocated earlier via
pte_alloc_one() is never freed, causing a memory leak.

Added free_abort label to release the pgtable in error path.

Link: https://lore.kernel.org/20260501115122.23288-1-nueralspacetech@gmail.com
Fixes: a30b48bf1b ("mm/migrate_device: implement THP migration of zone device pages")
Signed-off-by: Sunny Patel <nueralspacetech@gmail.com>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Huang Ying <ying.huang@linux.alibaba.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: Byungchul Park <byungchul@sk.com>
Cc: Gregory Price <gourry@gourry.net>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-21 19:06:13 -07:00
Alexandre Ghiti e16f17a9c5 mm: memcontrol: propagate NMI slab stats to memcg vmstats
flush_nmi_stats() drains per-node NMI slab atomics into the per-node
lruvec_stats, but does not propagate them to the memcg-level vmstats.

For non NMI case, account_slab_nmi_safe() calls mod_memcg_lruvec_state()
which updates both per-node lruvec_stats and memcg-level vmstats, so
flush_nmi_stats() needs to flush to per-node lruvec_stats as well as
memcg-level vmstats.

So fix this by flushing to the memcg-level vmstats for NMI too.

Link: https://lore.kernel.org/20260518082830.599102-1-alex@ghiti.fr
Fixes: 940b01fc8d ("memcg: nmi safe memcg stats for specific archs")
Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-21 19:06:12 -07:00
SeongJae Park 441f92f7d3 mm/damon/sysfs-schemes: delete tried region in regions_rmdirs()
DAMON sysfs maintains the DAMOS tried region directory objects via a
linked list.  When the user requests refresh of the directories, DAMON
sysfs removes all the region directories first, and then generate updated
regions directory on the empty space.  The removal function
(damon_sysfs_scheme_regions_rm_dirs()) only puts the kobj objects. 
Deletion of the container region object from the linked list is done
inside the kobj release callback function.

If somehow the callback invocation is delayed, the list will contain
regions list that gonna be freed.  If the updated region directories
creation is started in this situation, the list can be corrupted and
use-after-free can happen.

Because the kobj objects are managed by only DAMON sysfs, the issue cannot
happen in normal situation.  But, such delays can be made on kernels that
built with CONFIG_DEBUG_KOBJECT_RELEASE.  On the kernel, the issue can
indeed be reproduced like below.

    # damo start --damos_action stat
    # cd /sys/kernel/mm/damon/admin/kdamonds/0/
    # for i in {1..10}; do echo update_schemes_tried_regions > state; done
    # dmesg | grep underflow
    [   89.296152] refcount_t: underflow; use-after-free.

Fix the issue by removing the region object from the list when
decrementing the reference count.

Also update damos_sysfs_populate_region_dir() to add the region object to
the list only after the kobject_init_and_add() is success, so that fail of
kobject_init_and_add() is not leaving the deallocated object on the list.

The issue was discovered [1] by Sashiko.

Link: https://lore.kernel.org/20260518152559.93038-1-sj@kernel.org
Link: https://lore.kernel.org/20260513011920.119183-1-sj@kernel.org [1]
Fixes: 9277d0367b ("mm/damon/sysfs-schemes: implement scheme region directory")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org> # 6.2.x
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-21 19:06:12 -07:00
Dev Jain 3f8968e9cb mm/rmap: initialize nr_pages to 1 at loop start in try_to_unmap_one
Initialize nr_pages to 1 at the start of each loop iteration, like
folio_referenced_one() does.

Without this, nr_pages computed by a previous folio_unmap_pte_batch() call
can be reused on a later iteration that does not run
folio_unmap_pte_batch() again.

mmap a 64K large folio with MAP_ANONYMOUS | MAP_DROPPABLE, then call
madvise(MADV_FREE), then make the last page device-exclusive via
HMM_DMIRROR_EXCLUSIVE.

Trigger node reclaim through sysfs.  Now, in try_to_unmap_one(), we will
first clear the first 15 out of 16 entries mapping the lazyfree folio. 
This will set nr_pages to 15.  In the next pvmw walk, this nr_pages gets
reused on a device-exclusive pte, thus potentially corrupting folio
refcount/mapcount.

At the moment, I have a userspace program which can make the kernel spit
out a trace, but the blow up is in folio_referenced_one(), because there
are existing bugs in the interaction between device-private and rmap
(which too I am investigating).  I did a one liner kernel change to avoid
going into folio_referenced_one(), and the kernel blows up at
folio_remove_rmap_ptes in try_to_unmap_one which is what I wanted.

Note that the bug is there not since file folio batching but lazyfree
folio batching, since device-exclusive only works for anonymous folios.

Userspace visible effect is simply kernel crashing somewhere due to
refcount/mapcount corruption.

Link: https://lore.kernel.org/20260518063656.3721056-1-dev.jain@arm.com
Fixes: 354dffd295 ("mm: support batched unmap for lazyfree large folios during reclamation")
Signed-off-by: Dev Jain <dev.jain@arm.com>
Acked-by: Barry Song <baohua@kernel.org>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Harry Yoo <harry@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-21 19:06:12 -07:00
Pratyush Yadav (Google) 3b041514cb memfd: deny writeable mappings when implying SEAL_WRITE
When SEAL_EXEC is added, SEAL_WRITE is implied to make W^X.  But the
implied seal is set after the check that makes sure the memfd can not have
any writable mappings.  This means one can use SEAL_EXEC to apply
SEAL_WRITE while having writeable mappings.

This breaks the contract that SEAL_WRITE provides and can be used by an
attacker to pass a memfd that appears to be write sealed but can still be
modified arbitrarily.

Fix this by adding the implied seals before the call for
mapping_deny_writable() is done.

Link: https://lore.kernel.org/20260505133922.797635-1-pratyush@kernel.org
Fixes: c4f75bc8bd ("mm/memfd: add write seals when apply SEAL_EXEC to executable memfd")
Signed-off-by: Pratyush Yadav (Google) <pratyush@kernel.org>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> 
Acked-by: Jeff Xu <jeffxu@google.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kees Cook <kees@kernel.org>
Cc: "David Hildenbrand (Arm)" <david@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-21 19:06:11 -07:00
Lorenzo Stoakes 83f9efcce9 Revert "mm/hugetlbfs: update hugetlbfs to use mmap_prepare"
This reverts commit ea52cb24cd ("mm/hugetlbfs: update hugetlbfs to use
mmap_prepare") with conflict resolution to account for changes in commit
ea52cb24cd ("mm/hugetlbfs: update hugetlbfs to use mmap_prepare").

The patch incorrectly handled hugetlb VMA lock allocation at the
mmap_prepare stage, where a failed allocation occurring after mmap_prepare
is called might result in the lock leaking.

There is no risk of a merge causing a similar issues, as
VMA_DONTEXPAND_BIT is set for hugetlb mappings.

As a first step in addressing this issue, simply revert the change so we
can rework how we do this having corrected the underlying issues.

We maintain the VMA flags changes as best we can, accounting for the fact
that we were working with a VMA descriptor previously and propagating
like-for-like changes for this.

Note that we invoke vma_set_flags() and do not call vma_start_write() as
vm_flags_set() does.  This is OK as it's being done in an .mmap hook where
the VMA is not yet linked into the tree so nobody else can be accessing
it.

Link: https://lore.kernel.org/20260512160643.266960-1-ljs@kernel.org
Fixes: ea52cb24cd ("mm/hugetlbfs: update hugetlbfs to use mmap_prepare")
Signed-off-by: Lorenzo Stoakes <ljs@kernel.org>
Reported-by: Mingyu Wang <25181214217@stu.xidian.edu.cn>
Closes: https://lore.kernel.org/linux-mm/20260425070700.562229-1-25181214217@stu.xidian.edu.cn/
Acked-by: Muchun Song <muchun.song@linux.dev>
Acked-by: Oscar Salvador <osalvador@suse.de>
Cc: David Hildenbrand <david@kernel.org>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-21 19:06:11 -07:00
Linus Torvalds c6e99c10fd 14 hotfixes. 9 are for MM. 10 are cc:stable and the remainder are for
post-7.1 issues or aren't deemed suitable for backporting.
 
 There's a 2 patch MAINTAINERS series from Mike Rapoport which updates us
 for the new KEXEC/KDUMP/crash/LUO/etc arrangements.  And a 2 patch series
 from Muchun Song to fix a couple of memory-hotplug issues.  Otherwise
 singletons, please see the changelogs for details.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCagviKQAKCRDdBJ7gKXxA
 jlbsAP9SEShHxXEYcRMVQtXb+8/iJDe7J3KwVDP4e0VOlQKTPAD/c+C2bx4nllOG
 77wl9Qkr++KqTSmoPbzA7Q02gJC2ngQ=
 =2qN3
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2026-05-18-21-07' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "14 hotfixes. 9 are for MM. 10 are cc:stable and the remainder are for
  post-7.1 issues or aren't deemed suitable for backporting.

  There's a two-patch MAINTAINERS series from Mike Rapoport which
  updates us for the new KEXEC/KDUMP/crash/LUO/etc arrangements. And
  another two-patch series from Muchun Song to fix a couple of
  memory-hotplug issues. Otherwise singletons, please see the changelogs
  for details"

* tag 'mm-hotfixes-stable-2026-05-18-21-07' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  mm/memory: fix spurious warning when unmapping device-private/exclusive pages
  mm: fix __vm_normal_page() to handle missing support for pmd_special()/pud_special()
  drivers/base/memory: fix memory block reference leak in poison accounting
  mm/memory_hotplug: fix memory block reference leak on remove
  lib: kunit_iov_iter: fix test fail on powerpc
  mm/page_alloc: fix initialization of tags of the huge zero folio with init_on_free
  MAINTAINERS: add kexec@ list to LIVE UPDATE ENTRY
  MAINTAINERS: add tree for KDUMP and KEXEC
  selftests/mm: run_vmtests.sh: fix destructive tests invocation
  scripts/gdb: slab: update field names of struct kmem_cache
  scripts/gdb: mm: cast untyped symbols in x86_page_ops
  mm/damon: fix damos_stat tracepoint format for sz_applied
  mm/damon/sysfs-schemes: call missing mem_cgroup_iter_break()
  mm/migrate_device: fix spinlock leak in migrate_vma_insert_huge_pmd_page
2026-05-19 07:49:33 -07:00
Qing Ming 8817005efb cgroup/rstat: validate cpu before css_rstat_cpu() access
css_rstat_updated() is exposed as a BPF kfunc and accepts a
caller-provided cpu argument. The function uses cpu for per-cpu rstat
lookups without checking whether it refers to a valid possible CPU.

A BPF iter/cgroup program with CAP_BPF and CAP_PERFMON can pass an
invalid cpu value. On an unfixed UBSCAN_BOUNDS test kernel, cpu ==
0x7fffffff triggers:

  UBSAN: array-index-out-of-bounds in kernel/cgroup/rstat.c:31:9
  index 2147483647 is out of range for type 'long unsigned int [64]'
  Call Trace:
    css_rstat_updated
    bpf_iter_run_prog
    cgroup_iter_seq_show
    bpf_seq_read

Add cpu validation to the BPF-facing css_rstat_updated() kfunc and
move the common implementation to __css_rstat_updated() for in-kernel
callers.

Fixes: a319185be9 ("cgroup: bpf: enable bpf programs to integrate with rstat")
Signed-off-by: Qing Ming <a0yami@mailbox.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-18 09:31:52 -10:00
Linus Torvalds fd6b566156 This push contains the following changes:
- Fix potential dead-lock in rhashtable when used by xattr.
 - Avoid calling kvfree on atomic path in rhashtable.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEn51F/lCuNhUwmDeSxycdCkmxi6cFAmoG6NgACgkQxycdCkmx
 i6dpPA/9FUJGxLosr5LKlo71af5RpHuOAAVq9/3tjev9FLVR0IgqUuMpPqcKqaBC
 w4HL2drDcXUm3tlFLAgRgnCcIbz1lIVrZMTSHg1ivx9ZGNWm75o7n2gnYp4NnFzD
 5A8AKHGUfnLf0y/nnrehvgGpnuyEGuMVL2+lTnkfJi1gYGQfJszC2pvC8a7gK2m1
 ZXPdTEuMe58nfqHK0Jkhfwwjh8glpopvwfrEp5mByRqr3EELKznNGnTlPNpyjkPd
 vV0W8CM5qPJCn5Ts7OTKnZFktelSfgjKeHhxK3QFfUPV/fFQtwn2dNp099w00xMY
 GC3okWgDMGoXqnQ6nzNt9IGPXIvh1KgH76qY35R4gAjZ5JX8YF7ObJGylAthRYKz
 +AGx9Pl9u2GIE0EzkHDG5kxel2tCajWOgbdm78xYUtEn9ii3j8LInX63lB1bLLaq
 KIxMG/IuxP6mMaSiOs79/B9qmT/gYEfJvf4x8EXV1HXvzQh7SD5Zcep+krhVj49+
 rsP5SBcTZfiZapU+AlicquLO/1kCYucUHz15KoKkeXiwj5ruWuUNx/wNblzft94d
 9H4zfqL5hE41wiSOPdOT5fbGOI/a1PlMj4js3w+CGNh4BBOjAU6fZj86tO9gwI27
 aRyR62ziteLck474PsYQbKx7J2qk2EARxO8QL9vHZhAsq3MBOck=
 =dSj/
 -----END PGP SIGNATURE-----

Merge tag 'v7.1-p4' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6

Pull crypto fixes from Herbert Xu:

 - Fix potential dead-lock in rhashtable when used by xattr

 - Avoid calling kvfree on atomic path in rhashtable

* tag 'v7.1-p4' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
  rhashtable: Add bucket_table_free_atomic() helper
  mm/slab: Add kvfree_atomic() helper
  rhashtable: drop ht->mutex in rhashtable_free_and_destroy()
2026-05-15 10:38:37 -07:00
Qing Wang 67ea9d353d mm/slub: hold cpus_read_lock around flush_rcu_sheaves_on_cache()
flush_rcu_sheaves_on_cache() calls queue_work_on() in a
for_each_online_cpu() loop, which requires the cpu to stay online.
But cpus_read_lock() is not held in kvfree_rcu_barrier_on_cache() and the
set of "online cpus" is subject to change.

There are two paths that call flush_rcu_sheaves_on_cache():

  // has cpus_read_lock()
  flush_all_rcu_sheaves()
    -> flush_rcu_sheaves_on_cache()

  // no cpus_read_lock()
  kvfree_rcu_barrier_on_cache()
    -> flush_rcu_sheaves_on_cache()

Fix this by holding cpus_read_lock() in kvfree_rcu_barrier_on_cache().

Why not move cpus_read_lock() from flush_all_rcu_sheaves() into
flush_rcu_sheaves_on_cache()? The reason is it would introduce a new lock
order (slab_mutex -> cpu_hotplug_lock). The reverse order
(cpu_hotplug_lock -> slab_mutex) is established by

- cpuhp_setup_state_nocalls(..., slub_cpu_setup, ...)
- kmem_cache_destroy()

The two orders together would form an AB-BA deadlock.

Finally, add lockdep_assert_cpus_held() in flush_rcu_sheaves_on_cache()
to catch the same problem in the future.

Fixes: 0f35040de5 ("mm/slab: introduce kvfree_rcu_barrier_on_cache() for cache destruction")
Cc: <stable@vger.kernel.org>
Signed-off-by: Qing Wang <wangqing7171@gmail.com>
Link: https://patch.msgid.link/20260512035035.762317-1-wangqing7171@gmail.com
Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
2026-05-14 14:56:58 +02:00
Alistair Popple be3f38d05c mm/memory: fix spurious warning when unmapping device-private/exclusive pages
Device private and exclusive entries are only supported for anonymous
folios.  This condition is tested in __migrate_device_pages() and
make_device_exclusive() using folio_test_anon().  However the unmap path
tests this assumption using vma_is_anonymous().

This is wrong because whilst anonymous VMAs can only contain folios where
folio_test_anon() is true the opposite relation does not hold.  A folio
for which folio_test_anon() is true does not imply vma_is_anonymous() is
true.  Such a condition can occur if for example a folio is part of a
private filebacked mapping.

In this case vma_is_anonymous() is false as the mapping is filebacked, but
folio_test_anon() may be true, thus permitting devices to migrate the
folio to device private memory.  This can lead to the following spurious
warnings during process teardown:

[  772.737706] ------------[ cut here ]------------
[  772.739201] WARNING: mm/memory.c:1754 at unmap_page_range.cold+0x26/0x18a, CPU#17: hmm-tests/2041
[  772.742050] Modules linked in: test_hmm nvidia_uvm(O) nvidia(O)
[  772.743959] CPU: 17 UID: 0 PID: 2041 Comm: hmm-tests Tainted: G        W  O        7.0.0+ #387 PREEMPT(full)
[  772.747104] Tainted: [W]=WARN, [O]=OOT_MODULE
[  772.748509] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014
[  772.752117] RIP: 0010:unmap_page_range.cold+0x26/0x18a
[  772.753780] Code: 7e fe ff ff 48 89 4c 24 78 4c 89 44 24 38 e8 f2 ff b1 00 48 8b 4c 24 78 4c 8b 44 24 38 48 8b 44 24 18 48 83 78 48 00 74 04 90 <0f> 0b 90 48 89 ca b8 ff ff 37 00 48 c1 ea 03 48 c1 e0 2a 80 3c 02
[  772.759602] RSP: 0018:ffff888112607550 EFLAGS: 00010286
[  772.761310] RAX: ffff88811bbf4dc0 RBX: dffffc0000000000 RCX: ffffea03e9bfffd8
[  772.763583] RDX: 1ffff1102377e9c1 RSI: 0000000000000008 RDI: ffff88811bbf4e08
[  772.765914] RBP: 0000000000000006 R08: ffff8881059f7448 R09: ffffed10224c0e68
[  772.768184] R10: ffff888112607347 R11: 0000000000000001 R12: 0000000000000001
[  772.770461] R13: ffffea03e9bfffc0 R14: ffff888112607908 R15: ffffea03e9bfffc0
[  772.772782] FS:  00007f327caa2780(0000) GS:ffff888427b7d000(0000) knlGS:0000000000000000
[  772.775328] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  772.777187] CR2: 00007f327ca89000 CR3: 00000001994d5000 CR4: 00000000000006f0
[  772.779135] Call Trace:
[  772.779792]  <TASK>
[  772.780317]  ? dmirror_interval_invalidate+0x1a3/0x290 [test_hmm]
[  772.781873]  ? vm_normal_page_pud+0x2b0/0x2b0
[  772.782992]  ? __rwlock_init+0x150/0x150
[  772.784006]  ? lock_release+0x216/0x2b0
[  772.785008]  ? __mmu_notifier_invalidate_range_start+0x505/0x6e0
[  772.786522]  ? lock_release+0x216/0x2b0
[  772.787498]  ? unmap_single_vma+0xb6/0x210
[  772.788573]  unmap_vmas+0x27d/0x520
[  772.789506]  ? unmap_single_vma+0x210/0x210
[  772.790607]  ? mas_update_gap.part.0+0x620/0x620
[  772.791834]  unmap_region+0x19e/0x350
[  772.792769]  ? remove_vma+0x130/0x130
[  772.793684]  ? mas_alloc_nodes+0x1f2/0x300
[  772.794730]  vms_complete_munmap_vmas+0x8c1/0xe20
[  772.795926]  ? unmap_region+0x350/0x350
[  772.796917]  do_vmi_align_munmap+0x36a/0x4e0
[  772.798018]  ? lock_release+0x216/0x2b0
[  772.799024]  ? vma_shrink+0x620/0x620
[  772.799983]  do_vmi_munmap+0x150/0x2c0
[  772.800939]  __vm_munmap+0x161/0x2c0
[  772.801872]  ? expand_downwards+0xd60/0xd60
[  772.802948]  ? clockevents_program_event+0x1ef/0x540
[  772.804217]  ? lock_release+0x216/0x2b0
[  772.805158]  __x64_sys_munmap+0x59/0x80
[  772.805776]  do_syscall_64+0xfc/0x670
[  772.806336]  ? irqentry_exit+0xda/0x580
[  772.806976]  entry_SYSCALL_64_after_hwframe+0x4b/0x53
[  772.807772] RIP: 0033:0x7f327cbb2717
[  772.808323] Code: 73 01 c3 48 8b 0d f9 76 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 0b 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c9 76 0d 00 f7 d8 64 89 01 48
[  772.811337] RSP: 002b:00007ffde7f57d38 EFLAGS: 00000202 ORIG_RAX: 000000000000000b
[  772.812564] RAX: ffffffffffffffda RBX: 00007f327cc9c000 RCX: 00007f327cbb2717
[  772.813733] RDX: 0000000000000000 RSI: 0000000000400000 RDI: 00007f327c289000
[  772.814867] RBP: 0000000000421360 R08: 000000000000001a R09: 0000000000000000
[  772.815991] R10: 0000000000000003 R11: 0000000000000202 R12: 00007ffde7f57d74
[  772.817121] R13: 00007f327c689010 R14: 0000000000100000 R15: 00007f327c289000
[  772.818272]  </TASK>
[  772.818614] irq event stamp: 0
[  772.819159] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
[  772.820174] hardirqs last disabled at (0): [<ffffffff82a57ab3>] copy_process+0x19f3/0x6440
[  772.821511] softirqs last  enabled at (0): [<ffffffff82a57b00>] copy_process+0x1a40/0x6440
[  772.822869] softirqs last disabled at (0): [<0000000000000000>] 0x0
[  772.823871] ---[ end trace 0000000000000000 ]---

Fix this by using the same check for folio_test_anon() in
zap_nonpresent_ptes(). Also add a hmm-test case for this.

Link: https://lore.kernel.org/20260501065116.2057242-1-apopple@nvidia.com
Fixes: 999dad824c ("mm/shmem: persist uffd-wp bit across zapping for file-backed")
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reported-by: Arsen Arsenović <aarsenovic@baylibre.com>
Reviewed-by: Balbir Singh <balbirs@nvidia.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-13 17:40:03 -07:00
David Hildenbrand (Arm) c0c6ccd982 mm: fix __vm_normal_page() to handle missing support for pmd_special()/pud_special()
On x86 32-bit with THP enabled, zap_huge_pmd() is seen to generate a
"WARNING: mm/memory.c:735 at __vm_normal_page+0x6a/0x7d", from the
VM_WARN_ON_ONCE(is_zero_pfn(pfn) || is_huge_zero_pfn(pfn)); followed by
"BUG: Bad rss-counter state"s, then later "BUG: Bad page state"s when
reclaim gets to call shrink_huge_zero_folio_scan().

It's as if the _PAGE_SPECIAL bit never got set in the huge_zero pmd: and
indeed, whereas pte_special() and pte_mkspecial() are subject to a
dedicated CONFIG_ARCH_HAS_PTE_SPECIAL, pmd_special() and pmd_mkspecial()
are subject to CONFIG_ARCH_SUPPORTS_PMD_PFNMAP, which is never enabled on
any 32-bit architecture.

While the problem was exposed through commit d80a9cb1a6
("mm/huge_memory: add and use normal_or_softleaf_folio_pmd()"), it was an
oversight in commit af38538801 ("mm/memory: factor out common code from
vm_normal_page_*()") and would result in other problems:
* huge zero folio accounted in smaps, pagemap (PAGE_IS_FILE) and
  numamaps as file-backed THP
* folio_walk_start() returning the folio even without FW_ZEROPAGE set.
  Callers seem to tolerate that, though.

... and triggering the VM_WARN_ON_ONE(), although never reported so far.

To fix it, teach vm_normal_page_pmd()/vm_normal_page_pud() to consider
whether pmd_special/pud_special is actually implemented.

Link: https://lore.kernel.org/20260430-pmd_special-v1-1-dbcbcfd72c20@kernel.org
Fixes: af38538801 ("mm/memory: factor out common code from vm_normal_page_*()")
Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
Reported-by: Hugh Dickins <hughd@google.com>
Closes: https://lore.kernel.org/r/74a75b59-2e13-3985-ee99-d5521f39df2a@google.com
Reported-by: Bibo Mao <maobibo@loongson.cn>
Closes: https://lore.kernel.org/r/20260430041121.2839350-1-maobibo@loongson.cn
Debugged-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Lance Yang <lance.yang@linux.dev>
Tested-by: Bibo Mao <maobibo@loongson.cn>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-13 17:40:03 -07:00
Muchun Song 93866f55f7 mm/memory_hotplug: fix memory block reference leak on remove
Patch series "mm: Fix memory block leaks and locking", v2.

This series fixes two memory block device reference leaks and one locking
issue around the per-memory_block hwpoison counter.


This patch (of 2):

remove_memory_blocks_and_altmaps() looks up each memory block with
find_memory_block(), which acquires a reference to the memory block
device.

That reference is never dropped on this path, resulting in a leaked device
reference when removing memory blocks and their altmaps.  Drop the
reference after retrieving mem->altmap and clearing mem->altmap, before
removing the memory block device.

Link: https://lore.kernel.org/20260428085219.1316047-1-songmuchun@bytedance.com
Link: https://lore.kernel.org/20260428085219.1316047-2-songmuchun@bytedance.com
Fixes: 6b8f0798b8 ("mm/memory_hotplug: split memmap_on_memory requests across memblocks")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Oscar Salvador <osalvador@suse.de>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Huang, Ying" <huang.ying.caritas@gmail.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-13 17:40:03 -07:00
David Hildenbrand (Arm) 6a288a4ddb mm/page_alloc: fix initialization of tags of the huge zero folio with init_on_free
__GFP_ZEROTAGS semantics are currently a bit weird, but effectively this
flag is only ever set alongside __GFP_ZERO and __GFP_SKIP_KASAN.

If we run with init_on_free, we will zero out pages during
__free_pages_prepare(), to skip zeroing on the allocation path.

However, when allocating with __GFP_ZEROTAG set, post_alloc_hook() will
consequently not only skip clearing page content, but also skip clearing
tag memory.

Not clearing tags through __GFP_ZEROTAGS is irrelevant for most pages that
will get mapped to user space through set_pte_at() later: set_pte_at() and
friends will detect that the tags have not been initialized yet
(PG_mte_tagged not set), and initialize them.

However, for the huge zero folio, which will be mapped through a PMD
marked as special, this initialization will not be performed, ending up
exposing whatever tags were still set for the pages.

The docs (Documentation/arch/arm64/memory-tagging-extension.rst) state
that allocation tags are set to 0 when a page is first mapped to user
space.  That no longer holds with the huge zero folio when init_on_free is
enabled.

Fix it by decoupling __GFP_ZEROTAGS from __GFP_ZERO, passing to
tag_clear_highpages() whether we want to also clear page content.

Invert the meaning of the tag_clear_highpages() return value to have
clearer semantics.

Reproduced with the huge zero folio by modifying the check_buffer_fill
arm64/mte selftest to use a 2 MiB area, after making sure that pages have
a non-0 tag set when freeing (note that, during boot, we will not actually
initialize tags, but only set KASAN_TAG_KERNEL in the page flags).

	$ ./check_buffer_fill
	1..20
	...
	not ok 17 Check initial tags with private mapping, sync error mode and mmap memory
	not ok 18 Check initial tags with private mapping, sync error mode and mmap/mprotect memory
	...

This code needs more cleanups; we'll tackle that next, like
decoupling __GFP_ZEROTAGS from __GFP_SKIP_KASAN.

[akpm@linux-foundation.org: s/__GPF_ZERO/__GFP_ZERO/, per David]
Link: https://lore.kernel.org/20260421-zerotags-v2-1-05cb1035482e@kernel.org
Fixes: adfb6609c6 ("mm/huge_memory: initialise the tags of the huge zero folio")
Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Tested-by: Lance Yang <lance.yang@linux.dev>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Liam Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-13 17:40:02 -07:00
SeongJae Park d4e7b5c4cc mm/damon/sysfs-schemes: call missing mem_cgroup_iter_break()
damon_sysfs_memcg_path_to_id() breaks mem_cgroup_iter() loop without
calling mem_cgroup_iter_break().  This leaks the cgroup reference.  Fix
the issue by calling mem_cgroup_iter_break() before the break.

The issue was discovered [1] by Sashiko.

Link: https://lore.kernel.org/20260426173625.86521-1-sj@kernel.org
Link: https://lore.kernel.org/20260423004148.74722-1-sj@kernel.org [1]
Fixes: 29cbb9a13f ("mm/damon/sysfs-schemes: implement scheme filters")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org> # 6.3.x
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-13 17:40:00 -07:00
Sunny Patel 63451de16e mm/migrate_device: fix spinlock leak in migrate_vma_insert_huge_pmd_page
When check_stable_address_space() fails after the PMD spinlock has
been acquired via pmd_lock(), the code jumps directly to the abort
label, bypassing the spin_unlock() call in unlock_abort. This causes
the PMD spinlock to be permanently held, leading to a deadlock.

Change the goto target from abort to unlock_abort to ensure the
spinlock is always released on this error path.

Link: https://lore.kernel.org/20260425133537.17463-1-nueralspacetech@gmail.com
Fixes: a30b48bf1b ("mm/migrate_device: implement THP migration of zone device pages")
Signed-off-by: Sunny Patel <nueralspacetech@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Zi Yan <ziy@nvidia.com>
Acked-by: Balbir Singh <balbirs@nvidia.com>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Byungchul Park <byungchul@sk.com>
Cc: Gregory Price <gourry@gourry.net>
Cc: "Huang, Ying" <ying.huang@linux.alibaba.com>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-13 17:40:00 -07:00
Linus Torvalds 1f63dd8ca0 liveupdate fixes for v7.1-rc4
A few fixes for kexec handover and liveupdate:
 
 * make sure KHO is skipped for crash kernel
 * fix error reporting in memfd preservation if it fails mid-loop
 * don't allow preserving memfds whose page count exceeds UINT_MAX
 * fix documentation of memfd seals preservation to match the code
 -----BEGIN PGP SIGNATURE-----
 
 iQFEBAABCgAuFiEEeOVYVaWZL5900a/pOQOGJssO/ZEFAmoET5gQHHJwcHRAa2Vy
 bmVsLm9yZwAKCRA5A4Ymyw79kVfdB/99gLJy40MO9ZCHSxRQD9TE7Fbuv71flVuD
 wmDz43UOyDIEp+qCB0VcNQPG3v+UINygUMGHkhOG4fgKLm0bEORXIJHNr8sTXYYk
 LuxN8g+Xv1P/qkucEIXy1oB38okg9cORhlfrCOiwpWBjNt5/AqZYKWttDshuZiIM
 kjIKEDtTZ/nDLXjkWAa4Qs4MtBjqTVCrG3glSNHT0yiFDEkAejXbr4RZ/Ght/9pz
 FwHzTfdIOnecvOCD2OHVQx9TJluaP57mlxTkOXJV6OApg0wiHjohl0Xcerh+JfB4
 HAdF7xpr5Sk/BQVc3ygsDKwfTVfB/eYMfCoyUkXg9AVhcoBXmhD0
 =luI7
 -----END PGP SIGNATURE-----

Merge tag 'fixes-2026-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/liveupdate/linux

Pull liveupdate fixes from Mike Rapoport:
 "A few fixes for kexec handover and liveupdate:

   - make sure KHO is skipped for crash kernel

   - fix error reporting in memfd preservation if it fails mid-loop

   - don't allow preserving memfds whose page count exceeds UINT_MAX

   - fix documentation of memfd seals preservation to match the code"

* tag 'fixes-2026-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/liveupdate/linux:
  mm/memfd_luo: document preservation of file seals
  mm/memfd_luo: reject memfds whose page count exceeds UINT_MAX
  mm/memfd_luo: report error when restoring a folio fails mid-loop
  kho: skip KHO for crash kernel
2026-05-13 08:24:50 -07:00
Uladzislau Rezki (Sony) dad0d91cc2 mm/slab: Add kvfree_atomic() helper
kvmalloc() now supports non-sleeping GFP flags, including
the vmalloc fallback path. This means it may return vmalloc
memory even for GFP_ATOMIC and GFP_NOWAIT allocations.

Freeing such memory with kvfree() may then end up calling
vfree(), which is not safe for non-sleeping contexts.

Introduce kvfree_atomic() helper for such cases. It mirrors
kvfree(), but uses vfree_atomic() for vmalloced memory.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Acked-by: Harry Yoo (Oracle) <harry@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2026-05-05 16:12:07 +08:00
David Carlier 7b0b68b2b9
mm/memfd_luo: document preservation of file seals
Commit 8a552d68a8 ("mm: memfd_luo: preserve file seals") started
preserving file seals across live update and restoring them via
memfd_add_seals() on retrieve, but the DOC header was not updated and
still listed seals under "Non-Preserved Properties" as being unsealed
on restore.

Move the Seals entry to the "Preserved Properties" section and describe
the actual behavior, including the MEMFD_LUO_ALL_SEALS restriction that
both preserve and retrieve enforce.

Fixes: 8a552d68a8 ("mm: memfd_luo: preserve file seals")
Signed-off-by: David Carlier <devnexen@gmail.com>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Link: https://patch.msgid.link/20260423125648.152113-2-devnexen@gmail.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
2026-05-04 14:03:16 +00:00
David Carlier d581fc99d3
mm/memfd_luo: reject memfds whose page count exceeds UINT_MAX
memfd_luo_preserve_folios() declares max_folios as unsigned int and
computes it from the inode size, then passes it to memfd_pin_folios()
which itself caps max_folios at unsigned int.  For files whose base-page
count exceeds UINT_MAX (larger than 16 TiB with 4 KiB pages), the
assignment truncates silently: only a prefix of the file gets pinned and
preserved, while memfd_luo_preserve() still records the full inode size
in ser->size.  On retrieve the inode is restored to the full size but
only the preserved prefix repopulates the page cache, so the tail comes
back as holes and user data is silently lost across the live update.

Reject such files at preserve time with -EFBIG rather than chunk the
pin loop, which would also require enlarging the preserved folios array
well beyond what is practical.

Fixes: b3749f174d ("mm: memfd_luo: allow preserving memfd")
Signed-off-by: David Carlier <devnexen@gmail.com>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Link: https://patch.msgid.link/20260423125648.152113-1-devnexen@gmail.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
2026-05-04 14:03:15 +00:00
Linus Torvalds 811129272d slab fixes for 7.1-rc1
-----BEGIN PGP SIGNATURE-----
 
 iQFPBAABCAA5FiEEe7vIQRWZI0iWSE3xu+CwddJFiJoFAmn3BZgbFIAAAAAABAAO
 bWFudTIsMi41KzEuMTIsMiwyAAoJELvgsHXSRYiabzcIAIDyPybWZ/bRup/KfWGE
 GknYLaUA3nw9ZpZQdQ0cJ+jGO6pfoUqacO0cZ/5ppdcXblKz22AFItUnpsd44M9H
 92QjwNJoT6vjgzaSWCDE/6TeE0y27HjyhBlIYjs9mTRTb8sIH+mzfDS58FZponb/
 RZMGmcJvS4La+VzaetAxzyx+cX7J9W/7zDfvz/qPMpm8tysGe4z8H7/54c5x/U+c
 NvuEXaxPaKEa62x5e+IFG6G67NszbJnfRrLGn/UMwp+SS5WUId6D9kmjtT70yJlg
 hbdU6zw6kI+DIPxRidnpI/iJqcA08r9UuC/TrZwcjaM6sythev5UvOxZXMRVp19w
 qXU=
 =yAe1
 -----END PGP SIGNATURE-----

Merge tag 'slab-for-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab

Pull slab fixes from Vlastimil Babka:

 - Stable fixes for CONFIG_SMP=n where _nolock() allocations in NMI both
   at kmalloc and page allocator levels are not properly protected by
   the spin_trylock() semantics on !SMP (Harry Yoo)

* tag 'slab-for-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
  mm/slab: return NULL early from kmalloc_nolock() in NMI on UP
  mm/page_alloc: return NULL early from alloc_frozen_pages_nolock() in NMI on UP
2026-05-03 08:19:57 -07:00
Qi Zheng 99ebc509ee mm: memcontrol: fix rcu unbalance in get_non_dying_memcg_end()
Currently, get_non_dying_memcg_start() and get_non_dying_memcg_end() both
evaluate cgroup_subsys_on_dfl(memory_cgrp_subsys) independently to
determine whether to acquire or release the RCU read lock.

However, the result of cgroup_subsys_on_dfl() can change dynamically at
runtime due to cgroup hierarchy rebinding (e.g., when the memory
controller is moved between cgroup v1 and v2 hierarchies).  This can cause
the following warning:

 =====================================
 WARNING: bad unlock balance detected!
 7.0.0-next-20260420+ #83 Tainted: G        W
 -------------------------------------
 memcg-repro/270 is trying to release lock (rcu_read_lock) at:
 [<ffffffff815f57f7>] rcu_read_unlock+0x17/0x60
 but there are no more locks to release!

 other info that might help us debug this:
 1 lock held by memcg-repro/270:
  #0: ffff888102fa2088 (vm_lock){++++}-{0:0}, at: do_user_addr_fault+0x285/0x880

 stack backtrace:
 CPU: 0 UID: 0 PID: 270 Comm: memcg-repro Tainted: G        W           7.0.0-next-20260420+ #
 Tainted: [W]=WARN
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
 Call Trace:
  <TASK>
  ? rcu_read_unlock+0x17/0x60
  dump_stack_lvl+0x77/0xb0
  print_unlock_imbalance_bug+0xe0/0xf0
  ? rcu_read_unlock+0x17/0x60
  lock_release+0x21d/0x2a0
  rcu_read_unlock+0x1c/0x60
  do_pte_missing+0x233/0xb40
  __handle_mm_fault+0x80e/0xcd0
  handle_mm_fault+0x146/0x310
  do_user_addr_fault+0x303/0x880
  exc_page_fault+0x9b/0x270
  asm_exc_page_fault+0x26/0x30
 RIP: 0033:0x5590e4eb41ea
 Code: 61 cc 66 0f 6f e0 66 0f 61 c2 66 0f db cd 66 0f 69 e2 66 0f 6f d0 66 0f 69 d4 66 0f 61 0
 RSP: 002b:00007ffcad25f030 EFLAGS: 00010202
 RAX: 00005590e4eb8010 RBX: 00007ffcad260f7d RCX: 00007f73c474d44d
 RDX: 00005590e4eb80a0 RSI: 00005590e4eb503c RDI: 000000000000000f
 RBP: 00005590e4eb70a0 R08: 0000000000000000 R09: 00007f73c483a680
 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
 R13: 00007ffcad25f180 R14: 00005590e4eb6dd8 R15: 00007f73c4869020
  </TASK>
 ------------[ cut here ]------------

Fix this by explicitly tracking the RCU lock state, ensuring that
rcu_read_unlock() in get_non_dying_memcg_end() is strictly paired with the
lock acquisition, regardless of any runtime rebinding events.

Link: https://lore.kernel.org/20260429073105.44472-1-qi.zheng@linux.dev
Fixes: 8285917d6f ("mm: memcontrol: prepare for reparenting non-hierarchical stats")
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-30 06:13:20 -07:00
David Carlier 0fb1daf0b7 mm/memfd_luo: report error when restoring a folio fails mid-loop
memfd_luo_retrieve_folios() initialises err to -EIO, but the per-iteration
calls to mem_cgroup_charge(), shmem_add_to_page_cache() and
shmem_inode_acct_blocks() reuse and overwrite err.  Once any iteration
completes successfully, err becomes zero.

If a later iteration's kho_restore_folio() returns NULL, the failure path
jumps to put_folios without resetting err, so the function returns 0.
The caller memfd_luo_retrieve() then takes the success path, sets
args->file and reports the restore as successful, leaving userspace with
a partially populated memfd and no indication that anything went wrong.

Set err to -EIO in the kho_restore_folio() failure branch so the error
is propagated to the caller.

Signed-off-by: David Carlier <devnexen@gmail.com>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Fixes: b3749f174d ("mm: memfd_luo: allow preserving memfd")
Link: https://patch.msgid.link/20260415052300.362539-1-devnexen@gmail.com
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
2026-04-28 16:16:16 +03:00
David Carlier 292411fda2 mm/userfaultfd: detect VMA type change after copy retry in mfill_copy_folio_retry()
mfill_copy_folio_retry() drops mmap_lock for the copy_from_user() call.
During this window, the VMA can be replaced with a different type (e.g.
hugetlb), making the caller's ops pointer stale.  Subsequent use of the
stale ops would dispatch into the wrong per-vma handlers.

Capture the VMA's ops via vma_uffd_ops() before dropping the lock and
compare against the current vma_uffd_ops() after re-acquiring it.
Return -EAGAIN if they differ so the operation can be retried.  This
avoids comparing against the caller's ops which may have been
overridden to anon_uffd_ops for MAP_PRIVATE file-backed mappings.

Link: https://lore.kernel.org/20260424183638.196227-1-devnexen@gmail.com
Fixes: 6ab703034f ("userfaultfd: mfill_atomic(): remove retry logic")
Reported-by: Usama Arif <usama.arif@linux.dev>
Closes: https://lore.kernel.org/all/20260410114809.3592720-1-usama.arif@linux.dev/
Signed-off-by: David Carlier <devnexen@gmail.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-27 05:54:27 -07:00
SeongJae Park f98590bc08 mm/damon/stat: detect and use fresh enabled value
DAMON_STAT updates 'enabled' parameter value, which represents the running
status of its kdamond, when the user explicitly requests start/stop of the
kdamond.  The kdamond can, however, be stopped even if the user explicitly
requested the stop, if ctx->regions_score_histogram allocation failure at
beginning of the execution of the kdamond.  Hence, if the kdamond is
stopped by the allocation failure, the value of the parameter can be
stale.

Users could show the stale value and be confused.  The problem will only
rarely happen in real and common setups because the allocation is arguably
too small to fail.  Also, unlike the similar bugs that are now fixed in
DAMON_RECLAIM and DAMON_LRU_SORT, kdamond can be restarted in this case,
because DAMON_STAT force-updates the enabled parameter value for user
inputs.  The bug is a bug, though.

The issue stems from the fact that there are multiple events that can
change the status, and following all the events is challenging. 
Dynamically detect and use the fresh status for the parameters when those
are requested.

The issue was dicovered [1] by Sashiko.

Link: https://lore.kernel.org/20260419161003.79176-4-sj@kernel.org
Link: https://lore.kernel.org/20260416040602.88665-1-sj@kernel.org [1]
Fixes: 369c415e60 ("mm/damon: introduce DAMON_STAT module")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Liew Rui Yan <aethernet65535@gmail.com>
Cc: <stable@vger.kernel.org> # 6.17.x
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-27 05:54:27 -07:00
SeongJae Park b98b7ff602 mm/damon/lru_sort: detect and use fresh enabled and kdamond_pid values
DAMON_LRU_SORT updates 'enabled' and 'kdamond_pid' parameter values, which
represents the running status of its kdamond, when the user explicitly
requests start/stop of the kdamond.  The kdamond can, however, be stopped
in events other than the explicit user request in the following three
events.

1. ctx->regions_score_histogram allocation failure at beginning of the
   execution,
2. damon_commit_ctx() failure due to invalid user input, and
3. damon_commit_ctx() failure due to its internal allocation failures.

Hence, if the kdamond is stopped by the above three events, the values of
the status parameters can be stale.  Users could show the stale values and
be confused.  This is already bad, but the real consequence is worse. 
DAMON_LRU_SORT avoids unnecessary damon_start() and damon_stop() calls
based on the 'enabled' parameter value.  And the update of 'enabled'
parameter value depends on the damon_start() and damon_stop() call
results.  Hence, once the kdamond has stopped by the unintentional events,
the user cannot restart the kdamond before the system reboot.  For
example, the issue can be reproduced via below steps.

    # cd /sys/module/damon_lru_sort/parameters
    #
    # # start DAMON_LRU_SORT
    # echo Y > enabled
    # ps -ef | grep kdamond
    root         806       2  0 17:53 ?        00:00:00 [kdamond.0]
    root         808     803  0 17:53 pts/4    00:00:00 grep kdamond
    #
    # # commit wrong input to stop kdamond withou explicit stop request
    # echo 3 > addr_unit
    # echo Y > commit_inputs
    bash: echo: write error: Invalid argument
    #
    # # confirm kdamond is stopped
    # ps -ef | grep kdamond
    root         811     803  0 17:53 pts/4    00:00:00 grep kdamond
    #
    # # users casn now show stable status
    # cat enabled
    Y
    # cat kdamond_pid
    806
    #
    # # even after fixing the wrong parameter,
    # # kdamond cannot be restarted.
    # echo 1 > addr_unit
    # echo Y > enabled
    # ps -ef | grep kdamond
    root         815     803  0 17:54 pts/4    00:00:00 grep kdamond

The problem will only rarely happen in real and common setups for the
following reasons.  The allocation failures are unlikely in such setups
since those allocations are arguably too small to fail.  Also sane users
on real production environments may not commit wrong input parameters. 
But once it happens, the consequence is quite bad.  And the bug is a bug.

The issue stems from the fact that there are multiple events that can
change the status, and following all the events is challenging. 
Dynamically detect and use the fresh status for the parameters when those
are requested.

Link: https://lore.kernel.org/20260419161003.79176-3-sj@kernel.org
Fixes: 40e983cca9 ("mm/damon: introduce DAMON-based LRU-lists Sorting")
Co-developed-by: Liew Rui Yan <aethernet65535@gmail.com>
Signed-off-by: Liew Rui Yan <aethernet65535@gmail.com>
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org> # 6.0.x
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-27 05:54:27 -07:00
SeongJae Park 64a140afa5 mm/damon/reclaim: detect and use fresh enabled and kdamond_pid values
Patch series "mm/damon/modules: detect and use fresh status", v3.

DAMON modules including DAMON_RECLAIM, DAMON_LRU_SORT and DAMON_STAT
commonly expose the kdamond running status via their parameters.  Under
certain scenarios including wrong user inputs and memory allocation
failures, those parameter values can be stale.  It can confuse users.  For
DAMON_RECLAIM and DAMON_LRU_SORT, it even makes the kdamond unable to be
restarted before the system reboot.

The problem comes from the fact that there are multiple events for the
status changes and it is difficult to follow up all the scenarios.  Fix
the issue by detecting and using the status on demand, instead of using a
cached status that is difficult to be updated.

Patches 1-3 fix the bugs in DAMON_RECLAIM, DAMON_LRU_SORT and DAMON_STAT
in the order.


This patch (of 3):

DAMON_RECLAIM updates 'enabled' and 'kdamond_pid' parameter values, which
represents the running status of its kdamond, when the user explicitly
requests start/stop of the kdamond.  The kdamond can, however, be stopped
in events other than the explicit user request in the following three
events.

1. ctx->regions_score_histogram allocation failure at beginning of the
   execution,
2. damon_commit_ctx() failure due to invalid user input, and
3. damon_commit_ctx() failure due to its internal allocation failures.

Hence, if the kdamond is stopped by the above three events, the values of
the status parameters can be stale.  Users could show the stale values and
be confused.  This is already bad, but the real consequence is worse. 
DAMON_RECLAIM avoids unnecessary damon_start() and damon_stop() calls
based on the 'enabled' parameter value.  And the update of 'enabled'
parameter value depends on the damon_start() and damon_stop() call
results.  Hence, once the kdamond has stopped by the unintentional events,
the user cannot restart the kdamond before the system reboot.  For
example, the issue can be reproduced via below steps.

    # cd /sys/module/damon_reclaim/parameters
    #
    # # start DAMON_RECLAIM
    # echo Y > enabled
    # ps -ef | grep kdamond
    root         806       2  0 17:53 ?        00:00:00 [kdamond.0]
    root         808     803  0 17:53 pts/4    00:00:00 grep kdamond
    #
    # # commit wrong input to stop kdamond withou explicit stop request
    # echo 3 > addr_unit
    # echo Y > commit_inputs
    bash: echo: write error: Invalid argument
    #
    # # confirm kdamond is stopped
    # ps -ef | grep kdamond
    root         811     803  0 17:53 pts/4    00:00:00 grep kdamond
    #
    # # users casn now show stable status
    # cat enabled
    Y
    # cat kdamond_pid
    806
    #
    # # even after fixing the wrong parameter,
    # # kdamond cannot be restarted.
    # echo 1 > addr_unit
    # echo Y > enabled
    # ps -ef | grep kdamond
    root         815     803  0 17:54 pts/4    00:00:00 grep kdamond

The problem will only rarely happen in real and common setups for the
following reasons.  The allocation failures are unlikely in such setups
since those allocations are arguably too small to fail.  Also sane users
on real production environments may not commit wrong input parameters. 
But once it happens, the consequence is quite bad.  And the bug is a bug.

The issue stems from the fact that there are multiple events that can
change the status, and following all the events is challenging. 
Dynamically detect and use the fresh status for the parameters when those
are requested.

Link: https://lore.kernel.org/20260419161003.79176-1-sj@kernel.org
Link: https://lore.kernel.org/20260419161003.79176-2-sj@kernel.org
Fixes: e035c280f6 ("mm/damon/reclaim: support online inputs update")
Co-developed-by: Liew Rui Yan <aethernet65535@gmail.com>
Signed-off-by: Liew Rui Yan <aethernet65535@gmail.com>
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org> # 5.19.x
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-27 05:54:26 -07:00
SeongJae Park cf3b71421c mm/damon/sysfs-schemes: protect path kfree() with damon_sysfs_lock
damon_sysfs_quot_goal->path can be read and written by users, via DAMON
sysfs 'path' file.  It can also be indirectly read, for the parameters
{on,off}line committing to DAMON.  The reads for parameters committing are
protected by damon_sysfs_lock to avoid the sysfs files being destroyed
while any of the parameters are being read.  But the user-driven direct
reads and writes are not protected by any lock, while the write is
deallocating the path-pointing buffer.  As a result, the readers could
read the already freed buffer (user-after-free).  Note that the user-reads
don't race when the same open file is used by the writer, due to kernfs's
open file locking.  Nonetheless, doing the reads and writes with separate
open files would be common.  Fix it by protecting both the user-direct
reads and writes with damon_sysfs_lock.

Link: https://lore.kernel.org/20260423150253.111520-3-sj@kernel.org
Fixes: c41e253a41 ("mm/damon/sysfs-schemes: implement path file under quota goal directory")
Co-developed-by: Junxi Qian <qjx1298677004@gmail.com>
Signed-off-by: Junxi Qian <qjx1298677004@gmail.com>
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org> # 6.19.x
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-27 05:54:26 -07:00
SeongJae Park 1e68eb96e8 mm/damon/sysfs-schemes: protect memcg_path kfree() with damon_sysfs_lock
Patch series "mm/damon/sysfs-schemes: fix use-after-free for [memcg_]path".

Reads of 'memcg_path' and 'path' files in DAMON sysfs interface could race
with their writes, results in use-after-free.  Fix those.


This patch (of 2):

damon_sysfs_scheme_filter->mmecg_path can be read and written by users,
via DAMON sysfs memcg_path file.  It can also be indirectly read, for the
parameters {on,off}line committing to DAMON.  The reads for parameters
committing are protected by damon_sysfs_lock to avoid the sysfs files
being destroyed while any of the parameters are being read.  But the
user-driven direct reads and writes are not protected by any lock, while
the write is deallocating the memcg_path-pointing buffer.  As a result,
the readers could read the already freed buffer (user-after-free).  Note
that the user-reads don't race when the same open file is used by the
writer, due to kernfs's open file locking.  Nonetheless, doing the reads
and writes with separate open files would be common.  Fix it by protecting
both the user-direct reads and writes with damon_sysfs_lock.

Link: https://lore.kernel.org/20260423150253.111520-1-sj@kernel.org
Link: https://lore.kernel.org/20260423150253.111520-2-sj@kernel.org
Fixes: 4f489fe6af ("mm/damon/sysfs-schemes: free old damon_sysfs_scheme_filter->memcg_path on write")
Co-developed-by: Junxi Qian <qjx1298677004@gmail.com>
Signed-off-by: Junxi Qian <qjx1298677004@gmail.com>
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org> # 6.16.x
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-27 05:54:26 -07:00