mirror-linux/mm
Ankit Agrawal 2ec4196718 mm: handle poisoning of pfn without struct pages
Poison (or ECC) errors can be very common on a large size cluster.  The
kernel MM currently does not handle ECC errors / poison on a memory region
that is not backed by struct pages.  If a memory region mapped using
remap_pfn_range() for example, but not added to the kernel, MM will not
have associated struct pages.  Add a new mechanism to handle memory
failure on such memory.

Make kernel MM expose a function to allow modules managing the device
memory to register the device memory SPA and the address space associated
it.  MM maintains this information as an interval tree.  On poison, MM can
search for the range that the poisoned PFN belong and use the
address_space to determine the mapping VMA.

In this implementation, kernel MM follows the following sequence that is
largely similar to the memory_failure() handler for struct page backed
memory:

1. memory_failure() is triggered on reception of a poison error.  An
   absence of struct page is detected and consequently
   memory_failure_pfn() is executed.

2. memory_failure_pfn() collects the processes mapped to the PFN.

3. memory_failure_pfn() sends SIGBUS to all the processes mapping the
   faulty PFN using kill_procs().

Note that there is one primary difference versus the handling of the
poison on struct pages, which is to skip unmapping to the faulty PFN. 
This is done to handle the huge PFNMAP support added recently [1] that
enables VM_PFNMAP vmas to map at PMD or PUD level.  A poison to a PFN
mapped in such as way would need breaking the PMD/PUD mapping into PTEs
that will get mirrored into the S2.  This can greatly increase the cost of
table walks and have a major performance impact.

Link: https://lore.kernel.org/all/20240826204353.2228736-1-peterx@redhat.com/ [1]
Link: https://lkml.kernel.org/r/20251102184434.2406-3-ankita@nvidia.com
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
Cc: Aniket Agashe <aniketa@nvidia.com>
Cc: Borislav Betkov <bp@alien8.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hanjun Guo <guohanjun@huawei.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Kirti Wankhede <kwankhede@nvidia.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Matthew R. Ochs <mochs@nvidia.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Cc: Neo Jia <cjia@nvidia.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Shuai Xue <xueshuai@linux.alibaba.com>
Cc: Smita Koralahalli Channabasappa <smita.koralahallichannabasappa@amd.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tarun Gupta <targupta@nvidia.com>
Cc: Uwe Kleine-König <u.kleine-koenig@baylibre.com>
Cc: Vikram Sethi <vsethi@nvidia.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-16 17:28:29 -08:00
..
damon mm/damon/sysfs: implement obsolete_target file 2025-11-16 17:28:23 -08:00
kasan kasan: cleanup of kasan_enabled() checks 2025-11-16 17:28:01 -08:00
kfence kfence: drop nth_page() usage 2025-09-21 14:22:09 -07:00
kmsan kmsan: remove hard-coded GFP_KERNEL flags 2025-11-16 17:27:54 -08:00
Kconfig mm: handle poisoning of pfn without struct pages 2025-11-16 17:28:29 -08:00
Kconfig.debug
Makefile mm: remove unused zpool layer 2025-09-21 14:21:59 -07:00
backing-dev.c fuse update for 6.18 2025-10-03 12:48:18 -07:00
balloon_compaction.c mm/migrate: fix NULL movable_ops if CONFIG_ZSMALLOC=m 2025-08-19 16:35:57 -07:00
bootmem_info.c
cma.c mm/cma: refuse handing out non-contiguous page ranges 2025-09-21 14:22:06 -07:00
cma.h mm: cma: set early_pfn and bitmap as a union in cma_memrange 2025-05-22 14:55:36 -07:00
cma_debug.c mm: cma: simplify cma_maxchunk_get() 2025-07-24 19:12:36 -07:00
cma_sysfs.c
compaction.c mm/compaction: fix low_pfn advance on isolating hugetlb 2025-09-28 11:51:29 -07:00
debug.c mm/debug: fix missing space in case statement 2025-11-16 17:28:29 -08:00
debug_page_alloc.c mm/debug_page_alloc: improve error message for invalid guardpage minorder 2025-05-12 23:50:38 -07:00
debug_page_ref.c
debug_vm_pgtable.c treewide: include linux/pgalloc.h instead of asm/pgalloc.h 2025-11-16 17:28:25 -08:00
dmapool.c docs: dma-api: replace consistent with coherent 2025-07-01 13:25:36 -06:00
dmapool_test.c
early_ioremap.c
execmem.c mm: remove PMD alignment constraint in execmem_vmalloc() 2025-09-28 11:51:31 -07:00
fadvise.c
fail_page_alloc.c
failslab.c
filemap.c treewide: include linux/pgalloc.h instead of asm/pgalloc.h 2025-11-16 17:28:25 -08:00
folio-compat.c
gup.c mm: replace READ_ONCE() with standard page table accessors 2025-11-16 17:27:56 -08:00
gup_test.c
gup_test.h
highmem.c mm: constify highmem related functions for improved const-correctness 2025-09-21 14:22:15 -07:00
hmm.c mm: replace READ_ONCE() with standard page table accessors 2025-11-16 17:27:56 -08:00
huge_memory.c treewide: include linux/pgalloc.h instead of asm/pgalloc.h 2025-11-16 17:28:25 -08:00
hugetlb.c treewide: include linux/pgalloc.h instead of asm/pgalloc.h 2025-11-16 17:28:25 -08:00
hugetlb_cgroup.c
hugetlb_cma.c mm: hugetlb: directly pass order when allocate a hugetlb folio 2025-09-21 14:22:11 -07:00
hugetlb_cma.h mm: hugetlb: directly pass order when allocate a hugetlb folio 2025-09-21 14:22:11 -07:00
hugetlb_vmemmap.c treewide: include linux/pgalloc.h instead of asm/pgalloc.h 2025-11-16 17:28:25 -08:00
hugetlb_vmemmap.h
hwpoison-inject.c mm/hwpoison: decouple hwpoison_filter from mm/memory-failure.c 2025-09-21 14:22:21 -07:00
init-mm.c
internal.h mm: introduce io_remap_pfn_range_[prepare, complete]() 2025-11-16 17:28:12 -08:00
interval_tree.c
ioremap.c
khugepaged.c treewide: include linux/pgalloc.h instead of asm/pgalloc.h 2025-11-16 17:28:25 -08:00
kmemleak.c mm: fix possible deadlock in kmemleak 2025-09-01 17:11:37 -07:00
ksm.c ksm: replace function unmerge_ksm_pages with break_ksm 2025-11-16 17:28:28 -08:00
list_lru.c mm, list_lru: refactor the locking code 2025-07-09 22:41:56 -07:00
maccess.c mm: unexport globally copy_to_kernel_nofault 2025-07-09 22:42:22 -07:00
madvise.c mm: clean up is_guard_pte_marker() 2025-10-03 16:42:43 -07:00
mapping_dirty_helpers.c mm/dirty: replace READ_ONCE() with pudp_get() 2025-11-16 17:27:58 -08:00
memblock.c kho: replace kho_preserve_phys() with kho_preserve_pages() 2025-10-07 13:48:55 -07:00
memcontrol-v1.c mm/memcg: v1: account event registrations and drop world-writable cgroup.event_control 2025-09-21 14:22:26 -07:00
memcontrol-v1.h memcg: move do_memsw_account() to CONFIG_MEMCG_V1 2025-03-21 22:03:11 -07:00
memcontrol.c memcg: manually uninline __memcg_memory_event 2025-11-16 17:28:16 -08:00
memfd.c mm/memfd: remove redundant casts 2025-09-21 14:22:00 -07:00
memory-failure.c mm: handle poisoning of pfn without struct pages 2025-11-16 17:28:29 -08:00
memory-tiers.c mm: fix some typos in mm module 2025-11-16 17:27:52 -08:00
memory.c treewide: include linux/pgalloc.h instead of asm/pgalloc.h 2025-11-16 17:28:25 -08:00
memory_hotplug.c drivers/base/node: fold unregister_node() into unregister_one_node() 2025-11-16 17:28:03 -08:00
mempolicy.c mm: mprotect: convert to folio_can_map_prot_numa() 2025-11-16 17:28:03 -08:00
mempool.c mm: mempool: fix crash in mempool_free() for zero-minimum pools 2025-08-02 12:06:13 -07:00
memremap.c mm/memremap: remove unused get_dev_pagemap() parameter 2025-09-21 14:22:21 -07:00
memtest.c
migrate.c mm/migrate, swap: drop usage of folio_index 2025-11-16 17:28:20 -08:00
migrate_device.c treewide: remove MIGRATEPAGE_SUCCESS 2025-09-13 16:54:50 -07:00
mincore.c mm, swap: use unified helper for swap cache look up 2025-09-21 14:22:22 -07:00
mlock.c mm: folio_may_be_lru_cached() unless folio_test_large() 2025-09-13 13:05:36 -07:00
mm_init.c drivers/base/node: fold register_node() into register_one_node() 2025-11-16 17:28:02 -08:00
mm_slot.h
mmap.c mm: consistently use current->mm in mm_get_unmapped_area() 2025-11-16 17:27:57 -08:00
mmap_lock.c mm: change vma_start_read() to drop RCU lock on failure 2025-09-13 16:54:43 -07:00
mmu_gather.c treewide: include linux/pgalloc.h instead of asm/pgalloc.h 2025-11-16 17:28:25 -08:00
mmu_notifier.c Update Christoph's Email address and make it consistent 2025-05-12 23:50:31 -07:00
mmzone.c mm: introduce memdesc_flags_t 2025-09-13 16:55:07 -07:00
mprotect.c mm: mprotect: convert to folio_can_map_prot_numa() 2025-11-16 17:28:03 -08:00
mremap.c treewide: include linux/pgalloc.h instead of asm/pgalloc.h 2025-11-16 17:28:25 -08:00
mseal.c mm/mseal: rework mseal apply logic 2025-08-02 12:06:09 -07:00
msync.c
nommu.c mm/nommu: convert kobjsize() to folios 2025-09-13 16:54:46 -07:00
numa.c mm/numa: remove unnecessary local variable in alloc_node_data() 2025-05-12 23:50:38 -07:00
numa_emulation.c mm: numa,memblock: Use SZ_1M macro to denote bytes to MB conversion 2025-08-20 16:31:23 +03:00
numa_memblks.c mm: numa,memblock: Use SZ_1M macro to denote bytes to MB conversion 2025-08-20 16:31:23 +03:00
oom_kill.c mm/oom_kill.c: fix inverted check 2025-09-23 14:14:16 -07:00
page-writeback.c fuse update for 6.18 2025-10-03 12:48:18 -07:00
page_alloc.c mm/page_alloc: don't warn about large allocations with __GFP_NOFAIL 2025-11-16 17:28:29 -08:00
page_counter.c
page_ext.c mm,page_ext: derive the node from the pfn 2025-07-13 16:38:16 -07:00
page_frag_cache.c
page_idle.c mm: always call rmap_walk() on locked folios 2025-11-16 17:28:00 -08:00
page_io.c mm, swap: tidy up swap device and cluster info helpers 2025-09-21 14:22:23 -07:00
page_isolation.c mm/page_isolation: drop __folio_test_movable() check for large folios 2025-07-13 16:38:29 -07:00
page_owner.c mm/page_owner: simplify zone iteration logic in init_early_allocated_pages() 2025-11-16 17:28:01 -08:00
page_poison.c
page_reporting.c
page_reporting.h
page_table_check.c mm/page_table_check: Batch-check pmds/puds just like ptes 2025-05-09 13:43:07 +01:00
page_vma_mapped.c mm/page_vma_mapped: track if the page is mapped across page table boundary 2025-09-28 11:51:29 -07:00
pagewalk.c Summary of significant series in this pull request: 2025-10-02 18:18:33 -07:00
percpu-internal.h
percpu-km.c mm/mm/percpu-km: drop nth_page() usage within single allocation 2025-09-21 14:22:04 -07:00
percpu-stats.c mm: remove outdated filename comment in percpu-stats.c 2025-07-13 16:38:23 -07:00
percpu-vm.c kmsan: remove hard-coded GFP_KERNEL flags 2025-11-16 17:27:54 -08:00
percpu.c percpu: fix race on alloc failed warning limit 2025-09-08 23:45:10 -07:00
pgalloc-track.h
pgtable-generic.c treewide: include linux/pgalloc.h instead of asm/pgalloc.h 2025-11-16 17:28:25 -08:00
process_vm_access.c
pt_reclaim.c treewide: include linux/pgalloc.h instead of asm/pgalloc.h 2025-11-16 17:28:25 -08:00
ptdump.c mm/ptdump: replace READ_ONCE() with standard page table accessors 2025-11-16 17:27:52 -08:00
readahead.c readahead: add trace points 2025-09-21 14:22:28 -07:00
rmap.c mm: always call rmap_walk() on locked folios 2025-11-16 17:28:00 -08:00
rodata_test.c
secretmem.c mm: add vma_desc_size(), vma_desc_pages() helpers 2025-11-16 17:28:11 -08:00
shmem.c mm: shmem/tmpfs hugepage defaults config choice 2025-11-16 17:28:23 -08:00
shmem_quota.c
show_mem.c mm: re-enable kswapd when memory pressure subsides or demotion is toggled 2025-09-21 14:22:29 -07:00
shrinker.c
shrinker_debug.c
shuffle.c
shuffle.h
slab.h Summary of significant series in this pull request: 2025-10-02 18:18:33 -07:00
slab_common.c mm: fix some typos in mm module 2025-11-16 17:27:52 -08:00
slub.c mm: remove reference to destructor in comment in calculate_sizes() 2025-11-16 17:28:15 -08:00
sparse-vmemmap.c mm: replace READ_ONCE() with standard page table accessors 2025-11-16 17:27:56 -08:00
sparse.c mm: introduce memdesc_nid() 2025-09-13 16:55:07 -07:00
swap.c mm: lru_add_drain_all() do local lru_add_drain() first 2025-09-21 14:22:32 -07:00
swap.h mm/migrate, swap: drop usage of folio_index 2025-11-16 17:28:20 -08:00
swap_cgroup.c mm: swap_cgroup: remove double initialization of locals 2025-03-17 22:06:58 -07:00
swap_state.c mm, swap: fix potential UAF issue for VMA readahead 2025-11-15 10:52:02 -08:00
swap_table.h mm, swap: use a single page for swap table when the size fits 2025-09-21 14:22:25 -07:00
swapfile.c mm/swap: select swap device with default priority round robin 2025-11-16 17:28:27 -08:00
truncate.c mm/truncate: unmap large folio on split failure 2025-11-09 21:19:43 -08:00
usercopy.c
userfaultfd.c mm/userfaultfd: don't lock anon_vma when performing UFFDIO_MOVE 2025-11-16 17:28:00 -08:00
util.c mm: add ability to take further action in vm_area_desc 2025-11-16 17:28:12 -08:00
vma.c mm: add ability to take further action in vm_area_desc 2025-11-16 17:28:12 -08:00
vma.h mm/vma: remove unused function, make internal functions static 2025-11-16 17:28:10 -08:00
vma_exec.c mm/vma: use vmg->target to specify target VMA for new VMA merge 2025-07-09 22:42:11 -07:00
vma_init.c Summary of significant series in this pull request: 2025-10-02 18:18:33 -07:00
vma_internal.h
vmalloc.c mm/vmalloc: request large order pages from buddy allocator 2025-11-16 17:28:15 -08:00
vmpressure.c memcg: convert memcg->socket_pressure to u64 2025-07-24 19:12:32 -07:00
vmscan.c mm, swap: cleanup swap entry allocation parameter 2025-11-16 17:28:20 -08:00
vmstat.c mm: vmstat: output reserved_highatomic and free_highatomic in zoneinfo 2025-11-16 17:28:26 -08:00
workingset.c mm: introduce memdesc_flags_t 2025-09-13 16:55:07 -07:00
zpdesc.h mm: zpdesc: minor naming and comment corrections 2025-09-21 14:21:59 -07:00
zsmalloc.c mm: remove unused zpool layer 2025-09-21 14:21:59 -07:00
zswap.c mm/zswap: s/red-black tree/xarray/ 2025-11-16 17:27:57 -08:00