mirror-linux

History

David Hildenbrand (Red Hat) 8ce720d5bd mm/hugetlb: fix excessive IPI broadcasts when unsharing PMD tables using mmu_gather As reported, ever since commit `1013af4f58` ("mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race") we can end up in some situations where we perform so many IPI broadcasts when unsharing hugetlb PMD page tables that it severely regresses some workloads. In particular, when we fork()+exit(), or when we munmap() a large area backed by many shared PMD tables, we perform one IPI broadcast per unshared PMD table. There are two optimizations to be had: (1) When we process (unshare) multiple such PMD tables, such as during exit(), it is sufficient to send a single IPI broadcast (as long as we respect locking rules) instead of one per PMD table. Locking prevents that any of these PMD tables could get reused before we drop the lock. (2) When we are not the last sharer (> 2 users including us), there is no need to send the IPI broadcast. The shared PMD tables cannot become exclusive (fully unshared) before an IPI will be broadcasted by the last sharer. Concurrent GUP-fast could walk into a PMD table just before we unshared it. It could then succeed in grabbing a page from the shared page table even after munmap() etc succeeded (and supressed an IPI). But there is not difference compared to GUP-fast just sleeping for a while after grabbing the page and re-enabling IRQs. Most importantly, GUP-fast will never walk into page tables that are no-longer shared, because the last sharer will issue an IPI broadcast. (if ever required, checking whether the PUD changed in GUP-fast after grabbing the page like we do in the PTE case could handle this) So let's rework PMD sharing TLB flushing + IPI sync to use the mmu_gather infrastructure so we can implement these optimizations and demystify the code at least a bit. Extend the mmu_gather infrastructure to be able to deal with our special hugetlb PMD table sharing implementation. To make initialization of the mmu_gather easier when working on a single VMA (in particular, when dealing with hugetlb), provide tlb_gather_mmu_vma(). We'll consolidate the handling for (full) unsharing of PMD tables in tlb_unshare_pmd_ptdesc() and tlb_flush_unshared_tables(), and track in "struct mmu_gather" whether we had (full) unsharing of PMD tables. Because locking is very special (concurrent unsharing+reuse must be prevented), we disallow deferring flushing to tlb_finish_mmu() and instead require an explicit earlier call to tlb_flush_unshared_tables(). From hugetlb code, we call huge_pmd_unshare_flush() where we make sure that the expected lock protecting us from concurrent unsharing+reuse is still held. Check with a VM_WARN_ON_ONCE() in tlb_finish_mmu() that tlb_flush_unshared_tables() was properly called earlier. Document it all properly. Notes about tlb_remove_table_sync_one() interaction with unsharing: There are two fairly tricky things: (1) tlb_remove_table_sync_one() is a NOP on architectures without CONFIG_MMU_GATHER_RCU_TABLE_FREE. Here, the assumption is that the previous TLB flush would send an IPI to all relevant CPUs. Careful: some architectures like x86 only send IPIs to all relevant CPUs when tlb->freed_tables is set. The relevant architectures should be selecting MMU_GATHER_RCU_TABLE_FREE, but x86 might not do that in stable kernels and it might have been problematic before this patch. Also, the arch flushing behavior (independent of IPIs) is different when tlb->freed_tables is set. Do we have to enlighten them to also take care of tlb->unshared_tables? So far we didn't care, so hopefully we are fine. Of course, we could be setting tlb->freed_tables as well, but that might then unnecessarily flush too much, because the semantics of tlb->freed_tables are a bit fuzzy. This patch changes nothing in this regard. (2) tlb_remove_table_sync_one() is not a NOP on architectures with CONFIG_MMU_GATHER_RCU_TABLE_FREE that actually don't need a sync. Take x86 as an example: in the common case (!pv, !X86_FEATURE_INVLPGB) we still issue IPIs during TLB flushes and don't actually need the second tlb_remove_table_sync_one(). This optimized can be implemented on top of this, by checking e.g., in tlb_remove_table_sync_one() whether we really need IPIs. But as described in (1), it really must honor tlb->freed_tables then to send IPIs to all relevant CPUs. Notes on TLB flushing changes: (1) Flushing for non-shared PMD tables We're converting from flush_hugetlb_tlb_range() to tlb_remove_huge_tlb_entry(). Given that we properly initialize the MMU gather in tlb_gather_mmu_vma() to be hugetlb aware, similar to __unmap_hugepage_range(), that should be fine. (2) Flushing for shared PMD tables We're converting from various things (flush_hugetlb_tlb_range(), tlb_flush_pmd_range(), flush_tlb_range()) to tlb_flush_pmd_range(). tlb_flush_pmd_range() achieves the same that tlb_remove_huge_tlb_entry() would achieve in these scenarios. Note that tlb_remove_huge_tlb_entry() also calls __tlb_remove_tlb_entry(), however that is only implemented on powerpc, which does not support PMD table sharing. Similar to (1), tlb_gather_mmu_vma() should make sure that TLB flushing keeps on working as expected. Further, note that the ptdesc_pmd_pts_dec() in huge_pmd_share() is not a concern, as we are holding the i_mmap_lock the whole time, preventing concurrent unsharing. That ptdesc_pmd_pts_dec() usage will be removed separately as a cleanup later. There are plenty more cleanups to be had, but they have to wait until this is fixed. [david@kernel.org: fix kerneldoc] Link: https://lkml.kernel.org/r/f223dd74-331c-412d-93fc-69e360a5006c@kernel.org Link: https://lkml.kernel.org/r/20251223214037.580860-5-david@kernel.org Fixes: `1013af4f58` ("mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race") Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org> Reported-by: Uschakow, Stanislav" <suschako@amazon.de> Closes: https://lore.kernel.org/all/4d3878531c76479d9f8ca9789dc6485d@amazon.de/ Tested-by: Laurence Oberman <loberman@redhat.com> Acked-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liu Shixin <liushixin2@huawei.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Rik van Riel <riel@surriel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>		2026-01-20 09:34:26 -08:00
..
bitops	bitops: Add __attribute_const__ to generic ffs()-family implementations	2025-09-08 14:58:50 -07:00
vdso	vdso: Drop Kconfig GENERIC_VDSO_DATA_STORE	2025-09-04 11:23:50 +02:00
Kbuild	unwind_user: Add user space unwinding API with frame pointer support	2025-07-29 14:46:07 -04:00
access_ok.h	uaccess: remove CONFIG_SET_FS	2022-02-25 09:36:06 +01:00
agp.h	char/agp: introduce asm-generic/agp.h	2023-02-13 22:13:29 +01:00
archrandom.h	random: handle archrandom with multiple longs	2022-07-25 13:26:14 +02:00
asm-offsets.h	…
asm-prototypes.h	…
atomic.h	locking/atomic: make atomic*_{cmp,}xchg optional	2023-06-05 09:57:14 +02:00
atomic64.h	…
audit_change_attr.h	fs/xattr: add *at family syscalls	2024-11-06 12:59:44 -05:00
audit_dir_write.h	…
audit_read.h	…
audit_signal.h	…
audit_write.h	…
barrier.h	sched: Add missing memory barrier in switch_mm_cid	2024-04-16 13:59:45 +02:00
bitops.h	include: move find.h from asm_generic to linux	2022-01-15 08:47:31 -08:00
bitsperlong.h	…
bug.h	x86/bug: Implement WARN_ONCE()	2025-11-24 20:23:25 +01:00
cache.h	…
cacheflush.h	mm: Introduce flush_cache_vmap_early()	2023-12-14 00:23:17 -08:00
cfi.h	cfi: Flip headers	2023-12-15 16:25:55 -08:00
checksum.h	asm-generic: Improve csum_fold	2024-01-17 17:52:29 -08:00
cmpxchg-local.h	asm-generic: Fix 32 bit __generic_cmpxchg_local	2024-01-05 23:19:14 +01:00
cmpxchg.h	asm-generic: avoid __generic_cmpxchg_local warnings	2023-04-04 17:58:11 +02:00
codetag.lds.h	codetag: avoid unused alloc_tags sections/symbols	2025-07-09 22:42:14 -07:00
compat.h	asm-generic: compat: fix compat_arg_u64() and compat_arg_u64_dual()	2022-11-01 10:20:11 +11:00
current.h	asm-generic: current: Don't include thread-info.h if building asm	2023-08-26 22:38:49 +02:00
delay.h	delay: Fix ndelay() spuriously treated as udelay()	2024-11-29 11:40:22 +01:00
device.h	…
div64.h	__arch_xprod64(): make __always_inline when optimizing for performance	2024-10-28 21:44:28 +00:00
dma-mapping.h	dma-mapping: no need to pass a bus_type into get_arch_dma_ops()	2023-02-15 12:35:20 +01:00
dma.h	…
early_ioremap.h	mm/early_ioremap: add null pointer checks to prevent NULL-pointer dereference	2025-01-13 22:40:59 -08:00
emergency-restart.h	…
error-injection.h	docs: fault-injection: add requirements of error injectable functions	2023-02-02 22:50:00 -08:00
exec.h	…
extable.h	…
fixmap.h	fixmap: Remove unused set_fixmap_offset_io()	2024-07-11 17:41:23 +02:00
flat.h	…
fprobe.h	fprobe: Add fprobe_header encoding feature	2024-12-26 10:50:05 -05:00
ftrace.h	…
futex.h	futex: Fix additional regressions	2021-12-11 23:31:51 +01:00
getorder.h	…
hardirq.h	…
hugetlb.h	mm: correctly handle UFFD PTE markers	2025-11-24 15:08:50 -08:00
hw_irq.h	…
int-ll64.h	…
io.h	asm-generic/io.h: Skip trace helpers if rwmmio events are disabled	2025-09-24 16:21:13 +02:00
ioctl.h	…
iomap.h	asm-generic/io.h: rework split ioread64/iowrite64 helpers	2025-03-01 21:00:22 +01:00
irq.h	…
irq_regs.h	…
irq_work.h	…
irqflags.h	…
kdebug.h	…
kmap_size.h	…
kprobes.h	…
kvm_para.h	…
kvm_types.h	…
linkage.h	…
local.h	locking/generic: Wire up local{,64}_try_cmpxchg()	2023-04-29 09:09:09 +02:00
local64.h	locking/generic: Wire up local{,64}_try_cmpxchg()	2023-04-29 09:09:09 +02:00
logic_io.h	logic_io instance of iounmap() needs volatile on argument	2021-12-21 21:31:08 +01:00
mcs_spinlock.h	locking: Move MCS struct definition to public header	2025-03-18 10:28:21 -07:00
memory_model.h	mm: convert page_to_section() to memdesc_section()	2025-09-13 16:55:07 -07:00
mm_hooks.h	mm: remove arch_unmap()	2024-09-01 20:26:13 -07:00
mmiowb.h	…
mmiowb_types.h	…
mmu.h	…
mmu_context.h	…
mmzone.h	arch, mm: move definition of node_data to generic code	2024-09-03 21:15:28 -07:00
module.h	asm-generic: Always define Elf_Rel and Elf_Rela	2025-03-26 15:56:43 -07:00
module.lds.h	…
mshyperv.h	arch/x86: mshyperv: Trap on access for some synthetic MSRs	2025-11-15 06:18:14 +00:00
msi.h	irqchip/gic-v5: Add GICv5 IWB support	2025-07-08 18:35:52 +01:00
nommu_context.h	…
numa.h	arch_numa: switch over to numa_memblks	2024-09-03 21:15:32 -07:00
param.h	alpha: regularize the situation with asm/param.h	2025-06-24 22:02:05 -04:00
parport.h	…
pci.h	asm-generic: Add new pci.h and use it	2022-07-22 17:34:57 -05:00
pci_iomap.h	PCI: Stub __pci_ioport_map() for arches that don't support it at all	2022-07-29 12:01:00 -05:00
percpu.h	asm-generic: percpu: Add assembly guard	2025-10-27 16:41:53 +01:00
pgalloc.h	mm: remove unnecessary __GFP_HIGHMEM in __pd_alloc_one_()	2025-11-20 13:43:59 -08:00
pgtable-nop4d.h	…
pgtable-nopmd.h	mm: recover pud_leaf() definitions in nopmd case	2024-03-13 12:12:21 -07:00
pgtable-nopud.h	…
pgtable_uffd.h	mm: userfaultfd: add pgtable_supports_uffd_wp()	2025-11-24 15:08:54 -08:00
preempt.h	riscv: support PREEMPT_DYNAMIC with static keys	2023-08-31 00:18:34 -07:00
qrwlock.h	asm-generic changes for 5.19	2022-05-26 10:50:30 -07:00
qrwlock_types.h	locking/qrwlock: Change "queue rwlock" to "queued rwlock"	2022-05-11 16:27:04 +02:00
qspinlock.h	riscv: Add qspinlock support	2024-11-11 07:33:20 -08:00
qspinlock_types.h	…
resource.h	…
rqspinlock.h	rqspinlock: Enclose lock/unlock within lock entry acquisitions	2025-11-29 09:35:35 -08:00
runtime-const.h	runtime constants: add default dummy infrastructure	2024-06-19 12:34:34 -07:00
rwonce.h	rwonce: fix crash by removing READ_ONCE() for unaligned read	2025-03-26 22:16:50 +01:00
seccomp.h	…
sections.h	percpu: Remove __per_cpu_load	2025-02-18 10:16:00 +01:00
serial.h	…
set_memory.h	…
shmparam.h	…
signal.h	asm-generic: Remove empty #ifdef SA_RESTORER	2022-09-10 09:56:53 +02:00
simd.h	asm-generic: Add sched.h inclusion in simd.h	2025-05-30 20:56:48 +08:00
softirq_stack.h	asm-generic: Conditionally enable do_softirq_own_stack() via Kconfig.	2022-09-05 17:20:55 +02:00
spinlock.h	asm-generic: ticket-lock: Add separate ticket-lock.h	2024-11-11 07:33:17 -08:00
spinlock_types.h	asm-generic: ticket-lock: Reuse arch_spinlock_t of qspinlock	2024-11-11 07:33:16 -08:00
statfs.h	…
string.h	…
switch_to.h	…
syscall.h	syscall.h: introduce syscall_set_nr()	2025-05-11 17:48:15 -07:00
syscalls.h	syscalls: mmap(): use unsigned offset type consistently	2024-06-25 15:57:38 +02:00
text-patching.h	asm-generic: introduce text-patching.h	2024-11-07 14:25:15 -08:00
thread_info_tif.h	rseq: Switch to TIF_RSEQ if supported	2025-11-04 08:35:37 +01:00
ticket_spinlock.h	riscv: Add qspinlock support	2024-11-11 07:33:20 -08:00
timex.h	…
tlb.h	mm/hugetlb: fix excessive IPI broadcasts when unsharing PMD tables using mmu_gather	2026-01-20 09:34:26 -08:00
tlbflush.h	…
topology.h	…
trace_clock.h	…
uaccess.h	move asm/unaligned.h to linux/unaligned.h	2024-10-02 17:23:23 -04:00
unwind_user.h	unwind_user: Add user space unwinding API with frame pointer support	2025-07-29 14:46:07 -04:00
user.h	…
vermagic.h	…
vga.h	empty include/asm-generic/vga.h	2024-11-11 21:51:42 +01:00
video.h	arch: Rename fbdev header and source files	2024-05-03 17:07:50 +02:00
vmlinux.lds.h	Detect unused tracepoints for v6.19:	2025-12-05 09:37:41 -08:00
word-at-a-time.h	kernel.h: removed REPEAT_BYTE from kernel.h	2024-02-01 09:47:59 -08:00
xor.h	lib/xor: make xor prototypes more friendly to compiler vectorization	2022-02-11 20:39:39 +11:00