mirror-linux

Commit Graph

Author	SHA1	Message	Date
Linus Torvalds	14df4eb7e7	io_uring-6.19-20251211 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmk7RvYQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpr7UD/4qqOmmEq92nPJ9v9zeebrTCtO5Hah1HY11 D1Cc7KRqt2F6Jy1oN+LzQc1jXguLnQCxMuIRp1f3waTC/j4v4qfjjba+no/CZnLp jwC0UWEeXx8wVx3RkAg3qJJzixGEvJ4pWPYQCYp29LkCkzLL47y7VTBoVPxHiSog MpRGhTWjdFOLX4C/czm6kwpZM/2LkaG6dDpjZFIZm3inHTbP6QGBWdS+Q3VRzAJn YodteJifva4cefcEhP8zZeWLmzHMJLz0cQgnrhYG9michaCNoqM1y3TThvf6h3Ls 8ivwD+0PxzoQ32yImUfSOsqlmn+EI0kevaUVvN74vPT4d5RUxSCOR27z9fPxwI60 JE2xpPN5i6CICc8c1tJJDYDHvElM1C7cueKfr/PTusOvmiI8W+n3VTMl8rVuo25d 2k3ATGq3/E9yfx+/yu3jsjPVYPV52byPsDJ0vs3ixipxS4CdLlvwa3rIvyyrOatl Z6gFhbtTmB5LExSGfI/v+D6Lv7Pe1j6AHfo1RY64D0HR7FmnPMn/95WgTAXr28to vUnOWQ+XjFE4DzLaOZvHV170QajHh2oCBICOlkEdRWc1ubQJ96zTRpco74xGNcw6 Z4+3BxwotR205vbrQwiEIKaGfHDWK72hLdObX2wuPuiPZltchSHb4zZAdmavlUBV e2QzsEzaJQ== =ZPJt -----END PGP SIGNATURE----- Merge tag 'io_uring-6.19-20251211' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring fix from Jens Axboe: "Single fix for io_uring headed to stable, fixing an issue introduced with the min_wait support earlier this year, where SQPOLL didn't get correctly woken if an event arrived once the event waiting has finished the min_wait portion. As we already have regression tests for this added and people reporting new failures there, let's get this one flushed out so it can bubble back down to stable as well" * tag 'io_uring-6.19-20251211' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: io_uring: fix min_wait wakeups for SQPOLL	2025-12-12 22:01:32 +12:00
Jens Axboe	e15cb2200b	io_uring: fix min_wait wakeups for SQPOLL Using min_wait, two timeouts are given: 1) The min_wait timeout, within which up to 'wait_nr' events are waited for. 2) The overall long timeout, which is entered if no events are generated in the min_wait window. If the min_wait has expired, any event being posted must wake the task. For SQPOLL, that isn't the case, as it won't trigger the io_has_work() condition, as it will have already processed the task_work that happened when an event was posted. This causes any event to trigger post the min_wait to not always cause the waiting application to wakeup, and instead it will wait until the overall timeout has expired. This can be shown in a test case that has a 1 second min_wait, with a 5 second overall wait, even if an event triggers after 1.5 seconds: axboe@m2max-kvm /d/iouring-mre (master)> zig-out/bin/iouring info: MIN_TIMEOUT supported: true, features: 0x3ffff info: Testing: min_wait=1000ms, timeout=5s, wait_nr=4 info: 1 cqes in 5000.2ms where the expected result should be: axboe@m2max-kvm /d/iouring-mre (master)> zig-out/bin/iouring info: MIN_TIMEOUT supported: true, features: 0x3ffff info: Testing: min_wait=1000ms, timeout=5s, wait_nr=4 info: 1 cqes in 1500.3ms When the min_wait timeout triggers, reset the number of completions needed to wake the task. This should ensure that any future events will wake the task, regardless of how many events it originally wanted to wait for. Reported-by: Tip ten Brink <tip@tenbrinkmeijs.com> Cc: stable@vger.kernel.org Fixes: `1100c4a265` ("io_uring: add support for batch wait timeout") Link: https://github.com/axboe/liburing/issues/1477 Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-12-09 16:54:12 -07:00
Linus Torvalds	cfd4039213	io_uring-6.19-20251208 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmk3KXIQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpo91EACGlORRzg4FJXox8DcItdOQsZGFIqCXts9p SVtbxV6sPdsHwRB/xGTzHWP2iWUjA4+i5l3n4mt8vzGAmQU50gtdaIsJEMq7SOfB nJW0wNi905qcLihOfTpQ/2xpE5Am/iWPavFkAqOF7qo6GlS7aN47TIaHCPmAm3Nx Kla2XMDnneFhl8xCdnJHaLrzyD94xlArywG5UPjkgFGCmLEu2ZE6T9ivq86DHQZJ Ujy3ueMO/7SErfoDbY4I/gPs4ONxBaaieKycuyljQQB3n6sj15EBNB0TMDPA/Rwx Aq4WD/MC48titpxV2BT9RKCjYvJ4wsBww4uFLkCTKDlFCRH0pqclzgtd2iB46kge tj9KfTS9tkLBp9steMcw45FStu0iiHBwqqTcqUr1q/wzIPbPAQ/L/Mu6AlUOheW/ MmedhtPP22IShpkKYWSv923P2Qp2HhKa6LtoKJzxOK9rb6yoYvHl0zEQlKbWtPgq lpGzjbBoCtjqwlQKTpcH8diwaZ/fafrIP4h80Hg1pRiQEwzBgDpA3/N0EcfigkmU 2IgyH3k6F9v/IgyVPkpzNh4w6hrr9RnxVA8yaf2ItkfWKwajWJAtPLUBuING8qqa 3xg1MZ27NS6gUKEdCEy/mAaz8Vt2SGRUc3szHYrZHy7OFEW94WoiKAYK9qsZXGzX ms2VldIiQA== =Mbok -----END PGP SIGNATURE----- Merge tag 'io_uring-6.19-20251208' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring updates from Jens Axboe: "Followup set of fixes for io_uring for this merge window. These are either later fixes, or cleanups that don't make sense to defer. This pull request contains: - Fix for a recent regression in io-wq worker creation - Tracing cleanup - Use READ_ONCE/WRITE_ONCE consistently for ring mapped kbufs. Mostly for documentation purposes, indicating that they are shared with userspace - Fix for POLL_ADD losing a completion, if the request is updated and now is triggerable - eg, if POLLIN is set with the updated, and the polled file is readable - In conjunction with the above fix, also unify how poll wait queue entries are deleted with the head update. We had 3 different spots doing both the list deletion and head write, with one of them nicely documented. Abstract that into a helper and use it consistently - Small series from Joanne fixing an issue with buffer cloning, and cleaning up the arg validation" * tag 'io_uring-6.19-20251208' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: io_uring/poll: unify poll waitqueue entry and list removal io_uring/kbuf: use WRITE_ONCE() for userspace-shared buffer ring fields io_uring/kbuf: use READ_ONCE() for userspace-mapped memory io_uring/rsrc: fix lost entries after cloned range io_uring/rsrc: rename misleading src_node variable in io_clone_buffers() io_uring/rsrc: clean up buffer cloning arg validation io_uring/trace: rename io_uring_queue_async_work event "rw" field io_uring/io-wq: always retry worker create on ERESTART* io_uring/poll: correctly handle io_poll_add() return value on update	2025-12-09 09:07:28 +09:00
Linus Torvalds	4482ebb297	block-6.19-20251208 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmk3KZsQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpkNYD/91yqAJeehx2Heq3dWj9L8hDuETQelj/g9j gtZCiriAPy+bb1/BmWjK+BmvjtBt+g3a4Cwi6tVj4F1zoE46IPeLhO+2iJTEBiBq AhRtEf/MFXFK3qUnTpEnS8w3CtsXejOTB81VQ+6BysSu+B708m/1AQHv2HocZ37R jivrzfCsEdBr+ISwYw/EG5KcDBVTFo/JdXIhs7k4Z8bBfa3P5ye4EhKjORtgbFNU 5nXb78SZoWNCZF143YV++9MpZc3M2jzkzrk1CTLsUHhOxWg4T/6wTXfPGZc/W4m8 UBhs03u/gMJnKHhlZd4kpZWDito1TQZTdY2f5sBsysRQqeT7bwDK/1xiQ1nllZiP oYbeD6t65yMAlELwNFXo7y/DNcS2VLBMvChIX6p1gweEzyf23YneoHYyN5agEQlN 9C4EdcYzZRt0DwtHlIRtKvDk2LZzkJAcLau3D6ahU/DPLOawyWZKmvGiU+sSyJjF bEIO5c/+MLqkAgLAGaFgA4twFF1aYH9ssmJerDxprarkf1jtlOBLvUQ391Gtb5Hd B1yugmIgEwLbCFzhk9FlCtv2nQcWRCElnaeqv+Lv+xCBVPGCLm2qIHoTqmvHZPCd GbN/h0XLdgUboYPCFWVAX72/4K/cv+fQQcb+a7tiq6vMKcgJ/2I1szFGpFqz7azB hyiK0v3x2g== =r1xa -----END PGP SIGNATURE----- Merge tag 'block-6.19-20251208' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block updates from Jens Axboe: "Followup set of fixes and updates for block for the 6.19 merge window. NVMe had some late minute debates which lead to dropping some patches from that tree, which is why the initial PR didn't have NVMe included. It's here now. This pull request contains: - NVMe pull request via Keith: - Subsystem usage cleanups (Max) - Endpoint device fixes (Shin'ichiro) - Debug statements (Gerd) - FC fabrics cleanups and fixes (Daniel) - Consistent alloc API usages (Israel) - Code comment updates (Chu) - Authentication retry fix (Justin) - Fix a memory leak in the discard ioctl code, if the task is being interrupted by a signal at just the wrong time - Zoned write plugging fixes - Add ioctls for for persistent reservations - Enable per-cpu bio caching by default - Various little fixes and tweaks" * tag 'block-6.19-20251208' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (27 commits) nvme-fabrics: add ENOKEY to no retry criteria for authentication failures nvme-auth: use kvfree() for memory allocated with kvcalloc() nvmet-tcp: use kvcalloc for commands array nvmet-rdma: use kvcalloc for commands and responses arrays nvme: fix typo error in nvme target nvmet-fc: use pr_* print macros instead of dev_* nvmet-fcloop: remove unused lsdir member. nvmet-fcloop: check all request and response have been processed nvme-fc: check all request and response have been processed block: fix memory leak in __blkdev_issue_zero_pages block: fix comment for op_is_zone_mgmt() to include RESET_ALL block: Clear BLK_ZONE_WPLUG_PLUGGED when aborting plugged BIOs blk-mq: Abort suspend when wakeup events are pending blk-mq: add blk_rq_nr_bvec() helper block: add IOC_PR_READ_RESERVATION ioctl block: add IOC_PR_READ_KEYS ioctl nvme: reject invalid pr_read_keys() num_keys values scsi: sd: reject invalid pr_read_keys() num_keys values block: enable per-cpu bio cache by default block: use bio_alloc_bioset for passthru IO by default ...	2025-12-09 08:53:24 +09:00
Linus Torvalds	7203ca412f	Significant patch series in this merge are as follows: - The 10 patch series "__vmalloc()/kvmalloc() and no-block support" from Uladzislau Rezki reworks the vmalloc() code to support non-blocking allocations (GFP_ATOIC, GFP_NOWAIT). - The 2 patch series "ksm: fix exec/fork inheritance" from xu xin fixes a rare case where the KSM MMF_VM_MERGE_ANY prctl state is not inherited across fork/exec. - The 4 patch series "mm/zswap: misc cleanup of code and documentations" from SeongJae Park does some light maintenance work on the zswap code. - The 5 patch series "mm/page_owner: add debugfs files 'show_handles' and 'show_stacks_handles'" from Mauricio Faria de Oliveira enhances the /sys/kernel/debug/page_owner debug feature. It adds unique identifiers to differentiate the various stack traces so that userspace monitoring tools can better match stack traces over time. - The 2 patch series "mm/page_alloc: pcp->batch cleanups" from Joshua Hahn makes some minor alterations to the page allocator's per-cpu-pages feature. - The 2 patch series "Improve UFFDIO_MOVE scalability by removing anon_vma lock" from Lokesh Gidra addresses a scalability issue in userfaultfd's UFFDIO_MOVE operation. - The 2 patch series "kasan: cleanups for kasan_enabled() checks" from Sabyrzhan Tasbolatov performs some cleanup in the KASAN code. - The 2 patch series "drivers/base/node: fold node register and unregister functions" from Donet Tom cleans up the NUMA node handling code a little. - The 4 patch series "mm: some optimizations for prot numa" from Kefeng Wang provides some cleanups and small optimizations to the NUMA allocation hinting code. - The 5 patch series "mm/page_alloc: Batch callers of free_pcppages_bulk" from Joshua Hahn addresses long lock hold times at boot on large machines. These were causing (harmless) softlockup warnings. - The 2 patch series "optimize the logic for handling dirty file folios during reclaim" from Baolin Wang removes some now-unnecessary work from page reclaim. - The 10 patch series "mm/damon: allow DAMOS auto-tuned for per-memcg per-node memory usage" from SeongJae Park enhances the DAMOS auto-tuning feature. - The 2 patch series "mm/damon: fixes for address alignment issues in DAMON_LRU_SORT and DAMON_RECLAIM" from Quanmin Yan fixes DAMON_LRU_SORT and DAMON_RECLAIM with certain userspace configuration. - The 15 patch series "expand mmap_prepare functionality, port more users" from Lorenzo Stoakes enhances the new(ish) file_operations.mmap_prepare() method and ports additional callsites from the old ->mmap() over to ->mmap_prepare(). - The 8 patch series "Fix stale IOTLB entries for kernel address space" from Lu Baolu fixes a bug (and possible security issue on non-x86) in the IOMMU code. In some situations the IOMMU could be left hanging onto a stale kernel pagetable entry. - The 4 patch series "mm/huge_memory: cleanup __split_unmapped_folio()" from Wei Yang cleans up and optimizes the folio splitting code. - The 5 patch series "mm, swap: misc cleanup and bugfix" from Kairui Song implements some cleanups and a minor fix in the swap discard code. - The 8 patch series "mm/damon: misc documentation fixups" from SeongJae Park does as advertised. - The 9 patch series "mm/damon: support pin-point targets removal" from SeongJae Park permits userspace to remove a specific monitoring target in the middle of the current targets list. - The 2 patch series "mm: MISC follow-up patches for linux/pgalloc.h" from Harry Yoo implements a couple of cleanups related to mm header file inclusion. - The 2 patch series "mm/swapfile.c: select swap devices of default priority round robin" from Baoquan He improves the selection of swap devices for NUMA machines. - The 3 patch series "mm: Convert memory block states (MEM_) macros to enums" from Israel Batista changes the memory block labels from macros to enums so they will appear in kernel debug info. - The 3 patch series "ksm: perform a range-walk to jump over holes in break_ksm" from Pedro Demarchi Gomes addresses an inefficiency when KSM unmerges an address range. - The 22 patch series "mm/damon/tests: fix memory bugs in kunit tests" from SeongJae Park fixes leaks and unhandled malloc() failures in DAMON userspace unit tests. - The 2 patch series "some cleanups for pageout()" from Baolin Wang cleans up a couple of minor things in the page scanner's writeback-for-eviction code. - The 2 patch series "mm/hugetlb: refactor sysfs/sysctl interfaces" from Hui Zhu moves hugetlb's sysfs/sysctl handling code into a new file. - The 9 patch series "introduce VM_MAYBE_GUARD and make it sticky" from Lorenzo Stoakes makes the VMA guard regions available in /proc/pid/smaps and improves the mergeability of guarded VMAs. - The 2 patch series "mm: perform guard region install/remove under VMA lock" from Lorenzo Stoakes reduces mmap lock contention for callers performing VMA guard region operations. - The 2 patch series "vma_start_write_killable" from Matthew Wilcox starts work in permitting applications to be killed when they are waiting on a read_lock on the VMA lock. - The 11 patch series "mm/damon/tests: add more tests for online parameters commit" from SeongJae Park adds additional userspace testing of DAMON's "commit" feature. - The 9 patch series "mm/damon: misc cleanups" from SeongJae Park does that. - The 2 patch series "make VM_SOFTDIRTY a sticky VMA flag" from Lorenzo Stoakes addresses the possible loss of a VMA's VM_SOFTDIRTY flag when that VMA is merged with another. - The 16 patch series "mm: support device-private THP" from Balbir Singh introduces support for Transparent Huge Page (THP) migration in zone device-private memory. - The 3 patch series "Optimize folio split in memory failure" from Zi Yan optimizes folio split operations in the memory failure code. - The 2 patch series "mm/huge_memory: Define split_type and consolidate split support checks" from Wei Yang provides some more cleanups in the folio splitting code. - The 16 patch series "mm: remove is_swap_[pte, pmd]() + non-swap entries, introduce leaf entries" from Lorenzo Stoakes cleans up our handling of pagetable leaf entries by introducing the concept of 'software leaf entries', of type softleaf_t. - The 4 patch series "reparent the THP split queue" from Muchun Song reparents the THP split queue to its parent memcg. This is in preparation for addressing the long-standing "dying memcg" problem, wherein dead memcg's linger for too long, consuming memory resources. - The 3 patch series "unify PMD scan results and remove redundant cleanup" from Wei Yang does a little cleanup in the hugepage collapse code. - The 6 patch series "zram: introduce writeback bio batching" from Sergey Senozhatsky improves zram writeback efficiency by introducing batched bio writeback support. - The 4 patch series "memcg: cleanup the memcg stats interfaces" from Shakeel Butt cleans up our handling of the interrupt safety of some memcg stats. - The 4 patch series "make vmalloc gfp flags usage more apparent" from Vishal Moola cleans up vmalloc's handling of incoming GFP flags. - The 6 patch series "mm: Add soft-dirty and uffd-wp support for RISC-V" from Chunyan Zhang teches soft dirty and userfaultfd write protect tracking to use RISC-V's Svrsw60t59b extension. - The 5 patch series "mm: swap: small fixes and comment cleanups" from Youngjun Park fixes a small bug and cleans up some of the swap code. - The 4 patch series "initial work on making VMA flags a bitmap" from Lorenzo Stoakes starts work on converting the vma struct's flags to a bitmap, so we stop running out of them, especially on 32-bit. - The 2 patch series "mm/swapfile: fix and cleanup swap list iterations" from Youngjun Park addresses a possible bug in the swap discard code and cleans things up a little. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaTEb0wAKCRDdBJ7gKXxA jjfIAP94W4EkCCwNOupnChoG+YWw/JW21anXt5NN+i5svn1yugEAwzvv6A+cAFng o+ug/fyrfPZG7PLp2R8WFyGIP0YoBA4= =IUzS -----END PGP SIGNATURE----- Merge tag 'mm-stable-2025-12-03-21-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "__vmalloc()/kvmalloc() and no-block support" (Uladzislau Rezki) Rework the vmalloc() code to support non-blocking allocations (GFP_ATOIC, GFP_NOWAIT) "ksm: fix exec/fork inheritance" (xu xin) Fix a rare case where the KSM MMF_VM_MERGE_ANY prctl state is not inherited across fork/exec "mm/zswap: misc cleanup of code and documentations" (SeongJae Park) Some light maintenance work on the zswap code "mm/page_owner: add debugfs files 'show_handles' and 'show_stacks_handles'" (Mauricio Faria de Oliveira) Enhance the /sys/kernel/debug/page_owner debug feature by adding unique identifiers to differentiate the various stack traces so that userspace monitoring tools can better match stack traces over time "mm/page_alloc: pcp->batch cleanups" (Joshua Hahn) Minor alterations to the page allocator's per-cpu-pages feature "Improve UFFDIO_MOVE scalability by removing anon_vma lock" (Lokesh Gidra) Address a scalability issue in userfaultfd's UFFDIO_MOVE operation "kasan: cleanups for kasan_enabled() checks" (Sabyrzhan Tasbolatov) "drivers/base/node: fold node register and unregister functions" (Donet Tom) Clean up the NUMA node handling code a little "mm: some optimizations for prot numa" (Kefeng Wang) Cleanups and small optimizations to the NUMA allocation hinting code "mm/page_alloc: Batch callers of free_pcppages_bulk" (Joshua Hahn) Address long lock hold times at boot on large machines. These were causing (harmless) softlockup warnings "optimize the logic for handling dirty file folios during reclaim" (Baolin Wang) Remove some now-unnecessary work from page reclaim "mm/damon: allow DAMOS auto-tuned for per-memcg per-node memory usage" (SeongJae Park) Enhance the DAMOS auto-tuning feature "mm/damon: fixes for address alignment issues in DAMON_LRU_SORT and DAMON_RECLAIM" (Quanmin Yan) Fix DAMON_LRU_SORT and DAMON_RECLAIM with certain userspace configuration "expand mmap_prepare functionality, port more users" (Lorenzo Stoakes) Enhance the new(ish) file_operations.mmap_prepare() method and port additional callsites from the old ->mmap() over to ->mmap_prepare() "Fix stale IOTLB entries for kernel address space" (Lu Baolu) Fix a bug (and possible security issue on non-x86) in the IOMMU code. In some situations the IOMMU could be left hanging onto a stale kernel pagetable entry "mm/huge_memory: cleanup __split_unmapped_folio()" (Wei Yang) Clean up and optimize the folio splitting code "mm, swap: misc cleanup and bugfix" (Kairui Song) Some cleanups and a minor fix in the swap discard code "mm/damon: misc documentation fixups" (SeongJae Park) "mm/damon: support pin-point targets removal" (SeongJae Park) Permit userspace to remove a specific monitoring target in the middle of the current targets list "mm: MISC follow-up patches for linux/pgalloc.h" (Harry Yoo) A couple of cleanups related to mm header file inclusion "mm/swapfile.c: select swap devices of default priority round robin" (Baoquan He) improve the selection of swap devices for NUMA machines "mm: Convert memory block states (MEM_) macros to enums" (Israel Batista) Change the memory block labels from macros to enums so they will appear in kernel debug info "ksm: perform a range-walk to jump over holes in break_ksm" (Pedro Demarchi Gomes) Address an inefficiency when KSM unmerges an address range "mm/damon/tests: fix memory bugs in kunit tests" (SeongJae Park) Fix leaks and unhandled malloc() failures in DAMON userspace unit tests "some cleanups for pageout()" (Baolin Wang) Clean up a couple of minor things in the page scanner's writeback-for-eviction code "mm/hugetlb: refactor sysfs/sysctl interfaces" (Hui Zhu) Move hugetlb's sysfs/sysctl handling code into a new file "introduce VM_MAYBE_GUARD and make it sticky" (Lorenzo Stoakes) Make the VMA guard regions available in /proc/pid/smaps and improves the mergeability of guarded VMAs "mm: perform guard region install/remove under VMA lock" (Lorenzo Stoakes) Reduce mmap lock contention for callers performing VMA guard region operations "vma_start_write_killable" (Matthew Wilcox) Start work on permitting applications to be killed when they are waiting on a read_lock on the VMA lock "mm/damon/tests: add more tests for online parameters commit" (SeongJae Park) Add additional userspace testing of DAMON's "commit" feature "mm/damon: misc cleanups" (SeongJae Park) "make VM_SOFTDIRTY a sticky VMA flag" (Lorenzo Stoakes) Address the possible loss of a VMA's VM_SOFTDIRTY flag when that VMA is merged with another "mm: support device-private THP" (Balbir Singh) Introduce support for Transparent Huge Page (THP) migration in zone device-private memory "Optimize folio split in memory failure" (Zi Yan) "mm/huge_memory: Define split_type and consolidate split support checks" (Wei Yang) Some more cleanups in the folio splitting code "mm: remove is_swap_[pte, pmd]() + non-swap entries, introduce leaf entries" (Lorenzo Stoakes) Clean up our handling of pagetable leaf entries by introducing the concept of 'software leaf entries', of type softleaf_t "reparent the THP split queue" (Muchun Song) Reparent the THP split queue to its parent memcg. This is in preparation for addressing the long-standing "dying memcg" problem, wherein dead memcg's linger for too long, consuming memory resources "unify PMD scan results and remove redundant cleanup" (Wei Yang) A little cleanup in the hugepage collapse code "zram: introduce writeback bio batching" (Sergey Senozhatsky) Improve zram writeback efficiency by introducing batched bio writeback support "memcg: cleanup the memcg stats interfaces" (Shakeel Butt) Clean up our handling of the interrupt safety of some memcg stats "make vmalloc gfp flags usage more apparent" (Vishal Moola) Clean up vmalloc's handling of incoming GFP flags "mm: Add soft-dirty and uffd-wp support for RISC-V" (Chunyan Zhang) Teach soft dirty and userfaultfd write protect tracking to use RISC-V's Svrsw60t59b extension "mm: swap: small fixes and comment cleanups" (Youngjun Park) Fix a small bug and clean up some of the swap code "initial work on making VMA flags a bitmap" (Lorenzo Stoakes) Start work on converting the vma struct's flags to a bitmap, so we stop running out of them, especially on 32-bit "mm/swapfile: fix and cleanup swap list iterations" (Youngjun Park) Address a possible bug in the swap discard code and clean things up a little [ This merge also reverts commit `ebb9aeb980` ("vfio/nvgrace-gpu: register device memory for poison handling") because it looks broken to me, I've asked for clarification - Linus ] * tag 'mm-stable-2025-12-03-21-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits) mm: fix vma_start_write_killable() signal handling mm/swapfile: use plist_for_each_entry in __folio_throttle_swaprate mm/swapfile: fix list iteration when next node is removed during discard fs/proc/task_mmu.c: fix make_uffd_wp_huge_pte() huge pte handling mm/kfence: add reboot notifier to disable KFENCE on shutdown memcg: remove inc/dec_lruvec_kmem_state helpers selftests/mm/uffd: initialize char variable to Null mm: fix DEBUG_RODATA_TEST indentation in Kconfig mm: introduce VMA flags bitmap type tools/testing/vma: eliminate dependency on vma->__vm_flags mm: simplify and rename mm flags function for clarity mm: declare VMA flags by bit zram: fix a spelling mistake mm/page_alloc: optimize lowmem_reserve max lookup using its semantic monotonicity mm/vmscan: skip increasing kswapd_failures when reclaim was boosted pagemap: update BUDDY flag documentation mm: swap: remove scan_swap_map_slots() references from comments mm: swap: change swap_alloc_slow() to void mm, swap: remove redundant comment for read_swap_cache_async mm, swap: use SWP_SOLIDSTATE to determine if swap is rotational ...	2025-12-05 13:52:43 -08:00
Jens Axboe	55d57b3bcc	io_uring/poll: unify poll waitqueue entry and list removal For some cases, the order in which the waitq entry list and head writing happens is important, for others it doesn't really matter. But it's somewhat confusing to have them spread out over the file. Abstract out the nicely documented code in io_pollfree_wake() and move it into a helper, and use that helper consistently rather than having other call sites manually do the same thing. While at it, correct a comment function name as well. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-12-05 10:23:28 -07:00
Joanne Koong	a4c694bfc2	io_uring/kbuf: use WRITE_ONCE() for userspace-shared buffer ring fields buf->addr and buf->len reside in memory shared with userspace. They should be written with WRITE_ONCE() to guarantee atomic stores and prevent tearing or other unsafe compiler optimizations. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Cc: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-12-05 09:52:02 -07:00
Caleb Sander Mateos	78385c7299	io_uring/kbuf: use READ_ONCE() for userspace-mapped memory The struct io_uring_buf elements in a buffer ring are in a memory region accessible from userspace. A malicious/buggy userspace program could therefore write to them at any time, so they should be accessed with READ_ONCE() in the kernel. Commit `98b6fa62c8` ("io_uring/kbuf: always use READ_ONCE() to read ring provided buffer lengths") already switched the reads of the len field to READ_ONCE(). Do the same for bid and addr. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Fixes: `c7fb19428d` ("io_uring: add support for ring mapped supplied buffers") Cc: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-12-04 15:46:23 -07:00
Joanne Koong	525916ce49	io_uring/rsrc: fix lost entries after cloned range When cloning with node replacements (IORING_REGISTER_DST_REPLACE), destination entries after the cloned range are not copied over. Add logic to copy them over to the new destination table. Fixes: `c1329532d5` ("io_uring/rsrc: allow cloning with node replacements") Cc: stable@vger.kernel.org Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-12-04 15:46:13 -07:00
Joanne Koong	e29af2aba2	io_uring/rsrc: rename misleading src_node variable in io_clone_buffers() The variable holds nodes from the destination ring's existing buffer table. In io_clone_buffers(), the term "src" is used to refer to the source ring. Rename to node for clarity. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-12-04 15:46:13 -07:00
Joanne Koong	b8201b50e4	io_uring/rsrc: clean up buffer cloning arg validation Get rid of some redundant checks and move the src arg validation to before the buffer table allocation, which simplifies error handling. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-12-04 15:46:13 -07:00
Fengnan Chang	48f22f8093	block: enable per-cpu bio cache by default Since after commit `12e4e8c7ab` ("io_uring/rw: enable bio caches for IRQ rw"), bio_put is safe for task and irq context, bio_alloc_bioset is safe for task context and no one calls in irq context, so we can enable per cpu bio cache by default. Benchmarked with t/io_uring and ext4+nvme: taskset -c 6 /root/fio/t/io_uring -p0 -d128 -b4096 -s1 -c1 -F1 -B1 -R1 -X1 -n1 -P1 /mnt/testfile base IOPS is 562K, patch IOPS is 574K. The CPU usage of bio_alloc_bioset decrease from 1.42% to 1.22%. The worst case is allocate bio in CPU A but free in CPU B, still use t/io_uring and ext4+nvme: base IOPS is 648K, patch IOPS is 647K. Also use fio test ext4/xfs with libaio/sync/io_uring on null_blk and nvme, no obvious performance regression. Signed-off-by: Fengnan Chang <changfengnan@bytedance.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-12-04 07:19:24 -07:00
Caleb Sander Mateos	34c78b8610	io_uring/io-wq: always retry worker create on ERESTART* If a task has a pending signal when create_io_thread() is called, copy_process() will return -ERESTARTNOINTR. io_should_retry_thread() will request a retry of create_io_thread() up to WORKER_INIT_LIMIT = 3 times. If all retries fail, the io_uring request will fail with ECANCELED. Commit 3918315c5dc ("io-wq: backoff when retrying worker creation") added a linear backoff to allow the thread to handle its signal before the retry. However, a thread receiving frequent signals may get unlucky and have a signal pending at every retry. Since the userspace task doesn't control when it receives signals, there's no easy way for it to prevent the create_io_thread() failure due to pending signals. The task may also lack the information necessary to regenerate the canceled SQE. So always retry the create_io_thread() on the ERESTART* errors, analogous to what a fork() syscall would do. EAGAIN can occur due to various persistent conditions such as exceeding RLIMIT_NPROC, so respect the WORKER_INIT_LIMIT retry limit for EAGAIN errors. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-12-04 07:18:02 -07:00
Jens Axboe	84230ad2d2	io_uring/poll: correctly handle io_poll_add() return value on update When the core of io_uring was updated to handle completions consistently and with fixed return codes, the POLL_REMOVE opcode with updates got slightly broken. If a POLL_ADD is pending and then POLL_REMOVE is used to update the events of that request, if that update causes the POLL_ADD to now trigger, then that completion is lost and a CQE is never posted. Additionally, ensure that if an update does cause an existing POLL_ADD to complete, that the completion value isn't always overwritten with -ECANCELED. For that case, whatever io_poll_add() set the value to should just be retained. Cc: stable@vger.kernel.org Fixes: `97b388d70b` ("io_uring: handle completions in the core") Reported-by: syzbot+641eec6b7af1f62f2b99@syzkaller.appspotmail.com Tested-by: syzbot+641eec6b7af1f62f2b99@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-12-04 07:18:01 -07:00
Linus Torvalds	0abcfd8983	for-6.19/io_uring-20251201 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmktsm0QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpiLvD/0dptgeJyLHKchOtRHzi/UvtM/EuNFKJrvI LBWCyIMjygxsVfPR41Lave9SE3UpcavF8Mg/EddasTci8VlMcDF8zPxWLb289Lz2 tkp/wOVuyYmDhNXKmKNW59NOPTd0NosEJFTZI4VhMudwx+UtAHELJGfBWW5hRyQB Md+UwZ2+J9HbYd19mToaDFxz7jpIPLEE4BYUGtljveRUdpnxhyFGGUS2+CQXZt/5 lnRvJmmEv4nSGH9ZRksix1xnV6KvJM0UwYQhrWvXhgwyiKu47zG7ONpd39KqoaRw Fw+6zZd0t7nyyuZkk15cKNnBLnjilnsCzmdcPq0Cuvkmbf6y1hlhEQQTGWXTKfJx zCZxEZcnCC4wL0CBQjZjS38AEMfH2p76M/36+NTWtlYCibY7qUtd9ndpUr49BYGo o4qfT0HMpI1PHuUvpZwpMcf4OX5qvtLmavT9vt78uqmtM+Aryzzuy3bI3S2SGjNe if/cNHnZc8Z06hUqdEit5NW+lYzj642AoF/j7qH9ADDH+VXRWaCdK/iI8tPaEpDV Rw6j442eVugS5tDPoTjdO8jsJ9+OCNNV1t/Jxy+Or+zrGdq7lfg4mnzEia1/izy5 8MnSubRy6LEd+I5PnK/9y9mPIwFMIFgULi+mUjucAhJjRj5beiG74eR6+jBAdyp1 GhFvN6fwdw== =4g/f -----END PGP SIGNATURE----- Merge tag 'for-6.19/io_uring-20251201' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring updates from Jens Axboe: - Unify how task_work cancelations are detected, placing it in the task_work running state rather than needing to check the task state - Series cleaning up and moving the cancelation code to where it belongs, in cancel.c - Cleanup of waitid and futex argument handling - Add support for mixed sized SQEs. 6.18 added support for mixed sized CQEs, improving flexibility and efficiency of workloads that need big CQEs. This adds similar support for SQEs, where the occasional need for a 128b SQE doesn't necessitate having all SQEs be 128b in size - Introduce zcrx and SQ/CQ layout queries. The former returns what zcrx features are available. And both return the ring size information to help with allocation size calculation for user provided rings like IORING_SETUP_NO_MMAP and IORING_MEM_REGION_TYPE_USER - Zcrx updates for 6.19. It includes a bunch of small patches, IORING_REGISTER_ZCRX_CTRL and RQ flushing and David's work on sharing zcrx b/w multiple io_uring instances - Series cleaning up ring initializations, notable deduplicating ring size and offset calculations. It also moves most of the checking before doing any allocations, making the code simpler - Add support for getsockname and getpeername, which is mostly a trivial hookup after a bit of refactoring on the networking side - Various fixes and cleanups * tag 'for-6.19/io_uring-20251201' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (68 commits) io_uring: Introduce getsockname io_uring cmd socket: Split out a getsockname helper for io_uring socket: Unify getsockname and getpeername implementation io_uring/query: drop unused io_handle_query_entry() ctx arg io_uring/kbuf: remove obsolete buf_nr_pages and update comments io_uring/register: use correct location for io_rings_layout io_uring/zcrx: share an ifq between rings io_uring/zcrx: add io_fill_zcrx_offsets() io_uring/zcrx: export zcrx via a file io_uring/zcrx: move io_zcrx_scrub() and dependencies up io_uring/zcrx: count zcrx users io_uring/zcrx: add sync refill queue flushing io_uring/zcrx: introduce IORING_REGISTER_ZCRX_CTRL io_uring/zcrx: elide passing msg flags io_uring/zcrx: use folio_nr_pages() instead of shift operation io_uring/zcrx: convert to use netmem_desc io_uring/query: introduce rings info query io_uring/query: introduce zcrx query io_uring: move cq/sq user offset init around io_uring: pre-calculate scq layout ...	2025-12-03 18:58:57 -08:00
Linus Torvalds	1b5dd29869	vfs-6.19-rc1.fd_prepare.fs -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaSmOZwAKCRCRxhvAZXjc op0AAP4oNVJkFyvgKoPos5K2EGFB8M7merGhpYtsOoeg8UK6OwD/UySQErHsXQDR sUDDa5uFOhfrkcfM8REtAN4wF8p5qAc= =QgFD -----END PGP SIGNATURE----- Merge tag 'vfs-6.19-rc1.fd_prepare.fs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull fd prepare updates from Christian Brauner: "This adds the FD_ADD() and FD_PREPARE() primitive. They simplify the common pattern of get_unused_fd_flags() + create file + fd_install() that is used extensively throughout the kernel and currently requires cumbersome cleanup paths. FD_ADD() - For simple cases where a file is installed immediately: fd = FD_ADD(O_CLOEXEC, vfio_device_open_file(device)); if (fd < 0) vfio_device_put_registration(device); return fd; FD_PREPARE() - For cases requiring access to the fd or file, or additional work before publishing: FD_PREPARE(fdf, O_CLOEXEC, sync_file->file); if (fdf.err) { fput(sync_file->file); return fdf.err; } data.fence = fd_prepare_fd(fdf); if (copy_to_user((void __user )arg, &data, sizeof(data))) return -EFAULT; return fd_publish(fdf); The primitives are centered around struct fd_prepare. FD_PREPARE() encapsulates all allocation and cleanup logic and must be followed by a call to fd_publish() which associates the fd with the file and installs it into the caller's fdtable. If fd_publish() isn't called, both are deallocated automatically. FD_ADD() is a shorthand that does fd_publish() immediately and never exposes the struct to the caller. I've implemented this in a way that it's compatible with the cleanup infrastructure while also being usable separately. IOW, it's centered around struct fd_prepare which is aliased to class_fd_prepare_t and so we can make use of all the basica guard infrastructure" tag 'vfs-6.19-rc1.fd_prepare.fs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (42 commits) io_uring: convert io_create_mock_file() to FD_PREPARE() file: convert replace_fd() to FD_PREPARE() vfio: convert vfio_group_ioctl_get_device_fd() to FD_ADD() tty: convert ptm_open_peer() to FD_ADD() ntsync: convert ntsync_obj_get_fd() to FD_PREPARE() media: convert media_request_alloc() to FD_PREPARE() hv: convert mshv_ioctl_create_partition() to FD_ADD() gpio: convert linehandle_create() to FD_PREPARE() pseries: port papr_rtas_setup_file_interface() to FD_ADD() pseries: convert papr_platform_dump_create_handle() to FD_ADD() spufs: convert spufs_gang_open() to FD_PREPARE() papr-hvpipe: convert papr_hvpipe_dev_create_handle() to FD_PREPARE() spufs: convert spufs_context_open() to FD_PREPARE() net/socket: convert __sys_accept4_file() to FD_ADD() net/socket: convert sock_map_fd() to FD_ADD() net/kcm: convert kcm_ioctl() to FD_PREPARE() net/handshake: convert handshake_nl_accept_doit() to FD_PREPARE() secretmem: convert memfd_secret() to FD_ADD() memfd: convert memfd_create() to FD_ADD() bpf: convert bpf_token_create() to FD_PREPARE() ...	2025-12-01 17:32:07 -08:00
Linus Torvalds	1885cdbfbb	vfs-6.19-rc1.iomap -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaSmOZAAKCRCRxhvAZXjc ooCXAQCwzX2GS/55QHV6JXBBoNxguuSQ5dCj91ZmTfHzij0xNAEAhKEBw7iMGX72 c2/x+xYf+Pc6mAfxdus5RLMggqBFPAk= =jInB -----END PGP SIGNATURE----- Merge tag 'vfs-6.19-rc1.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull iomap updates from Christian Brauner: "FUSE iomap Support for Buffered Reads: This adds iomap support for FUSE buffered reads and readahead. This enables granular uptodate tracking with large folios so only non-uptodate portions need to be read. Also fixes a race condition with large folios + writeback cache that could cause data corruption on partial writes followed by reads. - Refactored iomap read/readahead bio logic into helpers - Added caller-provided callbacks for read operations - Moved buffered IO bio logic into new file - FUSE now uses iomap for read_folio and readahead Zero Range Folio Batch Support: Add folio batch support for iomap_zero_range() to handle dirty folios over unwritten mappings. Fix raciness issues where dirty data could be lost during zero range operations. - filemap_get_folios_tag_range() helper for dirty folio lookup - Optional zero range dirty folio processing - XFS fills dirty folios on zero range of unwritten mappings - Removed old partial EOF zeroing optimization DIO Write Completions from Interrupt Context: Restore pre-iomap behavior where pure overwrite completions run inline rather than being deferred to workqueue. Reduces context switches for high-performance workloads like ScyllaDB. - Removed unused IOCB_DIO_CALLER_COMP code - Error completions always run in user context (fixes zonefs) - Reworked REQ_FUA selection logic - Inverted IOMAP_DIO_INLINE_COMP to IOMAP_DIO_OFFLOAD_COMP Buffered IO Cleanups: Some performance and code clarity improvements: - Replace manual bitmap scanning with find_next_bit() - Simplify read skip logic for writes - Optimize pending async writeback accounting - Better variable naming - Documentation for iomap_finish_folio_write() requirements Misaligned Vectors for Zoned XFS: Enables sub-block aligned vectors in XFS always-COW mode for zoned devices via new IOMAP_DIO_FSBLOCK_ALIGNED flag. Bug Fixes: - Allocate s_dio_done_wq for async reads (fixes syzbot report after error completion changes) - Fix iomap_read_end() for already uptodate folios (regression fix)" * tag 'vfs-6.19-rc1.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (40 commits) iomap: allocate s_dio_done_wq for async reads as well iomap: fix iomap_read_end() for already uptodate folios iomap: invert the polarity of IOMAP_DIO_INLINE_COMP iomap: support write completions from interrupt context iomap: rework REQ_FUA selection iomap: always run error completions in user context fs, iomap: remove IOCB_DIO_CALLER_COMP iomap: use find_next_bit() for uptodate bitmap scanning iomap: use find_next_bit() for dirty bitmap scanning iomap: simplify when reads can be skipped for writes iomap: simplify ->read_folio_range() error handling for reads iomap: optimize pending async writeback accounting docs: document iomap writeback's iomap_finish_folio_write() requirement iomap: account for unaligned end offsets when truncating read range iomap: rename bytes_pending/bytes_accounted to bytes_submitted/bytes_not_submitted xfs: support sub-block aligned vectors in always COW mode iomap: add IOMAP_DIO_FSBLOCK_ALIGNED flag xfs: error tag to force zeroing on debug kernels iomap: remove old partial eof zeroing optimization xfs: fill dirty folios on zero range of unwritten mappings ...	2025-12-01 08:14:00 -08:00
Christian Brauner	6fb1022918	io_uring: convert io_create_mock_file() to FD_PREPARE() Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-45-b6efa1706cfd@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-11-28 12:42:36 +01:00
Gabriel Krisman Bertazi	5d24321e4c	io_uring: Introduce getsockname io_uring cmd Introduce a socket-specific io_uring_cmd to support getsockname/getpeername via io_uring. I made this an io_uring_cmd instead of a new operation to avoid polluting the command namespace with what is exclusively a socket operation. In addition, since we don't need to conform to existing interfaces, this merges the getsockname/getpeername in a single operation, since the implementation is pretty much the same. This has been frequently requested, for instance at [1] and more recently in the project Discord channel. The main use-case is to support fixed socket file descriptors. [1] https://github.com/axboe/liburing/issues/1356 Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-26 13:45:23 -07:00
Caleb Sander Mateos	1e93de9205	io_uring/query: drop unused io_handle_query_entry() ctx arg io_handle_query_entry() doesn't use its struct io_ring_ctx *ctx argument. So remove it from the function and its callers. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-26 09:37:10 -07:00
Pavel Begunkov	f6dc5a3619	io_uring: fix mixed cqe overflow handling I started to see zcrx data corruptions. That turned out to be due to CQ tail pointing to a stale entry which happened to be from a zcrx request. I.e. the tail is incremented without the CQE memory being changed. The culprit is __io_cqring_overflow_flush() passing "cqe32=true" to io_get_cqe_overflow() for non-mixed CQE32 setups, which only expects it to be set for mixed 32B CQEs and not for SETUP_CQE32. The fix is slightly hacky, long term it's better to unify mixed and CQE32 handling. Fixes: `e26dca67fd` ("io_uring: add support for IORING_SETUP_CQE_MIXED") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-25 07:03:45 -07:00
Christoph Hellwig	f9f8514999	fs, iomap: remove IOCB_DIO_CALLER_COMP This was added by commit `099ada2c87` ("io_uring/rw: add write support for IOCB_DIO_CALLER_COMP") and disabled a little later by commit `838b35bb6a` ("io_uring/rw: disable IOCB_DIO_CALLER_COMP") because it didn't work. Remove all the related code that sat unused for 2 years. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20251113170633.1453259-2-hch@lst.de Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-11-25 10:22:18 +01:00
Jens Axboe	f6041803a8	io_uring/net: ensure vectored buffer node import is tied to notification When support for vectored registered buffers was added, the import itself is using 'req' rather than the notification io_kiocb, sr->notif. For non-vectored imports, sr->notif is correctly used. This is important as the lifetime of the two may be different. Use the correct io_kiocb for the vectored buffer import. Cc: stable@vger.kernel.org Fixes: `23371eac7d` ("io_uring/net: implement vectored reg bufs for zctx") Reported-by: Google Big Sleep <big-sleep-vuln-reports+bigsleep-463332873@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-24 10:59:02 -07:00
Joanne Koong	84692a1519	io_uring/kbuf: remove obsolete buf_nr_pages and update comments The buf_nr_pages field in io_buffer_list was previously used to determine whether the buffer list uses ring-provided buffers or classic provided buffers. This is now determined by checking the IOBL_BUF_RING flag. Remove the buf_nr_pages field and update related comments. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-20 13:23:05 -07:00
Jens Axboe	46447367a5	io_uring/cmd_net: fix wrong argument types for skb_queue_splice() If timestamp retriving needs to be retried and the local list of SKB's already has entries, then it's spliced back into the socket queue. However, the arguments for the splice helper are transposed, causing exactly the wrong direction of splicing into the on-stack list. Fix that up. Cc: stable@vger.kernel.org Reported-by: Google Big Sleep <big-sleep-vuln-reports+bigsleep-462435176@google.com> Fixes: `9e4ed359b8` ("io_uring/netcmd: add tx timestamping cmd support") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-20 11:40:15 -07:00
Jens Axboe	f779ac0b87	io_uring/register: use correct location for io_rings_layout A previous consolidated the ring size etc calculations into io_prepare_config(), but missed updating io_register_resize_rings() correctly to use the calculated values. As a result, it ended up using on-stack uninitialized values, and hence either failed validating the size correctly, or just failed resizing because the sizes were random. This caused failures in the liburing regression tests: [...] Running test resize-rings.t resize=-7 test_basic 3000 failed Test resize-rings.t failed with ret 1 Running test resize-rings.t /dev/sda resize=-7 test_basic 3000 failed Test resize-rings.t failed with ret 1 Running test resize-rings.t /dev/nvme1n1 resize=-7 test_basic 3000 failed Test resize-rings.t failed with ret 1 Running test resize-rings.t /dev/dm-0 resize=-7 test_basic 3000 failed Test resize-rings.t failed with ret 1 because io_create_region() would return -E2BIG because of unintialized reg->size values. Adjust the struct io_rings_layout rl pointer to point to the correct location, and remove the (now dead) __rl on stack struct. Fixes: `eb76ff6a68` ("io_uring: pre-calculate scq layout") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-18 19:30:45 -07:00
Ryan Roberts	9ac09bb9fe	mm: consistently use current->mm in mm_get_unmapped_area() mm_get_unmapped_area() is a wrapper around arch_get_unmapped_area() / arch_get_unmapped_area_topdown(), both of which search current->mm for some free space. Neither take an mm_struct - they implicitly operate on current->mm. But the wrapper takes an mm_struct and uses it to decide whether to search bottom up or top down. All callers pass in current->mm for this, so everything is working consistently. But it feels like an accident waiting to happen; eventually someone will call that function with a different mm, expecting to find free space in it, but what gets returned is free space in the current mm. So let's simplify by removing the parameter and have the wrapper use current->mm to decide which end to start at. Now everything is consistent and self-documenting. Link: https://lkml.kernel.org/r/20251003155306.2147572-1-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-16 17:27:57 -08:00
David Wei	00d9148127	io_uring/zcrx: share an ifq between rings Add a way to share an ifq from a src ring that is real (i.e. bound to a HW RX queue) with other rings. This is done by passing a new flag IORING_ZCRX_IFQ_REG_IMPORT in the registration struct io_uring_zcrx_ifq_reg, alongside the fd of an exported zcrx ifq. Signed-off-by: David Wei <dw@davidwei.uk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-13 11:19:37 -07:00
David Wei	0926f94ab3	io_uring/zcrx: add io_fill_zcrx_offsets() Add a helper io_fill_zcrx_offsets() that sets the constant offsets in struct io_uring_zcrx_offsets returned to userspace. Signed-off-by: David Wei <dw@davidwei.uk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-13 11:19:37 -07:00
Pavel Begunkov	d7af80b213	io_uring/zcrx: export zcrx via a file Add an option to wrap a zcrx instance into a file and expose it to the user space. Currently, users can't do anything meaningful with the file, but it'll be used in a next patch to import it into another io_uring instance. It's implemented as a new op called ZCRX_CTRL_EXPORT for the IORING_REGISTER_ZCRX_CTRL registration opcode. Signed-off-by: David Wei <dw@davidwei.uk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-13 11:19:37 -07:00
David Wei	742cb2e14e	io_uring/zcrx: move io_zcrx_scrub() and dependencies up In preparation for adding zcrx ifq exporting and importing, move io_zcrx_scrub() and its dependencies up the file to be closer to io_close_queue(). Signed-off-by: David Wei <dw@davidwei.uk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-13 11:19:37 -07:00
Pavel Begunkov	39c9676f78	io_uring/zcrx: count zcrx users zcrx tries to detach ifq / terminate page pools when the io_uring ctx owning it is being destroyed. There will be multiple io_uring instances attached to it in the future, so add a separate counter to track the users. Note, refs can't be reused for this purpose as it only used to prevent zcrx and rings destruction, and also used by page pools to keep it alive. Signed-off-by: David Wei <dw@davidwei.uk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-13 11:19:37 -07:00
Pavel Begunkov	475eb39b00	io_uring/zcrx: add sync refill queue flushing Add an zcrx interface via IORING_REGISTER_ZCRX_CTRL that forces the kernel to flush / consume entries from the refill queue. Just as with the IORING_REGISTER_ZCRX_REFILL attempt, the motivation is to address cases where the refill queue becomes full, and the user can't return buffers and needs to stash them. It's still a slow path, and the user should size refill queue appropriately, but it should be helpful for handling temporary traffic spikes and other unpredictable conditions. The interface is simpler comparing to ZCRX_REFILL as it doesn't need temporary refill entry arrays and gives natural batching, whereas ZCRX_REFILL requires even more user logic to be somewhat efficient. Also, add a structure for the operation. It's not currently used but can serve for future improvements like limiting the number of buffers to process, etc. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-13 11:19:37 -07:00
Pavel Begunkov	d663976dad	io_uring/zcrx: introduce IORING_REGISTER_ZCRX_CTRL It'll be annoying and take enough of boilerplate code to implement new zcrx features as separate io_uring register opcode. Introduce IORING_REGISTER_ZCRX_CTRL that will multiplex such calls to zcrx. Note, there are no real users of the opcode in this patch. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-13 11:19:37 -07:00
Pavel Begunkov	1b8b5d0316	io_uring/zcrx: elide passing msg flags zcrx sqe->msg_flags has never been defined and checked to be zero. It doesn't need to be a MSG_* bitmask. Keep them undefined, don't mix with MSG_DONTWAIT, and don't pass into io_zcrx_recv() as it's ignored anyway. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-13 11:19:37 -07:00
Pedro Demarchi Gomes	a0169c3a62	io_uring/zcrx: use folio_nr_pages() instead of shift operation folio_nr_pages() is a faster helper function to get the number of pages when NR_PAGES_IN_LARGE_FOLIO is enabled. Signed-off-by: Pedro Demarchi Gomes <pedrodemargomes@gmail.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-13 11:19:37 -07:00
Pavel Begunkov	f0243d2b86	io_uring/zcrx: convert to use netmem_desc Convert zcrx to struct netmem_desc, and use struct net_iov::desc to access its fields instead of struct net_iov inner union alises. zcrx only directly reads niov->pp, so with this patch it doesn't depend on the union anymore. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Byungchul Park <byungchul@sk.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-13 11:19:37 -07:00
Pavel Begunkov	4aaa9bc4d5	io_uring/query: introduce rings info query Same problem as with zcrx in the previous patch, the user needs to know SQ/CQ header sizes to allocated memory before setup to use it for user provided rings, i.e. IORING_SETUP_NO_MMAP, however that information is only returned after registration, hence the user is guessing kernel implementation details. Return the header size and alignment, which is split with the same motivation, to allow the user to know the real structure size without alignment in case there will be more flexible placement schemes in the future. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-13 11:17:36 -07:00
Pavel Begunkov	2647e2ecc0	io_uring/query: introduce zcrx query Add a new query type IO_URING_QUERY_ZCRX returning the user some basic information about the interface, which includes allowed flags for areas and registration and supported IORING_REGISTER_ZCRX_CTRL subcodes. There is also a chicken-egg problem with user provided refill queue memory, where offsets and size information is returned after registration, but to properly allocate memory you need to know it beforehand, which is why the userspace currently has to guess the RQ headers size and severely overestimates it. Return the size information. It's split into "size" and "alignment" fields because for default placement modes the user is interested in the aligned size, however if it gets support for more flexible placement, it'll need to only know the actual header size. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-13 11:17:36 -07:00
Pavel Begunkov	d741c62555	io_uring: move cq/sq user offset init around Move user SQ/CQ offset initialisation at the end of io_prepare_config() where it already calculated all information to set it properly. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-13 07:27:35 -07:00
Pavel Begunkov	eb76ff6a68	io_uring: pre-calculate scq layout Move ring layouts calculations into io_prepare_config(), so that more misconfiguration checking can be done earlier before creating a ctx. It also deduplicates some code with ring resizing. And as a bonus, now it initialises params->sq_off.array, which is closer to all other user offset init, and also applies it to ring resizing, which was previously missing it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-13 07:27:34 -07:00
Pavel Begunkov	001b76b7e7	io_uring: keep ring laoyut in a structure Add a structure keeping SQ/CQ sizes and offsets. For now it only records data previously returned from rings_size and the SQ size. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-13 07:27:34 -07:00
Pavel Begunkov	0f4b537363	io_uring: introduce struct io_ctx_config There will be more information needed during ctx setup, and instead of passing a handful of pointers around, wrap them all into a new structure. Add a helper for encapsulating all configuration checks and preparation, that's also reused for ring resizing. Note, it indirectly adds a io_uring_sanitise_params() check to ring resizing, which is a good thing. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-13 07:27:34 -07:00
Pavel Begunkov	929dbbb699	io_uring: convert params to pointer in ring reisze The parameters in io_register_resize_rings() will be moved into another structure in a later patch. In preparation to that, convert the params variable it to a pointer, but still store the data on stack. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-13 07:27:34 -07:00
Pavel Begunkov	94cd832916	io_uring: use size_add helpers for ring offsets Use size_add / size_mul set of functions for rings_size() calculations. It's more consistent with struct_size(), and errors are preserved across a series of calculations, so intermediate result checks can be omitted. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-13 07:27:34 -07:00
Pavel Begunkov	e279bb4b4c	io_uring: refactor rings_size nosqarray handling A preparation patch inversing the IORING_SETUP_NO_SQARRAY check, this way there is only one successful return path from the function, which will be helpful later. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-13 07:27:34 -07:00
Jens Axboe	ecb8490b2f	Merge branch 'io_uring-6.18' into for-6.19/io_uring Merge 6.18-rc io_uring fixes, as certain coming changes depend on some of these. * io_uring-6.18: io_uring/rsrc: don't use blk_rq_nr_phys_segments() as number of bvecs io_uring/query: return number of available queries io_uring/rw: ensure allocated iovec gets cleared for early failure io_uring: fix regbuf vector size truncation io_uring: fix types for region size calulation io_uring/zcrx: remove sync refill uapi io_uring: fix buffer auto-commit for multishot uring_cmd io_uring: correct __must_hold annotation in io_install_fixed_file io_uring zcrx: add MAINTAINERS entry io_uring: Fix code indentation error io_uring/sqpoll: be smarter on when to update the stime usage io_uring/sqpoll: switch away from getrusage() for CPU accounting io_uring: fix incorrect unlikely() usage in io_waitid_prep() Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-13 07:26:37 -07:00
Caleb Sander Mateos	2d0e88f3fd	io_uring/rsrc: don't use blk_rq_nr_phys_segments() as number of bvecs io_buffer_register_bvec() currently uses blk_rq_nr_phys_segments() as the number of bvecs in the request. However, bvecs may be split into multiple segments depending on the queue limits. Thus, the number of segments may overestimate the number of bvecs. For ublk devices, the only current users of io_buffer_register_bvec(), virt_boundary_mask, seg_boundary_mask, max_segments, and max_segment_size can all be set arbitrarily by the ublk server process. Set imu->nr_bvecs based on the number of bvecs the rq_for_each_bvec() loop actually yields. However, continue using blk_rq_nr_phys_segments() as an upper bound on the number of bvecs when allocating imu to avoid needing to iterate the bvecs a second time. Link: https://lore.kernel.org/io-uring/20251111191530.1268875-1-csander@purestorage.com/ Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Fixes: `27cb27b6d5` ("io_uring: add support for kernel registered bvecs") Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-12 08:25:33 -07:00
Pavel Begunkov	712fbe97c3	io_uring: move flags check to io_uring_sanitise_params io_uring_sanitise_params() sanitises most of the setup flags invariants, move the IORING_SETUP_FLAGS check from io_uring_setup() into it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-11 07:53:33 -07:00
Pavel Begunkov	01405895c1	io_uring: use mem_is_zero to check ring params mem_is_zero() does the job without hand rolled loops, use that to verify reserved fields of ring params. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-11 07:53:33 -07:00

1 2 3 4 5 ...

1914 Commits (master)