mirror-linux

Commit Graph

Author	SHA1	Message	Date
Alok Tiwari	4ec703ec0c	io_uring: fix incorrect unlikely() usage in io_waitid_prep() The negation operator is incorrectly placed outside the unlikely() macro: if (!unlikely(iwa)) This inverts the compiler branch prediction hint, marking the NULL case as likely instead of unlikely. The intent is to indicate that allocation failures are rare, consistent with common kernel patterns. Moving the negation inside unlikely(): if (unlikely(!iwa)) Fixes: `2b4fc4cd43` ("io_uring/waitid: setup async data in the prep handler") Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com> Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-10-20 09:22:09 -06:00
Jens Axboe	18d6b1743e	io_uring/rw: check for NULL io_br_sel when putting a buffer Both the read and write side use kiocb_done() to finish a request, and kiocb_done() will call io_put_kbuf() in case a provided buffer was used for the request. Provided buffers are not supported for writes, hence NULL is being passed in. This normally works fine, as io_put_kbuf() won't actually use the value unless REQ_F_BUFFER_RING or REQ_F_BUFFER_SELECTED is set in the request flags. But depending on compiler (or whether or not CONFIG_CC_OPTIMIZE_FOR_SIZE is set), that may be done even though the value is never used. This will then cause a NULL pointer dereference. Make it a bit more obvious and check for a NULL io_br_sel, and don't even bother calling io_put_kbuf() for that case. Fixes: `5fda512554` ("io_uring/kbuf: switch to storing struct io_buffer_list locally") Reported-by: David Howells <dhowells@redhat.com> Tested-by: David Howells <dhowells@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-10-15 13:38:53 -06:00
Pavel Begunkov	437c23357d	io_uring: fix unexpected placement on same size resizing There might be many reasons why a user is resizing a ring, e.g. moving to huge pages or for some memory compaction using IORING_SETUP_NO_MMAP. Don't bypass resizing, the user will definitely be surprised seeing 0 while the rings weren't actually moved to a new place. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-10-15 08:01:38 -06:00
Pavel Begunkov	be7cab44ed	io_uring: protect mem region deregistration io_create_region_mmap_safe() protects publishing of a region against concurrent mmap calls, however we should also protect against it when removing a region. There is a gap io_register_mem_region() where it safely publishes a region, but then copy_to_user goes wrong and it unsafely frees the region. Cc: stable@vger.kernel.org Fixes: `087f997870` ("io_uring/memmap: implement mmap for regions") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-10-15 08:01:09 -06:00
Jens Axboe	927069c4ac	Revert "io_uring/rw: drop -EOPNOTSUPP check in __io_complete_rw_common()" This reverts commit `90bfb28d5f`. Kevin reports that this commit causes an issue for him with LVM snapshots, most likely because of turning off NOWAIT support while a snapshot is being created. This makes -EOPNOTSUPP bubble back through the completion handler, where io_uring read/write handling should just retry it. Reinstate the previous check removed by the referenced commit. Cc: stable@vger.kernel.org Fixes: `90bfb28d5f` ("io_uring/rw: drop -EOPNOTSUPP check in __io_complete_rw_common()") Reported-by: Salvatore Bonaccorso <carnil@debian.org> Reported-by: Kevin Lumik <kevin@xf.ee> Link: https://lore.kernel.org/io-uring/cceb723c-051b-4de2-9a4c-4aa82e1619ee@kernel.dk/ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-10-13 12:16:40 -06:00
Pavel Begunkov	e9a9dcb4cc	io_uring/zcrx: increment fallback loop src offset Don't forget to adjust the source offset in io_copy_page(), otherwise it'll be copying into the same location in some cases for highmem setups. Fixes: `e67645bb7f` ("io_uring/zcrx: prepare fallback for larger pages") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-10-08 07:26:14 -06:00
Pavel Begunkov	09cfd3c52e	io_uring/zcrx: fix overshooting recv limit It's reported that sometimes a zcrx request can receive more than was requested. It's caused by io_zcrx_recv_skb() adjusting desc->count for all received buffers including frag lists, but then doing recursive calls to process frag list skbs, which leads to desc->count double accounting and underflow. Reported-and-tested-by: Matthias Jasny <matthiasjasny@gmail.com> Fixes: `6699ec9a23` ("io_uring/zcrx: add a read limit to recvzc requests") Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-10-08 07:26:08 -06:00
Jens Axboe	2f8229d53d	io_uring/waitid: always prune wait queue entry in io_waitid_wait() For a successful return, always remove our entry from the wait queue entry list. Previously this was skipped if a cancelation was in progress, but this can race with another invocation of the wait queue entry callback. Cc: stable@vger.kernel.org Fixes: `f31ecf671d` ("io_uring: add IORING_OP_WAITID support") Reported-by: syzbot+b9e83021d9c642a33d8c@syzkaller.appspotmail.com Tested-by: syzbot+b9e83021d9c642a33d8c@syzkaller.appspotmail.com Link: https://lore.kernel.org/io-uring/68e5195e.050a0220.256323.001f.GAE@google.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-10-07 07:46:00 -06:00
Jens Axboe	0ca286477b	io_uring: update liburing git URL Change the liburing git URL to point to the git.kernel.org servers, rather than my private git.kernel.dk server. Due to continued AI scraping of cgit etc, it's becoming quite the chore to maintain a private git server. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-10-03 07:23:33 -06:00
Linus Torvalds	8804d970fa	Summary of significant series in this pull request: - The 3 patch series "mm, swap: improve cluster scan strategy" from Kairui Song improves performance and reduces the failure rate of swap cluster allocation. - The 4 patch series "support large align and nid in Rust allocators" from Vitaly Wool permits Rust allocators to set NUMA node and large alignment when perforning slub and vmalloc reallocs. - The 2 patch series "mm/damon/vaddr: support stat-purpose DAMOS" from Yueyang Pan extend DAMOS_STAT's handling of the DAMON operations sets for virtual address spaces for ops-level DAMOS filters. - The 3 patch series "execute PROCMAP_QUERY ioctl under per-vma lock" from Suren Baghdasaryan reduces mmap_lock contention during reads of /proc/pid/maps. - The 2 patch series "mm/mincore: minor clean up for swap cache checking" from Kairui Song performs some cleanup in the swap code. - The 11 patch series "mm: vm_normal_page() improvements" from David Hildenbrand provides code cleanup in the pagemap code. - The 5 patch series "add persistent huge zero folio support" from Pankaj Raghav provides a block layer speedup by optionalls making the huge_zero_pagepersistent, instead of releasing it when its refcount falls to zero. - The 3 patch series "kho: fixes and cleanups" from Mike Rapoport adds a few touchups to the recently added Kexec Handover feature. - The 10 patch series "mm: make mm->flags a bitmap and 64-bit on all arches" from Lorenzo Stoakes turns mm_struct.flags into a bitmap. To end the constant struggle with space shortage on 32-bit conflicting with 64-bit's needs. - The 2 patch series "mm/swapfile.c and swap.h cleanup" from Chris Li cleans up some swap code. - The 7 patch series "selftests/mm: Fix false positives and skip unsupported tests" from Donet Tom fixes a few things in our selftests code. - The 7 patch series "prctl: extend PR_SET_THP_DISABLE to only provide THPs when advised" from David Hildenbrand "allows individual processes to opt-out of THP=always into THP=madvise, without affecting other workloads on the system". It's a long story - the [1/N] changelog spells out the considerations. - The 11 patch series "Add and use memdesc_flags_t" from Matthew Wilcox gets us started on the memdesc project. Please see https://kernelnewbies.org/MatthewWilcox/Memdescs and https://blogs.oracle.com/linux/post/introducing-memdesc. - The 3 patch series "Tiny optimization for large read operations" from Chi Zhiling improves the efficiency of the pagecache read path. - The 5 patch series "Better split_huge_page_test result check" from Zi Yan improves our folio splitting selftest code. - The 2 patch series "test that rmap behaves as expected" from Wei Yang adds some rmap selftests. - The 3 patch series "remove write_cache_pages()" from Christoph Hellwig removes that function and converts its two remaining callers. - The 2 patch series "selftests/mm: uffd-stress fixes" from Dev Jain fixes some UFFD selftests issues. - The 3 patch series "introduce kernel file mapped folios" from Boris Burkov introduces the concept of "kernel file pages". Using these permits btrfs to account its metadata pages to the root cgroup, rather than to the cgroups of random inappropriate tasks. - The 2 patch series "mm/pageblock: improve readability of some pageblock handling" from Wei Yang provides some readability improvements to the page allocator code. - The 11 patch series "mm/damon: support ARM32 with LPAE" from SeongJae Park teaches DAMON to understand arm32 highmem. - The 4 patch series "tools: testing: Use existing atomic.h for vma/maple tests" from Brendan Jackman performs some code cleanups and deduplication under tools/testing/. - The 2 patch series "maple_tree: Fix testing for 32bit compiles" from Liam Howlett fixes a couple of 32-bit issues in tools/testing/radix-tree.c. - The 2 patch series "kasan: unify kasan_enabled() and remove arch-specific implementations" from Sabyrzhan Tasbolatov moves KASAN arch-specific initialization code into a common arch-neutral implementation. - The 3 patch series "mm: remove zpool" from Johannes Weiner removes zspool - an indirection layer which now only redirects to a single thing (zsmalloc). - The 2 patch series "mm: task_stack: Stack handling cleanups" from Pasha Tatashin makes a couple of cleanups in the fork code. - The 37 patch series "mm: remove nth_page()" from David Hildenbrand makes rather a lot of adjustments at various nth_page() callsites, eventually permitting the removal of that undesirable helper function. - The 2 patch series "introduce kasan.write_only option in hw-tags" from Yeoreum Yun creates a KASAN read-only mode for ARM, using that architecture's memory tagging feature. It is felt that a read-only mode KASAN is suitable for use in production systems rather than debug-only. - The 3 patch series "mm: hugetlb: cleanup hugetlb folio allocation" from Kefeng Wang does some tidying in the hugetlb folio allocation code. - The 12 patch series "mm: establish const-correctness for pointer parameters" from Max Kellermann makes quite a number of the MM API functions more accurate about the constness of their arguments. This was getting in the way of subsystems (in this case CEPH) when they attempt to improving their own const/non-const accuracy. - The 7 patch series "Cleanup free_pages() misuse" from Vishal Moola fixes a number of code sites which were confused over when to use free_pages() vs __free_pages(). - The 3 patch series "Add Rust abstraction for Maple Trees" from Alice Ryhl makes the mapletree code accessible to Rust. Required by nouveau and by its forthcoming successor: the new Rust Nova driver. - The 2 patch series "selftests/mm: split_huge_page_test: split_pte_mapped_thp improvements" from David Hildenbrand adds a fix and some cleanups to the thp selftesting code. - The 14 patch series "mm, swap: introduce swap table as swap cache (phase I)" from Chris Li and Kairui Song is the first step along the path to implementing "swap tables" - a new approach to swap allocation and state tracking which is expected to yield speed and space improvements. This patchset itself yields a 5-20% performance benefit in some situations. - The 3 patch series "Some ptdesc cleanups" from Matthew Wilcox utilizes the new memdesc layer to clean up the ptdesc code a little. - The 3 patch series "Fix va_high_addr_switch.sh test failure" from Chunyu Hu fixes some issues in our 5-level pagetable selftesting code. - The 2 patch series "Minor fixes for memory allocation profiling" from Suren Baghdasaryan addresses a couple of minor issues in relatively new memory allocation profiling feature. - The 3 patch series "Small cleanups" from Matthew Wilcox has a few cleanups in preparation for more memdesc work. - The 2 patch series "mm/damon: add addr_unit for DAMON_LRU_SORT and DAMON_RECLAIM" from Quanmin Yan makes some changes to DAMON in furtherance of supporting arm highmem. - The 2 patch series "selftests/mm: Add -Wunreachable-code and fix warnings" from Muhammad Anjum adds that compiler check to selftests code and fixes the fallout, by removing dead code. - The 10 patch series "Improvements to Victim Process Thawing and OOM Reaper Traversal Order" from zhongjinji makes a number of improvements in the OOM killer: mainly thawing a more appropriate group of victim threads so they can release resources. - The 5 patch series "mm/damon: misc fixups and improvements for 6.18" from SeongJae Park is a bunch of small and unrelated fixups for DAMON. - The 7 patch series "mm/damon: define and use DAMON initialization check function" from SeongJae Park implement reliability and maintainability improvements to a recently-added bug fix. - The 2 patch series "mm/damon/stat: expose auto-tuned intervals and non-idle ages" from SeongJae Park provides additional transparency to userspace clients of the DAMON_STAT information. - The 2 patch series "Expand scope of khugepaged anonymous collapse" from Dev Jain removes some constraints on khubepaged's collapsing of anon VMAs. It also increases the success rate of MADV_COLLAPSE against an anon vma. - The 2 patch series "mm: do not assume file == vma->vm_file in compat_vma_mmap_prepare()" from Lorenzo Stoakes moves us further towards removal of file_operations.mmap(). This patchset concentrates upon clearing up the treatment of stacked filesystems. - The 6 patch series "mm: Improve mlock tracking for large folios" from Kiryl Shutsemau provides some fixes and improvements to mlock's tracking of large folios. /proc/meminfo's "Mlocked" field became more accurate. - The 2 patch series "mm/ksm: Fix incorrect accounting of KSM counters during fork" from Donet Tom fixes several user-visible KSM stats inaccuracies across forks and adds selftest code to verify these counters. - The 2 patch series "mm_slot: fix the usage of mm_slot_entry" from Wei Yang addresses some potential but presently benign issues in KSM's mm_slot handling. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaN3cywAKCRDdBJ7gKXxA jtaPAQDmIuIu7+XnVUK5V11hsQ/5QtsUeLHV3OsAn4yW5/3dEQD/UddRU08ePN+1 2VRB0EwkLAdfMWW7TfiNZ+yhuoiL/AA= =4mhY -----END PGP SIGNATURE----- Merge tag 'mm-stable-2025-10-01-19-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - "mm, swap: improve cluster scan strategy" from Kairui Song improves performance and reduces the failure rate of swap cluster allocation - "support large align and nid in Rust allocators" from Vitaly Wool permits Rust allocators to set NUMA node and large alignment when perforning slub and vmalloc reallocs - "mm/damon/vaddr: support stat-purpose DAMOS" from Yueyang Pan extend DAMOS_STAT's handling of the DAMON operations sets for virtual address spaces for ops-level DAMOS filters - "execute PROCMAP_QUERY ioctl under per-vma lock" from Suren Baghdasaryan reduces mmap_lock contention during reads of /proc/pid/maps - "mm/mincore: minor clean up for swap cache checking" from Kairui Song performs some cleanup in the swap code - "mm: vm_normal_page() improvements" from David Hildenbrand provides code cleanup in the pagemap code - "add persistent huge zero folio support" from Pankaj Raghav provides a block layer speedup by optionalls making the huge_zero_pagepersistent, instead of releasing it when its refcount falls to zero - "kho: fixes and cleanups" from Mike Rapoport adds a few touchups to the recently added Kexec Handover feature - "mm: make mm->flags a bitmap and 64-bit on all arches" from Lorenzo Stoakes turns mm_struct.flags into a bitmap. To end the constant struggle with space shortage on 32-bit conflicting with 64-bit's needs - "mm/swapfile.c and swap.h cleanup" from Chris Li cleans up some swap code - "selftests/mm: Fix false positives and skip unsupported tests" from Donet Tom fixes a few things in our selftests code - "prctl: extend PR_SET_THP_DISABLE to only provide THPs when advised" from David Hildenbrand "allows individual processes to opt-out of THP=always into THP=madvise, without affecting other workloads on the system". It's a long story - the [1/N] changelog spells out the considerations - "Add and use memdesc_flags_t" from Matthew Wilcox gets us started on the memdesc project. Please see https://kernelnewbies.org/MatthewWilcox/Memdescs and https://blogs.oracle.com/linux/post/introducing-memdesc - "Tiny optimization for large read operations" from Chi Zhiling improves the efficiency of the pagecache read path - "Better split_huge_page_test result check" from Zi Yan improves our folio splitting selftest code - "test that rmap behaves as expected" from Wei Yang adds some rmap selftests - "remove write_cache_pages()" from Christoph Hellwig removes that function and converts its two remaining callers - "selftests/mm: uffd-stress fixes" from Dev Jain fixes some UFFD selftests issues - "introduce kernel file mapped folios" from Boris Burkov introduces the concept of "kernel file pages". Using these permits btrfs to account its metadata pages to the root cgroup, rather than to the cgroups of random inappropriate tasks - "mm/pageblock: improve readability of some pageblock handling" from Wei Yang provides some readability improvements to the page allocator code - "mm/damon: support ARM32 with LPAE" from SeongJae Park teaches DAMON to understand arm32 highmem - "tools: testing: Use existing atomic.h for vma/maple tests" from Brendan Jackman performs some code cleanups and deduplication under tools/testing/ - "maple_tree: Fix testing for 32bit compiles" from Liam Howlett fixes a couple of 32-bit issues in tools/testing/radix-tree.c - "kasan: unify kasan_enabled() and remove arch-specific implementations" from Sabyrzhan Tasbolatov moves KASAN arch-specific initialization code into a common arch-neutral implementation - "mm: remove zpool" from Johannes Weiner removes zspool - an indirection layer which now only redirects to a single thing (zsmalloc) - "mm: task_stack: Stack handling cleanups" from Pasha Tatashin makes a couple of cleanups in the fork code - "mm: remove nth_page()" from David Hildenbrand makes rather a lot of adjustments at various nth_page() callsites, eventually permitting the removal of that undesirable helper function - "introduce kasan.write_only option in hw-tags" from Yeoreum Yun creates a KASAN read-only mode for ARM, using that architecture's memory tagging feature. It is felt that a read-only mode KASAN is suitable for use in production systems rather than debug-only - "mm: hugetlb: cleanup hugetlb folio allocation" from Kefeng Wang does some tidying in the hugetlb folio allocation code - "mm: establish const-correctness for pointer parameters" from Max Kellermann makes quite a number of the MM API functions more accurate about the constness of their arguments. This was getting in the way of subsystems (in this case CEPH) when they attempt to improving their own const/non-const accuracy - "Cleanup free_pages() misuse" from Vishal Moola fixes a number of code sites which were confused over when to use free_pages() vs __free_pages() - "Add Rust abstraction for Maple Trees" from Alice Ryhl makes the mapletree code accessible to Rust. Required by nouveau and by its forthcoming successor: the new Rust Nova driver - "selftests/mm: split_huge_page_test: split_pte_mapped_thp improvements" from David Hildenbrand adds a fix and some cleanups to the thp selftesting code - "mm, swap: introduce swap table as swap cache (phase I)" from Chris Li and Kairui Song is the first step along the path to implementing "swap tables" - a new approach to swap allocation and state tracking which is expected to yield speed and space improvements. This patchset itself yields a 5-20% performance benefit in some situations - "Some ptdesc cleanups" from Matthew Wilcox utilizes the new memdesc layer to clean up the ptdesc code a little - "Fix va_high_addr_switch.sh test failure" from Chunyu Hu fixes some issues in our 5-level pagetable selftesting code - "Minor fixes for memory allocation profiling" from Suren Baghdasaryan addresses a couple of minor issues in relatively new memory allocation profiling feature - "Small cleanups" from Matthew Wilcox has a few cleanups in preparation for more memdesc work - "mm/damon: add addr_unit for DAMON_LRU_SORT and DAMON_RECLAIM" from Quanmin Yan makes some changes to DAMON in furtherance of supporting arm highmem - "selftests/mm: Add -Wunreachable-code and fix warnings" from Muhammad Anjum adds that compiler check to selftests code and fixes the fallout, by removing dead code - "Improvements to Victim Process Thawing and OOM Reaper Traversal Order" from zhongjinji makes a number of improvements in the OOM killer: mainly thawing a more appropriate group of victim threads so they can release resources - "mm/damon: misc fixups and improvements for 6.18" from SeongJae Park is a bunch of small and unrelated fixups for DAMON - "mm/damon: define and use DAMON initialization check function" from SeongJae Park implement reliability and maintainability improvements to a recently-added bug fix - "mm/damon/stat: expose auto-tuned intervals and non-idle ages" from SeongJae Park provides additional transparency to userspace clients of the DAMON_STAT information - "Expand scope of khugepaged anonymous collapse" from Dev Jain removes some constraints on khubepaged's collapsing of anon VMAs. It also increases the success rate of MADV_COLLAPSE against an anon vma - "mm: do not assume file == vma->vm_file in compat_vma_mmap_prepare()" from Lorenzo Stoakes moves us further towards removal of file_operations.mmap(). This patchset concentrates upon clearing up the treatment of stacked filesystems - "mm: Improve mlock tracking for large folios" from Kiryl Shutsemau provides some fixes and improvements to mlock's tracking of large folios. /proc/meminfo's "Mlocked" field became more accurate - "mm/ksm: Fix incorrect accounting of KSM counters during fork" from Donet Tom fixes several user-visible KSM stats inaccuracies across forks and adds selftest code to verify these counters - "mm_slot: fix the usage of mm_slot_entry" from Wei Yang addresses some potential but presently benign issues in KSM's mm_slot handling * tag 'mm-stable-2025-10-01-19-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (372 commits) mm: swap: check for stable address space before operating on the VMA mm: convert folio_page() back to a macro mm/khugepaged: use start_addr/addr for improved readability hugetlbfs: skip VMAs without shareable locks in hugetlb_vmdelete_list alloc_tag: fix boot failure due to NULL pointer dereference mm: silence data-race in update_hiwater_rss mm/memory-failure: don't select MEMORY_ISOLATION mm/khugepaged: remove definition of struct khugepaged_mm_slot mm/ksm: get mm_slot by mm_slot_entry() when slot is !NULL hugetlb: increase number of reserving hugepages via cmdline selftests/mm: add fork inheritance test for ksm_merging_pages counter mm/ksm: fix incorrect KSM counter handling in mm_struct during fork drivers/base/node: fix double free in register_one_node() mm: remove PMD alignment constraint in execmem_vmalloc() mm/memory_hotplug: fix typo 'esecially' -> 'especially' mm/rmap: improve mlock tracking for large folios mm/filemap: map entire large folio faultaround mm/fault: try to map the entire file folio in finish_fault() mm/rmap: mlock large folios in try_to_unmap_one() mm/rmap: fix a mlock race condition in folio_referenced_one() ...	2025-10-02 18:18:33 -07:00
Linus Torvalds	07fdad3a93	Networking changes for 6.18. Core & protocols ---------------- - Improve drop account scalability on NUMA hosts for RAW and UDP sockets and the backlog, almost doubling the Pps capacity under DoS. - Optimize the UDP RX performance under stress, reducing contention, revisiting the binary layout of the involved data structs and implementing NUMA-aware locking. This improves UDP RX performance by an additional 50%, even more under extreme conditions. - Add support for PSP encryption of TCP connections; this mechanism has some similarities with IPsec and TLS, but offers superior HW offloads capabilities. - Ongoing work to support Accurate ECN for TCP. AccECN allows more than one congestion notification signal per RTT and is a building block for Low Latency, Low Loss, and Scalable Throughput (L4S). - Reorganize the TCP socket binary layout for data locality, reducing the number of touched cachelines in the fastpath. - Refactor skb deferral free to better scale on large multi-NUMA hosts, this improves TCP and UDP RX performances significantly on such HW. - Increase the default socket memory buffer limits from 256K to 4M to better fit modern link speeds. - Improve handling of setups with a large number of nexthop, making dump operating scaling linearly and avoiding unneeded synchronize_rcu() on delete. - Improve bridge handling of VLAN FDB, storing a single entry per bridge instead of one entry per port; this makes the dump order of magnitude faster on large switches. - Restore IP ID correctly for encapsulated packets at GSO segmentation time, allowing GRO to merge packets in more scenarios. - Improve netfilter matching performance on large sets. - Improve MPTCP receive path performance by leveraging recently introduced core infrastructure (skb deferral free) and adopting recent TCP autotuning changes. - Allow bridges to redirect to a backup port when the bridge port is administratively down. - Introduce MPTCP 'laminar' endpoint that con be used only once per connection and simplify common MPTCP setups. - Add RCU safety to dst->dev, closing a lot of possible races. - A significant crypto library API for SCTP, MPTCP and IPv6 SR, reducing code duplication. - Supports pulling data from an skb frag into the linear area of an XDP buffer. Things we sprinkled into general kernel code -------------------------------------------- - Generate netlink documentation from YAML using an integrated YAML parser. Driver API ---------- - Support using IPv6 Flow Label in Rx hash computation and RSS queue selection. - Introduce API for fetching the DMA device for a given queue, allowing TCP zerocopy RX on more H/W setups. - Make XDP helpers compatible with unreadable memory, allowing more easily building DevMem-enabled drivers with a unified XDP/skbs datapath. - Add a new dedicated ethtool callback enabling drivers to provide the number of RX rings directly, improving efficiency and clarity in RX ring queries and RSS configuration. - Introduce a burst period for the health reporter, allowing better handling of multiple errors due to the same root cause. - Support for DPLL phase offset exponential moving average, controlling the average smoothing factor. Device drivers -------------- - Add a new Huawei driver for 3rd gen NIC (hinic3). - Add a new SpacemiT driver for K1 ethernet MAC. - Add a generic abstraction for shared memory communication devices (dibps) - Ethernet high-speed NICs: - nVidia/Mellanox: - Use multiple per-queue doorbell, to avoid MMIO contention issues - support adjacent functions, allowing them to delegate their SR-IOV VFs to sibling PFs - support RSS for IPSec offload - support exposing raw cycle counters in PTP and mlx5 - support for disabling host PFs. - Intel (100G, ice, idpf): - ice: support for SRIOV VFs over an Active-Active link aggregate - ice: support for firmware logging via debugfs - ice: support for Earliest TxTime First (ETF) hardware offload - idpf: support basic XDP functionalities and XSk - Broadcom (bnxt): - support Hyper-V VF ID - dynamic SRIOV resource allocations for RoCE - Meta (fbnic): - support queue API, zero-copy Rx and Tx - support basic XDP functionalities - devlink health support for FW crashes and OTP mem corruptions - expand hardware stats coverage to FEC, PHY, and Pause - Wangxun: - support ethtool coalesce options - support for multiple RSS contexts - Ethernet virtual: - Macsec: - replace custom netlink attribute checks with policy-level checks - Bonding: - support aggregator selection based on port priority - Microsoft vNIC: - use page pool fragments for RX buffers instead of full pages to improve memory efficiency - Ethernet NICs consumer, and embedded: - Qualcomm: support Ethernet function for IPQ9574 SoC - Airoha: implement wlan offloading via NPU - Freescale - enetc: add NETC timer PTP driver and add PTP support - fec: enable the Jumbo frame support for i.MX8QM - Renesas (R-Car S4): support HW offloading for layer 2 switching - support for RZ/{T2H, N2H} SoCs - Cadence (macb): support TAPRIO traffic scheduling - TI: - support for Gigabit ICSS ethernet SoC (icssm-prueth) - Synopsys (stmmac): a lot of cleanups - Ethernet PHYs: - Support 10g-qxgmi phy-mode for AQR412C, Felix DSA and Lynx PCS driver - Support bcm63268 GPHY power control - Support for Micrel lan8842 PHY and PTP - Support for Aquantia AQR412 and AQR115 - CAN: - a large CAN-XL preparation work - reorganize raw_sock and uniqframe struct to minimize memory usage - rcar_canfd: update the CAN-FD handling - WiFi: - extended Neighbor Awareness Networking (NAN) support - S1G channel representation cleanup - improve S1G support - WiFi drivers: - Intel (iwlwifi): - major refactor and cleanup - Broadcom (brcm80211): - support for AP isolation - RealTek (rtw88/89) rtw88/89: - preparation work for RTL8922DE support - MediaTek (mt76): - HW restart improvements - MLO support - Qualcomm/Atheros (ath10k_ - GTK rekey fixes - Bluetooth drivers: - btusb: support for several new IDs for MT7925 - btintel: support for BlazarIW core - btintel_pcie: support for _suspend() / _resume() - btintel_pcie: support for Scorpious, Panther Lake-H484 IDs Signed-off-by: Paolo Abeni <pabeni@redhat.com> -----BEGIN PGP SIGNATURE----- iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmjdBdsSHHBhYmVuaUBy ZWRoYXQuY29tAAoJECkkeY3MjxOkWnAP/2Jz0ExnYMDxLxkkKqMte+3JNeF/KNjB zVIslYN/5ekkd1TMXyDLFlWHqUOUMSF9+cK5ZRMHG6/lzOueSzFuuuDFsWs6Kf2f q7Q9MzlXhR9++YCsdES1uS5x3PPjMInzo2ZivMFa6fUDLLFSzeAOKL9k+RS0EggU VlXv2Wkm73R0O6KAssgDsHke9cnNz+F0DzhQ0S3qkyZF9tS5NrDeUl7fZ47XZgwb ZuXdEzKmTTepo2XvZGxJoF2D7nekWFlHhLjEPpDJtET19nwhukCry41/FplJwlKR dWsYkqLkrOEQKFQ+q++/5c3BgFOG+IrQLbP5bGQF1tZRMl4Dqp9PDxK5fKJfccbS 0VY3Y2qWOSYejT2Ji2kEHR5E5rPyFm3Y5A4Rnpz0yEHj14vL2v4zf7CZRkCyyDfD doIZXFGkM0+N7QeXLEN833cje9zjaXuP9GAE+2bb+wAWAZAqof4KX8JgHh+y5Rwm pvUtvFxmEtntlMwNBap8aT3FquGtfZncU8pzAE5kvWjuMvyF0NVWiE5rB2kSQM/X NLmUdvDyjwwJqthXAJG0fl+a0mNJ/kOAqSOKJDJrfKDGWa+ovwY0iY06SpK0wIbO Wz7tpMk5MSlYXW8xWMlbyhvvU6T9xuoQ2KV4QTdMxc6Ir3sNX6YkQr+gjQjxB0gx ST2QF6lZeWFh =w2Kz -----END PGP SIGNATURE----- Merge tag 'net-next-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Paolo Abeni: "Core & protocols: - Improve drop account scalability on NUMA hosts for RAW and UDP sockets and the backlog, almost doubling the Pps capacity under DoS - Optimize the UDP RX performance under stress, reducing contention, revisiting the binary layout of the involved data structs and implementing NUMA-aware locking. This improves UDP RX performance by an additional 50%, even more under extreme conditions - Add support for PSP encryption of TCP connections; this mechanism has some similarities with IPsec and TLS, but offers superior HW offloads capabilities - Ongoing work to support Accurate ECN for TCP. AccECN allows more than one congestion notification signal per RTT and is a building block for Low Latency, Low Loss, and Scalable Throughput (L4S) - Reorganize the TCP socket binary layout for data locality, reducing the number of touched cachelines in the fastpath - Refactor skb deferral free to better scale on large multi-NUMA hosts, this improves TCP and UDP RX performances significantly on such HW - Increase the default socket memory buffer limits from 256K to 4M to better fit modern link speeds - Improve handling of setups with a large number of nexthop, making dump operating scaling linearly and avoiding unneeded synchronize_rcu() on delete - Improve bridge handling of VLAN FDB, storing a single entry per bridge instead of one entry per port; this makes the dump order of magnitude faster on large switches - Restore IP ID correctly for encapsulated packets at GSO segmentation time, allowing GRO to merge packets in more scenarios - Improve netfilter matching performance on large sets - Improve MPTCP receive path performance by leveraging recently introduced core infrastructure (skb deferral free) and adopting recent TCP autotuning changes - Allow bridges to redirect to a backup port when the bridge port is administratively down - Introduce MPTCP 'laminar' endpoint that con be used only once per connection and simplify common MPTCP setups - Add RCU safety to dst->dev, closing a lot of possible races - A significant crypto library API for SCTP, MPTCP and IPv6 SR, reducing code duplication - Supports pulling data from an skb frag into the linear area of an XDP buffer Things we sprinkled into general kernel code: - Generate netlink documentation from YAML using an integrated YAML parser Driver API: - Support using IPv6 Flow Label in Rx hash computation and RSS queue selection - Introduce API for fetching the DMA device for a given queue, allowing TCP zerocopy RX on more H/W setups - Make XDP helpers compatible with unreadable memory, allowing more easily building DevMem-enabled drivers with a unified XDP/skbs datapath - Add a new dedicated ethtool callback enabling drivers to provide the number of RX rings directly, improving efficiency and clarity in RX ring queries and RSS configuration - Introduce a burst period for the health reporter, allowing better handling of multiple errors due to the same root cause - Support for DPLL phase offset exponential moving average, controlling the average smoothing factor Device drivers: - Add a new Huawei driver for 3rd gen NIC (hinic3) - Add a new SpacemiT driver for K1 ethernet MAC - Add a generic abstraction for shared memory communication devices (dibps) - Ethernet high-speed NICs: - nVidia/Mellanox: - Use multiple per-queue doorbell, to avoid MMIO contention issues - support adjacent functions, allowing them to delegate their SR-IOV VFs to sibling PFs - support RSS for IPSec offload - support exposing raw cycle counters in PTP and mlx5 - support for disabling host PFs. - Intel (100G, ice, idpf): - ice: support for SRIOV VFs over an Active-Active link aggregate - ice: support for firmware logging via debugfs - ice: support for Earliest TxTime First (ETF) hardware offload - idpf: support basic XDP functionalities and XSk - Broadcom (bnxt): - support Hyper-V VF ID - dynamic SRIOV resource allocations for RoCE - Meta (fbnic): - support queue API, zero-copy Rx and Tx - support basic XDP functionalities - devlink health support for FW crashes and OTP mem corruptions - expand hardware stats coverage to FEC, PHY, and Pause - Wangxun: - support ethtool coalesce options - support for multiple RSS contexts - Ethernet virtual: - Macsec: - replace custom netlink attribute checks with policy-level checks - Bonding: - support aggregator selection based on port priority - Microsoft vNIC: - use page pool fragments for RX buffers instead of full pages to improve memory efficiency - Ethernet NICs consumer, and embedded: - Qualcomm: support Ethernet function for IPQ9574 SoC - Airoha: implement wlan offloading via NPU - Freescale - enetc: add NETC timer PTP driver and add PTP support - fec: enable the Jumbo frame support for i.MX8QM - Renesas (R-Car S4): - support HW offloading for layer 2 switching - support for RZ/{T2H, N2H} SoCs - Cadence (macb): support TAPRIO traffic scheduling - TI: - support for Gigabit ICSS ethernet SoC (icssm-prueth) - Synopsys (stmmac): a lot of cleanups - Ethernet PHYs: - Support 10g-qxgmi phy-mode for AQR412C, Felix DSA and Lynx PCS driver - Support bcm63268 GPHY power control - Support for Micrel lan8842 PHY and PTP - Support for Aquantia AQR412 and AQR115 - CAN: - a large CAN-XL preparation work - reorganize raw_sock and uniqframe struct to minimize memory usage - rcar_canfd: update the CAN-FD handling - WiFi: - extended Neighbor Awareness Networking (NAN) support - S1G channel representation cleanup - improve S1G support - WiFi drivers: - Intel (iwlwifi): - major refactor and cleanup - Broadcom (brcm80211): - support for AP isolation - RealTek (rtw88/89) rtw88/89: - preparation work for RTL8922DE support - MediaTek (mt76): - HW restart improvements - MLO support - Qualcomm/Atheros (ath10k): - GTK rekey fixes - Bluetooth drivers: - btusb: support for several new IDs for MT7925 - btintel: support for BlazarIW core - btintel_pcie: support for _suspend() / _resume() - btintel_pcie: support for Scorpious, Panther Lake-H484 IDs" * tag 'net-next-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1536 commits) net: stmmac: Add support for Allwinner A523 GMAC200 dt-bindings: net: sun8i-emac: Add A523 GMAC200 compatible Revert "Documentation: net: add flow control guide and document ethtool API" octeontx2-pf: fix bitmap leak octeontx2-vf: fix bitmap leak net/mlx5e: Use extack in set rxfh callback net/mlx5e: Introduce mlx5e_rss_params for RSS configuration net/mlx5e: Introduce mlx5e_rss_init_params net/mlx5e: Remove unused mdev param from RSS indir init net/mlx5: Improve QoS error messages with actual depth values net/mlx5e: Prevent entering switchdev mode with inconsistent netns net/mlx5: HWS, Generalize complex matchers net/mlx5: Improve write-combining test reliability for ARM64 Grace CPUs selftests/net: add tcp_port_share to .gitignore Revert "net/mlx5e: Update and set Xon/Xoff upon MTU set" net: add NUMA awareness to skb_attempt_defer_free() net: use llist for sd->defer_list net: make softnet_data.defer_count an atomic selftests: drv-net: psp: add tests for destroying devices selftests: drv-net: psp: add test for auto-adjusting TCP MSS ...	2025-10-02 15:17:01 -07:00
Linus Torvalds	5832d26433	for-6.18/io_uring-20250929 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmjbLEcQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpnEUD/4/FgfQP2LFS/88BBF5ukZjRySe4wmyyZ2Q MFh2ehdxzkZxVXjbeA2wRAXdqjw2MbNhx8tzU9VrW7rweNDZxHbwi6jJIP7OAjxE 4ZP0goAQj7P0TFyXC2KGj7k6dP20FkAltx5gGLVwsuOWDDrQKp2EykAcRnGYAD4W 3yf+nojVr2bjHyO7dx8dM7jUDjMg7J8nmHD6zgHOlHRLblWwfzw907bhz+eBX/FI 9kYvtX2c9MgY4Isa+43rZd5qvj9S3Cs8PD6tFPbq+n+3l7yWgMBTu/y+SNI8hupT W7CqjPcpvppFHhPkcXDA3yARnW7ccEx5aiQuvUCmRUioHtGwXvC63HMp8OjcQspV NNoIHYFsi1alzYq2kJLxY1IleWZ8j0hUkSSU8u7al8VIvtD43LGkv51xavxQUFjg BO9mLyS51H2agffySs4vhHJE82lZizvmh/RJfSJ0ezALzE2k42MrximX1D1rBJE6 KPOhCiPt/jqpQMyqDYnY10FgTXQVwgPIVH1JLpo611tPFHlGW8Y4YxxR1Xduh5JX jbGLEjVREsDZ7EHrimLNLmJRAQpyQujv/yhf7k96gWBelVwVuISQLI4Ca5IeVQyk 9yifgLXNGddgAwj0POMFeKXSm2We9nrrPDYLCKrsBMSN96/3SLveJC7fkW88aUZr ye4/K8Y3vA== =uc/3 -----END PGP SIGNATURE----- Merge tag 'for-6.18/io_uring-20250929' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring updates from Jens Axboe: - Store ring provided buffers locally for the users, rather than stuff them into struct io_kiocb. These types of buffers must always be fully consumed or recycled in the current context, and leaving them in struct io_kiocb is hence not a good ideas as that struct has a vastly different life time. Basically just an architecture cleanup that can help prevent issues with ring provided buffers in the future. - Support for mixed CQE sizes in the same ring. Before this change, a CQ ring either used the default 16b CQEs, or it was setup with 32b CQE using IORING_SETUP_CQE32. For use cases where a few 32b CQEs were needed, this caused everything else to use big CQEs. This is wasteful both in terms of memory usage, but also memory bandwidth for the posted CQEs. With IORING_SETUP_CQE_MIXED, applications may use request types that post both normal 16b and big 32b CQEs on the same ring. - Add helpers for async data management, to make it harder for opcode handlers to mess it up. - Add support for multishot for uring_cmd, which ublk can use. This helps improve efficiency, by providing a persistent request type that can trigger multiple CQEs. - Add initial support for ring feature querying. We had basic support for probe operations, but the API isn't great. Rather than expand that, add support for QUERY which is easily expandable and can cover a lot more cases than the existing probe support. This will help applications get a better idea of what operations are supported on a given host. - zcrx improvements from Pavel: - Improve refill entry alignment for better caching - Various cleanups, especially around deduplicating normal memory vs dmabuf setup. - Generalisation of the niov size (Patch 12). It's still hard coded to PAGE_SIZE on init, but will let the user to specify the rx buffer length on setup. - Syscall / synchronous bufer return. It'll be used as a slow fallback path for returning buffers when the refill queue is full. Useful for tolerating slight queue size misconfiguration or with inconsistent load. - Accounting more memory to cgroups. - Additional independent cleanups that will also be useful for mutli-area support. - Various fixes and cleanups * tag 'for-6.18/io_uring-20250929' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (68 commits) io_uring/cmd: drop unused res2 param from io_uring_cmd_done() io_uring: fix nvme's 32b cqes on mixed cq io_uring/query: cap number of queries io_uring/query: prevent infinite loops io_uring/zcrx: account niov arrays to cgroup io_uring/zcrx: allow synchronous buffer return io_uring/zcrx: introduce io_parse_rqe() io_uring/zcrx: don't adjust free cache space io_uring/zcrx: use guards for the refill lock io_uring/zcrx: reduce netmem scope in refill io_uring/zcrx: protect netdev with pp_lock io_uring/zcrx: rename dma lock io_uring/zcrx: make niov size variable io_uring/zcrx: set sgt for umem area io_uring/zcrx: remove dmabuf_offset io_uring/zcrx: deduplicate area mapping io_uring/zcrx: pass ifq to io_zcrx_alloc_fallback() io_uring/zcrx: check all niovs filled with dma addresses io_uring/zcrx: move area reg checks into io_import_area io_uring/zcrx: don't pass slot to io_zcrx_create_area ...	2025-10-02 09:56:23 -07:00
Jakub Kicinski	203e3beb73	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-6.17-rc8). Conflicts: drivers/net/can/spi/hi311x.c `6b69680847` ("can: hi311x: fix null pointer dereference when resuming from sleep before interface was enabled") `27ce71e1ce` ("net: WQ_PERCPU added to alloc_workqueue users") https://lore.kernel.org/72ce7599-1b5b-464a-a5de-228ff9724701@kernel.org net/smc/smc_loopback.c drivers/dibs/dibs_loopback.c `a35c04de25` ("net/smc: fix warning in smc_rx_splice() when calling get_page()") `cc21191b58` ("dibs: Move data path to dibs layer") https://lore.kernel.org/74368a5c-48ac-4f8e-a198-40ec1ed3cf5f@kernel.org Adjacent changes: drivers/net/dsa/lantiq/lantiq_gswip.c `c0054b25e2` ("net: dsa: lantiq_gswip: move gswip_add_single_port_br() call to port_setup()") `7a1eaef0a7` ("net: dsa: lantiq_gswip: support model-specific mac_select_pcs()") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-25 11:00:59 -07:00
David Hildenbrand	d99c57546d	io_uring/zcrx: remove nth_page() usage within folio Within a folio/compound page, nth_page() is no longer required. Given that we call folio_test_partial_kmap()+kmap_local_page(), the code would already be problematic if the pages would span multiple folios. So let's just assume that all src pages belong to a single folio/compound page and can be iterated ordinarily. The dst page is currently always a single page, so we're not actually iterating anything. Link: https://lkml.kernel.org/r/20250901150359.867252-21-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:05 -07:00
Keith Busch	79525b51ac	io_uring: fix nvme's 32b cqes on mixed cq The nvme uring_cmd only uses 32b CQEs. If the ring uses a mixed CQ, then we need to make sure we flag the completion as a 32b CQE. On the other hand, if nvme uring_cmd was using a dedicated 32b CQE, the posting was missing the extra memcpy because it only applied to bit CQEs on a mixed CQ. Fixes: `e26dca67fd` ("io_uring: add support for IORING_SETUP_CQE_MIXED") Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-20 06:26:38 -06:00
Linus Torvalds	0d64ebf676	io_uring-6.17-20250919 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmjNXgQQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpgFnEACusY6hGBWCPaEoG492zYBPDAJy+2aqGQsH FFuk/ZeO4iUEnnirTPDIgbkJYvIn7zcETWd+bh8gyB2ix4HxLEF/0j//rrigedit smHtvAfeWQVv6aMS7HTBjra5CazjDAp8+TxlfKuuThMjkCRysF35MLiSNSmztgGK OCOiiJcZM7oUq+vNrf+J2H2siG2qnZvqInfBsNDsbqu1WiRfEdh0vN8YEEV3ndjB TeXDQIekYiWoGNbuZ+/MNYQhoTDoAtAM5z4P0ksxPA/oTZ9w1IwlxqmHPWNPXYZL GtKYb2Oj5G6hR06PjjmLWiSgWvfotCC0jLzbSra56kNFtSI2fmJa9S/Sah/CKPIY F1Jokf2Ji0BmTNPm78Euv+Fk+ekBY/lWPEGuzrTpNh+/7pjCtQprsxbUgfJRxGKf 5iGid5qcUds4OyFfUrP8HGA6nzgjC9CywFO22jOmS8769rXXOhyY5mX4Av4Hpwsk tVmx2Zrfc0WMqB1jR2L3tHHuOOvghKG6BDQAazjjNNXCjxb3b07MKGiBBA3eoLwb uFhphtyHW+h28JJzIXZhTPSui/MYoksCQfZd7X8bLN51SR6zJxSREularIxcupJj gU4kx3SiH0Irvra2KoAu+wZMoBGCjpPR7vnuRFBbcPw8nzd36cp9u7Z7X5O4Lq8R lJpuVZkGxA== =0LDC -----END PGP SIGNATURE----- Merge tag 'io_uring-6.17-20250919' of git://git.kernel.dk/linux Pull io_uring fixes from Jens Axboe: - Fix for a regression introduced in the io-wq worker creation logic. - Remove the allocation cache for the msg_ring io_kiocb allocations. I have a suspicion that there's a bug there, and since we just fixed one in that area, let's just yank the use of that cache entirely. It's not that important, and it kills some code. - Treat a closed ring like task exiting in that any requests that trigger post that condition should just get canceled. Doesn't fix any real issues, outside of having tasks being able to rely on that guarantee. - Fix for a bug in the network zero-copy notification mechanism, where a comparison for matching tctx/ctx for notifications was buggy in that it didn't correctly compare with the previous notification. * tag 'io_uring-6.17-20250919' of git://git.kernel.dk/linux: io_uring: fix incorrect io_kiocb reference in io_link_skb io_uring/msg_ring: kill alloc_cache for io_kiocb allocations io_uring: include dying ring in task_work "should cancel" state io_uring/io-wq: fix `max_workers` breakage and `nr_workers` underflow	2025-09-19 12:10:49 -07:00
Pavel Begunkov	7ea24326e7	io_uring/query: cap number of queries If a query chain forms a cycle, it'll be looping in the kernel until the process is killed. It might be fine as any such mistake can be easily uncovered during testing, but it's still nicer to let it break out of the syscall if it executed too many queries. Suggested-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-19 07:06:43 -06:00
Pavel Begunkov	2408d17832	io_uring/query: prevent infinite loops If the query chain forms a cycle, the interface will loop indefinitely. Make sure it handles fatal signals, so the user can kill the process and hence break out of the infinite loop. Fixes: `c265ae75f9` ("io_uring: introduce io_uring querying") Reported-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-19 07:06:43 -06:00
Yang Xiuwei	2c139a47ef	io_uring: fix incorrect io_kiocb reference in io_link_skb In io_link_skb function, there is a bug where prev_notif is incorrectly assigned using 'nd' instead of 'prev_nd'. This causes the context validation check to compare the current notification with itself instead of comparing it with the previous notification. Fix by using the correct prev_nd parameter when obtaining prev_notif. Signed-off-by: Yang Xiuwei <yangxiuwei@kylinos.cn> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Fixes: `6fe4220912` ("io_uring/notif: implement notification stacking") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-19 06:00:57 -06:00
Jens Axboe	df8922afc3	io_uring/msg_ring: kill alloc_cache for io_kiocb allocations A recent commit: `fc582cd26e` ("io_uring/msg_ring: ensure io_kiocb freeing is deferred for RCU") fixed an issue with not deferring freeing of io_kiocb structs that msg_ring allocates to after the current RCU grace period. But this only covers requests that don't end up in the allocation cache. If a request goes into the alloc cache, it can get reused before it is sane to do so. A recent syzbot report would seem to indicate that there's something there, however it may very well just be because of the KASAN poisoning that the alloc_cache handles manually. Rather than attempt to make the alloc_cache sane for that use case, just drop the usage of the alloc_cache for msg_ring request payload data. Fixes: `50cf5f3842` ("io_uring/msg_ring: add an alloc cache for io_kiocb entries") Link: https://lore.kernel.org/io-uring/68cc2687.050a0220.139b6.0005.GAE@google.com/ Reported-by: syzbot+baa2e0f4e02df602583e@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-18 13:59:15 -06:00
Jens Axboe	3539b1467e	io_uring: include dying ring in task_work "should cancel" state When running task_work for an exiting task, rather than perform the issue retry attempt, the task_work is canceled. However, this isn't done for a ring that has been closed. This can lead to requests being successfully completed post the ring being closed, which is somewhat confusing and surprising to an application. Rather than just check the task exit state, also include the ring ref state in deciding whether or not to terminate a given request when run from task_work. Cc: stable@vger.kernel.org # 6.1+ Link: https://github.com/axboe/liburing/discussions/1459 Reported-by: Benedek Thaler <thaler@thaler.hu> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-18 10:24:50 -06:00
Pavel Begunkov	31bf77dcc3	io_uring/zcrx: account niov arrays to cgroup net_iov / freelist / etc. arrays can be quite long, make sure they're accounted. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-16 12:37:21 -06:00
Pavel Begunkov	705d2ac7b2	io_uring/zcrx: allow synchronous buffer return Returning buffers via a ring is performant and convenient, but it becomes a problem when/if the user misconfigured the ring size and it becomes full. Add a synchronous way to return buffers back to the page pool via a new register opcode. It's supposed to be a reliable slow path for refilling. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-16 12:37:21 -06:00
Pavel Begunkov	8fd08d8dda	io_uring/zcrx: introduce io_parse_rqe() Add a helper for verifying a rqe and extracting a niov out of it. It'll be reused in following patches. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-16 12:37:21 -06:00
Pavel Begunkov	5a8b6e7c1d	io_uring/zcrx: don't adjust free cache space The cache should be empty when io_pp_zc_alloc_netmems() is called, that's promised by page pool and further checked, so there is no need to recalculate the available space in io_zcrx_ring_refill(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-16 12:37:21 -06:00
Pavel Begunkov	c95257f336	io_uring/zcrx: use guards for the refill lock Use guards for rq_lock in io_zcrx_ring_refill(), makes it a tad simpler. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-16 12:37:21 -06:00
Pavel Begunkov	73fa880eff	io_uring/zcrx: reduce netmem scope in refill Reduce the scope of a local var netmem in io_zcrx_ring_refill. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-16 12:37:20 -06:00
Pavel Begunkov	20dda449c0	io_uring/zcrx: protect netdev with pp_lock Remove ifq->lock and reuse pp_lock to protect the netdev pointer. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-16 12:37:20 -06:00
Pavel Begunkov	4f602f3112	io_uring/zcrx: rename dma lock In preparation for reusing the lock for other purposes, rename it to "pp_lock". As before, it can be taken deeper inside the networking stack by page pool, and so the syscall io_uring must avoid holding it while doing queue reconfiguration or anything that can result in immediate pp init/destruction. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-16 12:37:20 -06:00
Pavel Begunkov	d8d135dfe3	io_uring/zcrx: make niov size variable Instead of using PAGE_SIZE for the niov size add a niov_shift field to ifq, and patch up all important places. Copy fallback still assumes PAGE_SIZE, so it'll be wasting some memory for now. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-16 12:37:20 -06:00
Pavel Begunkov	5d93f7bade	io_uring/zcrx: set sgt for umem area Set struct io_zcrx_mem::sgt for umem areas as well to simplify looking up the current sg table. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-16 12:37:20 -06:00
Pavel Begunkov	6c18511729	io_uring/zcrx: remove dmabuf_offset It was removed from uapi, so now it's always 0 and can be removed together with offset handling in io_populate_area_dma(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-16 12:37:20 -06:00
Pavel Begunkov	439a98b972	io_uring/zcrx: deduplicate area mapping With a common type for storing dma addresses and io_populate_area_dma(), type-specific area mapping helpers are trivial, so open code them and deduplicate the call to io_populate_area_dma(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-16 12:37:20 -06:00
Pavel Begunkov	02bb047b5f	io_uring/zcrx: pass ifq to io_zcrx_alloc_fallback() io_zcrx_copy_chunk() doesn't and shouldn't care from which area the buffer is allocated, don't try to resolve the area in it but pass the ifq to io_zcrx_alloc_fallback() and let it handle it. Also rename it for more clarity. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-16 12:37:20 -06:00
Pavel Begunkov	d7ae46b454	io_uring/zcrx: check all niovs filled with dma addresses Add a warning if io_populate_area_dma() can't fill in all net_iovs, it should never happen. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-16 12:37:20 -06:00
Pavel Begunkov	01464ea405	io_uring/zcrx: move area reg checks into io_import_area io_import_area() is responsible for importing memory and parsing io_uring_zcrx_area_reg, so move all area reg structure checks into the function. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-16 12:37:20 -06:00
Pavel Begunkov	d425f13146	io_uring/zcrx: don't pass slot to io_zcrx_create_area Don't pass a pointer to a pointer where an area should be stored to io_zcrx_create_area(), and let it handle finding the right place for a new area. It's more straightforward and will be needed to support multiple areas. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-16 12:37:20 -06:00
Pavel Begunkov	c49606fc4b	io_uring/zcrx: remove extra io_zcrx_drop_netdev io_close_queue() already detaches the netdev, don't unnecessary call io_zcrx_drop_netdev() right after. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-16 12:37:20 -06:00
Pavel Begunkov	d5e31db9a9	io_uring/zcrx: use page_pool_unref_and_test() page_pool_unref_and_test() tries to better follow usuall refcount semantics, use it instead of page_pool_unref_netmem(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-16 12:37:20 -06:00
Pavel Begunkov	bdc0d478a1	io_uring/zcrx: replace memchar_inv with is_zero memchr_inv() is more ambiguous than mem_is_zero(), so use the latter for zero checks. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-16 12:37:20 -06:00
Pavel Begunkov	9eb3c57178	io_uring/zcrx: improve rqe cache alignment Refill queue entries are 16B structures, but because of the ring header placement, they're 8B aligned but not naturally / 16B aligned, which means some of them span across 2 cache lines. Push rqes to a new cache line. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-16 12:37:20 -06:00
Feng Zhou	3a0ac20253	io_uring/zcrx: fix ifq->if_rxq is -1, get dma_dev is NULL ifq->if_rxq has not been assigned, is -1, the correct value is in reg.if_rxq. Fixes: `59b8b32ac8` ("io_uring/zcrx: add support for custom DMA devices") Signed-off-by: Feng Zhou <zhoufeng.zf@bytedance.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Link: https://patch.msgid.link/20250912140133.97741-1-zhoufeng.zf@bytedance.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-15 18:12:53 -07:00
Max Kellermann	cd4ea81be3	io_uring/io-wq: fix `max_workers` breakage and `nr_workers` underflow Commit 88e6c42e40de ("io_uring/io-wq: add check free worker before create new worker") reused the variable `do_create` for something else, abusing it for the free worker check. This caused the value to effectively always be `true` at the time `nr_workers < max_workers` was checked, but it should really be `false`. This means the `max_workers` setting was ignored, and worse: if the limit had already been reached, incrementing `nr_workers` was skipped even though another worker would be created. When later lots of workers exit, the `nr_workers` field could easily underflow, making the problem worse because more and more workers would be created without incrementing `nr_workers`. The simple solution is to use a different variable for the free worker check instead of using one variable for two different things. Cc: stable@vger.kernel.org Fixes: 88e6c42e40de ("io_uring/io-wq: add check free worker before create new worker") Signed-off-by: Max Kellermann <max.kellermann@ionos.com> Reviewed-by: Fengnan Chang <changfengnan@bytedance.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-15 10:46:13 -06:00
Jakub Kicinski	fc3a281041	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-6.17-rc6). Conflicts: net/netfilter/nft_set_pipapo.c net/netfilter/nft_set_pipapo_avx2.c `c4eaca2e10` ("netfilter: nft_set_pipapo: don't check genbit from packetpath lookups") `84c1da7b38` ("netfilter: nft_set_pipapo: use avx2 algorithm for insertions too") Only trivial adjacent changes (in a doc and a Makefile). Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-11 17:40:13 -07:00
Jens Axboe	9adc6669a6	io_uring: correct size of overflow CQE calculation If a 32b CQE is required, don't double the size of the overflow struct, just add the size of the io_uring_cqe addition that is needed. This avoids allocating too much memory, as the io_overflow_cqe size includes the list member required to queue them too. Fixes: `e26dca67fd` ("io_uring: add support for IORING_SETUP_CQE_MIXED") Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-10 17:30:51 -06:00
Marco Crivellari	9f5f69d98e	io_uring: replace use of system_unbound_wq with system_dfl_wq Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistentcy cannot be addressed without refactoring the API. system_unbound_wq should be the default workqueue so as not to enforce locality constraints for random work whenever it's not required. Adding system_dfl_wq to encourage its use when unbound work should be used. queue_work() / queue_delayed_work() / mod_delayed_work() will now use the new unbound wq: whether the user still use the old wq a warn will be printed along with a wq redirect to the new one. The old system_unbound_wq will be kept for a few release cycles. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-10 08:21:03 -06:00
Marco Crivellari	8577441d4a	io_uring: replace use of system_wq with system_percpu_wq Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistentcy cannot be addressed without refactoring the API. system_wq is a per-CPU worqueue, yet nothing in its name tells about that CPU affinity constraint, which is very often not required by users. Make it clear by adding a system_percpu_wq. queue_work() / queue_delayed_work() mod_delayed_work() will now use the new per-cpu wq: whether the user still stick on the old name a warn will be printed along a wq redirect to the new one. This patch add the new system_percpu_wq except for mm, fs and net subsystem, whom are handled in separated patches. The old wq will be kept for a few release cylces. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-10 08:21:02 -06:00
Caleb Sander Mateos	2f076a453f	io_uring/rsrc: respect submitter_task in io_register_clone_buffers() io_ring_ctx's enabled with IORING_SETUP_SINGLE_ISSUER are only allowed a single task submitting to the ctx. Although the documentation only mentions this restriction applying to io_uring_enter() syscalls, commit `d7cce96c44` ("io_uring: limit registration w/ SINGLE_ISSUER") extends it to io_uring_register(). Ensuring only one task interacts with the io_ring_ctx will be important to allow this task to avoid taking the uring_lock. There is, however, one gap in these checks: io_register_clone_buffers() may take the uring_lock on a second (source) io_ring_ctx, but __io_uring_register() only checks the current thread against the destination io_ring_ctx's submitter_task. Fail the IORING_REGISTER_CLONE_BUFFERS with -EEXIST if the source io_ring_ctx has a registered submitter_task other than the current task. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-08 13:21:24 -06:00
Caleb Sander Mateos	5d4c52bfa8	io_uring: don't include filetable.h in io_uring.h io_uring/io_uring.h doesn't use anything declared in io_uring/filetable.h, so drop the unnecessary #include. Add filetable.h includes in .c files previously relying on the transitive include from io_uring.h. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-08 13:20:46 -06:00
Linus Torvalds	f777d1112e	vfs-6.17-rc6.fixes -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaL6SyQAKCRCRxhvAZXjc ouTGAQDGiTnaENiOzRhzNl1XONTRv8a1uV0pxg4W3fNdiRlxgQEA/O90/+nM48KC pdV3WHz5eGfcnMTpqgHxK6HYgwklJAY= =oKnm -----END PGP SIGNATURE----- Merge tag 'vfs-6.17-rc6.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: "fuse: - Prevent opening of non-regular backing files. Fuse doesn't support non-regular files anyway. - Check whether copy_file_range() returns a larger size than requested. - Prevent overflow in copy_file_range() as fuse currently only supports 32-bit sized copies. - Cache the blocksize value if the server returned a new value as inode->i_blkbits isn't modified directly anymore. - Fix i_blkbits handling for iomap partial writes. By default i_blkbits is set to PAGE_SIZE which causes iomap to mark the whole folio as uptodate even on a partial write. But fuseblk filesystems support choosing a blocksize smaller than PAGE_SIZE risking data corruption. Simply enforce PAGE_SIZE as blocksize for fuseblk's internal inode for now. - Prevent out-of-bounds acces in fuse_dev_write() when the number of bytes to be retrieved is truncated to the fc->max_pages limit. virtiofs: - Fix page faults for DAX page addresses. Misc: - Tighten file handle decoding from userns. Check that the decoded dentry itself has a valid idmapping in the user namespace. - Fix mount-notify selftests. - Fix some indentation errors. - Add an FMODE_ flag to indicate IOCB_HAS_METADATA availability. This will be moved to an FOP_* flag with a bit more rework needed for that to happen not suitable for a fix. - Don't silently ignore metadata for sync read/write. - Don't pointlessly log warning when reading coredump sysctls" * tag 'vfs-6.17-rc6.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fuse: virtio_fs: fix page fault for DAX page address selftests/fs/mount-notify: Fix compilation failure. fhandle: use more consistent rules for decoding file handle from userns fuse: Block access to folio overlimit fuse: fix fuseblk i_blkbits for iomap partial writes fuse: reflect cached blocksize if blocksize was changed fuse: prevent overflow in copy_file_range return value fuse: check if copy_file_range() returns larger than requested size fuse: do not allow mapping a non-regular backing file coredump: don't pointlessly check and spew warnings fs: fix indentation style block: don't silently ignore metadata for sync read/write fs: add a FMODE_ flag to indicate IOCB_HAS_METADATA availability Please enter a commit message to explain why this merge is necessary, especially if it merges an updated upstream into a topic branch.	2025-09-08 07:53:01 -07:00
Thorsten Blum	7b0604d77a	io_uring: Replace kzalloc() + copy_from_user() with memdup_user() Replace kzalloc() followed by copy_from_user() with memdup_user() to improve and simplify io_probe(). No functional changes intended. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-08 08:21:36 -06:00
Pavel Begunkov	c265ae75f9	io_uring: introduce io_uring querying There are many parameters users might want to query about io_uring like available request types or the ring sizes. This patch introduces an interface for such slow path queries. It was written with several requirements in mind: - Can be used with or without an io_uring instance. Asking for supported setup flags before creating an instance as well as qeurying info about an already created ring are valid use cases. - Should be moderately fast. For example, users might use it to periodically retrieve ring attributes at runtime. As a consequence, it should be able to query multiple attributes in a single syscall. - Backward and forward compatible. - Should be reasobably easy to use. - Reduce the kernel code size for introducing new query types. It's implemented as a new registration opcode IORING_REGISTER_QUERY. The user passes one or more query strutctures linked together, each represented by struct io_uring_query_hdr. The header stores common control fields needed for processing and points to query type specific information. The header contains - The query type - The result field, which on return contains the error code for the query - Pointer to the query type specific information - The size of the query structure. The kernel will only populate up to the size, which helps with backward compatibility. The kernel can also reduce the size, so if the current kernel is older than the inteface the user tries to use, it'll get only the supported bits. - next_entry field is used to chain multiple queries. Apart from common registeration syscall failures, it can only immediately return an error code in case when the headers are incorrect or any other addresses and invalid. That usually mean that the userspace doesn't use the API right and should be corrected. All query type specific errors are returned in the header's result field. As an example, the patch adds a single query type for now, i.e. IO_URING_QUERY_OPCODES, which tells what register / request / etc. opcodes are supported, but there are particular plans to extend it. Note: there is a request probing interface via IORING_REGISTER_PROBE, but it's a mess. It requires the user to create a ring first, it only works for requests, and requires dynamic allocations. Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-08 08:06:37 -06:00
Pavel Begunkov	63805d0a9b	io_uring: add macros for avaliable flags Add constants for supported setup / request / feature flags as well as the feature mask. They'll be used in the next patch. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-08 08:06:37 -06:00
Pavel Begunkov	da8bc3c81c	io_uring: add helper for *REGISTER_SEND_MSG_RING Move handling of IORING_REGISTER_SEND_MSG_RING into a separate function in preparation to growing io_uring_register_blind(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-08 08:06:37 -06:00
Caleb Sander Mateos	c2685729fa	io_uring: remove WRITE_ONCE() in io_uring_create() There's no need to use WRITE_ONCE() to set ctx->submitter_task in io_uring_create() since no other task can access the io_ring_ctx until a file descriptor is associated with it. So use a normal assignment instead of WRITE_ONCE(). Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Link: https://lore.kernel.org/r/20250904161223.2600435-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-04 17:22:24 -06:00
Jakub Kicinski	5ef04a7b06	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-6.17-rc5). No conflicts. Adjacent changes: include/net/sock.h `c51613fa27` ("net: add sk->sk_drop_counters") `5d6b58c932` ("net: lockless sock_i_ino()") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-04 13:33:00 -07:00
Caleb Sander Mateos	dd386b0d5e	io_uring/uring_cmd: correct io_uring_cmd_done() ret type io_uring_cmd_done() takes the result code for the CQE as a ssize_t ret argument. However, the CQE res field is a s32 value, as is the argument to io_req_set_res(). To clarify that only s32 values can be faithfully represented without truncation, change io_uring_cmd_done()'s ret argument type to s32. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Link: https://lore.kernel.org/r/20250902012609.1513123-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-03 17:34:36 -06:00
Caleb Sander Mateos	df3a7762ee	io_uring/uring_cmd: add io_uring_cmd_tw_t type alias Introduce a function pointer type alias io_uring_cmd_tw_t for the uring_cmd task work callback. This avoids repeating the signature in several places. Also name both arguments to the callback to clarify what they represent. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20250902160657.1726828-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-02 19:21:12 -06:00
Caleb Sander Mateos	8b9c9a2e7d	io_uring/register: drop redundant submitter_task check For IORING_SETUP_SINGLE_ISSUER io_ring_ctx's, io_register_resize_rings() checks that the current task is the ctx's submitter_task. However, its caller __io_uring_register() already checks this. Drop the redundant check in io_register_resize_rings(). Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Link: https://lore.kernel.org/r/20250902215108.1925105-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-02 19:20:24 -06:00
Jens Axboe	37500634d0	io_uring/net: correct type for min_not_zero() cast The kernel test robot reports that after a recent change, the signedness of a min_not_zero() compare is now incorrect. Fix that up and cast to the right type. Fixes: `429884ff35` ("io_uring/kbuf: use struct io_br_sel for multiple buffers picking") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202509020426.WJtrdwOU-lkp@intel.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-02 05:19:42 -06:00
Christian Brauner	e23654f5b1	fuse fixes for 6.17-rc5 -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCaLVsHgAKCRDh3BK/laaZ PDQ9AQDkWIqblrxfuL/Ji9d18XR4FFMN3PHxt046AjFwFVbL4gD/Q8bIDTmJNyts YAGcgSkDpa2Q/UT9yZ/8IFdmidnVqwk= =5/3p -----END PGP SIGNATURE----- gpgsig -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaLV6IAAKCRCRxhvAZXjc olPYAP9rPVDs2VyektXPfHowxX53nxuYcbGcowPzIn68f9cp0AD/XfyPHfYIKFRc Mk/u+Ns438FpHDzBu84y51giQbc2IwM= =IEzO -----END PGP SIGNATURE----- Merge tag 'fuse-fixes-6.17-rc5' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse into vfs.fixes fuse fixes for 6.17-rc5 * tag 'fuse-fixes-6.17-rc5' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (6 commits) fuse: Block access to folio overlimit fuse: fix fuseblk i_blkbits for iomap partial writes fuse: reflect cached blocksize if blocksize was changed fuse: prevent overflow in copy_file_range return value fuse: check if copy_file_range() returns larger than requested size fuse: do not allow mapping a non-regular backing file Link: https://lore.kernel.org/CAJfpeguEVMMyw_zCb+hbOuSxdE2Z3Raw=SJsq=Y56Ae6dn2W3g@mail.gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-01 12:48:28 +02:00
Jakub Kicinski	d23ad54de7	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-6.17-rc4). No conflicts. Adjacent changes: drivers/net/ethernet/intel/idpf/idpf_txrx.c `02614eee26` ("idpf: do not linearize big TSO packets") `6c4e684802` ("idpf: remove obsolete stashing code") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-29 11:48:01 -07:00
Dragos Tatulea	59b8b32ac8	io_uring/zcrx: add support for custom DMA devices Use the new API for getting a DMA device for a specific netdev queue. This patch will allow io_uring zero-copy rx to work with devices where the DMA device is not stored in the parent device. mlx5 SFs are an example of such a device. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/20250827144017.1529208-4-dtatulea@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-28 16:05:31 -07:00
Jens Axboe	98b6fa62c8	io_uring/kbuf: always use READ_ONCE() to read ring provided buffer lengths Since the buffers are mapped from userspace, it is prudent to use READ_ONCE() to read the value into a local variable, and use that for any other actions taken. Having a stable read of the buffer length avoids worrying about it changing after checking, or being read multiple times. Similarly, the buffer may well change in between it being picked and being committed. Ensure the looping for incremental ring buffer commit stops if it hits a zero sized buffer, as no further progress can be made at that point. Fixes: `ae98dbf43d` ("io_uring/kbuf: add support for incremental buffer consumption") Link: https://lore.kernel.org/io-uring/tencent_000C02641F6250C856D0C26228DE29A3D30A@qq.com/ Reported-by: Qingyue Zhang <chunzhennn@qq.com> Reported-by: Suoxing Zhang <aftern00n@qq.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-08-28 05:48:34 -06:00
Jens Axboe	4c0b26e23c	io_uring: add async data clear/free helpers Futex recently had an issue where it mishandled how ->async_data and REQ_F_ASYNC_DATA is handled. To avoid future issues like that, add a set of helpers that either clear or clear-and-free the async data assigned to a struct io_kiocb. Convert existing manual handling of that to use the helpers. No intended functional changes in this patch. Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-08-27 11:24:25 -06:00
Jens Axboe	c986f7586b	io_uring/zcrx: add support for IORING_SETUP_CQE_MIXED zcrx currently requires the ring to be set up with fixed 32b CQEs, allow it to use IORING_SETUP_CQE_MIXED as well. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-08-27 11:24:22 -06:00
Jens Axboe	1e81bf1414	io_uring/uring_cmd: add support for IORING_SETUP_CQE_MIXED Certain users of uring_cmd currently require fixed 32b CQE support, which is propagated through IO_URING_F_CQE32. Allow IORING_SETUP_CQE_MIXED to cover that case as well, so not all CQEs posted need to be 32b in size. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-08-27 11:24:15 -06:00
Jens Axboe	806ecb209a	io_uring/nop: add support for IORING_SETUP_CQE_MIXED This adds support for setting IORING_NOP_CQE32 as a flag for a NOP command, in which case a 32b CQE will be posted rather than a regular one. This is the default if the ring has been setup with IORING_SETUP_CQE32. If the ring has been setup with IORING_SETUP_CQE_MIXED, then 16b CQEs will be posted without this flag set, and 32b CQEs if this flag is set. For the latter case, sqe->off is what will be posted as cqe->big_cqe[0] and sqe->addr is what will be posted as cqe->big_cqe[1]. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-08-27 11:24:15 -06:00
Jens Axboe	e26dca67fd	io_uring: add support for IORING_SETUP_CQE_MIXED Normal rings support 16b CQEs for posting completions, while certain features require the ring to be configured with IORING_SETUP_CQE32, as they need to convey more information per completion. This, in turn, makes ALL the CQEs be 32b in size. This is somewhat wasteful and inefficient, particularly when only certain CQEs need to be of the bigger variant. This adds support for setting up a ring with mixed CQE sizes, using IORING_SETUP_CQE_MIXED. When setup in this mode, CQEs posted to the ring may be either 16b or 32b in size. If a CQE is 32b in size, then IORING_CQE_F_32 is set in the CQE flags to indicate that this is the case. If this flag isn't set, the CQE is the normal 16b variant. CQEs on these types of mixed rings may also have IORING_CQE_F_SKIP set. This can happen if the ring is one (small) CQE entry away from wrapping, and an attempt is made to post a 32b CQE. As CQEs must be contigious in the CQ ring, a 32b CQE cannot wrap the ring. For this case, a single dummy CQE is posted with the SKIP flag set. The application should simply ignore those. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-08-27 11:23:57 -06:00
Qingyue Zhang	c64eff368a	io_uring/kbuf: fix signedness in this_len calculation When importing and using buffers, buf->len is considered unsigned. However, buf->len is converted to signed int when committing. This can lead to unexpected behavior if the buffer is large enough to be interpreted as a negative value. Make min_t calculation unsigned. Fixes: `ae98dbf43d` ("io_uring/kbuf: add support for incremental buffer consumption") Co-developed-by: Suoxing Zhang <aftern00n@qq.com> Signed-off-by: Suoxing Zhang <aftern00n@qq.com> Signed-off-by: Qingyue Zhang <chunzhennn@qq.com> Link: https://lore.kernel.org/r/tencent_4DBB3674C0419BEC2C0C525949DA410CA307@qq.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-08-27 08:39:44 -06:00
Jens Axboe	82ceb7fcc5	io_uring/fdinfo: handle mixed sized CQEs Ensure that the CQ ring iteration handles differently sized CQEs, not just a fixed 16b or 32b size per ring. These CQEs aren't possible just yet, but prepare the fdinfo CQ ring dumping for handling them. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-08-24 11:41:13 -06:00
Caleb Sander Mateos	e5c717e795	io_uring/cmd: consolidate REQ_F_BUFFER_SELECT checks io_uring_cmd_prep() checks that REQ_F_BUFFER_SELECT is set in the io_kiocb's flags iff IORING_URING_CMD_MULTISHOT is set in the SQE's uring_cmd_flags. Consolidate the IORING_URING_CMD_MULTISHOT and !IORING_URING_CMD_MULTISHOT branches into a single check that the IORING_URING_CMD_MULTISHOT flag matches the REQ_F_BUFFER_SELECT flag. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250821163308.977915-4-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-08-24 11:41:12 -06:00
Caleb Sander Mateos	3484f530f8	io_uring/cmd: deduplicate uring_cmd_flags checks io_uring_cmd_prep() currently has two checks for whether IORING_URING_CMD_FIXED and IORING_URING_CMD_MULTISHOT are both set in uring_cmd_flags. Remove the second check. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250821163308.977915-3-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-08-24 11:41:12 -06:00
Ming Lei	620a50c927	io_uring: uring_cmd: add multishot support Add UAPI flag IORING_URING_CMD_MULTISHOT for supporting multishot uring_cmd operations with provided buffer. This enables drivers to post multiple completion events from a single uring_cmd submission, which is useful for: - Notifying userspace of device events (e.g., interrupt handling) - Supporting devices with multiple event sources (e.g., multi-queue devices) - Avoiding the need for device poll() support when events originate from multiple sources device-wide The implementation adds two new APIs: - io_uring_cmd_select_buffer(): selects a buffer from the provided buffer group for multishot uring_cmd - io_uring_mshot_cmd_post_cqe(): posts a CQE after event data is pushed to the provided buffer Multishot uring_cmd must be used with buffer select (IOSQE_BUFFER_SELECT) and is mutually exclusive with IORING_URING_CMD_FIXED for now. The ublk driver will be the first user of this functionality: https://github.com/ming1/linux/commits/ublk-devel/ Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250821040210.1152145-3-ming.lei@redhat.com [axboe: fold in fix for !CONFIG_IO_URING] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-08-24 11:41:12 -06:00
Ming Lei	d589bcddaa	io-uring: move `struct io_br_sel` into io_uring_types.h Move `struct io_br_sel` into io_uring_types.h and prepare for supporting provided buffer on uring_cmd. Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250821040210.1152145-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-08-24 11:41:12 -06:00
Jens Axboe	fe524b0684	io_uring/kbuf: check for ring provided buffers first in recycling This is the most likely of paths if a provided buffer is used, so offer it up first and push the legacy buffers to later. Link: https://lore.kernel.org/r/20250821020750.598432-14-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-08-24 11:41:12 -06:00
Jens Axboe	e973837b54	io_uring: remove async/poll related provided buffer recycles These aren't necessary anymore, get rid of them. Link: https://lore.kernel.org/r/20250821020750.598432-13-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-08-24 11:41:12 -06:00
Jens Axboe	5fda512554	io_uring/kbuf: switch to storing struct io_buffer_list locally Currently the buffer list is stored in struct io_kiocb. The buffer list can be of two types: 1) Classic/legacy buffer list. These don't need to get referenced after a buffer pick, and hence storing them in struct io_kiocb is perfectly fine. 2) Ring provided buffer lists. These DO need to be referenced after the initial buffer pick, as they need to get consumed later on. This can be either just incrementing the head of the ring, or it can be consuming parts of a buffer if incremental buffer consumptions has been configured. For case 2, io_uring needs to be careful not to access the buffer list after the initial pick-and-execute context. The core does recycling of these, but it's easy to make a mistake, because it's stored in the io_kiocb which does persist across multiple execution contexts. Either because it's a multishot request, or simply because it needed some kind of async trigger (eg poll) for retry purposes. Add a struct io_buffer_list to struct io_br_sel, which is always on stack for the various users of it. This prevents the buffer list from leaking outside of that execution context, and additionally it enables kbuf to not even pass back the struct io_buffer_list if the given context isn't appropriately locked already. This doesn't fix any bugs, it's simply a defensive measure to prevent any issues with reuse of a buffer list. Link: https://lore.kernel.org/r/20250821020750.598432-12-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-08-24 11:41:12 -06:00
Jens Axboe	461382a51f	io_uring/net: use struct io_br_sel->val as the send finish value Currently a pointer is passed in to the 'ret' in the send mshot handler, but since we already have a value field in io_br_sel, just use that. This is also in preparation for needing to pass in struct io_br_sel to io_send_finish() anyway. Link: https://lore.kernel.org/r/20250821020750.598432-11-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-08-24 11:41:12 -06:00
Jens Axboe	58d8150918	io_uring/net: use struct io_br_sel->val as the recv finish value Currently a pointer is passed in to the 'ret' in the receive handlers, but since we already have a value field in io_br_sel, just use that. This is also in preparation for needing to pass in struct io_br_sel to io_recv_finish() anyway. Link: https://lore.kernel.org/r/20250821020750.598432-10-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-08-24 11:41:12 -06:00
Jens Axboe	429884ff35	io_uring/kbuf: use struct io_br_sel for multiple buffers picking The networking side uses bundles, which is picking multiple buffers at the same time. Pass in struct io_br_sel to those helpers. Link: https://lore.kernel.org/r/20250821020750.598432-9-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-08-24 11:41:12 -06:00
Jens Axboe	d8e1dec2f8	io_uring/rw: recycle buffers manually for non-mshot reads The mshot side of reads already does this, but the regular read path does not. This leads to needing recycling checks sprinkled in various spots in the "go async" path, like arming poll. In preparation for getting rid of those, ensure that read recycles appropriately. Link: https://lore.kernel.org/r/20250821020750.598432-8-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-08-24 11:41:12 -06:00
Jens Axboe	ab6559bdbb	io_uring/kbuf: introduce struct io_br_sel Rather than return addresses directly from buffer selection, add a struct around it. No functional changes in this patch, it's in preparation for storing more buffer related information locally, rather than in struct io_kiocb. Link: https://lore.kernel.org/r/20250821020750.598432-7-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-08-24 11:41:12 -06:00
Jens Axboe	1b5add75d7	io_uring/kbuf: pass in struct io_buffer_list to commit/recycle helpers Rather than have this implied being in the io_kiocb, pass it in directly so it's immediately obvious where these users of ->buf_list are coming from. Link: https://lore.kernel.org/r/20250821020750.598432-6-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-08-24 11:41:12 -06:00
Jens Axboe	b22743f29b	io_uring/net: clarify io_recv_buf_select() return value It returns 0 on success, less than zero on error. Link: https://lore.kernel.org/r/20250821020750.598432-5-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-08-24 11:41:12 -06:00
Jens Axboe	15ba5e51e6	io_uring/net: don't use io_net_kbuf_recyle() for non-provided cases A previous commit used io_net_kbuf_recyle() for any network helper that did IO and needed partial retry. However, that's only needed if the opcode does buffer selection, which isnt support for sendzc, sendmsg_zc, or sendmsg. Just remove them - they don't do any harm, but it is a bit confusing when reading the code. Link: https://lore.kernel.org/r/20250821020750.598432-4-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-08-24 11:41:11 -06:00
Jens Axboe	5e73b402cb	io_uring/kbuf: drop 'issue_flags' from io_put_kbuf(s)() arguments Picking multiple buffers always requires the ring lock to be held across the operation, so there's no need to pass in the issue_flags to io_put_kbufs(). On the single buffer side, if the initial picking of a ring buffer was unlocked, then it will have been committed already. For legacy buffers, no locking is required, as they will simply be freed. Link: https://lore.kernel.org/r/20250821020750.598432-3-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-08-24 11:41:11 -06:00
Pavel Begunkov	ab3ea6eac5	io_uring/zctx: check chained notif contexts Send zc only links ubuf_info for requests coming from the same context. There are some ambiguous syz reports, so let's check the assumption on notification completion. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/fd527d8638203fe0f1c5ff06ff2e1d8fd68f831b.1755179962.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-08-24 11:41:11 -06:00
Pavel Begunkov	92a96b0a22	io_uring: add request poisoning Poison various request fields on free. __io_req_caches_free() is a slow path, so can be done unconditionally, but gate it on kasan for io_req_add_to_cache(). Note that some fields are logically retained between cache allocations and can't be poisoned in io_req_add_to_cache(). Ideally, it'd be replaced with KASAN'ed caches, but that can't be enabled because of some synchronisation nuances. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7a78e8a7f5be434313c400650b862e36c211b312.1755459452.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-08-24 11:41:11 -06:00
Jens Axboe	e4e6aaea46	io_uring: clear ->async_data as part of normal init Opcode handlers like POLL_ADD will use ->async_data as the pointer for double poll handling, which is a bit different than the usual case where it's strictly gated by the REQ_F_ASYNC_DATA flag. Be a bit more proactive in handling ->async_data, and clear it to NULL as part of regular init. Init is touching that cacheline anyway, so might as well clear it. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-08-21 13:54:01 -06:00
Jens Axboe	508c1314b3	io_uring/futex: ensure io_futex_wait() cleans up properly on failure The io_futex_data is allocated upfront and assigned to the io_kiocb async_data field, but the request isn't marked with REQ_F_ASYNC_DATA at that point. Those two should always go together, as the flag tells io_uring whether the field is valid or not. Additionally, on failure cleanup, the futex handler frees the data but does not clear ->async_data. Clear the data and the flag in the error path as well. Thanks to Trend Micro Zero Day Initiative and particularly ReDress for reporting this. Cc: stable@vger.kernel.org Fixes: `194bb58c60` ("io_uring: add support for futex wake and wait") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-08-21 13:53:33 -06:00
Christoph Hellwig	d072148a86	fs: add a FMODE_ flag to indicate IOCB_HAS_METADATA availability Currently the kernel will happily route io_uring requests with metadata to file operations that don't support it. Add a FMODE_ flag to guard that. Fixes: `4de2ce04c8` ("fs: introduce IOCB_HAS_METADATA for metadata") Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/20250819082517.2038819-2-hch@lst.de Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-08-20 11:12:58 +02:00
Fengnan Chang	9d83e1f05c	io_uring/io-wq: add check free worker before create new worker After commit `0b2b066f8a` ("io_uring/io-wq: only create a new worker if it can make progress"), in our produce environment, we still observe that part of io_worker threads keeps creating and destroying. After analysis, it was confirmed that this was due to a more complex scenario involving a large number of fsync operations, which can be abstracted as frequent write + fsync operations on multiple files in a single uring instance. Since write is a hash operation while fsync is not, and fsync is likely to be suspended during execution, the action of checking the hash value in io_wqe_dec_running cannot handle such scenarios. Similarly, if hash-based work and non-hash-based work are sent at the same time, similar issues are likely to occur. Returning to the starting point of the issue, when a new work arrives, io_wq_enqueue may wake up free worker A, while io_wq_dec_running may create worker B. Ultimately, only one of A and B can obtain and process the task, leaving the other in an idle state. In the end, the issue is caused by inconsistent logic in the checks performed by io_wq_enqueue and io_wq_dec_running. Therefore, the problem can be resolved by checking for available workers in io_wq_dec_running. Signed-off-by: Fengnan Chang <changfengnan@bytedance.com> Reviewed-by: Diangang Li <lidiangang@bytedance.com> Link: https://lore.kernel.org/r/20250813120214.18729-1-changfengnan@bytedance.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-08-13 06:31:10 -06:00
Jens Axboe	41b70df5b3	io_uring/net: commit partial buffers on retry Ring provided buffers are potentially only valid within the single execution context in which they were acquired. io_uring deals with this and invalidates them on retry. But on the networking side, if MSG_WAITALL is set, or if the socket is of the streaming type and too little was processed, then it will hang on to the buffer rather than recycle or commit it. This is problematic for two reasons: 1) If someone unregisters the provided buffer ring before a later retry, then the req->buf_list will no longer be valid. 2) If multiple sockers are using the same buffer group, then multiple receives can consume the same memory. This can cause data corruption in the application, as either receive could land in the same userspace buffer. Fix this by disallowing partial retries from pinning a provided buffer across multiple executions, if ring provided buffers are used. Cc: stable@vger.kernel.org Reported-by: pt x <superman.xpt@gmail.com> Fixes: `c56e022c0a` ("io_uring: add support for user mapped provided buffer ring") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-08-12 13:41:26 -06:00
Jens Axboe	33503c083f	io_uring/memmap: cast nr_pages to size_t before shifting If the allocated size exceeds UINT_MAX, then it's necessary to cast the mr->nr_pages value to size_t to prevent it from overflowing. In practice this isn't much of a concern as the required memory size will have been validated upfront, and accounted to the user. And > 4GB sizes will be necessary to make the lack of a cast a problem, which greatly exceeds normal user locked_vm settings that are generally in the kb to mb range. However, if root is used, then accounting isn't done, and then it's possible to hit this issue. Link: https://lore.kernel.org/all/6895b298.050a0220.7f033.0059.GAE@google.com/ Cc: stable@vger.kernel.org Reported-by: syzbot+23727438116feb13df15@syzkaller.appspotmail.com Fixes: `087f997870` ("io_uring/memmap: implement mmap for regions") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-08-08 06:35:14 -06:00
Norman Maurer	6f02527729	io_uring/net: Allow to do vectorized send At the moment you have to use sendmsg for vectorized send. While this works it's suboptimal as it also means you need to allocate a struct msghdr that needs to be kept alive until a submission happens. We can remove this limitation by just allowing to use send directly. Signed-off-by: Norman Maurer <norman_maurer@apple.com> Link: https://lore.kernel.org/r/20250729065952.26646-1-norman_maurer@apple.com [axboe: remove -EINVAL return for SENDMSG and SEND_VECTORIZED] [axboe: allow send_zc to set SEND_VECTORIZED too] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-07-30 08:23:04 -06:00
Linus Torvalds	c3018a2c6a	for-6.17/io_uring-20250728 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmiHcP8QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgphu+EACp+MNKBIx7iIM1arbmafEmroaR9Whi5CJu 8BczAKRVPNhICPXAjr6UOmRuD4nBzPysPNi73xNRdXdnd/dnOL27zMqF/0JCJsd6 V4Xt7pXcanJ4BoeRDv9P332h6/IVo1ulxqEzhWJt2KKoq4cWbSwlvk18NPOdGIkp PcMPARf15d4KbLprdST5Dzrg2uqZ3jZ4Y4zvIMKery56AZCkp5ZrtPFttw4jhTff FQzIjY/hXXDFURubv75EHFg7KC9rcZa4CuU40duTHmsXD/EsSsa0eAJMmiVAGjhM PXK/70ZOoxLK1zU7zN+iAMUnVL7onoNvo8g9e7/FJNuHV25am8aUx4DzzP3ci+He +2C4PpOKx9Egz0oKmC6hl9eMg/muPd/3j8uxQlGOvz38icgeirQB/ebr1KLtBkM9 pKQmCsGaIzncXWddcTMCayEj3wbvn2lgWSqSgZuwTt1AXqSTxLOmTywCYcXDBdli 2ejh/Hk45TALnBhiiDWdOT2euh2EEP1ADOZzVsZzeR29OzLqJDRjRnitb40NUU60 +ny2HOcplIN6aah+QdFwK31FieVKDY/ufjt1yGvwCP+kIxUEDSYi+NRRldmwznxy 8UZwFyvd8wOxUc+iG/wCK7ccY+MtYyaZAE4ok2QEvQvMUTBJE/LvI4bkd77fx7Sy 2cMeDukZ/w== =EGH5 -----END PGP SIGNATURE----- Merge tag 'for-6.17/io_uring-20250728' of git://git.kernel.dk/linux Pull io_uring updates from Jens Axboe: - Optimization to avoid reference counts on non-cloned registered buffers. This is how these buffers were handled prior to having cloning support, and we can still use that approach as long as the buffers haven't been cloned to another ring. - Cleanup and improvement for uring_cmd, where btrfs was the only user of storing allocated data for the lifetime of the uring_cmd. Clean that up so we can get rid of the need to do that. - Avoid unnecessary memory copies in uring_cmd usage. This is particularly important as a lot of uring_cmd usage necessitates the use of 128b SQEs. - A few updates for recv multishot, where it's now possible to add fairness limits for limiting how much is transferred for each retry loop. Additionally, recv multishot now supports an overall cap as well, where once reached the multishot recv will terminate. The latter is useful for buffer management and juggling many recv streams at the same time. - Add support for returning the TX timestamps via a new socket command. This feature can work in either singleshot or multishot mode, where the latter triggers a completion whenever new timestamps are available. This is an alternative to using the existing error queue. - Add support for an io_uring "mock" file, which is the start of being able to do 100% targeted testing in terms of exercising io_uring request handling. The idea is to have a file type that can be anything the tester would like, and behave exactly how you want it to behave in terms of hitting the code paths you want. - Improve zcrx by using sgtables to de-duplicate and improve dma address handling. - Prep work for supporting larger pages for zcrx. - Various little improvements and fixes. * tag 'for-6.17/io_uring-20250728' of git://git.kernel.dk/linux: (42 commits) io_uring/zcrx: fix leaking pages on sg init fail io_uring/zcrx: don't leak pages on account failure io_uring/zcrx: fix null ifq on area destruction io_uring: fix breakage in EXPERT menu io_uring/cmd: remove struct io_uring_cmd_data btrfs/ioctl: store btrfs_uring_encoded_data in io_btrfs_cmd io_uring/cmd: introduce IORING_URING_CMD_REISSUE flag io_uring/zcrx: account area memory io_uring: export io_[un]account_mem io_uring/net: Support multishot receive len cap io_uring: deduplicate wakeup handling io_uring/net: cast min_not_zero() type io_uring/poll: cleanup apoll freeing io_uring/net: allow multishot receive per-invocation cap io_uring/net: move io_sr_msg->retry_flags to io_sr_msg->flags io_uring/net: use passed in 'len' in io_recv_buf_select() io_uring/zcrx: prepare fallback for larger pages io_uring/zcrx: assert area type in io_zcrx_iov_page io_uring/zcrx: allocate sgtable for umem areas io_uring/zcrx: introduce io_populate_area_dma ...	2025-07-28 16:30:12 -07:00
Linus Torvalds	7879d7aff0	vfs-6.17-rc1.misc -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaIM/KwAKCRCRxhvAZXjc opT+AP407JwhRSBjUEmHg5JzUyDoivkOySdnthunRjaBKD8rlgEApM6SOIZYucU7 cPC3ZY6ORFM6Mwaw+iDW9lasM5ucHQ8= =CHha -----END PGP SIGNATURE----- Merge tag 'vfs-6.17-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc VFS updates from Christian Brauner: "This contains the usual selections of misc updates for this cycle. Features: - Add ext4 IOCB_DONTCACHE support This refactors the address_space_operations write_begin() and write_end() callbacks to take const struct kiocb * as their first argument, allowing IOCB flags such as IOCB_DONTCACHE to propagate to the filesystem's buffered I/O path. Ext4 is updated to implement handling of the IOCB_DONTCACHE flag and advertises support via the FOP_DONTCACHE file operation flag. Additionally, the i915 driver's shmem write paths are updated to bypass the legacy write_begin/write_end interface in favor of directly calling write_iter() with a constructed synchronous kiocb. Another i915 change replaces a manual write loop with kernel_write() during GEM shmem object creation. Cleanups: - don't duplicate vfs_open() in kernel_file_open() - proc_fd_getattr(): don't bother with S_ISDIR() check - fs/ecryptfs: replace snprintf with sysfs_emit in show function - vfs: Remove unnecessary list_for_each_entry_safe() from evict_inodes() - filelock: add new locks_wake_up_waiter() helper - fs: Remove three arguments from block_write_end() - VFS: change old_dir and new_dir in struct renamedata to dentrys - netfs: Remove unused declaration netfs_queue_write_request() Fixes: - eventpoll: Fix semi-unbounded recursion - eventpoll: fix sphinx documentation build warning - fs/read_write: Fix spelling typo - fs: annotate data race between poll_schedule_timeout() and pollwake() - fs/pipe: set FMODE_NOWAIT in create_pipe_files() - docs/vfs: update references to i_mutex to i_rwsem - fs/buffer: remove comment about hard sectorsize - fs/buffer: remove the min and max limit checks in __getblk_slow() - fs/libfs: don't assume blocksize <= PAGE_SIZE in generic_check_addressable - fs_context: fix parameter name in infofc() macro - fs: Prevent file descriptor table allocations exceeding INT_MAX" * tag 'vfs-6.17-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (24 commits) netfs: Remove unused declaration netfs_queue_write_request() eventpoll: fix sphinx documentation build warning ext4: support uncached buffered I/O mm/pagemap: add write_begin_get_folio() helper function fs: change write_begin/write_end interface to take struct kiocb * drm/i915: Refactor shmem_pwrite() to use kiocb and write_iter drm/i915: Use kernel_write() in shmem object create eventpoll: Fix semi-unbounded recursion vfs: Remove unnecessary list_for_each_entry_safe() from evict_inodes() fs/libfs: don't assume blocksize <= PAGE_SIZE in generic_check_addressable fs/buffer: remove the min and max limit checks in __getblk_slow() fs: Prevent file descriptor table allocations exceeding INT_MAX fs: Remove three arguments from block_write_end() fs/ecryptfs: replace snprintf with sysfs_emit in show function fs: annotate suspected data race between poll_schedule_timeout() and pollwake() docs/vfs: update references to i_mutex to i_rwsem fs/buffer: remove comment about hard sectorsize fs_context: fix parameter name in infofc() macro VFS: change old_dir and new_dir in struct renamedata to dentrys proc_fd_getattr(): don't bother with S_ISDIR() check ...	2025-07-28 11:22:56 -07:00
Pavel Begunkov	d9f595b9a6	io_uring/zcrx: fix leaking pages on sg init fail If sg_alloc_table_from_pages() fails, io_import_umem() returns without cleaning up pinned pages first. Fix it. Fixes: `b84621d96e` ("io_uring/zcrx: allocate sgtable for umem areas") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9fd94d1bc8c316611eccfec7579799182ff3fb0a.1753091564.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-07-21 06:47:45 -06:00
Pavel Begunkov	6bbd3411ff	io_uring/zcrx: don't leak pages on account failure Someone needs to release pinned pages in io_import_umem() if accounting fails. Assign them to the area but return an error, the following io_zcrx_free_area() will clean them up. Fixes: `262ab20518` ("io_uring/zcrx: account area memory") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e19f283a912f200c0d427e376cb789fc3f3d69bc.1753091564.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-07-21 06:47:45 -06:00
Pavel Begunkov	720df2310b	io_uring/zcrx: fix null ifq on area destruction Dan reports that ifq can be null when infering arguments for io_unaccount_mem() from io_zcrx_free_area(). Fix it by always setting a correct ifq. Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/r/202507180628.gBxrOgqr-lkp@intel.com/ Fixes: `262ab20518` ("io_uring/zcrx: account area memory") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20670d163bb90dba2a81a4150f1125603cefb101.1753091564.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-07-21 06:47:45 -06:00
Linus Torvalds	e347810e84	io_uring-6.16-20250718 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmh6ZXQQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpnaaEACbZ2g4GOk6tGqeatiIuB+HKbf72zkI4JP4 ffyirNnvXpL2du5pUskhIgqSpLb1S5qq2TlQSNpAAtcmbQuxhAbwrBP+6e3GkdZo zb2x2ztRVV1PdgUdE7dYX6NPJ2GzF5tMDH5sDDm8HeaQTHX6Zdv6sXahDYHdhH3a lo9yjh3GZp/2fa0N067ECMDPUWWVvhkolxulB4Ek+l11RhEtamIXzU4x8cBVvzVS 7nmaqES9xnVqwXFbF2/IEi5NfUYxlcxsySgYgUmfFBp8baOFC/dVUFre0wRNiCf2 BoRCBKG/by6mEccE+M7JTuH/itA2vPvDzYmgXuMk8BgDSwXdcoHm7QUx9PJh8bbU 3SXw/125VI1DKG5xZReDrRHI6GNSwwhEJLR7oAgBCCxfbiuG99cRKdM3C1HOeN4q jP4L1jQnSgD9VYaxrQD0se5191WtR2S6h+tO27W4UiYlOvSMU2hWHE9eZQBohLJb 2p+jltziAOL0riE8wwY9h0dnPAWSrvs+sf9LrfBtNe3lRDttszhFoJXeJ4SpGs0P 0xv+gt1Gq1pl1lKCyX+HlXt6wVj5ReYj78iQqLmHIEtFKOfrhCYJVIDyepRG+pKE 7LnncbfLoYHXcj3eIRj6zsiZh+k43tbGmoKBYmyqhUnu4ySNQpdUBQWP4ZVVjjlx zKY3j7fpcw== =N6xk -----END PGP SIGNATURE----- Merge tag 'io_uring-6.16-20250718' of git://git.kernel.dk/linux Pull io_uring fixes from Jens Axboe: - dmabug offset fix for zcrx - Fix for the POLLERR connect work around handling * tag 'io_uring-6.16-20250718' of git://git.kernel.dk/linux: io_uring/poll: fix POLLERR handling io_uring/zcrx: disallow user selected dmabuf offset and size	2025-07-18 12:21:26 -07:00
Caleb Sander Mateos	2e6dbb25ea	io_uring/cmd: remove struct io_uring_cmd_data There are no more users of struct io_uring_cmd_data and its op_data field. Remove it to shave 8 bytes from struct io_async_cmd and eliminate a store and load for every uring_cmd. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Acked-by: David Sterba <dsterba@suse.com> Link: https://lore.kernel.org/r/20250708202212.2851548-5-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-07-18 12:34:56 -06:00
Caleb Sander Mateos	733c43f1df	io_uring/cmd: introduce IORING_URING_CMD_REISSUE flag Add a flag IORING_URING_CMD_REISSUE that ->uring_cmd() implementations can use to tell whether this is the first or subsequent issue of the uring_cmd. This will allow ->uring_cmd() implementations to store information in the io_uring_cmd's pdu across issues. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Acked-by: David Sterba <dsterba@suse.com> Link: https://lore.kernel.org/r/20250708202212.2851548-3-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-07-18 12:34:56 -06:00
Pavel Begunkov	262ab20518	io_uring/zcrx: account area memory zcrx areas can be quite large and need to be accounted and checked against RLIMIT_MEMLOCK. In practise it shouldn't be a big issue as the inteface already requires cap_net_admin. Cc: stable@vger.kernel.org Fixes: `cf96310c5f` ("io_uring/zcrx: add io_zcrx_area") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4b53f0c575bd062f63d12bec6cac98037fc66aeb.1752699568.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-07-16 16:23:28 -06:00
Pavel Begunkov	11fbada718	io_uring: export io_[un]account_mem Export pinned memory accounting helpers, they'll be used by zcrx shortly. Cc: stable@vger.kernel.org Fixes: `cf96310c5f` ("io_uring/zcrx: add io_zcrx_area") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9a61e54bd89289b39570ae02fe620e12487439e4.1752699568.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-07-16 16:23:28 -06:00
Pavel Begunkov	c7cafd5b81	io_uring/poll: fix POLLERR handling `8c8492ca64` ("io_uring/net: don't retry connect operation on EPOLLERR") is a little dirty hack that 1) wrongfully assumes that POLLERR equals to a failed request, which breaks all POLLERR users, e.g. all error queue recv interfaces. 2) deviates the connection request behaviour from connect(2), and 3) racy and solved at a wrong level. Nothing can be done with 2) now, and 3) is beyond the scope of the patch. At least solve 1) by moving the hack out of generic poll handling into io_connect(). Cc: stable@vger.kernel.org Fixes: `8c8492ca64` ("io_uring/net: don't retry connect operation on EPOLLERR") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3dc89036388d602ebd84c28e5042e457bdfc952b.1752682444.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-07-16 10:28:28 -06:00
Norman Maurer	0ebc9a7ecf	io_uring/net: Support multishot receive len cap At the moment its very hard to do fine grained backpressure when using multishot as the kernel might produce a lot of completions before the user has a chance to cancel a previous submitted multishot recv. This change adds support to issue a multishot recv that is capped by a len, which means the kernel will only rearm until X amount of data is received. When the limit is reached the completion will signal to the user that a re-arm needs to happen manually by not setting the IORING_CQE_F_MORE flag. Signed-off-by: Norman Maurer <norman_maurer@apple.com> Link: https://lore.kernel.org/r/20250715140249.31186-1-norman_maurer@apple.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-07-16 08:28:11 -06:00
Jens Axboe	8723c146ad	io_uring: deduplicate wakeup handling Both io_poll_wq_wake() and io_cqring_wake() contain the exact same code, and most of the comment in the latter applies equally to both. Move the test and wakeup handling into a basic helper that they can both use, and move part of the comment that applies generically to this new helper. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-07-15 12:20:06 -06:00
Jens Axboe	b1915b18e1	io_uring/net: cast min_not_zero() type kernel test robot reports that xtensa complains about different signedness for a min_not_zero() comparison. Cast the int part to size_t to avoid this issue. Fixes: `e227c8cdb4` ("io_uring/net: use passed in 'len' in io_recv_buf_select()") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202507150504.zO5FsCPm-lkp@intel.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-07-14 16:36:08 -06:00
Pavel Begunkov	08ca1409c4	io_uring/zcrx: disallow user selected dmabuf offset and size zcrx shouldn't be so frivolous about cutting a dmabuf sgtable and taking a subrange into it, the dmabuf layer might be not expecting that. It shouldn't be a problem for now, but since the zcrx dmabuf support is new and there shouldn't be any real users, let's play safe and reject user provided ranges into dmabufs. Also, it shouldn't be needed as userspace should size them appropriately. Fixes: `a5c98e9424` ("io_uring/zcrx: dmabuf backed zerocopy receive") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/be899f1afed32053eb2e2079d0da241514674aca.1752443579.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-07-14 08:54:43 -06:00
Jens Axboe	6e4098382b	io_uring/poll: cleanup apoll freeing No point having REQ_F_POLLED in both IO_REQ_CLEAN_FLAGS and in IO_REQ_CLEAN_SLOW_FLAGS, and having both io_free_batch_list() and then io_clean_op() check for it and clean it. Move REQ_F_POLLED to IO_REQ_CLEAN_SLOW_FLAGS and drop it from IO_REQ_CLEAN_FLAGS, and have only io_free_batch_list() do the check and freeing. Link: https://lore.kernel.org/io-uring/20250712000344.1579663-2-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-07-12 15:03:05 -06:00
Linus Torvalds	cb3002e0e9	io_uring-6.16-20250710 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmhwb9gQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpiIAD/0Vs8uJRMTx/tB4xCRDoPrW5mdWK+d6FiPg e0/1Kn5J1vPEbM6uYpn6wZe0BwHS76OQGhJ5OrsFSjtxX5VA4rMYZxJZYxVLW88W U+Y4dGuU1ZoLPQYwGrKVSXz+9tKQzXJOsYIF/LCMvxgjFPJuvzwSsp0DeXT7vzBT 9UsEcnCfjK31X4OBNa9F8RvgfguodknVL3k6B/98wx3+DODM9xaSv7tgDhULFl4Q U+eZYtKr0dd0jUhaWiMgJrmGZ/bElRn36ILsOhJ0wgcZws3l+zLHkCC202Nx+J8/ VvljSeke1hUoY4YMoVAmJ72XlvSW+C2EqTO56P2xEyzpz0/Xhm00qVsiKZQHR0Ia r6xos6scvnni5myVgpkcLpbFRjHlrtSjX+kh3ozqFdya83/Mjd7Midizn7mjaiFS 4r5KK4ov3fXLY29rYeuREkZys31Fn8XCERd3N7RPLAN/hzEC4fXm9/S0lkOqwqTP 3OtvUu+AwepyUyJ0KlakYUDvu6X+vP6WFkQFLFIBFcN/OglWRZe5r3fuQoX0iw6/ Ln+DB+W6XtBi7rzIjuYYzAMgC7iiZc57e64iXlzSyPEsjOUkTngKRH4zQY8MyjFb 1Fnn7TWxIqHzlfpvu5g/e6dxbTduQxnLTDNQwocXw7hc8/D49wbyUUy4KGc+yyUP PYAYQSwtmw== =4o0B -----END PGP SIGNATURE----- Merge tag 'io_uring-6.16-20250710' of git://git.kernel.dk/linux Pull io_uring fixes from Jens Axboe: - Remove a pointless warning in the zcrx code - Fix for MSG_RING commands, where the allocated io_kiocb needs to be freed under RCU as well - Revert the work-around we had in place for the anon inodes pretending to be regular files. Since that got reworked upstream, the work-around is no longer needed * tag 'io_uring-6.16-20250710' of git://git.kernel.dk/linux: Revert "io_uring: gate REQ_F_ISREG on !S_ANON_INODE as well" io_uring/msg_ring: ensure io_kiocb freeing is deferred for RCU io_uring/zcrx: fix pp destruction warnings	2025-07-11 10:29:30 -07:00
Jens Axboe	6a8afb9fff	io_uring/net: allow multishot receive per-invocation cap If an application is handling multiple receive streams using recv multishot, then the amount of retries and buffer peeking for multishot and bundles can process too much per socket before moving on. This isn't directly controllable by the application. By default, io_uring will retry a recv MULTISHOT_MAX_RETRY (32) times, if the socket keeps having data to receive. And if using bundles, then each bundle peek will potentially map up to PEEK_MAX_IMPORT (256) iovecs of data. Once these limits are hit, then a requeue operation will be done, where the request will get retried after other pending requests have had a time to get executed. Add support for capping the per-invocation receive length, before a requeue condition is considered for each receive. This is done by setting sqe->mshot_len to the byte value. For example, if this is set to 1024, then each receive will be requeued by 1024 bytes received. Link: https://lore.kernel.org/io-uring/20250709203420.1321689-4-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-07-10 11:54:33 -06:00
Jens Axboe	3919b69593	io_uring/net: move io_sr_msg->retry_flags to io_sr_msg->flags There's plenty of space left, as sr->flags is a 16-bit type. The UAPI bits are the lower 8 bits, as that's all that sqe->ioprio can carry in the SQE anyway. Use a few of the upper 8 bits for internal uses, rather than have two separate flags entries. Link: https://lore.kernel.org/io-uring/20250709203420.1321689-2-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-07-10 11:54:22 -06:00
Jens Axboe	e227c8cdb4	io_uring/net: use passed in 'len' in io_recv_buf_select() len is a pointer to the desired len, use that rather than grab it from sr->len again. No functional changes as of this patch, but it does prepare io_recv_buf_select() for getting passed in a value that differs from sr->len. Link: https://lore.kernel.org/io-uring/20250709203420.1321689-3-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-07-10 11:54:08 -06:00
Pavel Begunkov	e67645bb7f	io_uring/zcrx: prepare fallback for larger pages io_zcrx_copy_chunk() processes one page at a time, which won't be sufficient when the net_iov size grows. Introduce a structure keeping the target niov page and other parameters, it's more convenient and can be reused later. And add a helper function that can efficient copy buffers of an arbitrary length. For 64bit archs the loop inside should be compiled out. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e84bc705a4e1edeb9aefff470d96558d8232388f.1751466461.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-07-08 11:59:56 -06:00
Pavel Begunkov	1b4dc1ff0a	io_uring/zcrx: assert area type in io_zcrx_iov_page Add a simple debug assertion to io_zcrx_iov_page() making it's not trying to return pages for a dmabuf area. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c3c30a926a18436a399a1768f3cc86c76cd17fa7.1751466461.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-07-08 11:59:56 -06:00
Pavel Begunkov	b84621d96e	io_uring/zcrx: allocate sgtable for umem areas Currently, dma addresses for umem areas are stored directly in niovs. It's memory efficient but inconvenient. I need a better format 1) to share code with dmabuf areas, and 2) for disentangling page, folio and niov sizes. dmabuf already provides sg_table, create one for user memory as well. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: David Wei <dw@davidwei.uk> Link: https://lore.kernel.org/r/f3c15081827c1bf5427d3a2e693bc526476b87ee.1751466461.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-07-08 11:59:56 -06:00
Pavel Begunkov	54e89a93ef	io_uring/zcrx: introduce io_populate_area_dma Add a helper that initialises page-pool dma addresses from a sg table. It'll be reused in following patches. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: David Wei <dw@davidwei.uk> Link: https://lore.kernel.org/r/a8972a77be9b5675abc585d6e2e6e30f9c7dbd85.1751466461.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-07-08 11:59:56 -06:00
Pavel Begunkov	06897ddfc5	io_uring/zcrx: return error from io_zcrx_map_area_* io_zcrx_map_area_*() helpers return the number of processed niovs, which we use to unroll some of the mappings for user memory areas. It's unhandy, and dmabuf doesn't care about it. Return an error code instead and move failure partial unmapping into io_zcrx_map_area_umem(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: David Wei <dw@davidwei.uk> Link: https://lore.kernel.org/r/42668e82be3a84b07ee8fc76d1d6d5ac0f137fe5.1751466461.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-07-08 11:59:56 -06:00
Pavel Begunkov	e9a9ddb15b	io_uring/zcrx: always pass page to io_zcrx_copy_chunk io_zcrx_copy_chunk() currently takes either a page or virtual address. Unify the parameters, make it take pages and resolve the linear part into a page the same way general networking code does that. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: David Wei <dw@davidwei.uk> Link: https://lore.kernel.org/r/b8f9f4bac027f5f44a9ccf85350912d1db41ceb8.1751466461.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-07-08 11:59:56 -06:00
Jens Axboe	9dff55ebae	Revert "io_uring: gate REQ_F_ISREG on !S_ANON_INODE as well" This reverts commit `6f11adcc6f`. The problematic commit was fixed in mainline, so the work-around in io_uring can be removed at this point. Anonymous inodes no longer pretend to be regular files after: `1e7ab6f678` ("anon_inode: rework assertions") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-07-08 11:09:01 -06:00
Jens Axboe	fc582cd26e	io_uring/msg_ring: ensure io_kiocb freeing is deferred for RCU syzbot reports that defer/local task_work adding via msg_ring can hit a request that has been freed: CPU: 1 UID: 0 PID: 19356 Comm: iou-wrk-19354 Not tainted 6.16.0-rc4-syzkaller-00108-g17bbde2e1716 #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025 Call Trace: <TASK> dump_stack_lvl+0x189/0x250 lib/dump_stack.c:120 print_address_description mm/kasan/report.c:408 [inline] print_report+0xd2/0x2b0 mm/kasan/report.c:521 kasan_report+0x118/0x150 mm/kasan/report.c:634 io_req_local_work_add io_uring/io_uring.c:1184 [inline] __io_req_task_work_add+0x589/0x950 io_uring/io_uring.c:1252 io_msg_remote_post io_uring/msg_ring.c:103 [inline] io_msg_data_remote io_uring/msg_ring.c:133 [inline] __io_msg_ring_data+0x820/0xaa0 io_uring/msg_ring.c:151 io_msg_ring_data io_uring/msg_ring.c:173 [inline] io_msg_ring+0x134/0xa00 io_uring/msg_ring.c:314 __io_issue_sqe+0x17e/0x4b0 io_uring/io_uring.c:1739 io_issue_sqe+0x165/0xfd0 io_uring/io_uring.c:1762 io_wq_submit_work+0x6e9/0xb90 io_uring/io_uring.c:1874 io_worker_handle_work+0x7cd/0x1180 io_uring/io-wq.c:642 io_wq_worker+0x42f/0xeb0 io_uring/io-wq.c:696 ret_from_fork+0x3fc/0x770 arch/x86/kernel/process.c:148 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245 </TASK> which is supposed to be safe with how requests are allocated. But msg ring requests alloc and free on their own, and hence must defer freeing to a sane time. Add an rcu_head and use kfree_rcu() in both spots where requests are freed. Only the one in io_msg_tw_complete() is strictly required as it has been visible on the other ring, but use it consistently in the other spot as well. This should not cause any other issues outside of KASAN rightfully complaining about it. Link: https://lore.kernel.org/io-uring/686cd2ea.a00a0220.338033.0007.GAE@google.com/ Reported-by: syzbot+54cbbfb4db9145d26fc2@syzkaller.appspotmail.com Cc: stable@vger.kernel.org Fixes: `0617bb500b` ("io_uring/msg_ring: improve handling of target CQE posting") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-07-08 11:08:31 -06:00
Jens Axboe	825aea662b	io_uring/rw: cast rw->flags assignment to rwf_t kernel test robot reports that a recent change of the sqe->rw_flags field throws a sparse warning on 32-bit archs: >> io_uring/rw.c:291:19: sparse: sparse: incorrect type in assignment (different base types) @@ expected restricted __kernel_rwf_t [usertype] flags @@ got unsigned int @@ io_uring/rw.c:291:19: sparse: expected restricted __kernel_rwf_t [usertype] flags io_uring/rw.c:291:19: sparse: got unsigned int Force cast it to rwf_t to silence that new sparse warning. Fixes: `cf73d9970e` ("io_uring: don't use int for ABI") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202507032211.PwSNPNSP-lkp@intel.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-07-07 16:46:30 -06:00
Pavel Begunkov	203817de26	io_uring/zcrx: fix pp destruction warnings With multiple page pools and in some other cases we can have allocated niovs on page pool destruction. Remove a misplaced warning checking that all niovs are returned to zcrx on io_pp_zc_destroy(). It was reported before but apparently got lost. Reported-by: Pedro Tammela <pctammela@mojatatu.com> Fixes: `34a3e60821` ("io_uring/zcrx: implement zerocopy receive pp memory provider") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b9e6d919d2964bc48ddbf8eb52fc9f5d118e9bc1.1751878185.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-07-07 06:53:54 -06:00
Jens Axboe	1bc8890264	Merge branch 'io_uring-6.16' into for-6.17/io_uring Merge in 6.16 io_uring fixes, to avoid clashes with pending net and settings changes. * io_uring-6.16: io_uring: gate REQ_F_ISREG on !S_ANON_INODE as well io_uring/kbuf: flag partial buffer mappings io_uring/net: mark iov as dynamically allocated even for single segments io_uring: fix resource leak in io_import_dmabuf() io_uring: don't assume uaddr alignment in io_vec_fill_bvec io_uring/rsrc: don't rely on user vaddr alignment io_uring/rsrc: fix folio unpinning io_uring: make fallocate be hashed work	2025-07-06 16:42:23 -06:00
Caleb Sander Mateos	daa01d954b	io_uring/rsrc: skip atomic refcount for uncloned buffers io_buffer_unmap() performs an atomic decrement of the io_mapped_ubuf's reference count in case it has been cloned into another io_ring_ctx's registered buffer table. This is an expensive operation and unnecessary in the common case that the io_mapped_ubuf is only registered once. Load the reference count first and check whether it's 1. In that case, skip the atomic decrement and immediately free the io_mapped_ubuf. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Link: https://lore.kernel.org/r/20250619143435.3474028-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-07-02 17:11:23 -06:00
Pavel Begunkov	e448d57826	io_uring/mock: add trivial poll handler Add a flag that enables polling on the mock file. For now it's trivially says that there is always data available, it'll be extended in the future. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f16de043ec4876d65fae294fc99ade57415fba0c.1750599274.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-07-02 08:10:26 -06:00
Pavel Begunkov	0c98a44329	io_uring/mock: support for async read/write Let the user to specify a delay to read/write request. io_uring will start a timer, return -EIOCBQUEUED and complete the request asynchronously after the delay pass. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/38f9d2e143fda8522c90a724b74630e68f9bbd16.1750599274.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-07-02 08:10:26 -06:00
Pavel Begunkov	2f71d2386f	io_uring/mock: allow to choose FMODE_NOWAIT Add an option to choose whether the file supports FMODE_NOWAIT, that changes the execution path io_uring request takes. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1e532565b05a05b23589d237c24ee1a3d90c2fd9.1750599274.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-07-02 08:10:26 -06:00
Pavel Begunkov	d1aa034657	io_uring/mock: add sync read/write Add support for synchronous zero read/write for mock files. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/571f3c9fe688e918256a06a722d3db6ced9ca3d5.1750599274.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-07-02 08:10:26 -06:00
Pavel Begunkov	4aac001f78	io_uring/mock: add cmd using vectored regbufs There is a command api allowing to import vectored registered buffers, add a new mock command that uses the feature and simply copies the specified registered buffer into user space or vice versa. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/229a113fd7de6b27dbef9567f7c0bf4475c9017d.1750599274.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-07-02 08:10:26 -06:00
Pavel Begunkov	3a0ae385f6	io_uring/mock: add basic infra for test mock files io_uring commands provide an ioctl style interface for files to implement file specific operations. io_uring provides many features and advanced api to commands, and it's getting hard to test as it requires specific files/devices. Add basic infrastucture for creating special mock files that will be implementing the cmd api and using various io_uring features we want to test. It'll also be useful to test some more obscure read/write/polling edge cases in the future. Suggested-by: chase xd <sl1589472800@gmail.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/93f21b0af58c1367a2b22635d5a7d694ad0272fc.1750599274.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-07-02 08:10:26 -06:00
Linus Torvalds	66701750d5	io_uring-6.16-20250630 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmhi/gwQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpq3FEACU0f8KTDoKE/kacZq0feu0869ycNU8RGLo h3ehaQn0yqieoFjmLyh2y2u6SYdyacPHFDtfXmfkdy1NZ5ORzLmDJHvuqrFgBdFj G+azBmi55nmSX+GwrMaX+6KwpFqtAFeHhf/2XrvTTgBprhif/3eYrtHPaQGx9Lcl +sMG0tUTfyL6yOAaDabcY3KoN6Yy6CBvDLknnigLBuWPFoJk7p+srywuyJat8YQv TOk572L1Uq11+rrQOhb9I9+sGKR3PmWyaGhmLnNa//FCO8oiONX/ncB0L222+yH6 xpwvHLlP8esoE4dZxrveub0lVg7N3hAjjehwxZBPVTFyKXRoe10wbaEAlySskMHu I2MZVo82BWaN/k9IoLA9SwvpAztLJFaU30AZ8UMdIYNjR5ZFZOyXRhbCLBQua9Fd k9+nPD0WPWMCvRgbvInyYRdJyMNktA0fPrhspSDdv/QPKI8O3pdmNw0zd9dXE2ZD CeG+P5K2Mftaq1Fky1ZgtldT3vqdLgh676MH0RTcfHKDraBDojN7RhLngzrGDalp Z6P4EeO7km3OiSUMyDc6nrzcgv939wqi+Zg5+6SjBxhx+gAj6HJHPatkKU5OCjsW UOBxFp4wUPTdFnHE5Tsan3pwBxU22ZE4WEeD6QwnxfuICJVWlsEmftk/2AoCusXW aLfjD6KRXA== =F1L2 -----END PGP SIGNATURE----- Merge tag 'io_uring-6.16-20250630' of git://git.kernel.dk/linux Pull io_uring fix from Jens Axboe: "Now that anonymous inodes set S_IFREG, this breaks the io_uring read/write retries for short reads/writes. As things like timerfd and eventfd are anon inodes, applications that previously did: unsigned long event_data[2]; io_uring_prep_read(sqe, evfd, event_data, sizeof(event_data), 0); and just got a short read when 1 event was posted, will now wait for the full amount before posting a completion. This caused issues for the ghostty application, making it basically unusable due to excessive buffering" * tag 'io_uring-6.16-20250630' of git://git.kernel.dk/linux: io_uring: gate REQ_F_ISREG on !S_ANON_INODE as well	2025-06-30 16:32:43 -07:00
Jens Axboe	6f11adcc6f	io_uring: gate REQ_F_ISREG on !S_ANON_INODE as well io_uring marks a request as dealing with a regular file on S_ISREG. This drives things like retries on short reads or writes, which is generally not expected on a regular file (or bdev). Applications tend to not expect that, so io_uring tries hard to ensure it doesn't deliver short IO on regular files. However, a recent commit added S_IFREG to anonymous inodes. When io_uring is used to read from various things that are backed by anon inodes, like eventfd, timerfd, etc, then it'll now all of a sudden wait for more data when rather than deliver what was read or written in a single operation. This breaks applications that issue reads on anon inodes, if they ask for more data than a single read delivers. Add a check for !S_ANON_INODE as well before setting REQ_F_ISREG to prevent that. Cc: Christian Brauner <brauner@kernel.org> Cc: stable@vger.kernel.org Link: https://github.com/ghostty-org/ghostty/discussions/7720 Fixes: `cfd86ef7e8` ("anon_inode: use a proper mode internally") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-06-29 16:52:34 -06:00
Linus Torvalds	0a47e02d8a	io_uring-6.16-20250626 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmhd4xwQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgph6UEACRarJ4kKnkjN3EtYWZRzoYRXHAPd3U5s/9 ySh02Gzr/C2mb6N6eMVKlYKiJ+wj1zajfGtu7ng/i3mIZ2NNF4RDAlkvFntwUiZN CDAnRmLIroHnGsYRPGHGbBCqVPC1/LAbWef4B/QQy/lmzvZlXKe68FsAbreXosXJ zyP6lH/Ce9MbZdXfnr4OSZxkc/RhF1gRgt0X2AaHnFLGRYqG/W2GXLnbby1E5clQ Y7JdZ2W4qdpECYr3Qe7XJXkWYSNdFw8JJA89IfRM/iK0xlexOH4HJ78Rk9Czf8k2 WHHYBZMLDWwmaa31Zr3tsgvDfD2g8kZJp+ZTf8bB6720rzRNAqPLA7u5mW5lAwKi MK+iYnlVRQU4zzhGZPkpgUvgd5s6b+7lRCvIOEXTj6ETqfll6NRXWWTwusNzZDul Z2ffFWFHtSMgeO5A5dm4BK3t1qFbBC+odQzQyccLdH0H7DLPx34js8hqDwN6yWPL Z/VRFBJo2k4SrJiVbuLLDQoJ20J2NOm6VMuFWSIg5yi/V9kIxnI8ge8cbC0GoVUz DzsInjGb5Jjb3KKdPfhUmtu+FmNKcv4JlguUG7MyfYWyHr9kqWsGel1AkKDsWdra d6sqEum4F3UzdAjbCJPG71q/9avtio98IuaDXXKAAmd8gL1DCZ0QZEYV8c/A8n4M fXmdKG3Xqw== =BUYz -----END PGP SIGNATURE----- Merge tag 'io_uring-6.16-20250626' of git://git.kernel.dk/linux Pull io_uring fixes from Jens Axboe: - Two tweaks for a recent fix: fixing a memory leak if multiple iovecs were initially mapped but only the first was used and hence turned into a UBUF rathan than an IOVEC iterator, and catching a case where a retry would be done even if the previous segment wasn't full - Small series fixing an issue making the vm unhappy if debugging is turned on, hitting a VM_BUG_ON_PAGE() - Fix a resource leak in io_import_dmabuf() in the error handling case, which is a regression in this merge window - Mark fallocate as needing to be write serialized, as is already done for truncate and buffered writes * tag 'io_uring-6.16-20250626' of git://git.kernel.dk/linux: io_uring/kbuf: flag partial buffer mappings io_uring/net: mark iov as dynamically allocated even for single segments io_uring: fix resource leak in io_import_dmabuf() io_uring: don't assume uaddr alignment in io_vec_fill_bvec io_uring/rsrc: don't rely on user vaddr alignment io_uring/rsrc: fix folio unpinning io_uring: make fallocate be hashed work	2025-06-27 08:55:57 -07:00
Jens Axboe	178b8ff66f	io_uring/kbuf: flag partial buffer mappings A previous commit aborted mapping more for a non-incremental ring for bundle peeking, but depending on where in the process this peeking happened, it would not necessarily prevent a retry by the user. That can create gaps in the received/read data. Add struct buf_sel_arg->partial_map, which can pass this information back. The networking side can then map that to internal state and use it to gate retry as well. Since this necessitates a new flag, change io_sr_msg->retry to a retry_flags member, and store both the retry and partial map condition in there. Cc: stable@vger.kernel.org Fixes: `26ec15e4b0` ("io_uring/kbuf: don't truncate end buffer for multiple buffer peeks") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-06-26 12:17:48 -06:00
Jens Axboe	9a709b7e98	io_uring/net: mark iov as dynamically allocated even for single segments A bigger array of vecs could've been allocated, but io_ring_buffers_peek() still decided to cap the mapped range depending on how much data was available. Hence don't rely on the segment count to know if the request should be marked as needing cleanup, always check upfront if the iov array is different than the fast_iov array. Fixes: `26ec15e4b0` ("io_uring/kbuf: don't truncate end buffer for multiple buffer peeks") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-06-25 10:17:06 -06:00
Penglei Jiang	7cac633a42	io_uring: fix resource leak in io_import_dmabuf() Replace the return statement with setting ret = -EINVAL and jumping to the err label to ensure resources are released via io_release_dmabuf. Fixes: `a5c98e9424` ("io_uring/zcrx: dmabuf backed zerocopy receive") Signed-off-by: Penglei Jiang <superman.xpt@gmail.com> Link: https://lore.kernel.org/r/20250625102703.68336-1-superman.xpt@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-06-25 08:14:14 -06:00
Pavel Begunkov	e1d7727b73	io_uring: don't assume uaddr alignment in io_vec_fill_bvec There is no guaranteed alignment for user pointers. Don't use mask trickery and adjust the offset by bv_offset. Cc: stable@vger.kernel.org Reported-by: David Hildenbrand <david@redhat.com> Fixes: `9ef4cbbcb4` ("io_uring: add infra for importing vectored reg buffers") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/io-uring/19530391f5c361a026ac9b401ff8e123bde55d98.1750771718.git.asml.silence@gmail.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-06-24 20:51:08 -06:00
Pavel Begunkov	3a3c6d6157	io_uring/rsrc: don't rely on user vaddr alignment There is no guaranteed alignment for user pointers, however the calculation of an offset of the first page into a folio after coalescing uses some weird bit mask logic, get rid of it. Cc: stable@vger.kernel.org Reported-by: David Hildenbrand <david@redhat.com> Fixes: `a8edbb424b` ("io_uring/rsrc: enable multi-hugepage buffer coalescing") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/io-uring/e387b4c78b33f231105a601d84eefd8301f57954.1750771718.git.asml.silence@gmail.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-06-24 20:50:59 -06:00
Pavel Begunkov	5afb4bf9fc	io_uring/rsrc: fix folio unpinning syzbot complains about an unmapping failure: [ 108.070381][ T14] kernel BUG at mm/gup.c:71! [ 108.070502][ T14] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP [ 108.123672][ T14] Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20250221-8.fc42 02/21/2025 [ 108.127458][ T14] Workqueue: iou_exit io_ring_exit_work [ 108.174205][ T14] Call trace: [ 108.175649][ T14] sanity_check_pinned_pages+0x7cc/0x7d0 (P) [ 108.178138][ T14] unpin_user_page+0x80/0x10c [ 108.180189][ T14] io_release_ubuf+0x84/0xf8 [ 108.182196][ T14] io_free_rsrc_node+0x250/0x57c [ 108.184345][ T14] io_rsrc_data_free+0x148/0x298 [ 108.186493][ T14] io_sqe_buffers_unregister+0x84/0xa0 [ 108.188991][ T14] io_ring_ctx_free+0x48/0x480 [ 108.191057][ T14] io_ring_exit_work+0x764/0x7d8 [ 108.193207][ T14] process_one_work+0x7e8/0x155c [ 108.195431][ T14] worker_thread+0x958/0xed8 [ 108.197561][ T14] kthread+0x5fc/0x75c [ 108.199362][ T14] ret_from_fork+0x10/0x20 We can pin a tail page of a folio, but then io_uring will try to unpin the head page of the folio. While it should be fine in terms of keeping the page actually alive, mm folks say it's wrong and triggers a debug warning. Use unpin_user_folio() instead of unpin_user_page*. Cc: stable@vger.kernel.org Debugged-by: David Hildenbrand <david@redhat.com> Reported-by: syzbot+1d335893772467199ab6@syzkaller.appspotmail.com Closes: https://lkml.kernel.org/r/683f1551.050a0220.55ceb.0017.GAE@google.com Fixes: `a8edbb424b` ("io_uring/rsrc: enable multi-hugepage buffer coalescing") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/io-uring/a28b0f87339ac2acf14a645dad1e95bbcbf18acd.1750771718.git.asml.silence@gmail.com/ [axboe: adapt to current tree, massage commit message] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-06-24 20:49:39 -06:00
Pavel Begunkov	9e4ed359b8	io_uring/netcmd: add tx timestamping cmd support Add a new socket command which returns tx time stamps to the user. It provide an alternative to the existing error queue recvmsg interface. The command works in a polled multishot mode, which means io_uring will poll the socket and keep posting timestamps until the request is cancelled or fails in any other way (e.g. with no space in the CQ). It reuses the net infra and grabs timestamps from the socket's error queue. The command requires IORING_SETUP_CQE32. All non-final CQEs (marked with IORING_CQE_F_MORE) have cqe->res set to the tskey, and the upper 16 bits of cqe->flags keep tstype (i.e. offset by IORING_CQE_BUFFER_SHIFT). The timevalue is store in the upper part of the extended CQE. The final completion won't have IORING_CQE_F_MORE and will have cqe->res storing 0/error. Suggested-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/92ee66e6b33b8de062a977843d825f58f21ecd37.1750065793.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-06-23 09:00:12 -06:00
Pavel Begunkov	ac479eac22	io_uring: add mshot helper for posting CQE32 Add a helper for posting 32 byte CQEs in a multishot mode and add a cmd helper on top. As it specifically works with requests, the helper ignore the passed in cqe->user_data and sets it to the one stored in the request. The command helper is only valid with multishot requests. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c29d7720c16e1f981cfaa903df187138baa3946b.1750065793.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-06-23 09:00:12 -06:00
Pavel Begunkov	b955754959	io_uring/cmd: allow multishot polled commands Some commands like timestamping in the next patch can make use of multishot polling, i.e. REQ_F_APOLL_MULTISHOT. Add support for that, which is condensed in a single helper called io_cmd_poll_multishot(). The user who wants to continue with a request in a multishot mode must call the function, and only if it returns 0 the user is free to proceed. Apart from normal terminal errors, it can also end up with -EIOCBQUEUED, in which case the user must forward it to the core io_uring. It's forbidden to use task work while the request is executing in a multishot mode. The API is not foolproof, hence it's not exported to modules nor exposed in public headers. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/bcf97c31659662c72b69fc8fcdf2a88cfc16e430.1750065793.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-06-23 09:00:12 -06:00
Pavel Begunkov	1621518892	io_uring/poll: introduce io_arm_apoll() In preparation to allowing commands to do file polling, add a helper that takes the desired poll event mask and arms it for polling. We won't be able to use io_arm_poll_handler() with IORING_OP_URING_CMD as it tries to infer the mask from the opcode data, and we can't unify it across all commands. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7ee5633f2dc45fd15243f1a60965f7e30e1c48e8.1750065793.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-06-23 09:00:12 -06:00
Jens Axboe	cb9ccfb404	io_uring/nop: add IORING_NOP_TW completion flag To test and profile the overhead of io_uring task_work and the various types of it, add IORING_NOP_TW which tells nop to signal completions through task_work rather than complete them inline. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-06-23 08:59:13 -06:00
Jens Axboe	ecf47d452c	io_uring/uring_cmd: implement ->sqe_copy() to avoid unnecessary copies uring_cmd currently copies the full SQE at prep time, just in case it needs it to be stable. However, for inline completions or requests that get queued up on the device side, there's no need to ever copy the SQE. This is particularly important, as various use cases of uring_cmd will be using 128b sized SQEs. Opt in to using ->sqe_copy() to let the core of io_uring decide when to copy SQEs. This callback will only be called if it is safe to do so. Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-06-23 08:59:13 -06:00
Jens Axboe	ead21053bf	io_uring/uring_cmd: get rid of io_uring_cmd_prep_setup() It's a pretty pointless helper, just allocates and copies data. Fold it into io_uring_cmd_prep(). Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-06-23 08:59:13 -06:00

1 2 3 4 5 ...

1914 Commits (master)