Commit Graph

1914 Commits (master)

Author SHA1 Message Date
Pavel Begunkov 720df2310b io_uring/zcrx: fix null ifq on area destruction
Dan reports that ifq can be null when infering arguments for
io_unaccount_mem() from io_zcrx_free_area(). Fix it by always setting a
correct ifq.

Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/202507180628.gBxrOgqr-lkp@intel.com/
Fixes: 262ab20518 ("io_uring/zcrx: account area memory")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20670d163bb90dba2a81a4150f1125603cefb101.1753091564.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-21 06:47:45 -06:00
Linus Torvalds e347810e84 io_uring-6.16-20250718
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmh6ZXQQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpnaaEACbZ2g4GOk6tGqeatiIuB+HKbf72zkI4JP4
 ffyirNnvXpL2du5pUskhIgqSpLb1S5qq2TlQSNpAAtcmbQuxhAbwrBP+6e3GkdZo
 zb2x2ztRVV1PdgUdE7dYX6NPJ2GzF5tMDH5sDDm8HeaQTHX6Zdv6sXahDYHdhH3a
 lo9yjh3GZp/2fa0N067ECMDPUWWVvhkolxulB4Ek+l11RhEtamIXzU4x8cBVvzVS
 7nmaqES9xnVqwXFbF2/IEi5NfUYxlcxsySgYgUmfFBp8baOFC/dVUFre0wRNiCf2
 BoRCBKG/by6mEccE+M7JTuH/itA2vPvDzYmgXuMk8BgDSwXdcoHm7QUx9PJh8bbU
 3SXw/125VI1DKG5xZReDrRHI6GNSwwhEJLR7oAgBCCxfbiuG99cRKdM3C1HOeN4q
 jP4L1jQnSgD9VYaxrQD0se5191WtR2S6h+tO27W4UiYlOvSMU2hWHE9eZQBohLJb
 2p+jltziAOL0riE8wwY9h0dnPAWSrvs+sf9LrfBtNe3lRDttszhFoJXeJ4SpGs0P
 0xv+gt1Gq1pl1lKCyX+HlXt6wVj5ReYj78iQqLmHIEtFKOfrhCYJVIDyepRG+pKE
 7LnncbfLoYHXcj3eIRj6zsiZh+k43tbGmoKBYmyqhUnu4ySNQpdUBQWP4ZVVjjlx
 zKY3j7fpcw==
 =N6xk
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-6.16-20250718' of git://git.kernel.dk/linux

Pull io_uring fixes from Jens Axboe:

 - dmabug offset fix for zcrx

 - Fix for the POLLERR connect work around handling

* tag 'io_uring-6.16-20250718' of git://git.kernel.dk/linux:
  io_uring/poll: fix POLLERR handling
  io_uring/zcrx: disallow user selected dmabuf offset and size
2025-07-18 12:21:26 -07:00
Caleb Sander Mateos 2e6dbb25ea io_uring/cmd: remove struct io_uring_cmd_data
There are no more users of struct io_uring_cmd_data and its op_data
field. Remove it to shave 8 bytes from struct io_async_cmd and eliminate
a store and load for every uring_cmd.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Acked-by: David Sterba <dsterba@suse.com>
Link: https://lore.kernel.org/r/20250708202212.2851548-5-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-18 12:34:56 -06:00
Caleb Sander Mateos 733c43f1df io_uring/cmd: introduce IORING_URING_CMD_REISSUE flag
Add a flag IORING_URING_CMD_REISSUE that ->uring_cmd() implementations
can use to tell whether this is the first or subsequent issue of the
uring_cmd. This will allow ->uring_cmd() implementations to store
information in the io_uring_cmd's pdu across issues.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Acked-by: David Sterba <dsterba@suse.com>
Link: https://lore.kernel.org/r/20250708202212.2851548-3-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-18 12:34:56 -06:00
Pavel Begunkov 262ab20518 io_uring/zcrx: account area memory
zcrx areas can be quite large and need to be accounted and checked
against RLIMIT_MEMLOCK. In practise it shouldn't be a big issue as
the inteface already requires cap_net_admin.

Cc: stable@vger.kernel.org
Fixes: cf96310c5f ("io_uring/zcrx: add io_zcrx_area")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4b53f0c575bd062f63d12bec6cac98037fc66aeb.1752699568.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-16 16:23:28 -06:00
Pavel Begunkov 11fbada718 io_uring: export io_[un]account_mem
Export pinned memory accounting helpers, they'll be used by zcrx
shortly.

Cc: stable@vger.kernel.org
Fixes: cf96310c5f ("io_uring/zcrx: add io_zcrx_area")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9a61e54bd89289b39570ae02fe620e12487439e4.1752699568.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-16 16:23:28 -06:00
Pavel Begunkov c7cafd5b81 io_uring/poll: fix POLLERR handling
8c8492ca64 ("io_uring/net: don't retry connect operation on EPOLLERR")
is a little dirty hack that
1) wrongfully assumes that POLLERR equals to a failed request, which
breaks all POLLERR users, e.g. all error queue recv interfaces.
2) deviates the connection request behaviour from connect(2), and
3) racy and solved at a wrong level.

Nothing can be done with 2) now, and 3) is beyond the scope of the
patch. At least solve 1) by moving the hack out of generic poll handling
into io_connect().

Cc: stable@vger.kernel.org
Fixes: 8c8492ca64 ("io_uring/net: don't retry connect operation on EPOLLERR")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/3dc89036388d602ebd84c28e5042e457bdfc952b.1752682444.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-16 10:28:28 -06:00
Norman Maurer 0ebc9a7ecf io_uring/net: Support multishot receive len cap
At the moment its very hard to do fine grained backpressure when using
multishot as the kernel might produce a lot of completions before the
user has a chance to cancel a previous submitted multishot recv.

This change adds support to issue a multishot recv that is capped by a
len, which means the kernel will only rearm until X amount of data is
received. When the limit is reached the completion will signal to the
user that a re-arm needs to happen manually by not setting the IORING_CQE_F_MORE
flag.

Signed-off-by: Norman Maurer <norman_maurer@apple.com>
Link: https://lore.kernel.org/r/20250715140249.31186-1-norman_maurer@apple.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-16 08:28:11 -06:00
Jens Axboe 8723c146ad io_uring: deduplicate wakeup handling
Both io_poll_wq_wake() and io_cqring_wake() contain the exact same code,
and most of the comment in the latter applies equally to both.

Move the test and wakeup handling into a basic helper that they can both
use, and move part of the comment that applies generically to this new
helper.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-15 12:20:06 -06:00
Jens Axboe b1915b18e1 io_uring/net: cast min_not_zero() type
kernel test robot reports that xtensa complains about different
signedness for a min_not_zero() comparison. Cast the int part to size_t
to avoid this issue.

Fixes: e227c8cdb4 ("io_uring/net: use passed in 'len' in io_recv_buf_select()")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202507150504.zO5FsCPm-lkp@intel.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-14 16:36:08 -06:00
Pavel Begunkov 08ca1409c4 io_uring/zcrx: disallow user selected dmabuf offset and size
zcrx shouldn't be so frivolous about cutting a dmabuf sgtable and taking
a subrange into it, the dmabuf layer might be not expecting that. It
shouldn't be a problem for now, but since the zcrx dmabuf support is new
and there shouldn't be any real users, let's play safe and reject user
provided ranges into dmabufs. Also, it shouldn't be needed as userspace
should size them appropriately.

Fixes: a5c98e9424 ("io_uring/zcrx: dmabuf backed zerocopy receive")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/be899f1afed32053eb2e2079d0da241514674aca.1752443579.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-14 08:54:43 -06:00
Jens Axboe 6e4098382b io_uring/poll: cleanup apoll freeing
No point having REQ_F_POLLED in both IO_REQ_CLEAN_FLAGS and in
IO_REQ_CLEAN_SLOW_FLAGS, and having both io_free_batch_list() and then
io_clean_op() check for it and clean it.

Move REQ_F_POLLED to IO_REQ_CLEAN_SLOW_FLAGS and drop it from
IO_REQ_CLEAN_FLAGS, and have only io_free_batch_list() do the check and
freeing.

Link: https://lore.kernel.org/io-uring/20250712000344.1579663-2-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-12 15:03:05 -06:00
Linus Torvalds cb3002e0e9 io_uring-6.16-20250710
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmhwb9gQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpiIAD/0Vs8uJRMTx/tB4xCRDoPrW5mdWK+d6FiPg
 e0/1Kn5J1vPEbM6uYpn6wZe0BwHS76OQGhJ5OrsFSjtxX5VA4rMYZxJZYxVLW88W
 U+Y4dGuU1ZoLPQYwGrKVSXz+9tKQzXJOsYIF/LCMvxgjFPJuvzwSsp0DeXT7vzBT
 9UsEcnCfjK31X4OBNa9F8RvgfguodknVL3k6B/98wx3+DODM9xaSv7tgDhULFl4Q
 U+eZYtKr0dd0jUhaWiMgJrmGZ/bElRn36ILsOhJ0wgcZws3l+zLHkCC202Nx+J8/
 VvljSeke1hUoY4YMoVAmJ72XlvSW+C2EqTO56P2xEyzpz0/Xhm00qVsiKZQHR0Ia
 r6xos6scvnni5myVgpkcLpbFRjHlrtSjX+kh3ozqFdya83/Mjd7Midizn7mjaiFS
 4r5KK4ov3fXLY29rYeuREkZys31Fn8XCERd3N7RPLAN/hzEC4fXm9/S0lkOqwqTP
 3OtvUu+AwepyUyJ0KlakYUDvu6X+vP6WFkQFLFIBFcN/OglWRZe5r3fuQoX0iw6/
 Ln+DB+W6XtBi7rzIjuYYzAMgC7iiZc57e64iXlzSyPEsjOUkTngKRH4zQY8MyjFb
 1Fnn7TWxIqHzlfpvu5g/e6dxbTduQxnLTDNQwocXw7hc8/D49wbyUUy4KGc+yyUP
 PYAYQSwtmw==
 =4o0B
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-6.16-20250710' of git://git.kernel.dk/linux

Pull io_uring fixes from Jens Axboe:

 - Remove a pointless warning in the zcrx code

 - Fix for MSG_RING commands, where the allocated io_kiocb
   needs to be freed under RCU as well

 - Revert the work-around we had in place for the anon inodes
   pretending to be regular files. Since that got reworked
   upstream, the work-around is no longer needed

* tag 'io_uring-6.16-20250710' of git://git.kernel.dk/linux:
  Revert "io_uring: gate REQ_F_ISREG on !S_ANON_INODE as well"
  io_uring/msg_ring: ensure io_kiocb freeing is deferred for RCU
  io_uring/zcrx: fix pp destruction warnings
2025-07-11 10:29:30 -07:00
Jens Axboe 6a8afb9fff io_uring/net: allow multishot receive per-invocation cap
If an application is handling multiple receive streams using recv
multishot, then the amount of retries and buffer peeking for multishot
and bundles can process too much per socket before moving on. This isn't
directly controllable by the application. By default, io_uring will
retry a recv MULTISHOT_MAX_RETRY (32) times, if the socket keeps having
data to receive. And if using bundles, then each bundle peek will
potentially map up to PEEK_MAX_IMPORT (256) iovecs of data. Once these
limits are hit, then a requeue operation will be done, where the request
will get retried after other pending requests have had a time to get
executed.

Add support for capping the per-invocation receive length, before a
requeue condition is considered for each receive. This is done by setting
sqe->mshot_len to the byte value. For example, if this is set to 1024,
then each receive will be requeued by 1024 bytes received.

Link: https://lore.kernel.org/io-uring/20250709203420.1321689-4-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-10 11:54:33 -06:00
Jens Axboe 3919b69593 io_uring/net: move io_sr_msg->retry_flags to io_sr_msg->flags
There's plenty of space left, as sr->flags is a 16-bit type. The UAPI
bits are the lower 8 bits, as that's all that sqe->ioprio can carry in
the SQE anyway. Use a few of the upper 8 bits for internal uses, rather
than have two separate flags entries.

Link: https://lore.kernel.org/io-uring/20250709203420.1321689-2-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-10 11:54:22 -06:00
Jens Axboe e227c8cdb4 io_uring/net: use passed in 'len' in io_recv_buf_select()
len is a pointer to the desired len, use that rather than grab it from
sr->len again. No functional changes as of this patch, but it does
prepare io_recv_buf_select() for getting passed in a value that differs
from sr->len.

Link: https://lore.kernel.org/io-uring/20250709203420.1321689-3-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-10 11:54:08 -06:00
Pavel Begunkov e67645bb7f io_uring/zcrx: prepare fallback for larger pages
io_zcrx_copy_chunk() processes one page at a time, which won't be
sufficient when the net_iov size grows. Introduce a structure keeping
the target niov page and other parameters, it's more convenient and can
be reused later. And add a helper function that can efficient copy
buffers of an arbitrary length. For 64bit archs the loop inside should
be compiled out.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e84bc705a4e1edeb9aefff470d96558d8232388f.1751466461.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-08 11:59:56 -06:00
Pavel Begunkov 1b4dc1ff0a io_uring/zcrx: assert area type in io_zcrx_iov_page
Add a simple debug assertion to io_zcrx_iov_page() making it's not
trying to return pages for a dmabuf area.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c3c30a926a18436a399a1768f3cc86c76cd17fa7.1751466461.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-08 11:59:56 -06:00
Pavel Begunkov b84621d96e io_uring/zcrx: allocate sgtable for umem areas
Currently, dma addresses for umem areas are stored directly in niovs.
It's memory efficient but inconvenient. I need a better format 1) to
share code with dmabuf areas, and 2) for disentangling page, folio and
niov sizes. dmabuf already provides sg_table, create one for user memory
as well.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: David Wei <dw@davidwei.uk>
Link: https://lore.kernel.org/r/f3c15081827c1bf5427d3a2e693bc526476b87ee.1751466461.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-08 11:59:56 -06:00
Pavel Begunkov 54e89a93ef io_uring/zcrx: introduce io_populate_area_dma
Add a helper that initialises page-pool dma addresses from a sg table.
It'll be reused in following patches.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: David Wei <dw@davidwei.uk>
Link: https://lore.kernel.org/r/a8972a77be9b5675abc585d6e2e6e30f9c7dbd85.1751466461.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-08 11:59:56 -06:00
Pavel Begunkov 06897ddfc5 io_uring/zcrx: return error from io_zcrx_map_area_*
io_zcrx_map_area_*() helpers return the number of processed niovs, which
we use to unroll some of the mappings for user memory areas. It's
unhandy, and dmabuf doesn't care about it. Return an error code instead
and move failure partial unmapping into io_zcrx_map_area_umem().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: David Wei <dw@davidwei.uk>
Link: https://lore.kernel.org/r/42668e82be3a84b07ee8fc76d1d6d5ac0f137fe5.1751466461.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-08 11:59:56 -06:00
Pavel Begunkov e9a9ddb15b io_uring/zcrx: always pass page to io_zcrx_copy_chunk
io_zcrx_copy_chunk() currently takes either a page or virtual address.
Unify the parameters, make it take pages and resolve the linear part
into a page the same way general networking code does that.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: David Wei <dw@davidwei.uk>
Link: https://lore.kernel.org/r/b8f9f4bac027f5f44a9ccf85350912d1db41ceb8.1751466461.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-08 11:59:56 -06:00
Jens Axboe 9dff55ebae Revert "io_uring: gate REQ_F_ISREG on !S_ANON_INODE as well"
This reverts commit 6f11adcc6f.

The problematic commit was fixed in mainline, so the work-around in
io_uring can be removed at this point. Anonymous inodes no longer
pretend to be regular files after:

1e7ab6f678 ("anon_inode: rework assertions")

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-08 11:09:01 -06:00
Jens Axboe fc582cd26e io_uring/msg_ring: ensure io_kiocb freeing is deferred for RCU
syzbot reports that defer/local task_work adding via msg_ring can hit
a request that has been freed:

CPU: 1 UID: 0 PID: 19356 Comm: iou-wrk-19354 Not tainted 6.16.0-rc4-syzkaller-00108-g17bbde2e1716 #0 PREEMPT(full)
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025
Call Trace:
 <TASK>
 dump_stack_lvl+0x189/0x250 lib/dump_stack.c:120
 print_address_description mm/kasan/report.c:408 [inline]
 print_report+0xd2/0x2b0 mm/kasan/report.c:521
 kasan_report+0x118/0x150 mm/kasan/report.c:634
 io_req_local_work_add io_uring/io_uring.c:1184 [inline]
 __io_req_task_work_add+0x589/0x950 io_uring/io_uring.c:1252
 io_msg_remote_post io_uring/msg_ring.c:103 [inline]
 io_msg_data_remote io_uring/msg_ring.c:133 [inline]
 __io_msg_ring_data+0x820/0xaa0 io_uring/msg_ring.c:151
 io_msg_ring_data io_uring/msg_ring.c:173 [inline]
 io_msg_ring+0x134/0xa00 io_uring/msg_ring.c:314
 __io_issue_sqe+0x17e/0x4b0 io_uring/io_uring.c:1739
 io_issue_sqe+0x165/0xfd0 io_uring/io_uring.c:1762
 io_wq_submit_work+0x6e9/0xb90 io_uring/io_uring.c:1874
 io_worker_handle_work+0x7cd/0x1180 io_uring/io-wq.c:642
 io_wq_worker+0x42f/0xeb0 io_uring/io-wq.c:696
 ret_from_fork+0x3fc/0x770 arch/x86/kernel/process.c:148
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
 </TASK>

which is supposed to be safe with how requests are allocated. But msg
ring requests alloc and free on their own, and hence must defer freeing
to a sane time.

Add an rcu_head and use kfree_rcu() in both spots where requests are
freed. Only the one in io_msg_tw_complete() is strictly required as it
has been visible on the other ring, but use it consistently in the other
spot as well.

This should not cause any other issues outside of KASAN rightfully
complaining about it.

Link: https://lore.kernel.org/io-uring/686cd2ea.a00a0220.338033.0007.GAE@google.com/
Reported-by: syzbot+54cbbfb4db9145d26fc2@syzkaller.appspotmail.com
Cc: stable@vger.kernel.org
Fixes: 0617bb500b ("io_uring/msg_ring: improve handling of target CQE posting")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-08 11:08:31 -06:00
Jens Axboe 825aea662b io_uring/rw: cast rw->flags assignment to rwf_t
kernel test robot reports that a recent change of the sqe->rw_flags
field throws a sparse warning on 32-bit archs:

>> io_uring/rw.c:291:19: sparse: sparse: incorrect type in assignment (different base types) @@     expected restricted __kernel_rwf_t [usertype] flags @@     got unsigned int @@
   io_uring/rw.c:291:19: sparse:     expected restricted __kernel_rwf_t [usertype] flags
   io_uring/rw.c:291:19: sparse:     got unsigned int

Force cast it to rwf_t to silence that new sparse warning.

Fixes: cf73d9970e ("io_uring: don't use int for ABI")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202507032211.PwSNPNSP-lkp@intel.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-07 16:46:30 -06:00
Pavel Begunkov 203817de26 io_uring/zcrx: fix pp destruction warnings
With multiple page pools and in some other cases we can have allocated
niovs on page pool destruction. Remove a misplaced warning checking that
all niovs are returned to zcrx on io_pp_zc_destroy(). It was reported
before but apparently got lost.

Reported-by: Pedro Tammela <pctammela@mojatatu.com>
Fixes: 34a3e60821 ("io_uring/zcrx: implement zerocopy receive pp memory provider")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b9e6d919d2964bc48ddbf8eb52fc9f5d118e9bc1.1751878185.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-07 06:53:54 -06:00
Jens Axboe 1bc8890264 Merge branch 'io_uring-6.16' into for-6.17/io_uring
Merge in 6.16 io_uring fixes, to avoid clashes with pending net and
settings changes.

* io_uring-6.16:
  io_uring: gate REQ_F_ISREG on !S_ANON_INODE as well
  io_uring/kbuf: flag partial buffer mappings
  io_uring/net: mark iov as dynamically allocated even for single segments
  io_uring: fix resource leak in io_import_dmabuf()
  io_uring: don't assume uaddr alignment in io_vec_fill_bvec
  io_uring/rsrc: don't rely on user vaddr alignment
  io_uring/rsrc: fix folio unpinning
  io_uring: make fallocate be hashed work
2025-07-06 16:42:23 -06:00
Caleb Sander Mateos daa01d954b io_uring/rsrc: skip atomic refcount for uncloned buffers
io_buffer_unmap() performs an atomic decrement of the io_mapped_ubuf's
reference count in case it has been cloned into another io_ring_ctx's
registered buffer table. This is an expensive operation and unnecessary
in the common case that the io_mapped_ubuf is only registered once.
Load the reference count first and check whether it's 1. In that case,
skip the atomic decrement and immediately free the io_mapped_ubuf.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250619143435.3474028-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-02 17:11:23 -06:00
Pavel Begunkov e448d57826 io_uring/mock: add trivial poll handler
Add a flag that enables polling on the mock file. For now it's trivially
says that there is always data available, it'll be extended in the
future.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f16de043ec4876d65fae294fc99ade57415fba0c.1750599274.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-02 08:10:26 -06:00
Pavel Begunkov 0c98a44329 io_uring/mock: support for async read/write
Let the user to specify a delay to read/write request. io_uring will
start a timer, return -EIOCBQUEUED and complete the request
asynchronously after the delay pass.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/38f9d2e143fda8522c90a724b74630e68f9bbd16.1750599274.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-02 08:10:26 -06:00
Pavel Begunkov 2f71d2386f io_uring/mock: allow to choose FMODE_NOWAIT
Add an option to choose whether the file supports FMODE_NOWAIT, that
changes the execution path io_uring request takes.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1e532565b05a05b23589d237c24ee1a3d90c2fd9.1750599274.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-02 08:10:26 -06:00
Pavel Begunkov d1aa034657 io_uring/mock: add sync read/write
Add support for synchronous zero read/write for mock files.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/571f3c9fe688e918256a06a722d3db6ced9ca3d5.1750599274.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-02 08:10:26 -06:00
Pavel Begunkov 4aac001f78 io_uring/mock: add cmd using vectored regbufs
There is a command api allowing to import vectored registered buffers,
add a new mock command that uses the feature and simply copies the
specified registered buffer into user space or vice versa.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/229a113fd7de6b27dbef9567f7c0bf4475c9017d.1750599274.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-02 08:10:26 -06:00
Pavel Begunkov 3a0ae385f6 io_uring/mock: add basic infra for test mock files
io_uring commands provide an ioctl style interface for files to
implement file specific operations. io_uring provides many features and
advanced api to commands, and it's getting hard to test as it requires
specific files/devices.

Add basic infrastucture for creating special mock files that will be
implementing the cmd api and using various io_uring features we want to
test. It'll also be useful to test some more obscure read/write/polling
edge cases in the future.

Suggested-by: chase xd <sl1589472800@gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/93f21b0af58c1367a2b22635d5a7d694ad0272fc.1750599274.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-02 08:10:26 -06:00
Linus Torvalds 66701750d5 io_uring-6.16-20250630
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmhi/gwQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpq3FEACU0f8KTDoKE/kacZq0feu0869ycNU8RGLo
 h3ehaQn0yqieoFjmLyh2y2u6SYdyacPHFDtfXmfkdy1NZ5ORzLmDJHvuqrFgBdFj
 G+azBmi55nmSX+GwrMaX+6KwpFqtAFeHhf/2XrvTTgBprhif/3eYrtHPaQGx9Lcl
 +sMG0tUTfyL6yOAaDabcY3KoN6Yy6CBvDLknnigLBuWPFoJk7p+srywuyJat8YQv
 TOk572L1Uq11+rrQOhb9I9+sGKR3PmWyaGhmLnNa//FCO8oiONX/ncB0L222+yH6
 xpwvHLlP8esoE4dZxrveub0lVg7N3hAjjehwxZBPVTFyKXRoe10wbaEAlySskMHu
 I2MZVo82BWaN/k9IoLA9SwvpAztLJFaU30AZ8UMdIYNjR5ZFZOyXRhbCLBQua9Fd
 k9+nPD0WPWMCvRgbvInyYRdJyMNktA0fPrhspSDdv/QPKI8O3pdmNw0zd9dXE2ZD
 CeG+P5K2Mftaq1Fky1ZgtldT3vqdLgh676MH0RTcfHKDraBDojN7RhLngzrGDalp
 Z6P4EeO7km3OiSUMyDc6nrzcgv939wqi+Zg5+6SjBxhx+gAj6HJHPatkKU5OCjsW
 UOBxFp4wUPTdFnHE5Tsan3pwBxU22ZE4WEeD6QwnxfuICJVWlsEmftk/2AoCusXW
 aLfjD6KRXA==
 =F1L2
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-6.16-20250630' of git://git.kernel.dk/linux

Pull io_uring fix from Jens Axboe:
 "Now that anonymous inodes set S_IFREG, this breaks the io_uring
  read/write retries for short reads/writes. As things like timerfd and
  eventfd are anon inodes, applications that previously did:

    unsigned long event_data[2];

    io_uring_prep_read(sqe, evfd, event_data, sizeof(event_data), 0);

  and just got a short read when 1 event was posted, will now wait for
  the full amount before posting a completion.

  This caused issues for the ghostty application, making it basically
  unusable due to excessive buffering"

* tag 'io_uring-6.16-20250630' of git://git.kernel.dk/linux:
  io_uring: gate REQ_F_ISREG on !S_ANON_INODE as well
2025-06-30 16:32:43 -07:00
Jens Axboe 6f11adcc6f io_uring: gate REQ_F_ISREG on !S_ANON_INODE as well
io_uring marks a request as dealing with a regular file on S_ISREG. This
drives things like retries on short reads or writes, which is generally
not expected on a regular file (or bdev). Applications tend to not
expect that, so io_uring tries hard to ensure it doesn't deliver short
IO on regular files.

However, a recent commit added S_IFREG to anonymous inodes. When
io_uring is used to read from various things that are backed by anon
inodes, like eventfd, timerfd, etc, then it'll now all of a sudden wait
for more data when rather than deliver what was read or written in a
single operation. This breaks applications that issue reads on anon
inodes, if they ask for more data than a single read delivers.

Add a check for !S_ANON_INODE as well before setting REQ_F_ISREG to
prevent that.

Cc: Christian Brauner <brauner@kernel.org>
Cc: stable@vger.kernel.org
Link: https://github.com/ghostty-org/ghostty/discussions/7720
Fixes: cfd86ef7e8 ("anon_inode: use a proper mode internally")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-29 16:52:34 -06:00
Linus Torvalds 0a47e02d8a io_uring-6.16-20250626
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmhd4xwQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgph6UEACRarJ4kKnkjN3EtYWZRzoYRXHAPd3U5s/9
 ySh02Gzr/C2mb6N6eMVKlYKiJ+wj1zajfGtu7ng/i3mIZ2NNF4RDAlkvFntwUiZN
 CDAnRmLIroHnGsYRPGHGbBCqVPC1/LAbWef4B/QQy/lmzvZlXKe68FsAbreXosXJ
 zyP6lH/Ce9MbZdXfnr4OSZxkc/RhF1gRgt0X2AaHnFLGRYqG/W2GXLnbby1E5clQ
 Y7JdZ2W4qdpECYr3Qe7XJXkWYSNdFw8JJA89IfRM/iK0xlexOH4HJ78Rk9Czf8k2
 WHHYBZMLDWwmaa31Zr3tsgvDfD2g8kZJp+ZTf8bB6720rzRNAqPLA7u5mW5lAwKi
 MK+iYnlVRQU4zzhGZPkpgUvgd5s6b+7lRCvIOEXTj6ETqfll6NRXWWTwusNzZDul
 Z2ffFWFHtSMgeO5A5dm4BK3t1qFbBC+odQzQyccLdH0H7DLPx34js8hqDwN6yWPL
 Z/VRFBJo2k4SrJiVbuLLDQoJ20J2NOm6VMuFWSIg5yi/V9kIxnI8ge8cbC0GoVUz
 DzsInjGb5Jjb3KKdPfhUmtu+FmNKcv4JlguUG7MyfYWyHr9kqWsGel1AkKDsWdra
 d6sqEum4F3UzdAjbCJPG71q/9avtio98IuaDXXKAAmd8gL1DCZ0QZEYV8c/A8n4M
 fXmdKG3Xqw==
 =BUYz
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-6.16-20250626' of git://git.kernel.dk/linux

Pull io_uring fixes from Jens Axboe:

 - Two tweaks for a recent fix: fixing a memory leak if multiple iovecs
   were initially mapped but only the first was used and hence turned
   into a UBUF rathan than an IOVEC iterator, and catching a case where
   a retry would be done even if the previous segment wasn't full

 - Small series fixing an issue making the vm unhappy if debugging is
   turned on, hitting a VM_BUG_ON_PAGE()

 - Fix a resource leak in io_import_dmabuf() in the error handling case,
   which is a regression in this merge window

 - Mark fallocate as needing to be write serialized, as is already done
   for truncate and buffered writes

* tag 'io_uring-6.16-20250626' of git://git.kernel.dk/linux:
  io_uring/kbuf: flag partial buffer mappings
  io_uring/net: mark iov as dynamically allocated even for single segments
  io_uring: fix resource leak in io_import_dmabuf()
  io_uring: don't assume uaddr alignment in io_vec_fill_bvec
  io_uring/rsrc: don't rely on user vaddr alignment
  io_uring/rsrc: fix folio unpinning
  io_uring: make fallocate be hashed work
2025-06-27 08:55:57 -07:00
Jens Axboe 178b8ff66f io_uring/kbuf: flag partial buffer mappings
A previous commit aborted mapping more for a non-incremental ring for
bundle peeking, but depending on where in the process this peeking
happened, it would not necessarily prevent a retry by the user. That can
create gaps in the received/read data.

Add struct buf_sel_arg->partial_map, which can pass this information
back. The networking side can then map that to internal state and use it
to gate retry as well.

Since this necessitates a new flag, change io_sr_msg->retry to a
retry_flags member, and store both the retry and partial map condition
in there.

Cc: stable@vger.kernel.org
Fixes: 26ec15e4b0 ("io_uring/kbuf: don't truncate end buffer for multiple buffer peeks")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-26 12:17:48 -06:00
Jens Axboe 9a709b7e98 io_uring/net: mark iov as dynamically allocated even for single segments
A bigger array of vecs could've been allocated, but
io_ring_buffers_peek() still decided to cap the mapped range depending
on how much data was available. Hence don't rely on the segment count
to know if the request should be marked as needing cleanup, always
check upfront if the iov array is different than the fast_iov array.

Fixes: 26ec15e4b0 ("io_uring/kbuf: don't truncate end buffer for multiple buffer peeks")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-25 10:17:06 -06:00
Penglei Jiang 7cac633a42 io_uring: fix resource leak in io_import_dmabuf()
Replace the return statement with setting ret = -EINVAL and jumping to
the err label to ensure resources are released via io_release_dmabuf.

Fixes: a5c98e9424 ("io_uring/zcrx: dmabuf backed zerocopy receive")
Signed-off-by: Penglei Jiang <superman.xpt@gmail.com>
Link: https://lore.kernel.org/r/20250625102703.68336-1-superman.xpt@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-25 08:14:14 -06:00
Pavel Begunkov e1d7727b73 io_uring: don't assume uaddr alignment in io_vec_fill_bvec
There is no guaranteed alignment for user pointers. Don't use mask
trickery and adjust the offset by bv_offset.

Cc: stable@vger.kernel.org
Reported-by: David Hildenbrand <david@redhat.com>
Fixes: 9ef4cbbcb4 ("io_uring: add infra for importing vectored reg buffers")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/io-uring/19530391f5c361a026ac9b401ff8e123bde55d98.1750771718.git.asml.silence@gmail.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-24 20:51:08 -06:00
Pavel Begunkov 3a3c6d6157 io_uring/rsrc: don't rely on user vaddr alignment
There is no guaranteed alignment for user pointers, however the
calculation of an offset of the first page into a folio after coalescing
uses some weird bit mask logic, get rid of it.

Cc: stable@vger.kernel.org
Reported-by: David Hildenbrand <david@redhat.com>
Fixes: a8edbb424b ("io_uring/rsrc: enable multi-hugepage buffer coalescing")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/io-uring/e387b4c78b33f231105a601d84eefd8301f57954.1750771718.git.asml.silence@gmail.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-24 20:50:59 -06:00
Pavel Begunkov 5afb4bf9fc io_uring/rsrc: fix folio unpinning
syzbot complains about an unmapping failure:

[  108.070381][   T14] kernel BUG at mm/gup.c:71!
[  108.070502][   T14] Internal error: Oops - BUG: 00000000f2000800 [#1]  SMP
[  108.123672][   T14] Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20250221-8.fc42 02/21/2025
[  108.127458][   T14] Workqueue: iou_exit io_ring_exit_work
[  108.174205][   T14] Call trace:
[  108.175649][   T14]  sanity_check_pinned_pages+0x7cc/0x7d0 (P)
[  108.178138][   T14]  unpin_user_page+0x80/0x10c
[  108.180189][   T14]  io_release_ubuf+0x84/0xf8
[  108.182196][   T14]  io_free_rsrc_node+0x250/0x57c
[  108.184345][   T14]  io_rsrc_data_free+0x148/0x298
[  108.186493][   T14]  io_sqe_buffers_unregister+0x84/0xa0
[  108.188991][   T14]  io_ring_ctx_free+0x48/0x480
[  108.191057][   T14]  io_ring_exit_work+0x764/0x7d8
[  108.193207][   T14]  process_one_work+0x7e8/0x155c
[  108.195431][   T14]  worker_thread+0x958/0xed8
[  108.197561][   T14]  kthread+0x5fc/0x75c
[  108.199362][   T14]  ret_from_fork+0x10/0x20

We can pin a tail page of a folio, but then io_uring will try to unpin
the head page of the folio. While it should be fine in terms of keeping
the page actually alive, mm folks say it's wrong and triggers a debug
warning. Use unpin_user_folio() instead of unpin_user_page*.

Cc: stable@vger.kernel.org
Debugged-by: David Hildenbrand <david@redhat.com>
Reported-by: syzbot+1d335893772467199ab6@syzkaller.appspotmail.com
Closes: https://lkml.kernel.org/r/683f1551.050a0220.55ceb.0017.GAE@google.com
Fixes: a8edbb424b ("io_uring/rsrc: enable multi-hugepage buffer coalescing")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/io-uring/a28b0f87339ac2acf14a645dad1e95bbcbf18acd.1750771718.git.asml.silence@gmail.com/
[axboe: adapt to current tree, massage commit message]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-24 20:49:39 -06:00
Pavel Begunkov 9e4ed359b8 io_uring/netcmd: add tx timestamping cmd support
Add a new socket command which returns tx time stamps to the user. It
provide an alternative to the existing error queue recvmsg interface.
The command works in a polled multishot mode, which means io_uring will
poll the socket and keep posting timestamps until the request is
cancelled or fails in any other way (e.g. with no space in the CQ). It
reuses the net infra and grabs timestamps from the socket's error queue.

The command requires IORING_SETUP_CQE32. All non-final CQEs (marked with
IORING_CQE_F_MORE) have cqe->res set to the tskey, and the upper 16 bits
of cqe->flags keep tstype (i.e. offset by IORING_CQE_BUFFER_SHIFT). The
timevalue is store in the upper part of the extended CQE. The final
completion won't have IORING_CQE_F_MORE and will have cqe->res storing
0/error.

Suggested-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/92ee66e6b33b8de062a977843d825f58f21ecd37.1750065793.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-23 09:00:12 -06:00
Pavel Begunkov ac479eac22 io_uring: add mshot helper for posting CQE32
Add a helper for posting 32 byte CQEs in a multishot mode and add a cmd
helper on top. As it specifically works with requests, the helper ignore
the passed in cqe->user_data and sets it to the one stored in the
request.

The command helper is only valid with multishot requests.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c29d7720c16e1f981cfaa903df187138baa3946b.1750065793.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-23 09:00:12 -06:00
Pavel Begunkov b955754959 io_uring/cmd: allow multishot polled commands
Some commands like timestamping in the next patch can make use of
multishot polling, i.e. REQ_F_APOLL_MULTISHOT. Add support for that,
which is condensed in a single helper called io_cmd_poll_multishot().

The user who wants to continue with a request in a multishot mode must
call the function, and only if it returns 0 the user is free to proceed.
Apart from normal terminal errors, it can also end up with -EIOCBQUEUED,
in which case the user must forward it to the core io_uring. It's
forbidden to use task work while the request is executing in a multishot
mode.

The API is not foolproof, hence it's not exported to modules nor exposed
in public headers.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/bcf97c31659662c72b69fc8fcdf2a88cfc16e430.1750065793.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-23 09:00:12 -06:00
Pavel Begunkov 1621518892 io_uring/poll: introduce io_arm_apoll()
In preparation to allowing commands to do file polling, add a helper
that takes the desired poll event mask and arms it for polling. We won't
be able to use io_arm_poll_handler() with IORING_OP_URING_CMD as it
tries to infer the mask from the opcode data, and we can't unify it
across all commands.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7ee5633f2dc45fd15243f1a60965f7e30e1c48e8.1750065793.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-23 09:00:12 -06:00
Jens Axboe cb9ccfb404 io_uring/nop: add IORING_NOP_TW completion flag
To test and profile the overhead of io_uring task_work and the various
types of it, add IORING_NOP_TW which tells nop to signal completions
through task_work rather than complete them inline.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-23 08:59:13 -06:00
Jens Axboe ecf47d452c io_uring/uring_cmd: implement ->sqe_copy() to avoid unnecessary copies
uring_cmd currently copies the full SQE at prep time, just in case it
needs it to be stable. However, for inline completions or requests that
get queued up on the device side, there's no need to ever copy the SQE.
This is particularly important, as various use cases of uring_cmd will
be using 128b sized SQEs.

Opt in to using ->sqe_copy() to let the core of io_uring decide when to
copy SQEs. This callback will only be called if it is safe to do so.

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-23 08:59:13 -06:00
Jens Axboe ead21053bf io_uring/uring_cmd: get rid of io_uring_cmd_prep_setup()
It's a pretty pointless helper, just allocates and copies data. Fold it
into io_uring_cmd_prep().

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-23 08:59:13 -06:00
Jens Axboe af19388a97 io_uring: add struct io_cold_def->sqe_copy() method
Will be called by the core of io_uring, if inline issue is not going
to be tried for a request. Opcodes can define this handler to defer
copying of SQE data that should remain stable.

Only called if IO_URING_F_INLINE is set. If it isn't set, then there's a
bug in the core handling of this, and -EFAULT will be returned instead
to terminate the request. This will trigger a WARN_ON_ONCE(). Don't
expect this to ever trigger, and down the line this can be removed.

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-23 08:59:13 -06:00
Jens Axboe 4d811e395b io_uring: add IO_URING_F_INLINE issue flag
Set when the execution of the request is done inline from the system
call itself. Any deferred issue will never have this flag set.

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-23 08:59:13 -06:00
Fengnan Chang 88a80066af io_uring: make fallocate be hashed work
Like ftruncate and write, fallocate operations on the same file cannot
be executed in parallel, so it is better to make fallocate be hashed
work.

Signed-off-by: Fengnan Chang <changfengnan@bytedance.com>
Link: https://lore.kernel.org/r/20250623110218.61490-1-changfengnan@bytedance.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-23 08:58:44 -06:00
Linus Torvalds f7301f856d io_uring-6.16-20250621
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmhWq6oQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpqTdEACmwjU6TCbV++/5Vk1FdtIXdpLhF/aoNeea
 gqR4qqBoaHiN1SBKBLjZLOmOyS++3YAmYVoMk/9Xe4cAI/WhPj4e5Zy2mfuPfj+5
 EIThW2y2YyUfy8HyqjjZmdkEEkYtpbNQ3dxsy2j7uOxrmcEvXhbSSUrCUObdsmz7
 JzTCNfif0PmU2kOL9OlkKefD0nNut7t9bQCVdD0dRIROOgX0OmosYWBBgyA9Q+eG
 KNlikmgxj1fUXm8WCvKhZu98Dm9+8WUVIapOEUyyTIcw6eFNFHIzg292AzKOMwWq
 sC8OBjIs9MzyXPbwcO8XxJBV5pNEZSYkxnZ0sYwknM3YZLla31/67YHYg1sSF+gS
 HeQceQ27hGiaJzVfVrM7/x8O5uGHXm3dmvBMnZADiXMQOWLCUrPum0cB7iwpd6G0
 P6VcGAVIGFzaWBpziIoJjhy8RKN+1Uekr2h/RawtfrBt2a2RkbFE0j0hcfY6YqR+
 VBQpLe+z6X0lf3jnmKRwWMLHqaUC06hAR6mj4YfD7Kw8SmcSsNVo3WCM2nt/vxrS
 UEIbmN3M1I/EhMxWtc2HdVoqXGHFibmWNxX04rAyavFU17T49RDNwyE9HFEleVto
 FzYlGvQJp4894bRtTkQFcWbBDDTFi0/yZoIGJ8vUflt+vTt0Xhrp3r7aZC1wJyQy
 EXHOgZyJJA==
 =G200
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-6.16-20250621' of git://git.kernel.dk/linux

Pull io_uring fix from Jens Axboe:
 "A single fix to hopefully wrap up the saga of receive bundles"

* tag 'io_uring-6.16-20250621' of git://git.kernel.dk/linux:
  io_uring/net: always use current transfer count for buffer put
2025-06-21 08:40:45 -07:00
Jens Axboe 51a4598ad5 io_uring/net: always use current transfer count for buffer put
A previous fix corrected the retry condition for when to continue a
current bundle, but it missed that the current (not the total) transfer
count also applies to the buffer put. If not, then for incrementally
consumed buffer rings repeated completions on the same request may end
up over consuming.

Reported-by: Roy Tang (ErgoniaTrading) <royonia@ergonia.io>
Cc: stable@vger.kernel.org
Fixes: 3a08988123 ("io_uring/net: only retry recv bundle for a full transfer")
Link: https://github.com/axboe/liburing/issues/1423
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-20 08:33:45 -06:00
Linus Torvalds 255da9b8d7 io_uring-6.16-20250619
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmhUFYEQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgplMoD/4u7Om+qQMQDBmOTbLeYZmst3NkZShEo0oj
 fNTlLnYmgeT44tM0LX9dYaLAwWPjCZJQObaxWdWEnlxk4AD1Zgairv54zGRlkJQK
 Sn81QDnDV6B1mW+VuqS24veK4fM9m5h9XQynnVOXRoKvIUPH5AEhRQj/aj1MCdD9
 mLQeE2EeR0mJxZDvLOL1G1RFDJa9EJ7m+eDFy/k3+1RCR3T+XH+YDz3EH2h1+mPd
 5gf3NHv5bCF8CGAzA9UixVN+2VBog0tgY1lL+DRKU9oNtqWXExD4i8dMt17Odg+p
 0DcVW3c1Wts/vOKGXCS0TSBLFM8aGraEb2tf0QIKrNNJQyRfsZpeLgn/hhEb8nTV
 ipJOsTbdgxjqPGvMqrLnhW8oC/RFLFyJbpW9nEkqSQVjncM5KJWxx5sxP08i/Ee/
 6HYkDUM4Z67Eh4bXBqiiqtsuxehM1PxA4i9Cb2obXKyR6IL/D6vhSsBqTexy3a0p
 j8jKrWXDUlQiTAU1A59+aeYKFNyk9Tin8VuQ+hxmWswu2T3VkhQwJTcOK3qNijo1
 ji71zNz/SVVzHwz+AF8dT08c448A8N/Xaaj8aXi+2xhpHCYj6/g2mDOEYa/tysrf
 wVAKUxeRYKTtHQM6LVB8q+VvXOW587D7gjsSETaVTeW0Bgmtbh2IntadxJnWRd5U
 ZiKpERpU6A==
 =cAEE
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-6.16-20250619' of git://git.kernel.dk/linux

Pull io_uring fixes from Jens Axboe:

 - Two fixes for error injection failures. One fixes a task leak issue
   introduced in this merge window, the other an older issue with
   handling allocation of a mapped buffer.

 - Fix for a syzbot issue that triggers a kmalloc warning on attempting
   an allocation that's too large

 - Fix for an error injection failure causing a double put of a task,
   introduced in this merge window

* tag 'io_uring-6.16-20250619' of git://git.kernel.dk/linux:
  io_uring: fix potential page leak in io_sqe_buffer_register()
  io_uring/sqpoll: don't put task_struct on tctx setup failure
  io_uring: remove duplicate io_uring_alloc_task_context() definition
  io_uring: fix task leak issue in io_wq_create()
  io_uring/rsrc: validate buffer count with offset for cloning
2025-06-19 23:25:28 -07:00
Penglei Jiang e1c75831f6 io_uring: fix potential page leak in io_sqe_buffer_register()
If allocation of the 'imu' fails, then the existing pages aren't
unpinned in the error path. This is mostly a theoretical issue,
requiring fault injection to hit.

Move unpin_user_pages() to unified error handling to fix the page leak
issue.

Fixes: d8c2237d0a ("io_uring: add io_pin_pages() helper")
Signed-off-by: Penglei Jiang <superman.xpt@gmail.com>
Link: https://lore.kernel.org/r/20250617165644.79165-1-superman.xpt@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-18 05:09:46 -06:00
Jens Axboe f2320f1dd6 io_uring/sqpoll: don't put task_struct on tctx setup failure
A recent commit moved the error handling of sqpoll thread and tctx
failures into the thread itself, as part of fixing an issue. However, it
missed that tctx allocation may also fail, and that
io_sq_offload_create() does its own error handling for the task_struct
in that case.

Remove the manual task putting in io_sq_offload_create(), as
io_sq_thread() will notice that the tctx did not get setup and hence it
should put itself and exit.

Reported-by: syzbot+763e12bbf004fb1062e4@syzkaller.appspotmail.com
Fixes: ac0b8b327a ("io_uring: fix use-after-free of sq->thread in __io_uring_show_fdinfo()")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-17 06:43:18 -06:00
Jens Axboe 91a7703a03 io_uring: remove duplicate io_uring_alloc_task_context() definition
This function exists in both tctx.h (where it belongs) and in io_uring.h
as a remnant of before the tctx handling code got split out. Remove the
io_uring.h definition and ensure that sqpoll.c includes the tctx.h
header to get the definition.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-17 06:41:48 -06:00
Penglei Jiang 89465d923b io_uring: fix task leak issue in io_wq_create()
Add missing put_task_struct() in the error path

Cc: stable@vger.kernel.org
Fixes: 0f8baa3c98 ("io-wq: fully initialize wqe before calling cpuhp_state_add_instance_nocalls()")
Signed-off-by: Penglei Jiang <superman.xpt@gmail.com>
Link: https://lore.kernel.org/r/20250615163906.2367-1-superman.xpt@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-15 12:58:39 -06:00
Jens Axboe 1d27f11bf0 io_uring/rsrc: validate buffer count with offset for cloning
syzbot reports that it can trigger a WARN_ON() for kmalloc() attempt
that's too big:

WARNING: CPU: 0 PID: 6488 at mm/slub.c:5024 __kvmalloc_node_noprof+0x520/0x640 mm/slub.c:5024
Modules linked in:
CPU: 0 UID: 0 PID: 6488 Comm: syz-executor312 Not tainted 6.15.0-rc7-syzkaller-gd7fa1af5b33e #0 PREEMPT
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025
pstate: 20400005 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : __kvmalloc_node_noprof+0x520/0x640 mm/slub.c:5024
lr : __do_kmalloc_node mm/slub.c:-1 [inline]
lr : __kvmalloc_node_noprof+0x3b4/0x640 mm/slub.c:5012
sp : ffff80009cfd7a90
x29: ffff80009cfd7ac0 x28: ffff0000dd52a120 x27: 0000000000412dc0
x26: 0000000000000178 x25: ffff7000139faf70 x24: 0000000000000000
x23: ffff800082f4cea8 x22: 00000000ffffffff x21: 000000010cd004a8
x20: ffff0000d75816c0 x19: ffff0000dd52a000 x18: 00000000ffffffff
x17: ffff800092f39000 x16: ffff80008adbe9e4 x15: 0000000000000005
x14: 1ffff000139faf1c x13: 0000000000000000 x12: 0000000000000000
x11: ffff7000139faf21 x10: 0000000000000003 x9 : ffff80008f27b938
x8 : 0000000000000002 x7 : 0000000000000000 x6 : 0000000000000000
x5 : 00000000ffffffff x4 : 0000000000400dc0 x3 : 0000000200000000
x2 : 000000010cd004a8 x1 : ffff80008b3ebc40 x0 : 0000000000000001
Call trace:
 __kvmalloc_node_noprof+0x520/0x640 mm/slub.c:5024 (P)
 kvmalloc_array_node_noprof include/linux/slab.h:1065 [inline]
 io_rsrc_data_alloc io_uring/rsrc.c:206 [inline]
 io_clone_buffers io_uring/rsrc.c:1178 [inline]
 io_register_clone_buffers+0x484/0xa14 io_uring/rsrc.c:1287
 __io_uring_register io_uring/register.c:815 [inline]
 __do_sys_io_uring_register io_uring/register.c:926 [inline]
 __se_sys_io_uring_register io_uring/register.c:903 [inline]
 __arm64_sys_io_uring_register+0x42c/0xea8 io_uring/register.c:903
 __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline]
 invoke_syscall+0x98/0x2b8 arch/arm64/kernel/syscall.c:49
 el0_svc_common+0x130/0x23c arch/arm64/kernel/syscall.c:132
 do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:151
 el0_svc+0x58/0x17c arch/arm64/kernel/entry-common.c:767
 el0t_64_sync_handler+0x78/0x108 arch/arm64/kernel/entry-common.c:786
 el0t_64_sync+0x198/0x19c arch/arm64/kernel/entry.S:600

which is due to offset + buffer_count being too large. The registration
code checks only the total count of buffers, but given that the indexing
is an array, it should also check offset + count. That can't exceed
IORING_MAX_REG_BUFFERS either, as there's no way to reach buffers beyond
that limit.

There's no issue with registrering a table this large, outside of the
fact that it's pointless to register buffers that cannot be reached, and
that it can trigger this kmalloc() warning for attempting an allocation
that is too large.

Cc: stable@vger.kernel.org
Fixes: b16e920a19 ("io_uring/rsrc: allow cloning at an offset")
Reported-by: syzbot+cb4bf3cb653be0d25de8@syzkaller.appspotmail.com
Link: https://lore.kernel.org/io-uring/684e77bd.a00a0220.279073.0029.GAE@google.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-15 08:09:14 -06:00
Linus Torvalds 6d13760ea3 io_uring-6.16-20250614
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmhNaVcQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpt2QEACK7lbIuv6R3AIAeYrBliLzYUC8kIZ8DGbT
 4fKN343AqnuCCFrBR7ew4tHaKmQ3KBaMpoO4FkoV++//vWqJfs3cUOIkntb82UY2
 tolnc743QzZFBOmjMP8XhtJ2o/KmQYYCMjteLcVCnqT3IoUfJ2cd0lf37Rd4x1BK
 XJacW231lCgmeBa/s336MEm6HphiokmGISrji0bGBiQqQYmWHiQ/0FnrWAlEpZD8
 mGEA75u4L3JFbJvetfsgvN+ifF+l/l9F92S689gkPkUDaq3b81sUzk0+cBpFGva+
 tIWGM3PzRoDwM+MuUg/R9mCFWP3LIbZCB6Um2I8Ek7AGgPND/3ocQn4RLzR9gTez
 /Q/Z7WBL/xrWkOy5fNgfy7pDqkBgHxmPztlXMUWnd29d2i50Hh6lP+eyg+wNhqVG
 GAO1f4Oholr/KNI5rJzEX0hrC3X9wI551ryCdftTXZqJKKlaEChWBVUEwAvm7ZxE
 Oi7Ni8WC6ZVnij6Gb3thgIbXv1z3XwtcvwHTTH0w3Rf+3Iy+i9dYq//QN46XXmd5
 hglOvHOUpcQE/dtbgW5Uuo0QvBxyljbwmOJNDog69wX5DC4n5wfdr/E5TGzWnwTq
 FrYID16p3gK/F0PbkIwJwMOqxnzMMvtAhZbkAgrrnoI9rLOSt3qZKefajJxYVMid
 FCchOkj21A==
 =79aU
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-6.16-20250614' of git://git.kernel.dk/linux

Pull io_uring fixes from Jens Axboe:

 - Fix for a race between SQPOLL exit and fdinfo reading.

   It's slim and I was only able to reproduce this with an artificial
   delay in the kernel. Followup sparse fix as well to unify the access
   to ->thread.

 - Fix for multiple buffer peeking, avoiding truncation if possible.

 - Run local task_work for IOPOLL reaping when the ring is exiting.

   This currently isn't done due to an assumption that polled IO will
   never need task_work, but a fix on the block side is going to change
   that.

* tag 'io_uring-6.16-20250614' of git://git.kernel.dk/linux:
  io_uring: run local task_work from ring exit IOPOLL reaping
  io_uring/kbuf: don't truncate end buffer for multiple buffer peeks
  io_uring: consistently use rcu semantics with sqpoll thread
  io_uring: fix use-after-free of sq->thread in __io_uring_show_fdinfo()
2025-06-14 08:44:54 -07:00
Jens Axboe b62e0efd8a io_uring: run local task_work from ring exit IOPOLL reaping
In preparation for needing to shift NVMe passthrough to always use
task_work for polled IO completions, ensure that those are suitably
run at exit time. See commit:

9ce6c9875f ("nvme: always punt polled uring_cmd end_io work to task_work")

for details on why that is necessary.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-13 15:26:17 -06:00
Jens Axboe 26ec15e4b0 io_uring/kbuf: don't truncate end buffer for multiple buffer peeks
If peeking a bunch of buffers, normally io_ring_buffers_peek() will
truncate the end buffer. This isn't optimal as presumably more data will
be arriving later, and hence it's better to stop with the last full
buffer rather than truncate the end buffer.

Cc: stable@vger.kernel.org
Fixes: 35c8711c8f ("io_uring/kbuf: add helpers for getting/peeking multiple buffers")
Reported-by: Christian Mazakas <christian.mazakas@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-13 11:01:49 -06:00
Keith Busch c538f400fa io_uring: consistently use rcu semantics with sqpoll thread
The sqpoll thread is dereferenced with rcu read protection in one place,
so it needs to be annotated as an __rcu type, and should consistently
use rcu helpers for access and assignment to make sparse happy.

Since most of the accesses occur under the sqd->lock, we can use
rcu_dereference_protected() without declaring an rcu read section.
Provide a simple helper to get the thread from a locked context.

Fixes: ac0b8b327a ("io_uring: fix use-after-free of sq->thread in __io_uring_show_fdinfo()")
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20250611205343.1821117-1-kbusch@meta.com
[axboe: fold in fix for register.c]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-12 08:17:09 -06:00
Penglei Jiang ac0b8b327a io_uring: fix use-after-free of sq->thread in __io_uring_show_fdinfo()
syzbot reports:

BUG: KASAN: slab-use-after-free in getrusage+0x1109/0x1a60
Read of size 8 at addr ffff88810de2d2c8 by task a.out/304

CPU: 0 UID: 0 PID: 304 Comm: a.out Not tainted 6.16.0-rc1 #1 PREEMPT(voluntary)
Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0x53/0x70
 print_report+0xd0/0x670
 ? __pfx__raw_spin_lock_irqsave+0x10/0x10
 ? getrusage+0x1109/0x1a60
 kasan_report+0xce/0x100
 ? getrusage+0x1109/0x1a60
 getrusage+0x1109/0x1a60
 ? __pfx_getrusage+0x10/0x10
 __io_uring_show_fdinfo+0x9fe/0x1790
 ? ksys_read+0xf7/0x1c0
 ? do_syscall_64+0xa4/0x260
 ? vsnprintf+0x591/0x1100
 ? __pfx___io_uring_show_fdinfo+0x10/0x10
 ? __pfx_vsnprintf+0x10/0x10
 ? mutex_trylock+0xcf/0x130
 ? __pfx_mutex_trylock+0x10/0x10
 ? __pfx_show_fd_locks+0x10/0x10
 ? io_uring_show_fdinfo+0x57/0x80
 io_uring_show_fdinfo+0x57/0x80
 seq_show+0x38c/0x690
 seq_read_iter+0x3f7/0x1180
 ? inode_set_ctime_current+0x160/0x4b0
 seq_read+0x271/0x3e0
 ? __pfx_seq_read+0x10/0x10
 ? __pfx__raw_spin_lock+0x10/0x10
 ? __mark_inode_dirty+0x402/0x810
 ? selinux_file_permission+0x368/0x500
 ? file_update_time+0x10f/0x160
 vfs_read+0x177/0xa40
 ? __pfx___handle_mm_fault+0x10/0x10
 ? __pfx_vfs_read+0x10/0x10
 ? mutex_lock+0x81/0xe0
 ? __pfx_mutex_lock+0x10/0x10
 ? fdget_pos+0x24d/0x4b0
 ksys_read+0xf7/0x1c0
 ? __pfx_ksys_read+0x10/0x10
 ? do_user_addr_fault+0x43b/0x9c0
 do_syscall_64+0xa4/0x260
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f0f74170fc9
Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 8
RSP: 002b:00007fffece049e8 EFLAGS: 00000206 ORIG_RAX: 0000000000000000
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f0f74170fc9
RDX: 0000000000001000 RSI: 00007fffece049f0 RDI: 0000000000000004
RBP: 00007fffece05ad0 R08: 0000000000000000 R09: 00007fffece04d90
R10: 0000000000000000 R11: 0000000000000206 R12: 00005651720a1100
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
 </TASK>

Allocated by task 298:
 kasan_save_stack+0x33/0x60
 kasan_save_track+0x14/0x30
 __kasan_slab_alloc+0x6e/0x70
 kmem_cache_alloc_node_noprof+0xe8/0x330
 copy_process+0x376/0x5e00
 create_io_thread+0xab/0xf0
 io_sq_offload_create+0x9ed/0xf20
 io_uring_setup+0x12b0/0x1cc0
 do_syscall_64+0xa4/0x260
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

Freed by task 22:
 kasan_save_stack+0x33/0x60
 kasan_save_track+0x14/0x30
 kasan_save_free_info+0x3b/0x60
 __kasan_slab_free+0x37/0x50
 kmem_cache_free+0xc4/0x360
 rcu_core+0x5ff/0x19f0
 handle_softirqs+0x18c/0x530
 run_ksoftirqd+0x20/0x30
 smpboot_thread_fn+0x287/0x6c0
 kthread+0x30d/0x630
 ret_from_fork+0xef/0x1a0
 ret_from_fork_asm+0x1a/0x30

Last potentially related work creation:
 kasan_save_stack+0x33/0x60
 kasan_record_aux_stack+0x8c/0xa0
 __call_rcu_common.constprop.0+0x68/0x940
 __schedule+0xff2/0x2930
 __cond_resched+0x4c/0x80
 mutex_lock+0x5c/0xe0
 io_uring_del_tctx_node+0xe1/0x2b0
 io_uring_clean_tctx+0xb7/0x160
 io_uring_cancel_generic+0x34e/0x760
 do_exit+0x240/0x2350
 do_group_exit+0xab/0x220
 __x64_sys_exit_group+0x39/0x40
 x64_sys_call+0x1243/0x1840
 do_syscall_64+0xa4/0x260
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

The buggy address belongs to the object at ffff88810de2cb00
 which belongs to the cache task_struct of size 3712
The buggy address is located 1992 bytes inside of
 freed 3712-byte region [ffff88810de2cb00, ffff88810de2d980)

which is caused by the task_struct pointed to by sq->thread being
released while it is being used in the function
__io_uring_show_fdinfo(). Holding ctx->uring_lock does not prevent ehre
relase or exit of sq->thread.

Fix this by assigning and looking up ->thread under RCU, and grabbing a
reference to the task_struct. This ensures that it cannot get released
while fdinfo is using it.

Reported-by: syzbot+531502bbbe51d2f769f4@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/682b06a5.a70a0220.3849cf.00b3.GAE@google.com
Fixes: 3fcb9d1720 ("io_uring/sqpoll: statistics of the true utilization of sq threads")
Signed-off-by: Penglei Jiang <superman.xpt@gmail.com>
Link: https://lore.kernel.org/r/20250610171801.70960-1-superman.xpt@gmail.com
[axboe: massage commit message]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-10 11:20:04 -06:00
Jens Axboe dd765ba872
fs/pipe: set FMODE_NOWAIT in create_pipe_files()
Rather than have the caller set the FMODE_NOWAIT flags for both output
files, move it to create_pipe_files() where other f_mode flags are set
anyway with stream_open(). With that, both __do_pipe_flags() and
io_pipe() can remove the manual setting of the NOWAIT flags.

No intended functional changes, just a code cleanup.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/1f0473f8-69f3-4eb1-aa77-3334c6a71d24@kernel.dk
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-10 13:16:19 +02:00
Linus Torvalds 794a549207 io_uring-6.16-20250606
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmhC8AkQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgprl3D/9i9py58aRdxBx8NthlWSFo7eP3F8IuYSmY
 Vt+Cfo9i2KuwQKfAeCtIYZIiFcNbqgeeMnJcTFPJv8c781b5bgVV9/J2HkQ0XykD
 ex8NdEqXQoaKqFLQWiv0lyEx8E0pMjlbqpSZMHzNacBcADGsWF3RHBYrcMyBsPTC
 UDuDe++Tph5jP2Mkqh73eWAWw2n+hzJ5frm5fuF2ShbNBexp5cq/9opBYeG7DWtz
 +6Ok9zUXdUkVXoNL9iiQa+T5dRSMIBh0tDQ7MT+yvHQ913MtyT9gXNFhfPlthG3U
 o3RiS/MQ8R0k4y2JmLoYwKP1dShmIkVkKXFaFpFTEdcZ2SBslolXCraVZDXx2L1i
 Wb6A1x5HNMgmYgwLmS7I9i/QuEgQqVsKM2ESoNke5tFlyizgzJOUn0czRR2dTMZg
 w10E0eeXIqD4c+x4t0wUQMcm0p6qEW47E8KB/QrmeSCufT7R6BuCUE4RFQKXgcLZ
 qUanG4UyeeBBxS/+vTDlSPuyi41YEmQz8yTIxvwgCtVbAtd10otldEVdYalUXEHO
 JH7K2Dw7lkwtsv/HM72pzvACFOjF37yv0mG1gK8FXWleG8gZQH1P0y7VqIHP+HcN
 HFESAbaaud9TNrCF36QdxiPRKH+jfAbFeAsa97b0pXOTQZ0OUxA7RVzYoN8Z+hh3
 Q/2sdDPjAw==
 =2eUI
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-6.16-20250606' of git://git.kernel.dk/linux

Pull io_uring fixes from Jens Axboe:

 - Fix for a regression introduced in this merge window, where the 'id'
   passed to xa_find() for ifq lookup is uninitialized

 - Fix for zcrx release on registration failure. From 6.15, going to
   stable

 - Tweak for recv bundles, where msg_inq should be > 1 before being used
   to gate a retry event

 - Pavel doesnt want to be a maintainer anymore, remove him from the
   MAINTAINERS entry

 - Limit legacy kbuf registrations to 64k, which is the size of the
   buffer ID field anyway. Hence it's nonsensical to support more than
   that, and the only purpose that serves is to have syzbot trigger long
   exit delays for heavily configured debug kernels

 - Fix for the io_uring futex handling, which got broken for
   FUTEX2_PRIVATE by a generic futex commit adding private hashes

* tag 'io_uring-6.16-20250606' of git://git.kernel.dk/linux:
  io_uring/futex: mark wait requests as inflight
  io_uring/futex: get rid of struct io_futex addr union
  io_uring/kbuf: limit legacy provided buffer lists to USHRT_MAX
  MAINTAINERS: remove myself from io_uring
  io_uring/net: only consider msg_inq if larger than 1
  io_uring/zcrx: fix area release on registration failure
  io_uring/zcrx: init id for xa_find
2025-06-06 13:09:03 -07:00
Jens Axboe 079afb081c io_uring/futex: mark wait requests as inflight
Inflight marking is used so that do_exit() -> io_uring_files_cancel()
will find requests with files that reference an io_uring instance,
so they can get appropriately canceled before the files go away.
However, it's also called before the mm goes away.

Mark futex/futexv wait requests as being inflight, so that
io_uring_files_cancel() will prune them. This ensures that the mm stays
alive, which is important as an exiting mm will also free the futex
private hash buckets. An io_uring futex request with FUTEX2_PRIVATE
set relies on those being alive until the request has completed. A
recent commit added these futex private hashes, which get killed when
the mm goes away.

Fixes: 80367ad01d ("futex: Add basic infrastructure for local task local hash")
Link: https://lore.kernel.org/io-uring/38053.1749045482@localhost/
Reported-by: Robert Morris <rtm@csail.mit.edu>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-04 10:50:14 -06:00
Jens Axboe 6a8118a77e io_uring/futex: get rid of struct io_futex addr union
Rather than use a union of a u32 and struct futex_waitv user address,
consolidate it into a single void __user pointer instead. This also
makes prep easier to use as the store happens to the member that will
be used.

No functional changes in this patch.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-04 10:31:31 -06:00
Jens Axboe 607d09d1a0 io_uring/kbuf: limit legacy provided buffer lists to USHRT_MAX
The buffer ID for a provided buffer is an unsigned short, and hence
there can only be 64k added to any given buffer list before having
duplicate BIDs. Cap the legacy provided buffers at 64k in the list.
This is mostly to prevent silly stall reports from syzbot, which
likes to dump tons of buffers into a list and then have kernels with
lockdep and kasan churning through them and hitting long wait times
for buffer pruning at ring exit time.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-03 15:32:44 -06:00
Linus Torvalds 1b98f357da Networking changes for 6.16.
Core
 ----
 
  - Implement the Device Memory TCP transmit path, allowing zero-copy
    data transmission on top of TCP from e.g. GPU memory to the wire.
 
  - Move all the IPv6 routing tables management outside the RTNL scope,
    under its own lock and RCU. The route control path is now 3x times
    faster.
 
  - Convert queue related netlink ops to instance lock, reducing
    again the scope of the RTNL lock. This improves the control plane
    scalability.
 
  - Refactor the software crc32c implementation, removing unneeded
    abstraction layers and improving significantly the related
    micro-benchmarks.
 
  - Optimize the GRO engine for UDP-tunneled traffic, for a 10%
    performance improvement in related stream tests.
 
  - Cover more per-CPU storage with local nested BH locking; this is a
    prep work to remove the current per-CPU lock in local_bh_disable()
    on PREMPT_RT.
 
  - Introduce and use nlmsg_payload helper, combining buffer bounds
    verification with accessing payload carried by netlink messages.
 
 Netfilter
 ---------
 
  - Rewrite the procfs conntrack table implementation, improving
    considerably the dump performance. A lot of user-space tools
    still use this interface.
 
  - Implement support for wildcard netdevice in netdev basechain
    and flowtables.
 
  - Integrate conntrack information into nft trace infrastructure.
 
  - Export set count and backend name to userspace, for better
    introspection.
 
 BPF
 ---
 
  - BPF qdisc support: BPF-qdisc can be implemented with BPF struct_ops
    programs and can be controlled in similar way to traditional qdiscs
    using the "tc qdisc" command.
 
  - Refactor the UDP socket iterator, addressing long standing issues
    WRT duplicate hits or missed sockets.
 
 Protocols
 ---------
 
  - Improve TCP receive buffer auto-tuning and increase the default
    upper bound for the receive buffer; overall this improves the single
    flow maximum thoughput on 200Gbs link by over 60%.
 
  - Add AFS GSSAPI security class to AF_RXRPC; it provides transport
    security for connections to the AFS fileserver and VL server.
 
  - Improve TCP multipath routing, so that the sources address always
    matches the nexthop device.
 
  - Introduce SO_PASSRIGHTS for AF_UNIX, to allow disabling SCM_RIGHTS,
    and thus preventing DoS caused by passing around problematic FDs.
 
  - Retire DCCP socket. DCCP only receives updates for bugs, and major
    distros disable it by default. Its removal allows for better
    organisation of TCP fields to reduce the number of cache lines hit
    in the fast path.
 
  - Extend TCP drop-reason support to cover PAWS checks.
 
 Driver API
 ----------
 
  - Reorganize PTP ioctl flag support to require an explicit opt-in for
    the drivers, avoiding the problem of drivers not rejecting new
    unsupported flags.
 
  - Converted several device drivers to timestamping APIs.
 
  - Introduce per-PHY ethtool dump helpers, improving the support for
    dump operations targeting PHYs.
 
 Tests and tooling
 -----------------
 
  - Add support for classic netlink in user space C codegen, so that
    ynl-c can now read, create and modify links, routes addresses and
    qdisc layer configuration.
 
  - Add ynl sub-types for binary attributes, allowing ynl-c to output
    known struct instead of raw binary data, clarifying the classic
    netlink output.
 
  - Extend MPTCP selftests to improve the code-coverage.
 
  - Add tests for XDP tail adjustment in AF_XDP.
 
 New hardware / drivers
 ----------------------
 
  - OpenVPN virtual driver: offload OpenVPN data channels processing
    to the kernel-space, increasing the data transfer throughput WRT
    the user-space implementation.
 
  - Renesas glue driver for the gigabit ethernet RZ/V2H(P) SoC.
 
  - Broadcom asp-v3.0 ethernet driver.
 
  - AMD Renoir ethernet device.
 
  - ReakTek MT9888 2.5G ethernet PHY driver.
 
  - Aeonsemi 10G C45 PHYs driver.
 
 Drivers
 -------
 
  - Ethernet high-speed NICs:
    - nVidia/Mellanox (mlx5):
      - refactor the stearing table handling to reduce significantly
        the amount of memory used
      - add support for complex matches in H/W flow steering
      - improve flow streeing error handling
      - convert to netdev instance locking
    - Intel (100G, ice, igb, ixgbe, idpf):
      - ice: add switchdev support for LLDP traffic over VF
      - ixgbe: add firmware manipulation and regions devlink support
      - igb: introduce support for frame transmission premption
      - igb: adds persistent NAPI configuration
      - idpf: introduce RDMA support
      - idpf: add initial PTP support
    - Meta (fbnic):
      - extend hardware stats coverage
      - add devlink dev flash support
    - Broadcom (bnxt):
      - add support for RX-side device memory TCP
    - Wangxun (txgbe):
      - implement support for udp tunnel offload
      - complete PTP and SRIOV support for AML 25G/10G devices
 
  - Ethernet NICs embedded and virtual:
    - Google (gve):
      - add device memory TCP TX support
    - Amazon (ena):
      - support persistent per-NAPI config
    - Airoha:
      - add H/W support for L2 traffic offload
      - add per flow stats for flow offloading
    - RealTek (rtl8211): add support for WoL magic packet
    - Synopsys (stmmac):
      - dwmac-socfpga 1000BaseX support
      - add Loongson-2K3000 support
      - introduce support for hardware-accelerated VLAN stripping
    - Broadcom (bcmgenet):
      - expose more H/W stats
    - Freescale (enetc, dpaa2-eth):
      - enetc: add MAC filter, VLAN filter RSS and loopback support
      - dpaa2-eth: convert to H/W timestamping APIs
    - vxlan: convert FDB table to rhashtable, for better scalabilty
    - veth: apply qdisc backpressure on full ring to reduce TX drops
 
  - Ethernet switches:
    - Microchip (kzZ88x3): add ETS scheduler support
 
  - Ethernet PHYs:
    - RealTek (rtl8211):
      - add support for WoL magic packet
      - add support for PHY LEDs
 
  - CAN:
    - Adds RZ/G3E CANFD support to the rcar_canfd driver.
    - Preparatory work for CAN-XL support.
    - Add self-tests framework with support for CAN physical interfaces.
 
  - WiFi:
    - mac80211:
      - scan improvements with multi-link operation (MLO)
    - Qualcomm (ath12k):
      - enable AHB support for IPQ5332
      - add monitor interface support to QCN9274
      - add multi-link operation support to WCN7850
      - add 802.11d scan offload support to WCN7850
      - monitor mode for WCN7850, better 6 GHz regulatory
    - Qualcomm (ath11k):
      - restore hibernation support
    - MediaTek (mt76):
      - WiFi-7 improvements
      - implement support for mt7990
    - Intel (iwlwifi):
      - enhanced multi-link single-radio (EMLSR) support on 5 GHz links
      - rework device configuration
    - RealTek (rtw88):
      - improve throughput for RTL8814AU
    - RealTek (rtw89):
      - add multi-link operation support
      - STA/P2P concurrency improvements
      - support different SAR configs by antenna
 
  - Bluetooth:
    - introduce HCI Driver protocol
    - btintel_pcie: do not generate coredump for diagnostic events
    - btusb: add HCI Drv commands for configuring altsetting
    - btusb: add RTL8851BE device 0x0bda:0xb850
    - btusb: add new VID/PID 13d3/3584 for MT7922
    - btusb: add new VID/PID 13d3/3630 and 13d3/3613 for MT7925
    - btnxpuart: implement host-wakeup feature
 
 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmg3D64SHHBhYmVuaUBy
 ZWRoYXQuY29tAAoJECkkeY3MjxOkcIsQAK2eEc+BxQer975wzvtMg6gF9eoex4a+
 rZ7jxfDzDtNvTauoQsrpehDZp0FnySaVGCU36lHGB2OvDnhCpPc5hXzKDWQpOuqQ
 SHrGG3/6FTbdTG/HfHUcbNyrUzIf53SADSObiQ3qg4gyEQ3sCpcOKtVtMcU8rvsY
 /HqMnsJWFaROUMjMtCcnUSgjmeY9kBvha3sTXUqgeRugEOCvZD7z4rpqFIcQqHw7
 e2Fi8dwIXEYNxqPp6MRq2qdyUTewCRruE8ZIMAFuhtfYeMElUZMPlqlMENX3AzTQ
 cr0EgwcFOUxRA7oZRxhoBNBsVXavtSpQr4ZDoWplxP4aQ37n5tc1E9Q72axpB/Og
 FbJRl6GvWYnCd8071BczgmfHlKaTAigPvt2Z4r6JjM5I/Bij/IZ3k+On1OTuOAj/
 EqfFkdZ0a5cfKrwUMP+oSGtSAywkMVUtnIKJlZeRbjSj2432sCfe2jVAlS8ELM43
 3LUgXYrAKtA87g171LlsRu5EEpI5QmqPb+i5LpPlEXe2TJEgPisyfecJ3NafF/2+
 j575lm+TFNm9NTNhGGjDPEvw0djI5wSGGMe9J4gC74eWi6s5t6C4cuUf84TKWdwR
 x+9H0IB7rfFncAwXHJuUUtzd+fPHaYzs5dDGbSgMQOXr1cr1wlubCK8mQ1r/Wt/a
 3GjFIOQKW2Q5
 =t/Tz
 -----END PGP SIGNATURE-----

Merge tag 'net-next-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Paolo Abeni:
 "Core:

   - Implement the Device Memory TCP transmit path, allowing zero-copy
     data transmission on top of TCP from e.g. GPU memory to the wire.

   - Move all the IPv6 routing tables management outside the RTNL scope,
     under its own lock and RCU. The route control path is now 3x times
     faster.

   - Convert queue related netlink ops to instance lock, reducing again
     the scope of the RTNL lock. This improves the control plane
     scalability.

   - Refactor the software crc32c implementation, removing unneeded
     abstraction layers and improving significantly the related
     micro-benchmarks.

   - Optimize the GRO engine for UDP-tunneled traffic, for a 10%
     performance improvement in related stream tests.

   - Cover more per-CPU storage with local nested BH locking; this is a
     prep work to remove the current per-CPU lock in local_bh_disable()
     on PREMPT_RT.

   - Introduce and use nlmsg_payload helper, combining buffer bounds
     verification with accessing payload carried by netlink messages.

  Netfilter:

   - Rewrite the procfs conntrack table implementation, improving
     considerably the dump performance. A lot of user-space tools still
     use this interface.

   - Implement support for wildcard netdevice in netdev basechain and
     flowtables.

   - Integrate conntrack information into nft trace infrastructure.

   - Export set count and backend name to userspace, for better
     introspection.

  BPF:

   - BPF qdisc support: BPF-qdisc can be implemented with BPF struct_ops
     programs and can be controlled in similar way to traditional qdiscs
     using the "tc qdisc" command.

   - Refactor the UDP socket iterator, addressing long standing issues
     WRT duplicate hits or missed sockets.

  Protocols:

   - Improve TCP receive buffer auto-tuning and increase the default
     upper bound for the receive buffer; overall this improves the
     single flow maximum thoughput on 200Gbs link by over 60%.

   - Add AFS GSSAPI security class to AF_RXRPC; it provides transport
     security for connections to the AFS fileserver and VL server.

   - Improve TCP multipath routing, so that the sources address always
     matches the nexthop device.

   - Introduce SO_PASSRIGHTS for AF_UNIX, to allow disabling SCM_RIGHTS,
     and thus preventing DoS caused by passing around problematic FDs.

   - Retire DCCP socket. DCCP only receives updates for bugs, and major
     distros disable it by default. Its removal allows for better
     organisation of TCP fields to reduce the number of cache lines hit
     in the fast path.

   - Extend TCP drop-reason support to cover PAWS checks.

  Driver API:

   - Reorganize PTP ioctl flag support to require an explicit opt-in for
     the drivers, avoiding the problem of drivers not rejecting new
     unsupported flags.

   - Converted several device drivers to timestamping APIs.

   - Introduce per-PHY ethtool dump helpers, improving the support for
     dump operations targeting PHYs.

  Tests and tooling:

   - Add support for classic netlink in user space C codegen, so that
     ynl-c can now read, create and modify links, routes addresses and
     qdisc layer configuration.

   - Add ynl sub-types for binary attributes, allowing ynl-c to output
     known struct instead of raw binary data, clarifying the classic
     netlink output.

   - Extend MPTCP selftests to improve the code-coverage.

   - Add tests for XDP tail adjustment in AF_XDP.

  New hardware / drivers:

   - OpenVPN virtual driver: offload OpenVPN data channels processing to
     the kernel-space, increasing the data transfer throughput WRT the
     user-space implementation.

   - Renesas glue driver for the gigabit ethernet RZ/V2H(P) SoC.

   - Broadcom asp-v3.0 ethernet driver.

   - AMD Renoir ethernet device.

   - ReakTek MT9888 2.5G ethernet PHY driver.

   - Aeonsemi 10G C45 PHYs driver.

  Drivers:

   - Ethernet high-speed NICs:
       - nVidia/Mellanox (mlx5):
           - refactor the steering table handling to significantly
             reduce the amount of memory used
           - add support for complex matches in H/W flow steering
           - improve flow streeing error handling
           - convert to netdev instance locking
       - Intel (100G, ice, igb, ixgbe, idpf):
           - ice: add switchdev support for LLDP traffic over VF
           - ixgbe: add firmware manipulation and regions devlink support
           - igb: introduce support for frame transmission premption
           - igb: adds persistent NAPI configuration
           - idpf: introduce RDMA support
           - idpf: add initial PTP support
       - Meta (fbnic):
           - extend hardware stats coverage
           - add devlink dev flash support
       - Broadcom (bnxt):
           - add support for RX-side device memory TCP
       - Wangxun (txgbe):
           - implement support for udp tunnel offload
           - complete PTP and SRIOV support for AML 25G/10G devices

   - Ethernet NICs embedded and virtual:
       - Google (gve):
           - add device memory TCP TX support
       - Amazon (ena):
           - support persistent per-NAPI config
       - Airoha:
           - add H/W support for L2 traffic offload
           - add per flow stats for flow offloading
       - RealTek (rtl8211): add support for WoL magic packet
       - Synopsys (stmmac):
           - dwmac-socfpga 1000BaseX support
           - add Loongson-2K3000 support
           - introduce support for hardware-accelerated VLAN stripping
       - Broadcom (bcmgenet):
           - expose more H/W stats
       - Freescale (enetc, dpaa2-eth):
           - enetc: add MAC filter, VLAN filter RSS and loopback support
           - dpaa2-eth: convert to H/W timestamping APIs
       - vxlan: convert FDB table to rhashtable, for better scalabilty
       - veth: apply qdisc backpressure on full ring to reduce TX drops

   - Ethernet switches:
       - Microchip (kzZ88x3): add ETS scheduler support

   - Ethernet PHYs:
       - RealTek (rtl8211):
           - add support for WoL magic packet
           - add support for PHY LEDs

   - CAN:
       - Adds RZ/G3E CANFD support to the rcar_canfd driver.
       - Preparatory work for CAN-XL support.
       - Add self-tests framework with support for CAN physical interfaces.

   - WiFi:
       - mac80211:
           - scan improvements with multi-link operation (MLO)
       - Qualcomm (ath12k):
           - enable AHB support for IPQ5332
           - add monitor interface support to QCN9274
           - add multi-link operation support to WCN7850
           - add 802.11d scan offload support to WCN7850
           - monitor mode for WCN7850, better 6 GHz regulatory
       - Qualcomm (ath11k):
           - restore hibernation support
       - MediaTek (mt76):
           - WiFi-7 improvements
           - implement support for mt7990
       - Intel (iwlwifi):
           - enhanced multi-link single-radio (EMLSR) support on 5 GHz links
           - rework device configuration
       - RealTek (rtw88):
           - improve throughput for RTL8814AU
       - RealTek (rtw89):
           - add multi-link operation support
           - STA/P2P concurrency improvements
           - support different SAR configs by antenna

   - Bluetooth:
       - introduce HCI Driver protocol
       - btintel_pcie: do not generate coredump for diagnostic events
       - btusb: add HCI Drv commands for configuring altsetting
       - btusb: add RTL8851BE device 0x0bda:0xb850
       - btusb: add new VID/PID 13d3/3584 for MT7922
       - btusb: add new VID/PID 13d3/3630 and 13d3/3613 for MT7925
       - btnxpuart: implement host-wakeup feature"

* tag 'net-next-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1611 commits)
  selftests/bpf: Fix bpf selftest build warning
  selftests: netfilter: Fix skip of wildcard interface test
  net: phy: mscc: Stop clearing the the UDPv4 checksum for L2 frames
  net: openvswitch: Fix the dead loop of MPLS parse
  calipso: Don't call calipso functions for AF_INET sk.
  selftests/tc-testing: Add a test for HFSC eltree double add with reentrant enqueue behaviour on netem
  net_sched: hfsc: Address reentrant enqueue adding class to eltree twice
  octeontx2-pf: QOS: Refactor TC_HTB_LEAF_DEL_LAST callback
  octeontx2-pf: QOS: Perform cache sync on send queue teardown
  net: mana: Add support for Multi Vports on Bare metal
  net: devmem: ncdevmem: remove unused variable
  net: devmem: ksft: upgrade rx test to send 1K data
  net: devmem: ksft: add 5 tuple FS support
  net: devmem: ksft: add exit_wait to make rx test pass
  net: devmem: ksft: add ipv4 support
  net: devmem: preserve sockc_err
  page_pool: fix ugly page_pool formatting
  net: devmem: move list_add to net_devmem_bind_dmabuf.
  selftests: netfilter: nft_queue.sh: include file transfer duration in log message
  net: phy: mscc: Fix memory leak when using one step timestamping
  ...
2025-05-28 15:24:36 -07:00
Jens Axboe 2c7f023219 io_uring/net: only consider msg_inq if larger than 1
Currently retry and general validity of msg_inq is gated on it being
larger than zero, but it's entirely possible for this to be slightly
inaccurate. In particular, if FIN is received, it'll return 1.

Just use larger than 1 as the check. This covers both the FIN case, and
at the same time, it doesn't make much sense to retry a recv immediately
if there's even just a single 1 byte of valid data in the socket.

Leave the SOCK_NONEMPTY flagging when larger than 0 still, as an app may
use that for the final receive.

Cc: stable@vger.kernel.org
Reported-by: Christian Mazakas <christian.mazakas@gmail.com>
Fixes: 7c71a0af81 ("io_uring/net: improve recv bundles")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-28 13:50:19 -06:00
Pavel Begunkov 0ec33c81d9 io_uring/zcrx: fix area release on registration failure
On area registration failure there might be no ifq set and it's not safe
to access area->ifq in the release path without checking it first.

Cc: stable@vger.kernel.org
Fixes: f12ecf5e1c ("io_uring/zcrx: fix late dma unmap for a dead dev")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/bc02878678a5fec28bc77d33355cdba735418484.1748365640.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-27 12:56:16 -06:00
Pavel Begunkov eda4623cf9 io_uring/zcrx: init id for xa_find
xa_find() interprets id as the lower bound and thus expects it initialised.

Reported-by: syzbot+c3ff04150c30d3df0f57@syzkaller.appspotmail.com
Fixes: 76f1cc98b2 ("io_uring/zcrx: add support for multiple ifqs")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/faea44ef63131e6968f635e1b6b7ca6056f1f533.1748359655.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-27 10:09:15 -06:00
Linus Torvalds b3570b00dc Locking changes for v6.16:
Futexes:
 
    - Add support for task local hash maps (Sebastian Andrzej Siewior,
      Peter Zijlstra)
 
    - Implement the FUTEX2_NUMA ABI, which feature extends the futex
      interface to be NUMA-aware. On NUMA-aware futexes a second u32
      word containing the NUMA node is added to after the u32 futex value
      word. (Peter Zijlstra)
 
    - Implement the FUTEX2_MPOL ABI, which feature extends the futex
      interface to be mempolicy-aware as well, to further refine futex
      node mappings and lookups. (Peter Zijlstra)
 
   Locking primitives:
 
    - Misc cleanups (Andy Shevchenko, Borislav Petkov, Colin Ian King,
                     Ingo Molnar, Nam Cao, Peter Zijlstra)
 
   Lockdep:
 
    - Prevent abuse of lockdep subclasses (Waiman Long)
    - Add number of dynamic keys to /proc/lockdep_stats (Waiman Long)
 
 Plus misc cleanups and fixes.
 
 Note that the tree includes the following dependent out-of-subsystem
 changes as well:
 
  - rcuref: Provide rcuref_is_dead()
  - mm: Add vmalloc_huge_node()
  - mm: Add the mmap_read_lock guard to <linux/mmap_lock.h>
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmgy3E8RHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1isNw/9FS6+ZReiV3NLHvhwIw8+6U2vV733wLY+
 mFzDk2CRwv2d6xg+QUrhLNI93i2fZnwNvK1f6LcRZMa1pNmwCcEghKgm0G+fRgbv
 skiGrlkUCoEqsDUxRW++/aTBcMo0vqG3NOObnUOrddG2W9tfrR8jq/EwlzB99dO7
 q8qaBNl9W1vLT3gh9/RPP5uKt0NKIf8ObvsyhWCGaywg81h2lC4AHf0Xlj3ZD95T
 TO5jhUhl/muhYtaqxeYPK0gDtCrgFz8NwZdjKx1nyP7Gbko6+L50AvOVXog0SIAU
 nncftvutGJg2ki7dbSYPDoHQrHO0JsF1vUfVZRjaKFebWpFo2yYdNMbITOeXVhSC
 QSpbH2qvyn21nT/YSj9dottHWBoNYBEgrcSf6DO4g0d8A0Jh7egXjQdA852RpeQ0
 LWGYx4rfiKhnjiXlKKQHrURZkcxxa40o+ls3RfFl2/kWA+7aUybvw6nAeDEkV0oL
 s2U0vZxsY37EPWDm40rTe9r4YpPqcB65i9YIesPzhtbcHJVmN0gts0o5l+x53GhR
 CeftFiiUi2nm6JaT+1wGvBDT3hQ8+NZ8GkPSeA6pEJWE3i4KquZlcBZLOSLZ3k/B
 df58zQi99Yun33is5f1kqDNspqvJOg/1nxUK68PgNSdCMKeuZkJYrcmh/rKNnXSC
 f7M1XHoWFb0=
 =La/x
 -----END PGP SIGNATURE-----

Merge tag 'locking-core-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking updates from Ingo Molnar:
 "Futexes:

   - Add support for task local hash maps (Sebastian Andrzej Siewior,
     Peter Zijlstra)

   - Implement the FUTEX2_NUMA ABI, which feature extends the futex
     interface to be NUMA-aware. On NUMA-aware futexes a second u32 word
     containing the NUMA node is added to after the u32 futex value word
     (Peter Zijlstra)

   - Implement the FUTEX2_MPOL ABI, which feature extends the futex
     interface to be mempolicy-aware as well, to further refine futex
     node mappings and lookups (Peter Zijlstra)

  Locking primitives:

   - Misc cleanups (Andy Shevchenko, Borislav Petkov, Colin Ian King,
     Ingo Molnar, Nam Cao, Peter Zijlstra)

  Lockdep:

   - Prevent abuse of lockdep subclasses (Waiman Long)

   - Add number of dynamic keys to /proc/lockdep_stats (Waiman Long)

  Plus misc cleanups and fixes"

* tag 'locking-core-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (35 commits)
  selftests/futex: Fix spelling mistake "unitiliazed" -> "uninitialized"
  futex: Correct the kernedoc return value for futex_wait_setup().
  tools headers: Synchronize prctl.h ABI header
  futex: Use RCU_INIT_POINTER() in futex_mm_init().
  selftests/futex: Use TAP output in futex_numa_mpol
  selftests/futex: Use TAP output in futex_priv_hash
  futex: Fix kernel-doc comments
  futex: Relax the rcu_assign_pointer() assignment of mm->futex_phash in futex_mm_init()
  futex: Fix outdated comment in struct restart_block
  locking/lockdep: Add number of dynamic keys to /proc/lockdep_stats
  locking/lockdep: Prevent abuse of lockdep subclass
  locking/lockdep: Move hlock_equal() to the respective #ifdeffery
  futex,selftests: Add another FUTEX2_NUMA selftest
  selftests/futex: Add futex_numa_mpol
  selftests/futex: Add futex_priv_hash
  selftests/futex: Build without headers nonsense
  tools/perf: Allow to select the number of hash buckets
  tools headers: Synchronize prctl.h ABI header
  futex: Implement FUTEX2_MPOL
  futex: Implement FUTEX2_NUMA
  ...
2025-05-26 14:42:07 -07:00
Linus Torvalds 49fffac983 for-6.16/io_uring-20250523
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmgwnDgQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpgHZEADA1ym0ihHRjU2kTlXXOdkLLOl+o1RCHUjr
 KNf6sELGgyDC5FL/hAWdsjonInY4MLbJW0eNHEuuK8iFcn3wSHuHPXhRJXx/4cOs
 GGVLTd+Jm8ih4UL/GeLrBe3ehW9UUOtz1TCYzho0bdXHQWjruCFTqB5OzPQFMGQW
 R/lwXVNfjgGno5JhBnsrwz3ZnAfAnJhxqmc0GFHaa/nVF1OREYW/HS75EPFNiFgp
 Aevilw5QyrA2gDlZ+zCUwaGKAEl32yZCI6LZpI4kMtPK1reEbgFTrzIaCZ/OZCYM
 DVdBVEeuOmcBYIKbitD/+fcLNXHMrSJSWvUSXR4GuRNVkCTIAcEMKM2bX8VY7gmJ
 7ZQIo0EL2mSwmewHIYnvf9w/qrNYR0NyUt2v4U4rA2wj6e5w1EYMriP94wKdBGvD
 RNxja429N3fg3aBIkdQ6iYSVJRgE7DCo7dnKrEqglZPb32LOiNoOoou9shI5tb25
 8X7u0HzbpwKY/XByXZ2IaX7PYK2iFqkJjFYlGehtF97W85LGEvkDFU6fcBdjBO8r
 umgeE5O+lR+cf68JTJ6P34A7bBg71AXO3ytIuWunG56/0yu/FHDCjhBWE5ZjEhGR
 u2YhAGPRDQsJlSlxx8TXoKyYWP55NqdeyxYrmku/fZLn5WNVXOFeRlUDAZsF7mU7
 nuiOt9j4WA==
 =k8SF
 -----END PGP SIGNATURE-----

Merge tag 'for-6.16/io_uring-20250523' of git://git.kernel.dk/linux

Pull io_uring updates from Jens Axboe:

 - Avoid indirect function calls in io-wq for executing and freeing
   work.

   The design of io-wq is such that it can be a generic mechanism, but
   as it's just used by io_uring now, may as well avoid these indirect
   calls

 - Clean up registered buffers for networking

 - Add support for IORING_OP_PIPE. Pretty straight forward, allows
   creating pipes with io_uring, particularly useful for having these be
   instantiated as direct descriptors

 - Clean up the coalescing support fore registered buffers

 - Add support for multiple interface queues for zero-copy rx
   networking. As this feature was merged for 6.15 it supported just a
   single ifq per ring

 - Clean up the eventfd support

 - Add dma-buf support to zero-copy rx

 - Clean up and improving the request draining support

 - Clean up provided buffer support, most notably with an eye toward
   making the legacy support less intrusive

 - Minor fdinfo cleanups, dropping support for dumping what credentials
   are registered

 - Improve support for overflow CQE handling, getting rid of GFP_ATOMIC
   for allocating overflow entries where possible

 - Improve detection of cases where io-wq doesn't need to spawn a new
   worker unnecessarily

 - Various little cleanups

* tag 'for-6.16/io_uring-20250523' of git://git.kernel.dk/linux: (59 commits)
  io_uring/cmd: warn on reg buf imports by ineligible cmds
  io_uring/io-wq: only create a new worker if it can make progress
  io_uring/io-wq: ignore non-busy worker going to sleep
  io_uring/io-wq: move hash helpers to the top
  trace/io_uring: fix io_uring_local_work_run ctx documentation
  io_uring: finish IOU_OK -> IOU_COMPLETE transition
  io_uring: add new helpers for posting overflows
  io_uring: pass in struct io_big_cqe to io_alloc_ocqe()
  io_uring: make io_alloc_ocqe() take a struct io_cqe pointer
  io_uring: split alloc and add of overflow
  io_uring: open code io_req_cqe_overflow()
  io_uring/fdinfo: get rid of dumping credentials
  io_uring/fdinfo: only compile if CONFIG_PROC_FS is set
  io_uring/kbuf: unify legacy buf provision and removal
  io_uring/kbuf: refactor __io_remove_buffers
  io_uring/kbuf: don't compute size twice on prep
  io_uring/kbuf: drop extra vars in io_register_pbuf_ring
  io_uring/kbuf: use mem_is_zero()
  io_uring/kbuf: account ring io_buffer_list memory
  io_uring: drain based on allocates reqs
  ...
2025-05-26 12:13:22 -07:00
Linus Torvalds 6f59de9bc0 for-6.16/block-20250523
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmgwnGYQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpq9aD/4iqOts77xhWWLrOJWkkhOcV5rREeyppq8X
 MKYul9S4cc4Uin9Xou9a+nab31QBQEk3nsN3kX9o3yAXvkh6yUm36HD8qYNW/46q
 IUkwRQQJ0COyTnexMZQNTbZPQDIYcenXmQxOcrEJ5jC1Jcz0sOKHsgekL+ab3kCy
 fLnuz2ozvjGDMala/NmE8fN5qSlj4qQABHgbamwlwfo4aWu07cwfqn5G/FCYJgDO
 xUvsnTVclom2g4G+7eSSvGQI1QyAxl5QpviPnj/TEgfFBFnhbCSoBTEY6ecqhlfW
 6u59MF/Uw8E+weiuGY4L87kDtBhjQs3UMSLxCuwH7MxXb25ff7qB4AIkcFD0kKFH
 3V5NtwqlU7aQT0xOjGxaHhfPwjLD+FVss4ARmuHS09/Kn8egOW9yROPyetnuH84R
 Oz0Ctnt1IPLFjvGeg3+rt9fjjS9jWOXLITb9Q6nX9gnCt7orCwIYke8YCpmnJyhn
 i+fV4CWYIQBBRKxIT0E/GhJxZOmL0JKpomnbpP2dH8npemnsTCuvtfdrK9gfhH2X
 chBVqCPY8MNU5zKfzdEiavPqcm9392lMzOoOXW2pSC1eAKqnAQ86ZT3r7rLntqE8
 75LxHcvaQIsnpyG+YuJVHvoiJ83TbqZNpyHwNaQTYhDmdYpp2d/wTtTQywX4DuXb
 Y6NDJw5+kQ==
 =1PNK
 -----END PGP SIGNATURE-----

Merge tag 'for-6.16/block-20250523' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:

 - ublk updates:
      - Add support for updating the size of a ublk instance
      - Zero-copy improvements
      - Auto-registering of buffers for zero-copy
      - Series simplifying and improving GET_DATA and request lookup
      - Series adding quiesce support
      - Lots of selftests additions
      - Various cleanups

 - NVMe updates via Christoph:
      - add per-node DMA pools and use them for PRP/SGL allocations
        (Caleb Sander Mateos, Keith Busch)
      - nvme-fcloop refcounting fixes (Daniel Wagner)
      - support delayed removal of the multipath node and optionally
        support the multipath node for private namespaces (Nilay Shroff)
      - support shared CQs in the PCI endpoint target code (Wilfred
        Mallawa)
      - support admin-queue only authentication (Hannes Reinecke)
      - use the crc32c library instead of the crypto API (Eric Biggers)
      - misc cleanups (Christoph Hellwig, Marcelo Moreira, Hannes
        Reinecke, Leon Romanovsky, Gustavo A. R. Silva)

 - MD updates via Yu:
      - Fix that normal IO can be starved by sync IO, found by mkfs on
        newly created large raid5, with some clean up patches for bdev
        inflight counters

 - Clean up brd, getting rid of atomic kmaps and bvec poking

 - Add loop driver specifically for zoned IO testing

 - Eliminate blk-rq-qos calls with a static key, if not enabled

 - Improve hctx locking for when a plug has IO for multiple queues
   pending

 - Remove block layer bouncing support, which in turn means we can
   remove the per-node bounce stat as well

 - Improve blk-throttle support

 - Improve delay support for blk-throttle

 - Improve brd discard support

 - Unify IO scheduler switching. This should also fix a bunch of lockdep
   warnings we've been seeing, after enabling lockdep support for queue
   freezing/unfreezeing

 - Add support for block write streams via FDP (flexible data placement)
   on NVMe

 - Add a bunch of block helpers, facilitating the removal of a bunch of
   duplicated boilerplate code

 - Remove obsolete BLK_MQ pci and virtio Kconfig options

 - Add atomic/untorn write support to blktrace

 - Various little cleanups and fixes

* tag 'for-6.16/block-20250523' of git://git.kernel.dk/linux: (186 commits)
  selftests: ublk: add test for UBLK_F_QUIESCE
  ublk: add feature UBLK_F_QUIESCE
  selftests: ublk: add test case for UBLK_U_CMD_UPDATE_SIZE
  traceevent/block: Add REQ_ATOMIC flag to block trace events
  ublk: run auto buf unregisgering in same io_ring_ctx with registering
  io_uring: add helper io_uring_cmd_ctx_handle()
  ublk: remove io argument from ublk_auto_buf_reg_fallback()
  ublk: handle ublk_set_auto_buf_reg() failure correctly in ublk_fetch()
  selftests: ublk: add test for covering UBLK_AUTO_BUF_REG_FALLBACK
  selftests: ublk: support UBLK_F_AUTO_BUF_REG
  ublk: support UBLK_AUTO_BUF_REG_FALLBACK
  ublk: register buffer to local io_uring with provided buf index via UBLK_F_AUTO_BUF_REG
  ublk: prepare for supporting to register request buffer automatically
  ublk: convert to refcount_t
  selftests: ublk: make IO & device removal test more stressful
  nvme: rename nvme_mpath_shutdown_disk to nvme_mpath_remove_disk
  nvme: introduce multipath_always_on module param
  nvme-multipath: introduce delayed removal of the multipath head node
  nvme-pci: derive and better document max segments limits
  nvme-pci: use struct_size for allocation struct nvme_dev
  ...
2025-05-26 11:39:36 -07:00
Ingo Molnar 94ec70880f Merge branch 'locking/futex' into locking/core, to pick up pending futex changes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-05-25 10:10:08 +02:00
Pavel Begunkov 6faaf6e0fa io_uring/cmd: warn on reg buf imports by ineligible cmds
For IORING_URING_CMD_FIXED-less commands io_uring doesn't pull buf_index
from the sqe, so imports might succeed if the index coincide, e.g. when
it's 0, but otherwise it's error prone. Warn if someone tries to import
without the flag.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Link: https://lore.kernel.org/r/a1c2c88e53c3fe96978f23d50c6bc66c2c79c337.1747991070.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-23 06:31:06 -06:00
Jens Axboe 0b2b066f8a io_uring/io-wq: only create a new worker if it can make progress
Hashed work is serialized by io-wq, intended to be used for cases like
serializing buffered writes to a regular file, where the file system
will serialize the workers anyway with a mutex or similar. Since they
would be forcibly serialized and blocked, it's more efficient for io-wq
to handle these individually rather than issue them in parallel.

If a worker is currently handling a hashed work item and gets blocked,
don't create a new worker if the next work item is also hashed and
mapped to the same bucket. That new worker would not be able to make any
progress anyway.

Reported-by: Fengnan Chang <changfengnan@bytedance.com>
Reported-by: Diangang Li <lidiangang@bytedance.com>
Link: https://lore.kernel.org/io-uring/20250522090909.73212-1-changfengnan@bytedance.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-23 06:14:10 -06:00
Jens Axboe 8343cae362 io_uring/io-wq: ignore non-busy worker going to sleep
When an io-wq worker goes to sleep, it checks if there's work to do.
If there is, it'll create a new worker. But if this worker is currently
idle, it'll either get woken right back up immediately, or someone
else has already created the necessary worker to handle this work.

Only go through the worker creation logic if the current worker is
currently handling a work item. That means it's being scheduled out as
part of handling that work, not just going to sleep on its own.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-23 06:14:07 -06:00
Jens Axboe e37dfc0530 io_uring/io-wq: move hash helpers to the top
Just in preparation for using them higher up in the function, move
these generic helpers.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-23 06:14:04 -06:00
Jakub Kicinski 33e1b1b399 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.15-rc8).

Conflicts:
  80f2ab46c2 ("irdma: free iwdev->rf after removing MSI-X")
  4bcc063939 ("ice, irdma: fix an off by one in error handling code")
  c24a65b6a2 ("iidc/ice/irdma: Update IDC to support multiple consumers")
https://lore.kernel.org/20250513130630.280ee6c5@canb.auug.org.au

No extra adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-22 09:42:41 -07:00
Jens Axboe 3a08988123 io_uring/net: only retry recv bundle for a full transfer
If a shorter than assumed transfer was seen, a partial buffer will have
been filled. For that case it isn't sane to attempt to fill more into
the bundle before posting a completion, as that will cause a gap in
the received data.

Check if the iterator has hit zero and only allow to continue a bundle
operation if that is the case.

Also ensure that for putting finished buffers, only the current transfer
is accounted. Otherwise too many buffers may be put for a short transfer.

Link: https://github.com/axboe/liburing/issues/1409
Cc: stable@vger.kernel.org
Fixes: 7c71a0af81 ("io_uring/net: improve recv bundles")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-21 19:24:18 -06:00
Jens Axboe 8bb9d6ccd3 io_uring: finish IOU_OK -> IOU_COMPLETE transition
IOU_COMPLETE is more descriptive, in that it explicitly says that the
return value means "please post a completion for this request". This
patch completes the transition from IOU_OK to IOU_COMPLETE, replacing
existing IOU_OK users.

This is a purely mechanical change.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-21 08:41:16 -06:00
Pavel Begunkov a7d755ed9c io_uring: fix overflow resched cqe reordering
Leaving the CQ critical section in the middle of a overflow flushing
can cause cqe reordering since the cache cq pointers are reset and any
new cqe emitters that might get called in between are not going to be
forced into io_cqe_cache_refill().

Fixes: eac2ca2d68 ("io_uring: check if we need to reschedule during overflow flush")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/90ba817f1a458f091f355f407de1c911d2b93bbf.1747483784.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-21 07:01:54 -06:00
Caleb Sander Mateos f1774d9d4e io_uring/cmd: axe duplicate io_uring_cmd_import_fixed_vec() declaration
io_uring_cmd_import_fixed_vec() is declared in both
include/linux/io_uring/cmd.h and io_uring/uring_cmd.h. The declarations
are identical (if redundant) for CONFIG_IO_URING=y. But if
CONFIG_IO_URING=N, include/linux/io_uring/cmd.h declares the function as
static inline while io_uring/uring_cmd.h declares it as extern. This
causes linker errors if the declaration in io_uring/uring_cmd.h is used.

Remove the declaration in io_uring/uring_cmd.h to avoid linker errors
and prevent the declarations getting out of sync.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Fixes: ef49027529 ("io_uring/cmd: introduce io_uring_cmd_import_fixed_vec")
Link: https://lore.kernel.org/r/20250520193337.1374509-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-20 14:36:41 -06:00
Jens Axboe f660fd2ca1 io_uring: add new helpers for posting overflows
Add two helpers, one for posting overflows for lockless_cq rings, and
one for non-lockless_cq rings. The former can allocate sanely with
GFP_KERNEL, but needs to grab the completion lock for posting, while the
latter must do non-sleeping allocs as it already holds the completion
lock.

While at it, mark the overflow handling functions as __cold as well, as
they should not generally be called during normal operations of the
ring.

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-17 18:47:19 -06:00
Jens Axboe c80bdb1c55 io_uring: pass in struct io_big_cqe to io_alloc_ocqe()
Rather than pass extra1/extra2 separately, just pass in the (now) named
io_big_cqe struct instead. The callers that don't use/support CQE32 will
now just pass a single NULL, rather than two seperate mystery zero
values.

Move the clearing of the big_cqe elements into io_alloc_ocqe() as well,
so it can get moved out of the generic code.

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-17 18:47:18 -06:00
Jens Axboe 072d37b52c io_uring: make io_alloc_ocqe() take a struct io_cqe pointer
The number of arguments to io_alloc_ocqe() is a bit unwieldy. Make it
take a struct io_cqe pointer rather than three separate CQE args. One
path already has that readily available, add an io_init_cqe() helper for
the remaining two.

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-17 18:46:53 -06:00
Jens Axboe 10f466abc4 io_uring: split alloc and add of overflow
Add a new helper, io_alloc_ocqe(), that simply allocates and fills an
overflow entry. Then it can get done outside of the locking section,
and hence use more appropriate gfp_t allocation flags rather than always
default to GFP_ATOMIC.

Inspired by a previous series from Pavel:

https://lore.kernel.org/io-uring/cover.1747209332.git.asml.silence@gmail.com/

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-17 18:46:50 -06:00
Pavel Begunkov 5288b9e28f io_uring: open code io_req_cqe_overflow()
A preparation patch, just open code io_req_cqe_overflow().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-16 12:38:36 -06:00
Jens Axboe 16256648cd io_uring/fdinfo: get rid of dumping credentials
It's a faily obscure feature, and registered credentials would for that
mostly be a static thing. Don't bother including code to dump the
personalities indices.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-16 12:33:28 -06:00
Jens Axboe 9a10926627 io_uring/fdinfo: only compile if CONFIG_PROC_FS is set
Rather than wrap fdinfo.c in one big if, handle it on the Makefile
side instead. io_uring.c already conditionally sets fops->fdinfo()
anyway.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-16 12:33:02 -06:00
Jens Axboe 3de7361f7c Merge branch 'io_uring-6.15' into for-6.16/io_uring
Merge in 6.15 io_uring fixes, mostly so that the fdinfo changes can
get easily extended without causing merge conflicts.

* io_uring-6.15:
  io_uring/fdinfo: grab ctx->uring_lock around io_uring_show_fdinfo()
  io_uring/memmap: don't use page_address() on a highmem page
  io_uring/uring_cmd: fix hybrid polling initialization issue
  io_uring/sqpoll: Increase task_work submission batch size
  io_uring: ensure deferred completions are flushed for multishot
  io_uring: always arm linked timeouts prior to issue
  io_uring/fdinfo: annotate racy sq/cq head/tail reads
  io_uring: fix 'sync' handling of io_fallback_tw()
  io_uring: don't duplicate flushing in io_req_post_cqe
2025-05-16 12:31:19 -06:00
Jakub Kicinski bebd7b2626 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.15-rc7).

Conflicts:

tools/testing/selftests/drivers/net/hw/ncdevmem.c
  97c4e094a4 ("tests/ncdevmem: Fix double-free of queue array")
  2f1a805f32 ("selftests: ncdevmem: Implement devmem TCP TX")
https://lore.kernel.org/20250514122900.1e77d62d@canb.auug.org.au

Adjacent changes:

net/core/devmem.c
net/core/devmem.h
  0afc44d8cd ("net: devmem: fix kernel panic when netlink socket close after module unload")
  bd61848900 ("net: devmem: Implement TX path")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-15 11:28:30 -07:00
Jens Axboe d871198ee4 io_uring/fdinfo: grab ctx->uring_lock around io_uring_show_fdinfo()
Not everything requires locking in there, which is why the 'has_lock'
variable exists. But enough does that it's a bit unwieldy to manage.
Wrap the whole thing in a ->uring_lock trylock, and just return
with no output if we fail to grab it. The existing trylock() will
already have greatly diminished utility/output for the failure case.

This fixes an issue with reading the SQE fields, if the ring is being
actively resized at the same time.

Reported-by: Jann Horn <jannh@google.com>
Fixes: 79cfe9e59c ("io_uring/register: add IORING_REGISTER_RESIZE_RINGS")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-14 07:15:28 -06:00
Pavel Begunkov 2b61bb1d9a io_uring/kbuf: unify legacy buf provision and removal
Combine IORING_OP_PROVIDE_BUFFERS and IORING_OP_REMOVE_BUFFERS
->issue(), so that we can deduplicate ring locking and list lookups.
This way we further reduce code for legacy provided buffers. Locking is
also separated from buffer related handling, which makes it a bit
simpler with label jumps.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f61af131622ad4337c2fb9f7c453d5b0102c7b90.1747150490.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-13 14:45:55 -06:00
Pavel Begunkov c724e80123 io_uring/kbuf: refactor __io_remove_buffers
__io_remove_buffers used for two purposes, the first is removing
buffers for non ring based lists, which implies that it can be called
multiple times for the same list. And the second is for destroying
lists, which is not perfectly reentrable for ring based lists.

It's confusing, so just have a helper for the legacy pbuf buffer
removal, make sure it's not called for ring pbuf, and open code all ring
pbuf destruction into io_put_bl().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0ae416b099d311ad23f285cea02f2c94c8ae9a6c.1747150490.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-13 14:45:55 -06:00
Pavel Begunkov 52a05d0cf8 io_uring/kbuf: don't compute size twice on prep
The size in prep is calculated by io_provide_buffers_prep(), so remove
the recomputation a few lines after.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7c97206561b74fce245cb22449c6082d2e066844.1747150490.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-13 14:45:55 -06:00
Pavel Begunkov 4e9fda29d6 io_uring/kbuf: drop extra vars in io_register_pbuf_ring
bl and free_bl variables in io_register_pbuf_ring() always point to the
same list since we started to reallocate the pre-existent list. Drop
free_bl.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d45c3342d74c9030f99376c777a4b3d59089074d.1747150490.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-13 14:45:55 -06:00
Pavel Begunkov 1724849072 io_uring/kbuf: use mem_is_zero()
Make use of mem_is_zero() for reserved fields checking.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/11fe27b7a831329bcdb4ea087317ef123ba7c171.1747150490.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-13 14:45:55 -06:00
Pavel Begunkov 475a8d3037 io_uring/kbuf: account ring io_buffer_list memory
Follow the non-ringed pbuf struct io_buffer_list allocations and account
it against the memcg. There is low chance of that being an actual
problem as ring provided buffer should either pin user memory or
allocate it, which is already accounted.

Cc: stable@vger.kernel.org # 6.1
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/3985218b50d341273cafff7234e1a7e6d0db9808.1747150490.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-13 14:45:47 -06:00
Mina Almasry bd61848900 net: devmem: Implement TX path
Augment dmabuf binding to be able to handle TX. Additional to all the RX
binding, we also create tx_vec needed for the TX path.

Provide API for sendmsg to be able to send dmabufs bound to this device:

- Provide a new dmabuf_tx_cmsg which includes the dmabuf to send from.
- MSG_ZEROCOPY with SCM_DEVMEM_DMABUF cmsg indicates send from dma-buf.

Devmem is uncopyable, so piggyback off the existing MSG_ZEROCOPY
implementation, while disabling instances where MSG_ZEROCOPY falls back
to copying.

We additionally pipe the binding down to the new
zerocopy_fill_skb_from_devmem which fills a TX skb with net_iov netmems
instead of the traditional page netmems.

We also special case skb_frag_dma_map to return the dma-address of these
dmabuf net_iovs instead of attempting to map pages.

The TX path may release the dmabuf in a context where we cannot wait.
This happens when the user unbinds a TX dmabuf while there are still
references to its netmems in the TX path. In that case, the netmems will
be put_netmem'd from a context where we can't unmap the dmabuf, Resolve
this by making __net_devmem_dmabuf_binding_free schedule_work'd.

Based on work by Stanislav Fomichev <sdf@fomichev.me>. A lot of the meat
of the implementation came from devmem TCP RFC v1[1], which included the
TX path, but Stan did all the rebasing on top of netmem/net_iov.

Cc: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>
Signed-off-by: Mina Almasry <almasrymina@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250508004830.4100853-5-almasrymina@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-13 11:12:48 +02:00
Mina Almasry 03e96b8c11 netmem: add niov->type attribute to distinguish different net_iov types
Later patches in the series adds TX net_iovs where there is no pp
associated, so we can't rely on niov->pp->mp_ops to tell what is the
type of the net_iov.

Add a type enum to the net_iov which tells us the net_iov type.

Signed-off-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/20250508004830.4100853-2-almasrymina@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-13 11:12:48 +02:00
Jens Axboe f446c6311e io_uring/memmap: don't use page_address() on a highmem page
For older/32-bit systems with highmem, don't assume that the pages in
a mapped region are always going to be mapped. If io_region_init_ptr()
finds that the pages are coalescable, also check if the first page is
a HighMem page or not. If it is, fall through to the usual vmap()
mapping rather than attempt to get the unmapped page address.

Cc: stable@vger.kernel.org
Fixes: c4d0ac1c15 ("io_uring/memmap: optimise single folio regions")
Link: https://lore.kernel.org/all/681fe2fb.050a0220.f2294.001a.GAE@google.com/
Reported-by: syzbot+5b8c4abafcb1d791ccfc@syzkaller.appspotmail.com
Link: https://lore.kernel.org/all/681fed0a.050a0220.f2294.001c.GAE@google.com/
Reported-by: syzbot+6456a99dfdc2e78c4feb@syzkaller.appspotmail.com
Tested-by: syzbot+6456a99dfdc2e78c4feb@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-12 09:27:41 -06:00
Pavel Begunkov 8fb7aee055 io_uring: drain based on allocates reqs
Don't rely on CQ sequence numbers for draining, as it has become messy
and needs cq_extra adjustments. Instead, base it on the number of
allocated requests and only allow flushing when all requests are in the
drain list.

As a result, cq_extra is gone, no overhead for its accounting in aux cqe
posting, less bloating as it was inlined before, and it's in general
simpler than trying to track where we should bump it and where it should
be put back like in cases of overflow. Also, it'll likely help with
cleaning and unifying some of the CQ posting helpers.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/46ece1e34320b046c06fee2498d6b4cd12a700f2.1746788718.git.asml.silence@gmail.com
Link: https://lore.kernel.org/r/24497b04b004bceada496033d3c9d09ff8e81ae9.1746944903.git.asml.silence@gmail.com
[axboe: fold in fix from link2]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-12 07:52:52 -06:00
hexue 63166b815d io_uring/uring_cmd: fix hybrid polling initialization issue
Modify the check for whether the timer is initialized during IO transfer
when passthrough is used with hybrid polling, to ensure that it's always
setup correctly.

Cc: stable@vger.kernel.org
Fixes: 01ee194d1a ("io_uring: add support for hybrid IOPOLL")
Signed-off-by: hexue <xue01.he@samsung.com>
Link: https://lore.kernel.org/r/20250512052025.293031-1-xue01.he@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-12 07:17:02 -06:00
Pavel Begunkov 63de899cb6 io_uring: count allocated requests
Keep track of the number requests a ring currently has allocated (and
not freed), it'll be needed in the next patch.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c8f8308294dc2a1cb8925d984d937d4fc14ab5d4.1746788718.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09 08:01:02 -06:00
Pavel Begunkov b0c8a6401f io_uring: open code io_account_cq_overflow()
io_account_cq_overflow() doesn't help explaining what's going on in
there, and it'll become even smaller with following patches, so open
code it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e4333fa0d371f519e52a71148ebdffed4b8d3aa9.1746788718.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09 08:01:02 -06:00
Pavel Begunkov 19a94da447 io_uring: consolidate drain seq checking
We check sequences when queuing drained requests as well when flushing
them. Instead, always queue and immediately try to flush, so that all
seq handling can be kept contained in the flushing code.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d4651f742e671af5b3216581e539ea5d31bc7125.1746788718.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09 08:01:02 -06:00
Pavel Begunkov e91e4f692f io_uring: remove drain prealloc checks
Currently io_drain_req() has two steps. The first is fast path checking
sequence numbers. The second is allocations, rechecking and actual
queuing. Further simplify it by removing the first step.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4d06e89ed07611993d7bf89182de2300858379bd.1746788718.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09 08:01:01 -06:00
Pavel Begunkov 05b334110f io_uring: simplify drain ret passing
"ret" in io_drain_req() is only used in one place, remove it and pass
-ENOMEM directly.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ece724b77e66e6caabcc215e0032ee7ff140f289.1746788718.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09 08:01:01 -06:00
Pavel Begunkov fde04c7e27 io_uring: fix spurious drain flushing
io_queue_deferred() is not tolerant to spurious calls not completing
some requests. You can have an inflight drain-marked request and another
request that came after and got queued into the drain list. Now, if
io_queue_deferred() is called before the first request completes, it'll
check the 2nd req with req_need_defer(), find that there is no drain
flag set, and queue it for execution.

To make io_queue_deferred() work, it should at least check sequences for
the first request, and then we need also need to check if there is
another drain request creating another bubble.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/972bde11b7d4ef25b3f5e3fd34f80e4d2aa345b8.1746788718.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09 08:01:01 -06:00
Pavel Begunkov f979c20547 io_uring: account drain memory to cgroup
Account drain allocations against memcg. It's not a big problem as each
such allocation is paired with a request, which is accounted, but it's
nicer to follow the limits more closely.

Cc: stable@vger.kernel.org # 6.1
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f8dfdbd755c41fd9c75d12b858af07dfba5bbb68.1746788718.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09 08:01:01 -06:00
Pavel Begunkov 81a22c86ec io_uring: add lockdep asserts to io_add_aux_cqe
io_add_aux_cqe() can only be called for rings with uring_lock protected
completion queues, add a couple of assertions in regards to that.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c010eab7b94a187c00a9d46d8b67bf7fcad18af4.1746788592.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09 07:58:55 -06:00
Pavel Begunkov 28b8cd864d io_uring/net: move CONFIG_NET guards to Makefile
Instruct Makefile to never try to compile net.c without CONFIG_NET and
kill ifdefs in the file.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f466400e20c3f536191bfd559b1f3cd2a2ab5a1e.1746788579.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09 07:58:47 -06:00
Long Li 6ae4308116 io_uring: update parameter name in io_pin_pages function declaration
Rename first parameter in io_pin_pages from ubuf to uaddr for consistency
between declaration and implementation.

Signed-off-by: Long Li <leo.lilong@huawei.com>
Link: https://lore.kernel.org/r/20250509063015.3799255-1-leo.lilong@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09 07:58:22 -06:00
Gabriel Krisman Bertazi 92835cebab io_uring/sqpoll: Increase task_work submission batch size
Our QA team reported a 10%-23%, throughput reduction on an io_uring
sqpoll testcase doing IO to a null_blk, that I traced back to a
reduction of the device submission queue depth utilization. It turns out
that, after commit af5d68f889 ("io_uring/sqpoll: manage task_work
privately"), we capped the number of task_work entries that can be
completed from a single spin of sqpoll to only 8 entries, before the
sqpoll goes around to (potentially) sleep.  While this cap doesn't drive
the submission side directly, it impacts the completion behavior, which
affects the number of IO queued by fio per sqpoll cycle on the
submission side, and io_uring ends up seeing less ios per sqpoll cycle.
As a result, block layer plugging is less effective, and we see more
time spent inside the block layer in profilings charts, and increased
submission latency measured by fio.

There are other places that have increased overhead once sqpoll sleeps
more often, such as the sqpoll utilization calculation.  But, in this
microbenchmark, those were not representative enough in perf charts, and
their removal didn't yield measurable changes in throughput.  The major
overhead comes from the fact we plug less, and less often, when submitting
to the block layer.

My benchmark is:

fio --ioengine=io_uring --direct=1 --iodepth=128 --runtime=300 --bs=4k \
    --invalidate=1 --time_based  --ramp_time=10 --group_reporting=1 \
    --filename=/dev/nullb0 --name=RandomReads-direct-nullb-sqpoll-4k-1 \
    --rw=randread --numjobs=1 --sqthread_poll

In one machine, tested on top of Linux 6.15-rc1, we have the following
baseline:
  READ: bw=4994MiB/s (5236MB/s), 4994MiB/s-4994MiB/s (5236MB/s-5236MB/s), io=439GiB (471GB), run=90001-90001msec

With this patch:
  READ: bw=5762MiB/s (6042MB/s), 5762MiB/s-5762MiB/s (6042MB/s-6042MB/s), io=506GiB (544GB), run=90001-90001msec

which is a 15% improvement in measured bandwidth.  The average
submission latency is noticeably lowered too.  As measured by
fio:

Baseline:
   lat (usec): min=20, max=241, avg=99.81, stdev=3.38
Patched:
   lat (usec): min=26, max=226, avg=86.48, stdev=4.82

If we look at blktrace, we can also see the plugging behavior is
improved. In the baseline, we end up limited to plugging 8 requests in
the block layer regardless of the device queue depth size, while after
patching we can drive more io, and we manage to utilize the full device
queue.

In the baseline, after a stabilization phase, an ordinary submission
looks like:
  254,0    1    49942     0.016028795  5977  U   N [iou-sqp-5976] 7

After patching, I see consistently more requests per unplug.
  254,0    1     4996     0.001432872  3145  U   N [iou-sqp-3144] 32

Ideally, the cap size would at least be the deep enough to fill the
device queue, but we can't predict that behavior, or assume all IO goes
to a single device, and thus can't guess the ideal batch size.  We also
don't want to let the tw run unbounded, though I'm not sure it would
really be a problem.  Instead, let's just give it a more sensible value
that will allow for more efficient batching.  I've tested with different
cap values, and initially proposed to increase the cap to 1024.  Jens
argued it is too big of a bump and I observed that, with 32, I'm no
longer able to observe this bottleneck in any of my machines.

Fixes: af5d68f889 ("io_uring/sqpoll: manage task_work privately")
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/20250508181203.3785544-1-krisman@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09 07:56:53 -06:00
Jens Axboe 687b2bae0e io_uring: ensure deferred completions are flushed for multishot
Multishot normally uses io_req_post_cqe() to post completions, but when
stopping it, it may finish up with a deferred completion. This is fine,
except if another multishot event triggers before the deferred completions
get flushed. If this occurs, then CQEs may get reordered in the CQ ring,
as new multishot completions get posted before the deferred ones are
flushed. This can cause confusion on the application side, if strict
ordering is required for the use case.

When multishot posting via io_req_post_cqe(), flush any pending deferred
completions first, if any.

Cc: stable@vger.kernel.org # 6.1+
Reported-by: Norman Maurer <norman_maurer@apple.com>
Reported-by: Christian Mazakas <christian.mazakas@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07 07:55:15 -06:00
Pavel Begunkov 35adea1d01 io_uring: move io_req_put_rsrc_nodes()
It'd be nice to hide details of how rsrc nodes are used by a request
from rsrc.c, specifically which request fields store them, and what bits
are signifying if there is a node in a request. It rather belong to
generic request handling, so move the helper to io_uring.c. While doing
so remove clearing of ->buf_node as it's controlled by REQ_F_BUF_NODE
and doesn't require zeroing.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/bb73fb42baf825edb39344365aff48cdfdd4c692.1746533789.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06 10:11:23 -06:00
Pavel Begunkov 9c2ff3f9b5 io_uring: remove io_preinit_req()
Apart from setting ->ctx, io_preinit_req() zeroes a bunch of fields of a
request, from which only ->file_node is mandatory. Remove the function
and zero the entire request on first allocation. With that, we also need
to initialise ->ctx every time, which might be a good thing for
performance as now we're likely overwriting the entire cache line, and
so it can write combined and avoid RMW.

Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ba5485dc913f1e275862ce88f5169d4ac4a33836.1746533807.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06 10:11:23 -06:00
Pavel Begunkov 78967aabf6 io_uring/timeout: don't export link t-out disarm helper
[__]io_disarm_linked_timeout() are only used inside timeout.c. so
confine them inside the file.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1eb200911255e643bf252a8e65fb2c787340cf18.1746533800.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06 10:11:23 -06:00
Pavel Begunkov a5c98e9424 io_uring/zcrx: dmabuf backed zerocopy receive
Add support for dmabuf backed zcrx areas. To use it, the user should
pass IORING_ZCRX_AREA_DMABUF in the struct io_uring_zcrx_area_reg flags
field and pass a dmabuf fd in the dmabuf_fd field.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20bb1890e60a82ec945ab36370d1fd54be414ab6.1746097431.git.asml.silence@gmail.com
Link: https://lore.kernel.org/io-uring/6e37db97303212bbd8955f9501cf99b579f8aece.1746547722.git.asml.silence@gmail.com
[axboe: fold in fixup]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06 10:11:00 -06:00
Keith Busch 02040353f4 io_uring: enable per-io write streams
Allow userspace to pass a per-I/O write stream in the SQE:

      __u8 write_stream;

The __u8 type matches the size the filesystems and block layer support.

Application can query the supported values from the block devices
max_write_streams sysfs attribute. Unsupported values are ignored by
file operations that do not support write streams or rejected with an
error by those that support them.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20250506121732.8211-7-joshi.k@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06 07:46:43 -06:00
Jens Axboe b53e523261 io_uring: always arm linked timeouts prior to issue
There are a few spots where linked timeouts are armed, and not all of
them adhere to the pre-arm, attempt issue, post-arm pattern. This can
be problematic if the linked request returns that it will trigger a
callback later, and does so before the linked timeout is fully armed.

Consolidate all the linked timeout handling into __io_issue_sqe(),
rather than have it spread throughout the various issue entry points.

Cc: stable@vger.kernel.org
Link: https://github.com/axboe/liburing/issues/1390
Reported-by: Chase Hiltz <chase@path.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-04 09:15:58 -06:00
Peter Zijlstra 93f1b6d79a futex: Move futex_queue() into futex_wait_setup()
futex_wait_setup() has a weird calling convention in order to return
hb to use as an argument to futex_queue().

Mostly such that requeue can have an extra test in between.

Reorder code a little to get rid of this and keep the hb usage inside
futex_wait_setup().

[bigeasy: fixes]

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-4-bigeasy@linutronix.de
2025-05-03 12:02:05 +02:00
Pavel Begunkov 8a62804248 io_uring/zcrx: split common area map/unmap parts
Extract area type depedent parts of io_zcrx_[un]map_area from the
generic path. It'll be helpful once there are more area memory types
and not only user pages.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/50f6e893e2d20f937e628196cbf528d15f81c289.1746097431.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-02 09:24:42 -06:00
Pavel Begunkov 782dfa329a io_uring/zcrx: split out memory holders from area
In the data path users of struct io_zcrx_area don't need to know what
kind of memory it's backed by. Only keep there generic bits in there and
and split out memory type dependent fields into a new structure. It also
logically separates the step that actually imports the memory, e.g.
pinning user pages, from the generic area initialisation.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b60fc09c76921bf69e77eb17e07eb4decedb3bf4.1746097431.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-02 09:24:42 -06:00
Pavel Begunkov 6c9589aa08 io_uring/zcrx: resolve netdev before area creation
Some area types will require a valid struct device to be created, so
resolve netdev and struct device before creating an area.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ac8c1482be22acfe9ca788d2c3ce31b7451ce488.1746097431.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-02 09:24:42 -06:00
Pavel Begunkov d760d3f59f io_uring/zcrx: improve area validation
dmabuf backed area will be taking an offset instead of addresses, and
io_buffer_validate() is not flexible enough to facilitate it. It also
takes an iovec, which may truncate the u64 length zcrx takes. Add a new
helper function for validation.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0b3b735391a0a8f8971bf0121c19765131fddd3b.1746097431.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-02 09:24:42 -06:00
Jens Axboe f024d3a8de io_uring/fdinfo: annotate racy sq/cq head/tail reads
syzbot complains about the cached sq head read, and it's totally right.
But we don't need to care, it's just reading fdinfo, and reading the
CQ or SQ tail/head entries are known racy in that they are just a view
into that very instant and may of course be outdated by the time they
are reported.

Annotate both the SQ head and CQ tail read with data_race() to avoid
this syzbot complaint.

Link: https://lore.kernel.org/io-uring/6811f6dc.050a0220.39e3a1.0d0e.GAE@google.com/
Reported-by: syzbot+3e77fd302e99f5af9394@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-30 07:17:17 -06:00
Pavel Begunkov 91db6edc57 io_uring/cmd: move net cmd into a separate file
We keep socket io_uring command implementation in io_uring/uring_cmd.c.
Separate it from generic command code into a separate file.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/747d0519a2255bd055ae76b691d38d2b4c311001.1745843119.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-28 11:51:31 -06:00
Pavel Begunkov 27d2fed790 io_uring: delete misleading comment in io_fill_cqe_aux()
io_fill_cqe_aux() doesn't overflow completions, however it might fail
them and lets the caller handle it. Remove the comment, which doesn't
make any sense.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/021aa8c1d8f20ef2b66da6aeabb6b511938fd2c5.1745843119.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-28 11:51:31 -06:00
Jens Axboe edd43f4d6f io_uring: fix 'sync' handling of io_fallback_tw()
A previous commit added a 'sync' parameter to io_fallback_tw(), which if
true, means the caller wants to wait on the fallback thread handling it.
But the logic is somewhat messed up, ensure that ctxs are swapped and
flushed appropriately.

Cc: stable@vger.kernel.org
Fixes: dfbe5561ae ("io_uring: flush offloaded and delayed task_work on exit")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-24 10:32:43 -06:00
Pavel Begunkov f6da4fee69 io_uring/eventfd: open code io_eventfd_grab()
io_eventfd_grab() doesn't help wit understanding the path, it'll be
simpler to keep the helper open coded.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/5cb53ce3876c2819db9e8055cf41dca4398521db.1745493845.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-24 08:33:54 -06:00
Pavel Begunkov da01f60f8a io_uring/eventfd: clean up rcu locking
Conditional locking is never welcome if there are better options. Move
rcu locking into io_eventfd_signal(), make it unconditional and use
guards. It also helps with sparse warnings.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/91a925e708ca8a5aa7fee61f96d29b24ea9adeaf.1745493845.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-24 08:33:54 -06:00
Pavel Begunkov 62f666df76 io_uring/eventfd: dedup signalling helpers
Consolidate io_eventfd_flush_signal() and io_eventfd_signal(). Not much
of a difference for now, but it prepares it for following changes.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/5beecd4da65d8d2d83df499196f84b329387f6a2.1745493845.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-24 08:33:54 -06:00
Pavel Begunkov 5e16f1a68d io_uring: don't duplicate flushing in io_req_post_cqe
io_req_post_cqe() sets submit_state.cq_flush so that
*flush_completions() can take care of batch commiting CQEs. Don't commit
it twice by using __io_cq_unlock_post().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/41c416660c509cee676b6cad96081274bcb459f3.1745493861.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-24 06:28:43 -06:00
Pavel Begunkov 76f1cc98b2 io_uring/zcrx: add support for multiple ifqs
Allow the user to register multiple ifqs / zcrx contexts. With that we
can use multiple interfaces / interface queues in a single io_uring
instance.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/668b03bee03b5216564482edcfefbc2ee337dd30.1745141261.git.asml.silence@gmail.com
[axboe: fold in fix]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-23 07:13:14 -06:00
Pavel Begunkov 632b318672 io_uring/zcrx: move zcrx region to struct io_zcrx_ifq
Refill queue region is a part of zcrx and should stay in struct
io_zcrx_ifq. We can't have multiple queues without it, so move it there.

As a result there is no context global zcrx region anymore, and the
region is looked up together with its ifq. To protect a concurrent mmap
from seeing an inconsistent region we were protecting changes to
->zcrx_region with mmap_lock, but now it protect the publishing of the
ifq.

Reviewed-by: David Wei <dw@davidwei.uk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/24f1a728fc03d0166f16d099575457e10d9d90f2.1745141261.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-21 05:12:31 -06:00
Pavel Begunkov 77231d4e46 io_uring/zcrx: let zcrx choose region for mmaping
In preparation for adding multiple ifqs, add a helper returning a region
for mmaping zcrx refill queue. For now it's trivial and returns the same
ctx global ->zcrx_region.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d9006e2ef8cd5e5b337c2ba2228ede8409eb60f2.1745141261.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-21 05:11:40 -06:00
Pavel Begunkov 59bc1ab922 io_uring/zcrx: remove sqe->file_index check
sqe->file_index and sqe->zcrx_ifq_idx are aliased, so even though
io_recvzc_prep() has all necessary handling and checking for
->zcrx_ifq_idx, it's already rejected a couple of lines above.

It's not a real problem, however, as we can only have one ifq at the
moment, which index is always zero, and it returns correct error
codes to the user.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b51acedd326d1a87bf53e33303e6fc87661a3e3f.1745141261.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-21 05:11:40 -06:00
Pavel Begunkov a79154ae5d io_uring/zcrx: move io_zcrx_iov_page
We'll need io_zcrx_iov_page at the top to keep offset calculations
closer together, move it there.

Reviewed-by: David Wei <dw@davidwei.uk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/575617033a8b84a5985c7eb760b7121efdbe7e56.1745141261.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-21 05:11:40 -06:00
Pavel Begunkov 37d26edd6b io_uring/zcrx: remove duplicated freelist init
Several lines below we already initialise the freelist, don't do it
twice.

Reviewed-by: David Wei <dw@davidwei.uk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/71b07c4e00d1db65899d1ed8d603d961fe7d1c28.1745141261.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-21 05:11:40 -06:00
Pavel Begunkov be6bad57b2 io_uring/rsrc: remove null check on import
WARN_ON_ONCE() checking imu for NULL in io_import_fixed() is in the hot
path and too protective, it's time to get rid of it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/3782c01e311dbd474bca45aefaf79a2f2822fafb.1745083025.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-21 05:10:04 -06:00
Pavel Begunkov 9cebcf7b0c io_uring/rsrc: clean up io_coalesce_buffer()
We don't need special handling for the first page in
io_coalesce_buffer(), move it inside the loop.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Link: https://lore.kernel.org/r/ad698cddc1eadb3d92a7515e95bb13f79420323d.1745083025.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-21 05:10:04 -06:00
Pavel Begunkov ea76925614 io_uring/rsrc: use unpin_user_folio
We want to have a full folio to be left pinned but with only one
reference, for that we "unpin" all but the first page with
unpin_user_pages(), which can be confusing. There is a new helper to
achieve that called unpin_user_folio(), so use that.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Link: https://lore.kernel.org/r/e0b2be8f9ea68f6b351ec3bb046f04f437f68491.1745083025.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-21 05:10:04 -06:00
Jens Axboe 8a2dacd49f io_uring/rsrc: remove node assignment helpers
There are two helpers here, one assigns and increments the node ref
count, and the other is simply a wrapper around that for the buffer node
handling.

The buffer node assignment benefits from checking and setting
REQ_F_BUF_NODE together, otherwise stalls have been observed on setting
that flag later in the process. Hence re-do it so that it's set when
checked, and cleared in case of (unlikely) failure. With that, the
buffer node helper can go, and then drop the generic
io_req_assign_rsrc_node() helper as well as there's only a single user
of it left at that point.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-21 05:06:58 -06:00
Jens Axboe 53db8a71ec io_uring: add support for IORING_OP_PIPE
This works just like pipe2(2), except it also supports fixed file
descriptors. Used in a similar fashion as for other fd instantiating
opcodes (like accept, socket, open, etc), where sqe->file_slot is set
appropriately if two direct descriptors are desired rather than a set
of normal file descriptors.

sqe->addr must be set to a pointer to an array of 2 integers, which
is where the fixed/normal file descriptors are copied to.

sqe->pipe_flags contains flags, same as what is allowed for pipe2(2).

Future expansion of per-op private flags can go in sqe->ioprio,
like we do for other opcodes that take both a "syscall" flag set and
an io_uring opcode specific flag set.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-21 05:06:58 -06:00
Pavel Begunkov bd32923e5f io_uring: don't store bgid in req->buf_index
Pass buffer group id into the rest of helpers via struct buf_sel_arg
and remove all reassignments of req->buf_index back to bgid. Now, it
only stores buffer indexes, and the group is provided by callers.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/3ea9fa08113ecb4d9224b943e7806e80a324bdf9.1743437358.git.asml.silence@gmail.com
Link: https://lore.kernel.org/io-uring/0c01d76ff12986c2f48614db8610caff8f78c869.1743500909.git.asml.silence@gmail.com/
[axboe: fold in patch from second link]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-21 05:06:58 -06:00
Pavel Begunkov c0e9650521 io_uring/kbuf: pass bgid to io_buffer_select()
The current situation with buffer group id juggling is not ideal.
req->buf_index first stores the bgid, then it's overwritten by a buffer
id, and then it can get restored back no recycling / etc. It's not so
easy to control, and it's not handled consistently across request types
with receive requests saving and restoring the bgid it by hand.

It's a prep patch that adds a buffer group id argument to
io_buffer_select(). The caller will be responsible for stashing a copy
somewhere and passing it into the function.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a210d6427cc3f4f42271a6853274cd5a50e56820.1743437358.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-21 05:06:58 -06:00
Pavel Begunkov e6f74fd67d io_uring: set IMPORT_BUFFER in generic send setup
Move REQ_F_IMPORT_BUFFER to the common send setup. Currently, the only
user is send zc, but we'll want for normal sends to support that in the
future.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/18b74edfc61d8a1b6c9fed3b78a9276fe80f8ced.1743437358.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-21 05:06:58 -06:00
Pavel Begunkov e9ff9ae103 io_uring/net: don't use io_do_buffer_select at prep
Prep code is interested whether it's a selected buffer request, not
whether a buffer has already been selected like what
io_do_buffer_select() returns. Check for REQ_F_BUFFER_SELECT directly.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4488a029ac698554bebf732263fe19d7734affa6.1743437358.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-21 05:06:58 -06:00
Caleb Sander Mateos 9fe99eed91 io_uring/wq: avoid indirect do_work/free_work calls
struct io_wq stores do_work and free_work function pointers which are
called on each work item. But these function pointers are always set to
io_wq_submit_work and io_wq_free_work, respectively. So remove these
function pointers and just call the functions directly.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250329161527.3281314-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-21 05:06:58 -06:00
Pavel Begunkov f12ecf5e1c io_uring/zcrx: fix late dma unmap for a dead dev
There is a problem with page pools not dma-unmapping immediately when
the device is going down, and delaying it until the page pool is
destroyed, which is not allowed (see links). That just got fixed for
normal page pools, and we need to address memory providers as well.

Unmap pages in the memory provider uninstall callback, and protect it
with a new lock. There is also a gap between when a dma mapping is
created and the mp is installed, so if the device is killed in between,
io_uring would be holding on to dma mappings to a dead device with no
one to call ->uninstall. Move it to page pool init and rely on
->is_mapped to make sure it's only done once.

Link: https://lore.kernel.org/lkml/8067f204-1380-4d37-8ffd-007fc6f26738@kernel.org/T/
Link: https://lore.kernel.org/all/20250409-page-pool-track-dma-v9-0-6a9ef2e0cba8@redhat.com/
Fixes: 34a3e60821 ("io_uring/zcrx: implement zerocopy receive pp memory provider")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ef9b7db249b14f6e0b570a1bb77ff177389f881c.1744965853.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-18 06:12:10 -06:00
Jens Axboe b419bed4f0 io_uring/rsrc: ensure segments counts are correct on kbuf buffers
kbuf imports have the front offset adjusted and segments removed, but
the tail segments are still included in the segment count that gets
passed in the iov_iter. As the segments aren't necessarily all the
same size, move importing to a separate helper and iterate the
mapped length to get an exact count.

Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-17 11:59:12 -06:00
Nitesh Shetty 80c7378f94 io_uring/rsrc: send exact nr_segs for fixed buffer
Sending exact nr_segs, avoids bio split check and processing in
block layer, which takes around 5%[1] of overall CPU utilization.

In our setup, we see overall improvement of IOPS from 7.15M to 7.65M [2]
and 5% less CPU utilization.

[1]
     3.52%  io_uring         [kernel.kallsyms]     [k] bio_split_rw_at
     1.42%  io_uring         [kernel.kallsyms]     [k] bio_split_rw
     0.62%  io_uring         [kernel.kallsyms]     [k] bio_submit_split

[2]
sudo taskset -c 0,1 ./t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n2
-r4 /dev/nvme0n1 /dev/nvme1n1

Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
[Pavel: fixed for kbuf, rebased and reworked on top of cleanups]
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7a1a49a8d053bd617c244291d63dbfbc07afde36.1744882081.git.asml.silence@gmail.com
[axboe: fold in fix factoring in buf reg offset]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-17 07:42:14 -06:00
Pavel Begunkov 59852ebad9 io_uring/rsrc: refactor io_import_fixed
io_import_fixed is a mess. Even though we know the final len of the
iterator, we still assign offset + len and do some magic after to
correct for that.

Do offset calculation first and finalise it with iov_iter_bvec at the
end.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2d5107fed24f8b23245ef2ede9a5a7f7c426df61.1744882081.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-17 06:21:13 -06:00
Pavel Begunkov 50169d0754 io_uring/rsrc: separate kbuf offset adjustments
Kernel registered buffers are special because segments are not uniform
in size, and we have a bunch of optimisations based on that uniformity
for normal buffers. Handle kbuf separately, it'll be cleaner this way.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4e9e5990b0ab5aee723c0be5cd9b5bcf810375f9.1744882081.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-17 06:21:13 -06:00
Pavel Begunkov 1ac5712888 io_uring/rsrc: don't skip offset calculation
Don't optimise for requests with offset=0. Large registered buffers are
the preference and hence the user is likely to pass an offset, and the
adjustments are not expensive and will be made even cheaper in following
patches.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1c2beb20470ee3c886a363d4d8340d3790db19f3.1744882081.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-17 06:21:13 -06:00
Pavel Begunkov 70e4f9bfc1 io_uring/zcrx: add pp to ifq conversion helper
It'll likely change how page pools store memory providers, so in
preparation for that, keep accesses in one place in io_uring by
introducing a helper.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/3522eb8fa9b4e21bcf32e7e9ae656c616b282210.1744722526.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-15 07:38:05 -06:00
Pavel Begunkov 25744f8495 io_uring/zcrx: return ifq id to the user
IORING_OP_RECV_ZC requests take a zcrx object id via sqe::zcrx_ifq_idx,
which binds it to the corresponding if / queue. However, we don't return
that id back to the user. It's fine as currently there can be only one
zcrx and the user assumes that its id should be 0, but as we'll need
multiple zcrx objects in the future let's explicitly pass it back on
registration.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8714667d370651962f7d1a169032e5f02682a73e.1744722517.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-15 07:37:49 -06:00
Jens Axboe cf960726eb io_uring/kbuf: reject zero sized provided buffers
This isn't fixing a real issue, but there's also zero point in going
through group and buffer setup, when the buffers are going to be
rejected once attempted to get used.

Cc: stable@vger.kernel.org
Reported-by: syzbot+58928048fd1416f1457c@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-07 07:51:23 -06:00
Pavel Begunkov 5a17131a5d io_uring/zcrx: separate niov number from pages
A preparation patch that separates the number of pages / folios from
the number of niovs. They will not match in the future to support huge
pages, improved dma mapping and/or larger chunk sizes.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0780ac966ee84200385737f45bb0f2ada052392b.1743848231.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-07 07:36:52 -06:00
Pavel Begunkov 9b58440a5b io_uring/zcrx: put refill data into separate cache line
Refill queue lock and other bits are only used from the allocation path
on the rx softirq side, but it shares the cache line with other fields
like ctx that are used also in the "syscall" path, which causes cache
bouncing when softirq runs on a different CPU.

Separate them into different cache lines. The first one now contains
constant fields used by both contextx, followed by a line responsible
for refill queue data.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6d1f598e27d623c07fc49d6baee13089a9b1216c.1743848241.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-07 07:36:52 -06:00
Pavel Begunkov ab6005f391 io_uring: don't post tag CQEs on file/buffer registration failure
Buffer / file table registration is all or nothing, if it fails all
resources we might have partially registered are dropped and the table
is killed. If that happens, it doesn't make sense to post any rsrc tag
CQEs. That would be confusing to the application, which should not need
to handle that case.

Cc: stable@vger.kernel.org
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Fixes: 7029acd8a9 ("io_uring/rsrc: get rid of per-ring io_rsrc_node list")
Link: https://lore.kernel.org/r/c514446a8dcb0197cddd5d4ba8f6511da081cf1f.1743777957.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-04 09:40:44 -06:00
Pavel Begunkov 390513642e io_uring: always do atomic put from iowq
io_uring always switches requests to atomic refcounting for iowq
execution before there is any parallilism by setting REQ_F_REFCOUNT,
and the flag is not cleared until the request completes. That should be
fine as long as the compiler doesn't make up a non existing value for
the flags, however KCSAN still complains when the request owner changes
oter flag bits:

BUG: KCSAN: data-race in io_req_task_cancel / io_wq_free_work
...
read to 0xffff888117207448 of 8 bytes by task 3871 on cpu 0:
 req_ref_put_and_test io_uring/refs.h:22 [inline]

Skip REQ_F_REFCOUNT checks for iowq, we know it's set.

Reported-by: syzbot+903a2ad71fb3f1e47cf5@syzkaller.appspotmail.com
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d880bc27fb8c3209b54641be4ff6ac02b0e5789a.1743679736.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-03 08:31:57 -06:00
Ming Lei 1045afae4b io_uring: support vectored kernel fixed buffer
io_uring has supported fixed kernel buffer via io_buffer_register_bvec()
and io_buffer_unregister_bvec().

The vectored fixed buffer has been ready, so it is natural to support
fixed kernel buffer, one use case is ublk.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250325135155.935398-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-02 07:06:59 -06:00
Ming Lei 8622b20f23 io_uring: add validate_fixed_range() for validate fixed buffer
Add helper of validate_fixed_range() for validating fixed buffer
range.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250325135155.935398-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-02 07:06:43 -06:00
David Wei fcfd94d696 io_uring/zcrx: return early from io_zcrx_recv_skb if readlen is 0
When readlen is set for a recvzc request, tcp_read_sock() will call
io_zcrx_recv_skb() one final time with len == desc->count == 0. This is
caused by the !desc->count check happening too late. The offset + 1 !=
skb->len happens earlier and causes the while loop to continue.

Fix this in io_zcrx_recv_skb() instead of tcp_read_sock(). Return early
if len is 0 i.e. the read is done.

Fixes: 6699ec9a23 ("io_uring/zcrx: add a read limit to recvzc requests")
Signed-off-by: David Wei <dw@davidwei.uk>
Link: https://lore.kernel.org/r/20250401195355.1613813-1-dw@davidwei.uk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-01 14:00:46 -06:00
Pavel Begunkov 81ed18015d io_uring/net: avoid import_ubuf for regvec send
With registered buffers we set up iterators in helpers like
io_import_fixed(), and there is no need for a import_ubuf() before that.
It was fine as we used real pointers for offset calculation, but that's
not the case anymore since introduction of ublk kernel buffers.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9b2de1a50844f848f62c8de609b494971033a6b9.1743437358.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-31 12:41:49 -06:00
Pavel Begunkov a1fbe0a121 io_uring/rsrc: check size when importing reg buffer
We're relying on callers to verify the IO size, do it inside of
io_import_fixed() instead. It's safer, easier to deal with, and more
consistent as now it's done close to the iter init site.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f9c2c75ec4d356a0c61289073f68d98e8a9db190.1743446271.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-31 12:41:49 -06:00
Pavel Begunkov ed344511c5 io_uring: cleanup {g,s]etsockopt sqe reading
Add a local variable for the sqe pointer to avoid repetition.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8dbac0f9acda2d3842534eeb7ce10d9276b021ae.1743357108.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-31 07:08:46 -06:00
Pavel Begunkov 296e169618 io_uring: hide caches sqes from drivers
There is now an io_uring private part of cmd async_data, move saved sqe
into it. Drivers are accessing it via struct io_uring_cmd::cmd.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ecbe078dd57acefdbc4366d083327086c0879378.1743357121.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-31 07:08:34 -06:00
Pavel Begunkov 487a0710f8 io_uring: make zcrx depend on CONFIG_IO_URING
Reflect in the kconfig that zcrx requires io_uring compiled.

Fixes: 6f377873cb ("io_uring/zcrx: add interface queue and refill queue")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8047135a344e79dbd04ee36a7a69cc260aabc2ca.1743356260.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-31 07:07:44 -06:00
Pavel Begunkov 697b2876ac io_uring: add req flag invariant build assertion
We're caching some of file related request flags in a tricky way, put
a build check to make sure flags don't get reshuffled.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9877577b83c25dd78224a8274f799187e7ec7639.1743407551.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-31 07:07:34 -06:00
Pavel Begunkov ea9106786e io_uring: don't pass ctx to tw add remote helper
Unlike earlier versions, io_msg_remote_post() creates a valid request
with a proper context, so don't pass a context to
io_req_task_work_add_remote() explicitly but derive it from the request.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/721f51cf34996d98b48f0bfd24ad40aa2730167e.1743190078.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-28 17:14:01 -06:00
Pavel Begunkov 9cc0bbdaba io_uring/msg: initialise msg request opcode
It's risky to have msg request opcode set to garbage, so at least
initialise it to nop. Later we might want to add a user inaccessible
opcode for such cases.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9afe650fcb348414a4529d89f52eb8969ba06efd.1743190078.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-28 17:13:09 -06:00
Pavel Begunkov b0e9570a6b io_uring/msg: rename io_double_lock_ctx()
io_double_lock_ctx() doesn't lock both rings. Rename it to prevent any
future confusion.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9e5defa000efd9b0f5e169cbb6bad4994d46ec5c.1743190078.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-28 17:13:09 -06:00
Pavel Begunkov fbe1a30c5d io_uring/net: import zc ubuf earlier
io_send_setup() already sets up the iterator for IORING_OP_SEND_ZC, we
don't need repeating that at issue time. Move it all together with mem
accounting at prep time, which is more consistent with how the non-zc
version does that.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/eb54f007c493ad9f4ca89aa8e715baf30d83fb88.1743202294.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-28 17:11:20 -06:00
Pavel Begunkov ad3f6cc400 io_uring/net: set sg_from_iter in advance
In preparation to the next patch, set ->sg_from_iter callback at request
prep time.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/5fe2972701df3bacdb3d760bce195fa640bee201.1743202294.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-28 17:11:20 -06:00
Pavel Begunkov 49dbce5602 io_uring/net: clusterise send vs msghdr branches
We have multiple branches at prep for send vs sendmsg handling, put them
together so that the variant handling is more localised.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/33abf666d9ded74cba4da2f0d9fe58e88520dffe.1743202294.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-28 17:11:20 -06:00
Pavel Begunkov 63b16e4f0b io_uring/net: unify sendmsg setup with zc
io_sendmsg_zc_setup() duplicates parts of io_sendmsg_setup(), and the
only difference between them is that the former support vectored
registered buffers with nothing zerocopy specific. Merge them together,
we want regular sendmsg to eventually support fixed buffers either way.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7e5ec40f9dc93355399dc6fa0cbc8b31f0b20ac5.1743202294.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-28 17:11:20 -06:00
Pavel Begunkov c55e2845df io_uring/net: combine sendzc flags writes
Save an instruction / trip to the cache and assign some of sendzc flags
together.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c519d6f406776c3be3ef988a8339a88e45d1ffd9.1743202294.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-28 17:11:20 -06:00
Pavel Begunkov 5f364117db io_uring/net: open code io_net_vec_assign()
Get rid of io_net_vec_assign() by open coding it into its only caller.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/19191c34b5cfe1161f7eeefa6e785418ea9ad56d.1743202294.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-28 17:11:20 -06:00
Pavel Begunkov a20b8631c8 io_uring/net: open code io_sendmsg_copy_hdr()
io_sendmsg_setup() is trivial and io_sendmsg_copy_hdr() doesn't add
any good abstraction, open code one into another.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/565318ce585665e88053663eeee5178d2c15692f.1743202294.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-28 17:11:20 -06:00
Pavel Begunkov 04491732fc io_uring/net: account memory for zc sendmsg
Account pinned pages for IORING_OP_SENDMSG_ZC, just as we for
IORING_OP_SEND_ZC and net/ does for MSG_ZEROCOPY.

Fixes: 493108d95f ("io_uring/net: zerocopy sendmsg")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4f00f67ca6ac8e8ed62343ae92b5816b1e0c9c4b.1743086313.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-28 16:15:42 -06:00
Linus Torvalds eff5f16bfd for-6.15/io_uring-reg-vec-20250327
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmflYcAQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpmvJD/4tKQlr0yRhln/JzPiONS41mUAuNRI4MdqJ
 ykpQkMx3NcQANbNyOxI0PV5I7y1Jdlg/UP9gy11BrIaBk4Kqoluc6iAzgr5q9pBC
 8pXhPIe80R/q/LOKEz9n5gqOMPNyUtd7IaBayJPBJre/YZXQu+49IL2Uyy3hss8d
 neqAbWErd2FoUfTY14XB3ImLM6a76Z6CjE3pJYvVDM5uRBuH0IGqehJJuNpsViBf
 M9XAW/HZt8ISsVt1tJbCQVWx4b63L/omHI8u5K2M0isTPV+QPk1O2Vgkn7dBrDeT
 JvThWrM1uE++DYGcQ3DXHfb3gBIFEjTrNb2nddstyEU2ZaEXUkuOV2O0b7WPuphj
 zp0oFaLl/ivHT8NoJzzZzK24zt99Qz43GWUaFCQeR0R8oTix/M1q0unguER45Iv7
 Po/b3h6+RAi+87KOlM5WWo05ScswS8AwcSUsP5xMR5BjjD+GQYO5PmVVyo8w0rid
 8F9U9DpN2CTA5YVjI+ax1cxWMOfmAXPK5ONjzZpyJoWb0THgj97esEwc2un7SBi7
 TJJz7Gc9/xOqfRKaPDoH9t8+b6ruWHMqCYDw6exSAUKeDxQ+7z0zNMudHkuR5VrX
 x+Taaj95ONLVNZYz0mbFcvmJC0UBOqkE94omXk7TU2Cn7SBzAW//XDep6CPpX/sa
 LcmOK4UXdg==
 =vOm1
 -----END PGP SIGNATURE-----

Merge tag 'for-6.15/io_uring-reg-vec-20250327' of git://git.kernel.dk/linux

Pull more io_uring updates from Jens Axboe:
 "Final separate updates for io_uring.

  This started out as a series of cleanups improvements and improvements
  for registered buffers, but as the last series of the io_uring changes
  for 6.15, it also collected a few fixes for the other branches on top:

   - Add support for vectored fixed/registered buffers.

     Previously only single segments have been supported for commands,
     now vectored variants are supported as well. This series includes
     networking and file read/write support.

   - Small series unifying return codes across multi and single shot.

   - Small series cleaning up registerd buffer importing.

   - Adding support for vectored registered buffers for uring_cmd.

   - Fix for io-wq handling of command reissue.

   - Various little fixes and tweaks"

* tag 'for-6.15/io_uring-reg-vec-20250327' of git://git.kernel.dk/linux: (25 commits)
  io_uring/net: fix io_req_post_cqe abuse by send bundle
  io_uring/net: use REQ_F_IMPORT_BUFFER for send_zc
  io_uring: move min_events sanitisation
  io_uring: rename "min" arg in io_iopoll_check()
  io_uring: open code __io_post_aux_cqe()
  io_uring: defer iowq cqe overflow via task_work
  io_uring: fix retry handling off iowq
  io_uring/net: only import send_zc buffer once
  io_uring/cmd: introduce io_uring_cmd_import_fixed_vec
  io_uring/cmd: add iovec cache for commands
  io_uring/cmd: don't expose entire cmd async data
  io_uring: rename the data cmd cache
  io_uring: rely on io_prep_reg_vec for iovec placement
  io_uring: introduce io_prep_reg_iovec()
  io_uring: unify STOP_MULTISHOT with IOU_OK
  io_uring: return -EAGAIN to continue multishot
  io_uring: cap cached iovec/bvec size
  io_uring/net: implement vectored reg bufs for zctx
  io_uring/net: convert to struct iou_vec
  io_uring/net: pull vec alloc out of msghdr import
  ...
2025-03-28 15:07:04 -07:00
Linus Torvalds 6df9d086ff for-6.15/io_uring-epoll-wait-20250325
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmfjTRYQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpvfMD/4596ZTo0bzeimMMzFloabhaqcMC9eO21iW
 oNb6KdSKQzAOdQl3+YfhSrP/ZbeoJl8y0YoPGIA2RBd4ELl/4YPVTNeahNhz70P2
 iawXcYmMyrM+6W6RXof9uj+k8mE17G7N+M7DuMYjfMnC/OhDKg/3DbbydwesMW47
 /1naMsk/rJx6CbeVKAl3V2es3NPUrZlh/dsurm5muGqXHAUmv459w46togKNvUhM
 tAGSq54zyPLnvnuA4Cm1QBIYtN5dHmjDjMmEXFiEF1qg7c5WIfVpkopc/SjvvlQs
 fVwN9Mn9Mh1VXuLysSqa5RsXINtXbJdpyXnh8pG/QqhntftNRKkwMxlN0gEYfBzB
 4iuvzUpGmPAwPhYJ60Ylci4KhWY31mwbU+ns9+/G4bIdDxdEMJGvy4D/oz/fUr9p
 1gwAIkJgfixb6fsUrL6REFoSEmRQQ7hQkyKRFyvXSXsTtqXz2BR7O75ujK381J9d
 Sa3zexDBRaJWlDWspRe7Olcwf/hqLmY3cWwQ8Z+a2vozU2dAtYpKr5Lhr7SjuftI
 Uw4pCkrXUG5EhdMSNuQnzVZv3xWPGAAUaAIT0MhDcWakhNT1dyyjbuXWuFzDmKQI
 0Q71+UPBW300eJx8bFupYQ1mKp10LbjewCxiXOkBFZxaG7n5UwGsv4FoE3CJ8QEP
 V0ou6ZpBFA==
 =AZkA
 -----END PGP SIGNATURE-----

Merge tag 'for-6.15/io_uring-epoll-wait-20250325' of git://git.kernel.dk/linux

Pull io_uring epoll support from Jens Axboe:
 "This adds support for reading epoll events via io_uring.

  While this may seem counter-intuitive (and/or productive), the
  reasoning here is that quite a few existing epoll event loops can
  easily do a partial conversion to a completion based model, but are
  still stuck with one (or few) event types that remain readiness based.

  For that case, they then need to add the io_uring fd to the epoll
  context, and continue to rely on epoll_wait(2) for waiting on events.
  This misses out on the finer grained waiting that io_uring can do, to
  reduce context switches and wait for multiple events in one batch
  reliably.

  With adding support for reaping epoll events via io_uring, the whole
  legacy readiness based event types can still be reaped via epoll, with
  the overall waiting in the loop be driven by io_uring"

* tag 'for-6.15/io_uring-epoll-wait-20250325' of git://git.kernel.dk/linux:
  io_uring/epoll: add support for IORING_OP_EPOLL_WAIT
  io_uring/epoll: remove CONFIG_EPOLL guards
2025-03-28 14:55:32 -07:00
Linus Torvalds ca0b04ba0b for-6.15/io_uring-rx-zc-20250325
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmfjTP8QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpm6oEACnpGL52FAKTVj14GDqFo6Pq0Jmnh07x8qj
 mpHFPwxfWAzRiuNyji2iS9ecS2cnlkixNyMWZipXRi4KJAUjJH6YDd7IofUI3Glf
 6v7b6srFSvsWJIJ8LdkJHLHAJuzYnJvFZ8apwgQczEDqgHq7BAunM1sVQ+mydjYk
 EXT4kN6DSBOPzwr9GAay52f8nXhbqdHfT+YTGHPHg+QToojL6gD7vvW57w/QqD/x
 91hJef1z01cSIsDZOxA0EUeD+9bBAHpoamr/e3IOOCVYCN6hy0dGa9g0QGbbpVyE
 AeU4FGZLV9J8OOfvHVraDt5Wn3IXxYaMu22dSn1S6tVinwjXhJR2LAA+t4fGHAkt
 i38LjOsIbopSQn/cNhzwC8UZcHLqnVsdDolHlHzsVFVfcpck2/4JFpUeP8QhWgrk
 f9tY12QUf/oEaWm0/sUCHZNFxpIGeFA5FFXf0Z92clnzBuiuWoesBNvxqY/2DeZn
 IDNXiv+Trxr6kFEjTpzPeuxbWrn4PJ7afQSAFcEmOCguk5riM+zJZNIKg0TxUHSS
 tt6sfxmTP1DhgDKad5kT3MLyzOcx47Kbjf4dj6KmRnD+3DGwwN2F7X7R1GJylPSp
 RLOzJ+Ouuy9UmBN6JMsT4BmR9+FJTVirADU926d/ZqCTtRV8Tnq/6HHmKmmr4CR0
 THJ0PJqQjg==
 =MOve
 -----END PGP SIGNATURE-----

Merge tag 'for-6.15/io_uring-rx-zc-20250325' of git://git.kernel.dk/linux

Pull io_uring zero-copy receive support from Jens Axboe:
 "This adds support for zero-copy receive with io_uring, enabling fast
  bulk receive of data directly into application memory, rather than
  needing to copy the data out of kernel memory.

  While this version only supports host memory as that was the initial
  target, other memory types are planned as well, with notably GPU
  memory coming next.

  This work depends on some networking components which were queued up
  on the networking side, but have now landed in your tree.

  This is the work of Pavel Begunkov and David Wei. From the v14 posting:

    'We configure a page pool that a driver uses to fill a hw rx queue
     to hand out user pages instead of kernel pages. Any data that ends
     up hitting this hw rx queue will thus be dma'd into userspace
     memory directly, without needing to be bounced through kernel
     memory. 'Reading' data out of a socket instead becomes a
     _notification_ mechanism, where the kernel tells userspace where
     the data is. The overall approach is similar to the devmem TCP
     proposal

     This relies on hw header/data split, flow steering and RSS to
     ensure packet headers remain in kernel memory and only desired
     flows hit a hw rx queue configured for zero copy. Configuring this
     is outside of the scope of this patchset.

     We share netdev core infra with devmem TCP. The main difference is
     that io_uring is used for the uAPI and the lifetime of all objects
     are bound to an io_uring instance. Data is 'read' using a new
     io_uring request type. When done, data is returned via a new shared
     refill queue. A zero copy page pool refills a hw rx queue from this
     refill queue directly. Of course, the lifetime of these data
     buffers are managed by io_uring rather than the networking stack,
     with different refcounting rules.

     This patchset is the first step adding basic zero copy support. We
     will extend this iteratively with new features e.g. dynamically
     allocated zero copy areas, THP support, dmabuf support, improved
     copy fallback, general optimisations and more'

  In a local setup, I was able to saturate a 200G link with a single CPU
  core, and at netdev conf 0x19 earlier this month, Jamal reported
  188Gbit of bandwidth using a single core (no HT, including soft-irq).

  Safe to say the efficiency is there, as bigger links would be needed
  to find the per-core limit, and it's considerably more efficient and
  faster than the existing devmem solution"

* tag 'for-6.15/io_uring-rx-zc-20250325' of git://git.kernel.dk/linux:
  io_uring/zcrx: add selftest case for recvzc with read limit
  io_uring/zcrx: add a read limit to recvzc requests
  io_uring: add missing IORING_MAP_OFF_ZCRX_REGION in io_uring_mmap
  io_uring: Rename KConfig to Kconfig
  io_uring/zcrx: fix leaks on failed registration
  io_uring/zcrx: recheck ifq on shutdown
  io_uring/zcrx: add selftest
  net: add documentation for io_uring zcrx
  io_uring/zcrx: add copy fallback
  io_uring/zcrx: throttle receive requests
  io_uring/zcrx: set pp memory provider for an rx queue
  io_uring/zcrx: add io_recvzc request
  io_uring/zcrx: dma-map area for the device
  io_uring/zcrx: implement zerocopy receive pp memory provider
  io_uring/zcrx: grab a net device
  io_uring/zcrx: add io_zcrx_area
  io_uring/zcrx: add interface queue and refill queue
2025-03-28 13:45:52 -07:00
Pavel Begunkov 6889ae1b4d io_uring/net: fix io_req_post_cqe abuse by send bundle
[  114.987980][ T5313] WARNING: CPU: 6 PID: 5313 at io_uring/io_uring.c:872 io_req_post_cqe+0x12e/0x4f0
[  114.991597][ T5313] RIP: 0010:io_req_post_cqe+0x12e/0x4f0
[  115.001880][ T5313] Call Trace:
[  115.002222][ T5313]  <TASK>
[  115.007813][ T5313]  io_send+0x4fe/0x10f0
[  115.009317][ T5313]  io_issue_sqe+0x1a6/0x1740
[  115.012094][ T5313]  io_wq_submit_work+0x38b/0xed0
[  115.013223][ T5313]  io_worker_handle_work+0x62a/0x1600
[  115.013876][ T5313]  io_wq_worker+0x34f/0xdf0

As the comment states, io_req_post_cqe() should only be used by
multishot requests, i.e. REQ_F_APOLL_MULTISHOT, which bundled sends are
not. Add a flag signifying whether a request wants to post multiple
CQEs. Eventually REQ_F_APOLL_MULTISHOT should imply the new flag, but
that's left out for simplicity.

Cc: stable@vger.kernel.org
Fixes: a05d1f625c ("io_uring/net: support bundles for send")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8b611dbb54d1cd47a88681f5d38c84d0c02bc563.1743067183.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-27 05:48:32 -06:00
Linus Torvalds 1a9239bb42 Networking changes for 6.15.
Core & protocols
 ----------------
 
  - Continue Netlink conversions to per-namespace RTNL lock
    (IPv4 routing, routing rules, routing next hops, ARP ioctls).
 
  - Continue extending the use of netdev instance locks. As a driver
    opt-in protect queue operations and (in due course) ethtool
    operations with the instance lock and not RTNL lock.
 
  - Support collecting TCP timestamps (data submitted, sent, acked)
    in BPF, allowing for transparent (to the application) and lower
    overhead tracking of TCP RPC performance.
 
  - Tweak existing networking Rx zero-copy infra to support zero-copy
    Rx via io_uring.
 
  - Optimize MPTCP performance in single subflow mode by 29%.
 
  - Enable GRO on packets which went thru XDP CPU redirect (were queued
    for processing on a different CPU). Improving TCP stream performance
    up to 2x.
 
  - Improve performance of contended connect() by 200% by searching
    for an available 4-tuple under RCU rather than a spin lock.
    Bring an additional 229% improvement by tweaking hash distribution.
 
  - Avoid unconditionally touching sk_tsflags on RX, improving
    performance under UDP flood by as much as 10%.
 
  - Avoid skb_clone() dance in ping_rcv() to improve performance under
    ping flood.
 
  - Avoid FIB lookup in netfilter if socket is available, 20% perf win.
 
  - Rework network device creation (in-kernel) API to more clearly
    identify network namespaces and their roles.
    There are up to 4 namespace roles but we used to have just 2 netns
    pointer arguments, interpreted differently based on context.
 
  - Use sysfs_break_active_protection() instead of trylock to avoid
    deadlocks between unregistering objects and sysfs access.
 
  - Add a new sysctl and sockopt for capping max retransmit timeout
    in TCP.
 
  - Support masking port and DSCP in routing rule matches.
 
  - Support dumping IPv4 multicast addresses with RTM_GETMULTICAST.
 
  - Support specifying at what time packet should be sent on AF_XDP
    sockets.
 
  - Expose TCP ULP diagnostic info (for TLS and MPTCP) to non-admin users.
 
  - Add Netlink YAML spec for WiFi (nl80211) and conntrack.
 
  - Introduce EXPORT_IPV6_MOD() and EXPORT_IPV6_MOD_GPL() for symbols
    which only need to be exported when IPv6 support is built as a module.
 
  - Age FDB entries based on Rx not Tx traffic in VxLAN, similar
    to normal bridging.
 
  - Allow users to specify source port range for GENEVE tunnels.
 
  - netconsole: allow attaching kernel release, CPU ID and task name
    to messages as metadata
 
 Driver API
 ----------
 
  - Continue rework / fixing of Energy Efficient Ethernet (EEE) across
    the SW layers. Delegate the responsibilities to phylink where possible.
    Improve its handling in phylib.
 
  - Support symmetric OR-XOR RSS hashing algorithm.
 
  - Support tracking and preserving IRQ affinity by NAPI itself.
 
  - Support loopback mode speed selection for interface selftests.
 
 Device drivers
 --------------
 
  - Remove the IBM LCS driver for s390.
 
  - Remove the sb1000 cable modem driver.
 
  - Add support for SFP module access over SMBus.
 
  - Add MCTP transport driver for MCTP-over-USB.
 
  - Enable XDP metadata support in multiple drivers.
 
  - Ethernet high-speed NICs:
    - Broadcom (bnxt):
      - add PCIe TLP Processing Hints (TPH) support for new AMD platforms
      - support dumping RoCE queue state for debug
      - opt into instance locking
    - Intel (100G, ice, idpf):
      - ice: rework MSI-X IRQ management and distribution
      - ice: support for E830 devices
      - iavf: add support for Rx timestamping
      - iavf: opt into instance locking
    - nVidia/Mellanox:
      - mlx4: use page pool memory allocator for Rx
      - mlx5: support for one PTP device per hardware clock
      - mlx5: support for 200Gbps per-lane link modes
      - mlx5: move IPSec policy check after decryption
    - AMD/Solarflare:
      - support FW flashing via devlink
    - Cisco (enic):
      - use page pool memory allocator for Rx
      - enable 32, 64 byte CQEs
      - get max rx/tx ring size from the device
    - Meta (fbnic):
      - support flow steering and RSS configuration
      - report queue stats
      - support TCP segmentation
      - support IRQ coalescing
      - support ring size configuration
    - Marvell/Cavium:
      - support AF_XDP
    - Wangxun:
      - support for PTP clock and timestamping
    - Huawei (hibmcge):
      - checksum offload
      - add more statistics
 
  - Ethernet virtual:
    - VirtIO net:
      - aggressively suppress Tx completions, improve perf by 96% with
        1 CPU and 55% with 2 CPUs
      - expose NAPI to IRQ mapping and persist NAPI settings
    - Google (gve):
      - support XDP in DQO RDA Queue Format
      - opt into instance locking
    - Microsoft vNIC:
      - support BIG TCP
 
  - Ethernet NICs consumer, and embedded:
    - Synopsys (stmmac):
      - cleanup Tx and Tx clock setting and other link-focused cleanups
      - enable SGMII and 2500BASEX mode switching for Intel platforms
      - support Sophgo SG2044
    - Broadcom switches (b53):
      - support for BCM53101
    - TI:
      - iep: add perout configuration support
      - icssg: support XDP
    - Cadence (macb):
      - implement BQL
    - Xilinx (axinet):
      - support dynamic IRQ moderation and changing coalescing at runtime
      - implement BQL
      - report standard stats
    - MediaTek:
      - support phylink managed EEE
    - Intel:
      - igc: don't restart the interface on every XDP program change
    - RealTek (r8169):
      - support reading registers of internal PHYs directly
      - increase max jumbo packet size on RTL8125/RTL8126
    - Airoha:
      - support for RISC-V NPU packet processing unit
      - enable scatter-gather and support MTU up to 9kB
    - Tehuti (tn40xx):
      - support cards with TN4010 MAC and an Aquantia AQR105 PHY
 
  - Ethernet PHYs:
    - support for TJA1102S, TJA1121
    - dp83tg720: add randomized polling intervals for link detection
    - dp83822: support changing the transmit amplitude voltage
    - support for LEDs on 88q2xxx
 
  - CAN:
    - canxl: support Remote Request Substitution bit access
    - flexcan: add S32G2/S32G3 SoC
 
  - WiFi:
    - remove cooked monitor support
    - strict mode for better AP testing
    - basic EPCS support
    - OMI RX bandwidth reduction support
    - batman-adv: add support for jumbo frames
 
  - WiFi drivers:
    - RealTek (rtw88):
      - support RTL8814AE and RTL8814AU
    - RealTek (rtw89):
      - switch using wiphy_lock and wiphy_work
      - add BB context to manipulate two PHY as preparation of MLO
      - improve BT-coexistence mechanism to play A2DP smoothly
    - Intel (iwlwifi):
      - add new iwlmld sub-driver for latest HW/FW combinations
    - MediaTek (mt76):
      - preparation for mt7996 Multi-Link Operation (MLO) support
    - Qualcomm/Atheros (ath12k):
      - continued work on MLO
    - Silabs (wfx):
      - Wake-on-WLAN support
 
  - Bluetooth:
    - add support for skb TX SND/COMPLETION timestamping
    - hci_core: enable buffer flow control for SCO/eSCO
    - coredump: log devcd dumps into the monitor
 
  - Bluetooth drivers:
    - intel: add support to configure TX power
    - nxp: handle bootloader error during cmd5 and cmd7
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmfkLC8ACgkQMUZtbf5S
 Irsb5g/+L7oKOf0ALbaV9kxFsoz8AymZfAW9i/27F07omGJGpks8oX6j6rQLgIRO
 OQOFcp7XEdDh1+jh82gHVuPrw2/6lchLtW8ARtzdiQKFr5DRjrsbtua6GRc8iBqA
 DIRCBFoV2HuMkF39Vr09HMa9AZAT7QR2RLsRGpSq8E8Z8xxKz0X7oujs10PFpMTE
 IVKhTrVrk+NDot/IU2hzVpnpup+0ld+T2/ZaBklJGcU8uDffImsqNepHRyCG5UC3
 xz74Ju23MAj24Gct+og0yFUooF+lUltKyVm0FYCDCY3bASTwgY01NR3kEH/0NQvM
 cywLzd/ngHm/SMD2ggVAHkjZUieiIVHdaZ53dgjDeBOQoVP6p0dgUK7EumXX8Mx4
 8ReR2UiGoYRPaq9c4o+IjG4K027MwVK2p+mF1a6MLa+20XcyMbev8FIRbbHtC/V4
 z5/FsOAxcuICWkA1hU9bODrrGzIqemmdRgKG8sGuTJCt/kYGAn72/TCATGNSaCJ0
 00n2jN1aepa7wtywHJ5MhVzxN9iQX7+geUHXz0BI+lK4e1Pmk+vjGksymb9ai2fk
 eQAUV9ekub6q68/J16scD7XeOUM37bTLiMBQeIF8UtZBOJscKiS71zn9QP9Twwxv
 P2pm01RDZUI+z5ZX3hc12Pm1vjRHaAh9S1JpAw/pTOVlQ+mAJEM=
 =XY0S
 -----END PGP SIGNATURE-----

Merge tag 'net-next-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Jakub Kicinski:
 "Core & protocols:

   - Continue Netlink conversions to per-namespace RTNL lock
     (IPv4 routing, routing rules, routing next hops, ARP ioctls)

   - Continue extending the use of netdev instance locks. As a driver
     opt-in protect queue operations and (in due course) ethtool
     operations with the instance lock and not RTNL lock.

   - Support collecting TCP timestamps (data submitted, sent, acked) in
     BPF, allowing for transparent (to the application) and lower
     overhead tracking of TCP RPC performance.

   - Tweak existing networking Rx zero-copy infra to support zero-copy
     Rx via io_uring.

   - Optimize MPTCP performance in single subflow mode by 29%.

   - Enable GRO on packets which went thru XDP CPU redirect (were queued
     for processing on a different CPU). Improving TCP stream
     performance up to 2x.

   - Improve performance of contended connect() by 200% by searching for
     an available 4-tuple under RCU rather than a spin lock. Bring an
     additional 229% improvement by tweaking hash distribution.

   - Avoid unconditionally touching sk_tsflags on RX, improving
     performance under UDP flood by as much as 10%.

   - Avoid skb_clone() dance in ping_rcv() to improve performance under
     ping flood.

   - Avoid FIB lookup in netfilter if socket is available, 20% perf win.

   - Rework network device creation (in-kernel) API to more clearly
     identify network namespaces and their roles. There are up to 4
     namespace roles but we used to have just 2 netns pointer arguments,
     interpreted differently based on context.

   - Use sysfs_break_active_protection() instead of trylock to avoid
     deadlocks between unregistering objects and sysfs access.

   - Add a new sysctl and sockopt for capping max retransmit timeout in
     TCP.

   - Support masking port and DSCP in routing rule matches.

   - Support dumping IPv4 multicast addresses with RTM_GETMULTICAST.

   - Support specifying at what time packet should be sent on AF_XDP
     sockets.

   - Expose TCP ULP diagnostic info (for TLS and MPTCP) to non-admin
     users.

   - Add Netlink YAML spec for WiFi (nl80211) and conntrack.

   - Introduce EXPORT_IPV6_MOD() and EXPORT_IPV6_MOD_GPL() for symbols
     which only need to be exported when IPv6 support is built as a
     module.

   - Age FDB entries based on Rx not Tx traffic in VxLAN, similar to
     normal bridging.

   - Allow users to specify source port range for GENEVE tunnels.

   - netconsole: allow attaching kernel release, CPU ID and task name to
     messages as metadata

  Driver API:

   - Continue rework / fixing of Energy Efficient Ethernet (EEE) across
     the SW layers. Delegate the responsibilities to phylink where
     possible. Improve its handling in phylib.

   - Support symmetric OR-XOR RSS hashing algorithm.

   - Support tracking and preserving IRQ affinity by NAPI itself.

   - Support loopback mode speed selection for interface selftests.

  Device drivers:

   - Remove the IBM LCS driver for s390

   - Remove the sb1000 cable modem driver

   - Add support for SFP module access over SMBus

   - Add MCTP transport driver for MCTP-over-USB

   - Enable XDP metadata support in multiple drivers

   - Ethernet high-speed NICs:
      - Broadcom (bnxt):
         - add PCIe TLP Processing Hints (TPH) support for new AMD
           platforms
         - support dumping RoCE queue state for debug
         - opt into instance locking
      - Intel (100G, ice, idpf):
         - ice: rework MSI-X IRQ management and distribution
         - ice: support for E830 devices
         - iavf: add support for Rx timestamping
         - iavf: opt into instance locking
      - nVidia/Mellanox:
         - mlx4: use page pool memory allocator for Rx
         - mlx5: support for one PTP device per hardware clock
         - mlx5: support for 200Gbps per-lane link modes
         - mlx5: move IPSec policy check after decryption
      - AMD/Solarflare:
         - support FW flashing via devlink
      - Cisco (enic):
         - use page pool memory allocator for Rx
         - enable 32, 64 byte CQEs
         - get max rx/tx ring size from the device
      - Meta (fbnic):
         - support flow steering and RSS configuration
         - report queue stats
         - support TCP segmentation
         - support IRQ coalescing
         - support ring size configuration
      - Marvell/Cavium:
         - support AF_XDP
      - Wangxun:
         - support for PTP clock and timestamping
      - Huawei (hibmcge):
         - checksum offload
         - add more statistics

   - Ethernet virtual:
      - VirtIO net:
         - aggressively suppress Tx completions, improve perf by 96%
           with 1 CPU and 55% with 2 CPUs
         - expose NAPI to IRQ mapping and persist NAPI settings
      - Google (gve):
         - support XDP in DQO RDA Queue Format
         - opt into instance locking
      - Microsoft vNIC:
         - support BIG TCP

   - Ethernet NICs consumer, and embedded:
      - Synopsys (stmmac):
         - cleanup Tx and Tx clock setting and other link-focused
           cleanups
         - enable SGMII and 2500BASEX mode switching for Intel platforms
         - support Sophgo SG2044
      - Broadcom switches (b53):
         - support for BCM53101
      - TI:
         - iep: add perout configuration support
         - icssg: support XDP
      - Cadence (macb):
         - implement BQL
      - Xilinx (axinet):
         - support dynamic IRQ moderation and changing coalescing at
           runtime
         - implement BQL
         - report standard stats
      - MediaTek:
         - support phylink managed EEE
      - Intel:
         - igc: don't restart the interface on every XDP program change
      - RealTek (r8169):
         - support reading registers of internal PHYs directly
         - increase max jumbo packet size on RTL8125/RTL8126
      - Airoha:
         - support for RISC-V NPU packet processing unit
         - enable scatter-gather and support MTU up to 9kB
      - Tehuti (tn40xx):
         - support cards with TN4010 MAC and an Aquantia AQR105 PHY

   - Ethernet PHYs:
      - support for TJA1102S, TJA1121
      - dp83tg720: add randomized polling intervals for link detection
      - dp83822: support changing the transmit amplitude voltage
      - support for LEDs on 88q2xxx

   - CAN:
      - canxl: support Remote Request Substitution bit access
      - flexcan: add S32G2/S32G3 SoC

   - WiFi:
      - remove cooked monitor support
      - strict mode for better AP testing
      - basic EPCS support
      - OMI RX bandwidth reduction support
      - batman-adv: add support for jumbo frames

   - WiFi drivers:
      - RealTek (rtw88):
         - support RTL8814AE and RTL8814AU
      - RealTek (rtw89):
         - switch using wiphy_lock and wiphy_work
         - add BB context to manipulate two PHY as preparation of MLO
         - improve BT-coexistence mechanism to play A2DP smoothly
      - Intel (iwlwifi):
         - add new iwlmld sub-driver for latest HW/FW combinations
      - MediaTek (mt76):
         - preparation for mt7996 Multi-Link Operation (MLO) support
      - Qualcomm/Atheros (ath12k):
         - continued work on MLO
      - Silabs (wfx):
         - Wake-on-WLAN support

   - Bluetooth:
      - add support for skb TX SND/COMPLETION timestamping
      - hci_core: enable buffer flow control for SCO/eSCO
      - coredump: log devcd dumps into the monitor

   - Bluetooth drivers:
      - intel: add support to configure TX power
      - nxp: handle bootloader error during cmd5 and cmd7"

* tag 'net-next-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1681 commits)
  unix: fix up for "apparmor: add fine grained af_unix mediation"
  mctp: Fix incorrect tx flow invalidation condition in mctp-i2c
  net: usb: asix: ax88772: Increase phy_name size
  net: phy: Introduce PHY_ID_SIZE — minimum size for PHY ID string
  net: libwx: fix Tx L4 checksum
  net: libwx: fix Tx descriptor content for some tunnel packets
  atm: Fix NULL pointer dereference
  net: tn40xx: add pci-id of the aqr105-based Tehuti TN4010 cards
  net: tn40xx: prepare tn40xx driver to find phy of the TN9510 card
  net: tn40xx: create swnode for mdio and aqr105 phy and add to mdiobus
  net: phy: aquantia: add essential functions to aqr105 driver
  net: phy: aquantia: search for firmware-name in fwnode
  net: phy: aquantia: add probe function to aqr105 for firmware loading
  net: phy: Add swnode support to mdiobus_scan
  gve: add XDP DROP and PASS support for DQ
  gve: update XDP allocation path support RX buffer posting
  gve: merge packet buffer size fields
  gve: update GQ RX to use buf_size
  gve: introduce config-based allocation for XDP
  gve: remove xdp_xsk_done and xdp_xsk_wakeup statistics
  ...
2025-03-26 21:48:21 -07:00
Linus Torvalds 91928e0d3c for-6.15/io_uring-20250322
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmfe8DcQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgprzNEADPOOfi05F0f4KOMLYOiRlB+D9yGtnbcu8k
 foPATU++L5fFDUpjAGobi38AXqHtF/Bx0CjCkPCvy4GjygcszYDY81FQmRlSFdeH
 CnYV/bm5myyupc8xANcWGJYpxf/ZAw7ugOAgcR3Xxk/qXdD5hYJ1jG4477o8v4IJ
 jyzE4vQCnWBogIoqpWTQyY3FSP5fdjeQlYdFGIFCx68rFTO9zEpuHhFr2OMfJtSu
 y1dLkeqbu1BZ0fhgRog9k9ZWk0uRQJI7d7EUXf9iwXOfkUyzCaV+nHvNGRdhJ/XO
 zj/qePw4lFyCgc+kuIejjIaPf/g84paCzshz/PIxOZTsvo6X78zj72RPM0Makhga
 amgtEqegWHtKSya1zew57oDwqwaIe6b+56PtNSP7K8IrwcJLheszoPKJIwiiynIL
 ZqvWseqYQXs58S9qZzSZlyEpwLVFcvyNt7Z/q7b56vhvXHabxOccPkUHt+UscVoQ
 qpG0R9+zmIiDfzY6Vfu7JyH2Uspdq3WjGYiaoVUko1rJXY+HFPd3be9pIbckyaA8
 ZQHIxB+h8IlxhpQ8HJKQpZhtYiluxEu6wDdvtpwC0KiGp5g4Q5eg9SkVyvETIEN8
 yW2tHqMAJJtJjDAPwJOV+GIyWQMIexzLmMP7Fq8R5VIoAox+Xn3flgNNEnMZdl/r
 kTc/hRjHvQ==
 =VEQr
 -----END PGP SIGNATURE-----

Merge tag 'for-6.15/io_uring-20250322' of git://git.kernel.dk/linux

Pull io_uring updates from Jens Axboe:
 "This is the first of the io_uring pull requests for the 6.15 merge
  window, there will be others once the net tree has gone in. This
  contains:

   - Cleanup and unification of cancelation handling across various
     request types.

   - Improvement for bundles, supporting them both for incrementally
     consumed buffers, and for non-multishot requests.

   - Enable toggling of using iowait while waiting on io_uring events or
     not. Unfortunately this is still tied with CPU frequency boosting
     on short waits, as the scheduler side has not been very receptive
     to splitting the (useless) iowait stat from the cpufreq implied
     boost.

   - Add support for kbuf nodes, enabling zero-copy support for the ublk
     block driver.

   - Various cleanups for resource node handling.

   - Series greatly cleaning up the legacy provided (non-ring based)
     buffers. For years, we've been pushing the ring provided buffers as
     the way to go, and that is what people have been using. Reduce the
     complexity and code associated with legacy provided buffers.

   - Series cleaning up the compat handling.

   - Series improving and cleaning up the recvmsg/sendmsg iovec and msg
     handling.

   - Series of cleanups for io-wq.

   - Start adding a bunch of selftests. The liburing repository
     generally carries feature and regression tests for everything, but
     at least for ublk initially, we'll try and go the route of having
     it in selftests as well. We'll see how this goes, might decide to
     migrate more tests this way in the future.

   - Various little cleanups and fixes"

* tag 'for-6.15/io_uring-20250322' of git://git.kernel.dk/linux: (108 commits)
  selftests: ublk: add stripe target
  selftests: ublk: simplify loop io completion
  selftests: ublk: enable zero copy for null target
  selftests: ublk: prepare for supporting stripe target
  selftests: ublk: move common code into common.c
  selftests: ublk: increase max buffer size to 1MB
  selftests: ublk: add single sqe allocator helper
  selftests: ublk: add generic_01 for verifying sequential IO order
  selftests: ublk: fix starting ublk device
  io_uring: enable toggle of iowait usage when waiting on CQEs
  selftests: ublk: fix write cache implementation
  selftests: ublk: add variable for user to not show test result
  selftests: ublk: don't show `modprobe` failure
  selftests: ublk: add one dependency header
  io_uring/kbuf: enable bundles for incrementally consumed buffers
  Revert "io_uring/rsrc: simplify the bvec iter count calculation"
  selftests: ublk: improve test usability
  selftests: ublk: add stress test for covering IO vs. killing ublk server
  selftests: ublk: add one stress test for covering IO vs. removing device
  selftests: ublk: load/unload ublk_drv when preparing & cleaning up tests
  ...
2025-03-26 17:56:00 -07:00
Caleb Sander Mateos 73b6dacb1c io_uring/net: use REQ_F_IMPORT_BUFFER for send_zc
Instead of a bool field in struct io_sr_msg, use REQ_F_IMPORT_BUFFER to
track whether io_send_zc() has already imported the buffer. This flag
already serves a similar purpose for sendmsg_zc and {read,write}v_fixed.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Suggested-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20250325143943.1226467-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-26 04:26:45 -06:00
Linus Torvalds 054570267d lsm/stable-6.15 PR 20250323
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmfgWgMUHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXNW5RAAvCDq5gBtY0aTNlULe637EVLSh+t8
 PkSzHzu/NlzU6BfjtwSm2fuML8welTGxSwUPxUzMCI91gPdkGeFktefavT3xa+QI
 BHWROn7fEJ/KmRZvngPeIkgLr5xhF5nBJmc/Jw71qem20zRzNgJnpzMX16d10Phx
 dxd2xOO1qM3bv6Z9RcIssZRGaN+PHngpWWg+0B69XuaBUso87S6NDyKNn1XPmvoz
 as96k+Wk/xAZGVEeCbs/+H5rBx6DLg+FfTRa06Oh4BFsqedpkDPxLrTgCJGJkA0H
 dsK6O/993zvjx0Jn4ZPoJ9n35S82BmkCsz4bGq1xVl6FYUiMcm3/8yO41wllS+w4
 j+RlTU/RIdB7n8EKyMMl1hj1stTvt3Bi9F5Cbf7ZEv0snfR00K4KVpi17jnFjUHv
 kpOiEtXZb/NGQip7UAuUq0PisfqbiO4jJurYHRetDgv1WCy6+C8ufM5t6I+cnvmG
 VG+dlxcW+rDIn6bLRVuGi9TJRsQ6eox9ipa+qEKNNiOXgftELcgT7m74nAS5m0uv
 n5rDa221nPXecEB0X7d6YUFk711lly90dbelNeLrmv1w6jl8L1PpS1oBaW+UzGu9
 46eGBd6pzu9otvK9WVyDEdotDOCrgH0sd7pTetqDhLJZ7KrGwyyqO2gD/JroUKcC
 lnxBQwPnat86iI8=
 =oxfV
 -----END PGP SIGNATURE-----

Merge tag 'lsm-pr-20250323' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm

Pull lsm updates from Paul Moore:

 - Various minor updates to the LSM Rust bindings

   Changes include marking trivial Rust bindings as inlines and comment
   tweaks to better reflect the LSM hooks.

 - Add LSM/SELinux access controls to io_uring_allowed()

   Similar to the io_uring_disabled sysctl, add a LSM hook to
   io_uring_allowed() to enable LSMs a simple way to enforce security
   policy on the use of io_uring. This pull request includes SELinux
   support for this new control using the io_uring/allowed permission.

 - Remove an unused parameter from the security_perf_event_open() hook

   The perf_event_attr struct parameter was not used by any currently
   supported LSMs, remove it from the hook.

 - Add an explicit MAINTAINERS entry for the credentials code

   We've seen problems in the past where patches to the credentials code
   sent by non-maintainers would often languish on the lists for
   multiple months as there was no one explicitly tasked with the
   responsibility of reviewing and/or merging credentials related code.

   Considering that most of the code under security/ has a vested
   interest in ensuring that the credentials code is well maintained,
   I'm volunteering to look after the credentials code and Serge Hallyn
   has also volunteered to step up as an official reviewer. I posted the
   MAINTAINERS update as a RFC to LKML in hopes that someone else would
   jump up with an "I'll do it!", but beyond Serge it was all crickets.

 - Update Stephen Smalley's old email address to prevent confusion

   This includes a corresponding update to the mailmap file.

* tag 'lsm-pr-20250323' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm:
  mailmap: map Stephen Smalley's old email addresses
  lsm: remove old email address for Stephen Smalley
  MAINTAINERS: add Serge Hallyn as a credentials reviewer
  MAINTAINERS: add an explicit credentials entry
  cred,rust: mark Credential methods inline
  lsm,rust: reword "destroy" -> "release" in SecurityCtx
  lsm,rust: mark SecurityCtx methods inline
  perf: Remove unnecessary parameter of security check
  lsm: fix a missing security_uring_allowed() prototype
  io_uring,lsm,selinux: add LSM hooks for io_uring_setup()
  io_uring: refactor io_uring_allowed()
2025-03-25 15:44:19 -07:00
Pavel Begunkov 816619782b io_uring: move min_events sanitisation
iopoll and normal waiting already duplicate min_completion truncation,
so move them inside the corresponding routines.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/254adb289cc04638f25d746a7499260fa89a179e.1742829388.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-25 12:38:01 -06:00
Pavel Begunkov d73acd7af3 io_uring: rename "min" arg in io_iopoll_check()
Don't name arguments "min", it shadows the namesake function.
min_events is also more consistent.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f52ce9d88d3bca5732a218b0da14924aa6968909.1742829388.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-25 12:37:57 -06:00
Pavel Begunkov 4c76de42cb io_uring: open code __io_post_aux_cqe()
There is no reason to keep __io_post_aux_cqe() separately from
io_post_aux_cqe().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2c4c1f68d694deea25a212fc09bbb11f330cd82e.1742829388.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-25 12:37:52 -06:00
Pavel Begunkov 3afcb3b2e3 io_uring: defer iowq cqe overflow via task_work
Don't handle CQE overflows in io_req_complete_post() and defer it to
flush_completions. It cuts some duplication, and I also want to limit
the number of places directly overflowing completions.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9046410ac27e18f2baa6f7cdb363ec921cbc3b79.1742829388.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-25 12:37:47 -06:00
Pavel Begunkov 3f0cb8de56 io_uring: fix retry handling off iowq
io_req_complete_post() doesn't handle reissue and if called with a
REQ_F_REISSUE request it might post extra unexpected completions. Fix it
by pushing into flush_completion via task work.

Fixes: d803d12394 ("io_uring/rw: handle -EAGAIN retry at IO completion time")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/badb3d7e462881e7edbfcc2be6301090b07dbe53.1742829388.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-25 12:37:43 -06:00
Linus Torvalds a50b4fe095 A treewide hrtimer timer cleanup
hrtimers are initialized with hrtimer_init() and a subsequent store to
   the callback pointer. This turned out to be suboptimal for the upcoming
   Rust integration and is obviously a silly implementation to begin with.
 
   This cleanup replaces the hrtimer_init(T); T->function = cb; sequence
   with hrtimer_setup(T, cb);
 
   The conversion was done with Coccinelle and a few manual fixups.
 
   Once the conversion has completely landed in mainline, hrtimer_init()
   will be removed and the hrtimer::function becomes a private member.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmff5jQTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoVvRD/wKtuwmiA66NJFgXC0qVq82A6fO3bY8
 GBdbfysDJIbqGu5PTcULTbJ8qkqv3jeLUv6CcXvS4sZ7y/uJQl2lzf8yrD/0bbwc
 rLI6sHiPSZmK93kNVN4X5H7kvt7cE/DYC9nnEOgK3BY5FgKc4n9887d4aVBhL8Lv
 ODwVXvZ+xi351YCj7qRyPU24zt/p4tkkT1o2k4a0HBluqLI0D+V20fke9IERUL8r
 d1uWKlcn0TqYDesE8HXKIhbst3gx52rMJrXBJDHwFmG6v8Pj1fkTXCVpPo8QcBz8
 OTVkpomN9f/Tx4+GZwhZOF86LhLL3OhxD6pT7JhFCXdmSGv+Ez8uyk1YZysM/XpV
 Juy/1yAcBpDIDkmhMFGdAAn48Nn9Fotty0r4je60zSEp1d/4QMXcFme29qr2JTUE
 iWnQ/HD6DxUjVHqy7CYvvo26Xegg1C7qgyOVt4PYZwAM1VKF5P3kzYTb4SAdxtop
 Tpji1sfW9QV08jqMNo6XntD32DSP9S2HqjO9LwBw700jnx2jjJ35fcJs6iodMOUn
 gckIZLMn3L0OoglPdyA5O7SNTbKE7aFiRKdnT/cJtR3Fa39Qu27CwC5gfiyuie9I
 Q+LG8GLuYSBHXAR+PBK4GWlzJ7Dn8k3eqmbnLeKpRMsU6ZzcttgA64xhaviN2wN0
 iJbvLJeisXr3GA==
 =bYAX
 -----END PGP SIGNATURE-----

Merge tag 'timers-cleanups-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer cleanups from Thomas Gleixner:
 "A treewide hrtimer timer cleanup

  hrtimers are initialized with hrtimer_init() and a subsequent store to
  the callback pointer. This turned out to be suboptimal for the
  upcoming Rust integration and is obviously a silly implementation to
  begin with.

  This cleanup replaces the hrtimer_init(T); T->function = cb; sequence
  with hrtimer_setup(T, cb);

  The conversion was done with Coccinelle and a few manual fixups.

  Once the conversion has completely landed in mainline, hrtimer_init()
  will be removed and the hrtimer::function becomes a private member"

* tag 'timers-cleanups-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (100 commits)
  wifi: rt2x00: Switch to use hrtimer_update_function()
  io_uring: Use helper function hrtimer_update_function()
  serial: xilinx_uartps: Use helper function hrtimer_update_function()
  ASoC: fsl: imx-pcm-fiq: Switch to use hrtimer_setup()
  RDMA: Switch to use hrtimer_setup()
  virtio: mem: Switch to use hrtimer_setup()
  drm/vmwgfx: Switch to use hrtimer_setup()
  drm/xe/oa: Switch to use hrtimer_setup()
  drm/vkms: Switch to use hrtimer_setup()
  drm/msm: Switch to use hrtimer_setup()
  drm/i915/request: Switch to use hrtimer_setup()
  drm/i915/uncore: Switch to use hrtimer_setup()
  drm/i915/pmu: Switch to use hrtimer_setup()
  drm/i915/perf: Switch to use hrtimer_setup()
  drm/i915/gvt: Switch to use hrtimer_setup()
  drm/i915/huc: Switch to use hrtimer_setup()
  drm/amdgpu: Switch to use hrtimer_setup()
  stm class: heartbeat: Switch to use hrtimer_setup()
  i2c: Switch to use hrtimer_setup()
  iio: Switch to use hrtimer_setup()
  ...
2025-03-25 10:54:15 -07:00
Linus Torvalds bb18645ac1 io_uring-6.14-20250322
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmfe7xsQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpnDHD/4rKlhIMGsWjBMaO1+ogNMsh9pJGRQdQZWB
 dCcIz1zRZbzY9WE1CQmkAWpM6tEbkUPMw3RrW1wduNO9W1OCQ4wW/wv9gShzi5HU
 O+yOF9s+wd2733n3Ghxr6zTFCyHeaHb5mFVQsct1eOsbdprxysQLQ6QwB4FqOcKa
 vZJyDg9Zxsd9wNblzgd7+QSMnPgsCrBhgC86mayDEsQ2e68CXFkxv8W6zBl28rx1
 auGCIF33Zic0FUWKqc2N2e2xZ0RUZKeKqf09ZpzoEVZ2Zti2IUd1RAZFaRo5SBTm
 ZhU71Eip4OqWruNSHmE8KtEgLscV5rwNZ2IH29Ywif71JTYMqEGE3Jr9O1w6REti
 bNH/ELSXEC/rpscmG904g8UQ3Nv8LcdETq0B3lxl1dnKVVy5jFe3mSvvvWIckJAJ
 t2KZf1HwX4Q0MXdi8HUnN4uswvu4xPYfjTNaEjqaK0H9U7BTY1CBkfTRjtWatPNH
 WCfeDK6Bj0T51ke/kmu3EpP6l19H1iivhC+Wz6bCgJ7mrsSlsm44ibjrsNgHeqSP
 x3jhZTM7RqIdC4UIv+LgHz/IZJ07iMVkpXGTAKBZnW+SrzTsVRfK6/RawqDmxL5T
 h63Z8rly8kTS9GmWaw82wieHqQpzJrYu2BXik6r5L5ON5DwvYeegcnMTczgldQKL
 pxQxiBO7yQ==
 =+zW+
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-6.14-20250322' of git://git.kernel.dk/linux

Pull io_uring fix from Jens Axboe:
 "Just a single fix for the commit that went into your tree yesterday,
  which exposed an issue with not always clearing notifications. That
  could cause them to be used more than once"

* tag 'io_uring-6.14-20250322' of git://git.kernel.dk/linux:
  io_uring/net: fix sendzc double notif flush
2025-03-22 10:45:44 -07:00
Pavel Begunkov 67c007d6c1 io_uring/net: fix sendzc double notif flush
refcount_t: underflow; use-after-free.
WARNING: CPU: 0 PID: 5823 at lib/refcount.c:28 refcount_warn_saturate+0x15a/0x1d0 lib/refcount.c:28
RIP: 0010:refcount_warn_saturate+0x15a/0x1d0 lib/refcount.c:28
Call Trace:
 <TASK>
 io_notif_flush io_uring/notif.h:40 [inline]
 io_send_zc_cleanup+0x121/0x170 io_uring/net.c:1222
 io_clean_op+0x58c/0x9a0 io_uring/io_uring.c:406
 io_free_batch_list io_uring/io_uring.c:1429 [inline]
 __io_submit_flush_completions+0xc16/0xd20 io_uring/io_uring.c:1470
 io_submit_flush_completions io_uring/io_uring.h:159 [inline]

Before the blamed commit, sendzc relied on io_req_msg_cleanup() to clear
REQ_F_NEED_CLEANUP, so after the following snippet the request will
never hit the core io_uring cleanup path.

io_notif_flush();
io_req_msg_cleanup();

The easiest fix is to null the notification. io_send_zc_cleanup() can
still be called after, but it's tolerated.

Reported-by: syzbot+cf285a028ffba71b2ef5@syzkaller.appspotmail.com
Tested-by: syzbot+cf285a028ffba71b2ef5@syzkaller.appspotmail.com
Fixes: cc34d8330e ("io_uring/net: don't clear REQ_F_NEED_CLEANUP unconditionally")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e1306007458b8891c88c4f20c966a17595f766b0.1742643795.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-22 08:14:36 -06:00
Caleb Sander Mateos 8e3100fcc5 io_uring/net: only import send_zc buffer once
io_send_zc() guards its call to io_send_zc_import() with if (!done_io)
in an attempt to avoid calling it redundantly on the same req. However,
if the initial non-blocking issue returns -EAGAIN, done_io will stay 0.
This causes the subsequent issue to unnecessarily re-import the buffer.

Add an explicit flag "imported" to io_sr_msg to track if its buffer has
already been imported. Clear the flag in io_send_zc_prep(). Call
io_send_zc_import() and set the flag in io_send_zc() if it is unset.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Fixes: 54cdcca05a ("io_uring/net: switch io_send() and io_send_zc() to using io_async_msghdr")
Link: https://lore.kernel.org/r/20250321184819.3847386-2-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-21 12:53:23 -06:00
Pavel Begunkov ef49027529 io_uring/cmd: introduce io_uring_cmd_import_fixed_vec
io_uring_cmd_import_fixed_vec() is a cmd helper around vectored
registered buffer import functions, which caches the memory under
the hood. The lifetime of the vectore and hence the iterator is bound to
the request. Furthermore, the user is not allowed to call it multiple
times for a single request.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/97487a80dec3fb8cf8aeedf1f9026ef6d503fe4b.1742579999.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-21 12:52:15 -06:00
Pavel Begunkov 3a4689ac10 io_uring/cmd: add iovec cache for commands
Add iou_vec to commands and wire caching for it, but don't expose it to
users just yet. We need the vec cleared on initial alloc, but since
we can't place it at the beginning at the moment, zero the entire
async_data. It's cached, and the performance effects only the initial
allocation, and it might be not a bad idea since we're exposing those
bits to outside drivers.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c0f2145b75791bc6106eb4e72add2cf6a2c72a7a.1742579999.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-21 12:52:15 -06:00
Linus Torvalds d07de43e3f io_uring-6.14-20250321
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmfdXP0QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpsB1EACD4LL93u/GUfCDI3gHMYdGv1Yy79j+qcB9
 +ikQYy+pM9iPHvFjXDZ8ZNzveaUic5OAXvrIQcfexcpbbJS3LxkmppBWZ7/gQxhA
 +nNXXcxYvhmzstnGTjsSlkUOosqUKWyo1M3JsFhZHL3H+Te+FGiSg82wwDdbNS69
 eezOzha4njxpkeIgcWI4Qzryg5//bYZKZBdf078m+y/Qa4xL9276VY3Y8bg+R51l
 wVYuDv5EjJT/zItXYr0uU+NRNFraMP1B23Ew+6EkdV/t4pzaH45H6+WfHv4tV5hS
 JbGZbvt/Zbvv9citAWsxqUkrGLDuveugKBEH/dKvffBW/tk0RmoiQrs3ZpO+g0/+
 Kdlfv/ggALWYC+QYUVyTzmb/xGk5YipSzN06M6t1+ELbedpaZyFWj5vhNSEsqc8l
 4edOoyA4ZPafGaipTNfsE4kSNk7UL1AiWqPxJWq3O9WYkVj2Zt9sD3dbhuddIIwm
 bEAbOvOEZXSQh8EBp0x2epkKc5y8ma75OdZui3cOlQEAKq+JMkPSYxPJLqkgAmrr
 SAFUoRYtM6ShKX6AGdwzk6w4htp1L3G8tlGssM4eqC2e9wQJ9B3nCep1Xvs48rjW
 MjpZfxfFRrfV14WHGun+63F9lbdGW4GJqnMbzpvdTrTmGfdOE88frCdcw0EEHRxB
 EmAoCbJ+aw==
 =TEDX
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-6.14-20250321' of git://git.kernel.dk/linux

Pull io_uring fix from Jens Axboe:
 "Single fix heading to stable, fixing an issue with io_req_msg_cleanup()
  sometimes too eagerly clearing cleanup flags"

* tag 'io_uring-6.14-20250321' of git://git.kernel.dk/linux:
  io_uring/net: don't clear REQ_F_NEED_CLEANUP unconditionally
2025-03-21 10:30:15 -07:00
Jens Axboe 07754bfd9a io_uring: enable toggle of iowait usage when waiting on CQEs
By default, io_uring marks a waiting task as being in iowait, if it's
sleeping waiting on events and there are pending requests. This isn't
necessarily always useful, and may be confusing on non-storage setups
where iowait isn't expected. It can also cause extra power usage, by
preventing the CPU from entering lower sleep states.

This adds a new enter flag, IORING_ENTER_NO_IOWAIT. If set, then
io_uring will not account the sleeping task as being in iowait. If the
kernel supports this feature, then it will be marked by having the
IORING_FEAT_NO_IOWAIT feature flag set.

As the kernel currently does not support separating the iowait
accounting and CPU frequency boosting, the IORING_ENTER_NO_IOWAIT
controls both of these at the same time. In the future, if those do end
up being split, then it'd be possible to control them separately.
However, it seems more likely that the kernel will decouple iowait and
CPU frequency boosting anyway.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-20 20:01:03 -06:00
Jens Axboe cc34d8330e io_uring/net: don't clear REQ_F_NEED_CLEANUP unconditionally
io_req_msg_cleanup() relies on the fact that io_netmsg_recycle() will
always fully recycle, but that may not be the case if the msg cache
was already full. To ensure that normal cleanup always gets run,
let io_netmsg_recycle() deal with clearing the relevant cleanup flags,
as it knows exactly when that should be done.

Cc: stable@vger.kernel.org
Reported-by: David Wei <dw@davidwei.uk>
Fixes: 7519134178 ("io_uring/net: add iovec recycling")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-20 12:27:27 -06:00
Pavel Begunkov 5f14404bfa io_uring/cmd: don't expose entire cmd async data
io_uring needs private bits in cmd's ->async_data, and they should never
be exposed to drivers as it'd certainly be abused. Leave struct
io_uring_cmd_data for the drivers but wrap it into a structure. It's a
prep patch and doesn't do anything useful yet.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20250319061251.21452-3-sidong.yang@furiosa.ai
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-19 09:25:55 -06:00
Pavel Begunkov 575e7b0629 io_uring: rename the data cmd cache
Pick a more descriptive name for the cmd async data cache.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20250319061251.21452-2-sidong.yang@furiosa.ai
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-19 09:25:55 -06:00
Paolo Abeni 941defcea7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.14-rc6).

Conflicts:

tools/testing/selftests/drivers/net/ping.py
  75cc19c8ff ("selftests: drv-net: add xdp cases for ping.py")
  de94e86974 ("selftests: drv-net: store addresses in dict indexed by ipver")
https://lore.kernel.org/netdev/20250311115758.17a1d414@canb.auug.org.au/

net/core/devmem.c
  a70f891e0f ("net: devmem: do not WARN conditionally after netdev_rx_queue_restart()")
  1d22d3060b ("net: drop rtnl_lock for queue_mgmt operations")
https://lore.kernel.org/netdev/20250313114929.43744df1@canb.auug.org.au/

Adjacent changes:

tools/testing/selftests/net/Makefile
  6f50175cca ("selftests: Add IPv6 link-local address generation tests for GRE devices.")
  2e5584e0f9 ("selftests/net: expand cmsg_ipv6.sh with ipv4")

drivers/net/ethernet/broadcom/bnxt/bnxt.c
  661958552e ("eth: bnxt: do not use BNXT_VNIC_NTUPLE unconditionally in queue restart logic")
  fe96d717d3 ("bnxt_en: Extend queue stop/start for TX rings")

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-13 23:08:11 +01:00
Jens Axboe cf9536e550 io_uring/kbuf: enable bundles for incrementally consumed buffers
The original support for incrementally consumed buffers didn't allow it
to be used with bundles, with the assumption being that incremental
buffers are generally larger, and hence there's less of a nedd to
support it.

But that assumption may not be correct - it's perfectly viable to use
smaller buffers with incremental consumption, and there may be valid
reasons for an application or framework to do so.

As there's really no need to explicitly disable bundles with
incrementally consumed buffers, allow it. This actually makes the peek
side cheaper and simpler, with the completion side basically the same,
just needing to iterate for the consumed length.

Reported-by: Norman Maurer <norman_maurer@apple.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-10 16:24:43 -06:00
Keith Busch 334f795ff8 Revert "io_uring/rsrc: simplify the bvec iter count calculation"
This reverts commit 2a51c327d4.

The kernel registered bvecs do use the iov_iter_advance() API, so we
can't rely on this simplification anymore.

Fixes: 27cb27b6d5 ("io_uring: add support for kernel registered bvecs")
Reported-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250310184825.569371-1-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-10 16:24:43 -06:00
Pavel Begunkov 146acfd0f6 io_uring: rely on io_prep_reg_vec for iovec placement
All vectored reg buffer users should use io_import_reg_vec() for iovec
imports, since iovec placement is the function's responsibility and
callers shouldn't know much about it, drop the offset parameter from
io_prep_reg_vec() and calculate it inside.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/08ed87ca4bbc06724373b6ce06f36b703fe60c4e.1741457480.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-10 07:14:27 -06:00
Pavel Begunkov d291fb6520 io_uring: introduce io_prep_reg_iovec()
iovecs that are turned into registered buffers are imported in a special
way with an offset, so that later we can do an in place translation. Add
a helper function taking care of it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7de2ecb9ed5efc3c5cf320232236966da5ad4ccc.1741457480.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-10 07:14:27 -06:00
Pavel Begunkov 5027d02452 io_uring: unify STOP_MULTISHOT with IOU_OK
IOU_OK means that the request ownership is now handed back to core
io_uring and it has to complete it using the result provided in
req->cqe. Same is true for multishot and IOU_STOP_MULTISHOT.

Rename it into IOU_COMPLETE to avoid confusion and use for both modes.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e6a5b2edb0eb9558acb1c8f1db38ac45fee95491.1741453534.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-10 07:14:18 -06:00
Pavel Begunkov 7a9dcb05f5 io_uring: return -EAGAIN to continue multishot
Multishot errors can be mapped 1:1 to normal errors, but there are not
identical. It leads to a peculiar situation where all multishot requests
has to check in what context they're run and return different codes.

Unify them starting with EAGAIN / IOU_ISSUE_SKIP_COMPLETE(EIOCBQUEUED)
pair, which mean that core io_uring still owns the request and it should
be retried. In case of multishot it's naturally just continues to poll,
otherwise it might poll, use iowq or do any other kind of allowed
blocking. Introduce IOU_RETRY aliased to -EAGAIN for that.

Apart from obvious upsides, multishot can now also check for misuse of
IOU_ISSUE_SKIP_COMPLETE.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/da117b79ce72ecc3ab488c744e29fae9ba54e23b.1741453534.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-10 07:14:18 -06:00
Linus Torvalds d53276d292 io_uring-6.14-20250306
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmfKQpsQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgphWWD/0Qi6/RyM71So/U3kB6HkwgDd3w7JdeWcmB
 EZN9qaALS3r/3Q/qNQTT6/2dLNvjw/MWAlLV7R511x7f+9lMJp6KxTyTkmxXbkd6
 yTMAm5E19vE1zeQZvHP9SzRKU5o2mh7yVnkNxHqqVbAzePtRGFGPLG/gQ4zCDq7I
 iqThGN4Hf5hZYM6QH6R25b/iljYYyr6kF8l96TLiY/QKp4ZTPR+dKjV4xfXFnVeF
 BjC/SzvQQqyHFDtpoXWeUemcxAE2tzPhAZ6w2p/56S2rkbfpxsWnwF+Ut4tteb0R
 pSBehvxfzkD9hBlADDRja0KK+WM95RBNJ0/Pf7dNBhqC3jhxNFH3FcBf66ot43fH
 8Jjshq9Wi5Zfdr176VzZCgpSSVihluctNRTpgDCR81289WsdBnpkhPXE0NSfw5su
 iulG3kjKJwQmIO/k+S5dBvT9/X+FzjQCuV4abuVyeDVJo2IXKbzGbWqp3ueUNuic
 kprQwEhcyPYtiVXpwsssnEMYjVx0pUc9FIM72ed+5hoaH+BejuGFHNAkKUOhCSvi
 S3SM4gVHZE4soDU8wevRuU4Kgag+pW/z1EYltA7DKcTJMGx9gO+uFlUJTSXoOFzO
 3kD4oXwnMc/CfHxAGWOWeCiANGaWnEkJYmwGG/0R80oG9Usji43ucErvs3p6zRTE
 iNFs/IfwPw==
 =v5sq
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-6.14-20250306' of git://git.kernel.dk/linux

Pull io_uring fix from Jens Axboe:
 "A single fix for a regression introduced in the 6.14 merge window,
  causing stalls/hangs with IOPOLL reads or writes"

* tag 'io_uring-6.14-20250306' of git://git.kernel.dk/linux:
  io_uring/rw: ensure reissue path is correctly handled for IOPOLL
2025-03-07 11:09:33 -10:00
Yue Haibing 30c970354c io_uring: Remove unused declaration io_alloc_async_data()
Commit ef623a647f ("io_uring: Move old async data allocation helper
to header") leave behind this unused declaration.

Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Link: https://lore.kernel.org/r/20250305013454.3635021-1-yuehaibing@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-07 14:09:16 -07:00
Pavel Begunkov 0396ad3766 io_uring: cap cached iovec/bvec size
Bvecs can be large, put an arbitrary limit on the max vector size it
can cache.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/823055fa6628daa24bbc9cd77c2da87e9a1e1e32.1741362889.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-07 13:41:08 -07:00
Pavel Begunkov 23371eac7d io_uring/net: implement vectored reg bufs for zctx
Add support for vectored registered buffers for send zc.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e484052875f862d2dca99f0f8c04407c1d51a1c1.1741362889.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-07 13:41:08 -07:00
Pavel Begunkov be7052a4b5 io_uring/net: convert to struct iou_vec
Convert net.c to use struct iou_vec.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6437b57dabed44eca708c02e390529c7ed211c78.1741362889.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-07 13:41:08 -07:00
Pavel Begunkov 9fcb349f5a io_uring/net: pull vec alloc out of msghdr import
I'll need more control over iovec management, move
io_net_import_vec() out of io_msg_copy_hdr().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9600ea6300f620e65d39da481c22605ddc898850.1741362889.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-07 13:41:08 -07:00
Pavel Begunkov 17523a821d io_uring/net: combine msghdr copy
Call the compat version from inside of io_msg_copy_hdr() and don't
duplicate it in callers.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/25795660f7b31f9273911c99f495d9c2b169ecda.1741362889.git.asml.silence@gmail.com
[axboe: fixup msg pointer vs variable braino in io_msg_copy_hdr()]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-07 13:40:39 -07:00
Pavel Begunkov 835c4bdf95 io_uring/rw: defer reg buf vec import
Import registered buffers for vectored reads and writes later at issue
time as we now do for other fixed ops.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e8491c976e4ab83a4e3dc428e9fe7555e59583b8.1741362889.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-07 09:07:29 -07:00
Pavel Begunkov bdabba04bb io_uring/rw: implement vectored registered rw
Implement registered buffer vectored reads with new opcodes
IORING_OP_WRITEV_FIXED and IORING_OP_READV_FIXED.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d7c89eb481e870f598edc91cc66ff4d1e4ae3788.1741362889.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-07 09:07:29 -07:00
Pavel Begunkov 9ef4cbbcb4 io_uring: add infra for importing vectored reg buffers
Add io_import_reg_vec(), which will be responsible for importing
vectored registered buffers. The function might reallocate the vector,
but it'd try to do the conversion in place first, which is why it's
required of the user to pad the iovec to the right border of the cache.

Overlapping also depends on struct iovec being larger than bvec, which
is not the case on e.g. 32 bit architectures. Don't try to complicate
this case and make sure vectors never overlap, it'll be improved later.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/60bd246b1249476a6996407c1dbc38ef6febad14.1741362889.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-07 09:07:29 -07:00
Pavel Begunkov e1d4995909 io_uring: introduce struct iou_vec
I need a convenient way to pass around and work with iovec+size pair,
put them into a structure and makes use of it in rw.c

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d39fadafc9e9047b0a292e5be6db3cf2f48bb1f7.1741362889.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-07 09:07:29 -07:00
Jens Axboe 6e3da40ed6 Merge branch 'for-6.15/io_uring-epoll-wait' into for-6.15/io_uring-reg-vec
* for-6.15/io_uring-epoll-wait:
  io_uring/epoll: add support for IORING_OP_EPOLL_WAIT
  io_uring/epoll: remove CONFIG_EPOLL guards
  eventpoll: add epoll_sendevents() helper
  eventpoll: abstract out ep_try_send_events() helper
  eventpoll: abstract out parameter sanity checking
2025-03-07 09:07:19 -07:00
Jens Axboe 78b6f6e9bf Merge branch 'for-6.15/io_uring-rx-zc' into for-6.15/io_uring-reg-vec
* for-6.15/io_uring-rx-zc: (80 commits)
  io_uring/zcrx: add selftest case for recvzc with read limit
  io_uring/zcrx: add a read limit to recvzc requests
  io_uring: add missing IORING_MAP_OFF_ZCRX_REGION in io_uring_mmap
  io_uring: Rename KConfig to Kconfig
  io_uring/zcrx: fix leaks on failed registration
  io_uring/zcrx: recheck ifq on shutdown
  io_uring/zcrx: add selftest
  net: add documentation for io_uring zcrx
  io_uring/zcrx: add copy fallback
  io_uring/zcrx: throttle receive requests
  io_uring/zcrx: set pp memory provider for an rx queue
  io_uring/zcrx: add io_recvzc request
  io_uring/zcrx: dma-map area for the device
  io_uring/zcrx: implement zerocopy receive pp memory provider
  io_uring/zcrx: grab a net device
  io_uring/zcrx: add io_zcrx_area
  io_uring/zcrx: add interface queue and refill queue
  net: add helpers for setting a memory provider on an rx queue
  net: page_pool: add memory provider helpers
  net: prepare for non devmem TCP memory providers
  ...
2025-03-07 09:07:11 -07:00
Jakub Kicinski 2525e16a2b Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.14-rc6).

Conflicts:

net/ethtool/cabletest.c
  2bcf4772e4 ("net: ethtool: try to protect all callback with netdev instance lock")
  637399bf7e ("net: ethtool: netlink: Allow NULL nlattrs when getting a phy_device")

No Adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-06 13:03:35 -08:00
Jens Axboe bcb0fda3c2 io_uring/rw: ensure reissue path is correctly handled for IOPOLL
The IOPOLL path posts CQEs when the io_kiocb is marked as completed,
so it cannot rely on the usual retry that non-IOPOLL requests do for
read/write requests.

If -EAGAIN is received and the request should be retried, go through
the normal completion path and let the normal flush logic catch it and
reissue it, like what is done for !IOPOLL reads or writes.

Fixes: d803d12394 ("io_uring/rw: handle -EAGAIN retry at IO completion time")
Reported-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/io-uring/2b43ccfa-644d-4a09-8f8f-39ad71810f41@oracle.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-05 14:03:34 -07:00
Caleb Sander Mateos 0d83b8a9f1 io_uring: introduce io_cache_free() helper
Add a helper function io_cache_free() that returns an allocation to a
io_alloc_cache, falling back on kfree() if the io_alloc_cache is full.
This is the inverse of io_cache_alloc(), which takes an allocation from
an io_alloc_cache and falls back on kmalloc() if the cache is empty.

Convert 4 callers to use the helper.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Suggested-by: Li Zetao <lizetao1@huawei.com>
Link: https://lore.kernel.org/r/20250304194814.2346705-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-05 07:38:55 -07:00
Caleb Sander Mateos fe21a4532e io_uring/rsrc: skip NULL file/buffer checks in io_free_rsrc_node()
io_rsrc_node's of type IORING_RSRC_FILE always have a file attached
immediately after they are allocated. IORING_RSRC_BUFFER nodes won't be
returned from io_sqe_buffer_register()/io_buffer_register_bvec() until
they have a io_mapped_ubuf attached.

So remove the checks for a NULL file/buffer in io_free_rsrc_node().

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250228235916.670437-5-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-04 07:17:15 -07:00
Caleb Sander Mateos 6e5d321a08 io_uring/rsrc: avoid NULL node check on io_sqe_buffer_register() failure
The done: label is only reachable if node is non-NULL. So don't bother
checking, just call io_free_node().

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250228235916.670437-4-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-04 07:17:15 -07:00
Caleb Sander Mateos 13f7f9686e io_uring/rsrc: call io_free_node() on io_sqe_buffer_register() failure
io_sqe_buffer_register() currently calls io_put_rsrc_node() if it fails
to fully set up the io_rsrc_node. io_put_rsrc_node() is more involved
than necessary, since we already know the reference count will reach 0
and no io_mapped_ubuf has been attached to the node yet.

So just call io_free_node() to release the node's memory. This also
avoids the need to temporarily set the node's buf pointer to NULL.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250228235916.670437-3-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-04 07:17:15 -07:00
Caleb Sander Mateos a387b96d2a io_uring/rsrc: free io_rsrc_node using kfree()
io_rsrc_node_alloc() calls io_cache_alloc(), which uses kmalloc() to
allocate the node. So it can be freed with kfree() instead of kvfree().

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250228235916.670437-2-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-04 07:17:15 -07:00
Caleb Sander Mateos 6a53541829 io_uring/rsrc: split out io_free_node() helper
Split the freeing of the io_rsrc_node from io_free_rsrc_node(), for use
with nodes that haven't been fully initialized.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250228235916.670437-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-04 07:17:15 -07:00
Caleb Sander Mateos a1967280a1 io_uring/rsrc: include io_uring_types.h in rsrc.h
io_uring/rsrc.h uses several types from include/linux/io_uring_types.h.
Include io_uring_types.h explicitly in rsrc.h to avoid depending on
users of rsrc.h including io_uring_types.h first.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Li Zetao <lizetao1@huawei.com>
Link: https://lore.kernel.org/r/20250301183612.937529-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-04 07:16:07 -07:00
Caleb Sander Mateos 6e83a442fb io_uring/nop: use io_find_buf_node()
Call io_find_buf_node() to avoid duplicating it in io_nop().

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250301001610.678223-2-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-28 19:35:37 -07:00
Caleb Sander Mateos bf931be52e io_uring/rsrc: declare io_find_buf_node() in header file
Declare io_find_buf_node() in io_uring/rsrc.h so it can be called from
other files.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250301001610.678223-1-csander@purestorage.com
[axboe: keep the inline for local hot path usage]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-28 19:35:22 -07:00
Caleb Sander Mateos e6ea7ec494 io_uring/ublk: report error when unregister operation fails
Indicate to userspace applications if a UBLK_IO_UNREGISTER_IO_BUF
command specifies an invalid buffer index by returning an error code.
Return -EINVAL if no buffer is registered with the given index, and
-EBUSY if the registered buffer is not a kernel bvec.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250228231432.642417-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-28 19:15:05 -07:00
Caleb Sander Mateos 0c542a69cb io_uring/uring_cmd: specify io_uring_cmd_import_fixed() pointer type
io_uring_cmd_import_fixed() takes a struct io_uring_cmd *, but the type
of the ioucmd parameter is void *. Make the pointer type explicit so the
compiler can type check it.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250228221514.604350-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-28 19:14:43 -07:00
Caleb Sander Mateos 2fced37638 io_uring/rsrc: use rq_data_dir() to compute bvec dir
The macro rq_data_dir() already computes a request's data direction.
Use it in place of the if-else to set imu->dir.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250228223057.615284-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-28 19:11:55 -07:00
Linus Torvalds 3e5d15dd83 io_uring-6.14-20250228
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmfBwzgQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgphH+D/9+jEgD+Jk+fWv4EUWEbmDTlGNOHgLowPo7
 TUV+AHOsI2I0ipzzQ8Kf75trmYmMNQIsItkGOW5QFQiZzWaUEvEjRH8miwtkkQk0
 WTLsbHwPUCl0qEAyH7F2gRWHBIeykcQOZR4PQEjvI1BobCrJMW3/6sS2atuGqQ75
 rQtYWnsBMpPfnu7+xNatlpebGE1UjzPGClo2s12caTTGg+2TjeCp82CGat6+wiS5
 KLEAg0jwpsSYUvfRjK0VUwiA3ky7b/a3F3SaAoJCq866eUm5oQKSZtsWhyeKswPn
 5jqXdvBlI7QB4jlMuT025MGfXdiOc7n6TDVhEFh/HKpvvYVVT1xkSI58qG6CNqnp
 y/GLDGLcDWw1U7L+36DtmwroprrLMGWkAgEgzvXiL94mN/dfGPCZEnI+21M1k1LZ
 LAnUP6Gcf9dittyGCMHUcFl67wy0ZqcOdgGAVBi0qLav6dmQ47IRlJAHwazXOH3q
 AfGFs8WzQNeLYZ1xOab442jnF2Wz11YsA6E0kU4fuv/azjIW00vLBRtCLA7G49eU
 bTblbNaERC/XPOeUzlXMC2b+acWpDnrmk5V3ecL2Z/NB5cpn3h7wWwzVTnK0jsF7
 H2CXwWIgFf1iboWuXMVqsjYtWvGZcSLwA58aHdQDCoJFTjdSVqGvqijlSOWw/359
 B/cmz/RY3g==
 =J9H9
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-6.14-20250228' of git://git.kernel.dk/linux

Pull io_uring fix from Jens Axboe:
 "Just a single fix headed for stable, ensuring that msg_control is
  properly saved in compat mode as well"

* tag 'io_uring-6.14-20250228' of git://git.kernel.dk/linux:
  io_uring/net: save msg_control for compat
2025-02-28 09:11:15 -08:00
Keith Busch ed9f3112a8 io_uring: cache nodes and mapped buffers
Frequent alloc/free cycles on these is pretty costly. Use an io cache to
more efficiently reuse these buffers.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20250227223916.143006-7-kbusch@meta.com
[axboe: fix imu leak]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-28 07:05:46 -07:00
Keith Busch 27cb27b6d5 io_uring: add support for kernel registered bvecs
Provide an interface for the kernel to leverage the existing
pre-registered buffers that io_uring provides. User space can reference
these later to achieve zero-copy IO.

User space must register an empty fixed buffer table with io_uring in
order for the kernel to make use of it.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20250227223916.143006-5-kbusch@meta.com
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-28 07:05:36 -07:00