mirror-linux

Commit Graph

Author	SHA1	Message	Date
Jens Axboe	c5e9f6a96b	io_uring: unify getting ctx from passed in file descriptor io_uring_enter() and io_uring_register() end up having duplicated code for getting a ctx from a passed in file descriptor, for either a registered ring descriptor or a normal file descriptor. Move the io_uring_register_get_file() into io_uring.c and name it a bit more generically, and use it from both callsites rather than have that logic and handling duplicated. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-08 13:21:35 -06:00
Jens Axboe	b4d893d636	io_uring/register: don't get a reference to the registered ring fd This isn't necessary and was only done because the register path isn't a hot path and hence the extra ref/put doesn't matter, and to have the exit path be able to unconditionally put whatever file was gotten regardless of the type. In preparation for sharing this code with the main io_uring_enter(2) syscall, drop the reference and have the caller conditionally put the file if it was a normal file descriptor. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-08 13:21:35 -06:00
Jens Axboe	7880174e1e	io_uring/tctx: clean up __io_uring_add_tctx_node() error handling Refactor __io_uring_add_tctx_node() so that on error it never leaves current->io_uring pointing at a half-setup tctx. This moves the assignment of current->io_uring to the end of the function post any failure points. Separate out the node installation into io_tctx_install_node() to further clean this up. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-08 13:21:34 -06:00
Jens Axboe	2c453a4281	io_uring/tctx: have io_uring_alloc_task_context() return tctx Instead of having io_uring_alloc_task_context() return an int and assign tsk->io_uring, just have it return the task context directly. This enables cleaner error handling in callers, which may have failure points post calling io_uring_alloc_task_context(). Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-08 13:21:30 -06:00
Yang Xiuwei	f847bf6d29	io_uring/timeout: use 'ctx' consistently There's already a local ctx variable, yet cq_timeouts accounting uses req->ctx. Use ctx consistently. Signed-off-by: Yang Xiuwei <yangxiuwei@kylinos.cn> Link: https://patch.msgid.link/20260402014952.260414-1-yangxiuwei@kylinos.cn Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-02 07:08:40 -06:00
Joanne Koong	c7f3aaf3e8	io_uring/rw: clean up __io_read() obsolete comment and early returns After commit `a9165b83c1` ("io_uring/rw: always setup io_async_rw for read/write requests") which moved the iovec allocation into the prep path and stores it in req->async_data where it now gets freed as part of the request lifecycle, this comment is now outdated. Remove it and clean up the goto as well. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Link: https://patch.msgid.link/20260401173511.4052303-1-joannelkoong@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-02 06:55:50 -06:00
Pavel Begunkov	4c6f93951b	io_uring/zcrx: use correct mmap off constants zcrx was using IORING_OFF_PBUF_SHIFT during first iterations, but there is now a separate constant it should use. Both are 16 so it doesn't change anything, but improve it for the future. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/fe16ebe9ba4048a7e12f9b3b50880bd175b1ce03.1774780198.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-02 06:55:48 -06:00
Pavel Begunkov	7120b87bed	io_uring/zcrx: use dma_len for chunk size calculation Buffers are now dma-mapped earlier and we can sg_dma_len(), otherwise, since it's walking with for_each_sgtable_dma_sg(), it might wrongfully reject some configurations. As a bonus, it'd now be able to use larger chunks if dma addresses are coalesced e.g by iommu. Fixes: 8c0cab0b7bf7 ("io_uring/zcrx: always dma map in advance") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/03b219af3f6cfdd1cf64679b8bab7461e47cc123.1774780198.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-02 06:55:47 -06:00
Pavel Begunkov	52dcd1776b	io_uring/zcrx: don't clear not allocated niovs Now that area->is_mapped is set earlier before niovs array is allocated, io_zcrx_free_area -> io_zcrx_unmap_area in an error path can try to clear dma addresses for unallocated niovs, fix it. Fixes: 8c0cab0b7bf7 ("io_uring/zcrx: always dma map in advance") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/cbcb7749b5a001ecd4d1c303515ce9403215640c.1774780198.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-02 06:55:36 -06:00
Pavel Begunkov	8ae2837d5a	io_uring/zcrx: don't use mark0 for allocating xarray XA_MARK_0 is not compatible with xarray allocating entries, use XA_MARK_1. Fixes: fda90d43f4fac ("io_uring/zcrx: return back two step unregistration") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/f232cfd3c466047d333b474dd2bddd246b6ebb82.1774780198.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-01 10:21:13 -06:00
Anas Iqbal	77d8c8d0f1	io_uring: cast id to u64 before shifting in io_allocate_rbuf_ring() Smatch warns: io_uring/zcrx.c:393 io_allocate_rbuf_ring() warn: should 'id << 16' be a 64 bit type? The expression 'id << IORING_OFF_PBUF_SHIFT' is evaluated using 32-bit arithmetic because id is a u32. This may overflow before being promoted to the 64-bit mmap_offset. Cast id to u64 before shifting to ensure the shift is performed in 64-bit arithmetic. Signed-off-by: Anas Iqbal <mohd.abd.6602@gmail.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/52400e1b343691416bef3ed3ae287fb1a88d407f.1774780198.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-01 10:21:13 -06:00
Pavel Begunkov	a9d008489f	io_uring/zcrx: reject REG_NODEV with large rx_buf_size The copy fallback path doesn't care about the actual niov size and only uses first PAGE_SIZE bytes, and any additional space will be wasted. Since ZCRX_REG_NODEV solely relies on the copy path, it doesn't make sense to support non-standard rx_buf_len. Reject it for now, and re-enable once improved. Fixes: c11728021d5cd ("io_uring/zcrx: implement device-less mode for zcrx") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/3e7652d9c27f8ac5d2b141e3af47971f2771fb05.1774780198.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-01 10:21:13 -06:00
Amir Mohammad Jahangirzad	85a58309c0	io_uring/cancel: validate opcode for IORING_ASYNC_CANCEL_OP io_async_cancel_prep() reads the opcode selector from sqe->len and stores it in cancel->opcode, which is an 8-bit field. Since sqe->len is a 32-bit value, values larger than U8_MAX are implicitly truncated. This can cause unintended opcode matches when the truncated value corresponds to a valid io_uring opcode. For example, submitting a value such as 0x10b will be truncated to 0x0b (IORING_OP_TIMEOUT), allowing a cancel request to match operations it did not intend to target. Validate the opcode value before assigning it to the 8-bit field and reject values outside the valid io_uring opcode range. Signed-off-by: Amir Mohammad Jahangirzad <a.jahangirzad@gmail.com> Link: https://patch.msgid.link/20260331232113.615972-1-a.jahangirzad@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-01 10:21:13 -06:00
Jackie Liu	19a8cc6cda	io_uring/rsrc: use io_cache_free() to free node Replace kfree(node) with io_cache_free() in io_buffer_register_bvec() to match all other error paths that free nodes allocated via io_rsrc_node_alloc(). The node is allocated through io_cache_alloc() internally, so it should be returned to the cache via io_cache_free() for proper object reuse. Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> Link: https://patch.msgid.link/20260331104509.7055-1-liu.yun@linux.dev [axboe: remove fixes tag, it's not a fix, it's a cleanup] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-01 10:21:13 -06:00
Pavel Begunkov	7c713dd007	io_uring/zcrx: rename zcrx [un]register functions Drop "ifqs" from function names, as it refers to an interface queue and there might be none once a device-less mode is introduced. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/657874acd117ec30fa6f45d9d844471c753b5a0f.1774261953.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-01 10:21:13 -06:00
Pavel Begunkov	de6ed1b323	io_uring/zcrx: check ctrl op payload struct sizes Add a build check that ctrl payloads are of the same size and don't grow struct zcrx_ctrl. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/af66caf9776d18e9ff880ab828eb159a6a03caf5.1774261953.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-01 10:21:13 -06:00
Pavel Begunkov	5c727ce042	io_uring/zcrx: cache fallback availability in zcrx ctx Store a flag in struct io_zcrx_ifq telling if the backing memory is normal page or dmabuf based. It was looking it up from the area, however it logically allocates from the zcrx ctx and not a particular area, and once we add more than one area it'll become a mess. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/65e75408a7758fe7e60fae89b7a8d5ae4857f515.1774261953.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-01 10:21:13 -06:00
Pavel Begunkov	f0b92207a0	io_uring/zcrx: warn on a repeated area append We only support a single area, no path should be able to call io_zcrx_append_area() twice. Warn if that happens instead of just returning an error. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/28eb67fb8c48445584d7c247a36e1ad8800f0c8b.1774261953.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-01 10:21:13 -06:00
Pavel Begunkov	61cfadaae6	io_uring/zcrx: consolidate dma syncing Split refilling into two steps, first allocate niovs, and then do DMA sync for them. This way dma synchronisation code can be better optimised. E.g. we don't need to call dma_dev_need_sync() for each every niov, and maybe we can coalesce sync for adjacent netmems in the future as well. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/19f2d50baa62ff2e0c6cd56dd7c394cab728c567.1774261953.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-01 10:21:12 -06:00
Pavel Begunkov	c0989138c0	io_uring/zcrx: netmem array as refiling format Instead of peeking into page pool allocation cache directly or via net_mp_netmem_place_in_cache(), pass a netmem array around. It's a better intermediate format, e.g. you can have it on stack and reuse the refilling code and decouples it from page pools a bit more. It still points into the page pool directly, there will be no additional copies. As the next step, we can change the callback prototype to take the netmem array from page pool. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/9d8549adb7ef6672daf2d8a52858ce5926279a82.1774261953.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-01 10:21:12 -06:00
Pavel Begunkov	48f253d65d	io_uring/zcrx: warn on alloc with non-empty pp cache Page pool ensures the cache is empty before asking to refill it. Warn if the assumption is violated. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/9c9792d6e65f3780d57ff83b6334d341ed9a5f29.1774261953.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-01 10:21:12 -06:00
Pavel Begunkov	7df542a665	io_uring/zcrx: move count check into zcrx_get_free_niov Instead of relying on the caller of __io_zcrx_get_free_niov() to check that there are free niovs available (i.e. free_count > 0), move the check into the function and return NULL if can't allocate. It consolidates the free count checks, and it'll be easier to extend the niov free list allocator in the future. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/6df04a6b3a6170f86d4345da9864f238311163f9.1774261953.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-01 10:21:12 -06:00
Pavel Begunkov	898ad80d12	io_uring/zcrx: use guards for locking Convert last several places using manual locking to guards to simplify the code. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/eb4667cfaf88c559700f6399da9e434889f5b04a.1774261953.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-01 10:21:12 -06:00
Pavel Begunkov	6a55a0a7eb	io_uring/zcrx: add a struct for refill queue Add a new structure that keeps the refill queue state. It's cleaner and will be useful once we introduce multiple refill queues. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/4ce200da1ff0309c377293b949200f95f80be9ae.1774261953.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-01 10:21:12 -06:00
Pavel Begunkov	ebae09bce4	io_uring/zcrx: use better name for RQ region Rename "region" to "rq_region" to highlight that it's a refill queue region. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/ac815790d2477a15826aecaa3d94f2a94ef507e6.1774261953.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-01 10:21:12 -06:00
Pavel Begunkov	825f276491	io_uring/zcrx: implement device-less mode for zcrx Allow creating a zcrx instance without attaching it to a net device. All data will be copied through the fallback path. The user is also expected to use ZCRX_CTRL_FLUSH_RQ to handle overflows as it normally should even with a netdev, but it becomes even more relevant as there will likely be no one to automatically pick up buffers. Apart from that, it follows the zcrx uapi for the I/O path, and is useful for testing, experimentation, and potentially for the copy receive path in the future if improved. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/674f8ad679c5a0bc79d538352b3042cf0999596e.1774261953.git.asml.silence@gmail.com [axboe: fix spelling error in uapi header and commit message] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-01 10:21:12 -06:00
Pavel Begunkov	06fc3b6d38	io_uring/zcrx: extract netdev+area init into a helper In preparation to following patches, add a function that is responsibly for looking up a netdev, creating an area, DMA mapping it and opening a queue. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/88cb6f746ecb496a9030756125419df273d0b003.1774261953.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-01 10:21:12 -06:00
Pavel Begunkov	b8d6eb6c1c	io_uring/zcrx: always dma map in advance zcrx was originally establisihing dma mappings at a late stage when it was being bound to a page pool. Dma-buf couldn't work this way, so it's initialised during area creation. It's messy having them do it at different spots, just move everything to the area creation time. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/334092a2cbdd4aabd7c025050aa99f05ace89bb5.1774261953.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-01 10:21:12 -06:00
Pavel Begunkov	41041562a7	io_uring/zcrx: fully clean area on error in io_import_umem() When accounting fails, io_import_umem() sets the page array, etc. and returns an error expecting that the error handling code will take care of the rest. To make the next patch simpler, only return a fully initialised areas from the function. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/3a602b7fb347dbd4da6797ac49b52ea5dedb856d.1774261953.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-01 10:21:12 -06:00
Pavel Begunkov	e5361d25e2	io_uring/zcrx: return back two step unregistration There are reports where io_uring instance removal takes too long and an ifq reallocation by another zcrx instance fails. Split zcrx destruction into two steps similarly how it was before, first close the queue early but maintain zcrx alive, and then when all inflight requests are completed, drop the main zcrx reference. For extra protection, mark terminated zcrx instances in xarray and warn if we double put them. Cc: stable@vger.kernel.org # 6.19+ Link: https://github.com/axboe/liburing/issues/1550 Reported-by: Youngmin Choi <youngminchoi94@gmail.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/0ce21f0565ab4358668922a28a8a36922dfebf76.1774261953.git.asml.silence@gmail.com [axboe: NULL ifq before break inside scoped guard] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-04-01 10:21:12 -06:00
Jens Axboe	f41b075492	io_uring: avoid req->ctx reload in io_req_put_rsrc_nodes() Cache 'ctx' to avoid it needing to get potentially reloaded. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-03-17 14:35:00 -06:00
Jens Axboe	3e97c2582f	io_uring/rw: use cached file rather than req->file In io_rw_init_file(), req->file is cached in file, yet the former is still being used when checking for O_DIRECT. As this is post setting the kiocb flags, the compiler has to reload req->file. Just use the locally cached file instead. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-03-17 14:35:00 -06:00
Jens Axboe	0a6b9ae1f3	io_uring/net: use 'ctx' consistently There's already a local ctx variable, use it for the io_is_compat() check as well. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-03-17 14:35:00 -06:00
Jens Axboe	74dbc0bab5	io_uring/poll: cache req->apoll_events Avoid a potential reload of ->apoll_events post vfs_poll() by caching it in a local variable. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-03-17 14:34:51 -06:00
Jens Axboe	49c21d9a5f	io_uring/kbuf: use 'ctx' consistently There's already a local ctx variable, yet the ring lock and unlock helpers use req->ctx. use ctx consistently. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-03-17 14:03:54 -06:00
Pavel Begunkov	98f37634b1	io_uring/bpf-ops: implement bpf ops registration Implement BPF struct ops registration. It's registered off the BPF path, and can be removed by BPF as well as io_uring. To protect it, introduce a global lock synchronising registration. ctx->uring_lock can be nested under it. ctx->bpf_ops is write protected by both locks and so it's safe to read it under either of them. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/1f46bffd76008de49cbafa2ad77d348810a4f69e.1772109579.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-03-16 16:15:00 -06:00
Pavel Begunkov	890819248a	io_uring/bpf-ops: add kfunc helpers Add two kfuncs that should cover most of the needs: 1. bpf_io_uring_submit_sqes(), which allows to submit io_uring requests. It mirrors the normal user space submission path and follows all related io_uring_enter(2) rules. i.e. SQEs are taken from the SQ according to head/tail values. In case of IORING_SETUP_SQ_REWIND, it'll submit first N entries. 2. bpf_io_uring_get_region() returns a pointer to the specified region, where io_uring regions are kernel-userspace shared chunks of memory. It takes the size as an argument, which should be a load time constant. There are 3 types of regions: - IOU_REGION_SQ returns the submission queue. - IOU_REGION_CQ stores the CQ, SQ/CQ headers and the sqarray. In other words, it gives same memory that would normally be mmap'ed with IORING_FEAT_SINGLE_MMAP enabled IORING_OFF_SQ_RING. - IOU_REGION_MEM represents the memory / parameter region. It can be used to store request indirect parameters and for kernel - user communication. It intentionally provides a thin but flexible API and expects BPF programs to implement CQ/SQ header parsing, CQ walking, etc. That mirrors how the normal user space works with rings and should help to minimise kernel / kfunc helpers changes while introducing new generic io_uring features. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/967bcc10e94c796eb273998621551b2a21848cde.1772109579.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-03-16 16:15:00 -06:00
Pavel Begunkov	d0e437b76b	io_uring/bpf-ops: implement loop_step with BPF struct_ops Introduce io_uring BPF struct ops implementing the loop_step callback, which will allow BPF to overwrite the default io_uring event loop logic. The callback takes an io_uring context, the main role of which is to be passed to io_uring kfuncs. The other argument is a struct iou_loop_params, which BPF can use to request CQ waiting and communicate other parameters. See the event loop description in the previous patch for more details. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/98db437651ce64e9cbeb611c60bf5887259db09f.1772109579.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-03-16 16:15:00 -06:00
Pavel Begunkov	033af2b3eb	io_uring: introduce callback driven main loop The io_uring_enter() has a fixed order of execution: it submits requests, waits for completions, and returns to the user. Allow to optionally replace it with a custom loop driven by a callback called loop_step. The basic requirements to the callback is that it should be able to submit requests, wait for completions, parse them and repeat. Most of the communication including parameter passing can be implemented via shared memory. The callback should return IOU_LOOP_CONTINUE to continue execution or IOU_LOOP_STOP to return to the user space. Note that the kernel may decide to prematurely terminate it as well, e.g. in case the process was signalled or killed. The hook takes a structure with parameters. It can be used to ask the kernel to wait for CQEs by setting cq_wait_idx to the CQE index it wants to wait for. Spurious wake ups are possible and even likely, the callback is expected to handle it. There will be more parameters in the future like timeout. It can be used with kernel callbacks, for example, as a slow path deprecation mechanism overwiting SQEs and emulating the wanted behaviour, however it's more useful together with BPF programs implemented in following patches. Note that keeping it separately from the normal io_uring wait loop makes things much simpler and cleaner. It keeps it in one place instead of spreading a bunch of checks in different places including disabling the submission path. It holds the lock by default, which is a better fit for BPF synchronisation and the loop execution model. It nicely avoids existing quirks like forced wake ups on timeout request completion. And it should be easier to implement new features. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/a2d369aa1c9dd23ad7edac9220cffc563abcaed6.1772109579.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-03-16 16:15:00 -06:00
Caleb Sander Mateos	f144dbac4b	nvme: remove nvme_dev_uring_cmd() IO_URING_F_IOPOLL check nvme_dev_uring_cmd() is part of struct file_operations nvme_dev_fops, which doesn't implement ->uring_cmd_iopoll(). So it won't be called with issue_flags that include IO_URING_F_IOPOLL. Drop the unnecessary IO_URING_F_IOPOLL check in nvme_dev_uring_cmd(). Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Link: https://patch.msgid.link/20260302172914.2488599-6-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-03-16 16:14:14 -06:00
Caleb Sander Mateos	23475637b0	io_uring/uring_cmd: allow non-iopoll cmds with IORING_SETUP_IOPOLL Currently, creating an io_uring with IORING_SETUP_IOPOLL requires all requests issued to it to support iopoll. This prevents, for example, using ublk zero-copy together with IORING_SETUP_IOPOLL, as ublk zero-copy buffer registrations are performed using a uring_cmd. There's no technical reason why these non-iopoll uring_cmds can't be supported. They will either complete synchronously or via an external mechanism that calls io_uring_cmd_done(), io_uring_cmd_post_mshot_cqe32(), or io_uring_mshot_cmd_post_cqe(), so they don't need to be polled. Allow uring_cmd requests to be issued to IORING_SETUP_IOPOLL io_urings even if their files don't implement ->uring_cmd_iopoll(). For these uring_cmd requests, skip initializing struct io_kiocb's iopoll fields, don't set REQ_F_IOPOLL, and don't set IO_URING_F_IOPOLL in issue_flags. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Link: https://patch.msgid.link/20260302172914.2488599-5-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-03-16 16:14:14 -06:00
Caleb Sander Mateos	3a5e96d47f	io_uring: count CQEs in io_iopoll_check() A subsequent commit will allow uring_cmds that don't use iopoll on IORING_SETUP_IOPOLL io_urings. As a result, CQEs can be posted without setting the iopoll_completed flag for a request in iopoll_list or going through task work. For example, a UBLK_U_IO_FETCH_IO_CMDS command could call io_uring_mshot_cmd_post_cqe() to directly post a CQE. The io_iopoll_check() loop currently only counts completions posted in io_do_iopoll() when determining whether the min_events threshold has been met. It also exits early if there are any existing CQEs before polling, or if any CQEs are posted while running task work. CQEs posted via io_uring_mshot_cmd_post_cqe() or other mechanisms won't be counted against min_events. Explicitly check the available CQEs in each io_iopoll_check() loop iteration to account for CQEs posted in any fashion. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Link: https://patch.msgid.link/20260302172914.2488599-4-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-03-16 16:14:14 -06:00
Caleb Sander Mateos	7995be40de	io_uring: remove iopoll_queue from struct io_issue_def The opcode iopoll_queue flag is now redundant with REQ_F_IOPOLL. Only io_{read,write}{,_fixed}() and io_uring_cmd() set the REQ_F_IOPOLL flag, and the opcodes with these ->issue() implementations are precisely the ones that set iopoll_queue. So don't bother checking the iopoll_queue flag in io_issue_sqe(). Remove the unused flag from struct io_issue_def. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Link: https://patch.msgid.link/20260302172914.2488599-3-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-03-16 16:14:14 -06:00
Caleb Sander Mateos	9165dc4fa9	io_uring: add REQ_F_IOPOLL A subsequent commit will allow uring_cmds to files that don't implement ->uring_cmd_iopoll() to be issued to IORING_SETUP_IOPOLL io_urings. This means the ctx's IORING_SETUP_IOPOLL flag isn't sufficient to determine whether a given request needs to be iopolled. Introduce a request flag REQ_F_IOPOLL set in ->issue() if a request needs to be iopolled to completion. Set the flag in io_rw_init_file() and io_uring_cmd() for requests issued to IORING_SETUP_IOPOLL ctxs. Use the request flag instead of IORING_SETUP_IOPOLL in places dealing with a specific request. A future possibility would be to add an option to enable/disable iopoll in the io_uring SQE instead of determining it from IORING_SETUP_IOPOLL. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Link: https://patch.msgid.link/20260302172914.2488599-2-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-03-16 16:14:14 -06:00
Jens Axboe	8c55744919	io_uring: mark known and harmless racy ctx->int_flags uses There are a few of these, where flags are read outside of the uring_lock, yet it's harmless to race on them. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-03-16 15:33:10 -06:00
Jens Axboe	f1a424e21c	io_uring: switch struct io_ring_ctx internal bitfields to flags Bitfields cannot be set and checked atomically, and this makes it more clear that these are indeed in shared storage and must be checked and set in a sane fashion. This is in preparation for annotating a few of the known racy, but harmless, flags checking. No intended functional changes in this patch. Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-03-16 15:32:59 -06:00
Jens Axboe	0e46cb553f	Merge branch 'io_uring-7.0' into for-7.1/io_uring Merge upstream io_uring fixes to avoid conflicts in later patches. * io_uring-7.0: io_uring/kbuf: check if target buffer list is still legacy on recycle io_uring: fix physical SQE bounds check for SQE_MIXED 128-byte ops io_uring/eventfd: use ctx->rings_rcu for flags checking io_uring: ensure ctx->rings is stable for task work flags manipulation io_uring/bpf_filter: use bpf_prog_run_pin_on_cpu() to prevent migration io_uring/register: fix comment about task_no_new_privs	2026-03-14 08:57:15 -06:00
Jens Axboe	c2c185be5c	io_uring/kbuf: check if target buffer list is still legacy on recycle There's a gap between when the buffer was grabbed and when it potentially gets recycled, where if the list is empty, someone could've upgraded it to a ring provided type. This can happen if the request is forced via io-wq. The legacy recycling is missing checking if the buffer_list still exists, and if it's of the correct type. Add those checks. Cc: stable@vger.kernel.org Fixes: `c7fb19428d` ("io_uring: add support for ring mapped supplied buffers") Reported-by: Keenan Dong <keenanat2000@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-03-12 08:59:25 -06:00
Tom Ryan	6f02c6b196	io_uring: fix physical SQE bounds check for SQE_MIXED 128-byte ops When IORING_SETUP_SQE_MIXED is used without IORING_SETUP_NO_SQARRAY, the boundary check for 128-byte SQE operations in io_init_req() validated the logical SQ head position rather than the physical SQE index. The existing check: !(ctx->cached_sq_head & (ctx->sq_entries - 1)) ensures the logical position isn't at the end of the ring, which is correct for NO_SQARRAY rings where physical == logical. However, when sq_array is present, an unprivileged user can remap any logical position to an arbitrary physical index via sq_array. Setting sq_array[N] = sq_entries - 1 places a 128-byte operation at the last physical SQE slot, causing the 128-byte memcpy in io_uring_cmd_sqe_copy() to read 64 bytes past the end of the SQE array. Replace the cached_sq_head alignment check with a direct validation of the physical SQE index, which correctly handles both sq_array and NO_SQARRAY cases. Fixes: `1cba30bf9f` ("io_uring: add support for IORING_SETUP_SQE_MIXED") Signed-off-by: Tom Ryan <ryan36005@gmail.com> Link: https://patch.msgid.link/20260310052003.72871-1-ryan36005@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-03-11 14:35:19 -06:00
Jens Axboe	177c694321	io_uring/eventfd: use ctx->rings_rcu for flags checking Similarly to what commit e78f7b70e837 did for local task work additions, use ->rings_rcu under RCU rather than dereference ->rings directly. See that commit for more details. Cc: stable@vger.kernel.org Fixes: `79cfe9e59c` ("io_uring/register: add IORING_REGISTER_RESIZE_RINGS") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-03-11 14:35:19 -06:00

1 2 3 4 5 ...

1427933 Commits (c5e9f6a96bf7379da87df1b852b90527e242b56f) All Branches Search

1427933 Commits (c5e9f6a96bf7379da87df1b852b90527e242b56f)

All Branches