Commit Graph

1336238 Commits (0d2cdc35e805eb50f4a33b7287995ef590f1b916)

Author SHA1 Message Date
Geert Uytterhoeven 0d2cdc35e8 io_uring: Rename KConfig to Kconfig
Kconfig files are traditionally named "Kconfig".

Fixes: 6f377873cb ("io_uring/zcrx: add interface queue and refill queue")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Link: https://lore.kernel.org/r/5ae387c1465f54768b51a5a1ca87be7934c4b2ad.1739976387.git.geert+renesas@glider.be
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-19 14:53:27 -07:00
Pavel Begunkov 95e65f2d0b io_uring/zcrx: fix leaks on failed registration
If we try to register a device-less interface like veth,
io_register_zcrx_ifq() will leak struct io_zcrx_ifq with a bunch of
resources attached to it. Fix that.

Fixes: 035af94b39 ("io_uring/zcrx: grab a net device")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/202502190532.W7NnmyiP-lkp@intel.com/
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/fbf16279dd73fa4c6df048168728355636ba5f53.1739959771.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-19 14:52:58 -07:00
Pavel Begunkov bc674a04c4 io_uring/zcrx: recheck ifq on shutdown
io_ring_exit_work() checks ifq before shutting it down and guarantees
that the pointer is stable, but instead of relying on rather complicated
synchronisation recheck the ifq pointer inside.

Reported-by: Kees Bakker <kees@ijzerbout.nl>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/905e55c47235ab26377a735294f939f31d00ae53.1739934175.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-19 08:07:31 -07:00
David Wei 71082faa2c io_uring/zcrx: add selftest
Add a selftest for io_uring zero copy Rx. This test cannot run locally
and requires a remote host to be configured in net.config. The remote
host must have hardware support for zero copy Rx as listed in the
documentation page. The test will restore the NIC config back to before
the test and is idempotent.

liburing is required to compile the test and be installed on the remote
host running the test.

Signed-off-by: David Wei <dw@davidwei.uk>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20250215000947.789731-12-dw@davidwei.uk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17 05:41:09 -07:00
David Wei d9ac1d5fc9 net: add documentation for io_uring zcrx
Add documentation for io_uring zero copy Rx that explains requirements
and the user API.

Signed-off-by: David Wei <dw@davidwei.uk>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20250215000947.789731-11-dw@davidwei.uk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17 05:41:09 -07:00
Pavel Begunkov bc57c7d36c io_uring/zcrx: add copy fallback
There are scenarios in which the zerocopy path can get a kernel buffer
instead of a net_iov and needs to copy it to the user, whether it is
because of mis-steering or simply getting an skb with the linear part.
In this case, grab a net_iov, copy into it and return it to the user as
normally.

At the moment the user doesn't get any indication whether there was a
copy or not, which is left for follow up work.

Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: David Wei <dw@davidwei.uk>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20250215000947.789731-10-dw@davidwei.uk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17 05:41:09 -07:00
Pavel Begunkov 931dfae190 io_uring/zcrx: throttle receive requests
io_zc_rx_tcp_recvmsg() continues until it fails or there is nothing to
receive. If the other side sends fast enough, we might get stuck in
io_zc_rx_tcp_recvmsg() producing more and more CQEs but not letting the
user to handle them leading to unbound latencies.

Break out of it based on an arbitrarily chosen limit, the upper layer
will either return to userspace or requeue the request.

Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: David Wei <dw@davidwei.uk>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20250215000947.789731-9-dw@davidwei.uk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17 05:41:09 -07:00
David Wei e0793de24a io_uring/zcrx: set pp memory provider for an rx queue
Set the page pool memory provider for the rx queue configured for zero
copy to io_uring. Then the rx queue is reset using
netdev_rx_queue_restart() and netdev core + page pool will take care of
filling the rx queue from the io_uring zero copy memory provider.

For now, there is only one ifq so its destruction happens implicitly
during io_uring cleanup.

Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: David Wei <dw@davidwei.uk>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20250215000947.789731-8-dw@davidwei.uk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17 05:41:09 -07:00
David Wei 11ed914bbf io_uring/zcrx: add io_recvzc request
Add io_uring opcode OP_RECV_ZC for doing zero copy reads out of a
socket. Only the connection should be land on the specific rx queue set
up for zero copy, and the socket must be handled by the io_uring
instance that the rx queue was registered for zero copy with. That's
because neither net_iovs / buffers from our queue can be read by outside
applications, nor zero copy is possible if traffic for the zero copy
connection goes to another queue. This coordination is outside of the
scope of this patch series. Also, any traffic directed to the zero copy
enabled queue is immediately visible to the application, which is why
CAP_NET_ADMIN is required at the registration step.

Of course, no data is actually read out of the socket, it has already
been copied by the netdev into userspace memory via DMA. OP_RECV_ZC
reads skbs out of the socket and checks that its frags are indeed
net_iovs that belong to io_uring. A cqe is queued for each one of these
frags.

Recall that each cqe is a big cqe, with the top half being an
io_uring_zcrx_cqe. The cqe res field contains the len or error. The
lower IORING_ZCRX_AREA_SHIFT bits of the struct io_uring_zcrx_cqe::off
field contain the offset relative to the start of the zero copy area.
The upper part of the off field is trivially zero, and will be used
to carry the area id.

For now, there is no limit as to how much work each OP_RECV_ZC request
does. It will attempt to drain a socket of all available data. This
request always operates in multishot mode.

Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: David Wei <dw@davidwei.uk>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20250215000947.789731-7-dw@davidwei.uk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17 05:41:09 -07:00
Pavel Begunkov db070446f5 io_uring/zcrx: dma-map area for the device
Setup DMA mappings for the area into which we intend to receive data
later on. We know the device we want to attach to even before we get a
page pool and can pre-map in advance. All net_iov are synchronised for
device when allocated, see page_pool_mp_return_in_cache().

Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: David Wei <dw@davidwei.uk>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20250215000947.789731-6-dw@davidwei.uk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17 05:41:09 -07:00
Pavel Begunkov 34a3e60821 io_uring/zcrx: implement zerocopy receive pp memory provider
Implement a page pool memory provider for io_uring to receieve in a
zero copy fashion. For that, the provider allocates user pages wrapped
around into struct net_iovs, that are stored in a previously registered
struct net_iov_area.

Unlike the traditional receive, that frees pages and returns them back
to the page pool right after data was copied to the user, e.g. inside
recv(2), we extend the lifetime until the user space confirms that it's
done processing the data. That's done by taking a net_iov reference.
When the user is done with the buffer, it must return it back to the
kernel by posting an entry into the refill ring, which is usually polled
off the io_uring memory provider callback in the page pool's netmem
allocation path.

There is also a separate set of per net_iov "user" references accounting
whether a buffer is currently given to the user (including possible
fragmentation).

Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: David Wei <dw@davidwei.uk>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20250215000947.789731-5-dw@davidwei.uk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17 05:41:09 -07:00
Pavel Begunkov 035af94b39 io_uring/zcrx: grab a net device
Zerocopy receive needs a net device to bind to its rx queue and dma map
buffers. As a preparation to following patches, resolve a net device
from the if_idx parameter with no functional changes otherwise.

Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: David Wei <dw@davidwei.uk>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20250215000947.789731-4-dw@davidwei.uk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17 05:41:09 -07:00
David Wei cf96310c5f io_uring/zcrx: add io_zcrx_area
Add io_zcrx_area that represents a region of userspace memory that is
used for zero copy. During ifq registration, userspace passes in the
uaddr and len of userspace memory, which is then pinned by the kernel.
Each net_iov is mapped to one of these pages.

The freelist is a spinlock protected list that keeps track of all the
net_iovs/pages that aren't used.

For now, there is only one area per ifq and area registration happens
implicitly as part of ifq registration. There is no API for
adding/removing areas yet. The struct for area registration is there for
future extensibility once we support multiple areas and TCP devmem.

Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: David Wei <dw@davidwei.uk>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20250215000947.789731-3-dw@davidwei.uk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17 05:41:09 -07:00
David Wei 6f377873cb io_uring/zcrx: add interface queue and refill queue
Add a new object called an interface queue (ifq) that represents a net
rx queue that has been configured for zero copy. Each ifq is registered
using a new registration opcode IORING_REGISTER_ZCRX_IFQ.

The refill queue is allocated by the kernel and mapped by userspace
using a new offset IORING_OFF_RQ_RING, in a similar fashion to the main
SQ/CQ. It is used by userspace to return buffers that it is done with,
which will then be re-used by the netdev again.

The main CQ ring is used to notify userspace of received data by using
the upper 16 bytes of a big CQE as a new struct io_uring_zcrx_cqe. Each
entry contains the offset + len to the data.

For now, each io_uring instance only has a single ifq.

Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: David Wei <dw@davidwei.uk>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20250215000947.789731-2-dw@davidwei.uk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17 05:41:03 -07:00
Jens Axboe 5c496ff11d Merge commit '71f0dd5a3293d75d26d405ffbaedfdda4836af32' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next into for-6.15/io_uring-rx-zc
Merge networking zerocopy receive tree, to get the prep patches for
the io_uring rx zc support.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (63 commits)
  net: add helpers for setting a memory provider on an rx queue
  net: page_pool: add memory provider helpers
  net: prepare for non devmem TCP memory providers
  net: page_pool: add a mp hook to unregister_netdevice*
  net: page_pool: add callback for mp info printing
  netdev: add io_uring memory provider info
  net: page_pool: create hooks for custom memory providers
  net: generalise net_iov chunk owners
  net: prefix devmem specific helpers
  net: page_pool: don't cast mp param to devmem
  tools: ynl: add all headers to makefile deps
  eth: fbnic: set IFF_UNICAST_FLT to avoid enabling promiscuous mode when adding unicast addrs
  eth: fbnic: add MAC address TCAM to debugfs
  tools: ynl-gen: support limits using definitions
  tools: ynl-gen: don't output external constants
  net/mlx5e: Avoid WARN_ON when configuring MQPRIO with HTB offload enabled
  net/mlx5e: Remove unused mlx5e_tc_flow_action struct
  net/mlx5: Remove stray semicolon in LAG port selection table creation
  net/mlx5e: Support FEC settings for 200G per lane link modes
  net/mlx5: Add support for 200Gbps per lane link modes
  ...
2025-02-17 05:38:28 -07:00
Caleb Sander Mateos 94a4274bb6 io_uring: pass struct io_tw_state by value
8e5b3b89ec ("io_uring: remove struct io_tw_state::locked") removed the
only field of io_tw_state but kept it as a task work callback argument
to "forc[e] users not to invoke them carelessly out of a wrong context".
Passing the struct io_tw_state * argument adds a few instructions to all
callers that can't inline the functions and see the argument is unused.

So pass struct io_tw_state by value instead. Since it's a 0-sized value,
it can be passed without any instructions needed to initialize it.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250217022511.1150145-2-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17 05:34:50 -07:00
Caleb Sander Mateos bcf8a0293a io_uring: introduce type alias for io_tw_state
In preparation for changing how io_tw_state is passed, introduce a type
alias io_tw_token_t for struct io_tw_state *. This allows for changing
the representation in one place, without having to update the many
functions that just forward their struct io_tw_state * argument.

Also add a comment to struct io_tw_state to explain its purpose.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250217022511.1150145-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17 05:34:50 -07:00
Caleb Sander Mateos 496f56bf9f io_uring/rsrc: avoid NULL check in io_put_rsrc_node()
Most callers of io_put_rsrc_node() already check that node is non-NULL:
- io_rsrc_data_free()
- io_sqe_buffer_register()
- io_reset_rsrc_node()
- io_req_put_rsrc_nodes() (REQ_F_BUF_NODE indicates non-NULL buf_node)

Only io_splice_cleanup() can call io_put_rsrc_node() with a NULL node.
So move the NULL check there.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250216225900.1075446-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17 05:34:46 -07:00
Caleb Sander Mateos 60e6ce746b io_uring: pass ctx instead of req to io_init_req_drain()
io_init_req_drain() takes a struct io_kiocb *req argument but only uses
it to get struct io_ring_ctx *ctx. The caller already knows the ctx, so
pass it instead.

Drop "req" from the function name since it operates on the ctx rather
than a specific req.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250212164807.3681036-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17 05:34:46 -07:00
Caleb Sander Mateos 0e8934724f io_uring: use IO_REQ_LINK_FLAGS more
Replace the 2 instances of REQ_F_LINK | REQ_F_HARDLINK with
the more commonly used IO_REQ_LINK_FLAGS.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250211202002.3316324-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17 05:34:46 -07:00
Jens Axboe 7c71a0af81 io_uring/net: improve recv bundles
Current recv bundles are only supported for multishot receives, and
additionally they also always post at least 2 CQEs if more data is
available than what a buffer will hold. This happens because the initial
bundle recv will do a single buffer, and then do the rest of what is in
the socket as a followup receive. As shown in a test program, if 1k
buffers are available and 32k is available to receive in the socket,
you'd get the following completions:

bundle=1, mshot=0
cqe res 1024
cqe res 1024
[...]
cqe res 1024

bundle=1, mshot=1
cqe res 1024
cqe res 31744

where bundle=1 && mshot=0 will post 32 1k completions, and bundle=1 &&
mshot=1 will post a 1k completion and then a 31k completion.

To support bundle recv without multishot, it's possible to simply retry
the recv immediately and post a single completion, rather than split it
into two completions. With the below patch, the same test looks as
follows:

bundle=1, mshot=0
cqe res 32768

bundle=1, mshot=1
cqe res 32768

where mshot=0 works fine for bundles, and both of them post just a
single 32k completion rather than split it into separate completions.
Posting fewer completions is always a nice win, and not needing
multishot for proper bundle efficiency is nice for cases that can't
necessarily use multishot.

Reported-by: Norman Maurer <norman_maurer@apple.com>
Link: https://lore.kernel.org/r/184f9f92-a682-4205-a15d-89e18f664502@kernel.dk
Fixes: 2f9c9515bd ("io_uring/net: support bundles for recv")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17 05:34:46 -07:00
Jens Axboe 932de5e35f io_uring/waitid: use generic io_cancel_remove() helper
Don't implement our own loop rolling and checking, just use the generic
helper to find and cancel requests.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17 05:34:45 -07:00
Jens Axboe 2eaa2fac47 io_uring/futex: use generic io_cancel_remove() helper
Don't implement our own loop rolling and checking, just use the generic
helper to find and cancel requests.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17 05:34:45 -07:00
Jens Axboe 8fa374f90b io_uring/cancel: add generic cancel helper
Any opcode that is cancelable ends up defining its own cancel helper
for finding and canceling a specific request. Add a generic helper that
can be used for this purpose.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17 05:34:45 -07:00
Jens Axboe 7d9944f506 io_uring/waitid: convert to io_cancel_remove_all()
Use the generic helper for cancelations.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17 05:34:45 -07:00
Jens Axboe e855b91384 io_uring/futex: convert to io_cancel_remove_all()
Use the generic helper for cancelations.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17 05:34:45 -07:00
Jens Axboe 1533376b13 io_uring/cancel: add generic remove_all helper
Any opcode that is cancelable ends up defining its own remove all
helper, which iterates the pending list and cancels matches. Add a
generic helper for it, which can be used by them.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17 05:34:45 -07:00
Pavel Begunkov 5d3e51240d io_uring/kbuf: uninline __io_put_kbufs
__io_put_kbufs() and other helper functions are too large to be inlined,
compilers would normally refuse to do so. Uninline it and move together
with io_kbuf_commit into kbuf.c.

io_kbuf_commitSigned-off-by: Pavel Begunkov <asml.silence@gmail.com>

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/3dade7f55ad590e811aff83b1ec55c9c04e17b2b.1738724373.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17 05:34:45 -07:00
Pavel Begunkov 54e00d9a61 io_uring/kbuf: introduce io_kbuf_drop_legacy()
io_kbuf_drop() is only used for legacy provided buffers, and so
__io_put_kbuf_list() is never called for REQ_F_BUFFER_RING. Remove the
dead branch out of __io_put_kbuf_list(), rename it into
io_kbuf_drop_legacy() and use it directly instead of io_kbuf_drop().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c8cc73e2272f09a86ecbdad9ebdd8304f8e583c0.1738724373.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17 05:34:45 -07:00
Pavel Begunkov e150e70fce io_uring/kbuf: open code __io_put_kbuf()
__io_put_kbuf() is a trivial wrapper, open code it into
__io_put_kbufs().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9dc17380272b48d56c95992c6f9eaacd5546e1d3.1738724373.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17 05:34:45 -07:00
Pavel Begunkov 13ee854e7c io_uring/kbuf: remove legacy kbuf caching
Remove all struct io_buffer caches. It makes it a fair bit simpler.
Apart from from killing a bunch of lines and juggling between lists,
__io_put_kbuf_list() doesn't need ->completion_lock locking now.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/18287217466ee2576ea0b1e72daccf7b22c7e856.1738724373.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17 05:34:45 -07:00
Pavel Begunkov dc39fb1093 io_uring/kbuf: simplify __io_put_kbuf
As a preparation step remove an optimisation from __io_put_kbuf() trying
to use the locked cache. With that __io_put_kbuf_list() is only used
with ->io_buffers_comp, and we remove the explicit list argument.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1b7f1394ec4afc7f96b35a61f5992e27c49fd067.1738724373.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17 05:34:45 -07:00
Pavel Begunkov dd4fbb11e7 io_uring/kbuf: move locking into io_kbuf_drop()
Move the burden of locking out of the caller into io_kbuf_drop(), that
will help with furher refactoring.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/530f0cf1f06963029399f819a9a58b1a34bebef3.1738724373.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17 05:34:45 -07:00
Pavel Begunkov 9afe6847cf io_uring/kbuf: remove legacy kbuf kmem cache
Remove the kmem cache used by legacy provided buffers.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8195c207d8524d94e972c0c82de99282289f7f5c.1738724373.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17 05:34:45 -07:00
Pavel Begunkov 7919292a96 io_uring/kbuf: remove legacy kbuf bulk allocation
Legacy provided buffers are slow and discouraged in favour of the ring
variant. Remove the bulk allocation to keep it simpler as we don't care
about performance.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a064d70370e590efed8076e9501ae4cfc20fe0ca.1738724373.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17 05:34:45 -07:00
Pavel Begunkov 92a3bac9a5 io_uring: sanitise ring params earlier
Do all struct io_uring_params validation early on before allocating the
context. That makes initialisation easier, especially by having fewer
places where we need to care about partial de-initialisation.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/363ba90b83ff78eefdc88b60e1b2c4a39d182247.1738344646.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17 05:34:45 -07:00
Pavel Begunkov 7215469659 io_uring: check for iowq alloc_workqueue failure
alloc_workqueue() can fail even during init in io_uring_init(), check
the result and panic if anything went wrong.

Fixes: 73eaa2b583 ("io_uring: use private workqueue for exit work")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/3a046063902f888f66151f89fa42f84063b9727b.1738343083.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17 05:34:45 -07:00
Pavel Begunkov 40b991837f io_uring: deduplicate caches deallocation
Add a function that frees all ring caches since we already have two
spots repeating the same thing and it's easy to miss it and change only
one of them.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b6b0125677c58bdff99eda91ab320137406e8562.1738342562.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17 05:34:45 -07:00
Max Kellermann 7d568502ef io_uring/io-wq: pass io_wq to io_get_next_work()
The only caller has already determined this pointer, so let's skip
the redundant dereference.

Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Link: https://lore.kernel.org/r/20250128133927.3989681-7-max.kellermann@ionos.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17 05:34:45 -07:00
Max Kellermann 486ba4d84d io_uring/io-wq: do not use bogus hash value
Previously, the `hash` variable was initialized with `-1` and only
updated by io_get_next_work() if the current work was hashed.  Commit
60cf46ae60 ("io-wq: hash dependent work") changed this to always
call io_get_work_hash() even if the work was not hashed.  This caused
the `hash != -1U` check to always be true, adding some overhead for
the `hash->wait` code.

This patch fixes the regression by checking the `IO_WQ_WORK_HASHED`
flag.

Perf diff for a flood of `IORING_OP_NOP` with `IOSQE_ASYNC`:

    38.55%     -1.57%  [kernel.kallsyms]  [k] queued_spin_lock_slowpath
     6.86%     -0.72%  [kernel.kallsyms]  [k] io_worker_handle_work
     0.10%     +0.67%  [kernel.kallsyms]  [k] put_prev_entity
     1.96%     +0.59%  [kernel.kallsyms]  [k] io_nop_prep
     3.31%     -0.51%  [kernel.kallsyms]  [k] try_to_wake_up
     7.18%     -0.47%  [kernel.kallsyms]  [k] io_wq_free_work

Fixes: 60cf46ae60 ("io-wq: hash dependent work")
Cc: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Link: https://lore.kernel.org/r/20250128133927.3989681-6-max.kellermann@ionos.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17 05:34:45 -07:00
Max Kellermann 6ee78354ea io_uring/io-wq: cache work->flags in variable
This eliminates several redundant atomic reads and therefore reduces
the duration the surrounding spinlocks are held.

In several io_uring benchmarks, this reduced the CPU time spent in
queued_spin_lock_slowpath() considerably:

io_uring benchmark with a flood of `IORING_OP_NOP` and `IOSQE_ASYNC`:

    38.86%     -1.49%  [kernel.kallsyms]  [k] queued_spin_lock_slowpath
     6.75%     +0.36%  [kernel.kallsyms]  [k] io_worker_handle_work
     2.60%     +0.19%  [kernel.kallsyms]  [k] io_nop
     3.92%     +0.18%  [kernel.kallsyms]  [k] io_req_task_complete
     6.34%     -0.18%  [kernel.kallsyms]  [k] io_wq_submit_work

HTTP server, static file:

    42.79%     -2.77%  [kernel.kallsyms]     [k] queued_spin_lock_slowpath
     2.08%     +0.23%  [kernel.kallsyms]     [k] io_wq_submit_work
     1.19%     +0.20%  [kernel.kallsyms]     [k] amd_iommu_iotlb_sync_map
     1.46%     +0.15%  [kernel.kallsyms]     [k] ep_poll_callback
     1.80%     +0.15%  [kernel.kallsyms]     [k] io_worker_handle_work

HTTP server, PHP:

    35.03%     -1.80%  [kernel.kallsyms]     [k] queued_spin_lock_slowpath
     0.84%     +0.21%  [kernel.kallsyms]     [k] amd_iommu_iotlb_sync_map
     1.39%     +0.12%  [kernel.kallsyms]     [k] _copy_to_iter
     0.21%     +0.10%  [kernel.kallsyms]     [k] update_sd_lb_stats

Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Link: https://lore.kernel.org/r/20250128133927.3989681-5-max.kellermann@ionos.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17 05:34:45 -07:00
Max Kellermann 751eedc4b4 io_uring/io-wq: move worker lists to struct io_wq_acct
Have separate linked lists for bounded and unbounded workers.  This
way, io_acct_activate_free_worker() sees only workers relevant to it
and doesn't need to skip irrelevant ones.  This speeds up the
linked list traversal (under acct->lock).

The `io_wq.lock` field is moved to `io_wq_acct.workers_lock`.  It did
not actually protect "access to elements below", that is, not all of
them; it only protected access to the worker lists.  By having two
locks instead of one, contention on this lock is reduced.

Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Link: https://lore.kernel.org/r/20250128133927.3989681-4-max.kellermann@ionos.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17 05:34:45 -07:00
Max Kellermann 3d3bafd35f io_uring/io-wq: add io_worker.acct pointer
This replaces the `IO_WORKER_F_BOUND` flag.  All code that checks this
flag is not interested in knowing whether this is a "bound" worker;
all it does with this flag is determine the `io_wq_acct` pointer.  At
the cost of an extra pointer field, we can eliminate some fragile
pointer arithmetic.  In turn, the `create_index` and `index` fields
are not needed anymore.

Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Link: https://lore.kernel.org/r/20250128133927.3989681-3-max.kellermann@ionos.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17 05:34:45 -07:00
Max Kellermann 3c75635f8e io_uring/io-wq: eliminate redundant io_work_get_acct() calls
Instead of calling io_work_get_acct() again, pass acct to
io_wq_insert_work() and io_wq_remove_pending().

This atomic access in io_work_get_acct() was done under the
`acct->lock`, and optimizing it away reduces lock contention a bit.

Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Link: https://lore.kernel.org/r/20250128133927.3989681-2-max.kellermann@ionos.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17 05:34:45 -07:00
Linus Torvalds 0ad2507d5d Linux 6.14-rc3 2025-02-16 14:02:44 -08:00
Linus Torvalds 224e745110 Kbuild fixes for v6.14 (2nd)
- Fix annoying logs when building tools in parallel
 
  - Fix the Debian linux-headers package build again
 
  - Fix the target triple detection for userspace programs on Clang
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmeyExkVHG1hc2FoaXJv
 eUBrZXJuZWwub3JnAAoJED2LAQed4NsGtEIQAJNiuG2KvXawT8BdDE1RMM25bJiE
 3mkqexYAWionInz44WzAwMuzXG3yLFxeuLynh926qw8eE832s2ngMzJluFQehl53
 QP/KVpzFliPyO+nRWPcsRi2Re8wL3TD3lnJLt58yNJlpBmF7pc9TYFUrMuRqFE/1
 1h2BlCOYYuq24mMY4F7ULESQ/4Bpb+4dGkN/WXtPttYFmNl+sPy9vn3lG3DxzLKM
 UjwAmRhta/sFV3UF6nWcb1jsiti7IkHf/WIZPm8MoMjBNetnSkfcesPeLhaI7waf
 +TVa9JqwoFWgscGXlCpK/Pw/gOQr1Lf0ZuSMNxtJe9vU76rEoLmdUYaHNbnA6CvQ
 wj6HA7E4mownOzKu3gqZ76dqoaHYdmeHTSoyDCJqVFzicI7ZDYT8MaggFGH/F10J
 KZ7daYOV6HYJxGVTYuhD1e2RvOZcXUHzxafTyl0Hyi1z3z+pWOKLE6W/uDvRkW32
 qCubjZwgepHuI/TUos9VIkkIiW2429Sb4kgqQTRBtzRdKdM+qGjTEzLV/3KPuaWj
 0qCiF6sf7shzgk4UstrUQ7JN+V5oLJUgY/TKZuPyy69qE6xYqUWvRmnGkYxav8aM
 5lTGMXWViksO+EebeE0sNH+R3XHXAxVTI95s4kX57SPcAHItTSMGPPVe2bRErwWJ
 hiLEu6gRezp/7IDT
 =56TH
 -----END PGP SIGNATURE-----

Merge tag 'kbuild-fixes-v6.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild fixes from Masahiro Yamada:

 - Fix annoying logs when building tools in parallel

 - Fix the Debian linux-headers package build again

 - Fix the target triple detection for userspace programs on Clang

* tag 'kbuild-fixes-v6.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
  modpost: Fix a few typos in a comment
  kbuild: userprogs: fix bitsize and target detection on clang
  kbuild: fix linux-headers package build when $(CC) cannot link userspace
  tools: fix annoying "mkdir -p ..." logs when building tools in parallel
2025-02-16 12:58:51 -08:00
Linus Torvalds ae5fa8ce7e Driver core api addition for 6.14-rc3
Here is a driver core new api for 6.14-rc3 that is being added to allow
 platform devices from stop being abused.  It adds a new "faux_device"
 structure and bus and api to allow almost a straight or simpler
 conversion from platform devices that were not really a platform device.
 It also comes with a binding for rust, with an example driver in rust
 showing how it's used.
 
 I'm adding this now so that the patches that convert the different
 drivers and subsystems can all start flowing into linux-next now through
 their different development trees, in time for 6.15-rc1.  We have a
 number that are already reviewed and tested, but adding those
 conversions now doesn't seem right.  For now, no one is using this, and
 it passes all build tests from 0-day and linux-next, so all should be
 good.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZ7H+sQ8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+yljfwCfdP8AvZeIdx89cqS0djspBSFLw1MAoIpq7Pbi
 6BY+VOuDSZNdBKXFLR/x
 =2qRL
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-6.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core api addition from Greg KH:
 "Here is a driver core new api for 6.14-rc3 that is being added to
  allow platform devices from stop being abused.

  It adds a new 'faux_device' structure and bus and api to allow almost
  a straight or simpler conversion from platform devices that were not
  really a platform device. It also comes with a binding for rust, with
  an example driver in rust showing how it's used.

  I'm adding this now so that the patches that convert the different
  drivers and subsystems can all start flowing into linux-next now
  through their different development trees, in time for 6.15-rc1.

  We have a number that are already reviewed and tested, but adding
  those conversions now doesn't seem right. For now, no one is using
  this, and it passes all build tests from 0-day and linux-next, so all
  should be good"

* tag 'driver-core-6.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  rust/kernel: Add faux device bindings
  driver core: add a faux bus for use when a simple device/bus is needed
2025-02-16 12:54:42 -08:00
Linus Torvalds 56400391b1 Serial driver fixes for 6.14-rc3
Here are some small serial driver fixes for some reported problems for
 6.14-rc3.  Nothing major, just:
   - sc16is7xx irq check fix
   - 8250 fifo underflow fix
   - serial_port and 8250 iotype fixes
 
 Most of these have been in linux-next already, and all have passed 0-day
 testing.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZ7H/YQ8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ykxAwCbBpzMC3xtuqSw/PCrFGbBNeMVuQcAoKtoQYWr
 Zro9O1hxTOfbFNeHEVtj
 =XvcE
 -----END PGP SIGNATURE-----

Merge tag 'tty-6.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty

Pull serial driver fixes from Greg KH:
 "Here are some small serial driver fixes for some reported problems.
  Nothing major, just:

   - sc16is7xx irq check fix

   - 8250 fifo underflow fix

   - serial_port and 8250 iotype fixes

  Most of these have been in linux-next already, and all have passed
  0-day testing"

* tag 'tty-6.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
  serial: 8250: Fix fifo underflow on flush
  serial: 8250_pnp: Remove unneeded ->iotype assignment
  serial: 8250_platform: Remove unneeded ->iotype assignment
  serial: 8250_of: Remove unneeded ->iotype assignment
  serial: port: Make ->iotype validation global in __uart_read_properties()
  serial: port: Always update ->iotype in __uart_read_properties()
  serial: port: Assign ->iotype correctly when ->iobase is set
  serial: sc16is7xx: Fix IRQ number check behavior
2025-02-16 12:50:44 -08:00
Linus Torvalds 6bfcc5fb2f USB fixes for 6.14-rc3
Here are some small USB driver fixes, and new device ids, for 6.14-rc3.
 Lots of tiny stuff for reported problems, including:
   - new device ids and quirks
   - usb hub crash fix found by syzbot
   - dwc2 driver fix
   - dwc3 driver fixes
   - uvc gadget driver fix
   - cdc-acm driver fixes for a variety of different issues
   - other tiny bugfixes
 
 Almost all of these have been in linux-next this week, and all have
 passed 0-day testing.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZ7H/9Q8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ym9YgCeIG0YNJa3Bb8gVQOnNwbOXzSpt84AniHDxqW9
 EMqZy36ZYzee04t1wR0b
 =La/t
 -----END PGP SIGNATURE-----

Merge tag 'usb-6.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb

Pull USB fixes from Greg KH:
 "Here are some small USB driver fixes, and new device ids, for
  6.14-rc3. Lots of tiny stuff for reported problems, including:

   - new device ids and quirks

   - usb hub crash fix found by syzbot

   - dwc2 driver fix

   - dwc3 driver fixes

   - uvc gadget driver fix

   - cdc-acm driver fixes for a variety of different issues

   - other tiny bugfixes

  Almost all of these have been in linux-next this week, and all have
  passed 0-day testing"

* tag 'usb-6.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (25 commits)
  usb: typec: tcpm: PSSourceOffTimer timeout in PR_Swap enters ERROR_RECOVERY
  usb: roles: set switch registered flag early on
  usb: gadget: uvc: Fix unstarted kthread worker
  USB: quirks: add USB_QUIRK_NO_LPM quirk for Teclast dist
  usb: gadget: core: flush gadget workqueue after device removal
  USB: gadget: f_midi: f_midi_complete to call queue_work
  usb: core: fix pipe creation for get_bMaxPacketSize0
  usb: dwc3: Fix timeout issue during controller enter/exit from halt state
  USB: Add USB_QUIRK_NO_LPM quirk for sony xperia xz1 smartphone
  USB: cdc-acm: Fill in Renesas R-Car D3 USB Download mode quirk
  usb: cdc-acm: Fix handling of oversized fragments
  usb: cdc-acm: Check control transfer buffer size before access
  usb: xhci: Restore xhci_pci support for Renesas HCs
  USB: pci-quirks: Fix HCCPARAMS register error for LS7A EHCI
  USB: serial: option: drop MeiG Smart defines
  USB: serial: option: fix Telit Cinterion FN990A name
  USB: serial: option: add Telit Cinterion FN990B compositions
  USB: serial: option: add MeiG Smart SLM828
  usb: gadget: f_midi: fix MIDI Streaming descriptor lengths
  usb: dwc2: gadget: remove of_node reference upon udc_stop
  ...
2025-02-16 11:15:50 -08:00
Linus Torvalds ba643b6d84 - Remove an unused config item GENERIC_PENDING_IRQ_CHIPFLAGS
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmextVEACgkQEsHwGGHe
 VUpEew//bv6xE+9sUxtKpD4yQFSZmOOAYbSg6TNdxlmlIMXn2JHeOe166TiQHDIv
 w4QVZuzmmHx11hPC3YVgdMtzU0S/dFWd2LhoguHVALbZ39sxap2tuWIZacDvShyP
 WKrGAtvb/BOb0lL0yQY3dGwQ+YO7ye0EU+gJLCaWqNC7n0vWfP4gin17k08iNgQg
 qghwGUNCsYsijfpvimcJB9eqg00pOqnOVnWiE2xGSz5LKy+Yigtd8aNR50mS+Mgt
 Sl2Dzmesio0B464X+HSj34dhi4GswdjWmqKyuixmOuECp/1rCBakpuq760J0ijOg
 dliiBHZplavsLcis9Mh/vKzN4Pd5xjXMPsADHpTX/2fpF8yXj9sWpG+xp7UgsJi0
 LZASk8LTovstOGiDSF7ff+4iCPlkZTvE77fawfdcrpe4kgUc85VpRWyv7xalbGwj
 0Y/6dIkXg9dOwv+1z57i15ws2KDG35d2faF5UH1qfr/wwzp/TloIk5oX1C5d2puh
 Y5Fohu3S+D+wFgJB0xwkGTXzLTGVn/ElQMF08CvHHvqX3TrA9HNfJ7EahROx/T3i
 HNjj10YyKXbGwnk7OAQENOKG3SztlxrPk4SEEY4fZ1RvehYYu6qXXHA9vShjDQcN
 C2Elj42sgoTXr+DBGeVPHE1bD279kxacWZYD2xPx15t74Mu0zis=
 =tXlN
 -----END PGP SIGNATURE-----

Merge tag 'irq_urgent_for_v6.14_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq Kconfig cleanup from Borislav Petkov:

 - Remove an unused config item GENERIC_PENDING_IRQ_CHIPFLAGS

* tag 'irq_urgent_for_v6.14_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  genirq: Remove unused CONFIG_GENERIC_PENDING_IRQ_CHIPFLAGS
2025-02-16 10:55:17 -08:00