mirror-linux

Commit Graph

Author	SHA1	Message	Date
Ernestas Kulik	864ba40c80	llc: Return -EINPROGRESS from llc_ui_connect() Given a zero sk_sndtimeo, llc_ui_connect() skips waiting for state change and returns 0, confusing userspace applications that will assume the socket is connected, making e.g. getpeername() calls error out. More specifically, the issue was discovered in libcoap, where newly-added AF_LLC socket support was behaving differently from AF_INET connections due to EINPROGRESS handling being skipped. Set rc to -EINPROGRESS if connect() would not block, akin to AF_INET sockets. Signed-off-by: Ernestas Kulik <ernestas.k@iconn-networks.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260421060304.285419-1-ernestas.k@iconn-networks.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-23 11:40:39 -07:00
Ruide Cao	67bf002a2d	ipv4: icmp: validate reply type before using icmp_pointers Extended echo replies use ICMP_EXT_ECHOREPLY as the outbound reply type. That value is outside the range covered by icmp_pointers[], which only describes the traditional ICMP types up to NR_ICMP_TYPES. Avoid consulting icmp_pointers[] for reply types outside that range, and use array_index_nospec() for the remaining in-range lookup. Normal ICMP replies keep their existing behavior unchanged. Fixes: `d329ea5bd8` ("icmp: add response to RFC 8335 PROBE messages") Cc: stable@kernel.org Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Signed-off-by: Ruide Cao <caoruide123@gmail.com> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/0dace90c01a5978e829ca741ef684dbd7304ce62.1776628519.git.caoruide123@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-23 11:40:08 -07:00
Jakub Kicinski	7ebc650474	Merge branch 'tcp-symmetric-challenge-ack-for-seg-ack-snd-nxt' Jiayuan Chen says: ==================== tcp: symmetric challenge ACK for SEG.ACK > SND.NXT Commit `354e4aa391` ("tcp: RFC 5961 5.2 Blind Data Injection Attack Mitigation") quotes RFC 5961 Section 5.2 in full, which requires that any incoming segment whose ACK value falls outside [SND.UNA - MAX.SND.WND, SND.NXT] MUST be discarded and an ACK sent back. Linux currently sends that challenge ACK only on the lower edge (SEG.ACK < SND.UNA - MAX.SND.WND); on the symmetric upper edge (SEG.ACK > SND.NXT) the segment is silently dropped with SKB_DROP_REASON_TCP_ACK_UNSENT_DATA. Patch 1 completes the mitigation by emitting a rate-limited challenge ACK on that branch, reusing tcp_send_challenge_ack() and honouring FLAG_NO_CHALLENGE_ACK for consistency with the lower-edge case. It also updates the existing tcp_ts_recent_invalid_ack.pkt selftest, which drives this exact path, to consume the new challenge ACK so bisect stays clean. Patch 2 adds a new packetdrill selftest that exercises RFC 5961 Section 5.2 on both edges of the acceptable window, filling a gap in the selftests tree (neither edge had dedicated coverage before). ==================== Link: https://patch.msgid.link/20260422123605.320000-1-jiayuan.chen@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-23 11:04:05 -07:00
Jiayuan Chen	cf94b3c0f0	selftests/net: packetdrill: cover RFC 5961 5.2 challenge ACK on both edges RFC 5961 Section 5.2 / RFC 793 Section 3.9 require a challenge ACK whenever an incoming SEG.ACK falls outside [SND.UNA - MAX.SND.WND, SND.NXT]. There is currently no packetdrill coverage for either edge. Add tcp_rfc5961_ack-out-of-window.pkt, which in a single passive-open connection exercises: - Upper edge (SEG.ACK > SND.NXT): peer ACKs data that was never sent before the server has transmitted anything. - Lower edge (SEG.ACK < SND.UNA - MAX.SND.WND): after the server has sent 2000 bytes (the peer-advertised rwnd forces two 1000-byte segments, both acknowledged), peer sends an ACK that is older than the acceptable window. Both cases must elicit a challenge ACK <SEQ = SND.NXT, ACK = RCV.NXT, CTL = ACK>. The per-socket RFC 5961 Section 7 rate limit is disabled for the duration of the test so that both challenge ACKs can fire back-to-back. Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260422123605.320000-3-jiayuan.chen@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-23 11:04:01 -07:00
Jiayuan Chen	42726ec644	tcp: send a challenge ACK on SEG.ACK > SND.NXT RFC 5961 Section 5.2 validates an incoming segment's ACK value against the range [SND.UNA - MAX.SND.WND, SND.NXT] and states: "All incoming segments whose ACK value doesn't satisfy the above condition MUST be discarded and an ACK sent back." Commit `354e4aa391` ("tcp: RFC 5961 5.2 Blind Data Injection Attack Mitigation") opted Linux into this mitigation and implements the challenge ACK on the lower side (SEG.ACK < SND.UNA - MAX.SND.WND), but the symmetric upper side (SEG.ACK > SND.NXT) still takes the pre-RFC-5961 path and silently returns SKB_DROP_REASON_TCP_ACK_UNSENT_DATA, even though RFC 793 Section 3.9 (now RFC 9293 Section 3.10.7.4) has always required: "If the ACK acknowledges something not yet sent (SEG.ACK > SND.NXT) then send an ACK, drop the segment, and return." Complete the mitigation by sending a challenge ACK on that branch, reusing the existing tcp_send_challenge_ack() path which already enforces the per-socket RFC 5961 Section 7 rate limit via __tcp_oow_rate_limited(). FLAG_NO_CHALLENGE_ACK is honoured for symmetry with the lower-edge case. Update the existing tcp_ts_recent_invalid_ack.pkt selftest, which drives this exact path, to consume the new challenge ACK. Fixes: `354e4aa391` ("tcp: RFC 5961 5.2 Blind Data Injection Attack Mitigation") Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260422123605.320000-2-jiayuan.chen@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-23 11:04:00 -07:00
Alexey Kodanev	4078c5611d	nfp: fix swapped arguments in nfp_encode_basic_qdr() calls There is a mismatch between the passed arguments and the actual nfp_encode_basic_qdr() function parameter names: static int nfp_encode_basic_qdr(u64 addr, int dest_island, int cpp_tgt, int mode, bool addr40, int isld1, int isld0) { ... But "dest_island" and "cpp_tgt" are swapped at every call-site. For example: return nfp_encode_basic_qdr(addr, cpp_tgt, dest_island, mode, addr40, isld1, isld0); As a result, nfp_encode_basic_qdr() receives "dest_island" as CPP target type, which is always NFP_CPP_TARGET_QDR(2) for these calls, and "cpp_tgt" as the destination island ID, which can accidentally match or be outside the valid NFP_CPP_TARGET_ types (e.g. '-1' for any destination). Since code already worked for years, also add extra pr_warn() to error paths in nfp_encode_basic_qdr() to help identify any potential address verification failures. Detected using the static analysis tool - Svace. Fixes: `4cb584e0ee` ("nfp: add CPP access core") Signed-off-by: Alexey Kodanev <aleksei.kodanev@bell-sw.com> Link: https://patch.msgid.link/20260422160536.61855-1-aleksei.kodanev@bell-sw.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-23 11:01:20 -07:00
Ruijie Li	5a8db80f72	net/smc: avoid early lgr access in smc_clc_wait_msg A CLC decline can be received while the handshake is still in an early stage, before the connection has been associated with a link group. The decline handling in smc_clc_wait_msg() updates link-group level sync state for first-contact declines, but that state only exists after link group setup has completed. Guard the link-group update accordingly and keep the per-socket peer diagnosis handling unchanged. This preserves the existing sync_err handling for established link-group contexts and avoids touching link-group state before it is available. Fixes: `0cfdd8f92c` ("smc: connection and link group creation") Cc: stable@kernel.org Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Signed-off-by: Ruijie Li <ruijieli51@gmail.com> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Link: https://patch.msgid.link/08c68a5c817acf198cce63d22517e232e8d60718.1776850759.git.ruijieli51@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-23 11:00:57 -07:00
Dexuan Cui	3d1f20727a	hv_sock: Return -EIO for malformed/short packets Commit `f631529589` fixes a regression, however it fails to report an error for malformed/short packets -- normally we should never see such packets, but let's report an error for them just in case. Fixes: `f631529589` ("hv_sock: Report EOF instead of -EIO for FIN") Cc: stable@vger.kernel.org Signed-off-by: Dexuan Cui <decui@microsoft.com> Acked-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://patch.msgid.link/20260423064811.1371749-1-decui@microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-23 10:53:16 -07:00
Brett Creeley	3bc06da858	virtio_net: sync rss_trailer.max_tx_vq on queue_pairs change via VQ_PAIRS_SET When netif_is_rxfh_configured() is true (i.e., the user has explicitly configured the RSS indirection table), virtnet_set_queues() skips the RSS update path and falls through to the VIRTIO_NET_CTRL_MQ_VQ_PAIRS_SET command to change the number of queue pairs. However, it does not update vi->rss_trailer.max_tx_vq to reflect the new queue_pairs value. This causes a mismatch between vi->curr_queue_pairs and vi->rss_trailer.max_tx_vq. Any subsequent RSS reconfiguration (e.g., via ethtool -X) calls virtnet_commit_rss_command(), which sends the stale max_tx_vq to the device, silently reverting the queue count. Reproduction: 1. User configured RSS ethtool -X eth0 equal 8 2. VQ_PAIRS_SET path; max_tx_vq stays 16 ethtool -L eth0 combined 12 3. RSS commit uses max_tx_vq=16 instead of 12 ethtool -X eth0 equal 4 Fix this by updating vi->rss_trailer.max_tx_vq after a successful VQ_PAIRS_SET command when RSS is enabled, keeping it in sync with curr_queue_pairs. Fixes: `50bfcaedd7` ("virtio_net: Update rss when set queue") Signed-off-by: Brett Creeley <brett.creeley@amd.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Link: https://patch.msgid.link/20260416212121.29073-1-brett.creeley@amd.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-23 09:35:53 -07:00
Paolo Abeni	d40831b016	Merge branch 'mptcp-sync-the-msk-sndbuf-at-accept-time' Matthieu Baerts says: ==================== mptcp: sync the msk->sndbuf at accept() time On passive MPTCP connections, the MPTCP socket send buffer doesn't have the expected size at accept() time. Patch 1 fixes the regression introduced in v6.7, while the following one validates the fix in the selftests. Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> ==================== Link: https://patch.msgid.link/20260420-net-mptcp-sync-sndbuf-accept-v1-0-e3523e3aeb44@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-23 13:20:25 +02:00
Gang Yan	d0576eb850	selftests: mptcp: add a check for sndbuf of S/C Add a new chk_sndbuf() helper to diag.sh that extracts the sndbuf (the 'tb' field from 'ss -m' skmem output) for both server and client MPTCP sockets, and verifies they are equal. Without the previous patch, it will fail: ''' 07 ....chk sndbuf server/client [FAIL] sndbuf S=20480 != C=2630656 ''' Signed-off-by: Gang Yan <yangang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260420-net-mptcp-sync-sndbuf-accept-v1-2-e3523e3aeb44@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-23 13:20:17 +02:00
Gang Yan	fcf04b1433	mptcp: sync the msk->sndbuf at accept() time On passive MPTCP connections, the msk sndbuf is not updated correctly. The root cause is an order issue in the accept path: - tcp_check_req() -> subflow_syn_recv_sock() -> mptcp_sk_clone_init() calls __mptcp_propagate_sndbuf() to copy the ssk sndbuf into msk - Later, tcp_child_process() -> tcp_init_transfer() -> tcp_sndbuf_expand() grows the ssk sndbuf. So __mptcp_propagate_sndbuf() runs before the ssk sndbuf has been expanded and the msk ends up with a much smaller sndbuf than the subflow: MPTCP: msk->sndbuf:20480, msk->first->sndbuf:2626560 Fix this by moving the __mptcp_propagate_sndbuf() call from mptcp_sk_clone_init() -- the ssk sndbuf is not yet finalized there -- to __mptcp_propagate_sndbuf() at accept() time, when the ssk sndbuf has been fully expanded by tcp_sndbuf_expand(). Fixes: `8005184fd1` ("mptcp: refactor sndbuf auto-tuning") Cc: stable@vger.kernel.org Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/602 Signed-off-by: Gang Yan <yangang@kylinos.cn> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260420-net-mptcp-sync-sndbuf-accept-v1-1-e3523e3aeb44@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-23 13:20:17 +02:00
Stefano Garzarella	1cb36e2522	vsock/virtio: fix MSG_ZEROCOPY pinned-pages accounting virtio_transport_init_zcopy_skb() uses iter->count as the size argument for msg_zerocopy_realloc(), which in turn passes it to mm_account_pinned_pages() for RLIMIT_MEMLOCK accounting. However, this function is called after virtio_transport_fill_skb() has already consumed the iterator via __zerocopy_sg_from_iter(), so on the last skb, iter->count will be 0, skipping the RLIMIT_MEMLOCK enforcement. Pass pkt_len (the total bytes being sent) as an explicit parameter to virtio_transport_init_zcopy_skb() instead of reading the already-consumed iter->count. This matches TCP and UDP, which both call msg_zerocopy_realloc() with the original message size. Fixes: `581512a6dc` ("vsock/virtio: MSG_ZEROCOPY flag support") Reported-by: Yiming Qian <yimingqian591@gmail.com> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com> Link: https://patch.msgid.link/20260420132051.217589-1-sgarzare@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-23 13:03:21 +02:00
Paolo Abeni	42ea37b077	Merge branch 'net-mana-fix-probe-remove-error-path-bugs' Erni Sri Satya Vennela says: ==================== net: mana: Fix probe/remove error path bugs Fix five bugs in mana_probe()/mana_remove() error handling that can cause warnings on uninitialized work structs, NULL pointer dereferences, masked errors, and resource leaks when early probe steps fail. Patches 1-2 move work struct initialization (link_change_work and gf_stats_work) to before any error path that could trigger mana_remove(), preventing WARN_ON in __flush_work() or debug object warnings when sync cancellation runs on uninitialized work structs. Patch 3 guards mana_remove() against double invocation. If PM resume fails, mana_probe() calls mana_remove() which sets gdma_context and driver_data to NULL. A failed resume does not unbind the driver, so when the device is eventually unbound, mana_remove() is called again and dereferences NULL, causing a kernel panic. An early return on NULL gdma_context or driver_data makes the second call harmless. Patch 4 prevents add_adev() from overwriting a port probe error, which could leave the driver in a broken state with NULL ports while reporting success. Patch 5 changes 'goto out' to 'break' in mana_remove()'s port loop so that mana_destroy_eq() is always reached, preventing EQ leaks when a NULL port is encountered. ==================== Link: https://patch.msgid.link/20260420124741.1056179-1-ernis@linux.microsoft.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-23 12:49:16 +02:00
Erni Sri Satya Vennela	65267c9c4f	net: mana: Fix EQ leak in mana_remove on NULL port In mana_remove(), when a NULL port is encountered in the port iteration loop, 'goto out' skips the mana_destroy_eq(ac) call, leaking the event queues allocated earlier by mana_create_eq(). This can happen when mana_probe_port() fails for port 0, leaving ac->ports[0] as NULL. On driver unload or error cleanup, mana_remove() hits the NULL entry and jumps past mana_destroy_eq(). Change 'goto out' to 'break' so the for-loop exits normally and mana_destroy_eq() is always reached. Remove the now-unreferenced out: label. Fixes: `1e2d0824a9` ("net: mana: Add support for EQ sharing") Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com> Link: https://patch.msgid.link/20260420124741.1056179-6-ernis@linux.microsoft.com Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-23 12:49:13 +02:00
Erni Sri Satya Vennela	a7fdaf069b	net: mana: Don't overwrite port probe error with add_adev result In mana_probe(), if mana_probe_port() fails for any port, the error is stored in 'err' and the loop breaks. However, the subsequent unconditional 'err = add_adev(gd, "eth")' overwrites this error. If add_adev() succeeds, mana_probe() returns success despite ports being left in a partially initialized state (ac->ports[i] == NULL). Only call add_adev() when there is no prior error, so the probe correctly fails and triggers mana_remove() cleanup. Fixes: `a69839d432` ("net: mana: Add support for auxiliary device") Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com> Link: https://patch.msgid.link/20260420124741.1056179-5-ernis@linux.microsoft.com Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-23 12:49:13 +02:00
Erni Sri Satya Vennela	50271d7ec9	net: mana: Guard mana_remove against double invocation If PM resume fails (e.g., mana_attach() returns an error), mana_probe() calls mana_remove(), which tears down the device and sets gd->gdma_context = NULL and gd->driver_data = NULL. However, a failed resume callback does not automatically unbind the driver. When the device is eventually unbound, mana_remove() is invoked a second time. Without a NULL check, it dereferences gc->dev with gc == NULL, causing a kernel panic. Add an early return if gdma_context or driver_data is NULL so the second invocation is harmless. Move the dev = gc->dev assignment after the guard so it cannot dereference NULL. Fixes: `635096a86e` ("net: mana: Support hibernation and kexec") Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com> Link: https://patch.msgid.link/20260420124741.1056179-4-ernis@linux.microsoft.com Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-23 12:49:13 +02:00
Erni Sri Satya Vennela	6e8bc03349	net: mana: Init gf_stats_work before potential error paths in probe Move INIT_DELAYED_WORK(gf_stats_work) to before mana_create_eq(), while keeping schedule_delayed_work() at its original location. Previously, if any function between mana_create_eq() and the INIT_DELAYED_WORK call failed, mana_probe() would call mana_remove() which unconditionally calls cancel_delayed_work_sync(gf_stats_work) in __flush_work() or debug object warnings with CONFIG_DEBUG_OBJECTS_WORK enabled. Fixes: `be4f1d67ec` ("net: mana: Add standard counter rx_missed_errors") Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com> Link: https://patch.msgid.link/20260420124741.1056179-3-ernis@linux.microsoft.com Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-23 12:49:13 +02:00
Erni Sri Satya Vennela	cb4a90744b	net: mana: Init link_change_work before potential error paths in probe Move INIT_WORK(link_change_work) to right after the mana_context allocation, before any error path that could reach mana_remove(). Previously, if mana_create_eq() or mana_query_device_cfg() failed, mana_probe() would jump to the error path which calls mana_remove(). mana_remove() unconditionally calls disable_work_sync(link_change_work), but the work struct had not been initialized yet. This can trigger CONFIG_DEBUG_OBJECTS_WORK enabled. Fixes: `54133f9b4b` ("net: mana: Support HW link state events") Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com> Link: https://patch.msgid.link/20260420124741.1056179-2-ernis@linux.microsoft.com Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-23 12:49:13 +02:00
Breno Leitao	7079c8c13f	netconsole: avoid out-of-bounds access on empty string in trim_newline() trim_newline() unconditionally dereferences s[len - 1] after computing len = strnlen(s, maxlen). When the string is empty, len is 0 and the expression underflows to s[(size_t)-1], reading (and potentially writing) one byte before the buffer. The two callers feed trim_newline() with the result of strscpy() from configfs store callbacks (dev_name_store, userdatum_value_store). configfs guarantees count >= 1 reaches the callback, but the byte itself can be NUL: a userspace write(fd, "\0", 1) leaves the destination empty after strscpy() and triggers the underflow. The OOB write only fires if the adjacent byte happens to be '\n', so this is not a security issue, but the access is undefined behaviour either way. This pattern is commonly flagged by LLM-based code reviewers. While it is not a security fix, the underlying access is undefined behaviour and the change is small and self-contained, so it is a reasonable candidate for the stable trees. Guard the dereference on a non-zero length. Fixes: `ae001dc679` ("net: netconsole: move newline trimming to function") Cc: stable@vger.kernel.org Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Gustavo Luiz Duarte <gustavold@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260420-netcons_trim_newline-v1-1-dc35889aeedf@debian.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-23 12:45:02 +02:00
Paolo Abeni	063571ab9f	Merge branch 'net-airoha-fix-null-pointer-derefrences-in-airoha_qdma_cleanup' Lorenzo Bianconi says: ==================== net: airoha: Fix NULL pointer derefrences in airoha_qdma_cleanup() Fix two possible NULL pointer derefrences in airoha_qdma_cleanup routine if airoha_qdma_init() fails. v1: https://lore.kernel.org/r/20260417-airoha_qdma_init_rx_queue-fix-v1-0-db9fa5e468e5@kernel.org ==================== Link: https://patch.msgid.link/20260420-airoha_qdma_init_rx_queue-fix-v2-0-d99347e5c18d@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-23 12:21:12 +02:00
Lorenzo Bianconi	4b91cb6578	net: airoha: Add size check for TX NAPIs in airoha_qdma_cleanup() If airoha_qdma_init routine fails before airoha_qdma_tx_irq_init() runs successfully for all TX NAPIs, airoha_qdma_cleanup() will unconditionally runs netif_napi_del() on TX NAPIs, triggering a NULL pointer dereference. Fix the issue relying on q_tx_irq size value to check if the TX NAPIs is properly initialized in airoha_qdma_cleanup(). Moreover, run netif_napi_add_tx() just if irq_q queue is properly allocated. Fixes: `23020f0493` ("net: airoha: Introduce ethernet support for EN7581 SoC") Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://patch.msgid.link/20260420-airoha_qdma_init_rx_queue-fix-v2-2-d99347e5c18d@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-23 12:17:35 +02:00
Lorenzo Bianconi	379050947a	net: airoha: Move ndesc initialization at end of airoha_qdma_init_rx_queue() If queue entry or DMA descriptor list allocation fails in airoha_qdma_init_rx_queue routine, airoha_qdma_cleanup() will trigger a NULL pointer dereference running netif_napi_del() for RX queue NAPIs since netif_napi_add() has never been executed to this particular RX NAPI. The issue is due to the early ndesc initialization in airoha_qdma_init_rx_queue() since airoha_qdma_cleanup() relies on ndesc value to check if the queue is properly initialized. Fix the issue moving ndesc initialization at end of airoha_qdma_init_tx routine. Move page_pool allocation after descriptor list allocation in order to avoid memory leaks if desc allocation fails. Fixes: `23020f0493` ("net: airoha: Introduce ethernet support for EN7581 SoC") Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://patch.msgid.link/20260420-airoha_qdma_init_rx_queue-fix-v2-1-d99347e5c18d@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-23 12:17:35 +02:00
Longxuan Yu	7dddc74af3	8021q: delete cleared egress QoS mappings vlan_dev_set_egress_priority() currently keeps cleared egress priority mappings in the hash as tombstones. Repeated set/clear cycles with distinct skb priorities therefore accumulate mapping nodes until device teardown and leak memory. Delete mappings when vlan_prio is cleared instead of keeping tombstones. Now that the egress mapping lists are RCU protected, the node can be unlinked safely and freed after a grace period. Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Cc: stable@kernel.org Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Co-developed-by: Yuan Tan <yuantan098@gmail.com> Signed-off-by: Yuan Tan <yuantan098@gmail.com> Signed-off-by: Longxuan Yu <ylong030@ucr.edu> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Link: https://patch.msgid.link/ecfa6f6ce2467a42647ff4c5221238ae85b79a59.1776647968.git.yuantan098@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-23 12:13:57 +02:00
Longxuan Yu	fc69decc81	8021q: use RCU for egress QoS mappings The TX fast path and reporting paths walk egress QoS mappings without RTNL. Convert the mapping lists to RCU-protected pointers, use RCU reader annotations in readers, and defer freeing mapping nodes with an embedded rcu_head. This prepares the egress QoS mapping code for safe removal of mapping nodes in a follow-up change while preserving the current behavior. Co-developed-by: Yuan Tan <yuantan098@gmail.com> Signed-off-by: Yuan Tan <yuantan098@gmail.com> Signed-off-by: Longxuan Yu <ylong030@ucr.edu> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Link: https://patch.msgid.link/9136768189f8c6d3f824f476c62d2fa1111688e8.1776647968.git.yuantan098@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-23 12:13:57 +02:00
Paolo Abeni	5a5db99c34	netfilter pull request 26-04-20 -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEjF9xRqF1emXiQiqU1w0aZmrPKyEFAmnmnwYACgkQ1w0aZmrP KyE1lg//VKRxQCN9R0XQPrqS/Dvz5GuNcHYtGkq1DZQIqGmaLLarZMmTN7b+iZNk +JHdzzd2B88IuYcorxoxu9JTUC+BdQnw+PP8WWUFrW6vaU5sMDvYC0vOp9/gybl2 D7xIH+HCeepGJz4SvdNowxXXSTnyvjl4h85G4kJLKScAe3KB1/t/TcKl3xJcJ8eb 8eTmJSt15F7QAom+vMGdRe8NlQrm9FVphW3CntBN4Hzc7+GwuIbk+KoXivcbgu+f hHGm/TpclSmOpnIkjLvyI6OBty9ubD1wtJcoqF6toDYUytdvi7pxQ103YQdIENSR snuQcXXXtkqaIkXGU3nXBVdfhIFzSVn8Y8imUhtLHcUfJlZSg1rrZu+YoseAJ9MR CnWDk0cTI5nHLpqNUJ4tFnUURfJYFev1ebeeoZpTM7ScK/5Vy0OUtjswdCntn7j2 mdb6ZlB6RTjl7blelk/A4WSImSplhSCy6vvlxa1ysApP+eq6zr2+Sh+nuUVIa8F8 8uplN5keUrozZ+hGolfS5Qrd9BtjBlINOx0T272aYHoiDDUXeXPaA0c63M85B1I7 VxUxUYyxBHCiYoMHzvUeat6KAMzLGA9jNCVgIDlBEaRtrI0SH99hUob8GuPAfySM 3aruUoNdzAspRigBlEKk4HrxdO5QLwVNYjQncTF+iYGEKI3E1vg= =6RJG -----END PGP SIGNATURE----- Merge tag 'nf-26-04-20' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Pablo Neira Ayuso says: ==================== Netfilter/IPVS fixes for net The following batch contains Netfilter/IPVS fixes for net: 1) nft_osf actually only supports IPv4, restrict it. 2) Address possible division by zero in nfnetlink_osf, from Xiang Mei. 3) Remove unsafe use of sprintf to fix possible buffer overflow in the SIP NAT helper, from Florian Westphal. 4) Restrict xt_mac, xt_owner and xt_physdev to inet families only; xt_realm is only for ipv4, otherwise null-pointer-deref is possible. 5) Use kfree_rcu() in nat core to release hooks, this can be an issue once nfnetlink_hook gets support to dump NAT hook information, not currently a real issue but better fix it now. From Florian Westphal. 6) Fix MTU checks in IPVS, from Yingnan Zhang. 7) Fix possible out-of-bounds when matching TCP options in nfnetlink_osf, from Fernando Fernandez Mancera. 8) Fix potential nul-ptr-deref in ttl check in nfnetlink_osf, remove useless loop to fix this, also from Fernando. This is a smaller batch, there are more patches pending in the queue to arm another pull request as soon as this is considered good enough. AI might complain again about one more issue regarding osf and big-endian arches in osf but this batch is targetting crash fixes for osf at this stage. netfilter pull request 26-04-20 * tag 'nf-26-04-20' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: nfnetlink_osf: fix potential NULL dereference in ttl check netfilter: nfnetlink_osf: fix out-of-bounds read on option matching ipvs: fix MTU check for GSO packets in tunnel mode netfilter: nat: use kfree_rcu to release ops netfilter: xtables: restrict several matches to inet family netfilter: conntrack: remove sprintf usage netfilter: nfnetlink_osf: fix divide-by-zero in OSF_WSS_MODULO netfilter: nft_osf: restrict it to ipv4 ==================== Link: https://patch.msgid.link/20260420220215.111510-1-pablo@netfilter.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-23 11:20:38 +02:00
Mieczyslaw Nalewaj	0c078021d3	net: dsa: realtek: rtl8365mb: fix mode mask calculation The RTL8365MB_DIGITAL_INTERFACE_SELECT_MODE_MASK macro was shifting the 4-bit mask (0xF) by only (_extint % 2) bits instead of (_extint % 2) * 4. This caused the mask to overlap with the adjacent nibble when configuring odd-numbered external interfaces, selecting the wrong bits entirely. Align the shift calculation with the existing ...MODE_OFFSET macro. Fixes: `4af2950c50` ("net: dsa: realtek-smi: add rtl8365mb subdriver for RTL8365MB-VC") Signed-off-by: Abdulkader Alrezej <alrazj.abdulkader@gmail.com> Signed-off-by: Mieczyslaw Nalewaj <namiltd@yahoo.com> Reviewed-by: Luiz Angelo Daros de Luca <luizluca@gmail.com> Link: https://patch.msgid.link/400a6387-a444-4576-af6d-26be5410bce3@yahoo.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-23 10:50:33 +02:00
Paolo Abeni	084a39af97	Merge branch 'net-airoha-fix-airoha_qdma_cleanup_tx_queue-processing' Lorenzo Bianconi says: ==================== net: airoha: Fix airoha_qdma_cleanup_tx_queue() processing Add missing bits in airoha_qdma_cleanup_tx_queue routine. Fix airoha_qdma_cleanup_tx_queue processing errors intorduced in commit '3f47e67dff1f7 ("net: airoha: Add the capability to consume out-of-order DMA tx descriptors")'. v3: https://lore.kernel.org/r/20260416-airoha_qdma_cleanup_tx_queue-fix-net-v3-0-2b69f5788580@kernel.org v2: https://lore.kernel.org/r/20260414-airoha_qdma_cleanup_tx_queue-fix-net-v2-1-875de57cc022@kernel.org v1: https://lore.kernel.org/r/20260410-airoha_qdma_cleanup_tx_queue-fix-net-v1-1-b7171c8f1e78@kernel.org ==================== Link: https://patch.msgid.link/20260417-airoha_qdma_cleanup_tx_queue-fix-net-v4-0-e04bcc2c9642@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-23 09:08:00 +02:00
Lorenzo Bianconi	3309965fe4	net: airoha: Add missing bits in airoha_qdma_cleanup_tx_queue() Similar to airoha_qdma_cleanup_rx_queue(), reset DMA TX descriptors in airoha_qdma_cleanup_tx_queue routine. Moreover, reset TX_DMA_IDX to TX_CPU_IDX to notify the NIC the QDMA TX ring is empty. Fixes: `23020f0493` ("net: airoha: Introduce ethernet support for EN7581 SoC") Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://patch.msgid.link/20260417-airoha_qdma_cleanup_tx_queue-fix-net-v4-2-e04bcc2c9642@kernel.org Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-23 09:07:57 +02:00
Lorenzo Bianconi	f329924bb4	net: airoha: Move ndesc initialization at end of airoha_qdma_init_tx() If queue entry list allocation fails in airoha_qdma_init_tx_queue routine, airoha_qdma_cleanup_tx_queue() will trigger a NULL pointer dereference accessing the queue entry array. The issue is due to the early ndesc initialization in airoha_qdma_init_tx_queue(). Fix the issue moving ndesc initialization at end of airoha_qdma_init_tx routine. Fixes: `3f47e67dff` ("net: airoha: Add the capability to consume out-of-order DMA tx descriptors") Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://patch.msgid.link/20260417-airoha_qdma_cleanup_tx_queue-fix-net-v4-1-e04bcc2c9642@kernel.org Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-23 09:07:57 +02:00
Eric Dumazet	1ada03fdef	net/sched: sch_sfb: annotate data-races in sfb_dump_stats() sfb_dump_stats() only runs with RTNL held, reading fields that can be changed in qdisc fast path. Add READ_ONCE()/WRITE_ONCE() annotations. Alternative would be to acquire the qdisc spinlock, but our long-term goal is to make qdisc dump operations lockless as much as we can. tc_sfb_xstats fields don't need to be latched atomically, otherwise this bug would have been caught earlier. Fixes: `edb09eb17e` ("net: sched: do not acquire qdisc spinlock in qdisc/class stats dump") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260421141655.3953721-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-22 21:12:59 -07:00
Eric Dumazet	a8f5192809	net/sched: sch_red: annotate data-races in red_dump_stats() red_dump_stats() only runs with RTNL held, reading fields that can be changed in qdisc fast path. Add READ_ONCE()/WRITE_ONCE() annotations. Alternative would be to acquire the qdisc spinlock, but our long-term goal is to make qdisc dump operations lockless as much as we can. tc_red_xstats fields don't need to be latched atomically, otherwise this bug would have been caught earlier. Fixes: `edb09eb17e` ("net: sched: do not acquire qdisc spinlock in qdisc/class stats dump") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20260421142309.3964322-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-22 21:12:54 -07:00
Eric Dumazet	bbfaa73ea6	net/sched: sch_fq_codel: remove data-races from fq_codel_dump_stats() fq_codel_dump_stats() acquires the qdisc spinlock a bit too late. Move this acquisition before we fill st.qdisc_stats with live data. Fixes: `edb09eb17e` ("net: sched: do not acquire qdisc spinlock in qdisc/class stats dump") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20260421142509.3967231-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-22 21:12:50 -07:00
Eric Dumazet	5154561d9b	net/sched: sch_pie: annotate data-races in pie_dump_stats() pie_dump_stats() only runs with RTNL held, reading fields that can be changed in qdisc fast path. Add READ_ONCE()/WRITE_ONCE() annotations. Alternative would be to acquire the qdisc spinlock, but our long-term goal is to make qdisc dump operations lockless as much as we can. tc_pie_xstats fields don't need to be latched atomically, otherwise this bug would have been caught earlier. Fixes: `edb09eb17e` ("net: sched: do not acquire qdisc spinlock in qdisc/class stats dump") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20260421142944.4009941-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-22 21:12:47 -07:00
Eric Dumazet	a6edf2cd41	net_sched: sch_hhf: annotate data-races in hhf_dump_stats() hhf_dump_stats() only runs with RTNL held, reading fields that can be changed in qdisc fast path. Add READ_ONCE()/WRITE_ONCE() annotations. Fixes: `edb09eb17e` ("net: sched: do not acquire qdisc spinlock in qdisc/class stats dump") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20260421143349.4052215-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-22 21:12:40 -07:00
Jakub Kicinski	9d146a5d10	Merge branch 'intel-wired-lan-driver-updates-2026-04-20-ice' Jacob Keller says: ==================== Intel Wired LAN Driver Updates 2026-04-20 (ice) Since this is a set of related fixes for just the ice driver, Jake provides the following description for the series: We recently ran into a nasty corner case issue with a customer operating E825C cards seeing some strange behavior with missing Tx timestamps. During the course of debugging. This series contains a few fixes found during this debugging process. The primary issue discovered in the investigation is a misconfiguration of the E825C PHY timestamp interrupt register, PHY_REG_TS_INT_CONFIG. This register is responsible for programming the Tx timestamp behavior of a PHY port. The driver programs two values here: a threshold for when to interrupt and whether the interrupt is enabled. The threshold value is used by hardware to determine when to trigger a Tx timestamp interrupt. The interrupt cause for the port is raised when the number of outstanding timestamps in the PHY port timestamp memory meets the threshold. The interrupt cause is not cleared until the number of outstanding timestamps drops below the threshold. It is considered a misconfiguration if the threshold is programmed to 0. If the interrupt is enabled while the threshold is zero, hardware will raise the interrupt cause at the next time it checks. Once raised, the interrupt cause for the port will never lower, since you cannot have fewer than zero outstanding timestamps. Worse, the timestamp status for the port will remain high even if the PHY_REG_TS_INT_CONFIG is reprogrammed with a new threshold. The PHY is a separate hardware block from the MAC, and thus the interrupt status for the port will remain high even if you reset the device MAC with a PF reset, CORE reset, or GLOBAL reset. PHY ports are connected together into quads. Each quad muxes the PHY interrupt status for the 4 ports on the quad together before connecting that to the MACs miscellaneous interrupt vector. As a result, if a single PHY port in the quad is stuck, no timestamp interrupts will be generated for any timestamp on any port on that quad. The ice driver never directly writes a value of 0 for the threshold. Indeed, the desired behavior is to set the threshold to 1, so that interrupts are generated as soon as a single timestamp is captured. Unfortunately, it turns out that for the E825C PHY, programming the threshold and enable bit in the same write may cause a race in the PHY timestamp block. The PHY may "see" the interrupt as enabled first before it sees the threshold value. If the previous threshold value is zero (such as when the register is initialized to zero at a cold power on), the hardware may race with programming the threshold and set the PHY interrupt status to high as described above. The first patch in this series corrects that programming order, ensuring that the threshold is always written first in a separate transaction from enabling the interrupt bit. Additionally, an explicit check against writing a 0 is added to make it clear to future readers that writing 0 to the threshold while enabling the interrupt is not safe. The PHY timestamp block does not reset with the MAC, and seems to only reset during cold power on. This makes recovery from the faulty configuration difficult. To address this, perform an explicit reset of the PHY PTP block during initialization. This is achieved by writing the PHY_REG_GLOBAL register. This performs a PHY soft reset, which completely resets the timestamp block. This includes clearing the timestamp memory, the PHY timestamp interrupt status, and the PHY PTP counter. A soft reset of all ports on the device is done as part of ice_ptp_init_phc() during early initialization of the PTP functionality by the PTP clock owner, prior to programming each PHY. The ice_ptp_init_phc() function is called at driver init and during reinitialization after all forms of device reset. This ensures that the driver begins operation at a clean slate, rather than carrying over the stale and potentially buggy configuration of a previous driver. While attempting to root cause the issue with the PHY timestamp interrupt, we also discovered that the driver incorrectly assumes that it is operating on E822 hardware when reading the PHY timestamp memory status registers in a few places. This includes the check at the end of the interrupt handler, as well as the check done inside the PTP auxiliary function. This prevented the driver from detecting waiting timestamps on ports other than the first two. Finally, the ice_ptp_read_tx_hwstamp_status_eth56g() function was discovered to only read the timestamp interrupt status value from the first quad due to mistaking the port index for a PHY quad index. This resulted in reporting the timestamp status for the second quad as identical to the first quad instead of properly reporting its value. This is a minor fix since the function currently is only used for diagnostic purposes and does not impact driver decision logic. ==================== Link: https://patch.msgid.link/20260420-jk-iwl-net-2026-04-20-ptp-e825c-phy-interrupt-fixes-v1-0-bc2240f42251@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-22 21:10:12 -07:00
Jacob Keller	1f75dbc53f	ice: fix ice_ptp_read_tx_hwtstamp_status_eth56g The ice_ptp_read_tx_hwtstamp_status_eth56g function calls ice_read_phy_eth56g with a PHY index. However the function actually expects a port index. This causes the function to read the wrong PHY_PTP_INT_STATUS registers, and effectively makes the status wrong for the second set of ports from 4 to 7. The ice_read_phy_eth56g function uses the provided port index to determine which PHY device to read. We could refactor the entire chain to take a PHY index, but this would impact many code sites. Instead, multiply the PHY index by the number of ports, so that we read from the first port of each PHY. Fixes: `7cab44f1c3` ("ice: Introduce ETH56G PHY model for E825C products") Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Petr Oros <poros@redhat.com> Tested-by: Sunitha Mekala <sunithax.d.mekala@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260420-jk-iwl-net-2026-04-20-ptp-e825c-phy-interrupt-fixes-v1-4-bc2240f42251@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-22 21:10:10 -07:00
Jacob Keller	359dc1d413	ice: fix ready bitmap check for non-E822 devices The E800 hardware (apart from E810) has a ready bitmap for the PHY indicating which timestamp slots currently have an outstanding timestamp waiting to be read by software. This bitmap is checked in multiple places using the ice_get_phy_tx_tstamp_ready(): * ice_ptp_process_tx_tstamp() calls it to determine which timestamps to attempt reading from the PHY * ice_ptp_tx_tstamps_pending() calls it in a loop at the end of the miscellaneous IRQ to check if new timestamps came in while the interrupt handler was executing. * ice_ptp_maybe_trigger_tx_interrupt() calls it in the auxiliary work task to trigger a software interrupt in the event that the hardware logic gets stuck. For E82X devices, multiple PHYs share the same block, and the parameter passed to the ready bitmap is a block number associated with the given port. For E825-C devices, the PHYs have their own independent blocks and do not share, so the parameter passed needs to be the port number. For E810 devices, the ice_get_phy_tx_tstamp_ready() always returns all 1s regardless of what port, since this hardware does not have a ready bitmap. Finally, for E830 devices, each PF has its own ready bitmap accessible via register, and the block parameter is unused. The first call correctly uses the Tx timestamp tracker block parameter to check the appropriate timestamp block. This works because the tracker is setup correctly for each timestamp device type. The second two callers behave incorrectly for all device types other than the older E822 devices. They both iterate in a loop using ICE_GET_QUAD_NUM() which is a macro only used by E822 devices. This logic is incorrect for devices other than the E822 devices. For E810 the calls would always return true, causing E810 devices to always attempt to trigger a software interrupt even when they have no reason to. For E830, this results in duplicate work as the ready bitmap is checked once per number of quads. Finally, for E825-C, this results in the pending checks failing to detect timestamps on ports other than the first two. Fix this by introducing a new hardware API function to ice_ptp_hw.c, ice_check_phy_tx_tstamp_ready(). This function will check if any timestamps are available and returns a positive value if any timestamps are pending. For E810, the function always returns false, so that the re-trigger checks never happen. For E830, check the ready bitmap just once. For E82x hardware, check each quad. Finally, for E825-C, check every port. The interface function returns an integer to enable reporting of error code if the driver is unable read the ready bitmap. This enables callers to handle this case properly. The previous implementation assumed that timestamps are available if they failed to read the bitmap. This is problematic as it could lead to continuous software IRQ triggering if the PHY timestamp registers somehow become inaccessible. This change is especially important for E825-C devices, as the missing checks could leave a window open where a new timestamp could arrive while the existing timestamps aren't completed. As a result, the hardware threshold logic would not trigger a new interrupt. Without the check, the timestamp is left unhandled, and new timestamps will not cause an interrupt again until the timestamp is handled. Since both the interrupt check and the backup check in the auxiliary task do not function properly, the device may have Tx timestamps permanently stuck failing on a given port. The faulty checks originate from commit `d938a8cca8` ("ice: Auxbus devices & driver for E822 TS") and commit `712e876371` ("ice: periodically kick Tx timestamp interrupt"), however at the time of the original coding, both functions only operated on E822 hardware. This is no longer the case, and hasn't been since the introduction of the ETH56G PHY model in commit `7cab44f1c3` ("ice: Introduce ETH56G PHY model for E825C products") Fixes: `7cab44f1c3` ("ice: Introduce ETH56G PHY model for E825C products") Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Petr Oros <poros@redhat.com> Tested-by: Sunitha Mekala <sunithax.d.mekala@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260420-jk-iwl-net-2026-04-20-ptp-e825c-phy-interrupt-fixes-v1-3-bc2240f42251@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-22 21:10:10 -07:00
Grzegorz Nitka	3ec46e157c	ice: perform PHY soft reset for E825C ports at initialization In some cases the PHY timestamp block of the E825C can become stuck. This is known to occur if the software writes 0 to the Tx timestamp threshold, and with older versions of the ice driver the threshold configuration is buggy and can race in such that hardware briefly operates with a zero threshold enabled. There are no other known ways to trigger this behavior, but once it occurs, the hardware is not recovered by normal reset, a driver reload, or even a warm power cycle of the system. A cold power cycle is sufficient to recover hardware, but this is extremely invasive and can result in significant downtime on customer deployments. The PHY for each port has a timestamping block which has its own reset functionality accessible by programming the PHY_REG_GLOBAL register. Writing to the PHY_REG_GLOBAL_SOFT_RESET_BIT triggers the hardware to perform a complete reset of the timestamping block of the PHY. This includes clearing the timestamp status for the port, clearing all outstanding timestamps in the memory bank, and resetting the PHY timer. The new ice_ptp_phy_soft_reset_eth56g() function toggles the PHY_REG_GLOBAL soft reset bit with the required delays, ensuring the PHY is properly reinitialized without requiring a full device reset. The sequence clears the reset bit, asserts it, then clears it again, with short waits between transitions to allow hardware stabilization. Call this function in the new ice_ptp_init_phc_e825c(), implementing the E825C device specific variant of the ice_ptp_init_phc(). Note that if ice_ptp_init_phc() fails, PTP functionality may be disabled, but the driver will still load to allow basic functionality to continue. This causes the clock owning PF driver to perform a PHY soft reset for every port during initialization. This ensures the driver begins life in a known functional state regardless of how it was previously programmed. This ensures that we properly reconfigure the hardware after a device reset or when loading the driver, even if it was previously misconfigured with an out-of-date or modified driver. Fixes: `7cab44f1c3` ("ice: Introduce ETH56G PHY model for E825C products") Signed-off-by: Timothy Miskell <timothy.miskell@intel.com> Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Petr Oros <poros@redhat.com> Tested-by: Sunitha Mekala <sunithax.d.mekala@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260420-jk-iwl-net-2026-04-20-ptp-e825c-phy-interrupt-fixes-v1-2-bc2240f42251@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-22 21:10:10 -07:00
Grzegorz Nitka	c0a575a801	ice: fix timestamp interrupt configuration for E825C The E825C ice_phy_cfg_intr_eth56g() function is responsible for programming the PHY interrupt for a given port. This function writes to the PHY_REG_TS_INT_CONFIG register of the port. The register is responsible for configuring whether the port interrupt logic is enabled, as well as programming the threshold of waiting timestamps that will trigger an interrupt from this port. This threshold value must not be programmed to zero while the interrupt is enabled. Doing so puts the port in a misconfigured state where the PHY timestamp interrupt for the quad of connected ports will become stuck. This occurs, because a threshold of zero results in the timestamp interrupt status for the port becoming stuck high. The four ports in the connected quad have their timestamp status indicators muxed together. A new interrupt cannot be generated until the timestamp status indicators return low for all four ports. Normally, the timestamp status for a port will clear once there are fewer timestamps in that ports timestamp memory bank than the threshold. A threshold of zero makes this impossible, so the timestamp status for the port does not clear. The ice driver never intentionally programs the threshold to zero, indeed the driver always programs it to a value of 1, intending to get an interrupt immediately as soon as even a single packet is waiting for a timestamp. However, there is a subtle flaw in the programming logic in the ice_phy_cfg_intr_eth56g() function. Due to the way that the hardware handles enabling the PHY interrupt. If the threshold value is modified at the same time as the interrupt is enabled, the HW PHY state machine might enable the interrupt before the new threshold value is actually updated. This leaves a potential race condition caused by the hardware logic where a PHY timestamp interrupt might be triggered before the non-zero threshold is written, resulting in the PHY timestamp logic becoming stuck. Once the PHY timestamp status is stuck high, it will remain stuck even after attempting to reprogram the PHY block by changing its threshold or disabling the interrupt. Even a typical PF or CORE reset will not reset the particular block of the PHY that becomes stuck. Even a warm power cycle is not guaranteed to cause the PHY block to reset, and a cold power cycle is required. Prevent this by always writing the PHY_REG_TS_INT_CONFIG in two stages. First write the threshold value with the interrupt disabled, and only write the enable bit after the threshold has been programmed. When disabling the interrupt, leave the threshold unchanged. Additionally, re-read the register after writing it to guarantee that the write to the PHY has been flushed upon exit of the function. While we're modifying this function implementation, explicitly reject programming a threshold of 0 when enabling the interrupt. No caller does this today, but the consequences of doing so are significant. An explicit rejection in the code makes this clear. Fixes: `7cab44f1c3` ("ice: Introduce ETH56G PHY model for E825C products") Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Petr Oros <poros@redhat.com> Tested-by: Sunitha Mekala <sunithax.d.mekala@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260420-jk-iwl-net-2026-04-20-ptp-e825c-phy-interrupt-fixes-v1-1-bc2240f42251@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-22 21:10:10 -07:00
Michael Bommarito	c88eb7e8d8	net/rds: zero per-item info buffer before handing it to visitors rds_for_each_conn_info() and rds_walk_conn_path_info() both hand a caller-allocated on-stack u64 buffer to a per-connection visitor and then copy the full item_len bytes back to user space via rds_info_copy() regardless of how much of the buffer the visitor actually wrote. rds_ib_conn_info_visitor() and rds6_ib_conn_info_visitor() only write a subset of their output struct when the underlying rds_connection is not in state RDS_CONN_UP (src/dst addr, tos, sl and the two GIDs via explicit memsets). Several u32 fields (max_send_wr, max_recv_wr, max_send_sge, rdma_mr_max, rdma_mr_size, cache_allocs) and the 2-byte alignment hole between sl and cache_allocs remain as whatever stack contents preceded the visitor call and are then memcpy_to_user()'d out to user space. struct rds_info_rdma_connection and struct rds6_info_rdma_connection are the only rds_info_* structs in include/uapi/linux/rds.h that are not marked __attribute__((packed)), so they have a real alignment hole. The other info visitors (rds_conn_info_visitor, rds6_conn_info_visitor, rds_tcp_tc_info, ...) write all fields of their packed output struct today and are not known to be vulnerable, but a future visitor that adds a conditional write-path would have the same bug. Reproduction on a kernel built without CONFIG_INIT_STACK_ALL_ZERO=y: a local unprivileged user opens AF_RDS, sets SO_RDS_TRANSPORT=IB, binds to a local address on an RDMA-capable netdev (rxe soft-RoCE on any netdev is sufficient), sendto()'s any peer on the same subnet (fails cleanly but installs an rds_connection in the global hash in RDS_CONN_CONNECTING), then calls getsockopt(SOL_RDS, RDS_INFO_IB_CONNECTIONS). The returned 68-byte item contains 26 bytes of stack garbage including kernel text/data pointers: 0..7 0a 63 00 01 0a 63 00 02 src=10.99.0.1 dst=10.99.0.2 8..39 00 ... gids (memset-zeroed) 40..47 e0 92 a3 81 ff ff ff ff kernel pointer (max_send_wr) 48..55 7f 37 b5 81 ff ff ff ff kernel pointer (rdma_mr_max) 56..59 01 00 08 00 rdma_mr_size (garbage) 60..61 00 00 tos, sl 62..63 00 00 alignment padding 64..67 18 00 00 00 cache_allocs (garbage) Fix by zeroing the per-item buffer in both rds_for_each_conn_info() and rds_walk_conn_path_info() before invoking the visitor. This covers the IPv4/IPv6 IB visitors and hardens all current and future visitors against the same class of bug. No functional change for visitors that fully populate their output. Changes in v2: - retarget at the net tree (subject prefix "[PATCH net v2]", net/rds: prefix in the title) - pick up Reviewed-by tags from Sharath Srinivasan and Allison Henderson Fixes: `ec16227e14` ("RDS/IB: Infiniband transport") Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com> Reviewed-by: Sharath Srinivasan <sharath.srinivasan@oracle.com> Reviewed-by: Allison Henderson <achender@kernel.org> Assisted-by: Claude:claude-opus-4-7 Link: https://patch.msgid.link/20260418141047.3398203-1-michael.bommarito@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-22 21:05:16 -07:00
Andrea Mayer	ade67d5f58	seg6: fix seg6 lwtunnel output redirect for L2 reduced encap mode When SEG6_IPTUN_MODE_L2ENCAP_RED (L2ENCAP_RED) was introduced, the condition in seg6_build_state() that excludes L2 encap modes from setting LWTUNNEL_STATE_OUTPUT_REDIRECT was not updated to account for the new mode. As a consequence, L2ENCAP_RED routes incorrectly trigger seg6_output() on the output path, where the packet is silently dropped because skb_mac_header_was_set() fails on L3 packets. Extend the check to also exclude L2ENCAP_RED, consistent with L2ENCAP. Fixes: `13f0296be8` ("seg6: add support for SRv6 H.L2Encaps.Red behavior") Cc: stable@vger.kernel.org Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it> Reviewed-by: Justin Iurman <justin.iurman@gmail.com> Link: https://patch.msgid.link/20260418162838.31979-1-andrea.mayer@uniroma2.it Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-22 20:32:38 -07:00
Xin Long	7c9b012d63	sctp: fix sockets_allocated imbalance after sk_clone() sk_clone() increments sockets_allocated and sets the socket refcount to 2. SCTP performs additional accounting in sctp_clone_sock(), so the clone-time increment must be undone to avoid double counting. Note we cannot simply remove the SCTP-side increment, because the SCTP destroy path in sctp_destroy_sock() only decrements sockets_allocated when sp->ep is set, which may not be true for all failure paths in sctp_clone_sock(). Fixes: `16942cf4d3` ("sctp: Use sk_clone() in sctp_accept().") Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/af8d66f928dec3e9fcbee8d4a85b7d5a6b86f515.1776460180.git.lucien.xin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-22 20:31:24 -07:00
Jakub Kicinski	0db1688072	Merge branch 'bnge-fixes' Vikas Gupta says: ==================== bnge fixes Patch-1: Due to wrong HWRM sequence, driver do not get the correct information regarding resources and capabilities. The patch fixes the initial HWRM sequence. Patch-2: Remove the unsupported backing store type initialization, which is not supported in Thor Ultra devices. ==================== Link: https://patch.msgid.link/20260418023438.1597876-1-vikas.gupta@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-22 20:30:50 -07:00
Vikas Gupta	c6b34add67	bnge: remove unsupported backing store type The backing store type, BNGE_CTX_MRAV, is not applicable in Thor Ultra devices. Remove it from the backing store configuration, as the firmware will not populate entities in this backing store type, due to which the driver load fails. Fixes: `29c5b358f3` ("bng_en: Add backing store support") Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com> Reviewed-by: Dharmender Garg <dharmender.garg@broadcom.com> Link: https://patch.msgid.link/20260418023438.1597876-3-vikas.gupta@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-22 20:30:46 -07:00
Vikas Gupta	70d7c905a0	bnge: fix initial HWRM sequence Firmware may not advertize correct resources if backing store is not enabled before resource information is queried. Fix the initial sequence of HWRMs so that driver gets capabilities and resource information correctly. Fixes: `3fa9e977a0` ("bng_en: Initialize default configuration") Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com> Reviewed-by: Rahul Gupta <rahul-rg.gupta@broadcom.com> Link: https://patch.msgid.link/20260418023438.1597876-2-vikas.gupta@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-22 20:30:46 -07:00
Ariful Islam Shoikot	645d044d7e	docs: maintainer-netdev: fix typo in "targeting" Fix spelling mistake "targgeting" -> "targeting" in maintainer-netdev.rst No functional change. Signed-off-by: Ariful Islam Shoikot <islamarifulshoikat@gmail.com> Reviewed-by: Breno Leitao <leitao@debian.org> Link: https://patch.msgid.link/20260420114554.1026-1-islamarifulshoikat@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-22 20:16:58 -07:00
Bingquan Chen	2c054e17d9	net/packet: fix TOCTOU race on mmap'd vnet_hdr in tpacket_snd() In tpacket_snd(), when PACKET_VNET_HDR is enabled, vnet_hdr points directly into the mmap'd TX ring buffer shared with userspace. The kernel validates the header via __packet_snd_vnet_parse() but then re-reads all fields later in virtio_net_hdr_to_skb(). A concurrent userspace thread can modify the vnet_hdr fields between validation and use, bypassing all safety checks. The non-TPACKET path (packet_snd()) already correctly copies vnet_hdr to a stack-local variable. All other vnet_hdr consumers in the kernel (tun.c, tap.c, virtio_net.c) also use stack copies. The TPACKET TX path is the only caller of virtio_net_hdr_to_skb() that reads directly from user-controlled shared memory. Fix this by copying vnet_hdr from the mmap'd ring buffer to a stack-local variable before validation and use, consistent with the approach used in packet_snd() and all other callers. Fixes: `1d036d25e5` ("packet: tpacket_snd gso and checksum offload") Signed-off-by: Bingquan Chen <patzilla007@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260418112006.78823-1-patzilla007@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-22 20:16:34 -07:00
Kohei Enju	3bfcf39608	net: validate skb->napi_id in RX tracepoints Since commit `2bd82484bb` ("xps: fix xps for stacked devices"), skb->napi_id shares storage with sender_cpu. RX tracepoints using net_dev_rx_verbose_template read skb->napi_id directly and can therefore report sender_cpu values as if they were NAPI IDs. For example, on the loopback path this can report 1 as napi_id, where 1 comes from raw_smp_processor_id() + 1 in the XPS path: # bpftrace -e 'tracepoint:net:netif_rx_entry{ print(args->napi_id); }' # taskset -c 0 ping -c 1 ::1 Report only valid NAPI IDs in these tracepoints and use 0 otherwise. Fixes: `2bd82484bb` ("xps: fix xps for stacked devices") Signed-off-by: Kohei Enju <kohei@enjuk.jp> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev> Link: https://patch.msgid.link/20260420105427.162816-1-kohei@enjuk.jp Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-04-22 20:15:50 -07:00
Chia-Yu Chang	478ed6b7d2	net/sched: sch_dualpi2: drain both C-queue and L-queue in dualpi2_change() Fix dualpi2_change() to correctly enforce updated limit and memlimit values after a configuration change of the dualpi2 qdisc. Before this patch, dualpi2_change() always attempted to dequeue packets via the root qdisc (C-queue) when reducing backlog or memory usage, and unconditionally assumed that a valid skb will be returned. When traffic classification results in packets being queued in the L-queue while the C-queue is empty, this leads to a NULL skb dereference during limit or memlimit enforcement. This is fixed by first dequeuing from the C-queue path if it is non-empty. Once the C-queue is empty, packets are dequeued directly from the L-queue. Return values from qdisc_dequeue_internal() are checked for both queues. When dequeuing from the L-queue, the parent qdisc qlen and backlog counters are updated explicitly to keep overall qdisc statistics consistent. Fixes: `320d031ad6` ("sched: Struct definition and parsing of dualpi2 qdisc") Reported-by: "Kito Xu (veritas501)" <hxzene@gmail.com> Closes: https://lore.kernel.org/netdev/20260413075740.2234828-1-hxzene@gmail.com/ Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Link: https://patch.msgid.link/20260417152551.71648-1-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-04-21 15:00:39 +02:00

1 2 3 4 5 ...

1434441 Commits (864ba40c80edae2b98f47d46f2c39399126aa3d6) All Branches Search

1434441 Commits (864ba40c80edae2b98f47d46f2c39399126aa3d6)

All Branches