When a pkt pointer acquires AT_PKT_END or BEYOND_PKT_END range from
a comparison, and then, known-constant arithmetic is performed,
adjust_ptr_min_max_vals() copies the stale range via dst_reg->raw =
ptr_reg->raw without clearing the negative reg->range sentinel values.
This lets is_pkt_ptr_branch_taken() choose one branch direction and
skip going through the other. Fix this by clearing negative pkt range
values (that is, AT_PKT_END and BEYOND_PKT_END) after arithmetic on
pkt pointers. This ensures is_pkt_ptr_branch_taken() returns unknown
and both branches are properly verified.
Fixes: 6d94e741a8 ("bpf: Support for pointers beyond pkt_end.")
Reported-by: STAR Labs SG <info@starlabs.sg>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20260409155016.536608-1-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The local subprog pointer in create_jt() and visit_abnormal_return_insn()
was declared static.
It is unconditionally assigned via bpf_find_containing_subprog() before
every use. Thus, the static qualifier serves no purpose and rather creates
confusion. Just remove it.
Fixes: e40f5a6bf8 ("bpf: correct stack liveness for tail calls")
Fixes: 493d9e0d60 ("bpf, x86: add support for indirect jumps")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Anton Protopopov <a.s.protopopov@gmail.com>
Link: https://lore.kernel.org/r/20260408191242.526279-3-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Usage of ld_{abs,ind} instructions got extended into subprogs some time
ago via commit 09b28d76ea ("bpf: Add abnormal return checks."). These
are only allowed in subprograms when the latter are BTF annotated and
have scalar return types.
The code generator in bpf_gen_ld_abs() has an abnormal exit path (r0=0 +
exit) from legacy cBPF times. While the enforcement is on scalar return
types, the verifier must also simulate the path of abnormal exit if the
packet data load via ld_{abs,ind} failed.
This is currently not the case. Fix it by having the verifier simulate
both success and failure paths, and extend it in similar ways as we do
for tail calls. The success path (r0=unknown, continue to next insn) is
pushed onto stack for later validation and the r0=0 and return to the
caller is done on the fall-through side.
Fixes: 09b28d76ea ("bpf: Add abnormal return checks.")
Reported-by: STAR Labs SG <info@starlabs.sg>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20260408191242.526279-2-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Move find_linfo() as bpf_find_linfo() into core.c to allow for its use
in the verifier in subsequent patches.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Mykyta Yatsenko <yatsenko@meta.com>
Link: https://lore.kernel.org/r/20260408021359.3786905-4-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Extract bpf_get_linfo_file_line as its own function so that the logic to
obtain the file, line, and line number for a given program can be shared
in subsequent patches.
Reviewed-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260408021359.3786905-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The verifier currently does not allow overwriting a referenced dynptr's
stack slot to prevent resource leak. This is because referenced dynptr
holds additional resources that requires calling specific helpers to
release. This limitation can be relaxed when there are multiple copies
of the same dynptr. Whether it is the orignial dynptr or one of its
clones, as long as there exists at least one other dynptr with the same
ref_obj_id (to be used to release the reference), its stack slot should
be allowed to be overwritten.
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260406150548.1354271-2-ameryhung@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
When a non-{add,sub} alu op such as xor is performed on a scalar
register that previously had a BPF_ADD_CONST delta, the else path
in adjust_reg_min_max_vals() only clears dst_reg->id but leaves
dst_reg->delta unchanged.
This stale delta can propagate via assign_scalar_id_before_mov()
when the register is later used in a mov. It gets a fresh id but
keeps the stale delta from the old (now-cleared) BPF_ADD_CONST.
This stale delta can later propagate leading to a verifier-vs-
runtime value mismatch.
The clear_id label already correctly clears both delta and id.
Make the else path consistent by also zeroing the delta when id
is cleared. More generally, this introduces a helper clear_scalar_id()
which internally takes care of zeroing. There are various other
locations in the verifier where only the id is cleared. By using
the helper we catch all current and future locations.
Fixes: 98d7ca374b ("bpf: Track delta between "linked" registers.")
Reported-by: STAR Labs SG <info@starlabs.sg>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20260407192421.508817-2-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Consider the case of rX += rX where src_reg and dst_reg are pointers to
the same bpf_reg_state in adjust_reg_min_max_vals(). The latter first
modifies the dst_reg in-place, and later in the delta tracking, the
subsequent is_reg_const(src_reg)/reg_const_value(src_reg) reads the
post-{add,sub} value instead of the original source.
This is problematic since it sets an incorrect delta, which sync_linked_regs()
then propagates to linked registers, thus creating a verifier-vs-runtime
mismatch. Fix it by just skipping this corner case.
Fixes: 98d7ca374b ("bpf: Track delta between "linked" registers.")
Reported-by: STAR Labs SG <info@starlabs.sg>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20260407192421.508817-1-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
When an unqualified kprobe target exists in both vmlinux and a loaded
module, number_of_same_symbols() returns a count greater than 1,
causing kprobe attachment to fail with -EADDRNOTAVAIL even though the
vmlinux symbol is unambiguous.
When no module qualifier is given and the symbol is found in vmlinux,
return the vmlinux-only count without scanning loaded modules. This
preserves the existing behavior for all other cases:
- Symbol only in a module: vmlinux count is 0, falls through to module
scan as before.
- Symbol qualified with MOD:SYM: mod != NULL, unchanged path.
- Symbol ambiguous within vmlinux itself: count > 1 is returned as-is.
Fixes: 926fe783c8 ("tracing/kprobes: Fix symbol counting logic by looking at modules as well")
Fixes: 9d8616034f ("tracing/kprobes: Add symbol counting check when module loads")
Suggested-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@crowdstrike.com>
Link: https://lore.kernel.org/r/20260407203912.1787502-2-andrey.grodzovsky@crowdstrike.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
RCU Tasks Trace grace period implies RCU grace period, and this
guarantee is expected to remain in the future. Only BPF is the user of
this predicate, hence retire the API and clean up all in-tree users.
RCU Tasks Trace is now implemented on SRCU-fast and its grace period
mechanism always has at least one call to synchronize_rcu() as it is
required for SRCU-fast's correctness (it replaces the smp_mb() that
SRCU-fast readers skip). So, RCU-tt GP will always imply RCU GP.
Reviewed-by: Puranjay Mohan <puranjay@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260407162234.785270-1-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
bpf_lsm_task_to_inode() is called under rcu_read_lock() and
bpf_lsm_inet_conn_established() is called from softirq context, so
neither hook can be used by sleepable LSM programs.
Fixes: 423f16108c ("bpf: Augment the set of sleepable LSM hooks")
Reported-by: Quan Sun <2022090917019@std.uestc.edu.cn>
Reported-by: Yinhao Hu <dddddd@hust.edu.cn>
Reported-by: Kaiyan Mei <M202472210@hust.edu.cn>
Reported-by: Dongliang Mu <dzm91@hust.edu.cn>
Closes: https://lore.kernel.org/bpf/3ab69731-24d1-431a-a351-452aafaaf2a5@std.uestc.edu.cn/T/#u
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Link: https://lore.kernel.org/r/20260407122334.344072-1-jiayuan.chen@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
When a pointer to PTR_TO_INSN is dereferenced, the offset field
of the BPF_LDX_MEM instruction can be nonzero. Patch the verifier
to not ignore this field.
Reported-by: Jiyong Yang <ksur673@gmail.com>
Fixes: 493d9e0d60 ("bpf, x86: add support for indirect jumps")
Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Link: https://lore.kernel.org/r/20260406160141.36943-2-a.s.protopopov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Apparently, struct bpf_empty_prog_array exists entirely to populate a
single element of "items" in a global variable. "null_prog" is only
used during the initializer.
None of this is needed; globals will be correctly sized with an array
initializer of a flexible-array member.
So, remove struct bpf_empty_prog_array and adjust the rest of the code,
accordingly.
With these changes, fix the following warnings:
./include/linux/bpf.h:2369:31: warning: structure containing a flexible
array member is not at the end of another structure [-Wflex-array-member-not-at-end]
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Acked-by: Mykyta Yatsenko <yatsenko@meta.com>
Link: https://lore.kernel.org/r/acr7Whmn0br3xeBP@kspp
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Don't reject usage of fixed unaligned offsets for syscall ctx. Tests
will be added in later commits. Unaligned offsets already work for
variable offsets.
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260406194403.1649608-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Allow accessing PTR_TO_CTX with variable offsets in syscall programs.
Fixed offsets are already enabled for all program types that do not
convert their ctx accesses, since the changes we made in the commit
de6c7d99f8 ("bpf: Relax fixed offset check for PTR_TO_CTX"). Note
that we also lift the restriction on passing syscall context into
helpers, which was not permitted before, and passing modified syscall
context into kfuncs.
The structure of check_mem_access can be mostly shared and preserved,
but we must use check_mem_region_access to correctly verify access with
variable offsets.
The check made in check_helper_mem_access is hardened to only allow
PTR_TO_CTX for syscall programs to be passed in as helper memory. This
was the original intention of the existing code anyway, and it makes
little sense for other program types' context to be utilized as a memory
buffer. In case a convincing example presents itself in the future, this
check can be relaxed further.
We also no longer use the last-byte access to simulate helper memory
access, but instead go through check_mem_region_access. Since this no
longer updates our max_ctx_offset, we must do so manually, to keep track
of the maximum offset at which the program ctx may be accessed.
Take care to ensure that when arg_type is ARG_PTR_TO_CTX, we do not
relax any fixed or variable offset constraints around PTR_TO_CTX even in
syscall programs, and require them to be passed unmodified. There are
several reasons why this is necessary. First, if we pass a modified ctx,
then the global subprog's accesses will not update the max_ctx_offset to
its true maximum offset, and can lead to out of bounds accesses. Second,
tail called program (or extension program replacing global subprog) where
their max_ctx_offset exceeds the program they are being called from can
also cause issues. For the latter, unmodified PTR_TO_CTX is the first
requirement for the fix, the second is ensuring max_ctx_offset >= the
program they are being called from, which has to be a separate change
not made in this commit.
All in all, we can hint using arg_type when we expect ARG_PTR_TO_CTX and
make our relaxation around offsets conditional on it.
Drop coverage of syscall tests from verifier_ctx.c temporarily for
negative cases until they are updated in subsequent commits.
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Acked-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260406194403.1649608-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
When a dev-bound-only BPF program (BPF_F_XDP_DEV_BOUND_ONLY) undergoes
JIT compilation with constant blinding enabled (bpf_jit_harden >= 2),
bpf_jit_blind_constants() clones the program. The original prog is then
freed in bpf_jit_prog_release_other(), which updates aux->prog to point
to the surviving clone, but fails to update offload->prog.
This leaves offload->prog pointing to the freed original program. When
the network namespace is subsequently destroyed, cleanup_net() triggers
bpf_dev_bound_netdev_unregister(), which iterates ondev->progs and calls
__bpf_prog_offload_destroy(offload->prog). Accessing the freed prog
causes a page fault:
BUG: unable to handle page fault for address: ffffc900085f1038
Workqueue: netns cleanup_net
RIP: 0010:__bpf_prog_offload_destroy+0xc/0x80
Call Trace:
__bpf_offload_dev_netdev_unregister+0x257/0x350
bpf_dev_bound_netdev_unregister+0x4a/0x90
unregister_netdevice_many_notify+0x2a2/0x660
...
cleanup_net+0x21a/0x320
The test sequence that triggers this reliably is:
1. Set net.core.bpf_jit_harden=2 (echo 2 > /proc/sys/net/core/bpf_jit_harden)
2. Run xdp_metadata selftest, which creates a dev-bound-only XDP
program on a veth inside a netns (./test_progs -t xdp_metadata)
3. cleanup_net -> page fault in __bpf_prog_offload_destroy
Dev-bound-only programs are unique in that they have an offload structure
but go through the normal JIT path instead of bpf_prog_offload_compile().
This means they are subject to constant blinding's prog clone-and-replace,
while also having offload->prog that must stay in sync.
Fix this by updating offload->prog in bpf_jit_prog_release_other(),
alongside the existing aux->prog update. Both are back-pointers to
the prog that must be kept in sync when the prog is replaced.
Fixes: 2b3486bc2d ("bpf: Introduce device-bound XDP programs")
Signed-off-by: MingTao Huang <mintaohuang@tencent.com>
Link: https://lore.kernel.org/r/tencent_BCF692F45859CCE6C22B7B0B64827947D406@qq.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
list_next_entry() never returns NULL -- when the current element is the
last entry it wraps to the list head via container_of(). The subsequent
NULL check is therefore dead code and get_next_key() never returns
-ENOENT for the last element, instead reading storage->key from a bogus
pointer that aliases internal map fields and copying the result to
userspace.
Replace it with list_entry_is_head() so the function correctly returns
-ENOENT when there are no more entries.
Fixes: de9cbbaadb ("bpf: introduce cgroup storage maps")
Reported-by: Xiang Mei <xmei5@asu.edu>
Signed-off-by: Weiming Shi <bestswngs@gmail.com>
Reviewed-by: Sun Jian <sun.jian.kdev@gmail.com>
Acked-by: Paul Chaignon <paul.chaignon@gmail.com>
Link: https://lore.kernel.org/r/20260403132951.43533-2-bestswngs@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
When a BPF_F_LOCK update races with a concurrent delete, the freed
element can be immediately recycled by alloc_htab_elem(). The fast path
in htab_map_update_elem() performs a lockless lookup and then calls
copy_map_value_locked() under the element's spin_lock. If
alloc_htab_elem() recycles the same memory, it overwrites the value
with plain copy_map_value(), without taking the spin_lock, causing
torn writes.
Use copy_map_value_locked() when BPF_F_LOCK is set so the new element's
value is written under the embedded spin_lock, serializing against any
stale lock holders.
Fixes: 96049f3afd ("bpf: introduce BPF_F_LOCK flag")
Reported-by: Aaron Esau <aaron1esau@gmail.com>
Closes: https://lore.kernel.org/all/CADucPGRvSRpkneb94dPP08YkOHgNgBnskTK6myUag_Mkjimihg@mail.gmail.com/
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Link: https://lore.kernel.org/r/20260401-bpf_map_torn_writes-v1-1-782d071c55e7@meta.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The static stack liveness analysis needs to know how many bytes a
helper or kfunc accesses through a stack pointer argument, so it can
precisely mark the affected stack slots as stack 'def' or 'use'.
Add bpf_helper_stack_access_bytes() and bpf_kfunc_stack_access_bytes()
which resolve the access size for a given call argument.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260403024422.87231-7-alexei.starovoitov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add two passes before the main verifier pass:
bpf_compute_const_regs() is a forward dataflow analysis that tracks
register values in R0-R9 across the program using fixed-point
iteration in reverse postorder. Each register is tracked with
a six-state lattice:
UNVISITED -> CONST(val) / MAP_PTR(map_index) /
MAP_VALUE(map_index, offset) / SUBPROG(num) -> UNKNOWN
At merge points, if two paths produce the same state and value for
a register, it stays; otherwise it becomes UNKNOWN.
The analysis handles:
- MOV, ADD, SUB, AND with immediate or register operands
- LD_IMM64 for plain constants, map FDs, map values, and subprogs
- LDX from read-only maps: constant-folds the load by reading the
map value directly via bpf_map_direct_read()
Results that fit in 32 bits are stored per-instruction in
insn_aux_data and bitmasks.
bpf_prune_dead_branches() uses the computed constants to evaluate
conditional branches. When both operands of a conditional jump are
known constants, the branch outcome is determined statically and the
instruction is rewritten to an unconditional jump.
The CFG postorder is then recomputed to reflect new control flow.
This eliminates dead edges so that subsequent liveness analysis
doesn't propagate through dead code.
Also add runtime sanity check to validate that precomputed
constants match the verifier's tracked state.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260403024422.87231-5-alexei.starovoitov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add a pass that sorts subprogs in topological order so that iterating
subprog_topo_order[] walks leaf subprogs first, then their callers.
This is computed as a DFS post-order traversal of the CFG.
The pass runs after check_cfg() to ensure the CFG has been validated
before traversing and after postorder has been computed to avoid
walking dead code.
Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260403024422.87231-3-alexei.starovoitov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmnPGdMACgkQ6rmadz2v
bTrNxw/9Hcn2V/Jqp/cEagmKIKqSAUFgEE+AwRbQU5YL2Yem/6Q15rnOk8pOSDT5
jqk7VbuchVmWa+a9DVy7d3XVWohk332QbvQRHfqV8P0ZpnfJa0YqdZlKg2/4/8P/
yVhLzVrGIGcvvz9CfhIynRhq/fvr7iYbSSv9JT3nig4qCYpUf7kPbXSLtxyElNWN
xX36KfTxQO4xI2+iezsNwklXF25Tv59V1fNuKF2lshxS+DwaroAzAJLd3MGvTHRj
8y5kU1UDb+HeJh9DpEFjppQp4qUQjIKAiNVvXGUOe7TI/i9VTIiMfesniWKNwzYv
Alo2G8fLb4nJhzNL2ol4R0I5BCYmMT55tBFvSNJQ+9Esy6azkbExmKuE1hXsUXo1
jY0TbNt58zSZEmyz9SYoFKlg4lOW4ZIMl0RtnSBRoDwtK3ThGV7QFlnKq3uPZ6ce
RcpMk7cOnERLzwPnpSiACrQmzhMk+j5HG1u+Eb3rXKxYCQO6bAhpQyPDKsiXNgkL
uezq2zqAnNho0/CInHGlRj7E1JnvRoHCcLBT4zzyIY/jruI8fzK0aMqGMvk/qOby
BWDnJ9GG3VmGSUc/FOp3IchKCnxXhkYqsjBCP03cbIZgr1MuixZeom81OsPNmSX8
Ke+FeGNsU5zOUJ1iG2BZjdya/DAgP8hd85WVtaXyX60KKhuu45c=
=w0RY
-----END PGP SIGNATURE-----
Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Pull bpf fixes from Alexei Starovoitov:
- Fix register equivalence for pointers to packet (Alexei Starovoitov)
- Fix incorrect pruning due to atomic fetch precision tracking (Daniel
Borkmann)
- Fix grace period wait for bpf_link-ed tracepoints (Kumar Kartikeya
Dwivedi)
- Fix use-after-free of sockmap's sk->sk_socket (Kuniyuki Iwashima)
- Reject direct access to nullable PTR_TO_BUF pointers (Qi Tang)
- Reject sleepable kprobe_multi programs at attach time (Varun R
Mallya)
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
selftests/bpf: Add more precision tracking tests for atomics
bpf: Fix incorrect pruning due to atomic fetch precision tracking
bpf: Reject sleepable kprobe_multi programs at attach time
bpf: reject direct access to nullable PTR_TO_BUF pointers
bpf: sockmap: Fix use-after-free of sk->sk_socket in sk_psock_verdict_data_ready().
bpf: Fix grace period wait for tracepoint bpf_link
bpf: Fix regsafe() for pointers to packet
This patch fixes the invariant violations that can happen after we
refine ranges & tnum based on an incorrectly-detected branch condition.
For example, the branch is always true, but we miss it in
is_branch_taken; we then refine based on the branch being false and end
up with incoherent ranges (e.g. umax < umin).
To avoid this, we can simulate the refinement on both branches. More
specifically, this patch simulates both branches taken using
regs_refine_cond_op and reg_bounds_sync. If the resulting register
states are ill-formed on one of the branches, is_branch_taken can mark
that branch as "never taken".
On a more formal note, we can deduce a branch is not taken when
regs_refine_cond_op or reg_bounds_sync returns an ill-formed state
because the branch operators are sound (verified with Agni [1]).
Soundness means that the verifier is guaranteed to produce sound
outputs on the taken branches. On the non-taken branch (explored
because of imprecision in the bounds), the verifier is free to produce
any output. We use ill-formedness as a signal that the branch is dead
and prune that branch.
This patch moves the refinement logic for both branches from
reg_set_min_max to their own function, simulate_both_branches_taken,
which is called from is_scalar_branch_taken. As a result,
reg_set_min_max now only runs sanity checks and has been renamed to
reg_bounds_sanity_check_branches to reflect that.
We have had five patches fixing specific cases of invariant violations
in the past, all added with selftests:
- commit fbc7aef517 ("bpf: Fix u32/s32 bounds when ranges cross
min/max boundary")
- commit efc11a6678 ("bpf: Improve bounds when tnum has a single
possible value")
- commit f41345f47f ("bpf: Use tnums for JEQ/JNE is_branch_taken
logic")
- commit 00bf8d0c6c ("bpf: Improve bounds when s64 crosses sign
boundary")
- commit 6279846b9b ("bpf: Forget ranges when refining tnum after
JSET")
To confirm that this patch addresses all invariant violations, we have
also reverted those five commits and verified that their related
selftests don't cause any invariant violation warnings anymore. Those
selftests still fail but only because of misdetected branches or
less-precise bounds than expected. This demonstrates that the current
patch is enough to avoid the invariant violation warning AND that the
previous five patches are still useful to improve branch detection.
In addition to the selftests, this change was also tested with the
Cilium complexity test suite: all programs were successfully loaded and
it didn't change the number of processed instructions.
Link: https://github.com/bpfverif/agni [1]
Reported-by: syzbot+c950cc277150935cc0b5@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=c950cc277150935cc0b5
Co-developed-by: Paul Chaignon <paul.chaignon@gmail.com>
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Co-developed-by: Srinivas Narayana <srinivas.narayana@rutgers.edu>
Signed-off-by: Srinivas Narayana <srinivas.narayana@rutgers.edu>
Co-developed-by: Santosh Nagarakatte <santosh.nagarakatte@rutgers.edu>
Signed-off-by: Santosh Nagarakatte <santosh.nagarakatte@rutgers.edu>
Signed-off-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/a166b54a3cbbbdbcdf8a87f53045f1097176218b.1775142354.git.paul.chaignon@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
In the subsequent commit, to prune dead branches we will rely on
detecting ill-formed ranges using range_bounds_violations()
(e.g., umin > umax) after refining register bounds using
regs_refine_cond_op().
However, reg_bounds_sync() can sometimes "repair" ill-formed bounds,
potentially masking a violation that was produced by
regs_refine_cond_op().
This commit modifies reg_bounds_sync() to exit early if an invariant
violation is already present in the input.
This ensures ill-formed reg_states remain ill-formed after
reg_bounds_sync(), allowing simulate_both_branches_taken() to correctly
identify dead branches with a single check to range_bounds_violation().
Suggested-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com>
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/73127d628841c59cb7423d6bdcd204bf90bcdc80.1775142354.git.paul.chaignon@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
In a subsequent patch, the regs_refine_cond_op and reg_bounds_sync
functions will be called in is_branch_taken instead of reg_set_min_max,
to simulate each branch's outcome. Since they will run before we branch
out, these two functions will need to work on temporary registers for
the two branches.
This refactoring patch prepares for that change, by introducing the
temporary registers on bpf_verifier_env and using them in
reg_set_min_max.
This change also allows us to save one fake_reg slot as we don't need to
allocate an additional temporary buffer in case of a BPF_K condition.
Finally, you may notice that this patch removes the check for
"false_reg1 == false_reg2" in reg_set_min_max. That check was introduced
in commit d43ad9da80 ("bpf: Skip bounds adjustment for conditional
jumps on same scalar register") to avoid an invariant violation. Given
that "env->false_reg1 == env->false_reg2" doesn't make sense and
invariant violations are addressed in a subsequent commit, this patch
just removes the check.
Suggested-by: Eduard Zingerman <eddyz87@gmail.com>
Co-developed-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com>
Signed-off-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com>
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Acked-by: Mykyta Yatsenko <yatsenko@meta.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/260b0270052944a420e1c56e6a92df4d43cadf03.1775142354.git.paul.chaignon@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This commit refactors reg_bounds_sanity_check to factor out the logic
that performs the sanity check from the logic that does the reporting.
Signed-off-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com>
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/198ec3e69343e2c46dc9cbe2b1bc9be9ae2df5bd.1775142354.git.paul.chaignon@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
When backtrack_insn encounters a BPF_STX instruction with BPF_ATOMIC
and BPF_FETCH, the src register (or r0 for BPF_CMPXCHG) also acts as
a destination, thus receiving the old value from the memory location.
The current backtracking logic does not account for this. It treats
atomic fetch operations the same as regular stores where the src
register is only an input. This leads the backtrack_insn to fail to
propagate precision to the stack location, which is then not marked
as precise!
Later, the verifier's path pruning can incorrectly consider two states
equivalent when they differ in terms of stack state. Meaning, two
branches can be treated as equivalent and thus get pruned when they
should not be seen as such.
Fix it as follows: Extend the BPF_LDX handling in backtrack_insn to
also cover atomic fetch operations via is_atomic_fetch_insn() helper.
When the fetch dst register is being tracked for precision, clear it,
and propagate precision over to the stack slot. For non-stack memory,
the precision walk stops at the atomic instruction, same as regular
BPF_LDX. This covers all fetch variants.
Before:
0: (b7) r1 = 8 ; R1=8
1: (7b) *(u64 *)(r10 -8) = r1 ; R1=8 R10=fp0 fp-8=8
2: (b7) r2 = 0 ; R2=0
3: (db) r2 = atomic64_fetch_add((u64 *)(r10 -8), r2) ; R2=8 R10=fp0 fp-8=mmmmmmmm
4: (bf) r3 = r10 ; R3=fp0 R10=fp0
5: (0f) r3 += r2
mark_precise: frame0: last_idx 5 first_idx 0 subseq_idx -1
mark_precise: frame0: regs=r2 stack= before 4: (bf) r3 = r10
mark_precise: frame0: regs=r2 stack= before 3: (db) r2 = atomic64_fetch_add((u64 *)(r10 -8), r2)
mark_precise: frame0: regs=r2 stack= before 2: (b7) r2 = 0
6: R2=8 R3=fp8
6: (b7) r0 = 0 ; R0=0
7: (95) exit
After:
0: (b7) r1 = 8 ; R1=8
1: (7b) *(u64 *)(r10 -8) = r1 ; R1=8 R10=fp0 fp-8=8
2: (b7) r2 = 0 ; R2=0
3: (db) r2 = atomic64_fetch_add((u64 *)(r10 -8), r2) ; R2=8 R10=fp0 fp-8=mmmmmmmm
4: (bf) r3 = r10 ; R3=fp0 R10=fp0
5: (0f) r3 += r2
mark_precise: frame0: last_idx 5 first_idx 0 subseq_idx -1
mark_precise: frame0: regs=r2 stack= before 4: (bf) r3 = r10
mark_precise: frame0: regs=r2 stack= before 3: (db) r2 = atomic64_fetch_add((u64 *)(r10 -8), r2)
mark_precise: frame0: regs= stack=-8 before 2: (b7) r2 = 0
mark_precise: frame0: regs= stack=-8 before 1: (7b) *(u64 *)(r10 -8) = r1
mark_precise: frame0: regs=r1 stack= before 0: (b7) r1 = 8
6: R2=8 R3=fp8
6: (b7) r0 = 0 ; R0=0
7: (95) exit
Fixes: 5ffa25502b ("bpf: Add instructions for atomic_[cmp]xchg")
Fixes: 5ca419f286 ("bpf: Add BPF_FETCH field / create atomic_fetch_add instruction")
Reported-by: STAR Labs SG <info@starlabs.sg>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20260331222020.401848-1-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
kprobe.multi programs run in atomic/RCU context and cannot sleep.
However, bpf_kprobe_multi_link_attach() did not validate whether the
program being attached had the sleepable flag set, allowing sleepable
helpers such as bpf_copy_from_user() to be invoked from a non-sleepable
context.
This causes a "sleeping function called from invalid context" splat:
BUG: sleeping function called from invalid context at ./include/linux/uaccess.h:169
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1787, name: sudo
preempt_count: 1, expected: 0
RCU nest depth: 2, expected: 0
Fix this by rejecting sleepable programs early in
bpf_kprobe_multi_link_attach(), before any further processing.
Fixes: 0dcac27254 ("bpf: Add multi kprobe link")
Signed-off-by: Varun R Mallya <varunrmallya@gmail.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Leon Hwang <leon.hwang@linux.dev>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20260401191126.440683-1-varunrmallya@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
check_mem_access() matches PTR_TO_BUF via base_type() which strips
PTR_MAYBE_NULL, allowing direct dereference without a null check.
Map iterator ctx->key and ctx->value are PTR_TO_BUF | PTR_MAYBE_NULL.
On stop callbacks these are NULL, causing a kernel NULL dereference.
Add a type_may_be_null() guard to the PTR_TO_BUF branch, matching the
existing PTR_TO_BTF_ID pattern.
Fixes: 20b2aff4bc ("bpf: Introduce MEM_RDONLY flag")
Signed-off-by: Qi Tang <tpluszz77@gmail.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260402092923.38357-2-tpluszz77@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Replace bpf_mem_alloc/bpf_mem_free with kmalloc_nolock/kfree_nolock for
bpf_dynptr_file_impl, continuing the migration away from bpf_mem_alloc
now that kmalloc can be used from NMI context.
freader_cleanup() runs before kfree_nolock() while the dynptr still
holds exclusive access, so plain kfree_nolock() is safe — no concurrent
readers can access the object.
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260330-kmalloc_special-v2-2-c90403f92ff0@meta.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Replace bpf_mem_alloc/bpf_mem_free with
kmalloc_nolock/kfree_rcu for bpf_task_work_ctx.
Replace guard(rcu_tasks_trace)() with guard(rcu)() in
bpf_task_work_irq(). The function only accesses ctx struct members
(not map values), so tasks trace protection is not needed - regular
RCU is sufficient since ctx is freed via kfree_rcu. The guard in
bpf_task_work_callback() remains as tasks trace since it accesses map
values from process context.
Sleepable BPF programs hold rcu_read_lock_trace but not
regular rcu_read_lock. Since kfree_rcu
waits for a regular RCU grace period, the ctx memory can be freed
while a sleepable program is still running. Add scoped_guard(rcu)
around the pointer read and refcount tryget in
bpf_task_work_acquire_ctx to close this race window.
Since kfree_rcu uses call_rcu internally which is not safe from
NMI context, defer destruction via irq_work when irqs are disabled.
For the lost-cmpxchg path the ctx was never published, so
kfree_nolock is safe.
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260330-kmalloc_special-v2-1-c90403f92ff0@meta.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
uprobe programs are allowed to modify struct pt_regs.
Since the actual program type of uprobe is KPROBE, it can be abused to
modify struct pt_regs via kprobe+freplace when the kprobe attaches to
kernel functions.
For example,
SEC("?kprobe")
int kprobe(struct pt_regs *regs)
{
return 0;
}
SEC("?freplace")
int freplace_kprobe(struct pt_regs *regs)
{
regs->di = 0;
return 0;
}
freplace_kprobe prog will attach to kprobe prog.
kprobe prog will attach to a kernel function.
Without this patch, when the kernel function runs, its first arg will
always be set as 0 via the freplace_kprobe prog.
To fix the abuse of kprobe_write_ctx=true via kprobe+freplace, disallow
attaching freplace programs on kprobe programs with different
kprobe_write_ctx values.
Fixes: 7384893d97 ("bpf: Allow uprobe program to change context registers")
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/r/20260331145353.87606-2-leon.hwang@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Recently, tracepoints were switched from using disabled preemption
(which acts as RCU read section) to SRCU-fast when they are not
faultable. This means that to do a proper grace period wait for programs
running in such tracepoints, we must use SRCU's grace period wait.
This is only for non-faultable tracepoints, faultable ones continue
using RCU Tasks Trace.
However, bpf_link_free() currently does call_rcu() for all cases when
the link is non-sleepable (hence, for tracepoints, non-faultable). Fix
this by doing a call_srcu() grace period wait.
As far RCU Tasks Trace gp -> RCU gp chaining is concerned, it is deemed
unnecessary for tracepoint programs. The link and program are either
accessed under RCU Tasks Trace protection, or SRCU-fast protection now.
The earlier logic of chaining both RCU Tasks Trace and RCU gp waits was
to generalize the logic, even if it conceded an extra RCU gp wait,
however that is unnecessary for tracepoints even before this change.
In practice no cost was paid since rcu_trace_implies_rcu_gp() was always
true. Hence we need not chaining any RCU gp after the SRCU gp.
For instance, in the non-faultable raw tracepoint, the RCU read section
of the program in __bpf_trace_run() is enclosed in the SRCU gp, likewise
for faultable raw tracepoint, the program is under the RCU Tasks Trace
protection. Hence, the outermost scope can be waited upon to ensure
correctness.
Also, sleepable programs cannot be attached to non-faultable
tracepoints, so whenever program or link is sleepable, only RCU Tasks
Trace protection is being used for the link and prog.
Fixes: a46023d561 ("tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast")
Reviewed-by: Sun Jian <sun.jian.kdev@gmail.com>
Reviewed-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20260331211021.1632902-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
In case rold->reg->range == BEYOND_PKT_END && rcur->reg->range == N
regsafe() may return true which may lead to current state with
valid packet range not being explored. Fix the bug.
Fixes: 6d94e741a8 ("bpf: Support for pointers beyond pkt_end.")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Amery Hung <ameryhung@gmail.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20260331204228.26726-1-alexei.starovoitov@gmail.com
- Fix SCX_KICK_WAIT deadlock where multiple CPUs waiting for each other in
hardirq context form a cycle. Move the wait to a balance callback which
can drop the rq lock and process IPIs.
- Fix inconsistent NUMA node lookup in scx_select_cpu_dfl() where the
waker_node used cpu_to_node() while prev_cpu used
scx_cpu_node_if_enabled(), leading to undefined behavior when per-node
idle tracking is disabled.
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCacwiiQ4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGVILAP44s30JBpNyJ9JhAiCoTYzxzOXqqGbotnpQckMF
+7WoJAD/Z9dJO/Sw/AH0fX6WVJDmO0QsQvFXLXJBxWy7A5XVAA0=
=2DW5
-----END PGP SIGNATURE-----
Merge tag 'sched_ext-for-7.0-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:
- Fix SCX_KICK_WAIT deadlock where multiple CPUs waiting for each other
in hardirq context form a cycle. Move the wait to a balance callback
which can drop the rq lock and process IPIs.
- Fix inconsistent NUMA node lookup in scx_select_cpu_dfl() where
the waker_node used cpu_to_node() while prev_cpu used
scx_cpu_node_if_enabled(), leading to undefined behavior when
per-node idle tracking is disabled.
* tag 'sched_ext-for-7.0-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
selftests/sched_ext: Add cyclic SCX_KICK_WAIT stress test
sched_ext: Fix SCX_KICK_WAIT deadlock by deferring wait to balance callback
sched_ext: Fix inconsistent NUMA node lookup in scx_select_cpu_dfl()
- Fix false positive stall reports on weakly ordered architectures where
the lockless worklist/timestamp check in the watchdog can observe stale
values due to memory reordering. Recheck under pool->lock to confirm.
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCacwiew4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGX66AQCT5ZNUV46mQdQU8TXV4qMiAW1gdwoUT+rvcQ9Q
u2jA/AD/S1uHycAWKqA+TTL8NfNvhgyhgq1AVbYUlTRXCz3iwgU=
=BlI5
-----END PGP SIGNATURE-----
Merge tag 'wq-for-7.0-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue fix from Tejun Heo:
- Fix false positive stall reports on weakly ordered architectures
where the lockless worklist/timestamp check in the watchdog can
observe stale values due to memory reordering.
Recheck under pool->lock to confirm.
* tag 'wq-for-7.0-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: Better describe stall check
workqueue: Fix false positive stall reports
- Fix cgroup rmdir racing with dying tasks. Deferred task cgroup unlink
introduced a window where cgroup.procs is empty but the cgroup is still
populated, causing rmdir to fail with -EBUSY and selftest failures. Make
rmdir wait for dying tasks to fully leave and fix selftests to not depend
on synchronous populated updates.
- Fix cpuset v1 task migration failure from empty cpusets under strict
security policies. When CPU hotplug removes the last CPU from a v1
cpuset, tasks must be migrated to an ancestor without a
security_task_setscheduler() check that would block the migration.
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCacwibg4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGXHEAP98nVEKyl7c7+sXYtwOPn8KEhdHkdpHyPZwhpS2
1wLhaQEAm8yO49s7IgvGPWSz0s/gQdmF5/x8RAee0sJsZALvGQg=
=bUUt
-----END PGP SIGNATURE-----
Merge tag 'cgroup-for-7.0-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
- Fix cgroup rmdir racing with dying tasks.
Deferred task cgroup unlink introduced a window where cgroup.procs
is empty but the cgroup is still populated, causing rmdir to fail
with -EBUSY and selftest failures.
Make rmdir wait for dying tasks to fully leave and fix selftests to
not depend on synchronous populated updates.
- Fix cpuset v1 task migration failure from empty cpusets under strict
security policies.
When CPU hotplug removes the last CPU from a v1 cpuset, tasks must be
migrated to an ancestor without a security_task_setscheduler() check
that would block the migration.
* tag 'cgroup-for-7.0-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup/cpuset: Skip security check for hotplug induced v1 task migration
cgroup/cpuset: Simplify setsched decision check in task iteration loop of cpuset_can_attach()
cgroup: Fix cgroup_drain_dying() testing the wrong condition
selftests/cgroup: Don't require synchronous populated update on task exit
cgroup: Wait for dying tasks to leave on rmdir
When a CPU hot removal causes a v1 cpuset to lose all its CPUs, the
cpuset hotplug handler will schedule a work function to migrate tasks
in that cpuset with no CPU to its ancestor to enable those tasks to
continue running.
If a strict security policy is in place, however, the task migration
may fail when security_task_setscheduler() call in cpuset_can_attach()
returns a -EACCES error. That will mean that those tasks will have
no CPU to run on. The system administrators will have to explicitly
intervene to either add CPUs to that cpuset or move the tasks elsewhere
if they are aware of it.
This problem was found by a reported test failure in the LTP's
cpuset_hotplug_test.sh. Fix this problem by treating this special case as
an exception to skip the setsched security check in cpuset_can_attach()
when a v1 cpuset with tasks have no CPU left.
With that patch applied, the cpuset_hotplug_test.sh test can be run
successfully without failure.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Centralize the check required to run security_task_setscheduler() in
the task iteration loop of cpuset_can_attach() outside of the loop as
it has no dependency on the characteristics of the tasks themselves.
There is no functional change.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
SCX_KICK_WAIT busy-waits in kick_cpus_irq_workfn() using
smp_cond_load_acquire() until the target CPU's kick_sync advances. Because
the irq_work runs in hardirq context, the waiting CPU cannot reschedule and
its own kick_sync never advances. If multiple CPUs form a wait cycle, all
CPUs deadlock.
Replace the busy-wait in kick_cpus_irq_workfn() with resched_curr() to
force the CPU through do_pick_task_scx(), which queues a balance callback
to perform the wait. The balance callback drops the rq lock and enables
IRQs following the sched_core_balance() pattern, so the CPU can process
IPIs while waiting. The local CPU's kick_sync is advanced on entry to
do_pick_task_scx() and continuously during the wait, ensuring any CPU that
starts waiting for us sees the advancement and cannot form cyclic
dependencies.
Fixes: 90e55164da ("sched_ext: Implement SCX_KICK_WAIT")
Cc: stable@vger.kernel.org # v6.12+
Reported-by: Christian Loehle <christian.loehle@arm.com>
Link: https://lore.kernel.org/r/20260316100249.1651641-1-christian.loehle@arm.com
Signed-off-by: Tejun Heo <tj@kernel.org>
Tested-by: Christian Loehle <christian.loehle@arm.com>
which may cause missed expirations or incorrect overrun accounting.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmnIt8IRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1hogw/+J8ekCjKdzZobEyGAM4fTc0jlOmPiXog3
yCKEZHnRpTmWskMKhDIqt3TYr9ihzHSWLwgtUpcx5bXO2qyhqVlQtfmSjA0CXeQ9
DHADtLAsZIg8qpkqur6zGd4icLX5ZoVfiwZdPX3QZMgwAmM0JsqOLOIy8TGvsB2Q
PWnq4aiTqs3lqA00ehMDR2gSlPfgJKUNgHVyJa0/udHVU81A67vPMbu12sX+Iz25
ITxOehF32JuEvJUxa/sipjswizYc/bt5qEywn8Xoob99xrkv5vmTQ31sehqovwq/
xVMaWs+oRj0Ec20E1LWuawnV83b0XR7j8Z2XCVNJHHfys80Hj6Egw7HVctKQq1Tf
X3bzQMbyaYHWn+Cc7DClexposL3xi76ZPW5VaWcsLs5DA15hGmgY7r2DrlJby7QZ
isjqsW06NGLM1DFrTEYlwTDKm7vLukFnWX4TXokMHitkSQPcyLa24sgMPFIXY1Pe
QoY+f9tmIZ5GmA8JA4dgORSQvQEiG8ElToNDfM2Dmk8b69eO+JwdehQfwwGhhemF
S8VMUMylrk0WOoCxM6peuA0JUMPKsLtg5dWWxsWDOxNUUxs9ujoQBAMYXuRloPUv
kGV8entjyBZ4Nz1dxxKM1r/J2c7a8IImbQ18AsdPFu4NaRIG6TvYkGP3MGZtMHLe
iHW5kgvgB5c=
=hDiV
-----END PGP SIGNATURE-----
Merge tag 'timers-urgent-2026-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Ingo Molnar:
"Fix an argument order bug in the alarm timer forwarding logic, which
may cause missed expirations or incorrect overrun accounting"
* tag 'timers-urgent-2026-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
alarmtimer: Fix argument order in alarm_timer_forward()
- Tighten up the sys_futex_requeue() ABI a bit, to disallow
dissimilar futex flags and potential UaF access. (Peter Zijlstra)
- Fix UaF between futex_key_to_node_opt() and vma_replace_policy()
(Hao-Yu Yang)
- Clear stale exiting pointer in futex_lock_pi() retry path,
which bug triggered a warning (and potential misbehavior)
in stress-testing. (Davidlohr Bueso)
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmnItYsRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1h0Ow/+O4JG+XMdmTn2OLZQBFVqlt40RG+UAudW
tLyop+xovbLXk0FbZZen233yrYLnD8Qea9+LSIi1KPOKdklT4fimgDrVhSdZVMcY
mPJxkZ1NTRDpl3qpjgaT3yXslgpQQlzDLjAkyyC6SWQh7RHlZ9JB4UICei9sOeE8
WgwBrCfDMJXgjG5sAvkIhQFJaNEb/5HBHJu2Nxldlkl0of+le1e0mGUdY7H0ntxx
kJ4jCsrpCw+tzhZw7RmGe8ouEDkbt2f7EBT10pjZ1cEP22Rogn1AzgdD7pkbhcU3
WxbBGjSfm3ziPADELf0IhVYE/s+UXKTK9LfKDt4Q5W7paTV+o4Fllr+sraBmOoxp
Ova5RZzSdq5NCGAcmFJWg/rxEyZf+uC7GjwPx/80huzgJvBthdjOagdVKm/3BNv8
4d4H9KX67zWXE/XQHMt3g/7KYlVD9tzcp+v+h/rs8fiwkCkElwbljLISEr1StyOv
CDaCouu6FQE8MoVTNPPbH1DfEyIgNKHdXfkQbt/OkGVB9/dX71zFmxEXRCYAGMb1
wZXJ0j/oZZYcapmcb/xk5YjcZEfEU9QnMHHtCdR+dDL56xbsT5cDIzrb09LF6H/9
WiMxhx2V3RNBNUwbOKGX6vdqa2kXG+Pjp6sq9Af7VB9oy+nbWpfcVFqWbWwKAP1N
7e/EmnoxPTs=
=85AU
-----END PGP SIGNATURE-----
Merge tag 'locking-urgent-2026-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull futex fixes from Ingo Molnar:
- Tighten up the sys_futex_requeue() ABI a bit, to disallow dissimilar
futex flags and potential UaF access (Peter Zijlstra)
- Fix UaF between futex_key_to_node_opt() and vma_replace_policy()
(Hao-Yu Yang)
- Clear stale exiting pointer in futex_lock_pi() retry path, which
triggered a warning (and potential misbehavior) in stress-testing
(Davidlohr Bueso)
* tag 'locking-urgent-2026-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
futex: Clear stale exiting pointer in futex_lock_pi() retry path
futex: Fix UaF between futex_key_to_node_opt() and vma_replace_policy()
futex: Require sys_futex_requeue() to have identical flags
The following kfuncs currently accept void *meta__ign argument:
* bpf_obj_new_impl
* bpf_obj_drop_impl
* bpf_percpu_obj_new_impl
* bpf_percpu_obj_drop_impl
* bpf_refcount_acquire_impl
* bpf_list_push_back_impl
* bpf_list_push_front_impl
* bpf_rbtree_add_impl
The __ign suffix is an indicator for the verifier to skip the argument
in check_kfunc_args(). Then, in fixup_kfunc_call() the verifier may
set the value of this argument to struct btf_struct_meta *
kptr_struct_meta from insn_aux_data.
BPF programs must pass a dummy NULL value when calling these kfuncs.
Additionally, the list and rbtree _impl kfuncs also accept an implicit
u64 argument, which doesn't require __ign suffix because it's a
scalar, and BPF programs explicitly pass 0.
Add new kfuncs with KF_IMPLICIT_ARGS [1], that correspond to each
_impl kfunc accepting meta__ign. The existing _impl kfuncs remain
unchanged for backwards compatibility.
To support this, add "btf_struct_meta" to the list of recognized
implicit argument types in resolve_btfids.
Implement is_kfunc_arg_implicit() in the verifier, that determines
implicit args by inspecting both a non-_impl BTF prototype of the
kfunc.
Update the special_kfunc_list in the verifier and relevant checks to
support both the old _impl and the new KF_IMPLICIT_ARGS variants of
btf_struct_meta users.
[1] https://lore.kernel.org/bpf/20260120222638.3976562-1-ihor.solodrai@linux.dev/
Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20260327203241.3365046-1-ihor.solodrai@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
- Fix potential deadlock in osnoise and hotplug
The interface_lock can be called by a osnoise thread and the CPU shutdown
logic of osnoise can wait for this thread to finish. But cpus_read_lock()
can also be taken while holding the interface_lock. This produces a
circular lock dependency and can cause a deadlock.
Swap the ordering of cpus_read_lock() and the interface_lock to have
interface_lock taken within the cpus_read_lock() context to prevent this
circular dependency.
- Fix freeing of event triggers in early boot up
If the same trigger is added on the kernel command line, the second
one will fail to be applied and the trigger created will be freed.
This calls into the deferred logic and creates a kernel thread to do
the freeing. But the command line logic is called before kernel
threads can be created and this leads to a NULL pointer dereference.
Delay freeing event triggers until late init.
-----BEGIN PGP SIGNATURE-----
iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCacf2GBQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qooZAQDQgsnrkwkUHc2WjhAAsZunjWhqRQOp
SIdOO95pDvQwewEAjjP6kG5UEppL8eNF9WmV8EMz9pjYo7i2/Wy59H6xnwQ=
=6Rs0
-----END PGP SIGNATURE-----
Merge tag 'trace-v7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix potential deadlock in osnoise and hotplug
The interface_lock can be called by a osnoise thread and the CPU
shutdown logic of osnoise can wait for this thread to finish. But
cpus_read_lock() can also be taken while holding the interface_lock.
This produces a circular lock dependency and can cause a deadlock.
Swap the ordering of cpus_read_lock() and the interface_lock to have
interface_lock taken within the cpus_read_lock() context to prevent
this circular dependency.
- Fix freeing of event triggers in early boot up
If the same trigger is added on the kernel command line, the second
one will fail to be applied and the trigger created will be freed.
This calls into the deferred logic and creates a kernel thread to do
the freeing. But the command line logic is called before kernel
threads can be created and this leads to a NULL pointer dereference.
Delay freeing event triggers until late init.
* tag 'trace-v7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Drain deferred trigger frees if kthread creation fails
tracing: Fix potential deadlock in cpu hotplug with osnoise
Fuzzying/stressing futexes triggered:
WARNING: kernel/futex/core.c:825 at wait_for_owner_exiting+0x7a/0x80, CPU#11: futex_lock_pi_s/524
When futex_lock_pi_atomic() sees the owner is exiting, it returns -EBUSY
and stores a refcounted task pointer in 'exiting'.
After wait_for_owner_exiting() consumes that reference, the local pointer
is never reset to nil. Upon a retry, if futex_lock_pi_atomic() returns a
different error, the bogus pointer is passed to wait_for_owner_exiting().
CPU0 CPU1 CPU2
futex_lock_pi(uaddr)
// acquires the PI futex
exit()
futex_cleanup_begin()
futex_state = EXITING;
futex_lock_pi(uaddr)
futex_lock_pi_atomic()
attach_to_pi_owner()
// observes EXITING
*exiting = owner; // takes ref
return -EBUSY
wait_for_owner_exiting(-EBUSY, owner)
put_task_struct(); // drops ref
// exiting still points to owner
goto retry;
futex_lock_pi_atomic()
lock_pi_update_atomic()
cmpxchg(uaddr)
*uaddr ^= WAITERS // whatever
// value changed
return -EAGAIN;
wait_for_owner_exiting(-EAGAIN, exiting) // stale
WARN_ON_ONCE(exiting)
Fix this by resetting upon retry, essentially aligning it with requeue_pi.
Fixes: 3ef240eaff ("futex: Prevent exit livelock")
Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260326001759.4129680-1-dave@stgolabs.net