-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmjZH40ACgkQ6rmadz2v
bTrG7w//X/5CyDoKIYJCqynYRdMtfqYuCe8Jhud4p5++iBVqkDyS6Y8EFLqZVyg/
UHTqaSE4Nz8/pma0WSjhUYn6Chs1AeH+Rw/g109SovE/YGkek2KNwY3o2hDrtPMX
+oD0my8qF2HLKgEyteXXyZ5Ju+AaF92JFiGko4/wNTX8O99F9nyz2pTkrctS9Vl9
VwuTxrEXpmhqrhP3WCxkfNfcbs9HP+AALpgOXZKdMI6T4KI0N1gnJ0ZWJbiXZ8oT
tug0MTPkNRidYMl0wHY2LZ6ZG8Q3a7Sgc+M0xFzaHGvGlJbBg1HjsDMtT6j34CrG
TIVJ/O8F6EJzAnQ5Hio0FJk8IIgMRgvng5Kd5GXidU+mE6zokTyHIHOXitYkBQNH
Hk+lGA7+E2cYqUqKvB5PFoyo+jlucuIH7YwrQlyGfqz+98n65xCgZKcmdVXr0hdB
9v3WmwJFtVIoPErUvBC3KRANQYhFk4eVk1eiGV/20+eIVyUuNbX6wqSWSA9uEXLy
n5fm/vlk4RjZmrPZHxcJ0dsl9LTF1VvQQHkgoC1Sz/Cc+jA6k4I+ECVHAqEbk36p
1TUF52yPOD2ViaJKkj+962JaaaXlUn6+Dq7f1GMP6VuyHjz4gsI3mOo4XarqNdWd
c7TnYmlGO/cGwqd4DdbmWiF1DDsrBcBzdbC8+FgffxQHLPXGzUg=
=LeQi
-----END PGP SIGNATURE-----
Merge tag 'bpf-next-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Pull bpf updates from Alexei Starovoitov:
- Support pulling non-linear xdp data with bpf_xdp_pull_data() kfunc
(Amery Hung)
Applied as a stable branch in bpf-next and net-next trees.
- Support reading skb metadata via bpf_dynptr (Jakub Sitnicki)
Also a stable branch in bpf-next and net-next trees.
- Enforce expected_attach_type for tailcall compatibility (Daniel
Borkmann)
- Replace path-sensitive with path-insensitive live stack analysis in
the verifier (Eduard Zingerman)
This is a significant change in the verification logic. More details,
motivation, long term plans are in the cover letter/merge commit.
- Support signed BPF programs (KP Singh)
This is another major feature that took years to materialize.
Algorithm details are in the cover letter/marge commit
- Add support for may_goto instruction to s390 JIT (Ilya Leoshkevich)
- Add support for may_goto instruction to arm64 JIT (Puranjay Mohan)
- Fix USDT SIB argument handling in libbpf (Jiawei Zhao)
- Allow uprobe-bpf program to change context registers (Jiri Olsa)
- Support signed loads from BPF arena (Kumar Kartikeya Dwivedi and
Puranjay Mohan)
- Allow access to union arguments in tracing programs (Leon Hwang)
- Optimize rcu_read_lock() + migrate_disable() combination where it's
used in BPF subsystem (Menglong Dong)
- Introduce bpf_task_work_schedule*() kfuncs to schedule deferred
execution of BPF callback in the context of a specific task using the
kernel’s task_work infrastructure (Mykyta Yatsenko)
- Enforce RCU protection for KF_RCU_PROTECTED kfuncs (Kumar Kartikeya
Dwivedi)
- Add stress test for rqspinlock in NMI (Kumar Kartikeya Dwivedi)
- Improve the precision of tnum multiplier verifier operation
(Nandakumar Edamana)
- Use tnums to improve is_branch_taken() logic (Paul Chaignon)
- Add support for atomic operations in arena in riscv JIT (Pu Lehui)
- Report arena faults to BPF error stream (Puranjay Mohan)
- Search for tracefs at /sys/kernel/tracing first in bpftool (Quentin
Monnet)
- Add bpf_strcasecmp() kfunc (Rong Tao)
- Support lookup_and_delete_elem command in BPF_MAP_STACK_TRACE (Tao
Chen)
* tag 'bpf-next-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (197 commits)
libbpf: Replace AF_ALG with open coded SHA-256
selftests/bpf: Add stress test for rqspinlock in NMI
selftests/bpf: Add test case for different expected_attach_type
bpf: Enforce expected_attach_type for tailcall compatibility
bpftool: Remove duplicate string.h header
bpf: Remove duplicate crypto/sha2.h header
libbpf: Fix error when st-prefix_ops and ops from differ btf
selftests/bpf: Test changing packet data from kfunc
selftests/bpf: Add stacktrace map lookup_and_delete_elem test case
selftests/bpf: Refactor stacktrace_map case with skeleton
bpf: Add lookup_and_delete_elem for BPF_MAP_STACK_TRACE
selftests/bpf: Fix flaky bpf_cookie selftest
selftests/bpf: Test changing packet data from global functions with a kfunc
bpf: Emit struct bpf_xdp_sock type in vmlinux BTF
selftests/bpf: Task_work selftest cleanup fixes
MAINTAINERS: Delete inactive maintainers from AF_XDP
bpf: Mark kfuncs as __noclone
selftests/bpf: Add kprobe multi write ctx attach test
selftests/bpf: Add kprobe write ctx attach test
selftests/bpf: Add uprobe context ip register change test
...
Core scheduler changes:
- Make migrate_{en,dis}able() inline, to improve performance
(Menglong Dong)
- Move STDL_INIT() functions out-of-line (Peter Zijlstra)
- Unify the SCHED_{SMT,CLUSTER,MC} Kconfig (Peter Zijlstra)
Fair scheduling:
- Defer throttling when tasks exit to user-space, to reduce the
chance & impact of throttle-preemption with held locks and
other resources. (Aaron Lu, Valentin Schneider)
- Get rid of sched_domains_curr_level hack for tl->cpumask(),
as the warning was getting triggered on certain topologies.
(Peter Zijlstra)
Misc cleanups & fixes:
- Header cleanups (Menglong Dong)
- Fix race in push_dl_task() (Harshit Agarwal)
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmjWn0IRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1i9ng/+KJYEsNilPA6nGzZLurU3HXm6S5NrNBPc
xJU43eo3QdFUymKqYaCoXCwaMah3m8fFW+DC8ZoEvWC5wHnlIw6+u/ZEtsWQ4R8N
8+sjQddQeb9Gez+nkC7pXX8h1nqlhHyGWU88+9hOyEEMk21KgH7tJ5gOuGQdPhiA
v24D8dkS1sT6CAeqZAycYIkos+kfNNV+wAEQCXAxfvSSslpzJVZcj9kdQInxF4O4
9+WrvFZFVYJWdJgmgfbpBMw7N8a4Gpv+Lr6U0ZVXHNeLjNdn60I+5Me+9BW9GRcO
pPL50RfGgGgVIqyEHvBhhz8p0tlKUJo9NkRJ9hmNB5SKT3P442SU789ppTYgI1il
5P4g3TiuomqPVuo99/mYHYOpT6NkYC1fkwMP9Hfk4NRUX0EA6lKXelVnMXaC3Z7R
U+0xVYtK4BBLRFBo4HIEM1HChSIcOnutuufFX4lPPt+ewnAmJwSt619w+lBF0v+n
cpiLIlQ74LrYlrULw3bznnwzh5dV8XHm4CT3XAShm6qB97/SaLr0rI5W7njdSvte
ym9z4yFWidawowvLWjFX1CykSnX/T5MV+uzuIH64MEIA39B4VKJWCAC3l3AMJn/5
Ckhf93B5uG0q+wUtE66x/JrN7wtbz9PMpwh01fXNWs/eWc1VKkVrvq+PmHODngmz
drSUoI1P570=
=8ZTZ
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2025-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"Core scheduler changes:
- Make migrate_{en,dis}able() inline, to improve performance
(Menglong Dong)
- Move STDL_INIT() functions out-of-line (Peter Zijlstra)
- Unify the SCHED_{SMT,CLUSTER,MC} Kconfig (Peter Zijlstra)
Fair scheduling:
- Defer throttling to when tasks exit to user-space, to reduce the
chance & impact of throttle-preemption with held locks and other
resources (Aaron Lu, Valentin Schneider)
- Get rid of sched_domains_curr_level hack for tl->cpumask(), as the
warning was getting triggered on certain topologies (Peter
Zijlstra)
Misc cleanups & fixes:
- Header cleanups (Menglong Dong)
- Fix race in push_dl_task() (Harshit Agarwal)"
* tag 'sched-core-2025-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Fix some typos in include/linux/preempt.h
sched: Make migrate_{en,dis}able() inline
rcu: Replace preempt.h with sched.h in include/linux/rcupdate.h
arch: Add the macro COMPILE_OFFSETS to all the asm-offsets.c
sched/fair: Do not balance task to a throttled cfs_rq
sched/fair: Do not special case tasks in throttled hierarchy
sched/fair: update_cfs_group() for throttled cfs_rqs
sched/fair: Propagate load for throttled cfs_rq
sched/fair: Get rid of throttled_lb_pair()
sched/fair: Task based throttle time accounting
sched/fair: Switch to task based throttle model
sched/fair: Implement throttle task work and related helpers
sched/fair: Add related data structure for task based throttle
sched: Unify the SCHED_{SMT,CLUSTER,MC} Kconfig
sched: Move STDL_INIT() functions out-of-line
sched/fair: Get rid of sched_domains_curr_level hack for tl->cpumask()
sched/deadline: Fix race in push_dl_task()
For now, migrate_enable and migrate_disable are global, which makes them
become hotspots in some case. Take BPF for example, the function calling
to migrate_enable and migrate_disable in BPF trampoline can introduce
significant overhead, and following is the 'perf top' of FENTRY's
benchmark (./tools/testing/selftests/bpf/bench trig-fentry):
54.63% bpf_prog_2dcccf652aac1793_bench_trigger_fentry [k]
bpf_prog_2dcccf652aac1793_bench_trigger_fentry
10.43% [kernel] [k] migrate_enable
10.07% bpf_trampoline_6442517037 [k] bpf_trampoline_6442517037
8.06% [kernel] [k] __bpf_prog_exit_recur
4.11% libc.so.6 [.] syscall
2.15% [kernel] [k] entry_SYSCALL_64
1.48% [kernel] [k] memchr_inv
1.32% [kernel] [k] fput
1.16% [kernel] [k] _copy_to_user
0.73% [kernel] [k] bpf_prog_test_run_raw_tp
So in this commit, we make migrate_enable/migrate_disable inline to obtain
better performance. The struct rq is defined internally in
kernel/sched/sched.h, and the field "nr_pinned" is accessed in
migrate_enable/migrate_disable, which makes it hard to make them inline.
Alexei Starovoitov suggests to generate the offset of "nr_pinned" in [1],
so we can define the migrate_enable/migrate_disable in
include/linux/sched.h and access "this_rq()->nr_pinned" with
"(void *)this_rq() + RQ_nr_pinned".
The offset of "nr_pinned" is generated in include/generated/rq-offsets.h
by kernel/sched/rq-offsets.c.
Generally speaking, we move the definition of migrate_enable and
migrate_disable to include/linux/sched.h from kernel/sched/core.c. The
calling to __set_cpus_allowed_ptr() is leaved in ___migrate_enable().
The "struct rq" is not available in include/linux/sched.h, so we can't
access the "runqueues" with this_cpu_ptr(), as the compilation will fail
in this_cpu_ptr() -> raw_cpu_ptr() -> __verify_pcpu_ptr():
typeof((ptr) + 0)
So we introduce the this_rq_raw() and access the runqueues with
arch_raw_cpu_ptr/PERCPU_PTR directly.
The variable "runqueues" is not visible in the kernel modules, and export
it is not a good idea. As Peter Zijlstra advised in [2], we define and
export migrate_enable/migrate_disable in kernel/sched/core.c too, and use
them for the modules.
Before this patch, the performance of BPF FENTRY is:
fentry : 113.030 ± 0.149M/s
fentry : 112.501 ± 0.187M/s
fentry : 112.828 ± 0.267M/s
fentry : 115.287 ± 0.241M/s
After this patch, the performance of BPF FENTRY increases to:
fentry : 143.644 ± 0.670M/s
fentry : 149.764 ± 0.362M/s
fentry : 149.642 ± 0.156M/s
fentry : 145.263 ± 0.221M/s
Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/bpf/CAADnVQ+5sEDKHdsJY5ZsfGDO_1SEhhQWHrt2SMBG5SYyQ+jt7w@mail.gmail.com/ [1]
Link: https://lore.kernel.org/all/20250819123214.GH4067720@noisy.programming.kicks-ass.net/ [2]
bpf_xdp_pull_data() may change packet data and therefore packet pointers
need to be invalidated. Add bpf_xdp_pull_data() to the special kfunc
list instead of introducing a new KF_ flag until there are more kfuncs
changing packet data.
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20250922233356.3356453-5-ameryhung@gmail.com
Currently, signed load instructions into arena memory are unsupported.
The compiler is free to generate these, and on GCC-14 we see a
corresponding error when it happens. The hurdle in supporting them is
deciding which unused opcode to use to mark them for the JIT's own
consumption. After much thinking, it appears 0xc0 / BPF_NOSPEC can be
combined with load instructions to identify signed arena loads. Use
this to recognize and JIT them appropriately, and remove the verifier
side limitation on the program if the JIT supports them.
Co-developed-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20250923110157.18326-2-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch adds necessary plumbing in verifier, syscall and maps to
support handling new kfunc bpf_task_work_schedule and kernel structure
bpf_task_work. The idea is similar to how we already handle bpf_wq and
bpf_timer.
verifier changes validate calls to bpf_task_work_schedule to make sure
it is safe and expected invariants hold.
btf part is required to detect bpf_task_work structure inside map value
and store its offset, which will be used in the next patch to calculate
key and value addresses.
arraymap and hashtab changes are needed to handle freeing of the
bpf_task_work: run code needed to deinitialize it, for example cancel
task_work callback if possible.
The use of bpf_task_work and proper implementation for kfuncs are
introduced in the next patch.
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250923112404.668720-6-mykyta.yatsenko5@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The verifier currently enforces a zero return value for all async
callbacks—a constraint originally introduced for bpf_timer. That
restriction is too narrow for other async use cases.
Relax the rule by allowing non-zero return codes from async callbacks in
general, while preserving the zero-return requirement for bpf_timer to
maintain its existing semantics.
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250923112404.668720-5-mykyta.yatsenko5@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Refactor the verifier by pulling the common logic from
process_timer_func() into a dedicated helper. This allows reusing
process_async_func() helper for verifying bpf_task_work struct in the
next patch.
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Tested-by: syzbot@syzkaller.appspotmail.com
Link: https://lore.kernel.org/r/20250923112404.668720-3-mykyta.yatsenko5@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Converting bpf_insn_successors() to use lookup table makes it ~1.5
times faster.
Also remove unnecessary conditionals:
- `idx + 1 < prog->len` is unnecessary because after check_cfg() all
jump targets are guaranteed to be within a program;
- `i == 0 || succ[0] != dst` is unnecessary because any client of
bpf_insn_successors() can handle duplicate edges:
- compute_live_registers()
- compute_scc()
Moving bpf_insn_successors() to liveness.c allows its inlining in
liveness.c:__update_stack_liveness().
Such inlining speeds up __update_stack_liveness() by ~40%.
bpf_insn_successors() is used in both verifier.c and liveness.c.
perf shows such move does not negatively impact users in verifier.c,
as these are executed only once before main varification pass.
Unlike __update_stack_liveness() which can be triggered multiple
times.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250918-callchain-sensitive-liveness-v3-10-c3cd27bacc60@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Remove register chain based liveness tracking:
- struct bpf_reg_state->{parent,live} fields are no longer needed;
- REG_LIVE_WRITTEN marks are superseded by bpf_mark_stack_write()
calls;
- mark_reg_read() calls are superseded by bpf_mark_stack_read();
- log.c:print_liveness() is superseded by logging in liveness.c;
- propagate_liveness() is superseded by bpf_update_live_stack();
- no need to establish register chains in is_state_visited() anymore;
- fix a bunch of tests expecting "_w" suffixes in verifier log
messages.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250918-callchain-sensitive-liveness-v3-9-c3cd27bacc60@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Unlike the new algorithm, register chain based liveness tracking is
fully path sensitive, and thus should be strictly more accurate.
Validate the new algorithm by signaling an error whenever it considers
a stack slot dead while the old algorithm considers it alive.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250918-callchain-sensitive-liveness-v3-8-c3cd27bacc60@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Allocate analysis instance:
- Add bpf_stack_liveness_{init,free}() calls to bpf_check().
Notify the instance about any stack reads and writes:
- Add bpf_mark_stack_write() call at every location where
REG_LIVE_WRITTEN is recorded for a stack slot.
- Add bpf_mark_stack_read() call at every location mark_reg_read() is
called.
- Both bpf_mark_stack_{read,write}() rely on
env->liveness->cur_instance callchain being in sync with
env->cur_state. It is possible to update env->liveness->cur_instance
every time a mark read/write is called, but that costs a hash table
lookup and is noticeable in the performance profile. Hence, manually
reset env->liveness->cur_instance whenever the verifier changes
env->cur_state call stack:
- call bpf_reset_live_stack_callchain() when the verifier enters a
subprogram;
- call bpf_update_live_stack() when the verifier exits a subprogram
(it implies the reset).
Make sure bpf_update_live_stack() is called for a callchain before
issuing liveness queries. And make sure that bpf_update_live_stack()
is called for any callee callchain first:
- Add bpf_update_live_stack() call at every location that processes
BPF_EXIT:
- exit from a subprogram;
- before pop_stack() call.
This makes sure that bpf_update_live_stack() is called for callee
callchains before caller callchains.
Make sure must_write marks are set to zero for instructions that
do not always access the stack:
- Wrap do_check_insn() with bpf_reset_stack_write_marks() /
bpf_commit_stack_write_marks() calls.
Any calls to bpf_mark_stack_write() are accumulated between this
pair of calls. If no bpf_mark_stack_write() calls were made
it means that the instruction does not access stack (at-least
on the current verification path) and it is important to record
this fact.
Finally, use bpf_live_stack_query_init() / bpf_stack_slot_alive()
to query stack liveness info.
The manual tracking of the correct order for callee/caller
bpf_update_live_stack() calls is a bit convoluted and may warrant some
automation in future revisions.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250918-callchain-sensitive-liveness-v3-7-c3cd27bacc60@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The next patch would require doing postorder traversal of individual
subprograms. Facilitate this by moving env->cfg.insn_postorder
computation from check_cfg() to a separate pass, as check_cfg()
descends into called subprograms (and it needs to, because of
merge_callee_effects() logic).
env->cfg.insn_postorder is used only by compute_live_registers(),
this function does not track cross subprogram dependencies,
thus the change does not affect it's operation.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250918-callchain-sensitive-liveness-v3-5-c3cd27bacc60@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
stacksafe() is called in exact == NOT_EXACT mode only for states that
had been porcessed by clean_verifier_states(). The latter replaces
dead stack spills with a series of STACK_INVALID masks. Such masks are
already handled by stacksafe().
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250918-callchain-sensitive-liveness-v3-3-c3cd27bacc60@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Prepare for bpf_reg_state->live field removal by leveraging
insn_aux_data->live_regs_before instead of bpf_reg_state->live in
compute_live_registers(). This is similar to logic in
func_states_equal(). No changes in verification performance for
selftests or sched_ext.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250918-callchain-sensitive-liveness-v3-2-c3cd27bacc60@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Prepare for bpf_reg_state->live field removal by introducing a
separate flag to track if clean_verifier_state() had been applied to
the state. No functional changes.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250918-callchain-sensitive-liveness-v3-1-c3cd27bacc60@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Exclusive maps allow maps to only be accessed by program with a
program with a matching hash which is specified in the excl_prog_hash
attr.
For the signing use-case, this allows the trusted loader program
to load the map and verify the integrity
Signed-off-by: KP Singh <kpsingh@kernel.org>
Link: https://lore.kernel.org/r/20250914215141.15144-3-kpsingh@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Currently, KF_RCU_PROTECTED only applies to iterator APIs and that too
in a convoluted fashion: the presence of this flag on the kfunc is used
to set MEM_RCU in iterator type, and the lack of RCU protection results
in an error only later, once next() or destroy() methods are invoked on
the iterator. While there is no bug, this is certainly a bit
unintuitive, and makes the enforcement of the flag iterator specific.
In the interest of making this flag useful for other upcoming kfuncs,
e.g. scx_bpf_cpu_curr() [0][1], add enforcement for invoking the kfunc
in an RCU critical section in general.
This would also mean that iterator APIs using KF_RCU_PROTECTED will
error out earlier, instead of throwing an error for lack of RCU CS
protection when next() or destroy() methods are invoked.
In addition to this, if the kfuncs tagged KF_RCU_PROTECTED return a
pointer value, ensure that this pointer value is only usable in an RCU
critical section. There might be edge cases where the return value is
special and doesn't need to imply MEM_RCU semantics, but in general, the
assumption should hold for the majority of kfuncs, and we can revisit
things if necessary later.
[0]: https://lore.kernel.org/all/20250903212311.369697-3-christian.loehle@arm.com
[1]: https://lore.kernel.org/all/20250909195709.92669-1-arighi@nvidia.com
Tested-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250917032755.4068726-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Syzbot generated a program that triggers a verifier_bug() call in
maybe_exit_scc(). maybe_exit_scc() assumes that, when called for a
state with insn_idx in some SCC, there should be an instance of struct
bpf_scc_visit allocated for that SCC. Turns out the assumption does
not hold for speculative execution paths. See example in the next
patch.
maybe_scc_exit() is called from update_branch_counts() for states that
reach branch count of zero, meaning that path exploration for a
particular path is finished. Path exploration can finish in one of
three ways:
a. Verification error is found. In this case, update_branch_counts()
is called only for non-speculative paths.
b. Top level BPF_EXIT is reached. Such instructions are never a part of
an SCC, so compute_scc_callchain() in maybe_scc_exit() will return
false, and maybe_scc_exit() will return early.
c. A checkpoint is reached and matched. Checkpoints are created by
is_state_visited(), which calls maybe_enter_scc(), which allocates
bpf_scc_visit instances for checkpoints within SCCs.
Hence, for non-speculative symbolic execution paths, the assumption
still holds: if maybe_scc_exit() is called for a state within an SCC,
bpf_scc_visit instance must exist.
This patch removes the verifier_bug() call for speculative paths.
Fixes: c9e31900b5 ("bpf: propagate read/precision marks over state graph backedges")
Reported-by: syzbot+3afc814e8df1af64b653@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/bpf/68c85acd.050a0220.2ff435.03a4.GAE@google.com/
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250916212251.3490455-1-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Function bpf_patch_insn_data() has the following structure:
static struct bpf_prog *bpf_patch_insn_data(... env ...)
{
struct bpf_prog *new_prog;
struct bpf_insn_aux_data *new_data = NULL;
if (len > 1) {
new_data = vrealloc(...); // <--------- (1)
if (!new_data)
return NULL;
env->insn_aux_data = new_data; // <---- (2)
}
new_prog = bpf_patch_insn_single(env->prog, off, patch, len);
if (IS_ERR(new_prog)) {
...
vfree(new_data); // <----------------- (3)
return NULL;
}
... happy path ...
}
In case if bpf_patch_insn_single() returns an error the `new_data`
allocated at (1) will be freed at (3). However, at (2) this pointer
is stored in `env->insn_aux_data`. Which is freed unconditionally
by verifier.c:bpf_check() on both happy and error paths.
Thus, leading to double-free.
Fix this by removing vfree() call at (3), ownership over `new_data` is
already passed to `env->insn_aux_data` at this point.
Fixes: 77620d1267 ("bpf: use realloc in bpf_patch_insn_data")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250912-patch-insn-data-double-free-v1-1-af05bd85a21a@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
BPF streams are only valid for the main programs, to make it easier to
access streams from subprogs, introduce main_prog_aux in struct
bpf_prog_aux.
prog->aux->main_prog_aux = prog->aux, for main programs and
prog->aux->main_prog_aux = main_prog->aux, for subprograms.
Make bpf_prog_find_from_stack() use the added main_prog_aux to return
the mainprog when a subprog is found on the stack.
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250911145808.58042-3-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
When enable CONFIG_PREEMPT_RT, the kernel will warn when run timer
selftests by './test_progs -t timer':
BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
In order to avoid such warning, reject bpf_timer in verifier when
PREEMPT_RT is enabled.
Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/r/20250910125740.52172-2-leon.hwang@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
In the following toy program (reg states minimized for readability), R0
and R1 always have different values at instruction 6. This is obvious
when reading the program but cannot be guessed from ranges alone as
they overlap (R0 in [0; 0xc0000000], R1 in [1024; 0xc0000400]).
0: call bpf_get_prandom_u32#7 ; R0_w=scalar()
1: w0 = w0 ; R0_w=scalar(var_off=(0x0; 0xffffffff))
2: r0 >>= 30 ; R0_w=scalar(var_off=(0x0; 0x3))
3: r0 <<= 30 ; R0_w=scalar(var_off=(0x0; 0xc0000000))
4: r1 = r0 ; R1_w=scalar(var_off=(0x0; 0xc0000000))
5: r1 += 1024 ; R1_w=scalar(var_off=(0x400; 0xc0000000))
6: if r1 != r0 goto pc+1
Looking at tnums however, we can deduce that R1 is always different from
R0 because their tnums don't agree on known bits. This patch uses this
logic to improve is_scalar_branch_taken in case of BPF_JEQ and BPF_JNE.
This change has a tiny impact on complexity, which was measured with
the Cilium complexity CI test. That test covers 72 programs with
various build and load time configurations for a total of 970 test
cases. For 80% of test cases, the patch has no impact. On the other
test cases, the patch decreases complexity by only 0.08% on average. In
the best case, the verifier needs to walk 3% less instructions and, in
the worst case, 1.5% more. Overall, the patch has a small positive
impact, especially for our largest programs.
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/be3ee70b6e489c49881cb1646114b1d861b5c334.1755694147.git.paul.chaignon@gmail.com
Add a dynptr type, similar to skb dynptr, but for the skb metadata access.
The dynptr provides an alternative to __sk_buff->data_meta for accessing
the custom metadata area allocated using the bpf_xdp_adjust_meta() helper.
More importantly, it abstracts away the fact where the storage for the
custom metadata lives, which opens up the way to persist the metadata by
relocating it as the skb travels through the network stack layers.
Writes to skb metadata invalidate any existing skb payload and metadata
slices. While this is more restrictive that needed at the moment, it leaves
the door open to reallocating the metadata on writes, and should be only a
minor inconvenience to the users.
Only the program types which can access __sk_buff->data_meta today are
allowed to create a dynptr for skb metadata at the moment. We need to
modify the network stack to persist the metadata across layers before
opening up access to other BPF hooks.
Once more BPF hooks gain access to skb_meta dynptr, we will also need to
add a read-only variant of the helper similar to
bpf_dynptr_from_skb_rdonly.
skb_meta dynptr ops are stubbed out and implemented by subsequent changes.
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Jesse Brandeburg <jbrandeburg@cloudflare.com>
Link: https://patch.msgid.link/20250814-skb-metadata-thru-dynptr-v7-1-8a39e636e0fb@cloudflare.com
When a BPF program which is being loaded reaches the map limit
(MAX_USED_MAPS) or the BTF limit (MAX_USED_BTFS) the -E2BIG is
returned. However, in the former case there is an accompanying
verifier verbose message, and in the latter case there is not.
Add a verbose message to make the behaviour symmetrical.
Reported-by: Kevin Sheldrake <kevin.sheldrake@isovalent.com>
Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20250816151554.902995-1-a.s.protopopov@gmail.com
The 'backedge' pointer is allocated with kzalloc(), which returns
physically contiguous memory. Using kvfree() to deallocate such
memory is functionally safe but semantically incorrect.
Replace kvfree() with kfree() to avoid unnecessary is_vmalloc_addr()
check in kvfree().
Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20250811123949.552885-1-rongqianfeng@vivo.com
Avoid excessive vzalloc/vfree calls when patching instructions in
do_misc_fixups(). bpf_patch_insn_data() uses vzalloc to allocate new
memory for env->insn_aux_data for each patch as follows:
struct bpf_prog *bpf_patch_insn_data(env, ...)
{
...
new_data = vzalloc(... O(program size) ...);
...
adjust_insn_aux_data(env, new_data, ...);
...
}
void adjust_insn_aux_data(env, new_data, ...)
{
...
memcpy(new_data, env->insn_aux_data);
vfree(env->insn_aux_data);
env->insn_aux_data = new_data;
...
}
The vzalloc/vfree pair is hot in perf report collected for e.g.
pyperf180 test case. It can be replaced with a call to vrealloc in
order to reduce the number of actual memory allocations.
This is a stop-gap solution, as bpf_patch_insn_data is still hot in
the profile. More comprehansive solutions had been discussed before
e.g. as in [1].
[1] https://lore.kernel.org/bpf/CAEf4BzY_E8MSL4mD0UPuuiDcbJhh9e2xQo2=5w+ppRWWiYSGvQ@mail.gmail.com/
Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Tested-by: Anton Protopopov <a.s.protopopov@gmail.com>
Link: https://lore.kernel.org/r/20250807010205.3210608-3-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Parameter 'env' is not used by is_reg64() and insn_has_def32()
functions. Remove the parameter to make it clear that neither function
depends on 'env' state, e.g. env->insn_aux_data.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250807010205.3210608-2-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
env->scc_info array contains references to bpf_scc_info objects
allocated lazily in verifier.c:scc_visit_alloc().
env->scc_cnt was supposed to track env->scc_info array size
in order to free referenced objects in verifier.c:free_states().
Fix initialization of env->scc_cnt that was omitted in
verifier.c:compute_scc().
To reproduce the bug:
- build with CONFIG_DEBUG_KMEMLEAK
- boot and load bpf program with loops, e.g.:
./veristat -q pyperf180.bpf.o
- initiate memleak scan and check results:
echo scan > /sys/kernel/debug/kmemleak
cat /sys/kernel/debug/kmemleak
Fixes: c9e31900b5 ("bpf: propagate read/precision marks over state graph backedges")
Reported-by: Jens Axboe <axboe@kernel.dk>
Closes: https://lore.kernel.org/bpf/CAADnVQKXUWg9uRCPD5ebRXwN4dmBCRUFFM7kN=GxymYz3zU25A@mail.gmail.com/T/
Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Tested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250801232330.1800436-1-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
We've already had two "error during ctx access conversion" warnings
triggered by syzkaller. Let's improve the error message by dumping the
cnt variable so that we can more easily differentiate between the
different error cases.
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/cc94316c30dd76fae4a75a664b61a2dbfe68e205.1754039605.git.paul.chaignon@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Show the rejected function name when attaching tracing programs to
functions in deny list.
With this change, we know why tracing programs can't attach to functions
like __rcu_read_lock() from log.
$ ./fentry
libbpf: prog '__rcu_read_lock': BPF program load failed: -EINVAL
libbpf: prog '__rcu_read_lock': -- BEGIN PROG LOAD LOG --
Attaching tracing programs to function '__rcu_read_lock' is rejected.
Suggested-by: Leon Hwang <leon.hwang@linux.dev>
Signed-off-by: KaFai Wan <kafai.wan@linux.dev>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20250724151454.499040-3-kafai.wan@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
With this change, we know the precise rejected function name when
attaching fexit/fmod_ret to __noreturn functions from log.
$ ./fexit
libbpf: prog 'fexit': BPF program load failed: -EINVAL
libbpf: prog 'fexit': -- BEGIN PROG LOAD LOG --
Attaching fexit/fmod_ret to __noreturn function 'do_exit' is rejected.
Suggested-by: Leon Hwang <leon.hwang@linux.dev>
Signed-off-by: KaFai Wan <kafai.wan@linux.dev>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20250724151454.499040-2-kafai.wan@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Commit d7f0087381 ("bpf: try harder to deduce register bounds from
different numeric domains") added a second call to __reg_deduce_bounds
in reg_bounds_sync because a single call wasn't enough to converge to a
fixed point in terms of register bounds.
With patch "bpf: Improve bounds when s64 crosses sign boundary" from
this series, Eduard noticed that calling __reg_deduce_bounds twice isn't
enough anymore to converge. The first selftest added in "selftests/bpf:
Test cross-sign 64bits range refinement" highlights the need for a third
call to __reg_deduce_bounds. After instruction 7, reg_bounds_sync
performs the following bounds deduction:
reg_bounds_sync entry: scalar(smin=-655,smax=0xeffffeee,smin32=-783,smax32=-146)
__update_reg_bounds: scalar(smin=-655,smax=0xeffffeee,smin32=-783,smax32=-146)
__reg_deduce_bounds:
__reg32_deduce_bounds: scalar(smin=-655,smax=0xeffffeee,smin32=-783,smax32=-146,umin32=0xfffffcf1,umax32=0xffffff6e)
__reg64_deduce_bounds: scalar(smin=-655,smax=0xeffffeee,smin32=-783,smax32=-146,umin32=0xfffffcf1,umax32=0xffffff6e)
__reg_deduce_mixed_bounds: scalar(smin=-655,smax=0xeffffeee,umin=umin32=0xfffffcf1,umax=0xffffffffffffff6e,smin32=-783,smax32=-146,umax32=0xffffff6e)
__reg_deduce_bounds:
__reg32_deduce_bounds: scalar(smin=-655,smax=0xeffffeee,umin=umin32=0xfffffcf1,umax=0xffffffffffffff6e,smin32=-783,smax32=-146,umax32=0xffffff6e)
__reg64_deduce_bounds: scalar(smin=-655,smax=smax32=-146,umin=0xfffffffffffffd71,umax=0xffffffffffffff6e,smin32=-783,umin32=0xfffffcf1,umax32=0xffffff6e)
__reg_deduce_mixed_bounds: scalar(smin=-655,smax=smax32=-146,umin=0xfffffffffffffd71,umax=0xffffffffffffff6e,smin32=-783,umin32=0xfffffcf1,umax32=0xffffff6e)
__reg_bound_offset: scalar(smin=-655,smax=smax32=-146,umin=0xfffffffffffffd71,umax=0xffffffffffffff6e,smin32=-783,umin32=0xfffffcf1,umax32=0xffffff6e,var_off=(0xfffffffffffffc00; 0x3ff))
__update_reg_bounds: scalar(smin=-655,smax=smax32=-146,umin=0xfffffffffffffd71,umax=0xffffffffffffff6e,smin32=-783,umin32=0xfffffcf1,umax32=0xffffff6e,var_off=(0xfffffffffffffc00; 0x3ff))
In particular, notice how:
1. In the first call to __reg_deduce_bounds, __reg32_deduce_bounds
learns new u32 bounds.
2. __reg64_deduce_bounds is unable to improve bounds at this point.
3. __reg_deduce_mixed_bounds derives new u64 bounds from the u32 bounds.
4. In the second call to __reg_deduce_bounds, __reg64_deduce_bounds
improves the smax and umin bounds thanks to patch "bpf: Improve
bounds when s64 crosses sign boundary" from this series.
5. Subsequent functions are unable to improve the ranges further (only
tnums). Yet, a better smin32 bound could be learned from the smin
bound.
__reg32_deduce_bounds is able to improve smin32 from smin, but for that
we need a third call to __reg_deduce_bounds.
As discussed in [1], there may be a better way to organize the deduction
rules to learn the same information with less calls to the same
functions. Such an optimization requires further analysis and is
orthogonal to the present patchset.
Link: https://lore.kernel.org/bpf/aIKtSK9LjQXB8FLY@mail.gmail.com/ [1]
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Co-developed-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Link: https://lore.kernel.org/r/79619d3b42e5525e0e174ed534b75879a5ba15de.1753695655.git.paul.chaignon@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
__reg64_deduce_bounds currently improves the s64 range using the u64
range and vice versa, but only if it doesn't cross the sign boundary.
This patch improves __reg64_deduce_bounds to cover the case where the
s64 range crosses the sign boundary but overlaps with the u64 range on
only one end. In that case, we can improve both ranges. Consider the
following example, with the s64 range crossing the sign boundary:
0 U64_MAX
| [xxxxxxxxxxxxxx u64 range xxxxxxxxxxxxxx] |
|----------------------------|----------------------------|
|xxxxx s64 range xxxxxxxxx] [xxxxxxx|
0 S64_MAX S64_MIN -1
The u64 range overlaps only with positive portion of the s64 range. We
can thus derive the following new s64 and u64 ranges.
0 U64_MAX
| [xxxxxx u64 range xxxxx] |
|----------------------------|----------------------------|
| [xxxxxx s64 range xxxxx] |
0 S64_MAX S64_MIN -1
The same logic can probably apply to the s32/u32 ranges, but this patch
doesn't implement that change.
In addition to the selftests, the __reg64_deduce_bounds change was
also tested with Agni, the formal verification tool for the range
analysis [1].
Link: https://github.com/bpfverif/agni [1]
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Link: https://lore.kernel.org/r/933bd9ce1f36ded5559f92fdc09e5dbc823fa245.1753695655.git.paul.chaignon@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
During the bounds refinement, we improve the precision of various ranges
by looking at other ranges. Among others, we improve the following in
this order (other things happen between 1 and 2):
1. Improve u32 from s32 in __reg32_deduce_bounds.
2. Improve s/u64 from u32 in __reg_deduce_mixed_bounds.
3. Improve s/u64 from s32 in __reg_deduce_mixed_bounds.
In particular, if the s32 range forms a valid u32 range, we will use it
to improve the u32 range in __reg32_deduce_bounds. In
__reg_deduce_mixed_bounds, under the same condition, we will use the s32
range to improve the s/u64 ranges.
If at (1) we were able to learn from s32 to improve u32, we'll then be
able to use that in (2) to improve s/u64. Hence, as (3) happens under
the same precondition as (1), it won't improve s/u64 ranges further than
(1)+(2) did. Thus, we can get rid of (3).
In addition to the extensive suite of selftests for bounds refinement,
this patch was also tested with the Agni formal verification tool [1].
Additionally, Eduard mentioned:
The argument appears to be as follows:
Under precondition `(u32)reg->s32_min <= (u32)reg->s32_max`
__reg32_deduce_bounds produces:
reg->u32_min = max_t(u32, reg->s32_min, reg->u32_min);
reg->u32_max = min_t(u32, reg->s32_max, reg->u32_max);
And then first part of __reg_deduce_mixed_bounds assigns:
a. reg->umin umax= (reg->umin & ~0xffffffffULL) | max_t(u32, reg->s32_min, reg->u32_min);
b. reg->umax umin= (reg->umax & ~0xffffffffULL) | min_t(u32, reg->s32_max, reg->u32_max);
And then second part of __reg_deduce_mixed_bounds assigns:
c. reg->umin umax= (reg->umin & ~0xffffffffULL) | (u32)reg->s32_min;
d. reg->umax umin= (reg->umax & ~0xffffffffULL) | (u32)reg->s32_max;
But assignment (c) is a noop because:
max_t(u32, reg->s32_min, reg->u32_min) >= (u32)reg->s32_min
Hence RHS(a) >= RHS(c) and umin= does nothing.
Also assignment (d) is a noop because:
min_t(u32, reg->s32_max, reg->u32_max) <= (u32)reg->s32_max
Hence RHS(b) <= RHS(d) and umin= does nothing.
Plus the same reasoning for the part dealing with reg->s{min,max}_value:
e. reg->smin_value smax= (reg->smin_value & ~0xffffffffULL) | max_t(u32, reg->s32_min_value, reg->u32_min_value);
f. reg->smax_value smin= (reg->smax_value & ~0xffffffffULL) | min_t(u32, reg->s32_max_value, reg->u32_max_value);
vs
g. reg->smin_value smax= (reg->smin_value & ~0xffffffffULL) | (u32)reg->s32_min_value;
h. reg->smax_value smin= (reg->smax_value & ~0xffffffffULL) | (u32)reg->s32_max_value;
RHS(e) >= RHS(g) and RHS(f) <= RHS(h), hence smax=,smin= do nothing.
This appears to be correct.
Also, Shung-Hsi:
Beside going through the reasoning, I also played with CBMC a bit to
double check that as far as a single run of __reg_deduce_bounds() is
concerned (and that the register state matches certain handwavy
expectations), the change indeed still preserve the original behavior.
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://github.com/bpfverif/agni [1]
Link: https://lore.kernel.org/bpf/aIJwnFnFyUjNsCNa@mail.gmail.com
Syzbot reported a kernel warning due to a range invariant violation on
the following BPF program.
0: call bpf_get_netns_cookie
1: if r0 == 0 goto <exit>
2: if r0 & Oxffffffff goto <exit>
The issue is on the path where we fall through both jumps.
That path is unreachable at runtime: after insn 1, we know r0 != 0, but
with the sign extension on the jset, we would only fallthrough insn 2
if r0 == 0. Unfortunately, is_branch_taken() isn't currently able to
figure this out, so the verifier walks all branches. The verifier then
refines the register bounds using the second condition and we end
up with inconsistent bounds on this unreachable path:
1: if r0 == 0 goto <exit>
r0: u64=[0x1, 0xffffffffffffffff] var_off=(0, 0xffffffffffffffff)
2: if r0 & 0xffffffff goto <exit>
r0 before reg_bounds_sync: u64=[0x1, 0xffffffffffffffff] var_off=(0, 0)
r0 after reg_bounds_sync: u64=[0x1, 0] var_off=(0, 0)
Improving the range refinement for JSET to cover all cases is tricky. We
also don't expect many users to rely on JSET given LLVM doesn't generate
those instructions. So instead of improving the range refinement for
JSETs, Eduard suggested we forget the ranges whenever we're narrowing
tnums after a JSET. This patch implements that approach.
Reported-by: syzbot+c711ce17dd78e5d4fdcf@syzkaller.appspotmail.com
Suggested-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Link: https://lore.kernel.org/r/9d4fd6432a095d281f815770608fdcd16028ce0b.1752171365.git.paul.chaignon@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
We must terminate the speculative analysis if the just-analyzed insn had
nospec_result set. Using cur_aux() here is wrong because insn_idx might
have been incremented by do_check_insn(). Therefore, introduce and use
insn_aux variable.
Also change cur_aux(env)->nospec in case do_check_insn() ever manages to
increment insn_idx but still fail.
Change the warning to check the insn class (which prevents it from
triggering for ldimm64, for which nospec_result would not be
problematic) and use verifier_bug_if().
In line with Eduard's suggestion, do not introduce prev_aux() because
that requires one to understand that after do_check_insn() call what was
current became previous. This would at-least require a comment.
Fixes: d6f1c85f22 ("bpf: Fall back to nospec for Spectre v1")
Reported-by: Paul Chaignon <paul.chaignon@gmail.com>
Reported-by: Eduard Zingerman <eddyz87@gmail.com>
Reported-by: syzbot+dc27c5fb8388e38d2d37@syzkaller.appspotmail.com
Link: https://lore.kernel.org/bpf/685b3c1b.050a0220.2303ee.0010.GAE@google.com/
Link: https://lore.kernel.org/bpf/4266fd5de04092aa4971cbef14f1b4b96961f432.camel@gmail.com/
Suggested-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Luis Gerhorst <luis.gerhorst@fau.de>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250705190908.1756862-2-luis.gerhorst@fau.de
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Allow specifying __arg_untrusted for void */char */int */long *
parameters. Treat such parameters as
PTR_TO_MEM|MEM_RDONLY|PTR_UNTRUSTED of size zero.
Intended usage is as follows:
int memcmp(char *a __arg_untrusted, char *b __arg_untrusted, size_t n) {
bpf_for(i, 0, n) {
if (a[i] - b[i]) // load at any offset is allowed
return a[i] - b[i];
}
return 0;
}
Allocate register id for ARG_PTR_TO_MEM parameters only when
PTR_MAYBE_NULL is set. Register id for PTR_TO_MEM is used only to
propagate non-null status after conditionals.
Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250704230354.1323244-8-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add support for PTR_TO_BTF_ID | PTR_UNTRUSTED global function
parameters. Anything is allowed to pass to such parameters, as these
are read-only and probe read instructions would protect against
invalid memory access.
Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250704230354.1323244-5-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
When processing a load from a PTR_TO_BTF_ID, the verifier calculates
the type of the loaded structure field based on the load offset.
For example, given the following types:
struct foo {
struct foo *a;
int *b;
} *p;
The verifier would calculate the type of `p->a` as a pointer to
`struct foo`. However, the type of `p->b` is currently calculated as a
SCALAR_VALUE.
This commit updates the logic for processing PTR_TO_BTF_ID to instead
calculate the type of p->b as PTR_TO_MEM|MEM_RDONLY|PTR_UNTRUSTED.
This change allows further dereferencing of such pointers (using probe
memory instructions).
Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250704230354.1323244-3-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Non-functional change:
mark_btf_ld_reg() expects 'reg_type' parameter to be either
SCALAR_VALUE or PTR_TO_BTF_ID. Next commit expands this set, so update
this function to fail if unexpected type is passed. Also update
callers to propagate the error.
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250704230354.1323244-2-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add a 'struct bpf_scc_callchain callchain_buf' field in bpf_verifier_env.
This way, the previous bpf_scc_callchain local variables can be
replaced by taking address of env->callchain_buf. This can reduce stack
usage and fix the following error:
kernel/bpf/verifier.c:19921:12: error: stack frame size (1368) exceeds limit (1280) in 'do_check'
[-Werror,-Wframe-larger-than]
Reported-by: Arnd Bergmann <arnd@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20250703141117.1485108-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>