Commit Graph

1571 Commits (09cfd3c52ea76f43b3cb15e570aeddf633d65e80)

Author SHA1 Message Date
Linus Torvalds ae28ed4578 bpf-next-6.18
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmjZH40ACgkQ6rmadz2v
 bTrG7w//X/5CyDoKIYJCqynYRdMtfqYuCe8Jhud4p5++iBVqkDyS6Y8EFLqZVyg/
 UHTqaSE4Nz8/pma0WSjhUYn6Chs1AeH+Rw/g109SovE/YGkek2KNwY3o2hDrtPMX
 +oD0my8qF2HLKgEyteXXyZ5Ju+AaF92JFiGko4/wNTX8O99F9nyz2pTkrctS9Vl9
 VwuTxrEXpmhqrhP3WCxkfNfcbs9HP+AALpgOXZKdMI6T4KI0N1gnJ0ZWJbiXZ8oT
 tug0MTPkNRidYMl0wHY2LZ6ZG8Q3a7Sgc+M0xFzaHGvGlJbBg1HjsDMtT6j34CrG
 TIVJ/O8F6EJzAnQ5Hio0FJk8IIgMRgvng5Kd5GXidU+mE6zokTyHIHOXitYkBQNH
 Hk+lGA7+E2cYqUqKvB5PFoyo+jlucuIH7YwrQlyGfqz+98n65xCgZKcmdVXr0hdB
 9v3WmwJFtVIoPErUvBC3KRANQYhFk4eVk1eiGV/20+eIVyUuNbX6wqSWSA9uEXLy
 n5fm/vlk4RjZmrPZHxcJ0dsl9LTF1VvQQHkgoC1Sz/Cc+jA6k4I+ECVHAqEbk36p
 1TUF52yPOD2ViaJKkj+962JaaaXlUn6+Dq7f1GMP6VuyHjz4gsI3mOo4XarqNdWd
 c7TnYmlGO/cGwqd4DdbmWiF1DDsrBcBzdbC8+FgffxQHLPXGzUg=
 =LeQi
 -----END PGP SIGNATURE-----

Merge tag 'bpf-next-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Pull bpf updates from Alexei Starovoitov:

 - Support pulling non-linear xdp data with bpf_xdp_pull_data() kfunc
   (Amery Hung)

   Applied as a stable branch in bpf-next and net-next trees.

 - Support reading skb metadata via bpf_dynptr (Jakub Sitnicki)

   Also a stable branch in bpf-next and net-next trees.

 - Enforce expected_attach_type for tailcall compatibility (Daniel
   Borkmann)

 - Replace path-sensitive with path-insensitive live stack analysis in
   the verifier (Eduard Zingerman)

   This is a significant change in the verification logic. More details,
   motivation, long term plans are in the cover letter/merge commit.

 - Support signed BPF programs (KP Singh)

   This is another major feature that took years to materialize.

   Algorithm details are in the cover letter/marge commit

 - Add support for may_goto instruction to s390 JIT (Ilya Leoshkevich)

 - Add support for may_goto instruction to arm64 JIT (Puranjay Mohan)

 - Fix USDT SIB argument handling in libbpf (Jiawei Zhao)

 - Allow uprobe-bpf program to change context registers (Jiri Olsa)

 - Support signed loads from BPF arena (Kumar Kartikeya Dwivedi and
   Puranjay Mohan)

 - Allow access to union arguments in tracing programs (Leon Hwang)

 - Optimize rcu_read_lock() + migrate_disable() combination where it's
   used in BPF subsystem (Menglong Dong)

 - Introduce bpf_task_work_schedule*() kfuncs to schedule deferred
   execution of BPF callback in the context of a specific task using the
   kernel’s task_work infrastructure (Mykyta Yatsenko)

 - Enforce RCU protection for KF_RCU_PROTECTED kfuncs (Kumar Kartikeya
   Dwivedi)

 - Add stress test for rqspinlock in NMI (Kumar Kartikeya Dwivedi)

 - Improve the precision of tnum multiplier verifier operation
   (Nandakumar Edamana)

 - Use tnums to improve is_branch_taken() logic (Paul Chaignon)

 - Add support for atomic operations in arena in riscv JIT (Pu Lehui)

 - Report arena faults to BPF error stream (Puranjay Mohan)

 - Search for tracefs at /sys/kernel/tracing first in bpftool (Quentin
   Monnet)

 - Add bpf_strcasecmp() kfunc (Rong Tao)

 - Support lookup_and_delete_elem command in BPF_MAP_STACK_TRACE (Tao
   Chen)

* tag 'bpf-next-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (197 commits)
  libbpf: Replace AF_ALG with open coded SHA-256
  selftests/bpf: Add stress test for rqspinlock in NMI
  selftests/bpf: Add test case for different expected_attach_type
  bpf: Enforce expected_attach_type for tailcall compatibility
  bpftool: Remove duplicate string.h header
  bpf: Remove duplicate crypto/sha2.h header
  libbpf: Fix error when st-prefix_ops and ops from differ btf
  selftests/bpf: Test changing packet data from kfunc
  selftests/bpf: Add stacktrace map lookup_and_delete_elem test case
  selftests/bpf: Refactor stacktrace_map case with skeleton
  bpf: Add lookup_and_delete_elem for BPF_MAP_STACK_TRACE
  selftests/bpf: Fix flaky bpf_cookie selftest
  selftests/bpf: Test changing packet data from global functions with a kfunc
  bpf: Emit struct bpf_xdp_sock type in vmlinux BTF
  selftests/bpf: Task_work selftest cleanup fixes
  MAINTAINERS: Delete inactive maintainers from AF_XDP
  bpf: Mark kfuncs as __noclone
  selftests/bpf: Add kprobe multi write ctx attach test
  selftests/bpf: Add kprobe write ctx attach test
  selftests/bpf: Add uprobe context ip register change test
  ...
2025-09-30 17:58:11 -07:00
Linus Torvalds 6c7340a7a8 Scheduler updates for v6.18:
Core scheduler changes:
 
  - Make migrate_{en,dis}able() inline, to improve performance
    (Menglong Dong)
 
  - Move STDL_INIT() functions out-of-line (Peter Zijlstra)
 
  - Unify the SCHED_{SMT,CLUSTER,MC} Kconfig (Peter Zijlstra)
 
 Fair scheduling:
 
  - Defer throttling when tasks exit to user-space, to reduce the
    chance & impact of throttle-preemption with held locks and
    other resources. (Aaron Lu, Valentin Schneider)
 
  - Get rid of sched_domains_curr_level hack for tl->cpumask(),
    as the warning was getting triggered on certain topologies.
    (Peter Zijlstra)
 
 Misc cleanups & fixes:
 
  - Header cleanups (Menglong Dong)
 
  - Fix race in push_dl_task() (Harshit Agarwal)
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmjWn0IRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1i9ng/+KJYEsNilPA6nGzZLurU3HXm6S5NrNBPc
 xJU43eo3QdFUymKqYaCoXCwaMah3m8fFW+DC8ZoEvWC5wHnlIw6+u/ZEtsWQ4R8N
 8+sjQddQeb9Gez+nkC7pXX8h1nqlhHyGWU88+9hOyEEMk21KgH7tJ5gOuGQdPhiA
 v24D8dkS1sT6CAeqZAycYIkos+kfNNV+wAEQCXAxfvSSslpzJVZcj9kdQInxF4O4
 9+WrvFZFVYJWdJgmgfbpBMw7N8a4Gpv+Lr6U0ZVXHNeLjNdn60I+5Me+9BW9GRcO
 pPL50RfGgGgVIqyEHvBhhz8p0tlKUJo9NkRJ9hmNB5SKT3P442SU789ppTYgI1il
 5P4g3TiuomqPVuo99/mYHYOpT6NkYC1fkwMP9Hfk4NRUX0EA6lKXelVnMXaC3Z7R
 U+0xVYtK4BBLRFBo4HIEM1HChSIcOnutuufFX4lPPt+ewnAmJwSt619w+lBF0v+n
 cpiLIlQ74LrYlrULw3bznnwzh5dV8XHm4CT3XAShm6qB97/SaLr0rI5W7njdSvte
 ym9z4yFWidawowvLWjFX1CykSnX/T5MV+uzuIH64MEIA39B4VKJWCAC3l3AMJn/5
 Ckhf93B5uG0q+wUtE66x/JrN7wtbz9PMpwh01fXNWs/eWc1VKkVrvq+PmHODngmz
 drSUoI1P570=
 =8ZTZ
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2025-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:
 "Core scheduler changes:

   - Make migrate_{en,dis}able() inline, to improve performance
     (Menglong Dong)

   - Move STDL_INIT() functions out-of-line (Peter Zijlstra)

   - Unify the SCHED_{SMT,CLUSTER,MC} Kconfig (Peter Zijlstra)

  Fair scheduling:

   - Defer throttling to when tasks exit to user-space, to reduce the
     chance & impact of throttle-preemption with held locks and other
     resources (Aaron Lu, Valentin Schneider)

   - Get rid of sched_domains_curr_level hack for tl->cpumask(), as the
     warning was getting triggered on certain topologies (Peter
     Zijlstra)

  Misc cleanups & fixes:

   - Header cleanups (Menglong Dong)

   - Fix race in push_dl_task() (Harshit Agarwal)"

* tag 'sched-core-2025-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched: Fix some typos in include/linux/preempt.h
  sched: Make migrate_{en,dis}able() inline
  rcu: Replace preempt.h with sched.h in include/linux/rcupdate.h
  arch: Add the macro COMPILE_OFFSETS to all the asm-offsets.c
  sched/fair: Do not balance task to a throttled cfs_rq
  sched/fair: Do not special case tasks in throttled hierarchy
  sched/fair: update_cfs_group() for throttled cfs_rqs
  sched/fair: Propagate load for throttled cfs_rq
  sched/fair: Get rid of throttled_lb_pair()
  sched/fair: Task based throttle time accounting
  sched/fair: Switch to task based throttle model
  sched/fair: Implement throttle task work and related helpers
  sched/fair: Add related data structure for task based throttle
  sched: Unify the SCHED_{SMT,CLUSTER,MC} Kconfig
  sched: Move STDL_INIT() functions out-of-line
  sched/fair: Get rid of sched_domains_curr_level hack for tl->cpumask()
  sched/deadline: Fix race in push_dl_task()
2025-09-30 10:35:11 -07:00
Menglong Dong 378b770819 sched: Make migrate_{en,dis}able() inline
For now, migrate_enable and migrate_disable are global, which makes them
become hotspots in some case. Take BPF for example, the function calling
to migrate_enable and migrate_disable in BPF trampoline can introduce
significant overhead, and following is the 'perf top' of FENTRY's
benchmark (./tools/testing/selftests/bpf/bench trig-fentry):

  54.63% bpf_prog_2dcccf652aac1793_bench_trigger_fentry [k]
                 bpf_prog_2dcccf652aac1793_bench_trigger_fentry
  10.43% [kernel] [k] migrate_enable
  10.07% bpf_trampoline_6442517037 [k] bpf_trampoline_6442517037
  8.06% [kernel] [k] __bpf_prog_exit_recur
  4.11% libc.so.6 [.] syscall
  2.15% [kernel] [k] entry_SYSCALL_64
  1.48% [kernel] [k] memchr_inv
  1.32% [kernel] [k] fput
  1.16% [kernel] [k] _copy_to_user
  0.73% [kernel] [k] bpf_prog_test_run_raw_tp

So in this commit, we make migrate_enable/migrate_disable inline to obtain
better performance. The struct rq is defined internally in
kernel/sched/sched.h, and the field "nr_pinned" is accessed in
migrate_enable/migrate_disable, which makes it hard to make them inline.

Alexei Starovoitov suggests to generate the offset of "nr_pinned" in [1],
so we can define the migrate_enable/migrate_disable in
include/linux/sched.h and access "this_rq()->nr_pinned" with
"(void *)this_rq() + RQ_nr_pinned".

The offset of "nr_pinned" is generated in include/generated/rq-offsets.h
by kernel/sched/rq-offsets.c.

Generally speaking, we move the definition of migrate_enable and
migrate_disable to include/linux/sched.h from kernel/sched/core.c. The
calling to __set_cpus_allowed_ptr() is leaved in ___migrate_enable().

The "struct rq" is not available in include/linux/sched.h, so we can't
access the "runqueues" with this_cpu_ptr(), as the compilation will fail
in this_cpu_ptr() -> raw_cpu_ptr() -> __verify_pcpu_ptr():
  typeof((ptr) + 0)

So we introduce the this_rq_raw() and access the runqueues with
arch_raw_cpu_ptr/PERCPU_PTR directly.

The variable "runqueues" is not visible in the kernel modules, and export
it is not a good idea. As Peter Zijlstra advised in [2], we define and
export migrate_enable/migrate_disable in kernel/sched/core.c too, and use
them for the modules.

Before this patch, the performance of BPF FENTRY is:

  fentry         :  113.030 ± 0.149M/s
  fentry         :  112.501 ± 0.187M/s
  fentry         :  112.828 ± 0.267M/s
  fentry         :  115.287 ± 0.241M/s

After this patch, the performance of BPF FENTRY increases to:

  fentry         :  143.644 ± 0.670M/s
  fentry         :  149.764 ± 0.362M/s
  fentry         :  149.642 ± 0.156M/s
  fentry         :  145.263 ± 0.221M/s

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/bpf/CAADnVQ+5sEDKHdsJY5ZsfGDO_1SEhhQWHrt2SMBG5SYyQ+jt7w@mail.gmail.com/ [1]
Link: https://lore.kernel.org/all/20250819123214.GH4067720@noisy.programming.kicks-ass.net/ [2]
2025-09-25 09:57:16 +02:00
Martin KaFai Lau 34f033a6c9 Merge branch 'bpf-next/xdp_pull_data' into 'bpf-next/master'
Merge the xdp_pull_data stable branch into the master branch. No conflict.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2025-09-23 16:23:58 -07:00
Amery Hung 0e7a733ab3 bpf: Clear packet pointers after changing packet data in kfuncs
bpf_xdp_pull_data() may change packet data and therefore packet pointers
need to be invalidated. Add bpf_xdp_pull_data() to the special kfunc
list instead of introducing a new KF_ flag until there are more kfuncs
changing packet data.

Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20250922233356.3356453-5-ameryhung@gmail.com
2025-09-23 13:35:12 -07:00
Kumar Kartikeya Dwivedi a91ae3c893 bpf, x86: Add support for signed arena loads
Currently, signed load instructions into arena memory are unsupported.
The compiler is free to generate these, and on GCC-14 we see a
corresponding error when it happens. The hurdle in supporting them is
deciding which unused opcode to use to mark them for the JIT's own
consumption. After much thinking, it appears 0xc0 / BPF_NOSPEC can be
combined with load instructions to identify signed arena loads. Use
this to recognize and JIT them appropriately, and remove the verifier
side limitation on the program if the JIT supports them.

Co-developed-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20250923110157.18326-2-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-23 12:00:22 -07:00
Mykyta Yatsenko 5c8fd7e2b5 bpf: bpf task work plumbing
This patch adds necessary plumbing in verifier, syscall and maps to
support handling new kfunc bpf_task_work_schedule and kernel structure
bpf_task_work. The idea is similar to how we already handle bpf_wq and
bpf_timer.
verifier changes validate calls to bpf_task_work_schedule to make sure
it is safe and expected invariants hold.
btf part is required to detect bpf_task_work structure inside map value
and store its offset, which will be used in the next patch to calculate
key and value addresses.
arraymap and hashtab changes are needed to handle freeing of the
bpf_task_work: run code needed to deinitialize it, for example cancel
task_work callback if possible.
The use of bpf_task_work and proper implementation for kfuncs are
introduced in the next patch.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250923112404.668720-6-mykyta.yatsenko5@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-23 07:34:38 -07:00
Mykyta Yatsenko d2699bdb6e bpf: verifier: permit non-zero returns from async callbacks
The verifier currently enforces a zero return value for all async
callbacks—a constraint originally introduced for bpf_timer. That
restriction is too narrow for other async use cases.

Relax the rule by allowing non-zero return codes from async callbacks in
general, while preserving the zero-return requirement for bpf_timer to
maintain its existing semantics.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250923112404.668720-5-mykyta.yatsenko5@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-23 07:34:38 -07:00
Mykyta Yatsenko 5eab266b80 bpf: extract generic helper from process_timer_func()
Refactor the verifier by pulling the common logic from
process_timer_func() into a dedicated helper. This allows reusing
process_async_func() helper for verifying bpf_task_work struct in the
next patch.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Tested-by: syzbot@syzkaller.appspotmail.com
Link: https://lore.kernel.org/r/20250923112404.668720-3-mykyta.yatsenko5@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-23 07:34:38 -07:00
Eduard Zingerman 79f047c7d9 bpf: table based bpf_insn_successors()
Converting bpf_insn_successors() to use lookup table makes it ~1.5
times faster.

Also remove unnecessary conditionals:
- `idx + 1 < prog->len` is unnecessary because after check_cfg() all
  jump targets are guaranteed to be within a program;
- `i == 0 || succ[0] != dst` is unnecessary because any client of
  bpf_insn_successors() can handle duplicate edges:
  - compute_live_registers()
  - compute_scc()

Moving bpf_insn_successors() to liveness.c allows its inlining in
liveness.c:__update_stack_liveness().
Such inlining speeds up __update_stack_liveness() by ~40%.
bpf_insn_successors() is used in both verifier.c and liveness.c.
perf shows such move does not negatively impact users in verifier.c,
as these are executed only once before main varification pass.
Unlike __update_stack_liveness() which can be triggered multiple
times.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250918-callchain-sensitive-liveness-v3-10-c3cd27bacc60@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-19 09:27:23 -07:00
Eduard Zingerman 107e169799 bpf: disable and remove registers chain based liveness
Remove register chain based liveness tracking:
- struct bpf_reg_state->{parent,live} fields are no longer needed;
- REG_LIVE_WRITTEN marks are superseded by bpf_mark_stack_write()
  calls;
- mark_reg_read() calls are superseded by bpf_mark_stack_read();
- log.c:print_liveness() is superseded by logging in liveness.c;
- propagate_liveness() is superseded by bpf_update_live_stack();
- no need to establish register chains in is_state_visited() anymore;
- fix a bunch of tests expecting "_w" suffixes in verifier log
  messages.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250918-callchain-sensitive-liveness-v3-9-c3cd27bacc60@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-19 09:27:23 -07:00
Eduard Zingerman ccf25a67c7 bpf: signal error if old liveness is more conservative than new
Unlike the new algorithm, register chain based liveness tracking is
fully path sensitive, and thus should be strictly more accurate.
Validate the new algorithm by signaling an error whenever it considers
a stack slot dead while the old algorithm considers it alive.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250918-callchain-sensitive-liveness-v3-8-c3cd27bacc60@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-19 09:27:23 -07:00
Eduard Zingerman e41c237953 bpf: enable callchain sensitive stack liveness tracking
Allocate analysis instance:
- Add bpf_stack_liveness_{init,free}() calls to bpf_check().

Notify the instance about any stack reads and writes:
- Add bpf_mark_stack_write() call at every location where
  REG_LIVE_WRITTEN is recorded for a stack slot.
- Add bpf_mark_stack_read() call at every location mark_reg_read() is
  called.
- Both bpf_mark_stack_{read,write}() rely on
  env->liveness->cur_instance callchain being in sync with
  env->cur_state. It is possible to update env->liveness->cur_instance
  every time a mark read/write is called, but that costs a hash table
  lookup and is noticeable in the performance profile. Hence, manually
  reset env->liveness->cur_instance whenever the verifier changes
  env->cur_state call stack:
  - call bpf_reset_live_stack_callchain() when the verifier enters a
    subprogram;
  - call bpf_update_live_stack() when the verifier exits a subprogram
    (it implies the reset).

Make sure bpf_update_live_stack() is called for a callchain before
issuing liveness queries. And make sure that bpf_update_live_stack()
is called for any callee callchain first:
- Add bpf_update_live_stack() call at every location that processes
  BPF_EXIT:
  - exit from a subprogram;
  - before pop_stack() call.
  This makes sure that bpf_update_live_stack() is called for callee
  callchains before caller callchains.

Make sure must_write marks are set to zero for instructions that
do not always access the stack:
- Wrap do_check_insn() with bpf_reset_stack_write_marks() /
  bpf_commit_stack_write_marks() calls.
  Any calls to bpf_mark_stack_write() are accumulated between this
  pair of calls. If no bpf_mark_stack_write() calls were made
  it means that the instruction does not access stack (at-least
  on the current verification path) and it is important to record
  this fact.

Finally, use bpf_live_stack_query_init() / bpf_stack_slot_alive()
to query stack liveness info.

The manual tracking of the correct order for callee/caller
bpf_update_live_stack() calls is a bit convoluted and may warrant some
automation in future revisions.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250918-callchain-sensitive-liveness-v3-7-c3cd27bacc60@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-19 09:27:23 -07:00
Eduard Zingerman efcda22aa5 bpf: compute instructions postorder per subprogram
The next patch would require doing postorder traversal of individual
subprograms. Facilitate this by moving env->cfg.insn_postorder
computation from check_cfg() to a separate pass, as check_cfg()
descends into called subprograms (and it needs to, because of
merge_callee_effects() logic).

env->cfg.insn_postorder is used only by compute_live_registers(),
this function does not track cross subprogram dependencies,
thus the change does not affect it's operation.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250918-callchain-sensitive-liveness-v3-5-c3cd27bacc60@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-19 09:27:23 -07:00
Eduard Zingerman 3b20d3c120 bpf: declare a few utility functions as internal api
Namely, rename the following functions and add prototypes to
bpf_verifier.h:
- find_containing_subprog -> bpf_find_containing_subprog
- insn_successors         -> bpf_insn_successors
- calls_callback          -> bpf_calls_callback
- fmt_stack_mask          -> bpf_fmt_stack_mask

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250918-callchain-sensitive-liveness-v3-4-c3cd27bacc60@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-19 09:27:22 -07:00
Eduard Zingerman 12a23f93a5 bpf: remove redundant REG_LIVE_READ check in stacksafe()
stacksafe() is called in exact == NOT_EXACT mode only for states that
had been porcessed by clean_verifier_states(). The latter replaces
dead stack spills with a series of STACK_INVALID masks. Such masks are
already handled by stacksafe().

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250918-callchain-sensitive-liveness-v3-3-c3cd27bacc60@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-19 09:27:22 -07:00
Eduard Zingerman 6cd21eb9ad bpf: use compute_live_registers() info in clean_func_state
Prepare for bpf_reg_state->live field removal by leveraging
insn_aux_data->live_regs_before instead of bpf_reg_state->live in
compute_live_registers(). This is similar to logic in
func_states_equal(). No changes in verification performance for
selftests or sched_ext.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250918-callchain-sensitive-liveness-v3-2-c3cd27bacc60@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-19 09:27:22 -07:00
Eduard Zingerman daf4c2929f bpf: bpf_verifier_state->cleaned flag instead of REG_LIVE_DONE
Prepare for bpf_reg_state->live field removal by introducing a
separate flag to track if clean_verifier_state() had been applied to
the state. No functional changes.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250918-callchain-sensitive-liveness-v3-1-c3cd27bacc60@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-19 09:27:22 -07:00
KP Singh baefdbdf68 bpf: Implement exclusive map creation
Exclusive maps allow maps to only be accessed by program with a
program with a matching hash which is specified in the excl_prog_hash
attr.

For the signing use-case, this allows the trusted loader program
to load the map and verify the integrity

Signed-off-by: KP Singh <kpsingh@kernel.org>
Link: https://lore.kernel.org/r/20250914215141.15144-3-kpsingh@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-18 19:11:42 -07:00
Kumar Kartikeya Dwivedi 1512231b6c bpf: Enforce RCU protection for KF_RCU_PROTECTED
Currently, KF_RCU_PROTECTED only applies to iterator APIs and that too
in a convoluted fashion: the presence of this flag on the kfunc is used
to set MEM_RCU in iterator type, and the lack of RCU protection results
in an error only later, once next() or destroy() methods are invoked on
the iterator. While there is no bug, this is certainly a bit
unintuitive, and makes the enforcement of the flag iterator specific.

In the interest of making this flag useful for other upcoming kfuncs,
e.g. scx_bpf_cpu_curr() [0][1], add enforcement for invoking the kfunc
in an RCU critical section in general.

This would also mean that iterator APIs using KF_RCU_PROTECTED will
error out earlier, instead of throwing an error for lack of RCU CS
protection when next() or destroy() methods are invoked.

In addition to this, if the kfuncs tagged KF_RCU_PROTECTED return a
pointer value, ensure that this pointer value is only usable in an RCU
critical section. There might be edge cases where the return value is
special and doesn't need to imply MEM_RCU semantics, but in general, the
assumption should hold for the majority of kfuncs, and we can revisit
things if necessary later.

  [0]: https://lore.kernel.org/all/20250903212311.369697-3-christian.loehle@arm.com
  [1]: https://lore.kernel.org/all/20250909195709.92669-1-arighi@nvidia.com

Tested-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250917032755.4068726-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-18 15:36:17 -07:00
Eduard Zingerman a3c73d629e bpf: dont report verifier bug for missing bpf_scc_visit on speculative path
Syzbot generated a program that triggers a verifier_bug() call in
maybe_exit_scc(). maybe_exit_scc() assumes that, when called for a
state with insn_idx in some SCC, there should be an instance of struct
bpf_scc_visit allocated for that SCC. Turns out the assumption does
not hold for speculative execution paths. See example in the next
patch.

maybe_scc_exit() is called from update_branch_counts() for states that
reach branch count of zero, meaning that path exploration for a
particular path is finished. Path exploration can finish in one of
three ways:
a. Verification error is found. In this case, update_branch_counts()
   is called only for non-speculative paths.
b. Top level BPF_EXIT is reached. Such instructions are never a part of
   an SCC, so compute_scc_callchain() in maybe_scc_exit() will return
   false, and maybe_scc_exit() will return early.
c. A checkpoint is reached and matched. Checkpoints are created by
   is_state_visited(), which calls maybe_enter_scc(), which allocates
   bpf_scc_visit instances for checkpoints within SCCs.

Hence, for non-speculative symbolic execution paths, the assumption
still holds: if maybe_scc_exit() is called for a state within an SCC,
bpf_scc_visit instance must exist.

This patch removes the verifier_bug() call for speculative paths.

Fixes: c9e31900b5 ("bpf: propagate read/precision marks over state graph backedges")
Reported-by: syzbot+3afc814e8df1af64b653@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/bpf/68c85acd.050a0220.2ff435.03a4.GAE@google.com/
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250916212251.3490455-1-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-17 11:19:58 -07:00
Eduard Zingerman b13448dd64 bpf: potential double-free of env->insn_aux_data
Function bpf_patch_insn_data() has the following structure:

  static struct bpf_prog *bpf_patch_insn_data(... env ...)
  {
        struct bpf_prog *new_prog;
        struct bpf_insn_aux_data *new_data = NULL;

        if (len > 1) {
                new_data = vrealloc(...);  // <--------- (1)
                if (!new_data)
                        return NULL;

                env->insn_aux_data = new_data;  // <---- (2)
        }

        new_prog = bpf_patch_insn_single(env->prog, off, patch, len);
        if (IS_ERR(new_prog)) {
                ...
                vfree(new_data);   // <----------------- (3)
                return NULL;
        }
        ... happy path ...
  }

In case if bpf_patch_insn_single() returns an error the `new_data`
allocated at (1) will be freed at (3). However, at (2) this pointer
is stored in `env->insn_aux_data`. Which is freed unconditionally
by verifier.c:bpf_check() on both happy and error paths.
Thus, leading to double-free.

Fix this by removing vfree() call at (3), ownership over `new_data` is
already passed to `env->insn_aux_data` at this point.

Fixes: 77620d1267 ("bpf: use realloc in bpf_patch_insn_data")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250912-patch-insn-data-double-free-v1-1-af05bd85a21a@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-15 13:04:21 -07:00
Puranjay Mohan 70f23546d2 bpf: core: introduce main_prog_aux for stream access
BPF streams are only valid for the main programs, to make it easier to
access streams from subprogs, introduce main_prog_aux in struct
bpf_prog_aux.

prog->aux->main_prog_aux = prog->aux, for main programs and
prog->aux->main_prog_aux = main_prog->aux, for subprograms.

Make bpf_prog_find_from_stack() use the added main_prog_aux to return
the mainprog when a subprog is found on the stack.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250911145808.58042-3-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-11 13:00:43 -07:00
Alexei Starovoitov 5d87e96a49 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf after rc5
Cross-merge BPF and other fixes after downstream PR.

No conflicts.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-11 09:34:37 -07:00
Leon Hwang e25ddfb388 bpf: Reject bpf_timer for PREEMPT_RT
When enable CONFIG_PREEMPT_RT, the kernel will warn when run timer
selftests by './test_progs -t timer':

BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48

In order to avoid such warning, reject bpf_timer in verifier when
PREEMPT_RT is enabled.

Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/r/20250910125740.52172-2-leon.hwang@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-10 12:34:09 -07:00
Feng Yang e4980fa646 bpf: Replace kvfree with kfree for kzalloc memory
These pointers are allocated by kzalloc. Therefore, replace kvfree() with
kfree() to avoid unnecessary is_vmalloc_addr() check in kvfree(). This is
the remaining unmodified part from [1].

Signed-off-by: Feng Yang <yangfeng@kylinos.cn>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20250811123949.552885-1-rongqianfeng@vivo.com [1]
Link: https://lore.kernel.org/bpf/20250827032812.498216-1-yangfeng59949@163.com
2025-09-02 17:29:52 +02:00
Paul Chaignon f41345f47f bpf: Use tnums for JEQ/JNE is_branch_taken logic
In the following toy program (reg states minimized for readability), R0
and R1 always have different values at instruction 6. This is obvious
when reading the program but cannot be guessed from ranges alone as
they overlap (R0 in [0; 0xc0000000], R1 in [1024; 0xc0000400]).

  0: call bpf_get_prandom_u32#7  ; R0_w=scalar()
  1: w0 = w0                     ; R0_w=scalar(var_off=(0x0; 0xffffffff))
  2: r0 >>= 30                   ; R0_w=scalar(var_off=(0x0; 0x3))
  3: r0 <<= 30                   ; R0_w=scalar(var_off=(0x0; 0xc0000000))
  4: r1 = r0                     ; R1_w=scalar(var_off=(0x0; 0xc0000000))
  5: r1 += 1024                  ; R1_w=scalar(var_off=(0x400; 0xc0000000))
  6: if r1 != r0 goto pc+1

Looking at tnums however, we can deduce that R1 is always different from
R0 because their tnums don't agree on known bits. This patch uses this
logic to improve is_scalar_branch_taken in case of BPF_JEQ and BPF_JNE.

This change has a tiny impact on complexity, which was measured with
the Cilium complexity CI test. That test covers 72 programs with
various build and load time configurations for a total of 970 test
cases. For 80% of test cases, the patch has no impact. On the other
test cases, the patch decreases complexity by only 0.08% on average. In
the best case, the verifier needs to walk 3% less instructions and, in
the worst case, 1.5% more. Overall, the patch has a small positive
impact, especially for our largest programs.

Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/be3ee70b6e489c49881cb1646114b1d861b5c334.1755694147.git.paul.chaignon@gmail.com
2025-08-22 18:12:24 +02:00
Martin KaFai Lau 5c42715e63 Merge branch 'bpf-next/skb-meta-dynptr' into 'bpf-next/master'
Merge 'skb-meta-dynptr' branch into 'master' branch. No conflict.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2025-08-18 17:59:26 -07:00
Jakub Sitnicki 89d912e494 bpf: Add dynptr type for skb metadata
Add a dynptr type, similar to skb dynptr, but for the skb metadata access.

The dynptr provides an alternative to __sk_buff->data_meta for accessing
the custom metadata area allocated using the bpf_xdp_adjust_meta() helper.

More importantly, it abstracts away the fact where the storage for the
custom metadata lives, which opens up the way to persist the metadata by
relocating it as the skb travels through the network stack layers.

Writes to skb metadata invalidate any existing skb payload and metadata
slices. While this is more restrictive that needed at the moment, it leaves
the door open to reallocating the metadata on writes, and should be only a
minor inconvenience to the users.

Only the program types which can access __sk_buff->data_meta today are
allowed to create a dynptr for skb metadata at the moment. We need to
modify the network stack to persist the metadata across layers before
opening up access to other BPF hooks.

Once more BPF hooks gain access to skb_meta dynptr, we will also need to
add a read-only variant of the helper similar to
bpf_dynptr_from_skb_rdonly.

skb_meta dynptr ops are stubbed out and implemented by subsequent changes.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Jesse Brandeburg <jbrandeburg@cloudflare.com>
Link: https://patch.msgid.link/20250814-skb-metadata-thru-dynptr-v7-1-8a39e636e0fb@cloudflare.com
2025-08-18 10:29:42 -07:00
Anton Protopopov dbe99ea541 bpf: Add a verbose message when the BTF limit is reached
When a BPF program which is being loaded reaches the map limit
(MAX_USED_MAPS) or the BTF limit (MAX_USED_BTFS) the -E2BIG is
returned. However, in the former case there is an accompanying
verifier verbose message, and in the latter case there is not.
Add a verbose message to make the behaviour symmetrical.

Reported-by: Kevin Sheldrake <kevin.sheldrake@isovalent.com>
Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20250816151554.902995-1-a.s.protopopov@gmail.com
2025-08-18 17:27:01 +02:00
Jiri Olsa e4414b01c1 bpf: Check the helper function is valid in get_helper_proto
kernel test robot reported verifier bug [1] where the helper func
pointer could be NULL due to disabled config option.

As Alexei suggested we could check on that in get_helper_proto
directly. Marking tail_call helper func with BPF_PTR_POISON,
because it is unused by design.

  [1] https://lore.kernel.org/oe-lkp/202507160818.68358831-lkp@intel.com

Reported-by: kernel test robot <oliver.sang@intel.com>
Reported-by: syzbot+a9ed3d9132939852d0df@syzkaller.appspotmail.com
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Paul Chaignon <paul.chaignon@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20250814200655.945632-1-jolsa@kernel.org
Closes: https://lore.kernel.org/oe-lkp/202507160818.68358831-lkp@intel.com
2025-08-15 11:16:56 +02:00
Qianfeng Rong bf0c2a84df bpf: Replace kvfree with kfree for kzalloc memory
The 'backedge' pointer is allocated with kzalloc(), which returns
physically contiguous memory. Using kvfree() to deallocate such
memory is functionally safe but semantically incorrect.

Replace kvfree() with kfree() to avoid unnecessary is_vmalloc_addr()
check in kvfree().

Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20250811123949.552885-1-rongqianfeng@vivo.com
2025-08-12 15:55:01 -07:00
Martin KaFai Lau 9e293d47bf Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Cross merge bpf/master after 6.17-rc1.

No conflict.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2025-08-12 12:14:02 -07:00
Eduard Zingerman 77620d1267 bpf: use realloc in bpf_patch_insn_data
Avoid excessive vzalloc/vfree calls when patching instructions in
do_misc_fixups(). bpf_patch_insn_data() uses vzalloc to allocate new
memory for env->insn_aux_data for each patch as follows:

  struct bpf_prog *bpf_patch_insn_data(env, ...)
  {
    ...
    new_data = vzalloc(... O(program size) ...);
    ...
    adjust_insn_aux_data(env, new_data, ...);
    ...
  }

  void adjust_insn_aux_data(env, new_data, ...)
  {
    ...
    memcpy(new_data, env->insn_aux_data);
    vfree(env->insn_aux_data);
    env->insn_aux_data = new_data;
    ...
  }

The vzalloc/vfree pair is hot in perf report collected for e.g.
pyperf180 test case. It can be replaced with a call to vrealloc in
order to reduce the number of actual memory allocations.

This is a stop-gap solution, as bpf_patch_insn_data is still hot in
the profile. More comprehansive solutions had been discussed before
e.g. as in [1].

[1] https://lore.kernel.org/bpf/CAEf4BzY_E8MSL4mD0UPuuiDcbJhh9e2xQo2=5w+ppRWWiYSGvQ@mail.gmail.com/

Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Tested-by: Anton Protopopov <a.s.protopopov@gmail.com>
Link: https://lore.kernel.org/r/20250807010205.3210608-3-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-08-07 09:17:02 -07:00
Eduard Zingerman cb070a8156 bpf: removed unused 'env' parameter from is_reg64 and insn_has_def32
Parameter 'env' is not used by is_reg64() and insn_has_def32()
functions. Remove the parameter to make it clear that neither function
depends on 'env' state, e.g. env->insn_aux_data.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250807010205.3210608-2-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-08-07 09:17:02 -07:00
Eduard Zingerman 1b30d44417 bpf: Fix memory leak of bpf_scc_info objects
env->scc_info array contains references to bpf_scc_info objects
allocated lazily in verifier.c:scc_visit_alloc().
env->scc_cnt was supposed to track env->scc_info array size
in order to free referenced objects in verifier.c:free_states().
Fix initialization of env->scc_cnt that was omitted in
verifier.c:compute_scc().

To reproduce the bug:
- build with CONFIG_DEBUG_KMEMLEAK
- boot and load bpf program with loops, e.g.:
  ./veristat -q pyperf180.bpf.o
- initiate memleak scan and check results:
  echo scan > /sys/kernel/debug/kmemleak
  cat /sys/kernel/debug/kmemleak

Fixes: c9e31900b5 ("bpf: propagate read/precision marks over state graph backedges")
Reported-by: Jens Axboe <axboe@kernel.dk>
Closes: https://lore.kernel.org/bpf/CAADnVQKXUWg9uRCPD5ebRXwN4dmBCRUFFM7kN=GxymYz3zU25A@mail.gmail.com/T/
Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Tested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250801232330.1800436-1-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-08-02 09:04:57 -07:00
Paul Chaignon f914876eec bpf: Improve ctx access verifier error message
We've already had two "error during ctx access conversion" warnings
triggered by syzkaller. Let's improve the error message by dumping the
cnt variable so that we can more easily differentiate between the
different error cases.

Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/cc94316c30dd76fae4a75a664b61a2dbfe68e205.1754039605.git.paul.chaignon@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-08-01 09:22:44 -07:00
KaFai Wan 863aab3d4d bpf: Add log for attaching tracing programs to functions in deny list
Show the rejected function name when attaching tracing programs to
functions in deny list.

With this change, we know why tracing programs can't attach to functions
like __rcu_read_lock() from log.

$ ./fentry
libbpf: prog '__rcu_read_lock': BPF program load failed: -EINVAL
libbpf: prog '__rcu_read_lock': -- BEGIN PROG LOAD LOG --
Attaching tracing programs to function '__rcu_read_lock' is rejected.

Suggested-by: Leon Hwang <leon.hwang@linux.dev>
Signed-off-by: KaFai Wan <kafai.wan@linux.dev>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20250724151454.499040-3-kafai.wan@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-28 19:39:29 -07:00
KaFai Wan a5a6b29a70 bpf: Show precise rejected function when attaching fexit/fmod_ret to __noreturn functions
With this change, we know the precise rejected function name when
attaching fexit/fmod_ret to __noreturn functions from log.

$ ./fexit
libbpf: prog 'fexit': BPF program load failed: -EINVAL
libbpf: prog 'fexit': -- BEGIN PROG LOAD LOG --
Attaching fexit/fmod_ret to __noreturn function 'do_exit' is rejected.

Suggested-by: Leon Hwang <leon.hwang@linux.dev>
Signed-off-by: KaFai Wan <kafai.wan@linux.dev>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20250724151454.499040-2-kafai.wan@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-28 19:39:29 -07:00
Suchit Karunakaran 5b4c54ac49 bpf: Fix various typos in verifier.c comments
This patch fixes several minor typos in comments within the BPF verifier.
No changes in functionality.

Signed-off-by: Suchit Karunakaran <suchitkarunakaran@gmail.com>
Link: https://lore.kernel.org/r/20250727081754.15986-1-suchitkarunakaran@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-28 10:02:57 -07:00
Paul Chaignon 5dbb19b16a bpf: Add third round of bounds deduction
Commit d7f0087381 ("bpf: try harder to deduce register bounds from
different numeric domains") added a second call to __reg_deduce_bounds
in reg_bounds_sync because a single call wasn't enough to converge to a
fixed point in terms of register bounds.

With patch "bpf: Improve bounds when s64 crosses sign boundary" from
this series, Eduard noticed that calling __reg_deduce_bounds twice isn't
enough anymore to converge. The first selftest added in "selftests/bpf:
Test cross-sign 64bits range refinement" highlights the need for a third
call to __reg_deduce_bounds. After instruction 7, reg_bounds_sync
performs the following bounds deduction:

  reg_bounds_sync entry:          scalar(smin=-655,smax=0xeffffeee,smin32=-783,smax32=-146)
  __update_reg_bounds:            scalar(smin=-655,smax=0xeffffeee,smin32=-783,smax32=-146)
  __reg_deduce_bounds:
      __reg32_deduce_bounds:      scalar(smin=-655,smax=0xeffffeee,smin32=-783,smax32=-146,umin32=0xfffffcf1,umax32=0xffffff6e)
      __reg64_deduce_bounds:      scalar(smin=-655,smax=0xeffffeee,smin32=-783,smax32=-146,umin32=0xfffffcf1,umax32=0xffffff6e)
      __reg_deduce_mixed_bounds:  scalar(smin=-655,smax=0xeffffeee,umin=umin32=0xfffffcf1,umax=0xffffffffffffff6e,smin32=-783,smax32=-146,umax32=0xffffff6e)
  __reg_deduce_bounds:
      __reg32_deduce_bounds:      scalar(smin=-655,smax=0xeffffeee,umin=umin32=0xfffffcf1,umax=0xffffffffffffff6e,smin32=-783,smax32=-146,umax32=0xffffff6e)
      __reg64_deduce_bounds:      scalar(smin=-655,smax=smax32=-146,umin=0xfffffffffffffd71,umax=0xffffffffffffff6e,smin32=-783,umin32=0xfffffcf1,umax32=0xffffff6e)
      __reg_deduce_mixed_bounds:  scalar(smin=-655,smax=smax32=-146,umin=0xfffffffffffffd71,umax=0xffffffffffffff6e,smin32=-783,umin32=0xfffffcf1,umax32=0xffffff6e)
  __reg_bound_offset:             scalar(smin=-655,smax=smax32=-146,umin=0xfffffffffffffd71,umax=0xffffffffffffff6e,smin32=-783,umin32=0xfffffcf1,umax32=0xffffff6e,var_off=(0xfffffffffffffc00; 0x3ff))
  __update_reg_bounds:            scalar(smin=-655,smax=smax32=-146,umin=0xfffffffffffffd71,umax=0xffffffffffffff6e,smin32=-783,umin32=0xfffffcf1,umax32=0xffffff6e,var_off=(0xfffffffffffffc00; 0x3ff))

In particular, notice how:
1. In the first call to __reg_deduce_bounds, __reg32_deduce_bounds
   learns new u32 bounds.
2. __reg64_deduce_bounds is unable to improve bounds at this point.
3. __reg_deduce_mixed_bounds derives new u64 bounds from the u32 bounds.
4. In the second call to __reg_deduce_bounds, __reg64_deduce_bounds
   improves the smax and umin bounds thanks to patch "bpf: Improve
   bounds when s64 crosses sign boundary" from this series.
5. Subsequent functions are unable to improve the ranges further (only
   tnums). Yet, a better smin32 bound could be learned from the smin
   bound.

__reg32_deduce_bounds is able to improve smin32 from smin, but for that
we need a third call to __reg_deduce_bounds.

As discussed in [1], there may be a better way to organize the deduction
rules to learn the same information with less calls to the same
functions. Such an optimization requires further analysis and is
orthogonal to the present patchset.

Link: https://lore.kernel.org/bpf/aIKtSK9LjQXB8FLY@mail.gmail.com/ [1]
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Co-developed-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Link: https://lore.kernel.org/r/79619d3b42e5525e0e174ed534b75879a5ba15de.1753695655.git.paul.chaignon@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-28 10:02:13 -07:00
Paul Chaignon 00bf8d0c6c bpf: Improve bounds when s64 crosses sign boundary
__reg64_deduce_bounds currently improves the s64 range using the u64
range and vice versa, but only if it doesn't cross the sign boundary.

This patch improves __reg64_deduce_bounds to cover the case where the
s64 range crosses the sign boundary but overlaps with the u64 range on
only one end. In that case, we can improve both ranges. Consider the
following example, with the s64 range crossing the sign boundary:

    0                                                   U64_MAX
    |  [xxxxxxxxxxxxxx u64 range xxxxxxxxxxxxxx]              |
    |----------------------------|----------------------------|
    |xxxxx s64 range xxxxxxxxx]                       [xxxxxxx|
    0                     S64_MAX S64_MIN                    -1

The u64 range overlaps only with positive portion of the s64 range. We
can thus derive the following new s64 and u64 ranges.

    0                                                   U64_MAX
    |  [xxxxxx u64 range xxxxx]                               |
    |----------------------------|----------------------------|
    |  [xxxxxx s64 range xxxxx]                               |
    0                     S64_MAX S64_MIN                    -1

The same logic can probably apply to the s32/u32 ranges, but this patch
doesn't implement that change.

In addition to the selftests, the __reg64_deduce_bounds change was
also tested with Agni, the formal verification tool for the range
analysis [1].

Link: https://github.com/bpfverif/agni [1]
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Link: https://lore.kernel.org/r/933bd9ce1f36ded5559f92fdc09e5dbc823fa245.1753695655.git.paul.chaignon@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-28 10:02:12 -07:00
Paul Chaignon 5345e64760 bpf: Simplify bounds refinement from s32
During the bounds refinement, we improve the precision of various ranges
by looking at other ranges. Among others, we improve the following in
this order (other things happen between 1 and 2):

  1. Improve u32 from s32 in __reg32_deduce_bounds.
  2. Improve s/u64 from u32 in __reg_deduce_mixed_bounds.
  3. Improve s/u64 from s32 in __reg_deduce_mixed_bounds.

In particular, if the s32 range forms a valid u32 range, we will use it
to improve the u32 range in __reg32_deduce_bounds. In
__reg_deduce_mixed_bounds, under the same condition, we will use the s32
range to improve the s/u64 ranges.

If at (1) we were able to learn from s32 to improve u32, we'll then be
able to use that in (2) to improve s/u64. Hence, as (3) happens under
the same precondition as (1), it won't improve s/u64 ranges further than
(1)+(2) did. Thus, we can get rid of (3).

In addition to the extensive suite of selftests for bounds refinement,
this patch was also tested with the Agni formal verification tool [1].

Additionally, Eduard mentioned:

  The argument appears to be as follows:

  Under precondition `(u32)reg->s32_min <= (u32)reg->s32_max`
  __reg32_deduce_bounds produces:

    reg->u32_min = max_t(u32, reg->s32_min, reg->u32_min);
    reg->u32_max = min_t(u32, reg->s32_max, reg->u32_max);

  And then first part of __reg_deduce_mixed_bounds assigns:

    a. reg->umin umax= (reg->umin & ~0xffffffffULL) | max_t(u32, reg->s32_min, reg->u32_min);
    b. reg->umax umin= (reg->umax & ~0xffffffffULL) | min_t(u32, reg->s32_max, reg->u32_max);

  And then second part of __reg_deduce_mixed_bounds assigns:

    c. reg->umin umax= (reg->umin & ~0xffffffffULL) | (u32)reg->s32_min;
    d. reg->umax umin= (reg->umax & ~0xffffffffULL) | (u32)reg->s32_max;

  But assignment (c) is a noop because:

     max_t(u32, reg->s32_min, reg->u32_min) >= (u32)reg->s32_min

  Hence RHS(a) >= RHS(c) and umin= does nothing.

  Also assignment (d) is a noop because:

    min_t(u32, reg->s32_max, reg->u32_max) <= (u32)reg->s32_max

  Hence RHS(b) <= RHS(d) and umin= does nothing.

  Plus the same reasoning for the part dealing with reg->s{min,max}_value:

    e. reg->smin_value smax= (reg->smin_value & ~0xffffffffULL) | max_t(u32, reg->s32_min_value, reg->u32_min_value);
    f. reg->smax_value smin= (reg->smax_value & ~0xffffffffULL) | min_t(u32, reg->s32_max_value, reg->u32_max_value);

      vs

    g. reg->smin_value smax= (reg->smin_value & ~0xffffffffULL) | (u32)reg->s32_min_value;
    h. reg->smax_value smin= (reg->smax_value & ~0xffffffffULL) | (u32)reg->s32_max_value;

      RHS(e) >= RHS(g) and RHS(f) <= RHS(h), hence smax=,smin= do nothing.

  This appears to be correct.

Also, Shung-Hsi:

  Beside going through the reasoning, I also played with CBMC a bit to
  double check that as far as a single run of __reg_deduce_bounds() is
  concerned (and that the register state matches certain handwavy
  expectations), the change indeed still preserve the original behavior.

Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://github.com/bpfverif/agni [1]
Link: https://lore.kernel.org/bpf/aIJwnFnFyUjNsCNa@mail.gmail.com
2025-07-27 19:23:29 +02:00
Paul Chaignon 6279846b9b bpf: Forget ranges when refining tnum after JSET
Syzbot reported a kernel warning due to a range invariant violation on
the following BPF program.

  0: call bpf_get_netns_cookie
  1: if r0 == 0 goto <exit>
  2: if r0 & Oxffffffff goto <exit>

The issue is on the path where we fall through both jumps.

That path is unreachable at runtime: after insn 1, we know r0 != 0, but
with the sign extension on the jset, we would only fallthrough insn 2
if r0 == 0. Unfortunately, is_branch_taken() isn't currently able to
figure this out, so the verifier walks all branches. The verifier then
refines the register bounds using the second condition and we end
up with inconsistent bounds on this unreachable path:

  1: if r0 == 0 goto <exit>
    r0: u64=[0x1, 0xffffffffffffffff] var_off=(0, 0xffffffffffffffff)
  2: if r0 & 0xffffffff goto <exit>
    r0 before reg_bounds_sync: u64=[0x1, 0xffffffffffffffff] var_off=(0, 0)
    r0 after reg_bounds_sync:  u64=[0x1, 0] var_off=(0, 0)

Improving the range refinement for JSET to cover all cases is tricky. We
also don't expect many users to rely on JSET given LLVM doesn't generate
those instructions. So instead of improving the range refinement for
JSETs, Eduard suggested we forget the ranges whenever we're narrowing
tnums after a JSET. This patch implements that approach.

Reported-by: syzbot+c711ce17dd78e5d4fdcf@syzkaller.appspotmail.com
Suggested-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Link: https://lore.kernel.org/r/9d4fd6432a095d281f815770608fdcd16028ce0b.1752171365.git.paul.chaignon@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-11 10:45:25 -07:00
Luis Gerhorst dadb59104c bpf: Fix aux usage after do_check_insn()
We must terminate the speculative analysis if the just-analyzed insn had
nospec_result set. Using cur_aux() here is wrong because insn_idx might
have been incremented by do_check_insn(). Therefore, introduce and use
insn_aux variable.

Also change cur_aux(env)->nospec in case do_check_insn() ever manages to
increment insn_idx but still fail.

Change the warning to check the insn class (which prevents it from
triggering for ldimm64, for which nospec_result would not be
problematic) and use verifier_bug_if().

In line with Eduard's suggestion, do not introduce prev_aux() because
that requires one to understand that after do_check_insn() call what was
current became previous. This would at-least require a comment.

Fixes: d6f1c85f22 ("bpf: Fall back to nospec for Spectre v1")
Reported-by: Paul Chaignon <paul.chaignon@gmail.com>
Reported-by: Eduard Zingerman <eddyz87@gmail.com>
Reported-by: syzbot+dc27c5fb8388e38d2d37@syzkaller.appspotmail.com
Link: https://lore.kernel.org/bpf/685b3c1b.050a0220.2303ee.0010.GAE@google.com/
Link: https://lore.kernel.org/bpf/4266fd5de04092aa4971cbef14f1b4b96961f432.camel@gmail.com/
Suggested-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Luis Gerhorst <luis.gerhorst@fau.de>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250705190908.1756862-2-luis.gerhorst@fau.de
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-07 08:32:34 -07:00
Eduard Zingerman c4aa454c64 bpf: support for void/primitive __arg_untrusted global func params
Allow specifying __arg_untrusted for void */char */int */long *
parameters. Treat such parameters as
PTR_TO_MEM|MEM_RDONLY|PTR_UNTRUSTED of size zero.
Intended usage is as follows:

  int memcmp(char *a __arg_untrusted, char *b __arg_untrusted, size_t n) {
    bpf_for(i, 0, n) {
      if (a[i] - b[i])      // load at any offset is allowed
        return a[i] - b[i];
    }
    return 0;
  }

Allocate register id for ARG_PTR_TO_MEM parameters only when
PTR_MAYBE_NULL is set. Register id for PTR_TO_MEM is used only to
propagate non-null status after conditionals.

Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250704230354.1323244-8-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-07 08:25:07 -07:00
Eduard Zingerman 182f7df704 bpf: attribute __arg_untrusted for global function parameters
Add support for PTR_TO_BTF_ID | PTR_UNTRUSTED global function
parameters. Anything is allowed to pass to such parameters, as these
are read-only and probe read instructions would protect against
invalid memory access.

Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250704230354.1323244-5-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-07 08:25:06 -07:00
Eduard Zingerman 2d5c91e1cc bpf: rdonly_untrusted_mem for btf id walk pointer leafs
When processing a load from a PTR_TO_BTF_ID, the verifier calculates
the type of the loaded structure field based on the load offset.
For example, given the following types:

  struct foo {
    struct foo *a;
    int *b;
  } *p;

The verifier would calculate the type of `p->a` as a pointer to
`struct foo`. However, the type of `p->b` is currently calculated as a
SCALAR_VALUE.

This commit updates the logic for processing PTR_TO_BTF_ID to instead
calculate the type of p->b as PTR_TO_MEM|MEM_RDONLY|PTR_UNTRUSTED.
This change allows further dereferencing of such pointers (using probe
memory instructions).

Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250704230354.1323244-3-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-07 08:25:06 -07:00
Eduard Zingerman b9d44bc9fd bpf: make makr_btf_ld_reg return error for unexpected reg types
Non-functional change:
mark_btf_ld_reg() expects 'reg_type' parameter to be either
SCALAR_VALUE or PTR_TO_BTF_ID. Next commit expands this set, so update
this function to fail if unexpected type is passed. Also update
callers to propagate the error.

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250704230354.1323244-2-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-07 08:25:06 -07:00
Yonghong Song 82bc4abf28 bpf: Avoid putting struct bpf_scc_callchain variables on the stack
Add a 'struct bpf_scc_callchain callchain_buf' field in bpf_verifier_env.
This way, the previous bpf_scc_callchain local variables can be
replaced by taking address of env->callchain_buf. This can reduce stack
usage and fix the following error:
    kernel/bpf/verifier.c:19921:12: error: stack frame size (1368) exceeds limit (1280) in 'do_check'
        [-Werror,-Wframe-larger-than]

Reported-by: Arnd Bergmann <arnd@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20250703141117.1485108-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-03 19:31:30 -07:00