Commit Graph

43332 Commits (2012a6abc87657c6c8171bb5ff13dd9bafb241bf)

Author SHA1 Message Date
Jakub Kicinski 753c8608f3 bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZWiCPAAKCRDbK58LschI
 g4djAQC1FdqCRIFkhbiIRNHTgHjnfQShELQbd9ofJqzylLqmmgD+JI1E7D9SXagm
 pIXQ26EGmq8/VcCT3VLncA8EsC76Gg4=
 =Xowm
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2023-11-30

We've added 30 non-merge commits during the last 7 day(s) which contain
a total of 58 files changed, 1598 insertions(+), 154 deletions(-).

The main changes are:

1) Add initial TX metadata implementation for AF_XDP with support in mlx5
   and stmmac drivers. Two types of offloads are supported right now, that
   is, TX timestamp and TX checksum offload, from Stanislav Fomichev with
   stmmac implementation from Song Yoong Siang.

2) Change BPF verifier logic to validate global subprograms lazily instead
   of unconditionally before the main program, so they can be guarded using
   BPF CO-RE techniques, from Andrii Nakryiko.

3) Add BPF link_info support for uprobe multi link along with bpftool
   integration for the latter, from Jiri Olsa.

4) Use pkg-config in BPF selftests to determine ld flags which is
   in particular needed for linking statically, from Akihiko Odaki.

5) Fix a few BPF selftest failures to adapt to the upcoming LLVM18,
   from Yonghong Song.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (30 commits)
  bpf/tests: Remove duplicate JSGT tests
  selftests/bpf: Add TX side to xdp_hw_metadata
  selftests/bpf: Convert xdp_hw_metadata to XDP_USE_NEED_WAKEUP
  selftests/bpf: Add TX side to xdp_metadata
  selftests/bpf: Add csum helpers
  selftests/xsk: Support tx_metadata_len
  xsk: Add option to calculate TX checksum in SW
  xsk: Validate xsk_tx_metadata flags
  xsk: Document tx_metadata_len layout
  net: stmmac: Add Tx HWTS support to XDP ZC
  net/mlx5e: Implement AF_XDP TX timestamp and checksum offload
  tools: ynl: Print xsk-features from the sample
  xsk: Add TX timestamp and TX checksum offload support
  xsk: Support tx_metadata_len
  selftests/bpf: Use pkg-config for libelf
  selftests/bpf: Override PKG_CONFIG for static builds
  selftests/bpf: Choose pkg-config for the target
  bpftool: Add support to display uprobe_multi links
  selftests/bpf: Add link_info test for uprobe_multi link
  selftests/bpf: Use bpf_link__destroy in fill_link_info tests
  ...
====================

Conflicts:

Documentation/netlink/specs/netdev.yaml:
  839ff60df3 ("net: page_pool: add nlspec for basic access to page pools")
  48eb03dd26 ("xsk: Add TX timestamp and TX checksum offload support")
https://lore.kernel.org/all/20231201094705.1ee3cab8@canb.auug.org.au/

While at it also regen, tree is dirty after:
  48eb03dd26 ("xsk: Add TX timestamp and TX checksum offload support")
looks like code wasn't re-rendered after "render-max" was removed.

Link: https://lore.kernel.org/r/20231130145708.32573-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-11-30 16:58:42 -08:00
Jakub Kicinski 975f2d73a9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

No conflicts.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-11-30 16:11:19 -08:00
Linus Torvalds 6172a5180f Including fixes from bpf and wifi.
Current release - regressions:
 
   - neighbour: fix __randomize_layout crash in struct neighbour
 
   - r8169: fix deadlock on RTL8125 in jumbo mtu mode
 
 Previous releases - regressions:
 
   - wifi:
     - mac80211: fix warning at station removal time
     - cfg80211: fix CQM for non-range use
 
   - tools: ynl-gen: fix unexpected response handling
 
   - octeontx2-af: fix possible buffer overflow
 
   - dpaa2: recycle the RX buffer only after all processing done
 
   - rswitch: fix missing dev_kfree_skb_any() in error path
 
 Previous releases - always broken:
 
   - ipv4: fix uaf issue when receiving igmp query packet
 
   - wifi: mac80211: fix debugfs deadlock at device removal time
 
   - bpf:
     - sockmap: af_unix stream sockets need to hold ref for pair sock
     - netdevsim: don't accept device bound programs
 
   - selftests: fix a char signedness issue
 
   - dsa: mv88e6xxx: fix marvell 6350 probe crash
 
   - octeontx2-pf: restore TC ingress police rules when interface is up
 
   - wangxun: fix memory leak on msix entry
 
   - ravb: keep reverse order of operations in ravb_remove()
 
 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmVobzISHHBhYmVuaUBy
 ZWRoYXQuY29tAAoJECkkeY3MjxOk4rwP/2qaUstOJVpkO8cG+bRYi3idH9uO/8Yu
 dYgFI4LM826YgbVNVzuiu9Sh7t78dbep/fWQ2quDuZUinWtPmv6RV3UKbDyNWLRr
 iV7sZvXElGsUefixxGANYDUPuCrlr3O230Y8zCN0R65BMurppljs9Pp8FwIqaD+v
 pVs2alb/PeX7g+hPACKPr4Knu8QeZYmzdHoyYeLoMG3PqIgJVU3/8OHHfmnYCdxT
 VSss2LB5FKFCOgetEPGy83KQP7QVaK22GDphZJ4xh7aSewRVP92ORfauiI8To4vQ
 0VnLNcQ+1pXnYzgGdv8oF02e4EP5b0jvrTpqCw1U0QU2s2PARJarzajCXBkwa308
 gXELRpVRpY4+7WEBSX4RGUigurwGGEh/IP/puVtPDr9KU3lFgaTI8wM624Y3Ob/e
 /LVI7a5kUSJysq9/H/QrHjoiuTtV7nCmzBgDqEFSN5hQinSHYKyD0XsUPcLlMJmn
 p6CyQDGHv2ibbg+8TStig0xfmC83N8KfDfcCekSrYxquDMTRtfa2VXofzQiQKDnr
 XNyIURmZAAUVPR6enxlg5Iqzc0mQGumYif7wzsO1bzVzmVZgIDCVxU95hkoRrutU
 qnWXuUGUdieUvXA9HltntTzy2BgJVtg7L/p8YEbd97dxtgK80sbdnjfDswFvEeE4
 nTvE+IDKdCmb
 =QiQp
 -----END PGP SIGNATURE-----

Merge tag 'net-6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
 "Including fixes from bpf and wifi.

  Current release - regressions:

   - neighbour: fix __randomize_layout crash in struct neighbour

   - r8169: fix deadlock on RTL8125 in jumbo mtu mode

  Previous releases - regressions:

   - wifi:
       - mac80211: fix warning at station removal time
       - cfg80211: fix CQM for non-range use

   - tools: ynl-gen: fix unexpected response handling

   - octeontx2-af: fix possible buffer overflow

   - dpaa2: recycle the RX buffer only after all processing done

   - rswitch: fix missing dev_kfree_skb_any() in error path

  Previous releases - always broken:

   - ipv4: fix uaf issue when receiving igmp query packet

   - wifi: mac80211: fix debugfs deadlock at device removal time

   - bpf:
       - sockmap: af_unix stream sockets need to hold ref for pair sock
       - netdevsim: don't accept device bound programs

   - selftests: fix a char signedness issue

   - dsa: mv88e6xxx: fix marvell 6350 probe crash

   - octeontx2-pf: restore TC ingress police rules when interface is up

   - wangxun: fix memory leak on msix entry

   - ravb: keep reverse order of operations in ravb_remove()"

* tag 'net-6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (51 commits)
  net: ravb: Keep reverse order of operations in ravb_remove()
  net: ravb: Stop DMA in case of failures on ravb_open()
  net: ravb: Start TX queues after HW initialization succeeded
  net: ravb: Make write access to CXR35 first before accessing other EMAC registers
  net: ravb: Use pm_runtime_resume_and_get()
  net: ravb: Check return value of reset_control_deassert()
  net: libwx: fix memory leak on msix entry
  ice: Fix VF Reset paths when interface in a failed over aggregate
  bpf, sockmap: Add af_unix test with both sockets in map
  bpf, sockmap: af_unix stream sockets need to hold ref for pair sock
  tools: ynl-gen: always construct struct ynl_req_state
  ethtool: don't propagate EOPNOTSUPP from dumps
  ravb: Fix races between ravb_tx_timeout_work() and net related ops
  r8169: prevent potential deadlock in rtl8169_close
  r8169: fix deadlock on RTL8125 in jumbo mtu mode
  neighbour: Fix __randomize_layout crash in struct neighbour
  octeontx2-pf: Restore TC ingress police rules when interface is up
  octeontx2-pf: Fix adding mbox work queue entry when num_vfs > 64
  net: stmmac: xgmac: Disable FPE MMC interrupts
  octeontx2-af: Fix possible buffer overflow
  ...
2023-12-01 08:24:46 +09:00
Jiri Olsa e56fdbfb06 bpf: Add link_info support for uprobe multi link
Adding support to get uprobe_link details through bpf_link_info
interface.

Adding new struct uprobe_multi to struct bpf_link_info to carry
the uprobe_multi link details.

The uprobe_multi.count is passed from user space to denote size
of array fields (offsets/ref_ctr_offsets/cookies). The actual
array size is stored back to uprobe_multi.count (allowing user
to find out the actual array size) and array fields are populated
up to the user passed size.

All the non-array fields (path/count/flags/pid) are always set.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/bpf/20231125193130.834322-4-jolsa@kernel.org
2023-11-28 21:50:09 -08:00
Jiri Olsa 4930b7f53a bpf: Store ref_ctr_offsets values in bpf_uprobe array
We will need to return ref_ctr_offsets values through link_info
interface in following change, so we need to keep them around.

Storing ref_ctr_offsets values directly into bpf_uprobe array.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/bpf/20231125193130.834322-3-jolsa@kernel.org
2023-11-28 21:50:09 -08:00
Hou Tao 75a442581d bpf: Add missed allocation hint for bpf_mem_cache_alloc_flags()
bpf_mem_cache_alloc_flags() may call __alloc() directly when there is no
free object in free list, but it doesn't initialize the allocation hint
for the returned pointer. It may lead to bad memory dereference when
freeing the pointer, so fix it by initializing the allocation hint.

Fixes: 822fb26bdb ("bpf: Add a hint to allocated objects.")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20231111043821.2258513-1-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-26 18:00:26 -08:00
Linus Torvalds 1d0dbc3d16 Fix lockdep block chain corruption resulting in KASAN warnings.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmVjEa0RHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1jGMRAAvlW/mmlwp4lRv/+aIRBo3iAzDS9vkPds
 uuS7jOweKkFJZJTR0Fr/OppRB05JObSUVXQSH71hGc0YUC29NEQyqa03Qy6MDdDx
 TuvDzIUildQqcUVJLRV2d8PmNRfFQftuQnvQcFpk+T0jrElBq6ADTe0SAwbSYLVU
 8onXjYrGRsxOaZP7zQ99o4BkWyX7DHMv8lMhq5QdEHotg8/4BkcYDU4F99Zs0tu9
 txi2RPDCvR8JmvK37qMXumexu/IMBcE8OQadmlQjK1uPiXIBj+7iHdrqDegUIayk
 XyttXmvODb8SgXL/o5thbmHI9ZGsTSK0RpwQMO5CHrF0LmlI/z2bNClz9bGMh/7A
 Sa6misq4at0o50RQmpus3zo8q8hZ1P37bhyhIBgsfbzLJCVWU5LAltV3A6OrDygy
 YR4j29qSsnZvRZ1kvlfDROS5t4QicPN1IwfYxdDJypnlapIeQbmt1nLQFH1zaCN4
 EwYeVTfJ9dJpXozZTPftD/uiPhj7NZUNUhkVI9mngP46XMCC1GWjF1CcPYLuv8Iz
 Qw0Gj4YzDWFwuG98r3hrntXaTz2BKy4GVAQTQcpswhdPFJ/BPxY4AJPeTznm7fQX
 Lu2bIBLYlUROvuL45TgAPArh17iC8O1pfxwTfEOxlQvi9+xNzN9hPNsWRSiJnYlV
 R4q3G7Ejelo=
 =8C4B
 -----END PGP SIGNATURE-----

Merge tag 'locking-urgent-2023-11-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking fix from Ingo Molnar:
 "Fix lockdep block chain corruption resulting in KASAN warnings"

* tag 'locking-urgent-2023-11-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  lockdep: Fix block chain corruption
2023-11-26 08:30:11 -08:00
Peter Zijlstra bca4104b00 lockdep: Fix block chain corruption
Kent reported an occasional KASAN splat in lockdep. Mark then noted:

> I suspect the dodgy access is to chain_block_buckets[-1], which hits the last 4
> bytes of the redzone and gets (incorrectly/misleadingly) attributed to
> nr_large_chain_blocks.

That would mean @size == 0, at which point size_to_bucket() returns -1
and the above happens.

alloc_chain_hlocks() has 'size - req', for the first with the
precondition 'size >= rq', which allows the 0.

This code is trying to split a block, del_chain_block() takes what we
need, and add_chain_block() puts back the remainder, except in the
above case the remainder is 0 sized and things go sideways.

Fixes: 810507fe6f ("locking/lockdep: Reuse freed chain_hlocks entries")
Reported-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Kent Overstreet <kent.overstreet@linux.dev>
Link: https://lkml.kernel.org/r/20231121114126.GH8262@noisy.programming.kicks-ass.net
2023-11-24 11:04:54 +01:00
Andrii Nakryiko 2afae08c9d bpf: Validate global subprogs lazily
Slightly change BPF verifier logic around eagerness and order of global
subprog validation. Instead of going over every global subprog eagerly
and validating it before main (entry) BPF program is verified, turn it
around. Validate main program first, mark subprogs that were called from
main program for later verification, but otherwise assume it is valid.
Afterwards, go over marked global subprogs and validate those,
potentially marking some more global functions as being called. Continue
this process until all (transitively) callable global subprogs are
validated. It's a BFS traversal at its heart and will always converge.

This is an important change because it allows to feature-gate some
subprograms that might not be verifiable on some older kernel, depending
on supported set of features.

E.g., at some point, global functions were allowed to accept a pointer
to memory, which size is identified by user-provided type.
Unfortunately, older kernels don't support this feature. With BPF CO-RE
approach, the natural way would be to still compile BPF object file once
and guard calls to this global subprog with some CO-RE check or using
.rodata variables. That's what people do to guard usage of new helpers
or kfuncs, and any other new BPF-side feature that might be missing on
old kernels.

That's currently impossible to do with global subprogs, unfortunately,
because they are eagerly and unconditionally validated. This patch set
aims to change this, so that in the future when global funcs gain new
features, those can be guarded using BPF CO-RE techniques in the same
fashion as any other new kernel feature.

Two selftests had to be adjusted in sync with these changes.

test_global_func12 relied on eager global subprog validation failing
before main program failure is detected (unknown return value). Fix by
making sure that main program is always valid.

verifier_subprog_precision's parent_stack_slot_precise subtest relied on
verifier checkpointing heuristic to do a checkpoint at instruction #5,
but that's no longer true because we don't have enough jumps validated
before reaching insn #5 due to global subprogs being validated later.

Other than that, no changes, as one would expect.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20231124035937.403208-3-andrii@kernel.org
2023-11-24 10:40:06 +01:00
Andrii Nakryiko 491dd8edec bpf: Emit global subprog name in verifier logs
We have the name, instead of emitting just func#N to identify global
subprog, augment verifier log messages with actual function name to make
it more user-friendly.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20231124035937.403208-2-andrii@kernel.org
2023-11-24 10:40:06 +01:00
Jakub Kicinski 45c226dde7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

drivers/net/ethernet/intel/ice/ice_main.c
  c9663f79cd ("ice: adjust switchdev rebuild path")
  7758017911 ("ice: restore timestamp configuration after device reset")
https://lore.kernel.org/all/20231121211259.3348630-1-anthony.l.nguyen@intel.com/

Adjacent changes:

kernel/bpf/verifier.c
  bb124da69c ("bpf: keep track of max number of bpf_loop callback iterations")
  5f99f312bd ("bpf: add register bounds sanity checks and sanitization")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-11-23 12:20:58 -08:00
Linus Torvalds d3fa86b1a7 Including fixes from bpf.
Current release - regressions:
 
  - Revert "net: r8169: Disable multicast filter for RTL8168H
    and RTL8107E"
 
  - kselftest: rtnetlink: fix ip route command typo
 
 Current release - new code bugs:
 
  - s390/ism: make sure ism driver implies smc protocol in kconfig
 
  - two build fixes for tools/net
 
 Previous releases - regressions:
 
  - rxrpc: couple of ACK/PING/RTT handling fixes
 
 Previous releases - always broken:
 
  - bpf: verify bpf_loop() callbacks as if they are called unknown
    number of times
 
  - improve stability of auto-bonding with Hyper-V
 
  - account BPF-neigh-redirected traffic in interface statistics
 
 Misc:
 
  - net: fill in some more MODULE_DESCRIPTION()s
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmVfiBoACgkQMUZtbf5S
 IrukFhAAiY5XyqiVyEBsm10AgYSpl0BbnxywfK27nR2SbxSTvSxyuXseV2EyEynU
 iNn6CksHe2rG1/DXbKoQohsIYey/YjY5c6omT5JzuxIT2h69J4iYKMIj/Ptk5joZ
 MQsDK5J9aCvxXBazYF2CuOCgVcwmqNFbCHf1FAFhk0RuqI3RoC5/OGbLM0tmTMQw
 BceNUHBn8iPcSkRbnntwLLHVxMrX9bay6F+pcy5/b40VWBATM3uBkw/2zBqPZ+n1
 Z0SNWvLtnO6T66Y07vaE1sRiqN4KHtS4WWelVYcmYX2rY1HkXx/TUjvrJ7R/uQQQ
 /5yUB6G294NmFv/2X+Yjt5AB8PjnFzjm/BqCBrjXcnnMPOiB0YZg554s59RhRrSr
 cmZ4bveUgCQV/jJWMxwGYvZSAqtle8uN+8DhxdjbCpVJocbrseDHKyWJ6bOy85BN
 zbJuUOUeFDs53nhV+Z9fiuUFuxhIwDCKHHFmEp7R5VotX0ZURiDnqlj9WEIxKZrC
 UidWRXv/VP+bV4BB2GVIFncEWMuhrnWOQY8DR6eC33uq7JkwTZD3R8IGR8up/+tm
 CtVyPvefAYZB8/IVU/mOSVrx04ERupNVvBkXzOMQe7UqRq3okPsQFPW8HmSrmnQG
 KrJWyBIqG3jnJCuqoXwlt0rKP3MmgCjowhTbZ3uDjeVf9UJTu2U=
 =2sG4
 -----END PGP SIGNATURE-----

Merge tag 'net-6.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from bpf.

  Current release - regressions:

   - Revert "net: r8169: Disable multicast filter for RTL8168H and
     RTL8107E"

   - kselftest: rtnetlink: fix ip route command typo

  Current release - new code bugs:

   - s390/ism: make sure ism driver implies smc protocol in kconfig

   - two build fixes for tools/net

  Previous releases - regressions:

   - rxrpc: couple of ACK/PING/RTT handling fixes

  Previous releases - always broken:

   - bpf: verify bpf_loop() callbacks as if they are called unknown
     number of times

   - improve stability of auto-bonding with Hyper-V

   - account BPF-neigh-redirected traffic in interface statistics

  Misc:

   - net: fill in some more MODULE_DESCRIPTION()s"

* tag 'net-6.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (58 commits)
  tools: ynl: fix duplicate op name in devlink
  tools: ynl: fix header path for nfsd
  net: ipa: fix one GSI register field width
  tls: fix NULL deref on tls_sw_splice_eof() with empty record
  net: axienet: Fix check for partial TX checksum
  vsock/test: fix SEQPACKET message bounds test
  i40e: Fix adding unsupported cloud filters
  ice: restore timestamp configuration after device reset
  ice: unify logic for programming PFINT_TSYN_MSK
  ice: remove ptp_tx ring parameter flag
  amd-xgbe: propagate the correct speed and duplex status
  amd-xgbe: handle the corner-case during tx completion
  amd-xgbe: handle corner-case during sfp hotplug
  net: veth: fix ethtool stats reporting
  octeontx2-pf: Fix ntuple rule creation to direct packet to VF with higher Rx queue than its PF
  net: usb: qmi_wwan: claim interface 4 for ZTE MF290
  Revert "net: r8169: Disable multicast filter for RTL8168H and RTL8107E"
  net/smc: avoid data corruption caused by decline
  nfc: virtual_ncidev: Add variable to check if ndev is running
  dpll: Fix potential msg memleak when genlmsg_put_reply failed
  ...
2023-11-23 10:40:13 -08:00
Jakub Kicinski 53475287da bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZV0kjgAKCRDbK58LschI
 gy0EAP9XwncW2OhO72DpITluFzvWPgB0N97OANKBXjzKJrRAlQD/aUe9nlvBQuad
 WsbMKLeC4wvI2X/4PEIR4ukbuZ3ypAA=
 =LMVg
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2023-11-21

We've added 85 non-merge commits during the last 12 day(s) which contain
a total of 63 files changed, 4464 insertions(+), 1484 deletions(-).

The main changes are:

1) Huge batch of verifier changes to improve BPF register bounds logic
   and range support along with a large test suite, and verifier log
   improvements, all from Andrii Nakryiko.

2) Add a new kfunc which acquires the associated cgroup of a task within
   a specific cgroup v1 hierarchy where the latter is identified by its id,
   from Yafang Shao.

3) Extend verifier to allow bpf_refcount_acquire() of a map value field
   obtained via direct load which is a use-case needed in sched_ext,
   from Dave Marchevsky.

4) Fix bpf_get_task_stack() helper to add the correct crosstask check
   for the get_perf_callchain(), from Jordan Rome.

5) Fix BPF task_iter internals where lockless usage of next_thread()
   was wrong. The rework also simplifies the code, from Oleg Nesterov.

6) Fix uninitialized tail padding via LIBBPF_OPTS_RESET, and another
   fix for certain BPF UAPI structs to fix verifier failures seen
   in bpf_dynptr usage, from Yonghong Song.

7) Add BPF selftest fixes for map_percpu_stats flakes due to per-CPU BPF
   memory allocator not being able to allocate per-CPU pointer successfully,
   from Hou Tao.

8) Add prep work around dynptr and string handling for kfuncs which
   is later going to be used by file verification via BPF LSM and fsverity,
   from Song Liu.

9) Improve BPF selftests to update multiple prog_tests to use ASSERT_*
   macros, from Yuran Pereira.

10) Optimize LPM trie lookup to check prefixlen before walking the trie,
    from Florian Lehner.

11) Consolidate virtio/9p configs from BPF selftests in config.vm file
    given they are needed consistently across archs, from Manu Bretelle.

12) Small BPF verifier refactor to remove register_is_const(),
    from Shung-Hsi Yu.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (85 commits)
  selftests/bpf: Replaces the usage of CHECK calls for ASSERTs in vmlinux
  selftests/bpf: Replaces the usage of CHECK calls for ASSERTs in bpf_obj_id
  selftests/bpf: Replaces the usage of CHECK calls for ASSERTs in bind_perm
  selftests/bpf: Replaces the usage of CHECK calls for ASSERTs in bpf_tcp_ca
  selftests/bpf: reduce verboseness of reg_bounds selftest logs
  bpf: bpf_iter_task_next: use next_task(kit->task) rather than next_task(kit->pos)
  bpf: bpf_iter_task_next: use __next_thread() rather than next_thread()
  bpf: task_group_seq_get_next: use __next_thread() rather than next_thread()
  bpf: emit frameno for PTR_TO_STACK regs if it differs from current one
  bpf: smarter verifier log number printing logic
  bpf: omit default off=0 and imm=0 in register state log
  bpf: emit map name in register state if applicable and available
  bpf: print spilled register state in stack slot
  bpf: extract register state printing
  bpf: move verifier state printing code to kernel/bpf/log.c
  bpf: move verbose_linfo() into kernel/bpf/log.c
  bpf: rename BPF_F_TEST_SANITY_STRICT to BPF_F_TEST_REG_INVARIANTS
  bpf: Remove test for MOVSX32 with offset=32
  selftests/bpf: add iter test requiring range x range logic
  veristat: add ability to set BPF_F_TEST_SANITY_STRICT flag with -r flag
  ...
====================

Link: https://lore.kernel.org/r/20231122000500.28126-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-11-21 17:53:20 -08:00
Eduard Zingerman bb124da69c bpf: keep track of max number of bpf_loop callback iterations
In some cases verifier can't infer convergence of the bpf_loop()
iteration. E.g. for the following program:

    static int cb(__u32 idx, struct num_context* ctx)
    {
        ctx->i++;
        return 0;
    }

    SEC("?raw_tp")
    int prog(void *_)
    {
        struct num_context ctx = { .i = 0 };
        __u8 choice_arr[2] = { 0, 1 };

        bpf_loop(2, cb, &ctx, 0);
        return choice_arr[ctx.i];
    }

Each 'cb' simulation would eventually return to 'prog' and reach
'return choice_arr[ctx.i]' statement. At which point ctx.i would be
marked precise, thus forcing verifier to track multitude of separate
states with {.i=0}, {.i=1}, ... at bpf_loop() callback entry.

This commit allows "brute force" handling for such cases by limiting
number of callback body simulations using 'umax' value of the first
bpf_loop() parameter.

For this, extend bpf_func_state with 'callback_depth' field.
Increment this field when callback visiting state is pushed to states
traversal stack. For frame #N it's 'callback_depth' field counts how
many times callback with frame depth N+1 had been executed.
Use bpf_func_state specifically to allow independent tracking of
callback depths when multiple nested bpf_loop() calls are present.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231121020701.26440-11-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-20 18:36:40 -08:00
Eduard Zingerman cafe2c2150 bpf: widening for callback iterators
Callbacks are similar to open coded iterators, so add imprecise
widening logic for callback body processing. This makes callback based
loops behave identically to open coded iterators, e.g. allowing to
verify programs like below:

  struct ctx { u32 i; };
  int cb(u32 idx, struct ctx* ctx)
  {
          ++ctx->i;
          return 0;
  }
  ...
  struct ctx ctx = { .i = 0 };
  bpf_loop(100, cb, &ctx, 0);
  ...

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231121020701.26440-9-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-20 18:36:40 -08:00
Eduard Zingerman ab5cfac139 bpf: verify callbacks as if they are called unknown number of times
Prior to this patch callbacks were handled as regular function calls,
execution of callback body was modeled exactly once.
This patch updates callbacks handling logic as follows:
- introduces a function push_callback_call() that schedules callback
  body verification in env->head stack;
- updates prepare_func_exit() to reschedule callback body verification
  upon BPF_EXIT;
- as calls to bpf_*_iter_next(), calls to callback invoking functions
  are marked as checkpoints;
- is_state_visited() is updated to stop callback based iteration when
  some identical parent state is found.

Paths with callback function invoked zero times are now verified first,
which leads to necessity to modify some selftests:
- the following negative tests required adding release/unlock/drop
  calls to avoid previously masked unrelated error reports:
  - cb_refs.c:underflow_prog
  - exceptions_fail.c:reject_rbtree_add_throw
  - exceptions_fail.c:reject_with_cp_reference
- the following precision tracking selftests needed change in expected
  log trace:
  - verifier_subprog_precision.c:callback_result_precise
    (note: r0 precision is no longer propagated inside callback and
           I think this is a correct behavior)
  - verifier_subprog_precision.c:parent_callee_saved_reg_precise_with_callback
  - verifier_subprog_precision.c:parent_stack_slot_precise_with_callback

Reported-by: Andrew Werner <awerner32@gmail.com>
Closes: https://lore.kernel.org/bpf/CA+vRuzPChFNXmouzGG+wsy=6eMcfr1mFG0F3g7rbg-sedGKW3w@mail.gmail.com/
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231121020701.26440-7-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-20 18:35:44 -08:00
Eduard Zingerman 58124a98cb bpf: extract setup_func_entry() utility function
Move code for simulated stack frame creation to a separate utility
function. This function would be used in the follow-up change for
callbacks handling.

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231121020701.26440-6-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-20 18:33:35 -08:00
Eduard Zingerman 683b96f960 bpf: extract __check_reg_arg() utility function
Split check_reg_arg() into two utility functions:
- check_reg_arg() operating on registers from current verifier state;
- __check_reg_arg() operating on a specific set of registers passed as
  a parameter;

The __check_reg_arg() function would be used by a follow-up change for
callbacks handling.

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231121020701.26440-5-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-20 18:33:35 -08:00
Linus Torvalds b0014556a2 - Do the push of pending hrtimers away from a CPU which is being
offlined earlier in the offlining process in order to prevent
   a deadlock
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmVaB28ACgkQEsHwGGHe
 VUr3ZBAAwOLL5vimHB3Y59cTRLPN+zGKhzyVLMLnbkKs4sGJ+9srP4HLX4Q9PoAb
 kR9Hzq90+48YuyLe+S/R2pvm1x88K33spS+4w4fl3x6EeToqvUlop2GPuMS2yzXY
 yECdqCLEd3Q6DeI8hN35lv899qyfGSD+6WxezLCT+uwx6AMHljMAsDy2249UtMZv
 1bqZnYCtN2zv3MQuV1uli/AVxTDv4vXcumza17inuw0IpEA26Wz2kWruxeyZnUXU
 /sWZudUdhiErfg428ok3oTL1BOwPzyiIWjhN2MzqlKFmyp463DwV7KoAc3SxYINE
 8qbODN93CFdnU6h29+VQoRxO9vcmWL6w7A/Swc9ar/0/Qnt7H9JdzUKtJ4+EaTCY
 /IpRWcNcX4WI6BKuHHl6kOBvX3YW77PKaIsxj8JDNZTMk6rq6lMGi+tIaVsAki92
 3MQZ9+Lkm0baykIZAWz4jajbA98KvJMeJ60qZQI6sWWdpyrncEqG9pH/ulkLY4aZ
 gT94LiRpdwT0LWrX0J6xPMTNi9NYWjdB/uyo6Drer42SB9J7ol4rAbOxs50srG8i
 z46VGDtgWz6C5MSkonhQqrpGzc/HF9xCWVVSF1UENT4K+2W55JhJrDZBs5XCPJiz
 Bj8T3Maz7wcVkA41DA7C5xlVed+ST1ID8/4y5cWImnrmWOdG5Zw=
 =Tekh
 -----END PGP SIGNATURE-----

Merge tag 'timers_urgent_for_v6.7_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer fix from Borislav Petkov:

 - Do the push of pending hrtimers away from a CPU which is being
   offlined earlier in the offlining process in order to prevent a
   deadlock

* tag 'timers_urgent_for_v6.7_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  hrtimers: Push pending hrtimers away from outgoing CPU earlier
2023-11-19 13:35:07 -08:00
Linus Torvalds 2a0adc4954 - Fix virtual runtime calculation when recomputing a sched entity's
weights
 
 - Fix wrongly rejected unprivileged poll requests to the cgroup psi
   pressure files
 
 - Make sure the load balancing is done by only one CPU
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmVaBe0ACgkQEsHwGGHe
 VUqX4hAAmrlp7bcloMRto6j4yC+pjDIQlFym7opa7kaEPeY3icOydfpSGEDnEwbv
 HxOOmveb2sC8DBE+Rkum4bHb2I46SD/5LlM/MZHvSguEGNgAJEYCcPfGZJDgGlW1
 MgALG78ThA/mVKr5i3/Q1U6U71+vuNHJOpCY1s4o+jgF/sG3AYIdK1sqaVI++ygz
 q0WK31jGo+YelPpNDKnXpVGIuOcUlh9v/Hu2zGBBJD9pf4kfTelseiV7rc+rk0yI
 YHSKpw2jCnuJaGS748Q4aIG+8kLRBz+HqUKDWQPlq3pRWjJWTBbH+i8TZef7keZQ
 gAk/uJpdQ9z4Z7suwY3vcEBVRo4e6AoD99XDG1eUX07C+f1d9p54EVDkgFIZMIle
 pT2yd5GT/zl0UfcZ8B96y2lJHoa6pHnv83uZKtRZhBMiN5F4iN88lhQFVpZDoCBg
 xZ+NPfpXcZxm4HpKFjfsGyxQJxIkC6NDdf6Rfhtc3sV1rx4AT1Qii4fDnBHOkaBs
 iFgpFOCeb+K9UUXB0ONJ5PWZVnc8OGPtm/22TwtZ9rBzVqrmtVJb+VDg2YWpwFwU
 xhy0hMWxwZFsn0VjjsBbgfm1/+WGjCKjbPa1SvS3oH3+H9EbWiBjxe1zwkS46PUf
 HjC0RCMPxfnYG4+h9JHEaFioGvUqQQ6Ub3K8epd8MPUtD9DCnro=
 =hJzS
 -----END PGP SIGNATURE-----

Merge tag 'sched_urgent_for_v6.7_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fixes from Borislav Petkov:

 - Fix virtual runtime calculation when recomputing a sched entity's
   weights

 - Fix wrongly rejected unprivileged poll requests to the cgroup psi
   pressure files

 - Make sure the load balancing is done by only one CPU

* tag 'sched_urgent_for_v6.7_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Fix the decision for load balance
  sched: psi: fix unprivileged polling against cgroups
  sched/eevdf: Fix vruntime adjustment on reweight
2023-11-19 13:32:00 -08:00
Linus Torvalds 2f84f8232e - Fix a hardcoded futex flags case which lead to one robust futex test
failure
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmVaA2sACgkQEsHwGGHe
 VUoOdBAAmvDdbMNVi0p33kqLhSQLwzxsqkrGyNkAfSbuuaGNsH8mQ87VA0dMQYpe
 bXzJzvoccHYxJYnFyExv0d7PtN3xquh2q32D1pL6gzaA974oEmQiyQag9++gkJGh
 +/NYQwo0Y2ucEsvgeMN+knE0q0OdelUAiKNPF9nE9LG0d9TLFC45jwLH+9pa5jAF
 jtLBxrexeU49UBBDnoPR2CNrDi9TlNYRas2V5xlQnUXl5kZlVNcQLMo1Rcd7+dTF
 6I414ZVXiS6u02Vs7wcrKC50BdBIa4h2WaOX+Nb2j9ibJ5uY14B1nwewAztmaQY7
 szpaI2EtSMk0+Ap0QHTaxZvi/UREWed5n0AykqTy97f0vsvkK9zCiPk3LMJsoupu
 vfEApclAIMzDi6qnB/zGhHkHLMBHsiXrANGCe6SbjphD9ic0ClKwAyqJ9kpB43JE
 pnqdrTcrYLuTCV+fE9r/WfXt5Z09xmlF+usmOS4T7y35gzrl4+BPVzu2V80SlZSj
 CtDSvMG7z7LLK5o8XsvQk1VlAYCXEPfOldkoRaisD82yKw0r38YqXf+cigE4noyq
 55ChMwNmlqtetvPNK/6SsPtj8F/502Lqo/xAJjSRo/vO1KYpNa3sfXUZpZ5J+xuc
 zVGXzcBGsNgteVin2I0jhdOvRd7apA7rKiXd0duTtiSj2N++b5U=
 =T4AK
 -----END PGP SIGNATURE-----

Merge tag 'locking_urgent_for_v6.7_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking fix from Borislav Petkov:

 - Fix a hardcoded futex flags case which lead to one robust futex test
   failure

* tag 'locking_urgent_for_v6.7_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  futex: Fix hardcoded flags
2023-11-19 13:30:21 -08:00
Linus Torvalds c8b3443cbd - Make sure the context refcount is transferred too when migrating perf
events
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmVaAW4ACgkQEsHwGGHe
 VUq4bQ/5ATpnsIYxn/rjshZa2hfDrOfS9FRegfMCZyz7w/HDmRlwjIbXbQE7ttS7
 vU55QgftybwS38N//LRzx9etGIhbcmeNgNibskslxsuVFCxBJLUVT3UJ9T9AtWrM
 v9i0W+D8SuJsIKoKo77K+OkTvKHZgpvxyABQivyHMjeD2zmPHUslJ+OTWYNooM7g
 kPmTu/M6sJme/q/PgtezOVKhV3tdaHB2oBKUEv9y4nVBNNVRnX/Fo+Zd/FEgs45F
 JhxnvwqP5cU7mSmn9itcNRKY8nW5lv2OdYNtgyqWbpYY39JPZxy/EKF4GUPv1Jqb
 d/eQov0B8K844j982bBa0lgecmiGYq7DNevVHlUB/JowQpNsfxVJAVphVnWZr5iy
 hzZQU5EfvmkRE/ja/g1wUn15mdxvAQ3QRmOGBJwCQi2QLC9zCiki4esnGc4U9Zp7
 ZVQ22PVliuBnLRloOSHsp1G65i3K+VPlKyBOkp0YmEVz9EHahOceBPa39WkGCHOh
 1EkhPIzemTYUzAE5KuzbLp7C6KzMvaP8Zc9fOLy9SEesb2HjwoIQjkRKmJ38rbSu
 KEY5SbdxIkVCsks4kajOVSysS89rafbH7ykd/Di3cXb1rOJtRSyWghnaap1trghd
 uf0EAVb9y8er2xvdOHHwHIdNILK3Y1F2TcOg+jAyywudrfmnIls=
 =lbBn
 -----END PGP SIGNATURE-----

Merge tag 'perf_urgent_for_v6.7_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf fix from Borislav Petkov:

 - Make sure the context refcount is transferred too when migrating perf
   events

* tag 'perf_urgent_for_v6.7_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/core: Fix cpuctx refcounting
2023-11-19 13:26:42 -08:00
Oleg Nesterov ac8148d957 bpf: bpf_iter_task_next: use next_task(kit->task) rather than next_task(kit->pos)
This looks more clear and simplifies the code. While at it, remove the
unnecessary initialization of pos/task at the start of bpf_iter_task_new().

Note that we can even kill kit->task, we can just use pos->group_leader,
but I don't understand the BUILD_BUG_ON() checks in bpf_iter_task_new().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20231114163239.GA903@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-19 11:43:44 -08:00
Oleg Nesterov 5a34f9dabd bpf: bpf_iter_task_next: use __next_thread() rather than next_thread()
Lockless use of next_thread() should be avoided, kernel/bpf/task_iter.c
is the last user and the usage is wrong.

bpf_iter_task_next() can loop forever, "kit->pos == kit->task" can never
happen if kit->pos execs. Change this code to use __next_thread().

With or without this change the usage of kit->pos/task and next_task()
doesn't look nice, see the next patch.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20231114163237.GA897@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-19 11:43:44 -08:00
Oleg Nesterov 2d1618054f bpf: task_group_seq_get_next: use __next_thread() rather than next_thread()
Lockless use of next_thread() should be avoided, kernel/bpf/task_iter.c
is the last user and the usage is wrong.

task_group_seq_get_next() can return the group leader twice if it races
with mt-thread exec which changes the group->leader's pid.

Change the main loop to use __next_thread(), kill "next_tid == common->pid"
check.

__next_thread() can't loop forever, we can also change this code to retry
if next_tid == 0.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20231114163234.GA890@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-19 11:43:44 -08:00
Linus Torvalds 2254005ef1 parisc architecture fixes for kernel v6.7-rc2:
- Fix power soft-off on qemu
 - Disable prctl(PR_SET_MDWE) since parisc sometimes still needs
   writeable stacks
 - Use strscpy instead of strlcpy in show_cpuinfo()
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQS86RI+GtKfB8BJu973ErUQojoPXwUCZVkHjgAKCRD3ErUQojoP
 X196AP9I9w/4Go3HfvFNgEGUpVSbQq8679um13mlMdlFC6z3NAD+J32vmvU1keL1
 0f4C7IltOr2ntU4QIXJUCLAPWO7NWgQ=
 =r7N6
 -----END PGP SIGNATURE-----

Merge tag 'parisc-for-6.7-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux

Pull parisc fixes from Helge Deller:
 "On parisc we still sometimes need writeable stacks, e.g. if programs
  aren't compiled with gcc-14. To avoid issues with the upcoming
  systemd-254 we therefore have to disable prctl(PR_SET_MDWE) for now
  (for parisc only).

  The other two patches are minor: a bugfix for the soft power-off on
  qemu with 64-bit kernel and prefer strscpy() over strlcpy():

   - Fix power soft-off on qemu

   - Disable prctl(PR_SET_MDWE) since parisc sometimes still needs
     writeable stacks

   - Use strscpy instead of strlcpy in show_cpuinfo()"

* tag 'parisc-for-6.7-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
  prctl: Disable prctl(PR_SET_MDWE) on parisc
  parisc/power: Fix power soft-off when running on qemu
  parisc: Replace strlcpy() with strscpy()
2023-11-18 15:13:10 -08:00
Andrii Nakryiko 46862ee854 bpf: emit frameno for PTR_TO_STACK regs if it differs from current one
It's possible to pass a pointer to parent's stack to child subprogs. In
such case verifier state output is ambiguous not showing whether
register container a pointer to "current" stack, belonging to current
subprog (frame), or it's actually a pointer to one of parent frames.

So emit this information if frame number differs between the state which
register is part of. E.g., if current state is in frame 2 and it has
a register pointing to stack in grand parent state (frame #0), we'll see
something like 'R1=fp[0]-16', while "local stack pointer" will be just
'R2=fp-16'.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231118034623.3320920-9-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-18 11:39:59 -08:00
Andrii Nakryiko 0f8dbdbc64 bpf: smarter verifier log number printing logic
Instead of always printing numbers as either decimals (and in some
cases, like for "imm=%llx", in hexadecimals), decide the form based on
actual values. For numbers in a reasonably small range (currently,
[0, U16_MAX] for unsigned values, and [S16_MIN, S16_MAX] for signed ones),
emit them as decimals. In all other cases, even for signed values,
emit them in hexadecimals.

For large values hex form is often times way more useful: it's easier to
see an exact difference between 0xffffffff80000000 and 0xffffffff7fffffff,
than between 18446744071562067966 and 18446744071562067967, as one
particular example.

Small values representing small pointer offsets or application
constants, on the other hand, are way more useful to be represented in
decimal notation.

Adjust reg_bounds register state parsing logic to take into account this
change.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231118034623.3320920-8-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-18 11:39:59 -08:00
Andrii Nakryiko 1db747d75b bpf: omit default off=0 and imm=0 in register state log
Simplify BPF verifier log further by omitting default (and frequently
irrelevant) off=0 and imm=0 parts for non-SCALAR_VALUE registers. As can
be seen from fixed tests, this is often a visual noise for PTR_TO_CTX
register and even for PTR_TO_PACKET registers.

Omitting default values follows the rest of register state logic: we
omit default values to keep verifier log succinct and to highlight
interesting state that deviates from default one. E.g., we do the same
for var_off, when it's unknown, which gives no additional information.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231118034623.3320920-7-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-18 11:39:59 -08:00
Andrii Nakryiko 0c95c9fdb6 bpf: emit map name in register state if applicable and available
In complicated real-world applications, whenever debugging some
verification error through verifier log, it often would be very useful
to see map name for PTR_TO_MAP_VALUE register. Usually this needs to be
inferred from key/value sizes and maybe trying to guess C code location,
but it's not always clear.

Given verifier has the name, and it's never too long, let's just emit it
for ptr_to_map_key, ptr_to_map_value, and const_ptr_to_map registers. We
reshuffle the order a bit, so that map name, key size, and value size
appear before offset and immediate values, which seems like a more
logical order.

Current output:

  R1_w=map_ptr(map=array_map,ks=4,vs=8,off=0,imm=0)

But we'll get rid of useless off=0 and imm=0 parts in the next patch.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231118034623.3320920-6-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-18 11:39:59 -08:00
Andrii Nakryiko 67d43dfbb4 bpf: print spilled register state in stack slot
Print the same register state representation when printing stack state,
as we do for normal registers. Note that if stack slot contains
subregister spill (1, 2, or 4 byte long), we'll still emit "m0?" mask
for those bytes that are not part of spilled register.

While means we can get something like fp-8=0000scalar() for a 4-byte
spill with other 4 bytes still being STACK_ZERO.

Some example before and after, taken from the log of
pyperf_subprogs.bpf.o:

49: (7b) *(u64 *)(r10 -256) = r1      ; frame1: R1_w=ctx(off=0,imm=0) R10=fp0 fp-256_w=ctx
49: (7b) *(u64 *)(r10 -256) = r1      ; frame1: R1_w=ctx(off=0,imm=0) R10=fp0 fp-256_w=ctx(off=0,imm=0)

150: (7b) *(u64 *)(r10 -264) = r0     ; frame1: R0_w=map_value_or_null(id=6,off=0,ks=192,vs=4,imm=0) R10=fp0 fp-264_w=map_value_or_null
150: (7b) *(u64 *)(r10 -264) = r0     ; frame1: R0_w=map_value_or_null(id=6,off=0,ks=192,vs=4,imm=0) R10=fp0 fp-264_w=map_value_or_null(id=6,off=0,ks=192,vs=4,imm=0)

5192: (61) r1 = *(u32 *)(r10 -272)    ; frame1: R1_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=15,var_off=(0x0; 0xf)) R10=fp0 fp-272=
5192: (61) r1 = *(u32 *)(r10 -272)    ; frame1: R1_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=15,var_off=(0x0; 0xf)) R10=fp0 fp-272=????scalar(smin=smin32=0,smax=umax=smax32=umax32=15,var_off=(0x0; 0xf))

While at it, do a few other simple clean ups:
  - skip slot if it's not scratched before detecting whether it's valid;
  - move taking spilled_reg pointer outside of switch (only DYNPTR has
    to adjust that to get to the "main" slot);
  - don't recalculate types_buf second time for MISC/ZERO/default case.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231118034623.3320920-5-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-18 11:39:59 -08:00
Andrii Nakryiko 009f5465be bpf: extract register state printing
Extract printing register state representation logic into a separate
helper, as we are going to reuse it for spilled register state printing
in the next patch. This also nicely reduces code nestedness.

No functional changes.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231118034623.3320920-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-18 11:39:59 -08:00
Andrii Nakryiko 42feb6620a bpf: move verifier state printing code to kernel/bpf/log.c
Move a good chunk of code from verifier.c to log.c: verifier state
verbose printing logic. This is an important and very much
logging/debugging oriented code. It fits the overlall log.c's focus on
verifier logging, and moving it allows to keep growing it without
unnecessarily adding to verifier.c code that otherwise contains a core
verification logic.

There are not many shared dependencies between this code and the rest of
verifier.c code, except a few single-line helpers for various register
type checks and a bit of state "scratching" helpers. We move all such
trivial helpers into include/bpf/bpf_verifier.h as static inlines.

No functional changes in this patch.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231118034623.3320920-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-18 11:39:59 -08:00
Andrii Nakryiko db840d389b bpf: move verbose_linfo() into kernel/bpf/log.c
verifier.c is huge. Let's try to move out parts that are logging-related
into log.c, as we previously did with bpf_log() and other related stuff.
This patch moves line info verbose output routines: it's pretty
self-contained and isolated code, so there is no problem with this.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231118034623.3320920-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-18 11:39:58 -08:00
Helge Deller 793838138c prctl: Disable prctl(PR_SET_MDWE) on parisc
systemd-254 tries to use prctl(PR_SET_MDWE) for it's MemoryDenyWriteExecute
functionality, but fails on parisc which still needs executable stacks in
certain combinations of gcc/glibc/kernel.

Disable prctl(PR_SET_MDWE) by returning -EINVAL for now on parisc, until
userspace has catched up.

Signed-off-by: Helge Deller <deller@gmx.de>
Co-developed-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Sam James <sam@gentoo.org>
Closes: https://github.com/systemd/systemd/issues/29775
Tested-by: Sam James <sam@gentoo.org>
Link: https://lore.kernel.org/all/875y2jro9a.fsf@gentoo.org/
Cc: <stable@vger.kernel.org> # v6.3+
2023-11-18 19:35:31 +01:00
Andrii Nakryiko ff8867af01 bpf: rename BPF_F_TEST_SANITY_STRICT to BPF_F_TEST_REG_INVARIANTS
Rename verifier internal flag BPF_F_TEST_SANITY_STRICT to more neutral
BPF_F_TEST_REG_INVARIANTS. This is a follow up to [0].

A few selftests and veristat need to be adjusted in the same patch as
well.

  [0] https://patchwork.kernel.org/project/netdevbpf/patch/20231112010609.848406-5-andrii@kernel.org/

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231117171404.225508-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-17 10:30:02 -08:00
Linus Torvalds bf786e2a78 audit/stable-6.7 PR 20231116
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmVWX8cUHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXPl8hAA2D4DAmbnM4wLGk5FX1ruFpACmabx
 7iNPonV7loDiGZInvlTgvQxTQ6hafvs6aFqu69ZplLuCaBLiSn6U3J/bOXneQxzn
 nRjLQEfJLcSmTd39M82QxpaihCtVltDRT4jPfq4AGN+6nV0TB4KyFjrIvOw7udfX
 fJF096Lt9rqxbYyKk2Lgy8LZZdVqFN9pbstpH7Vas8LOi4bnvogRljhFA3vipn45
 0tzMrFR9b/myOPFm1ktvAUSUdWIzNGmxsYkrxHkQ2TemhuFEiNl3n86juWzeXCzN
 wjaGPLIUqJQW+C+kXRmEZo/SytiqKS5Wo97mMVDPKpYFwp6IbgjSg01LPNdmLoVY
 2i1jxOFTDnANLZgXa31kjzTO2Ceu61GFVqLZGuOh2lB7rjj3+JAkL0U/YLQWWBMO
 RG8MbmQnHOGZlHdqiPRKJKo/qHPW7vBkgSPJ/K0tRNXMFoZtGAfcHjxJJQNystPU
 BoRd2Tdw0jMrrS5cLNXfkxhHKwNHGFny4TRyqOJo9G7/jK56JWU+3ZXNoWH9OKFJ
 Ln2wH7NT16CLMnb/kZ2CSh8UQXIJpkBL1OuG6IrOQuBoNun7AzGnmXW7vqywV5bo
 dqOgxtkBYhrfjUEXmRzEii2oOoc/esr1vZYnmj5K4RpWksIXTJ4BZjC9kH/fHEpg
 2ZO03UwyZ7ZiE2U=
 =00KW
 -----END PGP SIGNATURE-----

Merge tag 'audit-pr-20231116' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit

Pull audit fix from Paul Moore:
 "One small audit patch to convert a WARN_ON_ONCE() into a normal
  conditional to avoid scary looking console warnings when eBPF code
  generates audit records from unexpected places"

* tag 'audit-pr-20231116' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
  audit: don't WARN_ON_ONCE(!current->mm) in audit_exe_compare()
2023-11-17 08:42:05 -05:00
Linus Torvalds 7475e51b87 Including fixes from BPF and netfilter.
Current release - regressions:
 
  - core: fix undefined behavior in netdev name allocation
 
  - bpf: do not allocate percpu memory at init stage
 
  - netfilter: nf_tables: split async and sync catchall in two functions
 
  - mptcp: fix possible NULL pointer dereference on close
 
 Current release - new code bugs:
 
  - eth: ice: dpll: fix initial lock status of dpll
 
 Previous releases - regressions:
 
  - bpf: fix precision backtracking instruction iteration
 
  - af_unix: fix use-after-free in unix_stream_read_actor()
 
  - tipc: fix kernel-infoleak due to uninitialized TLV value
 
  - eth: bonding: stop the device in bond_setup_by_slave()
 
  - eth: mlx5:
    - fix double free of encap_header
    - avoid referencing skb after free-ing in drop path
 
  - eth: hns3: fix VF reset
 
  - eth: mvneta: fix calls to page_pool_get_stats
 
 Previous releases - always broken:
 
  - core: set SOCK_RCU_FREE before inserting socket into hashtable
 
  - bpf: fix control-flow graph checking in privileged mode
 
  - eth: ppp: limit MRU to 64K
 
  - eth: stmmac: avoid rx queue overrun
 
  - eth: icssg-prueth: fix error cleanup on failing initialization
 
  - eth: hns3: fix out-of-bounds access may occur when coalesce info is
  	      read via debugfs
 
  - eth: cortina: handle large frames
 
 Misc:
 
  - selftests: gso: support CONFIG_MAX_SKB_FRAGS up to 45
 
 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmVV9akSHHBhYmVuaUBy
 ZWRoYXQuY29tAAoJECkkeY3MjxOkICMP/1+QHUaD4JG1mW9oYc2zINPfQl3dqQt3
 2CGSE2yrtbQvyQl39BDa0WFzV5X6So6/U50twhTNM+UAJsCaOvxCUDvUP9eY9Dcm
 z2H4oITZimyP4CEb3l7JpL2PImvfImL7D/fCPPMUZVzNY6dkEFznaQrnawbJz4gg
 mZXDnjwIXq7OchoJy3dHzyOn4ZQj2Df5VcfBzkVMdMcwV55Sd5JezbhwJ6NOmnKA
 uoXlq4pFYj3ahAhEQfLWUwXmF3e6esHs/WUCMe5FR9YkanJlu4oHUmY3RLzfcdQA
 PPIPDRxOzthcXyymqvqs7gnZ3ruMUll4B7tGTVFpJch8ts+DwGdUyBIIoDd/1BUT
 gmjipP5HPia3Qdtk3Jc4vMkcf5AwoGo0hXku7YYJ1K7+4+t8ep3/hDbQc0PLWX6J
 afiQgqpnNXHSTqBO5zl91vSwhGr/AAtAkDlPnsQL/RDAxY4teIwxHuoMvwPWaHZJ
 sMo5ZcHXvNnBbGhpozFtmrnbf1nduUrQmW5LkJViCLf25Sj6pDYbo8WnhMuOKSnZ
 7an2YqniCgBtrX4MEVn2jsWgavI+SxndVIQR04u0uwqmP+dn8s9LUfjKKDtPWHsK
 +zMFtk+Op03TW5ur9w3+dgrGH0cLogPO3BJkho7xXKBfZ6/tN/pOef3/nV9xY6g8
 JjnBUdpZRTWI
 =VjWw
 -----END PGP SIGNATURE-----

Merge tag 'net-6.7-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
 "Including fixes from BPF and netfilter.

  Current release - regressions:

   - core: fix undefined behavior in netdev name allocation

   - bpf: do not allocate percpu memory at init stage

   - netfilter: nf_tables: split async and sync catchall in two
     functions

   - mptcp: fix possible NULL pointer dereference on close

  Current release - new code bugs:

   - eth: ice: dpll: fix initial lock status of dpll

  Previous releases - regressions:

   - bpf: fix precision backtracking instruction iteration

   - af_unix: fix use-after-free in unix_stream_read_actor()

   - tipc: fix kernel-infoleak due to uninitialized TLV value

   - eth: bonding: stop the device in bond_setup_by_slave()

   - eth: mlx5:
      - fix double free of encap_header
      - avoid referencing skb after free-ing in drop path

   - eth: hns3: fix VF reset

   - eth: mvneta: fix calls to page_pool_get_stats

  Previous releases - always broken:

   - core: set SOCK_RCU_FREE before inserting socket into hashtable

   - bpf: fix control-flow graph checking in privileged mode

   - eth: ppp: limit MRU to 64K

   - eth: stmmac: avoid rx queue overrun

   - eth: icssg-prueth: fix error cleanup on failing initialization

   - eth: hns3: fix out-of-bounds access may occur when coalesce info is
     read via debugfs

   - eth: cortina: handle large frames

  Misc:

   - selftests: gso: support CONFIG_MAX_SKB_FRAGS up to 45"

* tag 'net-6.7-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (78 commits)
  macvlan: Don't propagate promisc change to lower dev in passthru
  net: sched: do not offload flows with a helper in act_ct
  net/mlx5e: Check return value of snprintf writing to fw_version buffer for representors
  net/mlx5e: Check return value of snprintf writing to fw_version buffer
  net/mlx5e: Reduce the size of icosq_str
  net/mlx5: Increase size of irq name buffer
  net/mlx5e: Update doorbell for port timestamping CQ before the software counter
  net/mlx5e: Track xmit submission to PTP WQ after populating metadata map
  net/mlx5e: Avoid referencing skb after free-ing in drop path of mlx5e_sq_xmit_wqe
  net/mlx5e: Don't modify the peer sent-to-vport rules for IPSec offload
  net/mlx5e: Fix pedit endianness
  net/mlx5e: fix double free of encap_header in update funcs
  net/mlx5e: fix double free of encap_header
  net/mlx5: Decouple PHC .adjtime and .adjphase implementations
  net/mlx5: DR, Allow old devices to use multi destination FTE
  net/mlx5: Free used cpus mask when an IRQ is released
  Revert "net/mlx5: DR, Supporting inline WQE when possible"
  bpf: Do not allocate percpu memory at init stage
  net: Fix undefined behavior in netdev name allocation
  dt-bindings: net: ethernet-controller: Fix formatting error
  ...
2023-11-16 07:51:26 -05:00
Andrii Nakryiko cf5fe3c71c bpf: make __reg{32,64}_deduce_bounds logic more robust
This change doesn't seem to have any effect on selftests and production
BPF object files, but we preemptively try to make it more robust.

First, "learn sign from signed bounds" comment is misleading, as we are
learning not just sign, but also values.

Second, we simplify the check for determining whether entire range is
positive or negative similarly to other checks added earlier, using
appropriate u32/u64 cast and single comparisons. As explain in comments
in __reg64_deduce_bounds(), the checks are equivalent.

Last but not least, smin/smax and s32_min/s32_max reassignment based on
min/max of both umin/umax and smin/smax (and 32-bit equivalents) is hard
to explain and justify. We are updating unsigned bounds from signed
bounds, why would we update signed bounds at the same time? This might
be correct, but it's far from obvious why and the code or comments don't
try to justify this. Given we've added a separate deduction of signed
bounds from unsigned bounds earlier, this seems at least redundant, if
not just wrong.

In short, we remove doubtful pieces, and streamline the rest to follow
the logic and approach of the rest of reg_bounds_sync() checks.

Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231112010609.848406-7-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-15 12:03:42 -08:00
Andrii Nakryiko 3cf98cf594 bpf: remove redundant s{32,64} -> u{32,64} deduction logic
Equivalent checks were recently added in more succinct and, arguably,
safer form in:
  - f188765f23a5 ("bpf: derive smin32/smax32 from umin32/umax32 bounds");
  - 2e74aef782d3 ("bpf: derive smin/smax from umin/max bounds").

The checks we are removing in this patch set do similar checks to detect
if entire u32/u64 range has signed bit set or not set, but does it with
two separate checks.

Further, we forcefully overwrite either smin or smax (and 32-bit equvalents)
without applying normal min/max intersection logic. It's not clear why
that would be correct in all cases and seems to work by accident. This
logic is also "gated" by previous signed -> unsigned derivation, which
returns early.

All this is quite confusing and seems error-prone, while we already have
at least equivalent checks happening earlier. So remove this duplicate
and error-prone logic to simplify things a bit.

Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231112010609.848406-6-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-15 12:03:42 -08:00
Andrii Nakryiko 5f99f312bd bpf: add register bounds sanity checks and sanitization
Add simple sanity checks that validate well-formed ranges (min <= max)
across u64, s64, u32, and s32 ranges. Also for cases when the value is
constant (either 64-bit or 32-bit), we validate that ranges and tnums
are in agreement.

These bounds checks are performed at the end of BPF_ALU/BPF_ALU64
operations, on conditional jumps, and for LDX instructions (where subreg
zero/sign extension is probably the most important to check). This
covers most of the interesting cases.

Also, we validate the sanity of the return register when manually
adjusting it for some special helpers.

By default, sanity violation will trigger a warning in verifier log and
resetting register bounds to "unbounded" ones. But to aid development
and debugging, BPF_F_TEST_SANITY_STRICT flag is added, which will
trigger hard failure of verification with -EFAULT on register bounds
violations. This allows selftests to catch such issues. veristat will
also gain a CLI option to enable this behavior.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Link: https://lore.kernel.org/r/20231112010609.848406-5-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-15 12:03:42 -08:00
Andrii Nakryiko be41a203bb bpf: enhance BPF_JEQ/BPF_JNE is_branch_taken logic
Use 32-bit subranges to prune some 64-bit BPF_JEQ/BPF_JNE conditions
that otherwise would be "inconclusive" (i.e., is_branch_taken() would
return -1). This can happen, for example, when registers are initialized
as 64-bit u64/s64, then compared for inequality as 32-bit subregisters,
and then followed by 64-bit equality/inequality check. That 32-bit
inequality can establish some pattern for lower 32 bits of a register
(e.g., s< 0 condition determines whether the bit #31 is zero or not),
while overall 64-bit value could be anything (according to a value range
representation).

This is not a fancy quirky special case, but actually a handling that's
necessary to prevent correctness issue with BPF verifier's range
tracking: set_range_min_max() assumes that register ranges are
non-overlapping, and if that condition is not guaranteed by
is_branch_taken() we can end up with invalid ranges, where min > max.

  [0] https://lore.kernel.org/bpf/CACkBjsY2q1_fUohD7hRmKGqv1MV=eP2f6XK8kjkYNw7BaiF8iQ@mail.gmail.com/

Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231112010609.848406-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-15 12:03:42 -08:00
Andrii Nakryiko 96381879a3 bpf: generalize is_scalar_branch_taken() logic
Generalize is_branch_taken logic for SCALAR_VALUE register to handle
cases when both registers are not constants. Previously supported
<range> vs <scalar> cases are a natural subset of more generic <range>
vs <range> set of cases.

Generalized logic relies on straightforward segment intersection checks.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Link: https://lore.kernel.org/r/20231112010609.848406-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-15 12:03:41 -08:00
Andrii Nakryiko 67420501e8 bpf: generalize reg_set_min_max() to handle non-const register comparisons
Generalize bounds adjustment logic of reg_set_min_max() to handle not
just register vs constant case, but in general any register vs any
register cases. For most of the operations it's trivial extension based
on range vs range comparison logic, we just need to properly pick
min/max of a range to compare against min/max of the other range.

For BPF_JSET we keep the original capabilities, just make sure JSET is
integrated in the common framework. This is manifested in the
internal-only BPF_JSET + BPF_X "opcode" to allow for simpler and more
uniform rev_opcode() handling. See the code for details. This allows to
reuse the same code exactly both for TRUE and FALSE branches without
explicitly handling both conditions with custom code.

Note also that now we don't need a special handling of BPF_JEQ/BPF_JNE
case none of the registers are constants. This is now just a normal
generic case handled by reg_set_min_max().

To make tnum handling cleaner, tnum_with_subreg() helper is added, as
that's a common operator when dealing with 32-bit subregister bounds.
This keeps the overall logic much less noisy when it comes to tnums.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Link: https://lore.kernel.org/r/20231112010609.848406-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-15 12:03:41 -08:00
Yonghong Song 1fda5bb66a bpf: Do not allocate percpu memory at init stage
Kirill Shutemov reported significant percpu memory consumption increase after
booting in 288-cpu VM ([1]) due to commit 41a5db8d81 ("bpf: Add support for
non-fix-size percpu mem allocation"). The percpu memory consumption is
increased from 111MB to 969MB. The number is from /proc/meminfo.

I tried to reproduce the issue with my local VM which at most supports upto
255 cpus. With 252 cpus, without the above commit, the percpu memory
consumption immediately after boot is 57MB while with the above commit the
percpu memory consumption is 231MB.

This is not good since so far percpu memory from bpf memory allocator is not
widely used yet. Let us change pre-allocation in init stage to on-demand
allocation when verifier detects there is a need of percpu memory for bpf
program. With this change, percpu memory consumption after boot can be reduced
signicantly.

  [1] https://lore.kernel.org/lkml/20231109154934.4saimljtqx625l3v@box.shutemov.name/

Fixes: 41a5db8d81 ("bpf: Add support for non-fix-size percpu mem allocation")
Reported-and-tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231111013928.948838-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-15 07:51:06 -08:00
Peter Zijlstra 889c58b315 perf/core: Fix cpuctx refcounting
Audit of the refcounting turned up that perf_pmu_migrate_context()
fails to migrate the ctx refcount.

Fixes: bd27568117 ("perf: Rewrite core context handling")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20230612093539.085862001@infradead.org
Cc: <stable@vger.kernel.org>
2023-11-15 04:18:31 +01:00
Peter Zijlstra c9bd1568d5 futex: Fix hardcoded flags
Xi reported that commit 5694289ce1 ("futex: Flag conversion") broke
glibc's robust futex tests.

This was narrowed down to the change of FLAGS_SHARED from 0x01 to
0x10, at which point Florian noted that handle_futex_death() has a
hardcoded flags argument of 1.

Change this to: FLAGS_SIZE_32 | FLAGS_SHARED, matching how
futex_to_flags() unconditionally sets FLAGS_SIZE_32 for all legacy
futex ops.

Reported-by: Xi Ruoyao <xry111@xry111.site>
Reported-by: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20231114201402.GA25315@noisy.programming.kicks-ass.net
Fixes: 5694289ce1 ("futex: Flag conversion")
Cc: <stable@vger.kernel.org>
2023-11-15 04:02:25 +01:00
Paul Moore 969d90ec21 audit: don't WARN_ON_ONCE(!current->mm) in audit_exe_compare()
eBPF can end up calling into the audit code from some odd places, and
some of these places don't have @current set properly so we end up
tripping the `WARN_ON_ONCE(!current->mm)` near the top of
`audit_exe_compare()`.  While the basic `!current->mm` check is good,
the `WARN_ON_ONCE()` results in some scary console messages so let's
drop that and just do the regular `!current->mm` check to avoid
problems.

Cc: <stable@vger.kernel.org>
Fixes: 47846d5134 ("audit: don't take task_lock() in audit_exe_compare() code path")
Reported-by: Artem Savkov <asavkov@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2023-11-14 17:34:27 -05:00
Keisuke Nishimura 6d7e4782bc sched/fair: Fix the decision for load balance
should_we_balance is called for the decision to do load-balancing.
When sched ticks invoke this function, only one CPU should return
true. However, in the current code, two CPUs can return true. The
following situation, where b means busy and i means idle, is an
example, because CPU 0 and CPU 2 return true.

        [0, 1] [2, 3]
         b  b   i  b

This fix checks if there exists an idle CPU with busy sibling(s)
after looking for a CPU on an idle core. If some idle CPUs with busy
siblings are found, just the first one should do load-balancing.

Fixes: b1bfeab9b0 ("sched/fair: Consider the idle state of the whole core for load balance")
Signed-off-by: Keisuke Nishimura <keisuke.nishimura@inria.fr>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
Reviewed-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20231031133821.1570861-1-keisuke.nishimura@inria.fr
2023-11-14 22:27:01 +01:00
Johannes Weiner 8b39d20ece sched: psi: fix unprivileged polling against cgroups
519fabc7aa ("psi: remove 500ms min window size limitation for
triggers") breaks unprivileged psi polling on cgroups.

Historically, we had a privilege check for polling in the open() of a
pressure file in /proc, but were erroneously missing it for the open()
of cgroup pressure files.

When unprivileged polling was introduced in d82caa2735 ("sched/psi:
Allow unprivileged polling of N*2s period"), it needed to filter
privileges depending on the exact polling parameters, and as such
moved the CAP_SYS_RESOURCE check from the proc open() callback to
psi_trigger_create(). Both the proc files as well as cgroup files go
through this during write(). This implicitly added the missing check
for privileges required for HT polling for cgroups.

When 519fabc7aa ("psi: remove 500ms min window size limitation for
triggers") followed right after to remove further restrictions on the
RT polling window, it incorrectly assumed the cgroup privilege check
was still missing and added it to the cgroup open(), mirroring what we
used to do for proc files in the past.

As a result, unprivileged poll requests that would be supported now
get rejected when opening the cgroup pressure file for writing.

Remove the cgroup open() check. psi_trigger_create() handles it.

Fixes: 519fabc7aa ("psi: remove 500ms min window size limitation for triggers")
Reported-by: Luca Boccassi <bluca@debian.org>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Luca Boccassi <bluca@debian.org>
Acked-by: Suren Baghdasaryan <surenb@google.com>
Cc: stable@vger.kernel.org # 6.5+
Link: https://lore.kernel.org/r/20231026164114.2488682-1-hannes@cmpxchg.org
2023-11-14 22:27:00 +01:00