Show the rejected function name when attaching tracing programs to
functions in deny list.
With this change, we know why tracing programs can't attach to functions
like __rcu_read_lock() from log.
$ ./fentry
libbpf: prog '__rcu_read_lock': BPF program load failed: -EINVAL
libbpf: prog '__rcu_read_lock': -- BEGIN PROG LOAD LOG --
Attaching tracing programs to function '__rcu_read_lock' is rejected.
Suggested-by: Leon Hwang <leon.hwang@linux.dev>
Signed-off-by: KaFai Wan <kafai.wan@linux.dev>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20250724151454.499040-3-kafai.wan@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
With this change, we know the precise rejected function name when
attaching fexit/fmod_ret to __noreturn functions from log.
$ ./fexit
libbpf: prog 'fexit': BPF program load failed: -EINVAL
libbpf: prog 'fexit': -- BEGIN PROG LOAD LOG --
Attaching fexit/fmod_ret to __noreturn function 'do_exit' is rejected.
Suggested-by: Leon Hwang <leon.hwang@linux.dev>
Signed-off-by: KaFai Wan <kafai.wan@linux.dev>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20250724151454.499040-2-kafai.wan@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Commit d7f0087381 ("bpf: try harder to deduce register bounds from
different numeric domains") added a second call to __reg_deduce_bounds
in reg_bounds_sync because a single call wasn't enough to converge to a
fixed point in terms of register bounds.
With patch "bpf: Improve bounds when s64 crosses sign boundary" from
this series, Eduard noticed that calling __reg_deduce_bounds twice isn't
enough anymore to converge. The first selftest added in "selftests/bpf:
Test cross-sign 64bits range refinement" highlights the need for a third
call to __reg_deduce_bounds. After instruction 7, reg_bounds_sync
performs the following bounds deduction:
reg_bounds_sync entry: scalar(smin=-655,smax=0xeffffeee,smin32=-783,smax32=-146)
__update_reg_bounds: scalar(smin=-655,smax=0xeffffeee,smin32=-783,smax32=-146)
__reg_deduce_bounds:
__reg32_deduce_bounds: scalar(smin=-655,smax=0xeffffeee,smin32=-783,smax32=-146,umin32=0xfffffcf1,umax32=0xffffff6e)
__reg64_deduce_bounds: scalar(smin=-655,smax=0xeffffeee,smin32=-783,smax32=-146,umin32=0xfffffcf1,umax32=0xffffff6e)
__reg_deduce_mixed_bounds: scalar(smin=-655,smax=0xeffffeee,umin=umin32=0xfffffcf1,umax=0xffffffffffffff6e,smin32=-783,smax32=-146,umax32=0xffffff6e)
__reg_deduce_bounds:
__reg32_deduce_bounds: scalar(smin=-655,smax=0xeffffeee,umin=umin32=0xfffffcf1,umax=0xffffffffffffff6e,smin32=-783,smax32=-146,umax32=0xffffff6e)
__reg64_deduce_bounds: scalar(smin=-655,smax=smax32=-146,umin=0xfffffffffffffd71,umax=0xffffffffffffff6e,smin32=-783,umin32=0xfffffcf1,umax32=0xffffff6e)
__reg_deduce_mixed_bounds: scalar(smin=-655,smax=smax32=-146,umin=0xfffffffffffffd71,umax=0xffffffffffffff6e,smin32=-783,umin32=0xfffffcf1,umax32=0xffffff6e)
__reg_bound_offset: scalar(smin=-655,smax=smax32=-146,umin=0xfffffffffffffd71,umax=0xffffffffffffff6e,smin32=-783,umin32=0xfffffcf1,umax32=0xffffff6e,var_off=(0xfffffffffffffc00; 0x3ff))
__update_reg_bounds: scalar(smin=-655,smax=smax32=-146,umin=0xfffffffffffffd71,umax=0xffffffffffffff6e,smin32=-783,umin32=0xfffffcf1,umax32=0xffffff6e,var_off=(0xfffffffffffffc00; 0x3ff))
In particular, notice how:
1. In the first call to __reg_deduce_bounds, __reg32_deduce_bounds
learns new u32 bounds.
2. __reg64_deduce_bounds is unable to improve bounds at this point.
3. __reg_deduce_mixed_bounds derives new u64 bounds from the u32 bounds.
4. In the second call to __reg_deduce_bounds, __reg64_deduce_bounds
improves the smax and umin bounds thanks to patch "bpf: Improve
bounds when s64 crosses sign boundary" from this series.
5. Subsequent functions are unable to improve the ranges further (only
tnums). Yet, a better smin32 bound could be learned from the smin
bound.
__reg32_deduce_bounds is able to improve smin32 from smin, but for that
we need a third call to __reg_deduce_bounds.
As discussed in [1], there may be a better way to organize the deduction
rules to learn the same information with less calls to the same
functions. Such an optimization requires further analysis and is
orthogonal to the present patchset.
Link: https://lore.kernel.org/bpf/aIKtSK9LjQXB8FLY@mail.gmail.com/ [1]
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Co-developed-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Link: https://lore.kernel.org/r/79619d3b42e5525e0e174ed534b75879a5ba15de.1753695655.git.paul.chaignon@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
__reg64_deduce_bounds currently improves the s64 range using the u64
range and vice versa, but only if it doesn't cross the sign boundary.
This patch improves __reg64_deduce_bounds to cover the case where the
s64 range crosses the sign boundary but overlaps with the u64 range on
only one end. In that case, we can improve both ranges. Consider the
following example, with the s64 range crossing the sign boundary:
0 U64_MAX
| [xxxxxxxxxxxxxx u64 range xxxxxxxxxxxxxx] |
|----------------------------|----------------------------|
|xxxxx s64 range xxxxxxxxx] [xxxxxxx|
0 S64_MAX S64_MIN -1
The u64 range overlaps only with positive portion of the s64 range. We
can thus derive the following new s64 and u64 ranges.
0 U64_MAX
| [xxxxxx u64 range xxxxx] |
|----------------------------|----------------------------|
| [xxxxxx s64 range xxxxx] |
0 S64_MAX S64_MIN -1
The same logic can probably apply to the s32/u32 ranges, but this patch
doesn't implement that change.
In addition to the selftests, the __reg64_deduce_bounds change was
also tested with Agni, the formal verification tool for the range
analysis [1].
Link: https://github.com/bpfverif/agni [1]
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Link: https://lore.kernel.org/r/933bd9ce1f36ded5559f92fdc09e5dbc823fa245.1753695655.git.paul.chaignon@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
During the bounds refinement, we improve the precision of various ranges
by looking at other ranges. Among others, we improve the following in
this order (other things happen between 1 and 2):
1. Improve u32 from s32 in __reg32_deduce_bounds.
2. Improve s/u64 from u32 in __reg_deduce_mixed_bounds.
3. Improve s/u64 from s32 in __reg_deduce_mixed_bounds.
In particular, if the s32 range forms a valid u32 range, we will use it
to improve the u32 range in __reg32_deduce_bounds. In
__reg_deduce_mixed_bounds, under the same condition, we will use the s32
range to improve the s/u64 ranges.
If at (1) we were able to learn from s32 to improve u32, we'll then be
able to use that in (2) to improve s/u64. Hence, as (3) happens under
the same precondition as (1), it won't improve s/u64 ranges further than
(1)+(2) did. Thus, we can get rid of (3).
In addition to the extensive suite of selftests for bounds refinement,
this patch was also tested with the Agni formal verification tool [1].
Additionally, Eduard mentioned:
The argument appears to be as follows:
Under precondition `(u32)reg->s32_min <= (u32)reg->s32_max`
__reg32_deduce_bounds produces:
reg->u32_min = max_t(u32, reg->s32_min, reg->u32_min);
reg->u32_max = min_t(u32, reg->s32_max, reg->u32_max);
And then first part of __reg_deduce_mixed_bounds assigns:
a. reg->umin umax= (reg->umin & ~0xffffffffULL) | max_t(u32, reg->s32_min, reg->u32_min);
b. reg->umax umin= (reg->umax & ~0xffffffffULL) | min_t(u32, reg->s32_max, reg->u32_max);
And then second part of __reg_deduce_mixed_bounds assigns:
c. reg->umin umax= (reg->umin & ~0xffffffffULL) | (u32)reg->s32_min;
d. reg->umax umin= (reg->umax & ~0xffffffffULL) | (u32)reg->s32_max;
But assignment (c) is a noop because:
max_t(u32, reg->s32_min, reg->u32_min) >= (u32)reg->s32_min
Hence RHS(a) >= RHS(c) and umin= does nothing.
Also assignment (d) is a noop because:
min_t(u32, reg->s32_max, reg->u32_max) <= (u32)reg->s32_max
Hence RHS(b) <= RHS(d) and umin= does nothing.
Plus the same reasoning for the part dealing with reg->s{min,max}_value:
e. reg->smin_value smax= (reg->smin_value & ~0xffffffffULL) | max_t(u32, reg->s32_min_value, reg->u32_min_value);
f. reg->smax_value smin= (reg->smax_value & ~0xffffffffULL) | min_t(u32, reg->s32_max_value, reg->u32_max_value);
vs
g. reg->smin_value smax= (reg->smin_value & ~0xffffffffULL) | (u32)reg->s32_min_value;
h. reg->smax_value smin= (reg->smax_value & ~0xffffffffULL) | (u32)reg->s32_max_value;
RHS(e) >= RHS(g) and RHS(f) <= RHS(h), hence smax=,smin= do nothing.
This appears to be correct.
Also, Shung-Hsi:
Beside going through the reasoning, I also played with CBMC a bit to
double check that as far as a single run of __reg_deduce_bounds() is
concerned (and that the register state matches certain handwavy
expectations), the change indeed still preserve the original behavior.
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://github.com/bpfverif/agni [1]
Link: https://lore.kernel.org/bpf/aIJwnFnFyUjNsCNa@mail.gmail.com
The rework of this mechanism added a 'use_nsec' member to struct
system_counterval. get_device_system_crosststamp() instantiates that
struct on the stack and hands a pointer to the driver callback.
Only the drivers which set use_nsec to true, initialize that field, but
all others ignore it. As get_device_system_crosststamp() does not
initialize the struct, the use_nsec field contains random stack content
in those cases. That causes a miscalulation usually resulting in a
failing range check in the best case.
Initialize the structure before handing it to the drivers to cure that.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmiGFA4THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoRcsEACvQI0LmKTOigzSZvBT1CZnGcwpeqYi
Ez0v/w+tpyfbwQgf9kxR+ZbjNdwqCYFnR8PZPFKFuvsanWRTIcYaTkIQWvDhcEX/
U4AFI3VkdZUFckCEY/fv7j3/jkp7pbLVHMq001Z9xaMMcE+ox1AlHpEW0Khd3gqL
VFLXU5S7Q9H6J6ujjFAXAMuhgjk6WOz8q+ew3hnc3dxwyuEBAz83jOScH/be3dTl
10ydzoxFEa+ZlacAHX+SqZ7nhS7ExxNlwlUuTYj/EkBCQ8UIoS93YLA5bYMcWCao
W5rs6vFJmMO6NR6lkqwfKmKyjovx79jHMVNKoxydZGvkqcNMtfc/eUfByxAkyCDP
gmTCFwgKVGdjGsYwkGqafejmJt5OFrD1hMyWfBhGWQ/Z8CXuuJNEa/8trSyUK/CS
DFD1InOLltbYuw7rY5gRxb+xmgBTxUMj8gF/hXYs7wNzJqNJXXNae/2Sue+Xi+mV
iieEF8UonmpMe9k9w3+fFGGDWYa4lYnT5O3VMQ0nEjj6dt5RVQqRvjTa+GtQJzUs
h4fUs+BIKyCkh6DgRKyIsDruzryOSnZ+vqMcGMm0gvPttc3cGYksLiQVlWYjQhxs
pTFrHNGOSXMT5WBQ7KWKzGypHlf3WYhVWk1+dmPJedrdyr23AfgKAGM6zva1Oqjc
81w9DvppBL0sOA==
=j6Po
-----END PGP SIGNATURE-----
Merge tag 'timers-urgent-2025-07-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Thomas Gleixner:
"A single fix for the PTP systemcounter mechanism:
The rework of this mechanism added a 'use_nsec' member to struct
system_counterval. get_device_system_crosststamp() instantiates that
struct on the stack and hands a pointer to the driver callback.
Only the drivers which set use_nsec to true, initialize that field,
but all others ignore it. As get_device_system_crosststamp() does not
initialize the struct, the use_nsec field contains random stack
content in those cases. That causes a miscalulation usually resulting
in a failing range check in the best case.
Initialize the structure before handing it to the drivers to cure
that"
* tag 'timers-urgent-2025-07-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timekeeping: Zero initialize system_counterval when querying time from phc drivers
bpf_jit_get_prog_name() will be used by all JITs when enabling support
for private stack. This function is currently implemented in the x86
JIT.
Move the function to core.c so that other JITs can easily use it in
their implementation of private stack.
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/bpf/20250724120257.7299-2-puranjay@kernel.org
The code is unused since 98e20e5e13 ("bpfilter: remove bpfilter"),
therefore remove it.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Link: https://lore.kernel.org/bpf/20250721-remove-usermode-driver-v1-2-0d0083334382@linutronix.de
The usermode driver framework is not used anymore by the BPF
preload code.
Fixes: cb80ddc671 ("bpf: Convert bpf_preload.ko to use light skeleton.")
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/bpf/20250721-remove-usermode-driver-v1-1-0d0083334382@linutronix.de
or aren't considered necessary for -stable kernels.
7 are for MM.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaILYBgAKCRDdBJ7gKXxA
jo0uAQDvTlAjH6TcgRW/cbqHRIeiRoZ9Bwh/RUlJXM9neDR2LgEA41B+ohTsxUmZ
OhM3Ce94tiGrHnVlW3SsmVaO+1TjGAU=
=KUR9
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2025-07-24-18-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"11 hotfixes. 9 are cc:stable and the remainder address post-6.15
issues or aren't considered necessary for -stable kernels.
7 are for MM"
* tag 'mm-hotfixes-stable-2025-07-24-18-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
sprintf.h requires stdarg.h
resource: fix false warning in __request_region()
mm/damon/core: commit damos_quota_goal->nid
kasan: use vmalloc_dump_obj() for vmalloc error reports
mm/ksm: fix -Wsometimes-uninitialized from clang-21 in advisor_mode_show()
mm: update MAINTAINERS entry for HMM
nilfs2: reject invalid file types when reading inodes
selftests/mm: fix split_huge_page_test for folio_split() tests
mailmap: add entry for Senozhatsky
mm/zsmalloc: do not pass __GFP_MOVABLE if CONFIG_COMPACTION=n
mm/vmscan: fix hwpoisoned large folio handling in shrink_folio_list
A warning is raised when __request_region() detects a conflict with a
resource whose resource.desc is IORES_DESC_DEVICE_PRIVATE_MEMORY.
But this warning is only valid for iomem_resources.
The hmem device resource uses resource.desc as the numa node id, which can
cause spurious warnings.
This warning appeared on a machine with multiple cxl memory expanders.
One of the NUMA node id is 6, which is the same as the value of
IORES_DESC_DEVICE_PRIVATE_MEMORY.
In this environment it was just a spurious warning, but when I saw the
warning I suspected a real problem so it's better to fix it.
This change fixes this by restricting the warning to only iomem_resource.
This also adds a missing new line to the warning message.
Link: https://lkml.kernel.org/r/20250719112604.25500-1-akinobu.mita@gmail.com
Fixes: 7dab174e2e ("dax/hmem: Move hmem device registration to dax_hmem.ko")
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Most drivers only populate the fields cycles and cs_id of system_counterval
in their get_time_fn() callback for get_device_system_crosststamp(), unless
they explicitly provide nanosecond values.
When the use_nsecs field was added to struct system_counterval, most
drivers did not care. Clock sources other than CSID_GENERIC could then get
converted in convert_base_to_cs() based on an uninitialized use_nsecs field,
which usually results in -EINVAL during the following range check.
Pass in a fully zero initialized system_counterval_t to cure that.
Fixes: 6b2e299775 ("timekeeping: Provide infrastructure for converting to/from a base clock")
Signed-off-by: Markus Blöchl <markus@blochl.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20250720-timekeeping_uninit_crossts-v2-1-f513c885b7c2@blochl.de
Intel linux test robot reported a warning that ERR_CAST can be used
for error pointer casting instead of more-complicated/rarely-used
ERR_PTR(PTR_ERR(...)) style.
There is no functionality change, but still let us replace two such
instances as it improves consistency and readability.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202507201048.bceHy8zX-lkp@intel.com/
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://patch.msgid.link/20250720164754.3999140-1-yonghong.song@linux.dev
- Fix timerlat with use of FORTIFY_SOURCE
FORTIFY_SOURCE was added to the stack tracer where it compares the
entry->caller array to having entry->size elements.
timerlat has the following:
memcpy(&entry->caller, fstack->calls, size);
entry->size = size;
Which triggers FORTIFY_SOURCE as the caller is populated before the
entry->size is initialized.
Swap the order to satisfy FORTIFY_SOURCE logic.
- Add down_write(trace_event_sem) when adding trace events in modules
Trace events being added to the ftrace_events array are protected by
the trace_event_sem semaphore. But when loading modules that have
trace events, the addition of the events are not protected by the
semaphore and loading two modules that have events at the same time
can corrupt the list.
Also add a lockdep_assert_held(trace_event_sem) to
_trace_add_event_dirs() to confirm its held when iterating the list.
-----BEGIN PGP SIGNATURE-----
iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaH06gBQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qoJsAP0a+/E0f+5g7O/OtYPVEDSCREv1vj9c
3dr0iWopqaOC7gEAw8Vc5iWIHKcB/JuJ+GqALoutL+lihruG26MWkFFsOgU=
=zH5J
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.16-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix timerlat with use of FORTIFY_SOURCE
FORTIFY_SOURCE was added to the stack tracer where it compares the
entry->caller array to having entry->size elements.
timerlat has the following:
memcpy(&entry->caller, fstack->calls, size);
entry->size = size;
Which triggers FORTIFY_SOURCE as the caller is populated before the
entry->size is initialized.
Swap the order to satisfy FORTIFY_SOURCE logic.
- Add down_write(trace_event_sem) when adding trace events in modules
Trace events being added to the ftrace_events array are protected by
the trace_event_sem semaphore. But when loading modules that have
trace events, the addition of the events are not protected by the
semaphore and loading two modules that have events at the same time
can corrupt the list.
Also add a lockdep_assert_held(trace_event_sem) to
_trace_add_event_dirs() to confirm it is held when iterating the
list.
* tag 'trace-v6.16-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Add down_write(trace_event_sem) when adding trace event
tracing/osnoise: Fix crash in timerlat_dump_stack()
counter nr_uninterruptible to an unsigned int. Due to the fact that the
counters are not updated on migration of a uninterruptble task to a
different CPU, these counters can exceed INT_MAX. The counter is cast to
long in the load average calculation, which means that the cast expands
into negative space resulting in bogus load average values. Convert it back
to unsigned long to fix this.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmh82RYTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoXNlEACyQxl/OCXSmKpcxrIvMjnalXh3Ibs2
KqbZ3MA/hY1ZWhSKXBDMv8hkxE00PY8WXsPDtLeRz0n6GegbUx4zVsUQpzn24gRe
uEl+qIuANaz5uMu2qlmQwSuxVQ0SDqLVFObrNgKQ554jJckjXKcgvrBjwczTw9lJ
u6yTVLrknf0IsbQ79yxToaNf6jD3HcSIGX06Hs4EYucPzYd54VScr6LGGwux3rmI
CN0RSA9RduUzcf/aKhe9/tmM6oss4tdByNuVSdx/yZ5QvVw5oOrWdI8sU0BlZsrC
IjvqLvvLFKRPZre24UeFnIVxhtv+cwPXxrA6tR/SM2eUbHKzU8jHz2TlPu8xwOwb
sZMMBjPx1Y+fpVQoGxEC1cEWbCUPSyQ7NGN2uS3Quk4XoQLRz6t//B+63rT65bfV
wMAZpsgZ9/c23yDuPG11XxBMQYVbS0sFyLDymBC93Co8vh8PowDIf2xmgPFLxM78
HdOCCsHsBPqZutQGQ7/qw//e7T/9fv0MDP8D4fuyLrhUdvBAvGHZp85T4YmOUcjp
bTTE28Rw1sBgdAnTNjGWfwRN7Hwc1DzaLh40Um+nnjMA+yG9uq7gjTYDx4w6W/EA
V70tx9tfJOM11g/qfIURkJDpf8PezJ3KHg1ZOiPn9e+I08LoPVMeIqbFEnJ9VAc7
VStWpQd+SG99AQ==
=Aqcr
-----END PGP SIGNATURE-----
Merge tag 'sched-urgent-2025-07-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Thomas Gleixner:
"A single fix for the scheduler.
A recent commit changed the runqueue counter nr_uninterruptible to an
unsigned int. Due to the fact that the counters are not updated on
migration of a uninterruptble task to a different CPU, these counters
can exceed INT_MAX.
The counter is cast to long in the load average calculation, which
means that the cast expands into negative space resulting in bogus
load average values.
Convert it back to unsigned long to fix this.
* tag 'sched-urgent-2025-07-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Change nr_uninterruptible type to unsigned long
When a module is loaded, it adds trace events defined by the module. It
may also need to modify the modules trace printk formats to replace enum
names with their values.
If two modules are loaded at the same time, the adding of the event to the
ftrace_events list can corrupt the walking of the list in the code that is
modifying the printk format strings and crash the kernel.
The addition of the event should take the trace_event_sem for write while
it adds the new event.
Also add a lockdep_assert_held() on that semaphore in
__trace_add_event_dirs() as it iterates the list.
Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Link: https://lore.kernel.org/20250718223158.799bfc0c@batman.local.home
Reported-by: Fusheng Huang(黄富生) <Fusheng.Huang@luxshare-ict.com>
Closes: https://lore.kernel.org/all/20250717105007.46ccd18f@batman.local.home/
Fixes: 110bf2b764 ("tracing: add protection around module events unload")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
- Fix handling of migration disabled tasks in default idle selection.
- update_locked_rq() called __this_cpu_write() spuriously with NULL when @rq
was not locked. As the writes were spurious, it didn't break anything
directly. However, the function could be called in a preemptible leading
to a context warning in __this_cpu_write(). Skip the spurious NULL writes.
- Selftest fix on UP.
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaHvPZw4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGabMAP4jSAr4gYWEBOUaD9btwnPxZwlSiAEQtqBDBVRb
/UunFAD/WBwUPk/u7BchLHjuH3sYW5gQb40kbtUnmNvB+RNUUgc=
=3WAD
-----END PGP SIGNATURE-----
Merge tag 'sched_ext-for-6.16-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:
- Fix handling of migration disabled tasks in default idle selection
- update_locked_rq() called __this_cpu_write() spuriously with NULL
when @rq was not locked. As the writes were spurious, it didn't break
anything directly. However, the function could be called in a
preemptible leading to a context warning in __this_cpu_write(). Skip
the spurious NULL writes.
- Selftest fix on UP
* tag 'sched_ext-for-6.16-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: idle: Handle migration-disabled tasks in idle selection
sched/ext: Prevent update_locked_rq() calls with NULL rq
selftests/sched_ext: Fix exit selftest hang on UP
An earlier commit to suppress a warning introduced a race condition where
tasks can escape cgroup1 freezer. Revert the commit and simply remove the
warning which was spurious to begin with.
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaHvMvw4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGadfAP0cT4QXwtw0VXyiNr5PMqxQ74rYsngJ+NevRbod
fK6hIwD/T+owQc/ivYp5/N/XUgpT+Ixp7YRj2RIzQbL6SPjzOwE=
=IlrN
-----END PGP SIGNATURE-----
Merge tag 'cgroup-for-6.16-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
"An earlier commit to suppress a warning introduced a race condition
where tasks can escape cgroup1 freezer. Revert the commit and simply
remove the warning which was spurious to begin with"
* tag 'cgroup-for-6.16-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
Revert "cgroup_freezer: cgroup_freezing: Check if not frozen"
sched,freezer: Remove unnecessary warning in __thaw_task
We have observed kernel panics when using timerlat with stack saving,
with the following dmesg output:
memcpy: detected buffer overflow: 88 byte write of buffer size 0
WARNING: CPU: 2 PID: 8153 at lib/string_helpers.c:1032 __fortify_report+0x55/0xa0
CPU: 2 UID: 0 PID: 8153 Comm: timerlatu/2 Kdump: loaded Not tainted 6.15.3-200.fc42.x86_64 #1 PREEMPT(lazy)
Call Trace:
<TASK>
? trace_buffer_lock_reserve+0x2a/0x60
__fortify_panic+0xd/0xf
__timerlat_dump_stack.cold+0xd/0xd
timerlat_dump_stack.part.0+0x47/0x80
timerlat_fd_read+0x36d/0x390
vfs_read+0xe2/0x390
? syscall_exit_to_user_mode+0x1d5/0x210
ksys_read+0x73/0xe0
do_syscall_64+0x7b/0x160
? exc_page_fault+0x7e/0x1a0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
__timerlat_dump_stack() constructs the ftrace stack entry like this:
struct stack_entry *entry;
...
memcpy(&entry->caller, fstack->calls, size);
entry->size = fstack->nr_entries;
Since commit e7186af7fb ("tracing: Add back FORTIFY_SOURCE logic to
kernel_stack event structure"), struct stack_entry marks its caller
field with __counted_by(size). At the time of the memcpy, entry->size
contains garbage from the ringbuffer, which under some circumstances is
zero, triggering a kernel panic by buffer overflow.
Populate the size field before the memcpy so that the out-of-bounds
check knows the correct size. This is analogous to
__ftrace_trace_stack().
Cc: stable@vger.kernel.org
Cc: John Kacur <jkacur@redhat.com>
Cc: Luis Goncalves <lgoncalv@redhat.com>
Cc: Attila Fazekas <afazekas@redhat.com>
Link: https://lore.kernel.org/20250716143601.7313-1-tglozar@redhat.com
Fixes: e7186af7fb ("tracing: Add back FORTIFY_SOURCE logic to kernel_stack event structure")
Signed-off-by: Tomas Glozar <tglozar@redhat.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmh5r+QACgkQ6rmadz2v
bTqrFA/+OW+z+vh8k43C9TVZttqC5poGcFqF5zRlYTArQT3AB+QuhG/CRjVQFbGL
2YfVbq+5pxNo0I/FtCoWVui2OMf1UsRKKvM0pSn50yn3ytRfotZjQ/AWACm/9t5y
fyRLBIS3ArjashQ9/S71tAIfG6l/B+FGX81wOVa1uL50ab15+4NrplhZHY421o9a
lH2E2wnpy/BnrB9F/FO4iQbelixvBfMwj8epruhCVbipfx6BOKPMzKVtcm61FVT1
hDsQZ0bIpVKgpRNBlTUHjVyzYo8oeXzqVhhY7hsmpHxJSiol7KLWyHEJD5ExS9Qg
XVPK34b9IPgAfS8f/DgGAkWsAht7BMLsR0GUWyVIiacHHqTinRPVfWbzqWa5yjdD
+8Vp4RVrcUONx69upx+IDrb4uMfQYktdpcvQtSl0SSinsG/INXurT1Vyz8aBPfkv
WbiBeXhW/dCD9NuL5D9gnyZWaPXIAmbK7+pXJOSIpfKC24WRXTONDXhGP1b6ef31
zHQu3r98ekYnHr3hbsvdHOWB7LKkJ1bcg2+OsmtYUUmnCiQTM1H8ILTwbSQ4EfXJ
6iRxYeFp+VJOPScRzmNU/A3ibQWfV+foiO4S6hmazJOy3mmHX6hgPZoj2fjV8Ejf
xZeOpQbCaZQCzbxxOdtjykwfe+zPWGnyRPnQpVIVdi7Abk1EZaE=
=eUL7
-----END PGP SIGNATURE-----
Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Pull bpf fixes from Alexei Starovoitov:
- Fix handling of BPF arena relocations (Andrii Nakryiko)
- Fix race in bpf_arch_text_poke() on s390 (Ilya Leoshkevich)
- Fix use of virt_to_phys() on arm64 when mmapping BTF (Lorenz Bauer)
- Reject %p% format string in bprintf-like BPF helpers (Paul Chaignon)
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
libbpf: Fix handling of BPF arena relocations
btf: Fix virt_to_phys() on arm64 when mmapping BTF
selftests/bpf: Stress test attaching a BPF prog to another BPF prog
s390/bpf: Fix bpf_arch_text_poke() with new_addr == NULL again
selftests/bpf: Add negative test cases for snprintf
bpf: Reject %p% format string in bprintf-like helpers
When SCX_OPS_ENQ_MIGRATION_DISABLED is enabled, migration-disabled tasks
are also routed to ops.enqueue(). A scheduler may attempt to dispatch
such tasks directly to an idle CPU using the default idle selection
policy via scx_bpf_select_cpu_and() or scx_bpf_select_cpu_dfl().
This scenario must be properly handled by the built-in idle policy to
avoid returning an idle CPU where the target task isn't allowed to run.
Otherwise, it can lead to errors such as:
EXIT: runtime error (SCX_DSQ_LOCAL[_ON] cannot move migration disabled Chrome_ChildIOT[291646] from CPU 3 to 14)
Prevent this by explicitly handling migration-disabled tasks in the
built-in idle selection logic, maintaining their CPU affinity.
Fixes: a730e3f7a4 ("sched_ext: idle: Consolidate default idle CPU selection kfuncs")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
This reverts commit cff5f49d43.
Commit cff5f49d43 ("cgroup_freezer: cgroup_freezing: Check if not
frozen") modified the cgroup_freezing() logic to verify that the FROZEN
flag is not set, affecting the return value of the freezing() function,
in order to address a warning in __thaw_task.
A race condition exists that may allow tasks to escape being frozen. The
following scenario demonstrates this issue:
CPU 0 (get_signal path) CPU 1 (freezer.state reader)
try_to_freeze read freezer.state
__refrigerator freezer_read
update_if_frozen
WRITE_ONCE(current->__state, TASK_FROZEN);
...
/* Task is now marked frozen */
/* frozen(task) == true */
/* Assuming other tasks are frozen */
freezer->state |= CGROUP_FROZEN;
/* freezing(current) returns false */
/* because cgroup is frozen (not freezing) */
break out
__set_current_state(TASK_RUNNING);
/* Bug: Task resumes running when it should remain frozen */
The existing !frozen(p) check in __thaw_task makes the
WARN_ON_ONCE(freezing(p)) warning redundant. Removing this warning enables
reverting the commit cff5f49d43 ("cgroup_freezer: cgroup_freezing: Check
if not frozen") to resolve the issue.
The warning has been removed in the previous patch. This patch revert the
commit cff5f49d43 ("cgroup_freezer: cgroup_freezing: Check if not
frozen") to complete the fix.
Fixes: cff5f49d43 ("cgroup_freezer: cgroup_freezing: Check if not frozen")
Reported-by: Zhong Jiawei<zhongjiawei1@huawei.com>
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Commit cff5f49d43 ("cgroup_freezer: cgroup_freezing: Check if not
frozen") modified the cgroup_freezing() logic to verify that the FROZEN
flag is not set, affecting the return value of the freezing() function,
in order to address a warning in __thaw_task.
A race condition exists that may allow tasks to escape being frozen. The
following scenario demonstrates this issue:
CPU 0 (get_signal path) CPU 1 (freezer.state reader)
try_to_freeze read freezer.state
__refrigerator freezer_read
update_if_frozen
WRITE_ONCE(current->__state, TASK_FROZEN);
...
/* Task is now marked frozen */
/* frozen(task) == true */
/* Assuming other tasks are frozen */
freezer->state |= CGROUP_FROZEN;
/* freezing(current) returns false */
/* because cgroup is frozen (not freezing) */
break out
__set_current_state(TASK_RUNNING);
/* Bug: Task resumes running when it should remain frozen */
The existing !frozen(p) check in __thaw_task makes the
WARN_ON_ONCE(freezing(p)) warning redundant. Removing this warning enables
reverting commit cff5f49d43 ("cgroup_freezer: cgroup_freezing: Check if
not frozen") to resolve the issue.
This patch removes the warning from __thaw_task. A subsequent patch will
revert commit cff5f49d43 ("cgroup_freezer: cgroup_freezing: Check if
not frozen") to complete the fix.
Reported-by: Zhong Jiawei<zhongjiawei1@huawei.com>
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
- Fix a deadlock that may occur on asynchronous device suspend
failures due to missing completion updates in error paths (Rafael
Wysocki).
- Drop a misplaced pm_restore_gfp_mask() call, which may cause
swap to be accessed too early if system suspend fails, from
suspend_devices_and_enter() (Rafael Wysocki).
- Remove duplicate filesystems_freeze/thaw() calls, which sometimes
cause systems to be unable to resume, from enter_state() (Zihuan
Zhang).
-----BEGIN PGP SIGNATURE-----
iQFGBAABCAAwFiEEcM8Aw/RY0dgsiRUR7l+9nS/U47UFAmh5IE4SHHJqd0Byand5
c29ja2kubmV0AAoJEO5fvZ0v1OO12LYH/3CULHOIoshuWu+G9nIKokqO0oNYmxh1
qgkh+o9sBz9uTyfCSd1qDT9j1LjzUnOJUe67IzHJFuZcHbnWU4k9VYWV+H8TKyNp
CcQ+9g5gCqOzxWH7G7C2ekciSnnBlObwJ7ZsDlUOeuJ16GVCjqrFPZbJ6No0A+Hz
8Ed7R4o1MKrURLU9IZWpqV1a54Z9ySv2yrx9T4G0c8WV2VRJZJ76e1hAGcOr4owj
kM1+MPnsfU/RvBUUEKjUEm70ZBXGbXT+D9p/L/AuoYyhI94kvoImK1/2An5noHCO
czK5nDB867z6hu5jTVPt/RoIK/49H/a2CDNYl3ZiZnVVZIoPN/wt3C8=
=wkHb
-----END PGP SIGNATURE-----
Merge tag 'pm-6.16-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These address three issues introduced during the current development
cycle and related to system suspend and hibernation, one triggering
when asynchronous suspend of devices fails, one possibly affecting
memory management in the core suspend code error path, and one due to
duplicate filesystems freezing during system suspend:
- Fix a deadlock that may occur on asynchronous device suspend
failures due to missing completion updates in error paths (Rafael
Wysocki)
- Drop a misplaced pm_restore_gfp_mask() call, which may cause swap
to be accessed too early if system suspend fails, from
suspend_devices_and_enter() (Rafael Wysocki)
- Remove duplicate filesystems_freeze/thaw() calls, which sometimes
cause systems to be unable to resume, from enter_state() (Zihuan
Zhang)"
* tag 'pm-6.16-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM: sleep: Update power.completion for all devices on errors
PM: suspend: clean up redundant filesystems_freeze/thaw() handling
PM: suspend: Drop a misplaced pm_restore_gfp_mask() call
The 'commit 35f96de041 ("bpf: Introduce BPF token object")' added
BPF token as a new kind of BPF kernel object. And BPF_OBJ_GET_INFO_BY_FD
already used to get BPF object info, so we can also get token info with
this cmd.
One usage scenario, when program runs failed with token, because of
the permission failure, we can report what BPF token is allowing with
this API for debugging.
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Tao Chen <chen.dylane@linux.dev>
Link: https://lore.kernel.org/r/20250716134654.1162635-1-chen.dylane@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The last iterators update (commit 515ee52b22 ("bpf: make preloaded
map iterators to display map elements count")) missed the big-endian
skeleton. Update it by running "make big" with Debian clang version
21.0.0 (++20250706105601+01c97b4953e8-1~exp1~20250706225612.1558).
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20250710100907.45880-1-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Avoid invoking update_locked_rq() when the runqueue (rq) pointer is NULL
in the SCX_CALL_OP and SCX_CALL_OP_RET macros.
Previously, calling update_locked_rq(NULL) with preemption enabled could
trigger the following warning:
BUG: using __this_cpu_write() in preemptible [00000000]
This happens because __this_cpu_write() is unsafe to use in preemptible
context.
rq is NULL when an ops invoked from an unlocked context. In such cases, we
don't need to store any rq, since the value should already be NULL
(unlocked). Ensure that update_locked_rq() is only called when rq is
non-NULL, preventing calling __this_cpu_write() on preemptible context.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Fixes: 18853ba782 ("sched_ext: Track currently locked rq")
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org # v6.15
After a recent change in clang to strengthen uninitialized warnings [1],
it points out that in one of the error paths in parse_btf_arg(), params
is used uninitialized:
kernel/trace/trace_probe.c:660:19: warning: variable 'params' is uninitialized when used here [-Wuninitialized]
660 | return PTR_ERR(params);
| ^~~~~~
Match many other NO_BTF_ENTRY error cases and return -ENOENT, clearing
up the warning.
Link: https://lore.kernel.org/all/20250715-trace_probe-fix-const-uninit-warning-v1-1-98960f91dd04@kernel.org/
Cc: stable@vger.kernel.org
Closes: https://github.com/ClangBuiltLinux/linux/issues/2110
Fixes: d157d76944 ("tracing/probes: Support BTF field access from $retval")
Link: 2464313eef [1]
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
The recently introduced support for freezing filesystems during system
suspend included calls to filesystems_freeze() in both suspend_prepare()
and enter_state(), as well as calls to filesystems_thaw() in both
suspend_finish() and the Unlock path in enter_state(). These are
redundant.
Moreover, calling filesystems_freeze() twice, from both suspend_prepare()
and enter_state(), leads to a black screen and makes the system unable
to resume in some cases.
Address this as follows:
- filesystems_freeze() is already called in suspend_prepare(), which
is the proper and consistent place to handle pre-suspend operations.
The second call in enter_state() is unnecessary and so remove it.
- filesystems_thaw() is invoked in suspend_finish(), which covers
successful suspend/resume paths. In the failure case, add a call
to filesystems_thaw() only when needed, avoiding the duplicate call
in the general Unlock path.
This change simplifies the suspend code and avoids repeated freeze/thaw
calls, while preserving correct ordering and behavior.
Fixes: eacfbf7419 ("power: freeze filesystems during suspend/resume")
Signed-off-by: Zihuan Zhang <zhangzihuan@kylinos.cn>
Link: https://patch.msgid.link/20250712030824.81474-1-zhangzihuan@kylinos.cn
[ rjw: Changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The pm_restore_gfp_mask() call added by commit 12ffc3b151 ("PM:
Restrict swap use to later in the suspend sequence") to
suspend_devices_and_enter() is done too early because it takes
place before calling dpm_resume() in dpm_resume_end() and some
swap-backing devices may not be ready at that point. Moreover,
dpm_resume_end() called subsequently in the same code path invokes
pm_restore_gfp_mask() again and calling it twice in a row is
pointless.
Drop the misplaced pm_restore_gfp_mask() call from
suspend_devices_and_enter() to address this issue.
Fixes: 12ffc3b151 ("PM: Restrict swap use to later in the suspend sequence")
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Link: https://patch.msgid.link/2810409.mvXUDI8C0e@rjwysocki.net
[ rjw: Changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The commit e6fe3f422b ("sched: Make multiple runqueue task counters
32-bit") changed nr_uninterruptible to an unsigned int. But the
nr_uninterruptible values for each of the CPU runqueues can grow to
large numbers, sometimes exceeding INT_MAX. This is valid, if, over
time, a large number of tasks are migrated off of one CPU after going
into an uninterruptible state. Only the sum of all nr_interruptible
values across all CPUs yields the correct result, as explained in a
comment in kernel/sched/loadavg.c.
Change the type of nr_uninterruptible back to unsigned long to prevent
overflows, and thus the miscalculation of load average.
Fixes: e6fe3f422b ("sched: Make multiple runqueue task counters 32-bit")
Signed-off-by: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250709173328.606794-1-aruna.ramakrishna@oracle.com
about it
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmhzhRwACgkQEsHwGGHe
VUpz3xAAwLM/anyGvUXMP9Wx3X3kXM0NMw0NfkSucE22p7R//DfVWwLgz26hE8VX
haUp8Idnmp2EV6BsA/6SmslSbzYeBNKSIWwQiRI67goIACK0MDqzYkTlTnS170BU
bw4Lvd5PzA9DZiHTpqNCmg2m9ouGCQHOsxfKzs6DUMgpca6y8imFYW3gMZwGyA0m
Zkanw+BqPAhYSbTjZ1ZgtYmtPuvunXu/K/meEbonW2prRXfhZ9tbU31q/iRtHy3c
v9nRiK83pbqIJHUfYraTlfvmkMz975gvuN7mtmj5H2v+gc1S9adp1VrcI3tYFZDB
d5FVDsQpqWNBwewHG6VyYww6wCfm0IRg6Ys+rujhYl/ICjmthDUQJg9tZnZ03yxz
sv/b9E84Nq5BiJeLB88Ue0vRhH4ct9MTUr3Z9zhSKGN3IArpMPOm2wlmUmPoGerZ
g1jR80oG6Ikwl660MBpEX9yqyG8P5b7jnLGMJC0QJX9c0TLSY8yQgnnWKDVUvLVx
SrCjACQ2PnxqS9v2RlB3omoBYeEZfv7coy/vKIXZygXg7aNaDMemUktveihjUxLg
MLdOTH3Ygq5lASIGhhjqJenZ/844XIhZ9rgUcz3HLzlkXLGt8NmjOT+rW1IHAZHo
VpKbYPedr4uTqcxSDp4Qr/53m9xm4OTPe+3XEkXmeN5WQXbEh4A=
=vP1/
-----END PGP SIGNATURE-----
Merge tag 'perf_urgent_for_v6.16_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fix from Borislav Petkov:
- Prevent perf_sigtrap() from observing an exiting task and warning
about it
* tag 'perf_urgent_for_v6.16_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/core: Fix WARN in perf_sigtrap()
post-6.15 issues or aren't considered necessary for -stable kernels.
14 are for MM. Three gdb-script fixes and a kallsyms build fix.
-----BEGIN PGP SIGNATURE-----
iHQEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaHGbTgAKCRDdBJ7gKXxA
jowqAPiCWBFfcFaX20BxVaMU1PjC3Lh9llDXqQwBhBNdcadSAP44SGQ8nrfV+piB
OcNz2AEwBBfS354G0Etlh4k08YoAAw==
=IDDc
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2025-07-11-16-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"19 hotfixes. A whopping 16 are cc:stable and the remainder address
post-6.15 issues or aren't considered necessary for -stable kernels.
14 are for MM. Three gdb-script fixes and a kallsyms build fix"
* tag 'mm-hotfixes-stable-2025-07-11-16-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
Revert "sched/numa: add statistics of numa balance task"
mm: fix the inaccurate memory statistics issue for users
mm/damon: fix divide by zero in damon_get_intervals_score()
samples/damon: fix damon sample mtier for start failure
samples/damon: fix damon sample wsse for start failure
samples/damon: fix damon sample prcl for start failure
kasan: remove kasan_find_vm_area() to prevent possible deadlock
scripts: gdb: vfs: support external dentry names
mm/migrate: fix do_pages_stat in compat mode
mm/damon/core: handle damon_call_control as normal under kdmond deactivation
mm/rmap: fix potential out-of-bounds page table access during batched unmap
mm/hugetlb: don't crash when allocating a folio if there are no resv
scripts/gdb: de-reference per-CPU MCE interrupts
scripts/gdb: fix interrupts.py after maple tree conversion
maple_tree: fix mt_destroy_walk() on root leaf node
mm/vmalloc: leave lazy MMU mode on PTE mapping error
scripts/gdb: fix interrupts display after MCP on x86
lib/alloc_tag: do not acquire non-existent lock in alloc_tag_top_users()
kallsyms: fix build without execinfo
Use attach_type in bpf_link, and remove it in bpf_netns_link.
And move netns_type field to the end to fill the byte hole.
Signed-off-by: Tao Chen <chen.dylane@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20250710032038.888700-6-chen.dylane@linux.dev
Use attach_type in bpf_link to replace the location filed, and
remove location field in tcx_link.
Signed-off-by: Tao Chen <chen.dylane@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20250710032038.888700-5-chen.dylane@linux.dev
Use attach_type in bpf_link, and remove it in bpf_cgroup_link.
Signed-off-by: Tao Chen <chen.dylane@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20250710032038.888700-3-chen.dylane@linux.dev
Attach_type will be set when a link is created by user. It is better to
record attach_type in bpf_link generically and have it available
universally for all link types. So add the attach_type field in bpf_link
and move the sleepable field to avoid unnecessary gap padding.
Signed-off-by: Tao Chen <chen.dylane@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20250710032038.888700-2-chen.dylane@linux.dev
Syzbot reported a kernel warning due to a range invariant violation on
the following BPF program.
0: call bpf_get_netns_cookie
1: if r0 == 0 goto <exit>
2: if r0 & Oxffffffff goto <exit>
The issue is on the path where we fall through both jumps.
That path is unreachable at runtime: after insn 1, we know r0 != 0, but
with the sign extension on the jset, we would only fallthrough insn 2
if r0 == 0. Unfortunately, is_branch_taken() isn't currently able to
figure this out, so the verifier walks all branches. The verifier then
refines the register bounds using the second condition and we end
up with inconsistent bounds on this unreachable path:
1: if r0 == 0 goto <exit>
r0: u64=[0x1, 0xffffffffffffffff] var_off=(0, 0xffffffffffffffff)
2: if r0 & 0xffffffff goto <exit>
r0 before reg_bounds_sync: u64=[0x1, 0xffffffffffffffff] var_off=(0, 0)
r0 after reg_bounds_sync: u64=[0x1, 0] var_off=(0, 0)
Improving the range refinement for JSET to cover all cases is tricky. We
also don't expect many users to rely on JSET given LLVM doesn't generate
those instructions. So instead of improving the range refinement for
JSETs, Eduard suggested we forget the ranges whenever we're narrowing
tnums after a JSET. This patch implements that approach.
Reported-by: syzbot+c711ce17dd78e5d4fdcf@syzkaller.appspotmail.com
Suggested-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Link: https://lore.kernel.org/r/9d4fd6432a095d281f815770608fdcd16028ce0b.1752171365.git.paul.chaignon@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add a new BPF arena kfunc for reserving a range of arena virtual
addresses without backing them with pages. This prevents the range from
being populated using bpf_arena_alloc_pages().
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250709191312.29840-2-emil@etsalapatis.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
- small fix relevant to arm64 server and custom CMA configuration
(Feng Tang)
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSrngzkoBtlA8uaaJ+Jp1EFxbsSRAUCaHCzdQAKCRCJp1EFxbsS
RMrMAQDghOwKZqYuC27kJt5T7lgG47YCNE5em1v8WsTSvwQAugEA4AlWIpqQ34eI
Es6ObfMt8Q9gArubFZ0ZDFtmZq9NpA0=
=+z0i
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-6.16-2025-07-11' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux
Pull dma-mapping fix from Marek Szyprowski:
- small fix relevant to arm64 server and custom CMA configuration (Feng
Tang)
* tag 'dma-mapping-6.16-2025-07-11' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
dma-contiguous: hornor the cma address limit setup by user
This reverts commit ad6b26b6a0.
This commit introduces per-memcg/task NUMA balance statistics, but
unfortunately it introduced a NULL pointer exception due to the following
race condition: After a swap task candidate was chosen, its mm_struct
pointer was set to NULL due to task exit. Later, when performing the
actual task swapping, the p->mm caused the problem.
CPU0 CPU1
:
...
task_numa_migrate
task_numa_find_cpu
task_numa_compare
# a normal task p is chosen
env->best_task = p
# p exit:
exit_signals(p);
p->flags |= PF_EXITING
exit_mm
p->mm = NULL;
migrate_swap_stop
__migrate_swap_task((arg->src_task, arg->dst_cpu)
count_memcg_event_mm(p->mm, NUMA_TASK_SWAP)# p->mm is NULL
task_lock() should be held and the PF_EXITING flag needs to be checked to
prevent this from happening. After discussion, the conclusion was that
adding a lock is not worthwhile for some statistics calculations. Revert
the change and rely on the tracepoint for this purpose.
Link: https://lkml.kernel.org/r/20250704135620.685752-1-yu.c.chen@intel.com
Link: https://lkml.kernel.org/r/20250708064917.BBD13C4CEED@smtp.kernel.org
Fixes: ad6b26b6a0 ("sched/numa: add statistics of numa balance task")
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Reported-by: Jirka Hladky <jhladky@redhat.com>
Closes: https://lore.kernel.org/all/CAE4VaGBLJxpd=NeRJXpSCuw=REhC5LWJpC29kDy-Zh2ZDyzQZA@mail.gmail.com/
Reported-by: Srikanth Aithal <Srikanth.Aithal@amd.com>
Reported-by: Suneeth D <Suneeth.D@amd.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Hladky <jhladky@redhat.com>
Cc: Libo Chen <libo.chen@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Since exit_task_work() runs after perf_event_exit_task_context() updated
ctx->task to TASK_TOMBSTONE, perf_sigtrap() from perf_pending_task() might
observe event->ctx->task == TASK_TOMBSTONE.
Swap the early exit tests in order not to hit WARN_ON_ONCE().
Closes: https://syzkaller.appspot.com/bug?extid=2fe61cb2a86066be6985
Reported-by: syzbot <syzbot+2fe61cb2a86066be6985@syzkaller.appspotmail.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/b1c224bd-97f9-462c-a3e3-125d5e19c983@I-love.SAKURA.ne.jp
The per-CPU data section is handled differently than the other sections.
The memory allocations requires a special __percpu pointer and then the
section is copied into the view of each CPU. Therefore the SHF_ALLOC
flag is removed to ensure move_module() skips it.
Later, relocations are applied and apply_relocations() skips sections
without SHF_ALLOC because they have not been copied. This also skips the
per-CPU data section.
The missing relocations result in a NULL pointer on x86-64 and very
small values on x86-32. This results in a crash because it is not
skipped like NULL pointer would and can't be dereferenced.
Such an assignment happens during static per-CPU lock initialisation
with lockdep enabled.
Allow relocation processing for the per-CPU section even if SHF_ALLOC is
missing.
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202506041623.e45e4f7d-lkp@intel.com
Fixes: 1a6100caae425 ("Don't relocate non-allocated regions in modules.") #v2.6.1-rc3
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Petr Pavlu <petr.pavlu@suse.com>
Link: https://lore.kernel.org/r/20250610163328.URcsSUC1@linutronix.de
Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
Message-ID: <20250610163328.URcsSUC1@linutronix.de>
All error conditions in move_module() set the return value by updating the
ret variable. Therefore, it is not necessary to the initialize the variable
when declaring it.
Remove the unnecessary initialization.
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Daniel Gomez <da.gomez@samsung.com>
Link: https://lore.kernel.org/r/20250618122730.51324-3-petr.pavlu@suse.com
Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
Message-ID: <20250618122730.51324-3-petr.pavlu@suse.com>