scx_watchdog_timeout is written with WRITE_ONCE() in scx_enable():
WRITE_ONCE(scx_watchdog_timeout, timeout);
However, three read-side accesses use plain reads without the matching
READ_ONCE():
/* check_rq_for_timeouts() - L2824 */
last_runnable + scx_watchdog_timeout
/* scx_watchdog_workfn() - L2852 */
scx_watchdog_timeout / 2
/* scx_enable() - L5179 */
scx_watchdog_timeout / 2
The KCSAN documentation requires that if one accessor uses WRITE_ONCE()
to annotate lock-free access, all other accesses must also use the
appropriate accessor. Plain reads alongside WRITE_ONCE() leave the pair
incomplete and can trigger KCSAN warnings.
Note that scx_tick() already uses the correct READ_ONCE() annotation:
last_check + READ_ONCE(scx_watchdog_timeout)
Fix the three remaining plain reads to match, making all accesses to
scx_watchdog_timeout consistently annotated and KCSAN-clean.
Signed-off-by: zhidao su <suzhidao@xiaomi.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
scx_attr_ops_show() and scx_uevent() access scx_root->ops.name directly.
This is problematic for two reasons:
1. The file-level comment explicitly identifies naked scx_root
dereferences as a temporary measure that needs to be replaced
with proper per-instance access.
2. scx_attr_events_show(), the neighboring sysfs show function in
the same group, already uses the correct pattern:
struct scx_sched *sch = container_of(kobj, struct scx_sched, kobj);
Having inconsistent access patterns in the same sysfs/uevent
group is error-prone.
The kobject embedded in struct scx_sched is initialized as:
kobject_init_and_add(&sch->kobj, &scx_ktype, NULL, "root");
so container_of(kobj, struct scx_sched, kobj) correctly retrieves
the owning scx_sched instance in both callbacks.
Replace the naked scx_root dereferences with container_of()-based
access, consistent with scx_attr_events_show() and in preparation
for proper multi-instance scx_sched support.
Signed-off-by: zhidao su <suzhidao@xiaomi.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
scx_bpf_dsq_nr_queued() reads dsq->nr via READ_ONCE() without holding
any lock, making dsq->nr a lock-free concurrently accessed variable.
However, dsq_mod_nr(), the sole writer of dsq->nr, only uses
WRITE_ONCE() on the write side without the matching READ_ONCE() on the
read side:
WRITE_ONCE(dsq->nr, dsq->nr + delta);
^^^^^^^
plain read -- KCSAN data race
The KCSAN documentation requires that if one accessor uses READ_ONCE()
or WRITE_ONCE() on a variable to annotate lock-free access, all other
accesses must also use the appropriate accessor. A plain read on the
right-hand side of WRITE_ONCE() leaves the pair incomplete and will
trigger KCSAN warnings.
Fix by using READ_ONCE() for the read side of the update:
WRITE_ONCE(dsq->nr, READ_ONCE(dsq->nr) + delta);
This is consistent with scx_bpf_dsq_nr_queued() and makes the
concurrent access annotation complete and KCSAN-clean.
Signed-off-by: zhidao su <suzhidao@xiaomi.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
for the common HZ=100, 250 or 1000 cases, only inlining them
for odd HZ values like HZ=300.
The inlining overhead showed up in performance tests of the TCP code.
(Marked as an RFC pull request, as it's not a regression.)
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmmkA94RHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1ggmhAAqvNgH1vG+Ad1/r2aZ0iQNOJ3lLL9DKVU
V0Q2iXiRPKvzjXaHB+C2B2GkGgBBxyvdl1wGPpknI/OkoC8aBzK8jSchmfLFfWdx
C1MC0xoZRtaWDQaMYYQ83ZGQqdKHct4ZrwfsPu+g95NGvNg3m/W8p3cZjruknvH3
bKZmbN2wQiSx6+PB7/FkEjV7eAlaskYYdKLNqZzd/62oQ6is4ppAjEAp+X/FidJv
0lUVr7ILx6lZHjWnavUjex2iSvvZ8XqiaYVKlj5EE9cCaPhXDTpkaO8l/ELcjpHL
ZBBV6rpbDo7+Mo3UnUiz+CaYi6XAAOwgj6wZBiBXhVQUAW0PIlMaefDjccWFlDw1
oAcRCg5i3KF8cvrfQYUs4519W52eWOThqQ1fs5ql6P6ycHZ0KTsmaRAbNih811RN
A2VMTkiyX25bXOUQ9e5Y7cYOvDMGGCWiocT6C7Is9gZQMfkj92NDCKURLwRYzzBr
2XDjg46YekGXwy8OamMwXMRcdyUC5fAIaWOaq7IL1K3cgbS2qbZx55Y85+AOfngF
DvFWDIfjslYMZqzQQ+4+MaJitRQ2V6CqdOP3kQbJ3Z6DmIcwi7DkOzqH1hEglb7O
IjXCxjxosZv4iofpxr0FKJPx7KBVSzzxezjMLzeijM+zDdbF4GWFpqRD1ONh/4vm
/1tfaC6TU/E=
=duTI
-----END PGP SIGNATURE-----
Merge tag 'timers-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Ingo Molnar:
"Improve the inlining of jiffies_to_msecs() and jiffies_to_usecs(), for
the common HZ=100, 250 or 1000 cases. Only use a function call for odd
HZ values like HZ=300 that generate more code.
The function call overhead showed up in performance tests of the TCP
code"
* tag 'timers-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
time/jiffies: Inline jiffies_to_msecs() and jiffies_to_usecs()
- Fix zero_vruntime tracking when there's a single task running
- Fix slice protection logic
- Fix the ->vprot logic for reniced tasks
- Fix lag clamping in mixed slice workloads
- Fix objtool uaccess warning (and bug) in the
!CONFIG_RSEQ_SLICE_EXTENSION case caused by unexpected
un-inlining, which triggers with older compilers
- Fix a comment in the rseq registration rseq_size bound check code
- Fix a legacy RSEQ ABI quirk that handled 32-byte area sizes
differently, which special size we now reached naturally and
want to avoid. The visible ugliness of the new reserved field
will be avoided the next time the RSEQ area is extended.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmmkAl0RHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1hdsg/+IdQpmtEXsugE1FqEuuptm0ld6hcFI9WC
mlEXhid5Gq3a3KMBv0CLd73o+k8Ju/BDEdbfLMzY8A9h8OxfnuUL1T6Jt4q7dF1h
76ja1R+i+GNFcXWmSG8z6FUns4bRBJeNWFs3dzCFE9N2qOCCj1xBr/9BqgKvNVfZ
cbcaiMvmi3z/vPUmT8hdMdEcA0Zo2gVcKDmny4Tca9sigyLZD8FqtW1FhqL1HX8H
Cx8fZ2lkD2z6gKtOAbC3QuWVmP88tvZldaMsGHTAQIa14PP5h2xhyuLxBF1Zjnwy
aWl4iYr6ILu3LRi54CmQOiESdEf3Srdbl8JxDcvU9vh8ecqXvDGPUB2xCszlPvOx
R+scskNgNyd1WtUF2VYFLTNkj0B7Xe6eTYfIu2d5r8GrRt0YjRzsK/JQallAkV6V
KORDm4/Xyl5Ss6tNtfZP7lpHD2qykscRGxgr0HjjJCyjA1ZNtGc1A+JKZ8D8q9Nq
rxEbaa65KfAtYJ4i5j9goFPQwNeHXm/emToVzEfyKwZHs3ns0LwffDGSFFOYSm/p
FVVmi9iSoxRvRFHBflvBIwFaCnIyBLTJZlB/Bp8MVaFnv+6OzdE/nfcKNaYqcVaT
mzCpY2DFTx5KISmJR7DAWsPntoRV6WPcxVApWicTaT5G3C2TLvvTAEq8g2WIYDFB
j6oNyEkX/Xw=
=Nxqx
-----END PGP SIGNATURE-----
Merge tag 'sched-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
- Fix zero_vruntime tracking when there's a single task running
- Fix slice protection logic
- Fix the ->vprot logic for reniced tasks
- Fix lag clamping in mixed slice workloads
- Fix objtool uaccess warning (and bug) in the
!CONFIG_RSEQ_SLICE_EXTENSION case caused by unexpected un-inlining,
which triggers with older compilers
- Fix a comment in the rseq registration rseq_size bound check code
- Fix a legacy RSEQ ABI quirk that handled 32-byte area sizes
differently, which special size we now reached naturally and want to
avoid. The visible ugliness of the new reserved field will be avoided
the next time the RSEQ area is extended.
* tag 'sched-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rseq: slice ext: Ensure rseq feature size differs from original rseq size
rseq: Clarify rseq registration rseq_size bound check comment
sched/core: Fix wakeup_preempt's next_class tracking
rseq: Mark rseq_arm_slice_extension_timer() __always_inline
sched/fair: Fix lag clamp
sched/eevdf: Update se->vprot in reweight_entity()
sched/fair: Only set slice protection at pick time
sched/fair: Fix zero_vruntime tracking
- Fix lock ordering bug found by lockdep in perf_event_wakeup()
- Fix uncore counter enumeration on Granite Rapids and Sierra Forest
- Fix perf_mmap() refcount bug found by Syzkaller
- Fix __perf_event_overflow() vs. perf_remove_from_context() race
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmmj/7MRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1gXPg//V/Qrbnc9jYyyA9ZT9hGg4Oz36HSLLuRe
zcpb0Fndmyjt6Fq0vwqN59UcRM2coJjZ6V3TUyhQJjstnzkmMBsPE3frx+VjUqA6
rfUukgSr0mhlT/OtlBx0hUKBaiPvNe9khnKXLo1mO5aEVkIHryPbPcU/VLG45jE1
sF+dFP1cFVgyNqac8Ai4oNLsoRNQqWAvD0UrYHijJpFE6GqW8rBm2ReASbk/RMjv
s5CqpdLiRFmOoQ1Vwu9iG/wej0OMVJWUEpbmcqysT7UAMdoYtEm79HYON1Ez+ZEx
x6JEV8y3bv01MZc+HmP4mvKDgo5w1zxzNk3Smsx2sscUsZYVcv4zvG9C7UlkSsJ4
uWI6wwAPc1euBAmduTMDEyQr5CkjS3Rdb83s9+I2LtZXCP73+FPEhekbMx9mIhJi
Qw+H6QFeacpFso74vjfK4nGEEz0GbjWaT+VLBSJkwhOd/+/fkWyHsQoU8DPfC8nH
ETMaYGXpW80XRB5ttz/MoJfmXi2ovJsVpyvd06zJE0JzKdiydJsC0d6xGHY9JBCg
07bg0ux8/hX8grNVDWusvw2S15rso3RUOq9uajsTzlr728+hbCVZba87UYlgUNHL
+uA7IyX1WrY9DApXKmOWi9MTRkvdAQz6r43QMk+xkDd6b8JrOOMAJFqYuF8xD5Da
mXy3HKkIKag=
=0JyK
-----END PGP SIGNATURE-----
Merge tag 'perf-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf events fixes from Ingo Molnar:
- Fix lock ordering bug found by lockdep in perf_event_wakeup()
- Fix uncore counter enumeration on Granite Rapids and Sierra Forest
- Fix perf_mmap() refcount bug found by Syzkaller
- Fix __perf_event_overflow() vs perf_remove_from_context() race
* tag 'perf-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Fix __perf_event_overflow() vs perf_remove_from_context() race
perf/core: Fix refcount bug and potential UAF in perf_mmap
perf/x86/intel/uncore: Add per-scheduler IMC CAS count events
perf/core: Fix invalid wait context in ctx_sched_in()
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmmjmDIACgkQ6rmadz2v
bTq3gg//QQLOT/FxP2/dDurliDTXvQRr1tUxmIw6s3P6hnz9j/LLEVKpLRVkqd8t
XEwbubPd1TXDRsJ4f26Ew01YUtf9xi6ZQoMe/BL1okxi0ZwQGGRVMkiKOQgRT+rj
qYSN5JMfPzA2AuM6FjBF/hhw24yVRdgKRYBam6D7XLfFf3s8TOhHHjJ925PqEo0t
uJOy4ddDYB9BcGmfoeyiFgUtpPqcYrKIUCLBdwFvT2fnPJvrFFoCF3t7NS9UJu/O
wd6ZPuGWSOl9A7vSheldP6cJUDX8L/5WEGO4/LjN7plkySF0HNv8uq/b1T3kKqoY
Y3unXerLGJUAA9D5wpYAekx9YmvRTPQ/o39oTbquEB4SSJVU/SPUpvFw7m2Moq10
51yuyXLcPlI3xtk0Bd8c/CESSmkRenjWzsuZQhDGhsR0I9mIaALrhf9LaatHtXI5
f5ct73e+beK7Fc0Ze+b0JxDeFvzA3CKfAF0/fvGt0r9VZjBaMD+a3NnscBlyKztW
UCXazcfndMhNfUUWanktbT5YhYPmY7hzVQEOl7HAMGn4yG6XbXXmzzY6BqEXIucM
etueW2msZJHGBHQGe2RK3lxtmiB7/FglJHd86xebkIU2gCzqt8fGUha8AIuJ4rLS
7wxC33DycCofRGWdseVu7PsTasdhSGsHKbXz2fOFOFESOczYRw8=
=fj3P
-----END PGP SIGNATURE-----
Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Pull bpf fixes from Alexei Starovoitov:
- Fix alignment of arm64 JIT buffer to prevent atomic tearing (Fuad
Tabba)
- Fix invariant violation for single value tnums in the verifier
(Harishankar Vishwanathan, Paul Chaignon)
- Fix a bunch of issues found by ASAN in selftests/bpf (Ihor Solodrai)
- Fix race in devmpa and cpumap on PREEMPT_RT (Jiayuan Chen)
- Fix show_fdinfo of kprobe_multi when cookies are not present (Jiri
Olsa)
- Fix race in freeing special fields in BPF maps to prevent memory
leaks (Kumar Kartikeya Dwivedi)
- Fix OOB read in dmabuf_collector (T.J. Mercier)
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: (36 commits)
selftests/bpf: Avoid simplification of crafted bounds test
selftests/bpf: Test refinement of single-value tnum
bpf: Improve bounds when tnum has a single possible value
bpf: Introduce tnum_step to step through tnum's members
bpf: Fix race in devmap on PREEMPT_RT
bpf: Fix race in cpumap on PREEMPT_RT
selftests/bpf: Add tests for special fields races
bpf: Retire rcu_trace_implies_rcu_gp() from local storage
bpf: Delay freeing fields in local storage
bpf: Lose const-ness of map in map_check_btf()
bpf: Register dtor for freeing special fields
selftests/bpf: Fix OOB read in dmabuf_collector
selftests/bpf: Fix a memory leak in xdp_flowtable test
bpf: Fix stack-out-of-bounds write in devmap
bpf: Fix kprobe_multi cookies access in show_fdinfo callback
bpf, arm64: Force 8-byte alignment for JIT buffer to prevent atomic tearing
selftests/bpf: Don't override SIGSEGV handler with ASAN
selftests/bpf: Check BPFTOOL env var in detect_bpftool_path()
selftests/bpf: Fix out-of-bounds array access bugs reported by ASAN
selftests/bpf: Fix array bounds warning in jit_disasm_helpers
...
We're hitting an invariant violation in Cilium that sometimes leads to
BPF programs being rejected and Cilium failing to start [1]. The
following extract from verifier logs shows what's happening:
from 201 to 236: R1=0 R6=ctx() R7=1 R9=scalar(smin=umin=smin32=umin32=3584,smax=umax=smax32=umax32=3840,var_off=(0xe00; 0x100)) R10=fp0
236: R1=0 R6=ctx() R7=1 R9=scalar(smin=umin=smin32=umin32=3584,smax=umax=smax32=umax32=3840,var_off=(0xe00; 0x100)) R10=fp0
; if (magic == MARK_MAGIC_HOST || magic == MARK_MAGIC_OVERLAY || magic == MARK_MAGIC_ENCRYPT) @ bpf_host.c:1337
236: (16) if w9 == 0xe00 goto pc+45 ; R9=scalar(smin=umin=smin32=umin32=3585,smax=umax=smax32=umax32=3840,var_off=(0xe00; 0x100))
237: (16) if w9 == 0xf00 goto pc+1
verifier bug: REG INVARIANTS VIOLATION (false_reg1): range bounds violation u64=[0xe01, 0xe00] s64=[0xe01, 0xe00] u32=[0xe01, 0xe00] s32=[0xe01, 0xe00] var_off=(0xe00, 0x0)
We reach instruction 236 with two possible values for R9, 0xe00 and
0xf00. This is perfectly reflected in the tnum, but of course the ranges
are less accurate and cover [0xe00; 0xf00]. Taking the fallthrough path
at instruction 236 allows the verifier to reduce the range to
[0xe01; 0xf00]. The tnum is however not updated.
With these ranges, at instruction 237, the verifier is not able to
deduce that R9 is always equal to 0xf00. Hence the fallthrough pass is
explored first, the verifier refines the bounds using the assumption
that R9 != 0xf00, and ends up with an invariant violation.
This pattern of impossible branch + bounds refinement is common to all
invariant violations seen so far. The long-term solution is likely to
rely on the refinement + invariant violation check to detect dead
branches, as started by Eduard. To fix the current issue, we need
something with less refactoring that we can backport.
This patch uses the tnum_step helper introduced in the previous patch to
detect the above situation. In particular, three cases are now detected
in the bounds refinement:
1. The u64 range and the tnum only overlap in umin.
u64: ---[xxxxxx]-----
tnum: --xx----------x-
2. The u64 range and the tnum only overlap in the maximum value
represented by the tnum, called tmax.
u64: ---[xxxxxx]-----
tnum: xx-----x--------
3. The u64 range and the tnum only overlap in between umin (excluded)
and umax.
u64: ---[xxxxxx]-----
tnum: xx----x-------x-
To detect these three cases, we call tnum_step(tnum, umin), which
returns the smallest member of the tnum greater than umin, called
tnum_next here. We're in case (1) if umin is part of the tnum and
tnum_next is greater than umax. We're in case (2) if umin is not part of
the tnum and tnum_next is equal to tmax. Finally, we're in case (3) if
umin is not part of the tnum, tnum_next is inferior or equal to umax,
and calling tnum_step a second time gives us a value past umax.
This change implements these three cases. With it, the above bytecode
looks as follows:
0: (85) call bpf_get_prandom_u32#7 ; R0=scalar()
1: (47) r0 |= 3584 ; R0=scalar(smin=0x8000000000000e00,umin=umin32=3584,smin32=0x80000e00,var_off=(0xe00; 0xfffffffffffff1ff))
2: (57) r0 &= 3840 ; R0=scalar(smin=umin=smin32=umin32=3584,smax=umax=smax32=umax32=3840,var_off=(0xe00; 0x100))
3: (15) if r0 == 0xe00 goto pc+2 ; R0=3840
4: (15) if r0 == 0xf00 goto pc+1
4: R0=3840
6: (95) exit
In addition to the new selftests, this change was also verified with
Agni [3]. For the record, the raw SMT is available at [4]. The property
it verifies is that: If a concrete value x is contained in all input
abstract values, after __update_reg_bounds, it will continue to be
contained in all output abstract values.
Link: https://github.com/cilium/cilium/issues/44216 [1]
Link: https://pchaigno.github.io/test-verifier-complexity.html [2]
Link: https://github.com/bpfverif/agni [3]
Link: https://pastebin.com/raw/naCfaqNx [4]
Fixes: 0df1a55afa ("bpf: Warn on internal verifier errors")
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Tested-by: Marco Schirrmeister <mschirrmeister@gmail.com>
Co-developed-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com>
Signed-off-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com>
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Link: https://lore.kernel.org/r/ef254c4f68be19bd393d450188946821c588565d.1772225741.git.paul.chaignon@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This commit introduces tnum_step(), a function that, when given t, and a
number z returns the smallest member of t larger than z. The number z
must be greater or equal to the smallest member of t and less than the
largest member of t.
The first step is to compute j, a number that keeps all of t's known
bits, and matches all unknown bits to z's bits. Since j is a member of
the t, it is already a candidate for result. However, we want our result
to be (minimally) greater than z.
There are only two possible cases:
(1) Case j <= z. In this case, we want to increase the value of j and
make it > z.
(2) Case j > z. In this case, we want to decrease the value of j while
keeping it > z.
(Case 1) j <= z
t = xx11x0x0
z = 10111101 (189)
j = 10111000 (184)
^
k
(Case 1.1) Let's first consider the case where j < z. We will address j
== z later.
Since z > j, there had to be a bit position that was 1 in z and a 0 in
j, beyond which all positions of higher significance are equal in j and
z. Further, this position could not have been unknown in a, because the
unknown positions of a match z. This position had to be a 1 in z and
known 0 in t.
Let k be position of the most significant 1-to-0 flip. In our example, k
= 3 (starting the count at 1 at the least significant bit). Setting (to
1) the unknown bits of t in positions of significance smaller than
k will not produce a result > z. Hence, we must set/unset the unknown
bits at positions of significance higher than k. Specifically, we look
for the next larger combination of 1s and 0s to place in those
positions, relative to the combination that exists in z. We can achieve
this by concatenating bits at unknown positions of t into an integer,
adding 1, and writing the bits of that result back into the
corresponding bit positions previously extracted from z.
>From our example, considering only positions of significance greater
than k:
t = xx..x
z = 10..1
+ 1
-----
11..0
This is the exact combination 1s and 0s we need at the unknown bits of t
in positions of significance greater than k. Further, our result must
only increase the value minimally above z. Hence, unknown bits in
positions of significance smaller than k should remain 0. We finally
have,
result = 11110000 (240)
(Case 1.2) Now consider the case when j = z, for example
t = 1x1x0xxx
z = 10110100 (180)
j = 10110100 (180)
Matching the unknown bits of the t to the bits of z yielded exactly z.
To produce a number greater than z, we must set/unset the unknown bits
in t, and *all* the unknown bits of t candidates for being set/unset. We
can do this similar to Case 1.1, by adding 1 to the bits extracted from
the masked bit positions of z. Essentially, this case is equivalent to
Case 1.1, with k = 0.
t = 1x1x0xxx
z = .0.1.100
+ 1
---------
.0.1.101
This is the exact combination of bits needed in the unknown positions of
t. After recalling the known positions of t, we get
result = 10110101 (181)
(Case 2) j > z
t = x00010x1
z = 10000010 (130)
j = 10001011 (139)
^
k
Since j > z, there had to be a bit position which was 0 in z, and a 1 in
j, beyond which all positions of higher significance are equal in j and
z. This position had to be a 0 in z and known 1 in t. Let k be the
position of the most significant 0-to-1 flip. In our example, k = 4.
Because of the 0-to-1 flip at position k, a member of t can become
greater than z if the bits in positions greater than k are themselves >=
to z. To make that member *minimally* greater than z, the bits in
positions greater than k must be exactly = z. Hence, we simply match all
of t's unknown bits in positions more significant than k to z's bits. In
positions less significant than k, we set all t's unknown bits to 0
to retain minimality.
In our example, in positions of greater significance than k (=4),
t=x000. These positions are matched with z (1000) to produce 1000. In
positions of lower significance than k, t=10x1. All unknown bits are set
to 0 to produce 1001. The final result is:
result = 10001001 (137)
This concludes the computation for a result > z that is a member of t.
The procedure for tnum_step() in this commit implements the idea
described above. As a proof of correctness, we verified the algorithm
against a logical specification of tnum_step. The specification asserts
the following about the inputs t, z and output res that:
1. res is a member of t, and
2. res is strictly greater than z, and
3. there does not exist another value res2 such that
3a. res2 is also a member of t, and
3b. res2 is greater than z
3c. res2 is smaller than res
We checked the implementation against this logical specification using
an SMT solver. The verification formula in SMTLIB format is available
at [1]. The verification returned an "unsat": indicating that no input
assignment exists for which the implementation and the specification
produce different outputs.
In addition, we also automatically generated the logical encoding of the
C implementation using Agni [2] and verified it against the same
specification. This verification also returned an "unsat", confirming
that the implementation is equivalent to the specification. The formula
for this check is also available at [3].
Link: https://pastebin.com/raw/2eRWbiit [1]
Link: https://github.com/bpfverif/agni [2]
Link: https://pastebin.com/raw/EztVbBJ2 [3]
Co-developed-by: Srinivas Narayana <srinivas.narayana@rutgers.edu>
Signed-off-by: Srinivas Narayana <srinivas.narayana@rutgers.edu>
Co-developed-by: Santosh Nagarakatte <santosh.nagarakatte@rutgers.edu>
Signed-off-by: Santosh Nagarakatte <santosh.nagarakatte@rutgers.edu>
Signed-off-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com>
Link: https://lore.kernel.org/r/93fdf71910411c0f19e282ba6d03b4c65f9c5d73.1772225741.git.paul.chaignon@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
On PREEMPT_RT kernels, the per-CPU xdp_dev_bulk_queue (bq) can be
accessed concurrently by multiple preemptible tasks on the same CPU.
The original code assumes bq_enqueue() and __dev_flush() run atomically
with respect to each other on the same CPU, relying on
local_bh_disable() to prevent preemption. However, on PREEMPT_RT,
local_bh_disable() only calls migrate_disable() (when
PREEMPT_RT_NEEDS_BH_LOCK is not set) and does not disable
preemption, which allows CFS scheduling to preempt a task during
bq_xmit_all(), enabling another task on the same CPU to enter
bq_enqueue() and operate on the same per-CPU bq concurrently.
This leads to several races:
1. Double-free / use-after-free on bq->q[]: bq_xmit_all() snapshots
cnt = bq->count, then iterates bq->q[0..cnt-1] to transmit frames.
If preempted after the snapshot, a second task can call bq_enqueue()
-> bq_xmit_all() on the same bq, transmitting (and freeing) the
same frames. When the first task resumes, it operates on stale
pointers in bq->q[], causing use-after-free.
2. bq->count and bq->q[] corruption: concurrent bq_enqueue() modifying
bq->count and bq->q[] while bq_xmit_all() is reading them.
3. dev_rx/xdp_prog teardown race: __dev_flush() clears bq->dev_rx and
bq->xdp_prog after bq_xmit_all(). If preempted between
bq_xmit_all() return and bq->dev_rx = NULL, a preempting
bq_enqueue() sees dev_rx still set (non-NULL), skips adding bq to
the flush_list, and enqueues a frame. When __dev_flush() resumes,
it clears dev_rx and removes bq from the flush_list, orphaning the
newly enqueued frame.
4. __list_del_clearprev() on flush_node: similar to the cpumap race,
both tasks can call __list_del_clearprev() on the same flush_node,
the second dereferences the prev pointer already set to NULL.
The race between task A (__dev_flush -> bq_xmit_all) and task B
(bq_enqueue -> bq_xmit_all) on the same CPU:
Task A (xdp_do_flush) Task B (ndo_xdp_xmit redirect)
---------------------- --------------------------------
__dev_flush(flush_list)
bq_xmit_all(bq)
cnt = bq->count /* e.g. 16 */
/* start iterating bq->q[] */
<-- CFS preempts Task A -->
bq_enqueue(dev, xdpf)
bq->count == DEV_MAP_BULK_SIZE
bq_xmit_all(bq, 0)
cnt = bq->count /* same 16! */
ndo_xdp_xmit(bq->q[])
/* frames freed by driver */
bq->count = 0
<-- Task A resumes -->
ndo_xdp_xmit(bq->q[])
/* use-after-free: frames already freed! */
Fix this by adding a local_lock_t to xdp_dev_bulk_queue and acquiring
it in bq_enqueue() and __dev_flush(). These paths already run under
local_bh_disable(), so use local_lock_nested_bh() which on non-RT is
a pure annotation with no overhead, and on PREEMPT_RT provides a
per-CPU sleeping lock that serializes access to the bq.
Fixes: 3253cb49cb ("softirq: Allow to drop the softirq-BKL lock on PREEMPT_RT")
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Link: https://lore.kernel.org/r/20260225121459.183121-3-jiayuan.chen@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
On PREEMPT_RT kernels, the per-CPU xdp_bulk_queue (bq) can be accessed
concurrently by multiple preemptible tasks on the same CPU.
The original code assumes bq_enqueue() and __cpu_map_flush() run
atomically with respect to each other on the same CPU, relying on
local_bh_disable() to prevent preemption. However, on PREEMPT_RT,
local_bh_disable() only calls migrate_disable() (when
PREEMPT_RT_NEEDS_BH_LOCK is not set) and does not disable
preemption, which allows CFS scheduling to preempt a task during
bq_flush_to_queue(), enabling another task on the same CPU to enter
bq_enqueue() and operate on the same per-CPU bq concurrently.
This leads to several races:
1. Double __list_del_clearprev(): after bq->count is reset in
bq_flush_to_queue(), a preempting task can call bq_enqueue() ->
bq_flush_to_queue() on the same bq when bq->count reaches
CPU_MAP_BULK_SIZE. Both tasks then call __list_del_clearprev()
on the same bq->flush_node, the second call dereferences the
prev pointer that was already set to NULL by the first.
2. bq->count and bq->q[] races: concurrent bq_enqueue() can corrupt
the packet queue while bq_flush_to_queue() is processing it.
The race between task A (__cpu_map_flush -> bq_flush_to_queue) and
task B (bq_enqueue -> bq_flush_to_queue) on the same CPU:
Task A (xdp_do_flush) Task B (cpu_map_enqueue)
---------------------- ------------------------
bq_flush_to_queue(bq)
spin_lock(&q->producer_lock)
/* flush bq->q[] to ptr_ring */
bq->count = 0
spin_unlock(&q->producer_lock)
bq_enqueue(rcpu, xdpf)
<-- CFS preempts Task A --> bq->q[bq->count++] = xdpf
/* ... more enqueues until full ... */
bq_flush_to_queue(bq)
spin_lock(&q->producer_lock)
/* flush to ptr_ring */
spin_unlock(&q->producer_lock)
__list_del_clearprev(flush_node)
/* sets flush_node.prev = NULL */
<-- Task A resumes -->
__list_del_clearprev(flush_node)
flush_node.prev->next = ...
/* prev is NULL -> kernel oops */
Fix this by adding a local_lock_t to xdp_bulk_queue and acquiring it
in bq_enqueue() and __cpu_map_flush(). These paths already run under
local_bh_disable(), so use local_lock_nested_bh() which on non-RT is
a pure annotation with no overhead, and on PREEMPT_RT provides a
per-CPU sleeping lock that serializes access to the bq.
To reproduce, insert an mdelay(100) between bq->count = 0 and
__list_del_clearprev() in bq_flush_to_queue(), then run reproducer
provided by syzkaller.
Fixes: 3253cb49cb ("softirq: Allow to drop the softirq-BKL lock on PREEMPT_RT")
Reported-by: syzbot+2b3391f44313b3983e91@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/69369331.a70a0220.38f243.009d.GAE@google.com/T/
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Link: https://lore.kernel.org/r/20260225121459.183121-2-jiayuan.chen@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This assumption will always hold going forward, hence just remove the
various checks and assume it is true with a comment for the uninformed
reader.
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260227224806.646888-5-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Currently, when use_kmalloc_nolock is false, the freeing of fields for a
local storage selem is done eagerly before waiting for the RCU or RCU
tasks trace grace period to elapse. This opens up a window where the
program which has access to the selem can recreate the fields after the
freeing of fields is done eagerly, causing memory leaks when the element
is finally freed and returned to the kernel.
Make a few changes to address this. First, delay the freeing of fields
until after the grace periods have expired using a __bpf_selem_free_rcu
wrapper which is eventually invoked after transitioning through the
necessary number of grace period waits. Replace usage of the kfree_rcu
with call_rcu to be able to take a custom callback. Finally, care needs
to be taken to extend the rcu barriers for all cases, and not just when
use_kmalloc_nolock is true, as RCU and RCU tasks trace callbacks can be
in flight for either case and access the smap field, which is used to
obtain the BTF record to walk over special fields in the map value.
While we're at it, drop migrate_disable() from bpf_selem_free_rcu, since
migration should be disabled for RCU callbacks already.
Fixes: 9bac675e63 ("bpf: Postpone bpf_obj_free_fields to the rcu callback")
Reviewed-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260227224806.646888-4-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
BPF hash map may now use the map_check_btf() callback to decide whether
to set a dtor on its bpf_mem_alloc or not. Unlike C++ where members can
opt out of const-ness using mutable, we must lose the const qualifier on
the callback such that we can avoid the ugly cast. Make the change and
adjust all existing users, and lose the comment in hashtab.c.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260227224806.646888-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
There is a race window where BPF hash map elements can leak special
fields if the program with access to the map value recreates these
special fields between the check_and_free_fields done on the map value
and its eventual return to the memory allocator.
Several ways were explored prior to this patch, most notably [0] tried
to use a poison value to reject attempts to recreate special fields for
map values that have been logically deleted but still accessible to BPF
programs (either while sitting in the free list or when reused). While
this approach works well for task work, timers, wq, etc., it is harder
to apply the idea to kptrs, which have a similar race and failure mode.
Instead, we change bpf_mem_alloc to allow registering destructor for
allocated elements, such that when they are returned to the allocator,
any special fields created while they were accessible to programs in the
mean time will be freed. If these values get reused, we do not free the
fields again before handing the element back. The special fields thus
may remain initialized while the map value sits in a free list.
When bpf_mem_alloc is retired in the future, a similar concept can be
introduced to kmalloc_nolock-backed kmem_cache, paired with the existing
idea of a constructor.
Note that the destructor registration happens in map_check_btf, after
the BTF record is populated and (at that point) avaiable for inspection
and duplication. Duplication is necessary since the freeing of embedded
bpf_mem_alloc can be decoupled from actual map lifetime due to logic
introduced to reduce the cost of rcu_barrier()s in mem alloc free path in
9f2c6e96c6 ("bpf: Optimize rcu_barrier usage between hash map and bpf_mem_alloc.").
As such, once all callbacks are done, we must also free the duplicated
record. To remove dependency on the bpf_map itself, also stash the key
size of the map to obtain value from htab_elem long after the map is
gone.
[0]: https://lore.kernel.org/bpf/20260216131341.1285427-1-mykyta.yatsenko5@gmail.com
Fixes: 14a324f6a6 ("bpf: Wire up freeing of referenced kptr")
Fixes: 1bfbc267ec ("bpf: Enable bpf_timer and bpf_wq in any context")
Reported-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: syzbot@syzkaller.appspotmail.com
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260227224806.646888-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Even privileged services should not necessarily be able to see other
privileged service's namespaces so they can't leak information to each
other. Use may_see_all_namespaces() helper that centralizes this policy
until the nstree adapts.
Link: https://patch.msgid.link/20260226-work-visibility-fixes-v1-3-d2c2853313bd@kernel.org
Fixes: 76b6f5dfb3 ("nstree: add listns()")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Cc: stable@kernel.org # v6.19+
Signed-off-by: Christian Brauner <brauner@kernel.org>
Even privileged services should not necessarily be able to see other
privileged service's namespaces so they can't leak information to each
other. Use may_see_all_namespaces() helper that centralizes this policy
until the nstree adapts.
Link: https://patch.msgid.link/20260226-work-visibility-fixes-v1-1-d2c2853313bd@kernel.org
Fixes: a1d220d9da ("nsfs: iterate through mount namespaces")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Cc: stable@kernel.org # v6.12+
Signed-off-by: Christian Brauner <brauner@kernel.org>
All are singletons - please see the changelogs for details.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaaDF4AAKCRDdBJ7gKXxA
jhv5AQDv+B9rPkFJ0dlSS/hXqsDGqy3dGj/grJM0dw7LhkPHzgEAi/bV6D1jx0k3
k0hcP3JUxE54+a7liLadPDLIObOMLgo=
=R1ap
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2026-02-26-14-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"12 hotfixes. 7 are cc:stable. 8 are for MM.
All are singletons - please see the changelogs for details"
* tag 'mm-hotfixes-stable-2026-02-26-14-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
MAINTAINERS: update Yosry Ahmed's email address
mailmap: add entry for Daniele Alessandrelli
mm: fix NULL NODE_DATA dereference for memoryless nodes on boot
mm/tracing: rss_stat: ensure curr is false from kthread context
mm/kfence: fix KASAN hardware tag faults during late enablement
mm/damon/core: disallow non-power of two min_region_sz
Squashfs: check metadata block offset is within range
MAINTAINERS, mailmap: update e-mail address for Vlastimil Babka
liveupdate: luo_file: remember retrieve() status
mm: thp: deny THP for files on anonymous inodes
mm: change vma_alloc_folio_noprof() macro to inline function
mm/kfence: disable KFENCE upon KASAN HW tags enablement
A set of two fixes for DMA-mapping subsystem for the recently merged API
rework (Jiri Pirko and Stian Halseth).
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSrngzkoBtlA8uaaJ+Jp1EFxbsSRAUCaaDJnAAKCRCJp1EFxbsS
RBBTAP0WhFXOPeheDasVySviHbXxEdP2uc9/XgbnW40/HwWGRQD+KLkYSJDgGegv
v+fbLYNkZPgtOKAztp70imOOBM0fGQY=
=Q/dC
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-7.0-2026-02-26' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux
Pull dma-mapping fixes from Marek Szyprowski:
"Two DMA-mapping fixes for the recently merged API rework (Jiri Pirko
and Stian Halseth)"
* tag 'dma-mapping-7.0-2026-02-26' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
sparc: Fix page alignment in dma mapping
dma-mapping: avoid random addr value print out on error path
SCX_EFLAG_INITIALIZED is the sole member of enum scx_exit_flags with no
explicit value, so the compiler assigns it 0. This makes the bitwise OR
in scx_ops_init() a no-op:
sch->exit_info->flags |= SCX_EFLAG_INITIALIZED; /* |= 0 */
As a result, BPF schedulers cannot distinguish whether ops.init()
completed successfully by inspecting exit_info->flags.
Assign the value 1LLU << 0 so the flag is actually set.
Fixes: f3aec2adce ("sched_ext: Add SCX_EFLAG_INITIALIZED to indicate successful ops.init()")
Signed-off-by: David Carlier <devnexen@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
get_upper_ifindexes() iterates over all upper devices and writes their
indices into an array without checking bounds.
Also the callers assume that the max number of upper devices is
MAX_NEST_DEV and allocate excluded_devices[1+MAX_NEST_DEV] on the stack,
but that assumption is not correct and the number of upper devices could
be larger than MAX_NEST_DEV (e.g., many macvlans), causing a
stack-out-of-bounds write.
Add a max parameter to get_upper_ifindexes() to avoid the issue.
When there are too many upper devices, return -EOVERFLOW and abort the
redirect.
To reproduce, create more than MAX_NEST_DEV(8) macvlans on a device with
an XDP program attached using BPF_F_BROADCAST | BPF_F_EXCLUDE_INGRESS.
Then send a packet to the device to trigger the XDP redirect path.
Reported-by: syzbot+10cc7f13760b31bd2e61@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/698c4ce3.050a0220.340abe.000b.GAE@google.com/T/
Fixes: aeea1b86f9 ("bpf, devmap: Exclude XDP broadcast to master device")
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Kohei Enju <kohei@enjuk.jp>
Link: https://lore.kernel.org/r/20260225053506.4738-1-kohei@enjuk.jp
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
- Fix pointer-to-array allocation types for ubd and kcsan
- Force size overflow helpers to __always_inline
- Bump __builtin_counted_by_ref to Clang 22.1 from 22.0 (Nathan Chancellor)
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRSPkdeREjth1dHnSE2KwveOeQkuwUCaaCJawAKCRA2KwveOeQk
u2LdAQD/wZYe1YCCrUR+jM6EvAoI2CzUEQUQMyzLtjIyeSR6VAD8CSdZu0htnwwy
Ca76IzSF8Aj0D7Hytxkk8HDAD4WuxA0=
=QucI
-----END PGP SIGNATURE-----
Merge tag 'kmalloc_obj-v7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull kmalloc_obj fixes from Kees Cook:
- Fix pointer-to-array allocation types for ubd and kcsan
- Force size overflow helpers to __always_inline
- Bump __builtin_counted_by_ref to Clang 22.1 from 22.0 (Nathan Chancellor)
* tag 'kmalloc_obj-v7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
kcsan: test: Adjust "expect" allocation type for kmalloc_obj
overflow: Make sure size helpers are always inlined
init/Kconfig: Adjust fixed clang version for __builtin_counted_by_ref
ubd: Use pointer-to-pointers for io_thread_req arrays
The call to kmalloc_obj(observed.lines) returns "char (*)[3][512]",
a pointer to the whole 2D array. But "expect" wants to be "char (*)[512]",
the decayed pointer type, as if it were observed.lines itself (though
without the "3" bounds). This produces the following build error:
../kernel/kcsan/kcsan_test.c: In function '__report_matches':
../kernel/kcsan/kcsan_test.c:171:16: error: assignment to 'char (*)[512]' from incompatible pointer type 'char (*)[3][512]'
[-Wincompatible-pointer-types]
171 | expect = kmalloc_obj(observed.lines);
| ^
Instead of changing the "expect" type to "char (*)[3][512]" and
requiring a dereference at each use (e.g. "(expect*)[0]"), just
explicitly cast the return to the desired type.
Note that I'm intentionally not switching back to byte-based "kmalloc"
here because I cannot find a way for the Coccinelle script (which will
be used going forward to catch future conversions) to exclude this case.
Tested with:
$ ./tools/testing/kunit/kunit.py run \
--kconfig_add CONFIG_DEBUG_KERNEL=y \
--kconfig_add CONFIG_KCSAN=y \
--kconfig_add CONFIG_KCSAN_KUNIT_TEST=y \
--arch=x86_64 --qemu_args '-smp 2' kcsan
Reported-by: Nathan Chancellor <nathan@kernel.org>
Fixes: 69050f8d6d ("treewide: Replace kmalloc with kmalloc_obj for non-scalar types")
Signed-off-by: Kees Cook <kees@kernel.org>
Guillaume reported crashes via corrupted RCU callback function pointers
during KUnit testing. The crash was traced back to the pidfs rhashtable
conversion which replaced the 24-byte rb_node with an 8-byte rhash_head
in struct pid, shrinking it from 160 to 144 bytes.
struct kthread (without CONFIG_BLK_CGROUP) is also 144 bytes. With
CONFIG_SLAB_MERGE_DEFAULT and SLAB_HWCACHE_ALIGN both round up to
192 bytes and share the same slab cache. struct pid.rcu.func and
struct kthread.affinity_node both sit at offset 0x78.
When a kthread exits via make_task_dead() it bypasses kthread_exit() and
misses the affinity_node cleanup. free_kthread_struct() frees the memory
while the node is still linked into the global kthread_affinity_list. A
subsequent list_del() by another kthread writes through dangling list
pointers into the freed and reused memory, corrupting the pid's
rcu.func pointer.
Instead of patching free_kthread_struct() to handle the missed cleanup,
consolidate all kthread exit paths. Turn kthread_exit() into a macro
that calls do_exit() and add kthread_do_exit() which is called from
do_exit() for any task with PF_KTHREAD set. This guarantees that
kthread-specific cleanup always happens regardless of the exit path -
make_task_dead(), direct do_exit(), or kthread_exit().
Replace __to_kthread() with a new tsk_is_kthread() accessor in the
public header. Export do_exit() since module code using the
kthread_exit() macro now needs it directly.
Reported-by: Guillaume Tucker <gtucker@gtucker.io>
Tested-by: Guillaume Tucker <gtucker@gtucker.io>
Tested-by: Mark Brown <broonie@kernel.org>
Tested-by: David Gow <davidgow@google.com>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/all/20260224-mittlerweile-besessen-2738831ae7f6@brauner
Co-developed-by: Linus Torvalds <torvalds@linux-foundation.org>
Fixes: 4d13f4304f ("kthread: Implement preferred affinity")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
scx_idle_node_masks is allocated with num_possible_nodes() elements but
indexed by NUMA node IDs via for_each_node(). On systems with
non-contiguous NUMA node numbering (e.g. nodes 0 and 4), node IDs can
exceed the array size, causing out-of-bounds memory corruption.
Use nr_node_ids instead, which represents the maximum node ID range and
is the correct size for arrays indexed by node ID.
Fixes: 7c60329e3521 ("sched_ext: Add NUMA-awareness to the default idle selection policy")
Signed-off-by: David Carlier <devnexen@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Please consider pulling these changes from the signed vfs-7.0-rc2.fixes tag.
Thanks!
Christian
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaZ7xWAAKCRCRxhvAZXjc
onpeAP4qOrTURIAX9M/NGCHywvjI91ZJt20J6vm0X6KbVV/ebQD/eoJ21xzPhG9M
gN7oRcZ9SW3e/AdtdnlqB0PEP+cyGwM=
=9Ji+
-----END PGP SIGNATURE-----
Merge tag 'vfs-7.0-rc2.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
- Fix an uninitialized variable in file_getattr().
The flags_valid field wasn't initialized before calling
vfs_fileattr_get(), triggering KMSAN uninit-value reports in fuse
- Fix writeback wakeup and logging timeouts when DETECT_HUNG_TASK is
not enabled.
sysctl_hung_task_timeout_secs is 0 in that case causing spurious
"waiting for writeback completion for more than 1 seconds" warnings
- Fix a null-ptr-deref in do_statmount() when the mount is internal
- Add missing kernel-doc description for the @private parameter in
iomap_readahead()
- Fix mount namespace creation to hold namespace_sem across the mount
copy in create_new_namespace().
The previous drop-and-reacquire pattern was fragile and failed to
clean up mount propagation links if the real rootfs was a shared or
dependent mount
- Fix /proc mount iteration where m->index wasn't updated when
m->show() overflows, causing a restart to repeatedly show the same
mount entry in a rapidly expanding mount table
- Return EFSCORRUPTED instead of ENOSPC in minix_new_inode() when the
inode number is out of range
- Fix unshare(2) when CLONE_NEWNS is set and current->fs isn't shared.
copy_mnt_ns() received the live fs_struct so if a subsequent
namespace creation failed the rollback would leave pwd and root
pointing to detached mounts. Always allocate a new fs_struct when
CLONE_NEWNS is requested
- fserror bug fixes:
- Remove the unused fsnotify_sb_error() helper now that all callers
have been converted to fserror_report_metadata
- Fix a lockdep splat in fserror_report() where igrab() takes
inode::i_lock which can be held in IRQ context.
Replace igrab() with a direct i_count bump since filesystems
should not report inodes that are about to be freed or not yet
exposed
- Handle error pointer in procfs for try_lookup_noperm()
- Fix an integer overflow in ep_loop_check_proc() where recursive calls
returning INT_MAX would overflow when +1 is added, breaking the
recursion depth check
- Fix a misleading break in pidfs
* tag 'vfs-7.0-rc2.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
pidfs: avoid misleading break
eventpoll: Fix integer overflow in ep_loop_check_proc()
proc: Fix pointer error dereference
fserror: fix lockdep complaint when igrabbing inode
fsnotify: drop unused helper
unshare: fix unshare_fs() handling
minix: Correct errno in minix_new_inode
namespace: fix proc mount iteration
mount: hold namespace_sem across copy in create_new_namespace()
iomap: Describe @private in iomap_readahead()
statmount: Fix the null-ptr-deref in do_statmount()
writeback: Fix wakeup and logging timeouts for !DETECT_HUNG_TASK
fs: init flags_valid before calling vfs_fileattr_get
Make sure that __perf_event_overflow() runs with IRQs disabled for all
possible callchains. Specifically the software events can end up running
it with only preemption disabled.
This opens up a race vs perf_event_exit_event() and friends that will go
and free various things the overflow path expects to be present, like
the BPF program.
Fixes: 592903cdcb ("perf_counter: add an event_list")
Reported-by: Simond Hu <cmdhh1767@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Simond Hu <cmdhh1767@gmail.com>
Link: https://patch.msgid.link/20260224122909.GV1395416@noisy.programming.kicks-ass.net
scx_claim_exit() atomically sets exit_kind, which prevents scx_error() from
triggering further error handling. After claiming exit, the caller must kick
the helper kthread work which initiates bypass mode and teardown.
If the calling task gets preempted between claiming exit and kicking the
helper work, and the BPF scheduler fails to schedule it back (since error
handling is now disabled), the helper work is never queued, bypass mode
never activates, tasks stop being dispatched, and the system wedges.
Disable preemption across scx_claim_exit() and the subsequent work kicking
in all callers - scx_disable() and scx_vexit(). Add
lockdep_assert_preemption_disabled() to scx_claim_exit() to enforce the
requirement.
Fixes: f0e1a0643a ("sched_ext: Implement BPF extensible scheduler class")
Cc: stable@vger.kernel.org # v6.12+
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
LUO keeps track of successful retrieve attempts on a LUO file. It does so
to avoid multiple retrievals of the same file. Multiple retrievals cause
problems because once the file is retrieved, the serialized data
structures are likely freed and the file is likely in a very different
state from what the code expects.
The retrieve boolean in struct luo_file keeps track of this, and is passed
to the finish callback so it knows what work was already done and what it
has left to do.
All this works well when retrieve succeeds. When it fails,
luo_retrieve_file() returns the error immediately, without ever storing
anywhere that a retrieve was attempted or what its error code was. This
results in an errored LIVEUPDATE_SESSION_RETRIEVE_FD ioctl to userspace,
but nothing prevents it from trying this again.
The retry is problematic for much of the same reasons listed above. The
file is likely in a very different state than what the retrieve logic
normally expects, and it might even have freed some serialization data
structures. Attempting to access them or free them again is going to
break things.
For example, if memfd managed to restore 8 of its 10 folios, but fails on
the 9th, a subsequent retrieve attempt will try to call
kho_restore_folio() on the first folio again, and that will fail with a
warning since it is an invalid operation.
Apart from the retry, finish() also breaks. Since on failure the
retrieved bool in luo_file is never touched, the finish() call on session
close will tell the file handler that retrieve was never attempted, and it
will try to access or free the data structures that might not exist, much
in the same way as the retry attempt.
There is no sane way of attempting the retrieve again. Remember the error
retrieve returned and directly return it on a retry. Also pass this
status code to finish() so it can make the right decision on the work it
needs to do.
This is done by changing the bool to an integer. A value of 0 means
retrieve was never attempted, a positive value means it succeeded, and a
negative value means it failed and the error code is the value.
Link: https://lkml.kernel.org/r/20260216132221.987987-1-pratyush@kernel.org
Fixes: 7c722a7f44 ("liveupdate: luo_file: implement file systems callbacks")
Signed-off-by: Pratyush Yadav (Google) <pratyush@kernel.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The current cpuset partition code is able to dynamically update
the sched domains of a running system and the corresponding
HK_TYPE_DOMAIN housekeeping cpumask to perform what is essentially the
"isolcpus=domain,..." boot command line feature at run time.
The housekeeping cpumask update requires flushing a number of different
workqueues which may not be safe with cpus_read_lock() held as the
workqueue flushing code may acquire cpus_read_lock() or acquiring locks
which have locking dependency with cpus_read_lock() down the chain. Below
is an example of such circular locking problem.
======================================================
WARNING: possible circular locking dependency detected
6.18.0-test+ #2 Tainted: G S
------------------------------------------------------
test_cpuset_prs/10971 is trying to acquire lock:
ffff888112ba4958 ((wq_completion)sync_wq){+.+.}-{0:0}, at: touch_wq_lockdep_map+0x7a/0x180
but task is already holding lock:
ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_write+0x85/0x130
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #4 (cpuset_mutex){+.+.}-{4:4}:
-> #3 (cpu_hotplug_lock){++++}-{0:0}:
-> #2 (rtnl_mutex){+.+.}-{4:4}:
-> #1 ((work_completion)(&arg.work)){+.+.}-{0:0}:
-> #0 ((wq_completion)sync_wq){+.+.}-{0:0}:
Chain exists of:
(wq_completion)sync_wq --> cpu_hotplug_lock --> cpuset_mutex
5 locks held by test_cpuset_prs/10971:
#0: ffff88816810e440 (sb_writers#7){.+.+}-{0:0}, at: ksys_write+0xf9/0x1d0
#1: ffff8891ab620890 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x260/0x5f0
#2: ffff8890a78b83e8 (kn->active#187){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x2b6/0x5f0
#3: ffffffffadf32900 (cpu_hotplug_lock){++++}-{0:0}, at: cpuset_partition_write+0x77/0x130
#4: ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_write+0x85/0x130
Call Trace:
<TASK>
:
touch_wq_lockdep_map+0x93/0x180
__flush_workqueue+0x111/0x10b0
housekeeping_update+0x12d/0x2d0
update_parent_effective_cpumask+0x595/0x2440
update_prstate+0x89d/0xce0
cpuset_partition_write+0xc5/0x130
cgroup_file_write+0x1a5/0x680
kernfs_fop_write_iter+0x3df/0x5f0
vfs_write+0x525/0xfd0
ksys_write+0xf9/0x1d0
do_syscall_64+0x95/0x520
entry_SYSCALL_64_after_hwframe+0x76/0x7e
To avoid such a circular locking dependency problem, we have to
call housekeeping_update() without holding the cpus_read_lock() and
cpuset_mutex. The current set of wq's flushed by housekeeping_update()
may not have work functions that call cpus_read_lock() directly,
but we are likely to extend the list of wq's that are flushed in the
future. Moreover, the current set of work functions may hold locks that
may have cpu_hotplug_lock down the dependency chain.
So housekeeping_update() is now called after releasing cpus_read_lock
and cpuset_mutex at the end of a cpuset operation. These two locks are
then re-acquired later before calling rebuild_sched_domains_locked().
To enable mutual exclusion between the housekeeping_update() call and
other cpuset control file write actions, a new top level cpuset_top_mutex
is introduced. This new mutex will be acquired first to allow sharing
variables used by both code paths. However, cpuset update from CPU
hotplug can still happen in parallel with the housekeeping_update()
call, though that should be rare in production environment.
As cpus_read_lock() is now no longer held when
tmigr_isolated_exclude_cpumask() is called, it needs to acquire it
directly.
The lockdep_is_cpuset_held() is also updated to return true if either
cpuset_top_mutex or cpuset_mutex is held.
Fixes: 03ff735101 ("cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The cpuset_handle_hotplug() may need to invoke housekeeping_update(),
for instance, when an isolated partition is invalidated because its
last active CPU has been put offline.
As we are going to enable dynamic update to the nozh_full housekeeping
cpumask (HK_TYPE_KERNEL_NOISE) soon with the help of CPU hotplug,
allowing the CPU hotplug path to call into housekeeping_update() directly
from update_isolation_cpumasks() will likely cause deadlock. So we
have to defer any call to housekeeping_update() after the CPU hotplug
operation has finished. This is now done via the workqueue where
the update_hk_sched_domains() function will be invoked via the
hk_sd_workfn().
An concurrent cpuset control file write may have executed the required
update_hk_sched_domains() function before the work function is called. So
the work function call may become a no-op when it is invoked.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
With the latest changes in sched/isolation.c, rebuild_sched_domains*()
requires the HK_TYPE_DOMAIN housekeeping cpumask to be properly
updated first, if needed, before the sched domains can be
rebuilt. So the two naturally fit together. Do that by creating a new
update_hk_sched_domains() helper to house both actions.
The name of the isolated_cpus_updating flag to control the
call to housekeeping_update() is now outdated. So change it to
update_housekeeping to better reflect its purpose. Also move the call
to update_hk_sched_domains() to the end of cpuset and hotplug operations
before releasing the cpuset_mutex.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
As cpuset is updating HK_TYPE_DOMAIN housekeeping mask when there is
a change in the set of isolated CPUs, making this change is now more
costly than before. Right now, the isolated_cpus_updating flag can be
set even if there is no real change in isolated_cpus. Put in additional
checks to make sure that isolated_cpus_updating is set only if there
is a real change in isolated_cpus.
Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Clarify the locking rules associated with file level internal variables
inside the cpuset code. There is no functional change.
Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Commit e2ffe502ba ("cgroup/cpuset: Add cpuset.cpus.exclusive for v2")
incorrectly changed the 2nd parameter of cpuset_update_tasks_cpumask()
from tmp->new_cpus to cp->effective_cpus. This second parameter is just
a temporary cpumask for internal use. The cpuset_update_tasks_cpumask()
function was originally called update_tasks_cpumask() before commit
381b53c3b5 ("cgroup/cpuset: rename functions shared between v1
and v2").
This mistake can incorrectly change the effective_cpus of the
cpuset when it is the top_cpuset or in arm64 architecture where
task_cpu_possible_mask() may differ from cpu_possible_mask. So far
top_cpuset hasn't been passed to update_cpumasks_hier() yet, but arm64
arch can still be impacted. Fix it by reverting the incorrect change.
Fixes: e2ffe502ba ("cgroup/cpuset: Add cpuset.cpus.exclusive for v2")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The effective_xcpus of a cpuset can contain offline CPUs. In
partition_xcpus_del(), the xcpus parameter is incorrectly used as
a temporary cpumask to mask out offline CPUs. As xcpus can be the
effective_xcpus of a cpuset, this can result in unexpected changes
in that cpumask. Fix this problem by not making any changes to the
xcpus parameter.
Fixes: 11e5f407b6 ("cgroup/cpuset: Keep track of CPUs in isolated partitions")
Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The module loader doesn't check for bounds of the ELF section index in
simplify_symbols():
for (i = 1; i < symsec->sh_size / sizeof(Elf_Sym); i++) {
const char *name = info->strtab + sym[i].st_name;
switch (sym[i].st_shndx) {
case SHN_COMMON:
[...]
default:
/* Divert to percpu allocation if a percpu var. */
if (sym[i].st_shndx == info->index.pcpu)
secbase = (unsigned long)mod_percpu(mod);
else
/** HERE --> **/ secbase = info->sechdrs[sym[i].st_shndx].sh_addr;
sym[i].st_value += secbase;
break;
}
}
A symbol with an out-of-bounds st_shndx value, for example 0xffff
(known as SHN_XINDEX or SHN_HIRESERVE), may cause a kernel panic:
BUG: unable to handle page fault for address: ...
RIP: 0010:simplify_symbols+0x2b2/0x480
...
Kernel panic - not syncing: Fatal exception
This can happen when module ELF is legitimately using SHN_XINDEX or
when it is corrupted.
Add a bounds check in simplify_symbols() to validate that st_shndx is
within the valid range before using it.
This issue was discovered due to a bug in llvm-objcopy, see relevant
discussion for details [1].
[1] https://lore.kernel.org/linux-modules/20251224005752.201911-1-ihor.solodrai@linux.dev/
Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Reviewed-by: Daniel Gomez <da.gomez@samsung.com>
Reviewed-by: Petr Pavlu <petr.pavlu@suse.com>
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
This config option goes way back - it used to be an internal debug
option to random.c (at that point called DEBUG_RANDOM_BOOT), then was
renamed and exposed as a config option as CONFIG_WARN_UNSEEDED_RANDOM,
and then further renamed to the current CONFIG_WARN_ALL_UNSEEDED_RANDOM.
It was all done with the best of intentions: the more limited
rate-limited reports were reporting some cases, but if you wanted to see
all the gory details, you'd enable this "ALL" option.
However, it turns out - perhaps not surprisingly - that when people
don't care about and fix the first rate-limited cases, they most
certainly don't care about any others either, and so warning about all
of them isn't actually helping anything.
And the non-ratelimited reporting causes problems, where well-meaning
people enable debug options, but the excessive flood of messages that
nobody cares about will hide actual real information when things go
wrong.
I just got a kernel bug report (which had nothing to do with randomness)
where two thirds of the the truncated dmesg was just variations of
random: get_random_u32 called from __get_random_u32_below+0x10/0x70 with crng_init=0
and in the process early boot messages had been lost (in addition to
making the messages that _hadn't_ been lost harder to read).
The proper way to find these things for the hypothetical developer that
cares - if such a person exists - is almost certainly with boot time
tracing. That gives you the option to get call graphs etc too, which is
likely a requirement for fixing any problems anyway.
See Documentation/trace/boottime-trace.rst for that option.
And if we for some reason do want to re-introduce actual printing of
these things, it will need to have some uniqueness filtering rather than
this "just print it all" model.
Fixes: cc1e127bfa ("random: remove ratelimiting for in-kernel unseeded randomness")
Acked-by: Jason Donenfeld <Jason@zx2c4.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The module Kconfig file contains a set of options related to "Module
versioning support" (depends on MODVERSIONS) and "Module signature
verification" (depends on MODULE_SIG). The Kconfig tool automatically
creates submenus when an entry for a symbol is followed by consecutive
items that all depend on the symbol. However, this functionality doesn't
work for the mentioned module options. The MODVERSIONS options are
interleaved with ASM_MODVERSIONS, which has no 'depends on MODVERSIONS' but
instead uses 'default HAVE_ASM_MODVERSIONS && MODVERSIONS'. Similarly, the
MODULE_SIG options are interleaved by a comment warning not to forget
signing modules with scripts/sign-file, which uses the condition 'depends
on MODULE_SIG_FORCE && !MODULE_SIG_ALL'.
The result is that the options are confusingly shown when using
a menuconfig tool, as follows:
[*] Module versioning support
Module versioning implementation (genksyms (from source code)) --->
[ ] Extended Module Versioning Support
[*] Basic Module Versioning Support
[*] Source checksum for all modules
[*] Module signature verification
[ ] Require modules to be validly signed
[ ] Automatically sign all modules
Hash algorithm to sign modules (SHA-256) --->
Fix the issue by using if/endif to group related options together in
kernel/module/Kconfig, similarly to how the MODULE_DEBUG options are
already grouped. Note that the signing-related options depend on
'MODULE_SIG || IMA_APPRAISE_MODSIG', with the exception of
MODULE_SIG_FORCE, which is valid only for MODULE_SIG and is therefore kept
separately. For consistency, do the same for the MODULE_COMPRESS entries.
The options are then properly placed into submenus, as follows:
[*] Module versioning support
Module versioning implementation (genksyms (from source code)) --->
[ ] Extended Module Versioning Support
[*] Basic Module Versioning Support
[*] Source checksum for all modules
[*] Module signature verification
[ ] Require modules to be validly signed
[ ] Automatically sign all modules
Hash algorithm to sign modules (SHA-256) --->
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Reviewed-by: Daniel Gomez <da.gomez@samsung.com>
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
In the error path of load_module(), under the free_module label, the
code calls lockdep_free_key_range() to release lock classes associated
with the MOD_DATA, MOD_RODATA and MOD_RO_AFTER_INIT module regions, and
subsequently invokes module_deallocate().
Since commit ac3b432839 ("module: replace module_layout with
module_memory"), the module_deallocate() function calls free_mod_mem(),
which releases the lock classes as well and considers all module
regions.
Attempting to free these classes twice is unnecessary. Remove the
redundant code in load_module().
Fixes: ac3b432839 ("module: replace module_layout with module_memory")
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Reviewed-by: Daniel Gomez <da.gomez@samsung.com>
Reviewed-by: Aaron Tomlin <atomlin@atomlin.com>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Syzkaller reported a refcount_t: addition on 0; use-after-free warning
in perf_mmap.
The issue is caused by a race condition between a failing mmap() setup
and a concurrent mmap() on a dependent event (e.g., using output
redirection).
In perf_mmap(), the ring_buffer (rb) is allocated and assigned to
event->rb with the mmap_mutex held. The mutex is then released to
perform map_range().
If map_range() fails, perf_mmap_close() is called to clean up.
However, since the mutex was dropped, another thread attaching to
this event (via inherited events or output redirection) can acquire
the mutex, observe the valid event->rb pointer, and attempt to
increment its reference count. If the cleanup path has already
dropped the reference count to zero, this results in a
use-after-free or refcount saturation warning.
Fix this by extending the scope of mmap_mutex to cover the
map_range() call. This ensures that the ring buffer initialization
and mapping (or cleanup on failure) happens atomically effectively,
preventing other threads from accessing a half-initialized or
dying ring buffer.
Closes: https://lore.kernel.org/oe-kbuild-all/202602020208.m7KIjdzW-lkp@intel.com/
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Haocheng Yu <yuhaocheng035@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260202162057.7237-1-yuhaocheng035@gmail.com
Lockdep found a bug in the event scheduling when a pinned event was
failed and wakes up the threads in the ring buffer like below.
It seems it should not grab a wait-queue lock under perf-context lock.
Let's do it with irq_work.
[ 39.913691] =============================
[ 39.914157] [ BUG: Invalid wait context ]
[ 39.914623] 6.15.0-next-20250530-next-2025053 #1 Not tainted
[ 39.915271] -----------------------------
[ 39.915731] repro/837 is trying to lock:
[ 39.916191] ffff88801acfabd8 (&event->waitq){....}-{3:3}, at: __wake_up+0x26/0x60
[ 39.917182] other info that might help us debug this:
[ 39.917761] context-{5:5}
[ 39.918079] 4 locks held by repro/837:
[ 39.918530] #0: ffffffff8725cd00 (rcu_read_lock){....}-{1:3}, at: __perf_event_task_sched_in+0xd1/0xbc0
[ 39.919612] #1: ffff88806ca3c6f8 (&cpuctx_lock){....}-{2:2}, at: __perf_event_task_sched_in+0x1a7/0xbc0
[ 39.920748] #2: ffff88800d91fc18 (&ctx->lock){....}-{2:2}, at: __perf_event_task_sched_in+0x1f9/0xbc0
[ 39.921819] #3: ffffffff8725cd00 (rcu_read_lock){....}-{1:3}, at: perf_event_wakeup+0x6c/0x470
Fixes: f4b07fd62d ("perf/core: Use POLLHUP for a pinned event in error")
Closes: https://lore.kernel.org/lkml/aD2w50VDvGIH95Pf@ly-workstation
Reported-by: "Lai, Yi" <yi1.lai@linux.intel.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: "Lai, Yi" <yi1.lai@linux.intel.com>
Link: https://patch.msgid.link/20250603045105.1731451-1-namhyung@kernel.org
Before rseq became extensible, its original size was 32 bytes even
though the active rseq area was only 20 bytes. This had the following
impact in terms of userspace ecosystem evolution:
* The GNU libc between 2.35 and 2.39 expose a __rseq_size symbol set
to 32, even though the size of the active rseq area is really 20.
* The GNU libc 2.40 changes this __rseq_size to 20, thus making it
express the active rseq area.
* Starting from glibc 2.41, __rseq_size corresponds to the
AT_RSEQ_FEATURE_SIZE from getauxval(3).
This means that users of __rseq_size can always expect it to
correspond to the active rseq area, except for the value 32, for
which the active rseq area is 20 bytes.
Exposing a 32 bytes feature size would make life needlessly painful
for userspace. Therefore, add a reserved field at the end of the
rseq area to bump the feature size to 33 bytes. This reserved field
is expected to be replaced with whatever field will come next,
expecting that this field will be larger than 1 byte.
The effect of this change is to increase the size from 32 to 64 bytes
before we actually have fields using that memory.
Clarify the allocation size and alignment requirements in the struct
rseq uapi comment.
Change the value returned by getauxval(AT_RSEQ_ALIGN) to return the
value of the active rseq area size rounded up to next power of 2, which
guarantees that the rseq structure will always be aligned on the nearest
power of two large enough to contain it, even as it grows. Change the
alignment check in the rseq registration accordingly.
This will minimize the amount of ABI corner-cases we need to document
and require userspace to play games with. The rule stays simple when
__rseq_size != 32:
#define rseq_field_available(field) (__rseq_size >= offsetofend(struct rseq_abi, field))
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260220200642.1317826-3-mathieu.desnoyers@efficios.com
The rseq registration validates that the rseq_size argument is greater
or equal to 32 (the original rseq size), but the comment associated with
this check does not clearly state this.
Clarify the comment to that effect.
Fixes: ee3e3ac05c ("rseq: Introduce extensible rseq ABI")
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260220200642.1317826-2-mathieu.desnoyers@efficios.com
Kernel test robot reported that
tools/testing/selftests/kvm/hardware_disable_test was failing due to
commit 704069649b ("sched/core: Rework sched_class::wakeup_preempt()
and rq_modified_*()")
It turns out there were two related problems that could lead to a
missed preemption:
- when hitting newidle balance from the idle thread, it would elevate
rb->next_class from &idle_sched_class to &fair_sched_class, causing
later wakeup_preempt() calls to not hit the sched_class_above()
case, and not issue resched_curr().
Notably, this modification pattern should only lower the
next_class, and never raise it. Create two new helper functions to
wrap this.
- when doing schedule_idle(), it was possible to miss (re)setting
rq->next_class to &idle_sched_class, leading to the very same
problem.
Cc: Sean Christopherson <seanjc@google.com>
Fixes: 704069649b ("sched/core: Rework sched_class::wakeup_preempt() and rq_modified_*()")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202602122157.4e861298-lkp@intel.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260218163329.GQ1395416@noisy.programming.kicks-ass.net
Vincent reported that he was seeing undue lag clamping in a mixed
slice workload. Implement the max_slice tracking as per the todo
comment.
Fixes: 147f3efaa2 ("sched/fair: Implement an EEVDF-like scheduling policy")
Reported-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Shubhang Kaushik <shubhang@os.amperecomputing.com>
Link: https://patch.msgid.link/20250422101628.GA33555@noisy.programming.kicks-ass.net
In the EEVDF framework with Run-to-Parity protection, `se->vprot` is an
independent variable defining the virtual protection timestamp.
When `reweight_entity()` is called (e.g., via nice/renice), it performs
the following actions to preserve Lag consistency:
1. Scales `se->vlag` based on the new weight.
2. Calls `place_entity()`, which recalculates `se->vruntime` based on
the new weight and scaled lag.
However, the current implementation fails to update `se->vprot`, leading
to mismatches between the task's actual runtime and its expected duration.
Fixes: 63304558ba ("sched/eevdf: Curb wakeup-preemption")
Suggested-by: Zhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: Wang Tao <wangtao554@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Shubhang Kaushik <shubhang@os.amperecomputing.com>
Link: https://patch.msgid.link/20260120123113.3518950-1-wangtao554@huawei.com
We should not (re)set slice protection in the sched_change pattern
which calls put_prev_task() / set_next_task().
Fixes: 63304558ba ("sched/eevdf: Curb wakeup-preemption")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Shubhang Kaushik <shubhang@os.amperecomputing.com>
Link: https://patch.msgid.link/20260219080624.561421378%40infradead.org
It turns out that zero_vruntime tracking is broken when there is but a single
task running. Current update paths are through __{en,de}queue_entity(), and
when there is but a single task, pick_next_task() will always return that one
task, and put_prev_set_next_task() will end up in neither function.
This can cause entity_key() to grow indefinitely large and cause overflows,
leading to much pain and suffering.
Furtermore, doing update_zero_vruntime() from __{de,en}queue_entity(), which
are called from {set_next,put_prev}_entity() has problems because:
- set_next_entity() calls __dequeue_entity() before it does cfs_rq->curr = se.
This means the avg_vruntime() will see the removal but not current, missing
the entity for accounting.
- put_prev_entity() calls __enqueue_entity() before it does cfs_rq->curr =
NULL. This means the avg_vruntime() will see the addition *and* current,
leading to double accounting.
Both cases are incorrect/inconsistent.
Noting that avg_vruntime is already called on each {en,de}queue, remove the
explicit avg_vruntime() calls (which removes an extra 64bit division for each
{en,de}queue) and have avg_vruntime() update zero_vruntime itself.
Additionally, have the tick call avg_vruntime() -- discarding the result, but
for the side-effect of updating zero_vruntime.
While there, optimize avg_vruntime() by noting that the average of one value is
rather trivial to compute.
Test case:
# taskset -c -p 1 $$
# taskset -c 2 bash -c 'while :; do :; done&'
# cat /sys/kernel/debug/sched/debug | awk '/^cpu#/ {P=0} /^cpu#2,/ {P=1} {if (P) print $0}' | grep -e zero_vruntime -e "^>"
PRE:
.zero_vruntime : 31316.407903
>R bash 487 50787.345112 E 50789.145972 2.800000 50780.298364 16 120 0.000000 0.000000 0.000000 /
.zero_vruntime : 382548.253179
>R bash 487 427275.204288 E 427276.003584 2.800000 427268.157540 23 120 0.000000 0.000000 0.000000 /
POST:
.zero_vruntime : 17259.709467
>R bash 526 17259.709467 E 17262.509467 2.800000 16915.031624 9 120 0.000000 0.000000 0.000000 /
.zero_vruntime : 18702.723356
>R bash 526 18702.723356 E 18705.523356 2.800000 18358.045513 9 120 0.000000 0.000000 0.000000 /
Fixes: 79f3f9bedd ("sched/eevdf: Fix min_vruntime vs avg_vruntime")
Reported-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Shubhang Kaushik <shubhang@os.amperecomputing.com>
Link: https://patch.msgid.link/20260219080624.438854780%40infradead.org
dma_addr is unitialized in dma_direct_map_phys() when swiotlb is forced
and DMA_ATTR_MMIO is set which leads to random value print out in
warning. Fix that by just returning DMA_MAPPING_ERROR.
Fixes: e53d29f957 ("dma-mapping: convert dma_direct_*map_page to be phys_addr_t based")
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/20260209153809.250835-2-jiri@resnulli.us
This converts some of the visually simpler cases that have been split
over multiple lines. I only did the ones that are easy to verify the
resulting diff by having just that final GFP_KERNEL argument on the next
line.
Somebody should probably do a proper coccinelle script for this, but for
me the trivial script actually resulted in an assertion failure in the
middle of the script. I probably had made it a bit _too_ trivial.
So after fighting that far a while I decided to just do some of the
syntactically simpler cases with variations of the previous 'sed'
scripts.
The more syntactically complex multi-line cases would mostly really want
whitespace cleanup anyway.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is the exact same thing as the 'alloc_obj()' version, only much
smaller because there are a lot fewer users of the *alloc_flex()
interface.
As with alloc_obj() version, this was done entirely with mindless brute
force, using the same script, except using 'flex' in the pattern rather
than 'objs*'.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This was done entirely with mindless brute force, using
git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/'
to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.
Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.
For the same reason the 'flex' versions will be done as a separate
conversion.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:
Single allocations: kmalloc(sizeof(TYPE), ...)
are replaced with: kmalloc_obj(TYPE, ...)
Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with: kmalloc_objs(TYPE, COUNT, ...)
Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...)
(where TYPE may also be *VAR)
The resulting allocations no longer return "void *", instead returning
"TYPE *".
Signed-off-by: Kees Cook <kees@kernel.org>
- Fix possible dereference of uninitialized pointer
When validating the persistent ring buffer on boot up, if the first
validation fails, a reference to "head_page" is performed in the
error path, but it skips over the initialization of that variable.
Move the initialization before the first validation check.
- Fix use of event length in validation of persistent ring buffer
On boot up, the persistent ring buffer is checked to see if it is
valid by several methods. One being to walk all the events in the
memory location to make sure they are all valid. The length of the
event is used to move to the next event. This length is determined
by the data in the buffer. If that length is corrupted, it could
possibly make the next event to check located at a bad memory location.
Validate the length field of the event when doing the event walk.
- Fix function graph on archs that do not support use of ftrace_ops
When an architecture defines HAVE_DYNAMIC_FTRACE_WITH_ARGS, it means
that its function graph tracer uses the ftrace_ops of the function
tracer to call its callbacks. This allows a single registered callback
to be called directly instead of checking the callback's meta data's
hash entries against the function being traced.
For architectures that do not support this feature, it must always
call the loop function that tests each registered callback (even if
there's only one). The loop function tests each callback's meta data
against its hash of functions and will call its callback if the
function being traced is in its hash map.
The issue was that there was no check against this and the direct
function was being called even if the architecture didn't support it.
This meant that if function tracing was enabled at the same time
as a callback was registered with the function graph tracer, its
callback would be called for every function that the function tracer
also traced, even if the callback's meta data only wanted to be
called back for a small subset of functions.
Prevent the direct calling for those architectures that do not support
it.
- Fix references to trace_event_file for hist files
The hist files used event_file_data() to get a reference to the
associated trace_event_file the histogram was attached to. This
would return a pointer even if the trace_event_file is about to
be freed (via RCU). Instead it should use the event_file_file()
helper that returns NULL if the trace_event_file is marked to be
freed so that no new references are added to it.
- Wake up hist poll readers when an event is being freed
When polling on a hist file, the task is only awoken when a hist
trigger is triggered. This means that if an event is being freed
while there's a task waiting on its hist file, it will need to wait
until the hist trigger occurs to wake it up and allow the freeing
to happen. Note, the event will not be completely freed until all
references are removed, and a hist poller keeps a reference. But
it should still be woken when the event is being freed.
-----BEGIN PGP SIGNATURE-----
iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaZd4ExQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qqX5AP4powfnNnRfLSKH9idhTp+ltmJ+9roy
L7kWTr/z20S2VQEAk331PNZ32uZu+/ZUETYpgtEx4SbRGZFehTBv1ddjfw4=
=TWj5
-----END PGP SIGNATURE-----
Merge tag 'trace-v7.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix possible dereference of uninitialized pointer
When validating the persistent ring buffer on boot up, if the first
validation fails, a reference to "head_page" is performed in the
error path, but it skips over the initialization of that variable.
Move the initialization before the first validation check.
- Fix use of event length in validation of persistent ring buffer
On boot up, the persistent ring buffer is checked to see if it is
valid by several methods. One being to walk all the events in the
memory location to make sure they are all valid. The length of the
event is used to move to the next event. This length is determined by
the data in the buffer. If that length is corrupted, it could
possibly make the next event to check located at a bad memory
location.
Validate the length field of the event when doing the event walk.
- Fix function graph on archs that do not support use of ftrace_ops
When an architecture defines HAVE_DYNAMIC_FTRACE_WITH_ARGS, it means
that its function graph tracer uses the ftrace_ops of the function
tracer to call its callbacks. This allows a single registered
callback to be called directly instead of checking the callback's
meta data's hash entries against the function being traced.
For architectures that do not support this feature, it must always
call the loop function that tests each registered callback (even if
there's only one). The loop function tests each callback's meta data
against its hash of functions and will call its callback if the
function being traced is in its hash map.
The issue was that there was no check against this and the direct
function was being called even if the architecture didn't support it.
This meant that if function tracing was enabled at the same time as a
callback was registered with the function graph tracer, its callback
would be called for every function that the function tracer also
traced, even if the callback's meta data only wanted to be called
back for a small subset of functions.
Prevent the direct calling for those architectures that do not
support it.
- Fix references to trace_event_file for hist files
The hist files used event_file_data() to get a reference to the
associated trace_event_file the histogram was attached to. This would
return a pointer even if the trace_event_file is about to be freed
(via RCU). Instead it should use the event_file_file() helper that
returns NULL if the trace_event_file is marked to be freed so that no
new references are added to it.
- Wake up hist poll readers when an event is being freed
When polling on a hist file, the task is only awoken when a hist
trigger is triggered. This means that if an event is being freed
while there's a task waiting on its hist file, it will need to wait
until the hist trigger occurs to wake it up and allow the freeing to
happen. Note, the event will not be completely freed until all
references are removed, and a hist poller keeps a reference. But it
should still be woken when the event is being freed.
* tag 'trace-v7.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Wake up poll waiters for hist files when removing an event
tracing: Fix checking of freed trace_event_file for hist files
fgraph: Do not call handlers direct when not using ftrace_ops
tracing: ring-buffer: Fix to check event length before using
ring-buffer: Fix possible dereference of uninitialized pointer
The event_hist_poll() function attempts to verify whether an event file is
being removed, but this check may not occur or could be unnecessarily
delayed. This happens because hist_poll_wakeup() is currently invoked only
from event_hist_trigger() when a hist command is triggered. If the event
file is being removed, no associated hist command will be triggered and a
waiter will be woken up only after an unrelated hist command is triggered.
Fix the issue by adding a call to hist_poll_wakeup() in
remove_event_file_dir() after setting the EVENT_FILE_FL_FREED flag. This
ensures that a task polling on a hist file is woken up and receives
EPOLLERR.
Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Link: https://patch.msgid.link/20260219162737.314231-3-petr.pavlu@suse.com
Fixes: 1bd13edbbe ("tracing/hist: Add poll(POLLIN) support on hist file")
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The event_hist_open() and event_hist_poll() functions currently retrieve
a trace_event_file pointer from a file struct by invoking
event_file_data(), which simply returns file->f_inode->i_private. The
functions then check if the pointer is NULL to determine whether the event
is still valid. This approach is flawed because i_private is assigned when
an eventfs inode is allocated and remains set throughout its lifetime.
Instead, the code should call event_file_file(), which checks for
EVENT_FILE_FL_FREED. Using the incorrect access function may result in the
code potentially opening a hist file for an event that is being removed or
becoming stuck while polling on this file.
Correct the access method to event_file_file() in both functions.
Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Link: https://patch.msgid.link/20260219162737.314231-2-petr.pavlu@suse.com
Fixes: 1bd13edbbe ("tracing/hist: Add poll(POLLIN) support on hist file")
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The function graph tracer was modified to us the ftrace_ops of the
function tracer. This simplified the code as well as allowed more features
of the function graph tracer.
Not all architectures were converted over as it required the
implementation of HAVE_DYNAMIC_FTRACE_WITH_ARGS to implement. For those
architectures, it still did it the old way where the function graph tracer
handle was called by the function tracer trampoline. The handler then had
to check the hash to see if the registered handlers wanted to be called by
that function or not.
In order to speed up the function graph tracer that used ftrace_ops, if
only one callback was registered with function graph, it would call its
function directly via a static call.
Now, if the architecture does not support the use of using ftrace_ops and
still has the ftrace function trampoline calling the function graph
handler, then by doing a direct call it removes the check against the
handler's hash (list of functions it wants callbacks to), and it may call
that handler for functions that the handler did not request calls for.
On 32bit x86, which does not support the ftrace_ops use with function
graph tracer, it shows the issue:
~# trace-cmd start -p function -l schedule
~# trace-cmd show
# tracer: function_graph
#
# CPU DURATION FUNCTION CALLS
# | | | | | | |
2) * 11898.94 us | schedule();
3) # 1783.041 us | schedule();
1) | schedule() {
------------------------------------------
1) bash-8369 => kworker-7669
------------------------------------------
1) | schedule() {
------------------------------------------
1) kworker-7669 => bash-8369
------------------------------------------
1) + 97.004 us | }
1) | schedule() {
[..]
Now by starting the function tracer is another instance:
~# trace-cmd start -B foo -p function
This causes the function graph tracer to trace all functions (because the
function trace calls the function graph tracer for each on, and the
function graph trace is doing a direct call):
~# trace-cmd show
# tracer: function_graph
#
# CPU DURATION FUNCTION CALLS
# | | | | | | |
1) 1.669 us | } /* preempt_count_sub */
1) + 10.443 us | } /* _raw_spin_unlock_irqrestore */
1) | tick_program_event() {
1) | clockevents_program_event() {
1) 1.044 us | ktime_get();
1) 6.481 us | lapic_next_event();
1) + 10.114 us | }
1) + 11.790 us | }
1) ! 181.223 us | } /* hrtimer_interrupt */
1) ! 184.624 us | } /* __sysvec_apic_timer_interrupt */
1) | irq_exit_rcu() {
1) 0.678 us | preempt_count_sub();
When it should still only be tracing the schedule() function.
To fix this, add a macro FGRAPH_NO_DIRECT to be set to 0 when the
architecture does not support function graph use of ftrace_ops, and set to
1 otherwise. Then use this macro to know to allow function graph tracer to
call the handlers directly or not.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Link: https://patch.msgid.link/20260218104244.5f14dade@gandalf.local.home
Fixes: cc60ee813b ("function_graph: Use static_call and branch to optimize entry function")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Check the event length before adding it for accessing next index in
rb_read_data_buffer(). Since this function is used for validating
possibly broken ring buffers, the length of the event could be broken.
In that case, the new event (e + len) can point a wrong address.
To avoid invalid memory access at boot, check whether the length of
each event is in the possible range before using it.
Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes: 5f3b6e839f ("ring-buffer: Validate boot range memory events")
Link: https://patch.msgid.link/177123421541.142205.9414352170164678966.stgit@devnote2
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
There is a pointer head_page in rb_meta_validate_events() which is not
initialized at the beginning of a function. This pointer can be dereferenced
if there is a failure during reader page validation. In this case the control
is passed to "invalid" label where the pointer is dereferenced in a loop.
To fix the issue initialize orig_head and head_page before calling
rb_validate_buffer.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Cc: stable@vger.kernel.org
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Link: https://patch.msgid.link/20260213100130.2013839-1-d.dulov@aladdin.ru
Closes: https://lore.kernel.org/r/202406130130.JtTGRf7W-lkp@intel.com/
Fixes: 5f3b6e839f ("ring-buffer: Validate boot range memory events")
Signed-off-by: Daniil Dulov <d.dulov@aladdin.ru>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmmXR6QACgkQ6rmadz2v
bTqVjg/+PZPMKGBMfF5uWk74LWYQIt01ePfHH2QeA4DsOwNK+9Q1+jmCLRPa/diL
Ds//ZIEMatmtdd1eO5aHyGXE1sBsJ02LfKOhsPukQyzD/FtZ4BmQpzpG2mK5o1M5
NAH6wxY+6Tr8UlXQtoTF1FFXSa6Y0vQmkyXofOoUgSBAxTPMGVQnWw4bq7mUAX9A
G6/TnPDgGbNLejPCmu8mERCkqRjIGAgjBUItVeiHbdxymtzjHcrH7nwxuP59djR9
1AhMrJnyV+s7iEMkAKGkE6NOID73R/YQEqmvD1eX0AWvqdR8+4lOHT0KPU039JqT
RQV5JgXSfeEkdUtyvqQJZdiJinjFLOwp4CGcX+DKcvUpAKmLx8q3ihPiuWk8+JOV
fnosXQIeQ7B9EuTvoNoNTfvU/MuV8vWd3/1kQc+KGXhzk944Ypb29zywGEoGZarU
eb7YRtUIsXBo7H2K1juqTvj72jyhG83cbZxE5+pR2gv87yGgUvt0r0u0FvzkQf2c
Fq671n6UaOd+0ZYey7YG9bIc+WMTbsjqVY0f7+3Rcl3Q68we0HHdqiPoF637/a1r
lS6TZLRDJmykDPp03db97UcQml6RKJiyTnTNQVUF4Uk+VF1KTreXK/D8fD401gxZ
GWuLt/bjq8l/EVJOhzpaO0JDejmZVRaOQUX8t9DwstS+FFbSyBs=
=mb+y
-----END PGP SIGNATURE-----
Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Pull bpf fixes from Alexei Starovoitov:
- Fix invalid write loop logic in libbpf's bpf_linker__add_buf() (Amery
Hung)
- Fix a potential use-after-free of BTF object (Anton Protopopov)
- Add feature detection to libbpf and avoid moving arena global
variables on older kernels (Emil Tsalapatis)
- Remove extern declaration of bpf_stream_vprintk() from libbpf headers
(Ihor Solodrai)
- Fix truncated netlink dumps in bpftool (Jakub Kicinski)
- Fix map_kptr grace period wait in bpf selftests (Kumar Kartikeya
Dwivedi)
- Remove hexdump dependency while building bpf selftests (Matthieu
Baerts)
- Complete fsession support in BPF trampolines on riscv (Menglong Dong)
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
selftests/bpf: Remove hexdump dependency
libbpf: Remove extern declaration of bpf_stream_vprintk()
selftests/bpf: Use vmlinux.h in test_xdp_meta
bpftool: Fix truncated netlink dumps
libbpf: Delay feature gate check until object prepare time
libbpf: Do not use PROG_TYPE_TRACEPOINT program for feature gating
bpf: Add a map/btf from a fd array more consistently
selftests/bpf: Fix map_kptr grace period wait
selftests/bpf: enable fsession_test on riscv64
selftests/bpf: Adjust selftest due to function rename
bpf, riscv: add fsession support for trampolines
bpf: Fix a potential use-after-free of BTF object
bpf, riscv: introduce emit_store_stack_imm64() for trampoline
libbpf: Fix invalid write loop logic in bpf_linker__add_buf()
libbpf: Add gating for arena globals relocation feature
Total patches: 7
Reviews/patch: 0.57
Reviewed rate: 42%
- The 2 patch series "two fixes in kho_populate()" from Ran Xiaokai
fixes a couple of not-major issues in the kexec handover code.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaZaKBAAKCRDdBJ7gKXxA
jpB1AP9UpNzT63aGDnB6G8pgekSdK/I2gypZI3cS7MpBPorRUgEAhcClc2//zWGK
0Wz1rxh3sWIE/pzd/yOEsv+7oQHeDQA=
=oUp2
-----END PGP SIGNATURE-----
Merge tag 'mm-nonmm-stable-2026-02-18-19-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull more non-MM updates from Andrew Morton:
- "two fixes in kho_populate()" fixes a couple of not-major issues in
the kexec handover code (Ran Xiaokai)
- misc singletons
* tag 'mm-nonmm-stable-2026-02-18-19-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
lib/group_cpus: handle const qualifier from clusters allocation type
kho: remove unnecessary WARN_ON(err) in kho_populate()
kho: fix missing early_memunmap() call in kho_populate()
scripts/gdb: implement x86_page_ops in mm.py
objpool: fix the overestimation of object pooling metadata size
selftests/memfd: use IPC semaphore instead of SIGSTOP/SIGCONT
delayacct: fix build regression on accounting tool
Total patches: 36
Reviews/patch: 1.77
Reviewed rate: 83%
- The 2 patch series "mm/vmscan: fix demotion targets checks in
reclaim/demotion" from Bing Jiao fixes a couple of issues in the
demotion code - pages were failed demotion and were finding themselves
demoted into disallowed nodes.
- The 11 patch series "Remove XA_ZERO from error recovery of dup_mmap()"
from Liam Howlett fixes a rare mapledtree race and performs a number of
cleanups.
- The 13 patch series "mm: add bitmap VMA flag helpers and convert all
mmap_prepare to use them" from Lorenzo Stoakes implements a lot of
cleanups following on from the conversion of the VMA flags into a
bitmap.
- The 5 patch series "support batch checking of references and unmapping
for large folios" from Baolin Wang implements batching to greatly
improve the performance of reclaiming clean file-backed large folios.
- The 3 patch series "selftests/mm: add memory failure selftests" from
Miaohe Lin does as claimed.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaZaIEQAKCRDdBJ7gKXxA
jj73AQCQDwLoipDiQRGyjB5BDYydymWuDoiB1tlDPHfYAP3b/QD/UQtVlOEXqwM3
naOKs3NQ1pwnfhDaQMirGw2eAnJ1SQY=
=6Iif
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2026-02-18-19-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull more MM updates from Andrew Morton:
- "mm/vmscan: fix demotion targets checks in reclaim/demotion" fixes a
couple of issues in the demotion code - pages were failed demotion
and were finding themselves demoted into disallowed nodes (Bing Jiao)
- "Remove XA_ZERO from error recovery of dup_mmap()" fixes a rare
mapledtree race and performs a number of cleanups (Liam Howlett)
- "mm: add bitmap VMA flag helpers and convert all mmap_prepare to use
them" implements a lot of cleanups following on from the conversion
of the VMA flags into a bitmap (Lorenzo Stoakes)
- "support batch checking of references and unmapping for large folios"
implements batching to greatly improve the performance of reclaiming
clean file-backed large folios (Baolin Wang)
- "selftests/mm: add memory failure selftests" does as claimed (Miaohe
Lin)
* tag 'mm-stable-2026-02-18-19-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (36 commits)
mm/page_alloc: clear page->private in free_pages_prepare()
selftests/mm: add memory failure dirty pagecache test
selftests/mm: add memory failure clean pagecache test
selftests/mm: add memory failure anonymous page test
mm: rmap: support batched unmapping for file large folios
arm64: mm: implement the architecture-specific clear_flush_young_ptes()
arm64: mm: support batch clearing of the young flag for large folios
arm64: mm: factor out the address and ptep alignment into a new helper
mm: rmap: support batched checks of the references for large folios
tools/testing/vma: add VMA userland tests for VMA flag functions
tools/testing/vma: separate out vma_internal.h into logical headers
tools/testing/vma: separate VMA userland tests into separate files
mm: make vm_area_desc utilise vma_flags_t only
mm: update all remaining mmap_prepare users to use vma_flags_t
mm: update shmem_[kernel]_file_*() functions to use vma_flags_t
mm: update secretmem to use VMA flags on mmap_prepare
mm: update hugetlbfs to use VMA flags on mmap_prepare
mm: add basic VMA flag operation helper functions
tools: bitmap: add missing bitmap_[subset(), andnot()]
mm: add mk_vma_flags() bitmap flag macro helper
...
* Removed macros from proc handler converters
Replace the proc converter macros with "regular" functions. Though it is more
verbose than the macro version, it helps when debugging and better aligns with
coding-style.rst.
* General cleanup
Remove superfluous ctl_table forward declarations. Const qualify the
memory_allocation_profiling_sysctl and loadpin_sysctl_table arrays. Add
missing kernel doc to proc_dointvec_conv.
* Testing
This series was run through sysctl selftests/kunit test suite in
x86_64. And went into linux-next after rc4, giving it a good 3 weeks of
testing
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEErkcJVyXmMSXOyyeQupfNUreWQU8FAmmUabYACgkQupfNUreW
QU8y2Qv/d2y35uQPRDh0HKWKWXJy41C2RJzd/rFCWJPCwo150whTSHIHkWYnu76g
10QblBXQmXi9TVqFnJ7Il7PWgqkMPjzA13tfT9eXNWU8j2OB/mcVKNl9X4wm/jWi
QxtGmBsIQ/nxb2pUzMCykzgfc5mLi2NQ8qhZ5bOnq7UW3zdYmzEqx+tRdvIacyIk
adComi5v8xUDqyEbVFaBovuX2WHQkPyBMnD64nwWG93JpNG/+9PxGzv/DNUXY11Y
epVOfSoKdJbSLjYoHEPEhT0aHjSydq3QHru7uF6wzKOFTfHej/XkXXbUnFXPO2Pn
c5J0u/HziYG5eN2QTqGfrhECZYuCFPemtUozltbcgGebkl1wKH+k9K5vsCaz/mhk
ihUC3mui++W/n9B9HJRYh1XeEpk6C1pWERCOx27XFZ25fSek2YO6ZWkT0q+gceC0
t4+eIFSGJ3OzheJgHNK9XhTMWiQPmHyA6brXYGx4WeRvJFLpVddPF7k3Z89zIAu/
Fut7FGTH
=0Z+I
-----END PGP SIGNATURE-----
Merge tag 'sysctl-7.00-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl
Pull sysctl updates from Joel Granados:
- Remove macros from proc handler converters
Replace the proc converter macros with "regular" functions. Though it
is more verbose than the macro version, it helps when debugging and
better aligns with coding-style.rst.
- General cleanup
Remove superfluous ctl_table forward declarations. Const qualify the
memory_allocation_profiling_sysctl and loadpin_sysctl_table arrays.
Add missing kernel doc to proc_dointvec_conv.
- Testing
This series was run through sysctl selftests/kunit test suite in
x86_64. And went into linux-next after rc4, giving it a good 3 weeks
of testing
* tag 'sysctl-7.00-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl:
sysctl: replace SYSCTL_INT_CONV_CUSTOM macro with functions
sysctl: Replace unidirectional INT converter macros with functions
sysctl: Add kernel doc to proc_douintvec_conv
sysctl: Replace UINT converter macros with functions
sysctl: Add CONFIG_PROC_SYSCTL guards for converter macros
sysctl: clarify proc_douintvec_minmax doc
sysctl: Return -ENOSYS from proc_douintvec_conv when CONFIG_PROC_SYSCTL=n
sysctl: Remove unused ctl_table forward declarations
loadpin: Implement custom proc_handler for enforce
alloc_tag: move memory_allocation_profiling_sysctls into .rodata
sysctl: Add missing kernel-doc for proc_dointvec_conv
There's an unpleasant corner case in unshare(2), when we have a
CLONE_NEWNS in flags and current->fs hadn't been shared at all; in that
case copy_mnt_ns() gets passed current->fs instead of a private copy,
which causes interesting warts in proof of correctness]
> I guess if private means fs->users == 1, the condition could still be true.
Unfortunately, it's worse than just a convoluted proof of correctness.
Consider the case when we have CLONE_NEWCGROUP in addition to CLONE_NEWNS
(and current->fs->users == 1).
We pass current->fs to copy_mnt_ns(), all right. Suppose it succeeds and
flips current->fs->{pwd,root} to corresponding locations in the new namespace.
Now we proceed to copy_cgroup_ns(), which fails (e.g. with -ENOMEM).
We call put_mnt_ns() on the namespace created by copy_mnt_ns(), it's
destroyed and its mount tree is dissolved, but... current->fs->root and
current->fs->pwd are both left pointing to now detached mounts.
They are pinning those, so it's not a UAF, but it leaves the calling
process with unshare(2) failing with -ENOMEM _and_ leaving it with
pwd and root on detached isolated mounts. The last part is clearly a bug.
There is other fun related to that mess (races with pivot_root(), including
the one between pivot_root() and fork(), of all things), but this one
is easy to isolate and fix - treat CLONE_NEWNS as "allocate a new
fs_struct even if it hadn't been shared in the first place". Sure, we could
go for something like "if both CLONE_NEWNS *and* one of the things that might
end up failing after copy_mnt_ns() call in create_new_namespaces() are set,
force allocation of new fs_struct", but let's keep it simple - the cost
of copy_fs_struct() is trivial.
Another benefit is that copy_mnt_ns() with CLONE_NEWNS *always* gets
a freshly allocated fs_struct, yet to be attached to anything. That
seriously simplifies the analysis...
FWIW, that bug had been there since the introduction of unshare(2) ;-/
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://patch.msgid.link/20260207082524.GE3183987@ZenIV
Tested-by: Waiman Long <longman@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Here are two small changes that add some missing SPDX license lines to
some core kernel files. These are:
- adding SPDX license lines to kdb files
- adding SPDX license lines to the remaining kernel/ files
Both of these have been in linux-next for a while with no reported
issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCaZR2VA8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+yne8ACfYf6Mo8A+c4ZVduiKhAywn7wtX3YAn3Z29idk
Cq3AiQM/bjJvM0/KPOwW
=k232
-----END PGP SIGNATURE-----
Merge tag 'spdx-7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx
Pull SPDX updates from Greg KH:
"Here are two small changes that add some missing SPDX license lines to
some core kernel files. These are:
- adding SPDX license lines to kdb files
- adding SPDX license lines to the remaining kernel/ files
Both of these have been in linux-next for a while with no reported
issues"
* tag 'spdx-7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx:
kernel: debug: Add SPDX license ids to kdb files
kernel: add SPDX-License-Identifier lines
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmmTrNEQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpjsOEACpUk78nFmLbEgJ5UH8+Z6daDzgoasb5YRT
Mj4g+cM2J9Xc9JxgX8QR3F2EfolweTo/H6xlhnlPDcnpB+b3qj4WHuijR/wghphj
MBKKqNXTEC+j0ra9uk8h3RmIKaK79xcUup7XfTcuWdYpSsMyYE/m/rck3thw6yNL
OAjmWLTP4IwYzXip2AB+J7JbDDOV/qWK0aOYdWHCdbn9X8bBel/HDOITWPdybnSR
DNKBeoi/Yv8KwA+axogqP213ifc3Xr6ejRDkqDOf1bgXsKkELkIxcfog6MhfHhxq
3Cqlj1pBuIBxGVU7wmBTDqL+aHrVb983tcA5x1NGZIzJao64b026o5DUhNPprwrZ
HveU1MZ2jarAjAz85gE3S4oUY+6d47ooytfvO548Zp/1LY+fOxnjYqq5ksh8BBLk
WyjfkJScgr17Z4SVOK8a9GboWO2WKiQJRg+hZ/TWX5fyvu5g9sbRasdwxnp1sl52
EayzkhYFq/Rdd8slwTIaccVUPl/xeEDeRG+jTJ+4Fj54TihKiJzXVsxDkSWKf46V
CWmzDx+n6MlGPm9mShSERZ7HJh3VcSp4No/HAjf93u9/UXwubK/SKiV71nhpgJMf
9bWS2G3wPx/5LoME95YkF+CSgs0e/ROUusfGd8X6nIz9EBGzeabCG/mjqd5adC09
OZahOuqrIg==
=PVoY
-----END PGP SIGNATURE-----
Merge tag 'block-7.0-20260216' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull more block updates from Jens Axboe:
- Fix partial IOVA mapping cleanup in error handling
- Minor prep series ignoring discard return value, as
the inline value is always known
- Ensure BLK_FEAT_STABLE_WRITES is set for drbd
- Fix leak of folio in bio_iov_iter_bounce_read()
- Allow IOC_PR_READ_* for read-only open
- Another debugfs deadlock fix
- A few doc updates
* tag 'block-7.0-20260216' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
blk-mq: use NOIO context to prevent deadlock during debugfs creation
blk-stat: convert struct blk_stat_callback to kernel-doc
block: fix enum descriptions kernel-doc
block: update docs for bio and bvec_iter
block: change return type to void
nvmet: ignore discard return value
md: ignore discard return value
block: fix partial IOVA mapping cleanup in blk_rq_dma_map_iova
block: fix folio leak in bio_iov_iter_bounce_read()
block: allow IOC_PR_READ_* ioctls with BLK_OPEN_READ
drbd: always set BLK_FEAT_STABLE_WRITES
Please consider pulling these changes from the signed kernel-7.0-rc1.misc tag.
Thanks!
Christian
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaZL+JwAKCRCRxhvAZXjc
ovU/AP4xgVxEegnNYrXZ+TpdCXbCtQZ54JqowFX73MBtaBHY1QD/YkDaIzl6K70v
d9P2Fe8Y6wOnIHxcjE4MIdMansphjAM=
=TN3q
-----END PGP SIGNATURE-----
Merge tag 'kernel-7.0-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull pidfs updates from Christian Brauner:
- pid: introduce task_ppid_vnr() helper
- pidfs: convert rb-tree to rhashtable
Mateusz reported performance penalties during task creation because
pidfs uses pidmap_lock to add elements into the rbtree. Switch to an
rhashtable to have separate fine-grained locking and to decouple from
pidmap_lock moving all heavy manipulations outside of it
Also move inode allocation outside of pidmap_lock. With this there's
nothing happening for pidfs under pidmap_lock
- pid: reorder fields in pid_namespace to reduce false sharing
- Revert "pid: make __task_pid_nr_ns(ns => NULL) safe for zombie
callers"
- ipc: Add SPDX license id to mqueue.c
* tag 'kernel-7.0-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
pid: introduce task_ppid_vnr() helper
pidfs: implement ino allocation without the pidmap lock
Revert "pid: make __task_pid_nr_ns(ns => NULL) safe for zombie callers"
pid: reorder fields in pid_namespace to reduce false sharing
pidfs: convert rb-tree to rhashtable
ipc: Add SPDX license id to mqueue.c
Creating debugfs entries can trigger fs reclaim, which can enter back
into the block layer request_queue. This can cause deadlock if the
queue is frozen.
Previously, a WARN_ON_ONCE check was used in debugfs_create_files()
to detect this condition, but it was racy since the queue can be frozen
from another context at any time.
Introduce blk_debugfs_lock()/blk_debugfs_unlock() helpers that combine
the debugfs_mutex with memalloc_noio_save()/restore() to prevent fs
reclaim from triggering block I/O. Also add blk_debugfs_lock_nomemsave()
and blk_debugfs_unlock_nomemrestore() variants for callers that don't
need NOIO protection (e.g., debugfs removal or read-only operations).
Replace all raw debugfs_mutex lock/unlock pairs with these helpers,
using the _nomemsave/_nomemrestore variants where appropriate.
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Closes: https://lore.kernel.org/all/CAHj4cs9gNKEYAPagD9JADfO5UH+OiCr4P7OO2wjpfOYeM-RV=A@mail.gmail.com/
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Closes: https://lore.kernel.org/all/aYWQR7CtYdk3K39g@shinmob/
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
- kprobes: Use a dedicated kernel thread to optimize the kprobes
instead of using workqueue thread. Since the kprobe optimizer waits
a long time for synchronize_rcu_task(), it can block other workers
in the same queue if it uses a workqueue.
- kprobe-events: Returns immediately if no new probe events are
specified on the kernel command line at boot time. This shorten the
kernel boot time.
- kprobes: When a kprobe is fully removed from the kernel code,
retry optimizing another kprobe which is blocked by that kprobe.
-----BEGIN PGP SIGNATURE-----
iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmmPtX4bHG1hc2FtaS5o
aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8bgoUH/0oZ9/c+wkn9AKxFlOAw
AIbtTfKiQHXPirhamYuEbJJWTUH2O/Wr0SQrIkbwJzToou5CRTuT5UQBpmOk2GKZ
uZ5bO63nOMn/1ECbDpc+lFGV5J4Dn5Pw2GFjUTR6EPwms+CN/d6f2LlvuAVfJX7Y
klL2X9/QLwV5QT8La5zNEka2zBfrZlZ9a5DGoLtgdwX2zosuef9aDnisS2jkLenH
wugZLryx0Qvlj9DXimP7oVZpYDHr0hZEI3CNQPge0gMQDj1XajGawYfkhrfaeFMX
4eXjFCN+NQG6fBUpDugcOThOypWmjlJhoYxxROJu0eFcLQjsuWjp6jLixI9Nr7P+
tQk=
=syAj
-----END PGP SIGNATURE-----
Merge tag 'probes-v7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull kprobes updates from Masami Hiramatsu:
- Use a dedicated kernel thread to optimize the kprobes instead of
using workqueue thread. Since the kprobe optimizer waits a long time
for synchronize_rcu_task(), it can block other workers in the same
queue if it uses a workqueue.
- kprobe-events: return immediately if no new probe events are
specified on the kernel command line at boot time. This shortens
the kernel boot time.
- When a kprobe is fully removed from the kernel code, retry optimizing
another kprobe which is blocked by that kprobe.
* tag 'probes-v7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
kprobes: Use dedicated kthread for kprobe optimizer
tracing: kprobe-event: Return directly when trace kprobes is empty
kprobes: retry blocked optprobe in do_free_cleaned_kprobes
do_con_write(), fbcon_redraw.*() invoke console_conditional_schedule()
which is a conditional scheduling point based on printk's internal
variables console_may_schedule. It may only be used if the console lock
is acquired for instance via console_lock() or console_trylock().
Prinkt sets the internal variable to 1 (and allows to schedule)
if the console lock has been acquired via console_lock(). The trylock
does not allow it.
The console_conditional_schedule() invocation in do_con_write() is
invoked shortly before console_unlock().
The console_conditional_schedule() invocation in fbcon_redraw.*()
original from fbcon_scroll() / vt's con_scroll() which originate from a
line feed.
In console_unlock() the variable is set to 0 (forbids to schedule) and
it tries to schedule while making progress printing. This is brand new
compared to when console_conditional_schedule() was added in v2.4.9.11.
In v2.6.38-rc3, console_unlock() (started its existence) iterated over
all consoles and flushed them with disabled interrupts. A scheduling
attempt here was not possible, it relied that a long print scheduled
before console_unlock().
Since commit 8d91f8b153 ("printk: do cond_resched() between lines
while outputting to consoles"), which appeared in v4.5-rc1,
console_unlock() attempts to schedule if it was allowed to schedule
while during console_lock(). Each record is idealy one line so after
every line feed.
This console_conditional_schedule() is also only relevant on
PREEMPT_NONE and PREEMPT_VOLUNTARY builds. In other configurations
cond_resched() becomes a nop and has no impact.
I'm bringing this all up just proof that it is not required anymore. It
becomes a problem on a PREEMPT_RT build with debug code enabled because
that might_sleep() in cond_resched() remains and triggers a warnings.
This is due to
legacy_kthread_func-> console_flush_one_record -> vt_console_print-> lf
-> con_scroll -> fbcon_scroll
and vt_console_print() acquires a spinlock_t which does not allow a
voluntary schedule. There is no need to fb_scroll() to schedule since
console_flush_one_record() attempts to schedule after each line.
!PREEMPT_RT is not affected because the legacy printing thread is only
enabled on PREEMPT_RT builds.
Therefore I suggest to remove console_conditional_schedule().
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Helge Deller <deller@gmx.de>
Cc: linux-fbdev@vger.kernel.org
Cc: dri-devel@lists.freedesktop.org
Fixes: 5f53ca3ff8 ("printk: Implement legacy printer kthread for PREEMPT_RT")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Petr Mladek <pmladek@suse.com> # from printk() POV
Signed-off-by: Helge Deller <deller@gmx.de>
User visible changes:
- Add an entry into MAINTAINERS file for RUST versions of code
There's now RUST code for tracing and static branches. To differentiate
that code from the C code, add entries in for the RUST version (with "[RUST]"
around it) so that the right maintainers get notified on changes.
- New bitmask-list option added to tracefs
When this is set, bitmasks in trace event are not displayed as hex
numbers, but instead as lists: e.g. 0-5,7,9 instead of 0000015f
- New show_event_filters file in tracefs
Instead of having to search all events/*/*/filter for any active filters
enabled in the trace instance, the file show_event_filters will list them
so that there's only one file that needs to be examined to see if any
filters are active.
- New show_event_triggers file in tracefs
Instead of having to search all events/*/*/trigger for any active triggers
enabled in the trace instance, the file show_event_triggers will list them
so that there's only one file that needs to be examined to see if any
triggers are active.
- Have traceoff_on_warning disable trace pintk buffer too
Recently recording of trace_printk() could go to other trace instances
instead of the top level instance. But if traceoff_on_warning triggers, it
doesn't stop the buffer with trace_printk() and that data can easily be
lost by being overwritten. Have traceoff_on_warning also disable the
instance that has trace_printk() being written to it.
- Update the hist_debug file to show what function the field uses
When CONFIG_HIST_TRIGGERS_DEBUG is enabled, a hist_debug file exists for
every event. This displays the internal data of any histogram enabled for
that event. But it is lacking the function that is called to process one
of its fields. This is very useful information that was missing when
debugging histograms.
- Up the histogram stack size from 16 to 31
Stack traces can be used as keys for event histograms. Currently the size
of the stack that is stored is limited to just 16 entries. But the storage
space in the histogram is 256 bytes, meaning that it can store up to 31
entries (plus one for the count of entries). Instead of letting that space
go to waste, up the limit from 16 to 31. This makes the keys much more
useful.
- Fix permissions of per CPU file buffer_size_kb
The per CPU file of buffer_size_kb was incorrectly set to read only in a
previous cleanup. It should be writable.
- Reset "last_boot_info" if the persistent buffer is cleared
The last_boot_info shows address information of a persistent ring buffer
if it contains data from a previous boot. It is cleared when recording
starts again, but it is not cleared when the buffer is reset. The data is
useless after a reset so clear it on reset too.
Internal changes:
- A change was made to allow tracepoint callbacks to have preemption
enabled, and instead be protected by SRCU. This required some updates to
the callbacks for perf and BPF.
perf needed to disable preemption directly in its callback because it
expects preemption disabled in the later code.
BPF needed to disable migration, as its code expects to run completely on
the same CPU.
- Have irq_work wake up other CPU if current CPU is "isolated"
When there's a waiter waiting on ring buffer data and a new event happens,
an irq work is triggered to wake up that waiter. This is noisy on isolated
CPUs (running NO_HZ_FULL). Trigger an IPI to a house keeping CPU instead.
- Use proper free of trigger_data instead of open coding it in.
- Remove redundant call of event_trigger_reset_filter()
It was called immediately in a function that was called right after it.
- Workqueue cleanups
- Report errors if tracing_update_buffers() were to fail.
- Make the enum update workqueue generic for other parts of tracing
On boot up, a work queue is created to convert enum names into their
numbers in the trace event format files. This work queue can also be used
for other aspects of tracing that takes some time and shouldn't be called
by the init call code.
The blk_trace initialization takes a bit of time. Have the initialization
code moved to the new tracing generic work queue function.
- Skip kprobe boot event creation call if there's no kprobes defined on cmdline
The kprobe initialization to set up kprobes if they are defined on the
cmdline requires taking the event_mutex lock. This can be held by other
tracing code doing initialization for a long time. Since kprobes added to
the kernel command line need to be setup immediately, as they may be
tracing early initialization code, they cannot be postponed in a work
queue and must be setup in the initcall code.
If there's no kprobe on the kernel cmdline, there's no reason to take the
mutex and slow down the boot up code waiting to get the lock only to find
out there's nothing to do. Simply exit out early if there's no kprobes on
the kernel cmdline.
If there are kprobes on the cmdline, then someone cares more about tracing
over the speed of boot up.
- Clean up the trigger code a bit
- Move code out of trace.c and into their own files
trace.c is now over 11,000 lines of code and has become more difficult to
maintain. Start splitting it up so that related code is in their own
files.
Move all the trace_printk() related code into trace_printk.c.
Move the __always_inline stack functions into trace.h.
Move the pid filtering code into a new trace_pid.c file.
- Better define the max latency and snapshot code
The latency tracers have a "max latency" buffer that is a copy of the main
buffer and gets swapped with it when a new high latency is detected. This
keeps the trace up to the highest latency around where this max_latency
buffer is never written to. It is only used to save the last max latency
trace.
A while ago a snapshot feature was added to tracefs to allow user space to
perform the same logic. It could also enable events to trigger a
"snapshot" if one of their fields hit a new high. This was built on top of
the latency max_latency buffer logic.
Because snapshots came later, they were dependent on the latency tracers
to be enabled. In reality, the latency tracers depend on the snapshot code
and not the other way around. It was just that they came first.
Restructure the code and the kconfigs to have the latency tracers depend
on snapshot code instead. This actually simplifies the logic a bit and
allows to disable more when the latency tracers are not defined and the
snapshot code is.
- Fix a "false sharing" in the hwlat tracer code
The loop to search for latency in hardware was using a variable that could
be changed by user space for each sample. If the user change this
variable, it could cause a bus contention, and reading that variable can
show up as a large latency in the trace causing a false positive. Read
this variable at the start of the sample with a READ_ONCE() into a local
variable and keep the code from sharing cache lines with readers.
- Fix function graph tracer static branch optimization code
When only one tracer is defined for function graph tracing, it uses a
static branch to call that tracer directly. When another tracer is added,
it goes into loop logic to call all the registered callbacks.
The code was incorrect when going back to one tracer and never re-enabled
the static branch again to do the optimization code.
- And other small fixes and cleanups.
-----BEGIN PGP SIGNATURE-----
iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaY9P3BQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qou3AQCCrzdrIglLNABGPyny9sqWLDz6vyyw
nWAK9xg1VFxwRQD+LyJvVMWbpGeRBS/PsAK19RgldbgkQFWNv8gNhRKRgw0=
=U/kg
-----END PGP SIGNATURE-----
Merge tag 'trace-v7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing updates from Steven Rostedt:
"User visible changes:
- Add an entry into MAINTAINERS file for RUST versions of code
There's now RUST code for tracing and static branches. To
differentiate that code from the C code, add entries in for the
RUST version (with "[RUST]" around it) so that the right
maintainers get notified on changes.
- New bitmask-list option added to tracefs
When this is set, bitmasks in trace event are not displayed as hex
numbers, but instead as lists: e.g. 0-5,7,9 instead of 0000015f
- New show_event_filters file in tracefs
Instead of having to search all events/*/*/filter for any active
filters enabled in the trace instance, the file show_event_filters
will list them so that there's only one file that needs to be
examined to see if any filters are active.
- New show_event_triggers file in tracefs
Instead of having to search all events/*/*/trigger for any active
triggers enabled in the trace instance, the file
show_event_triggers will list them so that there's only one file
that needs to be examined to see if any triggers are active.
- Have traceoff_on_warning disable trace pintk buffer too
Recently recording of trace_printk() could go to other trace
instances instead of the top level instance. But if
traceoff_on_warning triggers, it doesn't stop the buffer with
trace_printk() and that data can easily be lost by being
overwritten. Have traceoff_on_warning also disable the instance
that has trace_printk() being written to it.
- Update the hist_debug file to show what function the field uses
When CONFIG_HIST_TRIGGERS_DEBUG is enabled, a hist_debug file
exists for every event. This displays the internal data of any
histogram enabled for that event. But it is lacking the function
that is called to process one of its fields. This is very useful
information that was missing when debugging histograms.
- Up the histogram stack size from 16 to 31
Stack traces can be used as keys for event histograms. Currently
the size of the stack that is stored is limited to just 16 entries.
But the storage space in the histogram is 256 bytes, meaning that
it can store up to 31 entries (plus one for the count of entries).
Instead of letting that space go to waste, up the limit from 16 to
31. This makes the keys much more useful.
- Fix permissions of per CPU file buffer_size_kb
The per CPU file of buffer_size_kb was incorrectly set to read only
in a previous cleanup. It should be writable.
- Reset "last_boot_info" if the persistent buffer is cleared
The last_boot_info shows address information of a persistent ring
buffer if it contains data from a previous boot. It is cleared when
recording starts again, but it is not cleared when the buffer is
reset. The data is useless after a reset so clear it on reset too.
Internal changes:
- A change was made to allow tracepoint callbacks to have preemption
enabled, and instead be protected by SRCU. This required some
updates to the callbacks for perf and BPF.
perf needed to disable preemption directly in its callback because
it expects preemption disabled in the later code.
BPF needed to disable migration, as its code expects to run
completely on the same CPU.
- Have irq_work wake up other CPU if current CPU is "isolated"
When there's a waiter waiting on ring buffer data and a new event
happens, an irq work is triggered to wake up that waiter. This is
noisy on isolated CPUs (running NO_HZ_FULL). Trigger an IPI to a
house keeping CPU instead.
- Use proper free of trigger_data instead of open coding it in.
- Remove redundant call of event_trigger_reset_filter()
It was called immediately in a function that was called right after
it.
- Workqueue cleanups
- Report errors if tracing_update_buffers() were to fail.
- Make the enum update workqueue generic for other parts of tracing
On boot up, a work queue is created to convert enum names into
their numbers in the trace event format files. This work queue can
also be used for other aspects of tracing that takes some time and
shouldn't be called by the init call code.
The blk_trace initialization takes a bit of time. Have the
initialization code moved to the new tracing generic work queue
function.
- Skip kprobe boot event creation call if there's no kprobes defined
on cmdline
The kprobe initialization to set up kprobes if they are defined on
the cmdline requires taking the event_mutex lock. This can be held
by other tracing code doing initialization for a long time. Since
kprobes added to the kernel command line need to be setup
immediately, as they may be tracing early initialization code, they
cannot be postponed in a work queue and must be setup in the
initcall code.
If there's no kprobe on the kernel cmdline, there's no reason to
take the mutex and slow down the boot up code waiting to get the
lock only to find out there's nothing to do. Simply exit out early
if there's no kprobes on the kernel cmdline.
If there are kprobes on the cmdline, then someone cares more about
tracing over the speed of boot up.
- Clean up the trigger code a bit
- Move code out of trace.c and into their own files
trace.c is now over 11,000 lines of code and has become more
difficult to maintain. Start splitting it up so that related code
is in their own files.
Move all the trace_printk() related code into trace_printk.c.
Move the __always_inline stack functions into trace.h.
Move the pid filtering code into a new trace_pid.c file.
- Better define the max latency and snapshot code
The latency tracers have a "max latency" buffer that is a copy of
the main buffer and gets swapped with it when a new high latency is
detected. This keeps the trace up to the highest latency around
where this max_latency buffer is never written to. It is only used
to save the last max latency trace.
A while ago a snapshot feature was added to tracefs to allow user
space to perform the same logic. It could also enable events to
trigger a "snapshot" if one of their fields hit a new high. This
was built on top of the latency max_latency buffer logic.
Because snapshots came later, they were dependent on the latency
tracers to be enabled. In reality, the latency tracers depend on
the snapshot code and not the other way around. It was just that
they came first.
Restructure the code and the kconfigs to have the latency tracers
depend on snapshot code instead. This actually simplifies the logic
a bit and allows to disable more when the latency tracers are not
defined and the snapshot code is.
- Fix a "false sharing" in the hwlat tracer code
The loop to search for latency in hardware was using a variable
that could be changed by user space for each sample. If the user
change this variable, it could cause a bus contention, and reading
that variable can show up as a large latency in the trace causing a
false positive. Read this variable at the start of the sample with
a READ_ONCE() into a local variable and keep the code from sharing
cache lines with readers.
- Fix function graph tracer static branch optimization code
When only one tracer is defined for function graph tracing, it uses
a static branch to call that tracer directly. When another tracer
is added, it goes into loop logic to call all the registered
callbacks.
The code was incorrect when going back to one tracer and never
re-enabled the static branch again to do the optimization code.
- And other small fixes and cleanups"
* tag 'trace-v7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (46 commits)
function_graph: Restore direct mode when callbacks drop to one
tracing: Fix indentation of return statement in print_trace_fmt()
tracing: Reset last_boot_info if ring buffer is reset
tracing: Fix to set write permission to per-cpu buffer_size_kb
tracing: Fix false sharing in hwlat get_sample()
tracing: Move d_max_latency out of CONFIG_FSNOTIFY protection
tracing: Better separate SNAPSHOT and MAX_TRACE options
tracing: Add tracer_uses_snapshot() helper to remove #ifdefs
tracing: Rename trace_array field max_buffer to snapshot_buffer
tracing: Move pid filtering into trace_pid.c
tracing: Move trace_printk functions out of trace.c and into trace_printk.c
tracing: Use system_state in trace_printk_init_buffers()
tracing: Have trace_printk functions use flags instead of using global_trace
tracing: Make tracing_update_buffers() take NULL for global_trace
tracing: Make printk_trace global for tracing system
tracing: Move ftrace_trace_stack() out of trace.c and into trace.h
tracing: Move __trace_buffer_{un}lock_*() functions to trace.h
tracing: Make tracing_selftest_running global to the tracing subsystem
tracing: Make tracing_disabled global for tracing system
tracing: Clean up use of trace_create_maxlat_file()
...
A small code cleanup for DMA-mapping subsystem:
- removal of unused hooks (Robin Murphy)
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSrngzkoBtlA8uaaJ+Jp1EFxbsSRAUCaY80tQAKCRCJp1EFxbsS
RHQUAP9yrD7PczbUfOCcYoxvmyqYZ76gQLwnx9GB4Gi0pctxGAEA1hxsu+LisN5x
uMc3XW/zluzQj/gn+jWCD5v5DzdEWA4=
=uf46
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-7.0-2026-02-13' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux
Pull dma-mapping update from Marek Szyprowski:
"A small code cleanup for the DMA-mapping subsystem: removal of unused
hooks (Robin Murphy)"
* tag 'dma-mapping-7.0-2026-02-13' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
dma-mapping: Remove dma_mark_clean (again)
The add_fd_from_fd_array() function takes a file descriptor as a
parameter and tries to add either map or btf to the corresponding
list of used objects. As was reported by Dan Carpenter, since the
commit c81e4322acf0 ("bpf: Fix a potential use-after-free of BTF
object"), the fdget() is called twice on the file descriptor, and
thus userspace, potentially, can replace the file pointed to by the
file descriptor in between the two calls. On practice, this shouldn't
break anything on the kernel side, but for consistency fix the code
such that only one fdget() is executed.
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/aY689z7gHNv8rgVO@stanley.mountain/
Fixes: ccd2d799ed ("bpf: Fix a potential use-after-free of BTF object")
Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Link: https://lore.kernel.org/r/20260213212949.759321-1-a.s.protopopov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
- in order support in virtio core
- multiple address space support in vduse
- fixes, cleanups all over the place, notably
- dma alignment fixes for non cache coherent systems
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQFDBAABCgAtFiEEXQn9CHHI+FuUyooNKB8NuNKNVGkFAmmO9rYPHG1zdEByZWRo
YXQuY29tAAoJECgfDbjSjVRpBzYH/2wUPo3T8/CKGFjF7QSPzgL/UI2NhnP8iSm4
btg1zVnrWmJK6vVIwnf5UsG8dFKsMcp/BEGCewTmIddNM2wEeSul0kKDXtIzrK/U
jdA9bJrUKLMeU7IFKne1Fip/yE+5nkWJttWXXyVRJtOJrYxZlkWfqSns3qYcPvsG
g7HXvF6tmici5uoKdRCLqHtQCWsvpnvTD5A7qoZAlEUjlQCDKKmuukpN9oK5UYLl
9uUOgPQAJaxIwx1C4uP7L+AwbLUcN/+MtrvQRNz+sFpP3sN9oXeDJKBpNQp109NB
JGk1sUsINL+54Cmdd5RwZ9T1vBJyRDrdWRDy1yHj95LildaPfh0=
=pnob
-----END PGP SIGNATURE-----
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio updates from Michael Tsirkin:
- in-order support in virtio core
- multiple address space support in vduse
- fixes, cleanups all over the place, notably dma alignment fixes for
non-cache-coherent systems
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (59 commits)
vduse: avoid adding implicit padding
vhost: fix caching attributes of MMIO regions by setting them explicitly
vdpa/mlx5: update MAC address handling in mlx5_vdpa_set_attr()
vdpa/mlx5: reuse common function for MAC address updates
vdpa/mlx5: update mlx_features with driver state check
crypto: virtio: Replace package id with numa node id
crypto: virtio: Remove duplicated virtqueue_kick in virtio_crypto_skcipher_crypt_req
crypto: virtio: Add spinlock protection with virtqueue notification
Documentation: Add documentation for VDUSE Address Space IDs
vduse: bump version number
vduse: add vq group asid support
vduse: merge tree search logic of IOTLB_GET_FD and IOTLB_GET_INFO ioctls
vduse: take out allocations from vduse_dev_alloc_coherent
vduse: remove unused vaddr parameter of vduse_domain_free_coherent
vduse: refactor vdpa_dev_add for goto err handling
vhost: forbid change vq groups ASID if DRIVER_OK is set
vdpa: document set_group_asid thread safety
vduse: return internal vq group struct as map token
vduse: add vq group support
vduse: add v1 API definition
...
When registering a second fgraph callback, direct path is disabled and
array loop is used instead. When ftrace_graph_active falls back to one,
we try to re-enable direct mode via ftrace_graph_enable_direct(true, ...).
But ftrace_graph_enable_direct() incorrectly disables the static key
rather than enabling it. This leaves fgraph_do_direct permanently off
after first multi-callback transition, so direct fast mode is never
restored.
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260213142932519cuWSpEXeS4-UnCvNXnK2P@zte.com.cn
Fixes: cc60ee813b ("function_graph: Use static_call and branch to optimize entry function")
Signed-off-by: Shengming Hu <hu.shengming@zte.com.cn>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
- Add support for control flow integrity for userspace processes.
This is based on the standard RISC-V ISA extensions Zicfiss and
Zicfilp
- Improve ptrace behavior regarding vector registers, and add some selftests
- Optimize our strlen() assembly
- Enable the ISO-8859-1 code page as built-in, similar to ARM64, for EFI
volume mounting
- Clean up some code slightly, including defining copy_user_page() as
copy_page() rather than memcpy(), aligning us with other
architectures; and using max3() to slightly simplify an expression
in riscv_iommu_init_check()
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEElRDoIDdEz9/svf2Kx4+xDQu9KksFAmmOYpYACgkQx4+xDQu9
KkvzOQ/9Fq8ZxWgYofhTPtw9/vps3avheOHlEoRrBWYfn1VkTRPAcbUULL4PGXwg
dnVFEl3AcrpOFikIthbukklLeLoOnUshZJBU25zY5h0My1jb63V1//gEwJR6I0dg
+V+GJmfzc4+YVaHK6UFdn7j3GgKUbTC7xXRMuGEriAzKPnm3AXAjh94wMNx6depv
Li3IXRoZT/HvqIAyfeAoM9STwOzJtE3Sc6fXABkzsIbNTjjdgIqoRSsQsKY10178
z6ox/sVStnLmVaMbOd/ZVN0J70JRDsvK0TC0/13K1ESUbnVia9a3bPIxLRmSapKC
wXnwAuSeevtFshGGyd5LZO0QQGxzG1H63Gky2GRoh8bTQbd2tQcfQzANdnPkBAQS
j2aOiSsiUQeNZqfZAfEBwRd27GXRYlKb/MpgCZKUH+ZO9VG6QaD3VGvg17/Caghy
nVdbBQ81ZV9tkz9EMN0vt2VJHmEqARh88w619laHjg+ioPTG4/UIDPzskt1I+Fgm
Y6NQLeFyfaO3RKKDYWGPcY7fmWQI9V8MECHOvyVI4xJcgqAbqnfsgytjuiFbrfRo
fTvpuB7kvltBZ180QSB79xj0sWGFTWR02MeWy3uOaLZz2eIm2ZTZbMUSgNYR0ldG
L3y7CEkTkoVF1ijYgAfuMgptk3Yf0dpa66D9HUo947wWkNrW5ds=
=4fTk
-----END PGP SIGNATURE-----
Merge tag 'riscv-for-linus-7.0-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V updates from Paul Walmsley:
- Add support for control flow integrity for userspace processes.
This is based on the standard RISC-V ISA extensions Zicfiss and
Zicfilp
- Improve ptrace behavior regarding vector registers, and add some
selftests
- Optimize our strlen() assembly
- Enable the ISO-8859-1 code page as built-in, similar to ARM64, for
EFI volume mounting
- Clean up some code slightly, including defining copy_user_page() as
copy_page() rather than memcpy(), aligning us with other
architectures; and using max3() to slightly simplify an expression
in riscv_iommu_init_check()
* tag 'riscv-for-linus-7.0-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: (42 commits)
riscv: lib: optimize strlen loop efficiency
selftests: riscv: vstate_exec_nolibc: Use the regular prctl() function
selftests: riscv: verify ptrace accepts valid vector csr values
selftests: riscv: verify ptrace rejects invalid vector csr inputs
selftests: riscv: verify syscalls discard vector context
selftests: riscv: verify initial vector state with ptrace
selftests: riscv: test ptrace vector interface
riscv: ptrace: validate input vector csr registers
riscv: csr: define vtype register elements
riscv: vector: init vector context with proper vlenb
riscv: ptrace: return ENODATA for inactive vector extension
kselftest/riscv: add kselftest for user mode CFI
riscv: add documentation for shadow stack
riscv: add documentation for landing pad / indirect branch tracking
riscv: create a Kconfig fragment for shadow stack and landing pad support
arch/riscv: add dual vdso creation logic and select vdso based on hw
arch/riscv: compile vdso with landing pad and shadow stack note
riscv: enable kernel access to shadow stack memory via the FWFT SBI call
riscv: add kernel command line option to opt out of user CFI
riscv/hwprobe: add zicfilp / zicfiss enumeration in hwprobe
...
- A set of commits that introduces cxl_memdev_attach and pave way for
soft reserved handling, type2 accelerator enabling, and LSA 2.0
enabling. All these series require the endpoint driver to settle
before continuing the memdev driver probe.
dax/hmem, e820, resource: Defer Soft Reserved insertion until hmem is ready
cxl/mem: Introduce cxl_memdev_attach for CXL-dependent operation
cxl/mem: Drop @host argument to devm_cxl_add_memdev()
cxl/mem: Convert devm_cxl_add_memdev() to scope-based-cleanup
cxl/port: Arrange for always synchronous endpoint attach
cxl/mem: Arrange for always-synchronous memdev attach
cxl/mem: Fix devm_cxl_memdev_edac_release() confusion
- A set to address CXL port error protocol handling and reporting. The
large patch series was split into 3 parts. Part 1 and 2 are included
here with part 3 coming later. Part 1 consists of a series of code
refactoring to PCI AER sub-system that addresses CXL and also CXL
RAS code to prepare for port error handling. Part 2 refactors the
CXL code to move management of component registers to cxl_port
objects to allow all CXL AER errors to be handled through the
cxl_port hierarchy.
Part 2:
cxl/port: Move endpoint component register management to cxl_port
cxl/port: Map Port RAS registers
cxl/port: Move dport RAS setup to dport add time
cxl/port: Move dport probe operations to a driver event
cxl/port: Move decoder setup before dport creation
cxl/port: Cleanup dport removal with a devres group
cxl/port: Reduce number of @dport variables in cxl_port_add_dport()
cxl/port: Cleanup handling of the nr_dports 0 -> 1 transition
Part 1:
cxl: Update RAS handler interfaces to also support CXL Ports
cxl/mem: Clarify @host for devm_cxl_add_nvdimm()
PCI/AER: Update struct aer_err_info with kernel-doc formatting
PCI/AER: Report CXL or PCIe bus type in AER trace logging
PCI/AER: Use guard() in cxl_rch_handle_error_iter()
PCI/AER: Move CXL RCH error handling to aer_cxl_rch.c
PCI/AER: Update is_internal_error() to be non-static is_aer_internal_error()
PCI/AER: Export pci_aer_unmask_internal_errors()
cxl/pci: Move CXL driver's RCH error handling into core/ras_rch.c
PCI/AER: Replace PCIEAER_CXL symbol with CXL_RAS
cxl/pci: Remove CXL VH handling in CONFIG_PCIEAER_CXL conditional blocks from core/pci.c
PCI: Replace cxl_error_is_native() with pcie_aer_is_native()
cxl/pci: Remove unnecessary CXL RCH handling helper functions
cxl/pci: Remove unnecessary CXL Endpoint handling helper functions
PCI: Introduce pcie_is_cxl()
PCI: Update CXL DVSEC definitions
PCI: Move CXL DVSEC definitions into uapi/linux/pci_regs.h
- A set of patches to provide AMD Zen5 platform address translation for
CXL using ACPI PRMT. Set includes a conventions document to explain
why this is needed and how it's implemented.
cxl: Disable HPA/SPA translation handlers for Normalized Addressing
cxl/region: Factor out code into cxl_region_setup_poison()
cxl/atl: Lock decoders that need address translation
cxl: Enable AMD Zen5 address translation using ACPI PRMT
cxl/acpi: Prepare use of EFI runtime services
cxl: Introduce callback for HPA address ranges translation
cxl/region: Use region data to get the root decoder
cxl/region: Add @hpa_range argument to function cxl_calc_interleave_pos()
cxl/region: Separate region parameter setup and region construction
cxl: Simplify cxl_root_ops allocation and handling
cxl/region: Store HPA range in struct cxl_region
cxl/region: Store root decoder in struct cxl_region
cxl/region: Rename misleading variable name @hpa to @hpa_range
Documentation/driver-api/cxl: ACPI PRM Address Translation Support and AMD Zen5 enablement
cxl, doc: Moving conventions in separate files
cxl, doc: Remove isonum.txt inclusion
- A set of misc CXL patches of fixes, cleanups, and updates. Including
CXL address translation for unaligned MOD3 regions.
cxl: Fix premature commit_end increment on decoder commit failure
cxl/region: Use do_div() for 64-bit modulo operation
cxl/region: Translate HPA to DPA and memdev in unaligned regions
cxl/region: Translate DPA->HPA in unaligned MOD3 regions
cxl/core: Fix cxl_dport debugfs EINJ entries
cxl/acpi: Remove cxl_acpi_set_cache_size()
cxl/hdm: Fix newline character in dev_err() messages
cxl/pci: Remove outdated FIXME comment and BUILD_BUG_ON
Documentation/driver-api/cxl: device hotplug section
Documentation/driver-api/cxl: BIOS/EFI expectation update
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE5DAy15EJMCV1R6v9YGjFFmlTOEoFAmmOFXcACgkQYGjFFmlT
OEojaxAApQJFLyX1MkPbhtm6j6GRzzEAEWTBX2XsmliZf1JhfahsNMWI69kO33rm
LddF+nyZNEl/foyHgUaxVzlQwqWuihyp7Qk2djXnMzLsuCAsWhPbB9j0RgJUN8h5
N4U76AmOdmhLlXH4CCqoW2jNy0OjxNdgp1FtTHv7VO7RxgRE9MFJRkLulKxB03wy
t6lRZXPofEFcHen40DlYRtW26vy1BYUO0dng2f16DxWrb1ztdACH/zVqCJJtdoFc
FAT5EaQCeRYZ9Yz4dONw3DcUjYlG6NcRN9FWNiptBn1Pb7pUX55Le8lfD3qZg0an
m3lWRs1T/lGz7pWmz4GPUKDwGFCEqLqd4oSz5v+dFR3JJxjJpRzKa19y5TfqK/LF
diqNZsDD9gCXE1HXzNr1YcbllpU2cPRPf58gWG9bLmG5xUUmScib8LoTMfgcCJW5
SlC6kf7BFLkJfDTcFaILc/UANeZaLGhrV0vyJntfGyT5EqKOcfjQEvrZvofA8mef
bdxt0IRDW4D+7kkcuR33OipTVUFG3ban8yYq4zXD64dmeHF76gwdJm3nyXsqdtpc
IYIIhz0W6pbTKjJ2fy1rZcTac1ZaALstyaF4bYWIjyF3NylPM8tDi48DFr+DGgeX
xkFs2B9p5vY5Cq73gCmSWsi3PBPTjWzeRp7YZrV6VoBd9uqewUs=
=blFQ
-----END PGP SIGNATURE-----
Merge tag 'cxl-for-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl
Pull CXL updates from Dave Jiang:
- Introduce cxl_memdev_attach and pave way for soft reserved handling,
type2 accelerator enabling, and LSA 2.0 enabling. All these series
require the endpoint driver to settle before continuing the memdev
driver probe.
- Address CXL port error protocol handling and reporting.
The large patch series was split into three parts. The first two
parts are included here with the final part coming later.
The first part consists of a series of code refactoring to PCI AER
sub-system that addresses CXL and also CXL RAS code to prepare for
port error handling.
The second part refactors the CXL code to move management of
component registers to cxl_port objects to allow all CXL AER errors
to be handled through the cxl_port hierarchy.
- Provide AMD Zen5 platform address translation for CXL using ACPI
PRMT. This includes a conventions document to explain why this is
needed and how it's implemented.
- Misc CXL patches of fixes, cleanups, and updates. Including CXL
address translation for unaligned MOD3 regions.
[ TLA service: CXL is "Compute Express Link" ]
* tag 'cxl-for-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: (59 commits)
cxl: Disable HPA/SPA translation handlers for Normalized Addressing
cxl/region: Factor out code into cxl_region_setup_poison()
cxl/atl: Lock decoders that need address translation
cxl: Enable AMD Zen5 address translation using ACPI PRMT
cxl/acpi: Prepare use of EFI runtime services
cxl: Introduce callback for HPA address ranges translation
cxl/region: Use region data to get the root decoder
cxl/region: Add @hpa_range argument to function cxl_calc_interleave_pos()
cxl/region: Separate region parameter setup and region construction
cxl: Simplify cxl_root_ops allocation and handling
cxl/region: Store HPA range in struct cxl_region
cxl/region: Store root decoder in struct cxl_region
cxl/region: Rename misleading variable name @hpa to @hpa_range
Documentation/driver-api/cxl: ACPI PRM Address Translation Support and AMD Zen5 enablement
cxl, doc: Moving conventions in separate files
cxl, doc: Remove isonum.txt inclusion
cxl/port: Unify endpoint and switch port lookup
cxl/port: Move endpoint component register management to cxl_port
cxl/port: Map Port RAS registers
cxl/port: Move dport RAS setup to dport add time
...
The following pr_warn() provides detailed error and location information,
WARN_ON(err) adds no additional debugging value, so remove the redundant
WARN_ON() call.
Link: https://lkml.kernel.org/r/20260212111146.210086-3-ranxiaokai627@163.com
Signed-off-by: Ran Xiaokai <ran.xiaokai@zte.com.cn>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Alexander Graf <graf@amazon.com>
Cc: Mike Rapoport <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "two fixes in kho_populate()", v3.
This patch (of 2):
kho_populate() returns without calling early_memunmap() on success path,
this will cause early ioremap virtual address space leak.
Link: https://lkml.kernel.org/r/20260212111146.210086-1-ranxiaokai627@163.com
Link: https://lkml.kernel.org/r/20260212111146.210086-2-ranxiaokai627@163.com
Fixes: b50634c5e8 ("kho: cleanup error handling in kho_populate()")
Signed-off-by: Ran Xiaokai <ran.xiaokai@zte.com.cn>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
We will be shortly removing the vm_flags_t field from vm_area_desc so we
need to update all mmap_prepare users to only use the dessc->vma_flags
field.
This patch achieves that and makes all ancillary changes required to make
this possible.
This lays the groundwork for future work to eliminate the use of
vm_flags_t in vm_area_desc altogether and more broadly throughout the
kernel.
While we're here, we take the opportunity to replace VM_REMAP_FLAGS with
VMA_REMAP_FLAGS, the vma_flags_t equivalent.
No functional changes intended.
Link: https://lkml.kernel.org/r/fb1f55323799f09fe6a36865b31550c9ec67c225.1769097829.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Damien Le Moal <dlemoal@kernel.org> [zonefs]
Acked-by: "Darrick J. Wong" <djwong@kernel.org>
Acked-by: Pedro Falcato <pfalcato@suse.de>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Yury Norov <ynorov@nvidia.com>
Cc: Chris Mason <clm@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm/vmscan: fix demotion targets checks in reclaim/demotion",
v9.
This patch series addresses two issues in demote_folio_list(),
can_demote(), and next_demotion_node() in reclaim/demotion.
1. demote_folio_list() and can_demote() do not correctly check
demotion target against cpuset.mems_effective, which will cause (a)
pages to be demoted to not-allowed nodes and (b) pages fail demotion
even if the system still has allowed demotion nodes.
Patch 1 fixes this bug by updating cpuset_node_allowed() and
mem_cgroup_node_allowed() to return effective_mems, allowing directly
logic-and operation against demotion targets.
2. next_demotion_node() returns a preferred demotion target, but it
does not check the node against allowed nodes.
Patch 2 ensures that next_demotion_node() filters against the allowed
node mask and selects the closest demotion target to the source node.
This patch (of 2):
Fix two bugs in demote_folio_list() and can_demote() due to incorrect
demotion target checks against cpuset.mems_effective in reclaim/demotion.
Commit 7d709f49ba ("vmscan,cgroup: apply mems_effective to reclaim")
introduces the cpuset.mems_effective check and applies it to can_demote().
However:
1. It does not apply this check in demote_folio_list(), which leads
to situations where pages are demoted to nodes that are
explicitly excluded from the task's cpuset.mems.
2. It checks only the nodes in the immediate next demotion hierarchy
and does not check all allowed demotion targets in can_demote().
This can cause pages to never be demoted if the nodes in the next
demotion hierarchy are not set in mems_effective.
These bugs break resource isolation provided by cpuset.mems. This is
visible from userspace because pages can either fail to be demoted
entirely or are demoted to nodes that are not allowed in multi-tier memory
systems.
To address these bugs, update cpuset_node_allowed() and
mem_cgroup_node_allowed() to return effective_mems, allowing directly
logic-and operation against demotion targets. Also update can_demote()
and demote_folio_list() accordingly.
Bug 1 reproduction:
Assume a system with 4 nodes, where nodes 0-1 are top-tier and
nodes 2-3 are far-tier memory. All nodes have equal capacity.
Test script:
echo 1 > /sys/kernel/mm/numa/demotion_enabled
mkdir /sys/fs/cgroup/test
echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control
echo "0-2" > /sys/fs/cgroup/test/cpuset.mems
echo $$ > /sys/fs/cgroup/test/cgroup.procs
swapoff -a
# Expectation: Should respect node 0-2 limit.
# Observation: Node 3 shows significant allocation (MemFree drops)
stress-ng --oomable --vm 1 --vm-bytes 150% --mbind 0,1
Bug 2 reproduction:
Assume a system with 6 nodes, where nodes 0-2 are top-tier,
node 3 is a far-tier node, and nodes 4-5 are the farthest-tier nodes.
All nodes have equal capacity.
Test script:
echo 1 > /sys/kernel/mm/numa/demotion_enabled
mkdir /sys/fs/cgroup/test
echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control
echo "0-2,4-5" > /sys/fs/cgroup/test/cpuset.mems
echo $$ > /sys/fs/cgroup/test/cgroup.procs
swapoff -a
# Expectation: Pages are demoted to Nodes 4-5
# Observation: No pages are demoted before oom.
stress-ng --oomable --vm 1 --vm-bytes 150% --mbind 0,1,2
Link: https://lkml.kernel.org/r/20260114205305.2869796-1-bingjiao@google.com
Link: https://lkml.kernel.org/r/20260114205305.2869796-2-bingjiao@google.com
Fixes: 7d709f49ba ("vmscan,cgroup: apply mems_effective to reclaim")
Signed-off-by: Bing Jiao <bingjiao@google.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Gregory Price <gourry@gourry.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
- Refactor da_monitor to minimise macros
Complete refactor of da_monitor.h to reduce reliance on macros
generating functions. Use generic static functions and uses the
preprocessor only when strictly necessary (e.g. for tracepoint
handlers).
The change essentially relies on functions with generic names (e.g.
da_handle) instead of monitor-specific as well adding the need to
define constant (e.g. MONITOR_NAME, MONITOR_TYPE) before including the
header rather than calling macros that would define functions.
Also adapt monitors and documentation accordingly.
- Cleanup DA code generation scripts
Clean up functions in dot2c removing reimplementations of trivial
library functions (__buff_to_string) and removing some other unused
intermediate steps.
- Annotate functions with types in the rvgen python scripts
- Remove superfluous assignments and cleanup generated code
The rvgen scripts generate a superfluous assignment to 0 for enum
variables and don't add commas to the last elements, which is against
the kernel coding standards. Change the generation process for a
better compliance and slightly simpler logic.
- Remove superfluous declarations from generated code
The monitor container source files contained a declaration and a
definition for the rv_monitor variable. The former is superfluous and
was removed.
- Fix reference to outdated documentation
s/da_monitor_synthesis.rst/monitor_synthesis.rst in comment in
da_monitor.h
-----BEGIN PGP SIGNATURE-----
iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaY3j9xQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qup5AP0afGb2gr+SOIOu0dIAXjGLg3cbz0C4
J4bDx6H0cdgXywD+Maassm0nKIgDnOBLwtuwL0kEUAxllqn7RCZ2+ARwbAU=
=dfCl
-----END PGP SIGNATURE-----
Merge tag 'trace-rv-v7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull runtime verifier updates from Steven Rostedt:
- Refactor da_monitor to minimize macros
Complete refactor of da_monitor.h to reduce reliance on macros
generating functions. Use generic static functions and uses the
preprocessor only when strictly necessary (e.g. for tracepoint
handlers).
The change essentially relies on functions with generic names (e.g.
da_handle) instead of monitor-specific as well adding the need to
define constant (e.g. MONITOR_NAME, MONITOR_TYPE) before including
the header rather than calling macros that would define functions.
Also adapt monitors and documentation accordingly.
- Cleanup DA code generation scripts
Clean up functions in dot2c removing reimplementations of trivial
library functions (__buff_to_string) and removing some other unused
intermediate steps.
- Annotate functions with types in the rvgen python scripts
- Remove superfluous assignments and cleanup generated code
The rvgen scripts generate a superfluous assignment to 0 for enum
variables and don't add commas to the last elements, which is against
the kernel coding standards. Change the generation process for a
better compliance and slightly simpler logic.
- Remove superfluous declarations from generated code
The monitor container source files contained a declaration and a
definition for the rv_monitor variable. The former is superfluous and
was removed.
- Fix reference to outdated documentation
s/da_monitor_synthesis.rst/monitor_synthesis.rst in comment in
da_monitor.h
* tag 'trace-rv-v7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
rv: Fix documentation reference in da_monitor.h
verification/rvgen: Remove unused variable declaration from containers
verification/dot2c: Remove superfluous enum assignment and add last comma
verification/dot2c: Remove __buff_to_string() and cleanup
verification/rvgen: Annotate DA functions with types
verification/rvgen: Adapt dot2k and templates after refactoring da_monitor.h
Documentation/rv: Adapt documentation after da_monitor refactoring
rv: Cleanup da_monitor after refactor
rv: Refactor da_monitor to minimise macros
Total patches: 107
Reviews/patch: 1.07
Reviewed rate: 67%
- The 2 patch series "ocfs2: give ocfs2 the ability to reclaim
suballocator free bg" from Heming Zhao saves disk space by teaching
ocfs2 to reclaim suballocator block group space.
- The 4 patch series "Add ARRAY_END(), and use it to fix off-by-one
bugs" from Alejandro Colomar adds the ARRAY_END() macro and uses it in
various places.
- The 2 patch series "vmcoreinfo: support VMCOREINFO_BYTES larger than
PAGE_SIZE" from Pnina Feder makes the vmcore code future-safe, if
VMCOREINFO_BYTES ever exceeds the page size.
- The 7 patch series "kallsyms: Prevent invalid access when showing
module buildid" from Petr Mladek cleans up kallsyms code related to
module buildid and fixes an invalid access crash when printing
backtraces.
- The 3 patch series "Address page fault in
ima_restore_measurement_list()" from Harshit Mogalapalli fixes a
kexec-related crash that can occur when booting the second-stage kernel
on x86.
- The 6 patch series "kho: ABI headers and Documentation updates" from
Mike Rapoport updates the kexec handover ABI documentation.
- The 4 patch series "Align atomic storage" from Finn Thain adds the
__aligned attribute to atomic_t and atomic64_t definitions to get
natural alignment of both types on csky, m68k, microblaze, nios2,
openrisc and sh.
- The 2 patch series "kho: clean up page initialization logic" from
Pratyush Yadav simplifies the page initialization logic in
kho_restore_page().
- The 6 patch series "Unload linux/kernel.h" from Yury Norov moves
several things out of kernel.h and into more appropriate places.
- The 7 patch series "don't abuse task_struct.group_leader" from Oleg
Nesterov removes the usage of ->group_leader when it is "obviously
unnecessary".
- The 5 patch series "list private v2 & luo flb" from Pasha Tatashin
adds some infrastructure improvements to the live update orchestrator.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaY4giAAKCRDdBJ7gKXxA
jgusAQDnKkP8UWTqXPC1jI+OrDJGU5ciAx8lzLeBVqMKzoYk9AD/TlhT2Nlx+Ef6
0HCUHUD0FMvAw/7/Dfc6ZKxwBEIxyww=
=mmsH
-----END PGP SIGNATURE-----
Merge tag 'mm-nonmm-stable-2026-02-12-10-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:
- "ocfs2: give ocfs2 the ability to reclaim suballocator free bg" saves
disk space by teaching ocfs2 to reclaim suballocator block group
space (Heming Zhao)
- "Add ARRAY_END(), and use it to fix off-by-one bugs" adds the
ARRAY_END() macro and uses it in various places (Alejandro Colomar)
- "vmcoreinfo: support VMCOREINFO_BYTES larger than PAGE_SIZE" makes
the vmcore code future-safe, if VMCOREINFO_BYTES ever exceeds the
page size (Pnina Feder)
- "kallsyms: Prevent invalid access when showing module buildid" cleans
up kallsyms code related to module buildid and fixes an invalid
access crash when printing backtraces (Petr Mladek)
- "Address page fault in ima_restore_measurement_list()" fixes a
kexec-related crash that can occur when booting the second-stage
kernel on x86 (Harshit Mogalapalli)
- "kho: ABI headers and Documentation updates" updates the kexec
handover ABI documentation (Mike Rapoport)
- "Align atomic storage" adds the __aligned attribute to atomic_t and
atomic64_t definitions to get natural alignment of both types on
csky, m68k, microblaze, nios2, openrisc and sh (Finn Thain)
- "kho: clean up page initialization logic" simplifies the page
initialization logic in kho_restore_page() (Pratyush Yadav)
- "Unload linux/kernel.h" moves several things out of kernel.h and into
more appropriate places (Yury Norov)
- "don't abuse task_struct.group_leader" removes the usage of
->group_leader when it is "obviously unnecessary" (Oleg Nesterov)
- "list private v2 & luo flb" adds some infrastructure improvements to
the live update orchestrator (Pasha Tatashin)
* tag 'mm-nonmm-stable-2026-02-12-10-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (107 commits)
watchdog/hardlockup: simplify perf event probe and remove per-cpu dependency
procfs: fix missing RCU protection when reading real_parent in do_task_stat()
watchdog/softlockup: fix sample ring index wrap in need_counting_irqs()
kcsan, compiler_types: avoid duplicate type issues in BPF Type Format
kho: fix doc for kho_restore_pages()
tests/liveupdate: add in-kernel liveupdate test
liveupdate: luo_flb: introduce File-Lifecycle-Bound global state
liveupdate: luo_file: Use private list
list: add kunit test for private list primitives
list: add primitives for private list manipulations
delayacct: fix uapi timespec64 definition
panic: add panic_force_cpu= parameter to redirect panic to a specific CPU
netclassid: use thread_group_leader(p) in update_classid_task()
RDMA/umem: don't abuse current->group_leader
drm/pan*: don't abuse current->group_leader
drm/amd: kill the outdated "Only the pthreads threading model is supported" checks
drm/amdgpu: don't abuse current->group_leader
android/binder: use same_thread_group(proc->tsk, current) in binder_mmap()
android/binder: don't abuse current->group_leader
kho: skip memoryless NUMA nodes when reserving scratch areas
...
Everything:
Total patches: 325
Reviews/patch: 1.39
Reviewed rate: 72%
Excluding DAMON:
Total patches: 262
Reviews/patch: 1.63
Reviewed rate: 82%
Excluding DAMON and zram:
Total patches: 248
Reviews/patch: 1.72
Reviewed rate: 86%
- The 14 patch series "powerpc/64s: do not re-activate batched TLB
flush" from Alexander Gordeev makes arch_{enter|leave}_lazy_mmu_mode()
nest properly.
It adds a generic enter/leave layer and switches architectures to use
it. Various hacks were removed in the process.
- The 7 patch series "zram: introduce compressed data writeback" from
Richard Chang and Sergey Senozhatsky implements data compression for
zram writeback.
- The 8 patch series "mm: folio_zero_user: clear page ranges" from David
Hildenbrand adds clearing of contiguous page ranges for hugepages.
Large improvements during demand faulting are demonstrated.
- The 2 patch series "memcg cleanups" from Chen Ridong tideis up some
memcg code.
- The 12 patch series "mm/damon: introduce {,max_}nr_snapshots and
tracepoint for damos stats" from SeongJae Park improves DAMOS stat's
provided information, deterministic control, and readability.
- The 3 patch series "selftests/mm: hugetlb cgroup charging: robustness
fixes" from Li Wang fixes a few issues in the hugetlb cgroup charging
selftests.
- The 5 patch series "Fix va_high_addr_switch.sh test failure - again"
from Chunyu Hu addresses several issues in the va_high_addr_switch test.
- The 5 patch series "mm/damon/tests/core-kunit: extend existing test
scenarios" from Shu Anzai improves the KUnit test coverage for DAMON.
- The 2 patch series "mm/khugepaged: fix dirty page handling for
MADV_COLLAPSE" from Shivank Garg fixes a glitch in khugepaged which was
causing madvise(MADV_COLLAPSE) to transiently return -EAGAIN.
- The 29 patch series "arch, mm: consolidate hugetlb early reservation"
from Mike Rapoport reworks and consolidates a pile of straggly code
related to reservation of hugetlb memory from bootmem and creation of
CMA areas for hugetlb.
- The 9 patch series "mm: clean up anon_vma implementation" from Lorenzo
Stoakes cleans up the anon_vma implementation in various ways.
- The 3 patch series "tweaks for __alloc_pages_slowpath()" from
Vlastimil Babka does a little streamlining of the page allocator's
slowpath code.
- The 8 patch series "memcg: separate private and public ID namespaces"
from Shakeel Butt cleans up the memcg ID code and prevents the
internal-only private IDs from being exposed to userspace.
- The 6 patch series "mm: hugetlb: allocate frozen gigantic folio" from
Kefeng Wang cleans up the allocation of frozen folios and avoids some
atomic refcount operations.
- The 11 patch series "mm/damon: advance DAMOS-based LRU sorting" from
SeongJae Park improves DAMOS's movement of memory betewwn the active and
inactive LRUs and adds auto-tuning of the ratio-based quotas and of
monitoring intervals.
- The 18 patch series "Support page table check on PowerPC" from Andrew
Donnellan makes CONFIG_PAGE_TABLE_CHECK_ENFORCED work on powerpc.
- The 3 patch series "nodemask: align nodes_and{,not} with underlying
bitmap ops" from Yury Norov makes nodes_and() and nodes_andnot()
propagate the return values from the underlying bit operations, enabling
some cleanup in calling code.
- The 5 patch series "mm/damon: hide kdamond and kdamond_lock from API
callers" from SeongJae Park cleans up some DAMON internal interfaces.
- The 4 patch series "mm/khugepaged: cleanups and scan limit fix" from
Shivank Garg does some cleanup work in khupaged and fixes a scan limit
accounting issue.
- The 24 patch series "mm: balloon infrastructure cleanups" from David
Hildenbrand goes to town on the balloon infrastructure and its page
migration function. Mainly cleanups, also some locking simplification.
- The 2 patch series "mm/vmscan: add tracepoint and reason for
kswapd_failures reset" from Jiayuan Chen adds additional tracepoints to
the page reclaim code.
- The 3 patch series "Replace wq users and add WQ_PERCPU to
alloc_workqueue() users" from Marco Crivellari is part of Marco's
kernel-wide migration from the legacy workqueue APIs over to the
preferred unbound workqueues.
- The 9 patch series "Various mm kselftests improvements/fixes" from
Kevin Brodsky provides various unrelated improvements/fixes for the mm
kselftests.
- The 5 patch series "mm: accelerate gigantic folio allocation" from
Kefeng Wang greatly speeds up gigantic folio allocation, mainly by
avoiding unnecessary work in pfn_range_valid_contig().
- The 5 patch series "selftests/damon: improve leak detection and wss
estimation reliability" from SeongJae Park improves the reliability of
two of the DAMON selftests.
- The 8 patch series "mm/damon: cleanup kdamond, damon_call(), damos
filter and DAMON_MIN_REGION" from SeongJae Park does some cleanup work
in the core DAMON code.
- The 8 patch series "Docs/mm/damon: update intro, modules, maintainer
profile, and misc" from SeongJae Park performs maintenance work on the
DAMON documentation.
- The 10 patch series "mm: add and use vma_assert_stabilised() helper"
from Lorenzo Stoakes refactors and cleans up the core VMA code. The
main aim here is to be able to use the mmap write lock's lockdep state
to perform various assertions regarding the locking which the VMA code
requires.
- The 19 patch series "mm, swap: swap table phase II: unify swapin use"
from Kairui Song removes some old swap code (swap cache bypassing and
swap synchronization) which wasn't working very well. Various other
cleanups and simplifications were made. The end result is a 20% speedup
in one benchmark.
- The 8 patch series "enable PT_RECLAIM on more 64-bit architectures"
from Qi Zheng makes PT_RECLAIM available on 64-bit alpha, loongarch,
mips, parisc, um, Various cleanups were performed along the way.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaY1HfAAKCRDdBJ7gKXxA
jqhZAP9H8ZlKKqCEgnr6U5XXmJ63Ep2FDQpl8p35yr9yVuU9+gEAgfyWiJ43l1fP
rT0yjsUW3KQFBi/SEA3R6aYarmoIBgI=
=+HLt
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2026-02-11-19-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- "powerpc/64s: do not re-activate batched TLB flush" makes
arch_{enter|leave}_lazy_mmu_mode() nest properly (Alexander Gordeev)
It adds a generic enter/leave layer and switches architectures to use
it. Various hacks were removed in the process.
- "zram: introduce compressed data writeback" implements data
compression for zram writeback (Richard Chang and Sergey Senozhatsky)
- "mm: folio_zero_user: clear page ranges" adds clearing of contiguous
page ranges for hugepages. Large improvements during demand faulting
are demonstrated (David Hildenbrand)
- "memcg cleanups" tidies up some memcg code (Chen Ridong)
- "mm/damon: introduce {,max_}nr_snapshots and tracepoint for damos
stats" improves DAMOS stat's provided information, deterministic
control, and readability (SeongJae Park)
- "selftests/mm: hugetlb cgroup charging: robustness fixes" fixes a few
issues in the hugetlb cgroup charging selftests (Li Wang)
- "Fix va_high_addr_switch.sh test failure - again" addresses several
issues in the va_high_addr_switch test (Chunyu Hu)
- "mm/damon/tests/core-kunit: extend existing test scenarios" improves
the KUnit test coverage for DAMON (Shu Anzai)
- "mm/khugepaged: fix dirty page handling for MADV_COLLAPSE" fixes a
glitch in khugepaged which was causing madvise(MADV_COLLAPSE) to
transiently return -EAGAIN (Shivank Garg)
- "arch, mm: consolidate hugetlb early reservation" reworks and
consolidates a pile of straggly code related to reservation of
hugetlb memory from bootmem and creation of CMA areas for hugetlb
(Mike Rapoport)
- "mm: clean up anon_vma implementation" cleans up the anon_vma
implementation in various ways (Lorenzo Stoakes)
- "tweaks for __alloc_pages_slowpath()" does a little streamlining of
the page allocator's slowpath code (Vlastimil Babka)
- "memcg: separate private and public ID namespaces" cleans up the
memcg ID code and prevents the internal-only private IDs from being
exposed to userspace (Shakeel Butt)
- "mm: hugetlb: allocate frozen gigantic folio" cleans up the
allocation of frozen folios and avoids some atomic refcount
operations (Kefeng Wang)
- "mm/damon: advance DAMOS-based LRU sorting" improves DAMOS's movement
of memory betewwn the active and inactive LRUs and adds auto-tuning
of the ratio-based quotas and of monitoring intervals (SeongJae Park)
- "Support page table check on PowerPC" makes
CONFIG_PAGE_TABLE_CHECK_ENFORCED work on powerpc (Andrew Donnellan)
- "nodemask: align nodes_and{,not} with underlying bitmap ops" makes
nodes_and() and nodes_andnot() propagate the return values from the
underlying bit operations, enabling some cleanup in calling code
(Yury Norov)
- "mm/damon: hide kdamond and kdamond_lock from API callers" cleans up
some DAMON internal interfaces (SeongJae Park)
- "mm/khugepaged: cleanups and scan limit fix" does some cleanup work
in khupaged and fixes a scan limit accounting issue (Shivank Garg)
- "mm: balloon infrastructure cleanups" goes to town on the balloon
infrastructure and its page migration function. Mainly cleanups, also
some locking simplification (David Hildenbrand)
- "mm/vmscan: add tracepoint and reason for kswapd_failures reset" adds
additional tracepoints to the page reclaim code (Jiayuan Chen)
- "Replace wq users and add WQ_PERCPU to alloc_workqueue() users" is
part of Marco's kernel-wide migration from the legacy workqueue APIs
over to the preferred unbound workqueues (Marco Crivellari)
- "Various mm kselftests improvements/fixes" provides various unrelated
improvements/fixes for the mm kselftests (Kevin Brodsky)
- "mm: accelerate gigantic folio allocation" greatly speeds up gigantic
folio allocation, mainly by avoiding unnecessary work in
pfn_range_valid_contig() (Kefeng Wang)
- "selftests/damon: improve leak detection and wss estimation
reliability" improves the reliability of two of the DAMON selftests
(SeongJae Park)
- "mm/damon: cleanup kdamond, damon_call(), damos filter and
DAMON_MIN_REGION" does some cleanup work in the core DAMON code
(SeongJae Park)
- "Docs/mm/damon: update intro, modules, maintainer profile, and misc"
performs maintenance work on the DAMON documentation (SeongJae Park)
- "mm: add and use vma_assert_stabilised() helper" refactors and cleans
up the core VMA code. The main aim here is to be able to use the mmap
write lock's lockdep state to perform various assertions regarding
the locking which the VMA code requires (Lorenzo Stoakes)
- "mm, swap: swap table phase II: unify swapin use" removes some old
swap code (swap cache bypassing and swap synchronization) which
wasn't working very well. Various other cleanups and simplifications
were made. The end result is a 20% speedup in one benchmark (Kairui
Song)
- "enable PT_RECLAIM on more 64-bit architectures" makes PT_RECLAIM
available on 64-bit alpha, loongarch, mips, parisc, and um. Various
cleanups were performed along the way (Qi Zheng)
* tag 'mm-stable-2026-02-11-19-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (325 commits)
mm/memory: handle non-split locks correctly in zap_empty_pte_table()
mm: move pte table reclaim code to memory.c
mm: make PT_RECLAIM depends on MMU_GATHER_RCU_TABLE_FREE
mm: convert __HAVE_ARCH_TLB_REMOVE_TABLE to CONFIG_HAVE_ARCH_TLB_REMOVE_TABLE config
um: mm: enable MMU_GATHER_RCU_TABLE_FREE
parisc: mm: enable MMU_GATHER_RCU_TABLE_FREE
mips: mm: enable MMU_GATHER_RCU_TABLE_FREE
LoongArch: mm: enable MMU_GATHER_RCU_TABLE_FREE
alpha: mm: enable MMU_GATHER_RCU_TABLE_FREE
mm: change mm/pt_reclaim.c to use asm/tlb.h instead of asm-generic/tlb.h
mm/damon/stat: remove __read_mostly from memory_idle_ms_percentiles
zsmalloc: make common caches global
mm: add SPDX id lines to some mm source files
mm/zswap: use %pe to print error pointers
mm/vmscan: use %pe to print error pointers
mm/readahead: fix typo in comment
mm: khugepaged: fix NR_FILE_PAGES and NR_SHMEM in collapse_file()
mm: refactor vma_map_pages to use vm_insert_pages
mm/damon: unify address range representation with damon_addr_range
mm/cma: replace snprintf with strscpy in cma_new_area
...
When a task is migrated out of a css_set, cgroup_migrate_add_task()
first moves it from cset->tasks to cset->mg_tasks via:
list_move_tail(&task->cg_list, &cset->mg_tasks);
If a css_task_iter currently has it->task_pos pointing to this task,
css_set_move_task() calls css_task_iter_skip() to keep the iterator
valid. However, since the task has already been moved to ->mg_tasks,
the iterator is advanced relative to the mg_tasks list instead of the
original tasks list. As a result, remaining tasks on cset->tasks, as
well as tasks queued on cset->mg_tasks, can be skipped by iteration.
Fix this by calling css_set_skip_task_iters() before unlinking
task->cg_list from cset->tasks. This advances all active iterators to
the next task on cset->tasks, so iteration continues correctly even
when a task is concurrently being migrated.
This race is hard to hit in practice without instrumentation, but it
can be reproduced by artificially slowing down cgroup_procs_show().
For example, on an Android device a temporary
/sys/kernel/cgroup/cgroup_test knob can be added to inject a delay
into cgroup_procs_show(), and then:
1) Spawn three long-running tasks (PIDs 101, 102, 103).
2) Create a test cgroup and move the tasks into it.
3) Enable a large delay via /sys/kernel/cgroup/cgroup_test.
4) In one shell, read cgroup.procs from the test cgroup.
5) Within the delay window, in another shell migrate PID 102 by
writing it to a different cgroup.procs file.
Under this setup, cgroup.procs can intermittently show only PID 101
while skipping PID 103. Once the migration completes, reading the
file again shows all tasks as expected.
Note that this change does not allow removing the existing
css_set_skip_task_iters() call in css_set_move_task(). The new call
in cgroup_migrate_add_task() only handles iterators that are racing
with migration while the task is still on cset->tasks. Iterators may
also start after the task has been moved to cset->mg_tasks. If we
dropped css_set_skip_task_iters() from css_set_move_task(), such
iterators could keep task_pos pointing to a migrating task, causing
css_task_iter_advance() to malfunction on the destination css_set,
up to and including crashes or infinite loops.
The race window between migration and iteration is very small, and
css_task_iter is not on a hot path. In the worst case, when an
iterator is positioned on the first thread of the migrating process,
cgroup_migrate_add_task() may have to skip multiple tasks via
css_set_skip_task_iters(). However, this only happens when migration
and iteration actually race, so the performance impact is negligible
compared to the correctness fix provided here.
Fixes: b636fd38dc ("cgroup: Implement css_task_iter_skip()")
Cc: stable@vger.kernel.org # v5.2+
Signed-off-by: Qingye Zhao <zhaoqingye@honor.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Core & protocols
----------------
- A significant effort all around the stack to guide the compiler to
make the right choice when inlining code, to avoid unneeded calls for
small helper and stack canary overhead in the fast-path. This
generates better and faster code with very small or no text size
increases, as in many cases the call generated more code than the
actual inlined helper.
- Extend AccECN implementation so that is now functionally complete,
also allow the user-space enabling it on a per network namespace
basis.
- Add support for memory providers with large (above 4K) rx buffer.
Paired with hw-gro, larger rx buffer sizes reduce the number of
buffers traversing the stack, dincreasing single stream CPU usage by
up to ~30%.
- Do not add HBH header to Big TCP GSO packets. This simplifies the RX
path, the TX path and the NIC drivers, and is possible because
user-space taps can now interpret correctly such packets without the
HBH hint.
- Allow IPv6 routes to be configured with a gateway address that is
resolved out of a different interface than the one specified, aligning
IPv6 to IPv4 behavior.
- Multi-queue aware sch_cake. This makes it possible to scale the rate
shaper of sch_cake across multiple CPUs, while still enforcing a
single global rate on the interface.
- Add support for the nbcon (new buffer console) infrastructure to
netconsole, enabling lock-free, priority-based console operations that
are safer in crash scenarios.
- Improve the TCP ipv6 output path to cache the flow information, saving
cpu cycles, reducing cache line misses and stack use.
- Improve netfilter packet tracker to resolve clashes for most protocols,
avoiding unneeded drops on rare occasions.
- Add IP6IP6 tunneling acceleration to the flowtable infrastructure.
- Reduce tcp socket size by one cache line.
- Notify neighbour changes atomically, avoiding inconsistencies between
the notification sequence and the actual states sequence.
- Add vsock namespace support, allowing complete isolation of vsocks
across different network namespaces.
- Improve xsk generic performances with cache-alignment-oriented
optimizations.
- Support netconsole automatic target recovery, allowing netconsole
to reestablish targets when underlying low-level interface comes back
online.
Driver API
----------
- Support for switching the working mode (automatic vs manual) of a DPLL
device via netlink.
- Introduce PHY ports representation to expose multiple front-facing
media ports over a single MAC.
- Introduce "rx-polarity" and "tx-polarity" device tree properties, to
generalize polarity inversion requirements for differential signaling.
- Add helper to create, prepare and enable managed clocks.
Device drivers
--------------
- Add Huawei hinic3 PF etherner driver.
- Add DWMAC glue driver for Motorcomm YT6801 PCIe ethernet controller.
- Add ethernet driver for MaxLinear MxL862xx switches
- Remove parallel-port Ethernet driver.
- Convert existing driver timestamp configuration reporting to
hwtstamp_get and remove legacy ioctl().
- Convert existing drivers to .get_rx_ring_count(), simplifing the RX
ring count retrieval. Also remove the legacy fallback path.
- Ethernet high-speed NICs:
- Broadcom (bnxt, bng):
- bnxt: add FW interface update to support FEC stats histogram and
NVRAM defragmentation
- bng: add TSO and H/W GRO support
- nVidia/Mellanox (mlx5):
- improve latency of channel restart operations, reducing the used
H/W resources
- add TSO support for UDP over GRE over VLAN
- add flow counters support for hardware steering (HWS) rules
- use a static memory area to store headers for H/W GRO, leading to
12% RX tput improvement
- Intel (100G, ice, idpf):
- ice: reorganizes layout of Tx and Rx rings for cacheline
locality and utilizes __cacheline_group* macros on the new layouts
- ice: introduces Synchronous Ethernet (SyncE) support
- Meta (fbnic):
- adds debugfs for firmware mailbox and tx/rx rings vectors
- Ethernet virtual:
- geneve: introduce GRO/GSO support for double UDP encapsulation
- Ethernet NICs consumer, and embedded:
- Synopsys (stmmac):
- some code refactoring and cleanups
- RealTek (r8169):
- add support for RTL8127ATF (10G Fiber SFP)
- add dash and LTR support
- Airoha:
- AN8811HB 2.5 Gbps phy support
- Freescale (fec):
- add XDP zero-copy support
- Thunderbolt:
- add get link setting support to allow bonding
- Renesas:
- add support for RZ/G3L GBETH SoC
- Ethernet switches:
- Maxlinear:
- support R(G)MII slow rate configuration
- add support for Intel GSW150
- Motorcomm (yt921x):
- add DCB/QoS support
- TI:
- icssm-prueth: support bridging (STP/RSTP) via the switchdev
framework
- Ethernet PHYs:
- Realtek:
- enable SGMII and 2500Base-X in-band auto-negotiation
- simplify and reunify C22/C45 drivers
- Micrel: convert bindings to DT schema
- CAN:
- move skb headroom content into skb extensions, making CAN metadata
access more robust
- CAN drivers:
- rcar_canfd:
- add support for FD-only mode
- add support for the RZ/T2H SoC
- sja1000: cleanup the CAN state handling
- WiFi:
- implement EPPKE/802.1X over auth frames support
- split up drop reasons better, removing generic RX_DROP
- additional FTM capabilities: 6 GHz support, supported number of
spatial streams and supported number of LTF repetitions
- better mac80211 iterators to enumerate resources
- initial UHR (Wi-Fi 8) support for cfg80211/mac80211
- WiFi drivers:
- Qualcomm/Atheros:
- ath11k: support for Channel Frequency Response measurement
- ath12k: a significant driver refactor to support
multi-wiphy devices and and pave the way for future device support
in the same driver (rather than splitting to ath13k)
- ath12k: support for the QCC2072 chipset
- Intel:
- iwlwifi: partial Neighbor Awareness Networking (NAN) support
- iwlwifi: initial support for U-NII-9 and IEEE 802.11bn
- RealTek (rtw89):
- preparations for RTL8922DE support
- Bluetooth:
- implement setsockopt(BT_PHY) to set the connection packet type/PHY
- set link_policy on incoming ACL connections
- Bluetooth drivers:
- btusb: add support for MediaTek7920, Realtek RTL8761BU and 8851BE
- btqca: add WCN6855 firmware priority selection feature
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmmMum4SHHBhYmVuaUBy
ZWRoYXQuY29tAAoJECkkeY3MjxOkDnMP/3bpHAGj+gylTid3Xsj0TjJ8AkPsQs+W
uvSMiCB1TvGCTD9kK36Vr+qPoIgJY10UxYMMjt5Gs0A9TvGDDfYnUOVoUIkfkWCH
grqSdp6dVkyaJVfyLEcuOVQQG2HwEnhC4c3ZOhOxaKNAnsLCP142lYsMR9ktGRuA
4vDGtz1+y7t8qBk/lyfXDM71KRrtq0HWJZIhmhz8QXTBsgPDfSejbTPNxXQOJoeO
sKeArsHr/Cmvf89ZtLZ63vbfr4BKDm4PeXqPYR3PrQs2Yu6I1EK4lehygTY2yE2O
I3MEPlvpa/tiVLxqXNNwEFbYIkMPY6FXS9x05hTxNZM65A6aB3vvdkqPVnVmAlXE
f+4PYg9paI13lbzZOeQbGfZ5HgPpzQvnginaaX6s9Fp12K3Ll1FkwWdUznFWhzVn
5LSrGyecR00CdKJByTIw9JGg/1ptz5a57pa8OQmcKRx3WhQ1XeV5TIJQF4QcPgHw
ApyjmeGDTQMQMzha1fsaVr+i6BK2zgZvKK9uGDTX90xn2JUw/M75tyOlsTtGlnuM
sZgj0KVGQlG2wLwBB/+D4S9Oi9YlPG00rkCs0E4jk5C/G4NBmMgpEPQg6azkb57h
Uiy0paohxfwcZ3qbGA9In091ClGqIwOiCBaq+uXRq1ro88Neo6PWkjz5ItNrsD8t
Ttgd5AVAQyPT
=O31Y
-----END PGP SIGNATURE-----
Merge tag 'net-next-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Paolo Abeni:
"Core & protocols:
- A significant effort all around the stack to guide the compiler to
make the right choice when inlining code, to avoid unneeded calls
for small helper and stack canary overhead in the fast-path.
This generates better and faster code with very small or no text
size increases, as in many cases the call generated more code than
the actual inlined helper.
- Extend AccECN implementation so that is now functionally complete,
also allow the user-space enabling it on a per network namespace
basis.
- Add support for memory providers with large (above 4K) rx buffer.
Paired with hw-gro, larger rx buffer sizes reduce the number of
buffers traversing the stack, dincreasing single stream CPU usage
by up to ~30%.
- Do not add HBH header to Big TCP GSO packets. This simplifies the
RX path, the TX path and the NIC drivers, and is possible because
user-space taps can now interpret correctly such packets without
the HBH hint.
- Allow IPv6 routes to be configured with a gateway address that is
resolved out of a different interface than the one specified,
aligning IPv6 to IPv4 behavior.
- Multi-queue aware sch_cake. This makes it possible to scale the
rate shaper of sch_cake across multiple CPUs, while still enforcing
a single global rate on the interface.
- Add support for the nbcon (new buffer console) infrastructure to
netconsole, enabling lock-free, priority-based console operations
that are safer in crash scenarios.
- Improve the TCP ipv6 output path to cache the flow information,
saving cpu cycles, reducing cache line misses and stack use.
- Improve netfilter packet tracker to resolve clashes for most
protocols, avoiding unneeded drops on rare occasions.
- Add IP6IP6 tunneling acceleration to the flowtable infrastructure.
- Reduce tcp socket size by one cache line.
- Notify neighbour changes atomically, avoiding inconsistencies
between the notification sequence and the actual states sequence.
- Add vsock namespace support, allowing complete isolation of vsocks
across different network namespaces.
- Improve xsk generic performances with cache-alignment-oriented
optimizations.
- Support netconsole automatic target recovery, allowing netconsole
to reestablish targets when underlying low-level interface comes
back online.
Driver API:
- Support for switching the working mode (automatic vs manual) of a
DPLL device via netlink.
- Introduce PHY ports representation to expose multiple front-facing
media ports over a single MAC.
- Introduce "rx-polarity" and "tx-polarity" device tree properties,
to generalize polarity inversion requirements for differential
signaling.
- Add helper to create, prepare and enable managed clocks.
Device drivers:
- Add Huawei hinic3 PF etherner driver.
- Add DWMAC glue driver for Motorcomm YT6801 PCIe ethernet
controller.
- Add ethernet driver for MaxLinear MxL862xx switches
- Remove parallel-port Ethernet driver.
- Convert existing driver timestamp configuration reporting to
hwtstamp_get and remove legacy ioctl().
- Convert existing drivers to .get_rx_ring_count(), simplifing the RX
ring count retrieval. Also remove the legacy fallback path.
- Ethernet high-speed NICs:
- Broadcom (bnxt, bng):
- bnxt: add FW interface update to support FEC stats histogram
and NVRAM defragmentation
- bng: add TSO and H/W GRO support
- nVidia/Mellanox (mlx5):
- improve latency of channel restart operations, reducing the
used H/W resources
- add TSO support for UDP over GRE over VLAN
- add flow counters support for hardware steering (HWS) rules
- use a static memory area to store headers for H/W GRO,
leading to 12% RX tput improvement
- Intel (100G, ice, idpf):
- ice: reorganizes layout of Tx and Rx rings for cacheline
locality and utilizes __cacheline_group* macros on the new
layouts
- ice: introduces Synchronous Ethernet (SyncE) support
- Meta (fbnic):
- adds debugfs for firmware mailbox and tx/rx rings vectors
- Ethernet virtual:
- geneve: introduce GRO/GSO support for double UDP encapsulation
- Ethernet NICs consumer, and embedded:
- Synopsys (stmmac):
- some code refactoring and cleanups
- RealTek (r8169):
- add support for RTL8127ATF (10G Fiber SFP)
- add dash and LTR support
- Airoha:
- AN8811HB 2.5 Gbps phy support
- Freescale (fec):
- add XDP zero-copy support
- Thunderbolt:
- add get link setting support to allow bonding
- Renesas:
- add support for RZ/G3L GBETH SoC
- Ethernet switches:
- Maxlinear:
- support R(G)MII slow rate configuration
- add support for Intel GSW150
- Motorcomm (yt921x):
- add DCB/QoS support
- TI:
- icssm-prueth: support bridging (STP/RSTP) via the switchdev
framework
- Ethernet PHYs:
- Realtek:
- enable SGMII and 2500Base-X in-band auto-negotiation
- simplify and reunify C22/C45 drivers
- Micrel: convert bindings to DT schema
- CAN:
- move skb headroom content into skb extensions, making CAN
metadata access more robust
- CAN drivers:
- rcar_canfd:
- add support for FD-only mode
- add support for the RZ/T2H SoC
- sja1000: cleanup the CAN state handling
- WiFi:
- implement EPPKE/802.1X over auth frames support
- split up drop reasons better, removing generic RX_DROP
- additional FTM capabilities: 6 GHz support, supported number of
spatial streams and supported number of LTF repetitions
- better mac80211 iterators to enumerate resources
- initial UHR (Wi-Fi 8) support for cfg80211/mac80211
- WiFi drivers:
- Qualcomm/Atheros:
- ath11k: support for Channel Frequency Response measurement
- ath12k: a significant driver refactor to support multi-wiphy
devices and and pave the way for future device support in the
same driver (rather than splitting to ath13k)
- ath12k: support for the QCC2072 chipset
- Intel:
- iwlwifi: partial Neighbor Awareness Networking (NAN) support
- iwlwifi: initial support for U-NII-9 and IEEE 802.11bn
- RealTek (rtw89):
- preparations for RTL8922DE support
- Bluetooth:
- implement setsockopt(BT_PHY) to set the connection packet type/PHY
- set link_policy on incoming ACL connections
- Bluetooth drivers:
- btusb: add support for MediaTek7920, Realtek RTL8761BU and 8851BE
- btqca: add WCN6855 firmware priority selection feature"
* tag 'net-next-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1254 commits)
bnge/bng_re: Add a new HSI
net: macb: Fix tx/rx malfunction after phy link down and up
af_unix: Fix memleak of newsk in unix_stream_connect().
net: ti: icssg-prueth: Add optional dependency on HSR
net: dsa: add basic initial driver for MxL862xx switches
net: mdio: add unlocked mdiodev C45 bus accessors
net: dsa: add tag format for MxL862xx switches
dt-bindings: net: dsa: add MaxLinear MxL862xx
selftests: drivers: net: hw: Modify toeplitz.c to poll for packets
octeontx2-pf: Unregister devlink on probe failure
net: renesas: rswitch: fix forwarding offload statemachine
ionic: Rate limit unknown xcvr type messages
tcp: inet6_csk_xmit() optimization
tcp: populate inet->cork.fl.u.ip6 in tcp_v6_syn_recv_sock()
tcp: populate inet->cork.fl.u.ip6 in tcp_v6_connect()
ipv6: inet6_csk_xmit() and inet6_csk_update_pmtu() use inet->cork.fl.u.ip6
ipv6: use inet->cork.fl.u.ip6 and np->final in ip6_datagram_dst_update()
ipv6: use np->final in inet6_sk_rebuild_header()
ipv6: add daddr/final storage in struct ipv6_pinfo
net: stmmac: qcom-ethqos: fix qcom_ethqos_serdes_powerup()
...
The return statement inside the nested if block in print_trace_fmt()
is not properly indented, making the code structure unclear. This was
flagged by smatch as a warning.
Add proper indentation to the return statement to match the kernel
coding style and improve readability.
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260210153903.8041-1-tttturtleruss@gmail.com
Signed-off-by: Haoyang LIU <tttturtleruss@gmail.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
-----BEGIN PGP SIGNATURE-----
iQJIBAABCgAyFiEEgMe7l+5h9hnxdsnuWYigwDrT+vwFAmmKJO4UHGJoZWxnYWFz
QGdvb2dsZS5jb20ACgkQWYigwDrT+vzotA/+OGSOPOs9hWd+OwNF5Dm2WA81yG/3
K3Jx5uMuPoSjduMbPhVcib02Mr6YDJTa6WlYNVa76ADs2G6HxcVMFHutlYudSVcl
umSF48FnyeH1LTba88dRoVj4DB47Cue+BfhYY2L0ZtxmjQq/NRuDFAaGBh54uNeF
Gcdgr52QlM01n1X6yKvl7vE9gPdcPH80L256ssHAm6oSOHI1SPc6gqEKUUD02f8G
FtzfTUAq/cWYjlY3VoS5GKtdHxFYuXqC5WfbURhJ11o/nVJY9k1Zx8n4eI1tmAtN
7q692xjWSQJZlzepOBBEyjFUpIiy80tZ43z2ptRRBeI/n/qMmGPAov/g4MzegBWG
IAEHTAp/xx1Wra1ynr7RNvYVcPpXm2TEim8gIGah9DkHbNgbu7ing+OO7DnQuyfD
2h4hGD2622o6uikqkwzVd4mYuIcFu7SA6yROZhFn83BRnz0QOQienDrDlvOB8XCV
EodLAOMc2KClvOmmriFMy11PH7MFFoXexV6KS83VfDJHi4+XzBsy0w6TXTohcA9s
JTPIkSWqf/u6SrdLjXlFGyyJ2/KCgRiXFIBhhtYBMhDuuO7nG+mcSVzMa1PT0s6C
PF+QoT7sJof/5VMJ4o3BgPrPkD3CQICrlt8XIt5I8ngsy6RZRQ5rt+pUix7Shcn8
DgcunuINYfQtkfw=
=LIjp
-----END PGP SIGNATURE-----
Merge tag 'pci-v7.0-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci
Pull PCI updates from Bjorn Helgaas:
"Enumeration:
- Don't try to enable Extended Tags on VFs since that bit is Reserved
and causes misleading log messages (Håkon Bugge)
- Initialize Endpoint Read Completion Boundary to match Root Port,
regardless of ACPI _HPX (Håkon Bugge)
- Apply _HPX PCIe Setting Record only to AER configuration, and only
when OS owns PCIe hotplug but not AER, to avoid clobbering Extended
Tag and Relaxed Ordering settings (Håkon Bugge)
Resource management:
- Move CardBus code to setup-cardbus.c and only build it when
CONFIG_CARDBUS is set (Ilpo Järvinen)
- Fix bridge window alignment with optional resources, where
additional alignment requirement was previously lost (Ilpo
Järvinen)
- Stop over-estimating bridge window size since they are now assigned
without any gaps between them (Ilpo Järvinen)
- Increase resource MAX_IORES_LEVEL to avoid /proc/iomem flattening
for nested bridges and endpoints (Ilpo Järvinen)
- Add pbus_mem_size_optional() to handle sizes of optional resources
(SR-IOV VF BARs, expansion ROMs, bridge windows) (Ilpo Järvinen)
- Don't claim disabled bridge windows to avoid spurious claim
failures (Ilpo Järvinen)
Driver binding:
- Fix device reference leak in pcie_port_remove_service() (Uwe
Kleine-König)
- Move pcie_port_bus_match() and pcie_port_bus_type to PCIe-specific
portdrv.c (Uwe Kleine-König)
- Convert portdrv to use pcie_port_bus_type.probe() and .remove()
callbacks so .probe() and .remove() can eventually be removed from
struct device_driver (Uwe Kleine-König)
Error handling:
- Clear stale errors on reporting agents upon probe so they don't
look like recent errors (Lukas Wunner)
- Add generic RAS tracepoint for hotplug events (Shuai Xue)
- Add RAS tracepoint for link speed changes (Shuai Xue)
Power management:
- Avoid redundant delay on transition from D3hot to D3cold if the
device was already in D3hot (Brian Norris)
- Prevent runtime suspend until devices are fully initialized to
avoid saving incompletely configured device state (Brian Norris)
Power control:
- Add power_on/off callbacks with generic signature to pwrseq,
tc9563, and slot drivers so they can be used by pwrctrl core
(Manivannan Sadhasivam)
- Add PCIe M.2 connector support to the slot pwrctrl driver
(Manivannan Sadhasivam)
- Switch to pwrctrl interfaces to create, destroy, and power on/off
devices, calling them from host controller drivers instead of the
PCI core (Manivannan Sadhasivam)
- Drop qcom .assert_perst() callbacks since this is now done by the
controller driver instead of the pwrctrl driver (Manivannan
Sadhasivam)
Virtualization:
- Remove an incorrect unlock in pci_slot_trylock() error handling
(Jinhui Guo)
- Lock the bridge device for slot reset (Keith Busch)
- Enable ACS after IOMMU configuration on OF platforms so ACS is
enabled an all devices; previously the first device enumerated
(typically a Root Port) didn't have ACS enabled (Manivannan
Sadhasivam)
- Disable ACS Source Validation for IDT 0x80b5 and 0x8090 switches to
work around hardware erratum; previously ACS SV was only
temporarily disabled, which worked for enumeration but not after
reset (Manivannan Sadhasivam)
Peer-to-peer DMA:
- Release per-CPU pgmap ref when vm_insert_page() fails to avoid hang
when removing the PCI device (Hou Tao)
- Remove incorrect p2pmem_alloc_mmap() warning about page refcount
(Hou Tao)
Endpoint framework:
- Add configfs sub-groups synchronously to avoid NULL pointer
dereference when racing with removal (Liu Song)
- Fix swapped parameters in pci_{primary/secondary}_epc_epf_unlink()
functions (Manikanta Maddireddy)
ASPEED PCIe controller driver:
- Add ASPEED Root Complex DT binding and driver (Jacky Chou)
Freescale i.MX6 PCIe controller driver:
- Add DT binding and driver support for an optional external refclock
in addition to the refclock from the internal PLL (Richard Zhu)
- Fix CLKREQ# control so host asserts it during enumeration and
Endpoints can use it afterwards to exit the L1.2 link state
(Richard Zhu)
NVIDIA Tegra PCIe controller driver:
- Export irq_domain_free_irqs() to allow PCI/MSI drivers that tear
down MSI domains to be built as modules (Aaron Kling)
- Allow pci-tegra to be built as a module (Aaron Kling)
NVIDIA Tegra194 PCIe controller driver:
- Relax Kconfig so tegra194 can be built for platforms beyond
Tegra194 (Vidya Sagar)
Qualcomm PCIe controller driver:
- Merge SC8180x DT binding into SM8150 (Krzysztof Kozlowski)
- Move SDX55, SDM845, QCS404, IPQ5018, IPQ6018, IPQ8074 Gen3,
IPQ8074, IPQ4019, IPQ9574, APQ8064, MSM8996, APQ8084 to dedicated
schema (Krzysztof Kozlowski)
- Add DT binding and driver support for SA8255p Endpoint being
configured by firmware (Mrinmay Sarkar)
- Parse PERST# from all PCIe bridge nodes for future platforms that
will have PERST# in Switch Downstream Ports as well as in Root
Ports (Manivannan Sadhasivam)
Renesas RZ/G3S PCIe controller driver:
- Use pci_generic_config_write() since the writability provided by
the custom wrapper is unnecessary (Claudiu Beznea)
SOPHGO PCIe controller driver:
- Disable ASPM L0s and L1 on Sophgo 2044 PCIe Root Ports (Inochi
Amaoto)
Synopsys DesignWare PCIe controller driver:
- Extend PCI_FIND_NEXT_CAP() and PCI_FIND_NEXT_EXT_CAP() to return a
pointer to the preceding Capability, to allow removal of
Capabilities that are advertised but not fully implemented (Qiang
Yu)
- Remove MSI and MSI-X Capabilities in platforms that can't support
them, so the PCI core automatically falls back to INTx (Qiang Yu)
- Add ASPM L1.1 and L1.2 Substates context to debugfs ltssm_status
for drivers that support this (Shawn Lin)
- Skip PME_Turn_Off broadcast and L2/L3 transition during suspend if
link is not up to avoid an unnecessary timeout (Manivannan
Sadhasivam)
- Revert dw-rockchip, qcom, and DWC core changes that used link-up
IRQs to trigger enumeration instead of waiting for link to be up
because the PCI core doesn't allocate bus number space for
hierarchies that might be attached (Niklas Cassel)
- Make endpoint iATU entry for MSI permanent instead of programming
it dynamically, which is slow and racy with respect to other
concurrent traffic, e.g., eDMA (Koichiro Den)
- Use iMSI-RX MSI target address when possible to fix endpoints using
32-bit MSI (Shawn Lin)
- Allow DWC host controller driver probe to continue if device is not
found or found but inactive; only fail when there's an error with
the link (Manivannan Sadhasivam)
- For controllers like NXP i.MX6QP and i.MX7D, where LTSSM registers
are not accessible after PME_Turn_Off, simply wait 10ms instead of
polling for L2/L3 Ready (Richard Zhu)
- Use multiple iATU entries to map large bridge windows and DMA
ranges when necessary instead of failing (Samuel Holland)
- Add EPC dynamic_inbound_mapping feature bit for Endpoint
Controllers that can update BAR inbound address translation without
requiring EPF driver to clear/reset the BAR first, and advertise it
for DWC-based Endpoints (Koichiro Den)
- Add EPC subrange_mapping feature bit for Endpoint Controllers that
can map multiple independent inbound regions in a single BAR,
implement subrange mapping, advertise it for DWC-based Endpoints,
and add Endpoint selftests for it (Koichiro Den)
- Make resizable BARs work for Endpoint multi-PF configurations;
previously it only worked for PF 0 (Aksh Garg)
- Fix Endpoint non-PF 0 support for BAR configuration, ATU mappings,
and Address Match Mode (Aksh Garg)
- Set up iATU when ECAM is enabled; previously IO and MEM outbound
windows weren't programmed, and ECAM-related iATU entries weren't
restored after suspend/resume, so config accesses failed (Krishna
Chaitanya Chundru)
Miscellaneous:
- Use system_percpu_wq and WQ_PERCPU to explicitly request per-CPU
work so WQ_UNBOUND can eventually be removed (Marco Crivellari)"
* tag 'pci-v7.0-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci: (176 commits)
PCI/bwctrl: Disable BW controller on Intel P45 using a quirk
PCI: Disable ACS SV for IDT 0x8090 switch
PCI: Disable ACS SV for IDT 0x80b5 switch
PCI: Cache ACS Capabilities register
PCI: Enable ACS after configuring IOMMU for OF platforms
PCI: Add ACS quirk for Pericom PI7C9X2G404 switches [12d8:b404]
PCI: Add ACS quirk for Qualcomm Hamoa & Glymur
PCI: Use device_lock_assert() to verify device lock is held
PCI: Use lockdep_assert_held(pci_bus_sem) to verify lock is held
PCI: Fix pci_slot_lock () device locking
PCI: Fix pci_slot_trylock() error handling
PCI: Mark Nvidia GB10 to avoid bus reset
PCI: Mark ASM1164 SATA controller to avoid bus reset
PCI: host-generic: Avoid reporting incorrect 'missing reg property' error
PCI/PME: Replace RMW of Root Status register with direct write
PCI/AER: Clear stale errors on reporting agents upon probe
PCI: Don't claim disabled bridge windows
PCI: rzg3s-host: Fix device node reference leak in rzg3s_pcie_host_parse_port()
PCI: dwc: Fix missing iATU setup when ECAM is enabled
PCI: dwc: Clean up iATU index usage in dw_pcie_iatu_setup()
...
Kbuild changes
==============
* Drop '*_probe' pattern from modpost section check allowlist, which hid
legitimate warnings (Johan Hovold)
* Disable -Wtype-limits altogether, instead of enabling at W=2 (Vincent
Mailhol)
* Improve UAPI testing to skip testing headers that require a libc when
CONFIG_CC_CAN_LINK is not set, opening up testing of headers with no
libc dependencies to more environments (Thomas Weißschuh)
* Update gendwarfksyms documentation with required dependencies (Jihan
LIN)
* Reject invalid LLVM= values to avoid unintentionally falling back to
system toolchain (Thomas Weißschuh)
* Add a script to help run the kernel build process in a container for
consistent environments and testing (Guillaume Tucker)
* Simplify kallsyms by getting rid of the relative base (Ard Biesheuvel)
* Performance and usability improvements to scripts/make_fit.py (Simon
Glass)
* Minor various clean ups and fixes
Kconfig changes
===============
* Move XPM icons to individual files, clearing up GTK deprecation
warnings (Rostislav Krasny)
* Support
depends on FOO if BAR
as syntactic sugar for
depends on FOO || !BAR' (Nicolas Pitre, Graham Roff)
* Refactor merge_config.sh to use awk over shell/sed/grep, dramatically
speeding up processing large number of config fragments (Anders
Roxell, Mikko Rapeli)
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQR74yXHMTGczQHYypIdayaRccAalgUCaYpgQwAKCRAdayaRccAa
liOGAQCqMI42YMLqljFcPu3B/3f43xhDBCXAhquPBIMhbgt+aAEAmmo3uMLHKSRV
XZDKkq13HMMV3Zlmrn5Xk/tzk+hkwwk=
=WYl4
-----END PGP SIGNATURE-----
Merge tag 'kbuild-7.0-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kbuild/linux
Pull Kbuild/Kconfig updates from Nathan Chancellor:
"Kbuild:
- Drop '*_probe' pattern from modpost section check allowlist, which
hid legitimate warnings (Johan Hovold)
- Disable -Wtype-limits altogether, instead of enabling at W=2
(Vincent Mailhol)
- Improve UAPI testing to skip testing headers that require a libc
when CONFIG_CC_CAN_LINK is not set, opening up testing of headers
with no libc dependencies to more environments (Thomas Weißschuh)
- Update gendwarfksyms documentation with required dependencies
(Jihan LIN)
- Reject invalid LLVM= values to avoid unintentionally falling back
to system toolchain (Thomas Weißschuh)
- Add a script to help run the kernel build process in a container
for consistent environments and testing (Guillaume Tucker)
- Simplify kallsyms by getting rid of the relative base (Ard
Biesheuvel)
- Performance and usability improvements to scripts/make_fit.py
(Simon Glass)
- Minor various clean ups and fixes
Kconfig:
- Move XPM icons to individual files, clearing up GTK deprecation
warnings (Rostislav Krasny)
- Support
depends on FOO if BAR
as syntactic sugar for
depends on FOO || !BAR
(Nicolas Pitre, Graham Roff)
- Refactor merge_config.sh to use awk over shell/sed/grep,
dramatically speeding up processing large number of config
fragments (Anders Roxell, Mikko Rapeli)"
* tag 'kbuild-7.0-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kbuild/linux: (39 commits)
kbuild: remove dependency of run-command on config
scripts/make_fit: Compress dtbs in parallel
scripts/make_fit: Support a few more parallel compressors
kbuild: Support a FIT_EXTRA_ARGS environment variable
scripts/make_fit: Move dtb processing into a function
scripts/make_fit: Support an initial ramdisk
scripts/make_fit: Speed up operation
rust: kconfig: Don't require RUST_IS_AVAILABLE for rustc-option
MAINTAINERS: Add scripts/install.sh into Kbuild entry
modpost: Amend ppc64 save/restfpr symnames for -Os build
MIPS: tools: relocs: Ship a definition of R_MIPS_PC32
streamline_config.pl: remove superfluous exclamation mark
kbuild: dummy-tools: Add python3
scripts: kconfig: merge_config.sh: warn on duplicate input files
scripts: kconfig: merge_config.sh: use awk in checks too
scripts: kconfig: merge_config.sh: refactor from shell/sed/grep to awk
kallsyms: Get rid of kallsyms relative base
mips: Add support for PC32 relocations in vmlinux
Documentation: dev-tools: add container.rst page
scripts: add tool to run containerized builds
...
- Move C example schedulers back from the external scx repo to
tools/sched_ext as the authoritative source. scx_userland and scx_pair
are returning while scx_sdt (BPF arena-based task data management) is
new. These schedulers will be dropped from the external repo.
- Improve error reporting by adding scx_bpf_error() calls when DSQ
creation fails across all in-tree schedulers.
- Avoid redundant irq_work_queue() calls in destroy_dsq() by only
queueing when llist_add() indicates an empty list.
- Fix flaky init_enable_count selftest by properly synchronizing
pre-forked children using a pipe instead of sleep().
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaYo1pQ4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGa5VAP9udiGksQ4bBFQrUD+0yhuvjsSuXzssfdfxHgT6
Hj66wgEAjgbnSyxfcGrB+w7DWUxNLaZlXepibVMPcfvAaSieSgU=
=UvLY
-----END PGP SIGNATURE-----
Merge tag 'sched_ext-for-6.20' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext updates from Tejun Heo:
- Move C example schedulers back from the external scx repo to
tools/sched_ext as the authoritative source. scx_userland and
scx_pair are returning while scx_sdt (BPF arena-based task data
management) is new. These schedulers will be dropped from the
external repo.
- Improve error reporting by adding scx_bpf_error() calls when DSQ
creation fails across all in-tree schedulers
- Avoid redundant irq_work_queue() calls in destroy_dsq() by only
queueing when llist_add() indicates an empty list
- Fix flaky init_enable_count selftest by properly synchronizing
pre-forked children using a pipe instead of sleep()
* tag 'sched_ext-for-6.20' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
selftests/sched_ext: Fix init_enable_count flakiness
tools/sched_ext: Fix data header access during free in scx_sdt
tools/sched_ext: Add error logging for dsq creation failures in remaining schedulers
tools/sched_ext: add arena based scheduler
tools/sched_ext: add scx_pair scheduler
tools/sched_ext: add scx_userland scheduler
sched_ext: Add error logging for dsq creation failures
sched_ext: Avoid multiple irq_work_queue() calls in destroy_dsq()
- cpuset changes:
- Continue separating v1 and v2 implementations by moving more
v1-specific logic into cpuset-v1.c.
- Improve partition handling. Sibling partitions are no longer
invalidated on cpuset.cpus conflict, cpuset.cpus changes no longer
fail in v2, and effective_xcpus computation is made consistent.
- Fix partition effective CPUs overlap that caused a warning on cpuset
removal when sibling partitions shared CPUs.
- Increase the maximum cgroup subsystem count from 16 to 32 to
accommodate future subsystem additions.
- Misc cleanups and selftest improvements including switching to
css_is_online() helper, removing dead code and stale documentation
references, using lockdep_assert_cpuset_lock_held() consistently,
and adding polling helpers for asynchronously updated cgroup
statistics.
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaYozIw4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGZQKAQD51KJQz4M79wf2yBhIBLOnM4aakMalhSwZNL4O
JiGutwD+Ir33VzNX8aXBuDin9p4wI15O54PhqSenJbelKRQ3Dws=
=gR7L
-----END PGP SIGNATURE-----
Merge tag 'cgroup-for-6.20' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
- cpuset changes:
- Continue separating v1 and v2 implementations by moving more
v1-specific logic into cpuset-v1.c
- Improve partition handling. Sibling partitions are no longer
invalidated on cpuset.cpus conflict, cpuset.cpus changes no longer
fail in v2, and effective_xcpus computation is made consistent
- Fix partition effective CPUs overlap that caused a warning on
cpuset removal when sibling partitions shared CPUs
- Increase the maximum cgroup subsystem count from 16 to 32 to
accommodate future subsystem additions
- Misc cleanups and selftest improvements including switching to
css_is_online() helper, removing dead code and stale documentation
references, using lockdep_assert_cpuset_lock_held() consistently, and
adding polling helpers for asynchronously updated cgroup statistics
* tag 'cgroup-for-6.20' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits)
cpuset: fix overlap of partition effective CPUs
cgroup: increase maximum subsystem count from 16 to 32
cgroup: Remove stale cpu.rt.max reference from documentation
cpuset: replace direct lockdep_assert_held() with lockdep_assert_cpuset_lock_held()
cgroup/cpuset: Move the v1 empty cpus/mems check to cpuset1_validate_change()
cgroup/cpuset: Don't invalidate sibling partitions on cpuset.cpus conflict
cgroup/cpuset: Don't fail cpuset.cpus change in v2
cgroup/cpuset: Consistently compute effective_xcpus in update_cpumasks_hier()
cgroup/cpuset: Streamline rm_siblings_excl_cpus()
cpuset: remove dead code in cpuset-v1.c
cpuset: remove v1-specific code from generate_sched_domains
cpuset: separate generate_sched_domains for v1 and v2
cpuset: move update_domain_attr_tree to cpuset_v1.c
cpuset: add cpuset1_init helper for v1 initialization
cpuset: add cpuset1_online_css helper for v1-specific operations
cpuset: add lockdep_assert_cpuset_lock_held helper
cpuset: Remove unnecessary checks in rebuild_sched_domains_locked
cgroup: switch to css_is_online() helper
selftests: cgroup: Replace sleep with cg_read_key_long_poll() for waiting on nr_dying_descendants
selftests: cgroup: make test_memcg_sock robust against delayed sock stats
...
- Rework the rescuer to process work items one-by-one instead of
slurping all pending work items in a single pass. As there is only
one rescuer per workqueue, a single long-blocking work item could
cause high latency for all tasks queued behind it, even after memory
pressure is relieved and regular kworkers become available to service
them.
- Add CONFIG_BOOTPARAM_WQ_STALL_PANIC build-time option and
workqueue.panic_on_stall_time parameter for time-based stall panic,
giving systems more control over workqueue stall handling.
- Replace BUG_ON() with panic() in the stall panic path for clearer
intent and more informative output.
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaYov/A4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGWXnAQCfyELl+evz3RdFhyTiVCM1TiOnC1TsBjgkm3SJ
orMhwgEAkgg40jino34wgeZRfdIAThxQ1O6bsvTpooWKjlCYcQY=
=zZCF
-----END PGP SIGNATURE-----
Merge tag 'wq-for-6.20' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue updates from Tejun Heo:
- Rework the rescuer to process work items one-by-one instead of
slurping all pending work items in a single pass.
As there is only one rescuer per workqueue, a single long-blocking
work item could cause high latency for all tasks queued behind it,
even after memory pressure is relieved and regular kworkers become
available to service them.
- Add CONFIG_BOOTPARAM_WQ_STALL_PANIC build-time option and
workqueue.panic_on_stall_time parameter for time-based stall panic,
giving systems more control over workqueue stall handling.
- Replace BUG_ON() with panic() in the stall panic path for clearer
intent and more informative output.
* tag 'wq-for-6.20' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: replace BUG_ON with panic in panic_on_wq_watchdog
workqueue: add time-based panic for stalls
workqueue: add CONFIG_BOOTPARAM_WQ_STALL_PANIC option
workqueue: Process extra works in rescuer on memory pressure
workqueue: Process rescuer work items one-by-one using a cursor
workqueue: Make send_mayday() take a PWQ argument directly
Shinichiro reported a KASAN UAF, which is actually an out of bounds access
in the MMCID management code.
CPU0 CPU1
T1 runs in userspace
T0: fork(T4) -> Switch to per CPU CID mode
fixup() set MM_CID_TRANSIT on T1/CPU1
T4 exit()
T3 exit()
T2 exit()
T1 exit() switch to per task mode
---> Out of bounds access.
As T1 has not scheduled after T0 set the TRANSIT bit, it exits with the
TRANSIT bit set. sched_mm_cid_remove_user() clears the TRANSIT bit in
the task and drops the CID, but it does not touch the per CPU storage.
That's functionally correct because a CID is only owned by the CPU when
the ONCPU bit is set, which is mutually exclusive with the TRANSIT flag.
Now sched_mm_cid_exit() assumes that the CID is CPU owned because the
prior mode was per CPU. It invokes mm_drop_cid_on_cpu() which clears the
not set ONCPU bit and then invokes clear_bit() with an insanely large
bit number because TRANSIT is set (bit 29).
Prevent that by actually validating that the CID is CPU owned in
mm_drop_cid_on_cpu().
Fixes: 007d84287c ("sched/mmcid: Drop per CPU CID immediately when switching to per task mode")
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Cc: stable@vger.kernel.org
Closes: https://lore.kernel.org/aYsZrixn9b6s_2zL@shinmob
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 32dc004252 ("tracing: Reset last-boot buffers when reading
out all cpu buffers") resets the last_boot_info when user read out
all data via trace_pipe* files. But it is not reset when user
resets the buffer from other files. (e.g. write `trace` file)
Reset it when the corresponding ring buffer is reset too.
Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/177071302364.2293046.17895165659153977720.stgit@mhiramat.tok.corp.google.com
Fixes: 32dc004252 ("tracing: Reset last-boot buffers when reading out all cpu buffers")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Since the per-cpu buffer_size_kb file is writable for changing
per-cpu ring buffer size, the file should have the write access
permission.
Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/177071301597.2293046.11683339475076917920.stgit@mhiramat.tok.corp.google.com
Fixes: 21ccc9cd72 ("tracing: Disable "other" permission bits in the tracefs files")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
For common cases (HZ=100, 250 or 1000), these helpers are at most one
multiply, so there is no point calling a tiny function.
Keep them out of line for HZ=300 and others.
This saves cycles in TCP fast path, among other things.
$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 0/8 grow/shrink: 25/89 up/down: 530/-3474 (-2944)
...
nla_put_msecs 193 - -193
message_stats_print 2131 920 -1211
Total: Before=25365208, After=25362264, chg -0.01%
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260210170226.57209-1-edumazet@google.com
- Implement masked user access
- Add support for internal only per-CPU instructions and inline the bpf_get_smp_processor_id() and bpf_get_current_task()
- Fix pSeries MSI-X allocation failure when quota is exceeded
- Fix recursive pci_lock_rescan_remove locking in EEH event handling
- Support tailcalls with subprogs & BPF exceptions on 64bit
- Extend "trusted" keys to support the PowerVM Key Wrapping Module (PKWM)
Thanks to: Abhishek Dubey, Christophe Leroy, Gaurav Batra, Guangshuo Li, Jarkko
Sakkinen, Mahesh Salgaonkar, Mimi Zohar, Miquel Sabaté Solà, Nam Cao, Narayana
Murty N, Nayna Jain, Nilay Shroff, Puranjay Mohan, Saket Kumar Bhaskar, Sourabh
Jain, Srish Srinivasan, Venkat Rao Bagalkote,
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEqX2DNAOgU8sBX3pRpnEsdPSHZJQFAmmL7S0ACgkQpnEsdPSH
ZJR2xA/9G+tZp9+2TidbLSyPT5o063uC5H5+j5QcvvHWi/ImGUNtixlDm5HcZLCR
yKywE1piKYBU8HoISjzAt0+JCVd3ZjJ8chTpKgCHLXPRSBTgdR1MG+SXQysDJSWb
yA4pwDikmwoLlfi+pf500F0nX2DRCRdU2Yi28ZFeaF/khJ7ebwj41QJ7LjN22+Q1
G8Kq543obZluzSoVvfG4xUK4ByWER+Zdd2F6699iMP68yw5PJ8PPc0SUGt8nuD4i
FUs0Lw7XV7i/K3+zm/ZgH5+Cvn7wOIcMNkXgFlxJXkFit97KXUDijifYPoXQyKLL
ksD7SPFdV0++Sc+3mWcgW4j+hQZC0Pn864unmh8C6ug3SagQ+3pE1JYWKwCmoyXd
49ROH0y+npArJ4NAc79eweunhafGcRYTSG+zV7swQvpRocMujEqa4CDz4uk1ll5W
1yAac08AN6PnfcXl2VMrcDboziTlQVFcnNQbK/ieYMC7KpgA+udw1hd2rOWNZCPd
u0byXxR1ak5YaAEuyMztd/39hrExx8306Jtkh5FIRZKWGAO+3np5bi3vxk11rDni
c9BGh2JIMtuPUGys3wcFPGMRTKwF2bDFW/pB+5hMHeLUdlkni9WGCX8eLe2klYsy
T7fBVb4d99IVrJGYv3J1lwELgjrgXvv35XOaUiyJyZhcbng15cc=
=zJoL
-----END PGP SIGNATURE-----
Merge tag 'powerpc-7.0-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates for 7.0
- Implement masked user access
- Add bpf support for internal only per-CPU instructions and inline the
bpf_get_smp_processor_id() and bpf_get_current_task() functions
- Fix pSeries MSI-X allocation failure when quota is exceeded
- Fix recursive pci_lock_rescan_remove locking in EEH event handling
- Support tailcalls with subprogs & BPF exceptions on 64bit
- Extend "trusted" keys to support the PowerVM Key Wrapping Module
(PKWM)
Thanks to Abhishek Dubey, Christophe Leroy, Gaurav Batra, Guangshuo Li,
Jarkko Sakkinen, Mahesh Salgaonkar, Mimi Zohar, Miquel Sabaté Solà, Nam
Cao, Narayana Murty N, Nayna Jain, Nilay Shroff, Puranjay Mohan, Saket
Kumar Bhaskar, Sourabh Jain, Srish Srinivasan, and Venkat Rao Bagalkote.
* tag 'powerpc-7.0-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (27 commits)
powerpc/pseries: plpks: export plpks_wrapping_is_supported
docs: trusted-encryped: add PKWM as a new trust source
keys/trusted_keys: establish PKWM as a trusted source
pseries/plpks: add HCALLs for PowerVM Key Wrapping Module
pseries/plpks: expose PowerVM wrapping features via the sysfs
powerpc/pseries: move the PLPKS config inside its own sysfs directory
pseries/plpks: fix kernel-doc comment inconsistencies
powerpc/smp: Add check for kcalloc() failure in parse_thread_groups()
powerpc: kgdb: Remove OUTBUFMAX constant
powerpc64/bpf: Additional NVR handling for bpf_throw
powerpc64/bpf: Support exceptions
powerpc64/bpf: Add arch_bpf_stack_walk() for BPF JIT
powerpc64/bpf: Avoid tailcall restore from trampoline
powerpc64/bpf: Support tailcalls with subprogs
powerpc64/bpf: Moving tail_call_cnt to bottom of frame
powerpc/eeh: fix recursive pci_lock_rescan_remove locking in EEH event handling
powerpc/pseries: Fix MSI-X allocation failure when quota is exceeded
powerpc/iommu: bypass DMA APIs for coherent allocations for pre-mapped memory
powerpc64/bpf: Inline bpf_get_smp_processor_id() and bpf_get_current_task/_btf()
powerpc64/bpf: Support internal-only MOV instruction to resolve per-CPU addrs
...
Extend struct printk_info to include the task name, pid, and CPU
number where printk messages originate. This information is captured
at vprintk_store() time and propagated through printk_message to
nbcon_write_context, making it available to nbcon console drivers.
This is useful for consoles like netconsole that want to include
execution context in their output, allowing correlation of messages
with specific tasks and CPUs regardless of where the console driver
actually runs.
The feature is controlled by CONFIG_PRINTK_EXECUTION_CTX, which is
automatically selected by CONFIG_NETCONSOLE_DYNAMIC. When disabled,
the helper functions compile to no-ops with no overhead.
Suggested-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Link: https://patch.msgid.link/20260206-nbcon-v7-1-62bda69b1b41@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
clock interface, taming the include hell by splitting the pv_ops structure
and removing of a bunch of obsolete code. Work by Juergen Gross.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmmLKHAACgkQEsHwGGHe
VUrURg//Ucf+3EAIkLCmFkH0WwYmQl2JjRYww8bPAw3iJMIVxy4dMnaBbsUiAtUp
kYza+pgEtvyAwwd8RIEs85c9VhZn0DKoaWV8goBH3zFH6YvIRiLwb0w2QvjkF+70
FNU+4zlvt/I3FD+tWNElAgVtkFL3Gmzm44qyLLsPtlYaJ71xFl2XB7V+TlqXMHzE
m8BMenP9/CrbTlBBdNJGzAkAbWi1uAP+IydvuFNolH/F2lqVM2z5Ta3gUWWCIk/q
jWrPLDZCHr2WlBZNUGamKVVH9NEh+7YNwBAGUrSNYGZFoaFjqeX6lN3djzS+wXIj
0nDoW35jN0QNKz239MdXZDf1mfpb6ZQd/iOhFjo4dAvbm+J8WPAMr98ac8wR3Dyb
2LF/BxkoKWRabxQApXSCrLPXEuqT6Qc1+lDA0bNHg51zBoqP5vRNVZRwArnzGB+O
LxDKx+o4VYOf+UCaB6oQHjylbSgFvIedZ9p822hBe3QG9act8indRE8LWip7Utld
peoJGgvlQ0xtClh6FjVHpvmVfAvk7Zki5ywj2GwmB/TZ0yywuGStAjE3UqY168/M
gb7MSajh+HHZNj1/2+b/se4CUYlAgIPDQ+SwHJPm5TqyopvnOVi/2XWmjbx8I5jT
jS0nxaxD+SbESSZ6IMAsppnAAxAYbvRHGIS+6mtNCXVkaV1pMbA=
=AeFt
-----END PGP SIGNATURE-----
Merge tag 'x86_paravirt_for_v7.0_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 paravirt updates from Borislav Petkov:
- A nice cleanup to the paravirt code containing a unification of the
paravirt clock interface, taming the include hell by splitting the
pv_ops structure and removing of a bunch of obsolete code (Juergen
Gross)
* tag 'x86_paravirt_for_v7.0_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
x86/paravirt: Use XOR r32,r32 to clear register in pv_vcpu_is_preempted()
x86/paravirt: Remove trailing semicolons from alternative asm templates
x86/pvlocks: Move paravirt spinlock functions into own header
x86/paravirt: Specify pv_ops array in paravirt macros
x86/paravirt: Allow pv-calls outside paravirt.h
objtool: Allow multiple pv_ops arrays
x86/xen: Drop xen_mmu_ops
x86/xen: Drop xen_cpu_ops
x86/xen: Drop xen_irq_ops
x86/paravirt: Move pv_native_*() prototypes to paravirt.c
x86/paravirt: Introduce new paravirt-base.h header
x86/paravirt: Move paravirt_sched_clock() related code into tsc.c
x86/paravirt: Use common code for paravirt_steal_clock()
riscv/paravirt: Use common code for paravirt_steal_clock()
loongarch/paravirt: Use common code for paravirt_steal_clock()
arm64/paravirt: Use common code for paravirt_steal_clock()
arm/paravirt: Use common code for paravirt_steal_clock()
sched: Move clock related paravirt code to kernel/sched
paravirt: Remove asm/paravirt_api_clock.h
x86/paravirt: Move thunk macros to paravirt_types.h
...
- Provide the missing 64-bit variant of clock_getres()
This allows the extension of CONFIG_COMPAT_32BIT_TIME to the vDSO and
finally the removal of 32-bit time types from the kernel and UAPI.
- Remove the useless and broken getcpu_cache from the VDSO
The intention was to provide a trivial way to retrieve the CPU number from
the VDSO, but as the VDSO data is per process there is no way to make it
work.
- Switch get/put_unaligned() from packed struct to memcpy()
The packed struct violates strict aliasing rules which requires to pass
-fno-strict-aliasing to the compiler. As this are scalar values
__builtin_memcpy() turns them into simple loads and stores
- Use __typeof_unqual__() for __unqual_scalar_typeof()
The get/put_unaligned() changes triggered a new sparse warning when __beNN
types are used with get/put_unaligned() as sparse builds add a special
'bitwise' attribute to them which prevents sparse to evaluate the Generic
in __unqual_scalar_typeof().
Newer sparse versions support __typeof_unqual__() which avoids the problem,
but requires a recent sparse install. So this adds a sanity check to sparse
builds, which validates that sparse is available and capable of handling it.
- Force inline __cvdso_clock_getres_common()
Compilers sometimes un-inline agressively, which results in function call
overhead and problems with automatic stack variable initialization.
Interestingly enough the force inlining results in smaller code than the
un-inlined variant produced by GCC when optimizing for size.
-----BEGIN PGP SIGNATURE-----
iQJEBAABCgAuFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmmJ2eAQHHRnbHhAa2Vy
bmVsLm9yZwAKCRCmGPVMDXSYoXhQD/4mjneVlRBbQ6Nt9LxxhzPlYXvED6u8b4SX
R//DQ4qqqagh2fSpE57tr3f56HqhUmXptGJSgEvr6tVKoyugsLqP6sN9J9J3o181
jnPQR0VBP9dihQZ5X93pKo9NZWxLIn0uLD0RsdCbE9NTx4ciw4VIPFYQqkC4Rw6b
jiRrDR2l8EhV8cmxB6puW5WaQ932M6Awabw9RumzwH3MzIIlbc5Ero51S9eS64LL
byU5XeWUe295W1Gxze5RHHJWyNQEyx1eUCFfe3LWvfpz7FMzc2AQsKnIJDzW3GiO
UGu5MGptbLpG+ccvhVEs6/Ls5pWXcoCw4WuDNAunCCOmqda98oDniKf2LwRRbLT0
nAfLNatMnhXdTPk2zbS45z9uipUQAGKmVAE3/LVqB+ekcutmIGMyqHgR75QX0b4l
CQPkC9rBsV6gGsScWTnhRydhqioNO/uhhrQv0vEXnKZa0ysTbgZKt3JDZbgUEL2B
uDxXKyrqjpnqDZKlMMaoLtwd+l+T80ya4/NhHd4ZNGUpTUrHVw2H47lgE7ahCxEk
/SvXTZSU4Jp8sVQIQ+J6y5z2AQ/xGy++zvNKiZMyP9fQuPqhiDqYkLpmOp61bARx
wqyVsfGXZYSB1l16AiSC/CyUDcMqqsFohGXQ/Yf0SOiVtXu2WUyfofY1N0IXIGu4
WOVV9mH1yg==
=0ogX
-----END PGP SIGNATURE-----
Merge tag 'timers-vdso-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull VDSO updates from Thomas Gleixner:
- Provide the missing 64-bit variant of clock_getres()
This allows the extension of CONFIG_COMPAT_32BIT_TIME to the vDSO and
finally the removal of 32-bit time types from the kernel and UAPI.
- Remove the useless and broken getcpu_cache from the VDSO
The intention was to provide a trivial way to retrieve the CPU number
from the VDSO, but as the VDSO data is per process there is no way to
make it work.
- Switch get/put_unaligned() from packed struct to memcpy()
The packed struct violates strict aliasing rules which requires to
pass -fno-strict-aliasing to the compiler. As this are scalar values
__builtin_memcpy() turns them into simple loads and stores
- Use __typeof_unqual__() for __unqual_scalar_typeof()
The get/put_unaligned() changes triggered a new sparse warning when
__beNN types are used with get/put_unaligned() as sparse builds add a
special 'bitwise' attribute to them which prevents sparse to evaluate
the Generic in __unqual_scalar_typeof().
Newer sparse versions support __typeof_unqual__() which avoids the
problem, but requires a recent sparse install. So this adds a sanity
check to sparse builds, which validates that sparse is available and
capable of handling it.
- Force inline __cvdso_clock_getres_common()
Compilers sometimes un-inline agressively, which results in function
call overhead and problems with automatic stack variable
initialization.
Interestingly enough the force inlining results in smaller code than
the un-inlined variant produced by GCC when optimizing for size.
* tag 'timers-vdso-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
vdso/gettimeofday: Force inlining of __cvdso_clock_getres_common()
x86/percpu: Make CONFIG_USE_X86_SEG_SUPPORT work with sparse
compiler: Use __typeof_unqual__() for __unqual_scalar_typeof()
powerpc/vdso: Provide clock_getres_time64()
tools headers: Remove unneeded ignoring of warnings in unaligned.h
tools headers: Update the linux/unaligned.h copy with the kernel sources
vdso: Switch get/put_unaligned() from packed struct to memcpy()
parisc: Inline a type punning version of get_unaligned_le32()
vdso: Remove struct getcpu_cache
MIPS: vdso: Provide getres_time64() for 32-bit ABIs
arm64: vdso32: Provide clock_getres_time64()
ARM: VDSO: Provide clock_getres_time64()
ARM: VDSO: Patch out __vdso_clock_getres() if unavailable
x86/vdso: Provide clock_getres_time64() for x86-32
selftests: vDSO: vdso_test_abi: Add test for clock_getres_time64()
selftests: vDSO: vdso_test_abi: Use UAPI system call numbers
selftests: vDSO: vdso_config: Add configurations for clock_getres_time64()
vdso: Add prototype for __vdso_clock_getres_time64()
- Inline timecounter_cyc2time() as that is now used in the networking
hotpath. Inlining it significantly improves performance.
- Optimize the tick dependency check in case that the tracepoint is disabled,
which improves the hotpath performance in the tick management code, which
is a hotpath on transitions in and out of idle.
- The usual cleanups and improvements
-----BEGIN PGP SIGNATURE-----
iQJEBAABCgAuFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmmJ1WUQHHRnbHhAa2Vy
bmVsLm9yZwAKCRCmGPVMDXSYoabjD/9cc06W408fLGfqEGlriV8BnMJSGiWO/efe
SJPA4IziIu6DS7aQugdmk1vTqV9PxQWZ+L0tgSe1BNzUY+9x+z+I7X+asY6pjPmp
4zQSGSuotvFpRsWa4MsEgOjXk0QV/1Lm1Ba0eaBBqfY1wacoxJgjxHd5qjZBycUp
H1nLAshXbDBsKYdTLqI4KQq9ZbHKTPz47UhJiKwPHEIRSGNzuAdNkD7KOv34GEDU
ZHlTfWv8dMoH0lp/N/2+NmojYOzVXjH9UeeG3MYyfrcrzxU8Ifg/21zz9KgcyrF/
rrDnUCdnJUk2z8AltgSyQfqw3awUrTDTnHQDFe9oS4t5lexXd7fkmbjqsZBpP8OD
QQJd6yyI2j5GsKKI/fcYz/3NEHZR5HxjgT/Dv3qdHh/7L49FIXFI68kcOqdHFGSf
21CY25cBRr7uPnyAs/G3LtSb/WyBVgTzMaPo798gIqpoDkSIuSlR0x252yaI6kW9
fMXul0ZYwU5j8xGPt39IW9/Xm0HAMBUdvMusKn8CKBPfYCAx2EtH75NXilCuRgjG
UJvzEv20W3f3IXgvrVqTznKUh0Airsfo8blNtX/LqhCZraeJIRV/UGUrhL8QpOp+
Ri8FJ1Dr57oRJDDqM7XVckJuyOGkRvYtvJyPhbyEXiWZpDUqQnrivm+3j7DecHN2
6+ThtJ6P4A==
=Zh3T
-----END PGP SIGNATURE-----
Merge tag 'timers-core-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer core updates from Thomas Gleixner:
- Inline timecounter_cyc2time() as that is now used in the networking
hotpath. Inlining it significantly improves performance.
- Optimize the tick dependency check in case that the tracepoint is
disabled, which improves the hotpath performance in the tick
management code, which is a hotpath on transitions in and out of
idle.
- The usual cleanups and improvements
* tag 'timers-core-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
time/kunit: Document handling of negative years of is_leap()
tick/nohz: Optimize check_tick_dependency() with early return
time/sched_clock: Use ACCESS_PRIVATE() to evaluate hrtimer::function
hrtimer: Drop _tv64() helpers
hrtimer: Remove public definition of HIGH_RES_NSEC
hrtimer: Remove unused resolution constants
time/timecounter: Inline timecounter_cyc2time()
- Add interrupt redirection infrastructure
Some PCI controllers use a single demultiplexing interrupt for the MSI
interrupts of subordinate devices.
This prevents setting the interrupt affinity of device interrupts, which
causes device interrupts to be delivered to a single CPU. That obviously is
counterproductive for multi-queue devices and interrupt balancing.
To work around this limitation the new infrastructure installs a dummy
irq_set_affinity() callback which captures the affinity mask and picks a
redirection target CPU out of the mask.
When the PCI controller demultiplexes the interrupts it invokes a new
handling function in the core, which either runs the interrupt handler in
the context of the target CPU or delegates it to irq_work on the target CPU.
- Utilize the interrupt redirection mechanism in the PCI DWC host controller
driver.
This allows affinity control for the subordinate device MSI interrupts
instead of being randomly executed on the CPU which runs the demultiplex
handler.
- Replace the binary 64-bit MSI flag with a DMA mask
Some PCI devices have PCI_MSI_FLAGS_64BIT in the MSI capability, but
implement less than 64 address bits. This breaks on platforms where such a
device is assigned an MSI address higher than what's supported.
With the binary 64-bit flag there is no other choice than disabling 64-bit
MSI support which leaves the device disfunctional.
By using a DMA mask the address limit of a device can be described
correctly which provides support for the above scenario.
- Make use of the DMA mask based address limit in the hda/intel and radeon
drivers to enable them on affected platforms.
- The usual small cleanups and improvements
-----BEGIN PGP SIGNATURE-----
iQJEBAABCgAuFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmmJyPsQHHRnbHhAa2Vy
bmVsLm9yZwAKCRCmGPVMDXSYocekEADAsS5FlUkFuBy6kODhl5J7b9/oqlL3IEnR
3CdOrFO716dce+Gej+Wp3T93dJ3XsfD7nCZuy99+LwUkTubmaBJXfjY9S+Ket0ID
Wc3ltiD6f3GEFB14rXN+fFG/u+OOLkaXdpbQpiTnqL4JAti9qF80D4uon28+FC/o
wc1MhqVBPbOHU9iM196ngkZuXCNVPLcnZN6PNBgIn0sxx06LcK+daY0bNGxfn5Ua
LY9SD8hN7tYlkDi42nB/ZXMrexqT9cxSqHObmPX+G/QLfXCRBtD+gyVbs+KVzpRL
hmFERTlUh9tUdcQFrjgiZP/r4N5ilzsu6w5ZpSOEsGuahFUPZWJWFFC1D8rmq/Ay
X9HKge1jqXJtbCf0pJM/kdbJKSH5S6aLP3iF37y+PqITIEIX8jIT3oVcvL9hI0BW
HFxpuJfhAVg63kMegZCO/iROTusLHUZr8iwYOM7pEiCE6fP46jPijsPffVIWvrlJ
2LVOv/A5wy9q8FW8sF9/M6CW7cdeYQF06Ce3qAyMxjZjEyR3KFBJCVWjhqyMxZJP
3zFl1XXKXgRO+CDrYKVTPIaXR5D76k/l6MnECQpq81CQyQKm2h6A9PyY+n70FfbZ
BimakUlBGCd92ZbSxzC9pAOiHo0ZoKtc5BhnsRhKVyBCmEKDazEplDuf49/OSZUE
p2kaf/PuOw==
=SCSQ
-----END PGP SIGNATURE-----
Merge tag 'irq-msi-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull MSI updates from Thomas Gleixner:
"Updates for the [PCI] MSI subsystem:
- Add interrupt redirection infrastructure
Some PCI controllers use a single demultiplexing interrupt for the
MSI interrupts of subordinate devices.
This prevents setting the interrupt affinity of device interrupts,
which causes device interrupts to be delivered to a single CPU.
That obviously is counterproductive for multi-queue devices and
interrupt balancing.
To work around this limitation the new infrastructure installs a
dummy irq_set_affinity() callback which captures the affinity mask
and picks a redirection target CPU out of the mask.
When the PCI controller demultiplexes the interrupts it invokes a
new handling function in the core, which either runs the interrupt
handler in the context of the target CPU or delegates it to
irq_work on the target CPU.
- Utilize the interrupt redirection mechanism in the PCI DWC host
controller driver.
This allows affinity control for the subordinate device MSI
interrupts instead of being randomly executed on the CPU which runs
the demultiplex handler.
- Replace the binary 64-bit MSI flag with a DMA mask
Some PCI devices have PCI_MSI_FLAGS_64BIT in the MSI capability,
but implement less than 64 address bits. This breaks on platforms
where such a device is assigned an MSI address higher than what's
supported.
With the binary 64-bit flag there is no other choice than disabling
64-bit MSI support which leaves the device disfunctional.
By using a DMA mask the address limit of a device can be described
correctly which provides support for the above scenario.
- Make use of the DMA mask based address limit in the hda/intel and
radeon drivers to enable them on affected platforms
- The usual small cleanups and improvements"
* tag 'irq-msi-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
ALSA: hda/intel: Make MSI address limit based on the device DMA limit
drm/radeon: Make MSI address limit based on the device DMA limit
PCI/MSI: Check the device specific address mask in msi_verify_entries()
PCI/MSI: Convert the boolean no_64bit_msi flag to a DMA address mask
genirq/redirect: Prevent writing MSI message on affinity change
PCI/MSI: Unmap MSI-X region on error
genirq: Update effective affinity for redirected interrupts
PCI: dwc: Enable MSI affinity support
PCI: dwc: Code cleanup
genirq: Add interrupt redirection infrastructure
genirq/msi: Correct kernel-doc in <linux/msi.h>
- Remove the interrupt timing infrastructure
This was added seven years ago to be used for power management purposes, but
that integration never happened.
- Clean up the remaining setup_percpu_irq() users
The memory allocator is available when interrupts can be requested so there
is not need for static irq_action. Move the remaining users to
request_percpu_irq() and delete the historical cruft.
- Warn when interrupt flag inconsistencies are detected in request*_irq().
Inconsistent flags can lead to hard to diagnose malfunction. The fallout of
this new warning has been addressed in next and the fixes are coming in via
the maintainer trees and the tip irq/cleanup pull requests.
- Invoke affinity notifier when CPU hotplug breaks affinity
Otherwise the code using the notifier misses the affinity change and
operates on stale information.
- The usual cleanups and improvements
-----BEGIN PGP SIGNATURE-----
iQJEBAABCgAuFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmmJwSAQHHRnbHhAa2Vy
bmVsLm9yZwAKCRCmGPVMDXSYoV1rD/9lxepiLVhIvaaauyr69HkHnLDJyyyIuNQk
OZ5Rx8JEnmtJIN49DJ/Ps/VjDZA6GYV2RvtKfqCNtpA3AdpHcLCDw2/mZKetHH3v
0RlTNeu0TKt76Mo+d7vadJZqdcKbBPYEBucaQed4XLGFq9Vb7nuVMFLzkDZe+C5U
zG8zhtIaPQyY+MUxFGpWJdeKcddnIC/lKRFBmhglmuflQ8ZdrydRlJZzFBxfcnPy
dcZVnOVZYD4aaW+uryd9eVWg0de5q97C03OobVdhwt077zcmTZpGanRCBM/+7Kop
UDB/p04Z5FFFKFbd+j0f76gl1l6yQ06K/2Pby8qrLnodsqOEgjEgLs0qbHhkNsVw
g5W6ry9boLWAA1XhGFOnqZ4Dc/Qb1G7DpnRUzpmAlyL04MTManjocIHNPrGhjlNG
+jT3iPuraiRUoNZ9K2PBYX9SikGVnKJXFmOwlffkPpOtaJk/oYVUAdMq2W71C4Du
TZeAbd2RyrSPMYaSh7pnBhM7KBmsYIg+hHU1+RaAi/rFSOT0Se5sUWwmUUUzSfOn
KvCAzMGX6gIg8rkSmTp7OvLQhXEfJ2kEVrVZACS6MBxS7+PNVBkxYuOqxdhdoVj0
sZrfB8Qnm4mAcbq7YJtrD6az3+LH5CkrdNyWIAL62leMmqWDoTOgLRMFh2/yEMTl
BeAOgPO2pw==
=3hJ2
-----END PGP SIGNATURE-----
Merge tag 'irq-core-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq core updates from Thomas Gleixner:
"Updates for the interrupt core subsystem:
- Remove the interrupt timing infrastructure
This was added seven years ago to be used for power management
purposes, but that integration never happened.
- Clean up the remaining setup_percpu_irq() users
The memory allocator is available when interrupts can be requested
so there is not need for static irq_action. Move the remaining
users to request_percpu_irq() and delete the historical cruft.
- Warn when interrupt flag inconsistencies are detected in
request*_irq().
Inconsistent flags can lead to hard to diagnose malfunction. The
fallout of this new warning has been addressed in next and the
fixes are coming in via the maintainer trees and the tip
irq/cleanup pull requests.
- Invoke affinity notifier when CPU hotplug breaks affinity
Otherwise the code using the notifier misses the affinity change
and operates on stale information.
- The usual cleanups and improvements"
* tag 'irq-core-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq/proc: Replace snprintf with strscpy in register_handler_proc
genirq/cpuhotplug: Notify about affinity changes breaking the affinity mask
genirq: Move clear of kstat_irqs to free_desc()
genirq: Warn about using IRQF_ONESHOT without a threaded handler
irqdomain: Fix up const problem in irq_domain_set_name()
genirq: Remove setup_percpu_irq()
clocksource/drivers/mips-gic-timer: Move GIC timer to request_percpu_irq()
MIPS: Move IP27 timer to request_percpu_irq()
MIPS: Move IP30 timer to request_percpu_irq()
genirq: Remove __request_percpu_irq() helper
genirq: Remove IRQ timing tracking infrastructure
Scheduler Kconfig space updates:
- Further consolidate configurable preemption modes: reduce
the number of architectures that are allowed to offer
PREEMPT_NONE and PREEMPT_VOLUNTARY, reducing the number
of preemption models from four to just two: 'full' and 'lazy'
on up-to-date architectures (arm64, loongarch, powerpc,
riscv, s390, x86).
None and voluntary are only available as legacy features
on platforms that don't implement lazy preemption yet,
or which don't even support preemption.
The goal is to eventually remove cond_resched() and
voluntary preemption altogether.
(Peter Zijlstra)
RSEQ based 'scheduler time slice extension' support:
This allows a thread to request a time slice extension when it
enters a critical section to avoid contention on a resource when
the thread is scheduled out inside of the critical section.
- Add fields and constants for time slice extension
- Provide static branch for time slice extensions
- Add statistics for time slice extensions
- Add prctl() to enable time slice extensions
- Implement sys_rseq_slice_yield()
- Implement syscall entry work for time slice extensions
- Implement time slice extension enforcement timer
- Reset slice extension when scheduled
- Implement rseq_grant_slice_extension()
- entry: Hook up rseq time slice extension
- selftests: Implement time slice extension test
(Thomas Gleixner)
- Allow registering RSEQ with slice extension
- Move slice_ext_nsec to debugfs
- Lower default slice extension
- selftests/rseq: Add rseq slice histogram script
(Peter Zijlstra)
Scheduler performance/scalability improvements:
- Update rq->avg_idle when a task is moved to an idle CPU,
which improves the scalability of various workloads.
(Shubhang Kaushik)
- Reorder fields in 'struct rq' for better caching
(Blake Jones)
- Fair scheduler SMP NOHZ balancing code speedups:
- Move checking for nohz cpus after time check
- Change likelyhood of nohz.nr_cpus
- Remove nohz.nr_cpus and use weight of cpumask instead
(Shrikanth Hegde)
- Avoid false sharing for sched_clock_irqtime (Wangyang Guo)
- Drop useless cpumask_empty() in find_energy_efficient_cpu()
- Simplify task_numa_find_cpu()
- Use cpumask_weight_and() in sched_balance_find_dst_group()
(Yury Norov)
DL scheduler updates:
- Add a deadline server for sched_ext tasks (by Andrea Righi and
Joel Fernandes, with fixes by Peter Zijlstra)
RT scheduler updates:
- Skip currently executing CPU in rto_next_cpu() (Chen Jinghuang)
Entry code updates and performance improvements, which is part of the
scheduler tree in this cycle due to interdependencies with the RSEQ
based time slice extension work:
- Remove unused syscall argument from syscall_trace_enter()
- Rework syscall_exit_to_user_mode_work() for architecture reuse
- Add arch_ptrace_report_syscall_entry/exit()
- Inline syscall_exit_work() and syscall_trace_enter()
(Jinjie Ruan)
Scheduler core updates:
- Rework sched_class::wakeup_preempt() and rq_modified_*()
- Avoid rq->lock bouncing in sched_balance_newidle()
- Rename rcu_dereference_check_sched_domain() =>
rcu_dereference_sched_domain()
- <linux/compiler_types.h>: Add the __signed_scalar_typeof() helper
(Peter Zijlstra)
Fair scheduler updates/refactoring:
- Fold the sched_avg update
- Change rcu_dereference_check_sched_domain() to rcu-sched
- Switch to rcu_dereference_all()
- Remove superfluous rcu_read_lock()
- Limit hrtick work
(Peter Zijlstra)
- Join two #ifdef CONFIG_FAIR_GROUP_SCHED blocks
- Clean up comments in 'struct cfs_rq'
- Separate se->vlag from se->vprot
- Rename cfs_rq::avg_load to cfs_rq::sum_weight
- Rename cfs_rq::avg_vruntime to ::sum_w_vruntime & helper functions
- Introduce and use the vruntime_cmp() and vruntime_op() wrappers
for wrapped-signed aritmetics
- Sort out 'blocked_load*' namespace noise
(Ingo Molnar)
Scheduler debugging code updates:
- Export hidden tracepoints to modules (Gabriele Monaco)
- Convert copy_from_user() + kstrtouint() to kstrtouint_from_user()
(Fushuai Wang)
- Add assertions to QUEUE_CLASS (Peter Zijlstra)
- hrtimer: Fix tracing oddity (Thomas Gleixner)
Misc fixes and cleanups:
- Re-evaluate scheduling when migrating queued tasks out of
throttled cgroups (Zicheng Qu)
- Remove task_struct->faults_disabled_mapping (Christoph Hellwig)
- Fix math notation errors in avg_vruntime comment (Zhan Xusheng)
- sched/cpufreq: Use %pe format for PTR_ERR() printing (zenghongling)
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmmJj+IRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1grtQ//WyXYGVE/WicdqslfaCY2Mr0uJnL0tLSM
CJp+0LROdkmy+ChJmftO8RgjCUSsjhC4/xcBhUQXApf/ffQi3b2jH6nkTp/Z64Ms
p2IXLkBiZjwdcO6fGbB0JE2G1J4hGRC5BlqfgkZzWidMf3kIbmrHg99mVWGzODLY
N/cPW4d0WGf9TScl1FgEiOqgF3czMLlqvTDJqaFMpsTzSUcRBnrG4xushb4W/bBx
573eqxgZJ6urNSGu8niY9PAl9F7gskXW3YxI3k8SH7VmJKSevWlwI9vMEhcRDzud
E0XxD7J8iPOKtr7ypXm7anMBv4jWVUdAnPbYi4TDsyDDU/HguqMqT1McTGn8wQ+F
jmdhmMC9/TEIzq93SNLbCYieibqDsmJoNVFFi0FWfPLMtYbcZd5a884SIz532vx4
DegdlDXdazUwhxzDiQR3sq1CsHXpxNS2YdrpadAtF/r2gU86DQjsEew8yBvXi7bb
Wrkzpax70sU1AFI23wJQkEb/OnnXyehAHAhhQN6GVvuiGr9P7C02WLEGLlmSmJrx
zl2F750P76yhTfGcvTfJ/5LTfSB+yRozGvcdXnIkyzWotY6a2D1MKNusAfVax+IR
kyfAWqVdxBhlKnqYbu92lTogvnPh3Lymd6G4TZZRkSH2jixyGd2oS7nZaDBAeBEM
NHQtr9R+KyU=
=Xj2f
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"Scheduler Kconfig space updates:
- Further consolidate configurable preemption modes (Peter Zijlstra)
Reduce the number of architectures that are allowed to offer
PREEMPT_NONE and PREEMPT_VOLUNTARY, reducing the number of
preemption models from four to just two: 'full' and 'lazy' on
up-to-date architectures (arm64, loongarch, powerpc, riscv, s390,
x86).
None and voluntary are only available as legacy features on
platforms that don't implement lazy preemption yet, or which don't
even support preemption.
The goal is to eventually remove cond_resched() and voluntary
preemption altogether.
RSEQ based 'scheduler time slice extension' support (Thomas Gleixner
and Peter Zijlstra):
This allows a thread to request a time slice extension when it enters
a critical section to avoid contention on a resource when the thread
is scheduled out inside of the critical section.
- Add fields and constants for time slice extension
- Provide static branch for time slice extensions
- Add statistics for time slice extensions
- Add prctl() to enable time slice extensions
- Implement sys_rseq_slice_yield()
- Implement syscall entry work for time slice extensions
- Implement time slice extension enforcement timer
- Reset slice extension when scheduled
- Implement rseq_grant_slice_extension()
- entry: Hook up rseq time slice extension
- selftests: Implement time slice extension test
- Allow registering RSEQ with slice extension
- Move slice_ext_nsec to debugfs
- Lower default slice extension
- selftests/rseq: Add rseq slice histogram script
Scheduler performance/scalability improvements:
- Update rq->avg_idle when a task is moved to an idle CPU, which
improves the scalability of various workloads (Shubhang Kaushik)
- Reorder fields in 'struct rq' for better caching (Blake Jones)
- Fair scheduler SMP NOHZ balancing code speedups (Shrikanth Hegde):
- Move checking for nohz cpus after time check
- Change likelyhood of nohz.nr_cpus
- Remove nohz.nr_cpus and use weight of cpumask instead
- Avoid false sharing for sched_clock_irqtime (Wangyang Guo)
- Cleanups (Yury Norov):
- Drop useless cpumask_empty() in find_energy_efficient_cpu()
- Simplify task_numa_find_cpu()
- Use cpumask_weight_and() in sched_balance_find_dst_group()
DL scheduler updates:
- Add a deadline server for sched_ext tasks (by Andrea Righi and Joel
Fernandes, with fixes by Peter Zijlstra)
RT scheduler updates:
- Skip currently executing CPU in rto_next_cpu() (Chen Jinghuang)
Entry code updates and performance improvements (Jinjie Ruan)
This is part of the scheduler tree in this cycle due to inter-
dependencies with the RSEQ based time slice extension work:
- Remove unused syscall argument from syscall_trace_enter()
- Rework syscall_exit_to_user_mode_work() for architecture reuse
- Add arch_ptrace_report_syscall_entry/exit()
- Inline syscall_exit_work() and syscall_trace_enter()
Scheduler core updates (Peter Zijlstra):
- Rework sched_class::wakeup_preempt() and rq_modified_*()
- Avoid rq->lock bouncing in sched_balance_newidle()
- Rename rcu_dereference_check_sched_domain() =>
rcu_dereference_sched_domain()
- <linux/compiler_types.h>: Add the __signed_scalar_typeof() helper
Fair scheduler updates/refactoring (Peter Zijlstra and Ingo Molnar):
- Fold the sched_avg update
- Change rcu_dereference_check_sched_domain() to rcu-sched
- Switch to rcu_dereference_all()
- Remove superfluous rcu_read_lock()
- Limit hrtick work
- Join two #ifdef CONFIG_FAIR_GROUP_SCHED blocks
- Clean up comments in 'struct cfs_rq'
- Separate se->vlag from se->vprot
- Rename cfs_rq::avg_load to cfs_rq::sum_weight
- Rename cfs_rq::avg_vruntime to ::sum_w_vruntime & helper functions
- Introduce and use the vruntime_cmp() and vruntime_op() wrappers for
wrapped-signed aritmetics
- Sort out 'blocked_load*' namespace noise
Scheduler debugging code updates:
- Export hidden tracepoints to modules (Gabriele Monaco)
- Convert copy_from_user() + kstrtouint() to kstrtouint_from_user()
(Fushuai Wang)
- Add assertions to QUEUE_CLASS (Peter Zijlstra)
- hrtimer: Fix tracing oddity (Thomas Gleixner)
Misc fixes and cleanups:
- Re-evaluate scheduling when migrating queued tasks out of throttled
cgroups (Zicheng Qu)
- Remove task_struct->faults_disabled_mapping (Christoph Hellwig)
- Fix math notation errors in avg_vruntime comment (Zhan Xusheng)
- sched/cpufreq: Use %pe format for PTR_ERR() printing
(zenghongling)"
* tag 'sched-core-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (64 commits)
sched: Re-evaluate scheduling when migrating queued tasks out of throttled cgroups
sched/cpufreq: Use %pe format for PTR_ERR() printing
sched/rt: Skip currently executing CPU in rto_next_cpu()
sched/clock: Avoid false sharing for sched_clock_irqtime
selftests/sched_ext: Add test for DL server total_bw consistency
selftests/sched_ext: Add test for sched_ext dl_server
sched/debug: Fix dl_server (re)start conditions
sched/debug: Add support to change sched_ext server params
sched_ext: Add a DL server for sched_ext tasks
sched/debug: Stop and start server based on if it was active
sched/debug: Fix updating of ppos on server write ops
sched/deadline: Clear the defer params
entry: Inline syscall_exit_work() and syscall_trace_enter()
entry: Add arch_ptrace_report_syscall_entry/exit()
entry: Rework syscall_exit_to_user_mode_work() for architecture reuse
entry: Remove unused syscall argument from syscall_trace_enter()
sched: remove task_struct->faults_disabled_mapping
sched: Update rq->avg_idle when a task is moved to an idle CPU
selftests/rseq: Add rseq slice histogram script
hrtimer: Fix trace oddity
...
Lock debugging:
- Implement compiler-driven static analysis locking context
checking, using the upcoming Clang 22 compiler's context
analysis features. (Marco Elver)
We removed Sparse context analysis support, because prior to
removal even a defconfig kernel produced 1,700+ context
tracking Sparse warnings, the overwhelming majority of which
are false positives. On an allmodconfig kernel the number of
false positive context tracking Sparse warnings grows to
over 5,200... On the plus side of the balance actual locking
bugs found by Sparse context analysis is also rather ... sparse:
I found only 3 such commits in the last 3 years. So the
rate of false positives and the maintenance overhead is
rather high and there appears to be no active policy in
place to achieve a zero-warnings baseline to move the
annotations & fixers to developers who introduce new code.
Clang context analysis is more complete and more aggressive
in trying to find bugs, at least in principle. Plus it has
a different model to enabling it: it's enabled subsystem by
subsystem, which results in zero warnings on all relevant
kernel builds (as far as our testing managed to cover it).
Which allowed us to enable it by default, similar to other
compiler warnings, with the expectation that there are no
warnings going forward. This enforces a zero-warnings baseline
on clang-22+ builds. (Which are still limited in distribution,
admittedly.)
Hopefully the Clang approach can lead to a more maintainable
zero-warnings status quo and policy, with more and more
subsystems and drivers enabling the feature. Context tracking
can be enabled for all kernel code via WARN_CONTEXT_ANALYSIS_ALL=y
(default disabled), but this will generate a lot of false positives.
( Having said that, Sparse support could still be added back,
if anyone is interested - the removal patch is still
relatively straightforward to revert at this stage. )
Rust integration updates: (Alice Ryhl, Fujita Tomonori, Boqun Feng)
- Add support for Atomic<i8/i16/bool> and replace most Rust native
AtomicBool usages with Atomic<bool>
- Clean up LockClassKey and improve its documentation
- Add missing Send and Sync trait implementation for SetOnce
- Make ARef Unpin as it is supposed to be
- Add __rust_helper to a few Rust helpers as a preparation for
helper LTO
- Inline various lock related functions to avoid additional
function calls.
WW mutexes:
- Extend ww_mutex tests and other test-ww_mutex updates (John Stultz)
Misc fixes and cleanups:
- rcu: Mark lockdep_assert_rcu_helper() __always_inline
(Arnd Bergmann)
- locking/local_lock: Include more missing headers (Peter Zijlstra)
- seqlock: fix scoped_seqlock_read kernel-doc (Randy Dunlap)
- rust: sync: Replace `kernel::c_str!` with C-Strings
(Tamir Duberstein)
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmmIXiURHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1gH+A/9GX5UmU6+HuDfDrCtXm9GDve6wkwahvcW
jLDxOYjs764I2BhyjZnjKjyF5zw60hbykem7Wcf5EV2YH30nM4XRgEWVJfkr1UAI
Pra415X4DdOzZ6qYQIpO8Udt1LtR7BMSaXITVLJaLicxEoOVtq3SKxjqyhCFs7UW
MfJdqleB+RMLqq3LlzgB4l43eKk1xyeHh+oQwI0RSxuIpVZme3p4TObnCKjIWnK7
Ihd+dkgC852WBjANgNL7F/sd5UsF5QX3wjtOrLhMKvkIgTPdXln0g398pivjN/G/
Kpnw18SFeb159JfJu8eMotsYvVnQ0D5aOcTBfL4qvOHCImhpcu2s6ik9BcXqt2yT
8IiuWk9xEM3Ok+I/I4ClT5cf5GYpyigV2QsXxn+IjDX5Na8v4zlHh0r8SElP8fOt
7dpQx7iw8UghAib3AzA3suN78Oh39m8l5BNobj7LAjnqOQcVvoPo4o7/48ntuH7A
38EucFrXfxQBMfGbMwvxEmgYuX7MyVfQLaPE06MHy1BkZkffT8Um38TB0iNtZmtf
WUx01yLKWYspehlwFi319uVI4/Zp7FnTfqa5uKv1oSXVdL9vZojSXUzrgDV7FVqT
Z4xAAw/kwNHpUG7y0zNOqd6PukovG1t+CjbLvK+eHPwc5c0vEGG2oTRAfEvvP1z/
kesYDmCyJnk=
=N1gA
-----END PGP SIGNATURE-----
Merge tag 'locking-core-2026-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
"Lock debugging:
- Implement compiler-driven static analysis locking context checking,
using the upcoming Clang 22 compiler's context analysis features
(Marco Elver)
We removed Sparse context analysis support, because prior to
removal even a defconfig kernel produced 1,700+ context tracking
Sparse warnings, the overwhelming majority of which are false
positives. On an allmodconfig kernel the number of false positive
context tracking Sparse warnings grows to over 5,200... On the plus
side of the balance actual locking bugs found by Sparse context
analysis is also rather ... sparse: I found only 3 such commits in
the last 3 years. So the rate of false positives and the
maintenance overhead is rather high and there appears to be no
active policy in place to achieve a zero-warnings baseline to move
the annotations & fixers to developers who introduce new code.
Clang context analysis is more complete and more aggressive in
trying to find bugs, at least in principle. Plus it has a different
model to enabling it: it's enabled subsystem by subsystem, which
results in zero warnings on all relevant kernel builds (as far as
our testing managed to cover it). Which allowed us to enable it by
default, similar to other compiler warnings, with the expectation
that there are no warnings going forward. This enforces a
zero-warnings baseline on clang-22+ builds (Which are still limited
in distribution, admittedly)
Hopefully the Clang approach can lead to a more maintainable
zero-warnings status quo and policy, with more and more subsystems
and drivers enabling the feature. Context tracking can be enabled
for all kernel code via WARN_CONTEXT_ANALYSIS_ALL=y (default
disabled), but this will generate a lot of false positives.
( Having said that, Sparse support could still be added back,
if anyone is interested - the removal patch is still
relatively straightforward to revert at this stage. )
Rust integration updates: (Alice Ryhl, Fujita Tomonori, Boqun Feng)
- Add support for Atomic<i8/i16/bool> and replace most Rust native
AtomicBool usages with Atomic<bool>
- Clean up LockClassKey and improve its documentation
- Add missing Send and Sync trait implementation for SetOnce
- Make ARef Unpin as it is supposed to be
- Add __rust_helper to a few Rust helpers as a preparation for
helper LTO
- Inline various lock related functions to avoid additional function
calls
WW mutexes:
- Extend ww_mutex tests and other test-ww_mutex updates (John
Stultz)
Misc fixes and cleanups:
- rcu: Mark lockdep_assert_rcu_helper() __always_inline (Arnd
Bergmann)
- locking/local_lock: Include more missing headers (Peter Zijlstra)
- seqlock: fix scoped_seqlock_read kernel-doc (Randy Dunlap)
- rust: sync: Replace `kernel::c_str!` with C-Strings (Tamir
Duberstein)"
* tag 'locking-core-2026-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (90 commits)
locking/rwlock: Fix write_trylock_irqsave() with CONFIG_INLINE_WRITE_TRYLOCK
rcu: Mark lockdep_assert_rcu_helper() __always_inline
compiler-context-analysis: Remove __assume_ctx_lock from initializers
tomoyo: Use scoped init guard
crypto: Use scoped init guard
kcov: Use scoped init guard
compiler-context-analysis: Introduce scoped init guards
cleanup: Make __DEFINE_LOCK_GUARD handle commas in initializers
seqlock: fix scoped_seqlock_read kernel-doc
tools: Update context analysis macros in compiler_types.h
rust: sync: Replace `kernel::c_str!` with C-Strings
rust: sync: Inline various lock related methods
rust: helpers: Move #define __rust_helper out of atomic.c
rust: wait: Add __rust_helper to helpers
rust: time: Add __rust_helper to helpers
rust: task: Add __rust_helper to helpers
rust: sync: Add __rust_helper to helpers
rust: refcount: Add __rust_helper to helpers
rust: rcu: Add __rust_helper to helpers
rust: processor: Add __rust_helper to helpers
...
x86 PMU driver updates:
- Add support for the core PMU for Intel Diamond Rapids (DMR) CPUs.
Compared to previous iterations of the Intel PMU code, there's
been a lot of changes, which center around three main areas:
- Introduce the OFF-MODULE RESPONSE (OMR) facility to
replace the Off-Core Response (OCR) facility
- New PEBS data source encoding layout
- Support the new "RDPMC user disable" feature
(Dapeng Mi)
- Likewise, a large series adds uncore PMU support for
Intel Diamond Rapids (DMR) CPUs, which center around these
four main areas:
- DMR may have two Integrated I/O and Memory Hub (IMH) dies,
separate from the compute tile (CBB) dies. Each CBB and
each IMH die has its own discovery domain.
- Unlike prior CPUs that retrieve the global discovery table
portal exclusively via PCI or MSR, DMR uses PCI for IMH PMON
discovery and MSR for CBB PMON discovery.
- DMR introduces several new PMON types: SCA, HAMVF, D2D_ULA,
UBR, PCIE4, CRS, CPC, ITC, OTC, CMS, and PCIE6.
- IIO free-running counters in DMR are MMIO-based, unlike SPR.
(Zide Chen)
- Also add support for Add missing PMON units for Intel Panther Lake,
and support Nova Lake (NVL), which largely maps to Panther Lake.
(Zide Chen)
- KVM integration: Add support for mediated vPMUs (by Kan Liang
and Sean Christopherson, with fixes and cleanups by Peter Zijlstra,
Sandipan Das and Mingwei Zhang)
- Add Intel cstate driver to support for Wildcat Lake (WCL)
CPUs, which are a low-power variant of Panther Lake.
(Zide Chen)
- Add core, cstate and MSR PMU support for the Airmont NP Intel CPU
(aka MaxLinear Lightning Mountain), which maps to the existing
Airmont code. (Martin Schiller)
Performance enhancements:
- core: Speed up kexec shutdown by avoiding unnecessary
cross CPU calls. (Jan H. Schönherr)
- core: Fix slow perf_event_task_exit() with LBR callstacks
(Namhyung Kim)
User-space stack unwinding support:
- Various cleanups and refactorings in preparation to generalize
the unwinding code for other architectures. (Jens Remus)
Uprobes updates:
- Transition from kmap_atomic to kmap_local_page (Keke Ming)
- Fix incorrect lockdep condition in filter_chain() (Breno Leitao)
- Fix XOL allocation failure for 32-bit tasks (Oleg Nesterov)
Misc fixes and cleanups:
- s390: Remove kvm_types.h from Kbuild (Randy Dunlap)
- x86/intel/uncore: Convert comma to semicolon (Chen Ni)
- x86/uncore: Clean up const mismatch (Greg Kroah-Hartman)
- x86/ibs: Fix typo in dc_l2tlb_miss comment (Xiang-Bin Shi)
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmmJhTURHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1i/qw/9F/sjMqbxH8d3kPXB8wk2eUuSkynQ2aNw
Zec9qtfCC5N1U9b2D7ywGJRscTmWYnX/3BKTyzFuyA6SDz6buAgDDIGPlHi+9Fww
+RUUS3lQ7N3pVWZ4Ifu3kbh3Vz4lkQuOXhfcjiyIMS6QIxfrcSLFoKHK+2V6PeU+
x0k+THHz/Ymg+DIpqSjqil1yrKaUmU9xRrbnyy6zJB1duREQrkYBhIWL1+bcd7SA
89RVAGXQ+sWzVQMPaKrMkZj6GavOCB7zseigiiwjBRLznukS2OulDDe8zR6pCJZp
wbdc7TR/nCm+QtNfkHlOmTQvsPAXiXNyXe5Vi8aFjGc0uMGhHaeiL9ah/bwsKA5m
Bm5Y7oVSmBlCJbcr/CTrGYkb+WwvLIgPCwVkn4FYPlsWv+U92qTOx9q7qKmIDFaj
1oUXCwoHbYrYnoZqZqPp2h689m0Lh/lsGhy0QRt8aGnKu0SDjMaqRGbYWH0UI0kA
aDnZstVHG76RnVi0143q8HcvvjNZb82NL0cS749tY/YcwH4kUEGj2XuSK2Ar887T
H0oDJHXijMlGXWqO5bK3WMoCQajR7nRyqBo7/rKYj20OjXmwoXS3vel77/s8WFo2
fUUC469MacDzyxdBNutnkJvvcvsUio3r4MWFsEEWQk2nUE58PtN8YM8j/FdNYpql
zAZ4Jx/A0RM=
=4SRB
-----END PGP SIGNATURE-----
Merge tag 'perf-core-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull performance event updates from Ingo Molnar:
"x86 PMU driver updates:
- Add support for the core PMU for Intel Diamond Rapids (DMR) CPUs
(Dapeng Mi)
Compared to previous iterations of the Intel PMU code, there's been
a lot of changes, which center around three main areas:
- Introduce the OFF-MODULE RESPONSE (OMR) facility to replace the
Off-Core Response (OCR) facility
- New PEBS data source encoding layout
- Support the new "RDPMC user disable" feature
- Likewise, a large series adds uncore PMU support for Intel Diamond
Rapids (DMR) CPUs (Zide Chen)
This centers around these four main areas:
- DMR may have two Integrated I/O and Memory Hub (IMH) dies,
separate from the compute tile (CBB) dies. Each CBB and each IMH
die has its own discovery domain.
- Unlike prior CPUs that retrieve the global discovery table
portal exclusively via PCI or MSR, DMR uses PCI for IMH PMON
discovery and MSR for CBB PMON discovery.
- DMR introduces several new PMON types: SCA, HAMVF, D2D_ULA, UBR,
PCIE4, CRS, CPC, ITC, OTC, CMS, and PCIE6.
- IIO free-running counters in DMR are MMIO-based, unlike SPR.
- Also add support for Add missing PMON units for Intel Panther Lake,
and support Nova Lake (NVL), which largely maps to Panther Lake.
(Zide Chen)
- KVM integration: Add support for mediated vPMUs (by Kan Liang and
Sean Christopherson, with fixes and cleanups by Peter Zijlstra,
Sandipan Das and Mingwei Zhang)
- Add Intel cstate driver to support for Wildcat Lake (WCL) CPUs,
which are a low-power variant of Panther Lake (Zide Chen)
- Add core, cstate and MSR PMU support for the Airmont NP Intel CPU
(aka MaxLinear Lightning Mountain), which maps to the existing
Airmont code (Martin Schiller)
Performance enhancements:
- Speed up kexec shutdown by avoiding unnecessary cross CPU calls
(Jan H. Schönherr)
- Fix slow perf_event_task_exit() with LBR callstacks (Namhyung Kim)
User-space stack unwinding support:
- Various cleanups and refactorings in preparation to generalize the
unwinding code for other architectures (Jens Remus)
Uprobes updates:
- Transition from kmap_atomic to kmap_local_page (Keke Ming)
- Fix incorrect lockdep condition in filter_chain() (Breno Leitao)
- Fix XOL allocation failure for 32-bit tasks (Oleg Nesterov)
Misc fixes and cleanups:
- s390: Remove kvm_types.h from Kbuild (Randy Dunlap)
- x86/intel/uncore: Convert comma to semicolon (Chen Ni)
- x86/uncore: Clean up const mismatch (Greg Kroah-Hartman)
- x86/ibs: Fix typo in dc_l2tlb_miss comment (Xiang-Bin Shi)"
* tag 'perf-core-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (58 commits)
s390: remove kvm_types.h from Kbuild
uprobes: Fix incorrect lockdep condition in filter_chain()
x86/ibs: Fix typo in dc_l2tlb_miss comment
x86/uprobes: Fix XOL allocation failure for 32-bit tasks
perf/x86/intel/uncore: Convert comma to semicolon
perf/x86/intel: Add support for rdpmc user disable feature
perf/x86: Use macros to replace magic numbers in attr_rdpmc
perf/x86/intel: Add core PMU support for Novalake
perf/x86/intel: Add support for PEBS memory auxiliary info field in NVL
perf/x86/intel: Add core PMU support for DMR
perf/x86/intel: Add support for PEBS memory auxiliary info field in DMR
perf/x86/intel: Support the 4 new OMR MSRs introduced in DMR and NVL
perf/core: Fix slow perf_event_task_exit() with LBR callstacks
perf/core: Speed up kexec shutdown by avoiding unnecessary cross CPU calls
uprobes: use kmap_local_page() for temporary page mappings
arm/uprobes: use kmap_local_page() in arch_uprobe_copy_ixol()
mips/uprobes: use kmap_local_page() in arch_uprobe_copy_ixol()
arm64/uprobes: use kmap_local_page() in arch_uprobe_copy_ixol()
riscv/uprobes: use kmap_local_page() in arch_uprobe_copy_ixol()
perf/x86/intel/uncore: Add Nova Lake support
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmmGmrgACgkQ6rmadz2v
bTq6NxAAkCHosxzGn9GYYBV8xhrBJoJJDCyEbQ4nR0XNY+zaWnuykmiPP9w1aOAM
zm/po3mQB2pZjetvlrPrgG5RLgBCAUHzqVGy0r+phUvD3vbohKlmSlMm2kiXOb9N
T01BgLWsyqN2ZcNFvORdSsftqIJUHcXxU6RdupGD60sO5XM9ty5cwyewLX8GBOas
UN2bOhbK2DpqYWUvtv+3Q3ykxoStMSkXZvDRurwLKl4RHeLjXZXPo8NjnfBlk/F2
vdFo/F4NO4TmhOave6UPXvKb4yo9IlBRmiPAl0RmNKBxenY8j9XuV/xZxU6YgzDn
+SQfDK+CKQ4IYIygE+fqd4e5CaQrnjmPPcIw12AB2CF0LimY9Xxyyk6FSAhMN7wm
GTVh5K2C3Dk3OiRQk4G58EvQ5QcxzX98IeeCpcckMUkPsFWHRvF402WMUcv9SWpD
DsxxPkfENY/6N67EvH0qcSe/ikdUorQKFl4QjXKwsMCd5WhToeP4Z7Ck1gVSNkAh
9CX++mLzg333Lpsc4SSIuk9bEPpFa5cUIKUY7GCsCiuOXciPeMDP3cGSd5LioqxN
qWljs4Z88QDM2LJpAh8g4m3sA7bMhES3nPmdlI5CfgBcVyLW8D8CqQq4GEZ1McwL
Ky084+lEosugoVjRejrdMMEOsqAfcbkTr2b8jpuAZdwJKm6p/bw=
=cBdK
-----END PGP SIGNATURE-----
Merge tag 'bpf-next-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Pull bpf updates from Alexei Starovoitov:
- Support associating BPF program with struct_ops (Amery Hung)
- Switch BPF local storage to rqspinlock and remove recursion detection
counters which were causing false positives (Amery Hung)
- Fix live registers marking for indirect jumps (Anton Protopopov)
- Introduce execution context detection BPF helpers (Changwoo Min)
- Improve verifier precision for 32bit sign extension pattern
(Cupertino Miranda)
- Optimize BTF type lookup by sorting vmlinux BTF and doing binary
search (Donglin Peng)
- Allow states pruning for misc/invalid slots in iterator loops (Eduard
Zingerman)
- In preparation for ASAN support in BPF arenas teach libbpf to move
global BPF variables to the end of the region and enable arena kfuncs
while holding locks (Emil Tsalapatis)
- Introduce support for implicit arguments in kfuncs and migrate a
number of them to new API. This is a prerequisite for cgroup
sub-schedulers in sched-ext (Ihor Solodrai)
- Fix incorrect copied_seq calculation in sockmap (Jiayuan Chen)
- Fix ORC stack unwind from kprobe_multi (Jiri Olsa)
- Speed up fentry attach by using single ftrace direct ops in BPF
trampolines (Jiri Olsa)
- Require frozen map for calculating map hash (KP Singh)
- Fix lock entry creation in TAS fallback in rqspinlock (Kumar
Kartikeya Dwivedi)
- Allow user space to select cpu in lookup/update operations on per-cpu
array and hash maps (Leon Hwang)
- Make kfuncs return trusted pointers by default (Matt Bobrowski)
- Introduce "fsession" support where single BPF program is executed
upon entry and exit from traced kernel function (Menglong Dong)
- Allow bpf_timer and bpf_wq use in all programs types (Mykyta
Yatsenko, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Alexei
Starovoitov)
- Make KF_TRUSTED_ARGS the default for all kfuncs and clean up their
definition across the tree (Puranjay Mohan)
- Allow BPF arena calls from non-sleepable context (Puranjay Mohan)
- Improve register id comparison logic in the verifier and extend
linked registers with negative offsets (Puranjay Mohan)
- In preparation for BPF-OOM introduce kfuncs to access memcg events
(Roman Gushchin)
- Use CFI compatible destructor kfunc type (Sami Tolvanen)
- Add bitwise tracking for BPF_END in the verifier (Tianci Cao)
- Add range tracking for BPF_DIV and BPF_MOD in the verifier (Yazhou
Tang)
- Make BPF selftests work with 64k page size (Yonghong Song)
* tag 'bpf-next-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (268 commits)
selftests/bpf: Fix outdated test on storage->smap
selftests/bpf: Choose another percpu variable in bpf for btf_dump test
selftests/bpf: Remove test_task_storage_map_stress_lookup
selftests/bpf: Update task_local_storage/task_storage_nodeadlock test
selftests/bpf: Update task_local_storage/recursion test
selftests/bpf: Update sk_storage_omem_uncharge test
bpf: Switch to bpf_selem_unlink_nofail in bpf_local_storage_{map_free, destroy}
bpf: Support lockless unlink when freeing map or local storage
bpf: Prepare for bpf_selem_unlink_nofail()
bpf: Remove unused percpu counter from bpf_local_storage_map_free
bpf: Remove cgroup local storage percpu counter
bpf: Remove task local storage percpu counter
bpf: Change local_storage->lock and b->lock to rqspinlock
bpf: Convert bpf_selem_unlink to failable
bpf: Convert bpf_selem_link_map to failable
bpf: Convert bpf_selem_unlink_map to failable
bpf: Select bpf_local_storage_map_bucket based on bpf_local_storage
selftests/xsk: fix number of Tx frags in invalid packet
selftests/xsk: properly handle batch ending in the middle of a packet
bpf: Prevent reentrance into call_rcu_tasks_trace()
...
Module signing:
- Remove SHA-1 support for signing modules. SHA-1 is no longer
considered secure for signatures due to vulnerabilities that can
lead to hash collisions. None of the major distributions use
SHA-1 anymore, and the kernel has defaulted to SHA-512 since
v6.11. Note that loading SHA-1 signed modules is still supported.
- Update scripts/sign-file to use only the OpenSSL CMS API for
signing. As SHA-1 support is gone, we can drop the legacy PKCS#7
API which was limited to SHA-1. This also cleans up support for
legacy OpenSSL versions.
Cleanups and fixes:
- Use system_dfl_wq instead of the per-cpu system_wq following the
ongoing workqueue API refactoring.
- Avoid open-coded kvrealloc() in module decompression logic by
using the standard helper.
- Improve section annotations by replacing the custom __modinit
with __init_or_module and removing several unused __INIT*_OR_MODULE
macros.
- Fix kernel-doc warnings in include/linux/moduleparam.h.
- Ensure set_module_sig_enforced is only declared when module
signing is enabled.
- Fix gendwarfksyms build failures on 32-bit hosts.
MAINTAINERS:
- Update the module subsystem entry to reflect the maintainer
rotation and update the git repository link.
The changes have been soaking in linux-next since -rc2.
Note that like Daniel mentioned in the previous pull request [1], we
rotate maintainership every 6 months, and I will be handling the module
subsystem pull requests for the first half of this year.
Link: https://lore.kernel.org/r/20251203234840.3720-1-da.gomez@kernel.org [1]
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQSE9au1u/dCZerzchhaByWrOaGnegUCaYYeeAAKCRBaByWrOaGn
epIxAQDU/VSAC491S9/5dAUeGbOis9/p6QJKQlNgEqU4oTlOsgEA0p8BZ9Spkwzd
v9BfIl3j9qVt7wUdlLdbHfdvPgtUVgc=
=at6w
-----END PGP SIGNATURE-----
Merge tag 'modules-7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux
Pull module updates from Sami Tolvanen:
"Module signing:
- Remove SHA-1 support for signing modules.
SHA-1 is no longer considered secure for signatures due to
vulnerabilities that can lead to hash collisions. None of the major
distributions use SHA-1 anymore, and the kernel has defaulted to
SHA-512 since v6.11.
Note that loading SHA-1 signed modules is still supported.
- Update scripts/sign-file to use only the OpenSSL CMS API for
signing.
As SHA-1 support is gone, we can drop the legacy PKCS#7 API which
was limited to SHA-1. This also cleans up support for legacy
OpenSSL versions.
Cleanups and fixes:
- Use system_dfl_wq instead of the per-cpu system_wq following the
ongoing workqueue API refactoring.
- Avoid open-coded kvrealloc() in module decompression logic by using
the standard helper.
- Improve section annotations by replacing the custom __modinit with
__init_or_module and removing several unused __INIT*_OR_MODULE
macros.
- Fix kernel-doc warnings in include/linux/moduleparam.h.
- Ensure set_module_sig_enforced is only declared when module signing
is enabled.
- Fix gendwarfksyms build failures on 32-bit hosts.
MAINTAINERS:
- Update the module subsystem entry to reflect the maintainer
rotation and update the git repository link"
* tag 'modules-7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux:
modules: moduleparam.h: fix kernel-doc comments
module: Only declare set_module_sig_enforced when CONFIG_MODULE_SIG=y
module/decompress: Avoid open-coded kvrealloc()
gendwarfksyms: Fix build on 32-bit hosts
sign-file: Use only the OpenSSL CMS API for signing
module: Remove SHA-1 support for module signing
module: replace use of system_wq with system_dfl_wq
params: Replace __modinit with __init_or_module
module: Remove unused __INIT*_OR_MODULE macros
MAINTAINERS: Update module subsystem maintainers and repository
API:
- Fix race condition in hwrng core by using RCU.
Algorithms:
- Allow authenc(sha224,rfc3686) in fips mode.
- Add test vectors for authenc(hmac(sha384),cbc(aes)).
- Add test vectors for authenc(hmac(sha224),cbc(aes)).
- Add test vectors for authenc(hmac(md5),cbc(des3_ede)).
- Add lz4 support in hisi_zip.
- Only allow clear key use during self-test in s390/{phmac,paes}.
Drivers:
- Set rng quality to 900 in airoha.
- Add gcm(aes) support for AMD/Xilinx Versal device.
- Allow tfms to share device in hisilicon/trng.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEn51F/lCuNhUwmDeSxycdCkmxi6cFAmmJlNEACgkQxycdCkmx
i6dfYw//fLKHita7B7k6Rnfv7aTX7ZaF7bwMb1w2OtNu7061ZK1+Ou127ZjFKFxC
qJtI71qmTnhTOXnqeLHDio81QLZ3D9cUwSITv4YS4SCIZlbpKmKNFNfmNd5qweNG
xHRQnD4jiM2Qk8GFx6CmXKWEooev9Z9vvjWtPSbuHSXVUd5WPGkJfLv6s9Oy3W6u
7/Z+KcPtMNx3mAhNy7ZwzttKLCPfLp8YhEP99sOFmrUhehjC2e5z59xcQmef5gfJ
cCTBUJkySLChF2bd8eHWilr8y7jow/pyldu2Ksxv2/o0l01xMqrQoIOXwCeEuEq0
uxpKMCR0wM9jBlA1C59zBfiL5Dacb+Dbc7jcRRAa49MuYclVMRoPmnAutUMiz38G
mk/gpc1BQJIez1rAoTyXiNsXiSeZnu/fR9tOq28pTfNXOt2CXsR6kM1AuuP2QyuP
QC0+UM5UsTE+QIibYklop3HfSCFIaV5LkDI/RIvPzrUjcYkJYgMnG3AoIlqkOl1s
mzcs20cH9PoZG3v5W4SkKJMib6qSx1qfa1YZ7GucYT1nUk04Plcb8tuYabPP4x6y
ow/vfikRjnzuMesJShifJUwplaZqP64RBXMvIfgdoOCXfeQ1tKCKz0yssPfgmSs6
K5mmnmtMvgB6k14luCD3E2zFHO6W+PHZQbSanEvhnlikPo86Dbk=
=n4fL
-----END PGP SIGNATURE-----
Merge tag 'v7.0-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto update from Herbert Xu:
"API:
- Fix race condition in hwrng core by using RCU
Algorithms:
- Allow authenc(sha224,rfc3686) in fips mode
- Add test vectors for authenc(hmac(sha384),cbc(aes))
- Add test vectors for authenc(hmac(sha224),cbc(aes))
- Add test vectors for authenc(hmac(md5),cbc(des3_ede))
- Add lz4 support in hisi_zip
- Only allow clear key use during self-test in s390/{phmac,paes}
Drivers:
- Set rng quality to 900 in airoha
- Add gcm(aes) support for AMD/Xilinx Versal device
- Allow tfms to share device in hisilicon/trng"
* tag 'v7.0-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (100 commits)
crypto: img-hash - Use unregister_ahashes in img_{un}register_algs
crypto: testmgr - Add test vectors for authenc(hmac(md5),cbc(des3_ede))
crypto: cesa - Simplify return statement in mv_cesa_dequeue_req_locked
crypto: testmgr - Add test vectors for authenc(hmac(sha224),cbc(aes))
crypto: testmgr - Add test vectors for authenc(hmac(sha384),cbc(aes))
hwrng: core - use RCU and work_struct to fix race condition
crypto: starfive - Fix memory leak in starfive_aes_aead_do_one_req()
crypto: xilinx - Fix inconsistant indentation
crypto: rng - Use unregister_rngs in register_rngs
crypto: atmel - Use unregister_{aeads,ahashes,skciphers}
hwrng: optee - simplify OP-TEE context match
crypto: ccp - Add sysfs attribute for boot integrity
dt-bindings: crypto: atmel,at91sam9g46-sha: add microchip,lan9691-sha
dt-bindings: crypto: atmel,at91sam9g46-aes: add microchip,lan9691-aes
dt-bindings: crypto: qcom,inline-crypto-engine: document the Milos ICE
crypto: caam - fix netdev memory leak in dpaa2_caam_probe
crypto: hisilicon/qm - increase wait time for mailbox
crypto: hisilicon/qm - obtain the mailbox configuration at one time
crypto: hisilicon/qm - remove unnecessary code in qm_mb_write()
crypto: hisilicon/qm - move the barrier before writing to the mailbox register
...
This reverts commit abdfd4948e.
The changelog in this commit explains why it is not easy to avoid ns == NULL
when the caller is exiting, but pid_vnr() is equally unsafe in this case.
However, commit 006568ab4c ("pid: Add a judgment for ns null in pid_nr_ns")
already added the ns != NULL check in pid_nr_ns(), so we can remove the same
check from __task_pid_nr_ns().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Link: https://patch.msgid.link/20251015123613.GA9456@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
This paves the way for scalable PID allocation later.
The 32 bit variant merely takes a spinlock for simplicity, the 64 bit
variant uses a scalable scheme.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20260120184539.1480930-1-mjguzik@gmail.com
Co-developed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Mateusz reported performance penalties [1] during task creation because
pidfs uses pidmap_lock to add elements into the rbtree. Switch to an
rhashtable to have separate fine-grained locking and to decouple from
pidmap_lock moving all heavy manipulations outside of it.
Convert the pidfs inode-to-pid mapping from an rb-tree with seqcount
protection to an rhashtable. This removes the global pidmap_lock
contention from pidfs_ino_get_pid() lookups and allows the hashtable
insert to happen outside the pidmap_lock.
pidfs_add_pid() is split. pidfs_prepare_pid() allocates inode number and
initializes pid fields and is called inside pidmap_lock. pidfs_add_pid()
inserts pid into rhashtable and is called outside pidmap_lock. Insertion
into the rhashtable can fail and memory allocation may happen so we need
to drop the spinlock.
To guard against accidently opening an already reaped task
pidfs_ino_get_pid() uses additional checks beyond pid_vnr(). If
pid->attr is PIDFS_PID_DEAD or NULL the pid either never had a pidfd or
it already went through pidfs_exit() aka the process as already reaped.
If pid->attr is valid check PIDFS_ATTR_BIT_EXIT to figure out whether
the task has exited.
This slightly changes visibility semantics: pidfd creation is denied
after pidfs_exit() runs, which is just before the pid number is removed
from the via free_pid(). That should not be an issue though.
Link: https://lore.kernel.org/20251206131955.780557-1-mjguzik@gmail.com [1]
Link: https://patch.msgid.link/20260120-work-pidfs-rhashtable-v2-1-d593c4d0f576@kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
The get_sample() function in the hwlat tracer assumes the caller holds
hwlat_data.lock, but this is not actually happening. The result is
unprotected data access to hwlat_data, and in per-cpu mode can result in
false sharing which may show up as false positive latency events.
The specific case of false sharing observed was primarily between
hwlat_data.sample_width and hwlat_data.count. These are separated by
just 8B and are therefore likely to share a cache line. When one thread
modifies count, the cache line is in a modified state so when other
threads read sample_width in the main latency detection loop, they fetch
the modified cache line. On some systems, the fetch itself may be slow
enough to count as a latency event, which could set up a self
reinforcing cycle of latency events as each event increments count which
then causes more latency events, continuing the cycle.
The other result of the unprotected data access is that hwlat_data.count
can end up with duplicate or missed values, which was observed on some
systems in testing.
Convert hwlat_data.count to atomic64_t so it can be safely modified
without locking, and prevent false sharing by pulling sample_width into
a local variable.
One system this was tested on was a dual socket server with 32 CPUs on
each numa node. With settings of 1us threshold, 1000us width, and
2000us window, this change reduced the number of latency events from
500 per second down to approximately 1 event per minute. Some machines
tested did not exhibit measurable latency from the false sharing.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260210074810.6328-1-clord@mykolab.com
Signed-off-by: Colin Lord <clord@mykolab.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
affinity of unbound kthreads (node or custom cpumask) against
housekeeping (CPU isolation) constraints and CPU hotplug events.
One crucial missing piece is the handling of cpuset: when an isolated
partition is created, deleted, or its CPUs updated, all the unbound
kthreads in the top cpuset become indifferently affine to _all_ the
non-isolated CPUs, possibly breaking their preferred affinity along
the way.
Solve this with performing the kthreads affinity update from cpuset to
the kthreads consolidated relevant code instead so that preferred
affinities are honoured and applied against the updated cpuset isolated
partitions.
The dispatch of the new isolated cpumasks to timers, workqueues and
kthreads is performed by housekeeping, as per the nice Tejun's
suggestion.
As a welcome side effect, HK_TYPE_DOMAIN then integrates both the set
from boot defined domain isolation (through isolcpus=) and cpuset
isolated partitions. Housekeeping cpumasks are now modifiable with a
specific RCU based synchronization. A big step toward making nohz_full=
also mutable through cpuset in the future.
-----BEGIN PGP SIGNATURE-----
iQJPBAABCAA5FiEEd76+gtGM8MbftQlOhSRUR1COjHcFAmmE0mYbFIAAAAAABAAO
bWFudTIsMi41KzEuMTEsMiwyAAoJEIUkVEdQjox36eMP/0Ls/ArfYVi/MNAXWlpy
rAt6m9Y/X9GBcDM/VI9BXq1ZX4qEr2XjJ8UUb8cM08uHEAt0ErlmpRxREwJFrKbI
H4jzg5EwO0D0c6MnvgQJEAwkHxQVIjsxG9DovRIjxyW4ycx3aSsRg/f2VKyWoLvY
7ZT7CbLFE+I/MQh2ZgUu/9pnCDQVR2anss2WYIej5mmgFL5pyEv3YvYgKYVyK08z
sXyNxpP976g2d9ECJ9OtFJV9we6mlqxlG0MVCiv/Uxh7DBjxWWPsLvlmLAXggQ03
+0GW+nnutDaKz83pgS7Z4zum/+Oa+I1dTLIN27pARUNcMCYip7njM2KNpJwPdov3
+fAIODH2JVX1xewT+U1cCq6gdI55ejbwdQYGFV075dKBUxKQeIyrghvfC3Ga6aKQ
Gw3y68jdrXOw6iyfHR5k/0Mnu2/FDKUW2fZxLKm55PvNZP5jQFmSlz9wyiwwyb3m
UUSgThj6Ozodxks8hDX41rGVezCcm1ni+qNSiNIs8HPaaZQrwbnvKHQFBBJHQzJP
rJ39VWBx3Hq/ly71BOR6pCzoZsfS1f85YKhJ4vsfjLO6BfhI16nBat89eROSRKcz
XptyWqW0PgAD0teDuMCTPNuUym/viBHALXHKuSO12CIizacvftiGcmaQNPlLiiFZ
/Dr2+aOhwYw3UD6djn3u94M9
=nWGh
-----END PGP SIGNATURE-----
Merge tag 'kthread-for-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks
Pull kthread updates from Frederic Weisbecker:
"The kthread code provides an infrastructure which manages the
preferred affinity of unbound kthreads (node or custom cpumask)
against housekeeping (CPU isolation) constraints and CPU hotplug
events.
One crucial missing piece is the handling of cpuset: when an isolated
partition is created, deleted, or its CPUs updated, all the unbound
kthreads in the top cpuset become indifferently affine to _all_ the
non-isolated CPUs, possibly breaking their preferred affinity along
the way.
Solve this with performing the kthreads affinity update from cpuset to
the kthreads consolidated relevant code instead so that preferred
affinities are honoured and applied against the updated cpuset
isolated partitions.
The dispatch of the new isolated cpumasks to timers, workqueues and
kthreads is performed by housekeeping, as per the nice Tejun's
suggestion.
As a welcome side effect, HK_TYPE_DOMAIN then integrates both the set
from boot defined domain isolation (through isolcpus=) and cpuset
isolated partitions. Housekeeping cpumasks are now modifiable with a
specific RCU based synchronization. A big step toward making
nohz_full= also mutable through cpuset in the future"
* tag 'kthread-for-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks: (33 commits)
doc: Add housekeeping documentation
kthread: Document kthread_affine_preferred()
kthread: Comment on the purpose and placement of kthread_affine_node() call
kthread: Honour kthreads preferred affinity after cpuset changes
sched/arm64: Move fallback task cpumask to HK_TYPE_DOMAIN
sched: Switch the fallback task allowed cpumask to HK_TYPE_DOMAIN
kthread: Rely on HK_TYPE_DOMAIN for preferred affinity management
kthread: Include kthreadd to the managed affinity list
kthread: Include unbound kthreads in the managed affinity list
kthread: Refine naming of affinity related fields
PCI: Remove superfluous HK_TYPE_WQ check
sched/isolation: Remove HK_TYPE_TICK test from cpu_is_isolated()
cpuset: Remove cpuset_cpu_is_isolated()
timers/migration: Remove superfluous cpuset isolation test
cpuset: Propagate cpuset isolation update to timers through housekeeping
cpuset: Propagate cpuset isolation update to workqueue through housekeeping
PCI: Flush PCI probe workqueue on cpuset isolated partition change
sched/isolation: Flush vmstat workqueues on cpuset isolated partition change
sched/isolation: Flush memcg workqueues on cpuset isolated partition change
cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset
...
- Remove the unused omap-cpufreq driver (Andreas Kemnade)
- Optimize error handling code in cpufreq_boost_trigger_state() and
make cpufreq_boost_trigger_state() return -EOPNOTSUPP if no policy
supports boost (Lifeng Zheng)
- Update cpufreq-dt-platdev list for tegra, qcom, TI (Aaron Kling,
Dhruva Gole, and Konrad Dybcio)
- Minor improvements to the cpufreq and cpumask rust implementation
(Alexandre Courbot, Alice Ryhl, Tamir Duberstein, and Yilin Chen)
- Add support for AM62L3 SoC to the ti-cpufreq driver (Dhruva Gole)
- Update arch_freq_scale in the CPPC cpufreq driver's frequency
invariance engine (FIE) in scheduler ticks if the related CPPC
registers are not in PCC (Jie Zhan)
- Assorted minor cleanups and improvements in ARM cpufreq drivers (Juan
Martinez, Felix Gu, Luca Weiss, and Sergey Shtylyov)
- Add generic helpers for sysfs show/store to cppc_cpufreq (Sumit
Gupta)
- Make the scaling_setspeed cpufreq sysfs attribute return the actual
requested frequency to avoid confusion (Pengjie Zhang)
- Simplify the idle CPU time granularity test in the ondemand cpufreq
governor (Frederic Weisbecker)
- Enable asym capacity in intel_pstate only when CPU SMT is not
possible (Yaxiong Tian)
- Update the description of rate_limit_us default value in cpufreq
documentation (Yaxiong Tian)
- Add a command line option to adjust the C-states table in the
intel_idle driver, remove the 'preferred_cstates' module parameter
from it, add C-states validation to it and clean it up (Artem
Bityutskiy)
- Make the menu cpuidle governor always check the time till the closest
timer event when the scheduler tick has been stopped to prevent it
from mistakenly selecting the deepest available idle state (Rafael
Wysocki)
- Update the teo cpuidle governor to avoid making suboptimal decisions
in certain corner cases and generally improve idle state selection
accuracy (Rafael Wysocki)
- Remove an unlikely() annotation on the early-return condition in
menu_select() that leads to branch misprediction 100% of the time
on systems with only 1 idle state enabled, like ARM64 servers (Breno
Leitao)
- Add Christian Loehle to MAINTAINERS as a cpuidle reviewer (Christian
Loehle)
- Stop flagging the PM runtime workqueue as freezable to avoid system
suspend and resume deadlocks in subsystems that assume asynchronous
runtime PM to work during system-wide PM transitions (Rafael Wysocki)
- Drop redundant NULL pointer checks before acomp_request_free() from
the hibernation code handling image saving (Rafael Wysocki)
- Update wakeup_sources_walk_start() to handle empty lists of wakeup
sources as appropriate (Samuel Wu)
- Make dev_pm_clear_wake_irq() check the power.wakeirq value under
power.lock to avoid race conditions (Gui-Dong Han)
- Avoid bit field races related to power.work_in_progress in the core
device suspend code (Xuewen Yan)
- Make several drivers discard pm_runtime_put() return value in
preparation for converting that function to a void one (Rafael
Wysocki)
- Add PL4 support for Ice Lake to the Intel RAPL power capping
driver (Daniel Tang)
- Replace sprintf() with sysfs_emit() in power capping sysfs show
functions (Sumeet Pawnikar)
- Make dev_pm_opp_get_level() return value match the documentation
after a previous update of the latter (Aleks Todorov)
- Use scoped for each OF child loop in the OPP code (Krzysztof
Kozlowski)
- Fix a bug in an example code snippet and correct typos in the energy
model management documentation (Patrick Little)
- Fix miscellaneous problems in cpupower (Kaushlendra Kumar):
* idle_monitor: Fix incorrect value logged after stop
* Fix inverted APERF capability check
* Use strcspn() to strip trailing newline
* Reset errno before strtoull()
* Show C0 in idle-info dump
- Improve cpupower installation procedure by making the systemd step
optional and allowing users to disable the installation of systemd's
unit file (João Marcos Costa)
-----BEGIN PGP SIGNATURE-----
iQFGBAABCAAwFiEEcM8Aw/RY0dgsiRUR7l+9nS/U47UFAmmDr5ASHHJqd0Byand5
c29ja2kubmV0AAoJEO5fvZ0v1OO1Q8oH/0KRqdidHzesIQl6gd5WSS/sWdxODRUt
R9dEGQQ6LXCY0z05RAq29HZQf618fYuRFX4PSrtyCvrcRJK7MJKuzK55MRq0MC3c
c/2pL1PdpHexjLXUP9pcoxrYjetsr7SnD6Y0M3JfOPg1E/bG8sp1DlnE8cdqrL0W
lrdB2cEGewT2SVkNhCIQ2n6bwfQwmLlfQl1vXTM8BA7xCjoslePUJlRphAFVAt/J
5fQxSOH0eSxK5PYQFUDM2D2J3uMAN0pFb6eIjwVYYqjABqV//BPl99Rv2W3ElJq7
K/SICRWlvzyINCgF15QAUtQHWdINxSb0GzovECVxODHOv0N4mKHdpNU=
=QlVe
-----END PGP SIGNATURE-----
Merge tag 'pm-6.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
"By the number of commits, cpufreq is the leading party (again) and the
most visible change there is the removal of the omap-cpufreq driver
that has not been used for a long time (good riddance). There are also
quite a few changes in the cppc_cpufreq driver, mostly related to
fixing its frequency invariance engine in the case when the CPPC
registers used by it are not in PCC. In addition to that, support for
AM62L3 is added to the ti-cpufreq driver and the cpufreq-dt-platdev
list is updated for some platforms. The remaining cpufreq changes are
assorted fixes and cleanups.
Next up is cpuidle and the changes there are dominated by intel_idle
driver updates, mostly related to the new command line facility
allowing users to adjust the list of C-states used by the driver.
There are also a few updates of cpuidle governors, including two menu
governor fixes and some refinements of the teo governor, and a
MAINTAINERS update adding Christian Loehle as a cpuidle reviewer.
[Thanks for stepping up Christian!]
The most significant update related to system suspend and hibernation
is the one to stop freezing the PM runtime workqueue during system PM
transitions which allows some deadlocks to be avoided. There is also a
fix for possible concurrent bit field updates in the core device
suspend code and a few other minor fixes.
Apart from the above, several drivers are updated to discard the
return value of pm_runtime_put() which is going to be converted to a
void function as soon as everybody stops using its return value, PL4
support for Ice Lake is added to the Intel RAPL power capping driver,
and there are assorted cleanups, documentation fixes, and some
cpupower utility improvements.
Specifics:
- Remove the unused omap-cpufreq driver (Andreas Kemnade)
- Optimize error handling code in cpufreq_boost_trigger_state() and
make cpufreq_boost_trigger_state() return -EOPNOTSUPP if no policy
supports boost (Lifeng Zheng)
- Update cpufreq-dt-platdev list for tegra, qcom, TI (Aaron Kling,
Dhruva Gole, and Konrad Dybcio)
- Minor improvements to the cpufreq and cpumask rust implementation
(Alexandre Courbot, Alice Ryhl, Tamir Duberstein, and Yilin Chen)
- Add support for AM62L3 SoC to the ti-cpufreq driver (Dhruva Gole)
- Update arch_freq_scale in the CPPC cpufreq driver's frequency
invariance engine (FIE) in scheduler ticks if the related CPPC
registers are not in PCC (Jie Zhan)
- Assorted minor cleanups and improvements in ARM cpufreq drivers
(Juan Martinez, Felix Gu, Luca Weiss, and Sergey Shtylyov)
- Add generic helpers for sysfs show/store to cppc_cpufreq (Sumit
Gupta)
- Make the scaling_setspeed cpufreq sysfs attribute return the actual
requested frequency to avoid confusion (Pengjie Zhang)
- Simplify the idle CPU time granularity test in the ondemand cpufreq
governor (Frederic Weisbecker)
- Enable asym capacity in intel_pstate only when CPU SMT is not
possible (Yaxiong Tian)
- Update the description of rate_limit_us default value in cpufreq
documentation (Yaxiong Tian)
- Add a command line option to adjust the C-states table in the
intel_idle driver, remove the 'preferred_cstates' module parameter
from it, add C-states validation to it and clean it up (Artem
Bityutskiy)
- Make the menu cpuidle governor always check the time till the
closest timer event when the scheduler tick has been stopped to
prevent it from mistakenly selecting the deepest available idle
state (Rafael Wysocki)
- Update the teo cpuidle governor to avoid making suboptimal
decisions in certain corner cases and generally improve idle state
selection accuracy (Rafael Wysocki)
- Remove an unlikely() annotation on the early-return condition in
menu_select() that leads to branch misprediction 100% of the time
on systems with only 1 idle state enabled, like ARM64 servers
(Breno Leitao)
- Add Christian Loehle to MAINTAINERS as a cpuidle reviewer
(Christian Loehle)
- Stop flagging the PM runtime workqueue as freezable to avoid system
suspend and resume deadlocks in subsystems that assume asynchronous
runtime PM to work during system-wide PM transitions (Rafael
Wysocki)
- Drop redundant NULL pointer checks before acomp_request_free() from
the hibernation code handling image saving (Rafael Wysocki)
- Update wakeup_sources_walk_start() to handle empty lists of wakeup
sources as appropriate (Samuel Wu)
- Make dev_pm_clear_wake_irq() check the power.wakeirq value under
power.lock to avoid race conditions (Gui-Dong Han)
- Avoid bit field races related to power.work_in_progress in the core
device suspend code (Xuewen Yan)
- Make several drivers discard pm_runtime_put() return value in
preparation for converting that function to a void one (Rafael
Wysocki)
- Add PL4 support for Ice Lake to the Intel RAPL power capping driver
(Daniel Tang)
- Replace sprintf() with sysfs_emit() in power capping sysfs show
functions (Sumeet Pawnikar)
- Make dev_pm_opp_get_level() return value match the documentation
after a previous update of the latter (Aleks Todorov)
- Use scoped for each OF child loop in the OPP code (Krzysztof
Kozlowski)
- Fix a bug in an example code snippet and correct typos in the
energy model management documentation (Patrick Little)
- Fix miscellaneous problems in cpupower (Kaushlendra Kumar):
* idle_monitor: Fix incorrect value logged after stop
* Fix inverted APERF capability check
* Use strcspn() to strip trailing newline
* Reset errno before strtoull()
* Show C0 in idle-info dump
- Improve cpupower installation procedure by making the systemd step
optional and allowing users to disable the installation of
systemd's unit file (João Marcos Costa)"
* tag 'pm-6.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (65 commits)
PM: sleep: core: Avoid bit field races related to work_in_progress
PM: sleep: wakeirq: harden dev_pm_clear_wake_irq() against races
cpufreq: Documentation: Update description of rate_limit_us default value
cpufreq: intel_pstate: Enable asym capacity only when CPU SMT is not possible
PM: wakeup: Handle empty list in wakeup_sources_walk_start()
PM: EM: Documentation: Fix bug in example code snippet
Documentation: Fix typos in energy model documentation
cpuidle: governors: teo: Refine intercepts-based idle state lookup
cpuidle: governors: teo: Adjust the classification of wakeup events
cpufreq: ondemand: Simplify idle cputime granularity test
cpufreq: userspace: make scaling_setspeed return the actual requested frequency
PM: hibernate: Drop NULL pointer checks before acomp_request_free()
cpufreq: CPPC: Add generic helpers for sysfs show/store
cpufreq: scmi: Fix device_node reference leak in scmi_cpu_domain_id()
cpufreq: ti-cpufreq: add support for AM62L3 SoC
cpufreq: dt-platdev: Add ti,am62l3 to blocklist
cpufreq/amd-pstate: Add comment explaining nominal_perf usage for performance policy
cpufreq: scmi: correct SCMI explanation
cpufreq: dt-platdev: Block the driver from probing on more QC platforms
rust: cpumask: rename methods of Cpumask for clarity and consistency
...
- Update the ACPICA code in the kernel to upstream version 20251212
which includes the following changes:
* Add support for new ACPI table DTPR (Michal Camacho Romero)
* Release objects with acpi_ut_delete_object_desc() (Zilin Guan)
* Add UUIDs for Microsoft fan extensions and UUIDs associated with
TPM 2.0 devices (Armin Wolf)
* Fix NULL pointer dereference in acpi_ev_address_space_dispatch()
(Alexey Simakov)
* Add KEYP ACPI table definition (Dave Jiang)
* Add support for the Microsoft display mux _OSI string (Armin Wolf)
* Add definitions for the IOVT ACPI table (Xianglai Li)
* Abort AML bytecode execution on AML_FATAL_OP (Armin Wolf)
* Include all fields in subtable type1 for PPTT (Ben Horgan)
* Add GICv5 MADT structures and Arm IORT IWB node definitions (Jose
Marinho)
* Update Parameter Block structure for RAS2 and add a new flag in
Memory Affinity Structure for SRAT (Pawel Chmielewski)
* Add _VDM (Voltage Domain) object (Pawel Chmielewski)
- Add support for GICv5 ACPI probing on ARM which is based on the
GICv5 MADT structures and ARM IORT IWB node definitions recently
added to ACPICA (Lorenzo Pieralisi)
- Rework ACPI PM notification setup for PCI root buses and modify the
ACPI PM setup for devices to register wakeup source objects under
physical (that is, PCI, platform, etc.) devices instead of doing that
under their ACPI companions (Rafael Wysocki)
- Adjust debug messages regarding postponed ACPI PM printed during
system resume to be more accurate (Rafael Wysocki)
- Remove dead code from lps0_device_attach() (Gergo Koteles)
- Start to invoke Microsoft Function 9 (Turn On Display) of the Low-
Power S0 Idle (LPS0) _DSM in the suspend-to-idle resume flow on
systems with ACPI LPS0 support to address a functional issue on
Lenovo Yoga Slim 7i Aura (15ILL9), where system fans and keyboard
backlights fail to resume after suspend (Jakob Riemenschneider)
- Add sysfs attribute cid for exposing _CID lists under ACPI device
objects (Rafael Wysocki)
- Replace sprintf() with sysfs_emit() in all of the core ACPI sysfs
interface code (Sumeet Pawnikar)
- Use acpi_get_local_u64_address() in the code implementing ACPI
support for PCI to evaluate _ADR instead of evaluating that object
directly (Andy Shevchenko)
- Add JWIPC JVC9100 to irq1_level_low_skip_override[] to unbreak
serial IRQs on that system (Ai Chao)
- Fix handling of _OSC errors in acpi_run_osc() to avoid failures on
systems where _OSC error bits are set even though the _OSC return
buffer contains acknowledged feature bits (Rafael Wysocki)
- Clean up and rearrange \_SB._OSC handling for general platform
features and USB4 features to avoid code duplication and unnecessary
memory management overhead (Rafael Wysocki)
- Make the ACPI core device enumeration code handle PNP0C01 and PNP0C02
("system resource") device objects directly instead of letting the
legacy PNP system driver handle them to avoid device enumeration
issues on systems where PNP0C02 is present in the _CID list under
ACPI device objects with a _HID matching a proper device driver in
Linux (Rafael Wysocki)
- Drop workarounds for the known device enumeration issues related to
_CID lists containing PNP0C02 (Rafael Wysocki)
- Drop outdated comment regarding removed function in the ACPI-based
device enumeration code (Julia Lawall)
- Make PRP0001 device matching work as expected for ACPI device objects
using it as a _HID for board development and similar purposes (Kartik
Rajput)
- Use async schedule function in acpi_scan_clear_dep_fn() to avoid
races with user space initialization on some systems (Yicong Yang)
- Add a piece of documentation explaining why binding drivers directly
to ACPI device objects is not a good idea in general and why it is
desirable to convert drivers doing so into proper platform drivers
that use struct platform_driver for device binding (Rafael Wysocki)
- Convert multiple "core ACPI" drivers, including the NFIT ACPI device
driver, the generic ACPI button drivers, the generic ACPI thermal
zone driver, the ACPI hardware event device (HED) driver, the ACPI EC
driver, the ACPI SMBUS HC driver, the ACPI Smart Battery Subsystem
(SBS) driver, and the ACPI backlight (video) driver to proper platform
drivers that use struct platform_driver for device binding (Rafael
Wysocki)
- Use acpi_get_local_u64_address() in the ACPI backlight (video) driver
to evaluate _ADR instead of evaluating that object directly (Andy
Shevchenko)
- Convert the generic ACPI battery driver to a proper platform driver
using struct platform_driver for device binding (Rafael Wysocki)
- Fix incorrect charging status when current is zero in the generic
ACPI battery driver (Ata İlhan Köktürk)
- Use LIST_HEAD() for initializing a stack-allocated list in the
generic ACPI watchdog device driver (Can Peng)
- Rework the ACPI idle driver initialization to register it directly
from the common initialization code instead of doing that from a
CPU hotplug "online" callback and clean it up (Huisong Li, Rafael
Wysocki)
- Fix a possible NULL pointer dereference in
acpi_processor_errata_piix4() (Tuo Li)
- Make read-only array non_mmio_desc[] static const (Colin Ian King)
- Prevent the APEI GHES support code on ARM from accessing memory out
of bounds or going past the ARM processor CPER record buffer (Mauro
Carvalho Chehab)
- Prevent cper_print_fw_err() from dumping the entire memory on systems
with defective firmware (Mauro Carvalho Chehab)
- Improve ghes_notify_nmi() status check to avoid unnecessary overhead
in the NMI handler by carrying out all of the requisite preparations
and the NMI registration time (Tony Luck)
- Refactor the GHES driver by extracting common functionality into
reusable helper functions to reduce code duplication and improve
the ghes_notify_sea() status check in analogy with the previous
ghes_notify_nmi() status check improvement (Shuai Xue)
- Make ELOG and GHES log and trace consistently and support the CPER
CXL protocol analogously (Fabio De Francesco)
- Disable KASAN instrumentation in the APEI GHES driver when compile
testing with clang < 18 (Nathan Chancellor)
- Let ghes_edac be the preferred driver to load on __ZX__ and _BYO_
systems by extending the platform detection list in the APEI GHES
driver (Tony W Wang-oc)
- Clean up cppc_perf_caps and cppc_perf_ctrls structs and rename EPP
constants for clarity in the ACPI CPPC library (Sumit Gupta)
-----BEGIN PGP SIGNATURE-----
iQFGBAABCAAwFiEEcM8Aw/RY0dgsiRUR7l+9nS/U47UFAmmErOoSHHJqd0Byand5
c29ja2kubmV0AAoJEO5fvZ0v1OO1wCoH/iWkTc5/kTTmzi5Urh9lsUsOL8Oc9hrx
KlH4XdLf9TUUSGPJtPRrrtXiOotNUSaZpg2ULktJAZHxDWXnA83DO0FjwT3+2WbH
/5YGVEK24Sae+DI38ukEvOI6bXrxxqNPLnFRtvV7tclUJD2OEAdKcUd8Y98s9qNf
64bf2gKoQtbZj0eM33KHiotULYZUNoZ2SWKFAWXzKQr7EOkcQGKcuupeStgQ04Id
uumWr+lIgW5vYPOcPFCufE3+LM0MzDvruxBlMJQEJJFVkY4IEFwwJNGW/lo8hgE3
74pFKqOPhYsIFPM8rtKXS+JmjP22iCI49QjnHIsQdDG03GEKvKFFeCw=
=CG0j
-----END PGP SIGNATURE-----
Merge tag 'acpi-6.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI updates from Rafael Wysocki:
"This one is significantly larger than previous ACPI support pull
requests because several significant updates have coincided in it.
First, there is a routine ACPICA code update, to upstream version
20251212, but this time it covers new ACPI 6.6 material that has not
been covered yet. Among other things, it includes definitions of a few
new ACPI tables and updates of some others, like the GICv5 MADT
structures and ARM IORT IWB node definitions that are used for adding
GICv5 ACPI probing on ARM (that technically is IRQ subsystem material,
but it depends on the ACPICA changes, so it is included here). The
latter alone adds a few hundred lines of new code.
Second, there is an update of ACPI _OSC handling including a fix that
prevents failures from occurring in some corner cases due to careless
handling of _OSC error bits.
On top of that, the "system resource" ACPI device objects with the
PNP0C01 and PNP0C02 are now going to be handled by the ACPI core
device enumeration code instead of handing them over to the legacy PNP
system driver which causes device enumeration issues to occur. Some of
those issues have been worked around in device drivers and elsewhere
and those workarounds should not be necessary any more, so they are
going away.
Moreover, the time has come to convert all "core ACPI" device drivers
that were still using struct acpi_driver objects for device binding
into proper platform drivers that use struct platform_driver for this
purpose. These updates are accompanied by some requisite core ACPI
device enumeration code changes.
Next, there are ACPI APEI updates, including changes to avoid excess
overhead in the NMI handler and in SEA on the ARM side, changes to
unify ACPI-based HW error tracing and logging, and changes to prevent
APEI code from reaching out of its allocated memory.
There are also some ACPI power management updates, mostly related to
the ACPI cpuidle support in the processor driver, suspend-to-idle
handling on systems with ACPI support and to ACPI PM of devices.
In addition to the above, bugs are fixed and the code is cleaned up in
assorted places all over.
Specifics:
- Update the ACPICA code in the kernel to upstream version 20251212
which includes the following changes:
* Add support for new ACPI table DTPR (Michal Camacho Romero)
* Release objects with acpi_ut_delete_object_desc() (Zilin Guan)
* Add UUIDs for Microsoft fan extensions and UUIDs associated with
TPM 2.0 devices (Armin Wolf)
* Fix NULL pointer dereference in acpi_ev_address_space_dispatch()
(Alexey Simakov)
* Add KEYP ACPI table definition (Dave Jiang)
* Add support for the Microsoft display mux _OSI string (Armin
Wolf)
* Add definitions for the IOVT ACPI table (Xianglai Li)
* Abort AML bytecode execution on AML_FATAL_OP (Armin Wolf)
* Include all fields in subtable type1 for PPTT (Ben Horgan)
* Add GICv5 MADT structures and Arm IORT IWB node definitions
(Jose Marinho)
* Update Parameter Block structure for RAS2 and add a new flag in
Memory Affinity Structure for SRAT (Pawel Chmielewski)
* Add _VDM (Voltage Domain) object (Pawel Chmielewski)
- Add support for GICv5 ACPI probing on ARM which is based on the
GICv5 MADT structures and ARM IORT IWB node definitions recently
added to ACPICA (Lorenzo Pieralisi)
- Rework ACPI PM notification setup for PCI root buses and modify the
ACPI PM setup for devices to register wakeup source objects under
physical (that is, PCI, platform, etc.) devices instead of doing
that under their ACPI companions (Rafael Wysocki)
- Adjust debug messages regarding postponed ACPI PM printed during
system resume to be more accurate (Rafael Wysocki)
- Remove dead code from lps0_device_attach() (Gergo Koteles)
- Start to invoke Microsoft Function 9 (Turn On Display) of the Low-
Power S0 Idle (LPS0) _DSM in the suspend-to-idle resume flow on
systems with ACPI LPS0 support to address a functional issue on
Lenovo Yoga Slim 7i Aura (15ILL9), where system fans and keyboard
backlights fail to resume after suspend (Jakob Riemenschneider)
- Add sysfs attribute cid for exposing _CID lists under ACPI device
objects (Rafael Wysocki)
- Replace sprintf() with sysfs_emit() in all of the core ACPI sysfs
interface code (Sumeet Pawnikar)
- Use acpi_get_local_u64_address() in the code implementing ACPI
support for PCI to evaluate _ADR instead of evaluating that object
directly (Andy Shevchenko)
- Add JWIPC JVC9100 to irq1_level_low_skip_override[] to unbreak
serial IRQs on that system (Ai Chao)
- Fix handling of _OSC errors in acpi_run_osc() to avoid failures on
systems where _OSC error bits are set even though the _OSC return
buffer contains acknowledged feature bits (Rafael Wysocki)
- Clean up and rearrange \_SB._OSC handling for general platform
features and USB4 features to avoid code duplication and
unnecessary memory management overhead (Rafael Wysocki)
- Make the ACPI core device enumeration code handle PNP0C01 and
PNP0C02 ("system resource") device objects directly instead of
letting the legacy PNP system driver handle them to avoid device
enumeration issues on systems where PNP0C02 is present in the _CID
list under ACPI device objects with a _HID matching a proper device
driver in Linux (Rafael Wysocki)
- Drop workarounds for the known device enumeration issues related to
_CID lists containing PNP0C02 (Rafael Wysocki)
- Drop outdated comment regarding removed function in the ACPI-based
device enumeration code (Julia Lawall)
- Make PRP0001 device matching work as expected for ACPI device
objects using it as a _HID for board development and similar
purposes (Kartik Rajput)
- Use async schedule function in acpi_scan_clear_dep_fn() to avoid
races with user space initialization on some systems (Yicong Yang)
- Add a piece of documentation explaining why binding drivers
directly to ACPI device objects is not a good idea in general and
why it is desirable to convert drivers doing so into proper
platform drivers that use struct platform_driver for device binding
(Rafael Wysocki)
- Convert multiple "core ACPI" drivers, including the NFIT ACPI
device driver, the generic ACPI button drivers, the generic ACPI
thermal zone driver, the ACPI hardware event device (HED) driver,
the ACPI EC driver, the ACPI SMBUS HC driver, the ACPI Smart
Battery Subsystem (SBS) driver, and the ACPI backlight (video)
driver to proper platform drivers that use struct platform_driver
for device binding (Rafael Wysocki)
- Use acpi_get_local_u64_address() in the ACPI backlight (video)
driver to evaluate _ADR instead of evaluating that object directly
(Andy Shevchenko)
- Convert the generic ACPI battery driver to a proper platform driver
using struct platform_driver for device binding (Rafael Wysocki)
- Fix incorrect charging status when current is zero in the generic
ACPI battery driver (Ata İlhan Köktürk)
- Use LIST_HEAD() for initializing a stack-allocated list in the
generic ACPI watchdog device driver (Can Peng)
- Rework the ACPI idle driver initialization to register it directly
from the common initialization code instead of doing that from a
CPU hotplug "online" callback and clean it up (Huisong Li, Rafael
Wysocki)
- Fix a possible NULL pointer dereference in
acpi_processor_errata_piix4() (Tuo Li)
- Make read-only array non_mmio_desc[] static const (Colin Ian King)
- Prevent the APEI GHES support code on ARM from accessing memory out
of bounds or going past the ARM processor CPER record buffer (Mauro
Carvalho Chehab)
- Prevent cper_print_fw_err() from dumping the entire memory on
systems with defective firmware (Mauro Carvalho Chehab)
- Improve ghes_notify_nmi() status check to avoid unnecessary
overhead in the NMI handler by carrying out all of the requisite
preparations and the NMI registration time (Tony Luck)
- Refactor the GHES driver by extracting common functionality into
reusable helper functions to reduce code duplication and improve
the ghes_notify_sea() status check in analogy with the previous
ghes_notify_nmi() status check improvement (Shuai Xue)
- Make ELOG and GHES log and trace consistently and support the CPER
CXL protocol analogously (Fabio De Francesco)
- Disable KASAN instrumentation in the APEI GHES driver when compile
testing with clang < 18 (Nathan Chancellor)
- Let ghes_edac be the preferred driver to load on __ZX__ and _BYO_
systems by extending the platform detection list in the APEI GHES
driver (Tony W Wang-oc)
- Clean up cppc_perf_caps and cppc_perf_ctrls structs and rename EPP
constants for clarity in the ACPI CPPC library (Sumit Gupta)"
* tag 'acpi-6.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (117 commits)
ACPI: battery: fix incorrect charging status when current is zero
ACPI: scan: Use async schedule function in acpi_scan_clear_dep_fn()
ACPI: x86: s2idle: Invoke Microsoft _DSM Function 9 (Turn On Display)
ACPI: APEI: GHES: Add ghes_edac support for __ZX__ and _BYO_ systems
ACPI: APEI: GHES: Disable KASAN instrumentation when compile testing with clang < 18
ACPI: sysfs: Replace sprintf() with sysfs_emit()
ACPI: CPPC: Rename EPP constants for clarity
ACPI: CPPC: Clean up cppc_perf_caps and cppc_perf_ctrls structs
ACPI: processor: idle: Rework the handling of acpi_processor_ffh_lpi_probe()
ACPI: processor: idle: Convert acpi_processor_setup_cpuidle_dev() to void
ACPI: processor: idle: Convert acpi_processor_setup_cpuidle_states() to void
irqchip/gic-v5: Add ACPI IWB probing
irqchip/gic-v5: Add ACPI ITS probing
irqchip/gic-v5: Add ACPI IRS probing
irqchip/gic-v5: Split IRS probing into OF and generic portions
PCI/MSI: Make the pci_msi_map_rid_ctlr_node() interface firmware agnostic
irqdomain: Add parent field to struct irqchip_fwid
ACPI: PCI: simplify code with acpi_get_local_u64_address()
ACPI: video: simplify code with acpi_get_local_u64_address()
ACPI: PM: Adjust messages regarding postponed ACPI PM
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmmGLwcQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpv+TD/48S2HTnMhmW6AtFYWErQ+sEKXpHrxbYe7S
+qR8/g/T+QSfhfqPwZEuagndFKtIP3LJfaXGSP1Lk1RfP9NLQy91v33Ibe4DjHkp
etWSfnMHA9MUAoWKmg8EvncB2G+ZQFiYCpjazj5tKHD9S2+psGMuL8kq6qzMJE83
uhpb8WutUl4aSIXbMSfyGlwBhI1MjjRbbWlIBmg4yC8BWt1sH8Qn2L2GNVylEIcX
U8At3KLgPGn0axSg4yGMAwTqtGhL/jwdDyeczbmRlXuAr4iVL9UX/yADCYkazt6U
ttQ2/H+cxCwfES84COx9EteAatlbZxo6wjGvZ3xOMiMJVTjYe1x6Gkcckq+LrZX6
tjofi2KK78qkrMXk1mZMkZjpyUWgRtCswhDllbQyqFs0SwzQtno2//Rk8HU9dhbt
pkpryDbGFki9X3upcNyEYp5TYflpW6YhAzShYgmE6KXim2fV8SeFLviy0erKOAl+
fwjTE6KQ5QoQv0s3WxkWa4lREm34O6IHrCUmbiPm5CruJnQDhqAN2QZIDgYC4WAf
0gu9cR/O4Vxu7TQXrumPs5q+gCyDU0u0B8C3mG2s+rIo+PI5cVZKs2OIZ8HiPo0F
x73kR/pX3DMe35ZQkQX22ymMuowV+aQouDLY9DTwakP5acdcg7h7GZKABk6VLB06
gUIsnxURiQ==
=jNzW
-----END PGP SIGNATURE-----
Merge tag 'for-7.0/block-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull block updates from Jens Axboe:
- Support for batch request processing for ublk, improving the
efficiency of the kernel/ublk server communication. This can yield
nice 7-12% performance improvements
- Support for integrity data for ublk
- Various other ublk improvements and additions, including a ton of
selftests additions and updated
- Move the handling of blk-crypto software fallback from below the
block layer to above it. This reduces the complexity of dealing with
bio splitting
- Series fixing a number of potential deadlocks in blk-mq related to
the queue usage counter and writeback throttling and rq-qos debugfs
handling
- Add an async_depth queue attribute, to resolve a performance
regression that's been around for a qhilw related to the scheduler
depth handling
- Only use task_work for IOPOLL completions on NVMe, if it is necessary
to do so. An earlier fix for an issue resulted in all these
completions being punted to task_work, to guarantee that completions
were only run for a given io_uring ring when it was local to that
ring. With the new changes, we can detect if it's necessary to use
task_work or not, and avoid it if possible.
- rnbd fixes:
- Fix refcount underflow in device unmap path
- Handle PREFLUSH and NOUNMAP flags properly in protocol
- Fix server-side bi_size for special IOs
- Zero response buffer before use
- Fix trace format for flags
- Add .release to rnbd_dev_ktype
- MD pull requests via Yu Kuai
- Fix raid5_run() to return error when log_init() fails
- Fix IO hang with degraded array with llbitmap
- Fix percpu_ref not resurrected on suspend timeout in llbitmap
- Fix GPF in write_page caused by resize race
- Fix NULL pointer dereference in process_metadata_update
- Fix hang when stopping arrays with metadata through dm-raid
- Fix any_working flag handling in raid10_sync_request
- Refactor sync/recovery code path, improve error handling for
badblocks, and remove unused recovery_disabled field
- Consolidate mddev boolean fields into mddev_flags
- Use mempool to allocate stripe_request_ctx and make sure
max_sectors is not less than io_opt in raid5
- Fix return value of mddev_trylock
- Fix memory leak in raid1_run()
- Add Li Nan as mdraid reviewer
- Move phys_vec definitions to the kernel types, mostly in preparation
for some VFIO and RDMA changes
- Improve the speed for secure erase for some devices
- Various little rust updates
- Various other minor fixes, improvements, and cleanups
* tag 'for-7.0/block-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (162 commits)
blk-mq: ABI/sysfs-block: fix docs build warnings
selftests: ublk: organize test directories by test ID
block: decouple secure erase size limit from discard size limit
block: remove redundant kill_bdev() call in set_blocksize()
blk-mq: add documentation for new queue attribute async_dpeth
block, bfq: convert to use request_queue->async_depth
mq-deadline: covert to use request_queue->async_depth
kyber: covert to use request_queue->async_depth
blk-mq: add a new queue sysfs attribute async_depth
blk-mq: factor out a helper blk_mq_limit_depth()
blk-mq-sched: unify elevators checking for async requests
block: convert nr_requests to unsigned int
block: don't use strcpy to copy blockdev name
blk-mq-debugfs: warn about possible deadlock
blk-mq-debugfs: add missing debugfs_mutex in blk_mq_debugfs_register_hctxs()
blk-mq-debugfs: remove blk_mq_debugfs_unregister_rqos()
blk-mq-debugfs: make blk_mq_debugfs_register_rqos() static
blk-rq-qos: fix possible debugfs_mutex deadlock
blk-mq-debugfs: factor out a helper to register debugfs for all rq_qos
blk-wbt: fix possible deadlock to nest pcpu_alloc_mutex under q_usage_counter
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmmGJ1kQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpky8EAChIL3uJ5Vmv+oQTxT4EVb1wpc8U/XzXWU5
Q5F9IpZZCGO7+i015Y7iTTqDRixjblRaWpWzZZP8vflWDUS8LESNZLQdcoEnxaiv
P367KNPUGwxejcKsu8PvZvfnX6JWSQoNstcDmrwkCF0ND2UUfvvMZyn3uKhkbBRY
h5Ehcqkvqc1OJDAWC7+yPzYAmB01uRPQ6sc9/GeujznHPlfbvie4u6gBvvfXeirT
592zbVftINMrm6Twd6zl4n+HNAn+CUoyVMppeeddv5IcyFPm9uz/dLOZBXTz6552
jFYNmB0U4g+SxGXMyqp37YISTALnuY+57y5eXmEAtgkEeE3HrF+F/ZdxQHwXSpo3
T2Lb9IOqFyHtSvq678HZ37JB6aIYbBE/mZdNf8FFFpnPJGb5Ey7d50qPp/ywVq0H
p9CahbpkzGUBMsZ+koew0YHiFdWV9tww+/Bnk5dTtn2197uyaHsLdmbf4C36GWke
Bk5cwNgU+3DMFAfTiL9m+AIXYsJkBayRJn+hViTrF5AL7gcGiBryGF43FOSKoYuq
f0mniDnGSwvn86VZPuZQ6wBRHZPEMR3OlaUXn6XrUU6cYyvMg0pBZV+QHF7zlsSP
2sdfUbPL5TxexF3G8dsxlDIypz9Z6TCoUCfU0WiiUETnCrVNkXfIY846A+w08p0b
ejBjzrwRtQ==
=CqJq
-----END PGP SIGNATURE-----
Merge tag 'io_uring-bpf-restrictions.4-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull io_uring bpf filters from Jens Axboe:
"This adds support for both cBPF filters for io_uring, as well as task
inherited restrictions and filters.
seccomp and io_uring don't play along nicely, as most of the
interesting data to filter on resides somewhat out-of-band, in the
submission queue ring.
As a result, things like containers and systemd that apply seccomp
filters, can't filter io_uring operations.
That leaves them with just one choice if filtering is critical -
filter the actual io_uring_setup(2) system call to simply disallow
io_uring. That's rather unfortunate, and has limited us because of it.
io_uring already has some filtering support. It requires the ring to
be setup in a disabled state, and then a filter set can be applied.
This filter set is completely bi-modal - an opcode is either enabled
or it's not. Once a filter set is registered, the ring can be enabled.
This is very restrictive, and it's not useful at all to systemd or
containers which really want both broader and more specific control.
This first adds support for cBPF filters for opcodes, which enables
tighter control over what exactly a specific opcode may do. As
examples, specific support is added for IORING_OP_OPENAT/OPENAT2,
allowing filtering on resolve flags. And another example is added for
IORING_OP_SOCKET, allowing filtering on domain/type/protocol. These
are both common use cases. cBPF was chosen rather than eBPF, because
the latter is often restricted in containers as well.
These filters are run post the init phase of the request, which allows
filters to even dip into data that is being passed in struct in user
memory, as the init side of requests make that data stable by bringing
it into the kernel. This allows filtering without needing to copy this
data twice, or have filters etc know about the exact layout of the
user data. The filters get the already copied and sanitized data
passed.
On top of that support is added for per-task filters, meaning that any
ring created with a task that has a per-task filter will get those
filters applied when it's created. These filters are inherited across
fork as well. Once a filter has been registered, any further added
filters may only further restrict what operations are permitted.
Filters cannot change the return value of an operation, they can only
permit or deny it based on the contents"
* tag 'io_uring-bpf-restrictions.4-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
io_uring: allow registration of per-task restrictions
io_uring: add task fork hook
io_uring/bpf_filter: add ref counts to struct io_bpf_filter
io_uring/bpf_filter: cache lookup table in ctx->bpf_filters
io_uring/bpf_filter: allow filtering on contents of struct open_how
io_uring/net: allow filtering on IORING_OP_SOCKET data
io_uring: add support for BPF filtering for opcode restrictions
[mostly] sanitize struct filename hanling
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCaYlcJgAKCRBZ7Krx/gZQ
6xlKAP9c9J13sJ/mcobsj1Ov7nSHISNbnYqvRRCu09Wq3UQvJgEApNQYOEdLtpff
zUnWOAQ0nOKY7w9VMLkRRustXpuGjAc=
=Fld4
-----END PGP SIGNATURE-----
Merge tag 'pull-filename' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs 'struct filename' updates from Al Viro:
"[Mostly] sanitize struct filename handling"
* tag 'pull-filename' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (68 commits)
sysfs(2): fs_index() argument is _not_ a pathname
alpha: switch osf_mount() to strndup_user()
ksmbd: use CLASS(filename_kernel)
mqueue: switch to CLASS(filename)
user_statfs(): switch to CLASS(filename)
statx: switch to CLASS(filename_maybe_null)
quotactl_block(): switch to CLASS(filename)
chroot(2): switch to CLASS(filename)
move_mount(2): switch to CLASS(filename_maybe_null)
namei.c: switch user pathname imports to CLASS(filename{,_flags})
namei.c: convert getname_kernel() callers to CLASS(filename_kernel)
do_f{chmod,chown,access}at(): use CLASS(filename_uflags)
do_readlinkat(): switch to CLASS(filename_flags)
do_sys_truncate(): switch to CLASS(filename)
do_utimes_path(): switch to CLASS(filename_uflags)
chdir(2): unspaghettify a bit...
do_fchownat(): unspaghettify a bit...
fspick(2): use CLASS(filename_flags)
name_to_handle_at(): use CLASS(filename_uflags)
vfs_open_tree(): use CLASS(filename_uflags)
...
The tracing_max_latency shouldn't be limited if CONFIG_FSNOTIFY is defined
or not and it was moved out of that protection to be always available with
CONFIG_TRACER_MAX_TRACE. All was moved out except the dentry descriptor
for it (d_max_latency) and it failed to build on some configs.
Move that out of the CONFIG_FSNOTIFY protection too.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260209194631.788bfc85@fedora
Fixes: ba73713da5 ("tracing: Clean up use of trace_create_maxlat_file()")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202602092133.fTdojd95-lkp@intel.com/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Please consider pulling these changes from the signed vfs-7.0-rc1.misc tag.
Thanks!
Christian
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaYX49QAKCRCRxhvAZXjc
ojrZAQD1VJzY46r5FnAVf4jlEHyjIbDnZCP/n+c4x6XnqpU6EQEAgB0yAtAGP6+u
SBuytElqHoTT5VtmEXTAabCNQ9Ks8wo=
=JwZz
-----END PGP SIGNATURE-----
Merge tag 'vfs-7.0-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"This contains a mix of VFS cleanups, performance improvements, API
fixes, documentation, and a deprecation notice.
Scalability and performance:
- Rework pid allocation to only take pidmap_lock once instead of
twice during alloc_pid(), improving thread creation/teardown
throughput by 10-16% depending on false-sharing luck. Pad the
namespace refcount to reduce false-sharing
- Track file lock presence via a flag in ->i_opflags instead of
reading ->i_flctx, avoiding false-sharing with ->i_readcount on
open/close hot paths. Measured 4-16% improvement on 24-core
open-in-a-loop benchmarks
- Use a consume fence in locks_inode_context() to match the
store-release/load-consume idiom, eliminating a hardware fence on
some architectures
- Annotate cdev_lock with __cacheline_aligned_in_smp to prevent
false-sharing
- Remove a redundant DCACHE_MANAGED_DENTRY check in
__follow_mount_rcu() that never fires since the caller already
verifies it, eliminating a 100% mispredicted branch
- Fix a 100% mispredicted likely() in devcgroup_inode_permission()
that became wrong after a prior code reorder
Bug fixes and correctness:
- Make insert_inode_locked() wait for inode destruction instead of
skipping, fixing a corner case where two matching inodes could
exist in the hash
- Move f_mode initialization before file_ref_init() in alloc_file()
to respect the SLAB_TYPESAFE_BY_RCU ordering contract
- Add a WARN_ON_ONCE guard in try_to_free_buffers() for folios with
no buffers attached, preventing a null pointer dereference when
AS_RELEASE_ALWAYS is set but no release_folio op exists
- Fix select restart_block to store end_time as timespec64, avoiding
truncation of tv_sec on 32-bit architectures
- Make dump_inode() use get_kernel_nofault() to safely access inode
and superblock fields, matching the dump_mapping() pattern
API modernization:
- Make posix_acl_to_xattr() allocate the buffer internally since
every single caller was doing it anyway. Reduces boilerplate and
unnecessary error checking across ~15 filesystems
- Replace deprecated simple_strtoul() with kstrtoul() for the
ihash_entries, dhash_entries, mhash_entries, and mphash_entries
boot parameters, adding proper error handling
- Convert chardev code to use guard(mutex) and __free(kfree) cleanup
patterns
- Replace min_t() with min() or umin() in VFS code to avoid silently
truncating unsigned long to unsigned int
- Gate LOOKUP_RCU assertions behind CONFIG_DEBUG_VFS since callers
already check the flag
Deprecation:
- Begin deprecating legacy BSD process accounting (acct(2)). The
interface has numerous footguns and better alternatives exist
(eBPF)
Documentation:
- Fix and complete kernel-doc for struct export_operations, removing
duplicated documentation between ReST and source
- Fix kernel-doc warnings for __start_dirop() and ilookup5_nowait()
Testing:
- Add a kunit test for initramfs cpio handling of entries with
filesize > PATH_MAX
Misc:
- Add missing <linux/init_task.h> include in fs_struct.c"
* tag 'vfs-7.0-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (28 commits)
posix_acl: make posix_acl_to_xattr() alloc the buffer
fs: make insert_inode_locked() wait for inode destruction
initramfs_test: kunit test for cpio.filesize > PATH_MAX
fs: improve dump_inode() to safely access inode fields
fs: add <linux/init_task.h> for 'init_fs'
docs: exportfs: Use source code struct documentation
fs: move initializing f_mode before file_ref_init()
exportfs: Complete kernel-doc for struct export_operations
exportfs: Mark struct export_operations functions at kernel-doc
exportfs: Fix kernel-doc output for get_name()
acct(2): begin the deprecation of legacy BSD process accounting
device_cgroup: remove branch hint after code refactor
VFS: fix __start_dirop() kernel-doc warnings
fs: Describe @isnew parameter in ilookup5_nowait()
fs/namei: Remove redundant DCACHE_MANAGED_DENTRY check in __follow_mount_rcu
fs: only assert on LOOKUP_RCU when built with CONFIG_DEBUG_VFS
select: store end_time as timespec64 in restart block
chardev: Switch to guard(mutex) and __free(kfree)
namespace: Replace simple_strtoul with kstrtoul to parse boot params
dcache: Replace simple_strtoul with kstrtoul in set_dhash_entries
...
-----BEGIN PGP SIGNATURE-----
iQJIBAABCgAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmmCurkUHHBhdWxAcGF1
bC1tb29yZS5jb20ACgkQ6iDy2pc3iXNDWA//RZxjjyY1I0GRDepJXJ8UFEVt4Fdr
VsnSKL3o7sf0SAsQj2HCJsJPiwD5fHm2C2gdxh9rFC0bPpMbTVAkwUL7WhP+nkAt
LA+UZKYurrk1XF6OctILoY3JcXmynb1Oe3lg6uVcWX5b1uEriqRgGKNcMYLb5fmr
D1vZ9LMuZe8WwGTScprQID9FMrZ0TDbdI/vqG7si1W/PCFH7630MPJkmzmjPWvnV
xJISKLOG+qbyWoNGLr+VaNjkmA+jPfsXAKWbfNXUGfikP8g/OHpFd70nIzJs8p7J
dxZD7w6/kqSGhauQjcX8ov0zKxn83Z2Xt0+4Ldl5vOCWI3r4T3Y8WdarmULbq65n
jIN8djDgmCJPqa5zuPmik+womaPk2GmSy1viEJdT4W0iHggTC1snOz1J+BbD+nkh
uEZkmcCZbaeEQmfefxIyHDirrFsJvrunWupGrkfxvfFr+QU8H1xNLfMd6CQzvtI4
P5p/KrnP2e58tJqvPxSY315ewUMy73kZU5DUl+Rq6Y4ai415R7vtwwEEkSKWnyja
LMdEumc9IrsiBMcLmsj8QwobCr7XJtdCQV5ohR8CPxxcsI/G0pR99e1pckD7l7Qm
OG461BKHntU3SFWSiZw+rNWlJuyPcSy5nmUxQvxQHP9pShZPu8rTfYX+CBzrHJk2
OFjAwNJn1N/NfYI=
=cCyp
-----END PGP SIGNATURE-----
Merge tag 'lsm-pr-20260203' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm
Pull lsm updates from Paul Moore:
- Unify the security_inode_listsecurity() calls in NFSv4
While looking at security_inode_listsecurity() with an eye towards
improving the interface, we realized that the NFSv4 code was making
multiple calls to the LSM hook that could be consolidated into one.
- Mark the LSM static branch keys as static - this helps resolve some
sparse warnings
- Add __rust_helper annotations to the LSM and cred wrapper functions
- Remove the unsused set_security_override_from_ctx() function
- Minor fixes to some of the LSM kdoc comment blocks
* tag 'lsm-pr-20260203' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm:
lsm: make keys for static branch static
cred: remove unused set_security_override_from_ctx()
rust: security: add __rust_helper to helpers
rust: cred: add __rust_helper to helpers
nfs: unify security_inode_listsecurity() calls
lsm: fix kernel-doc struct member names
-----BEGIN PGP SIGNATURE-----
iQJIBAABCgAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmmCuoQUHHBhdWxAcGF1
bC1tb29yZS5jb20ACgkQ6iDy2pc3iXMRFhAAntv4vmqRciFI4oEqxi5X8wmYmzc9
BUQV2XXcfO63IOHdGrXmYHByx3+mZZddAPpYMTqrzA0p2NCqi4svCVwspUHwUcTY
btl+xlppBJpBtUL5pmLiP6Q4u+zURYCwuA/OKfxuKa5Frm8D3kbkd5MpxJS15Mev
qqEhLT0aj6/rjQpYVwOFGMwehKfE7iuyc8XTBaetvUKHW38sj18ANSpLnN5bmiuE
3lz252kCjyDoOsu+vO0Saa8Rv8lVDjlSMn6mYr4L2fVygYwFDg2Gj7+bmB6LGYy9
YyIm6P+b23E8GOltEObpvrz8ItPR7nvKNiDMEeP1eqGzQ/Mc5OqqljVaNMNPmP+s
XN/jZt02XePKXlje+C08620mDVeIYp35TK1bY2/HrYMqySE0wwO1iSyBI4ftPFtu
CteM8XA8oH49pspFWbEKCHmtFFGxDVjfVM7YrHeDc+qw2tJjZ7R1GRk5hadP1Ou7
emxGLb6jfejT6NMNU8rM2RVmQNs1jcFh+8lHvDgqqQmaCXJd3AEgr+Om5w9kZ6fJ
FyEkh0f9HuZdcEn8tqWaIwCAZXTzECOThj6hhxZGiG9xXFXza1eIXxq7VtIE6fdO
ATAJ6cpcj0LIQmE7QWntS1NloPkWD1OSiWLNis0AgCiN97oRTk0oFJ5q7fD0bLUa
HGAQqd4OJKmNciw=
=pFeD
-----END PGP SIGNATURE-----
Merge tag 'audit-pr-20260203' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit
Pull audit updates from Paul Moore:
- Improve the NETFILTER_PKT audit records
Add source and destination ports to the NETFILTER_PKT audit records
while also consolidating a lot of the code into a new, singular
audit_log_nf_skb() function. This new approach to structuring the
NETFILTER_PKT record generation should eliminate some unnecessary
overhead when audit is not built into the kernel.
- Update the audit syscall classifier code
Add the listxattrat(), getxattrat(), and fchmodat2() syscall to the
audit code which classifies syscalls into categories of operations,
e.g. "read" or "change attributes".
- Move the syscall classifier declarations into audit_arch.h
Shuffle around some header file declarations to resolve some sparse
warnings.
* tag 'audit-pr-20260203' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
audit: move the compat_xxx_class[] extern declarations to audit_arch.h
audit: add missing syscalls to read class
audit: include source and destination ports to NETFILTER_PKT
audit: add audit_log_nf_skb helper function
audit: add fchmodat2() to change attributes class
RCU Tasks Trace:
Re-implement RCU tasks trace in term of SRCU-fast, not only more than 500 lines
of code are saved because of the reimplementation, a new set of API,
rcu_read_{,un}lock_tasks_trace(), becomes possible as well. Compared to the
previous rcu_read_{,un}lock_trace(), the new API avoid the task_struct accesses
thanks to the SRCU-fast semantics. As a result, the old
rcu_read{,un}lock_trace() API is now deprecated.
RCU Torture Test:
- Multiple improvements on kvm-series.sh (parallel run and progress showing
metrics)
- Add context checks to rcu_torture_timer().
- Make config2csv.sh properly handle comments in .boot files.
- Include commit discription in testid.txt.
Miscellaneous RCU changes:
- Reduce synchronize_rcu() latency by reporting GP kthread's CPU QS early.
- Use suitable gfp_flags for the init_srcu_struct_nodes().
- Fix rcu_read_unlock() deadloop due to softirq.
- Correctly compute probability to invoke ->exp_current() in rcutorture.
- Make expedited RCU CPU stall warnings detect stall-end races.
RCU nocb:
- Remove unnecessary WakeOvfIsDeferred wake path and callback overload
handling.
- Extract nocb_defer_wakeup_cancel() helper.
-----BEGIN PGP SIGNATURE-----
iQFFBAABCAAvFiEEj5IosQTPz8XU1wRHSXnow7UH+rgFAmmARZERHGJvcXVuQGtl
cm5lbC5vcmcACgkQSXnow7UH+rh8SAf+PDIBWAkdbGgs32EfgpFY42RB4CWygH47
YRup/M3+nU0JcBzNnona1srpHBXRySBJQbvRbsOdlM45VoNQ2wPjig/3vFVRUKYx
uqj9Tze00DS74IIGESoTGp0amZde9SS9JakNRoEfTr+Zpj8N6LFERQw0ywUwjR5b
RR6bz7q05TAl3u2BYUAgNdnf3VWWTmj4WYwArlQ+qRFAyGN+TVj8Ezra6+K5TJ7u
SQYrf7WmRGOhHbVVolvVEOVdACccI8dFl3ebJVE2Ky0gp1o3BLPkcDLJ6gBdTCoE
rRrbnkeqs5V7tOkPFDBeUhLPrm1QxrdEDxUQFWjSApbv161sx7AOZA==
=O9QQ
-----END PGP SIGNATURE-----
Merge tag 'rcu.release.v7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux
Pull RCU updates from Boqun Feng:
- RCU Tasks Trace:
Re-implement RCU tasks trace in term of SRCU-fast, not only more than
500 lines of code are saved because of the reimplementation, a new
set of API, rcu_read_{,un}lock_tasks_trace(), becomes possible as
well. Compared to the previous rcu_read_{,un}lock_trace(), the new
API avoid the task_struct accesses thanks to the SRCU-fast semantics.
As a result, the old rcu_read{,un}lock_trace() API is now deprecated.
- RCU Torture Test:
- Multiple improvements on kvm-series.sh (parallel run and
progress showing metrics)
- Add context checks to rcu_torture_timer()
- Make config2csv.sh properly handle comments in .boot files
- Include commit discription in testid.txt
- Miscellaneous RCU changes:
- Reduce synchronize_rcu() latency by reporting GP kthread's
CPU QS early
- Use suitable gfp_flags for the init_srcu_struct_nodes()
- Fix rcu_read_unlock() deadloop due to softirq
- Correctly compute probability to invoke ->exp_current()
in rcutorture
- Make expedited RCU CPU stall warnings detect stall-end races
- RCU nocb:
- Remove unnecessary WakeOvfIsDeferred wake path and callback
overload handling
- Extract nocb_defer_wakeup_cancel() helper
* tag 'rcu.release.v7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux: (25 commits)
rcu/nocb: Extract nocb_defer_wakeup_cancel() helper
rcu/nocb: Remove dead callback overload handling
rcu/nocb: Remove unnecessary WakeOvfIsDeferred wake path
rcu: Reduce synchronize_rcu() latency by reporting GP kthread's CPU QS early
srcu: Use suitable gfp_flags for the init_srcu_struct_nodes()
rcu: Fix rcu_read_unlock() deadloop due to softirq
rcutorture: Correctly compute probability to invoke ->exp_current()
rcu: Make expedited RCU CPU stall warnings detect stall-end races
rcutorture: Add --kill-previous option to terminate previous kvm.sh runs
rcutorture: Prevent concurrent kvm.sh runs on same source tree
torture: Include commit discription in testid.txt
torture: Make config2csv.sh properly handle comments in .boot files
torture: Make kvm-series.sh give run numbers and totals
torture: Make kvm-series.sh give build numbers and totals
torture: Parallelize kvm-series.sh guest-OS execution
rcutorture: Add context checks to rcu_torture_timer()
rcutorture: Test rcu_tasks_trace_expedite_current()
srcu: Create an rcu_tasks_trace_expedite_current() function
checkpatch: Deprecate rcu_read_{,un}lock_trace()
rcu: Update Requirements.rst for RCU Tasks Trace
...
The latency tracers (scheduler, irqsoff, etc) were created when tracing
was first added. These tracers required a "snapshot" buffer that was the
same size as the ring buffer being written to. When a new max latency was
hit, the main ring buffer would swap with the snapshot buffer so that the
trace leading up to the latency would be saved in the snapshot buffer (The
snapshot buffer is never written to directly and the data within it can be
viewed without fear of being overwritten).
Later, a new feature was added to allow snapshots to be taken by user
space or even event triggers. This created a "snapshot" file that allowed
users to trigger a snapshot from user space to save the current trace.
The config for this new feature (CONFIG_TRACER_SNAPSHOT) would select the
latency tracer config (CONFIG_TRACER_MAX_LATENCY) as it would need all the
functionality from it as it already existed. But this was incorrect. As
the snapshot feature is really what the latency tracers need and not the
other way around.
Have CONFIG_TRACER_MAX_TRACE select CONFIG_TRACER_SNAPSHOT where the
tracers that needs the max latency buffer selects the TRACE_MAX_TRACE
which will then select TRACER_SNAPSHOT.
Also, go through trace.c and trace.h and make the code that only needs the
TRACER_MAX_TRACE protected by that and the code that always requires the
snapshot to be protected by TRACER_SNAPSHOT.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20260208183856.767870992@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Instead of having #ifdef CONFIG_TRACER_MAX_TRACE around every access to
the struct tracer's use_max_tr field, add a helper function for that
access and if CONFIG_TRACER_MAX_TRACE is not configured it just returns
false.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20260208183856.599390238@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
When tracing was first added, there were latency tracers that would take a
snapshot of the current trace when a new max latency was hit. This
snapshot buffer was called "max_buffer". Since then, a snapshot feature
was added that allowed user space or event triggers to trigger a snapshot
of the current buffer using the same max_buffer of the trace_array.
As this snapshot buffer now has a more generic use case, calling it
"max_buffer" is confusing. Rename it to snapshot_buffer.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20260208183856.428446729@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The trace.c file was a dumping ground for most tracing code. Start
organizing it better by moving various functions out into their own files.
Move the PID filtering functions from trace.c into its own trace_pid.c
file.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20260208032450.998330662@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The file trace.c has become a catchall for most things tracing. Start
making it smaller by breaking out various aspects into their own files.
Move the functions associated to the trace_printk operations out of trace.c and
into trace_printk.c.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20260208032450.828744197@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The function trace_printk_init_buffers() is used to expand tha
trace_printk buffers when trace_printk() is used within the kernel or in
modules. On kernel boot up, it holds off from starting the sched switch
cmdline recorder, but will start it immediately when it is added by a
module.
Currently it uses a trick to see if the global_trace buffer has been
allocated or not to know if it was called by module load or not. But this
is more of a hack, and can not be used when this code is moved out of
trace.c. Instead simply look at the system_state and if it is running then
it is know that it could only be called by module load.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20260208032450.660237094@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The trace.c file has become a dumping ground for all tracing code and has
become quite large. In order to move the trace_printk functions out of it
these functions can not access global_trace directly, as that is something
that needs to stay static in trace.c.
Instead of testing the trace_array tr pointer to &global_trace, test the
tr->flags to see if TRACE_ARRAY_FL_GLOBAL set.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20260208032450.491116245@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The trace.c file has become a dumping ground for all tracing code and has
become quite large. In order to move the trace_printk functions out of it
these functions can not access global_trace directly, as that is something
that needs to stay static in trace.c.
Have tracing_update_buffers() take NULL for its trace_array to denote it
should work on the global_trace top level trace_array allows that function
to be used outside of trace.c and still update the global_trace
trace_array.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20260208032450.318864210@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The printk_trace is used to determine which trace_array trace_printk()
writes to. By making it a global variable among the tracing subsystem it
will allow the trace_printk functions to be moved out of trace.c and still
have direct access to that variable.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20260208032450.144525891@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The file trace.c has become a catchall for most things tracing. Start
making it smaller by breaking out various aspects into their own files.
Make ftrace_trace_stack() into a static inline that tests if stack tracing
is enabled and if so to call __ftrace_trace_stack() to do the stack trace.
This keeps the test inlined in the fast paths and only does the function
call if stack tracing is enabled.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20260208032449.974218132@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The file trace.c has become a catchall for most things tracing. Start
making it smaller by breaking out various aspects into their own files.
Move the __always_inline functions __trace_buffer_lock_reserve(),
__trace_buffer_unlock_commit() and trace_event_setup() into trace.h.
The trace.c file will be split up and these functions will be used in more
than one of these files. As they are already __always_inline they can
easily be moved into the trace.h header file.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20260208032449.813550600@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The file trace.c has become a catchall for most things tracing. Start
making it smaller by breaking out various aspects into their own files.
Make the variable tracing_selftest_running global so that it can be used
by other files in the tracing subsystem and trace.c can be split up.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20260208032449.648932796@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The tracing_disabled variable is set to one on boot up to prevent some
parts of tracing to access the tracing infrastructure before it is set up.
It also can be set after boot if an anomaly is discovered.
It is currently a static variable in trace.c and can be accessed via a
function call trace_is_disabled(). There's really no reason to use a
function call as the tracing subsystem should be able to access it
directly.
By making the variable accessed directly, code can be moved out of trace.c
without adding overhead of a function call to see if tracing is disabled
or not.
Make tracing_disabled global and remove the tracing_is_disabled() helper
function. Also add some "unlikely()"s around tracing_disabled where it's
checked in hot paths.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20260208032449.483690153@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
In trace.c, the function trace_create_maxlat_file() is defined behind the
#ifdef CONFIG_TRACER_MAX_TRACE block. The #else part defines it as:
#define trace_create_maxlat_file(tr, d_tracer) \
trace_create_file("tracing_max_latency", TRACE_MODE_WRITE, \
d_tracer, tr, &tracing_max_lat_fops)
But the one place that it it used has:
#ifdef CONFIG_TRACER_MAX_TRACE
trace_create_maxlat_file(tr, d_tracer);
#endif
Which is pointless and also wrong!
It only gets created when both CONFIG_TRACE_MAX_TRACE and CONFIG_FS_NOTIFY
is defined, but the file itself should not be dependent on
CONFIG_FS_NOTIFY. Always create that file when TRACE_MAX_TRACE is defined
regardless if FS_NOTIFY is or is not.
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Link: https://patch.msgid.link/20260207191101.0e014abd@robin
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The function tracing_set_filter_buffering() is only used in
trace_events_hist.c. Move it to that file and make it static.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20260206195936.617080218@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
When the triggers were first created, they may not have had a file
parameter passed to them and things needed to be done generically.
But today, all triggers have a file parameter passed to them. Remove the
generic code and add a "if (WARN_ON_ONCE(!file))" to each trigger.
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Link: https://patch.msgid.link/20260206101351.609d8906@gandalf.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Simplify the hardlockup detector's probe path and remove its implicit
dependency on pinned per-cpu execution.
Refactor hardlockup_detector_event_create() to be stateless. Return the
created perf_event pointer to the caller instead of directly modifying the
per-cpu 'watchdog_ev' variable. This allows the probe path to safely
manage a temporary event without the risk of leaving stale pointers should
task migration occur.
Link: https://lkml.kernel.org/r/20260129022629.2201331-1-realwujing@gmail.com
Signed-off-by: Shouxin Sun <sunshx@chinatelecom.cn>
Signed-off-by: Junnan Zhang <zhangjn11@chinatelecom.cn>
Signed-off-by: Qiliang Yuan <yuanql9@chinatelecom.cn>
Signed-off-by: Qiliang Yuan <realwujing@gmail.com>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Cc: Jinchao Wang <wangjinchao600@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Huafei <lihuafei1@huawei.com>
Cc: Song Liu <song@kernel.org>
Cc: Thorsten Blum <thorsten.blum@linux.dev>
Cc: Wang Jinchao <wangjinchao600@gmail.com>
Cc: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>