Commit Graph

49442 Commits (09cfd3c52ea76f43b3cb15e570aeddf633d65e80)

Author SHA1 Message Date
KaFai Wan df0cb5cb50 bpf: Allow fall back to interpreter for programs with stack size <= 512
OpenWRT users reported regression on ARMv6 devices after updating to latest
HEAD, where tcpdump filter:

tcpdump "not ether host 3c37121a2b3c and not ether host 184ecbca2a3a \
and not ether host 14130b4d3f47 and not ether host f0f61cf440b7 \
and not ether host a84b4dedf471 and not ether host d022be17e1d7 \
and not ether host 5c497967208b and not ether host 706655784d5b"

fails with warning: "Kernel filter failed: No error information"
when using config:
 # CONFIG_BPF_JIT_ALWAYS_ON is not set
 CONFIG_BPF_JIT_DEFAULT_ON=y

The issue arises because commits:
1. "bpf: Fix array bounds error with may_goto" changed default runtime to
   __bpf_prog_ret0_warn when jit_requested = 1
2. "bpf: Avoid __bpf_prog_ret0_warn when jit fails" returns error when
   jit_requested = 1 but jit fails

This change restores interpreter fallback capability for BPF programs with
stack size <= 512 bytes when jit fails.

Reported-by: Felix Fietkau <nbd@nbd.name>
Closes: https://lore.kernel.org/bpf/2e267b4b-0540-45d8-9310-e127bf95fc63@nbd.name/
Fixes: 6ebc5030e0 ("bpf: Fix array bounds error with may_goto")
Signed-off-by: KaFai Wan <kafai.wan@linux.dev>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250909144614.2991253-1-kafai.wan@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-09 15:12:16 -07:00
Kumar Kartikeya Dwivedi 0d80e7f951 rqspinlock: Choose trylock fallback for NMI waiters
Currently, out of all 3 types of waiters in the rqspinlock slow path
(i.e., pending bit waiter, wait queue head waiter, and wait queue
non-head waiter), only the pending bit waiter and wait queue head
waiters apply deadlock checks and a timeout on their waiting loop. The
assumption here was that the wait queue head's forward progress would be
sufficient to identify cases where the lock owner or pending bit waiter
is stuck, and non-head waiters relying on the head waiter would prove to
be sufficient for their own forward progress.

However, the head waiter itself can be preempted by a non-head waiter
for the same lock (AA) or a different lock (ABBA) in a manner that
impedes its forward progress. In such a case, non-head waiters not
performing deadlock and timeout checks becomes insufficient, and the
system can enter a state of lockup.

This is typically not a concern with non-NMI lock acquisitions, as lock
holders which in run in different contexts (IRQ, non-IRQ) use "irqsave"
variants of the lock APIs, which naturally excludes such lock holders
from preempting one another on the same CPU.

It might seem likely that a similar case may occur for rqspinlock when
programs are attached to contention tracepoints (begin, end), however,
these tracepoints either precede the enqueue into the wait queue, or
succeed it, therefore cannot be used to preempt a head waiter's waiting
loop.

We must still be careful against nested kprobe and fentry programs that
may attach to the middle of the head's waiting loop to stall forward
progress and invoke another rqspinlock acquisition that proceeds as a
non-head waiter. To this end, drop CC_FLAGS_FTRACE from the rqspinlock.o
object file.

For now, this issue is resolved by falling back to a repeated trylock on
the lock word from NMI context, while performing the deadlock checks to
break out early in case forward progress is impossible, and use the
timeout as a final fallback.

A more involved fix to terminate the queue when such a condition occurs
will be made as a follow up. A selftest to stress this aspect of nested
NMI/non-NMI locking attempts will be added in a subsequent patch to the
bpf-next tree when this fix lands and trees are synchronized.

Reported-by: Josef Bacik <josef@toxicpanda.com>
Fixes: 164c246571 ("rqspinlock: Protect waiters in queue from stalls")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250909184959.3509085-1-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-09 15:10:28 -07:00
Rong Tao 7edfc02470 bpf: Fix bpf_strnstr() to handle suffix match cases better
bpf_strnstr() should not treat the ending '\0' of s2 as a matching character
if the parameter 'len' equal to s2 string length, for example:

    1. bpf_strnstr("openat", "open", 4) = -ENOENT
    2. bpf_strnstr("openat", "open", 5) = 0

This patch makes (1) return 0, fix just the `len == strlen(s2)` case.

And fix a more general case when s2 is a suffix of the first len
characters of s1.

Fixes: e91370550f ("bpf: Add kfuncs for read-only string operations")
Signed-off-by: Rong Tao <rongtao@cestc.cn>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/tencent_17DC57B9D16BC443837021BEACE84B7C1507@qq.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-09 15:07:58 -07:00
Daniel Borkmann f9bb6ffa7f bpf: Fix out-of-bounds dynptr write in bpf_crypto_crypt
Stanislav reported that in bpf_crypto_crypt() the destination dynptr's
size is not validated to be at least as large as the source dynptr's
size before calling into the crypto backend with 'len = src_len'. This
can result in an OOB write when the destination is smaller than the
source.

Concretely, in mentioned function, psrc and pdst are both linear
buffers fetched from each dynptr:

  psrc = __bpf_dynptr_data(src, src_len);
  [...]
  pdst = __bpf_dynptr_data_rw(dst, dst_len);
  [...]
  err = decrypt ?
        ctx->type->decrypt(ctx->tfm, psrc, pdst, src_len, piv) :
        ctx->type->encrypt(ctx->tfm, psrc, pdst, src_len, piv);

The crypto backend expects pdst to be large enough with a src_len length
that can be written. Add an additional src_len > dst_len check and bail
out if it's the case. Note that these kfuncs are accessible under root
privileges only.

Fixes: 3e1c6f3540 ("bpf: make common crypto API for TC/XDP programs")
Reported-by: Stanislav Fort <disclosure@aisle.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://lore.kernel.org/r/20250829143657.318524-1-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-09 15:07:57 -07:00
Linus Torvalds 9dd1835ecd dma-mapping fix for Linux 6.17
- one more fix for DMA API debugging infrastructure (Baochen Qiang)
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSrngzkoBtlA8uaaJ+Jp1EFxbsSRAUCaMAGzwAKCRCJp1EFxbsS
 RJcvAPwO4NrxrUQdjaXOy//DhFNG4fhMmdn6kYqD+TjTJq9QnwEA/dMGxU1fZqvM
 Kqusn1iiQsHjFLRzEOExqRbL6MFDSgM=
 =ViHe
 -----END PGP SIGNATURE-----

Merge tag 'dma-mapping-6.17-2025-09-09' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux

Pull dma-mapping fix from Marek Szyprowski:

 - one more fix for DMA API debugging infrastructure (Baochen Qiang)

* tag 'dma-mapping-6.17-2025-09-09' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
  dma-debug: don't enforce dma mapping check on noncoherent allocations
2025-09-09 11:03:04 -07:00
Jiri Wiesner b9aa93aa51 clocksource: Print durations for sync check unconditionally
A typical set of messages that gets printed as a result of the clocksource
watchdog finding the TSC unstable usually does not contain messages
indicating CPUs being ahead of or behind the CPU from which the check is
carried out. That fact suggests that the TSC does not experience time skew
between CPUs (if the clocksource.verify_n_cpus parameter is set to a
negative value) but quantitative information is missing.

The cs_nsec_max value printed by the "CPU %d check durations" message
actually provides a worst case estimate of the time skew. If all CPUs have
been checked, the cs_nsec_max value multiplied by 2 is the maximum
possible time skew between the TSCs of any two CPUs on the system. The
worst case estimate is derived from two boundary cases:

1. No time is consumed to execute instructions between csnow_begin and
csnow_mid while all the cs_nsec_max time is consumed by the code between
csnow_mid and csnow_end. In this case, the maximum undetectable time skew
of a CPU being ahead would be cs_nsec_max.

2. All the cs_nsec_max time is consumed to execute instructions between
csnow_begin and csnow_mid while no time is consumed by the code between
csnow_mid and csnow_end. In this case, the maximum undetectable time skew
of a CPU being behind would be cs_nsec_max.

The worst case estimate assumes a system experiencing a corner case
consisting of the two boundary cases.

Always print the "CPU %d check durations" message so that the maximum
possible time skew measured by the TSC sync check can be compared to the
time skew measured by the clocksource watchdog.

Signed-off-by: Jiri Wiesner <jwiesner@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/all/aIuXXfdITXdI0lLp@incl
2025-09-09 14:08:19 +02:00
Xiongfeng Wang e895f8e291 hrtimers: Unconditionally update target CPU base after offline timer migration
When testing softirq based hrtimers on an ARM32 board, with high resolution
mode and NOHZ inactive, softirq based hrtimers fail to expire after being
moved away from an offline CPU:

CPU0				CPU1
				hrtimer_start(..., HRTIMER_MODE_SOFT);
cpu_down(CPU1)			...
				hrtimers_cpu_dying()
				  // Migrate timers to CPU0
				  smp_call_function_single(CPU0, returgger_next_event);
  retrigger_next_event()
    if (!highres && !nohz)
        return;

As retrigger_next_event() is a NOOP when both high resolution timers and
NOHZ are inactive CPU0's hrtimer_cpu_base::softirq_expires_next is not
updated and the migrated softirq timers never expire unless there is a
softirq based hrtimer queued on CPU0 later.

Fix this by removing the hrtimer_hres_active() and tick_nohz_active() check
in retrigger_next_event(), which enforces a full update of the CPU base.
As this is not a fast path the extra cost does not matter.

[ tglx: Massaged change log ]

Fixes: 5c0930ccaa ("hrtimers: Push pending hrtimers away from outgoing CPU earlier")
Co-developed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Xiongfeng Wang <wangxiongfeng2@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250805081025.54235-1-wangxiongfeng2@huawei.com
2025-09-09 14:05:16 +02:00
Bibo Mao fe2a449a45 tick: Do not set device to detached state in tick_shutdown()
tick_shutdown() sets the state of the clockevent device to detached
first and the invokes clockevents_exchange_device(), which in turn
invokes clockevents_switch_state().

But clockevents_switch_state() returns without invoking the device shutdown
callback as the device is already in detached state. As a consequence the
timer device is not shutdown when a CPU goes offline.

tick_shutdown() does this because it was originally invoked on a online CPU
and not on the outgoing CPU. It therefore could not access the clockevent
device of the already offlined CPU and just set the state.

Since commit 3b1596a21f tick_shutdown() is called on the outgoing CPU, so
the hardware device can be accessed.

Remove the state set before calling clockevents_exchange_device(), so that
the subsequent clockevents_switch_state() handles the state transition and
invokes the shutdown callback of the clockevent device.

[ tglx: Massaged change log ]

Fixes: 3b1596a21f ("clockevents: Shutdown and unregister current clockevents at CPUHP_AP_TICK_DYING")
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20250906064952.3749122-2-maobibo@loongson.cn
2025-09-09 13:39:00 +02:00
Thomas Weißschuh 3c3af563b3 hrtimer: Reorder branches in hrtimer_clockid_to_base()
Align the ordering to the one used for hrtimer_bases.

Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250821-hrtimer-cleanup-get_time-v2-9-3ae822e5bfbd@linutronix.de
2025-09-09 12:27:18 +02:00
Thomas Weißschuh 009eb5da29 hrtimer: Remove hrtimer_clock_base:: Get_time
The get_time() callbacks always need to match the bases clockid.
Instead of maintaining that association twice in hrtimer_bases,
use a helper.

Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20250821-hrtimer-cleanup-get_time-v2-8-3ae822e5bfbd@linutronix.de
2025-09-09 12:27:18 +02:00
Thomas Weißschuh b68b7f3e9b sched/core: Avoid direct access to hrtimer clockbase
The field timer->base->get_time is a private implementation detail and
should not be accessed outside of the hrtimer core.

Switch to the equivalent helper.

Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20250821-hrtimer-cleanup-get_time-v2-3-3ae822e5bfbd@linutronix.de
2025-09-09 12:27:18 +02:00
Thomas Weißschuh 5f531fe9cb timers/itimer: Avoid direct access to hrtimer clockbase
The field timer->base->get_time is a private implementation detail and
should not be accessed outside of the hrtimer core.

Switch to the equivalent helper.

Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20250821-hrtimer-cleanup-get_time-v2-2-3ae822e5bfbd@linutronix.de
2025-09-09 12:27:17 +02:00
Thomas Weißschuh 24fb08dcc4 posix-timers: Avoid direct access to hrtimer clockbase
The field timer->base->get_time is a private implementation detail and
should not be accessed outside of the hrtimer core.

Switch to the equivalent helpers.

Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20250821-hrtimer-cleanup-get_time-v2-1-3ae822e5bfbd@linutronix.de
2025-09-09 12:27:17 +02:00
Pu Lehui cd4453c5e9 tracing: Silence warning when chunk allocation fails in trace_pid_write
Syzkaller trigger a fault injection warning:

WARNING: CPU: 1 PID: 12326 at tracepoint_add_func+0xbfc/0xeb0
Modules linked in:
CPU: 1 UID: 0 PID: 12326 Comm: syz.6.10325 Tainted: G U 6.14.0-rc5-syzkaller #0
Tainted: [U]=USER
Hardware name: Google Compute Engine/Google Compute Engine
RIP: 0010:tracepoint_add_func+0xbfc/0xeb0 kernel/tracepoint.c:294
Code: 09 fe ff 90 0f 0b 90 0f b6 74 24 43 31 ff 41 bc ea ff ff ff
RSP: 0018:ffffc9000414fb48 EFLAGS: 00010283
RAX: 00000000000012a1 RBX: ffffffff8e240ae0 RCX: ffffc90014b78000
RDX: 0000000000080000 RSI: ffffffff81bbd78b RDI: 0000000000000001
RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000001 R12: ffffffffffffffef
R13: 0000000000000000 R14: dffffc0000000000 R15: ffffffff81c264f0
FS:  00007f27217f66c0(0000) GS:ffff8880b8700000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000001b2e80dff8 CR3: 00000000268f8000 CR4: 00000000003526f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 tracepoint_probe_register_prio+0xc0/0x110 kernel/tracepoint.c:464
 register_trace_prio_sched_switch include/trace/events/sched.h:222 [inline]
 register_pid_events kernel/trace/trace_events.c:2354 [inline]
 event_pid_write.isra.0+0x439/0x7a0 kernel/trace/trace_events.c:2425
 vfs_write+0x24c/0x1150 fs/read_write.c:677
 ksys_write+0x12b/0x250 fs/read_write.c:731
 do_syscall_x64 arch/x86/entry/common.c:52 [inline]
 do_syscall_64+0xcd/0x250 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

We can reproduce the warning by following the steps below:
1. echo 8 >> set_event_notrace_pid. Let tr->filtered_pids owns one pid
   and register sched_switch tracepoint.
2. echo ' ' >> set_event_pid, and perform fault injection during chunk
   allocation of trace_pid_list_alloc. Let pid_list with no pid and
assign to tr->filtered_pids.
3. echo ' ' >> set_event_pid. Let pid_list is NULL and assign to
   tr->filtered_pids.
4. echo 9 >> set_event_pid, will trigger the double register
   sched_switch tracepoint warning.

The reason is that syzkaller injects a fault into the chunk allocation
in trace_pid_list_alloc, causing a failure in trace_pid_list_set, which
may trigger double register of the same tracepoint. This only occurs
when the system is about to crash, but to suppress this warning, let's
add failure handling logic to trace_pid_list_set.

Link: https://lore.kernel.org/20250908024658.2390398-1-pulehui@huaweicloud.com
Fixes: 8d6e90983a ("tracing: Create a sparse bitmask for pid filtering")
Reported-by: syzbot+161412ccaeff20ce4dde@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/67cb890e.050a0220.d8275.022e.GAE@google.com
Signed-off-by: Pu Lehui <pulehui@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-09-08 14:56:43 -04:00
Marco Crivellari a857210b10 bpf: WQ_PERCPU added to alloc_workqueue users
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistentcy cannot be addressed without refactoring the API.

alloc_workqueue() treats all queues as per-CPU by default, while unbound
workqueues must opt-in via WQ_UNBOUND.

This default is suboptimal: most workloads benefit from unbound queues,
allowing the scheduler to place worker threads where they’re needed and
reducing noise when CPUs are isolated.

This default is suboptimal: most workloads benefit from unbound queues,
allowing the scheduler to place worker threads where they’re needed and
reducing noise when CPUs are isolated.

This patch adds a new WQ_PERCPU flag to explicitly request the use of
the per-CPU behavior. Both flags coexist for one release cycle to allow
callers to transition their calls.

Once migration is complete, WQ_UNBOUND can be removed and unbound will
become the implicit default.

With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU.

All existing users have been updated accordingly.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Link: https://lore.kernel.org/r/20250905085309.94596-4-marco.crivellari@suse.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-08 10:04:37 -07:00
Marco Crivellari 0409819a00 bpf: replace use of system_unbound_wq with system_dfl_wq
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.

This lack of consistentcy cannot be addressed without refactoring the API.

system_unbound_wq should be the default workqueue so as not to enforce
locality constraints for random work whenever it's not required.

Adding system_dfl_wq to encourage its use when unbound work should be used.

queue_work() / queue_delayed_work() / mod_delayed_work() will now use the
new unbound wq: whether the user still use the old wq a warn will be
printed along with a wq redirect to the new one.

The old system_unbound_wq will be kept for a few release cycles.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Link: https://lore.kernel.org/r/20250905085309.94596-3-marco.crivellari@suse.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-08 10:04:37 -07:00
Marco Crivellari 34f86083a4 bpf: replace use of system_wq with system_percpu_wq
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.

This lack of consistentcy cannot be addressed without refactoring the API.

system_wq is a per-CPU worqueue, yet nothing in its name tells about that
CPU affinity constraint, which is very often not required by users. Make
it clear by adding a system_percpu_wq.

queue_work() / queue_delayed_work() mod_delayed_work() will now use the
new per-cpu wq: whether the user still stick on the old name a warn will
be printed along a wq redirect to the new one.

This patch add the new system_percpu_wq except for mm, fs and net
subsystem, whom are handled in separated patches.

The old wq will be kept for a few release cylces.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Link: https://lore.kernel.org/r/20250905085309.94596-2-marco.crivellari@suse.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-08 10:04:37 -07:00
Linus Torvalds 6ab41fca2e Fix a severe slowdown regression in the timer vDSO code related
to the while() loop in __iter_div_u64_rem(), when the AUX-clock
 is enabled.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmi9XZURHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1gVzg/+N3T4qL8Vi3HwOzN862EqmVdaEauAVkQU
 vlIxFbUX26KHmwoSYpsVlaswhONMa6tkNEz3zeOfECrK6SAIg/9SGuk/1bRqERCc
 4KX3g0S0b7qfC7VJmjsmswXmxM9MrXLY5dx/XzhbACpGVoTJAMLlXL0OVwS361fP
 VOH4zSRCPUl+IbTJ6CkVQKZYWZAXxDuxBaSAsMaUz2QMZck6ikGOKP7CT9se0Fk5
 Rve7bC0Wy2vuUfchCIagSE9U28rf7HDawqJVEOygtRQm/Hl7yKSUxaHO5LdSI7c/
 k7mCoZkdyqj3Pdbe26CFSyvGdNJ+gHIhOSzaO43kHTf+bpgW/WBWm6DLjGWl8eKP
 EAW7QMkFoQaNWhSUbPwOR68R7fR3CsvmU1SOXufIRVKjuDy+246gu6y1nvJFr3RS
 BQyBPUh3qAJUoJj00nvjxtLB9p0pI2C+0ml7GTWEX3Ao8Y7rPPKh8yR2CvIhdmI4
 Oo4/vRtcs9NGYKzSu7gknYSkxITsjfEInCkmdsDVQ503QuUAgL653z8rjUIfTIti
 72fJ3pqVUyJuv9xyRkzQcavyZtU/l8uSfMNzgmHen8zmPjKA6rXg+BiuhP4uQ10W
 Wxw2s0cP3RgvuL3+y9oQHq3VHJ4JfdNVOKVssd97zCvmwFKCOXxi+0eKCy9ELQPq
 nNpuOC/9API=
 =5NRx
 -----END PGP SIGNATURE-----

Merge tag 'timers-urgent-2025-09-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer fix from Ingo Molnar:
 "Fix a severe slowdown regression in the timer vDSO code related to the
  while() loop in __iter_div_u64_rem(), when the AUX-clock is enabled"

* tag 'timers-urgent-2025-09-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  vdso/vsyscall: Avoid slow division loop in auxiliary clock update
2025-09-07 08:29:44 -07:00
Linus Torvalds b7369eb731 Fix an 'allocation from atomic context' regression in the futex vmalloc
variant.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmi9WmkRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1h4ww/9FnZ2elqxUG2kfzmDMwMCEA+KDyVIfqLI
 DzNqB0oFOgHO1xcWx/DLWItWIycH/+XhwcQUpCRcBlrG1pzaRqrbD0ioYw0SZRFo
 JbNRIQVjPBKa72vcPUd5et1cJB0iB7irUEhEPDPxkQ5BeIQOwayLF/DSz4IsKvde
 os9W9D2QcXk/YlOvbT+eRMmkDCSrAeAufz7RzcoPgzBvtGAWZEyCaBBjWXiuE2Au
 T05aXf6y35/QPB+pq7+aLxaqP97Auj2ebmCSB7iKGq1XCYXqCFmo4vQzAU6zZgkK
 hriO3+YL1wgN524bWaGXPRm87zcCmLVOXFJXe0wbTm/9w3CWmMUE+LEhfEAM+9n9
 IZpoAEL540P3/ijYFxpie4GAz+ZjttUwLvh8f5xaM/2jPUq8bUXakR6dt5U5h3YA
 n4FG2/K215R3s9+wYV2KpNQ0gUqZn7EmqfMAkQ1N09rsW571ExUKWdrEr61TUOnx
 BTwC0nOti/M97sOHjajG/ylBmRaswc6mwiLVlVy3Gvi/kzAqjAVe+cRJnNKSd7hd
 kIUKUiPWrI4eSWi+0gmW0Xo5RcyEbMNf4iVGvlZ6aM5zoKjZo72TGPcqapxbaW6m
 hRRsFJdlXmBBOxKhB+Ne0EZIqQzk9ZK8tknN5voXp1UsupanubwS7H3YYppljCxg
 VdlccxGizuI=
 =rQLp
 -----END PGP SIGNATURE-----

Merge tag 'locking-urgent-2025-09-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking fix from Ingo Molnar:
 "Fix an 'allocation from atomic context' regression in the futex
  vmalloc variant"

* tag 'locking-urgent-2025-09-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  futex: Move futex_hash_free() back to __mmput()
2025-09-07 08:26:28 -07:00
Linus Torvalds 6a8a34a56a Fix regression where PERF_EVENT_IOC_REFRESH counters
miss a PMU-stop.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmi9T3QRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1gkjhAAljiQ9y7DBlxWonRPstqKy1AYY7ndn+Jl
 CMgvflxE5NzI2IKqu7k642KCeb1UcL3Jpuqx+ywZCn5TZGgQ1haGhU7B8aufbMPc
 tcUx1LyrANhnRc0baBo5mPly9a1muV+bKNeA6Qa7fOQlI6R8dkFw2NQ+veN0bXCl
 2tarzNcnwCjUu0zat8xlgxTo94P49reW/A/SgTlXBBFmHCezzDh348FYgWzFJ9iw
 3Ki0BDd53z6p9uvTVSIxvPSVjhKAGVCX+zYJNcIOlBHnEe2yV0FYuGJGlEM7Vr2z
 jCd7MuUQ+kxtUmRLQGMZntSHjweseuOyJvx04rpVvX4I7tC7BgAZqeepw6pYV9ay
 OBIx1w+mYgI9gKHQUbYagETUBBPkofEABotQ46QD3Ih5zT/lwUbHy3JPj7bXt0nm
 qrpEognULUlJkwULzUcnBRs03nf72U1yZYNKL/gYw3rEM+EKek5UburmBXcOJa8S
 33FK9vyEWnkZaSJZ//KoMIirjZqwAK1s/nxuxTp6ZjN0vh3VQld1GLmOKgtYJNEK
 g8/yUa0EXGEjsRZccPoJkRqoX4tHdEj6cOBFDuTvT+Ps6Wj4S30QTkfWyDltcyWT
 ziaP9fWxPkXS68DSHlx3Em0yIVU76f2dhZDZnJ2sODkuSLi+c0yBcy1lXeYJrkOj
 m9XlTj72KkY=
 =p8YX
 -----END PGP SIGNATURE-----

Merge tag 'perf-urgent-2025-09-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf event fix from Ingo Molnar:
 "Fix regression where PERF_EVENT_IOC_REFRESH counters miss a PMU-stop"

* tag 'perf-urgent-2025-09-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf: Fix the POLL_HUP delivery breakage
2025-09-07 08:24:20 -07:00
Wang Liang c1628c00c4 tracing/osnoise: Fix null-ptr-deref in bitmap_parselist()
A crash was observed with the following output:

BUG: kernel NULL pointer dereference, address: 0000000000000010
Oops: Oops: 0000 [#1] SMP NOPTI
CPU: 2 UID: 0 PID: 92 Comm: osnoise_cpus Not tainted 6.17.0-rc4-00201-gd69eb204c255 #138 PREEMPT(voluntary)
RIP: 0010:bitmap_parselist+0x53/0x3e0
Call Trace:
 <TASK>
 osnoise_cpus_write+0x7a/0x190
 vfs_write+0xf8/0x410
 ? do_sys_openat2+0x88/0xd0
 ksys_write+0x60/0xd0
 do_syscall_64+0xa4/0x260
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
 </TASK>

This issue can be reproduced by below code:

fd=open("/sys/kernel/debug/tracing/osnoise/cpus", O_WRONLY);
write(fd, "0-2", 0);

When user pass 'count=0' to osnoise_cpus_write(), kmalloc() will return
ZERO_SIZE_PTR (16) and cpulist_parse() treat it as a normal value, which
trigger the null pointer dereference. Add check for the parameter 'count'.

Cc: <mhiramat@kernel.org>
Cc: <mathieu.desnoyers@efficios.com>
Cc: <tglozar@redhat.com>
Link: https://lore.kernel.org/20250906035610.3880282-1-wangliang74@huawei.com
Fixes: 17f89102fe ("tracing/osnoise: Allow arbitrarily long CPU string")
Signed-off-by: Wang Liang <wangliang74@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-09-06 12:12:38 -04:00
Guenter Roeck ab1396af75 trace/fgraph: Fix error handling
Commit edede7a6dc ("trace/fgraph: Fix the warning caused by missing
unregister notifier") added a call to unregister the PM notifier if
register_ftrace_graph() failed. It does so unconditionally. However,
the PM notifier is only registered with the first call to
register_ftrace_graph(). If the first registration was successful and
a subsequent registration failed, the notifier is now unregistered even
if ftrace graphs are still registered.

Fix the problem by only unregistering the PM notifier during error handling
if there are no active fgraph registrations.

Fixes: edede7a6dc ("trace/fgraph: Fix the warning caused by missing unregister notifier")
Closes: https://lore.kernel.org/all/63b0ba5a-a928-438e-84f9-93028dd72e54@roeck-us.net/
Cc: Ye Weihua <yeweihua4@huawei.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250906050618.2634078-1-linux@roeck-us.net
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-09-06 12:12:38 -04:00
Linus Torvalds 730c1451fb audit/stable-6.17 PR 20250905
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCgAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmi7AksUHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXMmaA/8ChtRoaZFLTFI7UTWmVo0aY7NDl77
 fqjwY1ZrPsqezLYt317LCJareQOfn1/NEc1Xxb7Caz6Z35eQPUDeUnd96zEarzDL
 4iYcZA1MJVO2jjnyj4lqVoSgkftQPN6qvga5osA9mMOQ24mNPAO5yZIVdzL/sekv
 ewosheC6spwFD26+/uWE00sdVRhtOnUPkLmelf4wyN0WX+lStTXozLpSW1Pr0FB1
 tZI/mjLyVmZ+YJtTMCMLhpM/hFIMymlSr7BZ7pO/G1w48OGzVBdUlS45bS5rMUIQ
 D/MdTB2YlkMOYFyRCgoSgxaFHGHf7F6MQO0J7y/hVu2QSvGoxneUaTIszpaj7rcg
 edaJKogA6DlXvauw32UQhjUTQetuppyFqAHQXsec/JVYGUJqAZWNdsThP86MGlim
 INwspptaBwOULg2OTYt/+jHblbhl8BI9ayC/S4lN89MCwszLYoFmjIo+mCf9qWfE
 uFsiyvvqoXaMKH/e4NkcUjT0AxzwAHF0DwU7Vh+apThOpdr7Qudr6meWEgJ+iRpO
 hhWrpyPAP1pann7VebwAqWs9iG97cuwARca4EWyuSy+i11qLQ19LwC/i5vfb58TJ
 Ozh9g01A65qWgA5/G14XWmN1oLYdWK+KqwmdTXtDgWXZ5hX3oMrlypRxbPxKIyVG
 PsEKU7GXjze/J1s=
 =g/UF
 -----END PGP SIGNATURE-----

Merge tag 'audit-pr-20250905' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit

Pull audit fix from Paul Moore:
 "A single small audit patch to fix a potential out-of-bounds read
  caused by a negative array index when comparing paths"

* tag 'audit-pr-20250905' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
  audit: fix out-of-bounds read in audit_compare_dname_path()
2025-09-05 12:35:25 -07:00
Marco Crivellari a2be943b46 workqueue: replace use of system_wq with system_percpu_wq
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.

This lack of consistentcy cannot be addressed without refactoring the API.

system_wq is a per-CPU worqueue, yet nothing in its name tells about that
CPU affinity constraint, which is very often not required by users. Make
it clear by adding a system_percpu_wq.

queue_work() / queue_delayed_work() mod_delayed_work() will now use the
new per-cpu wq: whether the user still stick on the old name a warn will
be printed along a wq redirect to the new one.

This patch add the new system_percpu_wq except for mm, fs and net
subsystem, whom are handled in separated patches.

The old wq will be kept for a few release cylces.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-05 07:20:00 -10:00
Marco Crivellari f6cfa602d2 workqueue: replace use of system_unbound_wq with system_dfl_wq
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.

This lack of consistentcy cannot be addressed without refactoring the API.

system_unbound_wq should be the default workqueue so as not to enforce
locality constraints for random work whenever it's not required.

Adding system_dfl_wq to encourage its use when unbound work should be used.

queue_work() / queue_delayed_work() / mod_delayed_work() will now use the
new unbound wq: whether the user still use the old wq a warn will be
printed along with a wq redirect to the new one.

The old system_unbound_wq will be kept for a few release cycles.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-05 07:19:09 -10:00
Tejun Heo 4a3e62dfa7 cgroup: Merge branch 'for-6.17-fixes' into for-6.18
Pull for-6.17-fixes to receive 79f919a89c ("cgroup: split
cgroup_destroy_wq into 3 workqueues") to resolve its conflict with
7fa33aa3b0 ("cgroup: WQ_PERCPU added to alloc_workqueue users"). The
latter adds WQ_PERCPU when creating cgroup_destroy_wq and the former splits
the workqueue into three. Resolve by applying WQ_PERCPU to the three split
workqueues.

Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-05 07:08:26 -10:00
Marco Crivellari 7fa33aa3b0 cgroup: WQ_PERCPU added to alloc_workqueue users
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistentcy cannot be addressed without refactoring the API.

alloc_workqueue() treats all queues as per-CPU by default, while unbound
workqueues must opt-in via WQ_UNBOUND.

This default is suboptimal: most workloads benefit from unbound queues,
allowing the scheduler to place worker threads where they’re needed and
reducing noise when CPUs are isolated.

This patch adds a new WQ_PERCPU flag to explicitly request the use of
the per-CPU behavior. Both flags coexist for one release cycle to allow
callers to transition their calls.

Once migration is complete, WQ_UNBOUND can be removed and unbound will
become the implicit default.

With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU.

All existing users have been updated accordingly.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-05 06:40:25 -10:00
Marco Crivellari d6256771d1 cgroup: replace use of system_wq with system_percpu_wq
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.

This lack of consistentcy cannot be addressed without refactoring the API.

system_wq is a per-CPU worqueue, yet nothing in its name tells about that
CPU affinity constraint, which is very often not required by users. Make
it clear by adding a system_percpu_wq.

queue_work() / queue_delayed_work() mod_delayed_work() will now use the
new per-cpu wq: whether the user still stick on the old name a warn will
be printed along a wq redirect to the new one.

This patch add the new system_percpu_wq except for mm, fs and net
subsystem, whom are handled in separated patches.

The old wq will be kept for a few release cylces.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-05 06:40:12 -10:00
Tejun Heo 222f83d5ab cgroup: Remove unused local variables from cgroup_procs_write_finish()
d8b269e009 ("cgroup: Remove unused cgroup_subsys::post_attach") made $ss
and $ssid unused but didn't drop them leading to compilation warnings. Drop
them.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Chuyi Zhou <zhouchuyi@bytedance.com>
2025-09-04 11:23:43 -10:00
Jakub Kicinski 5ef04a7b06 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.17-rc5).

No conflicts.

Adjacent changes:

include/net/sock.h
  c51613fa27 ("net: add sk->sk_drop_counters")
  5d6b58c932 ("net: lockless sock_i_ino()")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-04 13:33:00 -07:00
Andrea Righi 47d9f82128 sched_ext: Fix NULL dereference in scx_bpf_cpu_rq() warning
When printing the deprecation warning for scx_bpf_cpu_rq(), we may hit a
NULL pointer dereference if the kfunc is called before a BPF scheduler
is fully attached, for example, when invoked from a BPF timer or during
ops.init():

 [   50.752775] BUG: kernel NULL pointer dereference, address: 0000000000000331
 ...
 [   50.764205] RIP: 0010:scx_bpf_cpu_rq+0x30/0xa0
 ...
 [   50.787661] Call Trace:
 [   50.788398]  <TASK>
 [   50.789061]  bpf_prog_08f7fd2dcb187aaf_wakeup_timerfn+0x75/0x1a8
 [   50.792477]  bpf_timer_cb+0x7e/0x140
 [   50.796003]  hrtimer_run_softirq+0x91/0xe0
 [   50.796952]  handle_softirqs+0xce/0x3c0
 [   50.799087]  run_ksoftirqd+0x3e/0x70
 [   50.800197]  smpboot_thread_fn+0x133/0x290
 [   50.802320]  kthread+0x115/0x220
 [   50.804984]  ret_from_fork+0x17a/0x1d0
 [   50.806920]  ret_from_fork_asm+0x1a/0x30
 [   50.807799]  </TASK>

Fix this by only printing the warning once the scheduler is fully
registered.

Fixes: 5c48d88fe0 ("sched_ext: deprecation warn for scx_bpf_cpu_rq()")
Cc: Christian Loehle <christian.loehle@arm.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-04 10:27:16 -10:00
Samuel Wu 56a232d93c PM: sleep: Make pm_wakeup_clear() call more clear
Move pm_wakeup_clear() to the same location as other functions that do
bookkeeping prior to suspend_prepare().

Since calling pm_wakeup_clear() is a prerequisite to setting up for
suspend and enabling functionalities of suspend (like aborting during
suspend), moving pm_wakeup_clear() higher up the call stack makes its
intent more clear and obvious that it is called prior to
suspend_prepare().

After this change, there is a slightly larger window when abort events
can be registered, but otherwise suspend functionality is the same.

Suggested-by: Saravana Kannan <saravanak@google.com>
Signed-off-by: Samuel Wu <wusamuel@google.com>
Link: https://patch.msgid.link/20250821004237.2712312-2-wusamuel@google.com
Reviewed-by: Saravana Kannan <saravanak@google.com>
[ rjw: Subject and changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-09-04 21:05:14 +02:00
Sebastian Andrzej Siewior ad7c7f4b9c workqueue: Provide a handshake for canceling BH workers
While a BH work item is canceled, the core code spins until it
determines that the item completed. On PREEMPT_RT the spinning relies on
a lock in local_bh_disable() to avoid a live lock if the canceling
thread has higher priority than the BH-worker and preempts it. This lock
ensures that the BH-worker makes progress by PI-boosting it.

This lock in local_bh_disable() is a central per-CPU BKL and about to be
removed.

To provide the required synchronisation add a per pool lock. The lock is
acquired by the bh_worker at the begin while the individual callbacks
are invoked. To enforce progress in case of interruption, __flush_work()
needs to acquire the lock.
This will flush all BH-work items assigned to that pool.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-04 07:28:33 -10:00
Chuyi Zhou d8b269e009 cgroup: Remove unused cgroup_subsys::post_attach
cgroup_subsys::post_attach callback was introduced in commit 5cf1cacb49
("cgroup, cpuset: replace cpuset_post_attach_flush() with
cgroup_subsys->post_attach callback") and only cpuset would use this
callback to wait for the mm migration to complete at the end of
__cgroup_procs_write(). Since the previous patch defer the flush operation
until returning to userspace, no one use this callback now. Remove this
callback from cgroup_subsys.

Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-04 07:25:20 -10:00
Chuyi Zhou 3514309e03 cpuset: Defer flushing of the cpuset_migrate_mm_wq to task_work
Now in cpuset_attach(), we need to synchronously wait for
flush_workqueue to complete. The execution time of flushing
cpuset_migrate_mm_wq depends on the amount of mm migration initiated by
cpusets at that time. When the cpuset.mems of a cgroup occupying a large
amount of memory is modified, it may trigger extensive mm migration,
causing cpuset_attach() to block on flush_workqueue for an extended period.
This could be dangerous because cpuset_attach() is within the critical
section of cgroup_mutex, which may ultimately cause all cgroup-related
operations in the system to be blocked.

This patch attempts to defer the flush_workqueue() operation until
returning to userspace using the task_work which is originally proposed by
tejun[1], so that flush happens after cgroup_mutex is dropped. That way we
maintain the operation synchronicity while avoiding bothering anyone else.

[1]: https://lore.kernel.org/cgroups/ZgMFPMjZRZCsq9Q-@slm.duckdns.org/T/#m117f606fa24f66f0823a60f211b36f24bd9e1883

Originally-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-04 07:22:38 -10:00
Chuyi Zhou c0fb16ef88 cpuset: Don't always flush cpuset_migrate_mm_wq in cpuset_write_resmask
It is unnecessary to always wait for the flush operation of
cpuset_migrate_mm_wq to complete in cpuset_write_resmask, as modifying
cpuset.cpus or cpuset.exclusive does not trigger mm migrations. The
flush_workqueue can be executed only when cpuset.mems is modified.

Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-04 07:15:30 -10:00
Zqiang cda2b2d647 workqueue: Remove rcu_read_lock/unlock() in wq_watchdog_timer_fn()
The wq_watchdog_timer_fn() is executed in the softirq context, this
is already in the RCU read critical section, this commit therefore
remove rcu_read_lock/unlock() in wq_watchdog_timer_fn().

Signed-off-by: Zqiang <qiang.zhang@linux.dev>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-04 06:18:00 -10:00
Zqiang fd5081f4ef workqueue: Remove redundant rcu_read_lock/unlock() in workqueue_congested()
The preempt_disable/enable() has already formed RCU read crtical
section, this commit therefore remove rcu_read_lock/unlock() in
workqueue_congested().

Signed-off-by: Zqiang <qiang.zhang@linux.dev>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-04 06:17:52 -10:00
Rong Tao 19559e8441 bpf: add bpf_strcasecmp kfunc
bpf_strcasecmp() function performs same like bpf_strcmp() except ignoring
the case of the characters.

Signed-off-by: Rong Tao <rongtao@cestc.cn>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Viktor Malik <vmalik@redhat.com>
Link: https://lore.kernel.org/r/tencent_292BD3682A628581AA904996D8E59F4ACD06@qq.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-04 09:00:57 -07:00
Eric Dumazet 2aef21a6a6 audit: init ab->skb_list earlier in audit_buffer_alloc()
syzbot found a bug in audit_buffer_alloc() if nlmsg_new() returns NULL.

We need to initialize ab->skb_list before calling audit_buffer_free()
which will use both the skb_list spinlock and list pointers.

Fixes: eb59d494ee ("audit: add record for multiple task security contexts")
Reported-by: syzbot+bb185b018a51f8d91fd2@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/lkml/68b93e3c.a00a0220.eb3d.0000.GAE@google.com/T/#u
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Eric Paris <eparis@redhat.com>
Cc: audit@vger.kernel.org
Signed-off-by: Paul Moore <paul@paul-moore.com>
2025-09-04 11:06:33 -04:00
Thomas Weißschuh ea1a1fa919 time: Build generic update_vsyscall() only with generic time vDSO
The generic vDSO can be used without the time-related functionality.
In that case the generic update_vsyscall() from kernel/time/vsyscall.c
should not be built.

Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250826-vdso-cleanups-v1-5-d9b65750e49f@linutronix.de
2025-09-04 11:23:50 +02:00
Gatien Chevallier 96c88268b7 time: export timespec64_add_safe() symbol
Export the timespec64_add_safe() symbol so that this function can be used
in modules where computation of time related is done.

Signed-off-by: Gatien Chevallier <gatien.chevallier@foss.st.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/20250901-relative_flex_pps-v4-1-b874971dfe85@foss.st.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-03 16:51:08 -07:00
Christian Loehle 5c48d88fe0 sched_ext: deprecation warn for scx_bpf_cpu_rq()
scx_bpf_cpu_rq() works on an unlocked rq which generally isn't safe.
For the common use-cases scx_bpf_locked_rq() and
scx_bpf_cpu_curr() work, so add a deprecation warning
to scx_bpf_cpu_rq() so it can eventually be removed.

Signed-off-by: Christian Loehle <christian.loehle@arm.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-03 11:51:57 -10:00
Christian Loehle 20b158094a sched_ext: Introduce scx_bpf_cpu_curr()
Provide scx_bpf_cpu_curr() as a way for scx schedulers to check the curr
task of a remote rq without assuming its lock is held.

Many scx schedulers make use of scx_bpf_cpu_rq() to check a remote curr
(e.g. to see if it should be preempted). This is problematic because
scx_bpf_cpu_rq() provides access to all fields of struct rq, most of
which aren't safe to use without holding the associated rq lock.

Signed-off-by: Christian Loehle <christian.loehle@arm.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-03 11:50:42 -10:00
Christian Loehle e0ca169638 sched_ext: Introduce scx_bpf_locked_rq()
Most fields in scx_bpf_cpu_rq() assume that its rq_lock is held.
Furthermore they become meaningless without rq lock, too.
Make a safer version of scx_bpf_cpu_rq() that only returns a rq
if we hold rq lock of that rq.

Also mark the new scx_bpf_locked_rq() as returning NULL as
scx_bpf_cpu_rq() should've been too.

Signed-off-by: Christian Loehle <christian.loehle@arm.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-03 11:50:36 -10:00
Tejun Heo a5bd6ba30b sched_ext: Use cgroup_lock/unlock() to synchronize against cgroup operations
SCX hooks into CPU cgroup controller operations and read-locks
scx_cgroup_rwsem to exclude them while enabling and disable schedulers.
While this works, it's unnecessarily complicated given that
cgroup_[un]lock() are available and thus the cgroup operations can be locked
out that way.

Drop scx_cgroup_rwsem locking from the tg on/offline and cgroup [can_]attach
operations. Instead, grab cgroup_lock() from scx_cgroup_lock(). Drop
scx_cgroup_finish_attach() which is no longer necessary. Drop the now
unnecessary rcu locking and css ref bumping in scx_cgroup_init() and
scx_cgroup_exit().

As scx_cgroup_set_weight/bandwidth() paths aren't protected by
cgroup_lock(), rename scx_cgroup_rwsem to scx_cgroup_ops_rwsem and retain
the locking there.

This is overall simpler and will also allow enable/disable paths to
synchronize against cgroup changes independent of the CPU controller.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
2025-09-03 11:36:07 -10:00
Tejun Heo bcb7c23056 sched_ext: Put event_stats_cpu in struct scx_sched_pcpu
scx_sched.event_stats_cpu is the percpu counters that are used to track
stats. Introduce struct scx_sched_pcpu and move the counters inside. This
will ease adding more per-cpu fields. No functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
2025-09-03 11:33:28 -10:00
Tejun Heo 0c2b8356e4 sched_ext: Move internal type and accessor definitions to ext_internal.h
There currently isn't a place to place SCX-internal types and accessors to
be shared between ext.c and ext_idle.c. Create kernel/sched/ext_internal.h
and move internal type and accessor definitions there. This trims ext.c a
bit and makes future additions easier. Pure code reorganization. No
functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
2025-09-03 11:33:28 -10:00
Tejun Heo 4a1d9d73aa sched_ext: Keep bypass on between enable failure and scx_disable_workfn()
scx_enable() turns on the bypass mode while enable is in progress. If
enabling fails, it turns off the bypass mode and then triggers scx_error().
scx_error() will trigger scx_disable_workfn() which will turn on the bypass
mode again and unload the failed scheduler.

This moves the system out of bypass mode between the enable error path and
the disable path, which is unnecessary and can be brittle - e.g. the thread
running scx_enable() may already be on the failed scheduler and can be
switched out before it triggers scx_error() leading to a stall. The watchdog
would eventually kick in, so the situation isn't critical but is still
suboptimal.

There is nothing to be gained by turning off the bypass mode between
scx_enable() failure and scx_disable_workfn(). Keep bypass on.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
2025-09-03 11:33:28 -10:00
Tejun Heo b7975c4869 sched_ext: Make explicit scx_task_iter_relock() calls unnecessary
During tasks iteration, the locks can be dropped using
scx_task_iter_unlock() to perform e.g. sleepable allocations. Afterwards,
scx_task_iter_relock() has to be called prior to other iteration operations,
which is error-prone. This can be easily automated by tracking whether
scx_tasks_lock is held in scx_task_iter and re-acquiring when necessary. It
already tracks whether the task's rq is locked after all.

- Add scx_task_iter->list_locked which remembers whether scx_tasks_lock is
  held.

- Rename scx_task_iter->locked to scx_task_iter->locked_task to better
  distinguish it from ->list_locked.

- Replace scx_task_iter_relock() with __scx_task_iter_maybe_relock() which
  is automatically called by scx_task_iter_next() and scx_task_iter_stop().

- Drop explicit scx_task_iter_relock() calls.

The resulting behavior should be equivalent.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
2025-09-03 11:33:28 -10:00
Stanislav Fort 4540f1d23e audit: fix out-of-bounds read in audit_compare_dname_path()
When a watch on dir=/ is combined with an fsnotify event for a
single-character name directly under / (e.g., creating /a), an
out-of-bounds read can occur in audit_compare_dname_path().

The helper parent_len() returns 1 for "/". In audit_compare_dname_path(),
when parentlen equals the full path length (1), the code sets p = path + 1
and pathlen = 1 - 1 = 0. The subsequent loop then dereferences
p[pathlen - 1] (i.e., p[-1]), causing an out-of-bounds read.

Fix this by adding a pathlen > 0 check to the while loop condition
to prevent the out-of-bounds access.

Cc: stable@vger.kernel.org
Fixes: e92eebb0d6 ("audit: fix suffixed '/' filename matching")
Reported-by: Stanislav Fort <disclosure@aisle.com>
Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
Signed-off-by: Stanislav Fort <stanislav.fort@aisle.com>
[PM: subject tweak, sign-off email fixes]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2025-09-03 16:46:23 -04:00
Waiman Long e117ff1129 cgroup/cpuset: Prevent NULL pointer access in free_tmpmasks()
Commit 5806b3d051 ("cpuset: decouple tmpmasks and cpumasks freeing in
cgroup") separates out the freeing of tmpmasks into a new free_tmpmask()
helper but removes the NULL pointer check in the process. Unfortunately a
NULL pointer can be passed to free_tmpmasks() in cpuset_handle_hotplug()
if cpuset v1 is active. This can cause segmentation fault and crash
the kernel.

Fix that by adding the NULL pointer check to free_tmpmasks().

Fixes: 5806b3d051 ("cpuset: decouple tmpmasks and cpumasks freeing in cgroup")
Reported-by: Ashay Jaiswal <quic_ashayj@quicinc.com>
Closes: https://lore.kernel.org/lkml/20250902-cpuset-free-on-condition-v1-1-f46ffab53eac@quicinc.com/
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-03 08:40:11 -10:00
Christian Loehle 5ebf512f33 sched: Fix sched_numa_find_nth_cpu() if mask offline
sched_numa_find_nth_cpu() uses a bsearch to look for the 'closest'
CPU in sched_domains_numa_masks and given cpus mask. However they
might not intersect if all CPUs in the cpus mask are offline. bsearch
will return NULL in that case, bail out instead of dereferencing a
bogus pointer.

The previous behaviour lead to this bug when using maxcpus=4 on an
rk3399 (LLLLbb) (i.e. booting with all big CPUs offline):

[    1.422922] Unable to handle kernel paging request at virtual address ffffff8000000000
[    1.423635] Mem abort info:
[    1.423889]   ESR = 0x0000000096000006
[    1.424227]   EC = 0x25: DABT (current EL), IL = 32 bits
[    1.424715]   SET = 0, FnV = 0
[    1.424995]   EA = 0, S1PTW = 0
[    1.425279]   FSC = 0x06: level 2 translation fault
[    1.425735] Data abort info:
[    1.425998]   ISV = 0, ISS = 0x00000006, ISS2 = 0x00000000
[    1.426499]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[    1.426952]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[    1.427428] swapper pgtable: 4k pages, 39-bit VAs, pgdp=0000000004a9f000
[    1.428038] [ffffff8000000000] pgd=18000000f7fff403, p4d=18000000f7fff403, pud=18000000f7fff403, pmd=0000000000000000
[    1.429014] Internal error: Oops: 0000000096000006 [#1]  SMP
[    1.429525] Modules linked in:
[    1.429813] CPU: 3 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.17.0-rc4-dirty #343 PREEMPT
[    1.430559] Hardware name: Pine64 RockPro64 v2.1 (DT)
[    1.431012] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    1.431634] pc : sched_numa_find_nth_cpu+0x2a0/0x488
[    1.432094] lr : sched_numa_find_nth_cpu+0x284/0x488
[    1.432543] sp : ffffffc084e1b960
[    1.432843] x29: ffffffc084e1b960 x28: ffffff80078a8800 x27: ffffffc0846eb1d0
[    1.433495] x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000
[    1.434144] x23: 0000000000000000 x22: fffffffffff7f093 x21: ffffffc081de6378
[    1.434792] x20: 0000000000000000 x19: 0000000ffff7f093 x18: 00000000ffffffff
[    1.435441] x17: 3030303866666666 x16: 66663d736b73616d x15: ffffffc104e1b5b7
[    1.436091] x14: 0000000000000000 x13: ffffffc084712860 x12: 0000000000000372
[    1.436739] x11: 0000000000000126 x10: ffffffc08476a860 x9 : ffffffc084712860
[    1.437389] x8 : 00000000ffffefff x7 : ffffffc08476a860 x6 : 0000000000000000
[    1.438036] x5 : 000000000000bff4 x4 : 0000000000000000 x3 : 0000000000000000
[    1.438683] x2 : 0000000000000000 x1 : ffffffc0846eb000 x0 : ffffff8000407b68
[    1.439332] Call trace:
[    1.439559]  sched_numa_find_nth_cpu+0x2a0/0x488 (P)
[    1.440016]  smp_call_function_any+0xc8/0xd0
[    1.440416]  armv8_pmu_init+0x58/0x27c
[    1.440770]  armv8_cortex_a72_pmu_init+0x20/0x2c
[    1.441199]  arm_pmu_device_probe+0x1e4/0x5e8
[    1.441603]  armv8_pmu_device_probe+0x1c/0x28
[    1.442007]  platform_probe+0x5c/0xac
[    1.442347]  really_probe+0xbc/0x298
[    1.442683]  __driver_probe_device+0x78/0x12c
[    1.443087]  driver_probe_device+0xdc/0x160
[    1.443475]  __driver_attach+0x94/0x19c
[    1.443833]  bus_for_each_dev+0x74/0xd4
[    1.444190]  driver_attach+0x24/0x30
[    1.444525]  bus_add_driver+0xe4/0x208
[    1.444874]  driver_register+0x60/0x128
[    1.445233]  __platform_driver_register+0x24/0x30
[    1.445662]  armv8_pmu_driver_init+0x28/0x4c
[    1.446059]  do_one_initcall+0x44/0x25c
[    1.446416]  kernel_init_freeable+0x1dc/0x3bc
[    1.446820]  kernel_init+0x20/0x1d8
[    1.447151]  ret_from_fork+0x10/0x20
[    1.447493] Code: 90022e21 f000e5f5 910de2b5 2a1703e2 (f8767803)
[    1.448040] ---[ end trace 0000000000000000 ]---
[    1.448483] note: swapper/0[1] exited with preempt_count 1
[    1.449047] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[    1.449741] SMP: stopping secondary CPUs
[    1.450105] Kernel Offset: disabled
[    1.450419] CPU features: 0x000000,00080000,20002001,0400421b
[    1.450935] Memory Limit: none
[    1.451217] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]---

Yury: with the fix, the function returns cpu == nr_cpu_ids, and later in

	smp_call_function_any ->
	  smp_call_function_single ->
	     generic_exec_single

we test the cpu for '>= nr_cpu_ids' and return -ENXIO. So everything is
handled correctly.

Fixes: cd7f55359c ("sched: add sched_numa_find_nth_cpu()")
Cc: stable@vger.kernel.org
Signed-off-by: Christian Loehle <christian.loehle@arm.com>
Signed-off-by: Yury Norov (NVIDIA) <yury.norov@gmail.com>
2025-09-03 12:20:06 -04:00
Brian Norris 8ad25ebfa7 genirq/test: Ensure CPU 1 is online for hotplug test
It's possible to run these tests on platforms that think they have a
hotpluggable CPU1, but for whatever reason, CPU1 is not online and can't be
brought online:

    # irq_cpuhotplug_test: EXPECTATION FAILED at kernel/irq/irq_test.c:210
    Expected remove_cpu(1) == 0, but
        remove_cpu(1) == 1 (0x1)
CPU1: failed to boot: -38
    # irq_cpuhotplug_test: EXPECTATION FAILED at kernel/irq/irq_test.c:214
    Expected add_cpu(1) == 0, but
        add_cpu(1) == -38 (0xffffffffffffffda)

Check that CPU1 is actually online before trying to run the test.

Fixes: 66067c3c8a ("genirq: Add kunit tests for depth counts")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Brian Norris <briannorris@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: David Gow <davidgow@google.com>
Link: https://lore.kernel.org/all/20250822190140.2154646-7-briannorris@chromium.org
2025-09-03 17:04:52 +02:00
Brian Norris add03fdb9d genirq/test: Drop CONFIG_GENERIC_IRQ_MIGRATION assumptions
Not all platforms use the generic IRQ migration code, even if they select
GENERIC_IRQ_MIGRATION. (See, for example, powerpc / pseries_cpu_disable().)

If such platforms don't perform managed shutdown the same way, the interrupt
may not actually shut down, and these tests fail:

[    4.357022][  T101]     # irq_cpuhotplug_test: EXPECTATION FAILED at kernel/irq/irq_test.c:211
[    4.357022][  T101]     Expected irqd_is_activated(data) to be false, but is true
[    4.358128][  T101]     # irq_cpuhotplug_test: EXPECTATION FAILED at kernel/irq/irq_test.c:212
[    4.358128][  T101]     Expected irqd_is_started(data) to be false, but is true
[    4.375558][  T101]     # irq_cpuhotplug_test: EXPECTATION FAILED at kernel/irq/irq_test.c:216
[    4.375558][  T101]     Expected irqd_is_activated(data) to be false, but is true
[    4.376088][  T101]     # irq_cpuhotplug_test: EXPECTATION FAILED at kernel/irq/irq_test.c:217
[    4.376088][  T101]     Expected irqd_is_started(data) to be false, but is true
[    4.377851][    T1]     # irq_cpuhotplug_test: pass:0 fail:1 skip:0 total:1
[    4.377901][    T1]     not ok 4 irq_cpuhotplug_test
[    4.378073][    T1] # irq_test_cases: pass:3 fail:1 skip:0 total:4

Rather than test that PowerPC performs migration the same way as the
unterrupt core, just drop the state checks. The point of the test was to
ensure that the code kept |depth| balanced, which still can be tested for.

Fixes: 66067c3c8a ("genirq: Add kunit tests for depth counts")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Brian Norris <briannorris@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: David Gow <davidgow@google.com>
Link: https://lore.kernel.org/all/20250822190140.2154646-6-briannorris@chromium.org
2025-09-03 17:04:52 +02:00
Brian Norris 0c888bc86d genirq/test: Depend on SPARSE_IRQ
Some architectures have a static interrupt layout, with a limited number of
interrupts. Without SPARSE_IRQ, the test may not be able to allocate any
fake interrupts, and the test will fail. (This occurs on ARCH=m68k, for
example.)

Additionally, managed-affinity is only supported with CONFIG_SPARSE_IRQ=y,
so irq_shutdown_depth_test() and irq_cpuhotplug_test() would fail without
it.

Add a 'SPARSE_IRQ' dependency to avoid these problems.

Many architectures 'select SPARSE_IRQ', so this is easy to miss.

Notably, this also excludes ARCH=um from running any of these tests, even
though some of them might work.

Fixes: 66067c3c8a ("genirq: Add kunit tests for depth counts")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Brian Norris <briannorris@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: David Gow <davidgow@google.com>
Link: https://lore.kernel.org/all/20250822190140.2154646-5-briannorris@chromium.org
2025-09-03 17:04:52 +02:00
Brian Norris 988f45467f genirq/test: Fail early if interrupt request fails
Requesting an interrupt is part of the basic test setup. If it fails, most
of the subsequent tests are likely to fail, and the output gets noisy.

Use "assert" to fail early.

Signed-off-by: Brian Norris <briannorris@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: David Gow <davidgow@google.com>
Link: https://lore.kernel.org/all/20250822190140.2154646-4-briannorris@chromium.org
2025-09-03 17:04:52 +02:00
Brian Norris 59405c248a genirq/test: Factor out fake-virq setup
A few things need to be repeated in tests. Factor out the creation of fake
interrupts.

Signed-off-by: Brian Norris <briannorris@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: David Gow <davidgow@google.com>
Link: https://lore.kernel.org/all/20250822190140.2154646-3-briannorris@chromium.org
2025-09-03 17:04:52 +02:00
Brian Norris f8a44f9bab genirq/test: Select IRQ_DOMAIN
These tests use irq_domain_alloc_descs() and so require CONFIG_IRQ_DOMAIN.

Fixes: 66067c3c8a ("genirq: Add kunit tests for depth counts")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Brian Norris <briannorris@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: David Gow <davidgow@google.com>
Link: https://lore.kernel.org/all/20250822190140.2154646-2-briannorris@chromium.org
Closes: https://lore.kernel.org/lkml/ded44edf-eeb7-420c-b8a8-d6543b955e6e@roeck-us.net/
2025-09-03 17:04:52 +02:00
David Gow c9163915a9 genirq/test: Fix depth tests on architectures with NOREQUEST by default.
The new irq KUnit tests fail on some architectures (notably PowerPC and
32-bit ARM), as the request_irq() call fails due to the ARCH_IRQ_INIT_FLAGS
containing IRQ_NOREQUEST, yielding the following errors:

[10:17:45]     # irq_free_disabled_test: EXPECTATION FAILED at kernel/irq/irq_test.c:88
[10:17:45]     Expected ret == 0, but
[10:17:45]         ret == -22 (0xffffffffffffffea)
[10:17:45]     # irq_free_disabled_test: EXPECTATION FAILED at kernel/irq/irq_test.c:90
[10:17:45]     Expected desc->depth == 0, but
[10:17:45]         desc->depth == 1 (0x1)
[10:17:45]     # irq_free_disabled_test: EXPECTATION FAILED at kernel/irq/irq_test.c:93
[10:17:45]     Expected desc->depth == 1, but
[10:17:45]         desc->depth == 2 (0x2)

By clearing IRQ_NOREQUEST from the interrupt descriptor, these tests now
pass on ARM and PowerPC.

Fixes: 66067c3c8a ("genirq: Add kunit tests for depth counts")
Signed-off-by: David Gow <davidgow@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Tested-by: Brian Norris <briannorris@chromium.org>
Reviewed-by: Brian Norris <briannorris@chromium.org>
Link: https://lore.kernel.org/all/20250816094528.3560222-2-davidgow@google.com
2025-09-03 17:04:51 +02:00
Wladislav Wiebe 673f1244b3 genirq: Add support for warning on long-running interrupt handlers
Introduce a mechanism to detect and warn about prolonged interrupt handlers.
With a new command-line parameter (irqhandler.duration_warn_us=), users can
configure the duration threshold in microseconds when a warning in such
format should be emitted:

"[CPU14] long duration of IRQ[159:bad_irq_handler [long_irq]], took: 1330 us"

The implementation uses local_clock() to measure the execution duration of the
generic IRQ per-CPU event handler.

Signed-off-by: Wladislav Wiebe <wladislav.wiebe@nokia.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Jiri Slaby <jirislaby@kernel.org>
Link: https://lore.kernel.org/all/20250804093525.851-1-wladislav.wiebe@nokia.com
2025-09-03 16:10:40 +02:00
Thomas Weißschuh 762af5a2aa vdso/vsyscall: Avoid slow division loop in auxiliary clock update
The call to __iter_div_u64_rem() in vdso_time_update_aux() is a wrapper
around subtraction. It cannot be used to divide large numbers, as that
introduces long, computationally expensive delays.  A regular u64 division
is also not possible in the timekeeper update path as it can be too slow.

Instead of splitting the ktime_t offset into into second and subsecond
components during the timekeeper update fast-path, do it together with the
adjustment of tk->offs_aux in the slow-path. Equivalent to the handling of
offs_boot and monotonic_to_boot.

Reuse the storage of monotonic_to_boot for the new field, as it is not used
by auxiliary timekeepers.

Fixes: 380b84e168 ("vdso/vsyscall: Update auxiliary clock data in the datapage")
Reported-by: Miroslav Lichvar <mlichvar@redhat.com>
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250825-vdso-auxclock-division-v1-1-a1d32a16a313@linutronix.de
Closes: https://lore.kernel.org/lkml/aKwsNNWsHJg8IKzj@localhost/
2025-09-03 11:55:11 +02:00
Kan Liang 18dbcbfabf perf: Fix the POLL_HUP delivery breakage
The event_limit can be set by the PERF_EVENT_IOC_REFRESH to limit the
number of events. When the event_limit reaches 0, the POLL_HUP signal
should be sent. But it's missed.

The corresponding counter should be stopped when the event_limit reaches
0. It was implemented in the ARCH-specific code. However, since the
commit 9734e25fbf ("perf: Fix the throttle logic for a group"), all
the ARCH-specific code has been moved to the generic code. The code to
handle the event_limit was lost.

Add the event->pmu->stop(event, 0); back.

Fixes: 9734e25fbf ("perf: Fix the throttle logic for a group")
Closes: https://lore.kernel.org/lkml/aICYAqM5EQUlTqtX@li-2b55cdcc-350b-11b2-a85c-a78bff51fc11.ibm.com/
Reported-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Link: https://lkml.kernel.org/r/20250811182644.1305952-1-kan.liang@linux.intel.com
2025-09-03 10:10:59 +02:00
Aaron Lu 5b726e9bf9 sched/fair: Get rid of throttled_lb_pair()
Now that throttled tasks are dequeued and can not stay on rq's cfs_tasks
list, there is no need to take special care of these throttled tasks
anymore in load balance.

Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Matteo Martelli <matteo.martelli@codethink.co.uk>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20250829081120.806-6-ziqianlu@bytedance.com
2025-09-03 10:03:14 +02:00
Aaron Lu eb962f251f sched/fair: Task based throttle time accounting
With task based throttle model, the previous way to check cfs_rq's
nr_queued to decide if throttled time should be accounted doesn't work
as expected, e.g. when a cfs_rq which has a single task is throttled,
that task could later block in kernel mode instead of being dequeued on
limbo list and accounting this as throttled time is not accurate.

Rework throttle time accounting for a cfs_rq as follows:
- start accounting when the first task gets throttled in its hierarchy;
- stop accounting on unthrottle.

Note that there will be a time gap between when a cfs_rq is throttled
and when a task in its hierarchy is actually throttled. This accounting
mechanism only starts accounting in the latter case.

Suggested-by: Chengming Zhou <chengming.zhou@linux.dev> # accounting mechanism
Co-developed-by: K Prateek Nayak <kprateek.nayak@amd.com> # simplify implementation
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Matteo Martelli <matteo.martelli@codethink.co.uk>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20250829081120.806-5-ziqianlu@bytedance.com
2025-09-03 10:03:14 +02:00
Valentin Schneider e1fad12dcb sched/fair: Switch to task based throttle model
In current throttle model, when a cfs_rq is throttled, its entity will
be dequeued from cpu's rq, making tasks attached to it not able to run,
thus achiveing the throttle target.

This has a drawback though: assume a task is a reader of percpu_rwsem
and is waiting. When it gets woken, it can not run till its task group's
next period comes, which can be a relatively long time. Waiting writer
will have to wait longer due to this and it also makes further reader
build up and eventually trigger task hung.

To improve this situation, change the throttle model to task based, i.e.
when a cfs_rq is throttled, record its throttled status but do not remove
it from cpu's rq. Instead, for tasks that belong to this cfs_rq, when
they get picked, add a task work to them so that when they return
to user, they can be dequeued there. In this way, tasks throttled will
not hold any kernel resources. And on unthrottle, enqueue back those
tasks so they can continue to run.

Throttled cfs_rq's PELT clock is handled differently now: previously the
cfs_rq's PELT clock is stopped once it entered throttled state but since
now tasks(in kernel mode) can continue to run, change the behaviour to
stop PELT clock when the throttled cfs_rq has no tasks left.

Suggested-by: Chengming Zhou <chengming.zhou@linux.dev> # tag on pick
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Chen Yu <yu.c.chen@intel.com>
Tested-by: Matteo Martelli <matteo.martelli@codethink.co.uk>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20250829081120.806-4-ziqianlu@bytedance.com
2025-09-03 10:03:14 +02:00
Valentin Schneider 7fc2d14392 sched/fair: Implement throttle task work and related helpers
Implement throttle_cfs_rq_work() task work which gets executed on task's
ret2user path where the task is dequeued and marked as throttled.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Matteo Martelli <matteo.martelli@codethink.co.uk>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20250829081120.806-3-ziqianlu@bytedance.com
2025-09-03 10:03:13 +02:00
Valentin Schneider 2cd571245b sched/fair: Add related data structure for task based throttle
Add related data structures for this new throttle functionality.

Tesed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Matteo Martelli <matteo.martelli@codethink.co.uk>
Link: https://lore.kernel.org/r/20250829081120.806-2-ziqianlu@bytedance.com
2025-09-03 10:03:13 +02:00
Peter Zijlstra 91c614f09a sched: Move STDL_INIT() functions out-of-line
Since all these functions are address-taken in SDTL_INIT() and called
indirectly, it doesn't really make sense for them to be inline.

Suggested-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2025-09-03 10:03:13 +02:00
Peter Zijlstra 661f951e37 sched/fair: Get rid of sched_domains_curr_level hack for tl->cpumask()
Leon [1] and Vinicius [2] noted a topology_span_sane() warning during
their testing starting from v6.16-rc1. Debug that followed pointed to
the tl->mask() for the NODE domain being incorrectly resolved to that of
the highest NUMA domain.

tl->mask() for NODE is set to the sd_numa_mask() which depends on the
global "sched_domains_curr_level" hack. "sched_domains_curr_level" is
set to the "tl->numa_level" during tl traversal in build_sched_domains()
calling sd_init() but was not reset before topology_span_sane().

Since "tl->numa_level" still reflected the old value from
build_sched_domains(), topology_span_sane() for the NODE domain trips
when the span of the last NUMA domain overlaps.

Instead of replicating the "sched_domains_curr_level" hack, get rid of
it entirely and instead, pass the entire "sched_domain_topology_level"
object to tl->cpumask() function to prevent such mishap in the future.

sd_numa_mask() now directly references "tl->numa_level" instead of
relying on the global "sched_domains_curr_level" hack to index into
sched_domains_numa_masks[].

The original warning was reproducible on the following NUMA topology
reported by Leon:

    $ sudo numactl -H
    available: 5 nodes (0-4)
    node 0 cpus: 0 1
    node 0 size: 2927 MB
    node 0 free: 1603 MB
    node 1 cpus: 2 3
    node 1 size: 3023 MB
    node 1 free: 3008 MB
    node 2 cpus: 4 5
    node 2 size: 3023 MB
    node 2 free: 3007 MB
    node 3 cpus: 6 7
    node 3 size: 3023 MB
    node 3 free: 3002 MB
    node 4 cpus: 8 9
    node 4 size: 3022 MB
    node 4 free: 2718 MB
    node distances:
    node   0   1   2   3   4
      0:  10  39  38  37  36
      1:  39  10  38  37  36
      2:  38  38  10  37  36
      3:  37  37  37  10  36
      4:  36  36  36  36  10

The above topology can be mimicked using the following QEMU cmd that was
used to reproduce the warning and test the fix:

     sudo qemu-system-x86_64 -enable-kvm -cpu host \
     -m 20G -smp cpus=10,sockets=10 -machine q35 \
     -object memory-backend-ram,size=4G,id=m0 \
     -object memory-backend-ram,size=4G,id=m1 \
     -object memory-backend-ram,size=4G,id=m2 \
     -object memory-backend-ram,size=4G,id=m3 \
     -object memory-backend-ram,size=4G,id=m4 \
     -numa node,cpus=0-1,memdev=m0,nodeid=0 \
     -numa node,cpus=2-3,memdev=m1,nodeid=1 \
     -numa node,cpus=4-5,memdev=m2,nodeid=2 \
     -numa node,cpus=6-7,memdev=m3,nodeid=3 \
     -numa node,cpus=8-9,memdev=m4,nodeid=4 \
     -numa dist,src=0,dst=1,val=39 \
     -numa dist,src=0,dst=2,val=38 \
     -numa dist,src=0,dst=3,val=37 \
     -numa dist,src=0,dst=4,val=36 \
     -numa dist,src=1,dst=0,val=39 \
     -numa dist,src=1,dst=2,val=38 \
     -numa dist,src=1,dst=3,val=37 \
     -numa dist,src=1,dst=4,val=36 \
     -numa dist,src=2,dst=0,val=38 \
     -numa dist,src=2,dst=1,val=38 \
     -numa dist,src=2,dst=3,val=37 \
     -numa dist,src=2,dst=4,val=36 \
     -numa dist,src=3,dst=0,val=37 \
     -numa dist,src=3,dst=1,val=37 \
     -numa dist,src=3,dst=2,val=37 \
     -numa dist,src=3,dst=4,val=36 \
     -numa dist,src=4,dst=0,val=36 \
     -numa dist,src=4,dst=1,val=36 \
     -numa dist,src=4,dst=2,val=36 \
     -numa dist,src=4,dst=3,val=36 \
     ...

  [ prateek: Moved common functions to include/linux/sched/topology.h,
    reuse the common bits for s390 and ppc, commit message ]

Closes: https://lore.kernel.org/lkml/20250610110701.GA256154@unreal/ [1]
Fixes: ccf74128d6 ("sched/topology: Assert non-NUMA topology masks don't (partially) overlap") # ce29a7da84, f55dac1daf
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reported-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Tested-by: Valentin Schneider <vschneid@redhat.com> # x86
Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com> # powerpc
Link: https://lore.kernel.org/lkml/a3de98387abad28592e6ab591f3ff6107fe01dc1.1755893468.git.tim.c.chen@linux.intel.com/ [2]
2025-09-03 10:03:12 +02:00
Harshit Agarwal 8fd5485fb4 sched/deadline: Fix race in push_dl_task()
When a CPU chooses to call push_dl_task and picks a task to push to
another CPU's runqueue then it will call find_lock_later_rq method
which would take a double lock on both CPUs' runqueues. If one of the
locks aren't readily available, it may lead to dropping the current
runqueue lock and reacquiring both the locks at once. During this window
it is possible that the task is already migrated and is running on some
other CPU. These cases are already handled. However, if the task is
migrated and has already been executed and another CPU is now trying to
wake it up (ttwu) such that it is queued again on the runqeue
(on_rq is 1) and also if the task was run by the same CPU, then the
current checks will pass even though the task was migrated out and is no
longer in the pushable tasks list.
Please go through the original rt change for more details on the issue.

To fix this, after the lock is obtained inside the find_lock_later_rq,
it ensures that the task is still at the head of pushable tasks list.
Also removed some checks that are no longer needed with the addition of
this new check.
However, the new check of pushable tasks list only applies when
find_lock_later_rq is called by push_dl_task. For the other caller i.e.
dl_task_offline_migration, existing checks are used.

Signed-off-by: Harshit Agarwal <harshit@nutanix.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20250408045021.3283624-1-harshit@nutanix.com
2025-09-03 10:03:12 +02:00
Luo Gengkun 3d62ab32df tracing: Fix tracing_marker may trigger page fault during preempt_disable
Both tracing_mark_write and tracing_mark_raw_write call
__copy_from_user_inatomic during preempt_disable. But in some case,
__copy_from_user_inatomic may trigger page fault, and will call schedule()
subtly. And if a task is migrated to other cpu, the following warning will
be trigger:
        if (RB_WARN_ON(cpu_buffer,
                       !local_read(&cpu_buffer->committing)))

An example can illustrate this issue:

process flow						CPU
---------------------------------------------------------------------

tracing_mark_raw_write():				cpu:0
   ...
   ring_buffer_lock_reserve():				cpu:0
      ...
      cpu = raw_smp_processor_id()			cpu:0
      cpu_buffer = buffer->buffers[cpu]			cpu:0
      ...
   ...
   __copy_from_user_inatomic():				cpu:0
      ...
      # page fault
      do_mem_abort():					cpu:0
         ...
         # Call schedule
         schedule()					cpu:0
	 ...
   # the task schedule to cpu1
   __buffer_unlock_commit():				cpu:1
      ...
      ring_buffer_unlock_commit():			cpu:1
	 ...
	 cpu = raw_smp_processor_id()			cpu:1
	 cpu_buffer = buffer->buffers[cpu]		cpu:1

As shown above, the process will acquire cpuid twice and the return values
are not the same.

To fix this problem using copy_from_user_nofault instead of
__copy_from_user_inatomic, as the former performs 'access_ok' before
copying.

Link: https://lore.kernel.org/20250819105152.2766363-1-luogengkun@huaweicloud.com
Fixes: 656c7f0d2d ("tracing: Replace kmap with copy_from_user() in trace_marker writing")
Signed-off-by: Luo Gengkun <luogengkun@huaweicloud.com>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-09-02 12:02:42 -04:00
Qianfeng Rong 81ac63321e trace: Remove redundant __GFP_NOWARN
Commit 16f5dfbc85 ("gfp: include __GFP_NOWARN in GFP_NOWAIT")
made GFP_NOWAIT implicitly include __GFP_NOWARN.

Therefore, explicit __GFP_NOWARN combined with GFP_NOWAIT
(e.g., `GFP_NOWAIT | __GFP_NOWARN`) is now redundant. Let's clean
up these redundant flags across subsystems.

No functional changes.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250805023630.335719-1-rongqianfeng@vivo.com
Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-09-02 11:48:16 -04:00
Feng Yang e4980fa646 bpf: Replace kvfree with kfree for kzalloc memory
These pointers are allocated by kzalloc. Therefore, replace kvfree() with
kfree() to avoid unnecessary is_vmalloc_addr() check in kvfree(). This is
the remaining unmodified part from [1].

Signed-off-by: Feng Yang <yangfeng@kylinos.cn>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20250811123949.552885-1-rongqianfeng@vivo.com [1]
Link: https://lore.kernel.org/bpf/20250827032812.498216-1-yangfeng59949@163.com
2025-09-02 17:29:52 +02:00
Aleksa Sarai 7df8782012
pidns: move is-ancestor logic to helper
This check will be needed in later patches, and there's no point
open-coding it each time.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Link: https://lore.kernel.org/20250805-procfs-pidns-api-v4-1-705f984940e7@cyphar.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-02 11:37:24 +02:00
Baochen Qiang 7e2368a217 dma-debug: don't enforce dma mapping check on noncoherent allocations
As discussed in [1], there is no need to enforce dma mapping check on
noncoherent allocations, a simple test on the returned CPU address is
good enough.

Add a new pair of debug helpers and use them for noncoherent alloc/free
to fix this issue.

Fixes: efa70f2fdc ("dma-mapping: add a new dma_alloc_pages API")
Link: https://lore.kernel.org/all/ff6c1fe6-820f-4e58-8395-df06aa91706c@oss.qualcomm.com # 1
Signed-off-by: Baochen Qiang <baochen.qiang@oss.qualcomm.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/20250828-dma-debug-fix-noncoherent-dma-check-v1-1-76e9be0dd7fc@oss.qualcomm.com
2025-09-02 10:18:16 +02:00
Simon Schuster edd3cb05c0 copy_process: pass clone_flags as u64 across calltree
With the introduction of clone3 in commit 7f192e3cd3 ("fork: add
clone3") the effective bit width of clone_flags on all architectures was
increased from 32-bit to 64-bit, with a new type of u64 for the flags.
However, for most consumers of clone_flags the interface was not
changed from the previous type of unsigned long.

While this works fine as long as none of the new 64-bit flag bits
(CLONE_CLEAR_SIGHAND and CLONE_INTO_CGROUP) are evaluated, this is still
undesirable in terms of the principle of least surprise.

Thus, this commit fixes all relevant interfaces of callees to
sys_clone3/copy_process (excluding the architecture-specific
copy_thread) to consistently pass clone_flags as u64, so that
no truncation to 32-bit integers occurs on 32-bit architectures.

Signed-off-by: Simon Schuster <schuster.simon@siemens-energy.com>
Link: https://lore.kernel.org/20250901-nios2-implement-clone3-v2-2-53fcf5577d57@siemens-energy.com
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-01 15:31:34 +02:00
Simon Schuster 04ff48239f
copy_sighand: Handle architectures where sizeof(unsigned long) < sizeof(u64)
With the introduction of clone3 in commit 7f192e3cd3 ("fork: add
clone3") the effective bit width of clone_flags on all architectures was
increased from 32-bit to 64-bit. However, the signature of the copy_*
helper functions (e.g., copy_sighand) used by copy_process was not
adapted.

As such, they truncate the flags on any 32-bit architectures that
supports clone3 (arc, arm, csky, m68k, microblaze, mips32, openrisc,
parisc32, powerpc32, riscv32, x86-32 and xtensa).

For copy_sighand with CLONE_CLEAR_SIGHAND being an actual u64
constant, this triggers an observable bug in kernel selftest
clone3_clear_sighand:

        if (clone_flags & CLONE_CLEAR_SIGHAND)

in function copy_sighand within fork.c will always fail given:

        unsigned long /* == uint32_t */ clone_flags
        #define CLONE_CLEAR_SIGHAND 0x100000000ULL

This commit fixes the bug by always passing clone_flags to copy_sighand
via their declared u64 type, invariant of architecture-dependent integer
sizes.

Fixes: b612e5df45 ("clone3: add CLONE_CLEAR_SIGHAND")
Cc: stable@vger.kernel.org # linux-5.5+
Signed-off-by: Simon Schuster <schuster.simon@siemens-energy.com>
Link: https://lore.kernel.org/20250901-nios2-implement-clone3-v2-1-53fcf5577d57@siemens-energy.com
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-01 15:31:33 +02:00
Li Jun 98da8a4aec PM: hibernate: Fix typo in memory bitmaps description comment
Correct 'leave' to 'leaf' in memory bitmaps description comment.

Signed-off-by: Li Jun <lijun01@kylinos.cn>
Link: https://patch.msgid.link/20250819104038.1596952-1-lijun01@kylinos.cn
[ rjw: Subject and changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-09-01 11:55:53 +02:00
Qianfeng Rong 5545d56fd1 PM: hibernate: Use vmalloc_array() and vcalloc() to improve code
Remove array_size() calls and replace vmalloc() and vzalloc() with
vmalloc_array() and vcalloc() respectively to simplify the code in
save_compressed_image() and load_compressed_image().

vmalloc_array() is also optimized better, resulting in less
instructions being used, and vmalloc_array() handling overhead is
lower [1].

Link: https://lore.kernel.org/lkml/abc66ec5-85a4-47e1-9759-2f60ab111971@vivo.com/ [1]
Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com>
Link: https://patch.msgid.link/20250817083636.53872-1-rongqianfeng@vivo.com
[ rjw: Changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-09-01 11:55:53 +02:00
Linus Torvalds fe3ad7a58b - Fix a stall on the CPU offline path due to mis-counting a deadline server
task twice as part of the runqueue's running tasks count
 
 - Fix a realtime tasks starvation case where failure to enqueue a timer whose
   expiration time is already in the past would cause repeated attempts to
   re-enqueue a deadline server task which leads to starving the former,
   realtime one
 
 - Prevent a delayed deadline server task stop from breaking the per-runqueue
   bandwidth tracking
 
 - Have a function checking whether the deadline server task has stopped,
   return the correct value
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmi0GaYACgkQEsHwGGHe
 VUphpRAAuI0Ra8fDwy+VRLoHYVseGHw/SZuffgd6WVu1J5Vag9TSJzl4+mDBsLVU
 42+b5+KSWbnX5zMDKgOhx7MU6AOgR3UCDlQEAMtKYI1CFD7ADe+Jwr45jG32Z50s
 bAl4LHhFeTHJx+jLP5Ez5tTwCTc2/Q7UbhadpGgTLQOhvrFPDwsDjrlMgClXgYU4
 DNEF6s6m9X31UJ/jnZNJQ7VeXa6SdqNo2fBZU+SoY1J8GYzGgcUDlCNLD0SnQwMe
 CgGCnYyzOjl9oNdKV2Z14ruCZwkfhv3hlVt0qHwlRKiP8OOdNKWN0FMIAtvyD0QM
 IQXITVIsc+T/diihZEGNR7wHRd0vhZ/cZPLoPjUQ7mNYB4IuCtWJrbLZf4W17CbG
 0clZ/OxG0EmOTKSuxBOxjg5tUtWI9ZqBHPFvBXFFl+6AhHTb1QK0hriAqbaqe0t6
 rOmohWKqg55yQxuhr0VXUgHy4Oq4u4WBZCF1OH02wtk6w87EHawuWPrULp5jR2iM
 BUXazn8CiTc13IBm+NhO9X45GfH1wIHC0Uhul+gWhylzG6gFRWN0CNixqPjd/7M7
 GS5gpH7xVs6Qe5DmAG9WHIXGLHPhda8OkyvzYK5MMtwJ7zpdPvqH3LCbO/uXspMy
 qYJWG+z3ni09SBX6EnZGLjenzOApsRuYL85NvFK8lOv6LG/SSO0=
 =NXg1
 -----END PGP SIGNATURE-----

Merge tag 'sched_urgent_for_v6.17_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fixes from Borislav Petkov:

 - Fix a stall on the CPU offline path due to mis-counting a deadline
   server task twice as part of the runqueue's running tasks count

 - Fix a realtime tasks starvation case where failure to enqueue a timer
   whose expiration time is already in the past would cause repeated
   attempts to re-enqueue a deadline server task which leads to starving
   the former, realtime one

 - Prevent a delayed deadline server task stop from breaking the
   per-runqueue bandwidth tracking

 - Have a function checking whether the deadline server task has
   stopped, return the correct value

* tag 'sched_urgent_for_v6.17_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/deadline: Don't count nr_running for dl_server proxy tasks
  sched/deadline: Fix RT task potential starvation when expiry time passed
  sched/deadline: Always stop dl-server before changing parameters
  sched/deadline: Fix dl_server_stopped()
2025-08-31 09:13:00 -07:00
Sebastian Andrzej Siewior d9b05321e2 futex: Move futex_hash_free() back to __mmput()
To avoid a memory leak via mm_alloc() + mmdrop() the futex cleanup code
has been moved to __mmdrop(). This resulted in a warnings if the futex
hash table has been allocated via vmalloc() the mmdrop() was invoked
from atomic context.
The free path must stay in __mmput() to ensure it is invoked from
preemptible context.

In order to avoid the memory leak, delay the allocation of
mm_struct::mm->futex_ref to futex_hash_allocate(). This works because
neither the per-CPU counter nor the private hash has been allocated and
therefore
- futex_private_hash() callers (such as exit_pi_state_list()) don't
  acquire reference if there is no private hash yet. There is also no
  reference put.

- Regular callers (futex_hash()) fallback to global hash. No reference
  counting here.

The futex_ref member can be allocated in futex_hash_allocate() before
the private hash itself is allocated. This happens either while the
first thread is created or on request. In both cases the process has
just a single thread so there can be either futex operation in progress
or the request to create a private hash.

Move futex_hash_free() back to __mmput();
Move the allocation of mm_struct::futex_ref to futex_hash_allocate().

  [ bp: Fold a follow-up fix to prevent a use-after-free:
    https://lore.kernel.org/r/20250830213806.sEKuuGSm@linutronix.de ]

Fixes:  e703b7e247 ("futex: Move futex cleanup to __mmdrop()")
Closes: https://lore.kernel.org/all/20250821102721.6deae493@kernel.org/
Reported-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lkml.kernel.org/r/20250822141238.PfnkTjFb@linutronix.de
2025-08-31 11:48:19 +02:00
Casey Schaufler 0ffbc876d0 audit: add record for multiple object contexts
Create a new audit record AUDIT_MAC_OBJ_CONTEXTS.
An example of the MAC_OBJ_CONTEXTS record is:

    type=MAC_OBJ_CONTEXTS
      msg=audit(1601152467.009:1050):
      obj_selinux=unconfined_u:object_r:user_home_t:s0

When an audit event includes a AUDIT_MAC_OBJ_CONTEXTS record
the "obj=" field in other records in the event will be "obj=?".
An AUDIT_MAC_OBJ_CONTEXTS record is supplied when the system has
multiple security modules that may make access decisions based
on an object security context.

Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
[PM: subj tweak, audit example readability indents]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2025-08-30 10:15:30 -04:00
Casey Schaufler eb59d494ee audit: add record for multiple task security contexts
Replace the single skb pointer in an audit_buffer with a list of
skb pointers. Add the audit_stamp information to the audit_buffer as
there's no guarantee that there will be an audit_context containing
the stamp associated with the event. At audit_log_end() time create
auxiliary records as have been added to the list. Functions are
created to manage the skb list in the audit_buffer.

Create a new audit record AUDIT_MAC_TASK_CONTEXTS.
An example of the MAC_TASK_CONTEXTS record is:

    type=MAC_TASK_CONTEXTS
      msg=audit(1600880931.832:113)
      subj_apparmor=unconfined
      subj_smack=_

When an audit event includes a AUDIT_MAC_TASK_CONTEXTS record the
"subj=" field in other records in the event will be "subj=?".
An AUDIT_MAC_TASK_CONTEXTS record is supplied when the system has
multiple security modules that may make access decisions based on a
subject security context.

Refactor audit_log_task_context(), creating a new audit_log_subj_ctx().
This is used in netlabel auditing to provide multiple subject security
contexts as necessary.

Suggested-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
[PM: subj tweak, audit example readability indents]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2025-08-30 10:15:30 -04:00
Casey Schaufler a59076f266 lsm: security_lsmblob_to_secctx module selection
Add a parameter lsmid to security_lsmblob_to_secctx() to identify which
of the security modules that may be active should provide the security
context. If the value of lsmid is LSM_ID_UNDEF the first LSM providing
a hook is used. security_secid_to_secctx() is unchanged, and will
always report the first LSM providing a hook.

Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
[PM: subj tweak]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2025-08-30 10:15:29 -04:00
Casey Schaufler 0a561e3904 audit: create audit_stamp structure
Replace the timestamp and serial number pair used in audit records
with a structure containing the two elements.

Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
[PM: subj tweak]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2025-08-30 10:15:28 -04:00
Jakub Kicinski d23ad54de7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.17-rc4).

No conflicts.

Adjacent changes:

drivers/net/ethernet/intel/idpf/idpf_txrx.c
  02614eee26 ("idpf: do not linearize big TSO packets")
  6c4e684802 ("idpf: remove obsolete stashing code")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-29 11:48:01 -07:00
Linus Torvalds 4d28e28098 dma-mapping fixes for Linux 6.17
- another small fix relevant to arm64 systems with memory encryption
 (Shanker Donthineni)
 - fix relevant to arm32 systems with non-standard CMA configuration
 (Oreoluwa Babatunde)
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSrngzkoBtlA8uaaJ+Jp1EFxbsSRAUCaLAK7gAKCRCJp1EFxbsS
 REDxAQC+5hLiyzc/1rR5EQb1D6Xr1f/0VN3IFz3creHp3juFBAEApi1iFMdmahO7
 0YKG4KkzHpcNkGrxaXKP0VNtQsDLwww=
 =fVrB
 -----END PGP SIGNATURE-----

Merge tag 'dma-mapping-6.17-2025-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux

Pull dma-mapping fixes from Marek Szyprowski:

 - another small fix for arm64 systems with memory encryption (Shanker
   Donthineni)

 - fix for arm32 systems with non-standard CMA configuration (Oreoluwa
   Babatunde)

* tag 'dma-mapping-6.17-2025-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
  dma/pool: Ensure DMA_DIRECT_REMAP allocations are decrypted
  of: reserved_mem: Restructure call site for dma_contiguous_early_fixup()
2025-08-28 16:04:14 -07:00
Nandakumar Edamana 1df7dad4d5 bpf: Improve the general precision of tnum_mul
Drop the value-mask decomposition technique and adopt straightforward
long-multiplication with a twist: when LSB(a) is uncertain, find the
two partial products (for LSB(a) = known 0 and LSB(a) = known 1) and
take a union.

Experiment shows that applying this technique in long multiplication
improves the precision in a significant number of cases (at the cost
of losing precision in a relatively lower number of cases).

Signed-off-by: Nandakumar Edamana <nandakumar@nandakumar.co.in>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Tested-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com>
Reviewed-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20250826034524.2159515-1-nandakumar@nandakumar.co.in
2025-08-27 15:00:26 -07:00
Marie Zhussupova b9a214b5f6 kunit: Pass parameterized test context to generate_params()
To enable more complex parameterized testing scenarios, the
generate_params() function needs additional context beyond just
the previously generated parameter. This patch modifies the
generate_params() function signature to include an extra
`struct kunit *test` argument, giving test users access to the
parameterized test context when generating parameters.

The `struct kunit *test` argument was added as the first parameter
to the function signature as it aligns with the convention of other
KUnit functions that accept `struct kunit *test` first. This also
mirrors the "this" or "self" reference found in object-oriented
programming languages.

This patch also modifies xe_pci_live_device_gen_param() in xe_pci.c
and nthreads_gen_params() in kcsan_test.c to reflect this signature
change.

Link: https://lore.kernel.org/r/20250826091341.1427123-4-davidgow@google.com
Reviewed-by: David Gow <davidgow@google.com>
Reviewed-by: Rae Moar <rmoar@google.com>
Acked-by: Marco Elver <elver@google.com>
Acked-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Marie Zhussupova <marievic@google.com>
[Catch some additional gen_params signatures in drm/xe/tests --David]
Signed-off-by: David Gow <davidgow@google.com>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2025-08-26 23:36:03 -06:00
Yicong Yang 52d15521eb sched/deadline: Don't count nr_running for dl_server proxy tasks
On CPU offline the kernel stalled with below call trace:

  INFO: task kworker/0:1:11 blocked for more than 120 seconds.

cpuhp hold the cpu hotplug lock endless and stalled vmstat_shepherd.
This is because we count nr_running twice on cpuhp enqueuing and failed
the wait condition of cpuhp:

  enqueue_task_fair() // pick cpuhp from idle, rq->nr_running = 0
    dl_server_start()
      [...]
      add_nr_running() // rq->nr_running = 1
    add_nr_running() // rq->nr_running = 2
  [switch to cpuhp, waiting on balance_hotplug_wait()]
  rcuwait_wait_event(rq->nr_running == 1 && ...) // failed, rq->nr_running=2
    schedule() // wait again

It doesn't make sense to count the dl_server towards runnable tasks,
since it runs other tasks.

Fixes: 63ba8422f8 ("sched/deadline: Introduce deadline servers")
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250627035420.37712-1-yangyicong@huawei.com
2025-08-26 10:46:01 +02:00
kuyo chang 421fc59cf5 sched/deadline: Fix RT task potential starvation when expiry time passed
[Symptom]
The fair server mechanism, which is intended to prevent fair starvation
when higher-priority tasks monopolize the CPU.
Specifically, RT tasks on the runqueue may not be scheduled as expected.

[Analysis]
The log "sched: DL replenish lagged too much" triggered.

By memory dump of dl_server:
    curr = 0xFFFFFF80D6A0AC00 (
      dl_server = 0xFFFFFF83CD5B1470(
        dl_runtime = 0x02FAF080,
        dl_deadline = 0x3B9ACA00,
        dl_period = 0x3B9ACA00,
        dl_bw = 0xCCCC,
        dl_density = 0xCCCC,
        runtime = 0x02FAF080,
        deadline = 0x0000082031EB0E80,
        flags = 0x0,
        dl_throttled = 0x0,
        dl_yielded = 0x0,
        dl_non_contending = 0x0,
        dl_overrun = 0x0,
        dl_server = 0x1,
        dl_server_active = 0x1,
        dl_defer = 0x1,
        dl_defer_armed = 0x0,
        dl_defer_running = 0x1,
        dl_timer = (
          node = (
            expires = 0x000008199756E700),
          _softexpires = 0x000008199756E700,
          function = 0xFFFFFFDB9AF44D30 = dl_task_timer,
          base = 0xFFFFFF83CD5A12C0,
          state = 0x0,
          is_rel = 0x0,
          is_soft = 0x0,
    clock_update_flags = 0x4,
    clock = 0x000008204A496900,

 - The timer expiration time (rq->curr->dl_server->dl_timer->expires)
   is already in the past, indicating the timer has expired.
 - The timer state (rq->curr->dl_server->dl_timer->state) is 0.

[Suspected Root Cause]
The relevant code flow in the throttle path of
update_curr_dl_se() as follows:

  dequeue_dl_entity(dl_se, 0);                // the DL entity is dequeued

  if (unlikely(is_dl_boosted(dl_se) || !start_dl_timer(dl_se))) {
      if (dl_server(dl_se))                   // timer registration fails
          enqueue_dl_entity(dl_se, ENQUEUE_REPLENISH);//enqueue immediately
      ...
  }

The failure of `start_dl_timer` is caused by attempting to register a
timer with an expiration time that is already in the past. When this
situation persists, the code repeatedly re-enqueues the DL entity
without properly replenishing or restarting the timer, resulting in RT
task may not be scheduled as expected.

[Proposed Solution]:
Instead of immediately re-enqueuing the DL entity on timer registration
failure, this change ensures the DL entity is properly replenished and
the timer is restarted, preventing RT potential starvation.

Fixes: 63ba8422f8 ("sched/deadline: Introduce deadline servers")
Signed-off-by: kuyo chang <kuyo.chang@mediatek.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Closes: https://lore.kernel.org/CAMuHMdXn4z1pioTtBGMfQM0jsLviqS2jwysaWXpoLxWYoGa82w@mail.gmail.com
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Jiri Slaby <jirislaby@kernel.org>
Tested-by: Diederik de Haas <didi.debian@cknow.org>
Link: https://lkml.kernel.org/r/20250615131129.954975-1-kuyo.chang@mediatek.com
2025-08-26 10:46:01 +02:00
Juri Lelli bb4700adc3 sched/deadline: Always stop dl-server before changing parameters
Commit cccb45d7c4 ("sched/deadline: Less agressive dl_server
handling") reduced dl-server overhead by delaying disabling servers only
after there are no fair task around for a whole period, which means that
deadline entities are not dequeued right away on a server stop event.
However, the delay opens up a window in which a request for changing
server parameters can break per-runqueue running_bw tracking, as
reported by Yuri.

Close the problematic window by unconditionally calling dl_server_stop()
before applying the new parameters (ensuring deadline entities go
through an actual dequeue).

Fixes: cccb45d7c4 ("sched/deadline: Less agressive dl_server handling")
Reported-by: Yuri Andriaccio <yurand2000@gmail.com>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/20250721-upstream-fix-dlserver-lessaggressive-b4-v1-1-4ebc10c87e40@redhat.com
2025-08-26 10:46:00 +02:00
Huacai Chen 4717432dfd sched/deadline: Fix dl_server_stopped()
Commit cccb45d7c4 ("sched/deadline: Less agressive dl_server handling")
introduces dl_server_stopped(). But it is obvious that dl_server_stopped()
should return true if dl_se->dl_server_active is 0.

Fixes: cccb45d7c4 ("sched/deadline: Less agressive dl_server handling")
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250809130419.1980742-1-chenhuacai@loongson.cn
2025-08-26 10:46:00 +02:00
Josh Poimboeuf 16ed389227 perf: Skip user unwind if the task is a kernel thread
If the task is not a user thread, there's no user stack to unwind.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250820180428.930791978@kernel.org
2025-08-26 09:51:13 +02:00
Josh Poimboeuf d77e3319e3 perf: Simplify get_perf_callchain() user logic
Simplify the get_perf_callchain() user logic a bit.  task_pt_regs()
should never be NULL.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20250820180428.760066227@kernel.org
2025-08-26 09:51:13 +02:00
Steven Rostedt 90942f9fac perf: Use current->flags & PF_KTHREAD|PF_USER_WORKER instead of current->mm == NULL
To determine if a task is a kernel thread or not, it is more reliable to
use (current->flags & (PF_KTHREAD|PF_USER_WORKERi)) than to rely on
current->mm being NULL.  That is because some kernel tasks (io_uring
helpers) may have a mm field.

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250820180428.592367294@kernel.org
2025-08-26 09:51:13 +02:00
Josh Poimboeuf 153f9e74de perf: Have get_perf_callchain() return NULL if crosstask and user are set
get_perf_callchain() doesn't support cross-task unwinding for user space
stacks, have it return NULL if both the crosstask and user arguments are
set.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250820180428.426423415@kernel.org
2025-08-26 09:51:12 +02:00
Josh Poimboeuf e649bcda25 perf: Remove get_perf_callchain() init_nr argument
The 'init_nr' argument has double duty: it's used to initialize both the
number of contexts and the number of stack entries.  That's confusing
and the callers always pass zero anyway.  Hard code the zero.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Namhyung Kim <Namhyung@kernel.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/r/20250820180428.259565081@kernel.org
2025-08-26 09:51:12 +02:00
Menglong Dong 8e4f0b1ebc bpf: use rcu_read_lock_dont_migrate() for trampoline.c
Use rcu_read_lock_dont_migrate() and rcu_read_unlock_migrate() in
trampoline.c to obtain better performance when PREEMPT_RCU is not enabled.

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Link: https://lore.kernel.org/r/20250821090609.42508-8-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-08-25 18:52:16 -07:00
Menglong Dong 427a36bb55 bpf: use rcu_read_lock_dont_migrate() for bpf_prog_run_array_cg()
Use rcu_read_lock_dont_migrate() and rcu_read_unlock_migrate() in
bpf_prog_run_array_cg to obtain better performance when PREEMPT_RCU is
not enabled.

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Link: https://lore.kernel.org/r/20250821090609.42508-7-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-08-25 18:52:16 -07:00
Menglong Dong cf4303b70d bpf: use rcu_read_lock_dont_migrate() for bpf_task_storage_free()
Use rcu_read_lock_dont_migrate() and rcu_read_unlock_migrate() in
bpf_task_storage_free to obtain better performance when PREEMPT_RCU is
not enabled.

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Link: https://lore.kernel.org/r/20250821090609.42508-6-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-08-25 18:52:16 -07:00
Menglong Dong 68748f0397 bpf: use rcu_read_lock_dont_migrate() for bpf_iter_run_prog()
Use rcu_read_lock_dont_migrate() and rcu_read_unlock_migrate() in
bpf_iter_run_prog to obtain better performance when PREEMPT_RCU is
not enabled.

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Link: https://lore.kernel.org/r/20250821090609.42508-5-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-08-25 18:52:16 -07:00
Menglong Dong f2fa9b9069 bpf: use rcu_read_lock_dont_migrate() for bpf_inode_storage_free()
Use rcu_read_lock_dont_migrate() and rcu_read_unlock_migrate() in
bpf_inode_storage_free to obtain better performance when PREEMPT_RCU is
not enabled.

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Link: https://lore.kernel.org/r/20250821090609.42508-4-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-08-25 18:52:16 -07:00
Menglong Dong 8c0afc7c9c bpf: use rcu_read_lock_dont_migrate() for bpf_cgrp_storage_free()
Use rcu_read_lock_dont_migrate() and rcu_read_unlock_migrate() in
bpf_cgrp_storage_free to obtain better performance when PREEMPT_RCU is
not enabled.

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Link: https://lore.kernel.org/r/20250821090609.42508-3-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-08-25 18:52:16 -07:00
Chen Ridong 2c98144fc8 cpuset: add helpers for cpus read and cpuset_mutex locks
cpuset: add helpers for cpus_read_lock and cpuset_mutex locks.

Replace repetitive locking patterns with new helpers:
- cpuset_full_lock()
- cpuset_full_unlock()

This makes the code cleaner and ensures consistent lock ordering.

Signed-off-by: Chen Ridong <chenridong@huawei.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-08-25 08:20:22 -10:00
Chen Ridong ada00d5162 cpuset: separate tmpmasks and cpuset allocation logic
The original alloc_cpumasks() served dual purposes: allocating cpumasks
for both temporary masks (tmpmasks) and cpuset structures. This patch:

1. Decouples these allocation paths for better code clarity
2. Introduces dedicated alloc_tmpmasks() and dup_or_alloc_cpuset()
   functions
3. Maintains symmetric pairing:
   - alloc_tmpmasks() ↔ free_tmpmasks()
   - dup_or_alloc_cpuset() ↔ free_cpuset()

Signed-off-by: Chen Ridong <chenridong@huawei.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-08-25 08:19:57 -10:00
Chen Ridong 5806b3d051 cpuset: decouple tmpmasks and cpumasks freeing in cgroup
Currently, free_cpumasks() can free both tmpmasks and cpumasks of a cpuset
(cs). However, these two operations are not logically coupled. To improve
code clarity:
1. Move cpumask freeing to free_cpuset()
2. Rename free_cpumasks() to free_tmpmasks()

This change enforces the single responsibility principle.

Signed-off-by: Chen Ridong <chenridong@huawei.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-08-25 08:19:24 -10:00
Tiffany Yang 8d2a755895 cgroup: Fix 64-bit division in cgroup.stat.local
Fix the following build error for 32-bit systems:
   arm-linux-gnueabi-ld: kernel/cgroup/cgroup.o: in function `cgroup_core_local_stat_show':
>> kernel/cgroup/cgroup.c:3781:(.text+0x28f4): undefined reference to `__aeabi_uldivmod'
   arm-linux-gnueabi-ld: (__aeabi_uldivmod): Unknown destination type (ARM/Thumb) in kernel/cgroup/cgroup.o
>> kernel/cgroup/cgroup.c:3781:(.text+0x28f4): dangerous relocation: unsupported relocation

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202508230604.KyvqOy81-lkp@intel.com/
Signed-off-by: Tiffany Yang <ynaffit@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-08-25 08:16:55 -10:00
Linus Torvalds 69fd6b99b8 - Fix a case where the events throttling logic operates on inactive events
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmiq4usACgkQEsHwGGHe
 VUqytg/9GEznbpmRT2Byto488q6rBaxp+hpNL1xFubuZt9Xv0xbUKLGUInHwTx4I
 wcep6QXsS5k55vz7xV1Fw0gQMAeBlSTiyT3fq9cC97T5z+bT+tBXANEr6GzTXHGW
 2KesAWQo9QVyNs/jCxWPBoWJnzBAG8NTuMNsnXVFnFuK9nFvmYOfVz5fJBIYw0cZ
 JXCtzIJGsXUDSa5wYIfbpORhVyJbwA30OdKc/rXvOa3eYuOazArR/OkH8IYXEPD9
 kJsVAcOMNo5hJsOUrGle0O1AYtw1bqWCHOtdWkcDEL9zrodNXqeEfrJW4wymHo70
 oXPSn74ZXI23ukbpYp1KvE66LvBsOsRS0rLk9avpGuNAKDSa1hNYC2cOF/ABD1Ca
 I8irVQnEslSdkW0SFnJ1cjHNZpQJWXJRlvvTnrE9FxGJWM8MPOunke34Z7O1q6qE
 kxupGb/ApfVlfRM6mAsU4G5Ya4wWed+gbMATPhcvRLqIJjS4E+Xzp/2Xw5Jg4Lln
 fkSOv0a3/OI4rk4xPrrvr85awP7qfxdRIfefuYae2RQ8twzrhlX9MyRTgX92T3Lu
 WgajYbWN7GhABdrfvAUIu9YiXYGKFiox5vT10O8IB7NXH6w1TlGEHdUkC/UUrjwa
 NL+LHl/zVh68JlMHaGfgpQeb4fWUvM7D3+KaLvN6dTm9LWCnKUc=
 =AsFO
 -----END PGP SIGNATURE-----

Merge tag 'perf_urgent_for_v6.17_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf fix from Borislav Petkov:

 - Fix a case where the events throttling logic operates on inactive
   events

* tag 'perf_urgent_for_v6.17_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf: Avoid undefined behavior from stopping/starting inactive events
2025-08-24 10:13:05 -04:00
Linus Torvalds 14f84cd318 Modules fixes for 6.17-rc3
This includes a fix part of the KSPP (Kernel Self Protection Project) to replace
 the deprecated and unsafe strcpy() calls in the kernel parameter string handler
 and sysfs parameters for built-in modules. Single commit, no functional changes.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE73Ua4R8Pc+G5xjxTQJ6jxB8ZUfsFAmiqshwACgkQQJ6jxB8Z
 UfuF8Q//UVMLtvEREUM8m18u1SVDhnO9/+Dpl/bkTof+w03KviIbAN/VXm6qn3C7
 ZYtK5lDWU4twg+bOpC6EC/LdNVNyJx6cCpqkZFri21i9Yhf5ak0Sp833lWTZdqH/
 DDNCCAOuFH1EaihlJwQ/T4ox1CUOBTC49HyCBnVvdWiCCevUaPbxc8Cgsuzp/gsf
 vqXoOJCYKZD2ZdeRKgW7EgETWUljIpvjfXnb3DtMHztj92wzHPaCR50d0iBJbpZi
 3JEmrZ6FQ+sb+Qgp4VrW7ZEIa8UFGusqKVZBzJfZR61OU+iVz97gg2WptczsTZCa
 GoV0sM5MbNwaBtaMEoM40OiQWCAtYyfsIFmOH142Djcmzgs2hGFTGMKgZReDRs8B
 XiPPTq0IW5czYLoNyzJKvtoRX1qBC7wV0rxN9MY8AQieCPmhV3fQsjUgnPKwUOlV
 4U5EvzI2Qy4LL5oEUp3rEcymio/rP1wrd0dFxx/D+bMMj//PRF+9rr51deZ/tqtz
 Y0Q8rI7CYYlhg0I6XH8t2sAe2TypbU8gNGhOi23Z9vBZwtOv1e3fGTQXymopvVT4
 m50c541senApTKFHDbbck2KXwNyasdtIWCQrtChaP99Y0Lk2KRxkgYtVuCJEPNsv
 rFjyK3KnfkhhqNJEl4I8EUJwfQ1z9tUHJyfx1p0q74cShV7fRg8=
 =8/92
 -----END PGP SIGNATURE-----

Merge tag 'modules-6.17-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux

Pull modules fix from Daniel Gomez:
 "This includes a fix part of the KSPP (Kernel Self Protection Project)
  to replace the deprecated and unsafe strcpy() calls in the kernel
  parameter string handler and sysfs parameters for built-in modules.
  Single commit, no functional changes"

* tag 'modules-6.17-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux:
  params: Replace deprecated strcpy() with strscpy() and memcpy()
2025-08-24 09:43:50 -04:00
Pan Chuang 55b48e23f5 genirq/devres: Add error handling in devm_request_*_irq()
devm_request_threaded_irq() and devm_request_any_context_irq() currently
don't print any error message when interrupt registration fails.

This forces each driver to implement redundant error logging - over 2,000
lines of error messages exist across drivers. Additionally, when
upper-layer functions propagate these errors without logging, critical
debugging information is lost.

Add devm_request_result() helper to unify error reporting via dev_err_probe(),

Use it in devm_request_threaded_irq() and devm_request_any_context_irq()
printing device name, IRQ number, handler functions, and error code on failure
automatically.

Co-developed-by: Yangtao Li <frank.li@vivo.com>
Signed-off-by: Yangtao Li <frank.li@vivo.com>
Signed-off-by: Pan Chuang <panchuang@vivo.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250805092922.135500-2-panchuang@vivo.com
2025-08-24 13:00:45 +02:00
Inochi Amaoto 7a721a2fee genirq: Add irq_chip_(startup/shutdown)_parent()
As the MSI controller on SG2044 uses PLIC as the underlying interrupt
controller, it needs to call irq_enable() and irq_disable() to
startup/shutdown interrupts. Otherwise, the MSI interrupt can not be
startup correctly and will not respond any incoming interrupt.

Introduce irq_chip_startup_parent() and irq_chip_shutdown_parent() to allow
the interrupt controller to call the irq_startup()/irq_shutdown() callbacks
of the parent interrupt chip.

In case the irq_startup()/irq_shutdown() callbacks are not implemented for
the parent interrupt chip, this will fallback to irq_chip_enable_parent()
or irq_chip_disable_parent().

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Inochi Amaoto <inochiama@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Chen Wang <unicorn_wang@outlook.com> # Pioneerbox
Reviewed-by: Chen Wang <unicorn_wang@outlook.com>
Link: https://lore.kernel.org/all/20250813232835.43458-2-inochiama@gmail.com
Link: https://lore.kernel.org/lkml/20250722224513.22125-1-inochiama@gmail.com/
2025-08-23 21:20:25 +02:00
Sebastian Andrzej Siewior 3c71648793 genirq: Remove GENERIC_IRQ_LEGACY
IA64 is gone and with it the last GENERIC_IRQ_LEGACY user.

Remove GENERIC_IRQ_LEGACY.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250814165949.hvtP03r4@linutronix.de
2025-08-23 19:46:04 +02:00
Linus Torvalds e1d8f9ccb2 tracing fixes for v6.17-rc2:
- Fix rtla and latency tooling pkg-config errors
 
   If libtraceevent and libtracefs is installed, but their corresponding '.pc'
   files are not installed, it reports that the libraries are missing and
   confuses the developer. Instead, report that the pkg-config files are
   missing and should be installed.
 
 - Fix overflow bug of the parser in trace_get_user()
 
   trace_get_user() uses the parsing functions to parse the user space strings.
   If the parser fails due to incorrect processing, it doesn't terminate the
   buffer with a nul byte. Add a "failed" flag to the parser that gets set when
   parsing fails and is used to know if the buffer is fine to use or not.
 
 - Remove a semicolon that was at an end of a comment line
 
 - Fix register_ftrace_graph() to unregister the pm notifier on error
 
   The register_ftrace_graph() registers a pm notifier but there's an error
   path that can exit the function without unregistering it. Since the function
   returns an error, it will never be unregistered.
 
 - Allocate and copy ftrace hash for reader of ftrace filter files
 
   When the set_ftrace_filter or set_ftrace_notrace files are open for read,
   an iterator is created and sets its hash pointer to the associated hash that
   represents filtering or notrace filtering to it. The issue is that the hash
   it points to can change while the iteration is happening. All the locking
   used to access the tracer's hashes are released which means those hashes can
   change or even be freed. Using the hash pointed to by the iterator can cause
   UAF bugs or similar.
 
   Have the read of these files allocate and copy the corresponding hashes and
   use that as that will keep them the same while the iterator is open. This
   also simplifies the code as opening it for write already does an allocate
   and copy, and now that the read is doing the same, there's no need to check
   which way it was opened on the release of the file, and the iterator hash
   can always be freed.
 
 - Fix function graph to copy args into temp storage
 
   The output of the function graph tracer shows both the entry and the exit of
   a function. When the exit is right after the entry, it combines the two
   events into one with the output of "function();", instead of showing:
 
   function() {
   }
 
   In order to do this, the iterator descriptor that reads the events includes
   storage that saves the entry event while it peaks at the next event in
   the ring buffer. The peek can free the entry event so the iterator must
   store the information to use it after the peek.
 
   With the addition of function graph tracer recording the args, where the
   args are a dynamic array in the entry event, the temp storage does not save
   them. This causes the args to be corrupted or even cause a read of unsafe
   memory.
 
   Add space to save the args in the temp storage of the iterator.
 
 - Fix race between ftrace_dump and reading trace_pipe
 
   ftrace_dump() is used when a crash occurs where the ftrace buffer will be
   printed to the console. But it can also be triggered by sysrq-z. If a
   sysrq-z is triggered while a task is reading trace_pipe it can cause a race
   in the ftrace_dump() where it checks if the buffer has content, then it
   checks if the next event is available, and then prints the output
   (regardless if the next event was available or not). Reading trace_pipe
   at the same time can cause it to not be available, and this triggers a
   WARN_ON in the print. Move the printing into the check if the next event
   exists or not.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaKnAGRQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qotPAQD02idezasiFi0vakLTR+0x/uAI2UOL
 5RLfTwmZW7S1FwEAwOvGpKx3k/kUwDp5EReP34A+1Fqyc5Mvps4UCE1s4gM=
 =ENHu
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.17-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Fix rtla and latency tooling pkg-config errors

   If libtraceevent and libtracefs is installed, but their corresponding
   '.pc' files are not installed, it reports that the libraries are
   missing and confuses the developer. Instead, report that the
   pkg-config files are missing and should be installed.

 - Fix overflow bug of the parser in trace_get_user()

   trace_get_user() uses the parsing functions to parse the user space
   strings. If the parser fails due to incorrect processing, it doesn't
   terminate the buffer with a nul byte. Add a "failed" flag to the
   parser that gets set when parsing fails and is used to know if the
   buffer is fine to use or not.

 - Remove a semicolon that was at an end of a comment line

 - Fix register_ftrace_graph() to unregister the pm notifier on error

   The register_ftrace_graph() registers a pm notifier but there's an
   error path that can exit the function without unregistering it. Since
   the function returns an error, it will never be unregistered.

 - Allocate and copy ftrace hash for reader of ftrace filter files

   When the set_ftrace_filter or set_ftrace_notrace files are open for
   read, an iterator is created and sets its hash pointer to the
   associated hash that represents filtering or notrace filtering to it.
   The issue is that the hash it points to can change while the
   iteration is happening. All the locking used to access the tracer's
   hashes are released which means those hashes can change or even be
   freed. Using the hash pointed to by the iterator can cause UAF bugs
   or similar.

   Have the read of these files allocate and copy the corresponding
   hashes and use that as that will keep them the same while the
   iterator is open. This also simplifies the code as opening it for
   write already does an allocate and copy, and now that the read is
   doing the same, there's no need to check which way it was opened on
   the release of the file, and the iterator hash can always be freed.

 - Fix function graph to copy args into temp storage

   The output of the function graph tracer shows both the entry and the
   exit of a function. When the exit is right after the entry, it
   combines the two events into one with the output of "function();",
   instead of showing:

     function() {
     }

   In order to do this, the iterator descriptor that reads the events
   includes storage that saves the entry event while it peaks at the
   next event in the ring buffer. The peek can free the entry event so
   the iterator must store the information to use it after the peek.

   With the addition of function graph tracer recording the args, where
   the args are a dynamic array in the entry event, the temp storage
   does not save them. This causes the args to be corrupted or even
   cause a read of unsafe memory.

   Add space to save the args in the temp storage of the iterator.

 - Fix race between ftrace_dump and reading trace_pipe

   ftrace_dump() is used when a crash occurs where the ftrace buffer
   will be printed to the console. But it can also be triggered by
   sysrq-z. If a sysrq-z is triggered while a task is reading trace_pipe
   it can cause a race in the ftrace_dump() where it checks if the
   buffer has content, then it checks if the next event is available,
   and then prints the output (regardless if the next event was
   available or not). Reading trace_pipe at the same time can cause it
   to not be available, and this triggers a WARN_ON in the print. Move
   the printing into the check if the next event exists or not

* tag 'trace-v6.17-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  ftrace: Also allocate and copy hash for reading of filter files
  ftrace: Fix potential warning in trace_printk_seq during ftrace_dump
  fgraph: Copy args in intermediate storage with entry
  trace/fgraph: Fix the warning caused by missing unregister notifier
  ring-buffer: Remove redundant semicolons
  tracing: Limit access to parser->buffer when trace_get_user failed
  rtla: Check pkg-config install
  tools/latency-collector: Check pkg-config install
2025-08-23 10:11:34 -04:00
Steven Rostedt bfb336cf97 ftrace: Also allocate and copy hash for reading of filter files
Currently the reader of set_ftrace_filter and set_ftrace_notrace just adds
the pointer to the global tracer hash to its iterator. Unlike the writer
that allocates a copy of the hash, the reader keeps the pointer to the
filter hashes. This is problematic because this pointer is static across
function calls that release the locks that can update the global tracer
hashes. This can cause UAF and similar bugs.

Allocate and copy the hash for reading the filter files like it is done
for the writers. This not only fixes UAF bugs, but also makes the code a
bit simpler as it doesn't have to differentiate when to free the
iterator's hash between writers and readers.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/20250822183606.12962cc3@batman.local.home
Fixes: c20489dad1 ("ftrace: Assign iter->hash to filter or notrace hashes on seq read")
Closes: https://lore.kernel.org/all/20250813023044.2121943-1-wutengda@huaweicloud.com/
Closes: https://lore.kernel.org/all/20250822192437.GA458494@ax162/
Reported-by: Tengda Wu <wutengda@huaweicloud.com>
Tested-by: Tengda Wu <wutengda@huaweicloud.com>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-08-22 19:58:35 -04:00
Tengda Wu 4013aef2ce ftrace: Fix potential warning in trace_printk_seq during ftrace_dump
When calling ftrace_dump_one() concurrently with reading trace_pipe,
a WARN_ON_ONCE() in trace_printk_seq() can be triggered due to a race
condition.

The issue occurs because:

CPU0 (ftrace_dump)                              CPU1 (reader)
echo z > /proc/sysrq-trigger

!trace_empty(&iter)
trace_iterator_reset(&iter) <- len = size = 0
                                                cat /sys/kernel/tracing/trace_pipe
trace_find_next_entry_inc(&iter)
  __find_next_entry
    ring_buffer_empty_cpu <- all empty
  return NULL

trace_printk_seq(&iter.seq)
  WARN_ON_ONCE(s->seq.len >= s->seq.size)

In the context between trace_empty() and trace_find_next_entry_inc()
during ftrace_dump, the ring buffer data was consumed by other readers.
This caused trace_find_next_entry_inc to return NULL, failing to populate
`iter.seq`. At this point, due to the prior trace_iterator_reset, both
`iter.seq.len` and `iter.seq.size` were set to 0. Since they are equal,
the WARN_ON_ONCE condition is triggered.

Move the trace_printk_seq() into the if block that checks to make sure the
return value of trace_find_next_entry_inc() is non-NULL in
ftrace_dump_one(), ensuring the 'iter.seq' is properly populated before
subsequent operations.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Ingo Molnar <mingo@elte.hu>
Link: https://lore.kernel.org/20250822033343.3000289-1-wutengda@huaweicloud.com
Fixes: d769041f86 ("ring_buffer: implement new locking")
Signed-off-by: Tengda Wu <wutengda@huaweicloud.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-08-22 17:32:36 -04:00
Steven Rostedt e3d01979e4 fgraph: Copy args in intermediate storage with entry
The output of the function graph tracer has two ways to display its
entries. One way for leaf functions with no events recorded within them,
and the other is for functions with events recorded inside it. As function
graph has an entry and exit event, to simplify the output of leaf
functions it combines the two, where as non leaf functions are separate:

 2)               |              invoke_rcu_core() {
 2)               |                raise_softirq() {
 2)   0.391 us    |                  __raise_softirq_irqoff();
 2)   1.191 us    |                }
 2)   2.086 us    |              }

The __raise_softirq_irqoff() function above is really two events that were
merged into one. Otherwise it would have looked like:

 2)               |              invoke_rcu_core() {
 2)               |                raise_softirq() {
 2)               |                  __raise_softirq_irqoff() {
 2)   0.391 us    |                  }
 2)   1.191 us    |                }
 2)   2.086 us    |              }

In order to do this merge, the reading of the trace output file needs to
look at the next event before printing. But since the pointer to the event
is on the ring buffer, it needs to save the entry event before it looks at
the next event as the next event goes out of focus as soon as a new event
is read from the ring buffer. After it reads the next event, it will print
the entry event with either the '{' (non leaf) or ';' and timestamps (leaf).

The iterator used to read the trace file has storage for this event. The
problem happens when the function graph tracer has arguments attached to
the entry event as the entry now has a variable length "args" field. This
field only gets set when funcargs option is used. But the args are not
recorded in this temp data and garbage could be printed. The entry field
is copied via:

  data->ent = *curr;

Where "curr" is the entry field. But this method only saves the non
variable length fields from the structure.

Add a helper structure to the iterator data that adds the max args size to
the data storage in the iterator. Then simply copy the entire entry into
this storage (with size protection).

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/20250820195522.51d4a268@gandalf.local.home
Reported-by: Sasha Levin <sashal@kernel.org>
Tested-by: Sasha Levin <sashal@kernel.org>
Closes: https://lore.kernel.org/all/aJaxRVKverIjF4a6@lappy/
Fixes: ff5c9c576e ("ftrace: Add support for function argument to graph tracer")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-08-22 17:32:35 -04:00
Tao Chen 4223bf833c bpf: Remove preempt_disable in bpf_try_get_buffers
Now BPF program will run with migration disabled, so it is safe
to access this_cpu_inc_return(bpf_bprintf_nest_level).

Fixes: d9c9e4db18 ("bpf: Factorize bpf_trace_printk and bpf_seq_printf")
Signed-off-by: Tao Chen <chen.dylane@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250819125638.2544715-1-chen.dylane@linux.dev
2025-08-22 11:44:09 -07:00
Eric Biggers d47cc4dea1 bpf: Use sha1() instead of sha1_transform() in bpf_prog_calc_tag()
Now that there's a proper SHA-1 library API, just use that instead of
the low-level SHA-1 compression function.  This eliminates the need for
bpf_prog_calc_tag() to implement the SHA-1 padding itself.  No
functional change; the computed tags remain the same.

Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20250811201615.564461-1-ebiggers@kernel.org
2025-08-22 11:40:05 -07:00
Tiffany Yang afa3701c0e cgroup: cgroup.stat.local time accounting
There isn't yet a clear way to identify a set of "lost" time that
everyone (or at least a wider group of users) cares about. However,
users can perform some delay accounting by iterating over components of
interest. This patch allows cgroup v2 freezing time to be one of those
components.

Track the cumulative time that each v2 cgroup spends freezing and expose
it to userland via a new local stat file in cgroupfs. Thank you to
Michal, who provided the ASCII art in the updated documentation.

To access this value:
  $ mkdir /sys/fs/cgroup/test
  $ cat /sys/fs/cgroup/test/cgroup.stat.local
  freeze_time_total 0

Ensure consistent freeze time reads with freeze_seq, a per-cgroup
sequence counter. Writes are serialized using the css_set_lock.

Signed-off-by: Tiffany Yang <ynaffit@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-08-22 07:50:43 -10:00
Chen Ridong 94a4acfec1 cgroup/psi: Set of->priv to NULL upon file release
Setting of->priv to NULL when the file is released enables earlier bug
detection. This allows potential bugs to manifest as NULL pointer
dereferences rather than use-after-free errors[1], which are generally more
difficult to diagnose.

[1] https://lore.kernel.org/cgroups/38ef3ff9-b380-44f0-9315-8b3714b0948d@huaweicloud.com/T/#m8a3b3f88f0ff3da5925d342e90043394f8b2091b
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-08-22 07:47:43 -10:00
Chen Ridong 79f919a89c cgroup: split cgroup_destroy_wq into 3 workqueues
A hung task can occur during [1] LTP cgroup testing when repeatedly
mounting/unmounting perf_event and net_prio controllers with
systemd.unified_cgroup_hierarchy=1. The hang manifests in
cgroup_lock_and_drain_offline() during root destruction.

Related case:
cgroup_fj_function_perf_event cgroup_fj_function.sh perf_event
cgroup_fj_function_net_prio cgroup_fj_function.sh net_prio

Call Trace:
	cgroup_lock_and_drain_offline+0x14c/0x1e8
	cgroup_destroy_root+0x3c/0x2c0
	css_free_rwork_fn+0x248/0x338
	process_one_work+0x16c/0x3b8
	worker_thread+0x22c/0x3b0
	kthread+0xec/0x100
	ret_from_fork+0x10/0x20

Root Cause:

CPU0                            CPU1
mount perf_event                umount net_prio
cgroup1_get_tree                cgroup_kill_sb
rebind_subsystems               // root destruction enqueues
				// cgroup_destroy_wq
// kill all perf_event css
                                // one perf_event css A is dying
                                // css A offline enqueues cgroup_destroy_wq
                                // root destruction will be executed first
                                css_free_rwork_fn
                                cgroup_destroy_root
                                cgroup_lock_and_drain_offline
                                // some perf descendants are dying
                                // cgroup_destroy_wq max_active = 1
                                // waiting for css A to die

Problem scenario:
1. CPU0 mounts perf_event (rebind_subsystems)
2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work
3. A dying perf_event CSS gets queued for offline after root destruction
4. Root destruction waits for offline completion, but offline work is
   blocked behind root destruction in cgroup_destroy_wq (max_active=1)

Solution:
Split cgroup_destroy_wq into three dedicated workqueues:
cgroup_offline_wq – Handles CSS offline operations
cgroup_release_wq – Manages resource release
cgroup_free_wq – Performs final memory deallocation

This separation eliminates blocking in the CSS free path while waiting for
offline operations to complete.

[1] https://github.com/linux-test-project/ltp/blob/master/runtest/controllers
Fixes: 334c3679ec ("cgroup: reimplement rebind_subsystems() using cgroup_apply_control() and friends")
Reported-by: Gao Yingjie <gaoyingjie@uniontech.com>
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Suggested-by: Teju Heo <tj@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-08-22 07:44:11 -10:00
Paul Chaignon f41345f47f bpf: Use tnums for JEQ/JNE is_branch_taken logic
In the following toy program (reg states minimized for readability), R0
and R1 always have different values at instruction 6. This is obvious
when reading the program but cannot be guessed from ranges alone as
they overlap (R0 in [0; 0xc0000000], R1 in [1024; 0xc0000400]).

  0: call bpf_get_prandom_u32#7  ; R0_w=scalar()
  1: w0 = w0                     ; R0_w=scalar(var_off=(0x0; 0xffffffff))
  2: r0 >>= 30                   ; R0_w=scalar(var_off=(0x0; 0x3))
  3: r0 <<= 30                   ; R0_w=scalar(var_off=(0x0; 0xc0000000))
  4: r1 = r0                     ; R1_w=scalar(var_off=(0x0; 0xc0000000))
  5: r1 += 1024                  ; R1_w=scalar(var_off=(0x400; 0xc0000000))
  6: if r1 != r0 goto pc+1

Looking at tnums however, we can deduce that R1 is always different from
R0 because their tnums don't agree on known bits. This patch uses this
logic to improve is_scalar_branch_taken in case of BPF_JEQ and BPF_JNE.

This change has a tiny impact on complexity, which was measured with
the Cilium complexity CI test. That test covers 72 programs with
various build and load time configurations for a total of 970 test
cases. For 80% of test cases, the patch has no impact. On the other
test cases, the patch decreases complexity by only 0.08% on average. In
the best case, the verifier needs to walk 3% less instructions and, in
the worst case, 1.5% more. Overall, the patch has a small positive
impact, especially for our largest programs.

Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/be3ee70b6e489c49881cb1646114b1d861b5c334.1755694147.git.paul.chaignon@gmail.com
2025-08-22 18:12:24 +02:00
Linus Torvalds 6eba757ce9 20 hotfixes. 10 are cc:stable and the remainder address post-6.16 issues
or aren't considered necessary for -stable kernels.  17 of these fixes are
 for MM.
 
 As usual, singletons all over the place, apart from a three-patch series
 of KHO followup work from Pasha which is actually also a bunch of
 singletons.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaKfFVwAKCRDdBJ7gKXxA
 jvZGAQCCRTRgwnYsH0op9Rlxs72zokENbErSzXweWLez31pNpAD/S7bVSjjk1mXr
 BQ24ZadKUUomWkghwCusb9VomMeneg0=
 =+uBT
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2025-08-21-18-17' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "20 hotfixes. 10 are cc:stable and the remainder address post-6.16
  issues or aren't considered necessary for -stable kernels. 17 of these
  fixes are for MM.

  As usual, singletons all over the place, apart from a three-patch
  series of KHO followup work from Pasha which is actually also a bunch
  of singletons"

* tag 'mm-hotfixes-stable-2025-08-21-18-17' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  mm/mremap: fix WARN with uffd that has remap events disabled
  mm/damon/sysfs-schemes: put damos dests dir after removing its files
  mm/migrate: fix NULL movable_ops if CONFIG_ZSMALLOC=m
  mm/damon/core: fix damos_commit_filter not changing allow
  mm/memory-failure: fix infinite UCE for VM_PFNMAP pfn
  MAINTAINERS: mark MGLRU as maintained
  mm: rust: add page.rs to MEMORY MANAGEMENT - RUST
  iov_iter: iterate_folioq: fix handling of offset >= folio size
  selftests/damon: fix selftests by installing drgn related script
  .mailmap: add entry for Easwar Hariharan
  selftests/mm: add test for invalid multi VMA operations
  mm/mremap: catch invalid multi VMA moves earlier
  mm/mremap: allow multi-VMA move when filesystem uses thp_get_unmapped_area
  mm/damon/core: fix commit_ops_filters by using correct nth function
  tools/testing: add linux/args.h header and fix radix, VMA tests
  mm/debug_vm_pgtable: clear page table entries at destroy_args()
  squashfs: fix memory leak in squashfs_fill_super
  kho: warn if KHO is disabled due to an error
  kho: mm: don't allow deferred struct page with KHO
  kho: init new_physxa->phys_bits to fix lockdep
2025-08-22 08:54:34 -04:00
Jakub Kicinski 4dba4a936f bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ6NaUOruQGUkvPdG4raS+Z+3y5EwUCaKdsDgAKCRAraS+Z+3y5
 E85CAP0aLdMc7glIhzXITfY8If4HntXCKyDjEABDKlCKMea3MAD/d/lIHXkfL7vE
 ZCZMhFOpDVdQ1ZojpfEs7wLipfbl/Ao=
 =iUY0
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Martin KaFai Lau says:

====================
pull-request: bpf-next 2025-08-21

We've added 9 non-merge commits during the last 3 day(s) which contain
a total of 13 files changed, 1027 insertions(+), 27 deletions(-).

The main changes are:

1) Added bpf dynptr support for accessing the metadata of a skb,
   from Jakub Sitnicki.
   The patches are merged from a stable branch bpf-next/skb-meta-dynptr.
   The same patches have also been merged into bpf-next/master.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
  selftests/bpf: Cover metadata access from a modified skb clone
  selftests/bpf: Cover read/write to skb metadata at an offset
  selftests/bpf: Cover write access to skb metadata via dynptr
  selftests/bpf: Cover read access to skb metadata via dynptr
  selftests/bpf: Parametrize test_xdp_context_tuntap
  selftests/bpf: Pass just bpf_map to xdp_context_test helper
  selftests/bpf: Cover verifier checks for skb_meta dynptr type
  bpf: Enable read/write access to skb metadata through a dynptr
  bpf: Add dynptr type for skb metadata
====================

Link: https://patch.msgid.link/20250821191827.2099022-1-martin.lau@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-21 15:37:16 -07:00
Linus Torvalds 3957a57201 cgroup: Fixes for v6.17-rc2
- Fix NULL de-ref in css_rstat_exit() which could happen after allocation
   failure.
 
 - Fix a cpuset partition handling bug and a couple other misc issues.
 
 - Doc spelling fix.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaKd9WQ4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGd3pAQCkqjlcHyKBOr8AXCcNmisyj0PvSFJwmcCWf3Mu
 7gsJ0wEAjxqs+otIPHzjhQlRBMN1vhwn5/B/xVqKO57pCHtrGQY=
 =zj8n
 -----END PGP SIGNATURE-----

Merge tag 'cgroup-for-6.17-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup fixes from Tejun Heo:

 - Fix NULL de-ref in css_rstat_exit() which could happen after
   allocation failure

 - Fix a cpuset partition handling bug and a couple other misc issues

 - Doc spelling fix

* tag 'cgroup-for-6.17-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  docs: cgroup: fixed spelling mistakes in documentation
  cgroup: avoid null de-ref in css_rstat_exit()
  cgroup/cpuset: Remove the unnecessary css_get/put() in cpuset_partition_write()
  cgroup/cpuset: Fix a partition error with CPU hotplug
  cgroup/cpuset: Use static_branch_enable_cpuslocked() on cpusets_insane_config_key
2025-08-21 16:31:27 -04:00
Linus Torvalds d72052ac09 sched_ext: Fixes for v6.17-rc2
- Fix a subtle bug during SCX enabling where a dead task skips init but
   doesn't skip sched class switch leading to invalid task state transition
   warning.
 
 - Cosmetic fix in selftests.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaKdWkg4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGWI2AP9e+OTPPHa+sHeM7g3ngigF44nyvvRIPIMJHmZO
 7CYT9AD/e+YI+atHzo5iSBcpGwjW8BSLc0ozdrkI0N7XFLXC4go=
 =7Ti1
 -----END PGP SIGNATURE-----

Merge tag 'sched_ext-for-6.17-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext fixes from Tejun Heo:

 - Fix a subtle bug during SCX enabling where a dead task skips init
   but doesn't skip sched class switch leading to invalid task state
   transition warning

 - Cosmetic fix in selftests

* tag 'sched_ext-for-6.17-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  selftests/sched_ext: Remove duplicate sched.h header
  sched/ext: Fix invalid task state transitions on class switch
2025-08-21 16:02:35 -04:00
Qianfeng Rong e173287b5d uprobes: Remove redundant __GFP_NOWARN
Commit 16f5dfbc85 ("gfp: include __GFP_NOWARN in GFP_NOWAIT")
made GFP_NOWAIT implicitly include __GFP_NOWARN.

Therefore, explicit __GFP_NOWARN combined with GFP_NOWAIT
(e.g., `GFP_NOWAIT | __GFP_NOWARN`) is now redundant. Let's clean
up these redundant flags across subsystems.

No functional changes.

Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20250805025000.346647-1-rongqianfeng@vivo.com
2025-08-21 20:09:26 +02:00
Jiri Olsa 89d1d8434d seccomp: passthrough uprobe systemcall without filtering
Adding uprobe as another exception to the seccomp filter alongside
with the uretprobe syscall.

Same as the uretprobe the uprobe syscall is installed by kernel as
replacement for the breakpoint exception and is limited to x86_64
arch and isn't expected to ever be supported in i386.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kees Cook <kees@kernel.org>
Link: https://lore.kernel.org/r/20250720112133.244369-21-jolsa@kernel.org
2025-08-21 20:09:26 +02:00
Jiri Olsa ba2bfc97b4 uprobes/x86: Add support to optimize uprobes
Putting together all the previously added pieces to support optimized
uprobes on top of 5-byte nop instruction.

The current uprobe execution goes through following:

  - installs breakpoint instruction over original instruction
  - exception handler hit and calls related uprobe consumers
  - and either simulates original instruction or does out of line single step
    execution of it
  - returns to user space

The optimized uprobe path does following:

  - checks the original instruction is 5-byte nop (plus other checks)
  - adds (or uses existing) user space trampoline with uprobe syscall
  - overwrites original instruction (5-byte nop) with call to user space
    trampoline
  - the user space trampoline executes uprobe syscall that calls related uprobe
    consumers
  - trampoline returns back to next instruction

This approach won't speed up all uprobes as it's limited to using nop5 as
original instruction, but we plan to use nop5 as USDT probe instruction
(which currently uses single byte nop) and speed up the USDT probes.

The arch_uprobe_optimize triggers the uprobe optimization and is called after
first uprobe hit. I originally had it called on uprobe installation but then
it clashed with elf loader, because the user space trampoline was added in a
place where loader might need to put elf segments, so I decided to do it after
first uprobe hit when loading is done.

The uprobe is un-optimized in arch specific set_orig_insn call.

The instruction overwrite is x86 arch specific and needs to go through 3 updates:
(on top of nop5 instruction)

  - write int3 into 1st byte
  - write last 4 bytes of the call instruction
  - update the call instruction opcode

And cleanup goes though similar reverse stages:

  - overwrite call opcode with breakpoint (int3)
  - write last 4 bytes of the nop5 instruction
  - write the nop5 first instruction byte

We do not unmap and release uprobe trampoline when it's no longer needed,
because there's no easy way to make sure none of the threads is still
inside the trampoline. But we do not waste memory, because there's just
single page for all the uprobe trampoline mappings.

We do waste frame on page mapping for every 4GB by keeping the uprobe
trampoline page mapped, but that seems ok.

We take the benefit from the fact that set_swbp and set_orig_insn are
called under mmap_write_lock(mm), so we can use the current instruction
as the state the uprobe is in - nop5/breakpoint/call trampoline -
and decide the needed action (optimize/un-optimize) based on that.

Attaching the speed up from benchs/run_bench_uprobes.sh script:

current:
        usermode-count :  152.604 ± 0.044M/s
        syscall-count  :   13.359 ± 0.042M/s
-->     uprobe-nop     :    3.229 ± 0.002M/s
        uprobe-push    :    3.086 ± 0.004M/s
        uprobe-ret     :    1.114 ± 0.004M/s
        uprobe-nop5    :    1.121 ± 0.005M/s
        uretprobe-nop  :    2.145 ± 0.002M/s
        uretprobe-push :    2.070 ± 0.001M/s
        uretprobe-ret  :    0.931 ± 0.001M/s
        uretprobe-nop5 :    0.957 ± 0.001M/s

after the change:
        usermode-count :  152.448 ± 0.244M/s
        syscall-count  :   14.321 ± 0.059M/s
        uprobe-nop     :    3.148 ± 0.007M/s
        uprobe-push    :    2.976 ± 0.004M/s
        uprobe-ret     :    1.068 ± 0.003M/s
-->     uprobe-nop5    :    7.038 ± 0.007M/s
        uretprobe-nop  :    2.109 ± 0.004M/s
        uretprobe-push :    2.035 ± 0.001M/s
        uretprobe-ret  :    0.908 ± 0.001M/s
        uretprobe-nop5 :    3.377 ± 0.009M/s

I see bit more speed up on Intel (above) compared to AMD. The big nop5
speed up is partly due to emulating nop5 and partly due to optimization.

The key speed up we do this for is the USDT switch from nop to nop5:
        uprobe-nop     :    3.148 ± 0.007M/s
        uprobe-nop5    :    7.038 ± 0.007M/s

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Link: https://lore.kernel.org/r/20250720112133.244369-11-jolsa@kernel.org
2025-08-21 20:09:21 +02:00
Jiri Olsa 56101b69c9 uprobes/x86: Add uprobe syscall to speed up uprobe
Adding new uprobe syscall that calls uprobe handlers for given
'breakpoint' address.

The idea is that the 'breakpoint' address calls the user space
trampoline which executes the uprobe syscall.

The syscall handler reads the return address of the initial call
to retrieve the original 'breakpoint' address. With this address
we find the related uprobe object and call its consumers.

Adding the arch_uprobe_trampoline_mapping function that provides
uprobe trampoline mapping. This mapping is backed with one global
page initialized at __init time and shared by the all the mapping
instances.

We do not allow to execute uprobe syscall if the caller is not
from uprobe trampoline mapping.

The uprobe syscall ensures the consumer (bpf program) sees registers
values in the state before the trampoline was called.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Link: https://lore.kernel.org/r/20250720112133.244369-10-jolsa@kernel.org
2025-08-21 20:09:20 +02:00
Jiri Olsa 91440ff4ca uprobes/x86: Add mapping for optimized uprobe trampolines
Adding support to add special mapping for user space trampoline with
following functions:

  uprobe_trampoline_get - find or add uprobe_trampoline
  uprobe_trampoline_put - remove or destroy uprobe_trampoline

The user space trampoline is exported as arch specific user space special
mapping through tramp_mapping, which is initialized in following changes
with new uprobe syscall.

The uprobe trampoline needs to be callable/reachable from the probed address,
so while searching for available address we use is_reachable_by_call function
to decide if the uprobe trampoline is callable from the probe address.

All uprobe_trampoline objects are stored in uprobes_state object and are
cleaned up when the process mm_struct goes down. Adding new arch hooks
for that, because this change is x86_64 specific.

Locking is provided by callers in following changes.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Link: https://lore.kernel.org/r/20250720112133.244369-9-jolsa@kernel.org
2025-08-21 20:09:20 +02:00
Jiri Olsa 18a111256a uprobes: Add do_ref_ctr argument to uprobe_write function
Making update_ref_ctr call in uprobe_write conditional based
on do_ref_ctr argument. This way we can use uprobe_write for
instruction update without doing ref_ctr_offset update.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20250720112133.244369-8-jolsa@kernel.org
2025-08-21 20:09:20 +02:00
Jiri Olsa ec46350fe1 uprobes: Add is_register argument to uprobe_write and uprobe_write_opcode
The uprobe_write has special path to restore the original page when we
write original instruction back. This happens when uprobe_write detects
that we want to write anything else but breakpoint instruction.

Moving the detection away and passing it to uprobe_write as argument,
so it's possible to write different instructions (other than just
breakpoint and rest).

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20250720112133.244369-7-jolsa@kernel.org
2025-08-21 20:09:19 +02:00
Jiri Olsa f8b7c528b4 uprobes: Add nbytes argument to uprobe_write
Adding nbytes argument to uprobe_write and related functions as
preparation for writing whole instructions in following changes.

Also renaming opcode arguments to insn, which seems to fit better.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20250720112133.244369-6-jolsa@kernel.org
2025-08-21 20:09:19 +02:00
Jiri Olsa 33d7b2beaf uprobes: Add uprobe_write function
Adding uprobe_write function that does what uprobe_write_opcode did
so far, but allows to pass verify callback function that checks the
memory location before writing the opcode.

It will be used in following changes to implement specific checking
logic for instruction update.

The uprobe_write_opcode now calls uprobe_write with verify_opcode as
the verify callback.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20250720112133.244369-5-jolsa@kernel.org
2025-08-21 20:09:19 +02:00
Jiri Olsa 82afdd05a1 uprobes: Make copy_from_page global
Making copy_from_page global and adding uprobe prefix.
Adding the uprobe prefix to copy_to_page as well for symmetry.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20250720112133.244369-4-jolsa@kernel.org
2025-08-21 20:09:18 +02:00
Jiri Olsa 0f07b7919d uprobes: Rename arch_uretprobe_trampoline function
We are about to add uprobe trampoline, so cleaning up the namespace.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20250720112133.244369-3-jolsa@kernel.org
2025-08-21 20:09:18 +02:00
Jiri Olsa 7769cb177b uprobes: Remove breakpoint in unapply_uprobe under mmap_write_lock
Currently unapply_uprobe takes mmap_read_lock, but it might call
remove_breakpoint which eventually changes user pages.

Current code writes either breakpoint or original instruction, so it can
go away with read lock as explained in here [1]. But with the upcoming
change that writes multiple instructions on the probed address we need
to ensure that any update to mm's pages is exclusive.

[1] https://lore.kernel.org/all/20240710140045.GA1084@redhat.com/

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20250720112133.244369-2-jolsa@kernel.org
2025-08-21 20:09:18 +02:00
Linus Torvalds 068a56e56f Probes fixes for v6.17-rc2:
- tracing: fprobe-event: Sanitize wildcard for fprobe event name
   Fprobe event accepts wildcards for the target functions, but
   unless the user specifies its event name, it makes an event with
   the wildcards. Replace the wildcard '*' with the underscore '_'.
 -----BEGIN PGP SIGNATURE-----
 
 iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmimVxYbHG1hc2FtaS5o
 aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8bjlMIALQlChmfvICQBq7uGUJs
 +ZiZUielrlBbxjRqSCNXibt5tuA3NJr2uuZ6DT5JVF1On/c9onlabiXoFb/SWmRa
 nyWyKsrEgIb3X7QjnpR7MDurxwK98OJMzmFwtFa3gzFD/PUeb9t3qyx+yt/k1CUV
 uqsB00LrbHMYLDHQufR2pWrGooVejznt92gFCPIfEFJnEJ9hiaFfK6nBmzrjMmZS
 A3d70+6r5v76cANMwlYTxB53ewbiOuUvmDT09d0N+zg4y/5BZia8Asnjf3iBjIUB
 V/ePLO598Po6XlIKhjVHD1nmZezrvff+IToIZOfNXerDrzwqrKxXqUdce6VB6KEU
 VGU=
 =i5Hg
 -----END PGP SIGNATURE-----

Merge tag 'probes-fixes-v6.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull probes fix from Masami Hiramatsu:
 "Sanitize wildcard for fprobe event name

  Fprobe event accepts wildcards for the target functions, but unless
  the user specifies its event name, it makes an event with the
  wildcards. Replace the wildcard '*' with the underscore '_'"

* tag 'probes-fixes-v6.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: fprobe-event: Sanitize wildcard for fprobe event name
2025-08-20 16:29:30 -07:00
Masami Hiramatsu (Google) ec879e1a0b tracing: fprobe-event: Sanitize wildcard for fprobe event name
Fprobe event accepts wildcards for the target functions, but unless user
specifies its event name, it makes an event with the wildcards.

  /sys/kernel/tracing # echo 'f mutex*' >> dynamic_events
  /sys/kernel/tracing # cat dynamic_events
  f:fprobes/mutex*__entry mutex*
  /sys/kernel/tracing # ls events/fprobes/
  enable         filter         mutex*__entry

To fix this, replace the wildcard ('*') with an underscore.

Link: https://lore.kernel.org/all/175535345114.282990.12294108192847938710.stgit@devnote2/

Fixes: 334e5519c3 ("tracing/probes: Add fprobe events for tracing function entry and exit.")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: stable@vger.kernel.org
2025-08-20 23:41:58 +09:00
Ye Weihua edede7a6dc trace/fgraph: Fix the warning caused by missing unregister notifier
This warning was triggered during testing on v6.16:

notifier callback ftrace_suspend_notifier_call already registered
WARNING: CPU: 2 PID: 86 at kernel/notifier.c:23 notifier_chain_register+0x44/0xb0
...
Call Trace:
 <TASK>
 blocking_notifier_chain_register+0x34/0x60
 register_ftrace_graph+0x330/0x410
 ftrace_profile_write+0x1e9/0x340
 vfs_write+0xf8/0x420
 ? filp_flush+0x8a/0xa0
 ? filp_close+0x1f/0x30
 ? do_dup2+0xaf/0x160
 ksys_write+0x65/0xe0
 do_syscall_64+0xa4/0x260
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

When writing to the function_profile_enabled interface, the notifier was
not unregistered after start_graph_tracing failed, causing a warning the
next time function_profile_enabled was written.

Fixed by adding unregister_pm_notifier in the exception path.

Link: https://lore.kernel.org/20250818073332.3890629-1-yeweihua4@huawei.com
Fixes: 4a2b8dda3f ("tracing/function-graph-tracer: fix a regression while suspend to disk")
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Ye Weihua <yeweihua4@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-08-20 09:21:03 -04:00
Liao Yuanhong cd6e4faba9 ring-buffer: Remove redundant semicolons
Remove unnecessary semicolons.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250813095114.559530-1-liaoyuanhong@vivo.com
Signed-off-by: Liao Yuanhong <liaoyuanhong@vivo.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-08-20 09:20:30 -04:00
Pu Lehui 6a909ea83f tracing: Limit access to parser->buffer when trace_get_user failed
When the length of the string written to set_ftrace_filter exceeds
FTRACE_BUFF_MAX, the following KASAN alarm will be triggered:

BUG: KASAN: slab-out-of-bounds in strsep+0x18c/0x1b0
Read of size 1 at addr ffff0000d00bd5ba by task ash/165

CPU: 1 UID: 0 PID: 165 Comm: ash Not tainted 6.16.0-g6bcdbd62bd56-dirty
Hardware name: linux,dummy-virt (DT)
Call trace:
 show_stack+0x34/0x50 (C)
 dump_stack_lvl+0xa0/0x158
 print_address_description.constprop.0+0x88/0x398
 print_report+0xb0/0x280
 kasan_report+0xa4/0xf0
 __asan_report_load1_noabort+0x20/0x30
 strsep+0x18c/0x1b0
 ftrace_process_regex.isra.0+0x100/0x2d8
 ftrace_regex_release+0x484/0x618
 __fput+0x364/0xa58
 ____fput+0x28/0x40
 task_work_run+0x154/0x278
 do_notify_resume+0x1f0/0x220
 el0_svc+0xec/0xf0
 el0t_64_sync_handler+0xa0/0xe8
 el0t_64_sync+0x1ac/0x1b0

The reason is that trace_get_user will fail when processing a string
longer than FTRACE_BUFF_MAX, but not set the end of parser->buffer to 0.
Then an OOB access will be triggered in ftrace_regex_release->
ftrace_process_regex->strsep->strpbrk. We can solve this problem by
limiting access to parser->buffer when trace_get_user failed.

Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/20250813040232.1344527-1-pulehui@huaweicloud.com
Fixes: 8c9af478c0 ("ftrace: Handle commands when closing set_ftrace_filter file")
Signed-off-by: Pu Lehui <pulehui@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-08-20 09:20:30 -04:00
Pasha Tatashin 44958f2025 kho: warn if KHO is disabled due to an error
During boot scratch area is allocated based on command line parameters or
auto calculated.  However, scratch area may fail to allocate, and in that
case KHO is disabled.  Currently, no warning is printed that KHO is
disabled, which makes it confusing for the end user to figure out why KHO
is not available.  Add the missing warning message.

Link: https://lkml.kernel.org/r/20250808201804.772010-4-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: Pratyush Yadav <pratyush@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Baoquan He <bhe@redhat.com>
Cc: Changyuan Lyu <changyuanl@google.com>
Cc: Coiby Xu <coxu@redhat.com>
Cc: Dave Vasilevsky <dave@vasilevsky.ca>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Kees Cook <kees@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-19 16:35:53 -07:00
Pasha Tatashin 8b66ed2c3f kho: mm: don't allow deferred struct page with KHO
KHO uses struct pages for the preserved memory early in boot, however,
with deferred struct page initialization, only a small portion of memory
has properly initialized struct pages.

This problem was detected where vmemmap is poisoned, and illegal flag
combinations are detected.

Don't allow them to be enabled together, and later we will have to teach
KHO to work properly with deferred struct page init kernel feature.

Link: https://lkml.kernel.org/r/20250808201804.772010-3-pasha.tatashin@soleen.com
Fixes: 4e1d010e3b ("kexec: add config option for KHO")
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: Pratyush Yadav <pratyush@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Baoquan He <bhe@redhat.com>
Cc: Changyuan Lyu <changyuanl@google.com>
Cc: Coiby Xu <coxu@redhat.com>
Cc: Dave Vasilevsky <dave@vasilevsky.ca>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Kees Cook <kees@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-19 16:35:53 -07:00
Pasha Tatashin 63b17b653d kho: init new_physxa->phys_bits to fix lockdep
Patch series "Several KHO Hotfixes".

Three unrelated fixes for Kexec Handover.


This patch (of 3):

Lockdep shows the following warning:

INFO: trying to register non-static key.  The code is fine but needs
lockdep annotation, or maybe you didn't initialize this object before use?
turning off the locking correctness validator.

[<ffffffff810133a6>] dump_stack_lvl+0x66/0xa0
[<ffffffff8136012c>] assign_lock_key+0x10c/0x120
[<ffffffff81358bb4>] register_lock_class+0xf4/0x2f0
[<ffffffff813597ff>] __lock_acquire+0x7f/0x2c40
[<ffffffff81360cb0>] ? __pfx_hlock_conflict+0x10/0x10
[<ffffffff811707be>] ? native_flush_tlb_global+0x8e/0xa0
[<ffffffff8117096e>] ? __flush_tlb_all+0x4e/0xa0
[<ffffffff81172fc2>] ? __kernel_map_pages+0x112/0x140
[<ffffffff813ec327>] ? xa_load_or_alloc+0x67/0xe0
[<ffffffff81359556>] lock_acquire+0xe6/0x280
[<ffffffff813ec327>] ? xa_load_or_alloc+0x67/0xe0
[<ffffffff8100b9e0>] _raw_spin_lock+0x30/0x40
[<ffffffff813ec327>] ? xa_load_or_alloc+0x67/0xe0
[<ffffffff813ec327>] xa_load_or_alloc+0x67/0xe0
[<ffffffff813eb4c0>] kho_preserve_folio+0x90/0x100
[<ffffffff813ebb7f>] __kho_finalize+0xcf/0x400
[<ffffffff813ebef4>] kho_finalize+0x34/0x70

This is becase xa has its own lock, that is not initialized in
xa_load_or_alloc.

Modifiy __kho_preserve_order(), to properly call
xa_init(&new_physxa->phys_bits);

Link: https://lkml.kernel.org/r/20250808201804.772010-2-pasha.tatashin@soleen.com
Fixes: fc33e4b44b ("kexec: enable KHO support for memory preservation")
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Baoquan He <bhe@redhat.com>
Cc: Changyuan Lyu <changyuanl@google.com>
Cc: Coiby Xu <coxu@redhat.com>
Cc: Dave Vasilevsky <dave@vasilevsky.ca>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Pratyush Yadav <pratyush@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-19 16:35:53 -07:00
Linus Torvalds 055f213075 vfs-6.17-rc3.fixes
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaKRttQAKCRCRxhvAZXjc
 onwmAP98oaMku7CttHEVwJj8KD7luXvZWbvB23TPGmF6BNWg9wEAraks5EzZZJy3
 +4xWn10b6R+gXUqvwqr+bf0ufk3c+gc=
 =Nbg0
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.17-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs fixes from Christian Brauner:

 - Fix two memory leaks in pidfs

 - Prevent changing the idmapping of an already idmapped mount without
   OPEN_TREE_CLONE through open_tree_attr()

 - Don't fail listing extended attributes in kernfs when no extended
   attributes are set

 - Fix the return value in coredump_parse()

 - Fix the error handling for unbuffered writes in netfs

 - Fix broken data integrity guarantees for O_SYNC writes via iomap

 - Fix UAF in __mark_inode_dirty()

 - Keep inode->i_blkbits constant in fuse

 - Fix coredump selftests

 - Fix get_unused_fd_flags() usage in do_handle_open()

 - Rename EXPORT_SYMBOL_GPL_FOR_MODULES to EXPORT_SYMBOL_FOR_MODULES

 - Fix use-after-free in bh_read()

 - Fix incorrect lflags value in the move_mount() syscall

* tag 'vfs-6.17-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  signal: Fix memory leak for PIDFD_SELF* sentinels
  kernfs: don't fail listing extended attributes
  coredump: Fix return value in coredump_parse()
  fs/buffer: fix use-after-free when call bh_read() helper
  pidfs: Fix memory leak in pidfd_info()
  netfs: Fix unbuffered write error handling
  fhandle: do_handle_open() should get FD with user flags
  module: Rename EXPORT_SYMBOL_GPL_FOR_MODULES to EXPORT_SYMBOL_FOR_MODULES
  fs: fix incorrect lflags value in the move_mount syscall
  selftests/coredump: Remove the read() that fails the test
  fuse: keep inode->i_blkbits constant
  iomap: Fix broken data integrity guarantees for O_SYNC writes
  selftests/mount_setattr: add smoke tests for open_tree_attr(2) bug
  open_tree_attr: do not allow id-mapping changes without OPEN_TREE_CLONE
  fs: writeback: fix use-after-free in __mark_inode_dirty()
2025-08-19 09:54:47 -07:00
Adrian Huang (Lenovo) a2c1f82618
signal: Fix memory leak for PIDFD_SELF* sentinels
Commit f08d0c3a71 ("pidfd: add PIDFD_SELF* sentinels to refer to own
thread/process") introduced a leak by acquiring a pid reference through
get_task_pid(), which increments pid->count but never drops it with
put_pid().

As a result, kmemleak reports unreferenced pid objects after running
tools/testing/selftests/pidfd/pidfd_test, for example:

  unreferenced object 0xff1100206757a940 (size 160):
    comm "pidfd_test", pid 16965, jiffies 4294853028
    hex dump (first 32 bytes):
      01 00 00 00 00 00 00 00 00 00 00 00 fd 57 50 04  .............WP.
      5e 44 00 00 00 00 00 00 18 de 34 17 01 00 11 ff  ^D........4.....
    backtrace (crc cd8844d4):
      kmem_cache_alloc_noprof+0x2f4/0x3f0
      alloc_pid+0x54/0x3d0
      copy_process+0xd58/0x1740
      kernel_clone+0x99/0x3b0
      __do_sys_clone3+0xbe/0x100
      do_syscall_64+0x7b/0x2c0
      entry_SYSCALL_64_after_hwframe+0x76/0x7e

Fix this by calling put_pid() after do_pidfd_send_signal() returns.

Fixes: f08d0c3a71 ("pidfd: add PIDFD_SELF* sentinels to refer to own thread/process")
Signed-off-by: Adrian Huang (Lenovo) <adrianhuang0701@gmail.com>
Link: https://lore.kernel.org/20250818134310.12273-1-adrianhuang0701@gmail.com
Tested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-08-19 13:51:28 +02:00
Oleg Nesterov b1afcaddd6
pid: change bacct_add_tsk() to use task_ppid_nr_ns()
to simplify the code.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/20250810173615.GA20000@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-08-19 13:38:20 +02:00
Oleg Nesterov abdfd4948e
pid: make __task_pid_nr_ns(ns => NULL) safe for zombie callers
task_pid_vnr(another_task) will crash if the caller was already reaped.
The pid_alive(current) check can't really help, the parent/debugger can
call release_task() right after this check.

This also means that even task_ppid_nr_ns(current, NULL) is not safe,
pid_alive() only ensures that it is safe to dereference ->real_parent.

Change __task_pid_nr_ns() to ensure ns != NULL.

Originally-by: 高翔 <gaoxiang17@xiaomi.com>
Link: https://lore.kernel.org/all/20250802022123.3536934-1-gxxa03070307@gmail.com/
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/20250810173604.GA19991@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-08-19 13:38:20 +02:00
gaoxiang17 006568ab4c
pid: Add a judgment for ns null in pid_nr_ns
__task_pid_nr_ns
        ns = task_active_pid_ns(current);
        pid_nr_ns(rcu_dereference(*task_pid_ptr(task, type)), ns);
                if (pid && ns->level <= pid->level) {

Sometimes null is returned for task_active_pid_ns. Then it will trigger kernel panic in pid_nr_ns.

For example:
	Unable to handle kernel NULL pointer dereference at virtual address 0000000000000058
	Mem abort info:
	ESR = 0x0000000096000007
	EC = 0x25: DABT (current EL), IL = 32 bits
	SET = 0, FnV = 0
	EA = 0, S1PTW = 0
	FSC = 0x07: level 3 translation fault
	Data abort info:
	ISV = 0, ISS = 0x00000007, ISS2 = 0x00000000
	CM = 0, WnR = 0, TnD = 0, TagAccess = 0
	GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
	user pgtable: 4k pages, 39-bit VAs, pgdp=00000002175aa000
	[0000000000000058] pgd=08000002175ab003, p4d=08000002175ab003, pud=08000002175ab003, pmd=08000002175be003, pte=0000000000000000
	pstate: 834000c5 (Nzcv daIF +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
	pc : __task_pid_nr_ns+0x74/0xd0
	lr : __task_pid_nr_ns+0x24/0xd0
	sp : ffffffc08001bd10
	x29: ffffffc08001bd10 x28: ffffffd4422b2000 x27: 0000000000000001
	x26: ffffffd442821168 x25: ffffffd442821000 x24: 00000f89492eab31
	x23: 00000000000000c0 x22: ffffff806f5693c0 x21: ffffff806f5693c0
	x20: 0000000000000001 x19: 0000000000000000 x18: 0000000000000000
	x17: 00000000529c6ef0 x16: 00000000529c6ef0 x15: 00000000023a1adc
	x14: 0000000000000003 x13: 00000000007ef6d8 x12: 001167c391c78800
	x11: 00ffffffffffffff x10: 0000000000000000 x9 : 0000000000000001
	x8 : ffffff80816fa3c0 x7 : 0000000000000000 x6 : 49534d702d535449
	x5 : ffffffc080c4c2c0 x4 : ffffffd43ee128c8 x3 : ffffffd43ee124dc
	x2 : 0000000000000000 x1 : 0000000000000001 x0 : ffffff806f5693c0
	Call trace:
	__task_pid_nr_ns+0x74/0xd0
	...
	__handle_irq_event_percpu+0xd4/0x284
	handle_irq_event+0x48/0xb0
	handle_fasteoi_irq+0x160/0x2d8
	generic_handle_domain_irq+0x44/0x60
	gic_handle_irq+0x4c/0x114
	call_on_irq_stack+0x3c/0x74
	do_interrupt_handler+0x4c/0x84
	el1_interrupt+0x34/0x58
	el1h_64_irq_handler+0x18/0x24
	el1h_64_irq+0x68/0x6c
	account_kernel_stack+0x60/0x144
	exit_task_stack_account+0x1c/0x80
	do_exit+0x7e4/0xaf8
	...
	get_signal+0x7bc/0x8d8
	do_notify_resume+0x128/0x828
	el0_svc+0x6c/0x70
	el0t_64_sync_handler+0x68/0xbc
	el0t_64_sync+0x1a8/0x1ac
	Code: 35fffe54 911a02a8 f9400108 b4000128 (b9405a69)
	---[ end trace 0000000000000000 ]---
	Kernel panic - not syncing: Oops: Fatal exception in interrupt

Signed-off-by: gaoxiang17 <gaoxiang17@xiaomi.com>
Link: https://lore.kernel.org/20250802022123.3536934-1-gxxa03070307@gmail.com
Reviewed-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-08-19 13:38:20 +02:00
Thorsten Blum 800348aa34 kcsan: test: Replace deprecated strcpy() with strscpy()
strcpy() is deprecated; use strscpy() instead.

Link: https://github.com/KSPP/linux/issues/88
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Reviewed-by: Justin Stitt <justinstitt@google.com>
Signed-off-by: Marco Elver <elver@google.com>
2025-08-19 12:52:12 +02:00
Martin KaFai Lau 5c42715e63 Merge branch 'bpf-next/skb-meta-dynptr' into 'bpf-next/master'
Merge 'skb-meta-dynptr' branch into 'master' branch. No conflict.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2025-08-18 17:59:26 -07:00
Martin KaFai Lau 7e1371023a Merge branch 'bpf-next/skb-meta-dynptr' into 'bpf-next/net'
Merge 'skb-meta-dynptr' branch into 'net' branch. No conflict.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2025-08-18 17:58:21 -07:00
Jakub Sitnicki 6877cd392b bpf: Enable read/write access to skb metadata through a dynptr
Now that we can create a dynptr to skb metadata, make reads to the metadata
area possible with bpf_dynptr_read() or through a bpf_dynptr_slice(), and
make writes to the metadata area possible with bpf_dynptr_write() or
through a bpf_dynptr_slice_rdwr().

Note that for cloned skbs which share data with the original, we limit the
skb metadata dynptr to be read-only since we don't unclone on a
bpf_dynptr_write to metadata.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20250814-skb-metadata-thru-dynptr-v7-2-8a39e636e0fb@cloudflare.com
2025-08-18 10:29:42 -07:00
Jakub Sitnicki 89d912e494 bpf: Add dynptr type for skb metadata
Add a dynptr type, similar to skb dynptr, but for the skb metadata access.

The dynptr provides an alternative to __sk_buff->data_meta for accessing
the custom metadata area allocated using the bpf_xdp_adjust_meta() helper.

More importantly, it abstracts away the fact where the storage for the
custom metadata lives, which opens up the way to persist the metadata by
relocating it as the skb travels through the network stack layers.

Writes to skb metadata invalidate any existing skb payload and metadata
slices. While this is more restrictive that needed at the moment, it leaves
the door open to reallocating the metadata on writes, and should be only a
minor inconvenience to the users.

Only the program types which can access __sk_buff->data_meta today are
allowed to create a dynptr for skb metadata at the moment. We need to
modify the network stack to persist the metadata across layers before
opening up access to other BPF hooks.

Once more BPF hooks gain access to skb_meta dynptr, we will also need to
add a read-only variant of the helper similar to
bpf_dynptr_from_skb_rdonly.

skb_meta dynptr ops are stubbed out and implemented by subsequent changes.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Jesse Brandeburg <jbrandeburg@cloudflare.com>
Link: https://patch.msgid.link/20250814-skb-metadata-thru-dynptr-v7-1-8a39e636e0fb@cloudflare.com
2025-08-18 10:29:42 -07:00
Anton Protopopov dbe99ea541 bpf: Add a verbose message when the BTF limit is reached
When a BPF program which is being loaded reaches the map limit
(MAX_USED_MAPS) or the BTF limit (MAX_USED_BTFS) the -E2BIG is
returned. However, in the former case there is an accompanying
verifier verbose message, and in the latter case there is not.
Add a verbose message to make the behaviour symmetrical.

Reported-by: Kevin Sheldrake <kevin.sheldrake@isovalent.com>
Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20250816151554.902995-1-a.s.protopopov@gmail.com
2025-08-18 17:27:01 +02:00
Fushuai Wang d87fdb1f27 bpf: Replace get_next_cpu() with cpumask_next_wrap()
The get_next_cpu() function was only used in one place to find
the next possible CPU, which can be replaced by cpumask_next_wrap().

Signed-off-by: Fushuai Wang <wangfushuai@baidu.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20250818032344.23229-1-wangfushuai@baidu.com
2025-08-18 15:11:02 +02:00
Linus Torvalds 0a9ee9ce49 - Make sure sanity checks down in the mutex lock path happen on the correct
type of task so that they don't trigger falsely
 
 - Use the write unsafe user access pairs when writing a futex value to prevent
   an error on PowerPC which does user read and write accesses differently
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmihlaAACgkQEsHwGGHe
 VUrV7g/+Kx4n1TQuGnk4kd5h5q0uls8mgeFddYjv6BgVxcaWq7Tzv7XMJ5hvWEqp
 P/+Zt43Sv9sd7i+PhoFD2Lr+EDYx8c0Lp08/LH0zsgKIA2Ai8ntJHJcb3se3Kxr5
 yV23d0tkrCijcB58OL1xncm96Lp3XoXyTz8b0ahKDNG9mS8F/9XK9GgmG9OqKXDg
 7T8Vx5NKt0YrvAWwvsQQlTUTcQ4a4O/UMwJmgEbvqHn0WwQISRxx/TE6wYuIwWAj
 pbrN5kzDsZ6tA07h48NWnkFOEeqsQgbbKDkWvYYRYBrVzEATQBpBWfSQ0HsaqPmc
 1Mk5Zs+J5UFhHx7Yw348JVqw5Fl4VDT4Oi4AoIzjBym3c73nrNfZzESRsf4dES5Q
 DBsgTb0tjEZcR7MrWWErYu1LXw1qP5Ib39qLDVIvQQ4HomctSUuXVTIRL9qJvaCK
 aCPt2Ivkhj3wItZSeTfzLTXbWE9lP4AhuBpJ4ALHRbOaCRLNfK9ZzLfOUyKePUvx
 s3j7iPubfS5/lw192z5weLzEE4e8E7wIxSkIQNKLQFI/kr5YwKfwEO5Zm+UMfH5j
 m+Hl7YKS0nT2IlFbRel2cSkw4MDaEJjgahMzbp+D0p+xV2H4KjY4nLoarwsuoP8D
 GxLAOmRW1nzqj3QHIWsBF9iBxkO89lshWOgGxhUbhtywNqxSV6M=
 =q1tX
 -----END PGP SIGNATURE-----

Merge tag 'locking_urgent_for_v6.17_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking fixes from Borislav Petkov:

 - Make sure sanity checks down in the mutex lock path happen on the
   correct type of task so that they don't trigger falsely

 - Use the write unsafe user access pairs when writing a futex value to
   prevent an error on PowerPC which does user read and write accesses
   differently

* tag 'locking_urgent_for_v6.17_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  locking: Fix __clear_task_blocked_on() warning from __ww_mutex_wound() path
  futex: Use user_write_access_begin/_end() in futex_put_value()
2025-08-17 05:57:47 -07:00
Thorsten Blum 5eb4b9a4cd
params: Replace deprecated strcpy() with strscpy() and memcpy()
strcpy() is deprecated; use strscpy() and memcpy() instead.

In param_set_copystring(), we can safely use memcpy() because we already
know the length of the source string 'val' and that it is guaranteed to
be NUL-terminated within the first 'kps->maxlen' bytes.

Link: https://github.com/KSPP/linux/issues/88
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Reviewed-by: Daniel Gomez <da.gomez@samsung.com>
Reviewed-by: Petr Pavlu <petr.pavlu@suse.com>
Link: https://lore.kernel.org/r/20250813132200.184064-2-thorsten.blum@linux.dev
Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
2025-08-16 21:47:25 +02:00
Tao Chen abdaf49be5 bpf: Remove migrate_disable in kprobe_multi_link_prog_run
Graph tracer framework ensures we won't migrate, kprobe_multi_link_prog_run
called all the way from graph tracer, which disables preemption in
function_graph_enter_regs, as Jiri and Yonghong suggested, there is no
need to use migrate_disable. As a result, some overhead may will be reduced.
And add cant_sleep check for __this_cpu_inc_return.

Fixes: 0dcac27254 ("bpf: Add multi kprobe link")
Signed-off-by: Tao Chen <chen.dylane@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250814121430.2347454-1-chen.dylane@linux.dev
2025-08-15 16:49:31 -07:00
Thomas Gleixner 448f97fba9 perf: Convert mmap() refcounts to refcount_t
The recently fixed reference count leaks could have been detected by using
refcount_t and refcount_t would have mitigated the potential overflow at
least.

Now that the code is properly structured, convert the mmap() related
mmap_count variants over to refcount_t.

No functional change intended.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Link: https://lore.kernel.org/r/20250812104020.071507932@infradead.org
2025-08-15 13:13:02 +02:00
Peter Zijlstra 59741451b4 perf: Identify the 0->1 transition for event::mmap_count
Needed because refcount_inc() doesn't allow the 0->1 transition.

Specifically, this is the case where we've created the RB, this means
there was no RB, and as such there could not have been an mmap.
Additionally we hold mmap_mutex to serialize everything.

This must be the first.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250812104019.956479989@infradead.org
2025-08-15 13:13:02 +02:00
Peter Zijlstra d23a6dbc0a perf: Use scoped_guard() for mmap_mutex in perf_mmap()
Mostly just re-indent noise.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Link: https://lore.kernel.org/r/20250812104019.838047976@infradead.org
2025-08-15 13:13:01 +02:00
Peter Zijlstra 5d299897f1 perf: Split out the RB allocation
Move the RB buffer allocation branch into its own function.

Originally-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Link: https://lore.kernel.org/r/20250812104019.722214699@infradead.org
2025-08-15 13:13:01 +02:00
Peter Zijlstra 191759e5ea perf: Make RB allocation branch self sufficient
Ensure @rb usage doesn't extend out of the branch block.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Link: https://lore.kernel.org/r/20250812104019.605285302@infradead.org
2025-08-15 13:13:01 +02:00
Peter Zijlstra 2aee376823 perf: Split out the AUX buffer allocation
Move the AUX buffer allocation branch into its own function.

Originally-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Link: https://lore.kernel.org/r/20250812104019.494205648@infradead.org
2025-08-15 13:13:00 +02:00
Peter Zijlstra 8558dca9fb perf: Reflow to get rid of aux_success label
Mostly re-indent noise needed to get rid of that label.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Link: https://lore.kernel.org/r/20250812104019.362581570@infradead.org
2025-08-15 13:13:00 +02:00
Peter Zijlstra b33a51564e perf: Use guard() for aux_mutex in perf_mmap()
After duplicating the common code into the rb/aux branches is it
possible to use a simple guard() for the aux_mutex. Making the aux
branch self-contained.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Link: https://lore.kernel.org/r/20250812104019.246250452@infradead.org
2025-08-15 13:13:00 +02:00
Peter Zijlstra 41b80e1d74 perf: Remove redundant aux_unlock label
unlock and aux_unlock are now identical, remove the aux_unlock one.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Link: https://lore.kernel.org/r/20250812104019.131293512@infradead.org
2025-08-15 13:13:00 +02:00
Peter Zijlstra 4118994b33 perf: Move common code into both rb and aux branches
if (cond) {
    A;
  } else {
    B;
  }
  C;

into

  if (cond) {
    A;
    C;
  } else {
    B;
    C;
  }

Notably C has a success branch and both A and B have two places for
success. For A (rb case), duplicate the success case because later
patches will result in them no longer being identical. For B (aux
case), share using goto (cleaned up later).

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Link: https://lore.kernel.org/r/20250812104019.016252852@infradead.org
2025-08-15 13:12:59 +02:00
Peter Zijlstra 3821f25868 perf: Merge consecutive conditionals in perf_mmap()
if (cond) {
    A;
  } else {
    B;
  }

  if (cond) {
    C;
  } else {
    D;
  }

into:

  if (cond) {
    A;
    C;
  } else {
    B;
    D;
  }

Notably the conditions are not identical in form, but are equivalent.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Link: https://lore.kernel.org/r/20250812104018.900078502@infradead.org
2025-08-15 13:12:59 +02:00
Peter Zijlstra 86a0a7c598 perf: Move perf_mmap_calc_limits() into both rb and aux branches
if (cond) {
    A;
  } else {
    B;
  }
  C;

into

  if (cond) {
    A;
    C;
  } else {
    B;
    C;
  }

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Link: https://lore.kernel.org/r/20250812104018.781244099@infradead.org
2025-08-15 13:12:59 +02:00
Thomas Gleixner 1ea3e3b0da perf: Split out VM accounting
Similarly to the mlock limit calculation the VM accounting is required for
both the ringbuffer and the AUX buffer allocations.

To prepare for splitting them out into separate functions, move the
accounting into a helper function.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Link: https://lore.kernel.org/r/20250812104018.660347811@infradead.org
2025-08-15 13:12:59 +02:00
Thomas Gleixner 81e026ca47 perf: Split out mlock limit handling
To prepare for splitting the buffer allocation out into separate functions
for the ring buffer and the AUX buffer, split out mlock limit handling into
a helper function, which can be called from both.

No functional change intended.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Link: https://lore.kernel.org/r/20250812104018.541975109@infradead.org
2025-08-15 13:12:58 +02:00
Thomas Gleixner e8c4f6ee8e perf: Remove redundant condition for AUX buffer size
It is already checked whether the VMA size is the same as
nr_pages * PAGE_SIZE, so later checking both:

      aux_size == vma_size && aux_size == nr_pages * PAGE_SIZE

is redundant. Remove the vma_size check as nr_pages is what is actually
used in the allocation function. That prepares for splitting out the buffer
allocation into separate functions, so that only nr_pages needs to be
handed in.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Link: https://lore.kernel.org/r/20250812104018.424519320@infradead.org
2025-08-15 13:12:58 +02:00
Yunseong Kim b64fdd422a perf: Avoid undefined behavior from stopping/starting inactive events
Calling pmu->start()/stop() on perf events in PERF_EVENT_STATE_OFF can
leave event->hw.idx at -1. When PMU drivers later attempt to use this
negative index as a shift exponent in bitwise operations, it leads to UBSAN
shift-out-of-bounds reports.

The issue is a logical flaw in how event groups handle throttling when some
members are intentionally disabled. Based on the analysis and the
reproducer provided by Mark Rutland (this issue on both arm64 and x86-64).

The scenario unfolds as follows:

 1. A group leader event is configured with a very aggressive sampling
    period (e.g., sample_period = 1). This causes frequent interrupts and
    triggers the throttling mechanism.
 2. A child event in the same group is created in a disabled state
    (.disabled = 1). This event remains in PERF_EVENT_STATE_OFF.
    Since it hasn't been scheduled onto the PMU, its event->hw.idx remains
    initialized at -1.
 3. When throttling occurs, perf_event_throttle_group() and later
    perf_event_unthrottle_group() iterate through all siblings, including
    the disabled child event.
 4. perf_event_throttle()/unthrottle() are called on this inactive child
    event, which then call event->pmu->start()/stop().
 5. The PMU driver receives the event with hw.idx == -1 and attempts to
    use it as a shift exponent. e.g., in macros like PMCNTENSET(idx),
    leading to the UBSAN report.

The throttling mechanism attempts to start/stop events that are not
actively scheduled on the hardware.

Move the state check into perf_event_throttle()/perf_event_unthrottle() so
that inactive events are skipped entirely. This ensures only active events
with a valid hw.idx are processed, preventing undefined behavior and
silencing UBSAN warnings. The corrected check ensures true before
proceeding with PMU operations.

The problem can be reproduced with the syzkaller reproducer:

Fixes: 9734e25fbf ("perf: Fix the throttle logic for a group")
Signed-off-by: Yunseong Kim <ysk@kzalloc.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Link: https://lore.kernel.org/r/20250812181046.292382-2-ysk@kzalloc.com
2025-08-15 13:12:56 +02:00
Jiri Olsa e4414b01c1 bpf: Check the helper function is valid in get_helper_proto
kernel test robot reported verifier bug [1] where the helper func
pointer could be NULL due to disabled config option.

As Alexei suggested we could check on that in get_helper_proto
directly. Marking tail_call helper func with BPF_PTR_POISON,
because it is unused by design.

  [1] https://lore.kernel.org/oe-lkp/202507160818.68358831-lkp@intel.com

Reported-by: kernel test robot <oliver.sang@intel.com>
Reported-by: syzbot+a9ed3d9132939852d0df@syzkaller.appspotmail.com
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Paul Chaignon <paul.chaignon@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20250814200655.945632-1-jolsa@kernel.org
Closes: https://lore.kernel.org/oe-lkp/202507160818.68358831-lkp@intel.com
2025-08-15 11:16:56 +02:00
Jesper Dangaard Brouer 2b986b9e91 bpf, cpumap: Disable page_pool direct xdp_return need larger scope
When running an XDP bpf_prog on the remote CPU in cpumap code
then we must disable the direct return optimization that
xdp_return can perform for mem_type page_pool.  This optimization
assumes code is still executing under RX-NAPI of the original
receiving CPU, which isn't true on this remote CPU.

The cpumap code already disabled this via helpers
xdp_set_return_frame_no_direct() and xdp_clear_return_frame_no_direct(),
but the scope didn't include xdp_do_flush().

When doing XDP_REDIRECT towards e.g devmap this causes the
function bq_xmit_all() to run with direct return optimization
enabled. This can lead to hard to find bugs.  The issue
only happens when bq_xmit_all() cannot ndo_xdp_xmit all
frames and them frees them via xdp_return_frame_rx_napi().

Fix by expanding scope to include xdp_do_flush(). This was found
by Dragos Tatulea.

Fixes: 11941f8a85 ("bpf: cpumap: Implement generic cpumap")
Reported-by: Dragos Tatulea <dtatulea@nvidia.com>
Reported-by: Chris Arges <carges@cloudflare.com>
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Chris Arges <carges@cloudflare.com>
Link: https://patch.msgid.link/175519587755.3008742.1088294435150406835.stgit@firesoul
2025-08-15 11:08:08 +02:00
Linus Torvalds 63467137ec Including fixes from Netfilter and IPsec.
Current release - regressions:
 
   - netfilter: nft_set_pipapo:
     - don't return bogus extension pointer
     - fix null deref for empty set
 
 Current release - new code bugs:
 
   - core: prevent deadlocks when enabling NAPIs with mixed kthread config
 
   - eth: netdevsim: Fix wild pointer access in nsim_queue_free().
 
 Previous releases - regressions:
 
   - page_pool: allow enabling recycling late, fix false positive warning
 
   - sched: ets: use old 'nbands' while purging unused classes
 
   - xfrm:
     - restore GSO for SW crypto
     - bring back device check in validate_xmit_xfrm
 
   - tls: handle data disappearing from under the TLS ULP
 
   - ptp: prevent possible ABBA deadlock in ptp_clock_freerun()
 
   - eth: bnxt: fill data page pool with frags if PAGE_SIZE > BNXT_RX_PAGE_SIZE
 
   - eth: hv_netvsc: fix panic during namespace deletion with VF
 
 Previous releases - always broken:
 
   - netfilter: fix refcount leak on table dump
 
   - vsock: do not allow binding to VMADDR_PORT_ANY
 
   - sctp: linearize cloned gso packets in sctp_rcv
 
   - eth: hibmcge: fix the division by zero issue
 
   - eth: microchip: fix KSZ8863 reset problem
 
 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmidxhgSHHBhYmVuaUBy
 ZWRoYXQuY29tAAoJECkkeY3MjxOkPckP/AhJ0kARPgo72OhElbW6KvkHhXVNUTJb
 6m15j9Z8ybQPBupSorxUEBx7pxczv/GBloN1xoJqUY/p7B6kLO2HDOpaReNFjZvF
 J9hPl6ZF6CWzwgfcUfwI3UB1zhALKyHDClclfd8FBoKjAYXCrZXuPv/AV4oqYsA1
 y7g24zpI76Cu+M+Nf5YhrlIVSmQ1/DXX8gifdcHFYnAKmCn7KxNv2lwvm2/yE2lL
 9/Xl/D1cG/BiAaCUUXR4BP8RU5gdW72+lM3qmC/QFO/7jcBaoE89Y9Anona8p0PQ
 dqDerHd0GDUH9QA6bht3asCS+IW+Zo2gf25o53OzlYIMAxDmEZLUBCwetJhvNJBq
 DUQ6agzfNRxsCnlOc4JhMOqNr7rdU7d+9c9KuZWA/m8KRWdlvTYGJd/qzSlTWOhq
 s9228dl+4oTb9Mnq8Bqafi42+TImeOyFRW9ZgF8ptjlF0l/lyv6moIrRCmVXppRZ
 awABNDdG+i004XmAOAeOhjbUT7clLkLr+KEnsfH16qCa2o3dM6rlhvWYp2sucVJf
 SyRvMdz5VqMLgruefpQS/DuK52UklpRawgvgngzU6UDYQUaxQKToeusMjRU7xUnW
 hVI1y7/oNH6+r7Zr/iLTLKRR007B+RVC7VSbeMpxmAW+n6puMb+z7RnrJlnFapGM
 qqqtk2/jItuK
 =Ydk1
 -----END PGP SIGNATURE-----

Merge tag 'net-6.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
 "Including fixes from Netfilter and IPsec.

  Current release - regressions:

   - netfilter: nft_set_pipapo:
      - don't return bogus extension pointer
      - fix null deref for empty set

  Current release - new code bugs:

   - core: prevent deadlocks when enabling NAPIs with mixed kthread
     config

   - eth: netdevsim: Fix wild pointer access in nsim_queue_free().

  Previous releases - regressions:

   - page_pool: allow enabling recycling late, fix false positive
     warning

   - sched: ets: use old 'nbands' while purging unused classes

   - xfrm:
      - restore GSO for SW crypto
      - bring back device check in validate_xmit_xfrm

   - tls: handle data disappearing from under the TLS ULP

   - ptp: prevent possible ABBA deadlock in ptp_clock_freerun()

   - eth:
      - bnxt: fill data page pool with frags if PAGE_SIZE > BNXT_RX_PAGE_SIZE
      - hv_netvsc: fix panic during namespace deletion with VF

  Previous releases - always broken:

   - netfilter: fix refcount leak on table dump

   - vsock: do not allow binding to VMADDR_PORT_ANY

   - sctp: linearize cloned gso packets in sctp_rcv

   - eth:
      - hibmcge: fix the division by zero issue
      - microchip: fix KSZ8863 reset problem"

* tag 'net-6.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (54 commits)
  net: usb: asix_devices: add phy_mask for ax88772 mdio bus
  net: kcm: Fix race condition in kcm_unattach()
  selftests: net/forwarding: test purge of active DWRR classes
  net/sched: ets: use old 'nbands' while purging unused classes
  bnxt: fill data page pool with frags if PAGE_SIZE > BNXT_RX_PAGE_SIZE
  netdevsim: Fix wild pointer access in nsim_queue_free().
  net: mctp: Fix bad kfree_skb in bind lookup test
  netfilter: nf_tables: reject duplicate device on updates
  ipvs: Fix estimator kthreads preferred affinity
  netfilter: nft_set_pipapo: fix null deref for empty set
  selftests: tls: test TCP stealing data from under the TLS socket
  tls: handle data disappearing from under the TLS ULP
  ptp: prevent possible ABBA deadlock in ptp_clock_freerun()
  ixgbe: prevent from unwanted interface name changes
  devlink: let driver opt out of automatic phys_port_name generation
  net: prevent deadlocks when enabling NAPIs with mixed kthread config
  net: update NAPI threaded config even for disabled NAPIs
  selftests: drv-net: don't assume device has only 2 queues
  docs: Fix name for net.ipv4.udp_child_hash_entries
  riscv: dts: thead: Add APB clocks for TH1520 GMACs
  ...
2025-08-14 07:14:30 -07:00
Chen Ridong 4c70fb2624 cpuset: remove redundant CS_ONLINE flag
The CS_ONLINE flag was introduced prior to the CSS_ONLINE flag in the
cpuset subsystem. Currently, the flag setting sequence is as follows:

1. cpuset_css_online() sets CS_ONLINE
2. css->flags gets CSS_ONLINE set
...
3. cgroup->kill_css sets CSS_DYING
4. cpuset_css_offline() clears CS_ONLINE
5. css->flags clears CSS_ONLINE

The is_cpuset_online() check currently occurs between steps 1 and 3.
However, it would be equally safe to perform this check between steps 2
and 3, as CSS_ONLINE provides the same synchronization guarantee as
CS_ONLINE.

Since CS_ONLINE is redundant with CSS_ONLINE and provides no additional
synchronization benefits, we can safely remove it to simplify the code.

Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-08-13 08:14:20 -10:00
Dan Carpenter 70d0085864 audit: add a missing tab
Someone got a bit carried away deleting tabs.  Add it back.

Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2025-08-13 11:51:43 -04:00
Shanker Donthineni 89a2d212bd dma/pool: Ensure DMA_DIRECT_REMAP allocations are decrypted
When CONFIG_DMA_DIRECT_REMAP is enabled, atomic pool pages are
remapped via dma_common_contiguous_remap() using the supplied
pgprot. Currently, the mapping uses
pgprot_dmacoherent(PAGE_KERNEL), which leaves the memory encrypted
on systems with memory encryption enabled (e.g., ARM CCA Realms).

This can cause the DMA layer to fail or crash when accessing the
memory, as the underlying physical pages are not configured as
expected.

Fix this by requesting a decrypted mapping in the vmap() call:
pgprot_decrypted(pgprot_dmacoherent(PAGE_KERNEL))

This ensures that atomic pool memory is consistently mapped
unencrypted.

Cc: stable@vger.kernel.org
Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/20250811181759.998805-1-sdonthineni@nvidia.com
2025-08-13 11:02:10 +02:00
John Stultz 21924af67d locking: Fix __clear_task_blocked_on() warning from __ww_mutex_wound() path
The __clear_task_blocked_on() helper added a number of sanity
checks ensuring we hold the mutex wait lock and that the task
we are clearing blocked_on pointer (if set) matches the mutex.

However, there is an edge case in the _ww_mutex_wound() logic
where we need to clear the blocked_on pointer for the task that
owns the mutex, not the task that is waiting on the mutex.

For this case the sanity checks aren't valid, so handle this
by allowing a NULL lock to skip the additional checks.

K Prateek Nayak and Maarten Lankhorst also pointed out that in
this case where we don't hold the owner's mutex wait_lock, we
need to be a bit more careful using READ_ONCE/WRITE_ONCE in both
the __clear_task_blocked_on() and __set_task_blocked_on()
implementations to avoid accidentally tripping WARN_ONs if two
instances race. So do that here as well.

This issue was easier to miss, I realized, as the test-ww_mutex
driver only exercises the wait-die class of ww_mutexes. I've
sent a patch[1] to address this so the logic will be easier to
test.

[1]: https://lore.kernel.org/lkml/20250801023358.562525-2-jstultz@google.com/

Fixes: a4f0b6fef4 ("locking/mutex: Add p->blocked_on wrappers for correctness checks")
Closes: https://lore.kernel.org/lkml/68894443.a00a0220.26d0e1.0015.GAE@google.com/
Reported-by: syzbot+602c4720aed62576cd79@syzkaller.appspotmail.com
Reported-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Acked-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20250805001026.2247040-1-jstultz@google.com
2025-08-13 10:34:54 +02:00
Frederic Weisbecker c0a23bbc98 ipvs: Fix estimator kthreads preferred affinity
The estimator kthreads' affinity are defined by sysctl overwritten
preferences and applied through a plain call to the scheduler's affinity
API.

However since the introduction of managed kthreads preferred affinity,
such a practice shortcuts the kthreads core code which eventually
overwrites the target to the default unbound affinity.

Fix this with using the appropriate kthread's API.

Fixes: d1a8919758 ("kthread: Default affine kthread to its preferred NUMA node")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Florian Westphal <fw@strlen.de>
2025-08-13 08:34:33 +02:00
Qianfeng Rong bf0c2a84df bpf: Replace kvfree with kfree for kzalloc memory
The 'backedge' pointer is allocated with kzalloc(), which returns
physically contiguous memory. Using kvfree() to deallocate such
memory is functionally safe but semantically incorrect.

Replace kvfree() with kfree() to avoid unnecessary is_vmalloc_addr()
check in kvfree().

Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20250811123949.552885-1-rongqianfeng@vivo.com
2025-08-12 15:55:01 -07:00
Qianfeng Rong 3e2b799008 bpf: Remove redundant __GFP_NOWARN
Commit 16f5dfbc85 ("gfp: include __GFP_NOWARN in GFP_NOWAIT")
made GFP_NOWAIT implicitly include __GFP_NOWARN.

Therefore, explicit __GFP_NOWARN combined with GFP_NOWAIT
(e.g., `GFP_NOWAIT | __GFP_NOWARN`) is now redundant. Let's clean
up these redundant flags across subsystems.

No functional changes.

Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20250804122731.460158-1-rongqianfeng@vivo.com
2025-08-12 14:56:04 -07:00
Martin KaFai Lau 9e293d47bf Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Cross merge bpf/master after 6.17-rc1.

No conflict.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2025-08-12 12:14:02 -07:00
Thorsten Blum 8a013ec9cb cgroup: Replace deprecated strcpy() with strscpy()
strcpy() is deprecated; use strscpy() instead.

Link: https://github.com/KSPP/linux/issues/88
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-08-12 08:57:03 -10:00
Andrea Righi ddf7233fca sched/ext: Fix invalid task state transitions on class switch
When enabling a sched_ext scheduler, we may trigger invalid task state
transitions, resulting in warnings like the following (which can be
easily reproduced by running the hotplug selftest in a loop):

 sched_ext: Invalid task state transition 0 -> 3 for fish[770]
 WARNING: CPU: 18 PID: 787 at kernel/sched/ext.c:3862 scx_set_task_state+0x7c/0xc0
 ...
 RIP: 0010:scx_set_task_state+0x7c/0xc0
 ...
 Call Trace:
  <TASK>
  scx_enable_task+0x11f/0x2e0
  switching_to_scx+0x24/0x110
  scx_enable.isra.0+0xd14/0x13d0
  bpf_struct_ops_link_create+0x136/0x1a0
  __sys_bpf+0x1edd/0x2c30
  __x64_sys_bpf+0x21/0x30
  do_syscall_64+0xbb/0x370
  entry_SYSCALL_64_after_hwframe+0x77/0x7f

This happens because we skip initialization for tasks that are already
dead (with their usage counter set to zero), but we don't exclude them
during the scheduling class transition phase.

Fix this by also skipping dead tasks during class swiching, preventing
invalid task state transitions.

Fixes: a8532fac7b ("sched_ext: TASK_DEAD tasks must be switched into SCX on ops_enable")
Cc: stable@vger.kernel.org # v6.12+
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-08-11 06:56:37 -10:00
Waiman Long dfb36e4a8d futex: Use user_write_access_begin/_end() in futex_put_value()
Commit cec199c5e3 ("futex: Implement FUTEX2_NUMA") introduced the
futex_put_value() helper to write a value to the given user
address.

However, it uses user_read_access_begin() before the write. For
architectures that differentiate between read and write accesses, like
PowerPC, futex_put_value() fails with -EFAULT.

Fix that by using the user_write_access_begin/user_write_access_end() pair
instead.

Fixes: cec199c5e3 ("futex: Implement FUTEX2_NUMA")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20250811141147.322261-1-longman@redhat.com
2025-08-11 17:53:21 +02:00
Kieran Moy df1145b56c audit: fix typo in auditfilter.c comment
Correct the misspelling of "searching" (was "serarching")
in the function documentation for audit_update_lsm_rules.

Found via code inspection, no functional impact.

Signed-off-by: Kieran Moy <kfatyuip@gmail.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2025-08-11 11:44:57 -04:00
Thorsten Blum d8c09d7b55 audit: Replace deprecated strcpy() with strscpy()
strcpy() is deprecated; use strscpy() instead.

Link: https://github.com/KSPP/linux/issues/88
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2025-08-11 11:44:55 -04:00
Casey Schaufler c5055d0c8e audit: fix indentation in audit_log_exit()
Fix two indentation errors in audit_log_exit().

Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
[PM: subject tweak]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2025-08-11 11:44:50 -04:00
Oreoluwa Babatunde 2c223f7239 of: reserved_mem: Restructure call site for dma_contiguous_early_fixup()
Restructure the call site for dma_contiguous_early_fixup() to
where the reserved_mem nodes are being parsed from the DT so that
dma_mmu_remap[] is populated before dma_contiguous_remap() is called.

Fixes: 8a6e02d0c0 ("of: reserved_mem: Restructure how the reserved memory regions are processed")
Signed-off-by: Oreoluwa Babatunde <oreoluwa.babatunde@oss.qualcomm.com>
Tested-by: William Zhang <william.zhang@broadcom.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/20250806172421.2748302-1-oreoluwa.babatunde@oss.qualcomm.com
2025-08-11 13:05:38 +02:00
Frederic Weisbecker 61399e0c54 rcu: Fix racy re-initialization of irq_work causing hangs
RCU re-initializes the deferred QS irq work everytime before attempting
to queue it. However there are situations where the irq work is
attempted to be queued even though it is already queued. In that case
re-initializing messes-up with the irq work queue that is about to be
handled.

The chances for that to happen are higher when the architecture doesn't
support self-IPIs and irq work are then all lazy, such as with the
following sequence:

1) rcu_read_unlock() is called when IRQs are disabled and there is a
   grace period involving blocked tasks on the node. The irq work
   is then initialized and queued.

2) The related tasks are unblocked and the CPU quiescent state
   is reported. rdp->defer_qs_iw_pending is reset to DEFER_QS_IDLE,
   allowing the irq work to be requeued in the future (note the previous
   one hasn't fired yet).

3) A new grace period starts and the node has blocked tasks.

4) rcu_read_unlock() is called when IRQs are disabled again. The irq work
   is re-initialized (but it's queued! and its node is cleared) and
   requeued. Which means it's requeued to itself.

5) The irq work finally fires with the tick. But since it was requeued
   to itself, it loops and hangs.

Fix this with initializing the irq work only once before the CPU boots.

Fixes: b41642c877 ("rcu: Fix rcu_read_unlock() deadloop due to IRQ work")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202508071303.c1134cce-lkp@intel.com
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>
2025-08-11 08:43:49 +05:30
Linus Torvalds b96ddbc5c8 - Remove an obsolete comment and fix spelling
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmiXmjUACgkQEsHwGGHe
 VUohIg/9FA8vG1+KBjC6njb8M4tmKiIpS2/OLfdurdp0MFCnxgsAifHbB4Lfh0Aq
 RQO/d3kQ7MrdWS8vuR7AHBTq6ceGuKwIpcikJOvloohlTcHslxpxXDedKdTeSx4L
 CgV88uZbiYppZwFPafC/KOEkMKb+gd/WhnzHQmhIzMk/Cqtv55qaATTdk4Qmai02
 rPcLoS1Zhcs+t/4z9OleELZE9dcNTK2ZpOhHe+ygWZjWQizXAA8hJCyr8zc8UPhu
 +JLj5y2QofFa9XqADk54HzGidR/KKM9or8dsg2X1i4vO8UL26cKHry4NOJOUCblV
 AJptZzFZ3BU0kOyr0RVgT/dCSupl/ujGyiB9B3IuvxYqPAAt11KTbRrY0SoDd2w1
 7XGkm4PQiUikA0VFBKjsoeaJg+portp/TJ4O9r7etRgy+qgguLzZZzaZw1zlGEon
 juPlcTpdEwZND7XtnjtG9hIdJlih8NuJwKWSlmRuoWtzyCR5eGEnxOaD8pgTflr8
 MR5mQlkOgoXQjzjn6D+bmlB4qdst8aP3A8OY3uuj52xfGoaNp7FsP4RL//nGpdqg
 p/sdAOhGGyhs6c3/XrhRT/8YXog6cEJXelXXL1jzIjuiQX3yfR1//nWl7+ftKDvB
 EOR30na1iZ6MDGK7xJiS5B2wyc1my7AQj/mi/+5KopP8APMmqhg=
 =lrRr
 -----END PGP SIGNATURE-----

Merge tag 'smp_urgent_for_v6.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull smp fixes from Borislav Petkov:

 - Remove an obsolete comment and fix spelling

* tag 'smp_urgent_for_v6.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  cpu: Remove obsolete comment from takedown_cpu()
  smp: Fix spelling in on_each_cpu_cond_mask()'s doc-comment
2025-08-10 08:51:37 +03:00
Linus Torvalds 7d2fed1f3c - Fix a wrong ioremap size in mvebu-gicp
- Remove yet another compile-test case for a driver which needs an
   additional dependency
 
 - Fix a lock inversion scenario in the IRQ unit test suite
 
 - Remove an impossible flag situation in gic-v5
 
 - Do not iounmap resources in gic-v5 which are managed by devm
 
 - Make sure stale, left-over interrupts in mvebu-gicp are cleared on
   driver init
 
 - Fix a reference counting mishap in msi-lib
 
 - Fix a dereference-before-null-ptr-check case in the riscv-imsic
   irqchip driver
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmiXkZYACgkQEsHwGGHe
 VUoiaw/7B/0wjf6CZdrzFW+gSBgXMA4VghVcEUySQTeE66SL5dhOM5Pw2w8z5ow5
 tn2/cMZg7aAyQUt9XlPzWhHwU5Pu06Jtd32LtDX0sqf54byYybQHtoUBnNcPY7FR
 Anf7jr0WnXNHe+qQcccfVDtEYzF9R/HIkmp63f348wP3aS5RTbFPWk2cOPdAXOAY
 TeWordoDXjtzix+Ro8zk2WaD0h9oDLdHgg4eww3FUVNBvKiXIhkV7bb70t85f+gA
 f0eF6LJ0318e6UHwVhU+OgYzdD9uXMNTrZDKeZq/xYYIytVlCfwvKDMWFZC11ltC
 89BMyghLSRkI/oV/2/gXGICRmGp7OEn6HzD/vpPv30Zfeyj0e8O/rat2uZCifrbL
 9RJ4sXMJCOJUoHD3t/e7i1TDqsmVF9CdTbgwqQt6ANtypJrkVkIBqO4QvcNu8qQ5
 c6lt5Y7ob+owpIhUoBmxCUaZz19wZAwRcOIkAZwoWXTrvfYjD28AveQOqHpOBvvQ
 WQY3pvGkvgY9vmWbIeshWhZzb+kX5Wn5WI4C7Ul5cng2WUfo1pkI6U8u9dmv0D7y
 LBVjnj/rXTWR0G9OyI3R9WqrGmrCnKOPMpv98lsETctBZxTrSStbOWe4dOBTe1Zh
 Jq1KRWZ4UE5SauOr8R/59y21E4HulVVH2WK7TUhM8paPkeYbw3o=
 =VUGL
 -----END PGP SIGNATURE-----

Merge tag 'irq_urgent_for_v6.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq fixes from Borislav Petkov:

 - Fix a wrong ioremap size in mvebu-gicp

 - Remove yet another compile-test case for a driver which needs an
   additional dependency

 - Fix a lock inversion scenario in the IRQ unit test suite

 - Remove an impossible flag situation in gic-v5

 - Do not iounmap resources in gic-v5 which are managed by devm

 - Make sure stale, left-over interrupts in mvebu-gicp are cleared on
   driver init

 - Fix a reference counting mishap in msi-lib

 - Fix a dereference-before-null-ptr-check case in the riscv-imsic
   irqchip driver

* tag 'irq_urgent_for_v6.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip/mvebu-gicp: Use resource_size() for ioremap()
  irqchip: Build IMX_MU_MSI only on ARM
  genirq/test: Resolve irq lock inversion warnings
  irqchip/gic-v5: Remove IRQD_RESEND_WHEN_IN_PROGRESS for ITS IRQs
  irqchip/gic-v5: iwb: Fix iounmap probe failure path
  irqchip/mvebu-gicp: Clear pending interrupts on init
  irqchip/msi-lib: Fix fwnode refcount in msi_lib_irq_domain_select()
  irqchip/riscv-imsic: Don't dereference before NULL pointer check
2025-08-10 08:46:47 +03:00
Linus Torvalds 8e8f6b635f - Prevent a futex hash leak due to different mm lifetimes
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmiXjOUACgkQEsHwGGHe
 VUpEog//eQ60cSh6pzFJ6yypmPmp1/Tk7XHQH9s9V4JzrsbFoTCwqm2h3NE27Pfu
 INNfdiZ76Paf5fRkl/pITGwW11Svn0w42xWwM8BDeZMv/yAq8dXa/QKaABVJa+Hd
 PujrWUno3H/qck0O50Fq9Y6nbE0lHBczHxKsaGdrARxra91JpAezsgwkN7jVO9Kk
 6/2Gb9Hk2buWvG+eLmM4JwNIvaxgbMttw93tfA7DthPyQCI0dCPINmJ22fXhVZLI
 tVmkid9MqGOjz4789z0AN+pF+VfEcejGSy29zzCk5NrrgfgK0QSoZ0JpvUP5vtUh
 Opoez017+3sKe3REk+0j+PGdttmE48Zhl7WDgkAIZqOOEwWiVBoqbk9gCIGiJKan
 x9BRjcP3p1TH1RsS6OsHA+tbf+ZlGhOKQNRNeWmisteiOcDiuRYY8NE7F5Q3/mBQ
 N0KnlzAo2m2uTwJ4r5uvEOAIcCvB+EtNn2SYBkCxMpTCRzT65/WEQjgqmLDHR6cP
 LSFOfo91E210TwU/ZospjXxT3NhntoWRQVbvbbO5QS4gr3Sq6MCIofmIrjfJNqq6
 AoVnrM+8QAOp+pOaoPwSIcwp68uhI4L6SXAZtP0+xScwv6UUUy1KUv9TMMNbZ4/4
 lh9JYYIdfh3rtOlZmdK4+KoGBQ19YZ/qc9tXB8/oqadrQWFBOic=
 =Xpyt
 -----END PGP SIGNATURE-----

Merge tag 'locking_urgent_for_v6.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking fix from Borislav Petkov:

 - Prevent a futex hash leak due to different mm lifetimes

* tag 'locking_urgent_for_v6.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  futex: Move futex cleanup to __mmdrop()
2025-08-10 08:11:39 +03:00
JP Kobryn eea51c6e3f cgroup: avoid null de-ref in css_rstat_exit()
css_rstat_exit() may be called asynchronously in scenarios where preceding
calls to css_rstat_init() have not completed. One such example is this
sequence below:

css_create(...)
{
	...
	init_and_link_css(css, ...);

	err = percpu_ref_init(...);
	if (err)
		goto err_free_css;
	err = cgroup_idr_alloc(...);
	if (err)
		goto err_free_css;
	err = css_rstat_init(css, ...);
	if (err)
		goto err_free_css;
	...
err_free_css:
	INIT_RCU_WORK(&css->destroy_rwork, css_free_rwork_fn);
	queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork);
	return ERR_PTR(err);
}

If any of the three goto jumps are taken, async cleanup will begin and
css_rstat_exit() will be invoked on an uninitialized css->rstat_cpu.

Avoid accessing the unitialized field by returning early in
css_rstat_exit() if this is the case.

Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
Suggested-by: Michal Koutný <mkoutny@suse.com>
Fixes: 5da3bfa029 ("cgroup: use separate rstat trees for each subsystem")
Cc: stable@vger.kernel.org # v6.16
Reported-by: syzbot+8d052e8b99e40bc625ed@syzkaller.appspotmail.com
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-08-09 08:46:32 -10:00
Waiman Long 87eba5bc5a cgroup/cpuset: Remove the unnecessary css_get/put() in cpuset_partition_write()
The css_get/put() calls in cpuset_partition_write() are unnecessary as
an active reference of the kernfs node will be taken which will prevent
its removal and guarantee the existence of the css. Only the online
check is needed.

Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-08-09 08:42:47 -10:00
Waiman Long 150e298ae0 cgroup/cpuset: Fix a partition error with CPU hotplug
It was found during testing that an invalid leaf partition with an
empty effective exclusive CPU list can become a valid empty partition
with no CPU afer an offline/online operation of an unrelated CPU. An
empty partition root is allowed in the special case that it has no
task in its cgroup and has distributed out all its CPUs to its child
partitions. That is certainly not the case here.

The problem is in the cpumask_subsets() test in the hotplug case
(update with no new mask) of update_parent_effective_cpumask() as it
also returns true if the effective exclusive CPU list is empty. Fix that
by addding the cpumask_empty() test to root out this exception case.
Also add the cpumask_empty() test in cpuset_hotplug_update_tasks()
to avoid calling update_parent_effective_cpumask() for this special case.

Fixes: 0c7f293efc ("cgroup/cpuset: Add cpuset.cpus.exclusive.effective for v2")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-08-09 08:42:28 -10:00
Waiman Long 65f97cc81b cgroup/cpuset: Use static_branch_enable_cpuslocked() on cpusets_insane_config_key
The following lockdep splat was observed.

[  812.359086] ============================================
[  812.359089] WARNING: possible recursive locking detected
[  812.359097] --------------------------------------------
[  812.359100] runtest.sh/30042 is trying to acquire lock:
[  812.359105] ffffffffa7f27420 (cpu_hotplug_lock){++++}-{0:0}, at: static_key_enable+0xe/0x20
[  812.359131]
[  812.359131] but task is already holding lock:
[  812.359134] ffffffffa7f27420 (cpu_hotplug_lock){++++}-{0:0}, at: cpuset_write_resmask+0x98/0xa70
     :
[  812.359267] Call Trace:
[  812.359272]  <TASK>
[  812.359367]  cpus_read_lock+0x3c/0xe0
[  812.359382]  static_key_enable+0xe/0x20
[  812.359389]  check_insane_mems_config.part.0+0x11/0x30
[  812.359398]  cpuset_write_resmask+0x9f2/0xa70
[  812.359411]  cgroup_file_write+0x1c7/0x660
[  812.359467]  kernfs_fop_write_iter+0x358/0x530
[  812.359479]  vfs_write+0xabe/0x1250
[  812.359529]  ksys_write+0xf9/0x1d0
[  812.359558]  do_syscall_64+0x5f/0xe0

Since commit d74b27d63a ("cgroup/cpuset: Change cpuset_rwsem
and hotplug lock order"), the ordering of cpu hotplug lock
and cpuset_mutex had been reversed. That patch correctly
used the cpuslocked version of the static branch API to enable
cpusets_pre_enable_key and cpusets_enabled_key, but it didn't do the
same for cpusets_insane_config_key.

The cpusets_insane_config_key can be enabled in the
check_insane_mems_config() which is called from update_nodemask()
or cpuset_hotplug_update_tasks() with both cpu hotplug lock and
cpuset_mutex held. Deadlock can happen with a pending hotplug event that
tries to acquire the cpu hotplug write lock which will block further
cpus_read_lock() attempt from check_insane_mems_config(). Fix that by
switching to use static_branch_enable_cpuslocked().

Fixes: d74b27d63a ("cgroup/cpuset: Change cpuset_rwsem and hotplug lock order")
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-08-09 08:41:39 -10:00
Linus Torvalds c30a13538d bpf-fixes
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmiWKasACgkQ6rmadz2v
 bToTMhAAmAvsOqfaoYR0LfmuwPZj9F0bqg9voRRgHNnf3j1Iz0L2po65pdbJGQPs
 mf+KivbeWsPic8AD0mnbgPLgPXNYsunxsQ+k5LfShbtECZ+0QqUmT27EAIhsHWTr
 ghWmMZ6E3kq7bfsTaNAtJNjiovHCr+swqpZoJpAlwz1CxNhESDIjdDzGtarPNxHw
 s3ioI9gvPuH4x5vq7SblyfgeYJWNk0ZELISmdNEud1nBrdR91Ji2Iqxwkky0qhKI
 4kQzN9MmjsIl4dIa1DMlQsMpKkC/sHtoFW2VEUkuHUwP+wa/YIn7iAvD7xc2k3nS
 2fLzlpfZnSIdu+2piu9A7RoU4VI/XyC+tud6WmDYGN/xKgA0N7BmoBQjhgptj+uH
 BitpTGmFPoDuRiQDKHiGPTP6Wc4djt0Ipp6hFr89Q2ywCUCrylRafuJDpOn6xwzG
 VZH6yK1cMk2kIa8jArGjMtvcWiMbaMn6GwxjQaPC+Syhy6dmAWVmcyEKYXki8GAc
 N+gl0bBGHEGhcCCuyzxFAOLKx81CDp5C+gfoCzt3lDAXztrd1PeaZV+n9c6jtgVT
 VPO9TahYqnsfuLPBVDTCGSvVsTP2Fh2fIcdsEJvcF1EISBiwYBw6FGpDi4WlbOfm
 pisRchdT0YV6vg0V+f2U6mL5UbZr7v5w5PAfh7zXCqvfppoPD8U=
 =Qt6J
 -----END PGP SIGNATURE-----

Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Pull bpf fixes from Alexei Starovoitov:

 - Fix memory leak of bpf_scc_info objects (Eduard Zingerman)

 - Fix a regression in the 'perf' tool caused by moving UID filtering to
   BPF (Ilya Leoshkevich)

* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  perf bpf-filter: Enable events manually
  libbpf: Add the ability to suppress perf event enablement
  bpf: Fix memory leak of bpf_scc_info objects
2025-08-09 09:03:21 +03:00
Eduard Zingerman 77620d1267 bpf: use realloc in bpf_patch_insn_data
Avoid excessive vzalloc/vfree calls when patching instructions in
do_misc_fixups(). bpf_patch_insn_data() uses vzalloc to allocate new
memory for env->insn_aux_data for each patch as follows:

  struct bpf_prog *bpf_patch_insn_data(env, ...)
  {
    ...
    new_data = vzalloc(... O(program size) ...);
    ...
    adjust_insn_aux_data(env, new_data, ...);
    ...
  }

  void adjust_insn_aux_data(env, new_data, ...)
  {
    ...
    memcpy(new_data, env->insn_aux_data);
    vfree(env->insn_aux_data);
    env->insn_aux_data = new_data;
    ...
  }

The vzalloc/vfree pair is hot in perf report collected for e.g.
pyperf180 test case. It can be replaced with a call to vrealloc in
order to reduce the number of actual memory allocations.

This is a stop-gap solution, as bpf_patch_insn_data is still hot in
the profile. More comprehansive solutions had been discussed before
e.g. as in [1].

[1] https://lore.kernel.org/bpf/CAEf4BzY_E8MSL4mD0UPuuiDcbJhh9e2xQo2=5w+ppRWWiYSGvQ@mail.gmail.com/

Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Tested-by: Anton Protopopov <a.s.protopopov@gmail.com>
Link: https://lore.kernel.org/r/20250807010205.3210608-3-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-08-07 09:17:02 -07:00
Eduard Zingerman cb070a8156 bpf: removed unused 'env' parameter from is_reg64 and insn_has_def32
Parameter 'env' is not used by is_reg64() and insn_has_def32()
functions. Remove the parameter to make it clear that neither function
depends on 'env' state, e.g. env->insn_aux_data.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250807010205.3210608-2-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-08-07 09:17:02 -07:00
Waiman Long da274853fe cpu: Remove obsolete comment from takedown_cpu()
takedown_cpu() has a comment about "all preempt/rcu users must observe
!cpu_active()" which is kind of meaningless in this function. This
comment was originally introduced by commit 6acce3ef84 ("sched: Remove
get_online_cpus() usage") when _cpu_down() was setting cpu_active_mask
and synchronize_rcu()/synchronize_sched() were added after that.

Later commit 40190a78f8 ("sched/hotplug: Convert cpu_[in]active
notifiers to state machine") added a new CPUHP_AP_ACTIVE hotplug
state to set/clear cpu_active_mask. The following commit b2454caa89
("sched/hotplug: Move sync_rcu to be with set_cpu_active(false)")
move the synchronize_*() calls to sched_cpu_deactivate() associated
with the new hotplug state, but left the comment behind.

Remove this comment as it is no longer relevant in takedown_cpu().

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250729191232.664931-1-longman@redhat.com
2025-08-06 22:48:12 +02:00
Amery Hung d87a513d09 bpf: Allow struct_ops to get map id by kdata
Add bpf_struct_ops_id() to enable struct_ops implementors to use
struct_ops map id as the unique id of a struct_ops in their subsystem.
A subsystem that wishes to create a mapping between id and struct_ops
instance pointer can update the mapping accordingly during
bpf_struct_ops::reg(), unreg(), and update().

Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20250806162540.681679-2-ameryhung@gmail.com
2025-08-06 13:39:58 -07:00
Brian Norris 5b65258229 genirq/test: Resolve irq lock inversion warnings
irq_shutdown_and_deactivate() is normally called with the descriptor lock
held, and interrupts disabled. Nested a few levels down, it grabs the
global irq_resend_lock. Lockdep rightfully complains when interrupts are
not disabled:

       CPU0                    CPU1
       ----                    ----
  lock(irq_resend_lock);
                               local_irq_disable();
                               lock(&irq_desc_lock_class);
                               lock(irq_resend_lock);
  <Interrupt>
    lock(&irq_desc_lock_class);

...
   _raw_spin_lock+0x2b/0x40
   clear_irq_resend+0x14/0x70
   irq_shutdown_and_deactivate+0x29/0x80
   irq_shutdown_depth_test+0x1ce/0x600
   kunit_try_run_case+0x90/0x120

Grab the descriptor lock and disable interrupts, to resolve the
problem.

Fixes: 66067c3c8a ("genirq: Add kunit tests for depth counts")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Brian Norris <briannorris@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Link: https://lore.kernel.org/all/aJJONEIoIiTSDMqc@google.com
Closes: https://lore.kernel.org/lkml/31a761e4-8f81-40cf-aaf5-d220ba11911c@roeck-us.net/
2025-08-06 10:29:48 +02:00
Linus Torvalds a530a36bb5 Kbuild updates for v6.17
- Fix a shortcut key issue in menuconfig
 
  - Fix missing rebuild of kheaders
 
  - Sort the symbol dump generated by gendwarfsyms
 
  - Support zboot extraction in scripts/extract-vmlinux
 
  - Migrate gconfig to GTK 3
 
  - Add TAR variable to allow overriding the default tar command
 
  - Hand over Kbuild maintainership
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmiSr38VHG1hc2FoaXJv
 eUBrZXJuZWwub3JnAAoJED2LAQed4NsGcTIP/RCVr/OEJgVXg8dOBNhNuhGoidnM
 2uqRcaza68tOSegpFGcfd9bhO1TCR/O5SL117TS8UEx4f9ge7gk/+XVZC8i1268m
 9+V6eJd8QI34nB/EezMnrhvFmn2kC0UMuSldZYQJ2cReLjtapBP2xBtWnxi+Zyyw
 +FjdHwQln7E8UaB/gMqh9KVVOytX4NIUUZEA/78nd4eJaJbLxJ/5ztAxGLB//bXI
 Rr6bjAeOmIfRWS9QWnGzNzHmzp4SSmU+/gdLXyaWlmoVjeut8O+BJXvQRNfswk8K
 JXmk6uZUx6CNheCca2RaM0i6XAArkpOQc/7v7Ul/rSriTxdxAVUghjk0fNrXJGvQ
 kBjewOTUXg8f4xhuPAL3nkWmCh0s0tV3Q0EneAaJuUck5yRkW0bxxKa74h6ji2q8
 8RsNS5Mq0yYiR1gmCqhEmTN6/wDamSpGkHT0k6ukipixjCr5DP2QFP/xT3d7qSwc
 W6msliXgBmY/ZrGoJXy4zjmu5vxKfAes+7JLBDxUKjdItC818qwSPf+nbvVIdJqb
 K4/hcNDuYPKd/8N8YapPjbGfTBk9Xqp74ez4xg5XNIBPS+cE5k/mePletbCzkzC5
 vzjoNgVUmpPGPdDaBk+S4jSEzWUi575aQx590OfdoXsBt4CQVHHk4PEM1Qh5cWnZ
 m6tx1oqfruovU6Gx
 =MtyJ
 -----END PGP SIGNATURE-----

Merge tag 'kbuild-v6.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild updates from Masahiro Yamada:
 "This is the last pull request from me.

  I'm grateful to have been able to continue as a maintainer for eight
  years. From the next cycle, Nathan and Nicolas will maintain Kbuild.

   - Fix a shortcut key issue in menuconfig

   - Fix missing rebuild of kheaders

   - Sort the symbol dump generated by gendwarfsyms

   - Support zboot extraction in scripts/extract-vmlinux

   - Migrate gconfig to GTK 3

   - Add TAR variable to allow overriding the default tar command

   - Hand over Kbuild maintainership"

* tag 'kbuild-v6.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (92 commits)
  MAINTAINERS: hand over Kbuild maintenance
  kheaders: make it possible to override TAR
  kbuild: userprogs: use correct linker when mixing clang and GNU ld
  kconfig: lxdialog: replace strcpy() with strncpy() in inputbox.c
  kconfig: lxdialog: replace strcpy with snprintf in print_autowrap
  kconfig: gconf: refactor text_insert_help()
  kconfig: gconf: remove unneeded variable in text_insert_msg
  kconfig: gconf: use hyphens in signals
  kconfig: gconf: replace GtkImageMenuItem with GtkMenuItem
  kconfig: gconf: Fix Back button behavior
  kconfig: gconf: fix single view to display dependent symbols correctly
  scripts: add zboot support to extract-vmlinux
  gendwarfksyms: order -T symtypes output by name
  gendwarfksyms: use preferred form of sizeof for allocation
  kconfig: qconf: confine {begin,end}Group to constructor and destructor
  kconfig: qconf: fix ConfigList::updateListAllforAll()
  kconfig: add a function to dump all menu entries in a tree-like format
  kconfig: gconf: show GTK version in About dialog
  kconfig: gconf: replace GtkHPaned and GtkVPaned with GtkPaned
  kconfig: gconf: replace GdkColor with GdkRGBA
  ...
2025-08-06 07:32:52 +03:00
Linus Torvalds adf12a394c Perf fixes for perf_mmap() reference counting to prevent potential
reference count leaks which are caused by:
 
  - VMA splits, which change the offset or size of a mapping, which causes
    perf_mmap_close() to ignore the unmap or unmap the wrong buffer.
 
  - Several internal issues of perf_mmap(), which can cause reference count
    leaks in the perf mmap, corrupt accounting or cause leaks in perf
    drivers.
 
 The main fix is to prevent VMA splits by implementing the [may_]split()
 callback for vm operations. The other issues are addressed by rearranging
 code, early returns on failure and invocation of cleanups.
 
 Also provide a selftest to validate the fixes.
 
 The reference counting should be converted to refcount_t, but that requires
 larger refactoring of the code and will be done once these fixes are
 upstream.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmiSd0gTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoWCbD/9niCHArXIWfdIp1K2ZwM2+tsCU7Ntl
 XnkfnaPoCVpQQXcAN11WIEK6DlwaiY2sfOco1cpEqr5px0Hv5qzM+OGm0r3KQBpr
 9d1Ox8+tqUOIxw5zsKFXpw6/WX4zzdxGHjzU/T0H+fPP3jB+hj/Q3hOk4u10+f3v
 7f3Q4sOfkmOauQez2HEaDwUip6lLZFEaf8IK0tYEkOJxcStwsC2TnLvmlEmOA0Yx
 PnAXOicrpbe9d8KNq6VxU0OtV6XAT+YJtf9T5cTNR1NhIkqyaMwbdzkuh9RZgxAE
 oRblaAHubAUMmv2DgYOTUGoYivsXY13XOtjfXdLmxt19HmkSOyaCFO8nJgjAPOL7
 gxGXS7zKxhNac7bfVgBANPUHOOWtV30H5CqYOzxaPlQs8gzOsl8l+NDZuwVlP4P6
 CMdN3rz3eMpnMpuzy0mmUJhowytKDA8N81yamCP5L9hWWZVfp4boZfIXMMLtJdQa
 nv/T2HxLL8HweFrI6Wd7YDhXMKhsNDAqJvtSv0z+5U+PWWd9rcOFsgS9sUHIiJuB
 pLvNwLxPntzF6qw4qIp1W1AHfLz2VF/tR8WyINpEZe4oafP1TccI+aLQdIJ/vVqp
 gQ0bCTiZb16IGsHruu4L9C0fe40TdSuiwEK5X9Opk4aP11oagsqQ+GxzssvQZnZc
 Jx2XqouabWBBvQ==
 =B9L/
 -----END PGP SIGNATURE-----

Merge tag 'perf-fixes-27504' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git

Pull perf fixes from Thomas Gleixner:
 "Perf fixes for perf_mmap() reference counting to prevent potential
  reference count leaks which are caused by:

   - VMA splits, which change the offset or size of a mapping, which
     causes perf_mmap_close() to ignore the unmap or unmap the wrong
     buffer.

   - Several internal issues of perf_mmap(), which can cause reference
     count leaks in the perf mmap, corrupt accounting or cause leaks in
     perf drivers.

  The main fix is to prevent VMA splits by implementing the
  [may_]split() callback for vm operations.

  The other issues are addressed by rearranging code, early returns on
  failure and invocation of cleanups.

  Also provide a selftest to validate the fixes.

  The reference counting should be converted to refcount_t, but that
  requires larger refactoring of the code and will be done once these
  fixes are upstream"

* tag 'perf-fixes-27504' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git:
  selftests/perf_events: Add a mmap() correctness test
  perf/core: Prevent VMA split of buffer mappings
  perf/core: Handle buffer mapping fail correctly in perf_mmap()
  perf/core: Exit early on perf_mmap() fail
  perf/core: Don't leak AUX buffer refcount on allocation failure
  perf/core: Preserve AUX buffer allocation failure result
2025-08-06 04:41:21 +03:00
Michał Górny 73d210e9fa kheaders: make it possible to override TAR
Commit 86cdd2fdc4 ("kheaders: make headers archive reproducible")
introduced a number of options specific to GNU tar to the `tar`
invocation in `gen_kheaders.sh` script. This causes the script to fail
to work on systems where `tar` is not GNU tar. This can occur e.g.
on recent Gentoo Linux installations that support using bsdtar from
libarchive instead.

Add a `TAR` make variable to make it possible to override the tar
executable used, e.g. by specifying:

  make TAR=gtar

Link: https://bugs.gentoo.org/884061
Reported-by: Sam James <sam@gentoo.org>
Tested-by: Sam James <sam@gentoo.org>
Co-developed-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Michał Górny <mgorny@gentoo.org>
Signed-off-by: Sam James <sam@gentoo.org>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2025-08-06 10:23:36 +09:00
Thomas Gleixner b024d7b56c perf/core: Prevent VMA split of buffer mappings
The perf mmap code is careful about mmap()'ing the user page with the
ringbuffer and additionally the auxiliary buffer, when the event supports
it. Once the first mapping is established, subsequent mapping have to use
the same offset and the same size in both cases. The reference counting for
the ringbuffer and the auxiliary buffer depends on this being correct.

Though perf does not prevent that a related mapping is split via mmap(2),
munmap(2) or mremap(2). A split of a VMA results in perf_mmap_open() calls,
which take reference counts, but then the subsequent perf_mmap_close()
calls are not longer fulfilling the offset and size checks. This leads to
reference count leaks.

As perf already has the requirement for subsequent mappings to match the
initial mapping, the obvious consequence is that VMA splits, caused by
resizing of a mapping or partial unmapping, have to be prevented.

Implement the vm_operations_struct::may_split() callback and return
unconditionally -EINVAL.

That ensures that the mapping offsets and sizes cannot be changed after the
fact. Remapping to a different fixed address with the same size is still
possible as it takes the references for the new mapping and drops those of
the old mapping.

Fixes: 45bfb2e504 ("perf: Add AUX area to ring buffer for raw data streams")
Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-27504
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: stable@vger.kernel.org
2025-08-05 21:55:29 +02:00
Thomas Gleixner f74b9f4ba6 perf/core: Handle buffer mapping fail correctly in perf_mmap()
After successful allocation of a buffer or a successful attachment to an
existing buffer perf_mmap() tries to map the buffer read only into the page
table. If that fails, the already set up page table entries are zapped, but
the other perf specific side effects of that failure are not handled.  The
calling code just cleans up the VMA and does not invoke perf_mmap_close().

This leaks reference counts, corrupts user->vm accounting and also results
in an unbalanced invocation of event::event_mapped().

Cure this by moving the event::event_mapped() invocation before the
map_range() call so that on map_range() failure perf_mmap_close() can be
invoked without causing an unbalanced event::event_unmapped() call.

perf_mmap_close() undoes the reference counts and eventually frees buffers.

Fixes: b709eb872e ("perf: map pages in advance")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: stable@vger.kernel.org
2025-08-05 21:55:29 +02:00
Thomas Gleixner 07091aade3 perf/core: Exit early on perf_mmap() fail
When perf_mmap() fails to allocate a buffer, it still invokes the
event_mapped() callback of the related event. On X86 this might increase
the perf_rdpmc_allowed reference counter. But nothing undoes this as
perf_mmap_close() is never called in this case, which causes another
reference count leak.

Return early on failure to prevent that.

Fixes: 1e0fb9ec67 ("perf: Add pmu callbacks to track event mapping and unmapping")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: stable@vger.kernel.org
2025-08-05 21:55:29 +02:00
Thomas Gleixner 5468c0fbcc perf/core: Don't leak AUX buffer refcount on allocation failure
Failure of the AUX buffer allocation leaks the reference count.

Set the reference count to 1 only when the allocation succeeds.

Fixes: 45bfb2e504 ("perf: Add AUX area to ring buffer for raw data streams")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: stable@vger.kernel.org
2025-08-05 21:55:29 +02:00
Thomas Gleixner 54473e0ef8 perf/core: Preserve AUX buffer allocation failure result
A recent overhaul sets the return value to 0 unconditionally after the
allocations, which causes reference count leaks and corrupts the user->vm
accounting.

Preserve the AUX buffer allocation failure return value, so that the
subsequent code works correctly.

Fixes: 0983593f32 ("perf/core: Lift event->mmap_mutex in perf_mmap()")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: stable@vger.kernel.org
2025-08-05 21:55:28 +02:00
Linus Torvalds da23ea194d Significant patch series in this pull request:
- The 4 patch series "mseal cleanups" from Lorenzo Stoakes erforms some
   mseal cleaning with no intended functional change.
 
 - The 3 patch series "Optimizations for khugepaged" from David
   Hildenbrand improves khugepaged throughput by batching PTE operations
   for large folios.  This gain is mainly for arm64.
 
 - The 8 patch series "x86: enable EXECMEM_ROX_CACHE for ftrace and
   kprobes" from Mike Rapoport provides a bugfix, additional debug code and
   cleanups to the execmem code.
 
 - The 7 patch series "mm/shmem, swap: bugfix and improvement of mTHP
   swap in" from Kairui Song provides bugfixes, cleanups and performance
   improvememnts to the mTHP swapin code.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaI+6HQAKCRDdBJ7gKXxA
 jv7lAQCAKE5dUhdZ0pOYbhBKTlDapQh2KqHrlV3QFcxXgknEoQD/c3gG01rY3fLh
 Cnf5l9+cdyfKxFniO48sUPx6IpriRg8=
 =HT5/
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2025-08-03-12-35' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull more MM updates from Andrew Morton:
 "Significant patch series in this pull request:

   - "mseal cleanups" (Lorenzo Stoakes)

     Some mseal cleaning with no intended functional change.

   - "Optimizations for khugepaged" (David Hildenbrand)

     Improve khugepaged throughput by batching PTE operations for large
     folios. This gain is mainly for arm64.

   - "x86: enable EXECMEM_ROX_CACHE for ftrace and kprobes" (Mike Rapoport)

     A bugfix, additional debug code and cleanups to the execmem code.

   - "mm/shmem, swap: bugfix and improvement of mTHP swap in" (Kairui Song)

     Bugfixes, cleanups and performance improvememnts to the mTHP swapin
     code"

* tag 'mm-stable-2025-08-03-12-35' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (38 commits)
  mm: mempool: fix crash in mempool_free() for zero-minimum pools
  mm: correct type for vmalloc vm_flags fields
  mm/shmem, swap: fix major fault counting
  mm/shmem, swap: rework swap entry and index calculation for large swapin
  mm/shmem, swap: simplify swapin path and result handling
  mm/shmem, swap: never use swap cache and readahead for SWP_SYNCHRONOUS_IO
  mm/shmem, swap: tidy up swap entry splitting
  mm/shmem, swap: tidy up THP swapin checks
  mm/shmem, swap: avoid redundant Xarray lookup during swapin
  x86/ftrace: enable EXECMEM_ROX_CACHE for ftrace allocations
  x86/kprobes: enable EXECMEM_ROX_CACHE for kprobes allocations
  execmem: drop writable parameter from execmem_fill_trapping_insns()
  execmem: add fallback for failures in vmalloc(VM_ALLOW_HUGE_VMAP)
  execmem: move execmem_force_rw() and execmem_restore_rox() before use
  execmem: rework execmem_cache_free()
  execmem: introduce execmem_alloc_rw()
  execmem: drop unused execmem_update_copy()
  mm: fix a UAF when vma->mm is freed after vma->vm_refcnt got dropped
  mm/rmap: add anon_vma lifetime debug check
  mm: remove mm/io-mapping.c
  ...
2025-08-05 16:02:07 +03:00
Linus Torvalds 35a813e010 printk changes for 6.17
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAmiQpykACgkQUqAMR0iA
 lPJcrg/9Hez6+zO7LECCn5VkuK5oJWR5CyCfwx14ki8UF38djQGU2frckI5837rE
 MnVoEBexZunK5SXy4MAy7bTCitzR+lMqNtP5uq9J2ovlSPtNlfuJRDr7uGQLDtSS
 M5KZ1qsZnhgwLYeNhfVVToHgp+OwIQb2GcgYmYc8k03fUI1NQpdxIM46DzoTj+06
 x6qgrNsmmJbm8E73VWBByJAEFoq9ugjny8Rt+tYMi/CmhgZpp0ZyF1r5dYfYX/KS
 VS8UQY//aZOFhNsQUAXwP7Ym00CYRgTg7Na+MHivYLXmYGH2gF6tWQhX/eEgHKcJ
 RTmUbLFx70fdBbjJMxv2k8vyMk2sy6sTfJHPqM/NS/Fb0tSPBXQJG/EexzfoqiBc
 wcjgOPkeALIosVdFdTqXxjoIGOP8rqsU4t6Y6WFjJlWK04SBVjxBUofytRdQSxkG
 5Sb0rFVGKrKIkXaVkt4byPa1/BDpfNhfKMYPtQ56pv2VNUgzfye4prUAZHE5pLnK
 8nixeeMtKDFFCBpn6rG5wZW7k2mK5FrWGZUfdfxdK1gWQ1y0kqGy5wa3lNZLcxlH
 l3AtOYoDeWM2DjDVO6WCj8ambEWkbjbGg7tC9TI3F0NvRJSYytTb6npMqb3Gwhcb
 U4NgT+Ho0GJ/5BLUye8HMfhvrGoCfRCeptHtEFXAK7pzKyjc0+c=
 =Mocd
 -----END PGP SIGNATURE-----

Merge tag 'printk-for-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux

Pull printk updates from Petr Mladek:

 - Add new "hash_pointers=[auto|always|never]" boot parameter to force
   the hashing even with "slab_debug" enabled

 - Allow to stop CPU, after losing nbcon console ownership during
   panic(), even without proper NMI

 - Allow to use the printk kthread immediately even for the 1st
   registered nbcon

 - Compiler warning removal

* tag 'printk-for-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
  printk: nbcon: Allow reacquire during panic
  printk: Allow to use the printk kthread immediately even for 1st nbcon
  slab: Decouple slab_debug and no_hash_pointers
  vsprintf: Use __diag macros to disable '-Wsuggest-attribute=format'
  compiler-gcc.h: Introduce __diag_GCC_all
2025-08-04 10:54:36 -07:00
Peter Zijlstra 99b773d720 sched/psi: Fix psi_seq initialization
With the seqcount moved out of the group into a global psi_seq,
re-initializing the seqcount on group creation is causing seqcount
corruption.

Fixes: 570c8efd5e ("sched/psi: Optimize psi_group_change() cpu_clock() usage")
Reported-by: Chris Mason <clm@meta.com>
Suggested-by: Beata Michalska <beata.michalska@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-08-04 10:51:22 -07:00
Petr Mladek 3db488c8ed Merge branch 'rework/fixes' into for-linus 2025-08-04 14:18:01 +02:00
Linus Torvalds e991acf1bc Significant patch series in this pull request:
- The 2 patch series "squashfs: Remove page->mapping references" from
   Matthew Wilcox gets us closer to being able to remove page->mapping.
 
 - The 5 patch series "relayfs: misc changes" from Jason Xing does some
   maintenance and minor feature addition work in relayfs.
 
 - The 5 patch series "kdump: crashkernel reservation from CMA" from Jiri
   Bohac switches us from static preallocation of the kdump crashkernel's
   working memory over to dynamic allocation.  So the difficulty of
   a-priori estimation of the second kernel's needs is removed and the
   first kernel obtains extra memory.
 
 - The 5 patch series "generalize panic_print's dump function to be used
   by other kernel parts" from Feng Tang implements some consolidation and
   rationalizatio of the various ways in which a faiing kernel splats
   information at the operator.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaI+82gAKCRDdBJ7gKXxA
 jj4JAP9xb+w9DrBY6sa+7KTPIb+aTqQ7Zw3o9O2m+riKQJv6jAEA6aEwRnDA0451
 fDT5IqVlCWGvnVikdZHSnvhdD7TGsQ0=
 =rT71
 -----END PGP SIGNATURE-----

Merge tag 'mm-nonmm-stable-2025-08-03-12-47' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull non-MM updates from Andrew Morton:
 "Significant patch series in this pull request:

   - "squashfs: Remove page->mapping references" (Matthew Wilcox) gets
     us closer to being able to remove page->mapping

   - "relayfs: misc changes" (Jason Xing) does some maintenance and
     minor feature addition work in relayfs

   - "kdump: crashkernel reservation from CMA" (Jiri Bohac) switches
     us from static preallocation of the kdump crashkernel's working
     memory over to dynamic allocation. So the difficulty of a-priori
     estimation of the second kernel's needs is removed and the first
     kernel obtains extra memory

   - "generalize panic_print's dump function to be used by other
     kernel parts" (Feng Tang) implements some consolidation and
     rationalization of the various ways in which a failing kernel
     splats information at the operator

* tag 'mm-nonmm-stable-2025-08-03-12-47' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (80 commits)
  tools/getdelays: add backward compatibility for taskstats version
  kho: add test for kexec handover
  delaytop: enhance error logging and add PSI feature description
  samples: Kconfig: fix spelling mistake "instancess" -> "instances"
  fat: fix too many log in fat_chain_add()
  scripts/spelling.txt: add notifer||notifier to spelling.txt
  xen/xenbus: fix typo "notifer"
  net: mvneta: fix typo "notifer"
  drm/xe: fix typo "notifer"
  cxl: mce: fix typo "notifer"
  KVM: x86: fix typo "notifer"
  MAINTAINERS: add maintainers for delaytop
  ucount: use atomic_long_try_cmpxchg() in atomic_long_inc_below()
  ucount: fix atomic_long_inc_below() argument type
  kexec: enable CMA based contiguous allocation
  stackdepot: make max number of pools boot-time configurable
  lib/xxhash: remove unused functions
  init/Kconfig: restore CONFIG_BROKEN help text
  lib/raid6: update recov_rvv.c zero page usage
  docs: update docs after introducing delaytop
  ...
2025-08-03 16:23:09 -07:00
Linus Torvalds 3c4a063b1f tracing cleanups for v6.17:
- Remove unneeded goto out statements
 
   Over time, the logic was restructured but left a "goto out" where the
   out label simply did a "return ret;". Instead of jumping to this out
   label, simply return immediately and remove the out label.
 
 - Add guard(ring_buffer_nest)
 
   Some calls to the tracing ring buffer can happen when the ring buffer is
   already being written to at the same context (for example, a
   trace_printk() in between a ring_buffer_lock_reserve() and a
   ring_buffer_unlock_commit()).
 
   In order to not trigger the recursion detection, these functions use
   ring_buffer_nest_start() and ring_buffer_nest_end(). Create a guard() for
   these functions so that their use cases can be simplified and not need to
   use goto for the release.
 
 - Clean up the tracing code with guard() and __free() logic
 
   There were several locations that were prime candidates for using guard()
   and __free() helpers. Switch them over to use them.
 
 - Fix output of function argument traces for unsigned int values
 
   The function tracer with "func-args" option set will record up to 6 argument
   registers and then use BTF to format them for human consumption when the
   trace file is read. There's several arguments that are "unsigned long" and
   even "unsigned int" that are either and address or a mask. It is easier to
   understand if they were printed using hexadecimal instead of decimal.
   The old method just printed all non-pointer values as signed integers,
   which made it even worse for unsigned integers.
 
   For instance, instead of:
 
     __local_bh_disable_ip(ip=-2127311112, cnt=256) <-handle_softirqs
 
   Show:
 
    __local_bh_disable_ip(ip=0xffffffff8133cef8, cnt=0x100) <-handle_softirqs
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaI9pOBQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qkhoAQD+moa8M+WWUS9T9utwREytolfyNKEO
 dW0dPVzquX3L6gEAnc7zNla4QZJsdU1bHyhpDTn/Zhu11aMrzoxcBcdrSwI=
 =x79z
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull more tracing updates from Steven Rostedt:

 - Remove unneeded goto out statements

   Over time, the logic was restructured but left a "goto out" where the
   out label simply did a "return ret;". Instead of jumping to this out
   label, simply return immediately and remove the out label.

 - Add guard(ring_buffer_nest)

   Some calls to the tracing ring buffer can happen when the ring buffer
   is already being written to at the same context (for example, a
   trace_printk() in between a ring_buffer_lock_reserve() and a
   ring_buffer_unlock_commit()).

   In order to not trigger the recursion detection, these functions use
   ring_buffer_nest_start() and ring_buffer_nest_end(). Create a guard()
   for these functions so that their use cases can be simplified and not
   need to use goto for the release.

 - Clean up the tracing code with guard() and __free() logic

   There were several locations that were prime candidates for using
   guard() and __free() helpers. Switch them over to use them.

 - Fix output of function argument traces for unsigned int values

   The function tracer with "func-args" option set will record up to 6
   argument registers and then use BTF to format them for human
   consumption when the trace file is read. There are several arguments
   that are "unsigned long" and even "unsigned int" that are either and
   address or a mask. It is easier to understand if they were printed
   using hexadecimal instead of decimal. The old method just printed all
   non-pointer values as signed integers, which made it even worse for
   unsigned integers.

   For instance, instead of:

     __local_bh_disable_ip(ip=-2127311112, cnt=256) <-handle_softirqs

   show:

     __local_bh_disable_ip(ip=0xffffffff8133cef8, cnt=0x100) <-handle_softirqs"

* tag 'trace-v6.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: Have unsigned int function args displayed as hexadecimal
  ring-buffer: Convert ring_buffer_write() to use guard(preempt_notrace)
  tracing: Use __free(kfree) in trace.c to remove gotos
  tracing: Add guard() around locks and mutexes in trace.c
  tracing: Add guard(ring_buffer_nest)
  tracing: Remove unneeded goto out logic
2025-08-03 15:03:04 -07:00
Linus Torvalds 8877fcb70f This is a small set of changes for modules, primarily to extend module users
to use the module data structures in combination with the already no-op stub
 module functions, even when support for modules is disabled in the kernel
 configuration. This change follows the kernel's coding style for conditional
 compilation and allows kunit code to drop all CONFIG_MODULES ifdefs, which is
 also part of the changes. This should allow others part of the kernel to do the
 same cleanup.
 
 Note that this had a conflict with sysctl changes [1] but should be fixed now as I
 rebased on top.
 
 The remaining changes include a fix for module name length handling which could
 potentially lead to the removal of an incorrect module, and various cleanups.
 
 The module name fix and related cleanup has been in linux-next since Thursday
 (July 31) while the rest of the changes for a bit more than 3 weeks.
 
 Note that this currently has conflicts in next with kbuild's tree [2].
 
 Link: https://lore.kernel.org/all/20250714175916.774e6d79@canb.auug.org.au/ [1]
 Link: https://lore.kernel.org/all/20250801132941.6815d93d@canb.auug.org.au/ [2]
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE73Ua4R8Pc+G5xjxTQJ6jxB8ZUfsFAmiPQgkACgkQQJ6jxB8Z
 UfuPTA//XrRguJFBhQh6cUWqVleTNQJuhjiPsOSO5S52aVET4wsrnRNeM2eM5oqw
 0+6ELvhIJINQ1LjpOP8D67d8P5Ds1/qM1pbQIkQsoKiEj6E7Q4dXH5N0uyf/BzO3
 HaosLG9cpqcomlSorYEiYoPjqy9EChQzsi+YAYWAB+fW6bvU/AdUHTRH88m3ppBJ
 Y22BTTPOKKyj5/QgfY+kwH8TTnrzCzY8aoOqW7uimLI5h4c9dFQ2PigRJnoMfDG1
 11w5VshOTzZJvNFrUk5GVSirwlxdJDbW6dKfG0DD5+eNWK5dfIEc+/EcuhaGoPvO
 Euwv8VQubdxHTAG6kzHI0MtxAVQUM1gyz8zHiu18eW++GTtnTFs6m8E6H9AC176G
 nDkUh3qSxJN2HHgxtS9VUExEEZpYqtWeB9Zts8K3oSWvTaQenHWpVHPADkxzS4JU
 Jvkjq8SiKo+RqHxaOKfyf1RfOtYe5tjMCLrP7zX39d1+cwGxuc6mip/omY9HFDgn
 op132fYdt24JSHoioJDzRz9mTfvj3nICEmgX4D4WDQx5lP27CUcLugPnBNHPp0fu
 5hL+ajy8M8nq4zm/42Y+F7VS74TIA6mSnJKs9dMCknUWueD6HrDEU9xHi1YMpUMZ
 cBUSpU+P94dCIScwEzkp926vDnHyxCHLbpF1Jsq5qNNdj7AelHk=
 =4bGB
 -----END PGP SIGNATURE-----

Merge tag 'modules-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux

Pull module updates from Daniel Gomez:
 "This is a small set of changes for modules, primarily to extend module
  users to use the module data structures in combination with the
  already no-op stub module functions, even when support for modules is
  disabled in the kernel configuration. This change follows the kernel's
  coding style for conditional compilation and allows kunit code to drop
  all CONFIG_MODULES ifdefs, which is also part of the changes. This
  should allow others part of the kernel to do the same cleanup.

  The remaining changes include a fix for module name length handling
  which could potentially lead to the removal of an incorrect module,
  and various cleanups"

* tag 'modules-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux:
  module: Rename MAX_PARAM_PREFIX_LEN to __MODULE_NAME_LEN
  tracing: Replace MAX_PARAM_PREFIX_LEN with MODULE_NAME_LEN
  module: Restore the moduleparam prefix length check
  module: Remove unnecessary +1 from last_unloaded_module::name size
  module: Prevent silent truncation of module name in delete_module(2)
  kunit: test: Drop CONFIG_MODULE ifdeffery
  module: make structure definitions always visible
  module: move 'struct module_use' to internal.h
2025-08-03 14:16:52 -07:00
Mike Rapoport (Microsoft) 838955f64a execmem: introduce execmem_alloc_rw()
Some callers of execmem_alloc() require the memory to be temporarily
writable even when it is allocated from ROX cache.  These callers use
execemem_make_temp_rw() right after the call to execmem_alloc().

Wrap this sequence in execmem_alloc_rw() API.

Link: https://lkml.kernel.org/r/20250713071730.4117334-3-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Daniel Gomez <da.gomez@samsung.com>
Reviewed-by: Petr Pavlu <petr.pavlu@suse.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02 12:06:11 -07:00
Xuanye Liu 881388f343 mm: add process info to bad rss-counter warning
Enhance the debugging information in check_mm() by including the process
name and PID when reporting bad rss-counter states.  This helps identify
which process is associated with the memory accounting issue.

Link: https://lkml.kernel.org/r/20250723100901.1909683-1-liuqiye2025@163.com
Signed-off-by: Xuanye Liu <liuqiye2025@163.com>
Acked-by: SeongJae Park <sj@kernel.org>
Cc: Ben Segall <bsegall@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mel Gorman <mgorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02 12:06:08 -07:00
Uros Bizjak 58b4fba81a ucount: use atomic_long_try_cmpxchg() in atomic_long_inc_below()
Use atomic_long_try_cmpxchg() instead of
atomic_long_cmpxchg (*ptr, old, new) == old in atomic_long_inc_below().
x86 CMPXCHG instruction returns success in ZF flag, so this change saves
a compare after cmpxchg (and related move instruction in front of cmpxchg).

Also, atomic_long_try_cmpxchg implicitly assigns old *ptr value to "old"
when cmpxchg fails, enabling further code simplifications.

No functional change intended.

Link: https://lkml.kernel.org/r/20250721174610.28361-2-ubizjak@gmail.com
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Reviewed-by: Alexey Gladkov <legion@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Alexey Gladkov <legion@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: MengEn Sun <mengensun@tencent.com>
Cc: "Thomas Weißschuh" <linux@weissschuh.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02 12:01:38 -07:00
Uros Bizjak f8cd9193b6 ucount: fix atomic_long_inc_below() argument type
The type of u argument of atomic_long_inc_below() should be long to avoid
unwanted truncation to int.

The patch fixes the wrong argument type of an internal function to
prevent unwanted argument truncation.  It fixes an internal locking
primitive; it should not have any direct effect on userspace.

Mark said

: AFAICT there's no problem in practice because atomic_long_inc_below()
: is only used by inc_ucount(), and it looks like the value is
: constrained between 0 and INT_MAX.
: 
: In inc_ucount() the limit value is taken from
: user_namespace::ucount_max[], and AFAICT that's only written by
: sysctls, to the table setup by setup_userns_sysctls(), where
: UCOUNT_ENTRY() limits the value between 0 and INT_MAX.
: 
: This is certainly a cleanup, but there might be no functional issue in
: practice as above.

Link: https://lkml.kernel.org/r/20250721174610.28361-1-ubizjak@gmail.com
Fixes: f9c82a4ea8 ("Increase size of ucounts to atomic_long_t")
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Alexey Gladkov <legion@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: MengEn Sun <mengensun@tencent.com>
Cc: "Thomas Weißschuh" <linux@weissschuh.net>
Cc: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02 12:01:38 -07:00
Alexander Graf 07d2490297 kexec: enable CMA based contiguous allocation
When booting a new kernel with kexec_file, the kernel picks a target
location that the kernel should live at, then allocates random pages,
checks whether any of those patches magically happens to coincide with a
target address range and if so, uses them for that range.

For every page allocated this way, it then creates a page list that the
relocation code - code that executes while all CPUs are off and we are
just about to jump into the new kernel - copies to their final memory
location.  We can not put them there before, because chances are pretty
good that at least some page in the target range is already in use by the
currently running Linux environment.  Copying is happening from a single
CPU at RAM rate, which takes around 4-50 ms per 100 MiB.

All of this is inefficient and error prone.

To successfully kexec, we need to quiesce all devices of the outgoing
kernel so they don't scribble over the new kernel's memory.  We have seen
cases where that does not happen properly (*cough* GIC *cough*) and hence
the new kernel was corrupted.  This started a month long journey to root
cause failing kexecs to eventually see memory corruption, because the new
kernel was corrupted severely enough that it could not emit output to tell
us about the fact that it was corrupted.  By allocating memory for the
next kernel from a memory range that is guaranteed scribbling free, we can
boot the next kernel up to a point where it is at least able to detect
corruption and maybe even stop it before it becomes severe.  This
increases the chance for successful kexecs.

Since kexec got introduced, Linux has gained the CMA framework which can
perform physically contiguous memory mappings, while keeping that memory
available for movable memory when it is not needed for contiguous
allocations.  The default CMA allocator is for DMA allocations.

This patch adds logic to the kexec file loader to attempt to place the
target payload at a location allocated from CMA.  If successful, it uses
that memory range directly instead of creating copy instructions during
the hot phase.  To ensure that there is a safety net in case anything goes
wrong with the CMA allocation, it also adds a flag for user space to force
disable CMA allocations.

Using CMA allocations has two advantages:

  1) Faster by 4-50 ms per 100 MiB. There is no more need to copy in the
     hot phase.
  2) More robust. Even if by accident some page is still in use for DMA,
     the new kernel image will be safe from that access because it resides
     in a memory region that is considered allocated in the old kernel and
     has a chance to reinitialize that component.

Link: https://lkml.kernel.org/r/20250610085327.51817-1-graf@amazon.com
Signed-off-by: Alexander Graf <graf@amazon.com>
Acked-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Zhongkun He <hezhongkun.hzk@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-02 12:01:38 -07:00
Eduard Zingerman 1b30d44417 bpf: Fix memory leak of bpf_scc_info objects
env->scc_info array contains references to bpf_scc_info objects
allocated lazily in verifier.c:scc_visit_alloc().
env->scc_cnt was supposed to track env->scc_info array size
in order to free referenced objects in verifier.c:free_states().
Fix initialization of env->scc_cnt that was omitted in
verifier.c:compute_scc().

To reproduce the bug:
- build with CONFIG_DEBUG_KMEMLEAK
- boot and load bpf program with loops, e.g.:
  ./veristat -q pyperf180.bpf.o
- initiate memleak scan and check results:
  echo scan > /sys/kernel/debug/kmemleak
  cat /sys/kernel/debug/kmemleak

Fixes: c9e31900b5 ("bpf: propagate read/precision marks over state graph backedges")
Reported-by: Jens Axboe <axboe@kernel.dk>
Closes: https://lore.kernel.org/bpf/CAADnVQKXUWg9uRCPD5ebRXwN4dmBCRUFFM7kN=GxymYz3zU25A@mail.gmail.com/T/
Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Tested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250801232330.1800436-1-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-08-02 09:04:57 -07:00
Thomas Gleixner e703b7e247 futex: Move futex cleanup to __mmdrop()
Futex hash allocations are done in mm_init() and the cleanup happens in
__mmput(). That works most of the time, but there are mm instances which
are instantiated via mm_alloc() and freed via mmdrop(), which causes the
futex hash to be leaked.

Move the cleanup to __mmdrop().

Fixes: 56180dd20c ("futex: Use RCU-based per-CPU reference counting instead of rcuref_t")
Reported-by: André Draszik <andre.draszik@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: André Draszik <andre.draszik@linaro.org>
Link: https://lore.kernel.org/all/87ldo5ihu0.ffs@tglx
Closes: https://lore.kernel.org/all/0c8cc83bb73abf080faf584f319008b67d0931db.camel@linaro.org
2025-08-02 15:11:52 +02:00
Roman Kisel 83e6384374 smp: Fix spelling in on_each_cpu_cond_mask()'s doc-comment
"boolean" is spelt as "blooean". Fix that.

Signed-off-by: Roman Kisel <romank@linux.microsoft.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250722161818.6139-1-romank@linux.microsoft.com
2025-08-02 14:24:50 +02:00
Linus Torvalds a6923c06a3 bpf-fixes
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmiNNksACgkQ6rmadz2v
 bTrKRhAAnju4bbFRHU88Y68p6Meq/jxgjxHZAkTqZA0Nvbu2cItPRL7XHAAhTWE7
 OBEIm3UKCH4gs4fY8rDHiIgnnaQavXUmvXZblOIOjxnqRKJpU3px+wwJvGFq5Enq
 WP6UZV8tj+O2tNfNNYS+mgQvvIpUISHGpKimvx7ede3e1U3cJBkppbT3gooMHYuc
 5s1QtYHWaPY/1DpkHgqJ2UPGcbT9/HSPGMHRNaHKjQTcNcLcrj7RRjchgXqcc7Vs
 hVijvVrLiuK0MyU42ritmaqvjjgD6hKPZguRQe2/hAtrOo0Alf+4mXkMgam7simN
 iHfGc7nhw1xAFTPj4WXahja89G00FdDN5NR37Rgurm/i2fY7BuXAkMjiMiwGB3C3
 jk2wG3RSifYeC2rxhkYJdqcx8Cz6m+pjgyJ2o9Jy5dn426VXg/kzkUXpl6u5jaPZ
 SmKoo9Xu1r7xqTaUc9kk8pJI5Xt9vD5oQjF2KQuPZXxNidiwW6k2OGbW+wF26nEi
 Q6pfDu3pvHAd/UE6cD5yFe97o3Cc2XfGwI/Sv2k99UVPvNcvfAvVo9fsItHBhCPn
 zHkihW2S0zmbBlhcrB+PrLclNgLleP9JukFN+5scc0a9lbQxIm6v2TNKGlBfDQtO
 I+Kn266oqT4BEgnQGlCQquINnQAdmS8VMnnunGOu6+rwPUtkI7E=
 =XLHS
 -----END PGP SIGNATURE-----

Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Pull bpf fixes from Alexei Starovoitov:

 - Fix kCFI failures in JITed BPF code on arm64 (Sami Tolvanen, Puranjay
   Mohan, Mark Rutland, Maxwell Bland)

 - Disallow tail calls between BPF programs that use different cgroup
   local storage maps to prevent out-of-bounds access (Daniel Borkmann)

 - Fix unaligned access in flow_dissector and netfilter BPF programs
   (Paul Chaignon)

 - Avoid possible use of uninitialized mod_len in libbpf (Achill
   Gilgenast)

* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  selftests/bpf: Test for unaligned flow_dissector ctx access
  bpf: Improve ctx access verifier error message
  bpf: Check netfilter ctx accesses are aligned
  bpf: Check flow_dissector ctx accesses are aligned
  arm64/cfi,bpf: Support kCFI + BPF on arm64
  cfi: Move BPF CFI types and helpers to generic code
  cfi: add C CFI type macro
  libbpf: Avoid possible use of uninitialized mod_len
  bpf: Fix oob access in cgroup local storage
  bpf: Move cgroup iterator helpers to bpf.h
  bpf: Move bpf map owner out of common struct
  bpf: Add cookie object to bpf maps
2025-08-01 17:13:26 -07:00
Steven Rostedt 3ca824369b tracing: Have unsigned int function args displayed as hexadecimal
Most function arguments that are passed in as unsigned int or unsigned
long are better displayed as hexadecimal than normal integer. For example,
the functions:

static void __create_object(unsigned long ptr, size_t size,
				int min_count, gfp_t gfp, unsigned int objflags);

static bool stack_access_ok(struct unwind_state *state, unsigned long _addr,
			    size_t len);

void __local_bh_disable_ip(unsigned long ip, unsigned int cnt);

Show up in the trace as:

    __create_object(ptr=-131387050520576, size=4096, min_count=1, gfp=3264, objflags=0) <-kmem_cache_alloc_noprof
    stack_access_ok(state=0xffffc9000233fc98, _addr=-60473102566256, len=8) <-unwind_next_frame
    __local_bh_disable_ip(ip=-2127311112, cnt=256) <-handle_softirqs

Instead, by displaying unsigned as hexadecimal, they look more like this:

    __create_object(ptr=0xffff8881028d2080, size=0x280, min_count=1, gfp=0x82820, objflags=0x0) <-kmem_cache_alloc_node_noprof
    stack_access_ok(state=0xffffc90000003938, _addr=0xffffc90000003930, len=0x8) <-unwind_next_frame
    __local_bh_disable_ip(ip=0xffffffff8133cef8, cnt=0x100) <-handle_softirqs

Which is much easier to understand as most unsigned longs are usually just
pointers. Even the "unsigned int cnt" in __local_bh_disable_ip() looks
better as hexadecimal as a lot of flags are passed as unsigned.

Changes since v2: https://lore.kernel.org/20250801111453.01502861@gandalf.local.home

- Use btf_int_encoding() instead of open coding it (Martin KaFai Lau)

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Douglas Raillard <douglas.raillard@arm.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Link: https://lore.kernel.org/20250801165601.7770d65c@gandalf.local.home
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-08-01 19:14:51 -04:00
Linus Torvalds 821c9e515d virtio, vhost: features, fixes
vhost can now support legacy threading
 	if enabled in Kconfig
 vsock memory allocation strategies for
 	large buffers have been improved,
 	reducing pressure on kmalloc
 vhost now supports the in-order feature
 	guest bits missed the merge window
 
 fixes, cleanups all over the place
 
 Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQFDBAABCgAtFiEEXQn9CHHI+FuUyooNKB8NuNKNVGkFAmiMvQEPHG1zdEByZWRo
 YXQuY29tAAoJECgfDbjSjVRpgr8IAKUrIjqqTYXLkbCWn6tK8T+LxZ6LkMkyHA1v
 AJ+y5fKDeLsT5QpusD1XRjXJVqXBwQEsTN0pNVuhWHlcCpUeOFEHuJaf/QMncbc3
 deFlUfMa3ihniUxBuyhojlWURsf94uTC906lCFXlIsfSKH2CW6/SjKvqR0SH5PhN
 5WaqRYiSFFwDlyG2Ul4e5temP/er2KuZfYyvcYCU8VdSEp6bjvqCHd9ztFIVuByp
 fFWsrHce6IqR8ixOOzavEjzfd8WAN3LGzXntj5KEaX3fZ6HxCZCMv+rNVqvJmLps
 cSrTgIUo60nCiZb8klUCS1YTEEvmdmJg3UmmddIpIhcsCYJSbOU=
 =2dxm
 -----END PGP SIGNATURE-----

Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost

Pull virtio updates from Michael Tsirkin:

 - vhost can now support legacy threading if enabled in Kconfig

 - vsock memory allocation strategies for large buffers have been
   improved, reducing pressure on kmalloc

 - vhost now supports the in-order feature. guest bits missed the merge
   window.

 - fixes, cleanups all over the place

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (30 commits)
  vsock/virtio: Allocate nonlinear SKBs for handling large transmit buffers
  vsock/virtio: Rename virtio_vsock_skb_rx_put()
  vhost/vsock: Allocate nonlinear SKBs for handling large receive buffers
  vsock/virtio: Move SKB allocation lower-bound check to callers
  vsock/virtio: Rename virtio_vsock_alloc_skb()
  vsock/virtio: Resize receive buffers so that each SKB fits in a 4K page
  vsock/virtio: Move length check to callers of virtio_vsock_skb_rx_put()
  vsock/virtio: Validate length in packet header before skb_put()
  vhost/vsock: Avoid allocating arbitrarily-sized SKBs
  vhost_net: basic in_order support
  vhost: basic in order support
  vhost: fail early when __vhost_add_used() fails
  vhost: Reintroduce kthread API and add mode selection
  vdpa: Fix IDR memory leak in VDUSE module exit
  vdpa/mlx5: Fix release of uninitialized resources on error path
  vhost-scsi: Fix check for inline_sg_cnt exceeding preallocated limit
  virtio: virtio_dma_buf: fix missing parameter documentation
  vhost: Fix typos
  vhost: vringh: Remove unused functions
  vhost: vringh: Remove unused iotlb functions
  ...
2025-08-01 14:17:48 -07:00
Linus Torvalds 0bd0a41a51 pci-v6.17-changes
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCgAyFiEEgMe7l+5h9hnxdsnuWYigwDrT+vwFAmiL3OkUHGJoZWxnYWFz
 QGdvb2dsZS5jb20ACgkQWYigwDrT+vz9bhAAqiD9REYlNUgGX/bEBgCVPFdtjjTz
 FpSLzG23vWd2J0FEy04qtQWH9j71IXnM+yMybzsMe9SsPt2HhczzSCIMpPj0FZNN
 ccOf3gA/KqPux7FORrS3mpM8OO4ICt3XZhCji3nNg5iW5XlH+NrQKPVxRlvBB0rP
 +7RxSjDClUdZ97QSSmp1uZ7Qh1qyV0Ht0qjPMwecrnB2kApt4ZaMphAaKPEjX/4f
 RgZPFqbIpRWt9e87Z8ADr5c2jokZAzIV0zauQ2fhbjBkTcXIXL3yOzUbR+ngBWDD
 oq21rXJBUCQheA7J6j2SKabgF9AZaI5NI9ERld5vJ1inXSZCyuyKopN1AzuKZquG
 N+jyYJqZC99ePvMLbTWs/spU58J03A6TOwaJNE3ISRgbnxFkhvLl7h68XuTDonZm
 hYGloXXUj+i+rh7/eJIDDWa9MTpEvl2p1zc6EDIZ/umlnHwg9rGlGQVARMCs6Ist
 EiJQEtjMMlXiBJMkFhpxesOdyonGkxAL9WtT6MoEOFF7dqgsTqSKiDUPa+6MHV+I
 tsTB630J3ROsWGfQD1uJI2BrCm+op4j6faamH6UMqCrUU0TUZMHiRR3qVWbM6qgU
 /WL1gZ96uy5I7UoE0+gH+wMhMClO2BnsxffocToDE5wOYpGDd5BwPEoY8ej8U2lu
 CBMCkMor1jDtS8Y=
 =ipv3
 -----END PGP SIGNATURE-----

Merge tag 'pci-v6.17-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci

Pull PCI updates from Bjorn Helgaas:
 "Enumeration:

   - Allow built-in drivers, not just modular drivers, to use async
     initial probing (Lukas Wunner)

   - Support Immediate Readiness even on devices with no PM Capability
     (Sean Christopherson)

   - Consolidate definition of PCIE_RESET_CONFIG_WAIT_MS (100ms), the
     required delay between a reset and sending config requests to a
     device (Niklas Cassel)

   - Add pci_is_display() to check for "Display" base class and use it
     in ALSA hda, vfio, vga_switcheroo, vt-d (Mario Limonciello)

   - Allow 'isolated PCI functions' (multi-function devices without a
     function 0) for LoongArch, similar to s390 and jailhouse (Huacai
     Chen)

  Power control:

   - Add ability to enable optional slot clock for cases where the PCIe
     host controller and the slot are supplied by different clocks
     (Marek Vasut)

  PCIe native device hotplug:

   - Fix runtime PM ref imbalance on Hot-Plug Capable ports caused by
     misinterpreting a config read failure after a device has been
     removed (Lukas Wunner)

   - Avoid creating a useless PCIe port service device for pciehp if the
     slot is handled by the ACPI hotplug driver (Lukas Wunner)

   - Ignore ACPI hotplug slots when calculating depth of pciehp hotplug
     ports (Lukas Wunner)

  Virtualization:

   - Save VF resizable BAR state and restore it after reset (Michał
     Winiarski)

   - Allow IOV resources (VF BARs) to be resized (Michał Winiarski)

   - Add pci_iov_vf_bar_set_size() so drivers can control VF BAR size
     (Michał Winiarski)

  Endpoint framework:

   - Add RC-to-EP doorbell support using platform MSI controller,
     including a test case (Frank Li)

   - Allow BAR assignment via configfs so platforms have flexibility in
     determining BAR usage (Jerome Brunet)

  Native PCIe controller drivers:

   - Convert amazon,al-alpine-v[23]-pcie, apm,xgene-pcie,
     axis,artpec6-pcie, marvell,armada-3700-pcie, st,spear1340-pcie to
     DT schema format (Rob Herring)

   - Use dev_fwnode() instead of of_fwnode_handle() to remove OF
     dependency in altera (fixes an unused variable), designware-host,
     mediatek, mediatek-gen3, mobiveil, plda, xilinx, xilinx-dma,
     xilinx-nwl (Jiri Slaby, Arnd Bergmann)

   - Convert aardvark, altera, brcmstb, designware-host, iproc,
     mediatek, mediatek-gen3, mobiveil, plda, rcar-host, vmd, xilinx,
     xilinx-dma, xilinx-nwl from using pci_msi_create_irq_domain() to
     using msi_create_parent_irq_domain() instead; this makes the
     interrupt controller per-PCI device, allows dynamic allocation of
     vectors after initialization, and allows support of IMS (Nam Cao)

  APM X-Gene PCIe controller driver:

   - Rewrite MSI handling to MSI CPU affinity, drop useless CPU hotplug
     bits, use device-managed memory allocations, and clean things up
     (Marc Zyngier)

   - Probe xgene-msi as a standard platform driver rather than a
     subsys_initcall (Marc Zyngier)

  Broadcom STB PCIe controller driver:

   - Add optional DT 'num-lanes' property and if present, use it to
     override the Maximum Link Width advertised in Link Capabilities
     (Jim Quinlan)

  Cadence PCIe controller driver:

   - Use PCIe Message routing types from the PCI core rather than
     defining private ones (Hans Zhang)

  Freescale i.MX6 PCIe controller driver:

   - Add IMX8MQ_EP third 64-bit BAR in epc_features (Richard Zhu)

   - Add IMX8MM_EP and IMX8MP_EP fixed 256-byte BAR 4 in epc_features
     (Richard Zhu)

   - Configure LUT for MSI/IOMMU in Endpoint mode so Root Complex can
     trigger doorbel on Endpoint (Frank Li)

   - Remove apps_reset (LTSSM_EN) from
     imx_pcie_{assert,deassert}_core_reset(), which fixes a hotplug
     regression on i.MX8MM (Richard Zhu)

   - Delay Endpoint link start until configfs 'start' written (Richard
     Zhu)

  Intel VMD host bridge driver:

   - Add Intel Panther Lake (PTL)-H/P/U Vendor ID (George D Sworo)

  Qualcomm PCIe controller driver:

   - Add DT binding and driver support for SA8255p, which supports ECAM
     for Configuration Space access (Mayank Rana)

   - Update DT binding and driver to describe PHYs and per-Root Port
     resets in a Root Port stanza and deprecate describing them in the
     host bridge; this makes it possible to support multiple Root Ports
     in the future (Krishna Chaitanya Chundru)

   - Add Qualcomm QCS615 to SM8150 DT binding (Ziyue Zhang)

   - Add Qualcomm QCS8300 to SA8775p DT binding (Ziyue Zhang)

   - Drop TBU and ref clocks from Qualcomm SM8150 and SC8180x DT
     bindings (Konrad Dybcio)

   - Document 'link_down' reset in Qualcomm SA8775P DT binding (Ziyue
     Zhang)

   - Add required PCIE_RESET_CONFIG_WAIT_MS delay after Link up IRQ
     (Niklas Cassel)

  Rockchip PCIe controller driver:

   - Drop unused PCIe Message routing and code definitions (Hans Zhang)

   - Remove several unused header includes (Hans Zhang)

   - Use standard PCIe config register definitions instead of
     rockchip-specific redefinitions (Geraldo Nascimento)

   - Set Target Link Speed to 5.0 GT/s before retraining so we have a
     chance to train at a higher speed (Geraldo Nascimento)

  Rockchip DesignWare PCIe controller driver:

   - Prevent race between link training and register update via DBI by
     inhibiting link training after hot reset and link down (Wilfred
     Mallawa)

   - Add required PCIE_RESET_CONFIG_WAIT_MS delay after Link up IRQ
     (Niklas Cassel)

  Sophgo PCIe controller driver:

   - Add DT binding and driver for Sophgo SG2044 PCIe controller driver
     in Root Complex mode (Inochi Amaoto)

  Synopsys DesignWare PCIe controller driver:

   - Add required PCIE_RESET_CONFIG_WAIT_MS after waiting for Link up on
     Ports that support > 5.0 GT/s. Slower Ports still rely on the
     not-quite-correct PCIE_LINK_WAIT_SLEEP_MS 90ms default delay while
     waiting for the Link (Niklas Cassel)"

* tag 'pci-v6.17-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci: (116 commits)
  dt-bindings: PCI: qcom,pcie-sa8775p: Document 'link_down' reset
  dt-bindings: PCI: Remove 83xx-512x-pci.txt
  dt-bindings: PCI: Convert amazon,al-alpine-v[23]-pcie to DT schema
  dt-bindings: PCI: Convert marvell,armada-3700-pcie to DT schema
  dt-bindings: PCI: Convert apm,xgene-pcie to DT schema
  dt-bindings: PCI: Convert axis,artpec6-pcie to DT schema
  dt-bindings: PCI: Convert st,spear1340-pcie to DT schema
  PCI: Move is_pciehp check out of pciehp_is_native()
  PCI: pciehp: Use is_pciehp instead of is_hotplug_bridge
  PCI/portdrv: Use is_pciehp instead of is_hotplug_bridge
  PCI/ACPI: Fix runtime PM ref imbalance on Hot-Plug Capable ports
  selftests: pci_endpoint: Add doorbell test case
  misc: pci_endpoint_test: Add doorbell test case
  PCI: endpoint: pci-epf-test: Add doorbell test support
  PCI: endpoint: Add pci_epf_align_inbound_addr() helper for inbound address alignment
  PCI: endpoint: pci-ep-msi: Add checks for MSI parent and mutability
  PCI: endpoint: Add RC-to-EP doorbell support using platform MSI controller
  PCI: dwc: Add Sophgo SG2044 PCIe controller driver in Root Complex mode
  PCI: vmd: Switch to msi_create_parent_irq_domain()
  PCI: vmd: Convert to lock guards
  ...
2025-08-01 13:59:07 -07:00
Steven Rostedt db5f0c3e3e ring-buffer: Convert ring_buffer_write() to use guard(preempt_notrace)
The function ring_buffer_write() has a goto out to only do a
preempt_enable_notrace(). This can be replaced by a guard.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20250801203858.205479143@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-08-01 16:49:15 -04:00
Steven Rostedt 12d5189615 tracing: Use __free(kfree) in trace.c to remove gotos
There's a couple of locations that have goto out in trace.c for the only
purpose of freeing a variable that was allocated. These can be replaced
with __free(kfree).

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20250801203858.040892777@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-08-01 16:49:15 -04:00
Steven Rostedt debe57fbe1 tracing: Add guard() around locks and mutexes in trace.c
There's several locations in trace.c that can be simplified by using
guards around raw_spin_lock_irqsave, mutexes and preempt disabling.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20250801203857.879085376@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-08-01 16:49:15 -04:00
Steven Rostedt 788fa4b47c tracing: Add guard(ring_buffer_nest)
Some calls to the tracing ring buffer can happen when the ring buffer is
already being written to by the same context (for example, a
trace_printk() in between a ring_buffer_lock_reserve() and a
ring_buffer_unlock_commit()).

In order to not trigger the recursion detection, these functions use
ring_buffer_nest_start() and ring_buffer_nest_end(). Create a guard() for
these functions so that their use cases can be simplified and not need to
use goto for the release.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20250801203857.710501021@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-08-01 16:49:15 -04:00
Steven Rostedt c89504a703 tracing: Remove unneeded goto out logic
Several places in the trace.c file there's a goto out where the out is
simply a return. There's no reason to jump to the out label if it's not
doing any more logic but simply returning from the function.

Replace the goto outs with a return and remove the out labels.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20250801203857.538726745@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-08-01 16:49:14 -04:00
Linus Torvalds d6f38c1239 tracing changes for 6.17
- Deprecate auto-mounting tracefs to /sys/kernel/debug/tracing
 
   When tracefs was first introduced back in 2014, the directory
   /sys/kernel/tracing was added and is the designated location to mount
   tracefs. To keep backward compatibility, tracefs was auto-mounted in
   /sys/kernel/debug/tracing as well.
 
   All distros now mount tracefs on /sys/kernel/tracing. Having it seen in two
   different locations has lead to various issues and inconsistencies.
 
   The VFS folks have to also maintain debugfs_create_automount() for this
   single user.
 
   It's been over 10 years. Tooling and scripts should start replacing the
   debugfs location with the tracefs one. The reason tracefs was created in the
   first place was to allow access to the tracing facilities without the need
   to configure debugfs into the kernel. Using tracefs should now be more
   robust.
 
   A new config is created: CONFIG_TRACEFS_AUTOMOUNT_DEPRECATED
   which is default y, so that the kernel is still built with the automount.
   This config allows those that want to remove the automount from debugfs to
   do so.
 
   When tracefs is accessed from /sys/kernel/debug/tracing, the following
   printk is triggerd:
 
    pr_warn("NOTICE: Automounting of tracing to debugfs is deprecated and will be removed in 2030\n");
 
   This gives users another 5 years to fix their scripts.
 
 - Use queue_rcu_work() instead of call_rcu() for freeing event filters
 
   The number of filters to be free can be many depending on the number of
   events within an event system. Freeing them from softirq context can
   potentially cause undesired latency. Use the RCU workqueue to free them
   instead.
 
 - Remove pointless memory barriers in latency code
 
   Memory barriers were added to some of the latency code a long time ago with
   the idea of "making them visible", but that's not what memory barriers are
   for. They are to synchronize access between different variables. There was
   no synchronization here making them pointless.
 
 - Remove "__attribute__()" from the type field of event format
 
   When LLVM is used to compile the kernel with CONFIG_DEBUG_INFO_BTF=y and
   PAHOLE_HAS_BTF_TAG=y, some of the format fields get expanded with the
   following:
 
     field:const char * filename;      offset:24;      size:8; signed:0;
 
   Turns into:
 
     field:const char __attribute__((btf_type_tag("user"))) * filename;      offset:24;      size:8; signed:0;
 
   This confuses parsers. Add code to strip these tags from the strings.
 
 - Add eprobe config option CONFIG_EPROBE_EVENTS
 
   Eprobes were added back in 5.15 but were only enabled when another probe was
   enabled (kprobe, fprobe, uprobe, etc). The eprobes had no config option
   of their own. Add one as they should be a separate entity.
 
   It's default y to keep with the old kernels but still has dependencies on
   TRACING and HAVE_REGS_AND_STACK_ACCESS_API.
 
 - Add eprobe documentation
 
   When eprobes were added back in 5.15 no documentation was added to describe
   them. This needs to be rectified.
 
 - Replace open coded cpumask_next_wrap() in move_to_next_cpu()
 
 - Have preemptirq_delay_run() use off-stack CPU mask
 
 - Remove obsolete comment about pelt_cfs event
 
   DECLARE_TRACE() appends "_tp" to trace events now, but the comment above
   pelt_cfs still mentioned appending it manually.
 
 - Remove EVENT_FILE_FL_SOFT_MODE flag
 
   The SOFT_MODE flag was required when the soft enabling and disabling of
   trace events was first introduced. But there was a bug with this approach
   as it only worked for a single instance. When multiple users required soft
   disabling and disabling the code was changed to have a ref count. The
   SOFT_MODE flag is now set iff the ref count is non zero. This is redundant
   and just reading the ref count is good enough.
 
 - Fix typo in comment
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaIt5ZRQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qvriAPsEbOEgMrPF1Tdj1mHLVajYTxI8ft5J
 aX5bfM2cDDRVcgEA57JHOXp4d05dj555/hgAUuCWuFp/E0Anp45EnFTedgQ=
 =wKZW
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing updates from Steven Rostedt:

 - Deprecate auto-mounting tracefs to /sys/kernel/debug/tracing

   When tracefs was first introduced back in 2014, the directory
   /sys/kernel/tracing was added and is the designated location to mount
   tracefs. To keep backward compatibility, tracefs was auto-mounted in
   /sys/kernel/debug/tracing as well.

   All distros now mount tracefs on /sys/kernel/tracing. Having it seen
   in two different locations has lead to various issues and
   inconsistencies.

   The VFS folks have to also maintain debugfs_create_automount() for
   this single user.

   It's been over 10 years. Tooling and scripts should start replacing
   the debugfs location with the tracefs one. The reason tracefs was
   created in the first place was to allow access to the tracing
   facilities without the need to configure debugfs into the kernel.
   Using tracefs should now be more robust.

   A new config is created: CONFIG_TRACEFS_AUTOMOUNT_DEPRECATED which is
   default y, so that the kernel is still built with the automount. This
   config allows those that want to remove the automount from debugfs to
   do so.

   When tracefs is accessed from /sys/kernel/debug/tracing, the
   following printk is triggerd:

     pr_warn("NOTICE: Automounting of tracing to debugfs is deprecated and will be removed in 2030\n");

   This gives users another 5 years to fix their scripts.

 - Use queue_rcu_work() instead of call_rcu() for freeing event filters

   The number of filters to be free can be many depending on the number
   of events within an event system. Freeing them from softirq context
   can potentially cause undesired latency. Use the RCU workqueue to
   free them instead.

 - Remove pointless memory barriers in latency code

   Memory barriers were added to some of the latency code a long time
   ago with the idea of "making them visible", but that's not what
   memory barriers are for. They are to synchronize access between
   different variables. There was no synchronization here making them
   pointless.

 - Remove "__attribute__()" from the type field of event format

   When LLVM is used to compile the kernel with CONFIG_DEBUG_INFO_BTF=y
   and PAHOLE_HAS_BTF_TAG=y, some of the format fields get expanded with
   the following:

     field:const char * filename;      offset:24;      size:8; signed:0;

   Turns into:

     field:const char __attribute__((btf_type_tag("user"))) * filename;      offset:24;      size:8; signed:0;

   This confuses parsers. Add code to strip these tags from the strings.

 - Add eprobe config option CONFIG_EPROBE_EVENTS

   Eprobes were added back in 5.15 but were only enabled when another
   probe was enabled (kprobe, fprobe, uprobe, etc). The eprobes had no
   config option of their own. Add one as they should be a separate
   entity.

   It's default y to keep with the old kernels but still has
   dependencies on TRACING and HAVE_REGS_AND_STACK_ACCESS_API.

 - Add eprobe documentation

   When eprobes were added back in 5.15 no documentation was added to
   describe them. This needs to be rectified.

 - Replace open coded cpumask_next_wrap() in move_to_next_cpu()

 - Have preemptirq_delay_run() use off-stack CPU mask

 - Remove obsolete comment about pelt_cfs event

   DECLARE_TRACE() appends "_tp" to trace events now, but the comment
   above pelt_cfs still mentioned appending it manually.

 - Remove EVENT_FILE_FL_SOFT_MODE flag

   The SOFT_MODE flag was required when the soft enabling and disabling
   of trace events was first introduced. But there was a bug with this
   approach as it only worked for a single instance. When multiple users
   required soft disabling and disabling the code was changed to have a
   ref count. The SOFT_MODE flag is now set iff the ref count is non
   zero. This is redundant and just reading the ref count is good
   enough.

 - Fix typo in comment

* tag 'trace-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  Documentation: tracing: Add documentation about eprobes
  tracing: Have eprobes have their own config option
  tracing: Remove "__attribute__()" from the type field of event format
  tracing: Deprecate auto-mounting tracefs in debugfs
  tracing: Fix comment in trace_module_remove_events()
  tracing: Remove EVENT_FILE_FL_SOFT_MODE flag
  tracing: Remove pointless memory barriers
  tracing/sched: Remove obsolete comment on suffixes
  kernel: trace: preemptirq_delay_test: use offstack cpu mask
  tracing: Use queue_rcu_work() to free filters
  tracing: Replace opencoded cpumask_next_wrap() in move_to_next_cpu()
2025-08-01 10:29:36 -07:00
Linus Torvalds c6439bfaab Deferred unwind changes for 6.17
This is the core infrastructure for the deferred unwinder that is required
 for sframes[1]. Several other patch series is based on this work although
 those patch series are not dependent on each other. In order to simplify the
 development, having this core series upstream will allow the other series to
 be worked on in parallel. The other series are:
 
 - The two patches to implement x86:
   https://lore.kernel.org/linux-trace-kernel/20250717004958.260781923@kernel.org/
   https://lore.kernel.org/linux-trace-kernel/20250717004958.432327787@kernel.org/
 
 - The s390 work:
   https://lore.kernel.org/linux-trace-kernel/20250710163522.3195293-1-jremus@linux.ibm.com/
 
 - The perf work:
   https://lore.kernel.org/linux-trace-kernel/20250718164119.089692174@kernel.org/
 
 - The ftrace work:
   https://lore.kernel.org/linux-trace-kernel/20250424192612.505622711@goodmis.org/
 
 - The sframe work:
   https://lore.kernel.org/linux-trace-kernel/20250717012848.927473176@kernel.org/
 
 And more is on the way.
 
 The core infrastructure adds the following in kernel APIs:
 
 - int unwind_user_faultable(struct unwind_stacktrace *trace);
 
     Performs a user space stack trace that may fault user pages in.
 
 - int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func);
 
     Allows a tracer to register with the unwind deferred infrastructure.
 
 - int unwind_deferred_request(struct unwind_work *work, u64 *cookie);
 
     Used when a tracer request a deferred trace. Can be called from interrupt
     or NMI context.
 
 - void unwind_deferred_cancel(struct unwind_work *work);
 
     Called by a tracer to unregister from the deferred unwind infrastructure.
 
 - void unwind_deferred_task_exit(struct task_struct *task);
 
     Called by task exit code to flush any pending unwind requests.
 
 - void unwind_task_init(struct task_struct *task);
 
     Called by do_fork() to initialize the task struct for the deferred
     unwinder.
 
 - void unwind_task_free(struct task_struct *task);
 
     Called by do_exit() to free up any resources used by the deferred
     unwinder.
 
 None of the above is actually compiled unless an architecture enables it,
 which none currently do.
 
 [1] https://sourceware.org/binutils/wiki/sframe
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaIt9IhQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qqqzAQCMT/6qmSq7O746JF0MuGC6fTZnSbAc
 XGz4JigEqLTRewEA2kaJmD7PBsSRzFdiK2gvyKn95l+PZyWtE9MjTsqeSAc=
 =Lsbm
 -----END PGP SIGNATURE-----

Merge tag 'trace-deferred-unwind-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull initial deferred unwind infrastructure from Steven Rostedt:
 "This is the core infrastructure for the deferred unwinder that is
  required for sframes[1]. Several other patch series are based on this
  work although those patch series are not dependent on each other. In
  order to simplify the development, having this core series upstream
  will allow the other series to be worked on in parallel. The other
  series are:

    - The two patches to implement x86 support [2] [3]

    - The s390 work [4]

    - The perf work [5]

    - The ftrace work [6]

    - The sframe work [7]

  And more is on the way.

  The core infrastructure adds the following in kernel APIs:

    - int unwind_user_faultable(struct unwind_stacktrace *trace);

        Performs a user space stack trace that may fault user pages in.

    - int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func);

        Allows a tracer to register with the unwind deferred
        infrastructure.

    - int unwind_deferred_request(struct unwind_work *work, u64 *cookie);

        Used when a tracer request a deferred trace. Can be called from
        interrupt or NMI context.

    - void unwind_deferred_cancel(struct unwind_work *work);

        Called by a tracer to unregister from the deferred unwind
        infrastructure.

    - void unwind_deferred_task_exit(struct task_struct *task);

        Called by task exit code to flush any pending unwind requests.

    - void unwind_task_init(struct task_struct *task);

        Called by do_fork() to initialize the task struct for the
        deferred unwinder.

    - void unwind_task_free(struct task_struct *task);

        Called by do_exit() to free up any resources used by the
        deferred unwinder.

    None of the above is actually compiled unless an architecture enables it,
    which none currently do"

Link: https://sourceware.org/binutils/wiki/sframe [1]
Link: https://lore.kernel.org/linux-trace-kernel/20250717004958.260781923@kernel.org/ [2]
Link: https://lore.kernel.org/linux-trace-kernel/20250717004958.432327787@kernel.org/ [3]
Link: https://lore.kernel.org/linux-trace-kernel/20250710163522.3195293-1-jremus@linux.ibm.com/ [4]
Link: https://lore.kernel.org/linux-trace-kernel/20250718164119.089692174@kernel.org/ [5]
Link: https://lore.kernel.org/linux-trace-kernel/20250424192612.505622711@goodmis.org/ [6]
Link: https://lore.kernel.org/linux-trace-kernel/20250717012848.927473176@kernel.org/ [7]

* tag 'trace-deferred-unwind-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  unwind: Finish up unwind when a task exits
  unwind deferred: Use SRCU unwind_deferred_task_work()
  unwind: Add USED bit to only have one conditional on way back to user space
  unwind deferred: Add unwind_completed mask to stop spurious callbacks
  unwind deferred: Use bitmask to determine which callbacks to call
  unwind_user/deferred: Make unwind deferral requests NMI-safe
  unwind_user/deferred: Add deferred unwinding interface
  unwind_user/deferred: Add unwind cache
  unwind_user/deferred: Add unwind_user_faultable()
  unwind_user: Add user space unwinding API with frame pointer support
2025-08-01 09:46:24 -07:00
Paul Chaignon f914876eec bpf: Improve ctx access verifier error message
We've already had two "error during ctx access conversion" warnings
triggered by syzkaller. Let's improve the error message by dumping the
cnt variable so that we can more easily differentiate between the
different error cases.

Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/cc94316c30dd76fae4a75a664b61a2dbfe68e205.1754039605.git.paul.chaignon@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-08-01 09:22:44 -07:00
Pei Xiao 32d89a405a vhost: Use ERR_CAST inlined function instead of ERR_PTR(PTR_ERR(...))
cocci warning:
./kernel/vhost_task.c:148:9-16: WARNING: ERR_CAST can be used with tsk

Use ERR_CAST inlined function instead of ERR_PTR(PTR_ERR(...)).

Signed-off-by: Pei Xiao <xiaopei01@kylinos.cn>
Message-Id: <1a8499a5da53e4f72cf21aca044ae4b26db8b2ad.1749020055.git.xiaopei01@kylinos.cn>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-08-01 09:11:08 -04:00
Sami Tolvanen f1befc82ad cfi: Move BPF CFI types and helpers to generic code
Instead of duplicating the same code for each architecture, move
the CFI type hash variables for BPF function types and related
helper functions to generic CFI code, and allow architectures to
override the function definitions if needed.

Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Link: https://lore.kernel.org/r/20250801001004.1859976-7-samitolvanen@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-31 18:23:53 -07:00
Linus Torvalds f2d282e1df bitmap-for-6.17
Bits-related patched for 6.17:
  - find_random_bit() series (Yury);
  - GENMASK() consolidation (Vincent);
  - random cleanups (Shaopeng, Ben, Yury)
 -----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEEi8GdvG6xMhdgpu/4sUSA/TofvsgFAmiLkz0ACgkQsUSA/Tof
 vsj7MgwAvRSyevYSm9cm1Y99098M/7gWeJUeLAIy0GJdFaBQIcMkXkRGXJ9A0ZHb
 RoCFG4eiukHIDHRzJjncXUNTk0zVCbEifUF43BdnJrhjTePlou5SNVh6xhJfQ1Ai
 ENB4Q+nAZyIm43cUnoDhR24ne3pgJcY+oe6e7sQTRFF/6+nB4RDHmjIAMVsYgH30
 w8iPBxNXXULAZNDgOPA3J5bACEnZPfOAhtoiNBC9s4MsE4o+Q8E9FVhReI2tiIhk
 t98kVZu7TFyrGcCdLz8EgbcG4KPFBmwOwOv8S1Mzgy46MwS//dd7MZA7y3MqTvJ/
 VEMoTMAK14/VrgDxu/vdBsUJt/T1wPc+ZbUt/rNb530oSDkvjIo+4ihg1nfswqhn
 u+fj65wAHRW7CSkgpHn3bM/wvxmtIaE6AoY6jWwyuZ1zGIEV+5iPBo56kkmpJlYj
 GlnbiTHkNR/jGa1GwB3PDG2kzoqXVLz6EeFdZncUX53MGa90g0+5/k0ld+oBJTDh
 7QbkZlW1
 =uj9U
 -----END PGP SIGNATURE-----

Merge tag 'bitmap-for-6.17' of https://github.com/norov/linux

Pull bitmap updates from Yury Norov:

 - find_random_bit() series (Yury)

 - GENMASK() consolidation (Vincent)

 - random cleanups (Shaopeng, Ben, Yury)

* tag 'bitmap-for-6.17' of https://github.com/norov/linux:
  bitfield: Ensure the return values of helper functions are checked
  test_bits: add tests for __GENMASK() and __GENMASK_ULL()
  bits: unify the non-asm GENMASK*()
  bits: split the definition of the asm and non-asm GENMASK*()
  cpumask: Remove unnecessary cpumask_nth_andnot()
  watchdog: fix opencoded cpumask_next_wrap() in watchdog_next_cpu()
  clocksource: Improve randomness in clocksource_verify_choose_cpus()
  cpumask: introduce cpumask_random()
  bitmap: generalize node_random()
2025-07-31 16:52:32 -07:00
Linus Torvalds 6a68cec16b sched_ext: Changes for v6.17
- Add support for cgroup "cpu.max" interface.
 
 - Code organization cleanup so that ext_idle.c doesn't depend on the
   source-file-inclusion build method of sched/.
 
 - Drop UP paths in accordance with sched core changes.
 
 - Documentation and other misc changes.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaIqnxg4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGUh5AQC6YM7ggRPYRmy28m5B0nubpKtCHqPOAHSd/QbY
 MCiThgD+JuE9ewg3wYO/jvJx3NyIRB1McMnAaG59hf6R0Plh5Qo=
 =TeLF
 -----END PGP SIGNATURE-----

Merge tag 'sched_ext-for-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext updates from Tejun Heo:

 - Add support for cgroup "cpu.max" interface

 - Code organization cleanup so that ext_idle.c doesn't depend on the
   source-file-inclusion build method of sched/

 - Drop UP paths in accordance with sched core changes

 - Documentation and other misc changes

* tag 'sched_ext-for-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  sched_ext: Fix scx_bpf_reenqueue_local() reference
  sched_ext: Drop kfuncs marked for removal in 6.15
  sched_ext, rcu: Eject BPF scheduler on RCU CPU stall panic
  kernel/sched/ext.c: fix typo "occured" -> "occurred" in comments
  sched_ext: Add support for cgroup bandwidth control interface
  sched_ext, sched/core: Factor out struct scx_task_group
  sched_ext: Return NULL in llc_span
  sched_ext: Always use SMP versions in kernel/sched/ext_idle.h
  sched_ext: Always use SMP versions in kernel/sched/ext_idle.c
  sched_ext: Always use SMP versions in kernel/sched/ext.h
  sched_ext: Always use SMP versions in kernel/sched/ext.c
  sched_ext: Documentation: Clarify time slice handling in task lifecycle
  sched_ext: Make scx_locked_rq() inline
  sched_ext: Make scx_rq_bypassing() inline
  sched_ext: idle: Make local functions static in ext_idle.c
  sched_ext: idle: Remove unnecessary ifdef in scx_bpf_cpu_node()
2025-07-31 16:29:46 -07:00
Linus Torvalds 6aee5aed2e cgroup: Changes for v6.17
- Allow css_rstat_updated() in NMI context to enable memory accounting for
   allocations in NMI context.
 
 - /proc/cgroups doesn't contain useful information for cgroup2 and was
   updated to only show v1 controllers. This unfortunately broke something in
   the wild. Add an option to bring back the old behavior to ease transition.
 
 - selftest updates and other cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaIqlxQ4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGcTMAQDUlGf50ATWB9hDU7zUG4lVn8s8n8/+x8QFGHn4
 e4NERQD9FpU/jLN+cwGgspKo+L9qpu/1g+t36cJLcOuEKKoaQwI=
 =FLwx
 -----END PGP SIGNATURE-----

Merge tag 'cgroup-for-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup updates from Tejun Heo:

 - Allow css_rstat_updated() in NMI context to enable memory accounting
   for allocations in NMI context.

 - /proc/cgroups doesn't contain useful information for cgroup2 and was
   updated to only show v1 controllers. This unfortunately broke
   something in the wild. Add an option to bring back the old behavior
   to ease transition.

 - selftest updates and other cleanups.

* tag 'cgroup-for-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: Add compatibility option for content of /proc/cgroups
  selftests/cgroup: fix cpu.max tests
  cgroup: llist: avoid memory tears for llist_node
  selftests: cgroup: Fix missing newline in test_zswap_writeback_one
  selftests: cgroup: Allow longer timeout for kmem_dead_cgroups cleanup
  memcg: cgroup: call css_rstat_updated irrespective of in_nmi()
  cgroup: remove per-cpu per-subsystem locks
  cgroup: make css_rstat_updated nmi safe
  cgroup: support to enable nmi-safe css_rstat_updated
  selftests: cgroup: Fix compilation on pre-cgroupns kernels
  selftests: cgroup: Optionally set up v1 environment
  selftests: cgroup: Add support for named v1 hierarchies in test_core
  selftests: cgroup_util: Add helpers for testing named v1 hierarchies
  Documentation: cgroup: add section explaining controller availability
  cgroup: Drop sock_cgroup_classid() dummy implementation
2025-07-31 16:04:19 -07:00
Linus Torvalds af5b2619a8 workqueue: Changes for v6.17
- Prepare for defaulting to unbound workqueue. A separate branch was created
   to ease pulling in from other trees but none of the conversions have
   landed yet.
 
 - Memory allocation profiling support added.
 
 - Misc changes.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaIqiqg4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGa5uAP90MhiDmUxrIXK9A80f0+S6ujIpGm6tYQAOHHsZ
 s6gH3gD+PIsupQ6wF107+Z71ZFtMC2vkrKuTSGE88x5r3aWq+gw=
 =j/gv
 -----END PGP SIGNATURE-----

Merge tag 'wq-for-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq

Pull workqueue updates from Tejun Heo:

 - Prepare for defaulting to unbound workqueue. A separate branch was
   created to ease pulling in from other trees but none of the
   conversions have landed yet

 - Memory allocation profiling support added

 - Misc changes

* tag 'wq-for-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: Use atomic_try_cmpxchg_relaxed() in tryinc_node_nr_active()
  workqueue: Remove unused work_on_cpu_safe
  workqueue: Add new WQ_PERCPU flag
  workqueue: Add system_percpu_wq and system_dfl_wq
  workqueue: Basic memory allocation profiling support
  workqueue: fix opencoded cpumask_next_and_wrap() in wq_select_unbound_cpu()
2025-07-31 15:40:22 -07:00
Linus Torvalds beace86e61 Summary of significant series in this pull request:
- The 4 patch series "mm: ksm: prevent KSM from breaking merging of new
   VMAs" from Lorenzo Stoakes addresses an issue with KSM's
   PR_SET_MEMORY_MERGE mode: newly mapped VMAs were not eligible for
   merging with existing adjacent VMAs.
 
 - The 4 patch series "mm/damon: introduce DAMON_STAT for simple and
   practical access monitoring" from SeongJae Park adds a new kernel module
   which simplifies the setup and usage of DAMON in production
   environments.
 
 - The 6 patch series "stop passing a writeback_control to swap/shmem
   writeout" from Christoph Hellwig is a cleanup to the writeback code
   which removes a couple of pointers from struct writeback_control.
 
 - The 7 patch series "drivers/base/node.c: optimization and cleanups"
   from Donet Tom contains largely uncorrelated cleanups to the NUMA node
   setup and management code.
 
 - The 4 patch series "mm: userfaultfd: assorted fixes and cleanups" from
   Tal Zussman does some maintenance work on the userfaultfd code.
 
 - The 5 patch series "Readahead tweaks for larger folios" from Ryan
   Roberts implements some tuneups for pagecache readahead when it is
   reading into order>0 folios.
 
 - The 4 patch series "selftests/mm: Tweaks to the cow test" from Mark
   Brown provides some cleanups and consistency improvements to the
   selftests code.
 
 - The 4 patch series "Optimize mremap() for large folios" from Dev Jain
   does that.  A 37% reduction in execution time was measured in a
   memset+mremap+munmap microbenchmark.
 
 - The 5 patch series "Remove zero_user()" from Matthew Wilcox expunges
   zero_user() in favor of the more modern memzero_page().
 
 - The 3 patch series "mm/huge_memory: vmf_insert_folio_*() and
   vmf_insert_pfn_pud() fixes" from David Hildenbrand addresses some warts
   which David noticed in the huge page code.  These were not known to be
   causing any issues at this time.
 
 - The 3 patch series "mm/damon: use alloc_migrate_target() for
   DAMOS_MIGRATE_{HOT,COLD" from SeongJae Park provides some cleanup and
   consolidation work in DAMON.
 
 - The 3 patch series "use vm_flags_t consistently" from Lorenzo Stoakes
   uses vm_flags_t in places where we were inappropriately using other
   types.
 
 - The 3 patch series "mm/memfd: Reserve hugetlb folios before
   allocation" from Vivek Kasireddy increases the reliability of large page
   allocation in the memfd code.
 
 - The 14 patch series "mm: Remove pXX_devmap page table bit and pfn_t
   type" from Alistair Popple removes several now-unneeded PFN_* flags.
 
 - The 5 patch series "mm/damon: decouple sysfs from core" from SeongJae
   Park implememnts some cleanup and maintainability work in the DAMON
   sysfs layer.
 
 - The 5 patch series "madvise cleanup" from Lorenzo Stoakes does quite a
   lot of cleanup/maintenance work in the madvise() code.
 
 - The 4 patch series "madvise anon_name cleanups" from Vlastimil Babka
   provides additional cleanups on top or Lorenzo's effort.
 
 - The 11 patch series "Implement numa node notifier" from Oscar Salvador
   creates a standalone notifier for NUMA node memory state changes.
   Previously these were lumped under the more general memory on/offline
   notifier.
 
 - The 6 patch series "Make MIGRATE_ISOLATE a standalone bit" from Zi Yan
   cleans up the pageblock isolation code and fixes a potential issue which
   doesn't seem to cause any problems in practice.
 
 - The 5 patch series "selftests/damon: add python and drgn based DAMON
   sysfs functionality tests" from SeongJae Park adds additional drgn- and
   python-based DAMON selftests which are more comprehensive than the
   existing selftest suite.
 
 - The 5 patch series "Misc rework on hugetlb faulting path" from Oscar
   Salvador fixes a rather obscure deadlock in the hugetlb fault code and
   follows that fix with a series of cleanups.
 
 - The 3 patch series "cma: factor out allocation logic from
   __cma_declare_contiguous_nid" from Mike Rapoport rationalizes and cleans
   up the highmem-specific code in the CMA allocator.
 
 - The 28 patch series "mm/migration: rework movable_ops page migration
   (part 1)" from David Hildenbrand provides cleanups and
   future-preparedness to the migration code.
 
 - The 2 patch series "mm/damon: add trace events for auto-tuned
   monitoring intervals and DAMOS quota" from SeongJae Park adds some
   tracepoints to some DAMON auto-tuning code.
 
 - The 6 patch series "mm/damon: fix misc bugs in DAMON modules" from
   SeongJae Park does that.
 
 - The 6 patch series "mm/damon: misc cleanups" from SeongJae Park also
   does what it claims.
 
 - The 4 patch series "mm: folio_pte_batch() improvements" from David
   Hildenbrand cleans up the large folio PTE batching code.
 
 - The 13 patch series "mm/damon/vaddr: Allow interleaving in
   migrate_{hot,cold} actions" from SeongJae Park facilitates dynamic
   alteration of DAMON's inter-node allocation policy.
 
 - The 3 patch series "Remove unmap_and_put_page()" from Vishal Moola
   provides a couple of page->folio conversions.
 
 - The 4 patch series "mm: per-node proactive reclaim" from Davidlohr
   Bueso implements a per-node control of proactive reclaim - beyond the
   current memcg-based implementation.
 
 - The 14 patch series "mm/damon: remove damon_callback" from SeongJae
   Park replaces the damon_callback interface with a more general and
   powerful damon_call()+damos_walk() interface.
 
 - The 10 patch series "mm/mremap: permit mremap() move of multiple VMAs"
   from Lorenzo Stoakes implements a number of mremap cleanups (of course)
   in preparation for adding new mremap() functionality: newly permit the
   remapping of multiple VMAs when the user is specifying MREMAP_FIXED.  It
   still excludes some specialized situations where this cannot be
   performed reliably.
 
 - The 3 patch series "drop hugetlb_free_pgd_range()" from Anthony Yznaga
   switches some sparc hugetlb code over to the generic version and removes
   the thus-unneeded hugetlb_free_pgd_range().
 
 - The 4 patch series "mm/damon/sysfs: support periodic and automated
   stats update" from SeongJae Park augments the present
   userspace-requested update of DAMON sysfs monitoring files.  Automatic
   update is now provided, along with a tunable to control the update
   interval.
 
 - The 4 patch series "Some randome fixes and cleanups to swapfile" from
   Kemeng Shi does what is claims.
 
 - The 4 patch series "mm: introduce snapshot_page" from Luiz Capitulino
   and David Hildenbrand provides (and uses) a means by which debug-style
   functions can grab a copy of a pageframe and inspect it locklessly
   without tripping over the races inherent in operating on the live
   pageframe directly.
 
 - The 6 patch series "use per-vma locks for /proc/pid/maps reads" from
   Suren Baghdasaryan addresses the large contention issues which can be
   triggered by reads from that procfs file.  Latencies are reduced by more
   than half in some situations.  The series also introduces several new
   selftests for the /proc/pid/maps interface.
 
 - The 6 patch series "__folio_split() clean up" from Zi Yan cleans up
   __folio_split()!
 
 - The 7 patch series "Optimize mprotect() for large folios" from Dev
   Jain provides some quite large (>3x) speedups to mprotect() when dealing
   with large folios.
 
 - The 2 patch series "selftests/mm: reuse FORCE_READ to replace "asm
   volatile("" : "+r" (XXX));" and some cleanup" from wang lian does some
   cleanup work in the selftests code.
 
 - The 3 patch series "tools/testing: expand mremap testing" from Lorenzo
   Stoakes extends the mremap() selftest in several ways, including adding
   more checking of Lorenzo's recently added "permit mremap() move of
   multiple VMAs" feature.
 
 - The 22 patch series "selftests/damon/sysfs.py: test all parameters"
   from SeongJae Park extends the DAMON sysfs interface selftest so that it
   tests all possible user-requested parameters.  Rather than the present
   minimal subset.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaIqcCgAKCRDdBJ7gKXxA
 jkVBAQCCn9DR1QP0CRk961ot0cKzOgioSc0aA03DPb2KXRt2kQEAzDAz0ARurFhL
 8BzbvI0c+4tntHLXvIlrC33n9KWAOQM=
 =XsFy
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2025-07-30-15-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:
 "As usual, many cleanups. The below blurbiage describes 42 patchsets.
  21 of those are partially or fully cleanup work. "cleans up",
  "cleanup", "maintainability", "rationalizes", etc.

  I never knew the MM code was so dirty.

  "mm: ksm: prevent KSM from breaking merging of new VMAs" (Lorenzo Stoakes)
     addresses an issue with KSM's PR_SET_MEMORY_MERGE mode: newly
     mapped VMAs were not eligible for merging with existing adjacent
     VMAs.

  "mm/damon: introduce DAMON_STAT for simple and practical access monitoring" (SeongJae Park)
     adds a new kernel module which simplifies the setup and usage of
     DAMON in production environments.

  "stop passing a writeback_control to swap/shmem writeout" (Christoph Hellwig)
     is a cleanup to the writeback code which removes a couple of
     pointers from struct writeback_control.

  "drivers/base/node.c: optimization and cleanups" (Donet Tom)
     contains largely uncorrelated cleanups to the NUMA node setup and
     management code.

  "mm: userfaultfd: assorted fixes and cleanups" (Tal Zussman)
     does some maintenance work on the userfaultfd code.

  "Readahead tweaks for larger folios" (Ryan Roberts)
     implements some tuneups for pagecache readahead when it is reading
     into order>0 folios.

  "selftests/mm: Tweaks to the cow test" (Mark Brown)
     provides some cleanups and consistency improvements to the
     selftests code.

  "Optimize mremap() for large folios" (Dev Jain)
     does that. A 37% reduction in execution time was measured in a
     memset+mremap+munmap microbenchmark.

  "Remove zero_user()" (Matthew Wilcox)
     expunges zero_user() in favor of the more modern memzero_page().

  "mm/huge_memory: vmf_insert_folio_*() and vmf_insert_pfn_pud() fixes" (David Hildenbrand)
     addresses some warts which David noticed in the huge page code.
     These were not known to be causing any issues at this time.

  "mm/damon: use alloc_migrate_target() for DAMOS_MIGRATE_{HOT,COLD" (SeongJae Park)
     provides some cleanup and consolidation work in DAMON.

  "use vm_flags_t consistently" (Lorenzo Stoakes)
     uses vm_flags_t in places where we were inappropriately using other
     types.

  "mm/memfd: Reserve hugetlb folios before allocation" (Vivek Kasireddy)
     increases the reliability of large page allocation in the memfd
     code.

  "mm: Remove pXX_devmap page table bit and pfn_t type" (Alistair Popple)
     removes several now-unneeded PFN_* flags.

  "mm/damon: decouple sysfs from core" (SeongJae Park)
     implememnts some cleanup and maintainability work in the DAMON
     sysfs layer.

  "madvise cleanup" (Lorenzo Stoakes)
     does quite a lot of cleanup/maintenance work in the madvise() code.

  "madvise anon_name cleanups" (Vlastimil Babka)
     provides additional cleanups on top or Lorenzo's effort.

  "Implement numa node notifier" (Oscar Salvador)
     creates a standalone notifier for NUMA node memory state changes.
     Previously these were lumped under the more general memory
     on/offline notifier.

  "Make MIGRATE_ISOLATE a standalone bit" (Zi Yan)
     cleans up the pageblock isolation code and fixes a potential issue
     which doesn't seem to cause any problems in practice.

  "selftests/damon: add python and drgn based DAMON sysfs functionality tests" (SeongJae Park)
     adds additional drgn- and python-based DAMON selftests which are
     more comprehensive than the existing selftest suite.

  "Misc rework on hugetlb faulting path" (Oscar Salvador)
     fixes a rather obscure deadlock in the hugetlb fault code and
     follows that fix with a series of cleanups.

  "cma: factor out allocation logic from __cma_declare_contiguous_nid" (Mike Rapoport)
     rationalizes and cleans up the highmem-specific code in the CMA
     allocator.

  "mm/migration: rework movable_ops page migration (part 1)" (David Hildenbrand)
     provides cleanups and future-preparedness to the migration code.

  "mm/damon: add trace events for auto-tuned monitoring intervals and DAMOS quota" (SeongJae Park)
     adds some tracepoints to some DAMON auto-tuning code.

  "mm/damon: fix misc bugs in DAMON modules" (SeongJae Park)
     does that.

  "mm/damon: misc cleanups" (SeongJae Park)
     also does what it claims.

  "mm: folio_pte_batch() improvements" (David Hildenbrand)
     cleans up the large folio PTE batching code.

  "mm/damon/vaddr: Allow interleaving in migrate_{hot,cold} actions" (SeongJae Park)
     facilitates dynamic alteration of DAMON's inter-node allocation
     policy.

  "Remove unmap_and_put_page()" (Vishal Moola)
     provides a couple of page->folio conversions.

  "mm: per-node proactive reclaim" (Davidlohr Bueso)
     implements a per-node control of proactive reclaim - beyond the
     current memcg-based implementation.

  "mm/damon: remove damon_callback" (SeongJae Park)
     replaces the damon_callback interface with a more general and
     powerful damon_call()+damos_walk() interface.

  "mm/mremap: permit mremap() move of multiple VMAs" (Lorenzo Stoakes)
     implements a number of mremap cleanups (of course) in preparation
     for adding new mremap() functionality: newly permit the remapping
     of multiple VMAs when the user is specifying MREMAP_FIXED. It still
     excludes some specialized situations where this cannot be performed
     reliably.

  "drop hugetlb_free_pgd_range()" (Anthony Yznaga)
     switches some sparc hugetlb code over to the generic version and
     removes the thus-unneeded hugetlb_free_pgd_range().

  "mm/damon/sysfs: support periodic and automated stats update" (SeongJae Park)
     augments the present userspace-requested update of DAMON sysfs
     monitoring files. Automatic update is now provided, along with a
     tunable to control the update interval.

  "Some randome fixes and cleanups to swapfile" (Kemeng Shi)
     does what is claims.

  "mm: introduce snapshot_page" (Luiz Capitulino and David Hildenbrand)
     provides (and uses) a means by which debug-style functions can grab
     a copy of a pageframe and inspect it locklessly without tripping
     over the races inherent in operating on the live pageframe
     directly.

  "use per-vma locks for /proc/pid/maps reads" (Suren Baghdasaryan)
     addresses the large contention issues which can be triggered by
     reads from that procfs file. Latencies are reduced by more than
     half in some situations. The series also introduces several new
     selftests for the /proc/pid/maps interface.

  "__folio_split() clean up" (Zi Yan)
     cleans up __folio_split()!

  "Optimize mprotect() for large folios" (Dev Jain)
     provides some quite large (>3x) speedups to mprotect() when dealing
     with large folios.

  "selftests/mm: reuse FORCE_READ to replace "asm volatile("" : "+r" (XXX));" and some cleanup" (wang lian)
     does some cleanup work in the selftests code.

  "tools/testing: expand mremap testing" (Lorenzo Stoakes)
     extends the mremap() selftest in several ways, including adding
     more checking of Lorenzo's recently added "permit mremap() move of
     multiple VMAs" feature.

  "selftests/damon/sysfs.py: test all parameters" (SeongJae Park)
     extends the DAMON sysfs interface selftest so that it tests all
     possible user-requested parameters. Rather than the present minimal
     subset"

* tag 'mm-stable-2025-07-30-15-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (370 commits)
  MAINTAINERS: add missing headers to mempory policy & migration section
  MAINTAINERS: add missing file to cgroup section
  MAINTAINERS: add MM MISC section, add missing files to MISC and CORE
  MAINTAINERS: add missing zsmalloc file
  MAINTAINERS: add missing files to page alloc section
  MAINTAINERS: add missing shrinker files
  MAINTAINERS: move memremap.[ch] to hotplug section
  MAINTAINERS: add missing mm_slot.h file THP section
  MAINTAINERS: add missing interval_tree.c to memory mapping section
  MAINTAINERS: add missing percpu-internal.h file to per-cpu section
  mm/page_alloc: remove trace_mm_alloc_contig_migrate_range_info()
  selftests/damon: introduce _common.sh to host shared function
  selftests/damon/sysfs.py: test runtime reduction of DAMON parameters
  selftests/damon/sysfs.py: test non-default parameters runtime commit
  selftests/damon/sysfs.py: generalize DAMON context commit assertion
  selftests/damon/sysfs.py: generalize monitoring attributes commit assertion
  selftests/damon/sysfs.py: generalize DAMOS schemes commit assertion
  selftests/damon/sysfs.py: test DAMOS filters commitment
  selftests/damon/sysfs.py: generalize DAMOS scheme commit assertion
  selftests/damon/sysfs.py: test DAMOS destinations commitment
  ...
2025-07-31 14:57:54 -07:00
Daniel Borkmann abad3d0bad bpf: Fix oob access in cgroup local storage
Lonial reported that an out-of-bounds access in cgroup local storage
can be crafted via tail calls. Given two programs each utilizing a
cgroup local storage with a different value size, and one program
doing a tail call into the other. The verifier will validate each of
the indivial programs just fine. However, in the runtime context
the bpf_cg_run_ctx holds an bpf_prog_array_item which contains the
BPF program as well as any cgroup local storage flavor the program
uses. Helpers such as bpf_get_local_storage() pick this up from the
runtime context:

  ctx = container_of(current->bpf_ctx, struct bpf_cg_run_ctx, run_ctx);
  storage = ctx->prog_item->cgroup_storage[stype];

  if (stype == BPF_CGROUP_STORAGE_SHARED)
    ptr = &READ_ONCE(storage->buf)->data[0];
  else
    ptr = this_cpu_ptr(storage->percpu_buf);

For the second program which was called from the originally attached
one, this means bpf_get_local_storage() will pick up the former
program's map, not its own. With mismatching sizes, this can result
in an unintended out-of-bounds access.

To fix this issue, we need to extend bpf_map_owner with an array of
storage_cookie[] to match on i) the exact maps from the original
program if the second program was using bpf_get_local_storage(), or
ii) allow the tail call combination if the second program was not
using any of the cgroup local storage maps.

Fixes: 7d9c342789 ("bpf: Make cgroup storages shared between programs on the same cgroup")
Reported-by: Lonial Con <kongln9170@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20250730234733.530041-4-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-31 11:30:05 -07:00
Daniel Borkmann fd1c98f0ef bpf: Move bpf map owner out of common struct
Given this is only relevant for BPF tail call maps, it is adding up space
and penalizing other map types. We also need to extend this with further
objects to track / compare to. Therefore, lets move this out into a separate
structure and dynamically allocate it only for BPF tail call maps.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20250730234733.530041-2-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-31 11:30:05 -07:00
Daniel Borkmann 12df58ad29 bpf: Add cookie object to bpf maps
Add a cookie to BPF maps to uniquely identify BPF maps for the timespan
when the node is up. This is different to comparing a pointer or BPF map
id which could get rolled over and reused.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20250730234733.530041-1-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-31 11:30:05 -07:00
Linus Torvalds 44a8c96edd This update includes the following changes:
API:
 
 - Allow hash drivers without fallbacks (e.g., hardware key).
 
 Algorithms:
 
 - Add hmac hardware key support (phmac) on s390.
 - Re-enable sha384 in FIPS mode.
 - Disable sha1 in FIPS mode.
 - Convert zstd to acomp.
 
 Drivers:
 
 - Lower priority of qat skcipher and aead.
 - Convert aspeed to partial block API.
 - Add iMX8QXP support in caam.
 - Add rate limiting support for GEN6 devices in qat.
 - Enable telemetry for GEN6 devices in qat.
 - Implement full backlog mode for hisilicon/sec2.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEn51F/lCuNhUwmDeSxycdCkmxi6cFAmiHQXwACgkQxycdCkmx
 i6f49A//dQtMg/nvlqForj3BTYKPtjpfZhGxOhda1Y01ts4nFLwM39HtNXGCa6no
 e5L/taHdGd4loZoFa0H7Jz8Qn+I8F3YJLE1gDmN1zogzM6hG7KwFpJLy+PrusS3H
 IwjUehPKNTK2XWmJCdxpsulmwBD+Y//DG3wpwGlkr+MMvlzoMkesvBSCwmXKh/rh
 dn8efrHqL+3LBM6F4nM5zTwcKpLvp7V9arwAE6Zat95WN1X2puEk9L8vYG96hU9/
 YmG79E6WIb4UBILJlYdfba+3tK0bZaU3iDHMLQVlAPgM8JvzF9THyFRlLRa586/P
 rHo2xrgD1vPlMFXKhNI9p+D65zF/5Z0EKTfn1Z99z1kVzz3L71GOYlAvcAw1S9/j
 dRAcfrs/7xEW1SI9j+jVYqZn5g/ClGF8MwEL2VYHzyxN3VPY7ELys4rk6Il29NQp
 EVH8VfZS3XmdF1oiH51/ZDT4mfvQjn3v33ssdNpAFsZX2XIBj0d48JtTN/ynDfUB
 SPS2pTa5FBJCOpRR/Pbct+eloyrVP4Lcy8/gwlKAEY0ZffBBPmd2wCujQf/SKcUH
 e4b6hXAWe0gns/4VSnaker3YdG6o4uPWotZKvIiyKlkKGmJXHuSRK32odRO66+Bg
 tlaUYOmRghmxgU9Sc6h9M6vkm5rBLMw4ccykmhGSvvudm9rLh6A=
 =E8nj
 -----END PGP SIGNATURE-----

Merge tag 'v6.17-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6

Pull crypto update from Herbert Xu:
 "API:
   - Allow hash drivers without fallbacks (e.g., hardware key)

  Algorithms:
   - Add hmac hardware key support (phmac) on s390
   - Re-enable sha384 in FIPS mode
   - Disable sha1 in FIPS mode
   - Convert zstd to acomp

  Drivers:
   - Lower priority of qat skcipher and aead
   - Convert aspeed to partial block API
   - Add iMX8QXP support in caam
   - Add rate limiting support for GEN6 devices in qat
   - Enable telemetry for GEN6 devices in qat
   - Implement full backlog mode for hisilicon/sec2"

* tag 'v6.17-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (116 commits)
  crypto: keembay - Use min() to simplify ocs_create_linked_list_from_sg()
  crypto: hisilicon/hpre - fix dma unmap sequence
  crypto: qat - make adf_dev_autoreset() static
  crypto: ccp - reduce stack usage in ccp_run_aes_gcm_cmd
  crypto: qat - refactor ring-related debug functions
  crypto: qat - fix seq_file position update in adf_ring_next()
  crypto: qat - fix DMA direction for compression on GEN2 devices
  crypto: jitter - replace ARRAY_SIZE definition with header include
  crypto: engine - remove {prepare,unprepare}_crypt_hardware callbacks
  crypto: engine - remove request batching support
  crypto: qat - flush misc workqueue during device shutdown
  crypto: qat - enable rate limiting feature for GEN6 devices
  crypto: qat - add compression slice count for rate limiting
  crypto: qat - add get_svc_slice_cnt() in device data structure
  crypto: qat - add adf_rl_get_num_svc_aes() in rate limiting
  crypto: qat - relocate service related functions
  crypto: qat - consolidate service enums
  crypto: qat - add decompression service for rate limiting
  crypto: qat - validate service in rate limiting sysfs api
  crypto: hisilicon/sec2 - implement full backlog mode for sec
  ...
2025-07-31 09:45:28 -07:00
Yury Norov [NVIDIA] f49a4af3fa watchdog: fix opencoded cpumask_next_wrap() in watchdog_next_cpu()
The dedicated helper is more verbose and efficient comparing to
cpumask_next() followed by cpumask_first().

Signed-off-by: "Yury Norov [NVIDIA]" <yury.norov@gmail.com>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
2025-07-31 11:28:03 -04:00
Yury Norov [NVIDIA] 8557c8628c clocksource: Improve randomness in clocksource_verify_choose_cpus()
The current algorithm of picking a random CPU works OK for dense online
cpumask, but if cpumask is non-dense, the distribution of picked CPUs
is skewed.

For example, on 8-CPU board with CPUs 4-7 offlined, the probability of
selecting CPU 0 is 5/8. Accordingly, cpus 1, 2 and 3 are chosen with
probability 1/8 each. The proper algorithm should pick each online CPU
with probability 1/4.

Switch it to cpumask_random(), which has better statistical
characteristics.

CC: Andrew Morton <akpm@linux-foundation.org>
Acked-by: John Stultz <jstultz@google.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: "Yury Norov [NVIDIA]" <yury.norov@gmail.com>
2025-07-31 11:27:48 -04:00
Steven Rostedt b3b9cb11aa unwind: Finish up unwind when a task exits
On do_exit() when a task is exiting, if a unwind is requested and the
deferred user stacktrace is deferred via the task_work, the task_work
callback is called after exit_mm() is called in do_exit(). This means that
the user stack trace will not be retrieved and an empty stack is created.

Instead, add a function unwind_deferred_task_exit() and call it just
before exit_mm() so that the unwinder can call the requested callbacks
with the user space stack.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Indu Bhagat <indu.bhagat@oracle.com>
Cc: "Jose E. Marchesi" <jemarch@gnu.org>
Cc: Beau Belgrave <beaub@linux.microsoft.com>
Cc: Jens Remus <jremus@linux.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Sam James <sam@gentoo.org>
Link: https://lore.kernel.org/20250729182406.504259474@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-31 10:20:11 -04:00
Steven Rostedt 357eda2d74 unwind deferred: Use SRCU unwind_deferred_task_work()
Instead of using the callback_mutex to protect the link list of callbacks
in unwind_deferred_task_work(), use SRCU instead. This gets called every
time a task exits that has to record a stack trace that was requested.
This can happen for many tasks on several CPUs at the same time. A mutex
is a bottleneck and can cause a bit of contention and slow down performance.

As the callbacks themselves are allowed to sleep, regular RCU cannot be
used to protect the list. Instead use SRCU, as that still allows the
callbacks to sleep and the list can be read without needing to hold the
callback_mutex.

Link: https://lore.kernel.org/all/ca9bd83a-6c80-4ee0-a83c-224b9d60b755@efficios.com/

Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Indu Bhagat <indu.bhagat@oracle.com>
Cc: "Jose E. Marchesi" <jemarch@gnu.org>
Cc: Beau Belgrave <beaub@linux.microsoft.com>
Cc: Jens Remus <jremus@linux.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Sam James <sam@gentoo.org>
Link: https://lore.kernel.org/20250729182406.331548065@kernel.org
Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-31 10:20:11 -04:00
Steven Rostedt 858fa8a3b0 unwind: Add USED bit to only have one conditional on way back to user space
On the way back to user space, the function unwind_reset_info() is called
unconditionally (but always inlined). It currently has two conditionals.
One that checks the unwind_mask which is set whenever a deferred trace is
called and is used to know that the mask needs to be cleared. The other
checks if the cache has been allocated, and if so, it resets the
nr_entries so that the unwinder knows it needs to do the work to get a new
user space stack trace again (it only does it once per entering the
kernel).

Use one of the bits in the unwind mask as a "USED" bit that gets set
whenever a trace is created. This will make it possible to only check the
unwind_mask in the unwind_reset_info() to know if it needs to do work or
not and eliminates a conditional that happens every time the task goes
back to user space.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Indu Bhagat <indu.bhagat@oracle.com>
Cc: "Jose E. Marchesi" <jemarch@gnu.org>
Cc: Beau Belgrave <beaub@linux.microsoft.com>
Cc: Jens Remus <jremus@linux.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Sam James <sam@gentoo.org>
Link: https://lore.kernel.org/20250729182406.155422551@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-31 10:20:11 -04:00
Steven Rostedt 4c75133e74 unwind deferred: Add unwind_completed mask to stop spurious callbacks
If there's more than one registered tracer to the unwind deferred
infrastructure, it is currently possible that one tracer could cause extra
callbacks to happen for another tracer if the former requests a deferred
stacktrace after the latter's callback was executed and before the task
went back to user space.

Here's an example of how this could occur:

  [Task enters kernel]
    tracer 1 request -> add cookie to its buffer
    tracer 1 request -> add cookie to its buffer
    <..>
    [ task work executes ]
    tracer 1 callback -> add trace + cookie to its buffer

    [tracer 2 requests and triggers the task work again]
    [ task work executes again ]
    tracer 1 callback -> add trace + cookie to its buffer
    tracer 2 callback -> add trace + cookie to its buffer
 [Task exits back to user space]

This is because the bit for tracer 1 gets set in the task's unwind_mask
when it did its request and does not get cleared until the task returns
back to user space. But if another tracer were to request another deferred
stacktrace, then the next task work will executed all tracer's callbacks
that have their bits set in the task's unwind_mask.

To fix this issue, add another mask called unwind_completed and place it
into the task's info->cache structure. The cache structure is allocated
on the first occurrence of a deferred stacktrace and this unwind_completed
mask is not needed until then. It's better to have it in the cache than to
permanently waste space in the task_struct.

After a tracer's callback is executed, it's bit gets set in this
unwind_completed mask. When the task_work enters, it will AND the task's
unwind_mask with the inverse of the unwind_completed which will eliminate
any work that already had its callback executed since the task entered the
kernel.

When the task leaves the kernel, it will reset this unwind_completed mask
just like it resets the other values as it enters user space.

Link: https://lore.kernel.org/all/20250716142609.47f0e4a5@batman.local.home/

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Indu Bhagat <indu.bhagat@oracle.com>
Cc: "Jose E. Marchesi" <jemarch@gnu.org>
Cc: Beau Belgrave <beaub@linux.microsoft.com>
Cc: Jens Remus <jremus@linux.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Sam James <sam@gentoo.org>
Link: https://lore.kernel.org/20250729182405.989222722@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-31 10:20:10 -04:00
Steven Rostedt be3d526a5b unwind deferred: Use bitmask to determine which callbacks to call
In order to know which registered callback requested a stacktrace for when
the task goes back to user space, add a bitmask to keep track of all
registered tracers. The bitmask is the size of long, which means that on a
32 bit machine, it can have at most 32 registered tracers, and on 64 bit,
it can have at most 64 registered tracers. This should not be an issue as
there should not be more than 10 (unless BPF can abuse this?).

When a tracer registers with unwind_deferred_init() it will get a bit
number assigned to it. When a tracer requests a stacktrace, it will have
its bit set within the task_struct. When the task returns back to user
space, it will call the callbacks for all the registered tracers where
their bits are set in the task's mask.

When a tracer is removed by the unwind_deferred_cancel() all current tasks
will clear the associated bit, just in case another tracer gets registered
immediately afterward and then gets their callback called unexpectedly.

To prevent live locks from happening if an event that happens between the
task_work and when the task goes back to user space, triggers the deferred
unwind, have the unwind_mask get cleared on exit to user space and not
after the callback is made.

Move the pending bit from a value on the task_struct to bit zero of the
unwind_mask (saves space on the task_struct). This will allow modifying
the pending bit along with the work bits atomically.

Instead of clearing a work's bit after its callback is called, it is
delayed until exit. If the work is requested again, the task_work is not
queued again and the request will be notified that the task has already been
called by returning a positive number (the same as if it was already
pending).

The pending bit is cleared before calling the callback functions but the
current work bits remain. If one of the called works registers again, it
will not trigger a task_work if its bit is still present in the task's
unwind_mask.

If a new work requests a deferred unwind, then it will set both the
pending bit and its own bit. Note this will also cause any work that was
previously queued and had their callback already executed to be executed
again. Future work will remove these spurious callbacks.

The use of atomic_long bit operations were suggested by Peter Zijlstra:
Link: https://lore.kernel.org/all/20250715102912.GQ1613200@noisy.programming.kicks-ass.net/
The unwind_mask could not be converted to atomic_long_t do to atomic_long
not having all the bit operations needed by unwind_mask. Instead it
follows other use cases in the kernel and just typecasts the unwind_mask
to atomic_long_t when using the two atomic_long functions.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Indu Bhagat <indu.bhagat@oracle.com>
Cc: "Jose E. Marchesi" <jemarch@gnu.org>
Cc: Beau Belgrave <beaub@linux.microsoft.com>
Cc: Jens Remus <jremus@linux.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Sam James <sam@gentoo.org>
Link: https://lore.kernel.org/20250729182405.822789300@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-31 10:20:10 -04:00
Steven Rostedt 055c7060e7 unwind_user/deferred: Make unwind deferral requests NMI-safe
Make unwind_deferred_request() NMI-safe so tracers in NMI context can
call it and safely request a user space stacktrace when the task exits.

Note, this is only allowed for architectures that implement a safe
cmpxchg. If an architecture requests a deferred stack trace from NMI
context that does not support a safe NMI cmpxchg, it will get an -EINVAL
and trigger a warning. For those architectures, they would need another
method (perhaps an irqwork), to request a deferred user space stack trace.
That can be dealt with later if one of theses architectures require this
feature.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Indu Bhagat <indu.bhagat@oracle.com>
Cc: "Jose E. Marchesi" <jemarch@gnu.org>
Cc: Beau Belgrave <beaub@linux.microsoft.com>
Cc: Jens Remus <jremus@linux.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Sam James <sam@gentoo.org>
Link: https://lore.kernel.org/20250729182405.657072238@kernel.org
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-31 10:20:10 -04:00
Josh Poimboeuf 2dffa355f6 unwind_user/deferred: Add deferred unwinding interface
Add an interface for scheduling task work to unwind the user space stack
before returning to user space. This solves several problems for its
callers:

  - Ensure the unwind happens in task context even if the caller may be
    running in interrupt context.

  - Avoid duplicate unwinds, whether called multiple times by the same
    caller or by different callers.

  - Create a "context cookie" which allows trace post-processing to
    correlate kernel unwinds/traces with the user unwind.

A concept of a "cookie" is created to detect when the stacktrace is the
same. A cookie is generated the first time a user space stacktrace is
requested after the task enters the kernel. As the stacktrace is saved on
the task_struct while the task is in the kernel, if another request comes
in, if the cookie is still the same, it will use the saved stacktrace,
and not have to regenerate one.

The cookie is passed to the caller on request, and when the stacktrace is
generated upon returning to user space, it calls the requester's callback
with the cookie as well as the stacktrace. The cookie is cleared
when it goes back to user space. Note, this currently adds another
conditional to the unwind_reset_info() path that is always called
returning to user space, but future changes will put this back to a single
conditional.

A global list is created and protected by a global mutex that holds
tracers that register with the unwind infrastructure. The number of
registered tracers will be limited in future changes. Each perf program or
ftrace instance will register its own descriptor to use for deferred
unwind stack traces.

Note, in the function unwind_deferred_task_work() that gets called when
returning to user space, it uses a global mutex for synchronization which
will cause a big bottleneck. This will be replaced by SRCU, but that
change adds some complex synchronization that deservers its own commit.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Indu Bhagat <indu.bhagat@oracle.com>
Cc: "Jose E. Marchesi" <jemarch@gnu.org>
Cc: Beau Belgrave <beaub@linux.microsoft.com>
Cc: Jens Remus <jremus@linux.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Sam James <sam@gentoo.org>
Link: https://lore.kernel.org/20250729182405.488066537@kernel.org
Co-developed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-31 10:20:10 -04:00
Josh Poimboeuf b9c7352410 unwind_user/deferred: Add unwind cache
Cache the results of the unwind to ensure the unwind is only performed
once, even when called by multiple tracers.

The cache nr_entries gets cleared every time the task exits the kernel.
When a stacktrace is requested, nr_entries gets set to the number of
entries in the stacktrace. If another stacktrace is requested, if
nr_entries is not zero, then it contains the same stacktrace that would be
retrieved so it is not processed again and the entries is given to the
caller.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Indu Bhagat <indu.bhagat@oracle.com>
Cc: "Jose E. Marchesi" <jemarch@gnu.org>
Cc: Beau Belgrave <beaub@linux.microsoft.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Sam James <sam@gentoo.org>
Link: https://lore.kernel.org/20250729182405.319691167@kernel.org
Reviewed-by: Jens Remus <jremus@linux.ibm.com>
Reviewed-By: Indu Bhagat <indu.bhagat@oracle.com>
Co-developed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-31 10:20:04 -04:00
Petr Pavlu a7c54b2b41
tracing: Replace MAX_PARAM_PREFIX_LEN with MODULE_NAME_LEN
Use the MODULE_NAME_LEN definition in module_exists() to obtain the maximum
size of a module name, instead of using MAX_PARAM_PREFIX_LEN. The values
are the same but MODULE_NAME_LEN is more appropriate in this context.
MAX_PARAM_PREFIX_LEN was added in commit 730b69d225 ("module: check
kernel param length at compile time, not runtime") only to break a circular
dependency between module.h and moduleparam.h, and should mostly be limited
to use in moduleparam.h.

Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Daniel Gomez <da.gomez@samsung.com>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20250630143535.267745-5-petr.pavlu@suse.com
Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
2025-07-31 13:57:44 +02:00
Petr Pavlu 6c171b2ccf
module: Remove unnecessary +1 from last_unloaded_module::name size
The variable last_unloaded_module::name tracks the name of the last
unloaded module. It is a string copy of module::name, which is
MODULE_NAME_LEN bytes in size and includes the NUL terminator. Therefore,
the size of last_unloaded_module::name can also be just MODULE_NAME_LEN,
without the need for an extra byte.

Fixes: e14af7eeb4 ("debug: track and print last unloaded module in the oops trace")
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Reviewed-by: Daniel Gomez <da.gomez@samsung.com>
Link: https://lore.kernel.org/r/20250630143535.267745-3-petr.pavlu@suse.com
Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
2025-07-31 13:57:32 +02:00
Petr Pavlu a6323bd4e6
module: Prevent silent truncation of module name in delete_module(2)
Passing a module name longer than MODULE_NAME_LEN to the delete_module
syscall results in its silent truncation. This really isn't much of
a problem in practice, but it could theoretically lead to the removal of an
incorrect module. It is more sensible to return ENAMETOOLONG or ENOENT in
such a case.

Update the syscall to return ENOENT, as documented in the delete_module(2)
man page to mean "No module by that name exists." This is appropriate
because a module with a name longer than MODULE_NAME_LEN cannot be loaded
in the first place.

Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Reviewed-by: Daniel Gomez <da.gomez@samsung.com>
Link: https://lore.kernel.org/r/20250630143535.267745-2-petr.pavlu@suse.com
Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
2025-07-31 13:57:21 +02:00
Thomas Weißschuh 199d9ffb31 module: move 'struct module_use' to internal.h
The struct was moved to the public header file in commit c8e21ced08
("module: fix kdb's illicit use of struct module_use.").
Back then the structure was used outside of the module core.
Nowadays this is not true anymore, so the structure can be made internal.

Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Reviewed-by: Daniel Gomez <da.gomez@samsung.com>
Reviewed-by: Petr Pavlu <petr.pavlu@suse.com>
Link: https://lore.kernel.org/r/20250711-kunit-ifdef-modules-v2-1-39443decb1f8@linutronix.de
Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
2025-07-31 13:40:46 +02:00
Linus Torvalds 260f6f4fda drm for 6.17-rc1
non-drm:
 rust:
 - make ETIMEDOUT available
 - add size constants up to SZ_2G
 - add DMA coherent allocation bindings
 mtd:
 - driver for Intel GPU non-volatile storage
 i2c
 - designware quirk for Intel xe
 
 core:
 - atomic helpers: tune enable/disable sequences
 - add task info to wedge API
 - refactor EDID quirks
 - connector: move HDR sink to drm_display_info
 - fourcc: half-float and 32-bit float formats
 - mode_config: pass format info to simplify
 
 dma-buf:
 - heaps: Give CMA heap a stable name
 
 ci:
 - add device tree validation and kunit
 
 displayport:
 - change AUX DPCD access probe address
 - add quirk for DPCD probe
 - add panel replay definitions
 - backlight control helpers
 
 fbdev:
 - make CONFIG_FIRMWARE_EDID available on all arches
 
 fence:
 - fix UAF issues
 
 format-helper:
 - improve tests
 
 gpusvm:
 - introduce devmem only flag for allocation
 - add timeslicing support to GPU SVM
 
 ttm:
 - improve eviction
 
 sched:
 - tracing improvements
 - kunit improvements
 - memory leak fixes
 - reset handling improvements
 
 color mgmt:
 - add hardware gamma LUT handling helpers
 
 bridge:
 - add destroy hook
 - switch to reference counted drm_bridge allocations
 - tc358767: convert to devm_drm_bridge_alloc
 - improve CEC handling
 
 panel:
 - switch to reference counter drm_panel allocations
 - fwnode panel lookup
 - Huiling hl055fhv028c support
 - Raspberry Pi 7" 720x1280 support
 - edp: KDC KD116N3730A05, N160JCE-ELL CMN, N116BCJ-EAK
 - simple: AUO P238HAN01
 - st7701: Winstar wf40eswaa6mnn0
 - visionox: rm69299-shift
 - Renesas R61307, Renesas R69328 support
 - DJN HX83112B
 
 hdmi:
 - add CEC handling
 - YUV420 output support
 
 xe:
 - WildCat Lake support
 - Enable PanthorLake by default
 - mark BMG as SRIOV capable
 - update firmware recommendations
 - Expose media OA units
 - aux-bux support for non-volatile memory
 - MTD intel-dg driver for non-volatile memory
 - Expose fan control and voltage regulator in sysfs
 - restructure migration for multi-device
 - Restore GuC submit UAF fix
 - make GEM shrinker drm managed
 - SRIOV VF Post-migration recovery of GGTT nodes
 - W/A additions/reworks
 - Prefetch support for svm ranges
 - Don't allocate managed BO for each policy change
 - HWMON fixes for BMG
 - Create LRC BO without VM
 - PCI ID updates
 - make SLPC debugfs files optional
 - rework eviction rejection of bound external BOs
 - consolidate PAT programming logic for pre/post Xe2
 - init changes for flicker-free boot
 - Enable GuC Dynamic Inhibit Context switch
 
 i915:
 - drm_panic support for i915/xe
 - initial flip queue off by default for LNL/PNL
 - Wildcat Lake Display support
 - Support for DSC fractional link bpp
 - Support for simultaneous Panel Replay and Adaptive sync
 - Support for PTL+ double buffer LUT
 - initial PIPEDMC event handling
 - drm_panel_follower support
 - DPLL interface renames
 - allocate struct intel_display dynamically
 - flip queue preperation
 - abstract DRAM detection better
 - avoid GuC scheduling stalls
 - remove DG1 force probe requirement
 - fix MEI interrupt handler on RT kernels
 - use backlight control helpers for eDP
 - more shared display code refactoring
 
 amdgpu:
 - add userq slot to INFO ioctl
 - SR-IOV hibernation support
 - Suspend improvements
 - Backlight improvements
 - Use scaling for non-native eDP modes
 - cleaner shader updates for GC 9.x
 - Remove fence slab
 - SDMA fw checks for userq support
 - RAS updates
 - DMCUB updates
 - DP tunneling fixes
 - Display idle D3 support
 - Per queue reset improvements
 - initial smartmux support
 
 amdkfd:
 - enable KFD on loongarch
 - mtype fix for ext coherent system memory
 
 radeon:
 - CS validation additional GL extensions
 - drop console lock during suspend/resume
 - bump driver version
 
 msm:
 - VM BIND support
 - CI: infrastructure updates
 - UBWC single source of truth
 - decouple GPU and KMS support
 - DP: rework I/O accessors
 - DPU: SM8750 support
 - DSI: SM8750 support
 - GPU: X1-45 support and speedbin support for X1-85
 - MDSS: SM8750 support
 
 nova:
 - register! macro improvements
 - DMA object abstraction
 - VBIOS parser + fwsec lookup
 - sysmem flush page support
 - falcon: generic falcon boot code and HAL
 - FWSEC-FRTS: fb setup and load/execute
 
 ivpu:
 - Add Wildcat Lake support
 - Add turbo flag
 
 ast:
 - improve hardware generations implementation
 
 imx:
 - IMX8qxq Display Controller support
 
 lima:
 - Rockchip RK3528 GPU support
 
 nouveau:
 - fence handling cleanup
 
 panfrost:
 - MT8370 support
 - bo labeling
 - 64-bit register access
 
 qaic:
 - add RAS support
 
 rockchip:
 - convert inno_hdmi to a bridge
 
 rz-du:
 - add RZ/V2H(P) support
 - MIPI-DSI DCS support
 
 sitronix:
 - ST7567 support
 
 sun4i:
 - add H616 support
 
 tidss:
 - add TI AM62L support
 - AM65x OLDI bridge support
 
 bochs:
 - drm panic support
 
 vkms:
 - YUV and R* format support
 - use faux device
 
 vmwgfx:
 - fence improvements
 
 hyperv:
 - move out of simple
 - add drm_panic support
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEEKbZHaGwW9KfbeusDHTzWXnEhr4FAmiJM/0ACgkQDHTzWXnE
 hr6MpA/+JJKGdSdrE95QkaMcOZh/3e3areGXZ0V/RrrJXdB4/DoAfQSHhF0H7m7y
 MhBGVLGNMXq7KHrz28p1MjLHrE1mwmvJ6hZ4J076ed4u9naoCD0m6k5w5wiue+KL
 HyPR54ADxN0BYmgV0l/B0wj42KsHyTO4x4hdqPJu02V9Dtmx6FCh2ujkOF3p9nbK
 GMwWDttl4KEKljD0IvQ9YIYJ66crYGx/XmZi7JoWRrS104K/h1u8qZuXBp5jVKTy
 OZRAVyLdmJqdTOLH7l599MBBcEd/bNV37/LVwF4T5iFunEKOAiyN0QY0OR+IeRVh
 ZfOv2/gp4UNyIfyahQ7LKLgEilNPGHoPitvDJPvBZxW2UjwXVNvA1QfdK5DAlVRS
 D5NoFRjlFFCz8/c2hQwlKJ9o7eVgH3/pK0mwR7SPGQTuqzLFCrAfCuzUvg/gV++6
 JFqmGKMHeCoxO2o4GMrwjFttStP41usxtV/D+grcbPteNO9UyKJS4C38n4eamJXM
 a9Sy9APuAb6F0w5+yMItEF7TQifgmhIbm5AZHlxE1KoDQV6TdiIf1Gou5LeDGoL6
 OACbXHJPL52tUnfCRpbfI4tE/IVyYsfL01JnvZ5cZZWItXfcIz76ykJri+E0G60g
 yRl/zkimHKO4B0l/HSzal5xROXr+3VzeWehEiz/ot1VriP5OesA=
 =n9MO
 -----END PGP SIGNATURE-----

Merge tag 'drm-next-2025-07-30' of https://gitlab.freedesktop.org/drm/kernel

Pull drm updates from Dave Airlie:
 "Highlights:

   - Intel xe enable Panthor Lake, started adding WildCat Lake

   - amdgpu has a bunch of reset improvments along with the usual IP
     updates

   - msm got VM_BIND support which is important for vulkan sparse memory

   - more drm_panic users

   - gpusvm common code to handle a bunch of core SVM work outside
     drivers.

  Detail summary:

  Changes outside drm subdirectory:
   - 'shrink_shmem_memory()' for better shmem/hibernate interaction
   - Rust support infrastructure:
      - make ETIMEDOUT available
      - add size constants up to SZ_2G
      - add DMA coherent allocation bindings
   - mtd driver for Intel GPU non-volatile storage
   - i2c designware quirk for Intel xe

  core:
   - atomic helpers: tune enable/disable sequences
   - add task info to wedge API
   - refactor EDID quirks
   - connector: move HDR sink to drm_display_info
   - fourcc: half-float and 32-bit float formats
   - mode_config: pass format info to simplify

  dma-buf:
   - heaps: Give CMA heap a stable name

  ci:
   - add device tree validation and kunit

  displayport:
   - change AUX DPCD access probe address
   - add quirk for DPCD probe
   - add panel replay definitions
   - backlight control helpers

  fbdev:
   - make CONFIG_FIRMWARE_EDID available on all arches

  fence:
   - fix UAF issues

  format-helper:
   - improve tests

  gpusvm:
   - introduce devmem only flag for allocation
   - add timeslicing support to GPU SVM

  ttm:
   - improve eviction

  sched:
   - tracing improvements
   - kunit improvements
   - memory leak fixes
   - reset handling improvements

  color mgmt:
   - add hardware gamma LUT handling helpers

  bridge:
   - add destroy hook
   - switch to reference counted drm_bridge allocations
   - tc358767: convert to devm_drm_bridge_alloc
   - improve CEC handling

  panel:
   - switch to reference counter drm_panel allocations
   - fwnode panel lookup
   - Huiling hl055fhv028c support
   - Raspberry Pi 7" 720x1280 support
   - edp: KDC KD116N3730A05, N160JCE-ELL CMN, N116BCJ-EAK
   - simple: AUO P238HAN01
   - st7701: Winstar wf40eswaa6mnn0
   - visionox: rm69299-shift
   - Renesas R61307, Renesas R69328 support
   - DJN HX83112B

  hdmi:
   - add CEC handling
   - YUV420 output support

  xe:
   - WildCat Lake support
   - Enable PanthorLake by default
   - mark BMG as SRIOV capable
   - update firmware recommendations
   - Expose media OA units
   - aux-bux support for non-volatile memory
   - MTD intel-dg driver for non-volatile memory
   - Expose fan control and voltage regulator in sysfs
   - restructure migration for multi-device
   - Restore GuC submit UAF fix
   - make GEM shrinker drm managed
   - SRIOV VF Post-migration recovery of GGTT nodes
   - W/A additions/reworks
   - Prefetch support for svm ranges
   - Don't allocate managed BO for each policy change
   - HWMON fixes for BMG
   - Create LRC BO without VM
   - PCI ID updates
   - make SLPC debugfs files optional
   - rework eviction rejection of bound external BOs
   - consolidate PAT programming logic for pre/post Xe2
   - init changes for flicker-free boot
   - Enable GuC Dynamic Inhibit Context switch

  i915:
   - drm_panic support for i915/xe
   - initial flip queue off by default for LNL/PNL
   - Wildcat Lake Display support
   - Support for DSC fractional link bpp
   - Support for simultaneous Panel Replay and Adaptive sync
   - Support for PTL+ double buffer LUT
   - initial PIPEDMC event handling
   - drm_panel_follower support
   - DPLL interface renames
   - allocate struct intel_display dynamically
   - flip queue preperation
   - abstract DRAM detection better
   - avoid GuC scheduling stalls
   - remove DG1 force probe requirement
   - fix MEI interrupt handler on RT kernels
   - use backlight control helpers for eDP
   - more shared display code refactoring

  amdgpu:
   - add userq slot to INFO ioctl
   - SR-IOV hibernation support
   - Suspend improvements
   - Backlight improvements
   - Use scaling for non-native eDP modes
   - cleaner shader updates for GC 9.x
   - Remove fence slab
   - SDMA fw checks for userq support
   - RAS updates
   - DMCUB updates
   - DP tunneling fixes
   - Display idle D3 support
   - Per queue reset improvements
   - initial smartmux support

  amdkfd:
   - enable KFD on loongarch
   - mtype fix for ext coherent system memory

  radeon:
   - CS validation additional GL extensions
   - drop console lock during suspend/resume
   - bump driver version

  msm:
   - VM BIND support
   - CI: infrastructure updates
   - UBWC single source of truth
   - decouple GPU and KMS support
   - DP: rework I/O accessors
   - DPU: SM8750 support
   - DSI: SM8750 support
   - GPU: X1-45 support and speedbin support for X1-85
   - MDSS: SM8750 support

  nova:
   - register! macro improvements
   - DMA object abstraction
   - VBIOS parser + fwsec lookup
   - sysmem flush page support
   - falcon: generic falcon boot code and HAL
   - FWSEC-FRTS: fb setup and load/execute

  ivpu:
   - Add Wildcat Lake support
   - Add turbo flag

  ast:
   - improve hardware generations implementation

  imx:
   - IMX8qxq Display Controller support

  lima:
   - Rockchip RK3528 GPU support

  nouveau:
   - fence handling cleanup

  panfrost:
   - MT8370 support
   - bo labeling
   - 64-bit register access

  qaic:
   - add RAS support

  rockchip:
   - convert inno_hdmi to a bridge

  rz-du:
   - add RZ/V2H(P) support
   - MIPI-DSI DCS support

  sitronix:
   - ST7567 support

  sun4i:
   - add H616 support

  tidss:
   - add TI AM62L support
   - AM65x OLDI bridge support

  bochs:
   - drm panic support

  vkms:
   - YUV and R* format support
   - use faux device

  vmwgfx:
   - fence improvements

  hyperv:
   - move out of simple
   - add drm_panic support"

* tag 'drm-next-2025-07-30' of https://gitlab.freedesktop.org/drm/kernel: (1479 commits)
  drm/tidss: oldi: convert to devm_drm_bridge_alloc() API
  drm/tidss: encoder: convert to devm_drm_bridge_alloc()
  drm/amdgpu: move reset support type checks into the caller
  drm/amdgpu/sdma7: re-emit unprocessed state on ring reset
  drm/amdgpu/sdma6: re-emit unprocessed state on ring reset
  drm/amdgpu/sdma5.2: re-emit unprocessed state on ring reset
  drm/amdgpu/sdma5: re-emit unprocessed state on ring reset
  drm/amdgpu/gfx12: re-emit unprocessed state on ring reset
  drm/amdgpu/gfx11: re-emit unprocessed state on ring reset
  drm/amdgpu/gfx10: re-emit unprocessed state on ring reset
  drm/amdgpu/gfx9.4.3: re-emit unprocessed state on kcq reset
  drm/amdgpu/gfx9: re-emit unprocessed state on kcq reset
  drm/amdgpu: Add WARN_ON to the resource clear function
  drm/amd/pm: Use cached metrics data on SMUv13.0.6
  drm/amd/pm: Use cached data for min/max clocks
  gpu: nova-core: fix bounds check in PmuLookupTableEntry::new
  drm/amdgpu: Replace HQD terminology with slots naming
  drm/amdgpu: Add user queue instance count in HW IP info
  drm/amd/amdgpu: Add helper functions for isp buffers
  drm/amd/amdgpu: Initialize swnode for ISP MFD device
  ...
2025-07-30 19:26:49 -07:00
Linus Torvalds 63eb28bb14 ARM:
- Host driver for GICv5, the next generation interrupt controller for
   arm64, including support for interrupt routing, MSIs, interrupt
   translation and wired interrupts.
 
 - Use FEAT_GCIE_LEGACY on GICv5 systems to virtualize GICv3 VMs on
   GICv5 hardware, leveraging the legacy VGIC interface.
 
 - Userspace control of the 'nASSGIcap' GICv3 feature, allowing
   userspace to disable support for SGIs w/o an active state on hardware
   that previously advertised it unconditionally.
 
 - Map supporting endpoints with cacheable memory attributes on systems
   with FEAT_S2FWB and DIC where KVM no longer needs to perform cache
   maintenance on the address range.
 
 - Nested support for FEAT_RAS and FEAT_DoubleFault2, allowing the guest
   hypervisor to inject external aborts into an L2 VM and take traps of
   masked external aborts to the hypervisor.
 
 - Convert more system register sanitization to the config-driven
   implementation.
 
 - Fixes to the visibility of EL2 registers, namely making VGICv3 system
   registers accessible through the VGIC device instead of the ONE_REG
   vCPU ioctls.
 
 - Various cleanups and minor fixes.
 
 LoongArch:
 
 - Add stat information for in-kernel irqchip
 
 - Add tracepoints for CPUCFG and CSR emulation exits
 
 - Enhance in-kernel irqchip emulation
 
 - Various cleanups.
 
 RISC-V:
 
 - Enable ring-based dirty memory tracking
 
 - Improve perf kvm stat to report interrupt events
 
 - Delegate illegal instruction trap to VS-mode
 
 - MMU improvements related to upcoming nested virtualization
 
 s390x
 
 - Fixes
 
 x86:
 
 - Add CONFIG_KVM_IOAPIC for x86 to allow disabling support for I/O APIC,
   PIC, and PIT emulation at compile time.
 
 - Share device posted IRQ code between SVM and VMX and
   harden it against bugs and runtime errors.
 
 - Use vcpu_idx, not vcpu_id, for GA log tag/metadata, to make lookups O(1)
   instead of O(n).
 
 - For MMIO stale data mitigation, track whether or not a vCPU has access to
   (host) MMIO based on whether the page tables have MMIO pfns mapped; using
   VFIO is prone to false negatives
 
 - Rework the MSR interception code so that the SVM and VMX APIs are more or
   less identical.
 
 - Recalculate all MSR intercepts from scratch on MSR filter changes,
   instead of maintaining shadow bitmaps.
 
 - Advertise support for LKGS (Load Kernel GS base), a new instruction
   that's loosely related to FRED, but is supported and enumerated
   independently.
 
 - Fix a user-triggerable WARN that syzkaller found by setting the vCPU
   in INIT_RECEIVED state (aka wait-for-SIPI), and then putting the vCPU
   into VMX Root Mode (post-VMXON).  Trying to detect every possible path
   leading to architecturally forbidden states is hard and even risks
   breaking userspace (if it goes from valid to valid state but passes
   through invalid states), so just wait until KVM_RUN to detect that
   the vCPU state isn't allowed.
 
 - Add KVM_X86_DISABLE_EXITS_APERFMPERF to allow disabling interception of
   APERF/MPERF reads, so that a "properly" configured VM can access
   APERF/MPERF.  This has many caveats (APERF/MPERF cannot be zeroed
   on vCPU creation or saved/restored on suspend and resume, or preserved
   over thread migration let alone VM migration) but can be useful whenever
   you're interested in letting Linux guests see the effective physical CPU
   frequency in /proc/cpuinfo.
 
 - Reject KVM_SET_TSC_KHZ for vm file descriptors if vCPUs have been
   created, as there's no known use case for changing the default
   frequency for other VM types and it goes counter to the very reason
   why the ioctl was added to the vm file descriptor.  And also, there
   would be no way to make it work for confidential VMs with a "secure"
   TSC, so kill two birds with one stone.
 
 - Dynamically allocation the shadow MMU's hashed page list, and defer
   allocating the hashed list until it's actually needed (the TDP MMU
   doesn't use the list).
 
 - Extract many of KVM's helpers for accessing architectural local APIC
   state to common x86 so that they can be shared by guest-side code for
   Secure AVIC.
 
 - Various cleanups and fixes.
 
 x86 (Intel):
 
 - Preserve the host's DEBUGCTL.FREEZE_IN_SMM when running the guest.
   Failure to honor FREEZE_IN_SMM can leak host state into guests.
 
 - Explicitly check vmcs12.GUEST_DEBUGCTL on nested VM-Enter to prevent
   L1 from running L2 with features that KVM doesn't support, e.g. BTF.
 
 x86 (AMD):
 
 - WARN and reject loading kvm-amd.ko instead of panicking the kernel if the
   nested SVM MSRPM offsets tracker can't handle an MSR (which is pretty
   much a static condition and therefore should never happen, but still).
 
 - Fix a variety of flaws and bugs in the AVIC device posted IRQ code.
 
 - Inhibit AVIC if a vCPU's ID is too big (relative to what hardware
   supports) instead of rejecting vCPU creation.
 
 - Extend enable_ipiv module param support to SVM, by simply leaving
   IsRunning clear in the vCPU's physical ID table entry.
 
 - Disable IPI virtualization, via enable_ipiv, if the CPU is affected by
   erratum #1235, to allow (safely) enabling AVIC on such CPUs.
 
 - Request GA Log interrupts if and only if the target vCPU is blocking,
   i.e. only if KVM needs a notification in order to wake the vCPU.
 
 - Intercept SPEC_CTRL on AMD if the MSR shouldn't exist according to the
   vCPU's CPUID model.
 
 - Accept any SNP policy that is accepted by the firmware with respect to
   SMT and single-socket restrictions.  An incompatible policy doesn't put
   the kernel at risk in any way, so there's no reason for KVM to care.
 
 - Drop a superfluous WBINVD (on all CPUs!) when destroying a VM and
   use WBNOINVD instead of WBINVD when possible for SEV cache maintenance.
 
 - When reclaiming memory from an SEV guest, only do cache flushes on CPUs
   that have ever run a vCPU for the guest, i.e. don't flush the caches for
   CPUs that can't possibly have cache lines with dirty, encrypted data.
 
 Generic:
 
 - Rework irqbypass to track/match producers and consumers via an xarray
   instead of a linked list.  Using a linked list leads to O(n^2) insertion
   times, which is hugely problematic for use cases that create large
   numbers of VMs.  Such use cases typically don't actually use irqbypass,
   but eliminating the pointless registration is a future problem to
   solve as it likely requires new uAPI.
 
 - Track irqbypass's "token" as "struct eventfd_ctx *" instead of a "void *",
   to avoid making a simple concept unnecessarily difficult to understand.
 
 - Decouple device posted IRQs from VFIO device assignment, as binding a VM
   to a VFIO group is not a requirement for enabling device posted IRQs.
 
 - Clean up and document/comment the irqfd assignment code.
 
 - Disallow binding multiple irqfds to an eventfd with a priority waiter,
   i.e.  ensure an eventfd is bound to at most one irqfd through the entire
   host, and add a selftest to verify eventfd:irqfd bindings are globally
   unique.
 
 - Add a tracepoint for KVM_SET_MEMORY_ATTRIBUTES to help debug issues
   related to private <=> shared memory conversions.
 
 - Drop guest_memfd's .getattr() implementation as the VFS layer will call
   generic_fillattr() if inode_operations.getattr is NULL.
 
 - Fix issues with dirty ring harvesting where KVM doesn't bound the
   processing of entries in any way, which allows userspace to keep KVM
   in a tight loop indefinitely.
 
 - Kill off kvm_arch_{start,end}_assignment() and x86's associated tracking,
   now that KVM no longer uses assigned_device_count as a heuristic for
   either irqbypass usage or MDS mitigation.
 
 Selftests:
 
 - Fix a comment typo.
 
 - Verify KVM is loaded when getting any KVM module param so that attempting
   to run a selftest without kvm.ko loaded results in a SKIP message about
   KVM not being loaded/enabled (versus some random parameter not existing).
 
 - Skip tests that hit EACCES when attempting to access a file, and rpint
   a "Root required?" help message.  In most cases, the test just needs to
   be run with elevated permissions.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmiKXMgUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroMhMQf/QDhC/CP1aGXph2whuyeD2NMqPKiU
 9KdnDNST+ftPwjg9QxZ9mTaa8zeVz/wly6XlxD9OQHy+opM1wcys3k0GZAFFEEQm
 YrThgURdzEZ3nwJZgb+m0t4wjJQtpiFIBwAf7qq6z1VrqQBEmHXJ/8QxGuqO+BNC
 j5q/X+q6KZwehKI6lgFBrrOKWFaxqhnRAYfW6rGBxRXxzTJuna37fvDpodQnNceN
 zOiq+avfriUMArTXTqOteJNKU0229HjiPSnjILLnFQ+B3akBlwNG0jk7TMaAKR6q
 IZWG1EIS9q1BAkGXaw6DE1y6d/YwtXCR5qgAIkiGwaPt5yj9Oj6kRN2Ytw==
 =j2At
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm updates from Paolo Bonzini:
 "ARM:

   - Host driver for GICv5, the next generation interrupt controller for
     arm64, including support for interrupt routing, MSIs, interrupt
     translation and wired interrupts

   - Use FEAT_GCIE_LEGACY on GICv5 systems to virtualize GICv3 VMs on
     GICv5 hardware, leveraging the legacy VGIC interface

   - Userspace control of the 'nASSGIcap' GICv3 feature, allowing
     userspace to disable support for SGIs w/o an active state on
     hardware that previously advertised it unconditionally

   - Map supporting endpoints with cacheable memory attributes on
     systems with FEAT_S2FWB and DIC where KVM no longer needs to
     perform cache maintenance on the address range

   - Nested support for FEAT_RAS and FEAT_DoubleFault2, allowing the
     guest hypervisor to inject external aborts into an L2 VM and take
     traps of masked external aborts to the hypervisor

   - Convert more system register sanitization to the config-driven
     implementation

   - Fixes to the visibility of EL2 registers, namely making VGICv3
     system registers accessible through the VGIC device instead of the
     ONE_REG vCPU ioctls

   - Various cleanups and minor fixes

  LoongArch:

   - Add stat information for in-kernel irqchip

   - Add tracepoints for CPUCFG and CSR emulation exits

   - Enhance in-kernel irqchip emulation

   - Various cleanups

  RISC-V:

   - Enable ring-based dirty memory tracking

   - Improve perf kvm stat to report interrupt events

   - Delegate illegal instruction trap to VS-mode

   - MMU improvements related to upcoming nested virtualization

  s390x

   - Fixes

  x86:

   - Add CONFIG_KVM_IOAPIC for x86 to allow disabling support for I/O
     APIC, PIC, and PIT emulation at compile time

   - Share device posted IRQ code between SVM and VMX and harden it
     against bugs and runtime errors

   - Use vcpu_idx, not vcpu_id, for GA log tag/metadata, to make lookups
     O(1) instead of O(n)

   - For MMIO stale data mitigation, track whether or not a vCPU has
     access to (host) MMIO based on whether the page tables have MMIO
     pfns mapped; using VFIO is prone to false negatives

   - Rework the MSR interception code so that the SVM and VMX APIs are
     more or less identical

   - Recalculate all MSR intercepts from scratch on MSR filter changes,
     instead of maintaining shadow bitmaps

   - Advertise support for LKGS (Load Kernel GS base), a new instruction
     that's loosely related to FRED, but is supported and enumerated
     independently

   - Fix a user-triggerable WARN that syzkaller found by setting the
     vCPU in INIT_RECEIVED state (aka wait-for-SIPI), and then putting
     the vCPU into VMX Root Mode (post-VMXON). Trying to detect every
     possible path leading to architecturally forbidden states is hard
     and even risks breaking userspace (if it goes from valid to valid
     state but passes through invalid states), so just wait until
     KVM_RUN to detect that the vCPU state isn't allowed

   - Add KVM_X86_DISABLE_EXITS_APERFMPERF to allow disabling
     interception of APERF/MPERF reads, so that a "properly" configured
     VM can access APERF/MPERF. This has many caveats (APERF/MPERF
     cannot be zeroed on vCPU creation or saved/restored on suspend and
     resume, or preserved over thread migration let alone VM migration)
     but can be useful whenever you're interested in letting Linux
     guests see the effective physical CPU frequency in /proc/cpuinfo

   - Reject KVM_SET_TSC_KHZ for vm file descriptors if vCPUs have been
     created, as there's no known use case for changing the default
     frequency for other VM types and it goes counter to the very reason
     why the ioctl was added to the vm file descriptor. And also, there
     would be no way to make it work for confidential VMs with a
     "secure" TSC, so kill two birds with one stone

   - Dynamically allocation the shadow MMU's hashed page list, and defer
     allocating the hashed list until it's actually needed (the TDP MMU
     doesn't use the list)

   - Extract many of KVM's helpers for accessing architectural local
     APIC state to common x86 so that they can be shared by guest-side
     code for Secure AVIC

   - Various cleanups and fixes

  x86 (Intel):

   - Preserve the host's DEBUGCTL.FREEZE_IN_SMM when running the guest.
     Failure to honor FREEZE_IN_SMM can leak host state into guests

   - Explicitly check vmcs12.GUEST_DEBUGCTL on nested VM-Enter to
     prevent L1 from running L2 with features that KVM doesn't support,
     e.g. BTF

  x86 (AMD):

   - WARN and reject loading kvm-amd.ko instead of panicking the kernel
     if the nested SVM MSRPM offsets tracker can't handle an MSR (which
     is pretty much a static condition and therefore should never
     happen, but still)

   - Fix a variety of flaws and bugs in the AVIC device posted IRQ code

   - Inhibit AVIC if a vCPU's ID is too big (relative to what hardware
     supports) instead of rejecting vCPU creation

   - Extend enable_ipiv module param support to SVM, by simply leaving
     IsRunning clear in the vCPU's physical ID table entry

   - Disable IPI virtualization, via enable_ipiv, if the CPU is affected
     by erratum #1235, to allow (safely) enabling AVIC on such CPUs

   - Request GA Log interrupts if and only if the target vCPU is
     blocking, i.e. only if KVM needs a notification in order to wake
     the vCPU

   - Intercept SPEC_CTRL on AMD if the MSR shouldn't exist according to
     the vCPU's CPUID model

   - Accept any SNP policy that is accepted by the firmware with respect
     to SMT and single-socket restrictions. An incompatible policy
     doesn't put the kernel at risk in any way, so there's no reason for
     KVM to care

   - Drop a superfluous WBINVD (on all CPUs!) when destroying a VM and
     use WBNOINVD instead of WBINVD when possible for SEV cache
     maintenance

   - When reclaiming memory from an SEV guest, only do cache flushes on
     CPUs that have ever run a vCPU for the guest, i.e. don't flush the
     caches for CPUs that can't possibly have cache lines with dirty,
     encrypted data

  Generic:

   - Rework irqbypass to track/match producers and consumers via an
     xarray instead of a linked list. Using a linked list leads to
     O(n^2) insertion times, which is hugely problematic for use cases
     that create large numbers of VMs. Such use cases typically don't
     actually use irqbypass, but eliminating the pointless registration
     is a future problem to solve as it likely requires new uAPI

   - Track irqbypass's "token" as "struct eventfd_ctx *" instead of a
     "void *", to avoid making a simple concept unnecessarily difficult
     to understand

   - Decouple device posted IRQs from VFIO device assignment, as binding
     a VM to a VFIO group is not a requirement for enabling device
     posted IRQs

   - Clean up and document/comment the irqfd assignment code

   - Disallow binding multiple irqfds to an eventfd with a priority
     waiter, i.e. ensure an eventfd is bound to at most one irqfd
     through the entire host, and add a selftest to verify eventfd:irqfd
     bindings are globally unique

   - Add a tracepoint for KVM_SET_MEMORY_ATTRIBUTES to help debug issues
     related to private <=> shared memory conversions

   - Drop guest_memfd's .getattr() implementation as the VFS layer will
     call generic_fillattr() if inode_operations.getattr is NULL

   - Fix issues with dirty ring harvesting where KVM doesn't bound the
     processing of entries in any way, which allows userspace to keep
     KVM in a tight loop indefinitely

   - Kill off kvm_arch_{start,end}_assignment() and x86's associated
     tracking, now that KVM no longer uses assigned_device_count as a
     heuristic for either irqbypass usage or MDS mitigation

  Selftests:

   - Fix a comment typo

   - Verify KVM is loaded when getting any KVM module param so that
     attempting to run a selftest without kvm.ko loaded results in a
     SKIP message about KVM not being loaded/enabled (versus some random
     parameter not existing)

   - Skip tests that hit EACCES when attempting to access a file, and
     print a "Root required?" help message. In most cases, the test just
     needs to be run with elevated permissions"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (340 commits)
  Documentation: KVM: Use unordered list for pre-init VGIC registers
  RISC-V: KVM: Avoid re-acquiring memslot in kvm_riscv_gstage_map()
  RISC-V: KVM: Use find_vma_intersection() to search for intersecting VMAs
  RISC-V: perf/kvm: Add reporting of interrupt events
  RISC-V: KVM: Enable ring-based dirty memory tracking
  RISC-V: KVM: Fix inclusion of Smnpm in the guest ISA bitmap
  RISC-V: KVM: Delegate illegal instruction fault to VS mode
  RISC-V: KVM: Pass VMID as parameter to kvm_riscv_hfence_xyz() APIs
  RISC-V: KVM: Factor-out g-stage page table management
  RISC-V: KVM: Add vmid field to struct kvm_riscv_hfence
  RISC-V: KVM: Introduce struct kvm_gstage_mapping
  RISC-V: KVM: Factor-out MMU related declarations into separate headers
  RISC-V: KVM: Use ncsr_xyz() in kvm_riscv_vcpu_trap_redirect()
  RISC-V: KVM: Implement kvm_arch_flush_remote_tlbs_range()
  RISC-V: KVM: Don't flush TLB when PTE is unchanged
  RISC-V: KVM: Replace KVM_REQ_HFENCE_GVMA_VMID_ALL with KVM_REQ_TLB_FLUSH
  RISC-V: KVM: Rename and move kvm_riscv_local_tlb_sanitize()
  RISC-V: KVM: Drop the return value of kvm_riscv_vcpu_aia_init()
  RISC-V: KVM: Check kvm_riscv_vcpu_alloc_vector_context() return value
  KVM: arm64: selftests: Add FEAT_RAS EL2 registers to get-reg-list
  ...
2025-07-30 17:14:01 -07:00
Linus Torvalds 2be6a7503d Remove or hide unused tracepoints
Tracepoints take up memory (around 5K per tracepoint) even when they are
 unused. Changes are being made to detect when a tracepoint is defined but
 unused and a warning is shown at build. But those changes are not yet
 ready for inclusion.
 
 - Fix some of the unused tracepoints that it detected
 
   Some tracepoints were removed and others were hidden by config settings
   to match the config settings of where they are instantiated. Some
   tracepoints were moved into architecture specific code as only one
   architecture used them.
 
 - Call the ftrace_test_filter tracepoint in an unreachable if statement
 
   The ftrace_test_filter tracepoint which is defined when ftrace selftests
   are configured and is used to test the filter logic, but the tracepoint is
   not actually called. It is put into an if statement to not have it get
   compiled out, but also not warn for not being used.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaIlYqxQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qisrAQD+pu2en9LAXLcgbFxQOwhbACpxOpmT
 3LiE2+MvDR3ckQD/Vyi31XebdRmj3leJ7ENf28oa155y1pyK/onrPgDHyQ4=
 =nFfn
 -----END PGP SIGNATURE-----

Merge tag 'trace-unused-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracepoint cleanup from Steven Rostedt:
 "Remove or hide unused tracepoints

  Tracepoints take up memory (around 5K per tracepoint) even when they
  are unused. Changes are being made to detect when a tracepoint is
  defined but unused and a warning is shown at build. But those changes
  are not yet ready for inclusion.

   - Fix some of the unused tracepoints that it detected

     Some tracepoints were removed and others were hidden by config
     settings to match the config settings of where they are
     instantiated. Some tracepoints were moved into architecture
     specific code as only one architecture used them.

   - Call the ftrace_test_filter tracepoint in an unreachable if
     statement

     The ftrace_test_filter tracepoint which is defined when ftrace
     selftests are configured and is used to test the filter logic, but
     the tracepoint is not actually called. It is put into an if
     statement to not have it get compiled out, but also not warn for
     not being used"

* tag 'trace-unused-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: sched: Hide numa events under CONFIG_NUMA_BALANCING
  powerpc/thp: tracing: Hide hugepage events under CONFIG_PPC_BOOK3S_64
  tracing: Call trace_ftrace_test_filter() for the event
  tracing: arm: arm64: Hide trace events ipi_raise, ipi_entry and ipi_exit
  binder: Remove unused binder lock events
  PM: tracing: Hide power_domain_target event under ARCH_OMAP2PLUS
  PM: tracing: Hide device_pm_callback events under PM_SLEEP
  PM: tracing: Hide psci_domain_idle events under ARM_PSCI_CPUIDLE
  PM: cpufreq: powernv/tracing: Move powernv_throttle trace event
  alarmtimer: Hide alarmtimer_suspend event when RTC_CLASS is not configured
  tracing, AER: Hide PCIe AER event when PCIEAER is not configured
2025-07-30 16:41:58 -07:00
Linus Torvalds 4ff261e725 Runtime verification changes for 6.17
- Added Linear temporal logic monitors for RT application
 
   Real-time applications may have design flaws causing them to have
   unexpected latency. For example, the applications may raise page faults, or
   may be blocked trying to take a mutex without priority inheritance.
 
   However, while attempting to implement DA monitors for these real-time
   rules, deterministic automaton is found to be inappropriate as the
   specification language. The automaton is complicated, hard to understand,
   and error-prone.
 
   For these cases, linear temporal logic is found to be more suitable. The
   LTL is more concise and intuitive.
 
 - Make printk_deferred() public
 
   The new monitors needed access to printk_deferred(). Make them visible for
   the entire kernel.
 
 - Add a vpanic() to allow for va_list to be passed to panic.
 
 - Add rtapp container monitor.
 
   A collection of monitors that check for common problems with real-time
   applications that cause unexpected latency.
 
 - Add page fault tracepoints to risc-v
 
   These tracepoints are necessary to for the RV monitor to run on risc-v.
 
 - Fix the behaviour of the rv tool with -s and idle tasks.
 
 - Allow the rv tool to gracefully terminate with SIGTERM
 
 - Adjusts dot2c not to create lines over 100 columns
 
 - Properly order nested monitors in the RV Kconfig file
 
 - Return the registration error in all DA monitor instead of 0
 
 - Update and add new sched collection monitors
 
   Replace tss and sncid monitors with more complete sts:
   Not only prove that switches occur in scheduling context and scheduling
   needs interrupt disabled but also that each call to the scheduler
   disables interrupts to (optionally) switch.
 
   New monitor: nrp
    Preemption requires need resched which is cleared by any switch
    (includes a non optimal workaround for /nested/ preemptions)
 
   New monitor: sssw
    suspension requires setting the task to sleepable and, after the
    switch occurs, the task requires a wakeup to come back to runnable
 
   New monitor: opid
    waking and need-resched operations occur with interrupts and
    preemption disabled or in IRQ without explicitly disabling preemption
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaIk8cBQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qi3DAQCFu6DM7uPSh94oggWlH2LukOYVGk2b
 CvGrqMFuefae7QD/aK9nCMfzaBehixMOMQHLHELEh527Hd+RwQCrlnLALQU=
 =r5HZ
 -----END PGP SIGNATURE-----

Merge tag 'trace-rv-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull runtime verification updates from Steven Rostedt:

 - Added Linear temporal logic monitors for RT application

   Real-time applications may have design flaws causing them to have
   unexpected latency. For example, the applications may raise page
   faults, or may be blocked trying to take a mutex without priority
   inheritance.

   However, while attempting to implement DA monitors for these
   real-time rules, deterministic automaton is found to be inappropriate
   as the specification language. The automaton is complicated, hard to
   understand, and error-prone.

   For these cases, linear temporal logic is found to be more suitable.
   The LTL is more concise and intuitive.

 - Make printk_deferred() public

   The new monitors needed access to printk_deferred(). Make them
   visible for the entire kernel.

 - Add a vpanic() to allow for va_list to be passed to panic.

 - Add rtapp container monitor.

   A collection of monitors that check for common problems with
   real-time applications that cause unexpected latency.

 - Add page fault tracepoints to risc-v

   These tracepoints are necessary to for the RV monitor to run on
   risc-v.

 - Fix the behaviour of the rv tool with -s and idle tasks.

 - Allow the rv tool to gracefully terminate with SIGTERM

 - Adjusts dot2c not to create lines over 100 columns

 - Properly order nested monitors in the RV Kconfig file

 - Return the registration error in all DA monitor instead of 0

 - Update and add new sched collection monitors

   Replace tss and sncid monitors with more complete sts:

   Not only prove that switches occur in scheduling context and scheduling
   needs interrupt disabled but also that each call to the scheduler
   disables interrupts to (optionally) switch.

   New monitor: nrp
     Preemption requires need resched which is cleared by any switch
     (includes a non optimal workaround for /nested/ preemptions)

   New monitor: sssw
     suspension requires setting the task to sleepable and, after the
     switch occurs, the task requires a wakeup to come back to runnable

   New monitor: opid
      waking and need-resched operations occur with interrupts and
      preemption disabled or in IRQ without explicitly disabling
      preemption"

* tag 'trace-rv-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (48 commits)
  rv: Add opid per-cpu monitor
  rv: Add nrp and sssw per-task monitors
  rv: Replace tss and sncid monitors with more complete sts
  sched: Adapt sched tracepoints for RV task model
  rv: Retry when da monitor detects race conditions
  rv: Adjust monitor dependencies
  rv: Use strings in da monitors tracepoints
  rv: Remove trailing whitespace from tracepoint string
  rv: Add da_handle_start_run_event_ to per-task monitors
  rv: Fix wrong type cast in reactors_show() and monitor_reactor_show()
  rv: Fix wrong type cast in monitors_show()
  rv: Remove struct rv_monitor::reacting
  rv: Remove rv_reactor's reference counter
  rv: Merge struct rv_reactor_def into struct rv_reactor
  rv: Merge struct rv_monitor_def into struct rv_monitor
  rv: Remove unused field in struct rv_monitor_def
  rv: Return init error when registering monitors
  verification/rvgen: Organise Kconfig entries for nested monitors
  tools/dot2c: Fix generated files going over 100 column limit
  tools/rv: Stop gracefully also on SIGTERM
  ...
2025-07-30 16:23:12 -07:00
Linus Torvalds d50b07d05c ring-buffer changes for v6.17:
- Rewind persistent ring buffer on boot
 
   When the persistent ring buffer is being used for live kernel tracing and
   the system crashes, the tool that is reading the trace may not have recorded
   the data when the system crashed. Although the persistent ring buffer still
   has that data, when reading it after a reboot, it will start where it left
   off. That is, what was read will not be accessible.
 
   Instead, on reboot, have the persistent ring buffer restart where the data
   starts and this will allow the tooling to recover what was lost when the
   crash occurred.
 
 - Remove the ring_buffer_read_prepare_sync() logic
 
   Reading the trace file required stopping writing to the ring buffer as the
   trace file is only an iterator and does not consume what it read. It was
   originally not safe to read the ring buffer in this mode and required
   disabling writing. The ring_buffer_read_prepare_sync() logic was used to
   stop each per_cpu ring buffer, call synchronize_rcu() and then start the
   iterator. This was used instead of calling synchronize_rcu() for each
   per_cpu buffer.
 
   Today, the iterator has been updated where it is safe to read the trace file
   while writing to the ring buffer is still occurring. There is no more need
   to do this synchronization and it is causing large delays on machines with
   many CPUs. Remove this unneeded synchronization.
 
 - Make static string array a constant in show_irq_str()
 
   Making the string array into a constant has shown to decrease code text/data
   size.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaIkfURQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qnx4AQCNXOuKYJXDvXrkwf449agwrn0lCVyI
 vV0L65nyIrakpAD8COV/lw8DhlCpb/Lijlzzo5L0n9QpEElNpq5uEntNwgE=
 =1YIy
 -----END PGP SIGNATURE-----

Merge tag 'trace-ringbuffer-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull ring-buffer updates from Steven Rostedt:

 - Rewind persistent ring buffer on boot

   When the persistent ring buffer is being used for live kernel tracing
   and the system crashes, the tool that is reading the trace may not
   have recorded the data when the system crashed.

   Although the persistent ring buffer still has that data, when reading
   it after a reboot, it will start where it left off. That is, what was
   read will not be accessible.

   Instead, on reboot, have the persistent ring buffer restart where the
   data starts and this will allow the tooling to recover what was lost
   when the crash occurred.

 - Remove the ring_buffer_read_prepare_sync() logic

   Reading the trace file required stopping writing to the ring buffer
   as the trace file is only an iterator and does not consume what it
   read. It was originally not safe to read the ring buffer in this mode
   and required disabling writing. The ring_buffer_read_prepare_sync()
   logic was used to stop each per_cpu ring buffer, call
   synchronize_rcu() and then start the iterator. This was used instead
   of calling synchronize_rcu() for each per_cpu buffer.

   Today, the iterator has been updated where it is safe to read the
   trace file while writing to the ring buffer is still occurring. There
   is no more need to do this synchronization and it is causing large
   delays on machines with many CPUs. Remove this unneeded
   synchronization.

 - Make static string array a constant in show_irq_str()

   Making the string array into a constant has shown to decrease code
   text/data size.

* tag 'trace-ringbuffer-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  ring-buffer: Make the const read-only 'type' static
  ring-buffer: Remove ring_buffer_read_prepare_sync()
  tracing: ring_buffer: Rewind persistent ring buffer on reboot
2025-07-30 16:16:58 -07:00
Linus Torvalds 90a871f74b ftrace changes for v6.17:
- Keep track of when fgraph_ops are registered or not
 
   Keep accounting of when fgraph_ops are registered as if a fgraph_ops is
   registered twice it can mess up the accounting and it will not work as
   expected later. Trigger a warning if something registers it twice as to
   catch bugs before they are found by things just not working as expected.
 
 - Make DYNAMIC_FTRACE always enabled for architectures that support it
 
   As static ftrace (where all functions are always traced) is very expensive
   and only exists to help architectures support ftrace, do not make it an
   option. As soon as an architecture supports DYNAMIC_FTRACE make it use it.
   This simplifies the code.
 
 - Remove redundant config HAVE_FTRACE_MCOUNT_RECORD
 
   The CONFIG_HAVE_FTRACE_MCOUNT was added to help simplify the
   DYNAMIC_FTRACE work, but now every architecture that implements
   DYNAMIC_FTRACE also has HAVE_FTRACE_MCOUNT set too, making it redundant
   with the HAVE_DYNAMIC_FTRACE.
 
 - Make pid_ptr string size match the comment
 
   In print_graph_proc() the pid_ptr string is of size 11, but the comment says
   /* sign + log10(MAX_INT) + '\0' */ which is actually 12.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaIkVkRQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qmdxAPsGcyT/gnyX/wf70cI63QoODrlRAd7M
 tg3R0J0H41U05QD/apttbA9GSdZ8bDLLSFAXTJgr8f4GvYvbUsmu2sMBBA8=
 =gd9V
 -----END PGP SIGNATURE-----

Merge tag 'ftrace-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull ftrace updates from Steven Rostedt:

 - Keep track of when fgraph_ops are registered or not

   Keep accounting of when fgraph_ops are registered as if a fgraph_ops
   is registered twice it can mess up the accounting and it will not
   work as expected later. Trigger a warning if something registers it
   twice as to catch bugs before they are found by things just not
   working as expected.

 - Make DYNAMIC_FTRACE always enabled for architectures that support it

   As static ftrace (where all functions are always traced) is very
   expensive and only exists to help architectures support ftrace, do
   not make it an option. As soon as an architecture supports
   DYNAMIC_FTRACE make it use it. This simplifies the code.

 - Remove redundant config HAVE_FTRACE_MCOUNT_RECORD

   The CONFIG_HAVE_FTRACE_MCOUNT was added to help simplify the
   DYNAMIC_FTRACE work, but now every architecture that implements
   DYNAMIC_FTRACE also has HAVE_FTRACE_MCOUNT set too, making it
   redundant with the HAVE_DYNAMIC_FTRACE.

 - Make pid_ptr string size match the comment

   In print_graph_proc() the pid_ptr string is of size 11, but the
   comment says /* sign + log10(MAX_INT) + '\0' */ which is actually 12.

* tag 'ftrace-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: Remove redundant config HAVE_FTRACE_MCOUNT_RECORD
  ftrace: Make DYNAMIC_FTRACE always enabled for architectures that support it
  fgraph: Keep track of when fgraph_ops are registered or not
  fgraph: Make pid_str size match the comment
2025-07-30 16:04:10 -07:00
Linus Torvalds b7dbc2e813 Probes updates for v6.17:
- Stack usage reduction for probe events:
    - Allocate string buffers from the heap for uprobe, eprobe, kprobe,
      and fprobe events to avoid stack overflow.
    - Allocate traceprobe_parse_context from the heap to prevent
      potential stack overflow.
    - Fix a typo in the above commit.
 
  - New features for eprobe and tprobe events:
    - Add support for arrays in eprobes.
    - Support multiple tprobes on the same tracepoint.
 
  - Improve efficiency:
    - Register fprobe-events only when it is enabled to reduce overhead.
    - Register tracepoints for tprobe events only when enabled to
      resolve a lock dependency.
 
  - Code Cleanup:
    - Add kerneldoc for traceprobe_parse_event_name() and __get_insn_slot().
    - Sort #include alphabetically in the probes code.
    - Remove the unused 'mod' field from the tprobe-event.
    - Clean up the entry-arg storing code in probe-events.
 
  - Selftest update
    - Enable fprobe events before checking enable_functions in selftests.
 -----BEGIN PGP SIGNATURE-----
 
 iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmiJ2DQbHG1hc2FtaS5o
 aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8bSfkH/06Zn5I55rU85FKSBQll
 FN4hipmef/9Nd13skDwpEuFyzLPNS4P1up/UBUuyDQUTlO74+t2zSFO2dpcNrWmu
 sPTenQ+6h82H3K591WTIC23VzF54syIbFLXEj8iMBALT3wyU4Nn0bs4DCbnTo5HX
 R3NVo77rk6wxNJoKYOtT6ALf/lHonuNlGF+KTUGWP8UbWsIY3fIp0RWWy572M0bt
 +YBE8D8RIVrw+ZY+vNKn1LdZdWlR1ton518XDf1gV9isTCfKErcd/6HJKwuj5q2v
 qMgwiaKK+Gne/ylAKmWLEg2oNDo7kpyfW+612oiECitgZkqxOXhyYYfWgRt1lFNp
 Wb8=
 =E+Z6
 -----END PGP SIGNATURE-----

Merge tag 'probes-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull probes updates from Masami Hiramatsu:
 "Stack usage reduction for probe events:
   - Allocate string buffers from the heap for uprobe, eprobe, kprobe,
     and fprobe events to avoid stack overflow
   - Allocate traceprobe_parse_context from the heap to prevent
     potential stack overflow
   - Fix a typo in the above commit

  New features for eprobe and tprobe events:
   - Add support for arrays in eprobes
   - Support multiple tprobes on the same tracepoint

  Improve efficiency:
   - Register fprobe-events only when it is enabled to reduce overhead
   - Register tracepoints for tprobe events only when enabled to resolve
     a lock dependency

  Code Cleanup:
   - Add kerneldoc for traceprobe_parse_event_name() and
     __get_insn_slot()
   - Sort #include alphabetically in the probes code
   - Remove the unused 'mod' field from the tprobe-event
   - Clean up the entry-arg storing code in probe-events

  Selftest update
   - Enable fprobe events before checking enable_functions in selftests"

* tag 'probes-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: trace_fprobe: Fix typo of the semicolon
  tracing: Have eprobes handle arrays
  tracing: probes: Add a kerneldoc for traceprobe_parse_event_name()
  tracing: uprobe-event: Allocate string buffers from heap
  tracing: eprobe-event: Allocate string buffers from heap
  tracing: kprobe-event: Allocate string buffers from heap
  tracing: fprobe-event: Allocate string buffers from heap
  tracing: probe: Allocate traceprobe_parse_context from heap
  tracing: probes: Sort #include alphabetically
  kprobes: Add missing kerneldoc for __get_insn_slot
  tracing: tprobe-events: Register tracepoint when enable tprobe event
  selftests: tracing: Enable fprobe events before checking enable_functions
  tracing: fprobe-events: Register fprobe-events only when it is enabled
  tracing: tprobe-events: Support multiple tprobes on the same tracepoint
  tracing: tprobe-events: Remove mod field from tprobe-event
  tracing: probe-events: Cleanup entry-arg storing code
2025-07-30 15:38:01 -07:00
Linus Torvalds a03eec7420 Probes fixes for v6.16:
- Fix a potential infinite recursion in fprobe by using preempt_*_notrace().
 -----BEGIN PGP SIGNATURE-----
 
 iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmiIdp4bHG1hc2FtaS5o
 aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8boY8IAMQGLspd1mATbCfnrQKY
 2X86OVygOJx7Iq1RCmOV6fhroe5EoNVR/b1RXmZJf2gIoN176zdBdYrBIFC97lYO
 J1XaU/Ns1McBuKrOjc3TSYYioVPHJrKLiZ1vAoCicTkUsS34MQJXbbAlfdn424pb
 J1wUeIDJF0WrFH9yVJ4mEs1dH81oCQ3iSG0CYx5/qLggcoubUFrVl4QessJwAuI6
 VM+cKDsqMCltBovXFw/fAgWfiQp79z/uq9umOFLdZGsesqutMYTMgJXBS6slKl3a
 qE2EQ57Op39A2zpk2hUoVoyv5Ey/XkfEjLU7WIMfqjLOL201IGQEKuyvR/mS54Kc
 HDw=
 =EeVm
 -----END PGP SIGNATURE-----

Merge tag 'probes-fixes-v6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull probes fix from Masami Hiramatsu:

 - Fix a potential infinite recursion in fprobe by using preempt_*_notrace()

* tag 'probes-fixes-v6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: fprobe: Fix infinite recursion using preempt_*_notrace()
2025-07-30 15:35:57 -07:00
Linus Torvalds 2db4df0c09 RCU pull request for v6.17
This pull request contains the following branches:
 
 rcu-exp.23.07.2025
 
   - Protect against early RCU exp quiescent state reporting during exp
     grace period initialization.
 
   - Remove superfluous barrier in task unblock path.
 
   - Remove the CPU online quiescent state report optimization, which is
     error prone for certain scenarios.
 
   - Add warning for unexpected pending requested expedited quiescent
     state on dying CPU.
 
 rcu.22.07.2025
 
   - Robustify rcu_is_cpu_rrupt_from_idle() by using more accurate
     indicators of the actual context tracking state of a CPU.
 
   - Handle ->defer_qs_iw_pending field data race.
 
   - Enable rcu_normal_wake_from_gp by default on systems with <= 16 CPUs.
 
   - Fix lockup in rcu_read_unlock() due to recursive irq_exit() calls.
 
   - Refactor expedited handling condition in rcu_read_unlock_special().
 
   - Documentation updates for hotplug and GP init scan ordering,
     separation of rcu_state and rnp's gp_seq states, quiescent state
     reporting for offline CPUs.
 
 torture-scripts.16.07.2025
 
   - Cleanup and improve scripts : remove superfluous warnings for disabled
     tests; better handling of kvm.sh --kconfig arg; suppress some confusing
     diagnostics; tolerate bad kvm.sh args; add new diagnostic for build
     output; fail allmodconfig testing on warnings.
 
   - Include RCU_TORTURE_TEST_CHK_RDR_STATE config for KCSAN kernels.
 
   - Disable default RCU-tasks and clocksource-wdog testing on arm64.
 
   - Add EXPERT Kconfig option for arm64 KCSAN runs.
 
   - Remove SRCU-lite testing.
 
 rcutorture.16.07.2025
 
   - Start torture writer threads creation after reader threads to handle
     race in SRCU-P scenario.
 
   - Add SRCU down_read()/up_read() test.
 
   - Add diagnostics for delayed SRCU up_read(), unmatched up_read(), print
     number of up/down readers and the number of such readers which
     migrated to other CPU.
 
   - Ignore certain unsupported configurations for trivial RCU test.
 
   - Fix splats in RT kernels due to inaccurate checks for BH-disabled
     context.
 
   - Enable checks and logs to capture intentionally exercised unexpected
     scenarios (too short readers) for BUSTED test.
 
   - Remove SRCU-lite testing.
 
 srcu.19.07.2025
 
   - Expedite SRCU-fast grace periods.
 
   - Remove SRCU-lite implementation.
 
   - Add guards for SRCU-fast readers.
 
 rcu.nocb.18.07.2025
 
   - Dump NOCB group leader state on stall detection.
 
   - Robustify nocb_cb_kthread pointer accesses.
 
   - Fix delayed execution of hurry callbacks when LAZY_RCU is enabled.
 
 refscale.07.07.2025
 
   - Fix multiplication overflow in "loops" and "nreaders" calculations.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSi2tPIQIc2VEtjarIAHS7/6Z0wpQUCaINnRwAKCRAAHS7/6Z0w
 pRYJAQC97ZDW2wBegDbQPsg5ECLX9Lyd6+IC65sdi38IENl+TQEA4/oMzUUceIH+
 CDCnxv3fAMhPncJfvIukOLzMJpKw0go=
 =8t4O
 -----END PGP SIGNATURE-----

Merge tag 'rcu.release.v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux

Pull RCU updates from Neeraj Upadhyay:
 "Expedited grace period updates:

   - Protect against early RCU exp quiescent state reporting during exp
     grace period initialization

   - Remove superfluous barrier in task unblock path

   - Remove the CPU online quiescent state report optimization, which is
     error prone for certain scenarios

   - Add warning for unexpected pending requested expedited quiescent
     state on dying CPU

  Core:

   - Robustify rcu_is_cpu_rrupt_from_idle() by using more accurate
     indicators of the actual context tracking state of a CPU

   - Handle ->defer_qs_iw_pending field data race

   - Enable rcu_normal_wake_from_gp by default on systems with <= 16
     CPUs

   - Fix lockup in rcu_read_unlock() due to recursive irq_exit() calls

   - Refactor expedited handling condition in rcu_read_unlock_special()

   - Documentation updates for hotplug and GP init scan ordering,
     separation of rcu_state and rnp's gp_seq states, quiescent state
     reporting for offline CPUs

  torture-scripts:

   - Cleanup and improve scripts : remove superfluous warnings for
     disabled tests; better handling of kvm.sh --kconfig arg; suppress
     some confusing diagnostics; tolerate bad kvm.sh args; add new
     diagnostic for build output; fail allmodconfig testing on warnings

   - Include RCU_TORTURE_TEST_CHK_RDR_STATE config for KCSAN kernels

   - Disable default RCU-tasks and clocksource-wdog testing on arm64

   - Add EXPERT Kconfig option for arm64 KCSAN runs

   - Remove SRCU-lite testing

  rcutorture:

   - Start torture writer threads creation after reader threads to
     handle race in SRCU-P scenario

   - Add SRCU down_read()/up_read() test

   - Add diagnostics for delayed SRCU up_read(), unmatched up_read(),
     print number of up/down readers and the number of such readers
     which migrated to other CPU

   - Ignore certain unsupported configurations for trivial RCU test

   - Fix splats in RT kernels due to inaccurate checks for BH-disabled
     context

   - Enable checks and logs to capture intentionally exercised
     unexpected scenarios (too short readers) for BUSTED test

   - Remove SRCU-lite testing

  srcu:

   - Expedite SRCU-fast grace periods

   - Remove SRCU-lite implementation

   - Add guards for SRCU-fast readers

  rcu nocb:

   - Dump NOCB group leader state on stall detection

   - Robustify nocb_cb_kthread pointer accesses

   - Fix delayed execution of hurry callbacks when LAZY_RCU is enabled

  refscale:

   - Fix multiplication overflow in "loops" and "nreaders" calculations"

* tag 'rcu.release.v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux: (49 commits)
  rcu: Document concurrent quiescent state reporting for offline CPUs
  rcu: Document separation of rcu_state and rnp's gp_seq
  rcu: Document GP init vs hotplug-scan ordering requirements
  srcu: Add guards for SRCU-fast readers
  rcu: Fix delayed execution of hurry callbacks
  rcu: Refactor expedited handling check in rcu_read_unlock_special()
  checkpatch: Remove SRCU-lite deprecation
  srcu: Remove SRCU-lite implementation
  srcu: Expedite SRCU-fast grace periods
  rcutorture: Remove support for SRCU-lite
  rcutorture: Remove SRCU-lite scenarios
  torture: Remove support for SRCU-lite
  torture: Make torture.sh --allmodconfig testing fail on warnings
  torture: Add "ERROR" diagnostic for testing kernel-build output
  torture: Make torture.sh tolerate runs having bad kvm.sh arguments
  torture: Add textid.txt file to --do-allmodconfig and --do-rcu-rust runs
  torture: Extract testid.txt generation to separate script
  torture: Suppress "find" diagnostics from torture.sh --do-none run
  torture: Provide EXPERT Kconfig option for arm64 KCSAN torture.sh runs
  rcu: Fix rcu_read_unlock() deadloop due to IRQ work
  ...
2025-07-30 11:01:41 -07:00
Linus Torvalds 7dff275c66 Kernel Concurrency Sanitizer (KCSAN) updates for v6.17
- A single fix to silence an uninitialized variable warning
 
 This change has had a few days of linux-next exposure.
 -----BEGIN PGP SIGNATURE-----
 
 iIcEABYKAC8WIQR7t4b/75lzOR3l5rcxsLN3bbyLnwUCaIdstREcZWx2ZXJAZ29v
 Z2xlLmNvbQAKCRAxsLN3bbyLnxfiAQCHSdHCyTOTP6YghSkd2ZIqfgQ8O9Y8iKGf
 EBfa6nvDVQD/bvUioqMpn/IgD6sbp76wbSOjmaJN19AGH8sfQIB13gI=
 =yGgW
 -----END PGP SIGNATURE-----

Merge tag 'kcsan-20250728-v6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/melver/linux

Pull Kernel Concurrency Sanitizer (KCSAN) update from Marco Elver:

 - A single fix to silence an uninitialized variable warning

* tag 'kcsan-20250728-v6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/melver/linux:
  kcsan: test: Initialize dummy variable
2025-07-30 11:00:28 -07:00
Paolo Bonzini 196d9e72c4 RCU wakeup fix for KVM s390 guest entry
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEwGNS88vfc9+v45Yq41TmuOI4ufgFAmiJzvQACgkQ41TmuOI4
 ufhSCw/7BTDYOZD24NkBEa8449tJRl4NV4nBrNQdj4E7jhvbs7+Q/zR5opZKwU/g
 vksnH9vW+YPuA01rplBWjdDk863q1oqnLvJ9lgh5KJVIvDHWf0EPyMqOnm5Y9KdP
 oALh/prtjek6B6rWA4PsC/OKXtx/w0zn4HulWr9LliUvJsmmsLkOTDvXB1bJeld9
 6Yi1AZ4MtqsxzLnKZVKFDfJKPWyJArzcIU0xyV5Rr62FtIIU/0WVGyTdwMj+DtIw
 +XaI4KSgyymyxChNn5dtV4JlNA9gi5oTggDSSglMWKw8oHeSgdvFtFD/05+txKLr
 4Veo1LAtug6iwmRBNPuPiPn1z5LBXpNMp0prwnpFsUuOaib+C1J3lkD+aCsmGPAG
 3f0Q9B6dM+m1MgabgzQHjeJVeVL6xMLxMHfVjXEVt2xq9lR4rskjvCr78aaL7UvF
 6l+wrpcdiayXkSkQKawdJUcBYS2TorQrc0Kn5XL/pD5qv5gu0BU26kvXRhe+xyyJ
 7WP8R/4uZZTLcIEZJWO9QQU3KWvnT+bKOOaPRX34SNXCxtJIhIHYw1ON6IEk/gfs
 P5WG34TmM4RtQuSLSpQpvG/K+hNU6gFl+5s+YUL4OhxDRCaDdLY/UlYwx6tXuGJ6
 D5R4F8HxLb5d32p/EOSSuJbVtYt6/hHT1m8wWACGb5SCe/gcNs4=
 =+mwu
 -----END PGP SIGNATURE-----

Merge tag 'kvm-s390-next-6.17-1' of https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD

RCU wakeup fix for KVM s390 guest entry
2025-07-30 13:56:09 -04:00
Linus Torvalds d9104cec3e bpf-next-6.17
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmiINnEACgkQ6rmadz2v
 bToBnA/9F+A3R6rTwGk4HK3xpfc/nm2Tanl3oRN7S2ub/mskDOtWSIyG6cVFZ0UG
 1fK6IkByyRIpAF/5qhdlw8drRXHkQtGLA0lP2L9llm4X1mHLofB18y9OeLrDE1WN
 KwNP06+IGX9W802lCGSIXOY+VmRscVfXSMokyQt2ilHplKjOnDqJcYkWupi3T2rC
 mz79FY9aEl2YrIcpj9RXz+8nwP49pZBuW2P0IM5PAIj4BJBXShrUp8T1nz94okNe
 NFsnAyRxjWpUT0McEgtA9WvpD9lZqujYD8Qp0KlGZWmI3vNpV5d9S1+dBcEb1n7q
 dyNMkTF3oRrJhhg4VqoHc6fVpzSEoZ9ZxV5Hx4cs+ganH75D4YbdGqx/7mR3DUgH
 MZh6rHF1pGnK7TAm7h5gl3ZRAOkZOaahbe1i01NKo9CEe5fSh3AqMyzJYoyGHRKi
 xDN39eQdWBNA+hm1VkbK2Bv93Rbjrka2Kj+D3sSSO9Bo/u3ntcknr7LW39idKz62
 Q8dkKHcCEtun7gjk0YXPF013y81nEohj1C+52BmJ2l5JitM57xfr6YOaQpu7DPDE
 AJbHx6ASxKdyEETecd0b+cXUPQ349zmRXy0+CDMAGKpBicC0H0mHhL14cwOY1Hfu
 EIpIjmIJGI3JNF6T5kybcQGSBOYebdV0FFgwSllzPvuYt7YsHCs=
 =/O3j
 -----END PGP SIGNATURE-----

Merge tag 'bpf-next-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Pull bpf updates from Alexei Starovoitov:

 - Remove usermode driver (UMD) framework (Thomas Weißschuh)

 - Introduce Strongly Connected Component (SCC) in the verifier to
   detect loops and refine register liveness (Eduard Zingerman)

 - Allow 'void *' cast using bpf_rdonly_cast() and corresponding
   '__arg_untrusted' for global function parameters (Eduard Zingerman)

 - Improve precision for BPF_ADD and BPF_SUB operations in the verifier
   (Harishankar Vishwanathan)

 - Teach the verifier that constant pointer to a map cannot be NULL
   (Ihor Solodrai)

 - Introduce BPF streams for error reporting of various conditions
   detected by BPF runtime (Kumar Kartikeya Dwivedi)

 - Teach the verifier to insert runtime speculation barrier (lfence on
   x86) to mitigate speculative execution instead of rejecting the
   programs (Luis Gerhorst)

 - Various improvements for 'veristat' (Mykyta Yatsenko)

 - For CONFIG_DEBUG_KERNEL config warn on internal verifier errors to
   improve bug detection by syzbot (Paul Chaignon)

 - Support BPF private stack on arm64 (Puranjay Mohan)

 - Introduce bpf_cgroup_read_xattr() kfunc to read xattr of cgroup's
   node (Song Liu)

 - Introduce kfuncs for read-only string opreations (Viktor Malik)

 - Implement show_fdinfo() for bpf_links (Tao Chen)

 - Reduce verifier's stack consumption (Yonghong Song)

 - Implement mprog API for cgroup-bpf programs (Yonghong Song)

* tag 'bpf-next-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (192 commits)
  selftests/bpf: Migrate fexit_noreturns case into tracing_failure test suite
  selftests/bpf: Add selftest for attaching tracing programs to functions in deny list
  bpf: Add log for attaching tracing programs to functions in deny list
  bpf: Show precise rejected function when attaching fexit/fmod_ret to __noreturn functions
  bpf: Fix various typos in verifier.c comments
  bpf: Add third round of bounds deduction
  selftests/bpf: Test invariants on JSLT crossing sign
  selftests/bpf: Test cross-sign 64bits range refinement
  selftests/bpf: Update reg_bound range refinement logic
  bpf: Improve bounds when s64 crosses sign boundary
  bpf: Simplify bounds refinement from s32
  selftests/bpf: Enable private stack tests for arm64
  bpf, arm64: JIT support for private stack
  bpf: Move bpf_jit_get_prog_name() to core.c
  bpf, arm64: Fix fp initialization for exception boundary
  umd: Remove usermode driver framework
  bpf/preload: Don't select USERMODE_DRIVER
  selftests/bpf: Fix test dynptr/test_dynptr_memset_xdp_chunks failure
  selftests/bpf: Fix test dynptr/test_dynptr_copy_xdp failure
  selftests/bpf: Increase xdp data size for arm64 64K page size
  ...
2025-07-30 09:58:50 -07:00
Linus Torvalds 8be4d31cb8 Networking changes for 6.17.
Core & protocols
 ----------------
 
  - Wrap datapath globals into net_aligned_data, to avoid false sharing.
 
  - Preserve MSG_ZEROCOPY in forwarding (e.g. out of a container).
 
  - Add SO_INQ and SCM_INQ support to AF_UNIX.
 
  - Add SIOCINQ support to AF_VSOCK.
 
  - Add TCP_MAXSEG sockopt to MPTCP.
 
  - Add IPv6 force_forwarding sysctl to enable forwarding per interface.
 
  - Make TCP validation of whether packet fully fits in the receive
    window and the rcv_buf more strict. With increased use of HW
    aggregation a single "packet" can be multiple 100s of kB.
 
  - Add MSG_MORE flag to optimize large TCP transmissions via sockmap,
    improves latency up to 33% for sockmap users.
 
  - Convert TCP send queue handling from tasklet to BH workque.
 
  - Improve BPF iteration over TCP sockets to see each socket exactly once.
 
  - Remove obsolete and unused TCP RFC3517/RFC6675 loss recovery code.
 
  - Support enabling kernel threads for NAPI processing on per-NAPI
    instance basis rather than a whole device. Fully stop the kernel NAPI
    thread when threaded NAPI gets disabled. Previously thread would stick
    around until ifdown due to tricky synchronization.
 
  - Allow multicast routing to take effect on locally-generated packets.
 
  - Add output interface argument for End.X in segment routing.
 
  - MCTP: add support for gateway routing, improve bind() handling.
 
  - Don't require rtnl_lock when fetching an IPv6 neighbor over Netlink.
 
  - Add a new neighbor flag ("extern_valid"), which cedes refresh
    responsibilities to userspace. This is needed for EVPN multi-homing
    where a neighbor entry for a multi-homed host needs to be synced
    across all the VTEPs among which the host is multi-homed.
 
  - Support NUD_PERMANENT for proxy neighbor entries.
 
  - Add a new queuing discipline for IETF RFC9332 DualQ Coupled AQM.
 
  - Add sequence numbers to netconsole messages. Unregister netconsole's
    console when all net targets are removed. Code refactoring.
    Add a number of selftests.
 
  - Align IPSec inbound SA lookup to RFC 4301. Only SPI and protocol
    should be used for an inbound SA lookup.
 
  - Support inspecting ref_tracker state via DebugFS.
 
  - Don't force bonding advertisement frames tx to ~333 ms boundaries.
    Add broadcast_neighbor option to send ARP/ND on all bonded links.
 
  - Allow providing upcall pid for the 'execute' command in openvswitch.
 
  - Remove DCCP support from Netfilter's conntrack.
 
  - Disallow multiple packet duplications in the queuing layer.
 
  - Prevent use of deprecated iptables code on PREEMPT_RT.
 
 Driver API
 ----------
 
  - Support RSS and hashing configuration over ethtool Netlink.
 
  - Add dedicated ethtool callbacks for getting and setting hashing fields.
 
  - Add support for power budget evaluation strategy in PSE /
    Power-over-Ethernet. Generate Netlink events for overcurrent etc.
 
  - Support DPLL phase offset monitoring across all device inputs.
    Support providing clock reference and SYNC over separate DPLL
    inputs.
 
  - Support traffic classes in devlink rate API for bandwidth management.
 
  - Remove rtnl_lock dependency from UDP tunnel port configuration.
 
 Device drivers
 --------------
 
  - Add a new Broadcom driver for 800G Ethernet (bnge).
 
  - Add a standalone driver for Microchip ZL3073x DPLL.
 
  - Remove IBM's NETIUCV device driver.
 
  - Ethernet high-speed NICs:
    - Broadcom (bnxt):
     - support zero-copy Tx of DMABUF memory
     - take page size into account for page pool recycling rings
    - Intel (100G, ice, idpf):
      - idpf: XDP and AF_XDP support preparations
      - idpf: add flow steering
      - add link_down_events statistic
      - clean up the TSPLL code
      - preparations for live VM migration
    - nVidia/Mellanox:
     - support zero-copy Rx/Tx interfaces (DMABUF and io_uring)
     - optimize context memory usage for matchers
     - expose serial numbers in devlink info
     - support PCIe congestion metrics
    - Meta (fbnic):
      - add 25G, 50G, and 100G link modes to phylink
      - support dumping FW logs
    - Marvell/Cavium:
      - support for CN20K generation of the Octeon chips
    - Amazon:
      - add HW clock (without timestamping, just hypervisor time access)
 
  - Ethernet virtual:
    - VirtIO net:
      - support segmentation of UDP-tunnel-encapsulated packets
    - Google (gve):
      - support packet timestamping and clock synchronization
    - Microsoft vNIC:
      - add handler for device-originated servicing events
      - allow dynamic MSI-X vector allocation
      - support Tx bandwidth clamping
 
  - Ethernet NICs consumer, and embedded:
    - AMD:
      - amd-xgbe: hardware timestamping and PTP clock support
    - Broadcom integrated MACs (bcmgenet, bcmasp):
      - use napi_complete_done() return value to support NAPI polling
      - add support for re-starting auto-negotiation
    - Broadcom switches (b53):
      - support BCM5325 switches
      - add bcm63xx EPHY power control
    - Synopsys (stmmac):
      - lots of code refactoring and cleanups
    - TI:
      - icssg-prueth: read firmware-names from device tree
      - icssg: PRP offload support
    - Microchip:
      - lan78xx: convert to PHYLINK for improved PHY and MAC management
      - ksz: add KSZ8463 switch support
    - Intel:
      - support similar queue priority scheme in multi-queue and
        time-sensitive networking (taprio)
      - support packet pre-emption in both
    - RealTek (r8169):
      - enable EEE at 5Gbps on RTL8126
    - Airoha:
      - add PPPoE offload support
      - MDIO bus controller for Airoha AN7583
 
  - Ethernet PHYs:
    - support for the IPQ5018 internal GE PHY
    - micrel KSZ9477 switch-integrated PHYs:
      - add MDI/MDI-X control support
      - add RX error counters
      - add cable test support
      - add Signal Quality Indicator (SQI) reporting
    - dp83tg720: improve reset handling and reduce link recovery time
    - support bcm54811 (and its MII-Lite interface type)
    - air_en8811h: support resume/suspend
    - support PHY counters for QCA807x and QCA808x
    - support WoL for QCA807x
 
  - CAN drivers:
    - rcar_canfd: support for Transceiver Delay Compensation
    - kvaser: report FW versions via devlink dev info
 
  - WiFi:
    - extended regulatory info support (6 GHz)
    - add statistics and beacon monitor for Multi-Link Operation (MLO)
    - support S1G aggregation, improve S1G support
    - add Radio Measurement action fields
    - support per-radio RTS threshold
    - some work around how FIPS affects wifi, which was wrong (RC4 is used
      by TKIP, not only WEP)
    - improvements for unsolicited probe response handling
 
  - WiFi drivers:
    - RealTek (rtw88):
      - IBSS mode for SDIO devices
    - RealTek (rtw89):
      - BT coexistence for MLO/WiFi7
      - concurrent station + P2P support
      - support for USB devices RTL8851BU/RTL8852BU
    - Intel (iwlwifi):
      - use embedded PNVM in (to be released) FW images to fix
        compatibility issues
      - many cleanups (unused FW APIs, PCIe code, WoWLAN)
      - some FIPS interoperability
    - MediaTek (mt76):
      - firmware recovery improvements
      - more MLO work
    - Qualcomm/Atheros (ath12k):
      - fix scan on multi-radio devices
      - more EHT/Wi-Fi 7 features
      - encapsulation/decapsulation offload
    - Broadcom (brcm80211):
      - support SDIO 43751 device
 
  - Bluetooth:
    - hci_event: add support for handling LE BIG Sync Lost event
    - ISO: add socket option to report packet seqnum via CMSG
    - ISO: support SCM_TIMESTAMPING for ISO TS
 
  - Bluetooth drivers:
    - intel_pcie: support Function Level Reset
    - nxpuart: add support for 4M baudrate
    - nxpuart: implement powerup sequence, reset, FW dump, and FW loading
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmiFgLgACgkQMUZtbf5S
 IrvafxAAnQRwYBoIG+piCILx6z5pRvBGHkmEQ4AQgSCFuq2eO3ubwMFIqEybfma1
 5+QFjUZAV3OgGgKRBS2KGWxtSzdiF+/JGV1VOIN67sX3Mm0a2QgjA4n5CgKL0FPr
 o6BEzjX5XwG1zvGcBNQ5BZ19xUUKjoZQgTtnea8sZ57Fsp5RtRgmYRqoewNvNk/n
 uImh0NFsDVb0UeOpSzC34VD9l1dJvLGdui4zJAjno/vpvmT1DkXjoK419J/r52SS
 X+5WgsfJ6DkjHqVN1tIhhK34yWqBOcwGFZJgEnWHMkFIl2FqRfFKMHyqtfLlVnLA
 mnIpSyz8Sq2AHtx0TlgZ3At/Ri8p5+yYJgHOXcDKyABa8y8Zf4wrycmr6cV9JLuL
 z54nLEVnJuvfDVDVJjsLYdJXyhMpZFq6+uAItdxKaw8Ugp/QqG4QtoRj+XIHz4ZW
 z6OohkCiCzTwEISFK+pSTxPS30eOxq43kCspcvuLiwCCStJBRkRb5GdZA4dm7LA+
 1Od4ADAkHjyrFtBqTyyC2scX8UJ33DlAIpAYyIeS6w9Cj9EXxtp1z33IAAAZ03MW
 jJwIaJuc8bK2fWKMmiG7ucIXjPo4t//KiWlpkwwqLhPbjZgfDAcxq1AC2TLoqHBL
 y4EOgKpHDCMAghSyiFIAn2JprGcEt8dp+11B0JRXIn4Pm/eYDH8=
 =lqbe
 -----END PGP SIGNATURE-----

Merge tag 'net-next-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Jakub Kicinski:
 "Core & protocols:

   - Wrap datapath globals into net_aligned_data, to avoid false sharing

   - Preserve MSG_ZEROCOPY in forwarding (e.g. out of a container)

   - Add SO_INQ and SCM_INQ support to AF_UNIX

   - Add SIOCINQ support to AF_VSOCK

   - Add TCP_MAXSEG sockopt to MPTCP

   - Add IPv6 force_forwarding sysctl to enable forwarding per interface

   - Make TCP validation of whether packet fully fits in the receive
     window and the rcv_buf more strict. With increased use of HW
     aggregation a single "packet" can be multiple 100s of kB

   - Add MSG_MORE flag to optimize large TCP transmissions via sockmap,
     improves latency up to 33% for sockmap users

   - Convert TCP send queue handling from tasklet to BH workque

   - Improve BPF iteration over TCP sockets to see each socket exactly
     once

   - Remove obsolete and unused TCP RFC3517/RFC6675 loss recovery code

   - Support enabling kernel threads for NAPI processing on per-NAPI
     instance basis rather than a whole device. Fully stop the kernel
     NAPI thread when threaded NAPI gets disabled. Previously thread
     would stick around until ifdown due to tricky synchronization

   - Allow multicast routing to take effect on locally-generated packets

   - Add output interface argument for End.X in segment routing

   - MCTP: add support for gateway routing, improve bind() handling

   - Don't require rtnl_lock when fetching an IPv6 neighbor over Netlink

   - Add a new neighbor flag ("extern_valid"), which cedes refresh
     responsibilities to userspace. This is needed for EVPN multi-homing
     where a neighbor entry for a multi-homed host needs to be synced
     across all the VTEPs among which the host is multi-homed

   - Support NUD_PERMANENT for proxy neighbor entries

   - Add a new queuing discipline for IETF RFC9332 DualQ Coupled AQM

   - Add sequence numbers to netconsole messages. Unregister
     netconsole's console when all net targets are removed. Code
     refactoring. Add a number of selftests

   - Align IPSec inbound SA lookup to RFC 4301. Only SPI and protocol
     should be used for an inbound SA lookup

   - Support inspecting ref_tracker state via DebugFS

   - Don't force bonding advertisement frames tx to ~333 ms boundaries.
     Add broadcast_neighbor option to send ARP/ND on all bonded links

   - Allow providing upcall pid for the 'execute' command in openvswitch

   - Remove DCCP support from Netfilter's conntrack

   - Disallow multiple packet duplications in the queuing layer

   - Prevent use of deprecated iptables code on PREEMPT_RT

  Driver API:

   - Support RSS and hashing configuration over ethtool Netlink

   - Add dedicated ethtool callbacks for getting and setting hashing
     fields

   - Add support for power budget evaluation strategy in PSE /
     Power-over-Ethernet. Generate Netlink events for overcurrent etc

   - Support DPLL phase offset monitoring across all device inputs.
     Support providing clock reference and SYNC over separate DPLL
     inputs

   - Support traffic classes in devlink rate API for bandwidth
     management

   - Remove rtnl_lock dependency from UDP tunnel port configuration

  Device drivers:

   - Add a new Broadcom driver for 800G Ethernet (bnge)

   - Add a standalone driver for Microchip ZL3073x DPLL

   - Remove IBM's NETIUCV device driver

   - Ethernet high-speed NICs:
      - Broadcom (bnxt):
         - support zero-copy Tx of DMABUF memory
         - take page size into account for page pool recycling rings
      - Intel (100G, ice, idpf):
         - idpf: XDP and AF_XDP support preparations
         - idpf: add flow steering
         - add link_down_events statistic
         - clean up the TSPLL code
         - preparations for live VM migration
      - nVidia/Mellanox:
         - support zero-copy Rx/Tx interfaces (DMABUF and io_uring)
         - optimize context memory usage for matchers
         - expose serial numbers in devlink info
         - support PCIe congestion metrics
      - Meta (fbnic):
         - add 25G, 50G, and 100G link modes to phylink
         - support dumping FW logs
      - Marvell/Cavium:
         - support for CN20K generation of the Octeon chips
      - Amazon:
         - add HW clock (without timestamping, just hypervisor time access)

   - Ethernet virtual:
      - VirtIO net:
         - support segmentation of UDP-tunnel-encapsulated packets
      - Google (gve):
         - support packet timestamping and clock synchronization
      - Microsoft vNIC:
         - add handler for device-originated servicing events
         - allow dynamic MSI-X vector allocation
         - support Tx bandwidth clamping

   - Ethernet NICs consumer, and embedded:
      - AMD:
         - amd-xgbe: hardware timestamping and PTP clock support
      - Broadcom integrated MACs (bcmgenet, bcmasp):
         - use napi_complete_done() return value to support NAPI polling
         - add support for re-starting auto-negotiation
      - Broadcom switches (b53):
         - support BCM5325 switches
         - add bcm63xx EPHY power control
      - Synopsys (stmmac):
         - lots of code refactoring and cleanups
      - TI:
         - icssg-prueth: read firmware-names from device tree
         - icssg: PRP offload support
      - Microchip:
         - lan78xx: convert to PHYLINK for improved PHY and MAC management
         - ksz: add KSZ8463 switch support
      - Intel:
         - support similar queue priority scheme in multi-queue and
           time-sensitive networking (taprio)
         - support packet pre-emption in both
      - RealTek (r8169):
         - enable EEE at 5Gbps on RTL8126
      - Airoha:
         - add PPPoE offload support
         - MDIO bus controller for Airoha AN7583

   - Ethernet PHYs:
      - support for the IPQ5018 internal GE PHY
      - micrel KSZ9477 switch-integrated PHYs:
         - add MDI/MDI-X control support
         - add RX error counters
         - add cable test support
         - add Signal Quality Indicator (SQI) reporting
      - dp83tg720: improve reset handling and reduce link recovery time
      - support bcm54811 (and its MII-Lite interface type)
      - air_en8811h: support resume/suspend
      - support PHY counters for QCA807x and QCA808x
      - support WoL for QCA807x

   - CAN drivers:
      - rcar_canfd: support for Transceiver Delay Compensation
      - kvaser: report FW versions via devlink dev info

   - WiFi:
      - extended regulatory info support (6 GHz)
      - add statistics and beacon monitor for Multi-Link Operation (MLO)
      - support S1G aggregation, improve S1G support
      - add Radio Measurement action fields
      - support per-radio RTS threshold
      - some work around how FIPS affects wifi, which was wrong (RC4 is
        used by TKIP, not only WEP)
      - improvements for unsolicited probe response handling

   - WiFi drivers:
      - RealTek (rtw88):
         - IBSS mode for SDIO devices
      - RealTek (rtw89):
         - BT coexistence for MLO/WiFi7
         - concurrent station + P2P support
         - support for USB devices RTL8851BU/RTL8852BU
      - Intel (iwlwifi):
         - use embedded PNVM in (to be released) FW images to fix
           compatibility issues
         - many cleanups (unused FW APIs, PCIe code, WoWLAN)
         - some FIPS interoperability
      - MediaTek (mt76):
         - firmware recovery improvements
         - more MLO work
      - Qualcomm/Atheros (ath12k):
         - fix scan on multi-radio devices
         - more EHT/Wi-Fi 7 features
         - encapsulation/decapsulation offload
      - Broadcom (brcm80211):
         - support SDIO 43751 device

   - Bluetooth:
      - hci_event: add support for handling LE BIG Sync Lost event
      - ISO: add socket option to report packet seqnum via CMSG
      - ISO: support SCM_TIMESTAMPING for ISO TS

   - Bluetooth drivers:
      - intel_pcie: support Function Level Reset
      - nxpuart: add support for 4M baudrate
      - nxpuart: implement powerup sequence, reset, FW dump, and FW loading"

* tag 'net-next-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1742 commits)
  dpll: zl3073x: Fix build failure
  selftests: bpf: fix legacy netfilter options
  ipv6: annotate data-races around rt->fib6_nsiblings
  ipv6: fix possible infinite loop in fib6_info_uses_dev()
  ipv6: prevent infinite loop in rt6_nlmsg_size()
  ipv6: add a retry logic in net6_rt_notify()
  vrf: Drop existing dst reference in vrf_ip6_input_dst
  net/sched: taprio: align entry index attr validation with mqprio
  net: fsl_pq_mdio: use dev_err_probe
  selftests: rtnetlink.sh: remove esp4_offload after test
  vsock: remove unnecessary null check in vsock_getname()
  igb: xsk: solve negative overflow of nb_pkts in zerocopy mode
  stmmac: xsk: fix negative overflow of budget in zerocopy mode
  dt-bindings: ieee802154: Convert at86rf230.txt yaml format
  net: dsa: microchip: Disable PTP function of KSZ8463
  net: dsa: microchip: Setup fiber ports for KSZ8463
  net: dsa: microchip: Write switch MAC address differently for KSZ8463
  net: dsa: microchip: Use different registers for KSZ8463
  net: dsa: microchip: Add KSZ8463 switch support to KSZ DSA driver
  dt-bindings: net: dsa: microchip: Add KSZ8463 switch support
  ...
2025-07-30 08:58:55 -07:00
Steven Rostedt 0dd1274a05 tracing: Have eprobes have their own config option
Eprobes were added in 5.15 and were selected whenever any of the other
probe events were selected. If kprobe events were enabled (which it is by
default if kprobes are enabled) it would enable eprobe events as well. The
same for uprobes and fprobes.

Have eprobes have its own config and it gets enabled by default if tracing
is enabled.

Link: https://lore.kernel.org/all/20250729102636.b7cce553e7cc263722b12365@kernel.org/

Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Link: https://lore.kernel.org/20250730140945.360286733@kernel.org
Suggested-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-30 10:38:43 -04:00
Linus Torvalds 4b290aae78 Summary
* Move sysctls out of the kern_table array
 
   This is the final move of ctl_tables into their respective subsystems. Only 5
   (out of the original 50) will remain in kernel/sysctl.c file; these handle
   either sysctl or common arch variables.
 
   By decentralizing sysctl registrations, subsystem maintainers regain control
   over their sysctl interfaces, improving maintainability and reducing the
   likelihood of merge conflicts.
 
 * docs: Remove false positives from check-sysctl-docs
 
   Stopped falsely identifying sysctls as undocumented or unimplemented in the
   check-sysctl-docs script. This script can now be used to automatically
   identify if documentation is missing.
 
 * Testing
 
   All these have been in linux-next since rc3, giving them a solid 3 to 4 weeks
   worth of testing. Additionally, sysctl selftests and kunit were also run
   locally on my x86_64
 -----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEErkcJVyXmMSXOyyeQupfNUreWQU8FAmiAvd8ACgkQupfNUreW
 QU+9nAv/dtxaKoL4BXJSzsA2+49bbo9QfiK5Vjz1wSRYRQTb+jhGr9QdS5hG+NeX
 uN2ilvcNQqW7ENdiblU10lvcbPjIn2hw4lbMcpv/+QXnrudtGYlBFXlkWqW5nv7X
 AVvHU8y3uzfs6JbRIpROUA7Cn2cDOlfP2mMtwxCXR3iP+orS1ziuVEi1JRoirIyG
 iq5I/1rJMJBU3FjqqDTq6yljspLx8AlXO1yc5xUxAM67IcY4ew3ZTxqiZr6M9AhV
 DUbR2lu/88wcFNERt8DJmuQ50dSGGqOEpK3FURTmkwtMFxzNLmenFDQeBKKahz3Q
 2ntXSDfp2y+ppZNmcOP8tZZkra03Xpy1DQyoOgQ2r9uGekPxyr+wmKXwYPOeJIPO
 YWTNBm8omX9qr49zVzaZ1f2foRGfgStHL6aa6xLIf34zzScSDEPtO3og2+5Hw/30
 gnp+7v9E19uKpoE6oiGE0PtiFzAi/I6nFxzG2RRqrlMLFXyKVccTKygzY6tCnI3P
 6144s/Bt
 =R369
 -----END PGP SIGNATURE-----

Merge tag 'sysctl-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl

Pull sysctl updates from Joel Granados:

 - Move sysctls out of the kern_table array

   This is the final move of ctl_tables into their respective
   subsystems. Only 5 (out of the original 50) will remain in
   kernel/sysctl.c file; these handle either sysctl or common arch
   variables.

   By decentralizing sysctl registrations, subsystem maintainers regain
   control over their sysctl interfaces, improving maintainability and
   reducing the likelihood of merge conflicts.

 - docs: Remove false positives from check-sysctl-docs

   Stopped falsely identifying sysctls as undocumented or unimplemented
   in the check-sysctl-docs script. This script can now be used to
   automatically identify if documentation is missing.

* tag 'sysctl-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl: (23 commits)
  docs: Downgrade arm64 & riscv from titles to comment
  docs: Replace spaces with tabs in check-sysctl-docs
  docs: Remove colon from ctltable title in vm.rst
  docs: Add awk section for ucount sysctl entries
  docs: Use skiplist when checking sysctl admin-guide
  docs: nixify check-sysctl-docs
  sysctl: rename kern_table -> sysctl_subsys_table
  kernel/sys.c: Move overflow{uid,gid} sysctl into kernel/sys.c
  uevent: mv uevent_helper into kobject_uevent.c
  sysctl: Removed unused variable
  sysctl: Nixify sysctl.sh
  sysctl: Remove superfluous includes from kernel/sysctl.c
  sysctl: Remove (very) old file changelog
  sysctl: Move sysctl_panic_on_stackoverflow to kernel/panic.c
  sysctl: move cad_pid into kernel/pid.c
  sysctl: Move tainted ctl_table into kernel/panic.c
  Input: sysrq: mv sysrq into drivers/tty/sysrq.c
  fork: mv threads-max into kernel/fork.c
  parisc/power: Move soft-power into power.c
  mm: move randomize_va_space into memory.c
  ...
2025-07-29 21:43:08 -07:00
Linus Torvalds 6fb44438a5 arm64 updates for 6.17:
Perf and PMU updates:
 
  - Add support for new (v3) Hisilicon SLLC and DDRC PMUs
 
  - Add support for Arm-NI PMU integrations that share interrupts between
    clock domains within a given instance
 
  - Allow SPE to be configured with a lower sample period than the
    minimum recommendation advertised by PMSIDR_EL1.Interval
 
  - Add suppport for Arm's "Branch Record Buffer Extension" (BRBE)
 
  - Adjust the perf watchdog period according to cpu frequency changes
 
  - Minor driver fixes and cleanups
 
 Hardware features:
 
  - Support for MTE store-only checking (FEAT_MTE_STORE_ONLY)
 
  - Support for reporting the non-address bits during a synchronous MTE
    tag check fault (FEAT_MTE_TAGGED_FAR)
 
  - Optimise the TLBI when folding/unfolding contiguous PTEs on hardware
    with FEAT_BBM (break-before-make) level 2 and no TLB conflict aborts
 
 Software features:
 
  - Enable HAVE_LIVEPATCH after implementing arch_stack_walk_reliable()
    and using the text-poke API for late module relocations
 
  - Force VMAP_STACK always on and change arm64_efi_rt_init() to use
    arch_alloc_vmap_stack() in order to avoid KASAN false positives
 
 ACPI:
 
  - Improve SPCR handling and messaging on systems lacking an SPCR table
 
 Debug:
 
  - Simplify the debug exception entry path
 
  - Drop redundant DBG_MDSCR_* macros
 
 Kselftests:
 
  - Cleanups and improvements for SME, SVE and FPSIMD tests
 
 Miscellaneous:
 
  - Optimise loop to reduce redundant operations in contpte_ptep_get()
 
  - Remove ISB when resetting POR_EL0 during signal handling
 
  - Mark the kernel as tainted on SEA and SError panic
 
  - Remove redundant gcs_free() call
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE5RElWfyWxS+3PLO2a9axLQDIXvEFAmiDkgoACgkQa9axLQDI
 XvFucQ//bYugRP5/Sdlrq5eDKWBGi1HufYzwfDEBLc4S75Eu8mGL/tuThfu9yFn+
 qCowtt4U84HdWsZDTSVo6lym6v2vJUpGOMgXzepvJaFBRnqGv9X9NxH6RQO1LTnu
 Pm7rO+7I9tNpfuc7Zu9pHDggsJEw+WzVfmEF6WPSFlT9mUNv6NbSx4rbLQKU86Dm
 ouTqXaePEQZ5oiRXVasxyT0otGtiACD20WpgOtNjYGzsfUVwCf/C83V/2DLwwbhr
 9cW9lCtFxA/yFdQcA9ThRzWZ9Eo5LAHqjGIq00+zOjuzgDbBtcTT79gpChkhovIR
 FBIsWHd9j9i3nYxzf4V4eRKQnyqS3NQWv7g7uKFwNgARif1Zk0VJ77QIlAYk5xLI
 ENTRjLKz5WNGGnhdkeCvDlVyxX+OktgcVTp3vqRxAKCRahMMUqBrwxiM8RzVF37e
 yzkEQayL8F7uZqy9H7Sjn48UpHZux6frJ1bBQw1oEvR9QmAoAdqavPMSAYIOT3Zr
 ze4WIljq/cFr3kBPIFP5pK1e0qYMHXZpSKIm8MAv6y/7KmQuVbMjZthpuPbLSIw0
 Q7C0KalB8lToPIbO7qMni/he0dCN4K2+E1YHFTR+pzfcoLuW4rjSg7i8tqMLKMJ8
 H+SeGLyPtM5A6bdAPTTpqefcgUUe7064ENUqrGUpDEynGXA7boE=
 =5h1C
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Catalin Marinas:
 "A quick summary: perf support for Branch Record Buffer Extensions
  (BRBE), typical PMU hardware updates, small additions to MTE for
  store-only tag checking and exposing non-address bits to signal
  handlers, HAVE_LIVEPATCH enabled on arm64, VMAP_STACK forced on.

  There is also a TLBI optimisation on hardware that does not require
  break-before-make when changing the user PTEs between contiguous and
  non-contiguous.

  More details:

  Perf and PMU updates:

   - Add support for new (v3) Hisilicon SLLC and DDRC PMUs

   - Add support for Arm-NI PMU integrations that share interrupts
     between clock domains within a given instance

   - Allow SPE to be configured with a lower sample period than the
     minimum recommendation advertised by PMSIDR_EL1.Interval

   - Add suppport for Arm's "Branch Record Buffer Extension" (BRBE)

   - Adjust the perf watchdog period according to cpu frequency changes

   - Minor driver fixes and cleanups

  Hardware features:

   - Support for MTE store-only checking (FEAT_MTE_STORE_ONLY)

   - Support for reporting the non-address bits during a synchronous MTE
     tag check fault (FEAT_MTE_TAGGED_FAR)

   - Optimise the TLBI when folding/unfolding contiguous PTEs on
     hardware with FEAT_BBM (break-before-make) level 2 and no TLB
     conflict aborts

  Software features:

   - Enable HAVE_LIVEPATCH after implementing arch_stack_walk_reliable()
     and using the text-poke API for late module relocations

   - Force VMAP_STACK always on and change arm64_efi_rt_init() to use
     arch_alloc_vmap_stack() in order to avoid KASAN false positives

  ACPI:

   - Improve SPCR handling and messaging on systems lacking an SPCR
     table

  Debug:

   - Simplify the debug exception entry path

   - Drop redundant DBG_MDSCR_* macros

  Kselftests:

   - Cleanups and improvements for SME, SVE and FPSIMD tests

  Miscellaneous:

   - Optimise loop to reduce redundant operations in contpte_ptep_get()

   - Remove ISB when resetting POR_EL0 during signal handling

   - Mark the kernel as tainted on SEA and SError panic

   - Remove redundant gcs_free() call"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (93 commits)
  arm64/gcs: task_gcs_el0_enable() should use passed task
  arm64: Kconfig: Keep selects somewhat alphabetically ordered
  arm64: signal: Remove ISB when resetting POR_EL0
  kselftest/arm64: Handle attempts to disable SM on SME only systems
  kselftest/arm64: Fix SVE write data generation for SME only systems
  kselftest/arm64: Test SME on SME only systems in fp-ptrace
  kselftest/arm64: Test FPSIMD format data writes via NT_ARM_SVE in fp-ptrace
  kselftest/arm64: Allow sve-ptrace to run on SME only systems
  arm64/mm: Drop redundant addr increment in set_huge_pte_at()
  kselftest/arm4: Provide local defines for AT_HWCAP3
  arm64: Mark kernel as tainted on SAE and SError panic
  arm64/gcs: Don't call gcs_free() when releasing task_struct
  drivers/perf: hisi: Support PMUs with no interrupt
  drivers/perf: hisi: Relax the event number check of v2 PMUs
  drivers/perf: hisi: Add support for HiSilicon SLLC v3 PMU driver
  drivers/perf: hisi: Use ACPI driver_data to retrieve SLLC PMU information
  drivers/perf: hisi: Add support for HiSilicon DDRC v3 PMU driver
  drivers/perf: hisi: Simplify the probe process for each DDRC version
  perf/arm-ni: Support sharing IRQs within an NI instance
  perf/arm-ni: Consolidate CPU affinity handling
  ...
2025-07-29 20:21:54 -07:00
Linus Torvalds 72b8944f14 Locking updates for v6.16:
Locking primitives:
 
   - Mark devm_mutex_init() as __must_check and fix drivers
     that didn't check the return code. (Thomas Weißschuh)
 
   - Reorganize <linux/local_lock.h> to better expose the
     internal APIs to local variables. (Sebastian Andrzej Siewior)
 
   - Remove OWNER_SPINNABLE in rwsem (Jinliang Zheng)
 
   - Remove redundant #ifdefs in the mutex code (Ran Xiaokai)
 
 Lockdep:
 
   - Avoid returning struct in lock_stats() (Arnd Bergmann)
 
   - Change `static const` into enum for LOCKF_*_IRQ_*
     (Arnd Bergmann)
 
   - Temporarily use synchronize_rcu_expedited() in
     lockdep_unregister_key() to speed things up.
     (Breno Leitao)
 
 Rust runtime:
 
   - Add #[must_use] to Lock::try_lock() (Jason Devers)
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmiIbzURHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1gxlRAAsnrbMN1yUbGbOh2fr7eQh69nn4VLZhvQ
 n/Q2+ZpvgBQiPhUnYub4n0B03pO6lQO+taiAQ9WTK6VHi7kzIKIx0MPWP5KV9FCY
 NQKQCmRccese0mmWVYccLPjyk6GW8l5gIhRK1vuEYANtLf/XLBYB/ygvE6a8ywNz
 dmt7IzYOIknCuEtapDzcJLBZFHG9mVTT8Kk2A5aqn+XCrxNnKrYyVOH0qw395uBw
 ulVKPJT7FGQ4qLkxfYguNWH5V1ZneN53tJouwqcM7Xpc+ookQFAZel0xlfWpVu+A
 Q2WF3W8GOrS7ER9RzjG0SQF4qYBq60yKPZr3przmjCJFRgFdvEkMEIDvbirl0Gfv
 Y04hMIcovsnh8x0iLTYxkrRxlZB/7jm5uLVJ1B6E19iYBXq1HCPkM51XugDQFxwz
 fDSLblpRZLf9OoWT9NPiiQXpoSLigwOiFdiGimIMQHRbPKCujF2T9w4XpKLLECN4
 UbYGMx/yAGdkTXelSStyru0ZLYhvxP2XMAaUJoMBrjI1ReL2e58Vmp2MqQcuhiuU
 PV5NEt0qhBAjilUrP+vuM/27UihPxcBrVgvriT+wDVrrPiy1t5iJVOKxFcrkbMto
 B+XHFA7z1EglkwGD7HCdoOFU8V3PM6+GNDMqvs5Ey3tifqampEssmYcP3YA6QYBt
 eO7imScWtII=
 =RExf
 -----END PGP SIGNATURE-----

Merge tag 'locking-core-2025-07-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking updates from Ingo Molnar:
 "Locking primitives:

   - Mark devm_mutex_init() as __must_check and fix drivers that didn't
     check the return code (Thomas Weißschuh)

   - Reorganize <linux/local_lock.h> to better expose the internal APIs
     to local variables (Sebastian Andrzej Siewior)

   - Remove OWNER_SPINNABLE in rwsem (Jinliang Zheng)

   - Remove redundant #ifdefs in the mutex code (Ran Xiaokai)

  Lockdep:

   - Avoid returning struct in lock_stats() (Arnd Bergmann)

   - Change `static const` into enum for LOCKF_*_IRQ_* (Arnd Bergmann)

   - Temporarily use synchronize_rcu_expedited() in
     lockdep_unregister_key() to speed things up. (Breno Leitao)

  Rust runtime:

   - Add #[must_use] to Lock::try_lock() (Jason Devers)"

* tag 'locking-core-2025-07-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization
  locking/mutex: Remove redundant #ifdefs
  locking/lockdep: Change 'static const' variables to enum values
  locking/lockdep: Avoid struct return in lock_stats()
  locking/rwsem: Use OWNER_NONSPINNABLE directly instead of OWNER_SPINNABLE
  rust: sync: Add #[must_use] to Lock::try_lock()
  locking/mutex: Mark devm_mutex_init() as __must_check
  leds: lp8860: Check return value of devm_mutex_init()
  spi: spi-nxp-fspi: Check return value of devm_mutex_init()
  local_lock: Move this_cpu_ptr() notation from internal to main header
2025-07-29 18:11:32 -07:00
Linus Torvalds bf76f23aa1 Scheduler updates for v6.17:
Core scheduler changes:
 
  - Better tracking of maximum lag of tasks in presence of different
    slices duration, for better handling of lag in the fair
    scheduler. (Vincent Guittot)
 
  - Clean up and standardize #if/#else/#endif markers throughout
    the entire scheduler code base (Ingo Molnar)
 
  - Make SMP unconditional: build the SMP scheduler's
    data structures and logic on UP kernel too, even though
    they are not used, to simplify the scheduler and remove
    around 200 #ifdef/[#else]/#endif blocks from the
    scheduler. (Ingo Molnar)
 
  - Reorganize cgroup bandwidth control interface handling
    for better interfacing with sched_ext (Tejun Heo)
 
 Balancing:
 
  - Bump sd->max_newidle_lb_cost when newidle balance fails (Chris Mason)
  - Remove sched_domain_topology_level::flags to simplify the code (Prateek Nayak)
  - Simplify and clean up build_sched_topology() (Li Chen)
  - Optimize build_sched_topology() on large machines (Li Chen)
 
 Real-time scheduling:
 
  - Add initial version of proxy execution: a mechanism for mutex-owning
    tasks to inherit the scheduling context of higher priority waiters.
    Currently limited to a single runqueue and conditional on CONFIG_EXPERT,
    and other limitations. (John Stultz, Peter Zijlstra, Valentin Schneider)
 
  - Deadline scheduler (Juri Lelli):
 
    - Fix dl_servers initialization order (Juri Lelli)
    - Fix DL scheduler's root domain reinitialization logic (Juri Lelli)
    - Fix accounting bugs after global limits change (Juri Lelli)
    - Fix scalability regression by implementing less agressive dl_server handling
      (Peter Zijlstra)
 
 PSI:
 
  - Improve scalability by optimizing psi_group_change() cpu_clock() usage
    (Peter Zijlstra)
 
 Rust changes:
 
  - Make Task, CondVar and PollCondVar methods inline to avoid unnecessary
    function calls (Kunwu Chan, Panagiotis Foliadis)
 
  - Add might_sleep() support for Rust code: Rust's "#[track_caller]"
    mechanism is used so that Rust's might_sleep() doesn't need to be
    defined as a macro (Fujita Tomonori)
 
  - Introduce file_from_location() (Boqun Feng)
 
 Debugging & instrumentation:
 
  - Make clangd usable with scheduler source code files again (Peter Zijlstra)
 
  - tools: Add root_domains_dump.py which dumps root domains info (Juri Lelli)
 
  - tools: Add dl_bw_dump.py for printing bandwidth accounting info (Juri Lelli)
 
 Misc cleanups & fixes:
 
  - Remove play_idle() (Feng Lee)
 
  - Fix check_preemption_disabled() (Sebastian Andrzej Siewior)
 
  - Do not call __put_task_struct() on RT if pi_blocked_on is set
    (Luis Claudio R. Goncalves)
 
  - Correct the comment in place_entity() (wang wei)
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmiHHNIRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1g7DhAAg9aMW33PuC24A4hCS1XQay6j3rgmR5qC
 AOqDofj/CY4Q374HQtOl4m5CYZB/G5csRv6TZliWQKhAy9vr6VWddoyOMJYOAlAx
 XRurl1Z3MriOMD6DPgNvtHd5PrR5Un8ygALgT+32d0PRz27KNXORW5TyvEf2Bv4r
 BX4/GazlOlK0PdGUdZl0q/3dtkU4Wr5IifQzT8KbarOSBbNwZwVcg+83hLW5gJMx
 LgMGLaAATmiN7VuvJWNDATDfEOmOvQOu8veoS8TuP1AOVeJPfPT2JVh9Jen5V1/5
 3w1RUOkUI2mQX+cujWDW3koniSxjsA1OegXfHnFkF5BXp4q5e54k6D5sSh1xPFDX
 iDhkU5jsbKkkJS2ulD6Vi4bIAct3apMl4IrbJn/OYOLcUVI8WuunHs4UPPEuESAS
 TuQExKSdj4Ntrzo3pWEy8kX3/Z9VGa+WDzwsPUuBSvllB5Ir/jjKgvkxPA6zGsiY
 rbkmZT8qyI01IZ/GXqfI2AQYCGvgp+SOvFPi755ZlELTQS6sUkGZH2/2M5XnKA9t
 Z1wB2iwttoS1VQInx0HgiiAGrXrFkr7IzSIN2T+CfWIqilnL7+nTxzwlJtC206P4
 DB97bF6azDtJ6yh1LetRZ1ZMX/Gr56Cy0Z6USNoOu+a12PLqlPk9+fPBBpkuGcdy
 BRk8KgysEuk=
 =8T0v
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2025-07-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:
 "Core scheduler changes:

   - Better tracking of maximum lag of tasks in presence of different
     slices duration, for better handling of lag in the fair scheduler
     (Vincent Guittot)

   - Clean up and standardize #if/#else/#endif markers throughout the
     entire scheduler code base (Ingo Molnar)

   - Make SMP unconditional: build the SMP scheduler's data structures
     and logic on UP kernel too, even though they are not used, to
     simplify the scheduler and remove around 200 #ifdef/[#else]/#endif
     blocks from the scheduler (Ingo Molnar)

   - Reorganize cgroup bandwidth control interface handling for better
     interfacing with sched_ext (Tejun Heo)

  Balancing:

   - Bump sd->max_newidle_lb_cost when newidle balance fails (Chris
     Mason)

   - Remove sched_domain_topology_level::flags to simplify the code
     (Prateek Nayak)

   - Simplify and clean up build_sched_topology() (Li Chen)

   - Optimize build_sched_topology() on large machines (Li Chen)

  Real-time scheduling:

   - Add initial version of proxy execution: a mechanism for
     mutex-owning tasks to inherit the scheduling context of higher
     priority waiters.

     Currently limited to a single runqueue and conditional on
     CONFIG_EXPERT, and other limitations (John Stultz, Peter Zijlstra,
     Valentin Schneider)

   - Deadline scheduler (Juri Lelli):
      - Fix dl_servers initialization order (Juri Lelli)
      - Fix DL scheduler's root domain reinitialization logic (Juri
        Lelli)
      - Fix accounting bugs after global limits change (Juri Lelli)
      - Fix scalability regression by implementing less agressive
        dl_server handling (Peter Zijlstra)

  PSI:

   - Improve scalability by optimizing psi_group_change() cpu_clock()
     usage (Peter Zijlstra)

  Rust changes:

   - Make Task, CondVar and PollCondVar methods inline to avoid
     unnecessary function calls (Kunwu Chan, Panagiotis Foliadis)

   - Add might_sleep() support for Rust code: Rust's "#[track_caller]"
     mechanism is used so that Rust's might_sleep() doesn't need to be
     defined as a macro (Fujita Tomonori)

   - Introduce file_from_location() (Boqun Feng)

  Debugging & instrumentation:

   - Make clangd usable with scheduler source code files again (Peter
     Zijlstra)

   - tools: Add root_domains_dump.py which dumps root domains info (Juri
     Lelli)

   - tools: Add dl_bw_dump.py for printing bandwidth accounting info
     (Juri Lelli)

  Misc cleanups & fixes:

   - Remove play_idle() (Feng Lee)

   - Fix check_preemption_disabled() (Sebastian Andrzej Siewior)

   - Do not call __put_task_struct() on RT if pi_blocked_on is set (Luis
     Claudio R. Goncalves)

   - Correct the comment in place_entity() (wang wei)"

* tag 'sched-core-2025-07-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (84 commits)
  sched/idle: Remove play_idle()
  sched: Do not call __put_task_struct() on rt if pi_blocked_on is set
  sched: Start blocked_on chain processing in find_proxy_task()
  sched: Fix proxy/current (push,pull)ability
  sched: Add an initial sketch of the find_proxy_task() function
  sched: Fix runtime accounting w/ split exec & sched contexts
  sched: Move update_curr_task logic into update_curr_se
  locking/mutex: Add p->blocked_on wrappers for correctness checks
  locking/mutex: Rework task_struct::blocked_on
  sched: Add CONFIG_SCHED_PROXY_EXEC & boot argument to enable/disable
  sched/topology: Remove sched_domain_topology_level::flags
  x86/smpboot: avoid SMT domain attach/destroy if SMT is not enabled
  x86/smpboot: moves x86_topology to static initialize and truncate
  x86/smpboot: remove redundant CONFIG_SCHED_SMT
  smpboot: introduce SDTL_INIT() helper to tidy sched topology setup
  tools/sched: Add dl_bw_dump.py for printing bandwidth accounting info
  tools/sched: Add root_domains_dump.py which dumps root domains info
  sched/deadline: Fix accounting after global limits change
  sched/deadline: Reset extra_bw to max_bw when clearing root domains
  sched/deadline: Initialize dl_servers after SMP
  ...
2025-07-29 17:42:52 -07:00
Linus Torvalds 04d29e3609 - Untangle the Retbleed from the ITS mitigation on Intel. Allow for ITS
to enable stuffing independently from Retbleed, do some cleanups to
   simplify and streamline the code
 
 - Simplify SRSO and make mitigation types selection more versatile
   depending on the Retbleed mitigation selection. Simplify code some
 
 - Add the second part of the attack vector controls which provide a lot
   friendlier user interface to the speculation mitigations than
   selecting each one by one as it is now.
 
   Instead, the selection of whole attack vectors which are relevant to
   the system in use can be done and protection against only those
   vectors is enabled, thus giving back some performance to the users
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmiHh6cACgkQEsHwGGHe
 VUqprw//QMpqtWGVbo4bJ176sLtwn8cdKxOJwx9rWyFH/f3Zcn5hK1x+Zifm22hj
 NNo7YMLTvEg6BicxIDKp89tfXM5cwLS3pcUabWy7IS7Xzs7yLyRajNQ3hOFhQd9g
 UAUg8xx33xspCatlXzl4HcbOR0xyxb/qR4vd5H89Gir9GIuiO5+uz+3SdqEzzl8w
 2UfPDY5B9cXO8VoGsvJMtLTO1ULUHHZPgRdPaH8rSr9QkGlVFefpgUaw6Budic84
 kjNpE4tyJEvVLceZr8UtZWmVBwBS4z9oNRdqHbCFnrpPdYXnzYXA6pKMm1vP3zCz
 atRuWxmn0U6o9wZfxcBF7ZI2o3k049U8zxLWlz9mX4pXbMuqSX6MsR4kw82ta/Hp
 IzM9LckPO2STYHvJJlcEOivYbKTKttwYZd0rjfaFtJ0z+vVar4EyPyTbfGAdiH50
 T2UUmC9SpffVVhnOcaTUGtT/4SFCVA8ZNsoPm27auGVzZRnLOFSV63iv5fl41o3X
 pELyVfLzR3XtXFNXrzXY09lEKh5HIiy33Qe+syCNEoF56zTN+IREu37M7dKiWBmx
 xRJE9U9ZgxZjbEuMV0jKEMPOMzMf1ONQw5HSpfIgoT5OLwKXhP5HptHkKS3rwppG
 5Glo2kfvxKzFl/THHv7EPoIvVVL/tezcvO3H7z4owRl/jgw0CvA=
 =zO6b
 -----END PGP SIGNATURE-----

Merge tag 'x86_bugs_for_v6.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 CPU mitigation updates from Borislav Petkov:

 - Untangle the Retbleed from the ITS mitigation on Intel. Allow for ITS
   to enable stuffing independently from Retbleed, do some cleanups to
   simplify and streamline the code

 - Simplify SRSO and make mitigation types selection more versatile
   depending on the Retbleed mitigation selection. Simplify code some

 - Add the second part of the attack vector controls which provide a lot
   friendlier user interface to the speculation mitigations than
   selecting each one by one as it is now.

   Instead, the selection of whole attack vectors which are relevant to
   the system in use can be done and protection against only those
   vectors is enabled, thus giving back some performance to the users

* tag 'x86_bugs_for_v6.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (31 commits)
  x86/bugs: Print enabled attack vectors
  x86/bugs: Add attack vector controls for TSA
  x86/pti: Add attack vector controls for PTI
  x86/bugs: Add attack vector controls for ITS
  x86/bugs: Add attack vector controls for SRSO
  x86/bugs: Add attack vector controls for L1TF
  x86/bugs: Add attack vector controls for spectre_v2
  x86/bugs: Add attack vector controls for BHI
  x86/bugs: Add attack vector controls for spectre_v2_user
  x86/bugs: Add attack vector controls for retbleed
  x86/bugs: Add attack vector controls for spectre_v1
  x86/bugs: Add attack vector controls for GDS
  x86/bugs: Add attack vector controls for SRBDS
  x86/bugs: Add attack vector controls for RFDS
  x86/bugs: Add attack vector controls for MMIO
  x86/bugs: Add attack vector controls for TAA
  x86/bugs: Add attack vector controls for MDS
  x86/bugs: Define attack vectors relevant for each bug
  x86/Kconfig: Add arch attack vector support
  cpu: Define attack vectors
  ...
2025-07-29 16:34:45 -07:00
Linus Torvalds 909d2bb07d stop-machine: Improve documentation
Changes
 -------
 
 * Improve kernel-doc function-header comments
 * Document preemption and stop_machine() mutual exclusion (Joel Fernandes)
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmiBbxgTHHBhdWxtY2tA
 a2VybmVsLm9yZwAKCRCevxLzctn7jCgqD/4zu5NL6k5JmzPnYoLKPmjyAhQCZHDf
 saXTuH7mmZPLj0KB/LoiwHTSLO+ldFzqwlggw2YuyqfdHMQzc91NVBU2ZZMAFncD
 tEdcIQjDvtBHQmi7PU0b/knNJuVEHSCq6OtpmcnvSjQsZXfPeswrfd1ofuFrgc37
 jpmyYhplFMAbM4V0IbkjzX8afLYKQRG0Eb51S+yAmbz0ZdiEr4OuSRs3zJvQfCRE
 /PYfMM8oczUW6MaVJRRIi67KIAiGBfhN+X4Uc39yir/gAu/HQMEv7/+3ImemUu5d
 x/ZIca/LtJh7Ga4EHNdc7I9AOtDifzXti07izFiILAHrsMlpALMqIRbftHQAknwf
 CQQUpoB/C0qH1PxZuYyeOByo+MXYejKAsum8990AY6z9mBjTGtdE4cr/ikeYVFft
 DZLuDPKoy48C5xu0ycj5Ir9TF8LkTBAUdRsUmXiF+SnQWbOwnmcr79XoPp+7JH2A
 /70EClttaaTvCjpzDZRTNaHjPZ7bdiARUYwCvR/Vgd8KbqYaUoXnbSZjcLGbnSNC
 W/GCPiDJCEhxbKnli6Vdc2cEI7aA9mUUZqSMo6XryjixeXoPUEfmv91vZ3GkRrGv
 QMM+MxBuYLWG//G2rBbyIiGE81ewAn6TmDVU54UfqKbB3AmOioyDN0jI/kJCAYfx
 UqqLyz7+LMbx+A==
 =tjJh
 -----END PGP SIGNATURE-----

Merge tag 'stop-machine.2025.07.23a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu

Pull stop-machine documentation updates from Paul McKenney:

 - Improve kernel-doc function-header comments

 - Document preemption and stop_machine() mutual exclusion (Joel
   Fernandes)

* tag 'stop-machine.2025.07.23a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
  smp: Document preemption and stop_machine() mutual exclusion
  stop_machine: Improve kernel-doc function-header comments
2025-07-29 16:14:07 -07:00
Linus Torvalds 78bb43e51b Updates for the generic entry code:
- Split the code into syscall and exception/interrupt parts to ease the
     conversion of ARM[64] to the generic entry infrastructure
 
   - Extend syscall user dispatching to support a single intercepted range
     instead of the default single non-intercepted range. That allows
     monitoring/analysis of a specific executable range, e.g. a library, and
     also provides flexibility for sandboxing scenarios.
 
   - Cleanup and extend the user dispatch selftest
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmiIg6UTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoWn7EACTvQpu7tGd1rN9hCjiB1W5po7nvlCd
 gKghjS9Kp0KttTDQPLVcmnH06BhDHWNNn1HXZ1ORea4bpLywiKHtVgqUAsJDsBsv
 ETeTHYNphk0sktvAqp3XusA6HF4T0s1KXJQj3W1ACrYZWRkK/VystCLYwBRGpc3r
 cj7jAFmJyNpU236R5XYJ7ooHfPYpzZ8VAHBO8ykK7muHDfyBRXEIlmkGep++ctSv
 v0uZXAy6LONljKg87YJTien0UA7ze9lFgPTuV1y/qfaLbYNekUaJSDjfuhOpZZUw
 TzSh9OYoIvKpd0ylHwB1qMLd5CaXNicaeLfTW3xbX06KaXa7WNAonS35sK0EjhtZ
 0bBA9g6bRhphyh0tzR4saF9bczNvJydNCn7/QFo9dKbQUEL/FRXtJiIeusVx/0fJ
 +ZqWRTcEdDw2Rsyv52hKgyEJi7F3nL9ovabUN9P1/0aPcTdM3WekMpSOJm1U6wVF
 e6oSyeoeNdjcdxgWbQrgRNbmq5CPEV3ig5J+G418r5DTF3ifqZX+WscijUtKTu5K
 V5GpLc0PL9eoigQ37LmGkwK/4xoB9SAPTQuzUs9qgh9NidwT0cCfoNxpeGh6GeHX
 GLHPGU61vZaefxpwuAuv+SQSgxXSKk2/H/ijPzSjrX/PkUp7MoX9XoOQAh4FxZjO
 ok5YEUGXzSJfXQ==
 =yaCQ
 -----END PGP SIGNATURE-----

Merge tag 'core-entry-2025-07-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull generic entry code updates from Thomas Gleixner:

 - Split the code into syscall and exception/interrupt parts to ease the
   conversion of ARM[64] to the generic entry infrastructure

 - Extend syscall user dispatching to support a single intercepted range
   instead of the default single non-intercepted range. That allows
   monitoring/analysis of a specific executable range, e.g. a library,
   and also provides flexibility for sandboxing scenarios

 - Cleanup and extend the user dispatch selftest

* tag 'core-entry-2025-07-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  entry: Split generic entry into generic exception and syscall entry
  selftests: Add tests for PR_SYS_DISPATCH_INCLUSIVE_ON
  syscall_user_dispatch: Add PR_SYS_DISPATCH_INCLUSIVE_ON
  selftests: Fix errno checking in syscall_user_dispatch test
2025-07-29 15:14:29 -07:00
Linus Torvalds f38b1f243e Update for the futex subsystem:
- Switch the reference counting to a RCU based per-CPU reference to
      address a performance bottleneck vs. the single instance rcuref
      variant.
 
    - Make the futex selftest build on 32-bit architectures which only
      support 64-bit time_t, e.g. RISCV-32.
 
    - Cleanups and improvements in selftests and futex bench
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmiIiDITHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoblTD/0eV9w21tFVmn6ICrhgQgsrejJ0BANs
 mm5mE/0d29MZHEhnJO2CSccGXBDfykuk/gxHXHsUZ9tiVSOgjz9dDl1bcrZ8Je9V
 YNWMXiHASQrLctmrKLPSdjlcxQnPIxCm+K4lajoa+CyvReHE24sUDgCN8GC3P9pH
 VxTmQ7UjGrzvIRlfd4AL9GJBF1IGKNnpPHCeSwjn/cmlDxu4RxEdjRWTbW8Tbz9N
 1ay/T8vEE1SykI2qZOXIP16sYZw2dP9FOgARO90Ahb6hwAwbI72MvC69GpZe3lh5
 1B1ZgpEiUMa4IT5jJ43Wkm3k8BF6meW+rIUjUBt+y8yjNgaR4degvgnDx44YPZ94
 5Ek3cJgpTpVnWbfRxn2b2vRL8rZkRBIq9ezswp0/8KLgC7Gd+zPuQKPvoo2m+n3S
 UMufGGT2h5oJbx0qGry5rxZz03eGE6oWAm3H/WRl2wIw5D/kvU5ol6AYKJ5eGTyj
 JdPJVzzPBH319iCMZ1olqo/h5er148aYL16ga7w6w9pqhPuxGud30BFf8SHQ8F1R
 NIZiu6O3L2ge0RLb/8wxukFkDz3R1gZBWeTLxLEymTJG3TaA3uIByOI6UO03zgW/
 QBbNLr7ndkIcm8E31hAWamGQy+EAXj1/e5GYREvhhHOwUV+y/E1FTrrdwtT4GA0S
 tBYACfeCbOojsA==
 =WqFq
 -----END PGP SIGNATURE-----

Merge tag 'locking-futex-2025-07-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull futex updates from Thomas Gleixner:

 - Switch the reference counting to a RCU based per-CPU reference to
   address a performance bottleneck vs the single instance rcuref
   variant

 - Make the futex selftest build on 32-bit architectures which only
   support 64-bit time_t, e.g. RISCV-32

 - Cleanups and improvements in selftests and futex bench

* tag 'locking-futex-2025-07-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  selftests/futex: Fix spelling mistake "Succeffuly" -> "Successfully"
  selftests/futex: Define SYS_futex on 32-bit architectures with 64-bit time_t
  perf bench futex: Remove support for IMMUTABLE
  selftests/futex: Remove support for IMMUTABLE
  futex: Remove support for IMMUTABLE
  futex: Make futex_private_hash_get() static
  futex: Use RCU-based per-CPU reference counting instead of rcuref_t
  selftests/futex: Adapt the private hash test to RCU related changes
2025-07-29 14:39:42 -07:00
Linus Torvalds 02dc9d15d7 Updates for the timekeeping and VDSO code:
- Introduce support for auxiliary timekeepers
 
       PTP clocks can be disconnected from the universal CLOCK_TAI reality
       for various reasons including regularatory requirements for
       functional safety redundancy.
 
       The kernel so far only supports a single notion of time, which means
       that all clocks are correlated in frequency and only differ by
       offset to each other.
 
       Access to non-correlated PTP clocks has been available so far only
       through the file descriptor based "POSIX clock IDs", which are
       subject to locking and have to go all the way out to the hardware.
 
       The access is not only horribly slow, as it has to go all the way out
       to the NIC/PTP hardware, but that also prevents the kernel to read
       the time of such clocks e.g. from the network stack, where it is
       required for TSN networking both on the transmit and receive side
       unless the hardware provides offloading.
 
       The auxiliary clocks provide a mechanism to support arbitrary clocks
       which are not correlated to the system clock. This is not restricted
       to the PTP use case on purpose as there is no kernel side association
       of these clocks to a particular PTP device because that's a pure user
       space configuration decision. Having them independent allows to
       utilize them for other purposes and also enables them to be tested
       without hardware dependencies.
 
       To avoid pointless overhead these clocks have to be enabled
       individualy via a new sysfs interface to reduce the overhead to a
       single compare in the hotpath if they are enabled at the Kconfig
       level at all.
 
       These clocks utilize the existing timekeeping/NTP infrastructures,
       which has been made possible over the recent releases by incrementaly
       converting these infrastructures over from a single static instance
       to a multi-instance pointer based implementation without any
       performance regression reported.
 
       The auxiliary clocks provide the same "emulation" of a "correct"
       clock as the existing CLOCK_* variants do with an independent
       instance of data and provide the same steering mechanism through the
       existing sys_clock_adjtime() interface, which has been confirmed to
       work by the chronyd(8) maintainer.
 
       That allows to provide lockless kernel internal and VDSO support so
       that applications and kernel internal functionalities can access
       these clocks without restrictions and at the same performance as the
       existing system clocks.
 
   - Avoid double notifications in the adjtimex() syscall. Not a big issue,
     but a trivial to avoid latency source.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmiGo/MTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoWTIEACy/HC2OD7IbAzECgwQUvo59xvmw6ak
 p1wRNYpTjOLUgWZjcl/jQfh7jYe60/0PrIzYgAeltXSGwOVtqQDrUIWrTKrAHOUa
 wqKUCEfCucTUJRKLQ1Ktnjy/2Pp0Ojpf32Av0v/wgLUMxQk9Av39UdQwMOGyoHOa
 07//lrVzfYfqe5Ne7cmuZSbVcHlKyWpXtSvPiVhyk+tHZea4646Pz17sBeVsefps
 41mxZBRk7VNiE8yWtRWYKcaXxE/0nYkptjhXOqgmNRTGB/WfyKavDYVLWe31XPrI
 G3/QcAAJHBEYZgoGMHRn76L+NWNqnuxeFPtSOaGceBks7HwCKdfvrAwaJzJ1Mr22
 IxeoPm0Fzdrtjy2L2PM1txGdEI2iErprCNtQolM4BCQzUslsukP/Fts+SOujgpnC
 SJ9rbIIeKJzt2dW6kQai+xGw6WIpQyus7Lbt0sEcyBdWi5Bqvh9g1ZXUn8SHY2xx
 /aaEBe2J1RGGhNhHD6bSLYMAXoKDPcpZIrwO+2N96Z18uYee0FU5g3JVKGfuuTdo
 wYfpK79xsFmhaBDj8pYAJoU3y/v+WycE2pP2oFgBhN49Xcxo1yYIm5ECb9dYesfD
 4nW/Av9bI/NR7J4MilXvw2hSmfapgNgcUGb6sgEuSD7M8/UXkfFbcbqp+RcUBiQj
 0XGiNzqLnwgzBQ==
 =TTjt
 -----END PGP SIGNATURE-----

Merge tag 'timers-ptp-2025-07-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timekeeping and VDSO updates from Thomas Gleixner:

 - Introduce support for auxiliary timekeepers

   PTP clocks can be disconnected from the universal CLOCK_TAI reality
   for various reasons including regularatory requirements for
   functional safety redundancy.

   The kernel so far only supports a single notion of time, which means
   that all clocks are correlated in frequency and only differ by offset
   to each other.

   Access to non-correlated PTP clocks has been available so far only
   through the file descriptor based "POSIX clock IDs", which are
   subject to locking and have to go all the way out to the hardware.

   The access is not only horribly slow, as it has to go all the way out
   to the NIC/PTP hardware, but that also prevents the kernel to read
   the time of such clocks e.g. from the network stack, where it is
   required for TSN networking both on the transmit and receive side
   unless the hardware provides offloading.

   The auxiliary clocks provide a mechanism to support arbitrary clocks
   which are not correlated to the system clock. This is not restricted
   to the PTP use case on purpose as there is no kernel side association
   of these clocks to a particular PTP device because that's a pure user
   space configuration decision. Having them independent allows to
   utilize them for other purposes and also enables them to be tested
   without hardware dependencies.

   To avoid pointless overhead these clocks have to be enabled
   individualy via a new sysfs interface to reduce the overhead to a
   single compare in the hotpath if they are enabled at the Kconfig
   level at all.

   These clocks utilize the existing timekeeping/NTP infrastructures,
   which has been made possible over the recent releases by incrementaly
   converting these infrastructures over from a single static instance
   to a multi-instance pointer based implementation without any
   performance regression reported.

   The auxiliary clocks provide the same "emulation" of a "correct"
   clock as the existing CLOCK_* variants do with an independent
   instance of data and provide the same steering mechanism through the
   existing sys_clock_adjtime() interface, which has been confirmed to
   work by the chronyd(8) maintainer.

   That allows to provide lockless kernel internal and VDSO support so
   that applications and kernel internal functionalities can access
   these clocks without restrictions and at the same performance as the
   existing system clocks.

 - Avoid double notifications in the adjtimex() syscall. Not a big
   issue, but a trivial to avoid latency source.

* tag 'timers-ptp-2025-07-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (39 commits)
  vdso/gettimeofday: Add support for auxiliary clocks
  vdso/vsyscall: Update auxiliary clock data in the datapage
  vdso: Introduce aux_clock_resolution_ns()
  vdso/gettimeofday: Introduce vdso_get_timestamp()
  vdso/gettimeofday: Introduce vdso_set_timespec()
  vdso/gettimeofday: Introduce vdso_clockid_valid()
  vdso/gettimeofday: Return bool from clock_gettime() helpers
  vdso/gettimeofday: Return bool from clock_getres() helpers
  vdso/helpers: Add helpers for seqlocks of single vdso_clock
  vdso/vsyscall: Split up __arch_update_vsyscall() into __arch_update_vdso_clock()
  vdso/vsyscall: Introduce a helper to fill clock configurations
  timekeeping: Remove the temporary CLOCK_AUX workaround
  timekeeping: Provide ktime_get_clock_ts64()
  timekeeping: Provide interface to control auxiliary clocks
  timekeeping: Provide update for auxiliary timekeepers
  timekeeping: Provide adjtimex() for auxiliary clocks
  timekeeping: Prepare do_adtimex() for auxiliary clocks
  timekeeping: Make do_adjtimex() reusable
  timekeeping: Add auxiliary clock support to __timekeeping_inject_offset()
  timekeeping: Make timekeeping_inject_offset() reusable
  ...
2025-07-29 14:12:52 -07:00
Linus Torvalds d614399b28 Updates for the timer core:
- Simplify the logic in the timer migration code
 
  - Simplify the clocksource code by utilizing the more modern cpumask+*()
    interfaces
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmiGlSYTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoajDEACHGVGjbnOumTHEAp1yrUZe0qQh8FUw
 wB9E6DeYS3H5Ajh5zP8BzsTi7USnJYDTv5OWz2hTl8ji3ORcsXjRzMuRa6VD8rsp
 drtxROiXA8i8H+pjKTehCZWy96vWybxIW+s0gkgK91dQKP6CAEehk4u7hELO6gU0
 ZPIcwZ3Kx5za4SJagqnYY8PGzRl52Oy1mt/jLgiQUlVcR1bel8xEd8Q+tbWI0yC+
 NA51niSsPZrx+n/AKOdKiaARtgZnzZJYfmv4bJfbBfV4QEiM13hGCPLq64FJZDPD
 ZkqMzdj9X4g77t4agMoRyGEki9Sfk9+qMJrLSpstBd3GtQU2pSTr9l24sRPcT5oa
 A0o2M+tPINgwZlnS1OLqzq8e4RWFhh6WpvDQTlMJ87nB46mYto8rTNsgjuT9e0hY
 bvnOdmce/p4BCSS5JDJj2Ix0+knr2VKkKD470U2SvoZa3s0KMvwCU2iPfhLQKFoP
 xx8LJABK0FrSbsdXV8SjeLUbVGN/HX4A1Zz8xx88fCUADgUr3b6y8Bd4btatQOK9
 hTGmkLQfgkQ6Dm5Qg1pmFhKiQqkWRCnW3uX5+9JmhCeu55ZU+EMUlzcklCh9lqGa
 njimFS//6/ikp/hSeYS19Z59gYHNdTavRkdJZJFPN4R67YNvEliKPYDEqTNRPzEx
 K7hjrpnMe6Ti6A==
 =A19q
 -----END PGP SIGNATURE-----

Merge tag 'timers-core-2025-07-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer core updates from Thomas Gleixner:

 - Simplify the logic in the timer migration code

 - Simplify the clocksource code by utilizing the more modern
   cpumask+*() interfaces

* tag 'timers-core-2025-07-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  clocksource: Use cpumask_next_wrap() in clocksource_watchdog()
  clocksource: Use cpumask_any_but() in clocksource_verify_choose_cpus()
  timers/migration: Clean up the loop in tmigr_quick_check()
2025-07-29 14:08:21 -07:00
Linus Torvalds 99e731bcb8 A treewide cleanup of struct cycle_counter const annotations:
The initial idea of making them const was correct as they were seperate
  instances. When they got embedded into larger data structures, which are
  even modified by the callback this got moot. The only reason why this went
  unnoticed is that the required container_of() casts the const attribute
  forcefully away.
 
  Stop pretending that it is const.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmiGkt0THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYodb/D/47tAOcL6qyTxh+E5fDj4h5M2aVQYp+
 oLiBY7ejroObCAiMo6v/ajaGNEXvLXEtratFBQv1o7n12rxsKnv/XNdIbAzvMwKm
 3vNx9anQ0THFnWiUKCAf/mZBc/Z9uXnjBXY0TeaFLuj4DuUMqGGs/ga32UIvEwW+
 VXS592NIHYoK87VA/AIOK8AGX9VesYoxUqZmfmYn5NuJwsJDSgGaQpoH2fBVhU6p
 B7NRnVLYtFYpeAvL0ihFuOP03HZes5hM2wmqNtoCJJF9e39Eg9DXhsaUw8AIa1fu
 UCxm5sfNpMR+QN2AizbI5deHZJbFwYPy/cFrZ0AiwCfZt6NATjO7bQjvu/23yPFn
 klby8GizV0Q7qmiLi45EdPdid5oiBm/lNy6ICP9fN/XlHZFxAsqE/OGx626P+LPs
 xQK1MpHNjbkCK//X49hxQX6a0BCqSEAG4GOuEEqRxTeD9SZVzIQbcQ6CKOuBqPqc
 hc5o5N3suzdmq5sujQD0IiL9R960WfzsSgbmdQm+njhRz+LjAx6T419MlyePiu/5
 hf8pZjl/SHn1O6RpPfJ501U+tbo1auEnUjXMGcg12dx7KE4w9s7fboFQLXruuVRr
 UNeVXeMRxZqXSAB1HyKQMI8lZ7nHF6FZUA+xLMusCwIYi7YU0h7I3BxDzxpZV62l
 zJgslDp7ZqSopw==
 =C0ua
 -----END PGP SIGNATURE-----

Merge tag 'timers-cleanups-2025-07-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer cleanups from Thomas Gleixner:
 "A treewide cleanup of struct cycle_counter const annotations.

  The initial idea of making them const was correct as they were
  seperate instances. When they got embedded into larger data
  structures, which are even modified by the callback this got moot. The
  only reason why this went unnoticed is that the required
  container_of() casts the const attribute forcefully away.

  Stop pretending that it is const"

* tag 'timers-cleanups-2025-07-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  time/timecounter: Fix the lie that struct cyclecounter is const
2025-07-29 14:02:53 -07:00
Johannes Nixdorf cce436aafc seccomp: Fix a race with WAIT_KILLABLE_RECV if the tracer replies too fast
Normally the tracee starts in SECCOMP_NOTIFY_INIT, sends an
event to the tracer, and starts to wait interruptibly. With
SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV, if the tracer receives the
message (SECCOMP_NOTIFY_SENT is reached) while the tracee was waiting
and is subsequently interrupted, the tracee begins to wait again
uninterruptibly (but killable).

This fails if SECCOMP_NOTIFY_REPLIED is reached before the tracee
is interrupted, as the check only considered SECCOMP_NOTIFY_SENT as a
condition to begin waiting again. In this case the tracee is interrupted
even though the tracer already acted on its behalf. This breaks the
assumption SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV wanted to ensure,
namely that the tracer can be sure the syscall is not interrupted or
restarted on the tracee after it is received on the tracer. Fix this
by also considering SECCOMP_NOTIFY_REPLIED when evaluating whether to
switch to uninterruptible waiting.

With the condition changed the loop in seccomp_do_user_notification()
would exit immediately after deciding that noninterruptible waiting
is required if the operation already reached SECCOMP_NOTIFY_REPLIED,
skipping the code that processes pending addfd commands first. Prevent
this by executing the remaining loop body one last time in this case.

Fixes: c2aa2dfef2 ("seccomp: Add wait_killable semantic to seccomp user notifier")
Reported-by: Ali Polatel <alip@chesswob.org>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220291
Signed-off-by: Johannes Nixdorf <johannes@nixdorf.dev>
Link: https://lore.kernel.org/r/20250725-seccomp-races-v2-1-cf8b9d139596@nixdorf.dev
Signed-off-by: Kees Cook <kees@kernel.org>
2025-07-29 13:33:01 -07:00
Linus Torvalds 0b29600a30 Updates for interrupt chip drivers:
- Add support of forced affinity setting to yet offline CPUs for the
    MIPS-GIC to ensure that the affinity of per CPU interrupts can be set
    during the early bringup phase of a secondary CPU in the hotplug code
    before the CPU is set online and interrupts are enabled.\
 
  - Add support for the MIPS (RISC-V !?!?) P8700 SoC in the ACLINT_SSWI
    interrupt chip
 
  - Make the interrupt routing to RISV-V harts specification compliant so it
    supports arbitrary hart indices
 
  - Add a command line parameter and related handling to disable the generic
    RISCV IMSIC mechanism on platforms which use a trap-emulated IMSIC.
    Unfortunatly this is required because there is no mechanism available to
    discover this programatically.
 
  - Enable wakeup sources on the Renesas RZV2H driver
 
  - Convert interrupt chip drivers, which use a open coded variant of
    msi_create_parent_irq_domain() to use the new functionality
 
  - Convert interrupt chip drivers, which use the old style two level
    implementation of MSI support over to the MSI parent mechanism to
    prepare for removing at least one of the three PCI/MSI backend variants.
 
  - The usual cleanups and improvements all over the place
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmiGj60THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoSh6D/9wY0G2dGz+EJeiDsldzB1n5jmf5I0k
 3XsI3o5j0Ma/Yy+nu9Re3fZq0+qzPFZZErxkBp5igCJbSoaIGheqOyXQDuQu/8tm
 s2t8Wx9k6er7Cywg9rU9pWKzJ6AFXFvcKOEvGG2q2+lFbJbIoGdAM93qPrOJhqeo
 a3NyhQv6kNl7xAjOVyEZmOlCZgCYotFwC0+K1TVQgGDGbwHWH3wad64gLTyyQQlK
 RvtbUBKfCBqqwLJ7Mww7Xclezjk/Hpgm/OppxBAglv5WyRd0e15u2dpdTdM9r4BC
 4wX5Old3ZgbqBQjdHGNlljthu4lO2S0PXU6j0EC8W2NiQjN+hPaKK/EPeerlAJCz
 UmxY0/E3HFNUk8ZHkiif3PGiOSvAbn0JwWi3+D6HuK1rlVTXNs07NIgUBk727Ty5
 S7r5JEuUA5s9dGta4pszxHGn/0Dqg/WvnMZGcbPNaV6POH47wNnPlO2mj14I1HLk
 SfG+deohJM34pVVq7fiqgGukLVPm6PfiJkXx90MK6l+BfE58uo7Oue9mm9pqT2dy
 b6K1gdNPRsZzG7AoAqkx3UrjQuD7maWIpDGb4VZeUW/34bthLygIDUY4OZhpdrUZ
 m33T8zv0PrmNuvnMdFt0RyoDTu8PC9rYS0XVvsIMqsMxJDE/URVGH2tCi5CVMiEg
 PbRWL56yGyT1NA==
 =5LGj
 -----END PGP SIGNATURE-----

Merge tag 'irq-drivers-2025-07-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull interrupt chip driver updates from Thomas Gleixner:

 - Add support of forced affinity setting to yet offline CPUs for the
   MIPS-GIC to ensure that the affinity of per CPU interrupts can be set
   during the early bringup phase of a secondary CPU in the hotplug code
   before the CPU is set online and interrupts are enabled

 - Add support for the MIPS (RISC-V !?!?) P8700 SoC in the ACLINT_SSWI
   interrupt chip

 - Make the interrupt routing to RISV-V harts specification compliant so
   it supports arbitrary hart indices

 - Add a command line parameter and related handling to disable the
   generic RISCV IMSIC mechanism on platforms which use a trap-emulated
   IMSIC. Unfortunatly this is required because there is no mechanism
   available to discover this programatically.

 - Enable wakeup sources on the Renesas RZV2H driver

 - Convert interrupt chip drivers, which use a open coded variant of
   msi_create_parent_irq_domain() to use the new functionality

 - Convert interrupt chip drivers, which use the old style two level
   implementation of MSI support over to the MSI parent mechanism to
   prepare for removing at least one of the three PCI/MSI backend
   variants.

 - The usual cleanups and improvements all over the place

* tag 'irq-drivers-2025-07-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (28 commits)
  irqchip/renesas-irqc: Convert to DEFINE_SIMPLE_DEV_PM_OPS()
  irqchip/renesas-intc-irqpin: Convert to DEFINE_SIMPLE_DEV_PM_OPS()
  irqchip/riscv-imsic: Add kernel parameter to disable IPIs
  irqchip/gic-v3: Fix GICD_CTLR register naming
  irqchip/ls-scfg-msi: Fix NULL dereference in error handling
  irqchip/ls-scfg-msi: Switch to use msi_create_parent_irq_domain()
  irqchip/armada-370-xp: Switch to msi_create_parent_irq_domain()
  irqchip/alpine-msi: Switch to msi_create_parent_irq_domain()
  irqchip/alpine-msi: Convert to __free
  irqchip/alpine-msi: Convert to lock guards
  irqchip/alpine-msi: Clean up whitespace style
  irqchip/sg2042-msi: Switch to msi_create_parent_irq_domain()
  irqchip/loongson-pch-msi.c: Switch to msi_create_parent_irq_domain()
  irqchip/imx-mu-msi: Convert to msi_create_parent_irq_domain() helper
  irqchip/riscv-imsic: Convert to msi_create_parent_irq_domain() helper
  irqchip/bcm2712-mip: Switch to msi_create_parent_irq_domain()
  irqdomain: Add device pointer to irq_domain_info and msi_domain_info
  irqchip/renesas-rzv2h: Remove unneeded includes
  irqchip/renesas-rzv2h: Enable SKIP_SET_WAKE and MASK_ON_SUSPEND
  irqchip/aslint-sswi: Resolve hart index
  ...
2025-07-29 13:26:05 -07:00
Linus Torvalds b34111a89f A set of updates for SMP function calls:
- Improve localitu of smp_call_function_any() by utilizing
     sched_numa_find_nth_cpu() instead of picking a random CPU
 
   - Wait for work completion in smp_call_function_many_cond() only when
     there was actually work enqueued
 
   - Simplify functions by unutlizing the appropriate cpumask_*()
     interfaces
 
   - Trivial cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmiGkfQTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoVFFD/9OyKVhAlk3fP4PJG3VBZs/8IDp52Wo
 vXHZPAyjRm0mtgonmRKQfNh9Xow6/ISiSxoE6yy98aEXRnzPgygHpwZfVwpEP5Q+
 Ys0Y6DpaDW2Uw+a9qfBvpnEawmWK+b5N58ApLSMbabv6MdZhElI2SEjZKtqTda0j
 161nRGADXPYm6uIw2kbAGseHpTslKCqTLdMHvvCnSx2Qa6Otw3VMWlYBpsOoqf7n
 9+OA7rwpSArjgjGHJJKgwtdRfvobIYReEWUXOP6QF7Vgm4H5i9kgvD7NuFCa9Ykv
 2kZnknuIplp9V+AvSsFjMu+RdxpktlL348Pnl6tZdjYrHQrgCWjhb11aD8gi8pb5
 sdqAupJ2+N7woqfwuKFuzcEBjnjSbV0Jeks8GDQzuWOiniMn4BCj3qWPtIszZ80z
 YddgGXf4RNJjytWjMyohh472YBQ+O3rlvVDmR011GnNdIphl8ovrtI9r+Ra6FwVg
 eHmjr8yGjzmntay6KjbP+iQVjzqCFz6Lz7kTQBXGP3MPcd7du9R7KBGY6rm1+FJ5
 3D4yIxIgK9sWg5GEr//1fdoi9wIrxsAfvgIsqpliwpHZ7wScyG98Iq74QsPGoimP
 LgTHkHsxcMnsaHM8lLTo4mArbunQJTFtx/lYRk++lj1jfqxlLNUXmH6mQmKC+fla
 Jz6duXcmFOoI3A==
 =dFmz
 -----END PGP SIGNATURE-----

Merge tag 'smp-core-2025-07-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull smp updates from Thomas Gleixner:
 "A set of updates for SMP function calls:

   - Improve locality of smp_call_function_any() by utilizing
     sched_numa_find_nth_cpu() instead of picking a random CPU

   - Wait for work completion in smp_call_function_many_cond() only when
     there was actually work enqueued

   - Simplify functions by unutlizing the appropriate cpumask_*()
     interfaces

   - Trivial cleanups"

* tag 'smp-core-2025-07-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  smp: Wait only if work was enqueued
  smp: Defer check for local execution in smp_call_function_many_cond()
  smp: Use cpumask_any_but() in smp_call_function_many_cond()
  smp: Improve locality in smp_call_function_any()
  smp: Fix typo in comment for raw_smp_processor_id()
2025-07-29 13:00:20 -07:00
Linus Torvalds dba3ec9f2a Updates for the interrupt core subsystem:
- Prevent a interrupt migration related live lock in handle_edge_irq()
 
     If the interrupt affinity is moved to a new target CPU and the
     interrupt is currently handled on the previous target CPU for edge type
     interrupts the handler might get stuck on the previous target for a
     long time, which causes both involved CPUs to waste cycles and
     eventually run into a soft-lockup situation.
 
     Solve this by checking whether the interrupt is redirected to a new
     target CPU and if the interrupt is handled on that new target CPU, busy
     wait for completion instead of masking it and sending the pending but
     which would cause the old CPU to re-run the handler and in the worst
     case repeating this excercise for a long time.
 
     This only works on architectures which use single CPU interrupt
     targets, but that's so far the only ones where this behaviour has been
     observed.
 
   - Add a kunit test for interrupt disable depth counts
 
     The nested interrupt disable depth has been an issue in the past
     especially vs. free_irq(), interrupt shutdown and CPU hotplug and their
     interactions. The test exercises the combinations of these scenarios
     and checks for correctness.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmiGjEUTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoWemD/0XUR2xjN5pdlJSDqSehKfUbI2psqij
 TPWyNmttyw3vLAkzQZjlsrTANfJq8XnEvAYZSK/Qu+4ogQvTYlw3FPoesWLDScfD
 yUN0vKnKM8xJ04YRSKBOWcbLve/MjwbAp77PTPalOTF8wpcj2InoawwjD7pTcWwL
 8PG0cW26Rx6iuIFVfFhv+QMUZC+oS/66CfQTQ/Uf+02mAOhprs7I5pgINb5F+drQ
 XR4TyMnaePmZ1i8ULa4SPnw7AzfPRhJM/m+jhA2NZJwbpTGnKGueE7SLJ//wX08p
 woT3cf7DQCHeMPIZQAOLJ181LWbHz7EJFxJ+7Ba/FWP/X/Ep0EG9BwuK5Gg6/Auj
 73xEnria3+Nw1hM6hlfRVX0+osizO35X91q63QUIejYDtM9Y9ByOjmi1o334sVjl
 iFLq2jAw8H0phYia67kDtoPmc94GsvqLJwRt5JtxN6MRF5L1CspqEveQOJuoO/Tg
 ZlToL+U16bkf7PI25e6d1t9OjqeK5GhOL3/Ygt8HwO9m4jIE7JCIIlMnQXpxl54I
 s5H467lJwVqeiOqiU4UWhPM++hvxCE0/FuOZHAgbC4/fYPR/+WkZdEfyA/Rhz88g
 B4yhq5+SBE8cMr0kh2e7c6mC505KICCEeyAU3iadfKlu3bhd7sVfFrgjRLoiCAym
 XC+Vi+Q7lu0J7w==
 =IgyG
 -----END PGP SIGNATURE-----

Merge tag 'irq-core-2025-07-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq updates from Thomas Gleixner:

 - Prevent a interrupt migration related live lock in handle_edge_irq()

   If the interrupt affinity is moved to a new target CPU and the
   interrupt is currently handled on the previous target CPU for edge
   type interrupts the handler might get stuck on the previous target
   for a long time, which causes both involved CPUs to waste cycles and
   eventually run into a soft-lockup situation.

   Solve this by checking whether the interrupt is redirected to a new
   target CPU and if the interrupt is handled on that new target CPU,
   busy wait for completion instead of masking it and sending the
   pending but which would cause the old CPU to re-run the handler and
   in the worst case repeating this excercise for a long time.

   This only works on architectures which use single CPU interrupt
   targets, but that's so far the only ones where this behaviour has
   been observed.

 - Add a kunit test for interrupt disable depth counts

   The nested interrupt disable depth has been an issue in the past
   especially vs. free_irq(), interrupt shutdown and CPU hotplug and
   their interactions. The test exercises the combinations of these
   scenarios and checks for correctness.

* tag 'irq-core-2025-07-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  genirq: Prevent migration live lock in handle_edge_irq()
  genirq: Split up irq_pm_check_wakeup()
  genirq: Move irq_wait_for_poll() to call site
  genirq: Remove pointless local variable
  genirq: Add kunit tests for depth counts
2025-07-29 12:55:12 -07:00