The pattern of checking nocb_defer_wakeup and deleting the timer is
duplicated in __wake_nocb_gp() and nocb_gp_wait(). Extract this into a
common helper function nocb_defer_wakeup_cancel().
This removes code duplication and makes it easier to maintain.
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
During callback overload (exceeding qhimark), the NOCB code attempts
opportunistic advancement via rcu_advance_cbs_nowake(). Analysis shows
this code path is practically unreachable and serves no useful purpose.
Testing with 300,000 callback floods showed:
- 30 overload conditions triggered
- 0 advancements actually occurred
While a theoretical window exists where this code could execute (e.g.,
vCPU preemption between gp_seq update and rcu_nocb_gp_cleanup()), even
if it did, the advancement would be redundant. The rcuog kthread must
still run to wake the rcuoc callback thread - we would just be
duplicating work that rcuog will perform when it finally gets to run.
Since this path provides no meaningful benefit and extensive testing
confirms it is never useful, remove it entirely.
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
The WakeOvfIsDeferred code path in __call_rcu_nocb_wake() attempts to
wake rcuog when the callback count exceeds qhimark and callbacks aren't
done with their GP (newly queued or awaiting GP). However, a lot of
testing proves this wake is always redundant or useless.
In the flooding case, rcuog is always waiting for a GP to finish. So
waking up the rcuog thread is pointless. The timer wakeup adds overhead,
rcuog simply wakes up and goes back to sleep achieving nothing.
This path also adds a full memory barrier, and additional timer expiry
modifications unnecessarily.
The root cause is that WakeOvfIsDeferred fires when
!rcu_segcblist_ready_cbs() (GP not complete), but waking rcuog cannot
accelerate GP completion.
This commit therefore removes this path.
Tested with rcutorture scenarios: TREE01, TREE05, TREE08 (all NOCB
configurations) - all pass. Also stress tested using a kernel module
that floods call_rcu() to trigger the overload conditions and made the
observations confirming the findings.
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
When funcgraph-args and funcgraph-retaddr are both enabled, many kernel
functions display invalid parameters in trace logs.
The issue occurs because print_graph_retval() passes a mismatched args
pointer to print_function_args(). Fix this by retrieving the correct
args pointer using the FGRAPH_ENTRY_ARGS() macro.
Link: https://patch.msgid.link/20260112021601.1300479-1-dolinux.peng@gmail.com
Fixes: f83ac7544f ("function_graph: Enable funcgraph-args and funcgraph-retaddr to work simultaneously")
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Donglin Peng <pengdonglin@xiaomi.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
64-bit truncation to 32-bit can result in the sign of the truncated
value changing. The cmp_mod_entry is used in bsearch and so the
truncation could result in an invalid search order. This would only
happen were the addresses more than 2GB apart and so unlikely, but
let's fix the potentially broken compare anyway.
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260108002625.333331-1-irogers@google.com
Signed-off-by: Ian Rogers <irogers@google.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
When creating a synthetic event based on an existing synthetic event that
had a stacktrace field and the new synthetic event used that field a
kernel crash occurred:
~# cd /sys/kernel/tracing
~# echo 's:stack unsigned long stack[];' > dynamic_events
~# echo 'hist:keys=prev_pid:s0=common_stacktrace if prev_state & 3' >> events/sched/sched_switch/trigger
~# echo 'hist:keys=next_pid:s1=$s0:onmatch(sched.sched_switch).trace(stack,$s1)' >> events/sched/sched_switch/trigger
The above creates a synthetic event that takes a stacktrace when a task
schedules out in a non-running state and passes that stacktrace to the
sched_switch event when that task schedules back in. It triggers the
"stack" synthetic event that has a stacktrace as its field (called "stack").
~# echo 's:syscall_stack s64 id; unsigned long stack[];' >> dynamic_events
~# echo 'hist:keys=common_pid:s2=stack' >> events/synthetic/stack/trigger
~# echo 'hist:keys=common_pid:s3=$s2,i0=id:onmatch(synthetic.stack).trace(syscall_stack,$i0,$s3)' >> events/raw_syscalls/sys_exit/trigger
The above makes another synthetic event called "syscall_stack" that
attaches the first synthetic event (stack) to the sys_exit trace event and
records the stacktrace from the stack event with the id of the system call
that is exiting.
When enabling this event (or using it in a historgram):
~# echo 1 > events/synthetic/syscall_stack/enable
Produces a kernel crash!
BUG: unable to handle page fault for address: 0000000000400010
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: Oops: 0000 [#1] SMP PTI
CPU: 6 UID: 0 PID: 1257 Comm: bash Not tainted 6.16.3+deb14-amd64 #1 PREEMPT(lazy) Debian 6.16.3-1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.17.0-debian-1.17.0-1 04/01/2014
RIP: 0010:trace_event_raw_event_synth+0x90/0x380
Code: c5 00 00 00 00 85 d2 0f 84 e1 00 00 00 31 db eb 34 0f 1f 00 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 2e 0f 1f 84 00 00 00 00 00 <49> 8b 04 24 48 83 c3 01 8d 0c c5 08 00 00 00 01 cd 41 3b 5d 40 0f
RSP: 0018:ffffd2670388f958 EFLAGS: 00010202
RAX: ffff8ba1065cc100 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: fffff266ffda7b90 RDI: ffffd2670388f9b0
RBP: 0000000000000010 R08: ffff8ba104e76000 R09: ffffd2670388fa50
R10: ffff8ba102dd42e0 R11: ffffffff9a908970 R12: 0000000000400010
R13: ffff8ba10a246400 R14: ffff8ba10a710220 R15: fffff266ffda7b90
FS: 00007fa3bc63f740(0000) GS:ffff8ba2e0f48000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000400010 CR3: 0000000107f9e003 CR4: 0000000000172ef0
Call Trace:
<TASK>
? __tracing_map_insert+0x208/0x3a0
action_trace+0x67/0x70
event_hist_trigger+0x633/0x6d0
event_triggers_call+0x82/0x130
trace_event_buffer_commit+0x19d/0x250
trace_event_raw_event_sys_exit+0x62/0xb0
syscall_exit_work+0x9d/0x140
do_syscall_64+0x20a/0x2f0
? trace_event_raw_event_sched_switch+0x12b/0x170
? save_fpregs_to_fpstate+0x3e/0x90
? _raw_spin_unlock+0xe/0x30
? finish_task_switch.isra.0+0x97/0x2c0
? __rseq_handle_notify_resume+0xad/0x4c0
? __schedule+0x4b8/0xd00
? restore_fpregs_from_fpstate+0x3c/0x90
? switch_fpu_return+0x5b/0xe0
? do_syscall_64+0x1ef/0x2f0
? do_fault+0x2e9/0x540
? __handle_mm_fault+0x7d1/0xf70
? count_memcg_events+0x167/0x1d0
? handle_mm_fault+0x1d7/0x2e0
? do_user_addr_fault+0x2c3/0x7f0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
The reason is that the stacktrace field is not labeled as such, and is
treated as a normal field and not as a dynamic event that it is.
In trace_event_raw_event_synth() the event is field is still treated as a
dynamic array, but the retrieval of the data is considered a normal field,
and the reference is just the meta data:
// Meta data is retrieved instead of a dynamic array
str_val = (char *)(long)var_ref_vals[val_idx];
// Then when it tries to process it:
len = *((unsigned long *)str_val) + 1;
It triggers a kernel page fault.
To fix this, first when defining the fields of the first synthetic event,
set the filter type to FILTER_STACKTRACE. This is used later by the second
synthetic event to know that this field is a stacktrace. When creating
the field of the new synthetic event, have it use this FILTER_STACKTRACE
to know to create a stacktrace field to copy the stacktrace into.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Link: https://patch.msgid.link/20260122194824.6905a38e@gandalf.local.home
Fixes: 00cf3d672a ("tracing: Allow synthetic events to pass around stacktraces")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The TAS fallback can be invoked directly when queued spin locks are
disabled, and through the slow path when paravirt is enabled for queued
spin locks. In the latter case, the res_spin_lock macro will attempt the
fast path and already hold the entry when entering the slow path. This
will lead to creation of extraneous entries that are not released, which
may cause false positives for deadlock detection.
Fix this by always preceding invocation of the TAS fallback in every
case with the grabbing of the held lock entry, and add a comment to make
note of this.
Fixes: c9102a68c0 ("rqspinlock: Add a test-and-set fallback")
Reported-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Tested-by: Amery Hung <ameryhung@gmail.com>
Link: https://lore.kernel.org/r/20260122115911.3668985-1-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This agressively bypasses run_to_parity and slice protection with the
assumpiton that this is what waker wants but there is no garantee that
the wakee will be the next to run. It is a better choice to use
yield_to_task or WF_SYNC in such case.
This increases the number of resched and preemption because a task becomes
quickly "ineligible" when it runs; We update the task vruntime periodically
and before the task exhausted its slice or at least quantum.
Example:
2 tasks A and B wake up simultaneously with lag = 0. Both are
eligible. Task A runs 1st and wakes up task C. Scheduler updates task
A's vruntime which becomes greater than average runtime as all others
have a lag == 0 and didn't run yet. Now task A is ineligible because
it received more runtime than the other task but it has not yet
exhausted its slice nor a min quantum. We force preemption, disable
protection but Task B will run 1st not task C.
Sidenote, DELAY_ZERO increases this effect by clearing positive lag at
wake up.
Fixes: e837456fdc ("sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals")
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260123102858.52428-1-vincent.guittot@linaro.org
NEXT_BUDDY was disabled with the introduction of EEVDF and enabled again
after NEXT_BUDDY was rewritten for EEVDF by commit e837456fdc ("sched/fair:
Reimplement NEXT_BUDDY to align with EEVDF goals"). It was not expected
that this would be a universal win without a crystal ball instruction
but the reported regressions are a concern [1][2] even if gains were
also reported. Specifically;
o mysql with client/server running on different servers regresses
o specjbb reports lower peak metrics
o daytrader regresses
The mysql is realistic and a concern. It needs to be confirmed if
specjbb is simply shifting the point where peak performance is measured
but still a concern. daytrader is considered to be representative of a
real workload.
Access to test machines is currently problematic for verifying any fix to
this problem. Disable NEXT_BUDDY for now by default until the root causes
are addressed.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Link: https://lore.kernel.org/lkml/4b96909a-f1ac-49eb-b814-97b8adda6229@arm.com [1]
Link: https://lore.kernel.org/lkml/ec3ea66f-3a0d-4b5a-ab36-ce778f159b5b@linux.ibm.com [2]
Link: https://patch.msgid.link/fyqsk63pkoxpeaclyqsm5nwtz3dyejplr7rg6p74xwemfzdzuu@7m7xhs5aqpqw
These structs are never modified.
To prevent malicious or accidental modifications due to bugs,
mark them as const.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
When the kallsyms relative base was introduced, per-CPU variable
references on x86_64 SMP were implemented as offsets into the respective
per-CPU region, rather than offsets relative to the location of the
variable's template in the kernel image, which is how other
architectures implement it.
This required kallsyms to reason about the difference between the two,
and the sign of the value in the kallsyms_offsets[] array was used to
distinguish them. This meant that negative offsets were not permitted
for ordinary variables, and so it was crucial that the relative base was
chosen such that all offsets were positive numbers.
This is no longer needed: instead, the offsets can simply be encoded as
values in the range -/+ 2 GiB, which is precisely what PC32 relocations
provide on most architectures. So it is possible to simplify the logic,
and just use _text as the anchor directly, and let the linker calculate
the final value based on the location of the entry itself.
Some architectures (nios2, extensa) do not support place-relative
relocations at all, but these are all 32-bit and non-relocatable, and so
there is no need for place-relative relocations in the first place, and
the actual symbol values can just be stored directly.
This makes all entries in the kallsyms_offsets[] array visible as
place-relative references in the ELF metadata, which will be important
when implementing ELF-based fg-kaslr.
Reviewed-by: Kees Cook <kees@kernel.org>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://patch.msgid.link/20260116093359.2442297-6-ardb+git@google.com
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
1) The commit:
2b8272ff4a ("cpu/hotplug: Prevent self deadlock on CPU hot-unplug")
was added to fix an issue where the hotplug control task (BP) was
throttled between CPUHP_AP_IDLE_DEAD and CPUHP_HRTIMERS_PREPARE waiting
in the hrtimer blindspot for the bandwidth callback queued in the dead
CPU.
2) Later on, the commit:
38685e2a04 ("cpu/hotplug: Don't offline the last non-isolated CPU")
plugged on the target selection for the workqueue offloaded CPU down
process to prevent from destroying the last CPU domain.
3) Finally:
5c0930ccaa ("hrtimers: Push pending hrtimers away from outgoing CPU earlier")
removed entirely the conditions for the race exposed and partially fixed
in 1). The offloading of the CPU down process to a workqueue on another
CPU then becomes unnecessary. But the last CPU belonging to scheduler
domains must still remain online.
Therefore revert the now obsolete commit
2b8272ff4a and move the housekeeping check
under the cpu_hotplug_lock write held. Since HK_TYPE_DOMAIN will include
both isolcpus and cpuset isolated partition, the hotplug lock will
synchronize against concurrent cpuset partition updates.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Currently, rq->idle_stamp is only used to calculate avg_idle during
wakeups. This means other paths that move a task to an idle CPU such as
fork/clone, execve, or migrations, do not end the CPU's idle status in
the scheduler's eyes, leading to an inaccurate avg_idle.
This patch introduces update_rq_avg_idle() to provide a more accurate
measurement of CPU idle duration. By invoking this helper in
put_prev_task_idle(), we ensure avg_idle is updated whenever a CPU
stops being idle, regardless of how the new task arrived.
Testing on an 80-core Ampere Altra (ARMv8) with 6.19-rc5 baseline:
- Hackbench : +7.2% performance gain at 16 threads.
- Schbench: Reduced p99.9 tail latencies at high concurrency.
Signed-off-by: Shubhang Kaushik <shubhang@os.amperecomputing.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Shubhang Kaushik <shubhang@os.amperecomputing.com>
Link: https://patch.msgid.link/20260121-v8-patch-series-v8-1-b7f1cbee5055@os.amperecomputing.com
It turns out that __run_hrtimer() will trace like:
<idle>-0 [032] d.h2. 20705.474563: hrtimer_cancel: hrtimer=0xff2db8f77f8226e8
<idle>-0 [032] d.h1. 20705.474563: hrtimer_expire_entry: hrtimer=0xff2db8f77f8226e8 now=20699452001850 function=tick_nohz_handler/0x0
Which is a bit nonsensical, the timer doesn't get canceled on
expiration. The cause is the use of the incorrect debug helper.
Fixes: c6a2a17702 ("hrtimer: Add tracepoint for hrtimers")
Reported-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260121143208.219595606@infradead.org
Change the minimum slice extension to 5 usec.
Since slice_test selftest reaches a staggering ~350 nsec extension:
Task: slice_test Mean: 350.266 ns
Latency (us) | Count
------------------------------
EXPIRED | 238
0 us | 143189
1 us | 167
2 us | 26
3 us | 11
4 us | 28
5 us | 31
6 us | 22
7 us | 23
8 us | 32
9 us | 16
10 us | 35
Lower the minimal (and default) value to 5 usecs -- which is still massive.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260121143208.073200729@infradead.org
Since glibc cares about the number of syscalls required to initialize a new
thread, allow initializing rseq with slice extension on. This avoids having to
do another prctl().
Requested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260121143207.814193010@infradead.org
If a time slice extension is granted and the reschedule delayed, the kernel
has to ensure that user space cannot abuse the extension and exceed the
maximum granted time.
It was suggested to implement this via the existing hrtick() timer in the
scheduler, but that turned out to be problematic for several reasons:
1) It creates a dependency on CONFIG_SCHED_HRTICK, which can be disabled
independently of CONFIG_HIGHRES_TIMERS
2) HRTICK usage in the scheduler can be runtime disabled or is only used
for certain aspects of scheduling.
3) The function is calling into the scheduler code and that might have
unexpected consequences when this is invoked due to a time slice
enforcement expiry. Especially when the task managed to clear the
grant via sched_yield(0).
It would be possible to address #2 and #3 by storing state in the
scheduler, but that is extra complexity and fragility for no value.
Implement a dedicated per CPU hrtimer instead, which is solely used for the
purpose of time slice enforcement.
The timer is armed when an extension was granted right before actually
returning to user mode in rseq_exit_to_user_mode_restart().
It is disarmed, when the task relinquishes the CPU. This is expensive as
the timer is probably the first expiring timer on the CPU, which means it
has to reprogram the hardware. But that's less expensive than going through
a full hrtimer interrupt cycle for nothing.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251215155709.068329497@linutronix.de
The kernel sets SYSCALL_WORK_RSEQ_SLICE when it grants a time slice
extension. This allows to handle the rseq_slice_yield() syscall, which is
used by user space to relinquish the CPU after finishing the critical
section for which it requested an extension.
In case the kernel state is still GRANTED, the kernel resets both kernel
and user space state with a set of sanity checks. If the kernel state is
already cleared, then this raced against the timer or some other interrupt
and just clears the work bit.
Doing it in syscall entry work allows to catch misbehaving user space,
which issues an arbitrary syscall, i.e. not rseq_slice_yield(), from the
critical section. Contrary to the initial strict requirement to use
rseq_slice_yield() arbitrary syscalls are not considered a violation of the
ABI contract anymore to allow onion architecture applications, which cannot
control the code inside a critical section, to utilize this as well.
If the code detects inconsistent user space that result in a SIGSEGV for
the application.
If the grant was still active and the task was not preempted yet, the work
code reschedules immediately before continuing through the syscall.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251215155709.005777059@linutronix.de
Provide a new syscall which has the only purpose to yield the CPU after the
kernel granted a time slice extension.
sched_yield() is not suitable for that because it unconditionally
schedules, but the end of the time slice extension is not required to
schedule when the task was already preempted. This also allows to have a
strict check for termination to catch user space invoking random syscalls
including sched_yield() from a time slice extension region.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Link: https://patch.msgid.link/20251215155708.929634896@linutronix.de
Implement a prctl() so that tasks can enable the time slice extension
mechanism. This fails, when time slice extensions are disabled at compile
time or on the kernel command line and when no rseq pointer is registered
in the kernel.
That allows to implement a single trivial check in the exit to user mode
hotpath, to decide whether the whole mechanism needs to be invoked.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251215155708.858717691@linutronix.de
Guard the time slice extension functionality with a static key, which can
be disabled on the kernel command line.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251215155708.733429292@linutronix.de
Aside of a Kconfig knob add the following items:
- Two flag bits for the rseq user space ABI, which allow user space to
query the availability and enablement without a syscall.
- A new member to the user space ABI struct rseq, which is going to be
used to communicate request and grant between kernel and user space.
- A rseq state struct to hold the kernel state of this
- Documentation of the new mechanism
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251215155708.669472597@linutronix.de
Using kstrtouint_from_user() instead of copy_from_user() + kstrtouint()
makes the code simpler and less error-prone.
Suggested-by: Yury Norov <ynorov@nvidia.com>
Signed-off-by: Fushuai Wang <wangfushuai@baidu.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Yury Norov <ynorov@nvidia.com>
Link: https://patch.msgid.link/20260117145615.53455-2-fushuai.wang@linux.dev
bpf_strncasecmp() function performs same like bpf_strcasecmp() except
limiting the comparison to a specific length.
Signed-off-by: Yuzuki Ishiyama <ishiyama@hpc.is.uec.ac.jp>
Acked-by: Viktor Malik <vmalik@redhat.com>
Acked-by: Mykyta Yatsenko <mykyta.yatsenko5@gmail.com>
Link: https://lore.kernel.org/r/20260121033328.1850010-2-ishiyama@hpc.is.uec.ac.jp
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
For now, bpf_get_func_arg() and bpf_get_func_arg_cnt() is not supported by
the BPF_TRACE_RAW_TP, which is not convenient to get the argument of the
tracepoint, especially for the case that the position of the arguments in
a tracepoint can change.
The target tracepoint BTF type id is specified during loading time,
therefore we can get the function argument count from the function
prototype instead of the stack.
Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20260121044348.113201-2-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
When calling refcount_inc(&event->mmap_count) inside perf_mmap_rb(), the
following warning is triggered:
refcount_t: addition on 0; use-after-free.
WARNING: lib/refcount.c:25
PoC:
struct perf_event_attr attr = {0};
int fd = syscall(__NR_perf_event_open, &attr, 0, -1, -1, 0);
mmap(NULL, 0x3000, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
int victim = syscall(__NR_perf_event_open, &attr, 0, -1, fd,
PERF_FLAG_FD_OUTPUT);
mmap(NULL, 0x3000, PROT_READ | PROT_WRITE, MAP_SHARED, victim, 0);
This occurs when creating a group member event with the flag
PERF_FLAG_FD_OUTPUT. The group leader should be mmap-ed and then mmap-ing
the event triggers the warning.
Since the event has copied the output_event in perf_event_set_output(),
event->rb is set. As a result, perf_mmap_rb() calls
refcount_inc(&event->mmap_count) when event->mmap_count = 0.
Disallow the case when event->mmap_count = 0. This also prevents two
events from updating the same user_page.
Fixes: 448f97fba9 ("perf: Convert mmap() refcounts to refcount_t")
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Will Rosenberg <whrosenb@asu.edu>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260119184956.801238-1-whrosenb@asu.edu
The "valid" readout delay between the two reads of the watchdog is larger
than the valid delta between the resulting watchdog and clocksource
intervals, which results in false positive watchdog results.
Assume TSC is the clocksource and HPET is the watchdog and both have a
uncertainty margin of 250us (default). The watchdog readout does:
1) wdnow = read(HPET);
2) csnow = read(TSC);
3) wdend = read(HPET);
The valid window for the delta between #1 and #3 is calculated by the
uncertainty margins of the watchdog and the clocksource:
m = 2 * watchdog.uncertainty_margin + cs.uncertainty margin;
which results in 750us for the TSC/HPET case.
The actual interval comparison uses a smaller margin:
m = watchdog.uncertainty_margin + cs.uncertainty margin;
which results in 500us for the TSC/HPET case.
That means the following scenario will trigger the watchdog:
Watchdog cycle N:
1) wdnow[N] = read(HPET);
2) csnow[N] = read(TSC);
3) wdend[N] = read(HPET);
Assume the delay between #1 and #2 is 100us and the delay between #1 and
Watchdog cycle N + 1:
4) wdnow[N + 1] = read(HPET);
5) csnow[N + 1] = read(TSC);
6) wdend[N + 1] = read(HPET);
If the delay between #4 and #6 is within the 750us margin then any delay
between #4 and #5 which is larger than 600us will fail the interval check
and mark the TSC unstable because the intervals are calculated against the
previous value:
wd_int = wdnow[N + 1] - wdnow[N];
cs_int = csnow[N + 1] - csnow[N];
Putting the above delays in place this results in:
cs_int = (wdnow[N + 1] + 610us) - (wdnow[N] + 100us);
-> cs_int = wd_int + 510us;
which is obviously larger than the allowed 500us margin and results in
marking TSC unstable.
Fix this by using the same margin as the interval comparison. If the delay
between two watchdog reads is larger than that, then the readout was either
disturbed by interconnect congestion, NMIs or SMIs.
Fixes: 4ac1dd3245 ("clocksource: Set cs_watchdog_read() checks based on .uncertainty_margin")
Reported-by: Daniel J Blueman <daniel@quora.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/lkml/20250602223251.496591-1-daniel@quora.org/
Link: https://patch.msgid.link/87bjjxc9dq.ffs@tglx
The current comment "Clear TID on mm_release()?" ends with a question
mark, implying uncertainty about whether the TID is actually cleared in
mm_release().
However, the code flow is deterministic. When a task exits, mm_release()
explicitly checks 'tsk->clear_child_tid' and clears.
Since this behavior is unambiguous, remove the confusing question mark and
rephrase the comment to clearly state that TID is cleared in mm_release().
Link: https://lkml.kernel.org/r/20251125000407.24470-1-s9430939@naver.com
Signed-off-by: Minu Jin <s9430939@naver.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mel Gorman <mgorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
kallsyms_lookup_buildid() copies the symbol name into the given buffer so
that it can be safely read anytime later. But it just copies pointers to
mod->name and mod->build_id which might get reused after the related
struct module gets removed.
The lifetime of struct module is synchronized using RCU. Take the rcu
read lock for the entire __sprint_symbol().
Link: https://lkml.kernel.org/r/20251128135920.217303-8-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Aaron Tomlin <atomlin@atomlin.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkman <daniel@iogearbox.net>
Cc: Daniel Gomez <da.gomez@samsung.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Luis Chamberalin <mcgrof@kernel.org>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Petr Pavlu <petr.pavlu@suse.com>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
__sprint_symbol() might access an invalid pointer when
kallsyms_lookup_buildid() returns a symbol found by
ftrace_mod_address_lookup().
The ftrace lookup function must set both @modname and @modbuildid the same
way as module_address_lookup().
Link: https://lkml.kernel.org/r/20251128135920.217303-7-pmladek@suse.com
Fixes: 9294523e37 ("module: add printk formats to add module build ID to stacktraces")
Signed-off-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Aaron Tomlin <atomlin@atomlin.com>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkman <daniel@iogearbox.net>
Cc: Daniel Gomez <da.gomez@samsung.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Luis Chamberalin <mcgrof@kernel.org>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Petr Pavlu <petr.pavlu@suse.com>
Cc: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
bpf_address_lookup() has been used only in kallsyms_lookup_buildid(). It
was supposed to set @modname and @modbuildid when the symbol was in a
module.
But it always just cleared @modname because BPF symbols were never in a
module. And it did not clear @modbuildid because the pointer was not
passed.
The wrapper is no longer needed. Both @modname and @modbuildid are now
always initialized to NULL in kallsyms_lookup_buildid().
Remove the wrapper and rename __bpf_address_lookup() to
bpf_address_lookup() because this variant is used everywhere.
[akpm@linux-foundation.org: fix loongarch]
Link: https://lkml.kernel.org/r/20251128135920.217303-6-pmladek@suse.com
Fixes: 9294523e37 ("module: add printk formats to add module build ID to stacktraces")
Signed-off-by: Petr Mladek <pmladek@suse.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Aaron Tomlin <atomlin@atomlin.com>
Cc: Daniel Borkman <daniel@iogearbox.net>
Cc: Daniel Gomez <da.gomez@samsung.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Luis Chamberalin <mcgrof@kernel.org>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Petr Pavlu <petr.pavlu@suse.com>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Put the code for appending the optional "buildid" into a helper function,
It makes __sprint_symbol() better readable.
Also print a warning when the "modname" is set and the "buildid" isn't.
It might catch a situation when some lookup function in
kallsyms_lookup_buildid() does not handle the "buildid".
Use pr_*_once() to avoid an infinite recursion when the function is called
from printk(). The recursion is rather theoretical but better be on the
safe side.
Link: https://lkml.kernel.org/r/20251128135920.217303-5-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Cc: Aaron Tomlin <atomlin@atomlin.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkman <daniel@iogearbox.net>
Cc: Daniel Gomez <da.gomez@samsung.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Luis Chamberalin <mcgrof@kernel.org>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Petr Pavlu <petr.pavlu@suse.com>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Add a helper function for reading the optional "build_id" member of struct
module. It is going to be used also in ftrace_mod_address_lookup().
Use "#ifdef" instead of "#if IS_ENABLED()" to match the declaration of the
optional field in struct module.
Link: https://lkml.kernel.org/r/20251128135920.217303-4-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Daniel Gomez <da.gomez@samsung.com>
Reviewed-by: Petr Pavlu <petr.pavlu@suse.com>
Cc: Aaron Tomlin <atomlin@atomlin.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkman <daniel@iogearbox.net>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Luis Chamberalin <mcgrof@kernel.org>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The @modname and @modbuildid optional return parameters are set only when
the symbol is in a module.
Always initialize them so that they do not need to be cleared when the
module is not in a module. It simplifies the logic and makes the code
even slightly more safe.
Note that bpf_address_lookup() function will get updated in a separate
patch.
Link: https://lkml.kernel.org/r/20251128135920.217303-3-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Cc: Aaron Tomlin <atomlin@atomlin.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkman <daniel@iogearbox.net>
Cc: Daniel Gomez <da.gomez@samsung.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Luis Chamberalin <mcgrof@kernel.org>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Petr Pavlu <petr.pavlu@suse.com>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "kallsyms: Prevent invalid access when showing module
buildid", v3.
We have seen nested crashes in __sprint_symbol(), see below. They seem to
be caused by an invalid pointer to "buildid". This patchset cleans up
kallsyms code related to module buildid and fixes this invalid access when
printing backtraces.
I made an audit of __sprint_symbol() and found several situations
when the buildid might be wrong:
+ bpf_address_lookup() does not set @modbuildid
+ ftrace_mod_address_lookup() does not set @modbuildid
+ __sprint_symbol() does not take rcu_read_lock and
the related struct module might get removed before
mod->build_id is printed.
This patchset solves these problems:
+ 1st, 2nd patches are preparatory
+ 3rd, 4th, 6th patches fix the above problems
+ 5th patch cleans up a suspicious initialization code.
This is the backtrace, we have seen. But it is not really important.
The problems fixed by the patchset are obvious:
crash64> bt [62/2029]
PID: 136151 TASK: ffff9f6c981d4000 CPU: 367 COMMAND: "btrfs"
#0 [ffffbdb687635c28] machine_kexec at ffffffffb4c845b3
#1 [ffffbdb687635c80] __crash_kexec at ffffffffb4d86a6a
#2 [ffffbdb687635d08] hex_string at ffffffffb51b3b61
#3 [ffffbdb687635d40] crash_kexec at ffffffffb4d87964
#4 [ffffbdb687635d50] oops_end at ffffffffb4c41fc8
#5 [ffffbdb687635d70] do_trap at ffffffffb4c3e49a
#6 [ffffbdb687635db8] do_error_trap at ffffffffb4c3e6a4
#7 [ffffbdb687635df8] exc_stack_segment at ffffffffb5666b33
#8 [ffffbdb687635e20] asm_exc_stack_segment at ffffffffb5800cf9
...
This patch (of 7)
The function kallsyms_lookup_buildid() initializes the given @namebuf by
clearing the first and the last byte. It is not clear why.
The 1st byte makes sense because some callers ignore the return code and
expect that the buffer contains a valid string, for example:
- function_stat_show()
- kallsyms_lookup()
- kallsyms_lookup_buildid()
The initialization of the last byte does not make much sense because it
can later be overwritten. Fortunately, it seems that all called functions
behave correctly:
- kallsyms_expand_symbol() explicitly adds the trailing '\0'
at the end of the function.
- All *__address_lookup() functions either use the safe strscpy()
or they do not touch the buffer at all.
Document the reason for clearing the first byte. And remove the useless
initialization of the last byte.
Link: https://lkml.kernel.org/r/20251128135920.217303-2-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Aaron Tomlin <atomlin@atomlin.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkman <daniel@iogearbox.net>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Luis Chamberalin <mcgrof@kernel.org>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Petr Pavlu <petr.pavlu@suse.com>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Daniel Gomez <da.gomez@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The softlockup_panic sysctl is currently a binary option: panic
immediately or never panic on soft lockups.
Panicking on any soft lockup, regardless of duration, can be overly
aggressive for brief stalls that may be caused by legitimate operations.
Conversely, never panicking may allow severe system hangs to persist
undetected.
Extend softlockup_panic to accept an integer threshold, allowing the
kernel to panic only when the normalized lockup duration exceeds N
watchdog threshold periods. This provides finer-grained control to
distinguish between transient delays and persistent system failures.
The accepted values are:
- 0: Don't panic (unchanged)
- 1: Panic when duration >= 1 * threshold (20s default, original behavior)
- N > 1: Panic when duration >= N * threshold (e.g., 2 = 40s, 3 = 60s.)
The original behavior is preserved for values 0 and 1, maintaining full
backward compatibility while allowing systems to tolerate brief lockups
while still catching severe, persistent hangs.
[lirongqing@baidu.com: v2]
Link: https://lkml.kernel.org/r/20251218074300.4080-1-lirongqing@baidu.com
Link: https://lkml.kernel.org/r/20251216074521.2796-1-lirongqing@baidu.com
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Cc: Eduard Zingerman <eddyz87@gmail.com>
Cc: Hao Luo <haoluo@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Song Liu <song@kernel.org>
Cc: Stanislav Fomichev <sdf@fomichev.me>
Cc: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
kimage_crash_copy_vmcoreinfo() currently assumes vmcoreinfo fits in a
single page. This breaks if VMCOREINFO_BYTES exceeds PAGE_SIZE.
Allocate the required order of control pages and vmap all pages needed to
safely copy vmcoreinfo into the crash kernel image.
Link: https://lkml.kernel.org/r/20251216132801.807260-3-pnina.feder@mobileye.com
Signed-off-by: Pnina Feder <pnina.feder@mobileye.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "vmcoreinfo: support VMCOREINFO_BYTES larger than PAGE_SIZE".
VMCOREINFO_BYTES is defined as a configurable size, but multiple
code paths implicitly assume it always fits into a single page.
This series removes that assumption by allocating and mapping
vmcoreinfo based on its actual size.
Patch 1 updates vmcoreinfo allocation to use get_order(VMCOREINFO_BYTES).
Patch 2 updates crash kernel handling to correctly allocate and map
multiple pages when copying vmcoreinfo.
This makes vmcoreinfo size consistent across the kernel and avoids
future breakage if VMCOREINFO_BYTES grows.
(No functional change when VMCOREINFO_BYTES == PAGE_SIZE.)
This patch (of 2):
VMCOREINFO_BYTES defines the size of vmcoreinfo data, but the current
implementation assumes a single page allocation.
Allocate vmcoreinfo_data using get_order(VMCOREINFO_BYTES) so that
vmcoreinfo can safely grow beyond PAGE_SIZE.
This avoids hidden assumptions and keeps vmcoreinfo size consistent across
the kernel.
Link: https://lkml.kernel.org/r/20251216132801.807260-1-pnina.feder@mobileye.com
Link: https://lkml.kernel.org/r/20251216132801.807260-2-pnina.feder@mobileye.com
Signed-off-by: Pnina Feder <pnina.feder@mobileye.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
We were wasting a byte due to an off-by-one bug. s[c]nprintf() doesn't
write more than $2 bytes including the null byte, so trying to pass
'size-1' there is wasting one byte.
This is essentially the same as the previous commit, in a different
file.
Link: https://lkml.kernel.org/r/b4a945a4d40b7104364244f616eb9fb9f1fa691f.1765449750.git.alx@kernel.org
Signed-off-by: Alejandro Colomar <alx@kernel.org>
Cc: Marco Elver <elver@google.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Christopher Bazley <chris.bazley.wg14@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Marco Elver <elver@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Maciej W. Rozycki <macro@orcam.me.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Remove <linux/hex.h> from <linux/kernel.h> and update all users/callers of
hex.h interfaces to directly #include <linux/hex.h> as part of the process
of putting kernel.h on a diet.
Removing hex.h from kernel.h means that 36K C source files don't have to
pay the price of parsing hex.h for the roughly 120 C source files that
need it.
This change has been build-tested with allmodconfig on most ARCHes. Also,
all users/callers of <linux/hex.h> in the entire source tree have been
updated if needed (if not already #included).
Link: https://lkml.kernel.org/r/20251215005206.2362276-1-rdunlap@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Yury Norov (NVIDIA) <yury.norov@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
'struct configfs_item_operations' and 'configfs_group_operations' are not
modified in this driver.
Constifying these structures moves some data to a read-only section, so
increases overall security, especially when the structure holds some
function pointers.
On a x86_64, with allmodconfig, as an example:
Before:
======
text data bss dec hex filename
16339 11001 384 27724 6c4c kernel/crash_dump_dm_crypt.o
After:
=====
text data bss dec hex filename
16499 10841 384 27724 6c4c kernel/crash_dump_dm_crypt.o
Link: https://lkml.kernel.org/r/d046ee5666d2f6b1a48ca1a222dfbd2f7c44462f.1765735035.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Coiby Xu <coxu@redhat.com>
Tested-by: Coiby Xu <coxu@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Remove lock from the bpf_timer_cancel() helper. The lock does not
protect from concurrent modification of the bpf_async_cb data fields as
those are modified in the callback without locking.
Use guard(rcu)() instead of pair of explicit lock()/unlock().
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Link: https://lore.kernel.org/r/20260120-timer_nolock-v6-4-670ffdd787b4@meta.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Introduce bpf_async_update_prog_callback(): lock-free update of cb->prog
and cb->callback_fn. This function allows updating prog and callback_fn
fields of the struct bpf_async_cb without holding lock.
For now use it under the lock from __bpf_async_set_callback(), in the
next patches that lock will be removed.
Lock-free algorithm:
* Acquire a guard reference on prog to prevent it from being freed
during the retry loop.
* Retry loop:
1. Each iteration acquires a new prog reference and stores it
in cb->prog via xchg. The previous prog is released.
2. The loop condition checks if both cb->prog and cb->callback_fn
match what we just wrote. If either differs, a concurrent writer
overwrote our value, and we must retry.
3. When we retry, our previously-stored prog was already released by
the concurrent writer or will be released by us after
overwriting.
* Release guard reference.
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Link: https://lore.kernel.org/r/20260120-timer_nolock-v6-3-670ffdd787b4@meta.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Move the timer deletion logic into a dedicated bpf_timer_delete()
helper so it can be reused by later patches.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Link: https://lore.kernel.org/r/20260120-timer_nolock-v6-1-670ffdd787b4@meta.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add check to ensure that ARG_PTR_TO_MEM is used with either MEM_WRITE or
MEM_RDONLY.
Using ARG_PTR_TO_MEM alone without flags does not make sense because:
- If the helper does not change the argument, missing MEM_RDONLY causes the
verifier to incorrectly reject a read-only buffer.
- If the helper does change the argument, missing MEM_WRITE causes the
verifier to incorrectly assume the memory is unchanged, leading to errors
in code optimization.
Co-developed-by: Shuran Liu <electronlsr@gmail.com>
Signed-off-by: Shuran Liu <electronlsr@gmail.com>
Co-developed-by: Peili Gao <gplhust955@gmail.com>
Signed-off-by: Peili Gao <gplhust955@gmail.com>
Co-developed-by: Haoran Ni <haoran.ni.cs@gmail.com>
Signed-off-by: Haoran Ni <haoran.ni.cs@gmail.com>
Signed-off-by: Zesen Liu <ftyghome@gmail.com>
Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260120-helper_proto-v3-2-27b0180b4e77@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
After commit 37cce22dbd ("bpf: verifier: Refactor helper access type tracking"),
the verifier started relying on the access type flags in helper
function prototypes to perform memory access optimizations.
Currently, several helper functions utilizing ARG_PTR_TO_MEM lack the
corresponding MEM_RDONLY or MEM_WRITE flags. This omission causes the
verifier to incorrectly assume that the buffer contents are unchanged
across the helper call. Consequently, the verifier may optimize away
subsequent reads based on this wrong assumption, leading to correctness
issues.
For bpf_get_stack_proto_raw_tp, the original MEM_RDONLY was incorrect
since the helper writes to the buffer. Change it to ARG_PTR_TO_UNINIT_MEM
which correctly indicates write access to potentially uninitialized memory.
Similar issues were recently addressed for specific helpers in commit
ac44dcc788 ("bpf: Fix verifier assumptions of bpf_d_path's output buffer")
and commit 2eb7648558 ("bpf: Specify access type of bpf_sysctl_get_name args").
Fix these prototypes by adding the correct memory access flags.
Fixes: 37cce22dbd ("bpf: verifier: Refactor helper access type tracking")
Co-developed-by: Shuran Liu <electronlsr@gmail.com>
Signed-off-by: Shuran Liu <electronlsr@gmail.com>
Co-developed-by: Peili Gao <gplhust955@gmail.com>
Signed-off-by: Peili Gao <gplhust955@gmail.com>
Co-developed-by: Haoran Ni <haoran.ni.cs@gmail.com>
Signed-off-by: Haoran Ni <haoran.ni.cs@gmail.com>
Signed-off-by: Zesen Liu <ftyghome@gmail.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260120-helper_proto-v3-1-27b0180b4e77@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch implements range tracking (interval analysis) for BPF_DIV and
BPF_MOD operations when the divisor is a constant, covering both signed
and unsigned variants.
While LLVM typically optimizes integer division and modulo by constants
into multiplication and shift sequences, this optimization is less
effective for the BPF target when dealing with 64-bit arithmetic.
Currently, the verifier does not track bounds for scalar division or
modulo, treating the result as "unbounded". This leads to false positive
rejections for safe code patterns.
For example, the following code (compiled with -O2):
```c
int test(struct pt_regs *ctx) {
char buffer[6] = {1};
__u64 x = bpf_ktime_get_ns();
__u64 res = x % sizeof(buffer);
char value = buffer[res];
bpf_printk("res = %llu, val = %d", res, value);
return 0;
}
```
Generates a raw `BPF_MOD64` instruction:
```asm
; __u64 res = x % sizeof(buffer);
1: 97 00 00 00 06 00 00 00 r0 %= 0x6
; char value = buffer[res];
2: 18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r1 = 0x0 ll
4: 0f 01 00 00 00 00 00 00 r1 += r0
5: 91 14 00 00 00 00 00 00 r4 = *(s8 *)(r1 + 0x0)
```
Without this patch, the verifier fails with "math between map_value
pointer and register with unbounded min value is not allowed" because
it cannot deduce that `r0` is within [0, 5].
According to the BPF instruction set[1], the instruction's offset field
(`insn->off`) is used to distinguish between signed (`off == 1`) and
unsigned division (`off == 0`). Moreover, we also follow the BPF division
and modulo runtime behavior (semantics) to handle special cases, such as
division by zero and signed division overflow.
- UDIV: dst = (src != 0) ? (dst / src) : 0
- SDIV: dst = (src == 0) ? 0 : ((src == -1 && dst == LLONG_MIN) ? LLONG_MIN : (dst / src))
- UMOD: dst = (src != 0) ? (dst % src) : dst
- SMOD: dst = (src == 0) ? dst : ((src == -1 && dst == LLONG_MIN) ? 0: (dst s% src))
Here is the overview of the changes made in this patch (See the code comments
for more details and examples):
1. For BPF_DIV: Firstly check whether the divisor is zero. If so, set the
destination register to zero (matching runtime behavior).
For non-zero constant divisors: goto `scalar(32)?_min_max_(u|s)div` functions.
- General cases: compute the new range by dividing max_dividend and
min_dividend by the constant divisor.
- Overflow case (SIGNED_MIN / -1) in signed division: mark the result
as unbounded if the dividend is not a single number.
2. For BPF_MOD: Firstly check whether the divisor is zero. If so, leave the
destination register unchanged (matching runtime behavior).
For non-zero constant divisors: goto `scalar(32)?_min_max_(u|s)mod` functions.
- General case: For signed modulo, the result's sign matches the
dividend's sign. And the result's absolute value is strictly bounded
by `min(abs(dividend), abs(divisor) - 1)`.
- Special care is taken when the divisor is SIGNED_MIN. By casting
to unsigned before negation and subtracting 1, we avoid signed
overflow and correctly calculate the maximum possible magnitude
(`res_max_abs` in the code).
- "Small dividend" case: If the dividend is already within the possible
result range (e.g., [-2, 5] % 10), the operation is an identity
function, and the destination register remains unchanged.
3. In `scalar(32)?_min_max_(u|s)(div|mod)` functions: After updating current
range, reset other ranges and tnum to unbounded/unknown.
e.g., in `scalar_min_max_sdiv`, signed 64-bit range is updated. Then reset
unsigned 64-bit range and 32-bit range to unbounded, and tnum to unknown.
Exception: in BPF_MOD's "small dividend" case, since the result remains
unchanged, we do not reset other ranges/tnum.
4. Also updated existing selftests based on the expected BPF_DIV and
BPF_MOD behavior.
[1] https://www.kernel.org/doc/Documentation/bpf/standardization/instruction-set.rst
Co-developed-by: Shenghao Yuan <shenghaoyuan0928@163.com>
Signed-off-by: Shenghao Yuan <shenghaoyuan0928@163.com>
Co-developed-by: Tianci Cao <ziye@zju.edu.cn>
Signed-off-by: Tianci Cao <ziye@zju.edu.cn>
Signed-off-by: Yazhou Tang <tangyazhou518@outlook.com>
Tested-by: syzbot@syzkaller.appspotmail.com
Link: https://lore.kernel.org/r/20260119085458.182221-2-tangyazhou@zju.edu.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Implement bpf_stream_vprintk with an implicit bpf_prog_aux argument,
and remote bpf_stream_vprintk_impl from the kernel.
Update the selftests to use the new API with implicit argument.
bpf_stream_vprintk macro is changed to use the new bpf_stream_vprintk
kfunc, and the extern definition of bpf_stream_vprintk_impl is
replaced accordingly.
Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Link: https://lore.kernel.org/r/20260120222638.3976562-11-ihor.solodrai@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Implement bpf_task_work_schedule_* with an implicit bpf_prog_aux
argument, and remove corresponding _impl funcs from the kernel.
Update special kfunc checks in the verifier accordingly.
Update the selftests to use the new API with implicit argument.
Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Link: https://lore.kernel.org/r/20260120222638.3976562-10-ihor.solodrai@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Implement bpf_wq_set_callback() with an implicit bpf_prog_aux
argument, and remove bpf_wq_set_callback_impl().
Update special kfunc checks in the verifier accordingly.
Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Link: https://lore.kernel.org/r/20260120222638.3976562-8-ihor.solodrai@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
A kernel function bpf_foo marked with KF_IMPLICIT_ARGS flag is
expected to have two associated types in BTF:
* `bpf_foo` with a function prototype that omits implicit arguments
* `bpf_foo_impl` with a function prototype that matches the kernel
declaration of `bpf_foo`, but doesn't have a ksym associated with
its name
In order to support kfuncs with implicit arguments, the verifier has
to know how to resolve a call of `bpf_foo` to the correct BTF function
prototype and address.
To implement this, in add_kfunc_call() kfunc flags are checked for
KF_IMPLICIT_ARGS. For such kfuncs a BTF func prototype is adjusted to
the one found for `bpf_foo_impl` (func_name + "_impl" suffix, by
convention) function in BTF.
This effectively changes the signature of the `bpf_foo` kfunc in the
context of verification: from one without implicit args to the one
with full argument list.
The values of implicit arguments by design are provided by the
verifier, and so they can only be of particular types. In this patch
the only allowed implicit arg type is a pointer to struct
bpf_prog_aux.
In order for the verifier to correctly set an implicit bpf_prog_aux
arg value at runtime, is_kfunc_arg_prog() is extended to check for the
arg type. At a point when prog arg is determined in check_kfunc_args()
the kfunc with implicit args already has a prototype with full
argument list, so the existing value patch mechanism just works.
If a new kfunc with KF_IMPLICIT_ARG is declared for an existing kfunc
that uses a __prog argument (a legacy case), the prototype
substitution works in exactly the same way, assuming the kfunc follows
the _impl naming convention. The difference is only in how _impl
prototype is added to the BTF, which is not the verifier's
concern. See a subsequent resolve_btfids patch for details.
__prog suffix is still supported at this point, but will be removed in
a subsequent patch, after current users are moved to KF_IMPLICIT_ARGS.
Introduction of KF_IMPLICIT_ARGS revealed an issue with zero-extension
tracking, because an explicit rX = 0 in place of the verifier-supplied
argument is now absent if the arg is implicit (the BPF prog doesn't
pass a dummy NULL anymore). To mitigate this, reset the subreg_def of
all caller saved registers in check_kfunc_call() [1].
[1] https://lore.kernel.org/bpf/b4a760ef828d40dac7ea6074d39452bb0dc82caa.camel@gmail.com/
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Link: https://lore.kernel.org/r/20260120222638.3976562-4-ihor.solodrai@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
There is code duplication between add_kfunc_call() and
fetch_kfunc_meta() collecting information about a kfunc from BTF.
Introduce struct bpf_kfunc_meta to hold common kfunc BTF data and
implement fetch_kfunc_meta() to fill it in, instead of struct
bpf_kfunc_call_arg_meta directly.
Then use these in add_kfunc_call() and (new) fetch_kfunc_arg_meta()
functions, and fixup previous usages of fetch_kfunc_meta() to
fetch_kfunc_arg_meta().
Besides the code dedup, this change enables add_kfunc_call() to access
kfunc->flags.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Link: https://lore.kernel.org/r/20260120222638.3976562-3-ihor.solodrai@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
btf_kfunc_id_set_contains() is called by fetch_kfunc_meta() in the BPF
verifier to get the kfunc flags stored in the .BTF_ids ELF section.
If it returns NULL instead of a valid pointer, it's interpreted as an
illegal kfunc usage failing the verification.
There are two potential reasons for btf_kfunc_id_set_contains() to
return NULL:
1. Provided kfunc BTF id is not present in relevant kfunc id sets.
2. The kfunc is not allowed, as determined by the program type
specific filter [1].
The filter functions accept a pointer to `struct bpf_prog`, so they
might implicitly depend on earlier stages of verification, when
bpf_prog members are set.
For example, bpf_qdisc_kfunc_filter() in linux/net/sched/bpf_qdisc.c
inspects prog->aux->st_ops [2], which is initialized in:
check_attach_btf_id() -> check_struct_ops_btf_id()
So far this hasn't been an issue, because fetch_kfunc_meta() is the
only caller of btf_kfunc_id_set_contains().
However in subsequent patches of this series it is necessary to
inspect kfunc flags earlier in BPF verifier, in the add_kfunc_call().
To resolve this, refactor btf_kfunc_id_set_contains() into two
interface functions:
* btf_kfunc_flags() that simply returns pointer to kfunc_flags
without applying the filters
* btf_kfunc_is_allowed() that both checks for kfunc_flags existence
(which is a requirement for a kfunc to be allowed) and applies the
prog filters
See [3] for the previous version of this patch.
[1] https://lore.kernel.org/all/20230519225157.760788-7-aditi.ghag@isovalent.com/
[2] https://lore.kernel.org/all/20250409214606.2000194-4-ameryhung@gmail.com/
[3] https://lore.kernel.org/bpf/20251029190113.3323406-3-ihor.solodrai@linux.dev/
Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Link: https://lore.kernel.org/r/20260120222638.3976562-2-ihor.solodrai@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
- A 4 patch series from David Hildenbrand which fixes a few things
realted to hugetlb PMD sharing
- The remainder are singletons, please see their changelogs for details.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaW/vEQAKCRDdBJ7gKXxA
jtMcAQCK1tKnINRmVjY3UJCqZMAaXvOdOoUIgHDaTXD/DWKm9AD9HRwWzYB4+TNr
k/Te8F33d418LcMBTW9CLhrplQpaIAI=
=+d1A
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2026-01-20-13-09' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
- A patch series from David Hildenbrand which fixes a few things
related to hugetlb PMD sharing
- The remainder are singletons, please see their changelogs for details
* tag 'mm-hotfixes-stable-2026-01-20-13-09' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm: restore per-memcg proactive reclaim with !CONFIG_NUMA
mm/kfence: fix potential deadlock in reboot notifier
Docs/mm/allocation-profiling: describe sysctrl limitations in debug mode
mm: do not copy page tables unnecessarily for VM_UFFD_WP
mm/hugetlb: fix excessive IPI broadcasts when unsharing PMD tables using mmu_gather
mm/rmap: fix two comments related to huge_pmd_unshare()
mm/hugetlb: fix two comments related to huge_pmd_unshare()
mm/hugetlb: fix hugetlb_pmd_shared()
mm: remove unnecessary and incorrect mmap lock assert
x86/kfence: avoid writing L1TF-vulnerable PTEs
mm/vma: do not leak memory when .mmap_prepare swaps the file
migrate: correct lock ordering for hugetlb file folios
panic: only warn about deprecated panic_print on write access
fs/writeback: skip AS_NO_DATA_INTEGRITY mappings in wait_sb_inodes()
mm: take into account mm_cid size for mm_struct static definitions
mm: rename cpu_bitmap field to flexible_array
mm: add missing static initializer for init_mm::mm_cid.lock
Currently, reset_idmap_scratch() performs a 4.7KB memset() in every
states_equal() call. Optimize this by using a counter to track used
ID mappings, replacing the O(N) memset() with an O(1) reset and
bounding the search loop in check_ids().
Signed-off-by: Qiliang Yuan <realwujing@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20260120023234.77673-1-realwujing@gmail.com
After commit 4fa8d68aa5 ("bpf: Convert hashtab.c to rqspinlock")
we no longer use HASHTAB_MAP_LOCK_{COUNT,MASK} as the per-CPU
map_locked[HASHTAB_MAP_LOCK_COUNT] array got removed from struct
bpf_htab. Right now it is still accounted for in htab_map_mem_usage.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/09703eb6bb249f12b1d5253b5a50a0c4fa239d27.1768913513.git.daniel@iogearbox.net
sync_linked_regs() is called after a conditional jump to propagate new
bounds of a register to all its liked registers. But the verifier log
only prints the state of the register that is part of the conditional
jump.
Make sync_linked_regs() scratch the registers whose bounds have been
updated by propagation from a known register.
Before:
0: (85) call bpf_get_prandom_u32#7 ; R0=scalar()
1: (57) r0 &= 255 ; R0=scalar(smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff))
2: (bf) r1 = r0 ; R0=scalar(id=1,smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff)) R1=scalar(id=1,smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff))
3: (07) r1 += 4 ; R1=scalar(id=1+4,smin=umin=smin32=umin32=4,smax=umax=smax32=umax32=259,var_off=(0x0; 0x1ff))
4: (a5) if r1 < 0xa goto pc+2 ; R1=scalar(id=1+4,smin=umin=smin32=umin32=10,smax=umax=smax32=umax32=259,var_off=(0x0; 0x1ff))
5: (35) if r0 >= 0x6 goto pc+1
After:
0: (85) call bpf_get_prandom_u32#7 ; R0=scalar()
1: (57) r0 &= 255 ; R0=scalar(smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff))
2: (bf) r1 = r0 ; R0=scalar(id=1,smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff)) R1=scalar(id=1,smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff))
3: (07) r1 += 4 ; R1=scalar(id=1+4,smin=umin=smin32=umin32=4,smax=umax=smax32=umax32=259,var_off=(0x0; 0x1ff))
4: (a5) if r1 < 0xa goto pc+2 ; R0=scalar(id=1+0,smin=umin=smin32=umin32=6,smax=umax=smax32=umax32=255) R1=scalar(id=1+4,smin=umin=smin32=umin32=10,smax=umax=smax32=umax32=259,var_off=(0x0; 0x1ff))
5: (35) if r0 >= 0x6 goto pc+1
The conditional jump in 4 updates the bound of R1 and the new bounds are
propogated to R0 as it is linked with the same id, before this change,
verifier only printed the state for R1 but after it prints for both R0
and R1.
Suggested-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/bpf/20260116141436.3715322-1-puranjay@kernel.org
- minor fixes for the corner cases of the SWIOTLB pool management (Robin Murphy)
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSrngzkoBtlA8uaaJ+Jp1EFxbsSRAUCaW+NbgAKCRCJp1EFxbsS
RKU3AQCIpNwIYN6evmnTXO/Y+AFRjsrb1pE1cUyHn5QTY4/o4wEA6BiSgMCrZEbv
HTEs1ZSHgLOwBwZre1Z11icGg3BkRgY=
=6fBC
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-6.19-2026-01-20' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux
Pull dma-mapping fixes from Marek Szyprowski:
- minor fixes for the corner cases of the SWIOTLB pool management
(Robin Murphy)
* tag 'dma-mapping-6.19-2026-01-20' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
dma/pool: Avoid allocating redundant pools
mm_zone: Generalise has_managed_dma()
dma/pool: Improve pool lookup
Once the aggrprobe is fully reverted in do_free_cleaned_kprobes(), retry
optimize_kprobe() on that sibling so it can return to OPTIMIZED.
Also remove the stale comment in __disarm_kprobe().
Link: https://lore.kernel.org/all/349359900266B25F+20260115023804.3951960-2-hongao@uniontech.com/
Signed-off-by: hongao <hongao@uniontech.com>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
When __do_ajdtimex() was introduced to handle adjtimex for any
timekeeper, this reference to tk_core was not updated. When called on an
auxiliary timekeeper, the core timekeeper would be updated incorrectly.
This gets caught by the lock debugging diagnostics because the
timekeepers sequence lock gets written to without holding its
associated spinlock:
WARNING: include/linux/seqlock.h:226 at __do_adjtimex+0x394/0x3b0, CPU#2: test/125
aux_clock_adj (kernel/time/timekeeping.c:2979)
__do_sys_clock_adjtime (kernel/time/posix-timers.c:1161 kernel/time/posix-timers.c:1173)
do_syscall_64 (arch/x86/entry/syscall_64.c:63 (discriminator 1) arch/x86/entry/syscall_64.c:94 (discriminator 1))
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:131)
Update the correct auxiliary timekeeper.
Fixes: 775f71ebed ("timekeeping: Make do_adjtimex() reusable")
Fixes: ecf3e70304 ("timekeeping: Provide adjtimex() for auxiliary clocks")
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260120-timekeeper-auxclock-leapstate-v1-1-5b358c6b3cfd@linutronix.de
The panic_print_deprecated() warning is being triggered on both read and
write operations to the panic_print parameter.
This causes spurious warnings when users run 'sysctl -a' to list all
sysctl values, since that command reads /proc/sys/kernel/panic_print and
triggers the deprecation notice.
Modify the handlers to only emit the deprecation warning when the
parameter is actually being set:
- sysctl_panic_print_handler(): check 'write' flag before warning.
- panic_print_get(): remove the deprecation call entirely.
This way, users are only warned when they actively try to use the
deprecated parameter, not when passively querying system state.
Link: https://lkml.kernel.org/r/20260106163321.83586-1-gal@nvidia.com
Fixes: ee13240cd7 ("panic: add note that panic_print sysctl interface is deprecated")
Fixes: 2683df6539 ("panic: add note that 'panic_print' parameter is deprecated")
Signed-off-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Nimrod Oren <noren@nvidia.com>
Cc: Feng Tang <feng.tang@linux.alibaba.com>
Cc: Joel Granados <joel.granados@kernel.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
- Add Chen Ridong as cpuset reviewer.
- Add SPDX license identifiers to cgroup files that were missing them.
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaW0ToA4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGel7AQD3eTbiR/OU0i5yJj6YZsIdVYcteqqsWBkDFnZ3
9pZZugEA+KCwi5eQz+cryKZ7y5fU5g5O6f4OP/lfLEcSmFDpfQU=
=QHwy
-----END PGP SIGNATURE-----
Merge tag 'cgroup-for-6.19-rc5-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
- Add Chen Ridong as cpuset reviewer
- Add SPDX license identifiers to cgroup files that were missing them
* tag 'cgroup-for-6.19-rc5-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
kernel: cgroup: Add LGPL-2.1 SPDX license ID to legacy_freezer.c
kernel: cgroup: Add SPDX-License-Identifier lines
MAINTAINERS: Add Chen Ridong as cpuset reviewer
may result in incorrect skipping of hrtimer IPIs.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmlsr+oRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1iLPw/+OuA8hPlFY8nxrpVKASoiP7u+mGQ82Qmf
xdf1Kimpmc5ydpGsxpRLKY19l3fLuGSJuOa2Url0Ro52JHxU+yPPOkN2ONjsPxLI
mFMdhyXiS1PIgYt307OmNmKFadxdg5cGJTKY0wIS9JCcPFxI5Y3NFhWZ4ybPlGHK
/bxtsjUctekUTDVahqP65lRIYMvjhO1I2LQfoPQyg/C+2n6Owy7TAX8xZovDZLHz
a9RkkX6FPc8aR8Kj1VYm1yABNlUhGvFvTE6wfcfOMjiHHeDv2qYdwNHF2tK2eoKC
QX609zBTCT/n0ioZlikLrKDZitj6uRgS2blBc+hodHEs4cY5gE2umWSU4IirT7HH
GDjvG/vns/BLRr97gPdMaJ+N/sjdCSQ5tsDRrvHwCGC27VXPu3O3DYU128eH4eXW
+KNJGQHJvR7slBKElQ4wxJaLhbLBB5rw/AqqdXa4PU8LqvMBBNcLZjHnrYlbD5Ei
ovdX5VtRYOjrVl7pCNrXXdzFfBpXK+sOPHQMoJxnLHKXO+cdsZz8zIB7AHZEVe20
CIOMRcfgw2nvpS+0MqaHzre0ixyq+U2XSIu7MRIlq8GZ+zdjiVMqx3pPa19QRdyr
hA9JupeRC9eZdymlms2MiSzqIVmWSVvgcyhIFSYw9yLe55vXcGyArF27RKGtBHRA
uF+K5aT8gqA=
=CAdu
-----END PGP SIGNATURE-----
Merge tag 'timers-urgent-2026-01-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Ingo Molnar:
"Fix the update_needs_ipi() check in the hrtimer code that may result
in incorrect skipping of hrtimer IPIs"
* tag 'timers-urgent-2026-01-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
hrtimer: Fix softirq base check in update_needs_ipi()
discovered and fixed recently:
- Fix a race condition in the DL server
- Fix a DL server bug which can result in incorrectly going
idle when there's work available
- Fix DL server bug which triggers a WARN() due to broken
get_prio_dl() logic and subsequent misbehavior
- Fix double update_rq_clock() calls
- Fix setscheduler() assumption about static priorities
- Make sure balancing callbacks are always called
- Plus a handful of preparatory commits for the fixes.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmlsrfIRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1hgUA/9HQi51sxF/TlhKeoj9Z0zSO/L5Fbd7d5O
ucsr+exoU7B+X95AnYC64pV3YlTNibD5y6HR4zpPzhlmEdFw8K9hTmDeSOF2cB0T
dPdEY2KtvNoQwVwytPYKMsXtBTcJtZZqOec9BgGBzSIpjPm6Zhxr1J2/lKFy/udA
00aqNuJwfQW94dxvyPf+jPXbm0JW9HHZGNuuOXkkmZpkH+HIzBDrTWcAjHUaZt9H
doPyEfO08dg3Uu4OxNBtxx94X7bQsn5RU1plSEc+xF7zynqwEIwDbelFftpHiM+z
UqfFgkzCV8A8o/72BfZVhQgaspMZ+M4XSsaGBbouOTg3Rgx8gRKC23s8/xhDJSpy
ayMWhuR5bsI04GD1xZVOwLvUi8oDIgAFIAKRI6gcW5eQELlejii0bAvs9Jw9/OjI
QhhrX54sJCPPKMZpKO2n+AzQvv9a6+/p7ikiA2U5dR4wBEERFUw8ffRguHwSm+BU
U1M+t+2tw+UiC7yML0zvIC3HdyfBhj33ZeW7PWyvFU0vjGm78U6NjqliAxbqW21l
GdtQUpi9aXCtijXqHhLFvBtWGmqowj9UQiHroC0+nq1BsTpBDe575vwJNYF2K9he
ht0hZqreUIf7xTsJHXlLFbtFxPQw0xAKzWJWUswaaqO8eS2JcWJbmHUAsHQuxmJB
MbkJAlU7j/w=
=DgiW
-----END PGP SIGNATURE-----
Merge tag 'sched-urgent-2026-01-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
"Misc deadline scheduler fixes, mainly for a new category of bugs that
were discovered and fixed recently:
- Fix a race condition in the DL server
- Fix a DL server bug which can result in incorrectly going idle when
there's work available
- Fix DL server bug which triggers a WARN() due to broken
get_prio_dl() logic and subsequent misbehavior
- Fix double update_rq_clock() calls
- Fix setscheduler() assumption about static priorities
- Make sure balancing callbacks are always called
- Plus a handful of preparatory commits for the fixes"
* tag 'sched-urgent-2026-01-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/deadline: Use ENQUEUE_MOVE to allow priority change
sched: Deadline has dynamic priority
sched: Audit MOVE vs balance_callbacks
sched: Fold rq-pin swizzle into __balance_callbacks()
sched/deadline: Avoid double update_rq_clock()
sched/deadline: Ensure get_prio_dl() is up-to-date
sched/deadline: Fix server stopping with runnable tasks
sched: Provide idle_rq() helper
sched/deadline: Fix potential race in dl_add_task_root_domain()
sched/deadline: Remove unnecessary comment in dl_add_task_root_domain()
Add GPL-2.0 SPDX-License-Identifier lines to some files,
and remove a reference to COPYING, and boilerplate warranty
text, from offload.c.
Signed-off-by: Tim Bird <tim.bird@sony.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260115013129.598705-1-tim.bird@sony.com
Add __force annotations to casts that convert between __user and kernel
address spaces. These casts are intentional:
- In bpf_send_signal_common(), the value is stored in si_value.sival_ptr
which is typed as void __user *, but the value comes from a BPF
program parameter.
- In the bpf_*_dynptr() kfuncs, user pointers are cast to const void *
before being passed to copy helper functions that correctly handle
the user address space through copy_from_user variants.
Without __force, sparse reports:
warning: cast removes address space '__user' of expression
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260115184509.3585759-1-mykyta.yatsenko5@gmail.com
Closes: https://lore.kernel.org/oe-kbuild-all/202601131740.6C3BdBaB-lkp@intel.com/
- Fix a memory leak in em_create_pd() error path (Malaya Kumar Rout)
- Fix stale description of the cost field in struct em_perf_state to
reflect the current code (Yaxiong Tian)
- Fix and revamp the energy model YNL specification added recently
along with the energy model netlink interface (Changwoo Min)
-----BEGIN PGP SIGNATURE-----
iQFGBAABCAAwFiEEcM8Aw/RY0dgsiRUR7l+9nS/U47UFAmlqWS8SHHJqd0Byand5
c29ja2kubmV0AAoJEO5fvZ0v1OO1BhAH/iSA/9YP9/xIW+rIZj2irsncGVc/hJ+V
8PcM2tar966AEQBlmDPc7ug2MXT0Mb/5WRtQo/IvZ1tdAALB+bC70FHjdu3y7SAX
IzFGyHKJ9OVJHGxYq0TCBbRXgYubUZiqvAnaBTBZ/ZIFlt3B4KEyatfFfxJMb1Pe
H9zQIyUVBN1tjPfgNVbc22XGfkwfmmla72le6GmHseKZRqpwutIwzxm1PoMNVeLb
fQv625LsWrDD9NU6j5uGq3WEq3xPltubtHN545FTFi4aGwBYJaG1FuIi+kPiQuCR
VZmxTWVEiVwjttiZf/xWbOkPfwQqdgeXlTk5j+iBlF1jkrrW75IuLYQ=
=eJ2p
-----END PGP SIGNATURE-----
Merge tag 'pm-6.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These fix an error path memory leak in the energy model management
code, fix a kerneldoc comment in it, and fix and revamp the energy
model YNL specification added recently along with the new energy model
management netlink interface (that received feedback after being
added):
- Fix a memory leak in em_create_pd() error path (Malaya Kumar Rout)
- Fix stale description of the cost field in struct em_perf_state to
reflect the current code (Yaxiong Tian)
- Fix and revamp the energy model YNL specification added recently
along with the energy model netlink interface (Changwoo Min)"
* tag 'pm-6.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM: EM: Add dump to get-perf-domains in the EM YNL spec
PM: EM: Change cpus' type from string to u64 array in the EM YNL spec
PM: EM: Rename em.yaml to dev-energymodel.yaml
PM: EM: Fix yamllint warnings in the EM YNL spec
PM: EM: Fix memory leak in em_create_pd() error path
PM: EM: Fix incorrect description of the cost field in struct em_perf_state
The irq_chip_pm_put() return value is only used in __irq_do_set_handler()
to trigger a WARN_ON() if it is negative, but doing so is not useful
because irq_chip_pm_put() simply passes the pm_runtime_put() return value
to its callers.
Returning an error code from pm_runtime_put() merely means that it has
not queued up a work item to check whether or not the device can be
suspended and there are many perfectly valid situations in which that
can happen, like after writing "on" to the devices' runtime PM "control"
attribute in sysfs for one example.
For this reason, modify irq_chip_pm_put() to discard the pm_runtime_put()
return value, change its return type to void, and drop the WARN_ON()
around the irq_chip_pm_put() invocation from __irq_do_set_handler().
Also update the irq_chip_pm_put() kerneldoc comment to be more accurate.
This will facilitate a planned change of the pm_runtime_put() return
type to void in the future.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/5075294.31r3eYUQgx@rafael.j.wysocki
sync_linked_regs() copies the id of known_reg to reg when propagating
bounds of known_reg to reg using the off of known_reg, but when
known_reg was linked to reg like:
known_reg = reg ; both known_reg and reg get same id
known_reg += 4 ; known_reg gets off = 4, and its id gets BPF_ADD_CONST
now when a call to sync_linked_regs() happens, let's say with the following:
if known_reg >= 10 goto pc+2
known_reg's new bounds are propagated to reg but now reg gets
BPF_ADD_CONST from the copy.
This means if another link to reg is created like:
another_reg = reg ; another_reg should get the id of reg but
assign_scalar_id_before_mov() sees
BPF_ADD_CONST on reg and assigns a new id to it.
As reg has a new id now, known_reg's link to reg is broken. If we find
new bounds for known_reg, they will not be propagated to reg.
This can be seen in the selftest added in the next commit:
0: (85) call bpf_get_prandom_u32#7 ; R0=scalar()
1: (57) r0 &= 255 ; R0=scalar(smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff))
2: (bf) r1 = r0 ; R0=scalar(id=1,smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff)) R1=scalar(id=1,smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff))
3: (07) r1 += 4 ; R1=scalar(id=1+4,smin=umin=smin32=umin32=4,smax=umax=smax32=umax32=259,var_off=(0x0; 0x1ff))
4: (a5) if r1 < 0xa goto pc+4 ; R1=scalar(id=1+4,smin=umin=smin32=umin32=10,smax=umax=smax32=umax32=259,var_off=(0x0; 0x1ff))
5: (bf) r2 = r0 ; R0=scalar(id=2,smin=umin=smin32=umin32=6,smax=umax=smax32=umax32=255) R2=scalar(id=2,smin=umin=smin32=umin32=6,smax=umax=smax32=umax32=255)
6: (a5) if r1 < 0xe goto pc+2 ; R1=scalar(id=1+4,smin=umin=smin32=umin32=14,smax=umax=smax32=umax32=259,var_off=(0x0; 0x1ff))
7: (35) if r0 >= 0xa goto pc+1 ; R0=scalar(id=2,smin=umin=smin32=umin32=6,smax=umax=smax32=umax32=9,var_off=(0x0; 0xf))
8: (37) r0 /= 0
div by zero
When 4 is verified, r1's bounds are propagated to r0 but r0 also gets
BPF_ADD_CONST (bug).
When 5 is verified, r0 gets a new id (2) and its link with r1 is broken.
After 6 we know r1 has bounds [14, 259] and therefore r0 should have
bounds [10, 255], therefore the branch at 7 is always taken. But because
r0's id was changed to 2, r1's new bounds are not propagated to r0.
The verifier still thinks r0 has bounds [6, 255] before 7 and execution
can reach div by zero.
Fix this by preserving id in sync_linked_regs() like off and subreg_def.
Fixes: 98d7ca374b ("bpf: Track delta between "linked" registers.")
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260115151143.1344724-2-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJPBAABCAA5FiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAmlqGMYbFIAAAAAABAAO
bWFudTIsMi41KzEuMTEsMiwyAAoJEFKgDEdIgJTyT0UP/3Wn4tm0h0n3XeyujQEA
IPNisJCczF6aWIAuwccRigR8hQFriPNvkZ5AhMGtotZJrY3uUoe1/8aF+XIdqnl/
Yb1434AGwVNIpSaap+vtEclHrLDmzZH+Z75/FTWQwldM/hPkPWJI8fsEuRLqBZsn
v520NBFtrQVcOZKKNy0npBnHsC0DsAmqoZuOvLTx0mx5AyE029CfPbDMZuVnSNix
KjZ4U5KL0qDs2LIMpdB/mqprydGkHdogdIbrPK3WtzStVgNbi9VmnV19ZwbUlXJM
rYPbtbQg3htwuspgR+yM6O21qsthRf2qZF5+2/a929IzOBsD/qAXQbbxQWVpF7Qb
ELYXNV4N5hqm9EW8WeOOpLKUUG7k0fRPf81X/07uGVafPMQKQJ8kFNgLBkBFR4ya
RAMNxTPHbHQvVaLcRujxZXoC4Wh3ZTunQXpIouy0p9dKOzbsCAj0ZqeqDa09UsaW
rCEm50p/Pd1csML8a9A/2nNoWjQzuSVmML7F6obGCOWaW6p21GhSKHzqqDIjBab9
3wxhpllVeYRYS2yhkKjOPJkQKXo3idIdpieLpW8IVbJvp/gQgmeKjWdhnvNvUan9
hyCzfI8OZXVz0vItBfWsoX44+6UtpLHd4o16aDYjyDItflJOPuQbWzky+icT2R3x
8B5xPC08tmEGitf3miv2EGq9
=d/xV
-----END PGP SIGNATURE-----
Merge tag 'printk-for-6.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux
Pull printk fix from Petr Mladek:
- Prevent softlockup by restoring IRQs in atomic flush after each
record
* tag 'printk-for-6.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
printk/nbcon: Restore IRQ in atomic flush after each emitted record
Insert Soft Reserved memory into a dedicated soft_reserve_resource tree
instead of the iomem_resource tree at boot. Delay publishing these ranges
into the iomem hierarchy until ownership is resolved and the HMEM path
is ready to consume them.
Publishing Soft Reserved ranges into iomem too early conflicts with CXL
hotplug and prevents region assembly when those ranges overlap CXL
windows.
Follow up patches will reinsert Soft Reserved ranges into iomem after CXL
window publication is complete and HMEM is ready to claim the memory. This
provides a cleaner handoff between EFI-defined memory ranges and CXL
resource management without trimming or deleting resources later.
In the meantime "Soft Reserved" resources will no longer appear in
/proc/iomem, only their results. I.e. with "memmap=4G%4G+0xefffffff"
Before:
100000000-1ffffffff : Soft Reserved
100000000-1ffffffff : dax1.0
100000000-1ffffffff : System RAM (kmem)
After:
100000000-1ffffffff : dax1.0
100000000-1ffffffff : System RAM (kmem)
The expectation is that this does not lead to a user visible regression
because the dax1.0 device is created in both instances.
Co-developed-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
[Smita: incorporate feedback from x86 maintainer review]
Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
Link: https://patch.msgid.link/20251120031925.87762-2-Smita.KoralahalliChannabasappa@amd.com
[djbw: cleanups and clarifications]
Link: https://lore.kernel.org/69443f707b025_1cee10022@dwillia2-mobl4.notmuch
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
This script
#!/usr/bin/bash
echo 0 > /proc/sys/kernel/randomize_va_space
echo 'void main(void) {}' > TEST.c
# -fcf-protection to ensure that the 1st endbr32 insn can't be emulated
gcc -m32 -fcf-protection=branch TEST.c -o test
bpftrace -e 'uprobe:./test:main {}' -c ./test
"hangs", the probed ./test task enters an endless loop.
The problem is that with randomize_va_space == 0
get_unmapped_area(TASK_SIZE - PAGE_SIZE) called by xol_add_vma() can not
just return the "addr == TASK_SIZE - PAGE_SIZE" hint, this addr is used
by the stack vma.
arch_get_unmapped_area_topdown() doesn't take TIF_ADDR32 into account and
in_32bit_syscall() is false, this leads to info.high_limit > TASK_SIZE.
vm_unmapped_area() happily returns the high address > TASK_SIZE and then
get_unmapped_area() returns -ENOMEM after the "if (addr > TASK_SIZE - len)"
check.
handle_swbp() doesn't report this failure (probably it should) and silently
restarts the probed insn. Endless loop.
I think that the right fix should change the x86 get_unmapped_area() paths
to rely on TIF_ADDR32 rather than in_32bit_syscall(). Note also that if
CONFIG_X86_X32_ABI=y, in_x32_syscall() falsely returns true in this case
because ->orig_ax = -1.
But we need a simple fix for -stable, so this patch just sets TS_COMPAT if
the probed task is 32-bit to make in_ia32_syscall() true.
Fixes: 1b028f784e ("x86/mm: Introduce mmap_compat_base() for 32-bit mmap()")
Reported-by: Paulo Andrade <pandrade@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/aV5uldEvV7pb4RA8@redhat.com/
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/aWO7Fdxn39piQnxu@redhat.com
Merge fixes related to the energy model management for 6.19-rc6:
- Fix a memory leak in em_create_pd() error path (Malaya Kumar Rout)
- Fix stale description of the cost field in struct em_perf_state to
reflect the current code (Yaxiong Tian)
- Fix and revamp the energy model YNL specification added recently
along with the energy model netlink interface (Changwoo Min)
* pm-em:
PM: EM: Add dump to get-perf-domains in the EM YNL spec
PM: EM: Change cpus' type from string to u64 array in the EM YNL spec
PM: EM: Rename em.yaml to dev-energymodel.yaml
PM: EM: Fix yamllint warnings in the EM YNL spec
PM: EM: Fix memory leak in em_create_pd() error path
PM: EM: Fix incorrect description of the cost field in struct em_perf_state
Add SPDX-License-Identifier lines to some old kernel
files.
Signed-off-by: Tim Bird <tim.bird@sony.com>
Acked-by: Karim Yaghmour <karim.yaghmour@opersys.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Add an appropriate SPDX-License-Identifier line to the file,
and remove the GNU boilerplate text.
Signed-off-by: Tim Bird <tim.bird@sony.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Add GPL-2.0 SPDX license id lines to a few old
files, replacing the reference to the COPYING file.
The COPYING file at the time of creation of these files
(2007 and 2005) was GPL-v2.0, with an additional clause
indicating that only v2 applied.
Signed-off-by: Tim Bird <tim.bird@sony.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Add a GPL-2.0 license identifier line for this file.
kmod.c was originally introduced in the kernel in February
of 1998 by Linus Torvalds - who was familiar with kernel
licensing at the time this was introduced.
Signed-off-by: Tim Bird <tim.bird@sony.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mahe reported issue with bpf_override_return helper not working when
executed from kprobe.multi bpf program on arm.
The problem is that on arm we use alternate storage for pt_regs object
that is passed to bpf_prog_run and if any register is changed (which
is the case of bpf_override_return) it's not propagated back to actual
pt_regs object.
Fixing this by introducing and calling ftrace_partial_regs_update function
to propagate the values of changed registers (ip and stack).
Reported-by: Mahe Tardy <mahe.tardy@gmail.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/bpf/20260112121157.854473-1-jolsa@kernel.org
- Fix allocation accounting on boot up
The ftrace records for each function that ftrace can attach to is
done in a group of pages. At boot up, the number of pages are
calculated and allocated. After that, the pages are filled with data.
It may allocate more than needed due to some functions not being
recorded (because they are unused weak functions), this too is
recorded.
After the data is filled in, a check is made to make sure the right
number of pages were allocated. But this was off due to the
assumption that the same number of entries fit per every page.
Because the size of an entry does not evenly divide into PAGE_SIZE,
there is a rounding error when a large number of pages is allocated
to hold the events. This causes the check to fail and triggers a
warning.
Fix the accounting by finding out how many pages are actually
allocated from the functions that allocate them and use that to see
if all the pages allocated were used and the ones not used are
properly freed.
-----BEGIN PGP SIGNATURE-----
iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaWlyoxQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qoZoAQDuPOaly/RM9cS70zngDo/UsZmi2lQL
RxuMR7sPVMwkrAEAp/gnpJenSRxCzaO4EdQYUYI28/ihkdNGp3KG8Q2MNw0=
=v899
-----END PGP SIGNATURE-----
Merge tag 'ftrace-v6.19-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull ftrace fix from Steven Rostedt:
- Fix allocation accounting on boot up
The ftrace records for each function that ftrace can attach to is
done in a group of pages. At boot up, the number of pages are
calculated and allocated. After that, the pages are filled with data.
It may allocate more than needed due to some functions not being
recorded (because they are unused weak functions), this too is
recorded.
After the data is filled in, a check is made to make sure the right
number of pages were allocated. But this was off due to the
assumption that the same number of entries fit per every page.
Because the size of an entry does not evenly divide into PAGE_SIZE,
there is a rounding error when a large number of pages is allocated
to hold the events. This causes the check to fail and triggers a
warning.
Fix the accounting by finding out how many pages are actually
allocated from the functions that allocate them and use that to see
if all the pages allocated were used and the ones not used are
properly freed.
* tag 'ftrace-v6.19-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
ftrace: Do not over-allocate ftrace memory
nohz.nr_cpus was observed as contended cacheline when running
enterprise workload on large systems.
Fundamental scalability challenge with nohz.idle_cpus_mask
and nohz.nr_cpus is the following:
(1) nohz_balancer_kick() observes (reads) nohz.nr_cpus
(or nohz.idle_cpu_mask) and nohz.has_blocked to see whether there's
any nohz balancing work to do, in every scheduler tick.
(2) nohz_balance_enter_idle() and nohz_balance_exit_idle()
(through nohz_balancer_kick() via sched_tick()) modify (write)
nohz.nr_cpus (and/or nohz.idle_cpu_mask) and nohz.has_blocked.
The characteristic frequencies are the following:
(1) nohz_balancer_kick() happens at scheduler (busy)tick frequency
on CPU(which has not gone idle). This is a relatively constant
frequency in the ~1 kHz range or lower.
(2) happens at idle enter/exit frequency on every CPU that goes to idle.
This is workload dependent, but can easily be hundreds of kHz for
IO-bound loads and high CPU counts. Ie. can be orders of magnitude
higher than (1), in which case a cachemiss at every invocation of (1)
is almost inevitable. idle exit will trigger (1) on the CPU
which is coming out of idle.
There's two types of costs from these functions:
(A) scheduler tick cost via (1): this happens on busy CPUs too, and is
thus a primary scalability cost. But the rate here is constant and
typically much lower than (B), hence the absolute benefit to workload
scalability will be lower as well.
(B) idle cost via (2): going-to-idle and coming-from-idle costs are
secondary concerns, because they impact power efficiency more than
they impact scalability. But in terms of absolute cost this scales
up with nr_cpus as well, and a much faster rate, and thus may also
approach and negatively impact system limits like
memory bus/fabric bandwidth.
Note that nohz.idle_cpus_mask and nohz.nr_cpus may appear to reside in the
same cacheline, however under CONFIG_CPUMASK_OFFSTACK=y the backing storage
for nohz.idle_cpus_mask will be elsewhere. With CPUMASK_OFFSTACK=n,
the nohz.idle_cpus_mask and rest of nohz fields are in different cachelines
under typical NR_CPUS=512/2048. This implies two separate cachelines
being dirtied upon idle entry / exit.
nohz.nr_cpus can be derived from the mask itself. Its usage doesn't warrant
a functionally correct value. This means one less cacheline being dirtied in
idle entry/exit path which helps to save some bus bandwidth w.r.t to those
nohz functions(approx 50%). This in turn helps to improve enterprise
workload throughput.
On system with 480 CPUs, running "hackbench 40 process 10000 loops"
(Avg of 3 runs)
baseline:
0.81% hackbench [k] nohz_balance_exit_idle
0.21% hackbench [k] nohz_balancer_kick
0.09% swapper [k] nohz_run_idle_balance
With patch:
0.35% hackbench [k] nohz_balance_exit_idle
0.09% hackbench [k] nohz_balancer_kick
0.07% swapper [k] nohz_run_idle_balance
[Ingo Molnar: scalability analysis changlog]
Reviewed-and-tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://patch.msgid.link/20260115073524.376643-4-sshegde@linux.ibm.com
These days most of the system have multi cores. The likelyhood of
at least one or more CPUs in nohz (idle state) is higher.
Give accurate hint to the branch predictor.
Reviewed-and-tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://patch.msgid.link/20260115073524.376643-3-sshegde@linux.ibm.com
Current code does.
- Read nohz.nr_cpus
- Check if the time has passed to do NOHZ idle balance
Instead do this.
- Check if the time has passed to do NOHZ idle balance
- Read nohz.nr_cpus
This will skip the read most of the time in normal system usage.
i.e when there are nohz.nr_cpus (system is not 100% busy).
Note that when there are no idle CPUs(100% busy), even if the flag gets
set to NOHZ_STATS_KICK | NOHZ_NEXT_KICK, find_new_ilb will fail and
there will be no NOHZ idle balance. In such cases there will be a very
narrow window where, kick_ilb will be called un-necessarily.
However current functionality is still retained.
Note: This patch doesn't solve any cacheline overheads. No improvement
in performance apart from saving a few cycles of reading nohz.nr_cpus
Reviewed-and-tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://patch.msgid.link/20260115073524.376643-2-sshegde@linux.ibm.com
The avg_vruntime comment contains a couple of mathematical notation
issues:
- The summation over w_i * (V - v_i) is written in an ambiguous form
- The delta term refers to v instead of v0, which is inconsistent
with the code and preceding explanation
Fix these to make the comment mathematically correct and consistent
with the implementation.
Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260114090035.19033-1-zhanxusheng@xiaomi.com
Commit adcc3bfa88 ("sched: Adapt sched tracepoints for RV task model")
added a tracepoint to the need_resched action that can be triggered also
by set_tsk_need_resched.
This function was previously accessible from out-of-tree modules but
it's no longer available because the __trace_set_need_resched() symbol
is not exported (together with the tracepoint itself, which was exported
in a separate patch) and building such modules fails.
Export __trace_set_need_resched to modules to fix those build issues.
Fixes: adcc3bfa88 ("sched: Adapt sched tracepoints for RV task model")
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Link: https://patch.msgid.link/20260112140413.362202-1-gmonaco@redhat.com
Pierre reported hitting balance callback warnings for deadline tasks
after commit 6455ad5346 ("sched: Move sched_class::prio_changed()
into the change pattern").
It turns out that DEQUEUE_SAVE+ENQUEUE_RESTORE does not preserve DL
priority and subsequently trips a balance pass -- where one was not
expected.
From discussion with Juri and Luca, the purpose of this clause was to
deal with tasks new to DL and all those sites will have MOVE set (as
well as CLASS, but MOVE is move conservative at this point).
Per the previous patches MOVE is audited to always run the balance
callbacks, so switch enqueue_dl_entity() to use MOVE for this case.
Fixes: 6455ad5346 ("sched: Move sched_class::prio_changed() into the change pattern")
Reported-by: Pierre Gondois <pierre.gondois@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Pierre Gondois <pierre.gondois@arm.com>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://patch.msgid.link/20260114130528.GB831285@noisy.programming.kicks-ass.net
While FIFO/RR have static priority, DEADLINE is a dynamic priority
scheme. Notably it has static priority -1. Do not assume the priority
doesn't change for deadline tasks just because the static priority
doesn't change.
This ensures DL always sees {DE,EN}QUEUE_MOVE where appropriate.
Fixes: ff77e46853 ("sched/rt: Fix PI handling vs. sched_setscheduler()")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Pierre Gondois <pierre.gondois@arm.com>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://patch.msgid.link/20260114130528.GB831285@noisy.programming.kicks-ass.net
The {DE,EN}QUEUE_MOVE flag indicates a task is allowed to change
priority, which means there could be balance callbacks queued.
Therefore audit all MOVE users and make sure they do run balance
callbacks before dropping rq-lock.
Fixes: 6455ad5346 ("sched: Move sched_class::prio_changed() into the change pattern")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Pierre Gondois <pierre.gondois@arm.com>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://patch.msgid.link/20260114130528.GB831285@noisy.programming.kicks-ass.net
When setup_new_dl_entity() is called from enqueue_task_dl() ->
enqueue_dl_entity(), the rq-clock should already be updated, and
calling update_rq_clock() again is not right.
Move the update_rq_clock() to the one other caller of
setup_new_dl_entity(): sched_init_dl_server().
Fixes: 9f239df555 ("sched/deadline: Initialize dl_servers after SMP")
Reported-by: Pierre Gondois <pierre.gondois@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Pierre Gondois <pierre.gondois@arm.com>
Link: https://patch.msgid.link/20260113115622.GA831285@noisy.programming.kicks-ass.net
Pratheek tripped a WARN and noted the following issue:
> Inspecting the set of events that led to the warning being triggered
> showed the following:
>
> systemd-1 [008] dN.31 ...: do_set_cpus_allowed: set_cpus_allowed begin!
>
> systemd-1 [008] dN.31 ...: sched_change_begin: Begin!
> systemd-1 [008] dN.31 ...: sched_change_begin: Before dequeue_task()!
> systemd-1 [008] dN.31 ...: update_curr_dl_se: update_curr_dl_se: ENQUEUE_REPLENISH
> systemd-1 [008] dN.31 ...: enqueue_dl_entity: enqueue_dl_entity: ENQUEUE_REPLENISH
> systemd-1 [008] dN.31 ...: replenish_dl_entity: Replenish before: 14815760217
> systemd-1 [008] dN.31 ...: replenish_dl_entity: Replenish after: 14816960047
> systemd-1 [008] dN.31 ...: sched_change_begin: Before put_prev_task()!
>
> systemd-1 [008] dN.31 ...: sched_change_end: Before enqueue_task()!
> systemd-1 [008] dN.31 ...: sched_change_end: Before put_prev_task()!
> systemd-1 [008] dN.31 ...: prio_changed_dl: Queuing pull task on prio change: 14815760217 -> 14816960047
> systemd-1 [008] dN.31 ...: prio_changed_dl: Queuing balance callback!
> systemd-1 [008] dN.31 ...: sched_change_end: End!
>
> systemd-1 [008] dN.31 ...: do_set_cpus_allowed: set_cpus_allowed end!
> systemd-1 [008] dN.21 ...: __schedule: Woops! Balance callback found!
>
> 1. sched_change_begin() from guard(sched_change) in
> do_set_cpus_allowed() stashes the priority, which for the deadline
> task, is "p->dl.deadline".
> 2. The dequeue of the deadline task replenishes the deadline.
> 3. The task is enqueued back after guard's scope ends and since there is
> no *_CLASS flags set, sched_change_end() calls
> dl_sched_class->prio_changed() which compares the deadline.
> 4. Since deadline was moved on dequeue, prio_changed_dl() sees the value
> differ from the stashed value and queues a balance pull callback.
> 5. do_set_cpus_allowed() finishes and drops the rq_lock without doing a
> do_balance_callbacks().
> 6. Grabbing the rq_lock() at subsequent __schedule() triggers the
> warning since the balance pull callback was never executed before
> dropping the lock.
Meaning get_prio_dl() ought to update current and return an up-to-date
value.
Fixes: 6455ad5346 ("sched: Move sched_class::prio_changed() into the change pattern")
Reported-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://patch.msgid.link/20260106104113.GX3707891@noisy.programming.kicks-ass.net
- four kerneldoc fixes from Bagas Sanjaya
- four DAMON fixes from SeongJae
- four mremap VMA-related fixes from Lorenzo
- various singletons - please see the changelogs for details
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaWkP3gAKCRDdBJ7gKXxA
jnmTAQCBiCyw7Pvvak5OM8Of0qQ8m/jQmNMBPRrd5M3tAAEOlwD9F7E6RjcmyItU
k+sboDgNKTvgdRHTmRzj1t96aXJrSQQ=
=OxTh
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2026-01-15-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
- kerneldoc fixes from Bagas Sanjaya
- DAMON fixes from SeongJae
- mremap VMA-related fixes from Lorenzo
- various singletons - please see the changelogs for details
* tag 'mm-hotfixes-stable-2026-01-15-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (30 commits)
drivers/dax: add some missing kerneldoc comment fields for struct dev_dax
mm: numa,memblock: include <asm/numa.h> for 'numa_nodes_parsed'
mailmap: add entry for Daniel Thompson
tools/testing/selftests: fix gup_longterm for unknown fs
mm/page_alloc: prevent pcp corruption with SMP=n
iommu/sva: include mmu_notifier.h header
mm: kmsan: fix poisoning of high-order non-compound pages
tools/testing/selftests: add forked (un)/faulted VMA merge tests
mm/vma: enforce VMA fork limit on unfaulted,faulted mremap merge too
tools/testing/selftests: add tests for !tgt, src mremap() merges
mm/vma: fix anon_vma UAF on mremap() faulted, unfaulted merge
mm/zswap: fix error pointer free in zswap_cpu_comp_prepare()
mm/damon/sysfs-scheme: cleanup access_pattern subdirs on scheme dir setup failure
mm/damon/sysfs-scheme: cleanup quotas subdirs on scheme dir setup failure
mm/damon/sysfs: cleanup attrs subdirs on context dir setup failure
mm/damon/sysfs: cleanup intervals subdirs on attrs dir setup failure
mm/damon/core: remove call_control in inactive contexts
powerpc/watchdog: add support for hardlockup_sys_info sysctl
mips: fix HIGHMEM initialization
mm/hugetlb: ignore hugepage kernel args if hugepages are unsupported
...
The pg_remaining calculation in ftrace_process_locs() assumes that
ENTRIES_PER_PAGE multiplied by 2^order equals the actual capacity of the
allocated page group. However, ENTRIES_PER_PAGE is PAGE_SIZE / ENTRY_SIZE
(integer division). When PAGE_SIZE is not a multiple of ENTRY_SIZE (e.g.
4096 / 24 = 170 with remainder 16), high-order allocations (like 256 pages)
have significantly more capacity than 256 * 170. This leads to pg_remaining
being underestimated, which in turn makes skip (derived from skipped -
pg_remaining) larger than expected, causing the WARN(skip != remaining)
to trigger.
Extra allocated pages for ftrace: 2 with 654 skipped
WARNING: CPU: 0 PID: 0 at kernel/trace/ftrace.c:7295 ftrace_process_locs+0x5bf/0x5e0
A similar problem in ftrace_allocate_records() can result in allocating
too many pages. This can trigger the second warning in
ftrace_process_locs().
Extra allocated pages for ftrace
WARNING: CPU: 0 PID: 0 at kernel/trace/ftrace.c:7276 ftrace_process_locs+0x548/0x580
Use the actual capacity of a page group to determine the number of pages
to allocate. Have ftrace_allocate_pages() return the number of allocated
pages to avoid having to calculate it. Use the actual page group capacity
when validating the number of unused pages due to skipped entries.
Drop the definition of ENTRIES_PER_PAGE since it is no longer used.
Cc: stable@vger.kernel.org
Fixes: 4a3efc6baf ("ftrace: Update the mcount_loc check of skipped entries")
Link: https://patch.msgid.link/20260113152243.3557219-1-linux@roeck-us.net
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
I got a report that a task is stuck in perf_event_exit_task() waiting
for global_ctx_data_rwsem. On large systems with lots threads, it'd
have performance issues when it grabs the lock to iterate all threads
in the system to allocate the context data.
And it'd block task exit path which is problematic especially under
memory pressure.
perf_event_open
perf_event_alloc
attach_perf_ctx_data
attach_global_ctx_data
percpu_down_write (global_ctx_data_rwsem)
for_each_process_thread
alloc_task_ctx_data
do_exit
perf_event_exit_task
percpu_down_read (global_ctx_data_rwsem)
It should not hold the global_ctx_data_rwsem on the exit path. Let's
skip allocation for exiting tasks and free the data carefully.
Reported-by: Rosalie Fang <rosaliefang@google.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260112165157.1919624-1-namhyung@kernel.org
Commit a9af76a787 ("watchdog: add sys_info sysctls to dump sys info on
system lockup") adds 'hardlock_sys_info' systcl knob for general kernel
watchdog to control what kinds of system debug info to be dumped on
hardlockup.
Add similar support in powerpc watchdog code to make the sysctl knob more
general, which also fixes a compiling warning in general watchdog code
reported by 0day bot.
Link: https://lkml.kernel.org/r/20251231080309.39642-1-feng.tang@linux.alibaba.com
Fixes: a9af76a787 ("watchdog: add sys_info sysctls to dump sys info on system lockup")
Signed-off-by: Feng Tang <feng.tang@linux.alibaba.com>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202512030920.NFKtekA7-lkp@intel.com/
Suggested-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
If the previous kernel enabled KHO but did not call kho_finalize() (e.g.,
CONFIG_LIVEUPDATE=n or userspace skipped the finalization step), the
'preserved-memory-map' property in the FDT remains empty/zero.
Previously, kho_populate() would succeed regardless of the memory map's
state, reserving the incoming scratch regions in memblock. However,
kho_memory_init() would later fail to deserialize the empty map. By that
time, the scratch regions were already registered, leading to partial
initialization and subsequent list corruption (freeing scratch area twice)
during kho_init().
Move the validation of the preserved memory map earlier into
kho_populate(). If the memory map is empty/NULL:
1. Abort kho_populate() immediately with -ENOENT.
2. Do not register or reserve the incoming scratch memory, allowing the new
kernel to reclaim those pages as standard free memory.
3. Leave the global 'kho_in' state uninitialized.
Consequently, kho_memory_init() sees no active KHO context
(kho_in.mem_chunks_phys is 0) and falls back to kho_reserve_scratch(),
allocating fresh scratch memory as if it were a standard cold boot.
Link: https://lkml.kernel.org/r/20251223140140.2090337-1-pasha.tatashin@soleen.com
Fixes: de51999e68 ("kho: allow memory preservation state updates after finalization")
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reported-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Closes: https://lore.kernel.org/all/20251218215613.GA17304@ranerica-svr.sc.intel.com
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Tested-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
For a `gotox rX` instruction the rX register should be marked as used
in the compute_insn_live_regs() function. Fix this.
Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Link: https://lore.kernel.org/r/20260114162544.83253-2-a.s.protopopov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cross-merge BPF and other fixes after downstream PR.
No conflicts.
Adjacent:
Auto-merging MAINTAINERS
Auto-merging Makefile
Auto-merging kernel/bpf/verifier.c
Auto-merging kernel/sched/ext.c
Auto-merging mm/memcontrol.c
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
On smaller systems, e.g. embedded arm64, it is common for all memory
to end up in ZONE_DMA32 or even ZONE_DMA. In such cases it is redundant
to allocate a nominal pool for an empty higher zone that just ends up
coming from a lower zone that should already have its own pool anyway.
We already have logic to skip allocating a ZONE_DMA pool when that is
empty, so generalise that to save memory in the case of other zones too.
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Tested-by: Vladimir Kondratiev <vladimir.kondratiev@mobileye.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/8ab8d8a620dee0109f33f5cb63d6bfeed35aac37.1768230104.git.robin.murphy@arm.com
If CONFIG_ZONE_DMA32 is enabled, but we have not allocated the
corresponding atomic_pool_dma32, dma_guess_pool() may return the NULL
value of that and fail a GFP_DMA32 allocation without trying to fall
back to other pools which may exist. Furthermore, if no GFP_DMA pool
exists, it is preferable to try GFP_DMA32 rather than immediately fall
back to GFP_KERNEL with even less chance of success. Improve matters
by encoding an explicit order of pool preference for each flag.
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Tested-by: Vladimir Kondratiev <vladimir.kondratiev@mobileye.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/c846b1a2f43295cac926c7af2ce907f62baec518.1768230104.git.robin.murphy@arm.com
The cache parameter of getcpu() is useless nowadays for various reasons.
* It is never passed by userspace for either the vDSO or syscalls.
* It is never used by the kernel.
* It could not be made to work on the current vDSO architecture.
* The structure definition is not part of the UAPI headers.
* vdso_getcpu() is superseded by restartable sequences in any case.
Remove the struct and its header.
As a side-effect this gets rid of an unwanted inclusion of the linux/
header namespace from vDSO code.
[ tglx: Adapt to s390 upstream changes */
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Heiko Carstens <hca@linux.ibm.com> # s390
Link: https://patch.msgid.link/20251230-getcpu_cache-v3-1-fb9c5f880ebe@linutronix.de
The insn_array_map_direct_value_addr() function currently returns
-EINVAL when the offset within the map is invalid. Change this to
return -EACCES, so that it is consistent with similar boundary access
checks in the verifier.
Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Link: https://lore.kernel.org/r/20260111153047.8388-3-a.s.protopopov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The map_direct_value_addr() function of the instruction
array map incorrectly adds offset to the resulting address.
This is a bug, because later the resolve_pseudo_ldimm64()
function adds the offset. Fix it. Corresponding selftests
are added in a consequent commit.
Fixes: 493d9e0d60 ("bpf, x86: add support for indirect jumps")
Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Link: https://lore.kernel.org/r/20260111153047.8388-2-a.s.protopopov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Teach the BPF verifier to treat pointers to struct types returned from
BPF kfuncs as implicitly trusted (PTR_TO_BTF_ID | PTR_TRUSTED) by
default. Returning untrusted pointers to struct types from BPF kfuncs
should be considered an exception only, and certainly not the norm.
Update existing selftests to reflect the change in register type
printing (e.g. `ptr_` becoming `trusted_ptr_` in verifier error
messages).
Link: https://lore.kernel.org/bpf/aV4nbCaMfIoM0awM@google.com/
Signed-off-by: Matt Bobrowski <mattbobrowski@google.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260113083949.2502978-1-mattbobrowski@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Currently, vmlinux BTF is unconditionally sorted during
the build phase. The function btf_find_by_name_kind
executes the binary search branch, so find_bpffs_btf_enums
can be optimized by using btf_find_by_name_kind.
Signed-off-by: Donglin Peng <pengdonglin@xiaomi.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20260109130003.3313716-10-dolinux.peng@gmail.com
Currently, vmlinux and kernel module BTFs are unconditionally
sorted during the build phase, with named types placed at the
end. Thus, anonymous types should be skipped when starting the
search. In my vmlinux BTF, the number of anonymous types is
61,747, which means the loop count can be reduced by 61,747.
Signed-off-by: Donglin Peng <pengdonglin@xiaomi.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20260109130003.3313716-9-dolinux.peng@gmail.com
This patch checks whether the BTF is sorted by name in ascending order.
If sorted, binary search will be used when looking up types.
Specifically, vmlinux and kernel module BTFs are always sorted during
the build phase with anonymous types placed before named types, so we
only need to identify the starting ID of named types.
Signed-off-by: Donglin Peng <pengdonglin@xiaomi.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260109130003.3313716-8-dolinux.peng@gmail.com
Improve btf_find_by_name_kind() performance by adding binary search
support for sorted types. Falls back to linear search for compatibility.
Signed-off-by: Donglin Peng <pengdonglin@xiaomi.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260109130003.3313716-7-dolinux.peng@gmail.com
There are typically a lot of PMUs registered, but in many cases only few
of them have an event registered (like the "cpu" PMU in the presence of
the watchdog). As the mutex is already held, it's safe to just check for
existing events before doing the cross CPU call.
This change saves tens of milliseconds from kexec time (perceived as
steal time during a hypervisor host update), with <2ms remaining for
this step in the shutdown. There might be additional potential for
parallelization or we could just disable performance monitoring during
the actual shutdown and be less graceful about it.
Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
During CPU offlining the interrupts affined to that CPU are moved to other
online CPUs, which might break the original affinity mask if the outgoing
CPU was the last online CPU in that mask. This change is not propagated to
irq_desc::affinity_notify(), which leaves users of the affinity notifier
mechanism with stale information.
Avoid this by scheduling affinity change notification work for interrupts
that were affined to the CPU being offlined, if the new target CPU is not
part of the original affinity mask.
Since irq_set_affinity_locked() uses the same logic to schedule affinity
change notification work, split out this logic into a dedicated function
and use that at both places.
[ tglx: Removed the EXPORT(), removed the !SMP stub, moved the prototype,
added a lockdep assert instead of a comment, fixed up coding style
and name space. Polished and clarified the change log ]
Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260113143727.1041265-1-imran.f.khan@oracle.com
It accepts ERR_PTR() for name and does the right thing in that case.
That allows to simplify the logics in callers, making them trivial
to switch to CLASS(filename).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
... or visible outside of audit, really. Note that references
held in delayed_filename always have refcount 1, and from the
moment of complete_getname() or equivalent point in getname...()
there won't be any references to struct filename instance left
in places visible to other threads.
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Originally we tried to avoid multiple insertions into audit names array
during retry loop by a cute hack - memorize the userland pointer and
if there already is a match, just grab an extra reference to it.
Cute as it had been, it had problems - two identical pointers had
audit aux entries merged, two identical strings did not. Having
different behaviour for syscalls that differ only by addresses of
otherwise identical string arguments is obviously wrong - if nothing
else, compiler can decide to merge identical string literals.
Besides, this hack does nothing for non-audited processes - they get
a fresh copy for retry. It's not time-critical, but having behaviour
subtly differ that way is bogus.
These days we have very few places that import filename more than once
(9 functions total) and it's easy to massage them so we get rid of all
re-imports. With that done, we don't need audit_reusename() anymore.
There's no need to memorize userland pointer either.
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
cilium bpf_wiregard.bpf.c when compiled with -O1 fails to load
with the following verifier log:
192: (79) r2 = *(u64 *)(r10 -304) ; R2=pkt(r=40) R10=fp0 fp-304=pkt(r=40)
...
227: (85) call bpf_skb_store_bytes#9 ; R0=scalar()
228: (bc) w2 = w0 ; R0=scalar() R2=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff))
229: (c4) w2 s>>= 31 ; R2=scalar(smin=0,smax=umax=0xffffffff,smin32=-1,smax32=0,var_off=(0x0; 0xffffffff))
230: (54) w2 &= -134 ; R2=scalar(smin=0,smax=umax=umax32=0xffffff7a,smax32=0x7fffff7a,var_off=(0x0; 0xffffff7a))
...
232: (66) if w2 s> 0xffffffff goto pc+125 ; R2=scalar(smin=umin=umin32=0x80000000,smax=umax=umax32=0xffffff7a,smax32=-134,var_off=(0x80000000; 0x7fffff7a))
...
238: (79) r4 = *(u64 *)(r10 -304) ; R4=scalar() R10=fp0 fp-304=scalar()
239: (56) if w2 != 0xffffff78 goto pc+210 ; R2=0xffffff78 // -136
...
258: (71) r1 = *(u8 *)(r4 +0)
R4 invalid mem access 'scalar'
The error might confuse most bpf authors, since fp-304 slot had 'pkt'
pointer at insn 192 and became 'scalar' at 238. That happened because
bpf_skb_store_bytes() clears all packet pointers including those in
the stack. On the first glance it might look like a bug in the source
code, since ctx->data pointer should have been reloaded after the call
to bpf_skb_store_bytes().
The relevant part of cilium source code looks like this:
// bpf/lib/nodeport.h
int dsr_set_ipip6()
{
if (ctx_adjust_hroom(...))
return DROP_INVALID; // -134
if (ctx_store_bytes(...))
return DROP_WRITE_ERROR; // -141
return 0;
}
bool dsr_fail_needs_reply(int code)
{
if (code == DROP_FRAG_NEEDED) // -136
return true;
return false;
}
tail_nodeport_ipv6_dsr()
{
ret = dsr_set_ipip6(...);
if (!IS_ERR(ret)) {
...
} else {
if (dsr_fail_needs_reply(ret))
return dsr_reply_icmp6(...);
}
}
The code doesn't have arithmetic shift by 31 and it reloads ctx->data
every time it needs to access it. So it's not a bug in the source code.
The reason is DAGCombiner::foldSelectCCToShiftAnd() LLVM transformation:
// If this is a select where the false operand is zero and the compare is a
// check of the sign bit, see if we can perform the "gzip trick":
// select_cc setlt X, 0, A, 0 -> and (sra X, size(X)-1), A
// select_cc setgt X, 0, A, 0 -> and (not (sra X, size(X)-1)), A
The conditional branch in dsr_set_ipip6() and its return values
are optimized into BPF_ARSH plus BPF_AND:
227: (85) call bpf_skb_store_bytes#9
228: (bc) w2 = w0
229: (c4) w2 s>>= 31 ; R2=scalar(smin=0,smax=umax=0xffffffff,smin32=-1,smax32=0,var_off=(0x0; 0xffffffff))
230: (54) w2 &= -134 ; R2=scalar(smin=0,smax=umax=umax32=0xffffff7a,smax32=0x7fffff7a,var_off=(0x0; 0xffffff7a))
after insn 230 the register w2 can only be 0 or -134,
but the verifier approximates it, since there is no way to
represent two scalars in bpf_reg_state.
After fallthough at insn 232 the w2 can only be -134,
hence the branch at insn
239: (56) if w2 != -136 goto pc+210
should be always taken, and trapping insn 258 should never execute.
LLVM generated correct code, but the verifier follows impossible
path and rejects valid program. To fix this issue recognize this
special LLVM optimization and fork the verifier state.
So after insn 229: (c4) w2 s>>= 31
the verifier has two states to explore:
one with w2 = 0 and another with w2 = 0xffffffff
which makes the verifier accept bpf_wiregard.c
A similar pattern exists were OR operation is used in place of the AND
operation, the verifier detects that pattern as well by forking the
state before the OR operation with a scalar in range [-1,0].
Note there are 20+ such patterns in bpf_wiregard.o compiled
with -O1 and -O2, but they're rarely seen in other production
bpf programs, so push_stack() approach is not a concern.
Reported-by: Hao Sun <sunhao.th@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Co-developed-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20260112201424.816836-2-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Replace the pattern of declaring a local regs array from cur_regs()
and then indexing into it with the more concise reg_state() helper.
This simplifies the code by eliminating intermediate variables and
makes register access more consistent throughout the verifier.
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20260113134826.2214860-1-mykyta.yatsenko5@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The tracepoints sched_entry, sched_exit and sched_set_need_resched
are not exported to tracefs as trace events, this allows only kernel
code to access them. Helper modules like [1] can be used to still have
the tracepoints available to ftrace for debugging purposes, but they do
rely on the tracepoints being exported.
Export the 3 not exported tracepoints.
Note that sched_set_state is already exported as the macro is called
from modules.
[1] - https://github.com/qais-yousef/sched_tp.git
Fixes: adcc3bfa88 ("sched: Adapt sched tracepoints for RV task model")
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Link: https://patch.msgid.link/20251205131621.135513-9-gmonaco@redhat.com
The deadline server can currently stop due to idle although fair tasks
are runnable. This happens essentially when:
* the server is set to idle, a task wakes up, the server stops
* a task wakes up, the server sets itself to idle and stops right away
Address both cases by clearing the server idle flag whenever a fair task
wakes up and accounting also for pending tasks in the definition of idle.
Fixes: f5a538c07d ("sched/deadline: Fix dl_server stop condition")
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260113085159.114226-3-gmonaco@redhat.com
A fix for the dl_server 'requires' idle_cpu() usage, which made me
note that it and available_idle_cpu() are extern function calls.
And while idle_cpu() is used outside of kernel/sched/,
available_idle_cpu() is not.
This makes it hard to make idle_cpu() an inline helper, so provide
idle_rq() and implement idle_cpu() and available_idle_cpu() using
that.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
The access rule for local_cpu_mask_dl requires it to be called on the
local CPU with preemption disabled. However, dl_add_task_root_domain()
currently violates this rule.
Without preemption disabled, the following race can occur:
1. ThreadA calls dl_add_task_root_domain() on CPU 0
2. Gets pointer to CPU 0's local_cpu_mask_dl
3. ThreadA is preempted and migrated to CPU 1
4. ThreadA continues using CPU 0's local_cpu_mask_dl
5. Meanwhile, the scheduler on CPU 0 calls find_later_rq() which also
uses local_cpu_mask_dl (with preemption properly disabled)
6. Both contexts now corrupt the same per-CPU buffer concurrently
Fix this by moving the local_cpu_mask_dl access to the preemption
disabled section.
Closes: https://lore.kernel.org/lkml/aSBjm3mN_uIy64nz@jlelli-thinkpadt14gen4.remote.csb
Fixes: 318e18ed22 ("sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug")
Reported-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Pingfan Liu <piliu@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Waiman Long <longman@redhat.com>
Link: https://patch.msgid.link/20251125032630.8746-3-piliu@redhat.com
The comments above dl_get_task_effective_cpus() and
dl_add_task_root_domain() already explain how to fetch a valid
root domain and protect against races. There's no need to repeat
this inside dl_add_task_root_domain(). Remove the redundant comment
to keep the code clean.
No functional change.
Signed-off-by: Pingfan Liu <piliu@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Waiman Long <longman@redhat.com>
Link: https://patch.msgid.link/20251125032630.8746-2-piliu@redhat.com
Since ktime_t has become an alias to s64, these helpers are unnecessary.
Migrate the few remaining users to the regular helpers and remove the
now dead code.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260107-hrtimer-header-cleanup-v1-3-1a698ef0ddae@linutronix.de
This constant is only used in a single place and is has a very generic
name polluting the global namespace.
Move the constant closer to its only user.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260107-hrtimer-header-cleanup-v1-2-1a698ef0ddae@linutronix.de
desc_set_defaults() has a loop to clear the per-cpu counters kstats_irq.
This is only needed in free_desc(), which is used with non-sparse IRQs so
that the interrupt descriptor can be recycled. For newly allocated
descriptors, the memory comes from alloc_percpu() and is already zeroed
out.
Move the loop to free_desc() to avoid wasting time unnecessarily.
Signed-off-by: Luigi Rizzo <lrizzo@google.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260112083234.2665832-1-lrizzo@google.com
For redirected interrupts, irq_chip_redirect_set_affinity() does not
update the effective affinity mask, which then triggers the warning in
irq_validate_effective_affinity(). Also, because the effective affinity
mask is empty, the cpumask_test_cpu(smp_processor_id(), m) condition in
demux_redirect_remote() is always false, and the interrupt is always
redirected, even if it's already running on the target CPU.
Set the effective affinity mask to be the same as the requested affinity
mask. It's worth noting that irq_do_set_affinity() filters out offline
CPUs before calling chip->irq_set_affinity() (unless `force` is set), so
the mask passed to irq_chip_redirect_set_affinity() is already filtered.
The solution is not ideal because it may lie about the effective
affinity of the demultiplexed ("child") interrupt. If the requested
affinity mask includes multiple CPUs, the effective affinity, in
reality, is the intersection between the requested mask and the
demultiplexing ("parent") interrupt's effective affinity mask, plus
the first CPU in the requested mask.
Accurately describing the effective affinity of the demultiplexed
interrupt is not trivial because it requires keeping track of the
demultiplexing interrupt's effective affinity. That is tricky in the
context of CPU hot(un)plugging, where interrupt migration ordering is
not guaranteed. The solution in the initial version of the fixed patch,
which stored the first CPU of the demultiplexing interrupt's effective
affinity in the `target_cpu` field, has its own drawbacks and
limitations.
Fixes: fcc1d0dabd ("genirq: Add interrupt redirection infrastructure")
Reported-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Radu Rendec <rrendec@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: Jon Hunter <jonathanh@nvidia.com>
Link: https://patch.msgid.link/20260112211402.2927336-1-rrendec@redhat.com
Closes: https://lore.kernel.org/all/44509520-f29b-4b8a-8986-5eae3e022eb7@nvidia.com/
IRQF_ONESHOT disables the interrupt source until after the threaded
handler completed its work. This is needed to allow the threaded handler
to run - otherwise the CPU will get back to the interrupt handler
because the interrupt source remains active and the threaded handler
will not able to do its work.
Specifying IRQF_ONESHOT without a threaded handler does not make sense.
It could be a leftover if the handler _was_ threaded and changed back to
primary and the flag was not removed. This can be problematic in the
`threadirqs' case because the handler is exempt from forced-threading.
This in turn can become a problem on a PREEMPT_RT system if the handler
attempts to acquire sleeping locks.
Warn about missing threaded handlers with the IRQF_ONESHOT flag.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Reviewed-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Link: https://patch.msgid.link/20260112134013.eQWyReHR@linutronix.de
Ensure that registered destructor kfuncs have the same type
as btf_dtor_kfunc_t to avoid a kernel panic on systems with
CONFIG_CFI enabled.
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20260110082548.113748-10-samitolvanen@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
With CONFIG_CFI enabled, the kernel strictly enforces that indirect
function calls use a function pointer type that matches the target
function. I ran into the following type mismatch when running BPF
self-tests:
CFI failure at bpf_obj_free_fields+0x190/0x238 (target:
bpf_crypto_ctx_release+0x0/0x94; expected type: 0xa488ebfc)
Internal error: Oops - CFI: 00000000f2008228 [#1] SMP
...
As bpf_crypto_ctx_release() is also used in BPF programs and using
a void pointer as the argument would make the verifier unhappy, add
a simple stub function with the correct type and register it as the
destructor kfunc instead.
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Tested-by: Viktor Malik <vmalik@redhat.com>
Link: https://lore.kernel.org/r/20260110082548.113748-7-samitolvanen@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
We already added lockdep_assert_cpuset_lock_held(), use this new function
to keep consistency.
Signed-off-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn>
Signed-off-by: Tejun Heo <tj@kernel.org>
As stated in commit 1c09b195d3 ("cpuset: fix a regression in validating
config change"), it is not allowed to clear masks of a cpuset if
there're tasks in it. This is specific to v1 since empty "cpuset.cpus"
or "cpuset.mems" will cause the v2 cpuset to inherit the effective CPUs
or memory nodes from its parent. So it is OK to have empty cpus or mems
even if there are tasks in the cpuset.
Move this empty cpus/mems check in validate_change() to
cpuset1_validate_change() to allow more flexibility in setting
cpus or mems in v2. cpuset_is_populated() needs to be moved into
cpuset-internal.h as it is needed by the empty cpus/mems checking code.
Also add a test case to test_cpuset_prs.sh to verify that.
Reported-by: Chen Ridong <chenridong@huaweicloud.com>
Closes: https://lore.kernel.org/lkml/7a3ec392-2e86-4693-aa9f-1e668a668b9c@huaweicloud.com/
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Chen Ridong <chenridong@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Currently, when setting a cpuset's cpuset.cpus to a value that conflicts
with the cpuset.cpus/cpuset.cpus.exclusive of a sibling partition,
the sibling's partition state becomes invalid. This is overly harsh and
is probably not necessary.
The cpuset.cpus.exclusive control file, if set, will override the
cpuset.cpus of the same cpuset when creating a cpuset partition.
So cpuset.cpus has less priority than cpuset.cpus.exclusive in setting up
a partition. However, it cannot override a conflicting cpuset.cpus file
in a sibling cpuset and the partition creation process will fail. This
is inconsistent. That will also make using cpuset.cpus.exclusive less
valuable as a tool to set up cpuset partitions as the users have to
check if such a cpuset.cpus conflict exists or not.
Fix these problems by making sure that once a cpuset.cpus.exclusive
is set without failure, it will always be allowed to form a valid
partition as long as at least one CPU can be granted from its parent
irrespective of the state of the siblings' cpuset.cpus values. Of
course, setting cpuset.cpus.exclusive will fail if it conflicts with
the cpuset.cpus.exclusive or the cpuset.cpus.exclusive.effective value
of a sibling.
Partition can still be created by setting only cpuset.cpus without
setting cpuset.cpus.exclusive. However, any conflicting CPUs in sibling's
cpuset.cpus.exclusive.effective and cpuset.cpus.exclusive values will
be removed from its cpuset.cpus.exclusive.effective as long as there
is still one or more CPUs left and can be granted from its parent. This
CPU stripping is currently done in rm_siblings_excl_cpus().
The new code will now try its best to enable the creation of new
partitions with only cpuset.cpus set without invalidating existing ones.
However it is not guaranteed that all the CPUs requested in cpuset.cpus
will be used in the new partition even when all these CPUs can be
granted from the parent.
This is similar to the fact that cpuset.cpus.effective may not be
able to include all the CPUs requested in cpuset.cpus. In this case,
the parent may not able to grant all the exclusive CPUs requested in
cpuset.cpus to cpuset.cpus.exclusive.effective if some of them have
already been granted to other partitions earlier.
With the creation of multiple sibling partitions by setting
only cpuset.cpus, this does have the side effect that their exact
cpuset.cpus.exclusive.effective settings will depend on the order of
partition creation if there are conflicts. Due to the exclusive nature
of the CPUs in a partition, it is not easy to make it fair other than
the old behavior of invalidating all the conflicting partitions.
For example,
# echo "0-2" > A1/cpuset.cpus
# echo "root" > A1/cpuset.cpus.partition
# cat A1/cpuset.cpus.partition
root
# cat A1/cpuset.cpus.exclusive.effective
0-2
# echo "2-4" > B1/cpuset.cpus
# echo "root" > B1/cpuset.cpus.partition
# cat B1/cpuset.cpus.partition
root
# cat B1/cpuset.cpus.exclusive.effective
3-4
# cat B1/cpuset.cpus.effective
3-4
For users who want to be sure that they can get most of the CPUs they
want, cpuset.cpus.exclusive should be used instead if they can set
it successfully without failure. Setting cpuset.cpus.exclusive will
guarantee that sibling conflicts from then onward is no longer possible.
To make this change, we have to separate out the is_cpu_exclusive()
check in cpus_excl_conflict() into a cgroup v1 only
cpuset1_cpus_excl_conflict() helper. The cpus_allowed_validate_change()
helper is now no longer needed and can be removed.
Some existing tests in test_cpuset_prs.sh are updated and new ones are
added to reflect the new behavior. The cgroup-v2.rst doc file is also
updated the clarify what exclusive CPUs will be used when a partition
is created.
Reported-by: Sun Shaojie <sunshaojie@kylinos.cn>
Closes: https://lore.kernel.org/lkml/20251117015708.977585-1-sunshaojie@kylinos.cn/
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Chen Ridong <chenridong@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Commit fe8cd2736e ("cgroup/cpuset: Delay setting of CS_CPU_EXCLUSIVE
until valid partition") introduced a new check to disallow the setting
of a new cpuset.cpus.exclusive value that is a superset of a sibling's
cpuset.cpus value so that there will at least be one CPU left in the
sibling in case the cpuset becomes a valid partition root. This new
check does have the side effect of failing a cpuset.cpus change that
make it a subset of a sibling's cpuset.cpus.exclusive value.
With v2, users are supposed to be allowed to set whatever value they
want in cpuset.cpus without failure. To maintain this rule, the check
is now restricted to only when cpuset.cpus.exclusive is being changed
not when cpuset.cpus is changed.
The cgroup-v2.rst doc file is also updated to reflect this change.
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Chen Ridong <chenridong@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Since commit f62a5d3936 ("cgroup/cpuset: Remove remote_partition_check()
& make update_cpumasks_hier() handle remote partition"), the
compute_effective_exclusive_cpumask() helper was extended to
strip exclusive CPUs from siblings when computing effective_xcpus
(cpuset.cpus.exclusive.effective). This helper was later renamed to
compute_excpus() in commit 86bbbd1f33 ("cpuset: Refactor exclusive
CPU mask computation logic").
This helper is supposed to be used consistently to compute
effective_xcpus. However, there is an exception within the callback
critical section in update_cpumasks_hier() when exclusive_cpus of a
valid partition root is empty. This can cause effective_xcpus value to
differ depending on where exactly it is last computed. Fix this by using
compute_excpus() in this case to give a consistent result.
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Chen Ridong <chenridong@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
If exclusive_cpus is set, effective_xcpus must be a subset of
exclusive_cpus. Currently, rm_siblings_excl_cpus() checks both
exclusive_cpus and effective_xcpus consecutively. It is simpler
to check only exclusive_cpus if non-empty or just effective_xcpus
otherwise.
No functional change is expected.
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Chen Ridong <chenridong@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Paravirt clock related functions are available in multiple archs.
In order to share the common parts, move the common static keys
to kernel/sched/ and remove them from the arch specific files.
Make a common paravirt_steal_clock() implementation available in
kernel/sched/cputime.c, guarding it with a new config option
CONFIG_HAVE_PV_STEAL_CLOCK_GEN, which can be selected by an arch
in case it wants to use that common variant.
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260105110520.21356-7-jgross@suse.com
All architectures supporting CONFIG_PARAVIRT share the same contents
of asm/paravirt_api_clock.h:
#include <asm/paravirt.h>
So remove all incarnations of asm/paravirt_api_clock.h and remove the
only place where it is included, as there asm/paravirt.h is included
anyway.
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> # powerpc, scheduler bits
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260105110520.21356-6-jgross@suse.com
The monitor container source files contained a declaration and a
definition for the rv_monitor variable. The former is superfluous and
can be removed.
Remove the variable declaration from the template as well as the
existing monitor containers.
Reviewed-by: Nam Cao <namcao@linutronix.de>
Link: https://lore.kernel.org/r/20251126104241.291258-9-gmonaco@redhat.com
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
The header files generated by dot2c currently create enums for states
and events assigning the first element to 0. This is superfluous as it
happens automatically if no value is specified.
Also it doesn't add a comma to the last enum elements, which slightly
complicates the diff if states or events are added.
Remove the assignment to 0 and add a comma to last elements, this
simplifies the logic for the code generator.
Reviewed-by: Nam Cao <namcao@linutronix.de>
Link: https://lore.kernel.org/r/20251126104241.291258-8-gmonaco@redhat.com
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
The da_monitor helper functions are generated from macros of the type:
DECLARE_DA_FUNCTION(name, type) \
static void da_func_x_##name(type arg) {} \
static void da_func_y_##name(type arg) {} \
This is good to minimise code duplication but the long macros made of
skipped end of lines is rather hard to parse. Since functions are
static, the advantage of naming them differently for each monitor is
minimal.
Refactor the da_monitor.h file to minimise macros, instead of declaring
functions from macros, we simply declare them with the same name for all
monitors (e.g. da_func_x) and for any remaining reference to the monitor
name (e.g. tracepoints, enums, global variables) we use the CONCATENATE
macro.
In this way the file is much easier to maintain while keeping the same
generality.
Functions depending on the monitor types are now conditionally compiled
according to the value of RV_MON_TYPE, which must be defined in the
monitor source.
The monitor type can be specified as in the original implementation,
although it's best to keep the default implementation (unsigned char) as
not all parts of code support larger data types, and likely there's no
need.
We keep the empty macro definitions to ease review of this change with
diff tools, but cleanup is required.
Also adapt existing monitors to keep the build working.
Reviewed-by: Nam Cao <namcao@linutronix.de>
Link: https://lore.kernel.org/r/20251126104241.291258-2-gmonaco@redhat.com
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
In a vain attempt to consolidate the email zoo switch everything to the
kernel.org account.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* rcu-misc.20260111a:
rcu: Reduce synchronize_rcu() latency by reporting GP kthread's CPU QS early
srcu: Use suitable gfp_flags for the init_srcu_struct_nodes()
rcu: Fix rcu_read_unlock() deadloop due to softirq
rcutorture: Correctly compute probability to invoke ->exp_current()
rcu: Make expedited RCU CPU stall warnings detect stall-end races
The RCU grace period mechanism uses a two-phase FQS (Force Quiescent
State) design where the first FQS saves dyntick-idle snapshots and
the second FQS compares them. This results in long and unnecessary latency
for synchronize_rcu() on idle systems (two FQS waits of ~3ms each with
1000HZ) whenever one FQS wait sufficed.
Some investigations showed that the GP kthread's CPU is the holdout CPU
a lot of times after the first FQS as - it cannot be detected as "idle"
because it's actively running the FQS scan in the GP kthread.
Therefore, at the end of rcu_gp_init(), immediately report a quiescent
state for the GP kthread's CPU using rcu_qs() + rcu_report_qs_rdp(). The
GP kthread cannot be in an RCU read-side critical section while running
GP initialization, so this is safe and results in significant latency
improvements.
The following tests were performed:
(1) synchronize_rcu() benchmarking
100 synchronize_rcu() calls with 32 CPUs, 10 runs each (default fqs
jiffies settings):
Baseline (without fix):
| Run | Mean | Min | Max |
|-----|-----------|----------|-----------|
| 1 | 10.088 ms | 9.989 ms | 18.848 ms |
| 2 | 10.064 ms | 9.982 ms | 16.470 ms |
| 3 | 10.051 ms | 9.988 ms | 15.113 ms |
| 4 | 10.125 ms | 9.929 ms | 22.411 ms |
| 5 | 8.695 ms | 5.996 ms | 15.471 ms |
| 6 | 10.157 ms | 9.977 ms | 25.723 ms |
| 7 | 10.102 ms | 9.990 ms | 20.224 ms |
| 8 | 8.050 ms | 5.985 ms | 10.007 ms |
| 9 | 10.059 ms | 9.978 ms | 15.934 ms |
| 10 | 10.077 ms | 9.984 ms | 17.703 ms |
With fix:
| Run | Mean | Min | Max |
|-----|----------|----------|-----------|
| 1 | 6.027 ms | 5.915 ms | 8.589 ms |
| 2 | 6.032 ms | 5.984 ms | 9.241 ms |
| 3 | 6.010 ms | 5.986 ms | 7.004 ms |
| 4 | 6.076 ms | 5.993 ms | 10.001 ms |
| 5 | 6.084 ms | 5.893 ms | 10.250 ms |
| 6 | 6.034 ms | 5.908 ms | 9.456 ms |
| 7 | 6.051 ms | 5.993 ms | 10.000 ms |
| 8 | 6.057 ms | 5.941 ms | 10.001 ms |
| 9 | 6.016 ms | 5.927 ms | 7.540 ms |
| 10 | 6.036 ms | 5.993 ms | 9.579 ms |
Summary:
- Mean latency: 9.75 ms -> 6.04 ms (38% improvement)
- Max latency: 25.72 ms -> 10.25 ms (60% improvement)
(2) Bridge setup/teardown latency (Uladzislau Rezki)
x86_64 with 64 CPUs, 100 iterations of bridge add/configure/delete:
real time
1 - default: 24.221s
2 - this patch: 20.754s (14% faster)
3 - this patch + wake_from_gp: 15.895s (34% faster)
4 - wake_from_gp only: 18.947s (22% faster)
Per-synchronize_rcu() latency (in usec):
1 2 3 4
median: 37249.5 31540.5 15765 22480
min: 7881 7918 9803 7857
max: 63651 55639 31861 32040
This patch combined with rcu_normal_wake_from_gp reduces bridge
setup/teardown time from 24 seconds to 16 seconds.
(3) CPU overhead verification (Uladzislau Rezki)
System CPU time across 5 runs showed no measurable increase:
default: 1.698s - 1.937s
this patch: 1.667s - 1.930s
Conclusion: variations are within noise, no CPU overhead regression.
(4) rcutorture
Tested TREE and SRCU configurations - no regressions.
Reviewed-by: "Paul E. McKenney" <paulmck@kernel.org>
Tested-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Samir M <samir@linux.ibm.com>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Add dump to get-perf-domains, so that a user can fetch either information
about a specific performance domain with do or information about all
performance domains with dump. Share the reply format of do and dump using
perf-domain-attrs, so remove perf-domains. The YNL spec, autogenerated
files, and the do implementation are updated, and the dump implementation
is added.
Suggested-by: Donald Hunter <donald.hunter@gmail.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Link: https://patch.msgid.link/20260108053212.642478-5-changwoo@igalia.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Previously, the cpus attribute was a string format which was a "%*pb"
stringification of a bitmap. That is not very consumable for a UAPI,
so let’s change it to an u64 array of CPU ids.
Suggested-by: Donald Hunter <donald.hunter@gmail.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Link: https://patch.msgid.link/20260108053212.642478-4-changwoo@igalia.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The EM YNL specification used many acronyms, including ‘em’, ‘pd’,
‘ps’, etc. While the acronyms are short and convenient, they could be
confusing. So, let’s spell them out to be more specific. The following
changes were made in the spec. Note that the protocol name cannot exceed
GENL_NAMSIZ (16).
em -> dev-energymodel
pds -> perf-domains
pd -> perf-domain
pd-id -> perf-domain-id
pd-table -> perf-table
ps -> perf-state
get-pds -> get-perf-domains
get-pd-table -> get-perf-table
pd-created -> perf-domain-created
pd-updated -> perf-domain-updated
pd-deleted -> perf-domain-deleted
In addition. doc strings were added to the spec. based on the comments in
energy_model.h. Two flag attributes (perf-state-flags and
perf-domain-flags) were added for easily interpreting the bit flags.
Finally, the autogenerated files and em_netlink.c were updated accordingly
to reflect the name changes.
Suggested-by: Donald Hunter <donald.hunter@gmail.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Link: https://patch.msgid.link/20260108053212.642478-3-changwoo@igalia.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Fix a crash in the hibernation image saving code that can be triggered
when the given compression algorithm is unavailable (Malaya Kumar Rout)
-----BEGIN PGP SIGNATURE-----
iQFGBAABCAAwFiEEcM8Aw/RY0dgsiRUR7l+9nS/U47UFAmlhGicSHHJqd0Byand5
c29ja2kubmV0AAoJEO5fvZ0v1OO1q1QH/A7sxNcitVM3I6bdoYyVDYpVnKiDQqTk
ofzJ+eEEh2+5iD7iBoqQBVNLDWL3iOtSCk3u8VzhIoJMwvsmaxSQYqGaOMIHNSKx
Lfdl+RsMv5LKK05uN9uxLCSIBQJkRhnKI3+AC2TEbSpmLwi0+W15XYHecj1JmyY0
upsrRUq+XGBr2LQDoWiUmRmPay8+lp8zjoHaAorl7Jm4yr7pHV8kzrztrCOttuUx
nUtJiCks+Srnm09uIf0ghVgzBTwnmAGYft/xOf+fRqJDy9t91Nf/oJsev28mFCsn
qRVn/O4i8xAgba9mKiKWmHvzoqqW8omixDFShweeTdr7Lw4hADxMwn8=
=8zHM
-----END PGP SIGNATURE-----
Merge tag 'pm-6.19-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fix from Rafael Wysocki:
"This fixes a crash in the hibernation image saving code that can be
triggered when the given compression algorithm is unavailable (Malaya
Kumar Rout)"
* tag 'pm-6.19-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM: hibernate: Fix crash when freeing invalid crypto compressor
sched_mm_cid_after_execve() is called in bprm_execve()'s cleanup path even
when exec_binprm() fails. For the init task's first execve(), this causes a
problem:
1. current->mm is NULL (kernel threads don't have an mm)
2. sched_mm_cid_before_execve() exits early because mm is NULL
3. exec_binprm() fails (e.g., ENOENT for missing script interpreter)
4. sched_mm_cid_after_execve() is called with mm still NULL
5. sched_mm_cid_fork() is called unconditionally, triggering WARN_ON
This is easily reproduced by booting with an init that is a shell script
(#!/bin/sh) where the interpreter doesn't exist in the initramfs.
Fix this by checking if t->mm is NULL before calling sched_mm_cid_fork(),
matching the behavior of sched_mm_cid_before_execve() which already
handles this case via sched_mm_cid_exit()'s early return.
Fixes: b0c3d51b54 ("sched/mmcid: Provide precomputed maximal value")
Signed-off-by: Cong Wang <cwang@multikernel.io>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Will Deacon <will@kernel.org>
Link: https://patch.msgid.link/20251223215113.639686-1-xiyou.wangcong@gmail.com
When ida_alloc() fails in em_create_pd(), the function returns without
freeing the previously allocated 'pd' structure, leading to a memory leak.
The 'pd' pointer is allocated either at line 436 (for CPU devices with
cpumask) or line 442 (for other devices) using kzalloc().
Additionally, the function incorrectly returns -ENOMEM when ida_alloc()
fails, ignoring the actual error code returned by ida_alloc(), which can
fail for reasons other than memory exhaustion.
Fix both issues by:
1. Freeing the 'pd' structure with kfree() when ida_alloc() fails
2. Returning the actual error code from ida_alloc() instead of -ENOMEM
This ensures proper cleanup on the error path and accurate error reporting.
Fixes: cbe5aeedec ("PM: EM: Assign a unique ID when creating a performance domain")
Signed-off-by: Malaya Kumar Rout <mrout@redhat.com>
Reviewed-by: Changwoo Min <changwoo@igalia.com>
Link: https://patch.msgid.link/20260105103730.65626-1-mrout@redhat.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The introduction of PREEMPT_LAZY was for multiple reasons:
- PREEMPT_RT suffered from over-scheduling, hurting performance compared to
!PREEMPT_RT.
- the introduction of (more) features that rely on preemption; like
folio_zero_user() which can do large memset() without preemption checks.
(Xen already had a horrible hack to deal with long running hypercalls)
- the endless and uncontrolled sprinkling of cond_resched() -- mostly cargo
cult or in response to poor to replicate workloads.
By moving to a model that is fundamentally preemptable these things become
managable and avoid needing to introduce more horrible hacks.
Since this is a requirement; limit PREEMPT_NONE to architectures that do not
support preemption at all. Further limit PREEMPT_VOLUNTARY to those
architectures that do not yet have PREEMPT_LAZY support (with the eventual goal
to make this the empty set and completely remove voluntary preemption and
cond_resched() -- notably VOLUNTARY is already limited to !ARCH_NO_PREEMPT.)
This leaves up-to-date architectures (arm64, loongarch, powerpc, riscv, s390,
x86) with only two preemption models: full and lazy.
While Lazy has been the recommended setting for a while, not all distributions
have managed to make the switch yet. Force things along. Keep the patch minimal
in case of hard to address regressions that might pop up.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://patch.msgid.link/20251219101502.GB1132199@noisy.programming.kicks-ass.net
This colocates some hot fields in "struct rq" to be on the same cache line
as others that are often accessed at the same time or in similar ways.
Using data from a Google-internal fleet-scale profiler, I found three
distinct groups of hot fields in struct rq:
- (1) The runqueue lock: __lock.
- (2) Those accessed from hot code in pick_next_task_fair():
nr_running, nr_numa_running, nr_preferred_running,
ttwu_pending, cpu_capacity, curr, idle.
- (3) Those accessed from some other hot codepaths, e.g.
update_curr(), update_rq_clock(), and scheduler_tick():
clock_task, clock_pelt, clock, lost_idle_time,
clock_update_flags, clock_pelt_idle, clock_idle.
The cycles spent on accessing these different groups of fields broke down
roughly as follows:
- 50% on group (1) (the runqueue lock, always read-write)
- 39% on group (2) (load:store ratio around 38:1)
- 8% on group (3) (load:store ratio around 5:1)
- 3% on all the other fields
Most of the fields in group (3) are already in a cache line grouping; this
patch just adds "clock" and "clock_update_flags" to that group. The fields
in group (2) are scattered across several cache lines; the main effect of
this patch is to group them together, on a single line at the beginning of
the structure. A few other less performance-critical fields (nr_switches,
numa_migrate_on, has_blocked_load, nohz_csd, last_blocked_load_update_tick)
were also reordered to reduce holes in the data structure.
Since the runqueue lock is acquired from so many different contexts, and is
basically always accessed using an atomic operation, putting it on either
of the cache lines for groups (2) or (3) would slow down accesses to those
fields dramatically, since those groups are read-mostly accesses.
To test this, I wrote a focused load test that would put load on the
pick_next_task_fair() path. A parent process would fork many child
processes, and each child would nanosleep() for 1 msec many times in a
loop. The load test was monitored with "perf", and I looked at the amount
of cycles that were spent with sched_balance_rq() on the stack. The test
was reliably spending ~5% of all of its cycles there. I ran it 100 times
on a pair of 2-socket Intel Haswell machines (72 vCPUs per machine) - one
running the tip of sched/core, the other running this change - using 360
child processes and 8192 1-msec sleeps per child. The mean cycle count
dropped from 5.14B to 4.91B, or a *4.6% decrease* in relevant scheduler
cycles.
Given that this change reduces cache misses in a very hot kernel codepath,
there's likely to be additional application performance improvement due to
reduced cache conflicts from kernel data structures.
On a Power11 system with 128-byte cache lines, my test showed a ~5%
decrease in relevant scheduler cycles, along with a slight increase in user
time - both positive indicators. This data comes from
https://lore.kernel.org/lkml/affdc6b1-9980-44d1-89db-d90730c1e384@linux.ibm.com/
This is the case even though the additional "____cacheline_aligned" that
puts the runqueue lock on the next cache line adds an additional 64 bytes
of padding on those machines. This patch does not change the size of
"struct rq" on machines with 64-byte cache lines.
I also ran "hackbench" to try to test this change, but it didn't show
conclusive results. Looking at a CPU cycle profile of the hackbench run,
it was spending 95% of its cycles inside __alloc_skb(), __kfree_skb(), or
kmem_cache_free() - almost all of which was spent updating memcg counters
or contending on the list_lock in kmem_cache_node. And it spent less than
0.5% of its cycles inside either schedule() or try_to_wake_up(). So it's
not surprising that it didn't show useful results here.
The "__no_randomize_layout" was added to reflect the fact that performance
of code that references this data structure is unusually sensitive to
placement of its members.
Signed-off-by: Blake Jones <blakejones@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Reviewed-by: Josh Don <joshdon@google.com>
Tested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Link: https://patch.msgid.link/20251202023743.1524247-1-blakejones@google.com
In the group_has_spare case, the function creates a temporary cpumask
to just calculate weight of (p->cpus_ptr & sched_group_span(local)).
We've got a dedicated helper for it.
Signed-off-by: Yury Norov (NVIDIA) <yury.norov@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Reviewed-by: Fernand Sieber <sieberf@amazon.com>
Link: https://patch.msgid.link/20251207034247.402926-1-yury.norov@gmail.com
Use for_each_cpu_and() and drop some housekeeping code.
Signed-off-by: Yury Norov (NVIDIA) <yury.norov@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Reviewed-by: Phil Auld <pauld@redhat.com>
Link: https://patch.msgid.link/20251207033037.399608-1-yury.norov@gmail.com
cpumask_empty() call is O(N) and useless because the previous
cpumask_and() returns false for empty 'cpus'. Drop it.
Signed-off-by: Yury Norov (NVIDIA) <yury.norov@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://patch.msgid.link/20251207040543.407695-1-yury.norov@gmail.com
BPF_MAP_TYPE_INSN_ARRAY maps store instruction pointers in their
ips array, not string data. The map_direct_value_addr callback for
this map type returns the address of the ips array, which is not
suitable for use as a constant string argument.
When a BPF program passes a pointer to an insn_array map value as
ARG_PTR_TO_CONST_STR (e.g., to bpf_snprintf), the verifier's
null-termination check in check_reg_const_str() operates on the
wrong memory region, and at runtime bpf_bprintf_prepare() can read
out of bounds searching for a null terminator.
Reject BPF_MAP_TYPE_INSN_ARRAY in check_reg_const_str() since this
map type is not designed to hold string data.
Reported-by: syzbot+2c29addf92581b410079@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=2c29addf92581b410079
Tested-by: syzbot+2c29addf92581b410079@syzkaller.appspotmail.com
Fixes: 493d9e0d60 ("bpf, x86: add support for indirect jumps")
Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com>
Acked-by: Anton Protopopov <a.s.protopopov@gmail.com>
Link: https://lore.kernel.org/r/20260107021037.289644-1-kartikey406@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The cgrp_ancestor_storage has two drawbacks:
- it's not guaranteed that the member immediately follows struct cgrp in
cgroup_root (root cgroup's ancestors[0] might thus point to a padding
and not in cgrp_ancestor_storage proper),
- this idiom raises warnings with -Wflex-array-member-not-at-end.
Instead of relying on the auxiliary member in cgroup_root, define the
0-th level ancestor inside struct cgroup (needed for static allocation
of cgrp_dfl_root), deeper cgroups would allocate flexible
_low_ancestors[]. Unionized alias through ancestors[] will
transparently join the two ranges.
The above change would still leave the flexible array at the end of
struct cgroup inside cgroup_root, so move cgrp also towards the end of
cgroup_root to resolve the -Wflex-array-member-not-at-end.
Link: https://lore.kernel.org/r/5fb74444-2fbb-476e-b1bf-3f3e279d0ced@embeddedor.com/
Reported-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Closes: https://lore.kernel.org/r/b3eb050d-9451-4b60-b06c-ace7dab57497@embeddedor.com/
Cc: David Laight <david.laight.linux@gmail.com>
Acked-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The ftrace_dump_on_oops string is not used outside of trace.c so
make it static to avoid the export warning from sparse:
kernel/trace/trace.c:141:6: warning: symbol 'ftrace_dump_on_oops' was not declared. Should it be static?
Fixes: dd293df639 ("tracing: Move trace sysctls into trace.c")
Link: https://patch.msgid.link/20260106231054.84270-1-ben.dooks@codethink.co.uk
Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
A bug was reported about an infinite recursion caused by tracing the rcu
events with the kernel stack trace trigger enabled. The stack trace code
called back into RCU which then called the stack trace again.
Expand the ftrace recursion protection to add a set of bits to protect
events from recursion. Each bit represents the context that the event is
in (normal, softirq, interrupt and NMI).
Have the stack trace code use the interrupt context to protect against
recursion.
Note, the bug showed an issue in both the RCU code as well as the tracing
stacktrace code. This only handles the tracing stack trace side of the
bug. The RCU fix will be handled separately.
Link: https://lore.kernel.org/all/20260102122807.7025fc87@gandalf.local.home/
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Link: https://patch.msgid.link/20260105203141.515cd49f@gandalf.local.home
Reported-by: Yao Kai <yaokai34@huawei.com>
Tested-by: Yao Kai <yaokai34@huawei.com>
Fixes: 5f5fa7ea89 ("rcu: Don't use negative nesting depth in __rcu_read_unlock()")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
When user resize all trace ring buffer through file 'buffer_size_kb',
then in ring_buffer_resize(), kernel allocates buffer pages for each
cpu in a loop.
If the kernel preemption model is PREEMPT_NONE and there are many cpus
and there are many buffer pages to be freed, it may not give up cpu
for a long time and finally cause a softlockup.
To avoid it, call cond_resched() after each cpu buffer free as Commit
f6bd2c9248 ("ring-buffer: Avoid softlockup in ring_buffer_resize()")
does.
Detailed call trace as follow:
rcu: INFO: rcu_sched self-detected stall on CPU
rcu: 24-....: (14837 ticks this GP) idle=521c/1/0x4000000000000000 softirq=230597/230597 fqs=5329
rcu: (t=15004 jiffies g=26003221 q=211022 ncpus=96)
CPU: 24 UID: 0 PID: 11253 Comm: bash Kdump: loaded Tainted: G EL 6.18.2+ #278 NONE
pc : arch_local_irq_restore+0x8/0x20
arch_local_irq_restore+0x8/0x20 (P)
free_frozen_page_commit+0x28c/0x3b0
__free_frozen_pages+0x1c0/0x678
___free_pages+0xc0/0xe0
free_pages+0x3c/0x50
ring_buffer_resize.part.0+0x6a8/0x880
ring_buffer_resize+0x3c/0x58
__tracing_resize_ring_buffer.part.0+0x34/0xd8
tracing_resize_ring_buffer+0x8c/0xd0
tracing_entries_write+0x74/0xd8
vfs_write+0xcc/0x288
ksys_write+0x74/0x118
__arm64_sys_write+0x24/0x38
Cc: <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251228065008.2396573-1-mawupeng1@huawei.com
Signed-off-by: Wupeng Ma <mawupeng1@huawei.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
soft_mode is not read in the enable case, so drop the assignment.
Drop also the comment text that refers to the assignment and realign
the comment.
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: Gabriele Paoloni <gpaoloni@redhat.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251226110531.4129794-1-Julia.Lawall@inria.fr
Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
For use the init_srcu_struct*() to initialized srcu structure,
the srcu structure's->srcu_sup and sda use GFP_KERNEL flags to
allocate memory. similarly, if set SRCU_SIZING_INIT, the
srcu_sup's->node can still use GFP_KERNEL flags to allocate
memory, not need to use GFP_ATOMIC flags all the time.
Signed-off-by: Zqiang <qiang.zhang@linux.dev>
Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Lack of parentheses causes the ->exp_current() function, for example,
srcu_expedite_current(), to be called only once in four billion times
instead of the intended once in 256 times. This commit therefore adds
the needed parentheses.
Reported-by: Chris Mason <clm@meta.com>
Reported-by: Joel Fernandes <joelagnelf@nvidia.com>
Fixes: 950063c6e8 ("rcutorture: Test srcu_expedite_current()")
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
If an expedited RCU CPU stall ends just at the stall-warning timeout,
the current code will print an expedited stall-warning message, but one
that doesn't identify any CPUs or tasks causing the stall. This is most
likely to happen for short-timeout stalls, for example, the 20-millisecond
timeouts that are sometimes used for small embedded devices. Needless to
say, these semi-empty stall-warning messages can be rather confusing.
One option would be to suppress the stall-warning message entirely in
this case, but the near-miss information can be quite valuable.
Detect this race condition and emits a "INFO: Expedited stall ended
before state dump start" message to clarify matters.
[boqun: Apply feedback from Borislav]
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Introduce BPF_F_ALL_CPUS flag support for percpu_cgroup_storage maps to
allow updating values for all CPUs with a single value for update_elem
API.
Introduce BPF_F_CPU flag support for percpu_cgroup_storage maps to
allow:
* update value for specified CPU for update_elem API.
* lookup value for specified CPU for lookup_elem API.
The BPF_F_CPU flag is passed via map_flags along with embedded cpu info.
Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/r/20260107022022.12843-6-leon.hwang@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Copy map value using 'copy_map_value_long()'. It's to keep consistent
style with the way of other percpu maps.
No functional change intended.
Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/r/20260107022022.12843-5-leon.hwang@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Introduce BPF_F_ALL_CPUS flag support for percpu_hash and lru_percpu_hash
maps to allow updating values for all CPUs with a single value for both
update_elem and update_batch APIs.
Introduce BPF_F_CPU flag support for percpu_hash and lru_percpu_hash
maps to allow:
* update value for specified CPU for both update_elem and update_batch
APIs.
* lookup value for specified CPU for both lookup_elem and lookup_batch
APIs.
The BPF_F_CPU flag is passed via:
* map_flags along with embedded cpu info.
* elem_flags along with embedded cpu info.
Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/r/20260107022022.12843-4-leon.hwang@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Introduce support for the BPF_F_ALL_CPUS flag in percpu_array maps to
allow updating values for all CPUs with a single value for both
update_elem and update_batch APIs.
Introduce support for the BPF_F_CPU flag in percpu_array maps to allow:
* update value for specified CPU for both update_elem and update_batch
APIs.
* lookup value for specified CPU for both lookup_elem and lookup_batch
APIs.
The BPF_F_CPU flag is passed via:
* map_flags of lookup_elem and update_elem APIs along with embedded cpu
info.
* elem_flags of lookup_batch and update_batch APIs along with embedded
cpu info.
Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/r/20260107022022.12843-3-leon.hwang@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Introduce BPF_F_CPU and BPF_F_ALL_CPUS flags and check them for
following APIs:
* 'map_lookup_elem()'
* 'map_update_elem()'
* 'generic_map_lookup_batch()'
* 'generic_map_update_batch()'
And, get the correct value size for these APIs.
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/r/20260107022022.12843-2-leon.hwang@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Leverage ARCH_HAS_DMA_MAP_DIRECT config option for coherent allocations as
well. This will bypass DMA ops for memory allocations that have been
pre-mapped.
Always set device bus_dma_limit when memory is pre-mapped. In some
architectures, like PowerPC, pmemory can be converted to regular memory via
daxctl command. This will gate the coherent allocations to pre-mapped RAM
only, by dma_coherent_ok().
Signed-off-by: Gaurav Batra <gbatra@linux.ibm.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20251107161105.85999-1-gbatra@linux.ibm.com
The function set_security_override_from_ctx() has no in-tree callers
since 6.14. Remove it.
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
[PM: subject tweak, merge fuzz]
Signed-off-by: Paul Moore <paul@paul-moore.com>
The bpf_arena_*_pages() kfuncs can be called from sleepable contexts,
but the verifier still prevents BPF programs from calling them while
holding a spinlock. Amend the verifier to allow for BPF programs
calling arena page management functions while holding a lock.
Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com>
Link: https://lore.kernel.org/r/20260106-arena-under-lock-v2-2-378e9eab3066@etsalapatis.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The in_sleepable_context() function is used to specialize the BPF code
in do_misc_fixups(). With the addition of nonsleepable arena kfuncs,
there are kfuncs whose specialization depends on whether we are
holding a lock. We should use the nonsleepable version while
holding a lock and the sleepable one when not.
Add a check for active_locks to account for locking when specializing
arena kfuncs.
Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com>
Link: https://lore.kernel.org/r/20260106-arena-under-lock-v2-1-378e9eab3066@etsalapatis.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Remove SYSCTL_INT_CONV_CUSTOM and replace it with proc_int_conv. This
converter function expects a negp argument as it can take on negative
values. Update all jiffies converters to use explicit function calls.
Remove SYSCTL_CONV_IDENTITY as it is no longer used.
Signed-off-by: Joel Granados <joel.granados@kernel.org>
Replace SYSCTL_USER_TO_KERN_INT_CONV and SYSCTL_KERN_TO_USER_INT_CONV
macros with function implementing the same logic.This makes debugging
easier and aligns with the functions preference described in
coding-style.rst. Update all jiffies converters to use explicit function
implementations instead of macro-generated versions.
Signed-off-by: Joel Granados <joel.granados@kernel.org>
When crypto_alloc_acomp() fails, it returns an ERR_PTR value, not NULL.
The cleanup code in save_compressed_image() and load_compressed_image()
unconditionally calls crypto_free_acomp() without checking for ERR_PTR,
which causes crypto_acomp_tfm() to dereference an invalid pointer and
crash the kernel.
This can be triggered when the compression algorithm is unavailable
(e.g., CONFIG_CRYPTO_LZO not enabled).
Fix by adding IS_ERR_OR_NULL() checks before calling crypto_free_acomp()
and acomp_request_free(), similar to the existing kthread_stop() check.
Fixes: b03d542c3c ("PM: hibernate: Use crypto_acomp interface")
Signed-off-by: Malaya Kumar Rout <mrout@redhat.com>
Cc: 6.15+ <stable@vger.kernel.org> # 6.15+
[ rjw: Added 2 empty code lines ]
Link: https://patch.msgid.link/20251230115613.64080-1-mrout@redhat.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
This demonstrates a larger conversion to use Clang's context
analysis. The benefit is additional static checking of locking rules,
along with better documentation.
Notably, kernel/sched contains sufficiently complex synchronization
patterns, and application to core.c & fair.c demonstrates that the
latest Clang version has become powerful enough to start applying this
to more complex subsystems (with some modest annotations and changes).
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251219154418.3592607-37-elver@google.com
With Sparse support gone, Clang is a bit more strict and warns:
./include/linux/console.h:492:50: error: use of undeclared identifier 'console_mutex'
492 | extern void console_list_unlock(void) __releases(console_mutex);
Since it does not make sense to make console_mutex itself global, move
the annotation to printk.c. Context analysis remains disabled for
printk.c.
This is needed to enable context analysis for modules that include
<linux/console.h>.
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251219154418.3592607-34-elver@google.com
As discussed in [1], removing __cond_lock() will improve the readability
of trylock code. Now that Sparse context tracking support has been
removed, we can also remove __cond_lock().
Change existing APIs to either drop __cond_lock() completely, or make
use of the __cond_acquires() function attribute instead.
In particular, spinlock and rwlock implementations required switching
over to inline helpers rather than statement-expressions for their
trylock_* variants.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20250207082832.GU7145@noisy.programming.kicks-ass.net/ [1]
Link: https://patch.msgid.link/20251219154418.3592607-25-elver@google.com
Replace the SYSCTL_USER_TO_KERN_UINT_CONV and SYSCTL_UINT_CONV_CUSTOM
macros with functions with the same logic. This makes debugging easier
and aligns with the functions preference described in coding-style.rst.
Update the only user of this API: pipe.c.
Signed-off-by: Joel Granados <joel.granados@kernel.org>
Ensure an error if prco_douintvec_conv is erroneously called in a system
with CONFIG_PROC_SYSCTL=n
Signed-off-by: Joel Granados <joel.granados@kernel.org>
Remove superfluous forward declarations of ctl_table from header files
where they are no longer needed. These declarations were left behind
after sysctl code refactoring and cleanup.
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Muchun Song <muchun.song@linux.dev>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
Add kernel-doc documentation for the proc_dointvec_conv function to
describe its parameters and return value.
Reviewed-by: Kees Cook <kees@kernel.org>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
With the change to hrtimer_try_to_cancel() in
perf_swevent_cancel_hrtimer() it appears possible for the hrtimer to
still be active by the time the event gets freed.
Make sure the event does a full hrtimer_cancel() on the free path by
installing a perf_event::destroy handler.
Fixes: eb3182ef04 ("perf/core: Fix system hang caused by cpu-clock usage")
Reported-by: CyberUnicorns <a101e_iotvul@163.com>
Tested-by: CyberUnicorns <a101e_iotvul@163.com>
Debugged-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
* rcu-torture.20260104a:
rcutorture: Add --kill-previous option to terminate previous kvm.sh runs
rcutorture: Prevent concurrent kvm.sh runs on same source tree
torture: Include commit discription in testid.txt
torture: Make config2csv.sh properly handle comments in .boot files
torture: Make kvm-series.sh give run numbers and totals
torture: Make kvm-series.sh give build numbers and totals
torture: Parallelize kvm-series.sh guest-OS execution
rcutorture: Add context checks to rcu_torture_timer()
The __opt annotation was originally introduced specifically for
buffer/size argument pairs in bpf_dynptr_slice() and
bpf_dynptr_slice_rdwr(), allowing the buffer pointer to be NULL while
still validating the size as a constant. The __nullable annotation
serves the same purpose but is more general and is already used
throughout the BPF subsystem for raw tracepoints, struct_ops, and other
kfuncs.
This patch unifies the two annotations by replacing __opt with
__nullable. The key change is in the verifier's
get_kfunc_ptr_arg_type() function, where mem/size pair detection is now
performed before the nullable check. This ensures that buffer/size
pairs are correctly classified as KF_ARG_PTR_TO_MEM_SIZE even when the
buffer is nullable, while adding an !arg_mem_size condition to the
nullable check prevents interference with mem/size pair handling.
When processing KF_ARG_PTR_TO_MEM_SIZE arguments, the verifier now uses
is_kfunc_arg_nullable() instead of the removed is_kfunc_arg_optional()
to determine whether to skip size validation for NULL buffers.
This is the first documentation added for the __nullable annotation,
which has been in use since it was introduced but was previously
undocumented.
No functional changes to verifier behavior - nullable buffer/size pairs
continue to work exactly as before.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20260102221513.1961781-1-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
When arena allocations were converted from bpf_map_alloc_pages() to
kmalloc_nolock() to support non-sleepable contexts, memcg accounting was
inadvertently lost. This commit restores proper memory accounting for
all arena-related allocations.
All arena related allocations are accounted into memcg of the process
that created bpf_arena.
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20260102200230.25168-3-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Introduce bpf_map_memcg_enter() and bpf_map_memcg_exit() helpers to
reduce code duplication in memcg context management.
bpf_map_memcg_enter() gets the memcg from the map, sets it as active,
and returns both the previous and the now active memcg.
bpf_map_memcg_exit() restores the previous active memcg and releases the
reference obtained during enter.
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20260102200230.25168-2-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Fix a recent regression that affects system suspend testing at
the "core" level (Rafael Wysocki)
-----BEGIN PGP SIGNATURE-----
iQFGBAABCAAwFiEEcM8Aw/RY0dgsiRUR7l+9nS/U47UFAmlYG/ISHHJqd0Byand5
c29ja2kubmV0AAoJEO5fvZ0v1OO1mK0IAIrCiY5dvp9+72DvEWqS2uHHFVs3sHKR
SOpJR3koYehZEn/PvnM2PgvWNCLtru4nU/Q3EnWFfFCFuFuAMQ6Zl5U7YyKkW1Uc
bcTMsnLOTJm/3AYu3O+4TGASq1VF1xqE+AB/ie5fNz5gDSlblGKrqh0se3m5m1Vu
PsLsm27wkLyEHCd3AdXRNSU54GssjTaABkVTQ/Unk4PznbBiKsckaThLjbjQaiqB
KzqU0B3Q3Zx9Qj1lVzXwXaYushehGbs3bqw8+q2DPrV/jwLVLYX/ofwEkCH+lQ47
tS+di//pFi/grWu/GtR4EQ0fCzgYPDaBfbQlOD2gA60EgplU4XY3804=
=CK7L
-----END PGP SIGNATURE-----
Merge tag 'pm-6.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fix from Rafael Wysocki:
"Fix a recent regression that affects system suspend testing
at the 'core' level (Rafael Wysocki)"
* tag 'pm-6.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM: sleep: Fix suspend_test() at the TEST_CORE level
Now that KF_TRUSTED_ARGS is the default for all kfuncs, remove the
explicit KF_TRUSTED_ARGS flag from all kfunc definitions and remove the
flag itself.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20260102180038.2708325-3-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Change the verifier to make trusted args the default requirement for
all kfuncs by removing is_kfunc_trusted_args() assuming it be to always
return true.
This works because:
1. Context pointers (xdp_md, __sk_buff, etc.) are handled through their
own KF_ARG_PTR_TO_CTX case label and bypass the trusted check
2. Struct_ops callback arguments are already marked as PTR_TRUSTED during
initialization and pass is_trusted_reg()
3. KF_RCU kfuncs are handled separately via is_kfunc_rcu() checks at
call sites (always checked with || alongside is_kfunc_trusted_args)
This simple change makes all kfuncs require trusted args by default
while maintaining correct behavior for all existing special cases.
Note: This change means kfuncs that previously accepted NULL pointers
without KF_TRUSTED_ARGS will now reject NULL at verification time.
Several netfilter kfuncs are affected: bpf_xdp_ct_lookup(),
bpf_skb_ct_lookup(), bpf_xdp_ct_alloc(), and bpf_skb_ct_alloc() all
accept NULL for their bpf_tuple and opts parameters internally (checked
in __bpf_nf_ct_lookup), but after this change the verifier rejects NULL
before the kfunc is even called. This is acceptable because these kfuncs
don't work with NULL parameters in their proper usage. Now they will be
rejected rather than returning an error, which shouldn't make a
difference to BPF programs that were using these kfuncs properly.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20260102180038.2708325-2-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
If a driver is buggy and has 2 overlapping mappings but only
sets cache clean flag on the 1st one of them, we warn.
But if it only does it for the 2nd one, we don't.
Fix by tracking cache clean flag in the entry.
Message-ID: <0ffb3513d18614539c108b4548cdfbc64274a7d1.1767601130.git.mst@redhat.com>
Reviewed-by: Petr Tesarik <ptesarik@suse.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
This commit adds irq, NMI, and softirq context checks to the
rcu_torture_timer() function. Just because you are paranoid does not
mean that they are not out to get you... ;-)
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
This commit adds a ->exp_current member to the tasks_tracing_ops structure
to test the rcu_tasks_trace_expedite_current() function.
[ paulmck: Apply kernel test robot feedback. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: bpf@vger.kernel.org
Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
When expressing RCU Tasks Trace in terms of SRCU-fast, it was
necessary to keep a nesting count and per-CPU srcu_ctr structure
pointer in the task_struct structure, which is slow to access.
But an alternative is to instead make rcu_read_lock_tasks_trace() and
rcu_read_unlock_tasks_trace(), which match the underlying SRCU-fast
semantics, avoiding the task_struct accesses.
When all callers have switched to the new API, the previous
rcu_read_lock_trace() and rcu_read_unlock_trace() APIs will be removed.
The rcu_read_{,un}lock_{,tasks_}trace() functions need to use smp_mb()
only if invoked where RCU is not watching, that is, from locations where
a call to rcu_is_watching() would return false. In architectures that
define the ARCH_WANTS_NO_INSTR Kconfig option, use of noinstr and friends
ensures that tracing happens only where RCU is watching, so those
architectures can dispense entirely with the read-side calls to smp_mb().
Other architectures include these read-side calls by default, but in many
installations there might be either larger than average tolerance for
risk, prohibition of removing tracing on a running system, or careful
review and approval of removing of tracing. Such installations can
build their kernels with CONFIG_TASKS_TRACE_RCU_NO_MB=y to avoid those
read-side calls to smp_mb(), thus accepting responsibility for run-time
removal of tracing from code regions that RCU is not watching.
Those wishing to disable read-side memory barriers for an entire
architecture can select this TASKS_TRACE_RCU_NO_MB Kconfig option,
hence the polarity.
[ paulmck: Apply Peter Zijlstra feedback. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: bpf@vger.kernel.org
Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Moving the rcu_tasks_trace_srcu_struct structure instance out
from under the CONFIG_TASKS_RCU_GENERIC Kconfig option permits
the CONFIG_TASKS_TRACE_RCU Kconfig option to stop enabling this
CONFIG_TASKS_RCU_GENERIC Kconfig option. This commit also therefore
makes it so.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: bpf@vger.kernel.org
Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Now that RCU Tasks Trace has been re-implemented in terms of SRCU-fast,
the ->trc_ipi_to_cpu, ->trc_blkd_cpu, ->trc_blkd_node, ->trc_holdout_list,
and ->trc_reader_special task_struct fields are no longer used.
In addition, the rcu_tasks_trace_qs(), rcu_tasks_trace_qs_blkd(),
exit_tasks_rcu_finish_trace(), and rcu_spawn_tasks_trace_kthread(),
show_rcu_tasks_trace_gp_kthread(), rcu_tasks_trace_get_gp_data(),
rcu_tasks_trace_torture_stats_print(), and get_rcu_tasks_trace_gp_kthread()
functions and all the other functions that they invoke are no longer used.
Also, the TRC_NEED_QS and TRC_NEED_QS_CHECKED CPP macros are no longer used.
Neither are the rcu_tasks_trace_lazy_ms and rcu_task_ipi_delay rcupdate
module parameters and the TASKS_TRACE_RCU_READ_MB Kconfig option.
This commit therefore removes all of them.
[ paulmck: Apply Alexei Starovoitov feedback. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: bpf@vger.kernel.org
Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Because SRCU-fast does not use IPIs for its grace periods, there is
no need for real-time workloads to switch to an IPI-free mode, and
there is in turn no need for either rcu_task_trace_heavyweight_enter()
or rcu_task_trace_heavyweight_exit(). This commit therefore removes them.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: bpf@vger.kernel.org
Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
This commit saves more than 500 lines of RCU code by re-implementing
RCU Tasks Trace in terms of SRCU-fast. Follow-up work will remove
more code that does not cause problems by its presence, but that is no
longer required.
This variant places smp_mb() in rcu_read_{,un}lock_trace(), and in the
same place that srcu_read_{,un}lock() would put them. These smp_mb()
calls will be removed on common-case architectures in a later commit.
In the meantime, it serves to enforce ordering between the underlying
srcu_read_{,un}lock_fast() markers and the intervening critical section,
even on architectures that permit attaching tracepoints on regions of
code not watched by RCU. Such architectures defeat SRCU-fast's use of
implicit single-instruction, interrupts-disabled, and atomic-operation
RCU read-side critical sections, which have no effect when RCU is not
watching. The aforementioned later commit will insert these smp_mb()
calls only on architectures that have not used noinstr to prevent
attaching tracepoints to code where RCU is not watching.
[ paulmck: Apply kernel test robot, Boqun Feng, and Zqiang feedback. ]
[ paulmck: Split out Tiny SRCU fixes per Andrii Nakryiko feedback. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: kernel test robot <oliver.sang@intel.com>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: bpf@vger.kernel.org
Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
When multiple small DMA_FROM_DEVICE or DMA_BIDIRECTIONAL buffers share a
cacheline, and DMA_API_DEBUG is enabled, we get this warning:
cacheline tracking EEXIST, overlapping mappings aren't supported.
This is because when one of the mappings is removed, while another one
is active, CPU might write into the buffer.
Add an attribute for the driver to promise not to do this, making the
overlapping safe, and suppressing the warning.
Message-ID: <2d5d091f9d84b68ea96abd545b365dd1d00bbf48.1767601130.git.mst@redhat.com>
Reviewed-by: Petr Tesarik <ptesarik@suse.com>
Acked-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Within an iterator or callback based loop, it should be safe to prune
the current state if the old state stack slot is marked as
STACK_INVALID or STACK_MISC:
- either all branches of the old state lead to a program exit;
- or some branch of the old state leads the current state.
This is the same logic as applied in non-loop cases when
states_equal() is called in NOT_EXACT mode.
The test case that exercises stacksafe() and demonstrates the
difference in verification performance is included in the next patch.
I'm not sure if it is possible to prepare a test case that exercises
regsafe(); it appears that the compute_live_registers() pass makes
this impossible.
Nevertheless, for code readability reasons, I think that stacksafe()
and regsafe() should handle STACK_INVALID / NOT_INIT symmetrically.
Hence, this commit changes both functions.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20251230-loop-stack-misc-pruning-v1-1-585cfd6cec51@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Calls like bpf_loop() or bpf_for_each_map_elem() introduce loops that
are not explicitly present in the control-flow graph. The verifier
processes such calls by repeatedly interpreting the callback function
body within the same verification path (until the current state
converges with a previous state).
Such loops require a bpf_scc_visit instance in order to allow the
accumulation of the state graph backedges. Otherwise, certain
checkpoint states created within the bodies of such loops will have
incomplete precision marks.
See the next patch for an example of a program that leads to the
verifier accepting an unsafe program.
Fixes: 96c6aa4c63 ("bpf: compute SCCs in program control flow graph")
Fixes: c9e31900b5 ("bpf: propagate read/precision marks over state graph backedges")
Reported-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Tested-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/20251229-scc-for-callbacks-v1-1-ceadfe679900@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
There's a three patch series from Jiayuan Chen which fixes some issues
with KASAN and vmalloc. Apart from that it's the usual shower of
singletons - please see the respective changelogs for details.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaVIWxwAKCRDdBJ7gKXxA
jt6HAP49s/mEIIbuZbqnX8hxDrvYdYffs+RsSLPEZoR+yKG/7gD/VqSZhMoPw53b
rMZ56djXNjWxsOAfiVbZit3SFuQfnQ4=
=5rGD
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2025-12-28-21-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"27 hotfixes. 12 are cc:stable, 18 are MM.
There's a patch series from Jiayuan Chen which fixes some
issues with KASAN and vmalloc. Apart from that it's the usual
shower of singletons - please see the respective changelogs
for details"
* tag 'mm-hotfixes-stable-2025-12-28-21-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (27 commits)
mm/ksm: fix pte_unmap_unlock of wrong address in break_ksm_pmd_entry
mm/page_owner: fix memory leak in page_owner_stack_fops->release()
mm/memremap: fix spurious large folio warning for FS-DAX
MAINTAINERS: notify the "Device Memory" community of memory hotplug changes
sparse: update MAINTAINERS info
mm/page_alloc: report 1 as zone_batchsize for !CONFIG_MMU
mm: consider non-anon swap cache folios in folio_expected_ref_count()
rust: maple_tree: rcu_read_lock() in destructor to silence lockdep
mm: memcg: fix unit conversion for K() macro in OOM log
mm: fixup pfnmap memory failure handling to use pgoff
tools/mm/page_owner_sort: fix timestamp comparison for stable sorting
selftests/mm: fix thread state check in uffd-unit-tests
kernel/kexec: fix IMA when allocation happens in CMA area
kernel/kexec: change the prototype of kimage_map_segment()
MAINTAINERS: add ABI headers to KHO and LIVE UPDATE
.mailmap: remove one of the entries for WangYuli
mm/damon/vaddr: fix missing pte_unmap_unlock in damos_va_migrate_pmd_entry()
MAINTAINERS: update one straggling entry for Bartosz Golaszewski
mm/page_alloc: change all pageblocks migrate type on coalescing
mm: leafops.h: correct kernel-doc function param. names
...
- Fix uninitialized @ret on alloc_percpu() failure leading to ERR_PTR(0).
- Fix PREEMPT_RT warning when bypass load balancer sends IPI to offline
CPU by using resched_cpu() instead of resched_curr().
- Fix comment referring to renamed function.
- Update scx_show_state.py for scx_root and scx_aborting changes.
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaVHQAw4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGdJ2AP9nLMUa5Rw2hpcKCLvPjgkqe5fDpNteWrQB3ni9
bu28jQD/XGMwooJbATlDEgCtFqCH74QbddqUABJxBw4FE8qpUgw=
=hp6J
-----END PGP SIGNATURE-----
Merge tag 'sched_ext-for-6.19-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:
- Fix uninitialized @ret on alloc_percpu() failure leading to
ERR_PTR(0)
- Fix PREEMPT_RT warning when bypass load balancer sends IPI to offline
CPU by using resched_cpu() instead of resched_curr()
- Fix comment referring to renamed function
- Update scx_show_state.py for scx_root and scx_aborting changes
* tag 'sched_ext-for-6.19-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
tools/sched_ext: update scx_show_state.py for scx_aborting change
tools/sched_ext: fix scx_show_state.py for scx_root change
sched_ext: Use the resched_cpu() to replace resched_curr() in the bypass_lb_node()
sched_ext: Fix some comments in ext.c
sched_ext: fix uninitialized ret on alloc_percpu() failure
- cpuset: Fix spurious warning when disabling remote partition after CPU
hotplug leaves subpartitions_cpus empty. Guard the warning and invalidate
affected partitions.
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaVHNBg4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGW13AQC7akGwBAjBbR5p0aoLXKl5vZ8UYiwH+7iyep2Z
INn7ggEA/huYF4smSRk3oR8q7eiISa85mZc85EuVEAjPmBQhpAQ=
=9UP0
-----END PGP SIGNATURE-----
Merge tag 'cgroup-for-6.19-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fix from Tejun Heo:
- Fix a spurious cpuset warning when disabling remote partition after
CPU hotplug leaves subpartitions_cpus empty. Guard the warning and
invalidate affected partitions.
* tag 'cgroup-for-6.19-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cpuset: fix warning when disabling remote partition
Commit a10ad1b104 ("PM: suspend: Make pm_test delay interruptible by
wakeup events") replaced mdelay() in suspend_test() with msleep() which
does not work at the TEST_CORE test level that calls suspend_test()
while running on one CPU with interrupts off.
Address this by making suspend_test() check if the test level is
suitable for using msleep() and use mdelay() otherwise.
Fixes: a10ad1b104 ("PM: suspend: Make pm_test delay interruptible by wakeup events")
Reported-by: Sebastian Reichel <sebastian.reichel@collabora.com>
Closes: https://lore.kernel.org/linux-pm/aUsAk0k1N9hw8IkY@venus/
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Sebastian Reichel <sebastian.reichel@collabora.com>
Link: https://patch.msgid.link/6251576.lOV4Wx5bFT@rafael.j.wysocki
A couple of fixes for EFI regressions introduced this cycle:
- Make EDID handling in the EFI stub mixed mode safe
- Ensure that efi_mm.user_ns has a sane value - this is needed now that
EFI runtime calls are preemptible on arm64
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQQm/3uucuRGn1Dmh0wbglWLn0tXAUCaU694AAKCRAwbglWLn0t
XJHoAPsHkA4g+rRXXhqgysWvqQvGWTrWLiVHPeDOs3dAWTxHpQD+KTNVaW4H+BCz
x+5S3unu5q9PxWOwqeoCi2JvkKKjJwE=
=fb7r
-----END PGP SIGNATURE-----
Merge tag 'efi-fixes-for-v6.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi
Pull EFI fixes from Ard Biesheuvel:
"A couple of fixes for EFI regressions introduced this cycle:
- Make EDID handling in the EFI stub mixed mode safe
- Ensure that efi_mm.user_ns has a sane value - this is needed now
that EFI runtime calls are preemptible on arm64"
* tag 'efi-fixes-for-v6.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
kthread: Warn if mm_struct lacks user_ns in kthread_use_mm()
arm64: efi: Fix NULL pointer dereference by initializing user_ns
efi/libstub: gop: Fix EDID support in mixed-mode
Export irq_domain_free_irqs() to allow PCI/MSI drivers like pci-tegra to be
built as a module.
Signed-off-by: Aaron Kling <webgeek1234@gmail.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/20250731-pci-tegra-module-v7-1-cad4b088b8fb@gmail.com
Add a WARN_ON_ONCE() check to detect mm_struct instances that are
missing user_ns initialization when passed to kthread_use_mm().
When a kthread adopts an mm via kthread_use_mm(), LSM hooks and
capability checks may access current->mm->user_ns for credential
validation. If user_ns is NULL, this leads to a NULL pointer
dereference crash.
This was observed with efi_mm on arm64, where commit a5baf582f4
("arm64/efi: Call EFI runtime services without disabling preemption")
introduced kthread_use_mm(&efi_mm), but efi_mm lacked user_ns
initialization, causing crashes during /proc access.
Adding this warning helps catch similar bugs early during development
rather than waiting for hard-to-debug NULL pointer crashes in
production.
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Make arena related kfuncs any context safe by the following changes:
bpf_arena_alloc_pages() and bpf_arena_reserve_pages():
Replace the usage of the mutex with a rqspinlock for range tree and use
kmalloc_nolock() wherever needed. Use free_pages_nolock() to free pages
from any context.
apply_range_set/clear_cb() with apply_to_page_range() has already made
populating the vm_area in bpf_arena_alloc_pages() any context safe.
bpf_arena_free_pages(): defer the main logic to a workqueue if it is
called from a non-sleepable context.
specialize_kfunc() is used to replace the sleepable arena_free_pages()
with bpf_arena_free_pages_non_sleepable() when the verifier detects the
call is from a non-sleepable context.
In the non-sleepable case, arena_free_pages() queues the address and the
page count to be freed to a lock-less list of struct arena_free_spans
and raises an irq_work. The irq_work handler calls schedules_work() as
it is safe to be called from irq context. arena_free_worker() (the work
queue handler) iterates these spans and clears ptes, flushes tlb, zaps
pages, and calls __free_page().
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20251222195022.431211-4-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
To make arena_alloc_pages() safe to be called from any context, replace
kvcalloc() with kmalloc_nolock() so as it doesn't sleep or take any
locks. kmalloc_nolock() returns NULL for allocations larger than
KMALLOC_MAX_CACHE_SIZE, which is (PAGE_SIZE * 2) = 8KB on systems with
4KB pages. So, round down the allocation done by kmalloc_nolock to 1024
* 8 and reuse the array in a loop.
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20251222195022.431211-3-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
vm_area_map_pages() may allocate memory while inserting pages into bpf
arena's vm_area. In order to make bpf_arena_alloc_pages() kfunc
non-sleepable change bpf arena to populate pages without
allocating memory:
- at arena creation time populate all page table levels except
the last level
- when new pages need to be inserted call apply_to_page_range() again
with apply_range_set_cb() which will only set_pte_at() those pages and
will not allocate memory.
- when freeing pages call apply_to_existing_page_range with
apply_range_clear_cb() to clear the pte for the page to be removed. This
doesn't free intermediate page table levels.
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20251222195022.431211-2-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
*** Bug description ***
When I tested kexec with the latest kernel, I ran into the following warning:
[ 40.712410] ------------[ cut here ]------------
[ 40.712576] WARNING: CPU: 2 PID: 1562 at kernel/kexec_core.c:1001 kimage_map_segment+0x144/0x198
[...]
[ 40.816047] Call trace:
[ 40.818498] kimage_map_segment+0x144/0x198 (P)
[ 40.823221] ima_kexec_post_load+0x58/0xc0
[ 40.827246] __do_sys_kexec_file_load+0x29c/0x368
[...]
[ 40.855423] ---[ end trace 0000000000000000 ]---
*** How to reproduce ***
This bug is only triggered when the kexec target address is allocated in
the CMA area. If no CMA area is reserved in the kernel, use the "cma="
option in the kernel command line to reserve one.
*** Root cause ***
The commit 07d2490297 ("kexec: enable CMA based contiguous
allocation") allocates the kexec target address directly on the CMA area
to avoid copying during the jump. In this case, there is no IND_SOURCE
for the kexec segment. But the current implementation of
kimage_map_segment() assumes that IND_SOURCE pages exist and map them
into a contiguous virtual address by vmap().
*** Solution ***
If IMA segment is allocated in the CMA area, use its page_address()
directly.
Link: https://lkml.kernel.org/r/20251216014852.8737-2-piliu@redhat.com
Fixes: 07d2490297 ("kexec: enable CMA based contiguous allocation")
Signed-off-by: Pingfan Liu <piliu@redhat.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Alexander Graf <graf@amazon.com>
Cc: Steven Chen <chenste@linux.microsoft.com>
Cc: Mimi Zohar <zohar@linux.ibm.com>
Cc: Roberto Sassu <roberto.sassu@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The kexec segment index will be required to extract the corresponding
information for that segment in kimage_map_segment(). Additionally,
kexec_segment already holds the kexec relocation destination address and
size. Therefore, the prototype of kimage_map_segment() can be changed.
Link: https://lkml.kernel.org/r/20251216014852.8737-1-piliu@redhat.com
Fixes: 07d2490297 ("kexec: enable CMA based contiguous allocation")
Signed-off-by: Pingfan Liu <piliu@redhat.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Mimi Zohar <zohar@linux.ibm.com>
Cc: Roberto Sassu <roberto.sassu@huawei.com>
Cc: Alexander Graf <graf@amazon.com>
Cc: Steven Chen <chenste@linux.microsoft.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The -EEXIST error code is reserved by the module loading infrastructure
to indicate that a module is already loaded. When a module's init
function returns -EEXIST, userspace tools like kmod interpret this as
"module already loaded" and treat the operation as successful, returning
0 to the user even though the module initialization actually failed.
This follows the precedent set by commit 54416fd767 ("netfilter:
conntrack: helper: Replace -EEXIST by -EBUSY") which fixed the same
issue in nf_conntrack_helper_register().
This affects bpf_crypto_skcipher module. While the configuration
required to build it as a module is unlikely in practice, it is
technically possible, so fix it for correctness.
Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
Acked-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://lore.kernel.org/r/20251220-dev-module-init-eexists-bpf-v1-1-7f186663dbe7@samsung.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Associate raw tracepoint program type with the kfunc tracing hook. This
allows calling kfuncs from raw_tp programs.
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20251222133250.1890587-2-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
llist_add() returns true only when adding to an empty list, which indicates
that no IRQ work is currently queued or running. Therefore, we only need to
call irq_work_queue() when llist_add() returns true, to avoid unnecessarily
re-queueing IRQ work that is already pending or executing.
Signed-off-by: Zqiang <qiang.zhang@linux.dev>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
For the PREEMPT_RT kernels, the scx_bypass_lb_timerfn() running in the
preemptible per-CPU ktimer kthread context, this means that the following
scenarios will occur(for x86 platform):
cpu1 cpu2
ktimer kthread:
->scx_bypass_lb_timerfn
->bypass_lb_node
->for_each_cpu(cpu, resched_mask)
migration/1: by preempt by migration/2:
multi_cpu_stop() multi_cpu_stop()
->take_cpu_down()
->__cpu_disable()
->set cpu1 offline
->rq1 = cpu_rq(cpu1)
->resched_curr(rq1)
->smp_send_reschedule(cpu1)
->native_smp_send_reschedule(cpu1)
->if(unlikely(cpu_is_offline(cpu))) {
WARN(1, "sched: Unexpected
reschedule of offline CPU#%d!\n", cpu);
return;
}
This commit therefore use the resched_cpu() to replace resched_curr()
in the bypass_lb_node() to avoid send-ipi to offline CPUs.
Signed-off-by: Zqiang <qiang.zhang@linux.dev>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The commit 6e1d31ce49 ("cpuset: separate generate_sched_domains for v1
and v2") introduced dead code that was originally added for cpuset-v2
partition domain generation. Remove the redundant root_load_balance check.
Fixes: 6e1d31ce49 ("cpuset: separate generate_sched_domains for v1 and v2")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/cgroups/9a442808-ed53-4657-988b-882cc0014c0d@huaweicloud.com/T/
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
SHA-1 is considered deprecated and insecure due to vulnerabilities that can
lead to hash collisions. Most distributions have already been using SHA-2
for module signing because of this. The default was also changed last year
from SHA-1 to SHA-512 in commit f3b93547b9 ("module: sign with sha512
instead of sha1 by default"). This was not reported to cause any issues.
Therefore, it now seems to be a good time to remove SHA-1 support for
module signing.
Commit 16ab7cb582 ("crypto: pkcs7 - remove sha1 support") previously
removed support for reading PKCS#7/CMS signed with SHA-1, along with the
ability to use SHA-1 for module signing. This change broke iwd and was
subsequently completely reverted in commit 203a6763ab ("Revert "crypto:
pkcs7 - remove sha1 support""). However, dropping only the support for
using SHA-1 for module signing is unrelated and can still be done
separately.
Note that this change only removes support for new modules to be SHA-1
signed, but already signed modules can still be loaded.
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Reviewed-by: Aaron Tomlin <atomlin@atomlin.com>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Currently if a user enqueues a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistency cannot be addressed without refactoring the API.
This continues the effort to refactor workqueue APIs, which began with
the introduction of new workqueues and a new alloc_workqueue flag in:
commit 128ea9f6cc ("workqueue: Add system_percpu_wq and system_dfl_wq")
commit 930c2ea566 ("workqueue: Add new WQ_PERCPU flag")
Switch to using system_dfl_wq, the new unbound workqueue, because the
users do not benefit from a per-cpu workqueue.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Reviewed-by: Petr Pavlu <petr.pavlu@suse.com>
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Remove the custom __modinit macro from kernel/params.c and instead use the
common __init_or_module macro from include/linux/module.h. Both provide the
same functionality.
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Reviewed-by: Aaron Tomlin <atomlin@atomlin.com>
Reviewed-by: Daniel Gomez <da.gomez@samsung.com>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
As reported in [0], anonymous memory mappings are not backed by a
struct file instance. Consequently, the struct file pointer passed to
the security_mmap_file() LSM hook is NULL in such cases.
The BPF verifier is currently unaware of this, allowing BPF LSM
programs to dereference this struct file pointer without needing to
perform an explicit NULL check. This leads to potential NULL pointer
dereference and a kernel crash.
Add a strong override for bpf_lsm_mmap_file() which annotates the
struct file pointer parameter with the __nullable suffix. This
explicitly informs the BPF verifier that this pointer (PTR_MAYBE_NULL)
can be NULL, forcing BPF LSM programs to perform a check on it before
dereferencing it.
[0] https://lore.kernel.org/bpf/5e460d3c.4c3e9.19adde547d8.Coremail.kaiyanm@hust.edu.cn/
Reported-by: Kaiyan Mei <M202472210@hust.edu.cn>
Reported-by: Yinhao Hu <dddddd@hust.edu.cn>
Reviewed-by: Dongliang Mu <dzm91@hust.edu.cn>
Closes: https://lore.kernel.org/bpf/5e460d3c.4c3e9.19adde547d8.Coremail.kaiyanm@hust.edu.cn/
Signed-off-by: Matt Bobrowski <mattbobrowski@google.com>
Acked-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20251216133000.3690723-1-mattbobrowski@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
BPF programs detect recursion using a per-CPU 'active' flag in struct
bpf_prog. The trampoline currently sets/clears this flag with atomic
operations.
On some arm64 platforms (e.g., Neoverse V2 with LSE), per-CPU atomic
operations are relatively slow. Unlike x86_64 - where per-CPU updates
can avoid cross-core atomicity, arm64 LSE atomics are always atomic
across all cores, which is unnecessary overhead for strictly per-CPU
state.
This patch removes atomics from the recursion detection path on arm64 by
changing 'active' to a per-CPU array of four u8 counters, one per
context: {NMI, hard-irq, soft-irq, normal}. The running context uses a
non-atomic increment/decrement on its element. After increment,
recursion is detected by reading the array as a u32 and verifying that
only the expected element changed; any change in another element
indicates inter-context recursion, and a value > 1 in the same element
indicates same-context recursion.
For example, starting from {0,0,0,0}, a normal-context trigger changes
the array to {0,0,0,1}. If an NMI arrives on the same CPU and triggers
the program, the array becomes {1,0,0,1}. When the NMI context checks
the u32 against the expected mask for normal (0x00000001), it observes
0x01000001 and correctly reports recursion. Same-context recursion is
detected analogously.
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20251219184422.2899902-3-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
BPF programs detect recursion by doing atomic inc/dec on a per-cpu
active counter from the trampoline. Create two helpers for operations on
this active counter, this makes it easy to changes the recursion
detection logic in future.
This commit makes no functional changes.
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20251219184422.2899902-2-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This commit update balance_scx() in the comments to balance_one().
Signed-off-by: Zqiang <qiang.zhang@linux.dev>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
In irq_domain_set_name() a const pointer is passed in, and then the
const is "lost" when container_of() is called. Fix this up by properly
preserving the const pointer attribute when container_of() is used to
enforce the fact that this pointer should not have anything at it
changed.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/2025121731-facing-unhitched-63ae@gregkh
- Add Documentation/core-api/tracepoint.rst to TRACING in MAINTAINERS file
Updates to the tracepoint.rst document should be reviewed by the
tracing maintainers.
- Fix warning triggered by perf attaching to synthetic events
The synthetic events do not add a function to be registered when
perf attaches to them. This causes a warning when perf registers
a synthetic event and passes a NULL pointer to the tracepoint register
function. Ideally synthetic events should be updated to work with
perf, but as that's a feature and not a bug fix, simply now return
-ENODEV when perf tries to register an event that has a NULL pointer
for its function. This no longer causes a kernel warning and simply
causes the perf code to fail with an error message.
- Fix 32bit overflow in option flag test
The option's flags changed from 32 bits in size to 64 bits in size.
Fix one of the places that shift 1 by the option bit number to
to be 1ULL.
- Fix the output of printing the direct jmp functions
The enabled_functions that shows how functions are being attached by
ftrace wasn't updated to accommodate the new direct jmp trampolines
that set the LSB of the pointer, and outputs garbage. Update the
output to handle the direct jmp trampolines.
-----BEGIN PGP SIGNATURE-----
iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaURuVxQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qhU1AQCe3AwEiPg4PAooR7D3TDgj4ypJXUwS
J7PM0cvniKhJ2AD/VpSQCyj7sB8UQdVE04SK3hxduVwHIzrBhDe2voC8eQQ=
=C9zF
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Add Documentation/core-api/tracepoint.rst to TRACING in MAINTAINERS
file
Updates to the tracepoint.rst document should be reviewed by the
tracing maintainers.
- Fix warning triggered by perf attaching to synthetic events
The synthetic events do not add a function to be registered when perf
attaches to them. This causes a warning when perf registers a
synthetic event and passes a NULL pointer to the tracepoint register
function.
Ideally synthetic events should be updated to work with perf, but as
that's a feature and not a bug fix, simply now return -ENODEV when
perf tries to register an event that has a NULL pointer for its
function. This no longer causes a kernel warning and simply causes
the perf code to fail with an error message.
- Fix 32bit overflow in option flag test
The option's flags changed from 32 bits in size to 64 bits in size.
Fix one of the places that shift 1 by the option bit number to to be
1ULL.
- Fix the output of printing the direct jmp functions
The enabled_functions that shows how functions are being attached by
ftrace wasn't updated to accommodate the new direct jmp trampolines
that set the LSB of the pointer, and outputs garbage. Update the
output to handle the direct jmp trampolines.
* tag 'trace-v6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
ftrace: Fix address for jmp mode in t_show()
tracing: Fix UBSAN warning in __remove_instance()
tracing: Do not register unsupported perf events
MAINTAINERS: add tracepoint core-api doc files to TRACING
Current release - regressions:
- netfilter: nf_conncount: fix leaked ct in error paths
- sched: act_mirred: fix loop detection
- sctp: fix potential deadlock in sctp_clone_sock()
- can: fix build dependency
- eth: mlx5e: do not update BQL of old txqs during channel reconfiguration
Previous releases - regressions:
- sched: ets: always remove class from active list before deleting it
- inet: frags: flush pending skbs in fqdir_pre_exit()
- netfilter: nf_nat: remove bogus direction check
- mptcp:
- schedule rtx timer only after pushing data
- avoid deadlock on fallback while reinjecting
- can: gs_usb: fix error handling
- eth: mlx5e:
- avoid unregistering PSP twice
- fix double unregister of HCA_PORTS component
- eth: bnxt_en: fix XDP_TX path
- eth: mlxsw: fix use-after-free when updating multicast route stats
Previous releases - always broken:
- ethtool: avoid overflowing userspace buffer on stats query
- openvswitch: fix middle attribute validation in push_nsh() action
- eth: mlx5: fw_tracer, validate format string parameters
- eth: mlxsw: spectrum_router: fix neighbour use-after-free
- eth: ipvlan: ignore PACKET_LOOPBACK in handle_mode_l2()
Misc:
- Jozsef Kadlecsik retires from maintaining netfilter
- tools: ynl: fix build on systems with old kernel headers
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmlEPS0SHHBhYmVuaUBy
ZWRoYXQuY29tAAoJECkkeY3MjxOkHLEP/3TnVI9qLivOGsZ48bn5UUcIaIlX0wiw
i5gwpUhbtW3zcLSphO0Nh/CmProme6dFhaMOOkk48bKAdIxRpOMbCC20bfYDLyd0
ZJTUQqheKpuI23vOnhfs2TTOkcqz6cM7txUq681taQ8Nfo8Dbf0fgOO2HT5ljRD+
JXlxrdvicZDtve68sSdnsAj5M15EPQGKTPMyPkymBmCKnCMQK9SQKaeTEtaW6NIO
yM0KqDSNoulDc/6LYMhx8DTtyE7yTiTxwe2NixjdyYljXsk0KiRDirCzxnZuSgh8
oh+cFLdFDq4mwdYEKjgC5c3ifQyLEpZvwzlY5MKoobsVnT5SbigeSK53l5rEkO3V
sM84lITHfqTJvlid0AF/ixEc6iWwV7nGRBHh2FXNbfoIKt45eF77jPi9YFsq6Z95
vlCzYIbY0f2L1y3mPvZbzGQbh2Z12b5kyK8QA1j7SK+zNzxgXbf4+ZtYxHg7O3Ne
gecmIpTKXMWodaZyfsRQPjR/F6UIlMqsgl9Ci9bfUw+XwL8x+7bJxQAQz+yVjLla
ng14BItiYKBavcPZBjYlhGKqD1fzGhVZqQecrCkF0VTbMusRd9RcwytU1NG4QGDx
V5aL28ht85KtMednEWOBkrg+PeXnNyZHzLAf2Xtx3UkaGgiDC8G4IxUv3orlOLD1
sFPfZnSiGCof
=6pla
-----END PGP SIGNATURE-----
Merge tag 'net-6.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from netfilter and CAN.
Current release - regressions:
- netfilter: nf_conncount: fix leaked ct in error paths
- sched: act_mirred: fix loop detection
- sctp: fix potential deadlock in sctp_clone_sock()
- can: fix build dependency
- eth: mlx5e: do not update BQL of old txqs during channel
reconfiguration
Previous releases - regressions:
- sched: ets: always remove class from active list before deleting it
- inet: frags: flush pending skbs in fqdir_pre_exit()
- netfilter: nf_nat: remove bogus direction check
- mptcp:
- schedule rtx timer only after pushing data
- avoid deadlock on fallback while reinjecting
- can: gs_usb: fix error handling
- eth:
- mlx5e:
- avoid unregistering PSP twice
- fix double unregister of HCA_PORTS component
- bnxt_en: fix XDP_TX path
- mlxsw: fix use-after-free when updating multicast route stats
Previous releases - always broken:
- ethtool: avoid overflowing userspace buffer on stats query
- openvswitch: fix middle attribute validation in push_nsh() action
- eth:
- mlx5: fw_tracer, validate format string parameters
- mlxsw: spectrum_router: fix neighbour use-after-free
- ipvlan: ignore PACKET_LOOPBACK in handle_mode_l2()
Misc:
- Jozsef Kadlecsik retires from maintaining netfilter
- tools: ynl: fix build on systems with old kernel headers"
* tag 'net-6.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (83 commits)
net: hns3: add VLAN id validation before using
net: hns3: using the num_tqps to check whether tqp_index is out of range when vf get ring info from mbx
net: hns3: using the num_tqps in the vf driver to apply for resources
net: enetc: do not transmit redirected XDP frames when the link is down
selftests/tc-testing: Test case exercising potential mirred redirect deadlock
net/sched: act_mirred: fix loop detection
sctp: Clear inet_opt in sctp_v6_copy_ip_options().
sctp: Fetch inet6_sk() after setting ->pinet6 in sctp_clone_sock().
net/handshake: duplicate handshake cancellations leak socket
net/mlx5e: Don't include PSP in the hard MTU calculations
net/mlx5e: Do not update BQL of old txqs during channel reconfiguration
net/mlx5e: Trigger neighbor resolution for unresolved destinations
net/mlx5e: Use ip6_dst_lookup instead of ipv6_dst_lookup_flow for MAC init
net/mlx5: Serialize firmware reset with devlink
net/mlx5: fw_tracer, Handle escaped percent properly
net/mlx5: fw_tracer, Validate format string parameters
net/mlx5: Drain firmware reset in shutdown callback
net/mlx5: fw reset, clear reset requested on drain_fw_reset
net: dsa: mxl-gsw1xx: manually clear RANEG bit
net: dsa: mxl-gsw1xx: fix .shutdown driver operation
...
Following the introduction of cpuset1_generate_sched_domains() for v1
in the previous patch, v1-specific logic can now be removed from the
generic generate_sched_domains(). This patch cleans up the v1-only
code and ensures uf_node is only visible when CONFIG_CPUSETS_V1=y.
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The generate_sched_domains() function currently handles both v1 and v2
logic. However, the underlying mechanisms for building scheduler domains
differ significantly between the two versions. For cpuset v2, scheduler
domains are straightforwardly derived from valid partitions, whereas
cpuset v1 employs a more complex union-find algorithm to merge overlapping
cpusets. Co-locating these implementations complicates maintenance.
This patch, along with subsequent ones, aims to separate the v1 and v2
logic. For ease of review, this patch first copies the
generate_sched_domains() function into cpuset-v1.c as
cpuset1_generate_sched_domains() and removes v2-specific code. Common
helpers and top_cpuset are declared in cpuset-internal.h. When operating
in v1 mode, the code now calls cpuset1_generate_sched_domains().
Currently there is some code duplication, which will be largely eliminated
once v1-specific code is removed from v2 in the following patch.
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Since relax_domain_level is only applicable to v1, move
update_domain_attr_tree() to cpuset-v1.c, which solely updates
relax_domain_level,
Additionally, relax_domain_level is now initialized in cpuset1_inited.
Accordingly, the initialization of relax_domain_level in top_cpuset is
removed. The unnecessary remote_partition initialization in top_cpuset
is also cleaned up.
As a result, relax_domain_level can be defined in cpuset only when
CONFIG_CPUSETS_V1=y.
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
This patch introduces the cpuset1_init helper in cpuset_v1.c to initialize
v1-specific fields, including the fmeter and relax_domain_level members.
The relax_domain_level related code will be moved to cpuset_v1.c in a
subsequent patch. After this move, v1-specific members will only be
visible when CONFIG_CPUSETS_V1=y.
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
This commit introduces the cpuset1_online_css helper to centralize
v1-specific handling during cpuset online. It performs operations such as
updating the CS_SPREAD_PAGE, CS_SPREAD_SLAB, and CGRP_CPUSET_CLONE_CHILDREN
flags, which are unique to the cpuset v1 control group interface.
The helper is now placed in cpuset-v1.c to maintain clear separation
between v1 and v2 logic.
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Add lockdep_assert_cpuset_lock_held() to allow other subsystems to verify
that cpuset_mutex is held.
Suggested-by: Waiman Long <longman@redhat.com>
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
A warning was triggered as follows:
WARNING: kernel/cgroup/cpuset.c:1651 at remote_partition_disable+0xf7/0x110
RIP: 0010:remote_partition_disable+0xf7/0x110
RSP: 0018:ffffc90001947d88 EFLAGS: 00000206
RAX: 0000000000007fff RBX: ffff888103b6e000 RCX: 0000000000006f40
RDX: 0000000000006f00 RSI: ffffc90001947da8 RDI: ffff888103b6e000
RBP: ffff888103b6e000 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000001 R11: ffff88810b2e2728 R12: ffffc90001947da8
R13: 0000000000000000 R14: ffffc90001947da8 R15: ffff8881081f1c00
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f55c8bbe0b2 CR3: 000000010b14c000 CR4: 00000000000006f0
Call Trace:
<TASK>
update_prstate+0x2d3/0x580
cpuset_partition_write+0x94/0xf0
kernfs_fop_write_iter+0x147/0x200
vfs_write+0x35d/0x500
ksys_write+0x66/0xe0
do_syscall_64+0x6b/0x390
entry_SYSCALL_64_after_hwframe+0x4b/0x53
RIP: 0033:0x7f55c8cd4887
Reproduction steps (on a 16-CPU machine):
# cd /sys/fs/cgroup/
# mkdir A1
# echo +cpuset > A1/cgroup.subtree_control
# echo "0-14" > A1/cpuset.cpus.exclusive
# mkdir A1/A2
# echo "0-14" > A1/A2/cpuset.cpus.exclusive
# echo "root" > A1/A2/cpuset.cpus.partition
# echo 0 > /sys/devices/system/cpu/cpu15/online
# echo member > A1/A2/cpuset.cpus.partition
When CPU 15 is offlined, subpartitions_cpus gets cleared because no CPUs
remain available for the top_cpuset, forcing partitions to share CPUs with
the top_cpuset. In this scenario, disabling the remote partition triggers
a warning stating that effective_xcpus is not a subset of
subpartitions_cpus. Partitions should be invalidated in this case to
inform users that the partition is now invalid(cpus are shared with
top_cpuset).
To fix this issue:
1. Only emit the warning only if subpartitions_cpus is not empty and the
effective_xcpus is not a subset of subpartitions_cpus.
2. During the CPU hotplug process, invalidate partitions if
subpartitions_cpus is empty.
Fixes: f62a5d3936 ("cgroup/cpuset: Remove remote_partition_check() & make update_cpumasks_hier() handle remote partition")
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
In cases where the ww_mutex test was occasionally tripping on
hard to find issues, leaving qemu in a reboot loop was my best
way to reproduce problems. These reboots however wasted time
when I just wanted to run the test-ww_mutex logic.
So tweak the test-ww_mutex test so that it can be re-triggered
via a sysfs file, so the test can be run repeatedly without
doing module loads or restarting.
This has been particularly valuable to stressing and finding
issues with the proxy-exec series.
To use, run as root:
echo 1 > /sys/kernel/test_ww_mutex/run_tests
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251205013515.759030-4-jstultz@google.com
The test-ww_mutex test already allocates its own workqueue
so be sure to use it for the mtx.work and abba.work rather
then the default system workqueue.
This resolves numerous messages of the sort:
"workqueue: test_abba_work hogged CPU... consider switching to WQ_UNBOUND"
"workqueue: test_mutex_work hogged CPU... consider switching to WQ_UNBOUND"
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251205013515.759030-3-jstultz@google.com
Currently the test-ww_mutex tool only utilizes the wait-die
class of ww_mutexes, and thus isn't very helpful in exercising
the wait-wound class of ww_mutexes.
So extend the test to exercise both classes of ww_mutexes for
all of the subtests.
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251205013515.759030-2-jstultz@google.com
The address from ftrace_find_rec_direct() is printed directly in t_show().
This can mislead symbol offsets if it has the "jmp" bit in the last bit.
Fix this by printing the address that returned by ftrace_jmp_get().
Link: https://patch.msgid.link/20251217030053.80343-1-dongml2@chinatelecom.cn
Fixes: 25e4e3565d ("ftrace: Introduce FTRACE_OPS_FL_JMP")
Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Synthetic events currently do not have a function to register perf events.
This leads to calling the tracepoint register functions with a NULL
function pointer which triggers:
------------[ cut here ]------------
WARNING: kernel/tracepoint.c:175 at tracepoint_add_func+0x357/0x370, CPU#2: perf/2272
Modules linked in: kvm_intel kvm irqbypass
CPU: 2 UID: 0 PID: 2272 Comm: perf Not tainted 6.18.0-ftest-11964-ge022764176fc-dirty #323 PREEMPTLAZY
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.17.0-debian-1.17.0-1 04/01/2014
RIP: 0010:tracepoint_add_func+0x357/0x370
Code: 28 9c e8 4c 0b f5 ff eb 0f 4c 89 f7 48 c7 c6 80 4d 28 9c e8 ab 89 f4 ff 31 c0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc cc cc <0f> 0b 49 c7 c6 ea ff ff ff e9 ee fe ff ff 0f 0b e9 f9 fe ff ff 0f
RSP: 0018:ffffabc0c44d3c40 EFLAGS: 00010246
RAX: 0000000000000001 RBX: ffff9380aa9e4060 RCX: 0000000000000000
RDX: 000000000000000a RSI: ffffffff9e1d4a98 RDI: ffff937fcf5fd6c8
RBP: 0000000000000001 R08: 0000000000000007 R09: ffff937fcf5fc780
R10: 0000000000000003 R11: ffffffff9c193910 R12: 000000000000000a
R13: ffffffff9e1e5888 R14: 0000000000000000 R15: ffffabc0c44d3c78
FS: 00007f6202f5f340(0000) GS:ffff93819f00f000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055d3162281a8 CR3: 0000000106a56003 CR4: 0000000000172ef0
Call Trace:
<TASK>
tracepoint_probe_register+0x5d/0x90
synth_event_reg+0x3c/0x60
perf_trace_event_init+0x204/0x340
perf_trace_init+0x85/0xd0
perf_tp_event_init+0x2e/0x50
perf_try_init_event+0x6f/0x230
? perf_event_alloc+0x4bb/0xdc0
perf_event_alloc+0x65a/0xdc0
__se_sys_perf_event_open+0x290/0x9f0
do_syscall_64+0x93/0x7b0
? entry_SYSCALL_64_after_hwframe+0x76/0x7e
? trace_hardirqs_off+0x53/0xc0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Instead, have the code return -ENODEV, which doesn't warn and has perf
error out with:
# perf record -e synthetic:futex_wait
Error:
The sys_perf_event_open() syscall returned with 19 (No such device) for event (synthetic:futex_wait).
"dmesg | grep -i perf" may provide additional information.
Ideally perf should support synthetic events, but for now just fix the
warning. The support can come later.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: https://patch.msgid.link/20251216182440.147e4453@gandalf.local.home
Fixes: 4b147936fa ("tracing: Add support for 'synthetic' events")
Reported-by: Ian Rogers <irogers@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The mediated_pmu_account_event() and perf_create_mediated_pmu()
functions implement the exclusion between '!exclude_guest' counters
and mediated vPMUs. Their implementation is basically identical,
except mirrored in what they count/check.
Make sure the actual implementations reflect this similarity.
Notably:
- while perf_release_mediated_pmu() has an underflow check;
mediated_pmu_unaccount_event() did not.
- while perf_create_mediated_pmu() has an inc_not_zero() path;
mediated_pmu_account_event() did not.
Also, the inc_not_zero() path can be outsite of
perf_mediated_pmu_mutex. The mutex must guard the 0->1 (of either
nr_include_guest_events or nr_mediated_pmu_vms) transition, but once a
counter is already non-zero, it can safely be incremented further.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251208115156.GE3707891@noisy.programming.kicks-ass.net
This simplifies the code. unwind_user_next_fp() does not need to
return -EINVAL if config option HAVE_UNWIND_USER_FP is disabled, as
unwind_user_start() will then not select this unwind method and
unwind_user_next() will therefore not call it.
Provide (1) a dummy definition of ARCH_INIT_USER_FP_FRAME, if the unwind
user method HAVE_UNWIND_USER_FP is not enabled, (2) a common fallback
definition of unwind_user_at_function_start() which returns false, and
(3) a common dummy definition of ARCH_INIT_USER_FP_ENTRY_FRAME.
Note that enabling the config option HAVE_UNWIND_USER_FP without
defining ARCH_INIT_USER_FP_FRAME triggers a compile error, which is
helpful when implementing support for this unwind user method in an
architecture. Enabling the config option when providing an arch-
specific unwind_user_at_function_start() definition makes it necessary
to also provide an arch-specific ARCH_INIT_USER_FP_ENTRY_FRAME
definition.
Signed-off-by: Jens Remus <jremus@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251208160352.1363040-3-jremus@linux.ibm.com
Move the comment "Get the Canonical Frame Address (CFA)" to the top
of the sequence of statements that actually get the CFA. Reword the
comment "Find the Return Address (RA)" to "Get ...", as the statements
actually get the RA. Add a respective comment to the statements that
get the FP. This will be useful once future commits extend the logic
to get the RA and FP.
While at it align the comment on the "stack going in wrong direction"
check to the following one on the "address is word aligned" check.
Signed-off-by: Jens Remus <jremus@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251208160352.1363040-2-jremus@linux.ibm.com
Wire up system vector 0xf5 for handling PMIs (i.e. interrupts delivered
through the LVTPC) while running KVM guests with a mediated PMU. Perf
currently delivers all PMIs as NMIs, e.g. so that events that trigger while
IRQs are disabled aren't delayed and generate useless records, but due to
the multiplexing of NMIs throughout the system, correctly identifying NMIs
for a mediated PMU is practically infeasible.
To (greatly) simplify identifying guest mediated PMU PMIs, perf will
switch the CPU's LVTPC between PERF_GUEST_MEDIATED_PMI_VECTOR and NMI when
guest PMU context is loaded/put. I.e. PMIs that are generated by the CPU
while the guest is active will be identified purely based on the IRQ
vector.
Route the vector through perf, e.g. as opposed to letting KVM attach a
handler directly a la posted interrupt notification vectors, as perf owns
the LVTPC and thus is the rightful owner of PERF_GUEST_MEDIATED_PMI_VECTOR.
Functionally, having KVM directly own the vector would be fine (both KVM
and perf will be completely aware of when a mediated PMU is active), but
would lead to an undesirable split in ownership: perf would be responsible
for installing the vector, but not handling the resulting IRQs.
Add a new perf_guest_info_callbacks hook (and static call) to allow KVM to
register its handler with perf when running guests with mediated PMUs.
Note, because KVM always runs guests with host IRQs enabled, there is no
danger of a PMI being delayed from the guest's perspective due to using a
regular IRQ instead of an NMI.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://patch.msgid.link/20251206001720.468579-9-seanjc@google.com
Add exported APIs to load/put a guest mediated PMU context. KVM will
load the guest PMU shortly before VM-Enter, and put the guest PMU shortly
after VM-Exit.
On the perf side of things, schedule out all exclude_guest events when the
guest context is loaded, and schedule them back in when the guest context
is put. I.e. yield the hardware PMU resources to the guest, by way of KVM.
Note, perf is only responsible for managing host context. KVM is
responsible for loading/storing guest state to/from hardware.
[sean: shuffle patches around, write changelog]
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://patch.msgid.link/20251206001720.468579-8-seanjc@google.com
Current perf doesn't explicitly schedule out all exclude_guest events
while the guest is running. There is no problem with the current
emulated vPMU. Because perf owns all the PMU counters. It can mask the
counter which is assigned to an exclude_guest event when a guest is
running (Intel way), or set the corresponding HOSTONLY bit in evsentsel
(AMD way). The counter doesn't count when a guest is running.
However, either way doesn't work with the introduced mediated vPMU.
A guest owns all the PMU counters when it's running. The host should not
mask any counters. The counter may be used by the guest. The evsentsel
may be overwritten.
Perf should explicitly schedule out all exclude_guest events to release
the PMU resources when entering a guest, and resume the counting when
exiting the guest.
It's possible that an exclude_guest event is created when a guest is
running. The new event should not be scheduled in as well.
The ctx time is shared among different PMUs. The time cannot be stopped
when a guest is running. It is required to calculate the time for events
from other PMUs, e.g., uncore events. Add timeguest to track the guest
run time. For an exclude_guest event, the elapsed time equals
the ctx time - guest time.
Cgroup has dedicated times. Use the same method to deduct the guest time
from the cgroup time as well.
[sean: massage comments]
Co-developed-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://patch.msgid.link/20251206001720.468579-7-seanjc@google.com
The current perf tracks two timestamps for the normal ctx and cgroup.
The same type of variables and similar codes are used to track the
timestamps. In the following patch, the third timestamp to track the
guest time will be introduced.
To avoid the code duplication, add a new struct perf_time_ctx and factor
out a generic function update_perf_time_ctx().
No functional change.
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://patch.msgid.link/20251206001720.468579-6-seanjc@google.com
Currently, exposing PMU capabilities to a KVM guest is done by emulating
guest PMCs via host perf events, i.e. by having KVM be "just" another user
of perf. As a result, the guest and host are effectively competing for
resources, and emulating guest accesses to vPMU resources requires
expensive actions (expensive relative to the native instruction). The
overhead and resource competition results in degraded guest performance
and ultimately very poor vPMU accuracy.
To address the issues with the perf-emulated vPMU, introduce a "mediated
vPMU", where the data plane (PMCs and enable/disable knobs) is exposed
directly to the guest, but the control plane (event selectors and access
to fixed counters) is managed by KVM (via MSR interceptions). To allow
host perf usage of the PMU to (partially) co-exist with KVM/guest usage
of the PMU, KVM and perf will coordinate to a world switch between host
perf context and guest vPMU context near VM-Enter/VM-Exit.
Add two exported APIs, perf_{create,release}_mediated_pmu(), to allow KVM
to create and release a mediated PMU instance (per VM). Because host perf
context will be deactivated while the guest is running, mediated PMU usage
will be mutually exclusive with perf analysis of the guest, i.e. perf
events that do NOT exclude the guest will not behave as expected.
To avoid silent failure of !exclude_guest perf events, disallow creating a
mediated PMU if there are active !exclude_guest events, and on the perf
side, disallowing creating new !exclude_guest perf events while there is
at least one active mediated PMU.
Exempt PMU resources that do not support mediated PMU usage, i.e. that are
outside the scope/view of KVM's vPMU and will not be swapped out while the
guest is running.
Guard mediated PMU with a new kconfig to help readers identify code paths
that are unique to mediated PMU support, and to allow for adding arch-
specific hooks without stubs. KVM x86 is expected to be the only KVM
architecture to support a mediated PMU in the near future (e.g. arm64 is
trending toward a partitioned PMU implementation), and KVM x86 will select
PERF_GUEST_MEDIATED_PMU unconditionally, i.e. won't need stubs.
Immediately select PERF_GUEST_MEDIATED_PMU when KVM x86 is enabled so that
all paths are compile tested. Full KVM support is on its way...
[sean: add kconfig and WARNing, rewrite changelog, swizzle patch ordering]
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://patch.msgid.link/20251206001720.468579-5-seanjc@google.com
Move the freeing of any security state associated with a perf event from
_free_event() to __free_event(), i.e. invoke security_perf_event_free() in
the error paths for perf_event_alloc(). This will allow adding potential
error paths in perf_event_alloc() that can occur after allocating security
state.
Note, kfree() and thus security_perf_event_free() is a nop if
event->security is NULL, i.e. calling security_perf_event_free() even if
security_perf_event_alloc() fails or is never reached is functionality ok.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://patch.msgid.link/20251206001720.468579-4-seanjc@google.com
Only KVM knows the exact time when a guest is entering/exiting. Expose
two interfaces to KVM to switch the ownership of the PMU resources.
All the pinned events must be scheduled in first. Extend the
perf_event_sched_in() helper to support extra flag, e.g., EVENT_GUEST.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://patch.msgid.link/20251206001720.468579-3-seanjc@google.com
To optimize the cgroup context switch, the perf_event_pmu_context
iteration skips the PMUs without cgroup events. A bool cgroup was
introduced to indicate the case. It can work, but this way is hard to
extend for other cases, e.g. skipping non-mediated PMUs. It doesn't
make sense to keep adding bool variables.
Pass the event_type instead of the specific bool variable. Check both
the event_type and related pmu_ctx variables to decide whether skipping
a PMU.
Event flags, e.g., EVENT_CGROUP, should be cleard in the ctx->is_active.
Add EVENT_FLAGS to indicate such event flags.
No functional change.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://patch.msgid.link/20251206001720.468579-2-seanjc@google.com
Commit 47efe2ddcc ("sched/core: Add assertions to QUEUE_CLASS") added an
assert to sched_change_end() verifying that a class demotion would result in a
reschedule.
As it turns out; rt_mutex_setprio() does not force a resched on class
demontion. Furthermore, this is only relevant to running tasks.
Change the warning into a reschedule and make sure to only do so for running
tasks.
Fixes: 47efe2ddcc ("sched/core: Add assertions to QUEUE_CLASS")
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Tested-by: Linux Kernel Functional Testing <lkft@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251216141725.GW3707837@noisy.programming.kicks-ass.net
Change sched_class::wakeup_preempt() to also get called for
cross-class wakeups, specifically those where the woken task
is of a higher class than the previous highest class.
In order to do this, track the current highest class of the runqueue
in rq::next_class and have wakeup_preempt() track this upwards for
each new wakeup. Additionally have schedule() re-set the value on
pick.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://patch.msgid.link/20251127154725.901391274@infradead.org
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmlCBmwACgkQ6rmadz2v
bToUZA//ZY0IE1x1nCixEAqGF/nGpDzVX4YQQfjrUoXQOD4ykzt35yTNXl6B1IVA
dliVSI6kUtdoThUa7xJUxMSkDsVBsEMT/zYXQEXJG1zXvJANCB9wTzsC3OCBWbXt
BRczcEkq0OHC9/l5CrILR6ocwxKGDIMIysfeOSABgfqckSEhylWy3+EWZQCk08ka
gNpXlDJUG7dYpcZD/zhuC7e5Rg1uNvN7WiTv+Biig8xZCsEtYOq+qC5C/sOnsypI
nqfECfbx48cVl49SjatdgquuHn/INESdLRCHisshkurA2Mp5PQuCmrwlXbv4JG59
v9b7lsFQlkpvEXMdo9VYe6K2gjfkOPRdWsVPu2oXA1qISRmrDqX8cKOpapUIwRhL
p3ASruMOnz0KFqVaET8+5u2SwtALeW+c+1p1aHMfVGF/qbXuyG05qBkLoGFJR+Xr
WznXUXY80Z7pjD57SpA6U3DigAkGqKCBXUwdifaOq8HQonwsnQGqkW/3NngNULGP
IC4u0JXn61VgQsM/kAw+ucc4bdKI0g4oKJR56lT48elrj6Yxrjpde4oOqzZ0IQKu
VQ0YnzWqqT2tjh4YNMOwkNPbFR4ALd329zI6TUkWib/jByEBNcfjSj9BRANd1KSx
JgSHAE6agrbl6h3nOx584YCasX3Zq+nfv1Sj4Z/5GaHKKW3q/Vw=
=wHLt
-----END PGP SIGNATURE-----
Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Pull bpf fixes from Alexei Starovoitov:
- Fix BPF builds due to -fms-extensions. selftests (Alexei
Starovoitov), bpftool (Quentin Monnet).
- Fix build of net/smc when CONFIG_BPF_SYSCALL=y, but CONFIG_BPF_JIT=n
(Geert Uytterhoeven)
- Fix livepatch/BPF interaction and support reliable unwinding through
BPF stack frames (Josh Poimboeuf)
- Do not audit capability check in arm64 JIT (Ondrej Mosnacek)
- Fix truncated dmabuf BPF iterator reads (T.J. Mercier)
- Fix verifier assumptions of bpf_d_path's output buffer (Shuran Liu)
- Fix warnings in libbpf when built with -Wdiscarded-qualifiers under
C23 (Mikhail Gavrilov)
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
selftests/bpf: add regression test for bpf_d_path()
bpf: Fix verifier assumptions of bpf_d_path's output buffer
selftests/bpf: Add test for truncated dmabuf_iter reads
bpf: Fix truncated dmabuf iterator reads
x86/unwind/orc: Support reliable unwinding through BPF stack frames
bpf: Add bpf_has_frame_pointer()
bpf, arm64: Do not audit capability check in do_jit()
libbpf: Fix -Wdiscarded-qualifiers under C23
bpftool: Fix build warnings due to MS extensions
net: smc: SMC_HS_CTRL_BPF should depend on BPF_JIT
selftests/bpf: Add -fms-extensions to bpf build flags
Smatch reported:
kernel/sched/ext.c:5332 scx_alloc_and_add_sched() warn: passing zero to 'ERR_PTR'
In scx_alloc_and_add_sched(), the alloc_percpu() failure path jumps to
err_free_gdsqs without initializing @ret. That can lead to returning
ERR_PTR(0), which violates the ERR_PTR() convention and confuses
callers.
Set @ret to -ENOMEM before jumping to the error path when
alloc_percpu() fails.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/r/202512141601.yAXDAeA9-lkp@intel.com/
Reported-by: Dan Carpenter <error27@gmail.com>
Fixes: c201ea1578 ("sched_ext: Move event_stats_cpu into scx_sched")
Signed-off-by: Liang Jie <liangjie@lixiang.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The verifier currently limits direct offsets into a map to 512MiB
to avoid overflow during pointer arithmetic. However, this prevents
arena maps from using direct addressing instructions to access data
at the end of > 512MiB arena maps. This is necessary when moving
arena globals to the end of the arena instead of the front.
Refactor the verifier code to remove the offset calculation during
direct value access calculations. This is possible because the only
two map types that implement .map_direct_value_addr() are arrays and
arenas, and they both do their own internal checks to ensure the
offset is within bounds.
Adjust selftests that expect the old error. These tests still fail
because the verifier identifies the access as out of bounds for the
map, so change them to expect an "invalid access to map value pointer"
error instead.
Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20251216173325.98465-3-emil@etsalapatis.com
NETFILTER_PKT records show both source and destination
addresses, in addition to the associated networking protocol.
However, it lacks the ports information, which is often
valuable for troubleshooting.
This patch adds both source and destination port numbers,
'sport' and 'dport' respectively, to TCP, UDP, UDP-Lite and
SCTP-related NETFILTER_PKT records.
$ TESTS="netfilter_pkt" make -e test &> /dev/null
$ ausearch -i -ts recent |grep NETFILTER_PKT
type=NETFILTER_PKT ... proto=icmp
type=NETFILTER_PKT ... proto=ipv6-icmp
type=NETFILTER_PKT ... proto=udp sport=46333 dport=42424
type=NETFILTER_PKT ... proto=udp sport=35953 dport=42424
type=NETFILTER_PKT ... proto=tcp sport=50314 dport=42424
type=NETFILTER_PKT ... proto=tcp sport=57346 dport=42424
Link: https://github.com/linux-audit/audit-kernel/issues/162
Signed-off-by: Ricardo Robaina <rrobaina@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Netfilter code (net/netfilter/nft_log.c and net/netfilter/xt_AUDIT.c)
have to be kept in sync. Both source files had duplicated versions of
audit_ip4() and audit_ip6() functions, which can result in lack of
consistency and/or duplicated work.
This patch adds a helper function in audit.c that can be called by
netfilter code commonly, aiming to improve maintainability and
consistency.
Suggested-by: Florian Westphal <fw@strlen.de>
Suggested-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Ricardo Robaina <rrobaina@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Paul Moore <paul@paul-moore.com>
- Fix memory leak when destroying helper kthread workers during scheduler
disable.
- Fix bypass depth accounting on scx_enable() failure which could leave
the system permanently in bypass mode.
- Fix missing preemption handling when moving tasks to local DSQs via
scx_bpf_dsq_move().
- Misc fixes including NULL check for put_prev_task(), flushing stdout in
selftests, and removing unused code.
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaUA9aQ4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGXWKAP9VeREr2ceRFd3WK5vFsFzCGh1L8rRu+t352MGL
qWXkzQD/apLFSo5RQZ0tE8qORwKM9hNhS8QkVJhlsv5VZRWMBQI=
=ueoE
-----END PGP SIGNATURE-----
Merge tag 'sched_ext-for-6.19-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:
- Fix memory leak when destroying helper kthread workers during
scheduler disable
- Fix bypass depth accounting on scx_enable() failure which could leave
the system permanently in bypass mode
- Fix missing preemption handling when moving tasks to local DSQs via
scx_bpf_dsq_move()
- Misc fixes including NULL check for put_prev_task(), flushing stdout
in selftests, and removing unused code
* tag 'sched_ext-for-6.19-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: Remove unused code in the do_pick_task_scx()
selftests/sched_ext: flush stdout before test to avoid log spam
sched_ext: Fix missing post-enqueue handling in move_local_task_to_local_dsq()
sched_ext: Factor out local_dsq_post_enq() from dispatch_enqueue()
sched_ext: Fix bypass depth leak on scx_enable() failure
sched/ext: Avoid null ptr traversal when ->put_prev_task() is called with NULL next
sched_ext: Fix the memleak for sch->helper objects
- Fix a race condition in css_rstat_updated() where CMPXCHG without LOCK
prefix could cause lnode corruption when the flusher runs concurrently
on another CPU. The issue was introduced in 6.17 and causes memcg stats
to become corrupted in production.
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaUA8mg4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGXhaAQDSk/ywLzJG807gzvhC5QLcY8kYzUUFHJ4TQAFp
08UqOAEA9yDgBHPqkp6ucjEJMG+2esE9A5NJB2LeEjG8ZZOCDQM=
=Kcfd
-----END PGP SIGNATURE-----
Merge tag 'cgroup-for-6.19-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fix from Tejun Heo:
- Fix a race condition in css_rstat_updated() where CMPXCHG without
LOCK prefix could cause lnode corruption when the flusher runs
concurrently on another CPU. The issue was introduced in 6.17 and
causes memcg stats to become corrupted in production.
* tag 'cgroup-for-6.19-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: rstat: use LOCK CMPXCHG in css_rstat_updated
Add infrastructure to redirect interrupt handler execution to a
different CPU when the current CPU is not part of the interrupt's CPU
affinity mask.
This is primarily aimed at (de)multiplexed interrupts, where the child
interrupt handler runs in the context of the parent interrupt handler,
and therefore CPU affinity control for the child interrupt is typically
not available.
With the new infrastructure, the child interrupt is allowed to freely
change its affinity setting, independently of the parent. If the
interrupt handler happens to be triggered on an "incompatible" CPU (a
CPU that's not part of the child interrupt's affinity mask), the handler
is redirected and runs in IRQ work context on a "compatible" CPU.
No functional change is being made to any existing irqchip driver, and
irqchip drivers must be explicitly modified to use the newly added
infrastructure to support interrupt redirection.
Originally-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Radu Rendec <rrendec@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/linux-pci/878qpg4o4t.ffs@tglx/
Link: https://patch.msgid.link/20251128212055.1409093-2-rrendec@redhat.com
setup_percpu_irq() was always a bad kludge, and should have never
been there the first place. Now that the last users are gone,
remove it for good.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/20251210082242.360936-7-maz@kernel.org
With the IRQ timing stuff being gone, there is no need to specify a flag
when requesting a percpu interrupt. Not only IRQF_TIMER was the only flag
(set of flags actually) allowed, but nobody ever passed it.
Get rid of __request_percpu_irq(), which was only getting 0 as flags, and
promote request_percpu_irq_affinity() as its replacement.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Jinjie Ruan <ruanjinjie@huawei.com>
Link: https://patch.msgid.link/20251210082242.360936-3-maz@kernel.org
The IRQ timing tracking infrastructure was merged in 2019, but was never
plumbed in, is not selectable, and is therefore never used.
As Daniel agrees that there is little hope for this infrastructure to be
completed in the near term, drop it altogether.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Jinjie Ruan <ruanjinjie@huawei.com>
Link: https://lore.kernel.org/r/87zf7vex6h.wl-maz@kernel.org
Link: https://patch.msgid.link/20251210082242.360936-2-maz@kernel.org
New network transport protocols want NIC drivers to get hardware timestamps
of all incoming packets, and possibly all outgoing packets.
One example is the upcoming 'Swift congestion control' which is used by TCP
transport and is the primary need for timecounter_cyc2time(). This means
timecounter_cyc2time() can be called more than 100 million times per second
on a busy server.
Inlining timecounter_cyc2time() brings a 12% improvement on a UDP receive
stress test on a 100Gbit NIC.
Note that FDO, LTO, PGO are unable to magically help for this case,
presumably because NIC drivers are almost exclusively shipped as modules.
Add an unlikely() around the cc_cyc2ns_backwards() case, even if FDO (when
used) is able to take care of this optimization.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://research.google/pubs/swift-delay-is-simple-and-effective-for-congestion-control-in-the-datacenter/
Link: https://patch.msgid.link/20251129095740.3338476-1-edumazet@google.com
The kick_idle variable is no longer used, this commit therefore remove
it and also remove associated code in the do_pick_task_scx().
Signed-off-by: Zqiang <qiang.zhang@linux.dev>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The commit d5d399efff ("printk/nbcon: Release nbcon consoles ownership
in atomic flush after each emitted record") prevented stall of a CPU
which lost nbcon console ownership because another CPU entered
an emergency flush.
But there is still the problem that the CPU doing the emergency flush
might cause a stall on its own.
Let's go even further and restore IRQ in the atomic flush after
each emitted record.
It is not a complete solution. The interrupts and/or scheduling might
still be blocked when the emergency atomic flush was called with
IRQs and/or scheduling disabled. But it should remove the following
lockup:
mlx5_core 0000:03:00.0: Shutdown was called
kvm: exiting hardware virtualization
arm-smmu-v3 arm-smmu-v3.10.auto: CMD_SYNC timeout at 0x00000103 [hwprod 0x00000104, hwcons 0x00000102]
smp: csd: Detected non-responsive CSD lock (#1) on CPU#4, waiting 5000000032 ns for CPU#00 do_nothing (kernel/smp.c:1057)
smp: csd: CSD lock (#1) unresponsive.
[...]
Call trace:
pl011_console_write_atomic (./arch/arm64/include/asm/vdso/processor.h:12 drivers/tty/serial/amba-pl011.c:2540) (P)
nbcon_emit_next_record (kernel/printk/nbcon.c:1049)
__nbcon_atomic_flush_pending_con (kernel/printk/nbcon.c:1517)
__nbcon_atomic_flush_pending.llvm.15488114865160659019 (./arch/arm64/include/asm/alternative-macros.h:254 ./arch/arm64/include/asm/cpufeature.h:808 ./arch/arm64/include/asm/irqflags.h:192 kernel/printk/nbcon.c:1562 kernel/printk/nbcon.c:1612)
nbcon_atomic_flush_pending (kernel/printk/nbcon.c:1629)
printk_kthreads_shutdown (kernel/printk/printk.c:?)
syscore_shutdown (drivers/base/syscore.c:120)
kernel_kexec (kernel/kexec_core.c:1045)
__arm64_sys_reboot (kernel/reboot.c:794 kernel/reboot.c:722 kernel/reboot.c:722)
invoke_syscall (arch/arm64/kernel/syscall.c:50)
el0_svc_common.llvm.14158405452757855239 (arch/arm64/kernel/syscall.c:?)
do_el0_svc (arch/arm64/kernel/syscall.c:152)
el0_svc (./arch/arm64/include/asm/alternative-macros.h:254 ./arch/arm64/include/asm/cpufeature.h:808 ./arch/arm64/include/asm/irqflags.h:73 arch/arm64/kernel/entry-common.c:169 arch/arm64/kernel/entry-common.c:182 arch/arm64/kernel/entry-common.c:749)
el0t_64_sync_handler (arch/arm64/kernel/entry-common.c:820)
el0t_64_sync (arch/arm64/kernel/entry.S:600)
In this case, nbcon_atomic_flush_pending() is called from
printk_kthreads_shutdown() with IRQs and scheduling enabled.
Note that __nbcon_atomic_flush_pending_con() is directly called also from
nbcon_device_release() where the disabled IRQs might break PREEMPT_RT
guarantees. But the atomic flush is called only in emergency or panic
situations where the latencies are irrelevant anyway.
An ultimate solution would be a touching of watchdogs. But it would hide
all problems. Let's do it later when anyone reports a stall which does
not have a better solution.
Closes: https://lore.kernel.org/r/sqwajvt7utnt463tzxgwu2yctyn5m6bjwrslsnupfexeml6hkd@v6sqmpbu3vvu
Tested-by: Breno Leitao <leitao@debian.org>
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Link: https://patch.msgid.link/20251212124520.244483-1-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
These callbacks are necessary to synchronize ->write_thread callback
against other operations using the same device.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Link: https://patch.msgid.link/20251208-nbcon-device-cb-fix-v2-1-36be8d195123@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
When spawning and killing threads in separate processes in parallel the
primary bottleneck on the stock kernel is pidmap_lock, largely because
of a back-to-back acquire in the common case. This aspect is fixed with
the patch.
Performance improvement varies between reboots. When benchmarking with
20 processes creating and killing threads in a loop, the unpatched
baseline hovers around 465k ops/s, while patched is anything between
~510k ops/s and ~560k depending on false-sharing (which I only minimally
sanitized). So this is at least 10% if you are unlucky.
The change also facilitated some cosmetic fixes.
It has an unintentional side effect of no longer issuing spurious
idr_preload() around idr_replace().
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251203092851.287617-3-mjguzik@gmail.com
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Till now, the runtime PM workqueue has been flagged as freezable, so it
does not process work items during system-wide PM transitions like
system suspend and resume. The original reason to do that was to
reduce the likelihood of runtime PM getting in the way of system-wide
PM processing, but now it is mostly an optimization because (1) runtime
suspend of devices is prevented by bumping up their runtime PM usage
counters in device_prepare() and (2) device drivers are expected to
disable runtime PM for the devices handled by them before they embark
on system-wide PM activities that may change the state of the hardware
or otherwise interfere with runtime PM. However, it prevents
asynchronous runtime resume of devices from working during system-wide
PM transitions, which is confusing because synchronous runtime resume
is not prevented at the same time, and it also sometimes turns out to
be problematic.
For example, it has been reported that blk_queue_enter() may deadlock
during a system suspend transition because of the pm_request_resume()
usage in it [1]. It may also deadlock during a system resume transition
in a similar way. That happens because the asynchronous runtime resume
of the given device is not processed due to the freezing of the runtime
PM workqueue. While it may be better to address this particular issue
in the block layer, the very presence of it means that similar problems
may be expected to occur elsewhere.
For this reason, remove the WQ_FREEZABLE flag from the runtime PM
workqueue and make device_suspend_late() use the generic variant of
pm_runtime_disable() that will carry out runtime PM of the device
synchronously if there is pending resume work for it.
Also update the comment before the pm_runtime_disable() call in
device_suspend_late(), to document the fact that the runtime PM
should not be expected to work for the device until the end of
device_resume_early(), and update the related documentation.
This change may, even though it is not expected to, uncover some
latent issues related to queuing up asynchronous runtime resume
work items during system suspend or hibernation. However, they
should be limited to the interference between runtime resume and
system-wide PM callbacks in the cases when device drivers start
to handle system-wide PM before disabling runtime PM as described
above.
Link: https://lore.kernel.org/linux-pm/20251126101636.205505-2-yang.yang@vivo.com/
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Link: https://patch.msgid.link/12794222.O9o76ZdvQC@rafael.j.wysocki
There's three layers of logic in the scheduler that
deal with 'has_blocked' (load) handling of the NOHZ code:
(1) nohz.has_blocked,
(2) rq->has_blocked_load, deal with NOHZ idle balancing,
(3) and cfs_rq_has_blocked(), which is part of the layer
that is passing the SMP load-balancing signal to the
NOHZ layers.
The 'has_blocked' and 'has_blocked_load' names are used
in a mixed fashion, sometimes within the same function.
Standardize on 'has_blocked_load' to make it all easy
to read and easy to grep.
No change in functionality.
Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Link: https://patch.msgid.link/aS6yvxyc3JfMxxQW@gmail.com
We have to be careful with vruntime comparisons and subtraction,
due to the possibility of wrapping, so we have macros like:
#define vruntime_gt(field, lse, rse) ({ (s64)((lse)->field - (rse)->field) > 0; })
Which is used like this:
if (vruntime_gt(min_vruntime, se, rse))
se->min_vruntime = rse->min_vruntime;
Replace this with an easier to read pattern that uses the regular
arithmetics operators:
if (vruntime_cmp(se->min_vruntime, ">", rse->min_vruntime))
se->min_vruntime = rse->min_vruntime;
Also replace vruntime subtractions with vruntime_op():
- delta = (s64)(sea->vruntime - seb->vruntime) +
- (s64)(cfs_rqb->zero_vruntime_fi - cfs_rqa->zero_vruntime_fi);
+ delta = vruntime_op(sea->vruntime, "-", seb->vruntime) +
+ vruntime_op(cfs_rqb->zero_vruntime_fi, "-", cfs_rqa->zero_vruntime_fi);
In the vruntime_cmp() and vruntime_op() macros use Use __builtin_strcmp(),
because of __HAVE_ARCH_STRCMP might turn off the compiler optimizations
we rely on here to catch usage bugs.
No change in functionality.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The ::avg_vruntime field is a misnomer: it says it's an
'average vruntime', but in reality it's the momentary sum
of the weighted vruntimes of all queued tasks, which is
at least a division away from being an average.
This is clear from comments about the math of fair scheduling:
* \Sum (v_i - v0) * w_i := cfs_rq->avg_vruntime
This confusion is increased by the cfs_avg_vruntime() function,
which does perform the division and returns a true average.
The sum of all weighted vruntimes should be named thusly,
so rename the field to ::sum_w_vruntime. (As arguably
::sum_weighted_vruntime would be a bit of a mouthful.)
Understanding the scheduler is hard enough already, without
extra layers of obfuscated naming. ;-)
Also rename related helper functions:
sum_vruntime_add() => sum_w_vruntime_add()
sum_vruntime_sub() => sum_w_vruntime_sub()
sum_vruntime_update() => sum_w_vruntime_update()
With the notable exception of cfs_avg_vruntime(), which
was named accurately.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://patch.msgid.link/20251201064647.1851919-7-mingo@kernel.org
The ::avg_load field is a long-standing misnomer: it says it's an
'average load', but in reality it's the momentary sum of the load
of all currently runnable tasks. We'd have to also perform a
division by nr_running (or use time-decay) to arrive at any sort
of average value.
This is clear from comments about the math of fair scheduling:
* \Sum w_i := cfs_rq->avg_load
The sum of all weights is ... the sum of all weights, not
the average of all weights.
To make it doubly confusing, there's also an ::avg_load
in the load-balancing struct sg_lb_stats, which *is* a
true average.
The second part of the field's name is a minor misnomer
as well: it says 'load', and it is indeed a load_weight
structure as it shares code with the load-balancer - but
it's only in an SMP load-balancing context where
load = weight, in the fair scheduling context the primary
purpose is the weighting of different nice levels.
So rename the field to ::sum_weight instead, which makes
the terminology of the EEVDF math match up with our
implementation of it:
* \Sum w_i := cfs_rq->sum_weight
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://patch.msgid.link/20251201064647.1851919-6-mingo@kernel.org
Join two identical #ifdef blocks:
#ifdef CONFIG_FAIR_GROUP_SCHED
...
#endif
#ifdef CONFIG_FAIR_GROUP_SCHED
...
#endif
Also mark nested #ifdef blocks in the usual fashion, to make
it more apparent where in a nested hierarchy of #ifdefs we
are at a glance.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Link: https://patch.msgid.link/20251201064647.1851919-2-mingo@kernel.org
Add some checks to the sched_change pattern to validate assumptions
around changing classes.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://patch.msgid.link/20251127154725.771691954@infradead.org
The task_tick_fair() function does:
- update the hierarchical runtimes
- drive NUMA-balancing
- update load-balance statistics
- drive force-idle preemption
All but the very first can be limited to the periodic tick. Let hrtick
only update accounting and drive preemption, not load-balancing and
other bits.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://patch.msgid.link/20250918080205.563385766@infradead.org
With fair switched to rcu_dereference_all() validation, having IRQ or
preemption disabled is sufficient, remove the rcu_read_lock()
clutter.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://patch.msgid.link/20251127154725.647502625@infradead.org
With the {rcu,sched,bh} RCU flavours being unified, it doesn't really
make sense to check for just the rcu one. Switch to the _all family of
verification which includes all 3 of the listed flavours.
Notably, this will enable us to remove some superfluous
rcu_read_lock() regions when we know they are inside preempt/IRQ
disabled regions.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Remove check from the name for being surplus to requirements.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
While poking at this code recently I noted we do a pointless
unlock+lock cycle in sched_balance_newidle(). We drop the rq->lock (so
we can balance) but then instantly grab the same rq->lock again in
sched_balance_update_blocked_averages().
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://patch.msgid.link/20251127154725.532469061@infradead.org
Nine (and a half) instances of the same pattern is just silly, fold the lot.
Notably, the half instance in enqueue_load_avg() is right after setting
cfs_rq->avg.load_sum to cfs_rq->avg.load_avg * get_pelt_divider(&cfs_rq->avg).
Since get_pelt_divisor() >= PELT_MIN_DIVIDER, this ends up being a no-op
change.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://patch.msgid.link/20251127154725.413564507@infradead.org
Commit af65320948 ("bpf: Bump iter seq size to support BTF
representation of large data structures") increased the fixed buffer
size from PAGE_SIZE to PAGE_SIZE << 3, but the docs for the function
didn't get updated at the same time. Update them.
Signed-off-by: T.J. Mercier <tjmercier@google.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20251207091005.2829703-1-tjmercier@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmk74mgRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1jS2g/+NPDw1R7yAPHlTaZO6CduYSENXzTOjmG3
BxdWERuta3Z78CNJZ5mtbrL1mBccG29oN+EZFGrKbDuxrR6SoglWUr1x1pYJRDyp
h0DvBfzewXcVwOCiBQpqVzuQIzPqsF1IfdDP6eIGnkhnSAEi6ObIZeAbzBzZRovD
dccU5e0LAIffTmAyKFYIqD4Fu90bzwJ3dAE1b0tP4mnm+jFGitdd8vb8mowiftOk
Lij8hsy8bg2A5VThIlKow0ya4uc6Bb3ZGj6aiMHoGMPbB2pu3o8bNH/K9d1MsKIe
AdZ6Y8mFriYYk3GEI3aWaoRkEXCZFX2IcOUS++yqvrOb88iJpW7jujgRIRvX0Px4
5xgHomQwW9XQKJgNYU/xLgpQSvBH4IzGDX5Uf8oU+hqcjYJBJjy21Dk9JGJymjv/
/jTq11Bd9xqzyv9W2Ysg2CfOA29+jGe8XlZU9ucOKE1bz09D+8R+H8X/LZtldjMa
Xbr+7HG9AD13gLJW+PfRQbJR5CwoCvh9dPMzWtFLykxBYT/mQK5gs4OQDaG+JFsz
8ov+F28O1KBCtcK+v2M/2mQlo51y6RI/t5cw1sVSosgORjHPIVpPjbbTGi0r4MKw
PHBGdw0mb1dgsXMRNnCNwvfxZP/S1TnmbIkaT4THC0PqJHOloyrsVbXGD1u6F9In
Gaodcvn7MbA=
=jbkd
-----END PGP SIGNATURE-----
Merge tag 'smp-urgent-2025-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull CPU hotplug fix from Ingo Molnar:
- Fix CPU hotplug callbacks to disable interrupts on UP kernels
* tag 'smp-urgent-2025-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
cpu: Make atomic hotplug callbacks run with interrupts disabled on UP
individual changelogs for details.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaTsgBwAKCRDdBJ7gKXxA
jkSIAP4jD66nEC2QyKTiv9XvXi8rpKz6RGAHNZSam0ucI5WKswEAoflmlqsoD/Kk
sN3YLwLztzCJIYU7tT3IRaPK0irwDAo=
=nkux
-----END PGP SIGNATURE-----
Merge tag 'mm-nonmm-stable-2025-12-11-11-47' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc updates from Andrew Morton:
"There are no significant series in this small merge. Please see the
individual changelogs for details"
[ Editor's note: it's mainly ocfs2 and a couple of random fixes ]
* tag 'mm-nonmm-stable-2025-12-11-11-47' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm: memfd_luo: add CONFIG_SHMEM dependency
mm: shmem: avoid build warning for CONFIG_SHMEM=n
ocfs2: fix memory leak in ocfs2_merge_rec_left()
ocfs2: invalidate inode if i_mode is zero after block read
ocfs2: avoid -Wflex-array-member-not-at-end warning
ocfs2: convert remaining read-only checks to ocfs2_emergency_state
ocfs2: add ocfs2_emergency_state helper and apply to setattr
checkpatch: add uninitialized pointer with __free attribute check
args: fix documentation to reflect the correct numbers
ocfs2: fix kernel BUG in ocfs2_find_victim_chain
liveupdate: luo_core: fix redundant bound check in luo_ioctl()
ocfs2: validate inline xattr size and entry count in ocfs2_xattr_ibody_list
fs/fat: remove unnecessary wrapper fat_max_cache()
ocfs2: replace deprecated strcpy with strscpy
ocfs2: check tl_used after reading it from trancate log inode
liveupdate: luo_file: don't use invalid list iterator
Chris reported that the recent affinity management changes result in
overwriting the already initialized thread flags.
Use set_bit() to set the affinity bit instead of assigning the bit value to
the flags.
Fixes: 801afdfbfc ("genirq: Fix interrupt threads affinity vs. cpuset isolated partitions")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://patch.msgid.link/87ecp0e4cf.ffs@tglx
Closes: https://lore.kernel.org/all/20251212014848.3509622-1-clm@meta.com
move_local_task_to_local_dsq() is used when moving a task from a non-local
DSQ to a local DSQ on the same CPU. It directly manipulates the local DSQ
without going through dispatch_enqueue() and was missing the post-enqueue
handling that triggers preemption when SCX_ENQ_PREEMPT is set or the idle
task is running.
The function is used by move_task_between_dsqs() which backs
scx_bpf_dsq_move() and may be called while the CPU is busy.
Add local_dsq_post_enq() call to move_local_task_to_local_dsq(). As the
dispatch path doesn't need post-enqueue handling, add SCX_RQ_IN_BALANCE
early exit to keep consume_dispatch_q() behavior unchanged and avoid
triggering unnecessary resched when scx_bpf_dsq_move() is used from the
dispatch path.
Fixes: 4c30f5ce4f ("sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()")
Cc: stable@vger.kernel.org # v6.12+
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Factor out local_dsq_post_enq() which performs post-enqueue handling for
local DSQs - triggering resched_curr() if SCX_ENQ_PREEMPT is specified or if
the current CPU is idle. No functional change.
This will be used by the next patch to fix move_local_task_to_local_dsq().
Cc: stable@vger.kernel.org # v6.12+
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
scx_enable() calls scx_bypass(true) to initialize in bypass mode and then
scx_bypass(false) on success to exit. If scx_enable() fails during task
initialization - e.g. scx_cgroup_init() or scx_init_task() returns an error -
it jumps to err_disable while bypass is still active. scx_disable_workfn()
then calls scx_bypass(true/false) for its own bypass, leaving the bypass depth
at 1 instead of 0. This causes the system to remain permanently in bypass mode
after a failed scx_enable().
Failures after task initialization is complete - e.g. scx_tryset_enable_state()
at the end - already call scx_bypass(false) before reaching the error path and
are not affected. This only affects a subset of failure modes.
Fix it by tracking whether scx_enable() called scx_bypass(true) in a bool and
having scx_disable_workfn() call an extra scx_bypass(false) to clear it. This
is a temporary measure as the bypass depth will be moved into the sched
instance, which will make this tracking unnecessary.
Fixes: 8c2090c504 ("sched_ext: Initialize in bypass mode")
Cc: stable@vger.kernel.org # v6.12+
Reported-by: Chris Mason <clm@meta.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Link: https://lore.kernel.org/stable/286e6f7787a81239e1ce2989b52391ce%40kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
The new memfd code fails to link without SHMEM:
aarch64-linux-ld: mm/memfd_luo.o: in function `memfd_luo_retrieve_folios':
memfd_luo.c:(.text.memfd_luo_retrieve_folios+0xdc): undefined reference to `shmem_add_to_page_cache'
memfd_luo.c:(.text.memfd_luo_retrieve_folios+0x11c): undefined reference to `shmem_inode_acct_blocks'
memfd_luo.c:(.text.memfd_luo_retrieve_folios+0x134): undefined reference to `shmem_recalc_inode'
Add a Kconfig dependency to disallow that configuration.
Link: https://lkml.kernel.org/r/20251204100203.1034394-1-arnd@kernel.org
Fixes: b3749f174d ("mm: memfd_luo: allow preserving memfd")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The kernel test robot reported a Smatch warning:
kernel/liveupdate/luo_core.c:402 luo_ioctl() warn: unsigned 'nr' is
never less than zero.
This occurs because 'nr' is unsigned and LIVEUPDATE_CMD_BASE is currently
defined as 0, making the check (nr < LIVEUPDATE_CMD_BASE) always false.
Remove the explicit lower bound check. The logic remains correct because
'nr' is unsigned; if nr is less than LIVEUPDATE_CMD_BASE, the expression
(nr - LIVEUPDATE_CMD_BASE) will wrap around to a large positive value.
This will inevitably be larger than ARRAY_SIZE(luo_ioctl_ops) and be
caught by the upper bound check.
Link: https://lkml.kernel.org/r/20251130010919.1488230-1-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202511280300.6pvBmXUS-lkp@intel.com/
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: David Matlack <dmatlack@google.com>
Cc: Pratyush Yadav <pratyush@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
If we exit a list_for_each_entry() without hitting a break then the list
iterator points to an offset from the list_head. It's a non-NULL but
invalid pointer and dereferencing it isn't allowed.
Introduce a new "found" variable to test instead.
Link: https://lkml.kernel.org/r/aSlMc4SS09Re4_xn@stanley.mountain
Fixes: 3ee1d673194e ("liveupdate: luo_file: implement file systems callbacks")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/r/202511280420.y9O4fyhX-lkp@intel.com/
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Pratyush Yadav <pratyush@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
- Use the MSI parent domain API instead of the legacy API for setup and
teardown of PCI MSI IRQs
- Select POSIX_CPU_TIMERS_TASK_WORK now that VIRT_XFER_TO_GUEST_WORK has
been implemented for s390
- Fix a KVM bug which can lead to guest memory corruption
- Fix KASAN shadow memory mapping for hotplugged memory
- Minor bug fixes and improvements
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEECMNfWEw3SLnmiLkZIg7DeRspbsIFAmk5WtQACgkQIg7DeRsp
bsKOwRAAg7uNYB0nU5CYl/0GeExumHBxQT/B1Umxg8lCDf4rwAPwwSrV3Dpp04h5
8FpKH42x8sig1yKuqjrBgHTlRofmSlZOrtDI9xv/dtp8yJWkAaMY0Ec8sf1VwEEO
dgyEPH567ja+mWu5z74eFmxYSV5dsgcJ707DICTyDXB3DaIzmVH5D5f8yYbMs8eL
gJ4bAuRkzwV+hp1UXLaWWELDrTX2KpQ3EBPz/2SuS8e3KbmL7MuDRdMbTJ7oEb1+
fGGnm3qFhHszx5fGWUv/BrGsQidwZTe459JdQtKcQUsPkGqBjsbhmiq9ApomD2+7
MJFDGDVHt623iBNwDwygAPCrWrKoVUFlw5MaKfztvSiuKt19N6OrwN0R4VA8P46I
m9mVRSLeXTv4GzZcTmLsRb2cLJ3Z6co2HuDli4X1Jg3TTWBFzx0Jscp9dB1QKmoI
EXBX44LoL5bUZeh5cihqGC5YBuXeaYyMnnsP56eJhq56QK9r4WDMcvGezBbzbkHE
I3GjmFwEUr3U9hSx4xmUGh6pry5PLkuDch5So/vh1miAdHjZFFthXHK2s5UI+HUl
OGjNxI9q2WzX4bb1C090zVljvYAsP6YzKnk7Ks76+0mEBvykTsGx3Ms4oS1L0k5+
Z8R2J0BLfcBQU7pL0QoYE/fO2J5wo+atp/cBLwc2hd+G9YEvdTs=
=Nbs5
-----END PGP SIGNATURE-----
Merge tag 's390-6.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull more s390 updates from Heiko Carstens:
- Use the MSI parent domain API instead of the legacy API for setup and
teardown of PCI MSI IRQs
- Select POSIX_CPU_TIMERS_TASK_WORK now that VIRT_XFER_TO_GUEST_WORK
has been implemented for s390
- Fix a KVM bug which can lead to guest memory corruption
- Fix KASAN shadow memory mapping for hotplugged memory
- Minor bug fixes and improvements
* tag 's390-6.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/bug: Add missing alignment
s390/bug: Add missing CONFIG_BUG ifdef again
KVM: s390: Fix gmap_helper_zap_one_page() again
s390/pci: Migrate s390 IRQ logic to IRQ domain API
genirq: Change hwirq parameter to irq_hw_number_t
s390: Select POSIX_CPU_TIMERS_TASK_WORK
s390: Unmap early KASAN shadow on memory offlining
s390/vmem: Support 2G page splitting for KASAN shadow freeing
s390/boot: Use entire page for PTEs
s390/vmur: Use scnprintf() instead of sprintf()
- last minute fix for missing parenthesis in the recently merged code
(Hans de Goede) and removal of the excessive, non-fatal warnings (Dave
Kleikamp)
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSrngzkoBtlA8uaaJ+Jp1EFxbsSRAUCaTmllwAKCRCJp1EFxbsS
RNDxAQCZJ9/D9nJ+hbC2oVPyoenmhSfMqc9ZniiR8/z8UqBfqgEAsWWayVyJXiwp
aIWXvW6lnUGzgNZ64ZKeNWvKK3uJOgg=
=9lhd
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-6.19-2025-12-10' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux
Pull dma-mapping fixes from Marek Szyprowski:
- last minute fix for missing parenthesis in recently merged code (Hans
de Goede)
- removal of excessive, non-fatal warnings (Dave Kleikamp)
* tag 'dma-mapping-6.19-2025-12-10' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
dma-mapping: Fix DMA_BIT_MASK() macro being broken
dma/pool: eliminate alloc_pages warning in atomic_pool_expand
Commit 37cce22dbd ("bpf: verifier: Refactor helper access type
tracking") started distinguishing read vs write accesses performed by
helpers.
The second argument of bpf_d_path() is a pointer to a buffer that the
helper fills with the resulting path. However, its prototype currently
uses ARG_PTR_TO_MEM without MEM_WRITE.
Before 37cce22dbd, helper accesses were conservatively treated as
potential writes, so this mismatch did not cause issues. Since that
commit, the verifier may incorrectly assume that the buffer contents
are unchanged across the helper call and base its optimizations on this
wrong assumption. This can lead to misbehaviour in BPF programs that
read back the buffer, such as prefix comparisons on the returned path.
Fix this by marking the second argument of bpf_d_path() as
ARG_PTR_TO_MEM | MEM_WRITE so that the verifier correctly models the
write to the caller-provided buffer.
Fixes: 37cce22dbd ("bpf: verifier: Refactor helper access type tracking")
Co-developed-by: Zesen Liu <ftyg@live.com>
Signed-off-by: Zesen Liu <ftyg@live.com>
Co-developed-by: Peili Gao <gplhust955@gmail.com>
Signed-off-by: Peili Gao <gplhust955@gmail.com>
Co-developed-by: Haoran Ni <haoran.ni.cs@gmail.com>
Signed-off-by: Haoran Ni <haoran.ni.cs@gmail.com>
Signed-off-by: Shuran Liu <electronlsr@gmail.com>
Reviewed-by: Matt Bobrowski <mattbobrowski@google.com>
Link: https://lore.kernel.org/r/20251206141210.3148-2-electronlsr@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
- Standardize on ktime_t in restart_block::time as well
(Thomas Weißschuh)
- Futex selftests:
- Add robust list testcases (André Almeida)
- Formatting fixes/cleanups (Carlos Llamas)
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmk5J9QRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1gimQ//VhPIVYqioY/opLXBWFAiyxWLF8RbzbsD
URC4ppVqMa0d8np4VaoaNGTCyPOvhC5m/5q14BNgNhGfbvHOrx1W5cJ122v2vroC
aNwBkapt655lHwgax4vLgM7jMPW1do9hoyqtjsUIQeKMDlyZZW22gY+ExRTPJZTm
lP6PreWR74QTTKo5pTjNCoWSaLEmuoD5rbo52SkiTYHwBLL+Tqubl92mVFFUbE+Q
9n/c/zDeRSaixKr64TvDAijGyjrH8XFhOoMyyOGCiiWyrXfjDPde0Cblt6OkBQeK
/4/9uFo2TX3vFxQ2Zhj21gqMwQ84cdPvC86GkPh3FoxRNd66gArIpC14ME7rRn0V
aDKmz+c8QhFdS9gp4+Gw2OkuzYylLl2RSoSjhz/ndrZh7OLn64cA9KjfhRscCfjD
mcFMVFcjyK8BlgjXyTltC00tLftf6pyO7bqzO4kaYYbpls687ztfZ0mUSE60gLS4
WuCnntD+Q2IADV0r5xU9ps1a37eaqiHfgtjON2wznJeyj46PbpkWxSpdTIogTRoS
5JdLv+w1LdDMHaUuEPOgYYd6L69X0HpDauHotyRlGY+SoSoCoX6DbSXTg7CSqlqY
VHeuLj2OKnELJpxLEo16pcW6qTd+BM/2jcG91H7VJxbb2sJnngT3XtgJagXE5KYR
IYjcKq6DWig=
=9KQQ
-----END PGP SIGNATURE-----
Merge tag 'locking-futex-2025-12-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull futex updates from Ingo Molnar:
- Standardize on ktime_t in restart_block::time as well (Thomas
Weißschuh)
- Futex selftests:
- Add robust list testcases (André Almeida)
- Formatting fixes/cleanups (Carlos Llamas)
* tag 'locking-futex-2025-12-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
futex: Store time as ktime_t in restart block
selftests/futex: Create test for robust list
selftests/futex: Skip tests if shmget unsupported
selftests/futex: Add newline to ksft_exit_fail_msg()
selftests/futex: Remove unused test_futex_mpol()
This patch improves the verifier to correctly compute bounds for
sign extension compiler pattern composed of left shift by 32bits
followed by a sign right shift by 32bits. Pattern in the verifier was
limitted to positive value bounds and would reset bound computation for
negative values. New code allows both positive and negative values for
sign extension without compromising bound computation and verifier to
pass.
This change is required by GCC which generate such pattern, and was
detected in the context of systemd, as described in the following GCC
bugzilla: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119731
Three new tests were added in verifier_subreg.c.
Signed-off-by: Cupertino Miranda <cupertino.miranda@oracle.com>
Signed-off-by: Andrew Pinski <andrew.pinski@oss.qualcomm.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Cc: David Faust <david.faust@oracle.com>
Cc: Jose Marchesi <jose.marchesi@oracle.com>
Cc: Elena Zannoni <elena.zannoni@oracle.com>
Link: https://lore.kernel.org/r/20251202180220.11128-2-cupertino.miranda@oracle.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
After commit 9216477449 ("bpf: cpumap: Add the possibility to attach
an eBPF program to cpumap"), __cpu_map_entry_alloc() may fail with
errors other than -ENOMEM, such as -EBADF or -EINVAL.
However, __cpu_map_entry_alloc() returns NULL on all failures, and
cpu_map_update_elem() unconditionally converts this NULL into -ENOMEM.
As a result, user space always receives -ENOMEM regardless of the actual
underlying error.
Examples of unexpected behavior:
- Nonexistent fd : -ENOMEM (should be -EBADF)
- Non-BPF fd : -ENOMEM (should be -EINVAL)
- Bad attach type : -ENOMEM (should be -EINVAL)
Change __cpu_map_entry_alloc() to return ERR_PTR(err) instead of NULL
and have cpu_map_update_elem() propagate this error.
Signed-off-by: Kohei Enju <enjuk@amazon.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20251208131449.73036-2-enjuk@amazon.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
If there is a large number (hundreds) of dmabufs allocated, the text
output generated from dmabuf_iter_seq_show can exceed common user buffer
sizes (e.g. PAGE_SIZE) necessitating multiple start/stop cycles to
iterate through all dmabufs. However the dmabuf iterator currently
returns NULL in dmabuf_iter_seq_start for all non-zero pos values, which
results in the truncation of the output before all dmabufs are handled.
After dma_buf_iter_begin / dma_buf_iter_next, the refcount of the buffer
is elevated so that the BPF iterator program can run without holding any
locks. When a stop occurs, instead of immediately dropping the reference
on the buffer, stash a pointer to the buffer in seq->priv until
either start is called or the iterator is released. This also enables
the resumption of iteration without first walking through the list of
dmabufs based on the pos value.
Fixes: 76ea955349 ("bpf: Add dmabuf iterator")
Signed-off-by: T.J. Mercier <tjmercier@google.com>
Link: https://lore.kernel.org/r/20251204000348.1413593-1-tjmercier@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Introduce a bpf_has_frame_pointer() helper that unwinders can call to
determine whether a given instruction pointer is within the valid frame
pointer region of a BPF JIT program or trampoline (i.e., after the
prologue, before the epilogue).
This will enable livepatch (with the ORC unwinder) to reliably unwind
through BPF JIT frames.
Acked-by: Song Liu <song@kernel.org>
Acked-and-tested-by: Andrey Grodzovsky <andrey.grodzovsky@crowdstrike.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lore.kernel.org/r/fd2bc5b4e261a680774b28f6100509fd5ebad2f0.1764818927.git.jpoimboe@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
On SMP systems the CPU hotplug callbacks in the "starting" range are
invoked while the CPU is brought up and interrupts are still
disabled. Callbacks which are added later are invoked via the
hotplug-thread on the target CPU and interrupts are explicitly disabled.
In the UP case callbacks which are added later are invoked directly without
the thread indirection. This is in principle okay since there is just one
CPU but those callbacks are invoked with interrupt disabled code. That's
incorrect as those callbacks assume interrupt disabled context.
Disable interrupts before invoking the callbacks on UP if the state is
atomic and interrupts are expected to be disabled. The "save" part is
required because this is also invoked early in the boot process while
interrupts are disabled and must not be enabled prematurely.
Fixes: 06ddd17521 ("sched/smp: Always define is_percpu_thread() and scheduler_ipi()")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/20251127144723.ev9DuXXR@linutronix.de
setup_percpu_irq() was forgotten when the percpu_devid infrastructure was
updated to deal with CPU affinities.
In order to keep ignoring users of this legacy API, provide sensible
defaults by setting the affinity to cpu_online_mask if none was provided by
the caller.
Fixes: bdf4e2ac29 ("genirq: Allow per-cpu interrupt sharing for non-overlapping affinities")
Reported-by: Daniel Thompson <danielt@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/20251205091814.3944205-1-maz@kernel.org
Closes: https://lore.kernel.org/r/aTFozefMQRg7lYxh@aspen.lan
For events with inherit_stat enabled, a "read" event will be generated
to collect per task event counts on task exit.
The call chain is as follows:
do_exit
-> perf_event_exit_task
-> perf_event_exit_task_context
-> perf_event_exit_event
-> perf_remove_from_context
-> perf_child_detach
-> sync_child_event
-> perf_event_read_event
However, the child event context detaches the task too early in
perf_event_exit_task_context, which causes sync_child_event to never
generate the read event in this case, since child_event->ctx->task is
always set to TASK_TOMBSTONE. Fix that by moving context lock section
backward to ensure ctx->task is not set to TASK_TOMBSTONE before
generating the read event.
Because perf_event_free_task calls perf_event_exit_task_context with
exit = false to tear down all child events from the context, and the
task never lived, accessing the task PID can lead to a use-after-free.
To fix that, let sync_child_event read task from argument and move the
call to the only place it should be triggered to avoid the effect of
setting ctx->task to TASK_TOMESTONE, and add a task parameter to
perf_event_exit_event to trigger the sync_child_event properly when
needed.
This bug can be reproduced by running "perf record -s" and attaching to
any program that generates perf events in its child tasks. If we check
the result with "perf report -T", the last line of the report will leave
an empty table like "# PID TID", which is expected to contain the
per-task event counts by design.
Fixes: ef54c1a476 ("perf: Rework perf_event_exit_event()")
Signed-off-by: Thaumy Cheng <thaumy.love@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ian Rogers <irogers@google.com>
Cc: James Clark <james.clark@linaro.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: linux-perf-users@vger.kernel.org
Link: https://patch.msgid.link/20251209041600.963586-1-thaumy.love@gmail.com
Recent commit 68e83f3472 ("tools: ynl-gen: add regeneration comment")
added a hint how to regenerate the code to the headers. Update
the new headers from this release cycle to also include it.
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20251207004740.1657799-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Make the rescuer process more work on the last pwq when there are no
more to rescue for the whole workqueue to help the regular workers in
case it is a temporary memory pressure relief and to reduce relapse.
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Previously, the rescuer scanned for all matching work items at once and
processed them within a single rescuer thread, which could cause one
blocking work item to stall all others.
Make the rescuer process work items one-by-one instead of slurping all
matches in a single pass.
Break the rescuer loop after finding and processing the first matching
work item, then restart the search to pick up the next. This gives
normal worker threads a chance to process other items which gives them
the opportunity to be processed instead of waiting on the rescuer's
queue and prevents a blocking work item from stalling the rest once
memory pressure is relieved.
Introduce a dummy cursor work item to avoid potentially O(N^2)
rescans of the work list. The marker records the resume position for
the next scan, eliminating redundant traversals.
Also introduce RESCUER_BATCH to control the maximum number of work items
the rescuer processes in each turn, and move on to other PWQs when the
limit is reached.
Cc: ying chen <yc1082463@gmail.com>
Reported-by: ying chen <yc1082463@gmail.com>
Fixes: e22bee782b ("workqueue: implement concurrency managed dynamic worker pool")
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Make send_mayday() operate on a PWQ directly instead of taking a work
item, so that rescuer_thread() now calls send_mayday(pwq) instead of
open-coding the mayday list manipulation.
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Commit 406100f3da ("cpuset: fix race between hotplug work and later CPU
offline") added a check for empty effective_cpus in partitions for cgroup
v2. However, this check did not account for remote partitions, which were
introduced later.
After commit 2125c0034c ("cgroup/cpuset: Make cpuset hotplug processing
synchronous"), cpuset hotplug handling is now synchronous. This eliminates
the race condition with subsequent CPU offline operations that the original
check aimed to fix.
Instead of extending the check to support remote partitions, this patch
removes all the redundant effective_cpus check. Additionally, it adds a
check and warning to verify that all generated sched domains consist of
active CPUs, preventing partition_sched_domains from being invoked with
offline CPUs.
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Use the new css_is_online() helper that has been introduced to check css
online state, instead of testing the CSS_ONLINE flag directly. This
improves readability and centralizes the state check logic.
No functional changes intended.
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Tejun Heo <tj@kernel.org>
On x86-64, this_cpu_cmpxchg() uses CMPXCHG without LOCK prefix which
means it is only safe for the local CPU and not for multiple CPUs.
Recently the commit 36df6e3dbd ("cgroup: make css_rstat_updated nmi
safe") make css_rstat_updated lockless and uses lockless list to allow
reentrancy. Since css_rstat_updated can invoked from process context,
IRQ and NMI, it uses this_cpu_cmpxchg() to select the winner which will
inset the lockless lnode into the global per-cpu lockless list.
However the commit missed one case where lockless node of a cgroup can
be accessed and modified by another CPU doing the flushing. Basically
llist_del_first_init() in css_process_update_tree().
On a cursory look, it can be questioned how css_process_update_tree()
can see a lockless node in global lockless list where the updater is at
this_cpu_cmpxchg() and before llist_add() call in css_rstat_updated().
This can indeed happen in the presence of IRQs/NMI.
Consider this scenario: Updater for cgroup stat C on CPU A in process
context is after llist_on_list() check and before this_cpu_cmpxchg() in
css_rstat_updated() where it get interrupted by IRQ/NMI. In the IRQ/NMI
context, a new updater calls css_rstat_updated() for same cgroup C and
successfully inserts rstatc_pcpu->lnode.
Now concurrently CPU B is running the flusher and it calls
llist_del_first_init() for CPU A and got rstatc_pcpu->lnode of cgroup C
which was added by the IRQ/NMI updater.
Now imagine CPU B calling init_llist_node() on cgroup C's
rstatc_pcpu->lnode of CPU A and on CPU A, the process context updater
calling this_cpu_cmpxchg(rstatc_pcpu->lnode) concurrently.
The CMPXCNG without LOCK on CPU A is not safe and thus we need LOCK
prefix.
In Meta's fleet running the kernel with the commit 36df6e3dbd, we are
observing on some machines the memcg stats are getting skewed by more
than the actual memory on the system. On close inspection, we noticed
that lockless node for a workload for specific CPU was in the bad state
and thus all the updates on that CPU for that cgroup was being lost.
To confirm if this skew was indeed due to this CMPXCHG without LOCK in
css_rstat_updated(), we created a repro (using AI) at [1] which shows
that CMPXCHG without LOCK creates almost the same lnode corruption as
seem in Meta's fleet and with LOCK CMPXCHG the issue does not
reproduces.
Link: http://lore.kernel.org/efiagdwmzfwpdzps74fvcwq3n4cs36q33ij7eebcpssactv3zu@se4hqiwxcfxq [1]
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: stable@vger.kernel.org # v6.17+
Fixes: 36df6e3dbd ("cgroup: make css_rstat_updated nmi safe")
Signed-off-by: Tejun Heo <tj@kernel.org>
Early when trying to get sched_ext and proxy-exe working together,
I kept tripping over NULL ptr in put_prev_task_scx() on the line:
if (sched_class_above(&ext_sched_class, next->sched_class)) {
Which was due to put_prev_task() passes a NULL next, calling:
prev->sched_class->put_prev_task(rq, prev, NULL);
put_prev_task_scx() already guards for a NULL next in the
switch_class case, but doesn't seem to have a guard for
sched_class_above() check.
I can't say I understand why this doesn't trip usually without
proxy-exec. And in newer kernels there are way fewer
put_prev_task(), and I can't easily reproduce the issue now
even with proxy-exec.
But we still have one put_prev_task() call left in core.c that
seems like it could trip this, so I wanted to send this out for
consideration.
tj: put_prev_task() can be called with NULL @next; however, when @p is
queued, that doesn't happen, so this condition shouldn't currently be
triggerable. The connection isn't straightforward or necessarily reliable,
so add the NULL check even if it can't currently be triggered.
Link: http://lkml.kernel.org/r/20251206022218.1541878-1-jstultz@google.com
Signed-off-by: John Stultz <jstultz@google.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
atomic_pool_expand iteratively tries the allocation while decrementing
the page order. There is no need to issue a warning if an attempted
allocation fails.
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Fixes: d7e673ec2c ("dma-pool: Only allocate from CMA when in same memory zone")
[mszyprow: fixed typo]
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/20251202152810.142370-1-dave.kleikamp@oracle.com
The irqdomain implementation internally represents hardware IRQs as
irq_hw_number_t, which is defined as unsigned long int. When providing
an irq_hw_number_t to the generic_handle_domain() functions that expect
and unsigned int hwirq, this can lead to a loss of information. Change
the hwirq parameter to irq_hw_number_t to support the full range of
hwirqs.
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
Reviewed-by: Farhan Ali <alifm@linux.ibm.com>
Signed-off-by: Tobias Schumacher <ts@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
- The 6 patch series "panic: sys_info: Refactor and fix a potential
issue" from Andy Shevchenko fixes a build issue and does some cleanup in
ib/sys_info.c.
- The 9 patch series "Implement mul_u64_u64_div_u64_roundup()" from
David Laight enhances the 64-bit math code on behalf of a PWM driver and
beefs up the test module for these library functions.
- The 2 patch series "scripts/gdb/symbols: make BPF debug info available
to GDB" from Ilya Leoshkevich makes BPF symbol names, sizes, and line
numbers available to the GDB debugger.
- The 4 patch series "Enable hung_task and lockup cases to dump system
info on demand" from Feng Tang adds a sysctl which can be used to cause
additional info dumping when the hung-task and lockup detectors fire.
- The 6 patch series "lib/base64: add generic encoder/decoder, migrate
users" from Kuan-Wei Chiu adds a general base64 encoder/decoder to lib/
and migrates several users away from their private implementations.
- The 2 patch series "rbree: inline rb_first() and rb_last()" from Eric
Dumazet makes TCP a little faster.
- The 9 patch series "liveupdate: Rework KHO for in-kernel users" from
Pasha Tatashin reworks the KEXEC Handover interfaces in preparation for
Live Update Orchestrator (LUO), and possibly for other future clients.
- The 13 patch series "kho: simplify state machine and enable dynamic
updates" from Pasha Tatashin increases the flexibility of KEXEC
Handover. Also preparation for LUO.
- The 18 patch series "Live Update Orchestrator" from Pasha Tatashin is
a major new feature targeted at cloud environments. Quoting the [0/N]:
This series introduces the Live Update Orchestrator, a kernel subsystem
designed to facilitate live kernel updates using a kexec-based reboot.
This capability is critical for cloud environments, allowing hypervisors
to be updated with minimal downtime for running virtual machines. LUO
achieves this by preserving the state of selected resources, such as
memory, devices and their dependencies, across the kernel transition.
As a key feature, this series includes support for preserving memfd file
descriptors, which allows critical in-memory data, such as guest RAM or
any other large memory region, to be maintained in RAM across the kexec
reboot.
Mike Rappaport merits a mention here, for his extensive review and
testing work.
- The 3 patch series "kexec: reorganize kexec and kdump sysfs" from
Sourabh Jain moves the kexec and kdump sysfs entries from /sys/kernel/
to /sys/kernel/kexec/ and adds back-compatibility symlinks which can
hopefully be removed one day.
- The 2 patch series "kho: fixes for vmalloc restoration" from Mike
Rapoport fixes a BUG which was being hit during KHO restoration of
vmalloc() regions.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaTSAkQAKCRDdBJ7gKXxA
jrkiAP9QKfsRv46XZaM5raScjY1ayjP+gqb2rgt6BQ/gZvb2+wD/cPAYOR6BiX52
n0pVpQmG5P/KyOmpLztn96ejL4heKwQ=
=JY96
-----END PGP SIGNATURE-----
Merge tag 'mm-nonmm-stable-2025-12-06-11-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:
- "panic: sys_info: Refactor and fix a potential issue" (Andy Shevchenko)
fixes a build issue and does some cleanup in ib/sys_info.c
- "Implement mul_u64_u64_div_u64_roundup()" (David Laight)
enhances the 64-bit math code on behalf of a PWM driver and beefs up
the test module for these library functions
- "scripts/gdb/symbols: make BPF debug info available to GDB" (Ilya Leoshkevich)
makes BPF symbol names, sizes, and line numbers available to the GDB
debugger
- "Enable hung_task and lockup cases to dump system info on demand" (Feng Tang)
adds a sysctl which can be used to cause additional info dumping when
the hung-task and lockup detectors fire
- "lib/base64: add generic encoder/decoder, migrate users" (Kuan-Wei Chiu)
adds a general base64 encoder/decoder to lib/ and migrates several
users away from their private implementations
- "rbree: inline rb_first() and rb_last()" (Eric Dumazet)
makes TCP a little faster
- "liveupdate: Rework KHO for in-kernel users" (Pasha Tatashin)
reworks the KEXEC Handover interfaces in preparation for Live Update
Orchestrator (LUO), and possibly for other future clients
- "kho: simplify state machine and enable dynamic updates" (Pasha Tatashin)
increases the flexibility of KEXEC Handover. Also preparation for LUO
- "Live Update Orchestrator" (Pasha Tatashin)
is a major new feature targeted at cloud environments. Quoting the
cover letter:
This series introduces the Live Update Orchestrator, a kernel
subsystem designed to facilitate live kernel updates using a
kexec-based reboot. This capability is critical for cloud
environments, allowing hypervisors to be updated with minimal
downtime for running virtual machines. LUO achieves this by
preserving the state of selected resources, such as memory,
devices and their dependencies, across the kernel transition.
As a key feature, this series includes support for preserving
memfd file descriptors, which allows critical in-memory data, such
as guest RAM or any other large memory region, to be maintained in
RAM across the kexec reboot.
Mike Rappaport merits a mention here, for his extensive review and
testing work.
- "kexec: reorganize kexec and kdump sysfs" (Sourabh Jain)
moves the kexec and kdump sysfs entries from /sys/kernel/ to
/sys/kernel/kexec/ and adds back-compatibility symlinks which can
hopefully be removed one day
- "kho: fixes for vmalloc restoration" (Mike Rapoport)
fixes a BUG which was being hit during KHO restoration of vmalloc()
regions
* tag 'mm-nonmm-stable-2025-12-06-11-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (139 commits)
calibrate: update header inclusion
Reinstate "resource: avoid unnecessary lookups in find_next_iomem_res()"
vmcoreinfo: track and log recoverable hardware errors
kho: fix restoring of contiguous ranges of order-0 pages
kho: kho_restore_vmalloc: fix initialization of pages array
MAINTAINERS: TPM DEVICE DRIVER: update the W-tag
init: replace simple_strtoul with kstrtoul to improve lpj_setup
KHO: fix boot failure due to kmemleak access to non-PRESENT pages
Documentation/ABI: new kexec and kdump sysfs interface
Documentation/ABI: mark old kexec sysfs deprecated
kexec: move sysfs entries to /sys/kernel/kexec
test_kho: always print restore status
kho: free chunks using free_page() instead of kfree()
selftests/liveupdate: add kexec test for multiple and empty sessions
selftests/liveupdate: add simple kexec-based selftest for LUO
selftests/liveupdate: add userspace API selftests
docs: add documentation for memfd preservation via LUO
mm: memfd_luo: allow preserving memfd
liveupdate: luo_file: add private argument to store runtime state
mm: shmem: export some functions to internal.h
...
- Fix accounting of stop_count in file release
On opening the trace file, if "pause-on-trace" option is set, it will
increment the stop_count. On file release, it checks if stop_count is set,
and if so it decrements it. Since this code was originally written, the
stop_count can be incremented by other use cases. This makes just checking
the stop_count not enough to know if it should be decremented.
Add a new iterator flag called "PAUSE" and have it set if the open
disables tracing and only decrement the stop_count if that flag is set on
close.
- Remove length field in trace_seq_printf() of print_synth_event()
When printing the synthetic event that has a static length array field,
the vsprintf() of the trace_seq_printf() triggered a "(efault)" in the
output. That's because the print_fmt replaced the "%.*s" with "%s" causing
the arguments to be off.
- Fix a bunch of typos
-----BEGIN PGP SIGNATURE-----
iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaTRsYBQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qrYpAQC3Qc5QMOlPjqGHXls/4IR4SBEAvsUi
VZx3PdknfYCe3AD9HoYGOtrDDhSJ1tQbsWP5ud2jatHwL0zGAl3legNp7ww=
=jlNP
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix accounting of stop_count in file release
On opening the trace file, if "pause-on-trace" option is set, it will
increment the stop_count. On file release, it checks if stop_count is
set, and if so it decrements it. Since this code was originally
written, the stop_count can be incremented by other use cases. This
makes just checking the stop_count not enough to know if it should be
decremented.
Add a new iterator flag called "PAUSE" and have it set if the open
disables tracing and only decrement the stop_count if that flag is
set on close.
- Remove length field in trace_seq_printf() of print_synth_event()
When printing the synthetic event that has a static length array
field, the vsprintf() of the trace_seq_printf() triggered a
"(efault)" in the output. That's because the print_fmt replaced the
"%.*s" with "%s" causing the arguments to be off.
- Fix a bunch of typos
* tag 'trace-v6.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Fix typo in trace_seq.c
tracing: Fix typo in trace_probe.c
tracing: Fix multiple typos in trace_osnoise.c
tracing: Fix multiple typos in trace_events_user.c
tracing: Fix typo in trace_events_trigger.c
tracing: Fix typo in trace_events_hist.c
tracing: Fix typo in trace_events_filter.c
tracing: Fix multiple typos in trace_events.c
tracing: Fix multiple typos in trace.c
tracing: Fix typo in ring_buffer_benchmark.c
tracing: Fix multiple typos in ring_buffer.c
tracing: Fix typo in fprobe.c
tracing: Fix typo in fpgraph.c
tracing: Fix fixed array of synthetic event
tracing: Fix enabling of tracing on file release
- Fix psi_dequeue() for Proxy Execution
- Fix hrtick() vs. scheduling context bug
- Fix unfairness caused by stalled tg_load_avg_contrib when the last task migrates out
- Fix whitespace noise in headers
- Remove a preempt-disable section in rt_mutex_setprio()
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmk0FhoRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1holg/8DQWTB5UcOSK8r4VR6xsDkxmnEA4RNJYg
YWMAUCJbeAP1qciX9QbPH6T8aKAtrx5d7aLqGfDj7u0sRWTtkz32FcqWiny56vox
Rc7UIHCL9n5r/ZMwy5sNN7Dxfdr2Eqmxyg5yaAT3VU3AkBQWzVfj8wQbzYL0FPy3
aw4kwaRB8NMTANdcyVi3hTIDXbLeNb8WCvUKmH0YDfEpeKBikzLBn+yvlkGB/9Wb
LZ3CzPUtLVdKyJ/W1VdJBokZ1DSUhMkLFFWXxFqt/5TgaXu5wYyCpUr0mvHSQDEd
PITEMjhZ+NVMQLSSe3od1qa3vIpNe1W/t0FvlXPHXmZ/lmqC3UXwZuO1gzzCz9Jk
7NluFcoFdSNPLlDEjzA1qls6YAaKHZ8FZfrWRjr2LnZZKZj1r6pfeVJjCWl6Lw+U
7aOH+Z4TZexgdjZo9qC+S7VJUvt5+P6SipVE/F9a4xd85kEWxGUhKhJuDD7b0Ksq
pe0dQcva7mW7/FpAZYdSycYTJ98fU4SiUeKp7Er4xX0mRLO7KVFPja9M5ZHvPGH3
m+i0ONknEDtavgFmg0Z2L2+bpad+cv18hwR94Jhuw7q5ZciwV9vaO/4hbGb4AlAV
/eJ4rHqPWhOTfe4iHDl82nRYeOxT2V5tX0Uu5oDAKhpAk9xNFkByQsl53gqdJT+Q
3zVRbx2LIA4=
=kr0U
-----END PGP SIGNATURE-----
Merge tag 'sched-urgent-2025-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
"Miscellaneous scheduler fixes/cleanups:
- Fix psi_dequeue() for Proxy Execution
- Fix hrtick() vs. scheduling context bug
- Fix unfairness caused by stalled tg_load_avg_contrib when the last
task migrates out
- Fix whitespace noise in headers
- Remove a preempt-disable section in rt_mutex_setprio()"
* tag 'sched-urgent-2025-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/core: Fix psi_dequeue() for Proxy Execution
sched/fair: Fix unfairness caused by stalled tg_load_avg_contrib when the last task migrates out
sched/rt: Remove a preempt-disable section in rt_mutex_setprio()
sched/hrtick: Fix hrtick() vs. scheduling context
sched/headers: Remove whitespace noise from kernel/sched/sched.h
allmodconfig builds, plus improve the quality of alternatives
instructions generated code and disassembly.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmk0FVoRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1iClg//dHY58dvrsp5Fzo10XgU99/kwzNEgl2b5
SMrSEbliTrehdpG4vBvig9tAMZurxOVIf6yDBEtV45XfD6w3tw6EFYpO1was9wTE
R/80Ze6BAEeao782xN3sCpakU1Ogwbxhe4jYFZKE/WVbP9ZaeCI8qeBj3RAuOQ9y
PCJzjD5fl9c2cAGDqCJEswxIptpP7eXoBo/V3Txf46M8/ffFcXdJbHN3HRBlszVs
5I9Wb2/vFmwJ4Yi4EO8H7KfzwaXA8wW/MJSDcM24P2/+o5iTqSLNd+rADFMW3XF2
/8b3uAy/6A6tT3ek1teNoM7qB9hRpM1pmpFwgjjTkjl8yamEp6P/W99qUN+UmfV+
NTiW9sz7ShhVTMCdALIljyjmji318crKYQBDulAHuEACpodcBg/GUGfuUcrjSRB/
C7PLatOpfMCODPRGPH4+8Wg8nnBGvOEjjODZBjAq2yU5aJnBeLPmbK2mtcaJtKi+
R0T2LIsNgmnEa4wRZbH8i4jXsgcbe6gD45Tx3qZpss7D4d9IyRWPO8v6GegFUpvh
dw8qBqhgi1FzryZ/5uwh5IzkVq+iXHqkPBsV9w7CVSFF1Kc5w1/l7MXsEjkc7Xe3
qMjc43qsN0H/7ngoIA7yp4m7q87gqJMzReIfeIF4pGVtoULGQ+drN0jjQE/SHiKS
/EM8IAAk0pU=
=2DKc
-----END PGP SIGNATURE-----
Merge tag 'objtool-urgent-2025-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull objtool fixes from Ingo Molnar:
"Address various objtool scalability bugs/inefficiencies exposed by
allmodconfig builds, plus improve the quality of alternatives
instructions generated code and disassembly"
* tag 'objtool-urgent-2025-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
objtool: Simplify .annotate_insn code generation output some more
objtool: Add more robust signal error handling, detect and warn about stack overflows
objtool: Remove newlines and tabs from annotation macros
objtool: Consolidate annotation macros
x86/asm: Remove ANNOTATE_DATA_SPECIAL usage
x86/alternative: Remove ANNOTATE_DATA_SPECIAL usage
objtool: Fix stack overflow in validate_branch()
- next part of DMA mapping API refactoring to physical addresses as the primary
interface instead of page+offset parameters; this time dma_map_ops callbacks
are converted to physical addresses, what in turn results also in some
simplification of architecture specific code (Leon Romanovsky and Jason
Gunthorpe)
- clarify that dma_map_benchmark is not a kernel self-test, but standalone
tool (Qinxin Xia)
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSrngzkoBtlA8uaaJ+Jp1EFxbsSRAUCaTLpeQAKCRCJp1EFxbsS
RKlAAQCo/gVslheoJ+h4Hk5oUjLfVQaiamJOlzxw12EVHVAs3AEA7GILIAL1GbGn
EpkgwjIuz/J/4aGhNbYO6J+C1qMAHww=
=aM6P
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-6.19-2025-12-05' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux
Pull dma-mapping updates from Marek Szyprowski:
- More DMA mapping API refactoring to physical addresses as the primary
interface instead of page+offset parameters.
This time dma_map_ops callbacks are converted to physical addresses,
what in turn results also in some simplification of architecture
specific code (Leon Romanovsky and Jason Gunthorpe)
- Clarify that dma_map_benchmark is not a kernel self-test, but
standalone tool (Qinxin Xia)
* tag 'dma-mapping-6.19-2025-12-05' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
dma-mapping: remove unused map_page callback
xen: swiotlb: Convert mapping routine to rely on physical address
x86: Use physical address for DMA mapping
sparc: Use physical address DMA mapping
powerpc: Convert to physical address DMA mapping
parisc: Convert DMA map_page to map_phys interface
MIPS/jazzdma: Provide physical address directly
alpha: Convert mapping routine to rely on physical address
dma-mapping: remove unused mapping resource callbacks
xen: swiotlb: Switch to physical address mapping callbacks
ARM: dma-mapping: Switch to physical address mapping callbacks
ARM: dma-mapping: Reduce struct page exposure in arch_sync_dma*()
dma-mapping: convert dummy ops to physical address mapping
dma-mapping: prepare dma_map_ops to conversion to physical address
tools/dma: move dma_map_benchmark from selftests to tools/dma
Currently, if the sleep flag is set, psi_dequeue() doesn't
change any of the psi_flags.
This is because psi_task_switch() will clear TSK_ONCPU as well
as other potential flags (TSK_RUNNING), and the assumption is
that a voluntary sleep always consists of a task being dequeued
followed shortly there after with a psi_sched_switch() call.
Proxy Execution changes this expectation, as mutex-blocked tasks
that would normally sleep stay on the runqueue. But in the case
where the mutex-owning task goes to sleep, or the owner is on a
remote cpu, we will then deactivate the blocked task shortly
after.
In that situation, the mutex-blocked task will have had its
TSK_ONCPU cleared when it was switched off the cpu, but it will
stay TSK_RUNNING. Then if we later dequeue it (as currently done
if we hit a case find_proxy_task() can't yet handle, such as the
case of the owner being on another rq or a sleeping owner)
psi_dequeue() won't change any state (leaving it TSK_RUNNING),
as it incorrectly expects a psi_task_switch() call to
immediately follow.
Later on when the task get woken/re-enqueued, and psi_flags are
set for TSK_RUNNING, we hit an error as the task is already
TSK_RUNNING:
psi: inconsistent task state! task=188:kworker/28:0 cpu=28 psi_flags=4 clear=0 set=4
To resolve this, extend the logic in psi_dequeue() so that
if the sleep flag is set, we also check if psi_flags have
TSK_ONCPU set (meaning the psi_task_switch is imminent) before
we do the shortcut return.
If TSK_ONCPU is not set, that means we've already switched away,
and this psi_dequeue call needs to clear the flags.
Fixes: be41bde4c3 ("sched: Add an initial sketch of the find_proxy_task() function")
Reported-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Haiyue Wang <haiyuewa@163.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://patch.msgid.link/20251205012721.756394-1-jstultz@google.com
Closes: https://lore.kernel.org/lkml/20251117185550.365156-1-kprateek.nayak@amd.com/
When a task is migrated out, there is a probability that the tg->load_avg
value will become abnormal. The reason is as follows:
1. Due to the 1ms update period limitation in update_tg_load_avg(), there
is a possibility that the reduced load_avg is not updated to tg->load_avg
when a task migrates out.
2. Even though __update_blocked_fair() traverses the leaf_cfs_rq_list and
calls update_tg_load_avg() for cfs_rqs that are not fully decayed, the key
function cfs_rq_is_decayed() does not check whether
cfs->tg_load_avg_contrib is null. Consequently, in some cases,
__update_blocked_fair() removes cfs_rqs whose avg.load_avg has not been
updated to tg->load_avg.
Add a check of cfs_rq->tg_load_avg_contrib in cfs_rq_is_decayed(),
which fixes the case (2.) mentioned above.
Fixes: 1528c661c2 ("sched/fair: Ratelimit update to tg->load_avg")
Signed-off-by: xupengbo <xupengbo@oppo.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Aaron Lu <ziqianlu@bytedance.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Aaron Lu <ziqianlu@bytedance.com>
Link: https://patch.msgid.link/20250827022208.14487-1-xupengbo@oppo.com
rt_mutex_setprio() has only one caller: rt_mutex_adjust_prio(). It
expects that task_struct::pi_lock and rt_mutex_base::wait_lock are held.
Both locks are raw_spinlock_t and are acquired with disabled interrupts.
Nevertheless rt_mutex_setprio() disables preemption while invoking
__balance_callbacks() and raw_spin_rq_unlock(). Even if one of the
balance callbacks unlocks the rq then it must not enable interrupts
because rt_mutex_base::wait_lock is still locked.
Therefore interrupts should remain disabled and disabling preemption is
not needed.
Commit 4c9a4bc89a ("sched: Allow balance callbacks for check_class_changed()")
adds a preempt-disable section to rt_mutex_setprio() and
__sched_setscheduler(). In __sched_setscheduler() the preemption is
disabled before rq is unlocked and interrupts enabled but I don't see
why it makes a difference in rt_mutex_setprio().
Remove the preempt_disable() section from rt_mutex_setprio().
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://patch.msgid.link/20251127155529.t_sTatE4@linutronix.de
The sched_class::task_tick() method is called on the donor
sched_class, and sched_tick() hands it rq->donor as argument,
which is consistent.
However, while hrtick() uses the donor sched_class, it then passes
rq->curr, which is inconsistent. Fix it.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: John Stultz <jstultz@google.com>
Link: https://patch.msgid.link/20250918080205.442967033@infradead.org
This is the first half of the driver changes:
- A treewide interface change to the "syscore" operations for
power management, as a preparation for future Tegra specific
changes.
- Reset controller updates with added drivers for LAN969x, eic770
and RZ/G3S SoCs.
- Protection of system controller registers on Renesas and Google SoCs,
to prevent trivially triggering a system crash from e.g. debugfs
access.
- soc_device identification updates on Nvidia, Exynos and Mediatek
- debugfs support in the ST STM32 firewall driver
- Minor updates for SoC drivers on AMD/Xilinx, Renesas, Allwinner, TI
- Cleanups for memory controller support on Nvidia and Renesas
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEo6/YBQwIrVS28WGKmmx57+YAGNkFAmky/8gACgkQmmx57+YA
GNlqohAApPTLM6Q4gf1cIcsTVaP0uxx9CBgupCGuT5ORrOMKBghVWjTOTSxeEAab
UQF465QwYUUu602GH34UmRaY9CKW2bMIsfmkgmxNB4Y4Qd7yCgQNJ/h/TnN0rBH+
qTeEsRH/hax4miSNsh0oOZfVkZkg+23VF02d1VL0CcaX7y4oT45RPBQugrNx/gNS
fHfVwgIq8vJ8WyrmM1h2nv1i1vgSzEy50B3kY674BBw83FcJTafNLvD7N5DSgD1H
/I/2xeyEpb+oL1VfeHcXZaX/jf04O+cmvSzBi+MOH1tI3MpdxJib1vEYBdggoOWN
K/FFGgsOY+DNmJPpSnPTTu8UpzksS8SxGBP7M9Q8roKZwA2c9wLotxySvjki5yv8
2zvabRdzbrSaoYwsH9QnZdQ2hVkJ9W8MESu8PevD3yMNuFUzledPDWW0N1SbGm78
0ZdB6NPdaBZYHMNMRdFhN8P275/Mx5e0XWN9oYMQqjPooH7YkyT7hJWz6ao2PCJP
8mDmnW1RzL+LWf7mJ25ZEtS+YjmKA/PVmogRrGurKCadvdxXqCF09KNljICHhmmu
t0KB4dqw02OXLPvBk21qCi0zL56w1JDgqtS8suFvDYo9sCceeAbAcmpyoUOFj2N+
Upn976tb4iqFrr9mFswpmCJWPpqJkU+A+KnKsIRPU7N4kSrP35I=
=HvlN
-----END PGP SIGNATURE-----
Merge tag 'soc-drivers-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull SoC driver updates from Arnd Bergmann:
"This is the first half of the driver changes:
- A treewide interface change to the "syscore" operations for power
management, as a preparation for future Tegra specific changes
- Reset controller updates with added drivers for LAN969x, eic770 and
RZ/G3S SoCs
- Protection of system controller registers on Renesas and Google
SoCs, to prevent trivially triggering a system crash from e.g.
debugfs access
- soc_device identification updates on Nvidia, Exynos and Mediatek
- debugfs support in the ST STM32 firewall driver
- Minor updates for SoC drivers on AMD/Xilinx, Renesas, Allwinner, TI
- Cleanups for memory controller support on Nvidia and Renesas"
* tag 'soc-drivers-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (114 commits)
memory: tegra186-emc: Fix missing put_bpmp
Documentation: reset: Remove reset_controller_add_lookup()
reset: fix BIT macro reference
reset: rzg2l-usbphy-ctrl: Fix a NULL vs IS_ERR() bug in probe
reset: th1520: Support reset controllers in more subsystems
reset: th1520: Prepare for supporting multiple controllers
dt-bindings: reset: thead,th1520-reset: Add controllers for more subsys
dt-bindings: reset: thead,th1520-reset: Remove non-VO-subsystem resets
reset: remove legacy reset lookup code
clk: davinci: psc: drop unused reset lookup
reset: rzg2l-usbphy-ctrl: Add support for RZ/G3S SoC
reset: rzg2l-usbphy-ctrl: Add support for USB PWRRDY
dt-bindings: reset: renesas,rzg2l-usbphy-ctrl: Document RZ/G3S support
reset: eswin: Add eic7700 reset driver
dt-bindings: reset: eswin: Documentation for eic7700 SoC
reset: sparx5: add LAN969x support
dt-bindings: reset: microchip: Add LAN969x support
soc: rockchip: grf: Add select correct PWM implementation on RK3368
soc/tegra: pmc: Add USB wake events for Tegra234
amba: tegra-ahb: Fix device leak on SMMU enable
...
Add a new BPF command BPF_PROG_ASSOC_STRUCT_OPS to allow associating
a BPF program with a struct_ops map. This command takes a file
descriptor of a struct_ops map and a BPF program and set
prog->aux->st_ops_assoc to the kdata of the struct_ops map.
The command does not accept a struct_ops program nor a non-struct_ops
map. Programs of a struct_ops map is automatically associated with the
map during map update. If a program is shared between two struct_ops
maps, prog->aux->st_ops_assoc will be poisoned to indicate that the
associated struct_ops is ambiguous. The pointer, once poisoned, cannot
be reset since we have lost track of associated struct_ops. For other
program types, the associated struct_ops map, once set, cannot be
changed later. This restriction may be lifted in the future if there is
a use case.
A kernel helper bpf_prog_get_assoc_struct_ops() can be used to retrieve
the associated struct_ops pointer. The returned pointer, if not NULL, is
guaranteed to be valid and point to a fully updated struct_ops struct.
For struct_ops program reused in multiple struct_ops map, the return
will be NULL.
prog->aux->st_ops_assoc is protected by bumping the refcount for
non-struct_ops programs and RCU for struct_ops programs. Since it would
be inefficient to track programs associated with a struct_ops map, every
non-struct_ops program will bump the refcount of the map to make sure
st_ops_assoc stays valid. For a struct_ops program, it is protected by
RCU as map_free will wait for an RCU grace period before disassociating
the program with the map. The helper must be called in BPF program
context or RCU read-side critical section.
struct_ops implementers should note that the struct_ops returned may not
be initialized nor attached yet. The struct_ops implementer will be
responsible for tracking and checking the state of the associated
struct_ops map if the use case expects an initialized or attached
struct_ops.
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/bpf/20251203233748.668365-3-ameryhung@gmail.com
Allow verifier to fixup kfuncs in kernel module to support kfuncs with
__prog arguments. Currently, special kfuncs and kfuncs with __prog
arguments are kernel kfuncs. Allowing kernel module kfuncs should not
affect existing kfunc fixup as kernel module kfuncs have BTF IDs greater
than kernel kfuncs' BTF IDs.
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20251203233748.668365-2-ameryhung@gmail.com
dentries of created objects in dcache (and undo it when removing those).
Reference is grabbed and not released, but it's not actually _stored_
anywhere. That works, but it's hard to follow and verify; among other
things, we have no way to tell _which_ of the increments is intended
to be an unpaired one. Worse, on removal we need to decide whether
the reference had already been dropped, which can be non-trivial if
that removal is on umount and we need to figure out if this dentry is
pinned due to e.g. unlink() not done. Usually that is handled by using
kill_litter_super() as ->kill_sb(), but there are open-coded special
cases of the same (consider e.g. /proc/self).
Things get simpler if we introduce a new dentry flag (DCACHE_PERSISTENT)
marking those "leaked" dentries. Having it set claims responsibility
for +1 in refcount.
The end result this series is aiming for:
* get these unbalanced dget() and dput() replaced with new primitives that
would, in addition to adjusting refcount, set and clear persistency flag.
* instead of having kill_litter_super() mess with removing the remaining
"leaked" references (e.g. for all tmpfs files that hadn't been removed
prior to umount), have the regular shrink_dcache_for_umount() strip
DCACHE_PERSISTENT of all dentries, dropping the corresponding
reference if it had been set. After that kill_litter_super() becomes
an equivalent of kill_anon_super().
Doing that in a single step is not feasible - it would affect too many places
in too many filesystems. It has to be split into a series.
This work has really started early in 2024; quite a few preliminary pieces
have already gone into mainline. This chunk is finally getting to the
meat of that stuff - infrastructure and most of the conversions to it.
Some pieces are still sitting in the local branches, but the bulk of
that stuff is here.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCaTEq1wAKCRBZ7Krx/gZQ
643uAQC1rRslhw5l7OjxEpIYbGG4M+QaadN4Nf5Sr2SuTRaPJQD/W4oj/u4C2eCw
Dd3q071tqyvm/PXNgN2EEnIaxlFUlwc=
=rKq+
-----END PGP SIGNATURE-----
Merge tag 'pull-persistency' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull persistent dentry infrastructure and conversion from Al Viro:
"Some filesystems use a kinda-sorta controlled dentry refcount leak to
pin dentries of created objects in dcache (and undo it when removing
those). A reference is grabbed and not released, but it's not actually
_stored_ anywhere.
That works, but it's hard to follow and verify; among other things, we
have no way to tell _which_ of the increments is intended to be an
unpaired one. Worse, on removal we need to decide whether the
reference had already been dropped, which can be non-trivial if that
removal is on umount and we need to figure out if this dentry is
pinned due to e.g. unlink() not done. Usually that is handled by using
kill_litter_super() as ->kill_sb(), but there are open-coded special
cases of the same (consider e.g. /proc/self).
Things get simpler if we introduce a new dentry flag
(DCACHE_PERSISTENT) marking those "leaked" dentries. Having it set
claims responsibility for +1 in refcount.
The end result this series is aiming for:
- get these unbalanced dget() and dput() replaced with new primitives
that would, in addition to adjusting refcount, set and clear
persistency flag.
- instead of having kill_litter_super() mess with removing the
remaining "leaked" references (e.g. for all tmpfs files that hadn't
been removed prior to umount), have the regular
shrink_dcache_for_umount() strip DCACHE_PERSISTENT of all dentries,
dropping the corresponding reference if it had been set. After that
kill_litter_super() becomes an equivalent of kill_anon_super().
Doing that in a single step is not feasible - it would affect too many
places in too many filesystems. It has to be split into a series.
This work has really started early in 2024; quite a few preliminary
pieces have already gone into mainline. This chunk is finally getting
to the meat of that stuff - infrastructure and most of the conversions
to it.
Some pieces are still sitting in the local branches, but the bulk of
that stuff is here"
* tag 'pull-persistency' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (54 commits)
d_make_discardable(): warn if given a non-persistent dentry
kill securityfs_recursive_remove()
convert securityfs
get rid of kill_litter_super()
convert rust_binderfs
convert nfsctl
convert rpc_pipefs
convert hypfs
hypfs: swich hypfs_create_u64() to returning int
hypfs: switch hypfs_create_str() to returning int
hypfs: don't pin dentries twice
convert gadgetfs
gadgetfs: switch to simple_remove_by_name()
convert functionfs
functionfs: switch to simple_remove_by_name()
functionfs: fix the open/removal races
functionfs: need to cancel ->reset_work in ->kill_sb()
functionfs: don't bother with ffs->ref in ffs_data_{opened,closed}()
functionfs: don't abuse ffs_data_closed() on fs shutdown
convert selinuxfs
...
- The 10 patch series "__vmalloc()/kvmalloc() and no-block support" from
Uladzislau Rezki reworks the vmalloc() code to support non-blocking
allocations (GFP_ATOIC, GFP_NOWAIT).
- The 2 patch series "ksm: fix exec/fork inheritance" from xu xin fixes
a rare case where the KSM MMF_VM_MERGE_ANY prctl state is not inherited
across fork/exec.
- The 4 patch series "mm/zswap: misc cleanup of code and documentations"
from SeongJae Park does some light maintenance work on the zswap code.
- The 5 patch series "mm/page_owner: add debugfs files 'show_handles'
and 'show_stacks_handles'" from Mauricio Faria de Oliveira enhances the
/sys/kernel/debug/page_owner debug feature. It adds unique identifiers
to differentiate the various stack traces so that userspace monitoring
tools can better match stack traces over time.
- The 2 patch series "mm/page_alloc: pcp->batch cleanups" from Joshua
Hahn makes some minor alterations to the page allocator's per-cpu-pages
feature.
- The 2 patch series "Improve UFFDIO_MOVE scalability by removing
anon_vma lock" from Lokesh Gidra addresses a scalability issue in
userfaultfd's UFFDIO_MOVE operation.
- The 2 patch series "kasan: cleanups for kasan_enabled() checks" from
Sabyrzhan Tasbolatov performs some cleanup in the KASAN code.
- The 2 patch series "drivers/base/node: fold node register and
unregister functions" from Donet Tom cleans up the NUMA node handling
code a little.
- The 4 patch series "mm: some optimizations for prot numa" from Kefeng
Wang provides some cleanups and small optimizations to the NUMA
allocation hinting code.
- The 5 patch series "mm/page_alloc: Batch callers of
free_pcppages_bulk" from Joshua Hahn addresses long lock hold times at
boot on large machines. These were causing (harmless) softlockup
warnings.
- The 2 patch series "optimize the logic for handling dirty file folios
during reclaim" from Baolin Wang removes some now-unnecessary work from
page reclaim.
- The 10 patch series "mm/damon: allow DAMOS auto-tuned for per-memcg
per-node memory usage" from SeongJae Park enhances the DAMOS auto-tuning
feature.
- The 2 patch series "mm/damon: fixes for address alignment issues in
DAMON_LRU_SORT and DAMON_RECLAIM" from Quanmin Yan fixes DAMON_LRU_SORT
and DAMON_RECLAIM with certain userspace configuration.
- The 15 patch series "expand mmap_prepare functionality, port more
users" from Lorenzo Stoakes enhances the new(ish)
file_operations.mmap_prepare() method and ports additional callsites
from the old ->mmap() over to ->mmap_prepare().
- The 8 patch series "Fix stale IOTLB entries for kernel address space"
from Lu Baolu fixes a bug (and possible security issue on non-x86) in
the IOMMU code. In some situations the IOMMU could be left hanging onto
a stale kernel pagetable entry.
- The 4 patch series "mm/huge_memory: cleanup __split_unmapped_folio()"
from Wei Yang cleans up and optimizes the folio splitting code.
- The 5 patch series "mm, swap: misc cleanup and bugfix" from Kairui
Song implements some cleanups and a minor fix in the swap discard code.
- The 8 patch series "mm/damon: misc documentation fixups" from SeongJae
Park does as advertised.
- The 9 patch series "mm/damon: support pin-point targets removal" from
SeongJae Park permits userspace to remove a specific monitoring target
in the middle of the current targets list.
- The 2 patch series "mm: MISC follow-up patches for linux/pgalloc.h"
from Harry Yoo implements a couple of cleanups related to mm header file
inclusion.
- The 2 patch series "mm/swapfile.c: select swap devices of default
priority round robin" from Baoquan He improves the selection of swap
devices for NUMA machines.
- The 3 patch series "mm: Convert memory block states (MEM_*) macros to
enums" from Israel Batista changes the memory block labels from macros
to enums so they will appear in kernel debug info.
- The 3 patch series "ksm: perform a range-walk to jump over holes in
break_ksm" from Pedro Demarchi Gomes addresses an inefficiency when KSM
unmerges an address range.
- The 22 patch series "mm/damon/tests: fix memory bugs in kunit tests"
from SeongJae Park fixes leaks and unhandled malloc() failures in DAMON
userspace unit tests.
- The 2 patch series "some cleanups for pageout()" from Baolin Wang
cleans up a couple of minor things in the page scanner's
writeback-for-eviction code.
- The 2 patch series "mm/hugetlb: refactor sysfs/sysctl interfaces" from
Hui Zhu moves hugetlb's sysfs/sysctl handling code into a new file.
- The 9 patch series "introduce VM_MAYBE_GUARD and make it sticky" from
Lorenzo Stoakes makes the VMA guard regions available in /proc/pid/smaps
and improves the mergeability of guarded VMAs.
- The 2 patch series "mm: perform guard region install/remove under VMA
lock" from Lorenzo Stoakes reduces mmap lock contention for callers
performing VMA guard region operations.
- The 2 patch series "vma_start_write_killable" from Matthew Wilcox
starts work in permitting applications to be killed when they are
waiting on a read_lock on the VMA lock.
- The 11 patch series "mm/damon/tests: add more tests for online
parameters commit" from SeongJae Park adds additional userspace testing
of DAMON's "commit" feature.
- The 9 patch series "mm/damon: misc cleanups" from SeongJae Park does
that.
- The 2 patch series "make VM_SOFTDIRTY a sticky VMA flag" from Lorenzo
Stoakes addresses the possible loss of a VMA's VM_SOFTDIRTY flag when
that VMA is merged with another.
- The 16 patch series "mm: support device-private THP" from Balbir Singh
introduces support for Transparent Huge Page (THP) migration in zone
device-private memory.
- The 3 patch series "Optimize folio split in memory failure" from Zi
Yan optimizes folio split operations in the memory failure code.
- The 2 patch series "mm/huge_memory: Define split_type and consolidate
split support checks" from Wei Yang provides some more cleanups in the
folio splitting code.
- The 16 patch series "mm: remove is_swap_[pte, pmd]() + non-swap
entries, introduce leaf entries" from Lorenzo Stoakes cleans up our
handling of pagetable leaf entries by introducing the concept of
'software leaf entries', of type softleaf_t.
- The 4 patch series "reparent the THP split queue" from Muchun Song
reparents the THP split queue to its parent memcg. This is in
preparation for addressing the long-standing "dying memcg" problem,
wherein dead memcg's linger for too long, consuming memory resources.
- The 3 patch series "unify PMD scan results and remove redundant
cleanup" from Wei Yang does a little cleanup in the hugepage collapse
code.
- The 6 patch series "zram: introduce writeback bio batching" from
Sergey Senozhatsky improves zram writeback efficiency by introducing
batched bio writeback support.
- The 4 patch series "memcg: cleanup the memcg stats interfaces" from
Shakeel Butt cleans up our handling of the interrupt safety of some
memcg stats.
- The 4 patch series "make vmalloc gfp flags usage more apparent" from
Vishal Moola cleans up vmalloc's handling of incoming GFP flags.
- The 6 patch series "mm: Add soft-dirty and uffd-wp support for RISC-V"
from Chunyan Zhang teches soft dirty and userfaultfd write protect
tracking to use RISC-V's Svrsw60t59b extension.
- The 5 patch series "mm: swap: small fixes and comment cleanups" from
Youngjun Park fixes a small bug and cleans up some of the swap code.
- The 4 patch series "initial work on making VMA flags a bitmap" from
Lorenzo Stoakes starts work on converting the vma struct's flags to a
bitmap, so we stop running out of them, especially on 32-bit.
- The 2 patch series "mm/swapfile: fix and cleanup swap list iterations"
from Youngjun Park addresses a possible bug in the swap discard code and
cleans things up a little.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaTEb0wAKCRDdBJ7gKXxA
jjfIAP94W4EkCCwNOupnChoG+YWw/JW21anXt5NN+i5svn1yugEAwzvv6A+cAFng
o+ug/fyrfPZG7PLp2R8WFyGIP0YoBA4=
=IUzS
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2025-12-03-21-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"__vmalloc()/kvmalloc() and no-block support" (Uladzislau Rezki)
Rework the vmalloc() code to support non-blocking allocations
(GFP_ATOIC, GFP_NOWAIT)
"ksm: fix exec/fork inheritance" (xu xin)
Fix a rare case where the KSM MMF_VM_MERGE_ANY prctl state is not
inherited across fork/exec
"mm/zswap: misc cleanup of code and documentations" (SeongJae Park)
Some light maintenance work on the zswap code
"mm/page_owner: add debugfs files 'show_handles' and 'show_stacks_handles'" (Mauricio Faria de Oliveira)
Enhance the /sys/kernel/debug/page_owner debug feature by adding
unique identifiers to differentiate the various stack traces so
that userspace monitoring tools can better match stack traces over
time
"mm/page_alloc: pcp->batch cleanups" (Joshua Hahn)
Minor alterations to the page allocator's per-cpu-pages feature
"Improve UFFDIO_MOVE scalability by removing anon_vma lock" (Lokesh Gidra)
Address a scalability issue in userfaultfd's UFFDIO_MOVE operation
"kasan: cleanups for kasan_enabled() checks" (Sabyrzhan Tasbolatov)
"drivers/base/node: fold node register and unregister functions" (Donet Tom)
Clean up the NUMA node handling code a little
"mm: some optimizations for prot numa" (Kefeng Wang)
Cleanups and small optimizations to the NUMA allocation hinting
code
"mm/page_alloc: Batch callers of free_pcppages_bulk" (Joshua Hahn)
Address long lock hold times at boot on large machines. These were
causing (harmless) softlockup warnings
"optimize the logic for handling dirty file folios during reclaim" (Baolin Wang)
Remove some now-unnecessary work from page reclaim
"mm/damon: allow DAMOS auto-tuned for per-memcg per-node memory usage" (SeongJae Park)
Enhance the DAMOS auto-tuning feature
"mm/damon: fixes for address alignment issues in DAMON_LRU_SORT and DAMON_RECLAIM" (Quanmin Yan)
Fix DAMON_LRU_SORT and DAMON_RECLAIM with certain userspace
configuration
"expand mmap_prepare functionality, port more users" (Lorenzo Stoakes)
Enhance the new(ish) file_operations.mmap_prepare() method and port
additional callsites from the old ->mmap() over to ->mmap_prepare()
"Fix stale IOTLB entries for kernel address space" (Lu Baolu)
Fix a bug (and possible security issue on non-x86) in the IOMMU
code. In some situations the IOMMU could be left hanging onto a
stale kernel pagetable entry
"mm/huge_memory: cleanup __split_unmapped_folio()" (Wei Yang)
Clean up and optimize the folio splitting code
"mm, swap: misc cleanup and bugfix" (Kairui Song)
Some cleanups and a minor fix in the swap discard code
"mm/damon: misc documentation fixups" (SeongJae Park)
"mm/damon: support pin-point targets removal" (SeongJae Park)
Permit userspace to remove a specific monitoring target in the
middle of the current targets list
"mm: MISC follow-up patches for linux/pgalloc.h" (Harry Yoo)
A couple of cleanups related to mm header file inclusion
"mm/swapfile.c: select swap devices of default priority round robin" (Baoquan He)
improve the selection of swap devices for NUMA machines
"mm: Convert memory block states (MEM_*) macros to enums" (Israel Batista)
Change the memory block labels from macros to enums so they will
appear in kernel debug info
"ksm: perform a range-walk to jump over holes in break_ksm" (Pedro Demarchi Gomes)
Address an inefficiency when KSM unmerges an address range
"mm/damon/tests: fix memory bugs in kunit tests" (SeongJae Park)
Fix leaks and unhandled malloc() failures in DAMON userspace unit
tests
"some cleanups for pageout()" (Baolin Wang)
Clean up a couple of minor things in the page scanner's
writeback-for-eviction code
"mm/hugetlb: refactor sysfs/sysctl interfaces" (Hui Zhu)
Move hugetlb's sysfs/sysctl handling code into a new file
"introduce VM_MAYBE_GUARD and make it sticky" (Lorenzo Stoakes)
Make the VMA guard regions available in /proc/pid/smaps and
improves the mergeability of guarded VMAs
"mm: perform guard region install/remove under VMA lock" (Lorenzo Stoakes)
Reduce mmap lock contention for callers performing VMA guard region
operations
"vma_start_write_killable" (Matthew Wilcox)
Start work on permitting applications to be killed when they are
waiting on a read_lock on the VMA lock
"mm/damon/tests: add more tests for online parameters commit" (SeongJae Park)
Add additional userspace testing of DAMON's "commit" feature
"mm/damon: misc cleanups" (SeongJae Park)
"make VM_SOFTDIRTY a sticky VMA flag" (Lorenzo Stoakes)
Address the possible loss of a VMA's VM_SOFTDIRTY flag when that
VMA is merged with another
"mm: support device-private THP" (Balbir Singh)
Introduce support for Transparent Huge Page (THP) migration in zone
device-private memory
"Optimize folio split in memory failure" (Zi Yan)
"mm/huge_memory: Define split_type and consolidate split support checks" (Wei Yang)
Some more cleanups in the folio splitting code
"mm: remove is_swap_[pte, pmd]() + non-swap entries, introduce leaf entries" (Lorenzo Stoakes)
Clean up our handling of pagetable leaf entries by introducing the
concept of 'software leaf entries', of type softleaf_t
"reparent the THP split queue" (Muchun Song)
Reparent the THP split queue to its parent memcg. This is in
preparation for addressing the long-standing "dying memcg" problem,
wherein dead memcg's linger for too long, consuming memory
resources
"unify PMD scan results and remove redundant cleanup" (Wei Yang)
A little cleanup in the hugepage collapse code
"zram: introduce writeback bio batching" (Sergey Senozhatsky)
Improve zram writeback efficiency by introducing batched bio
writeback support
"memcg: cleanup the memcg stats interfaces" (Shakeel Butt)
Clean up our handling of the interrupt safety of some memcg stats
"make vmalloc gfp flags usage more apparent" (Vishal Moola)
Clean up vmalloc's handling of incoming GFP flags
"mm: Add soft-dirty and uffd-wp support for RISC-V" (Chunyan Zhang)
Teach soft dirty and userfaultfd write protect tracking to use
RISC-V's Svrsw60t59b extension
"mm: swap: small fixes and comment cleanups" (Youngjun Park)
Fix a small bug and clean up some of the swap code
"initial work on making VMA flags a bitmap" (Lorenzo Stoakes)
Start work on converting the vma struct's flags to a bitmap, so we
stop running out of them, especially on 32-bit
"mm/swapfile: fix and cleanup swap list iterations" (Youngjun Park)
Address a possible bug in the swap discard code and clean things
up a little
[ This merge also reverts commit ebb9aeb980 ("vfio/nvgrace-gpu:
register device memory for poison handling") because it looks
broken to me, I've asked for clarification - Linus ]
* tag 'mm-stable-2025-12-03-21-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits)
mm: fix vma_start_write_killable() signal handling
mm/swapfile: use plist_for_each_entry in __folio_throttle_swaprate
mm/swapfile: fix list iteration when next node is removed during discard
fs/proc/task_mmu.c: fix make_uffd_wp_huge_pte() huge pte handling
mm/kfence: add reboot notifier to disable KFENCE on shutdown
memcg: remove inc/dec_lruvec_kmem_state helpers
selftests/mm/uffd: initialize char variable to Null
mm: fix DEBUG_RODATA_TEST indentation in Kconfig
mm: introduce VMA flags bitmap type
tools/testing/vma: eliminate dependency on vma->__vm_flags
mm: simplify and rename mm flags function for clarity
mm: declare VMA flags by bit
zram: fix a spelling mistake
mm/page_alloc: optimize lowmem_reserve max lookup using its semantic monotonicity
mm/vmscan: skip increasing kswapd_failures when reclaim was boosted
pagemap: update BUDDY flag documentation
mm: swap: remove scan_swap_map_slots() references from comments
mm: swap: change swap_alloc_slow() to void
mm, swap: remove redundant comment for read_swap_cache_async
mm, swap: use SWP_SOLIDSTATE to determine if swap is rotational
...
The commit 4d38328eb4 ("tracing: Fix synth event printk format for str
fields") replaced "%.*s" with "%s" but missed removing the number size of
the dynamic and static strings. The commit e1a453a57b ("tracing: Do not
add length to print format in synthetic events") fixed the dynamic part
but did not fix the static part. That is, with the commands:
# echo 's:wake_lat char[] wakee; u64 delta;' >> /sys/kernel/tracing/dynamic_events
# echo 'hist:keys=pid:ts=common_timestamp.usecs if !(common_flags & 0x18)' > /sys/kernel/tracing/events/sched/sched_waking/trigger
# echo 'hist:keys=next_pid:delta=common_timestamp.usecs-$ts:onmatch(sched.sched_waking).trace(wake_lat,next_comm,$delta)' > /sys/kernel/tracing/events/sched/sched_switch/trigger
That caused the output of:
<idle>-0 [001] d..5. 193.428167: wake_lat: wakee=(efault)sshd-sessiondelta=155
sshd-session-879 [001] d..5. 193.811080: wake_lat: wakee=(efault)kworker/u34:5delta=58
<idle>-0 [002] d..5. 193.811198: wake_lat: wakee=(efault)bashdelta=91
The commit e1a453a57b fixed the part where the synthetic event had
"char[] wakee". But if one were to replace that with a static size string:
# echo 's:wake_lat char[16] wakee; u64 delta;' >> /sys/kernel/tracing/dynamic_events
Where "wakee" is defined as "char[16]" and not "char[]" making it a static
size, the code triggered the "(efaul)" again.
Remove the added STR_VAR_LEN_MAX size as the string is still going to be
nul terminated.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Douglas Raillard <douglas.raillard@arm.com>
Link: https://patch.msgid.link/20251204151935.5fa30355@gandalf.local.home
Fixes: e1a453a57b ("tracing: Do not add length to print format in synthetic events")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The trace file will pause tracing if the tracing instance has the
"pause-on-trace" option is set. This happens when the file is opened, and
it is unpaused when the file is closed. When this was first added, there
was only one user that paused tracing. On open, the check to pause was:
if (!iter->snapshot && (tr->trace_flags & TRACE_ITER(PAUSE_ON_TRACE)))
Where if it is not the snapshot tracer and the "pause-on-trace" option is
set, then it increments a "stop_count" of the trace instance.
On close, the check is:
if (!iter->snapshot && tr->stop_count)
That is, if it is not the snapshot buffer and it was stopped, it will
re-enable tracing.
Now there's more places that stop tracing. This means, if something else
stops tracing the tr->stop_count will be non-zero, and that means if the
trace file is closed, it will decrement the stop_count even though it
never incremented it. This causes a warning because when the user that
stopped tracing enables it again, the stop_count goes below zero.
Instead of relying on the stop_count being set to know if the close of
the trace file should enable tracing again, add a new flag to the trace
iterator. The trace iterator is unique per open of the trace file, and if
the open stops tracing set the trace iterator PAUSE flag. On close, if the
PAUSE flag is set, then re-enable it again.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251202161751.24abaaf1@gandalf.local.home
Fixes: 06e0a548ba ("tracing: Do not disable tracing when reading the trace file")
Reported-by: syzbot+ccdec3bfe0beec58a38d@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/692f44a5.a70a0220.2ea503.00c8.GAE@google.com/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
* Move jiffies converters out of kernel/sysctl.c
Moved the jiffies converters into kernel/time/jiffies.c and replaced
the pipe-max-size proc_handler converter with a macro based version.
This is all part of the effort to relocate non-sysctl logic out of
kernel/sysctl.c into more relevant subsystems. No functional changes.
* Generalize proc handler converter creation
Removed duplicated sysctl converter logic by consolidating it in
macros. These are used inside sysctl core as well as in pipe.c and
jiffies.c. Converter kernel and user space pointer args are now
automatically const qualified for the convenience of the caller. No
functional changes.
* Miscellaneous
Fixed kernel-doc format warnings, removed unnecessary __user
qualifiers, and moved the nmi_watchdog sysctl into .rodata.
* Testing
This series was run through sysctl selftests/kunit test suite in
x86_64. It went into linux-next after rc2, giving it a good 4/5 weeks
of testing.
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEErkcJVyXmMSXOyyeQupfNUreWQU8FAmktuJMACgkQupfNUreW
QU9l8Qv+Noh/wLTqBEmHCrQ8k19YCNlBHO6a10Q5bFiAiGdTAMCZ7oFzoAwAjv5y
pLtzS75G89zP0O6wgkxTsmoDNi4MRJenOCyjyEDFYvrK+qSTm0CWs0sZCsHqX4Dg
7M+7PVK/EbMO5509J2ae6cYS9pjfwg3EBQZ978b/FATkuhRjxOIJhIv3ZoaFjme4
0q/xqHw+oms5CUL035BfqtkoskIiRT19DAvM/DEjc2ByaHCTGURv00XLvSDHaRer
O0Z8nXaxOOCscLunZbC3UL+hC7tB0nPE+XSzm9ylBEM7bTxeZmtvx2G6ru0+873U
Sp+BwpFhe0RmzBFlclkd7UPtvGlFAY2QgAfpSaiLsodkoX0mctquTgpy99LhxKej
EEyjl9tPVrYoH4MG562bZPGrQHtV4vnR9DXYx56vYtY2Fyr1GZmWawQoMZsHe9AU
cVw5HrfeKeHBhk9hi3ZvT9z96ns3YBmIHnYNDeMy+mF/i+cVu///GwgGuFqUqKag
3eWcTaPh
=DXVD
-----END PGP SIGNATURE-----
Merge tag 'sysctl-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl
Pull sysctl updates from Joel Granados:
- Move jiffies converters out of kernel/sysctl.c
Move the jiffies converters into kernel/time/jiffies.c and replace
the pipe-max-size proc_handler converter with a macro based version.
This is all part of the effort to relocate non-sysctl logic out of
kernel/sysctl.c into more relevant subsystems. No functional changes.
- Generalize proc handler converter creation
Remove duplicated sysctl converter logic by consolidating it in
macros. These are used inside sysctl core as well as in pipe.c and
jiffies.c. Converter kernel and user space pointer args are now
automatically const qualified for the convenience of the caller. No
functional changes.
- Miscellaneous
Fix kernel-doc format warnings, remove unnecessary __user
qualifiers, and move the nmi_watchdog sysctl into .rodata.
- Testing
This series was run through sysctl selftests/kunit test suite in
x86_64. It went into linux-next after rc2, giving it a good 4/5 weeks
of testing.
* tag 'sysctl-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl: (21 commits)
sysctl: Wrap do_proc_douintvec with the public function proc_douintvec_conv
sysctl: Create pipe-max-size converter using sysctl UINT macros
sysctl: Move proc_doulongvec_ms_jiffies_minmax to kernel/time/jiffies.c
sysctl: Move jiffies converters to kernel/time/jiffies.c
sysctl: Move UINT converter macros to sysctl header
sysctl: Move INT converter macros to sysctl header
sysctl: Allow custom converters from outside sysctl
sysctl: remove __user qualifier from stack_erasing_sysctl buffer argument
sysctl: Create macro for user-to-kernel uint converter
sysctl: Add optional range checking to SYSCTL_UINT_CONV_CUSTOM
sysctl: Create unsigned int converter using new macro
sysctl: Add optional range checking to SYSCTL_INT_CONV_CUSTOM
sysctl: Create integer converters with one macro
sysctl: Create converter functions with two new macros
sysctl: Discriminate between kernel and user converter params
sysctl: Indicate the direction of operation with macro names
sysctl: Remove superfluous __do_proc_* indirection
sysctl: Remove superfluous tbl_data param from "dovec" functions
sysctl: Replace void pointer with const pointer to ctl_table
sysctl: fix kernel-doc format warning
...
- fprobe: Performance enhancement of the fprobe using rhltable
. fprobe: use rhltable for fprobe_ip_table. The fprobe IP table has
been converted to use an rhltable for improved performance when
dealing with a large number of probed functions.
. Fix a suspicious RCU usage warning of the above change in the
fprobe entry handler.
. Remove an unused local variable of the above change.
. Fix to initialize fprobe_ip_table in core_initcall().
- fprobe: Performance optimization of fprobe by ftrace
. fprobe: Use ftrace instead of fgraph for entry only probes. This
avoids the unneeded overhead of fgraph stack setup.
. Also update fprobe selftest for entry-only probe.
. fprobe: Use ftrace only if CONFIG_DYNAMIC_FTRACE_WITH_ARGS or
WITH_REGS is defined.
- probes: Cleanup probe event subsystems.
. uprobe/eprobe: Allocate traceprobe_parse_context per probe instead
of each probe argument parsing. This reduce memory allocation/free
of temporary working memory.
. uprobes: Cleanup code using __free().
. eprobes: Cleanup code using __free().
. probes: Cleanup code using __free(trace_probe_log_clear) to clear
error log automatically.
. probes: Replace strcpy() with memcpy() in __trace_probe_log_err().
-----BEGIN PGP SIGNATURE-----
iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmkvhSsbHG1hc2FtaS5o
aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8bWhEH/23XM5Msjy5vopB+ECZb
iCj8SkWrQzfiCBILUqxCkZdfJHFomGPHewxvxIOWdb7evtHuy0Ypne/Uw/TMAtAh
xvDQmu03IV2jO7h7GExsnEh0nX0upYg4IVmN0sCSSWSfgLLTWO9ICClavV9adcva
ZR+5TdZbK+W59n+ejxA9OMDt1G+nz1Ls9Qhx9ktf7odkJzBkQGPq/heZuPbF3+6k
Vj2IHTuqWobDDt+ekKOBRWNh9cS61ybxvsr/vmkT6s904ortP6mZa3zEYPRVOUNG
WJ/KGJwvExTcaG/Dy2g6q8tam1Bidx9/S6klyOGXQXxvaIT1VtBc66HzAUfso6jg
yIc=
=w6Kq
-----END PGP SIGNATURE-----
Merge tag 'probes-v6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull probes updates from Masami Hiramatsu:
"fprobe performance enhancement using rhltable:
- use rhltable for fprobe_ip_table. The fprobe IP table has been
converted to use an rhltable for improved performance when dealing
with a large number of probed functions
- Fix a suspicious RCU usage warning of the above change in the
fprobe entry handler
- Remove an unused local variable of the above change
- Fix to initialize fprobe_ip_table in core_initcall()
Performance optimization of fprobe by ftrace:
- Use ftrace instead of fgraph for entry only probes. This avoids the
unneeded overhead of fgraph stack setup
- Also update fprobe selftest for entry-only probe
- fprobe: Use ftrace only if CONFIG_DYNAMIC_FTRACE_WITH_ARGS or
WITH_REGS is defined
Cleanup probe event subsystems:
- Allocate traceprobe_parse_context per probe instead of each probe
argument parsing. This reduce memory allocation/free of temporary
working memory
- Cleanup code using __free()
- Replace strcpy() with memcpy() in __trace_probe_log_err()"
* tag 'probes-v6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: fprobe: use ftrace if CONFIG_DYNAMIC_FTRACE_WITH_ARGS
lib/test_fprobe: add testcase for mixed fprobe
tracing: fprobe: optimization for entry only case
tracing: fprobe: Fix to init fprobe_ip_table earlier
tracing: fprobe: Remove unused local variable
tracing: probes: Replace strcpy() with memcpy() in __trace_probe_log_err()
tracing: fprobe: fix suspicious rcu usage in fprobe_entry
tracing: uprobe: eprobes: Allocate traceprobe_parse_context per probe
tracing: uprobes: Cleanup __trace_uprobe_create() with __free()
tracing: eprobe: Cleanup eprobe event using __free()
tracing: probes: Use __free() for trace_probe_log
tracing: fprobe: use rhltable for fprobe_ip_table
The allocation of the per CPU buffer descriptor, the buffer page
descriptors and the buffer page data itself can be pretty ugly.
Add some helper macros and a function to have the code that allocates
buffer pages and such look a little cleaner.
-----BEGIN PGP SIGNATURE-----
iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaTL3JxQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qvDgAP9HFxPe2EqGspnY0RungWDs3yCxqlUp
Eqz7SaI9GCXdXgD/TKiz3YjNVxZveeDU6QHWsDl4svoBzjSAsaeTkXD+OQ8=
=siR0
-----END PGP SIGNATURE-----
Merge tag 'trace-ringbuffer-v6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull trace ring-buffer cleanup from Steven Rostedt:
- Add helper functions for allocations
The allocation of the per CPU buffer descriptor, the buffer page
descriptors and the buffer page data itself can be pretty ugly.
Add some helper macros and a function to have the code that allocates
buffer pages and such look a little cleaner.
* tag 'trace-ringbuffer-v6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
ring-buffer: Add helper functions for allocations
- Adapt the ftracetest script to be run from a different folder
This uses the already existing OPT_TEST_DIR but extends it further to run
independent tests, then add an --rv flag to allow using the script for
testing RV (mostly) independently on ftrace.
- Add basic RV selftests in selftests/verification for more validations
Add more validations for available/enabled monitors and reactors. This
could have caught the bug introducing kernel panic solved above. Tests use
ftracetest.
- Convert react() function in reactor to use va_list directly
Use a central helper to handle the variadic arguments. Clean up macros
and mark functions as static.
- Add lockdep annotations to reactors to have lockdep complain of errors
If the reactors are called from improper context. Useful to develop new
reactors. This highlights a warning in the panic reactor that is related
to the printk subsystem and not to RV.
- Convert core RV code to use lock guards and __free helpers
This completely removes goto statements.
- Fix compilation if !CONFIG_RV_REACTORS
Fix the warning by keeping LTL monitor variable as always static.
-----BEGIN PGP SIGNATURE-----
iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaTBoVxQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qtWpAQDxPQAJQvBZ41l9q9Cis7PqGGezT4Nv
g6Fh/ydMOlJCsQD/R0Xd5JxPmBI8FLCwCfqHo7wYKUhP8GfL/ORPEWhU2gI=
=EEot
-----END PGP SIGNATURE-----
Merge tag 'trace-rv-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull runtime verifier updates from Steven Rostedt:
- Adapt the ftracetest script to be run from a different folder
This uses the already existing OPT_TEST_DIR but extends it further to
run independent tests, then add an --rv flag to allow using the
script for testing RV (mostly) independently on ftrace.
- Add basic RV selftests in selftests/verification for more validations
Add more validations for available/enabled monitors and reactors.
This could have caught the bug introducing kernel panic solved above.
Tests use ftracetest.
- Convert react() function in reactor to use va_list directly
Use a central helper to handle the variadic arguments. Clean up
macros and mark functions as static.
- Add lockdep annotations to reactors to have lockdep complain of
errors
If the reactors are called from improper context. Useful to develop
new reactors. This highlights a warning in the panic reactor that is
related to the printk subsystem and not to RV.
- Convert core RV code to use lock guards and __free helpers
This completely removes goto statements.
- Fix compilation if !CONFIG_RV_REACTORS
Fix the warning by keeping LTL monitor variable as always static.
* tag 'trace-rv-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
rv: Fix compilation if !CONFIG_RV_REACTORS
rv: Convert to use __free
rv: Convert to use lock guard
rv: Add explicit lockdep context for reactors
rv: Make rv_reacting_on() static
rv: Pass va_list to reactors
selftests/verification: Add initial RV tests
selftest/ftrace: Generalise ftracetest to use with RV
- Fix regression of pid filtering of function graph tracer
When the function graph tracer allowed multiple instances of
graph tracing using subops, the filtering by pid broke.
The ftrace_ops->private that was used for pid filtering wasn't
updated on creation.
The wrong function entry callback was used when pid filtering was
enabled when the function graph tracer started, which meant that
the pid filtering wasn't happening.
- Remove no longer needed ftrace_trace_task()
With PID filtering working via ftrace_pids_enabled() and fgraph_pid_func(),
the coarse-grained ftrace_trace_task() check in graph_entry() is obsolete.
It was only a fallback for uninitialized op->private (now fixed), and its
removal ensures consistent PID filtering with standard function tracing.
-----BEGIN PGP SIGNATURE-----
iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaS90FhQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qrqMAQDbU53VhvZ6rE0pNvu0Tlk+LDCu3gxg
F2wisWr65389OgD/VFLTVRjCZh1iY7FFWjAPGRCMbetljmMgK5vpH6XSigA=
=VKaD
-----END PGP SIGNATURE-----
Merge tag 'ftrace-v6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull ftrace updates from Steven Rostedt:
- Fix regression of pid filtering of function graph tracer
When the function graph tracer allowed multiple instances of graph
tracing using subops, the filtering by pid broke.
The ftrace_ops->private that was used for pid filtering wasn't
updated on creation.
The wrong function entry callback was used when pid filtering was
enabled when the function graph tracer started, which meant that
the pid filtering wasn't happening.
- Remove no longer needed ftrace_trace_task()
With PID filtering working via ftrace_pids_enabled() and
fgraph_pid_func(), the coarse-grained ftrace_trace_task()
check in graph_entry() is obsolete.
It was only a fallback for uninitialized op->private (now fixed),
and its removal ensures consistent PID filtering with standard
function tracing.
* tag 'ftrace-v6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
fgraph: Remove coarse PID filtering from graph_entry()
fgraph: Check ftrace_pids_enabled on registration for early filtering
fgraph: Initialize ftrace_ops->private for function graph ops
- Merge branch shared with kprobes on extending trace options
The trace options were defined by a 32 bit variable. This limits the
tracing instances to have a total of 32 different options. As that limit
has been hit, and more options are being added, increase the option mask
to a 64 bit number, doubling the number of options available.
As this is required for the kprobe topic branches as well as the tracing
topic branch, a separate branch was created and merged into both.
- Make trace_user_fault_read() available for the rest of tracing
The function trace_user_fault_read() is used by trace_marker file read to
allow reading user space to be done fast and without locking or
allocations. Make this available so that the system call trace events can
use it too.
- Have system call trace events read user space values
Now that the system call trace events callbacks are called in a faultable
context, take advantage of this and read the user space buffers for
various system calls. For example, show the path name of the openat system
call instead of just showing the pointer to that path name in user space.
Also show the contents of the buffer of the write system call. Several
system call trace events are updated to make tracing into a light weight
strace tool for all applications in the system.
- Update perf system call tracing to do the same
- And a config and syscall_user_buf_size file to control the size of the buffer
Limit the amount of data that can be read from user space. The default
size is 63 bytes but that can be expanded to 165 bytes.
- Allow the persistent ring buffer to print system calls normally
The persistent ring buffer prints trace events by their type and ignores
the print_fmt. This is because the print_fmt may change from kernel to
kernel. As the system call output is fixed by the system call ABI itself,
there's no reason to limit that. This makes reading the system call events
in the persistent ring buffer much nicer and easier to understand.
- Add options to show text offset to function profiler
The function profiler that counts the number of times a function is hit
currently lists all functions by its name and offset. But this becomes
ambiguous when there are several functions with the same name. Add a
tracing option that changes the output to be that of _text+offset
instead. Now a user space tool can use this information to map the
_text+offset to the unique function it is counting.
- Report bad dynamic event command
If a bad command is passed to the dynamic_events file, report it properly
in the error log.
- Clean up tracer options
Clean up the tracer option code a bit, by removing some useless code and
also using switch statements instead of a series of if statements.
- Have tracing options be instance specific
Tracers can have their own options (function tracer, irqsoff tracer,
function graph tracer, etc). But now that the same tracer can be enabled
in multiple trace instances, their options are still global. The API is
per instance, thus changing one affects other instances. This isn't even
consistent, as the option take affect differently depending on when an
tracer started in an instance. Make the options for instances only affect
the instance it is changed under.
- Optimize pid_list lock contention
Whenever the pid_list is read, it uses a spin lock. This happens at every
sched switch. Taking the lock at sched switch can be removed by instead
using a seqlock counter.
- Clean up the trace trigger structures
The trigger code uses two different structures to implement a single
tigger. This was due to trying to reuse code for the two different types
of triggers (always on trigger, and count limited trigger). But by adding
a single field to one structure, the other structure could be absorbed
into the first structure making he code easier to understand.
- Create a bulk garbage collector for trace triggers
If user space has triggers for several hundreds of events and then removes
them, it can take several seconds to complete. This is because each
removal calls the slow tracepoint_synchronize_unregister() that can take
hundreds of milliseconds to complete. Instead, create a helper thread that
will do the clean up. When a trigger is removed, it will create the
kthread if it isn't already created, and then add the trigger to a llist.
The kthread will take the items off the llist, call
tracepoint_synchronize_unregister(), and then remove the items it took
off. It will then check if there's more items to free before sleeping.
This makes user space removing all these triggers to finish in less than a
second.
- Allow function tracing of some of the tracing infrastructure code
Because the tracing code can cause recursion issues if it is traced by the
function tracer the entire tracing directory disables function tracing.
But not all of tracing causes issues if it is traced. Namely, the event
tracing code. Add a config that enables some of the tracing code to be
traced to help in debugging it. Note, when this is enabled, it does add
noise to general function tracing, especially if events are enabled as
well (which is a common case).
- Add boot-time backup instance for persistent buffer
The persistent ring buffer is used mostly for kernel crash analysis in the
field. One issue is that if there's a crash, the data in the persistent
ring buffer must be read before tracing can begin using it. This slows
down the boot process. Once tracing starts in the persistent ring buffer,
the old data must be freed and the addresses no longer match and old
events can't be in the buffer with new events.
Create a way to create a backup buffer that copies the persistent ring
buffer at boot up. Then after a crash, the always on tracer can begin
immediately as well as the normal boot process while the crash analysis
tooling uses the backup buffer. After the backup buffer is finished being
read, it can be removed.
- Enable function graph args and return address options at the same time
Currently the when reading of arguments in the function graph tracer is
enabled, the option to record the parent function in the entry event can
not be enabled. Update the code so that it can.
- Add new struct_offset() helper macro
Add a new macro that takes a pointer to a structure and a name of one of
its members and it will return the offset of that member. This allows the
ring buffer code to simplify the following:
From: size = struct_size(entry, buf, cnt - sizeof(entry->id));
To: size = struct_offset(entry, id) + cnt;
There should be other simplifications that this macro can help out with as
well.
-----BEGIN PGP SIGNATURE-----
iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaS9xqxQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qj6tAQD4MR1lsE3XpH09asO4CDDfhbtRSQVD
o8bVKVihWx/j5gD/XezjqE2Q2+DO6dhnsQY6pbtNdXoKgaMuQJGA+dvPsQc=
=HilC
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing updates from Steven Rostedt:
- Extend tracing option mask to 64 bits
The trace options were defined by a 32 bit variable. This limits the
tracing instances to have a total of 32 different options. As that
limit has been hit, and more options are being added, increase the
option mask to a 64 bit number, doubling the number of options
available.
As this is required for the kprobe topic branches as well as the
tracing topic branch, a separate branch was created and merged into
both.
- Make trace_user_fault_read() available for the rest of tracing
The function trace_user_fault_read() is used by trace_marker file
read to allow reading user space to be done fast and without locking
or allocations. Make this available so that the system call trace
events can use it too.
- Have system call trace events read user space values
Now that the system call trace events callbacks are called in a
faultable context, take advantage of this and read the user space
buffers for various system calls. For example, show the path name of
the openat system call instead of just showing the pointer to that
path name in user space. Also show the contents of the buffer of the
write system call. Several system call trace events are updated to
make tracing into a light weight strace tool for all applications in
the system.
- Update perf system call tracing to do the same
- And a config and syscall_user_buf_size file to control the size of
the buffer
Limit the amount of data that can be read from user space. The
default size is 63 bytes but that can be expanded to 165 bytes.
- Allow the persistent ring buffer to print system calls normally
The persistent ring buffer prints trace events by their type and
ignores the print_fmt. This is because the print_fmt may change from
kernel to kernel. As the system call output is fixed by the system
call ABI itself, there's no reason to limit that. This makes reading
the system call events in the persistent ring buffer much nicer and
easier to understand.
- Add options to show text offset to function profiler
The function profiler that counts the number of times a function is
hit currently lists all functions by its name and offset. But this
becomes ambiguous when there are several functions with the same
name.
Add a tracing option that changes the output to be that of
'_text+offset' instead. Now a user space tool can use this
information to map the '_text+offset' to the unique function it is
counting.
- Report bad dynamic event command
If a bad command is passed to the dynamic_events file, report it
properly in the error log.
- Clean up tracer options
Clean up the tracer option code a bit, by removing some useless code
and also using switch statements instead of a series of if
statements.
- Have tracing options be instance specific
Tracers can have their own options (function tracer, irqsoff tracer,
function graph tracer, etc). But now that the same tracer can be
enabled in multiple trace instances, their options are still global.
The API is per instance, thus changing one affects other instances.
This isn't even consistent, as the option take affect differently
depending on when an tracer started in an instance. Make the options
for instances only affect the instance it is changed under.
- Optimize pid_list lock contention
Whenever the pid_list is read, it uses a spin lock. This happens at
every sched switch. Taking the lock at sched switch can be removed by
instead using a seqlock counter.
- Clean up the trace trigger structures
The trigger code uses two different structures to implement a single
tigger. This was due to trying to reuse code for the two different
types of triggers (always on trigger, and count limited trigger). But
by adding a single field to one structure, the other structure could
be absorbed into the first structure making he code easier to
understand.
- Create a bulk garbage collector for trace triggers
If user space has triggers for several hundreds of events and then
removes them, it can take several seconds to complete. This is
because each removal calls tracepoint_synchronize_unregister() that
can take hundreds of milliseconds to complete.
Instead, create a helper thread that will do the clean up. When a
trigger is removed, it will create the kthread if it isn't already
created, and then add the trigger to a llist. The kthread will take
the items off the llist, call tracepoint_synchronize_unregister(),
and then remove the items it took off. It will then check if there's
more items to free before sleeping.
This makes user space removing all these triggers to finish in less
than a second.
- Allow function tracing of some of the tracing infrastructure code
Because the tracing code can cause recursion issues if it is traced
by the function tracer the entire tracing directory disables function
tracing. But not all of tracing causes issues if it is traced.
Namely, the event tracing code. Add a config that enables some of the
tracing code to be traced to help in debugging it. Note, when this is
enabled, it does add noise to general function tracing, especially if
events are enabled as well (which is a common case).
- Add boot-time backup instance for persistent buffer
The persistent ring buffer is used mostly for kernel crash analysis
in the field. One issue is that if there's a crash, the data in the
persistent ring buffer must be read before tracing can begin using
it. This slows down the boot process. Once tracing starts in the
persistent ring buffer, the old data must be freed and the addresses
no longer match and old events can't be in the buffer with new
events.
Create a way to create a backup buffer that copies the persistent
ring buffer at boot up. Then after a crash, the always on tracer can
begin immediately as well as the normal boot process while the crash
analysis tooling uses the backup buffer. After the backup buffer is
finished being read, it can be removed.
- Enable function graph args and return address options at the same
time
Currently the when reading of arguments in the function graph tracer
is enabled, the option to record the parent function in the entry
event can not be enabled. Update the code so that it can.
- Add new struct_offset() helper macro
Add a new macro that takes a pointer to a structure and a name of one
of its members and it will return the offset of that member. This
allows the ring buffer code to simplify the following:
From: size = struct_size(entry, buf, cnt - sizeof(entry->id));
To: size = struct_offset(entry, id) + cnt;
There should be other simplifications that this macro can help out
with as well
* tag 'trace-v6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (42 commits)
overflow: Introduce struct_offset() to get offset of member
function_graph: Enable funcgraph-args and funcgraph-retaddr to work simultaneously
tracing: Add boot-time backup of persistent ring buffer
ftrace: Allow tracing of some of the tracing code
tracing: Use strim() in trigger_process_regex() instead of skip_spaces()
tracing: Add bulk garbage collection of freeing event_trigger_data
tracing: Remove unneeded event_mutex lock in event_trigger_regex_release()
tracing: Merge struct event_trigger_ops into struct event_command
tracing: Remove get_trigger_ops() and add count_func() from trigger ops
tracing: Show the tracer options in boot-time created instance
ftrace: Avoid redundant initialization in register_ftrace_direct
tracing: Remove unused variable in tracing_trace_options_show()
fgraph: Make fgraph_no_sleep_time signed
tracing: Convert function graph set_flags() to use a switch() statement
tracing: Have function graph tracer option sleep-time be per instance
tracing: Move graph-time out of function graph options
tracing: Have function graph tracer option funcgraph-irqs be per instance
trace/pid_list: optimize pid_list->lock contention
tracing: Have function graph tracer define options per instance
tracing: Have function tracer define options per instance
...
- Move libvfio selftest artifacts in preparation of more tightly
coupled integration with KVM selftests. (David Matlack)
- Fix comment typo in mtty driver. (Chu Guangqing)
- Support for new hardware revision in the hisi_acc vfio-pci variant
driver where the migration registers can now be accessed via the PF.
When enabled for this support, the full BAR can be exposed to the
user. (Longfang Liu)
- Fix vfio cdev support for VF token passing, using the correct size
for the kernel structure, thereby actually allowing userspace to
provide a non-zero UUID token. Also set the match token callback for
the hisi_acc, fixing VF token support for this this vfio-pci variant
driver. (Raghavendra Rao Ananta)
- Introduce internal callbacks on vfio devices to simplify and
consolidate duplicate code for generating VFIO_DEVICE_GET_REGION_INFO
data, removing various ioctl intercepts with a more structured
solution. (Jason Gunthorpe)
- Introduce dma-buf support for vfio-pci devices, allowing MMIO regions
to be exposed through dma-buf objects with lifecycle managed through
move operations. This enables low-level interactions such as a
vfio-pci based SPDK drivers interacting directly with dma-buf capable
RDMA devices to enable peer-to-peer operations. IOMMUFD is also now
able to build upon this support to fill a long standing feature gap
versus the legacy vfio type1 IOMMU backend with an implementation of
P2P support for VM use cases that better manages the lifecycle of the
P2P mapping. (Leon Romanovsky, Jason Gunthorpe, Vivek Kasireddy)
- Convert eventfd triggering for error and request signals to use RCU
mechanisms in order to avoid a 3-way lockdep reported deadlock issue.
(Alex Williamson)
- Fix a 32-bit overflow introduced via dma-buf support manifesting with
large DMA buffers. (Alex Mastro)
- Convert nvgrace-gpu vfio-pci variant driver to insert mappings on
fault rather than at mmap time. This conversion serves both to make
use of huge PFNMAPs but also to both avoid corrected RAS events
during reset by now being subject to vfio-pci-core's use of
unmap_mapping_range(), and to enable a device readiness test after
reset. (Ankit Agrawal)
- Refactoring of vfio selftests to support multi-device tests and split
code to provide better separation between IOMMU and device objects.
This work also enables a new test suite addition to measure parallel
device initialization latency. (David Matlack)
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEQvbATlQL0amee4qQI5ubbjuwiyIFAmkvV3IRHGFsZXhAc2hh
emJvdC5vcmcACgkQI5ubbjuwiyIpIQ/9GwpjLH5Vdv0v2d9mkHmZIWFpG/tr3zJa
+spQqOjO0etASc67PtIJArT9pWib+s6O8OaG7iFrdNR65HCSsXSZbIGbMThPODfy
DdDj1ipAqMVwcaCZT8un2N8Sktu9YpFQMvc5IoXWWYhw88vili7bBx+OTrEFV2T0
6qQijSBdhw1TXVFHG6BGSmqmisyMepIebA6GmPWdfYu6BfoWBYMdcMjDwd1J61Q5
DDwFRzn/Dz2Tvb1jbXiiRMRuFIuegFQii+wtd30S/cRPFZhZLWzc+drimC6oOFiQ
qL19vQQsBPnLtGvch40HsET/AbY5w0pLCkYX5qacxP3sq27+N+KuotzCvbnVMN+H
e2BqOCujyoce8z1Br6BzV71Lr2yzPDcc5pXTuEuuBT+J/ptOY8hfEikOj85s5Wzj
aKsTrdDRGMrn/o11NkGSzYwFcMs9MxCX9mo98U6OkWDr0+cmPLf4LGZgpJudWg4E
POUlzPpnzJrTlX5d+OqCdKJG0a1hTlTa2udzRa5hCDANHaZWLoAssfgSEKfV9xt1
PzOMf0UIJmPJmFcw/OpMO72/5xp8O4WslJS0ulSm6vrAJDtutLApHZ7bJ44KniNd
4vte+gOjyZY8ibTDKRULhXVlCDxkEnZjRBbApgI9HJD61IElOzjqohRuRx77J09B
7c8OSLI8d1U=
=tpee
-----END PGP SIGNATURE-----
Merge tag 'vfio-v6.19-rc1' of https://github.com/awilliam/linux-vfio
Pull VFIO updates from Alex Williamson:
- Move libvfio selftest artifacts in preparation of more tightly
coupled integration with KVM selftests (David Matlack)
- Fix comment typo in mtty driver (Chu Guangqing)
- Support for new hardware revision in the hisi_acc vfio-pci variant
driver where the migration registers can now be accessed via the PF.
When enabled for this support, the full BAR can be exposed to the
user (Longfang Liu)
- Fix vfio cdev support for VF token passing, using the correct size
for the kernel structure, thereby actually allowing userspace to
provide a non-zero UUID token. Also set the match token callback for
the hisi_acc, fixing VF token support for this this vfio-pci variant
driver (Raghavendra Rao Ananta)
- Introduce internal callbacks on vfio devices to simplify and
consolidate duplicate code for generating VFIO_DEVICE_GET_REGION_INFO
data, removing various ioctl intercepts with a more structured
solution (Jason Gunthorpe)
- Introduce dma-buf support for vfio-pci devices, allowing MMIO regions
to be exposed through dma-buf objects with lifecycle managed through
move operations. This enables low-level interactions such as a
vfio-pci based SPDK drivers interacting directly with dma-buf capable
RDMA devices to enable peer-to-peer operations. IOMMUFD is also now
able to build upon this support to fill a long standing feature gap
versus the legacy vfio type1 IOMMU backend with an implementation of
P2P support for VM use cases that better manages the lifecycle of the
P2P mapping (Leon Romanovsky, Jason Gunthorpe, Vivek Kasireddy)
- Convert eventfd triggering for error and request signals to use RCU
mechanisms in order to avoid a 3-way lockdep reported deadlock issue
(Alex Williamson)
- Fix a 32-bit overflow introduced via dma-buf support manifesting with
large DMA buffers (Alex Mastro)
- Convert nvgrace-gpu vfio-pci variant driver to insert mappings on
fault rather than at mmap time. This conversion serves both to make
use of huge PFNMAPs but also to both avoid corrected RAS events
during reset by now being subject to vfio-pci-core's use of
unmap_mapping_range(), and to enable a device readiness test after
reset (Ankit Agrawal)
- Refactoring of vfio selftests to support multi-device tests and split
code to provide better separation between IOMMU and device objects.
This work also enables a new test suite addition to measure parallel
device initialization latency (David Matlack)
* tag 'vfio-v6.19-rc1' of https://github.com/awilliam/linux-vfio: (65 commits)
vfio: selftests: Add vfio_pci_device_init_perf_test
vfio: selftests: Eliminate INVALID_IOVA
vfio: selftests: Split libvfio.h into separate header files
vfio: selftests: Move vfio_selftests_*() helpers into libvfio.c
vfio: selftests: Rename vfio_util.h to libvfio.h
vfio: selftests: Stop passing device for IOMMU operations
vfio: selftests: Move IOVA allocator into iova_allocator.c
vfio: selftests: Move IOMMU library code into iommu.c
vfio: selftests: Rename struct vfio_dma_region to dma_region
vfio: selftests: Upgrade driver logging to dev_err()
vfio: selftests: Prefix logs with device BDF where relevant
vfio: selftests: Eliminate overly chatty logging
vfio: selftests: Support multiple devices in the same container/iommufd
vfio: selftests: Introduce struct iommu
vfio: selftests: Rename struct vfio_iommu_mode to iommu_mode
vfio: selftests: Allow passing multiple BDFs on the command line
vfio: selftests: Split run.sh into separate scripts
vfio: selftests: Move run.sh into scripts directory
vfio/nvgrace-gpu: wait for the GPU mem to be ready
vfio/nvgrace-gpu: Inform devmem unmapped after reset
...
- Allow power-off for out-of-band wakeup-capable devices
- Drop the redundant call to dev_pm_domain_detach() for the amba bus
- Extend the genpd governor for CPUs to account for IPIs
pmdomain providers:
- bcm: Add support for BCM2712
- mediatek: Add support for MFlexGraphics power domains
- mediatek: Add support for MT8196 power domains
- qcom: Add RPMh power domain support for Kaanapali
- rockchip: Add support for RV1126B
pmdomain consumers:
- usb: dwc3: Enable out of band wakeup for i.MX95
- usb: chipidea: Enable out of band wakeup for i.MX95
-----BEGIN PGP SIGNATURE-----
iQJLBAABCgA1FiEEugLDXPmKSktSkQsV/iaEJXNYjCkFAmkt1HYXHHVsZi5oYW5z
c29uQGxpbmFyby5vcmcACgkQ/iaEJXNYjCk3PQ//W24ZdZ5cqXJASrw4YihTovbR
Pk/cVBacua32r3uBAf+GKdhAXrC7zaPbohp4tou5tpuY6I54NDJ4mpaTLUr+iZ6A
aiKGprBrNCNMU7Mu6Af23peQNo0t+0sdsxp3wjZAMcSK0/ioOnc2R+IZhjHI9dPh
Us7Ke/Pa1DRu78T0P2quRvIIRPv5iGep2vL8LFkQtPNS4tAU5ZHTWghVip+Q0m7x
660WwZ5a9Bkzii/mqxFlXnXMnFjaJwi7zEPEKEMKd88cTjZXZGqScq8Ojfxto4Jx
mdn9kPqwUR2QILTVKyBy2nJTeCB5nLvLhldi5/1nVxhqopN5ll4VwFx8CVsEPVTQ
l76IUMzdAmztnVqs1p7qZ8dyFKdYztNRSESapumpu10/DIIOgWW4eLNk5Sk+7d0K
1fZDL/hWV2j4gnY0wSjGht2JQ2jhDBsgkxInxvQY2a8Ik91gwyUsD7r3mCq/YwaA
97GJHk9YO75jolFPy6PniXttzVuoJXPEcvgpBR3br1iFRQhSoGIEfFcDnZEaH9Sw
m7nSc1Q99a6rsmq+WavB5JVPZBr2UkBhE6x+IU+MEBPwykjNj/1cZ7x4rtqzX666
eOn1FBDQ63CLEQl3hAPp1pxy3YB0nHcxKR9CPAZjECwZTfsYgJCrRNHyZb+yhPsU
qKKXXMaOyyowoLraGbQ=
=Fkue
-----END PGP SIGNATURE-----
Merge tag 'pmdomain-v6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/linux-pm
Pull pmdomain updates from Ulf Hansson:
"pmdomain core:
- Allow power-off for out-of-band wakeup-capable devices
- Drop the redundant call to dev_pm_domain_detach() for the amba bus
- Extend the genpd governor for CPUs to account for IPIs
pmdomain providers:
- bcm: Add support for BCM2712
- mediatek: Add support for MFlexGraphics power domains
- mediatek: Add support for MT8196 power domains
- qcom: Add RPMh power domain support for Kaanapali
- rockchip: Add support for RV1126B
pmdomain consumers:
- usb: dwc3: Enable out of band wakeup for i.MX95
- usb: chipidea: Enable out of band wakeup for i.MX95"
* tag 'pmdomain-v6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/linux-pm: (26 commits)
pmdomain: Extend the genpd governor for CPUs to account for IPIs
smp: Introduce a helper function to check for pending IPIs
pmdomain: mediatek: convert from clk round_rate() to determine_rate()
amba: bus: Drop dev_pm_domain_detach() call
pmdomain: bcm: bcm2835-power: Prepare to support BCM2712
pmdomain: mediatek: mtk-mfg: select MAILBOX in Kconfig
pmdomain: mediatek: Add support for MFlexGraphics
pmdomain: mediatek: Fix build-errors
cpuidle: psci: Replace deprecated strcpy in psci_idle_init_cpu
pmdomain: rockchip: Add support for RV1126B
pmdomain: mediatek: Add support for MT8196 HFRPSYS power domains
pmdomain: mediatek: Add support for MT8196 SCPSYS power domains
pmdomain: mediatek: Add support for secure HWCCF infra power on
pmdomain: mediatek: Add support for Hardware Voter power domains
pmdomain: qcom: rpmhpd: Add RPMh power domain support for Kaanapali
usb: dwc3: imx8mp: Set out of band wakeup for i.MX95
usb: chipidea: ci_hdrc_imx: Set out of band wakeup for i.MX95
usb: chipidea: core: detach power domain for ci_hdrc platform device
pmdomain: core: Allow power-off for out-of-band wakeup-capable devices
PM: wakeup: Add out-of-band system wakeup support for devices
...
new driver:
- Arm Ethos-U65/U85 accel driver
core:
- support the drm color pipeline in vkms/amdgfx
- add support for drm colorop pipeline
- add COLOR PIPELINE plane property
- add DRM_CLIENT_CAP_PLANE_COLOR_PIPELINE
- throttle dirty worker with vblank
- use drm_for_each_bridge_in_chain_scoped in drm's bridge code
- Ensure drm_client_modeset tests are enabled in UML
- add simulated vblank interrupt - use in drivers
- dumb buffer sizing helper
- move freeing of drm client memory to driver
- crtc sharpness strength property
- stop using system_wq in scheduler/drivers
- support emergency restore in drm-client
rust:
- make slice::as_flattened usable on all supported rustc
- add FromBytes::from_bytes_prefix() method
- remove redundant device ptr from Rust GEM object
- Change how AlwaysRefCounted is implemented for GEM objects
gpuvm:
- Add deferred vm_bo cleanup to GPUVM (for rust)
atomic:
- cleanup and improve state handling interfaces
buddy:
- optimize block management
dma-buf:
- heaps: Create heap per CMA reserved location
- improve userspace documentation
dp:
- add POST_LT_ADJ_REQ training sequence
- DPCD dSC quirk for synaptics panamera devices
- helpers to query branch DSC max throughput
ttm:
- Rename ttm_bo_put to ttm_bo_fini
- allow page protection flags on risc-v
- rework pipelined eviction fence handling
amdgpu:
- enable amdgpu by default for SI/CI dGPUs
- enable DC by default on SI
- refactor CIK/SI enablement
- add ABM KMS property
- Re-enable DM idle optimizations
- DC Analog encoders support
- Powerplay fixes for fiji/iceland
- Enable DC on bonaire by default
- HMM cleanup
- Add new RAS framework
- DML2.1 updates
- YCbCr420 fixes
- DC FP fixes
- DMUB fixes
- LTTPR fixes
- DTBCLK fixes
- DMU cursor offload handling
- Userq validation improvements
- Unify shutdown callback handling
- Suspend improvements
- Power limit code cleanup
- SR-IOV fixes
- AUX backlight fixes
- DCN 3.5 fixes
- HDMI compliance fixes
- DCN 4.0.1 cursor updates
- DCN interrupt fix
- DC KMS full update improvements
- Add additional HDCP traces
- DCN 3.2 fixes
- DP MST fixes
- Add support for new SR-IOV mailbox interface
- UQ reset support
- HDP flush rework
- VCE1 support
amdkfd:
- HMM cleanups
- Relax checks on save area overallocations
- Fix GPU mappings after prefetch
radeon:
- refactor CIK/SI enablement.
xe:
- Initial Xe3P support
- panic support on VRAM for display
- fix stolen size check
- Loosen used tracking restriction
- New SR-IOV debugfs structure and debugfs updates
- Hide the GPU madvise flag behind a VM_BIND flag
- Always expose VRAM provisioning data on discrete GPUs
- Allow VRAM mappings for userptr when used with SVM
- Allow pinning of p2p dma-buf
- Use per-tile debugfs where appropriate
- Add documentation for Execution Queues
- PF improvements
- VF migration recovery redesign work
- User / Kernel VRAM partitioning
- Update Tile-based messages
- Allow configfs to disable specific GT types
- VF provisioning and migration improvements
- use SVM range helpers in PT layer
- Initial CRI support
- access VF registers using dedicated MMIO view
- limit number of jobs per exec queue
- add sriov_admin sysfs tree
- more crescent island specific support
- debugfs residency counter
- SRIOV migration work
- runtime registers for GFX 35
i915:
- add initial Xe3p_LPD display version 35 support
- Enable LNL+ content adaptive sharpness filter
- Use optimized VRR guardband
- Enable Xe3p LT PHY
- enable FBC support for Xe3p_LPD display
- add display 30.02 firmware support
- refactor SKL+ watermark latency setup
- refactor fbdev handling
- call i915/xe runtime PM via function pointers
- refactor i915/xe stolen memory/display interfaces
- use display version instead of gfx version in display code
- extend i915_display_info with Type-C port details
- lots of display cleanups/refactorings
- set O_LARGEFILE in __create_shmem
- skuip guc communication warning on reset
- fix time conversions
- defeature DRRS on LNL+
- refactor intel_frontbuffer split between i915/xe/display
- convert inteL_rom interfaces to struct drm_device
- unify display register polling interfaces
- aovid lock inversion when pinning to GGTT on CHV/BXT+VTD
panel:
- Add KD116N3730A08/A12, chromebook mt8189
- JT101TM023, LQ079L1SX01,
- GLD070WX3-SL01 MIPI DSI
- Samsung LTL106AL0, Samsung LTL106AL01
- Raystar RFF500F-AWH-DNN
- Winstar WF70A8SYJHLNGA,
- Wanchanglong w552946aaa
- Samsung SOFEF00
- Lenovo X13s panel.
- ilitek-ili9881c : add rpi 5" support
- visionx-rm69299 - add backlight support
- edp - support AUI B116XAN02.0
bridge:
- improve ref counting
- ti-sn65dsi86 - add support for DP mode with HPD
- synopsis: support CEC, init timer with correct freq
- ASL CS5263 DP-to-HDMI bridge support
nova-core:
- introduce bitfield! macro
- introduce safe integer converters
- GSP inits to fully booted state on Ampere
- Use more future-proof register for GPU identification
nova-drm:
- select NOVA_CORE
- 64-bit only
nouveau:
- improve reclocking on tegra 186+
- add large page and compression support
msm:
- GPU:
- Gen8 support: A840 (Kaanapali) and X2-85 (Glymur)
- A612 support
- MDSS:
- Added support for Glymur and QCS8300 platforms
- DPU:
- Enabled Quad-Pipe support, unlocking higher resolutions support
- Added support for Glymur platform
- Documented DPU on QCS8300 platform as supported
- DisplayPort:
- Added support for Glymur platform
- Added support lame remapping inside DP block
- Documented DisplayPort controller on QCS8300 and SM6150/QCS615 as
supported
tegra:
- NVJPG driver
panfrost:
- display JM contexts over debugfs
- export JM contexts to userspace
- improve error and job handling
panthor:
- support custom ASN_HASH for mt8196
- support mali-G1 GPU
- flush shmem write before mapping buffers uncached
- make timeout per-queue instead of per-job
mediatek:
- MT8195/88 HDMIv2/DDCv2 support
rockchip:
- dsi: add support for RK3368
amdxdna:
- enhance runtime PM
- last hardware error reading uapi
- support firmware debug output
- add resource and telemetry data uapi
- preemption support
imx:
- add driver for HDMI TX Parallel audio interface
ivpu:
- add support for user-managed preemption buffer
- add userptr support
- update JSM firware API to 3.33.0
- add better alloc/free warnings
- fix page fault in unbind all bos
- rework bind/unbind of imported buffers
- enable MCA ECC signalling
- split fw runtime and global memory buffers
- add fdinfo memory statistics
tidss:
- convert to drm logging
- logging cleanup
ast:
- refactor generation init paths
- add per chip generation detect_tx_chip
- set quirks for each chip model
atmel-hlcdc:
- set LCDC_ATTRE register in plane disable
- set correct values for plane scaler
solomon:
- use drm helper for get_modes and move_valid
sitronix:
- fix output position when clearing screens
qaic:
- support dma-buf exports
- support new firmware's READ_DATA implementation
- sahara AIC200 image table update
- add sysfs support
- add coredump support
- add uevents support
- PM support
sun4i:
- layer refactors to decouple plane from output
- improve DE33 support
vc4:
- switch to generic CEC helpers
komeda:
- use drm_ logging functions
vkms:
- configfs support for display configuration
vgem:
- fix fence timer deadlock
etnaviv:
- add HWDB entry for GC8000 Nano Ultra VIP r6205
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEEKbZHaGwW9KfbeusDHTzWXnEhr4FAmkv4wAACgkQDHTzWXnE
hr7GNw/9HITakN5B4DXKZg+Rz38WjsAsSbw1HdgoQR2NMYRw4oVXx3iAVTVLGbvV
vvMCpgWtFE85IEtg96KWynpFjSHOx7xtXe/lckZ5s+VMi00K4c+wMmvFqKwv1Aul
9VqPTe+28xNAzk+gglOtNjd4DRVuXQIhPMQdTELork32jQCteJorod+sZnM+RNBG
0PiiYAKc0VwsPwklHQF7+dBt1yu+8r3ZH5Fc7SjHTuf8Q45Dpayd0HjatSgt9ycn
s8uxAAY41iDLHnj4LwyYtBFYoMH65VqOAzDeaeVdM9xA/3KbIl9of8QnBpvYWhaR
9dnh2Ceve3SjOufOBwK2p6H0RXu8Xo3OLnmw7+u0KwgvbmeBTILp8kgNQuk9beIp
wvkjwpFsDnfo1rKevFMz4R+3U9W3NXP1fjU2tvmnD1lfaby2o3f3AdboaYPbX/xK
YGh79KDsuAkKH4HdMJNg/FpJWIdSKaaWoFBzZqWBZgZBEPOquR8MHUAaLkVJYK/1
wBcJZceQpHRssldsg3bBZxGHK811J46QlvMWaID6TSGIoZ8GC6+1Zt0+bwzxMyUR
49jwRvOWbtYJmZEcR2JF+k1KteMXAp+bXIYqx9e+69c5Guz4w7JW7OLQFVvXszZW
aObzfrkObPpCGYDS1r85/6vT8yqwVUWtwfPvAeDH9130sHBP8Xw=
=gpnQ
-----END PGP SIGNATURE-----
Merge tag 'drm-next-2025-12-03' of https://gitlab.freedesktop.org/drm/kernel
Pull drm updates from Dave Airlie:
"There was a rather late merge of a new color pipeline feature, that
some userspace projects are blocked on, and has seen a lot of work in
amdgpu. This should have seen some time in -next. There is additional
support for this for Intel, that if it arrives in the next day or two
I'll pass it on in another pull request and you can decide if you want
to take it.
Highlights:
- Arm Ethos NPU accelerator driver
- new DRM color pipeline support
- amdgpu will now run discrete SI/CIK cards instead of radeon, which
enables vulkan support in userspace
- msm gets gen8 gpu support
- initial Xe3P support in xe
Full detail summary:
New driver:
- Arm Ethos-U65/U85 accel driver
Core:
- support the drm color pipeline in vkms/amdgfx
- add support for drm colorop pipeline
- add COLOR PIPELINE plane property
- add DRM_CLIENT_CAP_PLANE_COLOR_PIPELINE
- throttle dirty worker with vblank
- use drm_for_each_bridge_in_chain_scoped in drm's bridge code
- Ensure drm_client_modeset tests are enabled in UML
- add simulated vblank interrupt - use in drivers
- dumb buffer sizing helper
- move freeing of drm client memory to driver
- crtc sharpness strength property
- stop using system_wq in scheduler/drivers
- support emergency restore in drm-client
Rust:
- make slice::as_flattened usable on all supported rustc
- add FromBytes::from_bytes_prefix() method
- remove redundant device ptr from Rust GEM object
- Change how AlwaysRefCounted is implemented for GEM objects
gpuvm:
- Add deferred vm_bo cleanup to GPUVM (for rust)
atomic:
- cleanup and improve state handling interfaces
buddy:
- optimize block management
dma-buf:
- heaps: Create heap per CMA reserved location
- improve userspace documentation
dp:
- add POST_LT_ADJ_REQ training sequence
- DPCD dSC quirk for synaptics panamera devices
- helpers to query branch DSC max throughput
ttm:
- Rename ttm_bo_put to ttm_bo_fini
- allow page protection flags on risc-v
- rework pipelined eviction fence handling
amdgpu:
- enable amdgpu by default for SI/CI dGPUs
- enable DC by default on SI
- refactor CIK/SI enablement
- add ABM KMS property
- Re-enable DM idle optimizations
- DC Analog encoders support
- Powerplay fixes for fiji/iceland
- Enable DC on bonaire by default
- HMM cleanup
- Add new RAS framework
- DML2.1 updates
- YCbCr420 fixes
- DC FP fixes
- DMUB fixes
- LTTPR fixes
- DTBCLK fixes
- DMU cursor offload handling
- Userq validation improvements
- Unify shutdown callback handling
- Suspend improvements
- Power limit code cleanup
- SR-IOV fixes
- AUX backlight fixes
- DCN 3.5 fixes
- HDMI compliance fixes
- DCN 4.0.1 cursor updates
- DCN interrupt fix
- DC KMS full update improvements
- Add additional HDCP traces
- DCN 3.2 fixes
- DP MST fixes
- Add support for new SR-IOV mailbox interface
- UQ reset support
- HDP flush rework
- VCE1 support
amdkfd:
- HMM cleanups
- Relax checks on save area overallocations
- Fix GPU mappings after prefetch
radeon:
- refactor CIK/SI enablement
xe:
- Initial Xe3P support
- panic support on VRAM for display
- fix stolen size check
- Loosen used tracking restriction
- New SR-IOV debugfs structure and debugfs updates
- Hide the GPU madvise flag behind a VM_BIND flag
- Always expose VRAM provisioning data on discrete GPUs
- Allow VRAM mappings for userptr when used with SVM
- Allow pinning of p2p dma-buf
- Use per-tile debugfs where appropriate
- Add documentation for Execution Queues
- PF improvements
- VF migration recovery redesign work
- User / Kernel VRAM partitioning
- Update Tile-based messages
- Allow configfs to disable specific GT types
- VF provisioning and migration improvements
- use SVM range helpers in PT layer
- Initial CRI support
- access VF registers using dedicated MMIO view
- limit number of jobs per exec queue
- add sriov_admin sysfs tree
- more crescent island specific support
- debugfs residency counter
- SRIOV migration work
- runtime registers for GFX 35
i915:
- add initial Xe3p_LPD display version 35 support
- Enable LNL+ content adaptive sharpness filter
- Use optimized VRR guardband
- Enable Xe3p LT PHY
- enable FBC support for Xe3p_LPD display
- add display 30.02 firmware support
- refactor SKL+ watermark latency setup
- refactor fbdev handling
- call i915/xe runtime PM via function pointers
- refactor i915/xe stolen memory/display interfaces
- use display version instead of gfx version in display code
- extend i915_display_info with Type-C port details
- lots of display cleanups/refactorings
- set O_LARGEFILE in __create_shmem
- skuip guc communication warning on reset
- fix time conversions
- defeature DRRS on LNL+
- refactor intel_frontbuffer split between i915/xe/display
- convert inteL_rom interfaces to struct drm_device
- unify display register polling interfaces
- aovid lock inversion when pinning to GGTT on CHV/BXT+VTD
panel:
- Add KD116N3730A08/A12, chromebook mt8189
- JT101TM023, LQ079L1SX01,
- GLD070WX3-SL01 MIPI DSI
- Samsung LTL106AL0, Samsung LTL106AL01
- Raystar RFF500F-AWH-DNN
- Winstar WF70A8SYJHLNGA
- Wanchanglong w552946aaa
- Samsung SOFEF00
- Lenovo X13s panel
- ilitek-ili9881c - add rpi 5" support
- visionx-rm69299 - add backlight support
- edp - support AUI B116XAN02.0
bridge:
- improve ref counting
- ti-sn65dsi86 - add support for DP mode with HPD
- synopsis: support CEC, init timer with correct freq
- ASL CS5263 DP-to-HDMI bridge support
nova-core:
- introduce bitfield! macro
- introduce safe integer converters
- GSP inits to fully booted state on Ampere
- Use more future-proof register for GPU identification
nova-drm:
- select NOVA_CORE
- 64-bit only
nouveau:
- improve reclocking on tegra 186+
- add large page and compression support
msm:
- GPU:
- Gen8 support: A840 (Kaanapali) and X2-85 (Glymur)
- A612 support
- MDSS:
- Added support for Glymur and QCS8300 platforms
- DPU:
- Enabled Quad-Pipe support, unlocking higher resolutions support
- Added support for Glymur platform
- Documented DPU on QCS8300 platform as supported
- DisplayPort:
- Added support for Glymur platform
- Added support lame remapping inside DP block
- Documented DisplayPort controller on QCS8300 and SM6150/QCS615
as supported
tegra:
- NVJPG driver
panfrost:
- display JM contexts over debugfs
- export JM contexts to userspace
- improve error and job handling
panthor:
- support custom ASN_HASH for mt8196
- support mali-G1 GPU
- flush shmem write before mapping buffers uncached
- make timeout per-queue instead of per-job
mediatek:
- MT8195/88 HDMIv2/DDCv2 support
rockchip:
- dsi: add support for RK3368
amdxdna:
- enhance runtime PM
- last hardware error reading uapi
- support firmware debug output
- add resource and telemetry data uapi
- preemption support
imx:
- add driver for HDMI TX Parallel audio interface
ivpu:
- add support for user-managed preemption buffer
- add userptr support
- update JSM firware API to 3.33.0
- add better alloc/free warnings
- fix page fault in unbind all bos
- rework bind/unbind of imported buffers
- enable MCA ECC signalling
- split fw runtime and global memory buffers
- add fdinfo memory statistics
tidss:
- convert to drm logging
- logging cleanup
ast:
- refactor generation init paths
- add per chip generation detect_tx_chip
- set quirks for each chip model
atmel-hlcdc:
- set LCDC_ATTRE register in plane disable
- set correct values for plane scaler
solomon:
- use drm helper for get_modes and move_valid
sitronix:
- fix output position when clearing screens
qaic:
- support dma-buf exports
- support new firmware's READ_DATA implementation
- sahara AIC200 image table update
- add sysfs support
- add coredump support
- add uevents support
- PM support
sun4i:
- layer refactors to decouple plane from output
- improve DE33 support
vc4:
- switch to generic CEC helpers
komeda:
- use drm_ logging functions
vkms:
- configfs support for display configuration
vgem:
- fix fence timer deadlock
etnaviv:
- add HWDB entry for GC8000 Nano Ultra VIP r6205"
* tag 'drm-next-2025-12-03' of https://gitlab.freedesktop.org/drm/kernel: (1869 commits)
Revert "drm/amd: Skip power ungate during suspend for VPE"
drm/amdgpu: use common defines for HUB faults
drm/amdgpu/gmc12: add amdgpu_vm_handle_fault() handling
drm/amdgpu/gmc11: add amdgpu_vm_handle_fault() handling
drm/amdgpu: use static ids for ACP platform devs
drm/amdgpu/sdma6: Update SDMA 6.0.3 FW version to include UMQ protected-fence fix
drm/amdgpu: Forward VMID reservation errors
drm/amdgpu/gmc8: Delegate VM faults to soft IRQ handler ring
drm/amdgpu/gmc7: Delegate VM faults to soft IRQ handler ring
drm/amdgpu/gmc6: Delegate VM faults to soft IRQ handler ring
drm/amdgpu/gmc6: Cache VM fault info
drm/amdgpu/gmc6: Don't print MC client as it's unknown
drm/amdgpu/cz_ih: Enable soft IRQ handler ring
drm/amdgpu/tonga_ih: Enable soft IRQ handler ring
drm/amdgpu/iceland_ih: Enable soft IRQ handler ring
drm/amdgpu/cik_ih: Enable soft IRQ handler ring
drm/amdgpu/si_ih: Enable soft IRQ handler ring
drm/amd/display: fix typo in display_mode_core_structs.h
drm/amd/display: fix Smart Power OLED not working after S4
drm/amd/display: Move RGB-type check for audio sync to DCE HW sequence
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmktsoMQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpuiUD/92eivL+HmOh10o8trvxajB0yuyqfSjHHrL
g+xUbF4s9bgAg/v+Upx7sTY8jdrTcMjKov+G9T6uPvBMqVmeVdZckA1PSAKQaIX1
Zb7nS2LnO7F6JKbwpwVrrIaqVbcz8MfGIIMbN4yNNEOMCwdIVMp4fo7trPBknJNx
WddNSGUFlIF3NqSI8AflSS/pYnGm+McfBHXBpJAKipI3iquKKubHv+FX9kLp7Tn4
x27ZoCWOHglIBTJXU0mmXCVsLF8b5BA8DQcGtT62azb8+l0cRTkaHY0DFAv5BvhG
TqcjrKdmR0cGSNt+nEmFrujE3atBRl0G0kiHA80YgA1MTtYzdPaUVOUtM9k/rEem
gpiGMDpBypdxyJAyijPSaVJdfcg0psOlYbhIR4N2wbj/dq8268h+cWzXlF1spgVt
/7ygoaCmfMNbTy9rKThTjH+es787AVXUAXXaPHhIFsnCKUj8xQl4pT7XltmgYeWx
1/XD1NEJeLHHog5upAVlGX3H5tbvP1nIICxbZa9mDOJX1rwxxI7/s/RucPjbNXuY
AiaKPTfxtB9+Ihd2HrJ/76RVMkckcOBc4GIKoFfwuKDbcdLXQ5FcZCmVRoI1V9SV
KsH7JBgihLwR9XWKE1vp9+CBNe1Qlu3K4IjG/E7CNLeuDntIBu73ihqGP/DqV6Bq
RX1Dc0OyAQ==
=m22w
-----END PGP SIGNATURE-----
Merge tag 'for-6.19/block-20251201' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull block updates from Jens Axboe:
- Fix head insertion for mq-deadline, a regression from when priority
support was added
- Series simplifying and improving the ublk user copy code
- Various ublk related cleanups
- Fixup REQ_NOWAIT handling in loop/zloop, clearing NOWAIT when the
request is punted to a thread for handling
- Merge and then later revert loop dio nowait support, as it ended up
causing excessive stack usage for when the inline issue code needs to
dip back into the full file system code
- Improve auto integrity code, making it less deadlock prone
- Speedup polled IO handling, but manually managing the hctx lookups
- Fixes for blk-throttle for SSD devices
- Small series with fixes for the S390 dasd driver
- Add support for caching zones, avoiding unnecessary report zone
queries
- MD pull requests via Yu:
- fix null-ptr-dereference regression for dm-raid0
- fix IO hang for raid5 when array is broken with IO inflight
- remove legacy 1s delay to speed up system shutdown
- change maintainer's email address
- data can be lost if array is created with different lbs devices,
fix this problem and record lbs of the array in metadata
- fix rcu protection for md_thread
- fix mddev kobject lifetime regression
- enable atomic writes for md-linear
- some cleanups
- bcache updates via Coly
- remove useless discard and cache device code
- improve usage of per-cpu workqueues
- Reorganize the IO scheduler switching code, fixing some lockdep
reports as well
- Improve the block layer P2P DMA support
- Add support to the block tracing code for zoned devices
- Segment calculation improves, and memory alignment flexibility
improvements
- Set of prep and cleanups patches for ublk batching support. The
actual batching hasn't been added yet, but helps shrink down the
workload of getting that patchset ready for 6.20
- Fix for how the ps3 block driver handles segments offsets
- Improve how block plugging handles batch tag allocations
- nbd fixes for use-after-free of the configuration on device clear/put
- Set of improvements and fixes for zloop
- Add Damien as maintainer of the block zoned device code handling
- Various other fixes and cleanups
* tag 'for-6.19/block-20251201' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (162 commits)
block/rnbd: correct all kernel-doc complaints
blk-mq: use queue_hctx in blk_mq_map_queue_type
md: remove legacy 1s delay in md_notify_reboot
md/raid5: fix IO hang when array is broken with IO inflight
md: warn about updating super block failure
md/raid0: fix NULL pointer dereference in create_strip_zones() for dm-raid
sbitmap: fix all kernel-doc warnings
ublk: add helper of __ublk_fetch()
ublk: pass const pointer to ublk_queue_is_zoned()
ublk: refactor auto buffer register in ublk_dispatch_req()
ublk: add `union ublk_io_buf` with improved naming
ublk: add parameter `struct io_uring_cmd *` to ublk_prep_auto_buf_reg()
kfifo: add kfifo_alloc_node() helper for NUMA awareness
blk-mq: fix potential uaf for 'queue_hw_ctx'
blk-mq: use array manage hctx map instead of xarray
ublk: prevent invalid access with DEBUG
s390/dasd: Use scnprintf() instead of sprintf()
s390/dasd: Move device name formatting into separate function
s390/dasd: Remove unnecessary debugfs_create() return checks
s390/dasd: Fix gendisk parent after copy pair swap
...
Core & protocols
----------------
- Replace busylock at the Tx queuing layer with a lockless list. Resulting
in a 300% (4x) improvement on heavy TX workloads, sending twice the
number of packets per second, for half the cpu cycles.
- Allow constantly busy flows to migrate to a more suitable CPU/NIC
queue. Normally we perform queue re-selection when flow comes out
of idle, but under extreme circumstances the flows may be constantly
busy. Add sysctl to allow periodic rehashing even if it'd risk packet
reordering.
- Optimize the NAPI skb cache, make it larger, use it in more paths.
- Attempt returning Tx skbs to the originating CPU (like we already did
for Rx skbs).
- Various data structure layout and prefetch optimizations from Eric.
- Remove ktime_get() from the recvmsg() fast path, ktime_get() is sadly
quite expensive on recent AMD machines.
- Extend threaded NAPI polling to allow the kthread busy poll for packets.
- Make MPTCP use Rx backlog processing. This lowers the lock pressure,
improving the Rx performance.
- Support memcg accounting of MPTCP socket memory.
- Allow admin to opt sockets out of global protocol memory accounting
(using a sysctl or BPF-based policy). The global limits are a poor fit
for modern container workloads, where limits are imposed using cgroups.
- Improve heuristics for when to kick off AF_UNIX garbage collection.
- Allow users to control TCP SACK compression, and default to 33% of RTT.
- Add tcp_rcvbuf_low_rtt sysctl to let datacenter users avoid unnecessarily
aggressive rcvbuf growth and overshot when the connection RTT is low.
- Preserve skb metadata space across skb_push / skb_pull operations.
- Support for IPIP encapsulation in the nftables flowtable offload.
- Support appending IP interface information to ICMP messages (RFC 5837).
- Support setting max record size in TLS (RFC 8449).
- Remove taking rtnl_lock from RTM_GETNEIGHTBL and RTM_SETNEIGHTBL.
- Use a dedicated lock (and RCU) in MPLS, instead of rtnl_lock.
- Let users configure the number of write buffers in SMC.
- Add new struct sockaddr_unsized for sockaddr of unknown length,
from Kees.
- Some conversions away from the crypto_ahash API, from Eric Biggers.
- Some preparations for slimming down struct page.
- YAML Netlink protocol spec for WireGuard.
- Add a tool on top of YAML Netlink specs/lib for reporting commonly
computed derived statistics and summarized system state.
Driver API
----------
- Add CAN XL support to the CAN Netlink interface.
- Add uAPI for reporting PHY Mean Square Error (MSE) diagnostics,
as defined by the OPEN Alliance's "Advanced diagnostic features
for 100BASE-T1 automotive Ethernet PHYs" specification.
- Add DPLL phase-adjust-gran pin attribute (and implement it in zl3073x).
- Refactor xfrm_input lock to reduce contention when NIC offloads IPsec
and performs RSS.
- Add info to devlink params whether the current setting is the default
or a user override. Allow resetting back to default.
- Add standard device stats for PSP crypto offload.
- Leverage DSA frame broadcast to implement simple HSR frame duplication
for a lot of switches without dedicated HSR offload.
- Add uAPI defines for 1.6Tbps link modes.
Device drivers
--------------
- Add Motorcomm YT921x gigabit Ethernet switch support.
- Add MUCSE driver for N500/N210 1GbE NIC series.
- Convert drivers to support dedicated ops for timestamping control,
and away from the direct IOCTL handling. While at it support GET
operations for PHY timestamping.
- Add (and convert most drivers to) a dedicated ethtool callback
for reading the Rx ring count.
- Significant refactoring efforts in the STMMAC driver, which supports
Synopsys turn-key MAC IP integrated into a ton of SoCs.
- Ethernet high-speed NICs:
- Broadcom (bnxt):
- support PPS in/out on all pins
- Intel (100G, ice, idpf):
- ice: implement standard ethtool and timestamping stats
- i40e: support setting the max number of MAC addresses per VF
- iavf: support RSS of GTP tunnels for 5G and LTE deployments
- nVidia/Mellanox (mlx5):
- reduce downtime on interface reconfiguration
- disable being an XDP redirect target by default (same as other
drivers) to avoid wasting resources if feature is unused
- Meta (fbnic):
- add support for Linux-managed PCS on 25G, 50G, and 100G links
- Wangxun:
- support Rx descriptor merge, and Tx head writeback
- support Rx coalescing offload
- support 25G SPF and 40G QSFP modules
- Ethernet virtual:
- Google (gve):
- allow ethtool to configure rx_buf_len
- implement XDP HW RX Timestamping support for DQ descriptor format
- Microsoft vNIC (mana):
- support HW link state events
- handle hardware recovery events when probing the device
- Ethernet NICs consumer, and embedded:
- usbnet: add support for Byte Queue Limits (BQL)
- AMD (amd-xgbe):
- add device selftests
- NXP (enetc):
- add i.MX94 support
- Broadcom integrated MACs (bcmgenet, bcmasp):
- bcmasp: add support for PHY-based Wake-on-LAN
- Broadcom switches (b53):
- support port isolation
- support BCM5389/97/98 and BCM63XX ARL formats
- Lantiq/MaxLinear switches:
- support bridge FDB entries on the CPU port
- use regmap for register access
- allow user to enable/disable learning
- support Energy Efficient Ethernet
- support configuring RMII clock delays
- add tagging driver for MaxLinear GSW1xx switches
- Synopsys (stmmac):
- support using the HW clock in free running mode
- add Eswin EIC7700 support
- add Rockchip RK3506 support
- add Altera Agilex5 support
- Cadence (macb):
- cleanup and consolidate descriptor and DMA address handling
- add EyeQ5 support
- TI:
- icssg-prueth: support AF_XDP
- Airoha access points:
- add missing Ethernet stats and link state callback
- add AN7583 support
- support out-of-order Tx completion processing
- Power over Ethernet:
- pd692x0: preserve PSE configuration across reboots
- add support for TPS23881B devices
- Ethernet PHYs:
- Open Alliance OATC14 10BASE-T1S PHY cable diagnostic support
- Support 50G SerDes and 100G interfaces in Linux-managed PHYs
- micrel:
- support for non PTP SKUs of lan8814
- enable in-band auto-negotiation on lan8814
- realtek:
- cable testing support on RTL8224
- interrupt support on RTL8221B
- motorcomm: support for PHY LEDs on YT853
- microchip: support for LAN867X Rev.D0 PHYs w/ SQI and cable diag
- mscc: support for PHY LED control
- CAN drivers:
- m_can: add support for optional reset and system wake up
- remove can_change_mtu() obsoleted by core handling
- mcp251xfd: support GPIO controller functionality
- Bluetooth:
- add initial support for PASTa
- WiFi:
- split ieee80211.h file, it's way too big
- improvements in VHT radiotap reporting, S1G, Channel Switch
Announcement handling, rate tracking in mesh networks
- improve multi-radio monitor mode support, and add a cfg80211 debugfs
interface for it
- HT action frame handling on 6 GHz
- initial chanctx work towards NAN
- MU-MIMO sniffer improvements
- WiFi drivers:
- RealTek (rtw89):
- support USB devices RTL8852AU and RTL8852CU
- initial work for RTL8922DE
- improved injection support
- Intel:
- iwlwifi: new sniffer API support
- MediaTek (mt76):
- WED support for >32-bit DMA
- airoha NPU support
- regdomain improvements
- continued WiFi7/MLO work
- Qualcomm/Atheros:
- ath10k: factory test support
- ath11k: TX power insertion support
- ath12k: BSS color change support
- ath12k: statistics improvements
- brcmfmac: Acer A1 840 tablet quirk
- rtl8xxxu: 40 MHz connection fixes/support
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmkveRQACgkQMUZtbf5S
IrvY7A/+Nb0o4BxLHjPkAl1m3t3q2d0Y29B7SNkwnwEtxAV8EkNeZ3GWrdtDnTQY
MYhmc7LEzvz8/lihapr7UJkcokzSASUV54hbez5jDBKC8EEoyUk8FdWDPerwlcRI
zmCFNAVFyh9GX8i7wcrzKbDTHT5+GZLbSlGl9U5mhLsDdRlJgH7d8PJ7vWcmtLFY
XN0paDyaeHfCl8wReWNAYx4C/I0ODOvlscpO0tnAKhB0ngJbQCKY2t6tn3rOYdif
ZSQ5KwVRnJtQ4fYOFMOy9+FSCjVXtyrxF8KLxD+mqom2ZhmO00UpOMl09tqhq3uT
WnvwoHUVBt6F+iITHwg5kMgIDPUq1kpUvL4S4UbVSuUm9ZKD+4KRU2ZHRBYMx+MU
bsqmtY8/IULClUoRz+tZhltA8eb0NEqNZE2JPOFDiJHn1YiCCkFwxibhir893oM3
sB7x65D7LQI2ty2BBGVGYnwYDPtyaxOA/s3WTwPvLEi3+Y/TGNIIrS9lBLA4U+Yr
Gi93WQGVjttMmVyaHgXBUGmi3L52hvolm0AZ8zSRGrnIEpecjhly2KfYuaOzuxXC
IHEQ6AFLdRh6JzafXGb/mQwGCHNmhwsY8A49i94fakWQamaL/L6A+1dyPu4LXMqi
NwqCmlVb/LKGlfNG+V4wT27srJ+yBA2Vk3tpR1sZQQytFh0LKHI=
=UoDR
-----END PGP SIGNATURE-----
Merge tag 'net-next-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"Core & protocols:
- Replace busylock at the Tx queuing layer with a lockless list.
Resulting in a 300% (4x) improvement on heavy TX workloads, sending
twice the number of packets per second, for half the cpu cycles.
- Allow constantly busy flows to migrate to a more suitable CPU/NIC
queue.
Normally we perform queue re-selection when flow comes out of idle,
but under extreme circumstances the flows may be constantly busy.
Add sysctl to allow periodic rehashing even if it'd risk packet
reordering.
- Optimize the NAPI skb cache, make it larger, use it in more paths.
- Attempt returning Tx skbs to the originating CPU (like we already
did for Rx skbs).
- Various data structure layout and prefetch optimizations from Eric.
- Remove ktime_get() from the recvmsg() fast path, ktime_get() is
sadly quite expensive on recent AMD machines.
- Extend threaded NAPI polling to allow the kthread busy poll for
packets.
- Make MPTCP use Rx backlog processing. This lowers the lock
pressure, improving the Rx performance.
- Support memcg accounting of MPTCP socket memory.
- Allow admin to opt sockets out of global protocol memory accounting
(using a sysctl or BPF-based policy). The global limits are a poor
fit for modern container workloads, where limits are imposed using
cgroups.
- Improve heuristics for when to kick off AF_UNIX garbage collection.
- Allow users to control TCP SACK compression, and default to 33% of
RTT.
- Add tcp_rcvbuf_low_rtt sysctl to let datacenter users avoid
unnecessarily aggressive rcvbuf growth and overshot when the
connection RTT is low.
- Preserve skb metadata space across skb_push / skb_pull operations.
- Support for IPIP encapsulation in the nftables flowtable offload.
- Support appending IP interface information to ICMP messages (RFC
5837).
- Support setting max record size in TLS (RFC 8449).
- Remove taking rtnl_lock from RTM_GETNEIGHTBL and RTM_SETNEIGHTBL.
- Use a dedicated lock (and RCU) in MPLS, instead of rtnl_lock.
- Let users configure the number of write buffers in SMC.
- Add new struct sockaddr_unsized for sockaddr of unknown length,
from Kees.
- Some conversions away from the crypto_ahash API, from Eric Biggers.
- Some preparations for slimming down struct page.
- YAML Netlink protocol spec for WireGuard.
- Add a tool on top of YAML Netlink specs/lib for reporting commonly
computed derived statistics and summarized system state.
Driver API:
- Add CAN XL support to the CAN Netlink interface.
- Add uAPI for reporting PHY Mean Square Error (MSE) diagnostics, as
defined by the OPEN Alliance's "Advanced diagnostic features for
100BASE-T1 automotive Ethernet PHYs" specification.
- Add DPLL phase-adjust-gran pin attribute (and implement it in
zl3073x).
- Refactor xfrm_input lock to reduce contention when NIC offloads
IPsec and performs RSS.
- Add info to devlink params whether the current setting is the
default or a user override. Allow resetting back to default.
- Add standard device stats for PSP crypto offload.
- Leverage DSA frame broadcast to implement simple HSR frame
duplication for a lot of switches without dedicated HSR offload.
- Add uAPI defines for 1.6Tbps link modes.
Device drivers:
- Add Motorcomm YT921x gigabit Ethernet switch support.
- Add MUCSE driver for N500/N210 1GbE NIC series.
- Convert drivers to support dedicated ops for timestamping control,
and away from the direct IOCTL handling. While at it support GET
operations for PHY timestamping.
- Add (and convert most drivers to) a dedicated ethtool callback for
reading the Rx ring count.
- Significant refactoring efforts in the STMMAC driver, which
supports Synopsys turn-key MAC IP integrated into a ton of SoCs.
- Ethernet high-speed NICs:
- Broadcom (bnxt):
- support PPS in/out on all pins
- Intel (100G, ice, idpf):
- ice: implement standard ethtool and timestamping stats
- i40e: support setting the max number of MAC addresses per VF
- iavf: support RSS of GTP tunnels for 5G and LTE deployments
- nVidia/Mellanox (mlx5):
- reduce downtime on interface reconfiguration
- disable being an XDP redirect target by default (same as
other drivers) to avoid wasting resources if feature is
unused
- Meta (fbnic):
- add support for Linux-managed PCS on 25G, 50G, and 100G links
- Wangxun:
- support Rx descriptor merge, and Tx head writeback
- support Rx coalescing offload
- support 25G SPF and 40G QSFP modules
- Ethernet virtual:
- Google (gve):
- allow ethtool to configure rx_buf_len
- implement XDP HW RX Timestamping support for DQ descriptor
format
- Microsoft vNIC (mana):
- support HW link state events
- handle hardware recovery events when probing the device
- Ethernet NICs consumer, and embedded:
- usbnet: add support for Byte Queue Limits (BQL)
- AMD (amd-xgbe):
- add device selftests
- NXP (enetc):
- add i.MX94 support
- Broadcom integrated MACs (bcmgenet, bcmasp):
- bcmasp: add support for PHY-based Wake-on-LAN
- Broadcom switches (b53):
- support port isolation
- support BCM5389/97/98 and BCM63XX ARL formats
- Lantiq/MaxLinear switches:
- support bridge FDB entries on the CPU port
- use regmap for register access
- allow user to enable/disable learning
- support Energy Efficient Ethernet
- support configuring RMII clock delays
- add tagging driver for MaxLinear GSW1xx switches
- Synopsys (stmmac):
- support using the HW clock in free running mode
- add Eswin EIC7700 support
- add Rockchip RK3506 support
- add Altera Agilex5 support
- Cadence (macb):
- cleanup and consolidate descriptor and DMA address handling
- add EyeQ5 support
- TI:
- icssg-prueth: support AF_XDP
- Airoha access points:
- add missing Ethernet stats and link state callback
- add AN7583 support
- support out-of-order Tx completion processing
- Power over Ethernet:
- pd692x0: preserve PSE configuration across reboots
- add support for TPS23881B devices
- Ethernet PHYs:
- Open Alliance OATC14 10BASE-T1S PHY cable diagnostic support
- Support 50G SerDes and 100G interfaces in Linux-managed PHYs
- micrel:
- support for non PTP SKUs of lan8814
- enable in-band auto-negotiation on lan8814
- realtek:
- cable testing support on RTL8224
- interrupt support on RTL8221B
- motorcomm: support for PHY LEDs on YT853
- microchip: support for LAN867X Rev.D0 PHYs w/ SQI and cable diag
- mscc: support for PHY LED control
- CAN drivers:
- m_can: add support for optional reset and system wake up
- remove can_change_mtu() obsoleted by core handling
- mcp251xfd: support GPIO controller functionality
- Bluetooth:
- add initial support for PASTa
- WiFi:
- split ieee80211.h file, it's way too big
- improvements in VHT radiotap reporting, S1G, Channel Switch
Announcement handling, rate tracking in mesh networks
- improve multi-radio monitor mode support, and add a cfg80211
debugfs interface for it
- HT action frame handling on 6 GHz
- initial chanctx work towards NAN
- MU-MIMO sniffer improvements
- WiFi drivers:
- RealTek (rtw89):
- support USB devices RTL8852AU and RTL8852CU
- initial work for RTL8922DE
- improved injection support
- Intel:
- iwlwifi: new sniffer API support
- MediaTek (mt76):
- WED support for >32-bit DMA
- airoha NPU support
- regdomain improvements
- continued WiFi7/MLO work
- Qualcomm/Atheros:
- ath10k: factory test support
- ath11k: TX power insertion support
- ath12k: BSS color change support
- ath12k: statistics improvements
- brcmfmac: Acer A1 840 tablet quirk
- rtl8xxxu: 40 MHz connection fixes/support"
* tag 'net-next-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1381 commits)
net: page_pool: sanitise allocation order
net: page pool: xa init with destroy on pp init
net/mlx5e: Support XDP target xmit with dummy program
net/mlx5e: Update XDP features in switch channels
selftests/tc-testing: Test CAKE scheduler when enqueue drops packets
net/sched: sch_cake: Fix incorrect qlen reduction in cake_drop
wireguard: netlink: generate netlink code
wireguard: uapi: generate header with ynl-gen
wireguard: uapi: move flag enums
wireguard: uapi: move enum wg_cmd
wireguard: netlink: add YNL specification
selftests: drv-net: Fix tolerance calculation in devlink_rate_tc_bw.py
selftests: drv-net: Fix and clarify TC bandwidth split in devlink_rate_tc_bw.py
selftests: drv-net: Set shell=True for sysfs writes in devlink_rate_tc_bw.py
selftests: drv-net: Use Iperf3Runner in devlink_rate_tc_bw.py
selftests: drv-net: introduce Iperf3Runner for measurement use cases
selftests: drv-net: Add devlink_rate_tc_bw.py to TEST_PROGS
net: ps3_gelic_net: Use napi_alloc_skb() and napi_gro_receive()
Documentation: net: dsa: mention simple HSR offload helpers
Documentation: net: dsa: mention availability of RedBox
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmktzC4ACgkQ6rmadz2v
bTpA1w/+PZ45N3y6O+NQVIpBlpnHG7DEMK7Lw19On0xVLwH+XPHz6J5PEfzjyJR1
SCbsV30qkJ1YCtgRHHf+ZCuWPWm58hY8dXYwSDyjNavdQyVGOdf17aBu9pvH45NW
K20OhwQHpCHWIfDlijjPkDdiHnYf5S7Xy6ctt/3ztF0pMDHIaghGxJymG4wULcDT
iLKnT37kwO8b2ihmw/HbcZPQYMWfHRye7X009K+wCv0dnhJ6q/Ny1m+Pg4kF92e6
ON/RY26ep2dq7LpaNWa1rI1yOgFlI7uUlVojqrAuAb+xrg+64wUDBxeijvE37EN1
s/+PuEKAR6xwz1dbY2cWAI0D633saz24UdV6kCBW9HrjHKVRQ7ZSsBF9ENkS4DTK
nowx4wOe1ZHc/6YgTktZp9LEn/0YrmQtFxjqEAJiYUgD18FrBrSjmhHpBiL+HghP
sTqy41qDQGoKtg3bRu42Co9wmNeeLsnxT8NQExCmTYQ4ufpdA/VMQux9cBVX3GBq
EchJb465+AcvvCJUiKbnHLxDsHCQz1YYytz3RqyFLgGDFZnHOE0FjwPJmM8I5kkK
gvDB3ZYdO3Halm8BZfZZBnKv5uK7myuAWwqRLgMRanuZcRgmIV1oUP5EP88HdH75
fB20vZSVcfzB17SLyhiM20ivEWodJa9VCLEw9WDOmDoml+33Pks=
=kaCJ
-----END PGP SIGNATURE-----
Merge tag 'bpf-next-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Pull bpf updates from Alexei Starovoitov:
- Convert selftests/bpf/test_tc_edt and test_tc_tunnel from .sh to
test_progs runner (Alexis Lothoré)
- Convert selftests/bpf/test_xsk to test_progs runner (Bastien
Curutchet)
- Replace bpf memory allocator with kmalloc_nolock() in
bpf_local_storage (Amery Hung), and in bpf streams and range tree
(Puranjay Mohan)
- Introduce support for indirect jumps in BPF verifier and x86 JIT
(Anton Protopopov) and arm64 JIT (Puranjay Mohan)
- Remove runqslower bpf tool (Hoyeon Lee)
- Fix corner cases in the verifier to close several syzbot reports
(Eduard Zingerman, KaFai Wan)
- Several improvements in deadlock detection in rqspinlock (Kumar
Kartikeya Dwivedi)
- Implement "jmp" mode for BPF trampoline and corresponding
DYNAMIC_FTRACE_WITH_JMP. It improves "fexit" program type performance
from 80 M/s to 136 M/s. With Steven's Ack. (Menglong Dong)
- Add ability to test non-linear skbs in BPF_PROG_TEST_RUN (Paul
Chaignon)
- Do not let BPF_PROG_TEST_RUN emit invalid GSO types to stack (Daniel
Borkmann)
- Generalize buildid reader into bpf_dynptr (Mykyta Yatsenko)
- Optimize bpf_map_update_elem() for map-in-map types (Ritesh
Oedayrajsingh Varma)
- Introduce overwrite mode for BPF ring buffer (Xu Kuohai)
* tag 'bpf-next-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (169 commits)
bpf: optimize bpf_map_update_elem() for map-in-map types
bpf: make kprobe_multi_link_prog_run always_inline
selftests/bpf: do not hardcode target rate in test_tc_edt BPF program
selftests/bpf: remove test_tc_edt.sh
selftests/bpf: integrate test_tc_edt into test_progs
selftests/bpf: rename test_tc_edt.bpf.c section to expose program type
selftests/bpf: Add success stats to rqspinlock stress test
rqspinlock: Precede non-head waiter queueing with AA check
rqspinlock: Disable spinning for trylock fallback
rqspinlock: Use trylock fallback when per-CPU rqnode is busy
rqspinlock: Perform AA checks immediately
rqspinlock: Enclose lock/unlock within lock entry acquisitions
bpf: Remove runqslower tool
selftests/bpf: Remove usage of lsm/file_alloc_security in selftest
bpf: Disable file_alloc_security hook
bpf: check for insn arrays in check_ptr_alignment
bpf: force BPF_F_RDONLY_PROG on insn array creation
bpf: Fix exclusive map memory leak
selftests/bpf: Make CS length configurable for rqspinlock stress test
selftests/bpf: Add lock wait time stats to rqspinlock stress test
...
Toolchain and infrastructure:
- Add support for 'syn'.
Syn is a parsing library for parsing a stream of Rust tokens into a
syntax tree of Rust source code.
Currently this library is geared toward use in Rust procedural
macros, but contains some APIs that may be useful more generally.
'syn' allows us to greatly simplify writing complex macros such as
'pin-init' (Benno has already prepared the 'syn'-based version). We
will use it in the 'macros' crate too.
'syn' is the most downloaded Rust crate (according to crates.io), and
it is also used by the Rust compiler itself. While the amount of code
is substantial, there should not be many updates needed for these
crates, and even if there are, they should not be too big, e.g. +7k
-3k lines across the 3 crates in the last year.
'syn' requires two smaller dependencies: 'quote' and 'proc-macro2'.
I only modified their code to remove a third dependency
('unicode-ident') and to add the SPDX identifiers. The code can be
easily verified to exactly match upstream with the provided scripts.
They are all licensed under "Apache-2.0 OR MIT", like the other
vendored 'alloc' crate we had for a while.
Please see the merge commit with the cover letter for more context.
- Allow 'unreachable_pub' and 'clippy::disallowed_names' for doctests.
Examples (i.e. doctests) may want to do things like show public items
and use names such as 'foo'.
Nevertheless, we still try to keep examples as close to real code as
possible (this is part of why running Clippy on doctests is important
for us, e.g. for safety comments, which userspace Rust does not
support yet but we are stricter).
'kernel' crate:
- Replace our custom 'CStr' type with 'core::ffi::CStr'.
Using the standard library type reduces our custom code footprint,
and we retain needed custom functionality through an extension trait
and a new 'fmt!' macro which replaces the previous 'core' import.
This started in 6.17 and continued in 6.18, and we finally land the
replacement now. This required quite some stamina from Tamir, who
split the changes in steps to prepare for the flag day change here.
- Replace 'kernel::c_str!' with C string literals.
C string literals were added in Rust 1.77, which produce '&CStr's
(the 'core' one), so now we can write:
c"hi"
instead of:
c_str!("hi")
- Add 'num' module for numerical features.
It includes the 'Integer' trait, implemented for all primitive
integer types.
It also includes the 'Bounded' integer wrapping type: an integer
value that requires only the 'N' less significant bits of the wrapped
type to be encoded:
// An unsigned 8-bit integer, of which only the 4 LSBs are used.
let v = Bounded::<u8, 4>:🆕:<15>();
assert_eq!(v.get(), 15);
'Bounded' is useful to e.g. enforce guarantees when working with
bitfields that have an arbitrary number of bits.
Values can be constructed from simple non-constant expressions or,
for more complex ones, validated at runtime.
'Bounded' also comes with comparison and arithmetic operations (with
both their backing type and other 'Bounded's with a compatible
backing type), casts to change the backing type, extending/shrinking
and infallible/fallible conversions from/to primitives as applicable.
- 'rbtree' module: add immutable cursor ('Cursor').
It enables to use just an immutable tree reference where appropriate.
The existing fully-featured mutable cursor is renamed to 'CursorMut'.
kallsyms:
- Fix wrong "big" kernel symbol type read from procfs.
'pin-init' crate:
- A couple minor fixes (Benno asked me to pick these patches up for
him this cycle).
Documentation:
- Quick Start guide: add Debian 13 (Trixie).
Debian Stable is now able to build Linux, since Debian 13 (released
2025-08-09) packages Rust 1.85.0, which is recent enough.
We are planning to propose that the minimum supported Rust version in
Linux follows Debian Stable releases, with Debian 13 being the first
one we upgrade to, i.e. Rust 1.85.
MAINTAINERS:
- Add entry for the new 'num' module.
- Remove Alex as Rust maintainer: he hasn't had the time to contribute
for a few years now, so it is a no-op change in practice.
And a few other cleanups and improvements.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEPjU5OPd5QIZ9jqqOGXyLc2htIW0FAmks0WkACgkQGXyLc2ht
IW3QQg/+LpBmrz0ZSKH24kcX3x/hpfA2Erd4cmn+qjXev9RSAM1bt9xf3dsAhhFd
BStUpf0aglHOSEuWAvNHqEb5+yu+6qy6EFXXqH0ASexHK93t77Jbztjtf3SMlykT
N/lSJ+LWw2WiRT0NRWoTfKaEWzZQ8j9fi9Jb/IGNZGdNMryisVUYWqzLwNupPuK+
RMcEitHdO2NWjyodk2GGRyYQ7+XxQgbXZoxtgeubPSrrmGuGTXV42RlQKC2KHPx3
gz6CwcO3Xd0bGHHSgP32QDtGRJtniO8iXBKxiooT+ys+M83fTKbwNrIrW3tHdheY
765qsd/NvUmAkcgTCoLqj5biU6LCsepyimNg1vf4pYFohBoTaGeN+UqzbXBrSjy2
pmrgxwMRVHsYz+zoSKAVKJl7ASba5BXFdI4Whgfqwwc9So/X7uyujIYXGbRoznCV
W5vu7OUboBy26NvcsPrf6BqWcsJEpGV/M4z2UBRjAoJTRGQMcm/ckuo/GfYm3yW+
bUW62UmVCdY5crpo7XPH/G4ZGBR/k3p9dLVt8OJxEoTlfw4KDE5BszJoXmejZqdi
9LEMhzTWwoFp9NspQuEGdYdfGRlfG6XXqrwGZtQI+dlc4RvFEgBBu2Lxotq+Ods0
EfCVCJQjWmyCodVdJ/QqbCRFuXtOFLr/hPdWnvlrRxVkPtF2CDw=
=9nM+
-----END PGP SIGNATURE-----
Merge tag 'rust-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux
Pull Rust updates from Miguel Ojeda:
"Toolchain and infrastructure:
- Add support for 'syn'.
Syn is a parsing library for parsing a stream of Rust tokens into a
syntax tree of Rust source code.
Currently this library is geared toward use in Rust procedural
macros, but contains some APIs that may be useful more generally.
'syn' allows us to greatly simplify writing complex macros such as
'pin-init' (Benno has already prepared the 'syn'-based version). We
will use it in the 'macros' crate too.
'syn' is the most downloaded Rust crate (according to crates.io),
and it is also used by the Rust compiler itself. While the amount
of code is substantial, there should not be many updates needed for
these crates, and even if there are, they should not be too big,
e.g. +7k -3k lines across the 3 crates in the last year.
'syn' requires two smaller dependencies: 'quote' and 'proc-macro2'.
I only modified their code to remove a third dependency
('unicode-ident') and to add the SPDX identifiers. The code can be
easily verified to exactly match upstream with the provided
scripts.
They are all licensed under "Apache-2.0 OR MIT", like the other
vendored 'alloc' crate we had for a while.
Please see the merge commit with the cover letter for more context.
- Allow 'unreachable_pub' and 'clippy::disallowed_names' for
doctests.
Examples (i.e. doctests) may want to do things like show public
items and use names such as 'foo'.
Nevertheless, we still try to keep examples as close to real code
as possible (this is part of why running Clippy on doctests is
important for us, e.g. for safety comments, which userspace Rust
does not support yet but we are stricter).
'kernel' crate:
- Replace our custom 'CStr' type with 'core::ffi::CStr'.
Using the standard library type reduces our custom code footprint,
and we retain needed custom functionality through an extension
trait and a new 'fmt!' macro which replaces the previous 'core'
import.
This started in 6.17 and continued in 6.18, and we finally land the
replacement now. This required quite some stamina from Tamir, who
split the changes in steps to prepare for the flag day change here.
- Replace 'kernel::c_str!' with C string literals.
C string literals were added in Rust 1.77, which produce '&CStr's
(the 'core' one), so now we can write:
c"hi"
instead of:
c_str!("hi")
- Add 'num' module for numerical features.
It includes the 'Integer' trait, implemented for all primitive
integer types.
It also includes the 'Bounded' integer wrapping type: an integer
value that requires only the 'N' least significant bits of the
wrapped type to be encoded:
// An unsigned 8-bit integer, of which only the 4 LSBs are used.
let v = Bounded::<u8, 4>:🆕:<15>();
assert_eq!(v.get(), 15);
'Bounded' is useful to e.g. enforce guarantees when working with
bitfields that have an arbitrary number of bits.
Values can also be constructed from simple non-constant expressions
or, for more complex ones, validated at runtime.
'Bounded' also comes with comparison and arithmetic operations
(with both their backing type and other 'Bounded's with a
compatible backing type), casts to change the backing type,
extending/shrinking and infallible/fallible conversions from/to
primitives as applicable.
- 'rbtree' module: add immutable cursor ('Cursor').
It enables to use just an immutable tree reference where
appropriate. The existing fully-featured mutable cursor is renamed
to 'CursorMut'.
kallsyms:
- Fix wrong "big" kernel symbol type read from procfs.
'pin-init' crate:
- A couple minor fixes (Benno asked me to pick these patches up for
him this cycle).
Documentation:
- Quick Start guide: add Debian 13 (Trixie).
Debian Stable is now able to build Linux, since Debian 13 (released
2025-08-09) packages Rust 1.85.0, which is recent enough.
We are planning to propose that the minimum supported Rust version
in Linux follows Debian Stable releases, with Debian 13 being the
first one we upgrade to, i.e. Rust 1.85.
MAINTAINERS:
- Add entry for the new 'num' module.
- Remove Alex as Rust maintainer: he hasn't had the time to
contribute for a few years now, so it is a no-op change in
practice.
And a few other cleanups and improvements"
* tag 'rust-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux: (53 commits)
rust: macros: support `proc-macro2`, `quote` and `syn`
rust: syn: enable support in kbuild
rust: syn: add `README.md`
rust: syn: remove `unicode-ident` dependency
rust: syn: add SPDX License Identifiers
rust: syn: import crate
rust: quote: enable support in kbuild
rust: quote: add `README.md`
rust: quote: add SPDX License Identifiers
rust: quote: import crate
rust: proc-macro2: enable support in kbuild
rust: proc-macro2: add `README.md`
rust: proc-macro2: remove `unicode_ident` dependency
rust: proc-macro2: add SPDX License Identifiers
rust: proc-macro2: import crate
rust: kbuild: support using libraries in `rustc_procmacro`
rust: kbuild: support skipping flags in `rustc_test_library`
rust: kbuild: add proc macro library support
rust: kbuild: simplify `--cfg` handling
rust: kbuild: introduce `core-flags` and `core-skip_flags`
...
-----BEGIN PGP SIGNATURE-----
iQJPBAABCAA5FiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAmkumVobFIAAAAAABAAO
bWFudTIsMi41KzEuMTEsMiwyAAoJEFKgDEdIgJTypHMQALImoj39UZiLLVMn2d4P
5droOQw+qMCfsVVY5m9Rqw3vrCsSds+7eC5Qa2GaXR480l4ZeMGyoYCr+2gDWcM3
vwTWuj8UuUOxQlMXubIFyL9r/SUtYSiqfAPK5tGtzHuNNUEHWGed9r5RrOGKZOnj
SDe6ZL5G2Kekuua1mtSZhw8qpW/nK2IuFTRPxN6O243dUGtX0wkzWr1tVRpDDzoT
UNWZ2xoE1lzb/EAviyuSgSIc5YsT50bM+fMzOUHsY/E9u8fI42N1z5B+P3ZlTvGx
Z5nJ79ZMtbF3/4o6OaUqvR9Y4zZ73TtkMoV4sjrjYg28q+n4/tqVlCsjy0Hys/dP
MlaKbLSI54+RldTWx02Yu0jcn2Artua3N0bURyEenqkhDLW9RXVfx/h29QQYPpE9
7FSlpUAQnFzFbGUGJcB02oClkxUSUcOuzVqC0n30wfdo+bVzDn19rr+Vi2bHe8TD
MsASva96n2FhRVPkm8eBxnxpaYQj9Q6G1/7B1F30Et7PLzwD9qYUvXPXceyPATnd
r3hdUiKgkPvUaDhApJRHwv/ZGnNDKujGBDc/YgBQrl3cPfG0ShGzJzWz892JjX0+
NB+O+eabHVs6tFRFNvxS2fu/ROZxrGfBzZUKoaZjZh3JEgOhlLKVlL7aRqtctFvP
Dz0QYiOqlkjho7FykCIEKXxQ
=6o9X
-----END PGP SIGNATURE-----
Merge tag 'livepatching-for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching
Pull livepatching updates from Petr Mladek:
- Support both paths where tracefs is typically mounted in selftests
- Make old_sympos 0 and 1 equal. They both are valid when there is only
one symbol with the given name.
* tag 'livepatching-for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching:
selftests: livepatch: use canonical ftrace path
livepatch: Match old_sympos 0 and 1 in klp_find_func()
- Improve recovery from misbehaving BPF schedulers. When a scheduler puts many
tasks with varying affinity restrictions on a shared DSQ, CPUs scanning
through tasks they cannot run can overwhelm the system, causing lockups.
Bypass mode now uses per-CPU DSQs with a load balancer to avoid this, and
hooks into the hardlockup detector to attempt recovery. Add scx_cpu0 example
scheduler to demonstrate this scenario.
- Add lockless peek operation for DSQs to reduce lock contention for schedulers
that need to query queue state during load balancing.
- Allow scx_bpf_reenqueue_local() to be called from anywhere in preparation for
deprecating cpu_acquire/release() callbacks in favor of generic BPF hooks.
- Prepare for hierarchical scheduler support: add scx_bpf_task_set_slice() and
scx_bpf_task_set_dsq_vtime() kfuncs, make scx_bpf_dsq_insert*() return bool,
and wrap kfunc args in structs for future aux__prog parameter.
- Implement cgroup_set_idle() callback to notify BPF schedulers when a cgroup's
idle state changes.
- Fix migration tasks being incorrectly downgraded from stop_sched_class to
rt_sched_class across sched_ext enable/disable. Applied late as the fix is
low risk and the bug subtle but needs stable backporting.
- Various fixes and cleanups including cgroup exit ordering, SCX_KICK_WAIT
reliability, and backward compatibility improvements.
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaS4h1A4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGe/MAP9EZ0pLiTpmMtt6mI/11Fmi+aWfL84j1zt13cz9
W4vb4gEA9eVEH6n9xyC4nhcOk9AQwSDuCWMOzLsnhW8TbEHVTww=
=8W/B
-----END PGP SIGNATURE-----
Merge tag 'sched_ext-for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext updates from Tejun Heo:
- Improve recovery from misbehaving BPF schedulers.
When a scheduler puts many tasks with varying affinity restrictions
on a shared DSQ, CPUs scanning through tasks they cannot run can
overwhelm the system, causing lockups.
Bypass mode now uses per-CPU DSQs with a load balancer to avoid this,
and hooks into the hardlockup detector to attempt recovery.
Add scx_cpu0 example scheduler to demonstrate this scenario.
- Add lockless peek operation for DSQs to reduce lock contention for
schedulers that need to query queue state during load balancing.
- Allow scx_bpf_reenqueue_local() to be called from anywhere in
preparation for deprecating cpu_acquire/release() callbacks in favor
of generic BPF hooks.
- Prepare for hierarchical scheduler support: add
scx_bpf_task_set_slice() and scx_bpf_task_set_dsq_vtime() kfuncs,
make scx_bpf_dsq_insert*() return bool, and wrap kfunc args in
structs for future aux__prog parameter.
- Implement cgroup_set_idle() callback to notify BPF schedulers when a
cgroup's idle state changes.
- Fix migration tasks being incorrectly downgraded from
stop_sched_class to rt_sched_class across sched_ext enable/disable.
Applied late as the fix is low risk and the bug subtle but needs
stable backporting.
- Various fixes and cleanups including cgroup exit ordering,
SCX_KICK_WAIT reliability, and backward compatibility improvements.
* tag 'sched_ext-for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (44 commits)
sched_ext: Fix incorrect sched_class settings for per-cpu migration tasks
sched_ext: tools: Removing duplicate targets during non-cross compilation
sched_ext: Use kvfree_rcu() to release per-cpu ksyncs object
sched_ext: Pass locked CPU parameter to scx_hardlockup() and add docs
sched_ext: Update comments replacing breather with aborting mechanism
sched_ext: Implement load balancer for bypass mode
sched_ext: Factor out abbreviated dispatch dequeue into dispatch_dequeue_locked()
sched_ext: Factor out scx_dsq_list_node cursor initialization into INIT_DSQ_LIST_CURSOR
sched_ext: Add scx_cpu0 example scheduler
sched_ext: Hook up hardlockup detector
sched_ext: Make handle_lockup() propagate scx_verror() result
sched_ext: Refactor lockup handlers into handle_lockup()
sched_ext: Make scx_exit() and scx_vexit() return bool
sched_ext: Exit dispatch and move operations immediately when aborting
sched_ext: Simplify breather mechanism with scx_aborting flag
sched_ext: Use per-CPU DSQs instead of per-node global DSQs in bypass mode
sched_ext: Refactor do_enqueue_task() local and global DSQ paths
sched_ext: Use shorter slice in bypass mode
sched_ext: Mark racy bitfields to prevent adding fields that can't tolerate races
sched_ext: Minor cleanups to scx_task_iter
...
- Defer task cgroup unlink until after the dying task's final context switch
so that controllers see the cgroup properly populated until the task is
truly gone.
- cpuset cleanups and simplifications. Enforce that domain isolated CPUs
stay in root or isolated partitions and fail if isolated+nohz_full would
leave no housekeeping CPU. Fix sched/deadline root domain handling during
CPU hot-unplug and race for tasks in attaching cpusets.
- Misc fixes including memory reclaim protection documentation and selftest
KTAP conformance.
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaS3pEQ4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGYbrAP9H0kVyWH5tK9VhjSZyqidic8NuvtmNOyhIRrg0
8S8K0wD/YG9xlh2JUyRmS4B23ggc59+9y5xM2/sctrho51Pvsgg=
=0MB+
-----END PGP SIGNATURE-----
Merge tag 'cgroup-for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
- Defer task cgroup unlink until after the dying task's final context
switch so that controllers see the cgroup properly populated until
the task is truly gone
- cpuset cleanups and simplifications.
Enforce that domain isolated CPUs stay in root or isolated partitions
and fail if isolated+nohz_full would leave no housekeeping CPU. Fix
sched/deadline root domain handling during CPU hot-unplug and race
for tasks in attaching cpusets
- Misc fixes including memory reclaim protection documentation and
selftest KTAP conformance
* tag 'cgroup-for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits)
cpuset: Treat cpusets in attaching as populated
sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug
cgroup/cpuset: Introduce cpuset_cpus_allowed_locked()
docs: cgroup: No special handling of unpopulated memcgs
docs: cgroup: Note about sibling relative reclaim protection
docs: cgroup: Explain reclaim protection target
selftests/cgroup: conform test to KTAP format output
cpuset: remove need_rebuild_sched_domains
cpuset: remove global remote_children list
cpuset: simplify node setting on error
cgroup: include missing header for struct irq_work
cgroup: Fix sleeping from invalid context warning on PREEMPT_RT
cgroup/cpuset: Globally track isolated_cpus update
cgroup/cpuset: Ensure domain isolated CPUs stay in root or isolated partition
cgroup/cpuset: Move up prstate_housekeeping_conflict() helper
cgroup/cpuset: Fail if isolated and nohz_full don't leave any housekeeping
cgroup/cpuset: Rename update_unbound_workqueue_cpumask() to update_isolation_cpumasks()
cgroup: Defer task cgroup unlink until after the task is done switching out
cgroup: Move dying_tasks cleanup from cgroup_task_release() to cgroup_task_free()
cgroup: Rename cgroup lifecycle hooks to cgroup_task_*()
...
- Rescuer affinity management: Affinity now updated only when detached using
wq_unbound_cpumask consistently. DISASSOCIATED workers also follow unbound
cpumask changes to avoid breaking CPU isolation.
- Rescuer cleanups preparing for fetching work items one by one from pool list:
Work assignment factored out, optimized to skip pwqs no longer needing
rescue, and shutdown logic simplified.
- Unused assert_rcu_or_wq_mutex_or_pool_mutex() removed.
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaS3kYg4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGWkwAQC6zcYAbGTcEkfiDsg3GM6dCsHDk3hIXVguE7Em
mOpnVQD/QxiYHZxt/SOY5LIIX9ZvTpBYeuW7MR5bCncSBdCLTw4=
=VkvX
-----END PGP SIGNATURE-----
Merge tag 'wq-for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue updates from Tejun Heo:
- Rescuer affinity management: Affinity is now updated only when
detached using wq_unbound_cpumask consistently. DISASSOCIATED workers
also follow unbound cpumask changes to avoid breaking CPU isolation
- Rescuer cleanups preparing for fetching work items one by one from
pool list: Work assignment factored out, optimized to skip pwqs no
longer needing rescue, and shutdown logic simplified
- Unused assert_rcu_or_wq_mutex_or_pool_mutex() removed
* tag 'wq-for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: Don't rely on wq->rescuer to stop rescuer
workqueue: Only assign rescuer work when really needed
workqueue: Factor out assign_rescuer_work()
workqueue: Init rescuer's affinities as wq_unbound_cpumask
workqueue: Let DISASSOCIATED workers follow unbound wq cpumask changes
workqueue: Update the rescuer's affinity only when it is detached
workqueue: Remove unused assert_rcu_or_wq_mutex_or_pool_mutex
-----BEGIN PGP SIGNATURE-----
iQJPBAABCAA5FiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAmktlbUbFIAAAAAABAAO
bWFudTIsMi41KzEuMTEsMiwyAAoJEFKgDEdIgJTyevsP/1z98/wfCaSCquIq4H8S
OTqFGybGgYQt1NmMj2cGPpbAE3LJNYORT0A4tcoqOTy1Z5xbQz63rO3clSI/e7Mf
n4ZZ7NvkE40i8et1BjqtZa9dSkAv4QLYH73KrtNeuTr5tqvHo1x8FakUH6gQnb1k
QOOebvbVXnOb+rh89j1GZShrLFcCil0psjp165WHAYE/3PyFBgYGLMCgwLqS+W3H
re5Q4sl/ySXpMFF/XN1Kww48FWxy/h+YQFCxZwuWlUcXtVjqZ+BN+keb7AqaFQ7R
dC2exV2W0RBoupEJR/FWHoXrm/bDDLhzqRaMvoggLJrMJ9L6V0WdIhaFA4qzoG63
paJGFjUfmDX3dpPsAddq7kKeevCz4a2/HwFKhiBqqq4tdHuely7wZgnoFO7ovgmu
DYDCXHtpJuWZR3WJ5I/V/sJ9i9KFXhhyWcKVf13QTAFiCaA09aeSAcUWNYNaaxbn
nu6IkUxdIVnWIEBgcYH6jz1DrPGreYLYuD4bVb2gdZoP0r3tnMpG6xfSNIUueSGd
VFAKW9PJYaj7Id+jgACH6V+gQ22L600xJDdL1bPjRbGE0LD7vlz2F1MZTq3BFJFn
hUxJeOZplHX+TPophdvH4MO9VLmydWLUyJiDBP1yA8M9XZms/5s7IJJ1RYXqUCcf
qEB4L7W1+Qy1R/lzf2PU9X4R
=FnfO
-----END PGP SIGNATURE-----
Merge tag 'printk-for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux
Pull printk updates from Petr Mladek:
- Allow creaing nbcon console drivers with an unsafe write_atomic()
callback that can only be called by the final nbcon_atomic_flush_unsafe().
Otherwise, the driver would rely on the kthread.
It is going to be used as the-best-effort approach for an
experimental nbcon netconsole driver, see
https://lore.kernel.org/r/20251121-nbcon-v1-2-503d17b2b4af@debian.org
Note that a safe .write_atomic() callback is supposed to work in NMI
context. But some networking drivers are not safe even in IRQ
context:
https://lore.kernel.org/r/oc46gdpmmlly5o44obvmoatfqo5bhpgv7pabpvb6sjuqioymcg@gjsma3ghoz35
In an ideal world, all networking drivers would be fixed first and
the atomic flush would be blocked only in NMI context. But it brings
the question how reliable networking drivers are when the system is
in a bad state. They might block flushing more reliable serial
consoles which are more suitable for serious debugging anyway.
- Allow to use the last 4 bytes of the printk ring buffer.
- Prevent queuing IRQ work and block printk kthreads when consoles are
suspended. Otherwise, they create non-necessary churn or even block
the suspend.
- Release console_lock() between each record in the kthread used for
legacy consoles on RT. It might significantly speed up the boot.
- Release nbcon context between each record in the atomic flush. It
prevents stalls of the related printk kthread after it has lost the
ownership in the middle of a record
- Add support for NBCON consoles into KDB
- Add %ptsP modifier for printing struct timespec64 and use it where
possible
- Misc code clean up
* tag 'printk-for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux: (48 commits)
printk: Use console_is_usable on console_unblank
arch: um: kmsg_dump: Use console_is_usable
drivers: serial: kgdboc: Drop checks for CON_ENABLED and CON_BOOT
lib/vsprintf: Unify FORMAT_STATE_NUM handlers
printk: Avoid irq_work for printk_deferred() on suspend
printk: Avoid scheduling irq_work on suspend
printk: Allow printk_trigger_flush() to flush all types
tracing: Switch to use %ptSp
scsi: snic: Switch to use %ptSp
scsi: fnic: Switch to use %ptSp
s390/dasd: Switch to use %ptSp
ptp: ocp: Switch to use %ptSp
pps: Switch to use %ptSp
PCI: epf-test: Switch to use %ptSp
net: dsa: sja1105: Switch to use %ptSp
mmc: mmc_test: Switch to use %ptSp
media: av7110: Switch to use %ptSp
ipmi: Switch to use %ptSp
igb: Switch to use %ptSp
e1000e: Switch to use %ptSp
...
SRCU
----
_ Properly handle SRCU readers within IRQ disabled sections in tiny SRCU
- Preparation to reimplement RCU Tasks Trace on top of SRCU fast:
- Introduce API to expedite a grace period and test it through
rcutorture.
- Split srcu-fast in two flavours: SRCU-fast and SRCU-fast-updown.
Both are still targeted toward faster readers (without full
barriers on LOCK and UNLOCK) at the expense of heavier write side
(using full RCU grace period ordering instead of simply full
ordering) as compared to "traditional" non-fast SRCU. But those
srcu-fast flavours are going to be optimized in two different
ways:
- SRCU-fast will become the reimplementation basis for
RCU-TASK-TRACE for consolidation. Since RCU-TASK-TRACE must
be NMI safe, SRCU-fast must be as well.
- SRCU-fast-updown will be needed for uretprobes code in order
to get rid of the read-side memory barriers while still
allowing entering the reader at task level while exiting it
in a timer handler. It is considered semaphore-like in that
it can have different owners between LOCK and UNLOCK.
However it is not NMI-safe.
The actual optimizations are work in progress for the next cycle.
Only the new interfaces are added for now, along with related
torture and scalability test code.
- Create/document/debug/torture new proper initializers for RCU fast:
DEFINE_SRCU_FAST() and init_srcu_struct_fast()
This allows for using right away the proper ordering on the write
side (either full ordering or full RCU grace period ordering) without
waiting for the read side to tell which to use. Also this optimizes
the read side altogether with moving flavour debug checks under debug
config and with removing a costly RmW operation on their first call.
- Make some diagnostic functions tracing safe.
REFSCALE
-------
Add performance testing for common context synchronizations
(Preemption, IRQ, Softirq) and per-cpu increments. Those are
relevant comparisons against SRCU-fast read side APIs, especially
as they are planned to synchronize further tracing fast-path code.
MISCELLANOUS
------
- In order to prepare the layout for nohz_full work deferral to
user exit, the context tracking state must shrink the counter
of transitions to/from RCU not watching. The only possible hazard
is to trigger wrap-around more easily, delaying a bit grace periods
when that happens. This should be a rare event though. Yet add
debugging and torture code to test that assumption.
- Fix memory leak on locktorture module
- Annotate accesses in rculist_nulls.h to prevent from KCSAN warnings.
On recent discussions, we also concluded that all those WRITE_ONCE()
and READ_ONCE() on list APIs deserve appropriate comments. Something
to be expected for the next cycle.
- Provide a script to apply several configs to several commits with torture.
- Allow torture to reuse a build directory in order to save needless
rebuild time.
- Various cleanups.
-----BEGIN PGP SIGNATURE-----
iQJPBAABCAA5FiEEd76+gtGM8MbftQlOhSRUR1COjHcFAmksubgbFIAAAAAABAAO
bWFudTIsMi41KzEuMTEsMiwyAAoJEIUkVEdQjox3MJ0P/Aq+AOPWuaE70B8SxVYY
VfucjjzZwSyI30Vw4GE8skNVMOnyY2wPIrNRgmckghi9LM5luureMcks9esrdFfv
csWj3AhnBu+/Lyd4EhP8AxMH2OPIX50FUrAQaY7LAcmc4nIng1MqRm35eT776XzR
tN07+nsXjNumPcdN8JAv/VzInUf5DdMvhy+Qw0Krpvan/sUpzzpFpNUWdHsyZ2Re
UCQr//AooQLfCIIIHUYlviThqigJETZlv0NKYrD0Zaxx5MCOwwWmz5otQzacSRk2
Zgn6Bayp98s0972sOghiMRXEoSvqBSJe5eVTWcj1oNLruO3flpyREgE5Wuo8T8Bg
BHvOiYaPNC5FW+btCWOQcKcSx1mxrgsX3/9OAth+kmtPqZdJSQEeY+zqWzdN7ZlV
q5+oipsFONvWXL6+U6uZf44MG1pscrzn6QvXajjg98iaLZgY0+mkAhvtpjcfuTwZ
2CXYQKacfwftpOXgM2QQGB4QE8CKXchlma381KbTHv0rX8zfK0BnWoe/oGPz22GL
3kldmhfIC/87i8t9HHngkU6SZ+Mkvf9vC28eclnuXB1PKCRsBHePGvi3I671kbLj
hycBwnwWriULEwKYTVQ8gmdBHSANB4IpmMYflADVqLuvp+7U6W5mNjWNTcXoSwFl
NoR7AViEUNzOEoa8u6WMKoec
=XnUt
-----END PGP SIGNATURE-----
Merge tag 'rcu.release.v6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux
Pull RCU updates from Frederic Weisbecker:
"SRCU:
- Properly handle SRCU readers within IRQ disabled sections in tiny
SRCU
- Preparation to reimplement RCU Tasks Trace on top of SRCU fast:
- Introduce API to expedite a grace period and test it through
rcutorture
- Split srcu-fast in two flavours: SRCU-fast and SRCU-fast-updown.
Both are still targeted toward faster readers (without full
barriers on LOCK and UNLOCK) at the expense of heavier write
side (using full RCU grace period ordering instead of simply
full ordering) as compared to "traditional" non-fast SRCU. But
those srcu-fast flavours are going to be optimized in two
different ways:
- SRCU-fast will become the reimplementation basis for
RCU-TASK-TRACE for consolidation. Since RCU-TASK-TRACE must
be NMI safe, SRCU-fast must be as well.
- SRCU-fast-updown will be needed for uretprobes code in order
to get rid of the read-side memory barriers while still
allowing entering the reader at task level while exiting it
in a timer handler. It is considered semaphore-like in that
it can have different owners between LOCK and UNLOCK.
However it is not NMI-safe.
The actual optimizations are work in progress for the next
cycle. Only the new interfaces are added for now, along with
related torture and scalability test code.
- Create/document/debug/torture new proper initializers for RCU fast:
DEFINE_SRCU_FAST() and init_srcu_struct_fast()
This allows for using right away the proper ordering on the write
side (either full ordering or full RCU grace period ordering)
without waiting for the read side to tell which to use.
This also optimizes the read side altogether with moving flavour
debug checks under debug config and with removing a costly RmW
operation on their first call.
- Make some diagnostic functions tracing safe
Refscale:
- Add performance testing for common context synchronizations
(Preemption, IRQ, Softirq) and per-cpu increments. Those are
relevant comparisons against SRCU-fast read side APIs, especially
as they are planned to synchronize further tracing fast-path code
Miscellanous:
- In order to prepare the layout for nohz_full work deferral to user
exit, the context tracking state must shrink the counter of
transitions to/from RCU not watching. The only possible hazard is
to trigger wrap-around more easily, delaying a bit grace periods
when that happens. This should be a rare event though. Yet add
debugging and torture code to test that assumption
- Fix memory leak on locktorture module
- Annotate accesses in rculist_nulls.h to prevent from KCSAN
warnings. On recent discussions, we also concluded that all those
WRITE_ONCE() and READ_ONCE() on list APIs deserve appropriate
comments. Something to be expected for the next cycle
- Provide a script to apply several configs to several commits with
torture
- Allow torture to reuse a build directory in order to save needless
rebuild time
- Various cleanups"
* tag 'rcu.release.v6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux: (29 commits)
refscale: Add SRCU-fast-updown readers
refscale: Exercise DEFINE_STATIC_SRCU_FAST() and init_srcu_struct_fast()
rcutorture: Make srcu{,d}_torture_init() announce the SRCU type
srcu: Create an SRCU-fast-updown API
refscale: Do not disable interrupts for tests involving local_bh_enable()
refscale: Add non-atomic per-CPU increment readers
refscale: Add this_cpu_inc() readers
refscale: Add preempt_disable() readers
refscale: Add local_bh_disable() readers
refscale: Add local_irq_disable() and local_irq_save() readers
torture: Permit negative kvm.sh --kconfig numberic arguments
srcu: Add SRCU_READ_FLAVOR_FAST_UPDOWN CPP macro
rcu: Mark diagnostic functions as notrace
rcutorture: Make TREE04 use CONFIG_RCU_DYNTICKS_TORTURE
rcutorture: Remove redundant rcutorture_one_extend() from rcu_torture_one_read()
rcutorture: Permit kvm-again.sh to re-use the build directory
torture: Add kvm-series.sh to test commit/scenario combination
rcu: use WRITE_ONCE() for ->next and ->pprev of hlist_nulls
locktorture: Fix memory leak in param_set_cpumask()
doc: Update for SRCU-fast definitions and initialization
...
API:
- Rewrite memcpy_sglist from scratch.
- Add on-stack AEAD request allocation.
- Fix partial block processing in ahash.
Algorithms:
- Remove ansi_cprng.
- Remove tcrypt tests for poly1305.
- Fix EINPROGRESS processing in authenc.
- Fix double-free in zstd.
Drivers:
- Use drbg ctr helper when reseeding xilinx-trng.
- Add support for PCI device 0x115A to ccp.
- Add support of paes in caam.
- Add support for aes-xts in dthev2.
Others:
- Use likely in rhashtable lookup.
- Fix lockdep false-positive in padata by removing a helper.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEn51F/lCuNhUwmDeSxycdCkmxi6cFAmktaHwACgkQxycdCkmx
i6duthAAl4ZjsuSgt0P9ZPJXWgSH+QbNT/6fL1QzLEuzLVGn8Mt99LTQpaYU8HRh
fced8+R7UpqA/FgZTYbRKopZJVJJqhmTf2zqjbe47CroRm2Wf5UO+6ZXBsiqbMwa
6fNLilhcrq5G3DrIHepCpIQ7NM2+ucTMnPRIWP3cvzLwX0JzPtYIpYUSiVPAtkjh
9g24oPz6LR/xZfyk+wPbHOSYeqz4sSXnGJkL+Vn33AtU5KJZLum9zMP4Lleim7HP
XaNnUL/S/PYCspycrvfrnq6+YMLPw2USguttuZe0Dg0qhq/jPMyzdEkTAjcTD5LG
NZavVUbQsf6BW+YjXgaE/ybcSs6WR3ySs8aza1Ev8QqsmpbJj9xdpF9fn4RsffGR
mbhc5plJCKWzfiaparea8yY9n5vHwbOK4zoyF9P6kI5ykkoA+GmwRwTW73M9KCfa
i1R6g97O+t4Yaq9JI9GG7dkm9bxJpY+XaKouW7rqv/MX0iND1ExDYaqdcA+Xa61c
TNfdlVcGyX7Dolm2xnpvRv8EqF9NzeK4Vw1QslrdCijXfe7eJymabNKhLBlV4li0
tVfmh4vyQFgruyiR7r7AkXIKzsLZbji030UoOsQqiMW7ualBUQ0dCDbBa8J6kUcX
/vjbSmxV3LKgVgYvUBRRGIi9CJbKfs29RkS6RFtdqcq/YT4KsJU=
=DHes
-----END PGP SIGNATURE-----
Merge tag 'v6.19-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto updates from Herbert Xu:
"API:
- Rewrite memcpy_sglist from scratch
- Add on-stack AEAD request allocation
- Fix partial block processing in ahash
Algorithms:
- Remove ansi_cprng
- Remove tcrypt tests for poly1305
- Fix EINPROGRESS processing in authenc
- Fix double-free in zstd
Drivers:
- Use drbg ctr helper when reseeding xilinx-trng
- Add support for PCI device 0x115A to ccp
- Add support of paes in caam
- Add support for aes-xts in dthev2
Others:
- Use likely in rhashtable lookup
- Fix lockdep false-positive in padata by removing a helper"
* tag 'v6.19-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (71 commits)
crypto: zstd - fix double-free in per-CPU stream cleanup
crypto: ahash - Zero positive err value in ahash_update_finish
crypto: ahash - Fix crypto_ahash_import with partial block data
crypto: lib/mpi - use min() instead of min_t()
crypto: ccp - use min() instead of min_t()
hwrng: core - use min3() instead of nested min_t()
crypto: aesni - ctr_crypt() use min() instead of min_t()
crypto: drbg - Delete unused ctx from struct sdesc
crypto: testmgr - Add missing DES weak and semi-weak key tests
Revert "crypto: scatterwalk - Move skcipher walk and use it for memcpy_sglist"
crypto: scatterwalk - Fix memcpy_sglist() to always succeed
crypto: iaa - Request to add Kanchana P Sridhar to Maintainers.
crypto: tcrypt - Remove unused poly1305 support
crypto: ansi_cprng - Remove unused ansi_cprng algorithm
crypto: asymmetric_keys - fix uninitialized pointers with free attribute
KEYS: Avoid -Wflex-array-member-not-at-end warning
crypto: ccree - Correctly handle return of sg_nents_for_len
crypto: starfive - Correctly handle return of sg_nents_for_len
crypto: iaa - Fix incorrect return value in save_iaa_wq()
crypto: zstd - Remove unnecessary size_t cast
...
-----BEGIN PGP SIGNATURE-----
iIoEABYKADIWIQQdXVVFGN5XqKr1Hj7LwZzRsCrn5QUCaS896BQcem9oYXJAbGlu
dXguaWJtLmNvbQAKCRDLwZzRsCrn5RDuAQDx4fmvctP8kc9PeRjd5X/UV1ip1pPD
beMKt8ghEThQiAEAzjFJbNGUDKhfR8yWODifAvYRurU5YQJZZI9wJ8skNw0=
=3Vc4
-----END PGP SIGNATURE-----
Merge tag 'integrity-v6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity
Pull integrity updates from Mimi Zohar:
"Bug fixes:
- defer credentials checking from the bprm_check_security hook to the
bprm_creds_from_file security hook
- properly ignore IMA policy rules based on undefined SELinux labels
IMA policy rule extensions:
- extend IMA to limit including file hashes in the audit logs
(dont_audit action)
- define a new filesystem subtype policy option (fs_subtype)
Misc:
- extend IMA to support in-kernel module decompression by deferring
the IMA signature verification in kernel_read_file() to after the
kernel module is decompressed"
* tag 'integrity-v6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity:
ima: Handle error code returned by ima_filter_rule_match()
ima: Access decompressed kernel module to verify appended signature
ima: add fs_subtype condition for distinguishing FUSE instances
ima: add dont_audit action to suppress audit actions
ima: Attach CREDS_CHECK IMA hook to bprm_creds_from_file LSM hook
-----BEGIN PGP SIGNATURE-----
iQJIBAABCgAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmkuAIcUHHBhdWxAcGF1
bC1tb29yZS5jb20ACgkQ6iDy2pc3iXMWlhAAgydMY1/oYPKbP3xdJTPJU6uc4GLk
o1lRaO59FkwogA2iupU6rltKiqhPYH7lIzebMcOuRLzaHpO0sGk0anyCMjNW2Agl
P/Zu+xkF+uGnzpO3gpaOLYFbCkpT98hNbyZu33l6ftxr37+S+DWHuThUhmOnxcgQ
z7Kguq0jaruiW8oc219HjNI/VCWW0F1W/+PVjFUSZogty2K2UttsabZQFMJ8MHg6
9C2jP/f+tN2KD55u7oEA5QiucC/8BdNdyLGke4TvjhG38FG4bh71Q59LknHa5yMa
6+NeftpE76+Inb8e+ze7iNv1InRccBXurm0p6lZ/lU5nYrjRT245CleQ7nq9ppD1
hyhuGQP/fvvYExKdTl1VWXA0zGLb6+1rIn6f//MpDSbXShGj/5vK82Qo/ug1ZEGH
QEAr6g2/S6xgudl2ui5OHSDb87nDWxzNo1t9TxGPBoQ6TG3ryPcm1ccTB0tb59+3
Poej26MOuZUrTpiQl3SLnfwjN0WnkiqGX5y/Cjh9tHFVk2OzOe/VqImZk9oeFgxt
O+IEB2cUO/U0D/3tdECqiRVS/RFe3jSn/qYCzP9fQaLhD0IpqlhfJ5TXbWiQam8Y
vGjNCx17a7mwvfxCIofEANw0wu3ooajq68UjRVhTpoKka0USKvktAoFirAB8r54I
oNSBfWKl4polguo=
=3hKh
-----END PGP SIGNATURE-----
Merge tag 'audit-pr-20251201' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit
Pull audit updates from Paul Moore:
- Consolidate the loops in __audit_inode_child() to improve performance
When logging a child inode in __audit_inode_child(), we first run
through the list of recorded inodes looking for the parent and then
we repeat the search looking for a matching child entry. This pull
request consolidates both searches into one pass through the recorded
inodes, resuling in approximately a 50% reduction in audit overhead.
See the commit description for the testing details.
- Combine kmalloc()/memset() into kzalloc() in audit_krule_to_data()
- Comment fixes
* tag 'audit-pr-20251201' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
audit: merge loops in __audit_inode_child()
audit: Use kzalloc() instead of kmalloc()/memset() in audit_krule_to_data()
audit: fix comment misindentation in audit.h
Instead of manually annotating each __ex_table entry, just make the
section mergeable and store the entry size in the ELF section header.
Either way works for objtool create_fake_symbols(), this way produces
cleaner code generation.
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://patch.msgid.link/b858cb7891c1ba0080e22a9c32595e6c302435e2.1764694625.git.jpoimboe@kernel.org
- Introduce and document a QoS limit on CPU exit latency during wakeup
from suspend-to-idle (Ulf Hansson)
- Add support for building libcpupower statically (Zuo An)
- Add support for sending netlink notifications to user space on energy
model updates (Changwoo Mini, Peng Fan)
- Minor improvements to the Rust OPP interface (Tamir Duberstein)
- Fixes to scope-based pointers in the OPP library (Viresh Kumar)
- Use residency threshold in polling state override decisions in the
menu cpuidle governor (Aboorva Devarajan)
- Add sanity check for exit latency and target residency in the cpufreq
core (Rafael Wysocki)
- Use this_cpu_ptr() where possible in the teo governor (Christian
Loehle)
- Rework the handling of tick wakeups in the teo cpuidle governor to
increase the likelihood of stopping the scheduler tick in the cases
when tick wakeups can be counted as non-timer ones (Rafael Wysocki)
- Fix a reverse condition in the teo cpuidle governor and drop a
misguided target residency check from it (Rafael Wysocki)
- Clean up multiple minor defects in the teo cpuidle governor (Rafael
Wysocki)
- Update header inclusion to make it follow the Include What You Use
principle (Andy Shevchenko)
- Enable MSR-based RAPL PMU support in the intel_rapl power capping
driver and arrange for using it on the Panther Lake and Wildcat Lake
processors (Kuppuswamy Sathyanarayanan)
- Add support for Nova Lake and Wildcat Lake processors to the
intel_rapl power capping driver (Kaushlendra Kumar, Srinivas
Pandruvada)
- Add OPP and bandwidth support for Tegra186 (Aaron Kling)
- Optimizations for parameter array handling in the amd-pstate cpufreq
driver (Mario Limonciello)
- Fix for mode changes with offline CPUs in the amd-pstate cpufreq
driver (Gautham Shenoy)
- Preserve freq_table_sorted across suspend/hibernate in the cpufreq
core (Zihuan Zhang)
- Adjust energy model rules for Intel hybrid platforms in the
intel_pstate cpufreq driver and improve printing of debug messages
in it (Rafael Wysocki)
- Replace deprecated strcpy() in cpufreq_unregister_governor()
(Thorsten Blum)
- Fix duplicate hyperlink target errors in the intel_pstate cpufreq
driver documentation and use :ref: directive for internal linking in
it (Swaraj Gaikwad, Bagas Sanjaya)
- Add Diamond Rapids OOB mode support to the intel_pstate cpufreq
driver (Kuppuswamy Sathyanarayanan)
- Use mutex guard for driver locking in the intel_pstate driver and
eliminate some code duplication from it (Rafael Wysocki)
- Replace udelay() with usleep_range() in ACPI cpufreq (Kaushlendra
Kumar)
- Minor improvements to various cpufreq drivers (Christian Marangi, Hal
Feng, Jie Zhan, Marco Crivellari, Miaoqian Lin, and Shuhao Fu)
- Replace snprintf() with scnprintf() in show_trace_dev_match()
(Kaushlendra Kumar)
- Fix memory allocation error handling in pm_vt_switch_required()
(Malaya Kumar Rout)
- Introduce CALL_PM_OP() macro and use it to simplify code in
generic PM operations (Kaushlendra Kumar)
- Add module param to backtrace all CPUs in the device power management
watchdog (Sergey Senozhatsky)
- Rework message printing in swsusp_save() (Rafael Wysocki)
- Make it possible to change the number of hibernation compression
threads (Xueqin Luo)
- Clarify that only cgroup1 freezer uses PM freezer (Tejun Heo)
- Add document on debugging shutdown hangs to PM documentation and
correct a mistaken configuration option in it (Mario Limonciello)
- Shut down wakeup source timer before removing the wakeup source from
the list (Kaushlendra Kumar, Rafael Wysocki)
- Introduce new PMSG_POWEROFF event for system shutdown handling with
the help of PM device callbacks (Mario Limonciello)
- Make pm_test delay interruptible by wakeup events (Riwen Lu)
- Clean up kernel-doc comment style usage in the core hibernation
code and remove unuseful comments from it (Sunday Adelodun, Rafael
Wysocki)
- Add support for handling wakeup events and aborting the suspend
process while it is syncing file systems (Samuel Wu, Rafael Wysocki)
- Add WQ_UNBOUND to pm_wq workqueue (Marco Crivellari)
- Add runtime PM wrapper macros for ACQUIRE()/ACQUIRE_ERR() and use
them in the PCI core and the ACPI TAD driver (Rafael Wysocki)
- Improve runtime PM in the ACPI TAD driver (Rafael Wysocki)
- Update pm_runtime_allow/forbid() documentation (Rafael Wysocki)
- Fix typos in runtime.c comments (Malaya Kumar Rout)
- Move governor.h from devfreq under include/linux/ and rename to
devfreq-governor.h to allow devfreq governor definitions in out
of drivers/devfreq/ (Dmitry Baryshkov)
- Use min() to improve readability in tegra30-devfreq.c (Thorsten
Blum)
- Fix potential use-after-free issue of OPP handling in
hisi_uncore_freq.c (Pengjie Zhang)
- Fix typo in DFSO_DOWNDIFFERENTIAL macro name in
governor_simpleondemand.c in devfreq (Riwen Lu)
-----BEGIN PGP SIGNATURE-----
iQFGBAABCAAwFiEEcM8Aw/RY0dgsiRUR7l+9nS/U47UFAmkp0BYSHHJqd0Byand5
c29ja2kubmV0AAoJEO5fvZ0v1OO1Pc8H/2G5d0aD/ym1a8MDTpKqn7t3/rVMHa76
YGfxXMBr1oY++r5GTJTKBxZBHmF89VH71kdyvsMidTAtHjR+iZAS1ajd2Q5VYjOF
QNMld1qgPEzAZU8WSetDrBqMr89zls05Uubo4aCoNy6rFmgRaLHh3AmIKSS9aJuo
C1eH8dRONME5I/rafkOUpFs1+/Agq1vePwPZmwVnZX9A3qI+UOhMRdU9A37kYkx9
YwfQvR2fKTIPjZ6B9f/wGXPOvdrT37d4+dWT3EABOHMkxlpAPDMvmVzZsUaXSQMr
0d9NGEjPGo33qciKJJpHqNOdDOhi90606WBBf7aaMF+GMhDX3PznOK4=
=rzXO
-----END PGP SIGNATURE-----
Merge tag 'pm-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
"There are quite a few interesting things here, including new hardware
support, new features, some bug fixes and documentation updates. In
addition, there are a usual bunch of minor fixes and cleanups all
over.
In the new hardware support category, there are intel_pstate and
intel_rapl driver updates to support new processors, Panther Lake,
Wildcat Lake, Noval Lake, and Diamond Rapids in the OOB mode, OPP and
bandwidth allocation support in the tegra186 cpufreq driver, and
JH7110S SOC support in dt-platdev cpufreq.
The new features are the PM QoS CPU latency limit for suspend-to-idle,
the netlink support for the energy model management, support for
terminating system suspend via a wakeup event during the sync of file
systems, configurable number of hibernation compression threads, the
runtime PM auto-cleanup macros, and the "poweroff" PM event that is
expected to be used during system shutdown.
Bugs are mostly fixed in cpuidle governors, but there are also fixes
elsewhere, like in the amd-pstate cpufreq driver.
Documentation updates include, but are not limited to, a new doc on
debugging shutdown hangs, cross-referencing fixes and cleanups in the
intel_pstate documentation, and updates of comments in the core
hibernation code.
Specifics:
- Introduce and document a QoS limit on CPU exit latency during
wakeup from suspend-to-idle (Ulf Hansson)
- Add support for building libcpupower statically (Zuo An)
- Add support for sending netlink notifications to user space on
energy model updates (Changwoo Mini, Peng Fan)
- Minor improvements to the Rust OPP interface (Tamir Duberstein)
- Fixes to scope-based pointers in the OPP library (Viresh Kumar)
- Use residency threshold in polling state override decisions in the
menu cpuidle governor (Aboorva Devarajan)
- Add sanity check for exit latency and target residency in the
cpufreq core (Rafael Wysocki)
- Use this_cpu_ptr() where possible in the teo governor (Christian
Loehle)
- Rework the handling of tick wakeups in the teo cpuidle governor to
increase the likelihood of stopping the scheduler tick in the cases
when tick wakeups can be counted as non-timer ones (Rafael Wysocki)
- Fix a reverse condition in the teo cpuidle governor and drop a
misguided target residency check from it (Rafael Wysocki)
- Clean up multiple minor defects in the teo cpuidle governor (Rafael
Wysocki)
- Update header inclusion to make it follow the Include What You Use
principle (Andy Shevchenko)
- Enable MSR-based RAPL PMU support in the intel_rapl power capping
driver and arrange for using it on the Panther Lake and Wildcat
Lake processors (Kuppuswamy Sathyanarayanan)
- Add support for Nova Lake and Wildcat Lake processors to the
intel_rapl power capping driver (Kaushlendra Kumar, Srinivas
Pandruvada)
- Add OPP and bandwidth support for Tegra186 (Aaron Kling)
- Optimizations for parameter array handling in the amd-pstate
cpufreq driver (Mario Limonciello)
- Fix for mode changes with offline CPUs in the amd-pstate cpufreq
driver (Gautham Shenoy)
- Preserve freq_table_sorted across suspend/hibernate in the cpufreq
core (Zihuan Zhang)
- Adjust energy model rules for Intel hybrid platforms in the
intel_pstate cpufreq driver and improve printing of debug messages
in it (Rafael Wysocki)
- Replace deprecated strcpy() in cpufreq_unregister_governor()
(Thorsten Blum)
- Fix duplicate hyperlink target errors in the intel_pstate cpufreq
driver documentation and use :ref: directive for internal linking
in it (Swaraj Gaikwad, Bagas Sanjaya)
- Add Diamond Rapids OOB mode support to the intel_pstate cpufreq
driver (Kuppuswamy Sathyanarayanan)
- Use mutex guard for driver locking in the intel_pstate driver and
eliminate some code duplication from it (Rafael Wysocki)
- Replace udelay() with usleep_range() in ACPI cpufreq (Kaushlendra
Kumar)
- Minor improvements to various cpufreq drivers (Christian Marangi,
Hal Feng, Jie Zhan, Marco Crivellari, Miaoqian Lin, and Shuhao Fu)
- Replace snprintf() with scnprintf() in show_trace_dev_match()
(Kaushlendra Kumar)
- Fix memory allocation error handling in pm_vt_switch_required()
(Malaya Kumar Rout)
- Introduce CALL_PM_OP() macro and use it to simplify code in generic
PM operations (Kaushlendra Kumar)
- Add module param to backtrace all CPUs in the device power
management watchdog (Sergey Senozhatsky)
- Rework message printing in swsusp_save() (Rafael Wysocki)
- Make it possible to change the number of hibernation compression
threads (Xueqin Luo)
- Clarify that only cgroup1 freezer uses PM freezer (Tejun Heo)
- Add document on debugging shutdown hangs to PM documentation and
correct a mistaken configuration option in it (Mario Limonciello)
- Shut down wakeup source timer before removing the wakeup source
from the list (Kaushlendra Kumar, Rafael Wysocki)
- Introduce new PMSG_POWEROFF event for system shutdown handling with
the help of PM device callbacks (Mario Limonciello)
- Make pm_test delay interruptible by wakeup events (Riwen Lu)
- Clean up kernel-doc comment style usage in the core hibernation
code and remove unuseful comments from it (Sunday Adelodun, Rafael
Wysocki)
- Add support for handling wakeup events and aborting the suspend
process while it is syncing file systems (Samuel Wu, Rafael
Wysocki)
- Add WQ_UNBOUND to pm_wq workqueue (Marco Crivellari)
- Add runtime PM wrapper macros for ACQUIRE()/ACQUIRE_ERR() and use
them in the PCI core and the ACPI TAD driver (Rafael Wysocki)
- Improve runtime PM in the ACPI TAD driver (Rafael Wysocki)
- Update pm_runtime_allow/forbid() documentation (Rafael Wysocki)
- Fix typos in runtime.c comments (Malaya Kumar Rout)
- Move governor.h from devfreq under include/linux/ and rename to
devfreq-governor.h to allow devfreq governor definitions in out of
drivers/devfreq/ (Dmitry Baryshkov)
- Use min() to improve readability in tegra30-devfreq.c (Thorsten
Blum)
- Fix potential use-after-free issue of OPP handling in
hisi_uncore_freq.c (Pengjie Zhang)
- Fix typo in DFSO_DOWNDIFFERENTIAL macro name in
governor_simpleondemand.c in devfreq (Riwen Lu)"
* tag 'pm-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (96 commits)
PM / devfreq: Fix typo in DFSO_DOWNDIFFERENTIAL macro name
cpuidle: Warn instead of bailing out if target residency check fails
cpuidle: Update header inclusion
Documentation: power/cpuidle: Document the CPU system wakeup latency QoS
cpuidle: Respect the CPU system wakeup QoS limit for cpuidle
sched: idle: Respect the CPU system wakeup QoS limit for s2idle
pmdomain: Respect the CPU system wakeup QoS limit for cpuidle
pmdomain: Respect the CPU system wakeup QoS limit for s2idle
PM: QoS: Introduce a CPU system wakeup QoS limit
cpuidle: governors: teo: Add missing space to the description
PM: hibernate: Extra cleanup of comments in swap handling code
PM / devfreq: tegra30: use min to simplify actmon_cpu_to_emc_rate
PM / devfreq: hisi: Fix potential UAF in OPP handling
PM / devfreq: Move governor.h to a public header location
powercap: intel_rapl: Enable MSR-based RAPL PMU support
powercap: intel_rapl: Prepare read_raw() interface for atomic-context callers
cpufreq: qcom-nvmem: fix compilation warning for qcom_cpufreq_ipq806x_match_list
PM: sleep: Call pm_sleep_fs_sync() instead of ksys_sync_helper()
PM: sleep: Add support for wakeup during filesystem sync
cpufreq: ACPI: Replace udelay() with usleep_range()
...
The allocation of the per CPU buffer descriptor, the buffer page
descriptors and the buffer page data itself can be pretty ugly:
kzalloc_node(ALIGN(sizeof(struct buffer_page), cache_line_size()),
GFP_KERNEL, cpu_to_node(cpu));
And the data pages:
page = alloc_pages_node(cpu_to_node(cpu),
GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_COMP | __GFP_ZERO, order);
if (!page)
return NULL;
bpage->page = page_address(page);
rb_init_page(bpage->page);
Add helper functions to make the code easier to read.
This does make all allocations of the data page (bpage->page) allocated
with the __GFP_RETRY_MAYFAIL flag (and not just the bulk allocator). Which
is actually better, as allocating the data page for the ring buffer tracing
should try hard but not trigger the OOM killer.
Link: https://lore.kernel.org/all/CAHk-=wjMMSAaqTjBSfYenfuzE1bMjLj+2DLtLWJuGt07UGCH_Q@mail.gmail.com/
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251125121153.35c07461@gandalf.local.home
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
by in_hardirq() and marked deprecated in 2020.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmkvDhUTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoW+XD/959KAIm2JpcEYUWuNBmlhEyuYWvPLw
ZyOiraLYBNyWmfCO/Yz4Ff8VZSR9gdWQoNfvBb8uxkbSXa0UOEUhCbzWsuoTnqR5
ObTIHCJ9QmPlRiFDvs4Sf5TGmy/4nXh6/PoH3JykNdlD3rZMTxiAz/k6QuO/S2iu
ykA+DNtNL7jDkQHzrWa3rf597BkBN1Z+hUD8zHRt8LYKRfmLYWjCMggjPLMnuqcn
240fnV/FubCLd9f5ZgNxHQMQCQH2qB7GYMk08YwXwCZQqIIXWqbNnhedkkNO3kWq
Sws4TEO6yg9pgTFqkuiDU5QgYEboRY4pDT45KSkdTHHGZl2OAAl3eVIGCto72UEI
Eyzn4k900hZ1iI/Rad5mx3D4XJZEXFgEbXhjph0odn6jVvmSj+Fmg3J67u1niO2a
obzB+xeaIkbGNQIgJFy8+A9SSnZckvuPlXdZdUxS2S95zH7f9+vBY8HWJMuyursa
3AJAKa82mN1i3A9FdSuMTdttQWkDmrwPKVzxvixs1mBu7kB70XaRIKsPjZj7LH6X
CiqP9Kt5FO0hVA7K+nKTeUA5DdjB4HzYzOgMqzFUhExY3hksVsj8rQEO6B0bCp9t
CfITA3BvU7GXxhXZHOq3dABQ21J/ZHgeuK3QdQSnOxSQOv2ElYIdKvYirJy2QdS1
tSM3O3GXb4zWDg==
=6LKf
-----END PGP SIGNATURE-----
Merge tag 'core-core-2025-12-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core irq cleanup from Thomas Gleixner:
"Tree wide cleanup of the remaining users of in_irq() which got
replaced by in_hardirq() and marked deprecated in 2020"
* tag 'core-core-2025-12-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
treewide: Remove in_irq()
- Prevent a thundering herd problem when the timekeeper CPU is delayed
and a large number of CPUs compete to acquire jiffies_lock to do the
update. Limit it to one CPU with a separate "uncontended" atomic
variable.
- A set of improvements for the timer migration mechanism:
- Support imbalanced NUMA trees correctly
- Support dynamic exclusion of CPUs from the migrator duty to allow the
cpuset/isolation mechanism to exclude them from handling timers of
remote idle CPUs.
- The usual small updates, cleanups and enhancements
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmks7doTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoaxrD/40nxx+8cEXsVbVLIkP2PQbd2Y8+7sk
YbNu/Cb7j7Bg7R8YIs4p5GHk+7Yt/hNsW77SmbAzRPUyYYG6L3bUYlBa3yQlvIuo
xRPbzGA+RJies9skIGHbQ8z6ig1zUASRJPcBYiuaVIAuQhCfLNc4Nii9cEWtjZ24
+5gfRwV+vy74ArWwRkwaGejDK1tav+gd62OkFQZC8WtjQ08ozGZ6VBJNg7nYq/gH
FYO1rH2tQ/ZyjlO/x5NF8gFcjYD8iv5PDp8oH35MPx+XTdDccf0G3QB7ug0ffVdV
b4gA6lZTAmpsu/NHb6ByN4i/kf3wf8la/i+EaAh/Ov7NW078gunvVKVA7jStcbBl
ZgG5SRHiKRvQF/WXLGVQAnilRDZwRuS0nmJlqfExa44v23l5o3768RwdRYwQlv8g
X5KSRl0jlVgVtZHgNBlZtgX9+rnQSr9sB5sVGBP2a6a1WhVXQV/2kp0wjdnU0mPw
jLCnSdsHqBlSf9V7O/na823WCnBFb7blrLBXUoSbHBnICqtVFzhE1kBXWw3S7Kqh
CiaWM+S4WfR0HRnUlWMTS8BZ82MgiDnd7nGUXWwXBbdqWmoj/9CoU6SZRjbMBkzi
EY1XvmoYf6eSzdxfydI1hFi0/bbb8K9umHQlrpW3HeN9uXnVc0/+TroVPLuaKUdi
53ClqXjzE+CpJg==
=lQKn
-----END PGP SIGNATURE-----
Merge tag 'timers-core-2025-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer core updates from Thomas Gleixner:
- Prevent a thundering herd problem when the timekeeper CPU is delayed
and a large number of CPUs compete to acquire jiffies_lock to do the
update. Limit it to one CPU with a separate "uncontended" atomic
variable.
- A set of improvements for the timer migration mechanism:
- Support imbalanced NUMA trees correctly
- Support dynamic exclusion of CPUs from the migrator duty to allow
the cpuset/isolation mechanism to exclude them from handling
timers of remote idle CPUs
- The usual small updates, cleanups and enhancements
* tag 'timers-core-2025-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timers/migration: Exclude isolated cpus from hierarchy
cpumask: Add initialiser to use cleanup helpers
sched/isolation: Force housekeeping if isolcpus and nohz_full don't leave any
cgroup/cpuset: Rename update_unbound_workqueue_cpumask() to update_isolation_cpumasks()
timers/migration: Use scoped_guard on available flag set/clear
timers/migration: Add mask for CPUs available in the hierarchy
timers/migration: Rename 'online' bit to 'available'
selftests/timers/nanosleep: Add tests for return of remaining time
selftests/timers: Clean up kernel version check in posix_timers
time: Fix a few typos in time[r] related code comments
time: tick-oneshot: Add missing Return and parameter descriptions to kernel-doc
hrtimer: Store time as ktime_t in restart block
timers/migration: Remove dead code handling idle CPU checking for remote timers
timers/migration: Remove unused "cpu" parameter from tmigr_get_group()
timers/migration: Assert that hotplug preparing CPU is part of stable active hierarchy
timers/migration: Fix imbalanced NUMA trees
timers/migration: Remove locking on group connection
timers/migration: Convert "while" loops to use "for"
tick/sched: Limit non-timekeeper CPUs calling jiffies update
- Remove one variant of PCI/MSI management as all users have been
converted to use per device domains. That reduces the variants to two:
The modern and the real archaic legacy variant, which keeps the usual
suspects in the museum category alive.
- Rework the platform MSI device ID detection mechanism in the ARM GIC
world to address resource leaks, duplicated code and other details. This
requires a corresponding preparatory step in the PCI/iproc driver.
- Trivial core code cleanups
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmkswn0THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoYpuD/wKT7d6I6AqnJVF/RhiJ+/d6vuX/aFW
g6E7XAkMLKhmxunSNFfPzXsHy2a0oJroYKmDJH4C8GWGo/gXa+QvmDt2491k9rdV
zM+CBodBu3/bXWvTW+o1fbyAvG+p2C3+iSRW/gGqzPdcY8gQiRnNOZS1j7zusMjO
A6pz5SvLSPWQUnVl9PygJBuNX5TFHPnY3AySRpW11CvqB5/8gqGz+O6lT/Q+5hov
GUC57hskbQd1PsYhTNRaUR4z7VMolPHqscp8DYVCWjOMP/r5quC6dlsn91yxuATU
8D7oRiW8xkCaTJplY/rA6r/VxUthZ3EgIxzev3rGaWBdPxHcFfftf2oxyFFAf3lf
3rEdfGBcNgApx+MCcoT5/3mf3KJfn2/bE6bZhwv94+dtbTlHguztyMD3vnGTS73i
zPWQ5ae4M5sqc8kCNMRaBfU8yQEHEKs3gia67vStZyn5R/uUNVKRo67LBPZKVDcJ
2511Ylnm62yG6PtdPGIFHY1i75uPpxXuS7F0BJignzM3iPvVvwLPZLDORr3/pR4q
CmswZTA2obue6+nwz/LUacxzONsZ2Z8pzGY6rrT9sfj0Z4mk6xrfEPfjfmVoMpyk
Dk4B8lIVYwcR7d/Sw+FIwYst8iw+L77Yn7kN8yCbh4lAOxBUUvtS5KAP6uPGe3D1
30Q/DbBVlEvg/g==
=VAtQ
-----END PGP SIGNATURE-----
Merge tag 'irq-msi-2025-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull MSI updates from Thomas Gleixner:
"Updates for [PCI] MSI related code:
- Remove one variant of PCI/MSI management as all users have been
converted to use per device domains. That reduces the variants to
two:
The modern and the real archaic legacy variant, which keeps the
usual suspects in the museum category alive.
- Rework the platform MSI device ID detection mechanism in the ARM
GIC world to address resource leaks, duplicated code and other
details. This requires a corresponding preparatory step in the
PCI/iproc driver.
- Trivial core code cleanups"
* tag 'irq-msi-2025-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/gic-its: Rework platform MSI deviceID detection
PCI: iproc: Implement MSI controller node detection with of_msi_xlate()
genirq/msi: Slightly simplify msi_domain_alloc()
PCI/MSI: Delete pci_msi_create_irq_domain()
- Rework of the Per Processor Interrupt (PPI) management on ARM[64].
PPI support was built under the assumption that the systems are
homogenous so that the same CPU local device types are connected to
them. That's unfortunately wishful thinking and created horrible
workarounds.
This rework provides affinity management for PPIs so that they can be
individually configured in the firmware tables and mops up the related
drivers all over the place.
- Prevent CPUSET/isolation changes to arbitrarily affine interrupt
threads to random CPUs, which ignores user or driver settings.
- Plug a harmless race in the interrupt affinity proc interface, which
allows to see a half updated mask
- Adjust the priority of secondary interrupt threads on RT, so that the
combination of primary and secondary thread emulates the hardware
interrupt plus thread scenario. Having them at the same priority can
cause starvation issues in some drivers.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmksv3oTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoe5+D/wNnBaX9LRajuLOF+zaYw5WZxkzp6U7
X4AP3cLny8xynI1kM5V8M1ym3Fspk0hiqxNX2LLXrSZzBR+3O4uGCyCceBXeHKo2
vW4auUXG4MB+2sZyudQXaBpNK4A2YBubycTUcRECjkjDkBPAWgN7J+Oz2lXUSUcH
zlitlHNo48hnZQPAJr4PDpi5q9+rChn+8/s+K1d8NlEf9HOXC98qzyMuMq+jHdJE
AQ6tKoHkA5lHjHAUY3AbWptoHo1Wp+p5PSqsrFr6nbKuPlhUqRNEPXX0Z8q7aUTj
NgdkvIHJVJ0C+T40FIWCNzUYOUk4gTQXBSPvptwJSHAmf9ovp+Kg2ltVZBzyL2iI
R0EZSQAQU8iJcRrqjcAYqI36LkmwwVT6RD1zFa98xJT/AjsMpAt/U1pEMDtkoTKe
Lv7ZQ/hloc+4wV4xS4zEtoV/ukdUfA9aEdXsh5hNH/07tvatpKO2LgortsiI+lCK
76vAULcGvbMr5Jr63snjICgstahunpNMRn2HmnGAjmdZf4+g+TDvZR4DI6bswtuO
jp5G6OM30Z9zKheAr1VioV1XAKr6Y4jDKVjfFy/n1k5pDwYaSJopmZxSD35aas4e
VqWizAzc5dAVCYRlzr6S1lrMQ2JJRg0RpIn+sMS8dhf9SK7hs5ilGSOsgX1fgVat
1N3WXvYM8vSW+g==
=zrA1
-----END PGP SIGNATURE-----
Merge tag 'irq-core-2025-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq core updates from Thomas Gleixner:
"Updates for the interrupt core and treewide cleanups:
- Rework of the Per Processor Interrupt (PPI) management on ARM[64]
PPI support was built under the assumption that the systems are
homogenous so that the same CPU local device types are connected to
them. That's unfortunately wishful thinking and created horrible
workarounds.
This rework provides affinity management for PPIs so that they can
be individually configured in the firmware tables and mops up the
related drivers all over the place.
- Prevent CPUSET/isolation changes to arbitrarily affine interrupt
threads to random CPUs, which ignores user or driver settings.
- Plug a harmless race in the interrupt affinity proc interface,
which allows to see a half updated mask
- Adjust the priority of secondary interrupt threads on RT, so that
the combination of primary and secondary thread emulates the
hardware interrupt plus thread scenario. Having them at the same
priority can cause starvation issues in some drivers"
* tag 'irq-core-2025-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
genirq: Remove cpumask availability check on kthread affinity setting
genirq: Fix interrupt threads affinity vs. cpuset isolated partitions
genirq: Prevent early spurious wake-ups of interrupt threads
genirq: Use raw_spinlock_irq() in irq_set_affinity_notifier()
genirq/manage: Reduce priority of forced secondary interrupt handler
genirq/proc: Fix race in show_irq_affinity()
genirq: Fix percpu_devid irq affinity documentation
perf: arm_pmu: Kill last use of per-CPU cpu_armpmu pointer
irqdomain: Kill of_node_to_fwnode() helper
genirq: Kill irq_{g,s}et_percpu_devid_partition()
irqchip: Kill irq-partition-percpu
irqchip/apple-aic: Drop support for custom PMU irq partitions
irqchip/gic-v3: Drop support for custom PPI partitions
coresight: trbe: Request specific affinities for per CPU interrupts
perf: arm_spe_pmu: Request specific affinities for per CPU interrupts
perf: arm_pmu: Request specific affinities for per CPU NMIs/interrupts
genirq: Add request_percpu_irq_affinity() helper
genirq: Allow per-cpu interrupt sharing for non-overlapping affinities
genirq: Update request_percpu_nmi() to take an affinity
genirq: Add affinity to percpu_devid interrupt requests
...
The recent enablement of RSEQ in glibc resulted in regressions which are
caused by the related overhead. It turned out that the decision to invoke
the exit to user work was not really a decision. More or less each
context switch caused that. There is a long list of small issues which
sums up nicely and results in a 3-4% regression in I/O benchmarks.
The other detail which caused issues due to extra work in context switch
and task migration is the CID (memory context ID) management. It also
requires to use a task work to consolidate the CID space, which is
executed in the context of an arbitrary task and results in sporadic
uncontrolled exit latencies.
The rewrite addresses this by:
- Removing deprecated and long unsupported functionality
- Moving the related data into dedicated data structures which are
optimized for fast path processing.
- Caching values so actual decisions can be made
- Replacing the current implementation with a optimized inlined variant.
- Separating fast and slow path for architectures which use the generic
entry code, so that only fault and error handling goes into the
TIF_NOTIFY_RESUME handler.
- Rewriting the CID management so that it becomes mostly invisible in the
context switch path. That moves the work of switching modes into the
fork/exit path, which is a reasonable tradeoff. That work is only
required when a process creates more threads than the cpuset it is
allowed to run on or when enough threads exit after that. An artificial
thread pool benchmarks which triggers this did not degrade, it actually
improved significantly.
The main effect in migration heavy scenarios is that runqueue lock held
time and therefore contention goes down significantly.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmksaRYTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoencEADA5he8PAFPmSRRPo6+2G5mHzWe8kIU
5ZViQStWFNAA0qqy8VXryWiJ6qqrO6la9o7K4YOXASUtlkVjquRp1DF7PabqGwuy
zshbRCXNlT51J8uqanN8VrGVjlf+bMdHDbGoI1SLkUTxG8b+kDD5PXUQE1ARelPP
Slbg9u+EMrxj6D5MDTPbuW6TqryJEkPtiNScyOz43emp9ww9+WVxenOcRqU4D+Th
mjWmrGIzkroSf4XReMoD/wg9TPTpUjXnNCwl2viY9JvBpkMfYtU4tJAGK3aNFOWy
zsAN0O9CaFGrUEFne7qUmtwhNLdtnjx5HN5pe7yZd1EhdTuQKq4jPiiQnwwm8w72
c0o6m45FNPmPoSyfaZWCkLjbTEUXonT9JF61iN35JVxim8gBDDJjHFKnLxDmLrH3
X0eESE48ReY2EneDV6Y8RJRo6oG14Fccvc39aTf/2Rw3trpmtt2agvConQzupQIg
DzANw4jhUUzFRrHrMHACNsqKFXh9ratue/S9DM3xxTpGO/bKdeK7jGIgzNf8O34M
J0O6Hvk5jMdcWlIJTx21GoGzoSkkXnR49g/71aCcp+MwdY4x9zFz5SWi8LWQRmkx
xRo6tY27Bma8/SEwMJjPpAUXDTpq6v+j3cPisybL1yGsyt9lh+p8LX7VUtwcoEqe
6ZelC5Kgw/+/kg==
=n5KT
-----END PGP SIGNATURE-----
Merge tag 'core-rseq-2025-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull rseq updates from Thomas Gleixner:
"A large overhaul of the restartable sequences and CID management:
The recent enablement of RSEQ in glibc resulted in regressions which
are caused by the related overhead. It turned out that the decision to
invoke the exit to user work was not really a decision. More or less
each context switch caused that. There is a long list of small issues
which sums up nicely and results in a 3-4% regression in I/O
benchmarks.
The other detail which caused issues due to extra work in context
switch and task migration is the CID (memory context ID) management.
It also requires to use a task work to consolidate the CID space,
which is executed in the context of an arbitrary task and results in
sporadic uncontrolled exit latencies.
The rewrite addresses this by:
- Removing deprecated and long unsupported functionality
- Moving the related data into dedicated data structures which are
optimized for fast path processing.
- Caching values so actual decisions can be made
- Replacing the current implementation with a optimized inlined
variant.
- Separating fast and slow path for architectures which use the
generic entry code, so that only fault and error handling goes into
the TIF_NOTIFY_RESUME handler.
- Rewriting the CID management so that it becomes mostly invisible in
the context switch path. That moves the work of switching modes
into the fork/exit path, which is a reasonable tradeoff. That work
is only required when a process creates more threads than the
cpuset it is allowed to run on or when enough threads exit after
that. An artificial thread pool benchmarks which triggers this did
not degrade, it actually improved significantly.
The main effect in migration heavy scenarios is that runqueue lock
held time and therefore contention goes down significantly"
* tag 'core-rseq-2025-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits)
sched/mmcid: Switch over to the new mechanism
sched/mmcid: Implement deferred mode change
irqwork: Move data struct to a types header
sched/mmcid: Provide CID ownership mode fixup functions
sched/mmcid: Provide new scheduler CID mechanism
sched/mmcid: Introduce per task/CPU ownership infrastructure
sched/mmcid: Serialize sched_mm_cid_fork()/exit() with a mutex
sched/mmcid: Provide precomputed maximal value
sched/mmcid: Move initialization out of line
signal: Move MMCID exit out of sighand lock
sched/mmcid: Convert mm CID mask to a bitmap
cpumask: Cache num_possible_cpus()
sched/mmcid: Use cpumask_weighted_or()
cpumask: Introduce cpumask_weighted_or()
sched/mmcid: Prevent pointless work in mm_update_cpus_allowed()
sched/mmcid: Move scheduler code out of global header
sched: Fixup whitespace damage
sched/mmcid: Cacheline align MM CID storage
sched/mmcid: Use proper data structures
sched/mmcid: Revert the complex CID management
...
- Implement the missing u64 user access function on ARM when
CONFIG_CPU_SPECTRE=n. This makes it possible to access a 64bit value in
generic code with [unsafe_]get_user(). All other architectures and ARM
variants provide the relevant accessors already.
- Ensure that ASM GOTO jump label usage in the user mode access helpers
always goes through a local C scope label indirection inside the
helpers. This is required because compilers are not supporting that a
ASM GOTO target leaves a auto cleanup scope. GCC silently fails to emit
the cleanup invocation and CLANG fails the build.
This provides generic wrapper macros and the conversion of affected
architecture code to use them.
- Scoped user mode access with auto cleanup
Access to user mode memory can be required in hot code paths, but if it
has to be done with user controlled pointers, the access is shielded
with a speculation barrier, so that the CPU cannot speculate around the
address range check. Those speculation barriers impact performance quite
significantly. This can be avoided by "masking" the provided pointer so
it is guaranteed to be in the valid user memory access range and
otherwise to point to a guaranteed unpopulated address space. This has
to be done without branches so it creates an address dependency for the
access, which the CPU cannot speculate ahead.
This results in repeating and error prone programming patterns:
if (can_do_masked_user_access())
from = masked_user_read_access_begin((from));
else if (!user_read_access_begin(from, sizeof(*from)))
return -EFAULT;
unsafe_get_user(val, from, Efault);
user_read_access_end();
return 0;
Efault:
user_read_access_end();
return -EFAULT;
which can be replaced with scopes and automatic cleanup:
scoped_user_read_access(from, Efault)
unsafe_get_user(val, from, Efault);
return 0;
Efault:
return -EFAULT;
- Convert code which implements the above pattern over to
scope_user.*.access(). This also corrects a couple of imbalanced
masked_*_begin() instances which are harmless on most architectures, but
prevent PowerPC from implementing the masking optimization.
- Add a missing speculation barrier in copy_from_user_iter()
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmksRfITHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoVhBEACEySjWcyCrD1e0ZFMFAOJZFI2BShav
reotzCzmHYQdpVukDRxc64BgM2vN4yB04xnyMhi2o4hSTiIJhz1NzbKggsQJhVoA
psYz+xEI161HuLZnUBUBuF9RRko/HVsbGqO2JFCuOKor4GCycvjVgupR3EIN9h5T
HZEWGIgaTmN7MBj0QRrJgJkaaSTnPKOwWaNMV/F9pfk27zuB7vuV8WM9P3FaJYG+
JGa9td7VGaBpWavxgMJqfdvXWBCVDDfZ1dunWx8tPTnLxKZZZD6HlfQXhZTr2n1e
rtJpGgfVBx5Uqxn4RrhS0I7QeK1b9rrt3IU7EkFoaa3Z8LU5B7cHlm7KyicyoHhy
SzFFUszssznT/0OhA5fmgPRlqI295HynW2p1L4Xy9hC0EZ2vXJPG5rO6X3x6QwSR
asjRB7x/6JzWQUzE7/nhXd9KcB66wvQxhnjp7GqulF74aPBCtIdXXDD68YEDYkbi
dPC3NRBr0ePbsGVGWbYvYIPWcvo1u814C2io1zKwmVbiN6lCYURgQK861vfAZUP8
oP5D2a6ENgezDKoJo6eJ82inuDu64qZy7OOkU/aO3cbOuWGVyY9CjYD11x85Nr0k
UNabSOfvcmhmobtYUiAgLLrjX1grQUG3F74ZQTw513mwgMObuDAAoS11GPjY6HL6
b99WUJRv8jP66A==
=6no0
-----END PGP SIGNATURE-----
Merge tag 'core-uaccess-2025-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scoped user access updates from Thomas Gleixner:
"Scoped user mode access and related changes:
- Implement the missing u64 user access function on ARM when
CONFIG_CPU_SPECTRE=n.
This makes it possible to access a 64bit value in generic code with
[unsafe_]get_user(). All other architectures and ARM variants
provide the relevant accessors already.
- Ensure that ASM GOTO jump label usage in the user mode access
helpers always goes through a local C scope label indirection
inside the helpers.
This is required because compilers are not supporting that a ASM
GOTO target leaves a auto cleanup scope. GCC silently fails to emit
the cleanup invocation and CLANG fails the build.
[ Editor's note: gcc-16 will have fixed the code generation issue
in commit f68fe3ddda4 ("eh: Invoke cleanups/destructors in asm
goto jumps [PR122835]"). But we obviously have to deal with clang
and older versions of gcc, so.. - Linus ]
This provides generic wrapper macros and the conversion of affected
architecture code to use them.
- Scoped user mode access with auto cleanup
Access to user mode memory can be required in hot code paths, but
if it has to be done with user controlled pointers, the access is
shielded with a speculation barrier, so that the CPU cannot
speculate around the address range check. Those speculation
barriers impact performance quite significantly.
This cost can be avoided by "masking" the provided pointer so it is
guaranteed to be in the valid user memory access range and
otherwise to point to a guaranteed unpopulated address space. This
has to be done without branches so it creates an address dependency
for the access, which the CPU cannot speculate ahead.
This results in repeating and error prone programming patterns:
if (can_do_masked_user_access())
from = masked_user_read_access_begin((from));
else if (!user_read_access_begin(from, sizeof(*from)))
return -EFAULT;
unsafe_get_user(val, from, Efault);
user_read_access_end();
return 0;
Efault:
user_read_access_end();
return -EFAULT;
which can be replaced with scopes and automatic cleanup:
scoped_user_read_access(from, Efault)
unsafe_get_user(val, from, Efault);
return 0;
Efault:
return -EFAULT;
- Convert code which implements the above pattern over to
scope_user.*.access(). This also corrects a couple of imbalanced
masked_*_begin() instances which are harmless on most
architectures, but prevent PowerPC from implementing the masking
optimization.
- Add a missing speculation barrier in copy_from_user_iter()"
* tag 'core-uaccess-2025-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
lib/strn*,uaccess: Use masked_user_{read/write}_access_begin when required
scm: Convert put_cmsg() to scoped user access
iov_iter: Add missing speculation barrier to copy_from_user_iter()
iov_iter: Convert copy_from_user_iter() to masked user access
select: Convert to scoped user access
x86/futex: Convert to scoped user access
futex: Convert to get/put_user_inline()
uaccess: Provide put/get_user_inline()
uaccess: Provide scoped user access regions
arm64: uaccess: Use unsafe wrappers for ASM GOTO
s390/uaccess: Use unsafe wrappers for ASM GOTO
riscv/uaccess: Use unsafe wrappers for ASM GOTO
powerpc/uaccess: Use unsafe wrappers for ASM GOTO
x86/uaccess: Use unsafe wrappers for ASM GOTO
uaccess: Provide ASM GOTO safe wrappers for unsafe_*_user()
ARM: uaccess: Implement missing __get_user_asm_dword()
- Improve WARN(), which has vararg printf like arguments,
to work with the x86 #UD based WARN-optimizing infrastructure
by hiding the format in the bug_table and replacing this
first argument with the address of the bug-table entry,
while making the actual function that's called a UD1 instruction.
(Peter Zijlstra)
- Introduce the CONFIG_DEBUG_BUGVERBOSE_DETAILED Kconfig switch
(Ingo Molnar, s390 support by Heiko Carstens)
Fixes and cleanups:
- bugs/s390: Remove private WARN_ON() implementation (Heiko Carstens)
- <asm/bugs.h>: Make i386 use GENERIC_BUG_RELATIVE_POINTERS
(Peter Zijlstra)
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmktffcRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1gLCQ//ZMHTpv6UF+xY4Kq3rdVFJWRR7Prs2hDn
4V/cFHu7Xb2lRwpJDBNWnbiEkLbGR5qlWD+0CpAMK4mGuL9VjnoeflEzXxOPP9W1
7SWw1HDc6NkjHM3BNnvkPQbzWF4RL4ZGs4Lbb1nv7pjoTSdBMXNrD5RL+iNmfHHd
QfOG9ZiRyD5A/b07bfyjIXNaeC/Hot+FeVXTMfD7/vCfc2ywhL2Sm5G/igY/shTY
Una7q8sbFDD/bFFtWSR2eFQeHQQfT6c/Pu39ZcAIdTlLk1FtZ+A2wcRtwYv/FdPV
6KDOAxZK7fLMHoCdNTswsg+LbazhABOb/V1J7TaHq2tF/PeyN+B+ucxVY2KFcxQJ
V5eG5crMrUIL22QO/UaT3dPWxGbJYkNlAWl416tAKdgXA52W4OsPd+k4DGjeP569
KogAy3SY0D/80v1QN2HcFEJMvr7W2SukxtErqdtA5OKt/ZevB49lGqZoBecqASDO
QjI1K0yLKnb+erbMIpCpNj+o/Fr1JQgWbYVpwipL2GON4an6vrowimGTsUG7qqxN
Hwb7IaTNnWQ/4iyCkVV44q6Ln1gyP2hz5Lnzo7QJINUdzp98UjJLAEK4NRG44T4L
p0t5NMxnvREpAv35rwy03xhmITYeQMOqzN9JEBBvegQyld5ghbThxI7g9BDzcXyL
Bd0mF7WOV9M=
=bMUL
-----END PGP SIGNATURE-----
Merge tag 'core-bugs-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull bug handling infrastructure updates from Ingo Molnar:
"Core updates:
- Improve WARN(), which has vararg printf like arguments, to work
with the x86 #UD based WARN-optimizing infrastructure by hiding the
format in the bug_table and replacing this first argument with the
address of the bug-table entry, while making the actual function
that's called a UD1 instruction (Peter Zijlstra)
- Introduce the CONFIG_DEBUG_BUGVERBOSE_DETAILED Kconfig switch (Ingo
Molnar, s390 support by Heiko Carstens)
Fixes and cleanups:
- bugs/s390: Remove private WARN_ON() implementation (Heiko Carstens)
- <asm/bugs.h>: Make i386 use GENERIC_BUG_RELATIVE_POINTERS (Peter
Zijlstra)"
* tag 'core-bugs-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (31 commits)
x86/bugs: Make i386 use GENERIC_BUG_RELATIVE_POINTERS
x86/bug: Fix BUG_FORMAT vs KASLR
x86_64/bug: Inline the UD1
x86/bug: Implement WARN_ONCE()
x86_64/bug: Implement __WARN_printf()
x86/bug: Use BUG_FORMAT for DEBUG_BUGVERBOSE_DETAILED
x86/bug: Add BUG_FORMAT basics
bug: Allow architectures to provide __WARN_printf()
bug: Implement WARN_ON() using __WARN_FLAGS()
bug: Add report_bug_entry()
bug: Add BUG_FORMAT_ARGS infrastructure
bug: Clean up CONFIG_GENERIC_BUG_RELATIVE_POINTERS
bug: Add BUG_FORMAT infrastructure
x86: Rework __bug_table helpers
bugs/s390: Remove private WARN_ON() implementation
bugs/core: Reorganize fields in the first line of WARNING output, add ->comm[] output
bugs/sh: Concatenate 'cond_str' with '__FILE__' in __WARN_FLAGS(), to extend WARN_ON/BUG_ON output
bugs/parisc: Concatenate 'cond_str' with '__FILE__' in __WARN_FLAGS(), to extend WARN_ON/BUG_ON output
bugs/riscv: Concatenate 'cond_str' with '__FILE__' in __BUG_FLAGS(), to extend WARN_ON/BUG_ON output
bugs/riscv: Pass in 'cond_str' to __BUG_FLAGS()
...
Callchain support:
- Add support for deferred user-space stack unwinding for
perf, enabled on x86. (Peter Zijlstra, Steven Rostedt)
- unwind_user/x86: Enable frame pointer unwinding on x86
(Josh Poimboeuf)
x86 PMU support and infrastructure:
- x86/insn: Simplify for_each_insn_prefix() (Peter Zijlstra)
- x86/insn,uprobes,alternative: Unify insn_is_nop()
(Peter Zijlstra)
Intel PMU driver:
- Large series to prepare for and implement architectural PEBS
support for Intel platforms such as Clearwater Forest (CWF)
and Panther Lake (PTL). (Dapeng Mi, Kan Liang)
- Check dynamic constraints (Kan Liang)
- Optimize PEBS extended config (Peter Zijlstra)
- cstates: Remove PC3 support from LunarLake (Zhang Rui)
- cstates: Add Pantherlake support (Zhang Rui)
- cstates: Clearwater Forest support (Zide Chen)
AMD PMU driver:
- x86/amd: Check event before enable to avoid GPF (George Kennedy)
Fixes and cleanups:
- task_work: Fix NMI race condition (Peter Zijlstra)
- perf/x86: Fix NULL event access and potential PEBS record loss
(Dapeng Mi)
- Misc other fixes and cleanups.
(Dapeng Mi, Ingo Molnar, Peter Zijlstra)
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmktcU0RHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1gNKw//ThLmbkoGJ0/yLOdEcW8rA/7HB43Oz6j9
k0Vs7zwDBMRFP4zQg2XeF5SH7CWS9p/nI3eMhorgmH77oJCvXJxVtD5991zmlZhf
eafOar5ZMVaoMz+tK8WWiENZyuN0bt0mumZmz9svXR3KV1S/q18XZ8bCas0itwnq
D0T3Gqi/Z39gJIy7bHNgLoFY2zvI9b2EJNDKlzHk3NJ7UamA4GuMHN0cM2dIzKGK
2L+wXOe2BH9YYzYrz/cdKq7sBMjOvFsCQ/5jh23A2Yu6JI4nJbw0WmexZRK1OWCp
GAdMjBuqIShibLRxK746WRO9iut49uTsah4iSG80hXzhpwf7VaegOarost1nLaqm
zweIOr3iwJRf273r6IqRuaporVHpQYMj2w2H63z36sQtGtkKHNyxZ50b6bqpwwjU
LikLEJ9Bmh3mlvlXsOx2wX6dTb1fUk+cy2ezCDKUHqOLjqy4dM8V+jYhuRO4yxXz
mj9aHZKgyuREt8yo/3nLqAzF5Okj9cXp7H6F1hCKWuCoAhNXkrvYcvbg8h6aRxOX
2vGhMYjpElkl/DG6OWCSwuqCt9nVEC/dazW9fKQjh4S0CFOVopaMGSkGcS/xUPub
92J4XMDEJX4RJ6dfspeQr97+1fETXEIWNv4WbKnDjqJlAucU1gnOTprVnAYUjcWw
74320FjGN1E=
=/8GE
-----END PGP SIGNATURE-----
Merge tag 'perf-core-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull performance events updates from Ingo Molnar:
"Callchain support:
- Add support for deferred user-space stack unwinding for perf,
enabled on x86. (Peter Zijlstra, Steven Rostedt)
- unwind_user/x86: Enable frame pointer unwinding on x86 (Josh
Poimboeuf)
x86 PMU support and infrastructure:
- x86/insn: Simplify for_each_insn_prefix() (Peter Zijlstra)
- x86/insn,uprobes,alternative: Unify insn_is_nop() (Peter Zijlstra)
Intel PMU driver:
- Large series to prepare for and implement architectural PEBS
support for Intel platforms such as Clearwater Forest (CWF) and
Panther Lake (PTL). (Dapeng Mi, Kan Liang)
- Check dynamic constraints (Kan Liang)
- Optimize PEBS extended config (Peter Zijlstra)
- cstates:
- Remove PC3 support from LunarLake (Zhang Rui)
- Add Pantherlake support (Zhang Rui)
- Clearwater Forest support (Zide Chen)
AMD PMU driver:
- x86/amd: Check event before enable to avoid GPF (George Kennedy)
Fixes and cleanups:
- task_work: Fix NMI race condition (Peter Zijlstra)
- perf/x86: Fix NULL event access and potential PEBS record loss
(Dapeng Mi)
- Misc other fixes and cleanups (Dapeng Mi, Ingo Molnar, Peter
Zijlstra)"
* tag 'perf-core-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
perf/x86/intel: Fix and clean up intel_pmu_drain_arch_pebs() type use
perf/x86/intel: Optimize PEBS extended config
perf/x86/intel: Check PEBS dyn_constraints
perf/x86/intel: Add a check for dynamic constraints
perf/x86/intel: Add counter group support for arch-PEBS
perf/x86/intel: Setup PEBS data configuration and enable legacy groups
perf/x86/intel: Update dyn_constraint base on PEBS event precise level
perf/x86/intel: Allocate arch-PEBS buffer and initialize PEBS_BASE MSR
perf/x86/intel: Process arch-PEBS records or record fragments
perf/x86/intel/ds: Factor out PEBS group processing code to functions
perf/x86/intel/ds: Factor out PEBS record processing code to functions
perf/x86/intel: Initialize architectural PEBS
perf/x86/intel: Correct large PEBS flag check
perf/x86/intel: Replace x86_pmu.drain_pebs calling with static call
perf/x86: Fix NULL event access and potential PEBS record loss
perf/x86: Remove redundant is_x86_event() prototype
entry,unwind/deferred: Fix unwind_reset_info() placement
unwind_user/x86: Fix arch=um build
perf: Support deferred user unwind
unwind_user/x86: Teach FP unwind about start of function
...
- klp-build livepatch module generation (Josh Poimboeuf)
Introduce new objtool features and a klp-build
script to generate livepatch modules using a
source .patch as input.
This builds on concepts from the longstanding out-of-tree
kpatch project which began in 2012 and has been used for
many years to generate livepatch modules for production kernels.
However, this is a complete rewrite which incorporates
hard-earned lessons from 12+ years of maintaining kpatch.
Key improvements compared to kpatch-build:
- Integrated with objtool: Leverages objtool's existing control-flow
graph analysis to help detect changed functions.
- Works on vmlinux.o: Supports late-linked objects, making it
compatible with LTO, IBT, and similar.
- Simplified code base: ~3k fewer lines of code.
- Upstream: No more out-of-tree #ifdef hacks, far less cruft.
- Cleaner internals: Vastly simplified logic for symbol/section/reloc
inclusion and special section extraction.
- Robust __LINE__ macro handling: Avoids false positive binary diffs
caused by the __LINE__ macro by introducing a fix-patch-lines script
which injects #line directives into the source .patch to preserve
the original line numbers at compile time.
- Disassemble code with libopcodes instead of running objdump
(Alexandre Chartre)
- Disassemble support (-d option to objtool) by Alexandre Chartre,
which supports the decoding of various Linux kernel code generation
specials such as alternatives:
17ef: sched_balance_find_dst_group+0x62f mov 0x34(%r9),%edx
17f3: sched_balance_find_dst_group+0x633 | <alternative.17f3> | X86_FEATURE_POPCNT
17f3: sched_balance_find_dst_group+0x633 | call 0x17f8 <__sw_hweight64> | popcnt %rdi,%rax
17f8: sched_balance_find_dst_group+0x638 cmp %eax,%edx
... jump table alternatives:
1895: sched_use_asym_prio+0x5 test $0x8,%ch
1898: sched_use_asym_prio+0x8 je 0x18a9 <sched_use_asym_prio+0x19>
189a: sched_use_asym_prio+0xa | <jump_table.189a> | JUMP
189a: sched_use_asym_prio+0xa | jmp 0x18ae <sched_use_asym_prio+0x1e> | nop2
189c: sched_use_asym_prio+0xc mov $0x1,%eax
18a1: sched_use_asym_prio+0x11 and $0x80,%ecx
... exception table alternatives:
native_read_msr:
5b80: native_read_msr+0x0 mov %edi,%ecx
5b82: native_read_msr+0x2 | <ex_table.5b82> | EXCEPTION
5b82: native_read_msr+0x2 | rdmsr | resume at 0x5b84 <native_read_msr+0x4>
5b84: native_read_msr+0x4 shl $0x20,%rdx
.... x86 feature flag decoding (also see the X86_FEATURE_POPCNT
example in sched_balance_find_dst_group() above):
2faaf: start_thread_common.constprop.0+0x1f jne 0x2fba4 <start_thread_common.constprop.0+0x114>
2fab5: start_thread_common.constprop.0+0x25 | <alternative.2fab5> | X86_FEATURE_ALWAYS | X86_BUG_NULL_SEG
2fab5: start_thread_common.constprop.0+0x25 | jmp 0x2faba <.altinstr_aux+0x2f4> | jmp 0x4b0 <start_thread_common.constprop.0+0x3f> | nop5
2faba: start_thread_common.constprop.0+0x2a mov $0x2b,%eax
... NOP sequence shortening:
1048e2: snapshot_write_finalize+0xc2 je 0x104917 <snapshot_write_finalize+0xf7>
1048e4: snapshot_write_finalize+0xc4 nop6
1048ea: snapshot_write_finalize+0xca nop11
1048f5: snapshot_write_finalize+0xd5 nop11
104900: snapshot_write_finalize+0xe0 mov %rax,%rcx
104903: snapshot_write_finalize+0xe3 mov 0x10(%rdx),%rax
... and much more.
- Function validation tracing support (Alexandre Chartre)
- Various -ffunction-sections fixes (Josh Poimboeuf)
- Clang AutoFDO (Automated Feedback-Directed Optimizations) support (Josh Poimboeuf)
- Misc fixes and cleanups (Borislav Petkov, Chen Ni,
Dylan Hatch, Ingo Molnar, John Wang, Josh Poimboeuf,
Pankaj Raghav, Peter Zijlstra, Thorsten Blum)
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmktavcRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1j3IhAAvc9tRV8SJcohim6DrkPGxCN/S80uzt5S
q8v1x5tBzMYmUxftfpoLsPCri6Ww0jprNuhnbRCvWAzXFuW79HWBNdVkEO7V/cym
OsCKQv3r0mWv5UXP3o8VM5K3tnU61wOAIx3yZCz5XKWeOg6NPXBJCSGWYpLuA7z0
1wUWAXuHgmj4RHMlHu5x0FZnSqGU3/TkUDGAqdxrY+myhdwm0Ul+dSwWGQdQjCgA
59Y/gDsWWEe5BVL56suwKZ1e+8UFnpbncbWkjELD6euJpYpDSNMOW/S6PYqOOz5M
rjMv06XIX5ma7QQbF5fMG/sXW64tZtc090UocDnx7hpDq9mLEyNNkXsqRQlmd8Wt
wG19IaeWo8aG9DTQkiv8OhtmssPKZHJsVjRUvXGnjktvxnsYSomgOT1lNme38dJD
X9jHgZCFMdPsQmG0dp00Y0oejfTChqIDef7qSpYwT96R7l9VQQF7K7AxfJwSeLGO
3hClZ0Gz/u9NiJTUUWTxUmR+YEy+1xIeaQSDq6t4JRtNJaMZlcevfVW+F2Lm04XH
9eSeF7bJS2XKrlLHVdPgWCGZOmee+ghdQ7svsyEGpzdzaAZ7UveTucHJ9CvW3Fft
Dcrl8rxX2NiD2PLz03HCHR/JVUDc3W3Exrer1TD8PD4LcZhFoBEGQbZ/gFlkBTxb
TOcemtJT03U=
=yPrS
-----END PGP SIGNATURE-----
Merge tag 'objtool-core-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull objtool updates from Ingo Molnar:
- klp-build livepatch module generation (Josh Poimboeuf)
Introduce new objtool features and a klp-build script to generate
livepatch modules using a source .patch as input.
This builds on concepts from the longstanding out-of-tree kpatch
project which began in 2012 and has been used for many years to
generate livepatch modules for production kernels. However, this is a
complete rewrite which incorporates hard-earned lessons from 12+
years of maintaining kpatch.
Key improvements compared to kpatch-build:
- Integrated with objtool: Leverages objtool's existing control-flow
graph analysis to help detect changed functions.
- Works on vmlinux.o: Supports late-linked objects, making it
compatible with LTO, IBT, and similar.
- Simplified code base: ~3k fewer lines of code.
- Upstream: No more out-of-tree #ifdef hacks, far less cruft.
- Cleaner internals: Vastly simplified logic for
symbol/section/reloc inclusion and special section extraction.
- Robust __LINE__ macro handling: Avoids false positive binary diffs
caused by the __LINE__ macro by introducing a fix-patch-lines
script which injects #line directives into the source .patch to
preserve the original line numbers at compile time.
- Disassemble code with libopcodes instead of running objdump
(Alexandre Chartre)
- Disassemble support (-d option to objtool) by Alexandre Chartre,
which supports the decoding of various Linux kernel code generation
specials such as alternatives:
17ef: sched_balance_find_dst_group+0x62f mov 0x34(%r9),%edx
17f3: sched_balance_find_dst_group+0x633 | <alternative.17f3> | X86_FEATURE_POPCNT
17f3: sched_balance_find_dst_group+0x633 | call 0x17f8 <__sw_hweight64> | popcnt %rdi,%rax
17f8: sched_balance_find_dst_group+0x638 cmp %eax,%edx
... jump table alternatives:
1895: sched_use_asym_prio+0x5 test $0x8,%ch
1898: sched_use_asym_prio+0x8 je 0x18a9 <sched_use_asym_prio+0x19>
189a: sched_use_asym_prio+0xa | <jump_table.189a> | JUMP
189a: sched_use_asym_prio+0xa | jmp 0x18ae <sched_use_asym_prio+0x1e> | nop2
189c: sched_use_asym_prio+0xc mov $0x1,%eax
18a1: sched_use_asym_prio+0x11 and $0x80,%ecx
... exception table alternatives:
native_read_msr:
5b80: native_read_msr+0x0 mov %edi,%ecx
5b82: native_read_msr+0x2 | <ex_table.5b82> | EXCEPTION
5b82: native_read_msr+0x2 | rdmsr | resume at 0x5b84 <native_read_msr+0x4>
5b84: native_read_msr+0x4 shl $0x20,%rdx
.... x86 feature flag decoding (also see the X86_FEATURE_POPCNT
example in sched_balance_find_dst_group() above):
2faaf: start_thread_common.constprop.0+0x1f jne 0x2fba4 <start_thread_common.constprop.0+0x114>
2fab5: start_thread_common.constprop.0+0x25 | <alternative.2fab5> | X86_FEATURE_ALWAYS | X86_BUG_NULL_SEG
2fab5: start_thread_common.constprop.0+0x25 | jmp 0x2faba <.altinstr_aux+0x2f4> | jmp 0x4b0 <start_thread_common.constprop.0+0x3f> | nop5
2faba: start_thread_common.constprop.0+0x2a mov $0x2b,%eax
... NOP sequence shortening:
1048e2: snapshot_write_finalize+0xc2 je 0x104917 <snapshot_write_finalize+0xf7>
1048e4: snapshot_write_finalize+0xc4 nop6
1048ea: snapshot_write_finalize+0xca nop11
1048f5: snapshot_write_finalize+0xd5 nop11
104900: snapshot_write_finalize+0xe0 mov %rax,%rcx
104903: snapshot_write_finalize+0xe3 mov 0x10(%rdx),%rax
... and much more.
- Function validation tracing support (Alexandre Chartre)
- Various -ffunction-sections fixes (Josh Poimboeuf)
- Clang AutoFDO (Automated Feedback-Directed Optimizations) support
(Josh Poimboeuf)
- Misc fixes and cleanups (Borislav Petkov, Chen Ni, Dylan Hatch, Ingo
Molnar, John Wang, Josh Poimboeuf, Pankaj Raghav, Peter Zijlstra,
Thorsten Blum)
* tag 'objtool-core-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (129 commits)
objtool: Fix segfault on unknown alternatives
objtool: Build with disassembly can fail when including bdf.h
objtool: Trim trailing NOPs in alternative
objtool: Add wide output for disassembly
objtool: Compact output for alternatives with one instruction
objtool: Improve naming of group alternatives
objtool: Add Function to get the name of a CPU feature
objtool: Provide access to feature and flags of group alternatives
objtool: Fix address references in alternatives
objtool: Disassemble jump table alternatives
objtool: Disassemble exception table alternatives
objtool: Print addresses with alternative instructions
objtool: Disassemble group alternatives
objtool: Print headers for alternatives
objtool: Preserve alternatives order
objtool: Add the --disas=<function-pattern> action
objtool: Do not validate IBT for .return_sites and .call_sites
objtool: Improve tracing of alternative instructions
objtool: Add functions to better name alternatives
objtool: Identify the different types of alternatives
...
Mutexes:
- Redo __mutex_init() to reduce generated code size
(Sebastian Andrzej Siewior)
Seqlocks:
- Introduce scoped_seqlock_read() (Peter Zijlstra)
- Change thread_group_cputime() to use scoped_seqlock_read()
(Oleg Nesterov)
- Change do_task_stat() to use scoped_seqlock_read()
(Oleg Nesterov)
- Change do_io_accounting() to use scoped_seqlock_read()
(Oleg Nesterov)
- Fix the incorrect documentation of read_seqbegin_or_lock() /
need_seqretry() (Oleg Nesterov)
- Allow KASAN to fail optimizing (Peter Zijlstra)
Local lock updates:
- Fix all kernel-doc warnings (Randy Dunlap)
- Add the <linux/local_lock*.h> headers to MAINTAINERS
(Sebastian Andrzej Siewior)
- Reduce the risk of shadowing via s/l/__l/ and s/tl/__tl/
(Vincent Mailhol)
Lock debugging:
- spinlock/debug: Fix data-race in do_raw_write_lock
(Alexander Sverdlin)
Atomic primitives infrastructure:
- atomic: Skip alignment check for try_cmpxchg() old arg
(Arnd Bergmann)
Rust runtime integration:
- sync: atomic: Enable generated Atomic<T> usage (Boqun Feng)
- sync: atomic: Implement Debug for Atomic<Debug> (Boqun Feng)
- debugfs: Remove Rust native atomics and replace them with
Linux versions (Boqun Feng)
- debugfs: Implement Reader for Mutex<T> only when T is Unpin
(Boqun Feng)
- lock: guard: Add T: Unpin bound to DerefMut (Daniel Almeida)
- lock: Pin the inner data (Daniel Almeida)
- lock: Add a Pin<&mut T> accessor (Daniel Almeida)
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmktVmgRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1hDqw//e9BTs9Yx6yLxZVRDagS7KrDKy3V3OMg1
9h+tXzCosGeNz7XwVzt590wsACJvX7/QtqgTFQy/GPYBW56WeVOSYSpA6I4s43G1
sz+EAc4BYPBWEwBQxCfPDwHI/M0MD0Im6mjpATc6Uv1Ct0+KS8iFXRnGUALe4bis
8aKV8ZHo81Wnu6v1B8GroExHolL/AMORYfEYHABpWEe+GpwxqQcRdZjc/eiEUzOg
umwMC9bNc5FAiPlku9Mh6pcBPjkMd9bGjVEIG8deJhm/aD8f/b0mgaxyaKgoHx8J
ptauY3kLYBLuV793U37SXuQVw6o2LGHCYvN1fX+g1D0YTIuqIq9Pz7ObZs7w4xDd
6iIK4QYP4Gjkvn0ZA275eI3ItcBEjJ2FD3ZDbkXNj+O4vEOrmG/OX4h2H5WGq/AU
zr9YfmkRn0InPeHeLU1UM3NdbKgwc/Bd6MubSwX4v7G7ws4CGDtlvA2d3xg5q8Ls
MQoAV+9QtiZ9prQjtd8nukgmh/+okPmCcnuEVXhSOZHpPjqXXnyUCTPyKXVkltdF
1u4oUHiQY7Jydfn0wZgEV4nASDeV2gz5BwKoSAuKvYc5HGhXnXxvzyJyHJy3dL8R
afGGQ3XfQhA0hUKoMiQFUk7p7dvjdAiHxN1EcvxxJqWVsaE/Gpik1GOm+FRn7Oqs
UMvspgGrbI4=
=KPgY
-----END PGP SIGNATURE-----
Merge tag 'locking-core-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
"Mutexes:
- Redo __mutex_init() to reduce generated code size (Sebastian
Andrzej Siewior)
Seqlocks:
- Introduce scoped_seqlock_read() (Peter Zijlstra)
- Change thread_group_cputime() to use scoped_seqlock_read() (Oleg
Nesterov)
- Change do_task_stat() to use scoped_seqlock_read() (Oleg Nesterov)
- Change do_io_accounting() to use scoped_seqlock_read() (Oleg
Nesterov)
- Fix the incorrect documentation of read_seqbegin_or_lock() /
need_seqretry() (Oleg Nesterov)
- Allow KASAN to fail optimizing (Peter Zijlstra)
Local lock updates:
- Fix all kernel-doc warnings (Randy Dunlap)
- Add the <linux/local_lock*.h> headers to MAINTAINERS (Sebastian
Andrzej Siewior)
- Reduce the risk of shadowing via s/l/__l/ and s/tl/__tl/ (Vincent
Mailhol)
Lock debugging:
- spinlock/debug: Fix data-race in do_raw_write_lock (Alexander
Sverdlin)
Atomic primitives infrastructure:
- atomic: Skip alignment check for try_cmpxchg() old arg (Arnd
Bergmann)
Rust runtime integration:
- sync: atomic: Enable generated Atomic<T> usage (Boqun Feng)
- sync: atomic: Implement Debug for Atomic<Debug> (Boqun Feng)
- debugfs: Remove Rust native atomics and replace them with Linux
versions (Boqun Feng)
- debugfs: Implement Reader for Mutex<T> only when T is Unpin (Boqun
Feng)
- lock: guard: Add T: Unpin bound to DerefMut (Daniel Almeida)
- lock: Pin the inner data (Daniel Almeida)
- lock: Add a Pin<&mut T> accessor (Daniel Almeida)"
* tag 'locking-core-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/local_lock: Fix all kernel-doc warnings
locking/local_lock: s/l/__l/ and s/tl/__tl/ to reduce the risk of shadowing
locking/local_lock: Add the <linux/local_lock*.h> headers to MAINTAINERS
locking/mutex: Redo __mutex_init() to reduce generated code size
rust: debugfs: Replace the usage of Rust native atomics
rust: sync: atomic: Implement Debug for Atomic<Debug>
rust: sync: atomic: Make Atomic*Ops pub(crate)
seqlock: Allow KASAN to fail optimizing
rust: debugfs: Implement Reader for Mutex<T> only when T is Unpin
seqlock: Change do_io_accounting() to use scoped_seqlock_read()
seqlock: Change do_task_stat() to use scoped_seqlock_read()
seqlock: Change thread_group_cputime() to use scoped_seqlock_read()
seqlock: Introduce scoped_seqlock_read()
documentation: seqlock: fix the wrong documentation of read_seqbegin_or_lock/need_seqretry
atomic: Skip alignment check for try_cmpxchg() old arg
rust: lock: Add a Pin<&mut T> accessor
rust: lock: Pin the inner data
rust: lock: guard: Add T: Unpin bound to DerefMut
locking/spinlock/debug: Fix data-race in do_raw_write_lock
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaSmOZwAKCRCRxhvAZXjc
op0AAP4oNVJkFyvgKoPos5K2EGFB8M7merGhpYtsOoeg8UK6OwD/UySQErHsXQDR
sUDDa5uFOhfrkcfM8REtAN4wF8p5qAc=
=QgFD
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.19-rc1.fd_prepare.fs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull fd prepare updates from Christian Brauner:
"This adds the FD_ADD() and FD_PREPARE() primitive. They simplify the
common pattern of get_unused_fd_flags() + create file + fd_install()
that is used extensively throughout the kernel and currently requires
cumbersome cleanup paths.
FD_ADD() - For simple cases where a file is installed immediately:
fd = FD_ADD(O_CLOEXEC, vfio_device_open_file(device));
if (fd < 0)
vfio_device_put_registration(device);
return fd;
FD_PREPARE() - For cases requiring access to the fd or file, or
additional work before publishing:
FD_PREPARE(fdf, O_CLOEXEC, sync_file->file);
if (fdf.err) {
fput(sync_file->file);
return fdf.err;
}
data.fence = fd_prepare_fd(fdf);
if (copy_to_user((void __user *)arg, &data, sizeof(data)))
return -EFAULT;
return fd_publish(fdf);
The primitives are centered around struct fd_prepare. FD_PREPARE()
encapsulates all allocation and cleanup logic and must be followed by
a call to fd_publish() which associates the fd with the file and
installs it into the caller's fdtable. If fd_publish() isn't called,
both are deallocated automatically. FD_ADD() is a shorthand that does
fd_publish() immediately and never exposes the struct to the caller.
I've implemented this in a way that it's compatible with the cleanup
infrastructure while also being usable separately. IOW, it's centered
around struct fd_prepare which is aliased to class_fd_prepare_t and so
we can make use of all the basica guard infrastructure"
* tag 'vfs-6.19-rc1.fd_prepare.fs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (42 commits)
io_uring: convert io_create_mock_file() to FD_PREPARE()
file: convert replace_fd() to FD_PREPARE()
vfio: convert vfio_group_ioctl_get_device_fd() to FD_ADD()
tty: convert ptm_open_peer() to FD_ADD()
ntsync: convert ntsync_obj_get_fd() to FD_PREPARE()
media: convert media_request_alloc() to FD_PREPARE()
hv: convert mshv_ioctl_create_partition() to FD_ADD()
gpio: convert linehandle_create() to FD_PREPARE()
pseries: port papr_rtas_setup_file_interface() to FD_ADD()
pseries: convert papr_platform_dump_create_handle() to FD_ADD()
spufs: convert spufs_gang_open() to FD_PREPARE()
papr-hvpipe: convert papr_hvpipe_dev_create_handle() to FD_PREPARE()
spufs: convert spufs_context_open() to FD_PREPARE()
net/socket: convert __sys_accept4_file() to FD_ADD()
net/socket: convert sock_map_fd() to FD_ADD()
net/kcm: convert kcm_ioctl() to FD_PREPARE()
net/handshake: convert handshake_nl_accept_doit() to FD_PREPARE()
secretmem: convert memfd_secret() to FD_ADD()
memfd: convert memfd_create() to FD_ADD()
bpf: convert bpf_token_create() to FD_PREPARE()
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaSmOZQAKCRCRxhvAZXjc
orJLAP9UD+dX6cicJDkzFZowDakmoIQkR5ZSDwChSlmvLcmquwEAlSq4svVd9Bdl
7kOFUk71DqhVHrPAwO7ap0BxehokEAA=
=Cli6
-----END PGP SIGNATURE-----
Merge tag 'kernel-6.19-rc1.cred' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull cred guard updates from Christian Brauner:
"This contains substantial credential infrastructure improvements
adding guard-based credential management that simplifies code and
eliminates manual reference counting in many subsystems.
Features:
- Kernel Credential Guards
Add with_kernel_creds() and scoped_with_kernel_creds() guards that
allow using the kernel credentials without allocating and copying
them. This was requested by Linus after seeing repeated
prepare_kernel_creds() calls that duplicate the kernel credentials
only to drop them again later.
The new guards completely avoid the allocation and never expose the
temporary variable to hold the kernel credentials anywhere in
callers.
- Generic Credential Guards
Add scoped_with_creds() guards for the common override_creds() and
revert_creds() pattern. This builds on earlier work that made
override_creds()/revert_creds() completely reference count free.
- Prepare Credential Guards
Add prepare credential guards for the more complex pattern of
preparing a new set of credentials and overriding the current
credentials with them:
- prepare_creds()
- modify new creds
- override_creds()
- revert_creds()
- put_cred()
Cleanups:
- Make init_cred static since it should not be directly accessed
- Add kernel_cred() helper to properly access the kernel credentials
- Fix scoped_class() macro that was introduced two cycles ago
- coredump: split out do_coredump() from vfs_coredump() for cleaner
credential handling
- coredump: move revert_cred() before coredump_cleanup()
- coredump: mark struct mm_struct as const
- coredump: pass struct linux_binfmt as const
- sev-dev: use guard for path"
* tag 'kernel-6.19-rc1.cred' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (36 commits)
trace: use override credential guard
trace: use prepare credential guard
coredump: use override credential guard
coredump: use prepare credential guard
coredump: split out do_coredump() from vfs_coredump()
coredump: mark struct mm_struct as const
coredump: pass struct linux_binfmt as const
coredump: move revert_cred() before coredump_cleanup()
sev-dev: use override credential guards
sev-dev: use prepare credential guard
sev-dev: use guard for path
cred: add prepare credential guard
net/dns_resolver: use credential guards in dns_query()
cgroup: use credential guards in cgroup_attach_permissions()
act: use credential guards in acct_write_process()
smb: use credential guards in cifs_get_spnego_key()
nfs: use credential guards in nfs_idmap_get_key()
nfs: use credential guards in nfs_local_call_write()
nfs: use credential guards in nfs_local_call_read()
erofs: use credential guards
...
When loading the ebpf scheduler, the tasks in the scx_tasks list will
be traversed and invoke __setscheduler_class() to get new sched_class.
however, this would also incorrectly set the per-cpu migration
task's->sched_class to rt_sched_class, even after unload, the per-cpu
migration task's->sched_class remains sched_rt_class.
The log for this issue is as follows:
./scx_rustland --stats 1
[ 199.245639][ T630] sched_ext: "rustland" does not implement cgroup cpu.weight
[ 199.269213][ T630] sched_ext: BPF scheduler "rustland" enabled
04:25:09 [INFO] RustLand scheduler attached
bpftrace -e 'iter:task /strcontains(ctx->task->comm, "migration")/
{ printf("%s:%d->%pS\n", ctx->task->comm, ctx->task->pid, ctx->task->sched_class); }'
Attaching 1 probe...
migration/0:24->rt_sched_class+0x0/0xe0
migration/1:27->rt_sched_class+0x0/0xe0
migration/2:33->rt_sched_class+0x0/0xe0
migration/3:39->rt_sched_class+0x0/0xe0
migration/4:45->rt_sched_class+0x0/0xe0
migration/5:52->rt_sched_class+0x0/0xe0
migration/6:58->rt_sched_class+0x0/0xe0
migration/7:64->rt_sched_class+0x0/0xe0
sched_ext: BPF scheduler "rustland" disabled (unregistered from user space)
EXIT: unregistered from user space
04:25:21 [INFO] Unregister RustLand scheduler
bpftrace -e 'iter:task /strcontains(ctx->task->comm, "migration")/
{ printf("%s:%d->%pS\n", ctx->task->comm, ctx->task->pid, ctx->task->sched_class); }'
Attaching 1 probe...
migration/0:24->rt_sched_class+0x0/0xe0
migration/1:27->rt_sched_class+0x0/0xe0
migration/2:33->rt_sched_class+0x0/0xe0
migration/3:39->rt_sched_class+0x0/0xe0
migration/4:45->rt_sched_class+0x0/0xe0
migration/5:52->rt_sched_class+0x0/0xe0
migration/6:58->rt_sched_class+0x0/0xe0
migration/7:64->rt_sched_class+0x0/0xe0
This commit therefore generate a new scx_setscheduler_class() and
add check for stop_sched_class to replace __setscheduler_class().
Fixes: f0e1a0643a ("sched_ext: Implement BPF extensible scheduler class")
Cc: stable@vger.kernel.org # v6.12+
Signed-off-by: Zqiang <qiang.zhang@linux.dev>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaSmOZQAKCRCRxhvAZXjc
ooKwAP4kR5kMjHlthf8jHmmCjVU3nQFO9hUZsIQL9gFJLOIQMAD+LLoTaq1WJufl
oSgZpREXZVmI1TK61eR6EZMB1YikGAo=
=TExi
-----END PGP SIGNATURE-----
Merge tag 'namespace-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull namespace updates from Christian Brauner:
"This contains substantial namespace infrastructure changes including a new
system call, active reference counting, and extensive header cleanups.
The branch depends on the shared kbuild branch for -fms-extensions support.
Features:
- listns() system call
Add a new listns() system call that allows userspace to iterate
through namespaces in the system. This provides a programmatic
interface to discover and inspect namespaces, addressing
longstanding limitations:
Currently, there is no direct way for userspace to enumerate
namespaces. Applications must resort to scanning /proc/*/ns/ across
all processes, which is:
- Inefficient - requires iterating over all processes
- Incomplete - misses namespaces not attached to any running
process but kept alive by file descriptors, bind mounts, or
parent references
- Permission-heavy - requires access to /proc for many processes
- No ordering or ownership information
- No filtering per namespace type
The listns() system call solves these problems:
ssize_t listns(const struct ns_id_req *req, u64 *ns_ids,
size_t nr_ns_ids, unsigned int flags);
struct ns_id_req {
__u32 size;
__u32 spare;
__u64 ns_id;
struct /* listns */ {
__u32 ns_type;
__u32 spare2;
__u64 user_ns_id;
};
};
Features include:
- Pagination support for large namespace sets
- Filtering by namespace type (MNT_NS, NET_NS, USER_NS, etc.)
- Filtering by owning user namespace
- Permission checks respecting namespace isolation
- Active Reference Counting
Introduce an active reference count that tracks namespace
visibility to userspace. A namespace is visible in the following
cases:
- The namespace is in use by a task
- The namespace is persisted through a VFS object (namespace file
descriptor or bind-mount)
- The namespace is a hierarchical type and is the parent of child
namespaces
The active reference count does not regulate lifetime (that's still
done by the normal reference count) - it only regulates visibility
to namespace file handles and listns().
This prevents resurrection of namespaces that are pinned only for
internal kernel reasons (e.g., user namespaces held by
file->f_cred, lazy TLB references on idle CPUs, etc.) which should
not be accessible via (1)-(3).
- Unified Namespace Tree
Introduce a unified tree structure for all namespaces with:
- Fixed IDs assigned to initial namespaces
- Lookup based solely on inode number
- Maintained list of owned namespaces per user namespace
- Simplified rbtree comparison helpers
Cleanups
- Header Reorganization:
- Move namespace types into separate header (ns_common_types.h)
- Decouple nstree from ns_common header
- Move nstree types into separate header
- Switch to new ns_tree_{node,root} structures with helper functions
- Use guards for ns_tree_lock
- Initial Namespace Reference Count Optimization
- Make all reference counts on initial namespaces a nop to avoid
pointless cacheline ping-pong for namespaces that can never go
away
- Drop custom reference count initialization for initial namespaces
- Add NS_COMMON_INIT() macro and use it for all namespaces
- pid: rely on common reference count behavior
- Miscellaneous Cleanups
- Rename exit_task_namespaces() to exit_nsproxy_namespaces()
- Rename is_initial_namespace() and make argument const
- Use boolean to indicate anonymous mount namespace
- Simplify owner list iteration in nstree
- nsfs: raise SB_I_NODEV, SB_I_NOEXEC, and DCACHE_DONTCACHE explicitly
- nsfs: use inode_just_drop()
- pidfs: raise DCACHE_DONTCACHE explicitly
- pidfs: simplify PIDFD_GET__NAMESPACE ioctls
- libfs: allow to specify s_d_flags
- cgroup: add cgroup namespace to tree after owner is set
- nsproxy: fix free_nsproxy() and simplify create_new_namespaces()
Fixes:
- setns(pidfd, ...) race condition
Fix a subtle race when using pidfds with setns(). When the target
task exits after prepare_nsset() but before commit_nsset(), the
namespace's active reference count might have been dropped. If
setns() then installs the namespaces, it would bump the active
reference count from zero without taking the required reference on
the owner namespace, leading to underflow when later decremented.
The fix resurrects the ownership chain if necessary - if the caller
succeeded in grabbing passive references, the setns() should
succeed even if the target task exits or gets reaped.
- Return EFAULT on put_user() error instead of success
- Make sure references are dropped outside of RCU lock (some
namespaces like mount namespace sleep when putting the last
reference)
- Don't skip active reference count initialization for network
namespace
- Add asserts for active refcount underflow
- Add asserts for initial namespace reference counts (both passive
and active)
- ipc: enable is_ns_init_id() assertions
- Fix kernel-doc comments for internal nstree functions
- Selftests
- 15 active reference count tests
- 9 listns() functionality tests
- 7 listns() permission tests
- 12 inactive namespace resurrection tests
- 3 threaded active reference count tests
- commit_creds() active reference tests
- Pagination and stress tests
- EFAULT handling test
- nsid tests fixes"
* tag 'namespace-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (103 commits)
pidfs: simplify PIDFD_GET_<type>_NAMESPACE ioctls
nstree: fix kernel-doc comments for internal functions
nsproxy: fix free_nsproxy() and simplify create_new_namespaces()
selftests/namespaces: fix nsid tests
ns: drop custom reference count initialization for initial namespaces
pid: rely on common reference count behavior
ns: add asserts for initial namespace active reference counts
ns: add asserts for initial namespace reference counts
ns: make all reference counts on initial namespace a nop
ipc: enable is_ns_init_id() assertions
fs: use boolean to indicate anonymous mount namespace
ns: rename is_initial_namespace()
ns: make is_initial_namespace() argument const
nstree: use guards for ns_tree_lock
nstree: simplify owner list iteration
nstree: switch to new structures
nstree: add helper to operate on struct ns_tree_{node,root}
nstree: move nstree types into separate header
nstree: decouple from ns_common header
ns: move namespace types into separate header
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaSmOZAAKCRCRxhvAZXjc
onGCAQDEHKNEuZMhkyd3K5YsJtMzZlW/uXp4+Wddeob+5yQp0wEA09xN4CJNMwhP
J6Kjaa80hWfrFacqSvyMUwQHHw6mngs=
=5Mom
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.19-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"Features:
- Cheaper MAY_EXEC handling for path lookup. This elides MAY_WRITE
permission checks during path lookup and adds the
IOP_FASTPERM_MAY_EXEC flag so filesystems like btrfs can avoid
expensive permission work.
- Hide dentry_cache behind runtime const machinery.
- Add German Maglione as virtiofs co-maintainer.
Cleanups:
- Tidy up and inline step_into() and walk_component() for improved
code generation.
- Re-enable IOCB_NOWAIT writes to files. This refactors file
timestamp update logic, fixing a layering bypass in btrfs when
updating timestamps on device files and improving FMODE_NOCMTIME
handling in VFS now that nfsd started using it.
- Path lookup optimizations extracting slowpaths into dedicated
routines and adding branch prediction hints for mntput_no_expire(),
fd_install(), lookup_slow(), and various other hot paths.
- Enable clang's -fms-extensions flag, requiring a JFS rename to
avoid conflicts.
- Remove spurious exports in fs/file_attr.c.
- Stop duplicating union pipe_index declaration. This depends on the
shared kbuild branch that brings in -fms-extensions support which
is merged into this branch.
- Use MD5 library instead of crypto_shash in ecryptfs.
- Use largest_zero_folio() in iomap_dio_zero().
- Replace simple_strtol/strtoul with kstrtoint/kstrtouint in init and
initrd code.
- Various typo fixes.
Fixes:
- Fix emergency sync for btrfs. Btrfs requires an explicit sync_fs()
call with wait == 1 to commit super blocks. The emergency sync path
never passed this, leaving btrfs data uncommitted during emergency
sync.
- Use local kmap in watch_queue's post_one_notification().
- Add hint prints in sb_set_blocksize() for LBS dependency on THP"
* tag 'vfs-6.19-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (35 commits)
MAINTAINERS: add German Maglione as virtiofs co-maintainer
fs: inline step_into() and walk_component()
fs: tidy up step_into() & friends before inlining
orangefs: use inode_update_timestamps directly
btrfs: fix the comment on btrfs_update_time
btrfs: use vfs_utimes to update file timestamps
fs: export vfs_utimes
fs: lift the FMODE_NOCMTIME check into file_update_time_flags
fs: refactor file timestamp update logic
include/linux/fs.h: trivial fix: regualr -> regular
fs/splice.c: trivial fix: pipes -> pipe's
fs: mark lookup_slow() as noinline
fs: add predicts based on nd->depth
fs: move mntput_no_expire() slowpath into a dedicated routine
fs: remove spurious exports in fs/file_attr.c
watch_queue: Use local kmap in post_one_notification()
fs: touch up predicts in path lookup
fs: move fd_install() slowpath into a dedicated routine and provide commentary
fs: hide dentry_cache behind runtime const machinery
fs: touch predicts in do_dentry_open()
...
mutex_init() invokes __mutex_init() providing the name of the lock and
a pointer to a the lock class. With LOCKDEP enabled this information is
useful but without LOCKDEP it not used at all. Passing the pointer
information of the lock class might be considered negligible but the
name of the lock is passed as well and the string is stored. This
information is wasting storage.
Split __mutex_init() into a _genereic() variant doing the initialisation
of the lock and a _lockdep() version which does _genereic() plus the
lockdep bits. Restrict the lockdep version to lockdep enabled builds
allowing the compiler to remove the unused parameter.
This results in the following size reduction:
text data bss dec filename
| 30237599 8161430 1176624 39575653 vmlinux.defconfig
| 30233269 8149142 1176560 39558971 vmlinux.defconfig.patched
-4.2KiB -12KiB
| 32455099 8471098 12934684 53860881 vmlinux.defconfig.lockdep
| 32455100 8471098 12934684 53860882 vmlinux.defconfig.patched.lockdep
| 27152407 7191822 2068040 36412269 vmlinux.defconfig.preempt_rt
| 27145937 7183630 2067976 36397543 vmlinux.defconfig.patched.preempt_rt
-6.3KiB -8KiB
| 29382020 7505742 13784608 50672370 vmlinux.defconfig.preempt_rt.lockdep
| 29376229 7505742 13784544 50666515 vmlinux.defconfig.patched.preempt_rt.lockdep
-5.6KiB
[peterz: folded fix from boqun]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Waiman Long <longman@redhat.com>
Link: https://lkml.kernel.org/r/20251125145425.68319-1-boqun.feng@gmail.com
Link: https://patch.msgid.link/20251105142350.Tfeevs2N@linutronix.de
- In order to prepare the layout for nohz_full work deferral to
user exit, the context tracking state must shrink the counter
of transitions to/from RCU not watching. The only possible hazard
is to trigger wrap-around more easily, delaying a bit grace periods
when that happens. This should be a rare event though. Yet add
debugging and torture code to test that assumption.
- Fix memory leak on locktorture module
- Annotate accesses in rculist_nulls.h to prevent from KCSAN warnings.
On recent discussions, we also concluded that all those WRITE_ONCE()
and READ_ONCE() on list APIs deserve appropriate comments. Something
to be expected for the next cycle.
- Provide a script to apply several configs to several commits with torture.
- Allow torture to reuse a build directory in order to save needless
rebuild time.
- Various cleanups.
error code on failure instead of success
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmksJ6kACgkQEsHwGGHe
VUqxARAAqEqj33/ErgPpytnpRkuJwGCGeSO879MMAoLTFpmALmoQZlqhA+WF1xYw
T5QF7nGlBeW/GwFAVlm1riqI6iPk2iUZWczS9bq0wjtR0XNHReCqLGdRe+4DHEjb
PtyXMrrQgq5FuvHaIkuSUxBKtyHn+KMObldxp2azpO8nfUubDSCCgw9aQfm2osHi
JIldKOPdFKRv8uLnBBbI2A+6cFrqXr/ptAzx8C3hov7sN0jBODOitb7fNerEb+T4
U5awKzpK3WzcilyrlEZMCFXRV5e3naTgnUDd8/7t9m0SA10K5oSIM1zggVLRWN8Z
AA7XMpbRAl1S23GUVjSAM/vOMmaJJAuIMJ4ZXIKoTFVQJ3kenzY4GtDdObh05ID4
971y0hXrko74Qq42oME4UHSKexcR4vIBwXC0J9g04EbtKQ/sm6uhoUlb2Xb/UfxR
CPxzDqy1pFkjuMG49yu3JE2fUHOryiRcNUUgas2PBKjKG0KyZPcJWa/fRsk5LlLy
oeUXx/hXOq6Nw0ydyzI7iJgAM5SbM4A5p/hxlWmQnfXJ2EQdmkORTmMkloaUvB7u
vWaOd9I+D3iSzNaw4zP0kzgPRX70PCRn1qakT3zQ340c9vNT/lYQjRelK7gfYZF6
HOBID2Pl3+pqUTfli6qxbXsgqH4hbCWSomDbY9t2AFK6h6K2l8Q=
=CFYj
-----END PGP SIGNATURE-----
Merge tag 'timers_urgent_for_v6.18_rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Borislav Petkov:
- Have timekeeping aux clocks sysfs interface setup function return an
error code on failure instead of success
* tag 'timers_urgent_for_v6.18_rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timekeeping: Fix error code in tk_aux_sysfs_init()
The __mm_flags_set_word() function is slightly ambiguous - we use 'set' to
refer to setting individual bits (such as in mm_flags_set()) but here we
use it to refer to overwriting the value altogether.
Rename it to __mm_flags_overwrite_word() to eliminate this ambiguity.
We additionally simplify the functions, eliminating unnecessary
bitmap_xxx() operations (the compiler would have optimised these out but
it's worth being as clear as we can be here).
Link: https://lkml.kernel.org/r/8f0bc556e1b90eca8ea5eba41f8d5d3f9cd7c98a.1764064557.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Acked-by: Alice Ryhl <aliceryhl@google.com> [rust]
Cc: Alex Gaynor <alex.gaynor@gmail.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Andreas Hindborg <a.hindborg@kernel.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Björn Roy Baron <bjorn3_gh@protonmail.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Byungchul Park <byungchul@sk.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Chris Li <chrisl@kernel.org>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Gary Guo <gary@garyguo.net>
Cc: Gregory Price <gourry@gourry.net>
Cc: "Huang, Ying" <ying.huang@linux.alibaba.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Kairui Song <kasong@tencent.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Mathew Brost <matthew.brost@intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Trevor Gross <tmgross@umich.edu>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Wei Xu <weixugc@google.com>
Cc: xu xin <xu.xin16@zte.com.cn>
Cc: Yuanchu Xie <yuanchu@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Updating a BPF_MAP_TYPE_HASH_OF_MAPS or BPF_MAP_TYPE_ARRAY_OF_MAPS via
bpf_map_update_elem() is very expensive.
In one of our workloads, we're inserting ~1400 maps of type
BPF_MAP_TYPE_ARRAY into a BPF_MAP_TYPE_ARRAY_OF_MAPS. This takes ~21
seconds on a single thread, with an average of ~15ms per call:
Function Name: map_update_elem
Number of calls: 1369
Total time: 21s 182ms 966µs
Maximum: 47ms 937µs
Average: 15ms 473µs
Minimum: 7µs
Profiling shows that nearly all of this time is going to synchronize_rcu(),
via maybe_wait_bpf_programs() in map_update_elem().
The call to synchronize_rcu() is done to ensure that after
bpf_map_update_elem() returns, no BPF programs are still looking at the old
value of the map, per commit 1ae80cf319 ("bpf: wait for running BPF
programs when updating map-in-map").
As discussed on the bpf mailing list, replace synchronize_rcu() with
synchronize_rcu_expedited(). This is 175x faster: it now takes an average
of 88 microseconds per call, for a total of 127 milliseconds in the same
benchmark:
Function Name: map_update_elem
Number of calls: 1439
Total time: 127ms 626µs
Maximum: 445µs
Average: 88µs
Minimum: 10µs
Link: https://lore.kernel.org/bpf/CAH6OuBR=w2kybK6u7aH_35B=Bo1PCukeMZefR=7V4Z2tJNK--Q@mail.gmail.com/
Signed-off-by: Ritesh Oedayrajsingh Varma <ritesh@superluminal.eu>
Link: https://lore.kernel.org/r/20251128000422.20462-1-ritesh@superluminal.eu
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
While previous commits sufficiently address the deadlocks, there are
still scenarios where queueing of waiters in NMIs can exacerbate the
possibility of timeouts.
Consider the case below:
CPU 0
<NMI>
res_spin_lock(A) -> becomes non-head waiter
</NMI>
lock owner in CS or pending waiter spinning
CPU 1
res_spin_lock(A) -> head waiter spinning on owner/pending bits
In such a scenario, the non-head waiter in NMI on CPU 0 will not poll
for deadlocks or timeout since it will simply queue behind previous
waiter (head on CPU 1), and also not enter the trylock fallback since
no rqspinlock queue waiter is active on CPU 0. In such a scenario, the
transaction initiated by the head waiter on CPU 1 will timeout,
signalling the NMI and ending the cyclic dependency, but it will cost
250 ms of time.
Instead, the NMI on CPU 0 could simply check for the presence of an AA
deadlock and only proceed with queueing on success. Add such a check
right before any form of queueing is initiated.
The reason the AA deadlock check is not used in conjunction with
in_nmi() is that a similar case could occur due to a reentrant path
in the owner's critical section, and unconditionally checking for AA
before entering the queueing path avoids expensive timeouts. Non-NMI
reentrancy only happens at controlled points in the slow path (with
specific tracepoints which do not impede the forward progress of a
waiter loop), or in the owner CS, while NMIs can land anywhere.
While this check is only needed for non-head waiter queueing, checking
whether we are head or not is racy without xchg_tail, and after that
point, we are already queued, hence for simplicity we must invoke the
check unconditionally.
Note that a more contrived case could still be constructed by using two
locks, and interrupting the progress of the respective owners by
non-head waiters of the other lock, in an ABBA fashion, which would
still not be covered by the current set of checks and conditions. It
would still lead to a timeout though, and not a deadlock. An ABBA check
cannot happen optimistically before the queueing, since it can be racy,
and needs to be happen continuously during the waiting period, which
would then require an unlinking step for queued NMI/reentrant waiters.
This is beyond the scope of this patch.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20251128232802.1031906-6-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The original trylock fallback was inherited from qspinlock, and then
reused for the reentrant NMIs while the slow path is active. However,
under contention, it is very unlikely for the trylock to succeed in
taking the lock. In addition, a trylock also has no fairness guarantees,
and thus is prone to starvation issues under extreme scenarios.
The original qspinlock had no choice in terms of returning an error the
caller; if the node count was breached, it had to fall back to trylock
to attempt to take the lock. In case of rqspinlock, we do have the
option of returning to the user. Thus, simply attempt the trylock once,
and instead of spinning, return an error in case the lock cannot be
taken.
This ends up significantly reducing the time spent in the trylock
fallback, since we no longer wait for the timeout duration trying to
aimlessly acquire the lock when there's a high-probability that under
contention, it won't be available to us anyway.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20251128232802.1031906-5-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
In addition to deferring to the trylock fallback in NMIs, only do so
when an rqspinlock waiter is queued on the current CPU. This is detected
by noticing a non-zero node index. This allows NMI waiters to join the
waiter queue if it isn't interrupting an existing rqspinlock waiter, and
increase the chances of fairly obtaining the lock, performing deadlock
detection as the head, and not being starved while attempting the
trylock.
The trylock path in particular is unlikely to succeed under contention,
as it relies on the lock word becoming 0, which indicates no contention.
This means that the most likely result for NMIs attempting a trylock is
a timeout under contention if they don't hit an AA or ABBA case.
The core problem being addressed through the fixed commit was removing
the dependency edge between an NMI queue waiter and the queue waiter it
is interrupting. Whenever a circular dependency forms, and with no way
to break it (as non-head waiters don't poll for deadlocks or timeouts),
we would enter into a deadlock. A trylock either breaks such an edge by
probing for deadlocks, and finally terminating the waiting loop using a
timeout.
By excluding queueing on CPUs where the node index is non-zero for NMIs,
this sort of dependency is broken. The CPU enters the trylock path for
those cases, and falls back to deadlock checks and timeouts. However, in
other case where it doesn't interrupt the CPU in the slow path while its
queued on the lock, it can join the queue as a normal waiter, and avoid
trylock associated starvation and subsequent timeouts.
There are a few remaining cases here that matter: the NMI can still
preempt the owner in its critical section, and if it queues as a
non-head waiter, it can end up impeding the progress of the owner. While
this won't deadlock, since the head waiter will eventually signal the
NMI waiter to either stop (due to a timeout), it can still lead to long
timeouts. These gaps will be addressed in subsequent commits.
Note that while the node count detection approach is less conservative
than simply deferring NMIs to trylock, it is going to return errors
where attempts to lock B in NMI happen while waiters for lock A are in a
lower context on the same CPU. However, this only occurs when the lower
context is queued in the slow path, and the NMI attempt can proceed
without failure in all other cases. To continue to prevent AA deadlocks
(or ABBA in a similar NMI interrupting lower context pattern), we'd need
a more fleshed out algorithm to unlink NMI waiters after they queue and
detect such cases. However, all that complexity isn't appealing yet to
reduce the failure rate in the small window inside the slow path.
It is important to note that reentrancy in the slow path can also happen
through trace_contention_{begin,end}, but in those cases, unlike an NMI,
the forward progress of the head waiter (or the predecessor in general)
is not being blocked.
Fixes: 0d80e7f951 ("rqspinlock: Choose trylock fallback for NMI waiters")
Reported-by: Ritesh Oedayrajsingh Varma <ritesh@superluminal.eu>
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20251128232802.1031906-4-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Currently, while we enter the check_timeout call immediately due to the
way the ts.spin is initialized, we still invoke the AA and ABBA checks
in the second invocation, and only initialize the timestamp in the first
one. Since each iteration is at least done with a 1ms delay, this can
add delays in detection of AA deadlocks, up to a ms.
Rework check_timeout() to avoid this. First, call check_deadlock_AA()
while initializing the timestamps for the wait period. This also means
that we only do it once per waiting period, instead of every invocation.
Finally, drop check_deadlock() and call check_deadlock_ABBA() directly.
To save on unnecessary ktime_get_mono_fast_ns() in case of AA deadlock,
sample the time only if it returns 0.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20251128232802.1031906-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Ritesh reported that timeouts occurred frequently for rqspinlock despite
reentrancy on the same lock on the same CPU in [0]. This patch closes
one of the races leading to this behavior, and reduces the frequency of
timeouts.
We currently have a tiny window between the fast-path cmpxchg and the
grabbing of the lock entry where an NMI could land, attempt the same
lock that was just acquired, and end up timing out. This is not ideal.
Instead, move the lock entry acquisition from the fast path to before
the cmpxchg, and remove the grabbing of the lock entry in the slow path,
assuming it was already taken by the fast path. The TAS fallback is
invoked directly without being preceded by the typical fast path,
therefore we must continue to grab the deadlock detection entry in that
case.
Case on lock leading to missed AA:
cmpxchg lock A
<NMI>
... rqspinlock acquisition of A
... timeout
</NMI>
grab_held_lock_entry(A)
There is a similar case when unlocking the lock. If the NMI lands
between the WRITE_ONCE and smp_store_release, it is possible that we end
up in a situation where the NMI fails to diagnose the AA condition,
leading to a timeout.
Case on unlock leading to missed AA:
WRITE_ONCE(rqh->locks[rqh->cnt - 1], NULL)
<NMI>
... rqspinlock acquisition of A
... timeout
</NMI>
smp_store_release(A->locked, 0)
The patch changes the order on unlock to smp_store_release() succeeded
by WRITE_ONCE() of NULL. This avoids the missed AA detection described
above, but may lead to a false positive if the NMI lands between these
two statements, which is acceptable (and preferred over a timeout).
The original intention of the reverse order on unlock was to prevent the
following possible misdiagnosis of an ABBA scenario:
grab entry A
lock A
grab entry B
lock B
unlock B
smp_store_release(B->locked, 0)
grab entry B
lock B
grab entry A
lock A
! <detect ABBA>
WRITE_ONCE(rqh->locks[rqh->cnt - 1], NULL)
If the store release were is after the WRITE_ONCE, the other CPU would
not observe B in the table of the CPU unlocking the lock B. However,
since the threads are obviously participating in an ABBA deadlock, it
is no longer appealing to use the order above since it may lead to a
250 ms timeout due to missed AA detection.
[0]: https://lore.kernel.org/bpf/CAH6OuBTjG+N=+GGwcpOUbeDN563oz4iVcU3rbse68egp9wj9_A@mail.gmail.com
Fixes: 0d80e7f951 ("rqspinlock: Choose trylock fallback for NMI waiters")
Reported-by: Ritesh Oedayrajsingh Varma <ritesh@superluminal.eu>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20251128232802.1031906-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
A use-after-free bug may be triggered by calling bpf_inode_storage_get()
in a BPF LSM program hooked to file_alloc_security. Disable the hook to
prevent this from happening.
The cause of the bug is shown in the trace below. In alloc_file(), a
file struct is first allocated through kmem_cache_alloc(). Then,
file_alloc_security hook is invoked. Since the zero initialization or
assignment of f->f_inode happen after this LSM hook, a BPF program may
get a dangeld inode pointer by walking the file struct.
alloc_file()
-> alloc_empty_file()
-> f = kmem_cache_alloc()
-> init_file()
-> security_file_alloc() // f->f_inode not init-ed yet!
-> f->f_inode = NULL;
-> file_init_path()
-> f->f_inode = path->dentry->d_inode
Reported-by: Kaiyan Mei <M202472210@hust.edu.cn>
Reported-by: Yinhao Hu <dddddd@hust.edu.cn>
Reported-by: Dongliang Mu <dzm91@hust.edu.cn>
Closes: https://lore.kernel.org/bpf/1d2d1968.47cd3.19ab9528e94.Coremail.kaiyanm@hust.edu.cn/
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Link: https://lore.kernel.org/r/20251126202927.2584874-1-ameryhung@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Do not abuse the strict_alignment_once flag, and check if the map is
an instruction array inside the check_ptr_alignment() function.
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Link: https://lore.kernel.org/r/20251128063224.1305482-3-a.s.protopopov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The original implementation added a hack to check_mem_access()
to prevent programs from writing into insn arrays. To get rid
of this hack, enforce BPF_F_RDONLY_PROG on map creation.
Also fix the corresponding selftest, as the error message changes
with this patch.
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Link: https://lore.kernel.org/r/20251128063224.1305482-2-a.s.protopopov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add performance testing for common context synchronizations
(Preemption, IRQ, Softirq) and per-cpu increments. Those are
relevant comparisons against SRCU-fast read side APIs, especially
as they are planned to synchronize further tracing fast-path code.
Merge PM QoS updates and a cpupower utility update for 6.19-rc1:
- Introduce and document a QoS limit on CPU exit latency during wakeup
from suspend-to-idle (Ulf Hansson)
- Add support for building libcpupower statically (Zuo An)
* pm-qos:
Documentation: power/cpuidle: Document the CPU system wakeup latency QoS
cpuidle: Respect the CPU system wakeup QoS limit for cpuidle
sched: idle: Respect the CPU system wakeup QoS limit for s2idle
pmdomain: Respect the CPU system wakeup QoS limit for cpuidle
pmdomain: Respect the CPU system wakeup QoS limit for s2idle
PM: QoS: Introduce a CPU system wakeup QoS limit
* pm-tools:
tools/power/cpupower: Support building libcpupower statically
Merge energy model management updates and operating performance points
(OPP) library changes for 6.19-rc1:
- Add support for sending netlink notifications to user space on energy
model updates (Changwoo Mini, Peng Fan)
- Minor improvements to the Rust OPP interface (Tamir Duberstein)
- Fixes to scope-based pointers in the OPP library (Viresh Kumar)
* pm-em:
PM: EM: Add to em_pd_list only when no failure
PM: EM: Notify an event when the performance domain changes
PM: EM: Implement em_notify_pd_created/updated()
PM: EM: Implement em_notify_pd_deleted()
PM: EM: Implement em_nl_get_pd_table_doit()
PM: EM: Implement em_nl_get_pds_doit()
PM: EM: Add an iterator and accessor for the performance domain
PM: EM: Add a skeleton code for netlink notification
PM: EM: Add em.yaml and autogen files
PM: EM: Expose the ID of a performance domain via debugfs
PM: EM: Assign a unique ID when creating a performance domain
* pm-opp:
rust: opp: simplify callers of `to_c_str_array`
OPP: Initialize scope-based pointers inline
rust: opp: fix broken rustdoc link
Merge updates related to system suspend and hibernation for 6.19-rc1:
- Replace snprintf() with scnprintf() in show_trace_dev_match()
(Kaushlendra Kumar)
- Fix memory allocation error handling in pm_vt_switch_required()
(Malaya Kumar Rout)
- Introduce CALL_PM_OP() macro and use it to simplify code in
generic PM operations (Kaushlendra Kumar)
- Add module param to backtrace all CPUs in the device power management
watchdog (Sergey Senozhatsky)
- Rework message printing in swsusp_save() (Rafael Wysocki)
- Make it possible to change the number of hibernation compression
threads (Xueqin Luo)
- Clarify that only cgroup1 freezer uses PM freezer (Tejun Heo)
- Add document on debugging shutdown hangs to PM documentation and
correct a mistaken configuration option in it (Mario Limonciello)
- Shut down wakeup source timer before removing the wakeup source from
the list (Kaushlendra Kumar, Rafael Wysocki)
- Introduce new PMSG_POWEROFF event for system shutdown handling with
the help of PM device callbacks (Mario Limonciello)
- Make pm_test delay interruptible by wakeup events (Riwen Lu)
- Clean up kernel-doc comment style usage in the core hibernation
code and remove unuseful comments from it (Sunday Adelodun, Rafael
Wysocki)
- Add support for handling wakeup events and aborting the suspend
process while it is syncing file systems (Samuel Wu, Rafael Wysocki)
* pm-sleep: (21 commits)
PM: hibernate: Extra cleanup of comments in swap handling code
PM: sleep: Call pm_sleep_fs_sync() instead of ksys_sync_helper()
PM: sleep: Add support for wakeup during filesystem sync
PM: hibernate: Clean up kernel-doc comment style usage
PM: suspend: Make pm_test delay interruptible by wakeup events
usb: sl811-hcd: Add PM_EVENT_POWEROFF into suspend callbacks
scsi: Add PM_EVENT_POWEROFF into suspend callbacks
PM: Introduce new PMSG_POWEROFF event
PM: wakeup: Update after recent wakeup source removal ordering change
PM: wakeup: Delete timer before removing wakeup source from list
Documentation: power: Correct a mistaken configuration option
Documentation: power: Add document on debugging shutdown hangs
freezer: Clarify that only cgroup1 freezer uses PM freezer
PM: hibernate: add sysfs interface for hibernate_compression_threads
PM: hibernate: make compression threads configurable
PM: hibernate: dynamically allocate crc->unc_len/unc for configurable threads
PM: hibernate: Rework message printing in swsusp_save()
PM: dpm_watchdog: add module param to backtrace all CPUs
PM: sleep: Introduce CALL_PM_OP() macro to simplify code
PM: console: Fix memory allocation error handling in pm_vt_switch_required()
...
Merge a core power management update and runtime PM framework updates
for 6.19-rc1:
- Add WQ_UNBOUND to pm_wq workqueue (Marco Crivellari)
- Add runtime PM wrapper macros for ACQUIRE()/ACQUIRE_ERR() and use
them in the PCI core and the ACPI TAD driver (Rafael Wysocki)
- Improve runtime PM in the ACPI TAD driver (Rafael Wysocki)
- Update pm_runtime_allow/forbid() documentation (Rafael Wysocki)
- Fix typos in runtime.c comments (Malaya Kumar Rout)
* pm-core:
PM: WQ_UNBOUND added to pm_wq workqueue
* pm-runtime:
PCI/sysfs: Use PM_RUNTIME_ACQUIRE()/PM_RUNTIME_ACQUIRE_ERR()
ACPI: TAD: Use PM_RUNTIME_ACQUIRE()/PM_RUNTIME_ACQUIRE_ERR()
PM: runtime: Wrapper macros for ACQUIRE()/ACQUIRE_ERR()
PM: runtime: fix typos in runtime.c comments
ACPI: TAD: Improve runtime PM using guard macros
ACPI: TAD: Rearrange runtime PM operations in acpi_tad_remove()
PM: runtime: docs: Update pm_runtime_allow/forbid() documentation
This commit adds refscale readers based on srcu_read_lock_fast_updown()
and srcu_read_unlock_fast_updown()
("refscale.scale_type=srcu-fast-updown"). On my x86 laptop, these are
about 2.2ns per pair.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: <bpf@vger.kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
- two last minute fixes for the recently modified DMA API infrastructure:
a proper handling of DMA_ATTR_MMIO in dma_iova_unlink() function (me) and
a regression fix for the code refactoring related to P2PDMA (Pranjal
Shrivastava)
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSrngzkoBtlA8uaaJ+Jp1EFxbsSRAUCaSh6UAAKCRCJp1EFxbsS
RCYGAP94GdoTEBmQL1dXLooASbHHAyB5xrd8Di1H/Aqd7maCvAEAgT2+BRPyEtf+
wiz4mwKMG1RrF9Cj4XqPqJ2NDU1zdQ8=
=ngH/
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-6.18-2025-11-27' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux
Pull dma-mapping fixes from Marek Szyprowski:
"Two last minute fixes for the recently modified DMA API infrastructure:
- proper handling of DMA_ATTR_MMIO in dma_iova_unlink() function (me)
- regression fix for the code refactoring related to P2PDMA (Pranjal
Shrivastava)"
* tag 'dma-mapping-6.18-2025-11-27' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
dma-direct: Fix missing sg_dma_len assignment in P2PDMA bus mappings
iommu/dma: add missing support for DMA_ATTR_MMIO for dma_iova_unlink()
The trace_marker_raw file in tracefs takes a buffer from user space that
contains an id as well as a raw data string which is usually a binary
structure. The structure used has the following:
struct raw_data_entry {
struct trace_entry ent;
unsigned int id;
char buf[];
};
Since the passed in "cnt" variable is both the size of buf as well as the
size of id, the code to allocate the location on the ring buffer had:
size = struct_size(entry, buf, cnt - sizeof(entry->id));
Which is quite ugly and hard to understand. Instead, add a helper macro
called struct_offset() which then changes the above to a simple and easy
to understand:
size = struct_offset(entry, id) + cnt;
This will likely come in handy for other use cases too.
Link: https://lore.kernel.org/all/CAHk-=whYZVoEdfO1PmtbirPdBMTV9Nxt9f09CK0k6S+HJD3Zmg@mail.gmail.com/
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Gustavo A. R. Silva" <gustavoars@kernel.org>
Link: https://patch.msgid.link/20251126145249.05b1770a@gandalf.local.home
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Kees Cook <kees@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Commit 97523a4edb ("kernel/resource: remove first_lvl / siblings_only
logic") removed an optimization introduced by commit 756398750e
("resource: avoid unnecessary lookups in find_next_iomem_res()"). That
was not called out in the message of the first commit explicitly so it's
not entirely clear whether removing the optimization happened
inadvertently or not.
As the original commit message of the optimization explains there is no
point considering the children of a subtree in find_next_iomem_res() if
the top level range does not match.
Reinstating the optimization results in performance improvements in
systems where /proc/iomem is ~5k lines long. Calling mmap() on /dev/mem
in such platforms takes 700-1500μs without the optimisation and 10-50μs
with the optimisation.
Note that even though commit 97523a4edb removed the 'sibling_only'
parameter from next_resource(), newer kernels have basically reinstated it
under the name 'skip_children'.
Link: https://lore.kernel.org/all/20251124165349.3377826-1-ilstam@amazon.com/T/#u
Fixes: 97523a4edb ("kernel/resource: remove first_lvl / siblings_only logic")
Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Cc: Andriy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: "Huang, Ying" <huang.ying.caritas@gmail.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Introduce a generic infrastructure for tracking recoverable hardware
errors (HW errors that are visible to the OS but does not cause a panic)
and record them for vmcore consumption. This aids post-mortem crash
analysis tools by preserving a count and timestamp for the last occurrence
of such errors. On the other side, correctable errors, which the OS
typically remains unaware of because the underlying hardware handles them
transparently, are less relevant for crash dump and therefore are NOT
tracked in this infrastructure.
Add centralized logging for sources of recoverable hardware errors based
on the subsystem it has been notified.
hwerror_data is write-only at kernel runtime, and it is meant to be read
from vmcore using tools like crash/drgn. For example, this is how it
looks like when opening the crashdump from drgn.
>>> prog['hwerror_data']
(struct hwerror_info[1]){
{
.count = (int)844,
.timestamp = (time64_t)1752852018,
},
...
This helps fleet operators quickly triage whether a crash may be
influenced by hardware recoverable errors (which executes a uncommon code
path in the kernel), especially when recoverable errors occurred shortly
before a panic, such as the bug fixed by commit ee62ce7a1d ("page_pool:
Track DMA-mapped pages and unmap them when destroying the pool")
This is not intended to replace full hardware diagnostics but provides a
fast way to correlate hardware events with kernel panics quickly.
Rare machine check exceptions—like those indicated by mce_flags.p5 or
mce_flags.winchip—are not accounted for in this method, as they fall
outside the intended usage scope for this feature's user base.
[leitao@debian.org: add hw-recoverable-errors to toctree]
Link: https://lkml.kernel.org/r/20251127-vmcoreinfo_fix-v1-1-26f5b1c43da9@debian.org
Link: https://lkml.kernel.org/r/20251010-vmcore_hw_error-v5-1-636ede3efe44@debian.org
Signed-off-by: Breno Leitao <leitao@debian.org>
Suggested-by: Tony Luck <tony.luck@intel.com>
Suggested-by: Shuai Xue <xueshuai@linux.alibaba.com>
Reviewed-by: Shuai Xue <xueshuai@linux.alibaba.com>
Reviewed-by: Hanjun Guo <guohanjun@huawei.com> [APEI]
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Bob Moore <robert.moore@intel.com>
Cc: Borislav Betkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morse <james.morse@arm.com>
Cc: Konrad Rzessutek Wilk <konrad.wilk@oracle.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Mahesh Salgaonkar <mahesh@linux.ibm.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: "Oliver O'Halloran" <oohall@gmail.com>
Cc: Omar Sandoval <osandov@osandov.com>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When contiguous ranges of order-0 pages are restored, kho_restore_page()
calls prep_compound_page() with the first page in the range and order as
parameters and then kho_restore_pages() calls split_page() to make sure
all pages in the range are order-0.
However, since split_page() is not intended to split compound pages and
with VM_DEBUG enabled it will trigger a VM_BUG_ON_PAGE().
Update kho_restore_page() so that it will use prep_compound_page() when it
restores a folio and make sure it properly sets page count for both large
folios and ranges of order-0 pages.
Link: https://lkml.kernel.org/r/20251125110917.843744-3-rppt@kernel.org
Fixes: a667300bd5 ("kho: add support for preserving vmalloc allocations")
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reported-by: Pratyush Yadav <pratyush@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "kho: fixes for vmalloc restoration".
Pratyush reported off-list that when kho_restore_vmalloc() is used to
restore a vmalloc_huge() allocation it hits VM_BUG_ON() when we
reconstruct the struct pages in kho_restore_pages().
These patches fix the issue.
This patch (of 2):
In case a preserved vmalloc allocation was using huge pages, all pages in
the array of pages added to vm_struct during kho_restore_vmalloc() are
wrongly set to the same page.
Fix the indexing when assigning pages to that array.
Link: https://lkml.kernel.org/r/20251125110917.843744-1-rppt@kernel.org
Link: https://lkml.kernel.org/r/20251125110917.843744-2-rppt@kernel.org
Fixes: a667300bd5 ("kho: add support for preserving vmalloc allocations")
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When booting with debug_pagealloc=on while having:
CONFIG_KEXEC_HANDOVER_ENABLE_DEFAULT=y
CONFIG_DEBUG_KMEMLEAK_DEFAULT_OFF=n
the system fails to boot due to page faults during kmemleak scanning.
This occurs because:
With debug_pagealloc is enabled, __free_pages() invokes
debug_pagealloc_unmap_pages(), clearing the _PAGE_PRESENT bit for freed
pages in the kernel page table. KHO scratch areas are allocated from
memblock and noted by kmemleak. But these areas don't remain reserved but
released later to the page allocator using init_cma_reserved_pageblock().
This causes subsequent kmemleak scans access non-PRESENT pages, leading to
fatal page faults.
Mark scratch areas with kmemleak_ignore_phys() after they are allocated
from memblock to exclude them from kmemleak scanning before they are
released to buddy allocator to fix this.
[ran.xiaokai@zte.com.cn: add comment]
Link: https://lkml.kernel.org/r/20251127122700.103927-1-ranxiaokai627@163.com
Link: https://lkml.kernel.org/r/20251122182929.92634-1-ranxiaokai627@163.com
Signed-off-by: Ran Xiaokai <ran.xiaokai@zte.com.cn>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Changyuan Lyu <changyuanl@google.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "kexec: reorganize kexec and kdump sysfs", v6.
All existing kexec and kdump sysfs entries are moved to a new location,
/sys/kernel/kexec, to keep /sys/kernel/ clean and better organized.
Symlinks are created at the old locations for backward compatibility and
can be removed in the future [01/03].
While doing this cleanup, the old kexec and kdump sysfs entries are
marked as deprecated in the existing ABI documentation [02/03]. This
makes it clear that these older interfaces should no longer be used.
New ABI documentation is added to describe the reorganized interfaces
[03/03], so users and tools can rely on the updated sysfs interfaces
going forward.
This patch (of 3):
Several kexec and kdump sysfs entries are currently placed directly under
/sys/kernel/, which clutters the directory and makes it harder to identify
unrelated entries. To improve organization and readability, these entries
are now moved under a dedicated directory, /sys/kernel/kexec.
The following sysfs moved under new kexec sysfs node
+------------------------------------+------------------+
| Old sysfs name | New sysfs name |
| (under /sys/kernel) | (under /sys/kernel/kexec) |
+---------------------------+---------------------------+
| kexec_loaded | loaded |
+---------------------------+---------------------------+
| kexec_crash_loaded | crash_loaded |
+---------------------------+---------------------------+
| kexec_crash_size | crash_size |
+---------------------------+---------------------------+
| crash_elfcorehdr_size | crash_elfcorehdr_size |
+---------------------------+---------------------------+
| kexec_crash_cma_ranges | crash_cma_ranges |
+---------------------------+---------------------------+
For backward compatibility, symlinks are created at the old locations so
that existing tools and scripts continue to work. These symlinks can be
removed in the future once users have switched to the new path.
While creating symlinks, entries are added in /sys/kernel/ that point to
their new locations under /sys/kernel/kexec/. If an error occurs while
adding a symlink, it is logged but does not stop initialization of the
remaining kexec sysfs symlinks.
The /sys/kernel/<crash_elfcorehdr_size | kexec/crash_elfcorehdr_size>
entry is now controlled by CONFIG_CRASH_DUMP instead of
CONFIG_VMCORE_INFO, as CONFIG_CRASH_DUMP also enables CONFIG_VMCORE_INFO.
Link: https://lkml.kernel.org/r/20251118114507.1769455-1-sourabhjain@linux.ibm.com
Link: https://lkml.kernel.org/r/20251118114507.1769455-2-sourabhjain@linux.ibm.com
Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Aditya Gupta <adityag@linux.ibm.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Hari Bathini <hbathini@linux.ibm.com>
Cc: Jiri Bohac <jbohac@suse.cz>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Mahesh J Salgaonkar <mahesh@linux.ibm.com>
Cc: Pingfan Liu <piliu@redhat.com>
Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Cc: Shivang Upadhyay <shivangu@linux.ibm.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Before commit fa759cd75b ("kho: allocate metadata directly from the
buddy allocator"), the chunks were allocated from the slab allocator using
kzalloc(). Those were rightly freed using kfree().
When the commit switched to using the buddy allocator directly, it missed
updating kho_mem_ser_free() to use free_page() instead of kfree().
Link: https://lkml.kernel.org/r/20251118182218.63044-1-pratyush@kernel.org
Fixes: fa759cd75b ("kho: allocate metadata directly from the buddy allocator")
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: David Matlack <dmatlack@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Currently file handlers only get the serialized_data field to store their
state. This field has a pointer to the serialized state of the file, and
it becomes a part of LUO file's serialized state.
File handlers can also need some runtime state to track information that
shouldn't make it in the serialized data.
One such example is a vmalloc pointer. While kho_preserve_vmalloc()
preserves the memory backing a vmalloc allocation, it does not store the
original vmap pointer, since that has no use being passed to the next
kernel. The pointer is needed to free the memory in case the file is
unpreserved.
Provide a private field in struct luo_file and pass it to all the
callbacks. The field's can be set by preserve, and must be freed by
unpreserve.
Link: https://lkml.kernel.org/r/20251125165850.3389713-14-pasha.tatashin@soleen.com
Signed-off-by: Pratyush Yadav <ptyadav@amazon.de>
Co-developed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Tested-by: David Matlack <dmatlack@google.com>
Cc: Aleksander Lobakin <aleksander.lobakin@intel.com>
Cc: Alexander Graf <graf@amazon.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Andriy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: anish kumar <yesanishhere@gmail.com>
Cc: Anna Schumaker <anna.schumaker@oracle.com>
Cc: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Chanwoo Choi <cw00.choi@samsung.com>
Cc: Chen Ridong <chenridong@huawei.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Daniel Wagner <wagi@kernel.org>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Jeffery <djeffery@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guixin Liu <kanie@linux.alibaba.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com>
Cc: Joel Granados <joel.granados@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Lennart Poettering <lennart@poettering.net>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Leon Romanovsky <leonro@nvidia.com>
Cc: Lukas Wunner <lukas@wunner.de>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Matthew Maurer <mmaurer@google.com>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Myugnjoo Ham <myungjoo.ham@samsung.com>
Cc: Parav Pandit <parav@nvidia.com>
Cc: Pratyush Yadav <pratyush@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Saeed Mahameed <saeedm@nvidia.com>
Cc: Samiullah Khawaja <skhawaja@google.com>
Cc: Song Liu <song@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Stuart Hayes <stuart.w.hayes@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Thomas Weißschuh <linux@weissschuh.net>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: William Tu <witu@nvidia.com>
Cc: Yoann Congal <yoann.congal@smile.fr>
Cc: Zhu Yanjun <yanjun.zhu@linux.dev>
Cc: Zijun Hu <quic_zijuhu@quicinc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Introducing the userspace interface and internal logic required to manage
the lifecycle of file descriptors within a session. Previously, a session
was merely a container; this change makes it a functional management unit.
The following capabilities are added:
A new set of ioctl commands are added, which operate on the file
descriptor returned by CREATE_SESSION. This allows userspace to:
- LIVEUPDATE_SESSION_PRESERVE_FD: Add a file descriptor to a session
to be preserved across the live update.
- LIVEUPDATE_SESSION_RETRIEVE_FD: Retrieve a preserved file in the
new kernel using its unique token.
- LIVEUPDATE_SESSION_FINISH: finish session
The session's .release handler is enhanced to be state-aware. When a
session's file descriptor is closed, it correctly unpreserves the session
based on its current state before freeing all associated file resources.
Link: https://lkml.kernel.org/r/20251125165850.3389713-8-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Tested-by: David Matlack <dmatlack@google.com>
Cc: Aleksander Lobakin <aleksander.lobakin@intel.com>
Cc: Alexander Graf <graf@amazon.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Andriy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: anish kumar <yesanishhere@gmail.com>
Cc: Anna Schumaker <anna.schumaker@oracle.com>
Cc: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Chanwoo Choi <cw00.choi@samsung.com>
Cc: Chen Ridong <chenridong@huawei.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Daniel Wagner <wagi@kernel.org>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Jeffery <djeffery@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guixin Liu <kanie@linux.alibaba.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com>
Cc: Joel Granados <joel.granados@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Lennart Poettering <lennart@poettering.net>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Leon Romanovsky <leonro@nvidia.com>
Cc: Lukas Wunner <lukas@wunner.de>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Matthew Maurer <mmaurer@google.com>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Myugnjoo Ham <myungjoo.ham@samsung.com>
Cc: Parav Pandit <parav@nvidia.com>
Cc: Pratyush Yadav <ptyadav@amazon.de>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Saeed Mahameed <saeedm@nvidia.com>
Cc: Samiullah Khawaja <skhawaja@google.com>
Cc: Song Liu <song@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Stuart Hayes <stuart.w.hayes@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Thomas Weißschuh <linux@weissschuh.net>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: William Tu <witu@nvidia.com>
Cc: Yoann Congal <yoann.congal@smile.fr>
Cc: Zhu Yanjun <yanjun.zhu@linux.dev>
Cc: Zijun Hu <quic_zijuhu@quicinc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This patch implements the core mechanism for managing preserved files
throughout the live update lifecycle. It provides the logic to invoke the
file handler callbacks (preserve, unpreserve, freeze, unfreeze, retrieve,
and finish) at the appropriate stages.
During the reboot phase, luo_file_freeze() serializes the final metadata
for each file (handler compatible string, token, and data handle) into a
memory region preserved by KHO. In the new kernel, luo_file_deserialize()
reconstructs the in-memory file list from this data, preparing the session
for retrieval.
Link: https://lkml.kernel.org/r/20251125165850.3389713-7-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Tested-by: David Matlack <dmatlack@google.com>
Cc: Aleksander Lobakin <aleksander.lobakin@intel.com>
Cc: Alexander Graf <graf@amazon.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Andriy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: anish kumar <yesanishhere@gmail.com>
Cc: Anna Schumaker <anna.schumaker@oracle.com>
Cc: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Chanwoo Choi <cw00.choi@samsung.com>
Cc: Chen Ridong <chenridong@huawei.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Daniel Wagner <wagi@kernel.org>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Jeffery <djeffery@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guixin Liu <kanie@linux.alibaba.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com>
Cc: Joel Granados <joel.granados@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Lennart Poettering <lennart@poettering.net>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Leon Romanovsky <leonro@nvidia.com>
Cc: Lukas Wunner <lukas@wunner.de>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Matthew Maurer <mmaurer@google.com>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Myugnjoo Ham <myungjoo.ham@samsung.com>
Cc: Parav Pandit <parav@nvidia.com>
Cc: Pratyush Yadav <ptyadav@amazon.de>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Saeed Mahameed <saeedm@nvidia.com>
Cc: Samiullah Khawaja <skhawaja@google.com>
Cc: Song Liu <song@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Stuart Hayes <stuart.w.hayes@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Thomas Weißschuh <linux@weissschuh.net>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: William Tu <witu@nvidia.com>
Cc: Yoann Congal <yoann.congal@smile.fr>
Cc: Zhu Yanjun <yanjun.zhu@linux.dev>
Cc: Zijun Hu <quic_zijuhu@quicinc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Introduce the user-space interface for the Live Update Orchestrator via
ioctl commands, enabling external control over the live update process and
management of preserved resources.
The idea is that there is going to be a single userspace agent driving the
live update, therefore, only a single process can ever hold this device
opened at a time.
The following ioctl commands are introduced:
LIVEUPDATE_IOCTL_CREATE_SESSION
Provides a way for userspace to create a named session for grouping file
descriptors that need to be preserved. It returns a new file descriptor
representing the session.
LIVEUPDATE_IOCTL_RETRIEVE_SESSION
Allows the userspace agent in the new kernel to reclaim a preserved
session by its name, receiving a new file descriptor to manage the
restored resources.
Link: https://lkml.kernel.org/r/20251125165850.3389713-6-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Tested-by: David Matlack <dmatlack@google.com>
Cc: Aleksander Lobakin <aleksander.lobakin@intel.com>
Cc: Alexander Graf <graf@amazon.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Andriy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: anish kumar <yesanishhere@gmail.com>
Cc: Anna Schumaker <anna.schumaker@oracle.com>
Cc: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Chanwoo Choi <cw00.choi@samsung.com>
Cc: Chen Ridong <chenridong@huawei.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Daniel Wagner <wagi@kernel.org>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Jeffery <djeffery@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guixin Liu <kanie@linux.alibaba.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com>
Cc: Joel Granados <joel.granados@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Lennart Poettering <lennart@poettering.net>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Leon Romanovsky <leonro@nvidia.com>
Cc: Lukas Wunner <lukas@wunner.de>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Matthew Maurer <mmaurer@google.com>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Myugnjoo Ham <myungjoo.ham@samsung.com>
Cc: Parav Pandit <parav@nvidia.com>
Cc: Pratyush Yadav <ptyadav@amazon.de>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Saeed Mahameed <saeedm@nvidia.com>
Cc: Samiullah Khawaja <skhawaja@google.com>
Cc: Song Liu <song@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Stuart Hayes <stuart.w.hayes@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Thomas Weißschuh <linux@weissschuh.net>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: William Tu <witu@nvidia.com>
Cc: Yoann Congal <yoann.congal@smile.fr>
Cc: Zhu Yanjun <yanjun.zhu@linux.dev>
Cc: Zijun Hu <quic_zijuhu@quicinc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Introduce concept of "Live Update Sessions" within the LUO framework. LUO
sessions provide a mechanism to group and manage `struct file *` instances
(representing file descriptors) that need to be preserved across a
kexec-based live update.
Each session is identified by a unique name and acts as a container for
file objects whose state is critical to a userspace workload, such as a
virtual machine or a high-performance database, aiming to maintain their
functionality across a kernel transition.
This groundwork establishes the framework for preserving file-backed state
across kernel updates, with the actual file data preservation mechanisms
to be implemented in subsequent patches.
[dan.carpenter@linaro.org: fix use after free in luo_session_deserialize()]
Link: https://lkml.kernel.org/r/c5dd637d7eed3a3be48c5e9fedb881596a3b1f5a.1764163896.git.dan.carpenter@linaro.org
Link: https://lkml.kernel.org/r/20251125165850.3389713-5-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Tested-by: David Matlack <dmatlack@google.com>
Cc: Aleksander Lobakin <aleksander.lobakin@intel.com>
Cc: Alexander Graf <graf@amazon.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Andriy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: anish kumar <yesanishhere@gmail.com>
Cc: Anna Schumaker <anna.schumaker@oracle.com>
Cc: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Chanwoo Choi <cw00.choi@samsung.com>
Cc: Chen Ridong <chenridong@huawei.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Daniel Wagner <wagi@kernel.org>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Jeffery <djeffery@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guixin Liu <kanie@linux.alibaba.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com>
Cc: Joel Granados <joel.granados@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Lennart Poettering <lennart@poettering.net>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Leon Romanovsky <leonro@nvidia.com>
Cc: Lukas Wunner <lukas@wunner.de>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Matthew Maurer <mmaurer@google.com>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Myugnjoo Ham <myungjoo.ham@samsung.com>
Cc: Parav Pandit <parav@nvidia.com>
Cc: Pratyush Yadav <ptyadav@amazon.de>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Saeed Mahameed <saeedm@nvidia.com>
Cc: Samiullah Khawaja <skhawaja@google.com>
Cc: Song Liu <song@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Stuart Hayes <stuart.w.hayes@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Thomas Weißschuh <linux@weissschuh.net>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: William Tu <witu@nvidia.com>
Cc: Yoann Congal <yoann.congal@smile.fr>
Cc: Zhu Yanjun <yanjun.zhu@linux.dev>
Cc: Zijun Hu <quic_zijuhu@quicinc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Modify the kernel_kexec() to call liveupdate_reboot().
This ensures that the Live Update Orchestrator is notified just before the
kernel executes the kexec jump. The liveupdate_reboot() function triggers
the final freeze event, allowing participating FDs perform last-minute
check or state saving within the blackout window.
If liveupdate_reboot() returns an error (indicating a failure during LUO
finalization), the kexec operation is aborted to prevent proceeding with
an inconsistent state. An error is returned to user.
Link: https://lkml.kernel.org/r/20251125165850.3389713-4-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Tested-by: David Matlack <dmatlack@google.com>
Cc: Aleksander Lobakin <aleksander.lobakin@intel.com>
Cc: Alexander Graf <graf@amazon.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Andriy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: anish kumar <yesanishhere@gmail.com>
Cc: Anna Schumaker <anna.schumaker@oracle.com>
Cc: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Chanwoo Choi <cw00.choi@samsung.com>
Cc: Chen Ridong <chenridong@huawei.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Daniel Wagner <wagi@kernel.org>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Jeffery <djeffery@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guixin Liu <kanie@linux.alibaba.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com>
Cc: Joel Granados <joel.granados@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Lennart Poettering <lennart@poettering.net>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Leon Romanovsky <leonro@nvidia.com>
Cc: Lukas Wunner <lukas@wunner.de>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Matthew Maurer <mmaurer@google.com>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Myugnjoo Ham <myungjoo.ham@samsung.com>
Cc: Parav Pandit <parav@nvidia.com>
Cc: Pratyush Yadav <ptyadav@amazon.de>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Saeed Mahameed <saeedm@nvidia.com>
Cc: Samiullah Khawaja <skhawaja@google.com>
Cc: Song Liu <song@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Stuart Hayes <stuart.w.hayes@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Thomas Weißschuh <linux@weissschuh.net>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: William Tu <witu@nvidia.com>
Cc: Yoann Congal <yoann.congal@smile.fr>
Cc: Zhu Yanjun <yanjun.zhu@linux.dev>
Cc: Zijun Hu <quic_zijuhu@quicinc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Integrate the LUO with the KHO framework to enable passing LUO state
across a kexec reboot.
This patch implements the lifecycle integration with KHO:
1. Incoming State: During early boot (`early_initcall`), LUO checks if
KHO is active. If so, it retrieves the "LUO" subtree, verifies the
"luo-v1" compatibility string, and reads the `liveupdate-number` to
track the update count.
2. Outgoing State: During late initialization (`late_initcall`), LUO
allocates a new FDT for the next kernel, populates it with the basic
header (compatible string and incremented update number), and
registers it with KHO (`kho_add_subtree`).
3. Finalization: The `liveupdate_reboot()` notifier is updated to invoke
`kho_finalize()`. This ensures that all memory segments marked for
preservation are properly serialized before the kexec jump.
Link: https://lkml.kernel.org/r/20251125165850.3389713-3-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Tested-by: David Matlack <dmatlack@google.com>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Aleksander Lobakin <aleksander.lobakin@intel.com>
Cc: Alexander Graf <graf@amazon.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Andriy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: anish kumar <yesanishhere@gmail.com>
Cc: Anna Schumaker <anna.schumaker@oracle.com>
Cc: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Chanwoo Choi <cw00.choi@samsung.com>
Cc: Chen Ridong <chenridong@huawei.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Daniel Wagner <wagi@kernel.org>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Jeffery <djeffery@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guixin Liu <kanie@linux.alibaba.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com>
Cc: Joel Granados <joel.granados@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Lennart Poettering <lennart@poettering.net>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Leon Romanovsky <leonro@nvidia.com>
Cc: Lukas Wunner <lukas@wunner.de>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Matthew Maurer <mmaurer@google.com>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Myugnjoo Ham <myungjoo.ham@samsung.com>
Cc: Parav Pandit <parav@nvidia.com>
Cc: Pratyush Yadav <ptyadav@amazon.de>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Saeed Mahameed <saeedm@nvidia.com>
Cc: Samiullah Khawaja <skhawaja@google.com>
Cc: Song Liu <song@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Stuart Hayes <stuart.w.hayes@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Thomas Weißschuh <linux@weissschuh.net>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: William Tu <witu@nvidia.com>
Cc: Yoann Congal <yoann.congal@smile.fr>
Cc: Zhu Yanjun <yanjun.zhu@linux.dev>
Cc: Zijun Hu <quic_zijuhu@quicinc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "Live Update Orchestrator", v8.
This series introduces the Live Update Orchestrator, a kernel subsystem
designed to facilitate live kernel updates using a kexec-based reboot.
This capability is critical for cloud environments, allowing hypervisors
to be updated with minimal downtime for running virtual machines. LUO
achieves this by preserving the state of selected resources, such as
memory, devices and their dependencies, across the kernel transition.
As a key feature, this series includes support for preserving memfd file
descriptors, which allows critical in-memory data, such as guest RAM or
any other large memory region, to be maintained in RAM across the kexec
reboot.
The other series that use LUO, are VFIO [1], IOMMU [2], and PCI [3]
preservations.
Github repo of this series [4].
The core of LUO is a framework for managing the lifecycle of preserved
resources through a userspace-driven interface. Key features include:
- Session Management
Userspace agent (i.e. luod [5]) creates named sessions, each
represented by a file descriptor (via centralized agent that controls
/dev/liveupdate). The lifecycle of all preserved resources within a
session is tied to this FD, ensuring automatic kernel cleanup if the
controlling userspace agent crashes or exits unexpectedly.
- File Preservation
A handler-based framework allows specific file types (demonstrated
here with memfd) to be preserved. Handlers manage the serialization,
restoration, and lifecycle of their specific file types.
- File-Lifecycle-Bound State
A new mechanism for managing shared global state whose lifecycle is
tied to the preservation of one or more files. This is crucial for
subsystems like IOMMU or HugeTLB, where multiple file descriptors may
depend on a single, shared underlying resource that must be preserved
only once.
- KHO Integration
LUO drives the Kexec Handover framework programmatically to pass its
serialized metadata to the next kernel. The LUO state is finalized and
added to the kexec image just before the reboot is triggered. In the
future this step will also be removed once stateless KHO is
merged [6].
- Userspace Interface
Control is provided via ioctl commands on /dev/liveupdate for creating
and retrieving sessions, as well as on session file descriptors for
managing individual files.
- Testing
The series includes a set of selftests, including userspace API
validation, kexec-based lifecycle tests for various session and file
scenarios, and a new in-kernel test module to validate the FLB logic.
Introduce LUO, a mechanism intended to facilitate kernel updates while
keeping designated devices operational across the transition (e.g., via
kexec). The primary use case is updating hypervisors with minimal
disruption to running virtual machines. For userspace side of hypervisor
update we have copyless migration. LUO is for updating the kernel.
This initial patch lays the groundwork for the LUO subsystem.
Further functionality, including the implementation of state transition
logic, integration with KHO, and hooks for subsystems and file
descriptors, will be added in subsequent patches.
Create a character device at /dev/liveupdate.
A new uAPI header, <uapi/linux/liveupdate.h>, will define the necessary
structures. The magic number for IOCTL is registered in
Documentation/userspace-api/ioctl/ioctl-number.rst.
Link: https://lkml.kernel.org/r/20251125165850.3389713-1-pasha.tatashin@soleen.com
Link: https://lkml.kernel.org/r/20251125165850.3389713-2-pasha.tatashin@soleen.com
Link: https://lore.kernel.org/all/20251018000713.677779-1-vipinsh@google.com/ [1]
Link: https://lore.kernel.org/linux-iommu/20250928190624.3735830-1-skhawaja@google.com [2]
Link: https://lore.kernel.org/linux-pci/20250916-luo-pci-v2-0-c494053c3c08@kernel.org [3]
Link: https://github.com/googleprodkernel/linux-liveupdate/tree/luo/v8 [4]
Link: https://tinyurl.com/luoddesign [5]
Link: https://lore.kernel.org/all/20251020100306.2709352-1-jasonmiu@google.com [6]
Link: https://lore.kernel.org/all/20251115233409.768044-1-pasha.tatashin@soleen.com [7]
Link: https://github.com/soleen/linux/blob/luo/v8b03/diff.v7.v8 [8]
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Tested-by: David Matlack <dmatlack@google.com>
Cc: Aleksander Lobakin <aleksander.lobakin@intel.com>
Cc: Alexander Graf <graf@amazon.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Andriy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: anish kumar <yesanishhere@gmail.com>
Cc: Anna Schumaker <anna.schumaker@oracle.com>
Cc: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Chanwoo Choi <cw00.choi@samsung.com>
Cc: Chen Ridong <chenridong@huawei.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Daniel Wagner <wagi@kernel.org>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Jeffery <djeffery@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guixin Liu <kanie@linux.alibaba.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com>
Cc: Joel Granados <joel.granados@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Lennart Poettering <lennart@poettering.net>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Leon Romanovsky <leonro@nvidia.com>
Cc: Lukas Wunner <lukas@wunner.de>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Matthew Maurer <mmaurer@google.com>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Myugnjoo Ham <myungjoo.ham@samsung.com>
Cc: Parav Pandit <parav@nvidia.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Saeed Mahameed <saeedm@nvidia.com>
Cc: Samiullah Khawaja <skhawaja@google.com>
Cc: Song Liu <song@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Stuart Hayes <stuart.w.hayes@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Thomas Weißschuh <linux@weissschuh.net>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: William Tu <witu@nvidia.com>
Cc: Yoann Congal <yoann.congal@smile.fr>
Cc: Zijun Hu <quic_zijuhu@quicinc.com>
Cc: Pratyush Yadav <ptyadav@amazon.de>
Cc: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Currently, Kexec Handover must be explicitly enabled via the kernel
command line parameter `kho=on`.
For workloads that rely on KHO as a foundational requirement (such as the
upcoming Live Update Orchestrator), requiring an explicit boot parameter
adds redundant configuration steps.
Introduce CONFIG_KEXEC_HANDOVER_ENABLE_DEFAULT. When selected, KHO
defaults to enabled. This is equivalent to passing kho=on at boot. The
behavior can still be disabled at runtime by passing kho=off.
Link: https://lkml.kernel.org/r/20251114190002.3311679-14-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Baoquan He <bhe@redhat.com>
Cc: Coiby Xu <coxu@redhat.com>
Cc: Dave Vasilevsky <dave@vasilevsky.ca>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Kees Cook <kees@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Currently, kho_preserve_* and kho_unpreserve_* return -EBUSY if KHO is
finalized. This enforces a rigid "freeze" on the KHO memory state.
With the introduction of re-entrant finalization, this restriction is no
longer necessary. Users should be allowed to modify the preservation set
(e.g., adding new pages or freeing old ones) even after an initial
finalization.
The intended workflow for updates is now:
1. Modify state (preserve/unpreserve).
2. Call kho_finalize() again to refresh the serialized metadata.
Remove the kho_out.finalized checks to enable this dynamic behavior.
This also allows to convert kho_unpreserve_* functions to void, as they do
not return any error anymore.
Link: https://lkml.kernel.org/r/20251114190002.3311679-13-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Baoquan He <bhe@redhat.com>
Cc: Coiby Xu <coxu@redhat.com>
Cc: Dave Vasilevsky <dave@vasilevsky.ca>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Kees Cook <kees@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Currently, kho_fill_kimage() checks kho_out.finalized and returns early if
KHO is not yet finalized. This enforces a strict ordering where userspace
must finalize KHO *before* loading the kexec image.
This is restrictive, as standard workflows often involve loading the
target kernel early in the lifecycle and finalizing the state (FDT) only
immediately before the reboot.
Since the KHO FDT resides at a physical address allocated during boot
(kho_init), its location is stable. We can attach this stable address to
the kimage regardless of whether the content has been finalized yet.
Relax the check to only require kho_enable, allowing kexec_file_load to
proceed at any time.
Link: https://lkml.kernel.org/r/20251114190002.3311679-12-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Baoquan He <bhe@redhat.com>
Cc: Coiby Xu <coxu@redhat.com>
Cc: Dave Vasilevsky <dave@vasilevsky.ca>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Kees Cook <kees@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Currently, sub-FDTs were tracked in a list (kho_out.sub_fdts) and the
final FDT is constructed entirely from scratch during kho_finalize().
We can maintain the FDT dynamically:
1. Initialize a valid, empty FDT in kho_init().
2. Use fdt_add_subnode and fdt_setprop in kho_add_subtree to
update the FDT immediately when a subsystem registers.
3. Use fdt_del_node in kho_remove_subtree to remove entries.
This removes the need for the intermediate sub_fdts list and the
reconstruction logic in kho_finalize(). kho_finalize() now only needs to
trigger memory map serialization.
Link: https://lkml.kernel.org/r/20251114190002.3311679-11-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Baoquan He <bhe@redhat.com>
Cc: Coiby Xu <coxu@redhat.com>
Cc: Dave Vasilevsky <dave@vasilevsky.ca>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Kees Cook <kees@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Previously, KHO required a dedicated kho_abort() function to clean up
state before kho_finalize() could be called again. This was necessary to
handle complex unwind paths when using notifiers.
With the shift to direct memory preservation, the explicit abort step is
no longer strictly necessary.
Remove kho_abort() and refactor kho_finalize() to handle re-entry. If
kho_finalize() is called while KHO is already finalized, it will now
automatically clean up the previous memory map and state before generating
a new one. This allows the KHO state to be updated/refreshed simply by
triggering finalize again.
Update debugfs to return -EINVAL if userspace attempts to write 0 to the
finalize attribute, as explicit abort is no longer supported.
Link: https://lkml.kernel.org/r/20251114190002.3311679-10-pasha.tatashin@soleen.com
Suggested-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Baoquan He <bhe@redhat.com>
Cc: Coiby Xu <coxu@redhat.com>
Cc: Dave Vasilevsky <dave@vasilevsky.ca>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Kees Cook <kees@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Currently, the serialized memory map is tracked via
kho_out.preserved_mem_map and copied to the FDT during finalization. This
double tracking is redundant.
Remove preserved_mem_map from kho_out. Instead, maintain the physical
address of the head chunk directly in the preserved-memory-map FDT
property.
Introduce kho_update_memory_map() to manage this property. This function
handles:
1. Retrieving and freeing any existing serialized map (handling the
abort/retry case).
2. Updating the FDT property with the new chunk address.
This establishes the FDT as the single source of truth for the handover
state.
Link: https://lkml.kernel.org/r/20251114190002.3311679-9-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Baoquan He <bhe@redhat.com>
Cc: Coiby Xu <coxu@redhat.com>
Cc: Dave Vasilevsky <dave@vasilevsky.ca>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Kees Cook <kees@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Currently, __kho_finalize() performs memory serialization in the middle of
FDT construction. If FDT construction fails later, the function must
manually clean up the serialized memory via __kho_abort().
Refactor __kho_finalize() to perform kho_mem_serialize() only after the
FDT has been successfully constructed and finished. This reordering has
two benefits:
1. It avoids expensive serialization work if FDT generation fails.
2. It removes the need for cleanup in the FDT error path.
As a result, the internal helper __kho_abort() is no longer needed for
internal error handling. Inline its remaining logic (cleanup of the
preserved memory map) directly into kho_abort() and remove the helper.
Link: https://lkml.kernel.org/r/20251114190002.3311679-8-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Baoquan He <bhe@redhat.com>
Cc: Coiby Xu <coxu@redhat.com>
Cc: Dave Vasilevsky <dave@vasilevsky.ca>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Kees Cook <kees@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Currently, the output FDT is added to debugfs only when KHO is finalized
and removed when aborted.
There is no need to hide the FDT based on the state. Always expose it
starting from initialization. This aids the transition toward removing
the explicit abort functionality and converting KHO to be fully stateless.
Link: https://lkml.kernel.org/r/20251114190002.3311679-7-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Baoquan He <bhe@redhat.com>
Cc: Coiby Xu <coxu@redhat.com>
Cc: Dave Vasilevsky <dave@vasilevsky.ca>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Kees Cook <kees@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
During boot, kho_restore_folio() relies on the memory map having been
successfully deserialized. If deserialization fails or no map is present,
attempting to restore the FDT folio is unsafe.
Update kho_mem_deserialize() to return a boolean indicating success. Use
this return value in kho_memory_init() to disable KHO if deserialization
fails. Also, the incoming FDT folio is never used, there is no reason to
restore it.
Additionally, use get_unaligned() to retrieve the memory map pointer from
the FDT. FDT properties are not guaranteed to be naturally aligned, and
accessing a 64-bit value via a pointer that is only 32-bit aligned can
cause faults.
Link: https://lkml.kernel.org/r/20251114190002.3311679-6-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Baoquan He <bhe@redhat.com>
Cc: Coiby Xu <coxu@redhat.com>
Cc: Dave Vasilevsky <dave@vasilevsky.ca>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Kees Cook <kees@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Currently, the FDT folio is preserved inside __kho_finalize(). If the
user performs multiple finalize/abort cycles, kho_preserve_folio() is
called repeatedly for the same FDT folio.
Since the FDT folio is allocated once during kho_init(), it should be
marked for preservation at the same time. Move the preservation call to
kho_init() to align the preservation state with the object's lifecycle and
simplify the finalize path.
Also, pre-zero the FDT tree so we do not expose random bits to the user
and to the next kernel by using the new kho_alloc_preserve() api.
Link: https://lkml.kernel.org/r/20251114190002.3311679-5-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Baoquan He <bhe@redhat.com>
Cc: Coiby Xu <coxu@redhat.com>
Cc: Dave Vasilevsky <dave@vasilevsky.ca>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Kees Cook <kees@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Currently, clients of KHO must manually allocate memory (e.g., via
alloc_pages), calculate the page order, and explicitly call
kho_preserve_folio(). Similarly, cleanup requires separate calls to
unpreserve and free the memory.
Introduce a high-level API to streamline this common pattern:
- kho_alloc_preserve(size): Allocates physically contiguous, zeroed
memory and immediately marks it for preservation.
- kho_unpreserve_free(ptr): Unpreserves and frees the memory
in the current kernel.
- kho_restore_free(ptr): Restores the struct page state of
preserved memory in the new kernel and immediately frees it to the
page allocator.
[pasha.tatashin@soleen.com: build fixes]
Link: https://lkml.kernel.org/r/CA+CK2bBgXDhrHwTVgxrw7YTQ-0=LgW0t66CwPCgG=C85ftz4zw@mail.gmail.com
Link: https://lkml.kernel.org/r/20251114190002.3311679-4-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Baoquan He <bhe@redhat.com>
Cc: Coiby Xu <coxu@redhat.com>
Cc: Dave Vasilevsky <dave@vasilevsky.ca>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Kees Cook <kees@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The internal helper __kho_abort() always returns 0 and has no failure
paths. Its return value is ignored by __kho_finalize and checked
needlessly by kho_abort.
Change the return type to void to reflect that this function cannot fail,
and simplify kho_abort by removing dead error handling code.
Link: https://lkml.kernel.org/r/20251114190002.3311679-3-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Baoquan He <bhe@redhat.com>
Cc: Coiby Xu <coxu@redhat.com>
Cc: Dave Vasilevsky <dave@vasilevsky.ca>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Kees Cook <kees@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "kho: simplify state machine and enable dynamic updates", v2.
This patch series refactors the Kexec Handover subsystem to transition
from a rigid, state-locked model to a dynamic, re-entrant architecture.
It also introduces usability improvements.
Motivation
Currently, KHO relies on a strict state machine where memory
preservation is locked upon finalization. If a change is required, the
user must explicitly "abort" to reset the state. Additionally, the kexec
image cannot be loaded until KHO is finalized, and the FDT is rebuilt
from scratch on every finalization.
This series simplifies this workflow to support "load early, finalize
late" scenarios.
Key Changes
State Machine Simplification:
- Removed kho_abort(). kho_finalize() is now re-entrant; calling it a
second time automatically flushes the previous serialized state and
generates a fresh one.
- Removed kho_out.finalized checks from preservation APIs, allowing
drivers to add/remove pages even after an initial finalization.
- Decoupled kexec_file_load from KHO finalization. The KHO FDT physical
address is now stable from boot, allowing the kexec image to be loaded
before the handover metadata is finalized.
FDT Management:
- The FDT is now updated in-place dynamically when subtrees are added or
removed, removing the need for complex reconstruction logic.
- The output FDT is always exposed in debugfs (initialized and zeroed at
boot), improving visibility and debugging capabilities throughout the
system lifecycle.
- Removed the redundant global preserved_mem_map pointer, establishing
the FDT property as the single source of truth.
New Features & API Enhancements:
- High-Level Allocators: Introduced kho_alloc_preserve() and friends to
reduce boilerplate for drivers that need to allocate, preserve, and
eventually restore simple memory buffers.
- Configuration: Added CONFIG_KEXEC_HANDOVER_ENABLE_DEFAULT to allow KHO
to be active by default without requiring the kho=on command line
parameter.
Fixes:
- Fixed potential alignment faults when accessing 64-bit FDT properties.
- Fixed the lifecycle of the FDT folio preservation (now preserved once
at init).
This patch (of 13):
The log message in kho_populate() currently states "Will skip init for
some devices". This implies that Kexec Handover always involves skipping
device initialization.
However, KHO is a generic mechanism used to preserve kernel memory across
reboot for various purposes, such as memfd, telemetry, or reserve_mem.
Skipping device initialization is a specific property of live update
drivers using KHO, not a property of the mechanism itself.
Remove the misleading suffix to accurately reflect the generic nature of
KHO discovery.
Link: https://lkml.kernel.org/r/20251114190002.3311679-2-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Baoquan He <bhe@redhat.com>
Cc: Coiby Xu <coxu@redhat.com>
Cc: Dave Vasilevsky <dave@vasilevsky.ca>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Kees Cook <kees@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Make pr_xxx() call to use the %pe format specifier instead of %d. The %pe
specifier prints a symbolic error string (e.g., -ENOMEM, -EINVAL) when
given an error pointer created with ERR_PTR(err).
This change enhances the clarity and diagnostic value of the error message
by showing a descriptive error name rather than a numeric error code.
Note, that some err are still printed by value, as those errors might come
from libfdt and not regular errnos.
Link: https://lkml.kernel.org/r/20251101142325.1326536-10-pasha.tatashin@soleen.com
Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Co-developed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Changyuan Lyu <changyuanl@google.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Move KHO to kernel/liveupdate/ in preparation of placing all Live Update
core kernel related files to the same place.
[pasha.tatashin@soleen.com: disable the menu when DEFERRED_STRUCT_PAGE_INIT]
Link: https://lkml.kernel.org/r/CA+CK2bAvh9Oa2SLfsbJ8zztpEjrgr_hr-uGgF1coy8yoibT39A@mail.gmail.com
Link: https://lkml.kernel.org/r/20251101142325.1326536-8-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Changyuan Lyu <changyuanl@google.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Pratyush Yadav <pratyush@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Simon Horman <horms@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
KHO allows clients to preserve memory regions at any point before the KHO
state is finalized. The finalization process itself involves KHO
performing its own actions, such as serializing the overall preserved
memory map.
If this finalization process is aborted, the current implementation
destroys KHO's internal memory tracking structures
(`kho_out.ser.track.orders`). This behavior effectively unpreserves all
memory from KHO's perspective, regardless of whether those preservations
were made by clients before the finalization attempt or by KHO itself
during finalization.
This premature unpreservation is incorrect. An abort of the finalization
process should only undo actions taken by KHO as part of that specific
finalization attempt. Individual memory regions preserved by clients
prior to finalization should remain preserved, as their lifecycle is
managed by the clients themselves. These clients might still need to call
kho_unpreserve_folio() or kho_unpreserve_phys() based on their own logic,
even after a KHO finalization attempt is aborted.
Link: https://lkml.kernel.org/r/20251101142325.1326536-7-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Changyuan Lyu <changyuanl@google.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Simon Horman <horms@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The KHO framework uses a notifier chain as the mechanism for clients to
participate in the finalization process. While this works for a single,
central state machine, it is too restrictive for kernel-internal
components like pstore/reserve_mem or IMA. These components need a
simpler, direct way to register their state for preservation (e.g., during
their initcall) without being part of a complex, shutdown-time notifier
sequence. The notifier model forces all participants into a single
finalization flow and makes direct preservation from an arbitrary context
difficult. This patch refactors the client participation model by
removing the notifier chain and introducing a direct API for managing FDT
subtrees.
The core kho_finalize() and kho_abort() state machine remains, but clients
now register their data with KHO beforehand.
Link: https://lkml.kernel.org/r/20251101142325.1326536-3-pasha.tatashin@soleen.com
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Co-developed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Alexander Graf <graf@amazon.com>
Cc: Changyuan Lyu <changyuanl@google.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Pratyush Yadav <pratyush@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Simon Horman <horms@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "liveupdate: Rework KHO for in-kernel users", v9.
This series refactors the KHO framework to better support in-kernel users
like the upcoming LUO. The current design, which relies on a notifier
chain and debugfs for control, is too restrictive for direct programmatic
use.
The core of this rework is the removal of the notifier chain in favor of a
direct registration API. This decouples clients from the shutdown-time
finalization sequence, allowing them to manage their preserved state more
flexibly and at any time.
In support of this new model, this series also:
- Makes the debugfs interface optional.
- Introduces APIs to unpreserve memory and fixes a bug in the abort
path where client state was being incorrectly discarded. Note that
this is an interim step, as a more comprehensive fix is planned as
part of the stateless KHO work [1].
- Moves all KHO code into a new kernel/liveupdate/ directory to
consolidate live update components.
This patch (of 9):
Currently, KHO is controlled via debugfs interface, but once LUO is
introduced, it can control KHO, and the debug interface becomes optional.
Add a separate config CONFIG_KEXEC_HANDOVER_DEBUGFS that enables the
debugfs interface, and allows to inspect the tree.
Move all debugfs related code to a new file to keep the .c files clear of
ifdefs.
Link: https://lkml.kernel.org/r/20251101142325.1326536-1-pasha.tatashin@soleen.com
Link: https://lkml.kernel.org/r/20251101142325.1326536-2-pasha.tatashin@soleen.com
Link: https://lore.kernel.org/all/20251020100306.2709352-1-jasonmiu@google.com [1]
Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Changyuan Lyu <changyuanl@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Simon Horman <horms@kernel.org>
Cc: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
1. the numa parameter was straight up ignored.
2. nothing was done to check if the to-be-cached/allocated stack matches
the local node
The id remains ignored on free in case of memoryless nodes.
Note the current caching is already bad as the cache keeps overflowing
and a different solution is needed for the long run, to be worked
out(tm).
Stats collected over a kernel build with the patch with the following
topology:
NUMA node(s): 2
NUMA node0 CPU(s): 0-11
NUMA node1 CPU(s): 12-23
caller's node vs stack backing pages on free:
matching: 50083 (70%)
mismatched: 21492 (30%)
caching efficiency:
cached: 32651 (65.2%)
dropped: 17432 (34.8%)
Link: https://lkml.kernel.org/r/20251120054015.3019419-1-mjguzik@gmail.com
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Linus Waleij <linus.walleij@linaro.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The macro for_each_console_srcu iterates over all registered consoles. It's
implied that all registered consoles have CON_ENABLED flag set, making
the check for the flag unnecessary. Call console_is_usable function to
fully verify if the given console is usable before calling the ->unblank
callback.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://patch.msgid.link/20251121-printk-cleanup-part2-v2-3-57b8b78647f4@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Make do_proc_douintvec static and export proc_douintvec_conv wrapper
function for external use. This is to keep with the design in sysctl.c.
Update fs/pipe.c to use the new public API.
Signed-off-by: Joel Granados <joel.granados@kernel.org>
Create a converter for the pipe-max-size proc_handler using the
SYSCTL_UINT_CONV_CUSTOM. Move SYSCTL_CONV_IDENTITY macro to the sysctl
header to make it available for pipe size validation. Keep returning
-EINVAL when (val == 0) by using a range checking converter and setting
the minimal valid value (extern1) to SYSCTL_ONE. Keep round_pipe_size by
passing it as the operation for SYSCTL_USER_TO_KERN_INT_CONV.
Signed-off-by: Joel Granados <joel.granados@kernel.org>
Move proc_doulongvec_ms_jiffies_minmax to kernel/time/jiffies.c. Create
a non static wrapper function proc_doulongvec_minmax_conv that
forwards the custom convmul and convdiv argument values to the internal
do_proc_doulongvec_minmax. Remove unused linux/times.h include from
kernel/sysctl.c.
Signed-off-by: Joel Granados <joel.granados@kernel.org>
Move integer jiffies converters (proc_dointvec{_,_ms_,_userhz_}jiffies
and proc_dointvec_ms_jiffies_minmax) to kernel/time/jiffies.c. Error
stubs for when CONFIG_PRCO_SYSCTL is not defined are not reproduced
because all the jiffies converters go through proc_dointvec_conv which
is already stubbed. This is part of the greater effort to move sysctl
logic out of kernel/sysctl.c thereby reducing merge conflicts in
kernel/sysctl.c.
Signed-off-by: Joel Granados <joel.granados@kernel.org>
Move SYSCTL_USER_TO_KERN_UINT_CONV and SYSCTL_UINT_CONV_CUSTOM macros to
include/linux/sysctl.h. No need to embed sysctl_kern_to_user_uint_conv
in a macro as it will not need a custom kernel pointer operation. This
is a preparation commit to enable jiffies converter creation outside
kernel/sysctl.c.
Signed-off-by: Joel Granados <joel.granados@kernel.org>
Move direction macros (SYSCTL_{USER_TO_KERN,KERN_TO_USER}) and the
integer converter macros (SYSCTL_{USER_TO_KERN,KERN_TO_USER}_INT_CONV,
SYSCTL_INT_CONV_CUSTOM) into include/linux/sysctl.h. This is a
preparation commit to enable jiffies converter creation outside
kernel/sysctl.c.
Signed-off-by: Joel Granados <joel.granados@kernel.org>
The new non-static proc_dointvec_conv forwards a custom converter
function to do_proc_dointvec from outside the sysctl scope. Rename the
do_proc_dointvec call points so any future changes to proc_dointvec_conv
are propagated in sysctl.c This is a preparation commit that allows the
integer jiffie converter functions to move out of kernel/sysctl.c.
Signed-off-by: Joel Granados <joel.granados@kernel.org>
The buffer arg in proc handler functions have been void* (no __user
qualifier) since commit 32927393dc ("sysctl: pass kernel pointers to
->proc_handler"). The __user qualifier was erroneously brought back in
commit 0df8bdd5e3 ("stackleak: move stack_erasing sysctl to
stackleak.c"). This fixes the error by removing the __user qualifier.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202510221719.3ggn070M-lkp@intel.com/
Signed-off-by: Joel Granados <joel.granados@kernel.org>
Replace sysctl_user_to_kern_uint_conv function with
SYSCTL_USER_TO_KERN_UINT_CONV macro that accepts u_ptr_op parameter for
value transformation. Replacing sysctl_kern_to_user_uint_conv is not
needed as it will only be used from within sysctl.c. This is a
preparation commit for creating a custom converter in fs/pipe.c. No
Functional changes are intended.
Signed-off-by: Joel Granados <joel.granados@kernel.org>
Add k_ptr_range_check parameter to SYSCTL_UINT_CONV_CUSTOM macro to
enable range validation using table->extra1/extra2. Replace
do_proc_douintvec_minmax_conv with do_proc_uint_conv_minmax generated
by the updated macro.
Signed-off-by: Joel Granados <joel.granados@kernel.org>
Pass sysctl_{user_to_kern,kern_to_user}_uint_conv (unsigned integer
uni-directional converters) to the new SYSCTL_UINT_CONV_CUSTOM macro
to create do_proc_douintvec_conv's replacement (do_proc_uint_conv).
This is a preparation commit to use the unsigned integer converter from
outside sysctl. No functional change is intended.
Signed-off-by: Joel Granados <joel.granados@kernel.org>
Extend the SYSCTL_INT_CONV_CUSTOM macro with a k_ptr_range_check
parameter to conditionally generate range validation code. When enabled,
validation is done against table->extra1 (min) and table->extra2 (max)
bounds before assignment. Add base minmax and ms_jiffies_minmax
converter instances that utilize the range checking functionality.
Signed-off-by: Joel Granados <joel.granados@kernel.org>
New SYSCTL_INT_CONV_CUSTOM macro creates "bi-directional" converters
from a user-to-kernel and a kernel-to-user functions. Replace integer
versions of do_proc_*_conv functions with the ones from the new macro.
Rename "_dointvec_" to just "_int_" as these converters are not applied
to vectors and the "do" is already in the name.
Move the USER_HZ validation directly into proc_dointvec_userhz_jiffies()
Signed-off-by: Joel Granados <joel.granados@kernel.org>
Eight converter functions are created using two new macros
(SYSCTL_USER_TO_KERN_INT_CONV & SYSCTL_KERN_TO_USER_INT_CONV); they are
called from four pre-existing converter functions: do_proc_dointvec_conv
and do_proc_dointvec{,_userhz,_ms}_jiffies_conv. The function names
generated by the macros are differentiated by a string suffix passed as
the first macro argument.
The SYSCTL_USER_TO_KERN_INT_CONV macro first executes the u_ptr_op
operation, then checks for overflow, assigns sign (-, +) and finally
writes to the kernel var with WRITE_ONCE; it always returns an -EINVAL
when an overflow is detected. The SYSCTL_KERN_TO_USER_INT_CONV uses
READ_ONCE, casts to unsigned long, then executes the k_ptr_op before
assigning the value to the user space buffer.
The overflow check is always done against MAX_INT after applying
{k,u}_ptr_op. This approach avoids rounding or precision errors that
might occur when using the inverse operations.
Signed-off-by: Joel Granados <joel.granados@kernel.org>