Commit Graph

51062 Commits (d78ddeb8938a366aabfabf60255c1a94de8d8ea1)

Author SHA1 Message Date
Marco Elver b7be9442a3 kcov: Use scoped init guard
Convert lock initialization to scoped guarded initialization where
lock-guarded members are initialized in the same scope.

This ensures the context analysis treats the context as active during
member initialization. This is required to avoid errors once implicit
context assertion is removed.

Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260119094029.1344361-4-elver@google.com
2026-01-28 20:45:24 +01:00
Jiri Olsa 424f6a3610 bpf,x86: Use single ftrace_ops for direct calls
Using single ftrace_ops for direct calls update instead of allocating
ftrace_ops object for each trampoline.

With single ftrace_ops object we can use update_ftrace_direct_* api
that allows multiple ip sites updates on single ftrace_ops object.

Adding HAVE_SINGLE_FTRACE_DIRECT_OPS config option to be enabled on
each arch that supports this.

At the moment we can enable this only on x86 arch, because arm relies
on ftrace_ops object representing just single trampoline image (stored
in ftrace_ops::direct_call). Archs that do not support this will continue
to use *_ftrace_direct api.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/bpf/20251230145010.103439-10-jolsa@kernel.org
2026-01-28 11:44:59 -08:00
Jiri Olsa 956747efd8 ftrace: Factor ftrace_ops ops_func interface
We are going to remove "ftrace_ops->private == bpf_trampoline" setup
in following changes.

Adding ip argument to ftrace_ops_func_t callback function, so we can
use it to look up the trampoline.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/bpf/20251230145010.103439-9-jolsa@kernel.org
2026-01-28 11:44:57 -08:00
Jiri Olsa 7d0452497c bpf: Add trampoline ip hash table
Following changes need to lookup trampoline based on its ip address,
adding hash table for that.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20251230145010.103439-8-jolsa@kernel.org
2026-01-28 11:44:57 -08:00
Jiri Olsa e93672f770 ftrace: Add update_ftrace_direct_mod function
Adding update_ftrace_direct_mod function that modifies all entries
(ip -> direct) provided in hash argument to direct ftrace ops and
updates its attachments.

The difference to current modify_ftrace_direct is:
- hash argument that allows to modify multiple ip -> direct
  entries at once

This change will allow us to have simple ftrace_ops for all bpf
direct interface users in following changes.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/bpf/20251230145010.103439-7-jolsa@kernel.org
2026-01-28 11:44:54 -08:00
Jiri Olsa 8d2c1233f3 ftrace: Add update_ftrace_direct_del function
Adding update_ftrace_direct_del function that removes all entries
(ip -> addr) provided in hash argument to direct ftrace ops and
updates its attachments.

The difference to current unregister_ftrace_direct is
 - hash argument that allows to unregister multiple ip -> direct
   entries at once
 - we can call update_ftrace_direct_del multiple times on the
   same ftrace_ops object, becase we do not need to unregister
   all entries at once, we can do it gradualy with the help of
   ftrace_update_ops function

This change will allow us to have simple ftrace_ops for all bpf
direct interface users in following changes.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/bpf/20251230145010.103439-6-jolsa@kernel.org
2026-01-28 11:44:51 -08:00
Jiri Olsa 05dc5e9c1f ftrace: Add update_ftrace_direct_add function
Adding update_ftrace_direct_add function that adds all entries
(ip -> addr) provided in hash argument to direct ftrace ops
and updates its attachments.

The difference to current register_ftrace_direct is
 - hash argument that allows to register multiple ip -> direct
   entries at once
 - we can call update_ftrace_direct_add multiple times on the
   same ftrace_ops object, becase after first registration with
   register_ftrace_function_nolock, it uses ftrace_update_ops to
   update the ftrace_ops object

This change will allow us to have simple ftrace_ops for all bpf
direct interface users in following changes.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/bpf/20251230145010.103439-5-jolsa@kernel.org
2026-01-28 11:44:48 -08:00
Jiri Olsa 0e860d07c2 ftrace: Export some of hash related functions
We are going to use these functions in following changes.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/bpf/20251230145010.103439-4-jolsa@kernel.org
2026-01-28 11:44:45 -08:00
Jiri Olsa 676bfeae7b ftrace: Make alloc_and_copy_ftrace_hash direct friendly
Make alloc_and_copy_ftrace_hash to copy also direct address
for each hash entry.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/bpf/20251230145010.103439-3-jolsa@kernel.org
2026-01-28 11:44:43 -08:00
Jiri Olsa 4be42c9222 ftrace,bpf: Remove FTRACE_OPS_FL_JMP ftrace_ops flag
At the moment the we allow the jmp attach only for ftrace_ops that
has FTRACE_OPS_FL_JMP set. This conflicts with following changes
where we use single ftrace_ops object for all direct call sites,
so all could be be attached via just call or jmp.

We already limit the jmp attach support with config option and bit
(LSB) set on the trampoline address. It turns out that's actually
enough to limit the jmp attach for architecture and only for chosen
addresses (with LSB bit set).

Each user of register_ftrace_direct or modify_ftrace_direct can set
the trampoline bit (LSB) to indicate it has to be attached by jmp.

The bpf trampoline generation code uses trampoline flags to generate
jmp-attach specific code and ftrace inner code uses the trampoline
bit (LSB) to handle return from jmp attachment, so there's no harm
to remove the FTRACE_OPS_FL_JMP bit.

The fexit/fmodret performance stays the same (did not drop),
current code:

  fentry         :   77.904 ± 0.546M/s
  fexit          :   62.430 ± 0.554M/s
  fmodret        :   66.503 ± 0.902M/s

with this change:

  fentry         :   80.472 ± 0.061M/s
  fexit          :   63.995 ± 0.127M/s
  fmodret        :   67.362 ± 0.175M/s

Fixes: 25e4e3565d ("ftrace: Introduce FTRACE_OPS_FL_JMP")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/bpf/20251230145010.103439-2-jolsa@kernel.org
2026-01-28 11:44:35 -08:00
Guillaume Gonnet ae23bc81dd bpf: Fix tcx/netkit detach permissions when prog fd isn't given
This commit fixes a security issue where BPF_PROG_DETACH on tcx or
netkit devices could be executed by any user when no program fd was
provided, bypassing permission checks. The fix adds a capability
check for CAP_NET_ADMIN or CAP_SYS_ADMIN in this case.

Fixes: e420bed025 ("bpf: Add fd-based tcx multi-prog infra with link support")
Signed-off-by: Guillaume Gonnet <ggonnet.linux@gmail.com>
Link: https://lore.kernel.org/r/20260127160200.10395-1-ggonnet.linux@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-27 18:39:58 -08:00
Ilpo Järvinen 4326ab1806 resource: Increase MAX_IORES_LEVEL to 8
While debugging a PCI resource allocation issue, the resources for many
nested bridges and endpoints got flattened in /proc/iomem by
MAX_IORES_LEVEL that is set to 5. This made the iomem output hard to
read as the visual hierarchy cues were lost.

Increase MAX_IORES_LEVEL to 8 to avoid flattening PCI topologies with
nested bridges so aggressively (the case in the Link has the deepest
resource at level 7 so 8 looks a reasonable limit).

Link: https://bugzilla.kernel.org/show_bug.cgi?id=220775
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Link: https://patch.msgid.link/20251219174036.16738-5-ilpo.jarvinen@linux.intel.com
2026-01-27 16:36:51 -06:00
Matt Bobrowski 752b807028 bpf: add new BPF_CGROUP_ITER_CHILDREN control option
Currently, the BPF cgroup iterator supports walking descendants in
either pre-order (BPF_CGROUP_ITER_DESCENDANTS_PRE) or post-order
(BPF_CGROUP_ITER_DESCENDANTS_POST). These modes perform an exhaustive
depth-first search (DFS) of the hierarchy. In scenarios where a BPF
program may need to inspect only the direct children of a given parent
cgroup, a full DFS is unnecessarily expensive.

This patch introduces a new BPF cgroup iterator control option,
BPF_CGROUP_ITER_CHILDREN. This control option restricts the traversal
to the immediate children of a specified parent cgroup, allowing for
more targeted and efficient iteration, particularly when exhaustive
depth-first search (DFS) traversal is not required.

Signed-off-by: Matt Bobrowski <mattbobrowski@google.com>
Link: https://lore.kernel.org/r/20260127085112.3608687-1-mattbobrowski@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-27 09:05:54 -08:00
Tim Bird c86d39d680 kernel: debug: Add SPDX license ids to kdb files
Add GPL-2.0 license id to some files related to kdb and kgdb,
replacing references to GPL or COPYING.

These files were introduced into the kernel in 2008 and 2010.

Signed-off-by: Tim Bird <tim.bird@sony.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-27 15:57:20 +01:00
Lorenzo Pieralisi 0323897a88 irqdomain: Add parent field to struct irqchip_fwid
The GICv5 driver IRQ domain hierarchy requires adding a parent field to
struct irqchip_fwid so that core code can reference a fwnode_handle parent
for a given fwnode.

Add a parent field to struct irqchip_fwid and update the related kernel API
functions to initialize and handle it.

Signed-off-by: Lorenzo Pieralisi <lpieralisi@kernel.org>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Acked-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260115-gicv5-host-acpi-v3-1-c13a9a150388@kernel.org
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2026-01-27 15:31:41 +01:00
Yury Norov 291487b753 cgroup: use nodes_and() output where appropriate
Now that nodes_and() returns true if the result nodemask is not empty,
drop useless nodes_intersects() in guarantee_online_mems() and
nodes_empty() in update_nodemasks_hier(), which both are O(N).

Link: https://lkml.kernel.org/r/20260114172217.861204-4-ynorov@nvidia.com
Signed-off-by: Yury Norov <ynorov@nvidia.com>
Reviewed-by: Gregory Price <gourry@gourry.net>
Reviewed-by: Joshua Hahn <joshua.hahnjy@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Byungchul Park <byungchul@sk.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: "Huang, Ying" <ying.huang@linux.alibaba.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mathew Brost <matthew.brost@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: Yury Norov (NVIDIA) <yury.norov@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-26 20:02:37 -08:00
Pratyush Yadav (Google) 6ca9de3600 kho: print which scratch buffer failed to be reserved
When scratch area fails to reserve, KHO prints a message indicating that. 
But it doesn't say which scratch failed to allocate.  This can be useful
information for debugging.  Even more so when the failure is hard to
reproduce.

Along with the current message, also print which exact scratch area failed
to be reserved.

Link: https://lkml.kernel.org/r/20260116165416.1262531-1-pratyush@kernel.org
Signed-off-by: Pratyush Yadav (Google) <pratyush@kernel.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: David Matlack <dmatlack@google.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Pratyush Yadav <pratyush@kernel.org>
Cc: Samiullah Khawaja <skhawaja@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-26 19:07:15 -08:00
Finn Thain 3bb83c9109 bpf: explicitly align bpf_res_spin_lock
Patch series "Align atomic storage", v7.

This series adds the __aligned attribute to atomic_t and atomic64_t
definitions in include/linux and include/asm-generic (respectively) to get
natural alignment of both types on csky, m68k, microblaze, nios2, openrisc
and sh.

This series also adds Kconfig options to enable a new run-time warning to
help reveal misaligned atomic accesses on platforms which don't trap that.

The performance impact is expected to vary across platforms and workloads.
The measurements I made on m68k show that some workloads run faster and
others slower.


This patch (of 4):

Align bpf_res_spin_lock to avoid a BUILD_BUG_ON() when the alignment
changes, as it will do on m68k when, in a subsequent patch, the minimum
alignment of the atomic_t member of struct rqspinlock gets increased from
2 to 4.  Drop the BUILD_BUG_ON() as it becomes redundant.

Link: https://lkml.kernel.org/r/cover.1768281748.git.fthain@linux-m68k.org
Link: https://lkml.kernel.org/r/8a83876b07d1feacc024521e44059ae89abbb1ea.1768281748.git.fthain@linux-m68k.org
Signed-off-by: Finn Thain <fthain@linux-m68k.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Borislav Petkov (AMD)" <bp@alien8.de>
Cc: Daniel Borkman <daniel@iogearbox.net>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Eduard Zingerman <eddyz87@gmail.com>
Cc: Gary Guo <gary@garyguo.net>
Cc: Guo Ren <guoren@kernel.org>
Cc: Hao Luo <haoluo@google.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Sasha Levin (Microsoft) <sashal@kernel.org>
Cc: Song Liu <song@kernel.org>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Stanislav Fomichev <sdf@fomichev.me>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yonghong Song <yonghong.song@linux.dev>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-26 19:07:14 -08:00
Mathieu Desnoyers 5e65b5ca7d tsacct: skip all kernel threads
This patch is a preparation step for HPCC, for the OOM killer
improvements.  I suspect that this patch is useful on its own, because it
really makes no sense to sum up accounting statistics of use_mm within
kernel threads which are only temporarily using those mm.

When we hit acct_account_cputime within a irq handler over a kthread that
happens to use a userspace mm, we end up summing up the mm's RSS into the
tsk acct_rss_mem1, which eventually decays.

I don't see a good rationale behind tracking the mm's rss in that way when
a kthread use a userspace mm temporarily through use_mm.

It causes issues with init_mm and efi_mm which only partially initialize
their mm_struct when introducing the new hierarchical percpu counters to
replace RSS counters, which requires a pointer dereference when reading
the approximate counter sum.  The current percpu counters simply load a
zeroed atomic counter, which happen to work.

Skip all kernel threads in acct_account_cputime(), not just those that
happen to have a NULL mm.

This is a preparation step before introducing the hierarchical percpu
counters.

Link: https://lkml.kernel.org/r/20251224173810.648699-2-mathieu.desnoyers@efficios.com
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Aboorva Devarajan <aboorvad@linux.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Christan König <christian.koenig@amd.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Liam R . Howlett" <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Martin Liu <liumartin@google.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: SeongJae Park <sj@kernel.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-26 19:07:13 -08:00
Long Wei 25929dae28 kho: remove duplicate header file references
kexec_handover_internal.h is included twice in kexec_handover.c.  Remove
the redundant first inclusion to eliminate the duplication.

Link: https://lkml.kernel.org/r/20251216114400.2677311-1-longwei27@huawei.com
Signed-off-by: Long Wei <longwei27@huawei.com>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Alexander Graf <graf@amazon.com>
Cc: hewenliang <hewenliang4@huawei.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Pratyush Yadav <pratyush@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-26 19:07:13 -08:00
mingzhu.wang(王明珠) 2bbd9e1d14 kernel/fork: update obsolete use_mm references to kthread_use_mm
The comment for get_task_mm() in kernel/fork.c incorrectly references the
deprecated function `use_mm()`, which has been renamed to
`kthread_use_mm()` in kernel/kthread.c.

This patch updates the documentation to reflect the current function
names, ensuring accuracy when developers refer to the kernel thread memory
context API.

No functional changes were introduced.

Link: https://lkml.kernel.org/r/KUZPR04MB8965F954108B4DD7E8FFDB2B8F84A@KUZPR04MB8965.apcprd04.prod.outlook.com
Signed-off-by: mingzhu.wang <mingzhu.wang@transsion.com>
Cc: Ben Segall <bsegall@google.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiazi Li <jqqlijiazi@gmail.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Kees Cook <kees@kernel.org>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-26 19:07:12 -08:00
Jason Miu ac2d8102c4 kho: relocate vmalloc preservation structure to KHO ABI header
The `struct kho_vmalloc` defines the in-memory layout for preserving
vmalloc regions across kexec.  This layout is a contract between kernels
and part of the KHO ABI.

To reflect this relationship, the related structs and helper macros are
relocated to the ABI header, `include/linux/kho/abi/kexec_handover.h`. 
This move places the structure's definition under the protection of the
KHO_FDT_COMPATIBLE version string.

The structure and its components are now also documented within the ABI
header to describe the contract and prevent ABI breaks.

[rppt@kernel.org: update comment, per Pratyush]
  Link: https://lkml.kernel.org/r/aW_Mqp6HcqLwQImS@kernel.org
Link: https://lkml.kernel.org/r/20260105165839.285270-6-rppt@kernel.org
Signed-off-by: Jason Miu <jasonmiu@google.com>
Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Pratyush Yadav <pratyush@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-26 19:07:12 -08:00
Jason Miu 5e1ea1e27b kho: introduce KHO FDT ABI header
Introduce the `include/linux/kho/abi/kexec_handover.h` header file, which
defines the stable ABI for the KHO mechanism.  This header specifies how
preserved data is passed between kernels using an FDT.

The ABI contract includes the FDT structure, node properties, and the
"kho-v1" compatible string.  By centralizing these definitions, this
header serves as the foundational agreement for inter-kernel communication
of preserved states, ensuring forward compatibility and preventing
misinterpretation of data across kexec transitions.

Since the ABI definitions are now centralized in the header files, the
YAML files that previously described the FDT interfaces are redundant. 
These redundant files have therefore been removed.

Link: https://lkml.kernel.org/r/20260105165839.285270-5-rppt@kernel.org
Signed-off-by: Jason Miu <jasonmiu@google.com>
Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-26 19:07:12 -08:00
Mike Rapoport (Microsoft) a6f4e56828 kho: docs: combine concepts and FDT documentation
Currently index.rst in KHO documentation looks empty and sad as it only
contains links to "Kexec Handover Concepts" and "KHO FDT" chapters.

Inline contents of these chapters into index.rst to provide a single
coherent chapter describing KHO.

While on it, drop parts of the KHO FDT description that will be superseded
by addition of KHO ABI documentation.

[rppt@kernel.org: fix Documentation/core-api/kho/index.rst]
  Link: https://lkml.kernel.org/r/aV4bnHlBXGpT_FMc@kernel.org
Link: https://lkml.kernel.org/r/20260105165839.285270-4-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Jason Miu <jasonmiu@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Pratyush Yadav <pratyush@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-26 19:07:11 -08:00
Pasha Tatashin 998be0a4db liveupdate: separate memfd support into LIVEUPDATE_MEMFD
Decouple memfd preservation support from the core Live Update Orchestrator
configuration.

Previously, enabling CONFIG_LIVEUPDATE forced a dependency on CONFIG_SHMEM
and unconditionally compiled memfd_luo.o.  However, Live Update may be
used for purposes that do not require memfd-backed memory preservation.

Introduce CONFIG_LIVEUPDATE_MEMFD to gate memfd_luo.o.  This moves the
SHMEM and MEMFD_CREATE dependencies to the specific feature that needs
them, allowing the base LIVEUPDATE option to be selected independently of
shared memory support.

Link: https://lkml.kernel.org/r/20251230161402.1542099-1-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-26 19:07:10 -08:00
Breno Leitao bd58782995 vmcoreinfo: make hwerr_data visible for debugging
If the kernel is compiled with LTO, hwerr_data symbol might be lost, and
vmcoreinfo doesn't have it dumped.  This is currently seen in some
production kernels with LTO enabled.

Remove the static qualifier from hwerr_data so that the information is
still preserved when the kernel is built with LTO.  Making hwerr_data a
global symbol ensures its debug info survives the LTO link process and
appears in kallsyms.  Also document it, so it doesn't get removed in
the future as suggested by akpm.

Link: https://lkml.kernel.org/r/20260122-fix_vmcoreinfo-v2-1-2d6311f9e36c@debian.org
Fixes: 3fa805c37d ("vmcoreinfo: track and log recoverable hardware errors")
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Omar Sandoval <osandov@osandov.com>
Cc: Shuai Xue <xueshuai@linux.alibaba.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Zhiquan Li <zhiquan1.li@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-26 19:03:49 -08:00
Andrew Morton 412a32f0e5 kho: kho_preserve_vmalloc(): don't return 0 when ENOMEM
kho_preserve_vmalloc() should return -ENOMEM when new_vmalloc_chunk()
fails.

Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/202601211636.IRaejjdw-lkp@intel.com/
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-26 19:03:48 -08:00
Ran Xiaokai e86436ad0a kho: init alloc tags when restoring pages from reserved memory
Memblock pages (including reserved memory) should have their allocation
tags initialized to CODETAG_EMPTY via clear_page_tag_ref() before being
released to the page allocator.  When kho restores pages through
kho_restore_page(), missing this call causes mismatched
allocation/deallocation tracking and below warning message:

alloc_tag was not set
WARNING: include/linux/alloc_tag.h:164 at ___free_pages+0xb8/0x260, CPU#1: swapper/0/1
RIP: 0010:___free_pages+0xb8/0x260
 kho_restore_vmalloc+0x187/0x2e0
 kho_test_init+0x3c4/0xa30
 do_one_initcall+0x62/0x2b0
 kernel_init_freeable+0x25b/0x480
 kernel_init+0x1a/0x1c0
 ret_from_fork+0x2d1/0x360

Add missing clear_page_tag_ref() annotation in kho_restore_page() to
fix this.

Link: https://lkml.kernel.org/r/20260122132740.176468-1-ranxiaokai627@163.com
Fixes: fc33e4b44b ("kexec: enable KHO support for memory preservation")
Signed-off-by: Ran Xiaokai <ran.xiaokai@zte.com.cn>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-26 19:03:47 -08:00
Steven Rostedt 6bdf07302f tracing: Disable trace_printk buffer on warning too
When /proc/sys/kernel/traceoff_on_warning is set to 1, the top level
tracing buffer is disabled when a warning happens. This is very useful
when debugging and want the tracing buffer to stop taking new data when a
warning triggers keeping the events that lead up to the warning from being
overwritten.

Now that there is also a persistent ring buffer and an option to have
trace_printk go to that buffer, the same holds true for that buffer. A
warning could happen just before a crash but still write enough events to
lose the events that lead up to the first warning that was the reason for
the crash.

When /proc/sys/kernel/traceoff_on_warning is set to 1 and a warning is
triggered, not only disable the top level tracing buffer, but also disable
the buffer that trace_printk()s are written to.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Link: https://patch.msgid.link/20260121093858.5c5d7e7b@gandalf.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-01-26 17:45:17 -05:00
Guenter Roeck a9e0c5897a ftrace: Introduce and use ENTRIES_PER_PAGE_GROUP macro
ENTRIES_PER_PAGE_GROUP() returns the number of dyn_ftrace entries in a page
group, identified by its order.

No functional change.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260113152243.3557219-2-linux@roeck-us.net
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-01-26 17:45:12 -05:00
Steven Rostedt 2d8b7f9bf8 tracing: Have show_event_trigger/filter format a bit more in columns
By doing:

 # trace-cmd sqlhist -e -n futex_wait select TIMESTAMP_DELTA_USECS as lat from sys_enter_futex as start join sys_exit_futex as end on start.common_pid = end.common_pid

and

 # trace-cmd start -e futex_wait -f 'lat > 100' -e page_pool_state_release -f 'pfn == 1'

The output of the show_event_trigger and show_event_filter files are well
aligned because of the inconsistent 'tab' spacing:

 ~# cat /sys/kernel/tracing/show_event_triggers
syscalls:sys_exit_futex	hist:keys=common_pid:vals=hitcount:__lat_12046_2=common_timestamp.usecs-$__arg_12046_1:sort=hitcount:size=2048:clock=global:onmatch(syscalls.sys_enter_futex).trace(futex_wait,$__lat_12046_2) [active]
syscalls:sys_enter_futex	hist:keys=common_pid:vals=hitcount:__arg_12046_1=common_timestamp.usecs:sort=hitcount:size=2048:clock=global [active]

 ~# cat /sys/kernel/tracing/show_event_filters
synthetic:futex_wait	(lat > 100)
page_pool:page_pool_state_release	(pfn == 1)

This makes it not so easy to read. Instead, force the spacing to be at
least 32 bytes from the beginning (one space if the system:event is longer
than 30 bytes):

 ~# cat /sys/kernel/tracing/show_event_triggers
syscalls:sys_exit_futex          hist:keys=common_pid:vals=hitcount:__lat_8125_2=common_timestamp.usecs-$__arg_8125_1:sort=hitcount:size=2048:clock=global:onmatch(syscalls.sys_enter_futex).trace(futex_wait,$__lat_8125_2) [active]
syscalls:sys_enter_futex         hist:keys=common_pid:vals=hitcount:__arg_8125_1=common_timestamp.usecs:sort=hitcount:size=2048:clock=global [active]

 ~# cat /sys/kernel/tracing/show_event_filters
synthetic:futex_wait             (lat > 100)
page_pool:page_pool_state_release (pfn == 1)

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260112153408.18373e73@gandalf.local.home
Reviewed-by: Aaron Tomlin <atomlin@atomlin.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-01-26 17:45:06 -05:00
Petr Tesarik 8aa76aa415 ring-buffer: Use a housekeeping CPU to wake up waiters
Avoid running the wakeup irq_work on an isolated CPU. Since the wakeup can
run on any CPU, let's pick a housekeeping CPU to do the job.

This change reduces additional noise when tracing isolated CPUs. For
example, the following ipi_send_cpu stack trace was captured with
nohz_full=2 on the isolated CPU:

          <idle>-0       [002] d.h4.  1255.379293: ipi_send_cpu: cpu=2 callsite=irq_work_queue+0x2d/0x50 callback=rb_wake_up_waiters+0x0/0x80
          <idle>-0       [002] d.h4.  1255.379329: <stack trace>
 => trace_event_raw_event_ipi_send_cpu
 => __irq_work_queue_local
 => irq_work_queue
 => ring_buffer_unlock_commit
 => trace_buffer_unlock_commit_regs
 => trace_event_buffer_commit
 => trace_event_raw_event_x86_irq_vector
 => __sysvec_apic_timer_interrupt
 => sysvec_apic_timer_interrupt
 => asm_sysvec_apic_timer_interrupt
 => pv_native_safe_halt
 => default_idle
 => default_idle_call
 => do_idle
 => cpu_startup_entry
 => start_secondary
 => common_startup_64

The IRQ work interrupt alone adds considerable noise, but the impact can
get even worse with PREEMPT_RT, because the IRQ work interrupt is then
handled by a separate kernel thread. This requires a task switch and makes
tracing useless for analyzing latency on an isolated CPU.

After applying the patch, the trace is similar, but ipi_send_cpu always
targets a non-isolated CPU.

Unfortunately, irq_work_queue_on() is not NMI-safe. When running in NMI
context, fall back to queuing the irq work on the local CPU.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Clark Williams <clrkwllms@kernel.org>
Cc: Frederic Weisbecker <frederic@kernel.org>
Link: https://patch.msgid.link/20260108132132.2473515-1-ptesarik@suse.com
Signed-off-by: Petr Tesarik <ptesarik@suse.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-01-26 17:44:53 -05:00
Steven Rostedt e4ef389e76 tracing: Check the return value of tracing_update_buffers()
In the very unlikely event that tracing_update_buffers() fails in
trace_printk_init_buffers(), report the failure so that it is known.

Link: https://lore.kernel.org/all/20220917020353.3836285-1-floridsleeves@gmail.com/

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260107161510.4dc98b15@gandalf.local.home
Suggested-by: Li Zhong <floridsleeves@gmail.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-01-26 17:44:40 -05:00
Aaron Tomlin 6a80838814 tracing: Add show_event_triggers to expose active event triggers
To audit active event triggers, userspace currently must traverse the
events/ directory and read each individual trigger file. This is
cumbersome for system-wide auditing or debugging.

Introduce "show_event_triggers" at the trace root directory. This file
displays all events that currently have one or more triggers applied,
alongside the trigger configuration, in a consolidated
system:event [tab] trigger format.

The implementation leverages the existing trace_event_file iterators
and uses the trigger's own print() operation to ensure output
consistency with the per-event trigger files.

Link: https://patch.msgid.link/20260105142939.2655342-3-atomlin@atomlin.com
Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-01-26 17:44:24 -05:00
Aaron Tomlin 729757b96a tracing: Add show_event_filters to expose active event filters
Currently, to audit active Ftrace event filters, userspace must
recursively traverse the events/ directory and read each individual
filter file. This is inefficient for monitoring tools and debugging.

Introduce "show_event_filters" at the trace root directory. This file
displays all events that currently have a filter applied, alongside the
actual filter string, in a consolidated system:event [tab] filter
format.

The implementation reuses the existing trace_event_file iterators to
ensure atomic traversal of the event list and utilises guard(rcu)() for
automatic, scope-based protection when accessing volatile filter
strings.

Link: https://patch.msgid.link/20260105142939.2655342-2-atomlin@atomlin.com
Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-01-26 17:44:15 -05:00
Marco Crivellari e5136678b1 tracing: Replace use of system_wq with system_dfl_wq
This patch continues the effort to refactor workqueue APIs, which has begun
with the changes introducing new workqueues and a new alloc_workqueue flag:

   commit 128ea9f6cc ("workqueue: Add system_percpu_wq and system_dfl_wq")
   commit 930c2ea566 ("workqueue: Add new WQ_PERCPU flag")

The point of the refactoring is to eventually alter the default behavior of
workqueues to become unbound by default so that their workload placement is
optimized by the scheduler.

Before that to happen after a careful review and conversion of each individual
case, workqueue users must be converted to the better named new workqueues with
no intended behaviour changes:

   system_wq -> system_percpu_wq
   system_unbound_wq -> system_dfl_wq

This specific workflow has no benefits being per-cpu, so instead of
system_percpu_wq the new unbound workqueue has been used (system_dfl_wq).

This way the old obsolete workqueues (system_wq, system_unbound_wq) can be
removed in the future.

Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251230142820.173712-1-marco.crivellari@suse.com
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-01-26 17:44:05 -05:00
Aaron Tomlin 2cddfc2e8f tracing: Add bitmask-list option for human-readable bitmask display
Add support for displaying bitmasks in human-readable list format (e.g.,
0,2-5,7) in addition to the default hexadecimal bitmap representation.
This is particularly useful when tracing CPU masks and other large
bitmasks where individual bit positions are more meaningful than their
hexadecimal encoding.

When the "bitmask-list" option is enabled, the printk "%*pbl" format
specifier is used to render bitmasks as comma-separated ranges, making
trace output easier to interpret for complex CPU configurations and
large bitmask values.

Link: https://patch.msgid.link/20251226160724.2246493-2-atomlin@atomlin.com
Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-01-26 17:00:50 -05:00
Steven Rostedt a4e0ea0e10 tracing: Remove redundant call to event_trigger_reset_filter() in event_hist_trigger_parse()
With the change to replace kfree() with trigger_data_free(), which starts
out doing the exact same thing as event_trigger_reset_filter(), there's no
reason to call event_trigger_reset_filter() before calling
trigger_data_free(). Remove the call to it.

Link: https://lore.kernel.org/linux-trace-kernel/20251211204520.0f3ba6d1@fedora/

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Miaoqian Lin <linmq006@gmail.com>
Link: https://patch.msgid.link/20260108174429.2d9ca51f@gandalf.local.home
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-01-26 17:00:50 -05:00
Miaoqian Lin 0550069cc2 tracing: Properly process error handling in event_hist_trigger_parse()
Memory allocated with trigger_data_alloc() requires trigger_data_free()
for proper cleanup.

Replace kfree() with trigger_data_free() to fix this.

Found via static analysis and code review.

This isn't a real bug due to the current code basically being an open
coded version of trigger_data_free() without the synchronization. The
synchronization isn't needed as this is the error path of creation and
there's nothing to synchronize against yet. Replace the kfree() to be
consistent with the allocation.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Link: https://patch.msgid.link/20251211100058.2381268-1-linmq006@gmail.com
Fixes: e1f187d09e ("tracing: Have existing event_command.parse() implementations use helpers")
Signed-off-by: Miaoqian Lin <linmq006@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-01-26 17:00:50 -05:00
Menglong Dong eeee4239db bpf: support fsession for bpf_session_cookie
Implement session cookie for fsession. The session cookies will be stored
in the stack, and the layout of the stack will look like this:
  return value	-> 8 bytes
  argN		-> 8 bytes
  ...
  arg1		-> 8 bytes
  nr_args	-> 8 bytes
  ip (optional)	-> 8 bytes
  cookie2	-> 8 bytes
  cookie1	-> 8 bytes

The offset of the cookie for the current bpf program, which is in 8-byte
units, is stored in the
"(((u64 *)ctx)[-1] >> BPF_TRAMP_COOKIE_INDEX_SHIFT) & 0xFF". Therefore, we
can get the session cookie with ((u64 *)ctx)[-offset].

Implement and inline the bpf_session_cookie() for the fsession in the
verifier.

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Link: https://lore.kernel.org/r/20260124062008.8657-6-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-24 18:49:36 -08:00
Menglong Dong 27d89baa6d bpf: support fsession for bpf_session_is_return
If fsession exists, we will use the bit (1 << BPF_TRAMP_IS_RETURN_SHIFT)
in ((u64 *)ctx)[-1] to store the "is_return" flag.

The logic of bpf_session_is_return() for fsession is implemented in the
verifier by inline following code:

  bool bpf_session_is_return(void *ctx)
  {
      return (((u64 *)ctx)[-1] >> BPF_TRAMP_IS_RETURN_SHIFT) & 1;
  }

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Co-developed-by: Leon Hwang <leon.hwang@linux.dev>
Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/r/20260124062008.8657-5-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-24 18:49:36 -08:00
Menglong Dong 8fe4dc4f64 bpf: change prototype of bpf_session_{cookie,is_return}
Add the function argument of "void *ctx" to bpf_session_cookie() and
bpf_session_is_return(), which is a preparation of the next patch.

The two kfunc is seldom used now, so it will not introduce much effect
to change their function prototype.

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20260124062008.8657-4-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-24 18:49:35 -08:00
Menglong Dong f1b56b3cbd bpf: use the least significant byte for the nr_args in trampoline
For now, ((u64 *)ctx)[-1] is used to store the nr_args in the trampoline.
However, 1 byte is enough to store such information. Therefore, we use
only the least significant byte of ((u64 *)ctx)[-1] to store the nr_args,
and reserve the rest for other usages.

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Link: https://lore.kernel.org/r/20260124062008.8657-3-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-24 18:49:35 -08:00
Menglong Dong 2d419c4465 bpf: add fsession support
The fsession is something that similar to kprobe session. It allow to
attach a single BPF program to both the entry and the exit of the target
functions.

Introduce the struct bpf_fsession_link, which allows to add the link to
both the fentry and fexit progs_hlist of the trampoline.

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Co-developed-by: Leon Hwang <leon.hwang@linux.dev>
Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/r/20260124062008.8657-2-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-24 18:49:35 -08:00
Linus Torvalds b83a8ff87a tracing fixes for v6.19:
- Fix a crash with passing a stacktrace between synthetic events
 
   A synthetic event is an event that combines two events into a single event
   that can display fields from both events as well as the time delta that
   took place between the events. It can also pass a stacktrace from the
   first event so that it can be displayed by the synthetic event (this is
   useful to get a stacktrace of a task scheduling out when blocked and
   recording the time it was blocked for).
 
   A synthetic event can also connect an existing synthetic event to another
   event. An issue was found that if the first synthetic event had a stacktrace
   as one of its fields, and that stacktrace field was passed to the new
   synthetic event to be displayed, it would crash the kernel. This was due to
   the stacktrace not being saved as a stacktrace but was still marked as one.
   When the stacktrace was read, it would try to read an array but instead read
   the integer metadata of the stacktrace and dereferenced a bad value.
 
   Fix this by saving the stacktrace field as a stracktrace.
 
 - Fix possible overflow in cmp_mod_entry() compare function
 
   A binary search is used to find a module address and if the addresses are
   greater than 2GB apart it could lead to truncation and cause a bad search
   result. Use normal compares instead of a subtraction between addresses to
   calculate the compare value.
 
 - Fix output of entry arguments in function graph tracer
 
   Depending on the configurations enabled, the entry can be two different
   types that hold the argument array. The macro FGRAPH_ENTRY_ARGS() is used
   to find the correct arguments from the given type. One location was missed
   and still referenced the arguments directly via entry->args and could
   produce the wrong value depending on how the kernel was configured.
 
 - Fix memory leak in scripts/tracepoint-update build tool
 
   If the array fails to allocate, the memory for the values needs to be
   freed and was not. Free the allocated values if the array failed to
   allocate.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaXUQLxQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qgsJAQDgtWH9DWUkJKgzXTkiOA0l8JArPOVf
 tCSMla2wWJA70QD/as2ptacYAFU9v1oxO5YIgsKOLFBF68ZUIhJtvXpqtAE=
 =JeC6
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Fix a crash with passing a stacktrace between synthetic events

   A synthetic event is an event that combines two events into a single
   event that can display fields from both events as well as the time
   delta that took place between the events. It can also pass a
   stacktrace from the first event so that it can be displayed by the
   synthetic event (this is useful to get a stacktrace of a task
   scheduling out when blocked and recording the time it was blocked
   for).

   A synthetic event can also connect an existing synthetic event to
   another event. An issue was found that if the first synthetic event
   had a stacktrace as one of its fields, and that stacktrace field was
   passed to the new synthetic event to be displayed, it would crash the
   kernel. This was due to the stacktrace not being saved as a
   stacktrace but was still marked as one. When the stacktrace was read,
   it would try to read an array but instead read the integer metadata
   of the stacktrace and dereferenced a bad value.

   Fix this by saving the stacktrace field as a stacktrace.

 - Fix possible overflow in cmp_mod_entry() compare function

   A binary search is used to find a module address and if the addresses
   are greater than 2GB apart it could lead to truncation and cause a
   bad search result. Use normal compares instead of a subtraction
   between addresses to calculate the compare value.

 - Fix output of entry arguments in function graph tracer

   Depending on the configurations enabled, the entry can be two
   different types that hold the argument array. The macro
   FGRAPH_ENTRY_ARGS() is used to find the correct arguments from the
   given type. One location was missed and still referenced the
   arguments directly via entry->args and could produce the wrong value
   depending on how the kernel was configured.

 - Fix memory leak in scripts/tracepoint-update build tool

   If the array fails to allocate, the memory for the values needs to be
   freed and was not. Free the allocated values if the array failed to
   allocate.

* tag 'trace-v6.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  scripts/tracepoint-update: Fix memory leak in add_string() on failure
  function_graph: Fix args pointer mismatch in print_graph_retval()
  tracing: Avoid possible signed 64-bit truncation
  tracing: Fix crash on synthetic stacktrace field usage
2026-01-24 17:18:57 -08:00
Linus Torvalds 12a0094839 Misc fixes:
- Fix auxiliary timekeeper update & locking bug
 
  - Reduce the sensitivity of the clocksource watchdog, to
    fix false positive measurements that marked the
    TSC clocksource unstable.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAml0ksERHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1gvtRAAvLWNMb9YPW/65Gn8dkyMQUPzXBjEaq8A
 yP4L1b8EujoM6fSeQC0Y367hpn1GKHhEGZyj9ksRcU4dsU5XWlzPZr9QXCETmMuh
 ffTCvrUGI6d95685S+R1VplmhoCQAkerQAFcPQDAQd0QgfoEJO+hf2AWHrilnicu
 gCcZGDE/+gLAPYjR7LaRu7vb0W6VtwqhXvz8xTCGALmMlU84BDT3deLzCmujxtbF
 PvNaShBAtppm468Ln6HY2mk4mN5kWthPnonNF4n0zVYy8uAHLEEUERr/LndZ60Ua
 KlFgKukfoPXyJoU0M0umNcX6oXaRw7DeyNcPtJovZUwtfXyjkTPWrcfZ4sD3r37K
 QWjFqmbTCtj70vlUFP2RiHusOmNkuzcWKww5KdpA+HoeXEI4zcjhZq7zObyjDPIZ
 t0Cs5sZoWWpL7o53ikMjsO2Fe/zSDRaocYyImCWh2U+DdBn3/fh8a0pboXQakujx
 kjmuDrHaLXFNMI9h7NvlP143IW8g7AHUpu0piDGLVFFkZoNcII/8g7qawemQw8T9
 ZCUmL3oq1Zu0z3aGq9GRFz31ysVLXwDZdtY8CCuHxgVTuZQQnRNrLiNiTjZn75E/
 PY63jtSgKNJsAOTHJZ5hnyvcGb8w05anU0T7M38kTJFtiX4R6JaaDJVmj3eFG3g8
 es9cQ4gJGmo=
 =O1ly
 -----END PGP SIGNATURE-----

Merge tag 'timers-urgent-2026-01-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer fixes from Ingo Molnar:

 - Fix auxiliary timekeeper update & locking bug

 - Reduce the sensitivity of the clocksource watchdog,
   to fix false positive measurements that marked the
   TSC clocksource unstable

* tag 'timers-urgent-2026-01-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  clocksource: Reduce watchdog readout delay limit to prevent false positives
  timekeeping: Adjust the leap state for the correct auxiliary timekeeper
2026-01-24 09:36:03 -08:00
Linus Torvalds af5a3fae86 Miscellaneous scheduler fixes:
- Fix PELT clock synchronization bug when entering idle
 
  - Disable the NEXT_BUDDY feature, as during extensive testing
    Mel found that the negatives outweigh the positives.
 
  - Make wakeup preemption less aggressive, which resulted in
    an unreasonable increase in preemption frequency.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAml0kYYRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1j/Bw/+M5UK6EOPzLTs0Tj87X/kI7vXLN/uf9+K
 FpDtFNmoJnJKrQxPFfa8aT8OAz7zLfnjrGmRgxOG1fFkACMuB8s+TO4+i9lhzAsF
 gHOg6lDczi4K14mkTyiP/Bdaf3lZThghZitdRCBZFN4gjBpAh1rRI4ikQ0a3pwJH
 6y1NF8fX/xRBm6Bgprgbv/em+fKsO89i6NwunPtYaxHsOGA5U0FVPFSa3yghz+X2
 7Sl4U8LRdkU3Z62M+I8DKWuAMMb9nLwEXrSiNy+KJHIJ7FNFb1+U10gPrLq+z2Oi
 hvd25pzDzGUgOglZ7f1IRL5H48putMjCqv+A2wOZnzl7fHt4c02hpTBqMDle5hEP
 DYHyNNC1ZIRQ1zuy8j33LzB10ycP6pX9nawt+S1trEZcDVwaKwq9TbYsBjbJKtmZ
 V7C181fTUaNQ6M4wJY3gx+hD1ocrLv+Y0vhXJnK7sg+j4IhWJ7Yk6ne9hrpXbbUj
 cK8gzvo4OkShYtMlB9ut/RIiuyvFThyJd9rcNkeN4tTM9DVFURvOlIK2W6U70Yty
 OJDkhI6KBmkERPQoa43OsGvxZwF+g0KY8ISsLq3bJ2vLyATPdpJ2ns+zRg6I+B+q
 zHZmI86Fc8gr+O/y9oqi2VJhgio34RcoSId/QcAPC1Gl4TaIcmqFFTSzMhI+kLk9
 Ng7vHy/3KQg=
 =pGnx
 -----END PGP SIGNATURE-----

Merge tag 'sched-urgent-2026-01-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fixes from Ingo Molnar:

 - Fix PELT clock synchronization bug when entering idle

 - Disable the NEXT_BUDDY feature, as during extensive testing
   Mel found that the negatives outweigh the positives

 - Make wakeup preemption less aggressive, which resulted in
   an unreasonable increase in preemption frequency

* tag 'sched-urgent-2026-01-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Revert force wakeup preemption
  sched/fair: Disable scheduler feature NEXT_BUDDY
  sched/fair: Fix pelt clock sync when entering idle
2026-01-24 09:29:41 -08:00
Linus Torvalds ceaeaf66a2 Two perf events fixes:
- Fix mmap_count warning & bug when creating a group member event
    with the PERF_FLAG_FD_OUTPUT flag.
 
  - Disable the sample period == 1 branch events BTS optimization
    on guests, because BTS is not virtualized.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAml0kF8RHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1icZxAAhsYBY0uB3OzAgJYRIVlw/vzF7ubHh8+r
 PhQ+azap/uSk/dS8IXZ0FrP/vq9kgqEFpsYwWbB1EFRxB24nd8AAYarASyc22Dhb
 Qe6tKh4hzlDC9Cw0vJ111PgKJLiPDrmkxzS4C2HkPwYwriNql81hQjLW5nF6dHwg
 i8aP7+/uNB2h/xnPIRgkUFSUzBacRz1Snqgi/vSHmwEhk1GFI48rXoYywM9ItAnR
 CGN7tGxJ45YwDYeknZf9Ngsd3q+/eh38ihPfMEKoulf4oFvhIsiG4rJ1Vee0V9fD
 aD1NhukULtjw3lkKnMM75W+Jdb7fHDTOuxzUov9WpyIOIUTMqRkbvaKHgJzSD2SK
 01TbP3kQixhGhiRXx78GDQwYGjX8JQxngbqyJvL708GNvxbcKYrLhqtr5Ho00pWH
 ERxx/ajoDXB7Neo7XPhgYRHk/lrlnZsK1LoidhMzN1UX6C12VnhPcD+zUSqRxz5w
 yFuJp0+7wF+G74FpO7kv8jv5KDB/lery5mdzWu0kqAKMfYWdw5eGoaI6T48DVoBy
 IwNYU8bxDBRPzu64XudDE4xBuTuy4HpJmbvOUkxMUkBGMTZ+nysqHu8D7cJJqYtL
 2xYJOpR+P4Se9fRyR9xo5vtTZy27TQqTp+AjA5RlIoVOJhYK5T9mAFCJaEhdupbM
 2IA4xdYhbb8=
 =lNSW
 -----END PGP SIGNATURE-----

Merge tag 'perf-urgent-2026-01-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf events fixes from Ingo Molnar:

 - Fix mmap_count warning & bug when creating a group member event
   with the PERF_FLAG_FD_OUTPUT flag

 - Disable the sample period == 1 branch events BTS optimization
   on guests, because BTS is not virtualized

* tag 'perf-urgent-2026-01-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/intel: Do not enable BTS for guests
  perf: Fix refcount warning on event->mmap_count increment
2026-01-24 09:24:17 -08:00
Dave Jiang 3f7938b1ae Merge branch 'for-7.0/cxl-init' into cxl-for-next
Merge in patches to support several patch series such as Soft Reserve
handling, type2 accelerator enabling, and LSA 2.1 labeling support.
Mainly addition of cxl_memdev_attach() to allow the memdev probe
to make a decision of proceed/fail depending success of CXL topology
enumeration.

dax/hmem, e820, resource: Defer Soft Reserved insertion until hmem is ready
cxl/mem: Introduce cxl_memdev_attach for CXL-dependent operation
cxl/mem: Drop @host argument to devm_cxl_add_memdev()
cxl/mem: Convert devm_cxl_add_memdev() to scope-based-cleanup
cxl/port: Arrange for always synchronous endpoint attach
cxl/mem: Arrange for always-synchronous memdev attach
cxl/mem: Fix devm_cxl_memdev_edac_release() confusion
2026-01-23 14:13:16 -07:00
Boqun Feng ed062c41df Merge branch 'rcu-nocb.20260123a'
* rcu-nocb.20260123a:
  rcu/nocb: Extract nocb_defer_wakeup_cancel() helper
  rcu/nocb: Remove dead callback overload handling
  rcu/nocb: Remove unnecessary WakeOvfIsDeferred wake path
2026-01-23 11:15:36 -08:00
Joel Fernandes cc74050f13 rcu/nocb: Extract nocb_defer_wakeup_cancel() helper
The pattern of checking nocb_defer_wakeup and deleting the timer is
duplicated in __wake_nocb_gp() and nocb_gp_wait(). Extract this into a
common helper function nocb_defer_wakeup_cancel().

This removes code duplication and makes it easier to maintain.

Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2026-01-23 11:12:25 -08:00
Joel Fernandes b11c1efa7f rcu/nocb: Remove dead callback overload handling
During callback overload (exceeding qhimark), the NOCB code attempts
opportunistic advancement via rcu_advance_cbs_nowake(). Analysis shows
this code path is practically unreachable and serves no useful purpose.

Testing with 300,000 callback floods showed:
- 30 overload conditions triggered
- 0 advancements actually occurred

While a theoretical window exists where this code could execute (e.g.,
vCPU preemption between gp_seq update and rcu_nocb_gp_cleanup()), even
if it did, the advancement would be redundant. The rcuog kthread must
still run to wake the rcuoc callback thread - we would just be
duplicating work that rcuog will perform when it finally gets to run.

Since this path provides no meaningful benefit and extensive testing
confirms it is never useful, remove it entirely.

Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2026-01-23 11:12:25 -08:00
Joel Fernandes d92eca60fe rcu/nocb: Remove unnecessary WakeOvfIsDeferred wake path
The WakeOvfIsDeferred code path in __call_rcu_nocb_wake() attempts to
wake rcuog when the callback count exceeds qhimark and callbacks aren't
done with their GP (newly queued or awaiting GP). However, a lot of
testing proves this wake is always redundant or useless.

In the flooding case, rcuog is always waiting for a GP to finish. So
waking up the rcuog thread is pointless. The timer wakeup adds overhead,
rcuog simply wakes up and goes back to sleep achieving nothing.

This path also adds a full memory barrier, and additional timer expiry
modifications unnecessarily.

The root cause is that WakeOvfIsDeferred fires when
!rcu_segcblist_ready_cbs() (GP not complete), but waking rcuog cannot
accelerate GP completion.

This commit therefore removes this path.

Tested with rcutorture scenarios: TREE01, TREE05, TREE08 (all NOCB
configurations) - all pass. Also stress tested using a kernel module
that floods call_rcu() to trigger the overload conditions and made the
observations confirming the findings.

Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2026-01-23 11:12:25 -08:00
Donglin Peng c9703d17d2 function_graph: Fix args pointer mismatch in print_graph_retval()
When funcgraph-args and funcgraph-retaddr are both enabled, many kernel
functions display invalid parameters in trace logs.

The issue occurs because print_graph_retval() passes a mismatched args
pointer to print_function_args(). Fix this by retrieving the correct
args pointer using the FGRAPH_ENTRY_ARGS() macro.

Link: https://patch.msgid.link/20260112021601.1300479-1-dolinux.peng@gmail.com
Fixes: f83ac7544f ("function_graph: Enable funcgraph-args and funcgraph-retaddr to work simultaneously")
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Donglin Peng <pengdonglin@xiaomi.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-01-23 13:34:38 -05:00
Ian Rogers 00f13e28a9 tracing: Avoid possible signed 64-bit truncation
64-bit truncation to 32-bit can result in the sign of the truncated
value changing. The cmp_mod_entry is used in bsearch and so the
truncation could result in an invalid search order. This would only
happen were the addresses more than 2GB apart and so unlikely, but
let's fix the potentially broken compare anyway.

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260108002625.333331-1-irogers@google.com
Signed-off-by: Ian Rogers <irogers@google.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-01-23 13:34:30 -05:00
Steven Rostedt 90f9f5d64c tracing: Fix crash on synthetic stacktrace field usage
When creating a synthetic event based on an existing synthetic event that
had a stacktrace field and the new synthetic event used that field a
kernel crash occurred:

 ~# cd /sys/kernel/tracing
 ~# echo 's:stack unsigned long stack[];' > dynamic_events
 ~# echo 'hist:keys=prev_pid:s0=common_stacktrace if prev_state & 3' >> events/sched/sched_switch/trigger
 ~# echo 'hist:keys=next_pid:s1=$s0:onmatch(sched.sched_switch).trace(stack,$s1)' >> events/sched/sched_switch/trigger

The above creates a synthetic event that takes a stacktrace when a task
schedules out in a non-running state and passes that stacktrace to the
sched_switch event when that task schedules back in. It triggers the
"stack" synthetic event that has a stacktrace as its field (called "stack").

 ~# echo 's:syscall_stack s64 id; unsigned long stack[];' >> dynamic_events
 ~# echo 'hist:keys=common_pid:s2=stack' >> events/synthetic/stack/trigger
 ~# echo 'hist:keys=common_pid:s3=$s2,i0=id:onmatch(synthetic.stack).trace(syscall_stack,$i0,$s3)' >> events/raw_syscalls/sys_exit/trigger

The above makes another synthetic event called "syscall_stack" that
attaches the first synthetic event (stack) to the sys_exit trace event and
records the stacktrace from the stack event with the id of the system call
that is exiting.

When enabling this event (or using it in a historgram):

 ~# echo 1 > events/synthetic/syscall_stack/enable

Produces a kernel crash!

 BUG: unable to handle page fault for address: 0000000000400010
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 0 P4D 0
 Oops: Oops: 0000 [#1] SMP PTI
 CPU: 6 UID: 0 PID: 1257 Comm: bash Not tainted 6.16.3+deb14-amd64 #1 PREEMPT(lazy)  Debian 6.16.3-1
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.17.0-debian-1.17.0-1 04/01/2014
 RIP: 0010:trace_event_raw_event_synth+0x90/0x380
 Code: c5 00 00 00 00 85 d2 0f 84 e1 00 00 00 31 db eb 34 0f 1f 00 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 2e 0f 1f 84 00 00 00 00 00 <49> 8b 04 24 48 83 c3 01 8d 0c c5 08 00 00 00 01 cd 41 3b 5d 40 0f
 RSP: 0018:ffffd2670388f958 EFLAGS: 00010202
 RAX: ffff8ba1065cc100 RBX: 0000000000000000 RCX: 0000000000000000
 RDX: 0000000000000001 RSI: fffff266ffda7b90 RDI: ffffd2670388f9b0
 RBP: 0000000000000010 R08: ffff8ba104e76000 R09: ffffd2670388fa50
 R10: ffff8ba102dd42e0 R11: ffffffff9a908970 R12: 0000000000400010
 R13: ffff8ba10a246400 R14: ffff8ba10a710220 R15: fffff266ffda7b90
 FS:  00007fa3bc63f740(0000) GS:ffff8ba2e0f48000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000400010 CR3: 0000000107f9e003 CR4: 0000000000172ef0
 Call Trace:
  <TASK>
  ? __tracing_map_insert+0x208/0x3a0
  action_trace+0x67/0x70
  event_hist_trigger+0x633/0x6d0
  event_triggers_call+0x82/0x130
  trace_event_buffer_commit+0x19d/0x250
  trace_event_raw_event_sys_exit+0x62/0xb0
  syscall_exit_work+0x9d/0x140
  do_syscall_64+0x20a/0x2f0
  ? trace_event_raw_event_sched_switch+0x12b/0x170
  ? save_fpregs_to_fpstate+0x3e/0x90
  ? _raw_spin_unlock+0xe/0x30
  ? finish_task_switch.isra.0+0x97/0x2c0
  ? __rseq_handle_notify_resume+0xad/0x4c0
  ? __schedule+0x4b8/0xd00
  ? restore_fpregs_from_fpstate+0x3c/0x90
  ? switch_fpu_return+0x5b/0xe0
  ? do_syscall_64+0x1ef/0x2f0
  ? do_fault+0x2e9/0x540
  ? __handle_mm_fault+0x7d1/0xf70
  ? count_memcg_events+0x167/0x1d0
  ? handle_mm_fault+0x1d7/0x2e0
  ? do_user_addr_fault+0x2c3/0x7f0
  entry_SYSCALL_64_after_hwframe+0x76/0x7e

The reason is that the stacktrace field is not labeled as such, and is
treated as a normal field and not as a dynamic event that it is.

In trace_event_raw_event_synth() the event is field is still treated as a
dynamic array, but the retrieval of the data is considered a normal field,
and the reference is just the meta data:

// Meta data is retrieved instead of a dynamic array
  str_val = (char *)(long)var_ref_vals[val_idx];

// Then when it tries to process it:
  len = *((unsigned long *)str_val) + 1;

It triggers a kernel page fault.

To fix this, first when defining the fields of the first synthetic event,
set the filter type to FILTER_STACKTRACE. This is used later by the second
synthetic event to know that this field is a stacktrace. When creating
the field of the new synthetic event, have it use this FILTER_STACKTRACE
to know to create a stacktrace field to copy the stacktrace into.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Link: https://patch.msgid.link/20260122194824.6905a38e@gandalf.local.home
Fixes: 00cf3d672a ("tracing: Allow synthetic events to pass around stacktraces")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-01-23 13:34:21 -05:00
Kumar Kartikeya Dwivedi 82f3b142c9 rqspinlock: Fix TAS fallback lock entry creation
The TAS fallback can be invoked directly when queued spin locks are
disabled, and through the slow path when paravirt is enabled for queued
spin locks. In the latter case, the res_spin_lock macro will attempt the
fast path and already hold the entry when entering the slow path. This
will lead to creation of extraneous entries that are not released, which
may cause false positives for deadlock detection.

Fix this by always preceding invocation of the TAS fallback in every
case with the grabbing of the held lock entry, and add a comment to make
note of this.

Fixes: c9102a68c0 ("rqspinlock: Add a test-and-set fallback")
Reported-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Tested-by: Amery Hung <ameryhung@gmail.com>
Link: https://lore.kernel.org/r/20260122115911.3668985-1-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-23 10:03:49 -08:00
Vincent Guittot 15257cc2f9 sched/fair: Revert force wakeup preemption
This agressively bypasses run_to_parity and slice protection with the
assumpiton that this is what waker wants but there is no garantee that
the wakee will be the next to run. It is a better choice to use
yield_to_task or WF_SYNC in such case.

This increases the number of resched and preemption because a task becomes
quickly "ineligible" when it runs; We update the task vruntime periodically
and before the task exhausted its slice or at least quantum.

Example:
2 tasks A and B wake up simultaneously with lag = 0. Both are
eligible. Task A runs 1st and wakes up task C. Scheduler updates task
A's vruntime which becomes greater than average runtime as all others
have a lag == 0 and didn't run yet. Now task A is ineligible because
it received more runtime than the other task but it has not yet
exhausted its slice nor a min quantum. We force preemption, disable
protection but Task B will run 1st not task C.

Sidenote, DELAY_ZERO increases this effect by clearing positive lag at
wake up.

Fixes: e837456fdc ("sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals")
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260123102858.52428-1-vincent.guittot@linaro.org
2026-01-23 11:53:20 +01:00
Mel Gorman 4f70f106bc sched/fair: Disable scheduler feature NEXT_BUDDY
NEXT_BUDDY was disabled with the introduction of EEVDF and enabled again
after NEXT_BUDDY was rewritten for EEVDF by commit e837456fdc ("sched/fair:
Reimplement NEXT_BUDDY to align with EEVDF goals"). It was not expected
that this would be a universal win without a crystal ball instruction
but the reported regressions are a concern [1][2] even if gains were
also reported. Specifically;

o mysql with client/server running on different servers regresses
o specjbb reports lower peak metrics
o daytrader regresses

The mysql is realistic and a concern. It needs to be confirmed if
specjbb is simply shifting the point where peak performance is measured
but still a concern. daytrader is considered to be representative of a
real workload.

Access to test machines is currently problematic for verifying any fix to
this problem. Disable NEXT_BUDDY for now by default until the root causes
are addressed.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Link: https://lore.kernel.org/lkml/4b96909a-f1ac-49eb-b814-97b8adda6229@arm.com [1]
Link: https://lore.kernel.org/lkml/ec3ea66f-3a0d-4b5a-ab36-ce778f159b5b@linux.ibm.com [2]
Link: https://patch.msgid.link/fyqsk63pkoxpeaclyqsm5nwtz3dyejplr7rg6p74xwemfzdzuu@7m7xhs5aqpqw
2026-01-23 11:53:19 +01:00
Thomas Weißschuh c1b12cd933 padata: Constify padata_sysfs_entry structs
These structs are never modified.

To prevent malicious or accidental modifications due to bugs,
mark them as const.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2026-01-23 13:48:44 +08:00
Ard Biesheuvel a081b57892
kallsyms: Get rid of kallsyms relative base
When the kallsyms relative base was introduced, per-CPU variable
references on x86_64 SMP were implemented as offsets into the respective
per-CPU region, rather than offsets relative to the location of the
variable's template in the kernel image, which is how other
architectures implement it.

This required kallsyms to reason about the difference between the two,
and the sign of the value in the kallsyms_offsets[] array was used to
distinguish them. This meant that negative offsets were not permitted
for ordinary variables, and so it was crucial that the relative base was
chosen such that all offsets were positive numbers.

This is no longer needed: instead, the offsets can simply be encoded as
values in the range -/+ 2 GiB, which is precisely what PC32 relocations
provide on most architectures. So it is possible to simplify the logic,
and just use _text as the anchor directly, and let the linker calculate
the final value based on the location of the entry itself.

Some architectures (nios2, extensa) do not support place-relative
relocations at all, but these are all 32-bit and non-relocatable, and so
there is no need for place-relative relocations in the first place, and
the actual symbol values can just be stored directly.

This makes all entries in the kallsyms_offsets[] array visible as
place-relative references in the ELF metadata, which will be important
when implementing ELF-based fg-kaslr.

Reviewed-by: Kees Cook <kees@kernel.org>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://patch.msgid.link/20260116093359.2442297-6-ardb+git@google.com
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
2026-01-22 15:58:22 -07:00
Frederic Weisbecker de715325cc cpu: Revert "cpu/hotplug: Prevent self deadlock on CPU hot-unplug"
1) The commit:

	2b8272ff4a ("cpu/hotplug: Prevent self deadlock on CPU hot-unplug")

was added to fix an issue where the hotplug control task (BP) was
throttled between CPUHP_AP_IDLE_DEAD and CPUHP_HRTIMERS_PREPARE waiting
in the hrtimer blindspot for the bandwidth callback queued in the dead
CPU.

2) Later on, the commit:

	38685e2a04 ("cpu/hotplug: Don't offline the last non-isolated CPU")

plugged on the target selection for the workqueue offloaded CPU down
process to prevent from destroying the last CPU domain.

3) Finally:

	5c0930ccaa ("hrtimers: Push pending hrtimers away from outgoing CPU earlier")

removed entirely the conditions for the race exposed and partially fixed
in 1). The offloading of the CPU down process to a workqueue on another
CPU then becomes unnecessary. But the last CPU belonging to scheduler
domains must still remain online.

Therefore revert the now obsolete commit
2b8272ff4a and move the housekeeping check
under the cpu_hotplug_lock write held. Since HK_TYPE_DOMAIN will include
both isolcpus and cpuset isolated partition, the hotplug lock will
synchronize against concurrent cpuset partition updates.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
2026-01-22 18:32:41 +01:00
Shubhang Kaushik 4b603f1551 sched: Update rq->avg_idle when a task is moved to an idle CPU
Currently, rq->idle_stamp is only used to calculate avg_idle during
wakeups. This means other paths that move a task to an idle CPU such as
fork/clone, execve, or migrations, do not end the CPU's idle status in
the scheduler's eyes, leading to an inaccurate avg_idle.

This patch introduces update_rq_avg_idle() to provide a more accurate
measurement of CPU idle duration. By invoking this helper in
put_prev_task_idle(), we ensure avg_idle is updated whenever a CPU
stops being idle, regardless of how the new task arrived.

Testing on an 80-core Ampere Altra (ARMv8) with 6.19-rc5 baseline:
 - Hackbench : +7.2% performance gain at 16 threads.
 - Schbench: Reduced p99.9 tail latencies at high concurrency.

Signed-off-by: Shubhang Kaushik <shubhang@os.amperecomputing.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Shubhang Kaushik <shubhang@os.amperecomputing.com>
Link: https://patch.msgid.link/20260121-v8-patch-series-v8-1-b7f1cbee5055@os.amperecomputing.com
2026-01-22 11:11:21 +01:00
Thomas Gleixner 5d6446f409 hrtimer: Fix trace oddity
It turns out that __run_hrtimer() will trace like:

          <idle>-0     [032] d.h2. 20705.474563: hrtimer_cancel:       hrtimer=0xff2db8f77f8226e8
          <idle>-0     [032] d.h1. 20705.474563: hrtimer_expire_entry: hrtimer=0xff2db8f77f8226e8 now=20699452001850 function=tick_nohz_handler/0x0

Which is a bit nonsensical, the timer doesn't get canceled on
expiration. The cause is the use of the incorrect debug helper.

Fixes: c6a2a17702 ("hrtimer: Add tracepoint for hrtimers")
Reported-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260121143208.219595606@infradead.org
2026-01-22 11:11:20 +01:00
Peter Zijlstra 21c0e92d06 rseq: Lower default slice extension
Change the minimum slice extension to 5 usec.

Since slice_test selftest reaches a staggering ~350 nsec extension:

Task: slice_test    Mean: 350.266 ns
  Latency (us)    | Count
  ------------------------------
  EXPIRED         | 238
  0 us            | 143189
  1 us            | 167
  2 us            | 26
  3 us            | 11
  4 us            | 28
  5 us            | 31
  6 us            | 22
  7 us            | 23
  8 us            | 32
  9 us            | 16
  10 us           | 35

Lower the minimal (and default) value to 5 usecs -- which is still massive.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260121143208.073200729@infradead.org
2026-01-22 11:11:20 +01:00
Peter Zijlstra e1d7f54900 rseq: Move slice_ext_nsec to debugfs
Move changing the slice ext duration to debugfs, a sliglty less permanent
interface.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260121143207.923520192@infradead.org
2026-01-22 11:11:20 +01:00
Peter Zijlstra d6200245c7 rseq: Allow registering RSEQ with slice extension
Since glibc cares about the number of syscalls required to initialize a new
thread, allow initializing rseq with slice extension on. This avoids having to
do another prctl().

Requested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260121143207.814193010@infradead.org
2026-01-22 11:11:19 +01:00
Thomas Gleixner 3c78aaec19 entry: Hook up rseq time slice extension
Wire the grant decision function up in exit_to_user_mode_loop()

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251215155709.258157362@linutronix.de
2026-01-22 11:11:19 +01:00
Thomas Gleixner 0ac3b5c3dc rseq: Implement time slice extension enforcement timer
If a time slice extension is granted and the reschedule delayed, the kernel
has to ensure that user space cannot abuse the extension and exceed the
maximum granted time.

It was suggested to implement this via the existing hrtick() timer in the
scheduler, but that turned out to be problematic for several reasons:

   1) It creates a dependency on CONFIG_SCHED_HRTICK, which can be disabled
      independently of CONFIG_HIGHRES_TIMERS

   2) HRTICK usage in the scheduler can be runtime disabled or is only used
      for certain aspects of scheduling.

   3) The function is calling into the scheduler code and that might have
      unexpected consequences when this is invoked due to a time slice
      enforcement expiry. Especially when the task managed to clear the
      grant via sched_yield(0).

It would be possible to address #2 and #3 by storing state in the
scheduler, but that is extra complexity and fragility for no value.

Implement a dedicated per CPU hrtimer instead, which is solely used for the
purpose of time slice enforcement.

The timer is armed when an extension was granted right before actually
returning to user mode in rseq_exit_to_user_mode_restart().

It is disarmed, when the task relinquishes the CPU. This is expensive as
the timer is probably the first expiring timer on the CPU, which means it
has to reprogram the hardware. But that's less expensive than going through
a full hrtimer interrupt cycle for nothing.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251215155709.068329497@linutronix.de
2026-01-22 11:11:18 +01:00
Thomas Gleixner dd0a046069 rseq: Implement syscall entry work for time slice extensions
The kernel sets SYSCALL_WORK_RSEQ_SLICE when it grants a time slice
extension. This allows to handle the rseq_slice_yield() syscall, which is
used by user space to relinquish the CPU after finishing the critical
section for which it requested an extension.

In case the kernel state is still GRANTED, the kernel resets both kernel
and user space state with a set of sanity checks. If the kernel state is
already cleared, then this raced against the timer or some other interrupt
and just clears the work bit.

Doing it in syscall entry work allows to catch misbehaving user space,
which issues an arbitrary syscall, i.e. not rseq_slice_yield(), from the
critical section. Contrary to the initial strict requirement to use
rseq_slice_yield() arbitrary syscalls are not considered a violation of the
ABI contract anymore to allow onion architecture applications, which cannot
control the code inside a critical section, to utilize this as well.

If the code detects inconsistent user space that result in a SIGSEGV for
the application.

If the grant was still active and the task was not preempted yet, the work
code reschedules immediately before continuing through the syscall.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251215155709.005777059@linutronix.de
2026-01-22 11:11:18 +01:00
Thomas Gleixner 99d2592023 rseq: Implement sys_rseq_slice_yield()
Provide a new syscall which has the only purpose to yield the CPU after the
kernel granted a time slice extension.

sched_yield() is not suitable for that because it unconditionally
schedules, but the end of the time slice extension is not required to
schedule when the task was already preempted. This also allows to have a
strict check for termination to catch user space invoking random syscalls
including sched_yield() from a time slice extension region.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Link: https://patch.msgid.link/20251215155708.929634896@linutronix.de
2026-01-22 11:11:17 +01:00
Thomas Gleixner 28621ec2d4 rseq: Add prctl() to enable time slice extensions
Implement a prctl() so that tasks can enable the time slice extension
mechanism. This fails, when time slice extensions are disabled at compile
time or on the kernel command line and when no rseq pointer is registered
in the kernel.

That allows to implement a single trivial check in the exit to user mode
hotpath, to decide whether the whole mechanism needs to be invoked.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251215155708.858717691@linutronix.de
2026-01-22 11:11:17 +01:00
Thomas Gleixner b5b8282441 rseq: Add statistics for time slice extensions
Extend the quick statistics with time slice specific fields.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251215155708.795202254@linutronix.de
2026-01-22 11:11:17 +01:00
Thomas Gleixner f8380f9768 rseq: Provide static branch for time slice extensions
Guard the time slice extension functionality with a static key, which can
be disabled on the kernel command line.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251215155708.733429292@linutronix.de
2026-01-22 11:11:16 +01:00
Thomas Gleixner d7a5da7a0f rseq: Add fields and constants for time slice extension
Aside of a Kconfig knob add the following items:

   - Two flag bits for the rseq user space ABI, which allow user space to
     query the availability and enablement without a syscall.

   - A new member to the user space ABI struct rseq, which is going to be
     used to communicate request and grant between kernel and user space.

   - A rseq state struct to hold the kernel state of this

   - Documentation of the new mechanism

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251215155708.669472597@linutronix.de
2026-01-22 11:11:16 +01:00
Fushuai Wang 4fe82cf302 sched/debug: Convert copy_from_user() + kstrtouint() to kstrtouint_from_user()
Using kstrtouint_from_user() instead of copy_from_user() + kstrtouint()
makes the code simpler and less error-prone.

Suggested-by: Yury Norov <ynorov@nvidia.com>
Signed-off-by: Fushuai Wang <wangfushuai@baidu.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Yury Norov <ynorov@nvidia.com>
Link: https://patch.msgid.link/20260117145615.53455-2-fushuai.wang@linux.dev
2026-01-22 11:11:16 +01:00
Yuzuki Ishiyama 1dc6696467 bpf: add bpf_strncasecmp kfunc
bpf_strncasecmp() function performs same like bpf_strcasecmp() except
limiting the comparison to a specific length.

Signed-off-by: Yuzuki Ishiyama <ishiyama@hpc.is.uec.ac.jp>
Acked-by: Viktor Malik <vmalik@redhat.com>
Acked-by: Mykyta Yatsenko <mykyta.yatsenko5@gmail.com>
Link: https://lore.kernel.org/r/20260121033328.1850010-2-ishiyama@hpc.is.uec.ac.jp
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-21 09:42:53 -08:00
Menglong Dong 85c7f91471 bpf: support bpf_get_func_arg() for BPF_TRACE_RAW_TP
For now, bpf_get_func_arg() and bpf_get_func_arg_cnt() is not supported by
the BPF_TRACE_RAW_TP, which is not convenient to get the argument of the
tracepoint, especially for the case that the position of the arguments in
a tracepoint can change.

The target tracepoint BTF type id is specified during loading time,
therefore we can get the function argument count from the function
prototype instead of the stack.

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20260121044348.113201-2-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-21 09:31:35 -08:00
Vincent Guittot 98c88dc8a1 sched/fair: Fix pelt clock sync when entering idle
Samuel and Alex reported regressions of the util_avg of RT rq with
commit 17e3e88ed0 ("sched/fair: Fix pelt lost idle time detection").
It happens that fair is updating and syncing the pelt clock with task one
when pick_next_task_fair() fails to pick a task but before the prev
scheduling class got a chance to update its pelt signals.

Move update_idle_rq_clock_pelt() in set_next_task_idle() which is called
after prev class has been called.

Fixes: 17e3e88ed0 ("sched/fair: Fix pelt lost idle time detection")
Closes: https://lore.kernel.org/all/CAG2KctpO6VKS6GN4QWDji0t92_gNBJ7HjjXrE+6H+RwRXt=iLg@mail.gmail.com/
Closes: https://lore.kernel.org/all/8cf19bf0e0054dcfed70e9935029201694f1bb5a.camel@mediatek.com/
Reported-by: Samuel Wu <wusamuel@google.com>
Reported-by: Alex Hoh <Alex.Hoh@mediatek.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Samuel Wu <wusamuel@google.com>
Tested-by: Alex Hoh <Alex.Hoh@mediatek.com>
Link: https://patch.msgid.link/20260121163317.505635-1-vincent.guittot@linaro.org
2026-01-21 17:46:08 +01:00
Will Rosenberg d06bf78e55 perf: Fix refcount warning on event->mmap_count increment
When calling refcount_inc(&event->mmap_count) inside perf_mmap_rb(), the
following warning is triggered:

        refcount_t: addition on 0; use-after-free.
        WARNING: lib/refcount.c:25

PoC:

    struct perf_event_attr attr = {0};
    int fd = syscall(__NR_perf_event_open, &attr, 0, -1, -1, 0);
    mmap(NULL, 0x3000, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
    int victim = syscall(__NR_perf_event_open, &attr, 0, -1, fd,
                         PERF_FLAG_FD_OUTPUT);
    mmap(NULL, 0x3000, PROT_READ | PROT_WRITE, MAP_SHARED, victim, 0);

This occurs when creating a group member event with the flag
PERF_FLAG_FD_OUTPUT. The group leader should be mmap-ed and then mmap-ing
the event triggers the warning.

Since the event has copied the output_event in perf_event_set_output(),
event->rb is set. As a result, perf_mmap_rb() calls
refcount_inc(&event->mmap_count) when event->mmap_count = 0.

Disallow the case when event->mmap_count = 0. This also prevents two
events from updating the same user_page.

Fixes: 448f97fba9 ("perf: Convert mmap() refcounts to refcount_t")
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Will Rosenberg <whrosenb@asu.edu>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260119184956.801238-1-whrosenb@asu.edu
2026-01-21 16:28:58 +01:00
Thomas Gleixner c06343be0b clocksource: Reduce watchdog readout delay limit to prevent false positives
The "valid" readout delay between the two reads of the watchdog is larger
than the valid delta between the resulting watchdog and clocksource
intervals, which results in false positive watchdog results.

Assume TSC is the clocksource and HPET is the watchdog and both have a
uncertainty margin of 250us (default). The watchdog readout does:

  1) wdnow = read(HPET);
  2) csnow = read(TSC);
  3) wdend = read(HPET);

The valid window for the delta between #1 and #3 is calculated by the
uncertainty margins of the watchdog and the clocksource:

   m = 2 * watchdog.uncertainty_margin + cs.uncertainty margin;

which results in 750us for the TSC/HPET case.

The actual interval comparison uses a smaller margin:

   m = watchdog.uncertainty_margin + cs.uncertainty margin;

which results in 500us for the TSC/HPET case.

That means the following scenario will trigger the watchdog:

 Watchdog cycle N:

 1)       wdnow[N] = read(HPET);
 2)       csnow[N] = read(TSC);
 3)       wdend[N] = read(HPET);

Assume the delay between #1 and #2 is 100us and the delay between #1 and

 Watchdog cycle N + 1:

 4)       wdnow[N + 1] = read(HPET);
 5)       csnow[N + 1] = read(TSC);
 6)       wdend[N + 1] = read(HPET);

If the delay between #4 and #6 is within the 750us margin then any delay
between #4 and #5 which is larger than 600us will fail the interval check
and mark the TSC unstable because the intervals are calculated against the
previous value:

    wd_int = wdnow[N + 1] - wdnow[N];
    cs_int = csnow[N + 1] - csnow[N];

Putting the above delays in place this results in:

    cs_int = (wdnow[N + 1] + 610us) - (wdnow[N] + 100us);
 -> cs_int = wd_int + 510us;

which is obviously larger than the allowed 500us margin and results in
marking TSC unstable.

Fix this by using the same margin as the interval comparison. If the delay
between two watchdog reads is larger than that, then the readout was either
disturbed by interconnect congestion, NMIs or SMIs.

Fixes: 4ac1dd3245 ("clocksource: Set cs_watchdog_read() checks based on .uncertainty_margin")
Reported-by: Daniel J Blueman <daniel@quora.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/lkml/20250602223251.496591-1-daniel@quora.org/
Link: https://patch.msgid.link/87bjjxc9dq.ffs@tglx
2026-01-21 11:33:11 +01:00
Menglong Dong eaedea154e bpf, x86: inline bpf_get_current_task() for x86_64
Inline bpf_get_current_task() and bpf_get_current_task_btf() for x86_64
to obtain better performance.

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260120070555.233486-2-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-20 20:39:01 -08:00
Minu Jin f34e19c34e fork-comment-fix: remove ambiguous question mark in CLONE_CHILD_CLEARTID comment
The current comment "Clear TID on mm_release()?" ends with a question
mark, implying uncertainty about whether the TID is actually cleared in
mm_release().

However, the code flow is deterministic.  When a task exits, mm_release()
explicitly checks 'tsk->clear_child_tid' and clears.

Since this behavior is unambiguous, remove the confusing question mark and
rephrase the comment to clearly state that TID is cleared in mm_release().

Link: https://lkml.kernel.org/r/20251125000407.24470-1-s9430939@naver.com
Signed-off-by: Minu Jin <s9430939@naver.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mel Gorman <mgorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20 19:44:23 -08:00
Petr Mladek 3b07086444 kallsyms: prevent module removal when printing module name and buildid
kallsyms_lookup_buildid() copies the symbol name into the given buffer so
that it can be safely read anytime later.  But it just copies pointers to
mod->name and mod->build_id which might get reused after the related
struct module gets removed.

The lifetime of struct module is synchronized using RCU.  Take the rcu
read lock for the entire __sprint_symbol().

Link: https://lkml.kernel.org/r/20251128135920.217303-8-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Aaron Tomlin <atomlin@atomlin.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkman <daniel@iogearbox.net>
Cc: Daniel Gomez <da.gomez@samsung.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Luis Chamberalin <mcgrof@kernel.org>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Petr Pavlu <petr.pavlu@suse.com>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20 19:44:23 -08:00
Petr Mladek e8a1e7eaa1 kallsyms/ftrace: set module buildid in ftrace_mod_address_lookup()
__sprint_symbol() might access an invalid pointer when
kallsyms_lookup_buildid() returns a symbol found by
ftrace_mod_address_lookup().

The ftrace lookup function must set both @modname and @modbuildid the same
way as module_address_lookup().

Link: https://lkml.kernel.org/r/20251128135920.217303-7-pmladek@suse.com
Fixes: 9294523e37 ("module: add printk formats to add module build ID to stacktraces")
Signed-off-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Aaron Tomlin <atomlin@atomlin.com>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkman <daniel@iogearbox.net>
Cc: Daniel Gomez <da.gomez@samsung.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Luis Chamberalin <mcgrof@kernel.org>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Petr Pavlu <petr.pavlu@suse.com>
Cc: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20 19:44:22 -08:00
Petr Mladek cd6735896d kallsyms/bpf: rename __bpf_address_lookup() to bpf_address_lookup()
bpf_address_lookup() has been used only in kallsyms_lookup_buildid().  It
was supposed to set @modname and @modbuildid when the symbol was in a
module.

But it always just cleared @modname because BPF symbols were never in a
module.  And it did not clear @modbuildid because the pointer was not
passed.

The wrapper is no longer needed.  Both @modname and @modbuildid are now
always initialized to NULL in kallsyms_lookup_buildid().

Remove the wrapper and rename __bpf_address_lookup() to
bpf_address_lookup() because this variant is used everywhere.

[akpm@linux-foundation.org: fix loongarch]
Link: https://lkml.kernel.org/r/20251128135920.217303-6-pmladek@suse.com
Fixes: 9294523e37 ("module: add printk formats to add module build ID to stacktraces")
Signed-off-by: Petr Mladek <pmladek@suse.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Aaron Tomlin <atomlin@atomlin.com>
Cc: Daniel Borkman <daniel@iogearbox.net>
Cc: Daniel Gomez <da.gomez@samsung.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Luis Chamberalin <mcgrof@kernel.org>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Petr Pavlu <petr.pavlu@suse.com>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20 19:44:22 -08:00
Petr Mladek 8e81dac4cd kallsyms: cleanup code for appending the module buildid
Put the code for appending the optional "buildid" into a helper function,
It makes __sprint_symbol() better readable.

Also print a warning when the "modname" is set and the "buildid" isn't. 
It might catch a situation when some lookup function in
kallsyms_lookup_buildid() does not handle the "buildid".

Use pr_*_once() to avoid an infinite recursion when the function is called
from printk().  The recursion is rather theoretical but better be on the
safe side.

Link: https://lkml.kernel.org/r/20251128135920.217303-5-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Cc: Aaron Tomlin <atomlin@atomlin.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkman <daniel@iogearbox.net>
Cc: Daniel Gomez <da.gomez@samsung.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Luis Chamberalin <mcgrof@kernel.org>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Petr Pavlu <petr.pavlu@suse.com>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20 19:44:22 -08:00
Petr Mladek acfdbb4ab2 module: add helper function for reading module_buildid()
Add a helper function for reading the optional "build_id" member of struct
module.  It is going to be used also in ftrace_mod_address_lookup().

Use "#ifdef" instead of "#if IS_ENABLED()" to match the declaration of the
optional field in struct module.

Link: https://lkml.kernel.org/r/20251128135920.217303-4-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Daniel Gomez <da.gomez@samsung.com>
Reviewed-by: Petr Pavlu <petr.pavlu@suse.com>
Cc: Aaron Tomlin <atomlin@atomlin.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkman <daniel@iogearbox.net>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Luis Chamberalin <mcgrof@kernel.org>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20 19:44:22 -08:00
Petr Mladek fda024fb64 kallsyms: clean up modname and modbuildid initialization in kallsyms_lookup_buildid()
The @modname and @modbuildid optional return parameters are set only when
the symbol is in a module.

Always initialize them so that they do not need to be cleared when the
module is not in a module.  It simplifies the logic and makes the code
even slightly more safe.

Note that bpf_address_lookup() function will get updated in a separate
patch.

Link: https://lkml.kernel.org/r/20251128135920.217303-3-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Cc: Aaron Tomlin <atomlin@atomlin.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkman <daniel@iogearbox.net>
Cc: Daniel Gomez <da.gomez@samsung.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Luis Chamberalin <mcgrof@kernel.org>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Petr Pavlu <petr.pavlu@suse.com>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20 19:44:21 -08:00
Petr Mladek 426295ef18 kallsyms: clean up @namebuf initialization in kallsyms_lookup_buildid()
Patch series "kallsyms: Prevent invalid access when showing module
buildid", v3.

We have seen nested crashes in __sprint_symbol(), see below.  They seem to
be caused by an invalid pointer to "buildid".  This patchset cleans up
kallsyms code related to module buildid and fixes this invalid access when
printing backtraces.

I made an audit of __sprint_symbol() and found several situations
when the buildid might be wrong:

  + bpf_address_lookup() does not set @modbuildid

  + ftrace_mod_address_lookup() does not set @modbuildid

  + __sprint_symbol() does not take rcu_read_lock and
    the related struct module might get removed before
    mod->build_id is printed.

This patchset solves these problems:

  + 1st, 2nd patches are preparatory
  + 3rd, 4th, 6th patches fix the above problems
  + 5th patch cleans up a suspicious initialization code.

This is the backtrace, we have seen. But it is not really important.
The problems fixed by the patchset are obvious:

  crash64> bt [62/2029]
  PID: 136151 TASK: ffff9f6c981d4000 CPU: 367 COMMAND: "btrfs"
  #0 [ffffbdb687635c28] machine_kexec at ffffffffb4c845b3
  #1 [ffffbdb687635c80] __crash_kexec at ffffffffb4d86a6a
  #2 [ffffbdb687635d08] hex_string at ffffffffb51b3b61
  #3 [ffffbdb687635d40] crash_kexec at ffffffffb4d87964
  #4 [ffffbdb687635d50] oops_end at ffffffffb4c41fc8
  #5 [ffffbdb687635d70] do_trap at ffffffffb4c3e49a
  #6 [ffffbdb687635db8] do_error_trap at ffffffffb4c3e6a4
  #7 [ffffbdb687635df8] exc_stack_segment at ffffffffb5666b33
  #8 [ffffbdb687635e20] asm_exc_stack_segment at ffffffffb5800cf9
  ...


This patch (of 7)

The function kallsyms_lookup_buildid() initializes the given @namebuf by
clearing the first and the last byte.  It is not clear why.

The 1st byte makes sense because some callers ignore the return code and
expect that the buffer contains a valid string, for example:

  - function_stat_show()
    - kallsyms_lookup()
      - kallsyms_lookup_buildid()

The initialization of the last byte does not make much sense because it
can later be overwritten.  Fortunately, it seems that all called functions
behave correctly:

  -  kallsyms_expand_symbol() explicitly adds the trailing '\0'
     at the end of the function.

  - All *__address_lookup() functions either use the safe strscpy()
    or they do not touch the buffer at all.

Document the reason for clearing the first byte.  And remove the useless
initialization of the last byte.

Link: https://lkml.kernel.org/r/20251128135920.217303-2-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Aaron Tomlin <atomlin@atomlin.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkman <daniel@iogearbox.net>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Luis Chamberalin <mcgrof@kernel.org>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Petr Pavlu <petr.pavlu@suse.com>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Daniel Gomez <da.gomez@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20 19:44:21 -08:00
Li RongQing e700f5d156 watchdog: softlockup: panic when lockup duration exceeds N thresholds
The softlockup_panic sysctl is currently a binary option: panic
immediately or never panic on soft lockups.

Panicking on any soft lockup, regardless of duration, can be overly
aggressive for brief stalls that may be caused by legitimate operations. 
Conversely, never panicking may allow severe system hangs to persist
undetected.

Extend softlockup_panic to accept an integer threshold, allowing the
kernel to panic only when the normalized lockup duration exceeds N
watchdog threshold periods.  This provides finer-grained control to
distinguish between transient delays and persistent system failures.

The accepted values are:
- 0: Don't panic (unchanged)
- 1: Panic when duration >= 1 * threshold (20s default, original behavior)
- N > 1: Panic when duration >= N * threshold (e.g., 2 = 40s, 3 = 60s.)

The original behavior is preserved for values 0 and 1, maintaining full
backward compatibility while allowing systems to tolerate brief lockups
while still catching severe, persistent hangs.

[lirongqing@baidu.com: v2]
  Link: https://lkml.kernel.org/r/20251218074300.4080-1-lirongqing@baidu.com
Link: https://lkml.kernel.org/r/20251216074521.2796-1-lirongqing@baidu.com
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Cc: Eduard Zingerman <eddyz87@gmail.com>
Cc: Hao Luo <haoluo@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Song Liu <song@kernel.org>
Cc: Stanislav Fomichev <sdf@fomichev.me>
Cc: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20 19:44:20 -08:00
Pnina Feder b5bfcc1ffe kernel/crash: handle multi-page vmcoreinfo in crash kernel copy
kimage_crash_copy_vmcoreinfo() currently assumes vmcoreinfo fits in a
single page.  This breaks if VMCOREINFO_BYTES exceeds PAGE_SIZE.

Allocate the required order of control pages and vmap all pages needed to
safely copy vmcoreinfo into the crash kernel image.

Link: https://lkml.kernel.org/r/20251216132801.807260-3-pnina.feder@mobileye.com
Signed-off-by: Pnina Feder <pnina.feder@mobileye.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20 19:44:20 -08:00
Pnina Feder 76103d1b26 kernel: vmcoreinfo: allocate vmcoreinfo_data based on VMCOREINFO_BYTES
Patch series "vmcoreinfo: support VMCOREINFO_BYTES larger than PAGE_SIZE".

VMCOREINFO_BYTES is defined as a configurable size, but multiple
code paths implicitly assume it always fits into a single page.

This series removes that assumption by allocating and mapping
vmcoreinfo based on its actual size.

Patch 1 updates vmcoreinfo allocation to use get_order(VMCOREINFO_BYTES).
Patch 2 updates crash kernel handling to correctly allocate and map
multiple pages when copying vmcoreinfo.

This makes vmcoreinfo size consistent across the kernel and avoids
future breakage if VMCOREINFO_BYTES grows.

(No functional change when VMCOREINFO_BYTES == PAGE_SIZE.)


This patch (of 2):

VMCOREINFO_BYTES defines the size of vmcoreinfo data, but the current
implementation assumes a single page allocation.

Allocate vmcoreinfo_data using get_order(VMCOREINFO_BYTES) so that
vmcoreinfo can safely grow beyond PAGE_SIZE.

This avoids hidden assumptions and keeps vmcoreinfo size consistent across
the kernel.

Link: https://lkml.kernel.org/r/20251216132801.807260-1-pnina.feder@mobileye.com
Link: https://lkml.kernel.org/r/20251216132801.807260-2-pnina.feder@mobileye.com
Signed-off-by: Pnina Feder <pnina.feder@mobileye.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20 19:44:20 -08:00
Alejandro Colomar a9e5620c9a kernel: fix off-by-one benign bugs
We were wasting a byte due to an off-by-one bug.  s[c]nprintf() doesn't
write more than $2 bytes including the null byte, so trying to pass
'size-1' there is wasting one byte.

This is essentially the same as the previous commit, in a different
file.

Link: https://lkml.kernel.org/r/b4a945a4d40b7104364244f616eb9fb9f1fa691f.1765449750.git.alx@kernel.org
Signed-off-by: Alejandro Colomar <alx@kernel.org>
Cc: Marco Elver <elver@google.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Christopher Bazley <chris.bazley.wg14@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Marco Elver <elver@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Maciej W. Rozycki <macro@orcam.me.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20 19:44:19 -08:00
Randy Dunlap 24c776355f kernel.h: drop hex.h and update all hex.h users
Remove <linux/hex.h> from <linux/kernel.h> and update all users/callers of
hex.h interfaces to directly #include <linux/hex.h> as part of the process
of putting kernel.h on a diet.

Removing hex.h from kernel.h means that 36K C source files don't have to
pay the price of parsing hex.h for the roughly 120 C source files that
need it.

This change has been build-tested with allmodconfig on most ARCHes.  Also,
all users/callers of <linux/hex.h> in the entire source tree have been
updated if needed (if not already #included).

Link: https://lkml.kernel.org/r/20251215005206.2362276-1-rdunlap@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Yury Norov (NVIDIA) <yury.norov@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20 19:44:19 -08:00
Christophe JAILLET b11052be3e crash_dump: constify struct configfs_item_operations and configfs_group_operations
'struct configfs_item_operations' and 'configfs_group_operations' are not
modified in this driver.

Constifying these structures moves some data to a read-only section, so
increases overall security, especially when the structure holds some
function pointers.

On a x86_64, with allmodconfig, as an example:
Before:
======
   text	   data	    bss	    dec	    hex	filename
  16339	  11001	    384	  27724	   6c4c	kernel/crash_dump_dm_crypt.o

After:
=====
   text	   data	    bss	    dec	    hex	filename
  16499	  10841	    384	  27724	   6c4c	kernel/crash_dump_dm_crypt.o

Link: https://lkml.kernel.org/r/d046ee5666d2f6b1a48ca1a222dfbd2f7c44462f.1765735035.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Coiby Xu <coxu@redhat.com>
Tested-by: Coiby Xu <coxu@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20 19:44:15 -08:00
Mykyta Yatsenko 83c9030cdc bpf: Simplify bpf_timer_cancel()
Remove lock from the bpf_timer_cancel() helper. The lock does not
protect from concurrent modification of the bpf_async_cb data fields as
those are modified in the callback without locking.

Use guard(rcu)() instead of pair of explicit lock()/unlock().

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Link: https://lore.kernel.org/r/20260120-timer_nolock-v6-4-670ffdd787b4@meta.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-20 18:12:19 -08:00
Mykyta Yatsenko 8bb1e32b3f bpf: Introduce lock-free bpf_async_update_prog_callback()
Introduce bpf_async_update_prog_callback(): lock-free update of cb->prog
and cb->callback_fn. This function allows updating prog and callback_fn
fields of the struct bpf_async_cb without holding lock.
For now use it under the lock from __bpf_async_set_callback(), in the
next patches that lock will be removed.

Lock-free algorithm:
 * Acquire a guard reference on prog to prevent it from being freed
   during the retry loop.
 * Retry loop:
    1. Each iteration acquires a new prog reference and stores it
       in cb->prog via xchg. The previous prog is released.
    2. The loop condition checks if both cb->prog and cb->callback_fn
       match what we just wrote. If either differs, a concurrent writer
       overwrote our value, and we must retry.
    3. When we retry, our previously-stored prog was already released by
       the concurrent writer or will be released by us after
       overwriting.
 * Release guard reference.

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Link: https://lore.kernel.org/r/20260120-timer_nolock-v6-3-670ffdd787b4@meta.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-20 18:12:19 -08:00
Mykyta Yatsenko 57d31e72db bpf: Remove unnecessary arguments from bpf_async_set_callback()
Remove unused arguments from __bpf_async_set_callback().

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Link: https://lore.kernel.org/r/20260120-timer_nolock-v6-2-670ffdd787b4@meta.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-20 18:12:19 -08:00
Mykyta Yatsenko c1f2c449de bpf: Factor out timer deletion helper
Move the timer deletion logic into a dedicated bpf_timer_delete()
helper so it can be reused by later patches.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Link: https://lore.kernel.org/r/20260120-timer_nolock-v6-1-670ffdd787b4@meta.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-20 18:12:19 -08:00
Zesen Liu ed4724212f bpf: Require ARG_PTR_TO_MEM with memory flag
Add check to ensure that ARG_PTR_TO_MEM is used with either MEM_WRITE or
MEM_RDONLY.

Using ARG_PTR_TO_MEM alone without flags does not make sense because:

- If the helper does not change the argument, missing MEM_RDONLY causes the
verifier to incorrectly reject a read-only buffer.
- If the helper does change the argument, missing MEM_WRITE causes the
verifier to incorrectly assume the memory is unchanged, leading to errors
in code optimization.

Co-developed-by: Shuran Liu <electronlsr@gmail.com>
Signed-off-by: Shuran Liu <electronlsr@gmail.com>
Co-developed-by: Peili Gao <gplhust955@gmail.com>
Signed-off-by: Peili Gao <gplhust955@gmail.com>
Co-developed-by: Haoran Ni <haoran.ni.cs@gmail.com>
Signed-off-by: Haoran Ni <haoran.ni.cs@gmail.com>
Signed-off-by: Zesen Liu <ftyghome@gmail.com>
Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260120-helper_proto-v3-2-27b0180b4e77@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-20 16:59:25 -08:00
Zesen Liu 802eef5afb bpf: Fix memory access flags in helper prototypes
After commit 37cce22dbd ("bpf: verifier: Refactor helper access type tracking"),
the verifier started relying on the access type flags in helper
function prototypes to perform memory access optimizations.

Currently, several helper functions utilizing ARG_PTR_TO_MEM lack the
corresponding MEM_RDONLY or MEM_WRITE flags. This omission causes the
verifier to incorrectly assume that the buffer contents are unchanged
across the helper call. Consequently, the verifier may optimize away
subsequent reads based on this wrong assumption, leading to correctness
issues.

For bpf_get_stack_proto_raw_tp, the original MEM_RDONLY was incorrect
since the helper writes to the buffer. Change it to ARG_PTR_TO_UNINIT_MEM
which correctly indicates write access to potentially uninitialized memory.

Similar issues were recently addressed for specific helpers in commit
ac44dcc788 ("bpf: Fix verifier assumptions of bpf_d_path's output buffer")
and commit 2eb7648558 ("bpf: Specify access type of bpf_sysctl_get_name args").

Fix these prototypes by adding the correct memory access flags.

Fixes: 37cce22dbd ("bpf: verifier: Refactor helper access type tracking")
Co-developed-by: Shuran Liu <electronlsr@gmail.com>
Signed-off-by: Shuran Liu <electronlsr@gmail.com>
Co-developed-by: Peili Gao <gplhust955@gmail.com>
Signed-off-by: Peili Gao <gplhust955@gmail.com>
Co-developed-by: Haoran Ni <haoran.ni.cs@gmail.com>
Signed-off-by: Haoran Ni <haoran.ni.cs@gmail.com>
Signed-off-by: Zesen Liu <ftyghome@gmail.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260120-helper_proto-v3-1-27b0180b4e77@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-20 16:59:25 -08:00
Yazhou Tang 44fdd581d2 bpf: Add range tracking for BPF_DIV and BPF_MOD
This patch implements range tracking (interval analysis) for BPF_DIV and
BPF_MOD operations when the divisor is a constant, covering both signed
and unsigned variants.

While LLVM typically optimizes integer division and modulo by constants
into multiplication and shift sequences, this optimization is less
effective for the BPF target when dealing with 64-bit arithmetic.

Currently, the verifier does not track bounds for scalar division or
modulo, treating the result as "unbounded". This leads to false positive
rejections for safe code patterns.

For example, the following code (compiled with -O2):

```c
int test(struct pt_regs *ctx) {
    char buffer[6] = {1};
    __u64 x = bpf_ktime_get_ns();
    __u64 res = x % sizeof(buffer);
    char value = buffer[res];
    bpf_printk("res = %llu, val = %d", res, value);
    return 0;
}
```

Generates a raw `BPF_MOD64` instruction:

```asm
;     __u64 res = x % sizeof(buffer);
       1:	97 00 00 00 06 00 00 00	r0 %= 0x6
;     char value = buffer[res];
       2:	18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00	r1 = 0x0 ll
       4:	0f 01 00 00 00 00 00 00	r1 += r0
       5:	91 14 00 00 00 00 00 00	r4 = *(s8 *)(r1 + 0x0)
```

Without this patch, the verifier fails with "math between map_value
pointer and register with unbounded min value is not allowed" because
it cannot deduce that `r0` is within [0, 5].

According to the BPF instruction set[1], the instruction's offset field
(`insn->off`) is used to distinguish between signed (`off == 1`) and
unsigned division (`off == 0`). Moreover, we also follow the BPF division
and modulo runtime behavior (semantics) to handle special cases, such as
division by zero and signed division overflow.

- UDIV: dst = (src != 0) ? (dst / src) : 0
- SDIV: dst = (src == 0) ? 0 : ((src == -1 && dst == LLONG_MIN) ? LLONG_MIN : (dst / src))
- UMOD: dst = (src != 0) ? (dst % src) : dst
- SMOD: dst = (src == 0) ? dst : ((src == -1 && dst == LLONG_MIN) ? 0: (dst s% src))

Here is the overview of the changes made in this patch (See the code comments
for more details and examples):

1. For BPF_DIV: Firstly check whether the divisor is zero. If so, set the
   destination register to zero (matching runtime behavior).

   For non-zero constant divisors: goto `scalar(32)?_min_max_(u|s)div` functions.
   - General cases: compute the new range by dividing max_dividend and
     min_dividend by the constant divisor.
   - Overflow case (SIGNED_MIN / -1) in signed division: mark the result
     as unbounded if the dividend is not a single number.

2. For BPF_MOD: Firstly check whether the divisor is zero. If so, leave the
   destination register unchanged (matching runtime behavior).

   For non-zero constant divisors: goto `scalar(32)?_min_max_(u|s)mod` functions.
   - General case: For signed modulo, the result's sign matches the
     dividend's sign. And the result's absolute value is strictly bounded
     by `min(abs(dividend), abs(divisor) - 1)`.
     - Special care is taken when the divisor is SIGNED_MIN. By casting
       to unsigned before negation and subtracting 1, we avoid signed
       overflow and correctly calculate the maximum possible magnitude
       (`res_max_abs` in the code).
   - "Small dividend" case: If the dividend is already within the possible
     result range (e.g., [-2, 5] % 10), the operation is an identity
     function, and the destination register remains unchanged.

3. In `scalar(32)?_min_max_(u|s)(div|mod)` functions: After updating current
   range, reset other ranges and tnum to unbounded/unknown.

   e.g., in `scalar_min_max_sdiv`, signed 64-bit range is updated. Then reset
   unsigned 64-bit range and 32-bit range to unbounded, and tnum to unknown.

   Exception: in BPF_MOD's "small dividend" case, since the result remains
   unchanged, we do not reset other ranges/tnum.

4. Also updated existing selftests based on the expected BPF_DIV and
   BPF_MOD behavior.

[1] https://www.kernel.org/doc/Documentation/bpf/standardization/instruction-set.rst

Co-developed-by: Shenghao Yuan <shenghaoyuan0928@163.com>
Signed-off-by: Shenghao Yuan <shenghaoyuan0928@163.com>
Co-developed-by: Tianci Cao <ziye@zju.edu.cn>
Signed-off-by: Tianci Cao <ziye@zju.edu.cn>
Signed-off-by: Yazhou Tang <tangyazhou518@outlook.com>
Tested-by: syzbot@syzkaller.appspotmail.com
Link: https://lore.kernel.org/r/20260119085458.182221-2-tangyazhou@zju.edu.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-20 16:41:53 -08:00
Ihor Solodrai aed57a3638 bpf: Remove __prog kfunc arg annotation
Now that all the __prog suffix users in the kernel tree migrated to
KF_IMPLICIT_ARGS, remove it from the verifier.

See prior discussion for context [1].

[1] https://lore.kernel.org/bpf/CAEf4BzbgPfRm9BX=TsZm-TsHFAHcwhPY4vTt=9OT-uhWqf8tqw@mail.gmail.com/

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Link: https://lore.kernel.org/r/20260120222638.3976562-13-ihor.solodrai@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-20 16:22:38 -08:00
Ihor Solodrai d806f31012 bpf: Migrate bpf_stream_vprintk() to KF_IMPLICIT_ARGS
Implement bpf_stream_vprintk with an implicit bpf_prog_aux argument,
and remote bpf_stream_vprintk_impl from the kernel.

Update the selftests to use the new API with implicit argument.

bpf_stream_vprintk macro is changed to use the new bpf_stream_vprintk
kfunc, and the extern definition of bpf_stream_vprintk_impl is
replaced accordingly.

Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Link: https://lore.kernel.org/r/20260120222638.3976562-11-ihor.solodrai@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-20 16:22:38 -08:00
Ihor Solodrai 6e663ffdf7 bpf: Migrate bpf_task_work_schedule_* kfuncs to KF_IMPLICIT_ARGS
Implement bpf_task_work_schedule_* with an implicit bpf_prog_aux
argument, and remove corresponding _impl funcs from the kernel.

Update special kfunc checks in the verifier accordingly.

Update the selftests to use the new API with implicit argument.

Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Link: https://lore.kernel.org/r/20260120222638.3976562-10-ihor.solodrai@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-20 16:22:20 -08:00
Ihor Solodrai b97931a25a bpf: Migrate bpf_wq_set_callback_impl() to KF_IMPLICIT_ARGS
Implement bpf_wq_set_callback() with an implicit bpf_prog_aux
argument, and remove bpf_wq_set_callback_impl().

Update special kfunc checks in the verifier accordingly.

Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Link: https://lore.kernel.org/r/20260120222638.3976562-8-ihor.solodrai@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-20 16:15:57 -08:00
Ihor Solodrai 64e1360524 bpf: Verifier support for KF_IMPLICIT_ARGS
A kernel function bpf_foo marked with KF_IMPLICIT_ARGS flag is
expected to have two associated types in BTF:
  * `bpf_foo` with a function prototype that omits implicit arguments
  * `bpf_foo_impl` with a function prototype that matches the kernel
     declaration of `bpf_foo`, but doesn't have a ksym associated with
     its name

In order to support kfuncs with implicit arguments, the verifier has
to know how to resolve a call of `bpf_foo` to the correct BTF function
prototype and address.

To implement this, in add_kfunc_call() kfunc flags are checked for
KF_IMPLICIT_ARGS. For such kfuncs a BTF func prototype is adjusted to
the one found for `bpf_foo_impl` (func_name + "_impl" suffix, by
convention) function in BTF.

This effectively changes the signature of the `bpf_foo` kfunc in the
context of verification: from one without implicit args to the one
with full argument list.

The values of implicit arguments by design are provided by the
verifier, and so they can only be of particular types. In this patch
the only allowed implicit arg type is a pointer to struct
bpf_prog_aux.

In order for the verifier to correctly set an implicit bpf_prog_aux
arg value at runtime, is_kfunc_arg_prog() is extended to check for the
arg type. At a point when prog arg is determined in check_kfunc_args()
the kfunc with implicit args already has a prototype with full
argument list, so the existing value patch mechanism just works.

If a new kfunc with KF_IMPLICIT_ARG is declared for an existing kfunc
that uses a __prog argument (a legacy case), the prototype
substitution works in exactly the same way, assuming the kfunc follows
the _impl naming convention. The difference is only in how _impl
prototype is added to the BTF, which is not the verifier's
concern. See a subsequent resolve_btfids patch for details.

__prog suffix is still supported at this point, but will be removed in
a subsequent patch, after current users are moved to KF_IMPLICIT_ARGS.

Introduction of KF_IMPLICIT_ARGS revealed an issue with zero-extension
tracking, because an explicit rX = 0 in place of the verifier-supplied
argument is now absent if the arg is implicit (the BPF prog doesn't
pass a dummy NULL anymore). To mitigate this, reset the subreg_def of
all caller saved registers in check_kfunc_call() [1].

[1] https://lore.kernel.org/bpf/b4a760ef828d40dac7ea6074d39452bb0dc82caa.camel@gmail.com/

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Link: https://lore.kernel.org/r/20260120222638.3976562-4-ihor.solodrai@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-20 16:15:56 -08:00
Ihor Solodrai 08ca87d632 bpf: Introduce struct bpf_kfunc_meta
There is code duplication between add_kfunc_call() and
fetch_kfunc_meta() collecting information about a kfunc from BTF.

Introduce struct bpf_kfunc_meta to hold common kfunc BTF data and
implement fetch_kfunc_meta() to fill it in, instead of struct
bpf_kfunc_call_arg_meta directly.

Then use these in add_kfunc_call() and (new) fetch_kfunc_arg_meta()
functions, and fixup previous usages of fetch_kfunc_meta() to
fetch_kfunc_arg_meta().

Besides the code dedup, this change enables add_kfunc_call() to access
kfunc->flags.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Link: https://lore.kernel.org/r/20260120222638.3976562-3-ihor.solodrai@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-20 16:15:56 -08:00
Ihor Solodrai ea073d1818 bpf: Refactor btf_kfunc_id_set_contains
btf_kfunc_id_set_contains() is called by fetch_kfunc_meta() in the BPF
verifier to get the kfunc flags stored in the .BTF_ids ELF section.
If it returns NULL instead of a valid pointer, it's interpreted as an
illegal kfunc usage failing the verification.

There are two potential reasons for btf_kfunc_id_set_contains() to
return NULL:

  1. Provided kfunc BTF id is not present in relevant kfunc id sets.
  2. The kfunc is not allowed, as determined by the program type
     specific filter [1].

The filter functions accept a pointer to `struct bpf_prog`, so they
might implicitly depend on earlier stages of verification, when
bpf_prog members are set.

For example, bpf_qdisc_kfunc_filter() in linux/net/sched/bpf_qdisc.c
inspects prog->aux->st_ops [2], which is initialized in:

    check_attach_btf_id() -> check_struct_ops_btf_id()

So far this hasn't been an issue, because fetch_kfunc_meta() is the
only caller of btf_kfunc_id_set_contains().

However in subsequent patches of this series it is necessary to
inspect kfunc flags earlier in BPF verifier, in the add_kfunc_call().

To resolve this, refactor btf_kfunc_id_set_contains() into two
interface functions:
  * btf_kfunc_flags() that simply returns pointer to kfunc_flags
    without applying the filters
  * btf_kfunc_is_allowed() that both checks for kfunc_flags existence
    (which is a requirement for a kfunc to be allowed) and applies the
    prog filters

See [3] for the previous version of this patch.

[1] https://lore.kernel.org/all/20230519225157.760788-7-aditi.ghag@isovalent.com/
[2] https://lore.kernel.org/all/20250409214606.2000194-4-ameryhung@gmail.com/
[3] https://lore.kernel.org/bpf/20251029190113.3323406-3-ihor.solodrai@linux.dev/

Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Link: https://lore.kernel.org/r/20260120222638.3976562-2-ihor.solodrai@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-20 16:15:56 -08:00
Linus Torvalds c25f2fb1f4 17 hotfixes. 12 are cc:stable, 16 are for MM.
- A 4 patch series from David Hildenbrand which fixes a few things
   realted to hugetlb PMD sharing
 
 - The remainder are singletons, please see their changelogs for details.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaW/vEQAKCRDdBJ7gKXxA
 jtMcAQCK1tKnINRmVjY3UJCqZMAaXvOdOoUIgHDaTXD/DWKm9AD9HRwWzYB4+TNr
 k/Te8F33d418LcMBTW9CLhrplQpaIAI=
 =+d1A
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2026-01-20-13-09' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:

 - A patch series from David Hildenbrand which fixes a few things
   related to hugetlb PMD sharing

 - The remainder are singletons, please see their changelogs for details

* tag 'mm-hotfixes-stable-2026-01-20-13-09' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  mm: restore per-memcg proactive reclaim with !CONFIG_NUMA
  mm/kfence: fix potential deadlock in reboot notifier
  Docs/mm/allocation-profiling: describe sysctrl limitations in debug mode
  mm: do not copy page tables unnecessarily for VM_UFFD_WP
  mm/hugetlb: fix excessive IPI broadcasts when unsharing PMD tables using mmu_gather
  mm/rmap: fix two comments related to huge_pmd_unshare()
  mm/hugetlb: fix two comments related to huge_pmd_unshare()
  mm/hugetlb: fix hugetlb_pmd_shared()
  mm: remove unnecessary and incorrect mmap lock assert
  x86/kfence: avoid writing L1TF-vulnerable PTEs
  mm/vma: do not leak memory when .mmap_prepare swaps the file
  migrate: correct lock ordering for hugetlb file folios
  panic: only warn about deprecated panic_print on write access
  fs/writeback: skip AS_NO_DATA_INTEGRITY mappings in wait_sb_inodes()
  mm: take into account mm_cid size for mm_struct static definitions
  mm: rename cpu_bitmap field to flexible_array
  mm: add missing static initializer for init_mm::mm_cid.lock
2026-01-20 13:32:16 -08:00
Qiliang Yuan f81c07a6e9 bpf/verifier: Optimize ID mapping reset in states_equal
Currently, reset_idmap_scratch() performs a 4.7KB memset() in every
states_equal() call. Optimize this by using a counter to track used
ID mappings, replacing the O(N) memset() with an O(1) reset and
bounding the search loop in check_ids().

Signed-off-by: Qiliang Yuan <realwujing@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20260120023234.77673-1-realwujing@gmail.com
2026-01-20 11:32:28 -08:00
Daniel Borkmann 713edc7144 bpf: Remove leftover accounting in htab_map_mem_usage after rqspinlock
After commit 4fa8d68aa5 ("bpf: Convert hashtab.c to rqspinlock")
we no longer use HASHTAB_MAP_LOCK_{COUNT,MASK} as the per-CPU
map_locked[HASHTAB_MAP_LOCK_COUNT] array got removed from struct
bpf_htab. Right now it is still accounted for in htab_map_mem_usage.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/09703eb6bb249f12b1d5253b5a50a0c4fa239d27.1768913513.git.daniel@iogearbox.net
2026-01-20 11:28:02 -08:00
Puranjay Mohan ef7d4e42d1 bpf: verifier: Make sync_linked_regs() scratch registers
sync_linked_regs() is called after a conditional jump to propagate new
bounds of a register to all its liked registers. But the verifier log
only prints the state of the register that is part of the conditional
jump.

Make sync_linked_regs() scratch the registers whose bounds have been
updated by propagation from a known register.

Before:

0: (85) call bpf_get_prandom_u32#7    ; R0=scalar()
1: (57) r0 &= 255                     ; R0=scalar(smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff))
2: (bf) r1 = r0                       ; R0=scalar(id=1,smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff)) R1=scalar(id=1,smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff))
3: (07) r1 += 4                       ; R1=scalar(id=1+4,smin=umin=smin32=umin32=4,smax=umax=smax32=umax32=259,var_off=(0x0; 0x1ff))
4: (a5) if r1 < 0xa goto pc+2         ; R1=scalar(id=1+4,smin=umin=smin32=umin32=10,smax=umax=smax32=umax32=259,var_off=(0x0; 0x1ff))
5: (35) if r0 >= 0x6 goto pc+1

After:

0: (85) call bpf_get_prandom_u32#7    ; R0=scalar()
1: (57) r0 &= 255                     ; R0=scalar(smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff))
2: (bf) r1 = r0                       ; R0=scalar(id=1,smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff)) R1=scalar(id=1,smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff))
3: (07) r1 += 4                       ; R1=scalar(id=1+4,smin=umin=smin32=umin32=4,smax=umax=smax32=umax32=259,var_off=(0x0; 0x1ff))
4: (a5) if r1 < 0xa goto pc+2         ; R0=scalar(id=1+0,smin=umin=smin32=umin32=6,smax=umax=smax32=umax32=255) R1=scalar(id=1+4,smin=umin=smin32=umin32=10,smax=umax=smax32=umax32=259,var_off=(0x0; 0x1ff))
5: (35) if r0 >= 0x6 goto pc+1

The conditional jump in 4 updates the bound of R1 and the new bounds are
propogated to R0 as it is linked with the same id, before this change,
verifier only printed the state for R1 but after it prints for both R0
and R1.

Suggested-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/bpf/20260116141436.3715322-1-puranjay@kernel.org
2026-01-20 11:24:41 -08:00
Linus Torvalds c03e9c42ae dma-mapping fixes for Linux 6.19
- minor fixes for the corner cases of the SWIOTLB pool management (Robin Murphy)
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSrngzkoBtlA8uaaJ+Jp1EFxbsSRAUCaW+NbgAKCRCJp1EFxbsS
 RKU3AQCIpNwIYN6evmnTXO/Y+AFRjsrb1pE1cUyHn5QTY4/o4wEA6BiSgMCrZEbv
 HTEs1ZSHgLOwBwZre1Z11icGg3BkRgY=
 =6fBC
 -----END PGP SIGNATURE-----

Merge tag 'dma-mapping-6.19-2026-01-20' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux

Pull dma-mapping fixes from Marek Szyprowski:

 - minor fixes for the corner cases of the SWIOTLB pool management
   (Robin Murphy)

* tag 'dma-mapping-6.19-2026-01-20' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
  dma/pool: Avoid allocating redundant pools
  mm_zone: Generalise has_managed_dma()
  dma/pool: Improve pool lookup
2026-01-20 10:16:18 -08:00
hongao f76d1c41b6 kprobes: retry blocked optprobe in do_free_cleaned_kprobes
Once the aggrprobe is fully reverted in do_free_cleaned_kprobes(), retry
optimize_kprobe() on that sibling so it can return to OPTIMIZED.

Also remove the stale comment in __disarm_kprobe().

Link: https://lore.kernel.org/all/349359900266B25F+20260115023804.3951960-2-hongao@uniontech.com/

Signed-off-by: hongao <hongao@uniontech.com>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2026-01-20 23:53:07 +09:00
Thomas Weißschuh e806f7dde8 timekeeping: Adjust the leap state for the correct auxiliary timekeeper
When __do_ajdtimex() was introduced to handle adjtimex for any
timekeeper, this reference to tk_core was not updated. When called on an
auxiliary timekeeper, the core timekeeper would be updated incorrectly.

This gets caught by the lock debugging diagnostics because the
timekeepers sequence lock gets written to without holding its
associated spinlock:

WARNING: include/linux/seqlock.h:226 at __do_adjtimex+0x394/0x3b0, CPU#2: test/125
aux_clock_adj (kernel/time/timekeeping.c:2979)
__do_sys_clock_adjtime (kernel/time/posix-timers.c:1161 kernel/time/posix-timers.c:1173)
do_syscall_64 (arch/x86/entry/syscall_64.c:63 (discriminator 1) arch/x86/entry/syscall_64.c:94 (discriminator 1))
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:131)

Update the correct auxiliary timekeeper.

Fixes: 775f71ebed ("timekeeping: Make do_adjtimex() reusable")
Fixes: ecf3e70304 ("timekeeping: Provide adjtimex() for auxiliary clocks")
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260120-timekeeper-auxclock-leapstate-v1-1-5b358c6b3cfd@linutronix.de
2026-01-20 10:18:53 +01:00
Gal Pressman 90f3c12324 panic: only warn about deprecated panic_print on write access
The panic_print_deprecated() warning is being triggered on both read and
write operations to the panic_print parameter.

This causes spurious warnings when users run 'sysctl -a' to list all
sysctl values, since that command reads /proc/sys/kernel/panic_print and
triggers the deprecation notice.

Modify the handlers to only emit the deprecation warning when the
parameter is actually being set:

 - sysctl_panic_print_handler(): check 'write' flag before warning.
 - panic_print_get(): remove the deprecation call entirely.

This way, users are only warned when they actively try to use the
deprecated parameter, not when passively querying system state.

Link: https://lkml.kernel.org/r/20260106163321.83586-1-gal@nvidia.com
Fixes: ee13240cd7 ("panic: add note that panic_print sysctl interface is deprecated")
Fixes: 2683df6539 ("panic: add note that 'panic_print' parameter is deprecated")
Signed-off-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Nimrod Oren <noren@nvidia.com>
Cc: Feng Tang <feng.tang@linux.alibaba.com>
Cc: Joel Granados <joel.granados@kernel.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-19 12:30:01 -08:00
Linus Torvalds 6f32aa9161 cgroup: Another fix for v6.19-rc5
- Add Chen Ridong as cpuset reviewer.
 
 - Add SPDX license identifiers to cgroup files that were missing them.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaW0ToA4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGel7AQD3eTbiR/OU0i5yJj6YZsIdVYcteqqsWBkDFnZ3
 9pZZugEA+KCwi5eQz+cryKZ7y5fU5g5O6f4OP/lfLEcSmFDpfQU=
 =QHwy
 -----END PGP SIGNATURE-----

Merge tag 'cgroup-for-6.19-rc5-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup fixes from Tejun Heo:

 - Add Chen Ridong as cpuset reviewer

 - Add SPDX license identifiers to cgroup files that were missing them

* tag 'cgroup-for-6.19-rc5-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  kernel: cgroup: Add LGPL-2.1 SPDX license ID to legacy_freezer.c
  kernel: cgroup: Add SPDX-License-Identifier lines
  MAINTAINERS: Add Chen Ridong as cpuset reviewer
2026-01-18 14:30:27 -08:00
Linus Torvalds b671c1dad2 Fix the update_needs_ipi() check in the hrtimer code that
may result in incorrect skipping of hrtimer IPIs.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmlsr+oRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1iLPw/+OuA8hPlFY8nxrpVKASoiP7u+mGQ82Qmf
 xdf1Kimpmc5ydpGsxpRLKY19l3fLuGSJuOa2Url0Ro52JHxU+yPPOkN2ONjsPxLI
 mFMdhyXiS1PIgYt307OmNmKFadxdg5cGJTKY0wIS9JCcPFxI5Y3NFhWZ4ybPlGHK
 /bxtsjUctekUTDVahqP65lRIYMvjhO1I2LQfoPQyg/C+2n6Owy7TAX8xZovDZLHz
 a9RkkX6FPc8aR8Kj1VYm1yABNlUhGvFvTE6wfcfOMjiHHeDv2qYdwNHF2tK2eoKC
 QX609zBTCT/n0ioZlikLrKDZitj6uRgS2blBc+hodHEs4cY5gE2umWSU4IirT7HH
 GDjvG/vns/BLRr97gPdMaJ+N/sjdCSQ5tsDRrvHwCGC27VXPu3O3DYU128eH4eXW
 +KNJGQHJvR7slBKElQ4wxJaLhbLBB5rw/AqqdXa4PU8LqvMBBNcLZjHnrYlbD5Ei
 ovdX5VtRYOjrVl7pCNrXXdzFfBpXK+sOPHQMoJxnLHKXO+cdsZz8zIB7AHZEVe20
 CIOMRcfgw2nvpS+0MqaHzre0ixyq+U2XSIu7MRIlq8GZ+zdjiVMqx3pPa19QRdyr
 hA9JupeRC9eZdymlms2MiSzqIVmWSVvgcyhIFSYw9yLe55vXcGyArF27RKGtBHRA
 uF+K5aT8gqA=
 =CAdu
 -----END PGP SIGNATURE-----

Merge tag 'timers-urgent-2026-01-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer fix from Ingo Molnar:
 "Fix the update_needs_ipi() check in the hrtimer code that may result
  in incorrect skipping of hrtimer IPIs"

* tag 'timers-urgent-2026-01-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  hrtimer: Fix softirq base check in update_needs_ipi()
2026-01-18 10:56:32 -08:00
Linus Torvalds 837c8180e3 Misc DL scheduler fixes, mainly for a new category of bugs that were
discovered and fixed recently:
 
  - Fix a race condition in the DL server
 
  - Fix a DL server bug which can result in incorrectly going
    idle when there's work available
 
  - Fix DL server bug which triggers a WARN() due to broken
    get_prio_dl() logic and subsequent misbehavior
 
  - Fix double update_rq_clock() calls
 
  - Fix setscheduler() assumption about static priorities
 
  - Make sure balancing callbacks are always called
 
  - Plus a handful of preparatory commits for the fixes.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmlsrfIRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1hgUA/9HQi51sxF/TlhKeoj9Z0zSO/L5Fbd7d5O
 ucsr+exoU7B+X95AnYC64pV3YlTNibD5y6HR4zpPzhlmEdFw8K9hTmDeSOF2cB0T
 dPdEY2KtvNoQwVwytPYKMsXtBTcJtZZqOec9BgGBzSIpjPm6Zhxr1J2/lKFy/udA
 00aqNuJwfQW94dxvyPf+jPXbm0JW9HHZGNuuOXkkmZpkH+HIzBDrTWcAjHUaZt9H
 doPyEfO08dg3Uu4OxNBtxx94X7bQsn5RU1plSEc+xF7zynqwEIwDbelFftpHiM+z
 UqfFgkzCV8A8o/72BfZVhQgaspMZ+M4XSsaGBbouOTg3Rgx8gRKC23s8/xhDJSpy
 ayMWhuR5bsI04GD1xZVOwLvUi8oDIgAFIAKRI6gcW5eQELlejii0bAvs9Jw9/OjI
 QhhrX54sJCPPKMZpKO2n+AzQvv9a6+/p7ikiA2U5dR4wBEERFUw8ffRguHwSm+BU
 U1M+t+2tw+UiC7yML0zvIC3HdyfBhj33ZeW7PWyvFU0vjGm78U6NjqliAxbqW21l
 GdtQUpi9aXCtijXqHhLFvBtWGmqowj9UQiHroC0+nq1BsTpBDe575vwJNYF2K9he
 ht0hZqreUIf7xTsJHXlLFbtFxPQw0xAKzWJWUswaaqO8eS2JcWJbmHUAsHQuxmJB
 MbkJAlU7j/w=
 =DgiW
 -----END PGP SIGNATURE-----

Merge tag 'sched-urgent-2026-01-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fixes from Ingo Molnar:
 "Misc deadline scheduler fixes, mainly for a new category of bugs that
  were discovered and fixed recently:

   - Fix a race condition in the DL server

   - Fix a DL server bug which can result in incorrectly going idle when
     there's work available

   - Fix DL server bug which triggers a WARN() due to broken
     get_prio_dl() logic and subsequent misbehavior

   - Fix double update_rq_clock() calls

   - Fix setscheduler() assumption about static priorities

   - Make sure balancing callbacks are always called

   - Plus a handful of preparatory commits for the fixes"

* tag 'sched-urgent-2026-01-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/deadline: Use ENQUEUE_MOVE to allow priority change
  sched: Deadline has dynamic priority
  sched: Audit MOVE vs balance_callbacks
  sched: Fold rq-pin swizzle into __balance_callbacks()
  sched/deadline: Avoid double update_rq_clock()
  sched/deadline: Ensure get_prio_dl() is up-to-date
  sched/deadline: Fix server stopping with runnable tasks
  sched: Provide idle_rq() helper
  sched/deadline: Fix potential race in dl_add_task_root_domain()
  sched/deadline: Remove unnecessary comment in dl_add_task_root_domain()
2026-01-18 10:17:40 -08:00
Tim Bird 4787eaf7c1 bpf: Add SPDX license identifiers to a few files
Add GPL-2.0 SPDX-License-Identifier lines to some files,
and remove a reference to COPYING, and boilerplate warranty
text, from offload.c.

Signed-off-by: Tim Bird <tim.bird@sony.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260115013129.598705-1-tim.bird@sony.com
2026-01-16 14:50:00 -08:00
Mykyta Yatsenko 1700147697 bpf: Add __force annotations to silence sparse warnings
Add __force annotations to casts that convert between __user and kernel
address spaces. These casts are intentional:

- In bpf_send_signal_common(), the value is stored in si_value.sival_ptr
  which is typed as void __user *, but the value comes from a BPF
  program parameter.

- In the bpf_*_dynptr() kfuncs, user pointers are cast to const void *
  before being passed to copy helper functions that correctly handle
  the user address space through copy_from_user variants.

Without __force, sparse reports:
  warning: cast removes address space '__user' of expression

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260115184509.3585759-1-mykyta.yatsenko5@gmail.com

Closes: https://lore.kernel.org/oe-kbuild-all/202601131740.6C3BdBaB-lkp@intel.com/
2026-01-16 14:21:11 -08:00
Linus Torvalds b62ce2547f Power management fixes for 6.19-rc6
- Fix a memory leak in em_create_pd() error path (Malaya Kumar Rout)
 
  - Fix stale description of the cost field in struct em_perf_state to
    reflect the current code (Yaxiong Tian)
 
  - Fix and revamp the energy model YNL specification added recently
    along with the energy model netlink interface (Changwoo Min)
 -----BEGIN PGP SIGNATURE-----
 
 iQFGBAABCAAwFiEEcM8Aw/RY0dgsiRUR7l+9nS/U47UFAmlqWS8SHHJqd0Byand5
 c29ja2kubmV0AAoJEO5fvZ0v1OO1BhAH/iSA/9YP9/xIW+rIZj2irsncGVc/hJ+V
 8PcM2tar966AEQBlmDPc7ug2MXT0Mb/5WRtQo/IvZ1tdAALB+bC70FHjdu3y7SAX
 IzFGyHKJ9OVJHGxYq0TCBbRXgYubUZiqvAnaBTBZ/ZIFlt3B4KEyatfFfxJMb1Pe
 H9zQIyUVBN1tjPfgNVbc22XGfkwfmmla72le6GmHseKZRqpwutIwzxm1PoMNVeLb
 fQv625LsWrDD9NU6j5uGq3WEq3xPltubtHN545FTFi4aGwBYJaG1FuIi+kPiQuCR
 VZmxTWVEiVwjttiZf/xWbOkPfwQqdgeXlTk5j+iBlF1jkrrW75IuLYQ=
 =eJ2p
 -----END PGP SIGNATURE-----

Merge tag 'pm-6.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
 "These fix an error path memory leak in the energy model management
  code, fix a kerneldoc comment in it, and fix and revamp the energy
  model YNL specification added recently along with the new energy model
  management netlink interface (that received feedback after being
  added):

   - Fix a memory leak in em_create_pd() error path (Malaya Kumar Rout)

   - Fix stale description of the cost field in struct em_perf_state to
     reflect the current code (Yaxiong Tian)

   - Fix and revamp the energy model YNL specification added recently
     along with the energy model netlink interface (Changwoo Min)"

* tag 'pm-6.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  PM: EM: Add dump to get-perf-domains in the EM YNL spec
  PM: EM: Change cpus' type from string to u64 array in the EM YNL spec
  PM: EM: Rename em.yaml to dev-energymodel.yaml
  PM: EM: Fix yamllint warnings in the EM YNL spec
  PM: EM: Fix memory leak in em_create_pd() error path
  PM: EM: Fix incorrect description of the cost field in struct em_perf_state
2026-01-16 12:08:19 -08:00
Rafael J. Wysocki e9df6eba06 genirq/chip: Change irq_chip_pm_put() return type to void
The irq_chip_pm_put() return value is only used in __irq_do_set_handler()
to trigger a WARN_ON() if it is negative, but doing so is not useful
because irq_chip_pm_put() simply passes the pm_runtime_put() return value
to its callers.

Returning an error code from pm_runtime_put() merely means that it has
not queued up a work item to check whether or not the device can be
suspended and there are many perfectly valid situations in which that
can happen, like after writing "on" to the devices' runtime PM "control"
attribute in sysfs for one example.

For this reason, modify irq_chip_pm_put() to discard the pm_runtime_put()
return value, change its return type to void, and drop the WARN_ON()
around the irq_chip_pm_put() invocation from __irq_do_set_handler().
Also update the irq_chip_pm_put() kerneldoc comment to be more accurate.

This will facilitate a planned change of the pm_runtime_put() return
type to void in the future.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/5075294.31r3eYUQgx@rafael.j.wysocki
2026-01-16 20:28:05 +01:00
Puranjay Mohan af9e89d8dd bpf: Preserve id of register in sync_linked_regs()
sync_linked_regs() copies the id of known_reg to reg when propagating
bounds of known_reg to reg using the off of known_reg, but when
known_reg was linked to reg like:

known_reg = reg         ; both known_reg and reg get same id
known_reg += 4          ; known_reg gets off = 4, and its id gets BPF_ADD_CONST

now when a call to sync_linked_regs() happens, let's say with the following:

if known_reg >= 10 goto pc+2

known_reg's new bounds are propagated to reg but now reg gets
BPF_ADD_CONST from the copy.

This means if another link to reg is created like:

another_reg = reg       ; another_reg should get the id of reg but
                          assign_scalar_id_before_mov() sees
                          BPF_ADD_CONST on reg and assigns a new id to it.

As reg has a new id now, known_reg's link to reg is broken. If we find
new bounds for known_reg, they will not be propagated to reg.

This can be seen in the selftest added in the next commit:

0: (85) call bpf_get_prandom_u32#7    ; R0=scalar()
1: (57) r0 &= 255                     ; R0=scalar(smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff))
2: (bf) r1 = r0                       ; R0=scalar(id=1,smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff)) R1=scalar(id=1,smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff))
3: (07) r1 += 4                       ; R1=scalar(id=1+4,smin=umin=smin32=umin32=4,smax=umax=smax32=umax32=259,var_off=(0x0; 0x1ff))
4: (a5) if r1 < 0xa goto pc+4         ; R1=scalar(id=1+4,smin=umin=smin32=umin32=10,smax=umax=smax32=umax32=259,var_off=(0x0; 0x1ff))
5: (bf) r2 = r0                       ; R0=scalar(id=2,smin=umin=smin32=umin32=6,smax=umax=smax32=umax32=255) R2=scalar(id=2,smin=umin=smin32=umin32=6,smax=umax=smax32=umax32=255)
6: (a5) if r1 < 0xe goto pc+2         ; R1=scalar(id=1+4,smin=umin=smin32=umin32=14,smax=umax=smax32=umax32=259,var_off=(0x0; 0x1ff))
7: (35) if r0 >= 0xa goto pc+1        ; R0=scalar(id=2,smin=umin=smin32=umin32=6,smax=umax=smax32=umax32=9,var_off=(0x0; 0xf))
8: (37) r0 /= 0
div by zero

When 4 is verified, r1's bounds are propagated to r0 but r0 also gets
BPF_ADD_CONST (bug).
When 5 is verified, r0 gets a new id (2) and its link with r1 is broken.

After 6 we know r1 has bounds [14, 259] and therefore r0 should have
bounds [10, 255], therefore the branch at 7 is always taken. But because
r0's id was changed to 2, r1's new bounds are not propagated to r0.
The verifier still thinks r0 has bounds [6, 255] before 7 and execution
can reach div by zero.

Fix this by preserving id in sync_linked_regs() like off and subreg_def.

Fixes: 98d7ca374b ("bpf: Track delta between "linked" registers.")
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260115151143.1344724-2-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-16 10:08:59 -08:00
Linus Torvalds 7a2c1b27cd printk fixup for 6.19 rc6
-----BEGIN PGP SIGNATURE-----
 
 iQJPBAABCAA5FiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAmlqGMYbFIAAAAAABAAO
 bWFudTIsMi41KzEuMTEsMiwyAAoJEFKgDEdIgJTyT0UP/3Wn4tm0h0n3XeyujQEA
 IPNisJCczF6aWIAuwccRigR8hQFriPNvkZ5AhMGtotZJrY3uUoe1/8aF+XIdqnl/
 Yb1434AGwVNIpSaap+vtEclHrLDmzZH+Z75/FTWQwldM/hPkPWJI8fsEuRLqBZsn
 v520NBFtrQVcOZKKNy0npBnHsC0DsAmqoZuOvLTx0mx5AyE029CfPbDMZuVnSNix
 KjZ4U5KL0qDs2LIMpdB/mqprydGkHdogdIbrPK3WtzStVgNbi9VmnV19ZwbUlXJM
 rYPbtbQg3htwuspgR+yM6O21qsthRf2qZF5+2/a929IzOBsD/qAXQbbxQWVpF7Qb
 ELYXNV4N5hqm9EW8WeOOpLKUUG7k0fRPf81X/07uGVafPMQKQJ8kFNgLBkBFR4ya
 RAMNxTPHbHQvVaLcRujxZXoC4Wh3ZTunQXpIouy0p9dKOzbsCAj0ZqeqDa09UsaW
 rCEm50p/Pd1csML8a9A/2nNoWjQzuSVmML7F6obGCOWaW6p21GhSKHzqqDIjBab9
 3wxhpllVeYRYS2yhkKjOPJkQKXo3idIdpieLpW8IVbJvp/gQgmeKjWdhnvNvUan9
 hyCzfI8OZXVz0vItBfWsoX44+6UtpLHd4o16aDYjyDItflJOPuQbWzky+icT2R3x
 8B5xPC08tmEGitf3miv2EGq9
 =d/xV
 -----END PGP SIGNATURE-----

Merge tag 'printk-for-6.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux

Pull printk fix from Petr Mladek:

 - Prevent softlockup by restoring IRQs in atomic flush after each
   record

* tag 'printk-for-6.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
  printk/nbcon: Restore IRQ in atomic flush after each emitted record
2026-01-16 09:46:59 -08:00
Dan Williams bc62f5b308 dax/hmem, e820, resource: Defer Soft Reserved insertion until hmem is ready
Insert Soft Reserved memory into a dedicated soft_reserve_resource tree
instead of the iomem_resource tree at boot. Delay publishing these ranges
into the iomem hierarchy until ownership is resolved and the HMEM path
is ready to consume them.

Publishing Soft Reserved ranges into iomem too early conflicts with CXL
hotplug and prevents region assembly when those ranges overlap CXL
windows.

Follow up patches will reinsert Soft Reserved ranges into iomem after CXL
window publication is complete and HMEM is ready to claim the memory. This
provides a cleaner handoff between EFI-defined memory ranges and CXL
resource management without trimming or deleting resources later.

In the meantime "Soft Reserved" resources will no longer appear in
/proc/iomem, only their results. I.e. with "memmap=4G%4G+0xefffffff"

Before:
100000000-1ffffffff : Soft Reserved
  100000000-1ffffffff : dax1.0
    100000000-1ffffffff : System RAM (kmem)

After:
100000000-1ffffffff : dax1.0
  100000000-1ffffffff : System RAM (kmem)

The expectation is that this does not lead to a user visible regression
because the dax1.0 device is created in both instances.

Co-developed-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
[Smita: incorporate feedback from x86 maintainer review]
Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
Link: https://patch.msgid.link/20251120031925.87762-2-Smita.KoralahalliChannabasappa@amd.com
[djbw: cleanups and clarifications]
Link: https://lore.kernel.org/69443f707b025_1cee10022@dwillia2-mobl4.notmuch
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2026-01-16 09:02:36 -07:00
Oleg Nesterov d55c571e43 x86/uprobes: Fix XOL allocation failure for 32-bit tasks
This script

	#!/usr/bin/bash

	echo 0 > /proc/sys/kernel/randomize_va_space

	echo 'void main(void) {}' > TEST.c

	# -fcf-protection to ensure that the 1st endbr32 insn can't be emulated
	gcc -m32 -fcf-protection=branch TEST.c -o test

	bpftrace -e 'uprobe:./test:main {}' -c ./test

"hangs", the probed ./test task enters an endless loop.

The problem is that with randomize_va_space == 0
get_unmapped_area(TASK_SIZE - PAGE_SIZE) called by xol_add_vma() can not
just return the "addr == TASK_SIZE - PAGE_SIZE" hint, this addr is used
by the stack vma.

arch_get_unmapped_area_topdown() doesn't take TIF_ADDR32 into account and
in_32bit_syscall() is false, this leads to info.high_limit > TASK_SIZE.
vm_unmapped_area() happily returns the high address > TASK_SIZE and then
get_unmapped_area() returns -ENOMEM after the "if (addr > TASK_SIZE - len)"
check.

handle_swbp() doesn't report this failure (probably it should) and silently
restarts the probed insn. Endless loop.

I think that the right fix should change the x86 get_unmapped_area() paths
to rely on TIF_ADDR32 rather than in_32bit_syscall(). Note also that if
CONFIG_X86_X32_ABI=y, in_x32_syscall() falsely returns true in this case
because ->orig_ax = -1.

But we need a simple fix for -stable, so this patch just sets TS_COMPAT if
the probed task is 32-bit to make in_ia32_syscall() true.

Fixes: 1b028f784e ("x86/mm: Introduce mmap_compat_base() for 32-bit mmap()")
Reported-by: Paulo Andrade <pandrade@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/aV5uldEvV7pb4RA8@redhat.com/
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/aWO7Fdxn39piQnxu@redhat.com
2026-01-16 16:23:54 +01:00
Rafael J. Wysocki d51e68b700 Merge branch 'pm-em'
Merge fixes related to the energy model management for 6.19-rc6:

 - Fix a memory leak in em_create_pd() error path (Malaya Kumar Rout)

 - Fix stale description of the cost field in struct em_perf_state to
   reflect the current code (Yaxiong Tian)

 - Fix and revamp the energy model YNL specification added recently
   along with the energy model netlink interface (Changwoo Min)

* pm-em:
  PM: EM: Add dump to get-perf-domains in the EM YNL spec
  PM: EM: Change cpus' type from string to u64 array in the EM YNL spec
  PM: EM: Rename em.yaml to dev-energymodel.yaml
  PM: EM: Fix yamllint warnings in the EM YNL spec
  PM: EM: Fix memory leak in em_create_pd() error path
  PM: EM: Fix incorrect description of the cost field in struct em_perf_state
2026-01-16 16:16:24 +01:00
Tim Bird 330eb955ea kernel: add SPDX-License-Identifier lines
Add SPDX-License-Identifier lines to some old kernel
files.

Signed-off-by: Tim Bird <tim.bird@sony.com>
Acked-by: Karim Yaghmour <karim.yaghmour@opersys.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-16 15:32:16 +01:00
Tim Bird 84697bf553 kernel: cgroup: Add LGPL-2.1 SPDX license ID to legacy_freezer.c
Add an appropriate SPDX-License-Identifier line to the file,
and remove the GNU boilerplate text.

Signed-off-by: Tim Bird <tim.bird@sony.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-01-15 22:03:15 -10:00
Tim Bird a1b3421a02 kernel: cgroup: Add SPDX-License-Identifier lines
Add GPL-2.0 SPDX license id lines to a few old
files, replacing the reference to the COPYING file.

The COPYING file at the time of creation of these files
(2007 and 2005) was GPL-v2.0, with an additional clause
indicating that only v2 applied.

Signed-off-by: Tim Bird <tim.bird@sony.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-01-15 22:03:09 -10:00
Tim Bird 983d014aaf kernel: modules: Add SPDX license identifier to kmod.c
Add a GPL-2.0 license identifier line for this file.

kmod.c was originally introduced in the kernel in February
of 1998 by Linus Torvalds - who was familiar with kernel
licensing at the time this was introduced.

Signed-off-by: Tim Bird <tim.bird@sony.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-01-15 16:58:28 -08:00
Jiri Olsa 276f3b6daf arm64/ftrace,bpf: Fix partial regs after bpf_prog_run
Mahe reported issue with bpf_override_return helper not working when
executed from kprobe.multi bpf program on arm.

The problem is that on arm we use alternate storage for pt_regs object
that is passed to bpf_prog_run and if any register is changed (which
is the case of bpf_override_return) it's not propagated back to actual
pt_regs object.

Fixing this by introducing and calling ftrace_partial_regs_update function
to propagate the values of changed registers (ip and stack).

Reported-by: Mahe Tardy <mahe.tardy@gmail.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/bpf/20260112121157.854473-1-jolsa@kernel.org
2026-01-15 16:15:25 -08:00
Linus Torvalds 88e490913f ftrace fixes for v6.19:
- Fix allocation accounting on boot up
 
   The ftrace records for each function that ftrace can attach to is
   done in a group of pages. At boot up, the number of pages are
   calculated and allocated. After that, the pages are filled with data.
   It may allocate more than needed due to some functions not being
   recorded (because they are unused weak functions), this too is
   recorded.
 
   After the data is filled in, a check is made to make sure the right
   number of pages were allocated. But this was off due to the
   assumption that the same number of entries fit per every page.
   Because the size of an entry does not evenly divide into PAGE_SIZE,
   there is a rounding error when a large number of pages is allocated
   to hold the events. This causes the check to fail and triggers a
   warning.
 
   Fix the accounting by finding out how many pages are actually
   allocated from the functions that allocate them and use that to see
   if all the pages allocated were used and the ones not used are
   properly freed.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaWlyoxQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qoZoAQDuPOaly/RM9cS70zngDo/UsZmi2lQL
 RxuMR7sPVMwkrAEAp/gnpJenSRxCzaO4EdQYUYI28/ihkdNGp3KG8Q2MNw0=
 =v899
 -----END PGP SIGNATURE-----

Merge tag 'ftrace-v6.19-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull ftrace fix from Steven Rostedt:

 - Fix allocation accounting on boot up

   The ftrace records for each function that ftrace can attach to is
   done in a group of pages. At boot up, the number of pages are
   calculated and allocated. After that, the pages are filled with data.
   It may allocate more than needed due to some functions not being
   recorded (because they are unused weak functions), this too is
   recorded.

   After the data is filled in, a check is made to make sure the right
   number of pages were allocated. But this was off due to the
   assumption that the same number of entries fit per every page.
   Because the size of an entry does not evenly divide into PAGE_SIZE,
   there is a rounding error when a large number of pages is allocated
   to hold the events. This causes the check to fail and triggers a
   warning.

   Fix the accounting by finding out how many pages are actually
   allocated from the functions that allocate them and use that to see
   if all the pages allocated were used and the ones not used are
   properly freed.

* tag 'ftrace-v6.19-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  ftrace: Do not over-allocate ftrace memory
2026-01-15 15:13:05 -08:00
Shrikanth Hegde 5d86d542f6 sched/fair: Remove nohz.nr_cpus and use weight of cpumask instead
nohz.nr_cpus was observed as contended cacheline when running
enterprise workload on large systems.

Fundamental scalability challenge with nohz.idle_cpus_mask
and nohz.nr_cpus is the following:

 (1) nohz_balancer_kick() observes (reads) nohz.nr_cpus
     (or nohz.idle_cpu_mask) and nohz.has_blocked to  see whether there's
     any nohz balancing work to do, in every scheduler tick.

 (2) nohz_balance_enter_idle() and nohz_balance_exit_idle()
     (through nohz_balancer_kick() via sched_tick()) modify (write)
     nohz.nr_cpus (and/or nohz.idle_cpu_mask) and nohz.has_blocked.

The characteristic frequencies are the following:

 (1) nohz_balancer_kick() happens at scheduler (busy)tick frequency
     on CPU(which has not gone idle). This is a relatively constant
     frequency  in the ~1 kHz range or lower.

 (2) happens at idle enter/exit frequency on every CPU that goes to idle.
     This is workload dependent, but can easily be hundreds of kHz for
     IO-bound loads and high CPU counts. Ie. can be orders of magnitude
     higher than (1), in which case a cachemiss at every invocation of (1)
     is almost inevitable. idle exit will trigger (1) on the CPU
     which is coming out of idle.

There's two types of costs from these functions:

 (A) scheduler tick cost via (1): this happens on busy CPUs too, and is
     thus a primary scalability cost. But the rate here is constant and
     typically much lower than (B), hence the absolute benefit to workload
     scalability will be lower as well.

 (B) idle cost via (2): going-to-idle and coming-from-idle costs are
     secondary concerns, because they impact power efficiency more than
     they impact scalability. But in terms of absolute cost this scales
     up with nr_cpus as well, and a much faster rate, and thus may also
     approach and negatively impact system limits like
     memory bus/fabric bandwidth.

Note that nohz.idle_cpus_mask and nohz.nr_cpus may appear to reside in the
same cacheline, however under CONFIG_CPUMASK_OFFSTACK=y the backing storage
for nohz.idle_cpus_mask will be elsewhere. With CPUMASK_OFFSTACK=n,
the nohz.idle_cpus_mask and rest of nohz fields are in different cachelines
under typical NR_CPUS=512/2048. This implies two separate cachelines
being dirtied upon idle entry / exit.

nohz.nr_cpus can be derived from the mask itself. Its usage doesn't warrant
a functionally correct value. This means one less cacheline being dirtied in
idle entry/exit path which helps to save some bus bandwidth w.r.t to those
nohz functions(approx 50%). This in turn helps to improve enterprise
workload throughput.

On system with 480 CPUs, running "hackbench 40 process 10000 loops"
(Avg of 3 runs)
baseline:
     0.81%  hackbench          [k] nohz_balance_exit_idle
     0.21%  hackbench          [k] nohz_balancer_kick
     0.09%  swapper            [k] nohz_run_idle_balance

With patch:
     0.35%  hackbench          [k] nohz_balance_exit_idle
     0.09%  hackbench          [k] nohz_balancer_kick
     0.07%  swapper            [k] nohz_run_idle_balance

[Ingo Molnar: scalability analysis changlog]

Reviewed-and-tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://patch.msgid.link/20260115073524.376643-4-sshegde@linux.ibm.com
2026-01-15 22:41:27 +01:00
Shrikanth Hegde 94e70734b4 sched/fair: Change likelyhood of nohz.nr_cpus
These days most of the system have multi cores. The likelyhood of
at least one or more CPUs in nohz (idle state) is higher.

Give accurate hint to the branch predictor.

Reviewed-and-tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://patch.msgid.link/20260115073524.376643-3-sshegde@linux.ibm.com
2026-01-15 22:41:27 +01:00
Shrikanth Hegde 6b67c8a72e sched/fair: Move checking for nohz cpus after time check
Current code does.
- Read nohz.nr_cpus
- Check if the time has passed to do NOHZ idle balance

Instead do this.
- Check if the time has passed to do NOHZ idle balance
- Read nohz.nr_cpus

This will skip the read most of the time in normal system usage.
i.e when there are nohz.nr_cpus (system is not 100% busy).

Note that when there are no idle CPUs(100% busy), even if the flag gets
set to NOHZ_STATS_KICK | NOHZ_NEXT_KICK, find_new_ilb will fail and
there will be no NOHZ idle balance. In such cases there will be a very
narrow window where, kick_ilb will be called un-necessarily.
However current functionality is still retained.

Note: This patch doesn't solve any cacheline overheads. No improvement
in performance apart from saving a few cycles of reading nohz.nr_cpus

Reviewed-and-tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://patch.msgid.link/20260115073524.376643-2-sshegde@linux.ibm.com
2026-01-15 22:41:26 +01:00
Zhan Xusheng 553255cc85 sched/fair: Fix math notation errors in avg_vruntime comment
The avg_vruntime comment contains a couple of mathematical notation
issues:

 - The summation over w_i * (V - v_i) is written in an ambiguous form
 - The delta term refers to v instead of v0, which is inconsistent
   with the code and preceding explanation

Fix these to make the comment mathematically correct and consistent
with the implementation.

Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260114090035.19033-1-zhanxusheng@xiaomi.com
2026-01-15 22:41:26 +01:00
Gabriele Monaco 8d73732016 sched: Fix build for modules using set_tsk_need_resched()
Commit adcc3bfa88 ("sched: Adapt sched tracepoints for RV task model")
added a tracepoint to the need_resched action that can be triggered also
by set_tsk_need_resched.
This function was previously accessible from out-of-tree modules but
it's no longer available because the __trace_set_need_resched() symbol
is not exported (together with the tracepoint itself, which was exported
in a separate patch) and building such modules fails.

Export __trace_set_need_resched to modules to fix those build issues.

Fixes: adcc3bfa88 ("sched: Adapt sched tracepoints for RV task model")
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Link: https://patch.msgid.link/20260112140413.362202-1-gmonaco@redhat.com
2026-01-15 22:41:26 +01:00
Peter Zijlstra 627cc25f84 sched/deadline: Use ENQUEUE_MOVE to allow priority change
Pierre reported hitting balance callback warnings for deadline tasks
after commit 6455ad5346 ("sched: Move sched_class::prio_changed()
into the change pattern").

It turns out that DEQUEUE_SAVE+ENQUEUE_RESTORE does not preserve DL
priority and subsequently trips a balance pass -- where one was not
expected.

From discussion with Juri and Luca, the purpose of this clause was to
deal with tasks new to DL and all those sites will have MOVE set (as
well as CLASS, but MOVE is move conservative at this point).

Per the previous patches MOVE is audited to always run the balance
callbacks, so switch enqueue_dl_entity() to use MOVE for this case.

Fixes: 6455ad5346 ("sched: Move sched_class::prio_changed() into the change pattern")
Reported-by: Pierre Gondois <pierre.gondois@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Pierre Gondois <pierre.gondois@arm.com>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://patch.msgid.link/20260114130528.GB831285@noisy.programming.kicks-ass.net
2026-01-15 21:57:53 +01:00
Peter Zijlstra e008ec6c79 sched: Deadline has dynamic priority
While FIFO/RR have static priority, DEADLINE is a dynamic priority
scheme. Notably it has static priority -1. Do not assume the priority
doesn't change for deadline tasks just because the static priority
doesn't change.

This ensures DL always sees {DE,EN}QUEUE_MOVE where appropriate.

Fixes: ff77e46853 ("sched/rt: Fix PI handling vs. sched_setscheduler()")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Pierre Gondois <pierre.gondois@arm.com>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://patch.msgid.link/20260114130528.GB831285@noisy.programming.kicks-ass.net
2026-01-15 21:57:53 +01:00
Peter Zijlstra 53439363c0 sched: Audit MOVE vs balance_callbacks
The {DE,EN}QUEUE_MOVE flag indicates a task is allowed to change
priority, which means there could be balance callbacks queued.

Therefore audit all MOVE users and make sure they do run balance
callbacks before dropping rq-lock.

Fixes: 6455ad5346 ("sched: Move sched_class::prio_changed() into the change pattern")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Pierre Gondois <pierre.gondois@arm.com>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://patch.msgid.link/20260114130528.GB831285@noisy.programming.kicks-ass.net
2026-01-15 21:57:53 +01:00
Peter Zijlstra 49041e87f9 sched: Fold rq-pin swizzle into __balance_callbacks()
Prepare for more users needing the rq-pin swizzle.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Pierre Gondois <pierre.gondois@arm.com>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://patch.msgid.link/20260114130528.GB831285@noisy.programming.kicks-ass.net
2026-01-15 21:57:52 +01:00
Peter Zijlstra 4de9ff7606 sched/deadline: Avoid double update_rq_clock()
When setup_new_dl_entity() is called from enqueue_task_dl() ->
enqueue_dl_entity(), the rq-clock should already be updated, and
calling update_rq_clock() again is not right.

Move the update_rq_clock() to the one other caller of
setup_new_dl_entity(): sched_init_dl_server().

Fixes: 9f239df555 ("sched/deadline: Initialize dl_servers after SMP")
Reported-by: Pierre Gondois <pierre.gondois@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Pierre Gondois <pierre.gondois@arm.com>
Link: https://patch.msgid.link/20260113115622.GA831285@noisy.programming.kicks-ass.net
2026-01-15 21:57:52 +01:00
Peter Zijlstra 375410bb9a sched/deadline: Ensure get_prio_dl() is up-to-date
Pratheek tripped a WARN and noted the following issue:

> Inspecting the set of events that led to the warning being triggered
> showed the following:
>
>     systemd-1  [008] dN.31 ...: do_set_cpus_allowed: set_cpus_allowed begin!
>
>     systemd-1  [008] dN.31 ...: sched_change_begin: Begin!
>     systemd-1  [008] dN.31 ...: sched_change_begin: Before dequeue_task()!
>     systemd-1  [008] dN.31 ...: update_curr_dl_se: update_curr_dl_se: ENQUEUE_REPLENISH
>     systemd-1  [008] dN.31 ...: enqueue_dl_entity: enqueue_dl_entity: ENQUEUE_REPLENISH
>     systemd-1  [008] dN.31 ...: replenish_dl_entity: Replenish before: 14815760217
>     systemd-1  [008] dN.31 ...: replenish_dl_entity: Replenish after: 14816960047
>     systemd-1  [008] dN.31 ...: sched_change_begin: Before put_prev_task()!
>
>     systemd-1  [008] dN.31 ...: sched_change_end: Before enqueue_task()!
>     systemd-1  [008] dN.31 ...: sched_change_end: Before put_prev_task()!
>     systemd-1  [008] dN.31 ...: prio_changed_dl: Queuing pull task on prio change: 14815760217 -> 14816960047
>     systemd-1  [008] dN.31 ...: prio_changed_dl: Queuing balance callback!
>     systemd-1  [008] dN.31 ...: sched_change_end: End!
>
>     systemd-1  [008] dN.31 ...: do_set_cpus_allowed: set_cpus_allowed end!
>     systemd-1  [008] dN.21 ...: __schedule: Woops! Balance callback found!
>
> 1. sched_change_begin() from guard(sched_change) in
>    do_set_cpus_allowed() stashes the priority, which for the deadline
>    task, is "p->dl.deadline".
> 2. The dequeue of the deadline task replenishes the deadline.
> 3. The task is enqueued back after guard's scope ends and since there is
>    no *_CLASS flags set, sched_change_end() calls
>    dl_sched_class->prio_changed() which compares the deadline.
> 4. Since deadline was moved on dequeue, prio_changed_dl() sees the value
>    differ from the stashed value and queues a balance pull callback.
> 5. do_set_cpus_allowed() finishes and drops the rq_lock without doing a
>    do_balance_callbacks().
> 6. Grabbing the rq_lock() at subsequent __schedule() triggers the
>    warning since the balance pull callback was never executed before
>    dropping the lock.

Meaning get_prio_dl() ought to update current and return an up-to-date
value.

Fixes: 6455ad5346 ("sched: Move sched_class::prio_changed() into the change pattern")
Reported-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://patch.msgid.link/20260106104113.GX3707891@noisy.programming.kicks-ass.net
2026-01-15 21:57:52 +01:00
Linus Torvalds 13b2d15d99 32 hotfixes. 16 are cc:stable, 24 are for MM.
- four kerneldoc fixes from Bagas Sanjaya
 
 - four DAMON fixes from SeongJae
 
 - four mremap VMA-related fixes from Lorenzo
 
 - various singletons - please see the changelogs for details
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaWkP3gAKCRDdBJ7gKXxA
 jnmTAQCBiCyw7Pvvak5OM8Of0qQ8m/jQmNMBPRrd5M3tAAEOlwD9F7E6RjcmyItU
 k+sboDgNKTvgdRHTmRzj1t96aXJrSQQ=
 =OxTh
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2026-01-15-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:

 - kerneldoc fixes from Bagas Sanjaya

 - DAMON fixes from SeongJae

 - mremap VMA-related fixes from Lorenzo

 - various singletons - please see the changelogs for details

* tag 'mm-hotfixes-stable-2026-01-15-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (30 commits)
  drivers/dax: add some missing kerneldoc comment fields for struct dev_dax
  mm: numa,memblock: include <asm/numa.h> for 'numa_nodes_parsed'
  mailmap: add entry for Daniel Thompson
  tools/testing/selftests: fix gup_longterm for unknown fs
  mm/page_alloc: prevent pcp corruption with SMP=n
  iommu/sva: include mmu_notifier.h header
  mm: kmsan: fix poisoning of high-order non-compound pages
  tools/testing/selftests: add forked (un)/faulted VMA merge tests
  mm/vma: enforce VMA fork limit on unfaulted,faulted mremap merge too
  tools/testing/selftests: add tests for !tgt, src mremap() merges
  mm/vma: fix anon_vma UAF on mremap() faulted, unfaulted merge
  mm/zswap: fix error pointer free in zswap_cpu_comp_prepare()
  mm/damon/sysfs-scheme: cleanup access_pattern subdirs on scheme dir setup failure
  mm/damon/sysfs-scheme: cleanup quotas subdirs on scheme dir setup failure
  mm/damon/sysfs: cleanup attrs subdirs on context dir setup failure
  mm/damon/sysfs: cleanup intervals subdirs on attrs dir setup failure
  mm/damon/core: remove call_control in inactive contexts
  powerpc/watchdog: add support for hardlockup_sys_info sysctl
  mips: fix HIGHMEM initialization
  mm/hugetlb: ignore hugepage kernel args if hugepages are unsupported
  ...
2026-01-15 10:47:14 -08:00
Guenter Roeck be55257fab ftrace: Do not over-allocate ftrace memory
The pg_remaining calculation in ftrace_process_locs() assumes that
ENTRIES_PER_PAGE multiplied by 2^order equals the actual capacity of the
allocated page group. However, ENTRIES_PER_PAGE is PAGE_SIZE / ENTRY_SIZE
(integer division). When PAGE_SIZE is not a multiple of ENTRY_SIZE (e.g.
4096 / 24 = 170 with remainder 16), high-order allocations (like 256 pages)
have significantly more capacity than 256 * 170. This leads to pg_remaining
being underestimated, which in turn makes skip (derived from skipped -
pg_remaining) larger than expected, causing the WARN(skip != remaining)
to trigger.

Extra allocated pages for ftrace: 2 with 654 skipped
WARNING: CPU: 0 PID: 0 at kernel/trace/ftrace.c:7295 ftrace_process_locs+0x5bf/0x5e0

A similar problem in ftrace_allocate_records() can result in allocating
too many pages. This can trigger the second warning in
ftrace_process_locs().

Extra allocated pages for ftrace
WARNING: CPU: 0 PID: 0 at kernel/trace/ftrace.c:7276 ftrace_process_locs+0x548/0x580

Use the actual capacity of a page group to determine the number of pages
to allocate. Have ftrace_allocate_pages() return the number of allocated
pages to avoid having to calculate it. Use the actual page group capacity
when validating the number of unused pages due to skipped entries.
Drop the definition of ENTRIES_PER_PAGE since it is no longer used.

Cc: stable@vger.kernel.org
Fixes: 4a3efc6baf ("ftrace: Update the mcount_loc check of skipped entries")
Link: https://patch.msgid.link/20260113152243.3557219-1-linux@roeck-us.net
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-01-15 10:17:53 -05:00
Namhyung Kim 4960626f95 perf/core: Fix slow perf_event_task_exit() with LBR callstacks
I got a report that a task is stuck in perf_event_exit_task() waiting
for global_ctx_data_rwsem.  On large systems with lots threads, it'd
have performance issues when it grabs the lock to iterate all threads
in the system to allocate the context data.

And it'd block task exit path which is problematic especially under
memory pressure.

  perf_event_open
    perf_event_alloc
      attach_perf_ctx_data
        attach_global_ctx_data
          percpu_down_write (global_ctx_data_rwsem)
            for_each_process_thread
              alloc_task_ctx_data
                                               do_exit
                                                 perf_event_exit_task
                                                   percpu_down_read (global_ctx_data_rwsem)

It should not hold the global_ctx_data_rwsem on the exit path.  Let's
skip allocation for exiting tasks and free the data carefully.

Reported-by: Rosalie Fang <rosaliefang@google.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260112165157.1919624-1-namhyung@kernel.org
2026-01-15 10:04:26 +01:00
Feng Tang e561383a39 powerpc/watchdog: add support for hardlockup_sys_info sysctl
Commit a9af76a787 ("watchdog: add sys_info sysctls to dump sys info on
system lockup") adds 'hardlock_sys_info' systcl knob for general kernel
watchdog to control what kinds of system debug info to be dumped on
hardlockup.

Add similar support in powerpc watchdog code to make the sysctl knob more
general, which also fixes a compiling warning in general watchdog code
reported by 0day bot.

Link: https://lkml.kernel.org/r/20251231080309.39642-1-feng.tang@linux.alibaba.com
Fixes: a9af76a787 ("watchdog: add sys_info sysctls to dump sys info on system lockup")
Signed-off-by: Feng Tang <feng.tang@linux.alibaba.com>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202512030920.NFKtekA7-lkp@intel.com/
Suggested-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-14 22:16:22 -08:00
Pasha Tatashin 582f0f3864 kho: validate preserved memory map during population
If the previous kernel enabled KHO but did not call kho_finalize() (e.g.,
CONFIG_LIVEUPDATE=n or userspace skipped the finalization step), the
'preserved-memory-map' property in the FDT remains empty/zero.

Previously, kho_populate() would succeed regardless of the memory map's
state, reserving the incoming scratch regions in memblock.  However,
kho_memory_init() would later fail to deserialize the empty map.  By that
time, the scratch regions were already registered, leading to partial
initialization and subsequent list corruption (freeing scratch area twice)
during kho_init().

Move the validation of the preserved memory map earlier into
kho_populate(). If the memory map is empty/NULL:
1. Abort kho_populate() immediately with -ENOENT.
2. Do not register or reserve the incoming scratch memory, allowing the new
   kernel to reclaim those pages as standard free memory.
3. Leave the global 'kho_in' state uninitialized.

Consequently, kho_memory_init() sees no active KHO context
(kho_in.mem_chunks_phys is 0) and falls back to kho_reserve_scratch(),
allocating fresh scratch memory as if it were a standard cold boot.

Link: https://lkml.kernel.org/r/20251223140140.2090337-1-pasha.tatashin@soleen.com
Fixes: de51999e68 ("kho: allow memory preservation state updates after finalization")
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reported-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Closes: https://lore.kernel.org/all/20251218215613.GA17304@ranerica-svr.sc.intel.com
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Tested-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-14 22:16:21 -08:00
Anton Protopopov d1aab1ca57 bpf: Properly mark live registers for indirect jumps
For a `gotox rX` instruction the rX register should be marked as used
in the compute_insn_live_regs() function. Fix this.

Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Link: https://lore.kernel.org/r/20260114162544.83253-2-a.s.protopopov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-14 19:08:09 -08:00
Alexei Starovoitov e3d0dbb3b5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf after rc5
Cross-merge BPF and other fixes after downstream PR.

No conflicts.

Adjacent:
Auto-merging MAINTAINERS
Auto-merging Makefile
Auto-merging kernel/bpf/verifier.c
Auto-merging kernel/sched/ext.c
Auto-merging mm/memcontrol.c

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-14 15:22:01 -08:00
Robin Murphy c6ccd09880 dma/pool: Avoid allocating redundant pools
On smaller systems, e.g. embedded arm64, it is common for all memory
to end up in ZONE_DMA32 or even ZONE_DMA. In such cases it is redundant
to allocate a nominal pool for an empty higher zone that just ends up
coming from a lower zone that should already have its own pool anyway.
We already have logic to skip allocating a ZONE_DMA pool when that is
empty, so generalise that to save memory in the case of other zones too.

Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Tested-by: Vladimir Kondratiev <vladimir.kondratiev@mobileye.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/8ab8d8a620dee0109f33f5cb63d6bfeed35aac37.1768230104.git.robin.murphy@arm.com
2026-01-14 11:00:00 +01:00
Robin Murphy b31ac41b59 dma/pool: Improve pool lookup
If CONFIG_ZONE_DMA32 is enabled, but we have not allocated the
corresponding atomic_pool_dma32, dma_guess_pool() may return the NULL
value of that and fail a GFP_DMA32 allocation without trying to fall
back to other pools which may exist. Furthermore, if no GFP_DMA pool
exists, it is preferable to try GFP_DMA32 rather than immediately fall
back to GFP_KERNEL with even less chance of success. Improve matters
by encoding an explicit order of pool preference for each flag.

Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Tested-by: Vladimir Kondratiev <vladimir.kondratiev@mobileye.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/c846b1a2f43295cac926c7af2ce907f62baec518.1768230104.git.robin.murphy@arm.com
2026-01-14 11:00:00 +01:00
Thomas Weißschuh 7158fc54b2 vdso: Remove struct getcpu_cache
The cache parameter of getcpu() is useless nowadays for various reasons.

  * It is never passed by userspace for either the vDSO or syscalls.
  * It is never used by the kernel.
  * It could not be made to work on the current vDSO architecture.
  * The structure definition is not part of the UAPI headers.
  * vdso_getcpu() is superseded by restartable sequences in any case.

Remove the struct and its header.

As a side-effect this gets rid of an unwanted inclusion of the linux/
header namespace from vDSO code.

[ tglx: Adapt to s390 upstream changes */

Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Heiko Carstens <hca@linux.ibm.com> # s390
Link: https://patch.msgid.link/20251230-getcpu_cache-v3-1-fb9c5f880ebe@linutronix.de
2026-01-14 08:56:40 +01:00
Linus Torvalds c537e12dae bpf-fixes
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmlm+x8ACgkQ6rmadz2v
 bTrC5Q/+NSCjBpHAinVL/09pahh0YJKczxWnpHG0rs6hSSZ88o+1uY+z3PQ3lRX9
 cHSwjIicRvwpX+DosGcXESqspbvo5MLx5DcB0WBSCnLodJ5MDap0iWltBCrGzqme
 mpeuZiQMSpsAGQLjICiywxcCndoaKEMSRZFmjlzNVJPbYkbxbZCQvwUQqg+Z4rQX
 t5zBJjb0X3Oaz3TcSiiW+yulZPoejFyF4P+4eTmFy+mAqq5D7CpMXrrjVTf78mSf
 IqMPbY/PssTExzna2d/ZKtxmKQ+HF6lN7+JaTuqbQsqQtxwh4eXsswezY/UKX7j+
 FdAuwz6W7ybO8UC3i4fvA1jFR60Hi5bBBoSXu+0hb/UGQQYXc+gRtB8DxqeUEgeb
 oVByGwrBWLRSLuBbQVPl3PRTNgwpBn41DkyIpXS6DK6thzdAqZ8zM15vFTYbWanF
 W/skdubkgIdhREunXMpLfZD5h0b/pXIYxzA59x//hZ+W8qvTuaabICja18x3SnqY
 hytoIQMo5d1UytyngfcHvXe79DCi5xFAqqVj0AqYR+CDdt5n9vrXgavX76hrYeHe
 /yYUHZ0h5A564zMXITMmUWDWvgmvTs1Q2hiV+rmtWPJ1bpOpPFxUosIWEBl4edA3
 vKlGoh/CtU9NJp/E/7nYdUV1vXlLJaareGQ/UXpFBl4NYsnNe4w=
 =Vt4X
 -----END PGP SIGNATURE-----

Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Pull bpf fixes from Alexei Starovoitov:

 - Fix incorrect usage of BPF_TRAMP_F_ORIG_STACK in riscv JIT (Menglong
   Dong)

 - Fix reference count leak in bpf_prog_test_run_xdp() (Tetsuo Handa)

 - Fix metadata size check in bpf_test_run() (Toke Høiland-Jørgensen)

 - Check that BPF insn array is not allowed as a map for const strings
   (Deepanshu Kartikey)

* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  bpf: Fix reference count leak in bpf_prog_test_run_xdp()
  bpf: Reject BPF_MAP_TYPE_INSN_ARRAY in check_reg_const_str()
  selftests/bpf: Update xdp_context_test_run test to check maximum metadata size
  bpf, test_run: Subtract size of xdp_frame from allowed metadata size
  riscv, bpf: Fix incorrect usage of BPF_TRAMP_F_ORIG_STACK
2026-01-13 21:21:13 -08:00
Anton Protopopov 7e525860e7 bpf: Return EACCES for incorrect access to insn array
The insn_array_map_direct_value_addr() function currently returns
-EINVAL when the offset within the map is invalid. Change this to
return -EACCES, so that it is consistent with similar boundary access
checks in the verifier.

Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Link: https://lore.kernel.org/r/20260111153047.8388-3-a.s.protopopov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-13 19:36:18 -08:00
Anton Protopopov e3bd7bdf5f bpf: Return proper address for non-zero offsets in insn array
The map_direct_value_addr() function of the instruction
array map incorrectly adds offset to the resulting address.
This is a bug, because later the resolve_pseudo_ldimm64()
function adds the offset. Fix it. Corresponding selftests
are added in a consequent commit.

Fixes: 493d9e0d60 ("bpf, x86: add support for indirect jumps")
Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Link: https://lore.kernel.org/r/20260111153047.8388-2-a.s.protopopov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-13 19:35:47 -08:00
Matt Bobrowski f8ade2342e bpf: return PTR_TO_BTF_ID | PTR_TRUSTED from BPF kfuncs by default
Teach the BPF verifier to treat pointers to struct types returned from
BPF kfuncs as implicitly trusted (PTR_TO_BTF_ID | PTR_TRUSTED) by
default. Returning untrusted pointers to struct types from BPF kfuncs
should be considered an exception only, and certainly not the norm.

Update existing selftests to reflect the change in register type
printing (e.g. `ptr_` becoming `trusted_ptr_` in verifier error
messages).

Link: https://lore.kernel.org/bpf/aV4nbCaMfIoM0awM@google.com/
Signed-off-by: Matt Bobrowski <mattbobrowski@google.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260113083949.2502978-1-mattbobrowski@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-13 19:19:13 -08:00
Donglin Peng 434bcbc837 bpf: Optimize the performance of find_bpffs_btf_enums
Currently, vmlinux BTF is unconditionally sorted during
the build phase. The function btf_find_by_name_kind
executes the binary search branch, so find_bpffs_btf_enums
can be optimized by using btf_find_by_name_kind.

Signed-off-by: Donglin Peng <pengdonglin@xiaomi.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20260109130003.3313716-10-dolinux.peng@gmail.com
2026-01-13 16:21:36 -08:00
Donglin Peng dc893cfa39 bpf: Skip anonymous types in type lookup for performance
Currently, vmlinux and kernel module BTFs are unconditionally
sorted during the build phase, with named types placed at the
end. Thus, anonymous types should be skipped when starting the
search. In my vmlinux BTF, the number of anonymous types is
61,747, which means the loop count can be reduced by 61,747.

Signed-off-by: Donglin Peng <pengdonglin@xiaomi.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20260109130003.3313716-9-dolinux.peng@gmail.com
2026-01-13 16:21:36 -08:00
Donglin Peng 342bf525ba btf: Verify BTF sorting
This patch checks whether the BTF is sorted by name in ascending order.
If sorted, binary search will be used when looking up types.

Specifically, vmlinux and kernel module BTFs are always sorted during
the build phase with anonymous types placed before named types, so we
only need to identify the starting ID of named types.

Signed-off-by: Donglin Peng <pengdonglin@xiaomi.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260109130003.3313716-8-dolinux.peng@gmail.com
2026-01-13 16:21:30 -08:00
Donglin Peng 8c3070e159 btf: Optimize type lookup with binary search
Improve btf_find_by_name_kind() performance by adding binary search
support for sorted types. Falls back to linear search for compatibility.

Signed-off-by: Donglin Peng <pengdonglin@xiaomi.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260109130003.3313716-7-dolinux.peng@gmail.com
2026-01-13 16:20:38 -08:00
Jan H. Schönherr eebe6446cc perf/core: Speed up kexec shutdown by avoiding unnecessary cross CPU calls
There are typically a lot of PMUs registered, but in many cases only few
of them have an event registered (like the "cpu" PMU in the presence of
the watchdog). As the mutex is already held, it's safe to just check for
existing events before doing the cross CPU call.

This change saves tens of milliseconds from kexec time (perceived as
steal time during a hypervisor host update), with <2ms remaining for
this step in the shutdown. There might be additional potential for
parallelization or we could just disable performance monitoring during
the actual shutdown and be less graceful about it.

Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2026-01-13 21:39:01 +01:00
Imran Khan dd9f6d30c6 genirq/cpuhotplug: Notify about affinity changes breaking the affinity mask
During CPU offlining the interrupts affined to that CPU are moved to other
online CPUs, which might break the original affinity mask if the outgoing
CPU was the last online CPU in that mask. This change is not propagated to
irq_desc::affinity_notify(), which leaves users of the affinity notifier
mechanism with stale information.

Avoid this by scheduling affinity change notification work for interrupts
that were affined to the CPU being offlined, if the new target CPU is not
part of the original affinity mask.

Since irq_set_affinity_locked() uses the same logic to schedule affinity
change notification work, split out this logic into a dedicated function
and use that at both places.

[ tglx: Removed the EXPORT(), removed the !SMP stub, moved the prototype,
  	added a lockdep assert instead of a comment, fixed up coding style
  	and name space. Polished and clarified the change log ]

Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260113143727.1041265-1-imran.f.khan@oracle.com
2026-01-13 21:18:16 +01:00
Al Viro 47b3b9bf93 simplify the callers of file_open_name()
It accepts ERR_PTR() for name and does the right thing in that case.
That allows to simplify the logics in callers, making them trivial
to switch to CLASS(filename).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2026-01-13 15:18:08 -05:00
Al Viro 741c97fecb struct filename ->refcnt doesn't need to be atomic
... or visible outside of audit, really.  Note that references
held in delayed_filename always have refcount 1, and from the
moment of complete_getname() or equivalent point in getname...()
there won't be any references to struct filename instance left
in places visible to other threads.

Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2026-01-13 15:18:07 -05:00
Al Viro 41670a5900 get rid of audit_reusename()
Originally we tried to avoid multiple insertions into audit names array
during retry loop by a cute hack - memorize the userland pointer and
if there already is a match, just grab an extra reference to it.

Cute as it had been, it had problems - two identical pointers had
audit aux entries merged, two identical strings did not.  Having
different behaviour for syscalls that differ only by addresses of
otherwise identical string arguments is obviously wrong - if nothing
else, compiler can decide to merge identical string literals.

Besides, this hack does nothing for non-audited processes - they get
a fresh copy for retry.  It's not time-critical, but having behaviour
subtly differ that way is bogus.

These days we have very few places that import filename more than once
(9 functions total) and it's easy to massage them so we get rid of all
re-imports.  With that done, we don't need audit_reusename() anymore.
There's no need to memorize userland pointer either.

Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2026-01-13 15:16:44 -05:00
Song Chen c9c9f6bf7f bpf: Remove an unused parameter in check_func_proto
The func_id parameter is not needed in check_func_proto.
This patch removes it.

Signed-off-by: Song Chen <chensong_2000@189.cn>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260105155009.4581-1-chensong_2000@189.cn
2026-01-13 10:00:15 -08:00
Alexei Starovoitov bffacdb80b bpf: Recognize special arithmetic shift in the verifier
cilium bpf_wiregard.bpf.c when compiled with -O1 fails to load
with the following verifier log:

192: (79) r2 = *(u64 *)(r10 -304)     ; R2=pkt(r=40) R10=fp0 fp-304=pkt(r=40)
...
227: (85) call bpf_skb_store_bytes#9          ; R0=scalar()
228: (bc) w2 = w0                     ; R0=scalar() R2=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff))
229: (c4) w2 s>>= 31                  ; R2=scalar(smin=0,smax=umax=0xffffffff,smin32=-1,smax32=0,var_off=(0x0; 0xffffffff))
230: (54) w2 &= -134                  ; R2=scalar(smin=0,smax=umax=umax32=0xffffff7a,smax32=0x7fffff7a,var_off=(0x0; 0xffffff7a))
...
232: (66) if w2 s> 0xffffffff goto pc+125     ; R2=scalar(smin=umin=umin32=0x80000000,smax=umax=umax32=0xffffff7a,smax32=-134,var_off=(0x80000000; 0x7fffff7a))
...
238: (79) r4 = *(u64 *)(r10 -304)     ; R4=scalar() R10=fp0 fp-304=scalar()
239: (56) if w2 != 0xffffff78 goto pc+210     ; R2=0xffffff78 // -136
...
258: (71) r1 = *(u8 *)(r4 +0)
R4 invalid mem access 'scalar'

The error might confuse most bpf authors, since fp-304 slot had 'pkt'
pointer at insn 192 and became 'scalar' at 238. That happened because
bpf_skb_store_bytes() clears all packet pointers including those in
the stack. On the first glance it might look like a bug in the source
code, since ctx->data pointer should have been reloaded after the call
to bpf_skb_store_bytes().

The relevant part of cilium source code looks like this:

// bpf/lib/nodeport.h
int dsr_set_ipip6()
{
	if (ctx_adjust_hroom(...))
		return DROP_INVALID; // -134
	if (ctx_store_bytes(...))
		return DROP_WRITE_ERROR; // -141
	return 0;
}

bool dsr_fail_needs_reply(int code)
{
	if (code == DROP_FRAG_NEEDED) // -136
		return true;
	return false;
}

tail_nodeport_ipv6_dsr()
{
	ret = dsr_set_ipip6(...);
	if (!IS_ERR(ret)) {
		...
	} else {
		if (dsr_fail_needs_reply(ret))
			return dsr_reply_icmp6(...);
	}
}

The code doesn't have arithmetic shift by 31 and it reloads ctx->data
every time it needs to access it. So it's not a bug in the source code.

The reason is DAGCombiner::foldSelectCCToShiftAnd() LLVM transformation:

  // If this is a select where the false operand is zero and the compare is a
  // check of the sign bit, see if we can perform the "gzip trick":
  // select_cc setlt X, 0, A, 0 -> and (sra X, size(X)-1), A
  // select_cc setgt X, 0, A, 0 -> and (not (sra X, size(X)-1)), A

The conditional branch in dsr_set_ipip6() and its return values
are optimized into BPF_ARSH plus BPF_AND:

227: (85) call bpf_skb_store_bytes#9
228: (bc) w2 = w0
229: (c4) w2 s>>= 31   ; R2=scalar(smin=0,smax=umax=0xffffffff,smin32=-1,smax32=0,var_off=(0x0; 0xffffffff))
230: (54) w2 &= -134   ; R2=scalar(smin=0,smax=umax=umax32=0xffffff7a,smax32=0x7fffff7a,var_off=(0x0; 0xffffff7a))

after insn 230 the register w2 can only be 0 or -134,
but the verifier approximates it, since there is no way to
represent two scalars in bpf_reg_state.
After fallthough at insn 232 the w2 can only be -134,
hence the branch at insn
239: (56) if w2 != -136 goto pc+210
should be always taken, and trapping insn 258 should never execute.
LLVM generated correct code, but the verifier follows impossible
path and rejects valid program. To fix this issue recognize this
special LLVM optimization and fork the verifier state.
So after insn 229: (c4) w2 s>>= 31
the verifier has two states to explore:
one with w2 = 0 and another with w2 = 0xffffffff
which makes the verifier accept bpf_wiregard.c

A similar pattern exists were OR operation is used in place of the AND
operation, the verifier detects that pattern as well by forking the
state before the OR operation with a scalar in range [-1,0].

Note there are 20+ such patterns in bpf_wiregard.o compiled
with -O1 and -O2, but they're rarely seen in other production
bpf programs, so push_stack() approach is not a concern.

Reported-by: Hao Sun <sunhao.th@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Co-developed-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20260112201424.816836-2-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-13 09:33:38 -08:00
Mykyta Yatsenko 7af3339948 bpf: Consistently use reg_state() for register access in the verifier
Replace the pattern of declaring a local regs array from cur_regs()
and then indexing into it with the more concise reg_state() helper.
This simplifies the code by eliminating intermediate variables and
makes register access more consistent throughout the verifier.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20260113134826.2214860-1-mykyta.yatsenko5@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-13 09:31:17 -08:00
Thomas Gleixner 3db5306b0b time/sched_clock: Use ACCESS_PRIVATE() to evaluate hrtimer::function
This dereference of sched_clock_timer::function was missed when the
hrtimer callback function pointer was marked private.

Fixes: 04257da0c9 ("hrtimers: Make callback function pointer private")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/875x95jw7q.ffs@tglx
Closes: https://lore.kernel.org/oe-kbuild-all/202601131713.KsxhXQ0M-lkp@intel.com/
2026-01-13 18:08:57 +01:00
Gabriele Monaco 6c125b85f3 sched: Export hidden tracepoints to modules
The tracepoints sched_entry, sched_exit and sched_set_need_resched
are not exported to tracefs as trace events, this allows only kernel
code to access them. Helper modules like [1] can be used to still have
the tracepoints available to ftrace for debugging purposes, but they do
rely on the tracepoints being exported.

Export the 3 not exported tracepoints.
Note that sched_set_state is already exported as the macro is called
from modules.

[1] - https://github.com/qais-yousef/sched_tp.git

Fixes: adcc3bfa88 ("sched: Adapt sched tracepoints for RV task model")
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Link: https://patch.msgid.link/20251205131621.135513-9-gmonaco@redhat.com
2026-01-13 11:37:53 +01:00
Gabriele Monaco ca1e8eede4 sched/deadline: Fix server stopping with runnable tasks
The deadline server can currently stop due to idle although fair tasks
are runnable. This happens essentially when:

 * the server is set to idle, a task wakes up, the server stops
 * a task wakes up, the server sets itself to idle and stops right away

Address both cases by clearing the server idle flag whenever a fair task
wakes up and accounting also for pending tasks in the definition of idle.

Fixes: f5a538c07d ("sched/deadline: Fix dl_server stop condition")
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260113085159.114226-3-gmonaco@redhat.com
2026-01-13 11:37:52 +01:00
Peter Zijlstra 1e0a2ba7af sched: Provide idle_rq() helper
A fix for the dl_server 'requires' idle_cpu() usage, which made me
note that it and available_idle_cpu() are extern function calls.

And while idle_cpu() is used outside of kernel/sched/,
available_idle_cpu() is not.

This makes it hard to make idle_cpu() an inline helper, so provide
idle_rq() and implement idle_cpu() and available_idle_cpu() using
that.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2026-01-13 11:37:52 +01:00
Pingfan Liu 64e6fa7661 sched/deadline: Fix potential race in dl_add_task_root_domain()
The access rule for local_cpu_mask_dl requires it to be called on the
local CPU with preemption disabled. However, dl_add_task_root_domain()
currently violates this rule.

Without preemption disabled, the following race can occur:

1. ThreadA calls dl_add_task_root_domain() on CPU 0
2. Gets pointer to CPU 0's local_cpu_mask_dl
3. ThreadA is preempted and migrated to CPU 1
4. ThreadA continues using CPU 0's local_cpu_mask_dl
5. Meanwhile, the scheduler on CPU 0 calls find_later_rq() which also
   uses local_cpu_mask_dl (with preemption properly disabled)
6. Both contexts now corrupt the same per-CPU buffer concurrently

Fix this by moving the local_cpu_mask_dl access to the preemption
disabled section.

Closes: https://lore.kernel.org/lkml/aSBjm3mN_uIy64nz@jlelli-thinkpadt14gen4.remote.csb
Fixes: 318e18ed22 ("sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug")
Reported-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Pingfan Liu <piliu@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Waiman Long <longman@redhat.com>
Link: https://patch.msgid.link/20251125032630.8746-3-piliu@redhat.com
2026-01-13 11:37:51 +01:00
Pingfan Liu 479972efc2 sched/deadline: Remove unnecessary comment in dl_add_task_root_domain()
The comments above dl_get_task_effective_cpus() and
dl_add_task_root_domain() already explain how to fetch a valid
root domain and protect against races. There's no need to repeat
this inside dl_add_task_root_domain(). Remove the redundant comment
to keep the code clean.

No functional change.

Signed-off-by: Pingfan Liu <piliu@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Waiman Long <longman@redhat.com>
Link: https://patch.msgid.link/20251125032630.8746-2-piliu@redhat.com
2026-01-13 11:37:51 +01:00
Thomas Weißschuh ae4535b0d9 hrtimer: Drop _tv64() helpers
Since ktime_t has become an alias to s64, these helpers are unnecessary.

Migrate the few remaining users to the regular helpers and remove the
now dead code.

Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260107-hrtimer-header-cleanup-v1-3-1a698ef0ddae@linutronix.de
2026-01-13 11:05:49 +01:00
Thomas Weißschuh 84663a5ad6 hrtimer: Remove public definition of HIGH_RES_NSEC
This constant is only used in a single place and is has a very generic
name polluting the global namespace.

Move the constant closer to its only user.

Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260107-hrtimer-header-cleanup-v1-2-1a698ef0ddae@linutronix.de
2026-01-13 11:05:48 +01:00
Thomas Weißschuh 05dc4a9fc8 hrtimer: Fix softirq base check in update_needs_ipi()
The 'clockid' field is not the correct way to check for a softirq base.

Fix the check to correctly compare the base type instead of the clockid.

Fixes: 1e7f7fbcd4 ("hrtimer: Avoid more SMP function calls in clock_was_set()")
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260107-hrtimer-clock-base-check-v1-1-afb5dbce94a1@linutronix.de
2026-01-13 11:04:41 +01:00
Luigi Rizzo fb11a2493e genirq: Move clear of kstat_irqs to free_desc()
desc_set_defaults() has a loop to clear the per-cpu counters kstats_irq.

This is only needed in free_desc(), which is used with non-sparse IRQs so
that the interrupt descriptor can be recycled. For newly allocated
descriptors, the memory comes from alloc_percpu() and is already zeroed
out.

Move the loop to free_desc() to avoid wasting time unnecessarily.

Signed-off-by: Luigi Rizzo <lrizzo@google.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260112083234.2665832-1-lrizzo@google.com
2026-01-13 10:16:29 +01:00
Radu Rendec df439718af genirq: Update effective affinity for redirected interrupts
For redirected interrupts, irq_chip_redirect_set_affinity() does not
update the effective affinity mask, which then triggers the warning in
irq_validate_effective_affinity(). Also, because the effective affinity
mask is empty, the cpumask_test_cpu(smp_processor_id(), m) condition in
demux_redirect_remote() is always false, and the interrupt is always
redirected, even if it's already running on the target CPU.

Set the effective affinity mask to be the same as the requested affinity
mask. It's worth noting that irq_do_set_affinity() filters out offline
CPUs before calling chip->irq_set_affinity() (unless `force` is set), so
the mask passed to irq_chip_redirect_set_affinity() is already filtered.

The solution is not ideal because it may lie about the effective
affinity of the demultiplexed ("child") interrupt. If the requested
affinity mask includes multiple CPUs, the effective affinity, in
reality, is the intersection between the requested mask and the
demultiplexing ("parent") interrupt's effective affinity mask, plus
the first CPU in the requested mask.

Accurately describing the effective affinity of the demultiplexed
interrupt is not trivial because it requires keeping track of the
demultiplexing interrupt's effective affinity. That is tricky in the
context of CPU hot(un)plugging, where interrupt migration ordering is
not guaranteed. The solution in the initial version of the fixed patch,
which stored the first CPU of the demultiplexing interrupt's effective
affinity in the `target_cpu` field, has its own drawbacks and
limitations.

Fixes: fcc1d0dabd ("genirq: Add interrupt redirection infrastructure")
Reported-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Radu Rendec <rrendec@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: Jon Hunter <jonathanh@nvidia.com>
Link: https://patch.msgid.link/20260112211402.2927336-1-rrendec@redhat.com
Closes: https://lore.kernel.org/all/44509520-f29b-4b8a-8986-5eae3e022eb7@nvidia.com/
2026-01-13 09:59:28 +01:00
Sebastian Andrzej Siewior aef30c8d56 genirq: Warn about using IRQF_ONESHOT without a threaded handler
IRQF_ONESHOT disables the interrupt source until after the threaded
handler completed its work. This is needed to allow the threaded handler
to run - otherwise the CPU will get back to the interrupt handler
because the interrupt source remains active and the threaded handler
will not able to do its work.

Specifying IRQF_ONESHOT without a threaded handler does not make sense.
It could be a leftover if the handler _was_ threaded and changed back to
primary and the flag was not removed. This can be problematic in the
`threadirqs' case because the handler is exempt from forced-threading.
This in turn can become a problem on a PREEMPT_RT system if the handler
attempts to acquire sleeping locks.

Warn about missing threaded handlers with the IRQF_ONESHOT flag.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Reviewed-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Link: https://patch.msgid.link/20260112134013.eQWyReHR@linutronix.de
2026-01-13 09:56:25 +01:00
Sami Tolvanen 99fde4d062 bpf, btf: Enforce destructor kfunc type with CFI
Ensure that registered destructor kfuncs have the same type
as btf_dtor_kfunc_t to avoid a kernel panic on systems with
CONFIG_CFI enabled.

Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20260110082548.113748-10-samitolvanen@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-12 18:53:57 -08:00
Sami Tolvanen b40a5d724f bpf: crypto: Use the correct destructor kfunc type
With CONFIG_CFI enabled, the kernel strictly enforces that indirect
function calls use a function pointer type that matches the target
function. I ran into the following type mismatch when running BPF
self-tests:

  CFI failure at bpf_obj_free_fields+0x190/0x238 (target:
    bpf_crypto_ctx_release+0x0/0x94; expected type: 0xa488ebfc)
  Internal error: Oops - CFI: 00000000f2008228 [#1]  SMP
  ...

As bpf_crypto_ctx_release() is also used in BPF programs and using
a void pointer as the argument would make the verifier unhappy, add
a simple stub function with the correct type and register it as the
destructor kfunc instead.

Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Tested-by: Viktor Malik <vmalik@redhat.com>
Link: https://lore.kernel.org/r/20260110082548.113748-7-samitolvanen@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-12 18:53:57 -08:00
Linus Torvalds b71e635fee cgroup: Fixes for v6.19-rc5
- Fix -Wflex-array-member-not-at-end warnings in cgroup_root.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaWVQhw4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGasOAQDmAej2XKpvd7zHxExKJ8xLYUc5dkqg00FLq86r
 iyikqAD9HYZNcTYE0pvqkgzpYbypQnMWs5F65tSjqJTyJNnL/Qw=
 =vI2H
 -----END PGP SIGNATURE-----

Merge tag 'cgroup-for-6.19-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup fix from Tejun Heo:

 - Fix -Wflex-array-member-not-at-end warnings in cgroup_root

* tag 'cgroup-for-6.19-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: Eliminate cgrp_ancestor_storage in cgroup_root
2026-01-12 09:56:17 -10:00
Zhao Mengmeng 090e0ae303 cpuset: replace direct lockdep_assert_held() with lockdep_assert_cpuset_lock_held()
We already added lockdep_assert_cpuset_lock_held(), use this new function
to keep consistency.

Signed-off-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-01-12 09:27:58 -10:00
Waiman Long 272bd81833 cgroup/cpuset: Move the v1 empty cpus/mems check to cpuset1_validate_change()
As stated in commit 1c09b195d3 ("cpuset: fix a regression in validating
config change"), it is not allowed to clear masks of a cpuset if
there're tasks in it. This is specific to v1 since empty "cpuset.cpus"
or "cpuset.mems" will cause the v2 cpuset to inherit the effective CPUs
or memory nodes from its parent. So it is OK to have empty cpus or mems
even if there are tasks in the cpuset.

Move this empty cpus/mems check in validate_change() to
cpuset1_validate_change() to allow more flexibility in setting
cpus or mems in v2. cpuset_is_populated() needs to be moved into
cpuset-internal.h as it is needed by the empty cpus/mems checking code.

Also add a test case to test_cpuset_prs.sh to verify that.

Reported-by: Chen Ridong <chenridong@huaweicloud.com>
Closes: https://lore.kernel.org/lkml/7a3ec392-2e86-4693-aa9f-1e668a668b9c@huaweicloud.com/
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Chen Ridong <chenridong@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-01-12 09:03:22 -10:00
Waiman Long 2a3602030d cgroup/cpuset: Don't invalidate sibling partitions on cpuset.cpus conflict
Currently, when setting a cpuset's cpuset.cpus to a value that conflicts
with the cpuset.cpus/cpuset.cpus.exclusive of a sibling partition,
the sibling's partition state becomes invalid. This is overly harsh and
is probably not necessary.

The cpuset.cpus.exclusive control file, if set, will override the
cpuset.cpus of the same cpuset when creating a cpuset partition.
So cpuset.cpus has less priority than cpuset.cpus.exclusive in setting up
a partition.  However, it cannot override a conflicting cpuset.cpus file
in a sibling cpuset and the partition creation process will fail. This
is inconsistent.  That will also make using cpuset.cpus.exclusive less
valuable as a tool to set up cpuset partitions as the users have to
check if such a cpuset.cpus conflict exists or not.

Fix these problems by making sure that once a cpuset.cpus.exclusive
is set without failure, it will always be allowed to form a valid
partition as long as at least one CPU can be granted from its parent
irrespective of the state of the siblings' cpuset.cpus values. Of
course, setting cpuset.cpus.exclusive will fail if it conflicts with
the cpuset.cpus.exclusive or the cpuset.cpus.exclusive.effective value
of a sibling.

Partition can still be created by setting only cpuset.cpus without
setting cpuset.cpus.exclusive. However, any conflicting CPUs in sibling's
cpuset.cpus.exclusive.effective and cpuset.cpus.exclusive values will
be removed from its cpuset.cpus.exclusive.effective as long as there
is still one or more CPUs left and can be granted from its parent. This
CPU stripping is currently done in rm_siblings_excl_cpus().

The new code will now try its best to enable the creation of new
partitions with only cpuset.cpus set without invalidating existing ones.
However it is not guaranteed that all the CPUs requested in cpuset.cpus
will be used in the new partition even when all these CPUs can be
granted from the parent.

This is similar to the fact that cpuset.cpus.effective may not be
able to include all the CPUs requested in cpuset.cpus. In this case,
the parent may not able to grant all the exclusive CPUs requested in
cpuset.cpus to cpuset.cpus.exclusive.effective if some of them have
already been granted to other partitions earlier.

With the creation of multiple sibling partitions by setting
only cpuset.cpus, this does have the side effect that their exact
cpuset.cpus.exclusive.effective settings will depend on the order of
partition creation if there are conflicts. Due to the exclusive nature
of the CPUs in a partition, it is not easy to make it fair other than
the old behavior of invalidating all the conflicting partitions.

For example,
  # echo "0-2" > A1/cpuset.cpus
  # echo "root" > A1/cpuset.cpus.partition
  # cat A1/cpuset.cpus.partition
  root
  # cat A1/cpuset.cpus.exclusive.effective
  0-2
  # echo "2-4" > B1/cpuset.cpus
  # echo "root" > B1/cpuset.cpus.partition
  # cat B1/cpuset.cpus.partition
  root
  # cat B1/cpuset.cpus.exclusive.effective
  3-4
  # cat B1/cpuset.cpus.effective
  3-4

For users who want to be sure that they can get most of the CPUs they
want, cpuset.cpus.exclusive should be used instead if they can set
it successfully without failure. Setting cpuset.cpus.exclusive will
guarantee that sibling conflicts from then onward is no longer possible.

To make this change, we have to separate out the is_cpu_exclusive()
check in cpus_excl_conflict() into a cgroup v1 only
cpuset1_cpus_excl_conflict() helper. The cpus_allowed_validate_change()
helper is now no longer needed and can be removed.

Some existing tests in test_cpuset_prs.sh are updated and new ones are
added to reflect the new behavior. The cgroup-v2.rst doc file is also
updated the clarify what exclusive CPUs will be used when a partition
is created.

Reported-by: Sun Shaojie <sunshaojie@kylinos.cn>
Closes: https://lore.kernel.org/lkml/20251117015708.977585-1-sunshaojie@kylinos.cn/
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Chen Ridong <chenridong@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-01-12 09:02:46 -10:00
Waiman Long 6e6f13f6d5 cgroup/cpuset: Don't fail cpuset.cpus change in v2
Commit fe8cd2736e ("cgroup/cpuset: Delay setting of CS_CPU_EXCLUSIVE
until valid partition") introduced a new check to disallow the setting
of a new cpuset.cpus.exclusive value that is a superset of a sibling's
cpuset.cpus value so that there will at least be one CPU left in the
sibling in case the cpuset becomes a valid partition root. This new
check does have the side effect of failing a cpuset.cpus change that
make it a subset of a sibling's cpuset.cpus.exclusive value.

With v2, users are supposed to be allowed to set whatever value they
want in cpuset.cpus without failure. To maintain this rule, the check
is now restricted to only when cpuset.cpus.exclusive is being changed
not when cpuset.cpus is changed.

The cgroup-v2.rst doc file is also updated to reflect this change.

Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Chen Ridong <chenridong@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-01-12 09:02:14 -10:00
Waiman Long a1a01793ae cgroup/cpuset: Consistently compute effective_xcpus in update_cpumasks_hier()
Since commit f62a5d3936 ("cgroup/cpuset: Remove remote_partition_check()
& make update_cpumasks_hier() handle remote partition"), the
compute_effective_exclusive_cpumask() helper was extended to
strip exclusive CPUs from siblings when computing effective_xcpus
(cpuset.cpus.exclusive.effective). This helper was later renamed to
compute_excpus() in commit 86bbbd1f33 ("cpuset: Refactor exclusive
CPU mask computation logic").

This helper is supposed to be used consistently to compute
effective_xcpus. However, there is an exception within the callback
critical section in update_cpumasks_hier() when exclusive_cpus of a
valid partition root is empty. This can cause effective_xcpus value to
differ depending on where exactly it is last computed. Fix this by using
compute_excpus() in this case to give a consistent result.

Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Chen Ridong <chenridong@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-01-12 09:01:44 -10:00
Waiman Long 18bc2425a8 cgroup/cpuset: Streamline rm_siblings_excl_cpus()
If exclusive_cpus is set, effective_xcpus must be a subset of
exclusive_cpus. Currently, rm_siblings_excl_cpus() checks both
exclusive_cpus and effective_xcpus consecutively. It is simpler
to check only exclusive_cpus if non-empty or just effective_xcpus
otherwise.

No functional change is expected.

Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Chen Ridong <chenridong@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-01-12 09:01:40 -10:00
Rafael J. Wysocki 2a7151942e Merge back material related to system sleep for 6.20 2026-01-12 19:37:19 +01:00
Juergen Gross e6b2aa6d40 sched: Move clock related paravirt code to kernel/sched
Paravirt clock related functions are available in multiple archs.

In order to share the common parts, move the common static keys
to kernel/sched/ and remove them from the arch specific files.

Make a common paravirt_steal_clock() implementation available in
kernel/sched/cputime.c, guarding it with a new config option
CONFIG_HAVE_PV_STEAL_CLOCK_GEN, which can be selected by an arch
in case it wants to use that common variant.

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260105110520.21356-7-jgross@suse.com
2026-01-12 15:39:14 +01:00
Juergen Gross 68b10fd40d paravirt: Remove asm/paravirt_api_clock.h
All architectures supporting CONFIG_PARAVIRT share the same contents
of asm/paravirt_api_clock.h:

  #include <asm/paravirt.h>

So remove all incarnations of asm/paravirt_api_clock.h and remove the
only place where it is included, as there asm/paravirt.h is included
anyway.

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> # powerpc, scheduler bits
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260105110520.21356-6-jgross@suse.com
2026-01-12 15:34:33 +01:00
Gabriele Monaco 3fee5b320c verification/rvgen: Remove unused variable declaration from containers
The monitor container source files contained a declaration and a
definition for the rv_monitor variable. The former is superfluous and
can be removed.

Remove the variable declaration from the template as well as the
existing monitor containers.

Reviewed-by: Nam Cao <namcao@linutronix.de>
Link: https://lore.kernel.org/r/20251126104241.291258-9-gmonaco@redhat.com
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
2026-01-12 07:43:51 +01:00
Gabriele Monaco 3d2bfeeef3 verification/dot2c: Remove superfluous enum assignment and add last comma
The header files generated by dot2c currently create enums for states
and events assigning the first element to 0. This is superfluous as it
happens automatically if no value is specified.
Also it doesn't add a comma to the last enum elements, which slightly
complicates the diff if states or events are added.

Remove the assignment to 0 and add a comma to last elements, this
simplifies the logic for the code generator.

Reviewed-by: Nam Cao <namcao@linutronix.de>
Link: https://lore.kernel.org/r/20251126104241.291258-8-gmonaco@redhat.com
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
2026-01-12 07:43:50 +01:00
Gabriele Monaco 30984ccf31 rv: Refactor da_monitor to minimise macros
The da_monitor helper functions are generated from macros of the type:

DECLARE_DA_FUNCTION(name, type) \
static void da_func_x_##name(type arg) {} \
static void da_func_y_##name(type arg) {} \

This is good to minimise code duplication but the long macros made of
skipped end of lines is rather hard to parse. Since functions are
static, the advantage of naming them differently for each monitor is
minimal.

Refactor the da_monitor.h file to minimise macros, instead of declaring
functions from macros, we simply declare them with the same name for all
monitors (e.g. da_func_x) and for any remaining reference to the monitor
name (e.g. tracepoints, enums, global variables) we use the CONCATENATE
macro.
In this way the file is much easier to maintain while keeping the same
generality.
Functions depending on the monitor types are now conditionally compiled
according to the value of RV_MON_TYPE, which must be defined in the
monitor source.
The monitor type can be specified as in the original implementation,
although it's best to keep the default implementation (unsigned char) as
not all parts of code support larger data types, and likely there's no
need.

We keep the empty macro definitions to ease review of this change with
diff tools, but cleanup is required.

Also adapt existing monitors to keep the build working.

Reviewed-by: Nam Cao <namcao@linutronix.de>
Link: https://lore.kernel.org/r/20251126104241.291258-2-gmonaco@redhat.com
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
2026-01-12 07:43:49 +01:00
Linus Torvalds fac4bdbaca Fix a crash in sched_mm_cid_after_execve().
Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmljevkRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1iKTw//Y9jfjB5VrESjnQumkJeV9A6v/ueVMEhF
 lA6ktDOKJwUjIBqqTxDuMmbuTUVg1AEmfUX8GCv2DqS2ngXsbcJmR++3E/S9vWz9
 QvChizDNrw03fq7WDryPK/0Pk04VJsqFcpR89IlMPAOkygVZ06T0+JTSo9BVqMYb
 xs4B1TlP6JUz0AtIRKqIGVVt57Zx8kNoGHHhQIJ6ziZA30Srh1/n1Y+B3iJxRklO
 EalGmNQECVobjyYaGqNB1Z73fMmJIgOkZeUQF6TB+KWYPc9DelRTUS2JH9bBhh4O
 h6Pau9y45YHu5j5qlQxhnw8CmGVjPHYhMAUsuxje1AK4mvgg+ki4lqqWSjhtiuFr
 0TX4+Z1ZSF1AB0dD0CtunqZ1K++202LY0uzQ2sCHhS3hNzWOP6vYR+qvZ0fLBfxI
 trWsz9KUrM07POww6S3nNrGKqyEUwZbmaywX0WOTwQzPhCsU9gqwK/SwSoYEkzhs
 TEG2xjHeVmqRmJLK7WBKXxBwz6e/hd1YiCi+hSY7PYnCvly6vZOKZA8CuKueF495
 Th+IUX5obCNxAy0ZKrOclpGrfE0vl7JBpye6pnYDCvtCKLmF29rOnNDBo1SZvwsh
 xlu61B7ezdrNp9JneErUZbJRrQL+PQlCem3MF5UmMgRUfqwUP1s1tOjooMOfPE6o
 Jv9UIQO8MGs=
 =jEMw
 -----END PGP SIGNATURE-----

Merge tag 'sched-urgent-2026-01-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fix from Ingo Molnar:
 "Fix a crash in sched_mm_cid_after_execve()"

* tag 'sched-urgent-2026-01-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/mm_cid: Prevent NULL mm dereference in sched_mm_cid_after_execve()
2026-01-11 07:11:53 -10:00
Linus Torvalds fe948326e9 Fix perf swevent hrtimer deinit regression.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmljeiYRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1g9XhAAthSFL6Tl+fBLFYjwAFCeGUyfB5lqewT0
 z6Sf7s+TOrOCn/3zP4JoDbuV+golsazcKk/9e3VnzccS9F9VdryML1z01O8mz4ub
 yHTX53mWRu98d+o+B+IwVOlxW+yeyydyHXQu3Be5mStfdla3i6K95Xu314OQKH4F
 zSDO1X67zSDdUvT8wSDlhXh9tM3Pd+kFDAEzGAY7n/n/6PJMflSVid/qCOVEInaV
 WRZoHhZnQUNBuorPLKwhVcduHMS0fh1E5lFfN3WdiQCRBl8P1YXutN4KigS95M7x
 ymlNhzuQwh7doCursDL/Rbc9hIRviZsMSTtmHUMWf/rlthvtbh4knct+noNSF+bZ
 VTy6SAHNq0mgvQM1vBmWKqUIgy1ekpfainUeUXJ8cKNoKnUIo9+Y7cH1PwBCvfIx
 R1vgK9qb2SpfqisNBxMzCiovfXQ657ZNNAdRzWOUErCxDKalo+rZ+e5a5etetzVY
 YMAJfsjnwmiPDwXhWjBUqYS5gGbh9cHE5KGB6qI6UJQ3odxd3H6/nRLEcyweMMs2
 kxFqHaKNnIuwsGl+VAlgwTuUbaR7feVzcvV7yGFnTMclP7cXjVgJRSpXcbMd4FrQ
 sB03TuqBIpMAVrZ/kbBWJh43BPn+nH9dXlTGaWv+AAe5uq4a3two7u4XIibnoICg
 MG6uXO1xWtk=
 =XNWD
 -----END PGP SIGNATURE-----

Merge tag 'perf-urgent-2026-01-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf event fix from Ingo Molnar:
 "Fix perf swevent hrtimer deinit regression"

* tag 'perf-urgent-2026-01-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf: Ensure swevent hrtimer is properly destroyed
2026-01-11 06:55:27 -10:00
Thomas Gleixner 2e4b28c48f treewide: Update email address
In a vain attempt to consolidate the email zoo switch everything to the
kernel.org account.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-01-11 06:09:11 -10:00
Boqun Feng fe1d482884 Merge branch 'rcu-misc.20260111a'
* rcu-misc.20260111a:
  rcu: Reduce synchronize_rcu() latency by reporting GP kthread's CPU QS early
  srcu: Use suitable gfp_flags for the init_srcu_struct_nodes()
  rcu: Fix rcu_read_unlock() deadloop due to softirq
  rcutorture: Correctly compute probability to invoke ->exp_current()
  rcu: Make expedited RCU CPU stall warnings detect stall-end races
2026-01-11 20:15:07 +08:00
Joel Fernandes bc3705e209 rcu: Reduce synchronize_rcu() latency by reporting GP kthread's CPU QS early
The RCU grace period mechanism uses a two-phase FQS (Force Quiescent
State) design where the first FQS saves dyntick-idle snapshots and
the second FQS compares them. This results in long and unnecessary latency
for synchronize_rcu() on idle systems (two FQS waits of ~3ms each with
1000HZ) whenever one FQS wait sufficed.

Some investigations showed that the GP kthread's CPU is the holdout CPU
a lot of times after the first FQS as - it cannot be detected as "idle"
because it's actively running the FQS scan in the GP kthread.

Therefore, at the end of rcu_gp_init(), immediately report a quiescent
state for the GP kthread's CPU using rcu_qs() + rcu_report_qs_rdp(). The
GP kthread cannot be in an RCU read-side critical section while running
GP initialization, so this is safe and results in significant latency
improvements.

The following tests were performed:

(1) synchronize_rcu() benchmarking

    100 synchronize_rcu() calls with 32 CPUs, 10 runs each (default fqs
    jiffies settings):

    Baseline (without fix):
    | Run | Mean      | Min      | Max       |
    |-----|-----------|----------|-----------|
    | 1   | 10.088 ms | 9.989 ms | 18.848 ms |
    | 2   | 10.064 ms | 9.982 ms | 16.470 ms |
    | 3   | 10.051 ms | 9.988 ms | 15.113 ms |
    | 4   | 10.125 ms | 9.929 ms | 22.411 ms |
    | 5   |  8.695 ms | 5.996 ms | 15.471 ms |
    | 6   | 10.157 ms | 9.977 ms | 25.723 ms |
    | 7   | 10.102 ms | 9.990 ms | 20.224 ms |
    | 8   |  8.050 ms | 5.985 ms | 10.007 ms |
    | 9   | 10.059 ms | 9.978 ms | 15.934 ms |
    | 10  | 10.077 ms | 9.984 ms | 17.703 ms |

    With fix:
    | Run | Mean     | Min      | Max       |
    |-----|----------|----------|-----------|
    | 1   | 6.027 ms | 5.915 ms |  8.589 ms |
    | 2   | 6.032 ms | 5.984 ms |  9.241 ms |
    | 3   | 6.010 ms | 5.986 ms |  7.004 ms |
    | 4   | 6.076 ms | 5.993 ms | 10.001 ms |
    | 5   | 6.084 ms | 5.893 ms | 10.250 ms |
    | 6   | 6.034 ms | 5.908 ms |  9.456 ms |
    | 7   | 6.051 ms | 5.993 ms | 10.000 ms |
    | 8   | 6.057 ms | 5.941 ms | 10.001 ms |
    | 9   | 6.016 ms | 5.927 ms |  7.540 ms |
    | 10  | 6.036 ms | 5.993 ms |  9.579 ms |

    Summary:
    - Mean latency: 9.75 ms -> 6.04 ms (38% improvement)
    - Max latency:  25.72 ms -> 10.25 ms (60% improvement)

(2) Bridge setup/teardown latency (Uladzislau Rezki)

    x86_64 with 64 CPUs, 100 iterations of bridge add/configure/delete:

                                   real time
    1 - default:                   24.221s
    2 - this patch:                20.754s  (14% faster)
    3 - this patch + wake_from_gp: 15.895s  (34% faster)
    4 - wake_from_gp only:         18.947s  (22% faster)

    Per-synchronize_rcu() latency (in usec):
                  1         2         3       4
    median: 37249.5   31540.5   15765   22480
    min:    7881      7918      9803    7857
    max:    63651     55639     31861   32040

    This patch combined with rcu_normal_wake_from_gp reduces bridge
    setup/teardown time from 24 seconds to 16 seconds.

(3) CPU overhead verification (Uladzislau Rezki)

    System CPU time across 5 runs showed no measurable increase:
      default:     1.698s - 1.937s
      this patch:  1.667s - 1.930s
    Conclusion: variations are within noise, no CPU overhead regression.

(4) rcutorture

    Tested TREE and SRCU configurations - no regressions.

Reviewed-by: "Paul E. McKenney" <paulmck@kernel.org>
Tested-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Samir M <samir@linux.ibm.com>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2026-01-11 20:11:15 +08:00
Changwoo Min 380ff27af2 PM: EM: Add dump to get-perf-domains in the EM YNL spec
Add dump to get-perf-domains, so that a user can fetch either information
about a specific performance domain with do or information about all
performance domains with dump. Share the reply format of do and dump using
perf-domain-attrs, so remove perf-domains. The YNL spec, autogenerated
files, and the do implementation are updated, and the dump implementation
is added.

Suggested-by: Donald Hunter <donald.hunter@gmail.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Link: https://patch.msgid.link/20260108053212.642478-5-changwoo@igalia.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2026-01-09 21:44:46 +01:00
Changwoo Min d29b900cf4 PM: EM: Change cpus' type from string to u64 array in the EM YNL spec
Previously, the cpus attribute was a string format which was a "%*pb"
stringification of a bitmap. That is not very consumable for a UAPI,
so let’s change it to an u64 array of CPU ids.

Suggested-by: Donald Hunter <donald.hunter@gmail.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Link: https://patch.msgid.link/20260108053212.642478-4-changwoo@igalia.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2026-01-09 21:44:46 +01:00
Changwoo Min caa07a815d PM: EM: Rename em.yaml to dev-energymodel.yaml
The EM YNL specification used many acronyms, including ‘em’, ‘pd’,
‘ps’, etc. While the acronyms are short and convenient, they could be
confusing. So, let’s spell them out to be more specific. The following
changes were made in the spec. Note that the protocol name cannot exceed
GENL_NAMSIZ (16).

  em           -> dev-energymodel
  pds          -> perf-domains
  pd           -> perf-domain
  pd-id        -> perf-domain-id
  pd-table     -> perf-table
  ps           -> perf-state
  get-pds      -> get-perf-domains
  get-pd-table -> get-perf-table
  pd-created   -> perf-domain-created
  pd-updated   -> perf-domain-updated
  pd-deleted   -> perf-domain-deleted

In addition. doc strings were added to the spec. based on the comments in
energy_model.h. Two flag attributes (perf-state-flags and
perf-domain-flags) were added for easily interpreting the bit flags.

Finally, the autogenerated files and em_netlink.c were updated accordingly
to reflect the name changes.

Suggested-by: Donald Hunter <donald.hunter@gmail.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Link: https://patch.msgid.link/20260108053212.642478-3-changwoo@igalia.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2026-01-09 21:44:46 +01:00
Linus Torvalds 81c5ffec9e Power management fix for 6.19-rc5
Fix a crash in the hibernation image saving code that can be triggered
 when the given compression algorithm is unavailable (Malaya Kumar Rout)
 -----BEGIN PGP SIGNATURE-----
 
 iQFGBAABCAAwFiEEcM8Aw/RY0dgsiRUR7l+9nS/U47UFAmlhGicSHHJqd0Byand5
 c29ja2kubmV0AAoJEO5fvZ0v1OO1q1QH/A7sxNcitVM3I6bdoYyVDYpVnKiDQqTk
 ofzJ+eEEh2+5iD7iBoqQBVNLDWL3iOtSCk3u8VzhIoJMwvsmaxSQYqGaOMIHNSKx
 Lfdl+RsMv5LKK05uN9uxLCSIBQJkRhnKI3+AC2TEbSpmLwi0+W15XYHecj1JmyY0
 upsrRUq+XGBr2LQDoWiUmRmPay8+lp8zjoHaAorl7Jm4yr7pHV8kzrztrCOttuUx
 nUtJiCks+Srnm09uIf0ghVgzBTwnmAGYft/xOf+fRqJDy9t91Nf/oJsev28mFCsn
 qRVn/O4i8xAgba9mKiKWmHvzoqqW8omixDFShweeTdr7Lw4hADxMwn8=
 =8zHM
 -----END PGP SIGNATURE-----

Merge tag 'pm-6.19-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fix from Rafael Wysocki:
 "This fixes a crash in the hibernation image saving code that can be
  triggered when the given compression algorithm is unavailable (Malaya
  Kumar Rout)"

* tag 'pm-6.19-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  PM: hibernate: Fix crash when freeing invalid crypto compressor
2026-01-09 06:18:05 -10:00
Cong Wang 2bdf777410 sched/mm_cid: Prevent NULL mm dereference in sched_mm_cid_after_execve()
sched_mm_cid_after_execve() is called in bprm_execve()'s cleanup path even
when exec_binprm() fails. For the init task's first execve(), this causes a
problem:

  1. current->mm is NULL (kernel threads don't have an mm)
  2. sched_mm_cid_before_execve() exits early because mm is NULL
  3. exec_binprm() fails (e.g., ENOENT for missing script interpreter)
  4. sched_mm_cid_after_execve() is called with mm still NULL
  5. sched_mm_cid_fork() is called unconditionally, triggering WARN_ON

This is easily reproduced by booting with an init that is a shell script
(#!/bin/sh) where the interpreter doesn't exist in the initramfs.

Fix this by checking if t->mm is NULL before calling sched_mm_cid_fork(),
matching the behavior of sched_mm_cid_before_execve() which already
handles this case via sched_mm_cid_exit()'s early return.

Fixes: b0c3d51b54 ("sched/mmcid: Provide precomputed maximal value")
Signed-off-by: Cong Wang <cwang@multikernel.io>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Will Deacon <will@kernel.org>
Link: https://patch.msgid.link/20251223215113.639686-1-xiyou.wangcong@gmail.com
2026-01-09 13:02:57 +01:00
Malaya Kumar Rout e25348c540 PM: EM: Fix memory leak in em_create_pd() error path
When ida_alloc() fails in em_create_pd(), the function returns without
freeing the previously allocated 'pd' structure, leading to a memory leak.
The 'pd' pointer is allocated either at line 436 (for CPU devices with
cpumask) or line 442 (for other devices) using kzalloc().

Additionally, the function incorrectly returns -ENOMEM when ida_alloc()
fails, ignoring the actual error code returned by ida_alloc(), which can
fail for reasons other than memory exhaustion.

Fix both issues by:
 1. Freeing the 'pd' structure with kfree() when ida_alloc() fails
 2. Returning the actual error code from ida_alloc() instead of -ENOMEM

This ensures proper cleanup on the error path and accurate error reporting.

Fixes: cbe5aeedec ("PM: EM: Assign a unique ID when creating a performance domain")
Signed-off-by: Malaya Kumar Rout <mrout@redhat.com>
Reviewed-by:  Changwoo Min <changwoo@igalia.com>
Link: https://patch.msgid.link/20260105103730.65626-1-mrout@redhat.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2026-01-08 16:55:21 +01:00
Peter Zijlstra 7dadeaa6e8 sched: Further restrict the preemption modes
The introduction of PREEMPT_LAZY was for multiple reasons:

  - PREEMPT_RT suffered from over-scheduling, hurting performance compared to
    !PREEMPT_RT.

  - the introduction of (more) features that rely on preemption; like
    folio_zero_user() which can do large memset() without preemption checks.

    (Xen already had a horrible hack to deal with long running hypercalls)

  - the endless and uncontrolled sprinkling of cond_resched() -- mostly cargo
    cult or in response to poor to replicate workloads.

By moving to a model that is fundamentally preemptable these things become
managable and avoid needing to introduce more horrible hacks.

Since this is a requirement; limit PREEMPT_NONE to architectures that do not
support preemption at all. Further limit PREEMPT_VOLUNTARY to those
architectures that do not yet have PREEMPT_LAZY support (with the eventual goal
to make this the empty set and completely remove voluntary preemption and
cond_resched() -- notably VOLUNTARY is already limited to !ARCH_NO_PREEMPT.)

This leaves up-to-date architectures (arm64, loongarch, powerpc, riscv, s390,
x86) with only two preemption models: full and lazy.

While Lazy has been the recommended setting for a while, not all distributions
have managed to make the switch yet. Force things along. Keep the patch minimal
in case of hard to address regressions that might pop up.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://patch.msgid.link/20251219101502.GB1132199@noisy.programming.kicks-ass.net
2026-01-08 12:43:57 +01:00
Blake Jones 89951fc1f8 sched: Reorder some fields in struct rq
This colocates some hot fields in "struct rq" to be on the same cache line
as others that are often accessed at the same time or in similar ways.

Using data from a Google-internal fleet-scale profiler, I found three
distinct groups of hot fields in struct rq:

- (1) The runqueue lock: __lock.

- (2) Those accessed from hot code in pick_next_task_fair():
      nr_running, nr_numa_running, nr_preferred_running,
      ttwu_pending, cpu_capacity, curr, idle.

- (3) Those accessed from some other hot codepaths, e.g.
      update_curr(), update_rq_clock(), and scheduler_tick():
      clock_task, clock_pelt, clock, lost_idle_time,
      clock_update_flags, clock_pelt_idle, clock_idle.

The cycles spent on accessing these different groups of fields broke down
roughly as follows:

- 50% on group (1) (the runqueue lock, always read-write)
- 39% on group (2) (load:store ratio around 38:1)
-  8% on group (3) (load:store ratio around 5:1)
-  3% on all the other fields

Most of the fields in group (3) are already in a cache line grouping; this
patch just adds "clock" and "clock_update_flags" to that group. The fields
in group (2) are scattered across several cache lines; the main effect of
this patch is to group them together, on a single line at the beginning of
the structure. A few other less performance-critical fields (nr_switches,
numa_migrate_on, has_blocked_load, nohz_csd, last_blocked_load_update_tick)
were also reordered to reduce holes in the data structure.

Since the runqueue lock is acquired from so many different contexts, and is
basically always accessed using an atomic operation, putting it on either
of the cache lines for groups (2) or (3) would slow down accesses to those
fields dramatically, since those groups are read-mostly accesses.

To test this, I wrote a focused load test that would put load on the
pick_next_task_fair() path. A parent process would fork many child
processes, and each child would nanosleep() for 1 msec many times in a
loop. The load test was monitored with "perf", and I looked at the amount
of cycles that were spent with sched_balance_rq() on the stack. The test
was reliably spending ~5% of all of its cycles there. I ran it 100 times
on a pair of 2-socket Intel Haswell machines (72 vCPUs per machine) - one
running the tip of sched/core, the other running this change - using 360
child processes and 8192 1-msec sleeps per child.  The mean cycle count
dropped from 5.14B to 4.91B, or a *4.6% decrease* in relevant scheduler
cycles.

Given that this change reduces cache misses in a very hot kernel codepath,
there's likely to be additional application performance improvement due to
reduced cache conflicts from kernel data structures.

On a Power11 system with 128-byte cache lines, my test showed a ~5%
decrease in relevant scheduler cycles, along with a slight increase in user
time - both positive indicators. This data comes from
https://lore.kernel.org/lkml/affdc6b1-9980-44d1-89db-d90730c1e384@linux.ibm.com/
This is the case even though the additional "____cacheline_aligned" that
puts the runqueue lock on the next cache line adds an additional 64 bytes
of padding on those machines. This patch does not change the size of
"struct rq" on machines with 64-byte cache lines.

I also ran "hackbench" to try to test this change, but it didn't show
conclusive results.  Looking at a CPU cycle profile of the hackbench run,
it was spending 95% of its cycles inside __alloc_skb(), __kfree_skb(), or
kmem_cache_free() - almost all of which was spent updating memcg counters
or contending on the list_lock in kmem_cache_node. And it spent less than
0.5% of its cycles inside either schedule() or try_to_wake_up().  So it's
not surprising that it didn't show useful results here.

The "__no_randomize_layout" was added to reflect the fact that performance
of code that references this data structure is unusually sensitive to
placement of its members.

Signed-off-by: Blake Jones <blakejones@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Reviewed-by: Josh Don <joshdon@google.com>
Tested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Link: https://patch.msgid.link/20251202023743.1524247-1-blakejones@google.com
2026-01-08 12:43:56 +01:00
Yury Norov (NVIDIA) 55b39b0cf1 sched/fair: Use cpumask_weight_and() in sched_balance_find_dst_group()
In the group_has_spare case, the function creates a temporary cpumask
to just calculate weight of (p->cpus_ptr & sched_group_span(local)).

We've got a dedicated helper for it.

Signed-off-by: Yury Norov (NVIDIA) <yury.norov@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Reviewed-by: Fernand Sieber <sieberf@amazon.com>
Link: https://patch.msgid.link/20251207034247.402926-1-yury.norov@gmail.com
2026-01-08 12:43:56 +01:00
Yury Norov (NVIDIA) 0ab25ea2a3 sched/fair: Simplify task_numa_find_cpu()
Use for_each_cpu_and() and drop some housekeeping code.

Signed-off-by: Yury Norov (NVIDIA) <yury.norov@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Reviewed-by: Phil Auld <pauld@redhat.com>
Link: https://patch.msgid.link/20251207033037.399608-1-yury.norov@gmail.com
2026-01-08 12:43:56 +01:00
Yury Norov (NVIDIA) ff1de90dd7 sched/fair: Drop useless cpumask_empty() in find_energy_efficient_cpu()
cpumask_empty() call is O(N) and useless because the previous
cpumask_and() returns false for empty 'cpus'. Drop it.

Signed-off-by: Yury Norov (NVIDIA) <yury.norov@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://patch.msgid.link/20251207040543.407695-1-yury.norov@gmail.com
2026-01-08 12:43:56 +01:00
Deepanshu Kartikey 9df5fad801 bpf: Reject BPF_MAP_TYPE_INSN_ARRAY in check_reg_const_str()
BPF_MAP_TYPE_INSN_ARRAY maps store instruction pointers in their
ips array, not string data. The map_direct_value_addr callback for
this map type returns the address of the ips array, which is not
suitable for use as a constant string argument.

When a BPF program passes a pointer to an insn_array map value as
ARG_PTR_TO_CONST_STR (e.g., to bpf_snprintf), the verifier's
null-termination check in check_reg_const_str() operates on the
wrong memory region, and at runtime bpf_bprintf_prepare() can read
out of bounds searching for a null terminator.

Reject BPF_MAP_TYPE_INSN_ARRAY in check_reg_const_str() since this
map type is not designed to hold string data.

Reported-by: syzbot+2c29addf92581b410079@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=2c29addf92581b410079
Tested-by: syzbot+2c29addf92581b410079@syzkaller.appspotmail.com
Fixes: 493d9e0d60 ("bpf, x86: add support for indirect jumps")
Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com>
Acked-by: Anton Protopopov <a.s.protopopov@gmail.com>
Link: https://lore.kernel.org/r/20260107021037.289644-1-kartikey406@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-07 19:03:46 -08:00
Michal Koutný ef56578274 cgroup: Eliminate cgrp_ancestor_storage in cgroup_root
The cgrp_ancestor_storage has two drawbacks:
- it's not guaranteed that the member immediately follows struct cgrp in
  cgroup_root (root cgroup's ancestors[0] might thus point to a padding
  and not in cgrp_ancestor_storage proper),
- this idiom raises warnings with -Wflex-array-member-not-at-end.

Instead of relying on the auxiliary member in cgroup_root, define the
0-th level ancestor inside struct cgroup (needed for static allocation
of cgrp_dfl_root), deeper cgroups would allocate flexible
_low_ancestors[].  Unionized alias through ancestors[] will
transparently join the two ranges.

The above change would still leave the flexible array at the end of
struct cgroup inside cgroup_root, so move cgrp also towards the end of
cgroup_root to resolve the -Wflex-array-member-not-at-end.

Link: https://lore.kernel.org/r/5fb74444-2fbb-476e-b1bf-3f3e279d0ced@embeddedor.com/
Reported-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Closes: https://lore.kernel.org/r/b3eb050d-9451-4b60-b06c-ace7dab57497@embeddedor.com/
Cc: David Laight <david.laight.linux@gmail.com>
Acked-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-01-07 15:11:03 -10:00
Robin Murphy 8a840ab056 dma-mapping: Remove dma_mark_clean (again)
With IA-64 now gone, there are no users of the dma_mark_clean hook,
so we can retire it for good.

Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/c004927f01962726ff1dcf94d1b4efff84db805a.1767727673.git.robin.murphy@arm.com
2026-01-08 00:19:08 +01:00
Ben Dooks 1e2ed4bfd5 trace: ftrace_dump_on_oops[] is not exported, make it static
The ftrace_dump_on_oops string is not used outside of trace.c so
make it static to avoid the export warning from sparse:

kernel/trace/trace.c:141:6: warning: symbol 'ftrace_dump_on_oops' was not declared. Should it be static?

Fixes: dd293df639 ("tracing: Move trace sysctls into trace.c")
Link: https://patch.msgid.link/20260106231054.84270-1-ben.dooks@codethink.co.uk
Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-01-07 14:52:22 -05:00
Steven Rostedt 5f1ef0dfcb tracing: Add recursion protection in kernel stack trace recording
A bug was reported about an infinite recursion caused by tracing the rcu
events with the kernel stack trace trigger enabled. The stack trace code
called back into RCU which then called the stack trace again.

Expand the ftrace recursion protection to add a set of bits to protect
events from recursion. Each bit represents the context that the event is
in (normal, softirq, interrupt and NMI).

Have the stack trace code use the interrupt context to protect against
recursion.

Note, the bug showed an issue in both the RCU code as well as the tracing
stacktrace code. This only handles the tracing stack trace side of the
bug. The RCU fix will be handled separately.

Link: https://lore.kernel.org/all/20260102122807.7025fc87@gandalf.local.home/

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Link: https://patch.msgid.link/20260105203141.515cd49f@gandalf.local.home
Reported-by: Yao Kai <yaokai34@huawei.com>
Tested-by: Yao Kai <yaokai34@huawei.com>
Fixes: 5f5fa7ea89 ("rcu: Don't use negative nesting depth in __rcu_read_unlock()")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-01-07 14:52:22 -05:00
Wupeng Ma 6435ffd6c7 ring-buffer: Avoid softlockup in ring_buffer_resize() during memory free
When user resize all trace ring buffer through file 'buffer_size_kb',
then in ring_buffer_resize(), kernel allocates buffer pages for each
cpu in a loop.

If the kernel preemption model is PREEMPT_NONE and there are many cpus
and there are many buffer pages to be freed, it may not give up cpu
for a long time and finally cause a softlockup.

To avoid it, call cond_resched() after each cpu buffer free as Commit
f6bd2c9248 ("ring-buffer: Avoid softlockup in ring_buffer_resize()")
does.

Detailed call trace as follow:

  rcu: INFO: rcu_sched self-detected stall on CPU
  rcu: 	24-....: (14837 ticks this GP) idle=521c/1/0x4000000000000000 softirq=230597/230597 fqs=5329
  rcu: 	(t=15004 jiffies g=26003221 q=211022 ncpus=96)
  CPU: 24 UID: 0 PID: 11253 Comm: bash Kdump: loaded Tainted: G            EL      6.18.2+ #278 NONE
  pc : arch_local_irq_restore+0x8/0x20
   arch_local_irq_restore+0x8/0x20 (P)
   free_frozen_page_commit+0x28c/0x3b0
   __free_frozen_pages+0x1c0/0x678
   ___free_pages+0xc0/0xe0
   free_pages+0x3c/0x50
   ring_buffer_resize.part.0+0x6a8/0x880
   ring_buffer_resize+0x3c/0x58
   __tracing_resize_ring_buffer.part.0+0x34/0xd8
   tracing_resize_ring_buffer+0x8c/0xd0
   tracing_entries_write+0x74/0xd8
   vfs_write+0xcc/0x288
   ksys_write+0x74/0x118
   __arm64_sys_write+0x24/0x38

Cc: <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251228065008.2396573-1-mawupeng1@huawei.com
Signed-off-by: Wupeng Ma <mawupeng1@huawei.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-01-07 14:52:22 -05:00
Julia Lawall 7cc3fe8e75 tracing: Drop unneeded assignment to soft_mode
soft_mode is not read in the enable case, so drop the assignment.
Drop also the comment text that refers to the assignment and realign
the comment.

Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: Gabriele Paoloni <gpaoloni@redhat.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251226110531.4129794-1-Julia.Lawall@inria.fr
Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-01-07 14:52:22 -05:00
Zqiang cee2557ae3 srcu: Use suitable gfp_flags for the init_srcu_struct_nodes()
For use the init_srcu_struct*() to initialized srcu structure,
the srcu structure's->srcu_sup and sda use GFP_KERNEL flags to
allocate memory. similarly, if set SRCU_SIZING_INIT, the
srcu_sup's->node can still use GFP_KERNEL flags to allocate
memory, not need to use GFP_ATOMIC flags all the time.

Signed-off-by: Zqiang <qiang.zhang@linux.dev>
Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2026-01-07 21:59:41 +08:00
Yao Kai d41e37f26b rcu: Fix rcu_read_unlock() deadloop due to softirq
Commit 5f5fa7ea89 ("rcu: Don't use negative nesting depth in
__rcu_read_unlock()") removes the recursion-protection code from
__rcu_read_unlock(). Therefore, we could invoke the deadloop in
raise_softirq_irqoff() with ftrace enabled as follows:

WARNING: CPU: 0 PID: 0 at kernel/trace/trace.c:3021 __ftrace_trace_stack.constprop.0+0x172/0x180
Modules linked in: my_irq_work(O)
CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Tainted: G O 6.18.0-rc7-dirty #23 PREEMPT(full)
Tainted: [O]=OOT_MODULE
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
RIP: 0010:__ftrace_trace_stack.constprop.0+0x172/0x180
RSP: 0018:ffffc900000034a8 EFLAGS: 00010002
RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000000000
RDX: 0000000000000003 RSI: ffffffff826d7b87 RDI: ffffffff826e9329
RBP: 0000000000090009 R08: 0000000000000005 R09: ffffffff82afbc4c
R10: 0000000000000008 R11: 0000000000011d7a R12: 0000000000000000
R13: ffff888003874100 R14: 0000000000000003 R15: ffff8880038c1054
FS:  0000000000000000(0000) GS:ffff8880fa8ea000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055b31fa7f540 CR3: 00000000078f4005 CR4: 0000000000770ef0
PKRU: 55555554
Call Trace:
 <IRQ>
 trace_buffer_unlock_commit_regs+0x6d/0x220
 trace_event_buffer_commit+0x5c/0x260
 trace_event_raw_event_softirq+0x47/0x80
 raise_softirq_irqoff+0x6e/0xa0
 rcu_read_unlock_special+0xb1/0x160
 unwind_next_frame+0x203/0x9b0
 __unwind_start+0x15d/0x1c0
 arch_stack_walk+0x62/0xf0
 stack_trace_save+0x48/0x70
 __ftrace_trace_stack.constprop.0+0x144/0x180
 trace_buffer_unlock_commit_regs+0x6d/0x220
 trace_event_buffer_commit+0x5c/0x260
 trace_event_raw_event_softirq+0x47/0x80
 raise_softirq_irqoff+0x6e/0xa0
 rcu_read_unlock_special+0xb1/0x160
 unwind_next_frame+0x203/0x9b0
 __unwind_start+0x15d/0x1c0
 arch_stack_walk+0x62/0xf0
 stack_trace_save+0x48/0x70
 __ftrace_trace_stack.constprop.0+0x144/0x180
 trace_buffer_unlock_commit_regs+0x6d/0x220
 trace_event_buffer_commit+0x5c/0x260
 trace_event_raw_event_softirq+0x47/0x80
 raise_softirq_irqoff+0x6e/0xa0
 rcu_read_unlock_special+0xb1/0x160
 unwind_next_frame+0x203/0x9b0
 __unwind_start+0x15d/0x1c0
 arch_stack_walk+0x62/0xf0
 stack_trace_save+0x48/0x70
 __ftrace_trace_stack.constprop.0+0x144/0x180
 trace_buffer_unlock_commit_regs+0x6d/0x220
 trace_event_buffer_commit+0x5c/0x260
 trace_event_raw_event_softirq+0x47/0x80
 raise_softirq_irqoff+0x6e/0xa0
 rcu_read_unlock_special+0xb1/0x160
 __is_insn_slot_addr+0x54/0x70
 kernel_text_address+0x48/0xc0
 __kernel_text_address+0xd/0x40
 unwind_get_return_address+0x1e/0x40
 arch_stack_walk+0x9c/0xf0
 stack_trace_save+0x48/0x70
 __ftrace_trace_stack.constprop.0+0x144/0x180
 trace_buffer_unlock_commit_regs+0x6d/0x220
 trace_event_buffer_commit+0x5c/0x260
 trace_event_raw_event_softirq+0x47/0x80
 __raise_softirq_irqoff+0x61/0x80
 __flush_smp_call_function_queue+0x115/0x420
 __sysvec_call_function_single+0x17/0xb0
 sysvec_call_function_single+0x8c/0xc0
 </IRQ>

Commit b41642c877 ("rcu: Fix rcu_read_unlock() deadloop due to IRQ work")
fixed the infinite loop in rcu_read_unlock_special() for IRQ work by
setting a flag before calling irq_work_queue_on(). We fix this issue by
setting the same flag before calling raise_softirq_irqoff() and rename the
flag to defer_qs_pending for more common.

Fixes: 5f5fa7ea89 ("rcu: Don't use negative nesting depth in __rcu_read_unlock()")
Reported-by: Tengda Wu <wutengda2@huawei.com>
Signed-off-by: Yao Kai <yaokai34@huawei.com>
Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2026-01-07 21:58:37 +08:00
Paul E. McKenney 37d9b47507 rcutorture: Correctly compute probability to invoke ->exp_current()
Lack of parentheses causes the ->exp_current() function, for example,
srcu_expedite_current(), to be called only once in four billion times
instead of the intended once in 256 times.  This commit therefore adds
the needed parentheses.

Reported-by: Chris Mason <clm@meta.com>
Reported-by: Joel Fernandes <joelagnelf@nvidia.com>
Fixes: 950063c6e8 ("rcutorture: Test srcu_expedite_current()")
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2026-01-07 21:58:34 +08:00
Paul E. McKenney 255019537c rcu: Make expedited RCU CPU stall warnings detect stall-end races
If an expedited RCU CPU stall ends just at the stall-warning timeout,
the current code will print an expedited stall-warning message, but one
that doesn't identify any CPUs or tasks causing the stall.  This is most
likely to happen for short-timeout stalls, for example, the 20-millisecond
timeouts that are sometimes used for small embedded devices.  Needless to
say, these semi-empty stall-warning messages can be rather confusing.

One option would be to suppress the stall-warning message entirely in
this case, but the near-miss information can be quite valuable.

Detect this race condition and emits a "INFO: Expedited stall ended
before state dump start" message to clarify matters.

[boqun: Apply feedback from Borislav]

Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2026-01-07 21:58:26 +08:00
Leon Hwang 47c79f05aa bpf: Add BPF_F_CPU and BPF_F_ALL_CPUS flags support for percpu_cgroup_storage maps
Introduce BPF_F_ALL_CPUS flag support for percpu_cgroup_storage maps to
allow updating values for all CPUs with a single value for update_elem
API.

Introduce BPF_F_CPU flag support for percpu_cgroup_storage maps to
allow:

* update value for specified CPU for update_elem API.
* lookup value for specified CPU for lookup_elem API.

The BPF_F_CPU flag is passed via map_flags along with embedded cpu info.

Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/r/20260107022022.12843-6-leon.hwang@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-06 20:48:32 -08:00
Leon Hwang 8526397c3c bpf: Copy map value using copy_map_value_long for percpu_cgroup_storage maps
Copy map value using 'copy_map_value_long()'. It's to keep consistent
style with the way of other percpu maps.

No functional change intended.

Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/r/20260107022022.12843-5-leon.hwang@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-06 20:48:32 -08:00
Leon Hwang c6936161fd bpf: Add BPF_F_CPU and BPF_F_ALL_CPUS flags support for percpu_hash and lru_percpu_hash maps
Introduce BPF_F_ALL_CPUS flag support for percpu_hash and lru_percpu_hash
maps to allow updating values for all CPUs with a single value for both
update_elem and update_batch APIs.

Introduce BPF_F_CPU flag support for percpu_hash and lru_percpu_hash
maps to allow:

* update value for specified CPU for both update_elem and update_batch
APIs.
* lookup value for specified CPU for both lookup_elem and lookup_batch
APIs.

The BPF_F_CPU flag is passed via:

* map_flags along with embedded cpu info.
* elem_flags along with embedded cpu info.

Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/r/20260107022022.12843-4-leon.hwang@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-06 20:48:32 -08:00
Leon Hwang 8eb76cb03f bpf: Add BPF_F_CPU and BPF_F_ALL_CPUS flags support for percpu_array maps
Introduce support for the BPF_F_ALL_CPUS flag in percpu_array maps to
allow updating values for all CPUs with a single value for both
update_elem and update_batch APIs.

Introduce support for the BPF_F_CPU flag in percpu_array maps to allow:

* update value for specified CPU for both update_elem and update_batch
APIs.
* lookup value for specified CPU for both lookup_elem and lookup_batch
APIs.

The BPF_F_CPU flag is passed via:

* map_flags of lookup_elem and update_elem APIs along with embedded cpu
info.
* elem_flags of lookup_batch and update_batch APIs along with embedded
cpu info.

Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/r/20260107022022.12843-3-leon.hwang@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-06 20:48:32 -08:00
Leon Hwang 2b421662c7 bpf: Introduce BPF_F_CPU and BPF_F_ALL_CPUS flags
Introduce BPF_F_CPU and BPF_F_ALL_CPUS flags and check them for
following APIs:

* 'map_lookup_elem()'
* 'map_update_elem()'
* 'generic_map_lookup_batch()'
* 'generic_map_update_batch()'

And, get the correct value size for these APIs.

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/r/20260107022022.12843-2-leon.hwang@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-06 20:48:32 -08:00
Gaurav Batra 1471c517cf powerpc/iommu: bypass DMA APIs for coherent allocations for pre-mapped memory
Leverage ARCH_HAS_DMA_MAP_DIRECT config option for coherent allocations as
well. This will bypass DMA ops for memory allocations that have been
pre-mapped.

Always set device bus_dma_limit when memory is pre-mapped. In some
architectures, like PowerPC, pmemory can be converted to regular memory via
daxctl command. This will gate the coherent allocations to pre-mapped RAM
only, by dma_coherent_ok().

Signed-off-by: Gaurav Batra <gbatra@linux.ibm.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20251107161105.85999-1-gbatra@linux.ibm.com
2026-01-07 09:33:55 +05:30
Casey Schaufler 5547598e59 cred: remove unused set_security_override_from_ctx()
The function set_security_override_from_ctx() has no in-tree callers
since 6.14. Remove it.

Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
[PM: subject tweak, merge fuzz]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2026-01-06 20:52:57 -05:00
Emil Tsalapatis 39f77533b6 bpf: Allow calls to arena functions while holding spinlocks
The bpf_arena_*_pages() kfuncs can be called from sleepable contexts,
but the verifier still prevents BPF programs from calling them while
holding a spinlock. Amend the verifier to allow for BPF programs
calling arena page management functions while holding a lock.

Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com>
Link: https://lore.kernel.org/r/20260106-arena-under-lock-v2-2-378e9eab3066@etsalapatis.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-06 17:44:00 -08:00
Emil Tsalapatis b25b48c7d3 bpf: Check active lock count in in_sleepable_context()
The in_sleepable_context() function is used to specialize the BPF code
in do_misc_fixups(). With the addition of nonsleepable arena kfuncs,
there are kfuncs whose specialization depends on whether we are
holding a lock. We should use the nonsleepable version while
holding a lock and the sleepable one when not.

Add a check for active_locks to account for locking when specializing
arena kfuncs.

Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com>
Link: https://lore.kernel.org/r/20260106-arena-under-lock-v2-1-378e9eab3066@etsalapatis.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-06 17:43:19 -08:00
Keke Ming a491c02c27 uprobes: use kmap_local_page() for temporary page mappings
Replace deprecated kmap_atomic() with kmap_local_page().

Signed-off-by: Keke Ming <ming.jvle@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Link: https://patch.msgid.link/20260103084243.195125-6-ming.jvle@gmail.com
2026-01-06 16:34:28 +01:00
Joel Granados d174174c67 sysctl: replace SYSCTL_INT_CONV_CUSTOM macro with functions
Remove SYSCTL_INT_CONV_CUSTOM and replace it with proc_int_conv. This
converter function expects a negp argument as it can take on negative
values. Update all jiffies converters to use explicit function calls.
Remove SYSCTL_CONV_IDENTITY as it is no longer used.

Signed-off-by: Joel Granados <joel.granados@kernel.org>
2026-01-06 11:27:10 +01:00
Joel Granados ef153851af sysctl: Replace unidirectional INT converter macros with functions
Replace SYSCTL_USER_TO_KERN_INT_CONV and SYSCTL_KERN_TO_USER_INT_CONV
macros with function implementing the same logic.This makes debugging
easier and aligns with the functions preference described in
coding-style.rst. Update all jiffies converters to use explicit function
implementations instead of macro-generated versions.

Signed-off-by: Joel Granados <joel.granados@kernel.org>
2026-01-06 11:26:42 +01:00
Malaya Kumar Rout 7966cf0ebe PM: hibernate: Fix crash when freeing invalid crypto compressor
When crypto_alloc_acomp() fails, it returns an ERR_PTR value, not NULL.

The cleanup code in save_compressed_image() and load_compressed_image()
unconditionally calls crypto_free_acomp() without checking for ERR_PTR,
which causes crypto_acomp_tfm() to dereference an invalid pointer and
crash the kernel.

This can be triggered when the compression algorithm is unavailable
(e.g., CONFIG_CRYPTO_LZO not enabled).

Fix by adding IS_ERR_OR_NULL() checks before calling crypto_free_acomp()
and acomp_request_free(), similar to the existing kthread_stop() check.

Fixes: b03d542c3c ("PM: hibernate: Use crypto_acomp interface")
Signed-off-by: Malaya Kumar Rout <mrout@redhat.com>
Cc: 6.15+ <stable@vger.kernel.org> # 6.15+
[ rjw: Added 2 empty code lines ]
Link: https://patch.msgid.link/20251230115613.64080-1-mrout@redhat.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2026-01-05 19:12:56 +01:00
Marco Elver 04e49d926f sched: Enable context analysis for core.c and fair.c
This demonstrates a larger conversion to use Clang's context
analysis. The benefit is additional static checking of locking rules,
along with better documentation.

Notably, kernel/sched contains sufficiently complex synchronization
patterns, and application to core.c & fair.c demonstrates that the
latest Clang version has become powerful enough to start applying this
to more complex subsystems (with some modest annotations and changes).

Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251219154418.3592607-37-elver@google.com
2026-01-05 16:43:36 +01:00
Marco Elver 8ec56d9aab printk: Move locking annotation to printk.c
With Sparse support gone, Clang is a bit more strict and warns:

./include/linux/console.h:492:50: error: use of undeclared identifier 'console_mutex'
  492 | extern void console_list_unlock(void) __releases(console_mutex);

Since it does not make sense to make console_mutex itself global, move
the annotation to printk.c. Context analysis remains disabled for
printk.c.

This is needed to enable context analysis for modules that include
<linux/console.h>.

Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251219154418.3592607-34-elver@google.com
2026-01-05 16:43:36 +01:00
Marco Elver 0eaa911f89 kcsan: Enable context analysis
Enable context analysis for the KCSAN subsystem.

Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251219154418.3592607-31-elver@google.com
2026-01-05 16:43:35 +01:00
Marco Elver 6556fde265 kcov: Enable context analysis
Enable context analysis for the KCOV subsystem.

Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251219154418.3592607-30-elver@google.com
2026-01-05 16:43:34 +01:00
Marco Elver e4588c25c9 compiler-context-analysis: Remove __cond_lock() function-like helper
As discussed in [1], removing __cond_lock() will improve the readability
of trylock code. Now that Sparse context tracking support has been
removed, we can also remove __cond_lock().

Change existing APIs to either drop __cond_lock() completely, or make
use of the __cond_acquires() function attribute instead.

In particular, spinlock and rwlock implementations required switching
over to inline helpers rather than statement-expressions for their
trylock_* variants.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20250207082832.GU7145@noisy.programming.kicks-ass.net/ [1]
Link: https://patch.msgid.link/20251219154418.3592607-25-elver@google.com
2026-01-05 16:43:33 +01:00
Joel Granados b3af263b8a sysctl: Add kernel doc to proc_douintvec_conv
This commit is making sure that all the functions that are part of the
API are documented.

Signed-off-by: Joel Granados <joel.granados@kernel.org>
2026-01-05 14:10:32 +01:00
Joel Granados 8fc344a5af sysctl: Replace UINT converter macros with functions
Replace the SYSCTL_USER_TO_KERN_UINT_CONV and SYSCTL_UINT_CONV_CUSTOM
macros with functions with the same logic. This makes debugging easier
and aligns with the functions preference described in coding-style.rst.
Update the only user of this API: pipe.c.

Signed-off-by: Joel Granados <joel.granados@kernel.org>
2026-01-05 14:10:32 +01:00
Joel Granados ac3d6a4b60 sysctl: clarify proc_douintvec_minmax doc
Specify that the range check is only when assigning kernel variable

Signed-off-by: Joel Granados <joel.granados@kernel.org>
2026-01-05 14:10:32 +01:00
Joel Granados 11400f86c2 sysctl: Return -ENOSYS from proc_douintvec_conv when CONFIG_PROC_SYSCTL=n
Ensure an error if prco_douintvec_conv is erroneously called in a system
with CONFIG_PROC_SYSCTL=n

Signed-off-by: Joel Granados <joel.granados@kernel.org>
2026-01-05 14:10:32 +01:00
Joel Granados f7386f545e sysctl: Remove unused ctl_table forward declarations
Remove superfluous forward declarations of ctl_table from header files
where they are no longer needed. These declarations were left behind
after sysctl code refactoring and cleanup.

Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Muchun Song <muchun.song@linux.dev>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
2026-01-05 13:54:41 +01:00
Joel Granados 4864010524 sysctl: Add missing kernel-doc for proc_dointvec_conv
Add kernel-doc documentation for the proc_dointvec_conv function to
describe its parameters and return value.

Reviewed-by: Kees Cook <kees@kernel.org>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
2026-01-05 13:36:45 +01:00
Rafael J. Wysocki 10c3ab8cd8 Merge back a commit related to system sleep for 6.20 2026-01-05 12:01:29 +01:00
Peter Zijlstra ff5860f508 perf: Ensure swevent hrtimer is properly destroyed
With the change to hrtimer_try_to_cancel() in
perf_swevent_cancel_hrtimer() it appears possible for the hrtimer to
still be active by the time the event gets freed.

Make sure the event does a full hrtimer_cancel() on the free path by
installing a perf_event::destroy handler.

Fixes: eb3182ef04 ("perf/core: Fix system hang caused by cpu-clock usage")
Reported-by: CyberUnicorns <a101e_iotvul@163.com>
Tested-by: CyberUnicorns <a101e_iotvul@163.com>
Debugged-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2026-01-05 08:55:54 +01:00
Boqun Feng acb0b2f5d6 Merge branch 'rcu-torture.20260104a' into rcu-next
* rcu-torture.20260104a:
  rcutorture: Add --kill-previous option to terminate previous kvm.sh runs
  rcutorture: Prevent concurrent kvm.sh runs on same source tree
  torture: Include commit discription in testid.txt
  torture: Make config2csv.sh properly handle comments in .boot files
  torture: Make kvm-series.sh give run numbers and totals
  torture: Make kvm-series.sh give build numbers and totals
  torture: Parallelize kvm-series.sh guest-OS execution
  rcutorture: Add context checks to rcu_torture_timer()
2026-01-04 18:53:06 +08:00
Puranjay Mohan a069190b59 bpf: Replace __opt annotation with __nullable for kfuncs
The __opt annotation was originally introduced specifically for
buffer/size argument pairs in bpf_dynptr_slice() and
bpf_dynptr_slice_rdwr(), allowing the buffer pointer to be NULL while
still validating the size as a constant.  The __nullable annotation
serves the same purpose but is more general and is already used
throughout the BPF subsystem for raw tracepoints, struct_ops, and other
kfuncs.

This patch unifies the two annotations by replacing __opt with
__nullable.  The key change is in the verifier's
get_kfunc_ptr_arg_type() function, where mem/size pair detection is now
performed before the nullable check.  This ensures that buffer/size
pairs are correctly classified as KF_ARG_PTR_TO_MEM_SIZE even when the
buffer is nullable, while adding an !arg_mem_size condition to the
nullable check prevents interference with mem/size pair handling.

When processing KF_ARG_PTR_TO_MEM_SIZE arguments, the verifier now uses
is_kfunc_arg_nullable() instead of the removed is_kfunc_arg_optional()
to determine whether to skip size validation for NULL buffers.

This is the first documentation added for the __nullable annotation,
which has been in use since it was introduced but was previously
undocumented.

No functional changes to verifier behavior - nullable buffer/size pairs
continue to work exactly as before.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20260102221513.1961781-1-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-02 15:51:34 -08:00
Puranjay Mohan e66fe1bc6d bpf: arena: Reintroduce memcg accounting
When arena allocations were converted from bpf_map_alloc_pages() to
kmalloc_nolock() to support non-sleepable contexts, memcg accounting was
inadvertently lost. This commit restores proper memory accounting for
all arena-related allocations.

All arena related allocations are accounted into memcg of the process
that created bpf_arena.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20260102200230.25168-3-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-02 14:31:59 -08:00
Puranjay Mohan 817593af7b bpf: syscall: Introduce memcg enter/exit helpers
Introduce bpf_map_memcg_enter() and bpf_map_memcg_exit() helpers to
reduce code duplication in memcg context management.

bpf_map_memcg_enter() gets the memcg from the map, sets it as active,
and returns both the previous and the now active memcg.

bpf_map_memcg_exit() restores the previous active memcg and releases the
reference obtained during enter.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20260102200230.25168-2-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-02 14:31:59 -08:00
Linus Torvalds bbbc721033 Power management fix for 6.19-rc4
Fix a recent regression that affects system suspend testing at
 the "core" level (Rafael Wysocki)
 -----BEGIN PGP SIGNATURE-----
 
 iQFGBAABCAAwFiEEcM8Aw/RY0dgsiRUR7l+9nS/U47UFAmlYG/ISHHJqd0Byand5
 c29ja2kubmV0AAoJEO5fvZ0v1OO1mK0IAIrCiY5dvp9+72DvEWqS2uHHFVs3sHKR
 SOpJR3koYehZEn/PvnM2PgvWNCLtru4nU/Q3EnWFfFCFuFuAMQ6Zl5U7YyKkW1Uc
 bcTMsnLOTJm/3AYu3O+4TGASq1VF1xqE+AB/ie5fNz5gDSlblGKrqh0se3m5m1Vu
 PsLsm27wkLyEHCd3AdXRNSU54GssjTaABkVTQ/Unk4PznbBiKsckaThLjbjQaiqB
 KzqU0B3Q3Zx9Qj1lVzXwXaYushehGbs3bqw8+q2DPrV/jwLVLYX/ofwEkCH+lQ47
 tS+di//pFi/grWu/GtR4EQ0fCzgYPDaBfbQlOD2gA60EgplU4XY3804=
 =CK7L
 -----END PGP SIGNATURE-----

Merge tag 'pm-6.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fix from Rafael Wysocki:
 "Fix a recent regression that affects system suspend testing
  at the 'core' level (Rafael Wysocki)"

* tag 'pm-6.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  PM: sleep: Fix suspend_test() at the TEST_CORE level
2026-01-02 12:35:29 -08:00
Puranjay Mohan 7646c7afd9 bpf: Remove redundant KF_TRUSTED_ARGS flag from all kfuncs
Now that KF_TRUSTED_ARGS is the default for all kfuncs, remove the
explicit KF_TRUSTED_ARGS flag from all kfunc definitions and remove the
flag itself.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20260102180038.2708325-3-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-02 12:04:28 -08:00
Puranjay Mohan 1a5c01d250 bpf: Make KF_TRUSTED_ARGS the default for all kfuncs
Change the verifier to make trusted args the default requirement for
all kfuncs by removing is_kfunc_trusted_args() assuming it be to always
return true.

This works because:
1. Context pointers (xdp_md, __sk_buff, etc.) are handled through their
   own KF_ARG_PTR_TO_CTX case label and bypass the trusted check
2. Struct_ops callback arguments are already marked as PTR_TRUSTED during
   initialization and pass is_trusted_reg()
3. KF_RCU kfuncs are handled separately via is_kfunc_rcu() checks at
   call sites (always checked with || alongside is_kfunc_trusted_args)

This simple change makes all kfuncs require trusted args by default
while maintaining correct behavior for all existing special cases.

Note: This change means kfuncs that previously accepted NULL pointers
without KF_TRUSTED_ARGS will now reject NULL at verification time.
Several netfilter kfuncs are affected: bpf_xdp_ct_lookup(),
bpf_skb_ct_lookup(), bpf_xdp_ct_alloc(), and bpf_skb_ct_alloc() all
accept NULL for their bpf_tuple and opts parameters internally (checked
in __bpf_nf_ct_lookup), but after this change the verifier rejects NULL
before the kfunc is even called. This is acceptable because these kfuncs
don't work with NULL parameters in their proper usage. Now they will be
rejected rather than returning an error, which shouldn't make a
difference to BPF programs that were using these kfuncs properly.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20260102180038.2708325-2-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-02 12:04:28 -08:00
Michael S. Tsirkin d5d8465131 dma-debug: track cache clean flag in entries
If a driver is buggy and has 2 overlapping mappings but only
sets cache clean flag on the 1st one of them, we warn.
But if it only does it for the 2nd one, we don't.

Fix by tracking cache clean flag in the entry.

Message-ID: <0ffb3513d18614539c108b4548cdfbc64274a7d1.1767601130.git.mst@redhat.com>
Reviewed-by: Petr Tesarik <ptesarik@suse.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2026-01-02 06:22:49 -05:00
Paul E. McKenney e8a534a671 rcutorture: Add context checks to rcu_torture_timer()
This commit adds irq, NMI, and softirq context checks to the
rcu_torture_timer() function.  Just because you are paranoid does not
mean that they are not out to get you...  ;-)

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2026-01-01 16:43:21 +08:00
Paul E. McKenney 760f05bc83 rcutorture: Test rcu_tasks_trace_expedite_current()
This commit adds a ->exp_current member to the tasks_tracing_ops structure
to test the rcu_tasks_trace_expedite_current() function.

[ paulmck: Apply kernel test robot feedback. ]

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: bpf@vger.kernel.org
Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2026-01-01 16:39:46 +08:00
Paul E. McKenney 1a72f4bb6f rcu: Add noinstr-fast rcu_read_{,un}lock_tasks_trace() APIs
When expressing RCU Tasks Trace in terms of SRCU-fast, it was
necessary to keep a nesting count and per-CPU srcu_ctr structure
pointer in the task_struct structure, which is slow to access.
But an alternative is to instead make rcu_read_lock_tasks_trace() and
rcu_read_unlock_tasks_trace(), which match the underlying SRCU-fast
semantics, avoiding the task_struct accesses.

When all callers have switched to the new API, the previous
rcu_read_lock_trace() and rcu_read_unlock_trace() APIs will be removed.

The rcu_read_{,un}lock_{,tasks_}trace() functions need to use smp_mb()
only if invoked where RCU is not watching, that is, from locations where
a call to rcu_is_watching() would return false.  In architectures that
define the ARCH_WANTS_NO_INSTR Kconfig option, use of noinstr and friends
ensures that tracing happens only where RCU is watching, so those
architectures can dispense entirely with the read-side calls to smp_mb().

Other architectures include these read-side calls by default, but in many
installations there might be either larger than average tolerance for
risk, prohibition of removing tracing on a running system, or careful
review and approval of removing of tracing.  Such installations can
build their kernels with CONFIG_TASKS_TRACE_RCU_NO_MB=y to avoid those
read-side calls to smp_mb(), thus accepting responsibility for run-time
removal of tracing from code regions that RCU is not watching.

Those wishing to disable read-side memory barriers for an entire
architecture can select this TASKS_TRACE_RCU_NO_MB Kconfig option,
hence the polarity.

[ paulmck: Apply Peter Zijlstra feedback. ]

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: bpf@vger.kernel.org
Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2026-01-01 16:39:46 +08:00
Paul E. McKenney 176a6aeaf1 rcu: Move rcu_tasks_trace_srcu_struct out of #ifdef CONFIG_TASKS_RCU_GENERIC
Moving the rcu_tasks_trace_srcu_struct structure instance out
from under the CONFIG_TASKS_RCU_GENERIC Kconfig option permits
the CONFIG_TASKS_TRACE_RCU Kconfig option to stop enabling this
CONFIG_TASKS_RCU_GENERIC Kconfig option.  This commit also therefore
makes it so.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: bpf@vger.kernel.org
Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2026-01-01 16:39:46 +08:00
Paul E. McKenney a73fc3dcc6 rcu: Clean up after the SRCU-fastification of RCU Tasks Trace
Now that RCU Tasks Trace has been re-implemented in terms of SRCU-fast,
the ->trc_ipi_to_cpu, ->trc_blkd_cpu, ->trc_blkd_node, ->trc_holdout_list,
and ->trc_reader_special task_struct fields are no longer used.

In addition, the rcu_tasks_trace_qs(), rcu_tasks_trace_qs_blkd(),
exit_tasks_rcu_finish_trace(), and rcu_spawn_tasks_trace_kthread(),
show_rcu_tasks_trace_gp_kthread(), rcu_tasks_trace_get_gp_data(),
rcu_tasks_trace_torture_stats_print(), and get_rcu_tasks_trace_gp_kthread()
functions and all the other functions that they invoke are no longer used.

Also, the TRC_NEED_QS and TRC_NEED_QS_CHECKED CPP macros are no longer used.
Neither are the rcu_tasks_trace_lazy_ms and rcu_task_ipi_delay rcupdate
module parameters and the TASKS_TRACE_RCU_READ_MB Kconfig option.

This commit therefore removes all of them.

[ paulmck: Apply Alexei Starovoitov feedback. ]

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: bpf@vger.kernel.org
Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2026-01-01 16:39:46 +08:00
Paul E. McKenney 46e3235999 context_tracking: Remove rcu_task_trace_heavyweight_{enter,exit}()
Because SRCU-fast does not use IPIs for its grace periods, there is
no need for real-time workloads to switch to an IPI-free mode, and
there is in turn no need for either rcu_task_trace_heavyweight_enter()
or rcu_task_trace_heavyweight_exit().  This commit therefore removes them.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: bpf@vger.kernel.org
Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2026-01-01 16:39:46 +08:00
Paul E. McKenney c27cea4416 rcu: Re-implement RCU Tasks Trace in terms of SRCU-fast
This commit saves more than 500 lines of RCU code by re-implementing
RCU Tasks Trace in terms of SRCU-fast.  Follow-up work will remove
more code that does not cause problems by its presence, but that is no
longer required.

This variant places smp_mb() in rcu_read_{,un}lock_trace(), and in the
same place that srcu_read_{,un}lock() would put them. These smp_mb()
calls will be removed on common-case architectures in a later commit.
In the meantime, it serves to enforce ordering between the underlying
srcu_read_{,un}lock_fast() markers and the intervening critical section,
even on architectures that permit attaching tracepoints on regions of
code not watched by RCU.  Such architectures defeat SRCU-fast's use of
implicit single-instruction, interrupts-disabled, and atomic-operation
RCU read-side critical sections, which have no effect when RCU is not
watching.  The aforementioned later commit will insert these smp_mb()
calls only on architectures that have not used noinstr to prevent
attaching tracepoints to code where RCU is not watching.

[ paulmck: Apply kernel test robot, Boqun Feng, and Zqiang feedback. ]
[ paulmck: Split out Tiny SRCU fixes per Andrii Nakryiko feedback. ]

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: kernel test robot <oliver.sang@intel.com>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: bpf@vger.kernel.org
Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2026-01-01 16:39:46 +08:00
Michael S. Tsirkin 61868dc55a dma-mapping: add DMA_ATTR_CPU_CACHE_CLEAN
When multiple small DMA_FROM_DEVICE or DMA_BIDIRECTIONAL buffers share a
cacheline, and DMA_API_DEBUG is enabled, we get this warning:
	cacheline tracking EEXIST, overlapping mappings aren't supported.

This is because when one of the mappings is removed, while another one
is active, CPU might write into the buffer.

Add an attribute for the driver to promise not to do this, making the
overlapping safe, and suppressing the warning.

Message-ID: <2d5d091f9d84b68ea96abd545b365dd1d00bbf48.1767601130.git.mst@redhat.com>
Reviewed-by: Petr Tesarik <ptesarik@suse.com>
Acked-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-12-31 19:30:02 -05:00
Eduard Zingerman 840692326e bpf: allow states pruning for misc/invalid slots in iterator loops
Within an iterator or callback based loop, it should be safe to prune
the current state if the old state stack slot is marked as
STACK_INVALID or STACK_MISC:
- either all branches of the old state lead to a program exit;
- or some branch of the old state leads the current state.

This is the same logic as applied in non-loop cases when
states_equal() is called in NOT_EXACT mode.

The test case that exercises stacksafe() and demonstrates the
difference in verification performance is included in the next patch.
I'm not sure if it is possible to prepare a test case that exercises
regsafe(); it appears that the compute_live_registers() pass makes
this impossible.

Nevertheless, for code readability reasons, I think that stacksafe()
and regsafe() should handle STACK_INVALID / NOT_INIT symmetrically.
Hence, this commit changes both functions.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20251230-loop-stack-misc-pruning-v1-1-585cfd6cec51@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-12-31 09:01:13 -08:00
Eduard Zingerman f597664454 bpf: bpf_scc_visit instance and backedges accumulation for bpf_loop()
Calls like bpf_loop() or bpf_for_each_map_elem() introduce loops that
are not explicitly present in the control-flow graph. The verifier
processes such calls by repeatedly interpreting the callback function
body within the same verification path (until the current state
converges with a previous state).

Such loops require a bpf_scc_visit instance in order to allow the
accumulation of the state graph backedges. Otherwise, certain
checkpoint states created within the bodies of such loops will have
incomplete precision marks.

See the next patch for an example of a program that leads to the
verifier accepting an unsafe program.

Fixes: 96c6aa4c63 ("bpf: compute SCCs in program control flow graph")
Fixes: c9e31900b5 ("bpf: propagate read/precision marks over state graph backedges")
Reported-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Tested-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/20251229-scc-for-callbacks-v1-1-ceadfe679900@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-12-30 15:42:42 -08:00
Linus Torvalds 0b34fd0fea 27 hotfixes. 12 are cc:stable, 18 are MM.
There's a three patch series from Jiayuan Chen which fixes some issues
 with KASAN and vmalloc.  Apart from that it's the usual shower of
 singletons - please see the respective changelogs for details.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaVIWxwAKCRDdBJ7gKXxA
 jt6HAP49s/mEIIbuZbqnX8hxDrvYdYffs+RsSLPEZoR+yKG/7gD/VqSZhMoPw53b
 rMZ56djXNjWxsOAfiVbZit3SFuQfnQ4=
 =5rGD
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2025-12-28-21-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "27 hotfixes.  12 are cc:stable, 18 are MM.

  There's a patch series from Jiayuan Chen which fixes some
  issues with KASAN and vmalloc. Apart from that it's the usual
  shower of singletons - please see the respective changelogs
  for details"

* tag 'mm-hotfixes-stable-2025-12-28-21-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (27 commits)
  mm/ksm: fix pte_unmap_unlock of wrong address in break_ksm_pmd_entry
  mm/page_owner: fix memory leak in page_owner_stack_fops->release()
  mm/memremap: fix spurious large folio warning for FS-DAX
  MAINTAINERS: notify the "Device Memory" community of memory hotplug changes
  sparse: update MAINTAINERS info
  mm/page_alloc: report 1 as zone_batchsize for !CONFIG_MMU
  mm: consider non-anon swap cache folios in folio_expected_ref_count()
  rust: maple_tree: rcu_read_lock() in destructor to silence lockdep
  mm: memcg: fix unit conversion for K() macro in OOM log
  mm: fixup pfnmap memory failure handling to use pgoff
  tools/mm/page_owner_sort: fix timestamp comparison for stable sorting
  selftests/mm: fix thread state check in uffd-unit-tests
  kernel/kexec: fix IMA when allocation happens in CMA area
  kernel/kexec: change the prototype of kimage_map_segment()
  MAINTAINERS: add ABI headers to KHO and LIVE UPDATE
  .mailmap: remove one of the entries for WangYuli
  mm/damon/vaddr: fix missing pte_unmap_unlock in damos_va_migrate_pmd_entry()
  MAINTAINERS: update one straggling entry for Bartosz Golaszewski
  mm/page_alloc: change all pageblocks migrate type on coalescing
  mm: leafops.h: correct kernel-doc function param. names
  ...
2025-12-29 11:40:38 -08:00
Linus Torvalds 7839932417 sched_ext: Fixes for v6.19-rc3
- Fix uninitialized @ret on alloc_percpu() failure leading to ERR_PTR(0).
 
 - Fix PREEMPT_RT warning when bypass load balancer sends IPI to offline
   CPU by using resched_cpu() instead of resched_curr().
 
 - Fix comment referring to renamed function.
 
 - Update scx_show_state.py for scx_root and scx_aborting changes.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaVHQAw4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGdJ2AP9nLMUa5Rw2hpcKCLvPjgkqe5fDpNteWrQB3ni9
 bu28jQD/XGMwooJbATlDEgCtFqCH74QbddqUABJxBw4FE8qpUgw=
 =hp6J
 -----END PGP SIGNATURE-----

Merge tag 'sched_ext-for-6.19-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext fixes from Tejun Heo:

 - Fix uninitialized @ret on alloc_percpu() failure leading to
   ERR_PTR(0)

 - Fix PREEMPT_RT warning when bypass load balancer sends IPI to offline
   CPU by using resched_cpu() instead of resched_curr()

 - Fix comment referring to renamed function

 - Update scx_show_state.py for scx_root and scx_aborting changes

* tag 'sched_ext-for-6.19-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  tools/sched_ext: update scx_show_state.py for scx_aborting change
  tools/sched_ext: fix scx_show_state.py for scx_root change
  sched_ext: Use the resched_cpu() to replace resched_curr() in the bypass_lb_node()
  sched_ext: Fix some comments in ext.c
  sched_ext: fix uninitialized ret on alloc_percpu() failure
2025-12-28 17:21:36 -08:00
Linus Torvalds bba0b6a1c4 cgroup: Fixes for v6.19-rc3
- cpuset: Fix spurious warning when disabling remote partition after CPU
   hotplug leaves subpartitions_cpus empty. Guard the warning and invalidate
   affected partitions.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaVHNBg4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGW13AQC7akGwBAjBbR5p0aoLXKl5vZ8UYiwH+7iyep2Z
 INn7ggEA/huYF4smSRk3oR8q7eiISa85mZc85EuVEAjPmBQhpAQ=
 =9UP0
 -----END PGP SIGNATURE-----

Merge tag 'cgroup-for-6.19-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup fix from Tejun Heo:

 - Fix a spurious cpuset warning when disabling remote partition after
   CPU hotplug leaves subpartitions_cpus empty. Guard the warning and
   invalidate affected partitions.

* tag 'cgroup-for-6.19-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cpuset: fix warning when disabling remote partition
2025-12-28 17:19:09 -08:00
Rafael J. Wysocki 684d3b2670 PM: sleep: Fix suspend_test() at the TEST_CORE level
Commit a10ad1b104 ("PM: suspend: Make pm_test delay interruptible by
wakeup events") replaced mdelay() in suspend_test() with msleep() which
does not work at the TEST_CORE test level that calls suspend_test()
while running on one CPU with interrupts off.

Address this by making suspend_test() check if the test level is
suitable for using msleep() and use mdelay() otherwise.

Fixes: a10ad1b104 ("PM: suspend: Make pm_test delay interruptible by wakeup events")
Reported-by: Sebastian Reichel <sebastian.reichel@collabora.com>
Closes: https://lore.kernel.org/linux-pm/aUsAk0k1N9hw8IkY@venus/
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Sebastian Reichel <sebastian.reichel@collabora.com>
Link: https://patch.msgid.link/6251576.lOV4Wx5bFT@rafael.j.wysocki
2025-12-28 13:01:39 +01:00
Linus Torvalds b63f4a4e95 EFI fixes for v6.19 #1
A couple of fixes for EFI regressions introduced this cycle:
 
 - Make EDID handling in the EFI stub mixed mode safe
 
 - Ensure that efi_mm.user_ns has a sane value - this is needed now that
   EFI runtime calls are preemptible on arm64
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQQm/3uucuRGn1Dmh0wbglWLn0tXAUCaU694AAKCRAwbglWLn0t
 XJHoAPsHkA4g+rRXXhqgysWvqQvGWTrWLiVHPeDOs3dAWTxHpQD+KTNVaW4H+BCz
 x+5S3unu5q9PxWOwqeoCi2JvkKKjJwE=
 =fb7r
 -----END PGP SIGNATURE-----

Merge tag 'efi-fixes-for-v6.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi

Pull EFI fixes from Ard Biesheuvel:
 "A couple of fixes for EFI regressions introduced this cycle:

   - Make EDID handling in the EFI stub mixed mode safe

   - Ensure that efi_mm.user_ns has a sane value - this is needed now
     that EFI runtime calls are preemptible on arm64"

* tag 'efi-fixes-for-v6.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
  kthread: Warn if mm_struct lacks user_ns in kthread_use_mm()
  arm64: efi: Fix NULL pointer dereference by initializing user_ns
  efi/libstub: gop: Fix EDID support in mixed-mode
2025-12-26 13:37:11 -08:00
Aaron Kling 92d661c36f irqdomain: Export irq_domain_free_irqs()
Export irq_domain_free_irqs() to allow PCI/MSI drivers like pci-tegra to be
built as a module.

Signed-off-by: Aaron Kling <webgeek1234@gmail.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/20250731-pci-tegra-module-v7-1-cad4b088b8fb@gmail.com
2025-12-26 10:57:17 -06:00
Breno Leitao cfe54f4591 kthread: Warn if mm_struct lacks user_ns in kthread_use_mm()
Add a WARN_ON_ONCE() check to detect mm_struct instances that are
missing user_ns initialization when passed to kthread_use_mm().

When a kthread adopts an mm via kthread_use_mm(), LSM hooks and
capability checks may access current->mm->user_ns for credential
validation. If user_ns is NULL, this leads to a NULL pointer
dereference crash.

This was observed with efi_mm on arm64, where commit a5baf582f4
("arm64/efi: Call EFI runtime services without disabling preemption")
introduced kthread_use_mm(&efi_mm), but efi_mm lacked user_ns
initialization, causing crashes during /proc access.

Adding this warning helps catch similar bugs early during development
rather than waiting for hard-to-debug NULL pointer crashes in
production.

Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2025-12-24 21:32:58 +01:00
Puranjay Mohan b8467290ed bpf: arena: make arena kfuncs any context safe
Make arena related kfuncs any context safe by the following changes:

bpf_arena_alloc_pages() and bpf_arena_reserve_pages():
Replace the usage of the mutex with a rqspinlock for range tree and use
kmalloc_nolock() wherever needed. Use free_pages_nolock() to free pages
from any context.
apply_range_set/clear_cb() with apply_to_page_range() has already made
populating the vm_area in bpf_arena_alloc_pages() any context safe.

bpf_arena_free_pages(): defer the main logic to a workqueue if it is
called from a non-sleepable context.

specialize_kfunc() is used to replace the sleepable arena_free_pages()
with bpf_arena_free_pages_non_sleepable() when the verifier detects the
call is from a non-sleepable context.

In the non-sleepable case, arena_free_pages() queues the address and the
page count to be freed to a lock-less list of struct arena_free_spans
and raises an irq_work. The irq_work handler calls schedules_work() as
it is safe to be called from irq context.  arena_free_worker() (the work
queue handler) iterates these spans and clears ptes, flushes tlb, zaps
pages, and calls __free_page().

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20251222195022.431211-4-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-12-23 11:30:00 -08:00
Puranjay Mohan 360c35f8ff bpf: arena: use kmalloc_nolock() in place of kvcalloc()
To make arena_alloc_pages() safe to be called from any context, replace
kvcalloc() with kmalloc_nolock() so as it doesn't sleep or take any
locks. kmalloc_nolock() returns NULL for allocations larger than
KMALLOC_MAX_CACHE_SIZE, which is (PAGE_SIZE * 2) = 8KB on systems with
4KB pages. So, round down the allocation done by kmalloc_nolock to 1024
* 8 and reuse the array in a loop.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20251222195022.431211-3-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-12-23 11:29:59 -08:00
Puranjay Mohan c336b0b327 bpf: arena: populate vm_area without allocating memory
vm_area_map_pages() may allocate memory while inserting pages into bpf
arena's vm_area. In order to make bpf_arena_alloc_pages() kfunc
non-sleepable change bpf arena to populate pages without
allocating memory:
- at arena creation time populate all page table levels except
  the last level
- when new pages need to be inserted call apply_to_page_range() again
  with apply_range_set_cb() which will only set_pte_at() those pages and
  will not allocate memory.
- when freeing pages call apply_to_existing_page_range with
  apply_range_clear_cb() to clear the pte for the page to be removed. This
  doesn't free intermediate page table levels.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20251222195022.431211-2-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-12-23 11:29:59 -08:00
Pingfan Liu a3785ae5d3 kernel/kexec: fix IMA when allocation happens in CMA area
*** Bug description ***

When I tested kexec with the latest kernel, I ran into the following warning:

[   40.712410] ------------[ cut here ]------------
[   40.712576] WARNING: CPU: 2 PID: 1562 at kernel/kexec_core.c:1001 kimage_map_segment+0x144/0x198
[...]
[   40.816047] Call trace:
[   40.818498]  kimage_map_segment+0x144/0x198 (P)
[   40.823221]  ima_kexec_post_load+0x58/0xc0
[   40.827246]  __do_sys_kexec_file_load+0x29c/0x368
[...]
[   40.855423] ---[ end trace 0000000000000000 ]---

*** How to reproduce ***

This bug is only triggered when the kexec target address is allocated in
the CMA area. If no CMA area is reserved in the kernel, use the "cma="
option in the kernel command line to reserve one.

*** Root cause ***
The commit 07d2490297 ("kexec: enable CMA based contiguous
allocation") allocates the kexec target address directly on the CMA area
to avoid copying during the jump. In this case, there is no IND_SOURCE
for the kexec segment.  But the current implementation of
kimage_map_segment() assumes that IND_SOURCE pages exist and map them
into a contiguous virtual address by vmap().

*** Solution ***
If IMA segment is allocated in the CMA area, use its page_address()
directly.

Link: https://lkml.kernel.org/r/20251216014852.8737-2-piliu@redhat.com
Fixes: 07d2490297 ("kexec: enable CMA based contiguous allocation")
Signed-off-by: Pingfan Liu <piliu@redhat.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Alexander Graf <graf@amazon.com>
Cc: Steven Chen <chenste@linux.microsoft.com>
Cc: Mimi Zohar <zohar@linux.ibm.com>
Cc: Roberto Sassu <roberto.sassu@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-12-23 11:23:14 -08:00
Pingfan Liu fe55ea8593 kernel/kexec: change the prototype of kimage_map_segment()
The kexec segment index will be required to extract the corresponding
information for that segment in kimage_map_segment().  Additionally,
kexec_segment already holds the kexec relocation destination address and
size.  Therefore, the prototype of kimage_map_segment() can be changed.

Link: https://lkml.kernel.org/r/20251216014852.8737-1-piliu@redhat.com
Fixes: 07d2490297 ("kexec: enable CMA based contiguous allocation")
Signed-off-by: Pingfan Liu <piliu@redhat.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Mimi Zohar <zohar@linux.ibm.com>
Cc: Roberto Sassu <roberto.sassu@huawei.com>
Cc: Alexander Graf <graf@amazon.com>
Cc: Steven Chen <chenste@linux.microsoft.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-12-23 11:23:13 -08:00
Daniel Gomez ac1c5bc7c4 bpf: crypto: replace -EEXIST with -EBUSY
The -EEXIST error code is reserved by the module loading infrastructure
to indicate that a module is already loaded. When a module's init
function returns -EEXIST, userspace tools like kmod interpret this as
"module already loaded" and treat the operation as successful, returning
0 to the user even though the module initialization actually failed.

This follows the precedent set by commit 54416fd767 ("netfilter:
conntrack: helper: Replace -EEXIST by -EBUSY") which fixed the same
issue in nf_conntrack_helper_register().

This affects bpf_crypto_skcipher module. While the configuration
required to build it as a module is unlikely in practice, it is
technically possible, so fix it for correctness.

Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
Acked-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://lore.kernel.org/r/20251220-dev-module-init-eexists-bpf-v1-1-7f186663dbe7@samsung.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-12-22 22:25:09 -08:00
Puranjay Mohan 342297d511 bpf: allow calling kfuncs from raw_tp programs
Associate raw tracepoint program type with the kfunc tracing hook. This
allows calling kfuncs from raw_tp programs.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20251222133250.1890587-2-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-12-22 22:23:38 -08:00
Zqiang 714d81423e sched_ext: Avoid multiple irq_work_queue() calls in destroy_dsq()
llist_add() returns true only when adding to an empty list, which indicates
that no IRQ work is currently queued or running. Therefore, we only need to
call irq_work_queue() when llist_add() returns true, to avoid unnecessarily
re-queueing IRQ work that is already pending or executing.

Signed-off-by: Zqiang <qiang.zhang@linux.dev>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-22 17:55:41 -10:00
Zqiang ccaeeb585c sched_ext: Use the resched_cpu() to replace resched_curr() in the bypass_lb_node()
For the PREEMPT_RT kernels, the scx_bypass_lb_timerfn() running in the
preemptible per-CPU ktimer kthread context, this means that the following
scenarios will occur(for x86 platform):

       cpu1                          cpu2
				 ktimer kthread:
                                 ->scx_bypass_lb_timerfn
                                   ->bypass_lb_node
                                     ->for_each_cpu(cpu, resched_mask)

    migration/1:                       by preempt by migration/2:
    multi_cpu_stop()                     multi_cpu_stop()
    ->take_cpu_down()
      ->__cpu_disable()
	->set cpu1 offline

                                       ->rq1 = cpu_rq(cpu1)
                                       ->resched_curr(rq1)
                                         ->smp_send_reschedule(cpu1)
					   ->native_smp_send_reschedule(cpu1)
					     ->if(unlikely(cpu_is_offline(cpu))) {
                					WARN(1, "sched: Unexpected
							reschedule of offline CPU#%d!\n", cpu);
                					return;
        					}

This commit therefore use the resched_cpu() to replace resched_curr()
in the bypass_lb_node() to avoid send-ipi to offline CPUs.

Signed-off-by: Zqiang <qiang.zhang@linux.dev>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-22 17:51:51 -10:00
Chen Ridong 269679bdd1 cpuset: remove dead code in cpuset-v1.c
The commit 6e1d31ce49 ("cpuset: separate generate_sched_domains for v1
and v2") introduced dead code that was originally added for cpuset-v2
partition domain generation. Remove the redundant root_load_balance check.

Fixes: 6e1d31ce49 ("cpuset: separate generate_sched_domains for v1 and v2")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/cgroups/9a442808-ed53-4657-988b-882cc0014c0d@huaweicloud.com/T/
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-22 17:40:17 -10:00
Kees Cook 68e8555858 module/decompress: Avoid open-coded kvrealloc()
Replace open-coded allocate/copy with kvrealloc().

Signed-off-by: Kees Cook <kees@kernel.org>
Reviewed-by: Petr Pavlu <petr.pavlu@suse.com>
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
2025-12-22 16:35:54 +00:00
Petr Pavlu 148519a063 module: Remove SHA-1 support for module signing
SHA-1 is considered deprecated and insecure due to vulnerabilities that can
lead to hash collisions. Most distributions have already been using SHA-2
for module signing because of this. The default was also changed last year
from SHA-1 to SHA-512 in commit f3b93547b9 ("module: sign with sha512
instead of sha1 by default"). This was not reported to cause any issues.
Therefore, it now seems to be a good time to remove SHA-1 support for
module signing.

Commit 16ab7cb582 ("crypto: pkcs7 - remove sha1 support") previously
removed support for reading PKCS#7/CMS signed with SHA-1, along with the
ability to use SHA-1 for module signing. This change broke iwd and was
subsequently completely reverted in commit 203a6763ab ("Revert "crypto:
pkcs7 - remove sha1 support""). However, dropping only the support for
using SHA-1 for module signing is unrelated and can still be done
separately.

Note that this change only removes support for new modules to be SHA-1
signed, but already signed modules can still be loaded.

Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Reviewed-by: Aaron Tomlin <atomlin@atomlin.com>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
2025-12-22 16:35:53 +00:00
Marco Crivellari 581ac2d4a5 module: replace use of system_wq with system_dfl_wq
Currently if a user enqueues a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.

This lack of consistency cannot be addressed without refactoring the API.

This continues the effort to refactor workqueue APIs, which began with
the introduction of new workqueues and a new alloc_workqueue flag in:

commit 128ea9f6cc ("workqueue: Add system_percpu_wq and system_dfl_wq")
commit 930c2ea566 ("workqueue: Add new WQ_PERCPU flag")

Switch to using system_dfl_wq, the new unbound workqueue, because the
users do not benefit from a per-cpu workqueue.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Reviewed-by: Petr Pavlu <petr.pavlu@suse.com>
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
2025-12-22 16:35:53 +00:00
Petr Pavlu 3cb0c3bdea params: Replace __modinit with __init_or_module
Remove the custom __modinit macro from kernel/params.c and instead use the
common __init_or_module macro from include/linux/module.h. Both provide the
same functionality.

Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Reviewed-by: Aaron Tomlin <atomlin@atomlin.com>
Reviewed-by: Daniel Gomez <da.gomez@samsung.com>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
2025-12-22 16:35:53 +00:00
Linus Torvalds 610192c229 Fix IRQ thread affinity flags setup regression.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmlHrpkRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1hBGBAAuZG/SpA5k4Aexg3rfisxfenj4s4qeTO3
 HqMXKU5M1zEWOlG2RUx/+Gn0RCb2U/eHzwAXtu85TJjYPQ3tCN+G49/VbbKZm3/J
 wM1gvGvpdX9miSi5S2e6Ete0zf0bYXCP4dJOreekyzzngXd9lqncBA/u+vimt00l
 1axg7osB73puSKjeg/g8t3CIjRvQ6XgD9HC5dAlUebmFQz6KSgM6yZXac/SXbYqn
 nyGN9EIq7HegrBAOLW16viUK5dq5WmmUS02MA728soLEwwxn/yDr44MjPaqPAJlQ
 2cwDHiKvZRtzI9wDlPaVH/Dg2OoXvrEEjrjfrzTWpjHn6p8jjluuUYBqtTIJF+g7
 rUnbtVz4GwOMg7SetMHW9YOMo3R0LXC5ly6RFycRdYqgdnEwTML5Jh4qjxWBiqJD
 98DtPGxUEdGKAUkOE4wLSEJy9EgjSXMm0nc8V6MJMl6afEoKBSU/w73nxc+VVHvw
 x7B/G/qXhJdSqO1aEftHbeZUlpOc2NSMzMXEXqCfE7B/S3vfCbC0PSLQnEq90AjY
 yTT7gI9pI37ULMqqf1b0pfosQuBu0cdkT5r++RFuaZLRl/kBefIyRiWcgwNDAs5C
 /YNeSH+qU043/f55vOoHna8B+W+VSkTGZCQpWcYcRhYCxVIWupdY1WBH9AqBg1R4
 Ar81aRZK2hI=
 =HT4H
 -----END PGP SIGNATURE-----

Merge tag 'irq-urgent-2025-12-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq fix from Ingo Molnar:
 "Fix IRQ thread affinity flags setup regression"

* tag 'irq-urgent-2025-12-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  genirq: Don't overwrite interrupt thread flags on setup
2025-12-21 14:34:13 -08:00
Matt Bobrowski 94e948b7e6 bpf: annotate file argument as __nullable in bpf_lsm_mmap_file
As reported in [0], anonymous memory mappings are not backed by a
struct file instance. Consequently, the struct file pointer passed to
the security_mmap_file() LSM hook is NULL in such cases.

The BPF verifier is currently unaware of this, allowing BPF LSM
programs to dereference this struct file pointer without needing to
perform an explicit NULL check. This leads to potential NULL pointer
dereference and a kernel crash.

Add a strong override for bpf_lsm_mmap_file() which annotates the
struct file pointer parameter with the __nullable suffix. This
explicitly informs the BPF verifier that this pointer (PTR_MAYBE_NULL)
can be NULL, forcing BPF LSM programs to perform a check on it before
dereferencing it.

[0] https://lore.kernel.org/bpf/5e460d3c.4c3e9.19adde547d8.Coremail.kaiyanm@hust.edu.cn/

Reported-by: Kaiyan Mei <M202472210@hust.edu.cn>
Reported-by: Yinhao Hu <dddddd@hust.edu.cn>
Reviewed-by: Dongliang Mu <dzm91@hust.edu.cn>
Closes: https://lore.kernel.org/bpf/5e460d3c.4c3e9.19adde547d8.Coremail.kaiyanm@hust.edu.cn/
Signed-off-by: Matt Bobrowski <mattbobrowski@google.com>
Acked-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20251216133000.3690723-1-mattbobrowski@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-12-21 10:56:33 -08:00
Puranjay Mohan c3e34f88f9 bpf: arm64: Optimize recursion detection by not using atomics
BPF programs detect recursion using a per-CPU 'active' flag in struct
bpf_prog. The trampoline currently sets/clears this flag with atomic
operations.

On some arm64 platforms (e.g., Neoverse V2 with LSE), per-CPU atomic
operations are relatively slow. Unlike x86_64 - where per-CPU updates
can avoid cross-core atomicity, arm64 LSE atomics are always atomic
across all cores, which is unnecessary overhead for strictly per-CPU
state.

This patch removes atomics from the recursion detection path on arm64 by
changing 'active' to a per-CPU array of four u8 counters, one per
context: {NMI, hard-irq, soft-irq, normal}. The running context uses a
non-atomic increment/decrement on its element.  After increment,
recursion is detected by reading the array as a u32 and verifying that
only the expected element changed; any change in another element
indicates inter-context recursion, and a value > 1 in the same element
indicates same-context recursion.

For example, starting from {0,0,0,0}, a normal-context trigger changes
the array to {0,0,0,1}.  If an NMI arrives on the same CPU and triggers
the program, the array becomes {1,0,0,1}. When the NMI context checks
the u32 against the expected mask for normal (0x00000001), it observes
0x01000001 and correctly reports recursion. Same-context recursion is
detected analogously.

Acked-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20251219184422.2899902-3-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-12-21 10:54:37 -08:00
Puranjay Mohan 93f0d09697 bpf: move recursion detection logic to helpers
BPF programs detect recursion by doing atomic inc/dec on a per-cpu
active counter from the trampoline. Create two helpers for operations on
this active counter, this makes it easy to changes the recursion
detection logic in future.

This commit makes no functional changes.

Acked-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20251219184422.2899902-2-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-12-21 10:54:37 -08:00
Zqiang 12494e5e2a sched_ext: Fix some comments in ext.c
This commit update balance_scx() in the comments to balance_one().

Signed-off-by: Zqiang <qiang.zhang@linux.dev>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-19 13:11:22 -10:00
Peter Zijlstra 6ab7973f25 sched/fair: Fix sched_avg fold
After the robot reported a regression wrt commit: 089d84203a ("sched/fair:
Fold the sched_avg update"), Shrikanth noted that two spots missed a factor
se_weight().

Fixes: 089d84203a ("sched/fair: Fold the sched_avg update")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202512181208.753b9f6e-lkp@intel.com
Debugged-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251218102020.GO3707891@noisy.programming.kicks-ass.net
2025-12-19 09:09:38 +01:00
Peter Zijlstra 01122b8936 perf: Use EXPORT_SYMBOL_FOR_KVM() for the mediated APIs
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251208115156.GE3707891@noisy.programming.kicks-ass.net
2025-12-19 08:54:59 +01:00
Greg Kroah-Hartman 90876d9b37 irqdomain: Fix up const problem in irq_domain_set_name()
In irq_domain_set_name() a const pointer is passed in, and then the
const is "lost" when container_of() is called.  Fix this up by properly
preserving the const pointer attribute when container_of() is used to
enforce the fact that this pointer should not have anything at it
changed.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/2025121731-facing-unhitched-63ae@gregkh
2025-12-19 00:39:39 +01:00
Linus Torvalds dd9b004b7f tracing fixes for v6.19:
- Add Documentation/core-api/tracepoint.rst to TRACING in MAINTAINERS file
 
   Updates to the tracepoint.rst document should be reviewed by the
   tracing maintainers.
 
 - Fix warning triggered by perf attaching to synthetic events
 
   The synthetic events do not add a function to be registered when
   perf attaches to them. This causes a warning when perf registers
   a synthetic event and passes a NULL pointer to the tracepoint register
   function. Ideally synthetic events should be updated to work with
   perf, but as that's a feature and not a bug fix, simply now return
   -ENODEV when perf tries to register an event that has a NULL pointer
   for its function. This no longer causes a kernel warning and simply
   causes the perf code to fail with an error message.
 
 - Fix 32bit overflow in option flag test
 
   The option's flags changed from 32 bits in size to 64 bits in size.
   Fix one of the places that shift 1 by the option bit number to
   to be 1ULL.
 
 - Fix the output of printing the direct jmp functions
 
   The enabled_functions that shows how functions are being attached by
   ftrace wasn't updated to accommodate the new direct jmp trampolines
   that set the LSB of the pointer, and outputs garbage. Update the
   output to handle the direct jmp trampolines.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaURuVxQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qhU1AQCe3AwEiPg4PAooR7D3TDgj4ypJXUwS
 J7PM0cvniKhJ2AD/VpSQCyj7sB8UQdVE04SK3hxduVwHIzrBhDe2voC8eQQ=
 =C9zF
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Add Documentation/core-api/tracepoint.rst to TRACING in MAINTAINERS
   file

   Updates to the tracepoint.rst document should be reviewed by the
   tracing maintainers.

 - Fix warning triggered by perf attaching to synthetic events

   The synthetic events do not add a function to be registered when perf
   attaches to them. This causes a warning when perf registers a
   synthetic event and passes a NULL pointer to the tracepoint register
   function.

   Ideally synthetic events should be updated to work with perf, but as
   that's a feature and not a bug fix, simply now return -ENODEV when
   perf tries to register an event that has a NULL pointer for its
   function. This no longer causes a kernel warning and simply causes
   the perf code to fail with an error message.

 - Fix 32bit overflow in option flag test

   The option's flags changed from 32 bits in size to 64 bits in size.
   Fix one of the places that shift 1 by the option bit number to to be
   1ULL.

 - Fix the output of printing the direct jmp functions

   The enabled_functions that shows how functions are being attached by
   ftrace wasn't updated to accommodate the new direct jmp trampolines
   that set the LSB of the pointer, and outputs garbage. Update the
   output to handle the direct jmp trampolines.

* tag 'trace-v6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  ftrace: Fix address for jmp mode in t_show()
  tracing: Fix UBSAN warning in __remove_instance()
  tracing: Do not register unsupported perf events
  MAINTAINERS: add tracepoint core-api doc files to TRACING
2025-12-19 09:30:55 +12:00
Linus Torvalds 7b8e9264f5 Including fixes from netfilter and CAN.
Current release - regressions:
 
   - netfilter: nf_conncount: fix leaked ct in error paths
 
   - sched: act_mirred: fix loop detection
 
   - sctp: fix potential deadlock in sctp_clone_sock()
 
   - can: fix build dependency
 
   - eth: mlx5e: do not update BQL of old txqs during channel reconfiguration
 
 Previous releases - regressions:
 
   - sched: ets: always remove class from active list before deleting it
 
   - inet: frags: flush pending skbs in fqdir_pre_exit()
 
   - netfilter:  nf_nat: remove bogus direction check
 
   - mptcp:
     - schedule rtx timer only after pushing data
     - avoid deadlock on fallback while reinjecting
 
   - can: gs_usb: fix error handling
 
   - eth: mlx5e:
     - avoid unregistering PSP twice
     - fix double unregister of HCA_PORTS component
 
   - eth: bnxt_en: fix XDP_TX path
 
   - eth: mlxsw: fix use-after-free when updating multicast route stats
 
 Previous releases - always broken:
 
   - ethtool: avoid overflowing userspace buffer on stats query
 
   - openvswitch: fix middle attribute validation in push_nsh() action
 
   - eth: mlx5: fw_tracer, validate format string parameters
 
   - eth: mlxsw: spectrum_router: fix neighbour use-after-free
 
   - eth: ipvlan: ignore PACKET_LOOPBACK in handle_mode_l2()
 
 Misc:
 
   - Jozsef Kadlecsik retires from maintaining netfilter
 
   - tools: ynl: fix build on systems with old kernel headers
 
 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmlEPS0SHHBhYmVuaUBy
 ZWRoYXQuY29tAAoJECkkeY3MjxOkHLEP/3TnVI9qLivOGsZ48bn5UUcIaIlX0wiw
 i5gwpUhbtW3zcLSphO0Nh/CmProme6dFhaMOOkk48bKAdIxRpOMbCC20bfYDLyd0
 ZJTUQqheKpuI23vOnhfs2TTOkcqz6cM7txUq681taQ8Nfo8Dbf0fgOO2HT5ljRD+
 JXlxrdvicZDtve68sSdnsAj5M15EPQGKTPMyPkymBmCKnCMQK9SQKaeTEtaW6NIO
 yM0KqDSNoulDc/6LYMhx8DTtyE7yTiTxwe2NixjdyYljXsk0KiRDirCzxnZuSgh8
 oh+cFLdFDq4mwdYEKjgC5c3ifQyLEpZvwzlY5MKoobsVnT5SbigeSK53l5rEkO3V
 sM84lITHfqTJvlid0AF/ixEc6iWwV7nGRBHh2FXNbfoIKt45eF77jPi9YFsq6Z95
 vlCzYIbY0f2L1y3mPvZbzGQbh2Z12b5kyK8QA1j7SK+zNzxgXbf4+ZtYxHg7O3Ne
 gecmIpTKXMWodaZyfsRQPjR/F6UIlMqsgl9Ci9bfUw+XwL8x+7bJxQAQz+yVjLla
 ng14BItiYKBavcPZBjYlhGKqD1fzGhVZqQecrCkF0VTbMusRd9RcwytU1NG4QGDx
 V5aL28ht85KtMednEWOBkrg+PeXnNyZHzLAf2Xtx3UkaGgiDC8G4IxUv3orlOLD1
 sFPfZnSiGCof
 =6pla
 -----END PGP SIGNATURE-----

Merge tag 'net-6.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
 "Including fixes from netfilter and CAN.

  Current release - regressions:

   - netfilter: nf_conncount: fix leaked ct in error paths

   - sched: act_mirred: fix loop detection

   - sctp: fix potential deadlock in sctp_clone_sock()

   - can: fix build dependency

   - eth: mlx5e: do not update BQL of old txqs during channel
     reconfiguration

  Previous releases - regressions:

   - sched: ets: always remove class from active list before deleting it

   - inet: frags: flush pending skbs in fqdir_pre_exit()

   - netfilter: nf_nat: remove bogus direction check

   - mptcp:
      - schedule rtx timer only after pushing data
      - avoid deadlock on fallback while reinjecting

   - can: gs_usb: fix error handling

   - eth:
      - mlx5e:
         - avoid unregistering PSP twice
         - fix double unregister of HCA_PORTS component
      - bnxt_en: fix XDP_TX path
      - mlxsw: fix use-after-free when updating multicast route stats

  Previous releases - always broken:

   - ethtool: avoid overflowing userspace buffer on stats query

   - openvswitch: fix middle attribute validation in push_nsh() action

   - eth:
      - mlx5: fw_tracer, validate format string parameters
      - mlxsw: spectrum_router: fix neighbour use-after-free
      - ipvlan: ignore PACKET_LOOPBACK in handle_mode_l2()

  Misc:

   - Jozsef Kadlecsik retires from maintaining netfilter

   - tools: ynl: fix build on systems with old kernel headers"

* tag 'net-6.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (83 commits)
  net: hns3: add VLAN id validation before using
  net: hns3: using the num_tqps to check whether tqp_index is out of range when vf get ring info from mbx
  net: hns3: using the num_tqps in the vf driver to apply for resources
  net: enetc: do not transmit redirected XDP frames when the link is down
  selftests/tc-testing: Test case exercising potential mirred redirect deadlock
  net/sched: act_mirred: fix loop detection
  sctp: Clear inet_opt in sctp_v6_copy_ip_options().
  sctp: Fetch inet6_sk() after setting ->pinet6 in sctp_clone_sock().
  net/handshake: duplicate handshake cancellations leak socket
  net/mlx5e: Don't include PSP in the hard MTU calculations
  net/mlx5e: Do not update BQL of old txqs during channel reconfiguration
  net/mlx5e: Trigger neighbor resolution for unresolved destinations
  net/mlx5e: Use ip6_dst_lookup instead of ipv6_dst_lookup_flow for MAC init
  net/mlx5: Serialize firmware reset with devlink
  net/mlx5: fw_tracer, Handle escaped percent properly
  net/mlx5: fw_tracer, Validate format string parameters
  net/mlx5: Drain firmware reset in shutdown callback
  net/mlx5: fw reset, clear reset requested on drain_fw_reset
  net: dsa: mxl-gsw1xx: manually clear RANEG bit
  net: dsa: mxl-gsw1xx: fix .shutdown driver operation
  ...
2025-12-19 07:55:35 +12:00
Chen Ridong 7cc1720589 cpuset: remove v1-specific code from generate_sched_domains
Following the introduction of cpuset1_generate_sched_domains() for v1
in the previous patch, v1-specific logic can now be removed from the
generic generate_sched_domains(). This patch cleans up the v1-only
code and ensures uf_node is only visible when CONFIG_CPUSETS_V1=y.

Signed-off-by: Chen Ridong <chenridong@huawei.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-18 08:36:15 -10:00
Chen Ridong 6e1d31ce49 cpuset: separate generate_sched_domains for v1 and v2
The generate_sched_domains() function currently handles both v1 and v2
logic. However, the underlying mechanisms for building scheduler domains
differ significantly between the two versions. For cpuset v2, scheduler
domains are straightforwardly derived from valid partitions, whereas
cpuset v1 employs a more complex union-find algorithm to merge overlapping
cpusets. Co-locating these implementations complicates maintenance.

This patch, along with subsequent ones, aims to separate the v1 and v2
logic. For ease of review, this patch first copies the
generate_sched_domains() function into cpuset-v1.c as
cpuset1_generate_sched_domains() and removes v2-specific code. Common
helpers and top_cpuset are declared in cpuset-internal.h. When operating
in v1 mode, the code now calls cpuset1_generate_sched_domains().

Currently there is some code duplication, which will be largely eliminated
once v1-specific code is removed from v2 in the following patch.

Signed-off-by: Chen Ridong <chenridong@huawei.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-18 08:36:15 -10:00
Chen Ridong cb33f8814c cpuset: move update_domain_attr_tree to cpuset_v1.c
Since relax_domain_level is only applicable to v1, move
update_domain_attr_tree() to cpuset-v1.c, which solely updates
relax_domain_level,

Additionally, relax_domain_level is now initialized in cpuset1_inited.
Accordingly, the initialization of relax_domain_level in top_cpuset is
removed. The unnecessary remote_partition initialization in top_cpuset
is also cleaned up.

As a result, relax_domain_level can be defined in cpuset only when
CONFIG_CPUSETS_V1=y.

Signed-off-by: Chen Ridong <chenridong@huawei.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-18 08:36:15 -10:00
Chen Ridong 4ef42c645f cpuset: add cpuset1_init helper for v1 initialization
This patch introduces the cpuset1_init helper in cpuset_v1.c to initialize
v1-specific fields, including the fmeter and relax_domain_level members.

The relax_domain_level related code will be moved to cpuset_v1.c in a
subsequent patch. After this move, v1-specific members will only be
visible when CONFIG_CPUSETS_V1=y.

Signed-off-by: Chen Ridong <chenridong@huawei.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-18 08:36:15 -10:00
Chen Ridong 56805c1bb1 cpuset: add cpuset1_online_css helper for v1-specific operations
This commit introduces the cpuset1_online_css helper to centralize
v1-specific handling during cpuset online. It performs operations such as
updating the CS_SPREAD_PAGE, CS_SPREAD_SLAB, and CGRP_CPUSET_CLONE_CHILDREN
flags, which are unique to the cpuset v1 control group interface.

The helper is now placed in cpuset-v1.c to maintain clear separation
between v1 and v2 logic.

Signed-off-by: Chen Ridong <chenridong@huawei.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-18 08:36:08 -10:00
Chen Ridong 14c11e1b2a cpuset: add lockdep_assert_cpuset_lock_held helper
Add lockdep_assert_cpuset_lock_held() to allow other subsystems to verify
that cpuset_mutex is held.

Suggested-by: Waiman Long <longman@redhat.com>
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-18 08:35:57 -10:00
Chen Ridong aa7d3a56a2 cpuset: fix warning when disabling remote partition
A warning was triggered as follows:

WARNING: kernel/cgroup/cpuset.c:1651 at remote_partition_disable+0xf7/0x110
RIP: 0010:remote_partition_disable+0xf7/0x110
RSP: 0018:ffffc90001947d88 EFLAGS: 00000206
RAX: 0000000000007fff RBX: ffff888103b6e000 RCX: 0000000000006f40
RDX: 0000000000006f00 RSI: ffffc90001947da8 RDI: ffff888103b6e000
RBP: ffff888103b6e000 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000001 R11: ffff88810b2e2728 R12: ffffc90001947da8
R13: 0000000000000000 R14: ffffc90001947da8 R15: ffff8881081f1c00
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f55c8bbe0b2 CR3: 000000010b14c000 CR4: 00000000000006f0
Call Trace:
 <TASK>
 update_prstate+0x2d3/0x580
 cpuset_partition_write+0x94/0xf0
 kernfs_fop_write_iter+0x147/0x200
 vfs_write+0x35d/0x500
 ksys_write+0x66/0xe0
 do_syscall_64+0x6b/0x390
 entry_SYSCALL_64_after_hwframe+0x4b/0x53
RIP: 0033:0x7f55c8cd4887

Reproduction steps (on a 16-CPU machine):

        # cd /sys/fs/cgroup/
        # mkdir A1
        # echo +cpuset > A1/cgroup.subtree_control
        # echo "0-14" > A1/cpuset.cpus.exclusive
        # mkdir A1/A2
        # echo "0-14" > A1/A2/cpuset.cpus.exclusive
        # echo "root" > A1/A2/cpuset.cpus.partition
        # echo 0 > /sys/devices/system/cpu/cpu15/online
        # echo member > A1/A2/cpuset.cpus.partition

When CPU 15 is offlined, subpartitions_cpus gets cleared because no CPUs
remain available for the top_cpuset, forcing partitions to share CPUs with
the top_cpuset. In this scenario, disabling the remote partition triggers
a warning stating that effective_xcpus is not a subset of
subpartitions_cpus. Partitions should be invalidated in this case to
inform users that the partition is now invalid(cpus are shared with
top_cpuset).

To fix this issue:
1. Only emit the warning only if subpartitions_cpus is not empty and the
   effective_xcpus is not a subset of subpartitions_cpus.
2. During the CPU hotplug process, invalidate partitions if
   subpartitions_cpus is empty.

Fixes: f62a5d3936 ("cgroup/cpuset: Remove remote_partition_check() & make update_cpumasks_hier() handle remote partition")
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-18 06:38:54 -10:00
John Stultz de2c5a1523 test-ww_mutex: Allow test to be run (and re-run) from userland
In cases where the ww_mutex test was occasionally tripping on
hard to find issues, leaving qemu in a reboot loop was my best
way to reproduce problems. These reboots however wasted time
when I just wanted to run the test-ww_mutex logic.

So tweak the test-ww_mutex test so that it can be re-triggered
via a sysfs file, so the test can be run repeatedly without
doing module loads or restarting.

This has been particularly valuable to stressing and finding
issues with the proxy-exec series.

To use, run as root:
  echo 1 > /sys/kernel/test_ww_mutex/run_tests

Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251205013515.759030-4-jstultz@google.com
2025-12-18 10:45:23 +01:00
John Stultz d327e7166e test-ww_mutex: Move work to its own UNBOUND workqueue
The test-ww_mutex test already allocates its own workqueue
so be sure to use it for the mtx.work and abba.work rather
then the default system workqueue.

This resolves numerous messages of the sort:
"workqueue: test_abba_work hogged CPU... consider switching to WQ_UNBOUND"
"workqueue: test_mutex_work hogged CPU... consider switching to WQ_UNBOUND"

Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251205013515.759030-3-jstultz@google.com
2025-12-18 10:45:23 +01:00
John Stultz 34d80c93a5 test-ww_mutex: Extend ww_mutex tests to test both classes of ww_mutexes
Currently the test-ww_mutex tool only utilizes the wait-die
class of ww_mutexes, and thus isn't very helpful in exercising
the wait-wound class of ww_mutexes.

So extend the test to exercise both classes of ww_mutexes for
all of the subtests.

Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251205013515.759030-2-jstultz@google.com
2025-12-18 10:45:23 +01:00
Menglong Dong 39263f986d ftrace: Fix address for jmp mode in t_show()
The address from ftrace_find_rec_direct() is printed directly in t_show().
This can mislead symbol offsets if it has the "jmp" bit in the last bit.

Fix this by printing the address that returned by ftrace_jmp_get().

Link: https://patch.msgid.link/20251217030053.80343-1-dongml2@chinatelecom.cn
Fixes: 25e4e3565d ("ftrace: Introduce FTRACE_OPS_FL_JMP")
Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-12-17 17:53:59 -05:00
Darrick J. Wong 74bf97e9a8 tracing: Fix UBSAN warning in __remove_instance()
xfs/558 triggers the following UBSAN warning:

 ------------[ cut here ]------------
 UBSAN: shift-out-of-bounds in kernel/trace/trace.c:10510:10
 shift exponent 32 is too large for 32-bit type 'int'
 CPU: 1 UID: 0 PID: 888674 Comm: rmdir Not tainted 6.19.0-rc1-xfsx #rc1 PREEMPT(lazy)  dbf607ef4c142c563f76d706e71af9731d7b9c90
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-4.module+el8.8.0+21164+ed375313 04/01/2014
 Call Trace:
  <TASK>
  dump_stack_lvl+0x4a/0x70
  ubsan_epilogue+0x5/0x2b
  __ubsan_handle_shift_out_of_bounds.cold+0x5e/0x113
  __remove_instance.part.0.constprop.0.cold+0x18/0x26f
  instance_rmdir+0xf3/0x110
  tracefs_syscall_rmdir+0x4d/0x90
  vfs_rmdir+0x139/0x230
  do_rmdir+0x143/0x230
  __x64_sys_rmdir+0x1d/0x20
  do_syscall_64+0x44/0x230
  entry_SYSCALL_64_after_hwframe+0x4b/0x53
 RIP: 0033:0x7f7ae8e51f17
 Code: f0 ff ff 73 01 c3 48 8b 0d de 2e 0e 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 b8 54 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 01 c3 48 8b 15 b1 2e 0e 00 f7 d8 64 89 02 b8
 RSP: 002b:00007ffd90743f08 EFLAGS: 00000246 ORIG_RAX: 0000000000000054
 RAX: ffffffffffffffda RBX: 00007ffd907440f8 RCX: 00007f7ae8e51f17
 RDX: 00007f7ae8f3c5c0 RSI: 00007ffd90744a21 RDI: 00007ffd90744a21
 RBP: 0000000000000002 R08: 0000000000000000 R09: 0000000000000000
 R10: 00007f7ae8f35ac0 R11: 0000000000000246 R12: 00007ffd90744a21
 R13: 0000000000000001 R14: 00007f7ae8f8b000 R15: 000055e5283e6a98
  </TASK>
 ---[ end trace ]---

whilst tearing down an ftrace instance.  TRACE_FLAGS_MAX_SIZE is now 64bit,
so the mask comparison expression must be typecast to a u64 value to
avoid an overflow.  AFAICT, ZEROED_TRACE_FLAGS is already cast to ULL
so this is ok.

Link: https://patch.msgid.link/20251216174950.GA7705@frogsfrogsfrogs
Fixes: bbec8e28ca ("tracing: Allow tracer to add more than 32 options")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-12-17 17:50:04 -05:00
Steven Rostedt ef7f38df89 tracing: Do not register unsupported perf events
Synthetic events currently do not have a function to register perf events.
This leads to calling the tracepoint register functions with a NULL
function pointer which triggers:

 ------------[ cut here ]------------
 WARNING: kernel/tracepoint.c:175 at tracepoint_add_func+0x357/0x370, CPU#2: perf/2272
 Modules linked in: kvm_intel kvm irqbypass
 CPU: 2 UID: 0 PID: 2272 Comm: perf Not tainted 6.18.0-ftest-11964-ge022764176fc-dirty #323 PREEMPTLAZY
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.17.0-debian-1.17.0-1 04/01/2014
 RIP: 0010:tracepoint_add_func+0x357/0x370
 Code: 28 9c e8 4c 0b f5 ff eb 0f 4c 89 f7 48 c7 c6 80 4d 28 9c e8 ab 89 f4 ff 31 c0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc cc cc <0f> 0b 49 c7 c6 ea ff ff ff e9 ee fe ff ff 0f 0b e9 f9 fe ff ff 0f
 RSP: 0018:ffffabc0c44d3c40 EFLAGS: 00010246
 RAX: 0000000000000001 RBX: ffff9380aa9e4060 RCX: 0000000000000000
 RDX: 000000000000000a RSI: ffffffff9e1d4a98 RDI: ffff937fcf5fd6c8
 RBP: 0000000000000001 R08: 0000000000000007 R09: ffff937fcf5fc780
 R10: 0000000000000003 R11: ffffffff9c193910 R12: 000000000000000a
 R13: ffffffff9e1e5888 R14: 0000000000000000 R15: ffffabc0c44d3c78
 FS:  00007f6202f5f340(0000) GS:ffff93819f00f000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 000055d3162281a8 CR3: 0000000106a56003 CR4: 0000000000172ef0
 Call Trace:
  <TASK>
  tracepoint_probe_register+0x5d/0x90
  synth_event_reg+0x3c/0x60
  perf_trace_event_init+0x204/0x340
  perf_trace_init+0x85/0xd0
  perf_tp_event_init+0x2e/0x50
  perf_try_init_event+0x6f/0x230
  ? perf_event_alloc+0x4bb/0xdc0
  perf_event_alloc+0x65a/0xdc0
  __se_sys_perf_event_open+0x290/0x9f0
  do_syscall_64+0x93/0x7b0
  ? entry_SYSCALL_64_after_hwframe+0x76/0x7e
  ? trace_hardirqs_off+0x53/0xc0
  entry_SYSCALL_64_after_hwframe+0x76/0x7e

Instead, have the code return -ENODEV, which doesn't warn and has perf
error out with:

 # perf record -e synthetic:futex_wait
Error:
The sys_perf_event_open() syscall returned with 19 (No such device) for event (synthetic:futex_wait).
"dmesg | grep -i perf" may provide additional information.

Ideally perf should support synthetic events, but for now just fix the
warning. The support can come later.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: https://patch.msgid.link/20251216182440.147e4453@gandalf.local.home
Fixes: 4b147936fa ("tracing: Add support for 'synthetic' events")
Reported-by: Ian Rogers <irogers@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-12-17 15:47:35 -05:00
Peter Zijlstra 3cb3c2f688 perf: Clean up mediated vPMU accounting
The mediated_pmu_account_event() and perf_create_mediated_pmu()
functions implement the exclusion between '!exclude_guest' counters
and mediated vPMUs. Their implementation is basically identical,
except mirrored in what they count/check.

Make sure the actual implementations reflect this similarity.

Notably:
 - while perf_release_mediated_pmu() has an underflow check;
   mediated_pmu_unaccount_event() did not.
 - while perf_create_mediated_pmu() has an inc_not_zero() path;
   mediated_pmu_account_event() did not.

Also, the inc_not_zero() path can be outsite of
perf_mediated_pmu_mutex. The mutex must guard the 0->1 (of either
nr_include_guest_events or nr_mediated_pmu_vms) transition, but once a
counter is already non-zero, it can safely be incremented further.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251208115156.GE3707891@noisy.programming.kicks-ass.net
2025-12-17 13:31:09 +01:00
Jens Remus 2652f9a4b0 unwind_user/fp: Use dummies instead of ifdef
This simplifies the code.   unwind_user_next_fp() does not need to
return -EINVAL if config option HAVE_UNWIND_USER_FP is disabled, as
unwind_user_start() will then not select this unwind method and
unwind_user_next() will therefore not call it.

Provide (1) a dummy definition of ARCH_INIT_USER_FP_FRAME, if the unwind
user method HAVE_UNWIND_USER_FP is not enabled, (2) a common fallback
definition of unwind_user_at_function_start() which returns false, and
(3) a common dummy definition of ARCH_INIT_USER_FP_ENTRY_FRAME.

Note that enabling the config option HAVE_UNWIND_USER_FP without
defining ARCH_INIT_USER_FP_FRAME triggers a compile error, which is
helpful when implementing support for this unwind user method in an
architecture.  Enabling the config option when providing an arch-
specific unwind_user_at_function_start() definition makes it necessary
to also provide an arch-specific ARCH_INIT_USER_FP_ENTRY_FRAME
definition.

Signed-off-by: Jens Remus <jremus@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251208160352.1363040-3-jremus@linux.ibm.com
2025-12-17 13:31:07 +01:00
Jens Remus 2d6ad925fb unwind_user: Enhance comments on get CFA, FP, and RA
Move the comment "Get the Canonical Frame Address (CFA)" to the top
of the sequence of statements that actually get the CFA.  Reword the
comment "Find the Return Address (RA)" to "Get ...", as the statements
actually get the RA.  Add a respective comment to the statements that
get the FP.  This will be useful once future commits extend the logic
to get the RA and FP.

While at it align the comment on the "stack going in wrong direction"
check to the following one on the "address is word aligned" check.

Signed-off-by: Jens Remus <jremus@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251208160352.1363040-2-jremus@linux.ibm.com
2025-12-17 13:31:07 +01:00
Sean Christopherson a05385d84b perf/x86/core: Register a new vector for handling mediated guest PMIs
Wire up system vector 0xf5 for handling PMIs (i.e. interrupts delivered
through the LVTPC) while running KVM guests with a mediated PMU.  Perf
currently delivers all PMIs as NMIs, e.g. so that events that trigger while
IRQs are disabled aren't delayed and generate useless records, but due to
the multiplexing of NMIs throughout the system, correctly identifying NMIs
for a mediated PMU is practically infeasible.

To (greatly) simplify identifying guest mediated PMU PMIs, perf will
switch the CPU's LVTPC between PERF_GUEST_MEDIATED_PMI_VECTOR and NMI when
guest PMU context is loaded/put.  I.e. PMIs that are generated by the CPU
while the guest is active will be identified purely based on the IRQ
vector.

Route the vector through perf, e.g. as opposed to letting KVM attach a
handler directly a la posted interrupt notification vectors, as perf owns
the LVTPC and thus is the rightful owner of PERF_GUEST_MEDIATED_PMI_VECTOR.
Functionally, having KVM directly own the vector would be fine (both KVM
and perf will be completely aware of when a mediated PMU is active), but
would lead to an undesirable split in ownership: perf would be responsible
for installing the vector, but not handling the resulting IRQs.

Add a new perf_guest_info_callbacks hook (and static call) to allow KVM to
register its handler with perf when running guests with mediated PMUs.

Note, because KVM always runs guests with host IRQs enabled, there is no
danger of a PMI being delayed from the guest's perspective due to using a
regular IRQ instead of an NMI.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://patch.msgid.link/20251206001720.468579-9-seanjc@google.com
2025-12-17 13:31:05 +01:00
Kan Liang 42457a7fb6 perf: Add APIs to load/put guest mediated PMU context
Add exported APIs to load/put a guest mediated PMU context.  KVM will
load the guest PMU shortly before VM-Enter, and put the guest PMU shortly
after VM-Exit.

On the perf side of things, schedule out all exclude_guest events when the
guest context is loaded, and schedule them back in when the guest context
is put.  I.e. yield the hardware PMU resources to the guest, by way of KVM.

Note, perf is only responsible for managing host context.  KVM is
responsible for loading/storing guest state to/from hardware.

[sean: shuffle patches around, write changelog]
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://patch.msgid.link/20251206001720.468579-8-seanjc@google.com
2025-12-17 13:31:05 +01:00
Kan Liang 4593b4b6e2 perf: Add a EVENT_GUEST flag
Current perf doesn't explicitly schedule out all exclude_guest events
while the guest is running. There is no problem with the current
emulated vPMU. Because perf owns all the PMU counters. It can mask the
counter which is assigned to an exclude_guest event when a guest is
running (Intel way), or set the corresponding HOSTONLY bit in evsentsel
(AMD way). The counter doesn't count when a guest is running.

However, either way doesn't work with the introduced mediated vPMU.
A guest owns all the PMU counters when it's running. The host should not
mask any counters. The counter may be used by the guest. The evsentsel
may be overwritten.

Perf should explicitly schedule out all exclude_guest events to release
the PMU resources when entering a guest, and resume the counting when
exiting the guest.

It's possible that an exclude_guest event is created when a guest is
running. The new event should not be scheduled in as well.

The ctx time is shared among different PMUs. The time cannot be stopped
when a guest is running. It is required to calculate the time for events
from other PMUs, e.g., uncore events. Add timeguest to track the guest
run time. For an exclude_guest event, the elapsed time equals
the ctx time - guest time.
Cgroup has dedicated times. Use the same method to deduct the guest time
from the cgroup time as well.

[sean: massage comments]
Co-developed-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://patch.msgid.link/20251206001720.468579-7-seanjc@google.com
2025-12-17 13:31:05 +01:00
Kan Liang f5c7de8f84 perf: Clean up perf ctx time
The current perf tracks two timestamps for the normal ctx and cgroup.
The same type of variables and similar codes are used to track the
timestamps. In the following patch, the third timestamp to track the
guest time will be introduced.
To avoid the code duplication, add a new struct perf_time_ctx and factor
out a generic function update_perf_time_ctx().

No functional change.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://patch.msgid.link/20251206001720.468579-6-seanjc@google.com
2025-12-17 13:31:04 +01:00
Kan Liang eff95e1702 perf: Add APIs to create/release mediated guest vPMUs
Currently, exposing PMU capabilities to a KVM guest is done by emulating
guest PMCs via host perf events, i.e. by having KVM be "just" another user
of perf.  As a result, the guest and host are effectively competing for
resources, and emulating guest accesses to vPMU resources requires
expensive actions (expensive relative to the native instruction).  The
overhead and resource competition results in degraded guest performance
and ultimately very poor vPMU accuracy.

To address the issues with the perf-emulated vPMU, introduce a "mediated
vPMU", where the data plane (PMCs and enable/disable knobs) is exposed
directly to the guest, but the control plane (event selectors and access
to fixed counters) is managed by KVM (via MSR interceptions).  To allow
host perf usage of the PMU to (partially) co-exist with KVM/guest usage
of the PMU, KVM and perf will coordinate to a world switch between host
perf context and guest vPMU context near VM-Enter/VM-Exit.

Add two exported APIs, perf_{create,release}_mediated_pmu(), to allow KVM
to create and release a mediated PMU instance (per VM).  Because host perf
context will be deactivated while the guest is running, mediated PMU usage
will be mutually exclusive with perf analysis of the guest, i.e. perf
events that do NOT exclude the guest will not behave as expected.

To avoid silent failure of !exclude_guest perf events, disallow creating a
mediated PMU if there are active !exclude_guest events, and on the perf
side, disallowing creating new !exclude_guest perf events while there is
at least one active mediated PMU.

Exempt PMU resources that do not support mediated PMU usage, i.e. that are
outside the scope/view of KVM's vPMU and will not be swapped out while the
guest is running.

Guard mediated PMU with a new kconfig to help readers identify code paths
that are unique to mediated PMU support, and to allow for adding arch-
specific hooks without stubs.  KVM x86 is expected to be the only KVM
architecture to support a mediated PMU in the near future (e.g. arm64 is
trending toward a partitioned PMU implementation), and KVM x86 will select
PERF_GUEST_MEDIATED_PMU unconditionally, i.e. won't need stubs.

Immediately select PERF_GUEST_MEDIATED_PMU when KVM x86 is enabled so that
all paths are compile tested.  Full KVM support is on its way...

[sean: add kconfig and WARNing, rewrite changelog, swizzle patch ordering]
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://patch.msgid.link/20251206001720.468579-5-seanjc@google.com
2025-12-17 13:31:04 +01:00
Sean Christopherson 991bdf7e9d perf: Move security_perf_event_free() call to __free_event()
Move the freeing of any security state associated with a perf event from
_free_event() to __free_event(), i.e. invoke security_perf_event_free() in
the error paths for perf_event_alloc().  This will allow adding potential
error paths in perf_event_alloc() that can occur after allocating security
state.

Note, kfree() and thus security_perf_event_free() is a nop if
event->security is NULL, i.e. calling security_perf_event_free() even if
security_perf_event_alloc() fails or is never reached is functionality ok.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://patch.msgid.link/20251206001720.468579-4-seanjc@google.com
2025-12-17 13:31:04 +01:00
Kan Liang b9e52b11d2 perf: Add generic exclude_guest support
Only KVM knows the exact time when a guest is entering/exiting. Expose
two interfaces to KVM to switch the ownership of the PMU resources.

All the pinned events must be scheduled in first. Extend the
perf_event_sched_in() helper to support extra flag, e.g., EVENT_GUEST.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://patch.msgid.link/20251206001720.468579-3-seanjc@google.com
2025-12-17 13:31:03 +01:00
Kan Liang b825444b61 perf: Skip pmu_ctx based on event_type
To optimize the cgroup context switch, the perf_event_pmu_context
iteration skips the PMUs without cgroup events. A bool cgroup was
introduced to indicate the case. It can work, but this way is hard to
extend for other cases, e.g. skipping non-mediated PMUs. It doesn't
make sense to keep adding bool variables.

Pass the event_type instead of the specific bool variable. Check both
the event_type and related pmu_ctx variables to decide whether skipping
a PMU.

Event flags, e.g., EVENT_CGROUP, should be cleard in the ctx->is_active.
Add EVENT_FLAGS to indicate such event flags.

No functional change.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://patch.msgid.link/20251206001720.468579-2-seanjc@google.com
2025-12-17 13:31:03 +01:00
Peter Zijlstra 1862d8e264 sched: Fix faulty assertion in sched_change_end()
Commit 47efe2ddcc ("sched/core: Add assertions to QUEUE_CLASS") added an
assert to sched_change_end() verifying that a class demotion would result in a
reschedule.

As it turns out; rt_mutex_setprio() does not force a resched on class
demontion. Furthermore, this is only relevant to running tasks.

Change the warning into a reschedule and make sure to only do so for running
tasks.

Fixes: 47efe2ddcc ("sched/core: Add assertions to QUEUE_CLASS")
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Tested-by:  Linux Kernel Functional Testing <lkft@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251216141725.GW3707837@noisy.programming.kicks-ass.net
2025-12-17 11:41:18 +01:00
Peter Zijlstra 704069649b sched/core: Rework sched_class::wakeup_preempt() and rq_modified_*()
Change sched_class::wakeup_preempt() to also get called for
cross-class wakeups, specifically those where the woken task
is of a higher class than the previous highest class.

In order to do this, track the current highest class of the runqueue
in rq::next_class and have wakeup_preempt() track this upwards for
each new wakeup. Additionally have schedule() re-set the value on
pick.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://patch.msgid.link/20251127154725.901391274@infradead.org
2025-12-17 10:53:25 +01:00
Alexei Starovoitov ec439c3801 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf after 6.19-rc1
Cross-merge BPF and other fixes after downstream PR.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-12-16 21:29:38 -08:00
Linus Torvalds ea1013c153 bpf-fixes
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmlCBmwACgkQ6rmadz2v
 bToUZA//ZY0IE1x1nCixEAqGF/nGpDzVX4YQQfjrUoXQOD4ykzt35yTNXl6B1IVA
 dliVSI6kUtdoThUa7xJUxMSkDsVBsEMT/zYXQEXJG1zXvJANCB9wTzsC3OCBWbXt
 BRczcEkq0OHC9/l5CrILR6ocwxKGDIMIysfeOSABgfqckSEhylWy3+EWZQCk08ka
 gNpXlDJUG7dYpcZD/zhuC7e5Rg1uNvN7WiTv+Biig8xZCsEtYOq+qC5C/sOnsypI
 nqfECfbx48cVl49SjatdgquuHn/INESdLRCHisshkurA2Mp5PQuCmrwlXbv4JG59
 v9b7lsFQlkpvEXMdo9VYe6K2gjfkOPRdWsVPu2oXA1qISRmrDqX8cKOpapUIwRhL
 p3ASruMOnz0KFqVaET8+5u2SwtALeW+c+1p1aHMfVGF/qbXuyG05qBkLoGFJR+Xr
 WznXUXY80Z7pjD57SpA6U3DigAkGqKCBXUwdifaOq8HQonwsnQGqkW/3NngNULGP
 IC4u0JXn61VgQsM/kAw+ucc4bdKI0g4oKJR56lT48elrj6Yxrjpde4oOqzZ0IQKu
 VQ0YnzWqqT2tjh4YNMOwkNPbFR4ALd329zI6TUkWib/jByEBNcfjSj9BRANd1KSx
 JgSHAE6agrbl6h3nOx584YCasX3Zq+nfv1Sj4Z/5GaHKKW3q/Vw=
 =wHLt
 -----END PGP SIGNATURE-----

Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Pull bpf fixes from Alexei Starovoitov:

 - Fix BPF builds due to -fms-extensions. selftests (Alexei
   Starovoitov), bpftool (Quentin Monnet).

 - Fix build of net/smc when CONFIG_BPF_SYSCALL=y, but CONFIG_BPF_JIT=n
   (Geert Uytterhoeven)

 - Fix livepatch/BPF interaction and support reliable unwinding through
   BPF stack frames (Josh Poimboeuf)

 - Do not audit capability check in arm64 JIT (Ondrej Mosnacek)

 - Fix truncated dmabuf BPF iterator reads (T.J. Mercier)

 - Fix verifier assumptions of bpf_d_path's output buffer (Shuran Liu)

 - Fix warnings in libbpf when built with -Wdiscarded-qualifiers under
   C23 (Mikhail Gavrilov)

* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  selftests/bpf: add regression test for bpf_d_path()
  bpf: Fix verifier assumptions of bpf_d_path's output buffer
  selftests/bpf: Add test for truncated dmabuf_iter reads
  bpf: Fix truncated dmabuf iterator reads
  x86/unwind/orc: Support reliable unwinding through BPF stack frames
  bpf: Add bpf_has_frame_pointer()
  bpf, arm64: Do not audit capability check in do_jit()
  libbpf: Fix -Wdiscarded-qualifiers under C23
  bpftool: Fix build warnings due to MS extensions
  net: smc: SMC_HS_CTRL_BPF should depend on BPF_JIT
  selftests/bpf: Add -fms-extensions to bpf build flags
2025-12-17 15:54:58 +12:00
Liang Jie b0101ccb5b sched_ext: fix uninitialized ret on alloc_percpu() failure
Smatch reported:

  kernel/sched/ext.c:5332 scx_alloc_and_add_sched() warn: passing zero to 'ERR_PTR'

In scx_alloc_and_add_sched(), the alloc_percpu() failure path jumps to
err_free_gdsqs without initializing @ret. That can lead to returning
ERR_PTR(0), which violates the ERR_PTR() convention and confuses
callers.

Set @ret to -ENOMEM before jumping to the error path when
alloc_percpu() fails.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/r/202512141601.yAXDAeA9-lkp@intel.com/
Reported-by: Dan Carpenter <error27@gmail.com>
Fixes: c201ea1578 ("sched_ext: Move event_stats_cpu into scx_sched")
Signed-off-by: Liang Jie <liangjie@lixiang.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-16 09:15:03 -10:00
Emil Tsalapatis 12a1fe6e12 bpf/verifier: Do not limit maximum direct offset into arena map
The verifier currently limits direct offsets into a map to 512MiB
to avoid overflow during pointer arithmetic. However, this prevents
arena maps from using direct addressing instructions to access data
at the end of > 512MiB arena maps. This is necessary when moving
arena globals to the end of the arena instead of the front.

Refactor the verifier code to remove the offset calculation during
direct value access calculations. This is possible because the only
two map types that implement .map_direct_value_addr() are arrays and
arenas, and they both do their own internal checks to ensure the
offset is within bounds.

Adjust selftests that expect the old error. These tests still fail
because the verifier identifies the access as out of bounds for the
map, so change them to expect an "invalid access to map value pointer"
error instead.

Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20251216173325.98465-3-emil@etsalapatis.com
2025-12-16 10:42:55 -08:00
Ricardo Robaina 15b0c43aa6 audit: include source and destination ports to NETFILTER_PKT
NETFILTER_PKT records show both source and destination
addresses, in addition to the associated networking protocol.
However, it lacks the ports information, which is often
valuable for troubleshooting.

This patch adds both source and destination port numbers,
'sport' and 'dport' respectively, to TCP, UDP, UDP-Lite and
SCTP-related NETFILTER_PKT records.

 $ TESTS="netfilter_pkt" make -e test &> /dev/null
 $ ausearch -i -ts recent |grep NETFILTER_PKT
 type=NETFILTER_PKT ... proto=icmp
 type=NETFILTER_PKT ... proto=ipv6-icmp
 type=NETFILTER_PKT ... proto=udp sport=46333 dport=42424
 type=NETFILTER_PKT ... proto=udp sport=35953 dport=42424
 type=NETFILTER_PKT ... proto=tcp sport=50314 dport=42424
 type=NETFILTER_PKT ... proto=tcp sport=57346 dport=42424

Link: https://github.com/linux-audit/audit-kernel/issues/162

Signed-off-by: Ricardo Robaina <rrobaina@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2025-12-16 11:04:14 -05:00
Ricardo Robaina f19590b07c audit: add audit_log_nf_skb helper function
Netfilter code (net/netfilter/nft_log.c and net/netfilter/xt_AUDIT.c)
have to be kept in sync. Both source files had duplicated versions of
audit_ip4() and audit_ip6() functions, which can result in lack of
consistency and/or duplicated work.

This patch adds a helper function in audit.c that can be called by
netfilter code commonly, aiming to improve maintainability and
consistency.

Suggested-by: Florian Westphal <fw@strlen.de>
Suggested-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Ricardo Robaina <rrobaina@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2025-12-16 11:04:14 -05:00
Linus Torvalds dbf89321bf sched_ext: Fixes for v6.19-rc1
- Fix memory leak when destroying helper kthread workers during scheduler
   disable.
 
 - Fix bypass depth accounting on scx_enable() failure which could leave
   the system permanently in bypass mode.
 
 - Fix missing preemption handling when moving tasks to local DSQs via
   scx_bpf_dsq_move().
 
 - Misc fixes including NULL check for put_prev_task(), flushing stdout in
   selftests, and removing unused code.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaUA9aQ4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGXWKAP9VeREr2ceRFd3WK5vFsFzCGh1L8rRu+t352MGL
 qWXkzQD/apLFSo5RQZ0tE8qORwKM9hNhS8QkVJhlsv5VZRWMBQI=
 =ueoE
 -----END PGP SIGNATURE-----

Merge tag 'sched_ext-for-6.19-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext fixes from Tejun Heo:

 - Fix memory leak when destroying helper kthread workers during
   scheduler disable

 - Fix bypass depth accounting on scx_enable() failure which could leave
   the system permanently in bypass mode

 - Fix missing preemption handling when moving tasks to local DSQs via
   scx_bpf_dsq_move()

 - Misc fixes including NULL check for put_prev_task(), flushing stdout
   in selftests, and removing unused code

* tag 'sched_ext-for-6.19-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  sched_ext: Remove unused code in the do_pick_task_scx()
  selftests/sched_ext: flush stdout before test to avoid log spam
  sched_ext: Fix missing post-enqueue handling in move_local_task_to_local_dsq()
  sched_ext: Factor out local_dsq_post_enq() from dispatch_enqueue()
  sched_ext: Fix bypass depth leak on scx_enable() failure
  sched/ext: Avoid null ptr traversal when ->put_prev_task() is called with NULL next
  sched_ext: Fix the memleak for sch->helper objects
2025-12-16 19:24:35 +12:00
Linus Torvalds 6b63f90fa2 cgroup: Fixes for v6.19-rc1
- Fix a race condition in css_rstat_updated() where CMPXCHG without LOCK
   prefix could cause lnode corruption when the flusher runs concurrently
   on another CPU. The issue was introduced in 6.17 and causes memcg stats
   to become corrupted in production.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaUA8mg4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGXhaAQDSk/ywLzJG807gzvhC5QLcY8kYzUUFHJ4TQAFp
 08UqOAEA9yDgBHPqkp6ucjEJMG+2esE9A5NJB2LeEjG8ZZOCDQM=
 =Kcfd
 -----END PGP SIGNATURE-----

Merge tag 'cgroup-for-6.19-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup fix from Tejun Heo:

 - Fix a race condition in css_rstat_updated() where CMPXCHG without
   LOCK prefix could cause lnode corruption when the flusher runs
   concurrently on another CPU. The issue was introduced in 6.17 and
   causes memcg stats to become corrupted in production.

* tag 'cgroup-for-6.19-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: rstat: use LOCK CMPXCHG in css_rstat_updated
2025-12-16 19:21:17 +12:00
Radu Rendec fcc1d0dabd genirq: Add interrupt redirection infrastructure
Add infrastructure to redirect interrupt handler execution to a
different CPU when the current CPU is not part of the interrupt's CPU
affinity mask.

This is primarily aimed at (de)multiplexed interrupts, where the child
interrupt handler runs in the context of the parent interrupt handler,
and therefore CPU affinity control for the child interrupt is typically
not available.

With the new infrastructure, the child interrupt is allowed to freely
change its affinity setting, independently of the parent. If the
interrupt handler happens to be triggered on an "incompatible" CPU (a
CPU that's not part of the child interrupt's affinity mask), the handler
is redirected and runs in IRQ work context on a "compatible" CPU.

No functional change is being made to any existing irqchip driver, and
irqchip drivers must be explicitly modified to use the newly added
infrastructure to support interrupt redirection.

Originally-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Radu Rendec <rrendec@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/linux-pci/878qpg4o4t.ffs@tglx/
Link: https://patch.msgid.link/20251128212055.1409093-2-rrendec@redhat.com
2025-12-15 22:30:48 +01:00
Marc Zyngier dbcc728e18 genirq: Remove setup_percpu_irq()
setup_percpu_irq() was always a bad kludge, and should have never
been there the first place. Now that the last users are gone,
remove it for good.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/20251210082242.360936-7-maz@kernel.org
2025-12-15 22:20:51 +01:00
Marc Zyngier e9b624ea31 genirq: Remove __request_percpu_irq() helper
With the IRQ timing stuff being gone, there is no need to specify a flag
when requesting a percpu interrupt. Not only IRQF_TIMER was the only flag
(set of flags actually) allowed, but nobody ever passed it.

Get rid of __request_percpu_irq(), which was only getting 0 as flags, and
promote request_percpu_irq_affinity() as its replacement.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Jinjie Ruan <ruanjinjie@huawei.com>
Link: https://patch.msgid.link/20251210082242.360936-3-maz@kernel.org
2025-12-15 22:20:50 +01:00
Marc Zyngier c119e66853 genirq: Remove IRQ timing tracking infrastructure
The IRQ timing tracking infrastructure was merged in 2019, but was never
plumbed in, is not selectable, and is therefore never used.

As Daniel agrees that there is little hope for this infrastructure to be
completed in the near term, drop it altogether.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Jinjie Ruan <ruanjinjie@huawei.com>
Link: https://lore.kernel.org/r/87zf7vex6h.wl-maz@kernel.org
Link: https://patch.msgid.link/20251210082242.360936-2-maz@kernel.org
2025-12-15 22:20:50 +01:00
Eric Dumazet 4725344462 time/timecounter: Inline timecounter_cyc2time()
New network transport protocols want NIC drivers to get hardware timestamps
of all incoming packets, and possibly all outgoing packets.

One example is the upcoming 'Swift congestion control' which is used by TCP
transport and is the primary need for timecounter_cyc2time(). This means
timecounter_cyc2time() can be called more than 100 million times per second
on a busy server.

Inlining timecounter_cyc2time() brings a 12% improvement on a UDP receive
stress test on a 100Gbit NIC.

Note that FDO, LTO, PGO are unable to magically help for this case,
presumably because NIC drivers are almost exclusively shipped as modules.

Add an unlikely() around the cc_cyc2ns_backwards() case, even if FDO (when
used) is able to take care of this optimization.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://research.google/pubs/swift-delay-is-simple-and-effective-for-congestion-control-in-the-datacenter/
Link: https://patch.msgid.link/20251129095740.3338476-1-edumazet@google.com
2025-12-15 20:16:49 +01:00
Zqiang bb27226f0d sched_ext: Remove unused code in the do_pick_task_scx()
The kick_idle variable is no longer used, this commit therefore remove
it and also remove associated code in the do_pick_task_scx().

Signed-off-by: Zqiang <qiang.zhang@linux.dev>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-15 05:53:49 -10:00
Petr Mladek 9bd18e1262 printk/nbcon: Restore IRQ in atomic flush after each emitted record
The commit d5d399efff ("printk/nbcon: Release nbcon consoles ownership
in atomic flush after each emitted record") prevented stall of a CPU
which lost nbcon console ownership because another CPU entered
an emergency flush.

But there is still the problem that the CPU doing the emergency flush
might cause a stall on its own.

Let's go even further and restore IRQ in the atomic flush after
each emitted record.

It is not a complete solution. The interrupts and/or scheduling might
still be blocked when the emergency atomic flush was called with
IRQs and/or scheduling disabled. But it should remove the following
lockup:

  mlx5_core 0000:03:00.0: Shutdown was called
  kvm: exiting hardware virtualization
  arm-smmu-v3 arm-smmu-v3.10.auto: CMD_SYNC timeout at 0x00000103 [hwprod 0x00000104, hwcons 0x00000102]
  smp: csd: Detected non-responsive CSD lock (#1) on CPU#4, waiting 5000000032 ns for CPU#00 do_nothing (kernel/smp.c:1057)
  smp:     csd: CSD lock (#1) unresponsive.
  [...]
  Call trace:
  pl011_console_write_atomic (./arch/arm64/include/asm/vdso/processor.h:12 drivers/tty/serial/amba-pl011.c:2540) (P)
  nbcon_emit_next_record (kernel/printk/nbcon.c:1049)
  __nbcon_atomic_flush_pending_con (kernel/printk/nbcon.c:1517)
  __nbcon_atomic_flush_pending.llvm.15488114865160659019 (./arch/arm64/include/asm/alternative-macros.h:254 ./arch/arm64/include/asm/cpufeature.h:808 ./arch/arm64/include/asm/irqflags.h:192 kernel/printk/nbcon.c:1562 kernel/printk/nbcon.c:1612)
  nbcon_atomic_flush_pending (kernel/printk/nbcon.c:1629)
  printk_kthreads_shutdown (kernel/printk/printk.c:?)
  syscore_shutdown (drivers/base/syscore.c:120)
  kernel_kexec (kernel/kexec_core.c:1045)
  __arm64_sys_reboot (kernel/reboot.c:794 kernel/reboot.c:722 kernel/reboot.c:722)
  invoke_syscall (arch/arm64/kernel/syscall.c:50)
  el0_svc_common.llvm.14158405452757855239 (arch/arm64/kernel/syscall.c:?)
  do_el0_svc (arch/arm64/kernel/syscall.c:152)
  el0_svc (./arch/arm64/include/asm/alternative-macros.h:254 ./arch/arm64/include/asm/cpufeature.h:808 ./arch/arm64/include/asm/irqflags.h:73 arch/arm64/kernel/entry-common.c:169 arch/arm64/kernel/entry-common.c:182 arch/arm64/kernel/entry-common.c:749)
  el0t_64_sync_handler (arch/arm64/kernel/entry-common.c:820)
  el0t_64_sync (arch/arm64/kernel/entry.S:600)

In this case, nbcon_atomic_flush_pending() is called from
printk_kthreads_shutdown() with IRQs and scheduling enabled.

Note that __nbcon_atomic_flush_pending_con() is directly called also from
nbcon_device_release() where the disabled IRQs might break PREEMPT_RT
guarantees. But the atomic flush is called only in emergency or panic
situations where the latencies are irrelevant anyway.

An ultimate solution would be a touching of watchdogs. But it would hide
all problems. Let's do it later when anyone reports a stall which does
not have a better solution.

Closes: https://lore.kernel.org/r/sqwajvt7utnt463tzxgwu2yctyn5m6bjwrslsnupfexeml6hkd@v6sqmpbu3vvu
Tested-by: Breno Leitao <leitao@debian.org>
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Link: https://patch.msgid.link/20251212124520.244483-1-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
2025-12-15 16:18:41 +01:00
Marcos Paulo de Souza bdfcca65e7 printk: nbcon: Check for device_{lock,unlock} callbacks
These callbacks are necessary to synchronize ->write_thread callback
against other operations using the same device.

Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Link: https://patch.msgid.link/20251208-nbcon-device-cb-fix-v2-1-36be8d195123@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
2025-12-15 15:10:33 +01:00
Mateusz Guzik 6d864a1b18
pid: only take pidmap_lock once on alloc
When spawning and killing threads in separate processes in parallel the
primary bottleneck on the stock kernel is pidmap_lock, largely because
of a back-to-back acquire in the common case. This aspect is fixed with
the patch.

Performance improvement varies between reboots. When benchmarking with
20 processes creating and killing threads in a loop, the unpatched
baseline hovers around 465k ops/s, while patched is anything between
~510k ops/s and ~560k depending on false-sharing (which I only minimally
sanitized). So this is at least 10% if you are unlucky.

The change also facilitated some cosmetic fixes.

It has an unintentional side effect of no longer issuing spurious
idr_preload() around idr_replace().

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251203092851.287617-3-mjguzik@gmail.com
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-12-15 14:33:38 +01:00
Rafael J. Wysocki 4fb352df14 PM: sleep: Do not flag runtime PM workqueue as freezable
Till now, the runtime PM workqueue has been flagged as freezable, so it
does not process work items during system-wide PM transitions like
system suspend and resume.  The original reason to do that was to
reduce the likelihood of runtime PM getting in the way of system-wide
PM processing, but now it is mostly an optimization because (1) runtime
suspend of devices is prevented by bumping up their runtime PM usage
counters in device_prepare() and (2) device drivers are expected to
disable runtime PM for the devices handled by them before they embark
on system-wide PM activities that may change the state of the hardware
or otherwise interfere with runtime PM.  However, it prevents
asynchronous runtime resume of devices from working during system-wide
PM transitions, which is confusing because synchronous runtime resume
is not prevented at the same time, and it also sometimes turns out to
be problematic.

For example, it has been reported that blk_queue_enter() may deadlock
during a system suspend transition because of the pm_request_resume()
usage in it [1].  It may also deadlock during a system resume transition
in a similar way.  That happens because the asynchronous runtime resume
of the given device is not processed due to the freezing of the runtime
PM workqueue.  While it may be better to address this particular issue
in the block layer, the very presence of it means that similar problems
may be expected to occur elsewhere.

For this reason, remove the WQ_FREEZABLE flag from the runtime PM
workqueue and make device_suspend_late() use the generic variant of
pm_runtime_disable() that will carry out runtime PM of the device
synchronously if there is pending resume work for it.

Also update the comment before the pm_runtime_disable() call in
device_suspend_late(), to document the fact that the runtime PM
should not be expected to work for the device until the end of
device_resume_early(), and update the related documentation.

This change may, even though it is not expected to, uncover some
latent issues related to queuing up asynchronous runtime resume
work items during system suspend or hibernation.  However, they
should be limited to the interference between runtime resume and
system-wide PM callbacks in the cases when device drivers start
to handle system-wide PM before disabling runtime PM as described
above.

Link: https://lore.kernel.org/linux-pm/20251126101636.205505-2-yang.yang@vivo.com/
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Link: https://patch.msgid.link/12794222.O9o76ZdvQC@rafael.j.wysocki
2025-12-15 12:20:02 +01:00
Ingo Molnar 527a521029 sched/fair: Sort out 'blocked_load*' namespace noise
There's three layers of logic in the scheduler that
deal with 'has_blocked' (load) handling of the NOHZ code:

  (1) nohz.has_blocked,
  (2) rq->has_blocked_load, deal with NOHZ idle balancing,
  (3) and cfs_rq_has_blocked(), which is part of the layer
      that is passing the SMP load-balancing signal to the
      NOHZ layers.

The 'has_blocked' and 'has_blocked_load' names are used
in a mixed fashion, sometimes within the same function.

Standardize on 'has_blocked_load' to make it all easy
to read and easy to grep.

No change in functionality.

Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Link: https://patch.msgid.link/aS6yvxyc3JfMxxQW@gmail.com
2025-12-15 07:52:45 +01:00
Ingo Molnar 5758e48eef sched/fair: Introduce and use the vruntime_cmp() and vruntime_op() wrappers for wrapped-signed aritmetics
We have to be careful with vruntime comparisons and subtraction,
due to the possibility of wrapping, so we have macros like:

   #define vruntime_gt(field, lse, rse) ({ (s64)((lse)->field - (rse)->field) > 0; })

Which is used like this:

		if (vruntime_gt(min_vruntime, se, rse))
			se->min_vruntime = rse->min_vruntime;

Replace this with an easier to read pattern that uses the regular
arithmetics operators:

		if (vruntime_cmp(se->min_vruntime, ">", rse->min_vruntime))
			se->min_vruntime = rse->min_vruntime;

Also replace vruntime subtractions with vruntime_op():

	-       delta = (s64)(sea->vruntime - seb->vruntime) +
	-               (s64)(cfs_rqb->zero_vruntime_fi - cfs_rqa->zero_vruntime_fi);
	+       delta = vruntime_op(sea->vruntime, "-", seb->vruntime) +
	+               vruntime_op(cfs_rqb->zero_vruntime_fi, "-", cfs_rqa->zero_vruntime_fi);

In the vruntime_cmp() and vruntime_op() macros use Use __builtin_strcmp(),
because of __HAVE_ARCH_STRCMP might turn off the compiler optimizations
we rely on here to catch usage bugs.

No change in functionality.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-12-15 07:52:45 +01:00
Ingo Molnar dcbc9d3f0e sched/fair: Rename cfs_rq::avg_vruntime to ::sum_w_vruntime, and helper functions
The ::avg_vruntime field is a  misnomer: it says it's an
'average vruntime', but in reality it's the momentary sum
of the weighted vruntimes of all queued tasks, which is
at least a division away from being an average.

This is clear from comments about the math of fair scheduling:

    * \Sum (v_i - v0) * w_i := cfs_rq->avg_vruntime

This confusion is increased by the cfs_avg_vruntime() function,
which does perform the division and returns a true average.

The sum of all weighted vruntimes should be named thusly,
so rename the field to ::sum_w_vruntime. (As arguably
::sum_weighted_vruntime would be a bit of a mouthful.)

Understanding the scheduler is hard enough already, without
extra layers of obfuscated naming. ;-)

Also rename related helper functions:

  sum_vruntime_add()    => sum_w_vruntime_add()
  sum_vruntime_sub()    => sum_w_vruntime_sub()
  sum_vruntime_update() => sum_w_vruntime_update()

With the notable exception of cfs_avg_vruntime(), which
was named accurately.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://patch.msgid.link/20251201064647.1851919-7-mingo@kernel.org
2025-12-15 07:52:44 +01:00
Ingo Molnar 4ff674fa98 sched/fair: Rename cfs_rq::avg_load to cfs_rq::sum_weight
The ::avg_load field is a long-standing misnomer: it says it's an
'average load', but in reality it's the momentary sum of the load
of all currently runnable tasks. We'd have to also perform a
division by nr_running (or use time-decay) to arrive at any sort
of average value.

This is clear from comments about the math of fair scheduling:

    *              \Sum w_i := cfs_rq->avg_load

The sum of all weights is ... the sum of all weights, not
the average of all weights.

To make it doubly confusing, there's also an ::avg_load
in the load-balancing struct sg_lb_stats, which *is* a
true average.

The second part of the field's name is a minor misnomer
as well: it says 'load', and it is indeed a load_weight
structure as it shares code with the load-balancer - but
it's only in an SMP load-balancing context where
load = weight, in the fair scheduling context the primary
purpose is the weighting of different nice levels.

So rename the field to ::sum_weight instead, which makes
the terminology of the EEVDF math match up with our
implementation of it:

    *              \Sum w_i := cfs_rq->sum_weight

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://patch.msgid.link/20251201064647.1851919-6-mingo@kernel.org
2025-12-15 07:52:44 +01:00
Ingo Molnar fb9a7458e5 sched/fair: Clean up comments in 'struct cfs_rq'
- Fix vertical alignment
 - Fix typos
 - Fix capitalization

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://patch.msgid.link/20251201064647.1851919-3-mingo@kernel.org
2025-12-15 07:52:44 +01:00
Ingo Molnar 2b8c3d3dc9 sched/fair: Join two #ifdef CONFIG_FAIR_GROUP_SCHED blocks
Join two identical #ifdef blocks:

  #ifdef CONFIG_FAIR_GROUP_SCHED
  ...
  #endif

  #ifdef CONFIG_FAIR_GROUP_SCHED
  ...
  #endif

Also mark nested #ifdef blocks in the usual fashion, to make
it more apparent where in a nested hierarchy of #ifdefs we
are at a glance.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Link: https://patch.msgid.link/20251201064647.1851919-2-mingo@kernel.org
2025-12-15 07:52:44 +01:00
Peter Zijlstra 47efe2ddcc sched/core: Add assertions to QUEUE_CLASS
Add some checks to the sched_change pattern to validate assumptions
around changing classes.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://patch.msgid.link/20251127154725.771691954@infradead.org
2025-12-14 08:25:02 +01:00
Peter Zijlstra 95a0155224 sched/fair: Limit hrtick work
The task_tick_fair() function does:

 - update the hierarchical runtimes
 - drive NUMA-balancing
 - update load-balance statistics
 - drive force-idle preemption

All but the very first can be limited to the periodic tick. Let hrtick
only update accounting and drive preemption, not load-balancing and
other bits.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://patch.msgid.link/20250918080205.563385766@infradead.org
2025-12-14 08:25:02 +01:00
Peter Zijlstra a03fee333a sched/fair: Remove superfluous rcu_read_lock()
With fair switched to rcu_dereference_all() validation, having IRQ or
preemption disabled is sufficient, remove the rcu_read_lock()
clutter.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://patch.msgid.link/20251127154725.647502625@infradead.org
2025-12-14 08:25:02 +01:00
Peter Zijlstra 71fedc41c2 sched/fair: Switch to rcu_dereference_all()
With the {rcu,sched,bh} RCU flavours being unified, it doesn't really
make sense to check for just the rcu one. Switch to the _all family of
verification which includes all 3 of the listed flavours.

Notably, this will enable us to remove some superfluous
rcu_read_lock() regions when we know they are inside preempt/IRQ
disabled regions.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-12-14 08:25:02 +01:00
Peter Zijlstra f24165bfa7 sched/headers: Rename rcu_dereference_check_sched_domain() => rcu_dereference_sched_domain()
Remove check from the name for being surplus to requirements.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-12-14 08:25:02 +01:00
Peter Zijlstra 45e0922508 sched/fair: Avoid rq->lock bouncing in sched_balance_newidle()
While poking at this code recently I noted we do a pointless
unlock+lock cycle in sched_balance_newidle(). We drop the rq->lock (so
we can balance) but then instantly grab the same rq->lock again in
sched_balance_update_blocked_averages().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://patch.msgid.link/20251127154725.532469061@infradead.org
2025-12-14 08:25:02 +01:00
Peter Zijlstra 089d84203a sched/fair: Fold the sched_avg update
Nine (and a half) instances of the same pattern is just silly, fold the lot.

Notably, the half instance in enqueue_load_avg() is right after setting
cfs_rq->avg.load_sum to cfs_rq->avg.load_avg * get_pelt_divider(&cfs_rq->avg).
Since get_pelt_divisor() >= PELT_MIN_DIVIDER, this ends up being a no-op
change.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://patch.msgid.link/20251127154725.413564507@infradead.org
2025-12-14 08:25:02 +01:00
T.J. Mercier 6f0b824a61 bpf: Fix bpf_seq_read docs for increased buffer size
Commit af65320948 ("bpf: Bump iter seq size to support BTF
representation of large data structures") increased the fixed buffer
size from PAGE_SIZE to PAGE_SIZE << 3, but the docs for the function
didn't get updated at the same time. Update them.

Signed-off-by: T.J. Mercier <tjmercier@google.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20251207091005.2829703-1-tjmercier@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-12-13 18:57:53 -08:00
Linus Torvalds 4a298a43f5 Fix CPU hotplug callbacks to disable interrupts on UP kernels.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmk74mgRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1jS2g/+NPDw1R7yAPHlTaZO6CduYSENXzTOjmG3
 BxdWERuta3Z78CNJZ5mtbrL1mBccG29oN+EZFGrKbDuxrR6SoglWUr1x1pYJRDyp
 h0DvBfzewXcVwOCiBQpqVzuQIzPqsF1IfdDP6eIGnkhnSAEi6ObIZeAbzBzZRovD
 dccU5e0LAIffTmAyKFYIqD4Fu90bzwJ3dAE1b0tP4mnm+jFGitdd8vb8mowiftOk
 Lij8hsy8bg2A5VThIlKow0ya4uc6Bb3ZGj6aiMHoGMPbB2pu3o8bNH/K9d1MsKIe
 AdZ6Y8mFriYYk3GEI3aWaoRkEXCZFX2IcOUS++yqvrOb88iJpW7jujgRIRvX0Px4
 5xgHomQwW9XQKJgNYU/xLgpQSvBH4IzGDX5Uf8oU+hqcjYJBJjy21Dk9JGJymjv/
 /jTq11Bd9xqzyv9W2Ysg2CfOA29+jGe8XlZU9ucOKE1bz09D+8R+H8X/LZtldjMa
 Xbr+7HG9AD13gLJW+PfRQbJR5CwoCvh9dPMzWtFLykxBYT/mQK5gs4OQDaG+JFsz
 8ov+F28O1KBCtcK+v2M/2mQlo51y6RI/t5cw1sVSosgORjHPIVpPjbbTGi0r4MKw
 PHBGdw0mb1dgsXMRNnCNwvfxZP/S1TnmbIkaT4THC0PqJHOloyrsVbXGD1u6F9In
 Gaodcvn7MbA=
 =jbkd
 -----END PGP SIGNATURE-----

Merge tag 'smp-urgent-2025-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull CPU hotplug fix from Ingo Molnar:

 - Fix CPU hotplug callbacks to disable interrupts on UP kernels

* tag 'smp-urgent-2025-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  cpu: Make atomic hotplug callbacks run with interrupts disabled on UP
2025-12-14 06:12:46 +12:00
Linus Torvalds cba09e3ed0 Misc fixes:
- Fix NULL pointer dereference crash in the Intel PMU driver
  - Fix missing read event generation on task exit
  - Fix AMD uncore driver init error handling
  - Fix whitespace noise
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmk74boRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1hpjw//cILE+pfjJKc+rhpY1DTBi4GMsIlL7SGH
 8sZhelj+u1N9+gg2xM2PJL+2MV5nSTmFnFoXyHn5seAEmaKwJHsELTIDGa37PmrQ
 SEI4X5TQQ3M6XD8AWgvc+okdoxh0eQSZj0MOOCU6X+2RaQi4BuMuRa8Omz18NEof
 2Sm6+rC1XICzsSKyJFBRAwPeQcrW7oVsaVD0z+bPVimn17IF+euKIkTbMEBahjGJ
 3ejN/l37Fc7UrfX9nSVLiS3WdueCdagT0QN6eRDiZDZ446Rac8cqtvs0vI+YG50V
 6lD+WcVGj2MF6rq7x7VXIxdijHQXE8ycH5x3FKqg5b4iX+l0oF2sBcwdaFa0hiUB
 a0gjUO6bCqPzaXFFbEGBhewn8W2imv4pdQbf8G3GS+Jpox5S6XxVNJyoRj6e5QZS
 xXO/6F50oMgeMiOREf/mmADjX6ZrLMcCEPQ4d1vqOmatvn46oiW/mifClK7Vnt3J
 hprqhtvZ/CEICe5nDXEbZRClzBcGIMYw2WyrlmFKJ1VCCZMuH5gwi866wIcT2nQk
 qhOeBJxPbAD8v59YL78rix4pLayXOi8DISedTtEzEzumaLEGJFo1YkXB5U3PJvcQ
 dps6fyPaGNVH7+rVHlmNfokOTiNm2n9vZOhqxFYouLIEzgmmY9LN7oUQsPORwA4Z
 aYgKfuEckV4=
 =8ILq
 -----END PGP SIGNATURE-----

Merge tag 'perf-urgent-2025-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf event fixes from Ingo Molnar:

 - Fix NULL pointer dereference crash in the Intel PMU driver

 - Fix missing read event generation on task exit

 - Fix AMD uncore driver init error handling

 - Fix whitespace noise

* tag 'perf-urgent-2025-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/intel: Fix NULL event dereference crash in handle_pmi_common()
  perf/core: Fix missing read event generation on task exit
  perf/x86/amd/uncore: Fix the return value of amd_uncore_df_event_init() on error
  perf/uprobes: Remove <space><Tab> whitespace noise
2025-12-14 06:10:35 +12:00
Linus Torvalds db0130185e Misc fixes:
- Fix error code in the irqchip/mchp-eic driver
  - Fix setup_percpu_irq() affinity assumptions
  - Remove the unused irq_domain_add_tree() function
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmk74RgRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1gSEw/+IwHsSH9a4fOJGwV+vlwljy3zw6Sbmfwh
 D8C+jtVBQwocYzb/snbTvWnHJuhGwyLh6gVcYDDwJj/8IRGXnKVm7S1UKpV/XpRl
 O/dhqo8uogjqVAlo7osqTlYlwSGVLBY++iYCZrt9cY/Hn1bCm0Oz19/wbSJupHqp
 G/aG05iX4SRDHhTI8oWp7oEnnS1Hyox/2CKYHEBQoR2BapJL10exHqnyDpQ48pgL
 60LCQbPeZp5KArQ+NRkc4RbcHBW6tsBJQUPc+m71H5HJTuetxwQ7uNuc3UsaR6jI
 0A/vBryzeS5d88+FO08WxAYaiSENaGBAZ7aAKVbuiSoddwuRBUhvW6OYvWIGNqJG
 gCk4YB9K5li/2slrQdS4BQhq8thdVGkgzcSDFiFLyYmg7NjECPpr4IWggKwFT2zn
 hTF6tjuanBL9jew4Zxw21WcoJa3Kz6FXwSbU+PfWUUb+xc0/zouai17vkP8qwXR6
 xeMqVJCZjb88VXUvL42xqKjIYD4x6UaoUqOtXjfO7xa40XC+2KVmALRInbSdTDpa
 9MsJg6Mi2C+hYPXM1fPCymbktZoBSWnzi8vGUbA9CegyP9J4Zhpl8vmDIn+mXMmB
 Lu7pGAotSWHnXf9oCqAJ4eioaAj4V6HtIPKkSpmQGozifYMvlEI+boIwfkS2sJWK
 lLGe7LH6QeU=
 =/KYE
 -----END PGP SIGNATURE-----

Merge tag 'irq-urgent-2025-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq fixes from Ingo Molnar:

 - Fix error code in the irqchip/mchp-eic driver

 - Fix setup_percpu_irq() affinity assumptions

 - Remove the unused irq_domain_add_tree() function

* tag 'irq-urgent-2025-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip/mchp-eic: Fix error code in mchp_eic_domain_alloc()
  irqdomain: Delete irq_domain_add_tree()
  genirq: Allow NULL affinity for setup_percpu_irq()
2025-12-14 06:07:09 +12:00
Linus Torvalds 9d9c1cfec0 There are no significant series in this small merge. Please see the
individual changelogs for details.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaTsgBwAKCRDdBJ7gKXxA
 jkSIAP4jD66nEC2QyKTiv9XvXi8rpKz6RGAHNZSam0ucI5WKswEAoflmlqsoD/Kk
 sN3YLwLztzCJIYU7tT3IRaPK0irwDAo=
 =nkux
 -----END PGP SIGNATURE-----

Merge tag 'mm-nonmm-stable-2025-12-11-11-47' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc updates from Andrew Morton:
 "There are no significant series in this small merge. Please see the
  individual changelogs for details"

[ Editor's note: it's mainly ocfs2 and a couple of random fixes ]

* tag 'mm-nonmm-stable-2025-12-11-11-47' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  mm: memfd_luo: add CONFIG_SHMEM dependency
  mm: shmem: avoid build warning for CONFIG_SHMEM=n
  ocfs2: fix memory leak in ocfs2_merge_rec_left()
  ocfs2: invalidate inode if i_mode is zero after block read
  ocfs2: avoid -Wflex-array-member-not-at-end warning
  ocfs2: convert remaining read-only checks to ocfs2_emergency_state
  ocfs2: add ocfs2_emergency_state helper and apply to setattr
  checkpatch: add uninitialized pointer with __free attribute check
  args: fix documentation to reflect the correct numbers
  ocfs2: fix kernel BUG in ocfs2_find_victim_chain
  liveupdate: luo_core: fix redundant bound check in luo_ioctl()
  ocfs2: validate inline xattr size and entry count in ocfs2_xattr_ibody_list
  fs/fat: remove unnecessary wrapper fat_max_cache()
  ocfs2: replace deprecated strcpy with strscpy
  ocfs2: check tl_used after reading it from trancate log inode
  liveupdate: luo_file: don't use invalid list iterator
2025-12-13 20:55:12 +12:00
Thomas Gleixner fbbd7ce627 genirq: Don't overwrite interrupt thread flags on setup
Chris reported that the recent affinity management changes result in
overwriting the already initialized thread flags.

Use set_bit() to set the affinity bit instead of assigning the bit value to
the flags.

Fixes: 801afdfbfc ("genirq: Fix interrupt threads affinity vs. cpuset isolated partitions")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://patch.msgid.link/87ecp0e4cf.ffs@tglx
Closes: https://lore.kernel.org/all/20251212014848.3509622-1-clm@meta.com
2025-12-13 10:29:33 +09:00
Tejun Heo f5e1e5ec20 sched_ext: Fix missing post-enqueue handling in move_local_task_to_local_dsq()
move_local_task_to_local_dsq() is used when moving a task from a non-local
DSQ to a local DSQ on the same CPU. It directly manipulates the local DSQ
without going through dispatch_enqueue() and was missing the post-enqueue
handling that triggers preemption when SCX_ENQ_PREEMPT is set or the idle
task is running.

The function is used by move_task_between_dsqs() which backs
scx_bpf_dsq_move() and may be called while the CPU is busy.

Add local_dsq_post_enq() call to move_local_task_to_local_dsq(). As the
dispatch path doesn't need post-enqueue handling, add SCX_RQ_IN_BALANCE
early exit to keep consume_dispatch_q() behavior unchanged and avoid
triggering unnecessary resched when scx_bpf_dsq_move() is used from the
dispatch path.

Fixes: 4c30f5ce4f ("sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()")
Cc: stable@vger.kernel.org # v6.12+
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-12 06:26:42 -10:00
Tejun Heo 530b6637c7 sched_ext: Factor out local_dsq_post_enq() from dispatch_enqueue()
Factor out local_dsq_post_enq() which performs post-enqueue handling for
local DSQs - triggering resched_curr() if SCX_ENQ_PREEMPT is specified or if
the current CPU is idle. No functional change.

This will be used by the next patch to fix move_local_task_to_local_dsq().

Cc: stable@vger.kernel.org # v6.12+
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-12 06:26:07 -10:00
Tejun Heo 9f769637a9 sched_ext: Fix bypass depth leak on scx_enable() failure
scx_enable() calls scx_bypass(true) to initialize in bypass mode and then
scx_bypass(false) on success to exit. If scx_enable() fails during task
initialization - e.g. scx_cgroup_init() or scx_init_task() returns an error -
it jumps to err_disable while bypass is still active. scx_disable_workfn()
then calls scx_bypass(true/false) for its own bypass, leaving the bypass depth
at 1 instead of 0. This causes the system to remain permanently in bypass mode
after a failed scx_enable().

Failures after task initialization is complete - e.g. scx_tryset_enable_state()
at the end - already call scx_bypass(false) before reaching the error path and
are not affected. This only affects a subset of failure modes.

Fix it by tracking whether scx_enable() called scx_bypass(true) in a bool and
having scx_disable_workfn() call an extra scx_bypass(false) to clear it. This
is a temporary measure as the bypass depth will be moved into the sched
instance, which will make this tracking unnecessary.

Fixes: 8c2090c504 ("sched_ext: Initialize in bypass mode")
Cc: stable@vger.kernel.org # v6.12+
Reported-by: Chris Mason <clm@meta.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Link: https://lore.kernel.org/stable/286e6f7787a81239e1ce2989b52391ce%40kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-11 06:27:35 -10:00
Arnd Bergmann 601cc399a0 mm: memfd_luo: add CONFIG_SHMEM dependency
The new memfd code fails to link without SHMEM:

aarch64-linux-ld: mm/memfd_luo.o: in function `memfd_luo_retrieve_folios':
memfd_luo.c:(.text.memfd_luo_retrieve_folios+0xdc): undefined reference to `shmem_add_to_page_cache'
memfd_luo.c:(.text.memfd_luo_retrieve_folios+0x11c): undefined reference to `shmem_inode_acct_blocks'
memfd_luo.c:(.text.memfd_luo_retrieve_folios+0x134): undefined reference to `shmem_recalc_inode'

Add a Kconfig dependency to disallow that configuration.

Link: https://lkml.kernel.org/r/20251204100203.1034394-1-arnd@kernel.org
Fixes: b3749f174d ("mm: memfd_luo: allow preserving memfd")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-12-10 16:07:44 -08:00
Pasha Tatashin bf2c7bf5c4 liveupdate: luo_core: fix redundant bound check in luo_ioctl()
The kernel test robot reported a Smatch warning:
kernel/liveupdate/luo_core.c:402 luo_ioctl() warn: unsigned 'nr' is
never less than zero.

This occurs because 'nr' is unsigned and LIVEUPDATE_CMD_BASE is currently
defined as 0, making the check (nr < LIVEUPDATE_CMD_BASE) always false.

Remove the explicit lower bound check.  The logic remains correct because
'nr' is unsigned; if nr is less than LIVEUPDATE_CMD_BASE, the expression
(nr - LIVEUPDATE_CMD_BASE) will wrap around to a large positive value. 
This will inevitably be larger than ARRAY_SIZE(luo_ioctl_ops) and be
caught by the upper bound check.

Link: https://lkml.kernel.org/r/20251130010919.1488230-1-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202511280300.6pvBmXUS-lkp@intel.com/
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: David Matlack <dmatlack@google.com>
Cc: Pratyush Yadav <pratyush@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-12-10 16:07:42 -08:00
Dan Carpenter b2135d1cb0 liveupdate: luo_file: don't use invalid list iterator
If we exit a list_for_each_entry() without hitting a break then the list
iterator points to an offset from the list_head.  It's a non-NULL but
invalid pointer and dereferencing it isn't allowed.

Introduce a new "found" variable to test instead.

Link: https://lkml.kernel.org/r/aSlMc4SS09Re4_xn@stanley.mountain
Fixes: 3ee1d673194e ("liveupdate: luo_file: implement file systems callbacks")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/r/202511280420.y9O4fyhX-lkp@intel.com/
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Pratyush Yadav <pratyush@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-12-10 16:07:41 -08:00
Linus Torvalds 0723a166d1 more s390 updates for 6.19 merge window
- Use the MSI parent domain API instead of the legacy API for setup and
   teardown of PCI MSI IRQs
 
 - Select POSIX_CPU_TIMERS_TASK_WORK now that VIRT_XFER_TO_GUEST_WORK has
   been implemented for s390
 
 - Fix a KVM bug which can lead to guest memory corruption
 
 - Fix KASAN shadow memory mapping for hotplugged memory
 
 - Minor bug fixes and improvements
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEECMNfWEw3SLnmiLkZIg7DeRspbsIFAmk5WtQACgkQIg7DeRsp
 bsKOwRAAg7uNYB0nU5CYl/0GeExumHBxQT/B1Umxg8lCDf4rwAPwwSrV3Dpp04h5
 8FpKH42x8sig1yKuqjrBgHTlRofmSlZOrtDI9xv/dtp8yJWkAaMY0Ec8sf1VwEEO
 dgyEPH567ja+mWu5z74eFmxYSV5dsgcJ707DICTyDXB3DaIzmVH5D5f8yYbMs8eL
 gJ4bAuRkzwV+hp1UXLaWWELDrTX2KpQ3EBPz/2SuS8e3KbmL7MuDRdMbTJ7oEb1+
 fGGnm3qFhHszx5fGWUv/BrGsQidwZTe459JdQtKcQUsPkGqBjsbhmiq9ApomD2+7
 MJFDGDVHt623iBNwDwygAPCrWrKoVUFlw5MaKfztvSiuKt19N6OrwN0R4VA8P46I
 m9mVRSLeXTv4GzZcTmLsRb2cLJ3Z6co2HuDli4X1Jg3TTWBFzx0Jscp9dB1QKmoI
 EXBX44LoL5bUZeh5cihqGC5YBuXeaYyMnnsP56eJhq56QK9r4WDMcvGezBbzbkHE
 I3GjmFwEUr3U9hSx4xmUGh6pry5PLkuDch5So/vh1miAdHjZFFthXHK2s5UI+HUl
 OGjNxI9q2WzX4bb1C090zVljvYAsP6YzKnk7Ks76+0mEBvykTsGx3Ms4oS1L0k5+
 Z8R2J0BLfcBQU7pL0QoYE/fO2J5wo+atp/cBLwc2hd+G9YEvdTs=
 =Nbs5
 -----END PGP SIGNATURE-----

Merge tag 's390-6.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux

Pull more s390 updates from Heiko Carstens:

 - Use the MSI parent domain API instead of the legacy API for setup and
   teardown of PCI MSI IRQs

 - Select POSIX_CPU_TIMERS_TASK_WORK now that VIRT_XFER_TO_GUEST_WORK
   has been implemented for s390

 - Fix a KVM bug which can lead to guest memory corruption

 - Fix KASAN shadow memory mapping for hotplugged memory

 - Minor bug fixes and improvements

* tag 's390-6.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
  s390/bug: Add missing alignment
  s390/bug: Add missing CONFIG_BUG ifdef again
  KVM: s390: Fix gmap_helper_zap_one_page() again
  s390/pci: Migrate s390 IRQ logic to IRQ domain API
  genirq: Change hwirq parameter to irq_hw_number_t
  s390: Select POSIX_CPU_TIMERS_TASK_WORK
  s390: Unmap early KASAN shadow on memory offlining
  s390/vmem: Support 2G page splitting for KASAN shadow freeing
  s390/boot: Use entire page for PTEs
  s390/vmur: Use scnprintf() instead of sprintf()
2025-12-11 08:19:46 +09:00
Linus Torvalds 840b22edd5 dma-mapping fixes for Linux 6.19:
- last minute fix for missing parenthesis in the recently merged code
 (Hans de Goede) and removal of the excessive, non-fatal warnings (Dave
 Kleikamp)
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSrngzkoBtlA8uaaJ+Jp1EFxbsSRAUCaTmllwAKCRCJp1EFxbsS
 RNDxAQCZJ9/D9nJ+hbC2oVPyoenmhSfMqc9ZniiR8/z8UqBfqgEAsWWayVyJXiwp
 aIWXvW6lnUGzgNZ64ZKeNWvKK3uJOgg=
 =9lhd
 -----END PGP SIGNATURE-----

Merge tag 'dma-mapping-6.19-2025-12-10' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux

Pull dma-mapping fixes from Marek Szyprowski:

 - last minute fix for missing parenthesis in recently merged code (Hans
   de Goede)

 - removal of excessive, non-fatal warnings (Dave Kleikamp)

* tag 'dma-mapping-6.19-2025-12-10' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
  dma-mapping: Fix DMA_BIT_MASK() macro being broken
  dma/pool: eliminate alloc_pages warning in atomic_pool_expand
2025-12-11 08:14:23 +09:00
Shuran Liu ac44dcc788 bpf: Fix verifier assumptions of bpf_d_path's output buffer
Commit 37cce22dbd ("bpf: verifier: Refactor helper access type
tracking") started distinguishing read vs write accesses performed by
helpers.

The second argument of bpf_d_path() is a pointer to a buffer that the
helper fills with the resulting path. However, its prototype currently
uses ARG_PTR_TO_MEM without MEM_WRITE.

Before 37cce22dbd, helper accesses were conservatively treated as
potential writes, so this mismatch did not cause issues. Since that
commit, the verifier may incorrectly assume that the buffer contents
are unchanged across the helper call and base its optimizations on this
wrong assumption. This can lead to misbehaviour in BPF programs that
read back the buffer, such as prefix comparisons on the returned path.

Fix this by marking the second argument of bpf_d_path() as
ARG_PTR_TO_MEM | MEM_WRITE so that the verifier correctly models the
write to the caller-provided buffer.

Fixes: 37cce22dbd ("bpf: verifier: Refactor helper access type tracking")
Co-developed-by: Zesen Liu <ftyg@live.com>
Signed-off-by: Zesen Liu <ftyg@live.com>
Co-developed-by: Peili Gao <gplhust955@gmail.com>
Signed-off-by: Peili Gao <gplhust955@gmail.com>
Co-developed-by: Haoran Ni <haoran.ni.cs@gmail.com>
Signed-off-by: Haoran Ni <haoran.ni.cs@gmail.com>
Signed-off-by: Shuran Liu <electronlsr@gmail.com>
Reviewed-by: Matt Bobrowski <mattbobrowski@google.com>
Link: https://lore.kernel.org/r/20251206141210.3148-2-electronlsr@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-12-10 01:34:04 -08:00
Linus Torvalds 0048fbb401 Futex changes for v6.19:
- Standardize on ktime_t in restart_block::time as well
    (Thomas Weißschuh)
 
  - Futex selftests:
    - Add robust list testcases (André Almeida)
    - Formatting fixes/cleanups (Carlos Llamas)
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmk5J9QRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1gimQ//VhPIVYqioY/opLXBWFAiyxWLF8RbzbsD
 URC4ppVqMa0d8np4VaoaNGTCyPOvhC5m/5q14BNgNhGfbvHOrx1W5cJ122v2vroC
 aNwBkapt655lHwgax4vLgM7jMPW1do9hoyqtjsUIQeKMDlyZZW22gY+ExRTPJZTm
 lP6PreWR74QTTKo5pTjNCoWSaLEmuoD5rbo52SkiTYHwBLL+Tqubl92mVFFUbE+Q
 9n/c/zDeRSaixKr64TvDAijGyjrH8XFhOoMyyOGCiiWyrXfjDPde0Cblt6OkBQeK
 /4/9uFo2TX3vFxQ2Zhj21gqMwQ84cdPvC86GkPh3FoxRNd66gArIpC14ME7rRn0V
 aDKmz+c8QhFdS9gp4+Gw2OkuzYylLl2RSoSjhz/ndrZh7OLn64cA9KjfhRscCfjD
 mcFMVFcjyK8BlgjXyTltC00tLftf6pyO7bqzO4kaYYbpls687ztfZ0mUSE60gLS4
 WuCnntD+Q2IADV0r5xU9ps1a37eaqiHfgtjON2wznJeyj46PbpkWxSpdTIogTRoS
 5JdLv+w1LdDMHaUuEPOgYYd6L69X0HpDauHotyRlGY+SoSoCoX6DbSXTg7CSqlqY
 VHeuLj2OKnELJpxLEo16pcW6qTd+BM/2jcG91H7VJxbb2sJnngT3XtgJagXE5KYR
 IYjcKq6DWig=
 =9KQQ
 -----END PGP SIGNATURE-----

Merge tag 'locking-futex-2025-12-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull futex updates from Ingo Molnar:

 - Standardize on ktime_t in restart_block::time as well (Thomas
   Weißschuh)

 - Futex selftests:
     - Add robust list testcases (André Almeida)
     - Formatting fixes/cleanups (Carlos Llamas)

* tag 'locking-futex-2025-12-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  futex: Store time as ktime_t in restart block
  selftests/futex: Create test for robust list
  selftests/futex: Skip tests if shmget unsupported
  selftests/futex: Add newline to ksft_exit_fail_msg()
  selftests/futex: Remove unused test_futex_mpol()
2025-12-10 17:21:30 +09:00
Cupertino Miranda d18dec4b89 bpf: verifier improvement in 32bit shift sign extension pattern
This patch improves the verifier to correctly compute bounds for
sign extension compiler pattern composed of left shift by 32bits
followed by a sign right shift by 32bits.  Pattern in the verifier was
limitted to positive value bounds and would reset bound computation for
negative values.  New code allows both positive and negative values for
sign extension without compromising bound computation and verifier to
pass.

This change is required by GCC which generate such pattern, and was
detected in the context of systemd, as described in the following GCC
bugzilla: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119731

Three new tests were added in verifier_subreg.c.

Signed-off-by: Cupertino Miranda  <cupertino.miranda@oracle.com>
Signed-off-by: Andrew Pinski  <andrew.pinski@oss.qualcomm.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Cc: David Faust  <david.faust@oracle.com>
Cc: Jose Marchesi  <jose.marchesi@oracle.com>
Cc: Elena Zannoni  <elena.zannoni@oracle.com>
Link: https://lore.kernel.org/r/20251202180220.11128-2-cupertino.miranda@oracle.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-12-10 00:12:09 -08:00
Kohei Enju 48e11bad9a bpf: cpumap: propagate underlying error in cpu_map_update_elem()
After commit 9216477449 ("bpf: cpumap: Add the possibility to attach
an eBPF program to cpumap"), __cpu_map_entry_alloc() may fail with
errors other than -ENOMEM, such as -EBADF or -EINVAL.

However, __cpu_map_entry_alloc() returns NULL on all failures, and
cpu_map_update_elem() unconditionally converts this NULL into -ENOMEM.
As a result, user space always receives -ENOMEM regardless of the actual
underlying error.

Examples of unexpected behavior:
  - Nonexistent fd  : -ENOMEM (should be -EBADF)
  - Non-BPF fd      : -ENOMEM (should be -EINVAL)
  - Bad attach type : -ENOMEM (should be -EINVAL)

Change __cpu_map_entry_alloc() to return ERR_PTR(err) instead of NULL
and have cpu_map_update_elem() propagate this error.

Signed-off-by: Kohei Enju <enjuk@amazon.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20251208131449.73036-2-enjuk@amazon.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-12-09 23:53:27 -08:00
T.J. Mercier 234483565d bpf: Fix truncated dmabuf iterator reads
If there is a large number (hundreds) of dmabufs allocated, the text
output generated from dmabuf_iter_seq_show can exceed common user buffer
sizes (e.g. PAGE_SIZE) necessitating multiple start/stop cycles to
iterate through all dmabufs. However the dmabuf iterator currently
returns NULL in dmabuf_iter_seq_start for all non-zero pos values, which
results in the truncation of the output before all dmabufs are handled.

After dma_buf_iter_begin / dma_buf_iter_next, the refcount of the buffer
is elevated so that the BPF iterator program can run without holding any
locks. When a stop occurs, instead of immediately dropping the reference
on the buffer, stash a pointer to the buffer in seq->priv until
either start is called or the iterator is released. This also enables
the resumption of iteration without first walking through the list of
dmabufs based on the pos value.

Fixes: 76ea955349 ("bpf: Add dmabuf iterator")
Signed-off-by: T.J. Mercier <tjmercier@google.com>
Link: https://lore.kernel.org/r/20251204000348.1413593-1-tjmercier@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-12-09 23:48:34 -08:00
Josh Poimboeuf ca45c84afb bpf: Add bpf_has_frame_pointer()
Introduce a bpf_has_frame_pointer() helper that unwinders can call to
determine whether a given instruction pointer is within the valid frame
pointer region of a BPF JIT program or trampoline (i.e., after the
prologue, before the epilogue).

This will enable livepatch (with the ORC unwinder) to reliably unwind
through BPF JIT frames.

Acked-by: Song Liu <song@kernel.org>
Acked-and-tested-by: Andrey Grodzovsky <andrey.grodzovsky@crowdstrike.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lore.kernel.org/r/fd2bc5b4e261a680774b28f6100509fd5ebad2f0.1764818927.git.jpoimboe@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
2025-12-09 23:29:42 -08:00
Sebastian Andrzej Siewior c94291914b cpu: Make atomic hotplug callbacks run with interrupts disabled on UP
On SMP systems the CPU hotplug callbacks in the "starting" range are
invoked while the CPU is brought up and interrupts are still
disabled. Callbacks which are added later are invoked via the
hotplug-thread on the target CPU and interrupts are explicitly disabled.

In the UP case callbacks which are added later are invoked directly without
the thread indirection. This is in principle okay since there is just one
CPU but those callbacks are invoked with interrupt disabled code. That's
incorrect as those callbacks assume interrupt disabled context.

Disable interrupts before invoking the callbacks on UP if the state is
atomic and interrupts are expected to be disabled.  The "save" part is
required because this is also invoked early in the boot process while
interrupts are disabled and must not be enabled prematurely.

Fixes: 06ddd17521 ("sched/smp: Always define is_percpu_thread() and scheduler_ipi()")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/20251127144723.ev9DuXXR@linutronix.de
2025-12-10 15:49:11 +09:00
Marc Zyngier 89acaa5537 genirq: Allow NULL affinity for setup_percpu_irq()
setup_percpu_irq() was forgotten when the percpu_devid infrastructure was
updated to deal with CPU affinities.

In order to keep ignoring users of this legacy API, provide sensible
defaults by setting the affinity to cpu_online_mask if none was provided by
the caller.

Fixes: bdf4e2ac29 ("genirq: Allow per-cpu interrupt sharing for non-overlapping affinities")
Reported-by: Daniel Thompson <danielt@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/20251205091814.3944205-1-maz@kernel.org
Closes: https://lore.kernel.org/r/aTFozefMQRg7lYxh@aspen.lan
2025-12-10 09:47:33 +09:00
Thaumy Cheng c418d8b4d7 perf/core: Fix missing read event generation on task exit
For events with inherit_stat enabled, a "read" event will be generated
to collect per task event counts on task exit.

The call chain is as follows:

do_exit
  -> perf_event_exit_task
    -> perf_event_exit_task_context
      -> perf_event_exit_event
        -> perf_remove_from_context
          -> perf_child_detach
            -> sync_child_event
              -> perf_event_read_event

However, the child event context detaches the task too early in
perf_event_exit_task_context, which causes sync_child_event to never
generate the read event in this case, since child_event->ctx->task is
always set to TASK_TOMBSTONE. Fix that by moving context lock section
backward to ensure ctx->task is not set to TASK_TOMBSTONE before
generating the read event.

Because perf_event_free_task calls perf_event_exit_task_context with
exit = false to tear down all child events from the context, and the
task never lived, accessing the task PID can lead to a use-after-free.

To fix that, let sync_child_event read task from argument and move the
call to the only place it should be triggered to avoid the effect of
setting ctx->task to TASK_TOMESTONE, and add a task parameter to
perf_event_exit_event to trigger the sync_child_event properly when
needed.

This bug can be reproduced by running "perf record -s" and attaching to
any program that generates perf events in its child tasks. If we check
the result with "perf report -T", the last line of the report will leave
an empty table like "# PID  TID", which is expected to contain the
per-task event counts by design.

Fixes: ef54c1a476 ("perf: Rework perf_event_exit_event()")
Signed-off-by: Thaumy Cheng <thaumy.love@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ian Rogers <irogers@google.com>
Cc: James Clark <james.clark@linaro.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: linux-perf-users@vger.kernel.org
Link: https://patch.msgid.link/20251209041600.963586-1-thaumy.love@gmail.com
2025-12-09 12:22:25 +01:00
Jakub Kicinski e56cadaa27 ynl: add regen hint to new headers
Recent commit 68e83f3472 ("tools: ynl-gen: add regeneration comment")
added a hint how to regenerate the code to the headers. Update
the new headers from this release cycle to also include it.

Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20251207004740.1657799-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-12-08 23:52:43 -08:00
Lai Jiangshan 51cd2d2dec workqueue: Process extra works in rescuer on memory pressure
Make the rescuer process more work on the last pwq when there are no
more to rescue for the whole workqueue to help the regular workers in
case it is a temporary memory pressure relief and to reduce relapse.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-08 09:18:02 -10:00
Lai Jiangshan e5a30c303b workqueue: Process rescuer work items one-by-one using a cursor
Previously, the rescuer scanned for all matching work items at once and
processed them within a single rescuer thread, which could cause one
blocking work item to stall all others.

Make the rescuer process work items one-by-one instead of slurping all
matches in a single pass.

Break the rescuer loop after finding and processing the first matching
work item, then restart the search to pick up the next. This gives
normal worker threads a chance to process other items which gives them
the opportunity to be processed instead of waiting on the rescuer's
queue and prevents a blocking work item from stalling the rest once
memory pressure is relieved.

Introduce a dummy cursor work item to avoid potentially O(N^2)
rescans of the work list.  The marker records the resume position for
the next scan, eliminating redundant traversals.

Also introduce RESCUER_BATCH to control the maximum number of work items
the rescuer processes in each turn, and move on to other PWQs when the
limit is reached.

Cc: ying chen <yc1082463@gmail.com>
Reported-by: ying chen <yc1082463@gmail.com>
Fixes: e22bee782b ("workqueue: implement concurrency managed dynamic worker pool")
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-08 09:17:49 -10:00
Lai Jiangshan fc5ff53d2a workqueue: Make send_mayday() take a PWQ argument directly
Make send_mayday() operate on a PWQ directly instead of taking a work
item, so that rescuer_thread() now calls send_mayday(pwq) instead of
open-coding the mayday list manipulation.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-08 09:17:11 -10:00
Chen Ridong 6ee43047e8 cpuset: Remove unnecessary checks in rebuild_sched_domains_locked
Commit 406100f3da ("cpuset: fix race between hotplug work and later CPU
offline") added a check for empty effective_cpus in partitions for cgroup
v2. However, this check did not account for remote partitions, which were
introduced later.

After commit 2125c0034c ("cgroup/cpuset: Make cpuset hotplug processing
synchronous"), cpuset hotplug handling is now synchronous. This eliminates
the race condition with subsequent CPU offline operations that the original
check aimed to fix.

Instead of extending the check to support remote partitions, this patch
removes all the redundant effective_cpus check. Additionally, it adds a
check and warning to verify that all generated sched domains consist of
active CPUs, preventing partition_sched_domains from being invoked with
offline CPUs.

Signed-off-by: Chen Ridong <chenridong@huawei.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-08 09:02:42 -10:00
Chen Ridong 82d7e59ea7 cgroup: switch to css_is_online() helper
Use the new css_is_online() helper that has been introduced to check css
online state, instead of testing the CSS_ONLINE flag directly. This
improves readability and centralizes the state check logic.

No functional changes intended.

Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-08 09:02:38 -10:00
Shakeel Butt 3309b63a22 cgroup: rstat: use LOCK CMPXCHG in css_rstat_updated
On x86-64, this_cpu_cmpxchg() uses CMPXCHG without LOCK prefix which
means it is only safe for the local CPU and not for multiple CPUs.
Recently the commit 36df6e3dbd ("cgroup: make css_rstat_updated nmi
safe") make css_rstat_updated lockless and uses lockless list to allow
reentrancy. Since css_rstat_updated can invoked from process context,
IRQ and NMI, it uses this_cpu_cmpxchg() to select the winner which will
inset the lockless lnode into the global per-cpu lockless list.

However the commit missed one case where lockless node of a cgroup can
be accessed and modified by another CPU doing the flushing. Basically
llist_del_first_init() in css_process_update_tree().

On a cursory look, it can be questioned how css_process_update_tree()
can see a lockless node in global lockless list where the updater is at
this_cpu_cmpxchg() and before llist_add() call in css_rstat_updated().
This can indeed happen in the presence of IRQs/NMI.

Consider this scenario: Updater for cgroup stat C on CPU A in process
context is after llist_on_list() check and before this_cpu_cmpxchg() in
css_rstat_updated() where it get interrupted by IRQ/NMI. In the IRQ/NMI
context, a new updater calls css_rstat_updated() for same cgroup C and
successfully inserts rstatc_pcpu->lnode.

Now concurrently CPU B is running the flusher and it calls
llist_del_first_init() for CPU A and got rstatc_pcpu->lnode of cgroup C
which was added by the IRQ/NMI updater.

Now imagine CPU B calling init_llist_node() on cgroup C's
rstatc_pcpu->lnode of CPU A and on CPU A, the process context updater
calling this_cpu_cmpxchg(rstatc_pcpu->lnode) concurrently.

The CMPXCNG without LOCK on CPU A is not safe and thus we need LOCK
prefix.

In Meta's fleet running the kernel with the commit 36df6e3dbd, we are
observing on some machines the memcg stats are getting skewed by more
than the actual memory on the system. On close inspection, we noticed
that lockless node for a workload for specific CPU was in the bad state
and thus all the updates on that CPU for that cgroup was being lost.

To confirm if this skew was indeed due to this CMPXCHG without LOCK in
css_rstat_updated(), we created a repro (using AI) at [1] which shows
that CMPXCHG without LOCK creates almost the same lnode corruption as
seem in Meta's fleet and with LOCK CMPXCHG the issue does not
reproduces.

Link: http://lore.kernel.org/efiagdwmzfwpdzps74fvcwq3n4cs36q33ij7eebcpssactv3zu@se4hqiwxcfxq [1]
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: stable@vger.kernel.org # v6.17+
Fixes: 36df6e3dbd ("cgroup: make css_rstat_updated nmi safe")
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-08 08:26:56 -10:00
John Stultz 12b5cd99a0 sched/ext: Avoid null ptr traversal when ->put_prev_task() is called with NULL next
Early when trying to get sched_ext and proxy-exe working together,
I kept tripping over NULL ptr in put_prev_task_scx() on the line:
  if (sched_class_above(&ext_sched_class, next->sched_class)) {

Which was due to put_prev_task() passes a NULL next, calling:
  prev->sched_class->put_prev_task(rq, prev, NULL);

put_prev_task_scx() already guards for a NULL next in the
switch_class case, but doesn't seem to have a guard for
sched_class_above() check.

I can't say I understand why this doesn't trip usually without
proxy-exec. And in newer kernels there are way fewer
put_prev_task(), and I can't easily reproduce the issue now
even with proxy-exec.

But we still have one put_prev_task() call left in core.c that
seems like it could trip this, so I wanted to send this out for
consideration.

tj: put_prev_task() can be called with NULL @next; however, when @p is
queued, that doesn't happen, so this condition shouldn't currently be
triggerable. The connection isn't straightforward or necessarily reliable,
so add the NULL check even if it can't currently be triggered.

Link: http://lkml.kernel.org/r/20251206022218.1541878-1-jstultz@google.com
Signed-off-by: John Stultz <jstultz@google.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-08 07:18:13 -10:00
Zqiang 517a44d185 sched_ext: Fix the memleak for sch->helper objects
This commit use kthread_destroy_worker() to release sch->helper
objects to fix the following kmemleak:

unreferenced object 0xffff888121ec7b00 (size 128):
  comm "scx_simple", pid 1197, jiffies 4295884415
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 ad 4e ad de  .............N..
    ff ff ff ff 00 00 00 00 ff ff ff ff ff ff ff ff  ................
  backtrace (crc 587b3352):
    kmemleak_alloc+0x62/0xa0
    __kmalloc_cache_noprof+0x28d/0x3e0
    kthread_create_worker_on_node+0xd5/0x1f0
    scx_enable.isra.210+0x6c2/0x25b0
    bpf_scx_reg+0x12/0x20
    bpf_struct_ops_link_create+0x2c3/0x3b0
    __sys_bpf+0x3102/0x4b00
    __x64_sys_bpf+0x79/0xc0
    x64_sys_call+0x15d9/0x1dd0
    do_syscall_64+0xf0/0x470
    entry_SYSCALL_64_after_hwframe+0x77/0x7f

Fixes: bff3b5aec1 ("sched_ext: Move disable machinery into scx_sched")
Cc: stable@vger.kernel.org # v6.16+
Signed-off-by: Zqiang <qiang.zhang@linux.dev>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-08 07:08:17 -10:00
Dave Kleikamp 463d439bec dma/pool: eliminate alloc_pages warning in atomic_pool_expand
atomic_pool_expand iteratively tries the allocation while decrementing
the page order. There is no need to issue a warning if an attempted
allocation fails.

Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Fixes: d7e673ec2c ("dma-pool: Only allocate from CMA when in same memory zone")
[mszyprow: fixed typo]
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/20251202152810.142370-1-dave.kleikamp@oracle.com
2025-12-08 09:40:57 +01:00
Tobias Schumacher 455a65260f genirq: Change hwirq parameter to irq_hw_number_t
The irqdomain implementation internally represents hardware IRQs as
irq_hw_number_t, which is defined as unsigned long int. When providing
an irq_hw_number_t to the generic_handle_domain() functions that expect
and unsigned int hwirq, this can lead to a loss of information. Change
the hwirq parameter to irq_hw_number_t to support the full range of
hwirqs.

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
Reviewed-by: Farhan Ali <alifm@linux.ibm.com>
Signed-off-by: Tobias Schumacher <ts@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-12-07 16:15:22 +01:00
Linus Torvalds 509d3f4584 Significant patch series in this pull request:
- The 6 patch series "panic: sys_info: Refactor and fix a potential
   issue" from Andy Shevchenko fixes a build issue and does some cleanup in
   ib/sys_info.c.
 
 - The 9 patch series "Implement mul_u64_u64_div_u64_roundup()" from
   David Laight enhances the 64-bit math code on behalf of a PWM driver and
   beefs up the test module for these library functions.
 
 - The 2 patch series "scripts/gdb/symbols: make BPF debug info available
   to GDB" from Ilya Leoshkevich makes BPF symbol names, sizes, and line
   numbers available to the GDB debugger.
 
 - The 4 patch series "Enable hung_task and lockup cases to dump system
   info on demand" from Feng Tang adds a sysctl which can be used to cause
   additional info dumping when the hung-task and lockup detectors fire.
 
 - The 6 patch series "lib/base64: add generic encoder/decoder, migrate
   users" from Kuan-Wei Chiu adds a general base64 encoder/decoder to lib/
   and migrates several users away from their private implementations.
 
 - The 2 patch series "rbree: inline rb_first() and rb_last()" from Eric
   Dumazet makes TCP a little faster.
 
 - The 9 patch series "liveupdate: Rework KHO for in-kernel users" from
   Pasha Tatashin reworks the KEXEC Handover interfaces in preparation for
   Live Update Orchestrator (LUO), and possibly for other future clients.
 
 - The 13 patch series "kho: simplify state machine and enable dynamic
   updates" from Pasha Tatashin increases the flexibility of KEXEC
   Handover.  Also preparation for LUO.
 
 - The 18 patch series "Live Update Orchestrator" from Pasha Tatashin is
   a major new feature targeted at cloud environments.  Quoting the [0/N]:
 
     This series introduces the Live Update Orchestrator, a kernel subsystem
     designed to facilitate live kernel updates using a kexec-based reboot.
     This capability is critical for cloud environments, allowing hypervisors
     to be updated with minimal downtime for running virtual machines.  LUO
     achieves this by preserving the state of selected resources, such as
     memory, devices and their dependencies, across the kernel transition.
 
     As a key feature, this series includes support for preserving memfd file
     descriptors, which allows critical in-memory data, such as guest RAM or
     any other large memory region, to be maintained in RAM across the kexec
     reboot.
 
   Mike Rappaport merits a mention here, for his extensive review and
   testing work.
 
 - The 3 patch series "kexec: reorganize kexec and kdump sysfs" from
   Sourabh Jain moves the kexec and kdump sysfs entries from /sys/kernel/
   to /sys/kernel/kexec/ and adds back-compatibility symlinks which can
   hopefully be removed one day.
 
 - The 2 patch series "kho: fixes for vmalloc restoration" from Mike
   Rapoport fixes a BUG which was being hit during KHO restoration of
   vmalloc() regions.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaTSAkQAKCRDdBJ7gKXxA
 jrkiAP9QKfsRv46XZaM5raScjY1ayjP+gqb2rgt6BQ/gZvb2+wD/cPAYOR6BiX52
 n0pVpQmG5P/KyOmpLztn96ejL4heKwQ=
 =JY96
 -----END PGP SIGNATURE-----

Merge tag 'mm-nonmm-stable-2025-12-06-11-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull non-MM updates from Andrew Morton:

 - "panic: sys_info: Refactor and fix a potential issue" (Andy Shevchenko)
   fixes a build issue and does some cleanup in ib/sys_info.c

 - "Implement mul_u64_u64_div_u64_roundup()" (David Laight)
   enhances the 64-bit math code on behalf of a PWM driver and beefs up
   the test module for these library functions

 - "scripts/gdb/symbols: make BPF debug info available to GDB" (Ilya Leoshkevich)
   makes BPF symbol names, sizes, and line numbers available to the GDB
   debugger

 - "Enable hung_task and lockup cases to dump system info on demand" (Feng Tang)
   adds a sysctl which can be used to cause additional info dumping when
   the hung-task and lockup detectors fire

 - "lib/base64: add generic encoder/decoder, migrate users" (Kuan-Wei Chiu)
   adds a general base64 encoder/decoder to lib/ and migrates several
   users away from their private implementations

 - "rbree: inline rb_first() and rb_last()" (Eric Dumazet)
   makes TCP a little faster

 - "liveupdate: Rework KHO for in-kernel users" (Pasha Tatashin)
   reworks the KEXEC Handover interfaces in preparation for Live Update
   Orchestrator (LUO), and possibly for other future clients

 - "kho: simplify state machine and enable dynamic updates" (Pasha Tatashin)
   increases the flexibility of KEXEC Handover. Also preparation for LUO

 - "Live Update Orchestrator" (Pasha Tatashin)
   is a major new feature targeted at cloud environments. Quoting the
   cover letter:

      This series introduces the Live Update Orchestrator, a kernel
      subsystem designed to facilitate live kernel updates using a
      kexec-based reboot. This capability is critical for cloud
      environments, allowing hypervisors to be updated with minimal
      downtime for running virtual machines. LUO achieves this by
      preserving the state of selected resources, such as memory,
      devices and their dependencies, across the kernel transition.

      As a key feature, this series includes support for preserving
      memfd file descriptors, which allows critical in-memory data, such
      as guest RAM or any other large memory region, to be maintained in
      RAM across the kexec reboot.

   Mike Rappaport merits a mention here, for his extensive review and
   testing work.

 - "kexec: reorganize kexec and kdump sysfs" (Sourabh Jain)
   moves the kexec and kdump sysfs entries from /sys/kernel/ to
   /sys/kernel/kexec/ and adds back-compatibility symlinks which can
   hopefully be removed one day

 - "kho: fixes for vmalloc restoration" (Mike Rapoport)
   fixes a BUG which was being hit during KHO restoration of vmalloc()
   regions

* tag 'mm-nonmm-stable-2025-12-06-11-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (139 commits)
  calibrate: update header inclusion
  Reinstate "resource: avoid unnecessary lookups in find_next_iomem_res()"
  vmcoreinfo: track and log recoverable hardware errors
  kho: fix restoring of contiguous ranges of order-0 pages
  kho: kho_restore_vmalloc: fix initialization of pages array
  MAINTAINERS: TPM DEVICE DRIVER: update the W-tag
  init: replace simple_strtoul with kstrtoul to improve lpj_setup
  KHO: fix boot failure due to kmemleak access to non-PRESENT pages
  Documentation/ABI: new kexec and kdump sysfs interface
  Documentation/ABI: mark old kexec sysfs deprecated
  kexec: move sysfs entries to /sys/kernel/kexec
  test_kho: always print restore status
  kho: free chunks using free_page() instead of kfree()
  selftests/liveupdate: add kexec test for multiple and empty sessions
  selftests/liveupdate: add simple kexec-based selftest for LUO
  selftests/liveupdate: add userspace API selftests
  docs: add documentation for memfd preservation via LUO
  mm: memfd_luo: allow preserving memfd
  liveupdate: luo_file: add private argument to store runtime state
  mm: shmem: export some functions to internal.h
  ...
2025-12-06 14:01:20 -08:00
Linus Torvalds 09670b8c38 tracing fixes for v6.19:
- Fix accounting of stop_count in file release
 
   On opening the trace file, if "pause-on-trace" option is set, it will
   increment the stop_count. On file release, it checks if stop_count is set,
   and if so it decrements it. Since this code was originally written, the
   stop_count can be incremented by other use cases. This makes just checking
   the stop_count not enough to know if it should be decremented.
 
   Add a new iterator flag called "PAUSE" and have it set if the open
   disables tracing and only decrement the stop_count if that flag is set on
   close.
 
 - Remove length field in trace_seq_printf() of print_synth_event()
 
   When printing the synthetic event that has a static length array field,
   the vsprintf() of the trace_seq_printf() triggered a "(efault)" in the
   output. That's because the print_fmt replaced the "%.*s" with "%s" causing
   the arguments to be off.
 
 - Fix a bunch of typos
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaTRsYBQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qrYpAQC3Qc5QMOlPjqGHXls/4IR4SBEAvsUi
 VZx3PdknfYCe3AD9HoYGOtrDDhSJ1tQbsWP5ud2jatHwL0zGAl3legNp7ww=
 =jlNP
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Fix accounting of stop_count in file release

   On opening the trace file, if "pause-on-trace" option is set, it will
   increment the stop_count. On file release, it checks if stop_count is
   set, and if so it decrements it. Since this code was originally
   written, the stop_count can be incremented by other use cases. This
   makes just checking the stop_count not enough to know if it should be
   decremented.

   Add a new iterator flag called "PAUSE" and have it set if the open
   disables tracing and only decrement the stop_count if that flag is
   set on close.

 - Remove length field in trace_seq_printf() of print_synth_event()

   When printing the synthetic event that has a static length array
   field, the vsprintf() of the trace_seq_printf() triggered a
   "(efault)" in the output. That's because the print_fmt replaced the
   "%.*s" with "%s" causing the arguments to be off.

 - Fix a bunch of typos

* tag 'trace-v6.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: Fix typo in trace_seq.c
  tracing: Fix typo in trace_probe.c
  tracing: Fix multiple typos in trace_osnoise.c
  tracing: Fix multiple typos in trace_events_user.c
  tracing: Fix typo in trace_events_trigger.c
  tracing: Fix typo in trace_events_hist.c
  tracing: Fix typo in trace_events_filter.c
  tracing: Fix multiple typos in trace_events.c
  tracing: Fix multiple typos in trace.c
  tracing: Fix typo in ring_buffer_benchmark.c
  tracing: Fix multiple typos in ring_buffer.c
  tracing: Fix typo in fprobe.c
  tracing: Fix typo in fpgraph.c
  tracing: Fix fixed array of synthetic event
  tracing: Fix enabling of tracing on file release
2025-12-06 13:49:40 -08:00
Linus Torvalds 09bcd5ef66 Miscellaneous scheduler fixes/cleanups:
- Fix psi_dequeue() for Proxy Execution
   - Fix hrtick() vs. scheduling context bug
   - Fix unfairness caused by stalled tg_load_avg_contrib when the last task migrates out
   - Fix whitespace noise in headers
   - Remove a preempt-disable section in rt_mutex_setprio()
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmk0FhoRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1holg/8DQWTB5UcOSK8r4VR6xsDkxmnEA4RNJYg
 YWMAUCJbeAP1qciX9QbPH6T8aKAtrx5d7aLqGfDj7u0sRWTtkz32FcqWiny56vox
 Rc7UIHCL9n5r/ZMwy5sNN7Dxfdr2Eqmxyg5yaAT3VU3AkBQWzVfj8wQbzYL0FPy3
 aw4kwaRB8NMTANdcyVi3hTIDXbLeNb8WCvUKmH0YDfEpeKBikzLBn+yvlkGB/9Wb
 LZ3CzPUtLVdKyJ/W1VdJBokZ1DSUhMkLFFWXxFqt/5TgaXu5wYyCpUr0mvHSQDEd
 PITEMjhZ+NVMQLSSe3od1qa3vIpNe1W/t0FvlXPHXmZ/lmqC3UXwZuO1gzzCz9Jk
 7NluFcoFdSNPLlDEjzA1qls6YAaKHZ8FZfrWRjr2LnZZKZj1r6pfeVJjCWl6Lw+U
 7aOH+Z4TZexgdjZo9qC+S7VJUvt5+P6SipVE/F9a4xd85kEWxGUhKhJuDD7b0Ksq
 pe0dQcva7mW7/FpAZYdSycYTJ98fU4SiUeKp7Er4xX0mRLO7KVFPja9M5ZHvPGH3
 m+i0ONknEDtavgFmg0Z2L2+bpad+cv18hwR94Jhuw7q5ZciwV9vaO/4hbGb4AlAV
 /eJ4rHqPWhOTfe4iHDl82nRYeOxT2V5tX0Uu5oDAKhpAk9xNFkByQsl53gqdJT+Q
 3zVRbx2LIA4=
 =kr0U
 -----END PGP SIGNATURE-----

Merge tag 'sched-urgent-2025-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fixes from Ingo Molnar:
 "Miscellaneous scheduler fixes/cleanups:

   - Fix psi_dequeue() for Proxy Execution

   - Fix hrtick() vs. scheduling context bug

   - Fix unfairness caused by stalled tg_load_avg_contrib when the last
     task migrates out

   - Fix whitespace noise in headers

   - Remove a preempt-disable section in rt_mutex_setprio()"

* tag 'sched-urgent-2025-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/core: Fix psi_dequeue() for Proxy Execution
  sched/fair: Fix unfairness caused by stalled tg_load_avg_contrib when the last task migrates out
  sched/rt: Remove a preempt-disable section in rt_mutex_setprio()
  sched/hrtick: Fix hrtick() vs. scheduling context
  sched/headers: Remove whitespace noise from kernel/sched/sched.h
2025-12-06 12:31:21 -08:00
Linus Torvalds 08b8ddac1f Address various objtool scalability bugs/inefficiencies exposed by
allmodconfig builds, plus improve the quality of alternatives
 instructions generated code and disassembly.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmk0FVoRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1iClg//dHY58dvrsp5Fzo10XgU99/kwzNEgl2b5
 SMrSEbliTrehdpG4vBvig9tAMZurxOVIf6yDBEtV45XfD6w3tw6EFYpO1was9wTE
 R/80Ze6BAEeao782xN3sCpakU1Ogwbxhe4jYFZKE/WVbP9ZaeCI8qeBj3RAuOQ9y
 PCJzjD5fl9c2cAGDqCJEswxIptpP7eXoBo/V3Txf46M8/ffFcXdJbHN3HRBlszVs
 5I9Wb2/vFmwJ4Yi4EO8H7KfzwaXA8wW/MJSDcM24P2/+o5iTqSLNd+rADFMW3XF2
 /8b3uAy/6A6tT3ek1teNoM7qB9hRpM1pmpFwgjjTkjl8yamEp6P/W99qUN+UmfV+
 NTiW9sz7ShhVTMCdALIljyjmji318crKYQBDulAHuEACpodcBg/GUGfuUcrjSRB/
 C7PLatOpfMCODPRGPH4+8Wg8nnBGvOEjjODZBjAq2yU5aJnBeLPmbK2mtcaJtKi+
 R0T2LIsNgmnEa4wRZbH8i4jXsgcbe6gD45Tx3qZpss7D4d9IyRWPO8v6GegFUpvh
 dw8qBqhgi1FzryZ/5uwh5IzkVq+iXHqkPBsV9w7CVSFF1Kc5w1/l7MXsEjkc7Xe3
 qMjc43qsN0H/7ngoIA7yp4m7q87gqJMzReIfeIF4pGVtoULGQ+drN0jjQE/SHiKS
 /EM8IAAk0pU=
 =2DKc
 -----END PGP SIGNATURE-----

Merge tag 'objtool-urgent-2025-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull objtool fixes from Ingo Molnar:
 "Address various objtool scalability bugs/inefficiencies exposed by
  allmodconfig builds, plus improve the quality of alternatives
  instructions generated code and disassembly"

* tag 'objtool-urgent-2025-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  objtool: Simplify .annotate_insn code generation output some more
  objtool: Add more robust signal error handling, detect and warn about stack overflows
  objtool: Remove newlines and tabs from annotation macros
  objtool: Consolidate annotation macros
  x86/asm: Remove ANNOTATE_DATA_SPECIAL usage
  x86/alternative: Remove ANNOTATE_DATA_SPECIAL usage
  objtool: Fix stack overflow in validate_branch()
2025-12-06 11:56:51 -08:00
Linus Torvalds a7405aa92f dma-mapping updates for Linux 6.19:
- next part of DMA mapping API refactoring to physical addresses as the primary
 interface instead of page+offset parameters; this time dma_map_ops callbacks
 are converted to physical addresses, what in turn results also in some
 simplification of architecture specific code (Leon Romanovsky and Jason
 Gunthorpe)
 - clarify that dma_map_benchmark is not a kernel self-test, but standalone
 tool (Qinxin Xia)
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSrngzkoBtlA8uaaJ+Jp1EFxbsSRAUCaTLpeQAKCRCJp1EFxbsS
 RKlAAQCo/gVslheoJ+h4Hk5oUjLfVQaiamJOlzxw12EVHVAs3AEA7GILIAL1GbGn
 EpkgwjIuz/J/4aGhNbYO6J+C1qMAHww=
 =aM6P
 -----END PGP SIGNATURE-----

Merge tag 'dma-mapping-6.19-2025-12-05' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux

Pull dma-mapping updates from Marek Szyprowski:

 - More DMA mapping API refactoring to physical addresses as the primary
   interface instead of page+offset parameters.

   This time dma_map_ops callbacks are converted to physical addresses,
   what in turn results also in some simplification of architecture
   specific code (Leon Romanovsky and Jason Gunthorpe)

 - Clarify that dma_map_benchmark is not a kernel self-test, but
   standalone tool (Qinxin Xia)

* tag 'dma-mapping-6.19-2025-12-05' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
  dma-mapping: remove unused map_page callback
  xen: swiotlb: Convert mapping routine to rely on physical address
  x86: Use physical address for DMA mapping
  sparc: Use physical address DMA mapping
  powerpc: Convert to physical address DMA mapping
  parisc: Convert DMA map_page to map_phys interface
  MIPS/jazzdma: Provide physical address directly
  alpha: Convert mapping routine to rely on physical address
  dma-mapping: remove unused mapping resource callbacks
  xen: swiotlb: Switch to physical address mapping callbacks
  ARM: dma-mapping: Switch to physical address mapping callbacks
  ARM: dma-mapping: Reduce struct page exposure in arch_sync_dma*()
  dma-mapping: convert dummy ops to physical address mapping
  dma-mapping: prepare dma_map_ops to conversion to physical address
  tools/dma: move dma_map_benchmark from selftests to tools/dma
2025-12-06 09:25:05 -08:00
John Stultz c2ae8b0df2 sched/core: Fix psi_dequeue() for Proxy Execution
Currently, if the sleep flag is set, psi_dequeue() doesn't
change any of the psi_flags.

This is because psi_task_switch() will clear TSK_ONCPU as well
as other potential flags (TSK_RUNNING), and the assumption is
that a voluntary sleep always consists of a task being dequeued
followed shortly there after with a psi_sched_switch() call.

Proxy Execution changes this expectation, as mutex-blocked tasks
that would normally sleep stay on the runqueue. But in the case
where the mutex-owning task goes to sleep, or the owner is on a
remote cpu, we will then deactivate the blocked task shortly
after.

In that situation, the mutex-blocked task will have had its
TSK_ONCPU cleared when it was switched off the cpu, but it will
stay TSK_RUNNING. Then if we later dequeue it (as currently done
if we hit a case find_proxy_task() can't yet handle, such as the
case of the owner being on another rq or a sleeping owner)
psi_dequeue() won't change any state (leaving it TSK_RUNNING),
as it incorrectly expects a psi_task_switch() call to
immediately follow.

Later on when the task get woken/re-enqueued, and psi_flags are
set for TSK_RUNNING, we hit an error as the task is already
TSK_RUNNING:

  psi: inconsistent task state! task=188:kworker/28:0 cpu=28 psi_flags=4 clear=0 set=4

To resolve this, extend the logic in psi_dequeue() so that
if the sleep flag is set, we also check if psi_flags have
TSK_ONCPU set (meaning the psi_task_switch is imminent) before
we do the shortcut return.

If TSK_ONCPU is not set, that means we've already switched away,
and this psi_dequeue call needs to clear the flags.

Fixes: be41bde4c3 ("sched: Add an initial sketch of the find_proxy_task() function")
Reported-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Haiyue Wang <haiyuewa@163.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://patch.msgid.link/20251205012721.756394-1-jstultz@google.com
Closes: https://lore.kernel.org/lkml/20251117185550.365156-1-kprateek.nayak@amd.com/
2025-12-06 10:13:16 +01:00
xupengbo ca125231dd sched/fair: Fix unfairness caused by stalled tg_load_avg_contrib when the last task migrates out
When a task is migrated out, there is a probability that the tg->load_avg
value will become abnormal. The reason is as follows:

1. Due to the 1ms update period limitation in update_tg_load_avg(), there
   is a possibility that the reduced load_avg is not updated to tg->load_avg
   when a task migrates out.

2. Even though __update_blocked_fair() traverses the leaf_cfs_rq_list and
   calls update_tg_load_avg() for cfs_rqs that are not fully decayed, the key
   function cfs_rq_is_decayed() does not check whether
   cfs->tg_load_avg_contrib is null. Consequently, in some cases,
   __update_blocked_fair() removes cfs_rqs whose avg.load_avg has not been
   updated to tg->load_avg.

Add a check of cfs_rq->tg_load_avg_contrib in cfs_rq_is_decayed(),
which fixes the case (2.) mentioned above.

Fixes: 1528c661c2 ("sched/fair: Ratelimit update to tg->load_avg")
Signed-off-by: xupengbo <xupengbo@oppo.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Aaron Lu <ziqianlu@bytedance.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Aaron Lu <ziqianlu@bytedance.com>
Link: https://patch.msgid.link/20250827022208.14487-1-xupengbo@oppo.com
2025-12-06 10:03:13 +01:00
Sebastian Andrzej Siewior 22abd83277 sched/rt: Remove a preempt-disable section in rt_mutex_setprio()
rt_mutex_setprio() has only one caller: rt_mutex_adjust_prio(). It
expects that task_struct::pi_lock and rt_mutex_base::wait_lock are held.
Both locks are raw_spinlock_t and are acquired with disabled interrupts.

Nevertheless rt_mutex_setprio() disables preemption while invoking
__balance_callbacks() and raw_spin_rq_unlock(). Even if one of the
balance callbacks unlocks the rq then it must not enable interrupts
because rt_mutex_base::wait_lock is still locked.
Therefore interrupts should remain disabled and disabling preemption is
not needed.

Commit 4c9a4bc89a ("sched: Allow balance callbacks for check_class_changed()")
adds a preempt-disable section to rt_mutex_setprio() and
__sched_setscheduler(). In __sched_setscheduler() the preemption is
disabled before rq is unlocked and interrupts enabled but I don't see
why it makes a difference in rt_mutex_setprio().

Remove the preempt_disable() section from rt_mutex_setprio().

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://patch.msgid.link/20251127155529.t_sTatE4@linutronix.de
2025-12-06 10:03:13 +01:00
Peter Zijlstra e38e529974 sched/hrtick: Fix hrtick() vs. scheduling context
The sched_class::task_tick() method is called on the donor
sched_class, and sched_tick() hands it rq->donor as argument,
which is consistent.

However, while hrtick() uses the donor sched_class, it then passes
rq->curr, which is inconsistent. Fix it.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: John Stultz <jstultz@google.com>
Link: https://patch.msgid.link/20250918080205.442967033@infradead.org
2025-12-06 10:03:13 +01:00
Ingo Molnar dde3763365 sched/headers: Remove whitespace noise from kernel/sched/sched.h
A single case of space-Tab noise snuck in recently.

Fixes: 36569780b0 ("sched: Change nr_uninterruptible type to unsigned long")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://patch.msgid.link/176478595428.498.13816176784792752599.tip-bot2@tip-bot2
2025-12-06 10:03:13 +01:00
Linus Torvalds 208eed95fc soc: driver updates for 6.19
This is the first half of the driver changes:
 
  - A treewide interface change to the "syscore" operations for
    power management, as a preparation for future Tegra specific
    changes.
 
  - Reset controller updates with added drivers for LAN969x, eic770
    and RZ/G3S SoCs.
 
  - Protection of system controller registers on Renesas and Google SoCs,
    to prevent trivially triggering a system crash from e.g. debugfs
    access.
 
  - soc_device identification updates on Nvidia, Exynos and Mediatek
 
  - debugfs support in the ST STM32 firewall driver
 
  - Minor updates for SoC drivers on AMD/Xilinx, Renesas,  Allwinner, TI
 
  - Cleanups for memory controller support on Nvidia and Renesas
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEo6/YBQwIrVS28WGKmmx57+YAGNkFAmky/8gACgkQmmx57+YA
 GNlqohAApPTLM6Q4gf1cIcsTVaP0uxx9CBgupCGuT5ORrOMKBghVWjTOTSxeEAab
 UQF465QwYUUu602GH34UmRaY9CKW2bMIsfmkgmxNB4Y4Qd7yCgQNJ/h/TnN0rBH+
 qTeEsRH/hax4miSNsh0oOZfVkZkg+23VF02d1VL0CcaX7y4oT45RPBQugrNx/gNS
 fHfVwgIq8vJ8WyrmM1h2nv1i1vgSzEy50B3kY674BBw83FcJTafNLvD7N5DSgD1H
 /I/2xeyEpb+oL1VfeHcXZaX/jf04O+cmvSzBi+MOH1tI3MpdxJib1vEYBdggoOWN
 K/FFGgsOY+DNmJPpSnPTTu8UpzksS8SxGBP7M9Q8roKZwA2c9wLotxySvjki5yv8
 2zvabRdzbrSaoYwsH9QnZdQ2hVkJ9W8MESu8PevD3yMNuFUzledPDWW0N1SbGm78
 0ZdB6NPdaBZYHMNMRdFhN8P275/Mx5e0XWN9oYMQqjPooH7YkyT7hJWz6ao2PCJP
 8mDmnW1RzL+LWf7mJ25ZEtS+YjmKA/PVmogRrGurKCadvdxXqCF09KNljICHhmmu
 t0KB4dqw02OXLPvBk21qCi0zL56w1JDgqtS8suFvDYo9sCceeAbAcmpyoUOFj2N+
 Upn976tb4iqFrr9mFswpmCJWPpqJkU+A+KnKsIRPU7N4kSrP35I=
 =HvlN
 -----END PGP SIGNATURE-----

Merge tag 'soc-drivers-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc

Pull SoC driver updates from Arnd Bergmann:
 "This is the first half of the driver changes:

   - A treewide interface change to the "syscore" operations for power
     management, as a preparation for future Tegra specific changes

   - Reset controller updates with added drivers for LAN969x, eic770 and
     RZ/G3S SoCs

   - Protection of system controller registers on Renesas and Google
     SoCs, to prevent trivially triggering a system crash from e.g.
     debugfs access

   - soc_device identification updates on Nvidia, Exynos and Mediatek

   - debugfs support in the ST STM32 firewall driver

   - Minor updates for SoC drivers on AMD/Xilinx, Renesas, Allwinner, TI

   - Cleanups for memory controller support on Nvidia and Renesas"

* tag 'soc-drivers-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (114 commits)
  memory: tegra186-emc: Fix missing put_bpmp
  Documentation: reset: Remove reset_controller_add_lookup()
  reset: fix BIT macro reference
  reset: rzg2l-usbphy-ctrl: Fix a NULL vs IS_ERR() bug in probe
  reset: th1520: Support reset controllers in more subsystems
  reset: th1520: Prepare for supporting multiple controllers
  dt-bindings: reset: thead,th1520-reset: Add controllers for more subsys
  dt-bindings: reset: thead,th1520-reset: Remove non-VO-subsystem resets
  reset: remove legacy reset lookup code
  clk: davinci: psc: drop unused reset lookup
  reset: rzg2l-usbphy-ctrl: Add support for RZ/G3S SoC
  reset: rzg2l-usbphy-ctrl: Add support for USB PWRRDY
  dt-bindings: reset: renesas,rzg2l-usbphy-ctrl: Document RZ/G3S support
  reset: eswin: Add eic7700 reset driver
  dt-bindings: reset: eswin: Documentation for eic7700 SoC
  reset: sparx5: add LAN969x support
  dt-bindings: reset: microchip: Add LAN969x support
  soc: rockchip: grf: Add select correct PWM implementation on RK3368
  soc/tegra: pmc: Add USB wake events for Tegra234
  amba: tegra-ahb: Fix device leak on SMMU enable
  ...
2025-12-05 17:29:04 -08:00
Amery Hung b5709f6d26 bpf: Support associating BPF program with struct_ops
Add a new BPF command BPF_PROG_ASSOC_STRUCT_OPS to allow associating
a BPF program with a struct_ops map. This command takes a file
descriptor of a struct_ops map and a BPF program and set
prog->aux->st_ops_assoc to the kdata of the struct_ops map.

The command does not accept a struct_ops program nor a non-struct_ops
map. Programs of a struct_ops map is automatically associated with the
map during map update. If a program is shared between two struct_ops
maps, prog->aux->st_ops_assoc will be poisoned to indicate that the
associated struct_ops is ambiguous. The pointer, once poisoned, cannot
be reset since we have lost track of associated struct_ops. For other
program types, the associated struct_ops map, once set, cannot be
changed later. This restriction may be lifted in the future if there is
a use case.

A kernel helper bpf_prog_get_assoc_struct_ops() can be used to retrieve
the associated struct_ops pointer. The returned pointer, if not NULL, is
guaranteed to be valid and point to a fully updated struct_ops struct.
For struct_ops program reused in multiple struct_ops map, the return
will be NULL.

prog->aux->st_ops_assoc is protected by bumping the refcount for
non-struct_ops programs and RCU for struct_ops programs. Since it would
be inefficient to track programs associated with a struct_ops map, every
non-struct_ops program will bump the refcount of the map to make sure
st_ops_assoc stays valid. For a struct_ops program, it is protected by
RCU as map_free will wait for an RCU grace period before disassociating
the program with the map. The helper must be called in BPF program
context or RCU read-side critical section.

struct_ops implementers should note that the struct_ops returned may not
be initialized nor attached yet. The struct_ops implementer will be
responsible for tracking and checking the state of the associated
struct_ops map if the use case expects an initialized or attached
struct_ops.

Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/bpf/20251203233748.668365-3-ameryhung@gmail.com
2025-12-05 16:17:57 -08:00
Amery Hung 1588c81b9f bpf: Allow verifier to fixup kernel module kfuncs
Allow verifier to fixup kfuncs in kernel module to support kfuncs with
__prog arguments. Currently, special kfuncs and kfuncs with __prog
arguments are kernel kfuncs. Allowing kernel module kfuncs should not
affect existing kfunc fixup as kernel module kfuncs have BTF IDs greater
than kernel kfuncs' BTF IDs.

Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20251203233748.668365-2-ameryhung@gmail.com
2025-12-05 16:17:57 -08:00
Linus Torvalds 7cd122b552 Some filesystems use a kinda-sorta controlled dentry refcount leak to pin
dentries of created objects in dcache (and undo it when removing those).
 Reference is grabbed and not released, but it's not actually _stored_
 anywhere.  That works, but it's hard to follow and verify; among other
 things, we have no way to tell _which_ of the increments is intended
 to be an unpaired one.  Worse, on removal we need to decide whether
 the reference had already been dropped, which can be non-trivial if
 that removal is on umount and we need to figure out if this dentry is
 pinned due to e.g. unlink() not done.  Usually that is handled by using
 kill_litter_super() as ->kill_sb(), but there are open-coded special
 cases of the same (consider e.g. /proc/self).
 
 Things get simpler if we introduce a new dentry flag (DCACHE_PERSISTENT)
 marking those "leaked" dentries.  Having it set claims responsibility
 for +1 in refcount.
 
 The end result this series is aiming for:
 
 * get these unbalanced dget() and dput() replaced with new primitives that
   would, in addition to adjusting refcount, set and clear persistency flag.
 * instead of having kill_litter_super() mess with removing the remaining
   "leaked" references (e.g. for all tmpfs files that hadn't been removed
   prior to umount), have the regular shrink_dcache_for_umount() strip
   DCACHE_PERSISTENT of all dentries, dropping the corresponding
   reference if it had been set.  After that kill_litter_super() becomes
   an equivalent of kill_anon_super().
 
 Doing that in a single step is not feasible - it would affect too many places
 in too many filesystems.  It has to be split into a series.
 
 This work has really started early in 2024; quite a few preliminary pieces
 have already gone into mainline.  This chunk is finally getting to the
 meat of that stuff - infrastructure and most of the conversions to it.
 
 Some pieces are still sitting in the local branches, but the bulk of
 that stuff is here.
 
 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCaTEq1wAKCRBZ7Krx/gZQ
 643uAQC1rRslhw5l7OjxEpIYbGG4M+QaadN4Nf5Sr2SuTRaPJQD/W4oj/u4C2eCw
 Dd3q071tqyvm/PXNgN2EEnIaxlFUlwc=
 =rKq+
 -----END PGP SIGNATURE-----

Merge tag 'pull-persistency' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull persistent dentry infrastructure and conversion from Al Viro:
 "Some filesystems use a kinda-sorta controlled dentry refcount leak to
  pin dentries of created objects in dcache (and undo it when removing
  those). A reference is grabbed and not released, but it's not actually
  _stored_ anywhere.

  That works, but it's hard to follow and verify; among other things, we
  have no way to tell _which_ of the increments is intended to be an
  unpaired one. Worse, on removal we need to decide whether the
  reference had already been dropped, which can be non-trivial if that
  removal is on umount and we need to figure out if this dentry is
  pinned due to e.g. unlink() not done. Usually that is handled by using
  kill_litter_super() as ->kill_sb(), but there are open-coded special
  cases of the same (consider e.g. /proc/self).

  Things get simpler if we introduce a new dentry flag
  (DCACHE_PERSISTENT) marking those "leaked" dentries. Having it set
  claims responsibility for +1 in refcount.

  The end result this series is aiming for:

   - get these unbalanced dget() and dput() replaced with new primitives
     that would, in addition to adjusting refcount, set and clear
     persistency flag.

   - instead of having kill_litter_super() mess with removing the
     remaining "leaked" references (e.g. for all tmpfs files that hadn't
     been removed prior to umount), have the regular
     shrink_dcache_for_umount() strip DCACHE_PERSISTENT of all dentries,
     dropping the corresponding reference if it had been set. After that
     kill_litter_super() becomes an equivalent of kill_anon_super().

  Doing that in a single step is not feasible - it would affect too many
  places in too many filesystems. It has to be split into a series.

  This work has really started early in 2024; quite a few preliminary
  pieces have already gone into mainline. This chunk is finally getting
  to the meat of that stuff - infrastructure and most of the conversions
  to it.

  Some pieces are still sitting in the local branches, but the bulk of
  that stuff is here"

* tag 'pull-persistency' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (54 commits)
  d_make_discardable(): warn if given a non-persistent dentry
  kill securityfs_recursive_remove()
  convert securityfs
  get rid of kill_litter_super()
  convert rust_binderfs
  convert nfsctl
  convert rpc_pipefs
  convert hypfs
  hypfs: swich hypfs_create_u64() to returning int
  hypfs: switch hypfs_create_str() to returning int
  hypfs: don't pin dentries twice
  convert gadgetfs
  gadgetfs: switch to simple_remove_by_name()
  convert functionfs
  functionfs: switch to simple_remove_by_name()
  functionfs: fix the open/removal races
  functionfs: need to cancel ->reset_work in ->kill_sb()
  functionfs: don't bother with ffs->ref in ffs_data_{opened,closed}()
  functionfs: don't abuse ffs_data_closed() on fs shutdown
  convert selinuxfs
  ...
2025-12-05 14:36:21 -08:00
Linus Torvalds 7203ca412f Significant patch series in this merge are as follows:
- The 10 patch series "__vmalloc()/kvmalloc() and no-block support" from
   Uladzislau Rezki reworks the vmalloc() code to support non-blocking
   allocations (GFP_ATOIC, GFP_NOWAIT).
 
 - The 2 patch series "ksm: fix exec/fork inheritance" from xu xin fixes
   a rare case where the KSM MMF_VM_MERGE_ANY prctl state is not inherited
   across fork/exec.
 
 - The 4 patch series "mm/zswap: misc cleanup of code and documentations"
   from SeongJae Park does some light maintenance work on the zswap code.
 
 - The 5 patch series "mm/page_owner: add debugfs files 'show_handles'
   and 'show_stacks_handles'" from Mauricio Faria de Oliveira enhances the
   /sys/kernel/debug/page_owner debug feature.  It adds unique identifiers
   to differentiate the various stack traces so that userspace monitoring
   tools can better match stack traces over time.
 
 - The 2 patch series "mm/page_alloc: pcp->batch cleanups" from Joshua
   Hahn makes some minor alterations to the page allocator's per-cpu-pages
   feature.
 
 - The 2 patch series "Improve UFFDIO_MOVE scalability by removing
   anon_vma lock" from Lokesh Gidra addresses a scalability issue in
   userfaultfd's UFFDIO_MOVE operation.
 
 - The 2 patch series "kasan: cleanups for kasan_enabled() checks" from
   Sabyrzhan Tasbolatov performs some cleanup in the KASAN code.
 
 - The 2 patch series "drivers/base/node: fold node register and
   unregister functions" from Donet Tom cleans up the NUMA node handling
   code a little.
 
 - The 4 patch series "mm: some optimizations for prot numa" from Kefeng
   Wang provides some cleanups and small optimizations to the NUMA
   allocation hinting code.
 
 - The 5 patch series "mm/page_alloc: Batch callers of
   free_pcppages_bulk" from Joshua Hahn addresses long lock hold times at
   boot on large machines.  These were causing (harmless) softlockup
   warnings.
 
 - The 2 patch series "optimize the logic for handling dirty file folios
   during reclaim" from Baolin Wang removes some now-unnecessary work from
   page reclaim.
 
 - The 10 patch series "mm/damon: allow DAMOS auto-tuned for per-memcg
   per-node memory usage" from SeongJae Park enhances the DAMOS auto-tuning
   feature.
 
 - The 2 patch series "mm/damon: fixes for address alignment issues in
   DAMON_LRU_SORT and DAMON_RECLAIM" from Quanmin Yan fixes DAMON_LRU_SORT
   and DAMON_RECLAIM with certain userspace configuration.
 
 - The 15 patch series "expand mmap_prepare functionality, port more
   users" from Lorenzo Stoakes enhances the new(ish)
   file_operations.mmap_prepare() method and ports additional callsites
   from the old ->mmap() over to ->mmap_prepare().
 
 - The 8 patch series "Fix stale IOTLB entries for kernel address space"
   from Lu Baolu fixes a bug (and possible security issue on non-x86) in
   the IOMMU code.  In some situations the IOMMU could be left hanging onto
   a stale kernel pagetable entry.
 
 - The 4 patch series "mm/huge_memory: cleanup __split_unmapped_folio()"
   from Wei Yang cleans up and optimizes the folio splitting code.
 
 - The 5 patch series "mm, swap: misc cleanup and bugfix" from Kairui
   Song implements some cleanups and a minor fix in the swap discard code.
 
 - The 8 patch series "mm/damon: misc documentation fixups" from SeongJae
   Park does as advertised.
 
 - The 9 patch series "mm/damon: support pin-point targets removal" from
   SeongJae Park permits userspace to remove a specific monitoring target
   in the middle of the current targets list.
 
 - The 2 patch series "mm: MISC follow-up patches for linux/pgalloc.h"
   from Harry Yoo implements a couple of cleanups related to mm header file
   inclusion.
 
 - The 2 patch series "mm/swapfile.c: select swap devices of default
   priority round robin" from Baoquan He improves the selection of swap
   devices for NUMA machines.
 
 - The 3 patch series "mm: Convert memory block states (MEM_*) macros to
   enums" from Israel Batista changes the memory block labels from macros
   to enums so they will appear in kernel debug info.
 
 - The 3 patch series "ksm: perform a range-walk to jump over holes in
   break_ksm" from Pedro Demarchi Gomes addresses an inefficiency when KSM
   unmerges an address range.
 
 - The 22 patch series "mm/damon/tests: fix memory bugs in kunit tests"
   from SeongJae Park fixes leaks and unhandled malloc() failures in DAMON
   userspace unit tests.
 
 - The 2 patch series "some cleanups for pageout()" from Baolin Wang
   cleans up a couple of minor things in the page scanner's
   writeback-for-eviction code.
 
 - The 2 patch series "mm/hugetlb: refactor sysfs/sysctl interfaces" from
   Hui Zhu moves hugetlb's sysfs/sysctl handling code into a new file.
 
 - The 9 patch series "introduce VM_MAYBE_GUARD and make it sticky" from
   Lorenzo Stoakes makes the VMA guard regions available in /proc/pid/smaps
   and improves the mergeability of guarded VMAs.
 
 - The 2 patch series "mm: perform guard region install/remove under VMA
   lock" from Lorenzo Stoakes reduces mmap lock contention for callers
   performing VMA guard region operations.
 
 - The 2 patch series "vma_start_write_killable" from Matthew Wilcox
   starts work in permitting applications to be killed when they are
   waiting on a read_lock on the VMA lock.
 
 - The 11 patch series "mm/damon/tests: add more tests for online
   parameters commit" from SeongJae Park adds additional userspace testing
   of DAMON's "commit" feature.
 
 - The 9 patch series "mm/damon: misc cleanups" from SeongJae Park does
   that.
 
 - The 2 patch series "make VM_SOFTDIRTY a sticky VMA flag" from Lorenzo
   Stoakes addresses the possible loss of a VMA's VM_SOFTDIRTY flag when
   that VMA is merged with another.
 
 - The 16 patch series "mm: support device-private THP" from Balbir Singh
   introduces support for Transparent Huge Page (THP) migration in zone
   device-private memory.
 
 - The 3 patch series "Optimize folio split in memory failure" from Zi
   Yan optimizes folio split operations in the memory failure code.
 
 - The 2 patch series "mm/huge_memory: Define split_type and consolidate
   split support checks" from Wei Yang provides some more cleanups in the
   folio splitting code.
 
 - The 16 patch series "mm: remove is_swap_[pte, pmd]() + non-swap
   entries, introduce leaf entries" from Lorenzo Stoakes cleans up our
   handling of pagetable leaf entries by introducing the concept of
   'software leaf entries', of type softleaf_t.
 
 - The 4 patch series "reparent the THP split queue" from Muchun Song
   reparents the THP split queue to its parent memcg.  This is in
   preparation for addressing the long-standing "dying memcg" problem,
   wherein dead memcg's linger for too long, consuming memory resources.
 
 - The 3 patch series "unify PMD scan results and remove redundant
   cleanup" from Wei Yang does a little cleanup in the hugepage collapse
   code.
 
 - The 6 patch series "zram: introduce writeback bio batching" from
   Sergey Senozhatsky improves zram writeback efficiency by introducing
   batched bio writeback support.
 
 - The 4 patch series "memcg: cleanup the memcg stats interfaces" from
   Shakeel Butt cleans up our handling of the interrupt safety of some
   memcg stats.
 
 - The 4 patch series "make vmalloc gfp flags usage more apparent" from
   Vishal Moola cleans up vmalloc's handling of incoming GFP flags.
 
 - The 6 patch series "mm: Add soft-dirty and uffd-wp support for RISC-V"
   from Chunyan Zhang teches soft dirty and userfaultfd write protect
   tracking to use RISC-V's Svrsw60t59b extension.
 
 - The 5 patch series "mm: swap: small fixes and comment cleanups" from
   Youngjun Park fixes a small bug and cleans up some of the swap code.
 
 - The 4 patch series "initial work on making VMA flags a bitmap" from
   Lorenzo Stoakes starts work on converting the vma struct's flags to a
   bitmap, so we stop running out of them, especially on 32-bit.
 
 - The 2 patch series "mm/swapfile: fix and cleanup swap list iterations"
   from Youngjun Park addresses a possible bug in the swap discard code and
   cleans things up a little.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaTEb0wAKCRDdBJ7gKXxA
 jjfIAP94W4EkCCwNOupnChoG+YWw/JW21anXt5NN+i5svn1yugEAwzvv6A+cAFng
 o+ug/fyrfPZG7PLp2R8WFyGIP0YoBA4=
 =IUzS
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2025-12-03-21-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

  "__vmalloc()/kvmalloc() and no-block support" (Uladzislau Rezki)
     Rework the vmalloc() code to support non-blocking allocations
     (GFP_ATOIC, GFP_NOWAIT)

  "ksm: fix exec/fork inheritance" (xu xin)
     Fix a rare case where the KSM MMF_VM_MERGE_ANY prctl state is not
     inherited across fork/exec

  "mm/zswap: misc cleanup of code and documentations" (SeongJae Park)
     Some light maintenance work on the zswap code

  "mm/page_owner: add debugfs files 'show_handles' and 'show_stacks_handles'" (Mauricio Faria de Oliveira)
     Enhance the /sys/kernel/debug/page_owner debug feature by adding
     unique identifiers to differentiate the various stack traces so
     that userspace monitoring tools can better match stack traces over
     time

  "mm/page_alloc: pcp->batch cleanups" (Joshua Hahn)
     Minor alterations to the page allocator's per-cpu-pages feature

  "Improve UFFDIO_MOVE scalability by removing anon_vma lock" (Lokesh Gidra)
     Address a scalability issue in userfaultfd's UFFDIO_MOVE operation

  "kasan: cleanups for kasan_enabled() checks" (Sabyrzhan Tasbolatov)

  "drivers/base/node: fold node register and unregister functions" (Donet Tom)
     Clean up the NUMA node handling code a little

  "mm: some optimizations for prot numa" (Kefeng Wang)
     Cleanups and small optimizations to the NUMA allocation hinting
     code

  "mm/page_alloc: Batch callers of free_pcppages_bulk" (Joshua Hahn)
     Address long lock hold times at boot on large machines. These were
     causing (harmless) softlockup warnings

  "optimize the logic for handling dirty file folios during reclaim" (Baolin Wang)
     Remove some now-unnecessary work from page reclaim

  "mm/damon: allow DAMOS auto-tuned for per-memcg per-node memory usage" (SeongJae Park)
     Enhance the DAMOS auto-tuning feature

  "mm/damon: fixes for address alignment issues in DAMON_LRU_SORT and DAMON_RECLAIM" (Quanmin Yan)
     Fix DAMON_LRU_SORT and DAMON_RECLAIM with certain userspace
     configuration

  "expand mmap_prepare functionality, port more users" (Lorenzo Stoakes)
     Enhance the new(ish) file_operations.mmap_prepare() method and port
     additional callsites from the old ->mmap() over to ->mmap_prepare()

  "Fix stale IOTLB entries for kernel address space" (Lu Baolu)
     Fix a bug (and possible security issue on non-x86) in the IOMMU
     code. In some situations the IOMMU could be left hanging onto a
     stale kernel pagetable entry

  "mm/huge_memory: cleanup __split_unmapped_folio()" (Wei Yang)
     Clean up and optimize the folio splitting code

  "mm, swap: misc cleanup and bugfix" (Kairui Song)
     Some cleanups and a minor fix in the swap discard code

  "mm/damon: misc documentation fixups" (SeongJae Park)

  "mm/damon: support pin-point targets removal" (SeongJae Park)
     Permit userspace to remove a specific monitoring target in the
     middle of the current targets list

  "mm: MISC follow-up patches for linux/pgalloc.h" (Harry Yoo)
     A couple of cleanups related to mm header file inclusion

  "mm/swapfile.c: select swap devices of default priority round robin" (Baoquan He)
     improve the selection of swap devices for NUMA machines

  "mm: Convert memory block states (MEM_*) macros to enums" (Israel Batista)
     Change the memory block labels from macros to enums so they will
     appear in kernel debug info

  "ksm: perform a range-walk to jump over holes in break_ksm" (Pedro Demarchi Gomes)
     Address an inefficiency when KSM unmerges an address range

  "mm/damon/tests: fix memory bugs in kunit tests" (SeongJae Park)
     Fix leaks and unhandled malloc() failures in DAMON userspace unit
     tests

  "some cleanups for pageout()" (Baolin Wang)
     Clean up a couple of minor things in the page scanner's
     writeback-for-eviction code

  "mm/hugetlb: refactor sysfs/sysctl interfaces" (Hui Zhu)
     Move hugetlb's sysfs/sysctl handling code into a new file

  "introduce VM_MAYBE_GUARD and make it sticky" (Lorenzo Stoakes)
     Make the VMA guard regions available in /proc/pid/smaps and
     improves the mergeability of guarded VMAs

  "mm: perform guard region install/remove under VMA lock" (Lorenzo Stoakes)
     Reduce mmap lock contention for callers performing VMA guard region
     operations

  "vma_start_write_killable" (Matthew Wilcox)
     Start work on permitting applications to be killed when they are
     waiting on a read_lock on the VMA lock

  "mm/damon/tests: add more tests for online parameters commit" (SeongJae Park)
     Add additional userspace testing of DAMON's "commit" feature

  "mm/damon: misc cleanups" (SeongJae Park)

  "make VM_SOFTDIRTY a sticky VMA flag" (Lorenzo Stoakes)
     Address the possible loss of a VMA's VM_SOFTDIRTY flag when that
     VMA is merged with another

  "mm: support device-private THP" (Balbir Singh)
     Introduce support for Transparent Huge Page (THP) migration in zone
     device-private memory

  "Optimize folio split in memory failure" (Zi Yan)

  "mm/huge_memory: Define split_type and consolidate split support checks" (Wei Yang)
     Some more cleanups in the folio splitting code

  "mm: remove is_swap_[pte, pmd]() + non-swap entries, introduce leaf entries" (Lorenzo Stoakes)
     Clean up our handling of pagetable leaf entries by introducing the
     concept of 'software leaf entries', of type softleaf_t

  "reparent the THP split queue" (Muchun Song)
     Reparent the THP split queue to its parent memcg. This is in
     preparation for addressing the long-standing "dying memcg" problem,
     wherein dead memcg's linger for too long, consuming memory
     resources

  "unify PMD scan results and remove redundant cleanup" (Wei Yang)
     A little cleanup in the hugepage collapse code

  "zram: introduce writeback bio batching" (Sergey Senozhatsky)
     Improve zram writeback efficiency by introducing batched bio
     writeback support

  "memcg: cleanup the memcg stats interfaces" (Shakeel Butt)
     Clean up our handling of the interrupt safety of some memcg stats

  "make vmalloc gfp flags usage more apparent" (Vishal Moola)
     Clean up vmalloc's handling of incoming GFP flags

  "mm: Add soft-dirty and uffd-wp support for RISC-V" (Chunyan Zhang)
     Teach soft dirty and userfaultfd write protect tracking to use
     RISC-V's Svrsw60t59b extension

  "mm: swap: small fixes and comment cleanups" (Youngjun Park)
     Fix a small bug and clean up some of the swap code

  "initial work on making VMA flags a bitmap" (Lorenzo Stoakes)
     Start work on converting the vma struct's flags to a bitmap, so we
     stop running out of them, especially on 32-bit

  "mm/swapfile: fix and cleanup swap list iterations" (Youngjun Park)
     Address a possible bug in the swap discard code and clean things
     up a little

[ This merge also reverts commit ebb9aeb980 ("vfio/nvgrace-gpu:
  register device memory for poison handling") because it looks
  broken to me, I've asked for clarification   - Linus ]

* tag 'mm-stable-2025-12-03-21-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits)
  mm: fix vma_start_write_killable() signal handling
  mm/swapfile: use plist_for_each_entry in __folio_throttle_swaprate
  mm/swapfile: fix list iteration when next node is removed during discard
  fs/proc/task_mmu.c: fix make_uffd_wp_huge_pte() huge pte handling
  mm/kfence: add reboot notifier to disable KFENCE on shutdown
  memcg: remove inc/dec_lruvec_kmem_state helpers
  selftests/mm/uffd: initialize char variable to Null
  mm: fix DEBUG_RODATA_TEST indentation in Kconfig
  mm: introduce VMA flags bitmap type
  tools/testing/vma: eliminate dependency on vma->__vm_flags
  mm: simplify and rename mm flags function for clarity
  mm: declare VMA flags by bit
  zram: fix a spelling mistake
  mm/page_alloc: optimize lowmem_reserve max lookup using its semantic monotonicity
  mm/vmscan: skip increasing kswapd_failures when reclaim was boosted
  pagemap: update BUDDY flag documentation
  mm: swap: remove scan_swap_map_slots() references from comments
  mm: swap: change swap_alloc_slow() to void
  mm, swap: remove redundant comment for read_swap_cache_async
  mm, swap: use SWP_SOLIDSTATE to determine if swap is rotational
  ...
2025-12-05 13:52:43 -08:00
Maurice Hieronymus c5108c58b9 tracing: Fix typo in trace_seq.c
Fix typo "wont" to "won't".

Link: https://patch.msgid.link/20251121221835.28032-15-mhi@mailbox.org
Signed-off-by: Maurice Hieronymus <mhi@mailbox.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-12-05 15:43:41 -05:00
Maurice Hieronymus 0f17df72a7 tracing: Fix typo in trace_probe.c
Fix typo "separater" to "separator".

Link: https://patch.msgid.link/20251121221835.28032-14-mhi@mailbox.org
Signed-off-by: Maurice Hieronymus <mhi@mailbox.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-12-05 15:43:41 -05:00
Maurice Hieronymus fa3f733d97 tracing: Fix multiple typos in trace_osnoise.c
Fix multiple typos in comments:
"Anotate" -> "Annotate"
"infor" -> "info"
"timestemp" -> "timestamp"
"tread" -> "thread"
"varaibles" -> "variables"
"wast" -> "waste"

Link: https://patch.msgid.link/20251121221835.28032-13-mhi@mailbox.org
Signed-off-by: Maurice Hieronymus <mhi@mailbox.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-12-05 15:43:41 -05:00
Maurice Hieronymus 6ce5725d73 tracing: Fix multiple typos in trace_events_user.c
Fix multiple typos in comments:
"ambigious" -> "ambiguous"
"explictly" -> "explicitly"
"Uknown" -> "Unknown"

Link: https://patch.msgid.link/20251121221835.28032-12-mhi@mailbox.org
Signed-off-by: Maurice Hieronymus <mhi@mailbox.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-12-05 15:43:41 -05:00
Maurice Hieronymus 0166d3e31a tracing: Fix typo in trace_events_trigger.c
Fix typo "componenents" to "components".

Link: https://patch.msgid.link/20251121221835.28032-11-mhi@mailbox.org
Signed-off-by: Maurice Hieronymus <mhi@mailbox.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-12-05 15:43:41 -05:00
Maurice Hieronymus c29e75532e tracing: Fix typo in trace_events_hist.c
Fix typo "tigger" to "trigger".

Link: https://patch.msgid.link/20251121221835.28032-10-mhi@mailbox.org
Signed-off-by: Maurice Hieronymus <mhi@mailbox.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-12-05 15:43:40 -05:00
Maurice Hieronymus 86f320904e tracing: Fix typo in trace_events_filter.c
Fix typo "singe" to "single".

Link: https://patch.msgid.link/20251121221835.28032-9-mhi@mailbox.org
Signed-off-by: Maurice Hieronymus <mhi@mailbox.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-12-05 15:43:40 -05:00
Maurice Hieronymus d4290963d5 tracing: Fix multiple typos in trace_events.c
Fix multiple typos in comments:
"appened" -> "appended"
"paranthesis" -> "parenthesis"
"parethesis" -> "parenthesis"
"wont" -> "won't"

Link: https://patch.msgid.link/20251121221835.28032-8-mhi@mailbox.org
Signed-off-by: Maurice Hieronymus <mhi@mailbox.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-12-05 15:43:40 -05:00
Maurice Hieronymus 8d4cdbd45c tracing: Fix multiple typos in trace.c
Fix multiple typos in comments:
"alse" -> "also"
"enabed" -> "enabled"
"instane" -> "instance"
"outputing" -> "outputting"
"seperated" -> "separated"

Link: https://patch.msgid.link/20251121221835.28032-7-mhi@mailbox.org
Signed-off-by: Maurice Hieronymus <mhi@mailbox.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-12-05 15:43:40 -05:00
Maurice Hieronymus 81354f6335 tracing: Fix typo in ring_buffer_benchmark.c
Fix typo "overwite" to "overwrite".

Link: https://patch.msgid.link/20251121221835.28032-6-mhi@mailbox.org
Signed-off-by: Maurice Hieronymus <mhi@mailbox.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-12-05 15:43:40 -05:00
Maurice Hieronymus 1edb820ae9 tracing: Fix multiple typos in ring_buffer.c
Fix multiple typos in comments:
"ording" -> "ordering"
"scatch" -> "scratch"
"wont" -> "won't"

Link: https://patch.msgid.link/20251121221835.28032-5-mhi@mailbox.org
Signed-off-by: Maurice Hieronymus <mhi@mailbox.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-12-05 15:43:40 -05:00
Maurice Hieronymus 2ec7345c2d tracing: Fix typo in fprobe.c
Fix typo "funciton" to "function".

Link: https://patch.msgid.link/20251121221835.28032-4-mhi@mailbox.org
Signed-off-by: Maurice Hieronymus <mhi@mailbox.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-12-05 15:43:39 -05:00
Maurice Hieronymus 9c3f3b8fea tracing: Fix typo in fpgraph.c
Fix typo "reservered" to "reserved".

Link: https://patch.msgid.link/20251121221835.28032-3-mhi@mailbox.org
Signed-off-by: Maurice Hieronymus <mhi@mailbox.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-12-05 15:43:39 -05:00
Steven Rostedt 47ef834209 tracing: Fix fixed array of synthetic event
The commit 4d38328eb4 ("tracing: Fix synth event printk format for str
fields") replaced "%.*s" with "%s" but missed removing the number size of
the dynamic and static strings. The commit e1a453a57b ("tracing: Do not
add length to print format in synthetic events") fixed the dynamic part
but did not fix the static part. That is, with the commands:

  # echo 's:wake_lat char[] wakee; u64 delta;' >> /sys/kernel/tracing/dynamic_events
  # echo 'hist:keys=pid:ts=common_timestamp.usecs if !(common_flags & 0x18)' > /sys/kernel/tracing/events/sched/sched_waking/trigger
  # echo 'hist:keys=next_pid:delta=common_timestamp.usecs-$ts:onmatch(sched.sched_waking).trace(wake_lat,next_comm,$delta)' > /sys/kernel/tracing/events/sched/sched_switch/trigger

That caused the output of:

          <idle>-0       [001] d..5.   193.428167: wake_lat: wakee=(efault)sshd-sessiondelta=155
    sshd-session-879     [001] d..5.   193.811080: wake_lat: wakee=(efault)kworker/u34:5delta=58
          <idle>-0       [002] d..5.   193.811198: wake_lat: wakee=(efault)bashdelta=91

The commit e1a453a57b fixed the part where the synthetic event had
"char[] wakee". But if one were to replace that with a static size string:

  # echo 's:wake_lat char[16] wakee; u64 delta;' >> /sys/kernel/tracing/dynamic_events

Where "wakee" is defined as "char[16]" and not "char[]" making it a static
size, the code triggered the "(efaul)" again.

Remove the added STR_VAR_LEN_MAX size as the string is still going to be
nul terminated.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Douglas Raillard <douglas.raillard@arm.com>
Link: https://patch.msgid.link/20251204151935.5fa30355@gandalf.local.home
Fixes: e1a453a57b ("tracing: Do not add length to print format in synthetic events")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-12-05 15:38:10 -05:00
Steven Rostedt 02e7769e38 tracing: Fix enabling of tracing on file release
The trace file will pause tracing if the tracing instance has the
"pause-on-trace" option is set. This happens when the file is opened, and
it is unpaused when the file is closed. When this was first added, there
was only one user that paused tracing. On open, the check to pause was:

   if (!iter->snapshot && (tr->trace_flags & TRACE_ITER(PAUSE_ON_TRACE)))

Where if it is not the snapshot tracer and the "pause-on-trace" option is
set, then it increments a "stop_count" of the trace instance.

On close, the check is:

   if (!iter->snapshot && tr->stop_count)

That is, if it is not the snapshot buffer and it was stopped, it will
re-enable tracing.

Now there's more places that stop tracing. This means, if something else
stops tracing the tr->stop_count will be non-zero, and that means if the
trace file is closed, it will decrement the stop_count even though it
never incremented it. This causes a warning because when the user that
stopped tracing enables it again, the stop_count goes below zero.

Instead of relying on the stop_count being set to know if the close of
the trace file should enable tracing again, add a new flag to the trace
iterator. The trace iterator is unique per open of the trace file, and if
the open stops tracing set the trace iterator PAUSE flag. On close, if the
PAUSE flag is set, then re-enable it again.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251202161751.24abaaf1@gandalf.local.home
Fixes: 06e0a548ba ("tracing: Do not disable tracing when reading the trace file")
Reported-by: syzbot+ccdec3bfe0beec58a38d@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/692f44a5.a70a0220.2ea503.00c8.GAE@google.com/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-12-05 15:17:56 -05:00
Linus Torvalds ac20755937 Summary
* Move jiffies converters out of kernel/sysctl.c
 
   Moved the jiffies converters into kernel/time/jiffies.c and replaced
   the pipe-max-size proc_handler converter with a macro based version.
   This is all part of the effort to relocate non-sysctl logic out of
   kernel/sysctl.c into more relevant subsystems. No functional changes.
 
 * Generalize proc handler converter creation
 
   Removed duplicated sysctl converter logic by consolidating it in
   macros. These are used inside sysctl core as well as in pipe.c and
   jiffies.c. Converter kernel and user space pointer args are now
   automatically const qualified for the convenience of the caller. No
   functional changes.
 
 * Miscellaneous
 
   Fixed kernel-doc format warnings, removed unnecessary __user
   qualifiers, and moved the nmi_watchdog sysctl into .rodata.
 
 * Testing
 
   This series was run through sysctl selftests/kunit test suite in
   x86_64. It went into linux-next after rc2, giving it a good 4/5 weeks
   of testing.
 -----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEErkcJVyXmMSXOyyeQupfNUreWQU8FAmktuJMACgkQupfNUreW
 QU9l8Qv+Noh/wLTqBEmHCrQ8k19YCNlBHO6a10Q5bFiAiGdTAMCZ7oFzoAwAjv5y
 pLtzS75G89zP0O6wgkxTsmoDNi4MRJenOCyjyEDFYvrK+qSTm0CWs0sZCsHqX4Dg
 7M+7PVK/EbMO5509J2ae6cYS9pjfwg3EBQZ978b/FATkuhRjxOIJhIv3ZoaFjme4
 0q/xqHw+oms5CUL035BfqtkoskIiRT19DAvM/DEjc2ByaHCTGURv00XLvSDHaRer
 O0Z8nXaxOOCscLunZbC3UL+hC7tB0nPE+XSzm9ylBEM7bTxeZmtvx2G6ru0+873U
 Sp+BwpFhe0RmzBFlclkd7UPtvGlFAY2QgAfpSaiLsodkoX0mctquTgpy99LhxKej
 EEyjl9tPVrYoH4MG562bZPGrQHtV4vnR9DXYx56vYtY2Fyr1GZmWawQoMZsHe9AU
 cVw5HrfeKeHBhk9hi3ZvT9z96ns3YBmIHnYNDeMy+mF/i+cVu///GwgGuFqUqKag
 3eWcTaPh
 =DXVD
 -----END PGP SIGNATURE-----

Merge tag 'sysctl-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl

Pull sysctl updates from Joel Granados:

 - Move jiffies converters out of kernel/sysctl.c

   Move the jiffies converters into kernel/time/jiffies.c and replace
   the pipe-max-size proc_handler converter with a macro based version.
   This is all part of the effort to relocate non-sysctl logic out of
   kernel/sysctl.c into more relevant subsystems. No functional changes.

 - Generalize proc handler converter creation

   Remove duplicated sysctl converter logic by consolidating it in
   macros. These are used inside sysctl core as well as in pipe.c and
   jiffies.c. Converter kernel and user space pointer args are now
   automatically const qualified for the convenience of the caller. No
   functional changes.

 - Miscellaneous

   Fix kernel-doc format warnings, remove unnecessary __user
   qualifiers, and move the nmi_watchdog sysctl into .rodata.

 - Testing

   This series was run through sysctl selftests/kunit test suite in
   x86_64. It went into linux-next after rc2, giving it a good 4/5 weeks
   of testing.

* tag 'sysctl-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl: (21 commits)
  sysctl: Wrap do_proc_douintvec with the public function proc_douintvec_conv
  sysctl: Create pipe-max-size converter using sysctl UINT macros
  sysctl: Move proc_doulongvec_ms_jiffies_minmax to kernel/time/jiffies.c
  sysctl: Move jiffies converters to kernel/time/jiffies.c
  sysctl: Move UINT converter macros to sysctl header
  sysctl: Move INT converter macros to sysctl header
  sysctl: Allow custom converters from outside sysctl
  sysctl: remove __user qualifier from stack_erasing_sysctl buffer argument
  sysctl: Create macro for user-to-kernel uint converter
  sysctl: Add optional range checking to SYSCTL_UINT_CONV_CUSTOM
  sysctl: Create unsigned int converter using new macro
  sysctl: Add optional range checking to SYSCTL_INT_CONV_CUSTOM
  sysctl: Create integer converters with one macro
  sysctl: Create converter functions with two new macros
  sysctl: Discriminate between kernel and user converter params
  sysctl: Indicate the direction of operation with macro names
  sysctl: Remove superfluous __do_proc_* indirection
  sysctl: Remove superfluous tbl_data param from "dovec" functions
  sysctl: Replace void pointer with const pointer to ctl_table
  sysctl: fix kernel-doc format warning
  ...
2025-12-05 11:15:37 -08:00
Linus Torvalds d1d36025a6 Probes for v6.19
- fprobe: Performance enhancement of the fprobe using rhltable
   . fprobe: use rhltable for fprobe_ip_table. The fprobe IP table has
     been converted to use an rhltable for improved performance when
     dealing with a large number of probed functions.
   . Fix a suspicious RCU usage warning of the above change in the
     fprobe entry handler.
   . Remove an unused local variable of the above change.
   . Fix to initialize fprobe_ip_table in core_initcall().
 
 - fprobe: Performance optimization of fprobe by ftrace
   . fprobe: Use ftrace instead of fgraph for entry only probes. This
     avoids the unneeded overhead of fgraph stack setup.
   . Also update fprobe selftest for entry-only probe.
   . fprobe: Use ftrace only if CONFIG_DYNAMIC_FTRACE_WITH_ARGS or
     WITH_REGS is defined.
 
 - probes: Cleanup probe event subsystems.
   . uprobe/eprobe: Allocate traceprobe_parse_context per probe instead
     of each probe argument parsing. This reduce memory allocation/free
     of temporary working memory.
   . uprobes: Cleanup code using __free().
   . eprobes: Cleanup code using __free().
   . probes: Cleanup code using __free(trace_probe_log_clear) to clear
     error log automatically.
   . probes: Replace strcpy() with memcpy() in __trace_probe_log_err().
 -----BEGIN PGP SIGNATURE-----
 
 iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmkvhSsbHG1hc2FtaS5o
 aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8bWhEH/23XM5Msjy5vopB+ECZb
 iCj8SkWrQzfiCBILUqxCkZdfJHFomGPHewxvxIOWdb7evtHuy0Ypne/Uw/TMAtAh
 xvDQmu03IV2jO7h7GExsnEh0nX0upYg4IVmN0sCSSWSfgLLTWO9ICClavV9adcva
 ZR+5TdZbK+W59n+ejxA9OMDt1G+nz1Ls9Qhx9ktf7odkJzBkQGPq/heZuPbF3+6k
 Vj2IHTuqWobDDt+ekKOBRWNh9cS61ybxvsr/vmkT6s904ortP6mZa3zEYPRVOUNG
 WJ/KGJwvExTcaG/Dy2g6q8tam1Bidx9/S6klyOGXQXxvaIT1VtBc66HzAUfso6jg
 yIc=
 =w6Kq
 -----END PGP SIGNATURE-----

Merge tag 'probes-v6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull probes updates from Masami Hiramatsu:
 "fprobe performance enhancement using rhltable:
   - use rhltable for fprobe_ip_table. The fprobe IP table has been
     converted to use an rhltable for improved performance when dealing
     with a large number of probed functions
   - Fix a suspicious RCU usage warning of the above change in the
     fprobe entry handler
   - Remove an unused local variable of the above change
   - Fix to initialize fprobe_ip_table in core_initcall()

  Performance optimization of fprobe by ftrace:
   - Use ftrace instead of fgraph for entry only probes. This avoids the
     unneeded overhead of fgraph stack setup
   - Also update fprobe selftest for entry-only probe
   - fprobe: Use ftrace only if CONFIG_DYNAMIC_FTRACE_WITH_ARGS or
     WITH_REGS is defined

  Cleanup probe event subsystems:
   - Allocate traceprobe_parse_context per probe instead of each probe
     argument parsing. This reduce memory allocation/free of temporary
     working memory
   - Cleanup code using __free()
   - Replace strcpy() with memcpy() in __trace_probe_log_err()"

* tag 'probes-v6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: fprobe: use ftrace if CONFIG_DYNAMIC_FTRACE_WITH_ARGS
  lib/test_fprobe: add testcase for mixed fprobe
  tracing: fprobe: optimization for entry only case
  tracing: fprobe: Fix to init fprobe_ip_table earlier
  tracing: fprobe: Remove unused local variable
  tracing: probes: Replace strcpy() with memcpy() in __trace_probe_log_err()
  tracing: fprobe: fix suspicious rcu usage in fprobe_entry
  tracing: uprobe: eprobes: Allocate traceprobe_parse_context per probe
  tracing: uprobes: Cleanup __trace_uprobe_create() with __free()
  tracing: eprobe: Cleanup eprobe event using __free()
  tracing: probes: Use __free() for trace_probe_log
  tracing: fprobe: use rhltable for fprobe_ip_table
2025-12-05 10:55:47 -08:00
Linus Torvalds 2ba59045fb - Add helper functions for allocations
The allocation of the per CPU buffer descriptor, the buffer page
   descriptors and the buffer page data itself can be pretty ugly.
   Add some helper macros and a function to have the code that allocates
   buffer pages and such look a little cleaner.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaTL3JxQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qvDgAP9HFxPe2EqGspnY0RungWDs3yCxqlUp
 Eqz7SaI9GCXdXgD/TKiz3YjNVxZveeDU6QHWsDl4svoBzjSAsaeTkXD+OQ8=
 =siR0
 -----END PGP SIGNATURE-----

Merge tag 'trace-ringbuffer-v6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull trace ring-buffer cleanup from Steven Rostedt:

 - Add helper functions for allocations

   The allocation of the per CPU buffer descriptor, the buffer page
   descriptors and the buffer page data itself can be pretty ugly.

   Add some helper macros and a function to have the code that allocates
   buffer pages and such look a little cleaner.

* tag 'trace-ringbuffer-v6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  ring-buffer: Add helper functions for allocations
2025-12-05 10:50:24 -08:00
Linus Torvalds 0b1b4a3d8e Runtime verifier updates for v6.19:
- Adapt the ftracetest script to be run from a different folder
 
   This uses the already existing OPT_TEST_DIR but extends it further to run
   independent tests, then add an --rv flag to allow using the script for
   testing RV (mostly) independently on ftrace.
 
 - Add basic RV selftests in selftests/verification for more validations
 
   Add more validations for available/enabled monitors and reactors. This
   could have caught the bug introducing kernel panic solved above. Tests use
   ftracetest.
 
 - Convert react() function in reactor to use va_list directly
 
   Use a central helper to handle the variadic arguments. Clean up macros
   and mark functions as static.
 
 - Add lockdep annotations to reactors to have lockdep complain of errors
 
   If the reactors are called from improper context. Useful to develop new
   reactors. This highlights a warning in the panic reactor that is related
   to the printk subsystem and not to RV.
 
 - Convert core RV code to use lock guards and __free helpers
 
   This completely removes goto statements.
 
 - Fix compilation if !CONFIG_RV_REACTORS
 
   Fix the warning by keeping LTL monitor variable as always static.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaTBoVxQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qtWpAQDxPQAJQvBZ41l9q9Cis7PqGGezT4Nv
 g6Fh/ydMOlJCsQD/R0Xd5JxPmBI8FLCwCfqHo7wYKUhP8GfL/ORPEWhU2gI=
 =EEot
 -----END PGP SIGNATURE-----

Merge tag 'trace-rv-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull runtime verifier updates from Steven Rostedt:

 - Adapt the ftracetest script to be run from a different folder

   This uses the already existing OPT_TEST_DIR but extends it further to
   run independent tests, then add an --rv flag to allow using the
   script for testing RV (mostly) independently on ftrace.

 - Add basic RV selftests in selftests/verification for more validations

   Add more validations for available/enabled monitors and reactors.
   This could have caught the bug introducing kernel panic solved above.
   Tests use ftracetest.

 - Convert react() function in reactor to use va_list directly

   Use a central helper to handle the variadic arguments. Clean up
   macros and mark functions as static.

 - Add lockdep annotations to reactors to have lockdep complain of
   errors

   If the reactors are called from improper context. Useful to develop
   new reactors. This highlights a warning in the panic reactor that is
   related to the printk subsystem and not to RV.

 - Convert core RV code to use lock guards and __free helpers

   This completely removes goto statements.

 - Fix compilation if !CONFIG_RV_REACTORS

   Fix the warning by keeping LTL monitor variable as always static.

* tag 'trace-rv-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  rv: Fix compilation if !CONFIG_RV_REACTORS
  rv: Convert to use __free
  rv: Convert to use lock guard
  rv: Add explicit lockdep context for reactors
  rv: Make rv_reacting_on() static
  rv: Pass va_list to reactors
  selftests/verification: Add initial RV tests
  selftest/ftrace: Generalise ftracetest to use with RV
2025-12-05 10:17:00 -08:00
Linus Torvalds 0771cee974 ftrace fixes for v6.19:
- Fix regression of pid filtering of function graph tracer
 
   When the function graph tracer allowed multiple instances of
   graph tracing using subops, the filtering by pid broke.
 
   The ftrace_ops->private that was used for pid filtering wasn't
   updated on creation.
 
   The wrong function entry callback was used when pid filtering was
   enabled when the function graph tracer started, which meant that
   the pid filtering wasn't happening.
 
 - Remove no longer needed ftrace_trace_task()
 
   With PID filtering working via ftrace_pids_enabled() and fgraph_pid_func(),
   the coarse-grained ftrace_trace_task() check in graph_entry() is obsolete.
 
   It was only a fallback for uninitialized op->private (now fixed), and its
   removal ensures consistent PID filtering with standard function tracing.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaS90FhQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qrqMAQDbU53VhvZ6rE0pNvu0Tlk+LDCu3gxg
 F2wisWr65389OgD/VFLTVRjCZh1iY7FFWjAPGRCMbetljmMgK5vpH6XSigA=
 =VKaD
 -----END PGP SIGNATURE-----

Merge tag 'ftrace-v6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull ftrace updates from Steven Rostedt:

 - Fix regression of pid filtering of function graph tracer

   When the function graph tracer allowed multiple instances of graph
   tracing using subops, the filtering by pid broke.

   The ftrace_ops->private that was used for pid filtering wasn't
   updated on creation.

   The wrong function entry callback was used when pid filtering was
   enabled when the function graph tracer started, which meant that
   the pid filtering wasn't happening.

 - Remove no longer needed ftrace_trace_task()

   With PID filtering working via ftrace_pids_enabled() and
   fgraph_pid_func(), the coarse-grained ftrace_trace_task()
   check in graph_entry() is obsolete.

   It was only a fallback for uninitialized op->private (now fixed),
   and its removal ensures consistent PID filtering with standard
   function tracing.

* tag 'ftrace-v6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  fgraph: Remove coarse PID filtering from graph_entry()
  fgraph: Check ftrace_pids_enabled on registration for early filtering
  fgraph: Initialize ftrace_ops->private for function graph ops
2025-12-05 10:13:04 -08:00
Linus Torvalds 69c5079b49 tracing updates for v6.19:
- Merge branch shared with kprobes on extending trace options
 
   The trace options were defined by a 32 bit variable. This limits the
   tracing instances to have a total of 32 different options. As that limit
   has been hit, and more options are being added, increase the option mask
   to a 64 bit number, doubling the number of options available.
 
   As this is required for the kprobe topic branches as well as the tracing
   topic branch, a separate branch was created and merged into both.
 
 - Make trace_user_fault_read() available for the rest of tracing
 
   The function trace_user_fault_read() is used by trace_marker file read to
   allow reading user space to be done fast and without locking or
   allocations. Make this available so that the system call trace events can
   use it too.
 
 - Have system call trace events read user space values
 
   Now that the system call trace events callbacks are called in a faultable
   context, take advantage of this and read the user space buffers for
   various system calls. For example, show the path name of the openat system
   call instead of just showing the pointer to that path name in user space.
   Also show the contents of the buffer of the write system call. Several
   system call trace events are updated to make tracing into a light weight
   strace tool for all applications in the system.
 
 - Update perf system call tracing to do the same
 
 - And a config and syscall_user_buf_size file to control the size of the buffer
 
   Limit the amount of data that can be read from user space. The default
   size is 63 bytes but that can be expanded to 165 bytes.
 
 - Allow the persistent ring buffer to print system calls normally
 
   The persistent ring buffer prints trace events by their type and ignores
   the print_fmt. This is because the print_fmt may change from kernel to
   kernel. As the system call output is fixed by the system call ABI itself,
   there's no reason to limit that. This makes reading the system call events
   in the persistent ring buffer much nicer and easier to understand.
 
 - Add options to show text offset to function profiler
 
   The function profiler that counts the number of times a function is hit
   currently lists all functions by its name and offset. But this becomes
   ambiguous when there are several functions with the same name. Add a
   tracing option that changes the output to be that of _text+offset
   instead. Now a user space tool can use this information to map the
   _text+offset to the unique function it is counting.
 
 - Report bad dynamic event command
 
   If a bad command is passed to the dynamic_events file, report it properly
   in the error log.
 
 - Clean up tracer options
 
   Clean up the tracer option code a bit, by removing some useless code and
   also using switch statements instead of a series of if statements.
 
 - Have tracing options be instance specific
 
   Tracers can have their own options (function tracer, irqsoff tracer,
   function graph tracer, etc). But now that the same tracer can be enabled
   in multiple trace instances, their options are still global. The API is
   per instance, thus changing one affects other instances. This isn't even
   consistent, as the option take affect differently depending on when an
   tracer started in an instance.  Make the options for instances only affect
   the instance it is changed under.
 
 - Optimize pid_list lock contention
 
   Whenever the pid_list is read, it uses a spin lock. This happens at every
   sched switch. Taking the lock at sched switch can be removed by instead
   using a seqlock counter.
 
 - Clean up the trace trigger structures
 
   The trigger code uses two different structures to implement a single
   tigger. This was due to trying to reuse code for the two different types
   of triggers (always on trigger, and count limited trigger). But by adding
   a single field to one structure, the other structure could be absorbed
   into the first structure making he code easier to understand.
 
 - Create a bulk garbage collector for trace triggers
 
   If user space has triggers for several hundreds of events and then removes
   them, it can take several seconds to complete. This is because each
   removal calls the slow tracepoint_synchronize_unregister() that can take
   hundreds of milliseconds to complete. Instead, create a helper thread that
   will do the clean up. When a trigger is removed, it will create the
   kthread if it isn't already created, and then add the trigger to a llist.
   The kthread will take the items off the llist, call
   tracepoint_synchronize_unregister(), and then remove the items it took
   off. It will then check if there's more items to free before sleeping.
 
   This makes user space removing all these triggers to finish in less than a
   second.
 
 - Allow function tracing of some of the tracing infrastructure code
 
   Because the tracing code can cause recursion issues if it is traced by the
   function tracer the entire tracing directory disables function tracing.
   But not all of tracing causes issues if it is traced. Namely, the event
   tracing code. Add a config that enables some of the tracing code to be
   traced to help in debugging it. Note, when this is enabled, it does add
   noise to general function tracing, especially if events are enabled as
   well (which is a common case).
 
 - Add boot-time backup instance for persistent buffer
 
   The persistent ring buffer is used mostly for kernel crash analysis in the
   field. One issue is that if there's a crash, the data in the persistent
   ring buffer must be read before tracing can begin using it. This slows
   down the boot process. Once tracing starts in the persistent ring buffer,
   the old data must be freed and the addresses no longer match and old
   events can't be in the buffer with new events.
 
   Create a way to create a backup buffer that copies the persistent ring
   buffer at boot up. Then after a crash, the always on tracer can begin
   immediately as well as the normal boot process while the crash analysis
   tooling uses the backup buffer. After the backup buffer is finished being
   read, it can be removed.
 
 - Enable function graph args and return address options at the same time
 
   Currently the when reading of arguments in the function graph tracer is
   enabled, the option to record the parent function in the entry event can
   not be enabled. Update the code so that it can.
 
 - Add new struct_offset() helper macro
 
   Add a new macro that takes a pointer to a structure and a name of one of
   its members and it will return the offset of that member. This allows the
   ring buffer code to simplify the following:
 
   From:  size = struct_size(entry, buf, cnt - sizeof(entry->id));
     To:  size = struct_offset(entry, id) + cnt;
 
   There should be other simplifications that this macro can help out with as
   well.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaS9xqxQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qj6tAQD4MR1lsE3XpH09asO4CDDfhbtRSQVD
 o8bVKVihWx/j5gD/XezjqE2Q2+DO6dhnsQY6pbtNdXoKgaMuQJGA+dvPsQc=
 =HilC
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing updates from Steven Rostedt:

 - Extend tracing option mask to 64 bits

   The trace options were defined by a 32 bit variable. This limits the
   tracing instances to have a total of 32 different options. As that
   limit has been hit, and more options are being added, increase the
   option mask to a 64 bit number, doubling the number of options
   available.

   As this is required for the kprobe topic branches as well as the
   tracing topic branch, a separate branch was created and merged into
   both.

 - Make trace_user_fault_read() available for the rest of tracing

   The function trace_user_fault_read() is used by trace_marker file
   read to allow reading user space to be done fast and without locking
   or allocations. Make this available so that the system call trace
   events can use it too.

 - Have system call trace events read user space values

   Now that the system call trace events callbacks are called in a
   faultable context, take advantage of this and read the user space
   buffers for various system calls. For example, show the path name of
   the openat system call instead of just showing the pointer to that
   path name in user space. Also show the contents of the buffer of the
   write system call. Several system call trace events are updated to
   make tracing into a light weight strace tool for all applications in
   the system.

 - Update perf system call tracing to do the same

 - And a config and syscall_user_buf_size file to control the size of
   the buffer

   Limit the amount of data that can be read from user space. The
   default size is 63 bytes but that can be expanded to 165 bytes.

 - Allow the persistent ring buffer to print system calls normally

   The persistent ring buffer prints trace events by their type and
   ignores the print_fmt. This is because the print_fmt may change from
   kernel to kernel. As the system call output is fixed by the system
   call ABI itself, there's no reason to limit that. This makes reading
   the system call events in the persistent ring buffer much nicer and
   easier to understand.

 - Add options to show text offset to function profiler

   The function profiler that counts the number of times a function is
   hit currently lists all functions by its name and offset. But this
   becomes ambiguous when there are several functions with the same
   name.

   Add a tracing option that changes the output to be that of
   '_text+offset' instead. Now a user space tool can use this
   information to map the '_text+offset' to the unique function it is
   counting.

 - Report bad dynamic event command

   If a bad command is passed to the dynamic_events file, report it
   properly in the error log.

 - Clean up tracer options

   Clean up the tracer option code a bit, by removing some useless code
   and also using switch statements instead of a series of if
   statements.

 - Have tracing options be instance specific

   Tracers can have their own options (function tracer, irqsoff tracer,
   function graph tracer, etc). But now that the same tracer can be
   enabled in multiple trace instances, their options are still global.
   The API is per instance, thus changing one affects other instances.
   This isn't even consistent, as the option take affect differently
   depending on when an tracer started in an instance. Make the options
   for instances only affect the instance it is changed under.

 - Optimize pid_list lock contention

   Whenever the pid_list is read, it uses a spin lock. This happens at
   every sched switch. Taking the lock at sched switch can be removed by
   instead using a seqlock counter.

 - Clean up the trace trigger structures

   The trigger code uses two different structures to implement a single
   tigger. This was due to trying to reuse code for the two different
   types of triggers (always on trigger, and count limited trigger). But
   by adding a single field to one structure, the other structure could
   be absorbed into the first structure making he code easier to
   understand.

 - Create a bulk garbage collector for trace triggers

   If user space has triggers for several hundreds of events and then
   removes them, it can take several seconds to complete. This is
   because each removal calls tracepoint_synchronize_unregister() that
   can take hundreds of milliseconds to complete.

   Instead, create a helper thread that will do the clean up. When a
   trigger is removed, it will create the kthread if it isn't already
   created, and then add the trigger to a llist. The kthread will take
   the items off the llist, call tracepoint_synchronize_unregister(),
   and then remove the items it took off. It will then check if there's
   more items to free before sleeping.

   This makes user space removing all these triggers to finish in less
   than a second.

 - Allow function tracing of some of the tracing infrastructure code

   Because the tracing code can cause recursion issues if it is traced
   by the function tracer the entire tracing directory disables function
   tracing. But not all of tracing causes issues if it is traced.
   Namely, the event tracing code. Add a config that enables some of the
   tracing code to be traced to help in debugging it. Note, when this is
   enabled, it does add noise to general function tracing, especially if
   events are enabled as well (which is a common case).

 - Add boot-time backup instance for persistent buffer

   The persistent ring buffer is used mostly for kernel crash analysis
   in the field. One issue is that if there's a crash, the data in the
   persistent ring buffer must be read before tracing can begin using
   it. This slows down the boot process. Once tracing starts in the
   persistent ring buffer, the old data must be freed and the addresses
   no longer match and old events can't be in the buffer with new
   events.

   Create a way to create a backup buffer that copies the persistent
   ring buffer at boot up. Then after a crash, the always on tracer can
   begin immediately as well as the normal boot process while the crash
   analysis tooling uses the backup buffer. After the backup buffer is
   finished being read, it can be removed.

 - Enable function graph args and return address options at the same
   time

   Currently the when reading of arguments in the function graph tracer
   is enabled, the option to record the parent function in the entry
   event can not be enabled. Update the code so that it can.

 - Add new struct_offset() helper macro

   Add a new macro that takes a pointer to a structure and a name of one
   of its members and it will return the offset of that member. This
   allows the ring buffer code to simplify the following:

   From:  size = struct_size(entry, buf, cnt - sizeof(entry->id));
     To:  size = struct_offset(entry, id) + cnt;

   There should be other simplifications that this macro can help out
   with as well

* tag 'trace-v6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (42 commits)
  overflow: Introduce struct_offset() to get offset of member
  function_graph: Enable funcgraph-args and funcgraph-retaddr to work simultaneously
  tracing: Add boot-time backup of persistent ring buffer
  ftrace: Allow tracing of some of the tracing code
  tracing: Use strim() in trigger_process_regex() instead of skip_spaces()
  tracing: Add bulk garbage collection of freeing event_trigger_data
  tracing: Remove unneeded event_mutex lock in event_trigger_regex_release()
  tracing: Merge struct event_trigger_ops into struct event_command
  tracing: Remove get_trigger_ops() and add count_func() from trigger ops
  tracing: Show the tracer options in boot-time created instance
  ftrace: Avoid redundant initialization in register_ftrace_direct
  tracing: Remove unused variable in tracing_trace_options_show()
  fgraph: Make fgraph_no_sleep_time signed
  tracing: Convert function graph set_flags() to use a switch() statement
  tracing: Have function graph tracer option sleep-time be per instance
  tracing: Move graph-time out of function graph options
  tracing: Have function graph tracer option funcgraph-irqs be per instance
  trace/pid_list: optimize pid_list->lock contention
  tracing: Have function graph tracer define options per instance
  tracing: Have function tracer define options per instance
  ...
2025-12-05 09:51:37 -08:00
Linus Torvalds a3ebb59eee VFIO updates for v6.19-rc1
- Move libvfio selftest artifacts in preparation of more tightly
    coupled integration with KVM selftests. (David Matlack)
 
  - Fix comment typo in mtty driver. (Chu Guangqing)
 
  - Support for new hardware revision in the hisi_acc vfio-pci variant
    driver where the migration registers can now be accessed via the PF.
    When enabled for this support, the full BAR can be exposed to the
    user. (Longfang Liu)
 
  - Fix vfio cdev support for VF token passing, using the correct size
    for the kernel structure, thereby actually allowing userspace to
    provide a non-zero UUID token.  Also set the match token callback for
    the hisi_acc, fixing VF token support for this this vfio-pci variant
    driver. (Raghavendra Rao Ananta)
 
  - Introduce internal callbacks on vfio devices to simplify and
    consolidate duplicate code for generating VFIO_DEVICE_GET_REGION_INFO
    data, removing various ioctl intercepts with a more structured
    solution. (Jason Gunthorpe)
 
  - Introduce dma-buf support for vfio-pci devices, allowing MMIO regions
    to be exposed through dma-buf objects with lifecycle managed through
    move operations.  This enables low-level interactions such as a
    vfio-pci based SPDK drivers interacting directly with dma-buf capable
    RDMA devices to enable peer-to-peer operations.  IOMMUFD is also now
    able to build upon this support to fill a long standing feature gap
    versus the legacy vfio type1 IOMMU backend with an implementation of
    P2P support for VM use cases that better manages the lifecycle of the
    P2P mapping. (Leon Romanovsky, Jason Gunthorpe, Vivek Kasireddy)
 
  - Convert eventfd triggering for error and request signals to use RCU
    mechanisms in order to avoid a 3-way lockdep reported deadlock issue.
    (Alex Williamson)
 
  - Fix a 32-bit overflow introduced via dma-buf support manifesting with
    large DMA buffers. (Alex Mastro)
 
  - Convert nvgrace-gpu vfio-pci variant driver to insert mappings on
    fault rather than at mmap time.  This conversion serves both to make
    use of huge PFNMAPs but also to both avoid corrected RAS events
    during reset by now being subject to vfio-pci-core's use of
    unmap_mapping_range(), and to enable a device readiness test after
    reset. (Ankit Agrawal)
 
  - Refactoring of vfio selftests to support multi-device tests and split
    code to provide better separation between IOMMU and device objects.
    This work also enables a new test suite addition to measure parallel
    device initialization latency. (David Matlack)
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEQvbATlQL0amee4qQI5ubbjuwiyIFAmkvV3IRHGFsZXhAc2hh
 emJvdC5vcmcACgkQI5ubbjuwiyIpIQ/9GwpjLH5Vdv0v2d9mkHmZIWFpG/tr3zJa
 +spQqOjO0etASc67PtIJArT9pWib+s6O8OaG7iFrdNR65HCSsXSZbIGbMThPODfy
 DdDj1ipAqMVwcaCZT8un2N8Sktu9YpFQMvc5IoXWWYhw88vili7bBx+OTrEFV2T0
 6qQijSBdhw1TXVFHG6BGSmqmisyMepIebA6GmPWdfYu6BfoWBYMdcMjDwd1J61Q5
 DDwFRzn/Dz2Tvb1jbXiiRMRuFIuegFQii+wtd30S/cRPFZhZLWzc+drimC6oOFiQ
 qL19vQQsBPnLtGvch40HsET/AbY5w0pLCkYX5qacxP3sq27+N+KuotzCvbnVMN+H
 e2BqOCujyoce8z1Br6BzV71Lr2yzPDcc5pXTuEuuBT+J/ptOY8hfEikOj85s5Wzj
 aKsTrdDRGMrn/o11NkGSzYwFcMs9MxCX9mo98U6OkWDr0+cmPLf4LGZgpJudWg4E
 POUlzPpnzJrTlX5d+OqCdKJG0a1hTlTa2udzRa5hCDANHaZWLoAssfgSEKfV9xt1
 PzOMf0UIJmPJmFcw/OpMO72/5xp8O4WslJS0ulSm6vrAJDtutLApHZ7bJ44KniNd
 4vte+gOjyZY8ibTDKRULhXVlCDxkEnZjRBbApgI9HJD61IElOzjqohRuRx77J09B
 7c8OSLI8d1U=
 =tpee
 -----END PGP SIGNATURE-----

Merge tag 'vfio-v6.19-rc1' of https://github.com/awilliam/linux-vfio

Pull VFIO updates from Alex Williamson:

 - Move libvfio selftest artifacts in preparation of more tightly
   coupled integration with KVM selftests (David Matlack)

 - Fix comment typo in mtty driver (Chu Guangqing)

 - Support for new hardware revision in the hisi_acc vfio-pci variant
   driver where the migration registers can now be accessed via the PF.
   When enabled for this support, the full BAR can be exposed to the
   user (Longfang Liu)

 - Fix vfio cdev support for VF token passing, using the correct size
   for the kernel structure, thereby actually allowing userspace to
   provide a non-zero UUID token. Also set the match token callback for
   the hisi_acc, fixing VF token support for this this vfio-pci variant
   driver (Raghavendra Rao Ananta)

 - Introduce internal callbacks on vfio devices to simplify and
   consolidate duplicate code for generating VFIO_DEVICE_GET_REGION_INFO
   data, removing various ioctl intercepts with a more structured
   solution (Jason Gunthorpe)

 - Introduce dma-buf support for vfio-pci devices, allowing MMIO regions
   to be exposed through dma-buf objects with lifecycle managed through
   move operations. This enables low-level interactions such as a
   vfio-pci based SPDK drivers interacting directly with dma-buf capable
   RDMA devices to enable peer-to-peer operations. IOMMUFD is also now
   able to build upon this support to fill a long standing feature gap
   versus the legacy vfio type1 IOMMU backend with an implementation of
   P2P support for VM use cases that better manages the lifecycle of the
   P2P mapping (Leon Romanovsky, Jason Gunthorpe, Vivek Kasireddy)

 - Convert eventfd triggering for error and request signals to use RCU
   mechanisms in order to avoid a 3-way lockdep reported deadlock issue
   (Alex Williamson)

 - Fix a 32-bit overflow introduced via dma-buf support manifesting with
   large DMA buffers (Alex Mastro)

 - Convert nvgrace-gpu vfio-pci variant driver to insert mappings on
   fault rather than at mmap time. This conversion serves both to make
   use of huge PFNMAPs but also to both avoid corrected RAS events
   during reset by now being subject to vfio-pci-core's use of
   unmap_mapping_range(), and to enable a device readiness test after
   reset (Ankit Agrawal)

 - Refactoring of vfio selftests to support multi-device tests and split
   code to provide better separation between IOMMU and device objects.
   This work also enables a new test suite addition to measure parallel
   device initialization latency (David Matlack)

* tag 'vfio-v6.19-rc1' of https://github.com/awilliam/linux-vfio: (65 commits)
  vfio: selftests: Add vfio_pci_device_init_perf_test
  vfio: selftests: Eliminate INVALID_IOVA
  vfio: selftests: Split libvfio.h into separate header files
  vfio: selftests: Move vfio_selftests_*() helpers into libvfio.c
  vfio: selftests: Rename vfio_util.h to libvfio.h
  vfio: selftests: Stop passing device for IOMMU operations
  vfio: selftests: Move IOVA allocator into iova_allocator.c
  vfio: selftests: Move IOMMU library code into iommu.c
  vfio: selftests: Rename struct vfio_dma_region to dma_region
  vfio: selftests: Upgrade driver logging to dev_err()
  vfio: selftests: Prefix logs with device BDF where relevant
  vfio: selftests: Eliminate overly chatty logging
  vfio: selftests: Support multiple devices in the same container/iommufd
  vfio: selftests: Introduce struct iommu
  vfio: selftests: Rename struct vfio_iommu_mode to iommu_mode
  vfio: selftests: Allow passing multiple BDFs on the command line
  vfio: selftests: Split run.sh into separate scripts
  vfio: selftests: Move run.sh into scripts directory
  vfio/nvgrace-gpu: wait for the GPU mem to be ready
  vfio/nvgrace-gpu: Inform devmem unmapped after reset
  ...
2025-12-04 18:42:48 -08:00
Linus Torvalds 52206f82d9 pmdomain core:
- Allow power-off for out-of-band wakeup-capable devices
  - Drop the redundant call to dev_pm_domain_detach() for the amba bus
  - Extend the genpd governor for CPUs to account for IPIs
 
 pmdomain providers:
  - bcm: Add support for BCM2712
  - mediatek: Add support for MFlexGraphics power domains
  - mediatek: Add support for MT8196 power domains
  - qcom: Add RPMh power domain support for Kaanapali
  - rockchip: Add support for RV1126B
 
 pmdomain consumers:
  - usb: dwc3: Enable out of band wakeup for i.MX95
  - usb: chipidea: Enable out of band wakeup for i.MX95
 -----BEGIN PGP SIGNATURE-----
 
 iQJLBAABCgA1FiEEugLDXPmKSktSkQsV/iaEJXNYjCkFAmkt1HYXHHVsZi5oYW5z
 c29uQGxpbmFyby5vcmcACgkQ/iaEJXNYjCk3PQ//W24ZdZ5cqXJASrw4YihTovbR
 Pk/cVBacua32r3uBAf+GKdhAXrC7zaPbohp4tou5tpuY6I54NDJ4mpaTLUr+iZ6A
 aiKGprBrNCNMU7Mu6Af23peQNo0t+0sdsxp3wjZAMcSK0/ioOnc2R+IZhjHI9dPh
 Us7Ke/Pa1DRu78T0P2quRvIIRPv5iGep2vL8LFkQtPNS4tAU5ZHTWghVip+Q0m7x
 660WwZ5a9Bkzii/mqxFlXnXMnFjaJwi7zEPEKEMKd88cTjZXZGqScq8Ojfxto4Jx
 mdn9kPqwUR2QILTVKyBy2nJTeCB5nLvLhldi5/1nVxhqopN5ll4VwFx8CVsEPVTQ
 l76IUMzdAmztnVqs1p7qZ8dyFKdYztNRSESapumpu10/DIIOgWW4eLNk5Sk+7d0K
 1fZDL/hWV2j4gnY0wSjGht2JQ2jhDBsgkxInxvQY2a8Ik91gwyUsD7r3mCq/YwaA
 97GJHk9YO75jolFPy6PniXttzVuoJXPEcvgpBR3br1iFRQhSoGIEfFcDnZEaH9Sw
 m7nSc1Q99a6rsmq+WavB5JVPZBr2UkBhE6x+IU+MEBPwykjNj/1cZ7x4rtqzX666
 eOn1FBDQ63CLEQl3hAPp1pxy3YB0nHcxKR9CPAZjECwZTfsYgJCrRNHyZb+yhPsU
 qKKXXMaOyyowoLraGbQ=
 =Fkue
 -----END PGP SIGNATURE-----

Merge tag 'pmdomain-v6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/linux-pm

Pull pmdomain updates from Ulf Hansson:
 "pmdomain core:
   - Allow power-off for out-of-band wakeup-capable devices
   - Drop the redundant call to dev_pm_domain_detach() for the amba bus
   - Extend the genpd governor for CPUs to account for IPIs

  pmdomain providers:
   - bcm: Add support for BCM2712
   - mediatek: Add support for MFlexGraphics power domains
   - mediatek: Add support for MT8196 power domains
   - qcom: Add RPMh power domain support for Kaanapali
   - rockchip: Add support for RV1126B

  pmdomain consumers:
   - usb: dwc3: Enable out of band wakeup for i.MX95
   - usb: chipidea: Enable out of band wakeup for i.MX95"

* tag 'pmdomain-v6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/linux-pm: (26 commits)
  pmdomain: Extend the genpd governor for CPUs to account for IPIs
  smp: Introduce a helper function to check for pending IPIs
  pmdomain: mediatek: convert from clk round_rate() to determine_rate()
  amba: bus: Drop dev_pm_domain_detach() call
  pmdomain: bcm: bcm2835-power: Prepare to support BCM2712
  pmdomain: mediatek: mtk-mfg: select MAILBOX in Kconfig
  pmdomain: mediatek: Add support for MFlexGraphics
  pmdomain: mediatek: Fix build-errors
  cpuidle: psci: Replace deprecated strcpy in psci_idle_init_cpu
  pmdomain: rockchip: Add support for RV1126B
  pmdomain: mediatek: Add support for MT8196 HFRPSYS power domains
  pmdomain: mediatek: Add support for MT8196 SCPSYS power domains
  pmdomain: mediatek: Add support for secure HWCCF infra power on
  pmdomain: mediatek: Add support for Hardware Voter power domains
  pmdomain: qcom: rpmhpd: Add RPMh power domain support for Kaanapali
  usb: dwc3: imx8mp: Set out of band wakeup for i.MX95
  usb: chipidea: ci_hdrc_imx: Set out of band wakeup for i.MX95
  usb: chipidea: core: detach power domain for ci_hdrc platform device
  pmdomain: core: Allow power-off for out-of-band wakeup-capable devices
  PM: wakeup: Add out-of-band system wakeup support for devices
  ...
2025-12-04 13:50:39 -08:00
Linus Torvalds 6dfafbd029 drm-next for 6.19-rc1:
new driver:
 - Arm Ethos-U65/U85 accel driver
 
 core:
 - support the drm color pipeline in vkms/amdgfx
 - add support for drm colorop pipeline
 - add COLOR PIPELINE plane property
 - add DRM_CLIENT_CAP_PLANE_COLOR_PIPELINE
 - throttle dirty worker with vblank
 - use drm_for_each_bridge_in_chain_scoped in drm's bridge code
 - Ensure drm_client_modeset tests are enabled in UML
 - add simulated vblank interrupt - use in drivers
 - dumb buffer sizing helper
 - move freeing of drm client memory to driver
 - crtc sharpness strength property
 - stop using system_wq in scheduler/drivers
 - support emergency restore in drm-client
 
 rust:
 - make slice::as_flattened usable on all supported rustc
 - add FromBytes::from_bytes_prefix() method
 - remove redundant device ptr from Rust GEM object
 - Change how AlwaysRefCounted is implemented for GEM objects
 
 gpuvm:
 - Add deferred vm_bo cleanup to GPUVM (for rust)
 
 atomic:
 - cleanup and improve state handling interfaces
 
 buddy:
 - optimize block management
 
 dma-buf:
 - heaps: Create heap per CMA reserved location
 - improve userspace documentation
 
 dp:
 - add POST_LT_ADJ_REQ training sequence
 - DPCD dSC quirk for synaptics panamera devices
 - helpers to query branch DSC max throughput
 
 ttm:
 - Rename ttm_bo_put to ttm_bo_fini
 - allow page protection flags on risc-v
 - rework pipelined eviction fence handling
 
 amdgpu:
 - enable amdgpu by default for SI/CI dGPUs
 - enable DC by default on SI
 - refactor CIK/SI enablement
 - add ABM KMS property
 - Re-enable DM idle optimizations
 - DC Analog encoders support
 - Powerplay fixes for fiji/iceland
 - Enable DC on bonaire by default
 - HMM cleanup
 - Add new RAS framework
 - DML2.1 updates
 - YCbCr420 fixes
 - DC FP fixes
 - DMUB fixes
 - LTTPR fixes
 - DTBCLK fixes
 - DMU cursor offload handling
 - Userq validation improvements
 - Unify shutdown callback handling
 - Suspend improvements
 - Power limit code cleanup
 - SR-IOV fixes
 - AUX backlight fixes
 - DCN 3.5 fixes
 - HDMI compliance fixes
 - DCN 4.0.1 cursor updates
 - DCN interrupt fix
 - DC KMS full update improvements
 - Add additional HDCP traces
 - DCN 3.2 fixes
 - DP MST fixes
 - Add support for new SR-IOV mailbox interface
 - UQ reset support
 - HDP flush rework
 - VCE1 support
 
 amdkfd:
 - HMM cleanups
 - Relax checks on save area overallocations
 - Fix GPU mappings after prefetch
 
 radeon:
 - refactor CIK/SI enablement.
 
 xe:
 - Initial Xe3P support
 - panic support on VRAM for display
 - fix stolen size check
 - Loosen used tracking restriction
 - New SR-IOV debugfs structure and debugfs updates
 - Hide the GPU madvise flag behind a VM_BIND flag
 - Always expose VRAM provisioning data on discrete GPUs
 - Allow VRAM mappings for userptr when used with SVM
 - Allow pinning of p2p dma-buf
 - Use per-tile debugfs where appropriate
 - Add documentation for Execution Queues
 - PF improvements
 - VF migration recovery redesign work
 - User / Kernel VRAM partitioning
 - Update Tile-based messages
 - Allow configfs to disable specific GT types
 - VF provisioning and migration improvements
 - use SVM range helpers in PT layer
 - Initial CRI support
 - access VF registers using dedicated MMIO view
 - limit number of jobs per exec queue
 - add sriov_admin sysfs tree
 - more crescent island specific support
 - debugfs residency counter
 - SRIOV migration work
 - runtime registers for GFX 35
 
 i915:
 - add initial Xe3p_LPD display version 35 support
 - Enable LNL+ content adaptive sharpness filter
 - Use optimized VRR guardband
 - Enable Xe3p LT PHY
 - enable FBC support for Xe3p_LPD display
 - add display 30.02 firmware support
 - refactor SKL+ watermark latency setup
 - refactor fbdev handling
 - call i915/xe runtime PM via function pointers
 - refactor i915/xe stolen memory/display interfaces
 - use display version instead of gfx version in display code
 - extend i915_display_info with Type-C port details
 - lots of display cleanups/refactorings
 - set O_LARGEFILE in __create_shmem
 - skuip guc communication warning on reset
 - fix time conversions
 - defeature DRRS on LNL+
 - refactor intel_frontbuffer split between i915/xe/display
 - convert inteL_rom interfaces to struct drm_device
 - unify display register polling interfaces
 - aovid lock inversion when pinning to GGTT on CHV/BXT+VTD
 
 panel:
 - Add KD116N3730A08/A12, chromebook mt8189
 - JT101TM023, LQ079L1SX01,
 - GLD070WX3-SL01 MIPI DSI
 - Samsung LTL106AL0, Samsung LTL106AL01
 - Raystar RFF500F-AWH-DNN
 - Winstar WF70A8SYJHLNGA,
 - Wanchanglong w552946aaa
 - Samsung SOFEF00
 - Lenovo X13s panel.
 - ilitek-ili9881c : add rpi 5" support
 - visionx-rm69299 - add backlight support
 - edp - support AUI B116XAN02.0
 
 bridge:
 - improve ref counting
 - ti-sn65dsi86 - add support for DP mode with HPD
 - synopsis: support CEC, init timer with correct freq
 - ASL CS5263 DP-to-HDMI bridge support
 
 nova-core:
 - introduce bitfield! macro
 - introduce safe integer converters
 - GSP inits to fully booted state on Ampere
 - Use more future-proof register for GPU identification
 
 nova-drm:
 - select NOVA_CORE
 - 64-bit only
 
 nouveau:
 - improve reclocking on tegra 186+
 - add large page and compression support
 
 msm:
 - GPU:
   - Gen8 support: A840 (Kaanapali) and X2-85 (Glymur)
   - A612 support
 - MDSS:
   - Added support for Glymur and QCS8300 platforms
 - DPU:
   - Enabled Quad-Pipe support, unlocking higher resolutions support
   - Added support for Glymur platform
   - Documented DPU on QCS8300 platform as supported
 - DisplayPort:
   - Added support for Glymur platform
   - Added support lame remapping inside DP block
   - Documented DisplayPort controller on QCS8300 and SM6150/QCS615 as
     supported
 
 tegra:
 - NVJPG driver
 
 panfrost:
 - display JM contexts over debugfs
 - export JM contexts to userspace
 - improve error and job handling
 
 panthor:
 - support custom ASN_HASH for mt8196
 - support mali-G1 GPU
 - flush shmem write before mapping buffers uncached
 - make timeout per-queue instead of per-job
 
 mediatek:
 - MT8195/88 HDMIv2/DDCv2 support
 
 rockchip:
 - dsi: add support for RK3368
 
 amdxdna:
 - enhance runtime PM
 - last hardware error reading uapi
 - support firmware debug output
 - add resource and telemetry data uapi
 - preemption support
 
 imx:
 - add driver for HDMI TX Parallel audio interface
 
 ivpu:
 - add support for user-managed preemption buffer
 - add userptr support
 - update JSM firware API to 3.33.0
 - add better alloc/free warnings
 - fix page fault in unbind all bos
 - rework bind/unbind of imported buffers
 - enable MCA ECC signalling
 - split fw runtime and global memory buffers
 - add fdinfo memory statistics
 
 tidss:
 - convert to drm logging
 - logging cleanup
 
 ast:
 - refactor generation init paths
 - add per chip generation detect_tx_chip
 - set quirks for each chip model
 
 atmel-hlcdc:
 - set LCDC_ATTRE register in plane disable
 - set correct values for plane scaler
 
 solomon:
 - use drm helper for get_modes and move_valid
 
 sitronix:
 - fix output position when clearing screens
 
 qaic:
 - support dma-buf exports
 - support new firmware's READ_DATA implementation
 - sahara AIC200 image table update
 - add sysfs support
 - add coredump support
 - add uevents support
 - PM support
 
 sun4i:
 - layer refactors to decouple plane from output
 - improve DE33 support
 
 vc4:
 - switch to generic CEC helpers
 
 komeda:
 - use drm_ logging functions
 
 vkms:
 - configfs support for display configuration
 
 vgem:
 - fix fence timer deadlock
 
 etnaviv:
 - add HWDB entry for GC8000 Nano Ultra VIP r6205
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEEKbZHaGwW9KfbeusDHTzWXnEhr4FAmkv4wAACgkQDHTzWXnE
 hr7GNw/9HITakN5B4DXKZg+Rz38WjsAsSbw1HdgoQR2NMYRw4oVXx3iAVTVLGbvV
 vvMCpgWtFE85IEtg96KWynpFjSHOx7xtXe/lckZ5s+VMi00K4c+wMmvFqKwv1Aul
 9VqPTe+28xNAzk+gglOtNjd4DRVuXQIhPMQdTELork32jQCteJorod+sZnM+RNBG
 0PiiYAKc0VwsPwklHQF7+dBt1yu+8r3ZH5Fc7SjHTuf8Q45Dpayd0HjatSgt9ycn
 s8uxAAY41iDLHnj4LwyYtBFYoMH65VqOAzDeaeVdM9xA/3KbIl9of8QnBpvYWhaR
 9dnh2Ceve3SjOufOBwK2p6H0RXu8Xo3OLnmw7+u0KwgvbmeBTILp8kgNQuk9beIp
 wvkjwpFsDnfo1rKevFMz4R+3U9W3NXP1fjU2tvmnD1lfaby2o3f3AdboaYPbX/xK
 YGh79KDsuAkKH4HdMJNg/FpJWIdSKaaWoFBzZqWBZgZBEPOquR8MHUAaLkVJYK/1
 wBcJZceQpHRssldsg3bBZxGHK811J46QlvMWaID6TSGIoZ8GC6+1Zt0+bwzxMyUR
 49jwRvOWbtYJmZEcR2JF+k1KteMXAp+bXIYqx9e+69c5Guz4w7JW7OLQFVvXszZW
 aObzfrkObPpCGYDS1r85/6vT8yqwVUWtwfPvAeDH9130sHBP8Xw=
 =gpnQ
 -----END PGP SIGNATURE-----

Merge tag 'drm-next-2025-12-03' of https://gitlab.freedesktop.org/drm/kernel

Pull drm updates from Dave Airlie:
 "There was a rather late merge of a new color pipeline feature, that
  some userspace projects are blocked on, and has seen a lot of work in
  amdgpu. This should have seen some time in -next. There is additional
  support for this for Intel, that if it arrives in the next day or two
  I'll pass it on in another pull request and you can decide if you want
  to take it.

  Highlights:
   - Arm Ethos NPU accelerator driver
   - new DRM color pipeline support
   - amdgpu will now run discrete SI/CIK cards instead of radeon, which
     enables vulkan support in userspace
   - msm gets gen8 gpu support
   - initial Xe3P support in xe

  Full detail summary:

  New driver:
   - Arm Ethos-U65/U85 accel driver

  Core:
   - support the drm color pipeline in vkms/amdgfx
   - add support for drm colorop pipeline
   - add COLOR PIPELINE plane property
   - add DRM_CLIENT_CAP_PLANE_COLOR_PIPELINE
   - throttle dirty worker with vblank
   - use drm_for_each_bridge_in_chain_scoped in drm's bridge code
   - Ensure drm_client_modeset tests are enabled in UML
   - add simulated vblank interrupt - use in drivers
   - dumb buffer sizing helper
   - move freeing of drm client memory to driver
   - crtc sharpness strength property
   - stop using system_wq in scheduler/drivers
   - support emergency restore in drm-client

  Rust:
   - make slice::as_flattened usable on all supported rustc
   - add FromBytes::from_bytes_prefix() method
   - remove redundant device ptr from Rust GEM object
   - Change how AlwaysRefCounted is implemented for GEM objects

  gpuvm:
   - Add deferred vm_bo cleanup to GPUVM (for rust)

  atomic:
   - cleanup and improve state handling interfaces

  buddy:
   - optimize block management

  dma-buf:
   - heaps: Create heap per CMA reserved location
   - improve userspace documentation

  dp:
   - add POST_LT_ADJ_REQ training sequence
   - DPCD dSC quirk for synaptics panamera devices
   - helpers to query branch DSC max throughput

  ttm:
   - Rename ttm_bo_put to ttm_bo_fini
   - allow page protection flags on risc-v
   - rework pipelined eviction fence handling

  amdgpu:
   - enable amdgpu by default for SI/CI dGPUs
   - enable DC by default on SI
   - refactor CIK/SI enablement
   - add ABM KMS property
   - Re-enable DM idle optimizations
   - DC Analog encoders support
   - Powerplay fixes for fiji/iceland
   - Enable DC on bonaire by default
   - HMM cleanup
   - Add new RAS framework
   - DML2.1 updates
   - YCbCr420 fixes
   - DC FP fixes
   - DMUB fixes
   - LTTPR fixes
   - DTBCLK fixes
   - DMU cursor offload handling
   - Userq validation improvements
   - Unify shutdown callback handling
   - Suspend improvements
   - Power limit code cleanup
   - SR-IOV fixes
   - AUX backlight fixes
   - DCN 3.5 fixes
   - HDMI compliance fixes
   - DCN 4.0.1 cursor updates
   - DCN interrupt fix
   - DC KMS full update improvements
   - Add additional HDCP traces
   - DCN 3.2 fixes
   - DP MST fixes
   - Add support for new SR-IOV mailbox interface
   - UQ reset support
   - HDP flush rework
   - VCE1 support

  amdkfd:
   - HMM cleanups
   - Relax checks on save area overallocations
   - Fix GPU mappings after prefetch

  radeon:
   - refactor CIK/SI enablement

  xe:
   - Initial Xe3P support
   - panic support on VRAM for display
   - fix stolen size check
   - Loosen used tracking restriction
   - New SR-IOV debugfs structure and debugfs updates
   - Hide the GPU madvise flag behind a VM_BIND flag
   - Always expose VRAM provisioning data on discrete GPUs
   - Allow VRAM mappings for userptr when used with SVM
   - Allow pinning of p2p dma-buf
   - Use per-tile debugfs where appropriate
   - Add documentation for Execution Queues
   - PF improvements
   - VF migration recovery redesign work
   - User / Kernel VRAM partitioning
   - Update Tile-based messages
   - Allow configfs to disable specific GT types
   - VF provisioning and migration improvements
   - use SVM range helpers in PT layer
   - Initial CRI support
   - access VF registers using dedicated MMIO view
   - limit number of jobs per exec queue
   - add sriov_admin sysfs tree
   - more crescent island specific support
   - debugfs residency counter
   - SRIOV migration work
   - runtime registers for GFX 35

  i915:
   - add initial Xe3p_LPD display version 35 support
   - Enable LNL+ content adaptive sharpness filter
   - Use optimized VRR guardband
   - Enable Xe3p LT PHY
   - enable FBC support for Xe3p_LPD display
   - add display 30.02 firmware support
   - refactor SKL+ watermark latency setup
   - refactor fbdev handling
   - call i915/xe runtime PM via function pointers
   - refactor i915/xe stolen memory/display interfaces
   - use display version instead of gfx version in display code
   - extend i915_display_info with Type-C port details
   - lots of display cleanups/refactorings
   - set O_LARGEFILE in __create_shmem
   - skuip guc communication warning on reset
   - fix time conversions
   - defeature DRRS on LNL+
   - refactor intel_frontbuffer split between i915/xe/display
   - convert inteL_rom interfaces to struct drm_device
   - unify display register polling interfaces
   - aovid lock inversion when pinning to GGTT on CHV/BXT+VTD

  panel:
   - Add KD116N3730A08/A12, chromebook mt8189
   - JT101TM023, LQ079L1SX01,
   - GLD070WX3-SL01 MIPI DSI
   - Samsung LTL106AL0, Samsung LTL106AL01
   - Raystar RFF500F-AWH-DNN
   - Winstar WF70A8SYJHLNGA
   - Wanchanglong w552946aaa
   - Samsung SOFEF00
   - Lenovo X13s panel
   - ilitek-ili9881c - add rpi 5" support
   - visionx-rm69299 - add backlight support
   - edp - support AUI B116XAN02.0

  bridge:
   - improve ref counting
   - ti-sn65dsi86 - add support for DP mode with HPD
   - synopsis: support CEC, init timer with correct freq
   - ASL CS5263 DP-to-HDMI bridge support

  nova-core:
   - introduce bitfield! macro
   - introduce safe integer converters
   - GSP inits to fully booted state on Ampere
   - Use more future-proof register for GPU identification

  nova-drm:
   - select NOVA_CORE
   - 64-bit only

  nouveau:
   - improve reclocking on tegra 186+
   - add large page and compression support

  msm:
   - GPU:
      - Gen8 support: A840 (Kaanapali) and X2-85 (Glymur)
      - A612 support
   - MDSS:
      - Added support for Glymur and QCS8300 platforms
   - DPU:
      - Enabled Quad-Pipe support, unlocking higher resolutions support
      - Added support for Glymur platform
      - Documented DPU on QCS8300 platform as supported
   - DisplayPort:
      - Added support for Glymur platform
      - Added support lame remapping inside DP block
      - Documented DisplayPort controller on QCS8300 and SM6150/QCS615
        as supported

  tegra:
   - NVJPG driver

  panfrost:
   - display JM contexts over debugfs
   - export JM contexts to userspace
   - improve error and job handling

  panthor:
   - support custom ASN_HASH for mt8196
   - support mali-G1 GPU
   - flush shmem write before mapping buffers uncached
   - make timeout per-queue instead of per-job

  mediatek:
   - MT8195/88 HDMIv2/DDCv2 support

  rockchip:
   - dsi: add support for RK3368

  amdxdna:
   - enhance runtime PM
   - last hardware error reading uapi
   - support firmware debug output
   - add resource and telemetry data uapi
   - preemption support

  imx:
   - add driver for HDMI TX Parallel audio interface

  ivpu:
   - add support for user-managed preemption buffer
   - add userptr support
   - update JSM firware API to 3.33.0
   - add better alloc/free warnings
   - fix page fault in unbind all bos
   - rework bind/unbind of imported buffers
   - enable MCA ECC signalling
   - split fw runtime and global memory buffers
   - add fdinfo memory statistics

  tidss:
   - convert to drm logging
   - logging cleanup

  ast:
   - refactor generation init paths
   - add per chip generation detect_tx_chip
   - set quirks for each chip model

  atmel-hlcdc:
   - set LCDC_ATTRE register in plane disable
   - set correct values for plane scaler

  solomon:
   - use drm helper for get_modes and move_valid

  sitronix:
   - fix output position when clearing screens

  qaic:
   - support dma-buf exports
   - support new firmware's READ_DATA implementation
   - sahara AIC200 image table update
   - add sysfs support
   - add coredump support
   - add uevents support
   - PM support

  sun4i:
   - layer refactors to decouple plane from output
   - improve DE33 support

  vc4:
   - switch to generic CEC helpers

  komeda:
   - use drm_ logging functions

  vkms:
   - configfs support for display configuration

  vgem:
   - fix fence timer deadlock

  etnaviv:
   - add HWDB entry for GC8000 Nano Ultra VIP r6205"

* tag 'drm-next-2025-12-03' of https://gitlab.freedesktop.org/drm/kernel: (1869 commits)
  Revert "drm/amd: Skip power ungate during suspend for VPE"
  drm/amdgpu: use common defines for HUB faults
  drm/amdgpu/gmc12: add amdgpu_vm_handle_fault() handling
  drm/amdgpu/gmc11: add amdgpu_vm_handle_fault() handling
  drm/amdgpu: use static ids for ACP platform devs
  drm/amdgpu/sdma6: Update SDMA 6.0.3 FW version to include UMQ protected-fence fix
  drm/amdgpu: Forward VMID reservation errors
  drm/amdgpu/gmc8: Delegate VM faults to soft IRQ handler ring
  drm/amdgpu/gmc7: Delegate VM faults to soft IRQ handler ring
  drm/amdgpu/gmc6: Delegate VM faults to soft IRQ handler ring
  drm/amdgpu/gmc6: Cache VM fault info
  drm/amdgpu/gmc6: Don't print MC client as it's unknown
  drm/amdgpu/cz_ih: Enable soft IRQ handler ring
  drm/amdgpu/tonga_ih: Enable soft IRQ handler ring
  drm/amdgpu/iceland_ih: Enable soft IRQ handler ring
  drm/amdgpu/cik_ih: Enable soft IRQ handler ring
  drm/amdgpu/si_ih: Enable soft IRQ handler ring
  drm/amd/display: fix typo in display_mode_core_structs.h
  drm/amd/display: fix Smart Power OLED not working after S4
  drm/amd/display: Move RGB-type check for audio sync to DCE HW sequence
  ...
2025-12-04 08:53:30 -08:00
Linus Torvalds cc25df3e2e for-6.19/block-20251201
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmktsoMQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpuiUD/92eivL+HmOh10o8trvxajB0yuyqfSjHHrL
 g+xUbF4s9bgAg/v+Upx7sTY8jdrTcMjKov+G9T6uPvBMqVmeVdZckA1PSAKQaIX1
 Zb7nS2LnO7F6JKbwpwVrrIaqVbcz8MfGIIMbN4yNNEOMCwdIVMp4fo7trPBknJNx
 WddNSGUFlIF3NqSI8AflSS/pYnGm+McfBHXBpJAKipI3iquKKubHv+FX9kLp7Tn4
 x27ZoCWOHglIBTJXU0mmXCVsLF8b5BA8DQcGtT62azb8+l0cRTkaHY0DFAv5BvhG
 TqcjrKdmR0cGSNt+nEmFrujE3atBRl0G0kiHA80YgA1MTtYzdPaUVOUtM9k/rEem
 gpiGMDpBypdxyJAyijPSaVJdfcg0psOlYbhIR4N2wbj/dq8268h+cWzXlF1spgVt
 /7ygoaCmfMNbTy9rKThTjH+es787AVXUAXXaPHhIFsnCKUj8xQl4pT7XltmgYeWx
 1/XD1NEJeLHHog5upAVlGX3H5tbvP1nIICxbZa9mDOJX1rwxxI7/s/RucPjbNXuY
 AiaKPTfxtB9+Ihd2HrJ/76RVMkckcOBc4GIKoFfwuKDbcdLXQ5FcZCmVRoI1V9SV
 KsH7JBgihLwR9XWKE1vp9+CBNe1Qlu3K4IjG/E7CNLeuDntIBu73ihqGP/DqV6Bq
 RX1Dc0OyAQ==
 =m22w
 -----END PGP SIGNATURE-----

Merge tag 'for-6.19/block-20251201' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull block updates from Jens Axboe:

 - Fix head insertion for mq-deadline, a regression from when priority
   support was added

 - Series simplifying and improving the ublk user copy code

 - Various ublk related cleanups

 - Fixup REQ_NOWAIT handling in loop/zloop, clearing NOWAIT when the
   request is punted to a thread for handling

 - Merge and then later revert loop dio nowait support, as it ended up
   causing excessive stack usage for when the inline issue code needs to
   dip back into the full file system code

 - Improve auto integrity code, making it less deadlock prone

 - Speedup polled IO handling, but manually managing the hctx lookups

 - Fixes for blk-throttle for SSD devices

 - Small series with fixes for the S390 dasd driver

 - Add support for caching zones, avoiding unnecessary report zone
   queries

 - MD pull requests via Yu:
      - fix null-ptr-dereference regression for dm-raid0
      - fix IO hang for raid5 when array is broken with IO inflight
      - remove legacy 1s delay to speed up system shutdown
      - change maintainer's email address
      - data can be lost if array is created with different lbs devices,
        fix this problem and record lbs of the array in metadata
      - fix rcu protection for md_thread
      - fix mddev kobject lifetime regression
      - enable atomic writes for md-linear
      - some cleanups

 - bcache updates via Coly
      - remove useless discard and cache device code
      - improve usage of per-cpu workqueues

 - Reorganize the IO scheduler switching code, fixing some lockdep
   reports as well

 - Improve the block layer P2P DMA support

 - Add support to the block tracing code for zoned devices

 - Segment calculation improves, and memory alignment flexibility
   improvements

 - Set of prep and cleanups patches for ublk batching support. The
   actual batching hasn't been added yet, but helps shrink down the
   workload of getting that patchset ready for 6.20

 - Fix for how the ps3 block driver handles segments offsets

 - Improve how block plugging handles batch tag allocations

 - nbd fixes for use-after-free of the configuration on device clear/put

 - Set of improvements and fixes for zloop

 - Add Damien as maintainer of the block zoned device code handling

 - Various other fixes and cleanups

* tag 'for-6.19/block-20251201' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (162 commits)
  block/rnbd: correct all kernel-doc complaints
  blk-mq: use queue_hctx in blk_mq_map_queue_type
  md: remove legacy 1s delay in md_notify_reboot
  md/raid5: fix IO hang when array is broken with IO inflight
  md: warn about updating super block failure
  md/raid0: fix NULL pointer dereference in create_strip_zones() for dm-raid
  sbitmap: fix all kernel-doc warnings
  ublk: add helper of __ublk_fetch()
  ublk: pass const pointer to ublk_queue_is_zoned()
  ublk: refactor auto buffer register in ublk_dispatch_req()
  ublk: add `union ublk_io_buf` with improved naming
  ublk: add parameter `struct io_uring_cmd *` to ublk_prep_auto_buf_reg()
  kfifo: add kfifo_alloc_node() helper for NUMA awareness
  blk-mq: fix potential uaf for 'queue_hw_ctx'
  blk-mq: use array manage hctx map instead of xarray
  ublk: prevent invalid access with DEBUG
  s390/dasd: Use scnprintf() instead of sprintf()
  s390/dasd: Move device name formatting into separate function
  s390/dasd: Remove unnecessary debugfs_create() return checks
  s390/dasd: Fix gendisk parent after copy pair swap
  ...
2025-12-03 19:26:18 -08:00
Linus Torvalds 8f7aa3d3c7 Networking changes for 6.19.
Core & protocols
 ----------------
 
  - Replace busylock at the Tx queuing layer with a lockless list. Resulting
    in a 300% (4x) improvement on heavy TX workloads, sending twice the
    number of packets per second, for half the cpu cycles.
 
  - Allow constantly busy flows to migrate to a more suitable CPU/NIC
    queue. Normally we perform queue re-selection when flow comes out
    of idle, but under extreme circumstances the flows may be constantly
    busy. Add sysctl to allow periodic rehashing even if it'd risk packet
    reordering.
 
  - Optimize the NAPI skb cache, make it larger, use it in more paths.
 
  - Attempt returning Tx skbs to the originating CPU (like we already did
    for Rx skbs).
 
  - Various data structure layout and prefetch optimizations from Eric.
 
  - Remove ktime_get() from the recvmsg() fast path, ktime_get() is sadly
    quite expensive on recent AMD machines.
 
  - Extend threaded NAPI polling to allow the kthread busy poll for packets.
 
  - Make MPTCP use Rx backlog processing. This lowers the lock pressure,
    improving the Rx performance.
 
  - Support memcg accounting of MPTCP socket memory.
 
  - Allow admin to opt sockets out of global protocol memory accounting
    (using a sysctl or BPF-based policy). The global limits are a poor fit
    for modern container workloads, where limits are imposed using cgroups.
 
  - Improve heuristics for when to kick off AF_UNIX garbage collection.
 
  - Allow users to control TCP SACK compression, and default to 33% of RTT.
 
  - Add tcp_rcvbuf_low_rtt sysctl to let datacenter users avoid unnecessarily
    aggressive rcvbuf growth and overshot when the connection RTT is low.
 
  - Preserve skb metadata space across skb_push / skb_pull operations.
 
  - Support for IPIP encapsulation in the nftables flowtable offload.
 
  - Support appending IP interface information to ICMP messages (RFC 5837).
 
  - Support setting max record size in TLS (RFC 8449).
 
  - Remove taking rtnl_lock from RTM_GETNEIGHTBL and RTM_SETNEIGHTBL.
 
  - Use a dedicated lock (and RCU) in MPLS, instead of rtnl_lock.
 
  - Let users configure the number of write buffers in SMC.
 
  - Add new struct sockaddr_unsized for sockaddr of unknown length,
    from Kees.
 
  - Some conversions away from the crypto_ahash API, from Eric Biggers.
 
  - Some preparations for slimming down struct page.
 
  - YAML Netlink protocol spec for WireGuard.
 
  - Add a tool on top of YAML Netlink specs/lib for reporting commonly
    computed derived statistics and summarized system state.
 
 Driver API
 ----------
 
  - Add CAN XL support to the CAN Netlink interface.
 
  - Add uAPI for reporting PHY Mean Square Error (MSE) diagnostics,
    as defined by the OPEN Alliance's "Advanced diagnostic features
    for 100BASE-T1 automotive Ethernet PHYs" specification.
 
  - Add DPLL phase-adjust-gran pin attribute (and implement it in zl3073x).
 
  - Refactor xfrm_input lock to reduce contention when NIC offloads IPsec
    and performs RSS.
 
  - Add info to devlink params whether the current setting is the default
    or a user override. Allow resetting back to default.
 
  - Add standard device stats for PSP crypto offload.
 
  - Leverage DSA frame broadcast to implement simple HSR frame duplication
    for a lot of switches without dedicated HSR offload.
 
  - Add uAPI defines for 1.6Tbps link modes.
 
 Device drivers
 --------------
 
  - Add Motorcomm YT921x gigabit Ethernet switch support.
 
  - Add MUCSE driver for N500/N210 1GbE NIC series.
 
  - Convert drivers to support dedicated ops for timestamping control,
    and away from the direct IOCTL handling. While at it support GET
    operations for PHY timestamping.
 
  - Add (and convert most drivers to) a dedicated ethtool callback
    for reading the Rx ring count.
 
  - Significant refactoring efforts in the STMMAC driver, which supports
    Synopsys turn-key MAC IP integrated into a ton of SoCs.
 
  - Ethernet high-speed NICs:
    - Broadcom (bnxt):
      - support PPS in/out on all pins
    - Intel (100G, ice, idpf):
      - ice: implement standard ethtool and timestamping stats
      - i40e: support setting the max number of MAC addresses per VF
      - iavf: support RSS of GTP tunnels for 5G and LTE deployments
    - nVidia/Mellanox (mlx5):
      - reduce downtime on interface reconfiguration
      - disable being an XDP redirect target by default (same as other
        drivers) to avoid wasting resources if feature is unused
    - Meta (fbnic):
      - add support for Linux-managed PCS on 25G, 50G, and 100G links
    - Wangxun:
      - support Rx descriptor merge, and Tx head writeback
      - support Rx coalescing offload
      - support 25G SPF and 40G QSFP modules
 
  - Ethernet virtual:
    - Google (gve):
      - allow ethtool to configure rx_buf_len
      - implement XDP HW RX Timestamping support for DQ descriptor format
    - Microsoft vNIC (mana):
      - support HW link state events
      - handle hardware recovery events when probing the device
 
  - Ethernet NICs consumer, and embedded:
    - usbnet: add support for Byte Queue Limits (BQL)
    - AMD (amd-xgbe):
      - add device selftests
    - NXP (enetc):
      - add i.MX94 support
    - Broadcom integrated MACs (bcmgenet, bcmasp):
      - bcmasp: add support for PHY-based Wake-on-LAN
    - Broadcom switches (b53):
      - support port isolation
      - support BCM5389/97/98 and BCM63XX ARL formats
    - Lantiq/MaxLinear switches:
      - support bridge FDB entries on the CPU port
      - use regmap for register access
      - allow user to enable/disable learning
      - support Energy Efficient Ethernet
      - support configuring RMII clock delays
      - add tagging driver for MaxLinear GSW1xx switches
    - Synopsys (stmmac):
      - support using the HW clock in free running mode
      - add Eswin EIC7700 support
      - add Rockchip RK3506 support
      - add Altera Agilex5 support
    - Cadence (macb):
      - cleanup and consolidate descriptor and DMA address handling
      - add EyeQ5 support
    - TI:
      - icssg-prueth: support AF_XDP
    - Airoha access points:
      - add missing Ethernet stats and link state callback
      - add AN7583 support
      - support out-of-order Tx completion processing
    - Power over Ethernet:
      - pd692x0: preserve PSE configuration across reboots
      - add support for TPS23881B devices
 
  - Ethernet PHYs:
    - Open Alliance OATC14 10BASE-T1S PHY cable diagnostic support
    - Support 50G SerDes and 100G interfaces in Linux-managed PHYs
    - micrel:
      - support for non PTP SKUs of lan8814
      - enable in-band auto-negotiation on lan8814
    - realtek:
      - cable testing support on RTL8224
      - interrupt support on RTL8221B
    - motorcomm: support for PHY LEDs on YT853
    - microchip: support for LAN867X Rev.D0 PHYs w/ SQI and cable diag
    - mscc: support for PHY LED control
 
  - CAN drivers:
    - m_can: add support for optional reset and system wake up
    - remove can_change_mtu() obsoleted by core handling
    - mcp251xfd: support GPIO controller functionality
 
  - Bluetooth:
    - add initial support for PASTa
 
  - WiFi:
    - split ieee80211.h file, it's way too big
    - improvements in VHT radiotap reporting, S1G, Channel Switch
      Announcement handling, rate tracking in mesh networks
    - improve multi-radio monitor mode support, and add a cfg80211 debugfs
      interface for it
    - HT action frame handling on 6 GHz
    - initial chanctx work towards NAN
    - MU-MIMO sniffer improvements
 
  - WiFi drivers:
    - RealTek (rtw89):
      - support USB devices RTL8852AU and RTL8852CU
      - initial work for RTL8922DE
      - improved injection support
    - Intel:
      - iwlwifi: new sniffer API support
    - MediaTek (mt76):
      - WED support for >32-bit DMA
      - airoha NPU support
      - regdomain improvements
      - continued WiFi7/MLO work
    - Qualcomm/Atheros:
      - ath10k: factory test support
      - ath11k: TX power insertion support
      - ath12k: BSS color change support
      - ath12k: statistics improvements
    - brcmfmac: Acer A1 840 tablet quirk
    - rtl8xxxu: 40 MHz connection fixes/support
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmkveRQACgkQMUZtbf5S
 IrvY7A/+Nb0o4BxLHjPkAl1m3t3q2d0Y29B7SNkwnwEtxAV8EkNeZ3GWrdtDnTQY
 MYhmc7LEzvz8/lihapr7UJkcokzSASUV54hbez5jDBKC8EEoyUk8FdWDPerwlcRI
 zmCFNAVFyh9GX8i7wcrzKbDTHT5+GZLbSlGl9U5mhLsDdRlJgH7d8PJ7vWcmtLFY
 XN0paDyaeHfCl8wReWNAYx4C/I0ODOvlscpO0tnAKhB0ngJbQCKY2t6tn3rOYdif
 ZSQ5KwVRnJtQ4fYOFMOy9+FSCjVXtyrxF8KLxD+mqom2ZhmO00UpOMl09tqhq3uT
 WnvwoHUVBt6F+iITHwg5kMgIDPUq1kpUvL4S4UbVSuUm9ZKD+4KRU2ZHRBYMx+MU
 bsqmtY8/IULClUoRz+tZhltA8eb0NEqNZE2JPOFDiJHn1YiCCkFwxibhir893oM3
 sB7x65D7LQI2ty2BBGVGYnwYDPtyaxOA/s3WTwPvLEi3+Y/TGNIIrS9lBLA4U+Yr
 Gi93WQGVjttMmVyaHgXBUGmi3L52hvolm0AZ8zSRGrnIEpecjhly2KfYuaOzuxXC
 IHEQ6AFLdRh6JzafXGb/mQwGCHNmhwsY8A49i94fakWQamaL/L6A+1dyPu4LXMqi
 NwqCmlVb/LKGlfNG+V4wT27srJ+yBA2Vk3tpR1sZQQytFh0LKHI=
 =UoDR
 -----END PGP SIGNATURE-----

Merge tag 'net-next-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Jakub Kicinski:
 "Core & protocols:

   - Replace busylock at the Tx queuing layer with a lockless list.

     Resulting in a 300% (4x) improvement on heavy TX workloads, sending
     twice the number of packets per second, for half the cpu cycles.

   - Allow constantly busy flows to migrate to a more suitable CPU/NIC
     queue.

     Normally we perform queue re-selection when flow comes out of idle,
     but under extreme circumstances the flows may be constantly busy.

     Add sysctl to allow periodic rehashing even if it'd risk packet
     reordering.

   - Optimize the NAPI skb cache, make it larger, use it in more paths.

   - Attempt returning Tx skbs to the originating CPU (like we already
     did for Rx skbs).

   - Various data structure layout and prefetch optimizations from Eric.

   - Remove ktime_get() from the recvmsg() fast path, ktime_get() is
     sadly quite expensive on recent AMD machines.

   - Extend threaded NAPI polling to allow the kthread busy poll for
     packets.

   - Make MPTCP use Rx backlog processing. This lowers the lock
     pressure, improving the Rx performance.

   - Support memcg accounting of MPTCP socket memory.

   - Allow admin to opt sockets out of global protocol memory accounting
     (using a sysctl or BPF-based policy). The global limits are a poor
     fit for modern container workloads, where limits are imposed using
     cgroups.

   - Improve heuristics for when to kick off AF_UNIX garbage collection.

   - Allow users to control TCP SACK compression, and default to 33% of
     RTT.

   - Add tcp_rcvbuf_low_rtt sysctl to let datacenter users avoid
     unnecessarily aggressive rcvbuf growth and overshot when the
     connection RTT is low.

   - Preserve skb metadata space across skb_push / skb_pull operations.

   - Support for IPIP encapsulation in the nftables flowtable offload.

   - Support appending IP interface information to ICMP messages (RFC
     5837).

   - Support setting max record size in TLS (RFC 8449).

   - Remove taking rtnl_lock from RTM_GETNEIGHTBL and RTM_SETNEIGHTBL.

   - Use a dedicated lock (and RCU) in MPLS, instead of rtnl_lock.

   - Let users configure the number of write buffers in SMC.

   - Add new struct sockaddr_unsized for sockaddr of unknown length,
     from Kees.

   - Some conversions away from the crypto_ahash API, from Eric Biggers.

   - Some preparations for slimming down struct page.

   - YAML Netlink protocol spec for WireGuard.

   - Add a tool on top of YAML Netlink specs/lib for reporting commonly
     computed derived statistics and summarized system state.

  Driver API:

   - Add CAN XL support to the CAN Netlink interface.

   - Add uAPI for reporting PHY Mean Square Error (MSE) diagnostics, as
     defined by the OPEN Alliance's "Advanced diagnostic features for
     100BASE-T1 automotive Ethernet PHYs" specification.

   - Add DPLL phase-adjust-gran pin attribute (and implement it in
     zl3073x).

   - Refactor xfrm_input lock to reduce contention when NIC offloads
     IPsec and performs RSS.

   - Add info to devlink params whether the current setting is the
     default or a user override. Allow resetting back to default.

   - Add standard device stats for PSP crypto offload.

   - Leverage DSA frame broadcast to implement simple HSR frame
     duplication for a lot of switches without dedicated HSR offload.

   - Add uAPI defines for 1.6Tbps link modes.

  Device drivers:

   - Add Motorcomm YT921x gigabit Ethernet switch support.

   - Add MUCSE driver for N500/N210 1GbE NIC series.

   - Convert drivers to support dedicated ops for timestamping control,
     and away from the direct IOCTL handling. While at it support GET
     operations for PHY timestamping.

   - Add (and convert most drivers to) a dedicated ethtool callback for
     reading the Rx ring count.

   - Significant refactoring efforts in the STMMAC driver, which
     supports Synopsys turn-key MAC IP integrated into a ton of SoCs.

   - Ethernet high-speed NICs:
      - Broadcom (bnxt):
         - support PPS in/out on all pins
      - Intel (100G, ice, idpf):
         - ice: implement standard ethtool and timestamping stats
         - i40e: support setting the max number of MAC addresses per VF
         - iavf: support RSS of GTP tunnels for 5G and LTE deployments
      - nVidia/Mellanox (mlx5):
         - reduce downtime on interface reconfiguration
         - disable being an XDP redirect target by default (same as
           other drivers) to avoid wasting resources if feature is
           unused
      - Meta (fbnic):
         - add support for Linux-managed PCS on 25G, 50G, and 100G links
      - Wangxun:
         - support Rx descriptor merge, and Tx head writeback
         - support Rx coalescing offload
         - support 25G SPF and 40G QSFP modules

   - Ethernet virtual:
      - Google (gve):
         - allow ethtool to configure rx_buf_len
         - implement XDP HW RX Timestamping support for DQ descriptor
           format
      - Microsoft vNIC (mana):
         - support HW link state events
         - handle hardware recovery events when probing the device

   - Ethernet NICs consumer, and embedded:
      - usbnet: add support for Byte Queue Limits (BQL)
      - AMD (amd-xgbe):
         - add device selftests
      - NXP (enetc):
         - add i.MX94 support
      - Broadcom integrated MACs (bcmgenet, bcmasp):
         - bcmasp: add support for PHY-based Wake-on-LAN
      - Broadcom switches (b53):
         - support port isolation
         - support BCM5389/97/98 and BCM63XX ARL formats
      - Lantiq/MaxLinear switches:
         - support bridge FDB entries on the CPU port
         - use regmap for register access
         - allow user to enable/disable learning
         - support Energy Efficient Ethernet
         - support configuring RMII clock delays
         - add tagging driver for MaxLinear GSW1xx switches
      - Synopsys (stmmac):
         - support using the HW clock in free running mode
         - add Eswin EIC7700 support
         - add Rockchip RK3506 support
         - add Altera Agilex5 support
      - Cadence (macb):
         - cleanup and consolidate descriptor and DMA address handling
         - add EyeQ5 support
      - TI:
         - icssg-prueth: support AF_XDP
      - Airoha access points:
         - add missing Ethernet stats and link state callback
         - add AN7583 support
         - support out-of-order Tx completion processing
      - Power over Ethernet:
         - pd692x0: preserve PSE configuration across reboots
         - add support for TPS23881B devices

   - Ethernet PHYs:
      - Open Alliance OATC14 10BASE-T1S PHY cable diagnostic support
      - Support 50G SerDes and 100G interfaces in Linux-managed PHYs
      - micrel:
         - support for non PTP SKUs of lan8814
         - enable in-band auto-negotiation on lan8814
      - realtek:
         - cable testing support on RTL8224
         - interrupt support on RTL8221B
      - motorcomm: support for PHY LEDs on YT853
      - microchip: support for LAN867X Rev.D0 PHYs w/ SQI and cable diag
      - mscc: support for PHY LED control

   - CAN drivers:
      - m_can: add support for optional reset and system wake up
      - remove can_change_mtu() obsoleted by core handling
      - mcp251xfd: support GPIO controller functionality

   - Bluetooth:
      - add initial support for PASTa

   - WiFi:
      - split ieee80211.h file, it's way too big
      - improvements in VHT radiotap reporting, S1G, Channel Switch
        Announcement handling, rate tracking in mesh networks
      - improve multi-radio monitor mode support, and add a cfg80211
        debugfs interface for it
      - HT action frame handling on 6 GHz
      - initial chanctx work towards NAN
      - MU-MIMO sniffer improvements

   - WiFi drivers:
      - RealTek (rtw89):
         - support USB devices RTL8852AU and RTL8852CU
         - initial work for RTL8922DE
         - improved injection support
      - Intel:
         - iwlwifi: new sniffer API support
      - MediaTek (mt76):
         - WED support for >32-bit DMA
         - airoha NPU support
         - regdomain improvements
         - continued WiFi7/MLO work
      - Qualcomm/Atheros:
         - ath10k: factory test support
         - ath11k: TX power insertion support
         - ath12k: BSS color change support
         - ath12k: statistics improvements
      - brcmfmac: Acer A1 840 tablet quirk
      - rtl8xxxu: 40 MHz connection fixes/support"

* tag 'net-next-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1381 commits)
  net: page_pool: sanitise allocation order
  net: page pool: xa init with destroy on pp init
  net/mlx5e: Support XDP target xmit with dummy program
  net/mlx5e: Update XDP features in switch channels
  selftests/tc-testing: Test CAKE scheduler when enqueue drops packets
  net/sched: sch_cake: Fix incorrect qlen reduction in cake_drop
  wireguard: netlink: generate netlink code
  wireguard: uapi: generate header with ynl-gen
  wireguard: uapi: move flag enums
  wireguard: uapi: move enum wg_cmd
  wireguard: netlink: add YNL specification
  selftests: drv-net: Fix tolerance calculation in devlink_rate_tc_bw.py
  selftests: drv-net: Fix and clarify TC bandwidth split in devlink_rate_tc_bw.py
  selftests: drv-net: Set shell=True for sysfs writes in devlink_rate_tc_bw.py
  selftests: drv-net: Use Iperf3Runner in devlink_rate_tc_bw.py
  selftests: drv-net: introduce Iperf3Runner for measurement use cases
  selftests: drv-net: Add devlink_rate_tc_bw.py to TEST_PROGS
  net: ps3_gelic_net: Use napi_alloc_skb() and napi_gro_receive()
  Documentation: net: dsa: mention simple HSR offload helpers
  Documentation: net: dsa: mention availability of RedBox
  ...
2025-12-03 17:24:33 -08:00
Linus Torvalds 015e7b0b0e bpf-next-6.19
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmktzC4ACgkQ6rmadz2v
 bTpA1w/+PZ45N3y6O+NQVIpBlpnHG7DEMK7Lw19On0xVLwH+XPHz6J5PEfzjyJR1
 SCbsV30qkJ1YCtgRHHf+ZCuWPWm58hY8dXYwSDyjNavdQyVGOdf17aBu9pvH45NW
 K20OhwQHpCHWIfDlijjPkDdiHnYf5S7Xy6ctt/3ztF0pMDHIaghGxJymG4wULcDT
 iLKnT37kwO8b2ihmw/HbcZPQYMWfHRye7X009K+wCv0dnhJ6q/Ny1m+Pg4kF92e6
 ON/RY26ep2dq7LpaNWa1rI1yOgFlI7uUlVojqrAuAb+xrg+64wUDBxeijvE37EN1
 s/+PuEKAR6xwz1dbY2cWAI0D633saz24UdV6kCBW9HrjHKVRQ7ZSsBF9ENkS4DTK
 nowx4wOe1ZHc/6YgTktZp9LEn/0YrmQtFxjqEAJiYUgD18FrBrSjmhHpBiL+HghP
 sTqy41qDQGoKtg3bRu42Co9wmNeeLsnxT8NQExCmTYQ4ufpdA/VMQux9cBVX3GBq
 EchJb465+AcvvCJUiKbnHLxDsHCQz1YYytz3RqyFLgGDFZnHOE0FjwPJmM8I5kkK
 gvDB3ZYdO3Halm8BZfZZBnKv5uK7myuAWwqRLgMRanuZcRgmIV1oUP5EP88HdH75
 fB20vZSVcfzB17SLyhiM20ivEWodJa9VCLEw9WDOmDoml+33Pks=
 =kaCJ
 -----END PGP SIGNATURE-----

Merge tag 'bpf-next-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Pull bpf updates from Alexei Starovoitov:

 - Convert selftests/bpf/test_tc_edt and test_tc_tunnel from .sh to
   test_progs runner (Alexis Lothoré)

 - Convert selftests/bpf/test_xsk to test_progs runner (Bastien
   Curutchet)

 - Replace bpf memory allocator with kmalloc_nolock() in
   bpf_local_storage (Amery Hung), and in bpf streams and range tree
   (Puranjay Mohan)

 - Introduce support for indirect jumps in BPF verifier and x86 JIT
   (Anton Protopopov) and arm64 JIT (Puranjay Mohan)

 - Remove runqslower bpf tool (Hoyeon Lee)

 - Fix corner cases in the verifier to close several syzbot reports
   (Eduard Zingerman, KaFai Wan)

 - Several improvements in deadlock detection in rqspinlock (Kumar
   Kartikeya Dwivedi)

 - Implement "jmp" mode for BPF trampoline and corresponding
   DYNAMIC_FTRACE_WITH_JMP. It improves "fexit" program type performance
   from 80 M/s to 136 M/s. With Steven's Ack. (Menglong Dong)

 - Add ability to test non-linear skbs in BPF_PROG_TEST_RUN (Paul
   Chaignon)

 - Do not let BPF_PROG_TEST_RUN emit invalid GSO types to stack (Daniel
   Borkmann)

 - Generalize buildid reader into bpf_dynptr (Mykyta Yatsenko)

 - Optimize bpf_map_update_elem() for map-in-map types (Ritesh
   Oedayrajsingh Varma)

 - Introduce overwrite mode for BPF ring buffer (Xu Kuohai)

* tag 'bpf-next-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (169 commits)
  bpf: optimize bpf_map_update_elem() for map-in-map types
  bpf: make kprobe_multi_link_prog_run always_inline
  selftests/bpf: do not hardcode target rate in test_tc_edt BPF program
  selftests/bpf: remove test_tc_edt.sh
  selftests/bpf: integrate test_tc_edt into test_progs
  selftests/bpf: rename test_tc_edt.bpf.c section to expose program type
  selftests/bpf: Add success stats to rqspinlock stress test
  rqspinlock: Precede non-head waiter queueing with AA check
  rqspinlock: Disable spinning for trylock fallback
  rqspinlock: Use trylock fallback when per-CPU rqnode is busy
  rqspinlock: Perform AA checks immediately
  rqspinlock: Enclose lock/unlock within lock entry acquisitions
  bpf: Remove runqslower tool
  selftests/bpf: Remove usage of lsm/file_alloc_security in selftest
  bpf: Disable file_alloc_security hook
  bpf: check for insn arrays in check_ptr_alignment
  bpf: force BPF_F_RDONLY_PROG on insn array creation
  bpf: Fix exclusive map memory leak
  selftests/bpf: Make CS length configurable for rqspinlock stress test
  selftests/bpf: Add lock wait time stats to rqspinlock stress test
  ...
2025-12-03 16:54:54 -08:00
Linus Torvalds 784faa8eca Rust changes for v6.19
Toolchain and infrastructure:
 
  - Add support for 'syn'.
 
      Syn is a parsing library for parsing a stream of Rust tokens into a
      syntax tree of Rust source code.
 
      Currently this library is geared toward use in Rust procedural
      macros, but contains some APIs that may be useful more generally.
 
    'syn' allows us to greatly simplify writing complex macros such as
    'pin-init' (Benno has already prepared the 'syn'-based version). We
    will use it in the 'macros' crate too.
 
    'syn' is the most downloaded Rust crate (according to crates.io), and
    it is also used by the Rust compiler itself. While the amount of code
    is substantial, there should not be many updates needed for these
    crates, and even if there are, they should not be too big, e.g. +7k
    -3k lines across the 3 crates in the last year.
 
    'syn' requires two smaller dependencies: 'quote' and 'proc-macro2'.
    I only modified their code to remove a third dependency
    ('unicode-ident') and to add the SPDX identifiers. The code can be
    easily verified to exactly match upstream with the provided scripts.
 
    They are all licensed under "Apache-2.0 OR MIT", like the other
    vendored 'alloc' crate we had for a while.
 
    Please see the merge commit with the cover letter for more context.
 
  - Allow 'unreachable_pub' and 'clippy::disallowed_names' for doctests.
 
    Examples (i.e. doctests) may want to do things like show public items
    and use names such as 'foo'.
 
    Nevertheless, we still try to keep examples as close to real code as
    possible (this is part of why running Clippy on doctests is important
    for us, e.g. for safety comments, which userspace Rust does not
    support yet but we are stricter).
 
 'kernel' crate:
 
  - Replace our custom 'CStr' type with 'core::ffi::CStr'.
 
    Using the standard library type reduces our custom code footprint,
    and we retain needed custom functionality through an extension trait
    and a new 'fmt!' macro which replaces the previous 'core' import.
 
    This started in 6.17 and continued in 6.18, and we finally land the
    replacement now. This required quite some stamina from Tamir, who
    split the changes in steps to prepare for the flag day change here.
 
  - Replace 'kernel::c_str!' with C string literals.
 
    C string literals were added in Rust 1.77, which produce '&CStr's
    (the 'core' one), so now we can write:
 
        c"hi"
 
    instead of:
 
        c_str!("hi")
 
  - Add 'num' module for numerical features.
 
    It includes the 'Integer' trait, implemented for all primitive
    integer types.
 
    It also includes the 'Bounded' integer wrapping type: an integer
    value that requires only the 'N' less significant bits of the wrapped
    type to be encoded:
 
        // An unsigned 8-bit integer, of which only the 4 LSBs are used.
        let v = Bounded::<u8, 4>:🆕:<15>();
        assert_eq!(v.get(), 15);
 
    'Bounded' is useful to e.g. enforce guarantees when working with
    bitfields that have an arbitrary number of bits.
 
    Values can be constructed from simple non-constant expressions or,
    for more complex ones, validated at runtime.
 
    'Bounded' also comes with comparison and arithmetic operations (with
    both their backing type and other 'Bounded's with a compatible
    backing type), casts to change the backing type, extending/shrinking
    and infallible/fallible conversions from/to primitives as applicable.
 
  - 'rbtree' module: add immutable cursor ('Cursor').
 
    It enables to use just an immutable tree reference where appropriate.
    The existing fully-featured mutable cursor is renamed to 'CursorMut'.
 
 kallsyms:
 
  - Fix wrong "big" kernel symbol type read from procfs.
 
 'pin-init' crate:
 
  - A couple minor fixes (Benno asked me to pick these patches up for
    him this cycle).
 
 Documentation:
 
  - Quick Start guide: add Debian 13 (Trixie).
 
    Debian Stable is now able to build Linux, since Debian 13 (released
    2025-08-09) packages Rust 1.85.0, which is recent enough.
 
    We are planning to propose that the minimum supported Rust version in
    Linux follows Debian Stable releases, with Debian 13 being the first
    one we upgrade to, i.e. Rust 1.85.
 
 MAINTAINERS:
 
  - Add entry for the new 'num' module.
 
  - Remove Alex as Rust maintainer: he hasn't had the time to contribute
    for a few years now, so it is a no-op change in practice.
 
 And a few other cleanups and improvements.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEPjU5OPd5QIZ9jqqOGXyLc2htIW0FAmks0WkACgkQGXyLc2ht
 IW3QQg/+LpBmrz0ZSKH24kcX3x/hpfA2Erd4cmn+qjXev9RSAM1bt9xf3dsAhhFd
 BStUpf0aglHOSEuWAvNHqEb5+yu+6qy6EFXXqH0ASexHK93t77Jbztjtf3SMlykT
 N/lSJ+LWw2WiRT0NRWoTfKaEWzZQ8j9fi9Jb/IGNZGdNMryisVUYWqzLwNupPuK+
 RMcEitHdO2NWjyodk2GGRyYQ7+XxQgbXZoxtgeubPSrrmGuGTXV42RlQKC2KHPx3
 gz6CwcO3Xd0bGHHSgP32QDtGRJtniO8iXBKxiooT+ys+M83fTKbwNrIrW3tHdheY
 765qsd/NvUmAkcgTCoLqj5biU6LCsepyimNg1vf4pYFohBoTaGeN+UqzbXBrSjy2
 pmrgxwMRVHsYz+zoSKAVKJl7ASba5BXFdI4Whgfqwwc9So/X7uyujIYXGbRoznCV
 W5vu7OUboBy26NvcsPrf6BqWcsJEpGV/M4z2UBRjAoJTRGQMcm/ckuo/GfYm3yW+
 bUW62UmVCdY5crpo7XPH/G4ZGBR/k3p9dLVt8OJxEoTlfw4KDE5BszJoXmejZqdi
 9LEMhzTWwoFp9NspQuEGdYdfGRlfG6XXqrwGZtQI+dlc4RvFEgBBu2Lxotq+Ods0
 EfCVCJQjWmyCodVdJ/QqbCRFuXtOFLr/hPdWnvlrRxVkPtF2CDw=
 =9nM+
 -----END PGP SIGNATURE-----

Merge tag 'rust-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux

Pull Rust updates from Miguel Ojeda:
 "Toolchain and infrastructure:

   - Add support for 'syn'.

     Syn is a parsing library for parsing a stream of Rust tokens into a
     syntax tree of Rust source code.

     Currently this library is geared toward use in Rust procedural
     macros, but contains some APIs that may be useful more generally.

     'syn' allows us to greatly simplify writing complex macros such as
     'pin-init' (Benno has already prepared the 'syn'-based version). We
     will use it in the 'macros' crate too.

     'syn' is the most downloaded Rust crate (according to crates.io),
     and it is also used by the Rust compiler itself. While the amount
     of code is substantial, there should not be many updates needed for
     these crates, and even if there are, they should not be too big,
     e.g. +7k -3k lines across the 3 crates in the last year.

     'syn' requires two smaller dependencies: 'quote' and 'proc-macro2'.
     I only modified their code to remove a third dependency
     ('unicode-ident') and to add the SPDX identifiers. The code can be
     easily verified to exactly match upstream with the provided
     scripts.

     They are all licensed under "Apache-2.0 OR MIT", like the other
     vendored 'alloc' crate we had for a while.

     Please see the merge commit with the cover letter for more context.

   - Allow 'unreachable_pub' and 'clippy::disallowed_names' for
     doctests.

     Examples (i.e. doctests) may want to do things like show public
     items and use names such as 'foo'.

     Nevertheless, we still try to keep examples as close to real code
     as possible (this is part of why running Clippy on doctests is
     important for us, e.g. for safety comments, which userspace Rust
     does not support yet but we are stricter).

  'kernel' crate:

   - Replace our custom 'CStr' type with 'core::ffi::CStr'.

     Using the standard library type reduces our custom code footprint,
     and we retain needed custom functionality through an extension
     trait and a new 'fmt!' macro which replaces the previous 'core'
     import.

     This started in 6.17 and continued in 6.18, and we finally land the
     replacement now. This required quite some stamina from Tamir, who
     split the changes in steps to prepare for the flag day change here.

   - Replace 'kernel::c_str!' with C string literals.

     C string literals were added in Rust 1.77, which produce '&CStr's
     (the 'core' one), so now we can write:

         c"hi"

     instead of:

         c_str!("hi")

   - Add 'num' module for numerical features.

     It includes the 'Integer' trait, implemented for all primitive
     integer types.

     It also includes the 'Bounded' integer wrapping type: an integer
     value that requires only the 'N' least significant bits of the
     wrapped type to be encoded:

         // An unsigned 8-bit integer, of which only the 4 LSBs are used.
         let v = Bounded::<u8, 4>:🆕:<15>();
         assert_eq!(v.get(), 15);

     'Bounded' is useful to e.g. enforce guarantees when working with
     bitfields that have an arbitrary number of bits.

     Values can also be constructed from simple non-constant expressions
     or, for more complex ones, validated at runtime.

     'Bounded' also comes with comparison and arithmetic operations
     (with both their backing type and other 'Bounded's with a
     compatible backing type), casts to change the backing type,
     extending/shrinking and infallible/fallible conversions from/to
     primitives as applicable.

   - 'rbtree' module: add immutable cursor ('Cursor').

     It enables to use just an immutable tree reference where
     appropriate. The existing fully-featured mutable cursor is renamed
     to 'CursorMut'.

  kallsyms:

   - Fix wrong "big" kernel symbol type read from procfs.

  'pin-init' crate:

   - A couple minor fixes (Benno asked me to pick these patches up for
     him this cycle).

  Documentation:

   - Quick Start guide: add Debian 13 (Trixie).

     Debian Stable is now able to build Linux, since Debian 13 (released
     2025-08-09) packages Rust 1.85.0, which is recent enough.

     We are planning to propose that the minimum supported Rust version
     in Linux follows Debian Stable releases, with Debian 13 being the
     first one we upgrade to, i.e. Rust 1.85.

  MAINTAINERS:

   - Add entry for the new 'num' module.

   - Remove Alex as Rust maintainer: he hasn't had the time to
     contribute for a few years now, so it is a no-op change in
     practice.

  And a few other cleanups and improvements"

* tag 'rust-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux: (53 commits)
  rust: macros: support `proc-macro2`, `quote` and `syn`
  rust: syn: enable support in kbuild
  rust: syn: add `README.md`
  rust: syn: remove `unicode-ident` dependency
  rust: syn: add SPDX License Identifiers
  rust: syn: import crate
  rust: quote: enable support in kbuild
  rust: quote: add `README.md`
  rust: quote: add SPDX License Identifiers
  rust: quote: import crate
  rust: proc-macro2: enable support in kbuild
  rust: proc-macro2: add `README.md`
  rust: proc-macro2: remove `unicode_ident` dependency
  rust: proc-macro2: add SPDX License Identifiers
  rust: proc-macro2: import crate
  rust: kbuild: support using libraries in `rustc_procmacro`
  rust: kbuild: support skipping flags in `rustc_test_library`
  rust: kbuild: add proc macro library support
  rust: kbuild: simplify `--cfg` handling
  rust: kbuild: introduce `core-flags` and `core-skip_flags`
  ...
2025-12-03 14:16:49 -08:00
Linus Torvalds 51ab33fc0a livepatching changes for 6.19
-----BEGIN PGP SIGNATURE-----
 
 iQJPBAABCAA5FiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAmkumVobFIAAAAAABAAO
 bWFudTIsMi41KzEuMTEsMiwyAAoJEFKgDEdIgJTypHMQALImoj39UZiLLVMn2d4P
 5droOQw+qMCfsVVY5m9Rqw3vrCsSds+7eC5Qa2GaXR480l4ZeMGyoYCr+2gDWcM3
 vwTWuj8UuUOxQlMXubIFyL9r/SUtYSiqfAPK5tGtzHuNNUEHWGed9r5RrOGKZOnj
 SDe6ZL5G2Kekuua1mtSZhw8qpW/nK2IuFTRPxN6O243dUGtX0wkzWr1tVRpDDzoT
 UNWZ2xoE1lzb/EAviyuSgSIc5YsT50bM+fMzOUHsY/E9u8fI42N1z5B+P3ZlTvGx
 Z5nJ79ZMtbF3/4o6OaUqvR9Y4zZ73TtkMoV4sjrjYg28q+n4/tqVlCsjy0Hys/dP
 MlaKbLSI54+RldTWx02Yu0jcn2Artua3N0bURyEenqkhDLW9RXVfx/h29QQYPpE9
 7FSlpUAQnFzFbGUGJcB02oClkxUSUcOuzVqC0n30wfdo+bVzDn19rr+Vi2bHe8TD
 MsASva96n2FhRVPkm8eBxnxpaYQj9Q6G1/7B1F30Et7PLzwD9qYUvXPXceyPATnd
 r3hdUiKgkPvUaDhApJRHwv/ZGnNDKujGBDc/YgBQrl3cPfG0ShGzJzWz892JjX0+
 NB+O+eabHVs6tFRFNvxS2fu/ROZxrGfBzZUKoaZjZh3JEgOhlLKVlL7aRqtctFvP
 Dz0QYiOqlkjho7FykCIEKXxQ
 =6o9X
 -----END PGP SIGNATURE-----

Merge tag 'livepatching-for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching

Pull livepatching updates from Petr Mladek:

 - Support both paths where tracefs is typically mounted in selftests

 - Make old_sympos 0 and 1 equal. They both are valid when there is only
   one symbol with the given name.

* tag 'livepatching-for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching:
  selftests: livepatch: use canonical ftrace path
  livepatch: Match old_sympos 0 and 1 in klp_find_func()
2025-12-03 13:46:48 -08:00
Linus Torvalds 02baaa67d9 sched_ext: Changes for v6.19
- Improve recovery from misbehaving BPF schedulers. When a scheduler puts many
   tasks with varying affinity restrictions on a shared DSQ, CPUs scanning
   through tasks they cannot run can overwhelm the system, causing lockups.
   Bypass mode now uses per-CPU DSQs with a load balancer to avoid this, and
   hooks into the hardlockup detector to attempt recovery. Add scx_cpu0 example
   scheduler to demonstrate this scenario.
 
 - Add lockless peek operation for DSQs to reduce lock contention for schedulers
   that need to query queue state during load balancing.
 
 - Allow scx_bpf_reenqueue_local() to be called from anywhere in preparation for
   deprecating cpu_acquire/release() callbacks in favor of generic BPF hooks.
 
 - Prepare for hierarchical scheduler support: add scx_bpf_task_set_slice() and
   scx_bpf_task_set_dsq_vtime() kfuncs, make scx_bpf_dsq_insert*() return bool,
   and wrap kfunc args in structs for future aux__prog parameter.
 
 - Implement cgroup_set_idle() callback to notify BPF schedulers when a cgroup's
   idle state changes.
 
 - Fix migration tasks being incorrectly downgraded from stop_sched_class to
   rt_sched_class across sched_ext enable/disable. Applied late as the fix is
   low risk and the bug subtle but needs stable backporting.
 
 - Various fixes and cleanups including cgroup exit ordering, SCX_KICK_WAIT
   reliability, and backward compatibility improvements.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaS4h1A4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGe/MAP9EZ0pLiTpmMtt6mI/11Fmi+aWfL84j1zt13cz9
 W4vb4gEA9eVEH6n9xyC4nhcOk9AQwSDuCWMOzLsnhW8TbEHVTww=
 =8W/B
 -----END PGP SIGNATURE-----

Merge tag 'sched_ext-for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext updates from Tejun Heo:

 - Improve recovery from misbehaving BPF schedulers.

   When a scheduler puts many tasks with varying affinity restrictions
   on a shared DSQ, CPUs scanning through tasks they cannot run can
   overwhelm the system, causing lockups.

   Bypass mode now uses per-CPU DSQs with a load balancer to avoid this,
   and hooks into the hardlockup detector to attempt recovery.

   Add scx_cpu0 example scheduler to demonstrate this scenario.

 - Add lockless peek operation for DSQs to reduce lock contention for
   schedulers that need to query queue state during load balancing.

 - Allow scx_bpf_reenqueue_local() to be called from anywhere in
   preparation for deprecating cpu_acquire/release() callbacks in favor
   of generic BPF hooks.

 - Prepare for hierarchical scheduler support: add
   scx_bpf_task_set_slice() and scx_bpf_task_set_dsq_vtime() kfuncs,
   make scx_bpf_dsq_insert*() return bool, and wrap kfunc args in
   structs for future aux__prog parameter.

 - Implement cgroup_set_idle() callback to notify BPF schedulers when a
   cgroup's idle state changes.

 - Fix migration tasks being incorrectly downgraded from
   stop_sched_class to rt_sched_class across sched_ext enable/disable.
   Applied late as the fix is low risk and the bug subtle but needs
   stable backporting.

 - Various fixes and cleanups including cgroup exit ordering,
   SCX_KICK_WAIT reliability, and backward compatibility improvements.

* tag 'sched_ext-for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (44 commits)
  sched_ext: Fix incorrect sched_class settings for per-cpu migration tasks
  sched_ext: tools: Removing duplicate targets during non-cross compilation
  sched_ext: Use kvfree_rcu() to release per-cpu ksyncs object
  sched_ext: Pass locked CPU parameter to scx_hardlockup() and add docs
  sched_ext: Update comments replacing breather with aborting mechanism
  sched_ext: Implement load balancer for bypass mode
  sched_ext: Factor out abbreviated dispatch dequeue into dispatch_dequeue_locked()
  sched_ext: Factor out scx_dsq_list_node cursor initialization into INIT_DSQ_LIST_CURSOR
  sched_ext: Add scx_cpu0 example scheduler
  sched_ext: Hook up hardlockup detector
  sched_ext: Make handle_lockup() propagate scx_verror() result
  sched_ext: Refactor lockup handlers into handle_lockup()
  sched_ext: Make scx_exit() and scx_vexit() return bool
  sched_ext: Exit dispatch and move operations immediately when aborting
  sched_ext: Simplify breather mechanism with scx_aborting flag
  sched_ext: Use per-CPU DSQs instead of per-node global DSQs in bypass mode
  sched_ext: Refactor do_enqueue_task() local and global DSQ paths
  sched_ext: Use shorter slice in bypass mode
  sched_ext: Mark racy bitfields to prevent adding fields that can't tolerate races
  sched_ext: Minor cleanups to scx_task_iter
  ...
2025-12-03 13:25:39 -08:00
Linus Torvalds 8449d3252c cgroup: Changes for v6.19
- Defer task cgroup unlink until after the dying task's final context switch
   so that controllers see the cgroup properly populated until the task is
   truly gone.
 
 - cpuset cleanups and simplifications. Enforce that domain isolated CPUs
   stay in root or isolated partitions and fail if isolated+nohz_full would
   leave no housekeeping CPU. Fix sched/deadline root domain handling during
   CPU hot-unplug and race for tasks in attaching cpusets.
 
 - Misc fixes including memory reclaim protection documentation and selftest
   KTAP conformance.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaS3pEQ4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGYbrAP9H0kVyWH5tK9VhjSZyqidic8NuvtmNOyhIRrg0
 8S8K0wD/YG9xlh2JUyRmS4B23ggc59+9y5xM2/sctrho51Pvsgg=
 =0MB+
 -----END PGP SIGNATURE-----

Merge tag 'cgroup-for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup updates from Tejun Heo:

 - Defer task cgroup unlink until after the dying task's final context
   switch so that controllers see the cgroup properly populated until
   the task is truly gone

 - cpuset cleanups and simplifications.

   Enforce that domain isolated CPUs stay in root or isolated partitions
   and fail if isolated+nohz_full would leave no housekeeping CPU. Fix
   sched/deadline root domain handling during CPU hot-unplug and race
   for tasks in attaching cpusets

 - Misc fixes including memory reclaim protection documentation and
   selftest KTAP conformance

* tag 'cgroup-for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits)
  cpuset: Treat cpusets in attaching as populated
  sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug
  cgroup/cpuset: Introduce cpuset_cpus_allowed_locked()
  docs: cgroup: No special handling of unpopulated memcgs
  docs: cgroup: Note about sibling relative reclaim protection
  docs: cgroup: Explain reclaim protection target
  selftests/cgroup: conform test to KTAP format output
  cpuset: remove need_rebuild_sched_domains
  cpuset: remove global remote_children list
  cpuset: simplify node setting on error
  cgroup: include missing header for struct irq_work
  cgroup: Fix sleeping from invalid context warning on PREEMPT_RT
  cgroup/cpuset: Globally track isolated_cpus update
  cgroup/cpuset: Ensure domain isolated CPUs stay in root or isolated partition
  cgroup/cpuset: Move up prstate_housekeeping_conflict() helper
  cgroup/cpuset: Fail if isolated and nohz_full don't leave any housekeeping
  cgroup/cpuset: Rename update_unbound_workqueue_cpumask() to update_isolation_cpumasks()
  cgroup: Defer task cgroup unlink until after the task is done switching out
  cgroup: Move dying_tasks cleanup from cgroup_task_release() to cgroup_task_free()
  cgroup: Rename cgroup lifecycle hooks to cgroup_task_*()
  ...
2025-12-03 13:04:07 -08:00
Linus Torvalds 2b60145734 workqueue: Changes for v6.19
- Rescuer affinity management: Affinity now updated only when detached using
   wq_unbound_cpumask consistently. DISASSOCIATED workers also follow unbound
   cpumask changes to avoid breaking CPU isolation.
 
 - Rescuer cleanups preparing for fetching work items one by one from pool list:
   Work assignment factored out, optimized to skip pwqs no longer needing
   rescue, and shutdown logic simplified.
 
 - Unused assert_rcu_or_wq_mutex_or_pool_mutex() removed.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaS3kYg4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGWkwAQC6zcYAbGTcEkfiDsg3GM6dCsHDk3hIXVguE7Em
 mOpnVQD/QxiYHZxt/SOY5LIIX9ZvTpBYeuW7MR5bCncSBdCLTw4=
 =VkvX
 -----END PGP SIGNATURE-----

Merge tag 'wq-for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq

Pull workqueue updates from Tejun Heo:

 - Rescuer affinity management: Affinity is now updated only when
   detached using wq_unbound_cpumask consistently. DISASSOCIATED workers
   also follow unbound cpumask changes to avoid breaking CPU isolation

 - Rescuer cleanups preparing for fetching work items one by one from
   pool list: Work assignment factored out, optimized to skip pwqs no
   longer needing rescue, and shutdown logic simplified

 - Unused assert_rcu_or_wq_mutex_or_pool_mutex() removed

* tag 'wq-for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: Don't rely on wq->rescuer to stop rescuer
  workqueue: Only assign rescuer work when really needed
  workqueue: Factor out assign_rescuer_work()
  workqueue: Init rescuer's affinities as wq_unbound_cpumask
  workqueue: Let DISASSOCIATED workers follow unbound wq cpumask changes
  workqueue: Update the rescuer's affinity only when it is detached
  workqueue: Remove unused assert_rcu_or_wq_mutex_or_pool_mutex
2025-12-03 12:50:10 -08:00
Linus Torvalds 4d38b88fd1 printk changes for 6.19
-----BEGIN PGP SIGNATURE-----
 
 iQJPBAABCAA5FiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAmktlbUbFIAAAAAABAAO
 bWFudTIsMi41KzEuMTEsMiwyAAoJEFKgDEdIgJTyevsP/1z98/wfCaSCquIq4H8S
 OTqFGybGgYQt1NmMj2cGPpbAE3LJNYORT0A4tcoqOTy1Z5xbQz63rO3clSI/e7Mf
 n4ZZ7NvkE40i8et1BjqtZa9dSkAv4QLYH73KrtNeuTr5tqvHo1x8FakUH6gQnb1k
 QOOebvbVXnOb+rh89j1GZShrLFcCil0psjp165WHAYE/3PyFBgYGLMCgwLqS+W3H
 re5Q4sl/ySXpMFF/XN1Kww48FWxy/h+YQFCxZwuWlUcXtVjqZ+BN+keb7AqaFQ7R
 dC2exV2W0RBoupEJR/FWHoXrm/bDDLhzqRaMvoggLJrMJ9L6V0WdIhaFA4qzoG63
 paJGFjUfmDX3dpPsAddq7kKeevCz4a2/HwFKhiBqqq4tdHuely7wZgnoFO7ovgmu
 DYDCXHtpJuWZR3WJ5I/V/sJ9i9KFXhhyWcKVf13QTAFiCaA09aeSAcUWNYNaaxbn
 nu6IkUxdIVnWIEBgcYH6jz1DrPGreYLYuD4bVb2gdZoP0r3tnMpG6xfSNIUueSGd
 VFAKW9PJYaj7Id+jgACH6V+gQ22L600xJDdL1bPjRbGE0LD7vlz2F1MZTq3BFJFn
 hUxJeOZplHX+TPophdvH4MO9VLmydWLUyJiDBP1yA8M9XZms/5s7IJJ1RYXqUCcf
 qEB4L7W1+Qy1R/lzf2PU9X4R
 =FnfO
 -----END PGP SIGNATURE-----

Merge tag 'printk-for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux

Pull printk updates from Petr Mladek:

 - Allow creaing nbcon console drivers with an unsafe write_atomic()
   callback that can only be called by the final nbcon_atomic_flush_unsafe().
   Otherwise, the driver would rely on the kthread.

   It is going to be used as the-best-effort approach for an
   experimental nbcon netconsole driver, see

     https://lore.kernel.org/r/20251121-nbcon-v1-2-503d17b2b4af@debian.org

   Note that a safe .write_atomic() callback is supposed to work in NMI
   context. But some networking drivers are not safe even in IRQ
   context:

     https://lore.kernel.org/r/oc46gdpmmlly5o44obvmoatfqo5bhpgv7pabpvb6sjuqioymcg@gjsma3ghoz35

   In an ideal world, all networking drivers would be fixed first and
   the atomic flush would be blocked only in NMI context. But it brings
   the question how reliable networking drivers are when the system is
   in a bad state. They might block flushing more reliable serial
   consoles which are more suitable for serious debugging anyway.

 - Allow to use the last 4 bytes of the printk ring buffer.

 - Prevent queuing IRQ work and block printk kthreads when consoles are
   suspended. Otherwise, they create non-necessary churn or even block
   the suspend.

 - Release console_lock() between each record in the kthread used for
   legacy consoles on RT. It might significantly speed up the boot.

 - Release nbcon context between each record in the atomic flush. It
   prevents stalls of the related printk kthread after it has lost the
   ownership in the middle of a record

 - Add support for NBCON consoles into KDB

 - Add %ptsP modifier for printing struct timespec64 and use it where
   possible

 - Misc code clean up

* tag 'printk-for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux: (48 commits)
  printk: Use console_is_usable on console_unblank
  arch: um: kmsg_dump: Use console_is_usable
  drivers: serial: kgdboc: Drop checks for CON_ENABLED and CON_BOOT
  lib/vsprintf: Unify FORMAT_STATE_NUM handlers
  printk: Avoid irq_work for printk_deferred() on suspend
  printk: Avoid scheduling irq_work on suspend
  printk: Allow printk_trigger_flush() to flush all types
  tracing: Switch to use %ptSp
  scsi: snic: Switch to use %ptSp
  scsi: fnic: Switch to use %ptSp
  s390/dasd: Switch to use %ptSp
  ptp: ocp: Switch to use %ptSp
  pps: Switch to use %ptSp
  PCI: epf-test: Switch to use %ptSp
  net: dsa: sja1105: Switch to use %ptSp
  mmc: mmc_test: Switch to use %ptSp
  media: av7110: Switch to use %ptSp
  ipmi: Switch to use %ptSp
  igb: Switch to use %ptSp
  e1000e: Switch to use %ptSp
  ...
2025-12-03 12:42:36 -08:00
Linus Torvalds 98e7dcbb82 RCU pull request for v6.19
SRCU
 ----
 
 _ Properly handle SRCU readers within IRQ disabled sections in tiny SRCU
 
 - Preparation to reimplement RCU Tasks Trace on top of SRCU fast:
 
     - Introduce API to expedite a grace period and test it through
       rcutorture.
 
     - Split srcu-fast in two flavours: SRCU-fast and SRCU-fast-updown.
       Both are still targeted toward faster readers (without full
       barriers on LOCK and UNLOCK) at the expense of heavier write side
       (using full RCU grace period ordering instead of simply full
       ordering) as compared to "traditional" non-fast SRCU. But those
       srcu-fast flavours are going to be optimized in two different
       ways:
 
          - SRCU-fast will become the reimplementation basis for
            RCU-TASK-TRACE for consolidation. Since RCU-TASK-TRACE must
            be NMI safe, SRCU-fast must be as well.
 
          - SRCU-fast-updown will be needed for uretprobes code in order
            to get rid of the read-side memory barriers while still
            allowing entering the reader at task level while exiting it
            in a timer handler. It is considered semaphore-like in that
            it can have different owners between LOCK and UNLOCK.
            However it is not NMI-safe.
 
       The actual optimizations are work in progress for the next cycle.
       Only the new interfaces are added for now, along with related
       torture and scalability test code.
 
 - Create/document/debug/torture new proper initializers for RCU fast:
   DEFINE_SRCU_FAST() and init_srcu_struct_fast()
 
   This allows for using right away the proper ordering on the write
   side (either full ordering or full RCU grace period ordering) without
   waiting for the read side to tell which to use. Also this optimizes
   the read side altogether with moving flavour debug checks under debug
   config and with removing a costly RmW operation on their first call.
 
 - Make some diagnostic functions tracing safe.
 
 REFSCALE
 -------
 
 Add performance testing for common context synchronizations
 (Preemption, IRQ, Softirq) and per-cpu increments. Those are
 relevant comparisons against SRCU-fast read side APIs, especially
 as they are planned to synchronize further tracing fast-path code.
 
 MISCELLANOUS
 ------
 
 - In order to prepare the layout for nohz_full work deferral to
   user exit, the context tracking state must shrink the counter
   of transitions to/from RCU not watching. The only possible hazard
   is to trigger wrap-around more easily, delaying a bit grace periods
   when that happens. This should be a rare event though. Yet add
   debugging and torture code to test that assumption.
 
 - Fix memory leak on locktorture module
 
 - Annotate accesses in rculist_nulls.h to prevent from KCSAN warnings.
   On recent discussions, we also concluded that all those WRITE_ONCE()
   and READ_ONCE() on list APIs deserve appropriate comments. Something
   to be expected for the next cycle.
 
 - Provide a script to apply several configs to several commits with torture.
 
 - Allow torture to reuse a build directory in order to save needless
   rebuild time.
 
 - Various cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iQJPBAABCAA5FiEEd76+gtGM8MbftQlOhSRUR1COjHcFAmksubgbFIAAAAAABAAO
 bWFudTIsMi41KzEuMTEsMiwyAAoJEIUkVEdQjox3MJ0P/Aq+AOPWuaE70B8SxVYY
 VfucjjzZwSyI30Vw4GE8skNVMOnyY2wPIrNRgmckghi9LM5luureMcks9esrdFfv
 csWj3AhnBu+/Lyd4EhP8AxMH2OPIX50FUrAQaY7LAcmc4nIng1MqRm35eT776XzR
 tN07+nsXjNumPcdN8JAv/VzInUf5DdMvhy+Qw0Krpvan/sUpzzpFpNUWdHsyZ2Re
 UCQr//AooQLfCIIIHUYlviThqigJETZlv0NKYrD0Zaxx5MCOwwWmz5otQzacSRk2
 Zgn6Bayp98s0972sOghiMRXEoSvqBSJe5eVTWcj1oNLruO3flpyREgE5Wuo8T8Bg
 BHvOiYaPNC5FW+btCWOQcKcSx1mxrgsX3/9OAth+kmtPqZdJSQEeY+zqWzdN7ZlV
 q5+oipsFONvWXL6+U6uZf44MG1pscrzn6QvXajjg98iaLZgY0+mkAhvtpjcfuTwZ
 2CXYQKacfwftpOXgM2QQGB4QE8CKXchlma381KbTHv0rX8zfK0BnWoe/oGPz22GL
 3kldmhfIC/87i8t9HHngkU6SZ+Mkvf9vC28eclnuXB1PKCRsBHePGvi3I671kbLj
 hycBwnwWriULEwKYTVQ8gmdBHSANB4IpmMYflADVqLuvp+7U6W5mNjWNTcXoSwFl
 NoR7AViEUNzOEoa8u6WMKoec
 =XnUt
 -----END PGP SIGNATURE-----

Merge tag 'rcu.release.v6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux

Pull RCU updates from Frederic Weisbecker:
 "SRCU:

   - Properly handle SRCU readers within IRQ disabled sections in tiny
     SRCU

   - Preparation to reimplement RCU Tasks Trace on top of SRCU fast:

      - Introduce API to expedite a grace period and test it through
        rcutorture

      - Split srcu-fast in two flavours: SRCU-fast and SRCU-fast-updown.

        Both are still targeted toward faster readers (without full
        barriers on LOCK and UNLOCK) at the expense of heavier write
        side (using full RCU grace period ordering instead of simply
        full ordering) as compared to "traditional" non-fast SRCU. But
        those srcu-fast flavours are going to be optimized in two
        different ways:

          - SRCU-fast will become the reimplementation basis for
            RCU-TASK-TRACE for consolidation. Since RCU-TASK-TRACE must
            be NMI safe, SRCU-fast must be as well.

          - SRCU-fast-updown will be needed for uretprobes code in order
            to get rid of the read-side memory barriers while still
            allowing entering the reader at task level while exiting it
            in a timer handler. It is considered semaphore-like in that
            it can have different owners between LOCK and UNLOCK.
            However it is not NMI-safe.

        The actual optimizations are work in progress for the next
        cycle. Only the new interfaces are added for now, along with
        related torture and scalability test code.

   - Create/document/debug/torture new proper initializers for RCU fast:
     DEFINE_SRCU_FAST() and init_srcu_struct_fast()

     This allows for using right away the proper ordering on the write
     side (either full ordering or full RCU grace period ordering)
     without waiting for the read side to tell which to use.

     This also optimizes the read side altogether with moving flavour
     debug checks under debug config and with removing a costly RmW
     operation on their first call.

   - Make some diagnostic functions tracing safe

  Refscale:

   - Add performance testing for common context synchronizations
     (Preemption, IRQ, Softirq) and per-cpu increments. Those are
     relevant comparisons against SRCU-fast read side APIs, especially
     as they are planned to synchronize further tracing fast-path code

  Miscellanous:

   - In order to prepare the layout for nohz_full work deferral to user
     exit, the context tracking state must shrink the counter of
     transitions to/from RCU not watching. The only possible hazard is
     to trigger wrap-around more easily, delaying a bit grace periods
     when that happens. This should be a rare event though. Yet add
     debugging and torture code to test that assumption

   - Fix memory leak on locktorture module

   - Annotate accesses in rculist_nulls.h to prevent from KCSAN
     warnings. On recent discussions, we also concluded that all those
     WRITE_ONCE() and READ_ONCE() on list APIs deserve appropriate
     comments. Something to be expected for the next cycle

   - Provide a script to apply several configs to several commits with
     torture

   - Allow torture to reuse a build directory in order to save needless
     rebuild time

   - Various cleanups"

* tag 'rcu.release.v6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux: (29 commits)
  refscale: Add SRCU-fast-updown readers
  refscale: Exercise DEFINE_STATIC_SRCU_FAST() and init_srcu_struct_fast()
  rcutorture: Make srcu{,d}_torture_init() announce the SRCU type
  srcu: Create an SRCU-fast-updown API
  refscale: Do not disable interrupts for tests involving local_bh_enable()
  refscale: Add non-atomic per-CPU increment readers
  refscale: Add this_cpu_inc() readers
  refscale: Add preempt_disable() readers
  refscale: Add local_bh_disable() readers
  refscale: Add local_irq_disable() and local_irq_save() readers
  torture: Permit negative kvm.sh --kconfig numberic arguments
  srcu: Add SRCU_READ_FLAVOR_FAST_UPDOWN CPP macro
  rcu: Mark diagnostic functions as notrace
  rcutorture: Make TREE04 use CONFIG_RCU_DYNTICKS_TORTURE
  rcutorture: Remove redundant rcutorture_one_extend() from rcu_torture_one_read()
  rcutorture: Permit kvm-again.sh to re-use the build directory
  torture: Add kvm-series.sh to test commit/scenario combination
  rcu: use WRITE_ONCE() for ->next and ->pprev of hlist_nulls
  locktorture: Fix memory leak in param_set_cpumask()
  doc: Update for SRCU-fast definitions and initialization
  ...
2025-12-03 12:18:07 -08:00
Linus Torvalds a619fe35ab This update includes the following changes:
API:
 
 - Rewrite memcpy_sglist from scratch.
 - Add on-stack AEAD request allocation.
 - Fix partial block processing in ahash.
 
 Algorithms:
 
 - Remove ansi_cprng.
 - Remove tcrypt tests for poly1305.
 - Fix EINPROGRESS processing in authenc.
 - Fix double-free in zstd.
 
 Drivers:
 
 - Use drbg ctr helper when reseeding xilinx-trng.
 - Add support for PCI device 0x115A to ccp.
 - Add support of paes in caam.
 - Add support for aes-xts in dthev2.
 
 Others:
 
 - Use likely in rhashtable lookup.
 - Fix lockdep false-positive in padata by removing a helper.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEn51F/lCuNhUwmDeSxycdCkmxi6cFAmktaHwACgkQxycdCkmx
 i6duthAAl4ZjsuSgt0P9ZPJXWgSH+QbNT/6fL1QzLEuzLVGn8Mt99LTQpaYU8HRh
 fced8+R7UpqA/FgZTYbRKopZJVJJqhmTf2zqjbe47CroRm2Wf5UO+6ZXBsiqbMwa
 6fNLilhcrq5G3DrIHepCpIQ7NM2+ucTMnPRIWP3cvzLwX0JzPtYIpYUSiVPAtkjh
 9g24oPz6LR/xZfyk+wPbHOSYeqz4sSXnGJkL+Vn33AtU5KJZLum9zMP4Lleim7HP
 XaNnUL/S/PYCspycrvfrnq6+YMLPw2USguttuZe0Dg0qhq/jPMyzdEkTAjcTD5LG
 NZavVUbQsf6BW+YjXgaE/ybcSs6WR3ySs8aza1Ev8QqsmpbJj9xdpF9fn4RsffGR
 mbhc5plJCKWzfiaparea8yY9n5vHwbOK4zoyF9P6kI5ykkoA+GmwRwTW73M9KCfa
 i1R6g97O+t4Yaq9JI9GG7dkm9bxJpY+XaKouW7rqv/MX0iND1ExDYaqdcA+Xa61c
 TNfdlVcGyX7Dolm2xnpvRv8EqF9NzeK4Vw1QslrdCijXfe7eJymabNKhLBlV4li0
 tVfmh4vyQFgruyiR7r7AkXIKzsLZbji030UoOsQqiMW7ualBUQ0dCDbBa8J6kUcX
 /vjbSmxV3LKgVgYvUBRRGIi9CJbKfs29RkS6RFtdqcq/YT4KsJU=
 =DHes
 -----END PGP SIGNATURE-----

Merge tag 'v6.19-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6

Pull crypto updates from Herbert Xu:
 "API:
   - Rewrite memcpy_sglist from scratch
   - Add on-stack AEAD request allocation
   - Fix partial block processing in ahash

  Algorithms:
   - Remove ansi_cprng
   - Remove tcrypt tests for poly1305
   - Fix EINPROGRESS processing in authenc
   - Fix double-free in zstd

  Drivers:
   - Use drbg ctr helper when reseeding xilinx-trng
   - Add support for PCI device 0x115A to ccp
   - Add support of paes in caam
   - Add support for aes-xts in dthev2

  Others:
   - Use likely in rhashtable lookup
   - Fix lockdep false-positive in padata by removing a helper"

* tag 'v6.19-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (71 commits)
  crypto: zstd - fix double-free in per-CPU stream cleanup
  crypto: ahash - Zero positive err value in ahash_update_finish
  crypto: ahash - Fix crypto_ahash_import with partial block data
  crypto: lib/mpi - use min() instead of min_t()
  crypto: ccp - use min() instead of min_t()
  hwrng: core - use min3() instead of nested min_t()
  crypto: aesni - ctr_crypt() use min() instead of min_t()
  crypto: drbg - Delete unused ctx from struct sdesc
  crypto: testmgr - Add missing DES weak and semi-weak key tests
  Revert "crypto: scatterwalk - Move skcipher walk and use it for memcpy_sglist"
  crypto: scatterwalk - Fix memcpy_sglist() to always succeed
  crypto: iaa - Request to add Kanchana P Sridhar to Maintainers.
  crypto: tcrypt - Remove unused poly1305 support
  crypto: ansi_cprng - Remove unused ansi_cprng algorithm
  crypto: asymmetric_keys - fix uninitialized pointers with free attribute
  KEYS: Avoid -Wflex-array-member-not-at-end warning
  crypto: ccree - Correctly handle return of sg_nents_for_len
  crypto: starfive - Correctly handle return of sg_nents_for_len
  crypto: iaa - Fix incorrect return value in save_iaa_wq()
  crypto: zstd - Remove unnecessary size_t cast
  ...
2025-12-03 11:28:38 -08:00
Linus Torvalds 777f817160 integrity-v6.19
-----BEGIN PGP SIGNATURE-----
 
 iIoEABYKADIWIQQdXVVFGN5XqKr1Hj7LwZzRsCrn5QUCaS896BQcem9oYXJAbGlu
 dXguaWJtLmNvbQAKCRDLwZzRsCrn5RDuAQDx4fmvctP8kc9PeRjd5X/UV1ip1pPD
 beMKt8ghEThQiAEAzjFJbNGUDKhfR8yWODifAvYRurU5YQJZZI9wJ8skNw0=
 =3Vc4
 -----END PGP SIGNATURE-----

Merge tag 'integrity-v6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity

Pull integrity updates from Mimi Zohar:
 "Bug fixes:

   - defer credentials checking from the bprm_check_security hook to the
     bprm_creds_from_file security hook

   - properly ignore IMA policy rules based on undefined SELinux labels

  IMA policy rule extensions:

   - extend IMA to limit including file hashes in the audit logs
     (dont_audit action)

   - define a new filesystem subtype policy option (fs_subtype)

  Misc:

   - extend IMA to support in-kernel module decompression by deferring
     the IMA signature verification in kernel_read_file() to after the
     kernel module is decompressed"

* tag 'integrity-v6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity:
  ima: Handle error code returned by ima_filter_rule_match()
  ima: Access decompressed kernel module to verify appended signature
  ima: add fs_subtype condition for distinguishing FUSE instances
  ima: add dont_audit action to suppress audit actions
  ima: Attach CREDS_CHECK IMA hook to bprm_creds_from_file LSM hook
2025-12-03 11:08:03 -08:00
Linus Torvalds 0eae3283c3 audit/stable-6.19 PR 20251201
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCgAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmkuAIcUHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXMWlhAAgydMY1/oYPKbP3xdJTPJU6uc4GLk
 o1lRaO59FkwogA2iupU6rltKiqhPYH7lIzebMcOuRLzaHpO0sGk0anyCMjNW2Agl
 P/Zu+xkF+uGnzpO3gpaOLYFbCkpT98hNbyZu33l6ftxr37+S+DWHuThUhmOnxcgQ
 z7Kguq0jaruiW8oc219HjNI/VCWW0F1W/+PVjFUSZogty2K2UttsabZQFMJ8MHg6
 9C2jP/f+tN2KD55u7oEA5QiucC/8BdNdyLGke4TvjhG38FG4bh71Q59LknHa5yMa
 6+NeftpE76+Inb8e+ze7iNv1InRccBXurm0p6lZ/lU5nYrjRT245CleQ7nq9ppD1
 hyhuGQP/fvvYExKdTl1VWXA0zGLb6+1rIn6f//MpDSbXShGj/5vK82Qo/ug1ZEGH
 QEAr6g2/S6xgudl2ui5OHSDb87nDWxzNo1t9TxGPBoQ6TG3ryPcm1ccTB0tb59+3
 Poej26MOuZUrTpiQl3SLnfwjN0WnkiqGX5y/Cjh9tHFVk2OzOe/VqImZk9oeFgxt
 O+IEB2cUO/U0D/3tdECqiRVS/RFe3jSn/qYCzP9fQaLhD0IpqlhfJ5TXbWiQam8Y
 vGjNCx17a7mwvfxCIofEANw0wu3ooajq68UjRVhTpoKka0USKvktAoFirAB8r54I
 oNSBfWKl4polguo=
 =3hKh
 -----END PGP SIGNATURE-----

Merge tag 'audit-pr-20251201' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit

Pull audit updates from Paul Moore:

 - Consolidate the loops in __audit_inode_child() to improve performance

   When logging a child inode in __audit_inode_child(), we first run
   through the list of recorded inodes looking for the parent and then
   we repeat the search looking for a matching child entry. This pull
   request consolidates both searches into one pass through the recorded
   inodes, resuling in approximately a 50% reduction in audit overhead.

   See the commit description for the testing details.

 - Combine kmalloc()/memset() into kzalloc() in audit_krule_to_data()

 - Comment fixes

* tag 'audit-pr-20251201' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
  audit: merge loops in __audit_inode_child()
  audit: Use kzalloc() instead of kmalloc()/memset() in audit_krule_to_data()
  audit: fix comment misindentation in audit.h
2025-12-03 10:52:01 -08:00
Ingo Molnar 92546f6b52 perf/uprobes: Remove <space><Tab> whitespace noise
A few cases of space-Tab noise snuck in.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://patch.msgid.link/176478594889.498.15611228524880763978.tip-bot2@tip-bot2
2025-12-03 19:23:01 +01:00
Josh Poimboeuf f387d0e102 x86/asm: Remove ANNOTATE_DATA_SPECIAL usage
Instead of manually annotating each __ex_table entry, just make the
section mergeable and store the entry size in the ELF section header.

Either way works for objtool create_fake_symbols(), this way produces
cleaner code generation.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://patch.msgid.link/b858cb7891c1ba0080e22a9c32595e6c302435e2.1764694625.git.jpoimboe@kernel.org
2025-12-03 16:53:19 +01:00
Linus Torvalds d348c22394 Power management updates for 6.19-rc1
- Introduce and document a QoS limit on CPU exit latency during wakeup
    from suspend-to-idle (Ulf Hansson)
 
  - Add support for building libcpupower statically (Zuo An)
 
  - Add support for sending netlink notifications to user space on energy
    model updates (Changwoo Mini, Peng Fan)
 
  - Minor improvements to the Rust OPP interface (Tamir Duberstein)
 
  - Fixes to scope-based pointers in the OPP library (Viresh Kumar)
 
  - Use residency threshold in polling state override decisions in the
    menu cpuidle governor (Aboorva Devarajan)
 
  - Add sanity check for exit latency and target residency in the cpufreq
    core (Rafael Wysocki)
 
  - Use this_cpu_ptr() where possible in the teo governor (Christian
    Loehle)
 
  - Rework the handling of tick wakeups in the teo cpuidle governor to
    increase the likelihood of stopping the scheduler tick in the cases
    when tick wakeups can be counted as non-timer ones (Rafael Wysocki)
 
  - Fix a reverse condition in the teo cpuidle governor and drop a
    misguided target residency check from it (Rafael Wysocki)
 
  - Clean up multiple minor defects in the teo cpuidle governor (Rafael
    Wysocki)
 
  - Update header inclusion to make it follow the Include What You Use
    principle (Andy Shevchenko)
 
  - Enable MSR-based RAPL PMU support in the intel_rapl power capping
    driver and arrange for using it on the Panther Lake and Wildcat Lake
    processors (Kuppuswamy Sathyanarayanan)
 
  - Add support for Nova Lake and Wildcat Lake processors to the
    intel_rapl power capping driver (Kaushlendra Kumar, Srinivas
    Pandruvada)
 
  - Add OPP and bandwidth support for Tegra186 (Aaron Kling)
 
  - Optimizations for parameter array handling in the amd-pstate cpufreq
    driver (Mario Limonciello)
 
  - Fix for mode changes with offline CPUs in the amd-pstate cpufreq
    driver (Gautham Shenoy)
 
  - Preserve freq_table_sorted across suspend/hibernate in the cpufreq
    core (Zihuan Zhang)
 
  - Adjust energy model rules for Intel hybrid platforms in the
    intel_pstate cpufreq driver and improve printing of debug messages
    in it (Rafael Wysocki)
 
  - Replace deprecated strcpy() in cpufreq_unregister_governor()
    (Thorsten Blum)
 
  - Fix duplicate hyperlink target errors in the intel_pstate cpufreq
    driver documentation and use :ref: directive for internal linking in
    it (Swaraj Gaikwad, Bagas Sanjaya)
 
  - Add Diamond Rapids OOB mode support to the intel_pstate cpufreq
    driver (Kuppuswamy Sathyanarayanan)
 
  - Use mutex guard for driver locking in the intel_pstate driver and
    eliminate some code duplication from it (Rafael Wysocki)
 
  - Replace udelay() with usleep_range() in ACPI cpufreq (Kaushlendra
    Kumar)
 
  - Minor improvements to various cpufreq drivers (Christian Marangi, Hal
    Feng, Jie Zhan, Marco Crivellari, Miaoqian Lin, and Shuhao Fu)
 
  - Replace snprintf() with scnprintf() in show_trace_dev_match()
    (Kaushlendra Kumar)
 
  - Fix memory allocation error handling in pm_vt_switch_required()
    (Malaya Kumar Rout)
 
  - Introduce CALL_PM_OP() macro and use it to simplify code in
    generic PM operations (Kaushlendra Kumar)
 
  - Add module param to backtrace all CPUs in the device power management
    watchdog (Sergey Senozhatsky)
 
  - Rework message printing in swsusp_save() (Rafael Wysocki)
 
  - Make it possible to change the number of hibernation compression
    threads (Xueqin Luo)
 
  - Clarify that only cgroup1 freezer uses PM freezer (Tejun Heo)
 
  - Add document on debugging shutdown hangs to PM documentation and
    correct a mistaken configuration option in it (Mario Limonciello)
 
  - Shut down wakeup source timer before removing the wakeup source from
    the list (Kaushlendra Kumar, Rafael Wysocki)
 
  - Introduce new PMSG_POWEROFF event for system shutdown handling with
    the help of PM device callbacks (Mario Limonciello)
 
  - Make pm_test delay interruptible by wakeup events (Riwen Lu)
 
  - Clean up kernel-doc comment style usage in the core hibernation
    code and remove unuseful comments from it (Sunday Adelodun, Rafael
    Wysocki)
 
  - Add support for handling wakeup events and aborting the suspend
    process while it is syncing file systems (Samuel Wu, Rafael Wysocki)
 
  - Add WQ_UNBOUND to pm_wq workqueue (Marco Crivellari)
 
  - Add runtime PM wrapper macros for ACQUIRE()/ACQUIRE_ERR() and use
    them in the PCI core and the ACPI TAD driver (Rafael Wysocki)
 
  - Improve runtime PM in the ACPI TAD driver (Rafael Wysocki)
 
  - Update pm_runtime_allow/forbid() documentation (Rafael Wysocki)
 
  - Fix typos in runtime.c comments (Malaya Kumar Rout)
 
  - Move governor.h from devfreq under include/linux/ and rename to
    devfreq-governor.h to allow devfreq governor definitions in out
    of drivers/devfreq/ (Dmitry Baryshkov)
 
  - Use min() to improve readability in tegra30-devfreq.c (Thorsten
    Blum)
 
  - Fix potential use-after-free issue of OPP handling in
    hisi_uncore_freq.c (Pengjie Zhang)
 
  - Fix typo in DFSO_DOWNDIFFERENTIAL macro name in
    governor_simpleondemand.c in devfreq (Riwen Lu)
 -----BEGIN PGP SIGNATURE-----
 
 iQFGBAABCAAwFiEEcM8Aw/RY0dgsiRUR7l+9nS/U47UFAmkp0BYSHHJqd0Byand5
 c29ja2kubmV0AAoJEO5fvZ0v1OO1Pc8H/2G5d0aD/ym1a8MDTpKqn7t3/rVMHa76
 YGfxXMBr1oY++r5GTJTKBxZBHmF89VH71kdyvsMidTAtHjR+iZAS1ajd2Q5VYjOF
 QNMld1qgPEzAZU8WSetDrBqMr89zls05Uubo4aCoNy6rFmgRaLHh3AmIKSS9aJuo
 C1eH8dRONME5I/rafkOUpFs1+/Agq1vePwPZmwVnZX9A3qI+UOhMRdU9A37kYkx9
 YwfQvR2fKTIPjZ6B9f/wGXPOvdrT37d4+dWT3EABOHMkxlpAPDMvmVzZsUaXSQMr
 0d9NGEjPGo33qciKJJpHqNOdDOhi90606WBBf7aaMF+GMhDX3PznOK4=
 =rzXO
 -----END PGP SIGNATURE-----

Merge tag 'pm-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management updates from Rafael Wysocki:
 "There are quite a few interesting things here, including new hardware
  support, new features, some bug fixes and documentation updates. In
  addition, there are a usual bunch of minor fixes and cleanups all
  over.

  In the new hardware support category, there are intel_pstate and
  intel_rapl driver updates to support new processors, Panther Lake,
  Wildcat Lake, Noval Lake, and Diamond Rapids in the OOB mode, OPP and
  bandwidth allocation support in the tegra186 cpufreq driver, and
  JH7110S SOC support in dt-platdev cpufreq.

  The new features are the PM QoS CPU latency limit for suspend-to-idle,
  the netlink support for the energy model management, support for
  terminating system suspend via a wakeup event during the sync of file
  systems, configurable number of hibernation compression threads, the
  runtime PM auto-cleanup macros, and the "poweroff" PM event that is
  expected to be used during system shutdown.

  Bugs are mostly fixed in cpuidle governors, but there are also fixes
  elsewhere, like in the amd-pstate cpufreq driver.

  Documentation updates include, but are not limited to, a new doc on
  debugging shutdown hangs, cross-referencing fixes and cleanups in the
  intel_pstate documentation, and updates of comments in the core
  hibernation code.

  Specifics:

   - Introduce and document a QoS limit on CPU exit latency during
     wakeup from suspend-to-idle (Ulf Hansson)

   - Add support for building libcpupower statically (Zuo An)

   - Add support for sending netlink notifications to user space on
     energy model updates (Changwoo Mini, Peng Fan)

   - Minor improvements to the Rust OPP interface (Tamir Duberstein)

   - Fixes to scope-based pointers in the OPP library (Viresh Kumar)

   - Use residency threshold in polling state override decisions in the
     menu cpuidle governor (Aboorva Devarajan)

   - Add sanity check for exit latency and target residency in the
     cpufreq core (Rafael Wysocki)

   - Use this_cpu_ptr() where possible in the teo governor (Christian
     Loehle)

   - Rework the handling of tick wakeups in the teo cpuidle governor to
     increase the likelihood of stopping the scheduler tick in the cases
     when tick wakeups can be counted as non-timer ones (Rafael Wysocki)

   - Fix a reverse condition in the teo cpuidle governor and drop a
     misguided target residency check from it (Rafael Wysocki)

   - Clean up multiple minor defects in the teo cpuidle governor (Rafael
     Wysocki)

   - Update header inclusion to make it follow the Include What You Use
     principle (Andy Shevchenko)

   - Enable MSR-based RAPL PMU support in the intel_rapl power capping
     driver and arrange for using it on the Panther Lake and Wildcat
     Lake processors (Kuppuswamy Sathyanarayanan)

   - Add support for Nova Lake and Wildcat Lake processors to the
     intel_rapl power capping driver (Kaushlendra Kumar, Srinivas
     Pandruvada)

   - Add OPP and bandwidth support for Tegra186 (Aaron Kling)

   - Optimizations for parameter array handling in the amd-pstate
     cpufreq driver (Mario Limonciello)

   - Fix for mode changes with offline CPUs in the amd-pstate cpufreq
     driver (Gautham Shenoy)

   - Preserve freq_table_sorted across suspend/hibernate in the cpufreq
     core (Zihuan Zhang)

   - Adjust energy model rules for Intel hybrid platforms in the
     intel_pstate cpufreq driver and improve printing of debug messages
     in it (Rafael Wysocki)

   - Replace deprecated strcpy() in cpufreq_unregister_governor()
     (Thorsten Blum)

   - Fix duplicate hyperlink target errors in the intel_pstate cpufreq
     driver documentation and use :ref: directive for internal linking
     in it (Swaraj Gaikwad, Bagas Sanjaya)

   - Add Diamond Rapids OOB mode support to the intel_pstate cpufreq
     driver (Kuppuswamy Sathyanarayanan)

   - Use mutex guard for driver locking in the intel_pstate driver and
     eliminate some code duplication from it (Rafael Wysocki)

   - Replace udelay() with usleep_range() in ACPI cpufreq (Kaushlendra
     Kumar)

   - Minor improvements to various cpufreq drivers (Christian Marangi,
     Hal Feng, Jie Zhan, Marco Crivellari, Miaoqian Lin, and Shuhao Fu)

   - Replace snprintf() with scnprintf() in show_trace_dev_match()
     (Kaushlendra Kumar)

   - Fix memory allocation error handling in pm_vt_switch_required()
     (Malaya Kumar Rout)

   - Introduce CALL_PM_OP() macro and use it to simplify code in generic
     PM operations (Kaushlendra Kumar)

   - Add module param to backtrace all CPUs in the device power
     management watchdog (Sergey Senozhatsky)

   - Rework message printing in swsusp_save() (Rafael Wysocki)

   - Make it possible to change the number of hibernation compression
     threads (Xueqin Luo)

   - Clarify that only cgroup1 freezer uses PM freezer (Tejun Heo)

   - Add document on debugging shutdown hangs to PM documentation and
     correct a mistaken configuration option in it (Mario Limonciello)

   - Shut down wakeup source timer before removing the wakeup source
     from the list (Kaushlendra Kumar, Rafael Wysocki)

   - Introduce new PMSG_POWEROFF event for system shutdown handling with
     the help of PM device callbacks (Mario Limonciello)

   - Make pm_test delay interruptible by wakeup events (Riwen Lu)

   - Clean up kernel-doc comment style usage in the core hibernation
     code and remove unuseful comments from it (Sunday Adelodun, Rafael
     Wysocki)

   - Add support for handling wakeup events and aborting the suspend
     process while it is syncing file systems (Samuel Wu, Rafael
     Wysocki)

   - Add WQ_UNBOUND to pm_wq workqueue (Marco Crivellari)

   - Add runtime PM wrapper macros for ACQUIRE()/ACQUIRE_ERR() and use
     them in the PCI core and the ACPI TAD driver (Rafael Wysocki)

   - Improve runtime PM in the ACPI TAD driver (Rafael Wysocki)

   - Update pm_runtime_allow/forbid() documentation (Rafael Wysocki)

   - Fix typos in runtime.c comments (Malaya Kumar Rout)

   - Move governor.h from devfreq under include/linux/ and rename to
     devfreq-governor.h to allow devfreq governor definitions in out of
     drivers/devfreq/ (Dmitry Baryshkov)

   - Use min() to improve readability in tegra30-devfreq.c (Thorsten
     Blum)

   - Fix potential use-after-free issue of OPP handling in
     hisi_uncore_freq.c (Pengjie Zhang)

   - Fix typo in DFSO_DOWNDIFFERENTIAL macro name in
     governor_simpleondemand.c in devfreq (Riwen Lu)"

* tag 'pm-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (96 commits)
  PM / devfreq: Fix typo in DFSO_DOWNDIFFERENTIAL macro name
  cpuidle: Warn instead of bailing out if target residency check fails
  cpuidle: Update header inclusion
  Documentation: power/cpuidle: Document the CPU system wakeup latency QoS
  cpuidle: Respect the CPU system wakeup QoS limit for cpuidle
  sched: idle: Respect the CPU system wakeup QoS limit for s2idle
  pmdomain: Respect the CPU system wakeup QoS limit for cpuidle
  pmdomain: Respect the CPU system wakeup QoS limit for s2idle
  PM: QoS: Introduce a CPU system wakeup QoS limit
  cpuidle: governors: teo: Add missing space to the description
  PM: hibernate: Extra cleanup of comments in swap handling code
  PM / devfreq: tegra30: use min to simplify actmon_cpu_to_emc_rate
  PM / devfreq: hisi: Fix potential UAF in OPP handling
  PM / devfreq: Move governor.h to a public header location
  powercap: intel_rapl: Enable MSR-based RAPL PMU support
  powercap: intel_rapl: Prepare read_raw() interface for atomic-context callers
  cpufreq: qcom-nvmem: fix compilation warning for qcom_cpufreq_ipq806x_match_list
  PM: sleep: Call pm_sleep_fs_sync() instead of ksys_sync_helper()
  PM: sleep: Add support for wakeup during filesystem sync
  cpufreq: ACPI: Replace udelay() with usleep_range()
  ...
2025-12-02 17:31:22 -08:00
Steven Rostedt b1e7a590a0 ring-buffer: Add helper functions for allocations
The allocation of the per CPU buffer descriptor, the buffer page
descriptors and the buffer page data itself can be pretty ugly:

  kzalloc_node(ALIGN(sizeof(struct buffer_page), cache_line_size()),
               GFP_KERNEL, cpu_to_node(cpu));

And the data pages:

  page = alloc_pages_node(cpu_to_node(cpu),
                          GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_COMP | __GFP_ZERO, order);
  if (!page)
	return NULL;
  bpage->page = page_address(page);
  rb_init_page(bpage->page);

Add helper functions to make the code easier to read.

This does make all allocations of the data page (bpage->page) allocated
with the __GFP_RETRY_MAYFAIL flag (and not just the bulk allocator). Which
is actually better, as allocating the data page for the ring buffer tracing
should try hard but not trigger the OOM killer.

Link: https://lore.kernel.org/all/CAHk-=wjMMSAaqTjBSfYenfuzE1bMjLj+2DLtLWJuGt07UGCH_Q@mail.gmail.com/

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251125121153.35c07461@gandalf.local.home
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-12-02 15:49:35 -05:00
Linus Torvalds 7f8d5f70ff Tree wide cleanup of the remaining users of in_irq() which got replaced
by in_hardirq() and marked deprecated in 2020.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmkvDhUTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoW+XD/959KAIm2JpcEYUWuNBmlhEyuYWvPLw
 ZyOiraLYBNyWmfCO/Yz4Ff8VZSR9gdWQoNfvBb8uxkbSXa0UOEUhCbzWsuoTnqR5
 ObTIHCJ9QmPlRiFDvs4Sf5TGmy/4nXh6/PoH3JykNdlD3rZMTxiAz/k6QuO/S2iu
 ykA+DNtNL7jDkQHzrWa3rf597BkBN1Z+hUD8zHRt8LYKRfmLYWjCMggjPLMnuqcn
 240fnV/FubCLd9f5ZgNxHQMQCQH2qB7GYMk08YwXwCZQqIIXWqbNnhedkkNO3kWq
 Sws4TEO6yg9pgTFqkuiDU5QgYEboRY4pDT45KSkdTHHGZl2OAAl3eVIGCto72UEI
 Eyzn4k900hZ1iI/Rad5mx3D4XJZEXFgEbXhjph0odn6jVvmSj+Fmg3J67u1niO2a
 obzB+xeaIkbGNQIgJFy8+A9SSnZckvuPlXdZdUxS2S95zH7f9+vBY8HWJMuyursa
 3AJAKa82mN1i3A9FdSuMTdttQWkDmrwPKVzxvixs1mBu7kB70XaRIKsPjZj7LH6X
 CiqP9Kt5FO0hVA7K+nKTeUA5DdjB4HzYzOgMqzFUhExY3hksVsj8rQEO6B0bCp9t
 CfITA3BvU7GXxhXZHOq3dABQ21J/ZHgeuK3QdQSnOxSQOv2ElYIdKvYirJy2QdS1
 tSM3O3GXb4zWDg==
 =6LKf
 -----END PGP SIGNATURE-----

Merge tag 'core-core-2025-12-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull core irq cleanup from Thomas Gleixner:
 "Tree wide cleanup of the remaining users of in_irq() which got
  replaced by in_hardirq() and marked deprecated in 2020"

* tag 'core-core-2025-12-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  treewide: Remove in_irq()
2025-12-02 10:18:49 -08:00