cpustat_tail indexes cpustat_util[], which is a NUM_SAMPLE_PERIODS-sized
ring buffer. need_counting_irqs() currently wraps the index using
NUM_HARDIRQ_REPORT, which only happens to match NUM_SAMPLE_PERIODS.
Use NUM_SAMPLE_PERIODS for the wrap to keep the ring math correct even if
the NUM_HARDIRQ_REPORT or NUM_SAMPLE_PERIODS changes.
Link: https://lkml.kernel.org/r/tencent_7068189CB6D6689EB353F3D17BF5A5311A07@qq.com
Fixes: e9a9292e23 ("watchdog/softlockup: Report the most frequent interrupts")
Signed-off-by: Shengming Hu <hu.shengming@zte.com.cn>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Zhang Run <zhang.run@zte.com.cn>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This function returns NULL if kho_restore_page() returns NULL, which
happens in a couple of corner cases. It never returns an error code.
Link: https://lkml.kernel.org/r/20260123190506.1058669-1-tycho@kernel.org
Signed-off-by: Tycho Andersen (AMD) <tycho@kernel.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Introduce an in-kernel test module to validate the core logic of the Live
Update Orchestrator's File-Lifecycle-Bound feature. This provides a
low-level, controlled environment to test FLB registration and callback
invocation without requiring userspace interaction or actual kexec
reboots.
The test is enabled by the CONFIG_LIVEUPDATE_TEST Kconfig option.
Link: https://lkml.kernel.org/r/20251218155752.3045808-6-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Alexander Graf <graf@amazon.com>
Cc: David Gow <davidgow@google.com>
Cc: David Matlack <dmatlack@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kees Cook <kees@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Pratyush Yadav <pratyush@kernel.org>
Cc: Samiullah Khawaja <skhawaja@google.com>
Cc: Tamir Duberstein <tamird@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Introduce a mechanism for managing global kernel state whose lifecycle is
tied to the preservation of one or more files. This is necessary for
subsystems where multiple preserved file descriptors depend on a single,
shared underlying resource.
An example is HugeTLB, where multiple file descriptors such as memfd and
guest_memfd may rely on the state of a single HugeTLB subsystem.
Preserving this state for each individual file would be redundant and
incorrect. The state should be preserved only once when the first file is
preserved, and restored/finished only once the last file is handled.
This patch introduces File-Lifecycle-Bound (FLB) objects to solve this
problem. An FLB is a global, reference-counted object with a defined set
of operations:
- A file handler (struct liveupdate_file_handler) declares a dependency
on one or more FLBs via a new registration function,
liveupdate_register_flb().
- When the first file depending on an FLB is preserved, the FLB's
.preserve() callback is invoked to save the shared global state. The
reference count is then incremented for each subsequent file.
- Conversely, when the last file is unpreserved (before reboot) or
finished (after reboot), the FLB's .unpreserve() or .finish() callback
is invoked to clean up the global resource.
The implementation includes:
- A new set of ABI definitions (luo_flb_ser, luo_flb_head_ser) and a
corresponding FDT node (luo-flb) to serialize the state of all active
FLBs and pass them via Kexec Handover.
- Core logic in luo_flb.c to manage FLB registration, reference
counting, and the invocation of lifecycle callbacks.
- An API (liveupdate_flb_get/_incoming/_outgoing) for other kernel
subsystems to safely access the live object managed by an FLB, both
before and after the live update.
This framework provides the necessary infrastructure for more complex
subsystems like IOMMU, VFIO, and KVM to integrate with the Live Update
Orchestrator.
Link: https://lkml.kernel.org/r/20251218155752.3045808-5-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Alexander Graf <graf@amazon.com>
Cc: David Gow <davidgow@google.com>
Cc: David Matlack <dmatlack@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kees Cook <kees@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Pratyush Yadav <pratyush@kernel.org>
Cc: Samiullah Khawaja <skhawaja@google.com>
Cc: Tamir Duberstein <tamird@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Switch LUO to use the private list iterators.
Link: https://lkml.kernel.org/r/20251218155752.3045808-4-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Alexander Graf <graf@amazon.com>
Cc: David Gow <davidgow@google.com>
Cc: David Matlack <dmatlack@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kees Cook <kees@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Pratyush Yadav <pratyush@kernel.org>
Cc: Samiullah Khawaja <skhawaja@google.com>
Cc: Tamir Duberstein <tamird@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The custom definition of 'struct timespec64' is incompatible with both the
kernel's internal definition and the glibc type, at least on big-endian
targets that have the tv_nsec field in a different place, and the
definition clashes with any userspace that also defines a timespec64
structure.
Running the header check with -Wpadding enabled produces this output that
warns about the incorrect padding:
usr/include/linux/taskstats.h:25:1: error: padding struct size to alignment boundary with 4 bytes [-Werror=padded]
Remove the hack and instead use the regular __kernel_timespec type that is
meant to be used in uapi definitions.
Link: https://lkml.kernel.org/r/20260202095906.1344100-1-arnd@kernel.org
Fixes: 29b63f6eff0e ("delayacct: add timestamp of delay max")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Fan Yu <fan.yu9@zte.com.cn>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: xu xin <xu.xin16@zte.com.cn>
Cc: Yang Yang <yang.yang29@zte.com.cn>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Jiang Kun <jiang.kun2@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
performance regressions in the recent rewrite
of the SCHED_MM_CID management code:
- Fix livelock triggered by BPF CI testing
- Fix hard lockup on weakly ordered systems
- Simplify the dropping of CIDs in the exit path
by removing an unintended transition phase.
- Fix performance/scalability regression on a
thread-pool benchmark by optimizing transitional
CIDs when scheduling out.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmmHDvQRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1hdPBAAgnl/L09wF8WCQLSoLrhr71FmS6fZApDB
Rvov2be8tGJR0BsrJF5uOKTNjulqUIr0mfO73fdHZftdFuhm/WLnWjBO62GhCKMg
d8kXOVZ7PudFN+QwL17pOAub8voh9s9/mceE/hZ3M5eNjXlG4sAcpyGvnrTLLYru
rfzO48NOpy5NMfbxU5/f9nojfr2t8fhnpX2QjquOhEPpl/BeYzexTZK7h2IJXqTK
tkU6IY9X8fT7y8LkKbTCIMJvEuWawHj1DSW2EiWNPJZkX+Hk5ZHttg28JjROavEy
orgairCSCT/cOETKugfToFd0Z4WlmemY6Nk5Kyx//WiFQ/u0HHlFVgMJoJfQEovV
MtIxLVygVbEoQyTszZyFUlTQjrnH8uKxXYhh1mX5wSj9lyDfpfJZycFFA2RpE4Rw
/+pvH08BfR4FgpqTfojfgOnuK/575VsomaVghritoNW3bAie1kpnWIeBaXS8lL4O
0pkK7XX8ng6hXuZTMxgXXfkfUB6oM1Yp1OZJAEzUvftsK0FQ5q3e0WxD+pdVza2s
PfQPaA7bT/G7y8k4LIXm59/tPX2QWPwe0yci00NbyfWiOdxHSgS7crQO8E1+VAiq
TcLGZNj/wFL6B5ghaiUIi22Mo+WnLX8fW+aiIjSiUQILmbNZXYmwtfEFsvsahh9W
/RkE/WQ492E=
=/PkF
-----END PGP SIGNATURE-----
Merge tag 'sched-urgent-2026-02-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
"Miscellaneous MMCID fixes to address bugs and performance regressions
in the recent rewrite of the SCHED_MM_CID management code:
- Fix livelock triggered by BPF CI testing
- Fix hard lockup on weakly ordered systems
- Simplify the dropping of CIDs in the exit path by removing an
unintended transition phase
- Fix performance/scalability regression on a thread-pool benchmark
by optimizing transitional CIDs when scheduling out"
* tag 'sched-urgent-2026-02-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/mmcid: Optimize transitional CIDs when scheduling out
sched/mmcid: Drop per CPU CID immediately when switching to per task mode
sched/mmcid: Protect transition on weakly ordered systems
sched/mmcid: Prevent live lock on task to CPU mode transition
Replace BUG_ON() with panic() in panic_on_wq_watchdog(). This is not
a bug condition but a deliberate forced panic requested by the user
via module parameters to crash the system for debugging purposes.
Using panic() instead of BUG_ON() makes this intent clearer and provides
more informative output about which threshold was exceeded and the actual
values, making it easier to diagnose the stall condition from crash dumps.
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Add a new module parameter 'panic_on_stall_time' that triggers a panic
when a workqueue stall persists for longer than the specified duration
in seconds.
Unlike 'panic_on_stall' which counts accumulated stall events, this
parameter triggers based on the duration of a single continuous stall.
This is useful for catching truly stuck workqueues rather than
accumulating transient stalls.
Usage:
workqueue.panic_on_stall_time=120
This would panic if any workqueue pool has been stalled for 120 seconds
or more.
The stall duration is measured from the workqueue last progress
(poll_ts) which accounts for legitimate system stalls.
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
- Bump up the Clang minimum version requirements
for livepatch builds, due to Clang assembler
section handling bugs causing silent
miscompilations.
- Strip livepatching symbol artifacts from
non-livepatch modules.
- Fix livepatch build warnings when certain
Clang LTO options are enabled.
- Fix livepatch build error when
CONFIG_MEM_ALLOC_PROFILING_DEBUG=y.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmmHCvwRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1hkTQ/8DhLI5m9CqhMaouuR5Vm9POKgFOXAe6uz
eHTuKhpJlw+anhNjeUA7PtYbnkrj0j+aNo5SmrfD4Yx8CW7dCt+oO3Y5ziVhkPHw
46Q/9KbmcT11uPbYywp/G4b15FF8YYu9slzhGau/Wa9H+oqU/WPoGapJsMPpUtBo
s0qGxdr2G3WZyD9H/wgCyhMwCOkAYMJ0sHxpGgRajLefDutsRvtlau/ktYU79vI3
nUFteD3YDIAUblBtZPogsCP36QJlx7TWCUNK02vPeOYRh3xPjf3iG+vgf1+sjZHV
P20psekpDwhh1KyeVziUyihUy8TmEVVozRvsrUKVlXmEqLqDtNrysqKTSp5/yRdT
MqgNwDrvv2wW/DKYJhefbuttx3ppAErrnJ3zC9TYSmdn27feKPDJcD1OmdS3BIpH
x9u/eVbOS8xbsOc/t3/Al7CRazvjLU0+OXsMJWbmAaO3tE7SwHq2aOnGbLGjvwWC
Ts1AYfNp4H41CLLFnmKR8q2t/DOBhefW8p3cR5U+cVQ7PdqKRT+TwKWVnCrbrBcJ
71702IrqoghwUrmhtxdZR0jZLtwb80s8zqdbrHojXSbXUFfYwwHNNaW1NSEpdSr4
W8xYfSPrK71OGZ6oh/v1Wce6D+mxqb6kYD8DRLgOdbxrnvLwUhGXxxNDjfXA0f4B
abjGHlAyoCs=
=cdu2
-----END PGP SIGNATURE-----
Merge tag 'objtool-urgent-2026-02-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull objtool fixes from Ingo Molnar::
- Bump up the Clang minimum version requirements for livepatch
builds, due to Clang assembler section handling bugs causing
silent miscompilations
- Strip livepatching symbol artifacts from non-livepatch modules
- Fix livepatch build warnings when certain Clang LTO options
are enabled
- Fix livepatch build error when CONFIG_MEM_ALLOC_PROFILING_DEBUG=y
* tag 'objtool-urgent-2026-02-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
objtool/klp: Fix unexported static call key access for manually built livepatch modules
objtool/klp: Fix symbol correlation for orphaned local symbols
livepatch: Free klp_{object,func}_ext data after initialization
livepatch: Fix having __klp_objects relics in non-livepatch modules
livepatch/klp-build: Require Clang assembler >= 20
- Export irq_domain_free_irqs() to allow PCI/MSI drivers that tear down
MSI domains to be built as modules (Aaron Kling)
- Export tegra_cpuidle_pcie_irqs_in_use(), which disables Tegra CC6 while
PCI IRQs are in use, so pci-tegra can be built as a module (Aaron Kling)
- Allow pci-tegra to be built as a module (Aaron Kling)
* pci/controller/tegra:
PCI: tegra: Allow building as a module
cpuidle: tegra: Export tegra_cpuidle_pcie_irqs_in_use()
irqdomain: Export irq_domain_free_irqs()
Take care of rqspinlock error in bpf_local_storage_{map_free, destroy}()
properly by switching to bpf_selem_unlink_nofail().
Both functions iterate their own RCU-protected list of selems and call
bpf_selem_unlink_nofail(). In map_free(), to prevent infinite loop when
both map_free() and destroy() fail to remove a selem from b->list
(extremely unlikely), switch to hlist_for_each_entry_rcu(). In destroy(),
also switch to hlist_for_each_entry_rcu() since we no longer iterate
local_storage->list under local_storage->lock.
bpf_selem_unlink() now becomes dedicated to helpers and syscalls paths
so reuse_now should always be false. Remove it from the argument and
hardcode it.
Acked-by: Alexei Starovoitov <ast@kernel.org>
Co-developed-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-12-ameryhung@gmail.com
Introduce bpf_selem_unlink_nofail() to properly handle errors returned
from rqspinlock in bpf_local_storage_map_free() and
bpf_local_storage_destroy() where the operation must succeeds.
The idea of bpf_selem_unlink_nofail() is to allow an selem to be
partially linked and use atomic operation on a bit field, selem->state,
to determine when and who can free the selem if any unlink under lock
fails. An selem initially is fully linked to a map and a local storage.
Under normal circumstances, bpf_selem_unlink_nofail() will be able to
grab locks and unlink a selem from map and local storage in sequeunce,
just like bpf_selem_unlink(), and then free it after an RCU grace period.
However, if any of the lock attempts fails, it will only clear
SDATA(selem)->smap or selem->local_storage depending on the caller and
set SELEM_MAP_UNLINKED or SELEM_STORAGE_UNLINKED according to the
caller. Then, after both map_free() and destroy() see the selem and the
state becomes SELEM_UNLINKED, one of two racing caller can succeed in
cmpxchg the state from SELEM_UNLINKED to SELEM_TOFREE, ensuring no
double free or memory leak.
To make sure bpf_obj_free_fields() is done only once and when map is
still present, it is called when unlinking an selem from b->list under
b->lock.
To make sure uncharging memory is done only when the owner is still
present in map_free(), block destroy() from returning until there is no
pending map_free().
Since smap may not be valid in destroy(), bpf_selem_unlink_nofail()
skips bpf_selem_unlink_storage_nolock_misc() when called from destroy().
This is okay as bpf_local_storage_destroy() will return the remaining
amount of memory charge tracked by mem_charge to the owner to uncharge.
It is also safe to skip clearing local_storage->owner and owner_storage
as the owner is being freed and no users or bpf programs should be able
to reference the owner and using local_storage.
Finally, access of selem, SDATA(selem)->smap and selem->local_storage
are racy. Callers will protect these fields with RCU.
Acked-by: Alexei Starovoitov <ast@kernel.org>
Co-developed-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-11-ameryhung@gmail.com
The next patch will introduce bpf_selem_unlink_nofail() to handle
rqspinlock errors. bpf_selem_unlink_nofail() will allow an selem to be
partially unlinked from map or local storage. Save memory allocation
method in selem so that later an selem can be correctly freed even when
SDATA(selem)->smap is init to NULL.
In addition, keep track of memory charge to the owner in local storage
so that later bpf_selem_unlink_nofail() can return the correct memory
charge to the owner. Updating local_storage->mem_charge is protected by
local_storage->lock.
Finally, extract miscellaneous tasks performed when unlinking an selem
from local_storage into bpf_selem_unlink_storage_nolock_misc(). It will
be reused by bpf_selem_unlink_nofail().
This patch also takes the chance to remove local_storage->smap, which
is no longer used since commit f484f4a3e0 ("bpf: Replace bpf memory
allocator with kmalloc_nolock() in local storage").
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-10-ameryhung@gmail.com
Percpu locks have been removed from cgroup and task local storage. Now
that all local storage no longer use percpu variables as locks preventing
recursion, there is no need to pass them to bpf_local_storage_map_free().
Remove the argument from the function.
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-9-ameryhung@gmail.com
The percpu counter in cgroup local storage is no longer needed as the
underlying bpf_local_storage can now handle deadlock with the help of
rqspinlock. Remove the percpu counter and related migrate_{disable,
enable}.
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-8-ameryhung@gmail.com
The percpu counter in task local storage is no longer needed as the
underlying bpf_local_storage can now handle deadlock with the help of
rqspinlock. Remove the percpu counter and related migrate_{disable,
enable}.
Since the percpu counter is removed, merge back bpf_task_storage_get()
and bpf_task_storage_get_recur(). This will allow the bpf syscalls and
helpers to run concurrently on the same CPU, removing the spurious
-EBUSY error. bpf_task_storage_get(..., F_CREATE) will now always
succeed with enough free memory unless being called recursively.
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-7-ameryhung@gmail.com
Change bpf_local_storage::lock and bpf_local_storage_map_bucket::lock
from raw_spin_lock to rqspinlock.
Finally, propagate errors from raw_res_spin_lock_irqsave() to syscall
return or BPF helper return.
In bpf_local_storage_destroy(), ignore return from
raw_res_spin_lock_irqsave() for now. A later patch will correctly
handle errors correctly in bpf_local_storage_destroy() so that it can
unlink selems even when failing to acquire locks.
For __bpf_local_storage_map_cache(), instead of handling the error,
skip updating the cache.
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-6-ameryhung@gmail.com
To prepare changing both bpf_local_storage_map_bucket::lock and
bpf_local_storage::lock to rqspinlock, convert bpf_selem_unlink() to
failable. It still always succeeds and returns 0 until the change
happens. No functional change.
Open code bpf_selem_unlink_storage() in the only caller,
bpf_selem_unlink(), since unlink_map and unlink_storage must be done
together after all the necessary locks are acquired.
For bpf_local_storage_map_free(), ignore the return from
bpf_selem_unlink() for now. A later patch will allow it to unlink selems
even when failing to acquire locks.
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-5-ameryhung@gmail.com
To prepare for changing bpf_local_storage_map_bucket::lock to rqspinlock,
convert bpf_selem_link_map() to failable. It still always succeeds and
returns 0 until the change happens. No functional change.
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-4-ameryhung@gmail.com
To prepare for changing bpf_local_storage_map_bucket::lock to rqspinlock,
convert bpf_selem_unlink_map() to failable. It still always succeeds and
returns 0 for now.
Since some operations updating local storage cannot fail in the middle,
open-code bpf_selem_unlink_map() to take the b->lock before the
operation. There are two such locations:
- bpf_local_storage_alloc()
The first selem will be unlinked from smap if cmpxchg owner_storage_ptr
fails, which should not fail. Therefore, hold b->lock when linking
until allocation complete. Helpers that assume b->lock is held by
callers are introduced: bpf_selem_link_map_nolock() and
bpf_selem_unlink_map_nolock().
- bpf_local_storage_update()
The three step update process: link_map(new_selem),
link_storage(new_selem), and unlink_map(old_selem) should not fail in
the middle.
In bpf_selem_unlink(), bpf_selem_unlink_map() and
bpf_selem_unlink_storage() should either all succeed or fail as a whole
instead of failing in the middle. So, return if unlink_map() failed.
Remove the selem_linked_to_map_lockless() check as an selem in the
common paths (not bpf_local_storage_map_free() or
bpf_local_storage_destroy()), will be unlinked under b->lock and
local_storage->lock and therefore no other threads can unlink the selem
from map at the same time.
In bpf_local_storage_destroy(), ignore the return of
bpf_selem_unlink_map() for now. A later patch will allow
bpf_local_storage_destroy() to unlink selems even when failing to
acquire locks.
Note that while this patch removes all callers of selem_linked_to_map(),
a later patch that introduces bpf_selem_unlink_nofail() will use it
again.
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-3-ameryhung@gmail.com
A later bpf_local_storage refactor will acquire all locks before
performing any update. To simplified the number of locks needed to take
in bpf_local_storage_map_update(), determine the bucket based on the
local_storage an selem belongs to instead of the selem pointer.
Currently, when a new selem needs to be created to replace the old selem
in bpf_local_storage_map_update(), locks of both buckets need to be
acquired to prevent racing. This can be simplified if the two selem
belongs to the same bucket so that only one bucket needs to be locked.
Therefore, instead of hashing selem, hashing the local_storage pointer
the selem belongs.
Performance wise, this is slightly better as update now requires locking
one bucket. It should not change the level of contention on one bucket
as the pointers to local storages of selems in a map are just as unique
as pointers to selems.
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-2-ameryhung@gmail.com
- Fix event format field alignments for 32 bit architectures
The fields in the event format files are used to parse the raw binary
buffer data by applications. If they are incorrect, then the application
produces garbage.
On 32 bit architectures, the function graph 64bit calltime and rettime
were off by 4bytes. That's because the actual fields are in a packed
structure but the macros used by the ftrace events did not mark them as
packed, and instead, gave them their natural alignment which made their
offsets off by 4 bytes.
There are macros to have a packed field within an embedded structure of
an event, but there's no macro for normal fields within a packed
structure of the event. The macro __field_packed() was used for the
packed embedded structure field. Rename that to __field_desc_pcaked() (to
match the non-packed embedded field macro __field_desc()), and make
__field_packed() for fields that are in a packed event structure (which
matches the unpacked __field() macro).
Switch the calltime and rettime fields of the function graph event to use
the new __field_packed() and this makes the offsets correct.
-----BEGIN PGP SIGNATURE-----
iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaYZKpRQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qk1UAP41H5PL24xmYp+34GIP6lHuJr6iZzUm
KbZi1Zx4zNmXSAD/e3Ra5SZopWszeMTf/tmxUXbl30oLdw4CJgS1WztBggk=
=m36h
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.19-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fix from Steven Rostedt:
- Fix event format field alignments for 32 bit architectures
The fields in the event format files are used to parse the raw binary
buffer data by applications. If they are incorrect, then the
application produces garbage.
On 32 bit architectures, the function graph 64bit calltime and
rettime were off by 4bytes. That's because the actual fields are in a
packed structure but the macros used by the ftrace events did not
mark them as packed, and instead, gave them their natural alignment
which made their offsets off by 4 bytes.
There are macros to have a packed field within an embedded structure
of an event, but there's no macro for normal fields within a packed
structure of the event. The macro __field_packed() was used for the
packed embedded structure field. Rename that to __field_desc_packed()
(to match the non-packed embedded field macro __field_desc()), and
make __field_packed() for fields that are in a packed event structure
(which matches the unpacked __field() macro).
Switch the calltime and rettime fields of the function graph event to
use the new __field_packed() and this makes the offsets correct.
* tag 'trace-v6.19-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Fix ftrace event field alignments
When the 'kprobe_event=' kernel command-line parameter is not provided,
there is no need to execute setup_boot_kprobe_events().
This change optimizes the initialization function init_kprobe_trace()
by skipping unnecessary work and effectively prevents potential blocking
that could arise from contention on the event_mutex lock in subsequent
operations.
Link: https://patch.msgid.link/20260204015401.163748-1-tianyaxiong@kylinos.cn
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Yaxiong Tian <tianyaxiong@kylinos.cn>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The init_blk_tracer() function causes significant boot delay as it
waits for the trace_event_sem lock held by trace_event_update_all().
Specifically, its child function register_trace_event() requires
this lock, which is occupied for an extended period during boot.
To resolve this, the execution of primary init_blk_tracer() is moved
to the trace_init_wq workqueue, allowing it to run asynchronously,
and prevent blocking the main boot thread.
Link: https://patch.msgid.link/20260204015353.163331-1-tianyaxiong@kylinos.cn
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Yaxiong Tian <tianyaxiong@kylinos.cn>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The eval_map_work_func() function, though queued in eval_map_wq,
holds the trace_event_sem read-write lock for a long time during
kernel boot. This causes blocking issues for other functions.
Rename eval_map_wq to trace_init_wq and make it global, thereby
allowing other parts of tracing to schedule work on this queue
asynchronously and avoiding blockage of the main boot thread.
Link: https://patch.msgid.link/20260204015344.162818-1-tianyaxiong@kylinos.cn
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Yaxiong Tian <tianyaxiong@kylinos.cn>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Two minor fixes for DMA-mapping subsystem:
- check for the rare case of the allocation failure of the global CMA pool
(Shanker Donthineni)
- avoid perf buffer overflow when tracing large scatter-gather lists
(Deepanshu Kartikey)
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSrngzkoBtlA8uaaJ+Jp1EFxbsSRAUCaYXTzAAKCRCJp1EFxbsS
REbUAP40hTgNNGjlbV/b6nES4P/SuZ5a8p05+YWF7bTOVf/pMwEA8EHFz5DLsKeS
1bX7/X2wzEYOyJ7v1S+PYxIswn9A5AY=
=2IQo
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-6.19-2026-02-06' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux
Pull dma-mapping fixes from Marek Szyprowski:
"Two minor fixes for the DMA-mapping subsystem:
- check for the rare case of the allocation failure of the global CMA
pool (Shanker Donthineni)
- avoid perf buffer overflow when tracing large scatter-gather lists
(Deepanshu Kartikey)"
* tag 'dma-mapping-6.19-2026-02-06' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
dma: contiguous: Check return value of dma_contiguous_reserve_area()
tracing/dma: Cap dma_map_sg tracepoint arrays to prevent buffer overflow
Called when copy_process() is called to copy state to a new child.
Right now this is just a stub, but will be used shortly to properly
handle fork'ing of task based io_uring restrictions.
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
call_rcu_tasks_trace() is not safe from in_nmi() and not reentrant.
To prevent deadlock on raw_spin_lock_rcu_node(rtpcp) or memory corruption
defer to irq_work when IRQs are disabled. call_rcu_tasks_generic()
protects itself with local_irq_save().
Note when bpf_async_cb->refcnt drops to zero it's safe to reuse
bpf_async_cb->worker for a different irq_work callback, since
bpf_async_schedule_op() -> irq_work_queue(&cb->worker);
is only called when refcnt >= 1.
Fixes: 1bfbc267ec ("bpf: Enable bpf_timer and bpf_wq in any context")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260205190233.912-1-alexei.starovoitov@gmail.com
Currently, bpf_map_get_info_by_fd calculates and caches the hash of the
map regardless of the map's frozen state.
This leads to a TOCTOU bug where userspace can call
BPF_OBJ_GET_INFO_BY_FD to cache the hash and then modify the map
contents before freezing.
Therefore, a trusted loader can be tricked into verifying the stale hash
while loading the modified contents.
Fix this by returning -EPERM if the map is not frozen when the hash is
requested. This ensures the hash is only generated for the final,
immutable state of the map.
Fixes: ea2e6467ac ("bpf: Return hashes of maps in BPF_OBJ_GET_INFO_BY_FD")
Reported-by: Toshi Piazza <toshi.piazza@microsoft.com>
Signed-off-by: KP Singh <kpsingh@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20260205070755.695776-1-kpsingh@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Practical BPF signatures are significantly smaller than
KMALLOC_MAX_CACHE_SIZE
Allowing larger sizes opens the door for abuse by passing excessive
size values and forcing the kernel into expensive allocation paths (via
kmalloc_large or vmalloc).
Fixes: 3492715683 ("bpf: Implement signature verification for BPF programs")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: KP Singh <kpsingh@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20260205063807.690823-1-kpsingh@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The linker script scripts/module.lds.S specifies that all input
__klp_objects sections should be consolidated into an output section of
the same name, and start/stop symbols should be created to enable
scripts/livepatch/init.c to locate this data.
This start/stop pattern is not ideal for modules because the symbols are
created even if no __klp_objects input sections are present.
Consequently, a dummy __klp_objects section also appears in the
resulting module. This unnecessarily pollutes non-livepatch modules.
Instead, since modules are relocatable files, the usual method for
locating consolidated data in a module is to read its section table.
This approach avoids the aforementioned problem.
The klp_modinfo already stores a copy of the entire section table with
the final addresses. Introduce a helper function that
scripts/livepatch/init.c can call to obtain the location of the
__klp_objects section from this data.
Fixes: dd590d4d57 ("objtool/klp: Introduce klp diff subcommand for diffing object files")
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Acked-by: Joe Lawrence <joe.lawrence@redhat.com>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Reviewed-by: Aaron Tomlin <atomlin@atomlin.com>
Link: https://patch.msgid.link/20260123102825.3521961-2-petr.pavlu@suse.com
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
The fields of ftrace specific events (events used to save ftrace internal
events like function traces and trace_printk) are generated similarly to
how normal trace event fields are generated. That is, the fields are added
to a trace_events_fields array that saves the name, offset, size,
alignment and signness of the field. It is used to produce the output in
the format file in tracefs so that tooling knows how to parse the binary
data of the trace events.
The issue is that some of the ftrace event structures are packed. The
function graph exit event structures are one of them. The 64 bit calltime
and rettime fields end up 4 byte aligned, but the algorithm to show to
userspace shows them as 8 byte aligned.
The macros that create the ftrace events has one for embedded structure
fields. There's two macros for theses fields:
__field_desc() and __field_packed()
The difference of the latter macro is that it treats the field as packed.
Rename that field to __field_desc_packed() and create replace the
__field_packed() to be a normal field that is packed and have the calltime
and rettime use those.
This showed up on 32bit architectures for function graph time fields. It
had:
~# cat /sys/kernel/tracing/events/ftrace/funcgraph_exit/format
[..]
field:unsigned long func; offset:8; size:4; signed:0;
field:unsigned int depth; offset:12; size:4; signed:0;
field:unsigned int overrun; offset:16; size:4; signed:0;
field:unsigned long long calltime; offset:24; size:8; signed:0;
field:unsigned long long rettime; offset:32; size:8; signed:0;
Notice that overrun is at offset 16 with size 4, where in the structure
calltime is at offset 20 (16 + 4), but it shows the offset at 24. That's
because it used the alignment of unsigned long long when used as a
declaration and not as a member of a structure where it would be aligned
by word size (in this case 4).
By using the proper structure alignment, the format has it at the correct
offset:
~# cat /sys/kernel/tracing/events/ftrace/funcgraph_exit/format
[..]
field:unsigned long func; offset:8; size:4; signed:0;
field:unsigned int depth; offset:12; size:4; signed:0;
field:unsigned int overrun; offset:16; size:4; signed:0;
field:unsigned long long calltime; offset:20; size:8; signed:0;
field:unsigned long long rettime; offset:28; size:8; signed:0;
Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reported-by: "jempty.liang" <imntjempty@163.com>
Link: https://patch.msgid.link/20260204113628.53faec78@gandalf.local.home
Fixes: 04ae87a520 ("ftrace: Rework event_create_dir()")
Closes: https://lore.kernel.org/all/20260130015740.212343-1-imntjempty@163.com/
Closes: https://lore.kernel.org/all/20260202123342.2544795-1-imntjempty@163.com/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Replace prog and callback in bpf_async_cb after removing visibility of
bpf_async_cb in bpf_async_cancel_and_free() to increase the chances the
scheduled async callbacks short-circuit execution and exit early, and
not starting a RCU tasks trace section. This improves the overall time
spent in running the wq selftest.
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260205003853.527571-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
When freeing a bpf_async_cb in bpf_async_cb_rcu_tasks_trace_free(), in
case the wq callback is not scheduled, doing cancel_work() currently
returns false and leads to retry of RCU tasks trace grace period. If the
callback is never scheduled, we keep retrying indefinitely and don't put
the prog reference.
Since the only race we care about here is against a potentially running
wq callback in the first grace period, it should finish by the second
grace period, hence check work_busy() result to detect presence of
running wq callback if it's not pending, otherwise free the object
immediately without retrying.
Reasoning behind the check and its correctness with racing wq callback
invocation: cancel_work is supposed to be synchronized, hence calling it
first and getting false would mean that work is definitely not pending,
at this point, either the work is not scheduled at all or already
running, or we race and it already finished by the time we checked for
it using work_busy(). In case it is running, we synchronize using
pool->lock to check the current work running there, if we match, it
means we extend the wait by another grace period using retry = true,
otherwise either the work already finished running or was never
scheduled, so we can free the bpf_async_cb right away.
Fixes: 1bfbc267ec ("bpf: Enable bpf_timer and bpf_wq in any context")
Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260205003853.527571-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
All are singletons - please see the changelogs for details.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaYPcbwAKCRDdBJ7gKXxA
jpM9AQDRiBlZRBdYY8/nS2zMc8hE7s5O3koXu/UMf2O01aJjsgD6AssmcJzkbLir
O1mlBSD0wlR3TZLEqSOUYIxgw7evLww=
=ZHeq
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2026-02-04-15-55' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"Five hotfixes. Two are cc:stable, two are for MM.
All are singletons - please see the changelogs for details"
* tag 'mm-hotfixes-stable-2026-02-04-15-55' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
Documentation: document liveupdate cmdline parameter
mm, shmem: prevent infinite loop on truncate race
mailmap: update Alexander Mikhalitsyn's emails
liveupdate: luo_file: do not clear serialized_data on unfreeze
x86/kfence: fix booting on 32bit non-PAE systems
- Fix race where sched_class operations (sched_setscheduler() and friends)
could be invoked on dead tasks after sched_ext_dead() already ran, causing
invalid SCX task state transitions and NULL pointer dereferences. This was
a regression from the cgroup exit ordering fix which moved
sched_ext_free() to finish_task_switch().
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaYPIhw4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGS3HAQChF4sOgoD67cul36LJaeiQzjLCh9iTU9vi2lB4
slJb5QD/dJhrC0T2ZVRm5rHVxckIx7KeFwbzhvlrUD7l+zEaAwo=
=ysQE
-----END PGP SIGNATURE-----
Merge tag 'sched_ext-for-6.19-rc8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fix from Tejun Heo:
- Fix race where sched_class operations (sched_setscheduler() and
friends) could be invoked on dead tasks after sched_ext_dead()
already ran, causing invalid SCX task state transitions and NULL
pointer dereferences.
This was a regression from the cgroup exit ordering fix which
moved sched_ext_free() to finish_task_switch().
* tag 'sched_ext-for-6.19-rc8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: Short-circuit sched_class operations on dead tasks
7900aa699c ("sched_ext: Fix cgroup exit ordering by moving sched_ext_free()
to finish_task_switch()") moved sched_ext_free() to finish_task_switch() and
renamed it to sched_ext_dead() to fix cgroup exit ordering issues. However,
this created a race window where certain sched_class ops may be invoked on
dead tasks leading to failures - e.g. sched_setscheduler() may try to switch a
task which finished sched_ext_dead() back into SCX triggering invalid SCX task
state transitions.
Add task_dead_and_done() which tests whether a task is TASK_DEAD and has
completed its final context switch, and use it to short-circuit sched_class
operations which may be called on dead tasks.
Fixes: 7900aa699c ("sched_ext: Fix cgroup exit ordering by moving sched_ext_free() to finish_task_switch()")
Reported-by: Andrea Righi <arighi@nvidia.com>
Link: http://lkml.kernel.org/r/20260202151341.796959-1-arighi@nvidia.com
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Previously, the verifier only tracked positive constant deltas between
linked registers using BPF_ADD. This limitation meant patterns like:
r1 = r0;
r1 += -4;
if r1 s>= 0 goto l0_%=; // r1 >= 0 implies r0 >= 4
// verifier couldn't propagate bounds back to r0
if r0 != 0 goto l0_%=;
r0 /= 0; // Verifier thinks this is reachable
l0_%=:
Similar limitation exists for 32-bit registers.
With this change, the verifier can now track negative deltas in reg->off
enabling bound propagation for the above pattern.
For alu32, we make sure the destination register has the upper 32 bits
as 0s before creating the link. BPF_ADD_CONST is split into
BPF_ADD_CONST64 and BPF_ADD_CONST32, the latter is used in case of alu32
and sync_linked_regs uses this to zext the result if known_reg has this
flag.
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260204151741.2678118-2-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch implements bitwise tracking (tnum analysis) for BPF_END
(byte swap) operation.
Currently, the BPF verifier does not track value for BPF_END operation,
treating the result as completely unknown. This limits the verifier's
ability to prove safety of programs that perform endianness conversions,
which are common in networking code.
For example, the following code pattern for port number validation:
int test(struct pt_regs *ctx) {
__u64 x = bpf_get_prandom_u32();
x &= 0x3f00; // Range: [0, 0x3f00], var_off: (0x0; 0x3f00)
x = bswap16(x); // Should swap to range [0, 0x3f], var_off: (0x0; 0x3f)
if (x > 0x3f) goto trap;
return 0;
trap:
return *(u64 *)NULL; // Should be unreachable
}
Currently generates verifier output:
1: (54) w0 &= 16128 ; R0=scalar(smin=smin32=0,smax=umax=smax32=umax32=16128,var_off=(0x0; 0x3f00))
2: (d7) r0 = bswap16 r0 ; R0=scalar()
3: (25) if r0 > 0x3f goto pc+2 ; R0=scalar(smin=smin32=0,smax=umax=smax32=umax32=63,var_off=(0x0; 0x3f))
Without this patch, even though the verifier knows `x` has certain bits
set, after bswap16, it loses all tracking information and treats port
as having a completely unknown value [0, 65535].
According to the BPF instruction set[1], there are 3 kinds of BPF_END:
1. `bswap(16|32|64)`: opcode=0xd7 (BPF_END | BPF_ALU64 | BPF_TO_LE)
- do unconditional swap
2. `le(16|32|64)`: opcode=0xd4 (BPF_END | BPF_ALU | BPF_TO_LE)
- on big-endian: do swap
- on little-endian: truncation (16/32-bit) or no-op (64-bit)
3. `be(16|32|64)`: opcode=0xdc (BPF_END | BPF_ALU | BPF_TO_BE)
- on little-endian: do swap
- on big-endian: truncation (16/32-bit) or no-op (64-bit)
Since BPF_END operations are inherently bit-wise permutations, tnum
(bitwise tracking) offers the most efficient and precise mechanism
for value analysis. By implementing `tnum_bswap16`, `tnum_bswap32`,
and `tnum_bswap64`, we can derive exact `var_off` values concisely,
directly reflecting the bit-level changes.
Here is the overview of changes:
1. In `tnum_bswap(16|32|64)` (kernel/bpf/tnum.c):
Call `swab(16|32|64)` function on the value and mask of `var_off`, and
do truncation for 16/32-bit cases.
2. In `adjust_scalar_min_max_vals` (kernel/bpf/verifier.c):
Call helper function `scalar_byte_swap`.
- Only do byte swap when
* alu64 (unconditional swap) OR
* switching between big-endian and little-endian machines.
- If need do byte swap:
* Firstly call `tnum_bswap(16|32|64)` to update `var_off`.
* Then reset the bound since byte swap scrambles the range.
- For 16/32-bit cases, truncate dst register to match the swapped size.
This enables better verification of networking code that frequently uses
byte swaps for protocol processing, reducing false positive rejections.
[1] https://www.kernel.org/doc/Documentation/bpf/standardization/instruction-set.rst
Co-developed-by: Shenghao Yuan <shenghaoyuan0928@163.com>
Signed-off-by: Shenghao Yuan <shenghaoyuan0928@163.com>
Co-developed-by: Yazhou Tang <tangyazhou518@outlook.com>
Signed-off-by: Yazhou Tang <tangyazhou518@outlook.com>
Signed-off-by: Tianci Cao <ziye@zju.edu.cn>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260204111503.77871-2-ziye@zju.edu.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Do not schedule timer/wq operation on a cpu that is in irq_work
callback that is processing async_cmds queue.
Otherwise the following loop is possible:
bpf_timer_start() -> bpf_async_schedule_op() -> irq_work_queue().
irqrestore -> bpf_async_irq_worker() -> tracepoint -> bpf_timer_start().
Fixes: 1bfbc267ec ("bpf: Enable bpf_timer and bpf_wq in any context")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260204055147.54960-4-alexei.starovoitov@gmail.com
Though hrtimer_start/cancel() inlines all of the smaller helpers in
hrtimer.c and only call timerqueue_add/del() from lib/timerqueue.c where
everything is not traceable and not kprobe-able (because all files in
lib/ are not traceable), there are tracepoints within hrtimer that are
called with locks held. Therefore prevent the deadlock by tightening
conditions when timer/wq can be called synchronously.
hrtimer/wq are using raw_spin_lock_irqsave(), so irqs_disabled() is enough.
Fixes: 1bfbc267ec ("bpf: Enable bpf_timer and bpf_wq in any context")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260204055147.54960-2-alexei.starovoitov@gmail.com
Merge updates related to runtime PM for 6.20-rc1/7.0-rc1:
- Make several drivers discard pm_runtime_put() return value in
preparation for converting that function to a void one (Rafael
Wysocki)
* pm-runtime:
drm: Discard pm_runtime_put() return value
genirq/chip: Change irq_chip_pm_put() return type to void
scsi: ufs: core: Discard pm_runtime_put() return values
platform/chrome: cros_hps_i2c: Discard pm_runtime_put() return value
coresight: Discard pm_runtime_put() return values
hwspinlock: omap: Discard pm_runtime_put() return value
watchdog: rzv2h_wdt: Discard pm_runtime_put() return value
watchdog: rz: Discard pm_runtime_put() return values
media: ccs: Discard pm_runtime_put() return value
drm/imagination: Discard pm_runtime_put() return value
USB: core: Discard pm_runtime_put() return value
Merge updates related to system suspend and hibernation for
6.20-rc1/7.0-rc1:
- Stop flagging the PM runtime workqueue as freezable to avoid system
suspend and resume deadlocks in subsystems that assume asynchronous
runtime PM to work during system-wide PM transitions (Rafael Wysocki)
- Drop redundant NULL pointer checks before acomp_request_free() from
the hibernation code handling image saving (Rafael Wysocki)
- Update wakeup_sources_walk_start() to handle empty lists of wakeup
sources as appropriate (Samuel Wu)
- Make dev_pm_clear_wake_irq() check the power.wakeirq value under
power.lock to avoid race conditions (Gui-Dong Han)
- Avoid bit field races related to power.work_in_progress in the core
device suspend code (Xuewen Yan)
* pm-sleep:
PM: sleep: core: Avoid bit field races related to work_in_progress
PM: sleep: wakeirq: harden dev_pm_clear_wake_irq() against races
PM: wakeup: Handle empty list in wakeup_sources_walk_start()
PM: hibernate: Drop NULL pointer checks before acomp_request_free()
PM: sleep: Do not flag runtime PM workqueue as freezable
sk->sk_family should be read with READ_ONCE() in
__cgroup_bpf_run_filter_sock_addr() due to IPV6_ADDRFORM.
Also, the comment there is a bit stale since commit 859051dd16
("bpf: Implement cgroup sockaddr hooks for unix sockets"), and the
kdoc has the same comment.
Let's use sk_is_inet() and sk_is_unix() and remove the comment.
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260203213442.682838-2-kuniyu@google.com
During the investigation of the various transition mode issues
instrumentation revealed that the amount of bitmap operations can be
significantly reduced when a task with a transitional CID schedules out
after the fixup function completed and disabled the transition mode.
At that point the mode is stable and therefore it is not required to drop
the transitional CID back into the pool. As the fixup is complete the
potential exhaustion of the CID pool is not longer possible, so the CID can
be transferred to the scheduling out task or to the CPU depending on the
current ownership mode.
The racy snapshot of mm_cid::mode which contains both the ownership state
and the transition bit is valid because runqueue lock is held and the fixup
function of a concurrent mode switch is serialized.
Assigning the ownership right there not only spares the bitmap access for
dropping the CID it also avoids it when the task is scheduled back in as it
directly hits the fast path in both modes when the CID is within the
optimal range. If it's outside the range the next schedule in will need to
converge so dropping it right away is sensible. In the good case this also
allows to go into the fast path on the next schedule in operation.
With a thread pool benchmark which is configured to cross the mode switch
boundaries frequently this reduces the number of bitmap operations by about
30% and increases the fastpath utilization in the low single digit
percentage range.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260201192835.100194627@kernel.org
When a exiting task initiates the switch from per CPU back to per task
mode, it has already dropped its CID and marked itself inactive. But a
leftover from an earlier iteration of the rework then reassigns the per
CPU CID to the exiting task with the transition bit set.
That's wrong as the task is already marked CID inactive, which means it is
inconsistent state. It's harmless because the CID is marked in transit and
therefore dropped back into the pool when the exiting task schedules out
either through preemption or the final schedule().
Simply drop the per CPU CID when the exiting task triggered the transition.
Fixes: fbd0e71dc3 ("sched/mmcid: Provide CID ownership mode fixup functions")
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260201192835.032221009@kernel.org
Shrikanth reported a hard lockup which he observed once. The stack trace
shows the following CID related participants:
watchdog: CPU 23 self-detected hard LOCKUP @ mm_get_cid+0xe8/0x188
NIP: mm_get_cid+0xe8/0x188
LR: mm_get_cid+0x108/0x188
mm_cid_switch_to+0x3c4/0x52c
__schedule+0x47c/0x700
schedule_idle+0x3c/0x64
do_idle+0x160/0x1b0
cpu_startup_entry+0x48/0x50
start_secondary+0x284/0x288
start_secondary_prolog+0x10/0x14
watchdog: CPU 11 self-detected hard LOCKUP @ plpar_hcall_norets_notrace+0x18/0x2c
NIP: plpar_hcall_norets_notrace+0x18/0x2c
LR: queued_spin_lock_slowpath+0xd88/0x15d0
_raw_spin_lock+0x80/0xa0
raw_spin_rq_lock_nested+0x3c/0xf8
mm_cid_fixup_cpus_to_tasks+0xc8/0x28c
sched_mm_cid_exit+0x108/0x22c
do_exit+0xf4/0x5d0
make_task_dead+0x0/0x178
system_call_exception+0x128/0x390
system_call_vectored_common+0x15c/0x2ec
The task on CPU11 is running the CID ownership mode change fixup function
and is stuck on a runqueue lock. The task on CPU23 is trying to get a CID
from the pool with the same runqueue lock held, but the pool is empty.
After decoding a similar issue in the opposite direction switching from per
task to per CPU mode the tool which models the possible scenarios failed to
come up with a similar loop hole.
This showed up only once, was not reproducible and according to tooling not
related to a overlooked scheduling scenario permutation. But the fact that
it was observed on a PowerPC system gave the right hint: PowerPC is a
weakly ordered architecture.
The transition mechanism does:
WRITE_ONCE(mm->mm_cid.transit, MM_CID_TRANSIT);
WRITE_ONCE(mm->mm_cid.percpu, new_mode);
fixup()
WRITE_ONCE(mm->mm_cid.transit, 0);
mm_cid_schedin() does:
if (!READ_ONCE(mm->mm_cid.percpu))
...
cid |= READ_ONCE(mm->mm_cid.transit);
so weakly ordered systems can observe percpu == false and transit == 0 even
if the fixup function has not yet completed. As a consequence the task will
not drop the CID when scheduling out before the fixup is completed, which
means the CID space can be exhausted and the next task scheduling in will
loop in mm_get_cid() and the fixup thread can livelock on the held runqueue
lock as above.
This could obviously be solved by using:
smp_store_release(&mm->mm_cid.percpu, true);
and
smp_load_acquire(&mm->mm_cid.percpu);
but that brings a memory barrier back into the scheduler hotpath, which was
just designed out by the CID rewrite.
That can be completely avoided by combining the per CPU mode and the
transit storage into a single mm_cid::mode member and ordering the stores
against the fixup functions to prevent the CPU from reordering them.
That makes the update of both states atomic and a concurrent read observes
always consistent state.
The price is an additional AND operation in mm_cid_schedin() to evaluate
the per CPU or the per task path, but that's in the noise even on strongly
ordered architectures as the actual load can be significantly more
expensive and the conditional branch evaluation is there anyway.
Fixes: fbd0e71dc3 ("sched/mmcid: Provide CID ownership mode fixup functions")
Closes: https://lore.kernel.org/bdfea828-4585-40e8-8835-247c6a8a76b0@linux.ibm.com
Reported-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260201192834.965217106@kernel.org
Ihor reported a BPF CI failure which turned out to be a live lock in the
MM_CID management. The scenario is:
A test program creates the 5th thread, which means the MM_CID users become
more than the number of CPUs (four in this example), so it switches to per
CPU ownership mode.
At this point each live task of the program has a CID associated. Assume
thread creation order assignment for simplicity.
T0 CID0 runs fork() and creates T4
T1 CID1
T2 CID2
T3 CID3
T4 --- not visible yet
T0 sets mm_cid::percpu = true and transfers its own CID to CPU0 where it
runs on and then starts the fixup which walks through the threads to
transfer the per task CIDs either to the CPU the task is running on or drop
it back into the pool if the task is not on a CPU.
During that T1 - T3 are free to schedule in and out before the fixup caught
up with them. Going through all possible permutations with a python script
revealed a few problematic cases. The most trivial one is:
T1 schedules in on CPU1 and observes percpu == true, so it transfers
its CID to CPU1
T1 is migrated to CPU2 and schedule in observes percpu == true, but
CPU2 does not have a CID associated and T1 transferred its own to
CPU1
So it has to allocate one with CPU2 runqueue lock held, but the
pool is empty, so it keeps looping in mm_get_cid().
Now T0 reaches T1 in the thread walk and tries to lock the corresponding
runqueue lock, which is held causing a full live lock.
There is a similar scenario in the reverse direction of switching from per
CPU to task mode which is way more obvious and got therefore addressed by
an intermediate mode. In this mode the CIDs are marked with MM_CID_TRANSIT,
which means that they are neither owned by the CPU nor by the task. When a
task schedules out with a transit CID it drops the CID back into the pool
making it available for others to use temporarily. Once the task which
initiated the mode switch finished the fixup it clears the transit mode and
the process goes back into per task ownership mode.
Unfortunately this insight was not mapped back to the task to CPU mode
switch as the above described scenario was not considered in the analysis.
Apply the same transit mechanism to the task to CPU mode switch to handle
these problematic cases correctly.
As with the CPU to task transition this results in a potential temporary
contention on the CID bitmap, but that's only for the time it takes to
complete the transition. After that it stays in steady mode which does not
touch the bitmap at all.
Fixes: fbd0e71dc3 ("sched/mmcid: Provide CID ownership mode fixup functions")
Closes: https://lore.kernel.org/2b7463d7-0f58-4e34-9775-6e2115cfb971@linux.dev
Reported-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260201192834.897115238@kernel.org
Extend the verifier to recognize struct bpf_timer as a valid kfunc
argument type. Previously, bpf_timer was only supported in BPF helpers.
This prepares for adding timer-related kfuncs in subsequent patches.
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260201025403.66625-3-alexei.starovoitov@gmail.com
Refactor bpf_timer and bpf_wq to allow calling them from any context:
- add refcnt to bpf_async_cb
- map_delete_elem or map_free will drop refcnt to zero
via bpf_async_cancel_and_free()
- once refcnt is zero timer/wq_start is not allowed to make sure
that callback cannot rearm itself
- if in_hardirq defer to start/cancel operations to irq_work
Co-developed-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/bpf/20260201025403.66625-2-alexei.starovoitov@gmail.com
Add a kernel config option to set the default value of
workqueue.panic_on_stall, similar to CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC,
CONFIG_BOOTPARAM_HARDLOCKUP_PANIC and CONFIG_BOOTPARAM_HUNG_TASK_PANIC.
This allows setting the number of workqueue stalls before triggering
a kernel panic at build time, which is useful for high-availability
systems that need consistent panic-on-stall, in other words, those
servers which run with CONFIG_BOOTPARAM_*_PANIC=y already.
The default remains 0 (disabled). Setting it to 1 will panic on the
first stall, and higher values will panic after that many stall
warnings. The value can still be overridden at runtime via the
workqueue.panic_on_stall boot parameter or sysfs.
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
The BPF stream kfuncs bpf_stream_vprintk and bpf_stream_print_stack
do not sleep and so are safe to call while holding a lock. Amend
the verifier to allow that.
Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260203180424.14057-4-emil@etsalapatis.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add a new kfunc called bpf_stream_print_stack to be used by programs
that need to print out their current BPF stack. The kfunc is essentially
a wrapper around the existing bpf_stream_dump_stack functionality used
to generate stack traces for error events like may_goto violations and
BPF-side arena page faults.
Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260203180424.14057-2-emil@etsalapatis.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Scalar register IDs are used by the verifier to track relationships
between registers and enable bounds propagation across those
relationships. Once an ID becomes singular (i.e. only a single
register/stack slot carries it), it can no longer contribute to bounds
propagation and effectively becomes stale. The previous commit makes the
verifier clear such ids before caching the state.
When comparing the current and cached states for pruning, these stale
IDs can cause technically equivalent states to be considered different
and thus prevent pruning.
For example, in the selftest added in the next commit, two registers -
r6 and r7 are not linked to any other registers and get cached with
id=0, in the current state, they are both linked to each other with
id=A. Before this commit, check_scalar_ids would give temporary ids to
r6 and r7 (say tid1 and tid2) and then check_ids() would map tid1->A,
and when it would see tid2->A, it would not consider these state
equivalent.
Relax scalar ID equivalence by treating rold->id == 0 as "independent":
if the old state did not rely on any ID relationships for a register,
then any ID/linking present in the current state only adds constraints
and is always safe to accept for pruning. Implement this by returning
true immediately in check_scalar_ids() when old_id == 0.
Maintain correctness for the opposite direction (old_id != 0 && cur_id
== 0) by still allocating a temporary ID for cur_id == 0. This avoids
incorrectly allowing multiple independent current registers (id==0) to
satisfy a single linked old ID during mapping.
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20260203165102.2302462-5-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The maybe_widen_reg() function widens imprecise scalar registers to
unknown when their values differ between the cached and current states.
Previously, it used regs_exact() which also compared register IDs via
check_ids(), requiring registers to have matching IDs (or mapped IDs) to
be considered exact.
For scalar widening purposes, what matters is whether the value tracking
(bounds, tnum, var_off) is the same, not whether the IDs match. Two
scalars with identical value constraints but different IDs represent the
same abstract value and don't need to be widened.
Introduce scalars_exact_for_widen() that only compares the
value-tracking portion of bpf_reg_state (fields before 'id'). This
allows the verifier to preserve more scalar value information during
state merging when IDs differ but actual tracked values are identical,
reducing unnecessary widening and potentially improving verification
precision.
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20260203165102.2302462-4-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The verifier assigns ids to scalar registers/stack slots when they are
linked through a mov or stack spill/fill instruction. These ids are
later used to propagate newly found bounds from one register to all
registers that share the same id. The verifier also compares the ids of
these registers in current state and cached state when making pruning
decisions.
When an ID becomes singular (i.e., only a single register or stack slot
has that ID), it can no longer participate in bounds propagation. During
comparisons between current and cached states for pruning decisions,
however, such stale IDs can prevent pruning of otherwise equivalent
states.
Find and clear all singular ids before caching a state in
is_state_visited(). struct bpf_idset which is currently unused has been
repurposed for this use case.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20260203165102.2302462-3-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The next commit will allow clearing of scalar ids if no other
register/stack slot has that id. This is because if only one register
has a unique id, it can't participate in bounds propagation and is
equivalent to having no id.
But if the id of a stack slot is cleared by clear_singular_ids() in the
next commit, reading that stack slot into a register will not establish
a link because the stack slot's id is cleared.
This can happen in a situation where a register is spilled and later
loses its id due to a multiply operation (for example) and then the
stack slot's id becomes singular and can be cleared.
Make sure that scalar stack slots have an id before we read them into a
register.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20260203165102.2302462-2-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Some platforms require panic handling to execute on a specific CPU for
crash dump to work reliably. This can be due to firmware limitations,
interrupt routing constraints, or platform-specific requirements where
only a single CPU is able to safely enter the crash kernel.
Add the panic_force_cpu= kernel command-line parameter to redirect panic
execution to a designated CPU. When the parameter is provided, the CPU
that initially triggers panic forwards the panic context to the target CPU
via IPI, which then proceeds with the normal panic and kexec flow.
The IPI delivery is implemented as a weak function
(panic_smp_redirect_cpu) so architectures with NMI support can override it
for more reliable delivery.
If the specified CPU is invalid, offline, or a panic is already in
progress on another CPU, the redirection is skipped and panic continues on
the current CPU.
[pnina.feder@mobileye.com: fix unused variable warning]
Link: https://lkml.kernel.org/r/20260126122618.2967950-1-pnina.feder@mobileye.com
Link: https://lkml.kernel.org/r/20260122102457.1154599-1-pnina.feder@mobileye.com
Signed-off-by: Pnina Feder <pnina.feder@mobileye.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The documentation of this new API has been overlooked during its
introduction. Fill the gap.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
It may not appear obvious why kthread_affine_node() is not called before
the kthread creation completion instead of after the first wake-up.
The reason is that kthread_affine_node() applies a default affinity
behaviour that only takes place if no affinity preference have already
been passed by the kthread creation call site.
Add a comment to clarify that.
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
When cpuset isolated partitions get updated, unbound kthreads get
indifferently affine to all non isolated CPUs, regardless of their
individual affinity preferences.
For example kswapd is a per-node kthread that prefers to be affine to
the node it refers to. Whenever an isolated partition is created,
updated or deleted, kswapd's node affinity is going to be broken if any
CPU in the related node is not isolated because kswapd will be affine
globally.
Fix this with letting the consolidated kthread managed affinity code do
the affinity update on behalf of cpuset.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Waiman Long <longman@redhat.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: cgroups@vger.kernel.org
Unbound kthreads want to run neither on nohz_full CPUs nor on domain
isolated CPUs. And since nohz_full implies domain isolation, checking
the latter is enough to verify both.
Therefore exclude kthreads from domain isolation.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Waiman Long <longman@redhat.com>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
The unbound kthreads affinity management performed by cpuset is going to
be imported to the kthread core code for consolidation purposes.
Treat kthreadd just like any other kthread.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Waiman Long <longman@redhat.com>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
The managed affinity list currently contains only unbound kthreads that
have affinity preferences. Unbound kthreads globally affine by default
are outside of the list because their affinity is automatically managed
by the scheduler (through the fallback housekeeping mask) and by cpuset.
However in order to preserve the preferred affinity of kthreads, cpuset
will delegate the isolated partition update propagation to the
housekeeping and kthread code.
Prepare for that with including all unbound kthreads in the managed
affinity list.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Waiman Long <longman@redhat.com>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
The kthreads preferred affinity related fields use "hotplug" as the base
of their naming because the affinity management was initially deemed to
deal with CPU hotplug.
The scope of this role is going to broaden now and also deal with
cpuset isolated partition updates.
Switch the naming accordingly.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Waiman Long <longman@redhat.com>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
The set of cpuset isolated CPUs is now included in HK_TYPE_DOMAIN
housekeeping cpumask. There is no usecase left interested in just
checking what is isolated by cpuset and not by the isolcpus= kernel
boot parameter.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Waiman Long <longman@redhat.com>
Cc: "Michal Koutný" <mkoutny@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: cgroups@vger.kernel.org
Cpuset isolated partitions are now included in HK_TYPE_DOMAIN. Testing
if a CPU is part of an isolated partition alone is now useless.
Remove the superflous test.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Waiman Long <longman@redhat.com>
Until now, cpuset would propagate isolated partition changes to
timer migration so that unbound timers don't get migrated to isolated
CPUs.
Since housekeeping now centralizes, synchronize and propagates isolation
cpumask changes, perform the work from that subsystem for consolidation
and consistency purposes.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Until now, cpuset would propagate isolated partition changes to
workqueues so that unbound workers get properly reaffined.
Since housekeeping now centralizes, synchronize and propagates isolation
cpumask changes, perform the work from that subsystem for consolidation
and consistency purposes.
For simplification purpose, the target function is adapted to take the
new housekeeping mask instead of the isolated mask.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Waiman Long <longman@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: "Michal Koutný" <mkoutny@suse.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: cgroups@vger.kernel.org
The HK_TYPE_DOMAIN housekeeping cpumask is now modifiable at runtime. In
order to synchronize against PCI probe works and make sure that no
asynchronous probing is still pending or executing on a newly isolated
CPU, the housekeeping subsystem must flush the PCI probe works.
However the PCI probe works can't be flushed easily since they are
queued to the main per-CPU workqueue pool.
Solve this with creating a PCI probe-specific pool and provide and use
the appropriate flushing API.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: linux-pci@vger.kernel.org
The HK_TYPE_DOMAIN housekeeping cpumask is now modifiable at runtime.
In order to synchronize against vmstat workqueue to make sure
that no asynchronous vmstat work is still pending or executing on a
newly made isolated CPU, the housekeeping susbsystem must flush the
vmstat workqueues.
This involves flushing the whole mm_percpu_wq workqueue, shared with
LRU drain, introducing here a welcome side effect.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: linux-mm@kvack.org
The HK_TYPE_DOMAIN housekeeping cpumask is now modifiable at runtime. In
order to synchronize against memcg workqueue to make sure that no
asynchronous draining is still pending or executing on a newly made
isolated CPU, the housekeeping susbsystem must flush the memcg
workqueues.
However the memcg workqueues can't be flushed easily since they are
queued to the main per-CPU workqueue pool.
Solve this with creating a memcg specific pool and provide and use the
appropriate flushing API.
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
Until now, HK_TYPE_DOMAIN used to only include boot defined isolated
CPUs passed through isolcpus= boot option. Users interested in also
knowing the runtime defined isolated CPUs through cpuset must use
different APIs: cpuset_cpu_is_isolated(), cpu_is_isolated(), etc...
There are many drawbacks to that approach:
1) Most interested subsystems want to know about all isolated CPUs, not
just those defined on boot time.
2) cpuset_cpu_is_isolated() / cpu_is_isolated() are not synchronized with
concurrent cpuset changes.
3) Further cpuset modifications are not propagated to subsystems
Solve 1) and 2) and centralize all isolated CPUs within the
HK_TYPE_DOMAIN housekeeping cpumask.
Subsystems can rely on RCU to synchronize against concurrent changes.
The propagation mentioned in 3) will be handled in further patches.
[Chen Ridong: Fix cpu_hotplug_lock deadlock and use correct static
branch API]
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Waiman Long <longman@redhat.com>
Reviewed-by: Chen Ridong <chenridong@huawei.com>
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Cc: "Michal Koutný" <mkoutny@suse.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: cgroups@vger.kernel.org
HK_TYPE_DOMAIN's cpumask will soon be made modifiable by cpuset.
A synchronization mechanism is then needed to synchronize the updates
with the housekeeping cpumask readers.
Turn the housekeeping cpumasks into RCU pointers. Once a housekeeping
cpumask will be modified, the update side will wait for an RCU grace
period and propagate the change to interested subsystem when deemed
necessary.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
cpuset modifies partitions, including isolated, while holding the cpuset
mutex.
This means that holding the cpuset mutex is safe to synchronize against
housekeeping cpumask changes.
Provide a lockdep check to validate that.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: "Michal Koutný" <mkoutny@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: cgroups@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
cpuset modifies partitions, including isolated, while holding the cpu
hotplug lock read-held.
This means that write-holding the CPU hotplug lock is safe to
synchronize against housekeeping cpumask changes.
Provide a lockdep check to validate that.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Cc: linux-kernel@vger.kernel.org
Testing housekeeping_cpu() will soon require that either the RCU "lock"
is held or the cpuset mutex.
When CPUs get isolated through cpuset, the change is propagated to
timer migration such that isolation is also performed from the migration
tree. However that propagation is done using workqueue which tests if
the target is actually isolated before proceeding.
Lockdep doesn't know that the workqueue caller holds cpuset mutex and
that it waits for the work, making the housekeeping cpumask read safe.
Shut down the future warning by removing this test. It is unecessary
beyond hotplug, the workqueue is already targeted towards isolated CPUs.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Gabriele Monaco <gmonaco@redhat.com>
boot_hk_cpus is an ad-hoc copy of HK_TYPE_DOMAIN_BOOT. Remove it and use
the official version.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Chen Ridong <chenridong@huawei.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michal Koutny <mkoutny@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: cgroups@vger.kernel.org
HK_TYPE_DOMAIN will soon integrate not only boot defined isolcpus= CPUs
but also cpuset isolated partitions.
Housekeeping still needs a way to record what was initially passed
to isolcpus= in order to keep these CPUs isolated after a cpuset
isolated partition is modified or destroyed while containing some of
them.
Create a new HK_TYPE_DOMAIN_BOOT to keep track of those.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
0-day bot flagged the use of strcpy() in blk_trace_setup(), because the
source buffer can theoretically be bigger than the destination buffer.
While none of the current callers pass a string bigger than
BLKTRACE_BDEV_SIZE, use strscpy() to prevent eventual future misuse and
silence the checker warnings.
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/202602020718.GUEIRyG9-lkp@intel.com/
Fixes: 113cbd6282 ("blktrace: pass blk_user_trace2 to setup functions")
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Consider the following sequence on a CPU configured with nohz_full:
1) A task P runs in cgroup A, and cgroup A becomes throttled due to CFS
bandwidth control. The gse (cgroup A) where the task P attached is
dequeued and the CPU switches to idle.
2) Before cgroup A is unthrottled, task P is migrated from cgroup A to
another cgroup B (not throttled).
During sched_move_task(), the task P is observed as queued but not
running, and therefore no resched_curr() is triggered.
3) Since the CPU is nohz_full, it remains in do_idle() waiting for an
explicit scheduling event, i.e., resched_curr().
4) For kernel <= 5.10: Later, cgroup A is unthrottled. However, the task
P has already been migrated out of cgroup A, so unthrottle_cfs_rq()
may observe load_weight == 0 and return early without resched_curr()
called. For kernel >= 6.6: The unthrottling path normally triggers
`resched_curr()` almost cases even when no runnable tasks remain in the
unthrottled cgroup, preventing the idle stall described above. However,
if cgroup A is removed before it gets unthrottled, the unthrottling path
for cgroup A is never executed. In a result, no `resched_curr()` can be
called.
5) At this point, the task P is runnable in cgroup B (not throttled), but
the CPU remains in do_idle() with no pending reschedule point. The
system stays in this state until an unrelated event (e.g. a new task
wakeup or any cases) that can trigger a resched_curr() breaks the
nohz_full idle state, and then the task P finally gets scheduled.
The root cause is that sched_move_task() may classify the task as only
queued, not running, and therefore fails to trigger a resched_curr(),
while the later unthrottling path no longer has visibility of the
migrated task.
Preserve the existing behavior for running tasks by issuing
resched_curr(), and explicitly invoke check_preempt_curr() for tasks
that were queued at the time of migration. This ensures that runnable
tasks are reconsidered for scheduling even when nohz_full suppresses
periodic ticks.
Fixes: 29f59db3a7 ("sched: group-scheduler core")
Signed-off-by: Zicheng Qu <quzicheng@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Reviewed-by: Aaron Lu <ziqianlu@bytedance.com>
Tested-by: Aaron Lu <ziqianlu@bytedance.com>
Link: https://patch.msgid.link/20260130083438.1122457-1-quzicheng@huawei.com
Use %pe format specifier for printing PTR_ERR() error values
to make error messages more readable.
Found by Coccinelle:
./cpufreq_schedutil.c:685:49-56: WARNING: Consider using %pe to print PTR_ERR()
Signed-off-by: zenghongling <zenghongling@kylinos.cn>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260120083333.148385-1-zenghongling@kylinos.cn
CPU0 becomes overloaded when hosting a CPU-bound RT task, a non-CPU-bound
RT task, and a CFS task stuck in kernel space. When other CPUs switch from
RT to non-RT tasks, RT load balancing (LB) is triggered; with
HAVE_RT_PUSH_IPI enabled, they send IPIs to CPU0 to drive the execution
of rto_push_irq_work_func. During push_rt_task on CPU0,
if next_task->prio < rq->donor->prio, resched_curr() sets NEED_RESCHED
and after the push operation completes, CPU0 calls rto_next_cpu().
Since only CPU0 is overloaded in this scenario, rto_next_cpu() should
ideally return -1 (no further IPI needed).
However, multiple CPUs invoking tell_cpu_to_push() during LB increments
rd->rto_loop_next. Even when rd->rto_cpu is set to -1, the mismatch between
rd->rto_loop and rd->rto_loop_next forces rto_next_cpu() to restart its
search from -1. With CPU0 remaining overloaded (satisfying rt_nr_migratory
&& rt_nr_total > 1), it gets reselected, causing CPU0 to queue irq_work to
itself and send self-IPIs repeatedly. As long as CPU0 stays overloaded and
other CPUs run pull_rt_tasks(), it falls into an infinite self-IPI loop,
which triggers a CPU hardlockup due to continuous self-interrupts.
The trigging scenario is as follows:
cpu0 cpu1 cpu2
pull_rt_task
tell_cpu_to_push
<------------irq_work_queue_on
rto_push_irq_work_func
push_rt_task
resched_curr(rq) pull_rt_task
rto_next_cpu tell_cpu_to_push
<-------------------------- atomic_inc(rto_loop_next)
rd->rto_loop != next
rto_next_cpu
irq_work_queue_on
rto_push_irq_work_func
Fix redundant self-IPI by filtering the initiating CPU in rto_next_cpu().
This solution has been verified to effectively eliminate spurious self-IPIs
and prevent CPU hardlockup scenarios.
Fixes: 4bdced5c9a ("sched/rt: Simplify the IPI based RT balancing logic")
Suggested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Chen Jinghuang <chenjinghuang2@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://patch.msgid.link/20260122012533.673768-1-chenjinghuang2@huawei.com
Read-mostly sched_clock_irqtime may share the same cacheline with
frequently updated nohz struct. Make it as static_key to avoid
false sharing issue.
The only user of disable_sched_clock_irqtime()
is tsc_.*mark_unstable() which may be invoked under atomic context
and require a workqueue to disable static_key. But both of them
calls clear_sched_clock_stable() just before doing
disable_sched_clock_irqtime(). We can reuse
"sched_clock_work" to also disable sched_clock_irqtime().
One additional case need to handle is if the tsc is marked unstable
before late_initcall() phase, sched_clock_work will not be invoked
and sched_clock_irqtime will stay enabled although clock is unstable:
tsc_init()
enable_sched_clock_irqtime() # irqtime accounting is enabled here
...
if (unsynchronized_tsc()) # true
mark_tsc_unstable()
clear_sched_clock_stable()
__sched_clock_stable_early = 0;
...
if (static_key_count(&sched_clock_running.key) == 2)
# Only happens at sched_clock_init_late()
__clear_sched_clock_stable(); # Never executed
...
# late_initcall() phase
sched_clock_init_late()
if (__sched_clock_stable_early) # Already false
__set_sched_clock_stable(); # sched_clock is never marked stable
# TSC unstable, but sched_clock_work won't run to disable irqtime
So we need to disable_sched_clock_irqtime() in sched_clock_init_late()
if clock is unstable.
Reported-by: Benjamin Lei <benjamin.lei@intel.com>
Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Wangyang Guo <wangyang.guo@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Reviewed-by: Tianyou Li <tianyou.li@intel.com>
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://patch.msgid.link/20260127072509.2627346-1-wangyang.guo@intel.com
There are two problems with sched_server_write_common() that can cause the
dl_server to malfunction upon attempting to change the parameters:
1) when, after having disabled the dl_server by setting runtime=0, it is
enabled again while tasks are already enqueued. In this case is_active would
still be 0 and dl_server_start() would not be called.
2) when dl_server_apply_params() would fail, runtime is not applied and does
not reflect the new state.
Instead have dl_server_start() check its actual dl_runtime, and have
sched_server_write_common() unconditionally (re)start the dl_server. It will
automatically stop if there isn't anything to do, so spurious activation is
harmless -- while failing to start it is a problem.
While there, move the printk out of the locked region and make it symmetric,
also printing on enable.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260203103407.GK1282955@noisy.programming.kicks-ass.net
When a sched_ext server is loaded, tasks in the fair class are
automatically moved to the sched_ext class. Add support to modify the
ext server parameters similar to how the fair server parameters are
modified.
Re-use common code between ext and fair servers as needed.
Co-developed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/20260126100050.3854740-6-arighi@nvidia.com
sched_ext currently suffers starvation due to RT. The same workload when
converted to EXT can get zero runtime if RT is 100% running, causing EXT
processes to stall. Fix it by adding a DL server for EXT.
A kselftest is also included later to confirm that both DL servers are
functioning correctly:
# ./runner -t rt_stall
===== START =====
TEST: rt_stall
DESCRIPTION: Verify that RT tasks cannot stall SCHED_EXT tasks
OUTPUT:
TAP version 13
1..1
# Runtime of FAIR task (PID 1511) is 0.250000 seconds
# Runtime of RT task (PID 1512) is 4.750000 seconds
# FAIR task got 5.00% of total runtime
ok 1 PASS: FAIR task got more than 4.00% of runtime
TAP version 13
1..1
# Runtime of EXT task (PID 1514) is 0.250000 seconds
# Runtime of RT task (PID 1515) is 4.750000 seconds
# EXT task got 5.00% of total runtime
ok 2 PASS: EXT task got more than 4.00% of runtime
TAP version 13
1..1
# Runtime of FAIR task (PID 1517) is 0.250000 seconds
# Runtime of RT task (PID 1518) is 4.750000 seconds
# FAIR task got 5.00% of total runtime
ok 3 PASS: FAIR task got more than 4.00% of runtime
TAP version 13
1..1
# Runtime of EXT task (PID 1521) is 0.250000 seconds
# Runtime of RT task (PID 1522) is 4.750000 seconds
# EXT task got 5.00% of total runtime
ok 4 PASS: EXT task got more than 4.00% of runtime
ok 1 rt_stall #
===== END =====
Co-developed-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/20260126100050.3854740-5-arighi@nvidia.com
Currently the DL server interface for applying parameters checks
CFS-internals to identify if the server is active. This is error-prone
and makes it difficult when adding new servers in the future.
Fix it, by using dl_server_active() which is also used by the DL server
code to determine if the DL server was started.
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Tejun Heo <tj@kernel.org>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/20260126100050.3854740-4-arighi@nvidia.com
Updating "ppos" on error conditions does not make much sense. The pattern
is to return the error code directly without modifying the position, or
modify the position on success and return the number of bytes written.
Since on success, the return value of apply is 0, there is no point in
modifying ppos either. Fix it by removing all this and just returning
error code or number of bytes written on success.
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Tejun Heo <tj@kernel.org>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/20260126100050.3854740-3-arighi@nvidia.com
The defer params were not cleared in __dl_clear_params. Clear them.
Without this is some of my test cases are flaking and the DL timer is
not starting correctly AFAICS.
Fixes: a110a81c52 ("sched/deadline: Deferrable dl server")
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/20260126100050.3854740-2-arighi@nvidia.com
Patch series "liveupdate: fixes in error handling".
This series contains some fixes in LUO's error handling paths.
The first patch deals with failed freeze() attempts. The cleanup path
calls unfreeze, and that clears some data needed by later unpreserve
calls.
The second patch is a bit more involved. It deals with failed retrieve()
attempts. To do so properly, it reworks some of the error handling logic
in luo_file core.
Both these fixes are "theoretical" -- in the sense that I have not been
able to reproduce either of them in normal operation. The only supported
file type right now is memfd, and there is nothing userspace can do right
now to make it fail its retrieve or freeze. I need to make the retrieve
or freeze fail by artificially injecting errors. The injected errors
trigger a use-after-free and a double-free.
That said, once more complex file handlers are added or memfd preservation
is used in ways not currently expected or covered by the tests, we will be
able to see them on real systems.
This patch (of 2):
The unfreeze operation is supposed to undo the effects of the freeze
operation. serialized_data is not set by freeze, but by preserve.
Consequently, the unpreserve operation needs to access serialized_data to
undo the effects of the preserve operation. This includes freeing the
serialized data structures for example.
If a freeze callback fails, unfreeze is called for all frozen files. This
would clear serialized_data for them. Since live update has failed, it
can be expected that userspace aborts, releasing all sessions. When the
sessions are released, unpreserve will be called for all files. The
unfrozen files will see 0 in their serialized_data. This is not expected
by file handlers, and they might either fail, leaking data and state, or
might even crash or cause invalid memory access.
Do not clear serialized_data on unfreeze so it gets passed on to
unpreserve. There is no need to clear it on unpreserve since luo_file
will be freed immediately after.
Link: https://lkml.kernel.org/r/20260126230302.2936817-1-pratyush@kernel.org
Link: https://lkml.kernel.org/r/20260126230302.2936817-2-pratyush@kernel.org
Fixes: 7c722a7f44 ("liveupdate: luo_file: implement file systems callbacks")
Signed-off-by: Pratyush Yadav (Google) <pratyush@kernel.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Three dmem fixes from Chen Ridong addressing use-after-free, RCU warning,
and NULL pointer dereference issues introduced with the dmem controller.
All changes are confined to kernel/cgroup/dmem.c and can only affect dmem
controller users.
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaYES4Q4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGQ3XAP92niQQAOj9ekLd9O9DnwA0KlkHKT30QO4oPJVR
0z0wuwEA4kY8/jUAWpKjxzmXse9m06MTvzfjuv/4k5IRnZ84cwY=
=ze/P
-----END PGP SIGNATURE-----
Merge tag 'cgroup-for-6.19-rc8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
"Three dmem fixes from Chen Ridong addressing use-after-free, RCU
warning, and NULL pointer dereference issues introduced with the dmem
controller.
All changes are confined to kernel/cgroup/dmem.c and can only affect
dmem controller users"
* tag 'cgroup-for-6.19-rc8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup/dmem: avoid pool UAF
cgroup/dmem: avoid rcu warning when unregister region
cgroup/dmem: fix NULL pointer dereference when setting max
The list_for_each_entry_rcu() in filter_chain() uses
rcu_read_lock_trace_held() as the lockdep condition, but the function
holds consumer_rwsem, not the RCU trace lock.
This gives me the following output when running with some locking debug
option enabled:
kernel/events/uprobes.c:1141 RCU-list traversed in non-reader section!!
filter_chain
register_for_each_vma
uprobe_unregister_nosync
__probe_event_disable
Remove the incorrect lockdep condition since the rwsem provides
sufficient protection for the list traversal.
Fixes: cc01bd044e ("uprobes: travers uprobe's consumer list locklessly under SRCU protection")
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260128-uprobe_rcu-v2-1-994ea6d32730@debian.org
An UAF issue was observed:
BUG: KASAN: slab-use-after-free in page_counter_uncharge+0x65/0x150
Write of size 8 at addr ffff888106715440 by task insmod/527
CPU: 4 UID: 0 PID: 527 Comm: insmod 6.19.0-rc7-next-20260129+ #11
Tainted: [O]=OOT_MODULE
Call Trace:
<TASK>
dump_stack_lvl+0x82/0xd0
kasan_report+0xca/0x100
kasan_check_range+0x39/0x1c0
page_counter_uncharge+0x65/0x150
dmem_cgroup_uncharge+0x1f/0x260
Allocated by task 527:
Freed by task 0:
The buggy address belongs to the object at ffff888106715400
which belongs to the cache kmalloc-512 of size 512
The buggy address is located 64 bytes inside of
freed 512-byte region [ffff888106715400, ffff888106715600)
The buggy address belongs to the physical page:
Memory state around the buggy address:
ffff888106715300: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff888106715380: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff888106715400: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff888106715480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff888106715500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
The issue occurs because a pool can still be held by a caller after its
associated memory region is unregistered. The current implementation frees
the pool even if users still hold references to it (e.g., before uncharge
operations complete).
This patch adds a reference counter to each pool, ensuring that a pool is
only freed when its reference count drops to zero.
Fixes: b168ed458d ("kernel/cgroup: Add "dmem" memory accounting cgroup")
Cc: stable@vger.kernel.org # v6.14+
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
A warnning was detected:
WARNING: suspicious RCU usage
6.19.0-rc7-next-20260129+ #1101 Tainted: G O
kernel/cgroup/dmem.c:456 suspicious rcu_dereference_check() usage!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
1 lock held by insmod/532:
#0: ffffffff85e78b38 (dmemcg_lock){+.+.}-dmem_cgroup_unregister_region+
stack backtrace:
CPU: 2 UID: 0 PID: 532 Comm: insmod Tainted: 6.19.0-rc7-next-
Tainted: [O]=OOT_MODULE
Call Trace:
<TASK>
dump_stack_lvl+0xb0/0xd0
lockdep_rcu_suspicious+0x151/0x1c0
dmem_cgroup_unregister_region+0x1e2/0x380
? __pfx_dmem_test_init+0x10/0x10 [dmem_uaf]
dmem_test_init+0x65/0xff0 [dmem_uaf]
do_one_initcall+0xbb/0x3a0
The macro list_for_each_rcu() must be used within an RCU read-side critical
section (between rcu_read_lock() and rcu_read_unlock()). Using it outside
that context, as seen in dmem_cgroup_unregister_region(), triggers the
lockdep warning because the RCU protection is not guaranteed.
Replace list_for_each_rcu() with list_for_each_entry_safe(), which is
appropriate for traversal under spinlock protection where nodes may be
deleted.
Fixes: b168ed458d ("kernel/cgroup: Add "dmem" memory accounting cgroup")
Cc: stable@vger.kernel.org # v6.14+
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The code local is_leap() helper was tried to be replaced by the RTC
is_leap_year() function. Unfortunately the two aren't exactly equivalent,
as the kunit variant uses a signed value for the year and the RTC an
unsigned one.
Since the KUnit tests cover a 16000 year range around the epoch they use
year values that are very comfortably negative and hence get mishandled
when passed into is_leap_year().
The change was reverted, so add a comment which prevents further attempts
to do so.
[ tglx: Adapted to the revert ]
Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260130-kunit-fix-leap-year-v1-1-92ddf55dffd7@kernel.org
Commit 8f1fc1bf1a ("dma: contiguous: Reserve default CMA heap")
introduced a bug where dma_heap_cma_register_heap() is called with
a NULL pointer when dma_contiguous_reserve_area() fails to reserve
the CMA area.
When dma_contiguous_reserve_area() fails, dma_contiguous_default_area
remains NULL (initialized as a global variable), but the code doesn't
check the return value and proceeds to call dma_heap_cma_register_heap()
with this NULL pointer.
Later during boot, add_cma_heaps() iterates through the dma_areas[]
array and attempts to register heaps. When it encounters the NULL
pointer stored by the earlier call, it crashes in __add_cma_heap()
-> dma_heap_add() when trying to dereference the NULL CMA pointer.
The crash manifests as:
Unable to handle kernel NULL pointer dereference at virtual address
0000000000000038
...
Call trace:
dma_heap_add+0x40/0x2b0
__add_cma_heap+0x80/0xe0
add_cma_heaps+0x64/0xb0
do_one_initcall+0x60/0x318
kernel_init_freeable+0x260/0x2f0
kernel_init+0x2c/0x168
ret_from_fork+0x10/0x20
Fix this by checking the return value of dma_contiguous_reserve_area()
and only calling dma_heap_cma_register_heap() when the reservation
succeeds.
Fixes: 8f1fc1bf1a ("dma: contiguous: Reserve default CMA heap")
Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
Reviewed-by: T.J. Mercier <tjmercier@google.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/20260129181317.2429196-1-sdonthineni@nvidia.com
dl_server to be stuck.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAml/FIkRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1jB7w/9GazyDrZhdmJp7NeJCEW82FGKljBuEBvP
RxN9lrhRuOIKEVuOfJI76Chd5iooSlKdDmaOcxFrkihwJ/AREHekTlYFLtncg5Zu
5JjeYLiuN04kkOClDz7mTgcSzJD7yM0m2mD5CS45lZ35shM64eLr5GtXaFTEs9SZ
E9R4cBiNpz4bVMNW4y3TphSlEx84sIRWoj5Gmzw3CG/52GbIcj1rgwO4b/RI4FYV
yio7V5+ol6oVijlul/UIMYkB6vWGbO8swa/YErhyCvXpHfR1pwsEDFcTjJvO9hCK
2QPwBo34m9mibybvQoyS+/Cybf/yWqeyc1KHqn2+wlO/Ectk97H3eBsbLO2ot/dX
Sv+y+vq3Wm5BczaKtyCeGvk+N5NSp3jhKUO4jH0Iivi+jRJ2yynJ+Dgg0F0ldIMM
0qBfEvaiN8Fbg8LLDevfqtGnctvRiwzKNstaZyUmq1g5NXPOGlerVtHP+osuPb/L
lIK7uSZMZNStJS7R7vcO+g2lAl15UwBmMPEepDdUsBilnKEnefl+j5wilhjLoNM7
G+XMSFLMoWX1/oY9f8TRs6ncZ+ays6tal8bePVyTrdp/RBAXCFlB9npNjqlTkZjT
M6zRqHcIOPZxEJP3ZqsGb0rT41v7gJbuHg2uUSQwyqEWmTB45IqTewi2Omw/duJ/
OOp7g2matLY=
=U5Zw
-----END PGP SIGNATURE-----
Merge tag 'sched-urgent-2026-02-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Ingo Molnar:
"Fix a regression in the deferrable dl_server code that can cause the
dl_server to be stuck"
* tag 'sched-urgent-2026-02-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/deadline: Fix 'stuck' dl_server
A warning was detect:
WARNING: kernel/cgroup/cpuset.c:825 at rebuild_sched_domains_locked
Modules linked in:
CPU: 12 UID: 0 PID: 681 Comm: rmdir 6.19.0-rc6-next-20260121+
RIP: 0010:rebuild_sched_domains_locked+0x309/0x4b0
RSP: 0018:ffffc900019bbd28 EFLAGS: 00000202
RAX: ffff888104413508 RBX: 0000000000000008 RCX: ffff888104413510
RDX: ffff888109b5f400 RSI: 000000000000ffcf RDI: 0000000000000001
RBP: 0000000000000002 R08: ffff888104413508 R09: 0000000000000002
R10: ffff888104413508 R11: 0000000000000001 R12: ffff888104413500
R13: 0000000000000002 R14: ffffc900019bbd78 R15: 0000000000000000
FS: 00007fe274b8d740(0000) GS:ffff8881b6b3c000(0000) knlGS:
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fe274c98b50 CR3: 00000001047a9000 CR4: 00000000000006f0
Call Trace:
<TASK>
update_prstate+0x1c7/0x580
cpuset_css_killed+0x2f/0x50
kill_css+0x32/0x180
cgroup_destroy_locked+0xa7/0x200
cgroup_rmdir+0x28/0x100
kernfs_iop_rmdir+0x4c/0x80
vfs_rmdir+0x12c/0x280
filename_rmdir+0x19e/0x200
__x64_sys_rmdir+0x23/0x40
do_syscall_64+0x6b/0x390
It can be reproduced by steps:
# cd /sys/fs/cgroup/
# mkdir A1
# mkdir B1
# mkdir C1
# echo 1-3 > A1/cpuset.cpus
# echo root > A1/cpuset.cpus.partition
# echo 3-5 > B1/cpuset.cpus
# echo root > B1/cpuset.cpus.partition
# echo 6 > C1/cpuset.cpus
# echo root > C1/cpuset.cpus.partition
# rmdir A1/
# rmdir C1/
Both A1 and B1 were initially configured with CPU 3, which was exclusively
assigned to A1's partition. When A1 was removed, CPU 3 was returned to the
root pool. However, B1 incorrectly regained access to CPU 3 when
update_cpumasks_hier was triggered during C1's removal, which also updated
sibling configurations.
The update_sibling_cpumasks function was called to synchronize siblings'
effective CPUs due to changes in their parent's effective CPUs. However,
parent effective CPU changes should not affect partition-effective CPUs.
To fix this issue, update_cpumasks_hier should only be invoked when the
sibling is not a valid partition in the update_sibling_cpumasks.
Fixes: 2a3602030d ("cgroup/cpuset: Don't invalidate sibling partitions on cpuset.cpus conflict")
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The current cgroup subsystem limit of 16 is insufficient, as the number of
existing subsystems has already reached this limit. When adding a new
subsystem that is not yet in the mainline kernel, building with
`make allmodconfig` requires first bypassing the
`BUILD_BUG_ON(CGROUP_SUBSYS_COUNT > 16)` restriction to allow compilation
to succeed. However, the kernel still fails to boot afterward.
This patch increases the maximum number of supported cgroup subsystems from
16 to 32, providing enough room for future subsystem additions.
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Waiman Long <longman@redhat.com>
Tested-by: JP Kobryn <inwardvessel@gmail.com>
Acked-by: JP Kobryn <inwardvessel@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
kho_reserve_scratch() iterates over all online NUMA nodes to allocate
per-node scratch memory. On systems with memoryless NUMA nodes (nodes
that have CPUs but no memory), memblock_alloc_range_nid() fails because
there is no memory available on that node. This causes KHO initialization
to fail and kho_enable to be set to false.
Some ARM64 systems have NUMA topologies where certain nodes contain only
CPUs without any associated memory. These configurations are valid and
should not prevent KHO from functioning.
Fix this by only counting nodes that have memory (N_MEMORY state) and skip
memoryless nodes in the per-node scratch allocation loop.
Link: https://lkml.kernel.org/r/20260120175913.34368-1-epetron@amazon.de
Fixes: 3dc92c3114 ("kexec: add Kexec HandOver (KHO) generation helpers").
Signed-off-by: Evangelos Petrongonas <epetron@amazon.de>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Alexander Graf <graf@amazon.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
crash_load_dm_crypt_keys() reads dm-crypt volume keys from the user
keyring. It uses user_key_payload_locked() without holding key->sem,
which makes lockdep complain when kexec_file_load() assembles the crash
image:
=============================
WARNING: suspicious RCU usage
-----------------------------
./include/keys/user-type.h:53 suspicious rcu_dereference_protected() usage!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
no locks held by kexec/4875.
stack backtrace:
Call Trace:
<TASK>
dump_stack_lvl+0x5d/0x80
lockdep_rcu_suspicious.cold+0x4e/0x96
crash_load_dm_crypt_keys+0x314/0x390
bzImage64_load+0x116/0x9a0
? __lock_acquire+0x464/0x1ba0
__do_sys_kexec_file_load+0x26a/0x4f0
do_syscall_64+0xbd/0x430
entry_SYSCALL_64_after_hwframe+0x77/0x7f
In addition, the key returned by request_key() is never key_put()'d,
leaking a key reference on each load attempt.
Take key->sem while copying the payload and drop the key reference
afterwards.
Link: https://lkml.kernel.org/r/patch.git-2d4d76083a5c.your-ad-here.call-01769426386-ext-2560@work.hours
Fixes: 479e58549b ("crash_dump: store dm crypt keys in kdump reserved memory")
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Coiby Xu <coxu@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* use dedicated labels for error handling instead of checking if a pointer
is not null to decide if it should be unmapped
* drop assignment of values to err that are only used to print a numeric
error code, there are pr_warn()s for each failure already so printing a
numeric error code in the next line does not add anything useful
Link: https://lkml.kernel.org/r/20260122121757.575987-1-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Mike Rapoport <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The user.* sysctls implement the ctl_table_root::permissions hook and they
override the file access mode based on the CAP_SYS_RESOURCE capability (at
most rwx if capable, at most r-- if not). The capability is being checked
unconditionally, so if an LSM denies the capability, an audit record may
be logged even when access is in fact granted.
Given the logic in the set_permissions() function in kernel/ucount.c and
the unfortunate way the permission checking is implemented, it doesn't
seem viable to avoid false positive denials by deferring the capability
check. Thus, do the same as in net_ctl_permissions() (net/sysctl_net.c) -
switch from ns_capable() to ns_capable_noaudit(), so that the check never
logs an audit record.
Link: https://lkml.kernel.org/r/20260122140745.239428-1-omosnace@redhat.com
Fixes: dbec28460a ("userns: Add per user namespace sysctls.")
Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com>
Reviewed-by: Paul Moore <paul@paul-moore.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Alexey Gladkov <legion@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
kexec_load_purgatory() derives image->start by locating e_entry inside an
SHF_EXECINSTR section. If the purgatory object contains multiple
executable sections with overlapping sh_addr, the entrypoint check can
match more than once and trigger a WARN.
Derive the entry section from the purgatory_start symbol when present and
compute image->start from its final placement. Keep the existing e_entry
fallback for purgatories that do not expose the symbol.
WARNING: kernel/kexec_file.c:1009 at kexec_load_purgatory+0x395/0x3c0, CPU#10: kexec/1784
Call Trace:
<TASK>
bzImage64_load+0x133/0xa00
__do_sys_kexec_file_load+0x2b3/0x5c0
do_syscall_64+0x81/0x610
entry_SYSCALL_64_after_hwframe+0x76/0x7e
[me@linux.beauty: move helper to avoid forward declaration, per Baoquan]
Link: https://lkml.kernel.org/r/20260128043511.316860-1-me@linux.beauty
Link: https://lkml.kernel.org/r/20260120124005.148381-1-me@linux.beauty
Fixes: 8652d44f46 ("kexec: support purgatories with .text.hot sections")
Signed-off-by: Li Chen <me@linux.beauty>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Alexander Graf <graf@amazon.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Li Chen <me@linux.beauty>
Cc: Philipp Rudo <prudo@redhat.com>
Cc: Ricardo Ribalda Delgado <ribalda@chromium.org>
Cc: Ross Zwisler <zwisler@google.com>
Cc: Sourabh Jain <sourabhjain@linux.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Problem
=======
Commit 658eb5ab91 ("delayacct: add delay max to record delay peak")
introduced the delay max for getdelays, which records abnormal latency
peaks and helps us understand the magnitude of such delays. However, the
peak latency value alone is insufficient for effective root cause
analysis. Without the precise timestamp of when the peak occurred, we
still lack the critical context needed to correlate it with other system
events.
Solution
========
To address this, we need to additionally record a precise timestamp when
the maximum latency occurs. By correlating this timestamp with system
logs and monitoring metrics, we can identify processes with abnormal
resource usage at the same moment, which can help us to pinpoint root
causes.
Use Case
========
bash-4.4# ./getdelays -d -t 227
print delayacct stats ON
TGID 227
CPU count real total virtual total delay total delay average delay max delay min delay max timestamp
46 188000000 192348334 4098012 0.089ms 0.429260ms 0.051205ms 2026-01-15T15:06:58
IO count delay total delay average delay max delay min delay max timestamp
0 0 0.000ms 0.000000ms 0.000000ms N/A
SWAP count delay total delay average delay max delay min delay max timestamp
0 0 0.000ms 0.000000ms 0.000000ms N/A
RECLAIM count delay total delay average delay max delay min delay max timestamp
0 0 0.000ms 0.000000ms 0.000000ms N/A
THRAS HING count delay total delay average delay max delay min delay max timestamp
0 0 0.000ms 0.000000ms 0.000000ms N/A
COMPACT count delay total delay average delay max delay min delay max timestamp
0 0 0.000ms 0.000000ms 0.000000ms N/A
WPCOPY count delay total delay average delay max delay min delay max timestamp
182 19413338 0.107ms 0.547353ms 0.022462ms 2026-01-15T15:05:24
IRQ count delay total delay average delay max delay min delay max timestamp
0 0 0.000ms 0.000000ms 0.000000ms N/A
Link: https://lkml.kernel.org/r/20260119100241520gWubW8-5QfhSf9gjqcc_E@zte.com.cn
Signed-off-by: Wang Yaxin <wang.yaxin@zte.com.cn>
Cc: Fan Yu <fan.yu9@zte.com.cn>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: xu xin <xu.xin16@zte.com.cn>
Cc: Yang Yang <yang.yang29@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The __trace_puts() function takes a string pointer and the size of the
string itself. All users currently simply pass in the strlen() of the
string it is also passing in. There's no reason to pass in the size.
Instead have the __trace_puts() function do the strlen() within the
function itself.
This fixes a header recursion issue where using strlen() in the macro
calling __trace_puts() requires adding #include <linux/string.h> in order
to use strlen(). Removing the use of strlen() from the header fixes the
recursion issue.
Link: https://lore.kernel.org/all/aUN8Hm377C5A0ILX@yury/
Link: https://lkml.kernel.org/r/20260116042510.241009-6-ynorov@nvidia.com
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Yury Norov <ynorov@nvidia.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Aaron Tomlin <atomlin@atomlin.com>
Cc: Andi Shyti <andi.shyti@linux.intel.com>
Cc: Christophe Leroy (CS GROUP) <chleroy@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: Petr Pavlu <petr.pavlu@suse.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When restoring a page (from kho_restore_pages()) or folio (from
kho_restore_folio()), KHO must initialize the struct page. The
initialization differs slightly depending on if a folio is requested or a
set of 0-order pages is requested.
Conceptually, it is quite simple to understand. When restoring 0-order
pages, each page gets a refcount of 1 and that's it. When restoring a
folio, head page gets a refcount of 1 and tail pages get 0.
kho_restore_page() tries to combine the two separate initialization flow
into one piece of code. While it works fine, it is more complicated to
read than it needs to be. Make the code simpler by splitting the two
initalization paths into two separate functions. This improves
readability by clearly showing how each type must be initialized.
Link: https://lkml.kernel.org/r/20260116112217.915803-3-pratyush@kernel.org
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Alexander Graf <graf@amazon.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "kho: clean up page initialization logic", v2.
This series simplifies the page initialization logic in
kho_restore_page(). It was originally only a single patch [0], but on
Pasha's suggestion, I added another patch to use unsigned long for
nr_pages.
Technically speaking, the patches aren't related and can be applied
independently, but bundling them together since patch 2 relies on 1 and it
is easier to manage them this way.
This patch (of 2):
With 4k pages, a 32-bit nr_pages can span up to 16 TiB. While it is a
lot, there exist systems with terabytes of RAM. gup is also moving to
using long for nr_pages. Use unsigned long and make KHO future-proof.
Link: https://lkml.kernel.org/r/20260116112217.915803-1-pratyush@kernel.org
Link: https://lkml.kernel.org/r/20260116112217.915803-2-pratyush@kernel.org
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>
Suggested-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Alexander Graf <graf@amazon.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The current swap entry allocation/freeing workflow has never had a clear
definition. This makes it hard to debug or add new optimizations.
This commit introduces a proper definition of how swap entries would be
allocated and freed. Now, most operations are folio based, so they will
never exceed one swap cluster, and we now have a cleaner border between
swap and the rest of mm, making it much easier to follow and debug,
especially with new added sanity checks. Also making more optimization
possible.
Swap entry will be mostly freed and free with a folio bound. The folio
lock will be useful for resolving many swap related races.
Now swap allocation (except hibernation) always starts with a folio in the
swap cache, and gets duped/freed protected by the folio lock:
- folio_alloc_swap() - The only allocation entry point now.
Context: The folio must be locked.
This allocates one or a set of continuous swap slots for a folio and
binds them to the folio by adding the folio to the swap cache. The
swap slots' swap count start with zero value.
- folio_dup_swap() - Increase the swap count of one or more entries.
Context: The folio must be locked and in the swap cache. For now, the
caller still has to lock the new swap entry owner (e.g., PTL).
This increases the ref count of swap entries allocated to a folio.
Newly allocated swap slots' count has to be increased by this helper
as the folio got unmapped (and swap entries got installed).
- folio_put_swap() - Decrease the swap count of one or more entries.
Context: The folio must be locked and in the swap cache. For now, the
caller still has to lock the new swap entry owner (e.g., PTL).
This decreases the ref count of swap entries allocated to a folio.
Typically, swapin will decrease the swap count as the folio got
installed back and the swap entry got uninstalled
This won't remove the folio from the swap cache and free the
slot. Lazy freeing of swap cache is helpful for reducing IO.
There is already a folio_free_swap() for immediate cache reclaim.
This part could be further optimized later.
The above locking constraints could be further relaxed when the swap table
is fully implemented. Currently dup still needs the caller to lock the
swap entry container (e.g. PTL), or a concurrent zap may underflow the
swap count.
Some swap users need to interact with swap count without involving folio
(e.g. forking/zapping the page table or mapping truncate without swapin).
In such cases, the caller has to ensure there is no race condition on
whatever owns the swap count and use the below helpers:
- swap_put_entries_direct() - Decrease the swap count directly.
Context: The caller must lock whatever is referencing the slots to
avoid a race.
Typically the page table zapping or shmem mapping truncate will need
to free swap slots directly. If a slot is cached (has a folio bound),
this will also try to release the swap cache.
- swap_dup_entry_direct() - Increase the swap count directly.
Context: The caller must lock whatever is referencing the entries to
avoid race, and the entries must already have a swap count > 1.
Typically, forking will need to copy the page table and hence needs to
increase the swap count of the entries in the table. The page table is
locked while referencing the swap entries, so the entries all have a
swap count > 1 and can't be freed.
Hibernation subsystem is a bit different, so two special wrappers are here:
- swap_alloc_hibernation_slot() - Allocate one entry from one device.
- swap_free_hibernation_slot() - Free one entry allocated by the above
helper.
All hibernation entries are exclusive to the hibernation subsystem and
should not interact with ordinary swap routines.
By separating the workflows, it will be possible to bind folio more
tightly with swap cache and get rid of the SWAP_HAS_CACHE as a temporary
pin.
This commit should not introduce any behavior change
[kasong@tencent.com: fix leak, per Chris Mason. Remove WARN_ON, per Lai Yi]
Link: https://lkml.kernel.org/r/CAMgjq7AUz10uETVm8ozDWcB3XohkOqf0i33KGrAquvEVvfp5cg@mail.gmail.com
[ryncsn@gmail.com: fix KSM copy pages for swapoff, per Chris]
Link: https://lkml.kernel.org/r/aXxkANcET3l2Xu6J@KASONG-MC4
Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-14-8862a265a033@tencent.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Signed-off-by: Kairui Song <ryncsn@gmail.com>
Acked-by: Rafael J. Wysocki (Intel) <rafael@kernel.org>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chris Li <chrisl@kernel.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Cc: Deepanshu Kartikey <kartikey406@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kairui Song <ryncsn@gmail.com>
Cc: Chris Mason <clm@fb.com>
Cc: Chris Mason <clm@meta.com>
Cc: Lai Yi <yi1.lai@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The added fsession does not prevent running on those architectures, that
haven't added fsession support.
For example, try to run fsession tests on arm64:
test_fsession_basic:PASS:fsession_test__open_and_load 0 nsec
test_fsession_basic:PASS:fsession_attach 0 nsec
check_result:FAIL:test_run_opts err unexpected error: -14 (errno 14)
In order to prevent such errors, add bpf_jit_supports_fsession() to guard
those architectures.
Fixes: 2d419c4465 ("bpf: add fsession support")
Acked-by: Puranjay Mohan <puranjay@kernel.org>
Tested-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/r/20260131144950.16294-2-leon.hwang@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Consolidate all logic for verifying special map fields in the single
function check_map_field_pointer().
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Link: https://lore.kernel.org/r/20260130-verif_special_fields-v2-2-2c59e637da7d@meta.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Introduce struct bpf_map_desc to hold bpf_map pointer and map uid. Use
this struct in both bpf_call_arg_meta and bpf_kfunc_call_arg_meta
instead of having different representations:
- bpf_call_arg_meta had separate map_ptr and map_uid fields
- bpf_kfunc_call_arg_meta had an anonymous inline struct
This unifies the map fields layout across both metadata structures,
making the code more consistent and preparing for further refactoring of
map field pointer validation.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Link: https://lore.kernel.org/r/20260130-verif_special_fields-v2-1-2c59e637da7d@meta.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
In order to do a user space stacktrace the current task needs to be a user
task that has executed in user space. It use to be possible to test if a
task is a user task or not by simply checking the task_struct mm field. If
it was non NULL, it was a user task and if not it was a kernel task.
But things have changed over time, and some kernel tasks now have their
own mm field.
An idea was made to instead test PF_KTHREAD and two functions were used to
wrap this check in case it became more complex to test if a task was a
user task or not[1]. But this was rejected and the C code simply checked
the PF_KTHREAD directly.
It was later found that not all kernel threads set PF_KTHREAD. The io-uring
helpers instead set PF_USER_WORKER and this needed to be added as well.
But checking the flags is still not enough. There's a very small window
when a task exits that it frees its mm field and it is set back to NULL.
If perf were to trigger at this moment, the flags test would say its a
user space task but when perf would read the mm field it would crash with
at NULL pointer dereference.
Now there are flags that can be used to test if a task is exiting, but
they are set in areas that perf may still want to profile the user space
task (to see where it exited). The only real test is to check both the
flags and the mm field.
Instead of making this modification in every location, create a new
is_user_task() helper function that does all the tests needed to know if
it is safe to read the user space memory or not.
[1] https://lore.kernel.org/all/20250425204120.639530125@goodmis.org/
Fixes: 90942f9fac ("perf: Use current->flags & PF_KTHREAD|PF_USER_WORKER instead of current->mm == NULL")
Closes: https://lore.kernel.org/all/0d877e6f-41a7-4724-875d-0b0a27b8a545@roeck-us.net/
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260129102821.46484722@gandalf.local.home
Andrea reported the dl_server getting stuck for him. He tracked it
down to a state where dl_server_start() saw dl_defer_running==1, but
the dl_server's job is no longer valid at the time of
dl_server_start().
In the state diagram this corresponds to [4] D->A (or dl_server_stop()
due to no more runnable tasks) followed by [1], which in case of a
lapsed deadline must then be A->B.
Now our A has dl_defer_running==1, while B demands
dl_defer_running==0, therefore it must get cleared when the CBS wakeup
rules demand a replenish.
Fixes: a110a81c52 ("sched/deadline: Deferrable dl server")
Reported-by: Andrea Righi arighi@nvidia.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Tested-by: Andrea Righi arighi@nvidia.com
Link: https://lkml.kernel.org/r/20260123161645.2181752-1-arighi@nvidia.com
Link: https://patch.msgid.link/20260130124100.GC1079264@noisy.programming.kicks-ass.net
- important fix for ARM 32-bit based systems using cma= kernel parameter
(Oreoluwa Babatunde)
- a fix for the corner case of the DMA atomic pool based allocations
(Sai Sree Kartheek Adivi)
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSrngzkoBtlA8uaaJ+Jp1EFxbsSRAUCaXyw5wAKCRCJp1EFxbsS
RBN7AP9rEfEEB3JBOglcZG3TTrLoCKLnw+16uroyKuD95RLWrQD/bWeJnRYcEZB5
ox1peKYBA4SsDB3bCUWFDfW4I0OZFQQ=
=9Nx1
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-6.19-2026-01-30' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux
Pull dma-mapping fixes from Marek Szyprowski:
- important fix for ARM 32-bit based systems using cma= kernel
parameter (Oreoluwa Babatunde)
- a fix for the corner case of the DMA atomic pool based allocations
(Sai Sree Kartheek Adivi)
* tag 'dma-mapping-6.19-2026-01-30' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
dma/pool: distinguish between missing and exhausted atomic pools
of: reserved_mem: Allow reserved_mem framework detect "cma=" kernel param
There is no point in iterating through individual tick dependency bits when
the tick_stop tracepoint is disabled, which is the common case.
When the trace point is disabled, return immediately based on the atomic
value being zero or non-zero, skipping the per-bit evaluation.
This optimization improves the hot path performance of tick dependency
checks across all contexts (idle and non-idle), not just nohz_full CPUs.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ionut Nechita (Sunlight Linux) <sunlightlinux@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260128074558.15433-3-sunlightlinux@gmail.com
Allowing sleepable programs to use tail calls.
Making sure we can't mix sleepable and non-sleepable bpf programs
in tail call map (BPF_MAP_TYPE_PROG_ARRAY) and allowing it to be
used in sleepable programs.
Sleepable programs can be preempted and sleep which might bring
new source of race conditions, but both direct and indirect tail
calls should not be affected.
Direct tail calls work by patching direct jump to callee into bpf
caller program, so no problem there. We atomically switch from nop
to jump instruction.
Indirect tail call reads the callee from the map and then jumps to
it. The callee bpf program can't disappear (be released) from the
caller, because it is executed under rcu lock (rcu_read_lock_trace).
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/r/20260130081208.1130204-2-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add a appropriate kerneldoc to trace_event_buffer_reserve() to make it
easier to understand how that function is used.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260130103745.1126e4af@gandalf.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The current use of guard(preempt_notrace)() within __DECLARE_TRACE()
to protect invocation of __DO_TRACE_CALL() means that BPF programs
attached to tracepoints are non-preemptible. This is unhelpful in
real-time systems, whose users apparently wish to use BPF while also
achieving low latencies. (Who knew?)
One option would be to use preemptible RCU, but this introduces
many opportunities for infinite recursion, which many consider to
be counterproductive, especially given the relatively small stacks
provided by the Linux kernel. These opportunities could be shut down
by sufficiently energetic duplication of code, but this sort of thing
is considered impolite in some circles.
Therefore, use the shiny new SRCU-fast API, which provides somewhat faster
readers than those of preemptible RCU, at least on Paul E. McKenney's
laptop, where task_struct access is more expensive than access to per-CPU
variables. And SRCU-fast provides way faster readers than does SRCU,
courtesy of being able to avoid the read-side use of smp_mb(). Also,
it is quite straightforward to create srcu_read_{,un}lock_fast_notrace()
functions.
Link: https://lore.kernel.org/all/20250613152218.1924093-1-bigeasy@linutronix.de/
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Alexei Starovoitov <ast@kernel.org>
Link: https://patch.msgid.link/20260126231256.499701982@kernel.org
Co-developed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
SRCU-fast is designed to be used in NMI handlers, even going so far
as to use atomic operations for architectures supporting NMIs but not
providing NMI-safe per-CPU atomic operations. However, the WARN_ON_ONCE()
in __srcu_check_read_flavor() complains if SRCU-fast is used in an NMI
handler. This commit therefore modifies that WARN_ON_ONCE() to avoid
such complaints.
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Boqun Feng <boqun@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: bpf@vger.kernel.org
Link: https://patch.msgid.link/8232efe8-a7a3-446c-af0b-19f9b523b4f7@paulmck-laptop
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
In order to switch the protection of tracepoint callbacks from
preempt_disable() to srcu_read_lock_fast() the BPF callback from
tracepoints needs to have migration prevention as the BPF programs expect
to stay on the same CPU as they execute. Put together the RCU protection
with migration prevention and use rcu_read_lock_dont_migrate() in
__bpf_trace_run(). This will allow tracepoints callbacks to be
preemptible.
Link: https://lore.kernel.org/all/CAADnVQKvY026HSFGOsavJppm3-Ajm-VsLzY-OeFUe+BaKMRnDg@mail.gmail.com/
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Alexei Starovoitov <ast@kernel.org>
Link: https://patch.msgid.link/20260126231256.335034877@kernel.org
Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
After switching ARM64 to the generic entry code, a syscall_exit_work()
appeared as a profiling hotspot because it is not inlined.
Inlining both syscall_trace_enter() and syscall_exit_work() provides a
performance gain when any of the work items is enabled. With audit enabled
this results in a ~4% performance gain for perf bench basic syscall on
a kunpeng920 system:
| Metric | Baseline | Inlined | Change |
| ---------- | ----------- | ----------- | ------ |
| Total time | 2.353 [sec] | 2.264 [sec] | ↓3.8% |
| usecs/op | 0.235374 | 0.226472 | ↓3.8% |
| ops/sec | 4,248,588 | 4,415,554 | ↑3.9% |
Small gains can be observed on x86 as well, though the generated code
optimizes for the work case, which is counterproductive for high
performance scenarios where such entry/exit work is usually avoided.
Avoid this by marking the work check in syscall_enter_from_user_mode_work()
unlikely, which is what the corresponding check in the exit path does
already.
[ tglx: Massage changelog and add the unlikely() ]
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260128031934.3906955-14-ruanjinjie@huawei.com
ARM64 requires a architecture specific ptrace wrapper as it needs to save
and restore scratch registers.
Provide arch_ptrace_report_syscall_entry/exit() wrappers which fall back to
ptrace_report_syscall_entry/exit() if the architecture does not provide
them.
No functional change intended.
[ tglx: Massaged changelog and comments ]
Suggested-by: Mark Rutland <mark.rutland@arm.com>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com>
Link: https://patch.msgid.link/20260128031934.3906955-11-ruanjinjie@huawei.com
The 'syscall' argument of syscall_trace_enter() is immediately overwritten
before any real use and serves only as a local variable, so drop the
parameter.
No functional change intended.
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260128031934.3906955-2-ruanjinjie@huawei.com
Instead of using generic workqueue, use a dedicated kthread for optimizing
kprobes, because it can wait (sleep) for a long time inside the process
by synchronize_rcu_task(). This means other works can be stopped until it
finishes.
Link: https://lore.kernel.org/all/176970170302.114949.5175231591310436910.stgit@devnote2/
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
The interrupts which are handled by the redirection infrastructure provide
a irq_set_affinity() callback, which solely determines the target CPU for
redirection via irq_work and und updates the effective affinity mask.
Contrary to regular MSI interrupts this affinity setting does not change
the underlying interrupt message as the message is only created at setup
time to deliver to the demultiplexing interrupt.
Therefore the message write in msi_domain_set_affinity() is a pointless
exercise. In principle the write is harmless, but a Tegra system exposes a
full system hang during suspend due to that write.
It's unclear why the check for the PCI device state PCI_D0 in
pci_msi_domain_write_msg(), which prevents the actual hardware access if
a device is in powered down state, fails on this particular system, but
that's a different problem which needs to be investigated by the Tegra
experts.
The irq_set_affinity() callback can advise msi_domain_set_affinity() not to
write the MSI message by returning IRQ_SET_MASK_OK_DONE instead of
IRQ_SET_MASK_OK. Do exactly that.
Just to make it clear again:
This is not a correctness issue of the redirection code as returning
IRQ_SET_MASK_OK in that context is completely correct. From the core
code point of view this is solely a optimization to avoid an redundant
hardware write.
As a byproduct it papers over the underlying problem on the Tegra platform,
which fails to put the PCIe device[s] out of PCI_D0 despite the fact that
the devices and busses have been shut down. The redirect infrastructure
just unearthed the underlying issue, which is prone to happen in quite some
other code paths which use the PCI_D0 check to prevent hardware access to
powered down devices.
This therefore has neither a 'Fixes:' nor a 'Closes:' tag associated as the
underlying problem, which is outside the scope of the interrupt code, is
still unresolved.
Reported-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: Jon Hunter <jonathanh@nvidia.com>
Link: https://lore.kernel.org/all/4e5b349c-6599-4871-9e3b-e10352ae0ca0@nvidia.com
Link: https://patch.msgid.link/87tsw6aglz.ffs@tglx
- There's a 3 patch series from Pratyush Yadav which fixes a few things
in the new-in-6.19 LUO memfd code.
- Plus the usual shower of singletons - please see the changelogs for
details.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaXub9wAKCRDdBJ7gKXxA
jjoHAP48ag3lmJnpej977MxrA4VUBhv6ATmbFE2+czCzbRJaigEAiIMWtUlSVYbH
WuEYQvcFeSR0hYjs0ClKUiZYOUcj+wI=
=lHEL
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2026-01-29-09-41' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"16 hotfixes. 9 are cc:stable, 12 are for MM.
There's a patch series from Pratyush Yadav which fixes a few things in
the new-in-6.19 LUO memfd code.
Plus the usual shower of singletons - please see the changelogs for
details"
* tag 'mm-hotfixes-stable-2026-01-29-09-41' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
vmcoreinfo: make hwerr_data visible for debugging
mm/zone_device: reinitialize large zone device private folios
mm/mm_init: don't cond_resched() in deferred_init_memmap_chunk() if called from deferred_grow_zone()
mm/kfence: randomize the freelist on initialization
kho: kho_preserve_vmalloc(): don't return 0 when ENOMEM
kho: init alloc tags when restoring pages from reserved memory
mm: memfd_luo: restore and free memfd_luo_ser on failure
mm: memfd_luo: use memfd_alloc_file() instead of shmem_file_setup()
memfd: export alloc_file()
flex_proportions: make fprop_new_period() hardirq safe
mailmap: add entry for Viacheslav Bocharov
mm/memory-failure: teach kill_accessing_process to accept hugetlb tail page pfn
mm/memory-failure: fix missing ->mf_stats count in hugetlb poison
mm, swap: restore swap_space attr aviod kernel panic
mm/kasan: fix KASAN poisoning in vrealloc()
mm/shmem, swap: fix race of truncate and swap entry split
Three architectures (x86, aarch64, riscv) have support for indirect
branch tracking feature in a very similar fashion. On a very high
level, indirect branch tracking is a CPU feature where CPU tracks
branches which use a memory operand to transfer control. As part of
this tracking, during an indirect branch, the CPU expects a landing
pad instruction on the target PC, and if not found, the CPU raises
some fault (architecture-dependent).
x86 landing pad instr - 'ENDBRANCH'
arch64 landing pad instr - 'BTI'
riscv landing instr - 'lpad'
Given that three major architectures have support for indirect branch
tracking, this patch creates architecture-agnostic 'prctls' to allow
userspace to control this feature. They are:
- PR_GET_INDIR_BR_LP_STATUS: Get the current configured status for indirect
branch tracking.
- PR_SET_INDIR_BR_LP_STATUS: Set the configuration for indirect branch
tracking.
The following status options are allowed:
- PR_INDIR_BR_LP_ENABLE: Enables indirect branch tracking on user
thread.
- PR_INDIR_BR_LP_DISABLE: Disables indirect branch tracking on user
thread.
- PR_LOCK_INDIR_BR_LP_STATUS: Locks configured status for indirect branch
tracking for user thread.
Reviewed-by: Mark Brown <broonie@kernel.org>
Reviewed-by: Zong Li <zong.li@sifive.com>
Signed-off-by: Deepak Gupta <debug@rivosinc.com>
Tested-by: Andreas Korb <andreas.korb@aisec.fraunhofer.de> # QEMU, custom CVA6
Tested-by: Valentin Haudiquet <valentin.haudiquet@canonical.com>
Link: https://patch.msgid.link/20251112-v5_user_cfi_series-v23-13-b55691eacf4f@rivosinc.com
[pjw@kernel.org: cleaned up patch description, code comments]
Signed-off-by: Paul Walmsley <pjw@kernel.org>
Currently, dma_alloc_from_pool() unconditionally warns and dumps a stack
trace when an allocation fails, with the message "Failed to get suitable
pool".
This conflates two distinct failure modes:
1. Configuration error: No atomic pool is available for the requested
DMA mask (a fundamental system setup issue)
2. Resource Exhaustion: A suitable pool exists but is currently full (a
recoverable runtime state)
This lack of distinction prevents drivers from using __GFP_NOWARN to
suppress error messages during temporary pressure spikes, such as when
awaiting synchronous reclaim of descriptors.
Refactor the error handling to distinguish these cases:
- If no suitable pool is found, keep the unconditional WARN regarding
the missing pool.
- If a pool was found but is exhausted, respect __GFP_NOWARN and update
the warning message to explicitly state "DMA pool exhausted".
Fixes: 9420139f51 ("dma-pool: fix coherent pool allocations for IOMMU mappings")
Signed-off-by: Sai Sree Kartheek Adivi <s-adivi@ti.com>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/20260128133554.3056582-1-s-adivi@ti.com
The BPF verifier assumes `insn_aux->nospec_result` is only set for
direct memory writes (e.g., `*(u32*)(r1+off) = r2`). However, the
assertion fails to account for helper calls (e.g.,
`bpf_skb_load_bytes_relative`) that perform writes to stack memory. Make
the check more precise to resolve this.
The problem is that `BPF_CALL` instructions have `BPF_CLASS(insn->code)
== BPF_JMP`, which triggers the warning check:
- Helpers like `bpf_skb_load_bytes_relative` write to stack memory
- `check_helper_call()` loops through `meta.access_size`, calling
`check_mem_access(..., BPF_WRITE)`
- `check_stack_write()` sets `insn_aux->nospec_result = 1`
- Since `BPF_CALL` is encoded as `BPF_JMP | BPF_CALL`, the warning fires
Execution flow:
```
1. Drop capabilities → Enable Spectre mitigation
2. Load BPF program
└─> do_check()
├─> check_cond_jmp_op() → Marks dead branch as speculative
│ └─> push_stack(..., speculative=true)
├─> pop_stack() → state->speculative = 1
├─> check_helper_call() → Processes helper in dead branch
│ └─> check_mem_access(..., BPF_WRITE)
│ └─> insn_aux->nospec_result = 1
└─> Checks: state->speculative && insn_aux->nospec_result
└─> BPF_CLASS(insn->code) == BPF_JMP → WARNING
```
To fix the assert, it would be nice to be able to reuse
bpf_insn_successors() here, but bpf_insn_successors()->cnt is not
exactly what we want as it may also be 1 for BPF_JA. Instead, we could
check opcode_info.can_jump, but then we would have to share the table
between the functions. This would mean moving the table out of the
function and adding bpf_opcode_info(). As the verifier_bug_if() only
runs for insns with nospec_result set, the impact on verification time
would likely still be negligible. However, I assume sharing
bpf_opcode_info() between liveness.c and verifier.c will not be worth
it. It seems as only adjust_jmp_off() could also be simplified using it,
and there imm/off is touched. Thus it is maybe better to rely on exact
opcode/class matching there.
Therefore, to avoid this sharing only for a verifier_bug_if(), just
check the opcode. This should now cover all opcodes for which can_jump
in bpf_insn_successors() is true.
Parts of the description and example are taken from the bug report.
Fixes: dadb59104c ("bpf: Fix aux usage after do_check_insn()")
Signed-off-by: Luis Gerhorst <luis.gerhorst@fau.de>
Reported-by: Yinhao Hu <dddddd@hust.edu.cn>
Reported-by: Kaiyan Mei <M202472210@hust.edu.cn>
Reported-by: Dongliang Mu <dzm91@hust.edu.cn>
Closes: https://lore.kernel.org/bpf/7678017d-b760-4053-a2d8-a6879b0dbeeb@hust.edu.cn/
Link: https://lore.kernel.org/r/20260127115912.3026761-2-luis.gerhorst@fau.de
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The macros ENABLE_EVENT_STR and DISABLE_EVENT_STR were added to trace.h so
that more than one file can have access to them, but was never removed
from their original location. Remove the duplicates.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Link: https://patch.msgid.link/20260126130037.4ba201f9@gandalf.local.home
Fixes: d0bad49bb0 ("tracing: Add enable_hist/disable_hist triggers")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Recording stacktraces is very useful, but the size of 16 deep is very
restrictive. For example, in seeing where tasks schedule out in a non
running state, the following can be used:
~# cd /sys/kernel/tracing
~# echo 'hist:keys=common_stacktrace:vals=hitcount if prev_state & 3' > events/sched/sched_switch/trigger
~# cat events/sched/sched_switch/hist
[..]
{ common_stacktrace:
__schedule+0xdc0/0x1860
schedule+0x27/0xd0
schedule_timeout+0xb5/0x100
wait_for_completion+0x8a/0x140
xfs_buf_iowait+0x20/0xd0 [xfs]
xfs_buf_read_map+0x103/0x250 [xfs]
xfs_trans_read_buf_map+0x161/0x310 [xfs]
xfs_btree_read_buf_block+0xa0/0x120 [xfs]
xfs_btree_lookup_get_block+0xa3/0x1e0 [xfs]
xfs_btree_lookup+0xea/0x530 [xfs]
xfs_alloc_fixup_trees+0x72/0x570 [xfs]
xfs_alloc_ag_vextent_size+0x67f/0x800 [xfs]
xfs_alloc_vextent_iterate_ags.constprop.0+0x52/0x230 [xfs]
xfs_alloc_vextent_start_ag+0x9d/0x1b0 [xfs]
xfs_bmap_btalloc+0x2af/0x680 [xfs]
xfs_bmapi_allocate+0xdb/0x2c0 [xfs]
} hitcount: 1
[..]
The above stops at 16 functions where knowing more would be useful. As the
allocated storage for stacks is the same for strings, and that size is 256
bytes, there is a lot of space not being used for stacktraces.
16 * 8 = 128
Up the size to 31 (it requires the last slot to be zero, so it can't be 32).
Also change the BUILD_BUG_ON() to allow the size of the stacktrace storage
to be equal to the max size. One slot is used to hold the number of
elements in the stack.
BUILD_BUG_ON((HIST_STACKTRACE_DEPTH + 1) * sizeof(long) >= STR_VAR_LEN_MAX);
Change that from ">=" to just ">", as now they are equal.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260123105415.2be26bf4@gandalf.local.home
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
When debugging the synthetic events, being able to function trace its
functions is very useful (now that CONFIG_FUNCTION_SELF_TRACING is
available). For some reason trace_event_raw_event_synth() was marked as
"notrace", which was totally unnecessary as all of the tracing directory
had function tracing disabled until the recent FUNCTION_SELF_TRACING was
added.
Remove the notrace annotation from trace_event_raw_event_synth() as
there's no reason to not trace it when tracing synthetic event functions.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260122204526.068a98c9@gandalf.local.home
Acked-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
When CONFIG_HIST_TRIGGERS_DEBUG is enabled, each trace event has a
"hist_debug" file that explains the histogram internal data. This is very
useful for debugging histograms.
One bit of data that was missing from this file was what function a
histogram field uses to process its data. The hist_field structure now has
a fn_num that is used by a switch statement in hist_fn_call() to call a
function directly (to avoid spectre mitigations).
Instead of displaying that number, create a string array that maps to the
histogram function enums so that the function for a field may be
displayed:
~# cat /sys/kernel/tracing/events/sched/sched_switch/hist_debug
[..]
hist_data: 0000000043d62762
n_vals: 2
n_keys: 1
n_fields: 3
val fields:
hist_data->fields[0]:
flags:
VAL: HIST_FIELD_FL_HITCOUNT
type: u64
size: 8
is_signed: 0
function: hist_field_counter()
hist_data->fields[1]:
flags:
HIST_FIELD_FL_VAR
var.name: __arg_3921_2
var.idx (into tracing_map_elt.vars[]): 0
type: unsigned long[]
size: 128
is_signed: 0
function: hist_field_nop()
key fields:
hist_data->fields[2]:
flags:
HIST_FIELD_FL_KEY
ftrace_event_field name: prev_pid
type: pid_t
size: 8
is_signed: 1
function: hist_field_s32()
The "function:" field above is added.
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260122203822.58df4d80@gandalf.local.home
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Tested-by: Tom Zanussi <zanussi@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
In enable_boot_kprobe_events(), it returns directly when trace kprobes is
empty, thereby reducing the function's execution time. This function may
otherwise wait for the event_mutex lock for tens of milliseconds on certain
machines, which is unnecessary when trace kprobes is empty.
Link: https://lore.kernel.org/all/20260127053848.108473-1-sunliming@linux.dev/
Signed-off-by: sunliming <sunliming@kylinos.cn>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
When initializing the default cma region, the "cma=" kernel parameter
takes priority over a DT defined linux,cma-default region. Hence, give
the reserved_mem framework the ability to detect this so that the DT
defined cma region can skip initialization accordingly.
Signed-off-by: Oreoluwa Babatunde <oreoluwa.babatunde@oss.qualcomm.com>
Tested-by: Joy Zou <joy.zou@nxp.com>
Acked-by: Rob Herring (Arm) <robh@kernel.org>
Fixes: 8a6e02d0c0 ("of: reserved_mem: Restructure how the reserved memory regions are processed")
Fixes: 2c223f7239 ("of: reserved_mem: Restructure call site for dma_contiguous_early_fixup()")
Link: https://lore.kernel.org/r/20251210002027.1171519-1-oreoluwa.babatunde@oss.qualcomm.com
[mszyprow: rebased onto v6.19-rc1, added fixes tags, added a stub for
cma_skip_dt_default_reserved_mem() if no CONFIG_DMA_CMA is set]
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
cpufreq calls get_cpu_idle_time_us() just to know if idle cputime
accounting has a nanoseconds granularity.
Use the appropriate indicator instead to make that deduction.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://patch.msgid.link/aXozx0PXutnm8ECX@localhost.localdomain
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Since acomp_request_free() checks its argument against NULL, the NULL
pointer checks before calling it added by commit ("7966cf0ebe32 PM:
hibernate: Fix crash when freeing invalid crypto compressor") are
redundant, so drop them.
No intentional functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://patch.msgid.link/6233709.lOV4Wx5bFT@rafael.j.wysocki
Convert lock initialization to scoped guarded initialization where
lock-guarded members are initialized in the same scope.
This ensures the context analysis treats the context as active during
member initialization. This is required to avoid errors once implicit
context assertion is removed.
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260119094029.1344361-4-elver@google.com
Using single ftrace_ops for direct calls update instead of allocating
ftrace_ops object for each trampoline.
With single ftrace_ops object we can use update_ftrace_direct_* api
that allows multiple ip sites updates on single ftrace_ops object.
Adding HAVE_SINGLE_FTRACE_DIRECT_OPS config option to be enabled on
each arch that supports this.
At the moment we can enable this only on x86 arch, because arm relies
on ftrace_ops object representing just single trampoline image (stored
in ftrace_ops::direct_call). Archs that do not support this will continue
to use *_ftrace_direct api.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/bpf/20251230145010.103439-10-jolsa@kernel.org
We are going to remove "ftrace_ops->private == bpf_trampoline" setup
in following changes.
Adding ip argument to ftrace_ops_func_t callback function, so we can
use it to look up the trampoline.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/bpf/20251230145010.103439-9-jolsa@kernel.org
Adding update_ftrace_direct_mod function that modifies all entries
(ip -> direct) provided in hash argument to direct ftrace ops and
updates its attachments.
The difference to current modify_ftrace_direct is:
- hash argument that allows to modify multiple ip -> direct
entries at once
This change will allow us to have simple ftrace_ops for all bpf
direct interface users in following changes.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/bpf/20251230145010.103439-7-jolsa@kernel.org
Adding update_ftrace_direct_del function that removes all entries
(ip -> addr) provided in hash argument to direct ftrace ops and
updates its attachments.
The difference to current unregister_ftrace_direct is
- hash argument that allows to unregister multiple ip -> direct
entries at once
- we can call update_ftrace_direct_del multiple times on the
same ftrace_ops object, becase we do not need to unregister
all entries at once, we can do it gradualy with the help of
ftrace_update_ops function
This change will allow us to have simple ftrace_ops for all bpf
direct interface users in following changes.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/bpf/20251230145010.103439-6-jolsa@kernel.org
Adding update_ftrace_direct_add function that adds all entries
(ip -> addr) provided in hash argument to direct ftrace ops
and updates its attachments.
The difference to current register_ftrace_direct is
- hash argument that allows to register multiple ip -> direct
entries at once
- we can call update_ftrace_direct_add multiple times on the
same ftrace_ops object, becase after first registration with
register_ftrace_function_nolock, it uses ftrace_update_ops to
update the ftrace_ops object
This change will allow us to have simple ftrace_ops for all bpf
direct interface users in following changes.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/bpf/20251230145010.103439-5-jolsa@kernel.org
We are going to use these functions in following changes.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/bpf/20251230145010.103439-4-jolsa@kernel.org
Make alloc_and_copy_ftrace_hash to copy also direct address
for each hash entry.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/bpf/20251230145010.103439-3-jolsa@kernel.org
At the moment the we allow the jmp attach only for ftrace_ops that
has FTRACE_OPS_FL_JMP set. This conflicts with following changes
where we use single ftrace_ops object for all direct call sites,
so all could be be attached via just call or jmp.
We already limit the jmp attach support with config option and bit
(LSB) set on the trampoline address. It turns out that's actually
enough to limit the jmp attach for architecture and only for chosen
addresses (with LSB bit set).
Each user of register_ftrace_direct or modify_ftrace_direct can set
the trampoline bit (LSB) to indicate it has to be attached by jmp.
The bpf trampoline generation code uses trampoline flags to generate
jmp-attach specific code and ftrace inner code uses the trampoline
bit (LSB) to handle return from jmp attachment, so there's no harm
to remove the FTRACE_OPS_FL_JMP bit.
The fexit/fmodret performance stays the same (did not drop),
current code:
fentry : 77.904 ± 0.546M/s
fexit : 62.430 ± 0.554M/s
fmodret : 66.503 ± 0.902M/s
with this change:
fentry : 80.472 ± 0.061M/s
fexit : 63.995 ± 0.127M/s
fmodret : 67.362 ± 0.175M/s
Fixes: 25e4e3565d ("ftrace: Introduce FTRACE_OPS_FL_JMP")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/bpf/20251230145010.103439-2-jolsa@kernel.org
This commit fixes a security issue where BPF_PROG_DETACH on tcx or
netkit devices could be executed by any user when no program fd was
provided, bypassing permission checks. The fix adds a capability
check for CAP_NET_ADMIN or CAP_SYS_ADMIN in this case.
Fixes: e420bed025 ("bpf: Add fd-based tcx multi-prog infra with link support")
Signed-off-by: Guillaume Gonnet <ggonnet.linux@gmail.com>
Link: https://lore.kernel.org/r/20260127160200.10395-1-ggonnet.linux@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
While debugging a PCI resource allocation issue, the resources for many
nested bridges and endpoints got flattened in /proc/iomem by
MAX_IORES_LEVEL that is set to 5. This made the iomem output hard to
read as the visual hierarchy cues were lost.
Increase MAX_IORES_LEVEL to 8 to avoid flattening PCI topologies with
nested bridges so aggressively (the case in the Link has the deepest
resource at level 7 so 8 looks a reasonable limit).
Link: https://bugzilla.kernel.org/show_bug.cgi?id=220775
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Link: https://patch.msgid.link/20251219174036.16738-5-ilpo.jarvinen@linux.intel.com
Currently, the BPF cgroup iterator supports walking descendants in
either pre-order (BPF_CGROUP_ITER_DESCENDANTS_PRE) or post-order
(BPF_CGROUP_ITER_DESCENDANTS_POST). These modes perform an exhaustive
depth-first search (DFS) of the hierarchy. In scenarios where a BPF
program may need to inspect only the direct children of a given parent
cgroup, a full DFS is unnecessarily expensive.
This patch introduces a new BPF cgroup iterator control option,
BPF_CGROUP_ITER_CHILDREN. This control option restricts the traversal
to the immediate children of a specified parent cgroup, allowing for
more targeted and efficient iteration, particularly when exhaustive
depth-first search (DFS) traversal is not required.
Signed-off-by: Matt Bobrowski <mattbobrowski@google.com>
Link: https://lore.kernel.org/r/20260127085112.3608687-1-mattbobrowski@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add GPL-2.0 license id to some files related to kdb and kgdb,
replacing references to GPL or COPYING.
These files were introduced into the kernel in 2008 and 2010.
Signed-off-by: Tim Bird <tim.bird@sony.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The GICv5 driver IRQ domain hierarchy requires adding a parent field to
struct irqchip_fwid so that core code can reference a fwnode_handle parent
for a given fwnode.
Add a parent field to struct irqchip_fwid and update the related kernel API
functions to initialize and handle it.
Signed-off-by: Lorenzo Pieralisi <lpieralisi@kernel.org>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Acked-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260115-gicv5-host-acpi-v3-1-c13a9a150388@kernel.org
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Now that nodes_and() returns true if the result nodemask is not empty,
drop useless nodes_intersects() in guarantee_online_mems() and
nodes_empty() in update_nodemasks_hier(), which both are O(N).
Link: https://lkml.kernel.org/r/20260114172217.861204-4-ynorov@nvidia.com
Signed-off-by: Yury Norov <ynorov@nvidia.com>
Reviewed-by: Gregory Price <gourry@gourry.net>
Reviewed-by: Joshua Hahn <joshua.hahnjy@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Byungchul Park <byungchul@sk.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: "Huang, Ying" <ying.huang@linux.alibaba.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mathew Brost <matthew.brost@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: Yury Norov (NVIDIA) <yury.norov@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When scratch area fails to reserve, KHO prints a message indicating that.
But it doesn't say which scratch failed to allocate. This can be useful
information for debugging. Even more so when the failure is hard to
reproduce.
Along with the current message, also print which exact scratch area failed
to be reserved.
Link: https://lkml.kernel.org/r/20260116165416.1262531-1-pratyush@kernel.org
Signed-off-by: Pratyush Yadav (Google) <pratyush@kernel.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: David Matlack <dmatlack@google.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Pratyush Yadav <pratyush@kernel.org>
Cc: Samiullah Khawaja <skhawaja@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "Align atomic storage", v7.
This series adds the __aligned attribute to atomic_t and atomic64_t
definitions in include/linux and include/asm-generic (respectively) to get
natural alignment of both types on csky, m68k, microblaze, nios2, openrisc
and sh.
This series also adds Kconfig options to enable a new run-time warning to
help reveal misaligned atomic accesses on platforms which don't trap that.
The performance impact is expected to vary across platforms and workloads.
The measurements I made on m68k show that some workloads run faster and
others slower.
This patch (of 4):
Align bpf_res_spin_lock to avoid a BUILD_BUG_ON() when the alignment
changes, as it will do on m68k when, in a subsequent patch, the minimum
alignment of the atomic_t member of struct rqspinlock gets increased from
2 to 4. Drop the BUILD_BUG_ON() as it becomes redundant.
Link: https://lkml.kernel.org/r/cover.1768281748.git.fthain@linux-m68k.org
Link: https://lkml.kernel.org/r/8a83876b07d1feacc024521e44059ae89abbb1ea.1768281748.git.fthain@linux-m68k.org
Signed-off-by: Finn Thain <fthain@linux-m68k.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Borislav Petkov (AMD)" <bp@alien8.de>
Cc: Daniel Borkman <daniel@iogearbox.net>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Eduard Zingerman <eddyz87@gmail.com>
Cc: Gary Guo <gary@garyguo.net>
Cc: Guo Ren <guoren@kernel.org>
Cc: Hao Luo <haoluo@google.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Sasha Levin (Microsoft) <sashal@kernel.org>
Cc: Song Liu <song@kernel.org>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Stanislav Fomichev <sdf@fomichev.me>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yonghong Song <yonghong.song@linux.dev>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This patch is a preparation step for HPCC, for the OOM killer
improvements. I suspect that this patch is useful on its own, because it
really makes no sense to sum up accounting statistics of use_mm within
kernel threads which are only temporarily using those mm.
When we hit acct_account_cputime within a irq handler over a kthread that
happens to use a userspace mm, we end up summing up the mm's RSS into the
tsk acct_rss_mem1, which eventually decays.
I don't see a good rationale behind tracking the mm's rss in that way when
a kthread use a userspace mm temporarily through use_mm.
It causes issues with init_mm and efi_mm which only partially initialize
their mm_struct when introducing the new hierarchical percpu counters to
replace RSS counters, which requires a pointer dereference when reading
the approximate counter sum. The current percpu counters simply load a
zeroed atomic counter, which happen to work.
Skip all kernel threads in acct_account_cputime(), not just those that
happen to have a NULL mm.
This is a preparation step before introducing the hierarchical percpu
counters.
Link: https://lkml.kernel.org/r/20251224173810.648699-2-mathieu.desnoyers@efficios.com
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Aboorva Devarajan <aboorvad@linux.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Christan König <christian.koenig@amd.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Liam R . Howlett" <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Martin Liu <liumartin@google.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: SeongJae Park <sj@kernel.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
kexec_handover_internal.h is included twice in kexec_handover.c. Remove
the redundant first inclusion to eliminate the duplication.
Link: https://lkml.kernel.org/r/20251216114400.2677311-1-longwei27@huawei.com
Signed-off-by: Long Wei <longwei27@huawei.com>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Alexander Graf <graf@amazon.com>
Cc: hewenliang <hewenliang4@huawei.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Pratyush Yadav <pratyush@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The comment for get_task_mm() in kernel/fork.c incorrectly references the
deprecated function `use_mm()`, which has been renamed to
`kthread_use_mm()` in kernel/kthread.c.
This patch updates the documentation to reflect the current function
names, ensuring accuracy when developers refer to the kernel thread memory
context API.
No functional changes were introduced.
Link: https://lkml.kernel.org/r/KUZPR04MB8965F954108B4DD7E8FFDB2B8F84A@KUZPR04MB8965.apcprd04.prod.outlook.com
Signed-off-by: mingzhu.wang <mingzhu.wang@transsion.com>
Cc: Ben Segall <bsegall@google.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiazi Li <jqqlijiazi@gmail.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Kees Cook <kees@kernel.org>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The `struct kho_vmalloc` defines the in-memory layout for preserving
vmalloc regions across kexec. This layout is a contract between kernels
and part of the KHO ABI.
To reflect this relationship, the related structs and helper macros are
relocated to the ABI header, `include/linux/kho/abi/kexec_handover.h`.
This move places the structure's definition under the protection of the
KHO_FDT_COMPATIBLE version string.
The structure and its components are now also documented within the ABI
header to describe the contract and prevent ABI breaks.
[rppt@kernel.org: update comment, per Pratyush]
Link: https://lkml.kernel.org/r/aW_Mqp6HcqLwQImS@kernel.org
Link: https://lkml.kernel.org/r/20260105165839.285270-6-rppt@kernel.org
Signed-off-by: Jason Miu <jasonmiu@google.com>
Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Pratyush Yadav <pratyush@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Introduce the `include/linux/kho/abi/kexec_handover.h` header file, which
defines the stable ABI for the KHO mechanism. This header specifies how
preserved data is passed between kernels using an FDT.
The ABI contract includes the FDT structure, node properties, and the
"kho-v1" compatible string. By centralizing these definitions, this
header serves as the foundational agreement for inter-kernel communication
of preserved states, ensuring forward compatibility and preventing
misinterpretation of data across kexec transitions.
Since the ABI definitions are now centralized in the header files, the
YAML files that previously described the FDT interfaces are redundant.
These redundant files have therefore been removed.
Link: https://lkml.kernel.org/r/20260105165839.285270-5-rppt@kernel.org
Signed-off-by: Jason Miu <jasonmiu@google.com>
Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Currently index.rst in KHO documentation looks empty and sad as it only
contains links to "Kexec Handover Concepts" and "KHO FDT" chapters.
Inline contents of these chapters into index.rst to provide a single
coherent chapter describing KHO.
While on it, drop parts of the KHO FDT description that will be superseded
by addition of KHO ABI documentation.
[rppt@kernel.org: fix Documentation/core-api/kho/index.rst]
Link: https://lkml.kernel.org/r/aV4bnHlBXGpT_FMc@kernel.org
Link: https://lkml.kernel.org/r/20260105165839.285270-4-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Jason Miu <jasonmiu@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Pratyush Yadav <pratyush@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Decouple memfd preservation support from the core Live Update Orchestrator
configuration.
Previously, enabling CONFIG_LIVEUPDATE forced a dependency on CONFIG_SHMEM
and unconditionally compiled memfd_luo.o. However, Live Update may be
used for purposes that do not require memfd-backed memory preservation.
Introduce CONFIG_LIVEUPDATE_MEMFD to gate memfd_luo.o. This moves the
SHMEM and MEMFD_CREATE dependencies to the specific feature that needs
them, allowing the base LIVEUPDATE option to be selected independently of
shared memory support.
Link: https://lkml.kernel.org/r/20251230161402.1542099-1-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
If the kernel is compiled with LTO, hwerr_data symbol might be lost, and
vmcoreinfo doesn't have it dumped. This is currently seen in some
production kernels with LTO enabled.
Remove the static qualifier from hwerr_data so that the information is
still preserved when the kernel is built with LTO. Making hwerr_data a
global symbol ensures its debug info survives the LTO link process and
appears in kallsyms. Also document it, so it doesn't get removed in
the future as suggested by akpm.
Link: https://lkml.kernel.org/r/20260122-fix_vmcoreinfo-v2-1-2d6311f9e36c@debian.org
Fixes: 3fa805c37d ("vmcoreinfo: track and log recoverable hardware errors")
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Omar Sandoval <osandov@osandov.com>
Cc: Shuai Xue <xueshuai@linux.alibaba.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Zhiquan Li <zhiquan1.li@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Memblock pages (including reserved memory) should have their allocation
tags initialized to CODETAG_EMPTY via clear_page_tag_ref() before being
released to the page allocator. When kho restores pages through
kho_restore_page(), missing this call causes mismatched
allocation/deallocation tracking and below warning message:
alloc_tag was not set
WARNING: include/linux/alloc_tag.h:164 at ___free_pages+0xb8/0x260, CPU#1: swapper/0/1
RIP: 0010:___free_pages+0xb8/0x260
kho_restore_vmalloc+0x187/0x2e0
kho_test_init+0x3c4/0xa30
do_one_initcall+0x62/0x2b0
kernel_init_freeable+0x25b/0x480
kernel_init+0x1a/0x1c0
ret_from_fork+0x2d1/0x360
Add missing clear_page_tag_ref() annotation in kho_restore_page() to
fix this.
Link: https://lkml.kernel.org/r/20260122132740.176468-1-ranxiaokai627@163.com
Fixes: fc33e4b44b ("kexec: enable KHO support for memory preservation")
Signed-off-by: Ran Xiaokai <ran.xiaokai@zte.com.cn>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When /proc/sys/kernel/traceoff_on_warning is set to 1, the top level
tracing buffer is disabled when a warning happens. This is very useful
when debugging and want the tracing buffer to stop taking new data when a
warning triggers keeping the events that lead up to the warning from being
overwritten.
Now that there is also a persistent ring buffer and an option to have
trace_printk go to that buffer, the same holds true for that buffer. A
warning could happen just before a crash but still write enough events to
lose the events that lead up to the first warning that was the reason for
the crash.
When /proc/sys/kernel/traceoff_on_warning is set to 1 and a warning is
triggered, not only disable the top level tracing buffer, but also disable
the buffer that trace_printk()s are written to.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Link: https://patch.msgid.link/20260121093858.5c5d7e7b@gandalf.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
ENTRIES_PER_PAGE_GROUP() returns the number of dyn_ftrace entries in a page
group, identified by its order.
No functional change.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260113152243.3557219-2-linux@roeck-us.net
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
By doing:
# trace-cmd sqlhist -e -n futex_wait select TIMESTAMP_DELTA_USECS as lat from sys_enter_futex as start join sys_exit_futex as end on start.common_pid = end.common_pid
and
# trace-cmd start -e futex_wait -f 'lat > 100' -e page_pool_state_release -f 'pfn == 1'
The output of the show_event_trigger and show_event_filter files are well
aligned because of the inconsistent 'tab' spacing:
~# cat /sys/kernel/tracing/show_event_triggers
syscalls:sys_exit_futex hist:keys=common_pid:vals=hitcount:__lat_12046_2=common_timestamp.usecs-$__arg_12046_1:sort=hitcount:size=2048:clock=global:onmatch(syscalls.sys_enter_futex).trace(futex_wait,$__lat_12046_2) [active]
syscalls:sys_enter_futex hist:keys=common_pid:vals=hitcount:__arg_12046_1=common_timestamp.usecs:sort=hitcount:size=2048:clock=global [active]
~# cat /sys/kernel/tracing/show_event_filters
synthetic:futex_wait (lat > 100)
page_pool:page_pool_state_release (pfn == 1)
This makes it not so easy to read. Instead, force the spacing to be at
least 32 bytes from the beginning (one space if the system:event is longer
than 30 bytes):
~# cat /sys/kernel/tracing/show_event_triggers
syscalls:sys_exit_futex hist:keys=common_pid:vals=hitcount:__lat_8125_2=common_timestamp.usecs-$__arg_8125_1:sort=hitcount:size=2048:clock=global:onmatch(syscalls.sys_enter_futex).trace(futex_wait,$__lat_8125_2) [active]
syscalls:sys_enter_futex hist:keys=common_pid:vals=hitcount:__arg_8125_1=common_timestamp.usecs:sort=hitcount:size=2048:clock=global [active]
~# cat /sys/kernel/tracing/show_event_filters
synthetic:futex_wait (lat > 100)
page_pool:page_pool_state_release (pfn == 1)
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260112153408.18373e73@gandalf.local.home
Reviewed-by: Aaron Tomlin <atomlin@atomlin.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Avoid running the wakeup irq_work on an isolated CPU. Since the wakeup can
run on any CPU, let's pick a housekeeping CPU to do the job.
This change reduces additional noise when tracing isolated CPUs. For
example, the following ipi_send_cpu stack trace was captured with
nohz_full=2 on the isolated CPU:
<idle>-0 [002] d.h4. 1255.379293: ipi_send_cpu: cpu=2 callsite=irq_work_queue+0x2d/0x50 callback=rb_wake_up_waiters+0x0/0x80
<idle>-0 [002] d.h4. 1255.379329: <stack trace>
=> trace_event_raw_event_ipi_send_cpu
=> __irq_work_queue_local
=> irq_work_queue
=> ring_buffer_unlock_commit
=> trace_buffer_unlock_commit_regs
=> trace_event_buffer_commit
=> trace_event_raw_event_x86_irq_vector
=> __sysvec_apic_timer_interrupt
=> sysvec_apic_timer_interrupt
=> asm_sysvec_apic_timer_interrupt
=> pv_native_safe_halt
=> default_idle
=> default_idle_call
=> do_idle
=> cpu_startup_entry
=> start_secondary
=> common_startup_64
The IRQ work interrupt alone adds considerable noise, but the impact can
get even worse with PREEMPT_RT, because the IRQ work interrupt is then
handled by a separate kernel thread. This requires a task switch and makes
tracing useless for analyzing latency on an isolated CPU.
After applying the patch, the trace is similar, but ipi_send_cpu always
targets a non-isolated CPU.
Unfortunately, irq_work_queue_on() is not NMI-safe. When running in NMI
context, fall back to queuing the irq work on the local CPU.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Clark Williams <clrkwllms@kernel.org>
Cc: Frederic Weisbecker <frederic@kernel.org>
Link: https://patch.msgid.link/20260108132132.2473515-1-ptesarik@suse.com
Signed-off-by: Petr Tesarik <ptesarik@suse.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
To audit active event triggers, userspace currently must traverse the
events/ directory and read each individual trigger file. This is
cumbersome for system-wide auditing or debugging.
Introduce "show_event_triggers" at the trace root directory. This file
displays all events that currently have one or more triggers applied,
alongside the trigger configuration, in a consolidated
system:event [tab] trigger format.
The implementation leverages the existing trace_event_file iterators
and uses the trigger's own print() operation to ensure output
consistency with the per-event trigger files.
Link: https://patch.msgid.link/20260105142939.2655342-3-atomlin@atomlin.com
Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Currently, to audit active Ftrace event filters, userspace must
recursively traverse the events/ directory and read each individual
filter file. This is inefficient for monitoring tools and debugging.
Introduce "show_event_filters" at the trace root directory. This file
displays all events that currently have a filter applied, alongside the
actual filter string, in a consolidated system:event [tab] filter
format.
The implementation reuses the existing trace_event_file iterators to
ensure atomic traversal of the event list and utilises guard(rcu)() for
automatic, scope-based protection when accessing volatile filter
strings.
Link: https://patch.msgid.link/20260105142939.2655342-2-atomlin@atomlin.com
Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
This patch continues the effort to refactor workqueue APIs, which has begun
with the changes introducing new workqueues and a new alloc_workqueue flag:
commit 128ea9f6cc ("workqueue: Add system_percpu_wq and system_dfl_wq")
commit 930c2ea566 ("workqueue: Add new WQ_PERCPU flag")
The point of the refactoring is to eventually alter the default behavior of
workqueues to become unbound by default so that their workload placement is
optimized by the scheduler.
Before that to happen after a careful review and conversion of each individual
case, workqueue users must be converted to the better named new workqueues with
no intended behaviour changes:
system_wq -> system_percpu_wq
system_unbound_wq -> system_dfl_wq
This specific workflow has no benefits being per-cpu, so instead of
system_percpu_wq the new unbound workqueue has been used (system_dfl_wq).
This way the old obsolete workqueues (system_wq, system_unbound_wq) can be
removed in the future.
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251230142820.173712-1-marco.crivellari@suse.com
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Add support for displaying bitmasks in human-readable list format (e.g.,
0,2-5,7) in addition to the default hexadecimal bitmap representation.
This is particularly useful when tracing CPU masks and other large
bitmasks where individual bit positions are more meaningful than their
hexadecimal encoding.
When the "bitmask-list" option is enabled, the printk "%*pbl" format
specifier is used to render bitmasks as comma-separated ranges, making
trace output easier to interpret for complex CPU configurations and
large bitmask values.
Link: https://patch.msgid.link/20251226160724.2246493-2-atomlin@atomlin.com
Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
With the change to replace kfree() with trigger_data_free(), which starts
out doing the exact same thing as event_trigger_reset_filter(), there's no
reason to call event_trigger_reset_filter() before calling
trigger_data_free(). Remove the call to it.
Link: https://lore.kernel.org/linux-trace-kernel/20251211204520.0f3ba6d1@fedora/
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Miaoqian Lin <linmq006@gmail.com>
Link: https://patch.msgid.link/20260108174429.2d9ca51f@gandalf.local.home
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Memory allocated with trigger_data_alloc() requires trigger_data_free()
for proper cleanup.
Replace kfree() with trigger_data_free() to fix this.
Found via static analysis and code review.
This isn't a real bug due to the current code basically being an open
coded version of trigger_data_free() without the synchronization. The
synchronization isn't needed as this is the error path of creation and
there's nothing to synchronize against yet. Replace the kfree() to be
consistent with the allocation.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Link: https://patch.msgid.link/20251211100058.2381268-1-linmq006@gmail.com
Fixes: e1f187d09e ("tracing: Have existing event_command.parse() implementations use helpers")
Signed-off-by: Miaoqian Lin <linmq006@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Implement session cookie for fsession. The session cookies will be stored
in the stack, and the layout of the stack will look like this:
return value -> 8 bytes
argN -> 8 bytes
...
arg1 -> 8 bytes
nr_args -> 8 bytes
ip (optional) -> 8 bytes
cookie2 -> 8 bytes
cookie1 -> 8 bytes
The offset of the cookie for the current bpf program, which is in 8-byte
units, is stored in the
"(((u64 *)ctx)[-1] >> BPF_TRAMP_COOKIE_INDEX_SHIFT) & 0xFF". Therefore, we
can get the session cookie with ((u64 *)ctx)[-offset].
Implement and inline the bpf_session_cookie() for the fsession in the
verifier.
Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Link: https://lore.kernel.org/r/20260124062008.8657-6-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
If fsession exists, we will use the bit (1 << BPF_TRAMP_IS_RETURN_SHIFT)
in ((u64 *)ctx)[-1] to store the "is_return" flag.
The logic of bpf_session_is_return() for fsession is implemented in the
verifier by inline following code:
bool bpf_session_is_return(void *ctx)
{
return (((u64 *)ctx)[-1] >> BPF_TRAMP_IS_RETURN_SHIFT) & 1;
}
Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Co-developed-by: Leon Hwang <leon.hwang@linux.dev>
Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/r/20260124062008.8657-5-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add the function argument of "void *ctx" to bpf_session_cookie() and
bpf_session_is_return(), which is a preparation of the next patch.
The two kfunc is seldom used now, so it will not introduce much effect
to change their function prototype.
Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20260124062008.8657-4-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
For now, ((u64 *)ctx)[-1] is used to store the nr_args in the trampoline.
However, 1 byte is enough to store such information. Therefore, we use
only the least significant byte of ((u64 *)ctx)[-1] to store the nr_args,
and reserve the rest for other usages.
Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Link: https://lore.kernel.org/r/20260124062008.8657-3-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The fsession is something that similar to kprobe session. It allow to
attach a single BPF program to both the entry and the exit of the target
functions.
Introduce the struct bpf_fsession_link, which allows to add the link to
both the fentry and fexit progs_hlist of the trampoline.
Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Co-developed-by: Leon Hwang <leon.hwang@linux.dev>
Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/r/20260124062008.8657-2-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
- Fix a crash with passing a stacktrace between synthetic events
A synthetic event is an event that combines two events into a single event
that can display fields from both events as well as the time delta that
took place between the events. It can also pass a stacktrace from the
first event so that it can be displayed by the synthetic event (this is
useful to get a stacktrace of a task scheduling out when blocked and
recording the time it was blocked for).
A synthetic event can also connect an existing synthetic event to another
event. An issue was found that if the first synthetic event had a stacktrace
as one of its fields, and that stacktrace field was passed to the new
synthetic event to be displayed, it would crash the kernel. This was due to
the stacktrace not being saved as a stacktrace but was still marked as one.
When the stacktrace was read, it would try to read an array but instead read
the integer metadata of the stacktrace and dereferenced a bad value.
Fix this by saving the stacktrace field as a stracktrace.
- Fix possible overflow in cmp_mod_entry() compare function
A binary search is used to find a module address and if the addresses are
greater than 2GB apart it could lead to truncation and cause a bad search
result. Use normal compares instead of a subtraction between addresses to
calculate the compare value.
- Fix output of entry arguments in function graph tracer
Depending on the configurations enabled, the entry can be two different
types that hold the argument array. The macro FGRAPH_ENTRY_ARGS() is used
to find the correct arguments from the given type. One location was missed
and still referenced the arguments directly via entry->args and could
produce the wrong value depending on how the kernel was configured.
- Fix memory leak in scripts/tracepoint-update build tool
If the array fails to allocate, the memory for the values needs to be
freed and was not. Free the allocated values if the array failed to
allocate.
-----BEGIN PGP SIGNATURE-----
iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaXUQLxQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qgsJAQDgtWH9DWUkJKgzXTkiOA0l8JArPOVf
tCSMla2wWJA70QD/as2ptacYAFU9v1oxO5YIgsKOLFBF68ZUIhJtvXpqtAE=
=JeC6
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix a crash with passing a stacktrace between synthetic events
A synthetic event is an event that combines two events into a single
event that can display fields from both events as well as the time
delta that took place between the events. It can also pass a
stacktrace from the first event so that it can be displayed by the
synthetic event (this is useful to get a stacktrace of a task
scheduling out when blocked and recording the time it was blocked
for).
A synthetic event can also connect an existing synthetic event to
another event. An issue was found that if the first synthetic event
had a stacktrace as one of its fields, and that stacktrace field was
passed to the new synthetic event to be displayed, it would crash the
kernel. This was due to the stacktrace not being saved as a
stacktrace but was still marked as one. When the stacktrace was read,
it would try to read an array but instead read the integer metadata
of the stacktrace and dereferenced a bad value.
Fix this by saving the stacktrace field as a stacktrace.
- Fix possible overflow in cmp_mod_entry() compare function
A binary search is used to find a module address and if the addresses
are greater than 2GB apart it could lead to truncation and cause a
bad search result. Use normal compares instead of a subtraction
between addresses to calculate the compare value.
- Fix output of entry arguments in function graph tracer
Depending on the configurations enabled, the entry can be two
different types that hold the argument array. The macro
FGRAPH_ENTRY_ARGS() is used to find the correct arguments from the
given type. One location was missed and still referenced the
arguments directly via entry->args and could produce the wrong value
depending on how the kernel was configured.
- Fix memory leak in scripts/tracepoint-update build tool
If the array fails to allocate, the memory for the values needs to be
freed and was not. Free the allocated values if the array failed to
allocate.
* tag 'trace-v6.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
scripts/tracepoint-update: Fix memory leak in add_string() on failure
function_graph: Fix args pointer mismatch in print_graph_retval()
tracing: Avoid possible signed 64-bit truncation
tracing: Fix crash on synthetic stacktrace field usage
- Fix PELT clock synchronization bug when entering idle
- Disable the NEXT_BUDDY feature, as during extensive testing
Mel found that the negatives outweigh the positives.
- Make wakeup preemption less aggressive, which resulted in
an unreasonable increase in preemption frequency.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAml0kYYRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1j/Bw/+M5UK6EOPzLTs0Tj87X/kI7vXLN/uf9+K
FpDtFNmoJnJKrQxPFfa8aT8OAz7zLfnjrGmRgxOG1fFkACMuB8s+TO4+i9lhzAsF
gHOg6lDczi4K14mkTyiP/Bdaf3lZThghZitdRCBZFN4gjBpAh1rRI4ikQ0a3pwJH
6y1NF8fX/xRBm6Bgprgbv/em+fKsO89i6NwunPtYaxHsOGA5U0FVPFSa3yghz+X2
7Sl4U8LRdkU3Z62M+I8DKWuAMMb9nLwEXrSiNy+KJHIJ7FNFb1+U10gPrLq+z2Oi
hvd25pzDzGUgOglZ7f1IRL5H48putMjCqv+A2wOZnzl7fHt4c02hpTBqMDle5hEP
DYHyNNC1ZIRQ1zuy8j33LzB10ycP6pX9nawt+S1trEZcDVwaKwq9TbYsBjbJKtmZ
V7C181fTUaNQ6M4wJY3gx+hD1ocrLv+Y0vhXJnK7sg+j4IhWJ7Yk6ne9hrpXbbUj
cK8gzvo4OkShYtMlB9ut/RIiuyvFThyJd9rcNkeN4tTM9DVFURvOlIK2W6U70Yty
OJDkhI6KBmkERPQoa43OsGvxZwF+g0KY8ISsLq3bJ2vLyATPdpJ2ns+zRg6I+B+q
zHZmI86Fc8gr+O/y9oqi2VJhgio34RcoSId/QcAPC1Gl4TaIcmqFFTSzMhI+kLk9
Ng7vHy/3KQg=
=pGnx
-----END PGP SIGNATURE-----
Merge tag 'sched-urgent-2026-01-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
- Fix PELT clock synchronization bug when entering idle
- Disable the NEXT_BUDDY feature, as during extensive testing
Mel found that the negatives outweigh the positives
- Make wakeup preemption less aggressive, which resulted in
an unreasonable increase in preemption frequency
* tag 'sched-urgent-2026-01-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Revert force wakeup preemption
sched/fair: Disable scheduler feature NEXT_BUDDY
sched/fair: Fix pelt clock sync when entering idle
- Fix mmap_count warning & bug when creating a group member event
with the PERF_FLAG_FD_OUTPUT flag.
- Disable the sample period == 1 branch events BTS optimization
on guests, because BTS is not virtualized.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAml0kF8RHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1icZxAAhsYBY0uB3OzAgJYRIVlw/vzF7ubHh8+r
PhQ+azap/uSk/dS8IXZ0FrP/vq9kgqEFpsYwWbB1EFRxB24nd8AAYarASyc22Dhb
Qe6tKh4hzlDC9Cw0vJ111PgKJLiPDrmkxzS4C2HkPwYwriNql81hQjLW5nF6dHwg
i8aP7+/uNB2h/xnPIRgkUFSUzBacRz1Snqgi/vSHmwEhk1GFI48rXoYywM9ItAnR
CGN7tGxJ45YwDYeknZf9Ngsd3q+/eh38ihPfMEKoulf4oFvhIsiG4rJ1Vee0V9fD
aD1NhukULtjw3lkKnMM75W+Jdb7fHDTOuxzUov9WpyIOIUTMqRkbvaKHgJzSD2SK
01TbP3kQixhGhiRXx78GDQwYGjX8JQxngbqyJvL708GNvxbcKYrLhqtr5Ho00pWH
ERxx/ajoDXB7Neo7XPhgYRHk/lrlnZsK1LoidhMzN1UX6C12VnhPcD+zUSqRxz5w
yFuJp0+7wF+G74FpO7kv8jv5KDB/lery5mdzWu0kqAKMfYWdw5eGoaI6T48DVoBy
IwNYU8bxDBRPzu64XudDE4xBuTuy4HpJmbvOUkxMUkBGMTZ+nysqHu8D7cJJqYtL
2xYJOpR+P4Se9fRyR9xo5vtTZy27TQqTp+AjA5RlIoVOJhYK5T9mAFCJaEhdupbM
2IA4xdYhbb8=
=lNSW
-----END PGP SIGNATURE-----
Merge tag 'perf-urgent-2026-01-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf events fixes from Ingo Molnar:
- Fix mmap_count warning & bug when creating a group member event
with the PERF_FLAG_FD_OUTPUT flag
- Disable the sample period == 1 branch events BTS optimization
on guests, because BTS is not virtualized
* tag 'perf-urgent-2026-01-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel: Do not enable BTS for guests
perf: Fix refcount warning on event->mmap_count increment
Merge in patches to support several patch series such as Soft Reserve
handling, type2 accelerator enabling, and LSA 2.1 labeling support.
Mainly addition of cxl_memdev_attach() to allow the memdev probe
to make a decision of proceed/fail depending success of CXL topology
enumeration.
dax/hmem, e820, resource: Defer Soft Reserved insertion until hmem is ready
cxl/mem: Introduce cxl_memdev_attach for CXL-dependent operation
cxl/mem: Drop @host argument to devm_cxl_add_memdev()
cxl/mem: Convert devm_cxl_add_memdev() to scope-based-cleanup
cxl/port: Arrange for always synchronous endpoint attach
cxl/mem: Arrange for always-synchronous memdev attach
cxl/mem: Fix devm_cxl_memdev_edac_release() confusion
The pattern of checking nocb_defer_wakeup and deleting the timer is
duplicated in __wake_nocb_gp() and nocb_gp_wait(). Extract this into a
common helper function nocb_defer_wakeup_cancel().
This removes code duplication and makes it easier to maintain.
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
During callback overload (exceeding qhimark), the NOCB code attempts
opportunistic advancement via rcu_advance_cbs_nowake(). Analysis shows
this code path is practically unreachable and serves no useful purpose.
Testing with 300,000 callback floods showed:
- 30 overload conditions triggered
- 0 advancements actually occurred
While a theoretical window exists where this code could execute (e.g.,
vCPU preemption between gp_seq update and rcu_nocb_gp_cleanup()), even
if it did, the advancement would be redundant. The rcuog kthread must
still run to wake the rcuoc callback thread - we would just be
duplicating work that rcuog will perform when it finally gets to run.
Since this path provides no meaningful benefit and extensive testing
confirms it is never useful, remove it entirely.
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
The WakeOvfIsDeferred code path in __call_rcu_nocb_wake() attempts to
wake rcuog when the callback count exceeds qhimark and callbacks aren't
done with their GP (newly queued or awaiting GP). However, a lot of
testing proves this wake is always redundant or useless.
In the flooding case, rcuog is always waiting for a GP to finish. So
waking up the rcuog thread is pointless. The timer wakeup adds overhead,
rcuog simply wakes up and goes back to sleep achieving nothing.
This path also adds a full memory barrier, and additional timer expiry
modifications unnecessarily.
The root cause is that WakeOvfIsDeferred fires when
!rcu_segcblist_ready_cbs() (GP not complete), but waking rcuog cannot
accelerate GP completion.
This commit therefore removes this path.
Tested with rcutorture scenarios: TREE01, TREE05, TREE08 (all NOCB
configurations) - all pass. Also stress tested using a kernel module
that floods call_rcu() to trigger the overload conditions and made the
observations confirming the findings.
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
When funcgraph-args and funcgraph-retaddr are both enabled, many kernel
functions display invalid parameters in trace logs.
The issue occurs because print_graph_retval() passes a mismatched args
pointer to print_function_args(). Fix this by retrieving the correct
args pointer using the FGRAPH_ENTRY_ARGS() macro.
Link: https://patch.msgid.link/20260112021601.1300479-1-dolinux.peng@gmail.com
Fixes: f83ac7544f ("function_graph: Enable funcgraph-args and funcgraph-retaddr to work simultaneously")
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Donglin Peng <pengdonglin@xiaomi.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
64-bit truncation to 32-bit can result in the sign of the truncated
value changing. The cmp_mod_entry is used in bsearch and so the
truncation could result in an invalid search order. This would only
happen were the addresses more than 2GB apart and so unlikely, but
let's fix the potentially broken compare anyway.
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260108002625.333331-1-irogers@google.com
Signed-off-by: Ian Rogers <irogers@google.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
When creating a synthetic event based on an existing synthetic event that
had a stacktrace field and the new synthetic event used that field a
kernel crash occurred:
~# cd /sys/kernel/tracing
~# echo 's:stack unsigned long stack[];' > dynamic_events
~# echo 'hist:keys=prev_pid:s0=common_stacktrace if prev_state & 3' >> events/sched/sched_switch/trigger
~# echo 'hist:keys=next_pid:s1=$s0:onmatch(sched.sched_switch).trace(stack,$s1)' >> events/sched/sched_switch/trigger
The above creates a synthetic event that takes a stacktrace when a task
schedules out in a non-running state and passes that stacktrace to the
sched_switch event when that task schedules back in. It triggers the
"stack" synthetic event that has a stacktrace as its field (called "stack").
~# echo 's:syscall_stack s64 id; unsigned long stack[];' >> dynamic_events
~# echo 'hist:keys=common_pid:s2=stack' >> events/synthetic/stack/trigger
~# echo 'hist:keys=common_pid:s3=$s2,i0=id:onmatch(synthetic.stack).trace(syscall_stack,$i0,$s3)' >> events/raw_syscalls/sys_exit/trigger
The above makes another synthetic event called "syscall_stack" that
attaches the first synthetic event (stack) to the sys_exit trace event and
records the stacktrace from the stack event with the id of the system call
that is exiting.
When enabling this event (or using it in a historgram):
~# echo 1 > events/synthetic/syscall_stack/enable
Produces a kernel crash!
BUG: unable to handle page fault for address: 0000000000400010
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: Oops: 0000 [#1] SMP PTI
CPU: 6 UID: 0 PID: 1257 Comm: bash Not tainted 6.16.3+deb14-amd64 #1 PREEMPT(lazy) Debian 6.16.3-1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.17.0-debian-1.17.0-1 04/01/2014
RIP: 0010:trace_event_raw_event_synth+0x90/0x380
Code: c5 00 00 00 00 85 d2 0f 84 e1 00 00 00 31 db eb 34 0f 1f 00 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 2e 0f 1f 84 00 00 00 00 00 <49> 8b 04 24 48 83 c3 01 8d 0c c5 08 00 00 00 01 cd 41 3b 5d 40 0f
RSP: 0018:ffffd2670388f958 EFLAGS: 00010202
RAX: ffff8ba1065cc100 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: fffff266ffda7b90 RDI: ffffd2670388f9b0
RBP: 0000000000000010 R08: ffff8ba104e76000 R09: ffffd2670388fa50
R10: ffff8ba102dd42e0 R11: ffffffff9a908970 R12: 0000000000400010
R13: ffff8ba10a246400 R14: ffff8ba10a710220 R15: fffff266ffda7b90
FS: 00007fa3bc63f740(0000) GS:ffff8ba2e0f48000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000400010 CR3: 0000000107f9e003 CR4: 0000000000172ef0
Call Trace:
<TASK>
? __tracing_map_insert+0x208/0x3a0
action_trace+0x67/0x70
event_hist_trigger+0x633/0x6d0
event_triggers_call+0x82/0x130
trace_event_buffer_commit+0x19d/0x250
trace_event_raw_event_sys_exit+0x62/0xb0
syscall_exit_work+0x9d/0x140
do_syscall_64+0x20a/0x2f0
? trace_event_raw_event_sched_switch+0x12b/0x170
? save_fpregs_to_fpstate+0x3e/0x90
? _raw_spin_unlock+0xe/0x30
? finish_task_switch.isra.0+0x97/0x2c0
? __rseq_handle_notify_resume+0xad/0x4c0
? __schedule+0x4b8/0xd00
? restore_fpregs_from_fpstate+0x3c/0x90
? switch_fpu_return+0x5b/0xe0
? do_syscall_64+0x1ef/0x2f0
? do_fault+0x2e9/0x540
? __handle_mm_fault+0x7d1/0xf70
? count_memcg_events+0x167/0x1d0
? handle_mm_fault+0x1d7/0x2e0
? do_user_addr_fault+0x2c3/0x7f0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
The reason is that the stacktrace field is not labeled as such, and is
treated as a normal field and not as a dynamic event that it is.
In trace_event_raw_event_synth() the event is field is still treated as a
dynamic array, but the retrieval of the data is considered a normal field,
and the reference is just the meta data:
// Meta data is retrieved instead of a dynamic array
str_val = (char *)(long)var_ref_vals[val_idx];
// Then when it tries to process it:
len = *((unsigned long *)str_val) + 1;
It triggers a kernel page fault.
To fix this, first when defining the fields of the first synthetic event,
set the filter type to FILTER_STACKTRACE. This is used later by the second
synthetic event to know that this field is a stacktrace. When creating
the field of the new synthetic event, have it use this FILTER_STACKTRACE
to know to create a stacktrace field to copy the stacktrace into.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Link: https://patch.msgid.link/20260122194824.6905a38e@gandalf.local.home
Fixes: 00cf3d672a ("tracing: Allow synthetic events to pass around stacktraces")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The TAS fallback can be invoked directly when queued spin locks are
disabled, and through the slow path when paravirt is enabled for queued
spin locks. In the latter case, the res_spin_lock macro will attempt the
fast path and already hold the entry when entering the slow path. This
will lead to creation of extraneous entries that are not released, which
may cause false positives for deadlock detection.
Fix this by always preceding invocation of the TAS fallback in every
case with the grabbing of the held lock entry, and add a comment to make
note of this.
Fixes: c9102a68c0 ("rqspinlock: Add a test-and-set fallback")
Reported-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Tested-by: Amery Hung <ameryhung@gmail.com>
Link: https://lore.kernel.org/r/20260122115911.3668985-1-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This agressively bypasses run_to_parity and slice protection with the
assumpiton that this is what waker wants but there is no garantee that
the wakee will be the next to run. It is a better choice to use
yield_to_task or WF_SYNC in such case.
This increases the number of resched and preemption because a task becomes
quickly "ineligible" when it runs; We update the task vruntime periodically
and before the task exhausted its slice or at least quantum.
Example:
2 tasks A and B wake up simultaneously with lag = 0. Both are
eligible. Task A runs 1st and wakes up task C. Scheduler updates task
A's vruntime which becomes greater than average runtime as all others
have a lag == 0 and didn't run yet. Now task A is ineligible because
it received more runtime than the other task but it has not yet
exhausted its slice nor a min quantum. We force preemption, disable
protection but Task B will run 1st not task C.
Sidenote, DELAY_ZERO increases this effect by clearing positive lag at
wake up.
Fixes: e837456fdc ("sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals")
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260123102858.52428-1-vincent.guittot@linaro.org
NEXT_BUDDY was disabled with the introduction of EEVDF and enabled again
after NEXT_BUDDY was rewritten for EEVDF by commit e837456fdc ("sched/fair:
Reimplement NEXT_BUDDY to align with EEVDF goals"). It was not expected
that this would be a universal win without a crystal ball instruction
but the reported regressions are a concern [1][2] even if gains were
also reported. Specifically;
o mysql with client/server running on different servers regresses
o specjbb reports lower peak metrics
o daytrader regresses
The mysql is realistic and a concern. It needs to be confirmed if
specjbb is simply shifting the point where peak performance is measured
but still a concern. daytrader is considered to be representative of a
real workload.
Access to test machines is currently problematic for verifying any fix to
this problem. Disable NEXT_BUDDY for now by default until the root causes
are addressed.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Link: https://lore.kernel.org/lkml/4b96909a-f1ac-49eb-b814-97b8adda6229@arm.com [1]
Link: https://lore.kernel.org/lkml/ec3ea66f-3a0d-4b5a-ab36-ce778f159b5b@linux.ibm.com [2]
Link: https://patch.msgid.link/fyqsk63pkoxpeaclyqsm5nwtz3dyejplr7rg6p74xwemfzdzuu@7m7xhs5aqpqw
These structs are never modified.
To prevent malicious or accidental modifications due to bugs,
mark them as const.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
When the kallsyms relative base was introduced, per-CPU variable
references on x86_64 SMP were implemented as offsets into the respective
per-CPU region, rather than offsets relative to the location of the
variable's template in the kernel image, which is how other
architectures implement it.
This required kallsyms to reason about the difference between the two,
and the sign of the value in the kallsyms_offsets[] array was used to
distinguish them. This meant that negative offsets were not permitted
for ordinary variables, and so it was crucial that the relative base was
chosen such that all offsets were positive numbers.
This is no longer needed: instead, the offsets can simply be encoded as
values in the range -/+ 2 GiB, which is precisely what PC32 relocations
provide on most architectures. So it is possible to simplify the logic,
and just use _text as the anchor directly, and let the linker calculate
the final value based on the location of the entry itself.
Some architectures (nios2, extensa) do not support place-relative
relocations at all, but these are all 32-bit and non-relocatable, and so
there is no need for place-relative relocations in the first place, and
the actual symbol values can just be stored directly.
This makes all entries in the kallsyms_offsets[] array visible as
place-relative references in the ELF metadata, which will be important
when implementing ELF-based fg-kaslr.
Reviewed-by: Kees Cook <kees@kernel.org>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://patch.msgid.link/20260116093359.2442297-6-ardb+git@google.com
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
1) The commit:
2b8272ff4a ("cpu/hotplug: Prevent self deadlock on CPU hot-unplug")
was added to fix an issue where the hotplug control task (BP) was
throttled between CPUHP_AP_IDLE_DEAD and CPUHP_HRTIMERS_PREPARE waiting
in the hrtimer blindspot for the bandwidth callback queued in the dead
CPU.
2) Later on, the commit:
38685e2a04 ("cpu/hotplug: Don't offline the last non-isolated CPU")
plugged on the target selection for the workqueue offloaded CPU down
process to prevent from destroying the last CPU domain.
3) Finally:
5c0930ccaa ("hrtimers: Push pending hrtimers away from outgoing CPU earlier")
removed entirely the conditions for the race exposed and partially fixed
in 1). The offloading of the CPU down process to a workqueue on another
CPU then becomes unnecessary. But the last CPU belonging to scheduler
domains must still remain online.
Therefore revert the now obsolete commit
2b8272ff4a and move the housekeeping check
under the cpu_hotplug_lock write held. Since HK_TYPE_DOMAIN will include
both isolcpus and cpuset isolated partition, the hotplug lock will
synchronize against concurrent cpuset partition updates.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Currently, rq->idle_stamp is only used to calculate avg_idle during
wakeups. This means other paths that move a task to an idle CPU such as
fork/clone, execve, or migrations, do not end the CPU's idle status in
the scheduler's eyes, leading to an inaccurate avg_idle.
This patch introduces update_rq_avg_idle() to provide a more accurate
measurement of CPU idle duration. By invoking this helper in
put_prev_task_idle(), we ensure avg_idle is updated whenever a CPU
stops being idle, regardless of how the new task arrived.
Testing on an 80-core Ampere Altra (ARMv8) with 6.19-rc5 baseline:
- Hackbench : +7.2% performance gain at 16 threads.
- Schbench: Reduced p99.9 tail latencies at high concurrency.
Signed-off-by: Shubhang Kaushik <shubhang@os.amperecomputing.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Shubhang Kaushik <shubhang@os.amperecomputing.com>
Link: https://patch.msgid.link/20260121-v8-patch-series-v8-1-b7f1cbee5055@os.amperecomputing.com
It turns out that __run_hrtimer() will trace like:
<idle>-0 [032] d.h2. 20705.474563: hrtimer_cancel: hrtimer=0xff2db8f77f8226e8
<idle>-0 [032] d.h1. 20705.474563: hrtimer_expire_entry: hrtimer=0xff2db8f77f8226e8 now=20699452001850 function=tick_nohz_handler/0x0
Which is a bit nonsensical, the timer doesn't get canceled on
expiration. The cause is the use of the incorrect debug helper.
Fixes: c6a2a17702 ("hrtimer: Add tracepoint for hrtimers")
Reported-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260121143208.219595606@infradead.org
Change the minimum slice extension to 5 usec.
Since slice_test selftest reaches a staggering ~350 nsec extension:
Task: slice_test Mean: 350.266 ns
Latency (us) | Count
------------------------------
EXPIRED | 238
0 us | 143189
1 us | 167
2 us | 26
3 us | 11
4 us | 28
5 us | 31
6 us | 22
7 us | 23
8 us | 32
9 us | 16
10 us | 35
Lower the minimal (and default) value to 5 usecs -- which is still massive.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260121143208.073200729@infradead.org
Since glibc cares about the number of syscalls required to initialize a new
thread, allow initializing rseq with slice extension on. This avoids having to
do another prctl().
Requested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260121143207.814193010@infradead.org
If a time slice extension is granted and the reschedule delayed, the kernel
has to ensure that user space cannot abuse the extension and exceed the
maximum granted time.
It was suggested to implement this via the existing hrtick() timer in the
scheduler, but that turned out to be problematic for several reasons:
1) It creates a dependency on CONFIG_SCHED_HRTICK, which can be disabled
independently of CONFIG_HIGHRES_TIMERS
2) HRTICK usage in the scheduler can be runtime disabled or is only used
for certain aspects of scheduling.
3) The function is calling into the scheduler code and that might have
unexpected consequences when this is invoked due to a time slice
enforcement expiry. Especially when the task managed to clear the
grant via sched_yield(0).
It would be possible to address #2 and #3 by storing state in the
scheduler, but that is extra complexity and fragility for no value.
Implement a dedicated per CPU hrtimer instead, which is solely used for the
purpose of time slice enforcement.
The timer is armed when an extension was granted right before actually
returning to user mode in rseq_exit_to_user_mode_restart().
It is disarmed, when the task relinquishes the CPU. This is expensive as
the timer is probably the first expiring timer on the CPU, which means it
has to reprogram the hardware. But that's less expensive than going through
a full hrtimer interrupt cycle for nothing.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251215155709.068329497@linutronix.de
The kernel sets SYSCALL_WORK_RSEQ_SLICE when it grants a time slice
extension. This allows to handle the rseq_slice_yield() syscall, which is
used by user space to relinquish the CPU after finishing the critical
section for which it requested an extension.
In case the kernel state is still GRANTED, the kernel resets both kernel
and user space state with a set of sanity checks. If the kernel state is
already cleared, then this raced against the timer or some other interrupt
and just clears the work bit.
Doing it in syscall entry work allows to catch misbehaving user space,
which issues an arbitrary syscall, i.e. not rseq_slice_yield(), from the
critical section. Contrary to the initial strict requirement to use
rseq_slice_yield() arbitrary syscalls are not considered a violation of the
ABI contract anymore to allow onion architecture applications, which cannot
control the code inside a critical section, to utilize this as well.
If the code detects inconsistent user space that result in a SIGSEGV for
the application.
If the grant was still active and the task was not preempted yet, the work
code reschedules immediately before continuing through the syscall.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251215155709.005777059@linutronix.de
Provide a new syscall which has the only purpose to yield the CPU after the
kernel granted a time slice extension.
sched_yield() is not suitable for that because it unconditionally
schedules, but the end of the time slice extension is not required to
schedule when the task was already preempted. This also allows to have a
strict check for termination to catch user space invoking random syscalls
including sched_yield() from a time slice extension region.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Link: https://patch.msgid.link/20251215155708.929634896@linutronix.de
Implement a prctl() so that tasks can enable the time slice extension
mechanism. This fails, when time slice extensions are disabled at compile
time or on the kernel command line and when no rseq pointer is registered
in the kernel.
That allows to implement a single trivial check in the exit to user mode
hotpath, to decide whether the whole mechanism needs to be invoked.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251215155708.858717691@linutronix.de
Guard the time slice extension functionality with a static key, which can
be disabled on the kernel command line.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251215155708.733429292@linutronix.de
Aside of a Kconfig knob add the following items:
- Two flag bits for the rseq user space ABI, which allow user space to
query the availability and enablement without a syscall.
- A new member to the user space ABI struct rseq, which is going to be
used to communicate request and grant between kernel and user space.
- A rseq state struct to hold the kernel state of this
- Documentation of the new mechanism
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251215155708.669472597@linutronix.de
Using kstrtouint_from_user() instead of copy_from_user() + kstrtouint()
makes the code simpler and less error-prone.
Suggested-by: Yury Norov <ynorov@nvidia.com>
Signed-off-by: Fushuai Wang <wangfushuai@baidu.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Yury Norov <ynorov@nvidia.com>
Link: https://patch.msgid.link/20260117145615.53455-2-fushuai.wang@linux.dev
bpf_strncasecmp() function performs same like bpf_strcasecmp() except
limiting the comparison to a specific length.
Signed-off-by: Yuzuki Ishiyama <ishiyama@hpc.is.uec.ac.jp>
Acked-by: Viktor Malik <vmalik@redhat.com>
Acked-by: Mykyta Yatsenko <mykyta.yatsenko5@gmail.com>
Link: https://lore.kernel.org/r/20260121033328.1850010-2-ishiyama@hpc.is.uec.ac.jp
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
For now, bpf_get_func_arg() and bpf_get_func_arg_cnt() is not supported by
the BPF_TRACE_RAW_TP, which is not convenient to get the argument of the
tracepoint, especially for the case that the position of the arguments in
a tracepoint can change.
The target tracepoint BTF type id is specified during loading time,
therefore we can get the function argument count from the function
prototype instead of the stack.
Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20260121044348.113201-2-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
When calling refcount_inc(&event->mmap_count) inside perf_mmap_rb(), the
following warning is triggered:
refcount_t: addition on 0; use-after-free.
WARNING: lib/refcount.c:25
PoC:
struct perf_event_attr attr = {0};
int fd = syscall(__NR_perf_event_open, &attr, 0, -1, -1, 0);
mmap(NULL, 0x3000, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
int victim = syscall(__NR_perf_event_open, &attr, 0, -1, fd,
PERF_FLAG_FD_OUTPUT);
mmap(NULL, 0x3000, PROT_READ | PROT_WRITE, MAP_SHARED, victim, 0);
This occurs when creating a group member event with the flag
PERF_FLAG_FD_OUTPUT. The group leader should be mmap-ed and then mmap-ing
the event triggers the warning.
Since the event has copied the output_event in perf_event_set_output(),
event->rb is set. As a result, perf_mmap_rb() calls
refcount_inc(&event->mmap_count) when event->mmap_count = 0.
Disallow the case when event->mmap_count = 0. This also prevents two
events from updating the same user_page.
Fixes: 448f97fba9 ("perf: Convert mmap() refcounts to refcount_t")
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Will Rosenberg <whrosenb@asu.edu>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260119184956.801238-1-whrosenb@asu.edu
The "valid" readout delay between the two reads of the watchdog is larger
than the valid delta between the resulting watchdog and clocksource
intervals, which results in false positive watchdog results.
Assume TSC is the clocksource and HPET is the watchdog and both have a
uncertainty margin of 250us (default). The watchdog readout does:
1) wdnow = read(HPET);
2) csnow = read(TSC);
3) wdend = read(HPET);
The valid window for the delta between #1 and #3 is calculated by the
uncertainty margins of the watchdog and the clocksource:
m = 2 * watchdog.uncertainty_margin + cs.uncertainty margin;
which results in 750us for the TSC/HPET case.
The actual interval comparison uses a smaller margin:
m = watchdog.uncertainty_margin + cs.uncertainty margin;
which results in 500us for the TSC/HPET case.
That means the following scenario will trigger the watchdog:
Watchdog cycle N:
1) wdnow[N] = read(HPET);
2) csnow[N] = read(TSC);
3) wdend[N] = read(HPET);
Assume the delay between #1 and #2 is 100us and the delay between #1 and
Watchdog cycle N + 1:
4) wdnow[N + 1] = read(HPET);
5) csnow[N + 1] = read(TSC);
6) wdend[N + 1] = read(HPET);
If the delay between #4 and #6 is within the 750us margin then any delay
between #4 and #5 which is larger than 600us will fail the interval check
and mark the TSC unstable because the intervals are calculated against the
previous value:
wd_int = wdnow[N + 1] - wdnow[N];
cs_int = csnow[N + 1] - csnow[N];
Putting the above delays in place this results in:
cs_int = (wdnow[N + 1] + 610us) - (wdnow[N] + 100us);
-> cs_int = wd_int + 510us;
which is obviously larger than the allowed 500us margin and results in
marking TSC unstable.
Fix this by using the same margin as the interval comparison. If the delay
between two watchdog reads is larger than that, then the readout was either
disturbed by interconnect congestion, NMIs or SMIs.
Fixes: 4ac1dd3245 ("clocksource: Set cs_watchdog_read() checks based on .uncertainty_margin")
Reported-by: Daniel J Blueman <daniel@quora.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/lkml/20250602223251.496591-1-daniel@quora.org/
Link: https://patch.msgid.link/87bjjxc9dq.ffs@tglx
The current comment "Clear TID on mm_release()?" ends with a question
mark, implying uncertainty about whether the TID is actually cleared in
mm_release().
However, the code flow is deterministic. When a task exits, mm_release()
explicitly checks 'tsk->clear_child_tid' and clears.
Since this behavior is unambiguous, remove the confusing question mark and
rephrase the comment to clearly state that TID is cleared in mm_release().
Link: https://lkml.kernel.org/r/20251125000407.24470-1-s9430939@naver.com
Signed-off-by: Minu Jin <s9430939@naver.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mel Gorman <mgorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
kallsyms_lookup_buildid() copies the symbol name into the given buffer so
that it can be safely read anytime later. But it just copies pointers to
mod->name and mod->build_id which might get reused after the related
struct module gets removed.
The lifetime of struct module is synchronized using RCU. Take the rcu
read lock for the entire __sprint_symbol().
Link: https://lkml.kernel.org/r/20251128135920.217303-8-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Aaron Tomlin <atomlin@atomlin.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkman <daniel@iogearbox.net>
Cc: Daniel Gomez <da.gomez@samsung.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Luis Chamberalin <mcgrof@kernel.org>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Petr Pavlu <petr.pavlu@suse.com>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
__sprint_symbol() might access an invalid pointer when
kallsyms_lookup_buildid() returns a symbol found by
ftrace_mod_address_lookup().
The ftrace lookup function must set both @modname and @modbuildid the same
way as module_address_lookup().
Link: https://lkml.kernel.org/r/20251128135920.217303-7-pmladek@suse.com
Fixes: 9294523e37 ("module: add printk formats to add module build ID to stacktraces")
Signed-off-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Aaron Tomlin <atomlin@atomlin.com>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkman <daniel@iogearbox.net>
Cc: Daniel Gomez <da.gomez@samsung.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Luis Chamberalin <mcgrof@kernel.org>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Petr Pavlu <petr.pavlu@suse.com>
Cc: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
bpf_address_lookup() has been used only in kallsyms_lookup_buildid(). It
was supposed to set @modname and @modbuildid when the symbol was in a
module.
But it always just cleared @modname because BPF symbols were never in a
module. And it did not clear @modbuildid because the pointer was not
passed.
The wrapper is no longer needed. Both @modname and @modbuildid are now
always initialized to NULL in kallsyms_lookup_buildid().
Remove the wrapper and rename __bpf_address_lookup() to
bpf_address_lookup() because this variant is used everywhere.
[akpm@linux-foundation.org: fix loongarch]
Link: https://lkml.kernel.org/r/20251128135920.217303-6-pmladek@suse.com
Fixes: 9294523e37 ("module: add printk formats to add module build ID to stacktraces")
Signed-off-by: Petr Mladek <pmladek@suse.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Aaron Tomlin <atomlin@atomlin.com>
Cc: Daniel Borkman <daniel@iogearbox.net>
Cc: Daniel Gomez <da.gomez@samsung.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Luis Chamberalin <mcgrof@kernel.org>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Petr Pavlu <petr.pavlu@suse.com>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Put the code for appending the optional "buildid" into a helper function,
It makes __sprint_symbol() better readable.
Also print a warning when the "modname" is set and the "buildid" isn't.
It might catch a situation when some lookup function in
kallsyms_lookup_buildid() does not handle the "buildid".
Use pr_*_once() to avoid an infinite recursion when the function is called
from printk(). The recursion is rather theoretical but better be on the
safe side.
Link: https://lkml.kernel.org/r/20251128135920.217303-5-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Cc: Aaron Tomlin <atomlin@atomlin.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkman <daniel@iogearbox.net>
Cc: Daniel Gomez <da.gomez@samsung.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Luis Chamberalin <mcgrof@kernel.org>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Petr Pavlu <petr.pavlu@suse.com>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Add a helper function for reading the optional "build_id" member of struct
module. It is going to be used also in ftrace_mod_address_lookup().
Use "#ifdef" instead of "#if IS_ENABLED()" to match the declaration of the
optional field in struct module.
Link: https://lkml.kernel.org/r/20251128135920.217303-4-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Daniel Gomez <da.gomez@samsung.com>
Reviewed-by: Petr Pavlu <petr.pavlu@suse.com>
Cc: Aaron Tomlin <atomlin@atomlin.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkman <daniel@iogearbox.net>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Luis Chamberalin <mcgrof@kernel.org>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The @modname and @modbuildid optional return parameters are set only when
the symbol is in a module.
Always initialize them so that they do not need to be cleared when the
module is not in a module. It simplifies the logic and makes the code
even slightly more safe.
Note that bpf_address_lookup() function will get updated in a separate
patch.
Link: https://lkml.kernel.org/r/20251128135920.217303-3-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Cc: Aaron Tomlin <atomlin@atomlin.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkman <daniel@iogearbox.net>
Cc: Daniel Gomez <da.gomez@samsung.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Luis Chamberalin <mcgrof@kernel.org>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Petr Pavlu <petr.pavlu@suse.com>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "kallsyms: Prevent invalid access when showing module
buildid", v3.
We have seen nested crashes in __sprint_symbol(), see below. They seem to
be caused by an invalid pointer to "buildid". This patchset cleans up
kallsyms code related to module buildid and fixes this invalid access when
printing backtraces.
I made an audit of __sprint_symbol() and found several situations
when the buildid might be wrong:
+ bpf_address_lookup() does not set @modbuildid
+ ftrace_mod_address_lookup() does not set @modbuildid
+ __sprint_symbol() does not take rcu_read_lock and
the related struct module might get removed before
mod->build_id is printed.
This patchset solves these problems:
+ 1st, 2nd patches are preparatory
+ 3rd, 4th, 6th patches fix the above problems
+ 5th patch cleans up a suspicious initialization code.
This is the backtrace, we have seen. But it is not really important.
The problems fixed by the patchset are obvious:
crash64> bt [62/2029]
PID: 136151 TASK: ffff9f6c981d4000 CPU: 367 COMMAND: "btrfs"
#0 [ffffbdb687635c28] machine_kexec at ffffffffb4c845b3
#1 [ffffbdb687635c80] __crash_kexec at ffffffffb4d86a6a
#2 [ffffbdb687635d08] hex_string at ffffffffb51b3b61
#3 [ffffbdb687635d40] crash_kexec at ffffffffb4d87964
#4 [ffffbdb687635d50] oops_end at ffffffffb4c41fc8
#5 [ffffbdb687635d70] do_trap at ffffffffb4c3e49a
#6 [ffffbdb687635db8] do_error_trap at ffffffffb4c3e6a4
#7 [ffffbdb687635df8] exc_stack_segment at ffffffffb5666b33
#8 [ffffbdb687635e20] asm_exc_stack_segment at ffffffffb5800cf9
...
This patch (of 7)
The function kallsyms_lookup_buildid() initializes the given @namebuf by
clearing the first and the last byte. It is not clear why.
The 1st byte makes sense because some callers ignore the return code and
expect that the buffer contains a valid string, for example:
- function_stat_show()
- kallsyms_lookup()
- kallsyms_lookup_buildid()
The initialization of the last byte does not make much sense because it
can later be overwritten. Fortunately, it seems that all called functions
behave correctly:
- kallsyms_expand_symbol() explicitly adds the trailing '\0'
at the end of the function.
- All *__address_lookup() functions either use the safe strscpy()
or they do not touch the buffer at all.
Document the reason for clearing the first byte. And remove the useless
initialization of the last byte.
Link: https://lkml.kernel.org/r/20251128135920.217303-2-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Aaron Tomlin <atomlin@atomlin.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkman <daniel@iogearbox.net>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Luis Chamberalin <mcgrof@kernel.org>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Petr Pavlu <petr.pavlu@suse.com>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Daniel Gomez <da.gomez@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The softlockup_panic sysctl is currently a binary option: panic
immediately or never panic on soft lockups.
Panicking on any soft lockup, regardless of duration, can be overly
aggressive for brief stalls that may be caused by legitimate operations.
Conversely, never panicking may allow severe system hangs to persist
undetected.
Extend softlockup_panic to accept an integer threshold, allowing the
kernel to panic only when the normalized lockup duration exceeds N
watchdog threshold periods. This provides finer-grained control to
distinguish between transient delays and persistent system failures.
The accepted values are:
- 0: Don't panic (unchanged)
- 1: Panic when duration >= 1 * threshold (20s default, original behavior)
- N > 1: Panic when duration >= N * threshold (e.g., 2 = 40s, 3 = 60s.)
The original behavior is preserved for values 0 and 1, maintaining full
backward compatibility while allowing systems to tolerate brief lockups
while still catching severe, persistent hangs.
[lirongqing@baidu.com: v2]
Link: https://lkml.kernel.org/r/20251218074300.4080-1-lirongqing@baidu.com
Link: https://lkml.kernel.org/r/20251216074521.2796-1-lirongqing@baidu.com
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Cc: Eduard Zingerman <eddyz87@gmail.com>
Cc: Hao Luo <haoluo@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Song Liu <song@kernel.org>
Cc: Stanislav Fomichev <sdf@fomichev.me>
Cc: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
kimage_crash_copy_vmcoreinfo() currently assumes vmcoreinfo fits in a
single page. This breaks if VMCOREINFO_BYTES exceeds PAGE_SIZE.
Allocate the required order of control pages and vmap all pages needed to
safely copy vmcoreinfo into the crash kernel image.
Link: https://lkml.kernel.org/r/20251216132801.807260-3-pnina.feder@mobileye.com
Signed-off-by: Pnina Feder <pnina.feder@mobileye.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "vmcoreinfo: support VMCOREINFO_BYTES larger than PAGE_SIZE".
VMCOREINFO_BYTES is defined as a configurable size, but multiple
code paths implicitly assume it always fits into a single page.
This series removes that assumption by allocating and mapping
vmcoreinfo based on its actual size.
Patch 1 updates vmcoreinfo allocation to use get_order(VMCOREINFO_BYTES).
Patch 2 updates crash kernel handling to correctly allocate and map
multiple pages when copying vmcoreinfo.
This makes vmcoreinfo size consistent across the kernel and avoids
future breakage if VMCOREINFO_BYTES grows.
(No functional change when VMCOREINFO_BYTES == PAGE_SIZE.)
This patch (of 2):
VMCOREINFO_BYTES defines the size of vmcoreinfo data, but the current
implementation assumes a single page allocation.
Allocate vmcoreinfo_data using get_order(VMCOREINFO_BYTES) so that
vmcoreinfo can safely grow beyond PAGE_SIZE.
This avoids hidden assumptions and keeps vmcoreinfo size consistent across
the kernel.
Link: https://lkml.kernel.org/r/20251216132801.807260-1-pnina.feder@mobileye.com
Link: https://lkml.kernel.org/r/20251216132801.807260-2-pnina.feder@mobileye.com
Signed-off-by: Pnina Feder <pnina.feder@mobileye.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
We were wasting a byte due to an off-by-one bug. s[c]nprintf() doesn't
write more than $2 bytes including the null byte, so trying to pass
'size-1' there is wasting one byte.
This is essentially the same as the previous commit, in a different
file.
Link: https://lkml.kernel.org/r/b4a945a4d40b7104364244f616eb9fb9f1fa691f.1765449750.git.alx@kernel.org
Signed-off-by: Alejandro Colomar <alx@kernel.org>
Cc: Marco Elver <elver@google.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Christopher Bazley <chris.bazley.wg14@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Marco Elver <elver@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Maciej W. Rozycki <macro@orcam.me.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Remove <linux/hex.h> from <linux/kernel.h> and update all users/callers of
hex.h interfaces to directly #include <linux/hex.h> as part of the process
of putting kernel.h on a diet.
Removing hex.h from kernel.h means that 36K C source files don't have to
pay the price of parsing hex.h for the roughly 120 C source files that
need it.
This change has been build-tested with allmodconfig on most ARCHes. Also,
all users/callers of <linux/hex.h> in the entire source tree have been
updated if needed (if not already #included).
Link: https://lkml.kernel.org/r/20251215005206.2362276-1-rdunlap@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Yury Norov (NVIDIA) <yury.norov@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
'struct configfs_item_operations' and 'configfs_group_operations' are not
modified in this driver.
Constifying these structures moves some data to a read-only section, so
increases overall security, especially when the structure holds some
function pointers.
On a x86_64, with allmodconfig, as an example:
Before:
======
text data bss dec hex filename
16339 11001 384 27724 6c4c kernel/crash_dump_dm_crypt.o
After:
=====
text data bss dec hex filename
16499 10841 384 27724 6c4c kernel/crash_dump_dm_crypt.o
Link: https://lkml.kernel.org/r/d046ee5666d2f6b1a48ca1a222dfbd2f7c44462f.1765735035.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Coiby Xu <coxu@redhat.com>
Tested-by: Coiby Xu <coxu@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Remove lock from the bpf_timer_cancel() helper. The lock does not
protect from concurrent modification of the bpf_async_cb data fields as
those are modified in the callback without locking.
Use guard(rcu)() instead of pair of explicit lock()/unlock().
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Link: https://lore.kernel.org/r/20260120-timer_nolock-v6-4-670ffdd787b4@meta.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Introduce bpf_async_update_prog_callback(): lock-free update of cb->prog
and cb->callback_fn. This function allows updating prog and callback_fn
fields of the struct bpf_async_cb without holding lock.
For now use it under the lock from __bpf_async_set_callback(), in the
next patches that lock will be removed.
Lock-free algorithm:
* Acquire a guard reference on prog to prevent it from being freed
during the retry loop.
* Retry loop:
1. Each iteration acquires a new prog reference and stores it
in cb->prog via xchg. The previous prog is released.
2. The loop condition checks if both cb->prog and cb->callback_fn
match what we just wrote. If either differs, a concurrent writer
overwrote our value, and we must retry.
3. When we retry, our previously-stored prog was already released by
the concurrent writer or will be released by us after
overwriting.
* Release guard reference.
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Link: https://lore.kernel.org/r/20260120-timer_nolock-v6-3-670ffdd787b4@meta.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Move the timer deletion logic into a dedicated bpf_timer_delete()
helper so it can be reused by later patches.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Link: https://lore.kernel.org/r/20260120-timer_nolock-v6-1-670ffdd787b4@meta.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add check to ensure that ARG_PTR_TO_MEM is used with either MEM_WRITE or
MEM_RDONLY.
Using ARG_PTR_TO_MEM alone without flags does not make sense because:
- If the helper does not change the argument, missing MEM_RDONLY causes the
verifier to incorrectly reject a read-only buffer.
- If the helper does change the argument, missing MEM_WRITE causes the
verifier to incorrectly assume the memory is unchanged, leading to errors
in code optimization.
Co-developed-by: Shuran Liu <electronlsr@gmail.com>
Signed-off-by: Shuran Liu <electronlsr@gmail.com>
Co-developed-by: Peili Gao <gplhust955@gmail.com>
Signed-off-by: Peili Gao <gplhust955@gmail.com>
Co-developed-by: Haoran Ni <haoran.ni.cs@gmail.com>
Signed-off-by: Haoran Ni <haoran.ni.cs@gmail.com>
Signed-off-by: Zesen Liu <ftyghome@gmail.com>
Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260120-helper_proto-v3-2-27b0180b4e77@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
After commit 37cce22dbd ("bpf: verifier: Refactor helper access type tracking"),
the verifier started relying on the access type flags in helper
function prototypes to perform memory access optimizations.
Currently, several helper functions utilizing ARG_PTR_TO_MEM lack the
corresponding MEM_RDONLY or MEM_WRITE flags. This omission causes the
verifier to incorrectly assume that the buffer contents are unchanged
across the helper call. Consequently, the verifier may optimize away
subsequent reads based on this wrong assumption, leading to correctness
issues.
For bpf_get_stack_proto_raw_tp, the original MEM_RDONLY was incorrect
since the helper writes to the buffer. Change it to ARG_PTR_TO_UNINIT_MEM
which correctly indicates write access to potentially uninitialized memory.
Similar issues were recently addressed for specific helpers in commit
ac44dcc788 ("bpf: Fix verifier assumptions of bpf_d_path's output buffer")
and commit 2eb7648558 ("bpf: Specify access type of bpf_sysctl_get_name args").
Fix these prototypes by adding the correct memory access flags.
Fixes: 37cce22dbd ("bpf: verifier: Refactor helper access type tracking")
Co-developed-by: Shuran Liu <electronlsr@gmail.com>
Signed-off-by: Shuran Liu <electronlsr@gmail.com>
Co-developed-by: Peili Gao <gplhust955@gmail.com>
Signed-off-by: Peili Gao <gplhust955@gmail.com>
Co-developed-by: Haoran Ni <haoran.ni.cs@gmail.com>
Signed-off-by: Haoran Ni <haoran.ni.cs@gmail.com>
Signed-off-by: Zesen Liu <ftyghome@gmail.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260120-helper_proto-v3-1-27b0180b4e77@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch implements range tracking (interval analysis) for BPF_DIV and
BPF_MOD operations when the divisor is a constant, covering both signed
and unsigned variants.
While LLVM typically optimizes integer division and modulo by constants
into multiplication and shift sequences, this optimization is less
effective for the BPF target when dealing with 64-bit arithmetic.
Currently, the verifier does not track bounds for scalar division or
modulo, treating the result as "unbounded". This leads to false positive
rejections for safe code patterns.
For example, the following code (compiled with -O2):
```c
int test(struct pt_regs *ctx) {
char buffer[6] = {1};
__u64 x = bpf_ktime_get_ns();
__u64 res = x % sizeof(buffer);
char value = buffer[res];
bpf_printk("res = %llu, val = %d", res, value);
return 0;
}
```
Generates a raw `BPF_MOD64` instruction:
```asm
; __u64 res = x % sizeof(buffer);
1: 97 00 00 00 06 00 00 00 r0 %= 0x6
; char value = buffer[res];
2: 18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r1 = 0x0 ll
4: 0f 01 00 00 00 00 00 00 r1 += r0
5: 91 14 00 00 00 00 00 00 r4 = *(s8 *)(r1 + 0x0)
```
Without this patch, the verifier fails with "math between map_value
pointer and register with unbounded min value is not allowed" because
it cannot deduce that `r0` is within [0, 5].
According to the BPF instruction set[1], the instruction's offset field
(`insn->off`) is used to distinguish between signed (`off == 1`) and
unsigned division (`off == 0`). Moreover, we also follow the BPF division
and modulo runtime behavior (semantics) to handle special cases, such as
division by zero and signed division overflow.
- UDIV: dst = (src != 0) ? (dst / src) : 0
- SDIV: dst = (src == 0) ? 0 : ((src == -1 && dst == LLONG_MIN) ? LLONG_MIN : (dst / src))
- UMOD: dst = (src != 0) ? (dst % src) : dst
- SMOD: dst = (src == 0) ? dst : ((src == -1 && dst == LLONG_MIN) ? 0: (dst s% src))
Here is the overview of the changes made in this patch (See the code comments
for more details and examples):
1. For BPF_DIV: Firstly check whether the divisor is zero. If so, set the
destination register to zero (matching runtime behavior).
For non-zero constant divisors: goto `scalar(32)?_min_max_(u|s)div` functions.
- General cases: compute the new range by dividing max_dividend and
min_dividend by the constant divisor.
- Overflow case (SIGNED_MIN / -1) in signed division: mark the result
as unbounded if the dividend is not a single number.
2. For BPF_MOD: Firstly check whether the divisor is zero. If so, leave the
destination register unchanged (matching runtime behavior).
For non-zero constant divisors: goto `scalar(32)?_min_max_(u|s)mod` functions.
- General case: For signed modulo, the result's sign matches the
dividend's sign. And the result's absolute value is strictly bounded
by `min(abs(dividend), abs(divisor) - 1)`.
- Special care is taken when the divisor is SIGNED_MIN. By casting
to unsigned before negation and subtracting 1, we avoid signed
overflow and correctly calculate the maximum possible magnitude
(`res_max_abs` in the code).
- "Small dividend" case: If the dividend is already within the possible
result range (e.g., [-2, 5] % 10), the operation is an identity
function, and the destination register remains unchanged.
3. In `scalar(32)?_min_max_(u|s)(div|mod)` functions: After updating current
range, reset other ranges and tnum to unbounded/unknown.
e.g., in `scalar_min_max_sdiv`, signed 64-bit range is updated. Then reset
unsigned 64-bit range and 32-bit range to unbounded, and tnum to unknown.
Exception: in BPF_MOD's "small dividend" case, since the result remains
unchanged, we do not reset other ranges/tnum.
4. Also updated existing selftests based on the expected BPF_DIV and
BPF_MOD behavior.
[1] https://www.kernel.org/doc/Documentation/bpf/standardization/instruction-set.rst
Co-developed-by: Shenghao Yuan <shenghaoyuan0928@163.com>
Signed-off-by: Shenghao Yuan <shenghaoyuan0928@163.com>
Co-developed-by: Tianci Cao <ziye@zju.edu.cn>
Signed-off-by: Tianci Cao <ziye@zju.edu.cn>
Signed-off-by: Yazhou Tang <tangyazhou518@outlook.com>
Tested-by: syzbot@syzkaller.appspotmail.com
Link: https://lore.kernel.org/r/20260119085458.182221-2-tangyazhou@zju.edu.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Implement bpf_stream_vprintk with an implicit bpf_prog_aux argument,
and remote bpf_stream_vprintk_impl from the kernel.
Update the selftests to use the new API with implicit argument.
bpf_stream_vprintk macro is changed to use the new bpf_stream_vprintk
kfunc, and the extern definition of bpf_stream_vprintk_impl is
replaced accordingly.
Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Link: https://lore.kernel.org/r/20260120222638.3976562-11-ihor.solodrai@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Implement bpf_task_work_schedule_* with an implicit bpf_prog_aux
argument, and remove corresponding _impl funcs from the kernel.
Update special kfunc checks in the verifier accordingly.
Update the selftests to use the new API with implicit argument.
Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Link: https://lore.kernel.org/r/20260120222638.3976562-10-ihor.solodrai@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Implement bpf_wq_set_callback() with an implicit bpf_prog_aux
argument, and remove bpf_wq_set_callback_impl().
Update special kfunc checks in the verifier accordingly.
Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Link: https://lore.kernel.org/r/20260120222638.3976562-8-ihor.solodrai@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
A kernel function bpf_foo marked with KF_IMPLICIT_ARGS flag is
expected to have two associated types in BTF:
* `bpf_foo` with a function prototype that omits implicit arguments
* `bpf_foo_impl` with a function prototype that matches the kernel
declaration of `bpf_foo`, but doesn't have a ksym associated with
its name
In order to support kfuncs with implicit arguments, the verifier has
to know how to resolve a call of `bpf_foo` to the correct BTF function
prototype and address.
To implement this, in add_kfunc_call() kfunc flags are checked for
KF_IMPLICIT_ARGS. For such kfuncs a BTF func prototype is adjusted to
the one found for `bpf_foo_impl` (func_name + "_impl" suffix, by
convention) function in BTF.
This effectively changes the signature of the `bpf_foo` kfunc in the
context of verification: from one without implicit args to the one
with full argument list.
The values of implicit arguments by design are provided by the
verifier, and so they can only be of particular types. In this patch
the only allowed implicit arg type is a pointer to struct
bpf_prog_aux.
In order for the verifier to correctly set an implicit bpf_prog_aux
arg value at runtime, is_kfunc_arg_prog() is extended to check for the
arg type. At a point when prog arg is determined in check_kfunc_args()
the kfunc with implicit args already has a prototype with full
argument list, so the existing value patch mechanism just works.
If a new kfunc with KF_IMPLICIT_ARG is declared for an existing kfunc
that uses a __prog argument (a legacy case), the prototype
substitution works in exactly the same way, assuming the kfunc follows
the _impl naming convention. The difference is only in how _impl
prototype is added to the BTF, which is not the verifier's
concern. See a subsequent resolve_btfids patch for details.
__prog suffix is still supported at this point, but will be removed in
a subsequent patch, after current users are moved to KF_IMPLICIT_ARGS.
Introduction of KF_IMPLICIT_ARGS revealed an issue with zero-extension
tracking, because an explicit rX = 0 in place of the verifier-supplied
argument is now absent if the arg is implicit (the BPF prog doesn't
pass a dummy NULL anymore). To mitigate this, reset the subreg_def of
all caller saved registers in check_kfunc_call() [1].
[1] https://lore.kernel.org/bpf/b4a760ef828d40dac7ea6074d39452bb0dc82caa.camel@gmail.com/
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Link: https://lore.kernel.org/r/20260120222638.3976562-4-ihor.solodrai@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
There is code duplication between add_kfunc_call() and
fetch_kfunc_meta() collecting information about a kfunc from BTF.
Introduce struct bpf_kfunc_meta to hold common kfunc BTF data and
implement fetch_kfunc_meta() to fill it in, instead of struct
bpf_kfunc_call_arg_meta directly.
Then use these in add_kfunc_call() and (new) fetch_kfunc_arg_meta()
functions, and fixup previous usages of fetch_kfunc_meta() to
fetch_kfunc_arg_meta().
Besides the code dedup, this change enables add_kfunc_call() to access
kfunc->flags.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Link: https://lore.kernel.org/r/20260120222638.3976562-3-ihor.solodrai@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
btf_kfunc_id_set_contains() is called by fetch_kfunc_meta() in the BPF
verifier to get the kfunc flags stored in the .BTF_ids ELF section.
If it returns NULL instead of a valid pointer, it's interpreted as an
illegal kfunc usage failing the verification.
There are two potential reasons for btf_kfunc_id_set_contains() to
return NULL:
1. Provided kfunc BTF id is not present in relevant kfunc id sets.
2. The kfunc is not allowed, as determined by the program type
specific filter [1].
The filter functions accept a pointer to `struct bpf_prog`, so they
might implicitly depend on earlier stages of verification, when
bpf_prog members are set.
For example, bpf_qdisc_kfunc_filter() in linux/net/sched/bpf_qdisc.c
inspects prog->aux->st_ops [2], which is initialized in:
check_attach_btf_id() -> check_struct_ops_btf_id()
So far this hasn't been an issue, because fetch_kfunc_meta() is the
only caller of btf_kfunc_id_set_contains().
However in subsequent patches of this series it is necessary to
inspect kfunc flags earlier in BPF verifier, in the add_kfunc_call().
To resolve this, refactor btf_kfunc_id_set_contains() into two
interface functions:
* btf_kfunc_flags() that simply returns pointer to kfunc_flags
without applying the filters
* btf_kfunc_is_allowed() that both checks for kfunc_flags existence
(which is a requirement for a kfunc to be allowed) and applies the
prog filters
See [3] for the previous version of this patch.
[1] https://lore.kernel.org/all/20230519225157.760788-7-aditi.ghag@isovalent.com/
[2] https://lore.kernel.org/all/20250409214606.2000194-4-ameryhung@gmail.com/
[3] https://lore.kernel.org/bpf/20251029190113.3323406-3-ihor.solodrai@linux.dev/
Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Link: https://lore.kernel.org/r/20260120222638.3976562-2-ihor.solodrai@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
- A 4 patch series from David Hildenbrand which fixes a few things
realted to hugetlb PMD sharing
- The remainder are singletons, please see their changelogs for details.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaW/vEQAKCRDdBJ7gKXxA
jtMcAQCK1tKnINRmVjY3UJCqZMAaXvOdOoUIgHDaTXD/DWKm9AD9HRwWzYB4+TNr
k/Te8F33d418LcMBTW9CLhrplQpaIAI=
=+d1A
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2026-01-20-13-09' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
- A patch series from David Hildenbrand which fixes a few things
related to hugetlb PMD sharing
- The remainder are singletons, please see their changelogs for details
* tag 'mm-hotfixes-stable-2026-01-20-13-09' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm: restore per-memcg proactive reclaim with !CONFIG_NUMA
mm/kfence: fix potential deadlock in reboot notifier
Docs/mm/allocation-profiling: describe sysctrl limitations in debug mode
mm: do not copy page tables unnecessarily for VM_UFFD_WP
mm/hugetlb: fix excessive IPI broadcasts when unsharing PMD tables using mmu_gather
mm/rmap: fix two comments related to huge_pmd_unshare()
mm/hugetlb: fix two comments related to huge_pmd_unshare()
mm/hugetlb: fix hugetlb_pmd_shared()
mm: remove unnecessary and incorrect mmap lock assert
x86/kfence: avoid writing L1TF-vulnerable PTEs
mm/vma: do not leak memory when .mmap_prepare swaps the file
migrate: correct lock ordering for hugetlb file folios
panic: only warn about deprecated panic_print on write access
fs/writeback: skip AS_NO_DATA_INTEGRITY mappings in wait_sb_inodes()
mm: take into account mm_cid size for mm_struct static definitions
mm: rename cpu_bitmap field to flexible_array
mm: add missing static initializer for init_mm::mm_cid.lock
Currently, reset_idmap_scratch() performs a 4.7KB memset() in every
states_equal() call. Optimize this by using a counter to track used
ID mappings, replacing the O(N) memset() with an O(1) reset and
bounding the search loop in check_ids().
Signed-off-by: Qiliang Yuan <realwujing@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20260120023234.77673-1-realwujing@gmail.com
After commit 4fa8d68aa5 ("bpf: Convert hashtab.c to rqspinlock")
we no longer use HASHTAB_MAP_LOCK_{COUNT,MASK} as the per-CPU
map_locked[HASHTAB_MAP_LOCK_COUNT] array got removed from struct
bpf_htab. Right now it is still accounted for in htab_map_mem_usage.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/09703eb6bb249f12b1d5253b5a50a0c4fa239d27.1768913513.git.daniel@iogearbox.net
sync_linked_regs() is called after a conditional jump to propagate new
bounds of a register to all its liked registers. But the verifier log
only prints the state of the register that is part of the conditional
jump.
Make sync_linked_regs() scratch the registers whose bounds have been
updated by propagation from a known register.
Before:
0: (85) call bpf_get_prandom_u32#7 ; R0=scalar()
1: (57) r0 &= 255 ; R0=scalar(smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff))
2: (bf) r1 = r0 ; R0=scalar(id=1,smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff)) R1=scalar(id=1,smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff))
3: (07) r1 += 4 ; R1=scalar(id=1+4,smin=umin=smin32=umin32=4,smax=umax=smax32=umax32=259,var_off=(0x0; 0x1ff))
4: (a5) if r1 < 0xa goto pc+2 ; R1=scalar(id=1+4,smin=umin=smin32=umin32=10,smax=umax=smax32=umax32=259,var_off=(0x0; 0x1ff))
5: (35) if r0 >= 0x6 goto pc+1
After:
0: (85) call bpf_get_prandom_u32#7 ; R0=scalar()
1: (57) r0 &= 255 ; R0=scalar(smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff))
2: (bf) r1 = r0 ; R0=scalar(id=1,smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff)) R1=scalar(id=1,smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff))
3: (07) r1 += 4 ; R1=scalar(id=1+4,smin=umin=smin32=umin32=4,smax=umax=smax32=umax32=259,var_off=(0x0; 0x1ff))
4: (a5) if r1 < 0xa goto pc+2 ; R0=scalar(id=1+0,smin=umin=smin32=umin32=6,smax=umax=smax32=umax32=255) R1=scalar(id=1+4,smin=umin=smin32=umin32=10,smax=umax=smax32=umax32=259,var_off=(0x0; 0x1ff))
5: (35) if r0 >= 0x6 goto pc+1
The conditional jump in 4 updates the bound of R1 and the new bounds are
propogated to R0 as it is linked with the same id, before this change,
verifier only printed the state for R1 but after it prints for both R0
and R1.
Suggested-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/bpf/20260116141436.3715322-1-puranjay@kernel.org
- minor fixes for the corner cases of the SWIOTLB pool management (Robin Murphy)
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSrngzkoBtlA8uaaJ+Jp1EFxbsSRAUCaW+NbgAKCRCJp1EFxbsS
RKU3AQCIpNwIYN6evmnTXO/Y+AFRjsrb1pE1cUyHn5QTY4/o4wEA6BiSgMCrZEbv
HTEs1ZSHgLOwBwZre1Z11icGg3BkRgY=
=6fBC
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-6.19-2026-01-20' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux
Pull dma-mapping fixes from Marek Szyprowski:
- minor fixes for the corner cases of the SWIOTLB pool management
(Robin Murphy)
* tag 'dma-mapping-6.19-2026-01-20' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
dma/pool: Avoid allocating redundant pools
mm_zone: Generalise has_managed_dma()
dma/pool: Improve pool lookup
Once the aggrprobe is fully reverted in do_free_cleaned_kprobes(), retry
optimize_kprobe() on that sibling so it can return to OPTIMIZED.
Also remove the stale comment in __disarm_kprobe().
Link: https://lore.kernel.org/all/349359900266B25F+20260115023804.3951960-2-hongao@uniontech.com/
Signed-off-by: hongao <hongao@uniontech.com>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
When __do_ajdtimex() was introduced to handle adjtimex for any
timekeeper, this reference to tk_core was not updated. When called on an
auxiliary timekeeper, the core timekeeper would be updated incorrectly.
This gets caught by the lock debugging diagnostics because the
timekeepers sequence lock gets written to without holding its
associated spinlock:
WARNING: include/linux/seqlock.h:226 at __do_adjtimex+0x394/0x3b0, CPU#2: test/125
aux_clock_adj (kernel/time/timekeeping.c:2979)
__do_sys_clock_adjtime (kernel/time/posix-timers.c:1161 kernel/time/posix-timers.c:1173)
do_syscall_64 (arch/x86/entry/syscall_64.c:63 (discriminator 1) arch/x86/entry/syscall_64.c:94 (discriminator 1))
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:131)
Update the correct auxiliary timekeeper.
Fixes: 775f71ebed ("timekeeping: Make do_adjtimex() reusable")
Fixes: ecf3e70304 ("timekeeping: Provide adjtimex() for auxiliary clocks")
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260120-timekeeper-auxclock-leapstate-v1-1-5b358c6b3cfd@linutronix.de
The panic_print_deprecated() warning is being triggered on both read and
write operations to the panic_print parameter.
This causes spurious warnings when users run 'sysctl -a' to list all
sysctl values, since that command reads /proc/sys/kernel/panic_print and
triggers the deprecation notice.
Modify the handlers to only emit the deprecation warning when the
parameter is actually being set:
- sysctl_panic_print_handler(): check 'write' flag before warning.
- panic_print_get(): remove the deprecation call entirely.
This way, users are only warned when they actively try to use the
deprecated parameter, not when passively querying system state.
Link: https://lkml.kernel.org/r/20260106163321.83586-1-gal@nvidia.com
Fixes: ee13240cd7 ("panic: add note that panic_print sysctl interface is deprecated")
Fixes: 2683df6539 ("panic: add note that 'panic_print' parameter is deprecated")
Signed-off-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Nimrod Oren <noren@nvidia.com>
Cc: Feng Tang <feng.tang@linux.alibaba.com>
Cc: Joel Granados <joel.granados@kernel.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
- Add Chen Ridong as cpuset reviewer.
- Add SPDX license identifiers to cgroup files that were missing them.
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaW0ToA4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGel7AQD3eTbiR/OU0i5yJj6YZsIdVYcteqqsWBkDFnZ3
9pZZugEA+KCwi5eQz+cryKZ7y5fU5g5O6f4OP/lfLEcSmFDpfQU=
=QHwy
-----END PGP SIGNATURE-----
Merge tag 'cgroup-for-6.19-rc5-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
- Add Chen Ridong as cpuset reviewer
- Add SPDX license identifiers to cgroup files that were missing them
* tag 'cgroup-for-6.19-rc5-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
kernel: cgroup: Add LGPL-2.1 SPDX license ID to legacy_freezer.c
kernel: cgroup: Add SPDX-License-Identifier lines
MAINTAINERS: Add Chen Ridong as cpuset reviewer
may result in incorrect skipping of hrtimer IPIs.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmlsr+oRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1iLPw/+OuA8hPlFY8nxrpVKASoiP7u+mGQ82Qmf
xdf1Kimpmc5ydpGsxpRLKY19l3fLuGSJuOa2Url0Ro52JHxU+yPPOkN2ONjsPxLI
mFMdhyXiS1PIgYt307OmNmKFadxdg5cGJTKY0wIS9JCcPFxI5Y3NFhWZ4ybPlGHK
/bxtsjUctekUTDVahqP65lRIYMvjhO1I2LQfoPQyg/C+2n6Owy7TAX8xZovDZLHz
a9RkkX6FPc8aR8Kj1VYm1yABNlUhGvFvTE6wfcfOMjiHHeDv2qYdwNHF2tK2eoKC
QX609zBTCT/n0ioZlikLrKDZitj6uRgS2blBc+hodHEs4cY5gE2umWSU4IirT7HH
GDjvG/vns/BLRr97gPdMaJ+N/sjdCSQ5tsDRrvHwCGC27VXPu3O3DYU128eH4eXW
+KNJGQHJvR7slBKElQ4wxJaLhbLBB5rw/AqqdXa4PU8LqvMBBNcLZjHnrYlbD5Ei
ovdX5VtRYOjrVl7pCNrXXdzFfBpXK+sOPHQMoJxnLHKXO+cdsZz8zIB7AHZEVe20
CIOMRcfgw2nvpS+0MqaHzre0ixyq+U2XSIu7MRIlq8GZ+zdjiVMqx3pPa19QRdyr
hA9JupeRC9eZdymlms2MiSzqIVmWSVvgcyhIFSYw9yLe55vXcGyArF27RKGtBHRA
uF+K5aT8gqA=
=CAdu
-----END PGP SIGNATURE-----
Merge tag 'timers-urgent-2026-01-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Ingo Molnar:
"Fix the update_needs_ipi() check in the hrtimer code that may result
in incorrect skipping of hrtimer IPIs"
* tag 'timers-urgent-2026-01-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
hrtimer: Fix softirq base check in update_needs_ipi()
discovered and fixed recently:
- Fix a race condition in the DL server
- Fix a DL server bug which can result in incorrectly going
idle when there's work available
- Fix DL server bug which triggers a WARN() due to broken
get_prio_dl() logic and subsequent misbehavior
- Fix double update_rq_clock() calls
- Fix setscheduler() assumption about static priorities
- Make sure balancing callbacks are always called
- Plus a handful of preparatory commits for the fixes.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmlsrfIRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1hgUA/9HQi51sxF/TlhKeoj9Z0zSO/L5Fbd7d5O
ucsr+exoU7B+X95AnYC64pV3YlTNibD5y6HR4zpPzhlmEdFw8K9hTmDeSOF2cB0T
dPdEY2KtvNoQwVwytPYKMsXtBTcJtZZqOec9BgGBzSIpjPm6Zhxr1J2/lKFy/udA
00aqNuJwfQW94dxvyPf+jPXbm0JW9HHZGNuuOXkkmZpkH+HIzBDrTWcAjHUaZt9H
doPyEfO08dg3Uu4OxNBtxx94X7bQsn5RU1plSEc+xF7zynqwEIwDbelFftpHiM+z
UqfFgkzCV8A8o/72BfZVhQgaspMZ+M4XSsaGBbouOTg3Rgx8gRKC23s8/xhDJSpy
ayMWhuR5bsI04GD1xZVOwLvUi8oDIgAFIAKRI6gcW5eQELlejii0bAvs9Jw9/OjI
QhhrX54sJCPPKMZpKO2n+AzQvv9a6+/p7ikiA2U5dR4wBEERFUw8ffRguHwSm+BU
U1M+t+2tw+UiC7yML0zvIC3HdyfBhj33ZeW7PWyvFU0vjGm78U6NjqliAxbqW21l
GdtQUpi9aXCtijXqHhLFvBtWGmqowj9UQiHroC0+nq1BsTpBDe575vwJNYF2K9he
ht0hZqreUIf7xTsJHXlLFbtFxPQw0xAKzWJWUswaaqO8eS2JcWJbmHUAsHQuxmJB
MbkJAlU7j/w=
=DgiW
-----END PGP SIGNATURE-----
Merge tag 'sched-urgent-2026-01-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
"Misc deadline scheduler fixes, mainly for a new category of bugs that
were discovered and fixed recently:
- Fix a race condition in the DL server
- Fix a DL server bug which can result in incorrectly going idle when
there's work available
- Fix DL server bug which triggers a WARN() due to broken
get_prio_dl() logic and subsequent misbehavior
- Fix double update_rq_clock() calls
- Fix setscheduler() assumption about static priorities
- Make sure balancing callbacks are always called
- Plus a handful of preparatory commits for the fixes"
* tag 'sched-urgent-2026-01-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/deadline: Use ENQUEUE_MOVE to allow priority change
sched: Deadline has dynamic priority
sched: Audit MOVE vs balance_callbacks
sched: Fold rq-pin swizzle into __balance_callbacks()
sched/deadline: Avoid double update_rq_clock()
sched/deadline: Ensure get_prio_dl() is up-to-date
sched/deadline: Fix server stopping with runnable tasks
sched: Provide idle_rq() helper
sched/deadline: Fix potential race in dl_add_task_root_domain()
sched/deadline: Remove unnecessary comment in dl_add_task_root_domain()
Add GPL-2.0 SPDX-License-Identifier lines to some files,
and remove a reference to COPYING, and boilerplate warranty
text, from offload.c.
Signed-off-by: Tim Bird <tim.bird@sony.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260115013129.598705-1-tim.bird@sony.com
Add __force annotations to casts that convert between __user and kernel
address spaces. These casts are intentional:
- In bpf_send_signal_common(), the value is stored in si_value.sival_ptr
which is typed as void __user *, but the value comes from a BPF
program parameter.
- In the bpf_*_dynptr() kfuncs, user pointers are cast to const void *
before being passed to copy helper functions that correctly handle
the user address space through copy_from_user variants.
Without __force, sparse reports:
warning: cast removes address space '__user' of expression
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260115184509.3585759-1-mykyta.yatsenko5@gmail.com
Closes: https://lore.kernel.org/oe-kbuild-all/202601131740.6C3BdBaB-lkp@intel.com/
- Fix a memory leak in em_create_pd() error path (Malaya Kumar Rout)
- Fix stale description of the cost field in struct em_perf_state to
reflect the current code (Yaxiong Tian)
- Fix and revamp the energy model YNL specification added recently
along with the energy model netlink interface (Changwoo Min)
-----BEGIN PGP SIGNATURE-----
iQFGBAABCAAwFiEEcM8Aw/RY0dgsiRUR7l+9nS/U47UFAmlqWS8SHHJqd0Byand5
c29ja2kubmV0AAoJEO5fvZ0v1OO1BhAH/iSA/9YP9/xIW+rIZj2irsncGVc/hJ+V
8PcM2tar966AEQBlmDPc7ug2MXT0Mb/5WRtQo/IvZ1tdAALB+bC70FHjdu3y7SAX
IzFGyHKJ9OVJHGxYq0TCBbRXgYubUZiqvAnaBTBZ/ZIFlt3B4KEyatfFfxJMb1Pe
H9zQIyUVBN1tjPfgNVbc22XGfkwfmmla72le6GmHseKZRqpwutIwzxm1PoMNVeLb
fQv625LsWrDD9NU6j5uGq3WEq3xPltubtHN545FTFi4aGwBYJaG1FuIi+kPiQuCR
VZmxTWVEiVwjttiZf/xWbOkPfwQqdgeXlTk5j+iBlF1jkrrW75IuLYQ=
=eJ2p
-----END PGP SIGNATURE-----
Merge tag 'pm-6.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These fix an error path memory leak in the energy model management
code, fix a kerneldoc comment in it, and fix and revamp the energy
model YNL specification added recently along with the new energy model
management netlink interface (that received feedback after being
added):
- Fix a memory leak in em_create_pd() error path (Malaya Kumar Rout)
- Fix stale description of the cost field in struct em_perf_state to
reflect the current code (Yaxiong Tian)
- Fix and revamp the energy model YNL specification added recently
along with the energy model netlink interface (Changwoo Min)"
* tag 'pm-6.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM: EM: Add dump to get-perf-domains in the EM YNL spec
PM: EM: Change cpus' type from string to u64 array in the EM YNL spec
PM: EM: Rename em.yaml to dev-energymodel.yaml
PM: EM: Fix yamllint warnings in the EM YNL spec
PM: EM: Fix memory leak in em_create_pd() error path
PM: EM: Fix incorrect description of the cost field in struct em_perf_state
The irq_chip_pm_put() return value is only used in __irq_do_set_handler()
to trigger a WARN_ON() if it is negative, but doing so is not useful
because irq_chip_pm_put() simply passes the pm_runtime_put() return value
to its callers.
Returning an error code from pm_runtime_put() merely means that it has
not queued up a work item to check whether or not the device can be
suspended and there are many perfectly valid situations in which that
can happen, like after writing "on" to the devices' runtime PM "control"
attribute in sysfs for one example.
For this reason, modify irq_chip_pm_put() to discard the pm_runtime_put()
return value, change its return type to void, and drop the WARN_ON()
around the irq_chip_pm_put() invocation from __irq_do_set_handler().
Also update the irq_chip_pm_put() kerneldoc comment to be more accurate.
This will facilitate a planned change of the pm_runtime_put() return
type to void in the future.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/5075294.31r3eYUQgx@rafael.j.wysocki
sync_linked_regs() copies the id of known_reg to reg when propagating
bounds of known_reg to reg using the off of known_reg, but when
known_reg was linked to reg like:
known_reg = reg ; both known_reg and reg get same id
known_reg += 4 ; known_reg gets off = 4, and its id gets BPF_ADD_CONST
now when a call to sync_linked_regs() happens, let's say with the following:
if known_reg >= 10 goto pc+2
known_reg's new bounds are propagated to reg but now reg gets
BPF_ADD_CONST from the copy.
This means if another link to reg is created like:
another_reg = reg ; another_reg should get the id of reg but
assign_scalar_id_before_mov() sees
BPF_ADD_CONST on reg and assigns a new id to it.
As reg has a new id now, known_reg's link to reg is broken. If we find
new bounds for known_reg, they will not be propagated to reg.
This can be seen in the selftest added in the next commit:
0: (85) call bpf_get_prandom_u32#7 ; R0=scalar()
1: (57) r0 &= 255 ; R0=scalar(smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff))
2: (bf) r1 = r0 ; R0=scalar(id=1,smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff)) R1=scalar(id=1,smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff))
3: (07) r1 += 4 ; R1=scalar(id=1+4,smin=umin=smin32=umin32=4,smax=umax=smax32=umax32=259,var_off=(0x0; 0x1ff))
4: (a5) if r1 < 0xa goto pc+4 ; R1=scalar(id=1+4,smin=umin=smin32=umin32=10,smax=umax=smax32=umax32=259,var_off=(0x0; 0x1ff))
5: (bf) r2 = r0 ; R0=scalar(id=2,smin=umin=smin32=umin32=6,smax=umax=smax32=umax32=255) R2=scalar(id=2,smin=umin=smin32=umin32=6,smax=umax=smax32=umax32=255)
6: (a5) if r1 < 0xe goto pc+2 ; R1=scalar(id=1+4,smin=umin=smin32=umin32=14,smax=umax=smax32=umax32=259,var_off=(0x0; 0x1ff))
7: (35) if r0 >= 0xa goto pc+1 ; R0=scalar(id=2,smin=umin=smin32=umin32=6,smax=umax=smax32=umax32=9,var_off=(0x0; 0xf))
8: (37) r0 /= 0
div by zero
When 4 is verified, r1's bounds are propagated to r0 but r0 also gets
BPF_ADD_CONST (bug).
When 5 is verified, r0 gets a new id (2) and its link with r1 is broken.
After 6 we know r1 has bounds [14, 259] and therefore r0 should have
bounds [10, 255], therefore the branch at 7 is always taken. But because
r0's id was changed to 2, r1's new bounds are not propagated to r0.
The verifier still thinks r0 has bounds [6, 255] before 7 and execution
can reach div by zero.
Fix this by preserving id in sync_linked_regs() like off and subreg_def.
Fixes: 98d7ca374b ("bpf: Track delta between "linked" registers.")
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260115151143.1344724-2-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJPBAABCAA5FiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAmlqGMYbFIAAAAAABAAO
bWFudTIsMi41KzEuMTEsMiwyAAoJEFKgDEdIgJTyT0UP/3Wn4tm0h0n3XeyujQEA
IPNisJCczF6aWIAuwccRigR8hQFriPNvkZ5AhMGtotZJrY3uUoe1/8aF+XIdqnl/
Yb1434AGwVNIpSaap+vtEclHrLDmzZH+Z75/FTWQwldM/hPkPWJI8fsEuRLqBZsn
v520NBFtrQVcOZKKNy0npBnHsC0DsAmqoZuOvLTx0mx5AyE029CfPbDMZuVnSNix
KjZ4U5KL0qDs2LIMpdB/mqprydGkHdogdIbrPK3WtzStVgNbi9VmnV19ZwbUlXJM
rYPbtbQg3htwuspgR+yM6O21qsthRf2qZF5+2/a929IzOBsD/qAXQbbxQWVpF7Qb
ELYXNV4N5hqm9EW8WeOOpLKUUG7k0fRPf81X/07uGVafPMQKQJ8kFNgLBkBFR4ya
RAMNxTPHbHQvVaLcRujxZXoC4Wh3ZTunQXpIouy0p9dKOzbsCAj0ZqeqDa09UsaW
rCEm50p/Pd1csML8a9A/2nNoWjQzuSVmML7F6obGCOWaW6p21GhSKHzqqDIjBab9
3wxhpllVeYRYS2yhkKjOPJkQKXo3idIdpieLpW8IVbJvp/gQgmeKjWdhnvNvUan9
hyCzfI8OZXVz0vItBfWsoX44+6UtpLHd4o16aDYjyDItflJOPuQbWzky+icT2R3x
8B5xPC08tmEGitf3miv2EGq9
=d/xV
-----END PGP SIGNATURE-----
Merge tag 'printk-for-6.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux
Pull printk fix from Petr Mladek:
- Prevent softlockup by restoring IRQs in atomic flush after each
record
* tag 'printk-for-6.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
printk/nbcon: Restore IRQ in atomic flush after each emitted record
Insert Soft Reserved memory into a dedicated soft_reserve_resource tree
instead of the iomem_resource tree at boot. Delay publishing these ranges
into the iomem hierarchy until ownership is resolved and the HMEM path
is ready to consume them.
Publishing Soft Reserved ranges into iomem too early conflicts with CXL
hotplug and prevents region assembly when those ranges overlap CXL
windows.
Follow up patches will reinsert Soft Reserved ranges into iomem after CXL
window publication is complete and HMEM is ready to claim the memory. This
provides a cleaner handoff between EFI-defined memory ranges and CXL
resource management without trimming or deleting resources later.
In the meantime "Soft Reserved" resources will no longer appear in
/proc/iomem, only their results. I.e. with "memmap=4G%4G+0xefffffff"
Before:
100000000-1ffffffff : Soft Reserved
100000000-1ffffffff : dax1.0
100000000-1ffffffff : System RAM (kmem)
After:
100000000-1ffffffff : dax1.0
100000000-1ffffffff : System RAM (kmem)
The expectation is that this does not lead to a user visible regression
because the dax1.0 device is created in both instances.
Co-developed-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
[Smita: incorporate feedback from x86 maintainer review]
Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
Link: https://patch.msgid.link/20251120031925.87762-2-Smita.KoralahalliChannabasappa@amd.com
[djbw: cleanups and clarifications]
Link: https://lore.kernel.org/69443f707b025_1cee10022@dwillia2-mobl4.notmuch
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
This script
#!/usr/bin/bash
echo 0 > /proc/sys/kernel/randomize_va_space
echo 'void main(void) {}' > TEST.c
# -fcf-protection to ensure that the 1st endbr32 insn can't be emulated
gcc -m32 -fcf-protection=branch TEST.c -o test
bpftrace -e 'uprobe:./test:main {}' -c ./test
"hangs", the probed ./test task enters an endless loop.
The problem is that with randomize_va_space == 0
get_unmapped_area(TASK_SIZE - PAGE_SIZE) called by xol_add_vma() can not
just return the "addr == TASK_SIZE - PAGE_SIZE" hint, this addr is used
by the stack vma.
arch_get_unmapped_area_topdown() doesn't take TIF_ADDR32 into account and
in_32bit_syscall() is false, this leads to info.high_limit > TASK_SIZE.
vm_unmapped_area() happily returns the high address > TASK_SIZE and then
get_unmapped_area() returns -ENOMEM after the "if (addr > TASK_SIZE - len)"
check.
handle_swbp() doesn't report this failure (probably it should) and silently
restarts the probed insn. Endless loop.
I think that the right fix should change the x86 get_unmapped_area() paths
to rely on TIF_ADDR32 rather than in_32bit_syscall(). Note also that if
CONFIG_X86_X32_ABI=y, in_x32_syscall() falsely returns true in this case
because ->orig_ax = -1.
But we need a simple fix for -stable, so this patch just sets TS_COMPAT if
the probed task is 32-bit to make in_ia32_syscall() true.
Fixes: 1b028f784e ("x86/mm: Introduce mmap_compat_base() for 32-bit mmap()")
Reported-by: Paulo Andrade <pandrade@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/aV5uldEvV7pb4RA8@redhat.com/
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/aWO7Fdxn39piQnxu@redhat.com
Merge fixes related to the energy model management for 6.19-rc6:
- Fix a memory leak in em_create_pd() error path (Malaya Kumar Rout)
- Fix stale description of the cost field in struct em_perf_state to
reflect the current code (Yaxiong Tian)
- Fix and revamp the energy model YNL specification added recently
along with the energy model netlink interface (Changwoo Min)
* pm-em:
PM: EM: Add dump to get-perf-domains in the EM YNL spec
PM: EM: Change cpus' type from string to u64 array in the EM YNL spec
PM: EM: Rename em.yaml to dev-energymodel.yaml
PM: EM: Fix yamllint warnings in the EM YNL spec
PM: EM: Fix memory leak in em_create_pd() error path
PM: EM: Fix incorrect description of the cost field in struct em_perf_state
Add SPDX-License-Identifier lines to some old kernel
files.
Signed-off-by: Tim Bird <tim.bird@sony.com>
Acked-by: Karim Yaghmour <karim.yaghmour@opersys.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Add an appropriate SPDX-License-Identifier line to the file,
and remove the GNU boilerplate text.
Signed-off-by: Tim Bird <tim.bird@sony.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Add GPL-2.0 SPDX license id lines to a few old
files, replacing the reference to the COPYING file.
The COPYING file at the time of creation of these files
(2007 and 2005) was GPL-v2.0, with an additional clause
indicating that only v2 applied.
Signed-off-by: Tim Bird <tim.bird@sony.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Add a GPL-2.0 license identifier line for this file.
kmod.c was originally introduced in the kernel in February
of 1998 by Linus Torvalds - who was familiar with kernel
licensing at the time this was introduced.
Signed-off-by: Tim Bird <tim.bird@sony.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mahe reported issue with bpf_override_return helper not working when
executed from kprobe.multi bpf program on arm.
The problem is that on arm we use alternate storage for pt_regs object
that is passed to bpf_prog_run and if any register is changed (which
is the case of bpf_override_return) it's not propagated back to actual
pt_regs object.
Fixing this by introducing and calling ftrace_partial_regs_update function
to propagate the values of changed registers (ip and stack).
Reported-by: Mahe Tardy <mahe.tardy@gmail.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/bpf/20260112121157.854473-1-jolsa@kernel.org
- Fix allocation accounting on boot up
The ftrace records for each function that ftrace can attach to is
done in a group of pages. At boot up, the number of pages are
calculated and allocated. After that, the pages are filled with data.
It may allocate more than needed due to some functions not being
recorded (because they are unused weak functions), this too is
recorded.
After the data is filled in, a check is made to make sure the right
number of pages were allocated. But this was off due to the
assumption that the same number of entries fit per every page.
Because the size of an entry does not evenly divide into PAGE_SIZE,
there is a rounding error when a large number of pages is allocated
to hold the events. This causes the check to fail and triggers a
warning.
Fix the accounting by finding out how many pages are actually
allocated from the functions that allocate them and use that to see
if all the pages allocated were used and the ones not used are
properly freed.
-----BEGIN PGP SIGNATURE-----
iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaWlyoxQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qoZoAQDuPOaly/RM9cS70zngDo/UsZmi2lQL
RxuMR7sPVMwkrAEAp/gnpJenSRxCzaO4EdQYUYI28/ihkdNGp3KG8Q2MNw0=
=v899
-----END PGP SIGNATURE-----
Merge tag 'ftrace-v6.19-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull ftrace fix from Steven Rostedt:
- Fix allocation accounting on boot up
The ftrace records for each function that ftrace can attach to is
done in a group of pages. At boot up, the number of pages are
calculated and allocated. After that, the pages are filled with data.
It may allocate more than needed due to some functions not being
recorded (because they are unused weak functions), this too is
recorded.
After the data is filled in, a check is made to make sure the right
number of pages were allocated. But this was off due to the
assumption that the same number of entries fit per every page.
Because the size of an entry does not evenly divide into PAGE_SIZE,
there is a rounding error when a large number of pages is allocated
to hold the events. This causes the check to fail and triggers a
warning.
Fix the accounting by finding out how many pages are actually
allocated from the functions that allocate them and use that to see
if all the pages allocated were used and the ones not used are
properly freed.
* tag 'ftrace-v6.19-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
ftrace: Do not over-allocate ftrace memory
nohz.nr_cpus was observed as contended cacheline when running
enterprise workload on large systems.
Fundamental scalability challenge with nohz.idle_cpus_mask
and nohz.nr_cpus is the following:
(1) nohz_balancer_kick() observes (reads) nohz.nr_cpus
(or nohz.idle_cpu_mask) and nohz.has_blocked to see whether there's
any nohz balancing work to do, in every scheduler tick.
(2) nohz_balance_enter_idle() and nohz_balance_exit_idle()
(through nohz_balancer_kick() via sched_tick()) modify (write)
nohz.nr_cpus (and/or nohz.idle_cpu_mask) and nohz.has_blocked.
The characteristic frequencies are the following:
(1) nohz_balancer_kick() happens at scheduler (busy)tick frequency
on CPU(which has not gone idle). This is a relatively constant
frequency in the ~1 kHz range or lower.
(2) happens at idle enter/exit frequency on every CPU that goes to idle.
This is workload dependent, but can easily be hundreds of kHz for
IO-bound loads and high CPU counts. Ie. can be orders of magnitude
higher than (1), in which case a cachemiss at every invocation of (1)
is almost inevitable. idle exit will trigger (1) on the CPU
which is coming out of idle.
There's two types of costs from these functions:
(A) scheduler tick cost via (1): this happens on busy CPUs too, and is
thus a primary scalability cost. But the rate here is constant and
typically much lower than (B), hence the absolute benefit to workload
scalability will be lower as well.
(B) idle cost via (2): going-to-idle and coming-from-idle costs are
secondary concerns, because they impact power efficiency more than
they impact scalability. But in terms of absolute cost this scales
up with nr_cpus as well, and a much faster rate, and thus may also
approach and negatively impact system limits like
memory bus/fabric bandwidth.
Note that nohz.idle_cpus_mask and nohz.nr_cpus may appear to reside in the
same cacheline, however under CONFIG_CPUMASK_OFFSTACK=y the backing storage
for nohz.idle_cpus_mask will be elsewhere. With CPUMASK_OFFSTACK=n,
the nohz.idle_cpus_mask and rest of nohz fields are in different cachelines
under typical NR_CPUS=512/2048. This implies two separate cachelines
being dirtied upon idle entry / exit.
nohz.nr_cpus can be derived from the mask itself. Its usage doesn't warrant
a functionally correct value. This means one less cacheline being dirtied in
idle entry/exit path which helps to save some bus bandwidth w.r.t to those
nohz functions(approx 50%). This in turn helps to improve enterprise
workload throughput.
On system with 480 CPUs, running "hackbench 40 process 10000 loops"
(Avg of 3 runs)
baseline:
0.81% hackbench [k] nohz_balance_exit_idle
0.21% hackbench [k] nohz_balancer_kick
0.09% swapper [k] nohz_run_idle_balance
With patch:
0.35% hackbench [k] nohz_balance_exit_idle
0.09% hackbench [k] nohz_balancer_kick
0.07% swapper [k] nohz_run_idle_balance
[Ingo Molnar: scalability analysis changlog]
Reviewed-and-tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://patch.msgid.link/20260115073524.376643-4-sshegde@linux.ibm.com
These days most of the system have multi cores. The likelyhood of
at least one or more CPUs in nohz (idle state) is higher.
Give accurate hint to the branch predictor.
Reviewed-and-tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://patch.msgid.link/20260115073524.376643-3-sshegde@linux.ibm.com
Current code does.
- Read nohz.nr_cpus
- Check if the time has passed to do NOHZ idle balance
Instead do this.
- Check if the time has passed to do NOHZ idle balance
- Read nohz.nr_cpus
This will skip the read most of the time in normal system usage.
i.e when there are nohz.nr_cpus (system is not 100% busy).
Note that when there are no idle CPUs(100% busy), even if the flag gets
set to NOHZ_STATS_KICK | NOHZ_NEXT_KICK, find_new_ilb will fail and
there will be no NOHZ idle balance. In such cases there will be a very
narrow window where, kick_ilb will be called un-necessarily.
However current functionality is still retained.
Note: This patch doesn't solve any cacheline overheads. No improvement
in performance apart from saving a few cycles of reading nohz.nr_cpus
Reviewed-and-tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://patch.msgid.link/20260115073524.376643-2-sshegde@linux.ibm.com
The avg_vruntime comment contains a couple of mathematical notation
issues:
- The summation over w_i * (V - v_i) is written in an ambiguous form
- The delta term refers to v instead of v0, which is inconsistent
with the code and preceding explanation
Fix these to make the comment mathematically correct and consistent
with the implementation.
Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260114090035.19033-1-zhanxusheng@xiaomi.com
Commit adcc3bfa88 ("sched: Adapt sched tracepoints for RV task model")
added a tracepoint to the need_resched action that can be triggered also
by set_tsk_need_resched.
This function was previously accessible from out-of-tree modules but
it's no longer available because the __trace_set_need_resched() symbol
is not exported (together with the tracepoint itself, which was exported
in a separate patch) and building such modules fails.
Export __trace_set_need_resched to modules to fix those build issues.
Fixes: adcc3bfa88 ("sched: Adapt sched tracepoints for RV task model")
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Link: https://patch.msgid.link/20260112140413.362202-1-gmonaco@redhat.com
Pierre reported hitting balance callback warnings for deadline tasks
after commit 6455ad5346 ("sched: Move sched_class::prio_changed()
into the change pattern").
It turns out that DEQUEUE_SAVE+ENQUEUE_RESTORE does not preserve DL
priority and subsequently trips a balance pass -- where one was not
expected.
From discussion with Juri and Luca, the purpose of this clause was to
deal with tasks new to DL and all those sites will have MOVE set (as
well as CLASS, but MOVE is move conservative at this point).
Per the previous patches MOVE is audited to always run the balance
callbacks, so switch enqueue_dl_entity() to use MOVE for this case.
Fixes: 6455ad5346 ("sched: Move sched_class::prio_changed() into the change pattern")
Reported-by: Pierre Gondois <pierre.gondois@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Pierre Gondois <pierre.gondois@arm.com>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://patch.msgid.link/20260114130528.GB831285@noisy.programming.kicks-ass.net
While FIFO/RR have static priority, DEADLINE is a dynamic priority
scheme. Notably it has static priority -1. Do not assume the priority
doesn't change for deadline tasks just because the static priority
doesn't change.
This ensures DL always sees {DE,EN}QUEUE_MOVE where appropriate.
Fixes: ff77e46853 ("sched/rt: Fix PI handling vs. sched_setscheduler()")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Pierre Gondois <pierre.gondois@arm.com>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://patch.msgid.link/20260114130528.GB831285@noisy.programming.kicks-ass.net
The {DE,EN}QUEUE_MOVE flag indicates a task is allowed to change
priority, which means there could be balance callbacks queued.
Therefore audit all MOVE users and make sure they do run balance
callbacks before dropping rq-lock.
Fixes: 6455ad5346 ("sched: Move sched_class::prio_changed() into the change pattern")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Pierre Gondois <pierre.gondois@arm.com>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://patch.msgid.link/20260114130528.GB831285@noisy.programming.kicks-ass.net
When setup_new_dl_entity() is called from enqueue_task_dl() ->
enqueue_dl_entity(), the rq-clock should already be updated, and
calling update_rq_clock() again is not right.
Move the update_rq_clock() to the one other caller of
setup_new_dl_entity(): sched_init_dl_server().
Fixes: 9f239df555 ("sched/deadline: Initialize dl_servers after SMP")
Reported-by: Pierre Gondois <pierre.gondois@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Pierre Gondois <pierre.gondois@arm.com>
Link: https://patch.msgid.link/20260113115622.GA831285@noisy.programming.kicks-ass.net
Pratheek tripped a WARN and noted the following issue:
> Inspecting the set of events that led to the warning being triggered
> showed the following:
>
> systemd-1 [008] dN.31 ...: do_set_cpus_allowed: set_cpus_allowed begin!
>
> systemd-1 [008] dN.31 ...: sched_change_begin: Begin!
> systemd-1 [008] dN.31 ...: sched_change_begin: Before dequeue_task()!
> systemd-1 [008] dN.31 ...: update_curr_dl_se: update_curr_dl_se: ENQUEUE_REPLENISH
> systemd-1 [008] dN.31 ...: enqueue_dl_entity: enqueue_dl_entity: ENQUEUE_REPLENISH
> systemd-1 [008] dN.31 ...: replenish_dl_entity: Replenish before: 14815760217
> systemd-1 [008] dN.31 ...: replenish_dl_entity: Replenish after: 14816960047
> systemd-1 [008] dN.31 ...: sched_change_begin: Before put_prev_task()!
>
> systemd-1 [008] dN.31 ...: sched_change_end: Before enqueue_task()!
> systemd-1 [008] dN.31 ...: sched_change_end: Before put_prev_task()!
> systemd-1 [008] dN.31 ...: prio_changed_dl: Queuing pull task on prio change: 14815760217 -> 14816960047
> systemd-1 [008] dN.31 ...: prio_changed_dl: Queuing balance callback!
> systemd-1 [008] dN.31 ...: sched_change_end: End!
>
> systemd-1 [008] dN.31 ...: do_set_cpus_allowed: set_cpus_allowed end!
> systemd-1 [008] dN.21 ...: __schedule: Woops! Balance callback found!
>
> 1. sched_change_begin() from guard(sched_change) in
> do_set_cpus_allowed() stashes the priority, which for the deadline
> task, is "p->dl.deadline".
> 2. The dequeue of the deadline task replenishes the deadline.
> 3. The task is enqueued back after guard's scope ends and since there is
> no *_CLASS flags set, sched_change_end() calls
> dl_sched_class->prio_changed() which compares the deadline.
> 4. Since deadline was moved on dequeue, prio_changed_dl() sees the value
> differ from the stashed value and queues a balance pull callback.
> 5. do_set_cpus_allowed() finishes and drops the rq_lock without doing a
> do_balance_callbacks().
> 6. Grabbing the rq_lock() at subsequent __schedule() triggers the
> warning since the balance pull callback was never executed before
> dropping the lock.
Meaning get_prio_dl() ought to update current and return an up-to-date
value.
Fixes: 6455ad5346 ("sched: Move sched_class::prio_changed() into the change pattern")
Reported-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://patch.msgid.link/20260106104113.GX3707891@noisy.programming.kicks-ass.net
- four kerneldoc fixes from Bagas Sanjaya
- four DAMON fixes from SeongJae
- four mremap VMA-related fixes from Lorenzo
- various singletons - please see the changelogs for details
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaWkP3gAKCRDdBJ7gKXxA
jnmTAQCBiCyw7Pvvak5OM8Of0qQ8m/jQmNMBPRrd5M3tAAEOlwD9F7E6RjcmyItU
k+sboDgNKTvgdRHTmRzj1t96aXJrSQQ=
=OxTh
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2026-01-15-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
- kerneldoc fixes from Bagas Sanjaya
- DAMON fixes from SeongJae
- mremap VMA-related fixes from Lorenzo
- various singletons - please see the changelogs for details
* tag 'mm-hotfixes-stable-2026-01-15-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (30 commits)
drivers/dax: add some missing kerneldoc comment fields for struct dev_dax
mm: numa,memblock: include <asm/numa.h> for 'numa_nodes_parsed'
mailmap: add entry for Daniel Thompson
tools/testing/selftests: fix gup_longterm for unknown fs
mm/page_alloc: prevent pcp corruption with SMP=n
iommu/sva: include mmu_notifier.h header
mm: kmsan: fix poisoning of high-order non-compound pages
tools/testing/selftests: add forked (un)/faulted VMA merge tests
mm/vma: enforce VMA fork limit on unfaulted,faulted mremap merge too
tools/testing/selftests: add tests for !tgt, src mremap() merges
mm/vma: fix anon_vma UAF on mremap() faulted, unfaulted merge
mm/zswap: fix error pointer free in zswap_cpu_comp_prepare()
mm/damon/sysfs-scheme: cleanup access_pattern subdirs on scheme dir setup failure
mm/damon/sysfs-scheme: cleanup quotas subdirs on scheme dir setup failure
mm/damon/sysfs: cleanup attrs subdirs on context dir setup failure
mm/damon/sysfs: cleanup intervals subdirs on attrs dir setup failure
mm/damon/core: remove call_control in inactive contexts
powerpc/watchdog: add support for hardlockup_sys_info sysctl
mips: fix HIGHMEM initialization
mm/hugetlb: ignore hugepage kernel args if hugepages are unsupported
...
The pg_remaining calculation in ftrace_process_locs() assumes that
ENTRIES_PER_PAGE multiplied by 2^order equals the actual capacity of the
allocated page group. However, ENTRIES_PER_PAGE is PAGE_SIZE / ENTRY_SIZE
(integer division). When PAGE_SIZE is not a multiple of ENTRY_SIZE (e.g.
4096 / 24 = 170 with remainder 16), high-order allocations (like 256 pages)
have significantly more capacity than 256 * 170. This leads to pg_remaining
being underestimated, which in turn makes skip (derived from skipped -
pg_remaining) larger than expected, causing the WARN(skip != remaining)
to trigger.
Extra allocated pages for ftrace: 2 with 654 skipped
WARNING: CPU: 0 PID: 0 at kernel/trace/ftrace.c:7295 ftrace_process_locs+0x5bf/0x5e0
A similar problem in ftrace_allocate_records() can result in allocating
too many pages. This can trigger the second warning in
ftrace_process_locs().
Extra allocated pages for ftrace
WARNING: CPU: 0 PID: 0 at kernel/trace/ftrace.c:7276 ftrace_process_locs+0x548/0x580
Use the actual capacity of a page group to determine the number of pages
to allocate. Have ftrace_allocate_pages() return the number of allocated
pages to avoid having to calculate it. Use the actual page group capacity
when validating the number of unused pages due to skipped entries.
Drop the definition of ENTRIES_PER_PAGE since it is no longer used.
Cc: stable@vger.kernel.org
Fixes: 4a3efc6baf ("ftrace: Update the mcount_loc check of skipped entries")
Link: https://patch.msgid.link/20260113152243.3557219-1-linux@roeck-us.net
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
I got a report that a task is stuck in perf_event_exit_task() waiting
for global_ctx_data_rwsem. On large systems with lots threads, it'd
have performance issues when it grabs the lock to iterate all threads
in the system to allocate the context data.
And it'd block task exit path which is problematic especially under
memory pressure.
perf_event_open
perf_event_alloc
attach_perf_ctx_data
attach_global_ctx_data
percpu_down_write (global_ctx_data_rwsem)
for_each_process_thread
alloc_task_ctx_data
do_exit
perf_event_exit_task
percpu_down_read (global_ctx_data_rwsem)
It should not hold the global_ctx_data_rwsem on the exit path. Let's
skip allocation for exiting tasks and free the data carefully.
Reported-by: Rosalie Fang <rosaliefang@google.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260112165157.1919624-1-namhyung@kernel.org