For now, we will use ftrace for the fprobe if fp->exit_handler not exists
and CONFIG_DYNAMIC_FTRACE_WITH_REGS is enabled.
However, CONFIG_DYNAMIC_FTRACE_WITH_REGS is not supported by some arch,
such as arm. What we need in the fprobe is the function arguments, so we
can use ftrace for fprobe if CONFIG_DYNAMIC_FTRACE_WITH_ARGS is enabled.
Therefore, use ftrace if CONFIG_DYNAMIC_FTRACE_WITH_REGS or
CONFIG_DYNAMIC_FTRACE_WITH_ARGS enabled.
Link: https://lore.kernel.org/all/20251103063434.47388-1-dongml2@chinatelecom.cn/
Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
For now, fgraph is used for the fprobe, even if we need trace the entry
only. However, the performance of ftrace is better than fgraph, and we
can use ftrace_ops for this case.
Then performance of kprobe-multi increases from 54M to 69M. Before this
commit:
$ ./benchs/run_bench_trigger.sh kprobe-multi
kprobe-multi : 54.663 ± 0.493M/s
After this commit:
$ ./benchs/run_bench_trigger.sh kprobe-multi
kprobe-multi : 69.447 ± 0.143M/s
Mitigation is disable during the bench testing above.
Link: https://lore.kernel.org/all/20251015083238.2374294-2-dongml2@chinatelecom.cn/
Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Since the fprobe_ip_table is used from module unloading in
the failure path of load_module(), it must be initialized in
the earlier timing than late_initcall(). Unless that, the
fprobe_module_callback() will use an uninitialized spinlock of
fprobe_ip_table.
Initialize fprobe_ip_table in core_initcall which is the same
timing as ftrace.
Link: https://lore.kernel.org/all/175939434403.3665022.13030530757238556332.stgit@mhiramat.tok.corp.google.com/
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202509301440.be4b3631-lkp@intel.com
Fixes: e5a4cc28a052 ("tracing: fprobe: use rhltable for fprobe_ip_table")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Menglong Dong <menglong8.dong@gmail.com>
The 'ret' local variable in fprobe_remove_node_in_module() was used
for checking the error state in the loop, but commit dfe0d675df82
("tracing: fprobe: use rhltable for fprobe_ip_table") removed the loop.
So we don't need it anymore.
Link: https://lore.kernel.org/all/175867358989.600222.6175459620045800878.stgit@devnote2/
Fixes: e5a4cc28a052 ("tracing: fprobe: use rhltable for fprobe_ip_table")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Menglong Dong <menglong8.dong@gmail.com>
Since traceprobe_parse_context is reusable among a probe arguments,
it is more efficient to allocate it outside of the loop for parsing
probe argument as kprobe and fprobe events do.
Link: https://lore.kernel.org/all/175509541393.193596.16330324746701582114.stgit@devnote2/
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
For now, all the kernel functions who are hooked by the fprobe will be
added to the hash table "fprobe_ip_table". The key of it is the function
address, and the value of it is "struct fprobe_hlist_node".
The budget of the hash table is FPROBE_IP_TABLE_SIZE, which is 256. And
this means the overhead of the hash table lookup will grow linearly if
the count of the functions in the fprobe more than 256. When we try to
hook all the kernel functions, the overhead will be huge.
Therefore, replace the hash table with rhltable to reduce the overhead.
Link: https://lore.kernel.org/all/20250819031825.55653-1-dongml2@chinatelecom.cn/
Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
subsystem after a rework
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmj+EwoACgkQEsHwGGHe
VUqm6xAAuPDn4E0wuxgD5l6gXYDWXx7xoHEDT0KuL2J9OsfbWoHl8OwObBRmD7ls
au/SuuJUSs3NEntQwLfTklyi7UignkTzcyOYLqb2fMYPFLk+nRXWSjvxsQMQV/u3
wwSXyK1YaZ4qaEKqIAPm5Uvs4E1DQFu6zzBdjVTKB+w1n0Lh9P4xBdDaHgwc/dV/
8jKt39JsInLzCy+8aDLeabeU5X5qDscnbpJ3LEHf/6scMBCAvQbnfeICvDijzLgf
FF4qw+O7qGzFQTKRB2B4pymoFhKGOnGR4jtygejjm3wDO/k2QKS3OwoJo8mzIM3S
p/HimQ7Uy0KEU11Vo37ANdE8XErkeoj7meoBNGFiU4KZzRU99CnRz0EDap9RUvlx
clat0CC/3NSGau2hcbYDrTSsjkoWVbEtQJ2XbvHavnE0MscHUMIf1vIQjWzvVG06
0u5R1OPD+0czeCIXKZQVDGyRcRmmAF1+na3AuBUDq1h0i+KT4V/Y1vX64IFkDdd3
NaMk6GVmQu3bDpJ4LBpdhVl7cGV50kAbGl77VHST4pERvWQ1EWwwutDp0CK96zo0
WQnQfjF4/5Ja9l5nCLK7kffQtjdFg/jY/wyixwASEWJDM81T+fZSf32VGkP6Wf0N
tQYfjKOEj1l/ilRRarSxW8opazhZuN7t7k5e8IxPYP9LmbgAbnQ=
=oIwu
-----END PGP SIGNATURE-----
Merge tag 'irq_urgent_for_v6.18_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Borislav Petkov:
- Restore the original buslock locking in a couple of places in the irq
core subsystem after a rework
* tag 'irq_urgent_for_v6.18_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq/manage: Add buslock back in to enable_irq()
genirq/manage: Add buslock back in to __disable_irq_nosync()
genirq/chip: Add buslock back in to irq_set_handler()
throttled otherwise task movement and manipulation would lead to dangling
cfs_rq references and an eventual crash
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmj+Cf0ACgkQEsHwGGHe
VUqkvA/8D1ItoOslMeTpD6YtcaNN9oxzQ7Zow1QaWaPqirUsc+2l/zZ/3R5s0Zlt
9n0mUNdZ6EC03ZGPwYCNVLk2PvTywmMdwXOypya303PXLez2bPigekJIyXJeW5FV
YuJWTJBQWtZwiFf2ekP1OmHRceOA4KuBIwmWvfW4YwdXlUGfDLn+X6a4z8GsH/z+
ss8iUTfbEraBoFFaF16xq1zxrvRDw5vZpX2HkcHADiTVdkHcuXrf+33AeW/URWKz
FrwimiW+HJdue9trFNwLKUggHCPDoUpHLPA/kmWFiGCZWRXBPpmZ56NGRgfoadGa
4/Hb9ASMjMFl8Y9gnkOqLyomhQ8vJ8LkNqDChiJ5AiQQFYRekrPuZw+zuCENtzVZ
miAmp/kXCGSCWTMNZKlztxJGhmn/yiH+sVegmyHyDqGfqnuEBF3sebkf/DDkDAvu
88SG1YB8OlgmDIxShhfHQqw1nZa7BshLkViak6110n4fP6fbZrbY0MwBLHX2VVpQ
jJeFuvQ2pZuEl1LKVDsy+ROIShkQITZ8IOeabnm6vAeHEpjomDvmlZOmc5f9NfHV
wH6SmrHzSaEam70EJflzoglujYy+JMtVIUd7QC/jYXtPOYj1fcHPgwqlnv25uW9e
4IrwjFNwc2u0MAemKcqRO4DUEwAczD0y+dL/6eVKK8niVmat4f8=
=8MbE
-----END PGP SIGNATURE-----
Merge tag 'sched_urgent_for_v6.18_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Borislav Petkov:
- Make sure a CFS runqueue on a throttled hierarchy has its PELT clock
throttled otherwise task movement and manipulation would lead to
dangling cfs_rq references and an eventual crash
* tag 'sched_urgent_for_v6.18_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Start a cfs_rq on throttled hierarchy with PELT clock throttled
Clang is not happy with set but unused variable (this is visible
with `make W=1` build:
kernel/sched/sched.h:3744:18: error: variable 'cpumask' set but not used [-Werror,-Wunused-but-set-variable]
It seems like the variable was never used along with the assignment
that does not have side effects as far as I can see. Remove those
altogether.
Fixes: 223baf9d17 ("sched: Fix performance regression introduced by mm_cid")
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Tested-by: Eric Biggers <ebiggers@kernel.org>
Reviewed-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The locking was changed from a buslock to a plain lock, but the patch
description states there was no functional change. Assuming this was
accidental so reverting to using the buslock.
Fixes: bddd10c554 ("genirq/manage: Rework enable_irq()")
Signed-off-by: Charles Keepax <ckeepax@opensource.cirrus.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/20251023154901.1333755-4-ckeepax@opensource.cirrus.com
The locking was changed from a buslock to a plain lock, but the patch
description states there was no functional change. Assuming this was
accidental so reverting to using the buslock.
Fixes: 1b74444467 ("genirq/manage: Rework __disable_irq_nosync()")
Signed-off-by: Charles Keepax <ckeepax@opensource.cirrus.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/20251023154901.1333755-3-ckeepax@opensource.cirrus.com
The locking was changed from a buslock to a plain lock, but the patch
description states there was no functional change. Assuming this was
accidental so reverting to using the buslock.
Fixes: 5cd05f3e23 ("genirq/chip: Rework irq_set_handler() variants")
Signed-off-by: Charles Keepax <ckeepax@opensource.cirrus.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/20251023154901.1333755-2-ckeepax@opensource.cirrus.com
- A bug caused a kernel panic when reading enabled_monitors was reported.
Change callbacks functions to always use list_head iterators and by
doing so, fix the wrong pointer that was leading to the panic.
- The rtapp/pagefault monitor relies on the MMU to be present
(pagefaults exist) but that was not enforced via kconfig, leading to
potential build errors on systems without an MMU. Add that kconfig
dependency.
-----BEGIN PGP SIGNATURE-----
iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaPqYBRQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qrZjAQC02FScGQM+TTQ2kvIFAscKEfZddks9
iiodkKWxMTZoxwD/VxXKQUD8CP2HW9uSpJw/O3Zv+NAU80Eq8V2f7/0d9gw=
=AnZN
-----END PGP SIGNATURE-----
Merge tag 'trace-rv-v6.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
"A couple of fixes for Runtime Verification:
- A bug caused a kernel panic when reading enabled_monitors was
reported.
Change callback functions to always use list_head iterators and by
doing so, fix the wrong pointer that was leading to the panic.
- The rtapp/pagefault monitor relies on the MMU to be present
(pagefaults exist) but that was not enforced via kconfig, leading
to potential build errors on systems without an MMU.
Add that kconfig dependency"
* tag 'trace-rv-v6.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
rv: Make rtapp/pagefault monitor depends on CONFIG_MMU
rv: Fully convert enabled_monitors to use list_head as iterator
There's a two-patch DAMON series from SeongJae Park which addresses a
missed check and possible memory leak. Apart from that it's all
singletons - please see the changelogs for details.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaPkz/QAKCRDdBJ7gKXxA
js6eAQCdnA10LouzzVdqA+HuYh206z8qE2KGsGKpUGDfJv40uAEA4ZbxYrMJmwhU
MXFn7Czphh/NOfFFCnrDnOlAFH7MmQc=
=hc/P
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2025-10-22-12-43' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull hotfixes from Andrew Morton:
"17 hotfixes. 12 are cc:stable and 14 are for MM.
There's a two-patch DAMON series from SeongJae Park which addresses a
missed check and possible memory leak. Apart from that it's all
singletons - please see the changelogs for details"
* tag 'mm-hotfixes-stable-2025-10-22-12-43' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
csky: abiv2: adapt to new folio flags field
mm/damon/core: use damos_commit_quota_goal() for new goal commit
mm/damon/core: fix potential memory leak by cleaning ops_filter in damon_destroy_scheme
hugetlbfs: move lock assertions after early returns in huge_pmd_unshare()
vmw_balloon: indicate success when effectively deflating during migration
mm/damon/core: fix list_add_tail() call on damon_call()
mm/mremap: correctly account old mapping after MREMAP_DONTUNMAP remap
mm: prevent poison consumption when splitting THP
ocfs2: clear extent cache after moving/defragmenting extents
mm: don't spin in add_stack_record when gfp flags don't allow
dma-debug: don't report false positives with DMA_BOUNCE_UNALIGNED_KMALLOC
mm/damon/sysfs: dealloc commit test ctx always
mm/damon/sysfs: catch commit test ctx alloc failure
hung_task: fix warnings caused by unaligned lock pointers
Matteo reported hitting the assert_list_leaf_cfs_rq() warning from
enqueue_task_fair() post commit fe8d238e64 ("sched/fair: Propagate
load for throttled cfs_rq") which transitioned to using
cfs_rq_pelt_clock_throttled() check for leaf cfs_rq insertions in
propagate_entity_cfs_rq().
The "cfs_rq->pelt_clock_throttled" flag is used to indicate if the
hierarchy has its PELT frozen. If a cfs_rq's PELT is marked frozen, all
its descendants should have their PELT frozen too or weird things can
happen as a result of children accumulating PELT signals when the
parents have their PELT clock stopped.
Another side effect of this is the loss of integrity of the leaf cfs_rq
list. As debugged by Aaron, consider the following hierarchy:
root(#)
/ \
A(#) B(*)
|
C <--- new cgroup
|
D <--- new cgroup
# - Already on leaf cfs_rq list
* - Throttled with PELT frozen
The newly created cgroups don't have their "pelt_clock_throttled" signal
synced with cgroup B. Next, the following series of events occur:
1. online_fair_sched_group() for cgroup D will call
propagate_entity_cfs_rq(). (Same can happen if a throttled task is
moved to cgroup C and enqueue_task_fair() returns early.)
propagate_entity_cfs_rq() adds the cfs_rq of cgroup C to
"rq->tmp_alone_branch" since its PELT clock is not marked throttled
and cfs_rq of cgroup B is not on the list.
cfs_rq of cgroup B is skipped since its PELT is throttled.
root cfs_rq already exists on cfs_rq leading to
list_add_leaf_cfs_rq() returning early.
The cfs_rq of cgroup C is left dangling on the
"rq->tmp_alone_branch".
2. A new task wakes up on cgroup A. Since the whole hierarchy is already
on the leaf cfs_rq list, list_add_leaf_cfs_rq() keeps returning early
without any modifications to "rq->tmp_alone_branch".
The final assert_list_leaf_cfs_rq() in enqueue_task_fair() sees the
dangling reference to cgroup C's cfs_rq in "rq->tmp_alone_branch".
!!! Splat !!!
Syncing the "pelt_clock_throttled" indicator with parent cfs_rq is not
enough since the new cfs_rq is not yet enqueued on the hierarchy. A
dequeue on other subtree on the throttled hierarchy can freeze the PELT
clock for the parent hierarchy without setting the indicators for this
newly added cfs_rq which was never enqueued.
Since there are no tasks on the new hierarchy, start a cfs_rq on a
throttled hierarchy with its PELT clock throttled. The first enqueue, or
the distribution (whichever happens first) will unfreeze the PELT clock
and queue the cfs_rq on the leaf cfs_rq list.
While at it, add an assert_list_leaf_cfs_rq() in
propagate_entity_cfs_rq() to catch such cases in the future.
Closes: https://lore.kernel.org/lkml/58a587d694f33c2ea487c700b0d046fa@codethink.co.uk/
Fixes: e1fad12dcb ("sched/fair: Switch to task based throttle model")
Reported-by: Matteo Martelli <matteo.martelli@codethink.co.uk>
Suggested-by: Aaron Lu <ziqianlu@bytedance.com>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Aaron Lu <ziqianlu@bytedance.com>
Tested-by: Aaron Lu <ziqianlu@bytedance.com>
Tested-by: Matteo Martelli <matteo.martelli@codethink.co.uk>
Link: https://patch.msgid.link/20251021053522.37583-1-kprateek.nayak@amd.com
The loop in tk_aux_sysfs_init() uses `i <= MAX_AUX_CLOCKS` as the
termination condition, which results in 9 iterations (i=0 to 8) when
MAX_AUX_CLOCKS is defined as 8. However, the kernel is designed to support
only up to 8 auxiliary clocks.
This off-by-one error causes the creation of a 9th sysfs entry that exceeds
the intended auxiliary clock range.
Fix the loop bound to use `i < MAX_AUX_CLOCKS` to ensure exactly 8
auxiliary clock entries are created, matching the design specification.
Fixes: 7b95663a3d ("timekeeping: Provide interface to control auxiliary clocks")
Signed-off-by: Haofeng Li <lihaofeng@kylinos.cn>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/tencent_2376993D9FC06A3616A4F981B3DE1C599607@qq.com
The callbacks in enabled_monitors_seq_ops are inconsistent. Some treat the
iterator as struct rv_monitor *, while others treat the iterator as struct
list_head *.
This causes a wrong type cast and crashes the system as reported by Nathan.
Convert everything to use struct list_head * as iterator. This also makes
enabled_monitors consistent with available_monitors.
Fixes: de090d1cca ("rv: Fix wrong type cast in enabled_monitors_next()")
Reported-by: Nathan Chancellor <nathan@kernel.org>
Closes: https://lore.kernel.org/linux-trace-kernel/20250923002004.GA2836051@ax162/
Signed-off-by: Nam Cao <namcao@linutronix.de>
Cc: stable@vger.kernel.org
Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
Link: https://lore.kernel.org/r/20251002082235.973099-1-namcao@linutronix.de
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
have correct lost idle time accounting
- Stop the deadline server task before a CPU goes offline
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmj0yvsACgkQEsHwGGHe
VUqiYRAAncYon7a++87nuCHIw2ktAcjn4PJTz0F1VGw9ZvcbWThUhNoA17jd4uOz
XCzSH1rnHnlz359cJIzFwgVYjkBIaqT8GBN0al9ODra37laZCo89bKLmOeAlH81H
1xJXrDwn7U8dYBjgf6E6OGCdAx40kspCBxmpxrFW1VrGDvfNjEAKezm5GWeSED0Z
umA93dBr82i4IvfARUkK8s35ctHyx+o+7lCvCSsKSJgM02WWrKqAA/lv6jFjIgdE
0UuYJv+5A2e1Iog2KNSbvSPn23VaMnsZtvXfJoRLFHEsNTiL9NliTnwrOY6xx0Z8
9+GUeWsbobKwcKSk4dctOh0g/4afNbxWe2aAPmScHJNHtXHSeejps+zy4xFCLTZn
2muHCdZ2zo6YSL+og4TQax+FnLYnGUtPFDOQYsNxv/Cp1H+cbgvG5Qp08XXt8Tfl
Mt82g25GKklc28AN5Ui7FKTFmV2K363pV04YVZjXOwmxwiEYbwKw8gKfxi7CRW7S
fl4nW6Kp8BFtJQxc/RCXDIiX3h0wRlTOmF5FzyFYxgdsmO5AdGqS9tqknLrV2NlH
JVtj7alnrmCU34LwtTVfCvYQZiNd4IN+B6/htsL3AzrcLnqJz4O/T/Eyv9UL4yUs
yvQuO+yStCyk0BFYaGM3/E0xp87NYjaLiHnpM2jia3DT3UT1t7Q=
=uqJW
-----END PGP SIGNATURE-----
Merge tag 'sched_urgent_for_v6.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Borislav Petkov:
- Make sure the check for lost pelt idle time is done unconditionally
to have correct lost idle time accounting
- Stop the deadline server task before a CPU goes offline
* tag 'sched_urgent_for_v6.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Fix pelt lost idle time detection
sched/deadline: Stop dl_server before CPU goes offline
- Move the uprobe optimization to a better location logically
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmj0x3YACgkQEsHwGGHe
VUo6nQ/9E8LWC6PJG40QUXNZuj5qLe9VVaiWTW7w/zgeCf9nxkt6OhlOIu4fCMKz
6n5marqnvOoG9EXetUz5+n0wJvc9vDACESC0m6ESddaI4PGXULNJIsN2C5dR3UZ3
RULxaXvz9PVVkW3UIuM/U9az7fsG/ttH1rtrWQOsUYQZEO7vA9g+8KtASwnB7yBa
29WzVDYQIuHigdFPkVOuKBEdhslOjNjMM/N/shFOyFS62MGgwwFG/f4xv0c2GanJ
9gS2HPGhwOXLm8x/1Y6D8eKjiT5lvqZcDcRnui8bj7L7YGx+HU4PhRIIg7sBvGqA
QQGolxA9Xo2BTufUTxEQK9v2fSvg0f9wuKbkDbRUdyUeWiZZjEeBM/m0AkzEEeKf
FUrLCi3V/mN5J/sXSgIwjuCtYctwmsfaukL2bz6DB7feoTHceQmHunKCtBlDZtLE
Md/4hzMNYM+T/3nx27quGz8Cepxn9PSObN7W+DddWr0TxOxg2Pq6iMbnd7MulueP
K/AMvqDtbbVUB1XpsFvadRLcYUYYfXT9tiOCxa9O2w2NXDG8qeB6FZwScBaWuz1N
9GpKBhVMgZT8m0d3N8NoBi0+h32UVZnsJJ3UhHnceE8UyYf4kSO5L2K3nPHJa301
AavIPkH7+YOl5TAg6JlyYbRRdwfoUzxKUqY/hQ6Q8aLvwb2Jing=
=huy7
-----END PGP SIGNATURE-----
Merge tag 'perf_urgent_for_v6.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Borislav Petkov:
- Make sure perf reporting works correctly in setups using
overlayfs or FUSE
- Move the uprobe optimization to a better location logically
* tag 'perf_urgent_for_v6.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/core: Fix MMAP2 event device with backing files
perf/core: Fix MMAP event path names with backing files
perf/core: Fix address filter match with backing files
uprobe: Move arch_uprobe_optimize right after handlers execution
When __lookup_instance() allocates a func_instance structure but fails
to allocate the must_write_set array, it returns an error without freeing
the previously allocated func_instance. This causes a memory leak of 192
bytes (sizeof(struct func_instance)) each time this error path is triggered.
Fix by freeing 'result' on must_write_set allocation failure.
Fixes: b3698c356a ("bpf: callchain sensitive stack liveness tracking using CFG")
Reported-by: BPF Runtime Fuzzer (BRF)
Signed-off-by: Shardul Bankar <shardulsb08@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://patch.msgid.link/20251016063330.4107547-1-shardulsb08@gmail.com
Commit 370645f41e ("dma-mapping: force bouncing if the kmalloc() size is
not cache-line-aligned") introduced DMA_BOUNCE_UNALIGNED_KMALLOC feature
and permitted architecture specific code configure kmalloc slabs with
sizes smaller than the value of dma_get_cache_alignment().
When that feature is enabled, the physical address of some small
kmalloc()-ed buffers might be not aligned to the CPU cachelines, thus not
really suitable for typical DMA. To properly handle that case a SWIOTLB
buffer bouncing is used, so no CPU cache corruption occurs. When that
happens, there is no point reporting a false-positive DMA-API warning that
the buffer is not properly aligned, as this is not a client driver fault.
[m.szyprowski@samsung.com: replace is_swiotlb_allocated() with is_swiotlb_active(), per Catalin]
Link: https://lkml.kernel.org/r/20251010173009.3916215-1-m.szyprowski@samsung.com
Link: https://lkml.kernel.org/r/20251009141508.2342138-1-m.szyprowski@samsung.com
Fixes: 370645f41e ("dma-mapping: force bouncing if the kmalloc() size is not cache-line-aligned")
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Inki Dae <m.szyprowski@samsung.com>
Cc: Robin Murohy <robin.murphy@arm.com>
Cc: "Isaac J. Manjarres" <isaacmanjarres@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The following kmemleak splat:
[ 8.105530] kmemleak: Trying to color unknown object at 0xff11000100e918c0 as Black
[ 8.106521] Call Trace:
[ 8.106521] <TASK>
[ 8.106521] dump_stack_lvl+0x4b/0x70
[ 8.106521] kvfree_call_rcu+0xcb/0x3b0
[ 8.106521] ? hrtimer_cancel+0x21/0x40
[ 8.106521] bpf_obj_free_fields+0x193/0x200
[ 8.106521] htab_map_update_elem+0x29c/0x410
[ 8.106521] bpf_prog_cfc8cd0f42c04044_overwrite_cb+0x47/0x4b
[ 8.106521] bpf_prog_8c30cd7c4db2e963_overwrite_timer+0x65/0x86
[ 8.106521] bpf_prog_test_run_syscall+0xe1/0x2a0
happens due to the combination of features and fixes, but mainly due to
commit 6d78b4473c ("bpf: Tell memcg to use allow_spinning=false path in bpf_timer_init()")
It's using __GFP_HIGH, which instructs slub/kmemleak internals to skip
kmemleak_alloc_recursive() on allocation, so subsequent kfree_rcu()->
kvfree_call_rcu()->kmemleak_ignore() complains with the above splat.
To fix this imbalance, replace bpf_map_kmalloc_node() with
kmalloc_nolock() and kfree_rcu() with call_rcu() + kfree_nolock() to
make sure that the objects allocated with kmalloc_nolock() are freed
with kfree_nolock() rather than the implicit kfree() that kfree_rcu()
uses internally.
Note, the kmalloc_nolock() happens under bpf_spin_lock_irqsave(), so
it will always fail in PREEMPT_RT. This is not an issue at the moment,
since bpf_timers are disabled in PREEMPT_RT. In the future
bpf_spin_lock will be replaced with state machine similar to
bpf_task_work.
Fixes: 6d78b4473c ("bpf: Tell memcg to use allow_spinning=false path in bpf_timer_init()")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Harry Yoo <harry.yoo@oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: linux-mm@kvack.org
Link: https://lore.kernel.org/bpf/20251015000700.28988-1-alexei.starovoitov@gmail.com
The check for some lost idle pelt time should be always done when
pick_next_task_fair() fails to pick a task and not only when we call it
from the fair fast-path.
The case happens when the last running task on rq is a RT or DL task. When
the latter goes to sleep and the /Sum of util_sum of the rq is at the max
value, we don't account the lost of idle time whereas we should.
Fixes: 67692435c4 ("sched: Rework pick_next_task() slow-path")
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
IBM CI tool reported kernel warning[1] when running a CPU removal
operation through drmgr[2]. i.e "drmgr -c cpu -r -q 1"
WARNING: CPU: 0 PID: 0 at kernel/sched/cpudeadline.c:219 cpudl_set+0x58/0x170
NIP [c0000000002b6ed8] cpudl_set+0x58/0x170
LR [c0000000002b7cb8] dl_server_timer+0x168/0x2a0
Call Trace:
[c000000002c2f8c0] init_stack+0x78c0/0x8000 (unreliable)
[c0000000002b7cb8] dl_server_timer+0x168/0x2a0
[c00000000034df84] __hrtimer_run_queues+0x1a4/0x390
[c00000000034f624] hrtimer_interrupt+0x124/0x300
[c00000000002a230] timer_interrupt+0x140/0x320
Git bisects to: commit 4ae8d9aa9f ("sched/deadline: Fix dl_server getting stuck")
This happens since:
- dl_server hrtimer gets enqueued close to cpu offline, when
kthread_park enqueues a fair task.
- CPU goes offline and drmgr removes it from cpu_present_mask.
- hrtimer fires and warning is hit.
Fix it by stopping the dl_server before CPU is marked dead.
[1]: https://lore.kernel.org/all/8218e149-7718-4432-9312-f97297c352b9@linux.ibm.com/
[2]: https://github.com/ibm-power-utilities/powerpc-utils/tree/next/src/drmgr
[sshegde: wrote the changelog and tested it]
Fixes: 4ae8d9aa9f ("sched/deadline: Fix dl_server getting stuck")
Closes: https://lore.kernel.org/all/8218e149-7718-4432-9312-f97297c352b9@linux.ibm.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reported-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Some file systems like FUSE-based ones or overlayfs may record the backing
file in struct vm_area_struct vm_file, instead of the user file that the
user mmapped.
That causes perf to misreport the device major/minor numbers of the file
system of the file, and the generation of the file, and potentially other
inode details. There is an existing helper file_user_inode() for that
situation.
Use file_user_inode() instead of file_inode() to get the inode for MMAP2
events.
Example:
Setup:
# cd /root
# mkdir test ; cd test ; mkdir lower upper work merged
# cp `which cat` lower
# mount -t overlay overlay -olowerdir=lower,upperdir=upper,workdir=work merged
# perf record -e cycles:u -- /root/test/merged/cat /proc/self/maps
...
55b2c91d0000-55b2c926b000 r-xp 00018000 00:1a 3419 /root/test/merged/cat
...
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.004 MB perf.data (5 samples) ]
#
# stat /root/test/merged/cat
File: /root/test/merged/cat
Size: 1127792 Blocks: 2208 IO Block: 4096 regular file
Device: 0,26 Inode: 3419 Links: 1
Access: (0755/-rwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2025-09-08 12:23:59.453309624 +0000
Modify: 2025-09-08 12:23:59.454309624 +0000
Change: 2025-09-08 12:23:59.454309624 +0000
Birth: 2025-09-08 12:23:59.453309624 +0000
Before:
Device reported 00:02 differs from stat output and /proc/self/maps
# perf script --show-mmap-events | grep /root/test/merged/cat
cat 377 [-01] 243.078558: PERF_RECORD_MMAP2 377/377: [0x55b2c91d0000(0x9b000) @ 0x18000 00:02 3419 2068525940]: r-xp /root/test/merged/cat
After:
Device reported 00:1a is the same as stat output and /proc/self/maps
# perf script --show-mmap-events | grep /root/test/merged/cat
cat 362 [-01] 127.755167: PERF_RECORD_MMAP2 362/362: [0x55ba6e781000(0x9b000) @ 0x18000 00:1a 3419 0]: r-xp /root/test/merged/cat
With respect to stable kernels, overlayfs mmap function ovl_mmap() was
added in v4.19 but file_user_inode() was not added until v6.8 and never
back-ported to stable kernels. FMODE_BACKING that it depends on was added
in v6.5. This issue has gone largely unnoticed, so back-porting before
v6.8 is probably not worth it, so put 6.8 as the stable kernel prerequisite
version, although in practice the next long term kernel is 6.12.
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Amir Goldstein <amir73il@gmail.com>
Cc: stable@vger.kernel.org # 6.8
Some file systems like FUSE-based ones or overlayfs may record the backing
file in struct vm_area_struct vm_file, instead of the user file that the
user mmapped.
Since commit def3ae83da ("fs: store real path instead of fake path in
backing file f_path"), file_path() no longer returns the user file path
when applied to a backing file. There is an existing helper
file_user_path() for that situation.
Use file_user_path() instead of file_path() to get the path for MMAP
and MMAP2 events.
Example:
Setup:
# cd /root
# mkdir test ; cd test ; mkdir lower upper work merged
# cp `which cat` lower
# mount -t overlay overlay -olowerdir=lower,upperdir=upper,workdir=work merged
# perf record -e intel_pt//u -- /root/test/merged/cat /proc/self/maps
...
55b0ba399000-55b0ba434000 r-xp 00018000 00:1a 3419 /root/test/merged/cat
...
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.060 MB perf.data ]
#
Before:
File name is wrong (/cat), so decoding fails:
# perf script --no-itrace --show-mmap-events
cat 367 [016] 100.491492: PERF_RECORD_MMAP2 367/367: [0x55b0ba399000(0x9b000) @ 0x18000 00:02 3419 489959280]: r-xp /cat
...
# perf script --itrace=e | wc -l
Warning:
19 instruction trace errors
19
#
After:
File name is correct (/root/test/merged/cat), so decoding is ok:
# perf script --no-itrace --show-mmap-events
cat 364 [016] 72.153006: PERF_RECORD_MMAP2 364/364: [0x55ce4003d000(0x9b000) @ 0x18000 00:02 3419 3132534314]: r-xp /root/test/merged/cat
# perf script --itrace=e
# perf script --itrace=e | wc -l
0
#
Fixes: def3ae83da ("fs: store real path instead of fake path in backing file f_path")
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Amir Goldstein <amir73il@gmail.com>
Cc: stable@vger.kernel.org
It was reported that Intel PT address filters do not work in Docker
containers. That relates to the use of overlayfs.
overlayfs records the backing file in struct vm_area_struct vm_file,
instead of the user file that the user mmapped. In order for an address
filter to match, it must compare to the user file inode. There is an
existing helper file_user_inode() for that situation.
Use file_user_inode() instead of file_inode() to get the inode for address
filter matching.
Example:
Setup:
# cd /root
# mkdir test ; cd test ; mkdir lower upper work merged
# cp `which cat` lower
# mount -t overlay overlay -olowerdir=lower,upperdir=upper,workdir=work merged
# perf record --buildid-mmap -e intel_pt//u --filter 'filter * @ /root/test/merged/cat' -- /root/test/merged/cat /proc/self/maps
...
55d61d246000-55d61d2e1000 r-xp 00018000 00:1a 3418 /root/test/merged/cat
...
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.015 MB perf.data ]
# perf buildid-cache --add /root/test/merged/cat
Before:
Address filter does not match so there are no control flow packets
# perf script --itrace=e
# perf script --itrace=b | wc -l
0
# perf script -D | grep 'TIP.PGE' | wc -l
0
#
After:
Address filter does match so there are control flow packets
# perf script --itrace=e
# perf script --itrace=b | wc -l
235
# perf script -D | grep 'TIP.PGE' | wc -l
57
#
With respect to stable kernels, overlayfs mmap function ovl_mmap() was
added in v4.19 but file_user_inode() was not added until v6.8 and never
back-ported to stable kernels. FMODE_BACKING that it depends on was added
in v6.5. This issue has gone largely unnoticed, so back-porting before
v6.8 is probably not worth it, so put 6.8 as the stable kernel prerequisite
version, although in practice the next long term kernel is 6.12.
Closes: https://lore.kernel.org/linux-perf-users/aBCwoq7w8ohBRQCh@fremen.lan
Reported-by: Edd Barrett <edd@theunixzoo.co.uk>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Amir Goldstein <amir73il@gmail.com>
Cc: stable@vger.kernel.org # 6.8
It's less confusing to optimize uprobe right after handlers execution
and before we do the check for changed ip register to avoid situations
where changed ip register would skip uprobe optimization.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
The previous fix to trace_marker required updating trace_marker_raw
as well. The difference between trace_marker_raw from trace_marker
is that the raw version is for applications to write binary structures
directly into the ring buffer instead of writing ASCII strings.
This is for applications that will read the raw data from the ring
buffer and get the data structures directly. It's a bit quicker than
using the ASCII version.
Unfortunately, it appears that our test suite has several tests that
test writes to the trace_marker file, but lacks any tests to the
trace_marker_raw file (this needs to be remedied). Two issues came
about the update to the trace_marker_raw file that syzbot found:
- Fix tracing_mark_raw_write() to use per CPU buffer
The fix to use the per CPU buffer to copy from user space was needed for
both the trace_maker and trace_maker_raw file.
The fix for reading from user space into per CPU buffers properly fixed
the trace_marker write function, but the trace_marker_raw file wasn't
fixed properly. The user space data was correctly written into the per CPU
buffer, but the code that wrote into the ring buffer still used the user
space pointer and not the per CPU buffer that had the user space data
already written.
- Stop the fortify string warning from writing into trace_marker_raw
After converting the copy_from_user_nofault() into a memcpy(), another
issue appeared. As writes to the trace_marker_raw expects binary data, the
first entry is a 4 byte identifier. The entry structure is defined as:
struct {
struct trace_entry ent;
int id;
char buf[];
};
The size of this structure is reserved on the ring buffer with:
size = sizeof(*entry) + cnt;
Then it is copied from the buffer into the ring buffer with:
memcpy(&entry->id, buf, cnt);
This use to be a copy_from_user_nofault(), but now converting it to
a memcpy() triggers the fortify-string code, and causes a warning.
The allocated space is actually more than what is copied, as the cnt
used also includes the entry->id portion. Allocating sizeof(*entry)
plus cnt is actually allocating 4 bytes more than what is needed.
Change the size function to:
size = struct_size(entry, buf, cnt - sizeof(entry->id));
And update the memcpy() to unsafe_memcpy().
-----BEGIN PGP SIGNATURE-----
iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaOq0fhQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qr4HAQCNbFEDjzNNEueCWLOptC5YfeJgdvPB
399CzuFJl02ZOgD/flPJa1r+NaeYOBhe1BgpFF9FzB/SPXAXkXUGLM7WIgg=
=goks
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.18-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
"The previous fix to trace_marker required updating trace_marker_raw as
well. The difference between trace_marker_raw from trace_marker is
that the raw version is for applications to write binary structures
directly into the ring buffer instead of writing ASCII strings. This
is for applications that will read the raw data from the ring buffer
and get the data structures directly. It's a bit quicker than using
the ASCII version.
Unfortunately, it appears that our test suite has several tests that
test writes to the trace_marker file, but lacks any tests to the
trace_marker_raw file (this needs to be remedied). Two issues came
about the update to the trace_marker_raw file that syzbot found:
- Fix tracing_mark_raw_write() to use per CPU buffer
The fix to use the per CPU buffer to copy from user space was
needed for both the trace_maker and trace_maker_raw file.
The fix for reading from user space into per CPU buffers properly
fixed the trace_marker write function, but the trace_marker_raw
file wasn't fixed properly. The user space data was correctly
written into the per CPU buffer, but the code that wrote into the
ring buffer still used the user space pointer and not the per CPU
buffer that had the user space data already written.
- Stop the fortify string warning from writing into trace_marker_raw
After converting the copy_from_user_nofault() into a memcpy(),
another issue appeared. As writes to the trace_marker_raw expects
binary data, the first entry is a 4 byte identifier. The entry
structure is defined as:
struct {
struct trace_entry ent;
int id;
char buf[];
};
The size of this structure is reserved on the ring buffer with:
size = sizeof(*entry) + cnt;
Then it is copied from the buffer into the ring buffer with:
memcpy(&entry->id, buf, cnt);
This use to be a copy_from_user_nofault(), but now converting it to
a memcpy() triggers the fortify-string code, and causes a warning.
The allocated space is actually more than what is copied, as the
cnt used also includes the entry->id portion. Allocating
sizeof(*entry) plus cnt is actually allocating 4 bytes more than
what is needed.
Change the size function to:
size = struct_size(entry, buf, cnt - sizeof(entry->id));
And update the memcpy() to unsafe_memcpy()"
* tag 'trace-v6.18-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Stop fortify-string from warning in tracing_mark_raw_write()
tracing: Fix tracing_mark_raw_write() to use buf and not ubuf
preserve vmalloc allocations across handover.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaOmDWAAKCRDdBJ7gKXxA
jh+MAQDUPBj3mFm238CXI5DC1gJ3ETe3NJjJvfzIjLs51c+dFgD+PUuvDA0GUtKH
LCl6T+HJXh2FgGn1F2Kl/0hwPtEvuA4=
=HYr7
-----END PGP SIGNATURE-----
Merge tag 'mm-nonmm-stable-2025-10-10-15-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull more updates from Andrew Morton:
"Just one series here - Mike Rappoport has taught KEXEC handover to
preserve vmalloc allocations across handover"
* tag 'mm-nonmm-stable-2025-10-10-15-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
lib/test_kho: use kho_preserve_vmalloc instead of storing addresses in fdt
kho: add support for preserving vmalloc allocations
kho: replace kho_preserve_phys() with kho_preserve_pages()
kho: check if kho is finalized in __kho_preserve_order()
MAINTAINERS, .mailmap: update Umang's email address
The fix to use a per CPU buffer to read user space tested only the writes
to trace_marker. But it appears that the selftests are missing tests to
the trace_maker_raw file. The trace_maker_raw file is used by applications
that writes data structures and not strings into the file, and the tools
read the raw ring buffer to process the structures it writes.
The fix that reads the per CPU buffers passes the new per CPU buffer to
the trace_marker file writes, but the update to the trace_marker_raw write
read the data from user space into the per CPU buffer, but then still used
then passed the user space address to the function that records the data.
Pass in the per CPU buffer and not the user space address.
TODO: Add a test to better test trace_marker_raw.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20251011035243.386098147@kernel.org
Fixes: 64cf7d058a ("tracing: Have trace_marker use per-cpu data to read user space")
Reported-by: syzbot+9a2ede1643175f350105@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/68e973f5.050a0220.1186a4.0010.GAE@google.com/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
When unpinning a BPF hash table (htab or htab_lru) that contains internal
structures (timer, workqueue, or task_work) in its values, a BUG warning
is triggered:
BUG: sleeping function called from invalid context at kernel/bpf/hashtab.c:244
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 14, name: ksoftirqd/0
...
The issue arises from the interaction between BPF object unpinning and
RCU callback mechanisms:
1. BPF object unpinning uses ->free_inode() which schedules cleanup via
call_rcu(), deferring the actual freeing to an RCU callback that
executes within the RCU_SOFTIRQ context.
2. During cleanup of hash tables containing internal structures,
htab_map_free_internal_structs() is invoked, which includes
cond_resched() or cond_resched_rcu() calls to yield the CPU during
potentially long operations.
However, cond_resched() or cond_resched_rcu() cannot be safely called from
atomic RCU softirq context, leading to the BUG warning when attempting
to reschedule.
Fix this by changing from ->free_inode() to ->destroy_inode() and rename
bpf_free_inode() to bpf_destroy_inode() for BPF objects (prog, map, link).
This allows direct inode freeing without RCU callback scheduling,
avoiding the invalid context warning.
Reported-by: Le Chen <tom2cat@sjtu.edu.cn>
Closes: https://lore.kernel.org/all/1444123482.1827743.1750996347470.JavaMail.zimbra@sjtu.edu.cn/
Fixes: 68134668c1 ("bpf: Add map side support for bpf timers.")
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: KaFai Wan <kafai.wan@linux.dev>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20251008102628.808045-2-kafai.wan@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
- Have osnoise tracer use memdup_user_nul()
The function osnoise_cpus_write() open codes a kmalloc() and then
a copy_from_user() and then adds a nul byte at the end which is the
same as simply using memdup_user_nul().
- Fix wakeup and irq tracers when failing to acquire calltime
When the wakeup and irq tracers use the function graph tracer for
tracing function times, it saves a timestamp into the fgraph shadow
stack. It is possible that this could fail to be stored. If that
happens, it exits the routine early. These functions also disable
nesting of the operations by incremeting the data "disable" counter.
But if the calltime exits out early, it never increments the counter
back to what it needs to be.
Since there's only a couple of lines of code that does work after
acquiring the calltime, instead of exiting out early, reverse the
if statement to be true if calltime is acquired, and place the code
that is to be done within that if block. The clean up will always
be done after that.
- Fix ring_buffer_map() return value on failure of __rb_map_vma()
If __rb_map_vma() fails in ring_buffer_map(), it does not return
an error. This means the caller will be working against a bad vma
mapping. Have ring_buffer_map() return an error when __rb_map_vma()
fails.
- Fix regression of writing to the trace_marker file
A bug fix was made to change __copy_from_user_inatomic() to
copy_from_user_nofault() in the trace_marker write function.
The trace_marker file is used by applications to write into
it (usually with a file descriptor opened at the start of the
program) to record into the tracing system. It's usually used
in critical sections so the write to trace_marker is highly
optimized.
The reason for copying in an atomic section is that the write
reserves space on the ring buffer and then writes directly into
it. After it writes, it commits the event. The time between
reserve and commit must have preemption disabled.
The trace marker write does not have any locking nor can it
allocate due to the nature of it being a critical path.
Unfortunately, converting __copy_from_user_inatomic() to
copy_from_user_nofault() caused a regression in Android.
Now all the writes from its applications trigger the fault that
is rejected by the _nofault() version that wasn't rejected by
the _inatomic() version. Instead of getting data, it now just
gets a trace buffer filled with:
tracing_mark_write: <faulted>
To fix this, on opening of the trace_marker file, allocate
per CPU buffers that can be used by the write call. Then
when entering the write call, do the following:
preempt_disable();
cpu = smp_processor_id();
buffer = per_cpu_ptr(cpu_buffers, cpu);
do {
cnt = nr_context_switches_cpu(cpu);
migrate_disable();
preempt_enable();
ret = copy_from_user(buffer, ptr, size);
preempt_disable();
migrate_enable();
} while (!ret && cnt != nr_context_switches_cpu(cpu));
if (!ret)
ring_buffer_write(buffer);
preempt_enable();
This works similarly to seqcount. As it must enabled preemption
to do a copy_from_user() into a per CPU buffer, if it gets
preempted, the buffer could be corrupted by another task.
To handle this, read the number of context switches of the current
CPU, disable migration, enable preemption, copy the data from
user space, then immediately disable preemption again.
If the number of context switches is the same, the buffer
is still valid. Otherwise it must be assumed that the buffer may
have been corrupted and it needs to try again.
Now the trace_marker write can get the user data even if it has
to fault it in, and still not grab any locks of its own.
-----BEGIN PGP SIGNATURE-----
iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaOfVOhQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qq6oAP9y+zxuouWjtXIz9/z++aykFgKCkeau
XHSSdJdn4R+AQgEA4SE0UWKH0F6Bg7qwyocahMMQ1tIJRrpihfNrKBUmmQ4=
=wDGp
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing clean up and fixes from Steven Rostedt:
- Have osnoise tracer use memdup_user_nul()
The function osnoise_cpus_write() open codes a kmalloc() and then a
copy_from_user() and then adds a nul byte at the end which is the
same as simply using memdup_user_nul().
- Fix wakeup and irq tracers when failing to acquire calltime
When the wakeup and irq tracers use the function graph tracer for
tracing function times, it saves a timestamp into the fgraph shadow
stack. It is possible that this could fail to be stored. If that
happens, it exits the routine early. These functions also disable
nesting of the operations by incremeting the data "disable" counter.
But if the calltime exits out early, it never increments the counter
back to what it needs to be.
Since there's only a couple of lines of code that does work after
acquiring the calltime, instead of exiting out early, reverse the if
statement to be true if calltime is acquired, and place the code that
is to be done within that if block. The clean up will always be done
after that.
- Fix ring_buffer_map() return value on failure of __rb_map_vma()
If __rb_map_vma() fails in ring_buffer_map(), it does not return an
error. This means the caller will be working against a bad vma
mapping. Have ring_buffer_map() return an error when __rb_map_vma()
fails.
- Fix regression of writing to the trace_marker file
A bug fix was made to change __copy_from_user_inatomic() to
copy_from_user_nofault() in the trace_marker write function. The
trace_marker file is used by applications to write into it (usually
with a file descriptor opened at the start of the program) to record
into the tracing system. It's usually used in critical sections so
the write to trace_marker is highly optimized.
The reason for copying in an atomic section is that the write
reserves space on the ring buffer and then writes directly into it.
After it writes, it commits the event. The time between reserve and
commit must have preemption disabled.
The trace marker write does not have any locking nor can it allocate
due to the nature of it being a critical path.
Unfortunately, converting __copy_from_user_inatomic() to
copy_from_user_nofault() caused a regression in Android. Now all the
writes from its applications trigger the fault that is rejected by
the _nofault() version that wasn't rejected by the _inatomic()
version. Instead of getting data, it now just gets a trace buffer
filled with:
tracing_mark_write: <faulted>
To fix this, on opening of the trace_marker file, allocate per CPU
buffers that can be used by the write call. Then when entering the
write call, do the following:
preempt_disable();
cpu = smp_processor_id();
buffer = per_cpu_ptr(cpu_buffers, cpu);
do {
cnt = nr_context_switches_cpu(cpu);
migrate_disable();
preempt_enable();
ret = copy_from_user(buffer, ptr, size);
preempt_disable();
migrate_enable();
} while (!ret && cnt != nr_context_switches_cpu(cpu));
if (!ret)
ring_buffer_write(buffer);
preempt_enable();
This works similarly to seqcount. As it must enabled preemption to do
a copy_from_user() into a per CPU buffer, if it gets preempted, the
buffer could be corrupted by another task.
To handle this, read the number of context switches of the current
CPU, disable migration, enable preemption, copy the data from user
space, then immediately disable preemption again. If the number of
context switches is the same, the buffer is still valid. Otherwise it
must be assumed that the buffer may have been corrupted and it needs
to try again.
Now the trace_marker write can get the user data even if it has to
fault it in, and still not grab any locks of its own.
* tag 'trace-v6.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Have trace_marker use per-cpu data to read user space
ring buffer: Propagate __rb_map_vma return value to caller
tracing: Fix irqoff tracers on failure of acquiring calltime
tracing: Fix wakeup tracers on failure of acquiring calltime
tracing/osnoise: Replace kmalloc + copy_from_user with memdup_user_nul
It was reported that using __copy_from_user_inatomic() can actually
schedule. Which is bad when preemption is disabled. Even though there's
logic to check in_atomic() is set, but this is a nop when the kernel is
configured with PREEMPT_NONE. This is due to page faulting and the code
could schedule with preemption disabled.
Link: https://lore.kernel.org/all/20250819105152.2766363-1-luogengkun@huaweicloud.com/
The solution was to change the __copy_from_user_inatomic() to
copy_from_user_nofault(). But then it was reported that this caused a
regression in Android. There's several applications writing into
trace_marker() in Android, but now instead of showing the expected data,
it is showing:
tracing_mark_write: <faulted>
After reverting the conversion to copy_from_user_nofault(), Android was
able to get the data again.
Writes to the trace_marker is a way to efficiently and quickly enter data
into the Linux tracing buffer. It takes no locks and was designed to be as
non-intrusive as possible. This means it cannot allocate memory, and must
use pre-allocated data.
A method that is actively being worked on to have faultable system call
tracepoints read user space data is to allocate per CPU buffers, and use
them in the callback. The method uses a technique similar to seqcount.
That is something like this:
preempt_disable();
cpu = smp_processor_id();
buffer = this_cpu_ptr(&pre_allocated_cpu_buffers, cpu);
do {
cnt = nr_context_switches_cpu(cpu);
migrate_disable();
preempt_enable();
ret = copy_from_user(buffer, ptr, size);
preempt_disable();
migrate_enable();
} while (!ret && cnt != nr_context_switches_cpu(cpu));
if (!ret)
ring_buffer_write(buffer);
preempt_enable();
It's a little more involved than that, but the above is the basic logic.
The idea is to acquire the current CPU buffer, disable migration, and then
enable preemption. At this moment, it can safely use copy_from_user().
After reading the data from user space, it disables preemption again. It
then checks to see if there was any new scheduling on this CPU. If there
was, it must assume that the buffer was corrupted by another task. If
there wasn't, then the buffer is still valid as only tasks in preemptable
context can write to this buffer and only those that are running on the
CPU.
By using this method, where trace_marker open allocates the per CPU
buffers, trace_marker writes can access user space and even fault it in,
without having to allocate or take any locks of its own.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Luo Gengkun <luogengkun@huaweicloud.com>
Cc: Wattson CI <wattson-external@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/20251008124510.6dba541a@gandalf.local.home
Fixes: 3d62ab32df ("tracing: Fix tracing_marker may trigger page fault during preempt_disable")
Reported-by: Runping Lai <runpinglai@google.com>
Tested-by: Runping Lai <runpinglai@google.com>
Closes: https://lore.kernel.org/linux-trace-kernel/20251007003417.3470979-2-runpinglai@google.com/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The functions irqsoff_graph_entry() and irqsoff_graph_return() both call
func_prolog_dec() that will test if the data->disable is already set and
if not, increment it and return. If it was set, it returns false and the
caller exits.
The caller of this function must decrement the disable counter, but misses
doing so if the calltime fails to be acquired.
Instead of exiting out when calltime is NULL, change the logic to do the
work if it is not NULL and still do the clean up at the end of the
function if it is NULL.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20251008114943.6f60f30f@gandalf.local.home
Fixes: a485ea9e3e ("tracing: Fix irqsoff and wakeup latency tracers when using function graph")
Reported-by: Sasha Levin <sashal@kernel.org>
Closes: https://lore.kernel.org/linux-trace-kernel/20251006175848.1906912-2-sashal@kernel.org/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The functions wakeup_graph_entry() and wakeup_graph_return() both call
func_prolog_preempt_disable() that will test if the data->disable is
already set and if not, increment it and disable preemption. If it was
set, it returns false and the caller exits.
The caller of this function must decrement the disable counter, but misses
doing so if the calltime fails to be acquired.
Instead of exiting out when calltime is NULL, change the logic to do the
work if it is not NULL and still do the clean up at the end of the
function if it is NULL.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20251008114835.027b878a@gandalf.local.home
Fixes: a485ea9e3e ("tracing: Fix irqsoff and wakeup latency tracers when using function graph")
Reported-by: Sasha Levin <sashal@kernel.org>
Closes: https://lore.kernel.org/linux-trace-kernel/20251006175848.1906912-1-sashal@kernel.org/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Replace kmalloc() followed by copy_from_user() with memdup_user_nul() to
simplify and improve osnoise_cpus_write(). Remove the manual
NUL-termination.
No functional changes intended.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20251001130907.364673-2-thorsten.blum@linux.dev
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
A vmalloc allocation is preserved using binary structure similar to global
KHO memory tracker. It's a linked list of pages where each page is an
array of physical address of pages in vmalloc area.
kho_preserve_vmalloc() hands out the physical address of the head page to
the caller. This address is used as the argument to kho_vmalloc_restore()
to restore the mapping in the vmalloc address space and populate it with
the preserved pages.
[pasha.tatashin@soleen.com: free chunks using free_page() not kfree()]
Link: https://lkml.kernel.org/r/mafs0a52idbeg.fsf@kernel.org
[akpm@linux-foundation.org: coding-style cleanups]
Link: https://lkml.kernel.org/r/20250921054458.4043761-4-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Changyuan Lyu <changyuanl@google.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
to make it clear that KHO operates on pages rather than on a random
physical address.
The kho_preserve_pages() will be also used in upcoming support for vmalloc
preservation.
Link: https://lkml.kernel.org/r/20250921054458.4043761-3-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Alexander Graf <graf@amazon.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Changyuan Lyu <changyuanl@google.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>