mirror-linux

Commit Graph

Author	SHA1	Message	Date
Jiri Olsa	6b95cc562d	ftrace: Fix direct_functions leak in update_ftrace_direct_del Alexei reported memory leak in update_ftrace_direct_del. We miss cleanup of the replaced direct_functions in the success path in update_ftrace_direct_del, adding that. Fixes: `8d2c1233f3` ("ftrace: Add update_ftrace_direct_del function") Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Closes: https://lore.kernel.org/bpf/aX_BxG5EJTJdCMT9@krava/T/#m7c13f5a95f862ed7ab78e905fbb678d635306a0c Signed-off-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://lore.kernel.org/r/20260202075849.1684369-1-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-02-02 07:56:20 -08:00
Mark Brown	24989330fb	time/kunit: Document handling of negative years of is_leap() The code local is_leap() helper was tried to be replaced by the RTC is_leap_year() function. Unfortunately the two aren't exactly equivalent, as the kunit variant uses a signed value for the year and the RTC an unsigned one. Since the KUnit tests cover a 16000 year range around the epoch they use year values that are very comfortably negative and hence get mishandled when passed into is_leap_year(). The change was reverted, so add a comment which prevents further attempts to do so. [ tglx: Adapted to the revert ] Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260130-kunit-fix-leap-year-v1-1-92ddf55dffd7@kernel.org	2026-02-02 12:37:54 +01:00
Shanker Donthineni	c33efdfcfa	dma: contiguous: Check return value of dma_contiguous_reserve_area() Commit `8f1fc1bf1a` ("dma: contiguous: Reserve default CMA heap") introduced a bug where dma_heap_cma_register_heap() is called with a NULL pointer when dma_contiguous_reserve_area() fails to reserve the CMA area. When dma_contiguous_reserve_area() fails, dma_contiguous_default_area remains NULL (initialized as a global variable), but the code doesn't check the return value and proceeds to call dma_heap_cma_register_heap() with this NULL pointer. Later during boot, add_cma_heaps() iterates through the dma_areas[] array and attempts to register heaps. When it encounters the NULL pointer stored by the earlier call, it crashes in __add_cma_heap() -> dma_heap_add() when trying to dereference the NULL CMA pointer. The crash manifests as: Unable to handle kernel NULL pointer dereference at virtual address 0000000000000038 ... Call trace: dma_heap_add+0x40/0x2b0 __add_cma_heap+0x80/0xe0 add_cma_heaps+0x64/0xb0 do_one_initcall+0x60/0x318 kernel_init_freeable+0x260/0x2f0 kernel_init+0x2c/0x168 ret_from_fork+0x10/0x20 Fix this by checking the return value of dma_contiguous_reserve_area() and only calling dma_heap_cma_register_heap() when the reservation succeeds. Fixes: `8f1fc1bf1a` ("dma: contiguous: Reserve default CMA heap") Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com> Reviewed-by: T.J. Mercier <tjmercier@google.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20260129181317.2429196-1-sdonthineni@nvidia.com	2026-02-02 09:20:32 +01:00
Linus Torvalds	c00a879164	Fix a race in the user-callchains code. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAml/E38RHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1gESw/+OvzSQGvWAnL7LDHF5kNvS8eedFQMIXVk dkAgQ0ZK2URqU9tJ4RJY8gEAuKwwT+TtA7Ve1vXLYk6s8gpJ5jzYyi760NAbclFj JpyPJtS4zIqC2AQD4nMw+xzpLaGjavPN3ewYPdYgnRL2cK2ezZxlxWWxweAShWGH CqIXEYxG6Ezx0pUXtgqnUNRNM0ayIWwI8ZDHpvcaFt8FC86aKWvZ4urODGxZAT3Z W5JcsJu/cpPjUv4KAkxc9xdeofbPo1YK3yTLA7ih1MsH7p/Q7u5zkNSeXNnF25Wh 3NrrJCXV3K4MMVwyPQJzoMn8EWsCuP7yeZ5R+ds5aDaOBy4LYoL+kp5HzJ4GzdFr YUs3C2cRraR5xB6U9VjgidLNDXIriSUjwj9D9ucV9hlJ3Z7NJw04ZuQP5Oglt1TQ Pdndx8OqUyWjBctAOe0+bx6iC40jHrEj9HI3M/LkHM3DVIi5l3KWZi4qTHVTN/f9 mH5Z4Ot5QSy+A6jFiVXuIfo7CEYow5tOTrEibEX/J5fI5RK/5OgJAPUs+J3VeAMc jKb/Hp0e1SbMsoZmggWnNvIUz5Fk6nkkxrvC5iVMuGz6fsD9M7Ms6ckZd3Mw7tpa d4F95syN4odTseZX4hMFKyMP8Be62WNXSgXV4a7T8d1+FyYsaEdMRew6pyJD1nUc aW5b5eiBeIE= =ruON -----END PGP SIGNATURE----- Merge tag 'perf-urgent-2026-02-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf events fix from Ingo Molnar: "Fix a race in the user-callchains code" * tag 'perf-urgent-2026-02-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf: sched: Fix perf crash with new is_user_task() helper	2026-02-01 10:47:21 -08:00
Linus Torvalds	e53ada651a	Fix a regression in the deferrable dl_server code that can cause the dl_server to be stuck. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAml/FIkRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1jB7w/9GazyDrZhdmJp7NeJCEW82FGKljBuEBvP RxN9lrhRuOIKEVuOfJI76Chd5iooSlKdDmaOcxFrkihwJ/AREHekTlYFLtncg5Zu 5JjeYLiuN04kkOClDz7mTgcSzJD7yM0m2mD5CS45lZ35shM64eLr5GtXaFTEs9SZ E9R4cBiNpz4bVMNW4y3TphSlEx84sIRWoj5Gmzw3CG/52GbIcj1rgwO4b/RI4FYV yio7V5+ol6oVijlul/UIMYkB6vWGbO8swa/YErhyCvXpHfR1pwsEDFcTjJvO9hCK 2QPwBo34m9mibybvQoyS+/Cybf/yWqeyc1KHqn2+wlO/Ectk97H3eBsbLO2ot/dX Sv+y+vq3Wm5BczaKtyCeGvk+N5NSp3jhKUO4jH0Iivi+jRJ2yynJ+Dgg0F0ldIMM 0qBfEvaiN8Fbg8LLDevfqtGnctvRiwzKNstaZyUmq1g5NXPOGlerVtHP+osuPb/L lIK7uSZMZNStJS7R7vcO+g2lAl15UwBmMPEepDdUsBilnKEnefl+j5wilhjLoNM7 G+XMSFLMoWX1/oY9f8TRs6ncZ+ays6tal8bePVyTrdp/RBAXCFlB9npNjqlTkZjT M6zRqHcIOPZxEJP3ZqsGb0rT41v7gJbuHg2uUSQwyqEWmTB45IqTewi2Omw/duJ/ OOp7g2matLY= =U5Zw -----END PGP SIGNATURE----- Merge tag 'sched-urgent-2026-02-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Ingo Molnar: "Fix a regression in the deferrable dl_server code that can cause the dl_server to be stuck" * tag 'sched-urgent-2026-02-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/deadline: Fix 'stuck' dl_server	2026-02-01 10:39:52 -08:00
Chen Ridong	8b1f3c54f9	cpuset: fix overlap of partition effective CPUs A warning was detect: WARNING: kernel/cgroup/cpuset.c:825 at rebuild_sched_domains_locked Modules linked in: CPU: 12 UID: 0 PID: 681 Comm: rmdir 6.19.0-rc6-next-20260121+ RIP: 0010:rebuild_sched_domains_locked+0x309/0x4b0 RSP: 0018:ffffc900019bbd28 EFLAGS: 00000202 RAX: ffff888104413508 RBX: 0000000000000008 RCX: ffff888104413510 RDX: ffff888109b5f400 RSI: 000000000000ffcf RDI: 0000000000000001 RBP: 0000000000000002 R08: ffff888104413508 R09: 0000000000000002 R10: ffff888104413508 R11: 0000000000000001 R12: ffff888104413500 R13: 0000000000000002 R14: ffffc900019bbd78 R15: 0000000000000000 FS: 00007fe274b8d740(0000) GS:ffff8881b6b3c000(0000) knlGS: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fe274c98b50 CR3: 00000001047a9000 CR4: 00000000000006f0 Call Trace: <TASK> update_prstate+0x1c7/0x580 cpuset_css_killed+0x2f/0x50 kill_css+0x32/0x180 cgroup_destroy_locked+0xa7/0x200 cgroup_rmdir+0x28/0x100 kernfs_iop_rmdir+0x4c/0x80 vfs_rmdir+0x12c/0x280 filename_rmdir+0x19e/0x200 __x64_sys_rmdir+0x23/0x40 do_syscall_64+0x6b/0x390 It can be reproduced by steps: # cd /sys/fs/cgroup/ # mkdir A1 # mkdir B1 # mkdir C1 # echo 1-3 > A1/cpuset.cpus # echo root > A1/cpuset.cpus.partition # echo 3-5 > B1/cpuset.cpus # echo root > B1/cpuset.cpus.partition # echo 6 > C1/cpuset.cpus # echo root > C1/cpuset.cpus.partition # rmdir A1/ # rmdir C1/ Both A1 and B1 were initially configured with CPU 3, which was exclusively assigned to A1's partition. When A1 was removed, CPU 3 was returned to the root pool. However, B1 incorrectly regained access to CPU 3 when update_cpumasks_hier was triggered during C1's removal, which also updated sibling configurations. The update_sibling_cpumasks function was called to synchronize siblings' effective CPUs due to changes in their parent's effective CPUs. However, parent effective CPU changes should not affect partition-effective CPUs. To fix this issue, update_cpumasks_hier should only be invoked when the sibling is not a valid partition in the update_sibling_cpumasks. Fixes: `2a3602030d` ("cgroup/cpuset: Don't invalidate sibling partitions on cpuset.cpus conflict") Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-02-01 06:49:52 -10:00
Chen Ridong	5eab8c588b	cgroup: increase maximum subsystem count from 16 to 32 The current cgroup subsystem limit of 16 is insufficient, as the number of existing subsystems has already reached this limit. When adding a new subsystem that is not yet in the mainline kernel, building with `make allmodconfig` requires first bypassing the `BUILD_BUG_ON(CGROUP_SUBSYS_COUNT > 16)` restriction to allow compilation to succeed. However, the kernel still fails to boot afterward. This patch increases the maximum number of supported cgroup subsystems from 16 to 32, providing enough room for future subsystem additions. Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Waiman Long <longman@redhat.com> Tested-by: JP Kobryn <inwardvessel@gmail.com> Acked-by: JP Kobryn <inwardvessel@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-02-01 06:34:15 -10:00
Evangelos Petrongonas	427b2535f5	kho: skip memoryless NUMA nodes when reserving scratch areas kho_reserve_scratch() iterates over all online NUMA nodes to allocate per-node scratch memory. On systems with memoryless NUMA nodes (nodes that have CPUs but no memory), memblock_alloc_range_nid() fails because there is no memory available on that node. This causes KHO initialization to fail and kho_enable to be set to false. Some ARM64 systems have NUMA topologies where certain nodes contain only CPUs without any associated memory. These configurations are valid and should not prevent KHO from functioning. Fix this by only counting nodes that have memory (N_MEMORY state) and skip memoryless nodes in the per-node scratch allocation loop. Link: https://lkml.kernel.org/r/20260120175913.34368-1-epetron@amazon.de Fixes: `3dc92c3114` ("kexec: add Kexec HandOver (KHO) generation helpers"). Signed-off-by: Evangelos Petrongonas <epetron@amazon.de> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Alexander Graf <graf@amazon.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-31 16:16:08 -08:00
Vasily Gorbik	96a54b8ffc	crash_dump: fix dm_crypt keys locking and ref leak crash_load_dm_crypt_keys() reads dm-crypt volume keys from the user keyring. It uses user_key_payload_locked() without holding key->sem, which makes lockdep complain when kexec_file_load() assembles the crash image: ============================= WARNING: suspicious RCU usage ----------------------------- ./include/keys/user-type.h:53 suspicious rcu_dereference_protected() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 no locks held by kexec/4875. stack backtrace: Call Trace: <TASK> dump_stack_lvl+0x5d/0x80 lockdep_rcu_suspicious.cold+0x4e/0x96 crash_load_dm_crypt_keys+0x314/0x390 bzImage64_load+0x116/0x9a0 ? __lock_acquire+0x464/0x1ba0 __do_sys_kexec_file_load+0x26a/0x4f0 do_syscall_64+0xbd/0x430 entry_SYSCALL_64_after_hwframe+0x77/0x7f In addition, the key returned by request_key() is never key_put()'d, leaking a key reference on each load attempt. Take key->sem while copying the payload and drop the key reference afterwards. Link: https://lkml.kernel.org/r/patch.git-2d4d76083a5c.your-ad-here.call-01769426386-ext-2560@work.hours Fixes: `479e58549b` ("crash_dump: store dm crypt keys in kdump reserved memory") Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> Cc: Baoquan He <bhe@redhat.com> Cc: Coiby Xu <coxu@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-31 16:16:08 -08:00
Mike Rapoport (Microsoft)	b50634c5e8	kho: cleanup error handling in kho_populate() * use dedicated labels for error handling instead of checking if a pointer is not null to decide if it should be unmapped * drop assignment of values to err that are only used to print a numeric error code, there are pr_warn()s for each failure already so printing a numeric error code in the next line does not add anything useful Link: https://lkml.kernel.org/r/20260122121757.575987-1-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Mike Rapoport <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-31 16:16:08 -08:00
Ondrej Mosnacek	0895a000e4	ucount: check for CAP_SYS_RESOURCE using ns_capable_noaudit() The user.* sysctls implement the ctl_table_root::permissions hook and they override the file access mode based on the CAP_SYS_RESOURCE capability (at most rwx if capable, at most r-- if not). The capability is being checked unconditionally, so if an LSM denies the capability, an audit record may be logged even when access is in fact granted. Given the logic in the set_permissions() function in kernel/ucount.c and the unfortunate way the permission checking is implemented, it doesn't seem viable to avoid false positive denials by deferring the capability check. Thus, do the same as in net_ctl_permissions() (net/sysctl_net.c) - switch from ns_capable() to ns_capable_noaudit(), so that the check never logs an audit record. Link: https://lkml.kernel.org/r/20260122140745.239428-1-omosnace@redhat.com Fixes: `dbec28460a` ("userns: Add per user namespace sysctls.") Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com> Reviewed-by: Paul Moore <paul@paul-moore.com> Acked-by: Serge Hallyn <serge@hallyn.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Alexey Gladkov <legion@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-31 16:16:08 -08:00
Li Chen	480e1d5c64	kexec: derive purgatory entry from symbol kexec_load_purgatory() derives image->start by locating e_entry inside an SHF_EXECINSTR section. If the purgatory object contains multiple executable sections with overlapping sh_addr, the entrypoint check can match more than once and trigger a WARN. Derive the entry section from the purgatory_start symbol when present and compute image->start from its final placement. Keep the existing e_entry fallback for purgatories that do not expose the symbol. WARNING: kernel/kexec_file.c:1009 at kexec_load_purgatory+0x395/0x3c0, CPU#10: kexec/1784 Call Trace: <TASK> bzImage64_load+0x133/0xa00 __do_sys_kexec_file_load+0x2b3/0x5c0 do_syscall_64+0x81/0x610 entry_SYSCALL_64_after_hwframe+0x76/0x7e [me@linux.beauty: move helper to avoid forward declaration, per Baoquan] Link: https://lkml.kernel.org/r/20260128043511.316860-1-me@linux.beauty Link: https://lkml.kernel.org/r/20260120124005.148381-1-me@linux.beauty Fixes: `8652d44f46` ("kexec: support purgatories with .text.hot sections") Signed-off-by: Li Chen <me@linux.beauty> Acked-by: Baoquan He <bhe@redhat.com> Cc: Alexander Graf <graf@amazon.com> Cc: Eric Biggers <ebiggers@kernel.org> Cc: Li Chen <me@linux.beauty> Cc: Philipp Rudo <prudo@redhat.com> Cc: Ricardo Ribalda Delgado <ribalda@chromium.org> Cc: Ross Zwisler <zwisler@google.com> Cc: Sourabh Jain <sourabhjain@linux.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-31 16:16:07 -08:00
Wang Yaxin	503efe850c	delayacct: add timestamp of delay max Problem ======= Commit `658eb5ab91` ("delayacct: add delay max to record delay peak") introduced the delay max for getdelays, which records abnormal latency peaks and helps us understand the magnitude of such delays. However, the peak latency value alone is insufficient for effective root cause analysis. Without the precise timestamp of when the peak occurred, we still lack the critical context needed to correlate it with other system events. Solution ======== To address this, we need to additionally record a precise timestamp when the maximum latency occurs. By correlating this timestamp with system logs and monitoring metrics, we can identify processes with abnormal resource usage at the same moment, which can help us to pinpoint root causes. Use Case ======== bash-4.4# ./getdelays -d -t 227 print delayacct stats ON TGID 227 CPU count real total virtual total delay total delay average delay max delay min delay max timestamp 46 188000000 192348334 4098012 0.089ms 0.429260ms 0.051205ms 2026-01-15T15:06:58 IO count delay total delay average delay max delay min delay max timestamp 0 0 0.000ms 0.000000ms 0.000000ms N/A SWAP count delay total delay average delay max delay min delay max timestamp 0 0 0.000ms 0.000000ms 0.000000ms N/A RECLAIM count delay total delay average delay max delay min delay max timestamp 0 0 0.000ms 0.000000ms 0.000000ms N/A THRAS HING count delay total delay average delay max delay min delay max timestamp 0 0 0.000ms 0.000000ms 0.000000ms N/A COMPACT count delay total delay average delay max delay min delay max timestamp 0 0 0.000ms 0.000000ms 0.000000ms N/A WPCOPY count delay total delay average delay max delay min delay max timestamp 182 19413338 0.107ms 0.547353ms 0.022462ms 2026-01-15T15:05:24 IRQ count delay total delay average delay max delay min delay max timestamp 0 0 0.000ms 0.000000ms 0.000000ms N/A Link: https://lkml.kernel.org/r/20260119100241520gWubW8-5QfhSf9gjqcc_E@zte.com.cn Signed-off-by: Wang Yaxin <wang.yaxin@zte.com.cn> Cc: Fan Yu <fan.yu9@zte.com.cn> Cc: Jonathan Corbet <corbet@lwn.net> Cc: xu xin <xu.xin16@zte.com.cn> Cc: Yang Yang <yang.yang29@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-31 16:16:06 -08:00
Steven Rostedt	86e685ff36	tracing: remove size parameter in __trace_puts() The __trace_puts() function takes a string pointer and the size of the string itself. All users currently simply pass in the strlen() of the string it is also passing in. There's no reason to pass in the size. Instead have the __trace_puts() function do the strlen() within the function itself. This fixes a header recursion issue where using strlen() in the macro calling __trace_puts() requires adding #include <linux/string.h> in order to use strlen(). Removing the use of strlen() from the header fixes the recursion issue. Link: https://lore.kernel.org/all/aUN8Hm377C5A0ILX@yury/ Link: https://lkml.kernel.org/r/20260116042510.241009-6-ynorov@nvidia.com Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Yury Norov <ynorov@nvidia.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com> Cc: Aaron Tomlin <atomlin@atomlin.com> Cc: Andi Shyti <andi.shyti@linux.intel.com> Cc: Christophe Leroy (CS GROUP) <chleroy@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jani Nikula <jani.nikula@intel.com> Cc: Petr Pavlu <petr.pavlu@suse.com> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-31 16:16:05 -08:00
Pratyush Yadav	8f1081892d	kho: simplify page initialization in kho_restore_page() When restoring a page (from kho_restore_pages()) or folio (from kho_restore_folio()), KHO must initialize the struct page. The initialization differs slightly depending on if a folio is requested or a set of 0-order pages is requested. Conceptually, it is quite simple to understand. When restoring 0-order pages, each page gets a refcount of 1 and that's it. When restoring a folio, head page gets a refcount of 1 and tail pages get 0. kho_restore_page() tries to combine the two separate initialization flow into one piece of code. While it works fine, it is more complicated to read than it needs to be. Make the code simpler by splitting the two initalization paths into two separate functions. This improves readability by clearly showing how each type must be initialized. Link: https://lkml.kernel.org/r/20260116112217.915803-3-pratyush@kernel.org Signed-off-by: Pratyush Yadav <pratyush@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Alexander Graf <graf@amazon.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-31 16:16:04 -08:00
Pratyush Yadav	840fe43d37	kho: use unsigned long for nr_pages Patch series "kho: clean up page initialization logic", v2. This series simplifies the page initialization logic in kho_restore_page(). It was originally only a single patch [0], but on Pasha's suggestion, I added another patch to use unsigned long for nr_pages. Technically speaking, the patches aren't related and can be applied independently, but bundling them together since patch 2 relies on 1 and it is easier to manage them this way. This patch (of 2): With 4k pages, a 32-bit nr_pages can span up to 16 TiB. While it is a lot, there exist systems with terabytes of RAM. gup is also moving to using long for nr_pages. Use unsigned long and make KHO future-proof. Link: https://lkml.kernel.org/r/20260116112217.915803-1-pratyush@kernel.org Link: https://lkml.kernel.org/r/20260116112217.915803-2-pratyush@kernel.org Signed-off-by: Pratyush Yadav <pratyush@kernel.org> Suggested-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Alexander Graf <graf@amazon.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-31 16:16:04 -08:00
Andrew Morton	2eec08ff09	Merge branch 'mm-hotfixes-stable' into mm-nonmm-stable to pick up changes required to merge "kho: use unsigned long for nr_pages".	2026-01-31 16:12:21 -08:00
Kairui Song	3697615914	mm, swap: cleanup swap entry management workflow The current swap entry allocation/freeing workflow has never had a clear definition. This makes it hard to debug or add new optimizations. This commit introduces a proper definition of how swap entries would be allocated and freed. Now, most operations are folio based, so they will never exceed one swap cluster, and we now have a cleaner border between swap and the rest of mm, making it much easier to follow and debug, especially with new added sanity checks. Also making more optimization possible. Swap entry will be mostly freed and free with a folio bound. The folio lock will be useful for resolving many swap related races. Now swap allocation (except hibernation) always starts with a folio in the swap cache, and gets duped/freed protected by the folio lock: - folio_alloc_swap() - The only allocation entry point now. Context: The folio must be locked. This allocates one or a set of continuous swap slots for a folio and binds them to the folio by adding the folio to the swap cache. The swap slots' swap count start with zero value. - folio_dup_swap() - Increase the swap count of one or more entries. Context: The folio must be locked and in the swap cache. For now, the caller still has to lock the new swap entry owner (e.g., PTL). This increases the ref count of swap entries allocated to a folio. Newly allocated swap slots' count has to be increased by this helper as the folio got unmapped (and swap entries got installed). - folio_put_swap() - Decrease the swap count of one or more entries. Context: The folio must be locked and in the swap cache. For now, the caller still has to lock the new swap entry owner (e.g., PTL). This decreases the ref count of swap entries allocated to a folio. Typically, swapin will decrease the swap count as the folio got installed back and the swap entry got uninstalled This won't remove the folio from the swap cache and free the slot. Lazy freeing of swap cache is helpful for reducing IO. There is already a folio_free_swap() for immediate cache reclaim. This part could be further optimized later. The above locking constraints could be further relaxed when the swap table is fully implemented. Currently dup still needs the caller to lock the swap entry container (e.g. PTL), or a concurrent zap may underflow the swap count. Some swap users need to interact with swap count without involving folio (e.g. forking/zapping the page table or mapping truncate without swapin). In such cases, the caller has to ensure there is no race condition on whatever owns the swap count and use the below helpers: - swap_put_entries_direct() - Decrease the swap count directly. Context: The caller must lock whatever is referencing the slots to avoid a race. Typically the page table zapping or shmem mapping truncate will need to free swap slots directly. If a slot is cached (has a folio bound), this will also try to release the swap cache. - swap_dup_entry_direct() - Increase the swap count directly. Context: The caller must lock whatever is referencing the entries to avoid race, and the entries must already have a swap count > 1. Typically, forking will need to copy the page table and hence needs to increase the swap count of the entries in the table. The page table is locked while referencing the swap entries, so the entries all have a swap count > 1 and can't be freed. Hibernation subsystem is a bit different, so two special wrappers are here: - swap_alloc_hibernation_slot() - Allocate one entry from one device. - swap_free_hibernation_slot() - Free one entry allocated by the above helper. All hibernation entries are exclusive to the hibernation subsystem and should not interact with ordinary swap routines. By separating the workflows, it will be possible to bind folio more tightly with swap cache and get rid of the SWAP_HAS_CACHE as a temporary pin. This commit should not introduce any behavior change [kasong@tencent.com: fix leak, per Chris Mason. Remove WARN_ON, per Lai Yi] Link: https://lkml.kernel.org/r/CAMgjq7AUz10uETVm8ozDWcB3XohkOqf0i33KGrAquvEVvfp5cg@mail.gmail.com [ryncsn@gmail.com: fix KSM copy pages for swapoff, per Chris] Link: https://lkml.kernel.org/r/aXxkANcET3l2Xu6J@KASONG-MC4 Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-14-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Signed-off-by: Kairui Song <ryncsn@gmail.com> Acked-by: Rafael J. Wysocki (Intel) <rafael@kernel.org> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Cc: Chris Mason <clm@fb.com> Cc: Chris Mason <clm@meta.com> Cc: Lai Yi <yi1.lai@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-31 14:22:56 -08:00
Andrew Morton	f84b65b045	Merge branch 'mm-hotfixes-stable' into mm-stable to pick up "mm/shmem, swap: fix race of truncate and swap entry split", needed for merging "mm, swap: cleanup swap entry management workflow".	2026-01-31 14:20:03 -08:00
Leon Hwang	8798902f2b	bpf: Add bpf_jit_supports_fsession() The added fsession does not prevent running on those architectures, that haven't added fsession support. For example, try to run fsession tests on arm64: test_fsession_basic:PASS:fsession_test__open_and_load 0 nsec test_fsession_basic:PASS:fsession_attach 0 nsec check_result:FAIL:test_run_opts err unexpected error: -14 (errno 14) In order to prevent such errors, add bpf_jit_supports_fsession() to guard those architectures. Fixes: `2d419c4465` ("bpf: add fsession support") Acked-by: Puranjay Mohan <puranjay@kernel.org> Tested-by: Puranjay Mohan <puranjay@kernel.org> Signed-off-by: Leon Hwang <leon.hwang@linux.dev> Link: https://lore.kernel.org/r/20260131144950.16294-2-leon.hwang@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-31 13:51:04 -08:00
Mykyta Yatsenko	f4e72ad7c1	bpf: Consolidate special map field validation in verifier Consolidate all logic for verifying special map fields in the single function check_map_field_pointer(). Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Link: https://lore.kernel.org/r/20260130-verif_special_fields-v2-2-2c59e637da7d@meta.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-30 21:13:48 -08:00
Mykyta Yatsenko	98c4fd2963	bpf: Introduce struct bpf_map_desc in verifier Introduce struct bpf_map_desc to hold bpf_map pointer and map uid. Use this struct in both bpf_call_arg_meta and bpf_kfunc_call_arg_meta instead of having different representations: - bpf_call_arg_meta had separate map_ptr and map_uid fields - bpf_kfunc_call_arg_meta had an anonymous inline struct This unifies the map fields layout across both metadata structures, making the code more consistent and preparing for further refactoring of map field pointer validation. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Link: https://lore.kernel.org/r/20260130-verif_special_fields-v2-1-2c59e637da7d@meta.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-30 21:13:48 -08:00
Steven Rostedt	76ed27608f	perf: sched: Fix perf crash with new is_user_task() helper In order to do a user space stacktrace the current task needs to be a user task that has executed in user space. It use to be possible to test if a task is a user task or not by simply checking the task_struct mm field. If it was non NULL, it was a user task and if not it was a kernel task. But things have changed over time, and some kernel tasks now have their own mm field. An idea was made to instead test PF_KTHREAD and two functions were used to wrap this check in case it became more complex to test if a task was a user task or not[1]. But this was rejected and the C code simply checked the PF_KTHREAD directly. It was later found that not all kernel threads set PF_KTHREAD. The io-uring helpers instead set PF_USER_WORKER and this needed to be added as well. But checking the flags is still not enough. There's a very small window when a task exits that it frees its mm field and it is set back to NULL. If perf were to trigger at this moment, the flags test would say its a user space task but when perf would read the mm field it would crash with at NULL pointer dereference. Now there are flags that can be used to test if a task is exiting, but they are set in areas that perf may still want to profile the user space task (to see where it exited). The only real test is to check both the flags and the mm field. Instead of making this modification in every location, create a new is_user_task() helper function that does all the tests needed to know if it is safe to read the user space memory or not. [1] https://lore.kernel.org/all/20250425204120.639530125@goodmis.org/ Fixes: `90942f9fac` ("perf: Use current->flags & PF_KTHREAD\|PF_USER_WORKER instead of current->mm == NULL") Closes: https://lore.kernel.org/all/0d877e6f-41a7-4724-875d-0b0a27b8a545@roeck-us.net/ Reported-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Guenter Roeck <linux@roeck-us.net> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260129102821.46484722@gandalf.local.home	2026-01-30 23:06:07 +01:00
Peter Zijlstra	1151354225	sched/deadline: Fix 'stuck' dl_server Andrea reported the dl_server getting stuck for him. He tracked it down to a state where dl_server_start() saw dl_defer_running==1, but the dl_server's job is no longer valid at the time of dl_server_start(). In the state diagram this corresponds to [4] D->A (or dl_server_stop() due to no more runnable tasks) followed by [1], which in case of a lapsed deadline must then be A->B. Now our A has dl_defer_running==1, while B demands dl_defer_running==0, therefore it must get cleared when the CBS wakeup rules demand a replenish. Fixes: `a110a81c52` ("sched/deadline: Deferrable dl server") Reported-by: Andrea Righi arighi@nvidia.com Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Tested-by: Andrea Righi arighi@nvidia.com Link: https://lkml.kernel.org/r/20260123161645.2181752-1-arighi@nvidia.com Link: https://patch.msgid.link/20260130124100.GC1079264@noisy.programming.kicks-ass.net	2026-01-30 23:06:06 +01:00
Linus Torvalds	2b54ac9e0c	dma-mapping fixes for Linux 6.19 - important fix for ARM 32-bit based systems using cma= kernel parameter (Oreoluwa Babatunde) - a fix for the corner case of the DMA atomic pool based allocations (Sai Sree Kartheek Adivi) -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQSrngzkoBtlA8uaaJ+Jp1EFxbsSRAUCaXyw5wAKCRCJp1EFxbsS RBN7AP9rEfEEB3JBOglcZG3TTrLoCKLnw+16uroyKuD95RLWrQD/bWeJnRYcEZB5 ox1peKYBA4SsDB3bCUWFDfW4I0OZFQQ= =9Nx1 -----END PGP SIGNATURE----- Merge tag 'dma-mapping-6.19-2026-01-30' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux Pull dma-mapping fixes from Marek Szyprowski: - important fix for ARM 32-bit based systems using cma= kernel parameter (Oreoluwa Babatunde) - a fix for the corner case of the DMA atomic pool based allocations (Sai Sree Kartheek Adivi) * tag 'dma-mapping-6.19-2026-01-30' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux: dma/pool: distinguish between missing and exhausted atomic pools of: reserved_mem: Allow reserved_mem framework detect "cma=" kernel param	2026-01-30 13:15:04 -08:00
Ionut Nechita (Sunlight Linux)	56534673ce	tick/nohz: Optimize check_tick_dependency() with early return There is no point in iterating through individual tick dependency bits when the tick_stop tracepoint is disabled, which is the common case. When the trace point is disabled, return immediately based on the atomic value being zero or non-zero, skipping the per-bit evaluation. This optimization improves the hot path performance of tick dependency checks across all contexts (idle and non-idle), not just nohz_full CPUs. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ionut Nechita (Sunlight Linux) <sunlightlinux@gmail.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260128074558.15433-3-sunlightlinux@gmail.com	2026-01-30 22:13:13 +01:00
Jiri Olsa	0f0c332992	bpf: Allow sleepable programs to use tail calls Allowing sleepable programs to use tail calls. Making sure we can't mix sleepable and non-sleepable bpf programs in tail call map (BPF_MAP_TYPE_PROG_ARRAY) and allowing it to be used in sleepable programs. Sleepable programs can be preempted and sleep which might bring new source of race conditions, but both direct and indirect tail calls should not be affected. Direct tail calls work by patching direct jump to callee into bpf caller program, so no problem there. We atomically switch from nop to jump instruction. Indirect tail call reads the callee from the map and then jumps to it. The callee bpf program can't disappear (be released) from the caller, because it is executed under rcu lock (rcu_read_lock_trace). Signed-off-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Leon Hwang <leon.hwang@linux.dev> Link: https://lore.kernel.org/r/20260130081208.1130204-2-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-30 12:17:47 -08:00
Steven Rostedt	02b75ece53	tracing: Add kerneldoc to trace_event_buffer_reserve() Add a appropriate kerneldoc to trace_event_buffer_reserve() to make it easier to understand how that function is used. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20260130103745.1126e4af@gandalf.local.home Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2026-01-30 10:44:38 -05:00
Steven Rostedt	a46023d561	tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast The current use of guard(preempt_notrace)() within __DECLARE_TRACE() to protect invocation of __DO_TRACE_CALL() means that BPF programs attached to tracepoints are non-preemptible. This is unhelpful in real-time systems, whose users apparently wish to use BPF while also achieving low latencies. (Who knew?) One option would be to use preemptible RCU, but this introduces many opportunities for infinite recursion, which many consider to be counterproductive, especially given the relatively small stacks provided by the Linux kernel. These opportunities could be shut down by sufficiently energetic duplication of code, but this sort of thing is considered impolite in some circles. Therefore, use the shiny new SRCU-fast API, which provides somewhat faster readers than those of preemptible RCU, at least on Paul E. McKenney's laptop, where task_struct access is more expensive than access to per-CPU variables. And SRCU-fast provides way faster readers than does SRCU, courtesy of being able to avoid the read-side use of smp_mb(). Also, it is quite straightforward to create srcu_read_{,un}lock_fast_notrace() functions. Link: https://lore.kernel.org/all/20250613152218.1924093-1-bigeasy@linutronix.de/ Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Alexei Starovoitov <ast@kernel.org> Link: https://patch.msgid.link/20260126231256.499701982@kernel.org Co-developed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2026-01-30 10:44:11 -05:00
Paul E. McKenney	a77cb6a867	srcu: Fix warning to permit SRCU-fast readers in NMI handlers SRCU-fast is designed to be used in NMI handlers, even going so far as to use atomic operations for architectures supporting NMIs but not providing NMI-safe per-CPU atomic operations. However, the WARN_ON_ONCE() in __srcu_check_read_flavor() complains if SRCU-fast is used in an NMI handler. This commit therefore modifies that WARN_ON_ONCE() to avoid such complaints. Reported-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Tested-by: Steven Rostedt <rostedt@goodmis.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Boqun Feng <boqun@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: bpf@vger.kernel.org Link: https://patch.msgid.link/8232efe8-a7a3-446c-af0b-19f9b523b4f7@paulmck-laptop Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2026-01-30 10:43:58 -05:00
Steven Rostedt	f7d327654b	bpf: Have __bpf_trace_run() use rcu_read_lock_dont_migrate() In order to switch the protection of tracepoint callbacks from preempt_disable() to srcu_read_lock_fast() the BPF callback from tracepoints needs to have migration prevention as the BPF programs expect to stay on the same CPU as they execute. Put together the RCU protection with migration prevention and use rcu_read_lock_dont_migrate() in __bpf_trace_run(). This will allow tracepoints callbacks to be preemptible. Link: https://lore.kernel.org/all/CAADnVQKvY026HSFGOsavJppm3-Ajm-VsLzY-OeFUe+BaKMRnDg@mail.gmail.com/ Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Alexei Starovoitov <ast@kernel.org> Link: https://patch.msgid.link/20260126231256.335034877@kernel.org Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2026-01-30 10:43:48 -05:00
Thomas Gleixner	5c4378b7b0	Merge branch 'core/entry' into sched/core Pull the entry update to avoid merge conflicts with the time slice extension changes. Signed-off-by: Thomas Gleixner <tglx@kernel.org>	2026-01-30 15:40:05 +01:00
Jinjie Ruan	31c9387d0d	entry: Inline syscall_exit_work() and syscall_trace_enter() After switching ARM64 to the generic entry code, a syscall_exit_work() appeared as a profiling hotspot because it is not inlined. Inlining both syscall_trace_enter() and syscall_exit_work() provides a performance gain when any of the work items is enabled. With audit enabled this results in a ~4% performance gain for perf bench basic syscall on a kunpeng920 system: \| Metric \| Baseline \| Inlined \| Change \| \| ---------- \| ----------- \| ----------- \| ------ \| \| Total time \| 2.353 [sec] \| 2.264 [sec] \| ↓3.8% \| \| usecs/op \| 0.235374 \| 0.226472 \| ↓3.8% \| \| ops/sec \| 4,248,588 \| 4,415,554 \| ↑3.9% \| Small gains can be observed on x86 as well, though the generated code optimizes for the work case, which is counterproductive for high performance scenarios where such entry/exit work is usually avoided. Avoid this by marking the work check in syscall_enter_from_user_mode_work() unlikely, which is what the corresponding check in the exit path does already. [ tglx: Massage changelog and add the unlikely() ] Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260128031934.3906955-14-ruanjinjie@huawei.com	2026-01-30 15:38:10 +01:00
Jinjie Ruan	578b21fd3a	entry: Add arch_ptrace_report_syscall_entry/exit() ARM64 requires a architecture specific ptrace wrapper as it needs to save and restore scratch registers. Provide arch_ptrace_report_syscall_entry/exit() wrappers which fall back to ptrace_report_syscall_entry/exit() if the architecture does not provide them. No functional change intended. [ tglx: Massaged changelog and comments ] Suggested-by: Mark Rutland <mark.rutland@arm.com> Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com> Link: https://patch.msgid.link/20260128031934.3906955-11-ruanjinjie@huawei.com	2026-01-30 15:38:09 +01:00
Jinjie Ruan	03150a9f84	entry: Remove unused syscall argument from syscall_trace_enter() The 'syscall' argument of syscall_trace_enter() is immediately overwritten before any real use and serves only as a local variable, so drop the parameter. No functional change intended. Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260128031934.3906955-2-ruanjinjie@huawei.com	2026-01-30 15:38:09 +01:00
Thorsten Blum	2dfc417414	genirq/proc: Replace snprintf with strscpy in register_handler_proc Replace snprintf("%s", ...) with the faster and more direct strscpy(). Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260127224949.441391-2-thorsten.blum@linux.dev	2026-01-30 08:53:53 +01:00
Masami Hiramatsu (Google)	73c12f2094	kprobes: Use dedicated kthread for kprobe optimizer Instead of using generic workqueue, use a dedicated kthread for optimizing kprobes, because it can wait (sleep) for a long time inside the process by synchronize_rcu_task(). This means other works can be stopped until it finishes. Link: https://lore.kernel.org/all/176970170302.114949.5175231591310436910.stgit@devnote2/ Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>	2026-01-30 11:49:38 +09:00
Thomas Gleixner	37f9d5026c	genirq/redirect: Prevent writing MSI message on affinity change The interrupts which are handled by the redirection infrastructure provide a irq_set_affinity() callback, which solely determines the target CPU for redirection via irq_work and und updates the effective affinity mask. Contrary to regular MSI interrupts this affinity setting does not change the underlying interrupt message as the message is only created at setup time to deliver to the demultiplexing interrupt. Therefore the message write in msi_domain_set_affinity() is a pointless exercise. In principle the write is harmless, but a Tegra system exposes a full system hang during suspend due to that write. It's unclear why the check for the PCI device state PCI_D0 in pci_msi_domain_write_msg(), which prevents the actual hardware access if a device is in powered down state, fails on this particular system, but that's a different problem which needs to be investigated by the Tegra experts. The irq_set_affinity() callback can advise msi_domain_set_affinity() not to write the MSI message by returning IRQ_SET_MASK_OK_DONE instead of IRQ_SET_MASK_OK. Do exactly that. Just to make it clear again: This is not a correctness issue of the redirection code as returning IRQ_SET_MASK_OK in that context is completely correct. From the core code point of view this is solely a optimization to avoid an redundant hardware write. As a byproduct it papers over the underlying problem on the Tegra platform, which fails to put the PCIe device[s] out of PCI_D0 despite the fact that the devices and busses have been shut down. The redirect infrastructure just unearthed the underlying issue, which is prone to happen in quite some other code paths which use the PCI_D0 check to prevent hardware access to powered down devices. This therefore has neither a 'Fixes:' nor a 'Closes:' tag associated as the underlying problem, which is outside the scope of the interrupt code, is still unresolved. Reported-by: Jon Hunter <jonathanh@nvidia.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Jon Hunter <jonathanh@nvidia.com> Link: https://lore.kernel.org/all/4e5b349c-6599-4871-9e3b-e10352ae0ca0@nvidia.com Link: https://patch.msgid.link/87tsw6aglz.ffs@tglx	2026-01-29 23:49:55 +01:00
Linus Torvalds	bcb6058a4b	16 hotfixes. 9 are cc:stable, 12 are for MM. - There's a 3 patch series from Pratyush Yadav which fixes a few things in the new-in-6.19 LUO memfd code. - Plus the usual shower of singletons - please see the changelogs for details. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaXub9wAKCRDdBJ7gKXxA jjoHAP48ag3lmJnpej977MxrA4VUBhv6ATmbFE2+czCzbRJaigEAiIMWtUlSVYbH WuEYQvcFeSR0hYjs0ClKUiZYOUcj+wI= =lHEL -----END PGP SIGNATURE----- Merge tag 'mm-hotfixes-stable-2026-01-29-09-41' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "16 hotfixes. 9 are cc:stable, 12 are for MM. There's a patch series from Pratyush Yadav which fixes a few things in the new-in-6.19 LUO memfd code. Plus the usual shower of singletons - please see the changelogs for details" * tag 'mm-hotfixes-stable-2026-01-29-09-41' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: vmcoreinfo: make hwerr_data visible for debugging mm/zone_device: reinitialize large zone device private folios mm/mm_init: don't cond_resched() in deferred_init_memmap_chunk() if called from deferred_grow_zone() mm/kfence: randomize the freelist on initialization kho: kho_preserve_vmalloc(): don't return 0 when ENOMEM kho: init alloc tags when restoring pages from reserved memory mm: memfd_luo: restore and free memfd_luo_ser on failure mm: memfd_luo: use memfd_alloc_file() instead of shmem_file_setup() memfd: export alloc_file() flex_proportions: make fprop_new_period() hardirq safe mailmap: add entry for Viacheslav Bocharov mm/memory-failure: teach kill_accessing_process to accept hugetlb tail page pfn mm/memory-failure: fix missing ->mf_stats count in hugetlb poison mm, swap: restore swap_space attr aviod kernel panic mm/kasan: fix KASAN poisoning in vrealloc() mm/shmem, swap: fix race of truncate and swap entry split	2026-01-29 11:09:13 -08:00
Deepak Gupta	5ca243f6e3	prctl: add arch-agnostic prctl()s for indirect branch tracking Three architectures (x86, aarch64, riscv) have support for indirect branch tracking feature in a very similar fashion. On a very high level, indirect branch tracking is a CPU feature where CPU tracks branches which use a memory operand to transfer control. As part of this tracking, during an indirect branch, the CPU expects a landing pad instruction on the target PC, and if not found, the CPU raises some fault (architecture-dependent). x86 landing pad instr - 'ENDBRANCH' arch64 landing pad instr - 'BTI' riscv landing instr - 'lpad' Given that three major architectures have support for indirect branch tracking, this patch creates architecture-agnostic 'prctls' to allow userspace to control this feature. They are: - PR_GET_INDIR_BR_LP_STATUS: Get the current configured status for indirect branch tracking. - PR_SET_INDIR_BR_LP_STATUS: Set the configuration for indirect branch tracking. The following status options are allowed: - PR_INDIR_BR_LP_ENABLE: Enables indirect branch tracking on user thread. - PR_INDIR_BR_LP_DISABLE: Disables indirect branch tracking on user thread. - PR_LOCK_INDIR_BR_LP_STATUS: Locks configured status for indirect branch tracking for user thread. Reviewed-by: Mark Brown <broonie@kernel.org> Reviewed-by: Zong Li <zong.li@sifive.com> Signed-off-by: Deepak Gupta <debug@rivosinc.com> Tested-by: Andreas Korb <andreas.korb@aisec.fraunhofer.de> # QEMU, custom CVA6 Tested-by: Valentin Haudiquet <valentin.haudiquet@canonical.com> Link: https://patch.msgid.link/20251112-v5_user_cfi_series-v23-13-b55691eacf4f@rivosinc.com [pjw@kernel.org: cleaned up patch description, code comments] Signed-off-by: Paul Walmsley <pjw@kernel.org>	2026-01-29 02:36:32 -07:00
Sai Sree Kartheek Adivi	56c430c7f0	dma/pool: distinguish between missing and exhausted atomic pools Currently, dma_alloc_from_pool() unconditionally warns and dumps a stack trace when an allocation fails, with the message "Failed to get suitable pool". This conflates two distinct failure modes: 1. Configuration error: No atomic pool is available for the requested DMA mask (a fundamental system setup issue) 2. Resource Exhaustion: A suitable pool exists but is currently full (a recoverable runtime state) This lack of distinction prevents drivers from using __GFP_NOWARN to suppress error messages during temporary pressure spikes, such as when awaiting synchronous reclaim of descriptors. Refactor the error handling to distinguish these cases: - If no suitable pool is found, keep the unconditional WARN regarding the missing pool. - If a pool was found but is exhausted, respect __GFP_NOWARN and update the warning message to explicitly state "DMA pool exhausted". Fixes: `9420139f51` ("dma-pool: fix coherent pool allocations for IOMMU mappings") Signed-off-by: Sai Sree Kartheek Adivi <s-adivi@ti.com> Reviewed-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20260128133554.3056582-1-s-adivi@ti.com	2026-01-29 10:23:45 +01:00
Luis Gerhorst	cd3b6a3d49	bpf: Fix verifier_bug_if to account for BPF_CALL The BPF verifier assumes `insn_aux->nospec_result` is only set for direct memory writes (e.g., `(u32)(r1+off) = r2`). However, the assertion fails to account for helper calls (e.g., `bpf_skb_load_bytes_relative`) that perform writes to stack memory. Make the check more precise to resolve this. The problem is that `BPF_CALL` instructions have `BPF_CLASS(insn->code) == BPF_JMP`, which triggers the warning check: - Helpers like `bpf_skb_load_bytes_relative` write to stack memory - `check_helper_call()` loops through `meta.access_size`, calling `check_mem_access(..., BPF_WRITE)` - `check_stack_write()` sets `insn_aux->nospec_result = 1` - Since `BPF_CALL` is encoded as `BPF_JMP \| BPF_CALL`, the warning fires Execution flow: ``` 1. Drop capabilities → Enable Spectre mitigation 2. Load BPF program └─> do_check() ├─> check_cond_jmp_op() → Marks dead branch as speculative │ └─> push_stack(..., speculative=true) ├─> pop_stack() → state->speculative = 1 ├─> check_helper_call() → Processes helper in dead branch │ └─> check_mem_access(..., BPF_WRITE) │ └─> insn_aux->nospec_result = 1 └─> Checks: state->speculative && insn_aux->nospec_result └─> BPF_CLASS(insn->code) == BPF_JMP → WARNING ``` To fix the assert, it would be nice to be able to reuse bpf_insn_successors() here, but bpf_insn_successors()->cnt is not exactly what we want as it may also be 1 for BPF_JA. Instead, we could check opcode_info.can_jump, but then we would have to share the table between the functions. This would mean moving the table out of the function and adding bpf_opcode_info(). As the verifier_bug_if() only runs for insns with nospec_result set, the impact on verification time would likely still be negligible. However, I assume sharing bpf_opcode_info() between liveness.c and verifier.c will not be worth it. It seems as only adjust_jmp_off() could also be simplified using it, and there imm/off is touched. Thus it is maybe better to rely on exact opcode/class matching there. Therefore, to avoid this sharing only for a verifier_bug_if(), just check the opcode. This should now cover all opcodes for which can_jump in bpf_insn_successors() is true. Parts of the description and example are taken from the bug report. Fixes: `dadb59104c` ("bpf: Fix aux usage after do_check_insn()") Signed-off-by: Luis Gerhorst <luis.gerhorst@fau.de> Reported-by: Yinhao Hu <dddddd@hust.edu.cn> Reported-by: Kaiyan Mei <M202472210@hust.edu.cn> Reported-by: Dongliang Mu <dzm91@hust.edu.cn> Closes: https://lore.kernel.org/bpf/7678017d-b760-4053-a2d8-a6879b0dbeeb@hust.edu.cn/ Link: https://lore.kernel.org/r/20260127115912.3026761-2-luis.gerhorst@fau.de Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-28 18:41:57 -08:00
Steven Rostedt	9df0e49c5b	tracing: Remove duplicate ENABLE_EVENT_STR and DISABLE_EVENT_STR macros The macros ENABLE_EVENT_STR and DISABLE_EVENT_STR were added to trace.h so that more than one file can have access to them, but was never removed from their original location. Remove the duplicates. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Tom Zanussi <zanussi@kernel.org> Link: https://patch.msgid.link/20260126130037.4ba201f9@gandalf.local.home Fixes: `d0bad49bb0` ("tracing: Add enable_hist/disable_hist triggers") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2026-01-28 21:01:10 -05:00
Steven Rostedt	e62750b6ab	tracing: Up the hist stacktrace size from 16 to 31 Recording stacktraces is very useful, but the size of 16 deep is very restrictive. For example, in seeing where tasks schedule out in a non running state, the following can be used: ~# cd /sys/kernel/tracing ~# echo 'hist:keys=common_stacktrace:vals=hitcount if prev_state & 3' > events/sched/sched_switch/trigger ~# cat events/sched/sched_switch/hist [..] { common_stacktrace: __schedule+0xdc0/0x1860 schedule+0x27/0xd0 schedule_timeout+0xb5/0x100 wait_for_completion+0x8a/0x140 xfs_buf_iowait+0x20/0xd0 [xfs] xfs_buf_read_map+0x103/0x250 [xfs] xfs_trans_read_buf_map+0x161/0x310 [xfs] xfs_btree_read_buf_block+0xa0/0x120 [xfs] xfs_btree_lookup_get_block+0xa3/0x1e0 [xfs] xfs_btree_lookup+0xea/0x530 [xfs] xfs_alloc_fixup_trees+0x72/0x570 [xfs] xfs_alloc_ag_vextent_size+0x67f/0x800 [xfs] xfs_alloc_vextent_iterate_ags.constprop.0+0x52/0x230 [xfs] xfs_alloc_vextent_start_ag+0x9d/0x1b0 [xfs] xfs_bmap_btalloc+0x2af/0x680 [xfs] xfs_bmapi_allocate+0xdb/0x2c0 [xfs] } hitcount: 1 [..] The above stops at 16 functions where knowing more would be useful. As the allocated storage for stacks is the same for strings, and that size is 256 bytes, there is a lot of space not being used for stacktraces. 16 * 8 = 128 Up the size to 31 (it requires the last slot to be zero, so it can't be 32). Also change the BUILD_BUG_ON() to allow the size of the stacktrace storage to be equal to the max size. One slot is used to hold the number of elements in the stack. BUILD_BUG_ON((HIST_STACKTRACE_DEPTH + 1) * sizeof(long) >= STR_VAR_LEN_MAX); Change that from ">=" to just ">", as now they are equal. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20260123105415.2be26bf4@gandalf.local.home Reviewed-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2026-01-28 21:01:10 -05:00
Steven Rostedt	ef742dc5f8	tracing: Remove notrace from trace_event_raw_event_synth() When debugging the synthetic events, being able to function trace its functions is very useful (now that CONFIG_FUNCTION_SELF_TRACING is available). For some reason trace_event_raw_event_synth() was marked as "notrace", which was totally unnecessary as all of the tracing directory had function tracing disabled until the recent FUNCTION_SELF_TRACING was added. Remove the notrace annotation from trace_event_raw_event_synth() as there's no reason to not trace it when tracing synthetic event functions. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20260122204526.068a98c9@gandalf.local.home Acked-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2026-01-28 21:01:09 -05:00
Steven Rostedt	45641096c9	tracing: Have hist_debug show what function a field uses When CONFIG_HIST_TRIGGERS_DEBUG is enabled, each trace event has a "hist_debug" file that explains the histogram internal data. This is very useful for debugging histograms. One bit of data that was missing from this file was what function a histogram field uses to process its data. The hist_field structure now has a fn_num that is used by a switch statement in hist_fn_call() to call a function directly (to avoid spectre mitigations). Instead of displaying that number, create a string array that maps to the histogram function enums so that the function for a field may be displayed: ~# cat /sys/kernel/tracing/events/sched/sched_switch/hist_debug [..] hist_data: 0000000043d62762 n_vals: 2 n_keys: 1 n_fields: 3 val fields: hist_data->fields[0]: flags: VAL: HIST_FIELD_FL_HITCOUNT type: u64 size: 8 is_signed: 0 function: hist_field_counter() hist_data->fields[1]: flags: HIST_FIELD_FL_VAR var.name: __arg_3921_2 var.idx (into tracing_map_elt.vars[]): 0 type: unsigned long[] size: 128 is_signed: 0 function: hist_field_nop() key fields: hist_data->fields[2]: flags: HIST_FIELD_FL_KEY ftrace_event_field name: prev_pid type: pid_t size: 8 is_signed: 1 function: hist_field_s32() The "function:" field above is added. Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20260122203822.58df4d80@gandalf.local.home Reviewed-by: Tom Zanussi <zanussi@kernel.org> Tested-by: Tom Zanussi <zanussi@kernel.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2026-01-28 19:32:55 -05:00
sunliming	b8121b9cdc	tracing: kprobe-event: Return directly when trace kprobes is empty In enable_boot_kprobe_events(), it returns directly when trace kprobes is empty, thereby reducing the function's execution time. This function may otherwise wait for the event_mutex lock for tens of milliseconds on certain machines, which is unnecessary when trace kprobes is empty. Link: https://lore.kernel.org/all/20260127053848.108473-1-sunliming@linux.dev/ Signed-off-by: sunliming <sunliming@kylinos.cn> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>	2026-01-29 09:23:43 +09:00
Oreoluwa Babatunde	0fd17e5983	of: reserved_mem: Allow reserved_mem framework detect "cma=" kernel param When initializing the default cma region, the "cma=" kernel parameter takes priority over a DT defined linux,cma-default region. Hence, give the reserved_mem framework the ability to detect this so that the DT defined cma region can skip initialization accordingly. Signed-off-by: Oreoluwa Babatunde <oreoluwa.babatunde@oss.qualcomm.com> Tested-by: Joy Zou <joy.zou@nxp.com> Acked-by: Rob Herring (Arm) <robh@kernel.org> Fixes: `8a6e02d0c0` ("of: reserved_mem: Restructure how the reserved memory regions are processed") Fixes: `2c223f7239` ("of: reserved_mem: Restructure call site for dma_contiguous_early_fixup()") Link: https://lore.kernel.org/r/20251210002027.1171519-1-oreoluwa.babatunde@oss.qualcomm.com [mszyprow: rebased onto v6.19-rc1, added fixes tags, added a stub for cma_skip_dt_default_reserved_mem() if no CONFIG_DMA_CMA is set] Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>	2026-01-29 00:26:36 +01:00
Frederic Weisbecker	a554a25e66	cpufreq: ondemand: Simplify idle cputime granularity test cpufreq calls get_cpu_idle_time_us() just to know if idle cputime accounting has a nanoseconds granularity. Use the appropriate indicator instead to make that deduction. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Link: https://patch.msgid.link/aXozx0PXutnm8ECX@localhost.localdomain Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2026-01-28 22:24:58 +01:00
Rafael J. Wysocki	1081c1649d	PM: hibernate: Drop NULL pointer checks before acomp_request_free() Since acomp_request_free() checks its argument against NULL, the NULL pointer checks before calling it added by commit ("7966cf0ebe32 PM: hibernate: Fix crash when freeing invalid crypto compressor") are redundant, so drop them. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://patch.msgid.link/6233709.lOV4Wx5bFT@rafael.j.wysocki	2026-01-28 22:12:55 +01:00
Marco Elver	b7be9442a3	kcov: Use scoped init guard Convert lock initialization to scoped guarded initialization where lock-guarded members are initialized in the same scope. This ensures the context analysis treats the context as active during member initialization. This is required to avoid errors once implicit context assertion is removed. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260119094029.1344361-4-elver@google.com	2026-01-28 20:45:24 +01:00
Jiri Olsa	424f6a3610	bpf,x86: Use single ftrace_ops for direct calls Using single ftrace_ops for direct calls update instead of allocating ftrace_ops object for each trampoline. With single ftrace_ops object we can use update_ftrace_direct_* api that allows multiple ip sites updates on single ftrace_ops object. Adding HAVE_SINGLE_FTRACE_DIRECT_OPS config option to be enabled on each arch that supports this. At the moment we can enable this only on x86 arch, because arm relies on ftrace_ops object representing just single trampoline image (stored in ftrace_ops::direct_call). Archs that do not support this will continue to use *_ftrace_direct api. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://lore.kernel.org/bpf/20251230145010.103439-10-jolsa@kernel.org	2026-01-28 11:44:59 -08:00
Jiri Olsa	956747efd8	ftrace: Factor ftrace_ops ops_func interface We are going to remove "ftrace_ops->private == bpf_trampoline" setup in following changes. Adding ip argument to ftrace_ops_func_t callback function, so we can use it to look up the trampoline. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://lore.kernel.org/bpf/20251230145010.103439-9-jolsa@kernel.org	2026-01-28 11:44:57 -08:00
Jiri Olsa	7d0452497c	bpf: Add trampoline ip hash table Following changes need to lookup trampoline based on its ip address, adding hash table for that. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20251230145010.103439-8-jolsa@kernel.org	2026-01-28 11:44:57 -08:00
Jiri Olsa	e93672f770	ftrace: Add update_ftrace_direct_mod function Adding update_ftrace_direct_mod function that modifies all entries (ip -> direct) provided in hash argument to direct ftrace ops and updates its attachments. The difference to current modify_ftrace_direct is: - hash argument that allows to modify multiple ip -> direct entries at once This change will allow us to have simple ftrace_ops for all bpf direct interface users in following changes. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://lore.kernel.org/bpf/20251230145010.103439-7-jolsa@kernel.org	2026-01-28 11:44:54 -08:00
Jiri Olsa	8d2c1233f3	ftrace: Add update_ftrace_direct_del function Adding update_ftrace_direct_del function that removes all entries (ip -> addr) provided in hash argument to direct ftrace ops and updates its attachments. The difference to current unregister_ftrace_direct is - hash argument that allows to unregister multiple ip -> direct entries at once - we can call update_ftrace_direct_del multiple times on the same ftrace_ops object, becase we do not need to unregister all entries at once, we can do it gradualy with the help of ftrace_update_ops function This change will allow us to have simple ftrace_ops for all bpf direct interface users in following changes. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://lore.kernel.org/bpf/20251230145010.103439-6-jolsa@kernel.org	2026-01-28 11:44:51 -08:00
Jiri Olsa	05dc5e9c1f	ftrace: Add update_ftrace_direct_add function Adding update_ftrace_direct_add function that adds all entries (ip -> addr) provided in hash argument to direct ftrace ops and updates its attachments. The difference to current register_ftrace_direct is - hash argument that allows to register multiple ip -> direct entries at once - we can call update_ftrace_direct_add multiple times on the same ftrace_ops object, becase after first registration with register_ftrace_function_nolock, it uses ftrace_update_ops to update the ftrace_ops object This change will allow us to have simple ftrace_ops for all bpf direct interface users in following changes. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://lore.kernel.org/bpf/20251230145010.103439-5-jolsa@kernel.org	2026-01-28 11:44:48 -08:00
Jiri Olsa	0e860d07c2	ftrace: Export some of hash related functions We are going to use these functions in following changes. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://lore.kernel.org/bpf/20251230145010.103439-4-jolsa@kernel.org	2026-01-28 11:44:45 -08:00
Jiri Olsa	676bfeae7b	ftrace: Make alloc_and_copy_ftrace_hash direct friendly Make alloc_and_copy_ftrace_hash to copy also direct address for each hash entry. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://lore.kernel.org/bpf/20251230145010.103439-3-jolsa@kernel.org	2026-01-28 11:44:43 -08:00
Jiri Olsa	4be42c9222	ftrace,bpf: Remove FTRACE_OPS_FL_JMP ftrace_ops flag At the moment the we allow the jmp attach only for ftrace_ops that has FTRACE_OPS_FL_JMP set. This conflicts with following changes where we use single ftrace_ops object for all direct call sites, so all could be be attached via just call or jmp. We already limit the jmp attach support with config option and bit (LSB) set on the trampoline address. It turns out that's actually enough to limit the jmp attach for architecture and only for chosen addresses (with LSB bit set). Each user of register_ftrace_direct or modify_ftrace_direct can set the trampoline bit (LSB) to indicate it has to be attached by jmp. The bpf trampoline generation code uses trampoline flags to generate jmp-attach specific code and ftrace inner code uses the trampoline bit (LSB) to handle return from jmp attachment, so there's no harm to remove the FTRACE_OPS_FL_JMP bit. The fexit/fmodret performance stays the same (did not drop), current code: fentry : 77.904 ± 0.546M/s fexit : 62.430 ± 0.554M/s fmodret : 66.503 ± 0.902M/s with this change: fentry : 80.472 ± 0.061M/s fexit : 63.995 ± 0.127M/s fmodret : 67.362 ± 0.175M/s Fixes: `25e4e3565d` ("ftrace: Introduce FTRACE_OPS_FL_JMP") Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://lore.kernel.org/bpf/20251230145010.103439-2-jolsa@kernel.org	2026-01-28 11:44:35 -08:00
Guillaume Gonnet	ae23bc81dd	bpf: Fix tcx/netkit detach permissions when prog fd isn't given This commit fixes a security issue where BPF_PROG_DETACH on tcx or netkit devices could be executed by any user when no program fd was provided, bypassing permission checks. The fix adds a capability check for CAP_NET_ADMIN or CAP_SYS_ADMIN in this case. Fixes: `e420bed025` ("bpf: Add fd-based tcx multi-prog infra with link support") Signed-off-by: Guillaume Gonnet <ggonnet.linux@gmail.com> Link: https://lore.kernel.org/r/20260127160200.10395-1-ggonnet.linux@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-27 18:39:58 -08:00
Ilpo Järvinen	4326ab1806	resource: Increase MAX_IORES_LEVEL to 8 While debugging a PCI resource allocation issue, the resources for many nested bridges and endpoints got flattened in /proc/iomem by MAX_IORES_LEVEL that is set to 5. This made the iomem output hard to read as the visual hierarchy cues were lost. Increase MAX_IORES_LEVEL to 8 to avoid flattening PCI topologies with nested bridges so aggressively (the case in the Link has the deepest resource at level 7 so 8 looks a reasonable limit). Link: https://bugzilla.kernel.org/show_bug.cgi?id=220775 Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Link: https://patch.msgid.link/20251219174036.16738-5-ilpo.jarvinen@linux.intel.com	2026-01-27 16:36:51 -06:00
Matt Bobrowski	752b807028	bpf: add new BPF_CGROUP_ITER_CHILDREN control option Currently, the BPF cgroup iterator supports walking descendants in either pre-order (BPF_CGROUP_ITER_DESCENDANTS_PRE) or post-order (BPF_CGROUP_ITER_DESCENDANTS_POST). These modes perform an exhaustive depth-first search (DFS) of the hierarchy. In scenarios where a BPF program may need to inspect only the direct children of a given parent cgroup, a full DFS is unnecessarily expensive. This patch introduces a new BPF cgroup iterator control option, BPF_CGROUP_ITER_CHILDREN. This control option restricts the traversal to the immediate children of a specified parent cgroup, allowing for more targeted and efficient iteration, particularly when exhaustive depth-first search (DFS) traversal is not required. Signed-off-by: Matt Bobrowski <mattbobrowski@google.com> Link: https://lore.kernel.org/r/20260127085112.3608687-1-mattbobrowski@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-27 09:05:54 -08:00
Tim Bird	c86d39d680	kernel: debug: Add SPDX license ids to kdb files Add GPL-2.0 license id to some files related to kdb and kgdb, replacing references to GPL or COPYING. These files were introduced into the kernel in 2008 and 2010. Signed-off-by: Tim Bird <tim.bird@sony.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2026-01-27 15:57:20 +01:00
Lorenzo Pieralisi	0323897a88	irqdomain: Add parent field to struct irqchip_fwid The GICv5 driver IRQ domain hierarchy requires adding a parent field to struct irqchip_fwid so that core code can reference a fwnode_handle parent for a given fwnode. Add a parent field to struct irqchip_fwid and update the related kernel API functions to initialize and handle it. Signed-off-by: Lorenzo Pieralisi <lpieralisi@kernel.org> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Acked-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260115-gicv5-host-acpi-v3-1-c13a9a150388@kernel.org Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2026-01-27 15:31:41 +01:00
Yury Norov	291487b753	cgroup: use nodes_and() output where appropriate Now that nodes_and() returns true if the result nodemask is not empty, drop useless nodes_intersects() in guarantee_online_mems() and nodes_empty() in update_nodemasks_hier(), which both are O(N). Link: https://lkml.kernel.org/r/20260114172217.861204-4-ynorov@nvidia.com Signed-off-by: Yury Norov <ynorov@nvidia.com> Reviewed-by: Gregory Price <gourry@gourry.net> Reviewed-by: Joshua Hahn <joshua.hahnjy@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Byungchul Park <byungchul@sk.com> Cc: David Hildenbrand <david@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Michal Koutný <mkoutny@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Waiman Long <longman@redhat.com> Cc: Yury Norov (NVIDIA) <yury.norov@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-26 20:02:37 -08:00
Pratyush Yadav (Google)	6ca9de3600	kho: print which scratch buffer failed to be reserved When scratch area fails to reserve, KHO prints a message indicating that. But it doesn't say which scratch failed to allocate. This can be useful information for debugging. Even more so when the failure is hard to reproduce. Along with the current message, also print which exact scratch area failed to be reserved. Link: https://lkml.kernel.org/r/20260116165416.1262531-1-pratyush@kernel.org Signed-off-by: Pratyush Yadav (Google) <pratyush@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: David Matlack <dmatlack@google.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Pratyush Yadav <pratyush@kernel.org> Cc: Samiullah Khawaja <skhawaja@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-26 19:07:15 -08:00
Finn Thain	3bb83c9109	bpf: explicitly align bpf_res_spin_lock Patch series "Align atomic storage", v7. This series adds the __aligned attribute to atomic_t and atomic64_t definitions in include/linux and include/asm-generic (respectively) to get natural alignment of both types on csky, m68k, microblaze, nios2, openrisc and sh. This series also adds Kconfig options to enable a new run-time warning to help reveal misaligned atomic accesses on platforms which don't trap that. The performance impact is expected to vary across platforms and workloads. The measurements I made on m68k show that some workloads run faster and others slower. This patch (of 4): Align bpf_res_spin_lock to avoid a BUILD_BUG_ON() when the alignment changes, as it will do on m68k when, in a subsequent patch, the minimum alignment of the atomic_t member of struct rqspinlock gets increased from 2 to 4. Drop the BUILD_BUG_ON() as it becomes redundant. Link: https://lkml.kernel.org/r/cover.1768281748.git.fthain@linux-m68k.org Link: https://lkml.kernel.org/r/8a83876b07d1feacc024521e44059ae89abbb1ea.1768281748.git.fthain@linux-m68k.org Signed-off-by: Finn Thain <fthain@linux-m68k.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Daniel Borkman <daniel@iogearbox.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Eduard Zingerman <eddyz87@gmail.com> Cc: Gary Guo <gary@garyguo.net> Cc: Guo Ren <guoren@kernel.org> Cc: Hao Luo <haoluo@google.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: John Fastabend <john.fastabend@gmail.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonas Bonn <jonas@southpole.se> Cc: KP Singh <kpsingh@kernel.org> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rich Felker <dalias@libc.org> Cc: Sasha Levin (Microsoft) <sashal@kernel.org> Cc: Song Liu <song@kernel.org> Cc: Stafford Horne <shorne@gmail.com> Cc: Stanislav Fomichev <sdf@fomichev.me> Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Cc: Yonghong Song <yonghong.song@linux.dev> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-26 19:07:14 -08:00
Mathieu Desnoyers	5e65b5ca7d	tsacct: skip all kernel threads This patch is a preparation step for HPCC, for the OOM killer improvements. I suspect that this patch is useful on its own, because it really makes no sense to sum up accounting statistics of use_mm within kernel threads which are only temporarily using those mm. When we hit acct_account_cputime within a irq handler over a kthread that happens to use a userspace mm, we end up summing up the mm's RSS into the tsk acct_rss_mem1, which eventually decays. I don't see a good rationale behind tracking the mm's rss in that way when a kthread use a userspace mm temporarily through use_mm. It causes issues with init_mm and efi_mm which only partially initialize their mm_struct when introducing the new hierarchical percpu counters to replace RSS counters, which requires a pointer dereference when reading the approximate counter sum. The current percpu counters simply load a zeroed atomic counter, which happen to work. Skip all kernel threads in acct_account_cputime(), not just those that happen to have a NULL mm. This is a preparation step before introducing the hierarchical percpu counters. Link: https://lkml.kernel.org/r/20251224173810.648699-2-mathieu.desnoyers@efficios.com Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mark Brown <broonie@kernel.org> Cc: Aboorva Devarajan <aboorvad@linux.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Christan König <christian.koenig@amd.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Liam R . Howlett" <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Martin Liu <liumartin@google.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: SeongJae Park <sj@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-26 19:07:13 -08:00
Long Wei	25929dae28	kho: remove duplicate header file references kexec_handover_internal.h is included twice in kexec_handover.c. Remove the redundant first inclusion to eliminate the duplication. Link: https://lkml.kernel.org/r/20251216114400.2677311-1-longwei27@huawei.com Signed-off-by: Long Wei <longwei27@huawei.com> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Alexander Graf <graf@amazon.com> Cc: hewenliang <hewenliang4@huawei.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Pratyush Yadav <pratyush@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-26 19:07:13 -08:00
mingzhu.wang(王明珠)	2bbd9e1d14	kernel/fork: update obsolete use_mm references to kthread_use_mm The comment for get_task_mm() in kernel/fork.c incorrectly references the deprecated function `use_mm()`, which has been renamed to `kthread_use_mm()` in kernel/kthread.c. This patch updates the documentation to reflect the current function names, ensuring accuracy when developers refer to the kernel thread memory context API. No functional changes were introduced. Link: https://lkml.kernel.org/r/KUZPR04MB8965F954108B4DD7E8FFDB2B8F84A@KUZPR04MB8965.apcprd04.prod.outlook.com Signed-off-by: mingzhu.wang <mingzhu.wang@transsion.com> Cc: Ben Segall <bsegall@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiazi Li <jqqlijiazi@gmail.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Kees Cook <kees@kernel.org> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-26 19:07:12 -08:00
Jason Miu	ac2d8102c4	kho: relocate vmalloc preservation structure to KHO ABI header The `struct kho_vmalloc` defines the in-memory layout for preserving vmalloc regions across kexec. This layout is a contract between kernels and part of the KHO ABI. To reflect this relationship, the related structs and helper macros are relocated to the ABI header, `include/linux/kho/abi/kexec_handover.h`. This move places the structure's definition under the protection of the KHO_FDT_COMPATIBLE version string. The structure and its components are now also documented within the ABI header to describe the contract and prevent ABI breaks. [rppt@kernel.org: update comment, per Pratyush] Link: https://lkml.kernel.org/r/aW_Mqp6HcqLwQImS@kernel.org Link: https://lkml.kernel.org/r/20260105165839.285270-6-rppt@kernel.org Signed-off-by: Jason Miu <jasonmiu@google.com> Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Pratyush Yadav <pratyush@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-26 19:07:12 -08:00
Jason Miu	5e1ea1e27b	kho: introduce KHO FDT ABI header Introduce the `include/linux/kho/abi/kexec_handover.h` header file, which defines the stable ABI for the KHO mechanism. This header specifies how preserved data is passed between kernels using an FDT. The ABI contract includes the FDT structure, node properties, and the "kho-v1" compatible string. By centralizing these definitions, this header serves as the foundational agreement for inter-kernel communication of preserved states, ensuring forward compatibility and preventing misinterpretation of data across kexec transitions. Since the ABI definitions are now centralized in the header files, the YAML files that previously described the FDT interfaces are redundant. These redundant files have therefore been removed. Link: https://lkml.kernel.org/r/20260105165839.285270-5-rppt@kernel.org Signed-off-by: Jason Miu <jasonmiu@google.com> Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-26 19:07:12 -08:00
Mike Rapoport (Microsoft)	a6f4e56828	kho: docs: combine concepts and FDT documentation Currently index.rst in KHO documentation looks empty and sad as it only contains links to "Kexec Handover Concepts" and "KHO FDT" chapters. Inline contents of these chapters into index.rst to provide a single coherent chapter describing KHO. While on it, drop parts of the KHO FDT description that will be superseded by addition of KHO ABI documentation. [rppt@kernel.org: fix Documentation/core-api/kho/index.rst] Link: https://lkml.kernel.org/r/aV4bnHlBXGpT_FMc@kernel.org Link: https://lkml.kernel.org/r/20260105165839.285270-4-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Jason Miu <jasonmiu@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Pratyush Yadav <pratyush@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-26 19:07:11 -08:00
Pasha Tatashin	998be0a4db	liveupdate: separate memfd support into LIVEUPDATE_MEMFD Decouple memfd preservation support from the core Live Update Orchestrator configuration. Previously, enabling CONFIG_LIVEUPDATE forced a dependency on CONFIG_SHMEM and unconditionally compiled memfd_luo.o. However, Live Update may be used for purposes that do not require memfd-backed memory preservation. Introduce CONFIG_LIVEUPDATE_MEMFD to gate memfd_luo.o. This moves the SHMEM and MEMFD_CREATE dependencies to the specific feature that needs them, allowing the base LIVEUPDATE option to be selected independently of shared memory support. Link: https://lkml.kernel.org/r/20251230161402.1542099-1-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-26 19:07:10 -08:00
Breno Leitao	bd58782995	vmcoreinfo: make hwerr_data visible for debugging If the kernel is compiled with LTO, hwerr_data symbol might be lost, and vmcoreinfo doesn't have it dumped. This is currently seen in some production kernels with LTO enabled. Remove the static qualifier from hwerr_data so that the information is still preserved when the kernel is built with LTO. Making hwerr_data a global symbol ensures its debug info survives the LTO link process and appears in kallsyms. Also document it, so it doesn't get removed in the future as suggested by akpm. Link: https://lkml.kernel.org/r/20260122-fix_vmcoreinfo-v2-1-2d6311f9e36c@debian.org Fixes: `3fa805c37d` ("vmcoreinfo: track and log recoverable hardware errors") Signed-off-by: Breno Leitao <leitao@debian.org> Acked-by: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Omar Sandoval <osandov@osandov.com> Cc: Shuai Xue <xueshuai@linux.alibaba.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Zhiquan Li <zhiquan1.li@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-26 19:03:49 -08:00
Andrew Morton	412a32f0e5	kho: kho_preserve_vmalloc(): don't return 0 when ENOMEM kho_preserve_vmalloc() should return -ENOMEM when new_vmalloc_chunk() fails. Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/r/202601211636.IRaejjdw-lkp@intel.com/ Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-26 19:03:48 -08:00
Ran Xiaokai	e86436ad0a	kho: init alloc tags when restoring pages from reserved memory Memblock pages (including reserved memory) should have their allocation tags initialized to CODETAG_EMPTY via clear_page_tag_ref() before being released to the page allocator. When kho restores pages through kho_restore_page(), missing this call causes mismatched allocation/deallocation tracking and below warning message: alloc_tag was not set WARNING: include/linux/alloc_tag.h:164 at ___free_pages+0xb8/0x260, CPU#1: swapper/0/1 RIP: 0010:___free_pages+0xb8/0x260 kho_restore_vmalloc+0x187/0x2e0 kho_test_init+0x3c4/0xa30 do_one_initcall+0x62/0x2b0 kernel_init_freeable+0x25b/0x480 kernel_init+0x1a/0x1c0 ret_from_fork+0x2d1/0x360 Add missing clear_page_tag_ref() annotation in kho_restore_page() to fix this. Link: https://lkml.kernel.org/r/20260122132740.176468-1-ranxiaokai627@163.com Fixes: `fc33e4b44b` ("kexec: enable KHO support for memory preservation") Signed-off-by: Ran Xiaokai <ran.xiaokai@zte.com.cn> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-26 19:03:47 -08:00
Steven Rostedt	6bdf07302f	tracing: Disable trace_printk buffer on warning too When /proc/sys/kernel/traceoff_on_warning is set to 1, the top level tracing buffer is disabled when a warning happens. This is very useful when debugging and want the tracing buffer to stop taking new data when a warning triggers keeping the events that lead up to the warning from being overwritten. Now that there is also a persistent ring buffer and an option to have trace_printk go to that buffer, the same holds true for that buffer. A warning could happen just before a crash but still write enough events to lose the events that lead up to the first warning that was the reason for the crash. When /proc/sys/kernel/traceoff_on_warning is set to 1 and a warning is triggered, not only disable the top level tracing buffer, but also disable the buffer that trace_printk()s are written to. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mark Rutland <mark.rutland@arm.com> Link: https://patch.msgid.link/20260121093858.5c5d7e7b@gandalf.local.home Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2026-01-26 17:45:17 -05:00
Guenter Roeck	a9e0c5897a	ftrace: Introduce and use ENTRIES_PER_PAGE_GROUP macro ENTRIES_PER_PAGE_GROUP() returns the number of dyn_ftrace entries in a page group, identified by its order. No functional change. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20260113152243.3557219-2-linux@roeck-us.net Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2026-01-26 17:45:12 -05:00
Steven Rostedt	2d8b7f9bf8	tracing: Have show_event_trigger/filter format a bit more in columns By doing: # trace-cmd sqlhist -e -n futex_wait select TIMESTAMP_DELTA_USECS as lat from sys_enter_futex as start join sys_exit_futex as end on start.common_pid = end.common_pid and # trace-cmd start -e futex_wait -f 'lat > 100' -e page_pool_state_release -f 'pfn == 1' The output of the show_event_trigger and show_event_filter files are well aligned because of the inconsistent 'tab' spacing: ~# cat /sys/kernel/tracing/show_event_triggers syscalls:sys_exit_futex hist:keys=common_pid:vals=hitcount:__lat_12046_2=common_timestamp.usecs-$__arg_12046_1:sort=hitcount:size=2048:clock=global:onmatch(syscalls.sys_enter_futex).trace(futex_wait,$__lat_12046_2) [active] syscalls:sys_enter_futex hist:keys=common_pid:vals=hitcount:__arg_12046_1=common_timestamp.usecs:sort=hitcount:size=2048:clock=global [active] ~# cat /sys/kernel/tracing/show_event_filters synthetic:futex_wait (lat > 100) page_pool:page_pool_state_release (pfn == 1) This makes it not so easy to read. Instead, force the spacing to be at least 32 bytes from the beginning (one space if the system:event is longer than 30 bytes): ~# cat /sys/kernel/tracing/show_event_triggers syscalls:sys_exit_futex hist:keys=common_pid:vals=hitcount:__lat_8125_2=common_timestamp.usecs-$__arg_8125_1:sort=hitcount:size=2048:clock=global:onmatch(syscalls.sys_enter_futex).trace(futex_wait,$__lat_8125_2) [active] syscalls:sys_enter_futex hist:keys=common_pid:vals=hitcount:__arg_8125_1=common_timestamp.usecs:sort=hitcount:size=2048:clock=global [active] ~# cat /sys/kernel/tracing/show_event_filters synthetic:futex_wait (lat > 100) page_pool:page_pool_state_release (pfn == 1) Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20260112153408.18373e73@gandalf.local.home Reviewed-by: Aaron Tomlin <atomlin@atomlin.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2026-01-26 17:45:06 -05:00
Petr Tesarik	8aa76aa415	ring-buffer: Use a housekeeping CPU to wake up waiters Avoid running the wakeup irq_work on an isolated CPU. Since the wakeup can run on any CPU, let's pick a housekeeping CPU to do the job. This change reduces additional noise when tracing isolated CPUs. For example, the following ipi_send_cpu stack trace was captured with nohz_full=2 on the isolated CPU: <idle>-0 [002] d.h4. 1255.379293: ipi_send_cpu: cpu=2 callsite=irq_work_queue+0x2d/0x50 callback=rb_wake_up_waiters+0x0/0x80 <idle>-0 [002] d.h4. 1255.379329: <stack trace> => trace_event_raw_event_ipi_send_cpu => __irq_work_queue_local => irq_work_queue => ring_buffer_unlock_commit => trace_buffer_unlock_commit_regs => trace_event_buffer_commit => trace_event_raw_event_x86_irq_vector => __sysvec_apic_timer_interrupt => sysvec_apic_timer_interrupt => asm_sysvec_apic_timer_interrupt => pv_native_safe_halt => default_idle => default_idle_call => do_idle => cpu_startup_entry => start_secondary => common_startup_64 The IRQ work interrupt alone adds considerable noise, but the impact can get even worse with PREEMPT_RT, because the IRQ work interrupt is then handled by a separate kernel thread. This requires a task switch and makes tracing useless for analyzing latency on an isolated CPU. After applying the patch, the trace is similar, but ipi_send_cpu always targets a non-isolated CPU. Unfortunately, irq_work_queue_on() is not NMI-safe. When running in NMI context, fall back to queuing the irq work on the local CPU. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Clark Williams <clrkwllms@kernel.org> Cc: Frederic Weisbecker <frederic@kernel.org> Link: https://patch.msgid.link/20260108132132.2473515-1-ptesarik@suse.com Signed-off-by: Petr Tesarik <ptesarik@suse.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2026-01-26 17:44:53 -05:00
Steven Rostedt	e4ef389e76	tracing: Check the return value of tracing_update_buffers() In the very unlikely event that tracing_update_buffers() fails in trace_printk_init_buffers(), report the failure so that it is known. Link: https://lore.kernel.org/all/20220917020353.3836285-1-floridsleeves@gmail.com/ Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20260107161510.4dc98b15@gandalf.local.home Suggested-by: Li Zhong <floridsleeves@gmail.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2026-01-26 17:44:40 -05:00
Aaron Tomlin	6a80838814	tracing: Add show_event_triggers to expose active event triggers To audit active event triggers, userspace currently must traverse the events/ directory and read each individual trigger file. This is cumbersome for system-wide auditing or debugging. Introduce "show_event_triggers" at the trace root directory. This file displays all events that currently have one or more triggers applied, alongside the trigger configuration, in a consolidated system:event [tab] trigger format. The implementation leverages the existing trace_event_file iterators and uses the trigger's own print() operation to ensure output consistency with the per-event trigger files. Link: https://patch.msgid.link/20260105142939.2655342-3-atomlin@atomlin.com Signed-off-by: Aaron Tomlin <atomlin@atomlin.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2026-01-26 17:44:24 -05:00
Aaron Tomlin	729757b96a	tracing: Add show_event_filters to expose active event filters Currently, to audit active Ftrace event filters, userspace must recursively traverse the events/ directory and read each individual filter file. This is inefficient for monitoring tools and debugging. Introduce "show_event_filters" at the trace root directory. This file displays all events that currently have a filter applied, alongside the actual filter string, in a consolidated system:event [tab] filter format. The implementation reuses the existing trace_event_file iterators to ensure atomic traversal of the event list and utilises guard(rcu)() for automatic, scope-based protection when accessing volatile filter strings. Link: https://patch.msgid.link/20260105142939.2655342-2-atomlin@atomlin.com Signed-off-by: Aaron Tomlin <atomlin@atomlin.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2026-01-26 17:44:15 -05:00
Marco Crivellari	e5136678b1	tracing: Replace use of system_wq with system_dfl_wq This patch continues the effort to refactor workqueue APIs, which has begun with the changes introducing new workqueues and a new alloc_workqueue flag: commit `128ea9f6cc` ("workqueue: Add system_percpu_wq and system_dfl_wq") commit `930c2ea566` ("workqueue: Add new WQ_PERCPU flag") The point of the refactoring is to eventually alter the default behavior of workqueues to become unbound by default so that their workload placement is optimized by the scheduler. Before that to happen after a careful review and conversion of each individual case, workqueue users must be converted to the better named new workqueues with no intended behaviour changes: system_wq -> system_percpu_wq system_unbound_wq -> system_dfl_wq This specific workflow has no benefits being per-cpu, so instead of system_percpu_wq the new unbound workqueue has been used (system_dfl_wq). This way the old obsolete workqueues (system_wq, system_unbound_wq) can be removed in the future. Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251230142820.173712-1-marco.crivellari@suse.com Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2026-01-26 17:44:05 -05:00
Aaron Tomlin	2cddfc2e8f	tracing: Add bitmask-list option for human-readable bitmask display Add support for displaying bitmasks in human-readable list format (e.g., 0,2-5,7) in addition to the default hexadecimal bitmap representation. This is particularly useful when tracing CPU masks and other large bitmasks where individual bit positions are more meaningful than their hexadecimal encoding. When the "bitmask-list" option is enabled, the printk "%*pbl" format specifier is used to render bitmasks as comma-separated ranges, making trace output easier to interpret for complex CPU configurations and large bitmask values. Link: https://patch.msgid.link/20251226160724.2246493-2-atomlin@atomlin.com Signed-off-by: Aaron Tomlin <atomlin@atomlin.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2026-01-26 17:00:50 -05:00
Steven Rostedt	a4e0ea0e10	tracing: Remove redundant call to event_trigger_reset_filter() in event_hist_trigger_parse() With the change to replace kfree() with trigger_data_free(), which starts out doing the exact same thing as event_trigger_reset_filter(), there's no reason to call event_trigger_reset_filter() before calling trigger_data_free(). Remove the call to it. Link: https://lore.kernel.org/linux-trace-kernel/20251211204520.0f3ba6d1@fedora/ Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Miaoqian Lin <linmq006@gmail.com> Link: https://patch.msgid.link/20260108174429.2d9ca51f@gandalf.local.home Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2026-01-26 17:00:50 -05:00
Miaoqian Lin	0550069cc2	tracing: Properly process error handling in event_hist_trigger_parse() Memory allocated with trigger_data_alloc() requires trigger_data_free() for proper cleanup. Replace kfree() with trigger_data_free() to fix this. Found via static analysis and code review. This isn't a real bug due to the current code basically being an open coded version of trigger_data_free() without the synchronization. The synchronization isn't needed as this is the error path of creation and there's nothing to synchronize against yet. Replace the kfree() to be consistent with the allocation. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Tom Zanussi <zanussi@kernel.org> Link: https://patch.msgid.link/20251211100058.2381268-1-linmq006@gmail.com Fixes: `e1f187d09e` ("tracing: Have existing event_command.parse() implementations use helpers") Signed-off-by: Miaoqian Lin <linmq006@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2026-01-26 17:00:50 -05:00
Menglong Dong	eeee4239db	bpf: support fsession for bpf_session_cookie Implement session cookie for fsession. The session cookies will be stored in the stack, and the layout of the stack will look like this: return value -> 8 bytes argN -> 8 bytes ... arg1 -> 8 bytes nr_args -> 8 bytes ip (optional) -> 8 bytes cookie2 -> 8 bytes cookie1 -> 8 bytes The offset of the cookie for the current bpf program, which is in 8-byte units, is stored in the "(((u64 )ctx)[-1] >> BPF_TRAMP_COOKIE_INDEX_SHIFT) & 0xFF". Therefore, we can get the session cookie with ((u64 )ctx)[-offset]. Implement and inline the bpf_session_cookie() for the fsession in the verifier. Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Link: https://lore.kernel.org/r/20260124062008.8657-6-dongml2@chinatelecom.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-24 18:49:36 -08:00
Menglong Dong	27d89baa6d	bpf: support fsession for bpf_session_is_return If fsession exists, we will use the bit (1 << BPF_TRAMP_IS_RETURN_SHIFT) in ((u64 )ctx)[-1] to store the "is_return" flag. The logic of bpf_session_is_return() for fsession is implemented in the verifier by inline following code: bool bpf_session_is_return(void ctx) { return (((u64 *)ctx)[-1] >> BPF_TRAMP_IS_RETURN_SHIFT) & 1; } Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Co-developed-by: Leon Hwang <leon.hwang@linux.dev> Signed-off-by: Leon Hwang <leon.hwang@linux.dev> Link: https://lore.kernel.org/r/20260124062008.8657-5-dongml2@chinatelecom.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-24 18:49:36 -08:00
Menglong Dong	8fe4dc4f64	bpf: change prototype of bpf_session_{cookie,is_return} Add the function argument of "void *ctx" to bpf_session_cookie() and bpf_session_is_return(), which is a preparation of the next patch. The two kfunc is seldom used now, so it will not introduce much effect to change their function prototype. Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20260124062008.8657-4-dongml2@chinatelecom.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-24 18:49:35 -08:00
Menglong Dong	f1b56b3cbd	bpf: use the least significant byte for the nr_args in trampoline For now, ((u64 )ctx)[-1] is used to store the nr_args in the trampoline. However, 1 byte is enough to store such information. Therefore, we use only the least significant byte of ((u64 )ctx)[-1] to store the nr_args, and reserve the rest for other usages. Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Link: https://lore.kernel.org/r/20260124062008.8657-3-dongml2@chinatelecom.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-24 18:49:35 -08:00
Menglong Dong	2d419c4465	bpf: add fsession support The fsession is something that similar to kprobe session. It allow to attach a single BPF program to both the entry and the exit of the target functions. Introduce the struct bpf_fsession_link, which allows to add the link to both the fentry and fexit progs_hlist of the trampoline. Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Co-developed-by: Leon Hwang <leon.hwang@linux.dev> Signed-off-by: Leon Hwang <leon.hwang@linux.dev> Link: https://lore.kernel.org/r/20260124062008.8657-2-dongml2@chinatelecom.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-24 18:49:35 -08:00
Linus Torvalds	b83a8ff87a	tracing fixes for v6.19: - Fix a crash with passing a stacktrace between synthetic events A synthetic event is an event that combines two events into a single event that can display fields from both events as well as the time delta that took place between the events. It can also pass a stacktrace from the first event so that it can be displayed by the synthetic event (this is useful to get a stacktrace of a task scheduling out when blocked and recording the time it was blocked for). A synthetic event can also connect an existing synthetic event to another event. An issue was found that if the first synthetic event had a stacktrace as one of its fields, and that stacktrace field was passed to the new synthetic event to be displayed, it would crash the kernel. This was due to the stacktrace not being saved as a stacktrace but was still marked as one. When the stacktrace was read, it would try to read an array but instead read the integer metadata of the stacktrace and dereferenced a bad value. Fix this by saving the stacktrace field as a stracktrace. - Fix possible overflow in cmp_mod_entry() compare function A binary search is used to find a module address and if the addresses are greater than 2GB apart it could lead to truncation and cause a bad search result. Use normal compares instead of a subtraction between addresses to calculate the compare value. - Fix output of entry arguments in function graph tracer Depending on the configurations enabled, the entry can be two different types that hold the argument array. The macro FGRAPH_ENTRY_ARGS() is used to find the correct arguments from the given type. One location was missed and still referenced the arguments directly via entry->args and could produce the wrong value depending on how the kernel was configured. - Fix memory leak in scripts/tracepoint-update build tool If the array fails to allocate, the memory for the values needs to be freed and was not. Free the allocated values if the array failed to allocate. -----BEGIN PGP SIGNATURE----- iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaXUQLxQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qgsJAQDgtWH9DWUkJKgzXTkiOA0l8JArPOVf tCSMla2wWJA70QD/as2ptacYAFU9v1oxO5YIgsKOLFBF68ZUIhJtvXpqtAE= =JeC6 -----END PGP SIGNATURE----- Merge tag 'trace-v6.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Fix a crash with passing a stacktrace between synthetic events A synthetic event is an event that combines two events into a single event that can display fields from both events as well as the time delta that took place between the events. It can also pass a stacktrace from the first event so that it can be displayed by the synthetic event (this is useful to get a stacktrace of a task scheduling out when blocked and recording the time it was blocked for). A synthetic event can also connect an existing synthetic event to another event. An issue was found that if the first synthetic event had a stacktrace as one of its fields, and that stacktrace field was passed to the new synthetic event to be displayed, it would crash the kernel. This was due to the stacktrace not being saved as a stacktrace but was still marked as one. When the stacktrace was read, it would try to read an array but instead read the integer metadata of the stacktrace and dereferenced a bad value. Fix this by saving the stacktrace field as a stacktrace. - Fix possible overflow in cmp_mod_entry() compare function A binary search is used to find a module address and if the addresses are greater than 2GB apart it could lead to truncation and cause a bad search result. Use normal compares instead of a subtraction between addresses to calculate the compare value. - Fix output of entry arguments in function graph tracer Depending on the configurations enabled, the entry can be two different types that hold the argument array. The macro FGRAPH_ENTRY_ARGS() is used to find the correct arguments from the given type. One location was missed and still referenced the arguments directly via entry->args and could produce the wrong value depending on how the kernel was configured. - Fix memory leak in scripts/tracepoint-update build tool If the array fails to allocate, the memory for the values needs to be freed and was not. Free the allocated values if the array failed to allocate. * tag 'trace-v6.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: scripts/tracepoint-update: Fix memory leak in add_string() on failure function_graph: Fix args pointer mismatch in print_graph_retval() tracing: Avoid possible signed 64-bit truncation tracing: Fix crash on synthetic stacktrace field usage	2026-01-24 17:18:57 -08:00
Linus Torvalds	12a0094839	Misc fixes: - Fix auxiliary timekeeper update & locking bug - Reduce the sensitivity of the clocksource watchdog, to fix false positive measurements that marked the TSC clocksource unstable. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAml0ksERHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1gvtRAAvLWNMb9YPW/65Gn8dkyMQUPzXBjEaq8A yP4L1b8EujoM6fSeQC0Y367hpn1GKHhEGZyj9ksRcU4dsU5XWlzPZr9QXCETmMuh ffTCvrUGI6d95685S+R1VplmhoCQAkerQAFcPQDAQd0QgfoEJO+hf2AWHrilnicu gCcZGDE/+gLAPYjR7LaRu7vb0W6VtwqhXvz8xTCGALmMlU84BDT3deLzCmujxtbF PvNaShBAtppm468Ln6HY2mk4mN5kWthPnonNF4n0zVYy8uAHLEEUERr/LndZ60Ua KlFgKukfoPXyJoU0M0umNcX6oXaRw7DeyNcPtJovZUwtfXyjkTPWrcfZ4sD3r37K QWjFqmbTCtj70vlUFP2RiHusOmNkuzcWKww5KdpA+HoeXEI4zcjhZq7zObyjDPIZ t0Cs5sZoWWpL7o53ikMjsO2Fe/zSDRaocYyImCWh2U+DdBn3/fh8a0pboXQakujx kjmuDrHaLXFNMI9h7NvlP143IW8g7AHUpu0piDGLVFFkZoNcII/8g7qawemQw8T9 ZCUmL3oq1Zu0z3aGq9GRFz31ysVLXwDZdtY8CCuHxgVTuZQQnRNrLiNiTjZn75E/ PY63jtSgKNJsAOTHJZ5hnyvcGb8w05anU0T7M38kTJFtiX4R6JaaDJVmj3eFG3g8 es9cQ4gJGmo= =O1ly -----END PGP SIGNATURE----- Merge tag 'timers-urgent-2026-01-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Ingo Molnar: - Fix auxiliary timekeeper update & locking bug - Reduce the sensitivity of the clocksource watchdog, to fix false positive measurements that marked the TSC clocksource unstable * tag 'timers-urgent-2026-01-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: clocksource: Reduce watchdog readout delay limit to prevent false positives timekeeping: Adjust the leap state for the correct auxiliary timekeeper	2026-01-24 09:36:03 -08:00
Linus Torvalds	af5a3fae86	Miscellaneous scheduler fixes: - Fix PELT clock synchronization bug when entering idle - Disable the NEXT_BUDDY feature, as during extensive testing Mel found that the negatives outweigh the positives. - Make wakeup preemption less aggressive, which resulted in an unreasonable increase in preemption frequency. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAml0kYYRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1j/Bw/+M5UK6EOPzLTs0Tj87X/kI7vXLN/uf9+K FpDtFNmoJnJKrQxPFfa8aT8OAz7zLfnjrGmRgxOG1fFkACMuB8s+TO4+i9lhzAsF gHOg6lDczi4K14mkTyiP/Bdaf3lZThghZitdRCBZFN4gjBpAh1rRI4ikQ0a3pwJH 6y1NF8fX/xRBm6Bgprgbv/em+fKsO89i6NwunPtYaxHsOGA5U0FVPFSa3yghz+X2 7Sl4U8LRdkU3Z62M+I8DKWuAMMb9nLwEXrSiNy+KJHIJ7FNFb1+U10gPrLq+z2Oi hvd25pzDzGUgOglZ7f1IRL5H48putMjCqv+A2wOZnzl7fHt4c02hpTBqMDle5hEP DYHyNNC1ZIRQ1zuy8j33LzB10ycP6pX9nawt+S1trEZcDVwaKwq9TbYsBjbJKtmZ V7C181fTUaNQ6M4wJY3gx+hD1ocrLv+Y0vhXJnK7sg+j4IhWJ7Yk6ne9hrpXbbUj cK8gzvo4OkShYtMlB9ut/RIiuyvFThyJd9rcNkeN4tTM9DVFURvOlIK2W6U70Yty OJDkhI6KBmkERPQoa43OsGvxZwF+g0KY8ISsLq3bJ2vLyATPdpJ2ns+zRg6I+B+q zHZmI86Fc8gr+O/y9oqi2VJhgio34RcoSId/QcAPC1Gl4TaIcmqFFTSzMhI+kLk9 Ng7vHy/3KQg= =pGnx -----END PGP SIGNATURE----- Merge tag 'sched-urgent-2026-01-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: - Fix PELT clock synchronization bug when entering idle - Disable the NEXT_BUDDY feature, as during extensive testing Mel found that the negatives outweigh the positives - Make wakeup preemption less aggressive, which resulted in an unreasonable increase in preemption frequency * tag 'sched-urgent-2026-01-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Revert force wakeup preemption sched/fair: Disable scheduler feature NEXT_BUDDY sched/fair: Fix pelt clock sync when entering idle	2026-01-24 09:29:41 -08:00
Linus Torvalds	ceaeaf66a2	Two perf events fixes: - Fix mmap_count warning & bug when creating a group member event with the PERF_FLAG_FD_OUTPUT flag. - Disable the sample period == 1 branch events BTS optimization on guests, because BTS is not virtualized. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAml0kF8RHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1icZxAAhsYBY0uB3OzAgJYRIVlw/vzF7ubHh8+r PhQ+azap/uSk/dS8IXZ0FrP/vq9kgqEFpsYwWbB1EFRxB24nd8AAYarASyc22Dhb Qe6tKh4hzlDC9Cw0vJ111PgKJLiPDrmkxzS4C2HkPwYwriNql81hQjLW5nF6dHwg i8aP7+/uNB2h/xnPIRgkUFSUzBacRz1Snqgi/vSHmwEhk1GFI48rXoYywM9ItAnR CGN7tGxJ45YwDYeknZf9Ngsd3q+/eh38ihPfMEKoulf4oFvhIsiG4rJ1Vee0V9fD aD1NhukULtjw3lkKnMM75W+Jdb7fHDTOuxzUov9WpyIOIUTMqRkbvaKHgJzSD2SK 01TbP3kQixhGhiRXx78GDQwYGjX8JQxngbqyJvL708GNvxbcKYrLhqtr5Ho00pWH ERxx/ajoDXB7Neo7XPhgYRHk/lrlnZsK1LoidhMzN1UX6C12VnhPcD+zUSqRxz5w yFuJp0+7wF+G74FpO7kv8jv5KDB/lery5mdzWu0kqAKMfYWdw5eGoaI6T48DVoBy IwNYU8bxDBRPzu64XudDE4xBuTuy4HpJmbvOUkxMUkBGMTZ+nysqHu8D7cJJqYtL 2xYJOpR+P4Se9fRyR9xo5vtTZy27TQqTp+AjA5RlIoVOJhYK5T9mAFCJaEhdupbM 2IA4xdYhbb8= =lNSW -----END PGP SIGNATURE----- Merge tag 'perf-urgent-2026-01-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf events fixes from Ingo Molnar: - Fix mmap_count warning & bug when creating a group member event with the PERF_FLAG_FD_OUTPUT flag - Disable the sample period == 1 branch events BTS optimization on guests, because BTS is not virtualized * tag 'perf-urgent-2026-01-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/intel: Do not enable BTS for guests perf: Fix refcount warning on event->mmap_count increment	2026-01-24 09:24:17 -08:00
Dave Jiang	3f7938b1ae	Merge branch 'for-7.0/cxl-init' into cxl-for-next Merge in patches to support several patch series such as Soft Reserve handling, type2 accelerator enabling, and LSA 2.1 labeling support. Mainly addition of cxl_memdev_attach() to allow the memdev probe to make a decision of proceed/fail depending success of CXL topology enumeration. dax/hmem, e820, resource: Defer Soft Reserved insertion until hmem is ready cxl/mem: Introduce cxl_memdev_attach for CXL-dependent operation cxl/mem: Drop @host argument to devm_cxl_add_memdev() cxl/mem: Convert devm_cxl_add_memdev() to scope-based-cleanup cxl/port: Arrange for always synchronous endpoint attach cxl/mem: Arrange for always-synchronous memdev attach cxl/mem: Fix devm_cxl_memdev_edac_release() confusion	2026-01-23 14:13:16 -07:00
Boqun Feng	ed062c41df	Merge branch 'rcu-nocb.20260123a' * rcu-nocb.20260123a: rcu/nocb: Extract nocb_defer_wakeup_cancel() helper rcu/nocb: Remove dead callback overload handling rcu/nocb: Remove unnecessary WakeOvfIsDeferred wake path	2026-01-23 11:15:36 -08:00
Joel Fernandes	cc74050f13	rcu/nocb: Extract nocb_defer_wakeup_cancel() helper The pattern of checking nocb_defer_wakeup and deleting the timer is duplicated in __wake_nocb_gp() and nocb_gp_wait(). Extract this into a common helper function nocb_defer_wakeup_cancel(). This removes code duplication and makes it easier to maintain. Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2026-01-23 11:12:25 -08:00
Joel Fernandes	b11c1efa7f	rcu/nocb: Remove dead callback overload handling During callback overload (exceeding qhimark), the NOCB code attempts opportunistic advancement via rcu_advance_cbs_nowake(). Analysis shows this code path is practically unreachable and serves no useful purpose. Testing with 300,000 callback floods showed: - 30 overload conditions triggered - 0 advancements actually occurred While a theoretical window exists where this code could execute (e.g., vCPU preemption between gp_seq update and rcu_nocb_gp_cleanup()), even if it did, the advancement would be redundant. The rcuog kthread must still run to wake the rcuoc callback thread - we would just be duplicating work that rcuog will perform when it finally gets to run. Since this path provides no meaningful benefit and extensive testing confirms it is never useful, remove it entirely. Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2026-01-23 11:12:25 -08:00
Joel Fernandes	d92eca60fe	rcu/nocb: Remove unnecessary WakeOvfIsDeferred wake path The WakeOvfIsDeferred code path in __call_rcu_nocb_wake() attempts to wake rcuog when the callback count exceeds qhimark and callbacks aren't done with their GP (newly queued or awaiting GP). However, a lot of testing proves this wake is always redundant or useless. In the flooding case, rcuog is always waiting for a GP to finish. So waking up the rcuog thread is pointless. The timer wakeup adds overhead, rcuog simply wakes up and goes back to sleep achieving nothing. This path also adds a full memory barrier, and additional timer expiry modifications unnecessarily. The root cause is that WakeOvfIsDeferred fires when !rcu_segcblist_ready_cbs() (GP not complete), but waking rcuog cannot accelerate GP completion. This commit therefore removes this path. Tested with rcutorture scenarios: TREE01, TREE05, TREE08 (all NOCB configurations) - all pass. Also stress tested using a kernel module that floods call_rcu() to trigger the overload conditions and made the observations confirming the findings. Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2026-01-23 11:12:25 -08:00
Donglin Peng	c9703d17d2	function_graph: Fix args pointer mismatch in print_graph_retval() When funcgraph-args and funcgraph-retaddr are both enabled, many kernel functions display invalid parameters in trace logs. The issue occurs because print_graph_retval() passes a mismatched args pointer to print_function_args(). Fix this by retrieving the correct args pointer using the FGRAPH_ENTRY_ARGS() macro. Link: https://patch.msgid.link/20260112021601.1300479-1-dolinux.peng@gmail.com Fixes: `f83ac7544f` ("function_graph: Enable funcgraph-args and funcgraph-retaddr to work simultaneously") Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Donglin Peng <pengdonglin@xiaomi.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2026-01-23 13:34:38 -05:00
Ian Rogers	00f13e28a9	tracing: Avoid possible signed 64-bit truncation 64-bit truncation to 32-bit can result in the sign of the truncated value changing. The cmp_mod_entry is used in bsearch and so the truncation could result in an invalid search order. This would only happen were the addresses more than 2GB apart and so unlikely, but let's fix the potentially broken compare anyway. Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20260108002625.333331-1-irogers@google.com Signed-off-by: Ian Rogers <irogers@google.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2026-01-23 13:34:30 -05:00
Steven Rostedt	90f9f5d64c	tracing: Fix crash on synthetic stacktrace field usage When creating a synthetic event based on an existing synthetic event that had a stacktrace field and the new synthetic event used that field a kernel crash occurred: ~# cd /sys/kernel/tracing ~# echo 's:stack unsigned long stack[];' > dynamic_events ~# echo 'hist:keys=prev_pid:s0=common_stacktrace if prev_state & 3' >> events/sched/sched_switch/trigger ~# echo 'hist:keys=next_pid:s1=$s0:onmatch(sched.sched_switch).trace(stack,$s1)' >> events/sched/sched_switch/trigger The above creates a synthetic event that takes a stacktrace when a task schedules out in a non-running state and passes that stacktrace to the sched_switch event when that task schedules back in. It triggers the "stack" synthetic event that has a stacktrace as its field (called "stack"). ~# echo 's:syscall_stack s64 id; unsigned long stack[];' >> dynamic_events ~# echo 'hist:keys=common_pid:s2=stack' >> events/synthetic/stack/trigger ~# echo 'hist:keys=common_pid:s3=$s2,i0=id:onmatch(synthetic.stack).trace(syscall_stack,$i0,$s3)' >> events/raw_syscalls/sys_exit/trigger The above makes another synthetic event called "syscall_stack" that attaches the first synthetic event (stack) to the sys_exit trace event and records the stacktrace from the stack event with the id of the system call that is exiting. When enabling this event (or using it in a historgram): ~# echo 1 > events/synthetic/syscall_stack/enable Produces a kernel crash! BUG: unable to handle page fault for address: 0000000000400010 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: Oops: 0000 [#1] SMP PTI CPU: 6 UID: 0 PID: 1257 Comm: bash Not tainted 6.16.3+deb14-amd64 #1 PREEMPT(lazy) Debian 6.16.3-1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.17.0-debian-1.17.0-1 04/01/2014 RIP: 0010:trace_event_raw_event_synth+0x90/0x380 Code: c5 00 00 00 00 85 d2 0f 84 e1 00 00 00 31 db eb 34 0f 1f 00 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 2e 0f 1f 84 00 00 00 00 00 <49> 8b 04 24 48 83 c3 01 8d 0c c5 08 00 00 00 01 cd 41 3b 5d 40 0f RSP: 0018:ffffd2670388f958 EFLAGS: 00010202 RAX: ffff8ba1065cc100 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000001 RSI: fffff266ffda7b90 RDI: ffffd2670388f9b0 RBP: 0000000000000010 R08: ffff8ba104e76000 R09: ffffd2670388fa50 R10: ffff8ba102dd42e0 R11: ffffffff9a908970 R12: 0000000000400010 R13: ffff8ba10a246400 R14: ffff8ba10a710220 R15: fffff266ffda7b90 FS: 00007fa3bc63f740(0000) GS:ffff8ba2e0f48000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000400010 CR3: 0000000107f9e003 CR4: 0000000000172ef0 Call Trace: <TASK> ? __tracing_map_insert+0x208/0x3a0 action_trace+0x67/0x70 event_hist_trigger+0x633/0x6d0 event_triggers_call+0x82/0x130 trace_event_buffer_commit+0x19d/0x250 trace_event_raw_event_sys_exit+0x62/0xb0 syscall_exit_work+0x9d/0x140 do_syscall_64+0x20a/0x2f0 ? trace_event_raw_event_sched_switch+0x12b/0x170 ? save_fpregs_to_fpstate+0x3e/0x90 ? _raw_spin_unlock+0xe/0x30 ? finish_task_switch.isra.0+0x97/0x2c0 ? __rseq_handle_notify_resume+0xad/0x4c0 ? __schedule+0x4b8/0xd00 ? restore_fpregs_from_fpstate+0x3c/0x90 ? switch_fpu_return+0x5b/0xe0 ? do_syscall_64+0x1ef/0x2f0 ? do_fault+0x2e9/0x540 ? __handle_mm_fault+0x7d1/0xf70 ? count_memcg_events+0x167/0x1d0 ? handle_mm_fault+0x1d7/0x2e0 ? do_user_addr_fault+0x2c3/0x7f0 entry_SYSCALL_64_after_hwframe+0x76/0x7e The reason is that the stacktrace field is not labeled as such, and is treated as a normal field and not as a dynamic event that it is. In trace_event_raw_event_synth() the event is field is still treated as a dynamic array, but the retrieval of the data is considered a normal field, and the reference is just the meta data: // Meta data is retrieved instead of a dynamic array str_val = (char )(long)var_ref_vals[val_idx]; // Then when it tries to process it: len = ((unsigned long *)str_val) + 1; It triggers a kernel page fault. To fix this, first when defining the fields of the first synthetic event, set the filter type to FILTER_STACKTRACE. This is used later by the second synthetic event to know that this field is a stacktrace. When creating the field of the new synthetic event, have it use this FILTER_STACKTRACE to know to create a stacktrace field to copy the stacktrace into. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Tom Zanussi <zanussi@kernel.org> Link: https://patch.msgid.link/20260122194824.6905a38e@gandalf.local.home Fixes: `00cf3d672a` ("tracing: Allow synthetic events to pass around stacktraces") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2026-01-23 13:34:21 -05:00
Kumar Kartikeya Dwivedi	82f3b142c9	rqspinlock: Fix TAS fallback lock entry creation The TAS fallback can be invoked directly when queued spin locks are disabled, and through the slow path when paravirt is enabled for queued spin locks. In the latter case, the res_spin_lock macro will attempt the fast path and already hold the entry when entering the slow path. This will lead to creation of extraneous entries that are not released, which may cause false positives for deadlock detection. Fix this by always preceding invocation of the TAS fallback in every case with the grabbing of the held lock entry, and add a comment to make note of this. Fixes: `c9102a68c0` ("rqspinlock: Add a test-and-set fallback") Reported-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Tested-by: Amery Hung <ameryhung@gmail.com> Link: https://lore.kernel.org/r/20260122115911.3668985-1-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-23 10:03:49 -08:00
Vincent Guittot	15257cc2f9	sched/fair: Revert force wakeup preemption This agressively bypasses run_to_parity and slice protection with the assumpiton that this is what waker wants but there is no garantee that the wakee will be the next to run. It is a better choice to use yield_to_task or WF_SYNC in such case. This increases the number of resched and preemption because a task becomes quickly "ineligible" when it runs; We update the task vruntime periodically and before the task exhausted its slice or at least quantum. Example: 2 tasks A and B wake up simultaneously with lag = 0. Both are eligible. Task A runs 1st and wakes up task C. Scheduler updates task A's vruntime which becomes greater than average runtime as all others have a lag == 0 and didn't run yet. Now task A is ineligible because it received more runtime than the other task but it has not yet exhausted its slice nor a min quantum. We force preemption, disable protection but Task B will run 1st not task C. Sidenote, DELAY_ZERO increases this effect by clearing positive lag at wake up. Fixes: `e837456fdc` ("sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals") Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260123102858.52428-1-vincent.guittot@linaro.org	2026-01-23 11:53:20 +01:00
Mel Gorman	4f70f106bc	sched/fair: Disable scheduler feature NEXT_BUDDY NEXT_BUDDY was disabled with the introduction of EEVDF and enabled again after NEXT_BUDDY was rewritten for EEVDF by commit `e837456fdc` ("sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals"). It was not expected that this would be a universal win without a crystal ball instruction but the reported regressions are a concern [1][2] even if gains were also reported. Specifically; o mysql with client/server running on different servers regresses o specjbb reports lower peak metrics o daytrader regresses The mysql is realistic and a concern. It needs to be confirmed if specjbb is simply shifting the point where peak performance is measured but still a concern. daytrader is considered to be representative of a real workload. Access to test machines is currently problematic for verifying any fix to this problem. Disable NEXT_BUDDY for now by default until the root causes are addressed. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com> Link: https://lore.kernel.org/lkml/4b96909a-f1ac-49eb-b814-97b8adda6229@arm.com [1] Link: https://lore.kernel.org/lkml/ec3ea66f-3a0d-4b5a-ab36-ce778f159b5b@linux.ibm.com [2] Link: https://patch.msgid.link/fyqsk63pkoxpeaclyqsm5nwtz3dyejplr7rg6p74xwemfzdzuu@7m7xhs5aqpqw	2026-01-23 11:53:19 +01:00
Thomas Weißschuh	c1b12cd933	padata: Constify padata_sysfs_entry structs These structs are never modified. To prevent malicious or accidental modifications due to bugs, mark them as const. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2026-01-23 13:48:44 +08:00
Ard Biesheuvel	a081b57892	kallsyms: Get rid of kallsyms relative base When the kallsyms relative base was introduced, per-CPU variable references on x86_64 SMP were implemented as offsets into the respective per-CPU region, rather than offsets relative to the location of the variable's template in the kernel image, which is how other architectures implement it. This required kallsyms to reason about the difference between the two, and the sign of the value in the kallsyms_offsets[] array was used to distinguish them. This meant that negative offsets were not permitted for ordinary variables, and so it was crucial that the relative base was chosen such that all offsets were positive numbers. This is no longer needed: instead, the offsets can simply be encoded as values in the range -/+ 2 GiB, which is precisely what PC32 relocations provide on most architectures. So it is possible to simplify the logic, and just use _text as the anchor directly, and let the linker calculate the final value based on the location of the entry itself. Some architectures (nios2, extensa) do not support place-relative relocations at all, but these are all 32-bit and non-relocatable, and so there is no need for place-relative relocations in the first place, and the actual symbol values can just be stored directly. This makes all entries in the kallsyms_offsets[] array visible as place-relative references in the ELF metadata, which will be important when implementing ELF-based fg-kaslr. Reviewed-by: Kees Cook <kees@kernel.org> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Link: https://patch.msgid.link/20260116093359.2442297-6-ardb+git@google.com Signed-off-by: Nathan Chancellor <nathan@kernel.org>	2026-01-22 15:58:22 -07:00
Frederic Weisbecker	de715325cc	cpu: Revert "cpu/hotplug: Prevent self deadlock on CPU hot-unplug" 1) The commit: `2b8272ff4a` ("cpu/hotplug: Prevent self deadlock on CPU hot-unplug") was added to fix an issue where the hotplug control task (BP) was throttled between CPUHP_AP_IDLE_DEAD and CPUHP_HRTIMERS_PREPARE waiting in the hrtimer blindspot for the bandwidth callback queued in the dead CPU. 2) Later on, the commit: `38685e2a04` ("cpu/hotplug: Don't offline the last non-isolated CPU") plugged on the target selection for the workqueue offloaded CPU down process to prevent from destroying the last CPU domain. 3) Finally: `5c0930ccaa` ("hrtimers: Push pending hrtimers away from outgoing CPU earlier") removed entirely the conditions for the race exposed and partially fixed in 1). The offloading of the CPU down process to a workqueue on another CPU then becomes unnecessary. But the last CPU belonging to scheduler domains must still remain online. Therefore revert the now obsolete commit `2b8272ff4a` and move the housekeeping check under the cpu_hotplug_lock write held. Since HK_TYPE_DOMAIN will include both isolcpus and cpuset isolated partition, the hotplug lock will synchronize against concurrent cpuset partition updates. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Marco Crivellari <marco.crivellari@suse.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman Long <longman@redhat.com>	2026-01-22 18:32:41 +01:00
Shubhang Kaushik	4b603f1551	sched: Update rq->avg_idle when a task is moved to an idle CPU Currently, rq->idle_stamp is only used to calculate avg_idle during wakeups. This means other paths that move a task to an idle CPU such as fork/clone, execve, or migrations, do not end the CPU's idle status in the scheduler's eyes, leading to an inaccurate avg_idle. This patch introduces update_rq_avg_idle() to provide a more accurate measurement of CPU idle duration. By invoking this helper in put_prev_task_idle(), we ensure avg_idle is updated whenever a CPU stops being idle, regardless of how the new task arrived. Testing on an 80-core Ampere Altra (ARMv8) with 6.19-rc5 baseline: - Hackbench : +7.2% performance gain at 16 threads. - Schbench: Reduced p99.9 tail latencies at high concurrency. Signed-off-by: Shubhang Kaushik <shubhang@os.amperecomputing.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: Shubhang Kaushik <shubhang@os.amperecomputing.com> Link: https://patch.msgid.link/20260121-v8-patch-series-v8-1-b7f1cbee5055@os.amperecomputing.com	2026-01-22 11:11:21 +01:00
Thomas Gleixner	5d6446f409	hrtimer: Fix trace oddity It turns out that __run_hrtimer() will trace like: <idle>-0 [032] d.h2. 20705.474563: hrtimer_cancel: hrtimer=0xff2db8f77f8226e8 <idle>-0 [032] d.h1. 20705.474563: hrtimer_expire_entry: hrtimer=0xff2db8f77f8226e8 now=20699452001850 function=tick_nohz_handler/0x0 Which is a bit nonsensical, the timer doesn't get canceled on expiration. The cause is the use of the incorrect debug helper. Fixes: `c6a2a17702` ("hrtimer: Add tracepoint for hrtimers") Reported-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260121143208.219595606@infradead.org	2026-01-22 11:11:20 +01:00
Peter Zijlstra	21c0e92d06	rseq: Lower default slice extension Change the minimum slice extension to 5 usec. Since slice_test selftest reaches a staggering ~350 nsec extension: Task: slice_test Mean: 350.266 ns Latency (us) \| Count ------------------------------ EXPIRED \| 238 0 us \| 143189 1 us \| 167 2 us \| 26 3 us \| 11 4 us \| 28 5 us \| 31 6 us \| 22 7 us \| 23 8 us \| 32 9 us \| 16 10 us \| 35 Lower the minimal (and default) value to 5 usecs -- which is still massive. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260121143208.073200729@infradead.org	2026-01-22 11:11:20 +01:00
Peter Zijlstra	e1d7f54900	rseq: Move slice_ext_nsec to debugfs Move changing the slice ext duration to debugfs, a sliglty less permanent interface. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260121143207.923520192@infradead.org	2026-01-22 11:11:20 +01:00
Peter Zijlstra	d6200245c7	rseq: Allow registering RSEQ with slice extension Since glibc cares about the number of syscalls required to initialize a new thread, allow initializing rseq with slice extension on. This avoids having to do another prctl(). Requested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260121143207.814193010@infradead.org	2026-01-22 11:11:19 +01:00
Thomas Gleixner	3c78aaec19	entry: Hook up rseq time slice extension Wire the grant decision function up in exit_to_user_mode_loop() Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251215155709.258157362@linutronix.de	2026-01-22 11:11:19 +01:00
Thomas Gleixner	0ac3b5c3dc	rseq: Implement time slice extension enforcement timer If a time slice extension is granted and the reschedule delayed, the kernel has to ensure that user space cannot abuse the extension and exceed the maximum granted time. It was suggested to implement this via the existing hrtick() timer in the scheduler, but that turned out to be problematic for several reasons: 1) It creates a dependency on CONFIG_SCHED_HRTICK, which can be disabled independently of CONFIG_HIGHRES_TIMERS 2) HRTICK usage in the scheduler can be runtime disabled or is only used for certain aspects of scheduling. 3) The function is calling into the scheduler code and that might have unexpected consequences when this is invoked due to a time slice enforcement expiry. Especially when the task managed to clear the grant via sched_yield(0). It would be possible to address #2 and #3 by storing state in the scheduler, but that is extra complexity and fragility for no value. Implement a dedicated per CPU hrtimer instead, which is solely used for the purpose of time slice enforcement. The timer is armed when an extension was granted right before actually returning to user mode in rseq_exit_to_user_mode_restart(). It is disarmed, when the task relinquishes the CPU. This is expensive as the timer is probably the first expiring timer on the CPU, which means it has to reprogram the hardware. But that's less expensive than going through a full hrtimer interrupt cycle for nothing. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251215155709.068329497@linutronix.de	2026-01-22 11:11:18 +01:00
Thomas Gleixner	dd0a046069	rseq: Implement syscall entry work for time slice extensions The kernel sets SYSCALL_WORK_RSEQ_SLICE when it grants a time slice extension. This allows to handle the rseq_slice_yield() syscall, which is used by user space to relinquish the CPU after finishing the critical section for which it requested an extension. In case the kernel state is still GRANTED, the kernel resets both kernel and user space state with a set of sanity checks. If the kernel state is already cleared, then this raced against the timer or some other interrupt and just clears the work bit. Doing it in syscall entry work allows to catch misbehaving user space, which issues an arbitrary syscall, i.e. not rseq_slice_yield(), from the critical section. Contrary to the initial strict requirement to use rseq_slice_yield() arbitrary syscalls are not considered a violation of the ABI contract anymore to allow onion architecture applications, which cannot control the code inside a critical section, to utilize this as well. If the code detects inconsistent user space that result in a SIGSEGV for the application. If the grant was still active and the task was not preempted yet, the work code reschedules immediately before continuing through the syscall. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251215155709.005777059@linutronix.de	2026-01-22 11:11:18 +01:00
Thomas Gleixner	99d2592023	rseq: Implement sys_rseq_slice_yield() Provide a new syscall which has the only purpose to yield the CPU after the kernel granted a time slice extension. sched_yield() is not suitable for that because it unconditionally schedules, but the end of the time slice extension is not required to schedule when the task was already preempted. This also allows to have a strict check for termination to catch user space invoking random syscalls including sched_yield() from a time slice extension region. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Link: https://patch.msgid.link/20251215155708.929634896@linutronix.de	2026-01-22 11:11:17 +01:00
Thomas Gleixner	28621ec2d4	rseq: Add prctl() to enable time slice extensions Implement a prctl() so that tasks can enable the time slice extension mechanism. This fails, when time slice extensions are disabled at compile time or on the kernel command line and when no rseq pointer is registered in the kernel. That allows to implement a single trivial check in the exit to user mode hotpath, to decide whether the whole mechanism needs to be invoked. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251215155708.858717691@linutronix.de	2026-01-22 11:11:17 +01:00
Thomas Gleixner	b5b8282441	rseq: Add statistics for time slice extensions Extend the quick statistics with time slice specific fields. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251215155708.795202254@linutronix.de	2026-01-22 11:11:17 +01:00
Thomas Gleixner	f8380f9768	rseq: Provide static branch for time slice extensions Guard the time slice extension functionality with a static key, which can be disabled on the kernel command line. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251215155708.733429292@linutronix.de	2026-01-22 11:11:16 +01:00
Thomas Gleixner	d7a5da7a0f	rseq: Add fields and constants for time slice extension Aside of a Kconfig knob add the following items: - Two flag bits for the rseq user space ABI, which allow user space to query the availability and enablement without a syscall. - A new member to the user space ABI struct rseq, which is going to be used to communicate request and grant between kernel and user space. - A rseq state struct to hold the kernel state of this - Documentation of the new mechanism Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251215155708.669472597@linutronix.de	2026-01-22 11:11:16 +01:00
Fushuai Wang	4fe82cf302	sched/debug: Convert copy_from_user() + kstrtouint() to kstrtouint_from_user() Using kstrtouint_from_user() instead of copy_from_user() + kstrtouint() makes the code simpler and less error-prone. Suggested-by: Yury Norov <ynorov@nvidia.com> Signed-off-by: Fushuai Wang <wangfushuai@baidu.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Yury Norov <ynorov@nvidia.com> Link: https://patch.msgid.link/20260117145615.53455-2-fushuai.wang@linux.dev	2026-01-22 11:11:16 +01:00
Yuzuki Ishiyama	1dc6696467	bpf: add bpf_strncasecmp kfunc bpf_strncasecmp() function performs same like bpf_strcasecmp() except limiting the comparison to a specific length. Signed-off-by: Yuzuki Ishiyama <ishiyama@hpc.is.uec.ac.jp> Acked-by: Viktor Malik <vmalik@redhat.com> Acked-by: Mykyta Yatsenko <mykyta.yatsenko5@gmail.com> Link: https://lore.kernel.org/r/20260121033328.1850010-2-ishiyama@hpc.is.uec.ac.jp Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-21 09:42:53 -08:00
Menglong Dong	85c7f91471	bpf: support bpf_get_func_arg() for BPF_TRACE_RAW_TP For now, bpf_get_func_arg() and bpf_get_func_arg_cnt() is not supported by the BPF_TRACE_RAW_TP, which is not convenient to get the argument of the tracepoint, especially for the case that the position of the arguments in a tracepoint can change. The target tracepoint BTF type id is specified during loading time, therefore we can get the function argument count from the function prototype instead of the stack. Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Acked-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20260121044348.113201-2-dongml2@chinatelecom.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-21 09:31:35 -08:00
Vincent Guittot	98c88dc8a1	sched/fair: Fix pelt clock sync when entering idle Samuel and Alex reported regressions of the util_avg of RT rq with commit `17e3e88ed0` ("sched/fair: Fix pelt lost idle time detection"). It happens that fair is updating and syncing the pelt clock with task one when pick_next_task_fair() fails to pick a task but before the prev scheduling class got a chance to update its pelt signals. Move update_idle_rq_clock_pelt() in set_next_task_idle() which is called after prev class has been called. Fixes: `17e3e88ed0` ("sched/fair: Fix pelt lost idle time detection") Closes: https://lore.kernel.org/all/CAG2KctpO6VKS6GN4QWDji0t92_gNBJ7HjjXrE+6H+RwRXt=iLg@mail.gmail.com/ Closes: https://lore.kernel.org/all/8cf19bf0e0054dcfed70e9935029201694f1bb5a.camel@mediatek.com/ Reported-by: Samuel Wu <wusamuel@google.com> Reported-by: Alex Hoh <Alex.Hoh@mediatek.com> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Samuel Wu <wusamuel@google.com> Tested-by: Alex Hoh <Alex.Hoh@mediatek.com> Link: https://patch.msgid.link/20260121163317.505635-1-vincent.guittot@linaro.org	2026-01-21 17:46:08 +01:00
Will Rosenberg	d06bf78e55	perf: Fix refcount warning on event->mmap_count increment When calling refcount_inc(&event->mmap_count) inside perf_mmap_rb(), the following warning is triggered: refcount_t: addition on 0; use-after-free. WARNING: lib/refcount.c:25 PoC: struct perf_event_attr attr = {0}; int fd = syscall(__NR_perf_event_open, &attr, 0, -1, -1, 0); mmap(NULL, 0x3000, PROT_READ \| PROT_WRITE, MAP_SHARED, fd, 0); int victim = syscall(__NR_perf_event_open, &attr, 0, -1, fd, PERF_FLAG_FD_OUTPUT); mmap(NULL, 0x3000, PROT_READ \| PROT_WRITE, MAP_SHARED, victim, 0); This occurs when creating a group member event with the flag PERF_FLAG_FD_OUTPUT. The group leader should be mmap-ed and then mmap-ing the event triggers the warning. Since the event has copied the output_event in perf_event_set_output(), event->rb is set. As a result, perf_mmap_rb() calls refcount_inc(&event->mmap_count) when event->mmap_count = 0. Disallow the case when event->mmap_count = 0. This also prevents two events from updating the same user_page. Fixes: `448f97fba9` ("perf: Convert mmap() refcounts to refcount_t") Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Will Rosenberg <whrosenb@asu.edu> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260119184956.801238-1-whrosenb@asu.edu	2026-01-21 16:28:58 +01:00
Thomas Gleixner	c06343be0b	clocksource: Reduce watchdog readout delay limit to prevent false positives The "valid" readout delay between the two reads of the watchdog is larger than the valid delta between the resulting watchdog and clocksource intervals, which results in false positive watchdog results. Assume TSC is the clocksource and HPET is the watchdog and both have a uncertainty margin of 250us (default). The watchdog readout does: 1) wdnow = read(HPET); 2) csnow = read(TSC); 3) wdend = read(HPET); The valid window for the delta between #1 and #3 is calculated by the uncertainty margins of the watchdog and the clocksource: m = 2 * watchdog.uncertainty_margin + cs.uncertainty margin; which results in 750us for the TSC/HPET case. The actual interval comparison uses a smaller margin: m = watchdog.uncertainty_margin + cs.uncertainty margin; which results in 500us for the TSC/HPET case. That means the following scenario will trigger the watchdog: Watchdog cycle N: 1) wdnow[N] = read(HPET); 2) csnow[N] = read(TSC); 3) wdend[N] = read(HPET); Assume the delay between #1 and #2 is 100us and the delay between #1 and Watchdog cycle N + 1: 4) wdnow[N + 1] = read(HPET); 5) csnow[N + 1] = read(TSC); 6) wdend[N + 1] = read(HPET); If the delay between #4 and #6 is within the 750us margin then any delay between #4 and #5 which is larger than 600us will fail the interval check and mark the TSC unstable because the intervals are calculated against the previous value: wd_int = wdnow[N + 1] - wdnow[N]; cs_int = csnow[N + 1] - csnow[N]; Putting the above delays in place this results in: cs_int = (wdnow[N + 1] + 610us) - (wdnow[N] + 100us); -> cs_int = wd_int + 510us; which is obviously larger than the allowed 500us margin and results in marking TSC unstable. Fix this by using the same margin as the interval comparison. If the delay between two watchdog reads is larger than that, then the readout was either disturbed by interconnect congestion, NMIs or SMIs. Fixes: `4ac1dd3245` ("clocksource: Set cs_watchdog_read() checks based on .uncertainty_margin") Reported-by: Daniel J Blueman <daniel@quora.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Tested-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/lkml/20250602223251.496591-1-daniel@quora.org/ Link: https://patch.msgid.link/87bjjxc9dq.ffs@tglx	2026-01-21 11:33:11 +01:00
Menglong Dong	eaedea154e	bpf, x86: inline bpf_get_current_task() for x86_64 Inline bpf_get_current_task() and bpf_get_current_task_btf() for x86_64 to obtain better performance. Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260120070555.233486-2-dongml2@chinatelecom.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-20 20:39:01 -08:00
Minu Jin	f34e19c34e	fork-comment-fix: remove ambiguous question mark in CLONE_CHILD_CLEARTID comment The current comment "Clear TID on mm_release()?" ends with a question mark, implying uncertainty about whether the TID is actually cleared in mm_release(). However, the code flow is deterministic. When a task exits, mm_release() explicitly checks 'tsk->clear_child_tid' and clears. Since this behavior is unambiguous, remove the confusing question mark and rephrase the comment to clearly state that TID is cleared in mm_release(). Link: https://lkml.kernel.org/r/20251125000407.24470-1-s9430939@naver.com Signed-off-by: Minu Jin <s9430939@naver.com> Cc: Ben Segall <bsegall@google.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Kees Cook <kees@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mel Gorman <mgorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:44:23 -08:00
Petr Mladek	3b07086444	kallsyms: prevent module removal when printing module name and buildid kallsyms_lookup_buildid() copies the symbol name into the given buffer so that it can be safely read anytime later. But it just copies pointers to mod->name and mod->build_id which might get reused after the related struct module gets removed. The lifetime of struct module is synchronized using RCU. Take the rcu read lock for the entire __sprint_symbol(). Link: https://lkml.kernel.org/r/20251128135920.217303-8-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Aaron Tomlin <atomlin@atomlin.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkman <daniel@iogearbox.net> Cc: Daniel Gomez <da.gomez@samsung.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Kees Cook <kees@kernel.org> Cc: Luis Chamberalin <mcgrof@kernel.org> Cc: Marc Rutland <mark.rutland@arm.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Petr Pavlu <petr.pavlu@suse.com> Cc: Sami Tolvanen <samitolvanen@google.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:44:23 -08:00
Petr Mladek	e8a1e7eaa1	kallsyms/ftrace: set module buildid in ftrace_mod_address_lookup() __sprint_symbol() might access an invalid pointer when kallsyms_lookup_buildid() returns a symbol found by ftrace_mod_address_lookup(). The ftrace lookup function must set both @modname and @modbuildid the same way as module_address_lookup(). Link: https://lkml.kernel.org/r/20251128135920.217303-7-pmladek@suse.com Fixes: `9294523e37` ("module: add printk formats to add module build ID to stacktraces") Signed-off-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Aaron Tomlin <atomlin@atomlin.com> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkman <daniel@iogearbox.net> Cc: Daniel Gomez <da.gomez@samsung.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Kees Cook <kees@kernel.org> Cc: Luis Chamberalin <mcgrof@kernel.org> Cc: Marc Rutland <mark.rutland@arm.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Petr Pavlu <petr.pavlu@suse.com> Cc: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:44:22 -08:00
Petr Mladek	cd6735896d	kallsyms/bpf: rename __bpf_address_lookup() to bpf_address_lookup() bpf_address_lookup() has been used only in kallsyms_lookup_buildid(). It was supposed to set @modname and @modbuildid when the symbol was in a module. But it always just cleared @modname because BPF symbols were never in a module. And it did not clear @modbuildid because the pointer was not passed. The wrapper is no longer needed. Both @modname and @modbuildid are now always initialized to NULL in kallsyms_lookup_buildid(). Remove the wrapper and rename __bpf_address_lookup() to bpf_address_lookup() because this variant is used everywhere. [akpm@linux-foundation.org: fix loongarch] Link: https://lkml.kernel.org/r/20251128135920.217303-6-pmladek@suse.com Fixes: `9294523e37` ("module: add printk formats to add module build ID to stacktraces") Signed-off-by: Petr Mladek <pmladek@suse.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Cc: Aaron Tomlin <atomlin@atomlin.com> Cc: Daniel Borkman <daniel@iogearbox.net> Cc: Daniel Gomez <da.gomez@samsung.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Kees Cook <kees@kernel.org> Cc: Luis Chamberalin <mcgrof@kernel.org> Cc: Marc Rutland <mark.rutland@arm.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Petr Pavlu <petr.pavlu@suse.com> Cc: Sami Tolvanen <samitolvanen@google.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:44:22 -08:00
Petr Mladek	8e81dac4cd	kallsyms: cleanup code for appending the module buildid Put the code for appending the optional "buildid" into a helper function, It makes __sprint_symbol() better readable. Also print a warning when the "modname" is set and the "buildid" isn't. It might catch a situation when some lookup function in kallsyms_lookup_buildid() does not handle the "buildid". Use pr_*_once() to avoid an infinite recursion when the function is called from printk(). The recursion is rather theoretical but better be on the safe side. Link: https://lkml.kernel.org/r/20251128135920.217303-5-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Cc: Aaron Tomlin <atomlin@atomlin.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkman <daniel@iogearbox.net> Cc: Daniel Gomez <da.gomez@samsung.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Kees Cook <kees@kernel.org> Cc: Luis Chamberalin <mcgrof@kernel.org> Cc: Marc Rutland <mark.rutland@arm.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Petr Pavlu <petr.pavlu@suse.com> Cc: Sami Tolvanen <samitolvanen@google.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:44:22 -08:00
Petr Mladek	acfdbb4ab2	module: add helper function for reading module_buildid() Add a helper function for reading the optional "build_id" member of struct module. It is going to be used also in ftrace_mod_address_lookup(). Use "#ifdef" instead of "#if IS_ENABLED()" to match the declaration of the optional field in struct module. Link: https://lkml.kernel.org/r/20251128135920.217303-4-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Daniel Gomez <da.gomez@samsung.com> Reviewed-by: Petr Pavlu <petr.pavlu@suse.com> Cc: Aaron Tomlin <atomlin@atomlin.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkman <daniel@iogearbox.net> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Kees Cook <kees@kernel.org> Cc: Luis Chamberalin <mcgrof@kernel.org> Cc: Marc Rutland <mark.rutland@arm.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Sami Tolvanen <samitolvanen@google.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:44:22 -08:00
Petr Mladek	fda024fb64	kallsyms: clean up modname and modbuildid initialization in kallsyms_lookup_buildid() The @modname and @modbuildid optional return parameters are set only when the symbol is in a module. Always initialize them so that they do not need to be cleared when the module is not in a module. It simplifies the logic and makes the code even slightly more safe. Note that bpf_address_lookup() function will get updated in a separate patch. Link: https://lkml.kernel.org/r/20251128135920.217303-3-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Cc: Aaron Tomlin <atomlin@atomlin.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkman <daniel@iogearbox.net> Cc: Daniel Gomez <da.gomez@samsung.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Kees Cook <kees@kernel.org> Cc: Luis Chamberalin <mcgrof@kernel.org> Cc: Marc Rutland <mark.rutland@arm.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Petr Pavlu <petr.pavlu@suse.com> Cc: Sami Tolvanen <samitolvanen@google.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:44:21 -08:00
Petr Mladek	426295ef18	kallsyms: clean up @namebuf initialization in kallsyms_lookup_buildid() Patch series "kallsyms: Prevent invalid access when showing module buildid", v3. We have seen nested crashes in __sprint_symbol(), see below. They seem to be caused by an invalid pointer to "buildid". This patchset cleans up kallsyms code related to module buildid and fixes this invalid access when printing backtraces. I made an audit of __sprint_symbol() and found several situations when the buildid might be wrong: + bpf_address_lookup() does not set @modbuildid + ftrace_mod_address_lookup() does not set @modbuildid + __sprint_symbol() does not take rcu_read_lock and the related struct module might get removed before mod->build_id is printed. This patchset solves these problems: + 1st, 2nd patches are preparatory + 3rd, 4th, 6th patches fix the above problems + 5th patch cleans up a suspicious initialization code. This is the backtrace, we have seen. But it is not really important. The problems fixed by the patchset are obvious: crash64> bt [62/2029] PID: 136151 TASK: ffff9f6c981d4000 CPU: 367 COMMAND: "btrfs" #0 [ffffbdb687635c28] machine_kexec at ffffffffb4c845b3 #1 [ffffbdb687635c80] __crash_kexec at ffffffffb4d86a6a #2 [ffffbdb687635d08] hex_string at ffffffffb51b3b61 #3 [ffffbdb687635d40] crash_kexec at ffffffffb4d87964 #4 [ffffbdb687635d50] oops_end at ffffffffb4c41fc8 #5 [ffffbdb687635d70] do_trap at ffffffffb4c3e49a #6 [ffffbdb687635db8] do_error_trap at ffffffffb4c3e6a4 #7 [ffffbdb687635df8] exc_stack_segment at ffffffffb5666b33 #8 [ffffbdb687635e20] asm_exc_stack_segment at ffffffffb5800cf9 ... This patch (of 7) The function kallsyms_lookup_buildid() initializes the given @namebuf by clearing the first and the last byte. It is not clear why. The 1st byte makes sense because some callers ignore the return code and expect that the buffer contains a valid string, for example: - function_stat_show() - kallsyms_lookup() - kallsyms_lookup_buildid() The initialization of the last byte does not make much sense because it can later be overwritten. Fortunately, it seems that all called functions behave correctly: - kallsyms_expand_symbol() explicitly adds the trailing '\0' at the end of the function. - All *__address_lookup() functions either use the safe strscpy() or they do not touch the buffer at all. Document the reason for clearing the first byte. And remove the useless initialization of the last byte. Link: https://lkml.kernel.org/r/20251128135920.217303-2-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Aaron Tomlin <atomlin@atomlin.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkman <daniel@iogearbox.net> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Kees Cook <kees@kernel.org> Cc: Luis Chamberalin <mcgrof@kernel.org> Cc: Marc Rutland <mark.rutland@arm.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Petr Pavlu <petr.pavlu@suse.com> Cc: Sami Tolvanen <samitolvanen@google.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Daniel Gomez <da.gomez@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:44:21 -08:00
Li RongQing	e700f5d156	watchdog: softlockup: panic when lockup duration exceeds N thresholds The softlockup_panic sysctl is currently a binary option: panic immediately or never panic on soft lockups. Panicking on any soft lockup, regardless of duration, can be overly aggressive for brief stalls that may be caused by legitimate operations. Conversely, never panicking may allow severe system hangs to persist undetected. Extend softlockup_panic to accept an integer threshold, allowing the kernel to panic only when the normalized lockup duration exceeds N watchdog threshold periods. This provides finer-grained control to distinguish between transient delays and persistent system failures. The accepted values are: - 0: Don't panic (unchanged) - 1: Panic when duration >= 1 * threshold (20s default, original behavior) - N > 1: Panic when duration >= N * threshold (e.g., 2 = 40s, 3 = 60s.) The original behavior is preserved for values 0 and 1, maintaining full backward compatibility while allowing systems to tolerate brief lockups while still catching severe, persistent hangs. [lirongqing@baidu.com: v2] Link: https://lkml.kernel.org/r/20251218074300.4080-1-lirongqing@baidu.com Link: https://lkml.kernel.org/r/20251216074521.2796-1-lirongqing@baidu.com Signed-off-by: Li RongQing <lirongqing@baidu.com> Cc: Eduard Zingerman <eddyz87@gmail.com> Cc: Hao Luo <haoluo@google.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: John Fastabend <john.fastabend@gmail.com> Cc: KP Singh <kpsingh@kernel.org> Cc: Lance Yang <lance.yang@linux.dev> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Song Liu <song@kernel.org> Cc: Stanislav Fomichev <sdf@fomichev.me> Cc: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:44:20 -08:00
Pnina Feder	b5bfcc1ffe	kernel/crash: handle multi-page vmcoreinfo in crash kernel copy kimage_crash_copy_vmcoreinfo() currently assumes vmcoreinfo fits in a single page. This breaks if VMCOREINFO_BYTES exceeds PAGE_SIZE. Allocate the required order of control pages and vmap all pages needed to safely copy vmcoreinfo into the crash kernel image. Link: https://lkml.kernel.org/r/20251216132801.807260-3-pnina.feder@mobileye.com Signed-off-by: Pnina Feder <pnina.feder@mobileye.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:44:20 -08:00
Pnina Feder	76103d1b26	kernel: vmcoreinfo: allocate vmcoreinfo_data based on VMCOREINFO_BYTES Patch series "vmcoreinfo: support VMCOREINFO_BYTES larger than PAGE_SIZE". VMCOREINFO_BYTES is defined as a configurable size, but multiple code paths implicitly assume it always fits into a single page. This series removes that assumption by allocating and mapping vmcoreinfo based on its actual size. Patch 1 updates vmcoreinfo allocation to use get_order(VMCOREINFO_BYTES). Patch 2 updates crash kernel handling to correctly allocate and map multiple pages when copying vmcoreinfo. This makes vmcoreinfo size consistent across the kernel and avoids future breakage if VMCOREINFO_BYTES grows. (No functional change when VMCOREINFO_BYTES == PAGE_SIZE.) This patch (of 2): VMCOREINFO_BYTES defines the size of vmcoreinfo data, but the current implementation assumes a single page allocation. Allocate vmcoreinfo_data using get_order(VMCOREINFO_BYTES) so that vmcoreinfo can safely grow beyond PAGE_SIZE. This avoids hidden assumptions and keeps vmcoreinfo size consistent across the kernel. Link: https://lkml.kernel.org/r/20251216132801.807260-1-pnina.feder@mobileye.com Link: https://lkml.kernel.org/r/20251216132801.807260-2-pnina.feder@mobileye.com Signed-off-by: Pnina Feder <pnina.feder@mobileye.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:44:20 -08:00
Alejandro Colomar	a9e5620c9a	kernel: fix off-by-one benign bugs We were wasting a byte due to an off-by-one bug. s[c]nprintf() doesn't write more than $2 bytes including the null byte, so trying to pass 'size-1' there is wasting one byte. This is essentially the same as the previous commit, in a different file. Link: https://lkml.kernel.org/r/b4a945a4d40b7104364244f616eb9fb9f1fa691f.1765449750.git.alx@kernel.org Signed-off-by: Alejandro Colomar <alx@kernel.org> Cc: Marco Elver <elver@google.com> Cc: Kees Cook <kees@kernel.org> Cc: Christopher Bazley <chris.bazley.wg14@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Jann Horn <jannh@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Marco Elver <elver@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Maciej W. Rozycki <macro@orcam.me.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:44:19 -08:00
Randy Dunlap	24c776355f	kernel.h: drop hex.h and update all hex.h users Remove <linux/hex.h> from <linux/kernel.h> and update all users/callers of hex.h interfaces to directly #include <linux/hex.h> as part of the process of putting kernel.h on a diet. Removing hex.h from kernel.h means that 36K C source files don't have to pay the price of parsing hex.h for the roughly 120 C source files that need it. This change has been build-tested with allmodconfig on most ARCHes. Also, all users/callers of <linux/hex.h> in the entire source tree have been updated if needed (if not already #included). Link: https://lkml.kernel.org/r/20251215005206.2362276-1-rdunlap@infradead.org Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Andy Shevchenko <andriy.shevchenko@intel.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Yury Norov (NVIDIA) <yury.norov@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:44:19 -08:00
Christophe JAILLET	b11052be3e	crash_dump: constify struct configfs_item_operations and configfs_group_operations 'struct configfs_item_operations' and 'configfs_group_operations' are not modified in this driver. Constifying these structures moves some data to a read-only section, so increases overall security, especially when the structure holds some function pointers. On a x86_64, with allmodconfig, as an example: Before: ====== text data bss dec hex filename 16339 11001 384 27724 6c4c kernel/crash_dump_dm_crypt.o After: ===== text data bss dec hex filename 16499 10841 384 27724 6c4c kernel/crash_dump_dm_crypt.o Link: https://lkml.kernel.org/r/d046ee5666d2f6b1a48ca1a222dfbd2f7c44462f.1765735035.git.christophe.jaillet@wanadoo.fr Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Coiby Xu <coxu@redhat.com> Tested-by: Coiby Xu <coxu@redhat.com> Cc: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-20 19:44:15 -08:00
Mykyta Yatsenko	83c9030cdc	bpf: Simplify bpf_timer_cancel() Remove lock from the bpf_timer_cancel() helper. The lock does not protect from concurrent modification of the bpf_async_cb data fields as those are modified in the callback without locking. Use guard(rcu)() instead of pair of explicit lock()/unlock(). Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Link: https://lore.kernel.org/r/20260120-timer_nolock-v6-4-670ffdd787b4@meta.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-20 18:12:19 -08:00
Mykyta Yatsenko	8bb1e32b3f	bpf: Introduce lock-free bpf_async_update_prog_callback() Introduce bpf_async_update_prog_callback(): lock-free update of cb->prog and cb->callback_fn. This function allows updating prog and callback_fn fields of the struct bpf_async_cb without holding lock. For now use it under the lock from __bpf_async_set_callback(), in the next patches that lock will be removed. Lock-free algorithm: * Acquire a guard reference on prog to prevent it from being freed during the retry loop. * Retry loop: 1. Each iteration acquires a new prog reference and stores it in cb->prog via xchg. The previous prog is released. 2. The loop condition checks if both cb->prog and cb->callback_fn match what we just wrote. If either differs, a concurrent writer overwrote our value, and we must retry. 3. When we retry, our previously-stored prog was already released by the concurrent writer or will be released by us after overwriting. * Release guard reference. Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Link: https://lore.kernel.org/r/20260120-timer_nolock-v6-3-670ffdd787b4@meta.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-20 18:12:19 -08:00
Mykyta Yatsenko	57d31e72db	bpf: Remove unnecessary arguments from bpf_async_set_callback() Remove unused arguments from __bpf_async_set_callback(). Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Link: https://lore.kernel.org/r/20260120-timer_nolock-v6-2-670ffdd787b4@meta.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-20 18:12:19 -08:00
Mykyta Yatsenko	c1f2c449de	bpf: Factor out timer deletion helper Move the timer deletion logic into a dedicated bpf_timer_delete() helper so it can be reused by later patches. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Link: https://lore.kernel.org/r/20260120-timer_nolock-v6-1-670ffdd787b4@meta.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-20 18:12:19 -08:00
Zesen Liu	ed4724212f	bpf: Require ARG_PTR_TO_MEM with memory flag Add check to ensure that ARG_PTR_TO_MEM is used with either MEM_WRITE or MEM_RDONLY. Using ARG_PTR_TO_MEM alone without flags does not make sense because: - If the helper does not change the argument, missing MEM_RDONLY causes the verifier to incorrectly reject a read-only buffer. - If the helper does change the argument, missing MEM_WRITE causes the verifier to incorrectly assume the memory is unchanged, leading to errors in code optimization. Co-developed-by: Shuran Liu <electronlsr@gmail.com> Signed-off-by: Shuran Liu <electronlsr@gmail.com> Co-developed-by: Peili Gao <gplhust955@gmail.com> Signed-off-by: Peili Gao <gplhust955@gmail.com> Co-developed-by: Haoran Ni <haoran.ni.cs@gmail.com> Signed-off-by: Haoran Ni <haoran.ni.cs@gmail.com> Signed-off-by: Zesen Liu <ftyghome@gmail.com> Reviewed-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260120-helper_proto-v3-2-27b0180b4e77@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-20 16:59:25 -08:00
Zesen Liu	802eef5afb	bpf: Fix memory access flags in helper prototypes After commit `37cce22dbd` ("bpf: verifier: Refactor helper access type tracking"), the verifier started relying on the access type flags in helper function prototypes to perform memory access optimizations. Currently, several helper functions utilizing ARG_PTR_TO_MEM lack the corresponding MEM_RDONLY or MEM_WRITE flags. This omission causes the verifier to incorrectly assume that the buffer contents are unchanged across the helper call. Consequently, the verifier may optimize away subsequent reads based on this wrong assumption, leading to correctness issues. For bpf_get_stack_proto_raw_tp, the original MEM_RDONLY was incorrect since the helper writes to the buffer. Change it to ARG_PTR_TO_UNINIT_MEM which correctly indicates write access to potentially uninitialized memory. Similar issues were recently addressed for specific helpers in commit `ac44dcc788` ("bpf: Fix verifier assumptions of bpf_d_path's output buffer") and commit `2eb7648558` ("bpf: Specify access type of bpf_sysctl_get_name args"). Fix these prototypes by adding the correct memory access flags. Fixes: `37cce22dbd` ("bpf: verifier: Refactor helper access type tracking") Co-developed-by: Shuran Liu <electronlsr@gmail.com> Signed-off-by: Shuran Liu <electronlsr@gmail.com> Co-developed-by: Peili Gao <gplhust955@gmail.com> Signed-off-by: Peili Gao <gplhust955@gmail.com> Co-developed-by: Haoran Ni <haoran.ni.cs@gmail.com> Signed-off-by: Haoran Ni <haoran.ni.cs@gmail.com> Signed-off-by: Zesen Liu <ftyghome@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260120-helper_proto-v3-1-27b0180b4e77@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-20 16:59:25 -08:00
Yazhou Tang	44fdd581d2	bpf: Add range tracking for BPF_DIV and BPF_MOD This patch implements range tracking (interval analysis) for BPF_DIV and BPF_MOD operations when the divisor is a constant, covering both signed and unsigned variants. While LLVM typically optimizes integer division and modulo by constants into multiplication and shift sequences, this optimization is less effective for the BPF target when dealing with 64-bit arithmetic. Currently, the verifier does not track bounds for scalar division or modulo, treating the result as "unbounded". This leads to false positive rejections for safe code patterns. For example, the following code (compiled with -O2): ```c int test(struct pt_regs ctx) { char buffer[6] = {1}; __u64 x = bpf_ktime_get_ns(); __u64 res = x % sizeof(buffer); char value = buffer[res]; bpf_printk("res = %llu, val = %d", res, value); return 0; } ``` Generates a raw `BPF_MOD64` instruction: ```asm ; __u64 res = x % sizeof(buffer); 1: 97 00 00 00 06 00 00 00 r0 %= 0x6 ; char value = buffer[res]; 2: 18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r1 = 0x0 ll 4: 0f 01 00 00 00 00 00 00 r1 += r0 5: 91 14 00 00 00 00 00 00 r4 = (s8 *)(r1 + 0x0) ``` Without this patch, the verifier fails with "math between map_value pointer and register with unbounded min value is not allowed" because it cannot deduce that `r0` is within [0, 5]. According to the BPF instruction set[1], the instruction's offset field (`insn->off`) is used to distinguish between signed (`off == 1`) and unsigned division (`off == 0`). Moreover, we also follow the BPF division and modulo runtime behavior (semantics) to handle special cases, such as division by zero and signed division overflow. - UDIV: dst = (src != 0) ? (dst / src) : 0 - SDIV: dst = (src == 0) ? 0 : ((src == -1 && dst == LLONG_MIN) ? LLONG_MIN : (dst / src)) - UMOD: dst = (src != 0) ? (dst % src) : dst - SMOD: dst = (src == 0) ? dst : ((src == -1 && dst == LLONG_MIN) ? 0: (dst s% src)) Here is the overview of the changes made in this patch (See the code comments for more details and examples): 1. For BPF_DIV: Firstly check whether the divisor is zero. If so, set the destination register to zero (matching runtime behavior). For non-zero constant divisors: goto `scalar(32)?_min_max_(u\|s)div` functions. - General cases: compute the new range by dividing max_dividend and min_dividend by the constant divisor. - Overflow case (SIGNED_MIN / -1) in signed division: mark the result as unbounded if the dividend is not a single number. 2. For BPF_MOD: Firstly check whether the divisor is zero. If so, leave the destination register unchanged (matching runtime behavior). For non-zero constant divisors: goto `scalar(32)?_min_max_(u\|s)mod` functions. - General case: For signed modulo, the result's sign matches the dividend's sign. And the result's absolute value is strictly bounded by `min(abs(dividend), abs(divisor) - 1)`. - Special care is taken when the divisor is SIGNED_MIN. By casting to unsigned before negation and subtracting 1, we avoid signed overflow and correctly calculate the maximum possible magnitude (`res_max_abs` in the code). - "Small dividend" case: If the dividend is already within the possible result range (e.g., [-2, 5] % 10), the operation is an identity function, and the destination register remains unchanged. 3. In `scalar(32)?_min_max_(u\|s)(div\|mod)` functions: After updating current range, reset other ranges and tnum to unbounded/unknown. e.g., in `scalar_min_max_sdiv`, signed 64-bit range is updated. Then reset unsigned 64-bit range and 32-bit range to unbounded, and tnum to unknown. Exception: in BPF_MOD's "small dividend" case, since the result remains unchanged, we do not reset other ranges/tnum. 4. Also updated existing selftests based on the expected BPF_DIV and BPF_MOD behavior. [1] https://www.kernel.org/doc/Documentation/bpf/standardization/instruction-set.rst Co-developed-by: Shenghao Yuan <shenghaoyuan0928@163.com> Signed-off-by: Shenghao Yuan <shenghaoyuan0928@163.com> Co-developed-by: Tianci Cao <ziye@zju.edu.cn> Signed-off-by: Tianci Cao <ziye@zju.edu.cn> Signed-off-by: Yazhou Tang <tangyazhou518@outlook.com> Tested-by: syzbot@syzkaller.appspotmail.com Link: https://lore.kernel.org/r/20260119085458.182221-2-tangyazhou@zju.edu.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-20 16:41:53 -08:00
Ihor Solodrai	aed57a3638	bpf: Remove __prog kfunc arg annotation Now that all the __prog suffix users in the kernel tree migrated to KF_IMPLICIT_ARGS, remove it from the verifier. See prior discussion for context [1]. [1] https://lore.kernel.org/bpf/CAEf4BzbgPfRm9BX=TsZm-TsHFAHcwhPY4vTt=9OT-uhWqf8tqw@mail.gmail.com/ Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev> Link: https://lore.kernel.org/r/20260120222638.3976562-13-ihor.solodrai@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-20 16:22:38 -08:00
Ihor Solodrai	d806f31012	bpf: Migrate bpf_stream_vprintk() to KF_IMPLICIT_ARGS Implement bpf_stream_vprintk with an implicit bpf_prog_aux argument, and remote bpf_stream_vprintk_impl from the kernel. Update the selftests to use the new API with implicit argument. bpf_stream_vprintk macro is changed to use the new bpf_stream_vprintk kfunc, and the extern definition of bpf_stream_vprintk_impl is replaced accordingly. Reviewed-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev> Link: https://lore.kernel.org/r/20260120222638.3976562-11-ihor.solodrai@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-20 16:22:38 -08:00
Ihor Solodrai	6e663ffdf7	bpf: Migrate bpf_task_work_schedule_* kfuncs to KF_IMPLICIT_ARGS Implement bpf_task_work_schedule_* with an implicit bpf_prog_aux argument, and remove corresponding _impl funcs from the kernel. Update special kfunc checks in the verifier accordingly. Update the selftests to use the new API with implicit argument. Reviewed-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev> Link: https://lore.kernel.org/r/20260120222638.3976562-10-ihor.solodrai@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-20 16:22:20 -08:00
Ihor Solodrai	b97931a25a	bpf: Migrate bpf_wq_set_callback_impl() to KF_IMPLICIT_ARGS Implement bpf_wq_set_callback() with an implicit bpf_prog_aux argument, and remove bpf_wq_set_callback_impl(). Update special kfunc checks in the verifier accordingly. Reviewed-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev> Link: https://lore.kernel.org/r/20260120222638.3976562-8-ihor.solodrai@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-20 16:15:57 -08:00
Ihor Solodrai	64e1360524	bpf: Verifier support for KF_IMPLICIT_ARGS A kernel function bpf_foo marked with KF_IMPLICIT_ARGS flag is expected to have two associated types in BTF: * `bpf_foo` with a function prototype that omits implicit arguments * `bpf_foo_impl` with a function prototype that matches the kernel declaration of `bpf_foo`, but doesn't have a ksym associated with its name In order to support kfuncs with implicit arguments, the verifier has to know how to resolve a call of `bpf_foo` to the correct BTF function prototype and address. To implement this, in add_kfunc_call() kfunc flags are checked for KF_IMPLICIT_ARGS. For such kfuncs a BTF func prototype is adjusted to the one found for `bpf_foo_impl` (func_name + "_impl" suffix, by convention) function in BTF. This effectively changes the signature of the `bpf_foo` kfunc in the context of verification: from one without implicit args to the one with full argument list. The values of implicit arguments by design are provided by the verifier, and so they can only be of particular types. In this patch the only allowed implicit arg type is a pointer to struct bpf_prog_aux. In order for the verifier to correctly set an implicit bpf_prog_aux arg value at runtime, is_kfunc_arg_prog() is extended to check for the arg type. At a point when prog arg is determined in check_kfunc_args() the kfunc with implicit args already has a prototype with full argument list, so the existing value patch mechanism just works. If a new kfunc with KF_IMPLICIT_ARG is declared for an existing kfunc that uses a __prog argument (a legacy case), the prototype substitution works in exactly the same way, assuming the kfunc follows the _impl naming convention. The difference is only in how _impl prototype is added to the BTF, which is not the verifier's concern. See a subsequent resolve_btfids patch for details. __prog suffix is still supported at this point, but will be removed in a subsequent patch, after current users are moved to KF_IMPLICIT_ARGS. Introduction of KF_IMPLICIT_ARGS revealed an issue with zero-extension tracking, because an explicit rX = 0 in place of the verifier-supplied argument is now absent if the arg is implicit (the BPF prog doesn't pass a dummy NULL anymore). To mitigate this, reset the subreg_def of all caller saved registers in check_kfunc_call() [1]. [1] https://lore.kernel.org/bpf/b4a760ef828d40dac7ea6074d39452bb0dc82caa.camel@gmail.com/ Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev> Link: https://lore.kernel.org/r/20260120222638.3976562-4-ihor.solodrai@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-20 16:15:56 -08:00
Ihor Solodrai	08ca87d632	bpf: Introduce struct bpf_kfunc_meta There is code duplication between add_kfunc_call() and fetch_kfunc_meta() collecting information about a kfunc from BTF. Introduce struct bpf_kfunc_meta to hold common kfunc BTF data and implement fetch_kfunc_meta() to fill it in, instead of struct bpf_kfunc_call_arg_meta directly. Then use these in add_kfunc_call() and (new) fetch_kfunc_arg_meta() functions, and fixup previous usages of fetch_kfunc_meta() to fetch_kfunc_arg_meta(). Besides the code dedup, this change enables add_kfunc_call() to access kfunc->flags. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev> Link: https://lore.kernel.org/r/20260120222638.3976562-3-ihor.solodrai@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-20 16:15:56 -08:00
Ihor Solodrai	ea073d1818	bpf: Refactor btf_kfunc_id_set_contains btf_kfunc_id_set_contains() is called by fetch_kfunc_meta() in the BPF verifier to get the kfunc flags stored in the .BTF_ids ELF section. If it returns NULL instead of a valid pointer, it's interpreted as an illegal kfunc usage failing the verification. There are two potential reasons for btf_kfunc_id_set_contains() to return NULL: 1. Provided kfunc BTF id is not present in relevant kfunc id sets. 2. The kfunc is not allowed, as determined by the program type specific filter [1]. The filter functions accept a pointer to `struct bpf_prog`, so they might implicitly depend on earlier stages of verification, when bpf_prog members are set. For example, bpf_qdisc_kfunc_filter() in linux/net/sched/bpf_qdisc.c inspects prog->aux->st_ops [2], which is initialized in: check_attach_btf_id() -> check_struct_ops_btf_id() So far this hasn't been an issue, because fetch_kfunc_meta() is the only caller of btf_kfunc_id_set_contains(). However in subsequent patches of this series it is necessary to inspect kfunc flags earlier in BPF verifier, in the add_kfunc_call(). To resolve this, refactor btf_kfunc_id_set_contains() into two interface functions: * btf_kfunc_flags() that simply returns pointer to kfunc_flags without applying the filters * btf_kfunc_is_allowed() that both checks for kfunc_flags existence (which is a requirement for a kfunc to be allowed) and applies the prog filters See [3] for the previous version of this patch. [1] https://lore.kernel.org/all/20230519225157.760788-7-aditi.ghag@isovalent.com/ [2] https://lore.kernel.org/all/20250409214606.2000194-4-ameryhung@gmail.com/ [3] https://lore.kernel.org/bpf/20251029190113.3323406-3-ihor.solodrai@linux.dev/ Reviewed-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev> Link: https://lore.kernel.org/r/20260120222638.3976562-2-ihor.solodrai@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-20 16:15:56 -08:00
Linus Torvalds	c25f2fb1f4	17 hotfixes. 12 are cc:stable, 16 are for MM. - A 4 patch series from David Hildenbrand which fixes a few things realted to hugetlb PMD sharing - The remainder are singletons, please see their changelogs for details. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaW/vEQAKCRDdBJ7gKXxA jtMcAQCK1tKnINRmVjY3UJCqZMAaXvOdOoUIgHDaTXD/DWKm9AD9HRwWzYB4+TNr k/Te8F33d418LcMBTW9CLhrplQpaIAI= =+d1A -----END PGP SIGNATURE----- Merge tag 'mm-hotfixes-stable-2026-01-20-13-09' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: - A patch series from David Hildenbrand which fixes a few things related to hugetlb PMD sharing - The remainder are singletons, please see their changelogs for details * tag 'mm-hotfixes-stable-2026-01-20-13-09' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm: restore per-memcg proactive reclaim with !CONFIG_NUMA mm/kfence: fix potential deadlock in reboot notifier Docs/mm/allocation-profiling: describe sysctrl limitations in debug mode mm: do not copy page tables unnecessarily for VM_UFFD_WP mm/hugetlb: fix excessive IPI broadcasts when unsharing PMD tables using mmu_gather mm/rmap: fix two comments related to huge_pmd_unshare() mm/hugetlb: fix two comments related to huge_pmd_unshare() mm/hugetlb: fix hugetlb_pmd_shared() mm: remove unnecessary and incorrect mmap lock assert x86/kfence: avoid writing L1TF-vulnerable PTEs mm/vma: do not leak memory when .mmap_prepare swaps the file migrate: correct lock ordering for hugetlb file folios panic: only warn about deprecated panic_print on write access fs/writeback: skip AS_NO_DATA_INTEGRITY mappings in wait_sb_inodes() mm: take into account mm_cid size for mm_struct static definitions mm: rename cpu_bitmap field to flexible_array mm: add missing static initializer for init_mm::mm_cid.lock	2026-01-20 13:32:16 -08:00
Qiliang Yuan	f81c07a6e9	bpf/verifier: Optimize ID mapping reset in states_equal Currently, reset_idmap_scratch() performs a 4.7KB memset() in every states_equal() call. Optimize this by using a counter to track used ID mappings, replacing the O(N) memset() with an O(1) reset and bounding the search loop in check_ids(). Signed-off-by: Qiliang Yuan <realwujing@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20260120023234.77673-1-realwujing@gmail.com	2026-01-20 11:32:28 -08:00
Daniel Borkmann	713edc7144	bpf: Remove leftover accounting in htab_map_mem_usage after rqspinlock After commit `4fa8d68aa5` ("bpf: Convert hashtab.c to rqspinlock") we no longer use HASHTAB_MAP_LOCK_{COUNT,MASK} as the per-CPU map_locked[HASHTAB_MAP_LOCK_COUNT] array got removed from struct bpf_htab. Right now it is still accounted for in htab_map_mem_usage. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/09703eb6bb249f12b1d5253b5a50a0c4fa239d27.1768913513.git.daniel@iogearbox.net	2026-01-20 11:28:02 -08:00
Puranjay Mohan	ef7d4e42d1	bpf: verifier: Make sync_linked_regs() scratch registers sync_linked_regs() is called after a conditional jump to propagate new bounds of a register to all its liked registers. But the verifier log only prints the state of the register that is part of the conditional jump. Make sync_linked_regs() scratch the registers whose bounds have been updated by propagation from a known register. Before: 0: (85) call bpf_get_prandom_u32#7 ; R0=scalar() 1: (57) r0 &= 255 ; R0=scalar(smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff)) 2: (bf) r1 = r0 ; R0=scalar(id=1,smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff)) R1=scalar(id=1,smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff)) 3: (07) r1 += 4 ; R1=scalar(id=1+4,smin=umin=smin32=umin32=4,smax=umax=smax32=umax32=259,var_off=(0x0; 0x1ff)) 4: (a5) if r1 < 0xa goto pc+2 ; R1=scalar(id=1+4,smin=umin=smin32=umin32=10,smax=umax=smax32=umax32=259,var_off=(0x0; 0x1ff)) 5: (35) if r0 >= 0x6 goto pc+1 After: 0: (85) call bpf_get_prandom_u32#7 ; R0=scalar() 1: (57) r0 &= 255 ; R0=scalar(smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff)) 2: (bf) r1 = r0 ; R0=scalar(id=1,smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff)) R1=scalar(id=1,smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff)) 3: (07) r1 += 4 ; R1=scalar(id=1+4,smin=umin=smin32=umin32=4,smax=umax=smax32=umax32=259,var_off=(0x0; 0x1ff)) 4: (a5) if r1 < 0xa goto pc+2 ; R0=scalar(id=1+0,smin=umin=smin32=umin32=6,smax=umax=smax32=umax32=255) R1=scalar(id=1+4,smin=umin=smin32=umin32=10,smax=umax=smax32=umax32=259,var_off=(0x0; 0x1ff)) 5: (35) if r0 >= 0x6 goto pc+1 The conditional jump in 4 updates the bound of R1 and the new bounds are propogated to R0 as it is linked with the same id, before this change, verifier only printed the state for R1 but after it prints for both R0 and R1. Suggested-by: Andrii Nakryiko <andrii.nakryiko@gmail.com> Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/bpf/20260116141436.3715322-1-puranjay@kernel.org	2026-01-20 11:24:41 -08:00
Linus Torvalds	c03e9c42ae	dma-mapping fixes for Linux 6.19 - minor fixes for the corner cases of the SWIOTLB pool management (Robin Murphy) -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQSrngzkoBtlA8uaaJ+Jp1EFxbsSRAUCaW+NbgAKCRCJp1EFxbsS RKU3AQCIpNwIYN6evmnTXO/Y+AFRjsrb1pE1cUyHn5QTY4/o4wEA6BiSgMCrZEbv HTEs1ZSHgLOwBwZre1Z11icGg3BkRgY= =6fBC -----END PGP SIGNATURE----- Merge tag 'dma-mapping-6.19-2026-01-20' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux Pull dma-mapping fixes from Marek Szyprowski: - minor fixes for the corner cases of the SWIOTLB pool management (Robin Murphy) * tag 'dma-mapping-6.19-2026-01-20' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux: dma/pool: Avoid allocating redundant pools mm_zone: Generalise has_managed_dma() dma/pool: Improve pool lookup	2026-01-20 10:16:18 -08:00
hongao	f76d1c41b6	kprobes: retry blocked optprobe in do_free_cleaned_kprobes Once the aggrprobe is fully reverted in do_free_cleaned_kprobes(), retry optimize_kprobe() on that sibling so it can return to OPTIMIZED. Also remove the stale comment in __disarm_kprobe(). Link: https://lore.kernel.org/all/349359900266B25F+20260115023804.3951960-2-hongao@uniontech.com/ Signed-off-by: hongao <hongao@uniontech.com> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>	2026-01-20 23:53:07 +09:00
Thomas Weißschuh	e806f7dde8	timekeeping: Adjust the leap state for the correct auxiliary timekeeper When __do_ajdtimex() was introduced to handle adjtimex for any timekeeper, this reference to tk_core was not updated. When called on an auxiliary timekeeper, the core timekeeper would be updated incorrectly. This gets caught by the lock debugging diagnostics because the timekeepers sequence lock gets written to without holding its associated spinlock: WARNING: include/linux/seqlock.h:226 at __do_adjtimex+0x394/0x3b0, CPU#2: test/125 aux_clock_adj (kernel/time/timekeeping.c:2979) __do_sys_clock_adjtime (kernel/time/posix-timers.c:1161 kernel/time/posix-timers.c:1173) do_syscall_64 (arch/x86/entry/syscall_64.c:63 (discriminator 1) arch/x86/entry/syscall_64.c:94 (discriminator 1)) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:131) Update the correct auxiliary timekeeper. Fixes: `775f71ebed` ("timekeeping: Make do_adjtimex() reusable") Fixes: `ecf3e70304` ("timekeeping: Provide adjtimex() for auxiliary clocks") Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260120-timekeeper-auxclock-leapstate-v1-1-5b358c6b3cfd@linutronix.de	2026-01-20 10:18:53 +01:00
Gal Pressman	90f3c12324	panic: only warn about deprecated panic_print on write access The panic_print_deprecated() warning is being triggered on both read and write operations to the panic_print parameter. This causes spurious warnings when users run 'sysctl -a' to list all sysctl values, since that command reads /proc/sys/kernel/panic_print and triggers the deprecation notice. Modify the handlers to only emit the deprecation warning when the parameter is actually being set: - sysctl_panic_print_handler(): check 'write' flag before warning. - panic_print_get(): remove the deprecation call entirely. This way, users are only warned when they actively try to use the deprecated parameter, not when passively querying system state. Link: https://lkml.kernel.org/r/20260106163321.83586-1-gal@nvidia.com Fixes: `ee13240cd7` ("panic: add note that panic_print sysctl interface is deprecated") Fixes: `2683df6539` ("panic: add note that 'panic_print' parameter is deprecated") Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Nimrod Oren <noren@nvidia.com> Cc: Feng Tang <feng.tang@linux.alibaba.com> Cc: Joel Granados <joel.granados@kernel.org> Cc: Petr Mladek <pmladek@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-19 12:30:01 -08:00
Linus Torvalds	6f32aa9161	cgroup: Another fix for v6.19-rc5 - Add Chen Ridong as cpuset reviewer. - Add SPDX license identifiers to cgroup files that were missing them. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaW0ToA4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGel7AQD3eTbiR/OU0i5yJj6YZsIdVYcteqqsWBkDFnZ3 9pZZugEA+KCwi5eQz+cryKZ7y5fU5g5O6f4OP/lfLEcSmFDpfQU= =QHwy -----END PGP SIGNATURE----- Merge tag 'cgroup-for-6.19-rc5-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: - Add Chen Ridong as cpuset reviewer - Add SPDX license identifiers to cgroup files that were missing them * tag 'cgroup-for-6.19-rc5-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: kernel: cgroup: Add LGPL-2.1 SPDX license ID to legacy_freezer.c kernel: cgroup: Add SPDX-License-Identifier lines MAINTAINERS: Add Chen Ridong as cpuset reviewer	2026-01-18 14:30:27 -08:00
Linus Torvalds	b671c1dad2	Fix the update_needs_ipi() check in the hrtimer code that may result in incorrect skipping of hrtimer IPIs. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmlsr+oRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1iLPw/+OuA8hPlFY8nxrpVKASoiP7u+mGQ82Qmf xdf1Kimpmc5ydpGsxpRLKY19l3fLuGSJuOa2Url0Ro52JHxU+yPPOkN2ONjsPxLI mFMdhyXiS1PIgYt307OmNmKFadxdg5cGJTKY0wIS9JCcPFxI5Y3NFhWZ4ybPlGHK /bxtsjUctekUTDVahqP65lRIYMvjhO1I2LQfoPQyg/C+2n6Owy7TAX8xZovDZLHz a9RkkX6FPc8aR8Kj1VYm1yABNlUhGvFvTE6wfcfOMjiHHeDv2qYdwNHF2tK2eoKC QX609zBTCT/n0ioZlikLrKDZitj6uRgS2blBc+hodHEs4cY5gE2umWSU4IirT7HH GDjvG/vns/BLRr97gPdMaJ+N/sjdCSQ5tsDRrvHwCGC27VXPu3O3DYU128eH4eXW +KNJGQHJvR7slBKElQ4wxJaLhbLBB5rw/AqqdXa4PU8LqvMBBNcLZjHnrYlbD5Ei ovdX5VtRYOjrVl7pCNrXXdzFfBpXK+sOPHQMoJxnLHKXO+cdsZz8zIB7AHZEVe20 CIOMRcfgw2nvpS+0MqaHzre0ixyq+U2XSIu7MRIlq8GZ+zdjiVMqx3pPa19QRdyr hA9JupeRC9eZdymlms2MiSzqIVmWSVvgcyhIFSYw9yLe55vXcGyArF27RKGtBHRA uF+K5aT8gqA= =CAdu -----END PGP SIGNATURE----- Merge tag 'timers-urgent-2026-01-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Ingo Molnar: "Fix the update_needs_ipi() check in the hrtimer code that may result in incorrect skipping of hrtimer IPIs" * tag 'timers-urgent-2026-01-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: hrtimer: Fix softirq base check in update_needs_ipi()	2026-01-18 10:56:32 -08:00
Linus Torvalds	837c8180e3	Misc DL scheduler fixes, mainly for a new category of bugs that were discovered and fixed recently: - Fix a race condition in the DL server - Fix a DL server bug which can result in incorrectly going idle when there's work available - Fix DL server bug which triggers a WARN() due to broken get_prio_dl() logic and subsequent misbehavior - Fix double update_rq_clock() calls - Fix setscheduler() assumption about static priorities - Make sure balancing callbacks are always called - Plus a handful of preparatory commits for the fixes. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmlsrfIRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1hgUA/9HQi51sxF/TlhKeoj9Z0zSO/L5Fbd7d5O ucsr+exoU7B+X95AnYC64pV3YlTNibD5y6HR4zpPzhlmEdFw8K9hTmDeSOF2cB0T dPdEY2KtvNoQwVwytPYKMsXtBTcJtZZqOec9BgGBzSIpjPm6Zhxr1J2/lKFy/udA 00aqNuJwfQW94dxvyPf+jPXbm0JW9HHZGNuuOXkkmZpkH+HIzBDrTWcAjHUaZt9H doPyEfO08dg3Uu4OxNBtxx94X7bQsn5RU1plSEc+xF7zynqwEIwDbelFftpHiM+z UqfFgkzCV8A8o/72BfZVhQgaspMZ+M4XSsaGBbouOTg3Rgx8gRKC23s8/xhDJSpy ayMWhuR5bsI04GD1xZVOwLvUi8oDIgAFIAKRI6gcW5eQELlejii0bAvs9Jw9/OjI QhhrX54sJCPPKMZpKO2n+AzQvv9a6+/p7ikiA2U5dR4wBEERFUw8ffRguHwSm+BU U1M+t+2tw+UiC7yML0zvIC3HdyfBhj33ZeW7PWyvFU0vjGm78U6NjqliAxbqW21l GdtQUpi9aXCtijXqHhLFvBtWGmqowj9UQiHroC0+nq1BsTpBDe575vwJNYF2K9he ht0hZqreUIf7xTsJHXlLFbtFxPQw0xAKzWJWUswaaqO8eS2JcWJbmHUAsHQuxmJB MbkJAlU7j/w= =DgiW -----END PGP SIGNATURE----- Merge tag 'sched-urgent-2026-01-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "Misc deadline scheduler fixes, mainly for a new category of bugs that were discovered and fixed recently: - Fix a race condition in the DL server - Fix a DL server bug which can result in incorrectly going idle when there's work available - Fix DL server bug which triggers a WARN() due to broken get_prio_dl() logic and subsequent misbehavior - Fix double update_rq_clock() calls - Fix setscheduler() assumption about static priorities - Make sure balancing callbacks are always called - Plus a handful of preparatory commits for the fixes" * tag 'sched-urgent-2026-01-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/deadline: Use ENQUEUE_MOVE to allow priority change sched: Deadline has dynamic priority sched: Audit MOVE vs balance_callbacks sched: Fold rq-pin swizzle into __balance_callbacks() sched/deadline: Avoid double update_rq_clock() sched/deadline: Ensure get_prio_dl() is up-to-date sched/deadline: Fix server stopping with runnable tasks sched: Provide idle_rq() helper sched/deadline: Fix potential race in dl_add_task_root_domain() sched/deadline: Remove unnecessary comment in dl_add_task_root_domain()	2026-01-18 10:17:40 -08:00
Tim Bird	4787eaf7c1	bpf: Add SPDX license identifiers to a few files Add GPL-2.0 SPDX-License-Identifier lines to some files, and remove a reference to COPYING, and boilerplate warranty text, from offload.c. Signed-off-by: Tim Bird <tim.bird@sony.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20260115013129.598705-1-tim.bird@sony.com	2026-01-16 14:50:00 -08:00
Mykyta Yatsenko	1700147697	bpf: Add __force annotations to silence sparse warnings Add __force annotations to casts that convert between __user and kernel address spaces. These casts are intentional: - In bpf_send_signal_common(), the value is stored in si_value.sival_ptr which is typed as void __user , but the value comes from a BPF program parameter. - In the bpf__dynptr() kfuncs, user pointers are cast to const void * before being passed to copy helper functions that correctly handle the user address space through copy_from_user variants. Without __force, sparse reports: warning: cast removes address space '__user' of expression Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20260115184509.3585759-1-mykyta.yatsenko5@gmail.com Closes: https://lore.kernel.org/oe-kbuild-all/202601131740.6C3BdBaB-lkp@intel.com/	2026-01-16 14:21:11 -08:00
Linus Torvalds	b62ce2547f	Power management fixes for 6.19-rc6 - Fix a memory leak in em_create_pd() error path (Malaya Kumar Rout) - Fix stale description of the cost field in struct em_perf_state to reflect the current code (Yaxiong Tian) - Fix and revamp the energy model YNL specification added recently along with the energy model netlink interface (Changwoo Min) -----BEGIN PGP SIGNATURE----- iQFGBAABCAAwFiEEcM8Aw/RY0dgsiRUR7l+9nS/U47UFAmlqWS8SHHJqd0Byand5 c29ja2kubmV0AAoJEO5fvZ0v1OO1BhAH/iSA/9YP9/xIW+rIZj2irsncGVc/hJ+V 8PcM2tar966AEQBlmDPc7ug2MXT0Mb/5WRtQo/IvZ1tdAALB+bC70FHjdu3y7SAX IzFGyHKJ9OVJHGxYq0TCBbRXgYubUZiqvAnaBTBZ/ZIFlt3B4KEyatfFfxJMb1Pe H9zQIyUVBN1tjPfgNVbc22XGfkwfmmla72le6GmHseKZRqpwutIwzxm1PoMNVeLb fQv625LsWrDD9NU6j5uGq3WEq3xPltubtHN545FTFi4aGwBYJaG1FuIi+kPiQuCR VZmxTWVEiVwjttiZf/xWbOkPfwQqdgeXlTk5j+iBlF1jkrrW75IuLYQ= =eJ2p -----END PGP SIGNATURE----- Merge tag 'pm-6.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "These fix an error path memory leak in the energy model management code, fix a kerneldoc comment in it, and fix and revamp the energy model YNL specification added recently along with the new energy model management netlink interface (that received feedback after being added): - Fix a memory leak in em_create_pd() error path (Malaya Kumar Rout) - Fix stale description of the cost field in struct em_perf_state to reflect the current code (Yaxiong Tian) - Fix and revamp the energy model YNL specification added recently along with the energy model netlink interface (Changwoo Min)" * tag 'pm-6.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: PM: EM: Add dump to get-perf-domains in the EM YNL spec PM: EM: Change cpus' type from string to u64 array in the EM YNL spec PM: EM: Rename em.yaml to dev-energymodel.yaml PM: EM: Fix yamllint warnings in the EM YNL spec PM: EM: Fix memory leak in em_create_pd() error path PM: EM: Fix incorrect description of the cost field in struct em_perf_state	2026-01-16 12:08:19 -08:00
Rafael J. Wysocki	e9df6eba06	genirq/chip: Change irq_chip_pm_put() return type to void The irq_chip_pm_put() return value is only used in __irq_do_set_handler() to trigger a WARN_ON() if it is negative, but doing so is not useful because irq_chip_pm_put() simply passes the pm_runtime_put() return value to its callers. Returning an error code from pm_runtime_put() merely means that it has not queued up a work item to check whether or not the device can be suspended and there are many perfectly valid situations in which that can happen, like after writing "on" to the devices' runtime PM "control" attribute in sysfs for one example. For this reason, modify irq_chip_pm_put() to discard the pm_runtime_put() return value, change its return type to void, and drop the WARN_ON() around the irq_chip_pm_put() invocation from __irq_do_set_handler(). Also update the irq_chip_pm_put() kerneldoc comment to be more accurate. This will facilitate a planned change of the pm_runtime_put() return type to void in the future. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/5075294.31r3eYUQgx@rafael.j.wysocki	2026-01-16 20:28:05 +01:00
Puranjay Mohan	af9e89d8dd	bpf: Preserve id of register in sync_linked_regs() sync_linked_regs() copies the id of known_reg to reg when propagating bounds of known_reg to reg using the off of known_reg, but when known_reg was linked to reg like: known_reg = reg ; both known_reg and reg get same id known_reg += 4 ; known_reg gets off = 4, and its id gets BPF_ADD_CONST now when a call to sync_linked_regs() happens, let's say with the following: if known_reg >= 10 goto pc+2 known_reg's new bounds are propagated to reg but now reg gets BPF_ADD_CONST from the copy. This means if another link to reg is created like: another_reg = reg ; another_reg should get the id of reg but assign_scalar_id_before_mov() sees BPF_ADD_CONST on reg and assigns a new id to it. As reg has a new id now, known_reg's link to reg is broken. If we find new bounds for known_reg, they will not be propagated to reg. This can be seen in the selftest added in the next commit: 0: (85) call bpf_get_prandom_u32#7 ; R0=scalar() 1: (57) r0 &= 255 ; R0=scalar(smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff)) 2: (bf) r1 = r0 ; R0=scalar(id=1,smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff)) R1=scalar(id=1,smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff)) 3: (07) r1 += 4 ; R1=scalar(id=1+4,smin=umin=smin32=umin32=4,smax=umax=smax32=umax32=259,var_off=(0x0; 0x1ff)) 4: (a5) if r1 < 0xa goto pc+4 ; R1=scalar(id=1+4,smin=umin=smin32=umin32=10,smax=umax=smax32=umax32=259,var_off=(0x0; 0x1ff)) 5: (bf) r2 = r0 ; R0=scalar(id=2,smin=umin=smin32=umin32=6,smax=umax=smax32=umax32=255) R2=scalar(id=2,smin=umin=smin32=umin32=6,smax=umax=smax32=umax32=255) 6: (a5) if r1 < 0xe goto pc+2 ; R1=scalar(id=1+4,smin=umin=smin32=umin32=14,smax=umax=smax32=umax32=259,var_off=(0x0; 0x1ff)) 7: (35) if r0 >= 0xa goto pc+1 ; R0=scalar(id=2,smin=umin=smin32=umin32=6,smax=umax=smax32=umax32=9,var_off=(0x0; 0xf)) 8: (37) r0 /= 0 div by zero When 4 is verified, r1's bounds are propagated to r0 but r0 also gets BPF_ADD_CONST (bug). When 5 is verified, r0 gets a new id (2) and its link with r1 is broken. After 6 we know r1 has bounds [14, 259] and therefore r0 should have bounds [10, 255], therefore the branch at 7 is always taken. But because r0's id was changed to 2, r1's new bounds are not propagated to r0. The verifier still thinks r0 has bounds [6, 255] before 7 and execution can reach div by zero. Fix this by preserving id in sync_linked_regs() like off and subreg_def. Fixes: `98d7ca374b` ("bpf: Track delta between "linked" registers.") Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260115151143.1344724-2-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-16 10:08:59 -08:00
Linus Torvalds	7a2c1b27cd	printk fixup for 6.19 rc6 -----BEGIN PGP SIGNATURE----- iQJPBAABCAA5FiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAmlqGMYbFIAAAAAABAAO bWFudTIsMi41KzEuMTEsMiwyAAoJEFKgDEdIgJTyT0UP/3Wn4tm0h0n3XeyujQEA IPNisJCczF6aWIAuwccRigR8hQFriPNvkZ5AhMGtotZJrY3uUoe1/8aF+XIdqnl/ Yb1434AGwVNIpSaap+vtEclHrLDmzZH+Z75/FTWQwldM/hPkPWJI8fsEuRLqBZsn v520NBFtrQVcOZKKNy0npBnHsC0DsAmqoZuOvLTx0mx5AyE029CfPbDMZuVnSNix KjZ4U5KL0qDs2LIMpdB/mqprydGkHdogdIbrPK3WtzStVgNbi9VmnV19ZwbUlXJM rYPbtbQg3htwuspgR+yM6O21qsthRf2qZF5+2/a929IzOBsD/qAXQbbxQWVpF7Qb ELYXNV4N5hqm9EW8WeOOpLKUUG7k0fRPf81X/07uGVafPMQKQJ8kFNgLBkBFR4ya RAMNxTPHbHQvVaLcRujxZXoC4Wh3ZTunQXpIouy0p9dKOzbsCAj0ZqeqDa09UsaW rCEm50p/Pd1csML8a9A/2nNoWjQzuSVmML7F6obGCOWaW6p21GhSKHzqqDIjBab9 3wxhpllVeYRYS2yhkKjOPJkQKXo3idIdpieLpW8IVbJvp/gQgmeKjWdhnvNvUan9 hyCzfI8OZXVz0vItBfWsoX44+6UtpLHd4o16aDYjyDItflJOPuQbWzky+icT2R3x 8B5xPC08tmEGitf3miv2EGq9 =d/xV -----END PGP SIGNATURE----- Merge tag 'printk-for-6.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux Pull printk fix from Petr Mladek: - Prevent softlockup by restoring IRQs in atomic flush after each record * tag 'printk-for-6.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux: printk/nbcon: Restore IRQ in atomic flush after each emitted record	2026-01-16 09:46:59 -08:00
Dan Williams	bc62f5b308	dax/hmem, e820, resource: Defer Soft Reserved insertion until hmem is ready Insert Soft Reserved memory into a dedicated soft_reserve_resource tree instead of the iomem_resource tree at boot. Delay publishing these ranges into the iomem hierarchy until ownership is resolved and the HMEM path is ready to consume them. Publishing Soft Reserved ranges into iomem too early conflicts with CXL hotplug and prevents region assembly when those ranges overlap CXL windows. Follow up patches will reinsert Soft Reserved ranges into iomem after CXL window publication is complete and HMEM is ready to claim the memory. This provides a cleaner handoff between EFI-defined memory ranges and CXL resource management without trimming or deleting resources later. In the meantime "Soft Reserved" resources will no longer appear in /proc/iomem, only their results. I.e. with "memmap=4G%4G+0xefffffff" Before: 100000000-1ffffffff : Soft Reserved 100000000-1ffffffff : dax1.0 100000000-1ffffffff : System RAM (kmem) After: 100000000-1ffffffff : dax1.0 100000000-1ffffffff : System RAM (kmem) The expectation is that this does not lead to a user visible regression because the dax1.0 device is created in both instances. Co-developed-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> [Smita: incorporate feedback from x86 maintainer review] Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> Link: https://patch.msgid.link/20251120031925.87762-2-Smita.KoralahalliChannabasappa@amd.com [djbw: cleanups and clarifications] Link: https://lore.kernel.org/69443f707b025_1cee10022@dwillia2-mobl4.notmuch Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Dave Jiang <dave.jiang@intel.com>	2026-01-16 09:02:36 -07:00
Oleg Nesterov	d55c571e43	x86/uprobes: Fix XOL allocation failure for 32-bit tasks This script #!/usr/bin/bash echo 0 > /proc/sys/kernel/randomize_va_space echo 'void main(void) {}' > TEST.c # -fcf-protection to ensure that the 1st endbr32 insn can't be emulated gcc -m32 -fcf-protection=branch TEST.c -o test bpftrace -e 'uprobe:./test:main {}' -c ./test "hangs", the probed ./test task enters an endless loop. The problem is that with randomize_va_space == 0 get_unmapped_area(TASK_SIZE - PAGE_SIZE) called by xol_add_vma() can not just return the "addr == TASK_SIZE - PAGE_SIZE" hint, this addr is used by the stack vma. arch_get_unmapped_area_topdown() doesn't take TIF_ADDR32 into account and in_32bit_syscall() is false, this leads to info.high_limit > TASK_SIZE. vm_unmapped_area() happily returns the high address > TASK_SIZE and then get_unmapped_area() returns -ENOMEM after the "if (addr > TASK_SIZE - len)" check. handle_swbp() doesn't report this failure (probably it should) and silently restarts the probed insn. Endless loop. I think that the right fix should change the x86 get_unmapped_area() paths to rely on TIF_ADDR32 rather than in_32bit_syscall(). Note also that if CONFIG_X86_X32_ABI=y, in_x32_syscall() falsely returns true in this case because ->orig_ax = -1. But we need a simple fix for -stable, so this patch just sets TS_COMPAT if the probed task is 32-bit to make in_ia32_syscall() true. Fixes: `1b028f784e` ("x86/mm: Introduce mmap_compat_base() for 32-bit mmap()") Reported-by: Paulo Andrade <pandrade@redhat.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/aV5uldEvV7pb4RA8@redhat.com/ Cc: stable@vger.kernel.org Link: https://patch.msgid.link/aWO7Fdxn39piQnxu@redhat.com	2026-01-16 16:23:54 +01:00
Rafael J. Wysocki	d51e68b700	Merge branch 'pm-em' Merge fixes related to the energy model management for 6.19-rc6: - Fix a memory leak in em_create_pd() error path (Malaya Kumar Rout) - Fix stale description of the cost field in struct em_perf_state to reflect the current code (Yaxiong Tian) - Fix and revamp the energy model YNL specification added recently along with the energy model netlink interface (Changwoo Min) * pm-em: PM: EM: Add dump to get-perf-domains in the EM YNL spec PM: EM: Change cpus' type from string to u64 array in the EM YNL spec PM: EM: Rename em.yaml to dev-energymodel.yaml PM: EM: Fix yamllint warnings in the EM YNL spec PM: EM: Fix memory leak in em_create_pd() error path PM: EM: Fix incorrect description of the cost field in struct em_perf_state	2026-01-16 16:16:24 +01:00
Tim Bird	330eb955ea	kernel: add SPDX-License-Identifier lines Add SPDX-License-Identifier lines to some old kernel files. Signed-off-by: Tim Bird <tim.bird@sony.com> Acked-by: Karim Yaghmour <karim.yaghmour@opersys.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2026-01-16 15:32:16 +01:00
Tim Bird	84697bf553	kernel: cgroup: Add LGPL-2.1 SPDX license ID to legacy_freezer.c Add an appropriate SPDX-License-Identifier line to the file, and remove the GNU boilerplate text. Signed-off-by: Tim Bird <tim.bird@sony.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-01-15 22:03:15 -10:00
Tim Bird	a1b3421a02	kernel: cgroup: Add SPDX-License-Identifier lines Add GPL-2.0 SPDX license id lines to a few old files, replacing the reference to the COPYING file. The COPYING file at the time of creation of these files (2007 and 2005) was GPL-v2.0, with an additional clause indicating that only v2 applied. Signed-off-by: Tim Bird <tim.bird@sony.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-01-15 22:03:09 -10:00
Tim Bird	983d014aaf	kernel: modules: Add SPDX license identifier to kmod.c Add a GPL-2.0 license identifier line for this file. kmod.c was originally introduced in the kernel in February of 1998 by Linus Torvalds - who was familiar with kernel licensing at the time this was introduced. Signed-off-by: Tim Bird <tim.bird@sony.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2026-01-15 16:58:28 -08:00
Jiri Olsa	276f3b6daf	arm64/ftrace,bpf: Fix partial regs after bpf_prog_run Mahe reported issue with bpf_override_return helper not working when executed from kprobe.multi bpf program on arm. The problem is that on arm we use alternate storage for pt_regs object that is passed to bpf_prog_run and if any register is changed (which is the case of bpf_override_return) it's not propagated back to actual pt_regs object. Fixing this by introducing and calling ftrace_partial_regs_update function to propagate the values of changed registers (ip and stack). Reported-by: Mahe Tardy <mahe.tardy@gmail.com> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Acked-by: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/bpf/20260112121157.854473-1-jolsa@kernel.org	2026-01-15 16:15:25 -08:00
Linus Torvalds	88e490913f	ftrace fixes for v6.19: - Fix allocation accounting on boot up The ftrace records for each function that ftrace can attach to is done in a group of pages. At boot up, the number of pages are calculated and allocated. After that, the pages are filled with data. It may allocate more than needed due to some functions not being recorded (because they are unused weak functions), this too is recorded. After the data is filled in, a check is made to make sure the right number of pages were allocated. But this was off due to the assumption that the same number of entries fit per every page. Because the size of an entry does not evenly divide into PAGE_SIZE, there is a rounding error when a large number of pages is allocated to hold the events. This causes the check to fail and triggers a warning. Fix the accounting by finding out how many pages are actually allocated from the functions that allocate them and use that to see if all the pages allocated were used and the ones not used are properly freed. -----BEGIN PGP SIGNATURE----- iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaWlyoxQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qoZoAQDuPOaly/RM9cS70zngDo/UsZmi2lQL RxuMR7sPVMwkrAEAp/gnpJenSRxCzaO4EdQYUYI28/ihkdNGp3KG8Q2MNw0= =v899 -----END PGP SIGNATURE----- Merge tag 'ftrace-v6.19-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull ftrace fix from Steven Rostedt: - Fix allocation accounting on boot up The ftrace records for each function that ftrace can attach to is done in a group of pages. At boot up, the number of pages are calculated and allocated. After that, the pages are filled with data. It may allocate more than needed due to some functions not being recorded (because they are unused weak functions), this too is recorded. After the data is filled in, a check is made to make sure the right number of pages were allocated. But this was off due to the assumption that the same number of entries fit per every page. Because the size of an entry does not evenly divide into PAGE_SIZE, there is a rounding error when a large number of pages is allocated to hold the events. This causes the check to fail and triggers a warning. Fix the accounting by finding out how many pages are actually allocated from the functions that allocate them and use that to see if all the pages allocated were used and the ones not used are properly freed. * tag 'ftrace-v6.19-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: ftrace: Do not over-allocate ftrace memory	2026-01-15 15:13:05 -08:00
Shrikanth Hegde	5d86d542f6	sched/fair: Remove nohz.nr_cpus and use weight of cpumask instead nohz.nr_cpus was observed as contended cacheline when running enterprise workload on large systems. Fundamental scalability challenge with nohz.idle_cpus_mask and nohz.nr_cpus is the following: (1) nohz_balancer_kick() observes (reads) nohz.nr_cpus (or nohz.idle_cpu_mask) and nohz.has_blocked to see whether there's any nohz balancing work to do, in every scheduler tick. (2) nohz_balance_enter_idle() and nohz_balance_exit_idle() (through nohz_balancer_kick() via sched_tick()) modify (write) nohz.nr_cpus (and/or nohz.idle_cpu_mask) and nohz.has_blocked. The characteristic frequencies are the following: (1) nohz_balancer_kick() happens at scheduler (busy)tick frequency on CPU(which has not gone idle). This is a relatively constant frequency in the ~1 kHz range or lower. (2) happens at idle enter/exit frequency on every CPU that goes to idle. This is workload dependent, but can easily be hundreds of kHz for IO-bound loads and high CPU counts. Ie. can be orders of magnitude higher than (1), in which case a cachemiss at every invocation of (1) is almost inevitable. idle exit will trigger (1) on the CPU which is coming out of idle. There's two types of costs from these functions: (A) scheduler tick cost via (1): this happens on busy CPUs too, and is thus a primary scalability cost. But the rate here is constant and typically much lower than (B), hence the absolute benefit to workload scalability will be lower as well. (B) idle cost via (2): going-to-idle and coming-from-idle costs are secondary concerns, because they impact power efficiency more than they impact scalability. But in terms of absolute cost this scales up with nr_cpus as well, and a much faster rate, and thus may also approach and negatively impact system limits like memory bus/fabric bandwidth. Note that nohz.idle_cpus_mask and nohz.nr_cpus may appear to reside in the same cacheline, however under CONFIG_CPUMASK_OFFSTACK=y the backing storage for nohz.idle_cpus_mask will be elsewhere. With CPUMASK_OFFSTACK=n, the nohz.idle_cpus_mask and rest of nohz fields are in different cachelines under typical NR_CPUS=512/2048. This implies two separate cachelines being dirtied upon idle entry / exit. nohz.nr_cpus can be derived from the mask itself. Its usage doesn't warrant a functionally correct value. This means one less cacheline being dirtied in idle entry/exit path which helps to save some bus bandwidth w.r.t to those nohz functions(approx 50%). This in turn helps to improve enterprise workload throughput. On system with 480 CPUs, running "hackbench 40 process 10000 loops" (Avg of 3 runs) baseline: 0.81% hackbench [k] nohz_balance_exit_idle 0.21% hackbench [k] nohz_balancer_kick 0.09% swapper [k] nohz_run_idle_balance With patch: 0.35% hackbench [k] nohz_balance_exit_idle 0.09% hackbench [k] nohz_balancer_kick 0.07% swapper [k] nohz_run_idle_balance [Ingo Molnar: scalability analysis changlog] Reviewed-and-tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://patch.msgid.link/20260115073524.376643-4-sshegde@linux.ibm.com	2026-01-15 22:41:27 +01:00
Shrikanth Hegde	94e70734b4	sched/fair: Change likelyhood of nohz.nr_cpus These days most of the system have multi cores. The likelyhood of at least one or more CPUs in nohz (idle state) is higher. Give accurate hint to the branch predictor. Reviewed-and-tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://patch.msgid.link/20260115073524.376643-3-sshegde@linux.ibm.com	2026-01-15 22:41:27 +01:00
Shrikanth Hegde	6b67c8a72e	sched/fair: Move checking for nohz cpus after time check Current code does. - Read nohz.nr_cpus - Check if the time has passed to do NOHZ idle balance Instead do this. - Check if the time has passed to do NOHZ idle balance - Read nohz.nr_cpus This will skip the read most of the time in normal system usage. i.e when there are nohz.nr_cpus (system is not 100% busy). Note that when there are no idle CPUs(100% busy), even if the flag gets set to NOHZ_STATS_KICK \| NOHZ_NEXT_KICK, find_new_ilb will fail and there will be no NOHZ idle balance. In such cases there will be a very narrow window where, kick_ilb will be called un-necessarily. However current functionality is still retained. Note: This patch doesn't solve any cacheline overheads. No improvement in performance apart from saving a few cycles of reading nohz.nr_cpus Reviewed-and-tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://patch.msgid.link/20260115073524.376643-2-sshegde@linux.ibm.com	2026-01-15 22:41:26 +01:00
Zhan Xusheng	553255cc85	sched/fair: Fix math notation errors in avg_vruntime comment The avg_vruntime comment contains a couple of mathematical notation issues: - The summation over w_i * (V - v_i) is written in an ambiguous form - The delta term refers to v instead of v0, which is inconsistent with the code and preceding explanation Fix these to make the comment mathematically correct and consistent with the implementation. Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260114090035.19033-1-zhanxusheng@xiaomi.com	2026-01-15 22:41:26 +01:00
Gabriele Monaco	8d73732016	sched: Fix build for modules using set_tsk_need_resched() Commit `adcc3bfa88` ("sched: Adapt sched tracepoints for RV task model") added a tracepoint to the need_resched action that can be triggered also by set_tsk_need_resched. This function was previously accessible from out-of-tree modules but it's no longer available because the __trace_set_need_resched() symbol is not exported (together with the tracepoint itself, which was exported in a separate patch) and building such modules fails. Export __trace_set_need_resched to modules to fix those build issues. Fixes: `adcc3bfa88` ("sched: Adapt sched tracepoints for RV task model") Signed-off-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Link: https://patch.msgid.link/20260112140413.362202-1-gmonaco@redhat.com	2026-01-15 22:41:26 +01:00
Peter Zijlstra	627cc25f84	sched/deadline: Use ENQUEUE_MOVE to allow priority change Pierre reported hitting balance callback warnings for deadline tasks after commit `6455ad5346` ("sched: Move sched_class::prio_changed() into the change pattern"). It turns out that DEQUEUE_SAVE+ENQUEUE_RESTORE does not preserve DL priority and subsequently trips a balance pass -- where one was not expected. From discussion with Juri and Luca, the purpose of this clause was to deal with tasks new to DL and all those sites will have MOVE set (as well as CLASS, but MOVE is move conservative at this point). Per the previous patches MOVE is audited to always run the balance callbacks, so switch enqueue_dl_entity() to use MOVE for this case. Fixes: `6455ad5346` ("sched: Move sched_class::prio_changed() into the change pattern") Reported-by: Pierre Gondois <pierre.gondois@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Pierre Gondois <pierre.gondois@arm.com> Tested-by: Juri Lelli <juri.lelli@redhat.com> Link: https://patch.msgid.link/20260114130528.GB831285@noisy.programming.kicks-ass.net	2026-01-15 21:57:53 +01:00
Peter Zijlstra	e008ec6c79	sched: Deadline has dynamic priority While FIFO/RR have static priority, DEADLINE is a dynamic priority scheme. Notably it has static priority -1. Do not assume the priority doesn't change for deadline tasks just because the static priority doesn't change. This ensures DL always sees {DE,EN}QUEUE_MOVE where appropriate. Fixes: `ff77e46853` ("sched/rt: Fix PI handling vs. sched_setscheduler()") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Pierre Gondois <pierre.gondois@arm.com> Tested-by: Juri Lelli <juri.lelli@redhat.com> Link: https://patch.msgid.link/20260114130528.GB831285@noisy.programming.kicks-ass.net	2026-01-15 21:57:53 +01:00
Peter Zijlstra	53439363c0	sched: Audit MOVE vs balance_callbacks The {DE,EN}QUEUE_MOVE flag indicates a task is allowed to change priority, which means there could be balance callbacks queued. Therefore audit all MOVE users and make sure they do run balance callbacks before dropping rq-lock. Fixes: `6455ad5346` ("sched: Move sched_class::prio_changed() into the change pattern") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Pierre Gondois <pierre.gondois@arm.com> Tested-by: Juri Lelli <juri.lelli@redhat.com> Link: https://patch.msgid.link/20260114130528.GB831285@noisy.programming.kicks-ass.net	2026-01-15 21:57:53 +01:00
Peter Zijlstra	49041e87f9	sched: Fold rq-pin swizzle into __balance_callbacks() Prepare for more users needing the rq-pin swizzle. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Pierre Gondois <pierre.gondois@arm.com> Tested-by: Juri Lelli <juri.lelli@redhat.com> Link: https://patch.msgid.link/20260114130528.GB831285@noisy.programming.kicks-ass.net	2026-01-15 21:57:52 +01:00
Peter Zijlstra	4de9ff7606	sched/deadline: Avoid double update_rq_clock() When setup_new_dl_entity() is called from enqueue_task_dl() -> enqueue_dl_entity(), the rq-clock should already be updated, and calling update_rq_clock() again is not right. Move the update_rq_clock() to the one other caller of setup_new_dl_entity(): sched_init_dl_server(). Fixes: `9f239df555` ("sched/deadline: Initialize dl_servers after SMP") Reported-by: Pierre Gondois <pierre.gondois@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Pierre Gondois <pierre.gondois@arm.com> Link: https://patch.msgid.link/20260113115622.GA831285@noisy.programming.kicks-ass.net	2026-01-15 21:57:52 +01:00
Peter Zijlstra	375410bb9a	sched/deadline: Ensure get_prio_dl() is up-to-date Pratheek tripped a WARN and noted the following issue: > Inspecting the set of events that led to the warning being triggered > showed the following: > > systemd-1 [008] dN.31 ...: do_set_cpus_allowed: set_cpus_allowed begin! > > systemd-1 [008] dN.31 ...: sched_change_begin: Begin! > systemd-1 [008] dN.31 ...: sched_change_begin: Before dequeue_task()! > systemd-1 [008] dN.31 ...: update_curr_dl_se: update_curr_dl_se: ENQUEUE_REPLENISH > systemd-1 [008] dN.31 ...: enqueue_dl_entity: enqueue_dl_entity: ENQUEUE_REPLENISH > systemd-1 [008] dN.31 ...: replenish_dl_entity: Replenish before: 14815760217 > systemd-1 [008] dN.31 ...: replenish_dl_entity: Replenish after: 14816960047 > systemd-1 [008] dN.31 ...: sched_change_begin: Before put_prev_task()! > > systemd-1 [008] dN.31 ...: sched_change_end: Before enqueue_task()! > systemd-1 [008] dN.31 ...: sched_change_end: Before put_prev_task()! > systemd-1 [008] dN.31 ...: prio_changed_dl: Queuing pull task on prio change: 14815760217 -> 14816960047 > systemd-1 [008] dN.31 ...: prio_changed_dl: Queuing balance callback! > systemd-1 [008] dN.31 ...: sched_change_end: End! > > systemd-1 [008] dN.31 ...: do_set_cpus_allowed: set_cpus_allowed end! > systemd-1 [008] dN.21 ...: __schedule: Woops! Balance callback found! > > 1. sched_change_begin() from guard(sched_change) in > do_set_cpus_allowed() stashes the priority, which for the deadline > task, is "p->dl.deadline". > 2. The dequeue of the deadline task replenishes the deadline. > 3. The task is enqueued back after guard's scope ends and since there is > no *_CLASS flags set, sched_change_end() calls > dl_sched_class->prio_changed() which compares the deadline. > 4. Since deadline was moved on dequeue, prio_changed_dl() sees the value > differ from the stashed value and queues a balance pull callback. > 5. do_set_cpus_allowed() finishes and drops the rq_lock without doing a > do_balance_callbacks(). > 6. Grabbing the rq_lock() at subsequent __schedule() triggers the > warning since the balance pull callback was never executed before > dropping the lock. Meaning get_prio_dl() ought to update current and return an up-to-date value. Fixes: `6455ad5346` ("sched: Move sched_class::prio_changed() into the change pattern") Reported-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://patch.msgid.link/20260106104113.GX3707891@noisy.programming.kicks-ass.net	2026-01-15 21:57:52 +01:00
Linus Torvalds	13b2d15d99	32 hotfixes. 16 are cc:stable, 24 are for MM. - four kerneldoc fixes from Bagas Sanjaya - four DAMON fixes from SeongJae - four mremap VMA-related fixes from Lorenzo - various singletons - please see the changelogs for details -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaWkP3gAKCRDdBJ7gKXxA jnmTAQCBiCyw7Pvvak5OM8Of0qQ8m/jQmNMBPRrd5M3tAAEOlwD9F7E6RjcmyItU k+sboDgNKTvgdRHTmRzj1t96aXJrSQQ= =OxTh -----END PGP SIGNATURE----- Merge tag 'mm-hotfixes-stable-2026-01-15-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: - kerneldoc fixes from Bagas Sanjaya - DAMON fixes from SeongJae - mremap VMA-related fixes from Lorenzo - various singletons - please see the changelogs for details * tag 'mm-hotfixes-stable-2026-01-15-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (30 commits) drivers/dax: add some missing kerneldoc comment fields for struct dev_dax mm: numa,memblock: include <asm/numa.h> for 'numa_nodes_parsed' mailmap: add entry for Daniel Thompson tools/testing/selftests: fix gup_longterm for unknown fs mm/page_alloc: prevent pcp corruption with SMP=n iommu/sva: include mmu_notifier.h header mm: kmsan: fix poisoning of high-order non-compound pages tools/testing/selftests: add forked (un)/faulted VMA merge tests mm/vma: enforce VMA fork limit on unfaulted,faulted mremap merge too tools/testing/selftests: add tests for !tgt, src mremap() merges mm/vma: fix anon_vma UAF on mremap() faulted, unfaulted merge mm/zswap: fix error pointer free in zswap_cpu_comp_prepare() mm/damon/sysfs-scheme: cleanup access_pattern subdirs on scheme dir setup failure mm/damon/sysfs-scheme: cleanup quotas subdirs on scheme dir setup failure mm/damon/sysfs: cleanup attrs subdirs on context dir setup failure mm/damon/sysfs: cleanup intervals subdirs on attrs dir setup failure mm/damon/core: remove call_control in inactive contexts powerpc/watchdog: add support for hardlockup_sys_info sysctl mips: fix HIGHMEM initialization mm/hugetlb: ignore hugepage kernel args if hugepages are unsupported ...	2026-01-15 10:47:14 -08:00
Guenter Roeck	be55257fab	ftrace: Do not over-allocate ftrace memory The pg_remaining calculation in ftrace_process_locs() assumes that ENTRIES_PER_PAGE multiplied by 2^order equals the actual capacity of the allocated page group. However, ENTRIES_PER_PAGE is PAGE_SIZE / ENTRY_SIZE (integer division). When PAGE_SIZE is not a multiple of ENTRY_SIZE (e.g. 4096 / 24 = 170 with remainder 16), high-order allocations (like 256 pages) have significantly more capacity than 256 * 170. This leads to pg_remaining being underestimated, which in turn makes skip (derived from skipped - pg_remaining) larger than expected, causing the WARN(skip != remaining) to trigger. Extra allocated pages for ftrace: 2 with 654 skipped WARNING: CPU: 0 PID: 0 at kernel/trace/ftrace.c:7295 ftrace_process_locs+0x5bf/0x5e0 A similar problem in ftrace_allocate_records() can result in allocating too many pages. This can trigger the second warning in ftrace_process_locs(). Extra allocated pages for ftrace WARNING: CPU: 0 PID: 0 at kernel/trace/ftrace.c:7276 ftrace_process_locs+0x548/0x580 Use the actual capacity of a page group to determine the number of pages to allocate. Have ftrace_allocate_pages() return the number of allocated pages to avoid having to calculate it. Use the actual page group capacity when validating the number of unused pages due to skipped entries. Drop the definition of ENTRIES_PER_PAGE since it is no longer used. Cc: stable@vger.kernel.org Fixes: `4a3efc6baf` ("ftrace: Update the mcount_loc check of skipped entries") Link: https://patch.msgid.link/20260113152243.3557219-1-linux@roeck-us.net Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2026-01-15 10:17:53 -05:00
Namhyung Kim	4960626f95	perf/core: Fix slow perf_event_task_exit() with LBR callstacks I got a report that a task is stuck in perf_event_exit_task() waiting for global_ctx_data_rwsem. On large systems with lots threads, it'd have performance issues when it grabs the lock to iterate all threads in the system to allocate the context data. And it'd block task exit path which is problematic especially under memory pressure. perf_event_open perf_event_alloc attach_perf_ctx_data attach_global_ctx_data percpu_down_write (global_ctx_data_rwsem) for_each_process_thread alloc_task_ctx_data do_exit perf_event_exit_task percpu_down_read (global_ctx_data_rwsem) It should not hold the global_ctx_data_rwsem on the exit path. Let's skip allocation for exiting tasks and free the data carefully. Reported-by: Rosalie Fang <rosaliefang@google.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260112165157.1919624-1-namhyung@kernel.org	2026-01-15 10:04:26 +01:00
Feng Tang	e561383a39	powerpc/watchdog: add support for hardlockup_sys_info sysctl Commit `a9af76a787` ("watchdog: add sys_info sysctls to dump sys info on system lockup") adds 'hardlock_sys_info' systcl knob for general kernel watchdog to control what kinds of system debug info to be dumped on hardlockup. Add similar support in powerpc watchdog code to make the sysctl knob more general, which also fixes a compiling warning in general watchdog code reported by 0day bot. Link: https://lkml.kernel.org/r/20251231080309.39642-1-feng.tang@linux.alibaba.com Fixes: `a9af76a787` ("watchdog: add sys_info sysctls to dump sys info on system lockup") Signed-off-by: Feng Tang <feng.tang@linux.alibaba.com> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202512030920.NFKtekA7-lkp@intel.com/ Suggested-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-14 22:16:22 -08:00
Pasha Tatashin	582f0f3864	kho: validate preserved memory map during population If the previous kernel enabled KHO but did not call kho_finalize() (e.g., CONFIG_LIVEUPDATE=n or userspace skipped the finalization step), the 'preserved-memory-map' property in the FDT remains empty/zero. Previously, kho_populate() would succeed regardless of the memory map's state, reserving the incoming scratch regions in memblock. However, kho_memory_init() would later fail to deserialize the empty map. By that time, the scratch regions were already registered, leading to partial initialization and subsequent list corruption (freeing scratch area twice) during kho_init(). Move the validation of the preserved memory map earlier into kho_populate(). If the memory map is empty/NULL: 1. Abort kho_populate() immediately with -ENOENT. 2. Do not register or reserve the incoming scratch memory, allowing the new kernel to reclaim those pages as standard free memory. 3. Leave the global 'kho_in' state uninitialized. Consequently, kho_memory_init() sees no active KHO context (kho_in.mem_chunks_phys is 0) and falls back to kho_reserve_scratch(), allocating fresh scratch memory as if it were a standard cold boot. Link: https://lkml.kernel.org/r/20251223140140.2090337-1-pasha.tatashin@soleen.com Fixes: `de51999e68` ("kho: allow memory preservation state updates after finalization") Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reported-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Closes: https://lore.kernel.org/all/20251218215613.GA17304@ranerica-svr.sc.intel.com Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Tested-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-14 22:16:21 -08:00
Anton Protopopov	d1aab1ca57	bpf: Properly mark live registers for indirect jumps For a `gotox rX` instruction the rX register should be marked as used in the compute_insn_live_regs() function. Fix this. Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Link: https://lore.kernel.org/r/20260114162544.83253-2-a.s.protopopov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-14 19:08:09 -08:00
Alexei Starovoitov	e3d0dbb3b5	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf after rc5 Cross-merge BPF and other fixes after downstream PR. No conflicts. Adjacent: Auto-merging MAINTAINERS Auto-merging Makefile Auto-merging kernel/bpf/verifier.c Auto-merging kernel/sched/ext.c Auto-merging mm/memcontrol.c Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-14 15:22:01 -08:00
Robin Murphy	c6ccd09880	dma/pool: Avoid allocating redundant pools On smaller systems, e.g. embedded arm64, it is common for all memory to end up in ZONE_DMA32 or even ZONE_DMA. In such cases it is redundant to allocate a nominal pool for an empty higher zone that just ends up coming from a lower zone that should already have its own pool anyway. We already have logic to skip allocating a ZONE_DMA pool when that is empty, so generalise that to save memory in the case of other zones too. Signed-off-by: Robin Murphy <robin.murphy@arm.com> Tested-by: Vladimir Kondratiev <vladimir.kondratiev@mobileye.com> Reviewed-by: Baoquan He <bhe@redhat.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/8ab8d8a620dee0109f33f5cb63d6bfeed35aac37.1768230104.git.robin.murphy@arm.com	2026-01-14 11:00:00 +01:00
Robin Murphy	b31ac41b59	dma/pool: Improve pool lookup If CONFIG_ZONE_DMA32 is enabled, but we have not allocated the corresponding atomic_pool_dma32, dma_guess_pool() may return the NULL value of that and fail a GFP_DMA32 allocation without trying to fall back to other pools which may exist. Furthermore, if no GFP_DMA pool exists, it is preferable to try GFP_DMA32 rather than immediately fall back to GFP_KERNEL with even less chance of success. Improve matters by encoding an explicit order of pool preference for each flag. Signed-off-by: Robin Murphy <robin.murphy@arm.com> Tested-by: Vladimir Kondratiev <vladimir.kondratiev@mobileye.com> Reviewed-by: Baoquan He <bhe@redhat.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/c846b1a2f43295cac926c7af2ce907f62baec518.1768230104.git.robin.murphy@arm.com	2026-01-14 11:00:00 +01:00
Thomas Weißschuh	7158fc54b2	vdso: Remove struct getcpu_cache The cache parameter of getcpu() is useless nowadays for various reasons. * It is never passed by userspace for either the vDSO or syscalls. * It is never used by the kernel. * It could not be made to work on the current vDSO architecture. * The structure definition is not part of the UAPI headers. * vdso_getcpu() is superseded by restartable sequences in any case. Remove the struct and its header. As a side-effect this gets rid of an unwanted inclusion of the linux/ header namespace from vDSO code. [ tglx: Adapt to s390 upstream changes */ Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Heiko Carstens <hca@linux.ibm.com> # s390 Link: https://patch.msgid.link/20251230-getcpu_cache-v3-1-fb9c5f880ebe@linutronix.de	2026-01-14 08:56:40 +01:00
Linus Torvalds	c537e12dae	bpf-fixes -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmlm+x8ACgkQ6rmadz2v bTrC5Q/+NSCjBpHAinVL/09pahh0YJKczxWnpHG0rs6hSSZ88o+1uY+z3PQ3lRX9 cHSwjIicRvwpX+DosGcXESqspbvo5MLx5DcB0WBSCnLodJ5MDap0iWltBCrGzqme mpeuZiQMSpsAGQLjICiywxcCndoaKEMSRZFmjlzNVJPbYkbxbZCQvwUQqg+Z4rQX t5zBJjb0X3Oaz3TcSiiW+yulZPoejFyF4P+4eTmFy+mAqq5D7CpMXrrjVTf78mSf IqMPbY/PssTExzna2d/ZKtxmKQ+HF6lN7+JaTuqbQsqQtxwh4eXsswezY/UKX7j+ FdAuwz6W7ybO8UC3i4fvA1jFR60Hi5bBBoSXu+0hb/UGQQYXc+gRtB8DxqeUEgeb oVByGwrBWLRSLuBbQVPl3PRTNgwpBn41DkyIpXS6DK6thzdAqZ8zM15vFTYbWanF W/skdubkgIdhREunXMpLfZD5h0b/pXIYxzA59x//hZ+W8qvTuaabICja18x3SnqY hytoIQMo5d1UytyngfcHvXe79DCi5xFAqqVj0AqYR+CDdt5n9vrXgavX76hrYeHe /yYUHZ0h5A564zMXITMmUWDWvgmvTs1Q2hiV+rmtWPJ1bpOpPFxUosIWEBl4edA3 vKlGoh/CtU9NJp/E/7nYdUV1vXlLJaareGQ/UXpFBl4NYsnNe4w= =Vt4X -----END PGP SIGNATURE----- Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Pull bpf fixes from Alexei Starovoitov: - Fix incorrect usage of BPF_TRAMP_F_ORIG_STACK in riscv JIT (Menglong Dong) - Fix reference count leak in bpf_prog_test_run_xdp() (Tetsuo Handa) - Fix metadata size check in bpf_test_run() (Toke Høiland-Jørgensen) - Check that BPF insn array is not allowed as a map for const strings (Deepanshu Kartikey) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf: Fix reference count leak in bpf_prog_test_run_xdp() bpf: Reject BPF_MAP_TYPE_INSN_ARRAY in check_reg_const_str() selftests/bpf: Update xdp_context_test_run test to check maximum metadata size bpf, test_run: Subtract size of xdp_frame from allowed metadata size riscv, bpf: Fix incorrect usage of BPF_TRAMP_F_ORIG_STACK	2026-01-13 21:21:13 -08:00
Anton Protopopov	7e525860e7	bpf: Return EACCES for incorrect access to insn array The insn_array_map_direct_value_addr() function currently returns -EINVAL when the offset within the map is invalid. Change this to return -EACCES, so that it is consistent with similar boundary access checks in the verifier. Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Link: https://lore.kernel.org/r/20260111153047.8388-3-a.s.protopopov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-13 19:36:18 -08:00
Anton Protopopov	e3bd7bdf5f	bpf: Return proper address for non-zero offsets in insn array The map_direct_value_addr() function of the instruction array map incorrectly adds offset to the resulting address. This is a bug, because later the resolve_pseudo_ldimm64() function adds the offset. Fix it. Corresponding selftests are added in a consequent commit. Fixes: `493d9e0d60` ("bpf, x86: add support for indirect jumps") Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Link: https://lore.kernel.org/r/20260111153047.8388-2-a.s.protopopov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-13 19:35:47 -08:00
Matt Bobrowski	f8ade2342e	bpf: return PTR_TO_BTF_ID \| PTR_TRUSTED from BPF kfuncs by default Teach the BPF verifier to treat pointers to struct types returned from BPF kfuncs as implicitly trusted (PTR_TO_BTF_ID \| PTR_TRUSTED) by default. Returning untrusted pointers to struct types from BPF kfuncs should be considered an exception only, and certainly not the norm. Update existing selftests to reflect the change in register type printing (e.g. `ptr_` becoming `trusted_ptr_` in verifier error messages). Link: https://lore.kernel.org/bpf/aV4nbCaMfIoM0awM@google.com/ Signed-off-by: Matt Bobrowski <mattbobrowski@google.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260113083949.2502978-1-mattbobrowski@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-13 19:19:13 -08:00
Donglin Peng	434bcbc837	bpf: Optimize the performance of find_bpffs_btf_enums Currently, vmlinux BTF is unconditionally sorted during the build phase. The function btf_find_by_name_kind executes the binary search branch, so find_bpffs_btf_enums can be optimized by using btf_find_by_name_kind. Signed-off-by: Donglin Peng <pengdonglin@xiaomi.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20260109130003.3313716-10-dolinux.peng@gmail.com	2026-01-13 16:21:36 -08:00
Donglin Peng	dc893cfa39	bpf: Skip anonymous types in type lookup for performance Currently, vmlinux and kernel module BTFs are unconditionally sorted during the build phase, with named types placed at the end. Thus, anonymous types should be skipped when starting the search. In my vmlinux BTF, the number of anonymous types is 61,747, which means the loop count can be reduced by 61,747. Signed-off-by: Donglin Peng <pengdonglin@xiaomi.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20260109130003.3313716-9-dolinux.peng@gmail.com	2026-01-13 16:21:36 -08:00
Donglin Peng	342bf525ba	btf: Verify BTF sorting This patch checks whether the BTF is sorted by name in ascending order. If sorted, binary search will be used when looking up types. Specifically, vmlinux and kernel module BTFs are always sorted during the build phase with anonymous types placed before named types, so we only need to identify the starting ID of named types. Signed-off-by: Donglin Peng <pengdonglin@xiaomi.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20260109130003.3313716-8-dolinux.peng@gmail.com	2026-01-13 16:21:30 -08:00
Donglin Peng	8c3070e159	btf: Optimize type lookup with binary search Improve btf_find_by_name_kind() performance by adding binary search support for sorted types. Falls back to linear search for compatibility. Signed-off-by: Donglin Peng <pengdonglin@xiaomi.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20260109130003.3313716-7-dolinux.peng@gmail.com	2026-01-13 16:20:38 -08:00
Jan H. Schönherr	eebe6446cc	perf/core: Speed up kexec shutdown by avoiding unnecessary cross CPU calls There are typically a lot of PMUs registered, but in many cases only few of them have an event registered (like the "cpu" PMU in the presence of the watchdog). As the mutex is already held, it's safe to just check for existing events before doing the cross CPU call. This change saves tens of milliseconds from kexec time (perceived as steal time during a hypervisor host update), with <2ms remaining for this step in the shutdown. There might be additional potential for parallelization or we could just disable performance monitoring during the actual shutdown and be less graceful about it. Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>	2026-01-13 21:39:01 +01:00
Imran Khan	dd9f6d30c6	genirq/cpuhotplug: Notify about affinity changes breaking the affinity mask During CPU offlining the interrupts affined to that CPU are moved to other online CPUs, which might break the original affinity mask if the outgoing CPU was the last online CPU in that mask. This change is not propagated to irq_desc::affinity_notify(), which leaves users of the affinity notifier mechanism with stale information. Avoid this by scheduling affinity change notification work for interrupts that were affined to the CPU being offlined, if the new target CPU is not part of the original affinity mask. Since irq_set_affinity_locked() uses the same logic to schedule affinity change notification work, split out this logic into a dedicated function and use that at both places. [ tglx: Removed the EXPORT(), removed the !SMP stub, moved the prototype, added a lockdep assert instead of a comment, fixed up coding style and name space. Polished and clarified the change log ] Signed-off-by: Imran Khan <imran.f.khan@oracle.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260113143727.1041265-1-imran.f.khan@oracle.com	2026-01-13 21:18:16 +01:00
Al Viro	47b3b9bf93	simplify the callers of file_open_name() It accepts ERR_PTR() for name and does the right thing in that case. That allows to simplify the logics in callers, making them trivial to switch to CLASS(filename). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2026-01-13 15:18:08 -05:00
Al Viro	741c97fecb	struct filename ->refcnt doesn't need to be atomic ... or visible outside of audit, really. Note that references held in delayed_filename always have refcount 1, and from the moment of complete_getname() or equivalent point in getname...() there won't be any references to struct filename instance left in places visible to other threads. Acked-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2026-01-13 15:18:07 -05:00
Al Viro	41670a5900	get rid of audit_reusename() Originally we tried to avoid multiple insertions into audit names array during retry loop by a cute hack - memorize the userland pointer and if there already is a match, just grab an extra reference to it. Cute as it had been, it had problems - two identical pointers had audit aux entries merged, two identical strings did not. Having different behaviour for syscalls that differ only by addresses of otherwise identical string arguments is obviously wrong - if nothing else, compiler can decide to merge identical string literals. Besides, this hack does nothing for non-audited processes - they get a fresh copy for retry. It's not time-critical, but having behaviour subtly differ that way is bogus. These days we have very few places that import filename more than once (9 functions total) and it's easy to massage them so we get rid of all re-imports. With that done, we don't need audit_reusename() anymore. There's no need to memorize userland pointer either. Acked-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2026-01-13 15:16:44 -05:00
Song Chen	c9c9f6bf7f	bpf: Remove an unused parameter in check_func_proto The func_id parameter is not needed in check_func_proto. This patch removes it. Signed-off-by: Song Chen <chensong_2000@189.cn> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20260105155009.4581-1-chensong_2000@189.cn	2026-01-13 10:00:15 -08:00
Alexei Starovoitov	bffacdb80b	bpf: Recognize special arithmetic shift in the verifier cilium bpf_wiregard.bpf.c when compiled with -O1 fails to load with the following verifier log: 192: (79) r2 = (u64 )(r10 -304) ; R2=pkt(r=40) R10=fp0 fp-304=pkt(r=40) ... 227: (85) call bpf_skb_store_bytes#9 ; R0=scalar() 228: (bc) w2 = w0 ; R0=scalar() R2=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) 229: (c4) w2 s>>= 31 ; R2=scalar(smin=0,smax=umax=0xffffffff,smin32=-1,smax32=0,var_off=(0x0; 0xffffffff)) 230: (54) w2 &= -134 ; R2=scalar(smin=0,smax=umax=umax32=0xffffff7a,smax32=0x7fffff7a,var_off=(0x0; 0xffffff7a)) ... 232: (66) if w2 s> 0xffffffff goto pc+125 ; R2=scalar(smin=umin=umin32=0x80000000,smax=umax=umax32=0xffffff7a,smax32=-134,var_off=(0x80000000; 0x7fffff7a)) ... 238: (79) r4 = (u64 )(r10 -304) ; R4=scalar() R10=fp0 fp-304=scalar() 239: (56) if w2 != 0xffffff78 goto pc+210 ; R2=0xffffff78 // -136 ... 258: (71) r1 = (u8 )(r4 +0) R4 invalid mem access 'scalar' The error might confuse most bpf authors, since fp-304 slot had 'pkt' pointer at insn 192 and became 'scalar' at 238. That happened because bpf_skb_store_bytes() clears all packet pointers including those in the stack. On the first glance it might look like a bug in the source code, since ctx->data pointer should have been reloaded after the call to bpf_skb_store_bytes(). The relevant part of cilium source code looks like this: // bpf/lib/nodeport.h int dsr_set_ipip6() { if (ctx_adjust_hroom(...)) return DROP_INVALID; // -134 if (ctx_store_bytes(...)) return DROP_WRITE_ERROR; // -141 return 0; } bool dsr_fail_needs_reply(int code) { if (code == DROP_FRAG_NEEDED) // -136 return true; return false; } tail_nodeport_ipv6_dsr() { ret = dsr_set_ipip6(...); if (!IS_ERR(ret)) { ... } else { if (dsr_fail_needs_reply(ret)) return dsr_reply_icmp6(...); } } The code doesn't have arithmetic shift by 31 and it reloads ctx->data every time it needs to access it. So it's not a bug in the source code. The reason is DAGCombiner::foldSelectCCToShiftAnd() LLVM transformation: // If this is a select where the false operand is zero and the compare is a // check of the sign bit, see if we can perform the "gzip trick": // select_cc setlt X, 0, A, 0 -> and (sra X, size(X)-1), A // select_cc setgt X, 0, A, 0 -> and (not (sra X, size(X)-1)), A The conditional branch in dsr_set_ipip6() and its return values are optimized into BPF_ARSH plus BPF_AND: 227: (85) call bpf_skb_store_bytes#9 228: (bc) w2 = w0 229: (c4) w2 s>>= 31 ; R2=scalar(smin=0,smax=umax=0xffffffff,smin32=-1,smax32=0,var_off=(0x0; 0xffffffff)) 230: (54) w2 &= -134 ; R2=scalar(smin=0,smax=umax=umax32=0xffffff7a,smax32=0x7fffff7a,var_off=(0x0; 0xffffff7a)) after insn 230 the register w2 can only be 0 or -134, but the verifier approximates it, since there is no way to represent two scalars in bpf_reg_state. After fallthough at insn 232 the w2 can only be -134, hence the branch at insn 239: (56) if w2 != -136 goto pc+210 should be always taken, and trapping insn 258 should never execute. LLVM generated correct code, but the verifier follows impossible path and rejects valid program. To fix this issue recognize this special LLVM optimization and fork the verifier state. So after insn 229: (c4) w2 s>>= 31 the verifier has two states to explore: one with w2 = 0 and another with w2 = 0xffffffff which makes the verifier accept bpf_wiregard.c A similar pattern exists were OR operation is used in place of the AND operation, the verifier detects that pattern as well by forking the state before the OR operation with a scalar in range [-1,0]. Note there are 20+ such patterns in bpf_wiregard.o compiled with -O1 and -O2, but they're rarely seen in other production bpf programs, so push_stack() approach is not a concern. Reported-by: Hao Sun <sunhao.th@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Co-developed-by: Puranjay Mohan <puranjay@kernel.org> Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Link: https://lore.kernel.org/r/20260112201424.816836-2-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-13 09:33:38 -08:00
Mykyta Yatsenko	7af3339948	bpf: Consistently use reg_state() for register access in the verifier Replace the pattern of declaring a local regs array from cur_regs() and then indexing into it with the more concise reg_state() helper. This simplifies the code by eliminating intermediate variables and makes register access more consistent throughout the verifier. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20260113134826.2214860-1-mykyta.yatsenko5@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-13 09:31:17 -08:00
Thomas Gleixner	3db5306b0b	time/sched_clock: Use ACCESS_PRIVATE() to evaluate hrtimer::function This dereference of sched_clock_timer::function was missed when the hrtimer callback function pointer was marked private. Fixes: `04257da0c9` ("hrtimers: Make callback function pointer private") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/875x95jw7q.ffs@tglx Closes: https://lore.kernel.org/oe-kbuild-all/202601131713.KsxhXQ0M-lkp@intel.com/	2026-01-13 18:08:57 +01:00
Gabriele Monaco	6c125b85f3	sched: Export hidden tracepoints to modules The tracepoints sched_entry, sched_exit and sched_set_need_resched are not exported to tracefs as trace events, this allows only kernel code to access them. Helper modules like [1] can be used to still have the tracepoints available to ftrace for debugging purposes, but they do rely on the tracepoints being exported. Export the 3 not exported tracepoints. Note that sched_set_state is already exported as the macro is called from modules. [1] - https://github.com/qais-yousef/sched_tp.git Fixes: `adcc3bfa88` ("sched: Adapt sched tracepoints for RV task model") Signed-off-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Link: https://patch.msgid.link/20251205131621.135513-9-gmonaco@redhat.com	2026-01-13 11:37:53 +01:00
Gabriele Monaco	ca1e8eede4	sched/deadline: Fix server stopping with runnable tasks The deadline server can currently stop due to idle although fair tasks are runnable. This happens essentially when: * the server is set to idle, a task wakes up, the server stops * a task wakes up, the server sets itself to idle and stops right away Address both cases by clearing the server idle flag whenever a fair task wakes up and accounting also for pending tasks in the definition of idle. Fixes: `f5a538c07d` ("sched/deadline: Fix dl_server stop condition") Signed-off-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260113085159.114226-3-gmonaco@redhat.com	2026-01-13 11:37:52 +01:00
Peter Zijlstra	1e0a2ba7af	sched: Provide idle_rq() helper A fix for the dl_server 'requires' idle_cpu() usage, which made me note that it and available_idle_cpu() are extern function calls. And while idle_cpu() is used outside of kernel/sched/, available_idle_cpu() is not. This makes it hard to make idle_cpu() an inline helper, so provide idle_rq() and implement idle_cpu() and available_idle_cpu() using that. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>	2026-01-13 11:37:52 +01:00
Pingfan Liu	64e6fa7661	sched/deadline: Fix potential race in dl_add_task_root_domain() The access rule for local_cpu_mask_dl requires it to be called on the local CPU with preemption disabled. However, dl_add_task_root_domain() currently violates this rule. Without preemption disabled, the following race can occur: 1. ThreadA calls dl_add_task_root_domain() on CPU 0 2. Gets pointer to CPU 0's local_cpu_mask_dl 3. ThreadA is preempted and migrated to CPU 1 4. ThreadA continues using CPU 0's local_cpu_mask_dl 5. Meanwhile, the scheduler on CPU 0 calls find_later_rq() which also uses local_cpu_mask_dl (with preemption properly disabled) 6. Both contexts now corrupt the same per-CPU buffer concurrently Fix this by moving the local_cpu_mask_dl access to the preemption disabled section. Closes: https://lore.kernel.org/lkml/aSBjm3mN_uIy64nz@jlelli-thinkpadt14gen4.remote.csb Fixes: `318e18ed22` ("sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug") Reported-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Pingfan Liu <piliu@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Acked-by: Waiman Long <longman@redhat.com> Link: https://patch.msgid.link/20251125032630.8746-3-piliu@redhat.com	2026-01-13 11:37:51 +01:00
Pingfan Liu	479972efc2	sched/deadline: Remove unnecessary comment in dl_add_task_root_domain() The comments above dl_get_task_effective_cpus() and dl_add_task_root_domain() already explain how to fetch a valid root domain and protect against races. There's no need to repeat this inside dl_add_task_root_domain(). Remove the redundant comment to keep the code clean. No functional change. Signed-off-by: Pingfan Liu <piliu@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Acked-by: Waiman Long <longman@redhat.com> Link: https://patch.msgid.link/20251125032630.8746-2-piliu@redhat.com	2026-01-13 11:37:51 +01:00
Thomas Weißschuh	ae4535b0d9	hrtimer: Drop _tv64() helpers Since ktime_t has become an alias to s64, these helpers are unnecessary. Migrate the few remaining users to the regular helpers and remove the now dead code. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260107-hrtimer-header-cleanup-v1-3-1a698ef0ddae@linutronix.de	2026-01-13 11:05:49 +01:00
Thomas Weißschuh	84663a5ad6	hrtimer: Remove public definition of HIGH_RES_NSEC This constant is only used in a single place and is has a very generic name polluting the global namespace. Move the constant closer to its only user. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260107-hrtimer-header-cleanup-v1-2-1a698ef0ddae@linutronix.de	2026-01-13 11:05:48 +01:00
Thomas Weißschuh	05dc4a9fc8	hrtimer: Fix softirq base check in update_needs_ipi() The 'clockid' field is not the correct way to check for a softirq base. Fix the check to correctly compare the base type instead of the clockid. Fixes: `1e7f7fbcd4` ("hrtimer: Avoid more SMP function calls in clock_was_set()") Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260107-hrtimer-clock-base-check-v1-1-afb5dbce94a1@linutronix.de	2026-01-13 11:04:41 +01:00
Luigi Rizzo	fb11a2493e	genirq: Move clear of kstat_irqs to free_desc() desc_set_defaults() has a loop to clear the per-cpu counters kstats_irq. This is only needed in free_desc(), which is used with non-sparse IRQs so that the interrupt descriptor can be recycled. For newly allocated descriptors, the memory comes from alloc_percpu() and is already zeroed out. Move the loop to free_desc() to avoid wasting time unnecessarily. Signed-off-by: Luigi Rizzo <lrizzo@google.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260112083234.2665832-1-lrizzo@google.com	2026-01-13 10:16:29 +01:00
Radu Rendec	df439718af	genirq: Update effective affinity for redirected interrupts For redirected interrupts, irq_chip_redirect_set_affinity() does not update the effective affinity mask, which then triggers the warning in irq_validate_effective_affinity(). Also, because the effective affinity mask is empty, the cpumask_test_cpu(smp_processor_id(), m) condition in demux_redirect_remote() is always false, and the interrupt is always redirected, even if it's already running on the target CPU. Set the effective affinity mask to be the same as the requested affinity mask. It's worth noting that irq_do_set_affinity() filters out offline CPUs before calling chip->irq_set_affinity() (unless `force` is set), so the mask passed to irq_chip_redirect_set_affinity() is already filtered. The solution is not ideal because it may lie about the effective affinity of the demultiplexed ("child") interrupt. If the requested affinity mask includes multiple CPUs, the effective affinity, in reality, is the intersection between the requested mask and the demultiplexing ("parent") interrupt's effective affinity mask, plus the first CPU in the requested mask. Accurately describing the effective affinity of the demultiplexed interrupt is not trivial because it requires keeping track of the demultiplexing interrupt's effective affinity. That is tricky in the context of CPU hot(un)plugging, where interrupt migration ordering is not guaranteed. The solution in the initial version of the fixed patch, which stored the first CPU of the demultiplexing interrupt's effective affinity in the `target_cpu` field, has its own drawbacks and limitations. Fixes: `fcc1d0dabd` ("genirq: Add interrupt redirection infrastructure") Reported-by: Jon Hunter <jonathanh@nvidia.com> Signed-off-by: Radu Rendec <rrendec@redhat.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Jon Hunter <jonathanh@nvidia.com> Link: https://patch.msgid.link/20260112211402.2927336-1-rrendec@redhat.com Closes: https://lore.kernel.org/all/44509520-f29b-4b8a-8986-5eae3e022eb7@nvidia.com/	2026-01-13 09:59:28 +01:00
Sebastian Andrzej Siewior	aef30c8d56	genirq: Warn about using IRQF_ONESHOT without a threaded handler IRQF_ONESHOT disables the interrupt source until after the threaded handler completed its work. This is needed to allow the threaded handler to run - otherwise the CPU will get back to the interrupt handler because the interrupt source remains active and the threaded handler will not able to do its work. Specifying IRQF_ONESHOT without a threaded handler does not make sense. It could be a leftover if the handler _was_ threaded and changed back to primary and the flag was not removed. This can be problematic in the `threadirqs' case because the handler is exempt from forced-threading. This in turn can become a problem on a PREEMPT_RT system if the handler attempts to acquire sleeping locks. Warn about missing threaded handlers with the IRQF_ONESHOT flag. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Reviewed-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com> Link: https://patch.msgid.link/20260112134013.eQWyReHR@linutronix.de	2026-01-13 09:56:25 +01:00
Sami Tolvanen	99fde4d062	bpf, btf: Enforce destructor kfunc type with CFI Ensure that registered destructor kfuncs have the same type as btf_dtor_kfunc_t to avoid a kernel panic on systems with CONFIG_CFI enabled. Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20260110082548.113748-10-samitolvanen@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-12 18:53:57 -08:00
Sami Tolvanen	b40a5d724f	bpf: crypto: Use the correct destructor kfunc type With CONFIG_CFI enabled, the kernel strictly enforces that indirect function calls use a function pointer type that matches the target function. I ran into the following type mismatch when running BPF self-tests: CFI failure at bpf_obj_free_fields+0x190/0x238 (target: bpf_crypto_ctx_release+0x0/0x94; expected type: 0xa488ebfc) Internal error: Oops - CFI: 00000000f2008228 [#1] SMP ... As bpf_crypto_ctx_release() is also used in BPF programs and using a void pointer as the argument would make the verifier unhappy, add a simple stub function with the correct type and register it as the destructor kfunc instead. Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Tested-by: Viktor Malik <vmalik@redhat.com> Link: https://lore.kernel.org/r/20260110082548.113748-7-samitolvanen@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-12 18:53:57 -08:00
Linus Torvalds	b71e635fee	cgroup: Fixes for v6.19-rc5 - Fix -Wflex-array-member-not-at-end warnings in cgroup_root. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaWVQhw4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGasOAQDmAej2XKpvd7zHxExKJ8xLYUc5dkqg00FLq86r iyikqAD9HYZNcTYE0pvqkgzpYbypQnMWs5F65tSjqJTyJNnL/Qw= =vI2H -----END PGP SIGNATURE----- Merge tag 'cgroup-for-6.19-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fix from Tejun Heo: - Fix -Wflex-array-member-not-at-end warnings in cgroup_root * tag 'cgroup-for-6.19-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: Eliminate cgrp_ancestor_storage in cgroup_root	2026-01-12 09:56:17 -10:00
Zhao Mengmeng	090e0ae303	cpuset: replace direct lockdep_assert_held() with lockdep_assert_cpuset_lock_held() We already added lockdep_assert_cpuset_lock_held(), use this new function to keep consistency. Signed-off-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-01-12 09:27:58 -10:00
Waiman Long	272bd81833	cgroup/cpuset: Move the v1 empty cpus/mems check to cpuset1_validate_change() As stated in commit `1c09b195d3` ("cpuset: fix a regression in validating config change"), it is not allowed to clear masks of a cpuset if there're tasks in it. This is specific to v1 since empty "cpuset.cpus" or "cpuset.mems" will cause the v2 cpuset to inherit the effective CPUs or memory nodes from its parent. So it is OK to have empty cpus or mems even if there are tasks in the cpuset. Move this empty cpus/mems check in validate_change() to cpuset1_validate_change() to allow more flexibility in setting cpus or mems in v2. cpuset_is_populated() needs to be moved into cpuset-internal.h as it is needed by the empty cpus/mems checking code. Also add a test case to test_cpuset_prs.sh to verify that. Reported-by: Chen Ridong <chenridong@huaweicloud.com> Closes: https://lore.kernel.org/lkml/7a3ec392-2e86-4693-aa9f-1e668a668b9c@huaweicloud.com/ Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-01-12 09:03:22 -10:00
Waiman Long	2a3602030d	cgroup/cpuset: Don't invalidate sibling partitions on cpuset.cpus conflict Currently, when setting a cpuset's cpuset.cpus to a value that conflicts with the cpuset.cpus/cpuset.cpus.exclusive of a sibling partition, the sibling's partition state becomes invalid. This is overly harsh and is probably not necessary. The cpuset.cpus.exclusive control file, if set, will override the cpuset.cpus of the same cpuset when creating a cpuset partition. So cpuset.cpus has less priority than cpuset.cpus.exclusive in setting up a partition. However, it cannot override a conflicting cpuset.cpus file in a sibling cpuset and the partition creation process will fail. This is inconsistent. That will also make using cpuset.cpus.exclusive less valuable as a tool to set up cpuset partitions as the users have to check if such a cpuset.cpus conflict exists or not. Fix these problems by making sure that once a cpuset.cpus.exclusive is set without failure, it will always be allowed to form a valid partition as long as at least one CPU can be granted from its parent irrespective of the state of the siblings' cpuset.cpus values. Of course, setting cpuset.cpus.exclusive will fail if it conflicts with the cpuset.cpus.exclusive or the cpuset.cpus.exclusive.effective value of a sibling. Partition can still be created by setting only cpuset.cpus without setting cpuset.cpus.exclusive. However, any conflicting CPUs in sibling's cpuset.cpus.exclusive.effective and cpuset.cpus.exclusive values will be removed from its cpuset.cpus.exclusive.effective as long as there is still one or more CPUs left and can be granted from its parent. This CPU stripping is currently done in rm_siblings_excl_cpus(). The new code will now try its best to enable the creation of new partitions with only cpuset.cpus set without invalidating existing ones. However it is not guaranteed that all the CPUs requested in cpuset.cpus will be used in the new partition even when all these CPUs can be granted from the parent. This is similar to the fact that cpuset.cpus.effective may not be able to include all the CPUs requested in cpuset.cpus. In this case, the parent may not able to grant all the exclusive CPUs requested in cpuset.cpus to cpuset.cpus.exclusive.effective if some of them have already been granted to other partitions earlier. With the creation of multiple sibling partitions by setting only cpuset.cpus, this does have the side effect that their exact cpuset.cpus.exclusive.effective settings will depend on the order of partition creation if there are conflicts. Due to the exclusive nature of the CPUs in a partition, it is not easy to make it fair other than the old behavior of invalidating all the conflicting partitions. For example, # echo "0-2" > A1/cpuset.cpus # echo "root" > A1/cpuset.cpus.partition # cat A1/cpuset.cpus.partition root # cat A1/cpuset.cpus.exclusive.effective 0-2 # echo "2-4" > B1/cpuset.cpus # echo "root" > B1/cpuset.cpus.partition # cat B1/cpuset.cpus.partition root # cat B1/cpuset.cpus.exclusive.effective 3-4 # cat B1/cpuset.cpus.effective 3-4 For users who want to be sure that they can get most of the CPUs they want, cpuset.cpus.exclusive should be used instead if they can set it successfully without failure. Setting cpuset.cpus.exclusive will guarantee that sibling conflicts from then onward is no longer possible. To make this change, we have to separate out the is_cpu_exclusive() check in cpus_excl_conflict() into a cgroup v1 only cpuset1_cpus_excl_conflict() helper. The cpus_allowed_validate_change() helper is now no longer needed and can be removed. Some existing tests in test_cpuset_prs.sh are updated and new ones are added to reflect the new behavior. The cgroup-v2.rst doc file is also updated the clarify what exclusive CPUs will be used when a partition is created. Reported-by: Sun Shaojie <sunshaojie@kylinos.cn> Closes: https://lore.kernel.org/lkml/20251117015708.977585-1-sunshaojie@kylinos.cn/ Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-01-12 09:02:46 -10:00
Waiman Long	6e6f13f6d5	cgroup/cpuset: Don't fail cpuset.cpus change in v2 Commit `fe8cd2736e` ("cgroup/cpuset: Delay setting of CS_CPU_EXCLUSIVE until valid partition") introduced a new check to disallow the setting of a new cpuset.cpus.exclusive value that is a superset of a sibling's cpuset.cpus value so that there will at least be one CPU left in the sibling in case the cpuset becomes a valid partition root. This new check does have the side effect of failing a cpuset.cpus change that make it a subset of a sibling's cpuset.cpus.exclusive value. With v2, users are supposed to be allowed to set whatever value they want in cpuset.cpus without failure. To maintain this rule, the check is now restricted to only when cpuset.cpus.exclusive is being changed not when cpuset.cpus is changed. The cgroup-v2.rst doc file is also updated to reflect this change. Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-01-12 09:02:14 -10:00
Waiman Long	a1a01793ae	cgroup/cpuset: Consistently compute effective_xcpus in update_cpumasks_hier() Since commit `f62a5d3936` ("cgroup/cpuset: Remove remote_partition_check() & make update_cpumasks_hier() handle remote partition"), the compute_effective_exclusive_cpumask() helper was extended to strip exclusive CPUs from siblings when computing effective_xcpus (cpuset.cpus.exclusive.effective). This helper was later renamed to compute_excpus() in commit `86bbbd1f33` ("cpuset: Refactor exclusive CPU mask computation logic"). This helper is supposed to be used consistently to compute effective_xcpus. However, there is an exception within the callback critical section in update_cpumasks_hier() when exclusive_cpus of a valid partition root is empty. This can cause effective_xcpus value to differ depending on where exactly it is last computed. Fix this by using compute_excpus() in this case to give a consistent result. Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-01-12 09:01:44 -10:00
Waiman Long	18bc2425a8	cgroup/cpuset: Streamline rm_siblings_excl_cpus() If exclusive_cpus is set, effective_xcpus must be a subset of exclusive_cpus. Currently, rm_siblings_excl_cpus() checks both exclusive_cpus and effective_xcpus consecutively. It is simpler to check only exclusive_cpus if non-empty or just effective_xcpus otherwise. No functional change is expected. Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-01-12 09:01:40 -10:00
Rafael J. Wysocki	2a7151942e	Merge back material related to system sleep for 6.20	2026-01-12 19:37:19 +01:00
Juergen Gross	e6b2aa6d40	sched: Move clock related paravirt code to kernel/sched Paravirt clock related functions are available in multiple archs. In order to share the common parts, move the common static keys to kernel/sched/ and remove them from the arch specific files. Make a common paravirt_steal_clock() implementation available in kernel/sched/cputime.c, guarding it with a new config option CONFIG_HAVE_PV_STEAL_CLOCK_GEN, which can be selected by an arch in case it wants to use that common variant. Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260105110520.21356-7-jgross@suse.com	2026-01-12 15:39:14 +01:00
Juergen Gross	68b10fd40d	paravirt: Remove asm/paravirt_api_clock.h All architectures supporting CONFIG_PARAVIRT share the same contents of asm/paravirt_api_clock.h: #include <asm/paravirt.h> So remove all incarnations of asm/paravirt_api_clock.h and remove the only place where it is included, as there asm/paravirt.h is included anyway. Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> # powerpc, scheduler bits Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260105110520.21356-6-jgross@suse.com	2026-01-12 15:34:33 +01:00
Gabriele Monaco	3fee5b320c	verification/rvgen: Remove unused variable declaration from containers The monitor container source files contained a declaration and a definition for the rv_monitor variable. The former is superfluous and can be removed. Remove the variable declaration from the template as well as the existing monitor containers. Reviewed-by: Nam Cao <namcao@linutronix.de> Link: https://lore.kernel.org/r/20251126104241.291258-9-gmonaco@redhat.com Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>	2026-01-12 07:43:51 +01:00
Gabriele Monaco	3d2bfeeef3	verification/dot2c: Remove superfluous enum assignment and add last comma The header files generated by dot2c currently create enums for states and events assigning the first element to 0. This is superfluous as it happens automatically if no value is specified. Also it doesn't add a comma to the last enum elements, which slightly complicates the diff if states or events are added. Remove the assignment to 0 and add a comma to last elements, this simplifies the logic for the code generator. Reviewed-by: Nam Cao <namcao@linutronix.de> Link: https://lore.kernel.org/r/20251126104241.291258-8-gmonaco@redhat.com Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>	2026-01-12 07:43:50 +01:00
Gabriele Monaco	30984ccf31	rv: Refactor da_monitor to minimise macros The da_monitor helper functions are generated from macros of the type: DECLARE_DA_FUNCTION(name, type) \ static void da_func_x_##name(type arg) {} \ static void da_func_y_##name(type arg) {} \ This is good to minimise code duplication but the long macros made of skipped end of lines is rather hard to parse. Since functions are static, the advantage of naming them differently for each monitor is minimal. Refactor the da_monitor.h file to minimise macros, instead of declaring functions from macros, we simply declare them with the same name for all monitors (e.g. da_func_x) and for any remaining reference to the monitor name (e.g. tracepoints, enums, global variables) we use the CONCATENATE macro. In this way the file is much easier to maintain while keeping the same generality. Functions depending on the monitor types are now conditionally compiled according to the value of RV_MON_TYPE, which must be defined in the monitor source. The monitor type can be specified as in the original implementation, although it's best to keep the default implementation (unsigned char) as not all parts of code support larger data types, and likely there's no need. We keep the empty macro definitions to ease review of this change with diff tools, but cleanup is required. Also adapt existing monitors to keep the build working. Reviewed-by: Nam Cao <namcao@linutronix.de> Link: https://lore.kernel.org/r/20251126104241.291258-2-gmonaco@redhat.com Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>	2026-01-12 07:43:49 +01:00
Linus Torvalds	fac4bdbaca	Fix a crash in sched_mm_cid_after_execve(). Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmljevkRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1iKTw//Y9jfjB5VrESjnQumkJeV9A6v/ueVMEhF lA6ktDOKJwUjIBqqTxDuMmbuTUVg1AEmfUX8GCv2DqS2ngXsbcJmR++3E/S9vWz9 QvChizDNrw03fq7WDryPK/0Pk04VJsqFcpR89IlMPAOkygVZ06T0+JTSo9BVqMYb xs4B1TlP6JUz0AtIRKqIGVVt57Zx8kNoGHHhQIJ6ziZA30Srh1/n1Y+B3iJxRklO EalGmNQECVobjyYaGqNB1Z73fMmJIgOkZeUQF6TB+KWYPc9DelRTUS2JH9bBhh4O h6Pau9y45YHu5j5qlQxhnw8CmGVjPHYhMAUsuxje1AK4mvgg+ki4lqqWSjhtiuFr 0TX4+Z1ZSF1AB0dD0CtunqZ1K++202LY0uzQ2sCHhS3hNzWOP6vYR+qvZ0fLBfxI trWsz9KUrM07POww6S3nNrGKqyEUwZbmaywX0WOTwQzPhCsU9gqwK/SwSoYEkzhs TEG2xjHeVmqRmJLK7WBKXxBwz6e/hd1YiCi+hSY7PYnCvly6vZOKZA8CuKueF495 Th+IUX5obCNxAy0ZKrOclpGrfE0vl7JBpye6pnYDCvtCKLmF29rOnNDBo1SZvwsh xlu61B7ezdrNp9JneErUZbJRrQL+PQlCem3MF5UmMgRUfqwUP1s1tOjooMOfPE6o Jv9UIQO8MGs= =jEMw -----END PGP SIGNATURE----- Merge tag 'sched-urgent-2026-01-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Ingo Molnar: "Fix a crash in sched_mm_cid_after_execve()" * tag 'sched-urgent-2026-01-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/mm_cid: Prevent NULL mm dereference in sched_mm_cid_after_execve()	2026-01-11 07:11:53 -10:00
Linus Torvalds	fe948326e9	Fix perf swevent hrtimer deinit regression. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmljeiYRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1g9XhAAthSFL6Tl+fBLFYjwAFCeGUyfB5lqewT0 z6Sf7s+TOrOCn/3zP4JoDbuV+golsazcKk/9e3VnzccS9F9VdryML1z01O8mz4ub yHTX53mWRu98d+o+B+IwVOlxW+yeyydyHXQu3Be5mStfdla3i6K95Xu314OQKH4F zSDO1X67zSDdUvT8wSDlhXh9tM3Pd+kFDAEzGAY7n/n/6PJMflSVid/qCOVEInaV WRZoHhZnQUNBuorPLKwhVcduHMS0fh1E5lFfN3WdiQCRBl8P1YXutN4KigS95M7x ymlNhzuQwh7doCursDL/Rbc9hIRviZsMSTtmHUMWf/rlthvtbh4knct+noNSF+bZ VTy6SAHNq0mgvQM1vBmWKqUIgy1ekpfainUeUXJ8cKNoKnUIo9+Y7cH1PwBCvfIx R1vgK9qb2SpfqisNBxMzCiovfXQ657ZNNAdRzWOUErCxDKalo+rZ+e5a5etetzVY YMAJfsjnwmiPDwXhWjBUqYS5gGbh9cHE5KGB6qI6UJQ3odxd3H6/nRLEcyweMMs2 kxFqHaKNnIuwsGl+VAlgwTuUbaR7feVzcvV7yGFnTMclP7cXjVgJRSpXcbMd4FrQ sB03TuqBIpMAVrZ/kbBWJh43BPn+nH9dXlTGaWv+AAe5uq4a3two7u4XIibnoICg MG6uXO1xWtk= =XNWD -----END PGP SIGNATURE----- Merge tag 'perf-urgent-2026-01-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf event fix from Ingo Molnar: "Fix perf swevent hrtimer deinit regression" * tag 'perf-urgent-2026-01-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf: Ensure swevent hrtimer is properly destroyed	2026-01-11 06:55:27 -10:00
Thomas Gleixner	2e4b28c48f	treewide: Update email address In a vain attempt to consolidate the email zoo switch everything to the kernel.org account. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2026-01-11 06:09:11 -10:00
Boqun Feng	fe1d482884	Merge branch 'rcu-misc.20260111a' * rcu-misc.20260111a: rcu: Reduce synchronize_rcu() latency by reporting GP kthread's CPU QS early srcu: Use suitable gfp_flags for the init_srcu_struct_nodes() rcu: Fix rcu_read_unlock() deadloop due to softirq rcutorture: Correctly compute probability to invoke ->exp_current() rcu: Make expedited RCU CPU stall warnings detect stall-end races	2026-01-11 20:15:07 +08:00
Joel Fernandes	bc3705e209	rcu: Reduce synchronize_rcu() latency by reporting GP kthread's CPU QS early The RCU grace period mechanism uses a two-phase FQS (Force Quiescent State) design where the first FQS saves dyntick-idle snapshots and the second FQS compares them. This results in long and unnecessary latency for synchronize_rcu() on idle systems (two FQS waits of ~3ms each with 1000HZ) whenever one FQS wait sufficed. Some investigations showed that the GP kthread's CPU is the holdout CPU a lot of times after the first FQS as - it cannot be detected as "idle" because it's actively running the FQS scan in the GP kthread. Therefore, at the end of rcu_gp_init(), immediately report a quiescent state for the GP kthread's CPU using rcu_qs() + rcu_report_qs_rdp(). The GP kthread cannot be in an RCU read-side critical section while running GP initialization, so this is safe and results in significant latency improvements. The following tests were performed: (1) synchronize_rcu() benchmarking 100 synchronize_rcu() calls with 32 CPUs, 10 runs each (default fqs jiffies settings): Baseline (without fix): \| Run \| Mean \| Min \| Max \| \|-----\|-----------\|----------\|-----------\| \| 1 \| 10.088 ms \| 9.989 ms \| 18.848 ms \| \| 2 \| 10.064 ms \| 9.982 ms \| 16.470 ms \| \| 3 \| 10.051 ms \| 9.988 ms \| 15.113 ms \| \| 4 \| 10.125 ms \| 9.929 ms \| 22.411 ms \| \| 5 \| 8.695 ms \| 5.996 ms \| 15.471 ms \| \| 6 \| 10.157 ms \| 9.977 ms \| 25.723 ms \| \| 7 \| 10.102 ms \| 9.990 ms \| 20.224 ms \| \| 8 \| 8.050 ms \| 5.985 ms \| 10.007 ms \| \| 9 \| 10.059 ms \| 9.978 ms \| 15.934 ms \| \| 10 \| 10.077 ms \| 9.984 ms \| 17.703 ms \| With fix: \| Run \| Mean \| Min \| Max \| \|-----\|----------\|----------\|-----------\| \| 1 \| 6.027 ms \| 5.915 ms \| 8.589 ms \| \| 2 \| 6.032 ms \| 5.984 ms \| 9.241 ms \| \| 3 \| 6.010 ms \| 5.986 ms \| 7.004 ms \| \| 4 \| 6.076 ms \| 5.993 ms \| 10.001 ms \| \| 5 \| 6.084 ms \| 5.893 ms \| 10.250 ms \| \| 6 \| 6.034 ms \| 5.908 ms \| 9.456 ms \| \| 7 \| 6.051 ms \| 5.993 ms \| 10.000 ms \| \| 8 \| 6.057 ms \| 5.941 ms \| 10.001 ms \| \| 9 \| 6.016 ms \| 5.927 ms \| 7.540 ms \| \| 10 \| 6.036 ms \| 5.993 ms \| 9.579 ms \| Summary: - Mean latency: 9.75 ms -> 6.04 ms (38% improvement) - Max latency: 25.72 ms -> 10.25 ms (60% improvement) (2) Bridge setup/teardown latency (Uladzislau Rezki) x86_64 with 64 CPUs, 100 iterations of bridge add/configure/delete: real time 1 - default: 24.221s 2 - this patch: 20.754s (14% faster) 3 - this patch + wake_from_gp: 15.895s (34% faster) 4 - wake_from_gp only: 18.947s (22% faster) Per-synchronize_rcu() latency (in usec): 1 2 3 4 median: 37249.5 31540.5 15765 22480 min: 7881 7918 9803 7857 max: 63651 55639 31861 32040 This patch combined with rcu_normal_wake_from_gp reduces bridge setup/teardown time from 24 seconds to 16 seconds. (3) CPU overhead verification (Uladzislau Rezki) System CPU time across 5 runs showed no measurable increase: default: 1.698s - 1.937s this patch: 1.667s - 1.930s Conclusion: variations are within noise, no CPU overhead regression. (4) rcutorture Tested TREE and SRCU configurations - no regressions. Reviewed-by: "Paul E. McKenney" <paulmck@kernel.org> Tested-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Tested-by: Paul E. McKenney <paulmck@kernel.org> Tested-by: Samir M <samir@linux.ibm.com> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2026-01-11 20:11:15 +08:00
Changwoo Min	380ff27af2	PM: EM: Add dump to get-perf-domains in the EM YNL spec Add dump to get-perf-domains, so that a user can fetch either information about a specific performance domain with do or information about all performance domains with dump. Share the reply format of do and dump using perf-domain-attrs, so remove perf-domains. The YNL spec, autogenerated files, and the do implementation are updated, and the dump implementation is added. Suggested-by: Donald Hunter <donald.hunter@gmail.com> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Signed-off-by: Changwoo Min <changwoo@igalia.com> Link: https://patch.msgid.link/20260108053212.642478-5-changwoo@igalia.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2026-01-09 21:44:46 +01:00
Changwoo Min	d29b900cf4	PM: EM: Change cpus' type from string to u64 array in the EM YNL spec Previously, the cpus attribute was a string format which was a "%*pb" stringification of a bitmap. That is not very consumable for a UAPI, so let’s change it to an u64 array of CPU ids. Suggested-by: Donald Hunter <donald.hunter@gmail.com> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Signed-off-by: Changwoo Min <changwoo@igalia.com> Link: https://patch.msgid.link/20260108053212.642478-4-changwoo@igalia.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2026-01-09 21:44:46 +01:00
Changwoo Min	caa07a815d	PM: EM: Rename em.yaml to dev-energymodel.yaml The EM YNL specification used many acronyms, including ‘em’, ‘pd’, ‘ps’, etc. While the acronyms are short and convenient, they could be confusing. So, let’s spell them out to be more specific. The following changes were made in the spec. Note that the protocol name cannot exceed GENL_NAMSIZ (16). em -> dev-energymodel pds -> perf-domains pd -> perf-domain pd-id -> perf-domain-id pd-table -> perf-table ps -> perf-state get-pds -> get-perf-domains get-pd-table -> get-perf-table pd-created -> perf-domain-created pd-updated -> perf-domain-updated pd-deleted -> perf-domain-deleted In addition. doc strings were added to the spec. based on the comments in energy_model.h. Two flag attributes (perf-state-flags and perf-domain-flags) were added for easily interpreting the bit flags. Finally, the autogenerated files and em_netlink.c were updated accordingly to reflect the name changes. Suggested-by: Donald Hunter <donald.hunter@gmail.com> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Signed-off-by: Changwoo Min <changwoo@igalia.com> Link: https://patch.msgid.link/20260108053212.642478-3-changwoo@igalia.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2026-01-09 21:44:46 +01:00
Linus Torvalds	81c5ffec9e	Power management fix for 6.19-rc5 Fix a crash in the hibernation image saving code that can be triggered when the given compression algorithm is unavailable (Malaya Kumar Rout) -----BEGIN PGP SIGNATURE----- iQFGBAABCAAwFiEEcM8Aw/RY0dgsiRUR7l+9nS/U47UFAmlhGicSHHJqd0Byand5 c29ja2kubmV0AAoJEO5fvZ0v1OO1q1QH/A7sxNcitVM3I6bdoYyVDYpVnKiDQqTk ofzJ+eEEh2+5iD7iBoqQBVNLDWL3iOtSCk3u8VzhIoJMwvsmaxSQYqGaOMIHNSKx Lfdl+RsMv5LKK05uN9uxLCSIBQJkRhnKI3+AC2TEbSpmLwi0+W15XYHecj1JmyY0 upsrRUq+XGBr2LQDoWiUmRmPay8+lp8zjoHaAorl7Jm4yr7pHV8kzrztrCOttuUx nUtJiCks+Srnm09uIf0ghVgzBTwnmAGYft/xOf+fRqJDy9t91Nf/oJsev28mFCsn qRVn/O4i8xAgba9mKiKWmHvzoqqW8omixDFShweeTdr7Lw4hADxMwn8= =8zHM -----END PGP SIGNATURE----- Merge tag 'pm-6.19-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fix from Rafael Wysocki: "This fixes a crash in the hibernation image saving code that can be triggered when the given compression algorithm is unavailable (Malaya Kumar Rout)" * tag 'pm-6.19-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: PM: hibernate: Fix crash when freeing invalid crypto compressor	2026-01-09 06:18:05 -10:00
Cong Wang	2bdf777410	sched/mm_cid: Prevent NULL mm dereference in sched_mm_cid_after_execve() sched_mm_cid_after_execve() is called in bprm_execve()'s cleanup path even when exec_binprm() fails. For the init task's first execve(), this causes a problem: 1. current->mm is NULL (kernel threads don't have an mm) 2. sched_mm_cid_before_execve() exits early because mm is NULL 3. exec_binprm() fails (e.g., ENOENT for missing script interpreter) 4. sched_mm_cid_after_execve() is called with mm still NULL 5. sched_mm_cid_fork() is called unconditionally, triggering WARN_ON This is easily reproduced by booting with an init that is a shell script (#!/bin/sh) where the interpreter doesn't exist in the initramfs. Fix this by checking if t->mm is NULL before calling sched_mm_cid_fork(), matching the behavior of sched_mm_cid_before_execve() which already handles this case via sched_mm_cid_exit()'s early return. Fixes: `b0c3d51b54` ("sched/mmcid: Provide precomputed maximal value") Signed-off-by: Cong Wang <cwang@multikernel.io> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Will Deacon <will@kernel.org> Link: https://patch.msgid.link/20251223215113.639686-1-xiyou.wangcong@gmail.com	2026-01-09 13:02:57 +01:00
Malaya Kumar Rout	e25348c540	PM: EM: Fix memory leak in em_create_pd() error path When ida_alloc() fails in em_create_pd(), the function returns without freeing the previously allocated 'pd' structure, leading to a memory leak. The 'pd' pointer is allocated either at line 436 (for CPU devices with cpumask) or line 442 (for other devices) using kzalloc(). Additionally, the function incorrectly returns -ENOMEM when ida_alloc() fails, ignoring the actual error code returned by ida_alloc(), which can fail for reasons other than memory exhaustion. Fix both issues by: 1. Freeing the 'pd' structure with kfree() when ida_alloc() fails 2. Returning the actual error code from ida_alloc() instead of -ENOMEM This ensures proper cleanup on the error path and accurate error reporting. Fixes: `cbe5aeedec` ("PM: EM: Assign a unique ID when creating a performance domain") Signed-off-by: Malaya Kumar Rout <mrout@redhat.com> Reviewed-by: Changwoo Min <changwoo@igalia.com> Link: https://patch.msgid.link/20260105103730.65626-1-mrout@redhat.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2026-01-08 16:55:21 +01:00
Peter Zijlstra	7dadeaa6e8	sched: Further restrict the preemption modes The introduction of PREEMPT_LAZY was for multiple reasons: - PREEMPT_RT suffered from over-scheduling, hurting performance compared to !PREEMPT_RT. - the introduction of (more) features that rely on preemption; like folio_zero_user() which can do large memset() without preemption checks. (Xen already had a horrible hack to deal with long running hypercalls) - the endless and uncontrolled sprinkling of cond_resched() -- mostly cargo cult or in response to poor to replicate workloads. By moving to a model that is fundamentally preemptable these things become managable and avoid needing to introduce more horrible hacks. Since this is a requirement; limit PREEMPT_NONE to architectures that do not support preemption at all. Further limit PREEMPT_VOLUNTARY to those architectures that do not yet have PREEMPT_LAZY support (with the eventual goal to make this the empty set and completely remove voluntary preemption and cond_resched() -- notably VOLUNTARY is already limited to !ARCH_NO_PREEMPT.) This leaves up-to-date architectures (arm64, loongarch, powerpc, riscv, s390, x86) with only two preemption models: full and lazy. While Lazy has been the recommended setting for a while, not all distributions have managed to make the switch yet. Force things along. Keep the patch minimal in case of hard to address regressions that might pop up. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://patch.msgid.link/20251219101502.GB1132199@noisy.programming.kicks-ass.net	2026-01-08 12:43:57 +01:00
Blake Jones	89951fc1f8	sched: Reorder some fields in struct rq This colocates some hot fields in "struct rq" to be on the same cache line as others that are often accessed at the same time or in similar ways. Using data from a Google-internal fleet-scale profiler, I found three distinct groups of hot fields in struct rq: - (1) The runqueue lock: __lock. - (2) Those accessed from hot code in pick_next_task_fair(): nr_running, nr_numa_running, nr_preferred_running, ttwu_pending, cpu_capacity, curr, idle. - (3) Those accessed from some other hot codepaths, e.g. update_curr(), update_rq_clock(), and scheduler_tick(): clock_task, clock_pelt, clock, lost_idle_time, clock_update_flags, clock_pelt_idle, clock_idle. The cycles spent on accessing these different groups of fields broke down roughly as follows: - 50% on group (1) (the runqueue lock, always read-write) - 39% on group (2) (load:store ratio around 38:1) - 8% on group (3) (load:store ratio around 5:1) - 3% on all the other fields Most of the fields in group (3) are already in a cache line grouping; this patch just adds "clock" and "clock_update_flags" to that group. The fields in group (2) are scattered across several cache lines; the main effect of this patch is to group them together, on a single line at the beginning of the structure. A few other less performance-critical fields (nr_switches, numa_migrate_on, has_blocked_load, nohz_csd, last_blocked_load_update_tick) were also reordered to reduce holes in the data structure. Since the runqueue lock is acquired from so many different contexts, and is basically always accessed using an atomic operation, putting it on either of the cache lines for groups (2) or (3) would slow down accesses to those fields dramatically, since those groups are read-mostly accesses. To test this, I wrote a focused load test that would put load on the pick_next_task_fair() path. A parent process would fork many child processes, and each child would nanosleep() for 1 msec many times in a loop. The load test was monitored with "perf", and I looked at the amount of cycles that were spent with sched_balance_rq() on the stack. The test was reliably spending ~5% of all of its cycles there. I ran it 100 times on a pair of 2-socket Intel Haswell machines (72 vCPUs per machine) - one running the tip of sched/core, the other running this change - using 360 child processes and 8192 1-msec sleeps per child. The mean cycle count dropped from 5.14B to 4.91B, or a 4.6% decrease in relevant scheduler cycles. Given that this change reduces cache misses in a very hot kernel codepath, there's likely to be additional application performance improvement due to reduced cache conflicts from kernel data structures. On a Power11 system with 128-byte cache lines, my test showed a ~5% decrease in relevant scheduler cycles, along with a slight increase in user time - both positive indicators. This data comes from https://lore.kernel.org/lkml/affdc6b1-9980-44d1-89db-d90730c1e384@linux.ibm.com/ This is the case even though the additional "____cacheline_aligned" that puts the runqueue lock on the next cache line adds an additional 64 bytes of padding on those machines. This patch does not change the size of "struct rq" on machines with 64-byte cache lines. I also ran "hackbench" to try to test this change, but it didn't show conclusive results. Looking at a CPU cycle profile of the hackbench run, it was spending 95% of its cycles inside __alloc_skb(), __kfree_skb(), or kmem_cache_free() - almost all of which was spent updating memcg counters or contending on the list_lock in kmem_cache_node. And it spent less than 0.5% of its cycles inside either schedule() or try_to_wake_up(). So it's not surprising that it didn't show useful results here. The "__no_randomize_layout" was added to reflect the fact that performance of code that references this data structure is unusually sensitive to placement of its members. Signed-off-by: Blake Jones <blakejones@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com> Reviewed-by: Josh Don <joshdon@google.com> Tested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com> Link: https://patch.msgid.link/20251202023743.1524247-1-blakejones@google.com	2026-01-08 12:43:56 +01:00
Yury Norov (NVIDIA)	55b39b0cf1	sched/fair: Use cpumask_weight_and() in sched_balance_find_dst_group() In the group_has_spare case, the function creates a temporary cpumask to just calculate weight of (p->cpus_ptr & sched_group_span(local)). We've got a dedicated helper for it. Signed-off-by: Yury Norov (NVIDIA) <yury.norov@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Reviewed-by: Fernand Sieber <sieberf@amazon.com> Link: https://patch.msgid.link/20251207034247.402926-1-yury.norov@gmail.com	2026-01-08 12:43:56 +01:00
Yury Norov (NVIDIA)	0ab25ea2a3	sched/fair: Simplify task_numa_find_cpu() Use for_each_cpu_and() and drop some housekeeping code. Signed-off-by: Yury Norov (NVIDIA) <yury.norov@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Reviewed-by: Phil Auld <pauld@redhat.com> Link: https://patch.msgid.link/20251207033037.399608-1-yury.norov@gmail.com	2026-01-08 12:43:56 +01:00
Yury Norov (NVIDIA)	ff1de90dd7	sched/fair: Drop useless cpumask_empty() in find_energy_efficient_cpu() cpumask_empty() call is O(N) and useless because the previous cpumask_and() returns false for empty 'cpus'. Drop it. Signed-off-by: Yury Norov (NVIDIA) <yury.norov@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://patch.msgid.link/20251207040543.407695-1-yury.norov@gmail.com	2026-01-08 12:43:56 +01:00
Deepanshu Kartikey	9df5fad801	bpf: Reject BPF_MAP_TYPE_INSN_ARRAY in check_reg_const_str() BPF_MAP_TYPE_INSN_ARRAY maps store instruction pointers in their ips array, not string data. The map_direct_value_addr callback for this map type returns the address of the ips array, which is not suitable for use as a constant string argument. When a BPF program passes a pointer to an insn_array map value as ARG_PTR_TO_CONST_STR (e.g., to bpf_snprintf), the verifier's null-termination check in check_reg_const_str() operates on the wrong memory region, and at runtime bpf_bprintf_prepare() can read out of bounds searching for a null terminator. Reject BPF_MAP_TYPE_INSN_ARRAY in check_reg_const_str() since this map type is not designed to hold string data. Reported-by: syzbot+2c29addf92581b410079@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=2c29addf92581b410079 Tested-by: syzbot+2c29addf92581b410079@syzkaller.appspotmail.com Fixes: `493d9e0d60` ("bpf, x86: add support for indirect jumps") Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com> Acked-by: Anton Protopopov <a.s.protopopov@gmail.com> Link: https://lore.kernel.org/r/20260107021037.289644-1-kartikey406@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-07 19:03:46 -08:00
Michal Koutný	ef56578274	cgroup: Eliminate cgrp_ancestor_storage in cgroup_root The cgrp_ancestor_storage has two drawbacks: - it's not guaranteed that the member immediately follows struct cgrp in cgroup_root (root cgroup's ancestors[0] might thus point to a padding and not in cgrp_ancestor_storage proper), - this idiom raises warnings with -Wflex-array-member-not-at-end. Instead of relying on the auxiliary member in cgroup_root, define the 0-th level ancestor inside struct cgroup (needed for static allocation of cgrp_dfl_root), deeper cgroups would allocate flexible _low_ancestors[]. Unionized alias through ancestors[] will transparently join the two ranges. The above change would still leave the flexible array at the end of struct cgroup inside cgroup_root, so move cgrp also towards the end of cgroup_root to resolve the -Wflex-array-member-not-at-end. Link: https://lore.kernel.org/r/5fb74444-2fbb-476e-b1bf-3f3e279d0ced@embeddedor.com/ Reported-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Closes: https://lore.kernel.org/r/b3eb050d-9451-4b60-b06c-ace7dab57497@embeddedor.com/ Cc: David Laight <david.laight.linux@gmail.com> Acked-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2026-01-07 15:11:03 -10:00
Robin Murphy	8a840ab056	dma-mapping: Remove dma_mark_clean (again) With IA-64 now gone, there are no users of the dma_mark_clean hook, so we can retire it for good. Signed-off-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/c004927f01962726ff1dcf94d1b4efff84db805a.1767727673.git.robin.murphy@arm.com	2026-01-08 00:19:08 +01:00
Ben Dooks	1e2ed4bfd5	trace: ftrace_dump_on_oops[] is not exported, make it static The ftrace_dump_on_oops string is not used outside of trace.c so make it static to avoid the export warning from sparse: kernel/trace/trace.c:141:6: warning: symbol 'ftrace_dump_on_oops' was not declared. Should it be static? Fixes: `dd293df639` ("tracing: Move trace sysctls into trace.c") Link: https://patch.msgid.link/20260106231054.84270-1-ben.dooks@codethink.co.uk Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2026-01-07 14:52:22 -05:00
Steven Rostedt	5f1ef0dfcb	tracing: Add recursion protection in kernel stack trace recording A bug was reported about an infinite recursion caused by tracing the rcu events with the kernel stack trace trigger enabled. The stack trace code called back into RCU which then called the stack trace again. Expand the ftrace recursion protection to add a set of bits to protect events from recursion. Each bit represents the context that the event is in (normal, softirq, interrupt and NMI). Have the stack trace code use the interrupt context to protect against recursion. Note, the bug showed an issue in both the RCU code as well as the tracing stacktrace code. This only handles the tracing stack trace side of the bug. The RCU fix will be handled separately. Link: https://lore.kernel.org/all/20260102122807.7025fc87@gandalf.local.home/ Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Boqun Feng <boqun.feng@gmail.com> Link: https://patch.msgid.link/20260105203141.515cd49f@gandalf.local.home Reported-by: Yao Kai <yaokai34@huawei.com> Tested-by: Yao Kai <yaokai34@huawei.com> Fixes: `5f5fa7ea89` ("rcu: Don't use negative nesting depth in __rcu_read_unlock()") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2026-01-07 14:52:22 -05:00
Wupeng Ma	6435ffd6c7	ring-buffer: Avoid softlockup in ring_buffer_resize() during memory free When user resize all trace ring buffer through file 'buffer_size_kb', then in ring_buffer_resize(), kernel allocates buffer pages for each cpu in a loop. If the kernel preemption model is PREEMPT_NONE and there are many cpus and there are many buffer pages to be freed, it may not give up cpu for a long time and finally cause a softlockup. To avoid it, call cond_resched() after each cpu buffer free as Commit `f6bd2c9248` ("ring-buffer: Avoid softlockup in ring_buffer_resize()") does. Detailed call trace as follow: rcu: INFO: rcu_sched self-detected stall on CPU rcu: 24-....: (14837 ticks this GP) idle=521c/1/0x4000000000000000 softirq=230597/230597 fqs=5329 rcu: (t=15004 jiffies g=26003221 q=211022 ncpus=96) CPU: 24 UID: 0 PID: 11253 Comm: bash Kdump: loaded Tainted: G EL 6.18.2+ #278 NONE pc : arch_local_irq_restore+0x8/0x20 arch_local_irq_restore+0x8/0x20 (P) free_frozen_page_commit+0x28c/0x3b0 __free_frozen_pages+0x1c0/0x678 ___free_pages+0xc0/0xe0 free_pages+0x3c/0x50 ring_buffer_resize.part.0+0x6a8/0x880 ring_buffer_resize+0x3c/0x58 __tracing_resize_ring_buffer.part.0+0x34/0xd8 tracing_resize_ring_buffer+0x8c/0xd0 tracing_entries_write+0x74/0xd8 vfs_write+0xcc/0x288 ksys_write+0x74/0x118 __arm64_sys_write+0x24/0x38 Cc: <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251228065008.2396573-1-mawupeng1@huawei.com Signed-off-by: Wupeng Ma <mawupeng1@huawei.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2026-01-07 14:52:22 -05:00
Julia Lawall	7cc3fe8e75	tracing: Drop unneeded assignment to soft_mode soft_mode is not read in the enable case, so drop the assignment. Drop also the comment text that refers to the assignment and realign the comment. Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Gabriele Paoloni <gpaoloni@redhat.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251226110531.4129794-1-Julia.Lawall@inria.fr Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2026-01-07 14:52:22 -05:00
Zqiang	cee2557ae3	srcu: Use suitable gfp_flags for the init_srcu_struct_nodes() For use the init_srcu_struct*() to initialized srcu structure, the srcu structure's->srcu_sup and sda use GFP_KERNEL flags to allocate memory. similarly, if set SRCU_SIZING_INIT, the srcu_sup's->node can still use GFP_KERNEL flags to allocate memory, not need to use GFP_ATOMIC flags all the time. Signed-off-by: Zqiang <qiang.zhang@linux.dev> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com> Tested-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2026-01-07 21:59:41 +08:00
Yao Kai	d41e37f26b	rcu: Fix rcu_read_unlock() deadloop due to softirq Commit `5f5fa7ea89` ("rcu: Don't use negative nesting depth in __rcu_read_unlock()") removes the recursion-protection code from __rcu_read_unlock(). Therefore, we could invoke the deadloop in raise_softirq_irqoff() with ftrace enabled as follows: WARNING: CPU: 0 PID: 0 at kernel/trace/trace.c:3021 __ftrace_trace_stack.constprop.0+0x172/0x180 Modules linked in: my_irq_work(O) CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Tainted: G O 6.18.0-rc7-dirty #23 PREEMPT(full) Tainted: [O]=OOT_MODULE Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 RIP: 0010:__ftrace_trace_stack.constprop.0+0x172/0x180 RSP: 0018:ffffc900000034a8 EFLAGS: 00010002 RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000000000 RDX: 0000000000000003 RSI: ffffffff826d7b87 RDI: ffffffff826e9329 RBP: 0000000000090009 R08: 0000000000000005 R09: ffffffff82afbc4c R10: 0000000000000008 R11: 0000000000011d7a R12: 0000000000000000 R13: ffff888003874100 R14: 0000000000000003 R15: ffff8880038c1054 FS: 0000000000000000(0000) GS:ffff8880fa8ea000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055b31fa7f540 CR3: 00000000078f4005 CR4: 0000000000770ef0 PKRU: 55555554 Call Trace: <IRQ> trace_buffer_unlock_commit_regs+0x6d/0x220 trace_event_buffer_commit+0x5c/0x260 trace_event_raw_event_softirq+0x47/0x80 raise_softirq_irqoff+0x6e/0xa0 rcu_read_unlock_special+0xb1/0x160 unwind_next_frame+0x203/0x9b0 __unwind_start+0x15d/0x1c0 arch_stack_walk+0x62/0xf0 stack_trace_save+0x48/0x70 __ftrace_trace_stack.constprop.0+0x144/0x180 trace_buffer_unlock_commit_regs+0x6d/0x220 trace_event_buffer_commit+0x5c/0x260 trace_event_raw_event_softirq+0x47/0x80 raise_softirq_irqoff+0x6e/0xa0 rcu_read_unlock_special+0xb1/0x160 unwind_next_frame+0x203/0x9b0 __unwind_start+0x15d/0x1c0 arch_stack_walk+0x62/0xf0 stack_trace_save+0x48/0x70 __ftrace_trace_stack.constprop.0+0x144/0x180 trace_buffer_unlock_commit_regs+0x6d/0x220 trace_event_buffer_commit+0x5c/0x260 trace_event_raw_event_softirq+0x47/0x80 raise_softirq_irqoff+0x6e/0xa0 rcu_read_unlock_special+0xb1/0x160 unwind_next_frame+0x203/0x9b0 __unwind_start+0x15d/0x1c0 arch_stack_walk+0x62/0xf0 stack_trace_save+0x48/0x70 __ftrace_trace_stack.constprop.0+0x144/0x180 trace_buffer_unlock_commit_regs+0x6d/0x220 trace_event_buffer_commit+0x5c/0x260 trace_event_raw_event_softirq+0x47/0x80 raise_softirq_irqoff+0x6e/0xa0 rcu_read_unlock_special+0xb1/0x160 __is_insn_slot_addr+0x54/0x70 kernel_text_address+0x48/0xc0 __kernel_text_address+0xd/0x40 unwind_get_return_address+0x1e/0x40 arch_stack_walk+0x9c/0xf0 stack_trace_save+0x48/0x70 __ftrace_trace_stack.constprop.0+0x144/0x180 trace_buffer_unlock_commit_regs+0x6d/0x220 trace_event_buffer_commit+0x5c/0x260 trace_event_raw_event_softirq+0x47/0x80 __raise_softirq_irqoff+0x61/0x80 __flush_smp_call_function_queue+0x115/0x420 __sysvec_call_function_single+0x17/0xb0 sysvec_call_function_single+0x8c/0xc0 </IRQ> Commit `b41642c877` ("rcu: Fix rcu_read_unlock() deadloop due to IRQ work") fixed the infinite loop in rcu_read_unlock_special() for IRQ work by setting a flag before calling irq_work_queue_on(). We fix this issue by setting the same flag before calling raise_softirq_irqoff() and rename the flag to defer_qs_pending for more common. Fixes: `5f5fa7ea89` ("rcu: Don't use negative nesting depth in __rcu_read_unlock()") Reported-by: Tengda Wu <wutengda2@huawei.com> Signed-off-by: Yao Kai <yaokai34@huawei.com> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com> Tested-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2026-01-07 21:58:37 +08:00
Paul E. McKenney	37d9b47507	rcutorture: Correctly compute probability to invoke ->exp_current() Lack of parentheses causes the ->exp_current() function, for example, srcu_expedite_current(), to be called only once in four billion times instead of the intended once in 256 times. This commit therefore adds the needed parentheses. Reported-by: Chris Mason <clm@meta.com> Reported-by: Joel Fernandes <joelagnelf@nvidia.com> Fixes: `950063c6e8` ("rcutorture: Test srcu_expedite_current()") Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2026-01-07 21:58:34 +08:00
Paul E. McKenney	255019537c	rcu: Make expedited RCU CPU stall warnings detect stall-end races If an expedited RCU CPU stall ends just at the stall-warning timeout, the current code will print an expedited stall-warning message, but one that doesn't identify any CPUs or tasks causing the stall. This is most likely to happen for short-timeout stalls, for example, the 20-millisecond timeouts that are sometimes used for small embedded devices. Needless to say, these semi-empty stall-warning messages can be rather confusing. One option would be to suppress the stall-warning message entirely in this case, but the near-miss information can be quite valuable. Detect this race condition and emits a "INFO: Expedited stall ended before state dump start" message to clarify matters. [boqun: Apply feedback from Borislav] Reported-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2026-01-07 21:58:26 +08:00
Leon Hwang	47c79f05aa	bpf: Add BPF_F_CPU and BPF_F_ALL_CPUS flags support for percpu_cgroup_storage maps Introduce BPF_F_ALL_CPUS flag support for percpu_cgroup_storage maps to allow updating values for all CPUs with a single value for update_elem API. Introduce BPF_F_CPU flag support for percpu_cgroup_storage maps to allow: * update value for specified CPU for update_elem API. * lookup value for specified CPU for lookup_elem API. The BPF_F_CPU flag is passed via map_flags along with embedded cpu info. Signed-off-by: Leon Hwang <leon.hwang@linux.dev> Link: https://lore.kernel.org/r/20260107022022.12843-6-leon.hwang@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-06 20:48:32 -08:00
Leon Hwang	8526397c3c	bpf: Copy map value using copy_map_value_long for percpu_cgroup_storage maps Copy map value using 'copy_map_value_long()'. It's to keep consistent style with the way of other percpu maps. No functional change intended. Signed-off-by: Leon Hwang <leon.hwang@linux.dev> Link: https://lore.kernel.org/r/20260107022022.12843-5-leon.hwang@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-06 20:48:32 -08:00
Leon Hwang	c6936161fd	bpf: Add BPF_F_CPU and BPF_F_ALL_CPUS flags support for percpu_hash and lru_percpu_hash maps Introduce BPF_F_ALL_CPUS flag support for percpu_hash and lru_percpu_hash maps to allow updating values for all CPUs with a single value for both update_elem and update_batch APIs. Introduce BPF_F_CPU flag support for percpu_hash and lru_percpu_hash maps to allow: * update value for specified CPU for both update_elem and update_batch APIs. * lookup value for specified CPU for both lookup_elem and lookup_batch APIs. The BPF_F_CPU flag is passed via: * map_flags along with embedded cpu info. * elem_flags along with embedded cpu info. Signed-off-by: Leon Hwang <leon.hwang@linux.dev> Link: https://lore.kernel.org/r/20260107022022.12843-4-leon.hwang@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-06 20:48:32 -08:00
Leon Hwang	8eb76cb03f	bpf: Add BPF_F_CPU and BPF_F_ALL_CPUS flags support for percpu_array maps Introduce support for the BPF_F_ALL_CPUS flag in percpu_array maps to allow updating values for all CPUs with a single value for both update_elem and update_batch APIs. Introduce support for the BPF_F_CPU flag in percpu_array maps to allow: * update value for specified CPU for both update_elem and update_batch APIs. * lookup value for specified CPU for both lookup_elem and lookup_batch APIs. The BPF_F_CPU flag is passed via: * map_flags of lookup_elem and update_elem APIs along with embedded cpu info. * elem_flags of lookup_batch and update_batch APIs along with embedded cpu info. Signed-off-by: Leon Hwang <leon.hwang@linux.dev> Link: https://lore.kernel.org/r/20260107022022.12843-3-leon.hwang@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-06 20:48:32 -08:00
Leon Hwang	2b421662c7	bpf: Introduce BPF_F_CPU and BPF_F_ALL_CPUS flags Introduce BPF_F_CPU and BPF_F_ALL_CPUS flags and check them for following APIs: * 'map_lookup_elem()' * 'map_update_elem()' * 'generic_map_lookup_batch()' * 'generic_map_update_batch()' And, get the correct value size for these APIs. Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Leon Hwang <leon.hwang@linux.dev> Link: https://lore.kernel.org/r/20260107022022.12843-2-leon.hwang@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-06 20:48:32 -08:00
Gaurav Batra	1471c517cf	powerpc/iommu: bypass DMA APIs for coherent allocations for pre-mapped memory Leverage ARCH_HAS_DMA_MAP_DIRECT config option for coherent allocations as well. This will bypass DMA ops for memory allocations that have been pre-mapped. Always set device bus_dma_limit when memory is pre-mapped. In some architectures, like PowerPC, pmemory can be converted to regular memory via daxctl command. This will gate the coherent allocations to pre-mapped RAM only, by dma_coherent_ok(). Signed-off-by: Gaurav Batra <gbatra@linux.ibm.com> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/20251107161105.85999-1-gbatra@linux.ibm.com	2026-01-07 09:33:55 +05:30
Casey Schaufler	5547598e59	cred: remove unused set_security_override_from_ctx() The function set_security_override_from_ctx() has no in-tree callers since 6.14. Remove it. Signed-off-by: Casey Schaufler <casey@schaufler-ca.com> [PM: subject tweak, merge fuzz] Signed-off-by: Paul Moore <paul@paul-moore.com>	2026-01-06 20:52:57 -05:00
Emil Tsalapatis	39f77533b6	bpf: Allow calls to arena functions while holding spinlocks The bpf_arena_*_pages() kfuncs can be called from sleepable contexts, but the verifier still prevents BPF programs from calling them while holding a spinlock. Amend the verifier to allow for BPF programs calling arena page management functions while holding a lock. Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com> Link: https://lore.kernel.org/r/20260106-arena-under-lock-v2-2-378e9eab3066@etsalapatis.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-06 17:44:00 -08:00
Emil Tsalapatis	b25b48c7d3	bpf: Check active lock count in in_sleepable_context() The in_sleepable_context() function is used to specialize the BPF code in do_misc_fixups(). With the addition of nonsleepable arena kfuncs, there are kfuncs whose specialization depends on whether we are holding a lock. We should use the nonsleepable version while holding a lock and the sleepable one when not. Add a check for active_locks to account for locking when specializing arena kfuncs. Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com> Link: https://lore.kernel.org/r/20260106-arena-under-lock-v2-1-378e9eab3066@etsalapatis.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-06 17:43:19 -08:00
Keke Ming	a491c02c27	uprobes: use kmap_local_page() for temporary page mappings Replace deprecated kmap_atomic() with kmap_local_page(). Signed-off-by: Keke Ming <ming.jvle@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Link: https://patch.msgid.link/20260103084243.195125-6-ming.jvle@gmail.com	2026-01-06 16:34:28 +01:00
Joel Granados	d174174c67	sysctl: replace SYSCTL_INT_CONV_CUSTOM macro with functions Remove SYSCTL_INT_CONV_CUSTOM and replace it with proc_int_conv. This converter function expects a negp argument as it can take on negative values. Update all jiffies converters to use explicit function calls. Remove SYSCTL_CONV_IDENTITY as it is no longer used. Signed-off-by: Joel Granados <joel.granados@kernel.org>	2026-01-06 11:27:10 +01:00
Joel Granados	ef153851af	sysctl: Replace unidirectional INT converter macros with functions Replace SYSCTL_USER_TO_KERN_INT_CONV and SYSCTL_KERN_TO_USER_INT_CONV macros with function implementing the same logic.This makes debugging easier and aligns with the functions preference described in coding-style.rst. Update all jiffies converters to use explicit function implementations instead of macro-generated versions. Signed-off-by: Joel Granados <joel.granados@kernel.org>	2026-01-06 11:26:42 +01:00
Malaya Kumar Rout	7966cf0ebe	PM: hibernate: Fix crash when freeing invalid crypto compressor When crypto_alloc_acomp() fails, it returns an ERR_PTR value, not NULL. The cleanup code in save_compressed_image() and load_compressed_image() unconditionally calls crypto_free_acomp() without checking for ERR_PTR, which causes crypto_acomp_tfm() to dereference an invalid pointer and crash the kernel. This can be triggered when the compression algorithm is unavailable (e.g., CONFIG_CRYPTO_LZO not enabled). Fix by adding IS_ERR_OR_NULL() checks before calling crypto_free_acomp() and acomp_request_free(), similar to the existing kthread_stop() check. Fixes: `b03d542c3c` ("PM: hibernate: Use crypto_acomp interface") Signed-off-by: Malaya Kumar Rout <mrout@redhat.com> Cc: 6.15+ <stable@vger.kernel.org> # 6.15+ [ rjw: Added 2 empty code lines ] Link: https://patch.msgid.link/20251230115613.64080-1-mrout@redhat.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2026-01-05 19:12:56 +01:00
Marco Elver	04e49d926f	sched: Enable context analysis for core.c and fair.c This demonstrates a larger conversion to use Clang's context analysis. The benefit is additional static checking of locking rules, along with better documentation. Notably, kernel/sched contains sufficiently complex synchronization patterns, and application to core.c & fair.c demonstrates that the latest Clang version has become powerful enough to start applying this to more complex subsystems (with some modest annotations and changes). Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251219154418.3592607-37-elver@google.com	2026-01-05 16:43:36 +01:00
Marco Elver	8ec56d9aab	printk: Move locking annotation to printk.c With Sparse support gone, Clang is a bit more strict and warns: ./include/linux/console.h:492:50: error: use of undeclared identifier 'console_mutex' 492 \| extern void console_list_unlock(void) __releases(console_mutex); Since it does not make sense to make console_mutex itself global, move the annotation to printk.c. Context analysis remains disabled for printk.c. This is needed to enable context analysis for modules that include <linux/console.h>. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251219154418.3592607-34-elver@google.com	2026-01-05 16:43:36 +01:00
Marco Elver	0eaa911f89	kcsan: Enable context analysis Enable context analysis for the KCSAN subsystem. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251219154418.3592607-31-elver@google.com	2026-01-05 16:43:35 +01:00
Marco Elver	6556fde265	kcov: Enable context analysis Enable context analysis for the KCOV subsystem. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251219154418.3592607-30-elver@google.com	2026-01-05 16:43:34 +01:00
Marco Elver	e4588c25c9	compiler-context-analysis: Remove __cond_lock() function-like helper As discussed in [1], removing __cond_lock() will improve the readability of trylock code. Now that Sparse context tracking support has been removed, we can also remove __cond_lock(). Change existing APIs to either drop __cond_lock() completely, or make use of the __cond_acquires() function attribute instead. In particular, spinlock and rwlock implementations required switching over to inline helpers rather than statement-expressions for their trylock_* variants. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/20250207082832.GU7145@noisy.programming.kicks-ass.net/ [1] Link: https://patch.msgid.link/20251219154418.3592607-25-elver@google.com	2026-01-05 16:43:33 +01:00
Joel Granados	b3af263b8a	sysctl: Add kernel doc to proc_douintvec_conv This commit is making sure that all the functions that are part of the API are documented. Signed-off-by: Joel Granados <joel.granados@kernel.org>	2026-01-05 14:10:32 +01:00
Joel Granados	8fc344a5af	sysctl: Replace UINT converter macros with functions Replace the SYSCTL_USER_TO_KERN_UINT_CONV and SYSCTL_UINT_CONV_CUSTOM macros with functions with the same logic. This makes debugging easier and aligns with the functions preference described in coding-style.rst. Update the only user of this API: pipe.c. Signed-off-by: Joel Granados <joel.granados@kernel.org>	2026-01-05 14:10:32 +01:00
Joel Granados	ac3d6a4b60	sysctl: clarify proc_douintvec_minmax doc Specify that the range check is only when assigning kernel variable Signed-off-by: Joel Granados <joel.granados@kernel.org>	2026-01-05 14:10:32 +01:00
Joel Granados	11400f86c2	sysctl: Return -ENOSYS from proc_douintvec_conv when CONFIG_PROC_SYSCTL=n Ensure an error if prco_douintvec_conv is erroneously called in a system with CONFIG_PROC_SYSCTL=n Signed-off-by: Joel Granados <joel.granados@kernel.org>	2026-01-05 14:10:32 +01:00
Joel Granados	f7386f545e	sysctl: Remove unused ctl_table forward declarations Remove superfluous forward declarations of ctl_table from header files where they are no longer needed. These declarations were left behind after sysctl code refactoring and cleanup. Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Muchun Song <muchun.song@linux.dev> Reviewed-by: Petr Mladek <pmladek@suse.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Joel Granados <joel.granados@kernel.org>	2026-01-05 13:54:41 +01:00
Joel Granados	4864010524	sysctl: Add missing kernel-doc for proc_dointvec_conv Add kernel-doc documentation for the proc_dointvec_conv function to describe its parameters and return value. Reviewed-by: Kees Cook <kees@kernel.org> Signed-off-by: Joel Granados <joel.granados@kernel.org>	2026-01-05 13:36:45 +01:00
Rafael J. Wysocki	10c3ab8cd8	Merge back a commit related to system sleep for 6.20	2026-01-05 12:01:29 +01:00
Peter Zijlstra	ff5860f508	perf: Ensure swevent hrtimer is properly destroyed With the change to hrtimer_try_to_cancel() in perf_swevent_cancel_hrtimer() it appears possible for the hrtimer to still be active by the time the event gets freed. Make sure the event does a full hrtimer_cancel() on the free path by installing a perf_event::destroy handler. Fixes: `eb3182ef04` ("perf/core: Fix system hang caused by cpu-clock usage") Reported-by: CyberUnicorns <a101e_iotvul@163.com> Tested-by: CyberUnicorns <a101e_iotvul@163.com> Debugged-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>	2026-01-05 08:55:54 +01:00
Boqun Feng	acb0b2f5d6	Merge branch 'rcu-torture.20260104a' into rcu-next * rcu-torture.20260104a: rcutorture: Add --kill-previous option to terminate previous kvm.sh runs rcutorture: Prevent concurrent kvm.sh runs on same source tree torture: Include commit discription in testid.txt torture: Make config2csv.sh properly handle comments in .boot files torture: Make kvm-series.sh give run numbers and totals torture: Make kvm-series.sh give build numbers and totals torture: Parallelize kvm-series.sh guest-OS execution rcutorture: Add context checks to rcu_torture_timer()	2026-01-04 18:53:06 +08:00
Puranjay Mohan	a069190b59	bpf: Replace __opt annotation with __nullable for kfuncs The __opt annotation was originally introduced specifically for buffer/size argument pairs in bpf_dynptr_slice() and bpf_dynptr_slice_rdwr(), allowing the buffer pointer to be NULL while still validating the size as a constant. The __nullable annotation serves the same purpose but is more general and is already used throughout the BPF subsystem for raw tracepoints, struct_ops, and other kfuncs. This patch unifies the two annotations by replacing __opt with __nullable. The key change is in the verifier's get_kfunc_ptr_arg_type() function, where mem/size pair detection is now performed before the nullable check. This ensures that buffer/size pairs are correctly classified as KF_ARG_PTR_TO_MEM_SIZE even when the buffer is nullable, while adding an !arg_mem_size condition to the nullable check prevents interference with mem/size pair handling. When processing KF_ARG_PTR_TO_MEM_SIZE arguments, the verifier now uses is_kfunc_arg_nullable() instead of the removed is_kfunc_arg_optional() to determine whether to skip size validation for NULL buffers. This is the first documentation added for the __nullable annotation, which has been in use since it was introduced but was previously undocumented. No functional changes to verifier behavior - nullable buffer/size pairs continue to work exactly as before. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Link: https://lore.kernel.org/r/20260102221513.1961781-1-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-02 15:51:34 -08:00
Puranjay Mohan	e66fe1bc6d	bpf: arena: Reintroduce memcg accounting When arena allocations were converted from bpf_map_alloc_pages() to kmalloc_nolock() to support non-sleepable contexts, memcg accounting was inadvertently lost. This commit restores proper memory accounting for all arena-related allocations. All arena related allocations are accounted into memcg of the process that created bpf_arena. Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Link: https://lore.kernel.org/r/20260102200230.25168-3-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-02 14:31:59 -08:00
Puranjay Mohan	817593af7b	bpf: syscall: Introduce memcg enter/exit helpers Introduce bpf_map_memcg_enter() and bpf_map_memcg_exit() helpers to reduce code duplication in memcg context management. bpf_map_memcg_enter() gets the memcg from the map, sets it as active, and returns both the previous and the now active memcg. bpf_map_memcg_exit() restores the previous active memcg and releases the reference obtained during enter. Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Link: https://lore.kernel.org/r/20260102200230.25168-2-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-02 14:31:59 -08:00
Linus Torvalds	bbbc721033	Power management fix for 6.19-rc4 Fix a recent regression that affects system suspend testing at the "core" level (Rafael Wysocki) -----BEGIN PGP SIGNATURE----- iQFGBAABCAAwFiEEcM8Aw/RY0dgsiRUR7l+9nS/U47UFAmlYG/ISHHJqd0Byand5 c29ja2kubmV0AAoJEO5fvZ0v1OO1mK0IAIrCiY5dvp9+72DvEWqS2uHHFVs3sHKR SOpJR3koYehZEn/PvnM2PgvWNCLtru4nU/Q3EnWFfFCFuFuAMQ6Zl5U7YyKkW1Uc bcTMsnLOTJm/3AYu3O+4TGASq1VF1xqE+AB/ie5fNz5gDSlblGKrqh0se3m5m1Vu PsLsm27wkLyEHCd3AdXRNSU54GssjTaABkVTQ/Unk4PznbBiKsckaThLjbjQaiqB KzqU0B3Q3Zx9Qj1lVzXwXaYushehGbs3bqw8+q2DPrV/jwLVLYX/ofwEkCH+lQ47 tS+di//pFi/grWu/GtR4EQ0fCzgYPDaBfbQlOD2gA60EgplU4XY3804= =CK7L -----END PGP SIGNATURE----- Merge tag 'pm-6.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fix from Rafael Wysocki: "Fix a recent regression that affects system suspend testing at the 'core' level (Rafael Wysocki)" * tag 'pm-6.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: PM: sleep: Fix suspend_test() at the TEST_CORE level	2026-01-02 12:35:29 -08:00
Puranjay Mohan	7646c7afd9	bpf: Remove redundant KF_TRUSTED_ARGS flag from all kfuncs Now that KF_TRUSTED_ARGS is the default for all kfuncs, remove the explicit KF_TRUSTED_ARGS flag from all kfunc definitions and remove the flag itself. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Link: https://lore.kernel.org/r/20260102180038.2708325-3-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-02 12:04:28 -08:00
Puranjay Mohan	1a5c01d250	bpf: Make KF_TRUSTED_ARGS the default for all kfuncs Change the verifier to make trusted args the default requirement for all kfuncs by removing is_kfunc_trusted_args() assuming it be to always return true. This works because: 1. Context pointers (xdp_md, __sk_buff, etc.) are handled through their own KF_ARG_PTR_TO_CTX case label and bypass the trusted check 2. Struct_ops callback arguments are already marked as PTR_TRUSTED during initialization and pass is_trusted_reg() 3. KF_RCU kfuncs are handled separately via is_kfunc_rcu() checks at call sites (always checked with \|\| alongside is_kfunc_trusted_args) This simple change makes all kfuncs require trusted args by default while maintaining correct behavior for all existing special cases. Note: This change means kfuncs that previously accepted NULL pointers without KF_TRUSTED_ARGS will now reject NULL at verification time. Several netfilter kfuncs are affected: bpf_xdp_ct_lookup(), bpf_skb_ct_lookup(), bpf_xdp_ct_alloc(), and bpf_skb_ct_alloc() all accept NULL for their bpf_tuple and opts parameters internally (checked in __bpf_nf_ct_lookup), but after this change the verifier rejects NULL before the kfunc is even called. This is acceptable because these kfuncs don't work with NULL parameters in their proper usage. Now they will be rejected rather than returning an error, which shouldn't make a difference to BPF programs that were using these kfuncs properly. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Link: https://lore.kernel.org/r/20260102180038.2708325-2-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-02 12:04:28 -08:00
Michael S. Tsirkin	d5d8465131	dma-debug: track cache clean flag in entries If a driver is buggy and has 2 overlapping mappings but only sets cache clean flag on the 1st one of them, we warn. But if it only does it for the 2nd one, we don't. Fix by tracking cache clean flag in the entry. Message-ID: <0ffb3513d18614539c108b4548cdfbc64274a7d1.1767601130.git.mst@redhat.com> Reviewed-by: Petr Tesarik <ptesarik@suse.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2026-01-02 06:22:49 -05:00
Paul E. McKenney	e8a534a671	rcutorture: Add context checks to rcu_torture_timer() This commit adds irq, NMI, and softirq context checks to the rcu_torture_timer() function. Just because you are paranoid does not mean that they are not out to get you... ;-) Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2026-01-01 16:43:21 +08:00
Paul E. McKenney	760f05bc83	rcutorture: Test rcu_tasks_trace_expedite_current() This commit adds a ->exp_current member to the tasks_tracing_ops structure to test the rcu_tasks_trace_expedite_current() function. [ paulmck: Apply kernel test robot feedback. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: bpf@vger.kernel.org Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2026-01-01 16:39:46 +08:00
Paul E. McKenney	1a72f4bb6f	rcu: Add noinstr-fast rcu_read_{,un}lock_tasks_trace() APIs When expressing RCU Tasks Trace in terms of SRCU-fast, it was necessary to keep a nesting count and per-CPU srcu_ctr structure pointer in the task_struct structure, which is slow to access. But an alternative is to instead make rcu_read_lock_tasks_trace() and rcu_read_unlock_tasks_trace(), which match the underlying SRCU-fast semantics, avoiding the task_struct accesses. When all callers have switched to the new API, the previous rcu_read_lock_trace() and rcu_read_unlock_trace() APIs will be removed. The rcu_read_{,un}lock_{,tasks_}trace() functions need to use smp_mb() only if invoked where RCU is not watching, that is, from locations where a call to rcu_is_watching() would return false. In architectures that define the ARCH_WANTS_NO_INSTR Kconfig option, use of noinstr and friends ensures that tracing happens only where RCU is watching, so those architectures can dispense entirely with the read-side calls to smp_mb(). Other architectures include these read-side calls by default, but in many installations there might be either larger than average tolerance for risk, prohibition of removing tracing on a running system, or careful review and approval of removing of tracing. Such installations can build their kernels with CONFIG_TASKS_TRACE_RCU_NO_MB=y to avoid those read-side calls to smp_mb(), thus accepting responsibility for run-time removal of tracing from code regions that RCU is not watching. Those wishing to disable read-side memory barriers for an entire architecture can select this TASKS_TRACE_RCU_NO_MB Kconfig option, hence the polarity. [ paulmck: Apply Peter Zijlstra feedback. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: bpf@vger.kernel.org Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2026-01-01 16:39:46 +08:00
Paul E. McKenney	176a6aeaf1	rcu: Move rcu_tasks_trace_srcu_struct out of #ifdef CONFIG_TASKS_RCU_GENERIC Moving the rcu_tasks_trace_srcu_struct structure instance out from under the CONFIG_TASKS_RCU_GENERIC Kconfig option permits the CONFIG_TASKS_TRACE_RCU Kconfig option to stop enabling this CONFIG_TASKS_RCU_GENERIC Kconfig option. This commit also therefore makes it so. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: bpf@vger.kernel.org Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2026-01-01 16:39:46 +08:00
Paul E. McKenney	a73fc3dcc6	rcu: Clean up after the SRCU-fastification of RCU Tasks Trace Now that RCU Tasks Trace has been re-implemented in terms of SRCU-fast, the ->trc_ipi_to_cpu, ->trc_blkd_cpu, ->trc_blkd_node, ->trc_holdout_list, and ->trc_reader_special task_struct fields are no longer used. In addition, the rcu_tasks_trace_qs(), rcu_tasks_trace_qs_blkd(), exit_tasks_rcu_finish_trace(), and rcu_spawn_tasks_trace_kthread(), show_rcu_tasks_trace_gp_kthread(), rcu_tasks_trace_get_gp_data(), rcu_tasks_trace_torture_stats_print(), and get_rcu_tasks_trace_gp_kthread() functions and all the other functions that they invoke are no longer used. Also, the TRC_NEED_QS and TRC_NEED_QS_CHECKED CPP macros are no longer used. Neither are the rcu_tasks_trace_lazy_ms and rcu_task_ipi_delay rcupdate module parameters and the TASKS_TRACE_RCU_READ_MB Kconfig option. This commit therefore removes all of them. [ paulmck: Apply Alexei Starovoitov feedback. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: bpf@vger.kernel.org Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2026-01-01 16:39:46 +08:00
Paul E. McKenney	46e3235999	context_tracking: Remove rcu_task_trace_heavyweight_{enter,exit}() Because SRCU-fast does not use IPIs for its grace periods, there is no need for real-time workloads to switch to an IPI-free mode, and there is in turn no need for either rcu_task_trace_heavyweight_enter() or rcu_task_trace_heavyweight_exit(). This commit therefore removes them. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: bpf@vger.kernel.org Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2026-01-01 16:39:46 +08:00
Paul E. McKenney	c27cea4416	rcu: Re-implement RCU Tasks Trace in terms of SRCU-fast This commit saves more than 500 lines of RCU code by re-implementing RCU Tasks Trace in terms of SRCU-fast. Follow-up work will remove more code that does not cause problems by its presence, but that is no longer required. This variant places smp_mb() in rcu_read_{,un}lock_trace(), and in the same place that srcu_read_{,un}lock() would put them. These smp_mb() calls will be removed on common-case architectures in a later commit. In the meantime, it serves to enforce ordering between the underlying srcu_read_{,un}lock_fast() markers and the intervening critical section, even on architectures that permit attaching tracepoints on regions of code not watched by RCU. Such architectures defeat SRCU-fast's use of implicit single-instruction, interrupts-disabled, and atomic-operation RCU read-side critical sections, which have no effect when RCU is not watching. The aforementioned later commit will insert these smp_mb() calls only on architectures that have not used noinstr to prevent attaching tracepoints to code where RCU is not watching. [ paulmck: Apply kernel test robot, Boqun Feng, and Zqiang feedback. ] [ paulmck: Split out Tiny SRCU fixes per Andrii Nakryiko feedback. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Tested-by: kernel test robot <oliver.sang@intel.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: bpf@vger.kernel.org Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2026-01-01 16:39:46 +08:00
Michael S. Tsirkin	61868dc55a	dma-mapping: add DMA_ATTR_CPU_CACHE_CLEAN When multiple small DMA_FROM_DEVICE or DMA_BIDIRECTIONAL buffers share a cacheline, and DMA_API_DEBUG is enabled, we get this warning: cacheline tracking EEXIST, overlapping mappings aren't supported. This is because when one of the mappings is removed, while another one is active, CPU might write into the buffer. Add an attribute for the driver to promise not to do this, making the overlapping safe, and suppressing the warning. Message-ID: <2d5d091f9d84b68ea96abd545b365dd1d00bbf48.1767601130.git.mst@redhat.com> Reviewed-by: Petr Tesarik <ptesarik@suse.com> Acked-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2025-12-31 19:30:02 -05:00
Eduard Zingerman	840692326e	bpf: allow states pruning for misc/invalid slots in iterator loops Within an iterator or callback based loop, it should be safe to prune the current state if the old state stack slot is marked as STACK_INVALID or STACK_MISC: - either all branches of the old state lead to a program exit; - or some branch of the old state leads the current state. This is the same logic as applied in non-loop cases when states_equal() is called in NOT_EXACT mode. The test case that exercises stacksafe() and demonstrates the difference in verification performance is included in the next patch. I'm not sure if it is possible to prepare a test case that exercises regsafe(); it appears that the compute_live_registers() pass makes this impossible. Nevertheless, for code readability reasons, I think that stacksafe() and regsafe() should handle STACK_INVALID / NOT_INIT symmetrically. Hence, this commit changes both functions. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251230-loop-stack-misc-pruning-v1-1-585cfd6cec51@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-12-31 09:01:13 -08:00
Eduard Zingerman	f597664454	bpf: bpf_scc_visit instance and backedges accumulation for bpf_loop() Calls like bpf_loop() or bpf_for_each_map_elem() introduce loops that are not explicitly present in the control-flow graph. The verifier processes such calls by repeatedly interpreting the callback function body within the same verification path (until the current state converges with a previous state). Such loops require a bpf_scc_visit instance in order to allow the accumulation of the state graph backedges. Otherwise, certain checkpoint states created within the bodies of such loops will have incomplete precision marks. See the next patch for an example of a program that leads to the verifier accepting an unsafe program. Fixes: `96c6aa4c63` ("bpf: compute SCCs in program control flow graph") Fixes: `c9e31900b5` ("bpf: propagate read/precision marks over state graph backedges") Reported-by: Breno Leitao <leitao@debian.org> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Tested-by: Breno Leitao <leitao@debian.org> Link: https://lore.kernel.org/r/20251229-scc-for-callbacks-v1-1-ceadfe679900@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-12-30 15:42:42 -08:00
Linus Torvalds	0b34fd0fea	27 hotfixes. 12 are cc:stable, 18 are MM. There's a three patch series from Jiayuan Chen which fixes some issues with KASAN and vmalloc. Apart from that it's the usual shower of singletons - please see the respective changelogs for details. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaVIWxwAKCRDdBJ7gKXxA jt6HAP49s/mEIIbuZbqnX8hxDrvYdYffs+RsSLPEZoR+yKG/7gD/VqSZhMoPw53b rMZ56djXNjWxsOAfiVbZit3SFuQfnQ4= =5rGD -----END PGP SIGNATURE----- Merge tag 'mm-hotfixes-stable-2025-12-28-21-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "27 hotfixes. 12 are cc:stable, 18 are MM. There's a patch series from Jiayuan Chen which fixes some issues with KASAN and vmalloc. Apart from that it's the usual shower of singletons - please see the respective changelogs for details" * tag 'mm-hotfixes-stable-2025-12-28-21-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (27 commits) mm/ksm: fix pte_unmap_unlock of wrong address in break_ksm_pmd_entry mm/page_owner: fix memory leak in page_owner_stack_fops->release() mm/memremap: fix spurious large folio warning for FS-DAX MAINTAINERS: notify the "Device Memory" community of memory hotplug changes sparse: update MAINTAINERS info mm/page_alloc: report 1 as zone_batchsize for !CONFIG_MMU mm: consider non-anon swap cache folios in folio_expected_ref_count() rust: maple_tree: rcu_read_lock() in destructor to silence lockdep mm: memcg: fix unit conversion for K() macro in OOM log mm: fixup pfnmap memory failure handling to use pgoff tools/mm/page_owner_sort: fix timestamp comparison for stable sorting selftests/mm: fix thread state check in uffd-unit-tests kernel/kexec: fix IMA when allocation happens in CMA area kernel/kexec: change the prototype of kimage_map_segment() MAINTAINERS: add ABI headers to KHO and LIVE UPDATE .mailmap: remove one of the entries for WangYuli mm/damon/vaddr: fix missing pte_unmap_unlock in damos_va_migrate_pmd_entry() MAINTAINERS: update one straggling entry for Bartosz Golaszewski mm/page_alloc: change all pageblocks migrate type on coalescing mm: leafops.h: correct kernel-doc function param. names ...	2025-12-29 11:40:38 -08:00
Linus Torvalds	7839932417	sched_ext: Fixes for v6.19-rc3 - Fix uninitialized @ret on alloc_percpu() failure leading to ERR_PTR(0). - Fix PREEMPT_RT warning when bypass load balancer sends IPI to offline CPU by using resched_cpu() instead of resched_curr(). - Fix comment referring to renamed function. - Update scx_show_state.py for scx_root and scx_aborting changes. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaVHQAw4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGdJ2AP9nLMUa5Rw2hpcKCLvPjgkqe5fDpNteWrQB3ni9 bu28jQD/XGMwooJbATlDEgCtFqCH74QbddqUABJxBw4FE8qpUgw= =hp6J -----END PGP SIGNATURE----- Merge tag 'sched_ext-for-6.19-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: - Fix uninitialized @ret on alloc_percpu() failure leading to ERR_PTR(0) - Fix PREEMPT_RT warning when bypass load balancer sends IPI to offline CPU by using resched_cpu() instead of resched_curr() - Fix comment referring to renamed function - Update scx_show_state.py for scx_root and scx_aborting changes * tag 'sched_ext-for-6.19-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: tools/sched_ext: update scx_show_state.py for scx_aborting change tools/sched_ext: fix scx_show_state.py for scx_root change sched_ext: Use the resched_cpu() to replace resched_curr() in the bypass_lb_node() sched_ext: Fix some comments in ext.c sched_ext: fix uninitialized ret on alloc_percpu() failure	2025-12-28 17:21:36 -08:00
Linus Torvalds	bba0b6a1c4	cgroup: Fixes for v6.19-rc3 - cpuset: Fix spurious warning when disabling remote partition after CPU hotplug leaves subpartitions_cpus empty. Guard the warning and invalidate affected partitions. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaVHNBg4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGW13AQC7akGwBAjBbR5p0aoLXKl5vZ8UYiwH+7iyep2Z INn7ggEA/huYF4smSRk3oR8q7eiISa85mZc85EuVEAjPmBQhpAQ= =9UP0 -----END PGP SIGNATURE----- Merge tag 'cgroup-for-6.19-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fix from Tejun Heo: - Fix a spurious cpuset warning when disabling remote partition after CPU hotplug leaves subpartitions_cpus empty. Guard the warning and invalidate affected partitions. * tag 'cgroup-for-6.19-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cpuset: fix warning when disabling remote partition	2025-12-28 17:19:09 -08:00
Rafael J. Wysocki	684d3b2670	PM: sleep: Fix suspend_test() at the TEST_CORE level Commit `a10ad1b104` ("PM: suspend: Make pm_test delay interruptible by wakeup events") replaced mdelay() in suspend_test() with msleep() which does not work at the TEST_CORE test level that calls suspend_test() while running on one CPU with interrupts off. Address this by making suspend_test() check if the test level is suitable for using msleep() and use mdelay() otherwise. Fixes: `a10ad1b104` ("PM: suspend: Make pm_test delay interruptible by wakeup events") Reported-by: Sebastian Reichel <sebastian.reichel@collabora.com> Closes: https://lore.kernel.org/linux-pm/aUsAk0k1N9hw8IkY@venus/ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Sebastian Reichel <sebastian.reichel@collabora.com> Link: https://patch.msgid.link/6251576.lOV4Wx5bFT@rafael.j.wysocki	2025-12-28 13:01:39 +01:00
Linus Torvalds	b63f4a4e95	EFI fixes for v6.19 #1 A couple of fixes for EFI regressions introduced this cycle: - Make EDID handling in the EFI stub mixed mode safe - Ensure that efi_mm.user_ns has a sane value - this is needed now that EFI runtime calls are preemptible on arm64 -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQQm/3uucuRGn1Dmh0wbglWLn0tXAUCaU694AAKCRAwbglWLn0t XJHoAPsHkA4g+rRXXhqgysWvqQvGWTrWLiVHPeDOs3dAWTxHpQD+KTNVaW4H+BCz x+5S3unu5q9PxWOwqeoCi2JvkKKjJwE= =fb7r -----END PGP SIGNATURE----- Merge tag 'efi-fixes-for-v6.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi Pull EFI fixes from Ard Biesheuvel: "A couple of fixes for EFI regressions introduced this cycle: - Make EDID handling in the EFI stub mixed mode safe - Ensure that efi_mm.user_ns has a sane value - this is needed now that EFI runtime calls are preemptible on arm64" * tag 'efi-fixes-for-v6.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi: kthread: Warn if mm_struct lacks user_ns in kthread_use_mm() arm64: efi: Fix NULL pointer dereference by initializing user_ns efi/libstub: gop: Fix EDID support in mixed-mode	2025-12-26 13:37:11 -08:00
Aaron Kling	92d661c36f	irqdomain: Export irq_domain_free_irqs() Export irq_domain_free_irqs() to allow PCI/MSI drivers like pci-tegra to be built as a module. Signed-off-by: Aaron Kling <webgeek1234@gmail.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20250731-pci-tegra-module-v7-1-cad4b088b8fb@gmail.com	2025-12-26 10:57:17 -06:00
Breno Leitao	cfe54f4591	kthread: Warn if mm_struct lacks user_ns in kthread_use_mm() Add a WARN_ON_ONCE() check to detect mm_struct instances that are missing user_ns initialization when passed to kthread_use_mm(). When a kthread adopts an mm via kthread_use_mm(), LSM hooks and capability checks may access current->mm->user_ns for credential validation. If user_ns is NULL, this leads to a NULL pointer dereference crash. This was observed with efi_mm on arm64, where commit `a5baf582f4` ("arm64/efi: Call EFI runtime services without disabling preemption") introduced kthread_use_mm(&efi_mm), but efi_mm lacked user_ns initialization, causing crashes during /proc access. Adding this warning helps catch similar bugs early during development rather than waiting for hard-to-debug NULL pointer crashes in production. Signed-off-by: Breno Leitao <leitao@debian.org> Acked-by: Rik van Riel <riel@surriel.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>	2025-12-24 21:32:58 +01:00
Puranjay Mohan	b8467290ed	bpf: arena: make arena kfuncs any context safe Make arena related kfuncs any context safe by the following changes: bpf_arena_alloc_pages() and bpf_arena_reserve_pages(): Replace the usage of the mutex with a rqspinlock for range tree and use kmalloc_nolock() wherever needed. Use free_pages_nolock() to free pages from any context. apply_range_set/clear_cb() with apply_to_page_range() has already made populating the vm_area in bpf_arena_alloc_pages() any context safe. bpf_arena_free_pages(): defer the main logic to a workqueue if it is called from a non-sleepable context. specialize_kfunc() is used to replace the sleepable arena_free_pages() with bpf_arena_free_pages_non_sleepable() when the verifier detects the call is from a non-sleepable context. In the non-sleepable case, arena_free_pages() queues the address and the page count to be freed to a lock-less list of struct arena_free_spans and raises an irq_work. The irq_work handler calls schedules_work() as it is safe to be called from irq context. arena_free_worker() (the work queue handler) iterates these spans and clears ptes, flushes tlb, zaps pages, and calls __free_page(). Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Link: https://lore.kernel.org/r/20251222195022.431211-4-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-12-23 11:30:00 -08:00
Puranjay Mohan	360c35f8ff	bpf: arena: use kmalloc_nolock() in place of kvcalloc() To make arena_alloc_pages() safe to be called from any context, replace kvcalloc() with kmalloc_nolock() so as it doesn't sleep or take any locks. kmalloc_nolock() returns NULL for allocations larger than KMALLOC_MAX_CACHE_SIZE, which is (PAGE_SIZE * 2) = 8KB on systems with 4KB pages. So, round down the allocation done by kmalloc_nolock to 1024 * 8 and reuse the array in a loop. Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Link: https://lore.kernel.org/r/20251222195022.431211-3-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-12-23 11:29:59 -08:00
Puranjay Mohan	c336b0b327	bpf: arena: populate vm_area without allocating memory vm_area_map_pages() may allocate memory while inserting pages into bpf arena's vm_area. In order to make bpf_arena_alloc_pages() kfunc non-sleepable change bpf arena to populate pages without allocating memory: - at arena creation time populate all page table levels except the last level - when new pages need to be inserted call apply_to_page_range() again with apply_range_set_cb() which will only set_pte_at() those pages and will not allocate memory. - when freeing pages call apply_to_existing_page_range with apply_range_clear_cb() to clear the pte for the page to be removed. This doesn't free intermediate page table levels. Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Link: https://lore.kernel.org/r/20251222195022.431211-2-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-12-23 11:29:59 -08:00
Pingfan Liu	a3785ae5d3	kernel/kexec: fix IMA when allocation happens in CMA area * Bug description * When I tested kexec with the latest kernel, I ran into the following warning: [ 40.712410] ------------[ cut here ]------------ [ 40.712576] WARNING: CPU: 2 PID: 1562 at kernel/kexec_core.c:1001 kimage_map_segment+0x144/0x198 [...] [ 40.816047] Call trace: [ 40.818498] kimage_map_segment+0x144/0x198 (P) [ 40.823221] ima_kexec_post_load+0x58/0xc0 [ 40.827246] __do_sys_kexec_file_load+0x29c/0x368 [...] [ 40.855423] ---[ end trace 0000000000000000 ]--- * How to reproduce * This bug is only triggered when the kexec target address is allocated in the CMA area. If no CMA area is reserved in the kernel, use the "cma=" option in the kernel command line to reserve one. * Root cause * The commit `07d2490297` ("kexec: enable CMA based contiguous allocation") allocates the kexec target address directly on the CMA area to avoid copying during the jump. In this case, there is no IND_SOURCE for the kexec segment. But the current implementation of kimage_map_segment() assumes that IND_SOURCE pages exist and map them into a contiguous virtual address by vmap(). * Solution * If IMA segment is allocated in the CMA area, use its page_address() directly. Link: https://lkml.kernel.org/r/20251216014852.8737-2-piliu@redhat.com Fixes: `07d2490297` ("kexec: enable CMA based contiguous allocation") Signed-off-by: Pingfan Liu <piliu@redhat.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Alexander Graf <graf@amazon.com> Cc: Steven Chen <chenste@linux.microsoft.com> Cc: Mimi Zohar <zohar@linux.ibm.com> Cc: Roberto Sassu <roberto.sassu@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-12-23 11:23:14 -08:00
Pingfan Liu	fe55ea8593	kernel/kexec: change the prototype of kimage_map_segment() The kexec segment index will be required to extract the corresponding information for that segment in kimage_map_segment(). Additionally, kexec_segment already holds the kexec relocation destination address and size. Therefore, the prototype of kimage_map_segment() can be changed. Link: https://lkml.kernel.org/r/20251216014852.8737-1-piliu@redhat.com Fixes: `07d2490297` ("kexec: enable CMA based contiguous allocation") Signed-off-by: Pingfan Liu <piliu@redhat.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Mimi Zohar <zohar@linux.ibm.com> Cc: Roberto Sassu <roberto.sassu@huawei.com> Cc: Alexander Graf <graf@amazon.com> Cc: Steven Chen <chenste@linux.microsoft.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-12-23 11:23:13 -08:00
Daniel Gomez	ac1c5bc7c4	bpf: crypto: replace -EEXIST with -EBUSY The -EEXIST error code is reserved by the module loading infrastructure to indicate that a module is already loaded. When a module's init function returns -EEXIST, userspace tools like kmod interpret this as "module already loaded" and treat the operation as successful, returning 0 to the user even though the module initialization actually failed. This follows the precedent set by commit `54416fd767` ("netfilter: conntrack: helper: Replace -EEXIST by -EBUSY") which fixed the same issue in nf_conntrack_helper_register(). This affects bpf_crypto_skcipher module. While the configuration required to build it as a module is unlikely in practice, it is technically possible, so fix it for correctness. Signed-off-by: Daniel Gomez <da.gomez@samsung.com> Acked-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Link: https://lore.kernel.org/r/20251220-dev-module-init-eexists-bpf-v1-1-7f186663dbe7@samsung.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-12-22 22:25:09 -08:00
Puranjay Mohan	342297d511	bpf: allow calling kfuncs from raw_tp programs Associate raw tracepoint program type with the kfunc tracing hook. This allows calling kfuncs from raw_tp programs. Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20251222133250.1890587-2-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-12-22 22:23:38 -08:00
Zqiang	714d81423e	sched_ext: Avoid multiple irq_work_queue() calls in destroy_dsq() llist_add() returns true only when adding to an empty list, which indicates that no IRQ work is currently queued or running. Therefore, we only need to call irq_work_queue() when llist_add() returns true, to avoid unnecessarily re-queueing IRQ work that is already pending or executing. Signed-off-by: Zqiang <qiang.zhang@linux.dev> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-12-22 17:55:41 -10:00
Zqiang	ccaeeb585c	sched_ext: Use the resched_cpu() to replace resched_curr() in the bypass_lb_node() For the PREEMPT_RT kernels, the scx_bypass_lb_timerfn() running in the preemptible per-CPU ktimer kthread context, this means that the following scenarios will occur(for x86 platform): cpu1 cpu2 ktimer kthread: ->scx_bypass_lb_timerfn ->bypass_lb_node ->for_each_cpu(cpu, resched_mask) migration/1: by preempt by migration/2: multi_cpu_stop() multi_cpu_stop() ->take_cpu_down() ->__cpu_disable() ->set cpu1 offline ->rq1 = cpu_rq(cpu1) ->resched_curr(rq1) ->smp_send_reschedule(cpu1) ->native_smp_send_reschedule(cpu1) ->if(unlikely(cpu_is_offline(cpu))) { WARN(1, "sched: Unexpected reschedule of offline CPU#%d!\n", cpu); return; } This commit therefore use the resched_cpu() to replace resched_curr() in the bypass_lb_node() to avoid send-ipi to offline CPUs. Signed-off-by: Zqiang <qiang.zhang@linux.dev> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-12-22 17:51:51 -10:00
Chen Ridong	269679bdd1	cpuset: remove dead code in cpuset-v1.c The commit `6e1d31ce49` ("cpuset: separate generate_sched_domains for v1 and v2") introduced dead code that was originally added for cpuset-v2 partition domain generation. Remove the redundant root_load_balance check. Fixes: `6e1d31ce49` ("cpuset: separate generate_sched_domains for v1 and v2") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/cgroups/9a442808-ed53-4657-988b-882cc0014c0d@huaweicloud.com/T/ Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-12-22 17:40:17 -10:00
Kees Cook	68e8555858	module/decompress: Avoid open-coded kvrealloc() Replace open-coded allocate/copy with kvrealloc(). Signed-off-by: Kees Cook <kees@kernel.org> Reviewed-by: Petr Pavlu <petr.pavlu@suse.com> Signed-off-by: Sami Tolvanen <samitolvanen@google.com>	2025-12-22 16:35:54 +00:00
Petr Pavlu	148519a063	module: Remove SHA-1 support for module signing SHA-1 is considered deprecated and insecure due to vulnerabilities that can lead to hash collisions. Most distributions have already been using SHA-2 for module signing because of this. The default was also changed last year from SHA-1 to SHA-512 in commit `f3b93547b9` ("module: sign with sha512 instead of sha1 by default"). This was not reported to cause any issues. Therefore, it now seems to be a good time to remove SHA-1 support for module signing. Commit `16ab7cb582` ("crypto: pkcs7 - remove sha1 support") previously removed support for reading PKCS#7/CMS signed with SHA-1, along with the ability to use SHA-1 for module signing. This change broke iwd and was subsequently completely reverted in commit `203a6763ab` ("Revert "crypto: pkcs7 - remove sha1 support""). However, dropping only the support for using SHA-1 for module signing is unrelated and can still be done separately. Note that this change only removes support for new modules to be SHA-1 signed, but already signed modules can still be loaded. Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Reviewed-by: Aaron Tomlin <atomlin@atomlin.com> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Sami Tolvanen <samitolvanen@google.com>	2025-12-22 16:35:53 +00:00
Marco Crivellari	581ac2d4a5	module: replace use of system_wq with system_dfl_wq Currently if a user enqueues a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistency cannot be addressed without refactoring the API. This continues the effort to refactor workqueue APIs, which began with the introduction of new workqueues and a new alloc_workqueue flag in: commit `128ea9f6cc` ("workqueue: Add system_percpu_wq and system_dfl_wq") commit `930c2ea566` ("workqueue: Add new WQ_PERCPU flag") Switch to using system_dfl_wq, the new unbound workqueue, because the users do not benefit from a per-cpu workqueue. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Reviewed-by: Petr Pavlu <petr.pavlu@suse.com> Signed-off-by: Sami Tolvanen <samitolvanen@google.com>	2025-12-22 16:35:53 +00:00
Petr Pavlu	3cb0c3bdea	params: Replace __modinit with __init_or_module Remove the custom __modinit macro from kernel/params.c and instead use the common __init_or_module macro from include/linux/module.h. Both provide the same functionality. Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Reviewed-by: Aaron Tomlin <atomlin@atomlin.com> Reviewed-by: Daniel Gomez <da.gomez@samsung.com> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Sami Tolvanen <samitolvanen@google.com>	2025-12-22 16:35:53 +00:00
Linus Torvalds	610192c229	Fix IRQ thread affinity flags setup regression. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmlHrpkRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1hBGBAAuZG/SpA5k4Aexg3rfisxfenj4s4qeTO3 HqMXKU5M1zEWOlG2RUx/+Gn0RCb2U/eHzwAXtu85TJjYPQ3tCN+G49/VbbKZm3/J wM1gvGvpdX9miSi5S2e6Ete0zf0bYXCP4dJOreekyzzngXd9lqncBA/u+vimt00l 1axg7osB73puSKjeg/g8t3CIjRvQ6XgD9HC5dAlUebmFQz6KSgM6yZXac/SXbYqn nyGN9EIq7HegrBAOLW16viUK5dq5WmmUS02MA728soLEwwxn/yDr44MjPaqPAJlQ 2cwDHiKvZRtzI9wDlPaVH/Dg2OoXvrEEjrjfrzTWpjHn6p8jjluuUYBqtTIJF+g7 rUnbtVz4GwOMg7SetMHW9YOMo3R0LXC5ly6RFycRdYqgdnEwTML5Jh4qjxWBiqJD 98DtPGxUEdGKAUkOE4wLSEJy9EgjSXMm0nc8V6MJMl6afEoKBSU/w73nxc+VVHvw x7B/G/qXhJdSqO1aEftHbeZUlpOc2NSMzMXEXqCfE7B/S3vfCbC0PSLQnEq90AjY yTT7gI9pI37ULMqqf1b0pfosQuBu0cdkT5r++RFuaZLRl/kBefIyRiWcgwNDAs5C /YNeSH+qU043/f55vOoHna8B+W+VSkTGZCQpWcYcRhYCxVIWupdY1WBH9AqBg1R4 Ar81aRZK2hI= =HT4H -----END PGP SIGNATURE----- Merge tag 'irq-urgent-2025-12-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fix from Ingo Molnar: "Fix IRQ thread affinity flags setup regression" * tag 'irq-urgent-2025-12-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq: Don't overwrite interrupt thread flags on setup	2025-12-21 14:34:13 -08:00
Matt Bobrowski	94e948b7e6	bpf: annotate file argument as __nullable in bpf_lsm_mmap_file As reported in [0], anonymous memory mappings are not backed by a struct file instance. Consequently, the struct file pointer passed to the security_mmap_file() LSM hook is NULL in such cases. The BPF verifier is currently unaware of this, allowing BPF LSM programs to dereference this struct file pointer without needing to perform an explicit NULL check. This leads to potential NULL pointer dereference and a kernel crash. Add a strong override for bpf_lsm_mmap_file() which annotates the struct file pointer parameter with the __nullable suffix. This explicitly informs the BPF verifier that this pointer (PTR_MAYBE_NULL) can be NULL, forcing BPF LSM programs to perform a check on it before dereferencing it. [0] https://lore.kernel.org/bpf/5e460d3c.4c3e9.19adde547d8.Coremail.kaiyanm@hust.edu.cn/ Reported-by: Kaiyan Mei <M202472210@hust.edu.cn> Reported-by: Yinhao Hu <dddddd@hust.edu.cn> Reviewed-by: Dongliang Mu <dzm91@hust.edu.cn> Closes: https://lore.kernel.org/bpf/5e460d3c.4c3e9.19adde547d8.Coremail.kaiyanm@hust.edu.cn/ Signed-off-by: Matt Bobrowski <mattbobrowski@google.com> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20251216133000.3690723-1-mattbobrowski@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-12-21 10:56:33 -08:00
Puranjay Mohan	c3e34f88f9	bpf: arm64: Optimize recursion detection by not using atomics BPF programs detect recursion using a per-CPU 'active' flag in struct bpf_prog. The trampoline currently sets/clears this flag with atomic operations. On some arm64 platforms (e.g., Neoverse V2 with LSE), per-CPU atomic operations are relatively slow. Unlike x86_64 - where per-CPU updates can avoid cross-core atomicity, arm64 LSE atomics are always atomic across all cores, which is unnecessary overhead for strictly per-CPU state. This patch removes atomics from the recursion detection path on arm64 by changing 'active' to a per-CPU array of four u8 counters, one per context: {NMI, hard-irq, soft-irq, normal}. The running context uses a non-atomic increment/decrement on its element. After increment, recursion is detected by reading the array as a u32 and verifying that only the expected element changed; any change in another element indicates inter-context recursion, and a value > 1 in the same element indicates same-context recursion. For example, starting from {0,0,0,0}, a normal-context trigger changes the array to {0,0,0,1}. If an NMI arrives on the same CPU and triggers the program, the array becomes {1,0,0,1}. When the NMI context checks the u32 against the expected mask for normal (0x00000001), it observes 0x01000001 and correctly reports recursion. Same-context recursion is detected analogously. Acked-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Link: https://lore.kernel.org/r/20251219184422.2899902-3-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-12-21 10:54:37 -08:00
Puranjay Mohan	93f0d09697	bpf: move recursion detection logic to helpers BPF programs detect recursion by doing atomic inc/dec on a per-cpu active counter from the trampoline. Create two helpers for operations on this active counter, this makes it easy to changes the recursion detection logic in future. This commit makes no functional changes. Acked-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Link: https://lore.kernel.org/r/20251219184422.2899902-2-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-12-21 10:54:37 -08:00
Zqiang	12494e5e2a	sched_ext: Fix some comments in ext.c This commit update balance_scx() in the comments to balance_one(). Signed-off-by: Zqiang <qiang.zhang@linux.dev> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-12-19 13:11:22 -10:00
Peter Zijlstra	6ab7973f25	sched/fair: Fix sched_avg fold After the robot reported a regression wrt commit: `089d84203a` ("sched/fair: Fold the sched_avg update"), Shrikanth noted that two spots missed a factor se_weight(). Fixes: `089d84203a` ("sched/fair: Fold the sched_avg update") Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202512181208.753b9f6e-lkp@intel.com Debugged-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251218102020.GO3707891@noisy.programming.kicks-ass.net	2025-12-19 09:09:38 +01:00
Peter Zijlstra	01122b8936	perf: Use EXPORT_SYMBOL_FOR_KVM() for the mediated APIs Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251208115156.GE3707891@noisy.programming.kicks-ass.net	2025-12-19 08:54:59 +01:00
Greg Kroah-Hartman	90876d9b37	irqdomain: Fix up const problem in irq_domain_set_name() In irq_domain_set_name() a const pointer is passed in, and then the const is "lost" when container_of() is called. Fix this up by properly preserving the const pointer attribute when container_of() is used to enforce the fact that this pointer should not have anything at it changed. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/2025121731-facing-unhitched-63ae@gregkh	2025-12-19 00:39:39 +01:00
Linus Torvalds	dd9b004b7f	tracing fixes for v6.19: - Add Documentation/core-api/tracepoint.rst to TRACING in MAINTAINERS file Updates to the tracepoint.rst document should be reviewed by the tracing maintainers. - Fix warning triggered by perf attaching to synthetic events The synthetic events do not add a function to be registered when perf attaches to them. This causes a warning when perf registers a synthetic event and passes a NULL pointer to the tracepoint register function. Ideally synthetic events should be updated to work with perf, but as that's a feature and not a bug fix, simply now return -ENODEV when perf tries to register an event that has a NULL pointer for its function. This no longer causes a kernel warning and simply causes the perf code to fail with an error message. - Fix 32bit overflow in option flag test The option's flags changed from 32 bits in size to 64 bits in size. Fix one of the places that shift 1 by the option bit number to to be 1ULL. - Fix the output of printing the direct jmp functions The enabled_functions that shows how functions are being attached by ftrace wasn't updated to accommodate the new direct jmp trampolines that set the LSB of the pointer, and outputs garbage. Update the output to handle the direct jmp trampolines. -----BEGIN PGP SIGNATURE----- iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaURuVxQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qhU1AQCe3AwEiPg4PAooR7D3TDgj4ypJXUwS J7PM0cvniKhJ2AD/VpSQCyj7sB8UQdVE04SK3hxduVwHIzrBhDe2voC8eQQ= =C9zF -----END PGP SIGNATURE----- Merge tag 'trace-v6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Add Documentation/core-api/tracepoint.rst to TRACING in MAINTAINERS file Updates to the tracepoint.rst document should be reviewed by the tracing maintainers. - Fix warning triggered by perf attaching to synthetic events The synthetic events do not add a function to be registered when perf attaches to them. This causes a warning when perf registers a synthetic event and passes a NULL pointer to the tracepoint register function. Ideally synthetic events should be updated to work with perf, but as that's a feature and not a bug fix, simply now return -ENODEV when perf tries to register an event that has a NULL pointer for its function. This no longer causes a kernel warning and simply causes the perf code to fail with an error message. - Fix 32bit overflow in option flag test The option's flags changed from 32 bits in size to 64 bits in size. Fix one of the places that shift 1 by the option bit number to to be 1ULL. - Fix the output of printing the direct jmp functions The enabled_functions that shows how functions are being attached by ftrace wasn't updated to accommodate the new direct jmp trampolines that set the LSB of the pointer, and outputs garbage. Update the output to handle the direct jmp trampolines. * tag 'trace-v6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: ftrace: Fix address for jmp mode in t_show() tracing: Fix UBSAN warning in __remove_instance() tracing: Do not register unsupported perf events MAINTAINERS: add tracepoint core-api doc files to TRACING	2025-12-19 09:30:55 +12:00
Linus Torvalds	7b8e9264f5	Including fixes from netfilter and CAN. Current release - regressions: - netfilter: nf_conncount: fix leaked ct in error paths - sched: act_mirred: fix loop detection - sctp: fix potential deadlock in sctp_clone_sock() - can: fix build dependency - eth: mlx5e: do not update BQL of old txqs during channel reconfiguration Previous releases - regressions: - sched: ets: always remove class from active list before deleting it - inet: frags: flush pending skbs in fqdir_pre_exit() - netfilter: nf_nat: remove bogus direction check - mptcp: - schedule rtx timer only after pushing data - avoid deadlock on fallback while reinjecting - can: gs_usb: fix error handling - eth: mlx5e: - avoid unregistering PSP twice - fix double unregister of HCA_PORTS component - eth: bnxt_en: fix XDP_TX path - eth: mlxsw: fix use-after-free when updating multicast route stats Previous releases - always broken: - ethtool: avoid overflowing userspace buffer on stats query - openvswitch: fix middle attribute validation in push_nsh() action - eth: mlx5: fw_tracer, validate format string parameters - eth: mlxsw: spectrum_router: fix neighbour use-after-free - eth: ipvlan: ignore PACKET_LOOPBACK in handle_mode_l2() Misc: - Jozsef Kadlecsik retires from maintaining netfilter - tools: ynl: fix build on systems with old kernel headers Signed-off-by: Paolo Abeni <pabeni@redhat.com> -----BEGIN PGP SIGNATURE----- iQJGBAABCgAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmlEPS0SHHBhYmVuaUBy ZWRoYXQuY29tAAoJECkkeY3MjxOkHLEP/3TnVI9qLivOGsZ48bn5UUcIaIlX0wiw i5gwpUhbtW3zcLSphO0Nh/CmProme6dFhaMOOkk48bKAdIxRpOMbCC20bfYDLyd0 ZJTUQqheKpuI23vOnhfs2TTOkcqz6cM7txUq681taQ8Nfo8Dbf0fgOO2HT5ljRD+ JXlxrdvicZDtve68sSdnsAj5M15EPQGKTPMyPkymBmCKnCMQK9SQKaeTEtaW6NIO yM0KqDSNoulDc/6LYMhx8DTtyE7yTiTxwe2NixjdyYljXsk0KiRDirCzxnZuSgh8 oh+cFLdFDq4mwdYEKjgC5c3ifQyLEpZvwzlY5MKoobsVnT5SbigeSK53l5rEkO3V sM84lITHfqTJvlid0AF/ixEc6iWwV7nGRBHh2FXNbfoIKt45eF77jPi9YFsq6Z95 vlCzYIbY0f2L1y3mPvZbzGQbh2Z12b5kyK8QA1j7SK+zNzxgXbf4+ZtYxHg7O3Ne gecmIpTKXMWodaZyfsRQPjR/F6UIlMqsgl9Ci9bfUw+XwL8x+7bJxQAQz+yVjLla ng14BItiYKBavcPZBjYlhGKqD1fzGhVZqQecrCkF0VTbMusRd9RcwytU1NG4QGDx V5aL28ht85KtMednEWOBkrg+PeXnNyZHzLAf2Xtx3UkaGgiDC8G4IxUv3orlOLD1 sFPfZnSiGCof =6pla -----END PGP SIGNATURE----- Merge tag 'net-6.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from netfilter and CAN. Current release - regressions: - netfilter: nf_conncount: fix leaked ct in error paths - sched: act_mirred: fix loop detection - sctp: fix potential deadlock in sctp_clone_sock() - can: fix build dependency - eth: mlx5e: do not update BQL of old txqs during channel reconfiguration Previous releases - regressions: - sched: ets: always remove class from active list before deleting it - inet: frags: flush pending skbs in fqdir_pre_exit() - netfilter: nf_nat: remove bogus direction check - mptcp: - schedule rtx timer only after pushing data - avoid deadlock on fallback while reinjecting - can: gs_usb: fix error handling - eth: - mlx5e: - avoid unregistering PSP twice - fix double unregister of HCA_PORTS component - bnxt_en: fix XDP_TX path - mlxsw: fix use-after-free when updating multicast route stats Previous releases - always broken: - ethtool: avoid overflowing userspace buffer on stats query - openvswitch: fix middle attribute validation in push_nsh() action - eth: - mlx5: fw_tracer, validate format string parameters - mlxsw: spectrum_router: fix neighbour use-after-free - ipvlan: ignore PACKET_LOOPBACK in handle_mode_l2() Misc: - Jozsef Kadlecsik retires from maintaining netfilter - tools: ynl: fix build on systems with old kernel headers" * tag 'net-6.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (83 commits) net: hns3: add VLAN id validation before using net: hns3: using the num_tqps to check whether tqp_index is out of range when vf get ring info from mbx net: hns3: using the num_tqps in the vf driver to apply for resources net: enetc: do not transmit redirected XDP frames when the link is down selftests/tc-testing: Test case exercising potential mirred redirect deadlock net/sched: act_mirred: fix loop detection sctp: Clear inet_opt in sctp_v6_copy_ip_options(). sctp: Fetch inet6_sk() after setting ->pinet6 in sctp_clone_sock(). net/handshake: duplicate handshake cancellations leak socket net/mlx5e: Don't include PSP in the hard MTU calculations net/mlx5e: Do not update BQL of old txqs during channel reconfiguration net/mlx5e: Trigger neighbor resolution for unresolved destinations net/mlx5e: Use ip6_dst_lookup instead of ipv6_dst_lookup_flow for MAC init net/mlx5: Serialize firmware reset with devlink net/mlx5: fw_tracer, Handle escaped percent properly net/mlx5: fw_tracer, Validate format string parameters net/mlx5: Drain firmware reset in shutdown callback net/mlx5: fw reset, clear reset requested on drain_fw_reset net: dsa: mxl-gsw1xx: manually clear RANEG bit net: dsa: mxl-gsw1xx: fix .shutdown driver operation ...	2025-12-19 07:55:35 +12:00
Chen Ridong	7cc1720589	cpuset: remove v1-specific code from generate_sched_domains Following the introduction of cpuset1_generate_sched_domains() for v1 in the previous patch, v1-specific logic can now be removed from the generic generate_sched_domains(). This patch cleans up the v1-only code and ensures uf_node is only visible when CONFIG_CPUSETS_V1=y. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-12-18 08:36:15 -10:00
Chen Ridong	6e1d31ce49	cpuset: separate generate_sched_domains for v1 and v2 The generate_sched_domains() function currently handles both v1 and v2 logic. However, the underlying mechanisms for building scheduler domains differ significantly between the two versions. For cpuset v2, scheduler domains are straightforwardly derived from valid partitions, whereas cpuset v1 employs a more complex union-find algorithm to merge overlapping cpusets. Co-locating these implementations complicates maintenance. This patch, along with subsequent ones, aims to separate the v1 and v2 logic. For ease of review, this patch first copies the generate_sched_domains() function into cpuset-v1.c as cpuset1_generate_sched_domains() and removes v2-specific code. Common helpers and top_cpuset are declared in cpuset-internal.h. When operating in v1 mode, the code now calls cpuset1_generate_sched_domains(). Currently there is some code duplication, which will be largely eliminated once v1-specific code is removed from v2 in the following patch. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-12-18 08:36:15 -10:00
Chen Ridong	cb33f8814c	cpuset: move update_domain_attr_tree to cpuset_v1.c Since relax_domain_level is only applicable to v1, move update_domain_attr_tree() to cpuset-v1.c, which solely updates relax_domain_level, Additionally, relax_domain_level is now initialized in cpuset1_inited. Accordingly, the initialization of relax_domain_level in top_cpuset is removed. The unnecessary remote_partition initialization in top_cpuset is also cleaned up. As a result, relax_domain_level can be defined in cpuset only when CONFIG_CPUSETS_V1=y. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-12-18 08:36:15 -10:00
Chen Ridong	4ef42c645f	cpuset: add cpuset1_init helper for v1 initialization This patch introduces the cpuset1_init helper in cpuset_v1.c to initialize v1-specific fields, including the fmeter and relax_domain_level members. The relax_domain_level related code will be moved to cpuset_v1.c in a subsequent patch. After this move, v1-specific members will only be visible when CONFIG_CPUSETS_V1=y. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-12-18 08:36:15 -10:00
Chen Ridong	56805c1bb1	cpuset: add cpuset1_online_css helper for v1-specific operations This commit introduces the cpuset1_online_css helper to centralize v1-specific handling during cpuset online. It performs operations such as updating the CS_SPREAD_PAGE, CS_SPREAD_SLAB, and CGRP_CPUSET_CLONE_CHILDREN flags, which are unique to the cpuset v1 control group interface. The helper is now placed in cpuset-v1.c to maintain clear separation between v1 and v2 logic. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-12-18 08:36:08 -10:00
Chen Ridong	14c11e1b2a	cpuset: add lockdep_assert_cpuset_lock_held helper Add lockdep_assert_cpuset_lock_held() to allow other subsystems to verify that cpuset_mutex is held. Suggested-by: Waiman Long <longman@redhat.com> Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-12-18 08:35:57 -10:00
Chen Ridong	aa7d3a56a2	cpuset: fix warning when disabling remote partition A warning was triggered as follows: WARNING: kernel/cgroup/cpuset.c:1651 at remote_partition_disable+0xf7/0x110 RIP: 0010:remote_partition_disable+0xf7/0x110 RSP: 0018:ffffc90001947d88 EFLAGS: 00000206 RAX: 0000000000007fff RBX: ffff888103b6e000 RCX: 0000000000006f40 RDX: 0000000000006f00 RSI: ffffc90001947da8 RDI: ffff888103b6e000 RBP: ffff888103b6e000 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000001 R11: ffff88810b2e2728 R12: ffffc90001947da8 R13: 0000000000000000 R14: ffffc90001947da8 R15: ffff8881081f1c00 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f55c8bbe0b2 CR3: 000000010b14c000 CR4: 00000000000006f0 Call Trace: <TASK> update_prstate+0x2d3/0x580 cpuset_partition_write+0x94/0xf0 kernfs_fop_write_iter+0x147/0x200 vfs_write+0x35d/0x500 ksys_write+0x66/0xe0 do_syscall_64+0x6b/0x390 entry_SYSCALL_64_after_hwframe+0x4b/0x53 RIP: 0033:0x7f55c8cd4887 Reproduction steps (on a 16-CPU machine): # cd /sys/fs/cgroup/ # mkdir A1 # echo +cpuset > A1/cgroup.subtree_control # echo "0-14" > A1/cpuset.cpus.exclusive # mkdir A1/A2 # echo "0-14" > A1/A2/cpuset.cpus.exclusive # echo "root" > A1/A2/cpuset.cpus.partition # echo 0 > /sys/devices/system/cpu/cpu15/online # echo member > A1/A2/cpuset.cpus.partition When CPU 15 is offlined, subpartitions_cpus gets cleared because no CPUs remain available for the top_cpuset, forcing partitions to share CPUs with the top_cpuset. In this scenario, disabling the remote partition triggers a warning stating that effective_xcpus is not a subset of subpartitions_cpus. Partitions should be invalidated in this case to inform users that the partition is now invalid(cpus are shared with top_cpuset). To fix this issue: 1. Only emit the warning only if subpartitions_cpus is not empty and the effective_xcpus is not a subset of subpartitions_cpus. 2. During the CPU hotplug process, invalidate partitions if subpartitions_cpus is empty. Fixes: `f62a5d3936` ("cgroup/cpuset: Remove remote_partition_check() & make update_cpumasks_hier() handle remote partition") Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-12-18 06:38:54 -10:00
John Stultz	de2c5a1523	test-ww_mutex: Allow test to be run (and re-run) from userland In cases where the ww_mutex test was occasionally tripping on hard to find issues, leaving qemu in a reboot loop was my best way to reproduce problems. These reboots however wasted time when I just wanted to run the test-ww_mutex logic. So tweak the test-ww_mutex test so that it can be re-triggered via a sysfs file, so the test can be run repeatedly without doing module loads or restarting. This has been particularly valuable to stressing and finding issues with the proxy-exec series. To use, run as root: echo 1 > /sys/kernel/test_ww_mutex/run_tests Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251205013515.759030-4-jstultz@google.com	2025-12-18 10:45:23 +01:00
John Stultz	d327e7166e	test-ww_mutex: Move work to its own UNBOUND workqueue The test-ww_mutex test already allocates its own workqueue so be sure to use it for the mtx.work and abba.work rather then the default system workqueue. This resolves numerous messages of the sort: "workqueue: test_abba_work hogged CPU... consider switching to WQ_UNBOUND" "workqueue: test_mutex_work hogged CPU... consider switching to WQ_UNBOUND" Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251205013515.759030-3-jstultz@google.com	2025-12-18 10:45:23 +01:00
John Stultz	34d80c93a5	test-ww_mutex: Extend ww_mutex tests to test both classes of ww_mutexes Currently the test-ww_mutex tool only utilizes the wait-die class of ww_mutexes, and thus isn't very helpful in exercising the wait-wound class of ww_mutexes. So extend the test to exercise both classes of ww_mutexes for all of the subtests. Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251205013515.759030-2-jstultz@google.com	2025-12-18 10:45:23 +01:00
Menglong Dong	39263f986d	ftrace: Fix address for jmp mode in t_show() The address from ftrace_find_rec_direct() is printed directly in t_show(). This can mislead symbol offsets if it has the "jmp" bit in the last bit. Fix this by printing the address that returned by ftrace_jmp_get(). Link: https://patch.msgid.link/20251217030053.80343-1-dongml2@chinatelecom.cn Fixes: `25e4e3565d` ("ftrace: Introduce FTRACE_OPS_FL_JMP") Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-12-17 17:53:59 -05:00
Darrick J. Wong	74bf97e9a8	tracing: Fix UBSAN warning in __remove_instance() xfs/558 triggers the following UBSAN warning: ------------[ cut here ]------------ UBSAN: shift-out-of-bounds in kernel/trace/trace.c:10510:10 shift exponent 32 is too large for 32-bit type 'int' CPU: 1 UID: 0 PID: 888674 Comm: rmdir Not tainted 6.19.0-rc1-xfsx #rc1 PREEMPT(lazy) dbf607ef4c142c563f76d706e71af9731d7b9c90 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-4.module+el8.8.0+21164+ed375313 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x4a/0x70 ubsan_epilogue+0x5/0x2b __ubsan_handle_shift_out_of_bounds.cold+0x5e/0x113 __remove_instance.part.0.constprop.0.cold+0x18/0x26f instance_rmdir+0xf3/0x110 tracefs_syscall_rmdir+0x4d/0x90 vfs_rmdir+0x139/0x230 do_rmdir+0x143/0x230 __x64_sys_rmdir+0x1d/0x20 do_syscall_64+0x44/0x230 entry_SYSCALL_64_after_hwframe+0x4b/0x53 RIP: 0033:0x7f7ae8e51f17 Code: f0 ff ff 73 01 c3 48 8b 0d de 2e 0e 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 b8 54 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 01 c3 48 8b 15 b1 2e 0e 00 f7 d8 64 89 02 b8 RSP: 002b:00007ffd90743f08 EFLAGS: 00000246 ORIG_RAX: 0000000000000054 RAX: ffffffffffffffda RBX: 00007ffd907440f8 RCX: 00007f7ae8e51f17 RDX: 00007f7ae8f3c5c0 RSI: 00007ffd90744a21 RDI: 00007ffd90744a21 RBP: 0000000000000002 R08: 0000000000000000 R09: 0000000000000000 R10: 00007f7ae8f35ac0 R11: 0000000000000246 R12: 00007ffd90744a21 R13: 0000000000000001 R14: 00007f7ae8f8b000 R15: 000055e5283e6a98 </TASK> ---[ end trace ]--- whilst tearing down an ftrace instance. TRACE_FLAGS_MAX_SIZE is now 64bit, so the mask comparison expression must be typecast to a u64 value to avoid an overflow. AFAICT, ZEROED_TRACE_FLAGS is already cast to ULL so this is ok. Link: https://patch.msgid.link/20251216174950.GA7705@frogsfrogsfrogs Fixes: `bbec8e28ca` ("tracing: Allow tracer to add more than 32 options") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-12-17 17:50:04 -05:00
Steven Rostedt	ef7f38df89	tracing: Do not register unsupported perf events Synthetic events currently do not have a function to register perf events. This leads to calling the tracepoint register functions with a NULL function pointer which triggers: ------------[ cut here ]------------ WARNING: kernel/tracepoint.c:175 at tracepoint_add_func+0x357/0x370, CPU#2: perf/2272 Modules linked in: kvm_intel kvm irqbypass CPU: 2 UID: 0 PID: 2272 Comm: perf Not tainted 6.18.0-ftest-11964-ge022764176fc-dirty #323 PREEMPTLAZY Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.17.0-debian-1.17.0-1 04/01/2014 RIP: 0010:tracepoint_add_func+0x357/0x370 Code: 28 9c e8 4c 0b f5 ff eb 0f 4c 89 f7 48 c7 c6 80 4d 28 9c e8 ab 89 f4 ff 31 c0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc cc cc <0f> 0b 49 c7 c6 ea ff ff ff e9 ee fe ff ff 0f 0b e9 f9 fe ff ff 0f RSP: 0018:ffffabc0c44d3c40 EFLAGS: 00010246 RAX: 0000000000000001 RBX: ffff9380aa9e4060 RCX: 0000000000000000 RDX: 000000000000000a RSI: ffffffff9e1d4a98 RDI: ffff937fcf5fd6c8 RBP: 0000000000000001 R08: 0000000000000007 R09: ffff937fcf5fc780 R10: 0000000000000003 R11: ffffffff9c193910 R12: 000000000000000a R13: ffffffff9e1e5888 R14: 0000000000000000 R15: ffffabc0c44d3c78 FS: 00007f6202f5f340(0000) GS:ffff93819f00f000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055d3162281a8 CR3: 0000000106a56003 CR4: 0000000000172ef0 Call Trace: <TASK> tracepoint_probe_register+0x5d/0x90 synth_event_reg+0x3c/0x60 perf_trace_event_init+0x204/0x340 perf_trace_init+0x85/0xd0 perf_tp_event_init+0x2e/0x50 perf_try_init_event+0x6f/0x230 ? perf_event_alloc+0x4bb/0xdc0 perf_event_alloc+0x65a/0xdc0 __se_sys_perf_event_open+0x290/0x9f0 do_syscall_64+0x93/0x7b0 ? entry_SYSCALL_64_after_hwframe+0x76/0x7e ? trace_hardirqs_off+0x53/0xc0 entry_SYSCALL_64_after_hwframe+0x76/0x7e Instead, have the code return -ENODEV, which doesn't warn and has perf error out with: # perf record -e synthetic:futex_wait Error: The sys_perf_event_open() syscall returned with 19 (No such device) for event (synthetic:futex_wait). "dmesg \| grep -i perf" may provide additional information. Ideally perf should support synthetic events, but for now just fix the warning. The support can come later. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Link: https://patch.msgid.link/20251216182440.147e4453@gandalf.local.home Fixes: `4b147936fa` ("tracing: Add support for 'synthetic' events") Reported-by: Ian Rogers <irogers@google.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-12-17 15:47:35 -05:00
Peter Zijlstra	3cb3c2f688	perf: Clean up mediated vPMU accounting The mediated_pmu_account_event() and perf_create_mediated_pmu() functions implement the exclusion between '!exclude_guest' counters and mediated vPMUs. Their implementation is basically identical, except mirrored in what they count/check. Make sure the actual implementations reflect this similarity. Notably: - while perf_release_mediated_pmu() has an underflow check; mediated_pmu_unaccount_event() did not. - while perf_create_mediated_pmu() has an inc_not_zero() path; mediated_pmu_account_event() did not. Also, the inc_not_zero() path can be outsite of perf_mediated_pmu_mutex. The mutex must guard the 0->1 (of either nr_include_guest_events or nr_mediated_pmu_vms) transition, but once a counter is already non-zero, it can safely be incremented further. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251208115156.GE3707891@noisy.programming.kicks-ass.net	2025-12-17 13:31:09 +01:00
Jens Remus	2652f9a4b0	unwind_user/fp: Use dummies instead of ifdef This simplifies the code. unwind_user_next_fp() does not need to return -EINVAL if config option HAVE_UNWIND_USER_FP is disabled, as unwind_user_start() will then not select this unwind method and unwind_user_next() will therefore not call it. Provide (1) a dummy definition of ARCH_INIT_USER_FP_FRAME, if the unwind user method HAVE_UNWIND_USER_FP is not enabled, (2) a common fallback definition of unwind_user_at_function_start() which returns false, and (3) a common dummy definition of ARCH_INIT_USER_FP_ENTRY_FRAME. Note that enabling the config option HAVE_UNWIND_USER_FP without defining ARCH_INIT_USER_FP_FRAME triggers a compile error, which is helpful when implementing support for this unwind user method in an architecture. Enabling the config option when providing an arch- specific unwind_user_at_function_start() definition makes it necessary to also provide an arch-specific ARCH_INIT_USER_FP_ENTRY_FRAME definition. Signed-off-by: Jens Remus <jremus@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251208160352.1363040-3-jremus@linux.ibm.com	2025-12-17 13:31:07 +01:00
Jens Remus	2d6ad925fb	unwind_user: Enhance comments on get CFA, FP, and RA Move the comment "Get the Canonical Frame Address (CFA)" to the top of the sequence of statements that actually get the CFA. Reword the comment "Find the Return Address (RA)" to "Get ...", as the statements actually get the RA. Add a respective comment to the statements that get the FP. This will be useful once future commits extend the logic to get the RA and FP. While at it align the comment on the "stack going in wrong direction" check to the following one on the "address is word aligned" check. Signed-off-by: Jens Remus <jremus@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251208160352.1363040-2-jremus@linux.ibm.com	2025-12-17 13:31:07 +01:00
Sean Christopherson	a05385d84b	perf/x86/core: Register a new vector for handling mediated guest PMIs Wire up system vector 0xf5 for handling PMIs (i.e. interrupts delivered through the LVTPC) while running KVM guests with a mediated PMU. Perf currently delivers all PMIs as NMIs, e.g. so that events that trigger while IRQs are disabled aren't delayed and generate useless records, but due to the multiplexing of NMIs throughout the system, correctly identifying NMIs for a mediated PMU is practically infeasible. To (greatly) simplify identifying guest mediated PMU PMIs, perf will switch the CPU's LVTPC between PERF_GUEST_MEDIATED_PMI_VECTOR and NMI when guest PMU context is loaded/put. I.e. PMIs that are generated by the CPU while the guest is active will be identified purely based on the IRQ vector. Route the vector through perf, e.g. as opposed to letting KVM attach a handler directly a la posted interrupt notification vectors, as perf owns the LVTPC and thus is the rightful owner of PERF_GUEST_MEDIATED_PMI_VECTOR. Functionally, having KVM directly own the vector would be fine (both KVM and perf will be completely aware of when a mediated PMU is active), but would lead to an undesirable split in ownership: perf would be responsible for installing the vector, but not handling the resulting IRQs. Add a new perf_guest_info_callbacks hook (and static call) to allow KVM to register its handler with perf when running guests with mediated PMUs. Note, because KVM always runs guests with host IRQs enabled, there is no danger of a PMI being delayed from the guest's perspective due to using a regular IRQ instead of an NMI. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Xudong Hao <xudong.hao@intel.com> Link: https://patch.msgid.link/20251206001720.468579-9-seanjc@google.com	2025-12-17 13:31:05 +01:00
Kan Liang	42457a7fb6	perf: Add APIs to load/put guest mediated PMU context Add exported APIs to load/put a guest mediated PMU context. KVM will load the guest PMU shortly before VM-Enter, and put the guest PMU shortly after VM-Exit. On the perf side of things, schedule out all exclude_guest events when the guest context is loaded, and schedule them back in when the guest context is put. I.e. yield the hardware PMU resources to the guest, by way of KVM. Note, perf is only responsible for managing host context. KVM is responsible for loading/storing guest state to/from hardware. [sean: shuffle patches around, write changelog] Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Xudong Hao <xudong.hao@intel.com> Link: https://patch.msgid.link/20251206001720.468579-8-seanjc@google.com	2025-12-17 13:31:05 +01:00
Kan Liang	4593b4b6e2	perf: Add a EVENT_GUEST flag Current perf doesn't explicitly schedule out all exclude_guest events while the guest is running. There is no problem with the current emulated vPMU. Because perf owns all the PMU counters. It can mask the counter which is assigned to an exclude_guest event when a guest is running (Intel way), or set the corresponding HOSTONLY bit in evsentsel (AMD way). The counter doesn't count when a guest is running. However, either way doesn't work with the introduced mediated vPMU. A guest owns all the PMU counters when it's running. The host should not mask any counters. The counter may be used by the guest. The evsentsel may be overwritten. Perf should explicitly schedule out all exclude_guest events to release the PMU resources when entering a guest, and resume the counting when exiting the guest. It's possible that an exclude_guest event is created when a guest is running. The new event should not be scheduled in as well. The ctx time is shared among different PMUs. The time cannot be stopped when a guest is running. It is required to calculate the time for events from other PMUs, e.g., uncore events. Add timeguest to track the guest run time. For an exclude_guest event, the elapsed time equals the ctx time - guest time. Cgroup has dedicated times. Use the same method to deduct the guest time from the cgroup time as well. [sean: massage comments] Co-developed-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Xudong Hao <xudong.hao@intel.com> Link: https://patch.msgid.link/20251206001720.468579-7-seanjc@google.com	2025-12-17 13:31:05 +01:00
Kan Liang	f5c7de8f84	perf: Clean up perf ctx time The current perf tracks two timestamps for the normal ctx and cgroup. The same type of variables and similar codes are used to track the timestamps. In the following patch, the third timestamp to track the guest time will be introduced. To avoid the code duplication, add a new struct perf_time_ctx and factor out a generic function update_perf_time_ctx(). No functional change. Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Xudong Hao <xudong.hao@intel.com> Link: https://patch.msgid.link/20251206001720.468579-6-seanjc@google.com	2025-12-17 13:31:04 +01:00
Kan Liang	eff95e1702	perf: Add APIs to create/release mediated guest vPMUs Currently, exposing PMU capabilities to a KVM guest is done by emulating guest PMCs via host perf events, i.e. by having KVM be "just" another user of perf. As a result, the guest and host are effectively competing for resources, and emulating guest accesses to vPMU resources requires expensive actions (expensive relative to the native instruction). The overhead and resource competition results in degraded guest performance and ultimately very poor vPMU accuracy. To address the issues with the perf-emulated vPMU, introduce a "mediated vPMU", where the data plane (PMCs and enable/disable knobs) is exposed directly to the guest, but the control plane (event selectors and access to fixed counters) is managed by KVM (via MSR interceptions). To allow host perf usage of the PMU to (partially) co-exist with KVM/guest usage of the PMU, KVM and perf will coordinate to a world switch between host perf context and guest vPMU context near VM-Enter/VM-Exit. Add two exported APIs, perf_{create,release}_mediated_pmu(), to allow KVM to create and release a mediated PMU instance (per VM). Because host perf context will be deactivated while the guest is running, mediated PMU usage will be mutually exclusive with perf analysis of the guest, i.e. perf events that do NOT exclude the guest will not behave as expected. To avoid silent failure of !exclude_guest perf events, disallow creating a mediated PMU if there are active !exclude_guest events, and on the perf side, disallowing creating new !exclude_guest perf events while there is at least one active mediated PMU. Exempt PMU resources that do not support mediated PMU usage, i.e. that are outside the scope/view of KVM's vPMU and will not be swapped out while the guest is running. Guard mediated PMU with a new kconfig to help readers identify code paths that are unique to mediated PMU support, and to allow for adding arch- specific hooks without stubs. KVM x86 is expected to be the only KVM architecture to support a mediated PMU in the near future (e.g. arm64 is trending toward a partitioned PMU implementation), and KVM x86 will select PERF_GUEST_MEDIATED_PMU unconditionally, i.e. won't need stubs. Immediately select PERF_GUEST_MEDIATED_PMU when KVM x86 is enabled so that all paths are compile tested. Full KVM support is on its way... [sean: add kconfig and WARNing, rewrite changelog, swizzle patch ordering] Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Xudong Hao <xudong.hao@intel.com> Link: https://patch.msgid.link/20251206001720.468579-5-seanjc@google.com	2025-12-17 13:31:04 +01:00
Sean Christopherson	991bdf7e9d	perf: Move security_perf_event_free() call to __free_event() Move the freeing of any security state associated with a perf event from _free_event() to __free_event(), i.e. invoke security_perf_event_free() in the error paths for perf_event_alloc(). This will allow adding potential error paths in perf_event_alloc() that can occur after allocating security state. Note, kfree() and thus security_perf_event_free() is a nop if event->security is NULL, i.e. calling security_perf_event_free() even if security_perf_event_alloc() fails or is never reached is functionality ok. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Xudong Hao <xudong.hao@intel.com> Link: https://patch.msgid.link/20251206001720.468579-4-seanjc@google.com	2025-12-17 13:31:04 +01:00
Kan Liang	b9e52b11d2	perf: Add generic exclude_guest support Only KVM knows the exact time when a guest is entering/exiting. Expose two interfaces to KVM to switch the ownership of the PMU resources. All the pinned events must be scheduled in first. Extend the perf_event_sched_in() helper to support extra flag, e.g., EVENT_GUEST. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Xudong Hao <xudong.hao@intel.com> Link: https://patch.msgid.link/20251206001720.468579-3-seanjc@google.com	2025-12-17 13:31:03 +01:00
Kan Liang	b825444b61	perf: Skip pmu_ctx based on event_type To optimize the cgroup context switch, the perf_event_pmu_context iteration skips the PMUs without cgroup events. A bool cgroup was introduced to indicate the case. It can work, but this way is hard to extend for other cases, e.g. skipping non-mediated PMUs. It doesn't make sense to keep adding bool variables. Pass the event_type instead of the specific bool variable. Check both the event_type and related pmu_ctx variables to decide whether skipping a PMU. Event flags, e.g., EVENT_CGROUP, should be cleard in the ctx->is_active. Add EVENT_FLAGS to indicate such event flags. No functional change. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Tested-by: Xudong Hao <xudong.hao@intel.com> Link: https://patch.msgid.link/20251206001720.468579-2-seanjc@google.com	2025-12-17 13:31:03 +01:00
Peter Zijlstra	1862d8e264	sched: Fix faulty assertion in sched_change_end() Commit `47efe2ddcc` ("sched/core: Add assertions to QUEUE_CLASS") added an assert to sched_change_end() verifying that a class demotion would result in a reschedule. As it turns out; rt_mutex_setprio() does not force a resched on class demontion. Furthermore, this is only relevant to running tasks. Change the warning into a reschedule and make sure to only do so for running tasks. Fixes: `47efe2ddcc` ("sched/core: Add assertions to QUEUE_CLASS") Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Tested-by: Linux Kernel Functional Testing <lkft@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251216141725.GW3707837@noisy.programming.kicks-ass.net	2025-12-17 11:41:18 +01:00
Peter Zijlstra	704069649b	sched/core: Rework sched_class::wakeup_preempt() and rq_modified_*() Change sched_class::wakeup_preempt() to also get called for cross-class wakeups, specifically those where the woken task is of a higher class than the previous highest class. In order to do this, track the current highest class of the runqueue in rq::next_class and have wakeup_preempt() track this upwards for each new wakeup. Additionally have schedule() re-set the value on pick. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://patch.msgid.link/20251127154725.901391274@infradead.org	2025-12-17 10:53:25 +01:00
Alexei Starovoitov	ec439c3801	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf after 6.19-rc1 Cross-merge BPF and other fixes after downstream PR. Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-12-16 21:29:38 -08:00
Linus Torvalds	ea1013c153	bpf-fixes -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmlCBmwACgkQ6rmadz2v bToUZA//ZY0IE1x1nCixEAqGF/nGpDzVX4YQQfjrUoXQOD4ykzt35yTNXl6B1IVA dliVSI6kUtdoThUa7xJUxMSkDsVBsEMT/zYXQEXJG1zXvJANCB9wTzsC3OCBWbXt BRczcEkq0OHC9/l5CrILR6ocwxKGDIMIysfeOSABgfqckSEhylWy3+EWZQCk08ka gNpXlDJUG7dYpcZD/zhuC7e5Rg1uNvN7WiTv+Biig8xZCsEtYOq+qC5C/sOnsypI nqfECfbx48cVl49SjatdgquuHn/INESdLRCHisshkurA2Mp5PQuCmrwlXbv4JG59 v9b7lsFQlkpvEXMdo9VYe6K2gjfkOPRdWsVPu2oXA1qISRmrDqX8cKOpapUIwRhL p3ASruMOnz0KFqVaET8+5u2SwtALeW+c+1p1aHMfVGF/qbXuyG05qBkLoGFJR+Xr WznXUXY80Z7pjD57SpA6U3DigAkGqKCBXUwdifaOq8HQonwsnQGqkW/3NngNULGP IC4u0JXn61VgQsM/kAw+ucc4bdKI0g4oKJR56lT48elrj6Yxrjpde4oOqzZ0IQKu VQ0YnzWqqT2tjh4YNMOwkNPbFR4ALd329zI6TUkWib/jByEBNcfjSj9BRANd1KSx JgSHAE6agrbl6h3nOx584YCasX3Zq+nfv1Sj4Z/5GaHKKW3q/Vw= =wHLt -----END PGP SIGNATURE----- Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Pull bpf fixes from Alexei Starovoitov: - Fix BPF builds due to -fms-extensions. selftests (Alexei Starovoitov), bpftool (Quentin Monnet). - Fix build of net/smc when CONFIG_BPF_SYSCALL=y, but CONFIG_BPF_JIT=n (Geert Uytterhoeven) - Fix livepatch/BPF interaction and support reliable unwinding through BPF stack frames (Josh Poimboeuf) - Do not audit capability check in arm64 JIT (Ondrej Mosnacek) - Fix truncated dmabuf BPF iterator reads (T.J. Mercier) - Fix verifier assumptions of bpf_d_path's output buffer (Shuran Liu) - Fix warnings in libbpf when built with -Wdiscarded-qualifiers under C23 (Mikhail Gavrilov) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests/bpf: add regression test for bpf_d_path() bpf: Fix verifier assumptions of bpf_d_path's output buffer selftests/bpf: Add test for truncated dmabuf_iter reads bpf: Fix truncated dmabuf iterator reads x86/unwind/orc: Support reliable unwinding through BPF stack frames bpf: Add bpf_has_frame_pointer() bpf, arm64: Do not audit capability check in do_jit() libbpf: Fix -Wdiscarded-qualifiers under C23 bpftool: Fix build warnings due to MS extensions net: smc: SMC_HS_CTRL_BPF should depend on BPF_JIT selftests/bpf: Add -fms-extensions to bpf build flags	2025-12-17 15:54:58 +12:00
Liang Jie	b0101ccb5b	sched_ext: fix uninitialized ret on alloc_percpu() failure Smatch reported: kernel/sched/ext.c:5332 scx_alloc_and_add_sched() warn: passing zero to 'ERR_PTR' In scx_alloc_and_add_sched(), the alloc_percpu() failure path jumps to err_free_gdsqs without initializing @ret. That can lead to returning ERR_PTR(0), which violates the ERR_PTR() convention and confuses callers. Set @ret to -ENOMEM before jumping to the error path when alloc_percpu() fails. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/r/202512141601.yAXDAeA9-lkp@intel.com/ Reported-by: Dan Carpenter <error27@gmail.com> Fixes: `c201ea1578` ("sched_ext: Move event_stats_cpu into scx_sched") Signed-off-by: Liang Jie <liangjie@lixiang.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-12-16 09:15:03 -10:00
Emil Tsalapatis	12a1fe6e12	bpf/verifier: Do not limit maximum direct offset into arena map The verifier currently limits direct offsets into a map to 512MiB to avoid overflow during pointer arithmetic. However, this prevents arena maps from using direct addressing instructions to access data at the end of > 512MiB arena maps. This is necessary when moving arena globals to the end of the arena instead of the front. Refactor the verifier code to remove the offset calculation during direct value access calculations. This is possible because the only two map types that implement .map_direct_value_addr() are arrays and arenas, and they both do their own internal checks to ensure the offset is within bounds. Adjust selftests that expect the old error. These tests still fail because the verifier identifies the access as out of bounds for the map, so change them to expect an "invalid access to map value pointer" error instead. Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20251216173325.98465-3-emil@etsalapatis.com	2025-12-16 10:42:55 -08:00
Ricardo Robaina	15b0c43aa6	audit: include source and destination ports to NETFILTER_PKT NETFILTER_PKT records show both source and destination addresses, in addition to the associated networking protocol. However, it lacks the ports information, which is often valuable for troubleshooting. This patch adds both source and destination port numbers, 'sport' and 'dport' respectively, to TCP, UDP, UDP-Lite and SCTP-related NETFILTER_PKT records. $ TESTS="netfilter_pkt" make -e test &> /dev/null $ ausearch -i -ts recent \|grep NETFILTER_PKT type=NETFILTER_PKT ... proto=icmp type=NETFILTER_PKT ... proto=ipv6-icmp type=NETFILTER_PKT ... proto=udp sport=46333 dport=42424 type=NETFILTER_PKT ... proto=udp sport=35953 dport=42424 type=NETFILTER_PKT ... proto=tcp sport=50314 dport=42424 type=NETFILTER_PKT ... proto=tcp sport=57346 dport=42424 Link: https://github.com/linux-audit/audit-kernel/issues/162 Signed-off-by: Ricardo Robaina <rrobaina@redhat.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Paul Moore <paul@paul-moore.com>	2025-12-16 11:04:14 -05:00
Ricardo Robaina	f19590b07c	audit: add audit_log_nf_skb helper function Netfilter code (net/netfilter/nft_log.c and net/netfilter/xt_AUDIT.c) have to be kept in sync. Both source files had duplicated versions of audit_ip4() and audit_ip6() functions, which can result in lack of consistency and/or duplicated work. This patch adds a helper function in audit.c that can be called by netfilter code commonly, aiming to improve maintainability and consistency. Suggested-by: Florian Westphal <fw@strlen.de> Suggested-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Ricardo Robaina <rrobaina@redhat.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Paul Moore <paul@paul-moore.com>	2025-12-16 11:04:14 -05:00
Linus Torvalds	dbf89321bf	sched_ext: Fixes for v6.19-rc1 - Fix memory leak when destroying helper kthread workers during scheduler disable. - Fix bypass depth accounting on scx_enable() failure which could leave the system permanently in bypass mode. - Fix missing preemption handling when moving tasks to local DSQs via scx_bpf_dsq_move(). - Misc fixes including NULL check for put_prev_task(), flushing stdout in selftests, and removing unused code. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaUA9aQ4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGXWKAP9VeREr2ceRFd3WK5vFsFzCGh1L8rRu+t352MGL qWXkzQD/apLFSo5RQZ0tE8qORwKM9hNhS8QkVJhlsv5VZRWMBQI= =ueoE -----END PGP SIGNATURE----- Merge tag 'sched_ext-for-6.19-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: - Fix memory leak when destroying helper kthread workers during scheduler disable - Fix bypass depth accounting on scx_enable() failure which could leave the system permanently in bypass mode - Fix missing preemption handling when moving tasks to local DSQs via scx_bpf_dsq_move() - Misc fixes including NULL check for put_prev_task(), flushing stdout in selftests, and removing unused code * tag 'sched_ext-for-6.19-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Remove unused code in the do_pick_task_scx() selftests/sched_ext: flush stdout before test to avoid log spam sched_ext: Fix missing post-enqueue handling in move_local_task_to_local_dsq() sched_ext: Factor out local_dsq_post_enq() from dispatch_enqueue() sched_ext: Fix bypass depth leak on scx_enable() failure sched/ext: Avoid null ptr traversal when ->put_prev_task() is called with NULL next sched_ext: Fix the memleak for sch->helper objects	2025-12-16 19:24:35 +12:00
Linus Torvalds	6b63f90fa2	cgroup: Fixes for v6.19-rc1 - Fix a race condition in css_rstat_updated() where CMPXCHG without LOCK prefix could cause lnode corruption when the flusher runs concurrently on another CPU. The issue was introduced in 6.17 and causes memcg stats to become corrupted in production. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaUA8mg4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGXhaAQDSk/ywLzJG807gzvhC5QLcY8kYzUUFHJ4TQAFp 08UqOAEA9yDgBHPqkp6ucjEJMG+2esE9A5NJB2LeEjG8ZZOCDQM= =Kcfd -----END PGP SIGNATURE----- Merge tag 'cgroup-for-6.19-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fix from Tejun Heo: - Fix a race condition in css_rstat_updated() where CMPXCHG without LOCK prefix could cause lnode corruption when the flusher runs concurrently on another CPU. The issue was introduced in 6.17 and causes memcg stats to become corrupted in production. * tag 'cgroup-for-6.19-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: rstat: use LOCK CMPXCHG in css_rstat_updated	2025-12-16 19:21:17 +12:00
Radu Rendec	fcc1d0dabd	genirq: Add interrupt redirection infrastructure Add infrastructure to redirect interrupt handler execution to a different CPU when the current CPU is not part of the interrupt's CPU affinity mask. This is primarily aimed at (de)multiplexed interrupts, where the child interrupt handler runs in the context of the parent interrupt handler, and therefore CPU affinity control for the child interrupt is typically not available. With the new infrastructure, the child interrupt is allowed to freely change its affinity setting, independently of the parent. If the interrupt handler happens to be triggered on an "incompatible" CPU (a CPU that's not part of the child interrupt's affinity mask), the handler is redirected and runs in IRQ work context on a "compatible" CPU. No functional change is being made to any existing irqchip driver, and irqchip drivers must be explicitly modified to use the newly added infrastructure to support interrupt redirection. Originally-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Radu Rendec <rrendec@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/linux-pci/878qpg4o4t.ffs@tglx/ Link: https://patch.msgid.link/20251128212055.1409093-2-rrendec@redhat.com	2025-12-15 22:30:48 +01:00
Marc Zyngier	dbcc728e18	genirq: Remove setup_percpu_irq() setup_percpu_irq() was always a bad kludge, and should have never been there the first place. Now that the last users are gone, remove it for good. Signed-off-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20251210082242.360936-7-maz@kernel.org	2025-12-15 22:20:51 +01:00
Marc Zyngier	e9b624ea31	genirq: Remove __request_percpu_irq() helper With the IRQ timing stuff being gone, there is no need to specify a flag when requesting a percpu interrupt. Not only IRQF_TIMER was the only flag (set of flags actually) allowed, but nobody ever passed it. Get rid of __request_percpu_irq(), which was only getting 0 as flags, and promote request_percpu_irq_affinity() as its replacement. Signed-off-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jinjie Ruan <ruanjinjie@huawei.com> Link: https://patch.msgid.link/20251210082242.360936-3-maz@kernel.org	2025-12-15 22:20:50 +01:00
Marc Zyngier	c119e66853	genirq: Remove IRQ timing tracking infrastructure The IRQ timing tracking infrastructure was merged in 2019, but was never plumbed in, is not selectable, and is therefore never used. As Daniel agrees that there is little hope for this infrastructure to be completed in the near term, drop it altogether. Signed-off-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jinjie Ruan <ruanjinjie@huawei.com> Link: https://lore.kernel.org/r/87zf7vex6h.wl-maz@kernel.org Link: https://patch.msgid.link/20251210082242.360936-2-maz@kernel.org	2025-12-15 22:20:50 +01:00
Eric Dumazet	4725344462	time/timecounter: Inline timecounter_cyc2time() New network transport protocols want NIC drivers to get hardware timestamps of all incoming packets, and possibly all outgoing packets. One example is the upcoming 'Swift congestion control' which is used by TCP transport and is the primary need for timecounter_cyc2time(). This means timecounter_cyc2time() can be called more than 100 million times per second on a busy server. Inlining timecounter_cyc2time() brings a 12% improvement on a UDP receive stress test on a 100Gbit NIC. Note that FDO, LTO, PGO are unable to magically help for this case, presumably because NIC drivers are almost exclusively shipped as modules. Add an unlikely() around the cc_cyc2ns_backwards() case, even if FDO (when used) is able to take care of this optimization. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://research.google/pubs/swift-delay-is-simple-and-effective-for-congestion-control-in-the-datacenter/ Link: https://patch.msgid.link/20251129095740.3338476-1-edumazet@google.com	2025-12-15 20:16:49 +01:00
Zqiang	bb27226f0d	sched_ext: Remove unused code in the do_pick_task_scx() The kick_idle variable is no longer used, this commit therefore remove it and also remove associated code in the do_pick_task_scx(). Signed-off-by: Zqiang <qiang.zhang@linux.dev> Reviewed-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-12-15 05:53:49 -10:00
Petr Mladek	9bd18e1262	printk/nbcon: Restore IRQ in atomic flush after each emitted record The commit `d5d399efff` ("printk/nbcon: Release nbcon consoles ownership in atomic flush after each emitted record") prevented stall of a CPU which lost nbcon console ownership because another CPU entered an emergency flush. But there is still the problem that the CPU doing the emergency flush might cause a stall on its own. Let's go even further and restore IRQ in the atomic flush after each emitted record. It is not a complete solution. The interrupts and/or scheduling might still be blocked when the emergency atomic flush was called with IRQs and/or scheduling disabled. But it should remove the following lockup: mlx5_core 0000:03:00.0: Shutdown was called kvm: exiting hardware virtualization arm-smmu-v3 arm-smmu-v3.10.auto: CMD_SYNC timeout at 0x00000103 [hwprod 0x00000104, hwcons 0x00000102] smp: csd: Detected non-responsive CSD lock (#1) on CPU#4, waiting 5000000032 ns for CPU#00 do_nothing (kernel/smp.c:1057) smp: csd: CSD lock (#1) unresponsive. [...] Call trace: pl011_console_write_atomic (./arch/arm64/include/asm/vdso/processor.h:12 drivers/tty/serial/amba-pl011.c:2540) (P) nbcon_emit_next_record (kernel/printk/nbcon.c:1049) __nbcon_atomic_flush_pending_con (kernel/printk/nbcon.c:1517) __nbcon_atomic_flush_pending.llvm.15488114865160659019 (./arch/arm64/include/asm/alternative-macros.h:254 ./arch/arm64/include/asm/cpufeature.h:808 ./arch/arm64/include/asm/irqflags.h:192 kernel/printk/nbcon.c:1562 kernel/printk/nbcon.c:1612) nbcon_atomic_flush_pending (kernel/printk/nbcon.c:1629) printk_kthreads_shutdown (kernel/printk/printk.c:?) syscore_shutdown (drivers/base/syscore.c:120) kernel_kexec (kernel/kexec_core.c:1045) __arm64_sys_reboot (kernel/reboot.c:794 kernel/reboot.c:722 kernel/reboot.c:722) invoke_syscall (arch/arm64/kernel/syscall.c:50) el0_svc_common.llvm.14158405452757855239 (arch/arm64/kernel/syscall.c:?) do_el0_svc (arch/arm64/kernel/syscall.c:152) el0_svc (./arch/arm64/include/asm/alternative-macros.h:254 ./arch/arm64/include/asm/cpufeature.h:808 ./arch/arm64/include/asm/irqflags.h:73 arch/arm64/kernel/entry-common.c:169 arch/arm64/kernel/entry-common.c:182 arch/arm64/kernel/entry-common.c:749) el0t_64_sync_handler (arch/arm64/kernel/entry-common.c:820) el0t_64_sync (arch/arm64/kernel/entry.S:600) In this case, nbcon_atomic_flush_pending() is called from printk_kthreads_shutdown() with IRQs and scheduling enabled. Note that __nbcon_atomic_flush_pending_con() is directly called also from nbcon_device_release() where the disabled IRQs might break PREEMPT_RT guarantees. But the atomic flush is called only in emergency or panic situations where the latencies are irrelevant anyway. An ultimate solution would be a touching of watchdogs. But it would hide all problems. Let's do it later when anyone reports a stall which does not have a better solution. Closes: https://lore.kernel.org/r/sqwajvt7utnt463tzxgwu2yctyn5m6bjwrslsnupfexeml6hkd@v6sqmpbu3vvu Tested-by: Breno Leitao <leitao@debian.org> Reviewed-by: John Ogness <john.ogness@linutronix.de> Link: https://patch.msgid.link/20251212124520.244483-1-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com>	2025-12-15 16:18:41 +01:00
Marcos Paulo de Souza	bdfcca65e7	printk: nbcon: Check for device_{lock,unlock} callbacks These callbacks are necessary to synchronize ->write_thread callback against other operations using the same device. Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Reviewed-by: John Ogness <john.ogness@linutronix.de> Link: https://patch.msgid.link/20251208-nbcon-device-cb-fix-v2-1-36be8d195123@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com>	2025-12-15 15:10:33 +01:00
Mateusz Guzik	6d864a1b18	pid: only take pidmap_lock once on alloc When spawning and killing threads in separate processes in parallel the primary bottleneck on the stock kernel is pidmap_lock, largely because of a back-to-back acquire in the common case. This aspect is fixed with the patch. Performance improvement varies between reboots. When benchmarking with 20 processes creating and killing threads in a loop, the unpatched baseline hovers around 465k ops/s, while patched is anything between ~510k ops/s and ~560k depending on false-sharing (which I only minimally sanitized). So this is at least 10% if you are unlucky. The change also facilitated some cosmetic fixes. It has an unintentional side effect of no longer issuing spurious idr_preload() around idr_replace(). Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://patch.msgid.link/20251203092851.287617-3-mjguzik@gmail.com Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-12-15 14:33:38 +01:00
Rafael J. Wysocki	4fb352df14	PM: sleep: Do not flag runtime PM workqueue as freezable Till now, the runtime PM workqueue has been flagged as freezable, so it does not process work items during system-wide PM transitions like system suspend and resume. The original reason to do that was to reduce the likelihood of runtime PM getting in the way of system-wide PM processing, but now it is mostly an optimization because (1) runtime suspend of devices is prevented by bumping up their runtime PM usage counters in device_prepare() and (2) device drivers are expected to disable runtime PM for the devices handled by them before they embark on system-wide PM activities that may change the state of the hardware or otherwise interfere with runtime PM. However, it prevents asynchronous runtime resume of devices from working during system-wide PM transitions, which is confusing because synchronous runtime resume is not prevented at the same time, and it also sometimes turns out to be problematic. For example, it has been reported that blk_queue_enter() may deadlock during a system suspend transition because of the pm_request_resume() usage in it [1]. It may also deadlock during a system resume transition in a similar way. That happens because the asynchronous runtime resume of the given device is not processed due to the freezing of the runtime PM workqueue. While it may be better to address this particular issue in the block layer, the very presence of it means that similar problems may be expected to occur elsewhere. For this reason, remove the WQ_FREEZABLE flag from the runtime PM workqueue and make device_suspend_late() use the generic variant of pm_runtime_disable() that will carry out runtime PM of the device synchronously if there is pending resume work for it. Also update the comment before the pm_runtime_disable() call in device_suspend_late(), to document the fact that the runtime PM should not be expected to work for the device until the end of device_resume_early(), and update the related documentation. This change may, even though it is not expected to, uncover some latent issues related to queuing up asynchronous runtime resume work items during system suspend or hibernation. However, they should be limited to the interference between runtime resume and system-wide PM callbacks in the cases when device drivers start to handle system-wide PM before disabling runtime PM as described above. Link: https://lore.kernel.org/linux-pm/20251126101636.205505-2-yang.yang@vivo.com/ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org> Link: https://patch.msgid.link/12794222.O9o76ZdvQC@rafael.j.wysocki	2025-12-15 12:20:02 +01:00
Ingo Molnar	527a521029	sched/fair: Sort out 'blocked_load*' namespace noise There's three layers of logic in the scheduler that deal with 'has_blocked' (load) handling of the NOHZ code: (1) nohz.has_blocked, (2) rq->has_blocked_load, deal with NOHZ idle balancing, (3) and cfs_rq_has_blocked(), which is part of the layer that is passing the SMP load-balancing signal to the NOHZ layers. The 'has_blocked' and 'has_blocked_load' names are used in a mixed fashion, sometimes within the same function. Standardize on 'has_blocked_load' to make it all easy to read and easy to grep. No change in functionality. Suggested-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://patch.msgid.link/aS6yvxyc3JfMxxQW@gmail.com	2025-12-15 07:52:45 +01:00
Ingo Molnar	5758e48eef	sched/fair: Introduce and use the vruntime_cmp() and vruntime_op() wrappers for wrapped-signed aritmetics We have to be careful with vruntime comparisons and subtraction, due to the possibility of wrapping, so we have macros like: #define vruntime_gt(field, lse, rse) ({ (s64)((lse)->field - (rse)->field) > 0; }) Which is used like this: if (vruntime_gt(min_vruntime, se, rse)) se->min_vruntime = rse->min_vruntime; Replace this with an easier to read pattern that uses the regular arithmetics operators: if (vruntime_cmp(se->min_vruntime, ">", rse->min_vruntime)) se->min_vruntime = rse->min_vruntime; Also replace vruntime subtractions with vruntime_op(): - delta = (s64)(sea->vruntime - seb->vruntime) + - (s64)(cfs_rqb->zero_vruntime_fi - cfs_rqa->zero_vruntime_fi); + delta = vruntime_op(sea->vruntime, "-", seb->vruntime) + + vruntime_op(cfs_rqb->zero_vruntime_fi, "-", cfs_rqa->zero_vruntime_fi); In the vruntime_cmp() and vruntime_op() macros use Use __builtin_strcmp(), because of __HAVE_ARCH_STRCMP might turn off the compiler optimizations we rely on here to catch usage bugs. No change in functionality. Signed-off-by: Ingo Molnar <mingo@kernel.org>	2025-12-15 07:52:45 +01:00
Ingo Molnar	dcbc9d3f0e	sched/fair: Rename cfs_rq::avg_vruntime to ::sum_w_vruntime, and helper functions The ::avg_vruntime field is a misnomer: it says it's an 'average vruntime', but in reality it's the momentary sum of the weighted vruntimes of all queued tasks, which is at least a division away from being an average. This is clear from comments about the math of fair scheduling: * \Sum (v_i - v0) * w_i := cfs_rq->avg_vruntime This confusion is increased by the cfs_avg_vruntime() function, which does perform the division and returns a true average. The sum of all weighted vruntimes should be named thusly, so rename the field to ::sum_w_vruntime. (As arguably ::sum_weighted_vruntime would be a bit of a mouthful.) Understanding the scheduler is hard enough already, without extra layers of obfuscated naming. ;-) Also rename related helper functions: sum_vruntime_add() => sum_w_vruntime_add() sum_vruntime_sub() => sum_w_vruntime_sub() sum_vruntime_update() => sum_w_vruntime_update() With the notable exception of cfs_avg_vruntime(), which was named accurately. Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://patch.msgid.link/20251201064647.1851919-7-mingo@kernel.org	2025-12-15 07:52:44 +01:00
Ingo Molnar	4ff674fa98	sched/fair: Rename cfs_rq::avg_load to cfs_rq::sum_weight The ::avg_load field is a long-standing misnomer: it says it's an 'average load', but in reality it's the momentary sum of the load of all currently runnable tasks. We'd have to also perform a division by nr_running (or use time-decay) to arrive at any sort of average value. This is clear from comments about the math of fair scheduling: * \Sum w_i := cfs_rq->avg_load The sum of all weights is ... the sum of all weights, not the average of all weights. To make it doubly confusing, there's also an ::avg_load in the load-balancing struct sg_lb_stats, which is a true average. The second part of the field's name is a minor misnomer as well: it says 'load', and it is indeed a load_weight structure as it shares code with the load-balancer - but it's only in an SMP load-balancing context where load = weight, in the fair scheduling context the primary purpose is the weighting of different nice levels. So rename the field to ::sum_weight instead, which makes the terminology of the EEVDF math match up with our implementation of it: * \Sum w_i := cfs_rq->sum_weight Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://patch.msgid.link/20251201064647.1851919-6-mingo@kernel.org	2025-12-15 07:52:44 +01:00

... 6 7 8 9 10 ...

51062 Commits (d78ddeb8938a366aabfabf60255c1a94de8d8ea1)