mirror-linux

Commit Graph

Author	SHA1	Message	Date
Christian Brauner	8e199cd6e3	pid: use ns_common_init() Don't cargo-cult the same thing over and over. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-19 14:26:13 +02:00
Christian Brauner	0b40774ef0	cgroup: use ns_common_init() Don't cargo-cult the same thing over and over. Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-19 14:26:13 +02:00
Christian Göttsche	b9cb7e59ac	pid: use ns_capable_noaudit() when determining net sysctl permissions The capability check should not be audited since it is only being used to determine the inode permissions. A failed check does not indicate a violation of security policy but, when an LSM is enabled, a denial audit message was being generated. The denial audit message can either lead to the capability being unnecessarily allowed in a security policy, or being silenced potentially masking a legitimate capability check at a later point in time. Similar to commit `d6169b0206` ("net: Use ns_capable_noaudit() when determining net sysctl permissions") Fixes: `7863dcc72d` ("pid: allow pid_max to be set per pid namespace") CC: Christian Brauner <brauner@kernel.org> CC: linux-security-module@vger.kernel.org CC: selinux@vger.kernel.org Signed-off-by: Christian Göttsche <cgzones@googlemail.com> Acked-by: Serge Hallyn <serge@hallyn.com> Reviewed-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-19 13:08:31 +02:00
KP Singh	8cd189e414	bpf: Move the signature kfuncs to helpers.c No functional changes, except for the addition of the headers for the kfuncs so that they can be used for signature verification. Signed-off-by: KP Singh <kpsingh@kernel.org> Link: https://lore.kernel.org/r/20250914215141.15144-8-kpsingh@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-18 19:11:42 -07:00
KP Singh	ea2e6467ac	bpf: Return hashes of maps in BPF_OBJ_GET_INFO_BY_FD Currently only array maps are supported, but the implementation can be extended for other maps and objects. The hash is memoized only for exclusive and frozen maps as their content is stable until the exclusive program modifies the map. This is required for BPF signing, enabling a trusted loader program to verify a map's integrity. The loader retrieves the map's runtime hash from the kernel and compares it against an expected hash computed at build time. Signed-off-by: KP Singh <kpsingh@kernel.org> Link: https://lore.kernel.org/r/20250914215141.15144-7-kpsingh@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-18 19:11:42 -07:00
KP Singh	baefdbdf68	bpf: Implement exclusive map creation Exclusive maps allow maps to only be accessed by program with a program with a matching hash which is specified in the excl_prog_hash attr. For the signing use-case, this allows the trusted loader program to load the map and verify the integrity Signed-off-by: KP Singh <kpsingh@kernel.org> Link: https://lore.kernel.org/r/20250914215141.15144-3-kpsingh@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-18 19:11:42 -07:00
KP Singh	603b441623	bpf: Update the bpf_prog_calc_tag to use SHA256 Exclusive maps restrict map access to specific programs using a hash. The current hash used for this is SHA1, which is prone to collisions. This patch uses SHA256, which is more resilient against collisions. This new hash is stored in bpf_prog and used by the verifier to determine if a program can access a given exclusive map. The original 64-bit tags are kept, as they are used by users as a short, possibly colliding program identifier for non-security purposes. Signed-off-by: KP Singh <kpsingh@kernel.org> Link: https://lore.kernel.org/r/20250914215141.15144-2-kpsingh@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-18 19:10:20 -07:00
Kumar Kartikeya Dwivedi	1512231b6c	bpf: Enforce RCU protection for KF_RCU_PROTECTED Currently, KF_RCU_PROTECTED only applies to iterator APIs and that too in a convoluted fashion: the presence of this flag on the kfunc is used to set MEM_RCU in iterator type, and the lack of RCU protection results in an error only later, once next() or destroy() methods are invoked on the iterator. While there is no bug, this is certainly a bit unintuitive, and makes the enforcement of the flag iterator specific. In the interest of making this flag useful for other upcoming kfuncs, e.g. scx_bpf_cpu_curr() [0][1], add enforcement for invoking the kfunc in an RCU critical section in general. This would also mean that iterator APIs using KF_RCU_PROTECTED will error out earlier, instead of throwing an error for lack of RCU CS protection when next() or destroy() methods are invoked. In addition to this, if the kfuncs tagged KF_RCU_PROTECTED return a pointer value, ensure that this pointer value is only usable in an RCU critical section. There might be edge cases where the return value is special and doesn't need to imply MEM_RCU semantics, but in general, the assumption should hold for the majority of kfuncs, and we can revisit things if necessary later. [0]: https://lore.kernel.org/all/20250903212311.369697-3-christian.loehle@arm.com [1]: https://lore.kernel.org/all/20250909195709.92669-1-arighi@nvidia.com Tested-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250917032755.4068726-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-18 15:36:17 -07:00
Linus Torvalds	097a6c336d	Runtime Verifier fixes for v6.17 - Fix build in some RISC-V flavours Some system calls only are available for the 64bit RISC-V machines. #ifdef out the cases of clock_nanosleep and futex in the sleep monitor if they are not supported by the architecture. - Fix wrong cast, obsolete after refactoring Use container_of() to get to the rv_monitor structure from the enable_monitors_next() 'p' pointer. The assignment worked only because the list field used happened to be the first field of the structure. - Remove redundant include files Some include files were listed twice. Remove the extra ones and sort the includes. - Fix missing unlock on failure There was an error path that exited the rv_register_monitor() function without releasing a lock. Change that to goto the lock release. - Add Gabriele Monaco to be Runtime Verifier maintainer Gabriele is doing most of the work on RV as well as collecting patches. Add him to the maintainers file for Runtime Verification. -----BEGIN PGP SIGNATURE----- iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaMxsBRQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qk5oAP4tlGnMBoLpZXVBpAubVUQOVRfQo5dI ar9LpXdgnj4xQAEA9Q5uIvhCI/CMXTK98gFhR31p9O4Sqtn0JlCViBbVSQg= =tUQG -----END PGP SIGNATURE----- Merge tag 'trace-rv-v6.17-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull runtime verifier fixes from Steven Rostedt: - Fix build in some RISC-V flavours Some system calls only are available for the 64bit RISC-V machines. #ifdef out the cases of clock_nanosleep and futex in the sleep monitor if they are not supported by the architecture. - Fix wrong cast, obsolete after refactoring Use container_of() to get to the rv_monitor structure from the enable_monitors_next() 'p' pointer. The assignment worked only because the list field used happened to be the first field of the structure. - Remove redundant include files Some include files were listed twice. Remove the extra ones and sort the includes. - Fix missing unlock on failure There was an error path that exited the rv_register_monitor() function without releasing a lock. Change that to goto the lock release. - Add Gabriele Monaco to be Runtime Verifier maintainer Gabriele is doing most of the work on RV as well as collecting patches. Add him to the maintainers file for Runtime Verification. * tag 'trace-rv-v6.17-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: rv: Add Gabriele Monaco as maintainer for Runtime Verification rv: Fix missing mutex unlock in rv_register_monitor() include/linux/rv.h: remove redundant include file rv: Fix wrong type cast in enabled_monitors_next() rv: Support systems with time64-only syscalls	2025-09-18 15:22:00 -07:00
Rafael J. Wysocki	ccf09357ff	smp: Fix up and expand the smp_call_function_many() kerneldoc The smp_call_function_many() kerneldoc comment got out of sync with the function definition (bool parameter "wait" is incorrectly described as a bitmask in it), so fix it up by copying the "wait" description from the smp_call_function() kerneldoc and add information regarding the handling of the local CPU to it. Fixes: `49b3bd213a` ("smp: Fix all kernel-doc warnings") Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2025-09-18 22:21:28 +02:00
Andrea Righi	ac6772e8bc	sched_ext: Add migration-disabled counter to error state dump Include the task's migration-disabled counter when dumping task state during an error exit. This can help diagnose cases where tasks can get stuck, because they're unable to migrate elsewhere. tj: s/nomig/no_mig/ for readability and consistency with other keys. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-18 08:54:57 -10:00
Jakub Kicinski	f2cdc4c22b	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-6.17-rc7). No conflicts. Adjacent changes: drivers/net/ethernet/mellanox/mlx5/core/en/fs.h `9536fbe10c` ("net/mlx5e: Add PSP steering in local NIC RX") `7601a0a462` ("net/mlx5e: Add a miss level for ipsec crypto offload") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-18 11:26:06 -07:00
Linus Torvalds	992d4e481e	Probes fixes for v6.17-rc6: - kprobe-event: Fix null-ptr-deref in trace_kprobe_create_internal(), which handles NULL return of kmemdup() correctly. -----BEGIN PGP SIGNATURE----- iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmjLOcYbHG1hc2FtaS5o aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8b7PUH/0ao2r/pj/vfKDNWrGlY TJ59tabrQQ9AGB27jqL8nZPbie4Jn1UBKMsuvRcOfvbSLtmnrxOtqgx/RmJOVnjC JuLEWQt8XTiBatsLsPst/CNnzV9V/oLmZ7Fv8Z1QVqzCfpnyCW4HaHc6XaH8IM3r 5x6fIrZFKFu7E58t2yo972L+tNIPFwr457VTt2nCdHXlL3mwnK+GtYeNBnWSk40+ 9k16xShDVx3tm+oPEJ2jyJApchR4wWU1vIshMYSu1ygp9UacdJWajKt1qOQOMEih H2sNlBwZLTWGfhS9exBPori9mthhH4wxzqEYzpPHw0WgNn0OF9QL5AXX44VGJzG5 JqM= =1NqY -----END PGP SIGNATURE----- Merge tag 'probes-fixes-v6.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probe fix from Masami Hiramatsu: - kprobe-event: Fix null-ptr-deref in trace_kprobe_create_internal(), by handling NULL return of kmemdup() correctly * tag 'probes-fixes-v6.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: kprobe-event: Fix null-ptr-deref in trace_kprobe_create_internal()	2025-09-17 16:52:26 -07:00
Wang Liang	dc3382fffd	tracing: kprobe-event: Fix null-ptr-deref in trace_kprobe_create_internal() A crash was observed with the following output: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] SMP KASAN PTI KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] CPU: 1 UID: 0 PID: 2899 Comm: syz.2.399 Not tainted 6.17.0-rc5+ #5 PREEMPT(none) RIP: 0010:trace_kprobe_create_internal+0x3fc/0x1440 kernel/trace/trace_kprobe.c:911 Call Trace: <TASK> trace_kprobe_create_cb+0xa2/0xf0 kernel/trace/trace_kprobe.c:1089 trace_probe_create+0xf1/0x110 kernel/trace/trace_probe.c:2246 dyn_event_create+0x45/0x70 kernel/trace/trace_dynevent.c:128 create_or_delete_trace_kprobe+0x5e/0xc0 kernel/trace/trace_kprobe.c:1107 trace_parse_run_command+0x1a5/0x330 kernel/trace/trace.c:10785 vfs_write+0x2b6/0xd00 fs/read_write.c:684 ksys_write+0x129/0x240 fs/read_write.c:738 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x5d/0x2d0 arch/x86/entry/syscall_64.c:94 </TASK> Function kmemdup() may return NULL in trace_kprobe_create_internal(), add check for it's return value. Link: https://lore.kernel.org/all/20250916075816.3181175-1-wangliang74@huawei.com/ Fixes: `33b4e38baa` ("tracing: kprobe-event: Allocate string buffers from heap") Signed-off-by: Wang Liang <wangliang74@huawei.com> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>	2025-09-18 07:36:41 +09:00
Linus Torvalds	37889ceadd	sched_ext: Fixes for v6.17-rc6 This contains 2 sched_ext fixes. - Fix build failure when !FAIR_GROUP_SCHED && EXT_GROUP_SCHED. - Revert "sched_ext: Skip per-CPU tasks in scx_bpf_reenqueue_local()" which was causing issues with per-CPU task scheduling and reenqueuing behavior. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaMsN6g4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGRHQAP4kRGIyxCaCnoSHHDyI8R2SLUzDKvvByaVNdbKO 5VlCaAEAy0wKViyJDojpd5DXMFlFYCm8gXWQ0aD++hhYX1XfawI= =b+ar -----END PGP SIGNATURE----- Merge tag 'sched_ext-for-6.17-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: - Fix build failure when !FAIR_GROUP_SCHED && EXT_GROUP_SCHED - Revert "sched_ext: Skip per-CPU tasks in scx_bpf_reenqueue_local()" which was causing issues with per-CPU task scheduling and reenqueuing behavior * tag 'sched_ext-for-6.17-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext, sched/core: Fix build failure when !FAIR_GROUP_SCHED && EXT_GROUP_SCHED Revert "sched_ext: Skip per-CPU tasks in scx_bpf_reenqueue_local()"	2025-09-17 13:27:31 -07:00
Linus Torvalds	05950213a9	cgroup: Fixes for v6.17-rc6 This contains two cgroup changes. Both are pretty low risk. - Fix deadlock in cgroup destruction when repeatedly mounting/unmounting perf_event and net_prio controllers. The issue occurs because cgroup_destroy_wq has max_active=1, causing root destruction to wait for CSS offline operations that are queued behind it. The fix splits cgroup_destroy_wq into three separate workqueues to eliminate the blocking. - Set of->priv to NULL upon file release to make potential bugs to manifest as NULL pointer dereferences rather than use-after-free errors. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaMsIBg4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGUlBAP95OycrYMUu1iIc37YfClvugsBZJpOV/qMQpNTm oPjGoQEAv5dzOTo+763ecFUfRjCT469Ke7wFapS1RCVL7hEd5As= =Kykt -----END PGP SIGNATURE----- Merge tag 'cgroup-for-6.17-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: "This contains two cgroup changes. Both are pretty low risk. - Fix deadlock in cgroup destruction when repeatedly mounting/unmounting perf_event and net_prio controllers. The issue occurs because cgroup_destroy_wq has max_active=1, causing root destruction to wait for CSS offline operations that are queued behind it. The fix splits cgroup_destroy_wq into three separate workqueues to eliminate the blocking. - Set of->priv to NULL upon file release to make potential bugs to manifest as NULL pointer dereferences rather than use-after-free errors" * tag 'cgroup-for-6.17-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup/psi: Set of->priv to NULL upon file release cgroup: split cgroup_destroy_wq into 3 workqueues	2025-09-17 13:22:08 -07:00
Chen Ridong	c49b5e89c4	cpuset: use partition_cpus_change for setting exclusive cpus A previous patch has introduced a new helper function partition_cpus_change(). Now replace the exclusive cpus setting logic with this helper function. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-17 08:37:31 -10:00
Chen Ridong	de9f15e21c	cpuset: use parse_cpulist for setting cpus.exclusive Previous patches made parse_cpulist handle empty cpu mask input. Now use this helper for exclusive cpus setting. Also, compute_trialcs_xcpus can be called with empty cpus and handles it correctly. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-17 08:37:31 -10:00
Chen Ridong	27db824600	cpuset: introduce partition_cpus_change Introduce the partition_cpus_change function to handle both regular CPU set updates and exclusive CPU modifications, either of which may trigger partition state changes. This generalized function will also be utilized for exclusive CPU updates in subsequent patches. With the introduction of compute_trialcs_excpus in a previous patch, the trialcs->effective_xcpus field is now consistently computed and maintained. Consequently, the legacy logic which assigned trialcs->allowed_cpus to a local 'xcpus' variable when trialcs->effective_xcpus was empty has been removed. This removal is safe because when trialcs is not a partition member, trialcs->effective_xcpus is now correctly populated with the intersection of user_xcpus and the parent's effective_xcpus. This calculation inherently covers the scenario previously handled by the removed code. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-17 08:37:31 -10:00
Chen Ridong	c636673980	cpuset: refactor cpus_allowed_validate_change Refactor cpus_allowed_validate_change to handle the special case where cpuset.cpus can be set even when violating partition sibling CPU exclusivity rules. This differs from the general validation logic in validate_change. Add a wrapper function to properly handle this exceptional case. The trialcs->prs_err field is cleared before performing validation checks for both CPU changes and partition errors. If cpus_allowed_validate_change fails its validation, trialcs->prs_err is set to PERR_NOTEXCL. If partition validation fails, the specific error code returned by validate_partition is assigned to trialcs->prs_err. With the partition validation status now directly available through trialcs->prs_err, the local boolean variable 'invalidate' becomes redundant and can be safely removed. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-17 08:37:31 -10:00
Chen Ridong	7e05981ba3	cpuset: refactor out validate_partition Refactor the validate_partition function to handle cpuset partition validation when modifying cpuset.cpus. This refactoring also makes the function reusable for handling cpuset.cpus.exclusive updates in subsequent patches. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-17 08:37:30 -10:00
Chen Ridong	8daab66eb3	cpuset: introduce cpus_excl_conflict and mems_excl_conflict helpers This patch adds cpus_excl_conflict() and mems_excl_conflict() helper functions to improve code readability and maintainability. The exclusive conflict checking follows these rules: 1. If either cpuset has the 'exclusive' flag set, their user_xcpus must not have any overlap. 2. If neither cpuset has the 'exclusive' flag set, their 'cpuset.cpus.exclusive' (only for v2) values must not intersect. 3. The 'cpuset.cpus' of one cpuset must not form a subset of another cpuset's 'cpuset.cpus.exclusive'. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-17 08:37:30 -10:00
Chen Ridong	c5866c9a00	cpuset: refactor CPU mask buffer parsing logic The current implementation contains redundant handling for empty mask inputs, as cpulist_parse() already properly handles these cases. This refactoring introduces a new helper function parse_cpuset_cpulist() to consolidate CPU list parsing logic and eliminate special-case checks for empty inputs. Additionally, the effective_xcpus computation for trial cpusets has been simplified. Rather than computing effective_xcpus only when exclusive_cpus is set or when the cpuset forms a valid partition, we now recalculate it on every cpuset.cpus update. This approach ensures consistency and allows removal of redundant effective_xcpus logic in subsequent patches. The trial cpuset's effective_xcpus calculation follows two distinct cases: 1. For member cpusets: effective_xcpus is determined by the intersection of cpuset->exclusive_cpus and the parent's effective_xcpus. 2. For non-member cpusets: effective_xcpus is derived from the intersection of user_xcpus and the parent's effective_xcpus. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-17 08:37:30 -10:00
Chen Ridong	86bbbd1f33	cpuset: Refactor exclusive CPU mask computation logic The current compute_effective_exclusive_cpumask function handles multiple scenarios with different input parameters, making the code difficult to follow. This patch refactors it into two separate functions: compute_excpus and compute_trialcs_excpus. The compute_excpus function calculates the exclusive CPU mask for a given input and excludes exclusive CPUs from sibling cpusets when cs's exclusive_cpus is not explicitly set. The compute_trialcs_excpus function specifically handles exclusive CPU computation for trial cpusets used during CPU mask configuration updates, and always excludes exclusive CPUs from sibling cpusets. This refactoring significantly improves code readability and clarity, making it explicit which function to call for each use case and what parameters should be provided. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-17 08:37:30 -10:00
Chen Ridong	6a59fc4a3a	cpuset: change return type of is_partition_[in]valid to bool The functions is_partition_valid() and is_partition_invalid() logically return boolean values, but were previously declared with return type 'int'. This patch changes their return type to 'bool' to better reflect their semantic meaning and improve type safety. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-17 08:37:30 -10:00
Chen Ridong	bba0ccf829	cpuset: remove unused assignment to trialcs->partition_root_state The trialcs->partition_root_state field is not used during the configuration of 'cpuset.cpus' or 'cpuset.cpus.exclusive'. Therefore, the assignment of values to this field can be safely removed. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-17 08:35:28 -10:00
Chen Ridong	b783a62655	cpuset: move the root cpuset write check earlier The 'cpus' or 'mems' lists of the top_cpuset cannot be modified. This check can be moved before acquiring any locks as a common code block to improve efficiency and maintainability. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-17 08:35:17 -10:00
Eduard Zingerman	a3c73d629e	bpf: dont report verifier bug for missing bpf_scc_visit on speculative path Syzbot generated a program that triggers a verifier_bug() call in maybe_exit_scc(). maybe_exit_scc() assumes that, when called for a state with insn_idx in some SCC, there should be an instance of struct bpf_scc_visit allocated for that SCC. Turns out the assumption does not hold for speculative execution paths. See example in the next patch. maybe_scc_exit() is called from update_branch_counts() for states that reach branch count of zero, meaning that path exploration for a particular path is finished. Path exploration can finish in one of three ways: a. Verification error is found. In this case, update_branch_counts() is called only for non-speculative paths. b. Top level BPF_EXIT is reached. Such instructions are never a part of an SCC, so compute_scc_callchain() in maybe_scc_exit() will return false, and maybe_scc_exit() will return early. c. A checkpoint is reached and matched. Checkpoints are created by is_state_visited(), which calls maybe_enter_scc(), which allocates bpf_scc_visit instances for checkpoints within SCCs. Hence, for non-speculative symbolic execution paths, the assumption still holds: if maybe_scc_exit() is called for a state within an SCC, bpf_scc_visit instance must exist. This patch removes the verifier_bug() call for speculative paths. Fixes: `c9e31900b5` ("bpf: propagate read/precision marks over state graph backedges") Reported-by: syzbot+3afc814e8df1af64b653@syzkaller.appspotmail.com Closes: https://lore.kernel.org/bpf/68c85acd.050a0220.2ff435.03a4.GAE@google.com/ Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250916212251.3490455-1-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-17 11:19:58 -07:00
Sebastian Andrzej Siewior	3253cb49cb	softirq: Allow to drop the softirq-BKL lock on PREEMPT_RT softirqs are preemptible on PREEMPT_RT. There is synchronisation between individual sections which disable bottom halves. This in turn means that a forced threaded interrupt cannot preempt another forced threaded interrupt. Instead it will PI-boost the other handler and wait for its completion. This is required because code within a softirq section is assumed to be non-preemptible and may expect exclusive access to per-CPU resources such as variables or pinned timers. Code with such expectation has been identified and updated to use local_lock_nested_bh() for locking of the per-CPU resource. This means the softirq lock can be removed. Disable the softirq synchronization, but add a new config switch CONFIG_PREEMPT_RT_NEEDS_BH_LOCK which allows to re-enable the synchronized behavior in case that there are issues, which haven't been detected yet. The softirq_ctrl.cnt accounting remains to let the NOHZ code know if softirqs are currently handled. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2025-09-17 16:25:41 +02:00
Sebastian Andrzej Siewior	fd4e876f59	softirq: Provide a handshake for canceling tasklets via polling The tasklet_unlock_spin_wait() via tasklet_disable_in_atomic() is provided for a few legacy tasklet users. The interface is used from atomic context (which is either softirq or disabled preemption) on non-PREEMPT_RT and relies on spinning until the tasklet callback completes. On PREEMPT_RT the context is never atomic but the busy polling logic remains. It is possible that the thread invoking tasklet_unlock_spin_wait() has higher priority than the tasklet. If both run on the same CPU the the tasklet makes no progress and the thread trying to cancel the tasklet will live-lock the system. To avoid the lockup tasklet_unlock_spin_wait() uses local_bh_disable()/ enable() which utilizes the local_lock_t for synchronisation. This lock is a central per-CPU BKL and about to be removed. Solve this by acquire a lock in tasklet_action_common() which is held while the tasklet's callback is invoked. This lock will be acquired from tasklet_unlock_spin_wait() via tasklet_callback_cancel_wait_running(). After the tasklet completed tasklet_callback_sync_wait_running() drops the lock and acquires it again. In order to avoid unlocking the lock even if there is no cancel request, there is a cb_waiters counter which is incremented during a cancel request. Blocking on the lock will PI-boost the tasklet if needed, ensuring progress is made. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2025-09-17 16:25:41 +02:00
Tejun Heo	a1eab4d813	sched_ext, sched/core: Fix build failure when !FAIR_GROUP_SCHED && EXT_GROUP_SCHED While collecting SCX related fields in struct task_group into struct scx_task_group, `6e6558a6bc` ("sched_ext, sched/core: Factor out struct scx_task_group") forgot update tg->scx_weight usage in tg_weight(), which leads to build failure when CONFIG_FAIR_GROUP_SCHED is disabled but CONFIG_EXT_GROUP_SCHED is enabled. Fix it. Fixes: `6e6558a6bc` ("sched_ext, sched/core: Factor out struct scx_task_group") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202509170230.MwZsJSWa-lkp@intel.com/ Tested-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-16 23:07:27 -10:00
Jeremy Linton	ba1afc94de	uprobes: uprobe_warn should use passed task uprobe_warn() is passed a task structure, yet its using current. For the most part this shouldn't matter, but since a task structure is provided, lets use it. Fixes: `248d3a7b2f` ("uprobes: Change uprobe_copy_process() to dup return_instances") Signed-off-by: Jeremy Linton <jeremy.linton@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Will Deacon <will@kernel.org>	2025-09-16 21:34:49 +01:00
Marco Crivellari	dadb3ebcf3	workqueue: WQ_PERCPU added to alloc_workqueue users Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistentcy cannot be addressed without refactoring the API. alloc_workqueue() treats all queues as per-CPU by default, while unbound workqueues must opt-in via WQ_UNBOUND. This default is suboptimal: most workloads benefit from unbound queues, allowing the scheduler to place worker threads where they’re needed and reducing noise when CPUs are isolated. This patch adds a new WQ_PERCPU flag to explicitly request the use of the per-CPU behavior. Both flags coexist for one release cycle to allow callers to transition their calls. Once migration is complete, WQ_UNBOUND can be removed and unbound will become the implicit default. With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND), any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND must now use WQ_PERCPU. All existing users have been updated accordingly. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-16 10:33:53 -10:00
Andrea Righi	0b47b6c354	Revert "sched_ext: Skip per-CPU tasks in scx_bpf_reenqueue_local()" scx_bpf_reenqueue_local() can be called from ops.cpu_release() when a CPU is taken by a higher scheduling class to give tasks queued to the CPU's local DSQ a chance to be migrated somewhere else, instead of waiting indefinitely for that CPU to become available again. In doing so, we decided to skip migration-disabled tasks, under the assumption that they cannot be migrated anyway. However, when a higher scheduling class preempts a CPU, the running task is always inserted at the head of the local DSQ as a migration-disabled task. This means it is always skipped by scx_bpf_reenqueue_local(), and ends up being confined to the same CPU even if that CPU is heavily contended by other higher scheduling class tasks. As an example, let's consider the following scenario: $ schedtool -a 0,1, -e yes > /dev/null $ sudo schedtool -F -p 99 -a 0, -e \ stress-ng -c 1 --cpu-load 99 --cpu-load-slice 1000 The first task (SCHED_EXT) can run on CPU0 or CPU1. The second task (SCHED_FIFO) is pinned to CPU0 and consumes ~99% of it. If the SCHED_EXT task initially runs on CPU0, it will remain there because it always sees CPU0 as "idle" in the short gaps left by the RT task, resulting in ~1% utilization while CPU1 stays idle: 0[\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|100.0%] 8[ 0.0%] 1[ 0.0%] 9[ 0.0%] 2[ 0.0%] 10[ 0.0%] 3[ 0.0%] 11[ 0.0%] 4[ 0.0%] 12[ 0.0%] 5[ 0.0%] 13[ 0.0%] 6[ 0.0%] 14[ 0.0%] 7[ 0.0%] 15[ 0.0%] PID USER PRI NI S CPU CPU%▽MEM% TIME+ Command 1067 root RT 0 R 0 99.0 0.2 0:31.16 stress-ng-cpu [run] 975 arighi 20 0 R 0 1.0 0.0 0:26.32 yes By allowing scx_bpf_reenqueue_local() to re-enqueue migration-disabled tasks, the scheduler can choose to migrate them to other CPUs (CPU1 in this case) via ops.enqueue(), leading to better CPU utilization: 0[\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|100.0%] 8[ 0.0%] 1[\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|100.0%] 9[ 0.0%] 2[ 0.0%] 10[ 0.0%] 3[ 0.0%] 11[ 0.0%] 4[ 0.0%] 12[ 0.0%] 5[ 0.0%] 13[ 0.0%] 6[ 0.0%] 14[ 0.0%] 7[ 0.0%] 15[ 0.0%] PID USER PRI NI S CPU CPU%▽MEM% TIME+ Command 577 root RT 0 R 0 100.0 0.2 0:23.17 stress-ng-cpu [run] 555 arighi 20 0 R 1 100.0 0.0 0:28.67 yes It's debatable whether per-CPU tasks should be re-enqueued as well, but doing so is probably safer: the scheduler can recognize re-enqueued tasks through the %SCX_ENQ_REENQ flag, reassess their placement, and either put them back at the head of the local DSQ or let another task attempt to take the CPU. This also prevents giving per-CPU tasks an implicit priority boost, which would otherwise make them more likely to reclaim CPUs preempted by higher scheduling classes. Fixes: `97e13ecb02` ("sched_ext: Skip per-CPU tasks in scx_bpf_reenqueue_local()") Cc: stable@vger.kernel.org # v6.15+ Signed-off-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-16 10:15:23 -10:00
pengdonglin	58ab6d25a1	cgroup/cpuset: Remove redundant rcu_read_lock/unlock() in spin_lock Since commit `a8bb74acd8` ("rcu: Consolidate RCU-sched update-side function definitions") there is no difference between rcu_read_lock(), rcu_read_lock_bh() and rcu_read_lock_sched() in terms of RCU read section and the relevant grace period. That means that spin_lock(), which implies rcu_read_lock_sched(), also implies rcu_read_lock(). There is no need no explicitly start a RCU read section if one has already been started implicitly by spin_lock(). Simplify the code and remove the inner rcu_read_lock() invocation. Cc: Waiman Long <longman@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: pengdonglin <pengdonglin@xiaomi.com> Signed-off-by: pengdonglin <dolinux.peng@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-16 08:36:24 -10:00
pengdonglin	3ee4211ef8	cgroup: Remove redundant rcu_read_lock/unlock() in spin_lock Since commit `a8bb74acd8` ("rcu: Consolidate RCU-sched update-side function definitions") there is no difference between rcu_read_lock(), rcu_read_lock_bh() and rcu_read_lock_sched() in terms of RCU read section and the relevant grace period. That means that spin_lock(), which implies rcu_read_lock_sched(), also implies rcu_read_lock(). There is no need no explicitly start a RCU read section if one has already been started implicitly by spin_lock(). Simplify the code and remove the inner rcu_read_lock() invocation. Cc: Tejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Waiman Long <longman@redhat.com> Signed-off-by: pengdonglin <pengdonglin@xiaomi.com> Signed-off-by: pengdonglin <dolinux.peng@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-16 08:36:14 -10:00
Eduard Zingerman	b13448dd64	bpf: potential double-free of env->insn_aux_data Function bpf_patch_insn_data() has the following structure: static struct bpf_prog bpf_patch_insn_data(... env ...) { struct bpf_prog new_prog; struct bpf_insn_aux_data *new_data = NULL; if (len > 1) { new_data = vrealloc(...); // <--------- (1) if (!new_data) return NULL; env->insn_aux_data = new_data; // <---- (2) } new_prog = bpf_patch_insn_single(env->prog, off, patch, len); if (IS_ERR(new_prog)) { ... vfree(new_data); // <----------------- (3) return NULL; } ... happy path ... } In case if bpf_patch_insn_single() returns an error the `new_data` allocated at (1) will be freed at (3). However, at (2) this pointer is stored in `env->insn_aux_data`. Which is freed unconditionally by verifier.c:bpf_check() on both happy and error paths. Thus, leading to double-free. Fix this by removing vfree() call at (3), ownership over `new_data` is already passed to `env->insn_aux_data` at this point. Fixes: `77620d1267` ("bpf: use realloc in bpf_patch_insn_data") Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250912-patch-insn-data-double-free-v1-1-af05bd85a21a@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-15 13:04:21 -07:00
Kumar Kartikeya Dwivedi	2c89513395	bpf: Do not limit bpf_cgroup_from_id to current's namespace The bpf_cgroup_from_id kfunc relies on cgroup_get_from_id to obtain the cgroup corresponding to a given cgroup ID. This helper can be called in a lot of contexts where the current thread can be random. A recent example was its use in sched_ext's ops.tick(), to obtain the root cgroup pointer. Since the current task can be whatever random user space task preempted by the timer tick, this makes the behavior of the helper unreliable. Refactor out __cgroup_get_from_id as the non-namespace aware version of cgroup_get_from_id, and change bpf_cgroup_from_id to make use of it. There is no compatibility breakage here, since changing the namespace against which the lookup is being done to the root cgroup namespace only permits a wider set of lookups to succeed now. The cgroup IDs across namespaces are globally unique, and thus don't need to be retranslated. Reported-by: Dan Schatzberg <dschatzberg@meta.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20250915032618.1551762-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-15 10:53:15 -07:00
Mateusz Guzik	f99b391778	fs: rename generic_delete_inode() and generic_drop_inode() generic_delete_inode() is rather misleading for what the routine is doing. inode_just_drop() should be much clearer. The new naming is inconsistent with generic_drop_inode(), so rename that one as well with inode_ as the suffix. No functional changes. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-15 16:09:42 +02:00
Rafael J. Wysocki	bd03c7020d	Merge back earlier material related to system sleep for 6.18	2025-09-15 12:03:26 +02:00
Christian Loehle	1ebe8f7e78	PM: EM: Fix late boot with holes in CPU topology Commit `e3f1164fc9` ("PM: EM: Support late CPUs booting and capacity adjustment") added a mechanism to handle CPUs that come up late by retrying when any of the `cpufreq_cpu_get()` call fails. However, if there are holes in the CPU topology (offline CPUs, e.g. nosmt), the first missing CPU causes the loop to break, preventing subsequent online CPUs from being updated. Instead of aborting on the first missing CPU policy, loop through all and retry if any were missing. Fixes: `e3f1164fc9` ("PM: EM: Support late CPUs booting and capacity adjustment") Suggested-by: Kenneth Crudup <kenneth.crudup@gmail.com> Reported-by: Kenneth Crudup <kenneth.crudup@gmail.com> Link: https://lore.kernel.org/linux-pm/40212796-734c-4140-8a85-854f72b8144d@panix.com/ Cc: 6.9+ <stable@vger.kernel.org> # 6.9+ Signed-off-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/20250831214357.2020076-1-christian.loehle@arm.com [ rjw: Drop the new pr_debug() message which is not very useful ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2025-09-15 12:02:24 +02:00
Aaron Lu	0d4eaf8caf	sched/fair: Do not balance task to a throttled cfs_rq When doing load balance and the target cfs_rq is in throttled hierarchy, whether to allow balancing there is a question. The good side to allow balancing is: if the target CPU is idle or less loaded and the being balanced task is holding some kernel resources, then it seems a good idea to balance the task there and let the task get the CPU earlier and release kernel resources sooner. The bad part is, if the task is not holding any kernel resources, then the balance seems not that useful. While theoretically it's debatable, a performance test[0] which involves 200 cgroups and each cgroup runs hackbench(20 sender, 20 receiver) in pipe mode showed a performance degradation on AMD Genoa when allowing load balance to throttled cfs_rq. Analysis[1] showed hackbench doesn't like task migration across LLC boundary. For this reason, add a check in can_migrate_task() to forbid balancing to a cfs_rq that is in throttled hierarchy. This reduced task migration a lot and performance restored. [0]: https://lore.kernel.org/lkml/20250822110701.GB289@bytedance/ [1]: https://lore.kernel.org/lkml/20250903101102.GB42@bytedance/ Signed-off-by: Aaron Lu <ziqianlu@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>	2025-09-15 09:38:38 +02:00
Aaron Lu	253b3f5872	sched/fair: Do not special case tasks in throttled hierarchy With the introduction of task based throttle model, task in a throttled hierarchy is allowed to continue to run till it gets throttled on its ret2user path. For this reason, remove those throttled_hierarchy() checks in the following functions so that those tasks can get their turn as normal tasks: dequeue_entities(), check_preempt_wakeup_fair() and yield_to_task_fair(). The benefit of doing it this way is: if those tasks gets the chance to run earlier and if they hold any kernel resources, they can release those resources earlier. The downside is, if they don't hold any kernel resouces, all they can do is to throttle themselves on their way back to user space so the favor to let them run seems not that useful and for check_preempt_wakeup_fair(), that favor may be bad for curr. K Prateek Nayak pointed out prio_changed_fair() can send a throttled task to check_preempt_wakeup_fair(), further tests showed the affinity change path from move_queued_task() can also send a throttled task to check_preempt_wakeup_fair(), that's why the check of task_is_throttled() in that function. Signed-off-by: Aaron Lu <ziqianlu@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>	2025-09-15 09:38:37 +02:00
Aaron Lu	fcd394866e	sched/fair: update_cfs_group() for throttled cfs_rqs With task based throttle model, tasks in a throttled hierarchy are allowed to continue to run if they are running in kernel mode. For this reason, PELT clock is not stopped for these cfs_rqs in throttled hierarchy when they still have tasks running or queued. Since PELT clock is not stopped, whether to allow update_cfs_group() doing its job for cfs_rqs which are in throttled hierarchy but still have tasks running/queued is a question. The good side is, continue to run update_cfs_group() can get these cfs_rq entities with an up2date weight and that up2date weight can be useful to derive an accurate load for the CPU as well as ensure fairness if multiple tasks of different cgroups are running on the same CPU. OTOH, as Benjamin Segall pointed: when unthrottle comes around the most likely correct distribution is the distribution we had at the time of throttle. In reality, either way may not matter that much if tasks in throttled hierarchy don't run in kernel mode for too long. But in case that happens, let these cfs_rq entities have an up2date weight seems a good thing to do. Signed-off-by: Aaron Lu <ziqianlu@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>	2025-09-15 09:38:37 +02:00
Aaron Lu	fe8d238e64	sched/fair: Propagate load for throttled cfs_rq Before task based throttle model, propagating load will stop at a throttled cfs_rq and that propagate will happen on unthrottle time by update_load_avg(). Now that there is no update_load_avg() on unthrottle for throttled cfs_rq and all load tracking is done by task related operations, let the propagate happen immediately. While at it, add a comment to explain why cfs_rqs that are not affected by throttle have to be added to leaf cfs_rq list in propagate_entity_cfs_rq() per my understanding of commit `0258bdfaff` ("sched/fair: Fix unfairness caused by missing load decay"). Signed-off-by: Aaron Lu <ziqianlu@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>	2025-09-15 09:38:37 +02:00
Zhen Ni	9b5096761c	rv: Fix missing mutex unlock in rv_register_monitor() If create_monitor_dir() fails, the function returns directly without releasing rv_interface_lock. This leaves the mutex locked and causes subsequent monitor registration attempts to deadlock. Fix it by making the error path jump to out_unlock, ensuring that the mutex is always released before returning. Fixes: `24cbfe18d5` ("rv: Merge struct rv_monitor_def into struct rv_monitor") Signed-off-by: Zhen Ni <zhen.ni@easystack.cn> Reviewed-by: Gabriele Monaco <gmonaco@redhat.com> Reviewed-by: Nam Cao <namcao@linutronix.de> Link: https://lore.kernel.org/r/20250903065112.1878330-1-zhen.ni@easystack.cn Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>	2025-09-15 08:36:35 +02:00
Nam Cao	de090d1cca	rv: Fix wrong type cast in enabled_monitors_next() Argument 'p' of enabled_monitors_next() is not a pointer to struct rv_monitor, it is actually a pointer to the list_head inside struct rv_monitor. Therefore it is wrong to cast 'p' to struct rv_monitor *. This wrong type cast has been there since the beginning. But it still worked because the list_head was the first field in struct rv_monitor_def. This is no longer true since commit `24cbfe18d5` ("rv: Merge struct rv_monitor_def into struct rv_monitor") moved the list_head, and this wrong type cast became a functional problem. Properly use container_of() instead. Fixes: `24cbfe18d5` ("rv: Merge struct rv_monitor_def into struct rv_monitor") Signed-off-by: Nam Cao <namcao@linutronix.de> Reviewed-by: Gabriele Monaco <gmonaco@redhat.com> Link: https://lore.kernel.org/r/20250806120911.989365-1-namcao@linutronix.de Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>	2025-09-15 08:36:35 +02:00
Palmer Dabbelt	03ee64b5e5	rv: Support systems with time64-only syscalls Some systems (like 32-bit RISC-V) only have the 64-bit time_t versions of syscalls. So handle the 32-bit time_t version of those being undefined. Fixes: `f74f8bb246` ("rv: Add rtapp_sleep monitor") Closes: https://lore.kernel.org/oe-kbuild-all/202508160204.SsFyNfo6-lkp@intel.com Signed-off-by: Palmer Dabbelt <palmer@dabbelt.com> Acked-by: Nam Cao <namcao@linutronix.de> Link: https://lore.kernel.org/r/20250804194518.97620-2-palmer@dabbelt.com Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>	2025-09-15 08:36:27 +02:00
Greg Kroah-Hartman	c319c4ec06	Merge 6.17-rc6 into driver-core-next We need the driver core fixes in here to build on top of. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2025-09-15 08:26:05 +02:00
Linus Torvalds	8378c89172	Fix a lost-timeout CPU hotplug bug in the hrtimer code, which can trigger with certain hardware configs and regular HZ. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmjGirsRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1i91g//QWtFBkjSJgDaCwABsSIr7Z5vgzlidtx5 ZI6bP7Fjt2nMnRXqp+KojMhoVx5WNi8XnQHL+tXIx685ILg8pusxJZ8RplwgccxZ jApW6PWEvrznWQn4i/uPsH2sDud1T7lI/K74JGPri3xaxpWxfk5LBzHDv296ygi6 PShGP2RDjc1WpOtTs/K/1BHpwodDGB9k7V9CNjydaYKtbuRtNfbvpqTt+Syto8O8 UuFG22j2ZRyPbwuw3PouwaZgBOrks0H9cXW9s6E3wHJA4p+90LVDdhSXcYo2YtJG VHJM1wUSN/Tth4DIUwxcXwk5ya5AKvQvokPv9n/FL3ceO2CdfyR/hJ2euB+l/dCl kNooAjaIQzTLowyMigMO+tT7jKTLuwUrq/l6rHSEIoLWoWLH9Ii55fXDlKr9zihu y5H/jjbNKULIPzZ03gfuIqz4/+t7hFthMcyH+x4xHgPNayT9BJ2X/T59i3wATeZu s7hscdCb1rNE11Or2mggSX/pMDJxDMzEaD1JmD/4qJeFSipLmKbxcZZgq96RF//2 tU5zrZhm5GIqaK8o/xp1ps15xAwSnDyYjH59U7To0drfoV9uBg7lczPhaAoVAW/F lQoepSFrw9hV34YRAebimza7/+5IRk+SBCWwYg++xviGaxFt5TStJITOTk3UbHSx miCA3D/MDeM= =m+Ze -----END PGP SIGNATURE----- Merge tag 'timers-urgent-2025-09-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Ingo Molnar: "Fix a lost-timeout CPU hotplug bug in the hrtimer code, which can trigger with certain hardware configs and regular HZ" * tag 'timers-urgent-2025-09-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: hrtimers: Unconditionally update target CPU base after offline timer migration	2025-09-14 08:38:05 -07:00
Ricardo B. Marliere	3b5eba544a	perf: make pmu_bus const Now that the driver core can properly handle constant struct bus_type, move the pmu_bus variable to be a constant structure as well, placing it into read-only memory which can not be modified at runtime. Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: "Ricardo B. Marliere" <ricardo@marliere.net> Acked-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://lore.kernel.org/r/20240204-bus_cleanup-events-v1-1-c779d1639c3a@marliere.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2025-09-14 16:01:37 +02:00
Evangelos Petrongonas	d6d5116391	kexec: introduce is_kho_boot() Patch series "efi: Fix EFI boot with kexec handover (KHO)", v3. This patch series fixes a kernel panic that occurs when booting with both EFI and KHO (Kexec HandOver) enabled. The issue arises because EFI's `reserve_regions()` clears all memory regions with `memblock_remove(0, PHYS_ADDR_MAX)` before rebuilding them from EFI data. This destroys KHO scratch regions that were set up early during device tree scanning, causing a panic as the kernel has no valid memory regions for early allocations. The first patch introduces `is_kho_boot()` to allow early boot components to reliably detect if the kernel was booted via KHO-enabled kexec. The existing `kho_is_enabled()` only checks the command line and doesn't verify if an actual KHO FDT was passed. The second patch modifies EFI's `reserve_regions()` to selectively remove only non-KHO memory regions when KHO is active, preserving the critical scratch regions while still allowing EFI to rebuild its memory map. This patch (of 3): During early initialisation, after a kexec, other components, like EFI need to know if a KHO enabled kexec is performed. The `kho_is_enabled` function is not enough as in the early stages, it only reflects whether the cmdline has KHO enabled, not if an actual KHO FDT exists. Extend the KHO API with `is_kho_boot()` to provide a way for components to check if a KHO enabled kexec is performed. Link: https://lkml.kernel.org/r/cover.1755721529.git.epetron@amazon.de Link: https://lkml.kernel.org/r/7dc6674a76bf6e68cca0222ccff32427699cc02e.1755721529.git.epetron@amazon.de Signed-off-by: Evangelos Petrongonas <epetron@amazon.de> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Changyuan Lyu <changyuanl@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-13 17:32:56 -07:00
Coiby Xu	913e65a2fe	crash: add KUnit tests for crash_exclude_mem_range crash_exclude_mem_range seems to be a simple function but there have been multiple attempts to fix it, - commit `a2e9a95d21` ("kexec: Improve & fix crash_exclude_mem_range() to handle overlapping ranges") - commit `6dff315972` ("crash_core: fix and simplify the logic of crash_exclude_mem_range()") So add a set of unit tests to verify the correctness of current implementation. Shall we change the function in the future, the unit tests can also help prevent any regression. For example, we may make the function smarter by allocating extra crash_mem range on demand thus there is no need for the caller to foresee any memory range split or address -ENOMEM failure. The testing strategy is to verify the correctness of base case. The base case is there is one to-be-excluded range A and one existing range B. Then we can exhaust all possibilities of the position of A regarding B. For example, here are two combinations, Case: A is completely inside B (causes split) Original: [----B----] Exclude: {--A--} Result: [B1] .. [B2] Case: A overlaps B's left part Original: [----B----] Exclude: {---A---} Result: [..B..] In theory we can prove the correctness by induction, - Base case: crash_exclude_mem_range is correct in the case where n=1 (n is the number of existing ranges). - Inductive step: If crash_exclude_mem_range is correct for n=k existing ranges, then the it's also correct for n=k+1 ranges. But for the sake of simplicity, simply use unit tests to cover the base case together with two regression tests. Note most of the exclude_single_range_test() code is generated by Google Gemini with some small tweaks. The function specification, function body and the exhausting test strategy are presented as prompts. [akpm@linux-foundation.org: export crash_exclude_mem_range() to modules, for kernel/crash_core_test.c] Link: https://lkml.kernel.org/r/20250904093855.1180154-2-coxu@redhat.com Signed-off-by: Coiby Xu <coxu@redhat.com> Assisted-by: Google Gemini Cc: Baoquan He <bhe@redhat.com> Cc: Borislav Betkov <bp@alien8.de> Cc: Dave Young <dyoung@redhat.com> Cc: fuqiang wang <fuqiang.wang@easystack.cn> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-13 17:32:55 -07:00
Sergey Senozhatsky	37aa782df9	panic: remove redundant panic-cpu backtrace Backtraces from all CPUs are printed during panic() when SYS_INFO_ALL_CPU_BT is set. It shows the backtrace for the panic-CPU even when it has already been explicitly printed before. Do not change the legacy code which prints the backtrace in various contexts, for example, as part of Oops report, right after panic message. It will always be visible in the crash dump. Instead, remember when the backtrace was printed, and skip it when dumping the optional backtraces on all CPUs. [akpm@linux-foundation.org: make panic_this_cpu_backtrace_printed static] Closes: https://lore.kernel.org/oe-kbuild-all/202509050048.FMpVvh1u-lkp@intel.com/ [pmladek@suse.com: Handle situations when the backtrace was not printed for the panic CPU] Link: https://lkml.kernel.org/r/20250903100418.410026-1-pmladek@suse.com Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Link: https://lore.kernel.org/r/20250731030314.3818040-1-senozhatsky@chromium.org Signed-off-by: Petr Mladek <pmladek@suse.com> Tested-by: Feng Tang <feng.tang@linux.alibaba.com> Reviewed-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-13 17:32:54 -07:00
Jinchao Wang	652ab7c8fa	panic: use angle-bracket include for panic.h Replace quoted includes of panic.h with `#include <linux/panic.h>` for consistency across the kernel. Link: https://lkml.kernel.org/r/20250829051312.33773-1-wangjinchao600@gmail.com Signed-off-by: Jinchao Wang <wangjinchao600@gmail.com> Reviewed-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Cc: Qianqiang Liu <qianqiang.liu@163.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-13 17:32:54 -07:00
Jinchao Wang	3d5f4f15b7	watchdog: skip checks when panic is in progress This issue was found when an EFI pstore was configured for kdump logging with the NMI hard lockup detector enabled. The efi-pstore write operation was slow, and with a large number of logs, the pstore dump callback within kmsg_dump() took a long time. This delay triggered the NMI watchdog, leading to a nested panic. The call flow demonstrates how the secondary panic caused an emergency_restart() to be triggered before the initial pstore operation could finish, leading to a failure to dump the logs: real panic() { kmsg_dump() { ... pstore_dump() { start_dump(); ... // long time operation triggers NMI watchdog nmi panic() { ... emergency_restart(); // pstore unfinished } ... finish_dump(); // never reached } } } Both watchdog_buddy_check_hardlockup() and watchdog_overflow_callback() may trigger during a panic. This can lead to recursive panic handling. Add panic_in_progress() checks so watchdog activity is skipped once a panic has begun. This prevents recursive panic and keeps the panic path more reliable. Link: https://lkml.kernel.org/r/20250825022947.1596226-10-wangjinchao600@gmail.com Signed-off-by: Jinchao Wang <wangjinchao600@gmail.com> Reviewed-by: Yury Norov (NVIDIA) <yury.norov@gmail.com> Cc: Anna Schumaker <anna.schumaker@oracle.com> Cc: Baoquan He <bhe@redhat.com> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Young <dyoung@redhat.com> Cc: Doug Anderson <dianders@chromium.org> Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com> Cc: Helge Deller <deller@gmx.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com> Cc: Joel Granados <joel.granados@kernel.org> Cc: John Ogness <john.ogness@linutronix.de> Cc: Kees Cook <kees@kernel.org> Cc: Li Huafei <lihuafei1@huawei.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Luo Gengkun <luogengkun@huaweicloud.com> Cc: Max Kellermann <max.kellermann@ionos.com> Cc: Nam Cao <namcao@linutronix.de> Cc: oushixiong <oushixiong@kylinos.cn> Cc: Petr Mladek <pmladek@suse.com> Cc: Qianqiang Liu <qianqiang.liu@163.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Sohil Mehta <sohil.mehta@intel.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Zimemrmann <tzimmermann@suse.de> Cc: Thorsten Blum <thorsten.blum@linux.dev> Cc: Ville Syrjala <ville.syrjala@linux.intel.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Yicong Yang <yangyicong@hisilicon.com> Cc: Yunhui Cui <cuiyunhui@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-13 17:32:53 -07:00
Jinchao Wang	d4a36db563	panic/printk: replace other_cpu_in_panic() with panic_on_other_cpu() The helper other_cpu_in_panic() duplicated logic already provided by panic_on_other_cpu(). Remove other_cpu_in_panic() and update all users to call panic_on_other_cpu() instead. This removes redundant code and makes panic handling consistent. Link: https://lkml.kernel.org/r/20250825022947.1596226-9-wangjinchao600@gmail.com Signed-off-by: Jinchao Wang <wangjinchao600@gmail.com> Cc: Anna Schumaker <anna.schumaker@oracle.com> Cc: Baoquan He <bhe@redhat.com> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Young <dyoung@redhat.com> Cc: Doug Anderson <dianders@chromium.org> Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com> Cc: Helge Deller <deller@gmx.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com> Cc: Joel Granados <joel.granados@kernel.org> Cc: John Ogness <john.ogness@linutronix.de> Cc: Kees Cook <kees@kernel.org> Cc: Li Huafei <lihuafei1@huawei.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Luo Gengkun <luogengkun@huaweicloud.com> Cc: Max Kellermann <max.kellermann@ionos.com> Cc: Nam Cao <namcao@linutronix.de> Cc: oushixiong <oushixiong@kylinos.cn> Cc: Petr Mladek <pmladek@suse.com> Cc: Qianqiang Liu <qianqiang.liu@163.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Sohil Mehta <sohil.mehta@intel.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Zimemrmann <tzimmermann@suse.de> Cc: Thorsten Blum <thorsten.blum@linux.dev> Cc: Ville Syrjala <ville.syrjala@linux.intel.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Yicong Yang <yangyicong@hisilicon.com> Cc: Yunhui Cui <cuiyunhui@bytedance.com> Cc: Yury Norov (NVIDIA) <yury.norov@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-13 17:32:52 -07:00
Jinchao Wang	c6be36e299	panic/printk: replace this_cpu_in_panic() with panic_on_this_cpu() The helper this_cpu_in_panic() duplicated logic already provided by panic_on_this_cpu(). Remove this_cpu_in_panic() and switch all users to panic_on_this_cpu(). This simplifies the code and avoids having two helpers for the same check. Link: https://lkml.kernel.org/r/20250825022947.1596226-8-wangjinchao600@gmail.com Signed-off-by: Jinchao Wang <wangjinchao600@gmail.com> Cc: Anna Schumaker <anna.schumaker@oracle.com> Cc: Baoquan He <bhe@redhat.com> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Young <dyoung@redhat.com> Cc: Doug Anderson <dianders@chromium.org> Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com> Cc: Helge Deller <deller@gmx.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com> Cc: Joel Granados <joel.granados@kernel.org> Cc: John Ogness <john.ogness@linutronix.de> Cc: Kees Cook <kees@kernel.org> Cc: Li Huafei <lihuafei1@huawei.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Luo Gengkun <luogengkun@huaweicloud.com> Cc: Max Kellermann <max.kellermann@ionos.com> Cc: Nam Cao <namcao@linutronix.de> Cc: oushixiong <oushixiong@kylinos.cn> Cc: Petr Mladek <pmladek@suse.com> Cc: Qianqiang Liu <qianqiang.liu@163.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Sohil Mehta <sohil.mehta@intel.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Zimemrmann <tzimmermann@suse.de> Cc: Thorsten Blum <thorsten.blum@linux.dev> Cc: Ville Syrjala <ville.syrjala@linux.intel.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Yicong Yang <yangyicong@hisilicon.com> Cc: Yunhui Cui <cuiyunhui@bytedance.com> Cc: Yury Norov (NVIDIA) <yury.norov@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-13 17:32:52 -07:00
Jinchao Wang	2325e8eadf	printk/nbcon: use panic_on_this_cpu() helper nbcon_context_try_acquire() compared panic_cpu directly with smp_processor_id(). This open-coded check is now provided by panic_on_this_cpu(). Switch to panic_on_this_cpu() to simplify the code and improve readability. Link: https://lkml.kernel.org/r/20250825022947.1596226-7-wangjinchao600@gmail.com Signed-off-by: Jinchao Wang <wangjinchao600@gmail.com> Cc: Anna Schumaker <anna.schumaker@oracle.com> Cc: Baoquan He <bhe@redhat.com> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Young <dyoung@redhat.com> Cc: Doug Anderson <dianders@chromium.org> Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com> Cc: Helge Deller <deller@gmx.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com> Cc: Joel Granados <joel.granados@kernel.org> Cc: John Ogness <john.ogness@linutronix.de> Cc: Kees Cook <kees@kernel.org> Cc: Li Huafei <lihuafei1@huawei.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Luo Gengkun <luogengkun@huaweicloud.com> Cc: Max Kellermann <max.kellermann@ionos.com> Cc: Nam Cao <namcao@linutronix.de> Cc: oushixiong <oushixiong@kylinos.cn> Cc: Petr Mladek <pmladek@suse.com> Cc: Qianqiang Liu <qianqiang.liu@163.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Sohil Mehta <sohil.mehta@intel.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Zimemrmann <tzimmermann@suse.de> Cc: Thorsten Blum <thorsten.blum@linux.dev> Cc: Ville Syrjala <ville.syrjala@linux.intel.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Yicong Yang <yangyicong@hisilicon.com> Cc: Yunhui Cui <cuiyunhui@bytedance.com> Cc: Yury Norov (NVIDIA) <yury.norov@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-13 17:32:52 -07:00
Jinchao Wang	6f313b5585	panic: use panic_try_start() in vpanic() vpanic() had open-coded logic to claim panic_cpu with atomic_try_cmpxchg. This is already handled by panic_try_start(). Switch to panic_try_start() and use panic_on_other_cpu() for the fallback path. This removes duplicate code and makes panic handling consistent across functions. Link: https://lkml.kernel.org/r/20250825022947.1596226-6-wangjinchao600@gmail.com Signed-off-by: Jinchao Wang <wangjinchao600@gmail.com> Cc: Anna Schumaker <anna.schumaker@oracle.com> Cc: Baoquan He <bhe@redhat.com> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Young <dyoung@redhat.com> Cc: Doug Anderson <dianders@chromium.org> Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com> Cc: Helge Deller <deller@gmx.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com> Cc: Joel Granados <joel.granados@kernel.org> Cc: John Ogness <john.ogness@linutronix.de> Cc: Kees Cook <kees@kernel.org> Cc: Li Huafei <lihuafei1@huawei.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Luo Gengkun <luogengkun@huaweicloud.com> Cc: Max Kellermann <max.kellermann@ionos.com> Cc: Nam Cao <namcao@linutronix.de> Cc: oushixiong <oushixiong@kylinos.cn> Cc: Petr Mladek <pmladek@suse.com> Cc: Qianqiang Liu <qianqiang.liu@163.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Sohil Mehta <sohil.mehta@intel.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Zimemrmann <tzimmermann@suse.de> Cc: Thorsten Blum <thorsten.blum@linux.dev> Cc: Ville Syrjala <ville.syrjala@linux.intel.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Yicong Yang <yangyicong@hisilicon.com> Cc: Yunhui Cui <cuiyunhui@bytedance.com> Cc: Yury Norov (NVIDIA) <yury.norov@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-13 17:32:52 -07:00
Jinchao Wang	6b69c7ef96	panic: use panic_try_start() in nmi_panic() nmi_panic() duplicated the logic to claim panic_cpu with atomic_try_cmpxchg. This is already wrapped in panic_try_start(). Replace the open-coded logic with panic_try_start(), and use panic_on_other_cpu() for the fallback path. This removes duplication and keeps panic handling code consistent. Link: https://lkml.kernel.org/r/20250825022947.1596226-5-wangjinchao600@gmail.com Signed-off-by: Jinchao Wang <wangjinchao600@gmail.com> Cc: Anna Schumaker <anna.schumaker@oracle.com> Cc: Baoquan He <bhe@redhat.com> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Young <dyoung@redhat.com> Cc: Doug Anderson <dianders@chromium.org> Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com> Cc: Helge Deller <deller@gmx.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com> Cc: Joel Granados <joel.granados@kernel.org> Cc: John Ogness <john.ogness@linutronix.de> Cc: Kees Cook <kees@kernel.org> Cc: Li Huafei <lihuafei1@huawei.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Luo Gengkun <luogengkun@huaweicloud.com> Cc: Max Kellermann <max.kellermann@ionos.com> Cc: Nam Cao <namcao@linutronix.de> Cc: oushixiong <oushixiong@kylinos.cn> Cc: Petr Mladek <pmladek@suse.com> Cc: Qianqiang Liu <qianqiang.liu@163.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Sohil Mehta <sohil.mehta@intel.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Zimemrmann <tzimmermann@suse.de> Cc: Thorsten Blum <thorsten.blum@linux.dev> Cc: Ville Syrjala <ville.syrjala@linux.intel.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Yicong Yang <yangyicong@hisilicon.com> Cc: Yunhui Cui <cuiyunhui@bytedance.com> Cc: Yury Norov (NVIDIA) <yury.norov@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-13 17:32:51 -07:00
Jinchao Wang	33effbcaf1	crash_core: use panic_try_start() in crash_kexec() crash_kexec() had its own code to exclude parallel execution by setting panic_cpu. This is already handled by panic_try_start(). Switch to panic_try_start() to remove the duplication and keep the logic consistent. Link: https://lkml.kernel.org/r/20250825022947.1596226-4-wangjinchao600@gmail.com Signed-off-by: Jinchao Wang <wangjinchao600@gmail.com> Cc: Anna Schumaker <anna.schumaker@oracle.com> Cc: Baoquan He <bhe@redhat.com> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Young <dyoung@redhat.com> Cc: Doug Anderson <dianders@chromium.org> Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com> Cc: Helge Deller <deller@gmx.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com> Cc: Joel Granados <joel.granados@kernel.org> Cc: John Ogness <john.ogness@linutronix.de> Cc: Kees Cook <kees@kernel.org> Cc: Li Huafei <lihuafei1@huawei.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Luo Gengkun <luogengkun@huaweicloud.com> Cc: Max Kellermann <max.kellermann@ionos.com> Cc: Nam Cao <namcao@linutronix.de> Cc: oushixiong <oushixiong@kylinos.cn> Cc: Petr Mladek <pmladek@suse.com> Cc: Qianqiang Liu <qianqiang.liu@163.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Sohil Mehta <sohil.mehta@intel.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Zimemrmann <tzimmermann@suse.de> Cc: Thorsten Blum <thorsten.blum@linux.dev> Cc: Ville Syrjala <ville.syrjala@linux.intel.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Yicong Yang <yangyicong@hisilicon.com> Cc: Yunhui Cui <cuiyunhui@bytedance.com> Cc: Yury Norov (NVIDIA) <yury.norov@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-13 17:32:51 -07:00
Jinchao Wang	d0d9c72355	panic: introduce helper functions for panic state Patch series "panic: introduce panic status function family", v2. This series introduces a family of helper functions to manage panic state and updates existing code to use them. Before this series, panic state helpers were scattered and inconsistent. For example, panic_in_progress() was defined in printk/printk.c, not in panic.c or panic.h. As a result, developers had to look in unexpected places to understand or re-use panic state logic. Other checks were open- coded, duplicating logic across panic, crash, and watchdog paths. The new helpers centralize the functionality in panic.c/panic.h: - panic_try_start() - panic_reset() - panic_in_progress() - panic_on_this_cpu() - panic_on_other_cpu() Patches 1–8 add the helpers and convert panic/crash and printk/nbcon code to use them. Patch 9 fixes a bug in the watchdog subsystem by skipping checks when a panic is in progress, avoiding interference with the panic CPU. Together, this makes panic state handling simpler, more discoverable, and more robust. This patch (of 9): This patch introduces four new helper functions to abstract the management of the panic_cpu variable. These functions will be used in subsequent patches to refactor existing code. The direct use of panic_cpu can be error-prone and ambiguous, as it requires manual checks to determine which CPU is handling the panic. The new helpers clarify intent: panic_try_start(): Atomically sets the current CPU as the panicking CPU. panic_reset(): Reset panic_cpu to PANIC_CPU_INVALID. panic_in_progress(): Checks if a panic has been triggered. panic_on_this_cpu(): Returns true if the current CPU is the panic originator. panic_on_other_cpu(): Returns true if a panic is on another CPU. This change lays the groundwork for improved code readability and robustness in the panic handling subsystem. Link: https://lkml.kernel.org/r/20250825022947.1596226-1-wangjinchao600@gmail.com Link: https://lkml.kernel.org/r/20250825022947.1596226-2-wangjinchao600@gmail.com Signed-off-by: Jinchao Wang <wangjinchao600@gmail.com> Cc: Anna Schumaker <anna.schumaker@oracle.com> Cc: Baoquan He <bhe@redhat.com> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Young <dyoung@redhat.com> Cc: Doug Anderson <dianders@chromium.org> Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com> Cc: Helge Deller <deller@gmx.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com> Cc: Joel Granados <joel.granados@kernel.org> Cc: John Ogness <john.ogness@linutronix.de> Cc: Kees Cook <kees@kernel.org> Cc: Li Huafei <lihuafei1@huawei.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Luo Gengkun <luogengkun@huaweicloud.com> Cc: Max Kellermann <max.kellermann@ionos.com> Cc: Nam Cao <namcao@linutronix.de> Cc: oushixiong <oushixiong@kylinos.cn> Cc: Petr Mladek <pmladek@suse.com> Cc: Qianqiang Liu <qianqiang.liu@163.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Sohil Mehta <sohil.mehta@intel.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Zimemrmann <tzimmermann@suse.de> Cc: Thorsten Blum <thorsten.blum@linux.dev> Cc: Ville Syrjala <ville.syrjala@linux.intel.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Yicong Yang <yangyicong@hisilicon.com> Cc: Yunhui Cui <cuiyunhui@bytedance.com> Cc: Yury Norov (NVIDIA) <yury.norov@gmail.com>b Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-13 17:32:51 -07:00
Petr Mladek	e40d2014b2	panic: clean up message about deprecated 'panic_print' parameter Remove duplication of the message about deprecated 'panic_print' parameter. Also make the wording more direct. Make it clear that the new parameters already exist and should be used instead. Link: https://lkml.kernel.org/r/20250825025701.81921-5-feng.tang@linux.alibaba.com Signed-off-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Feng Tang <feng.tang@linux.alibaba.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Tested-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Feng Tang <feng.tang@linux.alibaba.com> Cc: Askar Safin <safinaskar@zohomail.com> Cc: John Ogness <john.ogness@linutronix.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-13 17:32:50 -07:00
Feng Tang	2683df6539	panic: add note that 'panic_print' parameter is deprecated Just like for 'panic_print's systcl interface, add similar note for setup of kernel cmdline parameter and parameter under /sys/module/kernel/. Also add __core_param_cb() macro, which enables to add special get/set operation for a kernel parameter. Link: https://lkml.kernel.org/r/20250825025701.81921-4-feng.tang@linux.alibaba.com Signed-off-by: Feng Tang <feng.tang@linux.alibaba.com> Suggested-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Cc: Askar Safin <safinaskar@zohomail.com> Cc: John Ogness <john.ogness@linutronix.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lance Yang <lance.yang@linux.dev> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-13 17:32:50 -07:00
Liao Yuanhong	13818f7b8c	kexec_core: remove redundant 0 value initialization The kimage struct is already zeroed by kzalloc(). It's redundant to initialize image->head to 0. Link: https://lkml.kernel.org/r/20250825123307.306634-1-liaoyuanhong@vivo.com Signed-off-by: Liao Yuanhong <liaoyuanhong@vivo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-13 17:32:49 -07:00
Oleg Nesterov	f7071db2fe	fork: kill the pointless lower_32_bits() in create_io_thread(), kernel_thread(), and user_mode_thread() Unlike sys_clone(), these helpers have only in kernel users which should pass the correct "flags" argument. lower_32_bits(flags) just adds the unnecessary confusion and doesn't allow to use the CLONE_ flags which don't fit into 32 bits. create_io_thread() looks especially confusing because: - "flags" is a compile-time constant, so lower_32_bits() simply has no effect - .exit_signal = (lower_32_bits(flags) & CSIGNAL) is harmless but doesn't look right, copy_process(CLONE_THREAD) will ignore this argument anyway. None of these helpers actually need CLONE_UNTRACED or "& ~CSIGNAL", but their presence does not add any confusion and improves code clarity. Link: https://lkml.kernel.org/r/20250820163946.GA18549@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Cc: Christian Brauner <brauner@kernel.org> Cc: Kees Cook <kees@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-13 17:32:49 -07:00
Tio Zhang	b32730e68d	fork: remove #ifdef CONFIG_LOCKDEP in copy_process() lockdep_init_task() is defined as an empty when CONFIG_LOCKDEP is not set. So the #ifdef here is redundant, remove it. Link: https://lkml.kernel.org/r/20250820101826.GA2484@didi-ThinkCentre-M930t-N000 Signed-off-by: Tio Zhang <tiozhang@didiglobal.com> Cc: Kees Cook <kees@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-13 17:32:48 -07:00
ZhenguoYao	95f091274f	watchdog/softlockup: fix incorrect CPU utilization output during softlockup Since we use 16-bit precision, the raw data will undergo integer division, which may sometimes result in data loss. This can lead to slightly inaccurate CPU utilization calculations. Under normal circumstances, this isn't an issue. However, when CPU utilization reaches 100%, the calculated result might exceed 100%. For example, with raw data like the following: sample_period 400000134 new_stat 83648414036 old_stat 83247417494 sample_period=400000134/2^24=23 new_stat=83648414036/2^24=4985 old_stat=83247417494/2^24=4961 util=105% Below log will output： CPU#3 Utilization every 0s during lockup: #1: 0% system, 0% softirq, 105% hardirq, 0% idle #2: 0% system, 0% softirq, 105% hardirq, 0% idle #3: 0% system, 0% softirq, 100% hardirq, 0% idle #4: 0% system, 0% softirq, 105% hardirq, 0% idle #5: 0% system, 0% softirq, 105% hardirq, 0% idle To avoid confusion, we enforce a 100% display cap when calculations exceed this threshold. We also round to the nearest multiple of 16.8 milliseconds to improve the accuracy. [yaozhenguo1@gmail.com: make get_16bit_precision() more accurate, fix comment layout] Link: https://lkml.kernel.org/r/20250818081438.40540-1-yaozhenguo@jd.com Link: https://lkml.kernel.org/r/20250812082510.32291-1-yaozhenguo@jd.com Signed-off-by: ZhenguoYao <yaozhenguo1@gmail.com> Cc: Bitao Hu <yaoma@linux.alibaba.com> Cc: Li Huafei <lihuafei1@huawei.com> Cc: Max Kellermann <max.kellermann@ionos.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-13 17:32:47 -07:00
ZhenguoYao	41f88ddfd4	watchdog/softlockup: fix wrong output when watchdog_thresh < 3 When watchdog_thresh is below 3, sample_period will be less than 1 second. So the following output will print when softlockup: CPU#3 Utilization every 0s during lockup Fix this by changing time unit from seconds to milliseconds. Link: https://lkml.kernel.org/r/20250812074132.27810-1-yaozhenguo@jd.com Signed-off-by: ZhenguoYao <yaozhenguo1@gmail.com> Cc: Bitao Hu <yaoma@linux.alibaba.com> Cc: Li Huafei <lihuafei1@huawei.com> Cc: Max Kellermann <max.kellermann@ionos.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-13 17:32:47 -07:00
Soham Bagchi	c2fe368b6e	kcov: use write memory barrier after memcpy() in kcov_move_area() KCOV Remote uses two separate memory buffers, one private to the kernel space (kcov_remote_areas) and the second one shared between user and kernel space (kcov->area). After every pair of kcov_remote_start() and kcov_remote_stop(), the coverage data collected in the kcov_remote_areas is copied to kcov->area so the user can read the collected coverage data. This memcpy() is located in kcov_move_area(). The load/store pattern on the kernel-side [1] is: ``` /* dst_area === kcov->area, dst_area[0] is where the count is stored / dst_len = READ_ONCE((unsigned long )dst_area); ... memcpy(dst_entries, src_entries, ...); ... WRITE_ONCE((unsigned long )dst_area, dst_len + entries_moved); ``` And for the user [2]: ``` / cover is equivalent to kcov->area */ n = __atomic_load_n(&cover[0], __ATOMIC_RELAXED); ``` Without a write-memory barrier, the atomic load for the user can potentially read fresh values of the count stored at cover[0], but continue to read stale coverage data from the buffer itself. Hence, we recommend adding a write-memory barrier between the memcpy() and the WRITE_ONCE() in kcov_move_area(). Link: https://lkml.kernel.org/r/20250728184318.1839137-1-soham.bagchi@utah.edu Link: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/kernel/kcov.c?h=master#n978 [1] Link: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/Documentation/dev-tools/kcov.rst#n364 [2] Signed-off-by: Soham Bagchi <soham.bagchi@utah.edu> Reviewed-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Thomas Gleinxer <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-13 17:32:44 -07:00
Brian Mak	f367474b58	x86/kexec: carry forward the boot DTB on kexec Currently, the kexec_file_load syscall on x86 does not support passing a device tree blob to the new kernel. Some embedded x86 systems use device trees. On these systems, failing to pass a device tree to the new kernel causes a boot failure. To add support for this, we copy the behavior of ARM64 and PowerPC and copy the current boot's device tree blob for use in the new kernel. We do this on x86 by passing the device tree blob as a setup_data entry in accordance with the x86 boot protocol. This behavior is gated behind the KEXEC_FILE_FORCE_DTB flag. Link: https://lkml.kernel.org/r/20250805211527.122367-3-makb@juniper.net Signed-off-by: Brian Mak <makb@juniper.net> Cc: Alexander Graf <graf@amazon.com> Cc: Baoquan He <bhe@redhat.com> Cc: Borislav Betkov <bp@alien8.de> Cc: Dave Young <dyoung@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Rob Herring <robh@kernel.org> Cc: Saravana Kannan <saravanak@google.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-13 17:32:43 -07:00
Masami Hiramatsu (Google)	1440648c0f	hung_task: dump blocker task if it is not hung Dump the lock blocker task if it is not hung because if the blocker task is also hung, it should be dumped by the detector. This will de-duplicate the same stackdumps if the blocker task is also blocked by another task (and hung). Link: https://lkml.kernel.org/r/175391351423.688839.11917911323784986774.stgit@devnote2 Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Suggested-by: Sergey Senozhatsky <senozhatsky@chromium.org> Tested-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Lance Yang <lance.yang@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-13 17:32:43 -07:00
Pratyush Yadav	e76e09bdf9	kho: make sure kho_scratch argument is fully consumed When specifying fixed sized scratch areas, the parser only parses the three scratch sizes and ignores the rest of the argument. This means the argument can have any bogus trailing characters. For example, "kho_scratch=256M,512M,512Mfoobar" results in successful parsing: [ 0.000000] KHO: scratch areas: lowmem: 256MiB global: 512MiB pernode: 512MiB It is generally a good idea to parse arguments as strictly as possible. In addition, if bogus trailing characters are allowed in the kho_scratch argument, it is possible that some people might end up using them and later extensions to the argument format will cause unexpected breakages. Make sure the argument is fully consumed after all three scratch sizes are parsed. With this change, the bogus argument "kho_scratch=256M,512M,512Mfoobar" results in: [ 0.000000] Malformed early option 'kho_scratch' Link: https://lkml.kernel.org/r/20250826123817.64681-1-pratyush@kernel.org Signed-off-by: Pratyush Yadav <pratyush@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Baoquan He <bhe@redhat.com> Cc: Changyuan Lyu <changyuanl@google.com> Cc: Pratyush Yadav <pratyush@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-13 16:55:18 -07:00
David Hildenbrand	9dc21bbd62	prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE Patch series "prctl: extend PR_SET_THP_DISABLE to only provide THPs when advised", v5. This will allow individual processes to opt-out of THP = "always" into THP = "madvise", without affecting other workloads on the system. This has been extensively discussed on the mailing list and has been summarized very well by David in the first patch which also includes the links to alternatives, please refer to the first patch commit message for the motivation for this series. Patch 1 adds the PR_THP_DISABLE_EXCEPT_ADVISED flag to implement this, along with the MMF changes. Patch 2 is a cleanup patch for tva_flags that will allow the forced collapse case to be transmitted to vma_thp_disabled (which is done in patch 3). Patch 4 adds documentation for PR_SET_THP_DISABLE/PR_GET_THP_DISABLE. Patches 6-7 implement the selftests for PR_SET_THP_DISABLE for completely disabling THPs (old behaviour) and only enabling it at advise (PR_THP_DISABLE_EXCEPT_ADVISED). This patch (of 7): People want to make use of more THPs, for example, moving from the "never" system policy to "madvise", or from "madvise" to "always". While this is great news for every THP desperately waiting to get allocated out there, apparently there are some workloads that require a bit of care during that transition: individual processes may need to opt-out from this behavior for various reasons, and this should be permitted without needing to make all other workloads on the system similarly opt-out. The following scenarios are imaginable: (1) Switch from "none" system policy to "madvise"/"always", but keep THPs disabled for selected workloads. (2) Stay at "none" system policy, but enable THPs for selected workloads, making only these workloads use the "madvise" or "always" policy. (3) Switch from "madvise" system policy to "always", but keep the "madvise" policy for selected workloads: allocate THPs only when advised. (4) Stay at "madvise" system policy, but enable THPs even when not advised for selected workloads -- "always" policy. Once can emulate (2) through (1), by setting the system policy to "madvise"/"always" while disabling THPs for all processes that don't want THPs. It requires configuring all workloads, but that is a user-space problem to sort out. (4) can be emulated through (3) in a similar way. Back when (1) was relevant in the past, as people started enabling THPs, we added PR_SET_THP_DISABLE, so relevant workloads that were not ready yet (i.e., used by Redis) were able to just disable THPs completely. Redis still implements the option to use this interface to disable THPs completely. With PR_SET_THP_DISABLE, we added a way to force-disable THPs for a workload -- a process, including fork+exec'ed process hierarchy. That essentially made us support (1): simply disable THPs for all workloads that are not ready for THPs yet, while still enabling THPs system-wide. The quest for handling (3) and (4) started, but current approaches (completely new prctl, options to set other policies per process, alternatives to prctl -- mctrl, cgroup handling) don't look particularly promising. Likely, the future will use bpf or something similar to implement better policies, in particular to also make better decisions about THP sizes to use, but this will certainly take a while as that work just started. Long story short: a simple enable/disable is not really suitable for the future, so we're not willing to add completely new toggles. While we could emulate (3)+(4) through (1)+(2) by simply disabling THPs completely for these processes, this is a step backwards, because these processes can no longer allocate THPs in regions where THPs were explicitly advised: regions flagged as VM_HUGEPAGE. Apparently, that imposes a problem for relevant workloads, because "not THPs" is certainly worse than "THPs only when advised". Could we simply relax PR_SET_THP_DISABLE, to "disable THPs unless not explicitly advised by the app through MAD_HUGEPAGE"? maybe, but this would change the documented semantics quite a bit, and the versatility to use it for debugging purposes, so I am not 100% sure that is what we want -- although it would certainly be much easier. So instead, as an easy way forward for (3) and (4), add an option to make PR_SET_THP_DISABLE disable less THPs for a process. In essence, this patch: (A) Adds PR_THP_DISABLE_EXCEPT_ADVISED, to be used as a flag in arg3 of prctl(PR_SET_THP_DISABLE) when disabling THPs (arg2 != 0). prctl(PR_SET_THP_DISABLE, 1, PR_THP_DISABLE_EXCEPT_ADVISED). (B) Makes prctl(PR_GET_THP_DISABLE) return 3 if PR_THP_DISABLE_EXCEPT_ADVISED was set while disabling. Previously, it would return 1 if THPs were disabled completely. Now it returns the set flags as well: 3 if PR_THP_DISABLE_EXCEPT_ADVISED was set. (C) Renames MMF_DISABLE_THP to MMF_DISABLE_THP_COMPLETELY, to express the semantics clearly. Fortunately, there are only two instances outside of prctl() code. (D) Adds MMF_DISABLE_THP_EXCEPT_ADVISED to express "no THP except for VMAs with VM_HUGEPAGE" -- essentially "thp=madvise" behavior Fortunately, we only have to extend vma_thp_disabled(). (E) Indicates "THP_enabled: 0" in /proc/pid/status only if THPs are disabled completely Only indicating that THPs are disabled when they are really disabled completely, not only partially. For now, we don't add another interface to obtained whether THPs are disabled partially (PR_THP_DISABLE_EXCEPT_ADVISED was set). If ever required, we could add a new entry. The documented semantics in the man page for PR_SET_THP_DISABLE "is inherited by a child created via fork(2) and is preserved across execve(2)" is maintained. This behavior, for example, allows for disabling THPs for a workload through the launching process (e.g., systemd where we fork() a helper process to then exec()). For now, MADV_COLLAPSE will fail in regions without VM_HUGEPAGE and VM_NOHUGEPAGE. As MADV_COLLAPSE is a clear advise that user space thinks a THP is a good idea, we'll enable that separately next (requiring a bit of cleanup first). There is currently not way to prevent that a process will not issue PR_SET_THP_DISABLE itself to re-enable THP. There are not really known users for re-enabling it, and it's against the purpose of the original interface. So if ever required, we could investigate just forbidding to re-enable them, or make this somehow configurable. Link: https://lkml.kernel.org/r/20250815135549.130506-1-usamaarif642@gmail.com Link: https://lkml.kernel.org/r/20250815135549.130506-2-usamaarif642@gmail.com Acked-by: Zi Yan <ziy@nvidia.com> Acked-by: Usama Arif <usamaarif642@gmail.com> Tested-by: Usama Arif <usamaarif642@gmail.com> Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Usama Arif <usamaarif642@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: SeongJae Park <sj@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yafang <laoar.shao@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-13 16:55:05 -07:00
Lorenzo Stoakes	d14d3f535e	mm: convert remaining users to mm_flags_() accessors As part of the effort to move to mm->flags becoming a bitmap field, convert existing users to making use of the mm_flags_() accessors which will, when the conversion is complete, be the only means of accessing mm_struct flags. No functional change intended. Link: https://lkml.kernel.org/r/cc67a56f9a8746a8ec7d9791853dc892c1c33e0b.1755012943.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Ben Segall <bsegall@google.com> Cc: Borislav Betkov <bp@alien8.de> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dev Jain <dev.jain@arm.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jiri Olsa <jolsa@kernel.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Kees Cook <kees@kernel.org> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Mariano Pache <npache@redhat.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Namhyung kim <namhyung@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: xu xin <xu.xin16@zte.com.cn> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-13 16:54:58 -07:00
Lorenzo Stoakes	19148a19da	mm: update fork mm->flags initialisation to use bitmap We now need to account for flag initialisation on fork. We retain the existing logic as much as we can, but dub the existing flag mask legacy. These flags are therefore required to fit in the first 32-bits of the flags field. However, further flag propagation upon fork can be implemented in mm_init() on a per-flag basis. We ensure we clear the entire bitmap prior to setting it, and use __mm_flags_get_word() and __mm_flags_set_word() to manipulate these legacy fields efficiently. Link: https://lkml.kernel.org/r/9fb8954a7a0f0184f012a8e66f8565bcbab014ba.1755012943.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Ben Segall <bsegall@google.com> Cc: Borislav Betkov <bp@alien8.de> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dev Jain <dev.jain@arm.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jiri Olsa <jolsa@kernel.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Kees Cook <kees@kernel.org> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Mariano Pache <npache@redhat.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Namhyung kim <namhyung@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: xu xin <xu.xin16@zte.com.cn> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-13 16:54:57 -07:00
Lorenzo Stoakes	c0951573e0	mm: convert uprobes to mm_flags_() accessors As part of the effort to move to mm->flags becoming a bitmap field, convert existing users to making use of the mm_flags_() accessors which will, when the conversion is complete, be the only means of accessing mm_struct flags. No functional change intended. Link: https://lkml.kernel.org/r/1d4fe5963904cc0c707da1f53fbfe6471d3eff10.1755012943.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Ben Segall <bsegall@google.com> Cc: Borislav Betkov <bp@alien8.de> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dev Jain <dev.jain@arm.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jiri Olsa <jolsa@kernel.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Kees Cook <kees@kernel.org> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Mariano Pache <npache@redhat.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Namhyung kim <namhyung@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: xu xin <xu.xin16@zte.com.cn> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-13 16:54:57 -07:00
Lorenzo Stoakes	879d0d9954	mm: convert prctl to mm_flags_() accessors As part of the effort to move to mm->flags becoming a bitmap field, convert existing users to making use of the mm_flags_() accessors which will, when the conversion is complete, be the only means of accessing mm_struct flags. No functional change intended. Link: https://lkml.kernel.org/r/b64f07b94822d02beb88d0d21a6a85f9ee45fc69.1755012943.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Ben Segall <bsegall@google.com> Cc: Borislav Betkov <bp@alien8.de> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dev Jain <dev.jain@arm.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jiri Olsa <jolsa@kernel.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Kees Cook <kees@kernel.org> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Mariano Pache <npache@redhat.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Namhyung kim <namhyung@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: xu xin <xu.xin16@zte.com.cn> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-13 16:54:56 -07:00
Mike Rapoport (Microsoft)	be564840bb	kho: allow scratch areas with zero size Patch series "kho: fixes and cleanups", v3. These are small KHO and KHO test fixes and cleanups. This patch (of 3): Parsing of kho_scratch parameter treats zero size as an invalid value, although it should be fine for user to request zero sized scratch area for some types if scratch memory, when for example there is no need to create scratch area in the low memory. Treat zero as a valid value for a scratch area size but reject kho_scratch parameter that defines no scratch memory at all. Link: https://lkml.kernel.org/r/20250811082510.4154080-1-rppt@kernel.org Link: https://lkml.kernel.org/r/20250811082510.4154080-2-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Baoquan He <bhe@redhat.com> Cc: Changyuan Lyu <changyuanl@google.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Thomas Weißschuh <linux@weissschuh.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-13 16:54:54 -07:00
Ye Liu	79e1c24285	mm: replace (20 - PAGE_SHIFT) with common macros for pages<->MB conversion Replace repeated (20 - PAGE_SHIFT) calculations with standard macros: - MB_TO_PAGES(mb) converts MB to page count - PAGES_TO_MB(pages) converts pages to MB No functional change. [akpm@linux-foundation.org: remove arc's private PAGES_TO_MB, remove its unused PAGES_TO_KB] [akpm@linux-foundation.org: don't include mm.h due to include file ordering mess] Link: https://lkml.kernel.org/r/20250718024134.1304745-1-ye.liu@linux.dev Signed-off-by: Ye Liu <liuye@kylinos.cn> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Ben Segall <bsegall@google.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Kairui Song <kasong@tencent.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lai jiangshan <jiangshanlai@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mel Gorman <mgorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Neeraj Upadhyay <neeraj.upadhyay@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-13 16:54:42 -07:00
Ruan Shiyang	337135e612	mm: memory-tiering: fix PGPROMOTE_CANDIDATE counting Goto-san reported confusing pgpromote statistics where the pgpromote_success count significantly exceeded pgpromote_candidate. On a system with three nodes (nodes 0-1: DRAM 4GB, node 2: NVDIMM 4GB): # Enable demotion only echo 1 > /sys/kernel/mm/numa/demotion_enabled numactl -m 0-1 memhog -r200 3500M >/dev/null & pid=$! sleep 2 numactl memhog -r100 2500M >/dev/null & sleep 10 kill -9 $pid # terminate the 1st memhog # Enable promotion echo 2 > /proc/sys/kernel/numa_balancing After a few seconds, we observeed `pgpromote_candidate < pgpromote_success` $ grep -e pgpromote /proc/vmstat pgpromote_success 2579 pgpromote_candidate 0 In this scenario, after terminating the first memhog, the conditions for pgdat_free_space_enough() are quickly met, and triggers promotion. However, these migrated pages are only counted for in PGPROMOTE_SUCCESS, not in PGPROMOTE_CANDIDATE. To solve these confusing statistics, introduce PGPROMOTE_CANDIDATE_NRL to count the missed promotion pages. And also, not counting these pages into PGPROMOTE_CANDIDATE is to avoid changing the existing algorithm or performance of the promotion rate limit. Link: https://lkml.kernel.org/r/20250901090122.124262-1-ruansy.fnst@fujitsu.com Link: https://lkml.kernel.org/r/20250729035101.1601407-1-ruansy.fnst@fujitsu.com Fixes: `c6833e1000` ("memory tiering: rate limit NUMA migration throughput") Co-developed-by: Li Zhijian <lizhijian@fujitsu.com> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com> Signed-off-by: Ruan Shiyang <ruansy.fnst@fujitsu.com> Reported-by: Yasunori Gotou (Fujitsu) <y-goto@fujitsu.com> Suggested-by: Huang Ying <ying.huang@linux.alibaba.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Huang Ying <ying.huang@linux.alibaba.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-13 16:54:42 -07:00
Thomas Gleixner	6eb350a223	rseq: Protect event mask against membarrier IPI rseq_need_restart() reads and clears task::rseq_event_mask with preemption disabled to guard against the scheduler. But membarrier() uses an IPI and sets the PREEMPT bit in the event mask from the IPI, which leaves that RMW operation unprotected. Use guard(irq) if CONFIG_MEMBARRIER is enabled to fix that. Fixes: `2a36ab717e` ("rseq/membarrier: Add MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Boqun Feng <boqun.feng@gmail.com> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: stable@vger.kernel.org	2025-09-13 19:51:59 +02:00
Jakub Kicinski	fc3a281041	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-6.17-rc6). Conflicts: net/netfilter/nft_set_pipapo.c net/netfilter/nft_set_pipapo_avx2.c `c4eaca2e10` ("netfilter: nft_set_pipapo: don't check genbit from packetpath lookups") `84c1da7b38` ("netfilter: nft_set_pipapo: use avx2 algorithm for insertions too") Only trivial adjacent changes (in a doc and a Makefile). Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-11 17:40:13 -07:00
Puranjay Mohan	5c5240d020	bpf: Report arena faults to BPF stderr Begin reporting arena page faults and the faulting address to BPF program's stderr, this patch adds support in the arm64 and x86-64 JITs, support for other archs can be added later. The fault handlers receive the 32 bit address in the arena region so the upper 32 bits of user_vm_start is added to it before printing the address. This is what the user would expect to see as this is what is printed by bpf_printk() is you pass it an address returned by bpf_arena_alloc_pages(); Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20250911145808.58042-4-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-11 13:00:43 -07:00
Puranjay Mohan	70f23546d2	bpf: core: introduce main_prog_aux for stream access BPF streams are only valid for the main programs, to make it easier to access streams from subprogs, introduce main_prog_aux in struct bpf_prog_aux. prog->aux->main_prog_aux = prog->aux, for main programs and prog->aux->main_prog_aux = main_prog->aux, for subprograms. Make bpf_prog_find_from_stack() use the added main_prog_aux to return the mainprog when a subprog is found on the stack. Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250911145808.58042-3-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-11 13:00:43 -07:00
Alexei Starovoitov	5d87e96a49	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf after rc5 Cross-merge BPF and other fixes after downstream PR. No conflicts. Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-11 09:34:37 -07:00
Linus Torvalds	a1228f048a	Power management fixes for 6.17-rc6 - Restore a pm_restrict_gfp_mask() call in hibernation_snapshot() that was removed incorrectly during the 6.16 development cycle (Rafael Wysocki) - Introduce a function for registering a perf domain without triggering a system-wide CPU capacity update and make the intel_pstate driver use it to avoid reocurring unsuccessful attempts to update capacities of all CPUs in the system (Rafael Wysocki) - Fix setting of CPPC.min_perf in the active mode with performance governor in the amd-pstate driver to restore its expected behavior changed recently (Gautham Shenoy) - Avoid mistakenly setting EPP to 0 in the amd-pstate driver after system resume as a result of recent code changes (Mario Limonciello) -----BEGIN PGP SIGNATURE----- iQFGBAABCAAwFiEEcM8Aw/RY0dgsiRUR7l+9nS/U47UFAmjCw/YSHHJqd0Byand5 c29ja2kubmV0AAoJEO5fvZ0v1OO1SP8H/3O8D4ZFg7CxwuTn5MofPH5BBAg3FkwB RhXZ3WA/qjz+0CusYBJO3hsJVIfUrDUzow47zi0H4tbdqqdI7CUbOPoWnGt/N2hd ngxL4m+t91XkngLi0eOorxFPQ1/dA1p0g5BHXrzVpuMdE94V3gxb92g3SPrvjcAF N6fVCL3RMQqDwl5ZbadvWfXdE+07nxwogKTF/NKa+DF3SHSy3SOznKgn/AlhQEo3 RKqAhaO3+RxzfTn8M0ie/flYUFApkMbdLdxYau2Lg4Ne3MhrID3ljhssGEFaQdQ0 8z3OGmfOOOLu21F1iYaZiWWXc8wB6v47NchJvuU1FF/JQ+uXdsZZdNA= =ZGx1 -----END PGP SIGNATURE----- Merge tag 'pm-6.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "These fix a nasty hibernation regression introduced during the 6.16 cycle, an issue related to energy model management occurring on Intel hybrid systems where some CPUs are offline to start with, and two regressions in the amd-pstate driver: - Restore a pm_restrict_gfp_mask() call in hibernation_snapshot() that was removed incorrectly during the 6.16 development cycle (Rafael Wysocki) - Introduce a function for registering a perf domain without triggering a system-wide CPU capacity update and make the intel_pstate driver use it to avoid reocurring unsuccessful attempts to update capacities of all CPUs in the system (Rafael Wysocki) - Fix setting of CPPC.min_perf in the active mode with performance governor in the amd-pstate driver to restore its expected behavior changed recently (Gautham Shenoy) - Avoid mistakenly setting EPP to 0 in the amd-pstate driver after system resume as a result of recent code changes (Mario Limonciello)" * tag 'pm-6.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: PM: hibernate: Restrict GFP mask in hibernation_snapshot() PM: EM: Add function for registering a PD without capacity update cpufreq/amd-pstate: Fix a regression leading to EPP 0 after resume cpufreq/amd-pstate: Fix setting of CPPC.min_perf in active mode for performance governor	2025-09-11 08:11:16 -07:00
Jinjie Ruan	3c973c51bf	entry: Add arch_irqentry_exit_need_resched() for arm64 Compared to the generic entry code, ARM64 does additional checks when deciding to reschedule on return from interrupt. So introduce arch_irqentry_exit_need_resched() in the need_resched() condition of the generic raw_irqentry_exit_cond_resched(), with a NOP default. This will allow ARM64 to implement the architecture specific version for switching over to the generic entry code. Suggested-by: Ada Couprie Diaz <ada.coupriediaz@arm.com> Suggested-by: Mark Rutland <mark.rutland@arm.com> Suggested-by: Kevin Brodsky <kevin.brodsky@arm.com> Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Will Deacon <will@kernel.org>	2025-09-11 15:55:34 +01:00
Linus Torvalds	02ffd6f89c	bpf-fixes -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmjB/6gACgkQ6rmadz2v bToSCg//Z8Q2ToTV/BOLTzFYLvcTm2YRlqqIe3SFxyxLQCIhC0kxQAT94baVQHky /6ASbPjDWXdGVHNoMopA6lpMx22Tq4xi6qO5fzJHDuSqh5KTi8l5/GyJeA3egPzD 7RIvKvvgePpCx0xm9rm5O5vvUeFrsxhQPRRiN/fsOibiTJjBpRAJDp9k+pvnK6mb HaZcHF+In5Vg7XozuHAUMzsp+4njzdLrMXL2Q54o2MrIoeBg8/oAnhLujskGMnXK mgUA+skW42IEkw+TYUu9888/5PMDkto3BZIx0plcAIVAIvcU5BFzLt11llQswgVl q740k50oRKrmwHyEVDwugV7WeGQMks48lMHtLKytYmdEhdTfEYUKHeBpcI87fUYy IpOdSUT49nBxOmGl59ccBcdzsndTjo7Zrl7dMf4umN0SSjfdohwj0uu7rmZCaOdd m/TxH13Ae7na4QzVx0N911qxBYw07uYNiq3Ati+x327ySozvvNfLIYK/sS/clJkd lOpz3kpjwgV+PUfv2NBqEJm4nSPTtW7fiEQ8p/yBvK90nB6NnHIbq9a2rPBKeDKx RpkDB9nhJ1J6OMKWkUDasMiP5tAXt9RrI+la/CgBxMcP/G4HxH6yf22Mf3Hzhe8V UfjAgHqXvmrjqpgIbO0AVkfwDvlOM37DGvY0H3bMFeOXCk0DDw4= =09/9 -----END PGP SIGNATURE----- Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Pull bpf fixes from Alexei Starovoitov: "A number of fixes accumulated due to summer vacations - Fix out-of-bounds dynptr write in bpf_crypto_crypt() kfunc which was misidentified as a security issue (Daniel Borkmann) - Update the list of BPF selftests maintainers (Eduard Zingerman) - Fix selftests warnings with icecc compiler (Ilya Leoshkevich) - Disable XDP/cpumap direct return optimization (Jesper Dangaard Brouer) - Fix unexpected get_helper_proto() result in unusual configuration BPF_SYSCALL=y and BPF_EVENTS=n (Jiri Olsa) - Allow fallback to interpreter when JIT support is limited (KaFai Wan) - Fix rqspinlock and choose trylock fallback for NMI waiters. Pick the simplest fix. More involved fix is targeted bpf-next (Kumar Kartikeya Dwivedi) - Fix cleanup when tcp_bpf_send_verdict() fails to allocate psock->cork (Kuniyuki Iwashima) - Disallow bpf_timer in PREEMPT_RT for now. Proper solution is being discussed for bpf-next. (Leon Hwang) - Fix XSK cq descriptor production (Maciej Fijalkowski) - Tell memcg to use allow_spinning=false path in bpf_timer_init() to avoid lockup in cgroup_file_notify() (Peilin Ye) - Fix bpf_strnstr() to handle suffix match cases (Rong Tao)" * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests/bpf: Skip timer cases when bpf_timer is not supported bpf: Reject bpf_timer for PREEMPT_RT tcp_bpf: Call sk_msg_free() when tcp_bpf_send_verdict() fails to allocate psock->cork. bpf: Tell memcg to use allow_spinning=false path in bpf_timer_init() bpf: Allow fall back to interpreter for programs with stack size <= 512 rqspinlock: Choose trylock fallback for NMI waiters xsk: Fix immature cq descriptor production bpf: Update the list of BPF selftests maintainers selftests/bpf: Add tests for bpf_strnstr selftests/bpf: Fix "expression result unused" warnings with icecc bpf: Fix bpf_strnstr() to handle suffix match cases better selftests/bpf: Extend crypto_sanity selftest with invalid dst buffer bpf: Fix out-of-bounds dynptr write in bpf_crypto_crypt bpf: Check the helper function is valid in get_helper_proto bpf, cpumap: Disable page_pool direct xdp_return need larger scope	2025-09-11 07:54:16 -07:00
Rafael J. Wysocki	bddce1c7a5	Merge branches 'pm-sleep' and 'pm-em' Merge a hibernation regression fix and an fix related to energy model management for 6.17-rc6 * pm-sleep: PM: hibernate: Restrict GFP mask in hibernation_snapshot() * pm-em: PM: EM: Add function for registering a PD without capacity update	2025-09-11 14:22:35 +02:00
Gerald Yang	d2c7731593	audit: fix skb leak when audit rate limit is exceeded When configuring a small audit rate limit in /etc/audit/rules.d/audit.rules: -a always,exit -F arch=b64 -S openat -S truncate -S ftruncate -F exit=-EACCES -F auid>=1000 -F auid!=4294967295 -k access -r 100 And then repeatedly triggering permission denied as a normal user: while :; do cat /proc/1/environ; done We can see the messages in kernel log: [ 2531.862184] audit: rate limit exceeded The unreclaimable slab objects start to leak quickly. With kmemleak enabled, many call traces appear like: unreferenced object 0xffff99144b13f600 (size 232): comm "cat", pid 1100, jiffies 4294739144 hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace (crc 8540ec4f): kmemleak_alloc+0x4a/0x90 kmem_cache_alloc_node+0x2ea/0x390 __alloc_skb+0x174/0x1b0 audit_log_start+0x198/0x3d0 audit_log_proctitle+0x32/0x160 audit_log_exit+0x6c6/0x780 __audit_syscall_exit+0xee/0x140 syscall_exit_work+0x12b/0x150 syscall_exit_to_user_mode_prepare+0x39/0x80 syscall_exit_to_user_mode+0x11/0x260 do_syscall_64+0x8c/0x180 entry_SYSCALL_64_after_hwframe+0x78/0x80 This shows that the skb allocated in audit_log_start() and queued onto skb_list is never freed. In audit_log_end(), each skb is dequeued from skb_list and passed to __audit_log_end(). However, when the audit rate limit is exceeded, __audit_log_end() simply prints "rate limit exceeded" and returns without processing the skb. Since the skb is already removed from skb_list, audit_buffer_free() cannot free it later, leading to a memory leak. Fix this by freeing the skb when the rate limit is exceeded. Fixes: `eb59d494ee` ("audit: add record for multiple task security contexts") Signed-off-by: Gerald Yang <gerald.yang@canonical.com> [PM: fixes tag, subj tweak] Signed-off-by: Paul Moore <paul@paul-moore.com>	2025-09-10 19:55:00 -04:00
Leon Hwang	e25ddfb388	bpf: Reject bpf_timer for PREEMPT_RT When enable CONFIG_PREEMPT_RT, the kernel will warn when run timer selftests by './test_progs -t timer': BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48 In order to avoid such warning, reject bpf_timer in verifier when PREEMPT_RT is enabled. Signed-off-by: Leon Hwang <leon.hwang@linux.dev> Link: https://lore.kernel.org/r/20250910125740.52172-2-leon.hwang@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-10 12:34:09 -07:00
Linus Torvalds	1b5d4661c7	Tracing fixes for v6.17: - Remove redundant __GFP_NOWARN flag is kmalloc As now __GFP_NOWARN is part of __GFP_NOWAIT, it can be removed from kmalloc as it is redundant. - Use copy_from_user_nofault() instead of _inatomic() for trace markers The trace_marker files are written to to allow user space to quickly write into the tracing ring buffer. Back in 2016, the get_user_pages_fast() and the kmap() logic was replaced by a __copy_from_user_inatomic(). But the _inatomic() is somewhat a misnomer, as if the data being read faults, it can cause a schedule. This is not something you want to do in an atomic context. Since the time this was added, copy_from_user_nofault() was added which is what is actually needed here. Replace the inatomic() with the nofault(). - Fix the assembly markup in the ftrace direct sample code The ftrace direct sample code (which is also used for selftests), had the size directive between the "leave" and the "ret" instead of after the ret. This caused objtool to think the code was unreachable. - Only call unregister_pm_notifier() on outer most fgraph registration There was an error path in register_ftrace_graph() that did not call unregister_pm_notifier() on error, so it was added in the error path. The problem with that fix, is that register_pm_notifier() is only called by the initial user of fgraph. If that succeeds, but another fgraph registration were to fail, then unregister_pm_notifier() would be called incorrectly. - Fix a crash in osnoise when zero size cpumask is passed in If a zero size CPU mask is passed in, the kmalloc() would return ZERO_SIZE_PTR which is not checked, and the code would continue thinking it had real memory and crash. If zero is passed in as the size of the write, simply return 0. - Fix possible warning in trace_pid_write() If while processing a series of numbers passed to the "set_event_pid" file, and one of the updates fails to allocate (triggered by a fault injection), it can cause a warning to trigger. Check the return value of the call to trace_pid_list_set() and break out early with an error code if it fails. -----BEGIN PGP SIGNATURE----- iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaMCL8RQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qnNIAQD9lMNjbpKLpTyZw49ZYyfieNPhFLJ/ 94FxNKBfmXnYIwD/cBL8KPo9CApbdk5fG8NO2BAM/AK2MJIBKdfMdnseLw4= =v8gJ -----END PGP SIGNATURE----- Merge tag 'trace-v6.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Remove redundant __GFP_NOWARN flag is kmalloc As now __GFP_NOWARN is part of __GFP_NOWAIT, it can be removed from kmalloc as it is redundant. - Use copy_from_user_nofault() instead of _inatomic() for trace markers The trace_marker files are written to to allow user space to quickly write into the tracing ring buffer. Back in 2016, the get_user_pages_fast() and the kmap() logic was replaced by a __copy_from_user_inatomic(), but didn't properly disable page faults around it. Since the time this was added, copy_from_user_nofault() was added which does the required page fault disabling for us. - Fix the assembly markup in the ftrace direct sample code The ftrace direct sample code (which is also used for selftests), had the size directive between the "leave" and the "ret" instead of after the ret. This caused objtool to think the code was unreachable. - Only call unregister_pm_notifier() on outer most fgraph registration There was an error path in register_ftrace_graph() that did not call unregister_pm_notifier() on error, so it was added in the error path. The problem with that fix, is that register_pm_notifier() is only called by the initial user of fgraph. If that succeeds, but another fgraph registration were to fail, then unregister_pm_notifier() would be called incorrectly. - Fix a crash in osnoise when zero size cpumask is passed in If a zero size CPU mask is passed in, the kmalloc() would return ZERO_SIZE_PTR which is not checked, and the code would continue thinking it had real memory and crash. If zero is passed in as the size of the write, simply return 0. - Fix possible warning in trace_pid_write() If while processing a series of numbers passed to the "set_event_pid" file, and one of the updates fails to allocate (triggered by a fault injection), it can cause a warning to trigger. Check the return value of the call to trace_pid_list_set() and break out early with an error code if it fails. * tag 'trace-v6.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Silence warning when chunk allocation fails in trace_pid_write tracing/osnoise: Fix null-ptr-deref in bitmap_parselist() trace/fgraph: Fix error handling ftrace/samples: Fix function size computation tracing: Fix tracing_marker may trigger page fault during preempt_disable trace: Remove redundant __GFP_NOWARN	2025-09-10 12:03:47 -07:00
Rafael J. Wysocki	449c9c0253	PM: hibernate: Restrict GFP mask in hibernation_snapshot() Commit `12ffc3b151` ("PM: Restrict swap use to later in the suspend sequence") incorrectly removed a pm_restrict_gfp_mask() call from hibernation_snapshot(), so memory allocations involving swap are not prevented from being carried out in this code path any more which may lead to serious breakage. The symptoms of such breakage have become visible after adding a shrink_shmem_memory() call to hibernation_snapshot() in commit `2640e81947` ("PM: hibernate: shrink shmem pages after dev_pm_ops.prepare()") which caused this problem to be much more likely to manifest itself. However, since commit `2640e81947` was initially present in the DRM tree that did not include commit `12ffc3b151`, the symptoms of this issue were not visible until merge commit `260f6f4fda` ("Merge tag 'drm-next-2025-07-30' of https://gitlab.freedesktop.org/drm/kernel") that exposed it through an entirely reasonable merge conflict resolution. Fixes: `12ffc3b151` ("PM: Restrict swap use to later in the suspend sequence") Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220555 Reported-by: Todd Brandt <todd.e.brandt@linux.intel.com> Tested-by: Todd Brandt <todd.e.brandt@linux.intel.com> Cc: 6.16+ <stable@vger.kernel.org> # 6.16+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>	2025-09-10 20:36:43 +02:00
Yi Tao	0568f89d4f	cgroup: replace global percpu_rwsem with per threadgroup resem when writing to cgroup.procs The static usage pattern of creating a cgroup, enabling controllers, and then seeding it with CLONE_INTO_CGROUP doesn't require write locking cgroup_threadgroup_rwsem and thus doesn't benefit from this patch. To avoid affecting other users, the per threadgroup rwsem is only used when the favordynmods is enabled. As computer hardware advances, modern systems are typically equipped with many CPU cores and large amounts of memory, enabling the deployment of numerous applications. On such systems, container creation and deletion become frequent operations, making cgroup process migration no longer a cold path. This leads to noticeable contention with common process operations such as fork, exec, and exit. To alleviate the contention between cgroup process migration and operations like process fork, this patch modifies lock to take the write lock on signal_struct->group_rwsem when writing pid to cgroup.procs/threads instead of holding a global write lock. Cgroup process migration has historically relied on signal_struct->group_rwsem to protect thread group integrity. In commit <1ed1328792ff> ("sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem"), this was changed to a global cgroup_threadgroup_rwsem. The advantage of using a global lock was simplified handling of process group migrations. This patch retains the use of the global lock for protecting process group migration, while reducing contention by using per thread group lock during cgroup.procs/threads writes. The locking behavior is as follows: write cgroup.procs/threads \| process fork,exec,exit \| process group migration ------------------------------------------------------------------------------ cgroup_lock() \| down_read(&g_rwsem) \| cgroup_lock() down_write(&p_rwsem) \| down_read(&p_rwsem) \| down_write(&g_rwsem) critical section \| critical section \| critical section up_write(&p_rwsem) \| up_read(&p_rwsem) \| up_write(&g_rwsem) cgroup_unlock() \| up_read(&g_rwsem) \| cgroup_unlock() g_rwsem denotes cgroup_threadgroup_rwsem, p_rwsem denotes signal_struct->group_rwsem. This patch eliminates contention between cgroup migration and fork operations for threads that belong to different thread groups, thereby reducing the long-tail latency of cgroup migrations and lowering system load. With this patch, under heavy fork and exec interference, the long-tail latency of cgroup migration has been reduced from milliseconds to microseconds. Under heavy cgroup migration interference, the multi-CPU score of the spawn test case in UnixBench increased by 9%. tj: Update comment in cgroup_favor_dynmods() and switch WARN_ONCE() to pr_warn_once(). Signed-off-by: Yi Tao <escape@linux.alibaba.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-10 07:44:51 -10:00
Yi Tao	477abc2ec8	cgroup: relocate cgroup_attach_lock within cgroup_procs_write_start Later patches will introduce a new parameter `task` to cgroup_attach_lock, thus adjusting the position of cgroup_attach_lock within cgroup_procs_write_start. Between obtaining the threadgroup leader via PID and acquiring the cgroup attach lock, the threadgroup leader may change, which could lead to incorrect cgroup migration. Therefore, after acquiring the cgroup attach lock, we check whether the threadgroup leader has changed, and if so, retry the operation. tj: Minor comment adjustments. Signed-off-by: Yi Tao <escape@linux.alibaba.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-10 07:31:09 -10:00
Yi Tao	a1ffc8ad31	cgroup: refactor the cgroup_attach_lock code to make it clearer Dynamic cgroup migration involving threadgroup locks can be in one of two states: no lock held, or holding the global lock. Explicitly declaring the different lock modes to make the code easier to understand and facilitates future extensions of the lock modes. Signed-off-by: Yi Tao <escape@linux.alibaba.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-10 07:26:15 -10:00
Rafael J. Wysocki	e042354147	PM: EM: Add function for registering a PD without capacity update The intel_pstate driver manages CPU capacity changes itself and it does not need an update of the capacity of all CPUs in the system to be carried out after registering a PD. Moreover, in some configurations (for instance, an SMT-capable hybrid x86 system booted with nosmt in the kernel command line) the em_check_capacity_update() call at the end of em_dev_register_perf_domain() always fails and reschedules itself to run once again in 1 s, so effectively it runs in vain every 1 s forever. To address this, introduce a new variant of em_dev_register_perf_domain(), called em_dev_register_pd_no_update(), that does not invoke em_check_capacity_update(), and make intel_pstate use it instead of the original. Fixes: `7b010f9b90` ("cpufreq: intel_pstate: EAS support for hybrid platforms") Closes: https://lore.kernel.org/linux-pm/40212796-734c-4140-8a85-854f72b8144d@panix.com/ Reported-by: Kenneth R. Crudup <kenny@panix.com> Tested-by: Kenneth R. Crudup <kenny@panix.com> Cc: 6.16+ <stable@vger.kernel.org> # 6.16+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2025-09-10 12:03:19 +02:00
Peilin Ye	6d78b4473c	bpf: Tell memcg to use allow_spinning=false path in bpf_timer_init() Currently, calling bpf_map_kmalloc_node() from __bpf_async_init() can cause various locking issues; see the following stack trace (edited for style) as one example: ... [10.011566] do_raw_spin_lock.cold [10.011570] try_to_wake_up (5) double-acquiring the same [10.011575] kick_pool rq_lock, causing a hardlockup [10.011579] __queue_work [10.011582] queue_work_on [10.011585] kernfs_notify [10.011589] cgroup_file_notify [10.011593] try_charge_memcg (4) memcg accounting raises an [10.011597] obj_cgroup_charge_pages MEMCG_MAX event [10.011599] obj_cgroup_charge_account [10.011600] __memcg_slab_post_alloc_hook [10.011603] __kmalloc_node_noprof ... [10.011611] bpf_map_kmalloc_node [10.011612] __bpf_async_init [10.011615] bpf_timer_init (3) BPF calls bpf_timer_init() [10.011617] bpf_prog_xxxxxxxxxxxxxxxx_fcg_runnable [10.011619] bpf__sched_ext_ops_runnable [10.011620] enqueue_task_scx (2) BPF runs with rq_lock held [10.011622] enqueue_task [10.011626] ttwu_do_activate [10.011629] sched_ttwu_pending (1) grabs rq_lock ... The above was reproduced on bpf-next (`b338cf849e`) by modifying ./tools/sched_ext/scx_flatcg.bpf.c to call bpf_timer_init() during ops.runnable(), and hacking the memcg accounting code a bit to make a bpf_timer_init() call more likely to raise an MEMCG_MAX event. We have also run into other similar variants (both internally and on bpf-next), including double-acquiring cgroup_file_kn_lock, the same worker_pool::lock, etc. As suggested by Shakeel, fix this by using __GFP_HIGH instead of GFP_ATOMIC in __bpf_async_init(), so that e.g. if try_charge_memcg() raises an MEMCG_MAX event, we call __memcg_memory_event() with @allow_spinning=false and avoid calling cgroup_file_notify() there. Depends on mm patch "memcg: skip cgroup_file_notify if spinning is not allowed": https://lore.kernel.org/bpf/20250905201606.66198-1-shakeel.butt@linux.dev/ v0 approach s/bpf_map_kmalloc_node/bpf_mem_alloc/ https://lore.kernel.org/bpf/20250905061919.439648-1-yepeilin@google.com/ v1 approach: https://lore.kernel.org/bpf/20250905234547.862249-1-yepeilin@google.com/ Fixes: `b00628b1c7` ("bpf: Introduce bpf timers.") Suggested-by: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Peilin Ye <yepeilin@google.com> Link: https://lore.kernel.org/r/20250909095222.2121438-1-yepeilin@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-09 15:24:34 -07:00
KaFai Wan	df0cb5cb50	bpf: Allow fall back to interpreter for programs with stack size <= 512 OpenWRT users reported regression on ARMv6 devices after updating to latest HEAD, where tcpdump filter: tcpdump "not ether host 3c37121a2b3c and not ether host 184ecbca2a3a \ and not ether host 14130b4d3f47 and not ether host f0f61cf440b7 \ and not ether host a84b4dedf471 and not ether host d022be17e1d7 \ and not ether host 5c497967208b and not ether host 706655784d5b" fails with warning: "Kernel filter failed: No error information" when using config: # CONFIG_BPF_JIT_ALWAYS_ON is not set CONFIG_BPF_JIT_DEFAULT_ON=y The issue arises because commits: 1. "bpf: Fix array bounds error with may_goto" changed default runtime to __bpf_prog_ret0_warn when jit_requested = 1 2. "bpf: Avoid __bpf_prog_ret0_warn when jit fails" returns error when jit_requested = 1 but jit fails This change restores interpreter fallback capability for BPF programs with stack size <= 512 bytes when jit fails. Reported-by: Felix Fietkau <nbd@nbd.name> Closes: https://lore.kernel.org/bpf/2e267b4b-0540-45d8-9310-e127bf95fc63@nbd.name/ Fixes: `6ebc5030e0` ("bpf: Fix array bounds error with may_goto") Signed-off-by: KaFai Wan <kafai.wan@linux.dev> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250909144614.2991253-1-kafai.wan@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-09 15:12:16 -07:00
Kumar Kartikeya Dwivedi	0d80e7f951	rqspinlock: Choose trylock fallback for NMI waiters Currently, out of all 3 types of waiters in the rqspinlock slow path (i.e., pending bit waiter, wait queue head waiter, and wait queue non-head waiter), only the pending bit waiter and wait queue head waiters apply deadlock checks and a timeout on their waiting loop. The assumption here was that the wait queue head's forward progress would be sufficient to identify cases where the lock owner or pending bit waiter is stuck, and non-head waiters relying on the head waiter would prove to be sufficient for their own forward progress. However, the head waiter itself can be preempted by a non-head waiter for the same lock (AA) or a different lock (ABBA) in a manner that impedes its forward progress. In such a case, non-head waiters not performing deadlock and timeout checks becomes insufficient, and the system can enter a state of lockup. This is typically not a concern with non-NMI lock acquisitions, as lock holders which in run in different contexts (IRQ, non-IRQ) use "irqsave" variants of the lock APIs, which naturally excludes such lock holders from preempting one another on the same CPU. It might seem likely that a similar case may occur for rqspinlock when programs are attached to contention tracepoints (begin, end), however, these tracepoints either precede the enqueue into the wait queue, or succeed it, therefore cannot be used to preempt a head waiter's waiting loop. We must still be careful against nested kprobe and fentry programs that may attach to the middle of the head's waiting loop to stall forward progress and invoke another rqspinlock acquisition that proceeds as a non-head waiter. To this end, drop CC_FLAGS_FTRACE from the rqspinlock.o object file. For now, this issue is resolved by falling back to a repeated trylock on the lock word from NMI context, while performing the deadlock checks to break out early in case forward progress is impossible, and use the timeout as a final fallback. A more involved fix to terminate the queue when such a condition occurs will be made as a follow up. A selftest to stress this aspect of nested NMI/non-NMI locking attempts will be added in a subsequent patch to the bpf-next tree when this fix lands and trees are synchronized. Reported-by: Josef Bacik <josef@toxicpanda.com> Fixes: `164c246571` ("rqspinlock: Protect waiters in queue from stalls") Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250909184959.3509085-1-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-09 15:10:28 -07:00
Rong Tao	7edfc02470	bpf: Fix bpf_strnstr() to handle suffix match cases better bpf_strnstr() should not treat the ending '\0' of s2 as a matching character if the parameter 'len' equal to s2 string length, for example: 1. bpf_strnstr("openat", "open", 4) = -ENOENT 2. bpf_strnstr("openat", "open", 5) = 0 This patch makes (1) return 0, fix just the `len == strlen(s2)` case. And fix a more general case when s2 is a suffix of the first len characters of s1. Fixes: `e91370550f` ("bpf: Add kfuncs for read-only string operations") Signed-off-by: Rong Tao <rongtao@cestc.cn> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/tencent_17DC57B9D16BC443837021BEACE84B7C1507@qq.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-09 15:07:58 -07:00
Daniel Borkmann	f9bb6ffa7f	bpf: Fix out-of-bounds dynptr write in bpf_crypto_crypt Stanislav reported that in bpf_crypto_crypt() the destination dynptr's size is not validated to be at least as large as the source dynptr's size before calling into the crypto backend with 'len = src_len'. This can result in an OOB write when the destination is smaller than the source. Concretely, in mentioned function, psrc and pdst are both linear buffers fetched from each dynptr: psrc = __bpf_dynptr_data(src, src_len); [...] pdst = __bpf_dynptr_data_rw(dst, dst_len); [...] err = decrypt ? ctx->type->decrypt(ctx->tfm, psrc, pdst, src_len, piv) : ctx->type->encrypt(ctx->tfm, psrc, pdst, src_len, piv); The crypto backend expects pdst to be large enough with a src_len length that can be written. Add an additional src_len > dst_len check and bail out if it's the case. Note that these kfuncs are accessible under root privileges only. Fixes: `3e1c6f3540` ("bpf: make common crypto API for TC/XDP programs") Reported-by: Stanislav Fort <disclosure@aisle.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Vadim Fedorenko <vadim.fedorenko@linux.dev> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Link: https://lore.kernel.org/r/20250829143657.318524-1-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-09 15:07:57 -07:00
Linus Torvalds	9dd1835ecd	dma-mapping fix for Linux 6.17 - one more fix for DMA API debugging infrastructure (Baochen Qiang) -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQSrngzkoBtlA8uaaJ+Jp1EFxbsSRAUCaMAGzwAKCRCJp1EFxbsS RJcvAPwO4NrxrUQdjaXOy//DhFNG4fhMmdn6kYqD+TjTJq9QnwEA/dMGxU1fZqvM Kqusn1iiQsHjFLRzEOExqRbL6MFDSgM= =ViHe -----END PGP SIGNATURE----- Merge tag 'dma-mapping-6.17-2025-09-09' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux Pull dma-mapping fix from Marek Szyprowski: - one more fix for DMA API debugging infrastructure (Baochen Qiang) * tag 'dma-mapping-6.17-2025-09-09' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux: dma-debug: don't enforce dma mapping check on noncoherent allocations	2025-09-09 11:03:04 -07:00
Jiri Wiesner	b9aa93aa51	clocksource: Print durations for sync check unconditionally A typical set of messages that gets printed as a result of the clocksource watchdog finding the TSC unstable usually does not contain messages indicating CPUs being ahead of or behind the CPU from which the check is carried out. That fact suggests that the TSC does not experience time skew between CPUs (if the clocksource.verify_n_cpus parameter is set to a negative value) but quantitative information is missing. The cs_nsec_max value printed by the "CPU %d check durations" message actually provides a worst case estimate of the time skew. If all CPUs have been checked, the cs_nsec_max value multiplied by 2 is the maximum possible time skew between the TSCs of any two CPUs on the system. The worst case estimate is derived from two boundary cases: 1. No time is consumed to execute instructions between csnow_begin and csnow_mid while all the cs_nsec_max time is consumed by the code between csnow_mid and csnow_end. In this case, the maximum undetectable time skew of a CPU being ahead would be cs_nsec_max. 2. All the cs_nsec_max time is consumed to execute instructions between csnow_begin and csnow_mid while no time is consumed by the code between csnow_mid and csnow_end. In this case, the maximum undetectable time skew of a CPU being behind would be cs_nsec_max. The worst case estimate assumes a system experiencing a corner case consisting of the two boundary cases. Always print the "CPU %d check durations" message so that the maximum possible time skew measured by the TSC sync check can be compared to the time skew measured by the clocksource watchdog. Signed-off-by: Jiri Wiesner <jwiesner@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/all/aIuXXfdITXdI0lLp@incl	2025-09-09 14:08:19 +02:00
Xiongfeng Wang	e895f8e291	hrtimers: Unconditionally update target CPU base after offline timer migration When testing softirq based hrtimers on an ARM32 board, with high resolution mode and NOHZ inactive, softirq based hrtimers fail to expire after being moved away from an offline CPU: CPU0 CPU1 hrtimer_start(..., HRTIMER_MODE_SOFT); cpu_down(CPU1) ... hrtimers_cpu_dying() // Migrate timers to CPU0 smp_call_function_single(CPU0, returgger_next_event); retrigger_next_event() if (!highres && !nohz) return; As retrigger_next_event() is a NOOP when both high resolution timers and NOHZ are inactive CPU0's hrtimer_cpu_base::softirq_expires_next is not updated and the migrated softirq timers never expire unless there is a softirq based hrtimer queued on CPU0 later. Fix this by removing the hrtimer_hres_active() and tick_nohz_active() check in retrigger_next_event(), which enforces a full update of the CPU base. As this is not a fast path the extra cost does not matter. [ tglx: Massaged change log ] Fixes: `5c0930ccaa` ("hrtimers: Push pending hrtimers away from outgoing CPU earlier") Co-developed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Xiongfeng Wang <wangxiongfeng2@huawei.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250805081025.54235-1-wangxiongfeng2@huawei.com	2025-09-09 14:05:16 +02:00
Bibo Mao	fe2a449a45	tick: Do not set device to detached state in tick_shutdown() tick_shutdown() sets the state of the clockevent device to detached first and the invokes clockevents_exchange_device(), which in turn invokes clockevents_switch_state(). But clockevents_switch_state() returns without invoking the device shutdown callback as the device is already in detached state. As a consequence the timer device is not shutdown when a CPU goes offline. tick_shutdown() does this because it was originally invoked on a online CPU and not on the outgoing CPU. It therefore could not access the clockevent device of the already offlined CPU and just set the state. Since commit `3b1596a21f` tick_shutdown() is called on the outgoing CPU, so the hardware device can be accessed. Remove the state set before calling clockevents_exchange_device(), so that the subsequent clockevents_switch_state() handles the state transition and invokes the shutdown callback of the clockevent device. [ tglx: Massaged change log ] Fixes: `3b1596a21f` ("clockevents: Shutdown and unregister current clockevents at CPUHP_AP_TICK_DYING") Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/all/20250906064952.3749122-2-maobibo@loongson.cn	2025-09-09 13:39:00 +02:00
Thomas Weißschuh	3c3af563b3	hrtimer: Reorder branches in hrtimer_clockid_to_base() Align the ordering to the one used for hrtimer_bases. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250821-hrtimer-cleanup-get_time-v2-9-3ae822e5bfbd@linutronix.de	2025-09-09 12:27:18 +02:00
Thomas Weißschuh	009eb5da29	hrtimer: Remove hrtimer_clock_base:: Get_time The get_time() callbacks always need to match the bases clockid. Instead of maintaining that association twice in hrtimer_bases, use a helper. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/20250821-hrtimer-cleanup-get_time-v2-8-3ae822e5bfbd@linutronix.de	2025-09-09 12:27:18 +02:00
Thomas Weißschuh	b68b7f3e9b	sched/core: Avoid direct access to hrtimer clockbase The field timer->base->get_time is a private implementation detail and should not be accessed outside of the hrtimer core. Switch to the equivalent helper. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/20250821-hrtimer-cleanup-get_time-v2-3-3ae822e5bfbd@linutronix.de	2025-09-09 12:27:18 +02:00
Thomas Weißschuh	5f531fe9cb	timers/itimer: Avoid direct access to hrtimer clockbase The field timer->base->get_time is a private implementation detail and should not be accessed outside of the hrtimer core. Switch to the equivalent helper. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/20250821-hrtimer-cleanup-get_time-v2-2-3ae822e5bfbd@linutronix.de	2025-09-09 12:27:17 +02:00
Thomas Weißschuh	24fb08dcc4	posix-timers: Avoid direct access to hrtimer clockbase The field timer->base->get_time is a private implementation detail and should not be accessed outside of the hrtimer core. Switch to the equivalent helpers. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/20250821-hrtimer-cleanup-get_time-v2-1-3ae822e5bfbd@linutronix.de	2025-09-09 12:27:17 +02:00
Pu Lehui	cd4453c5e9	tracing: Silence warning when chunk allocation fails in trace_pid_write Syzkaller trigger a fault injection warning: WARNING: CPU: 1 PID: 12326 at tracepoint_add_func+0xbfc/0xeb0 Modules linked in: CPU: 1 UID: 0 PID: 12326 Comm: syz.6.10325 Tainted: G U 6.14.0-rc5-syzkaller #0 Tainted: [U]=USER Hardware name: Google Compute Engine/Google Compute Engine RIP: 0010:tracepoint_add_func+0xbfc/0xeb0 kernel/tracepoint.c:294 Code: 09 fe ff 90 0f 0b 90 0f b6 74 24 43 31 ff 41 bc ea ff ff ff RSP: 0018:ffffc9000414fb48 EFLAGS: 00010283 RAX: 00000000000012a1 RBX: ffffffff8e240ae0 RCX: ffffc90014b78000 RDX: 0000000000080000 RSI: ffffffff81bbd78b RDI: 0000000000000001 RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000001 R12: ffffffffffffffef R13: 0000000000000000 R14: dffffc0000000000 R15: ffffffff81c264f0 FS: 00007f27217f66c0(0000) GS:ffff8880b8700000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b2e80dff8 CR3: 00000000268f8000 CR4: 00000000003526f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> tracepoint_probe_register_prio+0xc0/0x110 kernel/tracepoint.c:464 register_trace_prio_sched_switch include/trace/events/sched.h:222 [inline] register_pid_events kernel/trace/trace_events.c:2354 [inline] event_pid_write.isra.0+0x439/0x7a0 kernel/trace/trace_events.c:2425 vfs_write+0x24c/0x1150 fs/read_write.c:677 ksys_write+0x12b/0x250 fs/read_write.c:731 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xcd/0x250 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f We can reproduce the warning by following the steps below: 1. echo 8 >> set_event_notrace_pid. Let tr->filtered_pids owns one pid and register sched_switch tracepoint. 2. echo ' ' >> set_event_pid, and perform fault injection during chunk allocation of trace_pid_list_alloc. Let pid_list with no pid and assign to tr->filtered_pids. 3. echo ' ' >> set_event_pid. Let pid_list is NULL and assign to tr->filtered_pids. 4. echo 9 >> set_event_pid, will trigger the double register sched_switch tracepoint warning. The reason is that syzkaller injects a fault into the chunk allocation in trace_pid_list_alloc, causing a failure in trace_pid_list_set, which may trigger double register of the same tracepoint. This only occurs when the system is about to crash, but to suppress this warning, let's add failure handling logic to trace_pid_list_set. Link: https://lore.kernel.org/20250908024658.2390398-1-pulehui@huaweicloud.com Fixes: `8d6e90983a` ("tracing: Create a sparse bitmask for pid filtering") Reported-by: syzbot+161412ccaeff20ce4dde@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/67cb890e.050a0220.d8275.022e.GAE@google.com Signed-off-by: Pu Lehui <pulehui@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-09-08 14:56:43 -04:00
Marco Crivellari	a857210b10	bpf: WQ_PERCPU added to alloc_workqueue users Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistentcy cannot be addressed without refactoring the API. alloc_workqueue() treats all queues as per-CPU by default, while unbound workqueues must opt-in via WQ_UNBOUND. This default is suboptimal: most workloads benefit from unbound queues, allowing the scheduler to place worker threads where they’re needed and reducing noise when CPUs are isolated. This default is suboptimal: most workloads benefit from unbound queues, allowing the scheduler to place worker threads where they’re needed and reducing noise when CPUs are isolated. This patch adds a new WQ_PERCPU flag to explicitly request the use of the per-CPU behavior. Both flags coexist for one release cycle to allow callers to transition their calls. Once migration is complete, WQ_UNBOUND can be removed and unbound will become the implicit default. With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND), any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND must now use WQ_PERCPU. All existing users have been updated accordingly. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Link: https://lore.kernel.org/r/20250905085309.94596-4-marco.crivellari@suse.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-08 10:04:37 -07:00
Marco Crivellari	0409819a00	bpf: replace use of system_unbound_wq with system_dfl_wq Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistentcy cannot be addressed without refactoring the API. system_unbound_wq should be the default workqueue so as not to enforce locality constraints for random work whenever it's not required. Adding system_dfl_wq to encourage its use when unbound work should be used. queue_work() / queue_delayed_work() / mod_delayed_work() will now use the new unbound wq: whether the user still use the old wq a warn will be printed along with a wq redirect to the new one. The old system_unbound_wq will be kept for a few release cycles. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Link: https://lore.kernel.org/r/20250905085309.94596-3-marco.crivellari@suse.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-08 10:04:37 -07:00
Marco Crivellari	34f86083a4	bpf: replace use of system_wq with system_percpu_wq Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistentcy cannot be addressed without refactoring the API. system_wq is a per-CPU worqueue, yet nothing in its name tells about that CPU affinity constraint, which is very often not required by users. Make it clear by adding a system_percpu_wq. queue_work() / queue_delayed_work() mod_delayed_work() will now use the new per-cpu wq: whether the user still stick on the old name a warn will be printed along a wq redirect to the new one. This patch add the new system_percpu_wq except for mm, fs and net subsystem, whom are handled in separated patches. The old wq will be kept for a few release cylces. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Link: https://lore.kernel.org/r/20250905085309.94596-2-marco.crivellari@suse.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-08 10:04:37 -07:00
Linus Torvalds	6ab41fca2e	Fix a severe slowdown regression in the timer vDSO code related to the while() loop in __iter_div_u64_rem(), when the AUX-clock is enabled. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmi9XZURHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1gVzg/+N3T4qL8Vi3HwOzN862EqmVdaEauAVkQU vlIxFbUX26KHmwoSYpsVlaswhONMa6tkNEz3zeOfECrK6SAIg/9SGuk/1bRqERCc 4KX3g0S0b7qfC7VJmjsmswXmxM9MrXLY5dx/XzhbACpGVoTJAMLlXL0OVwS361fP VOH4zSRCPUl+IbTJ6CkVQKZYWZAXxDuxBaSAsMaUz2QMZck6ikGOKP7CT9se0Fk5 Rve7bC0Wy2vuUfchCIagSE9U28rf7HDawqJVEOygtRQm/Hl7yKSUxaHO5LdSI7c/ k7mCoZkdyqj3Pdbe26CFSyvGdNJ+gHIhOSzaO43kHTf+bpgW/WBWm6DLjGWl8eKP EAW7QMkFoQaNWhSUbPwOR68R7fR3CsvmU1SOXufIRVKjuDy+246gu6y1nvJFr3RS BQyBPUh3qAJUoJj00nvjxtLB9p0pI2C+0ml7GTWEX3Ao8Y7rPPKh8yR2CvIhdmI4 Oo4/vRtcs9NGYKzSu7gknYSkxITsjfEInCkmdsDVQ503QuUAgL653z8rjUIfTIti 72fJ3pqVUyJuv9xyRkzQcavyZtU/l8uSfMNzgmHen8zmPjKA6rXg+BiuhP4uQ10W Wxw2s0cP3RgvuL3+y9oQHq3VHJ4JfdNVOKVssd97zCvmwFKCOXxi+0eKCy9ELQPq nNpuOC/9API= =5NRx -----END PGP SIGNATURE----- Merge tag 'timers-urgent-2025-09-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Ingo Molnar: "Fix a severe slowdown regression in the timer vDSO code related to the while() loop in __iter_div_u64_rem(), when the AUX-clock is enabled" * tag 'timers-urgent-2025-09-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: vdso/vsyscall: Avoid slow division loop in auxiliary clock update	2025-09-07 08:29:44 -07:00
Linus Torvalds	b7369eb731	Fix an 'allocation from atomic context' regression in the futex vmalloc variant. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmi9WmkRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1h4ww/9FnZ2elqxUG2kfzmDMwMCEA+KDyVIfqLI DzNqB0oFOgHO1xcWx/DLWItWIycH/+XhwcQUpCRcBlrG1pzaRqrbD0ioYw0SZRFo JbNRIQVjPBKa72vcPUd5et1cJB0iB7irUEhEPDPxkQ5BeIQOwayLF/DSz4IsKvde os9W9D2QcXk/YlOvbT+eRMmkDCSrAeAufz7RzcoPgzBvtGAWZEyCaBBjWXiuE2Au T05aXf6y35/QPB+pq7+aLxaqP97Auj2ebmCSB7iKGq1XCYXqCFmo4vQzAU6zZgkK hriO3+YL1wgN524bWaGXPRm87zcCmLVOXFJXe0wbTm/9w3CWmMUE+LEhfEAM+9n9 IZpoAEL540P3/ijYFxpie4GAz+ZjttUwLvh8f5xaM/2jPUq8bUXakR6dt5U5h3YA n4FG2/K215R3s9+wYV2KpNQ0gUqZn7EmqfMAkQ1N09rsW571ExUKWdrEr61TUOnx BTwC0nOti/M97sOHjajG/ylBmRaswc6mwiLVlVy3Gvi/kzAqjAVe+cRJnNKSd7hd kIUKUiPWrI4eSWi+0gmW0Xo5RcyEbMNf4iVGvlZ6aM5zoKjZo72TGPcqapxbaW6m hRRsFJdlXmBBOxKhB+Ne0EZIqQzk9ZK8tknN5voXp1UsupanubwS7H3YYppljCxg VdlccxGizuI= =rQLp -----END PGP SIGNATURE----- Merge tag 'locking-urgent-2025-09-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fix from Ingo Molnar: "Fix an 'allocation from atomic context' regression in the futex vmalloc variant" * tag 'locking-urgent-2025-09-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: futex: Move futex_hash_free() back to __mmput()	2025-09-07 08:26:28 -07:00
Linus Torvalds	6a8a34a56a	Fix regression where PERF_EVENT_IOC_REFRESH counters miss a PMU-stop. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmi9T3QRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1gkjhAAljiQ9y7DBlxWonRPstqKy1AYY7ndn+Jl CMgvflxE5NzI2IKqu7k642KCeb1UcL3Jpuqx+ywZCn5TZGgQ1haGhU7B8aufbMPc tcUx1LyrANhnRc0baBo5mPly9a1muV+bKNeA6Qa7fOQlI6R8dkFw2NQ+veN0bXCl 2tarzNcnwCjUu0zat8xlgxTo94P49reW/A/SgTlXBBFmHCezzDh348FYgWzFJ9iw 3Ki0BDd53z6p9uvTVSIxvPSVjhKAGVCX+zYJNcIOlBHnEe2yV0FYuGJGlEM7Vr2z jCd7MuUQ+kxtUmRLQGMZntSHjweseuOyJvx04rpVvX4I7tC7BgAZqeepw6pYV9ay OBIx1w+mYgI9gKHQUbYagETUBBPkofEABotQ46QD3Ih5zT/lwUbHy3JPj7bXt0nm qrpEognULUlJkwULzUcnBRs03nf72U1yZYNKL/gYw3rEM+EKek5UburmBXcOJa8S 33FK9vyEWnkZaSJZ//KoMIirjZqwAK1s/nxuxTp6ZjN0vh3VQld1GLmOKgtYJNEK g8/yUa0EXGEjsRZccPoJkRqoX4tHdEj6cOBFDuTvT+Ps6Wj4S30QTkfWyDltcyWT ziaP9fWxPkXS68DSHlx3Em0yIVU76f2dhZDZnJ2sODkuSLi+c0yBcy1lXeYJrkOj m9XlTj72KkY= =p8YX -----END PGP SIGNATURE----- Merge tag 'perf-urgent-2025-09-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf event fix from Ingo Molnar: "Fix regression where PERF_EVENT_IOC_REFRESH counters miss a PMU-stop" * tag 'perf-urgent-2025-09-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf: Fix the POLL_HUP delivery breakage	2025-09-07 08:24:20 -07:00
Wang Liang	c1628c00c4	tracing/osnoise: Fix null-ptr-deref in bitmap_parselist() A crash was observed with the following output: BUG: kernel NULL pointer dereference, address: 0000000000000010 Oops: Oops: 0000 [#1] SMP NOPTI CPU: 2 UID: 0 PID: 92 Comm: osnoise_cpus Not tainted 6.17.0-rc4-00201-gd69eb204c255 #138 PREEMPT(voluntary) RIP: 0010:bitmap_parselist+0x53/0x3e0 Call Trace: <TASK> osnoise_cpus_write+0x7a/0x190 vfs_write+0xf8/0x410 ? do_sys_openat2+0x88/0xd0 ksys_write+0x60/0xd0 do_syscall_64+0xa4/0x260 entry_SYSCALL_64_after_hwframe+0x77/0x7f </TASK> This issue can be reproduced by below code: fd=open("/sys/kernel/debug/tracing/osnoise/cpus", O_WRONLY); write(fd, "0-2", 0); When user pass 'count=0' to osnoise_cpus_write(), kmalloc() will return ZERO_SIZE_PTR (16) and cpulist_parse() treat it as a normal value, which trigger the null pointer dereference. Add check for the parameter 'count'. Cc: <mhiramat@kernel.org> Cc: <mathieu.desnoyers@efficios.com> Cc: <tglozar@redhat.com> Link: https://lore.kernel.org/20250906035610.3880282-1-wangliang74@huawei.com Fixes: `17f89102fe` ("tracing/osnoise: Allow arbitrarily long CPU string") Signed-off-by: Wang Liang <wangliang74@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-09-06 12:12:38 -04:00
Guenter Roeck	ab1396af75	trace/fgraph: Fix error handling Commit `edede7a6dc` ("trace/fgraph: Fix the warning caused by missing unregister notifier") added a call to unregister the PM notifier if register_ftrace_graph() failed. It does so unconditionally. However, the PM notifier is only registered with the first call to register_ftrace_graph(). If the first registration was successful and a subsequent registration failed, the notifier is now unregistered even if ftrace graphs are still registered. Fix the problem by only unregistering the PM notifier during error handling if there are no active fgraph registrations. Fixes: `edede7a6dc` ("trace/fgraph: Fix the warning caused by missing unregister notifier") Closes: https://lore.kernel.org/all/63b0ba5a-a928-438e-84f9-93028dd72e54@roeck-us.net/ Cc: Ye Weihua <yeweihua4@huawei.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20250906050618.2634078-1-linux@roeck-us.net Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-09-06 12:12:38 -04:00
Linus Torvalds	730c1451fb	audit/stable-6.17 PR 20250905 -----BEGIN PGP SIGNATURE----- iQJIBAABCgAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmi7AksUHHBhdWxAcGF1 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXMmaA/8ChtRoaZFLTFI7UTWmVo0aY7NDl77 fqjwY1ZrPsqezLYt317LCJareQOfn1/NEc1Xxb7Caz6Z35eQPUDeUnd96zEarzDL 4iYcZA1MJVO2jjnyj4lqVoSgkftQPN6qvga5osA9mMOQ24mNPAO5yZIVdzL/sekv ewosheC6spwFD26+/uWE00sdVRhtOnUPkLmelf4wyN0WX+lStTXozLpSW1Pr0FB1 tZI/mjLyVmZ+YJtTMCMLhpM/hFIMymlSr7BZ7pO/G1w48OGzVBdUlS45bS5rMUIQ D/MdTB2YlkMOYFyRCgoSgxaFHGHf7F6MQO0J7y/hVu2QSvGoxneUaTIszpaj7rcg edaJKogA6DlXvauw32UQhjUTQetuppyFqAHQXsec/JVYGUJqAZWNdsThP86MGlim INwspptaBwOULg2OTYt/+jHblbhl8BI9ayC/S4lN89MCwszLYoFmjIo+mCf9qWfE uFsiyvvqoXaMKH/e4NkcUjT0AxzwAHF0DwU7Vh+apThOpdr7Qudr6meWEgJ+iRpO hhWrpyPAP1pann7VebwAqWs9iG97cuwARca4EWyuSy+i11qLQ19LwC/i5vfb58TJ Ozh9g01A65qWgA5/G14XWmN1oLYdWK+KqwmdTXtDgWXZ5hX3oMrlypRxbPxKIyVG PsEKU7GXjze/J1s= =g/UF -----END PGP SIGNATURE----- Merge tag 'audit-pr-20250905' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit Pull audit fix from Paul Moore: "A single small audit patch to fix a potential out-of-bounds read caused by a negative array index when comparing paths" * tag 'audit-pr-20250905' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit: audit: fix out-of-bounds read in audit_compare_dname_path()	2025-09-05 12:35:25 -07:00
Marco Crivellari	a2be943b46	workqueue: replace use of system_wq with system_percpu_wq Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistentcy cannot be addressed without refactoring the API. system_wq is a per-CPU worqueue, yet nothing in its name tells about that CPU affinity constraint, which is very often not required by users. Make it clear by adding a system_percpu_wq. queue_work() / queue_delayed_work() mod_delayed_work() will now use the new per-cpu wq: whether the user still stick on the old name a warn will be printed along a wq redirect to the new one. This patch add the new system_percpu_wq except for mm, fs and net subsystem, whom are handled in separated patches. The old wq will be kept for a few release cylces. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-05 07:20:00 -10:00
Marco Crivellari	f6cfa602d2	workqueue: replace use of system_unbound_wq with system_dfl_wq Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistentcy cannot be addressed without refactoring the API. system_unbound_wq should be the default workqueue so as not to enforce locality constraints for random work whenever it's not required. Adding system_dfl_wq to encourage its use when unbound work should be used. queue_work() / queue_delayed_work() / mod_delayed_work() will now use the new unbound wq: whether the user still use the old wq a warn will be printed along with a wq redirect to the new one. The old system_unbound_wq will be kept for a few release cycles. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-05 07:19:09 -10:00
Tejun Heo	4a3e62dfa7	cgroup: Merge branch 'for-6.17-fixes' into for-6.18 Pull for-6.17-fixes to receive `79f919a89c` ("cgroup: split cgroup_destroy_wq into 3 workqueues") to resolve its conflict with `7fa33aa3b0` ("cgroup: WQ_PERCPU added to alloc_workqueue users"). The latter adds WQ_PERCPU when creating cgroup_destroy_wq and the former splits the workqueue into three. Resolve by applying WQ_PERCPU to the three split workqueues. Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-05 07:08:26 -10:00
Marco Crivellari	7fa33aa3b0	cgroup: WQ_PERCPU added to alloc_workqueue users Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistentcy cannot be addressed without refactoring the API. alloc_workqueue() treats all queues as per-CPU by default, while unbound workqueues must opt-in via WQ_UNBOUND. This default is suboptimal: most workloads benefit from unbound queues, allowing the scheduler to place worker threads where they’re needed and reducing noise when CPUs are isolated. This patch adds a new WQ_PERCPU flag to explicitly request the use of the per-CPU behavior. Both flags coexist for one release cycle to allow callers to transition their calls. Once migration is complete, WQ_UNBOUND can be removed and unbound will become the implicit default. With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND), any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND must now use WQ_PERCPU. All existing users have been updated accordingly. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-05 06:40:25 -10:00
Marco Crivellari	d6256771d1	cgroup: replace use of system_wq with system_percpu_wq Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistentcy cannot be addressed without refactoring the API. system_wq is a per-CPU worqueue, yet nothing in its name tells about that CPU affinity constraint, which is very often not required by users. Make it clear by adding a system_percpu_wq. queue_work() / queue_delayed_work() mod_delayed_work() will now use the new per-cpu wq: whether the user still stick on the old name a warn will be printed along a wq redirect to the new one. This patch add the new system_percpu_wq except for mm, fs and net subsystem, whom are handled in separated patches. The old wq will be kept for a few release cylces. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-05 06:40:12 -10:00
Tejun Heo	222f83d5ab	cgroup: Remove unused local variables from cgroup_procs_write_finish() `d8b269e009` ("cgroup: Remove unused cgroup_subsys::post_attach") made $ss and $ssid unused but didn't drop them leading to compilation warnings. Drop them. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Chuyi Zhou <zhouchuyi@bytedance.com>	2025-09-04 11:23:43 -10:00
Jakub Kicinski	5ef04a7b06	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-6.17-rc5). No conflicts. Adjacent changes: include/net/sock.h `c51613fa27` ("net: add sk->sk_drop_counters") `5d6b58c932` ("net: lockless sock_i_ino()") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-04 13:33:00 -07:00
Andrea Righi	47d9f82128	sched_ext: Fix NULL dereference in scx_bpf_cpu_rq() warning When printing the deprecation warning for scx_bpf_cpu_rq(), we may hit a NULL pointer dereference if the kfunc is called before a BPF scheduler is fully attached, for example, when invoked from a BPF timer or during ops.init(): [ 50.752775] BUG: kernel NULL pointer dereference, address: 0000000000000331 ... [ 50.764205] RIP: 0010:scx_bpf_cpu_rq+0x30/0xa0 ... [ 50.787661] Call Trace: [ 50.788398] <TASK> [ 50.789061] bpf_prog_08f7fd2dcb187aaf_wakeup_timerfn+0x75/0x1a8 [ 50.792477] bpf_timer_cb+0x7e/0x140 [ 50.796003] hrtimer_run_softirq+0x91/0xe0 [ 50.796952] handle_softirqs+0xce/0x3c0 [ 50.799087] run_ksoftirqd+0x3e/0x70 [ 50.800197] smpboot_thread_fn+0x133/0x290 [ 50.802320] kthread+0x115/0x220 [ 50.804984] ret_from_fork+0x17a/0x1d0 [ 50.806920] ret_from_fork_asm+0x1a/0x30 [ 50.807799] </TASK> Fix this by only printing the warning once the scheduler is fully registered. Fixes: `5c48d88fe0` ("sched_ext: deprecation warn for scx_bpf_cpu_rq()") Cc: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-04 10:27:16 -10:00
Samuel Wu	56a232d93c	PM: sleep: Make pm_wakeup_clear() call more clear Move pm_wakeup_clear() to the same location as other functions that do bookkeeping prior to suspend_prepare(). Since calling pm_wakeup_clear() is a prerequisite to setting up for suspend and enabling functionalities of suspend (like aborting during suspend), moving pm_wakeup_clear() higher up the call stack makes its intent more clear and obvious that it is called prior to suspend_prepare(). After this change, there is a slightly larger window when abort events can be registered, but otherwise suspend functionality is the same. Suggested-by: Saravana Kannan <saravanak@google.com> Signed-off-by: Samuel Wu <wusamuel@google.com> Link: https://patch.msgid.link/20250821004237.2712312-2-wusamuel@google.com Reviewed-by: Saravana Kannan <saravanak@google.com> [ rjw: Subject and changelog edits ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2025-09-04 21:05:14 +02:00
Sebastian Andrzej Siewior	ad7c7f4b9c	workqueue: Provide a handshake for canceling BH workers While a BH work item is canceled, the core code spins until it determines that the item completed. On PREEMPT_RT the spinning relies on a lock in local_bh_disable() to avoid a live lock if the canceling thread has higher priority than the BH-worker and preempts it. This lock ensures that the BH-worker makes progress by PI-boosting it. This lock in local_bh_disable() is a central per-CPU BKL and about to be removed. To provide the required synchronisation add a per pool lock. The lock is acquired by the bh_worker at the begin while the individual callbacks are invoked. To enforce progress in case of interruption, __flush_work() needs to acquire the lock. This will flush all BH-work items assigned to that pool. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-04 07:28:33 -10:00
Chuyi Zhou	d8b269e009	cgroup: Remove unused cgroup_subsys::post_attach cgroup_subsys::post_attach callback was introduced in commit `5cf1cacb49` ("cgroup, cpuset: replace cpuset_post_attach_flush() with cgroup_subsys->post_attach callback") and only cpuset would use this callback to wait for the mm migration to complete at the end of __cgroup_procs_write(). Since the previous patch defer the flush operation until returning to userspace, no one use this callback now. Remove this callback from cgroup_subsys. Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-04 07:25:20 -10:00
Chuyi Zhou	3514309e03	cpuset: Defer flushing of the cpuset_migrate_mm_wq to task_work Now in cpuset_attach(), we need to synchronously wait for flush_workqueue to complete. The execution time of flushing cpuset_migrate_mm_wq depends on the amount of mm migration initiated by cpusets at that time. When the cpuset.mems of a cgroup occupying a large amount of memory is modified, it may trigger extensive mm migration, causing cpuset_attach() to block on flush_workqueue for an extended period. This could be dangerous because cpuset_attach() is within the critical section of cgroup_mutex, which may ultimately cause all cgroup-related operations in the system to be blocked. This patch attempts to defer the flush_workqueue() operation until returning to userspace using the task_work which is originally proposed by tejun[1], so that flush happens after cgroup_mutex is dropped. That way we maintain the operation synchronicity while avoiding bothering anyone else. [1]: https://lore.kernel.org/cgroups/ZgMFPMjZRZCsq9Q-@slm.duckdns.org/T/#m117f606fa24f66f0823a60f211b36f24bd9e1883 Originally-by: Tejun Heo <tj@kernel.org> Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-04 07:22:38 -10:00
Chuyi Zhou	c0fb16ef88	cpuset: Don't always flush cpuset_migrate_mm_wq in cpuset_write_resmask It is unnecessary to always wait for the flush operation of cpuset_migrate_mm_wq to complete in cpuset_write_resmask, as modifying cpuset.cpus or cpuset.exclusive does not trigger mm migrations. The flush_workqueue can be executed only when cpuset.mems is modified. Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-04 07:15:30 -10:00
Zqiang	cda2b2d647	workqueue: Remove rcu_read_lock/unlock() in wq_watchdog_timer_fn() The wq_watchdog_timer_fn() is executed in the softirq context, this is already in the RCU read critical section, this commit therefore remove rcu_read_lock/unlock() in wq_watchdog_timer_fn(). Signed-off-by: Zqiang <qiang.zhang@linux.dev> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-04 06:18:00 -10:00
Zqiang	fd5081f4ef	workqueue: Remove redundant rcu_read_lock/unlock() in workqueue_congested() The preempt_disable/enable() has already formed RCU read crtical section, this commit therefore remove rcu_read_lock/unlock() in workqueue_congested(). Signed-off-by: Zqiang <qiang.zhang@linux.dev> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-04 06:17:52 -10:00
Rong Tao	19559e8441	bpf: add bpf_strcasecmp kfunc bpf_strcasecmp() function performs same like bpf_strcmp() except ignoring the case of the characters. Signed-off-by: Rong Tao <rongtao@cestc.cn> Acked-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Viktor Malik <vmalik@redhat.com> Link: https://lore.kernel.org/r/tencent_292BD3682A628581AA904996D8E59F4ACD06@qq.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-09-04 09:00:57 -07:00
Eric Dumazet	2aef21a6a6	audit: init ab->skb_list earlier in audit_buffer_alloc() syzbot found a bug in audit_buffer_alloc() if nlmsg_new() returns NULL. We need to initialize ab->skb_list before calling audit_buffer_free() which will use both the skb_list spinlock and list pointers. Fixes: `eb59d494ee` ("audit: add record for multiple task security contexts") Reported-by: syzbot+bb185b018a51f8d91fd2@syzkaller.appspotmail.com Closes: https://lore.kernel.org/lkml/68b93e3c.a00a0220.eb3d.0000.GAE@google.com/T/#u Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Casey Schaufler <casey@schaufler-ca.com> Cc: Paul Moore <paul@paul-moore.com> Cc: Eric Paris <eparis@redhat.com> Cc: audit@vger.kernel.org Signed-off-by: Paul Moore <paul@paul-moore.com>	2025-09-04 11:06:33 -04:00
Thomas Weißschuh	ea1a1fa919	time: Build generic update_vsyscall() only with generic time vDSO The generic vDSO can be used without the time-related functionality. In that case the generic update_vsyscall() from kernel/time/vsyscall.c should not be built. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250826-vdso-cleanups-v1-5-d9b65750e49f@linutronix.de	2025-09-04 11:23:50 +02:00
Gatien Chevallier	96c88268b7	time: export timespec64_add_safe() symbol Export the timespec64_add_safe() symbol so that this function can be used in modules where computation of time related is done. Signed-off-by: Gatien Chevallier <gatien.chevallier@foss.st.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20250901-relative_flex_pps-v4-1-b874971dfe85@foss.st.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-03 16:51:08 -07:00
Christian Loehle	5c48d88fe0	sched_ext: deprecation warn for scx_bpf_cpu_rq() scx_bpf_cpu_rq() works on an unlocked rq which generally isn't safe. For the common use-cases scx_bpf_locked_rq() and scx_bpf_cpu_curr() work, so add a deprecation warning to scx_bpf_cpu_rq() so it can eventually be removed. Signed-off-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-03 11:51:57 -10:00
Christian Loehle	20b158094a	sched_ext: Introduce scx_bpf_cpu_curr() Provide scx_bpf_cpu_curr() as a way for scx schedulers to check the curr task of a remote rq without assuming its lock is held. Many scx schedulers make use of scx_bpf_cpu_rq() to check a remote curr (e.g. to see if it should be preempted). This is problematic because scx_bpf_cpu_rq() provides access to all fields of struct rq, most of which aren't safe to use without holding the associated rq lock. Signed-off-by: Christian Loehle <christian.loehle@arm.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-03 11:50:42 -10:00
Christian Loehle	e0ca169638	sched_ext: Introduce scx_bpf_locked_rq() Most fields in scx_bpf_cpu_rq() assume that its rq_lock is held. Furthermore they become meaningless without rq lock, too. Make a safer version of scx_bpf_cpu_rq() that only returns a rq if we hold rq lock of that rq. Also mark the new scx_bpf_locked_rq() as returning NULL as scx_bpf_cpu_rq() should've been too. Signed-off-by: Christian Loehle <christian.loehle@arm.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-03 11:50:36 -10:00
Tejun Heo	a5bd6ba30b	sched_ext: Use cgroup_lock/unlock() to synchronize against cgroup operations SCX hooks into CPU cgroup controller operations and read-locks scx_cgroup_rwsem to exclude them while enabling and disable schedulers. While this works, it's unnecessarily complicated given that cgroup_[un]lock() are available and thus the cgroup operations can be locked out that way. Drop scx_cgroup_rwsem locking from the tg on/offline and cgroup [can_]attach operations. Instead, grab cgroup_lock() from scx_cgroup_lock(). Drop scx_cgroup_finish_attach() which is no longer necessary. Drop the now unnecessary rcu locking and css ref bumping in scx_cgroup_init() and scx_cgroup_exit(). As scx_cgroup_set_weight/bandwidth() paths aren't protected by cgroup_lock(), rename scx_cgroup_rwsem to scx_cgroup_ops_rwsem and retain the locking there. This is overall simpler and will also allow enable/disable paths to synchronize against cgroup changes independent of the CPU controller. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Acked-by: Andrea Righi <arighi@nvidia.com>	2025-09-03 11:36:07 -10:00
Tejun Heo	bcb7c23056	sched_ext: Put event_stats_cpu in struct scx_sched_pcpu scx_sched.event_stats_cpu is the percpu counters that are used to track stats. Introduce struct scx_sched_pcpu and move the counters inside. This will ease adding more per-cpu fields. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>	2025-09-03 11:33:28 -10:00
Tejun Heo	0c2b8356e4	sched_ext: Move internal type and accessor definitions to ext_internal.h There currently isn't a place to place SCX-internal types and accessors to be shared between ext.c and ext_idle.c. Create kernel/sched/ext_internal.h and move internal type and accessor definitions there. This trims ext.c a bit and makes future additions easier. Pure code reorganization. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>	2025-09-03 11:33:28 -10:00
Tejun Heo	4a1d9d73aa	sched_ext: Keep bypass on between enable failure and scx_disable_workfn() scx_enable() turns on the bypass mode while enable is in progress. If enabling fails, it turns off the bypass mode and then triggers scx_error(). scx_error() will trigger scx_disable_workfn() which will turn on the bypass mode again and unload the failed scheduler. This moves the system out of bypass mode between the enable error path and the disable path, which is unnecessary and can be brittle - e.g. the thread running scx_enable() may already be on the failed scheduler and can be switched out before it triggers scx_error() leading to a stall. The watchdog would eventually kick in, so the situation isn't critical but is still suboptimal. There is nothing to be gained by turning off the bypass mode between scx_enable() failure and scx_disable_workfn(). Keep bypass on. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>	2025-09-03 11:33:28 -10:00
Tejun Heo	b7975c4869	sched_ext: Make explicit scx_task_iter_relock() calls unnecessary During tasks iteration, the locks can be dropped using scx_task_iter_unlock() to perform e.g. sleepable allocations. Afterwards, scx_task_iter_relock() has to be called prior to other iteration operations, which is error-prone. This can be easily automated by tracking whether scx_tasks_lock is held in scx_task_iter and re-acquiring when necessary. It already tracks whether the task's rq is locked after all. - Add scx_task_iter->list_locked which remembers whether scx_tasks_lock is held. - Rename scx_task_iter->locked to scx_task_iter->locked_task to better distinguish it from ->list_locked. - Replace scx_task_iter_relock() with __scx_task_iter_maybe_relock() which is automatically called by scx_task_iter_next() and scx_task_iter_stop(). - Drop explicit scx_task_iter_relock() calls. The resulting behavior should be equivalent. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>	2025-09-03 11:33:28 -10:00
Stanislav Fort	4540f1d23e	audit: fix out-of-bounds read in audit_compare_dname_path() When a watch on dir=/ is combined with an fsnotify event for a single-character name directly under / (e.g., creating /a), an out-of-bounds read can occur in audit_compare_dname_path(). The helper parent_len() returns 1 for "/". In audit_compare_dname_path(), when parentlen equals the full path length (1), the code sets p = path + 1 and pathlen = 1 - 1 = 0. The subsequent loop then dereferences p[pathlen - 1] (i.e., p[-1]), causing an out-of-bounds read. Fix this by adding a pathlen > 0 check to the while loop condition to prevent the out-of-bounds access. Cc: stable@vger.kernel.org Fixes: `e92eebb0d6` ("audit: fix suffixed '/' filename matching") Reported-by: Stanislav Fort <disclosure@aisle.com> Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org> Signed-off-by: Stanislav Fort <stanislav.fort@aisle.com> [PM: subject tweak, sign-off email fixes] Signed-off-by: Paul Moore <paul@paul-moore.com>	2025-09-03 16:46:23 -04:00
Waiman Long	e117ff1129	cgroup/cpuset: Prevent NULL pointer access in free_tmpmasks() Commit `5806b3d051` ("cpuset: decouple tmpmasks and cpumasks freeing in cgroup") separates out the freeing of tmpmasks into a new free_tmpmask() helper but removes the NULL pointer check in the process. Unfortunately a NULL pointer can be passed to free_tmpmasks() in cpuset_handle_hotplug() if cpuset v1 is active. This can cause segmentation fault and crash the kernel. Fix that by adding the NULL pointer check to free_tmpmasks(). Fixes: `5806b3d051` ("cpuset: decouple tmpmasks and cpumasks freeing in cgroup") Reported-by: Ashay Jaiswal <quic_ashayj@quicinc.com> Closes: https://lore.kernel.org/lkml/20250902-cpuset-free-on-condition-v1-1-f46ffab53eac@quicinc.com/ Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-09-03 08:40:11 -10:00
Christian Loehle	5ebf512f33	sched: Fix sched_numa_find_nth_cpu() if mask offline sched_numa_find_nth_cpu() uses a bsearch to look for the 'closest' CPU in sched_domains_numa_masks and given cpus mask. However they might not intersect if all CPUs in the cpus mask are offline. bsearch will return NULL in that case, bail out instead of dereferencing a bogus pointer. The previous behaviour lead to this bug when using maxcpus=4 on an rk3399 (LLLLbb) (i.e. booting with all big CPUs offline): [ 1.422922] Unable to handle kernel paging request at virtual address ffffff8000000000 [ 1.423635] Mem abort info: [ 1.423889] ESR = 0x0000000096000006 [ 1.424227] EC = 0x25: DABT (current EL), IL = 32 bits [ 1.424715] SET = 0, FnV = 0 [ 1.424995] EA = 0, S1PTW = 0 [ 1.425279] FSC = 0x06: level 2 translation fault [ 1.425735] Data abort info: [ 1.425998] ISV = 0, ISS = 0x00000006, ISS2 = 0x00000000 [ 1.426499] CM = 0, WnR = 0, TnD = 0, TagAccess = 0 [ 1.426952] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 [ 1.427428] swapper pgtable: 4k pages, 39-bit VAs, pgdp=0000000004a9f000 [ 1.428038] [ffffff8000000000] pgd=18000000f7fff403, p4d=18000000f7fff403, pud=18000000f7fff403, pmd=0000000000000000 [ 1.429014] Internal error: Oops: 0000000096000006 [#1] SMP [ 1.429525] Modules linked in: [ 1.429813] CPU: 3 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.17.0-rc4-dirty #343 PREEMPT [ 1.430559] Hardware name: Pine64 RockPro64 v2.1 (DT) [ 1.431012] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 1.431634] pc : sched_numa_find_nth_cpu+0x2a0/0x488 [ 1.432094] lr : sched_numa_find_nth_cpu+0x284/0x488 [ 1.432543] sp : ffffffc084e1b960 [ 1.432843] x29: ffffffc084e1b960 x28: ffffff80078a8800 x27: ffffffc0846eb1d0 [ 1.433495] x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000 [ 1.434144] x23: 0000000000000000 x22: fffffffffff7f093 x21: ffffffc081de6378 [ 1.434792] x20: 0000000000000000 x19: 0000000ffff7f093 x18: 00000000ffffffff [ 1.435441] x17: 3030303866666666 x16: 66663d736b73616d x15: ffffffc104e1b5b7 [ 1.436091] x14: 0000000000000000 x13: ffffffc084712860 x12: 0000000000000372 [ 1.436739] x11: 0000000000000126 x10: ffffffc08476a860 x9 : ffffffc084712860 [ 1.437389] x8 : 00000000ffffefff x7 : ffffffc08476a860 x6 : 0000000000000000 [ 1.438036] x5 : 000000000000bff4 x4 : 0000000000000000 x3 : 0000000000000000 [ 1.438683] x2 : 0000000000000000 x1 : ffffffc0846eb000 x0 : ffffff8000407b68 [ 1.439332] Call trace: [ 1.439559] sched_numa_find_nth_cpu+0x2a0/0x488 (P) [ 1.440016] smp_call_function_any+0xc8/0xd0 [ 1.440416] armv8_pmu_init+0x58/0x27c [ 1.440770] armv8_cortex_a72_pmu_init+0x20/0x2c [ 1.441199] arm_pmu_device_probe+0x1e4/0x5e8 [ 1.441603] armv8_pmu_device_probe+0x1c/0x28 [ 1.442007] platform_probe+0x5c/0xac [ 1.442347] really_probe+0xbc/0x298 [ 1.442683] __driver_probe_device+0x78/0x12c [ 1.443087] driver_probe_device+0xdc/0x160 [ 1.443475] __driver_attach+0x94/0x19c [ 1.443833] bus_for_each_dev+0x74/0xd4 [ 1.444190] driver_attach+0x24/0x30 [ 1.444525] bus_add_driver+0xe4/0x208 [ 1.444874] driver_register+0x60/0x128 [ 1.445233] __platform_driver_register+0x24/0x30 [ 1.445662] armv8_pmu_driver_init+0x28/0x4c [ 1.446059] do_one_initcall+0x44/0x25c [ 1.446416] kernel_init_freeable+0x1dc/0x3bc [ 1.446820] kernel_init+0x20/0x1d8 [ 1.447151] ret_from_fork+0x10/0x20 [ 1.447493] Code: 90022e21 f000e5f5 910de2b5 2a1703e2 (f8767803) [ 1.448040] ---[ end trace 0000000000000000 ]--- [ 1.448483] note: swapper/0[1] exited with preempt_count 1 [ 1.449047] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b [ 1.449741] SMP: stopping secondary CPUs [ 1.450105] Kernel Offset: disabled [ 1.450419] CPU features: 0x000000,00080000,20002001,0400421b [ 1.450935] Memory Limit: none [ 1.451217] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]--- Yury: with the fix, the function returns cpu == nr_cpu_ids, and later in smp_call_function_any -> smp_call_function_single -> generic_exec_single we test the cpu for '>= nr_cpu_ids' and return -ENXIO. So everything is handled correctly. Fixes: `cd7f55359c` ("sched: add sched_numa_find_nth_cpu()") Cc: stable@vger.kernel.org Signed-off-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Yury Norov (NVIDIA) <yury.norov@gmail.com>	2025-09-03 12:20:06 -04:00
Brian Norris	8ad25ebfa7	genirq/test: Ensure CPU 1 is online for hotplug test It's possible to run these tests on platforms that think they have a hotpluggable CPU1, but for whatever reason, CPU1 is not online and can't be brought online: # irq_cpuhotplug_test: EXPECTATION FAILED at kernel/irq/irq_test.c:210 Expected remove_cpu(1) == 0, but remove_cpu(1) == 1 (0x1) CPU1: failed to boot: -38 # irq_cpuhotplug_test: EXPECTATION FAILED at kernel/irq/irq_test.c:214 Expected add_cpu(1) == 0, but add_cpu(1) == -38 (0xffffffffffffffda) Check that CPU1 is actually online before trying to run the test. Fixes: `66067c3c8a` ("genirq: Add kunit tests for depth counts") Reported-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Brian Norris <briannorris@chromium.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: David Gow <davidgow@google.com> Link: https://lore.kernel.org/all/20250822190140.2154646-7-briannorris@chromium.org	2025-09-03 17:04:52 +02:00
Brian Norris	add03fdb9d	genirq/test: Drop CONFIG_GENERIC_IRQ_MIGRATION assumptions Not all platforms use the generic IRQ migration code, even if they select GENERIC_IRQ_MIGRATION. (See, for example, powerpc / pseries_cpu_disable().) If such platforms don't perform managed shutdown the same way, the interrupt may not actually shut down, and these tests fail: [ 4.357022][ T101] # irq_cpuhotplug_test: EXPECTATION FAILED at kernel/irq/irq_test.c:211 [ 4.357022][ T101] Expected irqd_is_activated(data) to be false, but is true [ 4.358128][ T101] # irq_cpuhotplug_test: EXPECTATION FAILED at kernel/irq/irq_test.c:212 [ 4.358128][ T101] Expected irqd_is_started(data) to be false, but is true [ 4.375558][ T101] # irq_cpuhotplug_test: EXPECTATION FAILED at kernel/irq/irq_test.c:216 [ 4.375558][ T101] Expected irqd_is_activated(data) to be false, but is true [ 4.376088][ T101] # irq_cpuhotplug_test: EXPECTATION FAILED at kernel/irq/irq_test.c:217 [ 4.376088][ T101] Expected irqd_is_started(data) to be false, but is true [ 4.377851][ T1] # irq_cpuhotplug_test: pass:0 fail:1 skip:0 total:1 [ 4.377901][ T1] not ok 4 irq_cpuhotplug_test [ 4.378073][ T1] # irq_test_cases: pass:3 fail:1 skip:0 total:4 Rather than test that PowerPC performs migration the same way as the unterrupt core, just drop the state checks. The point of the test was to ensure that the code kept \|depth\| balanced, which still can be tested for. Fixes: `66067c3c8a` ("genirq: Add kunit tests for depth counts") Reported-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Brian Norris <briannorris@chromium.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: David Gow <davidgow@google.com> Link: https://lore.kernel.org/all/20250822190140.2154646-6-briannorris@chromium.org	2025-09-03 17:04:52 +02:00
Brian Norris	0c888bc86d	genirq/test: Depend on SPARSE_IRQ Some architectures have a static interrupt layout, with a limited number of interrupts. Without SPARSE_IRQ, the test may not be able to allocate any fake interrupts, and the test will fail. (This occurs on ARCH=m68k, for example.) Additionally, managed-affinity is only supported with CONFIG_SPARSE_IRQ=y, so irq_shutdown_depth_test() and irq_cpuhotplug_test() would fail without it. Add a 'SPARSE_IRQ' dependency to avoid these problems. Many architectures 'select SPARSE_IRQ', so this is easy to miss. Notably, this also excludes ARCH=um from running any of these tests, even though some of them might work. Fixes: `66067c3c8a` ("genirq: Add kunit tests for depth counts") Reported-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Brian Norris <briannorris@chromium.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: David Gow <davidgow@google.com> Link: https://lore.kernel.org/all/20250822190140.2154646-5-briannorris@chromium.org	2025-09-03 17:04:52 +02:00
Brian Norris	988f45467f	genirq/test: Fail early if interrupt request fails Requesting an interrupt is part of the basic test setup. If it fails, most of the subsequent tests are likely to fail, and the output gets noisy. Use "assert" to fail early. Signed-off-by: Brian Norris <briannorris@chromium.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: David Gow <davidgow@google.com> Link: https://lore.kernel.org/all/20250822190140.2154646-4-briannorris@chromium.org	2025-09-03 17:04:52 +02:00
Brian Norris	59405c248a	genirq/test: Factor out fake-virq setup A few things need to be repeated in tests. Factor out the creation of fake interrupts. Signed-off-by: Brian Norris <briannorris@chromium.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: David Gow <davidgow@google.com> Link: https://lore.kernel.org/all/20250822190140.2154646-3-briannorris@chromium.org	2025-09-03 17:04:52 +02:00
Brian Norris	f8a44f9bab	genirq/test: Select IRQ_DOMAIN These tests use irq_domain_alloc_descs() and so require CONFIG_IRQ_DOMAIN. Fixes: `66067c3c8a` ("genirq: Add kunit tests for depth counts") Reported-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Brian Norris <briannorris@chromium.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: David Gow <davidgow@google.com> Link: https://lore.kernel.org/all/20250822190140.2154646-2-briannorris@chromium.org Closes: https://lore.kernel.org/lkml/ded44edf-eeb7-420c-b8a8-d6543b955e6e@roeck-us.net/	2025-09-03 17:04:52 +02:00
David Gow	c9163915a9	genirq/test: Fix depth tests on architectures with NOREQUEST by default. The new irq KUnit tests fail on some architectures (notably PowerPC and 32-bit ARM), as the request_irq() call fails due to the ARCH_IRQ_INIT_FLAGS containing IRQ_NOREQUEST, yielding the following errors: [10:17:45] # irq_free_disabled_test: EXPECTATION FAILED at kernel/irq/irq_test.c:88 [10:17:45] Expected ret == 0, but [10:17:45] ret == -22 (0xffffffffffffffea) [10:17:45] # irq_free_disabled_test: EXPECTATION FAILED at kernel/irq/irq_test.c:90 [10:17:45] Expected desc->depth == 0, but [10:17:45] desc->depth == 1 (0x1) [10:17:45] # irq_free_disabled_test: EXPECTATION FAILED at kernel/irq/irq_test.c:93 [10:17:45] Expected desc->depth == 1, but [10:17:45] desc->depth == 2 (0x2) By clearing IRQ_NOREQUEST from the interrupt descriptor, these tests now pass on ARM and PowerPC. Fixes: `66067c3c8a` ("genirq: Add kunit tests for depth counts") Signed-off-by: David Gow <davidgow@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Guenter Roeck <linux@roeck-us.net> Tested-by: Brian Norris <briannorris@chromium.org> Reviewed-by: Brian Norris <briannorris@chromium.org> Link: https://lore.kernel.org/all/20250816094528.3560222-2-davidgow@google.com	2025-09-03 17:04:51 +02:00
Wladislav Wiebe	673f1244b3	genirq: Add support for warning on long-running interrupt handlers Introduce a mechanism to detect and warn about prolonged interrupt handlers. With a new command-line parameter (irqhandler.duration_warn_us=), users can configure the duration threshold in microseconds when a warning in such format should be emitted: "[CPU14] long duration of IRQ[159:bad_irq_handler [long_irq]], took: 1330 us" The implementation uses local_clock() to measure the execution duration of the generic IRQ per-CPU event handler. Signed-off-by: Wladislav Wiebe <wladislav.wiebe@nokia.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jiri Slaby <jirislaby@kernel.org> Link: https://lore.kernel.org/all/20250804093525.851-1-wladislav.wiebe@nokia.com	2025-09-03 16:10:40 +02:00
Thomas Weißschuh	762af5a2aa	vdso/vsyscall: Avoid slow division loop in auxiliary clock update The call to __iter_div_u64_rem() in vdso_time_update_aux() is a wrapper around subtraction. It cannot be used to divide large numbers, as that introduces long, computationally expensive delays. A regular u64 division is also not possible in the timekeeper update path as it can be too slow. Instead of splitting the ktime_t offset into into second and subsecond components during the timekeeper update fast-path, do it together with the adjustment of tk->offs_aux in the slow-path. Equivalent to the handling of offs_boot and monotonic_to_boot. Reuse the storage of monotonic_to_boot for the new field, as it is not used by auxiliary timekeepers. Fixes: `380b84e168` ("vdso/vsyscall: Update auxiliary clock data in the datapage") Reported-by: Miroslav Lichvar <mlichvar@redhat.com> Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250825-vdso-auxclock-division-v1-1-a1d32a16a313@linutronix.de Closes: https://lore.kernel.org/lkml/aKwsNNWsHJg8IKzj@localhost/	2025-09-03 11:55:11 +02:00
Kan Liang	18dbcbfabf	perf: Fix the POLL_HUP delivery breakage The event_limit can be set by the PERF_EVENT_IOC_REFRESH to limit the number of events. When the event_limit reaches 0, the POLL_HUP signal should be sent. But it's missed. The corresponding counter should be stopped when the event_limit reaches 0. It was implemented in the ARCH-specific code. However, since the commit `9734e25fbf` ("perf: Fix the throttle logic for a group"), all the ARCH-specific code has been moved to the generic code. The code to handle the event_limit was lost. Add the event->pmu->stop(event, 0); back. Fixes: `9734e25fbf` ("perf: Fix the throttle logic for a group") Closes: https://lore.kernel.org/lkml/aICYAqM5EQUlTqtX@li-2b55cdcc-350b-11b2-a85c-a78bff51fc11.ibm.com/ Reported-by: Sumanth Korikkar <sumanthk@linux.ibm.com> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Sumanth Korikkar <sumanthk@linux.ibm.com> Link: https://lkml.kernel.org/r/20250811182644.1305952-1-kan.liang@linux.intel.com	2025-09-03 10:10:59 +02:00
Aaron Lu	5b726e9bf9	sched/fair: Get rid of throttled_lb_pair() Now that throttled tasks are dequeued and can not stay on rq's cfs_tasks list, there is no need to take special care of these throttled tasks anymore in load balance. Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Aaron Lu <ziqianlu@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Matteo Martelli <matteo.martelli@codethink.co.uk> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20250829081120.806-6-ziqianlu@bytedance.com	2025-09-03 10:03:14 +02:00
Aaron Lu	eb962f251f	sched/fair: Task based throttle time accounting With task based throttle model, the previous way to check cfs_rq's nr_queued to decide if throttled time should be accounted doesn't work as expected, e.g. when a cfs_rq which has a single task is throttled, that task could later block in kernel mode instead of being dequeued on limbo list and accounting this as throttled time is not accurate. Rework throttle time accounting for a cfs_rq as follows: - start accounting when the first task gets throttled in its hierarchy; - stop accounting on unthrottle. Note that there will be a time gap between when a cfs_rq is throttled and when a task in its hierarchy is actually throttled. This accounting mechanism only starts accounting in the latter case. Suggested-by: Chengming Zhou <chengming.zhou@linux.dev> # accounting mechanism Co-developed-by: K Prateek Nayak <kprateek.nayak@amd.com> # simplify implementation Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Aaron Lu <ziqianlu@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Matteo Martelli <matteo.martelli@codethink.co.uk> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20250829081120.806-5-ziqianlu@bytedance.com	2025-09-03 10:03:14 +02:00
Valentin Schneider	e1fad12dcb	sched/fair: Switch to task based throttle model In current throttle model, when a cfs_rq is throttled, its entity will be dequeued from cpu's rq, making tasks attached to it not able to run, thus achiveing the throttle target. This has a drawback though: assume a task is a reader of percpu_rwsem and is waiting. When it gets woken, it can not run till its task group's next period comes, which can be a relatively long time. Waiting writer will have to wait longer due to this and it also makes further reader build up and eventually trigger task hung. To improve this situation, change the throttle model to task based, i.e. when a cfs_rq is throttled, record its throttled status but do not remove it from cpu's rq. Instead, for tasks that belong to this cfs_rq, when they get picked, add a task work to them so that when they return to user, they can be dequeued there. In this way, tasks throttled will not hold any kernel resources. And on unthrottle, enqueue back those tasks so they can continue to run. Throttled cfs_rq's PELT clock is handled differently now: previously the cfs_rq's PELT clock is stopped once it entered throttled state but since now tasks(in kernel mode) can continue to run, change the behaviour to stop PELT clock when the throttled cfs_rq has no tasks left. Suggested-by: Chengming Zhou <chengming.zhou@linux.dev> # tag on pick Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Aaron Lu <ziqianlu@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Chen Yu <yu.c.chen@intel.com> Tested-by: Matteo Martelli <matteo.martelli@codethink.co.uk> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20250829081120.806-4-ziqianlu@bytedance.com	2025-09-03 10:03:14 +02:00
Valentin Schneider	7fc2d14392	sched/fair: Implement throttle task work and related helpers Implement throttle_cfs_rq_work() task work which gets executed on task's ret2user path where the task is dequeued and marked as throttled. Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Aaron Lu <ziqianlu@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Tested-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Matteo Martelli <matteo.martelli@codethink.co.uk> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20250829081120.806-3-ziqianlu@bytedance.com	2025-09-03 10:03:13 +02:00
Valentin Schneider	2cd571245b	sched/fair: Add related data structure for task based throttle Add related data structures for this new throttle functionality. Tesed-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Aaron Lu <ziqianlu@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Tested-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Matteo Martelli <matteo.martelli@codethink.co.uk> Link: https://lore.kernel.org/r/20250829081120.806-2-ziqianlu@bytedance.com	2025-09-03 10:03:13 +02:00
Peter Zijlstra	91c614f09a	sched: Move STDL_INIT() functions out-of-line Since all these functions are address-taken in SDTL_INIT() and called indirectly, it doesn't really make sense for them to be inline. Suggested-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>	2025-09-03 10:03:13 +02:00
Peter Zijlstra	661f951e37	sched/fair: Get rid of sched_domains_curr_level hack for tl->cpumask() Leon [1] and Vinicius [2] noted a topology_span_sane() warning during their testing starting from v6.16-rc1. Debug that followed pointed to the tl->mask() for the NODE domain being incorrectly resolved to that of the highest NUMA domain. tl->mask() for NODE is set to the sd_numa_mask() which depends on the global "sched_domains_curr_level" hack. "sched_domains_curr_level" is set to the "tl->numa_level" during tl traversal in build_sched_domains() calling sd_init() but was not reset before topology_span_sane(). Since "tl->numa_level" still reflected the old value from build_sched_domains(), topology_span_sane() for the NODE domain trips when the span of the last NUMA domain overlaps. Instead of replicating the "sched_domains_curr_level" hack, get rid of it entirely and instead, pass the entire "sched_domain_topology_level" object to tl->cpumask() function to prevent such mishap in the future. sd_numa_mask() now directly references "tl->numa_level" instead of relying on the global "sched_domains_curr_level" hack to index into sched_domains_numa_masks[]. The original warning was reproducible on the following NUMA topology reported by Leon: $ sudo numactl -H available: 5 nodes (0-4) node 0 cpus: 0 1 node 0 size: 2927 MB node 0 free: 1603 MB node 1 cpus: 2 3 node 1 size: 3023 MB node 1 free: 3008 MB node 2 cpus: 4 5 node 2 size: 3023 MB node 2 free: 3007 MB node 3 cpus: 6 7 node 3 size: 3023 MB node 3 free: 3002 MB node 4 cpus: 8 9 node 4 size: 3022 MB node 4 free: 2718 MB node distances: node 0 1 2 3 4 0: 10 39 38 37 36 1: 39 10 38 37 36 2: 38 38 10 37 36 3: 37 37 37 10 36 4: 36 36 36 36 10 The above topology can be mimicked using the following QEMU cmd that was used to reproduce the warning and test the fix: sudo qemu-system-x86_64 -enable-kvm -cpu host \ -m 20G -smp cpus=10,sockets=10 -machine q35 \ -object memory-backend-ram,size=4G,id=m0 \ -object memory-backend-ram,size=4G,id=m1 \ -object memory-backend-ram,size=4G,id=m2 \ -object memory-backend-ram,size=4G,id=m3 \ -object memory-backend-ram,size=4G,id=m4 \ -numa node,cpus=0-1,memdev=m0,nodeid=0 \ -numa node,cpus=2-3,memdev=m1,nodeid=1 \ -numa node,cpus=4-5,memdev=m2,nodeid=2 \ -numa node,cpus=6-7,memdev=m3,nodeid=3 \ -numa node,cpus=8-9,memdev=m4,nodeid=4 \ -numa dist,src=0,dst=1,val=39 \ -numa dist,src=0,dst=2,val=38 \ -numa dist,src=0,dst=3,val=37 \ -numa dist,src=0,dst=4,val=36 \ -numa dist,src=1,dst=0,val=39 \ -numa dist,src=1,dst=2,val=38 \ -numa dist,src=1,dst=3,val=37 \ -numa dist,src=1,dst=4,val=36 \ -numa dist,src=2,dst=0,val=38 \ -numa dist,src=2,dst=1,val=38 \ -numa dist,src=2,dst=3,val=37 \ -numa dist,src=2,dst=4,val=36 \ -numa dist,src=3,dst=0,val=37 \ -numa dist,src=3,dst=1,val=37 \ -numa dist,src=3,dst=2,val=37 \ -numa dist,src=3,dst=4,val=36 \ -numa dist,src=4,dst=0,val=36 \ -numa dist,src=4,dst=1,val=36 \ -numa dist,src=4,dst=2,val=36 \ -numa dist,src=4,dst=3,val=36 \ ... [ prateek: Moved common functions to include/linux/sched/topology.h, reuse the common bits for s390 and ppc, commit message ] Closes: https://lore.kernel.org/lkml/20250610110701.GA256154@unreal/ [1] Fixes: `ccf74128d6` ("sched/topology: Assert non-NUMA topology masks don't (partially) overlap") # `ce29a7da84`, `f55dac1daf` Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reported-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> Tested-by: Valentin Schneider <vschneid@redhat.com> # x86 Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com> # powerpc Link: https://lore.kernel.org/lkml/a3de98387abad28592e6ab591f3ff6107fe01dc1.1755893468.git.tim.c.chen@linux.intel.com/ [2]	2025-09-03 10:03:12 +02:00
Harshit Agarwal	8fd5485fb4	sched/deadline: Fix race in push_dl_task() When a CPU chooses to call push_dl_task and picks a task to push to another CPU's runqueue then it will call find_lock_later_rq method which would take a double lock on both CPUs' runqueues. If one of the locks aren't readily available, it may lead to dropping the current runqueue lock and reacquiring both the locks at once. During this window it is possible that the task is already migrated and is running on some other CPU. These cases are already handled. However, if the task is migrated and has already been executed and another CPU is now trying to wake it up (ttwu) such that it is queued again on the runqeue (on_rq is 1) and also if the task was run by the same CPU, then the current checks will pass even though the task was migrated out and is no longer in the pushable tasks list. Please go through the original rt change for more details on the issue. To fix this, after the lock is obtained inside the find_lock_later_rq, it ensures that the task is still at the head of pushable tasks list. Also removed some checks that are no longer needed with the addition of this new check. However, the new check of pushable tasks list only applies when find_lock_later_rq is called by push_dl_task. For the other caller i.e. dl_task_offline_migration, existing checks are used. Signed-off-by: Harshit Agarwal <harshit@nutanix.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20250408045021.3283624-1-harshit@nutanix.com	2025-09-03 10:03:12 +02:00
Luo Gengkun	3d62ab32df	tracing: Fix tracing_marker may trigger page fault during preempt_disable Both tracing_mark_write and tracing_mark_raw_write call __copy_from_user_inatomic during preempt_disable. But in some case, __copy_from_user_inatomic may trigger page fault, and will call schedule() subtly. And if a task is migrated to other cpu, the following warning will be trigger: if (RB_WARN_ON(cpu_buffer, !local_read(&cpu_buffer->committing))) An example can illustrate this issue: process flow CPU --------------------------------------------------------------------- tracing_mark_raw_write(): cpu:0 ... ring_buffer_lock_reserve(): cpu:0 ... cpu = raw_smp_processor_id() cpu:0 cpu_buffer = buffer->buffers[cpu] cpu:0 ... ... __copy_from_user_inatomic(): cpu:0 ... # page fault do_mem_abort(): cpu:0 ... # Call schedule schedule() cpu:0 ... # the task schedule to cpu1 __buffer_unlock_commit(): cpu:1 ... ring_buffer_unlock_commit(): cpu:1 ... cpu = raw_smp_processor_id() cpu:1 cpu_buffer = buffer->buffers[cpu] cpu:1 As shown above, the process will acquire cpuid twice and the return values are not the same. To fix this problem using copy_from_user_nofault instead of __copy_from_user_inatomic, as the former performs 'access_ok' before copying. Link: https://lore.kernel.org/20250819105152.2766363-1-luogengkun@huaweicloud.com Fixes: `656c7f0d2d` ("tracing: Replace kmap with copy_from_user() in trace_marker writing") Signed-off-by: Luo Gengkun <luogengkun@huaweicloud.com> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-09-02 12:02:42 -04:00
Qianfeng Rong	81ac63321e	trace: Remove redundant __GFP_NOWARN Commit `16f5dfbc85` ("gfp: include __GFP_NOWARN in GFP_NOWAIT") made GFP_NOWAIT implicitly include __GFP_NOWARN. Therefore, explicit __GFP_NOWARN combined with GFP_NOWAIT (e.g., `GFP_NOWAIT \| __GFP_NOWARN`) is now redundant. Let's clean up these redundant flags across subsystems. No functional changes. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20250805023630.335719-1-rongqianfeng@vivo.com Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-09-02 11:48:16 -04:00
Feng Yang	e4980fa646	bpf: Replace kvfree with kfree for kzalloc memory These pointers are allocated by kzalloc. Therefore, replace kvfree() with kfree() to avoid unnecessary is_vmalloc_addr() check in kvfree(). This is the remaining unmodified part from [1]. Signed-off-by: Feng Yang <yangfeng@kylinos.cn> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20250811123949.552885-1-rongqianfeng@vivo.com [1] Link: https://lore.kernel.org/bpf/20250827032812.498216-1-yangfeng59949@163.com	2025-09-02 17:29:52 +02:00
Aleksa Sarai	7df8782012	pidns: move is-ancestor logic to helper This check will be needed in later patches, and there's no point open-coding it each time. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Link: https://lore.kernel.org/20250805-procfs-pidns-api-v4-1-705f984940e7@cyphar.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-02 11:37:24 +02:00
Baochen Qiang	7e2368a217	dma-debug: don't enforce dma mapping check on noncoherent allocations As discussed in [1], there is no need to enforce dma mapping check on noncoherent allocations, a simple test on the returned CPU address is good enough. Add a new pair of debug helpers and use them for noncoherent alloc/free to fix this issue. Fixes: `efa70f2fdc` ("dma-mapping: add a new dma_alloc_pages API") Link: https://lore.kernel.org/all/ff6c1fe6-820f-4e58-8395-df06aa91706c@oss.qualcomm.com # 1 Signed-off-by: Baochen Qiang <baochen.qiang@oss.qualcomm.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20250828-dma-debug-fix-noncoherent-dma-check-v1-1-76e9be0dd7fc@oss.qualcomm.com	2025-09-02 10:18:16 +02:00
Simon Schuster	edd3cb05c0	copy_process: pass clone_flags as u64 across calltree With the introduction of clone3 in commit `7f192e3cd3` ("fork: add clone3") the effective bit width of clone_flags on all architectures was increased from 32-bit to 64-bit, with a new type of u64 for the flags. However, for most consumers of clone_flags the interface was not changed from the previous type of unsigned long. While this works fine as long as none of the new 64-bit flag bits (CLONE_CLEAR_SIGHAND and CLONE_INTO_CGROUP) are evaluated, this is still undesirable in terms of the principle of least surprise. Thus, this commit fixes all relevant interfaces of callees to sys_clone3/copy_process (excluding the architecture-specific copy_thread) to consistently pass clone_flags as u64, so that no truncation to 32-bit integers occurs on 32-bit architectures. Signed-off-by: Simon Schuster <schuster.simon@siemens-energy.com> Link: https://lore.kernel.org/20250901-nios2-implement-clone3-v2-2-53fcf5577d57@siemens-energy.com Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-01 15:31:34 +02:00
Simon Schuster	04ff48239f	copy_sighand: Handle architectures where sizeof(unsigned long) < sizeof(u64) With the introduction of clone3 in commit `7f192e3cd3` ("fork: add clone3") the effective bit width of clone_flags on all architectures was increased from 32-bit to 64-bit. However, the signature of the copy_* helper functions (e.g., copy_sighand) used by copy_process was not adapted. As such, they truncate the flags on any 32-bit architectures that supports clone3 (arc, arm, csky, m68k, microblaze, mips32, openrisc, parisc32, powerpc32, riscv32, x86-32 and xtensa). For copy_sighand with CLONE_CLEAR_SIGHAND being an actual u64 constant, this triggers an observable bug in kernel selftest clone3_clear_sighand: if (clone_flags & CLONE_CLEAR_SIGHAND) in function copy_sighand within fork.c will always fail given: unsigned long /* == uint32_t */ clone_flags #define CLONE_CLEAR_SIGHAND 0x100000000ULL This commit fixes the bug by always passing clone_flags to copy_sighand via their declared u64 type, invariant of architecture-dependent integer sizes. Fixes: `b612e5df45` ("clone3: add CLONE_CLEAR_SIGHAND") Cc: stable@vger.kernel.org # linux-5.5+ Signed-off-by: Simon Schuster <schuster.simon@siemens-energy.com> Link: https://lore.kernel.org/20250901-nios2-implement-clone3-v2-1-53fcf5577d57@siemens-energy.com Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-01 15:31:33 +02:00
Li Jun	98da8a4aec	PM: hibernate: Fix typo in memory bitmaps description comment Correct 'leave' to 'leaf' in memory bitmaps description comment. Signed-off-by: Li Jun <lijun01@kylinos.cn> Link: https://patch.msgid.link/20250819104038.1596952-1-lijun01@kylinos.cn [ rjw: Subject and changelog edits ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2025-09-01 11:55:53 +02:00
Qianfeng Rong	5545d56fd1	PM: hibernate: Use vmalloc_array() and vcalloc() to improve code Remove array_size() calls and replace vmalloc() and vzalloc() with vmalloc_array() and vcalloc() respectively to simplify the code in save_compressed_image() and load_compressed_image(). vmalloc_array() is also optimized better, resulting in less instructions being used, and vmalloc_array() handling overhead is lower [1]. Link: https://lore.kernel.org/lkml/abc66ec5-85a4-47e1-9759-2f60ab111971@vivo.com/ [1] Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com> Link: https://patch.msgid.link/20250817083636.53872-1-rongqianfeng@vivo.com [ rjw: Changelog edits ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2025-09-01 11:55:53 +02:00
Linus Torvalds	fe3ad7a58b	- Fix a stall on the CPU offline path due to mis-counting a deadline server task twice as part of the runqueue's running tasks count - Fix a realtime tasks starvation case where failure to enqueue a timer whose expiration time is already in the past would cause repeated attempts to re-enqueue a deadline server task which leads to starving the former, realtime one - Prevent a delayed deadline server task stop from breaking the per-runqueue bandwidth tracking - Have a function checking whether the deadline server task has stopped, return the correct value -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmi0GaYACgkQEsHwGGHe VUphpRAAuI0Ra8fDwy+VRLoHYVseGHw/SZuffgd6WVu1J5Vag9TSJzl4+mDBsLVU 42+b5+KSWbnX5zMDKgOhx7MU6AOgR3UCDlQEAMtKYI1CFD7ADe+Jwr45jG32Z50s bAl4LHhFeTHJx+jLP5Ez5tTwCTc2/Q7UbhadpGgTLQOhvrFPDwsDjrlMgClXgYU4 DNEF6s6m9X31UJ/jnZNJQ7VeXa6SdqNo2fBZU+SoY1J8GYzGgcUDlCNLD0SnQwMe CgGCnYyzOjl9oNdKV2Z14ruCZwkfhv3hlVt0qHwlRKiP8OOdNKWN0FMIAtvyD0QM IQXITVIsc+T/diihZEGNR7wHRd0vhZ/cZPLoPjUQ7mNYB4IuCtWJrbLZf4W17CbG 0clZ/OxG0EmOTKSuxBOxjg5tUtWI9ZqBHPFvBXFFl+6AhHTb1QK0hriAqbaqe0t6 rOmohWKqg55yQxuhr0VXUgHy4Oq4u4WBZCF1OH02wtk6w87EHawuWPrULp5jR2iM BUXazn8CiTc13IBm+NhO9X45GfH1wIHC0Uhul+gWhylzG6gFRWN0CNixqPjd/7M7 GS5gpH7xVs6Qe5DmAG9WHIXGLHPhda8OkyvzYK5MMtwJ7zpdPvqH3LCbO/uXspMy qYJWG+z3ni09SBX6EnZGLjenzOApsRuYL85NvFK8lOv6LG/SSO0= =NXg1 -----END PGP SIGNATURE----- Merge tag 'sched_urgent_for_v6.17_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Borislav Petkov: - Fix a stall on the CPU offline path due to mis-counting a deadline server task twice as part of the runqueue's running tasks count - Fix a realtime tasks starvation case where failure to enqueue a timer whose expiration time is already in the past would cause repeated attempts to re-enqueue a deadline server task which leads to starving the former, realtime one - Prevent a delayed deadline server task stop from breaking the per-runqueue bandwidth tracking - Have a function checking whether the deadline server task has stopped, return the correct value * tag 'sched_urgent_for_v6.17_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/deadline: Don't count nr_running for dl_server proxy tasks sched/deadline: Fix RT task potential starvation when expiry time passed sched/deadline: Always stop dl-server before changing parameters sched/deadline: Fix dl_server_stopped()	2025-08-31 09:13:00 -07:00
Sebastian Andrzej Siewior	d9b05321e2	futex: Move futex_hash_free() back to __mmput() To avoid a memory leak via mm_alloc() + mmdrop() the futex cleanup code has been moved to __mmdrop(). This resulted in a warnings if the futex hash table has been allocated via vmalloc() the mmdrop() was invoked from atomic context. The free path must stay in __mmput() to ensure it is invoked from preemptible context. In order to avoid the memory leak, delay the allocation of mm_struct::mm->futex_ref to futex_hash_allocate(). This works because neither the per-CPU counter nor the private hash has been allocated and therefore - futex_private_hash() callers (such as exit_pi_state_list()) don't acquire reference if there is no private hash yet. There is also no reference put. - Regular callers (futex_hash()) fallback to global hash. No reference counting here. The futex_ref member can be allocated in futex_hash_allocate() before the private hash itself is allocated. This happens either while the first thread is created or on request. In both cases the process has just a single thread so there can be either futex operation in progress or the request to create a private hash. Move futex_hash_free() back to __mmput(); Move the allocation of mm_struct::futex_ref to futex_hash_allocate(). [ bp: Fold a follow-up fix to prevent a use-after-free: https://lore.kernel.org/r/20250830213806.sEKuuGSm@linutronix.de ] Fixes: `e703b7e247` ("futex: Move futex cleanup to __mmdrop()") Closes: https://lore.kernel.org/all/20250821102721.6deae493@kernel.org/ Reported-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lkml.kernel.org/r/20250822141238.PfnkTjFb@linutronix.de	2025-08-31 11:48:19 +02:00
Casey Schaufler	0ffbc876d0	audit: add record for multiple object contexts Create a new audit record AUDIT_MAC_OBJ_CONTEXTS. An example of the MAC_OBJ_CONTEXTS record is: type=MAC_OBJ_CONTEXTS msg=audit(1601152467.009:1050): obj_selinux=unconfined_u:object_r:user_home_t:s0 When an audit event includes a AUDIT_MAC_OBJ_CONTEXTS record the "obj=" field in other records in the event will be "obj=?". An AUDIT_MAC_OBJ_CONTEXTS record is supplied when the system has multiple security modules that may make access decisions based on an object security context. Signed-off-by: Casey Schaufler <casey@schaufler-ca.com> [PM: subj tweak, audit example readability indents] Signed-off-by: Paul Moore <paul@paul-moore.com>	2025-08-30 10:15:30 -04:00
Casey Schaufler	eb59d494ee	audit: add record for multiple task security contexts Replace the single skb pointer in an audit_buffer with a list of skb pointers. Add the audit_stamp information to the audit_buffer as there's no guarantee that there will be an audit_context containing the stamp associated with the event. At audit_log_end() time create auxiliary records as have been added to the list. Functions are created to manage the skb list in the audit_buffer. Create a new audit record AUDIT_MAC_TASK_CONTEXTS. An example of the MAC_TASK_CONTEXTS record is: type=MAC_TASK_CONTEXTS msg=audit(1600880931.832:113) subj_apparmor=unconfined subj_smack=_ When an audit event includes a AUDIT_MAC_TASK_CONTEXTS record the "subj=" field in other records in the event will be "subj=?". An AUDIT_MAC_TASK_CONTEXTS record is supplied when the system has multiple security modules that may make access decisions based on a subject security context. Refactor audit_log_task_context(), creating a new audit_log_subj_ctx(). This is used in netlabel auditing to provide multiple subject security contexts as necessary. Suggested-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Casey Schaufler <casey@schaufler-ca.com> [PM: subj tweak, audit example readability indents] Signed-off-by: Paul Moore <paul@paul-moore.com>	2025-08-30 10:15:30 -04:00
Casey Schaufler	a59076f266	lsm: security_lsmblob_to_secctx module selection Add a parameter lsmid to security_lsmblob_to_secctx() to identify which of the security modules that may be active should provide the security context. If the value of lsmid is LSM_ID_UNDEF the first LSM providing a hook is used. security_secid_to_secctx() is unchanged, and will always report the first LSM providing a hook. Signed-off-by: Casey Schaufler <casey@schaufler-ca.com> [PM: subj tweak] Signed-off-by: Paul Moore <paul@paul-moore.com>	2025-08-30 10:15:29 -04:00
Casey Schaufler	0a561e3904	audit: create audit_stamp structure Replace the timestamp and serial number pair used in audit records with a structure containing the two elements. Signed-off-by: Casey Schaufler <casey@schaufler-ca.com> [PM: subj tweak] Signed-off-by: Paul Moore <paul@paul-moore.com>	2025-08-30 10:15:28 -04:00
Jakub Kicinski	d23ad54de7	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-6.17-rc4). No conflicts. Adjacent changes: drivers/net/ethernet/intel/idpf/idpf_txrx.c `02614eee26` ("idpf: do not linearize big TSO packets") `6c4e684802` ("idpf: remove obsolete stashing code") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-08-29 11:48:01 -07:00
Linus Torvalds	4d28e28098	dma-mapping fixes for Linux 6.17 - another small fix relevant to arm64 systems with memory encryption (Shanker Donthineni) - fix relevant to arm32 systems with non-standard CMA configuration (Oreoluwa Babatunde) -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQSrngzkoBtlA8uaaJ+Jp1EFxbsSRAUCaLAK7gAKCRCJp1EFxbsS REDxAQC+5hLiyzc/1rR5EQb1D6Xr1f/0VN3IFz3creHp3juFBAEApi1iFMdmahO7 0YKG4KkzHpcNkGrxaXKP0VNtQsDLwww= =fVrB -----END PGP SIGNATURE----- Merge tag 'dma-mapping-6.17-2025-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux Pull dma-mapping fixes from Marek Szyprowski: - another small fix for arm64 systems with memory encryption (Shanker Donthineni) - fix for arm32 systems with non-standard CMA configuration (Oreoluwa Babatunde) * tag 'dma-mapping-6.17-2025-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux: dma/pool: Ensure DMA_DIRECT_REMAP allocations are decrypted of: reserved_mem: Restructure call site for dma_contiguous_early_fixup()	2025-08-28 16:04:14 -07:00
Nandakumar Edamana	1df7dad4d5	bpf: Improve the general precision of tnum_mul Drop the value-mask decomposition technique and adopt straightforward long-multiplication with a twist: when LSB(a) is uncertain, find the two partial products (for LSB(a) = known 0 and LSB(a) = known 1) and take a union. Experiment shows that applying this technique in long multiplication improves the precision in a significant number of cases (at the cost of losing precision in a relatively lower number of cases). Signed-off-by: Nandakumar Edamana <nandakumar@nandakumar.co.in> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Tested-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com> Reviewed-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20250826034524.2159515-1-nandakumar@nandakumar.co.in	2025-08-27 15:00:26 -07:00
Marie Zhussupova	b9a214b5f6	kunit: Pass parameterized test context to generate_params() To enable more complex parameterized testing scenarios, the generate_params() function needs additional context beyond just the previously generated parameter. This patch modifies the generate_params() function signature to include an extra `struct kunit test` argument, giving test users access to the parameterized test context when generating parameters. The `struct kunit test` argument was added as the first parameter to the function signature as it aligns with the convention of other KUnit functions that accept `struct kunit *test` first. This also mirrors the "this" or "self" reference found in object-oriented programming languages. This patch also modifies xe_pci_live_device_gen_param() in xe_pci.c and nthreads_gen_params() in kcsan_test.c to reflect this signature change. Link: https://lore.kernel.org/r/20250826091341.1427123-4-davidgow@google.com Reviewed-by: David Gow <davidgow@google.com> Reviewed-by: Rae Moar <rmoar@google.com> Acked-by: Marco Elver <elver@google.com> Acked-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Marie Zhussupova <marievic@google.com> [Catch some additional gen_params signatures in drm/xe/tests --David] Signed-off-by: David Gow <davidgow@google.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>	2025-08-26 23:36:03 -06:00
Yicong Yang	52d15521eb	sched/deadline: Don't count nr_running for dl_server proxy tasks On CPU offline the kernel stalled with below call trace: INFO: task kworker/0:1:11 blocked for more than 120 seconds. cpuhp hold the cpu hotplug lock endless and stalled vmstat_shepherd. This is because we count nr_running twice on cpuhp enqueuing and failed the wait condition of cpuhp: enqueue_task_fair() // pick cpuhp from idle, rq->nr_running = 0 dl_server_start() [...] add_nr_running() // rq->nr_running = 1 add_nr_running() // rq->nr_running = 2 [switch to cpuhp, waiting on balance_hotplug_wait()] rcuwait_wait_event(rq->nr_running == 1 && ...) // failed, rq->nr_running=2 schedule() // wait again It doesn't make sense to count the dl_server towards runnable tasks, since it runs other tasks. Fixes: `63ba8422f8` ("sched/deadline: Introduce deadline servers") Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250627035420.37712-1-yangyicong@huawei.com	2025-08-26 10:46:01 +02:00
kuyo chang	421fc59cf5	sched/deadline: Fix RT task potential starvation when expiry time passed [Symptom] The fair server mechanism, which is intended to prevent fair starvation when higher-priority tasks monopolize the CPU. Specifically, RT tasks on the runqueue may not be scheduled as expected. [Analysis] The log "sched: DL replenish lagged too much" triggered. By memory dump of dl_server: curr = 0xFFFFFF80D6A0AC00 ( dl_server = 0xFFFFFF83CD5B1470( dl_runtime = 0x02FAF080, dl_deadline = 0x3B9ACA00, dl_period = 0x3B9ACA00, dl_bw = 0xCCCC, dl_density = 0xCCCC, runtime = 0x02FAF080, deadline = 0x0000082031EB0E80, flags = 0x0, dl_throttled = 0x0, dl_yielded = 0x0, dl_non_contending = 0x0, dl_overrun = 0x0, dl_server = 0x1, dl_server_active = 0x1, dl_defer = 0x1, dl_defer_armed = 0x0, dl_defer_running = 0x1, dl_timer = ( node = ( expires = 0x000008199756E700), _softexpires = 0x000008199756E700, function = 0xFFFFFFDB9AF44D30 = dl_task_timer, base = 0xFFFFFF83CD5A12C0, state = 0x0, is_rel = 0x0, is_soft = 0x0, clock_update_flags = 0x4, clock = 0x000008204A496900, - The timer expiration time (rq->curr->dl_server->dl_timer->expires) is already in the past, indicating the timer has expired. - The timer state (rq->curr->dl_server->dl_timer->state) is 0. [Suspected Root Cause] The relevant code flow in the throttle path of update_curr_dl_se() as follows: dequeue_dl_entity(dl_se, 0); // the DL entity is dequeued if (unlikely(is_dl_boosted(dl_se) \|\| !start_dl_timer(dl_se))) { if (dl_server(dl_se)) // timer registration fails enqueue_dl_entity(dl_se, ENQUEUE_REPLENISH);//enqueue immediately ... } The failure of `start_dl_timer` is caused by attempting to register a timer with an expiration time that is already in the past. When this situation persists, the code repeatedly re-enqueues the DL entity without properly replenishing or restarting the timer, resulting in RT task may not be scheduled as expected. [Proposed Solution]: Instead of immediately re-enqueuing the DL entity on timer registration failure, this change ensures the DL entity is properly replenished and the timer is restarted, preventing RT potential starvation. Fixes: `63ba8422f8` ("sched/deadline: Introduce deadline servers") Signed-off-by: kuyo chang <kuyo.chang@mediatek.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Closes: https://lore.kernel.org/CAMuHMdXn4z1pioTtBGMfQM0jsLviqS2jwysaWXpoLxWYoGa82w@mail.gmail.com Tested-by: Geert Uytterhoeven <geert@linux-m68k.org> Tested-by: Jiri Slaby <jirislaby@kernel.org> Tested-by: Diederik de Haas <didi.debian@cknow.org> Link: https://lkml.kernel.org/r/20250615131129.954975-1-kuyo.chang@mediatek.com	2025-08-26 10:46:01 +02:00
Juri Lelli	bb4700adc3	sched/deadline: Always stop dl-server before changing parameters Commit `cccb45d7c4` ("sched/deadline: Less agressive dl_server handling") reduced dl-server overhead by delaying disabling servers only after there are no fair task around for a whole period, which means that deadline entities are not dequeued right away on a server stop event. However, the delay opens up a window in which a request for changing server parameters can break per-runqueue running_bw tracking, as reported by Yuri. Close the problematic window by unconditionally calling dl_server_stop() before applying the new parameters (ensuring deadline entities go through an actual dequeue). Fixes: `cccb45d7c4` ("sched/deadline: Less agressive dl_server handling") Reported-by: Yuri Andriaccio <yurand2000@gmail.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20250721-upstream-fix-dlserver-lessaggressive-b4-v1-1-4ebc10c87e40@redhat.com	2025-08-26 10:46:00 +02:00
Huacai Chen	4717432dfd	sched/deadline: Fix dl_server_stopped() Commit `cccb45d7c4` ("sched/deadline: Less agressive dl_server handling") introduces dl_server_stopped(). But it is obvious that dl_server_stopped() should return true if dl_se->dl_server_active is 0. Fixes: `cccb45d7c4` ("sched/deadline: Less agressive dl_server handling") Signed-off-by: Huacai Chen <chenhuacai@loongson.cn> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250809130419.1980742-1-chenhuacai@loongson.cn	2025-08-26 10:46:00 +02:00
Josh Poimboeuf	16ed389227	perf: Skip user unwind if the task is a kernel thread If the task is not a user thread, there's no user stack to unwind. Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250820180428.930791978@kernel.org	2025-08-26 09:51:13 +02:00
Josh Poimboeuf	d77e3319e3	perf: Simplify get_perf_callchain() user logic Simplify the get_perf_callchain() user logic a bit. task_pt_regs() should never be NULL. Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20250820180428.760066227@kernel.org	2025-08-26 09:51:13 +02:00
Steven Rostedt	90942f9fac	perf: Use current->flags & PF_KTHREAD\|PF_USER_WORKER instead of current->mm == NULL To determine if a task is a kernel thread or not, it is more reliable to use (current->flags & (PF_KTHREAD\|PF_USER_WORKERi)) than to rely on current->mm being NULL. That is because some kernel tasks (io_uring helpers) may have a mm field. Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250820180428.592367294@kernel.org	2025-08-26 09:51:13 +02:00
Josh Poimboeuf	153f9e74de	perf: Have get_perf_callchain() return NULL if crosstask and user are set get_perf_callchain() doesn't support cross-task unwinding for user space stacks, have it return NULL if both the crosstask and user arguments are set. Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250820180428.426423415@kernel.org	2025-08-26 09:51:12 +02:00
Josh Poimboeuf	e649bcda25	perf: Remove get_perf_callchain() init_nr argument The 'init_nr' argument has double duty: it's used to initialize both the number of contexts and the number of stack entries. That's confusing and the callers always pass zero anyway. Hard code the zero. Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Namhyung Kim <Namhyung@kernel.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/r/20250820180428.259565081@kernel.org	2025-08-26 09:51:12 +02:00
Menglong Dong	8e4f0b1ebc	bpf: use rcu_read_lock_dont_migrate() for trampoline.c Use rcu_read_lock_dont_migrate() and rcu_read_unlock_migrate() in trampoline.c to obtain better performance when PREEMPT_RCU is not enabled. Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Link: https://lore.kernel.org/r/20250821090609.42508-8-dongml2@chinatelecom.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-08-25 18:52:16 -07:00

... 2 3 4 5 6 ...

49442 Commits (09cfd3c52ea76f43b3cb15e570aeddf633d65e80)