Commit Graph

5070 Commits (master)

Author SHA1 Message Date
Thomas Gleixner 1e83ccd592 sched/mmcid: Don't assume CID is CPU owned on mode switch
Shinichiro reported a KASAN UAF, which is actually an out of bounds access
in the MMCID management code.

   CPU0						CPU1
   						T1 runs in userspace
   T0: fork(T4) -> Switch to per CPU CID mode
         fixup() set MM_CID_TRANSIT on T1/CPU1
   T4 exit()
   T3 exit()
   T2 exit()
						T1 exit() switch to per task mode
						 ---> Out of bounds access.

As T1 has not scheduled after T0 set the TRANSIT bit, it exits with the
TRANSIT bit set. sched_mm_cid_remove_user() clears the TRANSIT bit in
the task and drops the CID, but it does not touch the per CPU storage.
That's functionally correct because a CID is only owned by the CPU when
the ONCPU bit is set, which is mutually exclusive with the TRANSIT flag.

Now sched_mm_cid_exit() assumes that the CID is CPU owned because the
prior mode was per CPU. It invokes mm_drop_cid_on_cpu() which clears the
not set ONCPU bit and then invokes clear_bit() with an insanely large
bit number because TRANSIT is set (bit 29).

Prevent that by actually validating that the CID is CPU owned in
mm_drop_cid_on_cpu().

Fixes: 007d84287c ("sched/mmcid: Drop per CPU CID immediately when switching to per task mode")
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Cc: stable@vger.kernel.org
Closes: https://lore.kernel.org/aYsZrixn9b6s_2zL@shinmob
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-11 12:59:56 -08:00
Linus Torvalds 57cb845067 - A nice cleanup to the paravirt code containing a unification of the paravirt
clock interface, taming the include hell by splitting the pv_ops structure
   and removing of a bunch of obsolete code. Work by Juergen Gross.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmmLKHAACgkQEsHwGGHe
 VUrURg//Ucf+3EAIkLCmFkH0WwYmQl2JjRYww8bPAw3iJMIVxy4dMnaBbsUiAtUp
 kYza+pgEtvyAwwd8RIEs85c9VhZn0DKoaWV8goBH3zFH6YvIRiLwb0w2QvjkF+70
 FNU+4zlvt/I3FD+tWNElAgVtkFL3Gmzm44qyLLsPtlYaJ71xFl2XB7V+TlqXMHzE
 m8BMenP9/CrbTlBBdNJGzAkAbWi1uAP+IydvuFNolH/F2lqVM2z5Ta3gUWWCIk/q
 jWrPLDZCHr2WlBZNUGamKVVH9NEh+7YNwBAGUrSNYGZFoaFjqeX6lN3djzS+wXIj
 0nDoW35jN0QNKz239MdXZDf1mfpb6ZQd/iOhFjo4dAvbm+J8WPAMr98ac8wR3Dyb
 2LF/BxkoKWRabxQApXSCrLPXEuqT6Qc1+lDA0bNHg51zBoqP5vRNVZRwArnzGB+O
 LxDKx+o4VYOf+UCaB6oQHjylbSgFvIedZ9p822hBe3QG9act8indRE8LWip7Utld
 peoJGgvlQ0xtClh6FjVHpvmVfAvk7Zki5ywj2GwmB/TZ0yywuGStAjE3UqY168/M
 gb7MSajh+HHZNj1/2+b/se4CUYlAgIPDQ+SwHJPm5TqyopvnOVi/2XWmjbx8I5jT
 jS0nxaxD+SbESSZ6IMAsppnAAxAYbvRHGIS+6mtNCXVkaV1pMbA=
 =AeFt
 -----END PGP SIGNATURE-----

Merge tag 'x86_paravirt_for_v7.0_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 paravirt updates from Borislav Petkov:

 - A nice cleanup to the paravirt code containing a unification of the
   paravirt clock interface, taming the include hell by splitting the
   pv_ops structure and removing of a bunch of obsolete code (Juergen
   Gross)

* tag 'x86_paravirt_for_v7.0_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
  x86/paravirt: Use XOR r32,r32 to clear register in pv_vcpu_is_preempted()
  x86/paravirt: Remove trailing semicolons from alternative asm templates
  x86/pvlocks: Move paravirt spinlock functions into own header
  x86/paravirt: Specify pv_ops array in paravirt macros
  x86/paravirt: Allow pv-calls outside paravirt.h
  objtool: Allow multiple pv_ops arrays
  x86/xen: Drop xen_mmu_ops
  x86/xen: Drop xen_cpu_ops
  x86/xen: Drop xen_irq_ops
  x86/paravirt: Move pv_native_*() prototypes to paravirt.c
  x86/paravirt: Introduce new paravirt-base.h header
  x86/paravirt: Move paravirt_sched_clock() related code into tsc.c
  x86/paravirt: Use common code for paravirt_steal_clock()
  riscv/paravirt: Use common code for paravirt_steal_clock()
  loongarch/paravirt: Use common code for paravirt_steal_clock()
  arm64/paravirt: Use common code for paravirt_steal_clock()
  arm/paravirt: Use common code for paravirt_steal_clock()
  sched: Move clock related paravirt code to kernel/sched
  paravirt: Remove asm/paravirt_api_clock.h
  x86/paravirt: Move thunk macros to paravirt_types.h
  ...
2026-02-10 19:01:45 -08:00
Linus Torvalds 36ae1c45b2 Scheduler changes for v7.0:
Scheduler Kconfig space updates:
 
  - Further consolidate configurable preemption modes: reduce
    the number of architectures that are allowed to offer
    PREEMPT_NONE and PREEMPT_VOLUNTARY, reducing the number
    of preemption models from four to just two: 'full' and 'lazy'
    on up-to-date architectures (arm64, loongarch, powerpc,
    riscv, s390, x86).
 
    None and voluntary are only available as legacy features
    on platforms that don't implement lazy preemption yet,
    or which don't even support preemption.
 
    The goal is to eventually remove cond_resched() and
    voluntary preemption altogether.
 
    (Peter Zijlstra)
 
 RSEQ based 'scheduler time slice extension' support:
 
 This allows a thread to request a time slice extension when it
 enters a critical section to avoid contention on a resource when
 the thread is scheduled out inside of the critical section.
 
  - Add fields and constants for time slice extension
  - Provide static branch for time slice extensions
  - Add statistics for time slice extensions
  - Add prctl() to enable time slice extensions
  - Implement sys_rseq_slice_yield()
  - Implement syscall entry work for time slice extensions
  - Implement time slice extension enforcement timer
  - Reset slice extension when scheduled
  - Implement rseq_grant_slice_extension()
  - entry: Hook up rseq time slice extension
  - selftests: Implement time slice extension test
 
    (Thomas Gleixner)
 
  - Allow registering RSEQ with slice extension
  - Move slice_ext_nsec to debugfs
  - Lower default slice extension
  - selftests/rseq: Add rseq slice histogram script
 
    (Peter Zijlstra)
 
 Scheduler performance/scalability improvements:
 
  - Update rq->avg_idle when a task is moved to an idle CPU,
    which improves the scalability of various workloads.
    (Shubhang Kaushik)
 
  - Reorder fields in 'struct rq' for better caching
    (Blake Jones)
 
  - Fair scheduler SMP NOHZ balancing code speedups:
 
    - Move checking for nohz cpus after time check
    - Change likelyhood of nohz.nr_cpus
    - Remove nohz.nr_cpus and use weight of cpumask instead
 
      (Shrikanth Hegde)
 
  - Avoid false sharing for sched_clock_irqtime (Wangyang Guo)
 
  - Drop useless cpumask_empty() in find_energy_efficient_cpu()
  - Simplify task_numa_find_cpu()
  - Use cpumask_weight_and() in sched_balance_find_dst_group()
 
    (Yury Norov)
 
 DL scheduler updates:
 
  - Add a deadline server for sched_ext tasks (by Andrea Righi and
    Joel Fernandes, with fixes by Peter Zijlstra)
 
 RT scheduler updates:
 
  - Skip currently executing CPU in rto_next_cpu() (Chen Jinghuang)
 
 Entry code updates and performance improvements, which is part of the
 scheduler tree in this cycle due to interdependencies with the RSEQ
 based time slice extension work:
 
   - Remove unused syscall argument from syscall_trace_enter()
   - Rework syscall_exit_to_user_mode_work() for architecture reuse
   - Add arch_ptrace_report_syscall_entry/exit()
   - Inline syscall_exit_work() and syscall_trace_enter()
 
     (Jinjie Ruan)
 
 Scheduler core updates:
 
  - Rework sched_class::wakeup_preempt() and rq_modified_*()
  - Avoid rq->lock bouncing in sched_balance_newidle()
  - Rename rcu_dereference_check_sched_domain() =>
           rcu_dereference_sched_domain()
  - <linux/compiler_types.h>: Add the __signed_scalar_typeof() helper
 
    (Peter Zijlstra)
 
 Fair scheduler updates/refactoring:
 
  - Fold the sched_avg update
  - Change rcu_dereference_check_sched_domain() to rcu-sched
  - Switch to rcu_dereference_all()
  - Remove superfluous rcu_read_lock()
  - Limit hrtick work
 
    (Peter Zijlstra)
 
  - Join two #ifdef CONFIG_FAIR_GROUP_SCHED blocks
  - Clean up comments in 'struct cfs_rq'
  - Separate se->vlag from se->vprot
  - Rename cfs_rq::avg_load to cfs_rq::sum_weight
  - Rename cfs_rq::avg_vruntime to ::sum_w_vruntime & helper functions
  - Introduce and use the vruntime_cmp() and vruntime_op() wrappers
    for wrapped-signed aritmetics
  - Sort out 'blocked_load*' namespace noise
 
    (Ingo Molnar)
 
 Scheduler debugging code updates:
 
  - Export hidden tracepoints to modules (Gabriele Monaco)
 
  - Convert copy_from_user() + kstrtouint() to kstrtouint_from_user()
    (Fushuai Wang)
 
  - Add assertions to QUEUE_CLASS (Peter Zijlstra)
 
  - hrtimer: Fix tracing oddity (Thomas Gleixner)
 
 Misc fixes and cleanups:
 
  - Re-evaluate scheduling when migrating queued tasks out of
    throttled cgroups (Zicheng Qu)
 
  - Remove task_struct->faults_disabled_mapping (Christoph Hellwig)
 
  - Fix math notation errors in avg_vruntime comment (Zhan Xusheng)
 
  - sched/cpufreq: Use %pe format for PTR_ERR() printing (zenghongling)
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmmJj+IRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1grtQ//WyXYGVE/WicdqslfaCY2Mr0uJnL0tLSM
 CJp+0LROdkmy+ChJmftO8RgjCUSsjhC4/xcBhUQXApf/ffQi3b2jH6nkTp/Z64Ms
 p2IXLkBiZjwdcO6fGbB0JE2G1J4hGRC5BlqfgkZzWidMf3kIbmrHg99mVWGzODLY
 N/cPW4d0WGf9TScl1FgEiOqgF3czMLlqvTDJqaFMpsTzSUcRBnrG4xushb4W/bBx
 573eqxgZJ6urNSGu8niY9PAl9F7gskXW3YxI3k8SH7VmJKSevWlwI9vMEhcRDzud
 E0XxD7J8iPOKtr7ypXm7anMBv4jWVUdAnPbYi4TDsyDDU/HguqMqT1McTGn8wQ+F
 jmdhmMC9/TEIzq93SNLbCYieibqDsmJoNVFFi0FWfPLMtYbcZd5a884SIz532vx4
 DegdlDXdazUwhxzDiQR3sq1CsHXpxNS2YdrpadAtF/r2gU86DQjsEew8yBvXi7bb
 Wrkzpax70sU1AFI23wJQkEb/OnnXyehAHAhhQN6GVvuiGr9P7C02WLEGLlmSmJrx
 zl2F750P76yhTfGcvTfJ/5LTfSB+yRozGvcdXnIkyzWotY6a2D1MKNusAfVax+IR
 kyfAWqVdxBhlKnqYbu92lTogvnPh3Lymd6G4TZZRkSH2jixyGd2oS7nZaDBAeBEM
 NHQtr9R+KyU=
 =Xj2f
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:
 "Scheduler Kconfig space updates:

   - Further consolidate configurable preemption modes (Peter Zijlstra)

     Reduce the number of architectures that are allowed to offer
     PREEMPT_NONE and PREEMPT_VOLUNTARY, reducing the number of
     preemption models from four to just two: 'full' and 'lazy' on
     up-to-date architectures (arm64, loongarch, powerpc, riscv, s390,
     x86).

     None and voluntary are only available as legacy features on
     platforms that don't implement lazy preemption yet, or which don't
     even support preemption.

     The goal is to eventually remove cond_resched() and voluntary
     preemption altogether.

  RSEQ based 'scheduler time slice extension' support (Thomas Gleixner
  and Peter Zijlstra):

  This allows a thread to request a time slice extension when it enters
  a critical section to avoid contention on a resource when the thread
  is scheduled out inside of the critical section.

   - Add fields and constants for time slice extension
   - Provide static branch for time slice extensions
   - Add statistics for time slice extensions
   - Add prctl() to enable time slice extensions
   - Implement sys_rseq_slice_yield()
   - Implement syscall entry work for time slice extensions
   - Implement time slice extension enforcement timer
   - Reset slice extension when scheduled
   - Implement rseq_grant_slice_extension()
   - entry: Hook up rseq time slice extension
   - selftests: Implement time slice extension test
   - Allow registering RSEQ with slice extension
   - Move slice_ext_nsec to debugfs
   - Lower default slice extension
   - selftests/rseq: Add rseq slice histogram script

  Scheduler performance/scalability improvements:

   - Update rq->avg_idle when a task is moved to an idle CPU, which
     improves the scalability of various workloads (Shubhang Kaushik)

   - Reorder fields in 'struct rq' for better caching (Blake Jones)

   - Fair scheduler SMP NOHZ balancing code speedups (Shrikanth Hegde):
      - Move checking for nohz cpus after time check
      - Change likelyhood of nohz.nr_cpus
      - Remove nohz.nr_cpus and use weight of cpumask instead

   - Avoid false sharing for sched_clock_irqtime (Wangyang Guo)

   - Cleanups (Yury Norov):
      - Drop useless cpumask_empty() in find_energy_efficient_cpu()
      - Simplify task_numa_find_cpu()
      - Use cpumask_weight_and() in sched_balance_find_dst_group()

  DL scheduler updates:

   - Add a deadline server for sched_ext tasks (by Andrea Righi and Joel
     Fernandes, with fixes by Peter Zijlstra)

  RT scheduler updates:

   - Skip currently executing CPU in rto_next_cpu() (Chen Jinghuang)

  Entry code updates and performance improvements (Jinjie Ruan)

  This is part of the scheduler tree in this cycle due to inter-
  dependencies with the RSEQ based time slice extension work:

    - Remove unused syscall argument from syscall_trace_enter()
    - Rework syscall_exit_to_user_mode_work() for architecture reuse
    - Add arch_ptrace_report_syscall_entry/exit()
    - Inline syscall_exit_work() and syscall_trace_enter()

  Scheduler core updates (Peter Zijlstra):

   - Rework sched_class::wakeup_preempt() and rq_modified_*()
   - Avoid rq->lock bouncing in sched_balance_newidle()
   - Rename rcu_dereference_check_sched_domain() =>
            rcu_dereference_sched_domain()
   - <linux/compiler_types.h>: Add the __signed_scalar_typeof() helper

  Fair scheduler updates/refactoring (Peter Zijlstra and Ingo Molnar):

   - Fold the sched_avg update
   - Change rcu_dereference_check_sched_domain() to rcu-sched
   - Switch to rcu_dereference_all()
   - Remove superfluous rcu_read_lock()
   - Limit hrtick work
   - Join two #ifdef CONFIG_FAIR_GROUP_SCHED blocks
   - Clean up comments in 'struct cfs_rq'
   - Separate se->vlag from se->vprot
   - Rename cfs_rq::avg_load to cfs_rq::sum_weight
   - Rename cfs_rq::avg_vruntime to ::sum_w_vruntime & helper functions
   - Introduce and use the vruntime_cmp() and vruntime_op() wrappers for
     wrapped-signed aritmetics
   - Sort out 'blocked_load*' namespace noise

  Scheduler debugging code updates:

   - Export hidden tracepoints to modules (Gabriele Monaco)

   - Convert copy_from_user() + kstrtouint() to kstrtouint_from_user()
     (Fushuai Wang)

   - Add assertions to QUEUE_CLASS (Peter Zijlstra)

   - hrtimer: Fix tracing oddity (Thomas Gleixner)

  Misc fixes and cleanups:

   - Re-evaluate scheduling when migrating queued tasks out of throttled
     cgroups (Zicheng Qu)

   - Remove task_struct->faults_disabled_mapping (Christoph Hellwig)

   - Fix math notation errors in avg_vruntime comment (Zhan Xusheng)

   - sched/cpufreq: Use %pe format for PTR_ERR() printing
     (zenghongling)"

* tag 'sched-core-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (64 commits)
  sched: Re-evaluate scheduling when migrating queued tasks out of throttled cgroups
  sched/cpufreq: Use %pe format for PTR_ERR() printing
  sched/rt: Skip currently executing CPU in rto_next_cpu()
  sched/clock: Avoid false sharing for sched_clock_irqtime
  selftests/sched_ext: Add test for DL server total_bw consistency
  selftests/sched_ext: Add test for sched_ext dl_server
  sched/debug: Fix dl_server (re)start conditions
  sched/debug: Add support to change sched_ext server params
  sched_ext: Add a DL server for sched_ext tasks
  sched/debug: Stop and start server based on if it was active
  sched/debug: Fix updating of ppos on server write ops
  sched/deadline: Clear the defer params
  entry: Inline syscall_exit_work() and syscall_trace_enter()
  entry: Add arch_ptrace_report_syscall_entry/exit()
  entry: Rework syscall_exit_to_user_mode_work() for architecture reuse
  entry: Remove unused syscall argument from syscall_trace_enter()
  sched: remove task_struct->faults_disabled_mapping
  sched: Update rq->avg_idle when a task is moved to an idle CPU
  selftests/rseq: Add rseq slice histogram script
  hrtimer: Fix trace oddity
  ...
2026-02-10 12:50:10 -08:00
Linus Torvalds 0923fd0419 Locking updates for v6.20:
Lock debugging:
 
  - Implement compiler-driven static analysis locking context
    checking, using the upcoming Clang 22 compiler's context
    analysis features. (Marco Elver)
 
    We removed Sparse context analysis support, because prior to
    removal even a defconfig kernel produced 1,700+ context
    tracking Sparse warnings, the overwhelming majority of which
    are false positives. On an allmodconfig kernel the number of
    false positive context tracking Sparse warnings grows to
    over 5,200... On the plus side of the balance actual locking
    bugs found by Sparse context analysis is also rather ... sparse:
    I found only 3 such commits in the last 3 years. So the
    rate of false positives and the maintenance overhead is
    rather high and there appears to be no active policy in
    place to achieve a zero-warnings baseline to move the
    annotations & fixers to developers who introduce new code.
 
    Clang context analysis is more complete and more aggressive
    in trying to find bugs, at least in principle. Plus it has
    a different model to enabling it: it's enabled subsystem by
    subsystem, which results in zero warnings on all relevant
    kernel builds (as far as our testing managed to cover it).
    Which allowed us to enable it by default, similar to other
    compiler warnings, with the expectation that there are no
    warnings going forward. This enforces a zero-warnings baseline
    on clang-22+ builds. (Which are still limited in distribution,
    admittedly.)
 
    Hopefully the Clang approach can lead to a more maintainable
    zero-warnings status quo and policy, with more and more
    subsystems and drivers enabling the feature. Context tracking
    can be enabled for all kernel code via WARN_CONTEXT_ANALYSIS_ALL=y
    (default disabled), but this will generate a lot of false positives.
 
    ( Having said that, Sparse support could still be added back,
      if anyone is interested - the removal patch is still
      relatively straightforward to revert at this stage. )
 
 Rust integration updates: (Alice Ryhl, Fujita Tomonori, Boqun Feng)
 
   - Add support for Atomic<i8/i16/bool> and replace most Rust native
     AtomicBool usages with Atomic<bool>
 
   - Clean up LockClassKey and improve its documentation
 
   - Add missing Send and Sync trait implementation for SetOnce
 
   - Make ARef Unpin as it is supposed to be
 
   - Add __rust_helper to a few Rust helpers as a preparation for
     helper LTO
 
   - Inline various lock related functions to avoid additional
     function calls.
 
 WW mutexes:
 
   - Extend ww_mutex tests and other test-ww_mutex updates (John Stultz)
 
 Misc fixes and cleanups:
 
   - rcu: Mark lockdep_assert_rcu_helper() __always_inline
     (Arnd Bergmann)
 
   - locking/local_lock: Include more missing headers (Peter Zijlstra)
 
   - seqlock: fix scoped_seqlock_read kernel-doc (Randy Dunlap)
 
   - rust: sync: Replace `kernel::c_str!` with C-Strings
     (Tamir Duberstein)
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmmIXiURHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1gH+A/9GX5UmU6+HuDfDrCtXm9GDve6wkwahvcW
 jLDxOYjs764I2BhyjZnjKjyF5zw60hbykem7Wcf5EV2YH30nM4XRgEWVJfkr1UAI
 Pra415X4DdOzZ6qYQIpO8Udt1LtR7BMSaXITVLJaLicxEoOVtq3SKxjqyhCFs7UW
 MfJdqleB+RMLqq3LlzgB4l43eKk1xyeHh+oQwI0RSxuIpVZme3p4TObnCKjIWnK7
 Ihd+dkgC852WBjANgNL7F/sd5UsF5QX3wjtOrLhMKvkIgTPdXln0g398pivjN/G/
 Kpnw18SFeb159JfJu8eMotsYvVnQ0D5aOcTBfL4qvOHCImhpcu2s6ik9BcXqt2yT
 8IiuWk9xEM3Ok+I/I4ClT5cf5GYpyigV2QsXxn+IjDX5Na8v4zlHh0r8SElP8fOt
 7dpQx7iw8UghAib3AzA3suN78Oh39m8l5BNobj7LAjnqOQcVvoPo4o7/48ntuH7A
 38EucFrXfxQBMfGbMwvxEmgYuX7MyVfQLaPE06MHy1BkZkffT8Um38TB0iNtZmtf
 WUx01yLKWYspehlwFi319uVI4/Zp7FnTfqa5uKv1oSXVdL9vZojSXUzrgDV7FVqT
 Z4xAAw/kwNHpUG7y0zNOqd6PukovG1t+CjbLvK+eHPwc5c0vEGG2oTRAfEvvP1z/
 kesYDmCyJnk=
 =N1gA
 -----END PGP SIGNATURE-----

Merge tag 'locking-core-2026-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking updates from Ingo Molnar:
 "Lock debugging:

   - Implement compiler-driven static analysis locking context checking,
     using the upcoming Clang 22 compiler's context analysis features
     (Marco Elver)

     We removed Sparse context analysis support, because prior to
     removal even a defconfig kernel produced 1,700+ context tracking
     Sparse warnings, the overwhelming majority of which are false
     positives. On an allmodconfig kernel the number of false positive
     context tracking Sparse warnings grows to over 5,200... On the plus
     side of the balance actual locking bugs found by Sparse context
     analysis is also rather ... sparse: I found only 3 such commits in
     the last 3 years. So the rate of false positives and the
     maintenance overhead is rather high and there appears to be no
     active policy in place to achieve a zero-warnings baseline to move
     the annotations & fixers to developers who introduce new code.

     Clang context analysis is more complete and more aggressive in
     trying to find bugs, at least in principle. Plus it has a different
     model to enabling it: it's enabled subsystem by subsystem, which
     results in zero warnings on all relevant kernel builds (as far as
     our testing managed to cover it). Which allowed us to enable it by
     default, similar to other compiler warnings, with the expectation
     that there are no warnings going forward. This enforces a
     zero-warnings baseline on clang-22+ builds (Which are still limited
     in distribution, admittedly)

     Hopefully the Clang approach can lead to a more maintainable
     zero-warnings status quo and policy, with more and more subsystems
     and drivers enabling the feature. Context tracking can be enabled
     for all kernel code via WARN_CONTEXT_ANALYSIS_ALL=y (default
     disabled), but this will generate a lot of false positives.

     ( Having said that, Sparse support could still be added back,
       if anyone is interested - the removal patch is still
       relatively straightforward to revert at this stage. )

  Rust integration updates: (Alice Ryhl, Fujita Tomonori, Boqun Feng)

    - Add support for Atomic<i8/i16/bool> and replace most Rust native
      AtomicBool usages with Atomic<bool>

    - Clean up LockClassKey and improve its documentation

    - Add missing Send and Sync trait implementation for SetOnce

    - Make ARef Unpin as it is supposed to be

    - Add __rust_helper to a few Rust helpers as a preparation for
      helper LTO

    - Inline various lock related functions to avoid additional function
      calls

  WW mutexes:

    - Extend ww_mutex tests and other test-ww_mutex updates (John
      Stultz)

  Misc fixes and cleanups:

    - rcu: Mark lockdep_assert_rcu_helper() __always_inline (Arnd
      Bergmann)

    - locking/local_lock: Include more missing headers (Peter Zijlstra)

    - seqlock: fix scoped_seqlock_read kernel-doc (Randy Dunlap)

    - rust: sync: Replace `kernel::c_str!` with C-Strings (Tamir
      Duberstein)"

* tag 'locking-core-2026-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (90 commits)
  locking/rwlock: Fix write_trylock_irqsave() with CONFIG_INLINE_WRITE_TRYLOCK
  rcu: Mark lockdep_assert_rcu_helper() __always_inline
  compiler-context-analysis: Remove __assume_ctx_lock from initializers
  tomoyo: Use scoped init guard
  crypto: Use scoped init guard
  kcov: Use scoped init guard
  compiler-context-analysis: Introduce scoped init guards
  cleanup: Make __DEFINE_LOCK_GUARD handle commas in initializers
  seqlock: fix scoped_seqlock_read kernel-doc
  tools: Update context analysis macros in compiler_types.h
  rust: sync: Replace `kernel::c_str!` with C-Strings
  rust: sync: Inline various lock related methods
  rust: helpers: Move #define __rust_helper out of atomic.c
  rust: wait: Add __rust_helper to helpers
  rust: time: Add __rust_helper to helpers
  rust: task: Add __rust_helper to helpers
  rust: sync: Add __rust_helper to helpers
  rust: refcount: Add __rust_helper to helpers
  rust: rcu: Add __rust_helper to helpers
  rust: processor: Add __rust_helper to helpers
  ...
2026-02-10 12:28:44 -08:00
Linus Torvalds f17b474e36 bpf-next-7.0
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmmGmrgACgkQ6rmadz2v
 bTq6NxAAkCHosxzGn9GYYBV8xhrBJoJJDCyEbQ4nR0XNY+zaWnuykmiPP9w1aOAM
 zm/po3mQB2pZjetvlrPrgG5RLgBCAUHzqVGy0r+phUvD3vbohKlmSlMm2kiXOb9N
 T01BgLWsyqN2ZcNFvORdSsftqIJUHcXxU6RdupGD60sO5XM9ty5cwyewLX8GBOas
 UN2bOhbK2DpqYWUvtv+3Q3ykxoStMSkXZvDRurwLKl4RHeLjXZXPo8NjnfBlk/F2
 vdFo/F4NO4TmhOave6UPXvKb4yo9IlBRmiPAl0RmNKBxenY8j9XuV/xZxU6YgzDn
 +SQfDK+CKQ4IYIygE+fqd4e5CaQrnjmPPcIw12AB2CF0LimY9Xxyyk6FSAhMN7wm
 GTVh5K2C3Dk3OiRQk4G58EvQ5QcxzX98IeeCpcckMUkPsFWHRvF402WMUcv9SWpD
 DsxxPkfENY/6N67EvH0qcSe/ikdUorQKFl4QjXKwsMCd5WhToeP4Z7Ck1gVSNkAh
 9CX++mLzg333Lpsc4SSIuk9bEPpFa5cUIKUY7GCsCiuOXciPeMDP3cGSd5LioqxN
 qWljs4Z88QDM2LJpAh8g4m3sA7bMhES3nPmdlI5CfgBcVyLW8D8CqQq4GEZ1McwL
 Ky084+lEosugoVjRejrdMMEOsqAfcbkTr2b8jpuAZdwJKm6p/bw=
 =cBdK
 -----END PGP SIGNATURE-----

Merge tag 'bpf-next-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Pull bpf updates from Alexei Starovoitov:

 - Support associating BPF program with struct_ops (Amery Hung)

 - Switch BPF local storage to rqspinlock and remove recursion detection
   counters which were causing false positives (Amery Hung)

 - Fix live registers marking for indirect jumps (Anton Protopopov)

 - Introduce execution context detection BPF helpers (Changwoo Min)

 - Improve verifier precision for 32bit sign extension pattern
   (Cupertino Miranda)

 - Optimize BTF type lookup by sorting vmlinux BTF and doing binary
   search (Donglin Peng)

 - Allow states pruning for misc/invalid slots in iterator loops (Eduard
   Zingerman)

 - In preparation for ASAN support in BPF arenas teach libbpf to move
   global BPF variables to the end of the region and enable arena kfuncs
   while holding locks (Emil Tsalapatis)

 - Introduce support for implicit arguments in kfuncs and migrate a
   number of them to new API. This is a prerequisite for cgroup
   sub-schedulers in sched-ext (Ihor Solodrai)

 - Fix incorrect copied_seq calculation in sockmap (Jiayuan Chen)

 - Fix ORC stack unwind from kprobe_multi (Jiri Olsa)

 - Speed up fentry attach by using single ftrace direct ops in BPF
   trampolines (Jiri Olsa)

 - Require frozen map for calculating map hash (KP Singh)

 - Fix lock entry creation in TAS fallback in rqspinlock (Kumar
   Kartikeya Dwivedi)

 - Allow user space to select cpu in lookup/update operations on per-cpu
   array and hash maps (Leon Hwang)

 - Make kfuncs return trusted pointers by default (Matt Bobrowski)

 - Introduce "fsession" support where single BPF program is executed
   upon entry and exit from traced kernel function (Menglong Dong)

 - Allow bpf_timer and bpf_wq use in all programs types (Mykyta
   Yatsenko, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Alexei
   Starovoitov)

 - Make KF_TRUSTED_ARGS the default for all kfuncs and clean up their
   definition across the tree (Puranjay Mohan)

 - Allow BPF arena calls from non-sleepable context (Puranjay Mohan)

 - Improve register id comparison logic in the verifier and extend
   linked registers with negative offsets (Puranjay Mohan)

 - In preparation for BPF-OOM introduce kfuncs to access memcg events
   (Roman Gushchin)

 - Use CFI compatible destructor kfunc type (Sami Tolvanen)

 - Add bitwise tracking for BPF_END in the verifier (Tianci Cao)

 - Add range tracking for BPF_DIV and BPF_MOD in the verifier (Yazhou
   Tang)

 - Make BPF selftests work with 64k page size (Yonghong Song)

* tag 'bpf-next-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (268 commits)
  selftests/bpf: Fix outdated test on storage->smap
  selftests/bpf: Choose another percpu variable in bpf for btf_dump test
  selftests/bpf: Remove test_task_storage_map_stress_lookup
  selftests/bpf: Update task_local_storage/task_storage_nodeadlock test
  selftests/bpf: Update task_local_storage/recursion test
  selftests/bpf: Update sk_storage_omem_uncharge test
  bpf: Switch to bpf_selem_unlink_nofail in bpf_local_storage_{map_free, destroy}
  bpf: Support lockless unlink when freeing map or local storage
  bpf: Prepare for bpf_selem_unlink_nofail()
  bpf: Remove unused percpu counter from bpf_local_storage_map_free
  bpf: Remove cgroup local storage percpu counter
  bpf: Remove task local storage percpu counter
  bpf: Change local_storage->lock and b->lock to rqspinlock
  bpf: Convert bpf_selem_unlink to failable
  bpf: Convert bpf_selem_link_map to failable
  bpf: Convert bpf_selem_unlink_map to failable
  bpf: Select bpf_local_storage_map_bucket based on bpf_local_storage
  selftests/xsk: fix number of Tx frags in invalid packet
  selftests/xsk: properly handle batch ending in the middle of a packet
  bpf: Prevent reentrance into call_rcu_tasks_trace()
  ...
2026-02-10 11:26:21 -08:00
Linus Torvalds d16738a4e7 The kthread code provides an infrastructure which manages the preferred
affinity of unbound kthreads (node or custom cpumask) against
 housekeeping (CPU isolation) constraints and CPU hotplug events.
 
 One crucial missing piece is the handling of cpuset: when an isolated
 partition is created, deleted, or its CPUs updated, all the unbound
 kthreads in the top cpuset become indifferently affine to _all_ the
 non-isolated CPUs, possibly breaking their preferred affinity along
 the way.
 
 Solve this with performing the kthreads affinity update from cpuset to
 the kthreads consolidated relevant code instead so that preferred
 affinities are honoured and applied against the updated cpuset isolated
 partitions.
 
 The dispatch of the new isolated cpumasks to timers, workqueues and
 kthreads is performed by housekeeping, as per the nice Tejun's
 suggestion.
 
 As a welcome side effect, HK_TYPE_DOMAIN then integrates both the set
 from boot defined domain isolation (through isolcpus=) and cpuset
 isolated partitions. Housekeeping cpumasks are now modifiable with a
 specific RCU based synchronization. A big step toward making nohz_full=
 also mutable through cpuset in the future.
 -----BEGIN PGP SIGNATURE-----
 
 iQJPBAABCAA5FiEEd76+gtGM8MbftQlOhSRUR1COjHcFAmmE0mYbFIAAAAAABAAO
 bWFudTIsMi41KzEuMTEsMiwyAAoJEIUkVEdQjox36eMP/0Ls/ArfYVi/MNAXWlpy
 rAt6m9Y/X9GBcDM/VI9BXq1ZX4qEr2XjJ8UUb8cM08uHEAt0ErlmpRxREwJFrKbI
 H4jzg5EwO0D0c6MnvgQJEAwkHxQVIjsxG9DovRIjxyW4ycx3aSsRg/f2VKyWoLvY
 7ZT7CbLFE+I/MQh2ZgUu/9pnCDQVR2anss2WYIej5mmgFL5pyEv3YvYgKYVyK08z
 sXyNxpP976g2d9ECJ9OtFJV9we6mlqxlG0MVCiv/Uxh7DBjxWWPsLvlmLAXggQ03
 +0GW+nnutDaKz83pgS7Z4zum/+Oa+I1dTLIN27pARUNcMCYip7njM2KNpJwPdov3
 +fAIODH2JVX1xewT+U1cCq6gdI55ejbwdQYGFV075dKBUxKQeIyrghvfC3Ga6aKQ
 Gw3y68jdrXOw6iyfHR5k/0Mnu2/FDKUW2fZxLKm55PvNZP5jQFmSlz9wyiwwyb3m
 UUSgThj6Ozodxks8hDX41rGVezCcm1ni+qNSiNIs8HPaaZQrwbnvKHQFBBJHQzJP
 rJ39VWBx3Hq/ly71BOR6pCzoZsfS1f85YKhJ4vsfjLO6BfhI16nBat89eROSRKcz
 XptyWqW0PgAD0teDuMCTPNuUym/viBHALXHKuSO12CIizacvftiGcmaQNPlLiiFZ
 /Dr2+aOhwYw3UD6djn3u94M9
 =nWGh
 -----END PGP SIGNATURE-----

Merge tag 'kthread-for-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks

Pull kthread updates from Frederic Weisbecker:
 "The kthread code provides an infrastructure which manages the
  preferred affinity of unbound kthreads (node or custom cpumask)
  against housekeeping (CPU isolation) constraints and CPU hotplug
  events.

  One crucial missing piece is the handling of cpuset: when an isolated
  partition is created, deleted, or its CPUs updated, all the unbound
  kthreads in the top cpuset become indifferently affine to _all_ the
  non-isolated CPUs, possibly breaking their preferred affinity along
  the way.

  Solve this with performing the kthreads affinity update from cpuset to
  the kthreads consolidated relevant code instead so that preferred
  affinities are honoured and applied against the updated cpuset
  isolated partitions.

  The dispatch of the new isolated cpumasks to timers, workqueues and
  kthreads is performed by housekeeping, as per the nice Tejun's
  suggestion.

  As a welcome side effect, HK_TYPE_DOMAIN then integrates both the set
  from boot defined domain isolation (through isolcpus=) and cpuset
  isolated partitions. Housekeeping cpumasks are now modifiable with a
  specific RCU based synchronization. A big step toward making
  nohz_full= also mutable through cpuset in the future"

* tag 'kthread-for-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks: (33 commits)
  doc: Add housekeeping documentation
  kthread: Document kthread_affine_preferred()
  kthread: Comment on the purpose and placement of kthread_affine_node() call
  kthread: Honour kthreads preferred affinity after cpuset changes
  sched/arm64: Move fallback task cpumask to HK_TYPE_DOMAIN
  sched: Switch the fallback task allowed cpumask to HK_TYPE_DOMAIN
  kthread: Rely on HK_TYPE_DOMAIN for preferred affinity management
  kthread: Include kthreadd to the managed affinity list
  kthread: Include unbound kthreads in the managed affinity list
  kthread: Refine naming of affinity related fields
  PCI: Remove superfluous HK_TYPE_WQ check
  sched/isolation: Remove HK_TYPE_TICK test from cpu_is_isolated()
  cpuset: Remove cpuset_cpu_is_isolated()
  timers/migration: Remove superfluous cpuset isolation test
  cpuset: Propagate cpuset isolation update to timers through housekeeping
  cpuset: Propagate cpuset isolation update to workqueue through housekeeping
  PCI: Flush PCI probe workqueue on cpuset isolated partition change
  sched/isolation: Flush vmstat workqueues on cpuset isolated partition change
  sched/isolation: Flush memcg workqueues on cpuset isolated partition change
  cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset
  ...
2026-02-09 19:57:30 -08:00
Linus Torvalds dda5df9823 Miscellaneous MMCID fixes to address bugs and
performance regressions in the recent rewrite
 of the SCHED_MM_CID management code:
 
  - Fix livelock triggered by BPF CI testing
 
  - Fix hard lockup on weakly ordered systems
 
  - Simplify the dropping of CIDs in the exit path
    by removing an unintended transition phase.
 
  - Fix performance/scalability regression on a
    thread-pool benchmark by optimizing transitional
    CIDs when scheduling out.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmmHDvQRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1hdPBAAgnl/L09wF8WCQLSoLrhr71FmS6fZApDB
 Rvov2be8tGJR0BsrJF5uOKTNjulqUIr0mfO73fdHZftdFuhm/WLnWjBO62GhCKMg
 d8kXOVZ7PudFN+QwL17pOAub8voh9s9/mceE/hZ3M5eNjXlG4sAcpyGvnrTLLYru
 rfzO48NOpy5NMfbxU5/f9nojfr2t8fhnpX2QjquOhEPpl/BeYzexTZK7h2IJXqTK
 tkU6IY9X8fT7y8LkKbTCIMJvEuWawHj1DSW2EiWNPJZkX+Hk5ZHttg28JjROavEy
 orgairCSCT/cOETKugfToFd0Z4WlmemY6Nk5Kyx//WiFQ/u0HHlFVgMJoJfQEovV
 MtIxLVygVbEoQyTszZyFUlTQjrnH8uKxXYhh1mX5wSj9lyDfpfJZycFFA2RpE4Rw
 /+pvH08BfR4FgpqTfojfgOnuK/575VsomaVghritoNW3bAie1kpnWIeBaXS8lL4O
 0pkK7XX8ng6hXuZTMxgXXfkfUB6oM1Yp1OZJAEzUvftsK0FQ5q3e0WxD+pdVza2s
 PfQPaA7bT/G7y8k4LIXm59/tPX2QWPwe0yci00NbyfWiOdxHSgS7crQO8E1+VAiq
 TcLGZNj/wFL6B5ghaiUIi22Mo+WnLX8fW+aiIjSiUQILmbNZXYmwtfEFsvsahh9W
 /RkE/WQ492E=
 =/PkF
 -----END PGP SIGNATURE-----

Merge tag 'sched-urgent-2026-02-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fixes from Ingo Molnar:
 "Miscellaneous MMCID fixes to address bugs and performance regressions
  in the recent rewrite of the SCHED_MM_CID management code:

   - Fix livelock triggered by BPF CI testing

   - Fix hard lockup on weakly ordered systems

   - Simplify the dropping of CIDs in the exit path by removing an
     unintended transition phase

   - Fix performance/scalability regression on a thread-pool benchmark
     by optimizing transitional CIDs when scheduling out"

* tag 'sched-urgent-2026-02-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/mmcid: Optimize transitional CIDs when scheduling out
  sched/mmcid: Drop per CPU CID immediately when switching to per task mode
  sched/mmcid: Protect transition on weakly ordered systems
  sched/mmcid: Prevent live lock on task to CPU mode transition
2026-02-07 09:10:42 -08:00
Linus Torvalds 3c7b4d1994 sched_ext: Fixes for v6.19-rc8
- Fix race where sched_class operations (sched_setscheduler() and friends)
   could be invoked on dead tasks after sched_ext_dead() already ran, causing
   invalid SCX task state transitions and NULL pointer dereferences. This was
   a regression from the cgroup exit ordering fix which moved
   sched_ext_free() to finish_task_switch().
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaYPIhw4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGS3HAQChF4sOgoD67cul36LJaeiQzjLCh9iTU9vi2lB4
 slJb5QD/dJhrC0T2ZVRm5rHVxckIx7KeFwbzhvlrUD7l+zEaAwo=
 =ysQE
 -----END PGP SIGNATURE-----

Merge tag 'sched_ext-for-6.19-rc8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext fix from Tejun Heo:

 - Fix race where sched_class operations (sched_setscheduler() and
   friends) could be invoked on dead tasks after sched_ext_dead()
   already ran, causing invalid SCX task state transitions and NULL
   pointer dereferences.

   This was a regression from the cgroup exit ordering fix which
   moved sched_ext_free() to finish_task_switch().

* tag 'sched_ext-for-6.19-rc8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  sched_ext: Short-circuit sched_class operations on dead tasks
2026-02-04 15:11:24 -08:00
Tejun Heo 0eca95cba2 sched_ext: Short-circuit sched_class operations on dead tasks
7900aa699c ("sched_ext: Fix cgroup exit ordering by moving sched_ext_free()
to finish_task_switch()") moved sched_ext_free() to finish_task_switch() and
renamed it to sched_ext_dead() to fix cgroup exit ordering issues. However,
this created a race window where certain sched_class ops may be invoked on
dead tasks leading to failures - e.g. sched_setscheduler() may try to switch a
task which finished sched_ext_dead() back into SCX triggering invalid SCX task
state transitions.

Add task_dead_and_done() which tests whether a task is TASK_DEAD and has
completed its final context switch, and use it to short-circuit sched_class
operations which may be called on dead tasks.

Fixes: 7900aa699c ("sched_ext: Fix cgroup exit ordering by moving sched_ext_free() to finish_task_switch()")
Reported-by: Andrea Righi <arighi@nvidia.com>
Link: http://lkml.kernel.org/r/20260202151341.796959-1-arighi@nvidia.com
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-04 12:22:11 -10:00
Thomas Gleixner 4463c7aa11 sched/mmcid: Optimize transitional CIDs when scheduling out
During the investigation of the various transition mode issues
instrumentation revealed that the amount of bitmap operations can be
significantly reduced when a task with a transitional CID schedules out
after the fixup function completed and disabled the transition mode.

At that point the mode is stable and therefore it is not required to drop
the transitional CID back into the pool. As the fixup is complete the
potential exhaustion of the CID pool is not longer possible, so the CID can
be transferred to the scheduling out task or to the CPU depending on the
current ownership mode.

The racy snapshot of mm_cid::mode which contains both the ownership state
and the transition bit is valid because runqueue lock is held and the fixup
function of a concurrent mode switch is serialized.

Assigning the ownership right there not only spares the bitmap access for
dropping the CID it also avoids it when the task is scheduled back in as it
directly hits the fast path in both modes when the CID is within the
optimal range. If it's outside the range the next schedule in will need to
converge so dropping it right away is sensible. In the good case this also
allows to go into the fast path on the next schedule in operation.

With a thread pool benchmark which is configured to cross the mode switch
boundaries frequently this reduces the number of bitmap operations by about
30% and increases the fastpath utilization in the low single digit
percentage range.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260201192835.100194627@kernel.org
2026-02-04 12:21:12 +01:00
Thomas Gleixner 007d84287c sched/mmcid: Drop per CPU CID immediately when switching to per task mode
When a exiting task initiates the switch from per CPU back to per task
mode, it has already dropped its CID and marked itself inactive. But a
leftover from an earlier iteration of the rework then reassigns the per
CPU CID to the exiting task with the transition bit set.

That's wrong as the task is already marked CID inactive, which means it is
inconsistent state. It's harmless because the CID is marked in transit and
therefore dropped back into the pool when the exiting task schedules out
either through preemption or the final schedule().

Simply drop the per CPU CID when the exiting task triggered the transition.

Fixes: fbd0e71dc3 ("sched/mmcid: Provide CID ownership mode fixup functions")
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260201192835.032221009@kernel.org
2026-02-04 12:21:12 +01:00
Thomas Gleixner 47ee94efcc sched/mmcid: Protect transition on weakly ordered systems
Shrikanth reported a hard lockup which he observed once. The stack trace
shows the following CID related participants:

  watchdog: CPU 23 self-detected hard LOCKUP @ mm_get_cid+0xe8/0x188
  NIP: mm_get_cid+0xe8/0x188
  LR:  mm_get_cid+0x108/0x188
   mm_cid_switch_to+0x3c4/0x52c
   __schedule+0x47c/0x700
   schedule_idle+0x3c/0x64
   do_idle+0x160/0x1b0
   cpu_startup_entry+0x48/0x50
   start_secondary+0x284/0x288
   start_secondary_prolog+0x10/0x14

  watchdog: CPU 11 self-detected hard LOCKUP @ plpar_hcall_norets_notrace+0x18/0x2c
  NIP: plpar_hcall_norets_notrace+0x18/0x2c
  LR:  queued_spin_lock_slowpath+0xd88/0x15d0
   _raw_spin_lock+0x80/0xa0
   raw_spin_rq_lock_nested+0x3c/0xf8
   mm_cid_fixup_cpus_to_tasks+0xc8/0x28c
   sched_mm_cid_exit+0x108/0x22c
   do_exit+0xf4/0x5d0
   make_task_dead+0x0/0x178
   system_call_exception+0x128/0x390
   system_call_vectored_common+0x15c/0x2ec

The task on CPU11 is running the CID ownership mode change fixup function
and is stuck on a runqueue lock. The task on CPU23 is trying to get a CID
from the pool with the same runqueue lock held, but the pool is empty.

After decoding a similar issue in the opposite direction switching from per
task to per CPU mode the tool which models the possible scenarios failed to
come up with a similar loop hole.

This showed up only once, was not reproducible and according to tooling not
related to a overlooked scheduling scenario permutation. But the fact that
it was observed on a PowerPC system gave the right hint: PowerPC is a
weakly ordered architecture.

The transition mechanism does:

    WRITE_ONCE(mm->mm_cid.transit, MM_CID_TRANSIT);
    WRITE_ONCE(mm->mm_cid.percpu, new_mode);

    fixup()

    WRITE_ONCE(mm->mm_cid.transit, 0);

mm_cid_schedin() does:

    if (!READ_ONCE(mm->mm_cid.percpu))
       ...
       cid |= READ_ONCE(mm->mm_cid.transit);

so weakly ordered systems can observe percpu == false and transit == 0 even
if the fixup function has not yet completed. As a consequence the task will
not drop the CID when scheduling out before the fixup is completed, which
means the CID space can be exhausted and the next task scheduling in will
loop in mm_get_cid() and the fixup thread can livelock on the held runqueue
lock as above.

This could obviously be solved by using:
     smp_store_release(&mm->mm_cid.percpu, true);
and
     smp_load_acquire(&mm->mm_cid.percpu);

but that brings a memory barrier back into the scheduler hotpath, which was
just designed out by the CID rewrite.

That can be completely avoided by combining the per CPU mode and the
transit storage into a single mm_cid::mode member and ordering the stores
against the fixup functions to prevent the CPU from reordering them.

That makes the update of both states atomic and a concurrent read observes
always consistent state.

The price is an additional AND operation in mm_cid_schedin() to evaluate
the per CPU or the per task path, but that's in the noise even on strongly
ordered architectures as the actual load can be significantly more
expensive and the conditional branch evaluation is there anyway.

Fixes: fbd0e71dc3 ("sched/mmcid: Provide CID ownership mode fixup functions")
Closes: https://lore.kernel.org/bdfea828-4585-40e8-8835-247c6a8a76b0@linux.ibm.com
Reported-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260201192834.965217106@kernel.org
2026-02-04 12:21:12 +01:00
Thomas Gleixner 4327fb13fa sched/mmcid: Prevent live lock on task to CPU mode transition
Ihor reported a BPF CI failure which turned out to be a live lock in the
MM_CID management. The scenario is:

A test program creates the 5th thread, which means the MM_CID users become
more than the number of CPUs (four in this example), so it switches to per
CPU ownership mode.

At this point each live task of the program has a CID associated. Assume
thread creation order assignment for simplicity.

   T0     CID0  runs fork() and creates T4
   T1 	  CID1
   T2 	  CID2
   T3 	  CID3
   T4       ---   not visible yet

T0 sets mm_cid::percpu = true and transfers its own CID to CPU0 where it
runs on and then starts the fixup which walks through the threads to
transfer the per task CIDs either to the CPU the task is running on or drop
it back into the pool if the task is not on a CPU.

During that T1 - T3 are free to schedule in and out before the fixup caught
up with them. Going through all possible permutations with a python script
revealed a few problematic cases. The most trivial one is:

   T1 schedules in on CPU1 and observes percpu == true, so it transfers
      its CID to CPU1

   T1 is migrated to CPU2 and schedule in observes percpu == true, but
      CPU2 does not have a CID associated and T1 transferred its own to
      CPU1

      So it has to allocate one with CPU2 runqueue lock held, but the
      pool is empty, so it keeps looping in mm_get_cid().

Now T0 reaches T1 in the thread walk and tries to lock the corresponding
runqueue lock, which is held causing a full live lock.

There is a similar scenario in the reverse direction of switching from per
CPU to task mode which is way more obvious and got therefore addressed by
an intermediate mode. In this mode the CIDs are marked with MM_CID_TRANSIT,
which means that they are neither owned by the CPU nor by the task. When a
task schedules out with a transit CID it drops the CID back into the pool
making it available for others to use temporarily. Once the task which
initiated the mode switch finished the fixup it clears the transit mode and
the process goes back into per task ownership mode.

Unfortunately this insight was not mapped back to the task to CPU mode
switch as the above described scenario was not considered in the analysis.

Apply the same transit mechanism to the task to CPU mode switch to handle
these problematic cases correctly.

As with the CPU to task transition this results in a potential temporary
contention on the CID bitmap, but that's only for the time it takes to
complete the transition. After that it stays in steady mode which does not
touch the bitmap at all.

Fixes: fbd0e71dc3 ("sched/mmcid: Provide CID ownership mode fixup functions")
Closes: https://lore.kernel.org/2b7463d7-0f58-4e34-9775-6e2115cfb971@linux.dev
Reported-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260201192834.897115238@kernel.org
2026-02-04 12:21:11 +01:00
Frederic Weisbecker e894f63398 kthread: Honour kthreads preferred affinity after cpuset changes
When cpuset isolated partitions get updated, unbound kthreads get
indifferently affine to all non isolated CPUs, regardless of their
individual affinity preferences.

For example kswapd is a per-node kthread that prefers to be affine to
the node it refers to. Whenever an isolated partition is created,
updated or deleted, kswapd's node affinity is going to be broken if any
CPU in the related node is not isolated because kswapd will be affine
globally.

Fix this with letting the consolidated kthread managed affinity code do
the affinity update on behalf of cpuset.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Waiman Long <longman@redhat.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: cgroups@vger.kernel.org
2026-02-03 15:23:35 +01:00
Frederic Weisbecker f5c145ae4f cpuset: Propagate cpuset isolation update to timers through housekeeping
Until now, cpuset would propagate isolated partition changes to
timer migration so that unbound timers don't get migrated to isolated
CPUs.

Since housekeeping now centralizes, synchronize and propagates isolation
cpumask changes, perform the work from that subsystem for consolidation
and consistency purposes.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2026-02-03 15:23:34 +01:00
Frederic Weisbecker 23f09dcc0a cpuset: Propagate cpuset isolation update to workqueue through housekeeping
Until now, cpuset would propagate isolated partition changes to
workqueues so that unbound workers get properly reaffined.

Since housekeeping now centralizes, synchronize and propagates isolation
cpumask changes, perform the work from that subsystem for consolidation
and consistency purposes.

For simplification purpose, the target function is adapted to take the
new housekeeping mask instead of the isolated mask.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Waiman Long <longman@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: "Michal Koutný" <mkoutny@suse.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: cgroups@vger.kernel.org
2026-02-03 15:23:34 +01:00
Frederic Weisbecker 29b306c44e PCI: Flush PCI probe workqueue on cpuset isolated partition change
The HK_TYPE_DOMAIN housekeeping cpumask is now modifiable at runtime. In
order to synchronize against PCI probe works and make sure that no
asynchronous probing is still pending or executing on a newly isolated
CPU, the housekeeping subsystem must flush the PCI probe works.

However the PCI probe works can't be flushed easily since they are
queued to the main per-CPU workqueue pool.

Solve this with creating a PCI probe-specific pool and provide and use
the appropriate flushing API.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: linux-pci@vger.kernel.org
2026-02-03 15:23:34 +01:00
Frederic Weisbecker ce84ad5e99 sched/isolation: Flush vmstat workqueues on cpuset isolated partition change
The HK_TYPE_DOMAIN housekeeping cpumask is now modifiable at runtime.
In order to synchronize against vmstat workqueue to make sure
that no asynchronous vmstat work is still pending or executing on a
newly made isolated CPU, the housekeeping susbsystem must flush the
vmstat workqueues.

This involves flushing the whole mm_percpu_wq workqueue, shared with
LRU drain, introducing here a welcome side effect.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: linux-mm@kvack.org
2026-02-03 15:23:34 +01:00
Frederic Weisbecker b7eb4edcc3 sched/isolation: Flush memcg workqueues on cpuset isolated partition change
The HK_TYPE_DOMAIN housekeeping cpumask is now modifiable at runtime. In
order to synchronize against memcg workqueue to make sure that no
asynchronous draining is still pending or executing on a newly made
isolated CPU, the housekeeping susbsystem must flush the memcg
workqueues.

However the memcg workqueues can't be flushed easily since they are
queued to the main per-CPU workqueue pool.

Solve this with creating a memcg specific pool and provide and use the
appropriate flushing API.

Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
2026-02-03 15:23:34 +01:00
Frederic Weisbecker 03ff735101 cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset
Until now, HK_TYPE_DOMAIN used to only include boot defined isolated
CPUs passed through isolcpus= boot option. Users interested in also
knowing the runtime defined isolated CPUs through cpuset must use
different APIs: cpuset_cpu_is_isolated(), cpu_is_isolated(), etc...

There are many drawbacks to that approach:

1) Most interested subsystems want to know about all isolated CPUs, not
  just those defined on boot time.

2) cpuset_cpu_is_isolated() / cpu_is_isolated() are not synchronized with
  concurrent cpuset changes.

3) Further cpuset modifications are not propagated to subsystems

Solve 1) and 2) and centralize all isolated CPUs within the
HK_TYPE_DOMAIN housekeeping cpumask.

Subsystems can rely on RCU to synchronize against concurrent changes.

The propagation mentioned in 3) will be handled in further patches.

[Chen Ridong: Fix cpu_hotplug_lock deadlock and use correct static
branch API]

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Waiman Long <longman@redhat.com>
Reviewed-by: Chen Ridong <chenridong@huawei.com>
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Cc: "Michal Koutný" <mkoutny@suse.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: cgroups@vger.kernel.org
2026-02-03 15:23:34 +01:00
Frederic Weisbecker 27c3a5967f sched/isolation: Convert housekeeping cpumasks to rcu pointers
HK_TYPE_DOMAIN's cpumask will soon be made modifiable by cpuset.
A synchronization mechanism is then needed to synchronize the updates
with the housekeeping cpumask readers.

Turn the housekeeping cpumasks into RCU pointers. Once a housekeeping
cpumask will be modified, the update side will wait for an RCU grace
period and propagate the change to interested subsystem when deemed
necessary.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
2026-02-03 15:23:33 +01:00
Frederic Weisbecker 4fca0e550d sched/isolation: Save boot defined domain flags
HK_TYPE_DOMAIN will soon integrate not only boot defined isolcpus= CPUs
but also cpuset isolated partitions.

Housekeeping still needs a way to record what was initially passed
to isolcpus= in order to keep these CPUs isolated after a cpuset
isolated partition is modified or destroyed while containing some of
them.

Create a new HK_TYPE_DOMAIN_BOOT to keep track of those.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
2026-02-03 15:23:33 +01:00
Zicheng Qu e34881c84c sched: Re-evaluate scheduling when migrating queued tasks out of throttled cgroups
Consider the following sequence on a CPU configured with nohz_full:

1) A task P runs in cgroup A, and cgroup A becomes throttled due to CFS
   bandwidth control. The gse (cgroup A) where the task P attached is
dequeued and the CPU switches to idle.

2) Before cgroup A is unthrottled, task P is migrated from cgroup A to
   another cgroup B (not throttled).

   During sched_move_task(), the task P is observed as queued but not
running, and therefore no resched_curr() is triggered.

3) Since the CPU is nohz_full, it remains in do_idle() waiting for an
   explicit scheduling event, i.e., resched_curr().

4) For kernel <= 5.10: Later, cgroup A is unthrottled. However, the task
   P has already been migrated out of cgroup A, so unthrottle_cfs_rq()
may observe load_weight == 0 and return early without resched_curr()
called. For kernel >= 6.6: The unthrottling path normally triggers
`resched_curr()` almost cases even when no runnable tasks remain in the
unthrottled cgroup, preventing the idle stall described above. However,
if cgroup A is removed before it gets unthrottled, the unthrottling path
for cgroup A is never executed. In a result, no `resched_curr()` can be
called.

5) At this point, the task P is runnable in cgroup B (not throttled), but
the CPU remains in do_idle() with no pending reschedule point. The
system stays in this state until an unrelated event (e.g. a new task
wakeup or any cases) that can trigger a resched_curr() breaks the
nohz_full idle state, and then the task P finally gets scheduled.

The root cause is that sched_move_task() may classify the task as only
queued, not running, and therefore fails to trigger a resched_curr(),
while the later unthrottling path no longer has visibility of the
migrated task.

Preserve the existing behavior for running tasks by issuing
resched_curr(), and explicitly invoke check_preempt_curr() for tasks
that were queued at the time of migration. This ensures that runnable
tasks are reconsidered for scheduling even when nohz_full suppresses
periodic ticks.

Fixes: 29f59db3a7 ("sched: group-scheduler core")
Signed-off-by: Zicheng Qu <quzicheng@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Reviewed-by: Aaron Lu <ziqianlu@bytedance.com>
Tested-by: Aaron Lu <ziqianlu@bytedance.com>
Link: https://patch.msgid.link/20260130083438.1122457-1-quzicheng@huawei.com
2026-02-03 12:04:19 +01:00
zenghongling 742fe830b7 sched/cpufreq: Use %pe format for PTR_ERR() printing
Use %pe format specifier for printing PTR_ERR() error values
to make error messages more readable.

Found by Coccinelle:
./cpufreq_schedutil.c:685:49-56: WARNING: Consider using %pe to print PTR_ERR()

Signed-off-by: zenghongling <zenghongling@kylinos.cn>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260120083333.148385-1-zenghongling@kylinos.cn
2026-02-03 12:04:19 +01:00
Chen Jinghuang 94894c9c47 sched/rt: Skip currently executing CPU in rto_next_cpu()
CPU0 becomes overloaded when hosting a CPU-bound RT task, a non-CPU-bound
RT task, and a CFS task stuck in kernel space. When other CPUs switch from
RT to non-RT tasks, RT load balancing (LB) is triggered; with
HAVE_RT_PUSH_IPI enabled, they send IPIs to CPU0 to drive the execution
of rto_push_irq_work_func. During push_rt_task on CPU0,
if next_task->prio < rq->donor->prio, resched_curr() sets NEED_RESCHED
and after the push operation completes, CPU0 calls rto_next_cpu().
Since only CPU0 is overloaded in this scenario, rto_next_cpu() should
ideally return -1 (no further IPI needed).

However, multiple CPUs invoking tell_cpu_to_push() during LB increments
rd->rto_loop_next. Even when rd->rto_cpu is set to -1, the mismatch between
rd->rto_loop and rd->rto_loop_next forces rto_next_cpu() to restart its
search from -1. With CPU0 remaining overloaded (satisfying rt_nr_migratory
&& rt_nr_total > 1), it gets reselected, causing CPU0 to queue irq_work to
itself and send self-IPIs repeatedly. As long as CPU0 stays overloaded and
other CPUs run pull_rt_tasks(), it falls into an infinite self-IPI loop,
which triggers a CPU hardlockup due to continuous self-interrupts.

The trigging scenario is as follows:

         cpu0                      cpu1                    cpu2
                                pull_rt_task
                              tell_cpu_to_push
                 <------------irq_work_queue_on
rto_push_irq_work_func
       push_rt_task
    resched_curr(rq)                                   pull_rt_task
    rto_next_cpu                                     tell_cpu_to_push
                      <-------------------------- atomic_inc(rto_loop_next)
rd->rto_loop != next
     rto_next_cpu
   irq_work_queue_on
rto_push_irq_work_func

Fix redundant self-IPI by filtering the initiating CPU in rto_next_cpu().
This solution has been verified to effectively eliminate spurious self-IPIs
and prevent CPU hardlockup scenarios.

Fixes: 4bdced5c9a ("sched/rt: Simplify the IPI based RT balancing logic")
Suggested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Chen Jinghuang <chenjinghuang2@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://patch.msgid.link/20260122012533.673768-1-chenjinghuang2@huawei.com
2026-02-03 12:04:19 +01:00
Wangyang Guo 505da66893 sched/clock: Avoid false sharing for sched_clock_irqtime
Read-mostly sched_clock_irqtime may share the same cacheline with
frequently updated nohz struct. Make it as static_key to avoid
false sharing issue.

The only user of disable_sched_clock_irqtime()
is tsc_.*mark_unstable() which may be invoked under atomic context
and require a workqueue to disable static_key. But both of them
calls clear_sched_clock_stable() just before doing
disable_sched_clock_irqtime(). We can reuse
"sched_clock_work" to also disable sched_clock_irqtime().

One additional case need to handle is if the tsc is marked unstable
before late_initcall() phase, sched_clock_work will not be invoked
and sched_clock_irqtime will stay enabled although clock is unstable:
  tsc_init()
    enable_sched_clock_irqtime() # irqtime accounting is enabled here
    ...
    if (unsynchronized_tsc()) # true
      mark_tsc_unstable()
        clear_sched_clock_stable()
          __sched_clock_stable_early = 0;
          ...
          if (static_key_count(&sched_clock_running.key) == 2)
            # Only happens at sched_clock_init_late()
            __clear_sched_clock_stable(); # Never executed
  ...

  # late_initcall() phase
  sched_clock_init_late()
    if (__sched_clock_stable_early) # Already false
      __set_sched_clock_stable(); # sched_clock is never marked stable
  # TSC unstable, but sched_clock_work won't run to disable irqtime

So we need to disable_sched_clock_irqtime() in sched_clock_init_late()
if clock is unstable.

Reported-by: Benjamin Lei <benjamin.lei@intel.com>
Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Wangyang Guo <wangyang.guo@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Reviewed-by: Tianyou Li <tianyou.li@intel.com>
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://patch.msgid.link/20260127072509.2627346-1-wangyang.guo@intel.com
2026-02-03 12:04:19 +01:00
Peter Zijlstra 5a40a9bb56 sched/debug: Fix dl_server (re)start conditions
There are two problems with sched_server_write_common() that can cause the
dl_server to malfunction upon attempting to change the parameters:

1) when, after having disabled the dl_server by setting runtime=0, it is
   enabled again while tasks are already enqueued. In this case is_active would
   still be 0 and dl_server_start() would not be called.

2) when dl_server_apply_params() would fail, runtime is not applied and does
   not reflect the new state.

Instead have dl_server_start() check its actual dl_runtime, and have
sched_server_write_common() unconditionally (re)start the dl_server. It will
automatically stop if there isn't anything to do, so spurious activation is
harmless -- while failing to start it is a problem.

While there, move the printk out of the locked region and make it symmetric,
also printing on enable.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260203103407.GK1282955@noisy.programming.kicks-ass.net
2026-02-03 12:04:18 +01:00
Joel Fernandes 76d12132ba sched/debug: Add support to change sched_ext server params
When a sched_ext server is loaded, tasks in the fair class are
automatically moved to the sched_ext class. Add support to modify the
ext server parameters similar to how the fair server parameters are
modified.

Re-use common code between ext and fair servers as needed.

Co-developed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/20260126100050.3854740-6-arighi@nvidia.com
2026-02-03 12:04:17 +01:00
Andrea Righi cd959a3562 sched_ext: Add a DL server for sched_ext tasks
sched_ext currently suffers starvation due to RT. The same workload when
converted to EXT can get zero runtime if RT is 100% running, causing EXT
processes to stall. Fix it by adding a DL server for EXT.

A kselftest is also included later to confirm that both DL servers are
functioning correctly:

 # ./runner -t rt_stall
 ===== START =====
 TEST: rt_stall
 DESCRIPTION: Verify that RT tasks cannot stall SCHED_EXT tasks
 OUTPUT:
 TAP version 13
 1..1
 # Runtime of FAIR task (PID 1511) is 0.250000 seconds
 # Runtime of RT task (PID 1512) is 4.750000 seconds
 # FAIR task got 5.00% of total runtime
 ok 1 PASS: FAIR task got more than 4.00% of runtime
 TAP version 13
 1..1
 # Runtime of EXT task (PID 1514) is 0.250000 seconds
 # Runtime of RT task (PID 1515) is 4.750000 seconds
 # EXT task got 5.00% of total runtime
 ok 2 PASS: EXT task got more than 4.00% of runtime
 TAP version 13
 1..1
 # Runtime of FAIR task (PID 1517) is 0.250000 seconds
 # Runtime of RT task (PID 1518) is 4.750000 seconds
 # FAIR task got 5.00% of total runtime
 ok 3 PASS: FAIR task got more than 4.00% of runtime
 TAP version 13
 1..1
 # Runtime of EXT task (PID 1521) is 0.250000 seconds
 # Runtime of RT task (PID 1522) is 4.750000 seconds
 # EXT task got 5.00% of total runtime
 ok 4 PASS: EXT task got more than 4.00% of runtime
 ok 1 rt_stall #
 =====  END  =====

Co-developed-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/20260126100050.3854740-5-arighi@nvidia.com
2026-02-03 12:04:17 +01:00
Joel Fernandes 68ec89d0e9 sched/debug: Stop and start server based on if it was active
Currently the DL server interface for applying parameters checks
CFS-internals to identify if the server is active. This is error-prone
and makes it difficult when adding new servers in the future.

Fix it, by using dl_server_active() which is also used by the DL server
code to determine if the DL server was started.

Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Tejun Heo <tj@kernel.org>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/20260126100050.3854740-4-arighi@nvidia.com
2026-02-03 12:04:17 +01:00
Joel Fernandes 6080fb2116 sched/debug: Fix updating of ppos on server write ops
Updating "ppos" on error conditions does not make much sense. The pattern
is to return the error code directly without modifying the position, or
modify the position on success and return the number of bytes written.

Since on success, the return value of apply is 0, there is no point in
modifying ppos either. Fix it by removing all this and just returning
error code or number of bytes written on success.

Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Tejun Heo <tj@kernel.org>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/20260126100050.3854740-3-arighi@nvidia.com
2026-02-03 12:04:16 +01:00
Joel Fernandes 3cb3b27693 sched/deadline: Clear the defer params
The defer params were not cleared in __dl_clear_params. Clear them.

Without this is some of my test cases are flaking and the DL timer is
not starting correctly AFAICS.

Fixes: a110a81c52 ("sched/deadline: Deferrable dl server")
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/20260126100050.3854740-2-arighi@nvidia.com
2026-02-03 12:04:16 +01:00
Peter Zijlstra 3e4067169c Linux 6.19-rc8
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCgA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAml/zSkeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG+bwIAJ0jbbeKDyeJJxPo
 8PgScnPJx9vBL3hGpphZrhbV3GOe9bDhKM/0Xk9qMDbpAm9C6qiBMTiDWyvWv5Qi
 qzDlZfoymMaDLPMxw9WHjJ++i1Z2StNdrz57Vze98C3/iG6gBcKnUEUzvF9nigri
 HIoxoOKlbSXLPUIzt49xE7YX+CRJhLF/kXmfoauZn5ghpv+uqSpWvRbUQJa3dmc0
 S4Ie/nbPtdVHmy1Fz9LJFDOzsdhGyjzHF4kc4shDkjAs8RAr8fJh74mQHO5a3MWA
 3WZ7GAAAc4XXNqj76X2dnVlMWpQNJ4p2e+OalsuXGA6VQ7OgbrJGMX8P6dMFn5AF
 8hFsXn4=
 =IdZ1
 -----END PGP SIGNATURE-----

Merge branch 'v6.19-rc8'

Update to avoid conflicts with /urgent patches.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
2026-02-03 12:04:13 +01:00
Peter Zijlstra 1151354225 sched/deadline: Fix 'stuck' dl_server
Andrea reported the dl_server getting stuck for him. He tracked it
down to a state where dl_server_start() saw dl_defer_running==1, but
the dl_server's job is no longer valid at the time of
dl_server_start().

In the state diagram this corresponds to [4] D->A (or dl_server_stop()
due to no more runnable tasks) followed by [1], which in case of a
lapsed deadline must then be A->B.

Now our A has dl_defer_running==1, while B demands
dl_defer_running==0, therefore it must get cleared when the CBS wakeup
rules demand a replenish.

Fixes: a110a81c52 ("sched/deadline: Deferrable dl server")
Reported-by: Andrea Righi arighi@nvidia.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Tested-by: Andrea Righi arighi@nvidia.com
Link: https://lkml.kernel.org/r/20260123161645.2181752-1-arighi@nvidia.com
Link: https://patch.msgid.link/20260130124100.GC1079264@noisy.programming.kicks-ass.net
2026-01-30 23:06:06 +01:00
Vincent Guittot 15257cc2f9 sched/fair: Revert force wakeup preemption
This agressively bypasses run_to_parity and slice protection with the
assumpiton that this is what waker wants but there is no garantee that
the wakee will be the next to run. It is a better choice to use
yield_to_task or WF_SYNC in such case.

This increases the number of resched and preemption because a task becomes
quickly "ineligible" when it runs; We update the task vruntime periodically
and before the task exhausted its slice or at least quantum.

Example:
2 tasks A and B wake up simultaneously with lag = 0. Both are
eligible. Task A runs 1st and wakes up task C. Scheduler updates task
A's vruntime which becomes greater than average runtime as all others
have a lag == 0 and didn't run yet. Now task A is ineligible because
it received more runtime than the other task but it has not yet
exhausted its slice nor a min quantum. We force preemption, disable
protection but Task B will run 1st not task C.

Sidenote, DELAY_ZERO increases this effect by clearing positive lag at
wake up.

Fixes: e837456fdc ("sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals")
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260123102858.52428-1-vincent.guittot@linaro.org
2026-01-23 11:53:20 +01:00
Mel Gorman 4f70f106bc sched/fair: Disable scheduler feature NEXT_BUDDY
NEXT_BUDDY was disabled with the introduction of EEVDF and enabled again
after NEXT_BUDDY was rewritten for EEVDF by commit e837456fdc ("sched/fair:
Reimplement NEXT_BUDDY to align with EEVDF goals"). It was not expected
that this would be a universal win without a crystal ball instruction
but the reported regressions are a concern [1][2] even if gains were
also reported. Specifically;

o mysql with client/server running on different servers regresses
o specjbb reports lower peak metrics
o daytrader regresses

The mysql is realistic and a concern. It needs to be confirmed if
specjbb is simply shifting the point where peak performance is measured
but still a concern. daytrader is considered to be representative of a
real workload.

Access to test machines is currently problematic for verifying any fix to
this problem. Disable NEXT_BUDDY for now by default until the root causes
are addressed.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Link: https://lore.kernel.org/lkml/4b96909a-f1ac-49eb-b814-97b8adda6229@arm.com [1]
Link: https://lore.kernel.org/lkml/ec3ea66f-3a0d-4b5a-ab36-ce778f159b5b@linux.ibm.com [2]
Link: https://patch.msgid.link/fyqsk63pkoxpeaclyqsm5nwtz3dyejplr7rg6p74xwemfzdzuu@7m7xhs5aqpqw
2026-01-23 11:53:19 +01:00
Shubhang Kaushik 4b603f1551 sched: Update rq->avg_idle when a task is moved to an idle CPU
Currently, rq->idle_stamp is only used to calculate avg_idle during
wakeups. This means other paths that move a task to an idle CPU such as
fork/clone, execve, or migrations, do not end the CPU's idle status in
the scheduler's eyes, leading to an inaccurate avg_idle.

This patch introduces update_rq_avg_idle() to provide a more accurate
measurement of CPU idle duration. By invoking this helper in
put_prev_task_idle(), we ensure avg_idle is updated whenever a CPU
stops being idle, regardless of how the new task arrived.

Testing on an 80-core Ampere Altra (ARMv8) with 6.19-rc5 baseline:
 - Hackbench : +7.2% performance gain at 16 threads.
 - Schbench: Reduced p99.9 tail latencies at high concurrency.

Signed-off-by: Shubhang Kaushik <shubhang@os.amperecomputing.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Shubhang Kaushik <shubhang@os.amperecomputing.com>
Link: https://patch.msgid.link/20260121-v8-patch-series-v8-1-b7f1cbee5055@os.amperecomputing.com
2026-01-22 11:11:21 +01:00
Fushuai Wang 4fe82cf302 sched/debug: Convert copy_from_user() + kstrtouint() to kstrtouint_from_user()
Using kstrtouint_from_user() instead of copy_from_user() + kstrtouint()
makes the code simpler and less error-prone.

Suggested-by: Yury Norov <ynorov@nvidia.com>
Signed-off-by: Fushuai Wang <wangfushuai@baidu.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Yury Norov <ynorov@nvidia.com>
Link: https://patch.msgid.link/20260117145615.53455-2-fushuai.wang@linux.dev
2026-01-22 11:11:16 +01:00
Vincent Guittot 98c88dc8a1 sched/fair: Fix pelt clock sync when entering idle
Samuel and Alex reported regressions of the util_avg of RT rq with
commit 17e3e88ed0 ("sched/fair: Fix pelt lost idle time detection").
It happens that fair is updating and syncing the pelt clock with task one
when pick_next_task_fair() fails to pick a task but before the prev
scheduling class got a chance to update its pelt signals.

Move update_idle_rq_clock_pelt() in set_next_task_idle() which is called
after prev class has been called.

Fixes: 17e3e88ed0 ("sched/fair: Fix pelt lost idle time detection")
Closes: https://lore.kernel.org/all/CAG2KctpO6VKS6GN4QWDji0t92_gNBJ7HjjXrE+6H+RwRXt=iLg@mail.gmail.com/
Closes: https://lore.kernel.org/all/8cf19bf0e0054dcfed70e9935029201694f1bb5a.camel@mediatek.com/
Reported-by: Samuel Wu <wusamuel@google.com>
Reported-by: Alex Hoh <Alex.Hoh@mediatek.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Samuel Wu <wusamuel@google.com>
Tested-by: Alex Hoh <Alex.Hoh@mediatek.com>
Link: https://patch.msgid.link/20260121163317.505635-1-vincent.guittot@linaro.org
2026-01-21 17:46:08 +01:00
Shrikanth Hegde 5d86d542f6 sched/fair: Remove nohz.nr_cpus and use weight of cpumask instead
nohz.nr_cpus was observed as contended cacheline when running
enterprise workload on large systems.

Fundamental scalability challenge with nohz.idle_cpus_mask
and nohz.nr_cpus is the following:

 (1) nohz_balancer_kick() observes (reads) nohz.nr_cpus
     (or nohz.idle_cpu_mask) and nohz.has_blocked to  see whether there's
     any nohz balancing work to do, in every scheduler tick.

 (2) nohz_balance_enter_idle() and nohz_balance_exit_idle()
     (through nohz_balancer_kick() via sched_tick()) modify (write)
     nohz.nr_cpus (and/or nohz.idle_cpu_mask) and nohz.has_blocked.

The characteristic frequencies are the following:

 (1) nohz_balancer_kick() happens at scheduler (busy)tick frequency
     on CPU(which has not gone idle). This is a relatively constant
     frequency  in the ~1 kHz range or lower.

 (2) happens at idle enter/exit frequency on every CPU that goes to idle.
     This is workload dependent, but can easily be hundreds of kHz for
     IO-bound loads and high CPU counts. Ie. can be orders of magnitude
     higher than (1), in which case a cachemiss at every invocation of (1)
     is almost inevitable. idle exit will trigger (1) on the CPU
     which is coming out of idle.

There's two types of costs from these functions:

 (A) scheduler tick cost via (1): this happens on busy CPUs too, and is
     thus a primary scalability cost. But the rate here is constant and
     typically much lower than (B), hence the absolute benefit to workload
     scalability will be lower as well.

 (B) idle cost via (2): going-to-idle and coming-from-idle costs are
     secondary concerns, because they impact power efficiency more than
     they impact scalability. But in terms of absolute cost this scales
     up with nr_cpus as well, and a much faster rate, and thus may also
     approach and negatively impact system limits like
     memory bus/fabric bandwidth.

Note that nohz.idle_cpus_mask and nohz.nr_cpus may appear to reside in the
same cacheline, however under CONFIG_CPUMASK_OFFSTACK=y the backing storage
for nohz.idle_cpus_mask will be elsewhere. With CPUMASK_OFFSTACK=n,
the nohz.idle_cpus_mask and rest of nohz fields are in different cachelines
under typical NR_CPUS=512/2048. This implies two separate cachelines
being dirtied upon idle entry / exit.

nohz.nr_cpus can be derived from the mask itself. Its usage doesn't warrant
a functionally correct value. This means one less cacheline being dirtied in
idle entry/exit path which helps to save some bus bandwidth w.r.t to those
nohz functions(approx 50%). This in turn helps to improve enterprise
workload throughput.

On system with 480 CPUs, running "hackbench 40 process 10000 loops"
(Avg of 3 runs)
baseline:
     0.81%  hackbench          [k] nohz_balance_exit_idle
     0.21%  hackbench          [k] nohz_balancer_kick
     0.09%  swapper            [k] nohz_run_idle_balance

With patch:
     0.35%  hackbench          [k] nohz_balance_exit_idle
     0.09%  hackbench          [k] nohz_balancer_kick
     0.07%  swapper            [k] nohz_run_idle_balance

[Ingo Molnar: scalability analysis changlog]

Reviewed-and-tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://patch.msgid.link/20260115073524.376643-4-sshegde@linux.ibm.com
2026-01-15 22:41:27 +01:00
Shrikanth Hegde 94e70734b4 sched/fair: Change likelyhood of nohz.nr_cpus
These days most of the system have multi cores. The likelyhood of
at least one or more CPUs in nohz (idle state) is higher.

Give accurate hint to the branch predictor.

Reviewed-and-tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://patch.msgid.link/20260115073524.376643-3-sshegde@linux.ibm.com
2026-01-15 22:41:27 +01:00
Shrikanth Hegde 6b67c8a72e sched/fair: Move checking for nohz cpus after time check
Current code does.
- Read nohz.nr_cpus
- Check if the time has passed to do NOHZ idle balance

Instead do this.
- Check if the time has passed to do NOHZ idle balance
- Read nohz.nr_cpus

This will skip the read most of the time in normal system usage.
i.e when there are nohz.nr_cpus (system is not 100% busy).

Note that when there are no idle CPUs(100% busy), even if the flag gets
set to NOHZ_STATS_KICK | NOHZ_NEXT_KICK, find_new_ilb will fail and
there will be no NOHZ idle balance. In such cases there will be a very
narrow window where, kick_ilb will be called un-necessarily.
However current functionality is still retained.

Note: This patch doesn't solve any cacheline overheads. No improvement
in performance apart from saving a few cycles of reading nohz.nr_cpus

Reviewed-and-tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://patch.msgid.link/20260115073524.376643-2-sshegde@linux.ibm.com
2026-01-15 22:41:26 +01:00
Zhan Xusheng 553255cc85 sched/fair: Fix math notation errors in avg_vruntime comment
The avg_vruntime comment contains a couple of mathematical notation
issues:

 - The summation over w_i * (V - v_i) is written in an ambiguous form
 - The delta term refers to v instead of v0, which is inconsistent
   with the code and preceding explanation

Fix these to make the comment mathematically correct and consistent
with the implementation.

Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260114090035.19033-1-zhanxusheng@xiaomi.com
2026-01-15 22:41:26 +01:00
Gabriele Monaco 8d73732016 sched: Fix build for modules using set_tsk_need_resched()
Commit adcc3bfa88 ("sched: Adapt sched tracepoints for RV task model")
added a tracepoint to the need_resched action that can be triggered also
by set_tsk_need_resched.
This function was previously accessible from out-of-tree modules but
it's no longer available because the __trace_set_need_resched() symbol
is not exported (together with the tracepoint itself, which was exported
in a separate patch) and building such modules fails.

Export __trace_set_need_resched to modules to fix those build issues.

Fixes: adcc3bfa88 ("sched: Adapt sched tracepoints for RV task model")
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Link: https://patch.msgid.link/20260112140413.362202-1-gmonaco@redhat.com
2026-01-15 22:41:26 +01:00
Peter Zijlstra 627cc25f84 sched/deadline: Use ENQUEUE_MOVE to allow priority change
Pierre reported hitting balance callback warnings for deadline tasks
after commit 6455ad5346 ("sched: Move sched_class::prio_changed()
into the change pattern").

It turns out that DEQUEUE_SAVE+ENQUEUE_RESTORE does not preserve DL
priority and subsequently trips a balance pass -- where one was not
expected.

From discussion with Juri and Luca, the purpose of this clause was to
deal with tasks new to DL and all those sites will have MOVE set (as
well as CLASS, but MOVE is move conservative at this point).

Per the previous patches MOVE is audited to always run the balance
callbacks, so switch enqueue_dl_entity() to use MOVE for this case.

Fixes: 6455ad5346 ("sched: Move sched_class::prio_changed() into the change pattern")
Reported-by: Pierre Gondois <pierre.gondois@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Pierre Gondois <pierre.gondois@arm.com>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://patch.msgid.link/20260114130528.GB831285@noisy.programming.kicks-ass.net
2026-01-15 21:57:53 +01:00
Peter Zijlstra e008ec6c79 sched: Deadline has dynamic priority
While FIFO/RR have static priority, DEADLINE is a dynamic priority
scheme. Notably it has static priority -1. Do not assume the priority
doesn't change for deadline tasks just because the static priority
doesn't change.

This ensures DL always sees {DE,EN}QUEUE_MOVE where appropriate.

Fixes: ff77e46853 ("sched/rt: Fix PI handling vs. sched_setscheduler()")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Pierre Gondois <pierre.gondois@arm.com>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://patch.msgid.link/20260114130528.GB831285@noisy.programming.kicks-ass.net
2026-01-15 21:57:53 +01:00
Peter Zijlstra 53439363c0 sched: Audit MOVE vs balance_callbacks
The {DE,EN}QUEUE_MOVE flag indicates a task is allowed to change
priority, which means there could be balance callbacks queued.

Therefore audit all MOVE users and make sure they do run balance
callbacks before dropping rq-lock.

Fixes: 6455ad5346 ("sched: Move sched_class::prio_changed() into the change pattern")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Pierre Gondois <pierre.gondois@arm.com>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://patch.msgid.link/20260114130528.GB831285@noisy.programming.kicks-ass.net
2026-01-15 21:57:53 +01:00
Peter Zijlstra 49041e87f9 sched: Fold rq-pin swizzle into __balance_callbacks()
Prepare for more users needing the rq-pin swizzle.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Pierre Gondois <pierre.gondois@arm.com>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://patch.msgid.link/20260114130528.GB831285@noisy.programming.kicks-ass.net
2026-01-15 21:57:52 +01:00
Peter Zijlstra 4de9ff7606 sched/deadline: Avoid double update_rq_clock()
When setup_new_dl_entity() is called from enqueue_task_dl() ->
enqueue_dl_entity(), the rq-clock should already be updated, and
calling update_rq_clock() again is not right.

Move the update_rq_clock() to the one other caller of
setup_new_dl_entity(): sched_init_dl_server().

Fixes: 9f239df555 ("sched/deadline: Initialize dl_servers after SMP")
Reported-by: Pierre Gondois <pierre.gondois@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Pierre Gondois <pierre.gondois@arm.com>
Link: https://patch.msgid.link/20260113115622.GA831285@noisy.programming.kicks-ass.net
2026-01-15 21:57:52 +01:00
Peter Zijlstra 375410bb9a sched/deadline: Ensure get_prio_dl() is up-to-date
Pratheek tripped a WARN and noted the following issue:

> Inspecting the set of events that led to the warning being triggered
> showed the following:
>
>     systemd-1  [008] dN.31 ...: do_set_cpus_allowed: set_cpus_allowed begin!
>
>     systemd-1  [008] dN.31 ...: sched_change_begin: Begin!
>     systemd-1  [008] dN.31 ...: sched_change_begin: Before dequeue_task()!
>     systemd-1  [008] dN.31 ...: update_curr_dl_se: update_curr_dl_se: ENQUEUE_REPLENISH
>     systemd-1  [008] dN.31 ...: enqueue_dl_entity: enqueue_dl_entity: ENQUEUE_REPLENISH
>     systemd-1  [008] dN.31 ...: replenish_dl_entity: Replenish before: 14815760217
>     systemd-1  [008] dN.31 ...: replenish_dl_entity: Replenish after: 14816960047
>     systemd-1  [008] dN.31 ...: sched_change_begin: Before put_prev_task()!
>
>     systemd-1  [008] dN.31 ...: sched_change_end: Before enqueue_task()!
>     systemd-1  [008] dN.31 ...: sched_change_end: Before put_prev_task()!
>     systemd-1  [008] dN.31 ...: prio_changed_dl: Queuing pull task on prio change: 14815760217 -> 14816960047
>     systemd-1  [008] dN.31 ...: prio_changed_dl: Queuing balance callback!
>     systemd-1  [008] dN.31 ...: sched_change_end: End!
>
>     systemd-1  [008] dN.31 ...: do_set_cpus_allowed: set_cpus_allowed end!
>     systemd-1  [008] dN.21 ...: __schedule: Woops! Balance callback found!
>
> 1. sched_change_begin() from guard(sched_change) in
>    do_set_cpus_allowed() stashes the priority, which for the deadline
>    task, is "p->dl.deadline".
> 2. The dequeue of the deadline task replenishes the deadline.
> 3. The task is enqueued back after guard's scope ends and since there is
>    no *_CLASS flags set, sched_change_end() calls
>    dl_sched_class->prio_changed() which compares the deadline.
> 4. Since deadline was moved on dequeue, prio_changed_dl() sees the value
>    differ from the stashed value and queues a balance pull callback.
> 5. do_set_cpus_allowed() finishes and drops the rq_lock without doing a
>    do_balance_callbacks().
> 6. Grabbing the rq_lock() at subsequent __schedule() triggers the
>    warning since the balance pull callback was never executed before
>    dropping the lock.

Meaning get_prio_dl() ought to update current and return an up-to-date
value.

Fixes: 6455ad5346 ("sched: Move sched_class::prio_changed() into the change pattern")
Reported-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://patch.msgid.link/20260106104113.GX3707891@noisy.programming.kicks-ass.net
2026-01-15 21:57:52 +01:00