Commit Graph

339 Commits (09cfd3c52ea76f43b3cb15e570aeddf633d65e80)

Author SHA1 Message Date
Linus Torvalds 6c7340a7a8 Scheduler updates for v6.18:
Core scheduler changes:
 
  - Make migrate_{en,dis}able() inline, to improve performance
    (Menglong Dong)
 
  - Move STDL_INIT() functions out-of-line (Peter Zijlstra)
 
  - Unify the SCHED_{SMT,CLUSTER,MC} Kconfig (Peter Zijlstra)
 
 Fair scheduling:
 
  - Defer throttling when tasks exit to user-space, to reduce the
    chance & impact of throttle-preemption with held locks and
    other resources. (Aaron Lu, Valentin Schneider)
 
  - Get rid of sched_domains_curr_level hack for tl->cpumask(),
    as the warning was getting triggered on certain topologies.
    (Peter Zijlstra)
 
 Misc cleanups & fixes:
 
  - Header cleanups (Menglong Dong)
 
  - Fix race in push_dl_task() (Harshit Agarwal)
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmjWn0IRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1i9ng/+KJYEsNilPA6nGzZLurU3HXm6S5NrNBPc
 xJU43eo3QdFUymKqYaCoXCwaMah3m8fFW+DC8ZoEvWC5wHnlIw6+u/ZEtsWQ4R8N
 8+sjQddQeb9Gez+nkC7pXX8h1nqlhHyGWU88+9hOyEEMk21KgH7tJ5gOuGQdPhiA
 v24D8dkS1sT6CAeqZAycYIkos+kfNNV+wAEQCXAxfvSSslpzJVZcj9kdQInxF4O4
 9+WrvFZFVYJWdJgmgfbpBMw7N8a4Gpv+Lr6U0ZVXHNeLjNdn60I+5Me+9BW9GRcO
 pPL50RfGgGgVIqyEHvBhhz8p0tlKUJo9NkRJ9hmNB5SKT3P442SU789ppTYgI1il
 5P4g3TiuomqPVuo99/mYHYOpT6NkYC1fkwMP9Hfk4NRUX0EA6lKXelVnMXaC3Z7R
 U+0xVYtK4BBLRFBo4HIEM1HChSIcOnutuufFX4lPPt+ewnAmJwSt619w+lBF0v+n
 cpiLIlQ74LrYlrULw3bznnwzh5dV8XHm4CT3XAShm6qB97/SaLr0rI5W7njdSvte
 ym9z4yFWidawowvLWjFX1CykSnX/T5MV+uzuIH64MEIA39B4VKJWCAC3l3AMJn/5
 Ckhf93B5uG0q+wUtE66x/JrN7wtbz9PMpwh01fXNWs/eWc1VKkVrvq+PmHODngmz
 drSUoI1P570=
 =8ZTZ
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2025-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:
 "Core scheduler changes:

   - Make migrate_{en,dis}able() inline, to improve performance
     (Menglong Dong)

   - Move STDL_INIT() functions out-of-line (Peter Zijlstra)

   - Unify the SCHED_{SMT,CLUSTER,MC} Kconfig (Peter Zijlstra)

  Fair scheduling:

   - Defer throttling to when tasks exit to user-space, to reduce the
     chance & impact of throttle-preemption with held locks and other
     resources (Aaron Lu, Valentin Schneider)

   - Get rid of sched_domains_curr_level hack for tl->cpumask(), as the
     warning was getting triggered on certain topologies (Peter
     Zijlstra)

  Misc cleanups & fixes:

   - Header cleanups (Menglong Dong)

   - Fix race in push_dl_task() (Harshit Agarwal)"

* tag 'sched-core-2025-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched: Fix some typos in include/linux/preempt.h
  sched: Make migrate_{en,dis}able() inline
  rcu: Replace preempt.h with sched.h in include/linux/rcupdate.h
  arch: Add the macro COMPILE_OFFSETS to all the asm-offsets.c
  sched/fair: Do not balance task to a throttled cfs_rq
  sched/fair: Do not special case tasks in throttled hierarchy
  sched/fair: update_cfs_group() for throttled cfs_rqs
  sched/fair: Propagate load for throttled cfs_rq
  sched/fair: Get rid of throttled_lb_pair()
  sched/fair: Task based throttle time accounting
  sched/fair: Switch to task based throttle model
  sched/fair: Implement throttle task work and related helpers
  sched/fair: Add related data structure for task based throttle
  sched: Unify the SCHED_{SMT,CLUSTER,MC} Kconfig
  sched: Move STDL_INIT() functions out-of-line
  sched/fair: Get rid of sched_domains_curr_level hack for tl->cpumask()
  sched/deadline: Fix race in push_dl_task()
2025-09-30 10:35:11 -07:00
Peter Zijlstra a3a70caf79 sched/deadline: Fix dl_server behaviour
John reported undesirable behaviour with the dl_server since commit:
cccb45d7c4 ("sched/deadline: Less agressive dl_server handling").

When starving fair tasks on purpose (starting spinning FIFO tasks),
his fair workload, which often goes (briefly) idle, would delay fair
invocations for a second, running one invocation per second was both
unexpected and terribly slow.

The reason this happens is that when dl_se->server_pick_task() returns
NULL, indicating no runnable tasks, it would yield, pushing any later
jobs out a whole period (1 second).

Instead simply stop the server. This should restore behaviour in that
a later wakeup (which restarts the server) will be able to continue
running (subject to the CBS wakeup rules).

Notably, this does not re-introduce the behaviour cccb45d7c4 set
out to solve, any start/stop cycle is naturally throttled by the timer
period (no active cancel).

Fixes: cccb45d7c4 ("sched/deadline: Less agressive dl_server handling")
Reported-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: John Stultz <jstultz@google.com>
2025-09-25 09:51:50 +02:00
Peter Zijlstra 4ae8d9aa9f sched/deadline: Fix dl_server getting stuck
John found it was easy to hit lockup warnings when running locktorture
on a 2 CPU VM, which he bisected down to: commit cccb45d7c4
("sched/deadline: Less agressive dl_server handling").

While debugging it seems there is a chance where we end up with the
dl_server dequeued, with dl_se->dl_server_active. This causes
dl_server_start() to return without enqueueing the dl_server, thus it
fails to run when RT tasks starve the cpu.

When this happens, dl_server_timer() catches the
'!dl_se->server_has_tasks(dl_se)' case, which then calls
replenish_dl_entity() and dl_server_stopped() and finally return
HRTIMER_NO_RESTART.

This ends in no new timer and also no enqueue, leaving the dl_server
'dead', allowing starvation.

What should have happened is for the bandwidth timer to start the
zero-laxity timer, which in turn would enqueue the dl_server and cause
dl_se->server_pick_task() to be called -- which will stop the
dl_server if no fair tasks are observed for a whole period.

IOW, it is totally irrelevant if there are fair tasks at the moment of
bandwidth refresh.

This removes all dl_se->server_has_tasks() users, so remove the whole
thing.

Fixes: cccb45d7c4 ("sched/deadline: Less agressive dl_server handling")
Reported-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: John Stultz <jstultz@google.com>
2025-09-25 09:51:50 +02:00
Harshit Agarwal 8fd5485fb4 sched/deadline: Fix race in push_dl_task()
When a CPU chooses to call push_dl_task and picks a task to push to
another CPU's runqueue then it will call find_lock_later_rq method
which would take a double lock on both CPUs' runqueues. If one of the
locks aren't readily available, it may lead to dropping the current
runqueue lock and reacquiring both the locks at once. During this window
it is possible that the task is already migrated and is running on some
other CPU. These cases are already handled. However, if the task is
migrated and has already been executed and another CPU is now trying to
wake it up (ttwu) such that it is queued again on the runqeue
(on_rq is 1) and also if the task was run by the same CPU, then the
current checks will pass even though the task was migrated out and is no
longer in the pushable tasks list.
Please go through the original rt change for more details on the issue.

To fix this, after the lock is obtained inside the find_lock_later_rq,
it ensures that the task is still at the head of pushable tasks list.
Also removed some checks that are no longer needed with the addition of
this new check.
However, the new check of pushable tasks list only applies when
find_lock_later_rq is called by push_dl_task. For the other caller i.e.
dl_task_offline_migration, existing checks are used.

Signed-off-by: Harshit Agarwal <harshit@nutanix.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20250408045021.3283624-1-harshit@nutanix.com
2025-09-03 10:03:12 +02:00
Yicong Yang 52d15521eb sched/deadline: Don't count nr_running for dl_server proxy tasks
On CPU offline the kernel stalled with below call trace:

  INFO: task kworker/0:1:11 blocked for more than 120 seconds.

cpuhp hold the cpu hotplug lock endless and stalled vmstat_shepherd.
This is because we count nr_running twice on cpuhp enqueuing and failed
the wait condition of cpuhp:

  enqueue_task_fair() // pick cpuhp from idle, rq->nr_running = 0
    dl_server_start()
      [...]
      add_nr_running() // rq->nr_running = 1
    add_nr_running() // rq->nr_running = 2
  [switch to cpuhp, waiting on balance_hotplug_wait()]
  rcuwait_wait_event(rq->nr_running == 1 && ...) // failed, rq->nr_running=2
    schedule() // wait again

It doesn't make sense to count the dl_server towards runnable tasks,
since it runs other tasks.

Fixes: 63ba8422f8 ("sched/deadline: Introduce deadline servers")
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250627035420.37712-1-yangyicong@huawei.com
2025-08-26 10:46:01 +02:00
kuyo chang 421fc59cf5 sched/deadline: Fix RT task potential starvation when expiry time passed
[Symptom]
The fair server mechanism, which is intended to prevent fair starvation
when higher-priority tasks monopolize the CPU.
Specifically, RT tasks on the runqueue may not be scheduled as expected.

[Analysis]
The log "sched: DL replenish lagged too much" triggered.

By memory dump of dl_server:
    curr = 0xFFFFFF80D6A0AC00 (
      dl_server = 0xFFFFFF83CD5B1470(
        dl_runtime = 0x02FAF080,
        dl_deadline = 0x3B9ACA00,
        dl_period = 0x3B9ACA00,
        dl_bw = 0xCCCC,
        dl_density = 0xCCCC,
        runtime = 0x02FAF080,
        deadline = 0x0000082031EB0E80,
        flags = 0x0,
        dl_throttled = 0x0,
        dl_yielded = 0x0,
        dl_non_contending = 0x0,
        dl_overrun = 0x0,
        dl_server = 0x1,
        dl_server_active = 0x1,
        dl_defer = 0x1,
        dl_defer_armed = 0x0,
        dl_defer_running = 0x1,
        dl_timer = (
          node = (
            expires = 0x000008199756E700),
          _softexpires = 0x000008199756E700,
          function = 0xFFFFFFDB9AF44D30 = dl_task_timer,
          base = 0xFFFFFF83CD5A12C0,
          state = 0x0,
          is_rel = 0x0,
          is_soft = 0x0,
    clock_update_flags = 0x4,
    clock = 0x000008204A496900,

 - The timer expiration time (rq->curr->dl_server->dl_timer->expires)
   is already in the past, indicating the timer has expired.
 - The timer state (rq->curr->dl_server->dl_timer->state) is 0.

[Suspected Root Cause]
The relevant code flow in the throttle path of
update_curr_dl_se() as follows:

  dequeue_dl_entity(dl_se, 0);                // the DL entity is dequeued

  if (unlikely(is_dl_boosted(dl_se) || !start_dl_timer(dl_se))) {
      if (dl_server(dl_se))                   // timer registration fails
          enqueue_dl_entity(dl_se, ENQUEUE_REPLENISH);//enqueue immediately
      ...
  }

The failure of `start_dl_timer` is caused by attempting to register a
timer with an expiration time that is already in the past. When this
situation persists, the code repeatedly re-enqueues the DL entity
without properly replenishing or restarting the timer, resulting in RT
task may not be scheduled as expected.

[Proposed Solution]:
Instead of immediately re-enqueuing the DL entity on timer registration
failure, this change ensures the DL entity is properly replenished and
the timer is restarted, preventing RT potential starvation.

Fixes: 63ba8422f8 ("sched/deadline: Introduce deadline servers")
Signed-off-by: kuyo chang <kuyo.chang@mediatek.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Closes: https://lore.kernel.org/CAMuHMdXn4z1pioTtBGMfQM0jsLviqS2jwysaWXpoLxWYoGa82w@mail.gmail.com
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Jiri Slaby <jirislaby@kernel.org>
Tested-by: Diederik de Haas <didi.debian@cknow.org>
Link: https://lkml.kernel.org/r/20250615131129.954975-1-kuyo.chang@mediatek.com
2025-08-26 10:46:01 +02:00
Huacai Chen 4717432dfd sched/deadline: Fix dl_server_stopped()
Commit cccb45d7c4 ("sched/deadline: Less agressive dl_server handling")
introduces dl_server_stopped(). But it is obvious that dl_server_stopped()
should return true if dl_se->dl_server_active is 0.

Fixes: cccb45d7c4 ("sched/deadline: Less agressive dl_server handling")
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250809130419.1980742-1-chenhuacai@loongson.cn
2025-08-26 10:46:00 +02:00
Valentin Schneider be39617e38 sched: Fix proxy/current (push,pull)ability
Proxy execution forms atomic pairs of tasks: The waiting donor
task (scheduling context) and a proxy (execution context). The
donor task, along with the rest of the blocked chain, follows
the proxy wrt CPU placement.

They can be the same task, in which case push/pull doesn't need any
modification. When they are different, however,
FIFO1 & FIFO42:

	      ,->  RT42
	      |     | blocked-on
	      |     v
blocked_donor |   mutex
	      |     | owner
	      |     v
	      `--  RT1

   RT1
   RT42

  CPU0            CPU1
   ^                ^
   |                |
  overloaded    !overloaded
  rq prio = 42  rq prio = 0

RT1 is eligible to be pushed to CPU1, but should that happen it will
"carry" RT42 along. Clearly here neither RT1 nor RT42 must be seen as
push/pullable.

Unfortunately, only the donor task is usually dequeued from the rq,
and the proxy'ed execution context (rq->curr) remains on the rq.
This can cause RT1 to be selected for migration from logic like the
rt pushable_list.

Thus, adda a dequeue/enqueue cycle on the proxy task before __schedule
returns, which allows the sched class logic to avoid adding the now
current task to the pushable_list.

Furthermore, tasks becoming blocked on a mutex don't need an explicit
dequeue/enqueue cycle to be made (push/pull)able: they have to be running
to block on a mutex, thus they will eventually hit put_prev_task().

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Connor O'Brien <connoro@google.com>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lkml.kernel.org/r/20250712033407.2383110-8-jstultz@google.com
2025-07-14 17:16:33 +02:00
Peter Zijlstra 8f2146159b Merge branch 'tip/sched/urgent'
Avoid merge conflicts

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
2025-07-14 17:16:28 +02:00
Juri Lelli 440989c10f sched/deadline: Fix accounting after global limits change
A global limits change (sched_rt_handler() logic) currently leaves stale
and/or incorrect values in variables related to accounting (e.g.
extra_bw).

Properly clean up per runqueue variables before implementing the change
and rebuild scheduling domains (so that accounting is also properly
restored) after such a change is complete.

Reported-by: Marcel Ziswiler <marcel.ziswiler@codethink.co.uk>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> # nuc & rock5b
Link: https://lore.kernel.org/r/20250627115118.438797-4-juri.lelli@redhat.com
2025-07-14 10:59:33 +02:00
Juri Lelli fcc9276c4d sched/deadline: Reset extra_bw to max_bw when clearing root domains
dl_clear_root_domain() doesn't take into account the fact that per-rq
extra_bw variables retain values computed before root domain changes,
resulting in broken accounting.

Fix it by resetting extra_bw to max_bw before restoring back dl-servers
contributions.

Fixes: 2ff899e351 ("sched/deadline: Rebuild root domain accounting after every update")
Reported-by: Marcel Ziswiler <marcel.ziswiler@codethink.co.uk>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> # nuc & rock5b
Link: https://lore.kernel.org/r/20250627115118.438797-3-juri.lelli@redhat.com
2025-07-14 10:59:32 +02:00
Juri Lelli 9f239df555 sched/deadline: Initialize dl_servers after SMP
dl-servers are currently initialized too early at boot when CPUs are not
fully up (only boot CPU is). This results in miscalculation of per
runqueue DEADLINE variables like extra_bw (which needs a stable CPU
count).

Move initialization of dl-servers later on after SMP has been
initialized and CPUs are all online, so that CPU count is stable and
DEADLINE variables can be computed correctly.

Fixes: d741f297bc ("sched/fair: Fair server interface")
Reported-by: Marcel Ziswiler <marcel.ziswiler@codethink.co.uk>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Tested-by: Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> # nuc & rock5b
Link: https://lore.kernel.org/r/20250627115118.438797-2-juri.lelli@redhat.com
2025-07-14 10:59:32 +02:00
Peter Zijlstra cccb45d7c4 sched/deadline: Less agressive dl_server handling
Chris reported that commit 5f6bd380c7 ("sched/rt: Remove default
bandwidth control") caused a significant dip in his favourite
benchmark of the day. Simply disabling dl_server cured things.

His workload hammers the 0->1, 1->0 transitions, and the
dl_server_{start,stop}() overhead kills it -- fairly obviously a bad
idea in hind sight and all that.

Change things around to only disable the dl_server when there has not
been a fair task around for a whole period. Since the default period
is 1 second, this ensures the benchmark never trips this, overhead
gone.

Fixes: 557a6bfc66 ("sched/fair: Add trivial fair server")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lkml.kernel.org/r/20250702121158.465086194@infradead.org
2025-07-09 13:40:21 +02:00
kuyo chang fc975cfb36 sched/deadline: Fix dl_server runtime calculation formula
In our testing with 6.12 based kernel on a big.LITTLE system, we were
seeing instances of RT tasks being blocked from running on the LITTLE
cpus for multiple seconds of time, apparently by the dl_server. This
far exceeds the default configured 50ms per second runtime.

This is due to the fair dl_server runtime calculation being scaled
for frequency & capacity of the cpu.

Consider the following case under a Big.LITTLE architecture:
Assume the runtime is: 50,000,000 ns, and Frequency/capacity
scale-invariance defined as below:
Frequency scale-invariance: 100
Capacity scale-invariance: 50
First by Frequency scale-invariance,
the runtime is scaled to 50,000,000 * 100 >> 10 = 4,882,812
Then by capacity scale-invariance,
it is further scaled to 4,882,812 * 50 >> 10 = 238,418.
So it will scaled to 238,418 ns.

This smaller "accounted runtime" value is what ends up being
subtracted against the fair-server's runtime for the current period.
Thus after 50ms of real time, we've only accounted ~238us against the
fair servers runtime. This 209:1 ratio in this example means that on
the smaller cpu the fair server is allowed to continue running,
blocking RT tasks, for over 10 seconds before it exhausts its supposed
50ms of runtime.  And on other hardware configurations it can be even
worse.

For the fair deadline_server, to prevent realtime tasks from being
unexpectedly delayed, we really do want to use fixed time, and not
scaled time for smaller capacity/frequency cpus. So remove the scaling
from the fair server's accounting to fix this.

Fixes: a110a81c52 ("sched/deadline: Deferrable dl server")
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: John Stultz <jstultz@google.com>
Signed-off-by: kuyo chang <kuyo.chang@mediatek.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: John Stultz <jstultz@google.com>
Tested-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/r/20250702021440.2594736-1-kuyo.chang@mediatek.com
2025-07-04 10:35:56 +02:00
Ingo Molnar 6324dce8f6 sched/smp: Use the SMP version of the deadline scheduling class
Simplify the scheduler by making CONFIG_SMP=y code
in prio_changed_dl() unconditional.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250528080924.2273858-30-mingo@kernel.org
2025-06-13 08:47:20 +02:00
Ingo Molnar cac5cefbad sched/smp: Make SMP unconditional
Simplify the scheduler by making CONFIG_SMP=y primitives and data
structures unconditional.

Introduce transitory wrappers for functionality not yet converted to SMP.

Note that this patch is pretty large, because there's no clear separation
between various aspects of the SMP scheduler, it's basically a huge block
of #ifdef CONFIG_SMP. A fair amount of it has to be switched on for it to
boot and work on UP systems.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250528080924.2273858-21-mingo@kernel.org
2025-06-13 08:47:18 +02:00
Ingo Molnar c503c3dc2d sched: Clean up and standardize #if/#else/#endif markers in sched/deadline.c
- Use the standard #ifdef marker format for larger blocks,
   where appropriate:

        #if CONFIG_FOO
        ...
        #else /* !CONFIG_FOO: */
        ...
        #endif /* !CONFIG_FOO */

 - Fix whitespace noise and other inconsistencies.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250528080924.2273858-8-mingo@kernel.org
2025-06-13 08:47:16 +02:00
Peter Zijlstra 3d7e10188a sched: Make clangd usable
Due to the weird Makefile setup of sched the various files do not
compile as stand alone units. The new generation of editors are trying
to do just this -- mostly to offer fancy things like completions but
also better syntax highlighting and code navigation.

Specifically, I've been playing around with neovim and clangd.

Setting up clangd on the kernel source is a giant pain in the arse
(this really should be improved), but once you do manage, you run into
dumb stuff like the above.

Fix up the scheduler files to at least pretend to work.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lkml.kernel.org/r/20250523164348.GN39944@noisy.programming.kicks-ass.net
2025-06-11 11:20:53 +02:00
Linus Torvalds a50b4fe095 A treewide hrtimer timer cleanup
hrtimers are initialized with hrtimer_init() and a subsequent store to
   the callback pointer. This turned out to be suboptimal for the upcoming
   Rust integration and is obviously a silly implementation to begin with.
 
   This cleanup replaces the hrtimer_init(T); T->function = cb; sequence
   with hrtimer_setup(T, cb);
 
   The conversion was done with Coccinelle and a few manual fixups.
 
   Once the conversion has completely landed in mainline, hrtimer_init()
   will be removed and the hrtimer::function becomes a private member.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmff5jQTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoVvRD/wKtuwmiA66NJFgXC0qVq82A6fO3bY8
 GBdbfysDJIbqGu5PTcULTbJ8qkqv3jeLUv6CcXvS4sZ7y/uJQl2lzf8yrD/0bbwc
 rLI6sHiPSZmK93kNVN4X5H7kvt7cE/DYC9nnEOgK3BY5FgKc4n9887d4aVBhL8Lv
 ODwVXvZ+xi351YCj7qRyPU24zt/p4tkkT1o2k4a0HBluqLI0D+V20fke9IERUL8r
 d1uWKlcn0TqYDesE8HXKIhbst3gx52rMJrXBJDHwFmG6v8Pj1fkTXCVpPo8QcBz8
 OTVkpomN9f/Tx4+GZwhZOF86LhLL3OhxD6pT7JhFCXdmSGv+Ez8uyk1YZysM/XpV
 Juy/1yAcBpDIDkmhMFGdAAn48Nn9Fotty0r4je60zSEp1d/4QMXcFme29qr2JTUE
 iWnQ/HD6DxUjVHqy7CYvvo26Xegg1C7qgyOVt4PYZwAM1VKF5P3kzYTb4SAdxtop
 Tpji1sfW9QV08jqMNo6XntD32DSP9S2HqjO9LwBw700jnx2jjJ35fcJs6iodMOUn
 gckIZLMn3L0OoglPdyA5O7SNTbKE7aFiRKdnT/cJtR3Fa39Qu27CwC5gfiyuie9I
 Q+LG8GLuYSBHXAR+PBK4GWlzJ7Dn8k3eqmbnLeKpRMsU6ZzcttgA64xhaviN2wN0
 iJbvLJeisXr3GA==
 =bYAX
 -----END PGP SIGNATURE-----

Merge tag 'timers-cleanups-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer cleanups from Thomas Gleixner:
 "A treewide hrtimer timer cleanup

  hrtimers are initialized with hrtimer_init() and a subsequent store to
  the callback pointer. This turned out to be suboptimal for the
  upcoming Rust integration and is obviously a silly implementation to
  begin with.

  This cleanup replaces the hrtimer_init(T); T->function = cb; sequence
  with hrtimer_setup(T, cb);

  The conversion was done with Coccinelle and a few manual fixups.

  Once the conversion has completely landed in mainline, hrtimer_init()
  will be removed and the hrtimer::function becomes a private member"

* tag 'timers-cleanups-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (100 commits)
  wifi: rt2x00: Switch to use hrtimer_update_function()
  io_uring: Use helper function hrtimer_update_function()
  serial: xilinx_uartps: Use helper function hrtimer_update_function()
  ASoC: fsl: imx-pcm-fiq: Switch to use hrtimer_setup()
  RDMA: Switch to use hrtimer_setup()
  virtio: mem: Switch to use hrtimer_setup()
  drm/vmwgfx: Switch to use hrtimer_setup()
  drm/xe/oa: Switch to use hrtimer_setup()
  drm/vkms: Switch to use hrtimer_setup()
  drm/msm: Switch to use hrtimer_setup()
  drm/i915/request: Switch to use hrtimer_setup()
  drm/i915/uncore: Switch to use hrtimer_setup()
  drm/i915/pmu: Switch to use hrtimer_setup()
  drm/i915/perf: Switch to use hrtimer_setup()
  drm/i915/gvt: Switch to use hrtimer_setup()
  drm/i915/huc: Switch to use hrtimer_setup()
  drm/amdgpu: Switch to use hrtimer_setup()
  stm class: heartbeat: Switch to use hrtimer_setup()
  i2c: Switch to use hrtimer_setup()
  iio: Switch to use hrtimer_setup()
  ...
2025-03-25 10:54:15 -07:00
Ingo Molnar dd5bdaf2b7 sched/debug: Make CONFIG_SCHED_DEBUG functionality unconditional
All the big Linux distros enable CONFIG_SCHED_DEBUG, because
the various features it provides help not just with kernel
development, but with system administration and user-space
software development as well.

Reflect this reality and enable this functionality
unconditionally.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250317104257.3496611-4-mingo@kernel.org
2025-03-19 22:20:53 +01:00
Ingo Molnar f7d2728cc0 sched/debug: Change SCHED_WARN_ON() to WARN_ON_ONCE()
The scheduler has this special SCHED_WARN() facility that
depends on CONFIG_SCHED_DEBUG.

Since CONFIG_SCHED_DEBUG is getting removed, convert
SCHED_WARN() to WARN_ON_ONCE().

Note that the warning output isn't 100% equivalent:

   #define SCHED_WARN_ON(x)      WARN_ONCE(x, #x)

Because SCHED_WARN_ON() would output the 'x' condition
as well, while WARN_ONCE() will only show a backtrace.

Hopefully these are rare enough to not really matter.

If it does, we should probably introduce a new WARN_ON()
variant that outputs the condition in stringified form,
or improve WARN_ON() itself.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250317104257.3496611-2-mingo@kernel.org
2025-03-19 22:20:53 +01:00
Juri Lelli 2ff899e351 sched/deadline: Rebuild root domain accounting after every update
Rebuilding of root domains accounting information (total_bw) is
currently broken on some cases, e.g. suspend/resume on aarch64. Problem
is that the way we keep track of domain changes and try to add bandwidth
back is convoluted and fragile.

Fix it by simplify things by making sure bandwidth accounting is cleared
and completely restored after root domains changes (after root domains
are again stable).

To be sure we always call dl_rebuild_rd_accounting while holding
cpuset_mutex we also add cpuset_reset_sched_domains() wrapper.

Fixes: 53916d5fd3 ("sched/deadline: Check bandwidth overflow earlier for hotplug")
Reported-by: Jon Hunter <jonathanh@nvidia.com>
Co-developed-by: Waiman Long <llong@redhat.com>
Signed-off-by: Waiman Long <llong@redhat.com>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/Z9MRfeJKJUOyUSto@jlelli-thinkpadt14gen4.remote.csb
2025-03-17 11:23:42 +01:00
Juri Lelli 45007c6fb5 sched/deadline: Generalize unique visiting of root domains
Bandwidth checks and updates that work on root domains currently employ
a cookie mechanism for efficiency. This mechanism is very much tied to
when root domains are first created and initialized.

Generalize the cookie mechanism so that it can be used also later at
runtime while updating root domains. Also, additionally guard it with
sched_domains_mutex, since domains need to be stable while updating them
(and it will be required for further dynamic changes).

Fixes: 53916d5fd3 ("sched/deadline: Check bandwidth overflow earlier for hotplug")
Reported-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Waiman Long <longman@redhat.com>
Tested-by: Jon Hunter <jonathanh@nvidia.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/Z9MQaiXPvEeW_v7x@jlelli-thinkpadt14gen4.remote.csb
2025-03-17 11:23:41 +01:00
Juri Lelli f6147af176 sched/deadline: Ignore special tasks when rebuilding domains
SCHED_DEADLINE special tasks get a fake bandwidth that is only used to
make sure sleeping and priority inheritance 'work', but it is ignored
for runtime enforcement and admission control.

Be consistent with it also when rebuilding root domains.

Fixes: 53916d5fd3 ("sched/deadline: Check bandwidth overflow earlier for hotplug")
Reported-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Waiman Long <longman@redhat.com>
Tested-by: Jon Hunter <jonathanh@nvidia.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20250313170011.357208-2-juri.lelli@redhat.com
2025-03-17 11:23:41 +01:00
Shrikanth Hegde 14672f059d sched/deadline: Use online cpus for validating runtime
The ftrace selftest reported a failure because writing -1 to
sched_rt_runtime_us returns -EBUSY. This happens when the possible
CPUs are different from active CPUs.

Active CPUs are part of one root domain, while remaining CPUs are part
of def_root_domain. Since active cpumask is being used, this results in
cpus=0 when a non active CPUs is used in the loop.

Fix it by looping over the online CPUs instead for validating the
bandwidth calculations.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/r/20250306052954.452005-2-sshegde@linux.ibm.com
2025-03-06 10:21:31 +01:00
Nam Cao ee13da875b sched: Switch to use hrtimer_setup()
hrtimer_setup() takes the callback function pointer as argument and
initializes the timer completely.

Replace hrtimer_init() and the open coded initialization of
hrtimer::function with the new setup mechanism.

Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/a55e849cba3c41b4c5708be6ea6be6f337d1a8fb.1738746821.git.namcao@linutronix.de
2025-02-18 10:32:32 +01:00
Joel Granados 1751f872cc treewide: const qualify ctl_tables where applicable
Add the const qualifier to all the ctl_tables in the tree except for
watchdog_hardlockup_sysctl, memory_allocation_profiling_sysctls,
loadpin_sysctl_table and the ones calling register_net_sysctl (./net,
drivers/inifiniband dirs). These are special cases as they use a
registration function with a non-const qualified ctl_table argument or
modify the arrays before passing them on to the registration function.

Constifying ctl_table structs will prevent the modification of
proc_handler function pointers as the arrays would reside in .rodata.
This is made possible after commit 78eb4ea25c ("sysctl: treewide:
constify the ctl_table argument of proc_handlers") constified all the
proc_handlers.

Created this by running an spatch followed by a sed command:
Spatch:
    virtual patch

    @
    depends on !(file in "net")
    disable optional_qualifier
    @

    identifier table_name != {
      watchdog_hardlockup_sysctl,
      iwcm_ctl_table,
      ucma_ctl_table,
      memory_allocation_profiling_sysctls,
      loadpin_sysctl_table
    };
    @@

    + const
    struct ctl_table table_name [] = { ... };

sed:
    sed --in-place \
      -e "s/struct ctl_table .table = &uts_kern/const struct ctl_table *table = \&uts_kern/" \
      kernel/utsname_sysctl.c

Reviewed-by: Song Liu <song@kernel.org>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> # for kernel/trace/
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> # SCSI
Reviewed-by: Darrick J. Wong <djwong@kernel.org> # xfs
Acked-by: Jani Nikula <jani.nikula@intel.com>
Acked-by: Corey Minyard <cminyard@mvista.com>
Acked-by: Wei Liu <wei.liu@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Acked-by: Baoquan He <bhe@redhat.com>
Acked-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Acked-by: Anna Schumaker <anna.schumaker@oracle.com>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
2025-01-28 13:48:37 +01:00
Linus Torvalds 62de6e1685 Scheduler enhancements for v6.14:
- Fair scheduler (SCHED_FAIR) enhancements:
 
    - Behavioral improvements:
      - Untangle NEXT_BUDDY and pick_next_task() (Peter Zijlstra)
 
    - Delayed-dequeue enhancements & fixes: (Vincent Guittot)
 
      - Rename h_nr_running into h_nr_queued
      - Add new cfs_rq.h_nr_runnable
      - Use the new cfs_rq.h_nr_runnable
      - Removed unsued cfs_rq.h_nr_delayed
      - Rename cfs_rq.idle_h_nr_running into h_nr_idle
      - Remove unused cfs_rq.idle_nr_running
      - Rename cfs_rq.nr_running into nr_queued
      - Do not try to migrate delayed dequeue task
      - Fix variable declaration position
      - Encapsulate set custom slice in a __setparam_fair() function
 
    - Fixes:
      - Fix race between yield_to() and try_to_wake_up() (Tianchen Ding)
      - Fix CPU bandwidth limit bypass during CPU hotplug (Vishal Chourasia)
 
    - Cleanups:
      - Clean up in migrate_degrades_locality() to improve
        readability (Peter Zijlstra)
      - Mark m*_vruntime() with __maybe_unused (Andy Shevchenko)
      - Update comments after sched_tick() rename (Sebastian Andrzej Siewior)
      - Remove CONFIG_CFS_BANDWIDTH=n definition of cfs_bandwidth_used()
        (Valentin Schneider)
 
  - Deadline scheduler (SCHED_DL) enhancements:
 
    - Restore dl_server bandwidth on non-destructive root domain
      changes (Juri Lelli)
 
    - Correctly account for allocated bandwidth during
      hotplug (Juri Lelli)
 
    - Check bandwidth overflow earlier for hotplug (Juri Lelli)
 
    - Clean up goto label in pick_earliest_pushable_dl_task()
      (John Stultz)
 
    - Consolidate timer cancellation (Wander Lairson Costa)
 
  - Load-balancer enhancements:
 
    - Improve performance by prioritizing migrating eligible
      tasks in sched_balance_rq() (Hao Jia)
 
    - Do not compute NUMA Balancing stats unnecessarily during
      load-balancing (K Prateek Nayak)
 
    - Do not compute overloaded status unnecessarily during
      load-balancing (K Prateek Nayak)
 
  - Generic scheduling code enhancements:
 
    - Use READ_ONCE() in task_on_rq_queued(), to consistently use
      the WRITE_ONCE() updated ->on_rq field (Harshit Agarwal)
 
  - Isolated CPUs support enhancements: (Waiman Long)
 
    - Make "isolcpus=nohz" equivalent to "nohz_full"
    - Consolidate housekeeping cpumasks that are always identical
    - Remove HK_TYPE_SCHED
    - Unify HK_TYPE_{TIMER|TICK|MISC} to HK_TYPE_KERNEL_NOISE
 
  - RSEQ enhancements:
 
    - Validate read-only fields under DEBUG_RSEQ config
      (Mathieu Desnoyers)
 
  - PSI enhancements:
 
    - Fix race when task wakes up before psi_sched_switch()
      adjusts flags (Chengming Zhou)
 
  - IRQ time accounting performance enhancements: (Yafang Shao)
 
    - Define sched_clock_irqtime as static key
    - Don't account irq time if sched_clock_irqtime is disabled
 
  - Virtual machine scheduling enhancements:
 
    - Don't try to catch up excess steal time (Suleiman Souhlal)
 
  - Heterogenous x86 CPU scheduling enhancements: (K Prateek Nayak)
 
    - Convert "sysctl_sched_itmt_enabled" to boolean
    - Use guard() for itmt_update_mutex
    - Move the "sched_itmt_enabled" sysctl to debugfs
    - Remove x86_smt_flags and use cpu_smt_flags directly
    - Use x86_sched_itmt_flags for PKG domain unconditionally
 
  - Debugging code & instrumentation enhancements:
 
    - Change need_resched warnings to pr_err() (David Rientjes)
    - Print domain name in /proc/schedstat (K Prateek Nayak)
    - Fix value reported by hot tasks pulled in /proc/schedstat (Peter Zijlstra)
    - Report the different kinds of imbalances in /proc/schedstat (Swapnil Sapkal)
    - Move sched domain name out of CONFIG_SCHED_DEBUG (Swapnil Sapkal)
    - Update Schedstat version to 17 (Swapnil Sapkal)
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmePSRcRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1hrdBAAjYiLl5Q8SHM0xnl+kbvuUkCTgEB/gSgA
 mfrZtHRUgRZuA89NZ9NljlCkQSlsLTOjnpNuaeFzs529GMg9iemc99dbnz3BP5F3
 V5qpYvWe7yIkJ3hd0TOGLmYEPMNQaAW57YBOrxcPjWNLJ4cr9iMdccVA1OQtcmqD
 ZUh3nibv81QI8HDmT2G+figxEIqH3yBV1+SmEIxbrdkQpIJ5702Ng6+0KQK5TShN
 xwjFELWZUl2TfkoCc4nkIpkImV6cI1DvXSw1xK6gbb1xEVOrsmFW3TYFw4trKHBu
 2RBG4wtmzNjh+12GmSdIBJHogPNcay+JIJW9EG/unT7jirqzkkeP1X2eJEbh+X1L
 CMa7GsD9Vy72jCzeJDMuiy7bKfG/MiKUtDXrAZQDo2atbw7H88QOzMuTE5a5WSV+
 tRxXGI/dgFVOk+JQUfctfJbYeXjmG8GAflawvXtGDAfDZsja6M+65fH8p0AOgW1E
 HHmXUzAe2E2xQBiSok/DYHPQeCDBAjoJvU93YhGiXv8UScb2UaD4BAfzfmc8P+Zs
 Eox6444ah5U0jiXmZ3HU707n1zO+Ql4qKoyyMJzSyP+oYHE/Do7NYTElw2QovVdN
 FX/9Uae8T4ttA/5lFe7FNoXgKvSxXDKYyKLZcysjVrWJF866Ui/TWtmxA6w8Osn7
 sfucuLawLPM=
 =5ZNW
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:
 "Fair scheduler (SCHED_FAIR) enhancements:

   - Behavioral improvements:
      - Untangle NEXT_BUDDY and pick_next_task() (Peter Zijlstra)

   - Delayed-dequeue enhancements & fixes: (Vincent Guittot)
      - Rename h_nr_running into h_nr_queued
      - Add new cfs_rq.h_nr_runnable
      - Use the new cfs_rq.h_nr_runnable
      - Removed unsued cfs_rq.h_nr_delayed
      - Rename cfs_rq.idle_h_nr_running into h_nr_idle
      - Remove unused cfs_rq.idle_nr_running
      - Rename cfs_rq.nr_running into nr_queued
      - Do not try to migrate delayed dequeue task
      - Fix variable declaration position
      - Encapsulate set custom slice in a __setparam_fair() function

   - Fixes:
      - Fix race between yield_to() and try_to_wake_up() (Tianchen Ding)
      - Fix CPU bandwidth limit bypass during CPU hotplug (Vishal
        Chourasia)

   - Cleanups:
      - Clean up in migrate_degrades_locality() to improve readability
        (Peter Zijlstra)
      - Mark m*_vruntime() with __maybe_unused (Andy Shevchenko)
      - Update comments after sched_tick() rename (Sebastian Andrzej
        Siewior)
      - Remove CONFIG_CFS_BANDWIDTH=n definition of cfs_bandwidth_used()
        (Valentin Schneider)

  Deadline scheduler (SCHED_DL) enhancements:

   - Restore dl_server bandwidth on non-destructive root domain changes
     (Juri Lelli)

   - Correctly account for allocated bandwidth during hotplug (Juri
     Lelli)

   - Check bandwidth overflow earlier for hotplug (Juri Lelli)

   - Clean up goto label in pick_earliest_pushable_dl_task() (John
     Stultz)

   - Consolidate timer cancellation (Wander Lairson Costa)

  Load-balancer enhancements:

   - Improve performance by prioritizing migrating eligible tasks in
     sched_balance_rq() (Hao Jia)

   - Do not compute NUMA Balancing stats unnecessarily during
     load-balancing (K Prateek Nayak)

   - Do not compute overloaded status unnecessarily during
     load-balancing (K Prateek Nayak)

  Generic scheduling code enhancements:

   - Use READ_ONCE() in task_on_rq_queued(), to consistently use the
     WRITE_ONCE() updated ->on_rq field (Harshit Agarwal)

  Isolated CPUs support enhancements: (Waiman Long)

   - Make "isolcpus=nohz" equivalent to "nohz_full"
   - Consolidate housekeeping cpumasks that are always identical
   - Remove HK_TYPE_SCHED
   - Unify HK_TYPE_{TIMER|TICK|MISC} to HK_TYPE_KERNEL_NOISE

  RSEQ enhancements:

   - Validate read-only fields under DEBUG_RSEQ config (Mathieu
     Desnoyers)

  PSI enhancements:

   - Fix race when task wakes up before psi_sched_switch() adjusts flags
     (Chengming Zhou)

  IRQ time accounting performance enhancements: (Yafang Shao)

   - Define sched_clock_irqtime as static key
   - Don't account irq time if sched_clock_irqtime is disabled

  Virtual machine scheduling enhancements:

   - Don't try to catch up excess steal time (Suleiman Souhlal)

  Heterogenous x86 CPU scheduling enhancements: (K Prateek Nayak)

   - Convert "sysctl_sched_itmt_enabled" to boolean
   - Use guard() for itmt_update_mutex
   - Move the "sched_itmt_enabled" sysctl to debugfs
   - Remove x86_smt_flags and use cpu_smt_flags directly
   - Use x86_sched_itmt_flags for PKG domain unconditionally

  Debugging code & instrumentation enhancements:

   - Change need_resched warnings to pr_err() (David Rientjes)
   - Print domain name in /proc/schedstat (K Prateek Nayak)
   - Fix value reported by hot tasks pulled in /proc/schedstat (Peter
     Zijlstra)
   - Report the different kinds of imbalances in /proc/schedstat
     (Swapnil Sapkal)
   - Move sched domain name out of CONFIG_SCHED_DEBUG (Swapnil Sapkal)
   - Update Schedstat version to 17 (Swapnil Sapkal)"

* tag 'sched-core-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (48 commits)
  rseq: Fix rseq unregistration regression
  psi: Fix race when task wakes up before psi_sched_switch() adjusts flags
  sched, psi: Don't account irq time if sched_clock_irqtime is disabled
  sched: Don't account irq time if sched_clock_irqtime is disabled
  sched: Define sched_clock_irqtime as static key
  sched/fair: Do not compute overloaded status unnecessarily during lb
  sched/fair: Do not compute NUMA Balancing stats unnecessarily during lb
  x86/topology: Use x86_sched_itmt_flags for PKG domain unconditionally
  x86/topology: Remove x86_smt_flags and use cpu_smt_flags directly
  x86/itmt: Move the "sched_itmt_enabled" sysctl to debugfs
  x86/itmt: Use guard() for itmt_update_mutex
  x86/itmt: Convert "sysctl_sched_itmt_enabled" to boolean
  sched/core: Prioritize migrating eligible tasks in sched_balance_rq()
  sched/debug: Change need_resched warnings to pr_err
  sched/fair: Encapsulate set custom slice in a __setparam_fair() function
  sched: Fix race between yield_to() and try_to_wake_up()
  docs: Update Schedstat version to 17
  sched/stats: Print domain name in /proc/schedstat
  sched: Move sched domain name out of CONFIG_SCHED_DEBUG
  sched: Report the different kinds of imbalances in /proc/schedstat
  ...
2025-01-21 11:32:36 -08:00
Vineeth Pillai (Google) b53127db1d sched/dlserver: Fix dlserver double enqueue
dlserver can get dequeued during a dlserver pick_task due to the delayed
deueue feature and this can lead to issues with dlserver logic as it
still thinks that dlserver is on the runqueue. The dlserver throttling
and replenish logic gets confused and can lead to double enqueue of
dlserver.

Double enqueue of dlserver could happend due to couple of reasons:

Case 1
------

Delayed dequeue feature[1] can cause dlserver being stopped during a
pick initiated by dlserver:
  __pick_next_task
   pick_task_dl -> server_pick_task
    pick_task_fair
     pick_next_entity (if (sched_delayed))
      dequeue_entities
       dl_server_stop

server_pick_task goes ahead with update_curr_dl_se without knowing that
dlserver is dequeued and this confuses the logic and may lead to
unintended enqueue while the server is stopped.

Case 2
------
A race condition between a task dequeue on one cpu and same task's enqueue
on this cpu by a remote cpu while the lock is released causing dlserver
double enqueue.

One cpu would be in the schedule() and releasing RQ-lock:

current->state = TASK_INTERRUPTIBLE();
        schedule();
          deactivate_task()
            dl_stop_server();
          pick_next_task()
            pick_next_task_fair()
              sched_balance_newidle()
                rq_unlock(this_rq)

at which point another CPU can take our RQ-lock and do:

        try_to_wake_up()
          ttwu_queue()
            rq_lock()
            ...
            activate_task()
              dl_server_start() --> first enqueue
            wakeup_preempt() := check_preempt_wakeup_fair()
              update_curr()
                update_curr_task()
                  if (current->dl_server)
                    dl_server_update()
                      enqueue_dl_entity() --> second enqueue

This bug was not apparent as the enqueue in dl_server_start doesn't
usually happen because of the defer logic. But as a side effect of the
first case(dequeue during dlserver pick), dl_throttled and dl_yield will
be set and this causes the time accounting of dlserver to messup and
then leading to a enqueue in dl_server_start.

Have an explicit flag representing the status of dlserver to avoid the
confusion. This is set in dl_server_start and reset in dlserver_stop.

Fixes: 63ba8422f8 ("sched/deadline: Introduce deadline servers")
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: "Vineeth Pillai (Google)" <vineeth@bitbyteword.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> # ROCK 5B
Link: https://lkml.kernel.org/r/20241213032244.877029-1-vineeth@bitbyteword.org
2024-12-13 12:57:34 +01:00
John Stultz 7675361ff9 sched: deadline: Cleanup goto label in pick_earliest_pushable_dl_task
Commit 8b5e770ed7 ("sched/deadline: Optimize pull_dl_task()")
added a goto label seems would be better written as a while
loop.

So replace the goto with a while loop, to make it easier to read.

Reported-by: Todd Kjos <tkjos@google.com>
Signed-off-by: John Stultz <jstultz@google.com>
Reviewed-and-tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/r/20241206000009.1226085-1-jstultz@google.com
2024-12-10 15:07:06 +01:00
Wander Lairson Costa 3a181f20fb sched/deadline: Consolidate Timer Cancellation
After commit b58652db66 ("sched/deadline: Fix task_struct reference
leak"), I identified additional calls to hrtimer_try_to_cancel that
might also require a dl_server check. It remains unclear whether this
omission was intentional or accidental in those contexts.

This patch consolidates the timer cancellation logic into dedicated
functions, ensuring consistent behavior across all calls.
Additionally, it reduces code duplication and improves overall code
cleanliness.

Note the use of the __always_inline keyword. In some instances, we
have a task_struct pointer, dereference the dl member, and then use
the container_of macro to retrieve the task_struct pointer again. By
inlining the code, the compiler can potentially optimize out this
redundant round trip.

Signed-off-by: Wander Lairson Costa <wander@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/r/20240724142253.27145-3-wander@redhat.com
2024-12-02 12:01:31 +01:00
Juri Lelli 53916d5fd3 sched/deadline: Check bandwidth overflow earlier for hotplug
Currently we check for bandwidth overflow potentially due to hotplug
operations at the end of sched_cpu_deactivate(), after the cpu going
offline has already been removed from scheduling, active_mask, etc.
This can create issues for DEADLINE tasks, as there is a substantial
race window between the start of sched_cpu_deactivate() and the moment
we possibly decide to roll-back the operation if dl_bw_deactivate()
returns failure in cpuset_cpu_inactive(). An example is a throttled
task that sees its replenishment timer firing while the cpu it was
previously running on is considered offline, but before
dl_bw_deactivate() had a chance to say no and roll-back happened.

Fix this by directly calling dl_bw_deactivate() first thing in
sched_cpu_deactivate() and do the required calculation in the former
function considering the cpu passed as an argument as offline already.

By doing so we also simplify sched_cpu_deactivate(), as there is no need
anymore for any kind of roll-back if we fail early.

Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Tested-by: Waiman Long <longman@redhat.com>
Link: https://lore.kernel.org/r/Zzc1DfPhbvqDDIJR@jlelli-thinkpadt14gen4.remote.csb
2024-12-02 12:01:31 +01:00
Juri Lelli d4742f6ed7 sched/deadline: Correctly account for allocated bandwidth during hotplug
For hotplug operations, DEADLINE needs to check that there is still enough
bandwidth left after removing the CPU that is going offline. We however
fail to do so currently.

Restore the correct behavior by restructuring dl_bw_manage() a bit, so
that overflow conditions (not enough bandwidth left) are properly
checked. Also account for dl_server bandwidth, i.e. discount such
bandwidth in the calculation since NORMAL tasks will be anyway moved
away from the CPU as a result of the hotplug operation.

Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Tested-by: Waiman Long <longman@redhat.com>
Link: https://lore.kernel.org/r/20241114142810.794657-3-juri.lelli@redhat.com
2024-12-02 12:01:31 +01:00
Juri Lelli 41d4200b71 sched/deadline: Restore dl_server bandwidth on non-destructive root domain changes
When root domain non-destructive changes (e.g., only modifying one of
the existing root domains while the rest is not touched) happen we still
need to clear DEADLINE bandwidth accounting so that it's then properly
restored, taking into account DEADLINE tasks associated to each cpuset
(associated to each root domain). After the introduction of dl_servers,
we fail to restore such servers contribution after non-destructive
changes (as they are only considered on destructive changes when
runqueues are attached to the new domains).

Fix this by making sure we iterate over the dl_servers attached to
domains that have not been destroyed and add their bandwidth
contribution back correctly.

Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Tested-by: Waiman Long <longman@redhat.com>
Link: https://lore.kernel.org/r/20241114142810.794657-2-juri.lelli@redhat.com
2024-12-02 12:01:30 +01:00
Wander Lairson Costa 0664e2c311 sched/deadline: Fix warning in migrate_enable for boosted tasks
When running the following command:

while true; do
    stress-ng --cyclic 30 --timeout 30s --minimize --quiet
done

a warning is eventually triggered:

WARNING: CPU: 43 PID: 2848 at kernel/sched/deadline.c:794
setup_new_dl_entity+0x13e/0x180
...
Call Trace:
 <TASK>
 ? show_trace_log_lvl+0x1c4/0x2df
 ? enqueue_dl_entity+0x631/0x6e0
 ? setup_new_dl_entity+0x13e/0x180
 ? __warn+0x7e/0xd0
 ? report_bug+0x11a/0x1a0
 ? handle_bug+0x3c/0x70
 ? exc_invalid_op+0x14/0x70
 ? asm_exc_invalid_op+0x16/0x20
 enqueue_dl_entity+0x631/0x6e0
 enqueue_task_dl+0x7d/0x120
 __do_set_cpus_allowed+0xe3/0x280
 __set_cpus_allowed_ptr_locked+0x140/0x1d0
 __set_cpus_allowed_ptr+0x54/0xa0
 migrate_enable+0x7e/0x150
 rt_spin_unlock+0x1c/0x90
 group_send_sig_info+0xf7/0x1a0
 ? kill_pid_info+0x1f/0x1d0
 kill_pid_info+0x78/0x1d0
 kill_proc_info+0x5b/0x110
 __x64_sys_kill+0x93/0xc0
 do_syscall_64+0x5c/0xf0
 entry_SYSCALL_64_after_hwframe+0x6e/0x76
 RIP: 0033:0x7f0dab31f92b

This warning occurs because set_cpus_allowed dequeues and enqueues tasks
with the ENQUEUE_RESTORE flag set. If the task is boosted, the warning
is triggered. A boosted task already had its parameters set by
rt_mutex_setprio, and a new call to setup_new_dl_entity is unnecessary,
hence the WARN_ON call.

Check if we are requeueing a boosted task and avoid calling
setup_new_dl_entity if that's the case.

Fixes: 295d6d5e37 ("sched/deadline: Fix switching to -deadline")
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/r/20240724142253.27145-2-wander@redhat.com
2024-12-02 12:01:29 +01:00
Juri Lelli 22368fe1f9 sched/deadline: Fix replenish_dl_new_period dl_server condition
The condition in replenish_dl_new_period() that checks if a reservation
(dl_server) is deferred and is not handling a starvation case is
obviously wrong.

Fix it.

Fixes: a110a81c52 ("sched/deadline: Deferrable dl server")
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20241127063740.8278-1-juri.lelli@redhat.com
2024-12-02 12:01:27 +01:00
Peter Zijlstra af0c8b2bf6 sched: Split scheduler and execution contexts
Let's define the "scheduling context" as all the scheduler state
in task_struct for the task chosen to run, which we'll call the
donor task, and the "execution context" as all state required to
actually run the task.

Currently both are intertwined in task_struct. We want to
logically split these such that we can use the scheduling
context of the donor task selected to be scheduled, but use
the execution context of a different task to actually be run.

To this purpose, introduce rq->donor field to point to the
task_struct chosen from the runqueue by the scheduler, and will
be used for scheduler state, and preserve rq->curr to indicate
the execution context of the task that will actually be run.

This patch introduces the donor field as a union with curr, so it
doesn't cause the contexts to be split yet, but adds the logic to
handle everything separately.

[add additional comments and update more sched_class code to use
 rq::proxy]
[jstultz: Rebased and resolved minor collisions, reworked to use
 accessors, tweaked update_curr_common to use rq_proxy fixing rt
 scheduling issues]

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Connor O'Brien <connoro@google.com>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Metin Kaya <metin.kaya@arm.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Metin Kaya <metin.kaya@arm.com>
Link: https://lore.kernel.org/r/20241009235352.1614323-8-jstultz@google.com
2024-10-14 12:52:42 +02:00
Connor O'Brien 18adad1dac sched: Consolidate pick_*_task to task_is_pushable helper
This patch consolidates rt and deadline pick_*_task functions to
a task_is_pushable() helper

This patch was broken out from a larger chain migration
patch originally by Connor O'Brien.

[jstultz: split out from larger chain migration patch,
 renamed helper function]

Signed-off-by: Connor O'Brien <connoro@google.com>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Metin Kaya <metin.kaya@arm.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Metin Kaya <metin.kaya@arm.com>
Link: https://lore.kernel.org/r/20241009235352.1614323-6-jstultz@google.com
2024-10-14 12:52:41 +02:00
Connor O'Brien 2b05a0b4c0 sched: Add move_queued_task_locked helper
Switch logic that deactivates, sets the task cpu,
and reactivates a task on a different rq to use a
helper that will be later extended to push entire
blocked task chains.

This patch was broken out from a larger chain migration
patch originally by Connor O'Brien.

[jstultz: split out from larger chain migration patch]
Signed-off-by: Connor O'Brien <connoro@google.com>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Metin Kaya <metin.kaya@arm.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Qais Yousef <qyousef@layalina.io>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Metin Kaya <metin.kaya@arm.com>
Link: https://lore.kernel.org/r/20241009235352.1614323-5-jstultz@google.com
2024-10-14 12:52:41 +02:00
Phil Auld d16b7eb6f5 sched/deadline: Use hrtick_enabled_dl() before start_hrtick_dl()
The deadline server code moved one of the start_hrtick_dl() calls
but dropped the dl specific hrtick_enabled check. This causes hrticks
to get armed even when sched_feat(HRTICK_DL) is false. Fix it.

Fixes: 63ba8422f8 ("sched/deadline: Introduce deadline servers")
Signed-off-by: Phil Auld <pauld@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/r/20241004123729.460668-1-pauld@redhat.com
2024-10-11 10:49:32 +02:00
Peter Zijlstra b2d70222db sched: Add put_prev_task(.next)
In order to tell the previous sched_class what the next task is, add
put_prev_task(.next).

Notable SCX will use this to:

 1) determine the next task will leave the SCX sched class and push
    the current task to another CPU if possible.
 2) statistics on how often and which other classes preempt it

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240813224016.367421076@infradead.org
2024-09-03 15:26:32 +02:00
Peter Zijlstra bd9bbc96e8 sched: Rework dl_server
When a task is selected through a dl_server, it will have p->dl_server
set, such that it can account runtime to the dl_server, see
update_curr_task().

Currently p->dl_server is set in pick*task() whenever it goes through
the dl_server, clearing it is a bit of a mess though. The trivial
solution is clearing it on the final put (now that we have this
location).

However, this gives a problem when:

	p = pick_task(rq);
	if (p)
		put_prev_set_next_task(rq, prev, next);

picks the same task but through a different path, notably when it goes
from picking through the dl_server to a direct pick or vice-versa. In
that case we cannot readily determine wether we should clear or
preserve p->dl_server.

An additional complication is pick_*task() setting p->dl_server for a
remote pick, it might still need to update runtime before it schedules
the core_pick.

Close all these holes and remove all the random clearing of
p->dl_server by:

 - having pick_*task() manage rq->dl_server

 - having the final put_prev_task() clear p->dl_server

 - having the first set_next_task() set p->dl_server = rq->dl_server

 - complicate the core_sched code to save/restore rq->dl_server where
   appropriate.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240813224016.259853414@infradead.org
2024-09-03 15:26:32 +02:00
Peter Zijlstra fd03c5b858 sched: Rework pick_next_task()
The current rule is that:

  pick_next_task() := pick_task() + set_next_task(.first = true)

And many classes implement it directly as such. Change things around
to make pick_next_task() optional while also changing the definition to:

  pick_next_task(prev) := pick_task() + put_prev_task() + set_next_task(.first = true)

The reason is that sched_ext would like to have a 'final' call that
knows the next task. By placing put_prev_task() right next to
set_next_task() (as it already is for sched_core) this becomes
trivial.

As a bonus, this is a nice cleanup on its own.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240813224016.051225657@infradead.org
2024-09-03 15:26:31 +02:00
Peter Zijlstra 4686cc598f sched: Clean up DL server vs core sched
Abide by the simple rule:

  pick_next_task() := pick_task() + set_next_task(.first = true)

This allows us to trivially get rid of server_pick_next() and things
collapse nicely.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240813224015.837303391@infradead.org
2024-09-03 15:26:31 +02:00
Peter Zijlstra dae4320b29 sched: Fixup set_next_task() implementations
The rule is that:

  pick_next_task() := pick_task() + set_next_task(.first = true)

Turns out, there's still a few things in pick_next_task() that are
missing from that combination.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240813224015.724111109@infradead.org
2024-09-03 15:26:30 +02:00
Huang Shijie 9c602adb79 sched/deadline: Fix schedstats vs deadline servers
In dl_server_start(), when schedstats is enabled, the following
happens:

  dl_server_start()
    dl_se->dl_server = 1;
    enqueue_dl_entity()
      update_stats_enqueue_dl()
        __schedstats_from_dl_se()
          dl_task_of()
            BUG_ON(dl_server(dl_se));

Since only tasks have schedstats and internal entries do not, avoid
trying to update stats in this case.

Fixes: 63ba8422f8 ("sched/deadline: Introduce deadline servers")
Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lkml.kernel.org/r/20240829031111.12142-1-shijie@os.amperecomputing.com
2024-09-03 15:26:30 +02:00
Peter Zijlstra 152e11f6df sched/fair: Implement delayed dequeue
Extend / fix 86bfbb7ce4 ("sched/fair: Add lag based placement") by
noting that lag is fundamentally a temporal measure. It should not be
carried around indefinitely.

OTOH it should also not be instantly discarded, doing so will allow a
task to game the system by purposefully (micro) sleeping at the end of
its time quantum.

Since lag is intimately tied to the virtual time base, a wall-time
based decay is also insufficient, notably competition is required for
any of this to make sense.

Instead, delay the dequeue and keep the 'tasks' on the runqueue,
competing until they are eligible.

Strictly speaking, we only care about keeping them until the 0-lag
point, but that is a difficult proposition, instead carry them around
until they get picked again, and dequeue them at that point.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/20240727105030.226163742@infradead.org
2024-08-17 11:06:44 +02:00
Peter Zijlstra 863ccdbb91 sched: Allow sched_class::dequeue_task() to fail
Change the function signature of sched_class::dequeue_task() to return
a boolean, allowing future patches to 'fail' dequeue.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/20240727105028.864630153@infradead.org
2024-08-17 11:06:41 +02:00
Peter Zijlstra 5f6bd380c7 sched/rt: Remove default bandwidth control
Now that fair_server exists, we no longer need RT bandwidth control
unless RT_GROUP_SCHED.

Enable fair_server with parameters equivalent to RT throttling.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: "Vineeth Pillai (Google)" <vineeth@bitbyteword.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/r/14d562db55df5c3c780d91940743acb166895ef7.1716811044.git.bristot@kernel.org
2024-07-29 12:22:37 +02:00
Joel Fernandes (Google) c8a85394cf sched/core: Fix picking of tasks for core scheduling with DL server
* Use simple CFS pick_task for DL pick_task

  DL server's pick_task calls CFS's pick_next_task_fair(), this is wrong
  because core scheduling's pick_task only calls CFS's pick_task() for
  evaluation / checking of the CFS task (comparing across CPUs), not for
  actually affirmatively picking the next task. This causes RB tree
  corruption issues in CFS that were found by syzbot.

* Make pick_task_fair clear DL server

  A DL task pick might set ->dl_server, but it is possible the task will
  never run (say the other HT has a stop task). If the CFS task is picked
  in the future directly (say without DL server), ->dl_server will be
  set. So clear it in pick_task_fair().

This fixes the KASAN issue reported by syzbot in set_next_entity().

(DL refactoring suggestions by Vineeth Pillai).

Reported-by: Suleiman Souhlal <suleiman@google.com>
Signed-off-by: "Joel Fernandes (Google)" <joel@joelfernandes.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vineeth Pillai <vineeth@bitbyteword.org>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/r/b10489ab1f03d23e08e6097acea47442e7d6466f.1716811044.git.bristot@kernel.org
2024-07-29 12:22:37 +02:00