diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst index d74c2c2b9ef3..03d595d178ea 100644 --- a/Documentation/scheduler/sched-ext.rst +++ b/Documentation/scheduler/sched-ext.rst @@ -93,6 +93,55 @@ scheduler has been loaded): # cat /sys/kernel/sched_ext/enable_seq 1 +Each running scheduler also exposes a per-scheduler ``events`` file under +``/sys/kernel/sched_ext//events`` that tracks diagnostic +counters. Each counter occupies one ``name value`` line: + +.. code-block:: none + + # cat /sys/kernel/sched_ext/simple/events + SCX_EV_SELECT_CPU_FALLBACK 0 + SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE 0 + SCX_EV_DISPATCH_KEEP_LAST 123 + SCX_EV_ENQ_SKIP_EXITING 0 + SCX_EV_ENQ_SKIP_MIGRATION_DISABLED 0 + SCX_EV_REENQ_IMMED 0 + SCX_EV_REENQ_LOCAL_REPEAT 0 + SCX_EV_REFILL_SLICE_DFL 456789 + SCX_EV_BYPASS_DURATION 0 + SCX_EV_BYPASS_DISPATCH 0 + SCX_EV_BYPASS_ACTIVATE 0 + SCX_EV_INSERT_NOT_OWNED 0 + SCX_EV_SUB_BYPASS_DISPATCH 0 + +The counters are described in ``kernel/sched/ext_internal.h``; briefly: + +* ``SCX_EV_SELECT_CPU_FALLBACK``: ops.select_cpu() returned a CPU unusable by + the task and the core scheduler silently picked a fallback CPU. +* ``SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE``: a local-DSQ dispatch was redirected + to the global DSQ because the target CPU went offline. +* ``SCX_EV_DISPATCH_KEEP_LAST``: a task continued running because no other + task was available (only when ``SCX_OPS_ENQ_LAST`` is not set). +* ``SCX_EV_ENQ_SKIP_EXITING``: an exiting task was dispatched to the local DSQ + directly, bypassing ops.enqueue() (only when ``SCX_OPS_ENQ_EXITING`` is not set). +* ``SCX_EV_ENQ_SKIP_MIGRATION_DISABLED``: a migration-disabled task was + dispatched to its local DSQ directly (only when + ``SCX_OPS_ENQ_MIGRATION_DISABLED`` is not set). +* ``SCX_EV_REENQ_IMMED``: a task dispatched with ``SCX_ENQ_IMMED`` was + re-enqueued because the target CPU was not available for immediate execution. +* ``SCX_EV_REENQ_LOCAL_REPEAT``: a reenqueue of the local DSQ triggered + another reenqueue; recurring counts indicate incorrect ``SCX_ENQ_REENQ`` + handling in the BPF scheduler. +* ``SCX_EV_REFILL_SLICE_DFL``: a task's time slice was refilled with the + default value (``SCX_SLICE_DFL``). +* ``SCX_EV_BYPASS_DURATION``: total nanoseconds spent in bypass mode. +* ``SCX_EV_BYPASS_DISPATCH``: number of tasks dispatched while in bypass mode. +* ``SCX_EV_BYPASS_ACTIVATE``: number of times bypass mode was activated. +* ``SCX_EV_INSERT_NOT_OWNED``: attempted to insert a task not owned by this + scheduler into a DSQ; such attempts are silently ignored. +* ``SCX_EV_SUB_BYPASS_DISPATCH``: tasks dispatched from sub-scheduler bypass + DSQs (only relevant with ``CONFIG_EXT_SUB_SCHED``). + ``tools/sched_ext/scx_show_state.py`` is a drgn script which shows more detailed information: @@ -228,16 +277,23 @@ The following briefly shows how a waking task is scheduled and executed. scheduler can wake up any cpu using the ``scx_bpf_kick_cpu()`` helper, using ``ops.select_cpu()`` judiciously can be simpler and more efficient. - A task can be immediately inserted into a DSQ from ``ops.select_cpu()`` - by calling ``scx_bpf_dsq_insert()``. If the task is inserted into - ``SCX_DSQ_LOCAL`` from ``ops.select_cpu()``, it will be inserted into the - local DSQ of whichever CPU is returned from ``ops.select_cpu()``. - Additionally, inserting directly from ``ops.select_cpu()`` will cause the - ``ops.enqueue()`` callback to be skipped. - Note that the scheduler core will ignore an invalid CPU selection, for example, if it's outside the allowed cpumask of the task. + A task can be immediately inserted into a DSQ from ``ops.select_cpu()`` + by calling ``scx_bpf_dsq_insert()`` or ``scx_bpf_dsq_insert_vtime()``. + + If the task is inserted into ``SCX_DSQ_LOCAL`` from + ``ops.select_cpu()``, it will be added to the local DSQ of whichever CPU + is returned from ``ops.select_cpu()``. Additionally, inserting directly + from ``ops.select_cpu()`` will cause the ``ops.enqueue()`` callback to + be skipped. + + Any other attempt to store a task in BPF-internal data structures from + ``ops.select_cpu()`` does not prevent ``ops.enqueue()`` from being + invoked. This is discouraged, as it can introduce racy behavior or + inconsistent state. + 2. Once the target CPU is selected, ``ops.enqueue()`` is invoked (unless the task was inserted directly from ``ops.select_cpu()``). ``ops.enqueue()`` can make one of the following decisions: @@ -251,6 +307,61 @@ The following briefly shows how a waking task is scheduled and executed. * Queue the task on the BPF side. + **Task State Tracking and ops.dequeue() Semantics** + + A task is in the "BPF scheduler's custody" when the BPF scheduler is + responsible for managing its lifecycle. A task enters custody when it is + dispatched to a user DSQ or stored in the BPF scheduler's internal data + structures. Custody is entered only from ``ops.enqueue()`` for those + operations. The only exception is dispatching to a user DSQ from + ``ops.select_cpu()``: although the task is not yet technically in BPF + scheduler custody at that point, the dispatch has the same semantic + effect as dispatching from ``ops.enqueue()`` for custody-related + purposes. + + Once ``ops.enqueue()`` is called, the task may or may not enter custody + depending on what the scheduler does: + + * **Directly dispatched to terminal DSQs** (``SCX_DSQ_LOCAL``, + ``SCX_DSQ_LOCAL_ON | cpu``, or ``SCX_DSQ_GLOBAL``): the BPF scheduler + is done with the task - it either goes straight to a CPU's local run + queue or to the global DSQ as a fallback. The task never enters (or + exits) BPF custody, and ``ops.dequeue()`` will not be called. + + * **Dispatch to user-created DSQs** (custom DSQs): the task enters the + BPF scheduler's custody. When the task later leaves BPF custody + (dispatched to a terminal DSQ, picked by core-sched, or dequeued for + sleep/property changes), ``ops.dequeue()`` will be called exactly + once. + + * **Stored in BPF data structures** (e.g., internal BPF queues): the + task is in BPF custody. ``ops.dequeue()`` will be called when it + leaves (e.g., when ``ops.dispatch()`` moves it to a terminal DSQ, or + on property change / sleep). + + When a task leaves BPF scheduler custody, ``ops.dequeue()`` is invoked. + The dequeue can happen for different reasons, distinguished by flags: + + 1. **Regular dispatch**: when a task in BPF custody is dispatched to a + terminal DSQ from ``ops.dispatch()`` (leaving BPF custody for + execution), ``ops.dequeue()`` is triggered without any special flags. + + 2. **Core scheduling pick**: when ``CONFIG_SCHED_CORE`` is enabled and + core scheduling picks a task for execution while it's still in BPF + custody, ``ops.dequeue()`` is called with the + ``SCX_DEQ_CORE_SCHED_EXEC`` flag. + + 3. **Scheduling property change**: when a task property changes (via + operations like ``sched_setaffinity()``, ``sched_setscheduler()``, + priority changes, CPU migrations, etc.) while the task is still in + BPF custody, ``ops.dequeue()`` is called with the + ``SCX_DEQ_SCHED_CHANGE`` flag set in ``deq_flags``. + + **Important**: Once a task has left BPF custody (e.g., after being + dispatched to a terminal DSQ), property changes will not trigger + ``ops.dequeue()``, since the task is no longer managed by the BPF + scheduler. + 3. When a CPU is ready to schedule, it first looks at its local DSQ. If empty, it then looks at the global DSQ. If there still isn't a task to run, ``ops.dispatch()`` is invoked which can use the following two @@ -264,9 +375,9 @@ The following briefly shows how a waking task is scheduled and executed. rather than performing them immediately. There can be up to ``ops.dispatch_max_batch`` pending tasks. - * ``scx_bpf_move_to_local()`` moves a task from the specified non-local + * ``scx_bpf_dsq_move_to_local()`` moves a task from the specified non-local DSQ to the dispatching DSQ. This function cannot be called with any BPF - locks held. ``scx_bpf_move_to_local()`` flushes the pending insertions + locks held. ``scx_bpf_dsq_move_to_local()`` flushes the pending insertions tasks before trying to move from the specified DSQ. 4. After ``ops.dispatch()`` returns, if there are tasks in the local DSQ, @@ -297,8 +408,8 @@ for more information. Task Lifecycle -------------- -The following pseudo-code summarizes the entire lifecycle of a task managed -by a sched_ext scheduler: +The following pseudo-code presents a rough overview of the entire lifecycle +of a task managed by a sched_ext scheduler: .. code-block:: c @@ -311,22 +422,37 @@ by a sched_ext scheduler: ops.runnable(); /* Task becomes ready to run */ - while (task is runnable) { - if (task is not in a DSQ && task->scx.slice == 0) { + while (task_is_runnable(task)) { + if (task is not in a DSQ || task->scx.slice == 0) { ops.enqueue(); /* Task can be added to a DSQ */ + /* Task property change (i.e., affinity, nice, etc.)? */ + if (sched_change(task)) { + ops.dequeue(); /* Exiting BPF scheduler custody */ + ops.quiescent(); + + /* Property change callback, e.g. ops.set_weight() */ + + ops.runnable(); + continue; + } + /* Any usable CPU becomes available */ - ops.dispatch(); /* Task is moved to a local DSQ */ + ops.dispatch(); /* Task is moved to a local DSQ */ + ops.dequeue(); /* Exiting BPF scheduler custody */ } + ops.running(); /* Task starts running on its assigned CPU */ - while (task->scx.slice > 0 && task is runnable) + + while (task_is_runnable(task) && task->scx.slice > 0) { ops.tick(); /* Called every 1/HZ seconds */ + + if (task->scx.slice == 0) + ops.dispatch(); /* task->scx.slice can be refilled */ + } + ops.stopping(); /* Task stops running (time slice expires or wait) */ - - /* Task's CPU becomes available */ - - ops.dispatch(); /* task->scx.slice can be refilled */ } ops.quiescent(); /* Task releases its assigned CPU (wait) */ @@ -335,6 +461,30 @@ by a sched_ext scheduler: ops.disable(); /* Disable BPF scheduling for the task */ ops.exit_task(); /* Task is destroyed */ +Note that the above pseudo-code does not cover all possible state transitions +and edge cases, to name a few examples: + +* ``ops.dispatch()`` may fail to move the task to a local DSQ due to a racing + property change on that task, in which case ``ops.dispatch()`` will be + retried. + +* The task may be direct-dispatched to a local DSQ from ``ops.enqueue()``, + in which case ``ops.dispatch()`` and ``ops.dequeue()`` are skipped and we go + straight to ``ops.running()``. + +* Property changes may occur at virtually any point during the task's lifecycle, + not just when the task is queued and waiting to be dispatched. For example, + changing a property of a running task will lead to the callback sequence + ``ops.stopping()`` -> ``ops.quiescent()`` -> (property change callback) -> + ``ops.runnable()`` -> ``ops.running()``. + +* A sched_ext task can be preempted by a task from a higher-priority scheduling + class, in which case it will exit the tick-dispatch loop even though it is runnable + and has a non-zero slice. + +See the "Scheduling Cycle" section for a more detailed description of how +a freshly woken up task gets on a CPU. + Where to Look ============= @@ -377,6 +527,25 @@ Where to Look scheduling. Tasks with CPU affinity are direct-dispatched in FIFO order; all others are scheduled in user space by a simple vruntime scheduler. +Module Parameters +================= + +sched_ext exposes two module parameters under the ``sched_ext.`` prefix that +control bypass-mode behaviour. These knobs are primarily for debugging; there +is usually no reason to change them during normal operation. They can be read +and written at runtime (mode 0600) via +``/sys/module/sched_ext/parameters/``. + +``sched_ext.slice_bypass_us`` (default: 5000 µs) + The time slice assigned to all tasks when the scheduler is in bypass mode, + i.e. during BPF scheduler load, unload, and error recovery. Valid range is + 100 µs to 100 ms. + +``sched_ext.bypass_lb_intv_us`` (default: 500000 µs) + The interval at which the bypass-mode load balancer redistributes tasks + across CPUs. Set to 0 to disable load balancing during bypass mode. Valid + range is 0 to 10 s. + ABI Instability =============== diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index f197ca104737..f42563739d2e 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -17,6 +17,7 @@ #include #include #include +#include #include #include #include @@ -628,6 +629,9 @@ struct cgroup { #ifdef CONFIG_BPF_SYSCALL struct bpf_local_storage __rcu *bpf_cgrp_storage; #endif +#ifdef CONFIG_EXT_SUB_SCHED + struct scx_sched __rcu *scx_sched; +#endif /* All ancestors including self */ union { diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h index bcb962d5ee7d..1a3af2ea2a79 100644 --- a/include/linux/sched/ext.h +++ b/include/linux/sched/ext.h @@ -62,6 +62,16 @@ enum scx_dsq_id_flags { SCX_DSQ_LOCAL_CPU_MASK = 0xffffffffLLU, }; +struct scx_deferred_reenq_user { + struct list_head node; + u64 flags; +}; + +struct scx_dsq_pcpu { + struct scx_dispatch_q *dsq; + struct scx_deferred_reenq_user deferred_reenq_user; +}; + /* * A dispatch queue (DSQ) can be either a FIFO or p->scx.dsq_vtime ordered * queue. A built-in DSQ is always a FIFO. The built-in local DSQs are used to @@ -78,30 +88,58 @@ struct scx_dispatch_q { u64 id; struct rhash_head hash_node; struct llist_node free_node; + struct scx_sched *sched; + struct scx_dsq_pcpu __percpu *pcpu; struct rcu_head rcu; }; -/* scx_entity.flags */ +/* sched_ext_entity.flags */ enum scx_ent_flags { SCX_TASK_QUEUED = 1 << 0, /* on ext runqueue */ + SCX_TASK_IN_CUSTODY = 1 << 1, /* in custody, needs ops.dequeue() when leaving */ SCX_TASK_RESET_RUNNABLE_AT = 1 << 2, /* runnable_at should be reset */ SCX_TASK_DEQD_FOR_SLEEP = 1 << 3, /* last dequeue was for SLEEP */ + SCX_TASK_SUB_INIT = 1 << 4, /* task being initialized for a sub sched */ + SCX_TASK_IMMED = 1 << 5, /* task is on local DSQ with %SCX_ENQ_IMMED */ - SCX_TASK_STATE_SHIFT = 8, /* bit 8 and 9 are used to carry scx_task_state */ + /* + * Bits 8 and 9 are used to carry task state: + * + * NONE ops.init_task() not called yet + * INIT ops.init_task() succeeded, but task can be cancelled + * READY fully initialized, but not in sched_ext + * ENABLED fully initialized and in sched_ext + */ + SCX_TASK_STATE_SHIFT = 8, /* bits 8 and 9 are used to carry task state */ SCX_TASK_STATE_BITS = 2, SCX_TASK_STATE_MASK = ((1 << SCX_TASK_STATE_BITS) - 1) << SCX_TASK_STATE_SHIFT, - SCX_TASK_CURSOR = 1 << 31, /* iteration cursor, not a task */ -}; + SCX_TASK_NONE = 0 << SCX_TASK_STATE_SHIFT, + SCX_TASK_INIT = 1 << SCX_TASK_STATE_SHIFT, + SCX_TASK_READY = 2 << SCX_TASK_STATE_SHIFT, + SCX_TASK_ENABLED = 3 << SCX_TASK_STATE_SHIFT, -/* scx_entity.flags & SCX_TASK_STATE_MASK */ -enum scx_task_state { - SCX_TASK_NONE, /* ops.init_task() not called yet */ - SCX_TASK_INIT, /* ops.init_task() succeeded, but task can be cancelled */ - SCX_TASK_READY, /* fully initialized, but not in sched_ext */ - SCX_TASK_ENABLED, /* fully initialized and in sched_ext */ + /* + * Bits 12 and 13 are used to carry reenqueue reason. In addition to + * %SCX_ENQ_REENQ flag, ops.enqueue() can also test for + * %SCX_TASK_REENQ_REASON_NONE to distinguish reenqueues. + * + * NONE not being reenqueued + * KFUNC reenqueued by scx_bpf_dsq_reenq() and friends + * IMMED reenqueued due to failed ENQ_IMMED + * PREEMPTED preempted while running + */ + SCX_TASK_REENQ_REASON_SHIFT = 12, + SCX_TASK_REENQ_REASON_BITS = 2, + SCX_TASK_REENQ_REASON_MASK = ((1 << SCX_TASK_REENQ_REASON_BITS) - 1) << SCX_TASK_REENQ_REASON_SHIFT, - SCX_TASK_NR_STATES, + SCX_TASK_REENQ_NONE = 0 << SCX_TASK_REENQ_REASON_SHIFT, + SCX_TASK_REENQ_KFUNC = 1 << SCX_TASK_REENQ_REASON_SHIFT, + SCX_TASK_REENQ_IMMED = 2 << SCX_TASK_REENQ_REASON_SHIFT, + SCX_TASK_REENQ_PREEMPTED = 3 << SCX_TASK_REENQ_REASON_SHIFT, + + /* iteration cursor, not a task */ + SCX_TASK_CURSOR = 1 << 31, }; /* scx_entity.dsq_flags */ @@ -109,33 +147,6 @@ enum scx_ent_dsq_flags { SCX_TASK_DSQ_ON_PRIQ = 1 << 0, /* task is queued on the priority queue of a dsq */ }; -/* - * Mask bits for scx_entity.kf_mask. Not all kfuncs can be called from - * everywhere and the following bits track which kfunc sets are currently - * allowed for %current. This simple per-task tracking works because SCX ops - * nest in a limited way. BPF will likely implement a way to allow and disallow - * kfuncs depending on the calling context which will replace this manual - * mechanism. See scx_kf_allow(). - */ -enum scx_kf_mask { - SCX_KF_UNLOCKED = 0, /* sleepable and not rq locked */ - /* ENQUEUE and DISPATCH may be nested inside CPU_RELEASE */ - SCX_KF_CPU_RELEASE = 1 << 0, /* ops.cpu_release() */ - /* - * ops.dispatch() may release rq lock temporarily and thus ENQUEUE and - * SELECT_CPU may be nested inside. ops.dequeue (in REST) may also be - * nested inside DISPATCH. - */ - SCX_KF_DISPATCH = 1 << 1, /* ops.dispatch() */ - SCX_KF_ENQUEUE = 1 << 2, /* ops.enqueue() and ops.select_cpu() */ - SCX_KF_SELECT_CPU = 1 << 3, /* ops.select_cpu() */ - SCX_KF_REST = 1 << 4, /* other rq-locked operations */ - - __SCX_KF_RQ_LOCKED = SCX_KF_CPU_RELEASE | SCX_KF_DISPATCH | - SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU | SCX_KF_REST, - __SCX_KF_TERMINAL = SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU | SCX_KF_REST, -}; - enum scx_dsq_lnode_flags { SCX_DSQ_LNODE_ITER_CURSOR = 1 << 0, @@ -149,19 +160,31 @@ struct scx_dsq_list_node { u32 priv; /* can be used by iter cursor */ }; -#define INIT_DSQ_LIST_CURSOR(__node, __flags, __priv) \ +#define INIT_DSQ_LIST_CURSOR(__cursor, __dsq, __flags) \ (struct scx_dsq_list_node) { \ - .node = LIST_HEAD_INIT((__node).node), \ + .node = LIST_HEAD_INIT((__cursor).node), \ .flags = SCX_DSQ_LNODE_ITER_CURSOR | (__flags), \ - .priv = (__priv), \ + .priv = READ_ONCE((__dsq)->seq), \ } +struct scx_sched; + /* * The following is embedded in task_struct and contains all fields necessary * for a task to be scheduled by SCX. */ struct sched_ext_entity { +#ifdef CONFIG_CGROUPS + /* + * Associated scx_sched. Updated either during fork or while holding + * both p->pi_lock and rq lock. + */ + struct scx_sched __rcu *sched; +#endif struct scx_dispatch_q *dsq; + atomic_long_t ops_state; + u64 ddsp_dsq_id; + u64 ddsp_enq_flags; struct scx_dsq_list_node dsq_list; /* dispatch order */ struct rb_node dsq_priq; /* p->scx.dsq_vtime order */ u32 dsq_seq; @@ -171,9 +194,7 @@ struct sched_ext_entity { s32 sticky_cpu; s32 holding_cpu; s32 selected_cpu; - u32 kf_mask; /* see scx_kf_mask above */ struct task_struct *kf_tasks[2]; /* see SCX_CALL_OP_TASK() */ - atomic_long_t ops_state; struct list_head runnable_node; /* rq->scx.runnable_list */ unsigned long runnable_at; @@ -181,8 +202,6 @@ struct sched_ext_entity { #ifdef CONFIG_SCHED_CORE u64 core_sched_at; /* see scx_prio_less() */ #endif - u64 ddsp_dsq_id; - u64 ddsp_enq_flags; /* BPF scheduler modifiable fields */ diff --git a/init/Kconfig b/init/Kconfig index 43875ef36752..29752a1db717 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1190,6 +1190,10 @@ config EXT_GROUP_SCHED endif #CGROUP_SCHED +config EXT_SUB_SCHED + def_bool y + depends on SCHED_CLASS_EXT && CGROUPS + config SCHED_MM_CID def_bool y depends on SMP && RSEQ diff --git a/kernel/fork.c b/kernel/fork.c index 8c61c8dd4372..fe3821160f9a 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2514,8 +2514,12 @@ __latent_entropy struct task_struct *copy_process( fd_install(pidfd, pidfile); proc_fork_connector(p); - sched_post_fork(p); + /* + * sched_ext needs @p to be associated with its cgroup in its post_fork + * hook. cgroup_post_fork() should come before sched_post_fork(). + */ cgroup_post_fork(p, args); + sched_post_fork(p); perf_event_fork(p); trace_task_newtask(p, clone_flags); diff --git a/kernel/sched/core.c b/kernel/sched/core.c index f351296922ac..8952f5764517 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4776,7 +4776,7 @@ int sched_cgroup_fork(struct task_struct *p, struct kernel_clone_args *kargs) p->sched_class->task_fork(p); raw_spin_unlock_irqrestore(&p->pi_lock, flags); - return scx_fork(p); + return scx_fork(p, kargs); } void sched_cancel_fork(struct task_struct *p) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 04fc5c9fee14..012ca8bd70fb 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -9,6 +9,8 @@ #include #include "ext_idle.h" +static DEFINE_RAW_SPINLOCK(scx_sched_lock); + /* * NOTE: sched_ext is in the process of growing multiple scheduler support and * scx_root usage is in a transitional state. Naked dereferences are safe if the @@ -17,7 +19,23 @@ * are used as temporary markers to indicate that the dereferences need to be * updated to point to the associated scheduler instances rather than scx_root. */ -static struct scx_sched __rcu *scx_root; +struct scx_sched __rcu *scx_root; + +/* + * All scheds, writers must hold both scx_enable_mutex and scx_sched_lock. + * Readers can hold either or rcu_read_lock(). + */ +static LIST_HEAD(scx_sched_all); + +#ifdef CONFIG_EXT_SUB_SCHED +static const struct rhashtable_params scx_sched_hash_params = { + .key_len = sizeof_field(struct scx_sched, ops.sub_cgroup_id), + .key_offset = offsetof(struct scx_sched, ops.sub_cgroup_id), + .head_offset = offsetof(struct scx_sched, hash_node), +}; + +static struct rhashtable scx_sched_hash; +#endif /* * During exit, a task may schedule after losing its PIDs. When disabling the @@ -33,37 +51,39 @@ static DEFINE_MUTEX(scx_enable_mutex); DEFINE_STATIC_KEY_FALSE(__scx_enabled); DEFINE_STATIC_PERCPU_RWSEM(scx_fork_rwsem); static atomic_t scx_enable_state_var = ATOMIC_INIT(SCX_DISABLED); -static int scx_bypass_depth; +static DEFINE_RAW_SPINLOCK(scx_bypass_lock); static cpumask_var_t scx_bypass_lb_donee_cpumask; static cpumask_var_t scx_bypass_lb_resched_cpumask; -static bool scx_aborting; static bool scx_init_task_enabled; static bool scx_switching_all; DEFINE_STATIC_KEY_FALSE(__scx_switched_all); -/* - * Tracks whether scx_enable() called scx_bypass(true). Used to balance bypass - * depth on enable failure. Will be removed when bypass depth is moved into the - * sched instance. - */ -static bool scx_bypassed_for_enable; - static atomic_long_t scx_nr_rejected = ATOMIC_LONG_INIT(0); static atomic_long_t scx_hotplug_seq = ATOMIC_LONG_INIT(0); +#ifdef CONFIG_EXT_SUB_SCHED /* - * A monotically increasing sequence number that is incremented every time a - * scheduler is enabled. This can be used by to check if any custom sched_ext + * The sub sched being enabled. Used by scx_disable_and_exit_task() to exit + * tasks for the sub-sched being enabled. Use a global variable instead of a + * per-task field as all enables are serialized. + */ +static struct scx_sched *scx_enabling_sub_sched; +#else +#define scx_enabling_sub_sched (struct scx_sched *)NULL +#endif /* CONFIG_EXT_SUB_SCHED */ + +/* + * A monotonically increasing sequence number that is incremented every time a + * scheduler is enabled. This can be used to check if any custom sched_ext * scheduler has ever been used in the system. */ static atomic_long_t scx_enable_seq = ATOMIC_LONG_INIT(0); /* - * The maximum amount of time in jiffies that a task may be runnable without - * being scheduled on a CPU. If this timeout is exceeded, it will trigger - * scx_error(). + * Watchdog interval. All scx_sched's share a single watchdog timer and the + * interval is half of the shortest sch->watchdog_timeout. */ -static unsigned long scx_watchdog_timeout; +static unsigned long scx_watchdog_interval; /* * The last time the delayed work was run. This delayed work relies on @@ -106,25 +126,6 @@ static const struct rhashtable_params dsq_hash_params = { static LLIST_HEAD(dsqs_to_free); -/* dispatch buf */ -struct scx_dsp_buf_ent { - struct task_struct *task; - unsigned long qseq; - u64 dsq_id; - u64 enq_flags; -}; - -static u32 scx_dsp_max_batch; - -struct scx_dsp_ctx { - struct rq *rq; - u32 cursor; - u32 nr_tasks; - struct scx_dsp_buf_ent buf[]; -}; - -static struct scx_dsp_ctx __percpu *scx_dsp_ctx; - /* string formatting from BPF */ struct scx_bstr_buf { u64 data[MAX_BPRINTF_VARARGS]; @@ -135,6 +136,8 @@ static DEFINE_RAW_SPINLOCK(scx_exit_bstr_buf_lock); static struct scx_bstr_buf scx_exit_bstr_buf; /* ops debug dump */ +static DEFINE_RAW_SPINLOCK(scx_dump_lock); + struct scx_dump_data { s32 cpu; bool first; @@ -156,7 +159,6 @@ static struct kset *scx_kset; * There usually is no reason to modify these as normal scheduler operation * shouldn't be affected by them. The knobs are primarily for debugging. */ -static u64 scx_slice_dfl = SCX_SLICE_DFL; static unsigned int scx_slice_bypass_us = SCX_SLICE_BYPASS / NSEC_PER_USEC; static unsigned int scx_bypass_lb_intv_us = SCX_BYPASS_LB_DFL_INTV_US; @@ -193,10 +195,10 @@ MODULE_PARM_DESC(bypass_lb_intv_us, "bypass load balance interval in microsecond #define CREATE_TRACE_POINTS #include -static void process_ddsp_deferred_locals(struct rq *rq); +static void run_deferred(struct rq *rq); static bool task_dead_and_done(struct task_struct *p); -static u32 reenq_local(struct rq *rq); static void scx_kick_cpu(struct scx_sched *sch, s32 cpu, u64 flags); +static void scx_disable(struct scx_sched *sch, enum scx_exit_kind kind); static bool scx_vexit(struct scx_sched *sch, enum scx_exit_kind kind, s64 exit_code, const char *fmt, va_list args); @@ -227,28 +229,109 @@ static long jiffies_delta_msecs(unsigned long at, unsigned long now) return -(long)jiffies_to_msecs(now - at); } -/* if the highest set bit is N, return a mask with bits [N+1, 31] set */ -static u32 higher_bits(u32 flags) -{ - return ~((1 << fls(flags)) - 1); -} - -/* return the mask with only the highest bit set */ -static u32 highest_bit(u32 flags) -{ - int bit = fls(flags); - return ((u64)1 << bit) >> 1; -} - static bool u32_before(u32 a, u32 b) { return (s32)(a - b) < 0; } -static struct scx_dispatch_q *find_global_dsq(struct scx_sched *sch, - struct task_struct *p) +#ifdef CONFIG_EXT_SUB_SCHED +/** + * scx_parent - Find the parent sched + * @sch: sched to find the parent of + * + * Returns the parent scheduler or %NULL if @sch is root. + */ +static struct scx_sched *scx_parent(struct scx_sched *sch) { - return sch->global_dsqs[cpu_to_node(task_cpu(p))]; + if (sch->level) + return sch->ancestors[sch->level - 1]; + else + return NULL; +} + +/** + * scx_next_descendant_pre - find the next descendant for pre-order walk + * @pos: the current position (%NULL to initiate traversal) + * @root: sched whose descendants to walk + * + * To be used by scx_for_each_descendant_pre(). Find the next descendant to + * visit for pre-order traversal of @root's descendants. @root is included in + * the iteration and the first node to be visited. + */ +static struct scx_sched *scx_next_descendant_pre(struct scx_sched *pos, + struct scx_sched *root) +{ + struct scx_sched *next; + + lockdep_assert(lockdep_is_held(&scx_enable_mutex) || + lockdep_is_held(&scx_sched_lock)); + + /* if first iteration, visit @root */ + if (!pos) + return root; + + /* visit the first child if exists */ + next = list_first_entry_or_null(&pos->children, struct scx_sched, sibling); + if (next) + return next; + + /* no child, visit my or the closest ancestor's next sibling */ + while (pos != root) { + if (!list_is_last(&pos->sibling, &scx_parent(pos)->children)) + return list_next_entry(pos, sibling); + pos = scx_parent(pos); + } + + return NULL; +} + +static struct scx_sched *scx_find_sub_sched(u64 cgroup_id) +{ + return rhashtable_lookup(&scx_sched_hash, &cgroup_id, + scx_sched_hash_params); +} + +static void scx_set_task_sched(struct task_struct *p, struct scx_sched *sch) +{ + rcu_assign_pointer(p->scx.sched, sch); +} +#else /* CONFIG_EXT_SUB_SCHED */ +static struct scx_sched *scx_parent(struct scx_sched *sch) { return NULL; } +static struct scx_sched *scx_next_descendant_pre(struct scx_sched *pos, struct scx_sched *root) { return pos ? NULL : root; } +static struct scx_sched *scx_find_sub_sched(u64 cgroup_id) { return NULL; } +static void scx_set_task_sched(struct task_struct *p, struct scx_sched *sch) {} +#endif /* CONFIG_EXT_SUB_SCHED */ + +/** + * scx_is_descendant - Test whether sched is a descendant + * @sch: sched to test + * @ancestor: ancestor sched to test against + * + * Test whether @sch is a descendant of @ancestor. + */ +static bool scx_is_descendant(struct scx_sched *sch, struct scx_sched *ancestor) +{ + if (sch->level < ancestor->level) + return false; + return sch->ancestors[ancestor->level] == ancestor; +} + +/** + * scx_for_each_descendant_pre - pre-order walk of a sched's descendants + * @pos: iteration cursor + * @root: sched to walk the descendants of + * + * Walk @root's descendants. @root is included in the iteration and the first + * node to be visited. Must be called with either scx_enable_mutex or + * scx_sched_lock held. + */ +#define scx_for_each_descendant_pre(pos, root) \ + for ((pos) = scx_next_descendant_pre(NULL, (root)); (pos); \ + (pos) = scx_next_descendant_pre((pos), (root))) + +static struct scx_dispatch_q *find_global_dsq(struct scx_sched *sch, s32 cpu) +{ + return &sch->pnode[cpu_to_node(cpu)]->global_dsq; } static struct scx_dispatch_q *find_user_dsq(struct scx_sched *sch, u64 dsq_id) @@ -264,28 +347,106 @@ static const struct sched_class *scx_setscheduler_class(struct task_struct *p) return __setscheduler_class(p->policy, p->prio); } -/* - * scx_kf_mask enforcement. Some kfuncs can only be called from specific SCX - * ops. When invoking SCX ops, SCX_CALL_OP[_RET]() should be used to indicate - * the allowed kfuncs and those kfuncs should use scx_kf_allowed() to check - * whether it's running from an allowed context. - * - * @mask is constant, always inline to cull the mask calculations. - */ -static __always_inline void scx_kf_allow(u32 mask) +static struct scx_dispatch_q *bypass_dsq(struct scx_sched *sch, s32 cpu) { - /* nesting is allowed only in increasing scx_kf_mask order */ - WARN_ONCE((mask | higher_bits(mask)) & current->scx.kf_mask, - "invalid nesting current->scx.kf_mask=0x%x mask=0x%x\n", - current->scx.kf_mask, mask); - current->scx.kf_mask |= mask; - barrier(); + return &per_cpu_ptr(sch->pcpu, cpu)->bypass_dsq; } -static void scx_kf_disallow(u32 mask) +static struct scx_dispatch_q *bypass_enq_target_dsq(struct scx_sched *sch, s32 cpu) { - barrier(); - current->scx.kf_mask &= ~mask; +#ifdef CONFIG_EXT_SUB_SCHED + /* + * If @sch is a sub-sched which is bypassing, its tasks should go into + * the bypass DSQs of the nearest ancestor which is not bypassing. The + * not-bypassing ancestor is responsible for scheduling all tasks from + * bypassing sub-trees. If all ancestors including root are bypassing, + * all tasks should go to the root's bypass DSQs. + * + * Whenever a sched starts bypassing, all runnable tasks in its subtree + * are re-enqueued after scx_bypassing() is turned on, guaranteeing that + * all tasks are transferred to the right DSQs. + */ + while (scx_parent(sch) && scx_bypassing(sch, cpu)) + sch = scx_parent(sch); +#endif /* CONFIG_EXT_SUB_SCHED */ + + return bypass_dsq(sch, cpu); +} + +/** + * bypass_dsp_enabled - Check if bypass dispatch path is enabled + * @sch: scheduler to check + * + * When a descendant scheduler enters bypass mode, bypassed tasks are scheduled + * by the nearest non-bypassing ancestor, or the root scheduler if all ancestors + * are bypassing. In the former case, the ancestor is not itself bypassing but + * its bypass DSQs will be populated with bypassed tasks from descendants. Thus, + * the ancestor's bypass dispatch path must be active even though its own + * bypass_depth remains zero. + * + * This function checks bypass_dsp_enable_depth which is managed separately from + * bypass_depth to enable this decoupling. See enable_bypass_dsp() and + * disable_bypass_dsp(). + */ +static bool bypass_dsp_enabled(struct scx_sched *sch) +{ + return unlikely(atomic_read(&sch->bypass_dsp_enable_depth)); +} + +/** + * rq_is_open - Is the rq available for immediate execution of an SCX task? + * @rq: rq to test + * @enq_flags: optional %SCX_ENQ_* of the task being enqueued + * + * Returns %true if @rq is currently open for executing an SCX task. After a + * %false return, @rq is guaranteed to invoke SCX dispatch path at least once + * before going to idle and not inserting a task into @rq's local DSQ after a + * %false return doesn't cause @rq to stall. + */ +static bool rq_is_open(struct rq *rq, u64 enq_flags) +{ + lockdep_assert_rq_held(rq); + + /* + * A higher-priority class task is either running or in the process of + * waking up on @rq. + */ + if (sched_class_above(rq->next_class, &ext_sched_class)) + return false; + + /* + * @rq is either in transition to or in idle and there is no + * higher-priority class task waking up on it. + */ + if (sched_class_above(&ext_sched_class, rq->next_class)) + return true; + + /* + * @rq is either picking, in transition to, or running an SCX task. + */ + + /* + * If we're in the dispatch path holding rq lock, $curr may or may not + * be ready depending on whether the on-going dispatch decides to extend + * $curr's slice. We say yes here and resolve it at the end of dispatch. + * See balance_one(). + */ + if (rq->scx.flags & SCX_RQ_IN_BALANCE) + return true; + + /* + * %SCX_ENQ_PREEMPT clears $curr's slice if on SCX and kicks dispatch, + * so allow it to avoid spuriously triggering reenq on a combined + * PREEMPT|IMMED insertion. + */ + if (enq_flags & SCX_ENQ_PREEMPT) + return true; + + /* + * @rq is either in transition to or running an SCX task and can't go + * idle without another SCX dispatch cycle. + */ + return false; } /* @@ -308,119 +469,77 @@ static inline void update_locked_rq(struct rq *rq) __this_cpu_write(scx_locked_rq_state, rq); } -#define SCX_CALL_OP(sch, mask, op, rq, args...) \ +#define SCX_CALL_OP(sch, op, rq, args...) \ do { \ if (rq) \ update_locked_rq(rq); \ - if (mask) { \ - scx_kf_allow(mask); \ - (sch)->ops.op(args); \ - scx_kf_disallow(mask); \ - } else { \ - (sch)->ops.op(args); \ - } \ + (sch)->ops.op(args); \ if (rq) \ update_locked_rq(NULL); \ } while (0) -#define SCX_CALL_OP_RET(sch, mask, op, rq, args...) \ +#define SCX_CALL_OP_RET(sch, op, rq, args...) \ ({ \ __typeof__((sch)->ops.op(args)) __ret; \ \ if (rq) \ update_locked_rq(rq); \ - if (mask) { \ - scx_kf_allow(mask); \ - __ret = (sch)->ops.op(args); \ - scx_kf_disallow(mask); \ - } else { \ - __ret = (sch)->ops.op(args); \ - } \ + __ret = (sch)->ops.op(args); \ if (rq) \ update_locked_rq(NULL); \ __ret; \ }) /* - * Some kfuncs are allowed only on the tasks that are subjects of the - * in-progress scx_ops operation for, e.g., locking guarantees. To enforce such - * restrictions, the following SCX_CALL_OP_*() variants should be used when - * invoking scx_ops operations that take task arguments. These can only be used - * for non-nesting operations due to the way the tasks are tracked. + * SCX_CALL_OP_TASK*() invokes an SCX op that takes one or two task arguments + * and records them in current->scx.kf_tasks[] for the duration of the call. A + * kfunc invoked from inside such an op can then use + * scx_kf_arg_task_ok() to verify that its task argument is one of + * those subject tasks. * - * kfuncs which can only operate on such tasks can in turn use - * scx_kf_allowed_on_arg_tasks() to test whether the invocation is allowed on - * the specific task. + * Every SCX_CALL_OP_TASK*() call site invokes its op with @p's rq lock held - + * either via the @rq argument here, or (for ops.select_cpu()) via @p's pi_lock + * held by try_to_wake_up() with rq tracking via scx_rq.in_select_cpu. So if + * kf_tasks[] is set, @p's scheduler-protected fields are stable. + * + * kf_tasks[] can not stack, so task-based SCX ops must not nest. The + * WARN_ON_ONCE() in each macro catches a re-entry of any of the three variants + * while a previous one is still in progress. */ -#define SCX_CALL_OP_TASK(sch, mask, op, rq, task, args...) \ +#define SCX_CALL_OP_TASK(sch, op, rq, task, args...) \ do { \ - BUILD_BUG_ON((mask) & ~__SCX_KF_TERMINAL); \ + WARN_ON_ONCE(current->scx.kf_tasks[0]); \ current->scx.kf_tasks[0] = task; \ - SCX_CALL_OP((sch), mask, op, rq, task, ##args); \ + SCX_CALL_OP((sch), op, rq, task, ##args); \ current->scx.kf_tasks[0] = NULL; \ } while (0) -#define SCX_CALL_OP_TASK_RET(sch, mask, op, rq, task, args...) \ +#define SCX_CALL_OP_TASK_RET(sch, op, rq, task, args...) \ ({ \ __typeof__((sch)->ops.op(task, ##args)) __ret; \ - BUILD_BUG_ON((mask) & ~__SCX_KF_TERMINAL); \ + WARN_ON_ONCE(current->scx.kf_tasks[0]); \ current->scx.kf_tasks[0] = task; \ - __ret = SCX_CALL_OP_RET((sch), mask, op, rq, task, ##args); \ + __ret = SCX_CALL_OP_RET((sch), op, rq, task, ##args); \ current->scx.kf_tasks[0] = NULL; \ __ret; \ }) -#define SCX_CALL_OP_2TASKS_RET(sch, mask, op, rq, task0, task1, args...) \ +#define SCX_CALL_OP_2TASKS_RET(sch, op, rq, task0, task1, args...) \ ({ \ __typeof__((sch)->ops.op(task0, task1, ##args)) __ret; \ - BUILD_BUG_ON((mask) & ~__SCX_KF_TERMINAL); \ + WARN_ON_ONCE(current->scx.kf_tasks[0]); \ current->scx.kf_tasks[0] = task0; \ current->scx.kf_tasks[1] = task1; \ - __ret = SCX_CALL_OP_RET((sch), mask, op, rq, task0, task1, ##args); \ + __ret = SCX_CALL_OP_RET((sch), op, rq, task0, task1, ##args); \ current->scx.kf_tasks[0] = NULL; \ current->scx.kf_tasks[1] = NULL; \ __ret; \ }) -/* @mask is constant, always inline to cull unnecessary branches */ -static __always_inline bool scx_kf_allowed(struct scx_sched *sch, u32 mask) -{ - if (unlikely(!(current->scx.kf_mask & mask))) { - scx_error(sch, "kfunc with mask 0x%x called from an operation only allowing 0x%x", - mask, current->scx.kf_mask); - return false; - } - - /* - * Enforce nesting boundaries. e.g. A kfunc which can be called from - * DISPATCH must not be called if we're running DEQUEUE which is nested - * inside ops.dispatch(). We don't need to check boundaries for any - * blocking kfuncs as the verifier ensures they're only called from - * sleepable progs. - */ - if (unlikely(highest_bit(mask) == SCX_KF_CPU_RELEASE && - (current->scx.kf_mask & higher_bits(SCX_KF_CPU_RELEASE)))) { - scx_error(sch, "cpu_release kfunc called from a nested operation"); - return false; - } - - if (unlikely(highest_bit(mask) == SCX_KF_DISPATCH && - (current->scx.kf_mask & higher_bits(SCX_KF_DISPATCH)))) { - scx_error(sch, "dispatch kfunc called from a nested operation"); - return false; - } - - return true; -} - /* see SCX_CALL_OP_TASK() */ -static __always_inline bool scx_kf_allowed_on_arg_tasks(struct scx_sched *sch, - u32 mask, +static __always_inline bool scx_kf_arg_task_ok(struct scx_sched *sch, struct task_struct *p) { - if (!scx_kf_allowed(sch, mask)) - return false; - if (unlikely((p != current->scx.kf_tasks[0] && p != current->scx.kf_tasks[1]))) { scx_error(sch, "called on a task not being operated on"); @@ -430,9 +549,22 @@ static __always_inline bool scx_kf_allowed_on_arg_tasks(struct scx_sched *sch, return true; } +enum scx_dsq_iter_flags { + /* iterate in the reverse dispatch order */ + SCX_DSQ_ITER_REV = 1U << 16, + + __SCX_DSQ_ITER_HAS_SLICE = 1U << 30, + __SCX_DSQ_ITER_HAS_VTIME = 1U << 31, + + __SCX_DSQ_ITER_USER_FLAGS = SCX_DSQ_ITER_REV, + __SCX_DSQ_ITER_ALL_FLAGS = __SCX_DSQ_ITER_USER_FLAGS | + __SCX_DSQ_ITER_HAS_SLICE | + __SCX_DSQ_ITER_HAS_VTIME, +}; + /** * nldsq_next_task - Iterate to the next task in a non-local DSQ - * @dsq: user dsq being iterated + * @dsq: non-local dsq being iterated * @cur: current position, %NULL to start iteration * @rev: walk backwards * @@ -472,6 +604,85 @@ static struct task_struct *nldsq_next_task(struct scx_dispatch_q *dsq, for ((p) = nldsq_next_task((dsq), NULL, false); (p); \ (p) = nldsq_next_task((dsq), (p), false)) +/** + * nldsq_cursor_next_task - Iterate to the next task given a cursor in a non-local DSQ + * @cursor: scx_dsq_list_node initialized with INIT_DSQ_LIST_CURSOR() + * @dsq: non-local dsq being iterated + * + * Find the next task in a cursor based iteration. The caller must have + * initialized @cursor using INIT_DSQ_LIST_CURSOR() and can release the DSQ lock + * between the iteration steps. + * + * Only tasks which were queued before @cursor was initialized are visible. This + * bounds the iteration and guarantees that vtime never jumps in the other + * direction while iterating. + */ +static struct task_struct *nldsq_cursor_next_task(struct scx_dsq_list_node *cursor, + struct scx_dispatch_q *dsq) +{ + bool rev = cursor->flags & SCX_DSQ_ITER_REV; + struct task_struct *p; + + lockdep_assert_held(&dsq->lock); + BUG_ON(!(cursor->flags & SCX_DSQ_LNODE_ITER_CURSOR)); + + if (list_empty(&cursor->node)) + p = NULL; + else + p = container_of(cursor, struct task_struct, scx.dsq_list); + + /* skip cursors and tasks that were queued after @cursor init */ + do { + p = nldsq_next_task(dsq, p, rev); + } while (p && unlikely(u32_before(cursor->priv, p->scx.dsq_seq))); + + if (p) { + if (rev) + list_move_tail(&cursor->node, &p->scx.dsq_list.node); + else + list_move(&cursor->node, &p->scx.dsq_list.node); + } else { + list_del_init(&cursor->node); + } + + return p; +} + +/** + * nldsq_cursor_lost_task - Test whether someone else took the task since iteration + * @cursor: scx_dsq_list_node initialized with INIT_DSQ_LIST_CURSOR() + * @rq: rq @p was on + * @dsq: dsq @p was on + * @p: target task + * + * @p is a task returned by nldsq_cursor_next_task(). The locks may have been + * dropped and re-acquired inbetween. Verify that no one else took or is in the + * process of taking @p from @dsq. + * + * On %false return, the caller can assume full ownership of @p. + */ +static bool nldsq_cursor_lost_task(struct scx_dsq_list_node *cursor, + struct rq *rq, struct scx_dispatch_q *dsq, + struct task_struct *p) +{ + lockdep_assert_rq_held(rq); + lockdep_assert_held(&dsq->lock); + + /* + * @p could have already left $src_dsq, got re-enqueud, or be in the + * process of being consumed by someone else. + */ + if (unlikely(p->scx.dsq != dsq || + u32_before(cursor->priv, p->scx.dsq_seq) || + p->scx.holding_cpu >= 0)) + return true; + + /* if @p has stayed on @dsq, its rq couldn't have changed */ + if (WARN_ON_ONCE(rq != task_rq(p))) + return true; + + return false; +} /* * BPF DSQ iterator. Tasks in a non-local DSQ can be iterated in [reverse] @@ -479,19 +690,6 @@ static struct task_struct *nldsq_next_task(struct scx_dispatch_q *dsq, * changes without breaking backward compatibility. Can be used with * bpf_for_each(). See bpf_iter_scx_dsq_*(). */ -enum scx_dsq_iter_flags { - /* iterate in the reverse dispatch order */ - SCX_DSQ_ITER_REV = 1U << 16, - - __SCX_DSQ_ITER_HAS_SLICE = 1U << 30, - __SCX_DSQ_ITER_HAS_VTIME = 1U << 31, - - __SCX_DSQ_ITER_USER_FLAGS = SCX_DSQ_ITER_REV, - __SCX_DSQ_ITER_ALL_FLAGS = __SCX_DSQ_ITER_USER_FLAGS | - __SCX_DSQ_ITER_HAS_SLICE | - __SCX_DSQ_ITER_HAS_VTIME, -}; - struct bpf_iter_scx_dsq_kern { struct scx_dsq_list_node cursor; struct scx_dispatch_q *dsq; @@ -514,14 +712,31 @@ struct scx_task_iter { struct rq_flags rf; u32 cnt; bool list_locked; +#ifdef CONFIG_EXT_SUB_SCHED + struct cgroup *cgrp; + struct cgroup_subsys_state *css_pos; + struct css_task_iter css_iter; +#endif }; /** * scx_task_iter_start - Lock scx_tasks_lock and start a task iteration * @iter: iterator to init + * @cgrp: Optional root of cgroup subhierarchy to iterate * - * Initialize @iter and return with scx_tasks_lock held. Once initialized, @iter - * must eventually be stopped with scx_task_iter_stop(). + * Initialize @iter. Once initialized, @iter must eventually be stopped with + * scx_task_iter_stop(). + * + * If @cgrp is %NULL, scx_tasks is used for iteration and this function returns + * with scx_tasks_lock held and @iter->cursor inserted into scx_tasks. + * + * If @cgrp is not %NULL, @cgrp and its descendants' tasks are walked using + * @iter->css_iter. The caller must be holding cgroup_lock() to prevent cgroup + * task migrations. + * + * The two modes of iterations are largely independent and it's likely that + * scx_tasks can be removed in favor of always using cgroup iteration if + * CONFIG_SCHED_CLASS_EXT depends on CONFIG_CGROUPS. * * scx_tasks_lock and the rq lock may be released using scx_task_iter_unlock() * between this and the first next() call or between any two next() calls. If @@ -532,10 +747,19 @@ struct scx_task_iter { * All tasks which existed when the iteration started are guaranteed to be * visited as long as they are not dead. */ -static void scx_task_iter_start(struct scx_task_iter *iter) +static void scx_task_iter_start(struct scx_task_iter *iter, struct cgroup *cgrp) { memset(iter, 0, sizeof(*iter)); +#ifdef CONFIG_EXT_SUB_SCHED + if (cgrp) { + lockdep_assert_held(&cgroup_mutex); + iter->cgrp = cgrp; + iter->css_pos = css_next_descendant_pre(NULL, &iter->cgrp->self); + css_task_iter_start(iter->css_pos, 0, &iter->css_iter); + return; + } +#endif raw_spin_lock_irq(&scx_tasks_lock); iter->cursor = (struct sched_ext_entity){ .flags = SCX_TASK_CURSOR }; @@ -588,6 +812,14 @@ static void __scx_task_iter_maybe_relock(struct scx_task_iter *iter) */ static void scx_task_iter_stop(struct scx_task_iter *iter) { +#ifdef CONFIG_EXT_SUB_SCHED + if (iter->cgrp) { + if (iter->css_pos) + css_task_iter_end(&iter->css_iter); + __scx_task_iter_rq_unlock(iter); + return; + } +#endif __scx_task_iter_maybe_relock(iter); list_del_init(&iter->cursor.tasks_node); scx_task_iter_unlock(iter); @@ -611,6 +843,24 @@ static struct task_struct *scx_task_iter_next(struct scx_task_iter *iter) cond_resched(); } +#ifdef CONFIG_EXT_SUB_SCHED + if (iter->cgrp) { + while (iter->css_pos) { + struct task_struct *p; + + p = css_task_iter_next(&iter->css_iter); + if (p) + return p; + + css_task_iter_end(&iter->css_iter); + iter->css_pos = css_next_descendant_pre(iter->css_pos, + &iter->cgrp->self); + if (iter->css_pos) + css_task_iter_start(iter->css_pos, 0, &iter->css_iter); + } + return NULL; + } +#endif __scx_task_iter_maybe_relock(iter); list_for_each_entry(pos, cursor, tasks_node) { @@ -810,16 +1060,6 @@ static int ops_sanitize_err(struct scx_sched *sch, const char *ops_name, s32 err return -EPROTO; } -static void run_deferred(struct rq *rq) -{ - process_ddsp_deferred_locals(rq); - - if (local_read(&rq->scx.reenq_local_deferred)) { - local_set(&rq->scx.reenq_local_deferred, 0); - reenq_local(rq); - } -} - static void deferred_bal_cb_workfn(struct rq *rq) { run_deferred(rq); @@ -845,10 +1085,18 @@ static void deferred_irq_workfn(struct irq_work *irq_work) static void schedule_deferred(struct rq *rq) { /* - * Queue an irq work. They are executed on IRQ re-enable which may take - * a bit longer than the scheduler hook in schedule_deferred_locked(). + * This is the fallback when schedule_deferred_locked() can't use + * the cheaper balance callback or wakeup hook paths (the target + * CPU is not in balance or wakeup). Currently, this is primarily + * hit by reenqueue operations targeting a remote CPU. + * + * Queue on the target CPU. The deferred work can run from any CPU + * correctly - the _locked() path already processes remote rqs from + * the calling CPU - but targeting the owning CPU allows IPI delivery + * without waiting for the calling CPU to re-enable IRQs and is + * cheaper as the reenqueue runs locally. */ - irq_work_queue(&rq->scx.deferred_irq_work); + irq_work_queue_on(&rq->scx.deferred_irq_work, cpu_of(rq)); } /** @@ -898,6 +1146,81 @@ static void schedule_deferred_locked(struct rq *rq) schedule_deferred(rq); } +static void schedule_dsq_reenq(struct scx_sched *sch, struct scx_dispatch_q *dsq, + u64 reenq_flags, struct rq *locked_rq) +{ + struct rq *rq; + + /* + * Allowing reenqueues doesn't make sense while bypassing. This also + * blocks from new reenqueues to be scheduled on dead scheds. + */ + if (unlikely(READ_ONCE(sch->bypass_depth))) + return; + + if (dsq->id == SCX_DSQ_LOCAL) { + rq = container_of(dsq, struct rq, scx.local_dsq); + + struct scx_sched_pcpu *sch_pcpu = per_cpu_ptr(sch->pcpu, cpu_of(rq)); + struct scx_deferred_reenq_local *drl = &sch_pcpu->deferred_reenq_local; + + /* + * Pairs with smp_mb() in process_deferred_reenq_locals() and + * guarantees that there is a reenq_local() afterwards. + */ + smp_mb(); + + if (list_empty(&drl->node) || + (READ_ONCE(drl->flags) & reenq_flags) != reenq_flags) { + + guard(raw_spinlock_irqsave)(&rq->scx.deferred_reenq_lock); + + if (list_empty(&drl->node)) + list_move_tail(&drl->node, &rq->scx.deferred_reenq_locals); + WRITE_ONCE(drl->flags, drl->flags | reenq_flags); + } + } else if (!(dsq->id & SCX_DSQ_FLAG_BUILTIN)) { + rq = this_rq(); + + struct scx_dsq_pcpu *dsq_pcpu = per_cpu_ptr(dsq->pcpu, cpu_of(rq)); + struct scx_deferred_reenq_user *dru = &dsq_pcpu->deferred_reenq_user; + + /* + * Pairs with smp_mb() in process_deferred_reenq_users() and + * guarantees that there is a reenq_user() afterwards. + */ + smp_mb(); + + if (list_empty(&dru->node) || + (READ_ONCE(dru->flags) & reenq_flags) != reenq_flags) { + + guard(raw_spinlock_irqsave)(&rq->scx.deferred_reenq_lock); + + if (list_empty(&dru->node)) + list_move_tail(&dru->node, &rq->scx.deferred_reenq_users); + WRITE_ONCE(dru->flags, dru->flags | reenq_flags); + } + } else { + scx_error(sch, "DSQ 0x%llx not allowed for reenq", dsq->id); + return; + } + + if (rq == locked_rq) + schedule_deferred_locked(rq); + else + schedule_deferred(rq); +} + +static void schedule_reenq_local(struct rq *rq, u64 reenq_flags) +{ + struct scx_sched *root = rcu_dereference_sched(scx_root); + + if (WARN_ON_ONCE(!root)) + return; + + schedule_dsq_reenq(root, &rq->scx.local_dsq, reenq_flags, rq); +} + /** * touch_core_sched - Update timestamp used for core-sched task ordering * @rq: rq to read clock from, must be locked @@ -974,28 +1297,105 @@ static bool scx_dsq_priq_less(struct rb_node *node_a, return time_before64(a->scx.dsq_vtime, b->scx.dsq_vtime); } -static void dsq_mod_nr(struct scx_dispatch_q *dsq, s32 delta) +static void dsq_inc_nr(struct scx_dispatch_q *dsq, struct task_struct *p, u64 enq_flags) { + /* scx_bpf_dsq_nr_queued() reads ->nr without locking, use WRITE_ONCE() */ + WRITE_ONCE(dsq->nr, dsq->nr + 1); + /* - * scx_bpf_dsq_nr_queued() reads ->nr without locking. Use READ_ONCE() - * on the read side and WRITE_ONCE() on the write side to properly - * annotate the concurrent lockless access and avoid KCSAN warnings. + * Once @p reaches a local DSQ, it can only leave it by being dispatched + * to the CPU or dequeued. In both cases, the only way @p can go back to + * the BPF sched is through enqueueing. If being inserted into a local + * DSQ with IMMED, persist the state until the next enqueueing event in + * do_enqueue_task() so that we can maintain IMMED protection through + * e.g. SAVE/RESTORE cycles and slice extensions. */ - WRITE_ONCE(dsq->nr, READ_ONCE(dsq->nr) + delta); + if (enq_flags & SCX_ENQ_IMMED) { + if (unlikely(dsq->id != SCX_DSQ_LOCAL)) { + WARN_ON_ONCE(!(enq_flags & SCX_ENQ_GDSQ_FALLBACK)); + return; + } + p->scx.flags |= SCX_TASK_IMMED; + } + + if (p->scx.flags & SCX_TASK_IMMED) { + struct rq *rq = container_of(dsq, struct rq, scx.local_dsq); + + if (WARN_ON_ONCE(dsq->id != SCX_DSQ_LOCAL)) + return; + + rq->scx.nr_immed++; + + /* + * If @rq already had other tasks or the current task is not + * done yet, @p can't go on the CPU immediately. Re-enqueue. + */ + if (unlikely(dsq->nr > 1 || !rq_is_open(rq, enq_flags))) + schedule_reenq_local(rq, 0); + } +} + +static void dsq_dec_nr(struct scx_dispatch_q *dsq, struct task_struct *p) +{ + /* see dsq_inc_nr() */ + WRITE_ONCE(dsq->nr, dsq->nr - 1); + + if (p->scx.flags & SCX_TASK_IMMED) { + struct rq *rq = container_of(dsq, struct rq, scx.local_dsq); + + if (WARN_ON_ONCE(dsq->id != SCX_DSQ_LOCAL) || + WARN_ON_ONCE(rq->scx.nr_immed <= 0)) + return; + + rq->scx.nr_immed--; + } } static void refill_task_slice_dfl(struct scx_sched *sch, struct task_struct *p) { - p->scx.slice = READ_ONCE(scx_slice_dfl); + p->scx.slice = READ_ONCE(sch->slice_dfl); __scx_add_event(sch, SCX_EV_REFILL_SLICE_DFL, 1); } +/* + * Return true if @p is moving due to an internal SCX migration, false + * otherwise. + */ +static inline bool task_scx_migrating(struct task_struct *p) +{ + /* + * We only need to check sticky_cpu: it is set to the destination + * CPU in move_remote_task_to_local_dsq() before deactivate_task() + * and cleared when the task is enqueued on the destination, so it + * is only non-negative during an internal SCX migration. + */ + return p->scx.sticky_cpu >= 0; +} + +/* + * Call ops.dequeue() if the task is in BPF custody and not migrating. + * Clears %SCX_TASK_IN_CUSTODY when the callback is invoked. + */ +static void call_task_dequeue(struct scx_sched *sch, struct rq *rq, + struct task_struct *p, u64 deq_flags) +{ + if (!(p->scx.flags & SCX_TASK_IN_CUSTODY) || task_scx_migrating(p)) + return; + + if (SCX_HAS_OP(sch, dequeue)) + SCX_CALL_OP_TASK(sch, dequeue, rq, p, deq_flags); + + p->scx.flags &= ~SCX_TASK_IN_CUSTODY; +} + static void local_dsq_post_enq(struct scx_dispatch_q *dsq, struct task_struct *p, u64 enq_flags) { struct rq *rq = container_of(dsq, struct rq, scx.local_dsq); bool preempt = false; + call_task_dequeue(scx_root, rq, p, 0); + /* * If @rq is in balance, the CPU is already vacant and looking for the * next task to run. No need to preempt or trigger resched after moving @@ -1014,8 +1414,9 @@ static void local_dsq_post_enq(struct scx_dispatch_q *dsq, struct task_struct *p resched_curr(rq); } -static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq, - struct task_struct *p, u64 enq_flags) +static void dispatch_enqueue(struct scx_sched *sch, struct rq *rq, + struct scx_dispatch_q *dsq, struct task_struct *p, + u64 enq_flags) { bool is_local = dsq->id == SCX_DSQ_LOCAL; @@ -1031,7 +1432,7 @@ static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq, scx_error(sch, "attempting to dispatch to a destroyed dsq"); /* fall back to the global dsq */ raw_spin_unlock(&dsq->lock); - dsq = find_global_dsq(sch, p); + dsq = find_global_dsq(sch, task_cpu(p)); raw_spin_lock(&dsq->lock); } } @@ -1106,20 +1507,37 @@ static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq, WRITE_ONCE(dsq->seq, dsq->seq + 1); p->scx.dsq_seq = dsq->seq; - dsq_mod_nr(dsq, 1); + dsq_inc_nr(dsq, p, enq_flags); p->scx.dsq = dsq; + /* + * Update custody and call ops.dequeue() before clearing ops_state: + * once ops_state is cleared, waiters in ops_dequeue() can proceed + * and dequeue_task_scx() will RMW p->scx.flags. If we clear + * ops_state first, both sides would modify p->scx.flags + * concurrently in a non-atomic way. + */ + if (is_local) { + local_dsq_post_enq(dsq, p, enq_flags); + } else { + /* + * Task on global/bypass DSQ: leave custody, task on + * non-terminal DSQ: enter custody. + */ + if (dsq->id == SCX_DSQ_GLOBAL || dsq->id == SCX_DSQ_BYPASS) + call_task_dequeue(sch, rq, p, 0); + else + p->scx.flags |= SCX_TASK_IN_CUSTODY; + + raw_spin_unlock(&dsq->lock); + } + /* * We're transitioning out of QUEUEING or DISPATCHING. store_release to * match waiters' load_acquire. */ if (enq_flags & SCX_ENQ_CLEAR_OPSS) atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_NONE); - - if (is_local) - local_dsq_post_enq(dsq, p, enq_flags); - else - raw_spin_unlock(&dsq->lock); } static void task_unlink_from_dsq(struct task_struct *p, @@ -1134,7 +1552,7 @@ static void task_unlink_from_dsq(struct task_struct *p, } list_del_init(&p->scx.dsq_list.node); - dsq_mod_nr(dsq, -1); + dsq_dec_nr(dsq, p); if (!(dsq->id & SCX_DSQ_FLAG_BUILTIN) && dsq->first_task == p) { struct task_struct *first_task; @@ -1213,7 +1631,7 @@ static void dispatch_dequeue_locked(struct task_struct *p, static struct scx_dispatch_q *find_dsq_for_dispatch(struct scx_sched *sch, struct rq *rq, u64 dsq_id, - struct task_struct *p) + s32 tcpu) { struct scx_dispatch_q *dsq; @@ -1224,20 +1642,19 @@ static struct scx_dispatch_q *find_dsq_for_dispatch(struct scx_sched *sch, s32 cpu = dsq_id & SCX_DSQ_LOCAL_CPU_MASK; if (!ops_cpu_valid(sch, cpu, "in SCX_DSQ_LOCAL_ON dispatch verdict")) - return find_global_dsq(sch, p); + return find_global_dsq(sch, tcpu); return &cpu_rq(cpu)->scx.local_dsq; } if (dsq_id == SCX_DSQ_GLOBAL) - dsq = find_global_dsq(sch, p); + dsq = find_global_dsq(sch, tcpu); else dsq = find_user_dsq(sch, dsq_id); if (unlikely(!dsq)) { - scx_error(sch, "non-existent DSQ 0x%llx for %s[%d]", - dsq_id, p->comm, p->pid); - return find_global_dsq(sch, p); + scx_error(sch, "non-existent DSQ 0x%llx", dsq_id); + return find_global_dsq(sch, tcpu); } return dsq; @@ -1300,7 +1717,7 @@ static void direct_dispatch(struct scx_sched *sch, struct task_struct *p, { struct rq *rq = task_rq(p); struct scx_dispatch_q *dsq = - find_dsq_for_dispatch(sch, rq, p->scx.ddsp_dsq_id, p); + find_dsq_for_dispatch(sch, rq, p->scx.ddsp_dsq_id, task_cpu(p)); u64 ddsp_enq_flags; touch_core_sched_dispatch(rq, p); @@ -1345,7 +1762,7 @@ static void direct_dispatch(struct scx_sched *sch, struct task_struct *p, ddsp_enq_flags = p->scx.ddsp_enq_flags; clear_direct_dispatch(p); - dispatch_enqueue(sch, dsq, p, ddsp_enq_flags | SCX_ENQ_CLEAR_OPSS); + dispatch_enqueue(sch, rq, dsq, p, ddsp_enq_flags | SCX_ENQ_CLEAR_OPSS); } static bool scx_rq_online(struct rq *rq) @@ -1363,17 +1780,25 @@ static bool scx_rq_online(struct rq *rq) static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags, int sticky_cpu) { - struct scx_sched *sch = scx_root; + struct scx_sched *sch = scx_task_sched(p); struct task_struct **ddsp_taskp; struct scx_dispatch_q *dsq; unsigned long qseq; WARN_ON_ONCE(!(p->scx.flags & SCX_TASK_QUEUED)); - /* rq migration */ + /* internal movements - rq migration / RESTORE */ if (sticky_cpu == cpu_of(rq)) goto local_norefill; + /* + * Clear persistent TASK_IMMED for fresh enqueues, see dsq_inc_nr(). + * Note that exiting and migration-disabled tasks that skip + * ops.enqueue() below will lose IMMED protection unless + * %SCX_OPS_ENQ_EXITING / %SCX_OPS_ENQ_MIGRATION_DISABLED are set. + */ + p->scx.flags &= ~SCX_TASK_IMMED; + /* * If !scx_rq_online(), we already told the BPF scheduler that the CPU * is offline and are just running the hotplug path. Don't bother the @@ -1382,7 +1807,7 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags, if (!scx_rq_online(rq)) goto local; - if (scx_rq_bypassing(rq)) { + if (scx_bypassing(sch, cpu_of(rq))) { __scx_add_event(sch, SCX_EV_BYPASS_DISPATCH, 1); goto bypass; } @@ -1417,12 +1842,18 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags, WARN_ON_ONCE(*ddsp_taskp); *ddsp_taskp = p; - SCX_CALL_OP_TASK(sch, SCX_KF_ENQUEUE, enqueue, rq, p, enq_flags); + SCX_CALL_OP_TASK(sch, enqueue, rq, p, enq_flags); *ddsp_taskp = NULL; if (p->scx.ddsp_dsq_id != SCX_DSQ_INVALID) goto direct; + /* + * Task is now in BPF scheduler's custody. Set %SCX_TASK_IN_CUSTODY + * so ops.dequeue() is called when it leaves custody. + */ + p->scx.flags |= SCX_TASK_IN_CUSTODY; + /* * If not directly dispatched, QUEUEING isn't clear yet and dispatch or * dequeue may be waiting. The store_release matches their load_acquire. @@ -1434,16 +1865,16 @@ direct: direct_dispatch(sch, p, enq_flags); return; local_norefill: - dispatch_enqueue(sch, &rq->scx.local_dsq, p, enq_flags); + dispatch_enqueue(sch, rq, &rq->scx.local_dsq, p, enq_flags); return; local: dsq = &rq->scx.local_dsq; goto enqueue; global: - dsq = find_global_dsq(sch, p); + dsq = find_global_dsq(sch, task_cpu(p)); goto enqueue; bypass: - dsq = &task_rq(p)->scx.bypass_dsq; + dsq = bypass_enq_target_dsq(sch, task_cpu(p)); goto enqueue; enqueue: @@ -1455,7 +1886,7 @@ enqueue: touch_core_sched(rq, p); refill_task_slice_dfl(sch, p); clear_direct_dispatch(p); - dispatch_enqueue(sch, dsq, p, enq_flags); + dispatch_enqueue(sch, rq, dsq, p, enq_flags); } static bool task_runnable(const struct task_struct *p) @@ -1488,16 +1919,13 @@ static void clr_task_runnable(struct task_struct *p, bool reset_runnable_at) static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int core_enq_flags) { - struct scx_sched *sch = scx_root; + struct scx_sched *sch = scx_task_sched(p); int sticky_cpu = p->scx.sticky_cpu; u64 enq_flags = core_enq_flags | rq->scx.extra_enq_flags; if (enq_flags & ENQUEUE_WAKEUP) rq->scx.flags |= SCX_RQ_IN_WAKEUP; - if (sticky_cpu >= 0) - p->scx.sticky_cpu = -1; - /* * Restoring a running task will be immediately followed by * set_next_task_scx() which expects the task to not be on the BPF @@ -1518,7 +1946,7 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int core_enq_ add_nr_running(rq, 1); if (SCX_HAS_OP(sch, runnable) && !task_on_rq_migrating(p)) - SCX_CALL_OP_TASK(sch, SCX_KF_REST, runnable, rq, p, enq_flags); + SCX_CALL_OP_TASK(sch, runnable, rq, p, enq_flags); if (enq_flags & SCX_ENQ_WAKEUP) touch_core_sched(rq, p); @@ -1528,6 +1956,9 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int core_enq_ dl_server_start(&rq->ext_server); do_enqueue_task(rq, p, enq_flags, sticky_cpu); + + if (sticky_cpu >= 0) + p->scx.sticky_cpu = -1; out: rq->scx.flags &= ~SCX_RQ_IN_WAKEUP; @@ -1538,7 +1969,7 @@ out: static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags) { - struct scx_sched *sch = scx_root; + struct scx_sched *sch = scx_task_sched(p); unsigned long opss; /* dequeue is always temporary, don't reset runnable_at */ @@ -1557,10 +1988,8 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags) */ BUG(); case SCX_OPSS_QUEUED: - if (SCX_HAS_OP(sch, dequeue)) - SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, - p, deq_flags); - + /* A queued task must always be in BPF scheduler's custody */ + WARN_ON_ONCE(!(p->scx.flags & SCX_TASK_IN_CUSTODY)); if (atomic_long_try_cmpxchg(&p->scx.ops_state, &opss, SCX_OPSS_NONE)) break; @@ -1583,11 +2012,35 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags) BUG_ON(atomic_long_read(&p->scx.ops_state) != SCX_OPSS_NONE); break; } + + /* + * Call ops.dequeue() if the task is still in BPF custody. + * + * The code that clears ops_state to %SCX_OPSS_NONE does not always + * clear %SCX_TASK_IN_CUSTODY: in dispatch_to_local_dsq(), when + * we're moving a task that was in %SCX_OPSS_DISPATCHING to a + * remote CPU's local DSQ, we only set ops_state to %SCX_OPSS_NONE + * so that a concurrent dequeue can proceed, but we clear + * %SCX_TASK_IN_CUSTODY only when we later enqueue or move the + * task. So we can see NONE + IN_CUSTODY here and we must handle + * it. Similarly, after waiting on %SCX_OPSS_DISPATCHING we see + * NONE but the task may still have %SCX_TASK_IN_CUSTODY set until + * it is enqueued on the destination. + */ + call_task_dequeue(sch, rq, p, deq_flags); } -static bool dequeue_task_scx(struct rq *rq, struct task_struct *p, int deq_flags) +static bool dequeue_task_scx(struct rq *rq, struct task_struct *p, int core_deq_flags) { - struct scx_sched *sch = scx_root; + struct scx_sched *sch = scx_task_sched(p); + u64 deq_flags = core_deq_flags; + + /* + * Set %SCX_DEQ_SCHED_CHANGE when the dequeue is due to a property + * change (not sleep or core-sched pick). + */ + if (!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC))) + deq_flags |= SCX_DEQ_SCHED_CHANGE; if (!(p->scx.flags & SCX_TASK_QUEUED)) { WARN_ON_ONCE(task_runnable(p)); @@ -1610,11 +2063,11 @@ static bool dequeue_task_scx(struct rq *rq, struct task_struct *p, int deq_flags */ if (SCX_HAS_OP(sch, stopping) && task_current(rq, p)) { update_curr_scx(rq); - SCX_CALL_OP_TASK(sch, SCX_KF_REST, stopping, rq, p, false); + SCX_CALL_OP_TASK(sch, stopping, rq, p, false); } if (SCX_HAS_OP(sch, quiescent) && !task_on_rq_migrating(p)) - SCX_CALL_OP_TASK(sch, SCX_KF_REST, quiescent, rq, p, deq_flags); + SCX_CALL_OP_TASK(sch, quiescent, rq, p, deq_flags); if (deq_flags & SCX_DEQ_SLEEP) p->scx.flags |= SCX_TASK_DEQD_FOR_SLEEP; @@ -1632,27 +2085,50 @@ static bool dequeue_task_scx(struct rq *rq, struct task_struct *p, int deq_flags static void yield_task_scx(struct rq *rq) { - struct scx_sched *sch = scx_root; struct task_struct *p = rq->donor; + struct scx_sched *sch = scx_task_sched(p); if (SCX_HAS_OP(sch, yield)) - SCX_CALL_OP_2TASKS_RET(sch, SCX_KF_REST, yield, rq, p, NULL); + SCX_CALL_OP_2TASKS_RET(sch, yield, rq, p, NULL); else p->scx.slice = 0; } static bool yield_to_task_scx(struct rq *rq, struct task_struct *to) { - struct scx_sched *sch = scx_root; struct task_struct *from = rq->donor; + struct scx_sched *sch = scx_task_sched(from); - if (SCX_HAS_OP(sch, yield)) - return SCX_CALL_OP_2TASKS_RET(sch, SCX_KF_REST, yield, rq, - from, to); + if (SCX_HAS_OP(sch, yield) && sch == scx_task_sched(to)) + return SCX_CALL_OP_2TASKS_RET(sch, yield, rq, from, to); else return false; } +static void wakeup_preempt_scx(struct rq *rq, struct task_struct *p, int wake_flags) +{ + /* + * Preemption between SCX tasks is implemented by resetting the victim + * task's slice to 0 and triggering reschedule on the target CPU. + * Nothing to do. + */ + if (p->sched_class == &ext_sched_class) + return; + + /* + * Getting preempted by a higher-priority class. Reenqueue IMMED tasks. + * This captures all preemption cases including: + * + * - A SCX task is currently running. + * + * - @rq is waking from idle due to a SCX task waking to it. + * + * - A higher-priority wakes up while SCX dispatch is in progress. + */ + if (rq->scx.nr_immed) + schedule_reenq_local(rq, 0); +} + static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags, struct scx_dispatch_q *src_dsq, struct rq *dst_rq) @@ -1670,7 +2146,7 @@ static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags, else list_add_tail(&p->scx.dsq_list.node, &dst_dsq->list); - dsq_mod_nr(dst_dsq, 1); + dsq_inc_nr(dst_dsq, p, enq_flags); p->scx.dsq = dst_dsq; local_dsq_post_enq(dst_dsq, p, enq_flags); @@ -1690,10 +2166,13 @@ static void move_remote_task_to_local_dsq(struct task_struct *p, u64 enq_flags, { lockdep_assert_rq_held(src_rq); - /* the following marks @p MIGRATING which excludes dequeue */ + /* + * Set sticky_cpu before deactivate_task() to properly mark the + * beginning of an SCX-internal migration. + */ + p->scx.sticky_cpu = cpu_of(dst_rq); deactivate_task(src_rq, p, 0); set_task_cpu(p, cpu_of(dst_rq)); - p->scx.sticky_cpu = cpu_of(dst_rq); raw_spin_rq_unlock(src_rq); raw_spin_rq_lock(dst_rq); @@ -1733,7 +2212,7 @@ static bool task_can_run_on_remote_rq(struct scx_sched *sch, struct task_struct *p, struct rq *rq, bool enforce) { - int cpu = cpu_of(rq); + s32 cpu = cpu_of(rq); WARN_ON_ONCE(task_cpu(p) == cpu); @@ -1827,13 +2306,14 @@ static bool unlink_dsq_and_lock_src_rq(struct task_struct *p, !WARN_ON_ONCE(src_rq != task_rq(p)); } -static bool consume_remote_task(struct rq *this_rq, struct task_struct *p, +static bool consume_remote_task(struct rq *this_rq, + struct task_struct *p, u64 enq_flags, struct scx_dispatch_q *dsq, struct rq *src_rq) { raw_spin_rq_unlock(this_rq); if (unlink_dsq_and_lock_src_rq(p, dsq, src_rq)) { - move_remote_task_to_local_dsq(p, 0, src_rq, this_rq); + move_remote_task_to_local_dsq(p, enq_flags, src_rq, this_rq); return true; } else { raw_spin_rq_unlock(src_rq); @@ -1873,8 +2353,9 @@ static struct rq *move_task_between_dsqs(struct scx_sched *sch, dst_rq = container_of(dst_dsq, struct rq, scx.local_dsq); if (src_rq != dst_rq && unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) { - dst_dsq = find_global_dsq(sch, p); + dst_dsq = find_global_dsq(sch, task_cpu(p)); dst_rq = src_rq; + enq_flags |= SCX_ENQ_GDSQ_FALLBACK; } } else { /* no need to migrate if destination is a non-local DSQ */ @@ -1905,14 +2386,14 @@ static struct rq *move_task_between_dsqs(struct scx_sched *sch, dispatch_dequeue_locked(p, src_dsq); raw_spin_unlock(&src_dsq->lock); - dispatch_enqueue(sch, dst_dsq, p, enq_flags); + dispatch_enqueue(sch, dst_rq, dst_dsq, p, enq_flags); } return dst_rq; } static bool consume_dispatch_q(struct scx_sched *sch, struct rq *rq, - struct scx_dispatch_q *dsq) + struct scx_dispatch_q *dsq, u64 enq_flags) { struct task_struct *p; retry: @@ -1937,18 +2418,18 @@ retry: * the system into the bypass mode. This can easily live-lock the * machine. If aborting, exit from all non-bypass DSQs. */ - if (unlikely(READ_ONCE(scx_aborting)) && dsq->id != SCX_DSQ_BYPASS) + if (unlikely(READ_ONCE(sch->aborting)) && dsq->id != SCX_DSQ_BYPASS) break; if (rq == task_rq) { task_unlink_from_dsq(p, dsq); - move_local_task_to_local_dsq(p, 0, dsq, rq); + move_local_task_to_local_dsq(p, enq_flags, dsq, rq); raw_spin_unlock(&dsq->lock); return true; } if (task_can_run_on_remote_rq(sch, p, rq, false)) { - if (likely(consume_remote_task(rq, p, dsq, task_rq))) + if (likely(consume_remote_task(rq, p, enq_flags, dsq, task_rq))) return true; goto retry; } @@ -1962,7 +2443,7 @@ static bool consume_global_dsq(struct scx_sched *sch, struct rq *rq) { int node = cpu_to_node(cpu_of(rq)); - return consume_dispatch_q(sch, rq, sch->global_dsqs[node]); + return consume_dispatch_q(sch, rq, &sch->pnode[node]->global_dsq, 0); } /** @@ -1995,15 +2476,15 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq, * If dispatching to @rq that @p is already on, no lock dancing needed. */ if (rq == src_rq && rq == dst_rq) { - dispatch_enqueue(sch, dst_dsq, p, + dispatch_enqueue(sch, rq, dst_dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS); return; } if (src_rq != dst_rq && unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) { - dispatch_enqueue(sch, find_global_dsq(sch, p), p, - enq_flags | SCX_ENQ_CLEAR_OPSS); + dispatch_enqueue(sch, rq, find_global_dsq(sch, task_cpu(p)), p, + enq_flags | SCX_ENQ_CLEAR_OPSS | SCX_ENQ_GDSQ_FALLBACK); return; } @@ -2040,7 +2521,7 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq, */ if (src_rq == dst_rq) { p->scx.holding_cpu = -1; - dispatch_enqueue(sch, &dst_rq->scx.local_dsq, p, + dispatch_enqueue(sch, dst_rq, &dst_rq->scx.local_dsq, p, enq_flags); } else { move_remote_task_to_local_dsq(p, enq_flags, @@ -2110,6 +2591,12 @@ retry: if ((opss & SCX_OPSS_QSEQ_MASK) != qseq_at_dispatch) return; + /* see SCX_EV_INSERT_NOT_OWNED definition */ + if (unlikely(!scx_task_on_sched(sch, p))) { + __scx_add_event(sch, SCX_EV_INSERT_NOT_OWNED, 1); + return; + } + /* * While we know @p is accessible, we don't yet have a claim on * it - the BPF scheduler is allowed to dispatch tasks @@ -2134,17 +2621,17 @@ retry: BUG_ON(!(p->scx.flags & SCX_TASK_QUEUED)); - dsq = find_dsq_for_dispatch(sch, this_rq(), dsq_id, p); + dsq = find_dsq_for_dispatch(sch, this_rq(), dsq_id, task_cpu(p)); if (dsq->id == SCX_DSQ_LOCAL) dispatch_to_local_dsq(sch, rq, dsq, p, enq_flags); else - dispatch_enqueue(sch, dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS); + dispatch_enqueue(sch, rq, dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS); } static void flush_dispatch_buf(struct scx_sched *sch, struct rq *rq) { - struct scx_dsp_ctx *dspc = this_cpu_ptr(scx_dsp_ctx); + struct scx_dsp_ctx *dspc = &this_cpu_ptr(sch->pcpu)->dsp_ctx; u32 u; for (u = 0; u < dspc->cursor; u++) { @@ -2171,13 +2658,117 @@ static inline void maybe_queue_balance_callback(struct rq *rq) rq->scx.flags &= ~SCX_RQ_BAL_CB_PENDING; } +/* + * One user of this function is scx_bpf_dispatch() which can be called + * recursively as sub-sched dispatches nest. Always inline to reduce stack usage + * from the call frame. + */ +static __always_inline bool +scx_dispatch_sched(struct scx_sched *sch, struct rq *rq, + struct task_struct *prev, bool nested) +{ + struct scx_dsp_ctx *dspc = &this_cpu_ptr(sch->pcpu)->dsp_ctx; + int nr_loops = SCX_DSP_MAX_LOOPS; + s32 cpu = cpu_of(rq); + bool prev_on_sch = (prev->sched_class == &ext_sched_class) && + scx_task_on_sched(sch, prev); + + if (consume_global_dsq(sch, rq)) + return true; + + if (bypass_dsp_enabled(sch)) { + /* if @sch is bypassing, only the bypass DSQs are active */ + if (scx_bypassing(sch, cpu)) + return consume_dispatch_q(sch, rq, bypass_dsq(sch, cpu), 0); + +#ifdef CONFIG_EXT_SUB_SCHED + /* + * If @sch isn't bypassing but its children are, @sch is + * responsible for making forward progress for both its own + * tasks that aren't bypassing and the bypassing descendants' + * tasks. The following implements a simple built-in behavior - + * let each CPU try to run the bypass DSQ every Nth time. + * + * Later, if necessary, we can add an ops flag to suppress the + * auto-consumption and a kfunc to consume the bypass DSQ and, + * so that the BPF scheduler can fully control scheduling of + * bypassed tasks. + */ + struct scx_sched_pcpu *pcpu = per_cpu_ptr(sch->pcpu, cpu); + + if (!(pcpu->bypass_host_seq++ % SCX_BYPASS_HOST_NTH) && + consume_dispatch_q(sch, rq, bypass_dsq(sch, cpu), 0)) { + __scx_add_event(sch, SCX_EV_SUB_BYPASS_DISPATCH, 1); + return true; + } +#endif /* CONFIG_EXT_SUB_SCHED */ + } + + if (unlikely(!SCX_HAS_OP(sch, dispatch)) || !scx_rq_online(rq)) + return false; + + dspc->rq = rq; + + /* + * The dispatch loop. Because flush_dispatch_buf() may drop the rq lock, + * the local DSQ might still end up empty after a successful + * ops.dispatch(). If the local DSQ is empty even after ops.dispatch() + * produced some tasks, retry. The BPF scheduler may depend on this + * looping behavior to simplify its implementation. + */ + do { + dspc->nr_tasks = 0; + + if (nested) { + SCX_CALL_OP(sch, dispatch, rq, cpu, prev_on_sch ? prev : NULL); + } else { + /* stash @prev so that nested invocations can access it */ + rq->scx.sub_dispatch_prev = prev; + SCX_CALL_OP(sch, dispatch, rq, cpu, prev_on_sch ? prev : NULL); + rq->scx.sub_dispatch_prev = NULL; + } + + flush_dispatch_buf(sch, rq); + + if ((prev->scx.flags & SCX_TASK_QUEUED) && prev->scx.slice) { + rq->scx.flags |= SCX_RQ_BAL_KEEP; + return true; + } + if (rq->scx.local_dsq.nr) + return true; + if (consume_global_dsq(sch, rq)) + return true; + + /* + * ops.dispatch() can trap us in this loop by repeatedly + * dispatching ineligible tasks. Break out once in a while to + * allow the watchdog to run. As IRQ can't be enabled in + * balance(), we want to complete this scheduling cycle and then + * start a new one. IOW, we want to call resched_curr() on the + * next, most likely idle, task, not the current one. Use + * __scx_bpf_kick_cpu() for deferred kicking. + */ + if (unlikely(!--nr_loops)) { + scx_kick_cpu(sch, cpu, 0); + break; + } + } while (dspc->nr_tasks); + + /* + * Prevent the CPU from going idle while bypassed descendants have tasks + * queued. Without this fallback, bypassed tasks could stall if the host + * scheduler's ops.dispatch() doesn't yield any tasks. + */ + if (bypass_dsp_enabled(sch)) + return consume_dispatch_q(sch, rq, bypass_dsq(sch, cpu), 0); + + return false; +} + static int balance_one(struct rq *rq, struct task_struct *prev) { struct scx_sched *sch = scx_root; - struct scx_dsp_ctx *dspc = this_cpu_ptr(scx_dsp_ctx); - bool prev_on_scx = prev->sched_class == &ext_sched_class; - bool prev_on_rq = prev->scx.flags & SCX_TASK_QUEUED; - int nr_loops = SCX_DSP_MAX_LOOPS; + s32 cpu = cpu_of(rq); lockdep_assert_rq_held(rq); rq->scx.flags |= SCX_RQ_IN_BALANCE; @@ -2192,12 +2783,11 @@ static int balance_one(struct rq *rq, struct task_struct *prev) * emitted in switch_class(). */ if (SCX_HAS_OP(sch, cpu_acquire)) - SCX_CALL_OP(sch, SCX_KF_REST, cpu_acquire, rq, - cpu_of(rq), NULL); + SCX_CALL_OP(sch, cpu_acquire, rq, cpu, NULL); rq->scx.cpu_released = false; } - if (prev_on_scx) { + if (prev->sched_class == &ext_sched_class) { update_curr_scx(rq); /* @@ -2210,7 +2800,8 @@ static int balance_one(struct rq *rq, struct task_struct *prev) * See scx_disable_workfn() for the explanation on the bypassing * test. */ - if (prev_on_rq && prev->scx.slice && !scx_rq_bypassing(rq)) { + if ((prev->scx.flags & SCX_TASK_QUEUED) && prev->scx.slice && + !scx_bypassing(sch, cpu)) { rq->scx.flags |= SCX_RQ_BAL_KEEP; goto has_tasks; } @@ -2220,67 +2811,15 @@ static int balance_one(struct rq *rq, struct task_struct *prev) if (rq->scx.local_dsq.nr) goto has_tasks; - if (consume_global_dsq(sch, rq)) + if (scx_dispatch_sched(sch, rq, prev, false)) goto has_tasks; - if (scx_rq_bypassing(rq)) { - if (consume_dispatch_q(sch, rq, &rq->scx.bypass_dsq)) - goto has_tasks; - else - goto no_tasks; - } - - if (unlikely(!SCX_HAS_OP(sch, dispatch)) || !scx_rq_online(rq)) - goto no_tasks; - - dspc->rq = rq; - - /* - * The dispatch loop. Because flush_dispatch_buf() may drop the rq lock, - * the local DSQ might still end up empty after a successful - * ops.dispatch(). If the local DSQ is empty even after ops.dispatch() - * produced some tasks, retry. The BPF scheduler may depend on this - * looping behavior to simplify its implementation. - */ - do { - dspc->nr_tasks = 0; - - SCX_CALL_OP(sch, SCX_KF_DISPATCH, dispatch, rq, - cpu_of(rq), prev_on_scx ? prev : NULL); - - flush_dispatch_buf(sch, rq); - - if (prev_on_rq && prev->scx.slice) { - rq->scx.flags |= SCX_RQ_BAL_KEEP; - goto has_tasks; - } - if (rq->scx.local_dsq.nr) - goto has_tasks; - if (consume_global_dsq(sch, rq)) - goto has_tasks; - - /* - * ops.dispatch() can trap us in this loop by repeatedly - * dispatching ineligible tasks. Break out once in a while to - * allow the watchdog to run. As IRQ can't be enabled in - * balance(), we want to complete this scheduling cycle and then - * start a new one. IOW, we want to call resched_curr() on the - * next, most likely idle, task, not the current one. Use - * scx_kick_cpu() for deferred kicking. - */ - if (unlikely(!--nr_loops)) { - scx_kick_cpu(sch, cpu_of(rq), 0); - break; - } - } while (dspc->nr_tasks); - -no_tasks: /* * Didn't find another task to run. Keep running @prev unless * %SCX_OPS_ENQ_LAST is in effect. */ - if (prev_on_rq && - (!(sch->ops.flags & SCX_OPS_ENQ_LAST) || scx_rq_bypassing(rq))) { + if ((prev->scx.flags & SCX_TASK_QUEUED) && + (!(sch->ops.flags & SCX_OPS_ENQ_LAST) || scx_bypassing(sch, cpu))) { rq->scx.flags |= SCX_RQ_BAL_KEEP; __scx_add_event(sch, SCX_EV_DISPATCH_KEEP_LAST, 1); goto has_tasks; @@ -2289,42 +2828,26 @@ no_tasks: return false; has_tasks: + /* + * @rq may have extra IMMED tasks without reenq scheduled: + * + * - rq_is_open() can't reliably tell when and how slice is going to be + * modified for $curr and allows IMMED tasks to be queued while + * dispatch is in progress. + * + * - A non-IMMED HEAD task can get queued in front of an IMMED task + * between the IMMED queueing and the subsequent scheduling event. + */ + if (unlikely(rq->scx.local_dsq.nr > 1 && rq->scx.nr_immed)) + schedule_reenq_local(rq, 0); + rq->scx.flags &= ~SCX_RQ_IN_BALANCE; return true; } -static void process_ddsp_deferred_locals(struct rq *rq) -{ - struct task_struct *p; - - lockdep_assert_rq_held(rq); - - /* - * Now that @rq can be unlocked, execute the deferred enqueueing of - * tasks directly dispatched to the local DSQs of other CPUs. See - * direct_dispatch(). Keep popping from the head instead of using - * list_for_each_entry_safe() as dispatch_local_dsq() may unlock @rq - * temporarily. - */ - while ((p = list_first_entry_or_null(&rq->scx.ddsp_deferred_locals, - struct task_struct, scx.dsq_list.node))) { - struct scx_sched *sch = scx_root; - struct scx_dispatch_q *dsq; - u64 dsq_id = p->scx.ddsp_dsq_id; - u64 enq_flags = p->scx.ddsp_enq_flags; - - list_del_init(&p->scx.dsq_list.node); - clear_direct_dispatch(p); - - dsq = find_dsq_for_dispatch(sch, rq, dsq_id, p); - if (!WARN_ON_ONCE(dsq->id != SCX_DSQ_LOCAL)) - dispatch_to_local_dsq(sch, rq, dsq, p, enq_flags); - } -} - static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first) { - struct scx_sched *sch = scx_root; + struct scx_sched *sch = scx_task_sched(p); if (p->scx.flags & SCX_TASK_QUEUED) { /* @@ -2339,7 +2862,7 @@ static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first) /* see dequeue_task_scx() on why we skip when !QUEUED */ if (SCX_HAS_OP(sch, running) && (p->scx.flags & SCX_TASK_QUEUED)) - SCX_CALL_OP_TASK(sch, SCX_KF_REST, running, rq, p); + SCX_CALL_OP_TASK(sch, running, rq, p); clr_task_runnable(p, true); @@ -2411,8 +2934,7 @@ static void switch_class(struct rq *rq, struct task_struct *next) .task = next, }; - SCX_CALL_OP(sch, SCX_KF_CPU_RELEASE, cpu_release, rq, - cpu_of(rq), &args); + SCX_CALL_OP(sch, cpu_release, rq, cpu_of(rq), &args); } rq->scx.cpu_released = true; } @@ -2421,7 +2943,7 @@ static void switch_class(struct rq *rq, struct task_struct *next) static void put_prev_task_scx(struct rq *rq, struct task_struct *p, struct task_struct *next) { - struct scx_sched *sch = scx_root; + struct scx_sched *sch = scx_task_sched(p); /* see kick_sync_wait_bal_cb() */ smp_store_release(&rq->scx.kick_sync, rq->scx.kick_sync + 1); @@ -2430,7 +2952,7 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p, /* see dequeue_task_scx() on why we skip when !QUEUED */ if (SCX_HAS_OP(sch, stopping) && (p->scx.flags & SCX_TASK_QUEUED)) - SCX_CALL_OP_TASK(sch, SCX_KF_REST, stopping, rq, p, true); + SCX_CALL_OP_TASK(sch, stopping, rq, p, true); if (p->scx.flags & SCX_TASK_QUEUED) { set_task_runnable(rq, p); @@ -2439,11 +2961,17 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p, * If @p has slice left and is being put, @p is getting * preempted by a higher priority scheduler class or core-sched * forcing a different task. Leave it at the head of the local - * DSQ. + * DSQ unless it was an IMMED task. IMMED tasks should not + * linger on a busy CPU, reenqueue them to the BPF scheduler. */ - if (p->scx.slice && !scx_rq_bypassing(rq)) { - dispatch_enqueue(sch, &rq->scx.local_dsq, p, - SCX_ENQ_HEAD); + if (p->scx.slice && !scx_bypassing(sch, cpu_of(rq))) { + if (p->scx.flags & SCX_TASK_IMMED) { + p->scx.flags |= SCX_TASK_REENQ_PREEMPTED; + do_enqueue_task(rq, p, SCX_ENQ_REENQ, -1); + p->scx.flags &= ~SCX_TASK_REENQ_REASON_MASK; + } else { + dispatch_enqueue(sch, rq, &rq->scx.local_dsq, p, SCX_ENQ_HEAD); + } goto switch_class; } @@ -2568,16 +3096,17 @@ do_pick_task_scx(struct rq *rq, struct rq_flags *rf, bool force_scx) if (keep_prev) { p = prev; if (!p->scx.slice) - refill_task_slice_dfl(rcu_dereference_sched(scx_root), p); + refill_task_slice_dfl(scx_task_sched(p), p); } else { p = first_local_task(rq); if (!p) return NULL; if (unlikely(!p->scx.slice)) { - struct scx_sched *sch = rcu_dereference_sched(scx_root); + struct scx_sched *sch = scx_task_sched(p); - if (!scx_rq_bypassing(rq) && !sch->warned_zero_slice) { + if (!scx_bypassing(sch, cpu_of(rq)) && + !sch->warned_zero_slice) { printk_deferred(KERN_WARNING "sched_ext: %s[%d] has zero slice in %s()\n", p->comm, p->pid, __func__); sch->warned_zero_slice = true; @@ -2643,16 +3172,17 @@ void ext_server_init(struct rq *rq) bool scx_prio_less(const struct task_struct *a, const struct task_struct *b, bool in_fi) { - struct scx_sched *sch = scx_root; + struct scx_sched *sch_a = scx_task_sched(a); + struct scx_sched *sch_b = scx_task_sched(b); /* * The const qualifiers are dropped from task_struct pointers when * calling ops.core_sched_before(). Accesses are controlled by the * verifier. */ - if (SCX_HAS_OP(sch, core_sched_before) && - !scx_rq_bypassing(task_rq(a))) - return SCX_CALL_OP_2TASKS_RET(sch, SCX_KF_REST, core_sched_before, + if (sch_a == sch_b && SCX_HAS_OP(sch_a, core_sched_before) && + !scx_bypassing(sch_a, task_cpu(a))) + return SCX_CALL_OP_2TASKS_RET(sch_a, core_sched_before, NULL, (struct task_struct *)a, (struct task_struct *)b); @@ -2663,8 +3193,8 @@ bool scx_prio_less(const struct task_struct *a, const struct task_struct *b, static int select_task_rq_scx(struct task_struct *p, int prev_cpu, int wake_flags) { - struct scx_sched *sch = scx_root; - bool rq_bypass; + struct scx_sched *sch = scx_task_sched(p); + bool bypassing; /* * sched_exec() calls with %WF_EXEC when @p is about to exec(2) as it @@ -2679,8 +3209,8 @@ static int select_task_rq_scx(struct task_struct *p, int prev_cpu, int wake_flag if (unlikely(wake_flags & WF_EXEC)) return prev_cpu; - rq_bypass = scx_rq_bypassing(task_rq(p)); - if (likely(SCX_HAS_OP(sch, select_cpu)) && !rq_bypass) { + bypassing = scx_bypassing(sch, task_cpu(p)); + if (likely(SCX_HAS_OP(sch, select_cpu)) && !bypassing) { s32 cpu; struct task_struct **ddsp_taskp; @@ -2688,10 +3218,9 @@ static int select_task_rq_scx(struct task_struct *p, int prev_cpu, int wake_flag WARN_ON_ONCE(*ddsp_taskp); *ddsp_taskp = p; - cpu = SCX_CALL_OP_TASK_RET(sch, - SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU, - select_cpu, NULL, p, prev_cpu, - wake_flags); + this_rq()->scx.in_select_cpu = true; + cpu = SCX_CALL_OP_TASK_RET(sch, select_cpu, NULL, p, prev_cpu, wake_flags); + this_rq()->scx.in_select_cpu = false; p->scx.selected_cpu = cpu; *ddsp_taskp = NULL; if (ops_cpu_valid(sch, cpu, "from ops.select_cpu()")) @@ -2710,7 +3239,7 @@ static int select_task_rq_scx(struct task_struct *p, int prev_cpu, int wake_flag } p->scx.selected_cpu = cpu; - if (rq_bypass) + if (bypassing) __scx_add_event(sch, SCX_EV_BYPASS_DISPATCH, 1); return cpu; } @@ -2724,7 +3253,7 @@ static void task_woken_scx(struct rq *rq, struct task_struct *p) static void set_cpus_allowed_scx(struct task_struct *p, struct affinity_context *ac) { - struct scx_sched *sch = scx_root; + struct scx_sched *sch = scx_task_sched(p); set_cpus_allowed_common(p, ac); @@ -2740,14 +3269,13 @@ static void set_cpus_allowed_scx(struct task_struct *p, * designation pointless. Cast it away when calling the operation. */ if (SCX_HAS_OP(sch, set_cpumask)) - SCX_CALL_OP_TASK(sch, SCX_KF_REST, set_cpumask, NULL, - p, (struct cpumask *)p->cpus_ptr); + SCX_CALL_OP_TASK(sch, set_cpumask, task_rq(p), p, (struct cpumask *)p->cpus_ptr); } static void handle_hotplug(struct rq *rq, bool online) { struct scx_sched *sch = scx_root; - int cpu = cpu_of(rq); + s32 cpu = cpu_of(rq); atomic_long_inc(&scx_hotplug_seq); @@ -2763,9 +3291,9 @@ static void handle_hotplug(struct rq *rq, bool online) scx_idle_update_selcpu_topology(&sch->ops); if (online && SCX_HAS_OP(sch, cpu_online)) - SCX_CALL_OP(sch, SCX_KF_UNLOCKED, cpu_online, NULL, cpu); + SCX_CALL_OP(sch, cpu_online, NULL, cpu); else if (!online && SCX_HAS_OP(sch, cpu_offline)) - SCX_CALL_OP(sch, SCX_KF_UNLOCKED, cpu_offline, NULL, cpu); + SCX_CALL_OP(sch, cpu_offline, NULL, cpu); else scx_exit(sch, SCX_EXIT_UNREG_KERN, SCX_ECODE_ACT_RESTART | SCX_ECODE_RSN_HOTPLUG, @@ -2793,7 +3321,6 @@ static void rq_offline_scx(struct rq *rq) rq->scx.flags &= ~SCX_RQ_ONLINE; } - static bool check_rq_for_timeouts(struct rq *rq) { struct scx_sched *sch; @@ -2807,10 +3334,11 @@ static bool check_rq_for_timeouts(struct rq *rq) goto out_unlock; list_for_each_entry(p, &rq->scx.runnable_list, scx.runnable_node) { + struct scx_sched *sch = scx_task_sched(p); unsigned long last_runnable = p->scx.runnable_at; if (unlikely(time_after(jiffies, - last_runnable + READ_ONCE(scx_watchdog_timeout)))) { + last_runnable + READ_ONCE(sch->watchdog_timeout)))) { u32 dur_ms = jiffies_to_msecs(jiffies - last_runnable); scx_exit(sch, SCX_EXIT_ERROR_STALL, 0, @@ -2827,6 +3355,7 @@ out_unlock: static void scx_watchdog_workfn(struct work_struct *work) { + unsigned long intv; int cpu; WRITE_ONCE(scx_watchdog_timestamp, jiffies); @@ -2837,28 +3366,30 @@ static void scx_watchdog_workfn(struct work_struct *work) cond_resched(); } - queue_delayed_work(system_dfl_wq, to_delayed_work(work), - READ_ONCE(scx_watchdog_timeout) / 2); + + intv = READ_ONCE(scx_watchdog_interval); + if (intv < ULONG_MAX) + queue_delayed_work(system_dfl_wq, to_delayed_work(work), intv); } void scx_tick(struct rq *rq) { - struct scx_sched *sch; + struct scx_sched *root; unsigned long last_check; if (!scx_enabled()) return; - sch = rcu_dereference_bh(scx_root); - if (unlikely(!sch)) + root = rcu_dereference_bh(scx_root); + if (unlikely(!root)) return; last_check = READ_ONCE(scx_watchdog_timestamp); if (unlikely(time_after(jiffies, - last_check + READ_ONCE(scx_watchdog_timeout)))) { + last_check + READ_ONCE(root->watchdog_timeout)))) { u32 dur_ms = jiffies_to_msecs(jiffies - last_check); - scx_exit(sch, SCX_EXIT_ERROR_STALL, 0, + scx_exit(root, SCX_EXIT_ERROR_STALL, 0, "watchdog failed to check in for %u.%03us", dur_ms / 1000, dur_ms % 1000); } @@ -2868,7 +3399,7 @@ void scx_tick(struct rq *rq) static void task_tick_scx(struct rq *rq, struct task_struct *curr, int queued) { - struct scx_sched *sch = scx_root; + struct scx_sched *sch = scx_task_sched(curr); update_curr_scx(rq); @@ -2876,11 +3407,11 @@ static void task_tick_scx(struct rq *rq, struct task_struct *curr, int queued) * While disabling, always resched and refresh core-sched timestamp as * we can't trust the slice management or ops.core_sched_before(). */ - if (scx_rq_bypassing(rq)) { + if (scx_bypassing(sch, cpu_of(rq))) { curr->scx.slice = 0; touch_core_sched(rq, curr); } else if (SCX_HAS_OP(sch, tick)) { - SCX_CALL_OP_TASK(sch, SCX_KF_REST, tick, rq, curr); + SCX_CALL_OP_TASK(sch, tick, rq, curr); } if (!curr->scx.slice) @@ -2909,18 +3440,16 @@ static struct cgroup *tg_cgrp(struct task_group *tg) #endif /* CONFIG_EXT_GROUP_SCHED */ -static enum scx_task_state scx_get_task_state(const struct task_struct *p) +static u32 scx_get_task_state(const struct task_struct *p) { - return (p->scx.flags & SCX_TASK_STATE_MASK) >> SCX_TASK_STATE_SHIFT; + return p->scx.flags & SCX_TASK_STATE_MASK; } -static void scx_set_task_state(struct task_struct *p, enum scx_task_state state) +static void scx_set_task_state(struct task_struct *p, u32 state) { - enum scx_task_state prev_state = scx_get_task_state(p); + u32 prev_state = scx_get_task_state(p); bool warn = false; - BUILD_BUG_ON(SCX_TASK_NR_STATES > (1 << SCX_TASK_STATE_BITS)); - switch (state) { case SCX_TASK_NONE: break; @@ -2934,42 +3463,45 @@ static void scx_set_task_state(struct task_struct *p, enum scx_task_state state) warn = prev_state != SCX_TASK_READY; break; default: - warn = true; + WARN_ONCE(1, "sched_ext: Invalid task state %d -> %d for %s[%d]", + prev_state, state, p->comm, p->pid); return; } - WARN_ONCE(warn, "sched_ext: Invalid task state transition %d -> %d for %s[%d]", + WARN_ONCE(warn, "sched_ext: Invalid task state transition 0x%x -> 0x%x for %s[%d]", prev_state, state, p->comm, p->pid); p->scx.flags &= ~SCX_TASK_STATE_MASK; - p->scx.flags |= state << SCX_TASK_STATE_SHIFT; + p->scx.flags |= state; } -static int scx_init_task(struct task_struct *p, struct task_group *tg, bool fork) +static int __scx_init_task(struct scx_sched *sch, struct task_struct *p, bool fork) { - struct scx_sched *sch = scx_root; int ret; p->scx.disallow = false; if (SCX_HAS_OP(sch, init_task)) { struct scx_init_task_args args = { - SCX_INIT_TASK_ARGS_CGROUP(tg) + SCX_INIT_TASK_ARGS_CGROUP(task_group(p)) .fork = fork, }; - ret = SCX_CALL_OP_RET(sch, SCX_KF_UNLOCKED, init_task, NULL, - p, &args); + ret = SCX_CALL_OP_RET(sch, init_task, NULL, p, &args); if (unlikely(ret)) { ret = ops_sanitize_err(sch, "init_task", ret); return ret; } } - scx_set_task_state(p, SCX_TASK_INIT); - if (p->scx.disallow) { - if (!fork) { + if (unlikely(scx_parent(sch))) { + scx_error(sch, "non-root ops.init_task() set task->scx.disallow for %s[%d]", + p->comm, p->pid); + } else if (unlikely(fork)) { + scx_error(sch, "ops.init_task() set task->scx.disallow for %s[%d] during fork", + p->comm, p->pid); + } else { struct rq *rq; struct rq_flags rf; @@ -2988,24 +3520,42 @@ static int scx_init_task(struct task_struct *p, struct task_group *tg, bool fork } task_rq_unlock(rq, p, &rf); - } else if (p->policy == SCHED_EXT) { - scx_error(sch, "ops.init_task() set task->scx.disallow for %s[%d] during fork", - p->comm, p->pid); } } - p->scx.flags |= SCX_TASK_RESET_RUNNABLE_AT; return 0; } -static void scx_enable_task(struct task_struct *p) +static int scx_init_task(struct scx_sched *sch, struct task_struct *p, bool fork) +{ + int ret; + + ret = __scx_init_task(sch, p, fork); + if (!ret) { + /* + * While @p's rq is not locked. @p is not visible to the rest of + * SCX yet and it's safe to update the flags and state. + */ + p->scx.flags |= SCX_TASK_RESET_RUNNABLE_AT; + scx_set_task_state(p, SCX_TASK_INIT); + } + return ret; +} + +static void __scx_enable_task(struct scx_sched *sch, struct task_struct *p) { - struct scx_sched *sch = scx_root; struct rq *rq = task_rq(p); u32 weight; lockdep_assert_rq_held(rq); + /* + * Verify the task is not in BPF scheduler's custody. If flag + * transitions are consistent, the flag should always be clear + * here. + */ + WARN_ON_ONCE(p->scx.flags & SCX_TASK_IN_CUSTODY); + /* * Set the weight before calling ops.enable() so that the scheduler * doesn't see a stale value if they inspect the task struct. @@ -3018,17 +3568,20 @@ static void scx_enable_task(struct task_struct *p) p->scx.weight = sched_weight_to_cgroup(weight); if (SCX_HAS_OP(sch, enable)) - SCX_CALL_OP_TASK(sch, SCX_KF_REST, enable, rq, p); - scx_set_task_state(p, SCX_TASK_ENABLED); + SCX_CALL_OP_TASK(sch, enable, rq, p); if (SCX_HAS_OP(sch, set_weight)) - SCX_CALL_OP_TASK(sch, SCX_KF_REST, set_weight, rq, - p, p->scx.weight); + SCX_CALL_OP_TASK(sch, set_weight, rq, p, p->scx.weight); } -static void scx_disable_task(struct task_struct *p) +static void scx_enable_task(struct scx_sched *sch, struct task_struct *p) +{ + __scx_enable_task(sch, p); + scx_set_task_state(p, SCX_TASK_ENABLED); +} + +static void scx_disable_task(struct scx_sched *sch, struct task_struct *p) { - struct scx_sched *sch = scx_root; struct rq *rq = task_rq(p); lockdep_assert_rq_held(rq); @@ -3037,17 +3590,25 @@ static void scx_disable_task(struct task_struct *p) clear_direct_dispatch(p); if (SCX_HAS_OP(sch, disable)) - SCX_CALL_OP_TASK(sch, SCX_KF_REST, disable, rq, p); + SCX_CALL_OP_TASK(sch, disable, rq, p); scx_set_task_state(p, SCX_TASK_READY); + + /* + * Verify the task is not in BPF scheduler's custody. If flag + * transitions are consistent, the flag should always be clear + * here. + */ + WARN_ON_ONCE(p->scx.flags & SCX_TASK_IN_CUSTODY); } -static void scx_exit_task(struct task_struct *p) +static void __scx_disable_and_exit_task(struct scx_sched *sch, + struct task_struct *p) { - struct scx_sched *sch = scx_root; struct scx_exit_task_args args = { .cancelled = false, }; + lockdep_assert_held(&p->pi_lock); lockdep_assert_rq_held(task_rq(p)); switch (scx_get_task_state(p)) { @@ -3059,7 +3620,7 @@ static void scx_exit_task(struct task_struct *p) case SCX_TASK_READY: break; case SCX_TASK_ENABLED: - scx_disable_task(p); + scx_disable_task(sch, p); break; default: WARN_ON_ONCE(true); @@ -3067,8 +3628,26 @@ static void scx_exit_task(struct task_struct *p) } if (SCX_HAS_OP(sch, exit_task)) - SCX_CALL_OP_TASK(sch, SCX_KF_REST, exit_task, task_rq(p), - p, &args); + SCX_CALL_OP_TASK(sch, exit_task, task_rq(p), p, &args); +} + +static void scx_disable_and_exit_task(struct scx_sched *sch, + struct task_struct *p) +{ + __scx_disable_and_exit_task(sch, p); + + /* + * If set, @p exited between __scx_init_task() and scx_enable_task() in + * scx_sub_enable() and is initialized for both the associated sched and + * its parent. Disable and exit for the child too. + */ + if ((p->scx.flags & SCX_TASK_SUB_INIT) && + !WARN_ON_ONCE(!scx_enabling_sub_sched)) { + __scx_disable_and_exit_task(scx_enabling_sub_sched, p); + p->scx.flags &= ~SCX_TASK_SUB_INIT; + } + + scx_set_task_sched(p, NULL); scx_set_task_state(p, SCX_TASK_NONE); } @@ -3082,7 +3661,7 @@ void init_scx_entity(struct sched_ext_entity *scx) INIT_LIST_HEAD(&scx->runnable_node); scx->runnable_at = jiffies; scx->ddsp_dsq_id = SCX_DSQ_INVALID; - scx->slice = READ_ONCE(scx_slice_dfl); + scx->slice = SCX_SLICE_DFL; } void scx_pre_fork(struct task_struct *p) @@ -3096,14 +3675,25 @@ void scx_pre_fork(struct task_struct *p) percpu_down_read(&scx_fork_rwsem); } -int scx_fork(struct task_struct *p) +int scx_fork(struct task_struct *p, struct kernel_clone_args *kargs) { + s32 ret; + percpu_rwsem_assert_held(&scx_fork_rwsem); - if (scx_init_task_enabled) - return scx_init_task(p, task_group(p), true); - else - return 0; + if (scx_init_task_enabled) { +#ifdef CONFIG_EXT_SUB_SCHED + struct scx_sched *sch = kargs->cset->dfl_cgrp->scx_sched; +#else + struct scx_sched *sch = scx_root; +#endif + ret = scx_init_task(sch, p, true); + if (!ret) + scx_set_task_sched(p, sch); + return ret; + } + + return 0; } void scx_post_fork(struct task_struct *p) @@ -3121,7 +3711,7 @@ void scx_post_fork(struct task_struct *p) struct rq *rq; rq = task_rq_lock(p, &rf); - scx_enable_task(p); + scx_enable_task(scx_task_sched(p), p); task_rq_unlock(rq, p, &rf); } } @@ -3141,7 +3731,7 @@ void scx_cancel_fork(struct task_struct *p) rq = task_rq_lock(p, &rf); WARN_ON_ONCE(scx_get_task_state(p) >= SCX_TASK_READY); - scx_exit_task(p); + scx_disable_and_exit_task(scx_task_sched(p), p); task_rq_unlock(rq, p, &rf); } @@ -3192,15 +3782,15 @@ void sched_ext_dead(struct task_struct *p) raw_spin_unlock_irqrestore(&scx_tasks_lock, flags); /* - * @p is off scx_tasks and wholly ours. scx_enable()'s READY -> ENABLED - * transitions can't race us. Disable ops for @p. + * @p is off scx_tasks and wholly ours. scx_root_enable()'s READY -> + * ENABLED transitions can't race us. Disable ops for @p. */ if (scx_get_task_state(p) != SCX_TASK_NONE) { struct rq_flags rf; struct rq *rq; rq = task_rq_lock(p, &rf); - scx_exit_task(p); + scx_disable_and_exit_task(scx_task_sched(p), p); task_rq_unlock(rq, p, &rf); } } @@ -3208,7 +3798,7 @@ void sched_ext_dead(struct task_struct *p) static void reweight_task_scx(struct rq *rq, struct task_struct *p, const struct load_weight *lw) { - struct scx_sched *sch = scx_root; + struct scx_sched *sch = scx_task_sched(p); lockdep_assert_rq_held(task_rq(p)); @@ -3217,8 +3807,7 @@ static void reweight_task_scx(struct rq *rq, struct task_struct *p, p->scx.weight = sched_weight_to_cgroup(scale_load_down(lw->weight)); if (SCX_HAS_OP(sch, set_weight)) - SCX_CALL_OP_TASK(sch, SCX_KF_REST, set_weight, rq, - p, p->scx.weight); + SCX_CALL_OP_TASK(sch, set_weight, rq, p, p->scx.weight); } static void prio_changed_scx(struct rq *rq, struct task_struct *p, u64 oldprio) @@ -3227,20 +3816,19 @@ static void prio_changed_scx(struct rq *rq, struct task_struct *p, u64 oldprio) static void switching_to_scx(struct rq *rq, struct task_struct *p) { - struct scx_sched *sch = scx_root; + struct scx_sched *sch = scx_task_sched(p); if (task_dead_and_done(p)) return; - scx_enable_task(p); + scx_enable_task(sch, p); /* * set_cpus_allowed_scx() is not called while @p is associated with a * different scheduler class. Keep the BPF scheduler up-to-date. */ if (SCX_HAS_OP(sch, set_cpumask)) - SCX_CALL_OP_TASK(sch, SCX_KF_REST, set_cpumask, rq, - p, (struct cpumask *)p->cpus_ptr); + SCX_CALL_OP_TASK(sch, set_cpumask, rq, p, (struct cpumask *)p->cpus_ptr); } static void switched_from_scx(struct rq *rq, struct task_struct *p) @@ -3248,11 +3836,9 @@ static void switched_from_scx(struct rq *rq, struct task_struct *p) if (task_dead_and_done(p)) return; - scx_disable_task(p); + scx_disable_task(scx_task_sched(p), p); } -static void wakeup_preempt_scx(struct rq *rq, struct task_struct *p, int wake_flags) {} - static void switched_to_scx(struct rq *rq, struct task_struct *p) {} int scx_check_setscheduler(struct task_struct *p, int policy) @@ -3267,17 +3853,327 @@ int scx_check_setscheduler(struct task_struct *p, int policy) return 0; } +static void process_ddsp_deferred_locals(struct rq *rq) +{ + struct task_struct *p; + + lockdep_assert_rq_held(rq); + + /* + * Now that @rq can be unlocked, execute the deferred enqueueing of + * tasks directly dispatched to the local DSQs of other CPUs. See + * direct_dispatch(). Keep popping from the head instead of using + * list_for_each_entry_safe() as dispatch_local_dsq() may unlock @rq + * temporarily. + */ + while ((p = list_first_entry_or_null(&rq->scx.ddsp_deferred_locals, + struct task_struct, scx.dsq_list.node))) { + struct scx_sched *sch = scx_task_sched(p); + struct scx_dispatch_q *dsq; + u64 dsq_id = p->scx.ddsp_dsq_id; + u64 enq_flags = p->scx.ddsp_enq_flags; + + list_del_init(&p->scx.dsq_list.node); + clear_direct_dispatch(p); + + dsq = find_dsq_for_dispatch(sch, rq, dsq_id, task_cpu(p)); + if (!WARN_ON_ONCE(dsq->id != SCX_DSQ_LOCAL)) + dispatch_to_local_dsq(sch, rq, dsq, p, enq_flags); + } +} + +/* + * Determine whether @p should be reenqueued from a local DSQ. + * + * @reenq_flags is mutable and accumulates state across the DSQ walk: + * + * - %SCX_REENQ_TSR_NOT_FIRST: Set after the first task is visited. "First" + * tracks position in the DSQ list, not among IMMED tasks. A non-IMMED task at + * the head consumes the first slot. + * + * - %SCX_REENQ_TSR_RQ_OPEN: Set by reenq_local() before the walk if + * rq_is_open() is true. + * + * An IMMED task is kept (returns %false) only if it's the first task in the DSQ + * AND the current task is done — i.e. it will execute immediately. All other + * IMMED tasks are reenqueued. This means if a non-IMMED task sits at the head, + * every IMMED task behind it gets reenqueued. + * + * Reenqueued tasks go through ops.enqueue() with %SCX_ENQ_REENQ | + * %SCX_TASK_REENQ_IMMED. If the BPF scheduler dispatches back to the same local + * DSQ with %SCX_ENQ_IMMED while the CPU is still unavailable, this triggers + * another reenq cycle. Repetitions are bounded by %SCX_REENQ_LOCAL_MAX_REPEAT + * in process_deferred_reenq_locals(). + */ +static bool local_task_should_reenq(struct task_struct *p, u64 *reenq_flags, u32 *reason) +{ + bool first; + + first = !(*reenq_flags & SCX_REENQ_TSR_NOT_FIRST); + *reenq_flags |= SCX_REENQ_TSR_NOT_FIRST; + + *reason = SCX_TASK_REENQ_KFUNC; + + if ((p->scx.flags & SCX_TASK_IMMED) && + (!first || !(*reenq_flags & SCX_REENQ_TSR_RQ_OPEN))) { + __scx_add_event(scx_task_sched(p), SCX_EV_REENQ_IMMED, 1); + *reason = SCX_TASK_REENQ_IMMED; + return true; + } + + return *reenq_flags & SCX_REENQ_ANY; +} + +static u32 reenq_local(struct scx_sched *sch, struct rq *rq, u64 reenq_flags) +{ + LIST_HEAD(tasks); + u32 nr_enqueued = 0; + struct task_struct *p, *n; + + lockdep_assert_rq_held(rq); + + if (WARN_ON_ONCE(reenq_flags & __SCX_REENQ_TSR_MASK)) + reenq_flags &= ~__SCX_REENQ_TSR_MASK; + if (rq_is_open(rq, 0)) + reenq_flags |= SCX_REENQ_TSR_RQ_OPEN; + + /* + * The BPF scheduler may choose to dispatch tasks back to + * @rq->scx.local_dsq. Move all candidate tasks off to a private list + * first to avoid processing the same tasks repeatedly. + */ + list_for_each_entry_safe(p, n, &rq->scx.local_dsq.list, + scx.dsq_list.node) { + struct scx_sched *task_sch = scx_task_sched(p); + u32 reason; + + /* + * If @p is being migrated, @p's current CPU may not agree with + * its allowed CPUs and the migration_cpu_stop is about to + * deactivate and re-activate @p anyway. Skip re-enqueueing. + * + * While racing sched property changes may also dequeue and + * re-enqueue a migrating task while its current CPU and allowed + * CPUs disagree, they use %ENQUEUE_RESTORE which is bypassed to + * the current local DSQ for running tasks and thus are not + * visible to the BPF scheduler. + */ + if (p->migration_pending) + continue; + + if (!scx_is_descendant(task_sch, sch)) + continue; + + if (!local_task_should_reenq(p, &reenq_flags, &reason)) + continue; + + dispatch_dequeue(rq, p); + + if (WARN_ON_ONCE(p->scx.flags & SCX_TASK_REENQ_REASON_MASK)) + p->scx.flags &= ~SCX_TASK_REENQ_REASON_MASK; + p->scx.flags |= reason; + + list_add_tail(&p->scx.dsq_list.node, &tasks); + } + + list_for_each_entry_safe(p, n, &tasks, scx.dsq_list.node) { + list_del_init(&p->scx.dsq_list.node); + + do_enqueue_task(rq, p, SCX_ENQ_REENQ, -1); + + p->scx.flags &= ~SCX_TASK_REENQ_REASON_MASK; + nr_enqueued++; + } + + return nr_enqueued; +} + +static void process_deferred_reenq_locals(struct rq *rq) +{ + u64 seq = ++rq->scx.deferred_reenq_locals_seq; + + lockdep_assert_rq_held(rq); + + while (true) { + struct scx_sched *sch; + u64 reenq_flags; + bool skip = false; + + scoped_guard (raw_spinlock, &rq->scx.deferred_reenq_lock) { + struct scx_deferred_reenq_local *drl = + list_first_entry_or_null(&rq->scx.deferred_reenq_locals, + struct scx_deferred_reenq_local, + node); + struct scx_sched_pcpu *sch_pcpu; + + if (!drl) + return; + + sch_pcpu = container_of(drl, struct scx_sched_pcpu, + deferred_reenq_local); + sch = sch_pcpu->sch; + + reenq_flags = drl->flags; + WRITE_ONCE(drl->flags, 0); + list_del_init(&drl->node); + + if (likely(drl->seq != seq)) { + drl->seq = seq; + drl->cnt = 0; + } else { + if (unlikely(++drl->cnt > SCX_REENQ_LOCAL_MAX_REPEAT)) { + scx_error(sch, "SCX_ENQ_REENQ on SCX_DSQ_LOCAL repeated %u times", + drl->cnt); + skip = true; + } + + __scx_add_event(sch, SCX_EV_REENQ_LOCAL_REPEAT, 1); + } + } + + if (!skip) { + /* see schedule_dsq_reenq() */ + smp_mb(); + + reenq_local(sch, rq, reenq_flags); + } + } +} + +static bool user_task_should_reenq(struct task_struct *p, u64 reenq_flags, u32 *reason) +{ + *reason = SCX_TASK_REENQ_KFUNC; + return reenq_flags & SCX_REENQ_ANY; +} + +static void reenq_user(struct rq *rq, struct scx_dispatch_q *dsq, u64 reenq_flags) +{ + struct rq *locked_rq = rq; + struct scx_sched *sch = dsq->sched; + struct scx_dsq_list_node cursor = INIT_DSQ_LIST_CURSOR(cursor, dsq, 0); + struct task_struct *p; + s32 nr_enqueued = 0; + + lockdep_assert_rq_held(rq); + + raw_spin_lock(&dsq->lock); + + while (likely(!READ_ONCE(sch->bypass_depth))) { + struct rq *task_rq; + u32 reason; + + p = nldsq_cursor_next_task(&cursor, dsq); + if (!p) + break; + + if (!user_task_should_reenq(p, reenq_flags, &reason)) + continue; + + task_rq = task_rq(p); + + if (locked_rq != task_rq) { + if (locked_rq) + raw_spin_rq_unlock(locked_rq); + if (unlikely(!raw_spin_rq_trylock(task_rq))) { + raw_spin_unlock(&dsq->lock); + raw_spin_rq_lock(task_rq); + raw_spin_lock(&dsq->lock); + } + locked_rq = task_rq; + + /* did we lose @p while switching locks? */ + if (nldsq_cursor_lost_task(&cursor, task_rq, dsq, p)) + continue; + } + + /* @p is on @dsq, its rq and @dsq are locked */ + dispatch_dequeue_locked(p, dsq); + raw_spin_unlock(&dsq->lock); + + if (WARN_ON_ONCE(p->scx.flags & SCX_TASK_REENQ_REASON_MASK)) + p->scx.flags &= ~SCX_TASK_REENQ_REASON_MASK; + p->scx.flags |= reason; + + do_enqueue_task(task_rq, p, SCX_ENQ_REENQ, -1); + + p->scx.flags &= ~SCX_TASK_REENQ_REASON_MASK; + + if (!(++nr_enqueued % SCX_TASK_ITER_BATCH)) { + raw_spin_rq_unlock(locked_rq); + locked_rq = NULL; + cpu_relax(); + } + + raw_spin_lock(&dsq->lock); + } + + list_del_init(&cursor.node); + raw_spin_unlock(&dsq->lock); + + if (locked_rq != rq) { + if (locked_rq) + raw_spin_rq_unlock(locked_rq); + raw_spin_rq_lock(rq); + } +} + +static void process_deferred_reenq_users(struct rq *rq) +{ + lockdep_assert_rq_held(rq); + + while (true) { + struct scx_dispatch_q *dsq; + u64 reenq_flags; + + scoped_guard (raw_spinlock, &rq->scx.deferred_reenq_lock) { + struct scx_deferred_reenq_user *dru = + list_first_entry_or_null(&rq->scx.deferred_reenq_users, + struct scx_deferred_reenq_user, + node); + struct scx_dsq_pcpu *dsq_pcpu; + + if (!dru) + return; + + dsq_pcpu = container_of(dru, struct scx_dsq_pcpu, + deferred_reenq_user); + dsq = dsq_pcpu->dsq; + reenq_flags = dru->flags; + WRITE_ONCE(dru->flags, 0); + list_del_init(&dru->node); + } + + /* see schedule_dsq_reenq() */ + smp_mb(); + + BUG_ON(dsq->id & SCX_DSQ_FLAG_BUILTIN); + reenq_user(rq, dsq, reenq_flags); + } +} + +static void run_deferred(struct rq *rq) +{ + process_ddsp_deferred_locals(rq); + + if (!list_empty(&rq->scx.deferred_reenq_locals)) + process_deferred_reenq_locals(rq); + + if (!list_empty(&rq->scx.deferred_reenq_users)) + process_deferred_reenq_users(rq); +} + #ifdef CONFIG_NO_HZ_FULL bool scx_can_stop_tick(struct rq *rq) { struct task_struct *p = rq->curr; - - if (scx_rq_bypassing(rq)) - return false; + struct scx_sched *sch = scx_task_sched(p); if (p->sched_class != &ext_sched_class) return true; + if (scx_bypassing(sch, cpu_of(rq))) + return false; + /* * @rq can dispatch from different DSQs, so we can't tell whether it * needs the tick or not by looking at nr_running. Allow stopping ticks @@ -3315,7 +4211,7 @@ int scx_tg_online(struct task_group *tg) .bw_quota_us = tg->scx.bw_quota_us, .bw_burst_us = tg->scx.bw_burst_us }; - ret = SCX_CALL_OP_RET(sch, SCX_KF_UNLOCKED, cgroup_init, + ret = SCX_CALL_OP_RET(sch, cgroup_init, NULL, tg->css.cgroup, &args); if (ret) ret = ops_sanitize_err(sch, "cgroup_init", ret); @@ -3337,8 +4233,7 @@ void scx_tg_offline(struct task_group *tg) if (scx_cgroup_enabled && SCX_HAS_OP(sch, cgroup_exit) && (tg->scx.flags & SCX_TG_INITED)) - SCX_CALL_OP(sch, SCX_KF_UNLOCKED, cgroup_exit, NULL, - tg->css.cgroup); + SCX_CALL_OP(sch, cgroup_exit, NULL, tg->css.cgroup); tg->scx.flags &= ~(SCX_TG_ONLINE | SCX_TG_INITED); } @@ -3367,8 +4262,7 @@ int scx_cgroup_can_attach(struct cgroup_taskset *tset) continue; if (SCX_HAS_OP(sch, cgroup_prep_move)) { - ret = SCX_CALL_OP_RET(sch, SCX_KF_UNLOCKED, - cgroup_prep_move, NULL, + ret = SCX_CALL_OP_RET(sch, cgroup_prep_move, NULL, p, from, css->cgroup); if (ret) goto err; @@ -3383,7 +4277,7 @@ err: cgroup_taskset_for_each(p, css, tset) { if (SCX_HAS_OP(sch, cgroup_cancel_move) && p->scx.cgrp_moving_from) - SCX_CALL_OP(sch, SCX_KF_UNLOCKED, cgroup_cancel_move, NULL, + SCX_CALL_OP(sch, cgroup_cancel_move, NULL, p, p->scx.cgrp_moving_from, css->cgroup); p->scx.cgrp_moving_from = NULL; } @@ -3404,7 +4298,7 @@ void scx_cgroup_move_task(struct task_struct *p) */ if (SCX_HAS_OP(sch, cgroup_move) && !WARN_ON_ONCE(!p->scx.cgrp_moving_from)) - SCX_CALL_OP_TASK(sch, SCX_KF_UNLOCKED, cgroup_move, NULL, + SCX_CALL_OP_TASK(sch, cgroup_move, task_rq(p), p, p->scx.cgrp_moving_from, tg_cgrp(task_group(p))); p->scx.cgrp_moving_from = NULL; @@ -3422,7 +4316,7 @@ void scx_cgroup_cancel_attach(struct cgroup_taskset *tset) cgroup_taskset_for_each(p, css, tset) { if (SCX_HAS_OP(sch, cgroup_cancel_move) && p->scx.cgrp_moving_from) - SCX_CALL_OP(sch, SCX_KF_UNLOCKED, cgroup_cancel_move, NULL, + SCX_CALL_OP(sch, cgroup_cancel_move, NULL, p, p->scx.cgrp_moving_from, css->cgroup); p->scx.cgrp_moving_from = NULL; } @@ -3436,8 +4330,7 @@ void scx_group_set_weight(struct task_group *tg, unsigned long weight) if (scx_cgroup_enabled && SCX_HAS_OP(sch, cgroup_set_weight) && tg->scx.weight != weight) - SCX_CALL_OP(sch, SCX_KF_UNLOCKED, cgroup_set_weight, NULL, - tg_cgrp(tg), weight); + SCX_CALL_OP(sch, cgroup_set_weight, NULL, tg_cgrp(tg), weight); tg->scx.weight = weight; @@ -3451,8 +4344,7 @@ void scx_group_set_idle(struct task_group *tg, bool idle) percpu_down_read(&scx_cgroup_ops_rwsem); if (scx_cgroup_enabled && SCX_HAS_OP(sch, cgroup_set_idle)) - SCX_CALL_OP(sch, SCX_KF_UNLOCKED, cgroup_set_idle, NULL, - tg_cgrp(tg), idle); + SCX_CALL_OP(sch, cgroup_set_idle, NULL, tg_cgrp(tg), idle); /* Update the task group's idle state */ tg->scx.idle = idle; @@ -3471,7 +4363,7 @@ void scx_group_set_bandwidth(struct task_group *tg, (tg->scx.bw_period_us != period_us || tg->scx.bw_quota_us != quota_us || tg->scx.bw_burst_us != burst_us)) - SCX_CALL_OP(sch, SCX_KF_UNLOCKED, cgroup_set_bandwidth, NULL, + SCX_CALL_OP(sch, cgroup_set_bandwidth, NULL, tg_cgrp(tg), period_us, quota_us, burst_us); tg->scx.bw_period_us = period_us; @@ -3480,33 +4372,55 @@ void scx_group_set_bandwidth(struct task_group *tg, percpu_up_read(&scx_cgroup_ops_rwsem); } +#endif /* CONFIG_EXT_GROUP_SCHED */ + +#if defined(CONFIG_EXT_GROUP_SCHED) || defined(CONFIG_EXT_SUB_SCHED) +static struct cgroup *root_cgroup(void) +{ + return &cgrp_dfl_root.cgrp; +} + +static struct cgroup *sch_cgroup(struct scx_sched *sch) +{ + return sch->cgrp; +} + +/* for each descendant of @cgrp including self, set ->scx_sched to @sch */ +static void set_cgroup_sched(struct cgroup *cgrp, struct scx_sched *sch) +{ + struct cgroup *pos; + struct cgroup_subsys_state *css; + + cgroup_for_each_live_descendant_pre(pos, css, cgrp) + rcu_assign_pointer(pos->scx_sched, sch); +} static void scx_cgroup_lock(void) { +#ifdef CONFIG_EXT_GROUP_SCHED percpu_down_write(&scx_cgroup_ops_rwsem); +#endif cgroup_lock(); } static void scx_cgroup_unlock(void) { cgroup_unlock(); +#ifdef CONFIG_EXT_GROUP_SCHED percpu_up_write(&scx_cgroup_ops_rwsem); +#endif } - -#else /* CONFIG_EXT_GROUP_SCHED */ - +#else /* CONFIG_EXT_GROUP_SCHED || CONFIG_EXT_SUB_SCHED */ +static struct cgroup *root_cgroup(void) { return NULL; } +static struct cgroup *sch_cgroup(struct scx_sched *sch) { return NULL; } +static void set_cgroup_sched(struct cgroup *cgrp, struct scx_sched *sch) {} static void scx_cgroup_lock(void) {} static void scx_cgroup_unlock(void) {} - -#endif /* CONFIG_EXT_GROUP_SCHED */ +#endif /* CONFIG_EXT_GROUP_SCHED || CONFIG_EXT_SUB_SCHED */ /* * Omitted operations: * - * - wakeup_preempt: NOOP as it isn't useful in the wakeup path because the task - * isn't tied to the CPU at that point. Preemption is implemented by resetting - * the victim task's slice to 0 and triggering reschedule on the target CPU. - * * - migrate_task_rq: Unnecessary as task to cpu mapping is transient. * * - task_fork/dead: We need fork/dead notifications for all tasks regardless of @@ -3547,13 +4461,60 @@ DEFINE_SCHED_CLASS(ext) = { #endif }; -static void init_dsq(struct scx_dispatch_q *dsq, u64 dsq_id) +static s32 init_dsq(struct scx_dispatch_q *dsq, u64 dsq_id, + struct scx_sched *sch) { + s32 cpu; + memset(dsq, 0, sizeof(*dsq)); raw_spin_lock_init(&dsq->lock); INIT_LIST_HEAD(&dsq->list); dsq->id = dsq_id; + dsq->sched = sch; + + dsq->pcpu = alloc_percpu(struct scx_dsq_pcpu); + if (!dsq->pcpu) + return -ENOMEM; + + for_each_possible_cpu(cpu) { + struct scx_dsq_pcpu *pcpu = per_cpu_ptr(dsq->pcpu, cpu); + + pcpu->dsq = dsq; + INIT_LIST_HEAD(&pcpu->deferred_reenq_user.node); + } + + return 0; +} + +static void exit_dsq(struct scx_dispatch_q *dsq) +{ + s32 cpu; + + for_each_possible_cpu(cpu) { + struct scx_dsq_pcpu *pcpu = per_cpu_ptr(dsq->pcpu, cpu); + struct scx_deferred_reenq_user *dru = &pcpu->deferred_reenq_user; + struct rq *rq = cpu_rq(cpu); + + /* + * There must have been a RCU grace period since the last + * insertion and @dsq should be off the deferred list by now. + */ + if (WARN_ON_ONCE(!list_empty(&dru->node))) { + guard(raw_spinlock_irqsave)(&rq->scx.deferred_reenq_lock); + list_del_init(&dru->node); + } + } + + free_percpu(dsq->pcpu); +} + +static void free_dsq_rcufn(struct rcu_head *rcu) +{ + struct scx_dispatch_q *dsq = container_of(rcu, struct scx_dispatch_q, rcu); + + exit_dsq(dsq); + kfree(dsq); } static void free_dsq_irq_workfn(struct irq_work *irq_work) @@ -3562,7 +4523,7 @@ static void free_dsq_irq_workfn(struct irq_work *irq_work) struct scx_dispatch_q *dsq, *tmp_dsq; llist_for_each_entry_safe(dsq, tmp_dsq, to_free, free_node) - kfree_rcu(dsq, rcu); + call_rcu(&dsq->rcu, free_dsq_rcufn); } static DEFINE_IRQ_WORK(free_dsq_irq_work, free_dsq_irq_workfn); @@ -3627,8 +4588,7 @@ static void scx_cgroup_exit(struct scx_sched *sch) if (!sch->ops.cgroup_exit) continue; - SCX_CALL_OP(sch, SCX_KF_UNLOCKED, cgroup_exit, NULL, - css->cgroup); + SCX_CALL_OP(sch, cgroup_exit, NULL, css->cgroup); } } @@ -3659,7 +4619,7 @@ static int scx_cgroup_init(struct scx_sched *sch) continue; } - ret = SCX_CALL_OP_RET(sch, SCX_KF_UNLOCKED, cgroup_init, NULL, + ret = SCX_CALL_OP_RET(sch, cgroup_init, NULL, css->cgroup, &args); if (ret) { scx_error(sch, "ops.cgroup_init() failed (%d)", ret); @@ -3738,6 +4698,7 @@ static const struct attribute_group scx_global_attr_group = { .attrs = scx_global_attrs, }; +static void free_pnode(struct scx_sched_pnode *pnode); static void free_exit_info(struct scx_exit_info *ei); static void scx_sched_free_rcu_work(struct work_struct *work) @@ -3746,22 +4707,42 @@ static void scx_sched_free_rcu_work(struct work_struct *work) struct scx_sched *sch = container_of(rcu_work, struct scx_sched, rcu_work); struct rhashtable_iter rht_iter; struct scx_dispatch_q *dsq; - int node; + int cpu, node; - irq_work_sync(&sch->error_irq_work); + irq_work_sync(&sch->disable_irq_work); kthread_destroy_worker(sch->helper); + timer_shutdown_sync(&sch->bypass_lb_timer); + +#ifdef CONFIG_EXT_SUB_SCHED + kfree(sch->cgrp_path); + if (sch_cgroup(sch)) + cgroup_put(sch_cgroup(sch)); +#endif /* CONFIG_EXT_SUB_SCHED */ + + for_each_possible_cpu(cpu) { + struct scx_sched_pcpu *pcpu = per_cpu_ptr(sch->pcpu, cpu); + + /* + * $sch would have entered bypass mode before the RCU grace + * period. As that blocks new deferrals, all + * deferred_reenq_local_node's must be off-list by now. + */ + WARN_ON_ONCE(!list_empty(&pcpu->deferred_reenq_local.node)); + + exit_dsq(bypass_dsq(sch, cpu)); + } free_percpu(sch->pcpu); for_each_node_state(node, N_POSSIBLE) - kfree(sch->global_dsqs[node]); - kfree(sch->global_dsqs); + free_pnode(sch->pnode[node]); + kfree(sch->pnode); rhashtable_walk_enter(&sch->dsq_hash, &rht_iter); do { rhashtable_walk_start(&rht_iter); - while ((dsq = rhashtable_walk_next(&rht_iter)) && !IS_ERR(dsq)) + while (!IS_ERR_OR_NULL((dsq = rhashtable_walk_next(&rht_iter)))) destroy_dsq(sch, dsq->id); rhashtable_walk_stop(&rht_iter); @@ -3778,7 +4759,7 @@ static void scx_kobj_release(struct kobject *kobj) struct scx_sched *sch = container_of(kobj, struct scx_sched, kobj); INIT_RCU_WORK(&sch->rcu_work, scx_sched_free_rcu_work); - queue_rcu_work(system_unbound_wq, &sch->rcu_work); + queue_rcu_work(system_dfl_wq, &sch->rcu_work); } static ssize_t scx_attr_ops_show(struct kobject *kobj, @@ -3807,10 +4788,14 @@ static ssize_t scx_attr_events_show(struct kobject *kobj, at += scx_attr_event_show(buf, at, &events, SCX_EV_DISPATCH_KEEP_LAST); at += scx_attr_event_show(buf, at, &events, SCX_EV_ENQ_SKIP_EXITING); at += scx_attr_event_show(buf, at, &events, SCX_EV_ENQ_SKIP_MIGRATION_DISABLED); + at += scx_attr_event_show(buf, at, &events, SCX_EV_REENQ_IMMED); + at += scx_attr_event_show(buf, at, &events, SCX_EV_REENQ_LOCAL_REPEAT); at += scx_attr_event_show(buf, at, &events, SCX_EV_REFILL_SLICE_DFL); at += scx_attr_event_show(buf, at, &events, SCX_EV_BYPASS_DURATION); at += scx_attr_event_show(buf, at, &events, SCX_EV_BYPASS_DISPATCH); at += scx_attr_event_show(buf, at, &events, SCX_EV_BYPASS_ACTIVATE); + at += scx_attr_event_show(buf, at, &events, SCX_EV_INSERT_NOT_OWNED); + at += scx_attr_event_show(buf, at, &events, SCX_EV_SUB_BYPASS_DISPATCH); return at; } SCX_ATTR(events); @@ -3830,7 +4815,17 @@ static const struct kobj_type scx_ktype = { static int scx_uevent(const struct kobject *kobj, struct kobj_uevent_env *env) { - const struct scx_sched *sch = container_of(kobj, struct scx_sched, kobj); + const struct scx_sched *sch; + + /* + * scx_uevent() can be reached by both scx_sched kobjects (scx_ktype) + * and sub-scheduler kset kobjects (kset_ktype) through the parent + * chain walk. Filter out the latter to avoid invalid casts. + */ + if (kobj->ktype != &scx_ktype) + return 0; + + sch = container_of(kobj, struct scx_sched, kobj); return add_uevent_var(env, "SCXOPS=%s", sch->ops.name); } @@ -3859,7 +4854,7 @@ bool scx_allow_ttwu_queue(const struct task_struct *p) if (!scx_enabled()) return true; - sch = rcu_dereference_sched(scx_root); + sch = scx_task_sched(p); if (unlikely(!sch)) return true; @@ -3952,7 +4947,7 @@ void scx_softlockup(u32 dur_s) * a good state before taking more drastic actions. * * Returns %true if sched_ext is enabled and abort was initiated, which may - * resolve the reported hardlockdup. %false if sched_ext is not enabled or + * resolve the reported hardlockup. %false if sched_ext is not enabled or * someone else already initiated abort. */ bool scx_hardlockup(int cpu) @@ -3965,13 +4960,14 @@ bool scx_hardlockup(int cpu) return true; } -static u32 bypass_lb_cpu(struct scx_sched *sch, struct rq *rq, +static u32 bypass_lb_cpu(struct scx_sched *sch, s32 donor, struct cpumask *donee_mask, struct cpumask *resched_mask, u32 nr_donor_target, u32 nr_donee_target) { - struct scx_dispatch_q *donor_dsq = &rq->scx.bypass_dsq; + struct rq *donor_rq = cpu_rq(donor); + struct scx_dispatch_q *donor_dsq = bypass_dsq(sch, donor); struct task_struct *p, *n; - struct scx_dsq_list_node cursor = INIT_DSQ_LIST_CURSOR(cursor, 0, 0); + struct scx_dsq_list_node cursor = INIT_DSQ_LIST_CURSOR(cursor, donor_dsq, 0); s32 delta = READ_ONCE(donor_dsq->nr) - nr_donor_target; u32 nr_balanced = 0, min_delta_us; @@ -3985,7 +4981,7 @@ static u32 bypass_lb_cpu(struct scx_sched *sch, struct rq *rq, if (delta < DIV_ROUND_UP(min_delta_us, READ_ONCE(scx_slice_bypass_us))) return 0; - raw_spin_rq_lock_irq(rq); + raw_spin_rq_lock_irq(donor_rq); raw_spin_lock(&donor_dsq->lock); list_add(&cursor.node, &donor_dsq->list); resume: @@ -3993,7 +4989,6 @@ resume: n = nldsq_next_task(donor_dsq, n, false); while ((p = n)) { - struct rq *donee_rq; struct scx_dispatch_q *donee_dsq; int donee; @@ -4009,14 +5004,13 @@ resume: if (donee >= nr_cpu_ids) continue; - donee_rq = cpu_rq(donee); - donee_dsq = &donee_rq->scx.bypass_dsq; + donee_dsq = bypass_dsq(sch, donee); /* * $p's rq is not locked but $p's DSQ lock protects its * scheduling properties making this test safe. */ - if (!task_can_run_on_remote_rq(sch, p, donee_rq, false)) + if (!task_can_run_on_remote_rq(sch, p, cpu_rq(donee), false)) continue; /* @@ -4031,7 +5025,7 @@ resume: * between bypass DSQs. */ dispatch_dequeue_locked(p, donor_dsq); - dispatch_enqueue(sch, donee_dsq, p, SCX_ENQ_NESTED); + dispatch_enqueue(sch, cpu_rq(donee), donee_dsq, p, SCX_ENQ_NESTED); /* * $donee might have been idle and need to be woken up. No need @@ -4046,9 +5040,9 @@ resume: if (!(nr_balanced % SCX_BYPASS_LB_BATCH) && n) { list_move_tail(&cursor.node, &n->scx.dsq_list.node); raw_spin_unlock(&donor_dsq->lock); - raw_spin_rq_unlock_irq(rq); + raw_spin_rq_unlock_irq(donor_rq); cpu_relax(); - raw_spin_rq_lock_irq(rq); + raw_spin_rq_lock_irq(donor_rq); raw_spin_lock(&donor_dsq->lock); goto resume; } @@ -4056,7 +5050,7 @@ resume: list_del_init(&cursor.node); raw_spin_unlock(&donor_dsq->lock); - raw_spin_rq_unlock_irq(rq); + raw_spin_rq_unlock_irq(donor_rq); return nr_balanced; } @@ -4074,7 +5068,7 @@ static void bypass_lb_node(struct scx_sched *sch, int node) /* count the target tasks and CPUs */ for_each_cpu_and(cpu, cpu_online_mask, node_mask) { - u32 nr = READ_ONCE(cpu_rq(cpu)->scx.bypass_dsq.nr); + u32 nr = READ_ONCE(bypass_dsq(sch, cpu)->nr); nr_tasks += nr; nr_cpus++; @@ -4096,24 +5090,21 @@ static void bypass_lb_node(struct scx_sched *sch, int node) cpumask_clear(donee_mask); for_each_cpu_and(cpu, cpu_online_mask, node_mask) { - if (READ_ONCE(cpu_rq(cpu)->scx.bypass_dsq.nr) < nr_target) + if (READ_ONCE(bypass_dsq(sch, cpu)->nr) < nr_target) cpumask_set_cpu(cpu, donee_mask); } /* iterate !donee CPUs and see if they should be offloaded */ cpumask_clear(resched_mask); for_each_cpu_and(cpu, cpu_online_mask, node_mask) { - struct rq *rq = cpu_rq(cpu); - struct scx_dispatch_q *donor_dsq = &rq->scx.bypass_dsq; - if (cpumask_empty(donee_mask)) break; if (cpumask_test_cpu(cpu, donee_mask)) continue; - if (READ_ONCE(donor_dsq->nr) <= nr_donor_target) + if (READ_ONCE(bypass_dsq(sch, cpu)->nr) <= nr_donor_target) continue; - nr_balanced += bypass_lb_cpu(sch, rq, donee_mask, resched_mask, + nr_balanced += bypass_lb_cpu(sch, cpu, donee_mask, resched_mask, nr_donor_target, nr_target); } @@ -4121,7 +5112,7 @@ static void bypass_lb_node(struct scx_sched *sch, int node) resched_cpu(cpu); for_each_cpu_and(cpu, cpu_online_mask, node_mask) { - u32 nr = READ_ONCE(cpu_rq(cpu)->scx.bypass_dsq.nr); + u32 nr = READ_ONCE(bypass_dsq(sch, cpu)->nr); after_min = min(nr, after_min); after_max = max(nr, after_max); @@ -4143,12 +5134,11 @@ static void bypass_lb_node(struct scx_sched *sch, int node) */ static void scx_bypass_lb_timerfn(struct timer_list *timer) { - struct scx_sched *sch; + struct scx_sched *sch = container_of(timer, struct scx_sched, bypass_lb_timer); int node; u32 intv_us; - sch = rcu_dereference_all(scx_root); - if (unlikely(!sch) || !READ_ONCE(scx_bypass_depth)) + if (!bypass_dsp_enabled(sch)) return; for_each_node_with_cpus(node) @@ -4159,10 +5149,102 @@ static void scx_bypass_lb_timerfn(struct timer_list *timer) mod_timer(timer, jiffies + usecs_to_jiffies(intv_us)); } -static DEFINE_TIMER(scx_bypass_lb_timer, scx_bypass_lb_timerfn); +static bool inc_bypass_depth(struct scx_sched *sch) +{ + lockdep_assert_held(&scx_bypass_lock); + + WARN_ON_ONCE(sch->bypass_depth < 0); + WRITE_ONCE(sch->bypass_depth, sch->bypass_depth + 1); + if (sch->bypass_depth != 1) + return false; + + WRITE_ONCE(sch->slice_dfl, READ_ONCE(scx_slice_bypass_us) * NSEC_PER_USEC); + sch->bypass_timestamp = ktime_get_ns(); + scx_add_event(sch, SCX_EV_BYPASS_ACTIVATE, 1); + return true; +} + +static bool dec_bypass_depth(struct scx_sched *sch) +{ + lockdep_assert_held(&scx_bypass_lock); + + WARN_ON_ONCE(sch->bypass_depth < 1); + WRITE_ONCE(sch->bypass_depth, sch->bypass_depth - 1); + if (sch->bypass_depth != 0) + return false; + + WRITE_ONCE(sch->slice_dfl, SCX_SLICE_DFL); + scx_add_event(sch, SCX_EV_BYPASS_DURATION, + ktime_get_ns() - sch->bypass_timestamp); + return true; +} + +static void enable_bypass_dsp(struct scx_sched *sch) +{ + struct scx_sched *host = scx_parent(sch) ?: sch; + u32 intv_us = READ_ONCE(scx_bypass_lb_intv_us); + s32 ret; + + /* + * @sch->bypass_depth transitioning from 0 to 1 triggers enabling. + * Shouldn't stagger. + */ + if (WARN_ON_ONCE(test_and_set_bit(0, &sch->bypass_dsp_claim))) + return; + + /* + * When a sub-sched bypasses, its tasks are queued on the bypass DSQs of + * the nearest non-bypassing ancestor or root. As enable_bypass_dsp() is + * called iff @sch is not already bypassed due to an ancestor bypassing, + * we can assume that the parent is not bypassing and thus will be the + * host of the bypass DSQs. + * + * While the situation may change in the future, the following + * guarantees that the nearest non-bypassing ancestor or root has bypass + * dispatch enabled while a descendant is bypassing, which is all that's + * required. + * + * bypass_dsp_enabled() test is used to determine whether to enter the + * bypass dispatch handling path from both bypassing and hosting scheds. + * Bump enable depth on both @sch and bypass dispatch host. + */ + ret = atomic_inc_return(&sch->bypass_dsp_enable_depth); + WARN_ON_ONCE(ret <= 0); + + if (host != sch) { + ret = atomic_inc_return(&host->bypass_dsp_enable_depth); + WARN_ON_ONCE(ret <= 0); + } + + /* + * The LB timer will stop running if bypass dispatch is disabled. Start + * after enabling bypass dispatch. + */ + if (intv_us && !timer_pending(&host->bypass_lb_timer)) + mod_timer(&host->bypass_lb_timer, + jiffies + usecs_to_jiffies(intv_us)); +} + +/* may be called without holding scx_bypass_lock */ +static void disable_bypass_dsp(struct scx_sched *sch) +{ + s32 ret; + + if (!test_and_clear_bit(0, &sch->bypass_dsp_claim)) + return; + + ret = atomic_dec_return(&sch->bypass_dsp_enable_depth); + WARN_ON_ONCE(ret < 0); + + if (scx_parent(sch)) { + ret = atomic_dec_return(&scx_parent(sch)->bypass_dsp_enable_depth); + WARN_ON_ONCE(ret < 0); + } +} /** * scx_bypass - [Un]bypass scx_ops and guarantee forward progress + * @sch: sched to bypass * @bypass: true for bypass, false for unbypass * * Bypassing guarantees that all runnable tasks make forward progress without @@ -4192,49 +5274,42 @@ static DEFINE_TIMER(scx_bypass_lb_timer, scx_bypass_lb_timerfn); * * - scx_prio_less() reverts to the default core_sched_at order. */ -static void scx_bypass(bool bypass) +static void scx_bypass(struct scx_sched *sch, bool bypass) { - static DEFINE_RAW_SPINLOCK(bypass_lock); - static unsigned long bypass_timestamp; - struct scx_sched *sch; + struct scx_sched *pos; unsigned long flags; int cpu; - raw_spin_lock_irqsave(&bypass_lock, flags); - sch = rcu_dereference_bh(scx_root); + raw_spin_lock_irqsave(&scx_bypass_lock, flags); if (bypass) { - u32 intv_us; - - WRITE_ONCE(scx_bypass_depth, scx_bypass_depth + 1); - WARN_ON_ONCE(scx_bypass_depth <= 0); - if (scx_bypass_depth != 1) + if (!inc_bypass_depth(sch)) goto unlock; - WRITE_ONCE(scx_slice_dfl, READ_ONCE(scx_slice_bypass_us) * NSEC_PER_USEC); - bypass_timestamp = ktime_get_ns(); - if (sch) - scx_add_event(sch, SCX_EV_BYPASS_ACTIVATE, 1); - intv_us = READ_ONCE(scx_bypass_lb_intv_us); - if (intv_us && !timer_pending(&scx_bypass_lb_timer)) { - scx_bypass_lb_timer.expires = - jiffies + usecs_to_jiffies(intv_us); - add_timer_global(&scx_bypass_lb_timer); - } + enable_bypass_dsp(sch); } else { - WRITE_ONCE(scx_bypass_depth, scx_bypass_depth - 1); - WARN_ON_ONCE(scx_bypass_depth < 0); - if (scx_bypass_depth != 0) + if (!dec_bypass_depth(sch)) goto unlock; - WRITE_ONCE(scx_slice_dfl, SCX_SLICE_DFL); - if (sch) - scx_add_event(sch, SCX_EV_BYPASS_DURATION, - ktime_get_ns() - bypass_timestamp); } + /* + * Bypass state is propagated to all descendants - an scx_sched bypasses + * if itself or any of its ancestors are in bypass mode. + */ + raw_spin_lock(&scx_sched_lock); + scx_for_each_descendant_pre(pos, sch) { + if (pos == sch) + continue; + if (bypass) + inc_bypass_depth(pos); + else + dec_bypass_depth(pos); + } + raw_spin_unlock(&scx_sched_lock); + /* * No task property is changing. We just need to make sure all currently - * queued tasks are re-queued according to the new scx_rq_bypassing() + * queued tasks are re-queued according to the new scx_bypassing() * state. As an optimization, walk each rq's runnable_list instead of * the scx_tasks list. * @@ -4246,19 +5321,23 @@ static void scx_bypass(bool bypass) struct task_struct *p, *n; raw_spin_rq_lock(rq); + raw_spin_lock(&scx_sched_lock); - if (bypass) { - WARN_ON_ONCE(rq->scx.flags & SCX_RQ_BYPASSING); - rq->scx.flags |= SCX_RQ_BYPASSING; - } else { - WARN_ON_ONCE(!(rq->scx.flags & SCX_RQ_BYPASSING)); - rq->scx.flags &= ~SCX_RQ_BYPASSING; + scx_for_each_descendant_pre(pos, sch) { + struct scx_sched_pcpu *pcpu = per_cpu_ptr(pos->pcpu, cpu); + + if (pos->bypass_depth) + pcpu->flags |= SCX_SCHED_PCPU_BYPASSING; + else + pcpu->flags &= ~SCX_SCHED_PCPU_BYPASSING; } + raw_spin_unlock(&scx_sched_lock); + /* * We need to guarantee that no tasks are on the BPF scheduler * while bypassing. Either we see enabled or the enable path - * sees scx_rq_bypassing() before moving tasks to SCX. + * sees scx_bypassing() before moving tasks to SCX. */ if (!scx_enabled()) { raw_spin_rq_unlock(rq); @@ -4274,6 +5353,9 @@ static void scx_bypass(bool bypass) */ list_for_each_entry_safe_reverse(p, n, &rq->scx.runnable_list, scx.runnable_node) { + if (!scx_is_descendant(scx_task_sched(p), sch)) + continue; + /* cycling deq/enq is enough, see the function comment */ scoped_guard (sched_change, p, DEQUEUE_SAVE | DEQUEUE_MOVE) { /* nothing */ ; @@ -4287,8 +5369,11 @@ static void scx_bypass(bool bypass) raw_spin_rq_unlock(rq); } + /* disarming must come after moving all tasks out of the bypass DSQs */ + if (!bypass) + disable_bypass_dsp(sch); unlock: - raw_spin_unlock_irqrestore(&bypass_lock, flags); + raw_spin_unlock_irqrestore(&scx_bypass_lock, flags); } static void free_exit_info(struct scx_exit_info *ei) @@ -4330,6 +5415,8 @@ static const char *scx_exit_reason(enum scx_exit_kind kind) return "unregistered from the main kernel"; case SCX_EXIT_SYSRQ: return "disabled by sysrq-S"; + case SCX_EXIT_PARENT: + return "parent exiting"; case SCX_EXIT_ERROR: return "runtime error"; case SCX_EXIT_ERROR_BPF: @@ -4355,28 +5442,279 @@ static void free_kick_syncs(void) } } -static void scx_disable_workfn(struct kthread_work *work) +static void refresh_watchdog(void) +{ + struct scx_sched *sch; + unsigned long intv = ULONG_MAX; + + /* take the shortest timeout and use its half for watchdog interval */ + rcu_read_lock(); + list_for_each_entry_rcu(sch, &scx_sched_all, all) + intv = max(min(intv, sch->watchdog_timeout / 2), 1); + rcu_read_unlock(); + + WRITE_ONCE(scx_watchdog_timestamp, jiffies); + WRITE_ONCE(scx_watchdog_interval, intv); + + if (intv < ULONG_MAX) + mod_delayed_work(system_dfl_wq, &scx_watchdog_work, intv); + else + cancel_delayed_work_sync(&scx_watchdog_work); +} + +static s32 scx_link_sched(struct scx_sched *sch) +{ + scoped_guard(raw_spinlock_irq, &scx_sched_lock) { +#ifdef CONFIG_EXT_SUB_SCHED + struct scx_sched *parent = scx_parent(sch); + s32 ret; + + if (parent) { + /* + * scx_claim_exit() propagates exit_kind transition to + * its sub-scheds while holding scx_sched_lock - either + * we can see the parent's non-NONE exit_kind or the + * parent can shoot us down. + */ + if (atomic_read(&parent->exit_kind) != SCX_EXIT_NONE) { + scx_error(sch, "parent disabled"); + return -ENOENT; + } + + ret = rhashtable_lookup_insert_fast(&scx_sched_hash, + &sch->hash_node, scx_sched_hash_params); + if (ret) { + scx_error(sch, "failed to insert into scx_sched_hash (%d)", ret); + return ret; + } + + list_add_tail(&sch->sibling, &parent->children); + } +#endif /* CONFIG_EXT_SUB_SCHED */ + + list_add_tail_rcu(&sch->all, &scx_sched_all); + } + + refresh_watchdog(); + return 0; +} + +static void scx_unlink_sched(struct scx_sched *sch) +{ + scoped_guard(raw_spinlock_irq, &scx_sched_lock) { +#ifdef CONFIG_EXT_SUB_SCHED + if (scx_parent(sch)) { + rhashtable_remove_fast(&scx_sched_hash, &sch->hash_node, + scx_sched_hash_params); + list_del_init(&sch->sibling); + } +#endif /* CONFIG_EXT_SUB_SCHED */ + list_del_rcu(&sch->all); + } + + refresh_watchdog(); +} + +/* + * Called to disable future dumps and wait for in-progress one while disabling + * @sch. Once @sch becomes empty during disable, there's no point in dumping it. + * This prevents calling dump ops on a dead sch. + */ +static void scx_disable_dump(struct scx_sched *sch) +{ + guard(raw_spinlock_irqsave)(&scx_dump_lock); + sch->dump_disabled = true; +} + +#ifdef CONFIG_EXT_SUB_SCHED +static DECLARE_WAIT_QUEUE_HEAD(scx_unlink_waitq); + +static void drain_descendants(struct scx_sched *sch) +{ + /* + * Child scheds that finished the critical part of disabling will take + * themselves off @sch->children. Wait for it to drain. As propagation + * is recursive, empty @sch->children means that all proper descendant + * scheds reached unlinking stage. + */ + wait_event(scx_unlink_waitq, list_empty(&sch->children)); +} + +static void scx_fail_parent(struct scx_sched *sch, + struct task_struct *failed, s32 fail_code) +{ + struct scx_sched *parent = scx_parent(sch); + struct scx_task_iter sti; + struct task_struct *p; + + scx_error(parent, "ops.init_task() failed (%d) for %s[%d] while disabling a sub-scheduler", + fail_code, failed->comm, failed->pid); + + /* + * Once $parent is bypassed, it's safe to put SCX_TASK_NONE tasks into + * it. This may cause downstream failures on the BPF side but $parent is + * dying anyway. + */ + scx_bypass(parent, true); + + scx_task_iter_start(&sti, sch->cgrp); + while ((p = scx_task_iter_next_locked(&sti))) { + if (scx_task_on_sched(parent, p)) + continue; + + scoped_guard (sched_change, p, DEQUEUE_SAVE | DEQUEUE_MOVE) { + scx_disable_and_exit_task(sch, p); + rcu_assign_pointer(p->scx.sched, parent); + } + } + scx_task_iter_stop(&sti); +} + +static void scx_sub_disable(struct scx_sched *sch) +{ + struct scx_sched *parent = scx_parent(sch); + struct scx_task_iter sti; + struct task_struct *p; + int ret; + + /* + * Guarantee forward progress and wait for descendants to be disabled. + * To limit disruptions, $parent is not bypassed. Tasks are fully + * prepped and then inserted back into $parent. + */ + scx_bypass(sch, true); + drain_descendants(sch); + + /* + * Here, every runnable task is guaranteed to make forward progress and + * we can safely use blocking synchronization constructs. Actually + * disable ops. + */ + mutex_lock(&scx_enable_mutex); + percpu_down_write(&scx_fork_rwsem); + scx_cgroup_lock(); + + set_cgroup_sched(sch_cgroup(sch), parent); + + scx_task_iter_start(&sti, sch->cgrp); + while ((p = scx_task_iter_next_locked(&sti))) { + struct rq *rq; + struct rq_flags rf; + + /* filter out duplicate visits */ + if (scx_task_on_sched(parent, p)) + continue; + + /* + * By the time control reaches here, all descendant schedulers + * should already have been disabled. + */ + WARN_ON_ONCE(!scx_task_on_sched(sch, p)); + + /* + * If $p is about to be freed, nothing prevents $sch from + * unloading before $p reaches sched_ext_free(). Disable and + * exit $p right away. + */ + if (!tryget_task_struct(p)) { + scx_disable_and_exit_task(sch, p); + continue; + } + + scx_task_iter_unlock(&sti); + + /* + * $p is READY or ENABLED on @sch. Initialize for $parent, + * disable and exit from @sch, and then switch over to $parent. + * + * If a task fails to initialize for $parent, the only available + * action is disabling $parent too. While this allows disabling + * of a child sched to cause the parent scheduler to fail, the + * failure can only originate from ops.init_task() of the + * parent. A child can't directly affect the parent through its + * own failures. + */ + ret = __scx_init_task(parent, p, false); + if (ret) { + scx_fail_parent(sch, p, ret); + put_task_struct(p); + break; + } + + rq = task_rq_lock(p, &rf); + scoped_guard (sched_change, p, DEQUEUE_SAVE | DEQUEUE_MOVE) { + /* + * $p is initialized for $parent and still attached to + * @sch. Disable and exit for @sch, switch over to + * $parent, override the state to READY to account for + * $p having already been initialized, and then enable. + */ + scx_disable_and_exit_task(sch, p); + scx_set_task_state(p, SCX_TASK_INIT); + rcu_assign_pointer(p->scx.sched, parent); + scx_set_task_state(p, SCX_TASK_READY); + scx_enable_task(parent, p); + } + task_rq_unlock(rq, p, &rf); + + put_task_struct(p); + } + scx_task_iter_stop(&sti); + + scx_disable_dump(sch); + + scx_cgroup_unlock(); + percpu_up_write(&scx_fork_rwsem); + + /* + * All tasks are moved off of @sch but there may still be on-going + * operations (e.g. ops.select_cpu()). Drain them by flushing RCU. Use + * the expedited version as ancestors may be waiting in bypass mode. + * Also, tell the parent that there is no need to keep running bypass + * DSQs for us. + */ + synchronize_rcu_expedited(); + disable_bypass_dsp(sch); + + scx_unlink_sched(sch); + + mutex_unlock(&scx_enable_mutex); + + /* + * @sch is now unlinked from the parent's children list. Notify and call + * ops.sub_detach/exit(). Note that ops.sub_detach/exit() must be called + * after unlinking and releasing all locks. See scx_claim_exit(). + */ + wake_up_all(&scx_unlink_waitq); + + if (parent->ops.sub_detach && sch->sub_attached) { + struct scx_sub_detach_args sub_detach_args = { + .ops = &sch->ops, + .cgroup_path = sch->cgrp_path, + }; + SCX_CALL_OP(parent, sub_detach, NULL, + &sub_detach_args); + } + + if (sch->ops.exit) + SCX_CALL_OP(sch, exit, NULL, sch->exit_info); + kobject_del(&sch->kobj); +} +#else /* CONFIG_EXT_SUB_SCHED */ +static void drain_descendants(struct scx_sched *sch) { } +static void scx_sub_disable(struct scx_sched *sch) { } +#endif /* CONFIG_EXT_SUB_SCHED */ + +static void scx_root_disable(struct scx_sched *sch) { - struct scx_sched *sch = container_of(work, struct scx_sched, disable_work); struct scx_exit_info *ei = sch->exit_info; struct scx_task_iter sti; struct task_struct *p; - int kind, cpu; + int cpu; - kind = atomic_read(&sch->exit_kind); - while (true) { - if (kind == SCX_EXIT_DONE) /* already disabled? */ - return; - WARN_ON_ONCE(kind == SCX_EXIT_NONE); - if (atomic_try_cmpxchg(&sch->exit_kind, &kind, SCX_EXIT_DONE)) - break; - } - ei->kind = kind; - ei->reason = scx_exit_reason(ei->kind); - - /* guarantee forward progress by bypassing scx_ops */ - scx_bypass(true); - WRITE_ONCE(scx_aborting, false); + /* guarantee forward progress and wait for descendants to be disabled */ + scx_bypass(sch, true); + drain_descendants(sch); switch (scx_set_enable_state(SCX_DISABLING)) { case SCX_DISABLING: @@ -4403,7 +5741,7 @@ static void scx_disable_workfn(struct kthread_work *work) /* * Shut down cgroup support before tasks so that the cgroup attach path - * doesn't race against scx_exit_task(). + * doesn't race against scx_disable_and_exit_task(). */ scx_cgroup_lock(); scx_cgroup_exit(sch); @@ -4417,7 +5755,7 @@ static void scx_disable_workfn(struct kthread_work *work) scx_init_task_enabled = false; - scx_task_iter_start(&sti); + scx_task_iter_start(&sti, NULL); while ((p = scx_task_iter_next_locked(&sti))) { unsigned int queue_flags = DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK; const struct sched_class *old_class = p->sched_class; @@ -4432,9 +5770,16 @@ static void scx_disable_workfn(struct kthread_work *work) p->sched_class = new_class; } - scx_exit_task(p); + scx_disable_and_exit_task(scx_task_sched(p), p); } scx_task_iter_stop(&sti); + + scx_disable_dump(sch); + + scx_cgroup_lock(); + set_cgroup_sched(sch_cgroup(sch), NULL); + scx_cgroup_unlock(); + percpu_up_write(&scx_fork_rwsem); /* @@ -4467,9 +5812,9 @@ static void scx_disable_workfn(struct kthread_work *work) } if (sch->ops.exit) - SCX_CALL_OP(sch, SCX_KF_UNLOCKED, exit, NULL, ei); + SCX_CALL_OP(sch, exit, NULL, ei); - cancel_delayed_work_sync(&scx_watchdog_work); + scx_unlink_sched(sch); /* * scx_root clearing must be inside cpus_read_lock(). See @@ -4486,21 +5831,13 @@ static void scx_disable_workfn(struct kthread_work *work) */ kobject_del(&sch->kobj); - free_percpu(scx_dsp_ctx); - scx_dsp_ctx = NULL; - scx_dsp_max_batch = 0; free_kick_syncs(); - if (scx_bypassed_for_enable) { - scx_bypassed_for_enable = false; - scx_bypass(false); - } - mutex_unlock(&scx_enable_mutex); WARN_ON_ONCE(scx_set_enable_state(SCX_DISABLED) != SCX_DISABLING); done: - scx_bypass(false); + scx_bypass(sch, false); } /* @@ -4516,6 +5853,9 @@ static bool scx_claim_exit(struct scx_sched *sch, enum scx_exit_kind kind) lockdep_assert_preemption_disabled(); + if (WARN_ON_ONCE(kind == SCX_EXIT_NONE || kind == SCX_EXIT_DONE)) + kind = SCX_EXIT_ERROR; + if (!atomic_try_cmpxchg(&sch->exit_kind, &none, kind)) return false; @@ -4524,25 +5864,61 @@ static bool scx_claim_exit(struct scx_sched *sch, enum scx_exit_kind kind) * flag to break potential live-lock scenarios, ensuring we can * successfully reach scx_bypass(). */ - WRITE_ONCE(scx_aborting, true); + WRITE_ONCE(sch->aborting, true); + + /* + * Propagate exits to descendants immediately. Each has a dedicated + * helper kthread and can run in parallel. While most of disabling is + * serialized, running them in separate threads allows parallelizing + * ops.exit(), which can take arbitrarily long prolonging bypass mode. + * + * To guarantee forward progress, this propagation must be in-line so + * that ->aborting is synchronously asserted for all sub-scheds. The + * propagation is also the interlocking point against sub-sched + * attachment. See scx_link_sched(). + * + * This doesn't cause recursions as propagation only takes place for + * non-propagation exits. + */ + if (kind != SCX_EXIT_PARENT) { + scoped_guard (raw_spinlock_irqsave, &scx_sched_lock) { + struct scx_sched *pos; + scx_for_each_descendant_pre(pos, sch) + scx_disable(pos, SCX_EXIT_PARENT); + } + } + return true; } -static void scx_disable(enum scx_exit_kind kind) +static void scx_disable_workfn(struct kthread_work *work) { - struct scx_sched *sch; + struct scx_sched *sch = container_of(work, struct scx_sched, disable_work); + struct scx_exit_info *ei = sch->exit_info; + int kind; - if (WARN_ON_ONCE(kind == SCX_EXIT_NONE || kind == SCX_EXIT_DONE)) - kind = SCX_EXIT_ERROR; - - rcu_read_lock(); - sch = rcu_dereference(scx_root); - if (sch) { - guard(preempt)(); - scx_claim_exit(sch, kind); - kthread_queue_work(sch->helper, &sch->disable_work); + kind = atomic_read(&sch->exit_kind); + while (true) { + if (kind == SCX_EXIT_DONE) /* already disabled? */ + return; + WARN_ON_ONCE(kind == SCX_EXIT_NONE); + if (atomic_try_cmpxchg(&sch->exit_kind, &kind, SCX_EXIT_DONE)) + break; } - rcu_read_unlock(); + ei->kind = kind; + ei->reason = scx_exit_reason(ei->kind); + + if (scx_parent(sch)) + scx_sub_disable(sch); + else + scx_root_disable(sch); +} + +static void scx_disable(struct scx_sched *sch, enum scx_exit_kind kind) +{ + guard(preempt)(); + if (scx_claim_exit(sch, kind)) + irq_work_queue(&sch->disable_irq_work); } static void dump_newline(struct seq_buf *s) @@ -4560,7 +5936,7 @@ static __printf(2, 3) void dump_line(struct seq_buf *s, const char *fmt, ...) #ifdef CONFIG_TRACEPOINTS if (trace_sched_ext_dump_enabled()) { - /* protected by scx_dump_state()::dump_lock */ + /* protected by scx_dump_lock */ static char line_buf[SCX_EXIT_MSG_LEN]; va_start(args, fmt); @@ -4656,25 +6032,38 @@ static void ops_dump_exit(void) scx_dump_data.cpu = -1; } -static void scx_dump_task(struct seq_buf *s, struct scx_dump_ctx *dctx, +static void scx_dump_task(struct scx_sched *sch, + struct seq_buf *s, struct scx_dump_ctx *dctx, struct task_struct *p, char marker) { static unsigned long bt[SCX_EXIT_BT_LEN]; - struct scx_sched *sch = scx_root; + struct scx_sched *task_sch = scx_task_sched(p); + const char *own_marker; + char sch_id_buf[32]; char dsq_id_buf[19] = "(n/a)"; unsigned long ops_state = atomic_long_read(&p->scx.ops_state); unsigned int bt_len = 0; + own_marker = task_sch == sch ? "*" : ""; + + if (task_sch->level == 0) + scnprintf(sch_id_buf, sizeof(sch_id_buf), "root"); + else + scnprintf(sch_id_buf, sizeof(sch_id_buf), "sub%d-%llu", + task_sch->level, task_sch->ops.sub_cgroup_id); + if (p->scx.dsq) scnprintf(dsq_id_buf, sizeof(dsq_id_buf), "0x%llx", (unsigned long long)p->scx.dsq->id); dump_newline(s); - dump_line(s, " %c%c %s[%d] %+ldms", + dump_line(s, " %c%c %s[%d] %s%s %+ldms", marker, task_state_to_char(p), p->comm, p->pid, + own_marker, sch_id_buf, jiffies_delta_msecs(p->scx.runnable_at, dctx->at_jiffies)); dump_line(s, " scx_state/flags=%u/0x%x dsq_flags=0x%x ops_state/qseq=%lu/%lu", - scx_get_task_state(p), p->scx.flags & ~SCX_TASK_STATE_MASK, + scx_get_task_state(p) >> SCX_TASK_STATE_SHIFT, + p->scx.flags & ~SCX_TASK_STATE_MASK, p->scx.dsq_flags, ops_state & SCX_OPSS_STATE_MASK, ops_state >> SCX_OPSS_QSEQ_SHIFT); dump_line(s, " sticky/holding_cpu=%d/%d dsq_id=%s", @@ -4686,7 +6075,7 @@ static void scx_dump_task(struct seq_buf *s, struct scx_dump_ctx *dctx, if (SCX_HAS_OP(sch, dump_task)) { ops_dump_init(s, " "); - SCX_CALL_OP(sch, SCX_KF_REST, dump_task, NULL, dctx, p); + SCX_CALL_OP(sch, dump_task, NULL, dctx, p); ops_dump_exit(); } @@ -4699,11 +6088,17 @@ static void scx_dump_task(struct seq_buf *s, struct scx_dump_ctx *dctx, } } -static void scx_dump_state(struct scx_exit_info *ei, size_t dump_len) +/* + * Dump scheduler state. If @dump_all_tasks is true, dump all tasks regardless + * of which scheduler they belong to. If false, only dump tasks owned by @sch. + * For SysRq-D dumps, @dump_all_tasks=false since all schedulers are dumped + * separately. For error dumps, @dump_all_tasks=true since only the failing + * scheduler is dumped. + */ +static void scx_dump_state(struct scx_sched *sch, struct scx_exit_info *ei, + size_t dump_len, bool dump_all_tasks) { - static DEFINE_SPINLOCK(dump_lock); static const char trunc_marker[] = "\n\n~~~~ TRUNCATED ~~~~\n"; - struct scx_sched *sch = scx_root; struct scx_dump_ctx dctx = { .kind = ei->kind, .exit_code = ei->exit_code, @@ -4713,14 +6108,24 @@ static void scx_dump_state(struct scx_exit_info *ei, size_t dump_len) }; struct seq_buf s; struct scx_event_stats events; - unsigned long flags; char *buf; int cpu; - spin_lock_irqsave(&dump_lock, flags); + guard(raw_spinlock_irqsave)(&scx_dump_lock); + + if (sch->dump_disabled) + return; seq_buf_init(&s, ei->dump, dump_len); +#ifdef CONFIG_EXT_SUB_SCHED + if (sch->level == 0) + dump_line(&s, "%s: root", sch->ops.name); + else + dump_line(&s, "%s: sub%d-%llu %s", + sch->ops.name, sch->level, sch->ops.sub_cgroup_id, + sch->cgrp_path); +#endif if (ei->kind == SCX_EXIT_NONE) { dump_line(&s, "Debug dump triggered by %s", ei->reason); } else { @@ -4734,7 +6139,7 @@ static void scx_dump_state(struct scx_exit_info *ei, size_t dump_len) if (SCX_HAS_OP(sch, dump)) { ops_dump_init(&s, ""); - SCX_CALL_OP(sch, SCX_KF_UNLOCKED, dump, NULL, &dctx); + SCX_CALL_OP(sch, dump, NULL, &dctx); ops_dump_exit(); } @@ -4794,7 +6199,7 @@ static void scx_dump_state(struct scx_exit_info *ei, size_t dump_len) used = seq_buf_used(&ns); if (SCX_HAS_OP(sch, dump_cpu)) { ops_dump_init(&ns, " "); - SCX_CALL_OP(sch, SCX_KF_REST, dump_cpu, NULL, + SCX_CALL_OP(sch, dump_cpu, NULL, &dctx, cpu, idle); ops_dump_exit(); } @@ -4816,11 +6221,13 @@ static void scx_dump_state(struct scx_exit_info *ei, size_t dump_len) seq_buf_set_overflow(&s); } - if (rq->curr->sched_class == &ext_sched_class) - scx_dump_task(&s, &dctx, rq->curr, '*'); + if (rq->curr->sched_class == &ext_sched_class && + (dump_all_tasks || scx_task_on_sched(sch, rq->curr))) + scx_dump_task(sch, &s, &dctx, rq->curr, '*'); list_for_each_entry(p, &rq->scx.runnable_list, scx.runnable_node) - scx_dump_task(&s, &dctx, p, ' '); + if (dump_all_tasks || scx_task_on_sched(sch, p)) + scx_dump_task(sch, &s, &dctx, p, ' '); next: rq_unlock_irqrestore(rq, &rf); } @@ -4835,25 +6242,27 @@ static void scx_dump_state(struct scx_exit_info *ei, size_t dump_len) scx_dump_event(s, &events, SCX_EV_DISPATCH_KEEP_LAST); scx_dump_event(s, &events, SCX_EV_ENQ_SKIP_EXITING); scx_dump_event(s, &events, SCX_EV_ENQ_SKIP_MIGRATION_DISABLED); + scx_dump_event(s, &events, SCX_EV_REENQ_IMMED); + scx_dump_event(s, &events, SCX_EV_REENQ_LOCAL_REPEAT); scx_dump_event(s, &events, SCX_EV_REFILL_SLICE_DFL); scx_dump_event(s, &events, SCX_EV_BYPASS_DURATION); scx_dump_event(s, &events, SCX_EV_BYPASS_DISPATCH); scx_dump_event(s, &events, SCX_EV_BYPASS_ACTIVATE); + scx_dump_event(s, &events, SCX_EV_INSERT_NOT_OWNED); + scx_dump_event(s, &events, SCX_EV_SUB_BYPASS_DISPATCH); if (seq_buf_has_overflowed(&s) && dump_len >= sizeof(trunc_marker)) memcpy(ei->dump + dump_len - sizeof(trunc_marker), trunc_marker, sizeof(trunc_marker)); - - spin_unlock_irqrestore(&dump_lock, flags); } -static void scx_error_irq_workfn(struct irq_work *irq_work) +static void scx_disable_irq_workfn(struct irq_work *irq_work) { - struct scx_sched *sch = container_of(irq_work, struct scx_sched, error_irq_work); + struct scx_sched *sch = container_of(irq_work, struct scx_sched, disable_irq_work); struct scx_exit_info *ei = sch->exit_info; if (ei->kind >= SCX_EXIT_ERROR) - scx_dump_state(ei, sch->ops.exit_dump_len); + scx_dump_state(sch, ei, sch->ops.exit_dump_len, true); kthread_queue_work(sch->helper, &sch->disable_work); } @@ -4883,7 +6292,7 @@ static bool scx_vexit(struct scx_sched *sch, ei->kind = kind; ei->reason = scx_exit_reason(ei->kind); - irq_work_queue(&sch->error_irq_work); + irq_work_queue(&sch->disable_irq_work); return true; } @@ -4914,14 +6323,47 @@ static int alloc_kick_syncs(void) return 0; } -static struct scx_sched *scx_alloc_and_add_sched(struct sched_ext_ops *ops) +static void free_pnode(struct scx_sched_pnode *pnode) +{ + if (!pnode) + return; + exit_dsq(&pnode->global_dsq); + kfree(pnode); +} + +static struct scx_sched_pnode *alloc_pnode(struct scx_sched *sch, int node) +{ + struct scx_sched_pnode *pnode; + + pnode = kzalloc_node(sizeof(*pnode), GFP_KERNEL, node); + if (!pnode) + return NULL; + + if (init_dsq(&pnode->global_dsq, SCX_DSQ_GLOBAL, sch)) { + kfree(pnode); + return NULL; + } + + return pnode; +} + +/* + * Allocate and initialize a new scx_sched. @cgrp's reference is always + * consumed whether the function succeeds or fails. + */ +static struct scx_sched *scx_alloc_and_add_sched(struct sched_ext_ops *ops, + struct cgroup *cgrp, + struct scx_sched *parent) { struct scx_sched *sch; - int node, ret; + s32 level = parent ? parent->level + 1 : 0; + s32 node, cpu, ret, bypass_fail_cpu = nr_cpu_ids; - sch = kzalloc_obj(*sch); - if (!sch) - return ERR_PTR(-ENOMEM); + sch = kzalloc_flex(*sch, ancestors, level + 1); + if (!sch) { + ret = -ENOMEM; + goto err_put_cgrp; + } sch->exit_info = alloc_exit_info(ops->exit_dump_len); if (!sch->exit_info) { @@ -4933,29 +6375,42 @@ static struct scx_sched *scx_alloc_and_add_sched(struct sched_ext_ops *ops) if (ret < 0) goto err_free_ei; - sch->global_dsqs = kzalloc_objs(sch->global_dsqs[0], nr_node_ids); - if (!sch->global_dsqs) { + sch->pnode = kzalloc_objs(sch->pnode[0], nr_node_ids); + if (!sch->pnode) { ret = -ENOMEM; goto err_free_hash; } for_each_node_state(node, N_POSSIBLE) { - struct scx_dispatch_q *dsq; - - dsq = kzalloc_node(sizeof(*dsq), GFP_KERNEL, node); - if (!dsq) { + sch->pnode[node] = alloc_pnode(sch, node); + if (!sch->pnode[node]) { ret = -ENOMEM; - goto err_free_gdsqs; + goto err_free_pnode; } - - init_dsq(dsq, SCX_DSQ_GLOBAL); - sch->global_dsqs[node] = dsq; } - sch->pcpu = alloc_percpu(struct scx_sched_pcpu); + sch->dsp_max_batch = ops->dispatch_max_batch ?: SCX_DSP_DFL_MAX_BATCH; + sch->pcpu = __alloc_percpu(struct_size_t(struct scx_sched_pcpu, + dsp_ctx.buf, sch->dsp_max_batch), + __alignof__(struct scx_sched_pcpu)); if (!sch->pcpu) { ret = -ENOMEM; - goto err_free_gdsqs; + goto err_free_pnode; + } + + for_each_possible_cpu(cpu) { + ret = init_dsq(bypass_dsq(sch, cpu), SCX_DSQ_BYPASS, sch); + if (ret) { + bypass_fail_cpu = cpu; + goto err_free_pcpu; + } + } + + for_each_possible_cpu(cpu) { + struct scx_sched_pcpu *pcpu = per_cpu_ptr(sch->pcpu, cpu); + + pcpu->sch = sch; + INIT_LIST_HEAD(&pcpu->deferred_reenq_local.node); } sch->helper = kthread_run_worker(0, "sched_ext_helper"); @@ -4966,33 +6421,98 @@ static struct scx_sched *scx_alloc_and_add_sched(struct sched_ext_ops *ops) sched_set_fifo(sch->helper->task); + if (parent) + memcpy(sch->ancestors, parent->ancestors, + level * sizeof(parent->ancestors[0])); + sch->ancestors[level] = sch; + sch->level = level; + + if (ops->timeout_ms) + sch->watchdog_timeout = msecs_to_jiffies(ops->timeout_ms); + else + sch->watchdog_timeout = SCX_WATCHDOG_MAX_TIMEOUT; + + sch->slice_dfl = SCX_SLICE_DFL; atomic_set(&sch->exit_kind, SCX_EXIT_NONE); - init_irq_work(&sch->error_irq_work, scx_error_irq_workfn); + init_irq_work(&sch->disable_irq_work, scx_disable_irq_workfn); kthread_init_work(&sch->disable_work, scx_disable_workfn); + timer_setup(&sch->bypass_lb_timer, scx_bypass_lb_timerfn, 0); sch->ops = *ops; - ops->priv = sch; + rcu_assign_pointer(ops->priv, sch); sch->kobj.kset = scx_kset; - ret = kobject_init_and_add(&sch->kobj, &scx_ktype, NULL, "root"); - if (ret < 0) - goto err_stop_helper; +#ifdef CONFIG_EXT_SUB_SCHED + char *buf = kzalloc(PATH_MAX, GFP_KERNEL); + if (!buf) { + ret = -ENOMEM; + goto err_stop_helper; + } + cgroup_path(cgrp, buf, PATH_MAX); + sch->cgrp_path = kstrdup(buf, GFP_KERNEL); + kfree(buf); + if (!sch->cgrp_path) { + ret = -ENOMEM; + goto err_stop_helper; + } + + sch->cgrp = cgrp; + INIT_LIST_HEAD(&sch->children); + INIT_LIST_HEAD(&sch->sibling); + + if (parent) + ret = kobject_init_and_add(&sch->kobj, &scx_ktype, + &parent->sub_kset->kobj, + "sub-%llu", cgroup_id(cgrp)); + else + ret = kobject_init_and_add(&sch->kobj, &scx_ktype, NULL, "root"); + + if (ret < 0) { + kobject_put(&sch->kobj); + return ERR_PTR(ret); + } + + if (ops->sub_attach) { + sch->sub_kset = kset_create_and_add("sub", NULL, &sch->kobj); + if (!sch->sub_kset) { + kobject_put(&sch->kobj); + return ERR_PTR(-ENOMEM); + } + } +#else /* CONFIG_EXT_SUB_SCHED */ + ret = kobject_init_and_add(&sch->kobj, &scx_ktype, NULL, "root"); + if (ret < 0) { + kobject_put(&sch->kobj); + return ERR_PTR(ret); + } +#endif /* CONFIG_EXT_SUB_SCHED */ return sch; +#ifdef CONFIG_EXT_SUB_SCHED err_stop_helper: kthread_destroy_worker(sch->helper); +#endif err_free_pcpu: + for_each_possible_cpu(cpu) { + if (cpu == bypass_fail_cpu) + break; + exit_dsq(bypass_dsq(sch, cpu)); + } free_percpu(sch->pcpu); -err_free_gdsqs: +err_free_pnode: for_each_node_state(node, N_POSSIBLE) - kfree(sch->global_dsqs[node]); - kfree(sch->global_dsqs); + free_pnode(sch->pnode[node]); + kfree(sch->pnode); err_free_hash: rhashtable_free_and_destroy(&sch->dsq_hash, NULL, NULL); err_free_ei: free_exit_info(sch->exit_info); err_free_sch: kfree(sch); +err_put_cgrp: +#if defined(CONFIG_EXT_GROUP_SCHED) || defined(CONFIG_EXT_SUB_SCHED) + cgroup_put(cgrp); +#endif return ERR_PTR(ret); } @@ -5041,9 +6561,6 @@ static int validate_ops(struct scx_sched *sch, const struct sched_ext_ops *ops) return -EINVAL; } - if (ops->flags & SCX_OPS_HAS_CGROUP_WEIGHT) - pr_warn("SCX_OPS_HAS_CGROUP_WEIGHT is deprecated and a noop\n"); - if (ops->cpu_acquire || ops->cpu_release) pr_warn("ops->cpu_acquire/release() are deprecated, use sched_switch TP instead\n"); @@ -5063,15 +6580,14 @@ struct scx_enable_cmd { int ret; }; -static void scx_enable_workfn(struct kthread_work *work) +static void scx_root_enable_workfn(struct kthread_work *work) { - struct scx_enable_cmd *cmd = - container_of(work, struct scx_enable_cmd, work); + struct scx_enable_cmd *cmd = container_of(work, struct scx_enable_cmd, work); struct sched_ext_ops *ops = cmd->ops; + struct cgroup *cgrp = root_cgroup(); struct scx_sched *sch; struct scx_task_iter sti; struct task_struct *p; - unsigned long timeout; int i, cpu, ret; mutex_lock(&scx_enable_mutex); @@ -5085,7 +6601,10 @@ static void scx_enable_workfn(struct kthread_work *work) if (ret) goto err_unlock; - sch = scx_alloc_and_add_sched(ops); +#if defined(CONFIG_EXT_GROUP_SCHED) || defined(CONFIG_EXT_SUB_SCHED) + cgroup_get(cgrp); +#endif + sch = scx_alloc_and_add_sched(ops, cgrp, NULL); if (IS_ERR(sch)) { ret = PTR_ERR(sch); goto err_free_ksyncs; @@ -5097,13 +6616,15 @@ static void scx_enable_workfn(struct kthread_work *work) */ WARN_ON_ONCE(scx_set_enable_state(SCX_ENABLING) != SCX_DISABLED); WARN_ON_ONCE(scx_root); - if (WARN_ON_ONCE(READ_ONCE(scx_aborting))) - WRITE_ONCE(scx_aborting, false); atomic_long_set(&scx_nr_rejected, 0); - for_each_possible_cpu(cpu) - cpu_rq(cpu)->scx.cpuperf_target = SCX_CPUPERF_ONE; + for_each_possible_cpu(cpu) { + struct rq *rq = cpu_rq(cpu); + + rq->scx.local_dsq.sched = sch; + rq->scx.cpuperf_target = SCX_CPUPERF_ONE; + } /* * Keep CPUs stable during enable so that the BPF scheduler can track @@ -5117,10 +6638,14 @@ static void scx_enable_workfn(struct kthread_work *work) */ rcu_assign_pointer(scx_root, sch); + ret = scx_link_sched(sch); + if (ret) + goto err_disable; + scx_idle_enable(ops); if (sch->ops.init) { - ret = SCX_CALL_OP_RET(sch, SCX_KF_UNLOCKED, init, NULL); + ret = SCX_CALL_OP_RET(sch, init, NULL); if (ret) { ret = ops_sanitize_err(sch, "init", ret); cpus_read_unlock(); @@ -5147,34 +6672,13 @@ static void scx_enable_workfn(struct kthread_work *work) if (ret) goto err_disable; - WARN_ON_ONCE(scx_dsp_ctx); - scx_dsp_max_batch = ops->dispatch_max_batch ?: SCX_DSP_DFL_MAX_BATCH; - scx_dsp_ctx = __alloc_percpu(struct_size_t(struct scx_dsp_ctx, buf, - scx_dsp_max_batch), - __alignof__(struct scx_dsp_ctx)); - if (!scx_dsp_ctx) { - ret = -ENOMEM; - goto err_disable; - } - - if (ops->timeout_ms) - timeout = msecs_to_jiffies(ops->timeout_ms); - else - timeout = SCX_WATCHDOG_MAX_TIMEOUT; - - WRITE_ONCE(scx_watchdog_timeout, timeout); - WRITE_ONCE(scx_watchdog_timestamp, jiffies); - queue_delayed_work(system_dfl_wq, &scx_watchdog_work, - READ_ONCE(scx_watchdog_timeout) / 2); - /* * Once __scx_enabled is set, %current can be switched to SCX anytime. * This can lead to stalls as some BPF schedulers (e.g. userspace * scheduling) may not function correctly before all tasks are switched. * Init in bypass mode to guarantee forward progress. */ - scx_bypass(true); - scx_bypassed_for_enable = true; + scx_bypass(sch, true); for (i = SCX_OPI_NORMAL_BEGIN; i < SCX_OPI_NORMAL_END; i++) if (((void (**)(void))ops)[i]) @@ -5206,11 +6710,12 @@ static void scx_enable_workfn(struct kthread_work *work) * never sees uninitialized tasks. */ scx_cgroup_lock(); + set_cgroup_sched(sch_cgroup(sch), sch); ret = scx_cgroup_init(sch); if (ret) goto err_disable_unlock_all; - scx_task_iter_start(&sti); + scx_task_iter_start(&sti, NULL); while ((p = scx_task_iter_next_locked(&sti))) { /* * @p may already be dead, have lost all its usages counts and @@ -5222,7 +6727,7 @@ static void scx_enable_workfn(struct kthread_work *work) scx_task_iter_unlock(&sti); - ret = scx_init_task(p, task_group(p), false); + ret = scx_init_task(sch, p, false); if (ret) { put_task_struct(p); scx_task_iter_stop(&sti); @@ -5231,6 +6736,7 @@ static void scx_enable_workfn(struct kthread_work *work) goto err_disable_unlock_all; } + scx_set_task_sched(p, sch); scx_set_task_state(p, SCX_TASK_READY); put_task_struct(p); @@ -5252,7 +6758,7 @@ static void scx_enable_workfn(struct kthread_work *work) * scx_tasks_lock. */ percpu_down_write(&scx_fork_rwsem); - scx_task_iter_start(&sti); + scx_task_iter_start(&sti, NULL); while ((p = scx_task_iter_next_locked(&sti))) { unsigned int queue_flags = DEQUEUE_SAVE | DEQUEUE_MOVE; const struct sched_class *old_class = p->sched_class; @@ -5265,15 +6771,14 @@ static void scx_enable_workfn(struct kthread_work *work) queue_flags |= DEQUEUE_CLASS; scoped_guard (sched_change, p, queue_flags) { - p->scx.slice = READ_ONCE(scx_slice_dfl); + p->scx.slice = READ_ONCE(sch->slice_dfl); p->sched_class = new_class; } } scx_task_iter_stop(&sti); percpu_up_write(&scx_fork_rwsem); - scx_bypassed_for_enable = false; - scx_bypass(false); + scx_bypass(sch, false); if (!scx_tryset_enable_state(SCX_ENABLED, SCX_ENABLING)) { WARN_ON_ONCE(atomic_read(&sch->exit_kind) == SCX_EXIT_NONE); @@ -5315,12 +6820,318 @@ err_disable: * Flush scx_disable_work to ensure that error is reported before init * completion. sch's base reference will be put by bpf_scx_unreg(). */ - scx_error(sch, "scx_enable() failed (%d)", ret); + scx_error(sch, "scx_root_enable() failed (%d)", ret); kthread_flush_work(&sch->disable_work); cmd->ret = 0; } -static int scx_enable(struct sched_ext_ops *ops, struct bpf_link *link) +#ifdef CONFIG_EXT_SUB_SCHED +/* verify that a scheduler can be attached to @cgrp and return the parent */ +static struct scx_sched *find_parent_sched(struct cgroup *cgrp) +{ + struct scx_sched *parent = cgrp->scx_sched; + struct scx_sched *pos; + + lockdep_assert_held(&scx_sched_lock); + + /* can't attach twice to the same cgroup */ + if (parent->cgrp == cgrp) + return ERR_PTR(-EBUSY); + + /* does $parent allow sub-scheds? */ + if (!parent->ops.sub_attach) + return ERR_PTR(-EOPNOTSUPP); + + /* can't insert between $parent and its exiting children */ + list_for_each_entry(pos, &parent->children, sibling) + if (cgroup_is_descendant(pos->cgrp, cgrp)) + return ERR_PTR(-EBUSY); + + return parent; +} + +static bool assert_task_ready_or_enabled(struct task_struct *p) +{ + u32 state = scx_get_task_state(p); + + switch (state) { + case SCX_TASK_READY: + case SCX_TASK_ENABLED: + return true; + default: + WARN_ONCE(true, "sched_ext: Invalid task state %d for %s[%d] during enabling sub sched", + state, p->comm, p->pid); + return false; + } +} + +static void scx_sub_enable_workfn(struct kthread_work *work) +{ + struct scx_enable_cmd *cmd = container_of(work, struct scx_enable_cmd, work); + struct sched_ext_ops *ops = cmd->ops; + struct cgroup *cgrp; + struct scx_sched *parent, *sch; + struct scx_task_iter sti; + struct task_struct *p; + s32 i, ret; + + mutex_lock(&scx_enable_mutex); + + if (!scx_enabled()) { + ret = -ENODEV; + goto out_unlock; + } + + cgrp = cgroup_get_from_id(ops->sub_cgroup_id); + if (IS_ERR(cgrp)) { + ret = PTR_ERR(cgrp); + goto out_unlock; + } + + raw_spin_lock_irq(&scx_sched_lock); + parent = find_parent_sched(cgrp); + if (IS_ERR(parent)) { + raw_spin_unlock_irq(&scx_sched_lock); + ret = PTR_ERR(parent); + goto out_put_cgrp; + } + kobject_get(&parent->kobj); + raw_spin_unlock_irq(&scx_sched_lock); + + /* scx_alloc_and_add_sched() consumes @cgrp whether it succeeds or not */ + sch = scx_alloc_and_add_sched(ops, cgrp, parent); + kobject_put(&parent->kobj); + if (IS_ERR(sch)) { + ret = PTR_ERR(sch); + goto out_unlock; + } + + ret = scx_link_sched(sch); + if (ret) + goto err_disable; + + if (sch->level >= SCX_SUB_MAX_DEPTH) { + scx_error(sch, "max nesting depth %d violated", + SCX_SUB_MAX_DEPTH); + goto err_disable; + } + + if (sch->ops.init) { + ret = SCX_CALL_OP_RET(sch, init, NULL); + if (ret) { + ret = ops_sanitize_err(sch, "init", ret); + scx_error(sch, "ops.init() failed (%d)", ret); + goto err_disable; + } + sch->exit_info->flags |= SCX_EFLAG_INITIALIZED; + } + + if (validate_ops(sch, ops)) + goto err_disable; + + struct scx_sub_attach_args sub_attach_args = { + .ops = &sch->ops, + .cgroup_path = sch->cgrp_path, + }; + + ret = SCX_CALL_OP_RET(parent, sub_attach, NULL, + &sub_attach_args); + if (ret) { + ret = ops_sanitize_err(sch, "sub_attach", ret); + scx_error(sch, "parent rejected (%d)", ret); + goto err_disable; + } + sch->sub_attached = true; + + scx_bypass(sch, true); + + for (i = SCX_OPI_BEGIN; i < SCX_OPI_END; i++) + if (((void (**)(void))ops)[i]) + set_bit(i, sch->has_op); + + percpu_down_write(&scx_fork_rwsem); + scx_cgroup_lock(); + + /* + * Set cgroup->scx_sched's and check CSS_ONLINE. Either we see + * !CSS_ONLINE or scx_cgroup_lifetime_notify() sees and shoots us down. + */ + set_cgroup_sched(sch_cgroup(sch), sch); + if (!(cgrp->self.flags & CSS_ONLINE)) { + scx_error(sch, "cgroup is not online"); + goto err_unlock_and_disable; + } + + /* + * Initialize tasks for the new child $sch without exiting them for + * $parent so that the tasks can always be reverted back to $parent + * sched on child init failure. + */ + WARN_ON_ONCE(scx_enabling_sub_sched); + scx_enabling_sub_sched = sch; + + scx_task_iter_start(&sti, sch->cgrp); + while ((p = scx_task_iter_next_locked(&sti))) { + struct rq *rq; + struct rq_flags rf; + + /* + * Task iteration may visit the same task twice when racing + * against exiting. Use %SCX_TASK_SUB_INIT to mark tasks which + * finished __scx_init_task() and skip if set. + * + * A task may exit and get freed between __scx_init_task() + * completion and scx_enable_task(). In such cases, + * scx_disable_and_exit_task() must exit the task for both the + * parent and child scheds. + */ + if (p->scx.flags & SCX_TASK_SUB_INIT) + continue; + + /* see scx_root_enable() */ + if (!tryget_task_struct(p)) + continue; + + if (!assert_task_ready_or_enabled(p)) { + ret = -EINVAL; + goto abort; + } + + scx_task_iter_unlock(&sti); + + /* + * As $p is still on $parent, it can't be transitioned to INIT. + * Let's worry about task state later. Use __scx_init_task(). + */ + ret = __scx_init_task(sch, p, false); + if (ret) + goto abort; + + rq = task_rq_lock(p, &rf); + p->scx.flags |= SCX_TASK_SUB_INIT; + task_rq_unlock(rq, p, &rf); + + put_task_struct(p); + } + scx_task_iter_stop(&sti); + + /* + * All tasks are prepped. Disable/exit tasks for $parent and enable for + * the new @sch. + */ + scx_task_iter_start(&sti, sch->cgrp); + while ((p = scx_task_iter_next_locked(&sti))) { + /* + * Use clearing of %SCX_TASK_SUB_INIT to detect and skip + * duplicate iterations. + */ + if (!(p->scx.flags & SCX_TASK_SUB_INIT)) + continue; + + scoped_guard (sched_change, p, DEQUEUE_SAVE | DEQUEUE_MOVE) { + /* + * $p must be either READY or ENABLED. If ENABLED, + * __scx_disabled_and_exit_task() first disables and + * makes it READY. However, after exiting $p, it will + * leave $p as READY. + */ + assert_task_ready_or_enabled(p); + __scx_disable_and_exit_task(parent, p); + + /* + * $p is now only initialized for @sch and READY, which + * is what we want. Assign it to @sch and enable. + */ + rcu_assign_pointer(p->scx.sched, sch); + scx_enable_task(sch, p); + + p->scx.flags &= ~SCX_TASK_SUB_INIT; + } + } + scx_task_iter_stop(&sti); + + scx_enabling_sub_sched = NULL; + + scx_cgroup_unlock(); + percpu_up_write(&scx_fork_rwsem); + + scx_bypass(sch, false); + + pr_info("sched_ext: BPF sub-scheduler \"%s\" enabled\n", sch->ops.name); + kobject_uevent(&sch->kobj, KOBJ_ADD); + ret = 0; + goto out_unlock; + +out_put_cgrp: + cgroup_put(cgrp); +out_unlock: + mutex_unlock(&scx_enable_mutex); + cmd->ret = ret; + return; + +abort: + put_task_struct(p); + scx_task_iter_stop(&sti); + scx_enabling_sub_sched = NULL; + + scx_task_iter_start(&sti, sch->cgrp); + while ((p = scx_task_iter_next_locked(&sti))) { + if (p->scx.flags & SCX_TASK_SUB_INIT) { + __scx_disable_and_exit_task(sch, p); + p->scx.flags &= ~SCX_TASK_SUB_INIT; + } + } + scx_task_iter_stop(&sti); +err_unlock_and_disable: + /* we'll soon enter disable path, keep bypass on */ + scx_cgroup_unlock(); + percpu_up_write(&scx_fork_rwsem); +err_disable: + mutex_unlock(&scx_enable_mutex); + kthread_flush_work(&sch->disable_work); + cmd->ret = 0; +} + +static s32 scx_cgroup_lifetime_notify(struct notifier_block *nb, + unsigned long action, void *data) +{ + struct cgroup *cgrp = data; + struct cgroup *parent = cgroup_parent(cgrp); + + if (!cgroup_on_dfl(cgrp)) + return NOTIFY_OK; + + switch (action) { + case CGROUP_LIFETIME_ONLINE: + /* inherit ->scx_sched from $parent */ + if (parent) + rcu_assign_pointer(cgrp->scx_sched, parent->scx_sched); + break; + case CGROUP_LIFETIME_OFFLINE: + /* if there is a sched attached, shoot it down */ + if (cgrp->scx_sched && cgrp->scx_sched->cgrp == cgrp) + scx_exit(cgrp->scx_sched, SCX_EXIT_UNREG_KERN, + SCX_ECODE_RSN_CGROUP_OFFLINE, + "cgroup %llu going offline", cgroup_id(cgrp)); + break; + } + + return NOTIFY_OK; +} + +static struct notifier_block scx_cgroup_lifetime_nb = { + .notifier_call = scx_cgroup_lifetime_notify, +}; + +static s32 __init scx_cgroup_lifetime_notifier_init(void) +{ + return blocking_notifier_chain_register(&cgroup_lifetime_notifier, + &scx_cgroup_lifetime_nb); +} +core_initcall(scx_cgroup_lifetime_notifier_init); +#endif /* CONFIG_EXT_SUB_SCHED */ + +static s32 scx_enable(struct sched_ext_ops *ops, struct bpf_link *link) { static struct kthread_worker *helper; static DEFINE_MUTEX(helper_mutex); @@ -5347,7 +7158,12 @@ static int scx_enable(struct sched_ext_ops *ops, struct bpf_link *link) mutex_unlock(&helper_mutex); } - kthread_init_work(&cmd.work, scx_enable_workfn); +#ifdef CONFIG_EXT_SUB_SCHED + if (ops->sub_cgroup_id > 1) + kthread_init_work(&cmd.work, scx_sub_enable_workfn); + else +#endif /* CONFIG_EXT_SUB_SCHED */ + kthread_init_work(&cmd.work, scx_root_enable_workfn); cmd.ops = ops; kthread_queue_work(READ_ONCE(helper), &cmd.work); @@ -5388,12 +7204,17 @@ static int bpf_scx_btf_struct_access(struct bpf_verifier_log *log, t = btf_type_by_id(reg->btf, reg->btf_id); if (t == task_struct_type) { - if (off >= offsetof(struct task_struct, scx.slice) && - off + size <= offsetofend(struct task_struct, scx.slice)) - return SCALAR_VALUE; - if (off >= offsetof(struct task_struct, scx.dsq_vtime) && - off + size <= offsetofend(struct task_struct, scx.dsq_vtime)) + /* + * COMPAT: Will be removed in v6.23. + */ + if ((off >= offsetof(struct task_struct, scx.slice) && + off + size <= offsetofend(struct task_struct, scx.slice)) || + (off >= offsetof(struct task_struct, scx.dsq_vtime) && + off + size <= offsetofend(struct task_struct, scx.dsq_vtime))) { + pr_warn("sched_ext: Writing directly to p->scx.slice/dsq_vtime is deprecated, use scx_bpf_task_set_slice/dsq_vtime()"); return SCALAR_VALUE; + } + if (off >= offsetof(struct task_struct, scx.disallow) && off + size <= offsetofend(struct task_struct, scx.disallow)) return SCALAR_VALUE; @@ -5449,11 +7270,30 @@ static int bpf_scx_init_member(const struct btf_type *t, case offsetof(struct sched_ext_ops, hotplug_seq): ops->hotplug_seq = *(u64 *)(udata + moff); return 1; +#ifdef CONFIG_EXT_SUB_SCHED + case offsetof(struct sched_ext_ops, sub_cgroup_id): + ops->sub_cgroup_id = *(u64 *)(udata + moff); + return 1; +#endif /* CONFIG_EXT_SUB_SCHED */ } return 0; } +#ifdef CONFIG_EXT_SUB_SCHED +static void scx_pstack_recursion_on_dispatch(struct bpf_prog *prog) +{ + struct scx_sched *sch; + + guard(rcu)(); + sch = scx_prog_sched(prog->aux); + if (unlikely(!sch)) + return; + + scx_error(sch, "dispatch recursion detected"); +} +#endif /* CONFIG_EXT_SUB_SCHED */ + static int bpf_scx_check_member(const struct btf_type *t, const struct btf_member *member, const struct bpf_prog *prog) @@ -5471,12 +7311,30 @@ static int bpf_scx_check_member(const struct btf_type *t, case offsetof(struct sched_ext_ops, cpu_offline): case offsetof(struct sched_ext_ops, init): case offsetof(struct sched_ext_ops, exit): + case offsetof(struct sched_ext_ops, sub_attach): + case offsetof(struct sched_ext_ops, sub_detach): break; default: if (prog->sleepable) return -EINVAL; } +#ifdef CONFIG_EXT_SUB_SCHED + /* + * Enable private stack for operations that can nest along the + * hierarchy. + * + * XXX - Ideally, we should only do this for scheds that allow + * sub-scheds and sub-scheds themselves but I don't know how to access + * struct_ops from here. + */ + switch (moff) { + case offsetof(struct sched_ext_ops, dispatch): + prog->aux->priv_stack_requested = true; + prog->aux->recursion_detected = scx_pstack_recursion_on_dispatch; + } +#endif /* CONFIG_EXT_SUB_SCHED */ + return 0; } @@ -5488,10 +7346,11 @@ static int bpf_scx_reg(void *kdata, struct bpf_link *link) static void bpf_scx_unreg(void *kdata, struct bpf_link *link) { struct sched_ext_ops *ops = kdata; - struct scx_sched *sch = ops->priv; + struct scx_sched *sch = rcu_dereference_protected(ops->priv, true); - scx_disable(SCX_EXIT_UNREG); + scx_disable(sch, SCX_EXIT_UNREG); kthread_flush_work(&sch->disable_work); + RCU_INIT_POINTER(ops->priv, NULL); kobject_put(&sch->kobj); } @@ -5548,7 +7407,9 @@ static void sched_ext_ops__cgroup_cancel_move(struct task_struct *p, struct cgro static void sched_ext_ops__cgroup_set_weight(struct cgroup *cgrp, u32 weight) {} static void sched_ext_ops__cgroup_set_bandwidth(struct cgroup *cgrp, u64 period_us, u64 quota_us, u64 burst_us) {} static void sched_ext_ops__cgroup_set_idle(struct cgroup *cgrp, bool idle) {} -#endif +#endif /* CONFIG_EXT_GROUP_SCHED */ +static s32 sched_ext_ops__sub_attach(struct scx_sub_attach_args *args) { return -EINVAL; } +static void sched_ext_ops__sub_detach(struct scx_sub_detach_args *args) {} static void sched_ext_ops__cpu_online(s32 cpu) {} static void sched_ext_ops__cpu_offline(s32 cpu) {} static s32 sched_ext_ops__init(void) { return -EINVAL; } @@ -5588,6 +7449,8 @@ static struct sched_ext_ops __bpf_ops_sched_ext_ops = { .cgroup_set_bandwidth = sched_ext_ops__cgroup_set_bandwidth, .cgroup_set_idle = sched_ext_ops__cgroup_set_idle, #endif + .sub_attach = sched_ext_ops__sub_attach, + .sub_detach = sched_ext_ops__sub_detach, .cpu_online = sched_ext_ops__cpu_online, .cpu_offline = sched_ext_ops__cpu_offline, .init = sched_ext_ops__init, @@ -5618,7 +7481,15 @@ static struct bpf_struct_ops bpf_sched_ext_ops = { static void sysrq_handle_sched_ext_reset(u8 key) { - scx_disable(SCX_EXIT_SYSRQ); + struct scx_sched *sch; + + rcu_read_lock(); + sch = rcu_dereference(scx_root); + if (likely(sch)) + scx_disable(sch, SCX_EXIT_SYSRQ); + else + pr_info("sched_ext: BPF schedulers not loaded\n"); + rcu_read_unlock(); } static const struct sysrq_key_op sysrq_sched_ext_reset_op = { @@ -5631,9 +7502,10 @@ static const struct sysrq_key_op sysrq_sched_ext_reset_op = { static void sysrq_handle_sched_ext_dump(u8 key) { struct scx_exit_info ei = { .kind = SCX_EXIT_NONE, .reason = "SysRq-D" }; + struct scx_sched *sch; - if (scx_enabled()) - scx_dump_state(&ei, 0); + list_for_each_entry_rcu(sch, &scx_sched_all, all) + scx_dump_state(sch, &ei, 0, false); } static const struct sysrq_key_op sysrq_sched_ext_dump_op = { @@ -5728,10 +7600,9 @@ static void kick_cpus_irq_workfn(struct irq_work *irq_work) unsigned long *ksyncs; s32 cpu; - if (unlikely(!ksyncs_pcpu)) { - pr_warn_once("kick_cpus_irq_workfn() called with NULL scx_kick_syncs"); + /* can race with free_kick_syncs() during scheduler disable */ + if (unlikely(!ksyncs_pcpu)) return; - } ksyncs = rcu_dereference_bh(ksyncs_pcpu)->syncs; @@ -5772,14 +7643,18 @@ static void kick_cpus_irq_workfn(struct irq_work *irq_work) */ void print_scx_info(const char *log_lvl, struct task_struct *p) { - struct scx_sched *sch = scx_root; + struct scx_sched *sch; enum scx_enable_state state = scx_enable_state(); const char *all = READ_ONCE(scx_switching_all) ? "+all" : ""; char runnable_at_buf[22] = "?"; struct sched_class *class; unsigned long runnable_at; - if (state == SCX_DISABLED) + guard(rcu)(); + + sch = scx_task_sched_rcu(p); + + if (!sch) return; /* @@ -5806,6 +7681,14 @@ void print_scx_info(const char *log_lvl, struct task_struct *p) static int scx_pm_handler(struct notifier_block *nb, unsigned long event, void *ptr) { + struct scx_sched *sch; + + guard(rcu)(); + + sch = rcu_dereference(scx_root); + if (!sch) + return NOTIFY_OK; + /* * SCX schedulers often have userspace components which are sometimes * involved in critial scheduling paths. PM operations involve freezing @@ -5816,12 +7699,12 @@ static int scx_pm_handler(struct notifier_block *nb, unsigned long event, void * case PM_HIBERNATION_PREPARE: case PM_SUSPEND_PREPARE: case PM_RESTORE_PREPARE: - scx_bypass(true); + scx_bypass(sch, true); break; case PM_POST_HIBERNATION: case PM_POST_SUSPEND: case PM_POST_RESTORE: - scx_bypass(false); + scx_bypass(sch, false); break; } @@ -5850,8 +7733,9 @@ void __init init_sched_ext_class(void) struct rq *rq = cpu_rq(cpu); int n = cpu_to_node(cpu); - init_dsq(&rq->scx.local_dsq, SCX_DSQ_LOCAL); - init_dsq(&rq->scx.bypass_dsq, SCX_DSQ_BYPASS); + /* local_dsq's sch will be set during scx_root_enable() */ + BUG_ON(init_dsq(&rq->scx.local_dsq, SCX_DSQ_LOCAL, NULL)); + INIT_LIST_HEAD(&rq->scx.runnable_list); INIT_LIST_HEAD(&rq->scx.ddsp_deferred_locals); @@ -5860,6 +7744,9 @@ void __init init_sched_ext_class(void) BUG_ON(!zalloc_cpumask_var_node(&rq->scx.cpus_to_preempt, GFP_KERNEL, n)); BUG_ON(!zalloc_cpumask_var_node(&rq->scx.cpus_to_wait, GFP_KERNEL, n)); BUG_ON(!zalloc_cpumask_var_node(&rq->scx.cpus_to_sync, GFP_KERNEL, n)); + raw_spin_lock_init(&rq->scx.deferred_reenq_lock); + INIT_LIST_HEAD(&rq->scx.deferred_reenq_locals); + INIT_LIST_HEAD(&rq->scx.deferred_reenq_users); rq->scx.deferred_irq_work = IRQ_WORK_INIT_HARD(deferred_irq_workfn); rq->scx.kick_cpus_irq_work = IRQ_WORK_INIT_HARD(kick_cpus_irq_workfn); @@ -5870,18 +7757,36 @@ void __init init_sched_ext_class(void) register_sysrq_key('S', &sysrq_sched_ext_reset_op); register_sysrq_key('D', &sysrq_sched_ext_dump_op); INIT_DELAYED_WORK(&scx_watchdog_work, scx_watchdog_workfn); + +#ifdef CONFIG_EXT_SUB_SCHED + BUG_ON(rhashtable_init(&scx_sched_hash, &scx_sched_hash_params)); +#endif /* CONFIG_EXT_SUB_SCHED */ } /******************************************************************************** * Helpers that can be called from the BPF scheduler. */ -static bool scx_dsq_insert_preamble(struct scx_sched *sch, struct task_struct *p, - u64 enq_flags) +static bool scx_vet_enq_flags(struct scx_sched *sch, u64 dsq_id, u64 *enq_flags) { - if (!scx_kf_allowed(sch, SCX_KF_ENQUEUE | SCX_KF_DISPATCH)) - return false; + bool is_local = dsq_id == SCX_DSQ_LOCAL || + (dsq_id & SCX_DSQ_LOCAL_ON) == SCX_DSQ_LOCAL_ON; + if (*enq_flags & SCX_ENQ_IMMED) { + if (unlikely(!is_local)) { + scx_error(sch, "SCX_ENQ_IMMED on a non-local DSQ 0x%llx", dsq_id); + return false; + } + } else if ((sch->ops.flags & SCX_OPS_ALWAYS_ENQ_IMMED) && is_local) { + *enq_flags |= SCX_ENQ_IMMED; + } + + return true; +} + +static bool scx_dsq_insert_preamble(struct scx_sched *sch, struct task_struct *p, + u64 dsq_id, u64 *enq_flags) +{ lockdep_assert_irqs_disabled(); if (unlikely(!p)) { @@ -5889,18 +7794,27 @@ static bool scx_dsq_insert_preamble(struct scx_sched *sch, struct task_struct *p return false; } - if (unlikely(enq_flags & __SCX_ENQ_INTERNAL_MASK)) { - scx_error(sch, "invalid enq_flags 0x%llx", enq_flags); + if (unlikely(*enq_flags & __SCX_ENQ_INTERNAL_MASK)) { + scx_error(sch, "invalid enq_flags 0x%llx", *enq_flags); return false; } + /* see SCX_EV_INSERT_NOT_OWNED definition */ + if (unlikely(!scx_task_on_sched(sch, p))) { + __scx_add_event(sch, SCX_EV_INSERT_NOT_OWNED, 1); + return false; + } + + if (!scx_vet_enq_flags(sch, dsq_id, enq_flags)) + return false; + return true; } static void scx_dsq_insert_commit(struct scx_sched *sch, struct task_struct *p, u64 dsq_id, u64 enq_flags) { - struct scx_dsp_ctx *dspc = this_cpu_ptr(scx_dsp_ctx); + struct scx_dsp_ctx *dspc = &this_cpu_ptr(sch->pcpu)->dsp_ctx; struct task_struct *ddsp_task; ddsp_task = __this_cpu_read(direct_dispatch_task); @@ -5909,7 +7823,7 @@ static void scx_dsq_insert_commit(struct scx_sched *sch, struct task_struct *p, return; } - if (unlikely(dspc->cursor >= scx_dsp_max_batch)) { + if (unlikely(dspc->cursor >= sch->dsp_max_batch)) { scx_error(sch, "dispatch buffer overflow"); return; } @@ -5930,6 +7844,7 @@ __bpf_kfunc_start_defs(); * @dsq_id: DSQ to insert into * @slice: duration @p can run for in nsecs, 0 to keep the current value * @enq_flags: SCX_ENQ_* + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs * * Insert @p into the FIFO queue of the DSQ identified by @dsq_id. It is safe to * call this function spuriously. Can be called from ops.enqueue(), @@ -5964,16 +7879,17 @@ __bpf_kfunc_start_defs(); * to check the return value. */ __bpf_kfunc bool scx_bpf_dsq_insert___v2(struct task_struct *p, u64 dsq_id, - u64 slice, u64 enq_flags) + u64 slice, u64 enq_flags, + const struct bpf_prog_aux *aux) { struct scx_sched *sch; guard(rcu)(); - sch = rcu_dereference(scx_root); + sch = scx_prog_sched(aux); if (unlikely(!sch)) return false; - if (!scx_dsq_insert_preamble(sch, p, enq_flags)) + if (!scx_dsq_insert_preamble(sch, p, dsq_id, &enq_flags)) return false; if (slice) @@ -5990,15 +7906,16 @@ __bpf_kfunc bool scx_bpf_dsq_insert___v2(struct task_struct *p, u64 dsq_id, * COMPAT: Will be removed in v6.23 along with the ___v2 suffix. */ __bpf_kfunc void scx_bpf_dsq_insert(struct task_struct *p, u64 dsq_id, - u64 slice, u64 enq_flags) + u64 slice, u64 enq_flags, + const struct bpf_prog_aux *aux) { - scx_bpf_dsq_insert___v2(p, dsq_id, slice, enq_flags); + scx_bpf_dsq_insert___v2(p, dsq_id, slice, enq_flags, aux); } static bool scx_dsq_insert_vtime(struct scx_sched *sch, struct task_struct *p, u64 dsq_id, u64 slice, u64 vtime, u64 enq_flags) { - if (!scx_dsq_insert_preamble(sch, p, enq_flags)) + if (!scx_dsq_insert_preamble(sch, p, dsq_id, &enq_flags)) return false; if (slice) @@ -6029,6 +7946,7 @@ struct scx_bpf_dsq_insert_vtime_args { * @args->slice: duration @p can run for in nsecs, 0 to keep the current value * @args->vtime: @p's ordering inside the vtime-sorted queue of the target DSQ * @args->enq_flags: SCX_ENQ_* + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs * * Wrapper kfunc that takes arguments via struct to work around BPF's 5 argument * limit. BPF programs should use scx_bpf_dsq_insert_vtime() which is provided @@ -6053,13 +7971,14 @@ struct scx_bpf_dsq_insert_vtime_args { */ __bpf_kfunc bool __scx_bpf_dsq_insert_vtime(struct task_struct *p, - struct scx_bpf_dsq_insert_vtime_args *args) + struct scx_bpf_dsq_insert_vtime_args *args, + const struct bpf_prog_aux *aux) { struct scx_sched *sch; guard(rcu)(); - sch = rcu_dereference(scx_root); + sch = scx_prog_sched(aux); if (unlikely(!sch)) return false; @@ -6081,44 +8000,61 @@ __bpf_kfunc void scx_bpf_dsq_insert_vtime(struct task_struct *p, u64 dsq_id, if (unlikely(!sch)) return; +#ifdef CONFIG_EXT_SUB_SCHED + /* + * Disallow if any sub-scheds are attached. There is no way to tell + * which scheduler called us, just error out @p's scheduler. + */ + if (unlikely(!list_empty(&sch->children))) { + scx_error(scx_task_sched(p), "__scx_bpf_dsq_insert_vtime() must be used"); + return; + } +#endif + scx_dsq_insert_vtime(sch, p, dsq_id, slice, vtime, enq_flags); } __bpf_kfunc_end_defs(); BTF_KFUNCS_START(scx_kfunc_ids_enqueue_dispatch) -BTF_ID_FLAGS(func, scx_bpf_dsq_insert, KF_RCU) -BTF_ID_FLAGS(func, scx_bpf_dsq_insert___v2, KF_RCU) -BTF_ID_FLAGS(func, __scx_bpf_dsq_insert_vtime, KF_RCU) +BTF_ID_FLAGS(func, scx_bpf_dsq_insert, KF_IMPLICIT_ARGS | KF_RCU) +BTF_ID_FLAGS(func, scx_bpf_dsq_insert___v2, KF_IMPLICIT_ARGS | KF_RCU) +BTF_ID_FLAGS(func, __scx_bpf_dsq_insert_vtime, KF_IMPLICIT_ARGS | KF_RCU) BTF_ID_FLAGS(func, scx_bpf_dsq_insert_vtime, KF_RCU) BTF_KFUNCS_END(scx_kfunc_ids_enqueue_dispatch) static const struct btf_kfunc_id_set scx_kfunc_set_enqueue_dispatch = { .owner = THIS_MODULE, .set = &scx_kfunc_ids_enqueue_dispatch, + .filter = scx_kfunc_context_filter, }; static bool scx_dsq_move(struct bpf_iter_scx_dsq_kern *kit, struct task_struct *p, u64 dsq_id, u64 enq_flags) { - struct scx_sched *sch = scx_root; struct scx_dispatch_q *src_dsq = kit->dsq, *dst_dsq; + struct scx_sched *sch = src_dsq->sched; struct rq *this_rq, *src_rq, *locked_rq; bool dispatched = false; bool in_balance; unsigned long flags; - if (!scx_kf_allowed_if_unlocked() && - !scx_kf_allowed(sch, SCX_KF_DISPATCH)) + if (!scx_vet_enq_flags(sch, dsq_id, &enq_flags)) return false; /* * If the BPF scheduler keeps calling this function repeatedly, it can * cause similar live-lock conditions as consume_dispatch_q(). */ - if (unlikely(READ_ONCE(scx_aborting))) + if (unlikely(READ_ONCE(sch->aborting))) return false; + if (unlikely(!scx_task_on_sched(sch, p))) { + scx_error(sch, "scx_bpf_dsq_move[_vtime]() on %s[%d] but the task belongs to a different scheduler", + p->comm, p->pid); + return false; + } + /* * Can be called from either ops.dispatch() locking this_rq() or any * context where no rq lock is held. If latter, lock @p's task_rq which @@ -6142,20 +8078,14 @@ static bool scx_dsq_move(struct bpf_iter_scx_dsq_kern *kit, locked_rq = src_rq; raw_spin_lock(&src_dsq->lock); - /* - * Did someone else get to it? @p could have already left $src_dsq, got - * re-enqueud, or be in the process of being consumed by someone else. - */ - if (unlikely(p->scx.dsq != src_dsq || - u32_before(kit->cursor.priv, p->scx.dsq_seq) || - p->scx.holding_cpu >= 0) || - WARN_ON_ONCE(src_rq != task_rq(p))) { + /* did someone else get to it while we dropped the locks? */ + if (nldsq_cursor_lost_task(&kit->cursor, src_rq, src_dsq, p)) { raw_spin_unlock(&src_dsq->lock); goto out; } /* @p is still on $src_dsq and stable, determine the destination */ - dst_dsq = find_dsq_for_dispatch(sch, this_rq, dsq_id, p); + dst_dsq = find_dsq_for_dispatch(sch, this_rq, dsq_id, task_cpu(p)); /* * Apply vtime and slice updates before moving so that the new time is @@ -6189,44 +8119,42 @@ __bpf_kfunc_start_defs(); /** * scx_bpf_dispatch_nr_slots - Return the number of remaining dispatch slots + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs * * Can only be called from ops.dispatch(). */ -__bpf_kfunc u32 scx_bpf_dispatch_nr_slots(void) +__bpf_kfunc u32 scx_bpf_dispatch_nr_slots(const struct bpf_prog_aux *aux) { struct scx_sched *sch; guard(rcu)(); - sch = rcu_dereference(scx_root); + sch = scx_prog_sched(aux); if (unlikely(!sch)) return 0; - if (!scx_kf_allowed(sch, SCX_KF_DISPATCH)) - return 0; - - return scx_dsp_max_batch - __this_cpu_read(scx_dsp_ctx->cursor); + return sch->dsp_max_batch - __this_cpu_read(sch->pcpu->dsp_ctx.cursor); } /** * scx_bpf_dispatch_cancel - Cancel the latest dispatch + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs * * Cancel the latest dispatch. Can be called multiple times to cancel further * dispatches. Can only be called from ops.dispatch(). */ -__bpf_kfunc void scx_bpf_dispatch_cancel(void) +__bpf_kfunc void scx_bpf_dispatch_cancel(const struct bpf_prog_aux *aux) { - struct scx_dsp_ctx *dspc = this_cpu_ptr(scx_dsp_ctx); struct scx_sched *sch; + struct scx_dsp_ctx *dspc; guard(rcu)(); - sch = rcu_dereference(scx_root); + sch = scx_prog_sched(aux); if (unlikely(!sch)) return; - if (!scx_kf_allowed(sch, SCX_KF_DISPATCH)) - return; + dspc = &this_cpu_ptr(sch->pcpu)->dsp_ctx; if (dspc->cursor > 0) dspc->cursor--; @@ -6236,10 +8164,21 @@ __bpf_kfunc void scx_bpf_dispatch_cancel(void) /** * scx_bpf_dsq_move_to_local - move a task from a DSQ to the current CPU's local DSQ - * @dsq_id: DSQ to move task from + * @dsq_id: DSQ to move task from. Must be a user-created DSQ + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs + * @enq_flags: %SCX_ENQ_* * * Move a task from the non-local DSQ identified by @dsq_id to the current CPU's - * local DSQ for execution. Can only be called from ops.dispatch(). + * local DSQ for execution with @enq_flags applied. Can only be called from + * ops.dispatch(). + * + * Built-in DSQs (%SCX_DSQ_GLOBAL and %SCX_DSQ_LOCAL*) are not supported as + * sources. Local DSQs support reenqueueing (a task can be picked up for + * execution, dequeued for property changes, or reenqueued), but the BPF + * scheduler cannot directly iterate or move tasks from them. %SCX_DSQ_GLOBAL + * is similar but also doesn't support reenqueueing, as it maps to multiple + * per-node DSQs making the scope difficult to define; this may change in the + * future. * * This function flushes the in-flight dispatches from scx_bpf_dsq_insert() * before trying to move from the specified DSQ. It may also grab rq locks and @@ -6248,21 +8187,24 @@ __bpf_kfunc void scx_bpf_dispatch_cancel(void) * Returns %true if a task has been moved, %false if there isn't any task to * move. */ -__bpf_kfunc bool scx_bpf_dsq_move_to_local(u64 dsq_id) +__bpf_kfunc bool scx_bpf_dsq_move_to_local___v2(u64 dsq_id, u64 enq_flags, + const struct bpf_prog_aux *aux) { - struct scx_dsp_ctx *dspc = this_cpu_ptr(scx_dsp_ctx); struct scx_dispatch_q *dsq; struct scx_sched *sch; + struct scx_dsp_ctx *dspc; guard(rcu)(); - sch = rcu_dereference(scx_root); + sch = scx_prog_sched(aux); if (unlikely(!sch)) return false; - if (!scx_kf_allowed(sch, SCX_KF_DISPATCH)) + if (!scx_vet_enq_flags(sch, SCX_DSQ_LOCAL, &enq_flags)) return false; + dspc = &this_cpu_ptr(sch->pcpu)->dsp_ctx; + flush_dispatch_buf(sch, dspc->rq); dsq = find_user_dsq(sch, dsq_id); @@ -6271,7 +8213,7 @@ __bpf_kfunc bool scx_bpf_dsq_move_to_local(u64 dsq_id) return false; } - if (consume_dispatch_q(sch, dspc->rq, dsq)) { + if (consume_dispatch_q(sch, dspc->rq, dsq, enq_flags)) { /* * A successfully consumed task can be dequeued before it starts * running while the CPU is trying to migrate other dispatched @@ -6285,6 +8227,14 @@ __bpf_kfunc bool scx_bpf_dsq_move_to_local(u64 dsq_id) } } +/* + * COMPAT: ___v2 was introduced in v7.1. Remove this and ___v2 tag in the future. + */ +__bpf_kfunc bool scx_bpf_dsq_move_to_local(u64 dsq_id, const struct bpf_prog_aux *aux) +{ + return scx_bpf_dsq_move_to_local___v2(dsq_id, 0, aux); +} + /** * scx_bpf_dsq_move_set_slice - Override slice when moving between DSQs * @it__iter: DSQ iterator in progress @@ -6380,105 +8330,104 @@ __bpf_kfunc bool scx_bpf_dsq_move_vtime(struct bpf_iter_scx_dsq *it__iter, p, dsq_id, enq_flags | SCX_ENQ_DSQ_PRIQ); } +#ifdef CONFIG_EXT_SUB_SCHED +/** + * scx_bpf_sub_dispatch - Trigger dispatching on a child scheduler + * @cgroup_id: cgroup ID of the child scheduler to dispatch + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs + * + * Allows a parent scheduler to trigger dispatching on one of its direct + * child schedulers. The child scheduler runs its dispatch operation to + * move tasks from dispatch queues to the local runqueue. + * + * Returns: true on success, false if cgroup_id is invalid, not a direct + * child, or caller lacks dispatch permission. + */ +__bpf_kfunc bool scx_bpf_sub_dispatch(u64 cgroup_id, const struct bpf_prog_aux *aux) +{ + struct rq *this_rq = this_rq(); + struct scx_sched *parent, *child; + + guard(rcu)(); + parent = scx_prog_sched(aux); + if (unlikely(!parent)) + return false; + + child = scx_find_sub_sched(cgroup_id); + + if (unlikely(!child)) + return false; + + if (unlikely(scx_parent(child) != parent)) { + scx_error(parent, "trying to dispatch a distant sub-sched on cgroup %llu", + cgroup_id); + return false; + } + + return scx_dispatch_sched(child, this_rq, this_rq->scx.sub_dispatch_prev, + true); +} +#endif /* CONFIG_EXT_SUB_SCHED */ + __bpf_kfunc_end_defs(); BTF_KFUNCS_START(scx_kfunc_ids_dispatch) -BTF_ID_FLAGS(func, scx_bpf_dispatch_nr_slots) -BTF_ID_FLAGS(func, scx_bpf_dispatch_cancel) -BTF_ID_FLAGS(func, scx_bpf_dsq_move_to_local) +BTF_ID_FLAGS(func, scx_bpf_dispatch_nr_slots, KF_IMPLICIT_ARGS) +BTF_ID_FLAGS(func, scx_bpf_dispatch_cancel, KF_IMPLICIT_ARGS) +BTF_ID_FLAGS(func, scx_bpf_dsq_move_to_local, KF_IMPLICIT_ARGS) +BTF_ID_FLAGS(func, scx_bpf_dsq_move_to_local___v2, KF_IMPLICIT_ARGS) +/* scx_bpf_dsq_move*() also in scx_kfunc_ids_unlocked: callable from unlocked contexts */ BTF_ID_FLAGS(func, scx_bpf_dsq_move_set_slice, KF_RCU) BTF_ID_FLAGS(func, scx_bpf_dsq_move_set_vtime, KF_RCU) BTF_ID_FLAGS(func, scx_bpf_dsq_move, KF_RCU) BTF_ID_FLAGS(func, scx_bpf_dsq_move_vtime, KF_RCU) +#ifdef CONFIG_EXT_SUB_SCHED +BTF_ID_FLAGS(func, scx_bpf_sub_dispatch, KF_IMPLICIT_ARGS) +#endif BTF_KFUNCS_END(scx_kfunc_ids_dispatch) static const struct btf_kfunc_id_set scx_kfunc_set_dispatch = { .owner = THIS_MODULE, .set = &scx_kfunc_ids_dispatch, + .filter = scx_kfunc_context_filter, }; -static u32 reenq_local(struct rq *rq) -{ - LIST_HEAD(tasks); - u32 nr_enqueued = 0; - struct task_struct *p, *n; - - lockdep_assert_rq_held(rq); - - /* - * The BPF scheduler may choose to dispatch tasks back to - * @rq->scx.local_dsq. Move all candidate tasks off to a private list - * first to avoid processing the same tasks repeatedly. - */ - list_for_each_entry_safe(p, n, &rq->scx.local_dsq.list, - scx.dsq_list.node) { - /* - * If @p is being migrated, @p's current CPU may not agree with - * its allowed CPUs and the migration_cpu_stop is about to - * deactivate and re-activate @p anyway. Skip re-enqueueing. - * - * While racing sched property changes may also dequeue and - * re-enqueue a migrating task while its current CPU and allowed - * CPUs disagree, they use %ENQUEUE_RESTORE which is bypassed to - * the current local DSQ for running tasks and thus are not - * visible to the BPF scheduler. - */ - if (p->migration_pending) - continue; - - dispatch_dequeue(rq, p); - list_add_tail(&p->scx.dsq_list.node, &tasks); - } - - list_for_each_entry_safe(p, n, &tasks, scx.dsq_list.node) { - list_del_init(&p->scx.dsq_list.node); - do_enqueue_task(rq, p, SCX_ENQ_REENQ, -1); - nr_enqueued++; - } - - return nr_enqueued; -} - __bpf_kfunc_start_defs(); /** * scx_bpf_reenqueue_local - Re-enqueue tasks on a local DSQ + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs * * Iterate over all of the tasks currently enqueued on the local DSQ of the * caller's CPU, and re-enqueue them in the BPF scheduler. Returns the number of * processed tasks. Can only be called from ops.cpu_release(). - * - * COMPAT: Will be removed in v6.23 along with the ___v2 suffix on the void - * returning variant that can be called from anywhere. */ -__bpf_kfunc u32 scx_bpf_reenqueue_local(void) +__bpf_kfunc u32 scx_bpf_reenqueue_local(const struct bpf_prog_aux *aux) { struct scx_sched *sch; struct rq *rq; guard(rcu)(); - sch = rcu_dereference(scx_root); + sch = scx_prog_sched(aux); if (unlikely(!sch)) return 0; - if (!scx_kf_allowed(sch, SCX_KF_CPU_RELEASE)) - return 0; - rq = cpu_rq(smp_processor_id()); lockdep_assert_rq_held(rq); - return reenq_local(rq); + return reenq_local(sch, rq, SCX_REENQ_ANY); } __bpf_kfunc_end_defs(); BTF_KFUNCS_START(scx_kfunc_ids_cpu_release) -BTF_ID_FLAGS(func, scx_bpf_reenqueue_local) +BTF_ID_FLAGS(func, scx_bpf_reenqueue_local, KF_IMPLICIT_ARGS) BTF_KFUNCS_END(scx_kfunc_ids_cpu_release) static const struct btf_kfunc_id_set scx_kfunc_set_cpu_release = { .owner = THIS_MODULE, .set = &scx_kfunc_ids_cpu_release, + .filter = scx_kfunc_context_filter, }; __bpf_kfunc_start_defs(); @@ -6487,11 +8436,12 @@ __bpf_kfunc_start_defs(); * scx_bpf_create_dsq - Create a custom DSQ * @dsq_id: DSQ to create * @node: NUMA node to allocate from + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs * * Create a custom DSQ identified by @dsq_id. Can be called from any sleepable * scx callback, and any BPF_PROG_TYPE_SYSCALL prog. */ -__bpf_kfunc s32 scx_bpf_create_dsq(u64 dsq_id, s32 node) +__bpf_kfunc s32 scx_bpf_create_dsq(u64 dsq_id, s32 node, const struct bpf_prog_aux *aux) { struct scx_dispatch_q *dsq; struct scx_sched *sch; @@ -6508,36 +8458,54 @@ __bpf_kfunc s32 scx_bpf_create_dsq(u64 dsq_id, s32 node) if (!dsq) return -ENOMEM; - init_dsq(dsq, dsq_id); + /* + * init_dsq() must be called in GFP_KERNEL context. Init it with NULL + * @sch and update afterwards. + */ + ret = init_dsq(dsq, dsq_id, NULL); + if (ret) { + kfree(dsq); + return ret; + } rcu_read_lock(); - sch = rcu_dereference(scx_root); - if (sch) + sch = scx_prog_sched(aux); + if (sch) { + dsq->sched = sch; ret = rhashtable_lookup_insert_fast(&sch->dsq_hash, &dsq->hash_node, dsq_hash_params); - else + } else { ret = -ENODEV; + } rcu_read_unlock(); - if (ret) + if (ret) { + exit_dsq(dsq); kfree(dsq); + } return ret; } __bpf_kfunc_end_defs(); BTF_KFUNCS_START(scx_kfunc_ids_unlocked) -BTF_ID_FLAGS(func, scx_bpf_create_dsq, KF_SLEEPABLE) +BTF_ID_FLAGS(func, scx_bpf_create_dsq, KF_IMPLICIT_ARGS | KF_SLEEPABLE) +/* also in scx_kfunc_ids_dispatch: also callable from ops.dispatch() */ BTF_ID_FLAGS(func, scx_bpf_dsq_move_set_slice, KF_RCU) BTF_ID_FLAGS(func, scx_bpf_dsq_move_set_vtime, KF_RCU) BTF_ID_FLAGS(func, scx_bpf_dsq_move, KF_RCU) BTF_ID_FLAGS(func, scx_bpf_dsq_move_vtime, KF_RCU) +/* also in scx_kfunc_ids_select_cpu: also callable from ops.select_cpu()/ops.enqueue() */ +BTF_ID_FLAGS(func, __scx_bpf_select_cpu_and, KF_IMPLICIT_ARGS | KF_RCU) +BTF_ID_FLAGS(func, scx_bpf_select_cpu_and, KF_RCU) +BTF_ID_FLAGS(func, scx_bpf_select_cpu_dfl, KF_IMPLICIT_ARGS | KF_RCU) BTF_KFUNCS_END(scx_kfunc_ids_unlocked) static const struct btf_kfunc_id_set scx_kfunc_set_unlocked = { .owner = THIS_MODULE, .set = &scx_kfunc_ids_unlocked, + .filter = scx_kfunc_context_filter, }; __bpf_kfunc_start_defs(); @@ -6546,12 +8514,21 @@ __bpf_kfunc_start_defs(); * scx_bpf_task_set_slice - Set task's time slice * @p: task of interest * @slice: time slice to set in nsecs + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs * * Set @p's time slice to @slice. Returns %true on success, %false if the * calling scheduler doesn't have authority over @p. */ -__bpf_kfunc bool scx_bpf_task_set_slice(struct task_struct *p, u64 slice) +__bpf_kfunc bool scx_bpf_task_set_slice(struct task_struct *p, u64 slice, + const struct bpf_prog_aux *aux) { + struct scx_sched *sch; + + guard(rcu)(); + sch = scx_prog_sched(aux); + if (unlikely(!scx_task_on_sched(sch, p))) + return false; + p->scx.slice = slice; return true; } @@ -6560,12 +8537,21 @@ __bpf_kfunc bool scx_bpf_task_set_slice(struct task_struct *p, u64 slice) * scx_bpf_task_set_dsq_vtime - Set task's virtual time for DSQ ordering * @p: task of interest * @vtime: virtual time to set + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs * * Set @p's virtual time to @vtime. Returns %true on success, %false if the * calling scheduler doesn't have authority over @p. */ -__bpf_kfunc bool scx_bpf_task_set_dsq_vtime(struct task_struct *p, u64 vtime) +__bpf_kfunc bool scx_bpf_task_set_dsq_vtime(struct task_struct *p, u64 vtime, + const struct bpf_prog_aux *aux) { + struct scx_sched *sch; + + guard(rcu)(); + sch = scx_prog_sched(aux); + if (unlikely(!scx_task_on_sched(sch, p))) + return false; + p->scx.dsq_vtime = vtime; return true; } @@ -6587,7 +8573,7 @@ static void scx_kick_cpu(struct scx_sched *sch, s32 cpu, u64 flags) * lead to irq_work_queue() malfunction such as infinite busy wait for * IRQ status update. Suppress kicking. */ - if (scx_rq_bypassing(this_rq)) + if (scx_bypassing(sch, cpu_of(this_rq))) goto out; /* @@ -6627,18 +8613,19 @@ out: * scx_bpf_kick_cpu - Trigger reschedule on a CPU * @cpu: cpu to kick * @flags: %SCX_KICK_* flags + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs * * Kick @cpu into rescheduling. This can be used to wake up an idle CPU or * trigger rescheduling on a busy CPU. This can be called from any online * scx_ops operation and the actual kicking is performed asynchronously through * an irq work. */ -__bpf_kfunc void scx_bpf_kick_cpu(s32 cpu, u64 flags) +__bpf_kfunc void scx_bpf_kick_cpu(s32 cpu, u64 flags, const struct bpf_prog_aux *aux) { struct scx_sched *sch; guard(rcu)(); - sch = rcu_dereference(scx_root); + sch = scx_prog_sched(aux); if (likely(sch)) scx_kick_cpu(sch, cpu, flags); } @@ -6712,13 +8699,14 @@ __bpf_kfunc void scx_bpf_destroy_dsq(u64 dsq_id) * @it: iterator to initialize * @dsq_id: DSQ to iterate * @flags: %SCX_DSQ_ITER_* + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs * * Initialize BPF iterator @it which can be used with bpf_for_each() to walk * tasks in the DSQ specified by @dsq_id. Iteration using @it only includes * tasks which are already queued when this function is invoked. */ __bpf_kfunc int bpf_iter_scx_dsq_new(struct bpf_iter_scx_dsq *it, u64 dsq_id, - u64 flags) + u64 flags, const struct bpf_prog_aux *aux) { struct bpf_iter_scx_dsq_kern *kit = (void *)it; struct scx_sched *sch; @@ -6736,7 +8724,7 @@ __bpf_kfunc int bpf_iter_scx_dsq_new(struct bpf_iter_scx_dsq *it, u64 dsq_id, */ kit->dsq = NULL; - sch = rcu_dereference_check(scx_root, rcu_read_lock_bh_held()); + sch = scx_prog_sched(aux); if (unlikely(!sch)) return -ENODEV; @@ -6747,8 +8735,7 @@ __bpf_kfunc int bpf_iter_scx_dsq_new(struct bpf_iter_scx_dsq *it, u64 dsq_id, if (!kit->dsq) return -ENOENT; - kit->cursor = INIT_DSQ_LIST_CURSOR(kit->cursor, flags, - READ_ONCE(kit->dsq->seq)); + kit->cursor = INIT_DSQ_LIST_CURSOR(kit->cursor, kit->dsq, flags); return 0; } @@ -6762,41 +8749,13 @@ __bpf_kfunc int bpf_iter_scx_dsq_new(struct bpf_iter_scx_dsq *it, u64 dsq_id, __bpf_kfunc struct task_struct *bpf_iter_scx_dsq_next(struct bpf_iter_scx_dsq *it) { struct bpf_iter_scx_dsq_kern *kit = (void *)it; - bool rev = kit->cursor.flags & SCX_DSQ_ITER_REV; - struct task_struct *p; - unsigned long flags; if (!kit->dsq) return NULL; - raw_spin_lock_irqsave(&kit->dsq->lock, flags); + guard(raw_spinlock_irqsave)(&kit->dsq->lock); - if (list_empty(&kit->cursor.node)) - p = NULL; - else - p = container_of(&kit->cursor, struct task_struct, scx.dsq_list); - - /* - * Only tasks which were queued before the iteration started are - * visible. This bounds BPF iterations and guarantees that vtime never - * jumps in the other direction while iterating. - */ - do { - p = nldsq_next_task(kit->dsq, p, rev); - } while (p && unlikely(u32_before(kit->cursor.priv, p->scx.dsq_seq))); - - if (p) { - if (rev) - list_move_tail(&kit->cursor.node, &p->scx.dsq_list.node); - else - list_move(&kit->cursor.node, &p->scx.dsq_list.node); - } else { - list_del_init(&kit->cursor.node); - } - - raw_spin_unlock_irqrestore(&kit->dsq->lock, flags); - - return p; + return nldsq_cursor_next_task(&kit->cursor, kit->dsq); } /** @@ -6825,6 +8784,7 @@ __bpf_kfunc void bpf_iter_scx_dsq_destroy(struct bpf_iter_scx_dsq *it) /** * scx_bpf_dsq_peek - Lockless peek at the first element. * @dsq_id: DSQ to examine. + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs * * Read the first element in the DSQ. This is semantically equivalent to using * the DSQ iterator, but is lockfree. Of course, like any lockless operation, @@ -6833,12 +8793,13 @@ __bpf_kfunc void bpf_iter_scx_dsq_destroy(struct bpf_iter_scx_dsq *it) * * Returns the pointer, or NULL indicates an empty queue OR internal error. */ -__bpf_kfunc struct task_struct *scx_bpf_dsq_peek(u64 dsq_id) +__bpf_kfunc struct task_struct *scx_bpf_dsq_peek(u64 dsq_id, + const struct bpf_prog_aux *aux) { struct scx_sched *sch; struct scx_dispatch_q *dsq; - sch = rcu_dereference(scx_root); + sch = scx_prog_sched(aux); if (unlikely(!sch)) return NULL; @@ -6856,6 +8817,62 @@ __bpf_kfunc struct task_struct *scx_bpf_dsq_peek(u64 dsq_id) return rcu_dereference(dsq->first_task); } +/** + * scx_bpf_dsq_reenq - Re-enqueue tasks on a DSQ + * @dsq_id: DSQ to re-enqueue + * @reenq_flags: %SCX_RENQ_* + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs + * + * Iterate over all of the tasks currently enqueued on the DSQ identified by + * @dsq_id, and re-enqueue them in the BPF scheduler. The following DSQs are + * supported: + * + * - Local DSQs (%SCX_DSQ_LOCAL or %SCX_DSQ_LOCAL_ON | $cpu) + * - User DSQs + * + * Re-enqueues are performed asynchronously. Can be called from anywhere. + */ +__bpf_kfunc void scx_bpf_dsq_reenq(u64 dsq_id, u64 reenq_flags, + const struct bpf_prog_aux *aux) +{ + struct scx_sched *sch; + struct scx_dispatch_q *dsq; + + guard(preempt)(); + + sch = scx_prog_sched(aux); + if (unlikely(!sch)) + return; + + if (unlikely(reenq_flags & ~__SCX_REENQ_USER_MASK)) { + scx_error(sch, "invalid SCX_REENQ flags 0x%llx", reenq_flags); + return; + } + + /* not specifying any filter bits is the same as %SCX_REENQ_ANY */ + if (!(reenq_flags & __SCX_REENQ_FILTER_MASK)) + reenq_flags |= SCX_REENQ_ANY; + + dsq = find_dsq_for_dispatch(sch, this_rq(), dsq_id, smp_processor_id()); + schedule_dsq_reenq(sch, dsq, reenq_flags, scx_locked_rq()); +} + +/** + * scx_bpf_reenqueue_local - Re-enqueue tasks on a local DSQ + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs + * + * Iterate over all of the tasks currently enqueued on the local DSQ of the + * caller's CPU, and re-enqueue them in the BPF scheduler. Can be called from + * anywhere. + * + * This is now a special case of scx_bpf_dsq_reenq() and may be removed in the + * future. + */ +__bpf_kfunc void scx_bpf_reenqueue_local___v2(const struct bpf_prog_aux *aux) +{ + scx_bpf_dsq_reenq(SCX_DSQ_LOCAL, 0, aux); +} + __bpf_kfunc_end_defs(); static s32 __bstr_format(struct scx_sched *sch, u64 *data_buf, char *line_buf, @@ -6910,18 +8927,20 @@ __bpf_kfunc_start_defs(); * @fmt: error message format string * @data: format string parameters packaged using ___bpf_fill() macro * @data__sz: @data len, must end in '__sz' for the verifier + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs * * Indicate that the BPF scheduler wants to exit gracefully, and initiate ops * disabling. */ __bpf_kfunc void scx_bpf_exit_bstr(s64 exit_code, char *fmt, - unsigned long long *data, u32 data__sz) + unsigned long long *data, u32 data__sz, + const struct bpf_prog_aux *aux) { struct scx_sched *sch; unsigned long flags; raw_spin_lock_irqsave(&scx_exit_bstr_buf_lock, flags); - sch = rcu_dereference_bh(scx_root); + sch = scx_prog_sched(aux); if (likely(sch) && bstr_format(sch, &scx_exit_bstr_buf, fmt, data, data__sz) >= 0) scx_exit(sch, SCX_EXIT_UNREG_BPF, exit_code, "%s", scx_exit_bstr_buf.line); @@ -6933,18 +8952,19 @@ __bpf_kfunc void scx_bpf_exit_bstr(s64 exit_code, char *fmt, * @fmt: error message format string * @data: format string parameters packaged using ___bpf_fill() macro * @data__sz: @data len, must end in '__sz' for the verifier + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs * * Indicate that the BPF scheduler encountered a fatal error and initiate ops * disabling. */ __bpf_kfunc void scx_bpf_error_bstr(char *fmt, unsigned long long *data, - u32 data__sz) + u32 data__sz, const struct bpf_prog_aux *aux) { struct scx_sched *sch; unsigned long flags; raw_spin_lock_irqsave(&scx_exit_bstr_buf_lock, flags); - sch = rcu_dereference_bh(scx_root); + sch = scx_prog_sched(aux); if (likely(sch) && bstr_format(sch, &scx_exit_bstr_buf, fmt, data, data__sz) >= 0) scx_exit(sch, SCX_EXIT_ERROR_BPF, 0, "%s", scx_exit_bstr_buf.line); @@ -6956,6 +8976,7 @@ __bpf_kfunc void scx_bpf_error_bstr(char *fmt, unsigned long long *data, * @fmt: format string * @data: format string parameters packaged using ___bpf_fill() macro * @data__sz: @data len, must end in '__sz' for the verifier + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs * * To be called through scx_bpf_dump() helper from ops.dump(), dump_cpu() and * dump_task() to generate extra debug dump specific to the BPF scheduler. @@ -6964,7 +8985,7 @@ __bpf_kfunc void scx_bpf_error_bstr(char *fmt, unsigned long long *data, * multiple calls. The last line is automatically terminated. */ __bpf_kfunc void scx_bpf_dump_bstr(char *fmt, unsigned long long *data, - u32 data__sz) + u32 data__sz, const struct bpf_prog_aux *aux) { struct scx_sched *sch; struct scx_dump_data *dd = &scx_dump_data; @@ -6973,7 +8994,7 @@ __bpf_kfunc void scx_bpf_dump_bstr(char *fmt, unsigned long long *data, guard(rcu)(); - sch = rcu_dereference(scx_root); + sch = scx_prog_sched(aux); if (unlikely(!sch)) return; @@ -7009,39 +9030,22 @@ __bpf_kfunc void scx_bpf_dump_bstr(char *fmt, unsigned long long *data, ops_dump_flush(); } -/** - * scx_bpf_reenqueue_local - Re-enqueue tasks on a local DSQ - * - * Iterate over all of the tasks currently enqueued on the local DSQ of the - * caller's CPU, and re-enqueue them in the BPF scheduler. Can be called from - * anywhere. - */ -__bpf_kfunc void scx_bpf_reenqueue_local___v2(void) -{ - struct rq *rq; - - guard(preempt)(); - - rq = this_rq(); - local_set(&rq->scx.reenq_local_deferred, 1); - schedule_deferred(rq); -} - /** * scx_bpf_cpuperf_cap - Query the maximum relative capacity of a CPU * @cpu: CPU of interest + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs * * Return the maximum relative capacity of @cpu in relation to the most * performant CPU in the system. The return value is in the range [1, * %SCX_CPUPERF_ONE]. See scx_bpf_cpuperf_cur(). */ -__bpf_kfunc u32 scx_bpf_cpuperf_cap(s32 cpu) +__bpf_kfunc u32 scx_bpf_cpuperf_cap(s32 cpu, const struct bpf_prog_aux *aux) { struct scx_sched *sch; guard(rcu)(); - sch = rcu_dereference(scx_root); + sch = scx_prog_sched(aux); if (likely(sch) && ops_cpu_valid(sch, cpu, NULL)) return arch_scale_cpu_capacity(cpu); else @@ -7051,6 +9055,7 @@ __bpf_kfunc u32 scx_bpf_cpuperf_cap(s32 cpu) /** * scx_bpf_cpuperf_cur - Query the current relative performance of a CPU * @cpu: CPU of interest + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs * * Return the current relative performance of @cpu in relation to its maximum. * The return value is in the range [1, %SCX_CPUPERF_ONE]. @@ -7062,13 +9067,13 @@ __bpf_kfunc u32 scx_bpf_cpuperf_cap(s32 cpu) * * The result is in the range [1, %SCX_CPUPERF_ONE]. */ -__bpf_kfunc u32 scx_bpf_cpuperf_cur(s32 cpu) +__bpf_kfunc u32 scx_bpf_cpuperf_cur(s32 cpu, const struct bpf_prog_aux *aux) { struct scx_sched *sch; guard(rcu)(); - sch = rcu_dereference(scx_root); + sch = scx_prog_sched(aux); if (likely(sch) && ops_cpu_valid(sch, cpu, NULL)) return arch_scale_freq_capacity(cpu); else @@ -7079,6 +9084,7 @@ __bpf_kfunc u32 scx_bpf_cpuperf_cur(s32 cpu) * scx_bpf_cpuperf_set - Set the relative performance target of a CPU * @cpu: CPU of interest * @perf: target performance level [0, %SCX_CPUPERF_ONE] + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs * * Set the target performance level of @cpu to @perf. @perf is in linear * relative scale between 0 and %SCX_CPUPERF_ONE. This determines how the @@ -7089,13 +9095,13 @@ __bpf_kfunc u32 scx_bpf_cpuperf_cur(s32 cpu) * use. Consult hardware and cpufreq documentation for more information. The * current performance level can be monitored using scx_bpf_cpuperf_cur(). */ -__bpf_kfunc void scx_bpf_cpuperf_set(s32 cpu, u32 perf) +__bpf_kfunc void scx_bpf_cpuperf_set(s32 cpu, u32 perf, const struct bpf_prog_aux *aux) { struct scx_sched *sch; guard(rcu)(); - sch = rcu_dereference(scx_root); + sch = scx_prog_sched(aux); if (unlikely(!sch)) return; @@ -7205,14 +9211,15 @@ __bpf_kfunc s32 scx_bpf_task_cpu(const struct task_struct *p) /** * scx_bpf_cpu_rq - Fetch the rq of a CPU * @cpu: CPU of the rq + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs */ -__bpf_kfunc struct rq *scx_bpf_cpu_rq(s32 cpu) +__bpf_kfunc struct rq *scx_bpf_cpu_rq(s32 cpu, const struct bpf_prog_aux *aux) { struct scx_sched *sch; guard(rcu)(); - sch = rcu_dereference(scx_root); + sch = scx_prog_sched(aux); if (unlikely(!sch)) return NULL; @@ -7231,18 +9238,19 @@ __bpf_kfunc struct rq *scx_bpf_cpu_rq(s32 cpu) /** * scx_bpf_locked_rq - Return the rq currently locked by SCX + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs * * Returns the rq if a rq lock is currently held by SCX. * Otherwise emits an error and returns NULL. */ -__bpf_kfunc struct rq *scx_bpf_locked_rq(void) +__bpf_kfunc struct rq *scx_bpf_locked_rq(const struct bpf_prog_aux *aux) { struct scx_sched *sch; struct rq *rq; guard(preempt)(); - sch = rcu_dereference_sched(scx_root); + sch = scx_prog_sched(aux); if (unlikely(!sch)) return NULL; @@ -7258,16 +9266,17 @@ __bpf_kfunc struct rq *scx_bpf_locked_rq(void) /** * scx_bpf_cpu_curr - Return remote CPU's curr task * @cpu: CPU of interest + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs * * Callers must hold RCU read lock (KF_RCU). */ -__bpf_kfunc struct task_struct *scx_bpf_cpu_curr(s32 cpu) +__bpf_kfunc struct task_struct *scx_bpf_cpu_curr(s32 cpu, const struct bpf_prog_aux *aux) { struct scx_sched *sch; guard(rcu)(); - sch = rcu_dereference(scx_root); + sch = scx_prog_sched(aux); if (unlikely(!sch)) return NULL; @@ -7277,41 +9286,6 @@ __bpf_kfunc struct task_struct *scx_bpf_cpu_curr(s32 cpu) return rcu_dereference(cpu_rq(cpu)->curr); } -/** - * scx_bpf_task_cgroup - Return the sched cgroup of a task - * @p: task of interest - * - * @p->sched_task_group->css.cgroup represents the cgroup @p is associated with - * from the scheduler's POV. SCX operations should use this function to - * determine @p's current cgroup as, unlike following @p->cgroups, - * @p->sched_task_group is protected by @p's rq lock and thus atomic w.r.t. all - * rq-locked operations. Can be called on the parameter tasks of rq-locked - * operations. The restriction guarantees that @p's rq is locked by the caller. - */ -#ifdef CONFIG_CGROUP_SCHED -__bpf_kfunc struct cgroup *scx_bpf_task_cgroup(struct task_struct *p) -{ - struct task_group *tg = p->sched_task_group; - struct cgroup *cgrp = &cgrp_dfl_root.cgrp; - struct scx_sched *sch; - - guard(rcu)(); - - sch = rcu_dereference(scx_root); - if (unlikely(!sch)) - goto out; - - if (!scx_kf_allowed_on_arg_tasks(sch, __SCX_KF_RQ_LOCKED, p)) - goto out; - - cgrp = tg_cgrp(tg); - -out: - cgroup_get(cgrp); - return cgrp; -} -#endif - /** * scx_bpf_now - Returns a high-performance monotonically non-decreasing * clock for the current CPU. The clock returned is in nanoseconds. @@ -7388,10 +9362,14 @@ static void scx_read_events(struct scx_sched *sch, struct scx_event_stats *event scx_agg_event(events, e_cpu, SCX_EV_DISPATCH_KEEP_LAST); scx_agg_event(events, e_cpu, SCX_EV_ENQ_SKIP_EXITING); scx_agg_event(events, e_cpu, SCX_EV_ENQ_SKIP_MIGRATION_DISABLED); + scx_agg_event(events, e_cpu, SCX_EV_REENQ_IMMED); + scx_agg_event(events, e_cpu, SCX_EV_REENQ_LOCAL_REPEAT); scx_agg_event(events, e_cpu, SCX_EV_REFILL_SLICE_DFL); scx_agg_event(events, e_cpu, SCX_EV_BYPASS_DURATION); scx_agg_event(events, e_cpu, SCX_EV_BYPASS_DISPATCH); scx_agg_event(events, e_cpu, SCX_EV_BYPASS_ACTIVATE); + scx_agg_event(events, e_cpu, SCX_EV_INSERT_NOT_OWNED); + scx_agg_event(events, e_cpu, SCX_EV_SUB_BYPASS_DISPATCH); } } @@ -7425,25 +9403,62 @@ __bpf_kfunc void scx_bpf_events(struct scx_event_stats *events, memcpy(events, &e_sys, events__sz); } +#ifdef CONFIG_CGROUP_SCHED +/** + * scx_bpf_task_cgroup - Return the sched cgroup of a task + * @p: task of interest + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs + * + * @p->sched_task_group->css.cgroup represents the cgroup @p is associated with + * from the scheduler's POV. SCX operations should use this function to + * determine @p's current cgroup as, unlike following @p->cgroups, + * @p->sched_task_group is stable for the duration of the SCX op. See + * SCX_CALL_OP_TASK() for details. + */ +__bpf_kfunc struct cgroup *scx_bpf_task_cgroup(struct task_struct *p, + const struct bpf_prog_aux *aux) +{ + struct task_group *tg = p->sched_task_group; + struct cgroup *cgrp = &cgrp_dfl_root.cgrp; + struct scx_sched *sch; + + guard(rcu)(); + + sch = scx_prog_sched(aux); + if (unlikely(!sch)) + goto out; + + if (!scx_kf_arg_task_ok(sch, p)) + goto out; + + cgrp = tg_cgrp(tg); + +out: + cgroup_get(cgrp); + return cgrp; +} +#endif /* CONFIG_CGROUP_SCHED */ + __bpf_kfunc_end_defs(); BTF_KFUNCS_START(scx_kfunc_ids_any) -BTF_ID_FLAGS(func, scx_bpf_task_set_slice, KF_RCU); -BTF_ID_FLAGS(func, scx_bpf_task_set_dsq_vtime, KF_RCU); -BTF_ID_FLAGS(func, scx_bpf_kick_cpu) +BTF_ID_FLAGS(func, scx_bpf_task_set_slice, KF_IMPLICIT_ARGS | KF_RCU); +BTF_ID_FLAGS(func, scx_bpf_task_set_dsq_vtime, KF_IMPLICIT_ARGS | KF_RCU); +BTF_ID_FLAGS(func, scx_bpf_kick_cpu, KF_IMPLICIT_ARGS) BTF_ID_FLAGS(func, scx_bpf_dsq_nr_queued) BTF_ID_FLAGS(func, scx_bpf_destroy_dsq) -BTF_ID_FLAGS(func, scx_bpf_dsq_peek, KF_RCU_PROTECTED | KF_RET_NULL) -BTF_ID_FLAGS(func, bpf_iter_scx_dsq_new, KF_ITER_NEW | KF_RCU_PROTECTED) +BTF_ID_FLAGS(func, scx_bpf_dsq_peek, KF_IMPLICIT_ARGS | KF_RCU_PROTECTED | KF_RET_NULL) +BTF_ID_FLAGS(func, scx_bpf_dsq_reenq, KF_IMPLICIT_ARGS) +BTF_ID_FLAGS(func, scx_bpf_reenqueue_local___v2, KF_IMPLICIT_ARGS) +BTF_ID_FLAGS(func, bpf_iter_scx_dsq_new, KF_IMPLICIT_ARGS | KF_ITER_NEW | KF_RCU_PROTECTED) BTF_ID_FLAGS(func, bpf_iter_scx_dsq_next, KF_ITER_NEXT | KF_RET_NULL) BTF_ID_FLAGS(func, bpf_iter_scx_dsq_destroy, KF_ITER_DESTROY) -BTF_ID_FLAGS(func, scx_bpf_exit_bstr) -BTF_ID_FLAGS(func, scx_bpf_error_bstr) -BTF_ID_FLAGS(func, scx_bpf_dump_bstr) -BTF_ID_FLAGS(func, scx_bpf_reenqueue_local___v2) -BTF_ID_FLAGS(func, scx_bpf_cpuperf_cap) -BTF_ID_FLAGS(func, scx_bpf_cpuperf_cur) -BTF_ID_FLAGS(func, scx_bpf_cpuperf_set) +BTF_ID_FLAGS(func, scx_bpf_exit_bstr, KF_IMPLICIT_ARGS) +BTF_ID_FLAGS(func, scx_bpf_error_bstr, KF_IMPLICIT_ARGS) +BTF_ID_FLAGS(func, scx_bpf_dump_bstr, KF_IMPLICIT_ARGS) +BTF_ID_FLAGS(func, scx_bpf_cpuperf_cap, KF_IMPLICIT_ARGS) +BTF_ID_FLAGS(func, scx_bpf_cpuperf_cur, KF_IMPLICIT_ARGS) +BTF_ID_FLAGS(func, scx_bpf_cpuperf_set, KF_IMPLICIT_ARGS) BTF_ID_FLAGS(func, scx_bpf_nr_node_ids) BTF_ID_FLAGS(func, scx_bpf_nr_cpu_ids) BTF_ID_FLAGS(func, scx_bpf_get_possible_cpumask, KF_ACQUIRE) @@ -7451,14 +9466,14 @@ BTF_ID_FLAGS(func, scx_bpf_get_online_cpumask, KF_ACQUIRE) BTF_ID_FLAGS(func, scx_bpf_put_cpumask, KF_RELEASE) BTF_ID_FLAGS(func, scx_bpf_task_running, KF_RCU) BTF_ID_FLAGS(func, scx_bpf_task_cpu, KF_RCU) -BTF_ID_FLAGS(func, scx_bpf_cpu_rq) -BTF_ID_FLAGS(func, scx_bpf_locked_rq, KF_RET_NULL) -BTF_ID_FLAGS(func, scx_bpf_cpu_curr, KF_RET_NULL | KF_RCU_PROTECTED) -#ifdef CONFIG_CGROUP_SCHED -BTF_ID_FLAGS(func, scx_bpf_task_cgroup, KF_RCU | KF_ACQUIRE) -#endif +BTF_ID_FLAGS(func, scx_bpf_cpu_rq, KF_IMPLICIT_ARGS) +BTF_ID_FLAGS(func, scx_bpf_locked_rq, KF_IMPLICIT_ARGS | KF_RET_NULL) +BTF_ID_FLAGS(func, scx_bpf_cpu_curr, KF_IMPLICIT_ARGS | KF_RET_NULL | KF_RCU_PROTECTED) BTF_ID_FLAGS(func, scx_bpf_now) BTF_ID_FLAGS(func, scx_bpf_events) +#ifdef CONFIG_CGROUP_SCHED +BTF_ID_FLAGS(func, scx_bpf_task_cgroup, KF_IMPLICIT_ARGS | KF_RCU | KF_ACQUIRE) +#endif BTF_KFUNCS_END(scx_kfunc_ids_any) static const struct btf_kfunc_id_set scx_kfunc_set_any = { @@ -7466,6 +9481,115 @@ static const struct btf_kfunc_id_set scx_kfunc_set_any = { .set = &scx_kfunc_ids_any, }; +/* + * Per-op kfunc allow flags. Each bit corresponds to a context-sensitive kfunc + * group; an op may permit zero or more groups, with the union expressed in + * scx_kf_allow_flags[]. The verifier-time filter (scx_kfunc_context_filter()) + * consults this table to decide whether a context-sensitive kfunc is callable + * from a given SCX op. + */ +enum scx_kf_allow_flags { + SCX_KF_ALLOW_UNLOCKED = 1 << 0, + SCX_KF_ALLOW_CPU_RELEASE = 1 << 1, + SCX_KF_ALLOW_DISPATCH = 1 << 2, + SCX_KF_ALLOW_ENQUEUE = 1 << 3, + SCX_KF_ALLOW_SELECT_CPU = 1 << 4, +}; + +/* + * Map each SCX op to the union of kfunc groups it permits, indexed by + * SCX_OP_IDX(op). Ops not listed only permit kfuncs that are not + * context-sensitive. + */ +static const u32 scx_kf_allow_flags[] = { + [SCX_OP_IDX(select_cpu)] = SCX_KF_ALLOW_SELECT_CPU | SCX_KF_ALLOW_ENQUEUE, + [SCX_OP_IDX(enqueue)] = SCX_KF_ALLOW_SELECT_CPU | SCX_KF_ALLOW_ENQUEUE, + [SCX_OP_IDX(dispatch)] = SCX_KF_ALLOW_ENQUEUE | SCX_KF_ALLOW_DISPATCH, + [SCX_OP_IDX(cpu_release)] = SCX_KF_ALLOW_CPU_RELEASE, + [SCX_OP_IDX(init_task)] = SCX_KF_ALLOW_UNLOCKED, + [SCX_OP_IDX(dump)] = SCX_KF_ALLOW_UNLOCKED, +#ifdef CONFIG_EXT_GROUP_SCHED + [SCX_OP_IDX(cgroup_init)] = SCX_KF_ALLOW_UNLOCKED, + [SCX_OP_IDX(cgroup_exit)] = SCX_KF_ALLOW_UNLOCKED, + [SCX_OP_IDX(cgroup_prep_move)] = SCX_KF_ALLOW_UNLOCKED, + [SCX_OP_IDX(cgroup_cancel_move)] = SCX_KF_ALLOW_UNLOCKED, + [SCX_OP_IDX(cgroup_set_weight)] = SCX_KF_ALLOW_UNLOCKED, + [SCX_OP_IDX(cgroup_set_bandwidth)] = SCX_KF_ALLOW_UNLOCKED, + [SCX_OP_IDX(cgroup_set_idle)] = SCX_KF_ALLOW_UNLOCKED, +#endif /* CONFIG_EXT_GROUP_SCHED */ + [SCX_OP_IDX(sub_attach)] = SCX_KF_ALLOW_UNLOCKED, + [SCX_OP_IDX(sub_detach)] = SCX_KF_ALLOW_UNLOCKED, + [SCX_OP_IDX(cpu_online)] = SCX_KF_ALLOW_UNLOCKED, + [SCX_OP_IDX(cpu_offline)] = SCX_KF_ALLOW_UNLOCKED, + [SCX_OP_IDX(init)] = SCX_KF_ALLOW_UNLOCKED, + [SCX_OP_IDX(exit)] = SCX_KF_ALLOW_UNLOCKED, +}; + +/* + * Verifier-time filter for context-sensitive SCX kfuncs. Registered via the + * .filter field on each per-group btf_kfunc_id_set. The BPF core invokes this + * for every kfunc call in the registered hook (BPF_PROG_TYPE_STRUCT_OPS or + * BPF_PROG_TYPE_SYSCALL), regardless of which set originally introduced the + * kfunc - so the filter must short-circuit on kfuncs it doesn't govern (e.g. + * scx_kfunc_ids_any) by falling through to "allow" when none of the + * context-sensitive sets contain the kfunc. + */ +int scx_kfunc_context_filter(const struct bpf_prog *prog, u32 kfunc_id) +{ + bool in_unlocked = btf_id_set8_contains(&scx_kfunc_ids_unlocked, kfunc_id); + bool in_select_cpu = btf_id_set8_contains(&scx_kfunc_ids_select_cpu, kfunc_id); + bool in_enqueue = btf_id_set8_contains(&scx_kfunc_ids_enqueue_dispatch, kfunc_id); + bool in_dispatch = btf_id_set8_contains(&scx_kfunc_ids_dispatch, kfunc_id); + bool in_cpu_release = btf_id_set8_contains(&scx_kfunc_ids_cpu_release, kfunc_id); + u32 moff, flags; + + /* Not a context-sensitive kfunc (e.g. from scx_kfunc_ids_any) - allow. */ + if (!(in_unlocked || in_select_cpu || in_enqueue || in_dispatch || in_cpu_release)) + return 0; + + /* SYSCALL progs (e.g. BPF test_run()) may call unlocked and select_cpu kfuncs. */ + if (prog->type == BPF_PROG_TYPE_SYSCALL) + return (in_unlocked || in_select_cpu) ? 0 : -EACCES; + + if (prog->type != BPF_PROG_TYPE_STRUCT_OPS) + return -EACCES; + + /* + * add_subprog_and_kfunc() collects all kfunc calls, including dead code + * guarded by bpf_ksym_exists(), before check_attach_btf_id() sets + * prog->aux->st_ops. Allow all kfuncs when st_ops is not yet set; + * do_check_main() re-runs the filter with st_ops set and enforces the + * actual restrictions. + */ + if (!prog->aux->st_ops) + return 0; + + /* + * Non-SCX struct_ops: only unlocked kfuncs are safe. The other + * context-sensitive kfuncs assume the rq lock is held by the SCX + * dispatch path, which doesn't apply to other struct_ops users. + */ + if (prog->aux->st_ops != &bpf_sched_ext_ops) + return in_unlocked ? 0 : -EACCES; + + /* SCX struct_ops: check the per-op allow list. */ + moff = prog->aux->attach_st_ops_member_off; + flags = scx_kf_allow_flags[SCX_MOFF_IDX(moff)]; + + if ((flags & SCX_KF_ALLOW_UNLOCKED) && in_unlocked) + return 0; + if ((flags & SCX_KF_ALLOW_CPU_RELEASE) && in_cpu_release) + return 0; + if ((flags & SCX_KF_ALLOW_DISPATCH) && in_dispatch) + return 0; + if ((flags & SCX_KF_ALLOW_ENQUEUE) && in_enqueue) + return 0; + if ((flags & SCX_KF_ALLOW_SELECT_CPU) && in_select_cpu) + return 0; + + return -EACCES; +} + static int __init scx_init(void) { int ret; @@ -7475,11 +9599,12 @@ static int __init scx_init(void) * register_btf_kfunc_id_set() needs most of the system to be up. * * Some kfuncs are context-sensitive and can only be called from - * specific SCX ops. They are grouped into BTF sets accordingly. - * Unfortunately, BPF currently doesn't have a way of enforcing such - * restrictions. Eventually, the verifier should be able to enforce - * them. For now, register them the same and make each kfunc explicitly - * check using scx_kf_allowed(). + * specific SCX ops. They are grouped into per-context BTF sets, each + * registered with scx_kfunc_context_filter as its .filter callback. The + * BPF core dedups identical filter pointers per hook + * (btf_populate_kfunc_set()), so the filter is invoked exactly once per + * kfunc lookup; it consults scx_kf_allow_flags[] to enforce per-op + * restrictions at verify time. */ if ((ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &scx_kfunc_set_enqueue_dispatch)) || diff --git a/kernel/sched/ext.h b/kernel/sched/ext.h index 43429b33e52c..0b7fc46aee08 100644 --- a/kernel/sched/ext.h +++ b/kernel/sched/ext.h @@ -11,7 +11,7 @@ void scx_tick(struct rq *rq); void init_scx_entity(struct sched_ext_entity *scx); void scx_pre_fork(struct task_struct *p); -int scx_fork(struct task_struct *p); +int scx_fork(struct task_struct *p, struct kernel_clone_args *kargs); void scx_post_fork(struct task_struct *p); void scx_cancel_fork(struct task_struct *p); bool scx_can_stop_tick(struct rq *rq); @@ -44,7 +44,7 @@ bool scx_prio_less(const struct task_struct *a, const struct task_struct *b, static inline void scx_tick(struct rq *rq) {} static inline void scx_pre_fork(struct task_struct *p) {} -static inline int scx_fork(struct task_struct *p) { return 0; } +static inline int scx_fork(struct task_struct *p, struct kernel_clone_args *kargs) { return 0; } static inline void scx_post_fork(struct task_struct *p) {} static inline void scx_cancel_fork(struct task_struct *p) {} static inline u32 scx_cpuperf_target(s32 cpu) { return 0; } diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c index 44c3a50c542c..443d12a3df67 100644 --- a/kernel/sched/ext_idle.c +++ b/kernel/sched/ext_idle.c @@ -368,7 +368,7 @@ void scx_idle_update_selcpu_topology(struct sched_ext_ops *ops) /* * Enable NUMA optimization only when there are multiple NUMA domains - * among the online CPUs and the NUMA domains don't perfectly overlaps + * among the online CPUs and the NUMA domains don't perfectly overlap * with the LLC domains. * * If all CPUs belong to the same NUMA node and the same LLC domain, @@ -424,18 +424,24 @@ static inline bool task_affinity_all(const struct task_struct *p) * - prefer the last used CPU to take advantage of cached data (L1, L2) and * branch prediction optimizations. * - * 3. Pick a CPU within the same LLC (Last-Level Cache): + * 3. Prefer @prev_cpu's SMT sibling: + * - if @prev_cpu is busy and no fully idle core is available, try to + * place the task on an idle SMT sibling of @prev_cpu; keeping the + * task on the same core makes migration cheaper, preserves L1 cache + * locality and reduces wakeup latency. + * + * 4. Pick a CPU within the same LLC (Last-Level Cache): * - if the above conditions aren't met, pick a CPU that shares the same * LLC, if the LLC domain is a subset of @cpus_allowed, to maintain * cache locality. * - * 4. Pick a CPU within the same NUMA node, if enabled: + * 5. Pick a CPU within the same NUMA node, if enabled: * - choose a CPU from the same NUMA node, if the node cpumask is a * subset of @cpus_allowed, to reduce memory access latency. * - * 5. Pick any idle CPU within the @cpus_allowed domain. + * 6. Pick any idle CPU within the @cpus_allowed domain. * - * Step 3 and 4 are performed only if the system has, respectively, + * Step 4 and 5 are performed only if the system has, respectively, * multiple LLCs / multiple NUMA nodes (see scx_selcpu_topo_llc and * scx_selcpu_topo_numa) and they don't contain the same subset of CPUs. * @@ -616,6 +622,20 @@ s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, u64 wake_flags, goto out_unlock; } +#ifdef CONFIG_SCHED_SMT + /* + * Use @prev_cpu's sibling if it's idle. + */ + if (sched_smt_active()) { + for_each_cpu_and(cpu, cpu_smt_mask(prev_cpu), allowed) { + if (cpu == prev_cpu) + continue; + if (scx_idle_test_and_clear_cpu(cpu)) + goto out_unlock; + } + } +#endif + /* * Search for any idle CPU in the same LLC domain. */ @@ -767,8 +787,9 @@ void __scx_update_idle(struct rq *rq, bool idle, bool do_notify) * either enqueue() sees the idle bit or update_idle() sees the task * that enqueue() queued. */ - if (SCX_HAS_OP(sch, update_idle) && do_notify && !scx_rq_bypassing(rq)) - SCX_CALL_OP(sch, SCX_KF_REST, update_idle, rq, cpu_of(rq), idle); + if (SCX_HAS_OP(sch, update_idle) && do_notify && + !scx_bypassing(sch, cpu_of(rq))) + SCX_CALL_OP(sch, update_idle, rq, cpu_of(rq), idle); } static void reset_idle_masks(struct sched_ext_ops *ops) @@ -892,8 +913,8 @@ static s32 select_cpu_from_kfunc(struct scx_sched *sch, struct task_struct *p, s32 prev_cpu, u64 wake_flags, const struct cpumask *allowed, u64 flags) { - struct rq *rq; - struct rq_flags rf; + unsigned long irq_flags; + bool we_locked = false; s32 cpu; if (!ops_cpu_valid(sch, prev_cpu, NULL)) @@ -903,27 +924,20 @@ static s32 select_cpu_from_kfunc(struct scx_sched *sch, struct task_struct *p, return -EBUSY; /* - * If called from an unlocked context, acquire the task's rq lock, - * so that we can safely access p->cpus_ptr and p->nr_cpus_allowed. + * Accessing p->cpus_ptr / p->nr_cpus_allowed needs either @p's rq + * lock or @p's pi_lock. Three cases: * - * Otherwise, allow to use this kfunc only from ops.select_cpu() - * and ops.select_enqueue(). + * - inside ops.select_cpu(): try_to_wake_up() holds @p's pi_lock. + * - other rq-locked SCX op: scx_locked_rq() points at the held rq. + * - truly unlocked (UNLOCKED ops, SYSCALL, non-SCX struct_ops): + * nothing held, take pi_lock ourselves. */ - if (scx_kf_allowed_if_unlocked()) { - rq = task_rq_lock(p, &rf); - } else { - if (!scx_kf_allowed(sch, SCX_KF_SELECT_CPU | SCX_KF_ENQUEUE)) - return -EPERM; - rq = scx_locked_rq(); - } - - /* - * Validate locking correctness to access p->cpus_ptr and - * p->nr_cpus_allowed: if we're holding an rq lock, we're safe; - * otherwise, assert that p->pi_lock is held. - */ - if (!rq) + if (this_rq()->scx.in_select_cpu) { lockdep_assert_held(&p->pi_lock); + } else if (!scx_locked_rq()) { + raw_spin_lock_irqsave(&p->pi_lock, irq_flags); + we_locked = true; + } /* * This may also be called from ops.enqueue(), so we need to handle @@ -942,8 +956,8 @@ static s32 select_cpu_from_kfunc(struct scx_sched *sch, struct task_struct *p, allowed ?: p->cpus_ptr, flags); } - if (scx_kf_allowed_if_unlocked()) - task_rq_unlock(rq, p, &rf); + if (we_locked) + raw_spin_unlock_irqrestore(&p->pi_lock, irq_flags); return cpu; } @@ -952,14 +966,15 @@ static s32 select_cpu_from_kfunc(struct scx_sched *sch, struct task_struct *p, * scx_bpf_cpu_node - Return the NUMA node the given @cpu belongs to, or * trigger an error if @cpu is invalid * @cpu: target CPU + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs */ -__bpf_kfunc int scx_bpf_cpu_node(s32 cpu) +__bpf_kfunc s32 scx_bpf_cpu_node(s32 cpu, const struct bpf_prog_aux *aux) { struct scx_sched *sch; guard(rcu)(); - sch = rcu_dereference(scx_root); + sch = scx_prog_sched(aux); if (unlikely(!sch) || !ops_cpu_valid(sch, cpu, NULL)) return NUMA_NO_NODE; return cpu_to_node(cpu); @@ -971,6 +986,7 @@ __bpf_kfunc int scx_bpf_cpu_node(s32 cpu) * @prev_cpu: CPU @p was on previously * @wake_flags: %SCX_WAKE_* flags * @is_idle: out parameter indicating whether the returned CPU is idle + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs * * Can be called from ops.select_cpu(), ops.enqueue(), or from an unlocked * context such as a BPF test_run() call, as long as built-in CPU selection @@ -981,14 +997,15 @@ __bpf_kfunc int scx_bpf_cpu_node(s32 cpu) * currently idle and thus a good candidate for direct dispatching. */ __bpf_kfunc s32 scx_bpf_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, - u64 wake_flags, bool *is_idle) + u64 wake_flags, bool *is_idle, + const struct bpf_prog_aux *aux) { struct scx_sched *sch; s32 cpu; guard(rcu)(); - sch = rcu_dereference(scx_root); + sch = scx_prog_sched(aux); if (unlikely(!sch)) return -ENODEV; @@ -1016,6 +1033,7 @@ struct scx_bpf_select_cpu_and_args { * @args->prev_cpu: CPU @p was on previously * @args->wake_flags: %SCX_WAKE_* flags * @args->flags: %SCX_PICK_IDLE* flags + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs * * Wrapper kfunc that takes arguments via struct to work around BPF's 5 argument * limit. BPF programs should use scx_bpf_select_cpu_and() which is provided @@ -1034,13 +1052,14 @@ struct scx_bpf_select_cpu_and_args { */ __bpf_kfunc s32 __scx_bpf_select_cpu_and(struct task_struct *p, const struct cpumask *cpus_allowed, - struct scx_bpf_select_cpu_and_args *args) + struct scx_bpf_select_cpu_and_args *args, + const struct bpf_prog_aux *aux) { struct scx_sched *sch; guard(rcu)(); - sch = rcu_dereference(scx_root); + sch = scx_prog_sched(aux); if (unlikely(!sch)) return -ENODEV; @@ -1062,6 +1081,17 @@ __bpf_kfunc s32 scx_bpf_select_cpu_and(struct task_struct *p, s32 prev_cpu, u64 if (unlikely(!sch)) return -ENODEV; +#ifdef CONFIG_EXT_SUB_SCHED + /* + * Disallow if any sub-scheds are attached. There is no way to tell + * which scheduler called us, just error out @p's scheduler. + */ + if (unlikely(!list_empty(&sch->children))) { + scx_error(scx_task_sched(p), "__scx_bpf_select_cpu_and() must be used"); + return -EINVAL; + } +#endif + return select_cpu_from_kfunc(sch, p, prev_cpu, wake_flags, cpus_allowed, flags); } @@ -1070,18 +1100,20 @@ __bpf_kfunc s32 scx_bpf_select_cpu_and(struct task_struct *p, s32 prev_cpu, u64 * scx_bpf_get_idle_cpumask_node - Get a referenced kptr to the * idle-tracking per-CPU cpumask of a target NUMA node. * @node: target NUMA node + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs * * Returns an empty cpumask if idle tracking is not enabled, if @node is * not valid, or running on a UP kernel. In this case the actual error will * be reported to the BPF scheduler via scx_error(). */ -__bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask_node(int node) +__bpf_kfunc const struct cpumask * +scx_bpf_get_idle_cpumask_node(s32 node, const struct bpf_prog_aux *aux) { struct scx_sched *sch; guard(rcu)(); - sch = rcu_dereference(scx_root); + sch = scx_prog_sched(aux); if (unlikely(!sch)) return cpu_none_mask; @@ -1095,17 +1127,18 @@ __bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask_node(int node) /** * scx_bpf_get_idle_cpumask - Get a referenced kptr to the idle-tracking * per-CPU cpumask. + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs * * Returns an empty mask if idle tracking is not enabled, or running on a * UP kernel. */ -__bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(void) +__bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(const struct bpf_prog_aux *aux) { struct scx_sched *sch; guard(rcu)(); - sch = rcu_dereference(scx_root); + sch = scx_prog_sched(aux); if (unlikely(!sch)) return cpu_none_mask; @@ -1125,18 +1158,20 @@ __bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(void) * idle-tracking, per-physical-core cpumask of a target NUMA node. Can be * used to determine if an entire physical core is free. * @node: target NUMA node + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs * * Returns an empty cpumask if idle tracking is not enabled, if @node is * not valid, or running on a UP kernel. In this case the actual error will * be reported to the BPF scheduler via scx_error(). */ -__bpf_kfunc const struct cpumask *scx_bpf_get_idle_smtmask_node(int node) +__bpf_kfunc const struct cpumask * +scx_bpf_get_idle_smtmask_node(s32 node, const struct bpf_prog_aux *aux) { struct scx_sched *sch; guard(rcu)(); - sch = rcu_dereference(scx_root); + sch = scx_prog_sched(aux); if (unlikely(!sch)) return cpu_none_mask; @@ -1154,17 +1189,18 @@ __bpf_kfunc const struct cpumask *scx_bpf_get_idle_smtmask_node(int node) * scx_bpf_get_idle_smtmask - Get a referenced kptr to the idle-tracking, * per-physical-core cpumask. Can be used to determine if an entire physical * core is free. + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs * * Returns an empty mask if idle tracking is not enabled, or running on a * UP kernel. */ -__bpf_kfunc const struct cpumask *scx_bpf_get_idle_smtmask(void) +__bpf_kfunc const struct cpumask *scx_bpf_get_idle_smtmask(const struct bpf_prog_aux *aux) { struct scx_sched *sch; guard(rcu)(); - sch = rcu_dereference(scx_root); + sch = scx_prog_sched(aux); if (unlikely(!sch)) return cpu_none_mask; @@ -1200,6 +1236,7 @@ __bpf_kfunc void scx_bpf_put_idle_cpumask(const struct cpumask *idle_mask) /** * scx_bpf_test_and_clear_cpu_idle - Test and clear @cpu's idle state * @cpu: cpu to test and clear idle for + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs * * Returns %true if @cpu was idle and its idle state was successfully cleared. * %false otherwise. @@ -1207,13 +1244,13 @@ __bpf_kfunc void scx_bpf_put_idle_cpumask(const struct cpumask *idle_mask) * Unavailable if ops.update_idle() is implemented and * %SCX_OPS_KEEP_BUILTIN_IDLE is not set. */ -__bpf_kfunc bool scx_bpf_test_and_clear_cpu_idle(s32 cpu) +__bpf_kfunc bool scx_bpf_test_and_clear_cpu_idle(s32 cpu, const struct bpf_prog_aux *aux) { struct scx_sched *sch; guard(rcu)(); - sch = rcu_dereference(scx_root); + sch = scx_prog_sched(aux); if (unlikely(!sch)) return false; @@ -1231,6 +1268,7 @@ __bpf_kfunc bool scx_bpf_test_and_clear_cpu_idle(s32 cpu) * @cpus_allowed: Allowed cpumask * @node: target NUMA node * @flags: %SCX_PICK_IDLE_* flags + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs * * Pick and claim an idle cpu in @cpus_allowed from the NUMA node @node. * @@ -1246,13 +1284,14 @@ __bpf_kfunc bool scx_bpf_test_and_clear_cpu_idle(s32 cpu) * %SCX_OPS_BUILTIN_IDLE_PER_NODE is not set. */ __bpf_kfunc s32 scx_bpf_pick_idle_cpu_node(const struct cpumask *cpus_allowed, - int node, u64 flags) + s32 node, u64 flags, + const struct bpf_prog_aux *aux) { struct scx_sched *sch; guard(rcu)(); - sch = rcu_dereference(scx_root); + sch = scx_prog_sched(aux); if (unlikely(!sch)) return -ENODEV; @@ -1267,6 +1306,7 @@ __bpf_kfunc s32 scx_bpf_pick_idle_cpu_node(const struct cpumask *cpus_allowed, * scx_bpf_pick_idle_cpu - Pick and claim an idle cpu * @cpus_allowed: Allowed cpumask * @flags: %SCX_PICK_IDLE_CPU_* flags + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs * * Pick and claim an idle cpu in @cpus_allowed. Returns the picked idle cpu * number on success. -%EBUSY if no matching cpu was found. @@ -1286,13 +1326,13 @@ __bpf_kfunc s32 scx_bpf_pick_idle_cpu_node(const struct cpumask *cpus_allowed, * scx_bpf_pick_idle_cpu_node() instead. */ __bpf_kfunc s32 scx_bpf_pick_idle_cpu(const struct cpumask *cpus_allowed, - u64 flags) + u64 flags, const struct bpf_prog_aux *aux) { struct scx_sched *sch; guard(rcu)(); - sch = rcu_dereference(scx_root); + sch = scx_prog_sched(aux); if (unlikely(!sch)) return -ENODEV; @@ -1313,6 +1353,7 @@ __bpf_kfunc s32 scx_bpf_pick_idle_cpu(const struct cpumask *cpus_allowed, * @cpus_allowed: Allowed cpumask * @node: target NUMA node * @flags: %SCX_PICK_IDLE_CPU_* flags + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs * * Pick and claim an idle cpu in @cpus_allowed. If none is available, pick any * CPU in @cpus_allowed. Guaranteed to succeed and returns the picked idle cpu @@ -1329,14 +1370,15 @@ __bpf_kfunc s32 scx_bpf_pick_idle_cpu(const struct cpumask *cpus_allowed, * CPU. */ __bpf_kfunc s32 scx_bpf_pick_any_cpu_node(const struct cpumask *cpus_allowed, - int node, u64 flags) + s32 node, u64 flags, + const struct bpf_prog_aux *aux) { struct scx_sched *sch; s32 cpu; guard(rcu)(); - sch = rcu_dereference(scx_root); + sch = scx_prog_sched(aux); if (unlikely(!sch)) return -ENODEV; @@ -1362,6 +1404,7 @@ __bpf_kfunc s32 scx_bpf_pick_any_cpu_node(const struct cpumask *cpus_allowed, * scx_bpf_pick_any_cpu - Pick and claim an idle cpu if available or pick any CPU * @cpus_allowed: Allowed cpumask * @flags: %SCX_PICK_IDLE_CPU_* flags + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs * * Pick and claim an idle cpu in @cpus_allowed. If none is available, pick any * CPU in @cpus_allowed. Guaranteed to succeed and returns the picked idle cpu @@ -1376,14 +1419,14 @@ __bpf_kfunc s32 scx_bpf_pick_any_cpu_node(const struct cpumask *cpus_allowed, * scx_bpf_pick_any_cpu_node() instead. */ __bpf_kfunc s32 scx_bpf_pick_any_cpu(const struct cpumask *cpus_allowed, - u64 flags) + u64 flags, const struct bpf_prog_aux *aux) { struct scx_sched *sch; s32 cpu; guard(rcu)(); - sch = rcu_dereference(scx_root); + sch = scx_prog_sched(aux); if (unlikely(!sch)) return -ENODEV; @@ -1408,20 +1451,17 @@ __bpf_kfunc s32 scx_bpf_pick_any_cpu(const struct cpumask *cpus_allowed, __bpf_kfunc_end_defs(); BTF_KFUNCS_START(scx_kfunc_ids_idle) -BTF_ID_FLAGS(func, scx_bpf_cpu_node) -BTF_ID_FLAGS(func, scx_bpf_get_idle_cpumask_node, KF_ACQUIRE) -BTF_ID_FLAGS(func, scx_bpf_get_idle_cpumask, KF_ACQUIRE) -BTF_ID_FLAGS(func, scx_bpf_get_idle_smtmask_node, KF_ACQUIRE) -BTF_ID_FLAGS(func, scx_bpf_get_idle_smtmask, KF_ACQUIRE) +BTF_ID_FLAGS(func, scx_bpf_cpu_node, KF_IMPLICIT_ARGS) +BTF_ID_FLAGS(func, scx_bpf_get_idle_cpumask_node, KF_IMPLICIT_ARGS | KF_ACQUIRE) +BTF_ID_FLAGS(func, scx_bpf_get_idle_cpumask, KF_IMPLICIT_ARGS | KF_ACQUIRE) +BTF_ID_FLAGS(func, scx_bpf_get_idle_smtmask_node, KF_IMPLICIT_ARGS | KF_ACQUIRE) +BTF_ID_FLAGS(func, scx_bpf_get_idle_smtmask, KF_IMPLICIT_ARGS | KF_ACQUIRE) BTF_ID_FLAGS(func, scx_bpf_put_idle_cpumask, KF_RELEASE) -BTF_ID_FLAGS(func, scx_bpf_test_and_clear_cpu_idle) -BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu_node, KF_RCU) -BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu, KF_RCU) -BTF_ID_FLAGS(func, scx_bpf_pick_any_cpu_node, KF_RCU) -BTF_ID_FLAGS(func, scx_bpf_pick_any_cpu, KF_RCU) -BTF_ID_FLAGS(func, __scx_bpf_select_cpu_and, KF_RCU) -BTF_ID_FLAGS(func, scx_bpf_select_cpu_and, KF_RCU) -BTF_ID_FLAGS(func, scx_bpf_select_cpu_dfl, KF_RCU) +BTF_ID_FLAGS(func, scx_bpf_test_and_clear_cpu_idle, KF_IMPLICIT_ARGS) +BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu_node, KF_IMPLICIT_ARGS | KF_RCU) +BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu, KF_IMPLICIT_ARGS | KF_RCU) +BTF_ID_FLAGS(func, scx_bpf_pick_any_cpu_node, KF_IMPLICIT_ARGS | KF_RCU) +BTF_ID_FLAGS(func, scx_bpf_pick_any_cpu, KF_IMPLICIT_ARGS | KF_RCU) BTF_KFUNCS_END(scx_kfunc_ids_idle) static const struct btf_kfunc_id_set scx_kfunc_set_idle = { @@ -1429,13 +1469,38 @@ static const struct btf_kfunc_id_set scx_kfunc_set_idle = { .set = &scx_kfunc_ids_idle, }; +/* + * The select_cpu kfuncs internally call task_rq_lock() when invoked from an + * rq-unlocked context, and thus cannot be safely called from arbitrary tracing + * contexts where @p's pi_lock state is unknown. Keep them out of + * BPF_PROG_TYPE_TRACING by registering them in their own set which is exposed + * only to STRUCT_OPS and SYSCALL programs. + * + * These kfuncs are also members of scx_kfunc_ids_unlocked (see ext.c) because + * they're callable from unlocked contexts in addition to ops.select_cpu() and + * ops.enqueue(). + */ +BTF_KFUNCS_START(scx_kfunc_ids_select_cpu) +BTF_ID_FLAGS(func, __scx_bpf_select_cpu_and, KF_IMPLICIT_ARGS | KF_RCU) +BTF_ID_FLAGS(func, scx_bpf_select_cpu_and, KF_RCU) +BTF_ID_FLAGS(func, scx_bpf_select_cpu_dfl, KF_IMPLICIT_ARGS | KF_RCU) +BTF_KFUNCS_END(scx_kfunc_ids_select_cpu) + +static const struct btf_kfunc_id_set scx_kfunc_set_select_cpu = { + .owner = THIS_MODULE, + .set = &scx_kfunc_ids_select_cpu, + .filter = scx_kfunc_context_filter, +}; + int scx_idle_init(void) { int ret; ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &scx_kfunc_set_idle) || register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &scx_kfunc_set_idle) || - register_btf_kfunc_id_set(BPF_PROG_TYPE_SYSCALL, &scx_kfunc_set_idle); + register_btf_kfunc_id_set(BPF_PROG_TYPE_SYSCALL, &scx_kfunc_set_idle) || + register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &scx_kfunc_set_select_cpu) || + register_btf_kfunc_id_set(BPF_PROG_TYPE_SYSCALL, &scx_kfunc_set_select_cpu); return ret; } diff --git a/kernel/sched/ext_idle.h b/kernel/sched/ext_idle.h index fa583f141f35..dc35f850481e 100644 --- a/kernel/sched/ext_idle.h +++ b/kernel/sched/ext_idle.h @@ -12,6 +12,8 @@ struct sched_ext_ops; +extern struct btf_id_set8 scx_kfunc_ids_select_cpu; + void scx_idle_update_selcpu_topology(struct sched_ext_ops *ops); void scx_idle_init_masks(void); diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h index 00b450597f3e..62ce4eaf6a3f 100644 --- a/kernel/sched/ext_internal.h +++ b/kernel/sched/ext_internal.h @@ -6,6 +6,7 @@ * Copyright (c) 2025 Tejun Heo */ #define SCX_OP_IDX(op) (offsetof(struct sched_ext_ops, op) / sizeof(void (*)(void))) +#define SCX_MOFF_IDX(moff) ((moff) / sizeof(void (*)(void))) enum scx_consts { SCX_DSP_DFL_MAX_BATCH = 32, @@ -24,10 +25,16 @@ enum scx_consts { */ SCX_TASK_ITER_BATCH = 32, + SCX_BYPASS_HOST_NTH = 2, + SCX_BYPASS_LB_DFL_INTV_US = 500 * USEC_PER_MSEC, SCX_BYPASS_LB_DONOR_PCT = 125, SCX_BYPASS_LB_MIN_DELTA_DIV = 4, SCX_BYPASS_LB_BATCH = 256, + + SCX_REENQ_LOCAL_MAX_REPEAT = 256, + + SCX_SUB_MAX_DEPTH = 4, }; enum scx_exit_kind { @@ -38,6 +45,7 @@ enum scx_exit_kind { SCX_EXIT_UNREG_BPF, /* BPF-initiated unregistration */ SCX_EXIT_UNREG_KERN, /* kernel-initiated unregistration */ SCX_EXIT_SYSRQ, /* requested by 'S' sysrq */ + SCX_EXIT_PARENT, /* parent exiting */ SCX_EXIT_ERROR = 1024, /* runtime error, error msg contains details */ SCX_EXIT_ERROR_BPF, /* ERROR but triggered through scx_bpf_error() */ @@ -62,6 +70,7 @@ enum scx_exit_kind { enum scx_exit_code { /* Reasons */ SCX_ECODE_RSN_HOTPLUG = 1LLU << 32, + SCX_ECODE_RSN_CGROUP_OFFLINE = 2LLU << 32, /* Actions */ SCX_ECODE_ACT_RESTART = 1LLU << 48, @@ -175,9 +184,10 @@ enum scx_ops_flags { SCX_OPS_BUILTIN_IDLE_PER_NODE = 1LLU << 6, /* - * CPU cgroup support flags + * If set, %SCX_ENQ_IMMED is assumed to be set on all local DSQ + * enqueues. */ - SCX_OPS_HAS_CGROUP_WEIGHT = 1LLU << 16, /* DEPRECATED, will be removed on 6.18 */ + SCX_OPS_ALWAYS_ENQ_IMMED = 1LLU << 7, SCX_OPS_ALL_FLAGS = SCX_OPS_KEEP_BUILTIN_IDLE | SCX_OPS_ENQ_LAST | @@ -186,7 +196,7 @@ enum scx_ops_flags { SCX_OPS_ALLOW_QUEUED_WAKEUP | SCX_OPS_SWITCH_PARTIAL | SCX_OPS_BUILTIN_IDLE_PER_NODE | - SCX_OPS_HAS_CGROUP_WEIGHT, + SCX_OPS_ALWAYS_ENQ_IMMED, /* high 8 bits are internal, don't include in SCX_OPS_ALL_FLAGS */ __SCX_OPS_INTERNAL_MASK = 0xffLLU << 56, @@ -213,7 +223,7 @@ struct scx_exit_task_args { bool cancelled; }; -/* argument container for ops->cgroup_init() */ +/* argument container for ops.cgroup_init() */ struct scx_cgroup_init_args { /* the weight of the cgroup [1..10000] */ u32 weight; @@ -236,12 +246,12 @@ enum scx_cpu_preempt_reason { }; /* - * Argument container for ops->cpu_acquire(). Currently empty, but may be + * Argument container for ops.cpu_acquire(). Currently empty, but may be * expanded in the future. */ struct scx_cpu_acquire_args {}; -/* argument container for ops->cpu_release() */ +/* argument container for ops.cpu_release() */ struct scx_cpu_release_args { /* the reason the CPU was preempted */ enum scx_cpu_preempt_reason reason; @@ -250,9 +260,7 @@ struct scx_cpu_release_args { struct task_struct *task; }; -/* - * Informational context provided to dump operations. - */ +/* informational context provided to dump operations */ struct scx_dump_ctx { enum scx_exit_kind kind; s64 exit_code; @@ -261,6 +269,18 @@ struct scx_dump_ctx { u64 at_jiffies; }; +/* argument container for ops.sub_attach() */ +struct scx_sub_attach_args { + struct sched_ext_ops *ops; + char *cgroup_path; +}; + +/* argument container for ops.sub_detach() */ +struct scx_sub_detach_args { + struct sched_ext_ops *ops; + char *cgroup_path; +}; + /** * struct sched_ext_ops - Operation table for BPF scheduler implementation * @@ -721,6 +741,20 @@ struct sched_ext_ops { #endif /* CONFIG_EXT_GROUP_SCHED */ + /** + * @sub_attach: Attach a sub-scheduler + * @args: argument container, see the struct definition + * + * Return 0 to accept the sub-scheduler. -errno to reject. + */ + s32 (*sub_attach)(struct scx_sub_attach_args *args); + + /** + * @sub_detach: Detach a sub-scheduler + * @args: argument container, see the struct definition + */ + void (*sub_detach)(struct scx_sub_detach_args *args); + /* * All online ops must come before ops.cpu_online(). */ @@ -762,6 +796,10 @@ struct sched_ext_ops { */ void (*exit)(struct scx_exit_info *info); + /* + * Data fields must comes after all ops fields. + */ + /** * @dispatch_max_batch: Max nr of tasks that dispatch() can dispatch */ @@ -796,6 +834,12 @@ struct sched_ext_ops { */ u64 hotplug_seq; + /** + * @cgroup_id: When >1, attach the scheduler as a sub-scheduler on the + * specified cgroup. + */ + u64 sub_cgroup_id; + /** * @name: BPF scheduler's name * @@ -806,7 +850,7 @@ struct sched_ext_ops { char name[SCX_OPS_NAME_LEN]; /* internal use only, must be NULL */ - void *priv; + void __rcu *priv; }; enum scx_opi { @@ -853,6 +897,24 @@ struct scx_event_stats { */ s64 SCX_EV_ENQ_SKIP_MIGRATION_DISABLED; + /* + * The number of times a task, enqueued on a local DSQ with + * SCX_ENQ_IMMED, was re-enqueued because the CPU was not available for + * immediate execution. + */ + s64 SCX_EV_REENQ_IMMED; + + /* + * The number of times a reenq of local DSQ caused another reenq of + * local DSQ. This can happen when %SCX_ENQ_IMMED races against a higher + * priority class task even if the BPF scheduler always satisfies the + * prerequisites for %SCX_ENQ_IMMED at the time of enqueue. However, + * that scenario is very unlikely and this count going up regularly + * indicates that the BPF scheduler is handling %SCX_ENQ_REENQ + * incorrectly causing recursive reenqueues. + */ + s64 SCX_EV_REENQ_LOCAL_REPEAT; + /* * Total number of times a task's time slice was refilled with the * default value (SCX_SLICE_DFL). @@ -873,15 +935,77 @@ struct scx_event_stats { * The number of times the bypassing mode has been activated. */ s64 SCX_EV_BYPASS_ACTIVATE; + + /* + * The number of times the scheduler attempted to insert a task that it + * doesn't own into a DSQ. Such attempts are ignored. + * + * As BPF schedulers are allowed to ignore dequeues, it's difficult to + * tell whether such an attempt is from a scheduler malfunction or an + * ignored dequeue around sub-sched enabling. If this count keeps going + * up regardless of sub-sched enabling, it likely indicates a bug in the + * scheduler. + */ + s64 SCX_EV_INSERT_NOT_OWNED; + + /* + * The number of times tasks from bypassing descendants are scheduled + * from sub_bypass_dsq's. + */ + s64 SCX_EV_SUB_BYPASS_DISPATCH; +}; + +struct scx_sched; + +enum scx_sched_pcpu_flags { + SCX_SCHED_PCPU_BYPASSING = 1LLU << 0, +}; + +/* dispatch buf */ +struct scx_dsp_buf_ent { + struct task_struct *task; + unsigned long qseq; + u64 dsq_id; + u64 enq_flags; +}; + +struct scx_dsp_ctx { + struct rq *rq; + u32 cursor; + u32 nr_tasks; + struct scx_dsp_buf_ent buf[]; +}; + +struct scx_deferred_reenq_local { + struct list_head node; + u64 flags; + u64 seq; + u32 cnt; }; struct scx_sched_pcpu { + struct scx_sched *sch; + u64 flags; /* protected by rq lock */ + /* * The event counters are in a per-CPU variable to minimize the * accounting overhead. A system-wide view on the event counter is * constructed when requested by scx_bpf_events(). */ struct scx_event_stats event_stats; + + struct scx_deferred_reenq_local deferred_reenq_local; + struct scx_dispatch_q bypass_dsq; +#ifdef CONFIG_EXT_SUB_SCHED + u32 bypass_host_seq; +#endif + + /* must be the last entry - contains flex array */ + struct scx_dsp_ctx dsp_ctx; +}; + +struct scx_sched_pnode { + struct scx_dispatch_q global_dsq; }; struct scx_sched { @@ -897,15 +1021,50 @@ struct scx_sched { * per-node split isn't sufficient, it can be further split. */ struct rhashtable dsq_hash; - struct scx_dispatch_q **global_dsqs; + struct scx_sched_pnode **pnode; struct scx_sched_pcpu __percpu *pcpu; + u64 slice_dfl; + u64 bypass_timestamp; + s32 bypass_depth; + + /* bypass dispatch path enable state, see bypass_dsp_enabled() */ + unsigned long bypass_dsp_claim; + atomic_t bypass_dsp_enable_depth; + + bool aborting; + bool dump_disabled; /* protected by scx_dump_lock */ + u32 dsp_max_batch; + s32 level; + /* * Updates to the following warned bitfields can race causing RMW issues * but it doesn't really matter. */ bool warned_zero_slice:1; bool warned_deprecated_rq:1; + bool warned_unassoc_progs:1; + + struct list_head all; + +#ifdef CONFIG_EXT_SUB_SCHED + struct rhash_head hash_node; + + struct list_head children; + struct list_head sibling; + struct cgroup *cgrp; + char *cgrp_path; + struct kset *sub_kset; + + bool sub_attached; +#endif /* CONFIG_EXT_SUB_SCHED */ + + /* + * The maximum amount of time in jiffies that a task may be runnable + * without being scheduled on a CPU. If this timeout is exceeded, it + * will trigger scx_error(). + */ + unsigned long watchdog_timeout; atomic_t exit_kind; struct scx_exit_info *exit_info; @@ -913,9 +1072,13 @@ struct scx_sched { struct kobject kobj; struct kthread_worker *helper; - struct irq_work error_irq_work; + struct irq_work disable_irq_work; struct kthread_work disable_work; + struct timer_list bypass_lb_timer; struct rcu_work rcu_work; + + /* all ancestors including self */ + struct scx_sched *ancestors[]; }; enum scx_wake_flags { @@ -942,13 +1105,27 @@ enum scx_enq_flags { SCX_ENQ_PREEMPT = 1LLU << 32, /* - * The task being enqueued was previously enqueued on the current CPU's - * %SCX_DSQ_LOCAL, but was removed from it in a call to the - * scx_bpf_reenqueue_local() kfunc. If scx_bpf_reenqueue_local() was - * invoked in a ->cpu_release() callback, and the task is again - * dispatched back to %SCX_LOCAL_DSQ by this current ->enqueue(), the - * task will not be scheduled on the CPU until at least the next invocation - * of the ->cpu_acquire() callback. + * Only allowed on local DSQs. Guarantees that the task either gets + * on the CPU immediately and stays on it, or gets reenqueued back + * to the BPF scheduler. It will never linger on a local DSQ or be + * silently put back after preemption. + * + * The protection persists until the next fresh enqueue - it + * survives SAVE/RESTORE cycles, slice extensions and preemption. + * If the task can't stay on the CPU for any reason, it gets + * reenqueued back to the BPF scheduler. + * + * Exiting and migration-disabled tasks bypass ops.enqueue() and + * are placed directly on a local DSQ without IMMED protection + * unless %SCX_OPS_ENQ_EXITING and %SCX_OPS_ENQ_MIGRATION_DISABLED + * are set respectively. + */ + SCX_ENQ_IMMED = 1LLU << 33, + + /* + * The task being enqueued was previously enqueued on a DSQ, but was + * removed and is being re-enqueued. See SCX_TASK_REENQ_* flags to find + * out why a given task is being reenqueued. */ SCX_ENQ_REENQ = 1LLU << 40, @@ -969,6 +1146,7 @@ enum scx_enq_flags { SCX_ENQ_CLEAR_OPSS = 1LLU << 56, SCX_ENQ_DSQ_PRIQ = 1LLU << 57, SCX_ENQ_NESTED = 1LLU << 58, + SCX_ENQ_GDSQ_FALLBACK = 1LLU << 59, /* fell back to global DSQ */ }; enum scx_deq_flags { @@ -982,6 +1160,28 @@ enum scx_deq_flags { * it hasn't been dispatched yet. Dequeue from the BPF side. */ SCX_DEQ_CORE_SCHED_EXEC = 1LLU << 32, + + /* + * The task is being dequeued due to a property change (e.g., + * sched_setaffinity(), sched_setscheduler(), set_user_nice(), + * etc.). + */ + SCX_DEQ_SCHED_CHANGE = 1LLU << 33, +}; + +enum scx_reenq_flags { + /* low 16bits determine which tasks should be reenqueued */ + SCX_REENQ_ANY = 1LLU << 0, /* all tasks */ + + __SCX_REENQ_FILTER_MASK = 0xffffLLU, + + __SCX_REENQ_USER_MASK = SCX_REENQ_ANY, + + /* bits 32-35 used by task_should_reenq() */ + SCX_REENQ_TSR_RQ_OPEN = 1LLU << 32, + SCX_REENQ_TSR_NOT_FIRST = 1LLU << 33, + + __SCX_REENQ_TSR_MASK = 0xfLLU << 32, }; enum scx_pick_idle_cpu_flags { @@ -1161,8 +1361,11 @@ enum scx_ops_state { #define SCX_OPSS_STATE_MASK ((1LU << SCX_OPSS_QSEQ_SHIFT) - 1) #define SCX_OPSS_QSEQ_MASK (~SCX_OPSS_STATE_MASK) +extern struct scx_sched __rcu *scx_root; DECLARE_PER_CPU(struct rq *, scx_locked_rq_state); +int scx_kfunc_context_filter(const struct bpf_prog *prog, u32 kfunc_id); + /* * Return the rq currently locked from an scx callback, or NULL if no rq is * locked. @@ -1172,12 +1375,107 @@ static inline struct rq *scx_locked_rq(void) return __this_cpu_read(scx_locked_rq_state); } -static inline bool scx_kf_allowed_if_unlocked(void) +static inline bool scx_bypassing(struct scx_sched *sch, s32 cpu) { - return !current->scx.kf_mask; + return unlikely(per_cpu_ptr(sch->pcpu, cpu)->flags & + SCX_SCHED_PCPU_BYPASSING); } -static inline bool scx_rq_bypassing(struct rq *rq) +#ifdef CONFIG_EXT_SUB_SCHED +/** + * scx_task_sched - Find scx_sched scheduling a task + * @p: task of interest + * + * Return @p's scheduler instance. Must be called with @p's pi_lock or rq lock + * held. + */ +static inline struct scx_sched *scx_task_sched(const struct task_struct *p) { - return unlikely(rq->scx.flags & SCX_RQ_BYPASSING); + return rcu_dereference_protected(p->scx.sched, + lockdep_is_held(&p->pi_lock) || + lockdep_is_held(__rq_lockp(task_rq(p)))); } + +/** + * scx_task_sched_rcu - Find scx_sched scheduling a task + * @p: task of interest + * + * Return @p's scheduler instance. The returned scx_sched is RCU protected. + */ +static inline struct scx_sched *scx_task_sched_rcu(const struct task_struct *p) +{ + return rcu_dereference_all(p->scx.sched); +} + +/** + * scx_task_on_sched - Is a task on the specified sched? + * @sch: sched to test against + * @p: task of interest + * + * Returns %true if @p is on @sch, %false otherwise. + */ +static inline bool scx_task_on_sched(struct scx_sched *sch, + const struct task_struct *p) +{ + return rcu_access_pointer(p->scx.sched) == sch; +} + +/** + * scx_prog_sched - Find scx_sched associated with a BPF prog + * @aux: aux passed in from BPF to a kfunc + * + * To be called from kfuncs. Return the scheduler instance associated with the + * BPF program given the implicit kfunc argument aux. The returned scx_sched is + * RCU protected. + */ +static inline struct scx_sched *scx_prog_sched(const struct bpf_prog_aux *aux) +{ + struct sched_ext_ops *ops; + struct scx_sched *root; + + ops = bpf_prog_get_assoc_struct_ops(aux); + if (likely(ops)) + return rcu_dereference_all(ops->priv); + + root = rcu_dereference_all(scx_root); + if (root) { + /* + * COMPAT-v6.19: Schedulers built before sub-sched support was + * introduced may have unassociated non-struct_ops programs. + */ + if (!root->ops.sub_attach) + return root; + + if (!root->warned_unassoc_progs) { + printk_deferred(KERN_WARNING "sched_ext: Unassociated program %s (id %d)\n", + aux->name, aux->id); + root->warned_unassoc_progs = true; + } + } + + return NULL; +} +#else /* CONFIG_EXT_SUB_SCHED */ +static inline struct scx_sched *scx_task_sched(const struct task_struct *p) +{ + return rcu_dereference_protected(scx_root, + lockdep_is_held(&p->pi_lock) || + lockdep_is_held(__rq_lockp(task_rq(p)))); +} + +static inline struct scx_sched *scx_task_sched_rcu(const struct task_struct *p) +{ + return rcu_dereference_all(scx_root); +} + +static inline bool scx_task_on_sched(struct scx_sched *sch, + const struct task_struct *p) +{ + return true; +} + +static struct scx_sched *scx_prog_sched(const struct bpf_prog_aux *aux) +{ + return rcu_dereference_all(scx_root); +} +#endif /* CONFIG_EXT_SUB_SCHED */ diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 88e0c93b9e21..9f63b15d309d 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -783,7 +783,6 @@ enum scx_rq_flags { SCX_RQ_ONLINE = 1 << 0, SCX_RQ_CAN_STOP_TICK = 1 << 1, SCX_RQ_BAL_KEEP = 1 << 3, /* balance decided to keep current */ - SCX_RQ_BYPASSING = 1 << 4, SCX_RQ_CLK_VALID = 1 << 5, /* RQ clock is fresh and valid */ SCX_RQ_BAL_CB_PENDING = 1 << 6, /* must queue a cb after dispatching */ @@ -799,8 +798,10 @@ struct scx_rq { u64 extra_enq_flags; /* see move_task_to_local_dsq() */ u32 nr_running; u32 cpuperf_target; /* [0, SCHED_CAPACITY_SCALE] */ + bool in_select_cpu; bool cpu_released; u32 flags; + u32 nr_immed; /* ENQ_IMMED tasks on local_dsq */ u64 clock; /* current per-rq clock -- see scx_bpf_now() */ cpumask_var_t cpus_to_kick; cpumask_var_t cpus_to_kick_if_idle; @@ -809,12 +810,17 @@ struct scx_rq { cpumask_var_t cpus_to_sync; bool kick_sync_pending; unsigned long kick_sync; - local_t reenq_local_deferred; + + struct task_struct *sub_dispatch_prev; + + raw_spinlock_t deferred_reenq_lock; + u64 deferred_reenq_locals_seq; + struct list_head deferred_reenq_locals; /* scheds requesting reenq of local DSQ */ + struct list_head deferred_reenq_users; /* user DSQs requesting reenq */ struct balance_callback deferred_bal_cb; struct balance_callback kick_sync_bal_cb; struct irq_work deferred_irq_work; struct irq_work kick_cpus_irq_work; - struct scx_dispatch_q bypass_dsq; }; #endif /* CONFIG_SCHED_CLASS_EXT */ diff --git a/tools/sched_ext/include/scx/bpf_arena_common.bpf.h b/tools/sched_ext/include/scx/bpf_arena_common.bpf.h index 4366fb3c91ce..2043d66940ea 100644 --- a/tools/sched_ext/include/scx/bpf_arena_common.bpf.h +++ b/tools/sched_ext/include/scx/bpf_arena_common.bpf.h @@ -15,7 +15,9 @@ #endif #if defined(__BPF_FEATURE_ADDR_SPACE_CAST) && !defined(BPF_ARENA_FORCE_ASM) +#ifndef __arena #define __arena __attribute__((address_space(1))) +#endif #define __arena_global __attribute__((address_space(1))) #define cast_kern(ptr) /* nop for bpf prog. emitted by LLVM */ #define cast_user(ptr) /* nop for bpf prog. emitted by LLVM */ @@ -81,12 +83,13 @@ void __arena* bpf_arena_alloc_pages(void *map, void __arena *addr, __u32 page_cnt, int node_id, __u64 flags) __ksym __weak; void bpf_arena_free_pages(void *map, void __arena *ptr, __u32 page_cnt) __ksym __weak; +int bpf_arena_reserve_pages(void *map, void __arena *ptr, __u32 page_cnt) __ksym __weak; /* * Note that cond_break can only be portably used in the body of a breakable * construct, whereas can_loop can be used anywhere. */ -#ifdef TEST +#ifdef SCX_BPF_UNITTEST #define can_loop true #define __cond_break(expr) expr #else @@ -165,7 +168,7 @@ void bpf_arena_free_pages(void *map, void __arena *ptr, __u32 page_cnt) __ksym _ }) #endif /* __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__ */ #endif /* __BPF_FEATURE_MAY_GOTO */ -#endif /* TEST */ +#endif /* SCX_BPF_UNITTEST */ #define cond_break __cond_break(break) #define cond_break_label(label) __cond_break(goto label) @@ -173,3 +176,4 @@ void bpf_arena_free_pages(void *map, void __arena *ptr, __u32 page_cnt) __ksym _ void bpf_preempt_disable(void) __weak __ksym; void bpf_preempt_enable(void) __weak __ksym; +ssize_t bpf_arena_mapping_nr_pages(void *p__map) __weak __ksym; diff --git a/tools/sched_ext/include/scx/common.bpf.h b/tools/sched_ext/include/scx/common.bpf.h index 821d5791bd42..19459dedde41 100644 --- a/tools/sched_ext/include/scx/common.bpf.h +++ b/tools/sched_ext/include/scx/common.bpf.h @@ -291,6 +291,50 @@ BPF_PROG(name, ##args) }) #endif /* ARRAY_ELEM_PTR */ +/** + * __sink - Hide @expr's value from the compiler and BPF verifier + * @expr: The expression whose value should be opacified + * + * No-op at runtime. The empty inline assembly with a read-write constraint + * ("+g") has two effects at compile/verify time: + * + * 1. Compiler: treats @expr as both read and written, preventing dead-code + * elimination and keeping @expr (and any side effects that produced it) + * alive. + * + * 2. BPF verifier: forgets the precise value/range of @expr ("makes it + * imprecise"). The verifier normally tracks exact ranges for every register + * and stack slot. While useful, precision means each distinct value creates a + * separate verifier state. Inside loops this leads to state explosion - each + * iteration carries different precise values so states never merge and the + * verifier explores every iteration individually. + * + * Example - preventing loop state explosion:: + * + * u32 nr_intersects = 0, nr_covered = 0; + * __sink(nr_intersects); + * __sink(nr_covered); + * bpf_for(i, 0, nr_nodes) { + * if (intersects(cpumask, node_mask[i])) + * nr_intersects++; + * if (covers(cpumask, node_mask[i])) + * nr_covered++; + * } + * + * Without __sink(), the verifier tracks every possible (nr_intersects, + * nr_covered) pair across iterations, causing "BPF program is too large". With + * __sink(), the values become unknown scalars so all iterations collapse into + * one reusable state. + * + * Example - keeping a reference alive:: + * + * struct task_struct *t = bpf_task_acquire(task); + * __sink(t); + * + * Follows the convention from BPF selftests (bpf_misc.h). + */ +#define __sink(expr) asm volatile ("" : "+g"(expr)) + /* * BPF declarations and helpers */ @@ -336,6 +380,7 @@ void bpf_task_release(struct task_struct *p) __ksym; /* cgroup */ struct cgroup *bpf_cgroup_ancestor(struct cgroup *cgrp, int level) __ksym; +struct cgroup *bpf_cgroup_acquire(struct cgroup *cgrp) __ksym; void bpf_cgroup_release(struct cgroup *cgrp) __ksym; struct cgroup *bpf_cgroup_from_id(u64 cgid) __ksym; @@ -741,6 +786,73 @@ static inline u64 __sqrt_u64(u64 x) return r; } +/* + * ctzll -- Counts trailing zeros in an unsigned long long. If the input value + * is zero, the return value is undefined. + */ +static inline int ctzll(u64 v) +{ +#if (!defined(__BPF__) && defined(__SCX_TARGET_ARCH_x86)) || \ + (defined(__BPF__) && defined(__clang_major__) && __clang_major__ >= 19) + /* + * Use the ctz builtin when: (1) building for native x86, or + * (2) building for BPF with clang >= 19 (BPF backend supports + * the intrinsic from clang 19 onward; earlier versions hit + * "unimplemented opcode" in the backend). + */ + return __builtin_ctzll(v); +#else + /* + * If neither the target architecture nor the toolchains support ctzll, + * use software-based emulation. Let's use the De Bruijn sequence-based + * approach to find LSB fastly. See the details of De Bruijn sequence: + * + * https://en.wikipedia.org/wiki/De_Bruijn_sequence + * https://www.chessprogramming.org/BitScan#De_Bruijn_Multiplication + */ + const int lookup_table[64] = { + 0, 1, 48, 2, 57, 49, 28, 3, 61, 58, 50, 42, 38, 29, 17, 4, + 62, 55, 59, 36, 53, 51, 43, 22, 45, 39, 33, 30, 24, 18, 12, 5, + 63, 47, 56, 27, 60, 41, 37, 16, 54, 35, 52, 21, 44, 32, 23, 11, + 46, 26, 40, 15, 34, 20, 31, 10, 25, 14, 19, 9, 13, 8, 7, 6, + }; + const u64 DEBRUIJN_CONSTANT = 0x03f79d71b4cb0a89ULL; + unsigned int index; + u64 lowest_bit; + const int *lt; + + if (v == 0) + return -1; + + /* + * Isolate the least significant bit (LSB). + * For example, if v = 0b...10100, then v & -v = 0b...00100 + */ + lowest_bit = v & -v; + + /* + * Each isolated bit produces a unique 6-bit value, guaranteed by the + * De Bruijn property. Calculate a unique index into the lookup table + * using the magic constant and a right shift. + * + * Multiplying by the 64-bit constant "spreads out" that 1-bit into a + * unique pattern in the top 6 bits. This uniqueness property is + * exactly what a De Bruijn sequence guarantees: Every possible 6-bit + * pattern (in top bits) occurs exactly once for each LSB position. So, + * the constant 0x03f79d71b4cb0a89ULL is carefully chosen to be a + * De Bruijn sequence, ensuring no collisions in the table index. + */ + index = (lowest_bit * DEBRUIJN_CONSTANT) >> 58; + + /* + * Lookup in a precomputed table. No collision is guaranteed by the + * De Bruijn property. + */ + lt = MEMBER_VPTR(lookup_table, [index]); + return (lt)? *lt : -1; +#endif +} + /* * Return a value proportionally scaled to the task's weight. */ @@ -758,6 +870,171 @@ static inline u64 scale_by_task_weight_inverse(const struct task_struct *p, u64 } +/* + * Get a random u64 from the kernel's pseudo-random generator. + */ +static inline u64 get_prandom_u64() +{ + return ((u64)bpf_get_prandom_u32() << 32) | bpf_get_prandom_u32(); +} + +/* + * Define the shadow structure to avoid a compilation error when + * vmlinux.h does not enable necessary kernel configs. The ___local + * suffix is a CO-RE convention that tells the loader to match this + * against the base struct rq in the kernel. The attribute + * preserve_access_index tells the compiler to generate a CO-RE + * relocation for these fields. + */ +struct rq___local { + /* + * A monotonically increasing clock per CPU. It is rq->clock minus + * cumulative IRQ time and hypervisor steal time. Unlike rq->clock, + * it does not advance during IRQ processing or hypervisor preemption. + * It does advance during idle (the idle task counts as a running task + * for this purpose). + */ + u64 clock_task; + /* + * Invariant version of clock_task scaled by CPU capacity and + * frequency. For example, clock_pelt advances 2x slower on a CPU + * with half the capacity. + * + * At idle exit, rq->clock_pelt jumps forward to resync with + * clock_task. The kernel's rq_clock_pelt() corrects for this jump + * by subtracting lost_idle_time, yielding a clock that appears + * continuous across idle transitions. scx_clock_pelt() mirrors + * rq_clock_pelt() by performing the same subtraction. + */ + u64 clock_pelt; + /* + * Accumulates the magnitude of each clock_pelt jump at idle exit. + * Subtracting this from clock_pelt gives rq_clock_pelt(): a + * continuous, capacity-invariant clock suitable for both task + * execution time stamping and cross-idle measurements. + */ + unsigned long lost_idle_time; + /* + * Shadow of paravirt_steal_clock() (the hypervisor's cumulative + * stolen time counter). Stays frozen while the hypervisor preempts + * the vCPU; catches up the next time update_rq_clock_task() is + * called. The delta is the stolen time not yet subtracted from + * clock_task. + * + * Unlike irqtime->total (a plain kernel-side field), the live stolen + * time counter lives in hypervisor-specific shared memory and has no + * kernel-side equivalent readable from BPF in a hypervisor-agnostic + * way. This field is therefore the only portable BPF-accessible + * approximation of cumulative steal time. + * + * Available only when CONFIG_PARAVIRT_TIME_ACCOUNTING is on. + */ + u64 prev_steal_time_rq; +} __attribute__((preserve_access_index)); + +extern struct rq runqueues __ksym; + +/* + * Define the shadow structure to avoid a compilation error when + * vmlinux.h does not enable necessary kernel configs. + */ +struct irqtime___local { + /* + * Cumulative IRQ time counter for this CPU, in nanoseconds. Advances + * immediately at the exit of every hardirq and non-ksoftirqd softirq + * via irqtime_account_irq(). ksoftirqd time is counted as normal + * task time and is NOT included. NMI time is also NOT included. + * + * The companion field irqtime->sync (struct u64_stats_sync) protects + * against 64-bit tearing on 32-bit architectures. On 64-bit kernels, + * u64_stats_sync is an empty struct and all seqcount operations are + * no-ops, so a plain BPF_CORE_READ of this field is safe. + * + * Available only when CONFIG_IRQ_TIME_ACCOUNTING is on. + */ + u64 total; +} __attribute__((preserve_access_index)); + +/* + * cpu_irqtime is a per-CPU variable defined only when + * CONFIG_IRQ_TIME_ACCOUNTING is on. Declare it as __weak so the BPF + * loader sets its address to 0 (rather than failing) when the symbol + * is absent from the running kernel. + */ +extern struct irqtime___local cpu_irqtime __ksym __weak; + +static inline struct rq___local *get_current_rq(u32 cpu) +{ + /* + * This is a workaround to get an rq pointer since we decided to + * deprecate scx_bpf_cpu_rq(). + * + * WARNING: The caller must hold the rq lock for @cpu. This is + * guaranteed when called from scheduling callbacks (ops.running, + * ops.stopping, ops.enqueue, ops.dequeue, ops.dispatch, etc.). + * There is no runtime check available in BPF for kernel spinlock + * state — correctness is enforced by calling context only. + */ + return (void *)bpf_per_cpu_ptr(&runqueues, cpu); +} + +static inline u64 scx_clock_task(u32 cpu) +{ + struct rq___local *rq = get_current_rq(cpu); + + /* Equivalent to the kernel's rq_clock_task(). */ + return rq ? rq->clock_task : 0; +} + +static inline u64 scx_clock_pelt(u32 cpu) +{ + struct rq___local *rq = get_current_rq(cpu); + + /* + * Equivalent to the kernel's rq_clock_pelt(): subtracts + * lost_idle_time from clock_pelt to absorb the jump that occurs + * when clock_pelt resyncs with clock_task at idle exit. The result + * is a continuous, capacity-invariant clock safe for both task + * execution time stamping and cross-idle measurements. + */ + return rq ? (rq->clock_pelt - rq->lost_idle_time) : 0; +} + +static inline u64 scx_clock_virt(u32 cpu) +{ + struct rq___local *rq; + + /* + * Check field existence before calling get_current_rq() so we avoid + * the per_cpu lookup entirely on kernels built without + * CONFIG_PARAVIRT_TIME_ACCOUNTING. + */ + if (!bpf_core_field_exists(((struct rq___local *)0)->prev_steal_time_rq)) + return 0; + + /* Lagging shadow of the kernel's paravirt_steal_clock(). */ + rq = get_current_rq(cpu); + return rq ? BPF_CORE_READ(rq, prev_steal_time_rq) : 0; +} + +static inline u64 scx_clock_irq(u32 cpu) +{ + struct irqtime___local *irqt; + + /* + * bpf_core_type_exists() resolves at load time: if struct irqtime is + * absent from kernel BTF (CONFIG_IRQ_TIME_ACCOUNTING off), the loader + * patches this into an unconditional return 0, making the + * bpf_per_cpu_ptr() call below dead code that the verifier never sees. + */ + if (!bpf_core_type_exists(struct irqtime___local)) + return 0; + + /* Equivalent to the kernel's irq_time_read(). */ + irqt = bpf_per_cpu_ptr(&cpu_irqtime, cpu); + return irqt ? BPF_CORE_READ(irqt, total) : 0; +} + #include "compat.bpf.h" #include "enums.bpf.h" diff --git a/tools/sched_ext/include/scx/common.h b/tools/sched_ext/include/scx/common.h index b3c6372bcf81..60f5513787d6 100644 --- a/tools/sched_ext/include/scx/common.h +++ b/tools/sched_ext/include/scx/common.h @@ -67,6 +67,7 @@ typedef int64_t s64; bpf_map__set_value_size((__skel)->maps.elfsec##_##arr, \ sizeof((__skel)->elfsec##_##arr->arr[0]) * (n)); \ (__skel)->elfsec##_##arr = \ + (typeof((__skel)->elfsec##_##arr)) \ bpf_map__initial_value((__skel)->maps.elfsec##_##arr, &__sz); \ } while (0) @@ -74,10 +75,6 @@ typedef int64_t s64; #include "compat.h" #include "enums.h" -/* not available when building kernel tools/sched_ext */ -#if __has_include() #include "bpf_arena_common.h" -#include -#endif #endif /* __SCHED_EXT_COMMON_H */ diff --git a/tools/sched_ext/include/scx/compat.bpf.h b/tools/sched_ext/include/scx/compat.bpf.h index f2969c3061a7..8977b5a2caa1 100644 --- a/tools/sched_ext/include/scx/compat.bpf.h +++ b/tools/sched_ext/include/scx/compat.bpf.h @@ -28,8 +28,11 @@ struct cgroup *scx_bpf_task_cgroup___new(struct task_struct *p) __ksym __weak; * * scx_bpf_dispatch_from_dsq() and friends were added during v6.12 by * 4c30f5ce4f7a ("sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()"). + * + * v7.1: scx_bpf_dsq_move_to_local___v2() to add @enq_flags. */ -bool scx_bpf_dsq_move_to_local___new(u64 dsq_id) __ksym __weak; +bool scx_bpf_dsq_move_to_local___v2(u64 dsq_id, u64 enq_flags) __ksym __weak; +bool scx_bpf_dsq_move_to_local___v1(u64 dsq_id) __ksym __weak; void scx_bpf_dsq_move_set_slice___new(struct bpf_iter_scx_dsq *it__iter, u64 slice) __ksym __weak; void scx_bpf_dsq_move_set_vtime___new(struct bpf_iter_scx_dsq *it__iter, u64 vtime) __ksym __weak; bool scx_bpf_dsq_move___new(struct bpf_iter_scx_dsq *it__iter, struct task_struct *p, u64 dsq_id, u64 enq_flags) __ksym __weak; @@ -41,10 +44,12 @@ void scx_bpf_dispatch_from_dsq_set_vtime___old(struct bpf_iter_scx_dsq *it__iter bool scx_bpf_dispatch_from_dsq___old(struct bpf_iter_scx_dsq *it__iter, struct task_struct *p, u64 dsq_id, u64 enq_flags) __ksym __weak; bool scx_bpf_dispatch_vtime_from_dsq___old(struct bpf_iter_scx_dsq *it__iter, struct task_struct *p, u64 dsq_id, u64 enq_flags) __ksym __weak; -#define scx_bpf_dsq_move_to_local(dsq_id) \ - (bpf_ksym_exists(scx_bpf_dsq_move_to_local___new) ? \ - scx_bpf_dsq_move_to_local___new((dsq_id)) : \ - scx_bpf_consume___old((dsq_id))) +#define scx_bpf_dsq_move_to_local(dsq_id, enq_flags) \ + (bpf_ksym_exists(scx_bpf_dsq_move_to_local___v2) ? \ + scx_bpf_dsq_move_to_local___v2((dsq_id), (enq_flags)) : \ + (bpf_ksym_exists(scx_bpf_dsq_move_to_local___v1) ? \ + scx_bpf_dsq_move_to_local___v1((dsq_id)) : \ + scx_bpf_consume___old((dsq_id)))) #define scx_bpf_dsq_move_set_slice(it__iter, slice) \ (bpf_ksym_exists(scx_bpf_dsq_move_set_slice___new) ? \ @@ -103,6 +108,19 @@ static inline struct task_struct *__COMPAT_scx_bpf_dsq_peek(u64 dsq_id) return p; } +/* + * v7.1: scx_bpf_sub_dispatch() for sub-sched dispatch. Preserve until + * we drop the compat layer for older kernels that lack the kfunc. + */ +bool scx_bpf_sub_dispatch___compat(u64 cgroup_id) __ksym __weak; + +static inline bool scx_bpf_sub_dispatch(u64 cgroup_id) +{ + if (bpf_ksym_exists(scx_bpf_sub_dispatch___compat)) + return scx_bpf_sub_dispatch___compat(cgroup_id); + return false; +} + /** * __COMPAT_is_enq_cpu_selected - Test if SCX_ENQ_CPU_SELECTED is on * in a compatible way. We will preserve this __COMPAT helper until v6.16. @@ -266,6 +284,14 @@ scx_bpf_select_cpu_and(struct task_struct *p, s32 prev_cpu, u64 wake_flags, } } +/* + * scx_bpf_select_cpu_and() is now an inline wrapper. Use this instead of + * bpf_ksym_exists(scx_bpf_select_cpu_and) to test availability. + */ +#define __COMPAT_HAS_scx_bpf_select_cpu_and \ + (bpf_core_type_exists(struct scx_bpf_select_cpu_and_args) || \ + bpf_ksym_exists(scx_bpf_select_cpu_and___compat)) + /** * scx_bpf_dsq_insert_vtime - Insert a task into the vtime priority queue of a DSQ * @p: task_struct to insert @@ -375,6 +401,27 @@ static inline void scx_bpf_reenqueue_local(void) scx_bpf_reenqueue_local___v1(); } +/* + * v6.20: New scx_bpf_dsq_reenq() that allows re-enqueues on more DSQs. This + * will eventually deprecate scx_bpf_reenqueue_local(). + */ +void scx_bpf_dsq_reenq___compat(u64 dsq_id, u64 reenq_flags, const struct bpf_prog_aux *aux__prog) __ksym __weak; + +static inline bool __COMPAT_has_generic_reenq(void) +{ + return bpf_ksym_exists(scx_bpf_dsq_reenq___compat); +} + +static inline void scx_bpf_dsq_reenq(u64 dsq_id, u64 reenq_flags) +{ + if (bpf_ksym_exists(scx_bpf_dsq_reenq___compat)) + scx_bpf_dsq_reenq___compat(dsq_id, reenq_flags, NULL); + else if (dsq_id == SCX_DSQ_LOCAL && reenq_flags == 0) + scx_bpf_reenqueue_local(); + else + scx_bpf_error("kernel too old to reenqueue foreign local or user DSQs"); +} + /* * Define sched_ext_ops. This may be expanded to define multiple variants for * backward compatibility. See compat.h::SCX_OPS_LOAD/ATTACH(). diff --git a/tools/sched_ext/include/scx/compat.h b/tools/sched_ext/include/scx/compat.h index edccc99c7294..039854c490d5 100644 --- a/tools/sched_ext/include/scx/compat.h +++ b/tools/sched_ext/include/scx/compat.h @@ -8,6 +8,7 @@ #define __SCX_COMPAT_H #include +#include #include #include #include @@ -115,6 +116,7 @@ static inline bool __COMPAT_struct_has_field(const char *type, const char *field #define SCX_OPS_ENQ_MIGRATION_DISABLED SCX_OPS_FLAG(SCX_OPS_ENQ_MIGRATION_DISABLED) #define SCX_OPS_ALLOW_QUEUED_WAKEUP SCX_OPS_FLAG(SCX_OPS_ALLOW_QUEUED_WAKEUP) #define SCX_OPS_BUILTIN_IDLE_PER_NODE SCX_OPS_FLAG(SCX_OPS_BUILTIN_IDLE_PER_NODE) +#define SCX_OPS_ALWAYS_ENQ_IMMED SCX_OPS_FLAG(SCX_OPS_ALWAYS_ENQ_IMMED) #define SCX_PICK_IDLE_FLAG(name) __COMPAT_ENUM_OR_ZERO("scx_pick_idle_cpu_flags", #name) @@ -158,6 +160,7 @@ static inline long scx_hotplug_seq(void) * COMPAT: * - v6.17: ops.cgroup_set_bandwidth() * - v6.19: ops.cgroup_set_idle() + * - v7.1: ops.sub_attach(), ops.sub_detach(), ops.sub_cgroup_id */ #define SCX_OPS_OPEN(__ops_name, __scx_name) ({ \ struct __scx_name *__skel; \ @@ -179,18 +182,65 @@ static inline long scx_hotplug_seq(void) fprintf(stderr, "WARNING: kernel doesn't support ops.cgroup_set_idle()\n"); \ __skel->struct_ops.__ops_name->cgroup_set_idle = NULL; \ } \ + if (__skel->struct_ops.__ops_name->sub_attach && \ + !__COMPAT_struct_has_field("sched_ext_ops", "sub_attach")) { \ + fprintf(stderr, "WARNING: kernel doesn't support ops.sub_attach()\n"); \ + __skel->struct_ops.__ops_name->sub_attach = NULL; \ + } \ + if (__skel->struct_ops.__ops_name->sub_detach && \ + !__COMPAT_struct_has_field("sched_ext_ops", "sub_detach")) { \ + fprintf(stderr, "WARNING: kernel doesn't support ops.sub_detach()\n"); \ + __skel->struct_ops.__ops_name->sub_detach = NULL; \ + } \ + if (__skel->struct_ops.__ops_name->sub_cgroup_id > 0 && \ + !__COMPAT_struct_has_field("sched_ext_ops", "sub_cgroup_id")) { \ + fprintf(stderr, "WARNING: kernel doesn't support ops.sub_cgroup_id\n"); \ + __skel->struct_ops.__ops_name->sub_cgroup_id = 0; \ + } \ __skel; \ }) +/* + * Associate non-struct_ops BPF programs with the scheduler's struct_ops map so + * that scx_prog_sched() can determine which scheduler a BPF program belongs + * to. Requires libbpf >= 1.7. + */ +#if LIBBPF_MAJOR_VERSION > 1 || \ + (LIBBPF_MAJOR_VERSION == 1 && LIBBPF_MINOR_VERSION >= 7) +static inline void __scx_ops_assoc_prog(struct bpf_program *prog, + struct bpf_map *map, + const char *ops_name) +{ + s32 err = bpf_program__assoc_struct_ops(prog, map, NULL); + if (err) + fprintf(stderr, + "ERROR: Failed to associate %s with %s: %d\n", + bpf_program__name(prog), ops_name, err); +} +#else +static inline void __scx_ops_assoc_prog(struct bpf_program *prog, + struct bpf_map *map, + const char *ops_name) +{ +} +#endif + #define SCX_OPS_LOAD(__skel, __ops_name, __scx_name, __uei_name) ({ \ + struct bpf_program *__prog; \ UEI_SET_SIZE(__skel, __ops_name, __uei_name); \ SCX_BUG_ON(__scx_name##__load((__skel)), "Failed to load skel"); \ + bpf_object__for_each_program(__prog, (__skel)->obj) { \ + if (bpf_program__type(__prog) == BPF_PROG_TYPE_STRUCT_OPS) \ + continue; \ + __scx_ops_assoc_prog(__prog, (__skel)->maps.__ops_name, \ + #__ops_name); \ + } \ }) /* * New versions of bpftool now emit additional link placeholders for BPF maps, * and set up BPF skeleton in such a way that libbpf will auto-attach BPF maps - * automatically, assumming libbpf is recent enough (v1.5+). Old libbpf will do + * automatically, assuming libbpf is recent enough (v1.5+). Old libbpf will do * nothing with those links and won't attempt to auto-attach maps. * * To maintain compatibility with older libbpf while avoiding trying to attach diff --git a/tools/sched_ext/include/scx/enum_defs.autogen.h b/tools/sched_ext/include/scx/enum_defs.autogen.h index c2c33df9292c..da4b459820fd 100644 --- a/tools/sched_ext/include/scx/enum_defs.autogen.h +++ b/tools/sched_ext/include/scx/enum_defs.autogen.h @@ -14,18 +14,27 @@ #define HAVE_SCX_EXIT_MSG_LEN #define HAVE_SCX_EXIT_DUMP_DFL_LEN #define HAVE_SCX_CPUPERF_ONE -#define HAVE_SCX_OPS_TASK_ITER_BATCH +#define HAVE_SCX_TASK_ITER_BATCH +#define HAVE_SCX_BYPASS_HOST_NTH +#define HAVE_SCX_BYPASS_LB_DFL_INTV_US +#define HAVE_SCX_BYPASS_LB_DONOR_PCT +#define HAVE_SCX_BYPASS_LB_MIN_DELTA_DIV +#define HAVE_SCX_BYPASS_LB_BATCH +#define HAVE_SCX_REENQ_LOCAL_MAX_REPEAT +#define HAVE_SCX_SUB_MAX_DEPTH #define HAVE_SCX_CPU_PREEMPT_RT #define HAVE_SCX_CPU_PREEMPT_DL #define HAVE_SCX_CPU_PREEMPT_STOP #define HAVE_SCX_CPU_PREEMPT_UNKNOWN #define HAVE_SCX_DEQ_SLEEP #define HAVE_SCX_DEQ_CORE_SCHED_EXEC +#define HAVE_SCX_DEQ_SCHED_CHANGE #define HAVE_SCX_DSQ_FLAG_BUILTIN #define HAVE_SCX_DSQ_FLAG_LOCAL_ON #define HAVE_SCX_DSQ_INVALID #define HAVE_SCX_DSQ_GLOBAL #define HAVE_SCX_DSQ_LOCAL +#define HAVE_SCX_DSQ_BYPASS #define HAVE_SCX_DSQ_LOCAL_ON #define HAVE_SCX_DSQ_LOCAL_CPU_MASK #define HAVE_SCX_DSQ_ITER_REV @@ -35,31 +44,55 @@ #define HAVE___SCX_DSQ_ITER_ALL_FLAGS #define HAVE_SCX_DSQ_LNODE_ITER_CURSOR #define HAVE___SCX_DSQ_LNODE_PRIV_SHIFT +#define HAVE_SCX_ENABLING +#define HAVE_SCX_ENABLED +#define HAVE_SCX_DISABLING +#define HAVE_SCX_DISABLED #define HAVE_SCX_ENQ_WAKEUP #define HAVE_SCX_ENQ_HEAD #define HAVE_SCX_ENQ_CPU_SELECTED #define HAVE_SCX_ENQ_PREEMPT +#define HAVE_SCX_ENQ_IMMED #define HAVE_SCX_ENQ_REENQ #define HAVE_SCX_ENQ_LAST #define HAVE___SCX_ENQ_INTERNAL_MASK #define HAVE_SCX_ENQ_CLEAR_OPSS #define HAVE_SCX_ENQ_DSQ_PRIQ +#define HAVE_SCX_ENQ_NESTED +#define HAVE_SCX_ENQ_GDSQ_FALLBACK #define HAVE_SCX_TASK_DSQ_ON_PRIQ #define HAVE_SCX_TASK_QUEUED +#define HAVE_SCX_TASK_IN_CUSTODY #define HAVE_SCX_TASK_RESET_RUNNABLE_AT #define HAVE_SCX_TASK_DEQD_FOR_SLEEP +#define HAVE_SCX_TASK_SUB_INIT +#define HAVE_SCX_TASK_IMMED #define HAVE_SCX_TASK_STATE_SHIFT #define HAVE_SCX_TASK_STATE_BITS #define HAVE_SCX_TASK_STATE_MASK +#define HAVE_SCX_TASK_NONE +#define HAVE_SCX_TASK_INIT +#define HAVE_SCX_TASK_READY +#define HAVE_SCX_TASK_ENABLED +#define HAVE_SCX_TASK_REENQ_REASON_SHIFT +#define HAVE_SCX_TASK_REENQ_REASON_BITS +#define HAVE_SCX_TASK_REENQ_REASON_MASK +#define HAVE_SCX_TASK_REENQ_NONE +#define HAVE_SCX_TASK_REENQ_KFUNC +#define HAVE_SCX_TASK_REENQ_IMMED +#define HAVE_SCX_TASK_REENQ_PREEMPTED #define HAVE_SCX_TASK_CURSOR #define HAVE_SCX_ECODE_RSN_HOTPLUG +#define HAVE_SCX_ECODE_RSN_CGROUP_OFFLINE #define HAVE_SCX_ECODE_ACT_RESTART +#define HAVE_SCX_EFLAG_INITIALIZED #define HAVE_SCX_EXIT_NONE #define HAVE_SCX_EXIT_DONE #define HAVE_SCX_EXIT_UNREG #define HAVE_SCX_EXIT_UNREG_BPF #define HAVE_SCX_EXIT_UNREG_KERN #define HAVE_SCX_EXIT_SYSRQ +#define HAVE_SCX_EXIT_PARENT #define HAVE_SCX_EXIT_ERROR #define HAVE_SCX_EXIT_ERROR_BPF #define HAVE_SCX_EXIT_ERROR_STALL @@ -80,40 +113,42 @@ #define HAVE_SCX_OPI_CPU_HOTPLUG_BEGIN #define HAVE_SCX_OPI_CPU_HOTPLUG_END #define HAVE_SCX_OPI_END -#define HAVE_SCX_OPS_ENABLING -#define HAVE_SCX_OPS_ENABLED -#define HAVE_SCX_OPS_DISABLING -#define HAVE_SCX_OPS_DISABLED #define HAVE_SCX_OPS_KEEP_BUILTIN_IDLE #define HAVE_SCX_OPS_ENQ_LAST #define HAVE_SCX_OPS_ENQ_EXITING #define HAVE_SCX_OPS_SWITCH_PARTIAL #define HAVE_SCX_OPS_ENQ_MIGRATION_DISABLED #define HAVE_SCX_OPS_ALLOW_QUEUED_WAKEUP -#define HAVE_SCX_OPS_HAS_CGROUP_WEIGHT +#define HAVE_SCX_OPS_BUILTIN_IDLE_PER_NODE +#define HAVE_SCX_OPS_ALWAYS_ENQ_IMMED #define HAVE_SCX_OPS_ALL_FLAGS +#define HAVE___SCX_OPS_INTERNAL_MASK +#define HAVE_SCX_OPS_HAS_CPU_PREEMPT #define HAVE_SCX_OPSS_NONE #define HAVE_SCX_OPSS_QUEUEING #define HAVE_SCX_OPSS_QUEUED #define HAVE_SCX_OPSS_DISPATCHING #define HAVE_SCX_OPSS_QSEQ_SHIFT #define HAVE_SCX_PICK_IDLE_CORE +#define HAVE_SCX_PICK_IDLE_IN_NODE #define HAVE_SCX_OPS_NAME_LEN #define HAVE_SCX_SLICE_DFL +#define HAVE_SCX_SLICE_BYPASS #define HAVE_SCX_SLICE_INF +#define HAVE_SCX_REENQ_ANY +#define HAVE___SCX_REENQ_FILTER_MASK +#define HAVE___SCX_REENQ_USER_MASK +#define HAVE_SCX_REENQ_TSR_RQ_OPEN +#define HAVE_SCX_REENQ_TSR_NOT_FIRST +#define HAVE___SCX_REENQ_TSR_MASK #define HAVE_SCX_RQ_ONLINE #define HAVE_SCX_RQ_CAN_STOP_TICK -#define HAVE_SCX_RQ_BAL_PENDING #define HAVE_SCX_RQ_BAL_KEEP -#define HAVE_SCX_RQ_BYPASSING #define HAVE_SCX_RQ_CLK_VALID +#define HAVE_SCX_RQ_BAL_CB_PENDING #define HAVE_SCX_RQ_IN_WAKEUP #define HAVE_SCX_RQ_IN_BALANCE -#define HAVE_SCX_TASK_NONE -#define HAVE_SCX_TASK_INIT -#define HAVE_SCX_TASK_READY -#define HAVE_SCX_TASK_ENABLED -#define HAVE_SCX_TASK_NR_STATES +#define HAVE_SCX_SCHED_PCPU_BYPASSING #define HAVE_SCX_TG_ONLINE #define HAVE_SCX_TG_INITED #define HAVE_SCX_WAKE_FORK diff --git a/tools/sched_ext/include/scx/enums.autogen.bpf.h b/tools/sched_ext/include/scx/enums.autogen.bpf.h index 2f8002bcc19a..dafccbb6b69d 100644 --- a/tools/sched_ext/include/scx/enums.autogen.bpf.h +++ b/tools/sched_ext/include/scx/enums.autogen.bpf.h @@ -67,6 +67,12 @@ const volatile u64 __SCX_TASK_RESET_RUNNABLE_AT __weak; const volatile u64 __SCX_TASK_DEQD_FOR_SLEEP __weak; #define SCX_TASK_DEQD_FOR_SLEEP __SCX_TASK_DEQD_FOR_SLEEP +const volatile u64 __SCX_TASK_SUB_INIT __weak; +#define SCX_TASK_SUB_INIT __SCX_TASK_SUB_INIT + +const volatile u64 __SCX_TASK_IMMED __weak; +#define SCX_TASK_IMMED __SCX_TASK_IMMED + const volatile u64 __SCX_TASK_STATE_SHIFT __weak; #define SCX_TASK_STATE_SHIFT __SCX_TASK_STATE_SHIFT @@ -115,6 +121,9 @@ const volatile u64 __SCX_ENQ_HEAD __weak; const volatile u64 __SCX_ENQ_PREEMPT __weak; #define SCX_ENQ_PREEMPT __SCX_ENQ_PREEMPT +const volatile u64 __SCX_ENQ_IMMED __weak; +#define SCX_ENQ_IMMED __SCX_ENQ_IMMED + const volatile u64 __SCX_ENQ_REENQ __weak; #define SCX_ENQ_REENQ __SCX_ENQ_REENQ @@ -127,3 +136,5 @@ const volatile u64 __SCX_ENQ_CLEAR_OPSS __weak; const volatile u64 __SCX_ENQ_DSQ_PRIQ __weak; #define SCX_ENQ_DSQ_PRIQ __SCX_ENQ_DSQ_PRIQ +const volatile u64 __SCX_DEQ_SCHED_CHANGE __weak; +#define SCX_DEQ_SCHED_CHANGE __SCX_DEQ_SCHED_CHANGE diff --git a/tools/sched_ext/include/scx/enums.autogen.h b/tools/sched_ext/include/scx/enums.autogen.h index fedec938584b..bbd4901f4fce 100644 --- a/tools/sched_ext/include/scx/enums.autogen.h +++ b/tools/sched_ext/include/scx/enums.autogen.h @@ -26,6 +26,8 @@ SCX_ENUM_SET(skel, scx_ent_flags, SCX_TASK_QUEUED); \ SCX_ENUM_SET(skel, scx_ent_flags, SCX_TASK_RESET_RUNNABLE_AT); \ SCX_ENUM_SET(skel, scx_ent_flags, SCX_TASK_DEQD_FOR_SLEEP); \ + SCX_ENUM_SET(skel, scx_ent_flags, SCX_TASK_SUB_INIT); \ + SCX_ENUM_SET(skel, scx_ent_flags, SCX_TASK_IMMED); \ SCX_ENUM_SET(skel, scx_ent_flags, SCX_TASK_STATE_SHIFT); \ SCX_ENUM_SET(skel, scx_ent_flags, SCX_TASK_STATE_BITS); \ SCX_ENUM_SET(skel, scx_ent_flags, SCX_TASK_STATE_MASK); \ @@ -42,8 +44,10 @@ SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_WAKEUP); \ SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_HEAD); \ SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_PREEMPT); \ + SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_IMMED); \ SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_REENQ); \ SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_LAST); \ SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_CLEAR_OPSS); \ SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_DSQ_PRIQ); \ + SCX_ENUM_SET(skel, scx_deq_flags, SCX_DEQ_SCHED_CHANGE); \ } while (0) diff --git a/tools/sched_ext/include/scx/enums.h b/tools/sched_ext/include/scx/enums.h index 8e7c91575f0b..c3b09acce824 100644 --- a/tools/sched_ext/include/scx/enums.h +++ b/tools/sched_ext/include/scx/enums.h @@ -9,7 +9,7 @@ #ifndef __SCX_ENUMS_H #define __SCX_ENUMS_H -static inline void __ENUM_set(u64 *val, char *type, char *name) +static inline void __ENUM_set(u64 *val, const char *type, const char *name) { bool res; diff --git a/tools/sched_ext/scx_central.bpf.c b/tools/sched_ext/scx_central.bpf.c index 1c2376b75b5d..4efcce099bd5 100644 --- a/tools/sched_ext/scx_central.bpf.c +++ b/tools/sched_ext/scx_central.bpf.c @@ -60,6 +60,7 @@ const volatile u32 nr_cpu_ids = 1; /* !0 for veristat, set during init */ const volatile u64 slice_ns; bool timer_pinned = true; +bool timer_started; u64 nr_total, nr_locals, nr_queued, nr_lost_pids; u64 nr_timers, nr_dispatches, nr_mismatches, nr_retries; u64 nr_overflows; @@ -179,9 +180,47 @@ static bool dispatch_to_cpu(s32 cpu) return false; } +static void start_central_timer(void) +{ + struct bpf_timer *timer; + u32 key = 0; + int ret; + + if (likely(timer_started)) + return; + + timer = bpf_map_lookup_elem(¢ral_timer, &key); + if (!timer) { + scx_bpf_error("failed to lookup central timer"); + return; + } + + ret = bpf_timer_start(timer, TIMER_INTERVAL_NS, BPF_F_TIMER_CPU_PIN); + /* + * BPF_F_TIMER_CPU_PIN is pretty new (>=6.7). If we're running in a + * kernel which doesn't have it, bpf_timer_start() will return -EINVAL. + * Retry without the PIN. This would be the perfect use case for + * bpf_core_enum_value_exists() but the enum type doesn't have a name + * and can't be used with bpf_core_enum_value_exists(). Oh well... + */ + if (ret == -EINVAL) { + timer_pinned = false; + ret = bpf_timer_start(timer, TIMER_INTERVAL_NS, 0); + } + + if (ret) { + scx_bpf_error("bpf_timer_start failed (%d)", ret); + return; + } + + timer_started = true; +} + void BPF_STRUCT_OPS(central_dispatch, s32 cpu, struct task_struct *prev) { if (cpu == central_cpu) { + start_central_timer(); + /* dispatch for all other CPUs first */ __sync_fetch_and_add(&nr_dispatches, 1); @@ -214,13 +253,13 @@ void BPF_STRUCT_OPS(central_dispatch, s32 cpu, struct task_struct *prev) } /* look for a task to run on the central CPU */ - if (scx_bpf_dsq_move_to_local(FALLBACK_DSQ_ID)) + if (scx_bpf_dsq_move_to_local(FALLBACK_DSQ_ID, 0)) return; dispatch_to_cpu(central_cpu); } else { bool *gimme; - if (scx_bpf_dsq_move_to_local(FALLBACK_DSQ_ID)) + if (scx_bpf_dsq_move_to_local(FALLBACK_DSQ_ID, 0)) return; gimme = ARRAY_ELEM_PTR(cpu_gimme_task, cpu, nr_cpu_ids); @@ -310,29 +349,12 @@ int BPF_STRUCT_OPS_SLEEPABLE(central_init) if (!timer) return -ESRCH; - if (bpf_get_smp_processor_id() != central_cpu) { - scx_bpf_error("init from non-central CPU"); - return -EINVAL; - } - bpf_timer_init(timer, ¢ral_timer, CLOCK_MONOTONIC); bpf_timer_set_callback(timer, central_timerfn); - ret = bpf_timer_start(timer, TIMER_INTERVAL_NS, BPF_F_TIMER_CPU_PIN); - /* - * BPF_F_TIMER_CPU_PIN is pretty new (>=6.7). If we're running in a - * kernel which doesn't have it, bpf_timer_start() will return -EINVAL. - * Retry without the PIN. This would be the perfect use case for - * bpf_core_enum_value_exists() but the enum type doesn't have a name - * and can't be used with bpf_core_enum_value_exists(). Oh well... - */ - if (ret == -EINVAL) { - timer_pinned = false; - ret = bpf_timer_start(timer, TIMER_INTERVAL_NS, 0); - } - if (ret) - scx_bpf_error("bpf_timer_start failed (%d)", ret); - return ret; + scx_bpf_kick_cpu(central_cpu, 0); + + return 0; } void BPF_STRUCT_OPS(central_exit, struct scx_exit_info *ei) diff --git a/tools/sched_ext/scx_central.c b/tools/sched_ext/scx_central.c index 710fa03376e2..4a72df39500d 100644 --- a/tools/sched_ext/scx_central.c +++ b/tools/sched_ext/scx_central.c @@ -5,7 +5,6 @@ * Copyright (c) 2022 David Vernet */ #define _GNU_SOURCE -#include #include #include #include @@ -21,7 +20,7 @@ const char help_fmt[] = "\n" "See the top-level comment in .bpf.c for more details.\n" "\n" -"Usage: %s [-s SLICE_US] [-c CPU]\n" +"Usage: %s [-s SLICE_US] [-c CPU] [-v]\n" "\n" " -s SLICE_US Override slice duration\n" " -c CPU Override the central CPU (default: 0)\n" @@ -49,8 +48,6 @@ int main(int argc, char **argv) struct bpf_link *link; __u64 seq = 0, ecode; __s32 opt; - cpu_set_t *cpuset; - size_t cpuset_size; libbpf_set_print(libbpf_print_fn); signal(SIGINT, sigint_handler); @@ -96,27 +93,6 @@ restart: SCX_OPS_LOAD(skel, central_ops, scx_central, uei); - /* - * Affinitize the loading thread to the central CPU, as: - * - That's where the BPF timer is first invoked in the BPF program. - * - We probably don't want this user space component to take up a core - * from a task that would benefit from avoiding preemption on one of - * the tickless cores. - * - * Until BPF supports pinning the timer, it's not guaranteed that it - * will always be invoked on the central CPU. In practice, this - * suffices the majority of the time. - */ - cpuset = CPU_ALLOC(skel->rodata->nr_cpu_ids); - SCX_BUG_ON(!cpuset, "Failed to allocate cpuset"); - cpuset_size = CPU_ALLOC_SIZE(skel->rodata->nr_cpu_ids); - CPU_ZERO_S(cpuset_size, cpuset); - CPU_SET_S(skel->rodata->central_cpu, cpuset_size, cpuset); - SCX_BUG_ON(sched_setaffinity(0, cpuset_size, cpuset), - "Failed to affinitize to central CPU %d (max %d)", - skel->rodata->central_cpu, skel->rodata->nr_cpu_ids - 1); - CPU_FREE(cpuset); - link = SCX_OPS_ATTACH(skel, central_ops, scx_central); if (!skel->data->timer_pinned) diff --git a/tools/sched_ext/scx_cpu0.bpf.c b/tools/sched_ext/scx_cpu0.bpf.c index 9b67ab11b04c..0b1a7ce879b0 100644 --- a/tools/sched_ext/scx_cpu0.bpf.c +++ b/tools/sched_ext/scx_cpu0.bpf.c @@ -66,7 +66,7 @@ void BPF_STRUCT_OPS(cpu0_enqueue, struct task_struct *p, u64 enq_flags) void BPF_STRUCT_OPS(cpu0_dispatch, s32 cpu, struct task_struct *prev) { if (cpu == 0) - scx_bpf_dsq_move_to_local(DSQ_CPU0); + scx_bpf_dsq_move_to_local(DSQ_CPU0, 0); } s32 BPF_STRUCT_OPS_SLEEPABLE(cpu0_init) diff --git a/tools/sched_ext/scx_flatcg.bpf.c b/tools/sched_ext/scx_flatcg.bpf.c index 0e785cff0f24..fec359581826 100644 --- a/tools/sched_ext/scx_flatcg.bpf.c +++ b/tools/sched_ext/scx_flatcg.bpf.c @@ -18,7 +18,7 @@ * 100/(100+100) == 1/2. At its parent level, A is competing against D and A's * share in that competition is 100/(200+100) == 1/3. B's eventual share in the * system can be calculated by multiplying the two shares, 1/2 * 1/3 == 1/6. C's - * eventual shaer is the same at 1/6. D is only competing at the top level and + * eventual share is the same at 1/6. D is only competing at the top level and * its share is 200/(100+200) == 2/3. * * So, instead of hierarchically scheduling level-by-level, we can consider it @@ -551,9 +551,11 @@ void BPF_STRUCT_OPS(fcg_stopping, struct task_struct *p, bool runnable) * too much, determine the execution time by taking explicit timestamps * instead of depending on @p->scx.slice. */ - if (!fifo_sched) - p->scx.dsq_vtime += - (SCX_SLICE_DFL - p->scx.slice) * 100 / p->scx.weight; + if (!fifo_sched) { + u64 delta = scale_by_task_weight_inverse(p, SCX_SLICE_DFL - p->scx.slice); + + scx_bpf_task_set_dsq_vtime(p, p->scx.dsq_vtime + delta); + } taskc = bpf_task_storage_get(&task_ctx, p, 0, 0); if (!taskc) { @@ -660,7 +662,7 @@ static bool try_pick_next_cgroup(u64 *cgidp) goto out_free; } - if (!scx_bpf_dsq_move_to_local(cgid)) { + if (!scx_bpf_dsq_move_to_local(cgid, 0)) { bpf_cgroup_release(cgrp); stat_inc(FCG_STAT_PNC_EMPTY); goto out_stash; @@ -740,7 +742,7 @@ void BPF_STRUCT_OPS(fcg_dispatch, s32 cpu, struct task_struct *prev) goto pick_next_cgroup; if (time_before(now, cpuc->cur_at + cgrp_slice_ns)) { - if (scx_bpf_dsq_move_to_local(cpuc->cur_cgid)) { + if (scx_bpf_dsq_move_to_local(cpuc->cur_cgid, 0)) { stat_inc(FCG_STAT_CNS_KEEP); return; } @@ -780,7 +782,7 @@ void BPF_STRUCT_OPS(fcg_dispatch, s32 cpu, struct task_struct *prev) pick_next_cgroup: cpuc->cur_at = now; - if (scx_bpf_dsq_move_to_local(FALLBACK_DSQ)) { + if (scx_bpf_dsq_move_to_local(FALLBACK_DSQ, 0)) { cpuc->cur_cgid = 0; return; } @@ -822,7 +824,7 @@ s32 BPF_STRUCT_OPS(fcg_init_task, struct task_struct *p, if (!(cgc = find_cgrp_ctx(args->cgroup))) return -ENOENT; - p->scx.dsq_vtime = cgc->tvtime_now; + scx_bpf_task_set_dsq_vtime(p, cgc->tvtime_now); return 0; } @@ -919,12 +921,12 @@ void BPF_STRUCT_OPS(fcg_cgroup_move, struct task_struct *p, struct fcg_cgrp_ctx *from_cgc, *to_cgc; s64 delta; - /* find_cgrp_ctx() triggers scx_ops_error() on lookup failures */ + /* find_cgrp_ctx() triggers scx_bpf_error() on lookup failures */ if (!(from_cgc = find_cgrp_ctx(from)) || !(to_cgc = find_cgrp_ctx(to))) return; delta = time_delta(p->scx.dsq_vtime, from_cgc->tvtime_now); - p->scx.dsq_vtime = to_cgc->tvtime_now + delta; + scx_bpf_task_set_dsq_vtime(p, to_cgc->tvtime_now + delta); } s32 BPF_STRUCT_OPS_SLEEPABLE(fcg_init) @@ -960,5 +962,5 @@ SCX_OPS_DEFINE(flatcg_ops, .cgroup_move = (void *)fcg_cgroup_move, .init = (void *)fcg_init, .exit = (void *)fcg_exit, - .flags = SCX_OPS_HAS_CGROUP_WEIGHT | SCX_OPS_ENQ_EXITING, + .flags = SCX_OPS_ENQ_EXITING, .name = "flatcg"); diff --git a/tools/sched_ext/scx_pair.c b/tools/sched_ext/scx_pair.c index 2e509391f3da..41b136d43a55 100644 --- a/tools/sched_ext/scx_pair.c +++ b/tools/sched_ext/scx_pair.c @@ -21,7 +21,7 @@ const char help_fmt[] = "\n" "See the top-level comment in .bpf.c for more details.\n" "\n" -"Usage: %s [-S STRIDE]\n" +"Usage: %s [-S STRIDE] [-v]\n" "\n" " -S STRIDE Override CPU pair stride (default: nr_cpus_ids / 2)\n" " -v Print libbpf debug messages\n" @@ -48,6 +48,7 @@ int main(int argc, char **argv) struct bpf_link *link; __u64 seq = 0, ecode; __s32 stride, i, opt, outer_fd; + __u32 pair_id = 0; libbpf_set_print(libbpf_print_fn); signal(SIGINT, sigint_handler); @@ -82,6 +83,14 @@ restart: scx_pair__destroy(skel); return -1; } + + if (skel->rodata->nr_cpu_ids & 1) { + fprintf(stderr, "scx_pair requires an even CPU count, got %u\n", + skel->rodata->nr_cpu_ids); + scx_pair__destroy(skel); + return -1; + } + bpf_map__set_max_entries(skel->maps.pair_ctx, skel->rodata->nr_cpu_ids / 2); /* Resize arrays so their element count is equal to cpu count. */ @@ -109,10 +118,11 @@ restart: skel->rodata_pair_cpu->pair_cpu[i] = j; skel->rodata_pair_cpu->pair_cpu[j] = i; - skel->rodata_pair_id->pair_id[i] = i; - skel->rodata_pair_id->pair_id[j] = i; + skel->rodata_pair_id->pair_id[i] = pair_id; + skel->rodata_pair_id->pair_id[j] = pair_id; skel->rodata_in_pair_idx->in_pair_idx[i] = 0; skel->rodata_in_pair_idx->in_pair_idx[j] = 1; + pair_id++; printf("[%d, %d] ", i, j); } diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c index d51d8c38f1cf..b68abb9e760b 100644 --- a/tools/sched_ext/scx_qmap.bpf.c +++ b/tools/sched_ext/scx_qmap.bpf.c @@ -11,8 +11,6 @@ * * - BPF-side queueing using PIDs. * - Sleepable per-task storage allocation using ops.prep_enable(). - * - Using ops.cpu_release() to handle a higher priority scheduling class taking - * the CPU away. * - Core-sched support. * * This scheduler is primarily for demonstration and testing of sched_ext @@ -26,8 +24,11 @@ enum consts { ONE_SEC_IN_NS = 1000000000, + ONE_MSEC_IN_NS = 1000000, + LOWPRI_INTV_NS = 10 * ONE_MSEC_IN_NS, SHARED_DSQ = 0, HIGHPRI_DSQ = 1, + LOWPRI_DSQ = 2, HIGHPRI_WEIGHT = 8668, /* this is what -20 maps to */ }; @@ -41,12 +42,18 @@ const volatile u32 dsp_batch; const volatile bool highpri_boosting; const volatile bool print_dsqs_and_events; const volatile bool print_msgs; +const volatile u64 sub_cgroup_id; const volatile s32 disallow_tgid; const volatile bool suppress_dump; +const volatile bool always_enq_immed; +const volatile u32 immed_stress_nth; u64 nr_highpri_queued; u32 test_error_cnt; +#define MAX_SUB_SCHEDS 8 +u64 sub_sched_cgroup_ids[MAX_SUB_SCHEDS]; + UEI_DEFINE(uei); struct qmap { @@ -127,7 +134,7 @@ struct { } cpu_ctx_stor SEC(".maps"); /* Statistics */ -u64 nr_enqueued, nr_dispatched, nr_reenqueued, nr_dequeued, nr_ddsp_from_enq; +u64 nr_enqueued, nr_dispatched, nr_reenqueued, nr_reenqueued_cpu0, nr_dequeued, nr_ddsp_from_enq; u64 nr_core_sched_execed; u64 nr_expedited_local, nr_expedited_remote, nr_expedited_lost, nr_expedited_from_timer; u32 cpuperf_min, cpuperf_avg, cpuperf_max; @@ -137,8 +144,10 @@ static s32 pick_direct_dispatch_cpu(struct task_struct *p, s32 prev_cpu) { s32 cpu; - if (p->nr_cpus_allowed == 1 || - scx_bpf_test_and_clear_cpu_idle(prev_cpu)) + if (!always_enq_immed && p->nr_cpus_allowed == 1) + return prev_cpu; + + if (scx_bpf_test_and_clear_cpu_idle(prev_cpu)) return prev_cpu; cpu = scx_bpf_pick_idle_cpu(p->cpus_ptr, 0); @@ -168,6 +177,9 @@ s32 BPF_STRUCT_OPS(qmap_select_cpu, struct task_struct *p, if (!(tctx = lookup_task_ctx(p))) return -ESRCH; + if (p->scx.weight < 2 && !(p->flags & PF_KTHREAD)) + return prev_cpu; + cpu = pick_direct_dispatch_cpu(p, prev_cpu); if (cpu >= 0) { @@ -202,8 +214,11 @@ void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags) void *ring; s32 cpu; - if (enq_flags & SCX_ENQ_REENQ) + if (enq_flags & SCX_ENQ_REENQ) { __sync_fetch_and_add(&nr_reenqueued, 1); + if (scx_bpf_task_cpu(p) == 0) + __sync_fetch_and_add(&nr_reenqueued_cpu0, 1); + } if (p->flags & PF_KTHREAD) { if (stall_kernel_nth && !(++kernel_cnt % stall_kernel_nth)) @@ -225,6 +240,22 @@ void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags) */ tctx->core_sched_seq = core_sched_tail_seqs[idx]++; + /* + * IMMED stress testing: Every immed_stress_nth'th enqueue, dispatch + * directly to prev_cpu's local DSQ even when busy to force dsq->nr > 1 + * and exercise the kernel IMMED reenqueue trigger paths. + */ + if (immed_stress_nth && !(enq_flags & SCX_ENQ_REENQ)) { + static u32 immed_stress_cnt; + + if (!(++immed_stress_cnt % immed_stress_nth)) { + tctx->force_local = false; + scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL_ON | scx_bpf_task_cpu(p), + slice_ns, enq_flags); + return; + } + } + /* * If qmap_select_cpu() is telling us to or this is the last runnable * task on the CPU, enqueue locally. @@ -235,6 +266,13 @@ void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags) return; } + /* see lowpri_timerfn() */ + if (__COMPAT_has_generic_reenq() && + p->scx.weight < 2 && !(p->flags & PF_KTHREAD) && !(enq_flags & SCX_ENQ_REENQ)) { + scx_bpf_dsq_insert(p, LOWPRI_DSQ, slice_ns, enq_flags); + return; + } + /* if select_cpu() wasn't called, try direct dispatch */ if (!__COMPAT_is_enq_cpu_selected(enq_flags) && (cpu = pick_direct_dispatch_cpu(p, scx_bpf_task_cpu(p))) >= 0) { @@ -375,7 +413,7 @@ void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev) if (dispatch_highpri(false)) return; - if (!nr_highpri_queued && scx_bpf_dsq_move_to_local(SHARED_DSQ)) + if (!nr_highpri_queued && scx_bpf_dsq_move_to_local(SHARED_DSQ, 0)) return; if (dsp_inf_loop_after && nr_dispatched > dsp_inf_loop_after) { @@ -433,6 +471,46 @@ void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev) __sync_fetch_and_add(&nr_dispatched, 1); scx_bpf_dsq_insert(p, SHARED_DSQ, slice_ns, 0); + + /* + * scx_qmap uses a global BPF queue that any CPU's + * dispatch can pop from. If this CPU popped a task that + * can't run here, it gets stranded on SHARED_DSQ after + * consume_dispatch_q() skips it. Kick the task's home + * CPU so it drains SHARED_DSQ. + * + * There's a race between the pop and the flush of the + * buffered dsq_insert: + * + * CPU 0 (dispatching) CPU 1 (home, idle) + * ~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~ + * pop from BPF queue + * dsq_insert(buffered) + * balance: + * SHARED_DSQ empty + * BPF queue empty + * -> goes idle + * flush -> on SHARED + * kick CPU 1 + * wakes, drains task + * + * The kick prevents indefinite stalls but a per-CPU + * kthread like ksoftirqd can be briefly stranded when + * its home CPU enters idle with softirq pending, + * triggering: + * + * "NOHZ tick-stop error: local softirq work is pending, handler #N!!!" + * + * from report_idle_softirq(). The kick lands shortly + * after and the home CPU drains the task. This could be + * avoided by e.g. dispatching pinned tasks to local or + * global DSQs, but the current code is left as-is to + * document this class of issue -- other schedulers + * seeing similar warnings can use this as a reference. + */ + if (!bpf_cpumask_test_cpu(cpu, p->cpus_ptr)) + scx_bpf_kick_cpu(scx_bpf_task_cpu(p), 0); + bpf_task_release(p); batch--; @@ -440,7 +518,7 @@ void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev) if (!batch || !scx_bpf_dispatch_nr_slots()) { if (dispatch_highpri(false)) return; - scx_bpf_dsq_move_to_local(SHARED_DSQ); + scx_bpf_dsq_move_to_local(SHARED_DSQ, 0); return; } if (!cpuc->dsp_cnt) @@ -450,6 +528,12 @@ void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev) cpuc->dsp_cnt = 0; } + for (i = 0; i < MAX_SUB_SCHEDS; i++) { + if (sub_sched_cgroup_ids[i] && + scx_bpf_sub_dispatch(sub_sched_cgroup_ids[i])) + return; + } + /* * No other tasks. @prev will keep running. Update its core_sched_seq as * if the task were enqueued and dispatched immediately. @@ -532,36 +616,11 @@ bool BPF_STRUCT_OPS(qmap_core_sched_before, return task_qdist(a) > task_qdist(b); } -SEC("tp_btf/sched_switch") -int BPF_PROG(qmap_sched_switch, bool preempt, struct task_struct *prev, - struct task_struct *next, unsigned long prev_state) -{ - if (!__COMPAT_scx_bpf_reenqueue_local_from_anywhere()) - return 0; - - /* - * If @cpu is taken by a higher priority scheduling class, it is no - * longer available for executing sched_ext tasks. As we don't want the - * tasks in @cpu's local dsq to sit there until @cpu becomes available - * again, re-enqueue them into the global dsq. See %SCX_ENQ_REENQ - * handling in qmap_enqueue(). - */ - switch (next->policy) { - case 1: /* SCHED_FIFO */ - case 2: /* SCHED_RR */ - case 6: /* SCHED_DEADLINE */ - scx_bpf_reenqueue_local(); - } - - return 0; -} - -void BPF_STRUCT_OPS(qmap_cpu_release, s32 cpu, struct scx_cpu_release_args *args) -{ - /* see qmap_sched_switch() to learn how to do this on newer kernels */ - if (!__COMPAT_scx_bpf_reenqueue_local_from_anywhere()) - scx_bpf_reenqueue_local(); -} +/* + * sched_switch tracepoint and cpu_release handlers are no longer needed. + * With SCX_OPS_ALWAYS_ENQ_IMMED, wakeup_preempt_scx() reenqueues IMMED + * tasks when a higher-priority scheduling class takes the CPU. + */ s32 BPF_STRUCT_OPS(qmap_init_task, struct task_struct *p, struct scx_init_task_args *args) @@ -856,13 +915,35 @@ static int monitor_timerfn(void *map, int *key, struct bpf_timer *timer) return 0; } +struct lowpri_timer { + struct bpf_timer timer; +}; + +struct { + __uint(type, BPF_MAP_TYPE_ARRAY); + __uint(max_entries, 1); + __type(key, u32); + __type(value, struct lowpri_timer); +} lowpri_timer SEC(".maps"); + +/* + * Nice 19 tasks are put into the lowpri DSQ. Every 10ms, reenq is triggered and + * the tasks are transferred to SHARED_DSQ. + */ +static int lowpri_timerfn(void *map, int *key, struct bpf_timer *timer) +{ + scx_bpf_dsq_reenq(LOWPRI_DSQ, 0); + bpf_timer_start(timer, LOWPRI_INTV_NS, 0); + return 0; +} + s32 BPF_STRUCT_OPS_SLEEPABLE(qmap_init) { u32 key = 0; struct bpf_timer *timer; s32 ret; - if (print_msgs) + if (print_msgs && !sub_cgroup_id) print_cpus(); ret = scx_bpf_create_dsq(SHARED_DSQ, -1); @@ -877,14 +958,32 @@ s32 BPF_STRUCT_OPS_SLEEPABLE(qmap_init) return ret; } + ret = scx_bpf_create_dsq(LOWPRI_DSQ, -1); + if (ret) + return ret; + timer = bpf_map_lookup_elem(&monitor_timer, &key); if (!timer) return -ESRCH; - bpf_timer_init(timer, &monitor_timer, CLOCK_MONOTONIC); bpf_timer_set_callback(timer, monitor_timerfn); + ret = bpf_timer_start(timer, ONE_SEC_IN_NS, 0); + if (ret) + return ret; - return bpf_timer_start(timer, ONE_SEC_IN_NS, 0); + if (__COMPAT_has_generic_reenq()) { + /* see lowpri_timerfn() */ + timer = bpf_map_lookup_elem(&lowpri_timer, &key); + if (!timer) + return -ESRCH; + bpf_timer_init(timer, &lowpri_timer, CLOCK_MONOTONIC); + bpf_timer_set_callback(timer, lowpri_timerfn); + ret = bpf_timer_start(timer, LOWPRI_INTV_NS, 0); + if (ret) + return ret; + } + + return 0; } void BPF_STRUCT_OPS(qmap_exit, struct scx_exit_info *ei) @@ -892,6 +991,36 @@ void BPF_STRUCT_OPS(qmap_exit, struct scx_exit_info *ei) UEI_RECORD(uei, ei); } +s32 BPF_STRUCT_OPS(qmap_sub_attach, struct scx_sub_attach_args *args) +{ + s32 i; + + for (i = 0; i < MAX_SUB_SCHEDS; i++) { + if (!sub_sched_cgroup_ids[i]) { + sub_sched_cgroup_ids[i] = args->ops->sub_cgroup_id; + bpf_printk("attaching sub-sched[%d] on %s", + i, args->cgroup_path); + return 0; + } + } + + return -ENOSPC; +} + +void BPF_STRUCT_OPS(qmap_sub_detach, struct scx_sub_detach_args *args) +{ + s32 i; + + for (i = 0; i < MAX_SUB_SCHEDS; i++) { + if (sub_sched_cgroup_ids[i] == args->ops->sub_cgroup_id) { + sub_sched_cgroup_ids[i] = 0; + bpf_printk("detaching sub-sched[%d] on %s", + i, args->cgroup_path); + break; + } + } +} + SCX_OPS_DEFINE(qmap_ops, .select_cpu = (void *)qmap_select_cpu, .enqueue = (void *)qmap_enqueue, @@ -899,7 +1028,6 @@ SCX_OPS_DEFINE(qmap_ops, .dispatch = (void *)qmap_dispatch, .tick = (void *)qmap_tick, .core_sched_before = (void *)qmap_core_sched_before, - .cpu_release = (void *)qmap_cpu_release, .init_task = (void *)qmap_init_task, .dump = (void *)qmap_dump, .dump_cpu = (void *)qmap_dump_cpu, @@ -907,6 +1035,8 @@ SCX_OPS_DEFINE(qmap_ops, .cgroup_init = (void *)qmap_cgroup_init, .cgroup_set_weight = (void *)qmap_cgroup_set_weight, .cgroup_set_bandwidth = (void *)qmap_cgroup_set_bandwidth, + .sub_attach = (void *)qmap_sub_attach, + .sub_detach = (void *)qmap_sub_detach, .cpu_online = (void *)qmap_cpu_online, .cpu_offline = (void *)qmap_cpu_offline, .init = (void *)qmap_init, diff --git a/tools/sched_ext/scx_qmap.c b/tools/sched_ext/scx_qmap.c index ef701d45ba43..e7c89a2bc3d8 100644 --- a/tools/sched_ext/scx_qmap.c +++ b/tools/sched_ext/scx_qmap.c @@ -10,6 +10,7 @@ #include #include #include +#include #include #include #include "scx_qmap.bpf.skel.h" @@ -20,7 +21,7 @@ const char help_fmt[] = "See the top-level comment in .bpf.c for more details.\n" "\n" "Usage: %s [-s SLICE_US] [-e COUNT] [-t COUNT] [-T COUNT] [-l COUNT] [-b COUNT]\n" -" [-P] [-M] [-d PID] [-D LEN] [-p] [-v]\n" +" [-P] [-M] [-H] [-d PID] [-D LEN] [-S] [-p] [-I] [-F COUNT] [-v]\n" "\n" " -s SLICE_US Override slice duration\n" " -e COUNT Trigger scx_bpf_error() after COUNT enqueues\n" @@ -35,6 +36,8 @@ const char help_fmt[] = " -D LEN Set scx_exit_info.dump buffer length\n" " -S Suppress qmap-specific debug dump\n" " -p Switch only tasks on SCHED_EXT policy instead of all\n" +" -I Turn on SCX_OPS_ALWAYS_ENQ_IMMED\n" +" -F COUNT IMMED stress: force every COUNT'th enqueue to a busy local DSQ (use with -I)\n" " -v Print libbpf debug messages\n" " -h Display this help and exit\n"; @@ -67,7 +70,7 @@ int main(int argc, char **argv) skel->rodata->slice_ns = __COMPAT_ENUM_OR_ZERO("scx_public_consts", "SCX_SLICE_DFL"); - while ((opt = getopt(argc, argv, "s:e:t:T:l:b:PMHd:D:Spvh")) != -1) { + while ((opt = getopt(argc, argv, "s:e:t:T:l:b:PMHc:d:D:SpIF:vh")) != -1) { switch (opt) { case 's': skel->rodata->slice_ns = strtoull(optarg, NULL, 0) * 1000; @@ -96,6 +99,16 @@ int main(int argc, char **argv) case 'H': skel->rodata->highpri_boosting = true; break; + case 'c': { + struct stat st; + if (stat(optarg, &st) < 0) { + perror("stat"); + return 1; + } + skel->struct_ops.qmap_ops->sub_cgroup_id = st.st_ino; + skel->rodata->sub_cgroup_id = st.st_ino; + break; + } case 'd': skel->rodata->disallow_tgid = strtol(optarg, NULL, 0); if (skel->rodata->disallow_tgid < 0) @@ -110,6 +123,13 @@ int main(int argc, char **argv) case 'p': skel->struct_ops.qmap_ops->flags |= SCX_OPS_SWITCH_PARTIAL; break; + case 'I': + skel->rodata->always_enq_immed = true; + skel->struct_ops.qmap_ops->flags |= SCX_OPS_ALWAYS_ENQ_IMMED; + break; + case 'F': + skel->rodata->immed_stress_nth = strtoul(optarg, NULL, 0); + break; case 'v': verbose = true; break; @@ -126,9 +146,10 @@ int main(int argc, char **argv) long nr_enqueued = skel->bss->nr_enqueued; long nr_dispatched = skel->bss->nr_dispatched; - printf("stats : enq=%lu dsp=%lu delta=%ld reenq=%"PRIu64" deq=%"PRIu64" core=%"PRIu64" enq_ddsp=%"PRIu64"\n", + printf("stats : enq=%lu dsp=%lu delta=%ld reenq/cpu0=%"PRIu64"/%"PRIu64" deq=%"PRIu64" core=%"PRIu64" enq_ddsp=%"PRIu64"\n", nr_enqueued, nr_dispatched, nr_enqueued - nr_dispatched, - skel->bss->nr_reenqueued, skel->bss->nr_dequeued, + skel->bss->nr_reenqueued, skel->bss->nr_reenqueued_cpu0, + skel->bss->nr_dequeued, skel->bss->nr_core_sched_execed, skel->bss->nr_ddsp_from_enq); printf(" exp_local=%"PRIu64" exp_remote=%"PRIu64" exp_timer=%"PRIu64" exp_lost=%"PRIu64"\n", diff --git a/tools/sched_ext/scx_sdt.bpf.c b/tools/sched_ext/scx_sdt.bpf.c index 31b09958e8d5..a1e33e6c412b 100644 --- a/tools/sched_ext/scx_sdt.bpf.c +++ b/tools/sched_ext/scx_sdt.bpf.c @@ -317,7 +317,8 @@ int scx_alloc_free_idx(struct scx_allocator *alloc, __u64 idx) }; /* Zero out one word at a time. */ - for (i = zero; i < alloc->pool.elem_size / 8 && can_loop; i++) { + for (i = zero; i < (alloc->pool.elem_size - sizeof(struct sdt_data)) / 8 + && can_loop; i++) { data->payload[i] = 0; } } @@ -643,7 +644,7 @@ void BPF_STRUCT_OPS(sdt_enqueue, struct task_struct *p, u64 enq_flags) void BPF_STRUCT_OPS(sdt_dispatch, s32 cpu, struct task_struct *prev) { - scx_bpf_dsq_move_to_local(SHARED_DSQ); + scx_bpf_dsq_move_to_local(SHARED_DSQ, 0); } s32 BPF_STRUCT_OPS_SLEEPABLE(sdt_init_task, struct task_struct *p, diff --git a/tools/sched_ext/scx_sdt.c b/tools/sched_ext/scx_sdt.c index a36405d8df30..bf664b2d3785 100644 --- a/tools/sched_ext/scx_sdt.c +++ b/tools/sched_ext/scx_sdt.c @@ -20,7 +20,7 @@ const char help_fmt[] = "\n" "Modified version of scx_simple that demonstrates arena-based data structures.\n" "\n" -"Usage: %s [-f] [-v]\n" +"Usage: %s [-v]\n" "\n" " -v Print libbpf debug messages\n" " -h Display this help and exit\n"; diff --git a/tools/sched_ext/scx_simple.bpf.c b/tools/sched_ext/scx_simple.bpf.c index b456bd7cae77..cc40552b2b5f 100644 --- a/tools/sched_ext/scx_simple.bpf.c +++ b/tools/sched_ext/scx_simple.bpf.c @@ -89,7 +89,7 @@ void BPF_STRUCT_OPS(simple_enqueue, struct task_struct *p, u64 enq_flags) void BPF_STRUCT_OPS(simple_dispatch, s32 cpu, struct task_struct *prev) { - scx_bpf_dsq_move_to_local(SHARED_DSQ); + scx_bpf_dsq_move_to_local(SHARED_DSQ, 0); } void BPF_STRUCT_OPS(simple_running, struct task_struct *p) @@ -121,12 +121,14 @@ void BPF_STRUCT_OPS(simple_stopping, struct task_struct *p, bool runnable) * too much, determine the execution time by taking explicit timestamps * instead of depending on @p->scx.slice. */ - p->scx.dsq_vtime += (SCX_SLICE_DFL - p->scx.slice) * 100 / p->scx.weight; + u64 delta = scale_by_task_weight_inverse(p, SCX_SLICE_DFL - p->scx.slice); + + scx_bpf_task_set_dsq_vtime(p, p->scx.dsq_vtime + delta); } void BPF_STRUCT_OPS(simple_enable, struct task_struct *p) { - p->scx.dsq_vtime = vtime_now; + scx_bpf_task_set_dsq_vtime(p, vtime_now); } s32 BPF_STRUCT_OPS_SLEEPABLE(simple_init) diff --git a/tools/sched_ext/scx_userland.c b/tools/sched_ext/scx_userland.c index 3f2aba658b4a..616043c165e6 100644 --- a/tools/sched_ext/scx_userland.c +++ b/tools/sched_ext/scx_userland.c @@ -38,7 +38,7 @@ const char help_fmt[] = "\n" "Try to reduce `sysctl kernel.pid_max` if this program triggers OOMs.\n" "\n" -"Usage: %s [-b BATCH]\n" +"Usage: %s [-b BATCH] [-v]\n" "\n" " -b BATCH The number of tasks to batch when dispatching (default: 8)\n" " -v Print libbpf debug messages\n" diff --git a/tools/testing/selftests/sched_ext/Makefile b/tools/testing/selftests/sched_ext/Makefile index 1c9ca328cca1..789037be44c7 100644 --- a/tools/testing/selftests/sched_ext/Makefile +++ b/tools/testing/selftests/sched_ext/Makefile @@ -163,6 +163,7 @@ all_test_bpfprogs := $(foreach prog,$(wildcard *.bpf.c),$(INCLUDE_DIR)/$(patsubs auto-test-targets := \ create_dsq \ + dequeue \ enq_last_no_enq_fails \ ddsp_bogus_dsq_fail \ ddsp_vtimelocal_fail \ diff --git a/tools/testing/selftests/sched_ext/dequeue.bpf.c b/tools/testing/selftests/sched_ext/dequeue.bpf.c new file mode 100644 index 000000000000..624e2ccb0688 --- /dev/null +++ b/tools/testing/selftests/sched_ext/dequeue.bpf.c @@ -0,0 +1,389 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * A scheduler that validates ops.dequeue() is called correctly: + * - Tasks dispatched to terminal DSQs (local, global) bypass the BPF + * scheduler entirely: no ops.dequeue() should be called + * - Tasks dispatched to user DSQs from ops.enqueue() enter BPF custody: + * ops.dequeue() must be called when they leave custody + * - Every ops.enqueue() dispatch to non-terminal DSQs is followed by + * exactly one ops.dequeue() (validate 1:1 pairing and state machine) + * + * Copyright (c) 2026 NVIDIA Corporation. + */ + +#include + +#define SHARED_DSQ 0 + +/* + * BPF internal queue. + * + * Tasks are stored here and consumed from ops.dispatch(), validating that + * tasks on BPF internal structures still get ops.dequeue() when they + * leave. + */ +struct { + __uint(type, BPF_MAP_TYPE_QUEUE); + __uint(max_entries, 32768); + __type(value, s32); +} global_queue SEC(".maps"); + +char _license[] SEC("license") = "GPL"; + +UEI_DEFINE(uei); + +/* + * Counters to track the lifecycle of tasks: + * - enqueue_cnt: Number of times ops.enqueue() was called + * - dequeue_cnt: Number of times ops.dequeue() was called (any type) + * - dispatch_dequeue_cnt: Number of regular dispatch dequeues (no flag) + * - change_dequeue_cnt: Number of property change dequeues + * - bpf_queue_full: Number of times the BPF internal queue was full + */ +u64 enqueue_cnt, dequeue_cnt, dispatch_dequeue_cnt, change_dequeue_cnt, bpf_queue_full; + +/* + * Test scenarios: + * 0) Dispatch to local DSQ from ops.select_cpu() (terminal DSQ, bypasses BPF + * scheduler, no dequeue callbacks) + * 1) Dispatch to global DSQ from ops.select_cpu() (terminal DSQ, bypasses BPF + * scheduler, no dequeue callbacks) + * 2) Dispatch to shared user DSQ from ops.select_cpu() (enters BPF scheduler, + * dequeue callbacks expected) + * 3) Dispatch to local DSQ from ops.enqueue() (terminal DSQ, bypasses BPF + * scheduler, no dequeue callbacks) + * 4) Dispatch to global DSQ from ops.enqueue() (terminal DSQ, bypasses BPF + * scheduler, no dequeue callbacks) + * 5) Dispatch to shared user DSQ from ops.enqueue() (enters BPF scheduler, + * dequeue callbacks expected) + * 6) BPF internal queue from ops.enqueue(): store task PIDs in ops.enqueue(), + * consume in ops.dispatch() and dispatch to local DSQ (validates dequeue + * for tasks stored in internal BPF data structures) + */ +u32 test_scenario; + +/* + * Per-task state to track lifecycle and validate workflow semantics. + * State transitions: + * NONE -> ENQUEUED (on enqueue) + * NONE -> DISPATCHED (on direct dispatch to terminal DSQ) + * ENQUEUED -> DISPATCHED (on dispatch dequeue) + * DISPATCHED -> NONE (on property change dequeue or re-enqueue) + * ENQUEUED -> NONE (on property change dequeue before dispatch) + */ +enum task_state { + TASK_NONE = 0, + TASK_ENQUEUED, + TASK_DISPATCHED, +}; + +struct task_ctx { + enum task_state state; /* Current state in the workflow */ + u64 enqueue_seq; /* Sequence number for debugging */ +}; + +struct { + __uint(type, BPF_MAP_TYPE_TASK_STORAGE); + __uint(map_flags, BPF_F_NO_PREALLOC); + __type(key, int); + __type(value, struct task_ctx); +} task_ctx_stor SEC(".maps"); + +static struct task_ctx *try_lookup_task_ctx(struct task_struct *p) +{ + return bpf_task_storage_get(&task_ctx_stor, p, 0, 0); +} + +s32 BPF_STRUCT_OPS(dequeue_select_cpu, struct task_struct *p, + s32 prev_cpu, u64 wake_flags) +{ + struct task_ctx *tctx; + + tctx = try_lookup_task_ctx(p); + if (!tctx) + return prev_cpu; + + switch (test_scenario) { + case 0: + /* + * Direct dispatch to the local DSQ. + * + * Task bypasses BPF scheduler entirely: no enqueue + * tracking, no ops.dequeue() callbacks. + */ + scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0); + tctx->state = TASK_DISPATCHED; + break; + case 1: + /* + * Direct dispatch to the global DSQ. + * + * Task bypasses BPF scheduler entirely: no enqueue + * tracking, no ops.dequeue() callbacks. + */ + scx_bpf_dsq_insert(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, 0); + tctx->state = TASK_DISPATCHED; + break; + case 2: + /* + * Dispatch to a shared user DSQ. + * + * Task enters BPF scheduler management: track + * enqueue/dequeue lifecycle and validate state + * transitions. + */ + if (tctx->state == TASK_ENQUEUED) + scx_bpf_error("%d (%s): enqueue while in ENQUEUED state seq=%llu", + p->pid, p->comm, tctx->enqueue_seq); + + scx_bpf_dsq_insert(p, SHARED_DSQ, SCX_SLICE_DFL, 0); + + __sync_fetch_and_add(&enqueue_cnt, 1); + + tctx->state = TASK_ENQUEUED; + tctx->enqueue_seq++; + break; + } + + return prev_cpu; +} + +void BPF_STRUCT_OPS(dequeue_enqueue, struct task_struct *p, u64 enq_flags) +{ + struct task_ctx *tctx; + s32 pid = p->pid; + + tctx = try_lookup_task_ctx(p); + if (!tctx) + return; + + switch (test_scenario) { + case 3: + /* + * Direct dispatch to the local DSQ. + * + * Task bypasses BPF scheduler entirely: no enqueue + * tracking, no ops.dequeue() callbacks. + */ + scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, enq_flags); + tctx->state = TASK_DISPATCHED; + break; + case 4: + /* + * Direct dispatch to the global DSQ. + * + * Task bypasses BPF scheduler entirely: no enqueue + * tracking, no ops.dequeue() callbacks. + */ + scx_bpf_dsq_insert(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags); + tctx->state = TASK_DISPATCHED; + break; + case 5: + /* + * Dispatch to shared user DSQ. + * + * Task enters BPF scheduler management: track + * enqueue/dequeue lifecycle and validate state + * transitions. + */ + if (tctx->state == TASK_ENQUEUED) + scx_bpf_error("%d (%s): enqueue while in ENQUEUED state seq=%llu", + p->pid, p->comm, tctx->enqueue_seq); + + scx_bpf_dsq_insert(p, SHARED_DSQ, SCX_SLICE_DFL, enq_flags); + + __sync_fetch_and_add(&enqueue_cnt, 1); + + tctx->state = TASK_ENQUEUED; + tctx->enqueue_seq++; + break; + case 6: + /* + * Store task in BPF internal queue. + * + * Task enters BPF scheduler management: track + * enqueue/dequeue lifecycle and validate state + * transitions. + */ + if (tctx->state == TASK_ENQUEUED) + scx_bpf_error("%d (%s): enqueue while in ENQUEUED state seq=%llu", + p->pid, p->comm, tctx->enqueue_seq); + + if (bpf_map_push_elem(&global_queue, &pid, 0)) { + scx_bpf_dsq_insert(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags); + __sync_fetch_and_add(&bpf_queue_full, 1); + + tctx->state = TASK_DISPATCHED; + } else { + __sync_fetch_and_add(&enqueue_cnt, 1); + + tctx->state = TASK_ENQUEUED; + tctx->enqueue_seq++; + } + break; + default: + /* For all other scenarios, dispatch to the global DSQ */ + scx_bpf_dsq_insert(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags); + tctx->state = TASK_DISPATCHED; + break; + } + + scx_bpf_kick_cpu(scx_bpf_task_cpu(p), SCX_KICK_IDLE); +} + +void BPF_STRUCT_OPS(dequeue_dequeue, struct task_struct *p, u64 deq_flags) +{ + struct task_ctx *tctx; + + __sync_fetch_and_add(&dequeue_cnt, 1); + + tctx = try_lookup_task_ctx(p); + if (!tctx) + return; + + /* + * For scenarios 0, 1, 3, and 4 (terminal DSQs: local and global), + * ops.dequeue() should never be called because tasks bypass the + * BPF scheduler entirely. If we get here, it's a kernel bug. + */ + if (test_scenario == 0 || test_scenario == 3) { + scx_bpf_error("%d (%s): dequeue called for local DSQ scenario", + p->pid, p->comm); + return; + } + + if (test_scenario == 1 || test_scenario == 4) { + scx_bpf_error("%d (%s): dequeue called for global DSQ scenario", + p->pid, p->comm); + return; + } + + if (deq_flags & SCX_DEQ_SCHED_CHANGE) { + /* + * Property change interrupting the workflow. Valid from + * both ENQUEUED and DISPATCHED states. Transitions task + * back to NONE state. + */ + __sync_fetch_and_add(&change_dequeue_cnt, 1); + + /* Validate state transition */ + if (tctx->state != TASK_ENQUEUED && tctx->state != TASK_DISPATCHED) + scx_bpf_error("%d (%s): invalid property change dequeue state=%d seq=%llu", + p->pid, p->comm, tctx->state, tctx->enqueue_seq); + + /* + * Transition back to NONE: task outside scheduler control. + * + * Scenario 6: dispatch() checks tctx->state after popping a + * PID, if the task is in state NONE, it was dequeued by + * property change and must not be dispatched (this + * prevents "target CPU not allowed"). + */ + tctx->state = TASK_NONE; + } else { + /* + * Regular dispatch dequeue: kernel is moving the task from + * BPF custody to a terminal DSQ. Normally we come from + * ENQUEUED state. We can also see TASK_NONE if the task + * was dequeued by property change (SCX_DEQ_SCHED_CHANGE) + * while it was already on a DSQ (dispatched but not yet + * consumed); in that case we just leave state as NONE. + */ + __sync_fetch_and_add(&dispatch_dequeue_cnt, 1); + + /* + * Must be ENQUEUED (normal path) or NONE (already dequeued + * by property change while on a DSQ). + */ + if (tctx->state != TASK_ENQUEUED && tctx->state != TASK_NONE) + scx_bpf_error("%d (%s): dispatch dequeue from state %d seq=%llu", + p->pid, p->comm, tctx->state, tctx->enqueue_seq); + + if (tctx->state == TASK_ENQUEUED) + tctx->state = TASK_DISPATCHED; + + /* NONE: leave as-is, task was already property-change dequeued */ + } +} + +void BPF_STRUCT_OPS(dequeue_dispatch, s32 cpu, struct task_struct *prev) +{ + if (test_scenario == 6) { + struct task_ctx *tctx; + struct task_struct *p; + s32 pid; + + if (bpf_map_pop_elem(&global_queue, &pid)) + return; + + p = bpf_task_from_pid(pid); + if (!p) + return; + + /* + * If the task was dequeued by property change + * (ops.dequeue() set tctx->state = TASK_NONE), skip + * dispatch. + */ + tctx = try_lookup_task_ctx(p); + if (!tctx || tctx->state == TASK_NONE) { + bpf_task_release(p); + return; + } + + /* + * Dispatch to this CPU's local DSQ if allowed, otherwise + * fallback to the global DSQ. + */ + if (bpf_cpumask_test_cpu(cpu, p->cpus_ptr)) + scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL_ON | cpu, SCX_SLICE_DFL, 0); + else + scx_bpf_dsq_insert(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, 0); + + bpf_task_release(p); + } else { + scx_bpf_dsq_move_to_local(SHARED_DSQ, 0); + } +} + +s32 BPF_STRUCT_OPS(dequeue_init_task, struct task_struct *p, + struct scx_init_task_args *args) +{ + struct task_ctx *tctx; + + tctx = bpf_task_storage_get(&task_ctx_stor, p, 0, + BPF_LOCAL_STORAGE_GET_F_CREATE); + if (!tctx) + return -ENOMEM; + + return 0; +} + +s32 BPF_STRUCT_OPS_SLEEPABLE(dequeue_init) +{ + s32 ret; + + ret = scx_bpf_create_dsq(SHARED_DSQ, -1); + if (ret) + return ret; + + return 0; +} + +void BPF_STRUCT_OPS(dequeue_exit, struct scx_exit_info *ei) +{ + UEI_RECORD(uei, ei); +} + +SEC(".struct_ops.link") +struct sched_ext_ops dequeue_ops = { + .select_cpu = (void *)dequeue_select_cpu, + .enqueue = (void *)dequeue_enqueue, + .dequeue = (void *)dequeue_dequeue, + .dispatch = (void *)dequeue_dispatch, + .init_task = (void *)dequeue_init_task, + .init = (void *)dequeue_init, + .exit = (void *)dequeue_exit, + .flags = SCX_OPS_ENQ_LAST, + .name = "dequeue_test", +}; diff --git a/tools/testing/selftests/sched_ext/dequeue.c b/tools/testing/selftests/sched_ext/dequeue.c new file mode 100644 index 000000000000..4e93262703ca --- /dev/null +++ b/tools/testing/selftests/sched_ext/dequeue.c @@ -0,0 +1,274 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (c) 2025 NVIDIA Corporation. + */ +#define _GNU_SOURCE +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include "scx_test.h" +#include "dequeue.bpf.skel.h" + +#define NUM_WORKERS 8 +#define AFFINITY_HAMMER_MS 500 + +/* + * Worker function that creates enqueue/dequeue events via CPU work and + * sleep. + */ +static void worker_fn(int id) +{ + int i; + volatile int sum = 0; + + for (i = 0; i < 1000; i++) { + volatile int j; + + /* Do some work to trigger scheduling events */ + for (j = 0; j < 10000; j++) + sum += j; + + /* Sleep to trigger dequeue */ + usleep(1000 + (id * 100)); + } + + exit(0); +} + +/* + * This thread changes workers' affinity from outside so that some changes + * hit tasks while they are still in the scheduler's queue and trigger + * property-change dequeues. + */ +static void *affinity_hammer_fn(void *arg) +{ + pid_t *pids = arg; + cpu_set_t cpuset; + int i = 0, n = NUM_WORKERS; + struct timespec start, now; + + clock_gettime(CLOCK_MONOTONIC, &start); + while (1) { + int w = i % n; + int cpu = (i / n) % 4; + + CPU_ZERO(&cpuset); + CPU_SET(cpu, &cpuset); + sched_setaffinity(pids[w], sizeof(cpuset), &cpuset); + i++; + + /* Check elapsed time every 256 iterations to limit gettime cost */ + if ((i & 255) == 0) { + long long elapsed_ms; + + clock_gettime(CLOCK_MONOTONIC, &now); + elapsed_ms = (now.tv_sec - start.tv_sec) * 1000LL + + (now.tv_nsec - start.tv_nsec) / 1000000; + if (elapsed_ms >= AFFINITY_HAMMER_MS) + break; + } + } + return NULL; +} + +static enum scx_test_status run_scenario(struct dequeue *skel, u32 scenario, + const char *scenario_name) +{ + struct bpf_link *link; + pid_t pids[NUM_WORKERS]; + pthread_t hammer; + + int i, status; + u64 enq_start, deq_start, + dispatch_deq_start, change_deq_start, bpf_queue_full_start; + u64 enq_delta, deq_delta, + dispatch_deq_delta, change_deq_delta, bpf_queue_full_delta; + + /* Set the test scenario */ + skel->bss->test_scenario = scenario; + + /* Record starting counts */ + enq_start = skel->bss->enqueue_cnt; + deq_start = skel->bss->dequeue_cnt; + dispatch_deq_start = skel->bss->dispatch_dequeue_cnt; + change_deq_start = skel->bss->change_dequeue_cnt; + bpf_queue_full_start = skel->bss->bpf_queue_full; + + link = bpf_map__attach_struct_ops(skel->maps.dequeue_ops); + SCX_FAIL_IF(!link, "Failed to attach struct_ops for scenario %s", scenario_name); + + /* Fork worker processes to generate enqueue/dequeue events */ + for (i = 0; i < NUM_WORKERS; i++) { + pids[i] = fork(); + SCX_FAIL_IF(pids[i] < 0, "Failed to fork worker %d", i); + + if (pids[i] == 0) { + worker_fn(i); + /* Should not reach here */ + exit(1); + } + } + + /* + * Run an "affinity hammer" so that some property changes hit tasks + * while they are still in BPF custody (e.g., in user DSQ or BPF + * queue), triggering SCX_DEQ_SCHED_CHANGE dequeues. + */ + SCX_FAIL_IF(pthread_create(&hammer, NULL, affinity_hammer_fn, pids) != 0, + "Failed to create affinity hammer thread"); + pthread_join(hammer, NULL); + + /* Wait for all workers to complete */ + for (i = 0; i < NUM_WORKERS; i++) { + SCX_FAIL_IF(waitpid(pids[i], &status, 0) != pids[i], + "Failed to wait for worker %d", i); + SCX_FAIL_IF(status != 0, "Worker %d exited with status %d", i, status); + } + + bpf_link__destroy(link); + + SCX_EQ(skel->data->uei.kind, EXIT_KIND(SCX_EXIT_UNREG)); + + /* Calculate deltas */ + enq_delta = skel->bss->enqueue_cnt - enq_start; + deq_delta = skel->bss->dequeue_cnt - deq_start; + dispatch_deq_delta = skel->bss->dispatch_dequeue_cnt - dispatch_deq_start; + change_deq_delta = skel->bss->change_dequeue_cnt - change_deq_start; + bpf_queue_full_delta = skel->bss->bpf_queue_full - bpf_queue_full_start; + + printf("%s:\n", scenario_name); + printf(" enqueues: %lu\n", (unsigned long)enq_delta); + printf(" dequeues: %lu (dispatch: %lu, property_change: %lu)\n", + (unsigned long)deq_delta, + (unsigned long)dispatch_deq_delta, + (unsigned long)change_deq_delta); + printf(" BPF queue full: %lu\n", (unsigned long)bpf_queue_full_delta); + + /* + * Validate enqueue/dequeue lifecycle tracking. + * + * For scenarios 0, 1, 3, 4 (local and global DSQs from + * ops.select_cpu() and ops.enqueue()), both enqueues and dequeues + * should be 0 because tasks bypass the BPF scheduler entirely: + * tasks never enter BPF scheduler's custody. + * + * For scenarios 2, 5, 6 (user DSQ or BPF internal queue) we expect + * both enqueues and dequeues. + * + * The BPF code does strict state machine validation with + * scx_bpf_error() to ensure the workflow semantics are correct. + * + * If we reach this point without errors, the semantics are + * validated correctly. + */ + if (scenario == 0 || scenario == 1 || + scenario == 3 || scenario == 4) { + /* Tasks bypass BPF scheduler completely */ + SCX_EQ(enq_delta, 0); + SCX_EQ(deq_delta, 0); + SCX_EQ(dispatch_deq_delta, 0); + SCX_EQ(change_deq_delta, 0); + } else { + /* + * User DSQ from ops.enqueue() or ops.select_cpu(): tasks + * enter BPF scheduler's custody. + * + * Also validate 1:1 enqueue/dequeue pairing. + */ + SCX_GT(enq_delta, 0); + SCX_GT(deq_delta, 0); + SCX_EQ(enq_delta, deq_delta); + } + + return SCX_TEST_PASS; +} + +static enum scx_test_status setup(void **ctx) +{ + struct dequeue *skel; + + skel = dequeue__open(); + SCX_FAIL_IF(!skel, "Failed to open skel"); + SCX_ENUM_INIT(skel); + SCX_FAIL_IF(dequeue__load(skel), "Failed to load skel"); + + *ctx = skel; + + return SCX_TEST_PASS; +} + +static enum scx_test_status run(void *ctx) +{ + struct dequeue *skel = ctx; + enum scx_test_status status; + + status = run_scenario(skel, 0, "Scenario 0: Local DSQ from ops.select_cpu()"); + if (status != SCX_TEST_PASS) + return status; + + status = run_scenario(skel, 1, "Scenario 1: Global DSQ from ops.select_cpu()"); + if (status != SCX_TEST_PASS) + return status; + + status = run_scenario(skel, 2, "Scenario 2: User DSQ from ops.select_cpu()"); + if (status != SCX_TEST_PASS) + return status; + + status = run_scenario(skel, 3, "Scenario 3: Local DSQ from ops.enqueue()"); + if (status != SCX_TEST_PASS) + return status; + + status = run_scenario(skel, 4, "Scenario 4: Global DSQ from ops.enqueue()"); + if (status != SCX_TEST_PASS) + return status; + + status = run_scenario(skel, 5, "Scenario 5: User DSQ from ops.enqueue()"); + if (status != SCX_TEST_PASS) + return status; + + status = run_scenario(skel, 6, "Scenario 6: BPF queue from ops.enqueue()"); + if (status != SCX_TEST_PASS) + return status; + + printf("\n=== Summary ===\n"); + printf("Total enqueues: %lu\n", (unsigned long)skel->bss->enqueue_cnt); + printf("Total dequeues: %lu\n", (unsigned long)skel->bss->dequeue_cnt); + printf(" Dispatch dequeues: %lu (no flag, normal workflow)\n", + (unsigned long)skel->bss->dispatch_dequeue_cnt); + printf(" Property change dequeues: %lu (SCX_DEQ_SCHED_CHANGE flag)\n", + (unsigned long)skel->bss->change_dequeue_cnt); + printf(" BPF queue full: %lu\n", + (unsigned long)skel->bss->bpf_queue_full); + printf("\nAll scenarios passed - no state machine violations detected\n"); + printf("-> Validated: Local DSQ dispatch bypasses BPF scheduler\n"); + printf("-> Validated: Global DSQ dispatch bypasses BPF scheduler\n"); + printf("-> Validated: User DSQ dispatch triggers ops.dequeue() callbacks\n"); + printf("-> Validated: Dispatch dequeues have no flags (normal workflow)\n"); + printf("-> Validated: Property change dequeues have SCX_DEQ_SCHED_CHANGE flag\n"); + printf("-> Validated: No duplicate enqueues or invalid state transitions\n"); + + return SCX_TEST_PASS; +} + +static void cleanup(void *ctx) +{ + struct dequeue *skel = ctx; + + dequeue__destroy(skel); +} + +struct scx_test dequeue_test = { + .name = "dequeue", + .description = "Verify ops.dequeue() semantics", + .setup = setup, + .run = run, + .cleanup = cleanup, +}; + +REGISTER_SCX_TEST(&dequeue_test) diff --git a/tools/testing/selftests/sched_ext/exit.bpf.c b/tools/testing/selftests/sched_ext/exit.bpf.c index 4bc36182d3ff..2e848820a44b 100644 --- a/tools/testing/selftests/sched_ext/exit.bpf.c +++ b/tools/testing/selftests/sched_ext/exit.bpf.c @@ -41,7 +41,7 @@ void BPF_STRUCT_OPS(exit_dispatch, s32 cpu, struct task_struct *p) if (exit_point == EXIT_DISPATCH) EXIT_CLEANLY(); - scx_bpf_dsq_move_to_local(DSQ_ID); + scx_bpf_dsq_move_to_local(DSQ_ID, 0); } void BPF_STRUCT_OPS(exit_enable, struct task_struct *p) diff --git a/tools/testing/selftests/sched_ext/exit.c b/tools/testing/selftests/sched_ext/exit.c index ee25824b1cbe..b987611789d1 100644 --- a/tools/testing/selftests/sched_ext/exit.c +++ b/tools/testing/selftests/sched_ext/exit.c @@ -33,7 +33,7 @@ static enum scx_test_status run(void *ctx) skel = exit__open(); SCX_ENUM_INIT(skel); skel->rodata->exit_point = tc; - exit__load(skel); + SCX_FAIL_IF(exit__load(skel), "Failed to load skel"); link = bpf_map__attach_struct_ops(skel->maps.exit_ops); if (!link) { SCX_ERR("Failed to attach scheduler"); diff --git a/tools/testing/selftests/sched_ext/exit_test.h b/tools/testing/selftests/sched_ext/exit_test.h index 94f0268b9cb8..2723e0fda801 100644 --- a/tools/testing/selftests/sched_ext/exit_test.h +++ b/tools/testing/selftests/sched_ext/exit_test.h @@ -17,4 +17,4 @@ enum exit_test_case { NUM_EXITS, }; -#endif // # __EXIT_TEST_H__ +#endif // __EXIT_TEST_H__ diff --git a/tools/testing/selftests/sched_ext/maximal.bpf.c b/tools/testing/selftests/sched_ext/maximal.bpf.c index 01cf4f3da4e0..04a369078aac 100644 --- a/tools/testing/selftests/sched_ext/maximal.bpf.c +++ b/tools/testing/selftests/sched_ext/maximal.bpf.c @@ -30,7 +30,7 @@ void BPF_STRUCT_OPS(maximal_dequeue, struct task_struct *p, u64 deq_flags) void BPF_STRUCT_OPS(maximal_dispatch, s32 cpu, struct task_struct *prev) { - scx_bpf_dsq_move_to_local(DSQ_ID); + scx_bpf_dsq_move_to_local(DSQ_ID, 0); } void BPF_STRUCT_OPS(maximal_runnable, struct task_struct *p, u64 enq_flags) @@ -67,13 +67,12 @@ void BPF_STRUCT_OPS(maximal_set_cpumask, struct task_struct *p, void BPF_STRUCT_OPS(maximal_update_idle, s32 cpu, bool idle) {} -void BPF_STRUCT_OPS(maximal_cpu_acquire, s32 cpu, - struct scx_cpu_acquire_args *args) -{} - -void BPF_STRUCT_OPS(maximal_cpu_release, s32 cpu, - struct scx_cpu_release_args *args) -{} +SEC("tp_btf/sched_switch") +int BPF_PROG(maximal_sched_switch, bool preempt, struct task_struct *prev, + struct task_struct *next, unsigned int prev_state) +{ + return 0; +} void BPF_STRUCT_OPS(maximal_cpu_online, s32 cpu) {} @@ -150,8 +149,6 @@ struct sched_ext_ops maximal_ops = { .set_weight = (void *) maximal_set_weight, .set_cpumask = (void *) maximal_set_cpumask, .update_idle = (void *) maximal_update_idle, - .cpu_acquire = (void *) maximal_cpu_acquire, - .cpu_release = (void *) maximal_cpu_release, .cpu_online = (void *) maximal_cpu_online, .cpu_offline = (void *) maximal_cpu_offline, .init_task = (void *) maximal_init_task, diff --git a/tools/testing/selftests/sched_ext/maximal.c b/tools/testing/selftests/sched_ext/maximal.c index c6be50a9941d..1dc369224670 100644 --- a/tools/testing/selftests/sched_ext/maximal.c +++ b/tools/testing/selftests/sched_ext/maximal.c @@ -19,6 +19,9 @@ static enum scx_test_status setup(void **ctx) SCX_ENUM_INIT(skel); SCX_FAIL_IF(maximal__load(skel), "Failed to load skel"); + bpf_map__set_autoattach(skel->maps.maximal_ops, false); + SCX_FAIL_IF(maximal__attach(skel), "Failed to attach skel"); + *ctx = skel; return SCX_TEST_PASS; diff --git a/tools/testing/selftests/sched_ext/numa.bpf.c b/tools/testing/selftests/sched_ext/numa.bpf.c index a79d86ed54a1..78cc49a7f9a6 100644 --- a/tools/testing/selftests/sched_ext/numa.bpf.c +++ b/tools/testing/selftests/sched_ext/numa.bpf.c @@ -68,7 +68,7 @@ void BPF_STRUCT_OPS(numa_dispatch, s32 cpu, struct task_struct *prev) { int node = __COMPAT_scx_bpf_cpu_node(cpu); - scx_bpf_dsq_move_to_local(node); + scx_bpf_dsq_move_to_local(node, 0); } s32 BPF_STRUCT_OPS_SLEEPABLE(numa_init) diff --git a/tools/testing/selftests/sched_ext/peek_dsq.bpf.c b/tools/testing/selftests/sched_ext/peek_dsq.bpf.c index 784f2f6c1af9..7f23fb17b1e0 100644 --- a/tools/testing/selftests/sched_ext/peek_dsq.bpf.c +++ b/tools/testing/selftests/sched_ext/peek_dsq.bpf.c @@ -95,7 +95,7 @@ static int scan_dsq_pool(void) record_peek_result(task->pid); /* Try to move this task to local */ - if (!moved && scx_bpf_dsq_move_to_local(dsq_id) == 0) { + if (!moved && scx_bpf_dsq_move_to_local(dsq_id, 0) == 0) { moved = 1; break; } @@ -156,19 +156,19 @@ void BPF_STRUCT_OPS(peek_dsq_dispatch, s32 cpu, struct task_struct *prev) dsq_peek_result2_pid = peek_result ? peek_result->pid : -1; /* Now consume the task since we've peeked at it */ - scx_bpf_dsq_move_to_local(test_dsq_id); + scx_bpf_dsq_move_to_local(test_dsq_id, 0); /* Mark phase 1 as complete */ phase1_complete = 1; bpf_printk("Phase 1 complete, starting phase 2 stress testing"); } else if (!phase1_complete) { /* Still in phase 1, use real DSQ */ - scx_bpf_dsq_move_to_local(real_dsq_id); + scx_bpf_dsq_move_to_local(real_dsq_id, 0); } else { /* Phase 2: Scan all DSQs in the pool and try to move a task */ if (!scan_dsq_pool()) { /* No tasks found in DSQ pool, fall back to real DSQ */ - scx_bpf_dsq_move_to_local(real_dsq_id); + scx_bpf_dsq_move_to_local(real_dsq_id, 0); } } } @@ -197,7 +197,7 @@ s32 BPF_STRUCT_OPS_SLEEPABLE(peek_dsq_init) } err = scx_bpf_create_dsq(real_dsq_id, -1); if (err) { - scx_bpf_error("Failed to create DSQ %d: %d", test_dsq_id, err); + scx_bpf_error("Failed to create DSQ %d: %d", real_dsq_id, err); return err; } diff --git a/tools/testing/selftests/sched_ext/reload_loop.c b/tools/testing/selftests/sched_ext/reload_loop.c index 308211d80436..49297b83d748 100644 --- a/tools/testing/selftests/sched_ext/reload_loop.c +++ b/tools/testing/selftests/sched_ext/reload_loop.c @@ -23,6 +23,9 @@ static enum scx_test_status setup(void **ctx) SCX_ENUM_INIT(skel); SCX_FAIL_IF(maximal__load(skel), "Failed to load skel"); + bpf_map__set_autoattach(skel->maps.maximal_ops, false); + SCX_FAIL_IF(maximal__attach(skel), "Failed to attach skel"); + return SCX_TEST_PASS; } diff --git a/tools/testing/selftests/sched_ext/rt_stall.c b/tools/testing/selftests/sched_ext/rt_stall.c index 81ea9b4883e5..a5041fc2e44f 100644 --- a/tools/testing/selftests/sched_ext/rt_stall.c +++ b/tools/testing/selftests/sched_ext/rt_stall.c @@ -119,6 +119,11 @@ static enum scx_test_status setup(void **ctx) { struct rt_stall *skel; + if (!__COMPAT_struct_has_field("rq", "ext_server")) { + fprintf(stderr, "SKIP: ext DL server not supported\n"); + return SCX_TEST_SKIP; + } + skel = rt_stall__open(); SCX_FAIL_IF(!skel, "Failed to open"); SCX_ENUM_INIT(skel); diff --git a/tools/testing/selftests/sched_ext/runner.c b/tools/testing/selftests/sched_ext/runner.c index 761c21f96404..c264807caa91 100644 --- a/tools/testing/selftests/sched_ext/runner.c +++ b/tools/testing/selftests/sched_ext/runner.c @@ -18,7 +18,7 @@ const char help_fmt[] = "It's required for the testcases to be serial, as only a single host-wide sched_ext\n" "scheduler may be loaded at any given time." "\n" -"Usage: %s [-t TEST] [-h]\n" +"Usage: %s [-t TEST] [-s] [-l] [-q]\n" "\n" " -t TEST Only run tests whose name includes this string\n" " -s Include print output for skipped tests\n" @@ -133,6 +133,8 @@ static bool test_valid(const struct scx_test *test) int main(int argc, char **argv) { const char *filter = NULL; + const char *failed_tests[MAX_SCX_TESTS]; + const char *skipped_tests[MAX_SCX_TESTS]; unsigned testnum = 0, i; unsigned passed = 0, skipped = 0, failed = 0; int opt; @@ -162,6 +164,26 @@ int main(int argc, char **argv) } } + if (optind < argc) { + fprintf(stderr, "Unexpected argument '%s'. Use -t to filter tests.\n", + argv[optind]); + return 1; + } + + if (filter) { + for (i = 0; i < __scx_num_tests; i++) { + if (!should_skip_test(&__scx_tests[i], filter)) + break; + } + if (i == __scx_num_tests) { + fprintf(stderr, "No tests matched filter '%s'\n", filter); + fprintf(stderr, "Available tests (use -l to list):\n"); + for (i = 0; i < __scx_num_tests; i++) + fprintf(stderr, " %s\n", __scx_tests[i].name); + return 1; + } + } + for (i = 0; i < __scx_num_tests; i++) { enum scx_test_status status; struct scx_test *test = &__scx_tests[i]; @@ -198,10 +220,10 @@ int main(int argc, char **argv) passed++; break; case SCX_TEST_SKIP: - skipped++; + skipped_tests[skipped++] = test->name; break; case SCX_TEST_FAIL: - failed++; + failed_tests[failed++] = test->name; break; } } @@ -210,8 +232,18 @@ int main(int argc, char **argv) printf("PASSED: %u\n", passed); printf("SKIPPED: %u\n", skipped); printf("FAILED: %u\n", failed); + if (skipped > 0) { + printf("\nSkipped tests:\n"); + for (i = 0; i < skipped; i++) + printf(" - %s\n", skipped_tests[i]); + } + if (failed > 0) { + printf("\nFailed tests:\n"); + for (i = 0; i < failed; i++) + printf(" - %s\n", failed_tests[i]); + } - return 0; + return failed > 0 ? 1 : 0; } void scx_test_register(struct scx_test *test) diff --git a/tools/testing/selftests/sched_ext/select_cpu_vtime.bpf.c b/tools/testing/selftests/sched_ext/select_cpu_vtime.bpf.c index bfcb96cd4954..eec70d388cbf 100644 --- a/tools/testing/selftests/sched_ext/select_cpu_vtime.bpf.c +++ b/tools/testing/selftests/sched_ext/select_cpu_vtime.bpf.c @@ -53,7 +53,7 @@ ddsp: void BPF_STRUCT_OPS(select_cpu_vtime_dispatch, s32 cpu, struct task_struct *p) { - if (scx_bpf_dsq_move_to_local(VTIME_DSQ)) + if (scx_bpf_dsq_move_to_local(VTIME_DSQ, 0)) consumed = true; } @@ -66,12 +66,14 @@ void BPF_STRUCT_OPS(select_cpu_vtime_running, struct task_struct *p) void BPF_STRUCT_OPS(select_cpu_vtime_stopping, struct task_struct *p, bool runnable) { - p->scx.dsq_vtime += (SCX_SLICE_DFL - p->scx.slice) * 100 / p->scx.weight; + u64 delta = scale_by_task_weight_inverse(p, SCX_SLICE_DFL - p->scx.slice); + + scx_bpf_task_set_dsq_vtime(p, p->scx.dsq_vtime + delta); } void BPF_STRUCT_OPS(select_cpu_vtime_enable, struct task_struct *p) { - p->scx.dsq_vtime = vtime_now; + scx_bpf_task_set_dsq_vtime(p, vtime_now); } s32 BPF_STRUCT_OPS_SLEEPABLE(select_cpu_vtime_init) diff --git a/tools/testing/selftests/sched_ext/util.h b/tools/testing/selftests/sched_ext/util.h index bc13dfec1267..681cec04b439 100644 --- a/tools/testing/selftests/sched_ext/util.h +++ b/tools/testing/selftests/sched_ext/util.h @@ -10,4 +10,4 @@ long file_read_long(const char *path); int file_write_long(const char *path, long val); -#endif // __SCX_TEST_H__ +#endif // __SCX_TEST_UTIL_H__