Start to flesh out the real find_proxy_task() implementation,
but avoid the migration cases for now, in those cases just
deactivate the donor task and pick again.
To ensure the donor task or other blocked tasks in the chain
aren't migrated away while we're running the proxy, also tweak
the fair class logic to avoid migrating donor or mutex blocked
tasks.
[jstultz: This change was split out from the larger proxy patch]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Connor O'Brien <connoro@google.com>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lkml.kernel.org/r/20250712033407.2383110-9-jstultz@google.com
Without proxy-exec, we normally charge the "current" task for
both its vruntime as well as its sum_exec_runtime.
With proxy, however, we have two "current" contexts: the
scheduler context and the execution context. We want to charge
the execution context rq->curr (ie: proxy/lock holder) execution
time to its sum_exec_runtime (so it's clear to userland the
rq->curr task *is* running), as well as its thread group.
However the rest of the time accounting (such a vruntime and
cgroup accounting), we charge against the scheduler context
(rq->donor) task, because it is from that task that the time
is being "donated".
If the donor and curr tasks are the same, then it's the same as
without proxy.
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lkml.kernel.org/r/20250712033407.2383110-6-jstultz@google.com
Absorb update_curr_task() into update_curr_se(), and
in the process simplify update_curr_common().
This will make the next step a bit easier.
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lkml.kernel.org/r/20250712033407.2383110-5-jstultz@google.com
Support for overlapping domains added in commit e3589f6c81 ("sched:
Allow for overlapping sched_domain spans") also allowed forcefully
setting SD_OVERLAP for !NUMA domains via FORCE_SD_OVERLAP sched_feat().
Since NUMA domains had to be presumed overlapping to ensure correct
behavior, "sched_domain_topology_level::flags" was introduced. NUMA
domains added the SDTL_OVERLAP flag would ensure SD_OVERLAP was always
added during build_sched_domains() for these domains, even when
FORCE_SD_OVERLAP was off.
Condition for adding the SD_OVERLAP flag at the aforementioned commit
was as follows:
if (tl->flags & SDTL_OVERLAP || sched_feat(FORCE_SD_OVERLAP))
sd->flags |= SD_OVERLAP;
The FORCE_SD_OVERLAP debug feature was removed in commit af85596c74
("sched/topology: Remove FORCE_SD_OVERLAP") which left the NUMA domains
as the exclusive users of SDTL_OVERLAP, SD_OVERLAP, and SD_NUMA flags.
Get rid of SDTL_OVERLAP and SD_OVERLAP as they have become redundant
and instead rely on SD_NUMA to detect the only overlapping domain
currently supported. Since SDTL_OVERLAP was the only user of
"tl->flags", get rid of "sched_domain_topology_level::flags" too.
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/ba4dbdf8-bc37-493d-b2e0-2efb00ea3e19@amd.com
Always trigger a resched after a protected period even if the entity is
still eligible. It can happen that an entity remains eligible at the end
of the protected period but must let an entity with a shorter slice to run
in order to keep its lag shorter than slice. This is particulalry true
with run to parity which tries to maximize the lag.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250708165630.1948751-7-vincent.guittot@linaro.org
When an entity is enqueued without preempting current, we must ensure
that the slice protection is updated to take into account the slice
duration of the newly enqueued task so that its lag will not exceed
its slice (+ tick).
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250708165630.1948751-6-vincent.guittot@linaro.org
Run to parity ensures that current will get a chance to run its full
slice in one go but this can create large latency and/or lag for
entities with shorter slice that have exhausted their previous slice
and wait to run their next slice.
Clamp the run to parity to the shortest slice of all enqueued entities.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250708165630.1948751-5-vincent.guittot@linaro.org
Even if the waking task can preempt current, it might not be the one
selected by pick_task_fair. Check that the waking task will be selected
if we cancel the slice protection before doing so.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250708165630.1948751-4-vincent.guittot@linaro.org
EEVDF expects the scheduler to allocate a time quantum to the selected
entity and then pick a new entity for next quantum.
Although this notion of time quantum is not strictly doable in our case,
we can ensure a minimum runtime for each task most of the time and pick a
new entity after a minimum time has elapsed.
Reuse the slice protection of run to parity to ensure such runtime
quantum.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250708165630.1948751-3-vincent.guittot@linaro.org
Replace the test by the relevant protect_slice() function.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dhaval Giani (AMD) <dhaval@gianis.ca>
Link: https://lkml.kernel.org/r/20250708165630.1948751-2-vincent.guittot@linaro.org
Chris reported that commit 5f6bd380c7 ("sched/rt: Remove default
bandwidth control") caused a significant dip in his favourite
benchmark of the day. Simply disabling dl_server cured things.
His workload hammers the 0->1, 1->0 transitions, and the
dl_server_{start,stop}() overhead kills it -- fairly obviously a bad
idea in hind sight and all that.
Change things around to only disable the dl_server when there has not
been a fair task around for a whole period. Since the default period
is 1 second, this ensures the benchmark never trips this, overhead
gone.
Fixes: 557a6bfc66 ("sched/fair: Add trivial fair server")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lkml.kernel.org/r/20250702121158.465086194@infradead.org
schbench (https://github.com/masoncl/schbench.git) is showing a
regression from previous production kernels that bisected down to:
sched/fair: Remove sysctl_sched_migration_cost condition (c5b0a7eefc)
The schbench command line was:
schbench -L -m 4 -M auto -t 256 -n 0 -r 0 -s 0
This creates 4 message threads pinned to CPUs 0-3, and 256x4 worker
threads spread across the rest of the CPUs. Neither the worker threads
or the message threads do any work, they just wake each other up and go
back to sleep as soon as possible.
The end result is the first 4 CPUs are pegged waking up those 1024
workers, and the rest of the CPUs are constantly banging in and out of
idle. If I take a v6.9 Linus kernel and revert that one commit,
performance goes from 3.4M RPS to 5.4M RPS.
schedstat shows there are ~100x more new idle balance operations, and
profiling shows the worker threads are spending ~20% of their CPU time
on new idle balance. schedstats also shows that almost all of these new
idle balance attemps are failing to find busy groups.
The fix used here is to crank up the cost of the newidle balance whenever it
fails. Since we don't want sd->max_newidle_lb_cost to grow out of
control, this also changes update_newidle_cost() to use
sysctl_sched_migration_cost as the upper limit on max_newidle_lb_cost.
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20250626144017.1510594-2-clm@fb.com
- Move input parameter validation from tg_set_cfs_bandwidth() to the new
outer function tg_set_bandwidth(). The outer function handles parameters
in usecs, validates them and calls tg_set_cfs_bandwidth() which converts
them into nsecs. This matches tg_bandwidth() on the read side.
- max/min_cfs_* consts are now used by tg_set_bandwidth(). Relocate, convert
into usecs and drop "cfs" from the names.
- Reimplement cpu_cfs_{period|quote|burst}_write_*() using tg_bandwidth()
and tg_set_bandwidth() and replace "cfs" in the names with "bw".
- Update cpu_max_write() to use tg_set_bandiwdth(). cpu_period_quota_parse()
is updated to drop nsec conversion accordingly. This aligns the behavior
with cfs_period_quota_print().
- Drop now unused tg_set_cfs_{period|quota|burst}().
- While at it, for consistency, rename default_cfs_period() to
default_bw_period_us() and make it return usecs.
This is to prepare for adding bandwidth control support to sched_ext.
tg_set_bandwidth() will be used as the muxing point. No functional changes
intended.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250614012346.2358261-5-tj@kernel.org
max_cfs_quota_period is defined in core.c but has a declaration in fair.c.
Move the declaration to kernel/sched/sched.h. Also, move
default_cfs_period() from fair.c to sched.h. No functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250614012346.2358261-2-tj@kernel.org
Simplify the scheduler by making CONFIG_SMP=y primitives and data
structures unconditional.
Introduce transitory wrappers for functionality not yet converted to SMP.
Note that this patch is pretty large, because there's no clear separation
between various aspects of the SMP scheduler, it's basically a huge block
of #ifdef CONFIG_SMP. A fair amount of it has to be switched on for it to
boot and work on UP systems.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250528080924.2273858-21-mingo@kernel.org
Sergey Senozhatsky adds infrastructure for passing algorithm-specific
parameters into zram. A single parameter `winbits' is implemented at
this time.
- The 5 patch series "memcg: nmi-safe kmem charging" from Shakeel Butt
makes memcg charging nmi-safe, which is required by BFP, which can
operate in NMI context.
- The 5 patch series "Some random fixes and cleanup to shmem" from
Kemeng Shi implements small fixes and cleanups in the shmem code.
- The 2 patch series "Skip mm selftests instead when kernel features are
not present" from Zi Yan fixes some issues in the MM selftest code.
- The 2 patch series "mm/damon: build-enable essential DAMON components
by default" from SeongJae Park reworks DAMON Kconfig to make it easier
to enable CONFIG_DAMON.
- The 2 patch series "sched/numa: add statistics of numa balance task
migration" from Libo Chen adds more info into sysfs and procfs files to
improve visibility into the NUMA balancer's task migration activity.
- The 4 patch series "selftests/mm: cow and gup_longterm cleanups" from
Mark Brown provides various updates to some of the MM selftests to make
them play better with the overall containing framework.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaDzA9wAKCRDdBJ7gKXxA
js8sAP9V3COg+vzTmimzP3ocTkkbbIJzDfM6nXpE2EQ4BR3ejwD+NsIT2ZLtTF6O
LqAZpgO7ju6wMjR/lM30ebCq5qFbZAw=
=oruw
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2025-06-01-14-06' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull more MM updates from Andrew Morton:
- "zram: support algorithm-specific parameters" from Sergey Senozhatsky
adds infrastructure for passing algorithm-specific parameters into
zram. A single parameter `winbits' is implemented at this time.
- "memcg: nmi-safe kmem charging" from Shakeel Butt makes memcg
charging nmi-safe, which is required by BFP, which can operate in NMI
context.
- "Some random fixes and cleanup to shmem" from Kemeng Shi implements
small fixes and cleanups in the shmem code.
- "Skip mm selftests instead when kernel features are not present" from
Zi Yan fixes some issues in the MM selftest code.
- "mm/damon: build-enable essential DAMON components by default" from
SeongJae Park reworks DAMON Kconfig to make it easier to enable
CONFIG_DAMON.
- "sched/numa: add statistics of numa balance task migration" from Libo
Chen adds more info into sysfs and procfs files to improve visibility
into the NUMA balancer's task migration activity.
- "selftests/mm: cow and gup_longterm cleanups" from Mark Brown
provides various updates to some of the MM selftests to make them
play better with the overall containing framework.
* tag 'mm-stable-2025-06-01-14-06' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (43 commits)
mm/khugepaged: clean up refcount check using folio_expected_ref_count()
selftests/mm: fix test result reporting in gup_longterm
selftests/mm: report unique test names for each cow test
selftests/mm: add helper for logging test start and results
selftests/mm: use standard ksft_finished() in cow and gup_longterm
selftests/damon/_damon_sysfs: skip testcases if CONFIG_DAMON_SYSFS is disabled
sched/numa: add statistics of numa balance task
sched/numa: fix task swap by skipping kernel threads
tools/testing: check correct variable in open_procmap()
tools/testing/vma: add missing function stub
mm/gup: update comment explaining why gup_fast() disables IRQs
selftests/mm: two fixes for the pfnmap test
mm/khugepaged: fix race with folio split/free using temporary reference
mm: add CONFIG_PAGE_BLOCK_ORDER to select page block order
mmu_notifiers: remove leftover stub macros
selftests/mm: deduplicate test names in madv_populate
kcov: rust: add flags for KCOV with Rust
mm: rust: make CONFIG_MMU ifdefs more narrow
mmu_gather: move tlb flush for VM_PFNMAP/VM_MIXEDMAP vmas into free_pgtables()
mm/damon/Kconfig: enable CONFIG_DAMON by default
...
Patch series "sched/numa: add statistics of numa balance task migration",
v6.
Introduce task migration and swap statistics in the following places:
/sys/fs/cgroup/{GROUP}/memory.stat
/proc/{PID}/sched
/proc/vmstat
These statistics facilitate a rapid evaluation of the performance and
resource utilization of the target workload.
This patch (of 2):
Task swapping is triggered when there are no idle CPUs in task A's
preferred node. In this case, the NUMA load balancer chooses a task B
on A's preferred node and swaps B with A. This helps improve NUMA
locality without introducing load imbalance between nodes. In the
current implementation, B's NUMA node preference is not mandatory.
That is to say, a kernel thread might be incorrectly chosen as B.
However, kernel thread and user space thread that does not have mm are
not supposed to be covered by NUMA balancing because NUMA balancing
only considers user pages via VMAs.
According to Peter's suggestion for fixing this issue, we use
PF_KTHREAD to skip the kernel thread. curr->mm is also checked because
it is possible that user_mode_thread() might create a user thread
without an mm. As per Prateek's analysis, after adding the PF_KTHREAD
check, there is no need to further check the PF_IDLE flag:
: - play_idle_precise() already ensures PF_KTHREAD is set before adding
: PF_IDLE
:
: - cpu_startup_entry() is only called from the startup thread which
: should be marked with PF_KTHREAD (based on my understanding looking at
: commit cff9b2332a ("kernel/sched: Modify initial boot task idle
: setup"))
In summary, the check in task_numa_compare() now aligns with
task_tick_numa().
Link: https://lkml.kernel.org/r/cover.1748493462.git.yu.c.chen@intel.com
Link: https://lkml.kernel.org/r/43d68b356b25d124f0d222ebedf3859e86eefb9f.1748493462.git.yu.c.chen@intel.com
Link: https://lkml.kernel.org/r/cover.1748002400.git.yu.c.chen@intel.com
Link: https://lkml.kernel.org/r/eaacc9c9bd37bac92d43a671867d85b2fdad3b06.1748002400.git.yu.c.chen@intel.com
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Libo Chen <libo.chen@oracle.com>
Suggested-by: Michal Koutný <mkoutny@suse.com>
Tested-by: Ayush Jain <Ayush.jain3@amd.com>
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Aubrey Li <aubrey.li@intel.com>
Cc: "Chen, Tim C" <tim.c.chen@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Cc: Mel Gorman <mgorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
simplifies the act of creating a pte which addresses the first page in a
folio and reduces the amount of plumbing which architecture must
implement to provide this.
- The 8 patch series "Misc folio patches for 6.16" from Matthew Wilcox
is a shower of largely unrelated folio infrastructure changes which
clean things up and better prepare us for future work.
- The 3 patch series "memory,x86,acpi: hotplug memory alignment
advisement" from Gregory Price adds early-init code to prevent x86 from
leaving physical memory unused when physical address regions are not
aligned to memory block size.
- The 2 patch series "mm/compaction: allow more aggressive proactive
compaction" from Michal Clapinski provides some tuning of the (sadly,
hard-coded (more sadly, not auto-tuned)) thresholds for our invokation
of proactive compaction. In a simple test case, the reduction of a guest
VM's memory consumption was dramatic.
- The 8 patch series "Minor cleanups and improvements to swap freeing
code" from Kemeng Shi provides some code cleaups and a small efficiency
improvement to this part of our swap handling code.
- The 6 patch series "ptrace: introduce PTRACE_SET_SYSCALL_INFO API"
from Dmitry Levin adds the ability for a ptracer to modify syscalls
arguments. At this time we can alter only "system call information that
are used by strace system call tampering, namely, syscall number,
syscall arguments, and syscall return value.
This series should have been incorporated into mm.git's "non-MM"
branch, but I goofed.
- The 3 patch series "fs/proc: extend the PAGEMAP_SCAN ioctl to report
guard regions" from Andrei Vagin extends the info returned by the
PAGEMAP_SCAN ioctl against /proc/pid/pagemap. This permits CRIU to more
efficiently get at the info about guard regions.
- The 2 patch series "Fix parameter passed to page_mapcount_is_type()"
from Gavin Shan implements that fix. No runtime effect is expected
because validate_page_before_insert() happens to fix up this error.
- The 3 patch series "kernel/events/uprobes: uprobe_write_opcode()
rewrite" from David Hildenbrand basically brings uprobe text poking into
the current decade. Remove a bunch of hand-rolled implementation in
favor of using more current facilities.
- The 3 patch series "mm/ptdump: Drop assumption that pxd_val() is u64"
from Anshuman Khandual provides enhancements and generalizations to the
pte dumping code. This might be needed when 128-bit Page Table
Descriptors are enabled for ARM.
- The 12 patch series "Always call constructor for kernel page tables"
from Kevin Brodsky "ensures that the ctor/dtor is always called for
kernel pgtables, as it already is for user pgtables". This permits the
addition of more functionality such as "insert hooks to protect page
tables". This change does result in various architectures performing
unnecesary work, but this is fixed up where it is anticipated to occur.
- The 9 patch series "Rust support for mm_struct, vm_area_struct, and
mmap" from Alice Ryhl adds plumbing to permit Rust access to core MM
structures.
- The 3 patch series "fix incorrectly disallowed anonymous VMA merges"
from Lorenzo Stoakes takes advantage of some VMA merging opportunities
which we've been missing for 15 years.
- The 4 patch series "mm/madvise: batch tlb flushes for MADV_DONTNEED
and MADV_FREE" from SeongJae Park optimizes process_madvise()'s TLB
flushing. Instead of flushing each address range in the provided iovec,
we batch the flushing across all the iovec entries. The syscall's cost
was approximately halved with a microbenchmark which was designed to
load this particular operation.
- The 6 patch series "Track node vacancy to reduce worst case allocation
counts" from Sidhartha Kumar makes the maple tree smarter about its node
preallocation. stress-ng mmap performance increased by single-digit
percentages and the amount of unnecessarily preallocated memory was
dramaticelly reduced.
- The 3 patch series "mm/gup: Minor fix, cleanup and improvements" from
Baoquan He removes a few unnecessary things which Baoquan noted when
reading the code.
- The 3 patch series ""Enhance sysfs handling for memory hotplug in
weighted interleave" from Rakie Kim "enhances the weighted interleave
policy in the memory management subsystem by improving sysfs handling,
fixing memory leaks, and introducing dynamic sysfs updates for memory
hotplug support". Fixes things on error paths which we are unlikely to
hit.
- The 7 patch series "mm/damon: auto-tune DAMOS for NUMA setups
including tiered memory" from SeongJae Park introduces new DAMOS quota
goal metrics which eliminate the manual tuning which is required when
utilizing DAMON for memory tiering.
- The 5 patch series "mm/vmalloc.c: code cleanup and improvements" from
Baoquan He provides cleanups and small efficiency improvements which
Baoquan found via code inspection.
- The 2 patch series "vmscan: enforce mems_effective during demotion"
from Gregory Price "changes reclaim to respect cpuset.mems_effective
during demotion when possible". because "presently, reclaim explicitly
ignores cpuset.mems_effective when demoting, which may cause the cpuset
settings to violated." "This is useful for isolating workloads on a
multi-tenant system from certain classes of memory more consistently."
- The 2 patch series ""Clean up split_huge_pmd_locked() and remove
unnecessary folio pointers" from Gavin Guo provides minor cleanups and
efficiency gains in in the huge page splitting and migrating code.
- The 3 patch series "Use kmem_cache for memcg alloc" from Huan Yang
creates a slab cache for `struct mem_cgroup', yielding improved memory
utilization.
- The 4 patch series "add max arg to swappiness in memory.reclaim and
lru_gen" from Zhongkun He adds a new "max" argument to the "swappiness="
argument for memory.reclaim MGLRU's lru_gen. This directs proactive
reclaim to reclaim from only anon folios rather than file-backed folios.
- The 17 patch series "kexec: introduce Kexec HandOver (KHO)" from Mike
Rapoport is the first step on the path to permitting the kernel to
maintain existing VMs while replacing the host kernel via file-based
kexec. At this time only memblock's reserve_mem is preserved.
- The 7 patch series "mm: Introduce for_each_valid_pfn()" from David
Woodhouse provides and uses a smarter way of looping over a pfn range.
By skipping ranges of invalid pfns.
- The 2 patch series "sched/numa: Skip VMA scanning on memory pinned to
one NUMA node via cpuset.mems" from Libo Chen removes a lot of pointless
VMA scanning when a task is pinned a single NUMA mode. Dramatic
performance benefits were seen in some real world cases.
- The 2 patch series "JFS: Implement migrate_folio for
jfs_metapage_aops" from Shivank Garg addresses a warning which occurs
during memory compaction when using JFS.
- The 4 patch series "move all VMA allocation, freeing and duplication
logic to mm" from Lorenzo Stoakes moves some VMA code from kernel/fork.c
into the more appropriate mm/vma.c.
- The 6 patch series "mm, swap: clean up swap cache mapping helper" from
Kairui Song provides code consolidation and cleanups related to the
folio_index() function.
- The 2 patch series "mm/gup: Cleanup memfd_pin_folios()" from Vishal
Moola does that.
- The 8 patch series "memcg: Fix test_memcg_min/low test failures" from
Waiman Long addresses some bogus failures which are being reported by
the test_memcontrol selftest.
- The 3 patch series "eliminate mmap() retry merge, add .mmap_prepare
hook" from Lorenzo Stoakes commences the deprecation of
file_operations.mmap() in favor of the new
file_operations.mmap_prepare(). The latter is more restrictive and
prevents drivers from messing with things in ways which, amongst other
problems, may defeat VMA merging.
- The 4 patch series "memcg: decouple memcg and objcg stocks"" from
Shakeel Butt decouples the per-cpu memcg charge cache from the objcg's
one. This is a step along the way to making memcg and objcg charging
NMI-safe, which is a BPF requirement.
- The 6 patch series "mm/damon: minor fixups and improvements for code,
tests, and documents" from SeongJae Park is "yet another batch of
miscellaneous DAMON changes. Fix and improve minor problems in code,
tests and documents."
- The 7 patch series "memcg: make memcg stats irq safe" from Shakeel
Butt converts memcg stats to be irq safe. Another step along the way to
making memcg charging and stats updates NMI-safe, a BPF requirement.
- The 4 patch series "Let unmap_hugepage_range() and several related
functions take folio instead of page" from Fan Ni provides folio
conversions in the hugetlb code.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaDt5qgAKCRDdBJ7gKXxA
ju6XAP9nTiSfRz8Cz1n5LJZpFKEGzLpSihCYyR6P3o1L9oe3mwEAlZ5+XAwk2I5x
Qqb/UGMEpilyre1PayQqOnct3aSL9Ao=
=tYYm
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2025-05-31-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- "Add folio_mk_pte()" from Matthew Wilcox simplifies the act of
creating a pte which addresses the first page in a folio and reduces
the amount of plumbing which architecture must implement to provide
this.
- "Misc folio patches for 6.16" from Matthew Wilcox is a shower of
largely unrelated folio infrastructure changes which clean things up
and better prepare us for future work.
- "memory,x86,acpi: hotplug memory alignment advisement" from Gregory
Price adds early-init code to prevent x86 from leaving physical
memory unused when physical address regions are not aligned to memory
block size.
- "mm/compaction: allow more aggressive proactive compaction" from
Michal Clapinski provides some tuning of the (sadly, hard-coded (more
sadly, not auto-tuned)) thresholds for our invokation of proactive
compaction. In a simple test case, the reduction of a guest VM's
memory consumption was dramatic.
- "Minor cleanups and improvements to swap freeing code" from Kemeng
Shi provides some code cleaups and a small efficiency improvement to
this part of our swap handling code.
- "ptrace: introduce PTRACE_SET_SYSCALL_INFO API" from Dmitry Levin
adds the ability for a ptracer to modify syscalls arguments. At this
time we can alter only "system call information that are used by
strace system call tampering, namely, syscall number, syscall
arguments, and syscall return value.
This series should have been incorporated into mm.git's "non-MM"
branch, but I goofed.
- "fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions" from
Andrei Vagin extends the info returned by the PAGEMAP_SCAN ioctl
against /proc/pid/pagemap. This permits CRIU to more efficiently get
at the info about guard regions.
- "Fix parameter passed to page_mapcount_is_type()" from Gavin Shan
implements that fix. No runtime effect is expected because
validate_page_before_insert() happens to fix up this error.
- "kernel/events/uprobes: uprobe_write_opcode() rewrite" from David
Hildenbrand basically brings uprobe text poking into the current
decade. Remove a bunch of hand-rolled implementation in favor of
using more current facilities.
- "mm/ptdump: Drop assumption that pxd_val() is u64" from Anshuman
Khandual provides enhancements and generalizations to the pte dumping
code. This might be needed when 128-bit Page Table Descriptors are
enabled for ARM.
- "Always call constructor for kernel page tables" from Kevin Brodsky
ensures that the ctor/dtor is always called for kernel pgtables, as
it already is for user pgtables.
This permits the addition of more functionality such as "insert hooks
to protect page tables". This change does result in various
architectures performing unnecesary work, but this is fixed up where
it is anticipated to occur.
- "Rust support for mm_struct, vm_area_struct, and mmap" from Alice
Ryhl adds plumbing to permit Rust access to core MM structures.
- "fix incorrectly disallowed anonymous VMA merges" from Lorenzo
Stoakes takes advantage of some VMA merging opportunities which we've
been missing for 15 years.
- "mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE" from
SeongJae Park optimizes process_madvise()'s TLB flushing.
Instead of flushing each address range in the provided iovec, we
batch the flushing across all the iovec entries. The syscall's cost
was approximately halved with a microbenchmark which was designed to
load this particular operation.
- "Track node vacancy to reduce worst case allocation counts" from
Sidhartha Kumar makes the maple tree smarter about its node
preallocation.
stress-ng mmap performance increased by single-digit percentages and
the amount of unnecessarily preallocated memory was dramaticelly
reduced.
- "mm/gup: Minor fix, cleanup and improvements" from Baoquan He removes
a few unnecessary things which Baoquan noted when reading the code.
- ""Enhance sysfs handling for memory hotplug in weighted interleave"
from Rakie Kim "enhances the weighted interleave policy in the memory
management subsystem by improving sysfs handling, fixing memory
leaks, and introducing dynamic sysfs updates for memory hotplug
support". Fixes things on error paths which we are unlikely to hit.
- "mm/damon: auto-tune DAMOS for NUMA setups including tiered memory"
from SeongJae Park introduces new DAMOS quota goal metrics which
eliminate the manual tuning which is required when utilizing DAMON
for memory tiering.
- "mm/vmalloc.c: code cleanup and improvements" from Baoquan He
provides cleanups and small efficiency improvements which Baoquan
found via code inspection.
- "vmscan: enforce mems_effective during demotion" from Gregory Price
changes reclaim to respect cpuset.mems_effective during demotion when
possible. because presently, reclaim explicitly ignores
cpuset.mems_effective when demoting, which may cause the cpuset
settings to violated.
This is useful for isolating workloads on a multi-tenant system from
certain classes of memory more consistently.
- "Clean up split_huge_pmd_locked() and remove unnecessary folio
pointers" from Gavin Guo provides minor cleanups and efficiency gains
in in the huge page splitting and migrating code.
- "Use kmem_cache for memcg alloc" from Huan Yang creates a slab cache
for `struct mem_cgroup', yielding improved memory utilization.
- "add max arg to swappiness in memory.reclaim and lru_gen" from
Zhongkun He adds a new "max" argument to the "swappiness=" argument
for memory.reclaim MGLRU's lru_gen.
This directs proactive reclaim to reclaim from only anon folios
rather than file-backed folios.
- "kexec: introduce Kexec HandOver (KHO)" from Mike Rapoport is the
first step on the path to permitting the kernel to maintain existing
VMs while replacing the host kernel via file-based kexec. At this
time only memblock's reserve_mem is preserved.
- "mm: Introduce for_each_valid_pfn()" from David Woodhouse provides
and uses a smarter way of looping over a pfn range. By skipping
ranges of invalid pfns.
- "sched/numa: Skip VMA scanning on memory pinned to one NUMA node via
cpuset.mems" from Libo Chen removes a lot of pointless VMA scanning
when a task is pinned a single NUMA mode.
Dramatic performance benefits were seen in some real world cases.
- "JFS: Implement migrate_folio for jfs_metapage_aops" from Shivank
Garg addresses a warning which occurs during memory compaction when
using JFS.
- "move all VMA allocation, freeing and duplication logic to mm" from
Lorenzo Stoakes moves some VMA code from kernel/fork.c into the more
appropriate mm/vma.c.
- "mm, swap: clean up swap cache mapping helper" from Kairui Song
provides code consolidation and cleanups related to the folio_index()
function.
- "mm/gup: Cleanup memfd_pin_folios()" from Vishal Moola does that.
- "memcg: Fix test_memcg_min/low test failures" from Waiman Long
addresses some bogus failures which are being reported by the
test_memcontrol selftest.
- "eliminate mmap() retry merge, add .mmap_prepare hook" from Lorenzo
Stoakes commences the deprecation of file_operations.mmap() in favor
of the new file_operations.mmap_prepare().
The latter is more restrictive and prevents drivers from messing with
things in ways which, amongst other problems, may defeat VMA merging.
- "memcg: decouple memcg and objcg stocks"" from Shakeel Butt decouples
the per-cpu memcg charge cache from the objcg's one.
This is a step along the way to making memcg and objcg charging
NMI-safe, which is a BPF requirement.
- "mm/damon: minor fixups and improvements for code, tests, and
documents" from SeongJae Park is yet another batch of miscellaneous
DAMON changes. Fix and improve minor problems in code, tests and
documents.
- "memcg: make memcg stats irq safe" from Shakeel Butt converts memcg
stats to be irq safe. Another step along the way to making memcg
charging and stats updates NMI-safe, a BPF requirement.
- "Let unmap_hugepage_range() and several related functions take folio
instead of page" from Fan Ni provides folio conversions in the
hugetlb code.
* tag 'mm-stable-2025-05-31-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (285 commits)
mm: pcp: increase pcp->free_count threshold to trigger free_high
mm/hugetlb: convert use of struct page to folio in __unmap_hugepage_range()
mm/hugetlb: refactor __unmap_hugepage_range() to take folio instead of page
mm/hugetlb: refactor unmap_hugepage_range() to take folio instead of page
mm/hugetlb: pass folio instead of page to unmap_ref_private()
memcg: objcg stock trylock without irq disabling
memcg: no stock lock for cpu hot-unplug
memcg: make __mod_memcg_lruvec_state re-entrant safe against irqs
memcg: make count_memcg_events re-entrant safe against irqs
memcg: make mod_memcg_state re-entrant safe against irqs
memcg: move preempt disable to callers of memcg_rstat_updated
memcg: memcg_rstat_updated re-entrant safe against irqs
mm: khugepaged: decouple SHMEM and file folios' collapse
selftests/eventfd: correct test name and improve messages
alloc_tag: check mem_profiling_support in alloc_tag_init
Docs/damon: update titles and brief introductions to explain DAMOS
selftests/damon/_damon_sysfs: read tried regions directories in order
mm/damon/tests/core-kunit: add a test for damos_set_filters_default_reject()
mm/damon/paddr: remove unused variable, folio_list, in damon_pa_stat()
mm/damon/sysfs-schemes: fix wrong comment on damons_sysfs_quota_goal_metric_strs
...
To prevent double enqueue/dequeue of the util-est for sched_delayed tasks,
commit 729288bc68 ("kernel/sched: Fix util_est accounting for DELAY_DEQUEUE")
added the corresponding check. This check excludes double en/dequeue during
task migration and priority changes.
In fact, these conditions can be simplified.
For util_est_dequeue, we know that sched_delayed flag is set in dequeue_entity.
When the task is sleeping, we need to call util_est_dequeue to subtract
util-est from the cfs_rq. At this point, sched_delayed has not yet been set.
If we find that sched_delayed is already set, it indicates that this task
has already called dequeue_task_fair once. In this case, there is no need to
call util_est_dequeue again. Therefore, simply checking the sched_delayed flag
should be sufficient to prevent unnecessary util_est updates during the dequeue.
For util_est_enqueue, our goal is to add the util_est to the cfs_rq
when task enqueue. However, we don't want to add the util_est of a
sched_delayed task to the cfs_rq because the task is sleeping.
Therefore, we can exclude the util_est_enqueue for sched_delayed tasks
by checking the sched_delayed flag. However, when waking up a delayed task,
the sched_delayed flag is cleared after util_est_enqueue. As a result,
if we only check the sched_delayed flag, we would miss the util_est_enqueue.
Since waking up a sched_delayed task calls enqueue_task with the ENQUEUE_DELAYED flag,
we can determine whether to call util_est_enqueue by checking if the
enqueue_flag contains ENQUEUE_DELAYED.
Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20250417043457.10632-2-xuewen.yan@unisoc.com
Delayed dequeued feature keeps a sleeping task enqueued until its
lag has elapsed. As a result, it stays also visible in rq->nr_running.
So when in wake_affine_idle(), we should use the real running-tasks
in rq to check whether we should place the wake-up task to
current cpu.
On the other hand, add a helper function to return the nr-delayed.
Fixes: 152e11f6df ("sched/fair: Implement delayed dequeue")
Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com>
Reviewed-and-tested-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250303105241.17251-2-xuewen.yan@unisoc.com
Patch series "sched/numa: Skip VMA scanning on memory pinned to one NUMA
node via cpuset.mems", v5.
This patch (of 2):
When the memory of the current task is pinned to one NUMA node by cgroup,
there is no point in continuing the rest of VMA scanning and hinting page
faults as they will just be overhead. With this change, there will be no
more unnecessary PTE updates or page faults in this scenario.
We have seen up to a 6x improvement on a typical java workload running on
VMs with memory and CPU pinned to one NUMA node via cpuset in a two-socket
AARCH64 system. With the same pinning, on a 18-cores-per-socket Intel
platform, we have seen 20% improvment in a microbench that creates a
30-vCPU selftest KVM guest with 4GB memory, where each vCPU reads 4KB
pages in a fixed number of loops.
Link: https://lkml.kernel.org/r/20250424024523.2298272-1-libo.chen@oracle.com
Link: https://lkml.kernel.org/r/20250424024523.2298272-2-libo.chen@oracle.com
Signed-off-by: Libo Chen <libo.chen@oracle.com>
Tested-by: Chen Yu <yu.c.chen@intel.com>
Tested-by: Srikanth Aithal <sraithal@amd.com>
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Cc: "Chen, Tim C" <tim.c.chen@intel.com>
Cc: Chris Hyser <chris.hyser@oracle.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Cc: Mel Gorman <mgorman <mgorman@suse.de>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Raghavendra K T <raghavendra.kt@amd.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
There is a code path in dequeue_entities() that can set the slice of a
sched_entity to U64_MAX, which sometimes results in a crash.
The offending case is when dequeue_entities() is called to dequeue a
delayed group entity, and then the entity's parent's dequeue is delayed.
In that case:
1. In the if (entity_is_task(se)) else block at the beginning of
dequeue_entities(), slice is set to
cfs_rq_min_slice(group_cfs_rq(se)). If the entity was delayed, then
it has no queued tasks, so cfs_rq_min_slice() returns U64_MAX.
2. The first for_each_sched_entity() loop dequeues the entity.
3. If the entity was its parent's only child, then the next iteration
tries to dequeue the parent.
4. If the parent's dequeue needs to be delayed, then it breaks from the
first for_each_sched_entity() loop _without updating slice_.
5. The second for_each_sched_entity() loop sets the parent's ->slice to
the saved slice, which is still U64_MAX.
This throws off subsequent calculations with potentially catastrophic
results. A manifestation we saw in production was:
6. In update_entity_lag(), se->slice is used to calculate limit, which
ends up as a huge negative number.
7. limit is used in se->vlag = clamp(vlag, -limit, limit). Because limit
is negative, vlag > limit, so se->vlag is set to the same huge
negative number.
8. In place_entity(), se->vlag is scaled, which overflows and results in
another huge (positive or negative) number.
9. The adjusted lag is subtracted from se->vruntime, which increases or
decreases se->vruntime by a huge number.
10. pick_eevdf() calls entity_eligible()/vruntime_eligible(), which
incorrectly returns false because the vruntime is so far from the
other vruntimes on the queue, causing the
(vruntime - cfs_rq->min_vruntime) * load calulation to overflow.
11. Nothing appears to be eligible, so pick_eevdf() returns NULL.
12. pick_next_entity() tries to dereference the return value of
pick_eevdf() and crashes.
Dumping the cfs_rq states from the core dumps with drgn showed tell-tale
huge vruntime ranges and bogus vlag values, and I also traced se->slice
being set to U64_MAX on live systems (which was usually "benign" since
the rest of the runqueue needed to be in a particular state to crash).
Fix it in dequeue_entities() by always setting slice from the first
non-empty cfs_rq.
Fixes: aef6987d89 ("sched/eevdf: Propagate min_slice up the cgroup hierarchy")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/f0c2d1072be229e1bdddc73c0703919a8b00c652.1745570998.git.osandov@fb.com
Mike reports that commit 6d71a9c616 ("sched/fair: Fix EEVDF entity
placement bug causing scheduling lag") relies on commit 4423af84b2
("sched/fair: optimize the PLACE_LAG when se->vlag is zero") to not
trip a WARN in place_entity().
What happens is that the lag of the very last entity is 0 per
definition -- the average of one element matches the value of that
element. Therefore place_entity() will match the condition skipping
the lag adjustment:
if (sched_feat(PLACE_LAG) && cfs_rq->nr_queued && se->vlag) {
Without the 'se->vlag' condition -- it will attempt to adjust the zero
lag even though we're inserting into an empty tree.
Notably, we should have failed the 'cfs_rq->nr_queued' condition, but
don't because they didn't get updated.
Additionally, move update_load_add() after placement() as is
consistent with other place_entity() users -- this change is
non-functional, place_entity() does not use cfs_rq->load.
Fixes: 6d71a9c616 ("sched/fair: Fix EEVDF entity placement bug causing scheduling lag")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/c216eb4ef0e0e0029c600aefc69d56681cee5581.camel@gmx.de
Subsequent commits add the support to dynamically update the sched_group
struct's "asym_prefer_cpu" member from a remote CPU. Use READ_ONCE()
when reading the "sg->asym_prefer_cpu" to ensure load balancer always
reads the latest value.
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250409053446.23367-2-kprateek.nayak@amd.com
commit 10a35e6812 ("sched/pelt: Skip updating util_est when
utilization is higher than CPU's capacity")
prevents util_est from being updated if util_avg is higher than the
underlying CPU capacity to avoid overestimating the task when the CPU
is capped (due to thermal issue for instance). In this scenario, the
task will miss its deadlines and start overlapping its wake-up events
for instance. The task will appear as always running when the CPU is
just not powerful enough to allow having a good estimation of the
task.
commit b8c9636140 ("sched/fair/util_est: Implement faster ramp-up
EWMA on utilization increases")
sets ewma to util_avg when ewma > util_avg, allowing ewma to quickly
grow instead of slowly converge to the new util_avg value when a task
profile changes from small to big.
However, the 2 conditions:
- Check util_avg against max CPU capacity
- Check whether util_est > util_avg
are placed in an order such as it is possible to set util_est to a
value higher than the CPU capacity if util_est > util_avg, but
util_est is prevented to decay as long as:
CPU capacity < util_avg < util_est.
Just remove the check as either:
1.
There is idle time on the CPU. In that case the util_avg value of the
task is actually correct. It is possible that the task missed a
deadline and appears bigger, but this is also the case when the
util_avg of the task is lower than the maximum CPU capacity.
2.
There is no idle time. In that case, the util_avg value might aswell
be an under estimation of the size of the task.
It is possible that undesired frequency spikes will appear when the
task is later enqueued with an inflated util_est value, but the
frequency spike might aswell be deserved. The absence of idle time
prevents from drawing any conclusion.
Signed-off-by: Pierre Gondois <pierre.gondois@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.rog>
Link: https://lore.kernel.org/r/20250325150542.1077344-1-pierre.gondois@arm.com
hrtimers are initialized with hrtimer_init() and a subsequent store to
the callback pointer. This turned out to be suboptimal for the upcoming
Rust integration and is obviously a silly implementation to begin with.
This cleanup replaces the hrtimer_init(T); T->function = cb; sequence
with hrtimer_setup(T, cb);
The conversion was done with Coccinelle and a few manual fixups.
Once the conversion has completely landed in mainline, hrtimer_init()
will be removed and the hrtimer::function becomes a private member.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmff5jQTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoVvRD/wKtuwmiA66NJFgXC0qVq82A6fO3bY8
GBdbfysDJIbqGu5PTcULTbJ8qkqv3jeLUv6CcXvS4sZ7y/uJQl2lzf8yrD/0bbwc
rLI6sHiPSZmK93kNVN4X5H7kvt7cE/DYC9nnEOgK3BY5FgKc4n9887d4aVBhL8Lv
ODwVXvZ+xi351YCj7qRyPU24zt/p4tkkT1o2k4a0HBluqLI0D+V20fke9IERUL8r
d1uWKlcn0TqYDesE8HXKIhbst3gx52rMJrXBJDHwFmG6v8Pj1fkTXCVpPo8QcBz8
OTVkpomN9f/Tx4+GZwhZOF86LhLL3OhxD6pT7JhFCXdmSGv+Ez8uyk1YZysM/XpV
Juy/1yAcBpDIDkmhMFGdAAn48Nn9Fotty0r4je60zSEp1d/4QMXcFme29qr2JTUE
iWnQ/HD6DxUjVHqy7CYvvo26Xegg1C7qgyOVt4PYZwAM1VKF5P3kzYTb4SAdxtop
Tpji1sfW9QV08jqMNo6XntD32DSP9S2HqjO9LwBw700jnx2jjJ35fcJs6iodMOUn
gckIZLMn3L0OoglPdyA5O7SNTbKE7aFiRKdnT/cJtR3Fa39Qu27CwC5gfiyuie9I
Q+LG8GLuYSBHXAR+PBK4GWlzJ7Dn8k3eqmbnLeKpRMsU6ZzcttgA64xhaviN2wN0
iJbvLJeisXr3GA==
=bYAX
-----END PGP SIGNATURE-----
Merge tag 'timers-cleanups-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer cleanups from Thomas Gleixner:
"A treewide hrtimer timer cleanup
hrtimers are initialized with hrtimer_init() and a subsequent store to
the callback pointer. This turned out to be suboptimal for the
upcoming Rust integration and is obviously a silly implementation to
begin with.
This cleanup replaces the hrtimer_init(T); T->function = cb; sequence
with hrtimer_setup(T, cb);
The conversion was done with Coccinelle and a few manual fixups.
Once the conversion has completely landed in mainline, hrtimer_init()
will be removed and the hrtimer::function becomes a private member"
* tag 'timers-cleanups-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (100 commits)
wifi: rt2x00: Switch to use hrtimer_update_function()
io_uring: Use helper function hrtimer_update_function()
serial: xilinx_uartps: Use helper function hrtimer_update_function()
ASoC: fsl: imx-pcm-fiq: Switch to use hrtimer_setup()
RDMA: Switch to use hrtimer_setup()
virtio: mem: Switch to use hrtimer_setup()
drm/vmwgfx: Switch to use hrtimer_setup()
drm/xe/oa: Switch to use hrtimer_setup()
drm/vkms: Switch to use hrtimer_setup()
drm/msm: Switch to use hrtimer_setup()
drm/i915/request: Switch to use hrtimer_setup()
drm/i915/uncore: Switch to use hrtimer_setup()
drm/i915/pmu: Switch to use hrtimer_setup()
drm/i915/perf: Switch to use hrtimer_setup()
drm/i915/gvt: Switch to use hrtimer_setup()
drm/i915/huc: Switch to use hrtimer_setup()
drm/amdgpu: Switch to use hrtimer_setup()
stm class: heartbeat: Switch to use hrtimer_setup()
i2c: Switch to use hrtimer_setup()
iio: Switch to use hrtimer_setup()
...
All the big Linux distros enable CONFIG_SCHED_DEBUG, because
the various features it provides help not just with kernel
development, but with system administration and user-space
software development as well.
Reflect this reality and enable this functionality
unconditionally.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250317104257.3496611-4-mingo@kernel.org
The scheduler has this special SCHED_WARN() facility that
depends on CONFIG_SCHED_DEBUG.
Since CONFIG_SCHED_DEBUG is getting removed, convert
SCHED_WARN() to WARN_ON_ONCE().
Note that the warning output isn't 100% equivalent:
#define SCHED_WARN_ON(x) WARN_ONCE(x, #x)
Because SCHED_WARN_ON() would output the 'x' condition
as well, while WARN_ONCE() will only show a backtrace.
Hopefully these are rare enough to not really matter.
If it does, we should probably introduce a new WARN_ON()
variant that outputs the condition in stringified form,
or improve WARN_ON() itself.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250317104257.3496611-2-mingo@kernel.org
child_cfs_rq_on_list attempts to convert a 'prev' pointer to a cfs_rq.
This 'prev' pointer can originate from struct rq's leaf_cfs_rq_list,
making the conversion invalid and potentially leading to memory
corruption. Depending on the relative positions of leaf_cfs_rq_list and
the task group (tg) pointer within the struct, this can cause a memory
fault or access garbage data.
The issue arises in list_add_leaf_cfs_rq, where both
cfs_rq->leaf_cfs_rq_list and rq->leaf_cfs_rq_list are added to the same
leaf list. Also, rq->tmp_alone_branch can be set to rq->leaf_cfs_rq_list.
This adds a check `if (prev == &rq->leaf_cfs_rq_list)` after the main
conditional in child_cfs_rq_on_list. This ensures that the container_of
operation will convert a correct cfs_rq struct.
This check is sufficient because only cfs_rqs on the same CPU are added
to the list, so verifying the 'prev' pointer against the current rq's list
head is enough.
Fixes a potential memory corruption issue that due to current struct
layout might not be manifesting as a crash but could lead to unpredictable
behavior when the layout changes.
Fixes: fdaba61ef8 ("sched/fair: Ensure that the CFS parent is added after unthrottling")
Signed-off-by: Zecheng Li <zecheng@google.com>
Reviewed-and-tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250304214031.2882646-1-zecheng@google.com
hrtimer_setup() takes the callback function pointer as argument and
initializes the timer completely.
Replace hrtimer_init() and the open coded initialization of
hrtimer::function with the new setup mechanism.
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/a55e849cba3c41b4c5708be6ea6be6f337d1a8fb.1738746821.git.namcao@linutronix.de
The function "can_migrate_task()" utilize "for_each_cpu_and" with a
"if" statement inside to find the destination cpu. It's the same logic
to find the first set bit of the result of the bitwise-AND of
"env->dst_grpmask", "env->cpus" and "p->cpus_ptr".
Refactor it by using "cpumask_first_and_and()" to perform bitwise-AND
for "env->dst_grpmask", "env->cpus" and "p->cpus_ptr" and pick the
first cpu within the intersection as the destination cpu, so we can
elimate the need of looping and multiple times of branch.
After the refactoring this part of the code can speed up from ~115ns
to ~54ns, according to the test below.
Ran the test for 5 times and the result is showned in the following
table, and the test script is paste in next section.
-------------------------------------------------------
|Old method| 130| 118| 115| 109| 106| avg ~115ns|
-------------------------------------------------------
|New method| 58| 55| 54| 48| 55| avg ~54ns|
-------------------------------------------------------
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250210103019.283824-1-richard120310@gmail.com
When a task is enqueued and its parent cgroup se is already on_rq, this
parent cgroup se will not be enqueued again, and hence the root->min_slice
leaves unchanged. The same issue happens when a task is dequeued and its
parent cgroup se has other runnable entities, and the parent cgroup se
will not be dequeued.
Force propagating min_slice when se doesn't need to be enqueued or
dequeued. Ensure the se hierarchy always get the latest min_slice.
Fixes: aef6987d89 ("sched/eevdf: Propagate min_slice up the cgroup hierarchy")
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250211063659.7180-1-dtcccc@linux.alibaba.com
The old default value for slice is 0.75 msec * (1 + ilog(ncpus)) which
means that we have a default slice of:
0.75 for 1 cpu
1.50 up to 3 cpus
2.25 up to 7 cpus
3.00 for 8 cpus and above.
For HZ=250 and HZ=100, because of the tick accuracy, the runtime of
tasks is far higher than their slice.
For HZ=1000 with 8 cpus or more, the accuracy of tick is already
satisfactory, but there is still an issue that tasks will get an extra
tick because the tick often arrives a little faster than expected. In
this case, the task can only wait until the next tick to consider that it
has reached its deadline, and will run 1ms longer.
vruntime + sysctl_sched_base_slice = deadline
|-----------|-----------|-----------|-----------|
1ms 1ms 1ms 1ms
^ ^ ^ ^
tick1 tick2 tick3 tick4(nearly 4ms)
There are two reasons for tick error: clockevent precision and the
CONFIG_IRQ_TIME_ACCOUNTING/CONFIG_PARAVIRT_TIME_ACCOUNTING. with
CONFIG_IRQ_TIME_ACCOUNTING every tick will be less than 1ms, but even
without it, because of clockevent precision, tick still often less than
1ms.
In order to make scheduling more precise, we changed 0.75 to 0.70,
Using 0.70 instead of 0.75 should not change much for other configs
and would fix this issue:
0.70 for 1 cpu
1.40 up to 3 cpus
2.10 up to 7 cpus
2.8 for 8 cpus and above.
This does not guarantee that tasks can run the slice time accurately
every time, but occasionally running an extra tick has little impact.
Signed-off-by: zihan zhou <15645113830zzh@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20250208075322.13139-1-15645113830zzh@gmail.com
A wakeup non-idle entity should preempt idle entity at any time,
but because of the slice protection of the idle entity, the non-idle
entity has to wait, so just cancel it.
This patch is aimed at minimizing the impact of SCHED_IDLE on
SCHED_NORMAL. For example, a task with SCHED_IDLE policy that sleeps for
1s and then runs for 3 ms, running cyclictest on the same cpu, has a
maximum latency of 3 ms, which is caused by the slice protection of the
idle entity. It is unreasonable. With this patch, the cyclictest latency
under the same conditions is basically the same on the cpu with idle
processes and on empty cpu.
[peterz: add helpers]
Fixes: 63304558ba ("sched/eevdf: Curb wakeup-preemption")
Signed-off-by: zihan zhou <15645113830zzh@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20250208080850.16300-1-15645113830zzh@gmail.com
defensive SCHED_WARN_ON() on certain workloads. The bug is
believed to be (accidentally) self-correcting, hence no
behavioral side effects are expected.
Also print se.slice in debug output, since this value can now be
set via the syscall ABI and can be useful to track.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmenIcsRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1iyZA//ZpzGiiZ8coXBk4PQ77c0+BOgSGdkbWeR
zicKvqWd+j9skOTVrIk8MPs7D3C5uNuDuVAeNWYalBHRO7ndLfCg36pHqR8tQ3xN
2GwziJzKpi1r4WXBSCtFe/abKrMhIsMYiqKDQz3Ry2Zfn+TuY4t4bzEYgfm481mO
FkLcXs97RI5sFsCBI0uhRu76A9jrNRZKW2zczHTb2MS4zjwGq1yL3vkx2M11d0Su
BqrcPjz6mZFuzCrA6pW0QeSP4Mn5xL8c2seoMYnvFrPYzq47Thxm4nOWPURCN9uX
6MmqNXisJjTe00noswhbcUia1n2cr37uu5d21tXhn7Rf2XTO4+WRE7zjzvxLKQCr
DtcH1XqvnChPGPFTsSlso4vbygfEZIk14DWZ71mXQJto6t06A/yjFiAR82ZIKzLA
ZlK6hnX2rQ8WST3XSixiRPU/5xDu3ptJ0aDRV/dScCMzb7c0Y2YSeqA4xBMcZDPq
A1TrNUsnsvH0TJXXqeslDSxxEFUKHNiOI2itzMCxAteEZXPysa8Rvcl4zFOZ5A6F
nb8hDN94dydNnPpf4g+ZIVEHs5ES3Yeexh1dGwCjQSp72qAJK3wkHQ6KfPN/T3Gb
PbDVw93cYJI/fhB2BX9DcnN+uNLTrqr1kfY7uFBE1feMdnQNBHILdjbByILfQAQ+
0r7+aYwn0F0=
=5mot
-----END PGP SIGNATURE-----
Merge tag 'sched-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
"Fix a cfs_rq->h_nr_runnable accounting bug that trips up a defensive
SCHED_WARN_ON() on certain workloads. The bug is believed to be
(accidentally) self-correcting, hence no behavioral side effects are
expected.
Also print se.slice in debug output, since this value can now be set
via the syscall ABI and can be useful to track"
* tag 'sched-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/debug: Provide slice length for fair tasks
sched/fair: Fix inaccurate h_nr_runnable accounting with delayed dequeue
Add the const qualifier to all the ctl_tables in the tree except for
watchdog_hardlockup_sysctl, memory_allocation_profiling_sysctls,
loadpin_sysctl_table and the ones calling register_net_sysctl (./net,
drivers/inifiniband dirs). These are special cases as they use a
registration function with a non-const qualified ctl_table argument or
modify the arrays before passing them on to the registration function.
Constifying ctl_table structs will prevent the modification of
proc_handler function pointers as the arrays would reside in .rodata.
This is made possible after commit 78eb4ea25c ("sysctl: treewide:
constify the ctl_table argument of proc_handlers") constified all the
proc_handlers.
Created this by running an spatch followed by a sed command:
Spatch:
virtual patch
@
depends on !(file in "net")
disable optional_qualifier
@
identifier table_name != {
watchdog_hardlockup_sysctl,
iwcm_ctl_table,
ucma_ctl_table,
memory_allocation_profiling_sysctls,
loadpin_sysctl_table
};
@@
+ const
struct ctl_table table_name [] = { ... };
sed:
sed --in-place \
-e "s/struct ctl_table .table = &uts_kern/const struct ctl_table *table = \&uts_kern/" \
kernel/utsname_sysctl.c
Reviewed-by: Song Liu <song@kernel.org>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> # for kernel/trace/
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> # SCSI
Reviewed-by: Darrick J. Wong <djwong@kernel.org> # xfs
Acked-by: Jani Nikula <jani.nikula@intel.com>
Acked-by: Corey Minyard <cminyard@mvista.com>
Acked-by: Wei Liu <wei.liu@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Acked-by: Baoquan He <bhe@redhat.com>
Acked-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Acked-by: Anna Schumaker <anna.schumaker@oracle.com>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
- Fair scheduler (SCHED_FAIR) enhancements:
- Behavioral improvements:
- Untangle NEXT_BUDDY and pick_next_task() (Peter Zijlstra)
- Delayed-dequeue enhancements & fixes: (Vincent Guittot)
- Rename h_nr_running into h_nr_queued
- Add new cfs_rq.h_nr_runnable
- Use the new cfs_rq.h_nr_runnable
- Removed unsued cfs_rq.h_nr_delayed
- Rename cfs_rq.idle_h_nr_running into h_nr_idle
- Remove unused cfs_rq.idle_nr_running
- Rename cfs_rq.nr_running into nr_queued
- Do not try to migrate delayed dequeue task
- Fix variable declaration position
- Encapsulate set custom slice in a __setparam_fair() function
- Fixes:
- Fix race between yield_to() and try_to_wake_up() (Tianchen Ding)
- Fix CPU bandwidth limit bypass during CPU hotplug (Vishal Chourasia)
- Cleanups:
- Clean up in migrate_degrades_locality() to improve
readability (Peter Zijlstra)
- Mark m*_vruntime() with __maybe_unused (Andy Shevchenko)
- Update comments after sched_tick() rename (Sebastian Andrzej Siewior)
- Remove CONFIG_CFS_BANDWIDTH=n definition of cfs_bandwidth_used()
(Valentin Schneider)
- Deadline scheduler (SCHED_DL) enhancements:
- Restore dl_server bandwidth on non-destructive root domain
changes (Juri Lelli)
- Correctly account for allocated bandwidth during
hotplug (Juri Lelli)
- Check bandwidth overflow earlier for hotplug (Juri Lelli)
- Clean up goto label in pick_earliest_pushable_dl_task()
(John Stultz)
- Consolidate timer cancellation (Wander Lairson Costa)
- Load-balancer enhancements:
- Improve performance by prioritizing migrating eligible
tasks in sched_balance_rq() (Hao Jia)
- Do not compute NUMA Balancing stats unnecessarily during
load-balancing (K Prateek Nayak)
- Do not compute overloaded status unnecessarily during
load-balancing (K Prateek Nayak)
- Generic scheduling code enhancements:
- Use READ_ONCE() in task_on_rq_queued(), to consistently use
the WRITE_ONCE() updated ->on_rq field (Harshit Agarwal)
- Isolated CPUs support enhancements: (Waiman Long)
- Make "isolcpus=nohz" equivalent to "nohz_full"
- Consolidate housekeeping cpumasks that are always identical
- Remove HK_TYPE_SCHED
- Unify HK_TYPE_{TIMER|TICK|MISC} to HK_TYPE_KERNEL_NOISE
- RSEQ enhancements:
- Validate read-only fields under DEBUG_RSEQ config
(Mathieu Desnoyers)
- PSI enhancements:
- Fix race when task wakes up before psi_sched_switch()
adjusts flags (Chengming Zhou)
- IRQ time accounting performance enhancements: (Yafang Shao)
- Define sched_clock_irqtime as static key
- Don't account irq time if sched_clock_irqtime is disabled
- Virtual machine scheduling enhancements:
- Don't try to catch up excess steal time (Suleiman Souhlal)
- Heterogenous x86 CPU scheduling enhancements: (K Prateek Nayak)
- Convert "sysctl_sched_itmt_enabled" to boolean
- Use guard() for itmt_update_mutex
- Move the "sched_itmt_enabled" sysctl to debugfs
- Remove x86_smt_flags and use cpu_smt_flags directly
- Use x86_sched_itmt_flags for PKG domain unconditionally
- Debugging code & instrumentation enhancements:
- Change need_resched warnings to pr_err() (David Rientjes)
- Print domain name in /proc/schedstat (K Prateek Nayak)
- Fix value reported by hot tasks pulled in /proc/schedstat (Peter Zijlstra)
- Report the different kinds of imbalances in /proc/schedstat (Swapnil Sapkal)
- Move sched domain name out of CONFIG_SCHED_DEBUG (Swapnil Sapkal)
- Update Schedstat version to 17 (Swapnil Sapkal)
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmePSRcRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1hrdBAAjYiLl5Q8SHM0xnl+kbvuUkCTgEB/gSgA
mfrZtHRUgRZuA89NZ9NljlCkQSlsLTOjnpNuaeFzs529GMg9iemc99dbnz3BP5F3
V5qpYvWe7yIkJ3hd0TOGLmYEPMNQaAW57YBOrxcPjWNLJ4cr9iMdccVA1OQtcmqD
ZUh3nibv81QI8HDmT2G+figxEIqH3yBV1+SmEIxbrdkQpIJ5702Ng6+0KQK5TShN
xwjFELWZUl2TfkoCc4nkIpkImV6cI1DvXSw1xK6gbb1xEVOrsmFW3TYFw4trKHBu
2RBG4wtmzNjh+12GmSdIBJHogPNcay+JIJW9EG/unT7jirqzkkeP1X2eJEbh+X1L
CMa7GsD9Vy72jCzeJDMuiy7bKfG/MiKUtDXrAZQDo2atbw7H88QOzMuTE5a5WSV+
tRxXGI/dgFVOk+JQUfctfJbYeXjmG8GAflawvXtGDAfDZsja6M+65fH8p0AOgW1E
HHmXUzAe2E2xQBiSok/DYHPQeCDBAjoJvU93YhGiXv8UScb2UaD4BAfzfmc8P+Zs
Eox6444ah5U0jiXmZ3HU707n1zO+Ql4qKoyyMJzSyP+oYHE/Do7NYTElw2QovVdN
FX/9Uae8T4ttA/5lFe7FNoXgKvSxXDKYyKLZcysjVrWJF866Ui/TWtmxA6w8Osn7
sfucuLawLPM=
=5ZNW
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"Fair scheduler (SCHED_FAIR) enhancements:
- Behavioral improvements:
- Untangle NEXT_BUDDY and pick_next_task() (Peter Zijlstra)
- Delayed-dequeue enhancements & fixes: (Vincent Guittot)
- Rename h_nr_running into h_nr_queued
- Add new cfs_rq.h_nr_runnable
- Use the new cfs_rq.h_nr_runnable
- Removed unsued cfs_rq.h_nr_delayed
- Rename cfs_rq.idle_h_nr_running into h_nr_idle
- Remove unused cfs_rq.idle_nr_running
- Rename cfs_rq.nr_running into nr_queued
- Do not try to migrate delayed dequeue task
- Fix variable declaration position
- Encapsulate set custom slice in a __setparam_fair() function
- Fixes:
- Fix race between yield_to() and try_to_wake_up() (Tianchen Ding)
- Fix CPU bandwidth limit bypass during CPU hotplug (Vishal
Chourasia)
- Cleanups:
- Clean up in migrate_degrades_locality() to improve readability
(Peter Zijlstra)
- Mark m*_vruntime() with __maybe_unused (Andy Shevchenko)
- Update comments after sched_tick() rename (Sebastian Andrzej
Siewior)
- Remove CONFIG_CFS_BANDWIDTH=n definition of cfs_bandwidth_used()
(Valentin Schneider)
Deadline scheduler (SCHED_DL) enhancements:
- Restore dl_server bandwidth on non-destructive root domain changes
(Juri Lelli)
- Correctly account for allocated bandwidth during hotplug (Juri
Lelli)
- Check bandwidth overflow earlier for hotplug (Juri Lelli)
- Clean up goto label in pick_earliest_pushable_dl_task() (John
Stultz)
- Consolidate timer cancellation (Wander Lairson Costa)
Load-balancer enhancements:
- Improve performance by prioritizing migrating eligible tasks in
sched_balance_rq() (Hao Jia)
- Do not compute NUMA Balancing stats unnecessarily during
load-balancing (K Prateek Nayak)
- Do not compute overloaded status unnecessarily during
load-balancing (K Prateek Nayak)
Generic scheduling code enhancements:
- Use READ_ONCE() in task_on_rq_queued(), to consistently use the
WRITE_ONCE() updated ->on_rq field (Harshit Agarwal)
Isolated CPUs support enhancements: (Waiman Long)
- Make "isolcpus=nohz" equivalent to "nohz_full"
- Consolidate housekeeping cpumasks that are always identical
- Remove HK_TYPE_SCHED
- Unify HK_TYPE_{TIMER|TICK|MISC} to HK_TYPE_KERNEL_NOISE
RSEQ enhancements:
- Validate read-only fields under DEBUG_RSEQ config (Mathieu
Desnoyers)
PSI enhancements:
- Fix race when task wakes up before psi_sched_switch() adjusts flags
(Chengming Zhou)
IRQ time accounting performance enhancements: (Yafang Shao)
- Define sched_clock_irqtime as static key
- Don't account irq time if sched_clock_irqtime is disabled
Virtual machine scheduling enhancements:
- Don't try to catch up excess steal time (Suleiman Souhlal)
Heterogenous x86 CPU scheduling enhancements: (K Prateek Nayak)
- Convert "sysctl_sched_itmt_enabled" to boolean
- Use guard() for itmt_update_mutex
- Move the "sched_itmt_enabled" sysctl to debugfs
- Remove x86_smt_flags and use cpu_smt_flags directly
- Use x86_sched_itmt_flags for PKG domain unconditionally
Debugging code & instrumentation enhancements:
- Change need_resched warnings to pr_err() (David Rientjes)
- Print domain name in /proc/schedstat (K Prateek Nayak)
- Fix value reported by hot tasks pulled in /proc/schedstat (Peter
Zijlstra)
- Report the different kinds of imbalances in /proc/schedstat
(Swapnil Sapkal)
- Move sched domain name out of CONFIG_SCHED_DEBUG (Swapnil Sapkal)
- Update Schedstat version to 17 (Swapnil Sapkal)"
* tag 'sched-core-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (48 commits)
rseq: Fix rseq unregistration regression
psi: Fix race when task wakes up before psi_sched_switch() adjusts flags
sched, psi: Don't account irq time if sched_clock_irqtime is disabled
sched: Don't account irq time if sched_clock_irqtime is disabled
sched: Define sched_clock_irqtime as static key
sched/fair: Do not compute overloaded status unnecessarily during lb
sched/fair: Do not compute NUMA Balancing stats unnecessarily during lb
x86/topology: Use x86_sched_itmt_flags for PKG domain unconditionally
x86/topology: Remove x86_smt_flags and use cpu_smt_flags directly
x86/itmt: Move the "sched_itmt_enabled" sysctl to debugfs
x86/itmt: Use guard() for itmt_update_mutex
x86/itmt: Convert "sysctl_sched_itmt_enabled" to boolean
sched/core: Prioritize migrating eligible tasks in sched_balance_rq()
sched/debug: Change need_resched warnings to pr_err
sched/fair: Encapsulate set custom slice in a __setparam_fair() function
sched: Fix race between yield_to() and try_to_wake_up()
docs: Update Schedstat version to 17
sched/stats: Print domain name in /proc/schedstat
sched: Move sched domain name out of CONFIG_SCHED_DEBUG
sched: Report the different kinds of imbalances in /proc/schedstat
...
set_delayed() adjusts cfs_rq->h_nr_runnable for the hierarchy when an
entity is delayed irrespective of whether the entity corresponds to a
task or a cfs_rq.
Consider the following scenario:
root
/ \
A B (*) delayed since B is no longer eligible on root
| |
Task0 Task1 <--- dequeue_task_fair() - task blocks
When Task1 blocks (dequeue_entity() for task's se returns true),
dequeue_entities() will continue adjusting cfs_rq->h_nr_* for the
hierarchy of Task1. However, when the sched_entity corresponding to
cfs_rq B is delayed, set_delayed() will adjust the h_nr_runnable for the
hierarchy too leading to both dequeue_entity() and set_delayed()
decrementing h_nr_runnable for the dequeue of the same task.
A SCHED_WARN_ON() to inspect h_nr_runnable post its update in
dequeue_entities() like below:
cfs_rq->h_nr_runnable -= h_nr_runnable;
SCHED_WARN_ON(((int) cfs_rq->h_nr_runnable) < 0);
is consistently tripped when running wakeup intensive workloads like
hackbench in a cgroup.
This error is self correcting since cfs_rq are per-cpu and cannot
migrate. The entitiy is either picked for full dequeue or is requeued
when a task wakes up below it. Both those paths call clear_delayed()
which again increments h_nr_runnable of the hierarchy without
considering if the entity corresponds to a task or not.
h_nr_runnable will eventually reflect the correct value however in the
interim, the incorrect values can still influence PELT calculation which
uses se->runnable_weight or cfs_rq->h_nr_runnable.
Since only delayed tasks take the early return path in
dequeue_entities() and enqueue_task_fair(), adjust the
h_nr_runnable in {set,clear}_delayed() only when a task is delayed as
this path skips the h_nr_* update loops and returns early.
For entities corresponding to cfs_rq, the h_nr_* update loop in the
caller will do the right thing.
Fixes: 76f2f78329 ("sched/eevdf: More PELT vs DELAYED_DEQUEUE")
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Tested-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
Link: https://lkml.kernel.org/r/20250117105852.23908-1-kprateek.nayak@amd.com
Only set sg_overloaded when computing sg_lb_stats() at the highest sched
domain since rd->overloaded status is updated only when load balancing
at the highest domain. While at it, move setting of sg_overloaded below
idle_cpu() check since an idle CPU can never be overloaded.
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Link: https://lore.kernel.org/r/20241223043407.1611-8-kprateek.nayak@amd.com
Aggregate nr_numa_running and nr_preferred_running when load balancing
at NUMA domains only. While at it, also move the aggregation below the
idle_cpu() check since an idle CPU cannot have any preferred tasks.
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20241223043407.1611-7-kprateek.nayak@amd.com
When the PLACE_LAG scheduling feature is enabled and
dst_cfs_rq->nr_queued is greater than 1, if a task is
ineligible (lag < 0) on the source cpu runqueue, it will
also be ineligible when it is migrated to the destination
cpu runqueue. Because we will keep the original equivalent
lag of the task in place_entity(). So if the task was
ineligible before, it will still be ineligible after
migration.
So in sched_balance_rq(), we prioritize migrating eligible
tasks, and we soft-limit ineligible tasks, allowing them
to migrate only when nr_balance_failed is non-zero to
avoid load-balancing trying very hard to balance the load.
Below are some benchmark test results. From my test results,
this patch shows a slight improvement on hackbench.
Benchmark
=========
All of the benchmarks are done inside a normal cpu cgroup in a
clean environment with cpu turbo disabled, and test machine is:
Single NUMA machine model is 13th Gen Intel(R) Core(TM)
i7-13700, 12 Core/24 HT.
Based on master b86545e02e.
Results
=======
hackbench-process-pipes
vanilla patched
Amean 1 0.5837 ( 0.00%) 0.5733 ( 1.77%)
Amean 4 1.4423 ( 0.00%) 1.4503 ( -0.55%)
Amean 7 2.5147 ( 0.00%) 2.4773 ( 1.48%)
Amean 12 3.9347 ( 0.00%) 3.8880 ( 1.19%)
Amean 21 5.3943 ( 0.00%) 5.3873 ( 0.13%)
Amean 30 6.7840 ( 0.00%) 6.6660 ( 1.74%)
Amean 48 9.8313 ( 0.00%) 9.6100 ( 2.25%)
Amean 79 15.4403 ( 0.00%) 14.9580 ( 3.12%)
Amean 96 18.4970 ( 0.00%) 17.9533 ( 2.94%)
hackbench-process-sockets
vanilla patched
Amean 1 0.6297 ( 0.00%) 0.6223 ( 1.16%)
Amean 4 2.1517 ( 0.00%) 2.0887 ( 2.93%)
Amean 7 3.6377 ( 0.00%) 3.5670 ( 1.94%)
Amean 12 6.1277 ( 0.00%) 5.9290 ( 3.24%)
Amean 21 10.0380 ( 0.00%) 9.7623 ( 2.75%)
Amean 30 14.1517 ( 0.00%) 13.7513 ( 2.83%)
Amean 48 24.7253 ( 0.00%) 24.2287 ( 2.01%)
Amean 79 43.9523 ( 0.00%) 43.2330 ( 1.64%)
Amean 96 54.5310 ( 0.00%) 53.7650 ( 1.40%)
tbench4 Throughput
vanilla patched
Hmean 1 255.97 ( 0.00%) 275.01 ( 7.44%)
Hmean 2 511.60 ( 0.00%) 544.27 ( 6.39%)
Hmean 4 996.70 ( 0.00%) 1006.57 ( 0.99%)
Hmean 8 1646.46 ( 0.00%) 1649.15 ( 0.16%)
Hmean 16 2259.42 ( 0.00%) 2274.35 ( 0.66%)
Hmean 32 4725.48 ( 0.00%) 4735.57 ( 0.21%)
Hmean 64 4411.47 ( 0.00%) 4400.05 ( -0.26%)
Hmean 96 4284.31 ( 0.00%) 4267.39 ( -0.39%)
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Hao Jia <jiahao1@lixiang.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20241223091446.90208-1-jiahao.kernel@gmail.com
Similarly to dl, create a __setparam_fair() function to set parameters
related to fair class and move it in the fair.c file.
No functional changes expected
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Link: https://lore.kernel.org/r/20250110144656.484601-1-vincent.guittot@linaro.org
Normally dequeue_entities() will continue to dequeue an empty group entity;
except DELAY_DEQUEUE changes things -- it retains empty entities such that they
might continue to compete and burn off some lag.
However, doing this results in update_cfs_group() re-computing the cgroup
weight 'slice' for an empty group, which it (rightly) figures isn't much at
all. This in turn means that the delayed entity is not competing at the
expected weight. Worse, the very low weight causes its lag to be inflated,
which combined with avg_vruntime() using scale_load_down(), leads to artifacts.
As such, don't adjust the weight for empty group entities and let them compete
at their original weight.
Fixes: 152e11f6df ("sched/fair: Implement delayed dequeue")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250110115720.GA17405@noisy.programming.kicks-ass.net
I noticed this in my traces today:
turbostat-1222 [006] d..2. 311.935649: reweight_entity: (ffff888108f13e00-ffff88885ef38440-6)
{ weight: 1048576 avg_vruntime: 3184159639071 vruntime: 3184159640194 (-1123) deadline: 3184162621107 } ->
{ weight: 2 avg_vruntime: 3184177463330 vruntime: 3184748414495 (-570951165) deadline: 4747605329439 }
turbostat-1222 [006] d..2. 311.935651: reweight_entity: (ffff888108f13e00-ffff88885ef38440-6)
{ weight: 2 avg_vruntime: 3184177463330 vruntime: 3184748414495 (-570951165) deadline: 4747605329439 } ->
{ weight: 1048576 avg_vruntime: 3184176414812 vruntime: 3184177464419 (-1049607) deadline: 3184180445332 }
Which is a weight transition: 1048576 -> 2 -> 1048576.
One would expect the lag to shoot out *AND* come back, notably:
-1123*1048576/2 = -588775424
-588775424*2/1048576 = -1123
Except the trace shows it is all off. Worse, subsequent cycles shoot it
out further and further.
This made me have a very hard look at reweight_entity(), and
specifically the ->on_rq case, which is more prominent with
DELAY_DEQUEUE.
And indeed, it is all sorts of broken. While the computation of the new
lag is correct, the computation for the new vruntime, using the new lag
is broken for it does not consider the logic set out in place_entity().
With the below patch, I now see things like:
migration/12-55 [012] d..3. 309.006650: reweight_entity: (ffff8881e0e6f600-ffff88885f235f40-12)
{ weight: 977582 avg_vruntime: 4860513347366 vruntime: 4860513347908 (-542) deadline: 4860516552475 } ->
{ weight: 2 avg_vruntime: 4860528915984 vruntime: 4860793840706 (-264924722) deadline: 6427157349203 }
migration/14-62 [014] d..3. 309.006698: reweight_entity: (ffff8881e0e6cc00-ffff88885f3b5f40-15)
{ weight: 2 avg_vruntime: 4874472992283 vruntime: 4939833828823 (-65360836540) deadline: 6316614641111 } ->
{ weight: 967149 avg_vruntime: 4874217684324 vruntime: 4874217688559 (-4235) deadline: 4874220535650 }
Which isn't perfect yet, but much closer.
Reported-by: Doug Smythies <dsmythies@telus.net>
Reported-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fixes: eab03c23c2 ("sched/eevdf: Fix vruntime adjustment on reweight")
Link: https://lore.kernel.org/r/20250109105959.GA2981@noisy.programming.kicks-ass.net