Similar to sk/inode/task storage, implement similar cgroup local storage.
There already exists a local storage implementation for cgroup-attached
bpf programs. See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper
bpf_get_local_storage(). But there are use cases such that non-cgroup
attached bpf progs wants to access cgroup local storage data. For example,
tc egress prog has access to sk and cgroup. It is possible to use
sk local storage to emulate cgroup local storage by storing data in socket.
But this is a waste as it could be lots of sockets belonging to a particular
cgroup. Alternatively, a separate map can be created with cgroup id as the key.
But this will introduce additional overhead to manipulate the new map.
A cgroup local storage, similar to existing sk/inode/task storage,
should help for this use case.
The life-cycle of storage is managed with the life-cycle of the
cgroup struct. i.e. the storage is destroyed along with the owning cgroup
with a call to bpf_cgrp_storage_free() when cgroup itself
is deleted.
The userspace map operations can be done by using a cgroup fd as a key
passed to the lookup, update and delete operations.
Typically, the following code is used to get the current cgroup:
struct task_struct *task = bpf_get_current_task_btf();
... task->cgroups->dfl_cgrp ...
and in structure task_struct definition:
struct task_struct {
....
struct css_set __rcu *cgroups;
....
}
With sleepable program, accessing task->cgroups is not protected by rcu_read_lock.
So the current implementation only supports non-sleepable program and supporting
sleepable program will be the next step together with adding rcu_read_lock
protection for rcu tagged structures.
Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old cgroup local
storage support, the new map name BPF_MAP_TYPE_CGRP_STORAGE is used
for cgroup storage available to non-cgroup-attached bpf programs. The old
cgroup storage supports bpf_get_local_storage() helper to get the cgroup data.
The new cgroup storage helper bpf_cgrp_storage_get() can provide similar
functionality. While old cgroup storage pre-allocates storage memory, the new
mechanism can also pre-allocate with a user space bpf_map_update_elem() call
to avoid potential run-time memory allocation failure.
Therefore, the new cgroup storage can provide all functionality w.r.t.
the old one. So in uapi bpf.h, the old BPF_MAP_TYPE_CGROUP_STORAGE is alias to
BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED to indicate the old cgroup storage can
be deprecated since the new one can provide the same functionality.
Acked-by: David Vernet <void@manifault.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20221026042850.673791-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* Fix a recent regression where a sleeping kernfs function is called with
css_set_lock (spinlock) held.
* Revert the commit to enable cgroup1 support for cgroup_get_from_fd/file().
Multiple users assume that the lookup only works for cgroup2 and breaks
when fed a cgroup1 file. Instead, introduce a separate set of functions to
lookup both v1 and v2 and use them where the user explicitly wants to
support both versions.
* Compat update for tools/perf/util/bpf_skel/bperf_cgroup.bpf.c.
* Add Josef Bacik as a blkcg maintainer.
-----BEGIN PGP SIGNATURE-----
iIQEABYIACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCY03MlA4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGTkUAQD7fNcSPuc2m/BvW+gySKQkp9PZMA6E6yOIqirc
QKmIgwEAwWECW7RR1alhOGD50RtYkuYVsLD1+6Ka4eMHe+EhwA4=
=kGLI
-----END PGP SIGNATURE-----
Merge tag 'cgroup-for-6.1-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
- Fix a recent regression where a sleeping kernfs function is called
with css_set_lock (spinlock) held
- Revert the commit to enable cgroup1 support for cgroup_get_from_fd/file()
Multiple users assume that the lookup only works for cgroup2 and
breaks when fed a cgroup1 file. Instead, introduce a separate set of
functions to lookup both v1 and v2 and use them where the user
explicitly wants to support both versions.
- Compat update for tools/perf/util/bpf_skel/bperf_cgroup.bpf.c.
- Add Josef Bacik as a blkcg maintainer.
* tag 'cgroup-for-6.1-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
blkcg: Update MAINTAINERS entry
mm: cgroup: fix comments for get from fd/file helpers
perf stat: Support old kernels for bperf cgroup counting
bpf: cgroup_iter: support cgroup1 using cgroup fd
cgroup: add cgroup_v1v2_get_from_[fd/file]()
Revert "cgroup: enable cgroup_get_from_file() on cgroup1"
cgroup: Reorganize css_set_lock and kernfs path processing
- Various performance optimizations, resulting in a 4%-9% speedup
in the mmtests/config-scheduler-perfpipe micro-benchmark.
- New interface to turn PSI on/off on a per cgroup level.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmNJKPsRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1iPmg//aovCitAQX2lLoHJDIgdQibU40oaEpKTX
wM549EGz3Dr6qmwF8+qT1U2Ge6af/hHQc5G/ZqDpKbuTjUIc3RmBkqX80dNKFLuH
uyi9UtfsSriw+ks8fWuDdjr+S4oppwW9ZoIXvK8v4bisd3F31DNGvKPTayNxt73m
lExfzJiD1oJixDxGX8MGO9QpcoywmjWjzjrB2P+J8hnTpArouHx/HOKdQOpG6wXq
ZRr9kZvju6ucDpXCTa1HJrfVRxNAh35tx/b4cDtXbBFifVAeKaPOrHapMTVsqfel
Z7T+2DymhidNYK0hrRJoGUwa/vkz+2Sm1ZLG9LlgUCXVco/9S1zw1ZuQakVvzPen
wriuxRaAkR+szCP0L8js5+/DAkGa43MjKsvQHmDVnetQtlsAD4eYnn+alQ837SXv
MP3jwFqF+e4mcWdoQcfh0OWUgGec5XZzdgRYrFkBKyTWGLB2iPivcAMNf0X/h82Q
xxv4DQJIIJ017GOQ/ho2saq+GbtFCvX8YnGYas9T47Bjjluhjo7jgTVtPTo+mhtN
RfwMdG718Ap/gvnAX7wMe/t+L/4AP8AIgDRi5L35dTRqETwOjH+LAvOYjleQFYgu
kMVtLMyzU+TGwHscuzPFRh7TnvSJ4sD48Ll1BPnyZsh3SS9u0gAs1bml7Cu7JbmW
SIZD/S/hzdI=
=91tB
-----END PGP SIGNATURE-----
Merge tag 'sched-psi-2022-10-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull PSI updates from Ingo Molnar:
- Various performance optimizations, resulting in a 4%-9% speedup in
the mmtests/config-scheduler-perfpipe micro-benchmark.
- New interface to turn PSI on/off on a per cgroup level.
* tag 'sched-psi-2022-10-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/psi: Per-cgroup PSI accounting disable/re-enable interface
sched/psi: Cache parent psi_group to speed up group iteration
sched/psi: Consolidate cgroup_psi()
sched/psi: Add PSI_IRQ to track IRQ/SOFTIRQ pressure
sched/psi: Remove NR_ONCPU task accounting
sched/psi: Optimize task switch inside shared cgroups again
sched/psi: Move private helpers to sched/stats.h
sched/psi: Save percpu memory when !psi_cgroups_enabled
sched/psi: Don't create cgroup PSI files when psi_disabled
sched/psi: Fix periodic aggregation shut off
Fix the documentation comments for cgroup_[v1v2_]get_from_[fd/file]().
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Add cgroup_v1v2_get_from_fd() and cgroup_v1v2_get_from_file() that
support both cgroup1 and cgroup2.
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
linux-next for a couple of months without, to my knowledge, any negative
reports (or any positive ones, come to that).
- Also the Maple Tree from Liam R. Howlett. An overlapping range-based
tree for vmas. It it apparently slight more efficient in its own right,
but is mainly targeted at enabling work to reduce mmap_lock contention.
Liam has identified a number of other tree users in the kernel which
could be beneficially onverted to mapletrees.
Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat
(https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com).
This has yet to be addressed due to Liam's unfortunately timed
vacation. He is now back and we'll get this fixed up.
- Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses
clang-generated instrumentation to detect used-unintialized bugs down to
the single bit level.
KMSAN keeps finding bugs. New ones, as well as the legacy ones.
- Yang Shi adds a userspace mechanism (madvise) to induce a collapse of
memory into THPs.
- Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to support
file/shmem-backed pages.
- userfaultfd updates from Axel Rasmussen
- zsmalloc cleanups from Alexey Romanov
- cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and memory-failure
- Huang Ying adds enhancements to NUMA balancing memory tiering mode's
page promotion, with a new way of detecting hot pages.
- memcg updates from Shakeel Butt: charging optimizations and reduced
memory consumption.
- memcg cleanups from Kairui Song.
- memcg fixes and cleanups from Johannes Weiner.
- Vishal Moola provides more folio conversions
- Zhang Yi removed ll_rw_block() :(
- migration enhancements from Peter Xu
- migration error-path bugfixes from Huang Ying
- Aneesh Kumar added ability for a device driver to alter the memory
tiering promotion paths. For optimizations by PMEM drivers, DRM
drivers, etc.
- vma merging improvements from Jakub Matěn.
- NUMA hinting cleanups from David Hildenbrand.
- xu xin added aditional userspace visibility into KSM merging activity.
- THP & KSM code consolidation from Qi Zheng.
- more folio work from Matthew Wilcox.
- KASAN updates from Andrey Konovalov.
- DAMON cleanups from Kaixu Xia.
- DAMON work from SeongJae Park: fixes, cleanups.
- hugetlb sysfs cleanups from Muchun Song.
- Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY0HaPgAKCRDdBJ7gKXxA
joPjAQDZ5LlRCMWZ1oxLP2NOTp6nm63q9PWcGnmY50FjD/dNlwEAnx7OejCLWGWf
bbTuk6U2+TKgJa4X7+pbbejeoqnt5QU=
=xfWx
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- Yu Zhao's Multi-Gen LRU patches are here. They've been under test in
linux-next for a couple of months without, to my knowledge, any
negative reports (or any positive ones, come to that).
- Also the Maple Tree from Liam Howlett. An overlapping range-based
tree for vmas. It it apparently slightly more efficient in its own
right, but is mainly targeted at enabling work to reduce mmap_lock
contention.
Liam has identified a number of other tree users in the kernel which
could be beneficially onverted to mapletrees.
Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat
at [1]. This has yet to be addressed due to Liam's unfortunately
timed vacation. He is now back and we'll get this fixed up.
- Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses
clang-generated instrumentation to detect used-unintialized bugs down
to the single bit level.
KMSAN keeps finding bugs. New ones, as well as the legacy ones.
- Yang Shi adds a userspace mechanism (madvise) to induce a collapse of
memory into THPs.
- Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to
support file/shmem-backed pages.
- userfaultfd updates from Axel Rasmussen
- zsmalloc cleanups from Alexey Romanov
- cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and
memory-failure
- Huang Ying adds enhancements to NUMA balancing memory tiering mode's
page promotion, with a new way of detecting hot pages.
- memcg updates from Shakeel Butt: charging optimizations and reduced
memory consumption.
- memcg cleanups from Kairui Song.
- memcg fixes and cleanups from Johannes Weiner.
- Vishal Moola provides more folio conversions
- Zhang Yi removed ll_rw_block() :(
- migration enhancements from Peter Xu
- migration error-path bugfixes from Huang Ying
- Aneesh Kumar added ability for a device driver to alter the memory
tiering promotion paths. For optimizations by PMEM drivers, DRM
drivers, etc.
- vma merging improvements from Jakub Matěn.
- NUMA hinting cleanups from David Hildenbrand.
- xu xin added aditional userspace visibility into KSM merging
activity.
- THP & KSM code consolidation from Qi Zheng.
- more folio work from Matthew Wilcox.
- KASAN updates from Andrey Konovalov.
- DAMON cleanups from Kaixu Xia.
- DAMON work from SeongJae Park: fixes, cleanups.
- hugetlb sysfs cleanups from Muchun Song.
- Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core.
Link: https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com [1]
* tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (555 commits)
hugetlb: allocate vma lock for all sharable vmas
hugetlb: take hugetlb vma_lock when clearing vma_lock->vma pointer
hugetlb: fix vma lock handling during split vma and range unmapping
mglru: mm/vmscan.c: fix imprecise comments
mm/mglru: don't sync disk for each aging cycle
mm: memcontrol: drop dead CONFIG_MEMCG_SWAP config symbol
mm: memcontrol: use do_memsw_account() in a few more places
mm: memcontrol: deprecate swapaccounting=0 mode
mm: memcontrol: don't allocate cgroup swap arrays when memcg is disabled
mm/secretmem: remove reduntant return value
mm/hugetlb: add available_huge_pages() func
mm: remove unused inline functions from include/linux/mm_inline.h
selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory
selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared pmd
selftests/vm: add thp collapse shmem testing
selftests/vm: add thp collapse file and tmpfs testing
selftests/vm: modularize thp collapse memory operations
selftests/vm: dedup THP helpers
mm/khugepaged: add tracepoint to hpage_collapse_scan_file()
mm/madvise: add file and shmem support to MADV_COLLAPSE
...
This reverts commit f3a2aebdd6.
The commit enabled looking up v1 cgroups via cgroup_get_from_file().
However, there are multiple users, including CLONE_INTO_CGROUP, which have
been assuming that it would only look up v2 cgroups. Returning v1 cgroups
breaks them.
Let's revert the commit and retry later with a separate lookup interface
which allows both v1 and v2.
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/000000000000385cbf05ea3f1862@google.com
Cc: Yosry Ahmed <yosryahmed@google.com>
The commit 74e4b956eb incorrectly wrapped kernfs_walk_and_get
(might_sleep) under css_set_lock (spinlock). css_set_lock is needed by
__cset_cgroup_from_root to ensure stable cset->cgrp_links but not for
kernfs_walk_and_get.
We only need to make sure that the returned root_cgrp won't be freed
under us. This is given in the case of global root because it is static
(cgrp_dfl_root.cgrp). When the root_cgrp is lower in the hierarchy, it
is pinned by cgroup_ns->root_cset (and `current` task cannot switch
namespace asynchronously so ns_proxy pins cgroup_ns).
Note this reasoning won't hold for root cgroups in v1 hierarchies,
therefore create a special-cased helper function just for the default
hierarchy.
Fixes: 74e4b956eb ("cgroup: Honor caller's cgroup NS when resolving path")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
* cpuset now support isolated cpus.partition type, which will enable dynamic
CPU isolation.
* pids.peak added to remember the max number of pids used.
* Holes in cgroup namespace plugged.
* Internal cleanups.
Note that for-6.1-fixes was pulled into for-6.1 twice. Both were for
follow-up cleanups and each merge commit has details.
Also, 8a693f7766 ("cgroup: Remove CFTYPE_PRESSURE") removes the flag used
by PSI changes in the tip tree and the merged result won't compile due to
the missing flag. Simply removing the struct init lines specifying the flag
is the correct resolution. linux-next already contains the correct fix:
https://lkml.kernel.org/r/20220912161812.072aaa3b@canb.auug.org.au
-----BEGIN PGP SIGNATURE-----
iIQEABYIACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCYzsl7w4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGYsxAP4kad4YPw+CueLyyEMiYgBHouqDt8cG0+FJWK3X
svTC7wD/eCLfxZM8TjjSrMmvaMrml586mr3NoQaFeW0x3twptQQ=
=LERu
-----END PGP SIGNATURE-----
Merge tag 'cgroup-for-6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
- cpuset now support isolated cpus.partition type, which will enable
dynamic CPU isolation
- pids.peak added to remember the max number of pids used
- holes in cgroup namespace plugged
- internal cleanups
* tag 'cgroup-for-6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (25 commits)
cgroup: use strscpy() is more robust and safer
iocost_monitor: reorder BlkgIterator
cgroup: simplify code in cgroup_apply_control
cgroup: Make cgroup_get_from_id() prettier
cgroup/cpuset: remove unreachable code
cgroup: Remove CFTYPE_PRESSURE
cgroup: Improve cftype add/rm error handling
kselftest/cgroup: Add cpuset v2 partition root state test
cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst
cgroup/cpuset: Make partition invalid if cpumask change violates exclusivity rule
cgroup/cpuset: Relocate a code block in validate_change()
cgroup/cpuset: Show invalid partition reason string
cgroup/cpuset: Add a new isolated cpus.partition type
cgroup/cpuset: Relax constraints to partition & cpus changes
cgroup/cpuset: Allow no-task partition to have empty cpuset.cpus.effective
cgroup/cpuset: Miscellaneous cleanups & add helper functions
cgroup/cpuset: Enable update_tasks_cpumask() on top_cpuset
cgroup: add pids.peak interface for pids controller
cgroup: Remove data-race around cgrp_dfl_visible
cgroup: Fix build failure when CONFIG_SHRINKER_DEBUG
...
- Debuggability:
- Change most occurances of BUG_ON() to WARN_ON_ONCE()
- Reorganize & fix TASK_ state comparisons, turn it into a bitmap
- Update/fix misc scheduler debugging facilities
- Load-balancing & regular scheduling:
- Improve the behavior of the scheduler in presence of lot of
SCHED_IDLE tasks - in particular they should not impact other
scheduling classes.
- Optimize task load tracking, cleanups & fixes
- Clean up & simplify misc load-balancing code
- Freezer:
- Rewrite the core freezer to behave better wrt thawing and be simpler
in general, by replacing PF_FROZEN with TASK_FROZEN & fixing/adjusting
all the fallout.
- Deadline scheduler:
- Fix the DL capacity-aware code
- Factor out dl_task_is_earliest_deadline() & replenish_dl_new_period()
- Relax/optimize locking in task_non_contending()
- Cleanups:
- Factor out the update_current_exec_runtime() helper
- Various cleanups, simplifications
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmM/01cRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1geZA/+PB4KC1T9aVxzaTHI36R03YgJYZmIdtxw
wTf02MixePmz+gQCbepJbempGOh5ST28aOcI0xhdYOql5B63MaUBBMlB0HvGUyDG
IU3zETqLMRtAbnSTdQFv8m++ECUtZYp8/x1FCel4WO7ya4ETkRu1NRfCoUepEhpZ
aVAlae9LH3NBaF9t7s0PT2lTjf3pIzMFRkddJ0ywJhbFR3VnWat05fAK+J6fGY8+
LS54coefNlJD4oDh5TY8uniL1j5SmWmmwbk9Cdj7bLU5P3dFSS0/+5FJNHJPVGDE
srGT7wstRUcDrN0CnZo48VIUBiApJCCDqTfJYi9wNYd0NAHvwY6MIJJgEIY8mKsI
L/qH26H81Wt+ezSZ/5JIlGlZ/LIeNaa6OO/fbWEYABBQogvvx3nxsRNUYKSQzumH
CnSBasBjLnjWyLlK4qARM9cI7NFSEK6NUigrEx/7h8JFu/8T4DlSy6LsF1HUyKgq
4+FJLAqG6cL0tcwB/fHYd0oRESN8dStnQhGxSojgufwLc7dlFULvCYF5JM/dX+/V
IKwbOfIOeOn6ViMtSOXAEGdII+IQ2/ZFPwr+8Z5JC7NzvTVL6xlu/3JXkLZR3L7o
yaXTSaz06h1vil7Z+GRf7RHc+wUeGkEpXh5vnarGZKXivhFdWsBdROIJANK+xR0i
TeSLCxQxXlU=
=KjMD
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2022-10-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"Debuggability:
- Change most occurances of BUG_ON() to WARN_ON_ONCE()
- Reorganize & fix TASK_ state comparisons, turn it into a bitmap
- Update/fix misc scheduler debugging facilities
Load-balancing & regular scheduling:
- Improve the behavior of the scheduler in presence of lot of
SCHED_IDLE tasks - in particular they should not impact other
scheduling classes.
- Optimize task load tracking, cleanups & fixes
- Clean up & simplify misc load-balancing code
Freezer:
- Rewrite the core freezer to behave better wrt thawing and be
simpler in general, by replacing PF_FROZEN with TASK_FROZEN &
fixing/adjusting all the fallout.
Deadline scheduler:
- Fix the DL capacity-aware code
- Factor out dl_task_is_earliest_deadline() &
replenish_dl_new_period()
- Relax/optimize locking in task_non_contending()
Cleanups:
- Factor out the update_current_exec_runtime() helper
- Various cleanups, simplifications"
* tag 'sched-core-2022-10-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (41 commits)
sched: Fix more TASK_state comparisons
sched: Fix TASK_state comparisons
sched/fair: Move call to list_last_entry() in detach_tasks
sched/fair: Cleanup loop_max and loop_break
sched/fair: Make sure to try to detach at least one movable task
sched: Show PF_flag holes
freezer,sched: Rewrite core freezer logic
sched: Widen TAKS_state literals
sched/wait: Add wait_event_state()
sched/completion: Add wait_for_completion_state()
sched: Add TASK_ANY for wait_task_inactive()
sched: Change wait_task_inactive()s match_state
freezer,umh: Clean up freezer/initrd interaction
freezer: Have {,un}lock_system_sleep() save/restore flags
sched: Rename task_running() to task_on_cpu()
sched/fair: Cleanup for SIS_PROP
sched/fair: Default to false in test_idle_cores()
sched/fair: Remove useless check in select_idle_core()
sched/fair: Avoid double search on same cpu
sched/fair: Remove redundant check in select_idle_smt()
...
Here is the big set of driver core and debug printk changes for 6.1-rc1.
Included in here is:
- dynamic debug updates for the core and the drm subsystem. The
drm changes have all been acked by the relevant maintainers.
- kernfs fixes for syzbot reported problems
- kernfs refactors and updates for cgroup requirements
- magic number cleanups and removals from the kernel tree (they
were not being used and they really did not actually do
anything.)
- other tiny cleanups
All of these have been in linux-next for a while with no reported
issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCY0BYUA8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ylozwCdFRlcghaf7XBUyNgRZRwMC+oQI8EAn1G/nEDE
6aFd2er41uK0IGQnSmYO
=OK0k
-----END PGP SIGNATURE-----
Merge tag 'driver-core-6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here is the big set of driver core and debug printk changes for
6.1-rc1. Included in here is:
- dynamic debug updates for the core and the drm subsystem. The drm
changes have all been acked by the relevant maintainers
- kernfs fixes for syzbot reported problems
- kernfs refactors and updates for cgroup requirements
- magic number cleanups and removals from the kernel tree (they were
not being used and they really did not actually do anything)
- other tiny cleanups
All of these have been in linux-next for a while with no reported
issues"
* tag 'driver-core-6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (74 commits)
docs: filesystems: sysfs: Make text and code for ->show() consistent
Documentation: NBD_REQUEST_MAGIC isn't a magic number
a.out: restore CMAGIC
device property: Add const qualifier to device_get_match_data() parameter
drm_print: add _ddebug descriptor to drm_*dbg prototypes
drm_print: prefer bare printk KERN_DEBUG on generic fn
drm_print: optimize drm_debug_enabled for jump-label
drm-print: add drm_dbg_driver to improve namespace symmetry
drm-print.h: include dyndbg header
drm_print: wrap drm_*_dbg in dyndbg descriptor factory macro
drm_print: interpose drm_*dbg with forwarding macros
drm: POC drm on dyndbg - use in core, 2 helpers, 3 drivers.
drm_print: condense enum drm_debug_category
debugfs: use DEFINE_SHOW_ATTRIBUTE to define debugfs_regset32_fops
driver core: use IS_ERR_OR_NULL() helper in device_create_groups_vargs()
Documentation: ENI155_MAGIC isn't a magic number
Documentation: NBD_REPLY_MAGIC isn't a magic number
nbd: remove define-only NBD_MAGIC, previously magic number
Documentation: FW_HEADER_MAGIC isn't a magic number
Documentation: EEPROM_MAGIC_VALUE isn't a magic number
...
Add /sys/kernel/mm/lru_gen/enabled as a kill switch. Components that
can be disabled include:
0x0001: the multi-gen LRU core
0x0002: walking page table, when arch_has_hw_pte_young() returns
true
0x0004: clearing the accessed bit in non-leaf PMD entries, when
CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y
[yYnN]: apply to all the components above
E.g.,
echo y >/sys/kernel/mm/lru_gen/enabled
cat /sys/kernel/mm/lru_gen/enabled
0x0007
echo 5 >/sys/kernel/mm/lru_gen/enabled
cat /sys/kernel/mm/lru_gen/enabled
0x0005
NB: the page table walks happen on the scale of seconds under heavy memory
pressure, in which case the mmap_lock contention is a lesser concern,
compared with the LRU lock contention and the I/O congestion. So far the
only well-known case of the mmap_lock contention happens on Android, due
to Scudo [1] which allocates several thousand VMAs for merely a few
hundred MBs. The SPF and the Maple Tree also have provided their own
assessments [2][3]. However, if walking page tables does worsen the
mmap_lock contention, the kill switch can be used to disable it. In this
case the multi-gen LRU will suffer a minor performance degradation, as
shown previously.
Clearing the accessed bit in non-leaf PMD entries can also be disabled,
since this behavior was not tested on x86 varieties other than Intel and
AMD.
[1] https://source.android.com/devices/tech/debug/scudo
[2] https://lore.kernel.org/r/20220128131006.67712-1-michel@lespinasse.org/
[3] https://lore.kernel.org/r/20220426150616.3937571-1-Liam.Howlett@oracle.com/
Link: https://lkml.kernel.org/r/20220918080010.2920238-11-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The implementation of strscpy() is more robust and safer.
That's now the recommended way to copy NUL terminated strings.
Signed-off-by: ye xingchen <ye.xingchen@zte.com.cn>
Signed-off-by: Tejun Heo <tj@kernel.org>
It could directly return 'cgroup_update_dfl_csses' to simplify code.
Signed-off-by: William Dean <williamsukatube@163.com>
Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
After merging 836ac87d ("cgroup: fix cgroup_get_from_id") into for-6.1, its
combination with two commits in for-6.1 - 4534dee9 ("cgroup: cgroup: Honor
caller's cgroup NS when resolving cgroup id") and fa7e439c ("cgroup:
Homogenize cgroup_get_from_id() return value") - makes the gotos in the
error handling path too ugly while not adding anything of value.
All that the gotos are saving is one extra kernfs_put() call. Let's remove
the gotos and perform error returns directly.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Michal Koutný <mkoutny@suse.com>
for-6.0 has the following fix for cgroup_get_from_id().
836ac87d ("cgroup: fix cgroup_get_from_id")
which conflicts with the following two commits in for-6.1.
4534dee9 ("cgroup: cgroup: Honor caller's cgroup NS when resolving cgroup id")
fa7e439c ("cgroup: Homogenize cgroup_get_from_id() return value")
While the resolution is straightforward, the code ends up pretty ugly
afterwards. Let's pull for-6.0-fixes into for-6.1 so that the code can be
fixed up there.
Signed-off-by: Tejun Heo <tj@kernel.org>
cgroup has to be one kernfs dir, otherwise kernel panic is caused,
especially cgroup id is provide from userspace.
Reported-by: Marco Patalano <mpatalan@redhat.com>
Fixes: 6b658c4863 ("scsi: cgroup: Add cgroup_get_from_id()")
Cc: Muneendra <muneendra.kumar@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Acked-by: Mukesh Ojha <quic_mojha@quicinc.com>
Cc: stable@vger.kernel.org # v5.14+
Signed-off-by: Tejun Heo <tj@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmMeQ2keHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGYRMH+gLNHiGirGZlm2GQ
tKaZQUy7MiXuIP0hGDonDIIIAmIVhnjm9MDG8KT4W8AvEd7ukncyYqJfwWeWQPhP
4mZcf6l3Z8Ke+qiaFpXpMPCxTyWcln1ox0EoNx2g9gdPxZntaRuuaTQVljUfTiey
aVPHxve8ip3G7jDoJnuLSxESOqWxkb8v/SshBP1E5bF5BZ+cgZRqq7FNigFqxjbk
wF29K09BVOPjdgkSvY/b0/SnL5KlSdMAv+FrPcJNGivcdIPgf/qJks5cI2HRUo7o
CpKgbcLorCVyD+d+zLonJBwIy3arbmKD8JqYnfdTSIqVOUqHXWUDfeydsH32u1Gu
lPSI2Hw=
=7LTL
-----END PGP SIGNATURE-----
Merge 6.0-rc5 into driver-core-next
We need the driver core and debugfs changes in this branch.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
PSI accounts stalls for each cgroup separately and aggregates it
at each level of the hierarchy. This may cause non-negligible overhead
for some workloads when under deep level of the hierarchy.
commit 3958e2d0c3 ("cgroup: make per-cgroup pressure stall tracking configurable")
make PSI to skip per-cgroup stall accounting, only account system-wide
to avoid this each level overhead.
But for our use case, we also want leaf cgroup PSI stats accounted for
userspace adjustment on that cgroup, apart from only system-wide adjustment.
So this patch introduce a per-cgroup PSI accounting disable/re-enable
interface "cgroup.pressure", which is a read-write single value file that
allowed values are "0" and "1", the defaults is "1" so per-cgroup
PSI stats is enabled by default.
Implementation details:
It should be relatively straight-forward to disable and re-enable
state aggregation, time tracking, averaging on a per-cgroup level,
if we can live with losing history from while it was disabled.
I.e. the avgs will restart from 0, total= will have gaps.
But it's hard or complex to stop/restart groupc->tasks[] updates,
which is not implemented in this patch. So we always update
groupc->tasks[] and PSI_ONCPU bit in psi_group_change() even when
the cgroup PSI stats is disabled.
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lkml.kernel.org/r/20220907090332.2078-1-zhouchengming@bytedance.com
cgroup_psi() can't return psi_group for root cgroup, so we have many
open code "psi = cgroup_ino(cgrp) == 1 ? &psi_system : cgrp->psi".
This patch move cgroup_psi() definition to <linux/psi.h>, in which
we can return psi_system for root cgroup, so can handle all cgroups.
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/r/20220825164111.29534-9-zhouchengming@bytedance.com
Now PSI already tracked workload pressure stall information for
CPU, memory and IO. Apart from these, IRQ/SOFTIRQ could have
obvious impact on some workload productivity, such as web service
workload.
When CONFIG_IRQ_TIME_ACCOUNTING, we can get IRQ/SOFTIRQ delta time
from update_rq_clock_task(), in which we can record that delta
to CPU curr task's cgroups as PSI_IRQ_FULL status.
Note we don't use PSI_IRQ_SOME since IRQ/SOFTIRQ always happen in
the current task on the CPU, make nothing productive could run
even if it were runnable, so we only use PSI_IRQ_FULL.
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/r/20220825164111.29534-8-zhouchengming@bytedance.com
commit 3958e2d0c3 ("cgroup: make per-cgroup pressure stall tracking configurable")
make PSI can be configured to skip per-cgroup stall accounting. And
doesn't expose PSI files in cgroup hierarchy.
This patch do the same thing when psi_disabled.
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/r/20220825164111.29534-3-zhouchengming@bytedance.com
Rewrite the core freezer to behave better wrt thawing and be simpler
in general.
By replacing PF_FROZEN with TASK_FROZEN, a special block state, it is
ensured frozen tasks stay frozen until thawed and don't randomly wake
up early, as is currently possible.
As such, it does away with PF_FROZEN and PF_FREEZER_SKIP, freeing up
two PF_flags (yay!).
Specifically; the current scheme works a little like:
freezer_do_not_count();
schedule();
freezer_count();
And either the task is blocked, or it lands in try_to_freezer()
through freezer_count(). Now, when it is blocked, the freezer
considers it frozen and continues.
However, on thawing, once pm_freezing is cleared, freezer_count()
stops working, and any random/spurious wakeup will let a task run
before its time.
That is, thawing tries to thaw things in explicit order; kernel
threads and workqueues before doing bringing SMP back before userspace
etc.. However due to the above mentioned races it is entirely possible
for userspace tasks to thaw (by accident) before SMP is back.
This can be a fatal problem in asymmetric ISA architectures (eg ARMv9)
where the userspace task requires a special CPU to run.
As said; replace this with a special task state TASK_FROZEN and add
the following state transitions:
TASK_FREEZABLE -> TASK_FROZEN
__TASK_STOPPED -> TASK_FROZEN
__TASK_TRACED -> TASK_FROZEN
The new TASK_FREEZABLE can be set on any state part of TASK_NORMAL
(IOW. TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE) -- any such state
is already required to deal with spurious wakeups and the freezer
causes one such when thawing the task (since the original state is
lost).
The special __TASK_{STOPPED,TRACED} states *can* be restored since
their canonical state is in ->jobctl.
With this, frozen tasks need an explicit TASK_FROZEN wakeup and are
free of undue (early / spurious) wakeups.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lore.kernel.org/r/20220822114649.055452969@infradead.org
Daniel Borkmann says:
====================
pull-request: bpf-next 2022-09-05
The following pull-request contains BPF updates for your *net-next* tree.
We've added 106 non-merge commits during the last 18 day(s) which contain
a total of 159 files changed, 5225 insertions(+), 1358 deletions(-).
There are two small merge conflicts, resolve them as follows:
1) tools/testing/selftests/bpf/DENYLIST.s390x
Commit 27e23836ce ("selftests/bpf: Add lru_bug to s390x deny list") in
bpf tree was needed to get BPF CI green on s390x, but it conflicted with
newly added tests on bpf-next. Resolve by adding both hunks, result:
[...]
lru_bug # prog 'printk': failed to auto-attach: -524
setget_sockopt # attach unexpected error: -524 (trampoline)
cb_refs # expected error message unexpected error: -524 (trampoline)
cgroup_hierarchical_stats # JIT does not support calling kernel function (kfunc)
htab_update # failed to attach: ERROR: strerror_r(-524)=22 (trampoline)
[...]
2) net/core/filter.c
Commit 1227c1771d ("net: Fix data-races around sysctl_[rw]mem_(max|default).")
from net tree conflicts with commit 29003875bd ("bpf: Change bpf_setsockopt(SOL_SOCKET)
to reuse sk_setsockopt()") from bpf-next tree. Take the code as it is from
bpf-next tree, result:
[...]
if (getopt) {
if (optname == SO_BINDTODEVICE)
return -EINVAL;
return sk_getsockopt(sk, SOL_SOCKET, optname,
KERNEL_SOCKPTR(optval),
KERNEL_SOCKPTR(optlen));
}
return sk_setsockopt(sk, SOL_SOCKET, optname,
KERNEL_SOCKPTR(optval), *optlen);
[...]
The main changes are:
1) Add any-context BPF specific memory allocator which is useful in particular for BPF
tracing with bonus of performance equal to full prealloc, from Alexei Starovoitov.
2) Big batch to remove duplicated code from bpf_{get,set}sockopt() helpers as an effort
to reuse the existing core socket code as much as possible, from Martin KaFai Lau.
3) Extend BPF flow dissector for BPF programs to just augment the in-kernel dissector
with custom logic. In other words, allow for partial replacement, from Shmulik Ladkani.
4) Add a new cgroup iterator to BPF with different traversal options, from Hao Luo.
5) Support for BPF to collect hierarchical cgroup statistics efficiently through BPF
integration with the rstat framework, from Yosry Ahmed.
6) Support bpf_{g,s}et_retval() under more BPF cgroup hooks, from Stanislav Fomichev.
7) BPF hash table and local storages fixes under fully preemptible kernel, from Hou Tao.
8) Add various improvements to BPF selftests and libbpf for compilation with gcc BPF
backend, from James Hilliard.
9) Fix verifier helper permissions and reference state management for synchronous
callbacks, from Kumar Kartikeya Dwivedi.
10) Add support for BPF selftest's xskxceiver to also be used against real devices that
support MAC loopback, from Maciej Fijalkowski.
11) Various fixes to the bpf-helpers(7) man page generation script, from Quentin Monnet.
12) Document BPF verifier's tnum_in(tnum_range(), ...) gotchas, from Shung-Hsi Yu.
13) Various minor misc improvements all over the place.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (106 commits)
bpf: Optimize rcu_barrier usage between hash map and bpf_mem_alloc.
bpf: Remove usage of kmem_cache from bpf_mem_cache.
bpf: Remove prealloc-only restriction for sleepable bpf programs.
bpf: Prepare bpf_mem_alloc to be used by sleepable bpf programs.
bpf: Remove tracing program restriction on map types
bpf: Convert percpu hash map to per-cpu bpf_mem_alloc.
bpf: Add percpu allocation support to bpf_mem_alloc.
bpf: Batch call_rcu callbacks instead of SLAB_TYPESAFE_BY_RCU.
bpf: Adjust low/high watermarks in bpf_mem_cache
bpf: Optimize call_rcu in non-preallocated hash map.
bpf: Optimize element count in non-preallocated hash map.
bpf: Relax the requirement to use preallocated hash maps in tracing progs.
samples/bpf: Reduce syscall overhead in map_perf_test.
selftests/bpf: Improve test coverage of test_maps
bpf: Convert hash map to bpf_mem_alloc.
bpf: Introduce any context BPF specific memory allocator.
selftest/bpf: Add test for bpf_getsockopt()
bpf: Change bpf_getsockopt(SOL_IPV6) to reuse do_ipv6_getsockopt()
bpf: Change bpf_getsockopt(SOL_IP) to reuse do_ip_getsockopt()
bpf: Change bpf_getsockopt(SOL_TCP) to reuse do_tcp_getsockopt()
...
====================
Link: https://lore.kernel.org/r/20220905161136.9150-1-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
CFTYPE_PRESSURE is used to flag PSI related files so that they are not
created if PSI is disabled during boot. It's a bit weird to use a generic
flag to mark a specific file type. Let's instead move the PSI files into its
own cftypes array and add/rm them conditionally. This is a bit more code but
cleaner.
No userland visible changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Let's track whether a cftype is currently added or not using a new flag
__CFTYPE_ADDED so that duplicate operations can be failed safely and
consistently allow using empty cftypes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Currently, changes in "cpust.cpus" of a partition root is not allowed if
it violates the sibling cpu exclusivity rule when the check is done
in the validate_change() function. That is inconsistent with the
other cpuset changes that are always allowed but may make a partition
invalid.
Update the cpuset code to allow cpumask change even if it violates the
sibling cpu exclusivity rule, but invalidate the partition instead
just like the other changes. However, other sibling partitions with
conflicting cpumask will also be invalidated in order to not violating
the exclusivity rule. This behavior is specific to this partition
rule violation.
Note that a previous commit has made sibling cpu exclusivity rule check
the last check of validate_change(). So if -EINVAL is returned, we can
be sure that sibling cpu exclusivity rule violation is the only rule
that is broken.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
This patch moves down the exclusive cpu and memory check in
validate_change(). There is no functional change.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
There are a number of different reasons which can cause a partition to
become invalid. A user seeing an invalid partition may not know exactly
why. To help user to get a better understanding of the underlying reason,
The cpuset.cpus.partition control file, when read, will now report the
reason why a partition become invalid. When a partition does become
invalid, reading the control file will show "root invalid (<reason>)"
where <reason> is a string that describes why the partition is invalid.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cpuset v1 uses the sched_load_balance control file to determine if load
balancing should be enabled. Cpuset v2 gets rid of sched_load_balance
as its use may require disabling load balancing at cgroup root.
For workloads that require very low latency like DPDK, the latency
jitters caused by periodic load balancing may exceed the desired
latency limit.
When cpuset v2 is in use, the only way to avoid this latency cost is to
use the "isolcpus=" kernel boot option to isolate a set of CPUs. After
the kernel boot, however, there is no way to add or remove CPUs from
this isolated set. For workloads that are more dynamic in nature, that
means users have to provision enough CPUs for the worst case situation
resulting in excess idle CPUs.
To address this issue for cpuset v2, a new cpuset.cpus.partition type
"isolated" is added which allows the creation of a cpuset partition
without load balancing. This will allow system administrators to
dynamically adjust the size of isolated partition to the current need
of the workload without rebooting the system.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Currently, enabling a partition root is only allowed if all the
constraints of a valid partition are satisfied. Even changes to
"cpuset.cpus" may not be allowed in some cases. Moreover, there are
limits to changes made to a parent cpuset if it is a valid partition
root. This is contrary to the general cgroup v2 philosophy.
This patch relaxes the constraints of changing the state of "cpuset.cpus"
and "cpuset.cpus.partition". Now all valid changes ("member" or "root")
to "cpuset.cpus.partition" are allowed even if there are child cpusets
underneath it.
Trying to make a cpuset a partition root, however, will cause its state
to become invalid if the following constraints of a valid partition
root are not satisfied.
1) The "cpuset.cpus" is non-empty and exclusive.
2) The parent cpuset is a valid partition root.
3) The "cpuset.cpus" overlaps parent's "cpuset.cpus".
Similarly, almost all changes to "cpuset.cpus" are allowed with the
exception that if the underlying CS_CPU_EXCLUSIVE flag is set, the
exclusivity rule will still apply.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Currently, a partition root cannot have empty "cpuset.cpus.effective".
As a result, a parent partition root cannot distribute out all its
CPUs to child partitions with no CPUs left. However in most cases,
there shouldn't be any tasks associated with intermediate nodes of the
default hierarchy. So the current rule is too restrictive and can waste
valuable CPU resource.
To address this issue, we are now allowing a partition to have empty
"cpuset.cpus.effective" as long as it has no task. Since cpuset is
threaded, no-internal-process rule does not apply. So it is possible
to have tasks in a partition root with child sub-partitions even though
that should be uncommon.
A parent partition with no task can now have all its CPUs distributed out
to its child partitions. The top cpuset always have some house-keeping
tasks running and so its list of effective cpu can't be empty.
Once a partition with empty "cpuset.cpus.effective" is formed, no
new task can be moved into it until "cpuset.cpus.effective" becomes
non-empty.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The partition root state (PRS) macro names do not currently match the
external names. Change them to match the external names and add helper
functions to read or change the state.
Shorten the cpuset argument of update_parent_subparts_cpumask() to cs
to match other cpuset functions.
Remove the new_prs argument from notify_partition_change() as the
cs->partition_root_state has already been set to new_prs before it
is called.
There is no functional change.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Previously, update_tasks_cpumask() is not supposed to be called with
top cpuset. With cpuset partition that takes CPUs away from the top
cpuset, adjusting the cpus_mask of the tasks in the top cpuset is
necessary. Percpu kthreads, however, are ignored.
Fixes: ee8dde0cd2 ("cpuset: Add new v2 cpuset.sched.partition flag")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
pids.peak tracks the high watermark of usage for number of pids. This
helps give a better baseline on which to set pids.max. Polling
pids.current isn't really feasible, since it would potentially miss
short-lived spikes.
This interface is analogous to memory.peak.
Signed-off-by: Josh Don <joshdon@google.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
There's a seemingly harmless data-race around cgrp_dfl_visible detected by
kernel concurrency sanitizer. Let's remove it by throwing WRITE/READ_ONCE at
it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Abhishek Shah <abhishek.shah@columbia.edu>
Cc: Gabriel Ryan <gabe@cs.columbia.edu>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Link: https://lore.kernel.org/netdev/20220819072256.fn7ctciefy4fc4cu@wittgenstein/
Add cgroup_file_show() which allows toggling visibility of a cgroup file
using the new kernfs_show(). This will be used to hide psi interface files
on cgroups where it's disabled.
Cc: Chengming Zhou <zhouchengming@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220828050440.734579-10-tj@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cgroup id is user provided datum hence extend its return domain to
include possible error reason (similar to cgroup_get_from_fd()).
This change also fixes commit d4ccaf58a8 ("bpf: Introduce cgroup
iter") that would use NULL instead of proper error handling in
d4ccaf58a8 ("bpf: Introduce cgroup iter").
Additionally, neither of: fc_appid_store, bpf_iter_attach_cgroup,
mem_cgroup_get_from_ino (callers of cgroup_get_from_fd) is built without
CONFIG_CGROUPS (depends via CONFIG_BLK_CGROUP, direct, transitive
CONFIG_MEMCG respectively) transitive, so drop the singular definition
not needed with !CONFIG_CGROUPS.
Fixes: d4ccaf58a8 ("bpf: Introduce cgroup iter")
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cgroup ids are resolved in the global scope. That may be needed sometime
(in future) but currently it violates virtual view provided through
cgroup namespaces.
There are currently following users of the resolution:
- fc_appid_store
- bpf_iter_attach_cgroup
- mem_cgroup_get_from_ino
None of the is a called on behalf of kernel but the resolution is made
with proper userspace context, hence the default to current->nsproxy
makes sens. (This doesn't rule out cgroup_get_from_id with cgroup NS
parameter in the future.)
Since cgroup ids are defined on v2 hierarchy only, we simply check
existence in the cgroup namespace by looking at ancestry on the default
hierarchy.
Fixes: 6b658c4863 ("scsi: cgroup: Add cgroup_get_from_id()")
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
cgroup_get_from_path() is not widely used function. Its callers presume
the path is resolved under cgroup namespace. (There is one caller
currently and resolving in init NS won't make harm (netfilter). However,
future users may be subject to different effects when resolving
globally.)
Since, there's currently no use for the global resolution, modify the
existing function to take cgroup NS into account.
Fixes: a79a908fd2 ("cgroup: introduce cgroup namespaces")
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Enable bpf programs to make use of rstat to collect cgroup hierarchical
stats efficiently:
- Add cgroup_rstat_updated() kfunc, for bpf progs that collect stats.
- Add cgroup_rstat_flush() sleepable kfunc, for bpf progs that read stats.
- Add an empty bpf_rstat_flush() hook that is called during rstat
flushing, for bpf progs that flush stats to attach to. Attaching a bpf
prog to this hook effectively registers it as a flush callback.
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Hao Luo <haoluo@google.com>
Link: https://lore.kernel.org/r/20220824233117.1312810-4-haoluo@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Root cause:
The rebind_subsystems() is no lock held when move css object from A
list to B list,then let B's head be treated as css node at
list_for_each_entry_rcu().
Solution:
Add grace period before invalidating the removed rstat_css_node.
Reported-by: Jing-Ting Wu <jing-ting.wu@mediatek.com>
Suggested-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Jing-Ting Wu <jing-ting.wu@mediatek.com>
Tested-by: Jing-Ting Wu <jing-ting.wu@mediatek.com>
Link: https://lore.kernel.org/linux-arm-kernel/d8f0bc5e2fb6ed259f9334c83279b4c011283c41.camel@mediatek.com/T/
Acked-by: Mukesh Ojha <quic_mojha@quicinc.com>
Fixes: a7df69b81a ("cgroup: rstat: support cgroup1")
Cc: stable@vger.kernel.org # v5.13+
Signed-off-by: Tejun Heo <tj@kernel.org>
Andrii Nakryiko says:
====================
bpf-next 2022-08-17
We've added 45 non-merge commits during the last 14 day(s) which contain
a total of 61 files changed, 986 insertions(+), 372 deletions(-).
The main changes are:
1) New bpf_ktime_get_tai_ns() BPF helper to access CLOCK_TAI, from Kurt
Kanzenbach and Jesper Dangaard Brouer.
2) Few clean ups and improvements for libbpf 1.0, from Andrii Nakryiko.
3) Expose crash_kexec() as kfunc for BPF programs, from Artem Savkov.
4) Add ability to define sleepable-only kfuncs, from Benjamin Tissoires.
5) Teach libbpf's bpf_prog_load() and bpf_map_create() to gracefully handle
unsupported names on old kernels, from Hangbin Liu.
6) Allow opting out from auto-attaching BPF programs by libbpf's BPF skeleton,
from Hao Luo.
7) Relax libbpf's requirement for shared libs to be marked executable, from
Henqgi Chen.
8) Improve bpf_iter internals handling of error returns, from Hao Luo.
9) Few accommodations in libbpf to support GCC-BPF quirks, from James Hilliard.
10) Fix BPF verifier logic around tracking dynptr ref_obj_id, from Joanne Koong.
11) bpftool improvements to handle full BPF program names better, from Manu
Bretelle.
12) bpftool fixes around libcap use, from Quentin Monnet.
13) BPF map internals clean ups and improvements around memory allocations,
from Yafang Shao.
14) Allow to use cgroup_get_from_file() on cgroupv1, allowing BPF cgroup
iterator to work on cgroupv1, from Yosry Ahmed.
15) BPF verifier internal clean ups, from Dave Marchevsky and Joanne Koong.
16) Various fixes and clean ups for selftests/bpf and vmtest.sh, from Daniel
Xu, Artem Savkov, Joanne Koong, Andrii Nakryiko, Shibin Koikkara Reeny.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (45 commits)
selftests/bpf: Few fixes for selftests/bpf built in release mode
libbpf: Clean up deprecated and legacy aliases
libbpf: Streamline bpf_attr and perf_event_attr initialization
libbpf: Fix potential NULL dereference when parsing ELF
selftests/bpf: Tests libbpf autoattach APIs
libbpf: Allows disabling auto attach
selftests/bpf: Fix attach point for non-x86 arches in test_progs/lsm
libbpf: Making bpf_prog_load() ignore name if kernel doesn't support
selftests/bpf: Update CI kconfig
selftests/bpf: Add connmark read test
selftests/bpf: Add existing connection bpf_*_ct_lookup() test
bpftool: Clear errno after libcap's checks
bpf: Clear up confusion in bpf_skb_adjust_room()'s documentation
bpftool: Fix a typo in a comment
libbpf: Add names for auxiliary maps
bpf: Use bpf_map_area_alloc consistently on bpf map creation
bpf: Make __GFP_NOWARN consistent in bpf map creation
bpf: Use bpf_map_area_free instread of kvfree
bpf: Remove unneeded memset in queue_stack_map creation
libbpf: preserve errno across pr_warn/pr_info/pr_debug
...
====================
Link: https://lore.kernel.org/r/20220817215656.1180215-1-andrii@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Bringing up a CPU may involve creating and destroying tasks which requires
read-locking threadgroup_rwsem, so threadgroup_rwsem nests inside
cpus_read_lock(). However, cpuset's ->attach(), which may be called with
thredagroup_rwsem write-locked, also wants to disable CPU hotplug and
acquires cpus_read_lock(), leading to a deadlock.
Fix it by guaranteeing that ->attach() is always called with CPU hotplug
disabled and removing cpus_read_lock() call from cpuset_attach().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-and-tested-by: Imran Khan <imran.f.khan@oracle.com>
Reported-and-tested-by: Xuewen Yan <xuewen.yan@unisoc.com>
Fixes: 05c7b7a92c ("cgroup/cpuset: Fix a race between cpuset_attach() and cpu hotplug")
Cc: stable@vger.kernel.org # v5.17+
psi_trigger_create()'s 'nbytes' parameter is not used, so we can remove it.
Signed-off-by: Hao Jia <jiahao.os@bytedance.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Every cgroup knows all its ancestors through its ->ancestor_ids[]. There's
no advantage to remembering the IDs instead of the pointers directly and
this makes the array useless for finding an actual ancestor cgroup forcing
cgroup_ancestor() to iteratively walk up the hierarchy instead. Let's
replace cgroup->ancestor_ids[] with ->ancestors[] and remove the walking-up
from cgroup_ancestor().
While at it, improve comments around cgroup_root->cgrp_ancestor_storage.
This patch shouldn't cause user-visible behavior differences.
v2: Update cgroup_ancestor() to use ->ancestors[].
v3: cgroup_root->cgrp_ancestor_storage's type is updated to match
cgroup->ancestors[]. Better comments.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
cgroup_get_from_file() currently fails with -EBADF if called on cgroup
v1. However, the current implementation works on cgroup v1 as well, so
the restriction is unnecessary.
This enabled cgroup_get_from_fd() to work on cgroup v1, which would be
the only thing stopping bpf cgroup_iter from supporting cgroup v1.
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Hao Luo <haoluo@google.com>
Link: https://lore.kernel.org/r/20220805214821.1058337-3-haoluo@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmLuvmwRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1gONQ/+KkkPTeKgGDvrahTfeYZlmRyvcI1R78r9
yooa8v+DtifznBW2eXDBc8WTruzqr78VyUY+1YSjfKS6FRQWYMficJ3qk3hxgBru
998KZbvl3jXBBlRkqgGeFlF5Ty2KaryEZgX97a7IF/0xWDgpm972jFkJ/KCo/YTY
WSQrzutz2FKe71EjK4cAplYxPZIiy/zo2hSGTbsso4M7bO5VLc1Y4qMtFGcCZ7JB
s9JYkj2Rfz+AS5wioDRcGuec4A4SrroxKszZA6QDDBuhMJukqexO02xs/fxZ2W4Z
DF4U5MFOrtz9AWSGsf1P6XXbgJO8qTgQXZchFsEcJwypV13w8U0IViXQfD/Pvx2X
y+WHdnZVIO2sDwOJ15ew7IuoJZ2LsVygrBNFJJaIFOtIz3RzprI0BJN7LeWFALOa
IPmbtiY8hVwhKmjRgMHWDwJhMEHLuhGx3idiD89w1pknzTUnKDiwLyEUtyynxeGd
ft9uCvPefrYQVx9AiH7wf0W+fg334FCccC+0f8LyduyftUyQCfZIZY6LUSKuKded
Odm7k0ngLDPbdZwAHs0Nf/ilRwd91Z7b6hGt5U3ptx+8BPMKB+/k1VoKog7OISPc
zGaP7DrtuC4sEdX4X6bqX+mEQhpkLcQw15gVGxhKoHqygWNSZrV634aSSXwfVXJx
eT5m/K9a7L0=
=CYl5
-----END PGP SIGNATURE-----
Merge tag 'sched-urgent-2022-08-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
"Various fixes: a deadline scheduler fix, a migration fix, a Sparse fix
and a comment fix"
* tag 'sched-urgent-2022-08-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/core: Do not requeue task on CPU excluded from cpus_mask
sched/rt: Fix Sparse warnings due to undefined rt.c declarations
exit: Fix typo in comment: s/sub-theads/sub-threads
sched, cpuset: Fix dl_cpu_busy() panic due to empty cs->cpus_allowed
* threadgroup_rwsem write locking is skipped when configuring controllers in
empty subtrees. Combined with CLONE_INTO_CGROUP, this allows the common
static usage pattern to not grab threadgroup_rwsem at all (glibc still
doesn't seem ready for CLONE_INTO_CGROUP unfortunately).
* threadgroup_rwsem used to be put into non-percpu mode by default due to
latency concerns in specific use cases. There's no reason for everyone
else to pay for it. Make the behavior optional.
* psi no longer allocates memory when disabled.
along with some code cleanups.
-----BEGIN PGP SIGNATURE-----
iIQEABYIACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCYugHIQ4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGd+oAP9lfD3fTRdNo4qWV2VsZsYzoOxzNIuJSwN/dnYx
IEbQOwD/cd2YMfeo6zcb427U/VfTFqjJjFK04OeljYtJU8fFywo=
=sucy
-----END PGP SIGNATURE-----
Merge tag 'cgroup-for-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
"Several core optimizations:
- threadgroup_rwsem write locking is skipped when configuring
controllers in empty subtrees.
Combined with CLONE_INTO_CGROUP, this allows the common static
usage pattern to not grab threadgroup_rwsem at all (glibc still
doesn't seem ready for CLONE_INTO_CGROUP unfortunately).
- threadgroup_rwsem used to be put into non-percpu mode by default
due to latency concerns in specific use cases. There's no reason
for everyone else to pay for it. Make the behavior optional.
- psi no longer allocates memory when disabled.
... along with some code cleanups"
* tag 'cgroup-for-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: Skip subtree root in cgroup_update_dfl_csses()
cgroup: remove "no" prefixed mount options
cgroup: Make !percpu threadgroup_rwsem operations optional
cgroup: Add "no" prefixed mount options
cgroup: Elide write-locking threadgroup_rwsem when updating csses on an empty subtree
cgroup.c: remove redundant check for mixable cgroup in cgroup_migrate_vet_dst
cgroup.c: add helper __cset_cgroup_from_root to cleanup duplicated codes
psi: dont alloc memory for psi by default
With cgroup v2, the cpuset's cpus_allowed mask can be empty indicating
that the cpuset will just use the effective CPUs of its parent. So
cpuset_can_attach() can call task_can_attach() with an empty mask.
This can lead to cpumask_any_and() returns nr_cpu_ids causing the call
to dl_bw_of() to crash due to percpu value access of an out of bound
CPU value. For example:
[80468.182258] BUG: unable to handle page fault for address: ffffffff8b6648b0
:
[80468.191019] RIP: 0010:dl_cpu_busy+0x30/0x2b0
:
[80468.207946] Call Trace:
[80468.208947] cpuset_can_attach+0xa0/0x140
[80468.209953] cgroup_migrate_execute+0x8c/0x490
[80468.210931] cgroup_update_dfl_csses+0x254/0x270
[80468.211898] cgroup_subtree_control_write+0x322/0x400
[80468.212854] kernfs_fop_write_iter+0x11c/0x1b0
[80468.213777] new_sync_write+0x11f/0x1b0
[80468.214689] vfs_write+0x1eb/0x280
[80468.215592] ksys_write+0x5f/0xe0
[80468.216463] do_syscall_64+0x5c/0x80
[80468.224287] entry_SYSCALL_64_after_hwframe+0x44/0xae
Fix that by using effective_cpus instead. For cgroup v1, effective_cpus
is the same as cpus_allowed. For v2, effective_cpus is the real cpumask
to be used by tasks within the cpuset anyway.
Also update task_can_attach()'s 2nd argument name to cs_effective_cpus to
reflect the change. In addition, a check is added to task_can_attach()
to guard against the possibility that cpumask_any_and() may return a
value >= nr_cpu_ids.
Fixes: 7f51412a41 ("sched/deadline: Fix bandwidth check/update when migrating tasks between exclusive cpusets")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/r/20220803015451.2219567-1-longman@redhat.com
Load-balancing improvements:
============================
- Improve NUMA balancing on AMD Zen systems for affine workloads.
- Improve the handling of reduced-capacity CPUs in load-balancing.
- Energy Model improvements: fix & refine all the energy fairness metrics (PELT),
and remove the conservative threshold requiring 6% energy savings to
migrate a task. Doing this improves power efficiency for most workloads,
and also increases the reliability of energy-efficiency scheduling.
- Optimize/tweak select_idle_cpu() to spend (much) less time searching
for an idle CPU on overloaded systems. There's reports of several
milliseconds spent there on large systems with large workloads ...
[ Since the search logic changed, there might be behavioral side effects. ]
- Improve NUMA imbalance behavior. On certain systems
with spare capacity, initial placement of tasks is non-deterministic,
and such an artificial placement imbalance can persist for a long time,
hurting (and sometimes helping) performance.
The fix is to make fork-time task placement consistent with runtime
NUMA balancing placement.
Note that some performance regressions were reported against this,
caused by workloads that are not memory bandwith limited, which benefit
from the artificial locality of the placement bug(s). Mel Gorman's
conclusion, with which we concur, was that consistency is better than
random workload benefits from non-deterministic bugs:
"Given there is no crystal ball and it's a tradeoff, I think it's
better to be consistent and use similar logic at both fork time
and runtime even if it doesn't have universal benefit."
- Improve core scheduling by fixing a bug in sched_core_update_cookie() that
caused unnecessary forced idling.
- Improve wakeup-balancing by allowing same-LLC wakeup of idle CPUs for newly
woken tasks.
- Fix a newidle balancing bug that introduced unnecessary wakeup latencies.
ABI improvements/fixes:
=======================
- Do not check capabilities and do not issue capability check denial messages
when a scheduler syscall doesn't require privileges. (Such as increasing niceness.)
- Add forced-idle accounting to cgroups too.
- Fix/improve the RSEQ ABI to not just silently accept unknown flags.
(No existing tooling is known to have learned to rely on the previous behavior.)
- Depreciate the (unused) RSEQ_CS_FLAG_NO_RESTART_ON_* flags.
Optimizations:
==============
- Optimize & simplify leaf_cfs_rq_list()
- Micro-optimize set_nr_{and_not,if}_polling() via try_cmpxchg().
Misc fixes & cleanups:
======================
- Fix the RSEQ self-tests on RISC-V and Glibc 2.35 systems.
- Fix a full-NOHZ bug that can in some cases result in the tick not being
re-enabled when the last SCHED_RT task is gone from a runqueue but there's
still SCHED_OTHER tasks around.
- Various PREEMPT_RT related fixes.
- Misc cleanups & smaller fixes.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmLn2ywRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1iNfxAAhPJMwM4tYCpIM6PhmxKiHl6kkiT2tt42
HhEmiJVLjczLybWaWwmGA2dSFkv1f4+hG7nqdZTm9QYn0Pqat2UTSRcwoKQc+gpB
x85Hwt2IUmnUman52fRl5r1miH9LTdCI6agWaFLQae5ds1XmOugFo52t2ahax+Gn
dB8LxS2fa/GrKj229EhkJSPWAK4Y94asoTProwpKLuKEeXhDkqUNrOWbKhz+wEnA
pVZySpA9uEOdNLVSr1s0VB6mZoh5/z6yQefj5YSNntsG71XWo9jxKCIm5buVdk2U
wjdn6UzoTThOy/5Ygm64eYRexMHG71UamF1JYUdmvDeUJZ5fhG6RD0FECUQNVcJB
Msu2fce6u1AV0giZGYtiooLGSawB/+e6MoDkjTl8guFHi/peve9CezKX1ZgDWPfE
eGn+EbYkUS9RMafXCKuEUBAC1UUqAavGN9sGGN1ufyR4za6ogZplOqAFKtTRTGnT
/Ne3fHTtvv73DLGW9ohO5vSS2Rp7zhAhB6FunhibhxCWlt7W6hA4Ze2vU9hf78Yn
SJDLAJjOEilLaKUkRG/d9uM3FjKJM1tqxuT76+sUbM0MNxdyiKcviQlP1b8oq5Um
xE1KNZUevnr/WXqOTGDKHH/HNPFgwxbwavMiP7dNFn8h/hEk4t9dkf5siDmVHtn4
nzDVOob1LgE=
=xr2b
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2022-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"Load-balancing improvements:
- Improve NUMA balancing on AMD Zen systems for affine workloads.
- Improve the handling of reduced-capacity CPUs in load-balancing.
- Energy Model improvements: fix & refine all the energy fairness
metrics (PELT), and remove the conservative threshold requiring 6%
energy savings to migrate a task. Doing this improves power
efficiency for most workloads, and also increases the reliability
of energy-efficiency scheduling.
- Optimize/tweak select_idle_cpu() to spend (much) less time
searching for an idle CPU on overloaded systems. There's reports of
several milliseconds spent there on large systems with large
workloads ...
[ Since the search logic changed, there might be behavioral side
effects. ]
- Improve NUMA imbalance behavior. On certain systems with spare
capacity, initial placement of tasks is non-deterministic, and such
an artificial placement imbalance can persist for a long time,
hurting (and sometimes helping) performance.
The fix is to make fork-time task placement consistent with runtime
NUMA balancing placement.
Note that some performance regressions were reported against this,
caused by workloads that are not memory bandwith limited, which
benefit from the artificial locality of the placement bug(s). Mel
Gorman's conclusion, with which we concur, was that consistency is
better than random workload benefits from non-deterministic bugs:
"Given there is no crystal ball and it's a tradeoff, I think
it's better to be consistent and use similar logic at both fork
time and runtime even if it doesn't have universal benefit."
- Improve core scheduling by fixing a bug in
sched_core_update_cookie() that caused unnecessary forced idling.
- Improve wakeup-balancing by allowing same-LLC wakeup of idle CPUs
for newly woken tasks.
- Fix a newidle balancing bug that introduced unnecessary wakeup
latencies.
ABI improvements/fixes:
- Do not check capabilities and do not issue capability check denial
messages when a scheduler syscall doesn't require privileges. (Such
as increasing niceness.)
- Add forced-idle accounting to cgroups too.
- Fix/improve the RSEQ ABI to not just silently accept unknown flags.
(No existing tooling is known to have learned to rely on the
previous behavior.)
- Depreciate the (unused) RSEQ_CS_FLAG_NO_RESTART_ON_* flags.
Optimizations:
- Optimize & simplify leaf_cfs_rq_list()
- Micro-optimize set_nr_{and_not,if}_polling() via try_cmpxchg().
Misc fixes & cleanups:
- Fix the RSEQ self-tests on RISC-V and Glibc 2.35 systems.
- Fix a full-NOHZ bug that can in some cases result in the tick not
being re-enabled when the last SCHED_RT task is gone from a
runqueue but there's still SCHED_OTHER tasks around.
- Various PREEMPT_RT related fixes.
- Misc cleanups & smaller fixes"
* tag 'sched-core-2022-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (32 commits)
rseq: Kill process when unknown flags are encountered in ABI structures
rseq: Deprecate RSEQ_CS_FLAG_NO_RESTART_ON_* flags
sched/core: Fix the bug that task won't enqueue into core tree when update cookie
nohz/full, sched/rt: Fix missed tick-reenabling bug in dequeue_task_rt()
sched/core: Always flush pending blk_plug
sched/fair: fix case with reduced capacity CPU
sched/core: Use try_cmpxchg in set_nr_{and_not,if}_polling
sched/core: add forced idle accounting for cgroups
sched/fair: Remove the energy margin in feec()
sched/fair: Remove task_util from effective utilization in feec()
sched/fair: Use the same cpumask per-PD throughout find_energy_efficient_cpu()
sched/fair: Rename select_idle_mask to select_rq_mask
sched, drivers: Remove max param from effective_cpu_util()/sched_cpu_util()
sched/fair: Decay task PELT values during wakeup migration
sched/fair: Provide u64 read for 32-bits arch helper
sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg
sched: only perform capability check on privileged operation
sched: Remove unused function group_first_cpu()
sched/fair: Remove redundant word " *"
selftests/rseq: check if libc rseq support is registered
...
The cgroup_update_dfl_csses() function updates css associations when a
cgroup's subtree_control file is modified. Any changes made to a cgroup's
subtree_control file, however, will only affect its descendants but not
the cgroup itself. So there is no point in migrating csses associated
with that cgroup. We can skip them instead.
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
30312730bd ("cgroup: Add "no" prefixed mount options") added "no" prefixed
mount options to allow turning them off and 6a010a49b6 ("cgroup: Make
!percpu threadgroup_rwsem operations optional") added one more "no" prefixed
mount option. However, Michal pointed out that the "no" prefixed options
aren't necessary in allowing mount options to be turned off:
# grep group /proc/mounts
cgroup2 /sys/fs/cgroup cgroup2 rw,nosuid,nodev,relatime,nsdelegate,memory_recursiveprot 0 0
# mount -o remount,nsdelegate,memory_recursiveprot none /sys/fs/cgroup
# grep cgroup /proc/mounts
cgroup2 /sys/fs/cgroup cgroup2 rw,relatime,nsdelegate,memory_recursiveprot 0 0
Note that this is different from the remount behavior when the mount(1) is
invoked without the device argument - "none":
# grep cgroup /proc/mounts
cgroup2 /sys/fs/cgroup cgroup2 rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot 0 0
# mount -o remount,nsdelegate,memory_recursiveprot /sys/fs/cgroup
# grep cgroup /proc/mounts
cgroup2 /sys/fs/cgroup cgroup2 rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot 0 0
While a bit confusing, given that there is a way to turn off the options,
there's no reason to have the explicit "no" prefixed options. Let's remove
them.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
3942a9bd7b ("locking, rcu, cgroup: Avoid synchronize_sched() in
__cgroup_procs_write()") disabled percpu operations on threadgroup_rwsem
because the impiled synchronize_rcu() on write locking was pushing up the
latencies too much for android which constantly moves processes between
cgroups.
This makes the hotter paths - fork and exit - slower as they're always
forced into the slow path. There is no reason to force this on everyone
especially given that more common static usage pattern can now completely
avoid write-locking the rwsem. Write-locking is elided when turning on and
off controllers on empty sub-trees and CLONE_INTO_CGROUP enables seeding a
cgroup without grabbing the rwsem.
Restore the default percpu operations and introduce the mount option
"favordynmods" and config option CGROUP_FAVOR_DYNMODS for users who need
lower latencies for the dynamic operations.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Michal Koutn� <mkoutny@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Dmitry Shmidt <dimitrysh@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
We allow modifying these mount options via remount. Let's add "no" prefixed
variants so that they can be turned off too.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Michal Koutný <mkoutny@suse.com>
cgroup_update_dfl_csses() write-lock the threadgroup_rwsem as updating the
csses can trigger process migrations. However, if the subtree doesn't
contain any tasks, there aren't gonna be any cgroup migrations. This
condition can be trivially detected by testing whether
mgctx.preloaded_src_csets is empty. Elide write-locking threadgroup_rwsem if
the subtree is empty.
After this optimization, the usage pattern of creating a cgroup, enabling
the necessary controllers, and then seeding it with CLONE_INTO_CGROUP and
then removing the cgroup after it becomes empty doesn't need to write-lock
threadgroup_rwsem at all.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Michal Koutný <mkoutny@suse.com>
4feee7d126 previously added per-task forced idle accounting. This patch
extends this to also include cgroups.
rstat is used for cgroup accounting, except for the root, which uses
kcpustat in order to bypass the need for doing an rstat flush when
reading root stats.
Only cgroup v2 is supported. Similar to the task accounting, the cgroup
accounting requires that schedstats is enabled.
Signed-off-by: Josh Don <joshdon@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lkml.kernel.org/r/20220629211426.3329954-1-joshdon@google.com
We have:
int cgroup_migrate_vet_dst(struct cgroup *dst_cgrp)
{
...
/* mixables don't care */
if (cgroup_is_mixable(dst_cgrp))
return 0;
/*
* If @dst_cgrp is already or can become a thread root or is
* threaded, it doesn't matter.
*/
if (cgroup_can_be_thread_root(dst_cgrp) || cgroup_is_threaded(dst_cgrp))
return 0;
...
}
but in fact the entry of cgroup_can_be_thread_root() covers case that
checking cgroup_is_mixable() as following:
static bool cgroup_can_be_thread_root(struct cgroup *cgrp)
{
/* mixables don't care */
if (cgroup_is_mixable(cgrp))
return true;
...
}
so explicitly checking in cgroup_migrate_vet_dst is unnecessary.
Signed-off-by: Lin Feng <linf@wangsu.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Each cset (css_set) is pinned by its tasks. When we're moving tasks around
across csets for a migration, we need to hold the source and destination
csets to ensure that they don't go away while we're moving tasks about. This
is done by linking cset->mg_preload_node on either the
mgctx->preloaded_src_csets or mgctx->preloaded_dst_csets list. Using the
same cset->mg_preload_node for both the src and dst lists was deemed okay as
a cset can't be both the source and destination at the same time.
Unfortunately, this overloading becomes problematic when multiple tasks are
involved in a migration and some of them are identity noop migrations while
others are actually moving across cgroups. For example, this can happen with
the following sequence on cgroup1:
#1> mkdir -p /sys/fs/cgroup/misc/a/b
#2> echo $$ > /sys/fs/cgroup/misc/a/cgroup.procs
#3> RUN_A_COMMAND_WHICH_CREATES_MULTIPLE_THREADS &
#4> PID=$!
#5> echo $PID > /sys/fs/cgroup/misc/a/b/tasks
#6> echo $PID > /sys/fs/cgroup/misc/a/cgroup.procs
the process including the group leader back into a. In this final migration,
non-leader threads would be doing identity migration while the group leader
is doing an actual one.
After #3, let's say the whole process was in cset A, and that after #4, the
leader moves to cset B. Then, during #6, the following happens:
1. cgroup_migrate_add_src() is called on B for the leader.
2. cgroup_migrate_add_src() is called on A for the other threads.
3. cgroup_migrate_prepare_dst() is called. It scans the src list.
4. It notices that B wants to migrate to A, so it tries to A to the dst
list but realizes that its ->mg_preload_node is already busy.
5. and then it notices A wants to migrate to A as it's an identity
migration, it culls it by list_del_init()'ing its ->mg_preload_node and
putting references accordingly.
6. The rest of migration takes place with B on the src list but nothing on
the dst list.
This means that A isn't held while migration is in progress. If all tasks
leave A before the migration finishes and the incoming task pins it, the
cset will be destroyed leading to use-after-free.
This is caused by overloading cset->mg_preload_node for both src and dst
preload lists. We wanted to exclude the cset from the src list but ended up
inadvertently excluding it from the dst list too.
This patch fixes the issue by separating out cset->mg_preload_node into
->mg_src_preload_node and ->mg_dst_preload_node, so that the src and dst
preloadings don't interfere with each other.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Mukesh Ojha <quic_mojha@quicinc.com>
Reported-by: shisiyuan <shisiyuan19870131@gmail.com>
Link: http://lkml.kernel.org/r/1654187688-27411-1-git-send-email-shisiyuan@xiaomi.com
Link: https://www.spinics.net/lists/cgroups/msg33313.html
Fixes: f817de9851 ("cgroup: prepare migration path for unified hierarchy")
Cc: stable@vger.kernel.org # v3.16+
No funtionality change, but save us some lines.
Signed-off-by: Lin Feng <linf@wangsu.com>
Acked-by: Mukesh Ojha <quic_mojha@quicinc.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Memory about struct psi_group is allocated by default for
each cgroup even if psi_disabled is true, in this case, these
allocated memory is waste, so alloc memory for struct psi_group
only when psi_disabled is false.
Signed-off-by: Chen Wandun <chenwandun@huawei.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull cgroup updates from Tejun Heo:
"Nothing too interesting. This adds cpu controller selftests and there
are a couple code cleanup patches"
* 'for-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: remove the superfluous judgment
cgroup: Make cgroup_debug static
kseltest/cgroup: Make test_stress.sh work if run interactively
kselftest/cgroup: fix test_stress.sh to use OUTPUT dir
cgroup: Add config file to cgroup selftest suite
cgroup: Add test_cpucg_max_nested() testcase
cgroup: Add test_cpucg_max() testcase
cgroup: Add test_cpucg_nested_weight_underprovisioned() testcase
cgroup: Adding test_cpucg_nested_weight_overprovisioned() testcase
cgroup: Add test_cpucg_weight_underprovisioned() testcase
cgroup: Add test_cpucg_weight_overprovisioned() testcase
cgroup: Add test_cpucg_stats() testcase to cgroup cpu selftests
cgroup: Add new test_cpu.c test suite in cgroup selftests
Remove the superfluous judgment since the function is
never called for a root cgroup, as suggested by Tejun.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Shida Zhang <zhangshida@kylinos.cn>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Make cgroup_debug static since it's only used in cgroup.c
Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
There are 3 places where the cpu and node masks of the top cpuset can
be initialized in the order they are executed:
1) start_kernel -> cpuset_init()
2) start_kernel -> cgroup_init() -> cpuset_bind()
3) kernel_init_freeable() -> do_basic_setup() -> cpuset_init_smp()
The first cpuset_init() call just sets all the bits in the masks.
The second cpuset_bind() call sets cpus_allowed and mems_allowed to the
default v2 values. The third cpuset_init_smp() call sets them back to
v1 values.
For systems with cgroup v2 setup, cpuset_bind() is called once. As a
result, cpu and memory node hot add may fail to update the cpu and node
masks of the top cpuset to include the newly added cpu or node in a
cgroup v2 environment.
For systems with cgroup v1 setup, cpuset_bind() is called again by
rebind_subsystem() when the v1 cpuset filesystem is mounted as shown
in the dmesg log below with an instrumented kernel.
[ 2.609781] cpuset_bind() called - v2 = 1
[ 3.079473] cpuset_init_smp() called
[ 7.103710] cpuset_bind() called - v2 = 0
smp_init() is called after the first two init functions. So we don't
have a complete list of active cpus and memory nodes until later in
cpuset_init_smp() which is the right time to set up effective_cpus
and effective_mems.
To fix this cgroup v2 mask setup problem, the potentially incorrect
cpus_allowed & mems_allowed setting in cpuset_init_smp() are removed.
For cgroup v2 systems, the initial cpuset_bind() call will set the masks
correctly. For cgroup v1 systems, the second call to cpuset_bind()
will do the right setup.
cc: stable@vger.kernel.org
Signed-off-by: Waiman Long <longman@redhat.com>
Tested-by: Feng Tang <feng.tang@intel.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Here is the set of driver core changes for 5.18-rc1.
Not much here, primarily it was a bunch of cleanups and small updates:
- kobj_type cleanups for default_groups
- documentation updates
- firmware loader minor changes
- component common helper added and take advantage of it in many
drivers (the largest part of this pull request).
There will be a merge conflict in drivers/power/supply/ab8500_chargalg.c
with your tree, the merge conflict should be easy (take all the
changes).
All of these have been in linux-next for a while with no reported
problems.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCYkG6PA8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ylMFwCfSIyAU4oLEgj+/Rfmx4o45cAVIWMAnit3zbdU
wUUCGqKcOnTJEcW6dMPh
=1VVi
-----END PGP SIGNATURE-----
Merge tag 'driver-core-5.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here is the set of driver core changes for 5.18-rc1.
Not much here, primarily it was a bunch of cleanups and small updates:
- kobj_type cleanups for default_groups
- documentation updates
- firmware loader minor changes
- component common helper added and take advantage of it in many
drivers (the largest part of this pull request).
All of these have been in linux-next for a while with no reported
problems"
* tag 'driver-core-5.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (54 commits)
Documentation: update stable review cycle documentation
drivers/base/dd.c : Remove the initial value of the global variable
Documentation: update stable tree link
Documentation: add link to stable release candidate tree
devres: fix typos in comments
Documentation: add note block surrounding security patch note
samples/kobject: Use sysfs_emit instead of sprintf
base: soc: Make soc_device_match() simpler and easier to read
driver core: dd: fix return value of __setup handler
driver core: Refactor sysfs and drv/bus remove hooks
driver core: Refactor multiple copies of device cleanup
scripts: get_abi.pl: Fix typo in help message
kernfs: fix typos in comments
kernfs: remove unneeded #if 0 guard
ALSA: hda/realtek: Make use of the helper component_compare_dev_name
video: omapfb: dss: Make use of the helper component_compare_dev
power: supply: ab8500: Make use of the helper component_compare_dev
ASoC: codecs: wcd938x: Make use of the helper component_compare/release_of
iommu/mediatek: Make use of the helper component_compare/release_of
drm: of: Make use of the helper component_release_of
...
Merge more updates from Andrew Morton:
"Various misc subsystems, before getting into the post-linux-next
material.
41 patches.
Subsystems affected by this patch series: procfs, misc, core-kernel,
lib, checkpatch, init, pipe, minix, fat, cgroups, kexec, kdump,
taskstats, panic, kcov, resource, and ubsan"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (41 commits)
Revert "ubsan, kcsan: Don't combine sanitizer with kcov on clang"
kernel/resource: fix kfree() of bootmem memory again
kcov: properly handle subsequent mmap calls
kcov: split ioctl handling into locked and unlocked parts
panic: move panic_print before kmsg dumpers
panic: add option to dump all CPUs backtraces in panic_print
docs: sysctl/kernel: add missing bit to panic_print
taskstats: remove unneeded dead assignment
kasan: no need to unset panic_on_warn in end_report()
ubsan: no need to unset panic_on_warn in ubsan_epilogue()
panic: unset panic_on_warn inside panic()
docs: kdump: add scp example to write out the dump file
docs: kdump: update description about sysfs file system support
arm64: mm: use IS_ENABLED(CONFIG_KEXEC_CORE) instead of #ifdef
x86/setup: use IS_ENABLED(CONFIG_KEXEC_CORE) instead of #ifdef
riscv: mm: init: use IS_ENABLED(CONFIG_KEXEC_CORE) instead of #ifdef
kexec: make crashk_res, crashk_low_res and crash_notes symbols always visible
cgroup: use irqsave in cgroup_rstat_flush_locked().
fat: use pointer to simple type in put_user()
minix: fix bug when opening a file with O_DIRECT
...
All callers of cgroup_rstat_flush_locked() acquire cgroup_rstat_lock
either with spin_lock_irq() or spin_lock_irqsave().
cgroup_rstat_flush_locked() itself acquires cgroup_rstat_cpu_lock which
is a raw_spin_lock. This lock is also acquired in
cgroup_rstat_updated() in IRQ context and therefore requires _irqsave()
locking suffix in cgroup_rstat_flush_locked().
Since there is no difference between spin_lock_t and raw_spin_lock_t on
!RT lockdep does not complain here. On RT lockdep complains because the
interrupts were not disabled here and a deadlock is possible.
Acquire the raw_spin_lock_t with disabled interrupts.
Link: https://lkml.kernel.org/r/20220301122143.1521823-2-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Zefan Li <lizefan.x@bytedance.com>
From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Subject: cgroup: add a comment to cgroup_rstat_flush_locked().
Add a comment why spin_lock_irq() -> raw_spin_lock_irqsave() is needed.
Link: https://lkml.kernel.org/r/Yh+DOK73hfVV5ThX@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull cgroup updates from Tejun Heo:
"All trivial cleanups without meaningful behavior changes"
* 'for-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: cleanup comments
cgroup: Fix cgroup_can_fork() and cgroup_post_fork() kernel-doc comment
cgroup: rstat: retrieve current bstat to delta directly
cgroup: rstat: use same convention to assign cgroup_base_stat
for spdx, add a space before //
replacements
judgement to judgment
transofrmed to transformed
partitition to partition
histrical to historical
migratecd to migrated
Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
There is no need to have struct kernfs_root be part of kernfs.h for
the whole kernel to see and poke around it. Move it internal to kernfs
code and provide a helper function, kernfs_root_to_node(), to handle the
one field that kernfs users were directly accessing from the structure.
Cc: Imran Khan <imran.f.khan@oracle.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220222070713.3517679-1-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Pull cgroup fixes from Tejun Heo:
- Fix for a subtle bug in the recent release_agent permission check
update
- Fix for a long-standing race condition between cpuset and cpu hotplug
- Comment updates
* 'for-5.17-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cpuset: Fix kernel-doc
cgroup-v1: Correct privileges check in release_agent writes
cgroup: clarify cgroup_css_set_fork()
cgroup/cpuset: Fix a race between cpuset_attach() and cpu hotplug
Fix the following W=1 kernel warnings:
kernel/cgroup/cpuset.c:3718: warning: expecting prototype for
cpuset_memory_pressure_bump(). Prototype was for
__cpuset_memory_pressure_bump() instead.
kernel/cgroup/cpuset.c:3568: warning: expecting prototype for
cpuset_node_allowed(). Prototype was for __cpuset_node_allowed()
instead.
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The idea is to check: a) the owning user_ns of cgroup_ns, b)
capabilities in init_user_ns.
The commit 24f6008564 ("cgroup-v1: Require capabilities to set
release_agent") got this wrong in the write handler of release_agent
since it checked user_ns of the opener (may be different from the owning
user_ns of cgroup_ns).
Secondly, to avoid possibly confused deputy, the capability of the
opener must be checked.
Fixes: 24f6008564 ("cgroup-v1: Require capabilities to set release_agent")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/stable/20220216121142.GB30035@blackbody.suse.cz/
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Reviewed-by: Masami Ichikawa(CIP) <masami.ichikawa@cybertrust.co.jp>
Signed-off-by: Tejun Heo <tj@kernel.org>
With recent fixes for the permission checking when moving a task into a cgroup
using a file descriptor to a cgroup's cgroup.procs file and calling write() it
seems a good idea to clarify CLONE_INTO_CGROUP permission checking with a
comment.
Cc: Tejun Heo <tj@kernel.org>
Cc: <cgroups@vger.kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmISrYgeHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGg20IAKDZr7rfSHBopjQV
Cocw744tom0XuxpvSZpp2GGOOXF+tkswcNNaRIrbGOl1mkyxA7eBZCTMpDeDS9aQ
wB0D0Gxx8QBAJp4KgB1W7TB+hIGes/rs8Ve+6iO4ulLLdCVWX/q2boI0aZ7QX9O9
qNi8OsoZQtk6falRvciZFHwV5Av1p2Sy1AW57udQ7DvJ4H98AfKf1u8/z208WWW8
1ixC+qJxQcUcM9vI+7P9Tt7NbFSKv8SvAmqjFY7P+DxQAsVw6KXoqVXykDzeOv0t
fUNOE/t0oFZafwtn8h7KBQnwS9lH03+3KkslVZs+iMFyUj/Bar+NVVyKoDhWXtVg
/PuMhEg=
=eU1o
-----END PGP SIGNATURE-----
Merge tag 'v5.17-rc5' into sched/core, to resolve conflicts
New conflicts in sched/core due to the following upstream fixes:
44585f7bc0 ("psi: fix "defined but not used" warnings when CONFIG_PROC_FS=n")
a06247c680 ("psi: Fix uaf issue when psi trigger is destroyed while being polled")
Conflicts:
include/linux/psi_types.h
kernel/sched/psi.c
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Refer to housekeeping APIs using single feature types instead of flags.
This prevents from passing multiple isolation features at once to
housekeeping interfaces, which soon won't be possible anymore as each
isolation features will have their own cpumask.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Reviewed-by: Phil Auld <pauld@redhat.com>
Link: https://lore.kernel.org/r/20220207155910.527133-5-frederic@kernel.org
As previously discussed(https://lkml.org/lkml/2022/1/20/51),
cpuset_attach() is affected with similar cpu hotplug race,
as follow scenario:
cpuset_attach() cpu hotplug
--------------------------- ----------------------
down_write(cpuset_rwsem)
guarantee_online_cpus() // (load cpus_attach)
sched_cpu_deactivate
set_cpu_active()
// will change cpu_active_mask
set_cpus_allowed_ptr(cpus_attach)
__set_cpus_allowed_ptr_locked()
// (if the intersection of cpus_attach and
cpu_active_mask is empty, will return -EINVAL)
up_write(cpuset_rwsem)
To avoid races such as described above, protect cpuset_attach() call
with cpu_hotplug_lock.
Fixes: be367d0992 ("cgroups: let ss->can_attach and ss->attach do whole threadgroups at a time")
Cc: stable@vger.kernel.org # v2.6.32+
Reported-by: Zhao Gongyi <zhaogongyi@huawei.com>
Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com>
Acked-by: Waiman Long <longman@redhat.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull cgroup fixes from Tejun Heo:
- Eric's fix for a long standing cgroup1 permission issue where it only
checks for uid 0 instead of CAP which inadvertently allows
unprivileged userns roots to modify release_agent userhelper
- Fixes for the fallout from Waiman's recent cpuset work
* 'for-5.17-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup/cpuset: Fix "suspicious RCU usage" lockdep warning
cgroup-v1: Require capabilities to set release_agent
cpuset: Fix the bug that subpart_cpus updated wrongly in update_cpumask()
cgroup/cpuset: Make child cpusets restrict parents on v1 hierarchy
It was found that a "suspicious RCU usage" lockdep warning was issued
with the rcu_read_lock() call in update_sibling_cpumasks(). It is
because the update_cpumasks_hier() function may sleep. So we have
to release the RCU lock, call update_cpumasks_hier() and reacquire
it afterward.
Also add a percpu_rwsem_assert_held() in update_sibling_cpumasks()
instead of stating that in the comment.
Fixes: 4716909cc5 ("cpuset: Track cpusets that use parent's effective_cpus")
Signed-off-by: Waiman Long <longman@redhat.com>
Tested-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Phil Auld <pauld@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The cgroup release_agent is called with call_usermodehelper. The function
call_usermodehelper starts the release_agent with a full set fo capabilities.
Therefore require capabilities when setting the release_agaent.
Reported-by: Tabitha Sable <tabitha.c.sable@gmail.com>
Tested-by: Tabitha Sable <tabitha.c.sable@gmail.com>
Fixes: 81a6a5cdd2 ("Task Control Groups: automatic userspace notification of idle cgroups")
Cc: stable@vger.kernel.org # v2.6.24+
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
subparts_cpus should be limited as a subset of cpus_allowed, but it is
updated wrongly by using cpumask_andnot(). Use cpumask_and() instead to
fix it.
Fixes: ee8dde0cd2 ("cpuset: Add new v2 cpuset.sched.partition flag")
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
With write operation on psi files replacing old trigger with a new one,
the lifetime of its waitqueue is totally arbitrary. Overwriting an
existing trigger causes its waitqueue to be freed and pending poll()
will stumble on trigger->event_wait which was destroyed.
Fix this by disallowing to redefine an existing psi trigger. If a write
operation is used on a file descriptor with an already existing psi
trigger, the operation will fail with EBUSY error.
Also bypass a check for psi_disabled in the psi_trigger_destroy as the
flag can be flipped after the trigger is created, leading to a memory
leak.
Fixes: 0e94682b73 ("psi: introduce psi monitor")
Reported-by: syzbot+cdb5dd11c97cc532efad@syzkaller.appspotmail.com
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Analyzed-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20220111232309.1786347-1-surenb@google.com
The commit 1f1562fcd0 ("cgroup/cpuset: Don't let child cpusets
restrict parent in default hierarchy") inteded to relax the check only
on the default hierarchy (or v2 mode) but it dropped the check in v1
too.
This patch returns and separates the legacy-only validations so that
they can be considered only in the v1 mode, which should enforce the old
constraints for the sake of compatibility.
Fixes: 1f1562fcd0 ("cgroup/cpuset: Don't let child cpusets restrict parent in default hierarchy")
Suggested-by: Waiman Long <longman@redhat.com>
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Add the description of @kargs in cgroup_can_fork() and
cgroup_post_fork() kernel-doc comment to remove warnings found
by running scripts/kernel-doc, which is caused by using 'make W=1'.
kernel/cgroup/cgroup.c:6235: warning: Function parameter or member
'kargs' not described in 'cgroup_can_fork'
kernel/cgroup/cgroup.c:6296: warning: Function parameter or member
'kargs' not described in 'cgroup_post_fork'
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Instead of retrieve current bstat to cur and copy it to delta, let's use
delta directly.
This saves one copy operation and has the same code convention as
propagating delta to parent.
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
In function cgroup_base_stat_flush(), we update cgroup_base_stat by
getting rstatc->bstat and adjust delta to related fields.
There are two convention to assign cgroup_base_stat in this function:
* rstat2 = rstat1
* rstat2.cputime = rstat1.cputime
The second convention may make audience think just field "cputime" is
updated, while cputime is the only field in cgroup_base_stat.
Let's use the same convention to eliminate this confusion.
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull cgroup updates from Tejun Heo:
"Nothing too interesting. The only two noticeable changes are a subtle
cpuset behavior fix and trace event id field being expanded to u64
from int. Most others are code cleanups"
* 'for-5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cpuset: convert 'allowed' in __cpuset_node_allowed() to be boolean
cgroup/rstat: check updated_next only for root
cgroup: rstat: explicitly put loop variant in while
cgroup: return early if it is already on preloaded list
cgroup/cpuset: Don't let child cpusets restrict parent in default hierarchy
cgroup: Trace event cgroup id fields should be u64
cgroup: fix a typo in comment
cgroup: get the wrong css for css_alloc() during cgroup_init_subsys()
cgroup: rstat: Mark benign data race to silence KCSAN
Core
----
- Defer freeing TCP skbs to the BH handler, whenever possible,
or at least perform the freeing outside of the socket lock section
to decrease cross-CPU allocator work and improve latency.
- Add netdevice refcount tracking to locate sources of netdevice
and net namespace refcount leaks.
- Make Tx watchdog less intrusive - avoid pausing Tx and restarting
all queues from a single CPU removing latency spikes.
- Various small optimizations throughout the stack from Eric Dumazet.
- Make netdev->dev_addr[] constant, force modifications to go via
appropriate helpers to allow us to keep addresses in ordered data
structures.
- Replace unix_table_lock with per-hash locks, improving performance
of bind() calls.
- Extend skb drop tracepoint with a drop reason.
- Allow SO_MARK and SO_PRIORITY setsockopt under CAP_NET_RAW.
BPF
---
- New helpers:
- bpf_find_vma(), find and inspect VMAs for profiling use cases
- bpf_loop(), runtime-bounded loop helper trading some execution
time for much faster (if at all converging) verification
- bpf_strncmp(), improve performance, avoid compiler flakiness
- bpf_get_func_arg(), bpf_get_func_ret(), bpf_get_func_arg_cnt()
for tracing programs, all inlined by the verifier
- Support BPF relocations (CO-RE) in the kernel loader.
- Further the support for BTF_TYPE_TAG annotations.
- Allow access to local storage in sleepable helpers.
- Convert verifier argument types to a composable form with different
attributes which can be shared across types (ro, maybe-null).
- Prepare libbpf for upcoming v1.0 release by cleaning up APIs,
creating new, extensible ones where missing and deprecating those
to be removed.
Protocols
---------
- WiFi (mac80211/cfg80211):
- notify user space about long "come back in N" AP responses,
allow it to react to such temporary rejections
- allow non-standard VHT MCS 10/11 rates
- use coarse time in airtime fairness code to save CPU cycles
- Bluetooth:
- rework of HCI command execution serialization to use a common
queue and work struct, and improve handling errors reported
in the middle of a batch of commands
- rework HCI event handling to use skb_pull_data, avoiding packet
parsing pitfalls
- support AOSP Bluetooth Quality Report
- SMC:
- support net namespaces, following the RDMA model
- improve connection establishment latency by pre-clearing buffers
- introduce TCP ULP for automatic redirection to SMC
- Multi-Path TCP:
- support ioctls: SIOCINQ, OUTQ, and OUTQNSD
- support socket options: IP_TOS, IP_FREEBIND, IP_TRANSPARENT,
IPV6_FREEBIND, and IPV6_TRANSPARENT, TCP_CORK and TCP_NODELAY
- support cmsgs: TCP_INQ
- improvements in the data scheduler (assigning data to subflows)
- support fastclose option (quick shutdown of the full MPTCP
connection, similar to TCP RST in regular TCP)
- MCTP (Management Component Transport) over serial, as defined by
DMTF spec DSP0253 - "MCTP Serial Transport Binding".
Driver API
----------
- Support timestamping on bond interfaces in active/passive mode.
- Introduce generic phylink link mode validation for drivers which
don't have any quirks and where MAC capability bits fully express
what's supported. Allow PCS layer to participate in the validation.
Convert a number of drivers.
- Add support to set/get size of buffers on the Rx rings and size of
the tx copybreak buffer via ethtool.
- Support offloading TC actions as first-class citizens rather than
only as attributes of filters, improve sharing and device resource
utilization.
- WiFi (mac80211/cfg80211):
- support forwarding offload (ndo_fill_forward_path)
- support for background radar detection hardware
- SA Query Procedures offload on the AP side
New hardware / drivers
----------------------
- tsnep - FPGA based TSN endpoint Ethernet MAC used in PLCs with
real-time requirements for isochronous communication with protocols
like OPC UA Pub/Sub.
- Qualcomm BAM-DMUX WWAN - driver for data channels of modems
integrated into many older Qualcomm SoCs, e.g. MSM8916 or
MSM8974 (qcom_bam_dmux).
- Microchip LAN966x multi-port Gigabit AVB/TSN Ethernet Switch
driver with support for bridging, VLANs and multicast forwarding
(lan966x).
- iwlmei driver for co-operating between Intel's WiFi driver and
Intel's Active Management Technology (AMT) devices.
- mse102x - Vertexcom MSE102x Homeplug GreenPHY chips
- Bluetooth:
- MediaTek MT7921 SDIO devices
- Foxconn MT7922A
- Realtek RTL8852AE
Drivers
-------
- Significantly improve performance in the datapaths of:
lan78xx, ax88179_178a, lantiq_xrx200, bnxt.
- Intel Ethernet NICs:
- igb: support PTP/time PEROUT and EXTTS SDP functions on
82580/i354/i350 adapters
- ixgbevf: new PF -> VF mailbox API which avoids the risk of
mailbox corruption with ESXi
- iavf: support configuration of VLAN features of finer granularity,
stacked tags and filtering
- ice: PTP support for new E822 devices with sub-ns precision
- ice: support firmware activation without reboot
- Mellanox Ethernet NICs (mlx5):
- expose control over IRQ coalescing mode (CQE vs EQE) via ethtool
- support TC forwarding when tunnel encap and decap happen between
two ports of the same NIC
- dynamically size and allow disabling various features to save
resources for running in embedded / SmartNIC scenarios
- Broadcom Ethernet NICs (bnxt):
- use page frag allocator to improve Rx performance
- expose control over IRQ coalescing mode (CQE vs EQE) via ethtool
- Other Ethernet NICs:
- amd-xgbe: add Ryzen 6000 (Yellow Carp) Ethernet support
- Microsoft cloud/virtual NIC (mana):
- add XDP support (PASS, DROP, TX)
- Mellanox Ethernet switches (mlxsw):
- initial support for Spectrum-4 ASICs
- VxLAN with IPv6 underlay
- Marvell Ethernet switches (prestera):
- support flower flow templates
- add basic IP forwarding support
- NXP embedded Ethernet switches (ocelot & felix):
- support Per-Stream Filtering and Policing (PSFP)
- enable cut-through forwarding between ports by default
- support FDMA to improve packet Rx/Tx to CPU
- Other embedded switches:
- hellcreek: improve trapping management (STP and PTP) packets
- qca8k: support link aggregation and port mirroring
- Qualcomm 802.11ax WiFi (ath11k):
- qca6390, wcn6855: enable 802.11 power save mode in station mode
- BSS color change support
- WCN6855 hw2.1 support
- 11d scan offload support
- scan MAC address randomization support
- full monitor mode, only supported on QCN9074
- qca6390/wcn6855: report signal and tx bitrate
- qca6390: rfkill support
- qca6390/wcn6855: regdb.bin support
- Intel WiFi (iwlwifi):
- support SAR GEO Offset Mapping (SGOM) and Time-Aware-SAR (TAS)
in cooperation with the BIOS
- support for Optimized Connectivity Experience (OCE) scan
- support firmware API version 68
- lots of preparatory work for the upcoming Bz device family
- MediaTek WiFi (mt76):
- Specific Absorption Rate (SAR) support
- mt7921: 160 MHz channel support
- RealTek WiFi (rtw88):
- Specific Absorption Rate (SAR) support
- scan offload
- Other WiFi NICs
- ath10k: support fetching (pre-)calibration data from nvmem
- brcmfmac: configure keep-alive packet on suspend
- wcn36xx: beacon filter support
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmHbkZAACgkQMUZtbf5S
IruYkQ//XX7BggcwBfukPK83j0dONolClijqKcKR08g4vB5L8GXvv6OErKIWrh4k
h8JanCH352ZkbCSw3MvFdm825UYQv8vPMd6Qks/LJ4aSKqCuy4MIlAo+yOw4Km3O
i7++lRfma6DqHHI59wvLjWoxZSPu8lL+rI8UsZ5qMOlnNlGAOXsNrzRjaqQ3FddY
AMxZeBUtrPqUCCQZFq3U8apkYzUp7CA/3XR9zRcja3uPbrtOV2G+4whRF90qGNWz
Tm/QvJ9F/Ab292cbhxR4KuaQ3hUhaCQyDjbZk3+FZzZpAVhYTVqcNjny6+yXmbiP
NXRtwemnl1NlWKMnJM8lEeY48u626tRIkxA/Wtd61uoO5uKUSxfGP+UpUi+DfXbF
yIw50VQ7L2bpxXP/HjtmhVgZDaWKYyh22Zw4Hp/muMJz0hgUB0KODY3tf2jUWbjJ
0oEgocWyzhhwMQKqupTDCIaRgIs2ewYr4ZrFDhI3HnHC/vv1VjoPRUPIyxwppD2N
cXvZb3B1sWK8iX5gCbISGzyU4bB7I0rvJSTU42ueti7n6NqRFZ79qHQpYnnY+JdO
z1qOwY/d/yWfBoXVKRtRg2qz6CdEt5BQklwAgVEBgrFpf58gp694EwGMb1htY14J
r/k9bVpmyIFpUnBH2CPMRfBVA3tUTqzyzzFV4AMw40NYLKmhLdo=
=KLm3
-----END PGP SIGNATURE-----
Merge tag '5.17-net-next' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"Core
----
- Defer freeing TCP skbs to the BH handler, whenever possible, or at
least perform the freeing outside of the socket lock section to
decrease cross-CPU allocator work and improve latency.
- Add netdevice refcount tracking to locate sources of netdevice and
net namespace refcount leaks.
- Make Tx watchdog less intrusive - avoid pausing Tx and restarting
all queues from a single CPU removing latency spikes.
- Various small optimizations throughout the stack from Eric Dumazet.
- Make netdev->dev_addr[] constant, force modifications to go via
appropriate helpers to allow us to keep addresses in ordered data
structures.
- Replace unix_table_lock with per-hash locks, improving performance
of bind() calls.
- Extend skb drop tracepoint with a drop reason.
- Allow SO_MARK and SO_PRIORITY setsockopt under CAP_NET_RAW.
BPF
---
- New helpers:
- bpf_find_vma(), find and inspect VMAs for profiling use cases
- bpf_loop(), runtime-bounded loop helper trading some execution
time for much faster (if at all converging) verification
- bpf_strncmp(), improve performance, avoid compiler flakiness
- bpf_get_func_arg(), bpf_get_func_ret(), bpf_get_func_arg_cnt()
for tracing programs, all inlined by the verifier
- Support BPF relocations (CO-RE) in the kernel loader.
- Further the support for BTF_TYPE_TAG annotations.
- Allow access to local storage in sleepable helpers.
- Convert verifier argument types to a composable form with different
attributes which can be shared across types (ro, maybe-null).
- Prepare libbpf for upcoming v1.0 release by cleaning up APIs,
creating new, extensible ones where missing and deprecating those
to be removed.
Protocols
---------
- WiFi (mac80211/cfg80211):
- notify user space about long "come back in N" AP responses,
allow it to react to such temporary rejections
- allow non-standard VHT MCS 10/11 rates
- use coarse time in airtime fairness code to save CPU cycles
- Bluetooth:
- rework of HCI command execution serialization to use a common
queue and work struct, and improve handling errors reported in
the middle of a batch of commands
- rework HCI event handling to use skb_pull_data, avoiding packet
parsing pitfalls
- support AOSP Bluetooth Quality Report
- SMC:
- support net namespaces, following the RDMA model
- improve connection establishment latency by pre-clearing buffers
- introduce TCP ULP for automatic redirection to SMC
- Multi-Path TCP:
- support ioctls: SIOCINQ, OUTQ, and OUTQNSD
- support socket options: IP_TOS, IP_FREEBIND, IP_TRANSPARENT,
IPV6_FREEBIND, and IPV6_TRANSPARENT, TCP_CORK and TCP_NODELAY
- support cmsgs: TCP_INQ
- improvements in the data scheduler (assigning data to subflows)
- support fastclose option (quick shutdown of the full MPTCP
connection, similar to TCP RST in regular TCP)
- MCTP (Management Component Transport) over serial, as defined by
DMTF spec DSP0253 - "MCTP Serial Transport Binding".
Driver API
----------
- Support timestamping on bond interfaces in active/passive mode.
- Introduce generic phylink link mode validation for drivers which
don't have any quirks and where MAC capability bits fully express
what's supported. Allow PCS layer to participate in the validation.
Convert a number of drivers.
- Add support to set/get size of buffers on the Rx rings and size of
the tx copybreak buffer via ethtool.
- Support offloading TC actions as first-class citizens rather than
only as attributes of filters, improve sharing and device resource
utilization.
- WiFi (mac80211/cfg80211):
- support forwarding offload (ndo_fill_forward_path)
- support for background radar detection hardware
- SA Query Procedures offload on the AP side
New hardware / drivers
----------------------
- tsnep - FPGA based TSN endpoint Ethernet MAC used in PLCs with
real-time requirements for isochronous communication with protocols
like OPC UA Pub/Sub.
- Qualcomm BAM-DMUX WWAN - driver for data channels of modems
integrated into many older Qualcomm SoCs, e.g. MSM8916 or MSM8974
(qcom_bam_dmux).
- Microchip LAN966x multi-port Gigabit AVB/TSN Ethernet Switch driver
with support for bridging, VLANs and multicast forwarding
(lan966x).
- iwlmei driver for co-operating between Intel's WiFi driver and
Intel's Active Management Technology (AMT) devices.
- mse102x - Vertexcom MSE102x Homeplug GreenPHY chips
- Bluetooth:
- MediaTek MT7921 SDIO devices
- Foxconn MT7922A
- Realtek RTL8852AE
Drivers
-------
- Significantly improve performance in the datapaths of: lan78xx,
ax88179_178a, lantiq_xrx200, bnxt.
- Intel Ethernet NICs:
- igb: support PTP/time PEROUT and EXTTS SDP functions on
82580/i354/i350 adapters
- ixgbevf: new PF -> VF mailbox API which avoids the risk of
mailbox corruption with ESXi
- iavf: support configuration of VLAN features of finer
granularity, stacked tags and filtering
- ice: PTP support for new E822 devices with sub-ns precision
- ice: support firmware activation without reboot
- Mellanox Ethernet NICs (mlx5):
- expose control over IRQ coalescing mode (CQE vs EQE) via ethtool
- support TC forwarding when tunnel encap and decap happen between
two ports of the same NIC
- dynamically size and allow disabling various features to save
resources for running in embedded / SmartNIC scenarios
- Broadcom Ethernet NICs (bnxt):
- use page frag allocator to improve Rx performance
- expose control over IRQ coalescing mode (CQE vs EQE) via ethtool
- Other Ethernet NICs:
- amd-xgbe: add Ryzen 6000 (Yellow Carp) Ethernet support
- Microsoft cloud/virtual NIC (mana):
- add XDP support (PASS, DROP, TX)
- Mellanox Ethernet switches (mlxsw):
- initial support for Spectrum-4 ASICs
- VxLAN with IPv6 underlay
- Marvell Ethernet switches (prestera):
- support flower flow templates
- add basic IP forwarding support
- NXP embedded Ethernet switches (ocelot & felix):
- support Per-Stream Filtering and Policing (PSFP)
- enable cut-through forwarding between ports by default
- support FDMA to improve packet Rx/Tx to CPU
- Other embedded switches:
- hellcreek: improve trapping management (STP and PTP) packets
- qca8k: support link aggregation and port mirroring
- Qualcomm 802.11ax WiFi (ath11k):
- qca6390, wcn6855: enable 802.11 power save mode in station mode
- BSS color change support
- WCN6855 hw2.1 support
- 11d scan offload support
- scan MAC address randomization support
- full monitor mode, only supported on QCN9074
- qca6390/wcn6855: report signal and tx bitrate
- qca6390: rfkill support
- qca6390/wcn6855: regdb.bin support
- Intel WiFi (iwlwifi):
- support SAR GEO Offset Mapping (SGOM) and Time-Aware-SAR (TAS)
in cooperation with the BIOS
- support for Optimized Connectivity Experience (OCE) scan
- support firmware API version 68
- lots of preparatory work for the upcoming Bz device family
- MediaTek WiFi (mt76):
- Specific Absorption Rate (SAR) support
- mt7921: 160 MHz channel support
- RealTek WiFi (rtw88):
- Specific Absorption Rate (SAR) support
- scan offload
- Other WiFi NICs
- ath10k: support fetching (pre-)calibration data from nvmem
- brcmfmac: configure keep-alive packet on suspend
- wcn36xx: beacon filter support"
* tag '5.17-net-next' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2048 commits)
tcp: tcp_send_challenge_ack delete useless param `skb`
net/qla3xxx: Remove useless DMA-32 fallback configuration
rocker: Remove useless DMA-32 fallback configuration
hinic: Remove useless DMA-32 fallback configuration
lan743x: Remove useless DMA-32 fallback configuration
net: enetc: Remove useless DMA-32 fallback configuration
cxgb4vf: Remove useless DMA-32 fallback configuration
cxgb4: Remove useless DMA-32 fallback configuration
cxgb3: Remove useless DMA-32 fallback configuration
bnx2x: Remove useless DMA-32 fallback configuration
et131x: Remove useless DMA-32 fallback configuration
be2net: Remove useless DMA-32 fallback configuration
vmxnet3: Remove useless DMA-32 fallback configuration
bna: Simplify DMA setting
net: alteon: Simplify DMA setting
myri10ge: Simplify DMA setting
qlcnic: Simplify DMA setting
net: allwinner: Fix print format
page_pool: remove spinlock in page_pool_refill_alloc_cache()
amt: fix wrong return type of amt_send_membership_update()
...
Convert 'allowed' in __cpuset_node_allowed() to be boolean since the
return types of node_isset() and __cpuset_node_allowed() are both
boolean.
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
After commit dc26532aed ("cgroup: rstat: punt root-level optimization to
individual controllers"), each rstat on updated_children list has its
->updated_next not NULL.
This means we can remove the check on ->updated_next, if we make sure
the subtree from @root is on list, which could be done by checking
updated_next for root.
tj: Coding style fixes.
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>