mirror-linux

History

Tejun Heo 93618edf75 cgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated A chain of commits going back to v7.0 reworked rmdir to satisfy the controller invariant that a subsystem's ->css_offline() must not run while tasks are still doing kernel-side work in the cgroup. [1] `d245698d72` ("cgroup: Defer task cgroup unlink until after the task is done switching out") [2] `a72f73c4dd` ("cgroup: Don't expose dead tasks in cgroup") [3] `1b164b876c` ("cgroup: Wait for dying tasks to leave on rmdir") [4] `4c56a8ac68` ("cgroup: Fix cgroup_drain_dying() testing the wrong condition") [5] `13e786b64b` ("cgroup: Increment nr_dying_subsys_* from rmdir context") [1] moved task cset unlink from do_exit() to finish_task_switch() so a task's cset link drops only after the task has fully stopped scheduling. That made tasks past exit_signals() linger on cset->tasks until their final context switch, which led to a series of problems as what userspace expected to see after rmdir diverged from what the kernel needs to wait for. [2]-[5] tried to bridge that divergence: [2] filtered the exiting tasks from cgroup.procs; [3] had rmdir(2) sleep in TASK_UNINTERRUPTIBLE for them; [4] fixed the wait's condition; [5] made nr_dying_subsys_* visible synchronously. The cgroup_drain_dying() wait in [3] turned out to be a dead end. When the rmdir caller is also the reaper of a zombie that pins a pidns teardown (e.g. host PID 1 systemd reaping orphan pids that were re-parented to it during the same teardown), rmdir blocks in TASK_UNINTERRUPTIBLE waiting for those pids to free, the pids can't free because PID 1 is the reaper and it's stuck in rmdir, and the system A-A deadlocks. No internal lock ordering breaks this; the wait itself is the bug. The css killing side that drove the original reorder, however, can be made cleanly asynchronous: ->css_offline() is already async, run from css_killed_work_fn() driven by percpu_ref_kill_and_confirm(). The fix is to make that chain start only after all tasks have left the cgroup. rmdir's user-visible side then returns as soon as cgroup.procs and friends are empty, while ->css_offline() still runs only after the cgroup is fully drained. Verified by the original reproducer (pidns teardown + zombie reaper, runs under vng) which hangs vanilla and succeeds here, and by per-commit deterministic repros for [2], [3], [4], [5] with a boot parameter that widens the post-exit_signals() window so each state is reliably reachable. Some stress tests on top of that. cgroup_apply_control_disable() has the same shape of pre-existing race: when a controller is disabled via subtree_control, kill_css() ran synchronously while tasks past exit_signals() could still be linked to the cgroup's csets, and ->css_offline() could fire before they drained. This patch preserves the existing synchronous behavior at that call site (kill_css_sync() + kill_css_finish() back-to-back) and a follow-up patch will defer kill_css_finish() there using a per-css trigger. This seems like the right approach and I don't see problems with it. The changes are somewhat invasive but not excessively so, so backporting to -stable should be okay. If something does turn out to be wrong, the fallback is to revert the entire chain ([1]-[5]) and rework in the development branch instead. v2: Pin cgrp across the deferred destroy work with explicit cgroup_get()/cgroup_put() around queue_work() and the work_fn. v1 wasn't actually broken (ordered cgroup_offline_wq + queue_work order in cgroup_task_dead() saved it) but the explicit ref removes the dependency on those non-obvious invariants. Also note the pre-existing cgroup_apply_control_disable() race in the description; a follow-up will defer kill_css_finish() there. Fixes: `1b164b876c` ("cgroup: Wait for dying tasks to leave on rmdir") Cc: stable@vger.kernel.org # v7.0+ Reported-and-tested-by: Martin Pitt <martin@piware.de> Link: https://lore.kernel.org/all/afHNg2VX2jy9bW7y@piware.de/ Link: https://lore.kernel.org/all/35e0670adb4abeab13da2c321582af9f@kernel.org/ Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>		2026-05-04 08:52:26 -10:00
..
Makefile	kernel/cgroup: Add "dmem" memory accounting cgroup	2025-01-06 17:24:38 +01:00
cgroup-internal.h	cgroup: Expose some cgroup helpers	2026-03-05 18:15:58 -10:00
cgroup-v1.c	Convert 'alloc_obj' family to use the new default GFP_KERNEL argument	2026-02-21 17:09:51 -08:00
cgroup.c	cgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated	2026-05-04 08:52:26 -10:00
cpuset-internal.h	cgroup/cpuset: record DL BW alloc CPU for attach rollback	2026-04-17 08:57:37 -10:00
cpuset-v1.c	Convert 'alloc_obj' family to use the new default GFP_KERNEL argument	2026-02-21 17:09:51 -08:00
cpuset.c	cgroup/cpuset: record DL BW alloc CPU for attach rollback	2026-04-17 08:57:37 -10:00
debug.c	Convert 'alloc_obj' family to use the new default GFP_KERNEL argument	2026-02-21 17:09:51 -08:00
dmem.c	cgroup/dmem: remove region parameter from dmemcg_parse_limit	2026-03-21 09:24:02 -10:00
freezer.c	cgroup: cgroup.stat.local time accounting	2025-08-22 07:50:43 -10:00
legacy_freezer.c	Convert 'alloc_obj' family to use the new default GFP_KERNEL argument	2026-02-21 17:09:51 -08:00
misc.c	Convert 'alloc_obj' family to use the new default GFP_KERNEL argument	2026-02-21 17:09:51 -08:00
namespace.c	treewide: Replace kmalloc with kmalloc_obj for non-scalar types	2026-02-21 01:02:28 -08:00
pids.c	Convert 'alloc_obj' family to use the new default GFP_KERNEL argument	2026-02-21 17:09:51 -08:00
rdma.c	cgroup/rdma: fix integer overflow in rdmacg_try_charge()	2026-04-17 07:25:27 -10:00
rstat.c	cgroup: rstat: use LOCK CMPXCHG in css_rstat_updated	2025-12-08 08:26:56 -10:00