mirror-linux/kernel/cgroup
Shakeel Butt 3309b63a22 cgroup: rstat: use LOCK CMPXCHG in css_rstat_updated
On x86-64, this_cpu_cmpxchg() uses CMPXCHG without LOCK prefix which
means it is only safe for the local CPU and not for multiple CPUs.
Recently the commit 36df6e3dbd ("cgroup: make css_rstat_updated nmi
safe") make css_rstat_updated lockless and uses lockless list to allow
reentrancy. Since css_rstat_updated can invoked from process context,
IRQ and NMI, it uses this_cpu_cmpxchg() to select the winner which will
inset the lockless lnode into the global per-cpu lockless list.

However the commit missed one case where lockless node of a cgroup can
be accessed and modified by another CPU doing the flushing. Basically
llist_del_first_init() in css_process_update_tree().

On a cursory look, it can be questioned how css_process_update_tree()
can see a lockless node in global lockless list where the updater is at
this_cpu_cmpxchg() and before llist_add() call in css_rstat_updated().
This can indeed happen in the presence of IRQs/NMI.

Consider this scenario: Updater for cgroup stat C on CPU A in process
context is after llist_on_list() check and before this_cpu_cmpxchg() in
css_rstat_updated() where it get interrupted by IRQ/NMI. In the IRQ/NMI
context, a new updater calls css_rstat_updated() for same cgroup C and
successfully inserts rstatc_pcpu->lnode.

Now concurrently CPU B is running the flusher and it calls
llist_del_first_init() for CPU A and got rstatc_pcpu->lnode of cgroup C
which was added by the IRQ/NMI updater.

Now imagine CPU B calling init_llist_node() on cgroup C's
rstatc_pcpu->lnode of CPU A and on CPU A, the process context updater
calling this_cpu_cmpxchg(rstatc_pcpu->lnode) concurrently.

The CMPXCNG without LOCK on CPU A is not safe and thus we need LOCK
prefix.

In Meta's fleet running the kernel with the commit 36df6e3dbd, we are
observing on some machines the memcg stats are getting skewed by more
than the actual memory on the system. On close inspection, we noticed
that lockless node for a workload for specific CPU was in the bad state
and thus all the updates on that CPU for that cgroup was being lost.

To confirm if this skew was indeed due to this CMPXCHG without LOCK in
css_rstat_updated(), we created a repro (using AI) at [1] which shows
that CMPXCHG without LOCK creates almost the same lnode corruption as
seem in Meta's fleet and with LOCK CMPXCHG the issue does not
reproduces.

Link: http://lore.kernel.org/efiagdwmzfwpdzps74fvcwq3n4cs36q33ij7eebcpssactv3zu@se4hqiwxcfxq [1]
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: stable@vger.kernel.org # v6.17+
Fixes: 36df6e3dbd ("cgroup: make css_rstat_updated nmi safe")
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-08 08:26:56 -10:00
..
Makefile kernel/cgroup: Add "dmem" memory accounting cgroup 2025-01-06 17:24:38 +01:00
cgroup-internal.h cgroup: replace global percpu_rwsem with per threadgroup resem when writing to cgroup.procs 2025-09-10 07:44:51 -10:00
cgroup-v1.c cgroup: replace global percpu_rwsem with per threadgroup resem when writing to cgroup.procs 2025-09-10 07:44:51 -10:00
cgroup.c Significant patch series in this merge are as follows: 2025-12-05 13:52:43 -08:00
cpuset-internal.h cpuset: remove global remote_children list 2025-11-11 11:47:08 -10:00
cpuset-v1.c cpuset: add helpers for cpus read and cpuset_mutex locks 2025-08-25 08:20:22 -10:00
cpuset.c cgroup: Changes for v6.19 2025-12-03 13:04:07 -08:00
debug.c cgroup: Remove redundant rcu_read_lock/unlock() in spin_lock 2025-09-16 08:36:14 -10:00
dmem.c rculist: move list_for_each_rcu() to where it belongs 2025-08-25 10:13:26 -07:00
freezer.c cgroup: cgroup.stat.local time accounting 2025-08-22 07:50:43 -10:00
legacy_freezer.c freezer: Clarify that only cgroup1 freezer uses PM freezer 2025-10-30 20:10:27 +01:00
misc.c Merge branch 'kvm-tdx-initial' into HEAD 2025-04-07 07:36:33 -04:00
namespace.c cgroup: add cgroup namespace to tree after owner is set 2025-10-31 10:16:24 +01:00
pids.c cgroup/pids: Remove unreachable paths of pids_{can,cancel}_fork 2024-08-05 10:32:16 -10:00
rdma.c rdmacg: fix kernel-doc warnings in rdmacg 2023-06-05 09:45:14 -10:00
rstat.c cgroup: rstat: use LOCK CMPXCHG in css_rstat_updated 2025-12-08 08:26:56 -10:00