An UAF issue was observed:
BUG: KASAN: slab-use-after-free in page_counter_uncharge+0x65/0x150
Write of size 8 at addr ffff888106715440 by task insmod/527
CPU: 4 UID: 0 PID: 527 Comm: insmod 6.19.0-rc7-next-20260129+ #11
Tainted: [O]=OOT_MODULE
Call Trace:
<TASK>
dump_stack_lvl+0x82/0xd0
kasan_report+0xca/0x100
kasan_check_range+0x39/0x1c0
page_counter_uncharge+0x65/0x150
dmem_cgroup_uncharge+0x1f/0x260
Allocated by task 527:
Freed by task 0:
The buggy address belongs to the object at ffff888106715400
which belongs to the cache kmalloc-512 of size 512
The buggy address is located 64 bytes inside of
freed 512-byte region [ffff888106715400, ffff888106715600)
The buggy address belongs to the physical page:
Memory state around the buggy address:
ffff888106715300: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff888106715380: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff888106715400: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff888106715480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff888106715500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
The issue occurs because a pool can still be held by a caller after its
associated memory region is unregistered. The current implementation frees
the pool even if users still hold references to it (e.g., before uncharge
operations complete).
This patch adds a reference counter to each pool, ensuring that a pool is
only freed when its reference count drops to zero.
Fixes: b168ed458d ("kernel/cgroup: Add "dmem" memory accounting cgroup")
Cc: stable@vger.kernel.org # v6.14+
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
A warnning was detected:
WARNING: suspicious RCU usage
6.19.0-rc7-next-20260129+ #1101 Tainted: G O
kernel/cgroup/dmem.c:456 suspicious rcu_dereference_check() usage!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
1 lock held by insmod/532:
#0: ffffffff85e78b38 (dmemcg_lock){+.+.}-dmem_cgroup_unregister_region+
stack backtrace:
CPU: 2 UID: 0 PID: 532 Comm: insmod Tainted: 6.19.0-rc7-next-
Tainted: [O]=OOT_MODULE
Call Trace:
<TASK>
dump_stack_lvl+0xb0/0xd0
lockdep_rcu_suspicious+0x151/0x1c0
dmem_cgroup_unregister_region+0x1e2/0x380
? __pfx_dmem_test_init+0x10/0x10 [dmem_uaf]
dmem_test_init+0x65/0xff0 [dmem_uaf]
do_one_initcall+0xbb/0x3a0
The macro list_for_each_rcu() must be used within an RCU read-side critical
section (between rcu_read_lock() and rcu_read_unlock()). Using it outside
that context, as seen in dmem_cgroup_unregister_region(), triggers the
lockdep warning because the RCU protection is not guaranteed.
Replace list_for_each_rcu() with list_for_each_entry_safe(), which is
appropriate for traversal under spinlock protection where nodes may be
deleted.
Fixes: b168ed458d ("kernel/cgroup: Add "dmem" memory accounting cgroup")
Cc: stable@vger.kernel.org # v6.14+
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The list_for_each_rcu() relies on the rcu_dereference() API which is not
provided by the list.h. At the same time list.h is a low-level basic header
that must not have dependencies like RCU, besides the fact of the potential
circular dependencies in some cases. With all that said, move RCU related
API to the rculist.h where it belongs.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Simona Vetter <simona.vetter@ffwll.ch>
Reviewed-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>
Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org>
The current implementation has a bug: If the current css doesn't
contain any pool that is a descendant of the "pool" (i.e. when
found_descendant == false), then "pool" will point to some unrelated
pool. If the current css has a child, we'll overwrite parent_pool with
this unrelated pool on the next iteration.
Since we can just check whether a pool refers to the same region to
determine whether or not it's related, all the additional pool tracking
is unnecessary, so just switch to using css_for_each_descendant_pre for
traversal.
Fixes: b168ed458d ("kernel/cgroup: Add "dmem" memory accounting cgroup")
Signed-off-by: Friedrich Vock <friedrich.vock@gmx.de>
Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20250127152754.21325-1-friedrich.vock@gmx.de
Signed-off-by: Maarten Lankhorst <dev@lankhorst.se>
During the dmem cgroup development, the parameters to the
dmem_cgroup_state_evict_valuable() and dmem_cgroup_try_charge() were
changed, but the documentation wasn't adjusted accordingly.
This results in a documentation build warning. Adjust the documentation
to reflect what the final functions parameters are.
Fixes: b168ed458d ("kernel/cgroup: Add "dmem" memory accounting cgroup")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Closes: https://lore.kernel.org/r/20250113160334.1f09f881@canb.auug.org.au/
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Simona Vetter <simona.vetter@ffwll.ch>
Link: https://patchwork.freedesktop.org/patch/msgid/20250113092608.1349287-2-mripard@kernel.org
Signed-off-by: Maxime Ripard <mripard@kernel.org>
This code is based on the RDMA and misc cgroup initially, but now
uses page_counter. It uses the same min/low/max semantics as the memory
cgroup as a result.
There's a small mismatch as TTM uses u64, and page_counter long pages.
In practice it's not a problem. 32-bits systems don't really come with
>=4GB cards and as long as we're consistently wrong with units, it's
fine. The device page size may not be in the same units as kernel page
size, and each region might also have a different page size (VRAM vs GART
for example).
The interface is simple:
- Call dmem_cgroup_register_region()
- Use dmem_cgroup_try_charge to check if you can allocate a chunk of memory,
use dmem_cgroup__uncharge when freeing it. This may return an error code,
or -EAGAIN when the cgroup limit is reached. In that case a reference
to the limiting pool is returned.
- The limiting cs can be used as compare function for
dmem_cgroup_state_evict_valuable.
- After having evicted enough, drop reference to limiting cs with
dmem_cgroup_pool_state_put.
This API allows you to limit device resources with cgroups.
You can see the supported cards in /sys/fs/cgroup/dmem.capacity
You need to echo +dmem to cgroup.subtree_control, and then you can
partition device memory.
Co-developed-by: Friedrich Vock <friedrich.vock@gmx.de>
Signed-off-by: Friedrich Vock <friedrich.vock@gmx.de>
Co-developed-by: Maxime Ripard <mripard@kernel.org>
Signed-off-by: Maarten Lankhorst <dev@lankhorst.se>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20241204143112.1250983-1-dev@lankhorst.se
Signed-off-by: Maxime Ripard <mripard@kernel.org>