mirror-linux/Documentation
Baoquan He 8e689f8ea4 mm/swap: do not choose swap device according to numa node
Patch series "mm/swapfile.c: select swap devices of default priority round
robin", v5.

Currently, on system with multiple swap devices, swap allocation will
select one swap device according to priority.  The swap device with the
highest priority will be chosen to allocate firstly.

People can specify a priority from 0 to 32767 when swapon a swap device,
or the system will set it from -2 then downwards by default.  Meanwhile,
on NUMA system, the swap device with node_id will be considered first on
that NUMA node of the node_id.

In the current code, an array of plist, swap_avail_heads[nid], is used to
organize swap devices on each NUMA node.  For each NUMA node, there is a
plist organizing all swap devices.  The 'prio' value in the plist is the
negated value of the device's priority due to plist being sorted from low
to high.  The swap device owning one node_id will be promoted to the front
position on that NUMA node, then other swap devices are put in order of
their default priority.

E.g I got a system with 8 NUMA nodes, and I setup 4 zram partition as
swap devices.

Current behaviour:
their priorities will be(note that -1 is skipped):
NAME       TYPE      SIZE USED PRIO
/dev/zram0 partition  16G   0B   -2
/dev/zram1 partition  16G   0B   -3
/dev/zram2 partition  16G   0B   -4
/dev/zram3 partition  16G   0B   -5

And their positions in the 8 swap_avail_lists[nid] will be:
swap_avail_lists[0]: /* node 0's available swap device list */
zram0   -> zram1   -> zram2   -> zram3
prio:1     prio:3     prio:4     prio:5
swap_avali_lists[1]: /* node 1's available swap device list */
zram1   -> zram0   -> zram2   -> zram3
prio:1     prio:2     prio:4     prio:5
swap_avail_lists[2]: /* node 2's available swap device list */
zram2   -> zram0   -> zram1   -> zram3
prio:1     prio:2     prio:3     prio:5
swap_avail_lists[3]: /* node 3's available swap device list */
zram3   -> zram0   -> zram1   -> zram2
prio:1     prio:2     prio:3     prio:4
swap_avail_lists[4-7]: /* node 4,5,6,7's available swap device list */
zram0   -> zram1   -> zram2   -> zram3
prio:2     prio:3     prio:4     prio:5

The adjustment for swap device with node_id intended to decrease the
pressure of lock contention for one swap device by taking different swap
device on different node.  The adjustment was introduced in commit
a2468cc9bf ("swap: choose swap device according to numa node"). 
However, the adjustment is a little coarse-grained.  On the node, the swap
device sharing the node's id will always be selected firstly by node's
CPUs until exhausted, then next one.  And on other nodes where no swap
device shares its node id, swap device with priority '-2' will be selected
firstly until exhausted, then next with priority '-3'.

This is the swapon output during the process high pressure vm-scability
test is being taken.  It's clearly showing zram0 is heavily exploited
until exhausted.

===================================
[root@hp-dl385g10-03 ~]# swapon
NAME       TYPE      SIZE  USED PRIO
/dev/zram0 partition  16G 15.7G   -2
/dev/zram1 partition  16G  3.4G   -3
/dev/zram2 partition  16G  3.4G   -4
/dev/zram3 partition  16G  2.6G   -5

The node based strategy on selecting swap device is much better then the
old way one by one selecting swap device.  However it is still
unreasonable because swap devices are assumed to have similar accessing
speed if no priority is specified when swapon.  It's unfair and doesn't
make sense just because one swap device is swapped on firstly, its
priority will be higher than the one swapped on later.

So in this patchset, change is made to select the swap device round robin
if default priority.  In code, the plist array swap_avail_heads[nid] is
replaced with a plist swap_avail_head which reverts commit a2468cc9bf. 
Meanwhile, on top of the revert, further change is taken to make any
device w/o specified priority get the same default priority '-1'.  Surely,
swap device with specified priority are always put foremost, this is not
impacted.  If you care about their different accessing speed, then use
'swapon -p xx' to deploy priority for your swap devices.

New behaviour:

swap_avail_list: /* one global available swap device list */
zram0   -> zram1   -> zram2   -> zram3
prio:1     prio:1     prio:1     prio:1

This is the swapon output during the process high pressure vm-scability
being taken, all is selected round robin:
=======================================
[root@hp-dl385g10-03 linux]# swapon
NAME       TYPE      SIZE  USED PRIO
/dev/zram0 partition  16G 12.6G   -1
/dev/zram1 partition  16G 12.6G   -1
/dev/zram2 partition  16G 12.6G   -1
/dev/zram3 partition  16G 12.6G   -1

With the change, we can see about 18% efficiency promotion as below:

vm-scability test:
==================
Test with:
usemem --init-time -O -y -x -n 31 2G (4G memcg, zram as swap)
                           Before:          After:
System time:               637.92 s         526.74 s      (lower is better)
Sum Throughput:            3546.56 MB/s     4207.56 MB/s  (higher is better)
Single process Throughput: 114.40 MB/s      135.72 MB/s   (higher is better)
free latency:              10138455.99 us   6810119.01 us (low is better)


This patch (of 2):

This reverts commit a2468cc9bf ("swap: choose swap device according to
numa node").

After this patch, the behaviour will change back to pre-commit
a2468cc9bf.  Means the priority will be set from -1 then downwards by
default, and when swapping, it will exhault swap device one by one
according to priority from high to low.  This is preparation work for
later change.

[root@hp-dl385g10-03 ~]# swapon
NAME       TYPE      SIZE   USED PRIO
/dev/zram0 partition  16G    16G   -1
/dev/zram1 partition  16G 966.2M   -2
/dev/zram2 partition  16G     0B   -3
/dev/zram3 partition  16G     0B   -4

Link: https://lkml.kernel.org/r/20251028034308.929550-1-bhe@redhat.com
Link: https://lkml.kernel.org/r/20251028034308.929550-2-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Suggested-by: Chris Li <chrisl@kernel.org>
Acked-by: Chris Li <chrisl@kernel.org>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Reviewed-by: Kairui Song <kasong@tencent.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-16 17:28:27 -08:00
..
ABI Docs/ABI/damon: document obsolete_target sysfs file 2025-11-16 17:28:24 -08:00
PCI pci-v6.18-changes 2025-10-06 10:41:03 -07:00
RCU RCU pull request for v6.18 2025-10-04 11:28:45 -07:00
accel
accounting
admin-guide mm/swap: do not choose swap device according to numa node 2025-11-16 17:28:27 -08:00
arch - Make TDX and kexec work together 2025-10-04 10:01:30 -07:00
block
bpf bpf: disable and remove registers chain based liveness 2025-09-19 09:27:23 -07:00
cdrom
core-api dma-mapping updates for Linux 6.18: 2025-10-03 17:41:12 -07:00
cpu-freq cpufreq: Drop unused symbol CPUFREQ_ETERNAL 2025-10-01 13:57:22 +02:00
crypto crypto: doc - Add explicit title heading to API docs 2025-09-28 11:54:48 +08:00
dev-tools It has been a relatively busy cycle in docsland, with changes all over: 2025-10-03 17:16:13 -07:00
devicetree dt-bindings: gpio: ti,twl4030: Correct the schema $id path 2025-11-03 11:48:30 +01:00
doc-guide
driver-api pci-v6.18-changes 2025-10-06 10:41:03 -07:00
edac
fault-injection
fb fbdev fixes & enhancements for 6.18-rc1: 2025-10-10 09:36:23 -07:00
features OpenRISC updates for 6.18 2025-10-05 10:02:54 -07:00
filesystems doc: update porting, vfs documentation for mmap_prepare actions 2025-11-16 17:28:13 -08:00
firmware-guide Documentation: ACPI: i2c-muxes: fix I2C device references 2025-11-03 17:01:05 +01:00
firmware_class
fpga
gpu drm next for 6.18-rc1 2025-10-02 12:47:25 -07:00
hid
hwmon hwmon: (cros_ec) register fans into thermal framework cooling devices 2025-09-25 08:08:14 -07:00
i2c i2c: i801: Add support for Intel Wildcat Lake-U 2025-09-28 00:45:53 +02:00
iio
images
infiniband
input
isdn
kbuild Kbuild updates for 6.18 2025-10-01 20:58:51 -07:00
kernel-hacking
leds
litmus-tests
livepatch
locking
maintainer
mhi
misc-devices
mm Docs/mm/damon/design: fix wrong link to intervals goal section 2025-11-16 17:28:21 -08:00
netlabel
netlink dpll: spec: add missing module-name and clock-id to pin-get reply 2025-10-27 18:20:36 -07:00
networking Documentation: netconsole: Remove obsolete contact people 2025-10-29 17:40:19 -07:00
nvdimm
nvme
pcmcia
peci
power It has been a relatively busy cycle in docsland, with changes all over: 2025-10-03 17:16:13 -07:00
process It has been a relatively busy cycle in docsland, with changes all over: 2025-10-03 17:16:13 -07:00
rust docs: rust: add section on imports formatting 2025-10-17 00:56:20 +02:00
scheduler
scsi
security
sound ALSA: emu10k1: Fix typo in docs 2025-10-04 15:47:24 +02:00
sphinx docs: remove cdomain.py 2025-09-21 16:35:57 -06:00
sphinx-static
spi
staging It has been a relatively busy cycle in docsland, with changes all over: 2025-10-03 17:16:13 -07:00
sunrpc/xdr
target
tee
timers
tools tools/rtla: Add remaining support for osnoise actions 2025-09-27 04:53:48 -04:00
trace
translations More power management updates for 6.18-rc1 2025-10-07 09:39:51 -07:00
usb
userspace-api It has been a relatively busy cycle in docsland, with changes all over: 2025-10-03 17:16:13 -07:00
virt KVM x86 fixes for 6.18: 2025-10-18 10:25:43 +02:00
w1
watchdog
wmi
.gitignore
.renames.txt
Changes
CodingStyle
Kconfig
Makefile
SubmittingPatches
atomic_bitops.txt
atomic_t.txt
conf.py
docutils.conf
index.rst
memory-barriers.txt
subsystem-apis.rst