mirror-linux

History

Baoquan He 8e689f8ea4 mm/swap: do not choose swap device according to numa node Patch series "mm/swapfile.c: select swap devices of default priority round robin", v5. Currently, on system with multiple swap devices, swap allocation will select one swap device according to priority. The swap device with the highest priority will be chosen to allocate firstly. People can specify a priority from 0 to 32767 when swapon a swap device, or the system will set it from -2 then downwards by default. Meanwhile, on NUMA system, the swap device with node_id will be considered first on that NUMA node of the node_id. In the current code, an array of plist, swap_avail_heads[nid], is used to organize swap devices on each NUMA node. For each NUMA node, there is a plist organizing all swap devices. The 'prio' value in the plist is the negated value of the device's priority due to plist being sorted from low to high. The swap device owning one node_id will be promoted to the front position on that NUMA node, then other swap devices are put in order of their default priority. E.g I got a system with 8 NUMA nodes, and I setup 4 zram partition as swap devices. Current behaviour: their priorities will be(note that -1 is skipped): NAME TYPE SIZE USED PRIO /dev/zram0 partition 16G 0B -2 /dev/zram1 partition 16G 0B -3 /dev/zram2 partition 16G 0B -4 /dev/zram3 partition 16G 0B -5 And their positions in the 8 swap_avail_lists[nid] will be: swap_avail_lists[0]: /* node 0's available swap device list / zram0 -> zram1 -> zram2 -> zram3 prio:1 prio:3 prio:4 prio:5 swap_avali_lists[1]: / node 1's available swap device list / zram1 -> zram0 -> zram2 -> zram3 prio:1 prio:2 prio:4 prio:5 swap_avail_lists[2]: / node 2's available swap device list / zram2 -> zram0 -> zram1 -> zram3 prio:1 prio:2 prio:3 prio:5 swap_avail_lists[3]: / node 3's available swap device list / zram3 -> zram0 -> zram1 -> zram2 prio:1 prio:2 prio:3 prio:4 swap_avail_lists[4-7]: / node 4,5,6,7's available swap device list / zram0 -> zram1 -> zram2 -> zram3 prio:2 prio:3 prio:4 prio:5 The adjustment for swap device with node_id intended to decrease the pressure of lock contention for one swap device by taking different swap device on different node. The adjustment was introduced in commit `a2468cc9bf` ("swap: choose swap device according to numa node"). However, the adjustment is a little coarse-grained. On the node, the swap device sharing the node's id will always be selected firstly by node's CPUs until exhausted, then next one. And on other nodes where no swap device shares its node id, swap device with priority '-2' will be selected firstly until exhausted, then next with priority '-3'. This is the swapon output during the process high pressure vm-scability test is being taken. It's clearly showing zram0 is heavily exploited until exhausted. =================================== [root@hp-dl385g10-03 ~]# swapon NAME TYPE SIZE USED PRIO /dev/zram0 partition 16G 15.7G -2 /dev/zram1 partition 16G 3.4G -3 /dev/zram2 partition 16G 3.4G -4 /dev/zram3 partition 16G 2.6G -5 The node based strategy on selecting swap device is much better then the old way one by one selecting swap device. However it is still unreasonable because swap devices are assumed to have similar accessing speed if no priority is specified when swapon. It's unfair and doesn't make sense just because one swap device is swapped on firstly, its priority will be higher than the one swapped on later. So in this patchset, change is made to select the swap device round robin if default priority. In code, the plist array swap_avail_heads[nid] is replaced with a plist swap_avail_head which reverts commit `a2468cc9bf`. Meanwhile, on top of the revert, further change is taken to make any device w/o specified priority get the same default priority '-1'. Surely, swap device with specified priority are always put foremost, this is not impacted. If you care about their different accessing speed, then use 'swapon -p xx' to deploy priority for your swap devices. New behaviour: swap_avail_list: / one global available swap device list */ zram0 -> zram1 -> zram2 -> zram3 prio:1 prio:1 prio:1 prio:1 This is the swapon output during the process high pressure vm-scability being taken, all is selected round robin: ======================================= [root@hp-dl385g10-03 linux]# swapon NAME TYPE SIZE USED PRIO /dev/zram0 partition 16G 12.6G -1 /dev/zram1 partition 16G 12.6G -1 /dev/zram2 partition 16G 12.6G -1 /dev/zram3 partition 16G 12.6G -1 With the change, we can see about 18% efficiency promotion as below: vm-scability test: ================== Test with: usemem --init-time -O -y -x -n 31 2G (4G memcg, zram as swap) Before: After: System time: 637.92 s 526.74 s (lower is better) Sum Throughput: 3546.56 MB/s 4207.56 MB/s (higher is better) Single process Throughput: 114.40 MB/s 135.72 MB/s (higher is better) free latency: 10138455.99 us 6810119.01 us (low is better) This patch (of 2): This reverts commit `a2468cc9bf` ("swap: choose swap device according to numa node"). After this patch, the behaviour will change back to pre-commit `a2468cc9bf`. Means the priority will be set from -1 then downwards by default, and when swapping, it will exhault swap device one by one according to priority from high to low. This is preparation work for later change. [root@hp-dl385g10-03 ~]# swapon NAME TYPE SIZE USED PRIO /dev/zram0 partition 16G 16G -1 /dev/zram1 partition 16G 966.2M -2 /dev/zram2 partition 16G 0B -3 /dev/zram3 partition 16G 0B -4 Link: https://lkml.kernel.org/r/20251028034308.929550-1-bhe@redhat.com Link: https://lkml.kernel.org/r/20251028034308.929550-2-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Suggested-by: Chris Li <chrisl@kernel.org> Acked-by: Chris Li <chrisl@kernel.org> Acked-by: Nhat Pham <nphamcs@gmail.com> Reviewed-by: Kairui Song <kasong@tencent.com> Cc: Barry Song <baohua@kernel.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>		2025-11-16 17:28:27 -08:00
..
ABI	Docs/ABI/damon: document obsolete_target sysfs file	2025-11-16 17:28:24 -08:00
PCI	pci-v6.18-changes	2025-10-06 10:41:03 -07:00
RCU	RCU pull request for v6.18	2025-10-04 11:28:45 -07:00
accel	…
accounting	…
admin-guide	mm/swap: do not choose swap device according to numa node	2025-11-16 17:28:27 -08:00
arch	- Make TDX and kexec work together	2025-10-04 10:01:30 -07:00
block	…
bpf	bpf: disable and remove registers chain based liveness	2025-09-19 09:27:23 -07:00
cdrom	…
core-api	dma-mapping updates for Linux 6.18:	2025-10-03 17:41:12 -07:00
cpu-freq	cpufreq: Drop unused symbol CPUFREQ_ETERNAL	2025-10-01 13:57:22 +02:00
crypto	crypto: doc - Add explicit title heading to API docs	2025-09-28 11:54:48 +08:00
dev-tools	It has been a relatively busy cycle in docsland, with changes all over:	2025-10-03 17:16:13 -07:00
devicetree	dt-bindings: gpio: ti,twl4030: Correct the schema $id path	2025-11-03 11:48:30 +01:00
doc-guide	…
driver-api	pci-v6.18-changes	2025-10-06 10:41:03 -07:00
edac	…
fault-injection	…
fb	fbdev fixes & enhancements for 6.18-rc1:	2025-10-10 09:36:23 -07:00
features	OpenRISC updates for 6.18	2025-10-05 10:02:54 -07:00
filesystems	doc: update porting, vfs documentation for mmap_prepare actions	2025-11-16 17:28:13 -08:00
firmware-guide	Documentation: ACPI: i2c-muxes: fix I2C device references	2025-11-03 17:01:05 +01:00
firmware_class	…
fpga	…
gpu	drm next for 6.18-rc1	2025-10-02 12:47:25 -07:00
hid	…
hwmon	hwmon: (cros_ec) register fans into thermal framework cooling devices	2025-09-25 08:08:14 -07:00
i2c	i2c: i801: Add support for Intel Wildcat Lake-U	2025-09-28 00:45:53 +02:00
iio	…
images	…
infiniband	…
input	…
isdn	…
kbuild	Kbuild updates for 6.18	2025-10-01 20:58:51 -07:00
kernel-hacking	…
leds	…
litmus-tests	…
livepatch	…
locking	…
maintainer	…
mhi	…
misc-devices	…
mm	Docs/mm/damon/design: fix wrong link to intervals goal section	2025-11-16 17:28:21 -08:00
netlabel	…
netlink	dpll: spec: add missing module-name and clock-id to pin-get reply	2025-10-27 18:20:36 -07:00
networking	Documentation: netconsole: Remove obsolete contact people	2025-10-29 17:40:19 -07:00
nvdimm	…
nvme	…
pcmcia	…
peci	…
power	It has been a relatively busy cycle in docsland, with changes all over:	2025-10-03 17:16:13 -07:00
process	It has been a relatively busy cycle in docsland, with changes all over:	2025-10-03 17:16:13 -07:00
rust	docs: rust: add section on imports formatting	2025-10-17 00:56:20 +02:00
scheduler	…
scsi	…
security	…
sound	ALSA: emu10k1: Fix typo in docs	2025-10-04 15:47:24 +02:00
sphinx	docs: remove cdomain.py	2025-09-21 16:35:57 -06:00
sphinx-static	…
spi	…
staging	It has been a relatively busy cycle in docsland, with changes all over:	2025-10-03 17:16:13 -07:00
sunrpc/xdr	…
target	…
tee	…
timers	…
tools	tools/rtla: Add remaining support for osnoise actions	2025-09-27 04:53:48 -04:00
trace	…
translations	More power management updates for 6.18-rc1	2025-10-07 09:39:51 -07:00
usb	…
userspace-api	It has been a relatively busy cycle in docsland, with changes all over:	2025-10-03 17:16:13 -07:00
virt	KVM x86 fixes for 6.18:	2025-10-18 10:25:43 +02:00
w1	…
watchdog	…
wmi	…
.gitignore	…
.renames.txt	…
Changes	…
CodingStyle	…
Kconfig	…
Makefile	…
SubmittingPatches	…
atomic_bitops.txt	…
atomic_t.txt	…
conf.py	…
docutils.conf	…
index.rst	…
memory-barriers.txt	…
subsystem-apis.rst	…