mirror-linux/drivers
Nikolay Aleksandrov acb16b395c virtio_net: fix wrong buf address calculation when using xdp
We received a report[1] of kernel crashes when Cilium is used in XDP
mode with virtio_net after updating to newer kernels. After
investigating the reason it turned out that when using mergeable bufs
with an XDP program which adjusts xdp.data or xdp.data_meta page_to_buf()
calculates the build_skb address wrong because the offset can become less
than the headroom so it gets the address of the previous page (-X bytes
depending on how lower offset is):
 page_to_skb: page addr ffff9eb2923e2000 buf ffff9eb2923e1ffc offset 252 headroom 256

This is a pr_err() I added in the beginning of page_to_skb which clearly
shows offset that is less than headroom by adding 4 bytes of metadata
via an xdp prog. The calculations done are:
 receive_mergeable():
 headroom = VIRTIO_XDP_HEADROOM; // VIRTIO_XDP_HEADROOM == 256 bytes
 offset = xdp.data - page_address(xdp_page) -
          vi->hdr_len - metasize;

 page_to_skb():
 p = page_address(page) + offset;
 ...
 buf = p - headroom;

Now buf goes -4 bytes from the page's starting address as can be seen
above which is set as skb->head and skb->data by build_skb later. Depending
on what's done with the skb (when it's freed most often) we get all kinds
of corruptions and BUG_ON() triggers in mm[2]. We have to recalculate
the new headroom after the xdp program has run, similar to how offset
and len are recalculated. Headroom is directly related to
data_hard_start, data and data_meta, so we use them to get the new size.
The result is correct (similar pr_err() in page_to_skb, one case of
xdp_page and one case of virtnet buf):
 a) Case with 4 bytes of metadata
 [  115.949641] page_to_skb: page addr ffff8b4dcfad2000 offset 252 headroom 252
 [  121.084105] page_to_skb: page addr ffff8b4dcf018000 offset 20732 headroom 252
 b) Case of pushing data +32 bytes
 [  153.181401] page_to_skb: page addr ffff8b4dd0c4d000 offset 288 headroom 288
 [  158.480421] page_to_skb: page addr ffff8b4dd00b0000 offset 24864 headroom 288
 c) Case of pushing data -33 bytes
 [  835.906830] page_to_skb: page addr ffff8b4dd3270000 offset 223 headroom 223
 [  840.839910] page_to_skb: page addr ffff8b4dcdd68000 offset 12511 headroom 223

Offset and headroom are equal because offset points to the start of
reserved bytes for the virtio_net header which are at buf start +
headroom, while data points at buf start + vnet hdr size + headroom so
when data or data_meta are adjusted by the xdp prog both the headroom size
and the offset change equally. We can use data_hard_start to compute the
new headroom after the xdp prog (linearized / page start case, the
virtnet buf case is similar just with bigger base offset):
 xdp.data_hard_start = page_address + vnet_hdr
 xdp.data = page_address + vnet_hdr + headroom
 new headroom after xdp prog = xdp.data - xdp.data_hard_start - metasize

An example reproducer xdp prog[3] is below.

[1] https://github.com/cilium/cilium/issues/19453

[2] Two of the many traces:
 [   40.437400] BUG: Bad page state in process swapper/0  pfn:14940
 [   40.916726] BUG: Bad page state in process systemd-resolve  pfn:053b7
 [   41.300891] kernel BUG at include/linux/mm.h:720!
 [   41.301801] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
 [   41.302784] CPU: 1 PID: 1181 Comm: kubelet Kdump: loaded Tainted: G    B   W         5.18.0-rc1+ #37
 [   41.304458] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
 [   41.306018] RIP: 0010:page_frag_free+0x79/0xe0
 [   41.306836] Code: 00 00 75 ea 48 8b 07 a9 00 00 01 00 74 e0 48 8b 47 48 48 8d 50 ff a8 01 48 0f 45 fa eb d0 48 c7 c6 18 b8 30 a6 e8 d7 f8 fc ff <0f> 0b 48 8d 78 ff eb bc 48 8b 07 a9 00 00 01 00 74 3a 66 90 0f b6
 [   41.310235] RSP: 0018:ffffac05c2a6bc78 EFLAGS: 00010292
 [   41.311201] RAX: 000000000000003e RBX: 0000000000000000 RCX: 0000000000000000
 [   41.312502] RDX: 0000000000000001 RSI: ffffffffa6423004 RDI: 00000000ffffffff
 [   41.313794] RBP: ffff993c98823600 R08: 0000000000000000 R09: 00000000ffffdfff
 [   41.315089] R10: ffffac05c2a6ba68 R11: ffffffffa698ca28 R12: ffff993c98823600
 [   41.316398] R13: ffff993c86311ebc R14: 0000000000000000 R15: 000000000000005c
 [   41.317700] FS:  00007fe13fc56740(0000) GS:ffff993cdd900000(0000) knlGS:0000000000000000
 [   41.319150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 [   41.320152] CR2: 000000c00008a000 CR3: 0000000014908000 CR4: 0000000000350ee0
 [   41.321387] Call Trace:
 [   41.321819]  <TASK>
 [   41.322193]  skb_release_data+0x13f/0x1c0
 [   41.322902]  __kfree_skb+0x20/0x30
 [   41.343870]  tcp_recvmsg_locked+0x671/0x880
 [   41.363764]  tcp_recvmsg+0x5e/0x1c0
 [   41.384102]  inet_recvmsg+0x42/0x100
 [   41.406783]  ? sock_recvmsg+0x1d/0x70
 [   41.428201]  sock_read_iter+0x84/0xd0
 [   41.445592]  ? 0xffffffffa3000000
 [   41.462442]  new_sync_read+0x148/0x160
 [   41.479314]  ? 0xffffffffa3000000
 [   41.496937]  vfs_read+0x138/0x190
 [   41.517198]  ksys_read+0x87/0xc0
 [   41.535336]  do_syscall_64+0x3b/0x90
 [   41.551637]  entry_SYSCALL_64_after_hwframe+0x44/0xae
 [   41.568050] RIP: 0033:0x48765b
 [   41.583955] Code: e8 4a 35 fe ff eb 88 cc cc cc cc cc cc cc cc e8 fb 7a fe ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30
 [   41.632818] RSP: 002b:000000c000a2f5b8 EFLAGS: 00000212 ORIG_RAX: 0000000000000000
 [   41.664588] RAX: ffffffffffffffda RBX: 000000c000062000 RCX: 000000000048765b
 [   41.681205] RDX: 0000000000005e54 RSI: 000000c000e66000 RDI: 0000000000000016
 [   41.697164] RBP: 000000c000a2f608 R08: 0000000000000001 R09: 00000000000001b4
 [   41.713034] R10: 00000000000000b6 R11: 0000000000000212 R12: 00000000000000e9
 [   41.728755] R13: 0000000000000001 R14: 000000c000a92000 R15: ffffffffffffffff
 [   41.744254]  </TASK>
 [   41.758585] Modules linked in: br_netfilter bridge veth netconsole virtio_net

 and

 [   33.524802] BUG: Bad page state in process systemd-network  pfn:11e60
 [   33.528617] page ffffe05dc0147b00 ffffe05dc04e7a00 ffff8ae9851ec000 (1) len 82 offset 252 metasize 4 hroom 0 hdr_len 12 data ffff8ae9851ec10c data_meta ffff8ae9851ec108 data_end ffff8ae9851ec14e
 [   33.529764] page:000000003792b5ba refcount:0 mapcount:-512 mapping:0000000000000000 index:0x0 pfn:0x11e60
 [   33.532463] flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
 [   33.532468] raw: 000fffffc0000000 0000000000000000 dead000000000122 0000000000000000
 [   33.532470] raw: 0000000000000000 0000000000000000 00000000fffffdff 0000000000000000
 [   33.532471] page dumped because: nonzero mapcount
 [   33.532472] Modules linked in: br_netfilter bridge veth netconsole virtio_net
 [   33.532479] CPU: 0 PID: 791 Comm: systemd-network Kdump: loaded Not tainted 5.18.0-rc1+ #37
 [   33.532482] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
 [   33.532484] Call Trace:
 [   33.532496]  <TASK>
 [   33.532500]  dump_stack_lvl+0x45/0x5a
 [   33.532506]  bad_page.cold+0x63/0x94
 [   33.532510]  free_pcp_prepare+0x290/0x420
 [   33.532515]  free_unref_page+0x1b/0x100
 [   33.532518]  skb_release_data+0x13f/0x1c0
 [   33.532524]  kfree_skb_reason+0x3e/0xc0
 [   33.532527]  ip6_mc_input+0x23c/0x2b0
 [   33.532531]  ip6_sublist_rcv_finish+0x83/0x90
 [   33.532534]  ip6_sublist_rcv+0x22b/0x2b0

[3] XDP program to reproduce(xdp_pass.c):
 #include <linux/bpf.h>
 #include <bpf/bpf_helpers.h>

 SEC("xdp_pass")
 int xdp_pkt_pass(struct xdp_md *ctx)
 {
          bpf_xdp_adjust_head(ctx, -(int)32);
          return XDP_PASS;
 }

 char _license[] SEC("license") = "GPL";

 compile: clang -O2 -g -Wall -target bpf -c xdp_pass.c -o xdp_pass.o
 load on virtio_net: ip link set enp1s0 xdpdrv obj xdp_pass.o sec xdp_pass

CC: stable@vger.kernel.org
CC: Jason Wang <jasowang@redhat.com>
CC: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
CC: Daniel Borkmann <daniel@iogearbox.net>
CC: "Michael S. Tsirkin" <mst@redhat.com>
CC: virtualization@lists.linux-foundation.org
Fixes: 8fb7da9e99 ("virtio_net: get build_skb() buf by data ptr")
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Link: https://lore.kernel.org/r/20220425103703.3067292-1-razor@blackwall.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-04-26 13:24:44 +02:00
..
accessibility
acpi Merge branch 'acpi-bus' 2022-04-08 19:50:44 +02:00
amba
android
ata ata: ahci: Rename CONFIG_SATA_LPM_POLICY configuration item back 2022-04-06 11:08:04 +09:00
atm Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-03-17 13:56:58 -07:00
auxdisplay auxdisplay: lcd2s: Use array size explicitly in lcd2s_gotoxy() 2022-03-18 20:31:14 +01:00
base net: mdio: don't defer probe forever if PHY IRQ provider is missing 2022-04-08 14:17:55 -07:00
bcma Core MTD changes: 2022-03-25 13:35:34 -07:00
block block: null_blk: end timed out poll request 2022-04-14 10:16:33 -06:00
bluetooth Bluetooth: ath3k: remove superfluous header files 2022-03-18 17:12:09 +01:00
bus Char/Misc and other driver updates for 5.18-rc1 2022-03-28 12:27:35 -07:00
cdrom cdrom: remove unused variable 2022-04-06 08:47:52 -06:00
char random: use memmove instead of memcpy for remaining 32 bytes 2022-04-16 12:53:31 +02:00
clk A single revert to fix a boot regression seen when clk_put() started 2022-04-03 12:21:14 -07:00
clocksource asm-generic updates for 5.18 2022-03-23 18:03:08 -07:00
comedi
connector
counter Char/Misc and other driver updates for 5.18-rc1 2022-03-28 12:27:35 -07:00
cpufreq Merge branch 'cpufreq/arm/linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm 2022-03-22 12:15:47 +01:00
cpuidle RISC-V CPU Idle Support 2022-03-30 16:17:54 -07:00
crypto virtio: features, fixes 2022-03-31 13:57:15 -07:00
cxl cxl/pci: Drop shadowed variable 2022-04-08 12:59:43 -07:00
dax dax for 5.18 2022-03-24 18:12:09 -07:00
dca
devfreq
dio
dma dmaengine updates for v5.18-rc1 2022-03-30 10:54:49 -07:00
dma-buf dma-buf: handle empty dma_fence_arrays gracefully 2022-03-29 09:14:30 +02:00
edac Merge branch 'edac-amd64' into edac-updates-for-v5.18 2022-03-21 10:34:57 +01:00
eisa
extcon
firewire
firmware firmware: arm_scmi: Fix sparse warnings in OPTEE transport driver 2022-04-04 23:06:37 +01:00
fpga
fsi
gnss
gpio intel-gpio for v5.18-2 2022-04-16 21:57:00 +02:00
gpu Merge tag 'amd-drm-fixes-5.18-2022-04-13' of https://gitlab.freedesktop.org/agd5f/linux into drm-fixes 2022-04-15 07:14:20 +10:00
greybus
hid Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input 2022-04-01 10:14:32 -07:00
hsi
hv hyperv-fixes for 5.18-rc2 2022-04-07 06:35:34 -10:00
hwmon Char/Misc and other driver updates for 5.18-rc1 2022-03-28 12:27:35 -07:00
hwspinlock hwspinlock: sprd: Use struct_size() helper in devm_kzalloc() 2022-03-11 14:56:57 -06:00
hwtracing Char/Misc and other driver updates for 5.18-rc1 2022-03-28 12:27:35 -07:00
i2c i2c: ismt: Fix undefined behavior due to shift overflowing the constant 2022-04-15 23:49:02 +02:00
i3c
idle cpuidle: intel_idle: Drop redundant backslash at line end 2022-03-17 14:32:59 +01:00
iio Char/Misc and other driver updates for 5.18-rc1 2022-03-28 12:27:35 -07:00
infiniband RDMA/hfi1: Fix use-after-free bug for mm struct 2022-04-08 15:40:06 -03:00
input Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input 2022-04-01 10:14:32 -07:00
interconnect
iommu iommu/omap: Fix regression in probe for NULL pointer dereference 2022-04-08 11:16:29 +02:00
ipack
irqchip irqchip/gic, gic-v3: Prevent GSI to SGI translations 2022-04-05 16:33:47 +01:00
isdn mISDN: fix typo "frame to short" -> "frame too short" 2022-03-21 13:26:38 +00:00
leds LED updates for 5.18-rc1. Nothing major here, there are two drivers 2022-03-27 14:09:48 -07:00
macintosh
mailbox mailbox: ti-msgmgr: Operate mailbox in polled mode during system suspend 2022-03-12 19:33:30 -06:00
mcb
md dm: fix bio length of empty flush 2022-04-15 16:16:09 -04:00
media media: si2157: unknown chip version Si2147-A30 ROM 0x50 2022-04-09 17:45:49 +02:00
memory memory: fsl_ifc: populate child nodes of buses and mfd devices 2022-04-06 09:39:16 +02:00
memstick
message scsi: message: fusion: Remove redundant variable dmp 2022-04-06 22:28:07 -04:00
mfd - New Drivers 2022-03-25 13:56:18 -07:00
misc habanalabs: Fix test build failures 2022-04-04 17:03:04 +02:00
mmc mmc: core: improve API to make clear mmc_hw_reset is for cards 2022-04-08 11:00:08 +02:00
most
mtd This pull request contains fixes for JFFS2, UBI and UBIFS 2022-03-31 16:09:41 -07:00
mux
net virtio_net: fix wrong buf address calculation when using xdp 2022-04-26 13:24:44 +02:00
nfc spi: Updates for v5.18 2022-03-21 18:33:57 -07:00
ntb
nubus
nvdimm libnvdimm for 5.18 2022-03-30 10:04:11 -07:00
nvme nvme-pci: disable namespace identifiers for Qemu controllers 2022-04-15 06:56:17 +02:00
nvmem nvmem: brcm_nvram: parse NVRAM content into NVMEM cells 2022-03-18 14:08:36 +01:00
of Char/Misc and other driver updates for 5.18-rc1 2022-03-28 12:27:35 -07:00
opp
parisc parisc: Fix CPU affinity for Lasi, WAX and Dino chips 2022-03-29 21:37:12 +02:00
parport parport_pc: Also enable driver for PCI systems 2022-03-18 14:01:41 +01:00
pci hyperv-fixes for 5.18-rc2 2022-04-07 06:35:34 -10:00
pcmcia
peci
perf perf/imx_ddr: Fix undefined behavior due to shift overflowing the constant 2022-04-08 14:17:57 +01:00
phy phy: PHY_FSL_LYNX_28G should depend on ARCH_LAYERSCAPE 2022-03-29 08:45:16 -07:00
pinctrl Pin control bulk changes for the v5.18 kernel cycle 2022-03-28 11:52:53 -07:00
platform platform/x86: amd-pmc: Fix compilation without CONFIG_SUSPEND 2022-04-04 16:26:09 +02:00
pnp PNP update for 5.18-rc1 2022-03-21 14:46:01 -07:00
power power: supply: Reset err after not finding static battery 2022-04-13 12:05:22 +02:00
powercap
pps pps: generators: pps_gen_parport: Switch to use module_parport_driver() 2022-03-18 14:01:19 +01:00
ps3
ptp ptp: ocp: handle error from nvmem_device_find 2022-03-30 12:08:11 -07:00
pwm
rapidio
ras
regulator regulator: atc260x: Fix missing active_discharge_on setting 2022-04-04 08:59:43 +01:00
remoteproc remoteproc updates for v5.18 2022-03-30 10:50:48 -07:00
reset reset: tegra-bpmp: Restore Handle errors in BPMP response 2022-04-04 11:14:13 +02:00
rpmsg rpmsg: ctrl: Introduce new RPMSG_CREATE/RELEASE_DEV_IOCTL controls 2022-03-13 11:49:53 -05:00
rtc RTC for 5.18 2022-04-01 09:37:18 -07:00
s390 s390: cleanup timer API use 2022-03-27 22:18:39 +02:00
sbus
scsi scsi: qedi: Fix failed disconnect handling 2022-04-11 22:09:35 -04:00
sh
siox
slimbus
soc Networking changes for 5.18. 2022-03-24 13:13:26 -07:00
soundwire Char/Misc and other driver updates for 5.18-rc1 2022-03-28 12:27:35 -07:00
spi spi: Fixes for v5.18 2022-04-19 10:30:43 -07:00
spmi
ssb
staging staging: r8188eu: Fix PPPoE tag insertion on little endian systems 2022-04-04 16:35:20 +02:00
target Merge branch '5.18/scsi-queue' into 5.18/scsi-fixes 2022-04-06 21:46:54 -04:00
tc
tee ARM driver updates for 5.18 2022-03-23 18:23:13 -07:00
thermal Merge branch 'thermal-hfi' 2022-03-18 19:00:26 +01:00
thunderbolt Char/Misc and other driver updates for 5.18-rc1 2022-03-28 12:27:35 -07:00
tty tty: serial: mpc52xx_uart: make rx/tx hooks return unsigned, part II. 2022-04-04 10:33:02 +02:00
uio
usb xen: branch for v5.18-rc1 2022-03-28 14:32:39 -07:00
vdpa virtio: fixes, cleanups 2022-04-05 10:40:52 -07:00
vfio vfio/pci: Fix vf_token mechanism when device-specific VF drivers are used 2022-04-13 11:37:44 -06:00
vhost virtio: features, fixes 2022-03-31 13:57:15 -07:00
video fbdev: Fix unregistering of framebuffers without device 2022-04-06 21:12:28 +02:00
virt Random number generator fixes for Linux 5.18-rc1. 2022-03-31 14:51:34 -07:00
virtio virtio: fixes, cleanups 2022-04-05 10:40:52 -07:00
visorbus
vlynq
vme
w1 w1: w1_therm: Add support for Maxim MAX31850 thermoelement IF. 2022-03-18 14:07:09 +01:00
watchdog linux-watchdog 5.18-rc1 tag 2022-03-31 14:14:03 -07:00
xen xen/balloon: don't use PV mode extra memory for zone device allocations 2022-04-07 15:08:37 -05:00
zorro
Kconfig
Makefile