mirror-linux/drivers
Jesper Dangaard Brouer dc82a33297 veth: apply qdisc backpressure on full ptr_ring to reduce TX drops
In production, we're seeing TX drops on veth devices when the ptr_ring
fills up. This can occur when NAPI mode is enabled, though it's
relatively rare. However, with threaded NAPI - which we use in
production - the drops become significantly more frequent.

The underlying issue is that with threaded NAPI, the consumer often runs
on a different CPU than the producer. This increases the likelihood of
the ring filling up before the consumer gets scheduled, especially under
load, leading to drops in veth_xmit() (ndo_start_xmit()).

This patch introduces backpressure by returning NETDEV_TX_BUSY when the
ring is full, signaling the qdisc layer to requeue the packet. The txq
(netdev queue) is stopped in this condition and restarted once
veth_poll() drains entries from the ring, ensuring coordination between
NAPI and qdisc.

Backpressure is only enabled when a qdisc is attached. Without a qdisc,
the driver retains its original behavior - dropping packets immediately
when the ring is full. This avoids unexpected behavior changes in setups
without a configured qdisc.

With a qdisc in place (e.g. fq, sfq) this allows Active Queue Management
(AQM) to fairly schedule packets across flows and reduce collateral
damage from elephant flows.

A known limitation of this approach is that the full ring sits in front
of the qdisc layer, effectively forming a FIFO buffer that introduces
base latency. While AQM still improves fairness and mitigates flow
dominance, the latency impact is measurable.

In hardware drivers, this issue is typically addressed using BQL (Byte
Queue Limits), which tracks in-flight bytes needed based on physical link
rate. However, for virtual drivers like veth, there is no fixed bandwidth
constraint - the bottleneck is CPU availability and the scheduler's ability
to run the NAPI thread. It is unclear how effective BQL would be in this
context.

This patch serves as a first step toward addressing TX drops. Future work
may explore adapting a BQL-like mechanism to better suit virtual devices
like veth.

Reported-by: Yan Zhai <yan@cloudflare.com>
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Reviewed-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
Link: https://patch.msgid.link/174559294022.827981.1282809941662942189.stgit@firesoul
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-28 14:06:58 -07:00
..
accel accel/ivpu: Add cmdq_id to job related logs 2025-04-11 12:07:44 +02:00
accessibility treewide: Switch/rename to timer_delete[_sync]() 2025-04-05 10:30:12 +02:00
acpi gcc-15: disable '-Wunterminated-string-initialization' entirely for now 2025-04-20 15:30:53 -07:00
amba
android
ata ata: libata-sata: Save all fields from sense data descriptor 2025-04-16 17:33:17 +09:00
atm treewide: Switch/rename to timer_delete[_sync]() 2025-04-05 10:30:12 +02:00
auxdisplay treewide: Switch/rename to timer_delete[_sync]() 2025-04-05 10:30:12 +02:00
base treewide: Switch/rename to timer_delete[_sync]() 2025-04-05 10:30:12 +02:00
bcma
block block-6.15-20250417 2025-04-18 09:21:14 -07:00
bluetooth Bluetooth: vhci: Avoid needless snprintf() calls 2025-04-16 16:50:47 -04:00
bus treewide: Switch/rename to timer_delete[_sync]() 2025-04-05 10:30:12 +02:00
cache
cdrom
cdx Merge branches 'apple/dart', 'arm/smmu/updates', 'arm/smmu/bindings', 'rockchip', 's390', 'core', 'intel/vt-d' and 'amd/amd-vi' into next 2025-03-20 09:11:09 +01:00
char virtio_console: fix order of fields cols and rows 2025-04-18 10:08:11 -04:00
clk ARM and clkdev updates for 6.15-rc1 2025-04-03 12:21:44 -07:00
clocksource RISC-V Patches for the 6.15 Merge Window, Part 1 2025-04-04 09:49:17 -07:00
comedi treewide: Switch/rename to timer_delete[_sync]() 2025-04-05 10:30:12 +02:00
connector
counter Char/Misc fixes for 6.15-rc1 2025-04-02 18:03:34 -07:00
cpufreq amd-pstate content for 6.15 (4/15/25) 2025-04-17 17:55:09 +02:00
cpuidle pmdomain core: 2025-03-25 20:40:51 -07:00
crypto crypto: atmel-sha204a - Set hwrng quality to lowest possible 2025-04-23 09:32:57 +08:00
cxl cxl for v6.15 2025-04-02 20:04:43 -07:00
dax device/dax: properly refcount device dax pages when mapping 2025-03-17 22:06:41 -07:00
dca
devfreq
dio
dma treewide: Switch/rename to timer_delete[_sync]() 2025-04-05 10:30:12 +02:00
dma-buf dma-buf/sw_sync: Decrement refcount on error in sw_sync_ioctl_get_deadline() 2025-04-11 14:22:22 +02:00
dpll Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2025-03-20 21:38:01 +01:00
edac - Add infrastructure support to EDAC in order to be able to register memory 2025-03-25 14:00:26 -07:00
eisa
extcon
firewire treewide: Switch/rename to timer_delete[_sync]() 2025-04-05 10:30:12 +02:00
firmware sound fixes for 6.15-rc3 2025-04-17 10:14:51 -07:00
fpga fpga: tests: add module descriptions 2025-04-11 17:32:38 -07:00
fsi
fwctl fwctl: Fix repeated device word in log message 2025-04-11 20:47:45 -03:00
gnss
gpio gpiolib: Allow to use setters with return value for output-only gpios 2025-04-14 20:31:00 +02:00
gpu virtio, vhost: fixes 2025-04-23 08:25:56 -07:00
greybus treewide: Switch/rename to timer_delete[_sync]() 2025-04-05 10:30:12 +02:00
hid treewide: Switch/rename to timer_delete[_sync]() 2025-04-05 10:30:12 +02:00
hsi treewide: Switch/rename to timer_delete[_sync]() 2025-04-05 10:30:12 +02:00
hte treewide: Switch/rename to timer_delete[_sync]() 2025-04-05 10:30:12 +02:00
hv - The 6 patch series "Enable strict percpu address space checks" from 2025-04-01 09:29:18 -07:00
hwmon treewide: Switch/rename to timer_delete[_sync]() 2025-04-05 10:30:12 +02:00
hwspinlock hwspinlock: Remove unused hwspin_lock_get_id() 2025-03-21 17:12:04 -05:00
hwtracing Char/Misc/IIO driver updates for 6.15-rc1 2025-04-01 11:26:08 -07:00
i2c i2c-host-fixes for v6.15-rc3 2025-04-18 23:42:56 +02:00
i3c i3c: Add NULL pointer check in i3c_master_queue_ibi() 2025-03-31 11:44:00 +02:00
idle Power management updates for 6.15-rc1 2025-03-25 15:00:18 -07:00
iio gcc-15: add '__nonstring' markers to byte arrays 2025-04-20 11:57:54 -07:00
infiniband RDMA/bnxt_re: Remove unusable nq variable 2025-04-10 14:47:55 -03:00
input gcc-15: add '__nonstring' markers to byte arrays 2025-04-20 11:57:54 -07:00
interconnect
iommu iommu/tegra241-cmdqv: Fix warnings due to dmam_free_coherent() 2025-04-11 12:44:27 +02:00
ipack
irqchip irqchip/irq-bcm2712-mip: Enable driver when ARCH_BCM2835 is enabled 2025-04-16 14:39:25 +02:00
isdn treewide: Switch/rename to timer_delete[_sync]() 2025-04-05 10:30:12 +02:00
leds treewide: Switch/rename to timer_delete[_sync]() 2025-04-05 10:30:12 +02:00
macintosh treewide: Switch/rename to timer_delete[_sync]() 2025-04-05 10:30:12 +02:00
mailbox treewide: Switch/rename to timer_delete[_sync]() 2025-04-05 10:30:12 +02:00
mcb
md gcc-15: get rid of misc extra NUL character padding 2025-04-20 11:57:54 -07:00
media treewide: Switch/rename to timer_delete[_sync]() 2025-04-05 10:30:12 +02:00
memory treewide: Switch/rename to timer_delete[_sync]() 2025-04-05 10:30:12 +02:00
memstick treewide: Switch/rename to timer_delete[_sync]() 2025-04-05 10:30:12 +02:00
message SCSI misc on 20250326 2025-03-26 19:57:34 -07:00
mfd * Maxim MAX77705: 2025-03-29 14:33:13 -07:00
misc treewide: Switch/rename to timer_delete[_sync]() 2025-04-05 10:30:12 +02:00
mmc treewide: Switch/rename to timer_delete[_sync]() 2025-04-05 10:30:12 +02:00
most treewide: Switch/rename to timer_delete[_sync]() 2025-04-05 10:30:12 +02:00
mtd mtd: rawnand: Add status chack in r852_ready() 2025-04-07 09:02:49 +02:00
mux
net veth: apply qdisc backpressure on full ptr_ring to reduce TX drops 2025-04-28 14:06:58 -07:00
nfc treewide: Switch/rename to timer_delete[_sync]() 2025-04-05 10:30:12 +02:00
ntb Bug fixes for NTB Switchtec driver mw negative shift, Intel NTB link 2025-04-04 14:23:07 -07:00
nubus
nvdimm libnvdimm additions for 6.15 2025-04-02 20:27:18 -07:00
nvme block-6.15-20250417 2025-04-18 09:21:14 -07:00
nvmem net, treewide: define and use MAC_ADDR_STR_LEN 2025-03-19 19:17:58 +01:00
of Devicetree for v6.15: 2025-03-29 11:23:16 -07:00
opp
parisc
parport treewide: Switch/rename to timer_delete[_sync]() 2025-04-05 10:30:12 +02:00
pci Miscellaneous fixes: 2025-04-18 13:28:41 -07:00
pcmcia treewide: Switch/rename to timer_delete[_sync]() 2025-04-05 10:30:12 +02:00
peci
perf pci-v6.15-changes 2025-03-28 19:36:53 -07:00
phy phy-for-6.15 2025-04-01 12:47:11 -07:00
pinctrl Pin control changes for the v6.15 kernel cycle: 2025-03-29 16:59:16 -07:00
platform platform/x86: msi-wmi-platform: Workaround a ACPI firmware bug 2025-04-16 11:15:22 +03:00
pmdomain pmdomain: arm: scmi_pm_domain: Remove redundant state verification 2025-03-17 11:12:01 +01:00
pnp Staging driver updates for 6.15-rc1 2025-04-02 18:09:17 -07:00
power gcc-15: get rid of misc extra NUL character padding 2025-04-20 11:57:54 -07:00
powercap Power management updates for 6.15-rc1 2025-03-25 15:00:18 -07:00
pps treewide: Convert new and leftover hrtimer_init() users 2025-04-05 10:30:17 +02:00
ps3
ptp ptp: Do not enable by default during compile testing 2025-04-22 18:43:10 -07:00
pwm pwm: A set of fixes for pwm core and various drivers 2025-04-12 08:11:19 -07:00
rapidio
ras RAS/AMD/FMPM: Get masked address 2025-04-08 19:30:58 +02:00
regulator These are objtool fixes and updates by Josh Poimboeuf, centered 2025-04-02 10:30:10 -07:00
remoteproc remoteproc: qcom_q6v5_pas: Make single-PD handling more robust 2025-03-22 08:42:39 -05:00
reset remoteproc updates for v6.15 2025-03-29 17:18:50 -07:00
rpmsg
rtc treewide: Switch/rename to timer_delete[_sync]() 2025-04-05 10:30:12 +02:00
s390 s390: ism: Pass string literal as format argument of dev_set_name() 2025-04-21 18:36:54 -07:00
sbus
scsi Merge branch '6.15/scsi-queue' into 6.15/scsi-fixes 2025-04-08 22:04:31 -04:00
sh
siox
slimbus
soc soc: drivers for 6.15, part 2 2025-04-04 09:06:32 -07:00
soundwire soundwire updates for 6.15 2025-04-01 12:43:13 -07:00
spi spi: spi-imx: Add check for spi_imx_setupxfer() 2025-04-17 12:25:12 +01:00
spmi
ssb
staging treewide: Switch/rename to timer_delete[_sync]() 2025-04-05 10:30:12 +02:00
target treewide: Switch/rename to timer_delete[_sync]() 2025-04-05 10:30:12 +02:00
tc
tee
thermal thermal: intel: int340x: Fix Panther Lake DLVR support 2025-04-15 18:57:25 +02:00
thunderbolt USB/Thunderbolt update for 6.15-rc1 2025-04-02 18:23:31 -07:00
tty treewide: Switch/rename to timer_delete[_sync]() 2025-04-05 10:30:12 +02:00
ufs Merge branch '6.15/scsi-queue' into 6.15/scsi-fixes 2025-04-08 22:04:31 -04:00
uio
usb treewide: Switch/rename to timer_delete[_sync]() 2025-04-05 10:30:12 +02:00
vdpa
vfio vfio/pci: Virtualize zero INTx PIN if no pdev->irq 2025-04-14 08:31:45 -06:00
vhost vhost-scsi: Fix vhost_scsi_send_status() 2025-04-18 10:08:11 -04:00
video treewide: Switch/rename to timer_delete[_sync]() 2025-04-05 10:30:12 +02:00
virt treewide: Switch/rename to timer_delete[_sync]() 2025-04-05 10:30:12 +02:00
virtio virtgpu: don't reset on shutdown 2025-04-18 10:05:49 -04:00
w1
watchdog treewide: Switch/rename to timer_delete[_sync]() 2025-04-05 10:30:12 +02:00
xen x86/xen: fix balloon target initialization for PVH dom0 2025-04-07 11:24:12 +02:00
zorro
Kconfig
Makefile