mirror-linux/drivers
Vitaly Prosyak e47b0056a0 drm/amdgpu: set noretry=1 as default for GFX 10.1.x (Navi10/12/14)
Problem:
While developing the amd_close_race IGT test (which intentionally triggers
execute permission faults by removing VM_PAGE_EXECUTABLE from GPU page table
entries), we discovered that on Navi10 (GFX 10.1.x) these faults produce
zero diagnostic output. The GPU simply hangs silently for ~10s until the
scheduler timeout fires. There is no way to distinguish an execute
permission fault from any other type of GPU hang.

Root cause:
GFX 10.1.x defaults to noretry=0, which sets
RETRY_PERMISSION_OR_INVALID_PAGE_FAULT=1 in the GFXHUB UTCL2 registers
(gfxhub_v2_0.c line 313). With this bit set, permission faults (valid PTE,
wrong R/W/X bits) are handled entirely within the UTCL1/UTCL2 hardware
loop: UTCL2 returns an XNACK to UTCL1, and UTCL1 re-requests the
translation indefinitely, expecting software to eventually fix the
permission bits (as happens in SVM/HMM recovery). No interrupt of any kind
reaches the IH ring.

This is different from invalid-page faults (V=0) which DO generate a retry
fault interrupt that the driver can escalate to a no-retry fault. Permission
faults with valid PTEs loop silently forever in hardware.

GFX 10.3+ already defaults to noretry=1, which makes permission faults
generate immediate L2 protection fault interrupts. GFX 10.1.x was
inadvertently left out of this default.

Fix:
Change the noretry=1 threshold from IP_VERSION(10, 3, 0) to
IP_VERSION(10, 1, 0) in amdgpu_gmc_noretry_set(). This is a one-line
change that aligns GFX 10.1.x behavior with GFX 10.3+ and all newer
generations.

With noretry=1, the existing non-retry fault handler
(gmc_v10_0_process_interrupt) already decodes and prints the full
GCVM_L2_PROTECTION_FAULT_STATUS register including PERMISSION_FAULTS,
faulting address, VMID, PASID, and process name. No additional logging
code is needed — the fix is purely routing permission faults to the
existing, fully-capable non-retry interrupt handler.

v2: Dropped GFX10-specific logging from gmc_v10_0.c and
kfd_int_process_v10.c (Felix Kuehling). v1 added logging in the retry
fault handler, but with noretry=1 permission faults take the non-retry
path — the v1 retry handler code was dead and would never execute.

Tested on Navi10 (GFX 10.1.10):
- Execute permission faults now produce immediate, clear output:
    [gfxhub] page fault (src_id:0 ring:64 vmid:4 pasid:592)
     Process amd_close_race pid 13380 thread amd_close_race pid 13384
      in page at address 0x40001000 from client 0x1b (UTCL2)
    GCVM_L2_PROTECTION_FAULT_STATUS:0x00700881
         PERMISSION_FAULTS: 0x8
- No regressions with properly-mapped GPU workloads

Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit eb21edd24c40d81066753f8ac6f23bce15745395)
Cc: stable@vger.kernel.org
2026-06-03 14:54:05 -04:00
..
accel accel/ivpu: prevent uninitialized data bug in debugfs 2026-05-26 08:04:07 +02:00
accessibility
acpi ACPI: button: Add missing device class clearing on probe failures 2026-05-25 09:52:34 +02:00
amba
android rust_binder: Avoid holding lock when dropping delivered_death 2026-05-22 11:55:48 +02:00
ata ata: libata-scsi: do not needlessly defer commands when using PMP with FBS 2026-05-18 12:26:51 +02:00
atm net: remove unused ATM protocols and legacy ATM device drivers 2026-04-23 12:21:14 -07:00
auxdisplay
base regmap: reject volatile update_bits() in cache-only mode 2026-05-28 15:15:46 +01:00
bcma
block 13 hotfixes. 9 are for MM. 9 are cc:stable and the remaining 4 address 2026-05-26 08:23:19 -07:00
bluetooth Bluetooth: hci_qca: Use 100 ms SSR delay for rampatch and NVM loading 2026-05-27 16:44:02 -04:00
bus Char/Misc/IIO/and others driver updates for 7.1-rc1 2026-04-24 13:23:50 -07:00
cache
cdrom cdrom, scsi: sr: propagate read-only status to block layer via set_disk_ro() 2026-04-27 15:52:51 -06:00
cdx
char IPMI: Fix a number of issues that came up recently 2026-05-04 12:48:30 -07:00
clk clk: rk808: fix OF node reference imbalance 2026-04-28 20:55:53 -07:00
clocksource
comedi comedi: comedi_test: fix check for valid scan_begin_src in waveform_ai_cmdtest() 2026-05-21 10:34:04 +02:00
connector
counter counter: Fix refcount leak in counter_alloc() error path 2026-05-03 13:48:39 +09:00
cpufreq cpufreq/amd-pstate-ut: Disable dynamic_epp after the mode switch 2026-05-26 12:39:28 +02:00
cpuidle
crypto
cxl
dax dax changes for 7.1 2026-04-21 14:12:01 -07:00
dca
devfreq
dibs
dio
dma
dma-buf dma-buf: fix UAF in dma_buf_fd() tracepoint 2026-05-28 20:05:43 +05:30
dpll dpll: zl3073x: make frequency monitor a per-device attribute 2026-05-28 14:05:29 +02:00
edac EDAC/versalnet: Fix device name memory leak 2026-05-05 14:49:48 +02:00
eisa
extcon
firewire
firmware LoongArch fixes for v7.1-rc5 2026-05-23 09:13:00 -07:00
fpga
fsi
fwctl fwctl: pds: Validate RPC input size before parsing 2026-05-19 10:44:32 -03:00
gnss
gpib Revert "gpib: cb7210: Fix region leak when request_irq fails" 2026-05-30 12:25:36 +02:00
gpio gpio: rockchip: teardown bugs and resource leaks 2026-05-28 15:23:40 +02:00
gpu drm/amdgpu: set noretry=1 as default for GFX 10.1.x (Navi10/12/14) 2026-06-03 14:54:05 -04:00
greybus
hid hid-for-linus-2026052801 2026-05-29 08:51:46 -07:00
hsi
hte
hv drm fixes for 7.1-rc1 2026-04-24 11:44:52 -07:00
hwmon hwmon: (pmbus/adm1266) serialize sequencer_state debugfs read with pmbus_lock 2026-05-21 07:00:39 -07:00
hwspinlock
hwtracing Char/Misc/IIO/and others driver updates for 7.1-rc1 2026-04-24 13:23:50 -07:00
i2c i2c: virtio: mark device ready before registering the adapter 2026-05-30 15:56:07 +02:00
i3c
idle
iio iio: adc: viperboard: Fix error handling in vprbrd_iio_read_raw 2026-05-15 12:05:35 +01:00
infiniband RDMA v7.1 first rc window 2026-05-23 07:17:27 -07:00
input Input updates for v7.1-rc5 2026-05-31 08:27:18 -07:00
interconnect
iommu iommu, debugobjects: avoid gcc-16.1 section mismatch warnings 2026-05-28 09:07:12 +02:00
ipack
irqchip irqchip/renesas-rzt2h: Use pm_runtime_put_sync() in probe error path 2026-05-21 20:11:29 +02:00
leds
macintosh
mailbox
mcb
md - fix crashes in dm-vdo if GFP_NOWAIT allocation fails 2026-05-25 12:45:40 -07:00
media media: rc: igorplugusb: fix control request setup packet 2026-05-30 18:21:47 +02:00
memory
memstick
message
mfd MFD for v7.1 2026-04-20 11:31:01 -07:00
misc misc: rp1: Send IACK on IRQ activate to fix kdump/kexec 2026-05-22 12:19:02 +02:00
mmc
most
mtd mtd: spinand: winbond: Fix ODTR write VCR on W35NxxJW 2026-04-27 15:08:04 +02:00
mux Char/Misc/IIO/and others driver updates for 7.1-rc1 2026-04-24 13:23:50 -07:00
net wireguard: send: append trailer after expanding head 2026-05-29 13:01:27 -07:00
nfc nfc: nxp-nci: i2c: use rising-edge IRQ on ACPI systems 2026-05-18 18:30:36 +02:00
ntb
nubus
nvdimm
nvme Including fixes from netfilter. 2026-05-28 13:13:48 -07:00
nvmem
of
opp
parisc parisc: Fix IRQ leak in LASI driver 2026-05-04 11:48:12 +02:00
parport parport: Fix race between port and client registration 2026-05-22 12:19:02 +02:00
pci pci-v7.1-fixes-2 2026-05-21 15:02:12 -07:00
pcmcia PCMCIA fixes and cleanups for v7.1 2026-04-23 11:22:16 -07:00
peci
perf
phy phy: qcom: qmp-usbc: Fix out-of-bounds array access in dp swing config 2026-05-19 15:42:11 +05:30
pinctrl pinctrl-amd: enable IRQ for WACF2200 touchscreen on Lenovo Yoga 7 14AGP11 2026-05-13 09:34:55 +02:00
platform platform-drivers-x86 for v7.1-4 2026-05-22 15:45:26 -07:00
pmdomain pmdomain: mediatek: fix use-after-free in scpsys_get_bus_protection_legacy() 2026-04-27 14:53:30 +02:00
pnp
power USB / Thunderbolt changes for 7.1-rc1 2026-04-19 08:47:40 -07:00
powercap
pps
ps3
ptp
pwm pwm: Two driver fixes 2026-04-23 08:37:07 -07:00
rapidio
ras
regulator regulator: tps65219: fix irq_data.rdev not being assigned 2026-05-18 10:52:24 +01:00
remoteproc
resctrl arm_mpam: Check whether the config array is allocated before destroying it 2026-05-14 09:52:05 +01:00
reset reset: eyeq: drop device_set_of_node_from_dev() done by parent 2026-04-28 19:03:50 -07:00
rpmsg
rtc RTC for 7.1 2026-04-25 16:39:03 -07:00
s390 s390/cio: Restore GFP_DMA for CHSC allocation 2026-05-11 16:27:25 +02:00
sbus
scsi SCSI fixes on 20260531 2026-05-31 08:45:08 -07:00
sh
siox
slimbus
soc
soundwire
spi spi: spi-mem: avoid mutating op template in spi_mem_supports_op() 2026-05-28 13:49:00 +01:00
spmi
ssb
staging hid-for-linus-2026051401 2026-05-14 14:30:01 -07:00
target scsi: target: iscsi: Validate CHAP_R length before base64 decode 2026-05-22 23:06:00 -04:00
tc
tee
thermal
thunderbolt thunderbolt: property: Cap recursion depth in __tb_property_parse_dir() 2026-05-11 11:32:03 +02:00
tty serial: dz: Enable modular build 2026-05-22 11:52:34 +02:00
ufs scsi: ufs: core: Fix bRefClkFreq write failure in HS-LSS mode 2026-04-21 20:58:06 -04:00
uio uio: uio_pci_generic_sva: fix double free of devm_kzalloc() memory 2026-05-22 12:19:02 +02:00
usb USB serial fixes for 7.1-rc5 2026-05-23 13:21:56 +02:00
vdpa
vfio vfio/pci: Check BAR resources before exporting a DMABUF 2026-05-14 11:39:03 -06:00
vhost Including fixes from Netfilter. 2026-04-23 16:50:42 -07:00
video fbdev: udlfb: add vm_ops to dlfb_ops_mmap to prevent use-after-free 2026-05-04 10:35:55 +02:00
virt virt: sev-guest: Explicitly leak pages in unknown state 2026-05-20 18:03:17 -07:00
virtio
w1
watchdog
xen ACPI: PAD: xen: Check ACPI_COMPANION() against NULL 2026-05-12 19:01:37 +02:00
zorro
Kconfig net: remove ISDN subsystem and Bluetooth CMTP 2026-04-23 10:24:02 -07:00
Makefile net: remove ISDN subsystem and Bluetooth CMTP 2026-04-23 10:24:02 -07:00