Commit Graph

16421 Commits (09cfd3c52ea76f43b3cb15e570aeddf633d65e80)

Author SHA1 Message Date
Srinivasan Shanmugam 70e33073d9 drm/amdgpu: Fix kdoc style in amdgpu_fence.c
The initial comment block before
amdgpu_fence_driver_guilty_force_completion() incorrectly used '/**' but
is not a kernel-doc comment, causing build warnings.

Fixes the below with gcc W=1:
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c:742: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
 * Kernel queue reset handling

Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-04 14:27:28 -04:00
Perry Yuan 8e3967a71e drm/amdgpu: Fix build error when CONFIG_SUSPEND is disabled
The variable `pm_suspend_target_state` is conditionally defined only when
`CONFIG_SUSPEND` is enabled (see `include/linux/suspend.h`). Directly
referencing it without guarding by `#ifdef CONFIG_SUSPEND` causes build
failures when suspend functionality is disabled (e.g., `CONFIG_SUSPEND=n`).

Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Perry Yuan <perry.yuan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-04 14:26:38 -04:00
Christian König 6716a823d1 drm/amdgpu: rework how PTE flags are generated v3
Previously we tried to keep the HW specific PTE flags in each mapping,
but for CRIU that isn't sufficient any more since the original value is
needed for the checkpoint procedure.

So rework the whole handling, nuke the early mapping function, keep the
UAPI flags in each mapping instead of the HW flags and translate them to
the HW flags while filling in the PTEs.

Only tested on Navi 23 for now, so probably needs quite a bit of more
work.

v2: fix KFD and SVN handling
v3: one more SVN fix pointed out by Felix
v4: squash in gfx12 fix from David

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-04 14:26:38 -04:00
Linus Torvalds 89748acdf2 drm fixes for 6.17-rc1
amdgpu:
 - DSC divide by 0 fix
 - clang fix
 - DC debugfs fix
 - Userq fixes
 - Avoid extra evict-restore with KFD
 - Backlight fix
 - Documentation fix
 - RAS fix
 - Add new kicker handling
 - DSC fix for DCN 3.1.4
 - PSR fix
 - Atomic fix
 - DC reset fixes
 - DCN 3.0.1 fix
 - MMHUB client mapping fix
 
 xe:
 - Fix BMG probe on unsupported mailbox command
 - Fix OA static checker warning about null gt
 - Fix a NULL vs IS_ERR() bug in xe_i2c_register_adapter
 - Fix missing unwind goto in GuC/HuC
 - Don't register I2C devices if VF
 - Clear whole GuC g2h_fence during initialization
 - Avoid call kfree for drmm_kzalloc
 - Fix pci_dev reference leak on configfs
 - SRIOV: Disable CSC support on VF
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEEKbZHaGwW9KfbeusDHTzWXnEhr4FAmiL9c8ACgkQDHTzWXnE
 hr5V3RAAji19x0wr2RsNTZRLP5m9+8vAkevIcUv5Y3ir8dUTXONN00ynE7/J/GHt
 X2MI2A+KrS0jvWlNw1vKCRNY/r6kiPd+HpPFwpN7y98Hn9o8H3byKHCvCcV6z469
 XJlXRO7yLe71Bn+iVkgQRSHnsAJVvvtoV0eeJR39wW0pvscWlV/wnTVx9XExNODJ
 lQFehlcbPQB10VRQrRBcWuPtqTX9+sIefEYERKDSXgOKcngPXTwx4QDnbBkuGy5e
 mJ0Xz9SjB554IL2pWo+gOT8FYZLw7lypVAhdCJjR9o4WzqjlxLWK2h3vcLRvo6Pq
 PyzvH0MXIEwfTT6HExXZXYtzt3ovUJ6ZaIJ6Y5Y6z8PcNzeO0+qWyQBwuKSXsN1S
 OOgHYhDnbjz8bS7J9/FPr6yE8RAl391ZWQldG3qXNp8OoXVoQv+TRcX80A9HVwai
 y4f5Ob0bRSKsnJIddsldDZz3usJ7K3xhRjcw2F03mkQKNLcmAQ1YwE0ak573N7gG
 wUj85wa9xREfWiA0ZuhXlV4BVv8a6U8JdolYsJ+HIl0l7xsFKkwLT6adzVSovE2b
 u52fV8MTjJMV17tEF7pzu0Hv3Uar0lecXMKx4G1gYOzBhZCuzeq5+SPepIAc05Ad
 QUR2LwvH0cnd5/GFPkXT8+GsDnjIJKUjEEsJUxoQS966IVTzI4A=
 =AgPB
 -----END PGP SIGNATURE-----

Merge tag 'drm-next-2025-08-01' of https://gitlab.freedesktop.org/drm/kernel

Pull drm fixes from Dave Airlie:
 "Just a bunch of amdgpu and xe fixes.

  amdgpu:
   - DSC divide by 0 fix
   - clang fix
   - DC debugfs fix
   - Userq fixes
   - Avoid extra evict-restore with KFD
   - Backlight fix
   - Documentation fix
   - RAS fix
   - Add new kicker handling
   - DSC fix for DCN 3.1.4
   - PSR fix
   - Atomic fix
   - DC reset fixes
   - DCN 3.0.1 fix
   - MMHUB client mapping fix

  xe:
   - Fix BMG probe on unsupported mailbox command
   - Fix OA static checker warning about null gt
   - Fix a NULL vs IS_ERR() bug in xe_i2c_register_adapter
   - Fix missing unwind goto in GuC/HuC
   - Don't register I2C devices if VF
   - Clear whole GuC g2h_fence during initialization
   - Avoid call kfree for drmm_kzalloc
   - Fix pci_dev reference leak on configfs
   - SRIOV: Disable CSC support on VF

* tag 'drm-next-2025-08-01' of https://gitlab.freedesktop.org/drm/kernel: (24 commits)
  drm/xe/vf: Disable CSC support on VF
  drm/amdgpu: update mmhub 4.1.0 client id mappings
  drm/amd/display: Allow DCN301 to clear update flags
  drm/amd/display: Pass up errors for reset GPU that fails to init HW
  drm/amd/display: Only finalize atomic_obj if it was initialized
  drm/amd/display: Avoid configuring PSR granularity if PSR-SU not supported
  drm/amd/display: Disable dsc_power_gate for dcn314 by default
  drm/amdgpu: add kicker fws loading for gfx12/smu14/psp14
  drm/amd/amdgpu: fix missing lock for cper.ring->rptr/wptr access
  drm/amd/display: Fix misuse of /** to /* in 'dce_i2c_hw.c'
  drm/amd/display: fix initial backlight brightness calculation
  drm/amdgpu: Avoid extra evict-restore process.
  drm/amdgpu: track whether a queue is a kernel queue in amdgpu_mqd_prop
  drm/amdgpu: check if hubbub is NULL in debugfs/amdgpu_dm_capabilities
  drm/amdgpu: Initialize data to NULL in imu_v12_0_program_rlc_ram()
  drm/amd/display: Fix divide by zero when calculating min ODM factor
  drm/xe/configfs: Fix pci_dev reference leak
  drm/xe/hw_engine_group: Avoid call kfree() for drmm_kzalloc()
  drm/xe/guc: Clear whole g2h_fence during initialization
  drm/xe/vf: Don't register I2C devices if VF
  ...
2025-07-31 21:47:36 -07:00
Linus Torvalds 260f6f4fda drm for 6.17-rc1
non-drm:
 rust:
 - make ETIMEDOUT available
 - add size constants up to SZ_2G
 - add DMA coherent allocation bindings
 mtd:
 - driver for Intel GPU non-volatile storage
 i2c
 - designware quirk for Intel xe
 
 core:
 - atomic helpers: tune enable/disable sequences
 - add task info to wedge API
 - refactor EDID quirks
 - connector: move HDR sink to drm_display_info
 - fourcc: half-float and 32-bit float formats
 - mode_config: pass format info to simplify
 
 dma-buf:
 - heaps: Give CMA heap a stable name
 
 ci:
 - add device tree validation and kunit
 
 displayport:
 - change AUX DPCD access probe address
 - add quirk for DPCD probe
 - add panel replay definitions
 - backlight control helpers
 
 fbdev:
 - make CONFIG_FIRMWARE_EDID available on all arches
 
 fence:
 - fix UAF issues
 
 format-helper:
 - improve tests
 
 gpusvm:
 - introduce devmem only flag for allocation
 - add timeslicing support to GPU SVM
 
 ttm:
 - improve eviction
 
 sched:
 - tracing improvements
 - kunit improvements
 - memory leak fixes
 - reset handling improvements
 
 color mgmt:
 - add hardware gamma LUT handling helpers
 
 bridge:
 - add destroy hook
 - switch to reference counted drm_bridge allocations
 - tc358767: convert to devm_drm_bridge_alloc
 - improve CEC handling
 
 panel:
 - switch to reference counter drm_panel allocations
 - fwnode panel lookup
 - Huiling hl055fhv028c support
 - Raspberry Pi 7" 720x1280 support
 - edp: KDC KD116N3730A05, N160JCE-ELL CMN, N116BCJ-EAK
 - simple: AUO P238HAN01
 - st7701: Winstar wf40eswaa6mnn0
 - visionox: rm69299-shift
 - Renesas R61307, Renesas R69328 support
 - DJN HX83112B
 
 hdmi:
 - add CEC handling
 - YUV420 output support
 
 xe:
 - WildCat Lake support
 - Enable PanthorLake by default
 - mark BMG as SRIOV capable
 - update firmware recommendations
 - Expose media OA units
 - aux-bux support for non-volatile memory
 - MTD intel-dg driver for non-volatile memory
 - Expose fan control and voltage regulator in sysfs
 - restructure migration for multi-device
 - Restore GuC submit UAF fix
 - make GEM shrinker drm managed
 - SRIOV VF Post-migration recovery of GGTT nodes
 - W/A additions/reworks
 - Prefetch support for svm ranges
 - Don't allocate managed BO for each policy change
 - HWMON fixes for BMG
 - Create LRC BO without VM
 - PCI ID updates
 - make SLPC debugfs files optional
 - rework eviction rejection of bound external BOs
 - consolidate PAT programming logic for pre/post Xe2
 - init changes for flicker-free boot
 - Enable GuC Dynamic Inhibit Context switch
 
 i915:
 - drm_panic support for i915/xe
 - initial flip queue off by default for LNL/PNL
 - Wildcat Lake Display support
 - Support for DSC fractional link bpp
 - Support for simultaneous Panel Replay and Adaptive sync
 - Support for PTL+ double buffer LUT
 - initial PIPEDMC event handling
 - drm_panel_follower support
 - DPLL interface renames
 - allocate struct intel_display dynamically
 - flip queue preperation
 - abstract DRAM detection better
 - avoid GuC scheduling stalls
 - remove DG1 force probe requirement
 - fix MEI interrupt handler on RT kernels
 - use backlight control helpers for eDP
 - more shared display code refactoring
 
 amdgpu:
 - add userq slot to INFO ioctl
 - SR-IOV hibernation support
 - Suspend improvements
 - Backlight improvements
 - Use scaling for non-native eDP modes
 - cleaner shader updates for GC 9.x
 - Remove fence slab
 - SDMA fw checks for userq support
 - RAS updates
 - DMCUB updates
 - DP tunneling fixes
 - Display idle D3 support
 - Per queue reset improvements
 - initial smartmux support
 
 amdkfd:
 - enable KFD on loongarch
 - mtype fix for ext coherent system memory
 
 radeon:
 - CS validation additional GL extensions
 - drop console lock during suspend/resume
 - bump driver version
 
 msm:
 - VM BIND support
 - CI: infrastructure updates
 - UBWC single source of truth
 - decouple GPU and KMS support
 - DP: rework I/O accessors
 - DPU: SM8750 support
 - DSI: SM8750 support
 - GPU: X1-45 support and speedbin support for X1-85
 - MDSS: SM8750 support
 
 nova:
 - register! macro improvements
 - DMA object abstraction
 - VBIOS parser + fwsec lookup
 - sysmem flush page support
 - falcon: generic falcon boot code and HAL
 - FWSEC-FRTS: fb setup and load/execute
 
 ivpu:
 - Add Wildcat Lake support
 - Add turbo flag
 
 ast:
 - improve hardware generations implementation
 
 imx:
 - IMX8qxq Display Controller support
 
 lima:
 - Rockchip RK3528 GPU support
 
 nouveau:
 - fence handling cleanup
 
 panfrost:
 - MT8370 support
 - bo labeling
 - 64-bit register access
 
 qaic:
 - add RAS support
 
 rockchip:
 - convert inno_hdmi to a bridge
 
 rz-du:
 - add RZ/V2H(P) support
 - MIPI-DSI DCS support
 
 sitronix:
 - ST7567 support
 
 sun4i:
 - add H616 support
 
 tidss:
 - add TI AM62L support
 - AM65x OLDI bridge support
 
 bochs:
 - drm panic support
 
 vkms:
 - YUV and R* format support
 - use faux device
 
 vmwgfx:
 - fence improvements
 
 hyperv:
 - move out of simple
 - add drm_panic support
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEEKbZHaGwW9KfbeusDHTzWXnEhr4FAmiJM/0ACgkQDHTzWXnE
 hr6MpA/+JJKGdSdrE95QkaMcOZh/3e3areGXZ0V/RrrJXdB4/DoAfQSHhF0H7m7y
 MhBGVLGNMXq7KHrz28p1MjLHrE1mwmvJ6hZ4J076ed4u9naoCD0m6k5w5wiue+KL
 HyPR54ADxN0BYmgV0l/B0wj42KsHyTO4x4hdqPJu02V9Dtmx6FCh2ujkOF3p9nbK
 GMwWDttl4KEKljD0IvQ9YIYJ66crYGx/XmZi7JoWRrS104K/h1u8qZuXBp5jVKTy
 OZRAVyLdmJqdTOLH7l599MBBcEd/bNV37/LVwF4T5iFunEKOAiyN0QY0OR+IeRVh
 ZfOv2/gp4UNyIfyahQ7LKLgEilNPGHoPitvDJPvBZxW2UjwXVNvA1QfdK5DAlVRS
 D5NoFRjlFFCz8/c2hQwlKJ9o7eVgH3/pK0mwR7SPGQTuqzLFCrAfCuzUvg/gV++6
 JFqmGKMHeCoxO2o4GMrwjFttStP41usxtV/D+grcbPteNO9UyKJS4C38n4eamJXM
 a9Sy9APuAb6F0w5+yMItEF7TQifgmhIbm5AZHlxE1KoDQV6TdiIf1Gou5LeDGoL6
 OACbXHJPL52tUnfCRpbfI4tE/IVyYsfL01JnvZ5cZZWItXfcIz76ykJri+E0G60g
 yRl/zkimHKO4B0l/HSzal5xROXr+3VzeWehEiz/ot1VriP5OesA=
 =n9MO
 -----END PGP SIGNATURE-----

Merge tag 'drm-next-2025-07-30' of https://gitlab.freedesktop.org/drm/kernel

Pull drm updates from Dave Airlie:
 "Highlights:

   - Intel xe enable Panthor Lake, started adding WildCat Lake

   - amdgpu has a bunch of reset improvments along with the usual IP
     updates

   - msm got VM_BIND support which is important for vulkan sparse memory

   - more drm_panic users

   - gpusvm common code to handle a bunch of core SVM work outside
     drivers.

  Detail summary:

  Changes outside drm subdirectory:
   - 'shrink_shmem_memory()' for better shmem/hibernate interaction
   - Rust support infrastructure:
      - make ETIMEDOUT available
      - add size constants up to SZ_2G
      - add DMA coherent allocation bindings
   - mtd driver for Intel GPU non-volatile storage
   - i2c designware quirk for Intel xe

  core:
   - atomic helpers: tune enable/disable sequences
   - add task info to wedge API
   - refactor EDID quirks
   - connector: move HDR sink to drm_display_info
   - fourcc: half-float and 32-bit float formats
   - mode_config: pass format info to simplify

  dma-buf:
   - heaps: Give CMA heap a stable name

  ci:
   - add device tree validation and kunit

  displayport:
   - change AUX DPCD access probe address
   - add quirk for DPCD probe
   - add panel replay definitions
   - backlight control helpers

  fbdev:
   - make CONFIG_FIRMWARE_EDID available on all arches

  fence:
   - fix UAF issues

  format-helper:
   - improve tests

  gpusvm:
   - introduce devmem only flag for allocation
   - add timeslicing support to GPU SVM

  ttm:
   - improve eviction

  sched:
   - tracing improvements
   - kunit improvements
   - memory leak fixes
   - reset handling improvements

  color mgmt:
   - add hardware gamma LUT handling helpers

  bridge:
   - add destroy hook
   - switch to reference counted drm_bridge allocations
   - tc358767: convert to devm_drm_bridge_alloc
   - improve CEC handling

  panel:
   - switch to reference counter drm_panel allocations
   - fwnode panel lookup
   - Huiling hl055fhv028c support
   - Raspberry Pi 7" 720x1280 support
   - edp: KDC KD116N3730A05, N160JCE-ELL CMN, N116BCJ-EAK
   - simple: AUO P238HAN01
   - st7701: Winstar wf40eswaa6mnn0
   - visionox: rm69299-shift
   - Renesas R61307, Renesas R69328 support
   - DJN HX83112B

  hdmi:
   - add CEC handling
   - YUV420 output support

  xe:
   - WildCat Lake support
   - Enable PanthorLake by default
   - mark BMG as SRIOV capable
   - update firmware recommendations
   - Expose media OA units
   - aux-bux support for non-volatile memory
   - MTD intel-dg driver for non-volatile memory
   - Expose fan control and voltage regulator in sysfs
   - restructure migration for multi-device
   - Restore GuC submit UAF fix
   - make GEM shrinker drm managed
   - SRIOV VF Post-migration recovery of GGTT nodes
   - W/A additions/reworks
   - Prefetch support for svm ranges
   - Don't allocate managed BO for each policy change
   - HWMON fixes for BMG
   - Create LRC BO without VM
   - PCI ID updates
   - make SLPC debugfs files optional
   - rework eviction rejection of bound external BOs
   - consolidate PAT programming logic for pre/post Xe2
   - init changes for flicker-free boot
   - Enable GuC Dynamic Inhibit Context switch

  i915:
   - drm_panic support for i915/xe
   - initial flip queue off by default for LNL/PNL
   - Wildcat Lake Display support
   - Support for DSC fractional link bpp
   - Support for simultaneous Panel Replay and Adaptive sync
   - Support for PTL+ double buffer LUT
   - initial PIPEDMC event handling
   - drm_panel_follower support
   - DPLL interface renames
   - allocate struct intel_display dynamically
   - flip queue preperation
   - abstract DRAM detection better
   - avoid GuC scheduling stalls
   - remove DG1 force probe requirement
   - fix MEI interrupt handler on RT kernels
   - use backlight control helpers for eDP
   - more shared display code refactoring

  amdgpu:
   - add userq slot to INFO ioctl
   - SR-IOV hibernation support
   - Suspend improvements
   - Backlight improvements
   - Use scaling for non-native eDP modes
   - cleaner shader updates for GC 9.x
   - Remove fence slab
   - SDMA fw checks for userq support
   - RAS updates
   - DMCUB updates
   - DP tunneling fixes
   - Display idle D3 support
   - Per queue reset improvements
   - initial smartmux support

  amdkfd:
   - enable KFD on loongarch
   - mtype fix for ext coherent system memory

  radeon:
   - CS validation additional GL extensions
   - drop console lock during suspend/resume
   - bump driver version

  msm:
   - VM BIND support
   - CI: infrastructure updates
   - UBWC single source of truth
   - decouple GPU and KMS support
   - DP: rework I/O accessors
   - DPU: SM8750 support
   - DSI: SM8750 support
   - GPU: X1-45 support and speedbin support for X1-85
   - MDSS: SM8750 support

  nova:
   - register! macro improvements
   - DMA object abstraction
   - VBIOS parser + fwsec lookup
   - sysmem flush page support
   - falcon: generic falcon boot code and HAL
   - FWSEC-FRTS: fb setup and load/execute

  ivpu:
   - Add Wildcat Lake support
   - Add turbo flag

  ast:
   - improve hardware generations implementation

  imx:
   - IMX8qxq Display Controller support

  lima:
   - Rockchip RK3528 GPU support

  nouveau:
   - fence handling cleanup

  panfrost:
   - MT8370 support
   - bo labeling
   - 64-bit register access

  qaic:
   - add RAS support

  rockchip:
   - convert inno_hdmi to a bridge

  rz-du:
   - add RZ/V2H(P) support
   - MIPI-DSI DCS support

  sitronix:
   - ST7567 support

  sun4i:
   - add H616 support

  tidss:
   - add TI AM62L support
   - AM65x OLDI bridge support

  bochs:
   - drm panic support

  vkms:
   - YUV and R* format support
   - use faux device

  vmwgfx:
   - fence improvements

  hyperv:
   - move out of simple
   - add drm_panic support"

* tag 'drm-next-2025-07-30' of https://gitlab.freedesktop.org/drm/kernel: (1479 commits)
  drm/tidss: oldi: convert to devm_drm_bridge_alloc() API
  drm/tidss: encoder: convert to devm_drm_bridge_alloc()
  drm/amdgpu: move reset support type checks into the caller
  drm/amdgpu/sdma7: re-emit unprocessed state on ring reset
  drm/amdgpu/sdma6: re-emit unprocessed state on ring reset
  drm/amdgpu/sdma5.2: re-emit unprocessed state on ring reset
  drm/amdgpu/sdma5: re-emit unprocessed state on ring reset
  drm/amdgpu/gfx12: re-emit unprocessed state on ring reset
  drm/amdgpu/gfx11: re-emit unprocessed state on ring reset
  drm/amdgpu/gfx10: re-emit unprocessed state on ring reset
  drm/amdgpu/gfx9.4.3: re-emit unprocessed state on kcq reset
  drm/amdgpu/gfx9: re-emit unprocessed state on kcq reset
  drm/amdgpu: Add WARN_ON to the resource clear function
  drm/amd/pm: Use cached metrics data on SMUv13.0.6
  drm/amd/pm: Use cached data for min/max clocks
  gpu: nova-core: fix bounds check in PmuLookupTableEntry::new
  drm/amdgpu: Replace HQD terminology with slots naming
  drm/amdgpu: Add user queue instance count in HW IP info
  drm/amd/amdgpu: Add helper functions for isp buffers
  drm/amd/amdgpu: Initialize swnode for ISP MFD device
  ...
2025-07-30 19:26:49 -07:00
Linus Torvalds 22c5696e3f Driver core changes for 6.17-rc1
- DEBUGFS
 
   - Remove unneeded debugfs_file_{get,put}() instances
 
   - Remove last remnants of debugfs_real_fops()
 
   - Allow storing non-const void * in struct debugfs_inode_info::aux
 
 - SYSFS
 
   - Switch back to attribute_group::bin_attrs (treewide)
 
   - Switch back to bin_attribute::read()/write() (treewide)
 
   - Constify internal references to 'struct bin_attribute'
 
 - Support cache-ids for device-tree systems
 
   - Add arch hook arch_compact_of_hwid()
 
   - Use arch_compact_of_hwid() to compact MPIDR values on arm64
 
 - Rust
 
   - Device
 
     - Introduce CoreInternal device context (for bus internal methods)
 
     - Provide generic drvdata accessors for bus devices
 
     - Provide Driver::unbind() callbacks
 
     - Use the infrastructure above for auxiliary, PCI and platform
 
     - Implement Device::as_bound()
 
     - Rename Device::as_ref() to Device::from_raw() (treewide)
 
     - Implement fwnode and device property abstractions
 
       - Implement example usage in the Rust platform sample driver
 
   - Devres
 
     - Remove the inner reference count (Arc) and use pin-init instead
 
     - Replace Devres::new_foreign_owned() with devres::register()
 
     - Require T to be Send in Devres<T>
 
     - Initialize the data kept inside a Devres last
 
     - Provide an accessor for the Devres associated Device
 
   - Device ID
 
     - Add support for ACPI device IDs and driver match tables
 
     - Split up generic device ID infrastructure
 
     - Use generic device ID infrastructure in net::phy
 
   - DMA
 
     - Implement the dma::Device trait
 
     - Add DMA mask accessors to dma::Device
 
     - Implement dma::Device for PCI and platform devices
 
     - Use DMA masks from the DMA sample module
 
   - I/O
 
     - Implement abstraction for resource regions (struct resource)
 
     - Implement resource-based ioremap() abstractions
 
     - Provide platform device accessors for I/O (remap) requests
 
   - Misc
 
     - Support fallible PinInit types in Revocable
 
     - Implement Wrapper<T> for Opaque<T>
 
     - Merge pin-init blanket dependencies (for Devres)
 
 - Misc
 
   - Fix OF node leak in auxiliary_device_create()
 
   - Use util macros in device property iterators
 
   - Improve kobject sample code
 
   - Add device_link_test() for testing device link flags
 
   - Fix typo in Documentation/ABI/testing/sysfs-kernel-address_bits
 
   - Hint to prefer container_of_const() over container_of()
 -----BEGIN PGP SIGNATURE-----
 
 iHQEABYKAB0WIQS2q/xV6QjXAdC7k+1FlHeO1qrKLgUCaIjkhwAKCRBFlHeO1qrK
 LpXuAP9RWwfD9ZGgQZ9OsMk/0pZ2mDclaK97jcmI9TAeSxeZMgD1FHnOMTY7oSIi
 iG7Muq0yLD+A5gk9HUnMUnFNrngWCg==
 =jgRj
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core

Pull driver core updates from Danilo Krummrich:
 "debugfs:
   - Remove unneeded debugfs_file_{get,put}() instances
   - Remove last remnants of debugfs_real_fops()
   - Allow storing non-const void * in struct debugfs_inode_info::aux

  sysfs:
   - Switch back to attribute_group::bin_attrs (treewide)
   - Switch back to bin_attribute::read()/write() (treewide)
   - Constify internal references to 'struct bin_attribute'

  Support cache-ids for device-tree systems:
   - Add arch hook arch_compact_of_hwid()
   - Use arch_compact_of_hwid() to compact MPIDR values on arm64

  Rust:
   - Device:
       - Introduce CoreInternal device context (for bus internal methods)
       - Provide generic drvdata accessors for bus devices
       - Provide Driver::unbind() callbacks
       - Use the infrastructure above for auxiliary, PCI and platform
       - Implement Device::as_bound()
       - Rename Device::as_ref() to Device::from_raw() (treewide)
       - Implement fwnode and device property abstractions
       - Implement example usage in the Rust platform sample driver
   - Devres:
       - Remove the inner reference count (Arc) and use pin-init instead
       - Replace Devres::new_foreign_owned() with devres::register()
       - Require T to be Send in Devres<T>
       - Initialize the data kept inside a Devres last
       - Provide an accessor for the Devres associated Device
   - Device ID:
       - Add support for ACPI device IDs and driver match tables
       - Split up generic device ID infrastructure
       - Use generic device ID infrastructure in net::phy
   - DMA:
       - Implement the dma::Device trait
       - Add DMA mask accessors to dma::Device
       - Implement dma::Device for PCI and platform devices
       - Use DMA masks from the DMA sample module
   - I/O:
       - Implement abstraction for resource regions (struct resource)
       - Implement resource-based ioremap() abstractions
       - Provide platform device accessors for I/O (remap) requests
   - Misc:
       - Support fallible PinInit types in Revocable
       - Implement Wrapper<T> for Opaque<T>
       - Merge pin-init blanket dependencies (for Devres)

  Misc:
   - Fix OF node leak in auxiliary_device_create()
   - Use util macros in device property iterators
   - Improve kobject sample code
   - Add device_link_test() for testing device link flags
   - Fix typo in Documentation/ABI/testing/sysfs-kernel-address_bits
   - Hint to prefer container_of_const() over container_of()"

* tag 'driver-core-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core: (84 commits)
  rust: io: fix broken intra-doc links to `platform::Device`
  rust: io: fix broken intra-doc link to missing `flags` module
  rust: io: mem: enable IoRequest doc-tests
  rust: platform: add resource accessors
  rust: io: mem: add a generic iomem abstraction
  rust: io: add resource abstraction
  rust: samples: dma: set DMA mask
  rust: platform: implement the `dma::Device` trait
  rust: pci: implement the `dma::Device` trait
  rust: dma: add DMA addressing capabilities
  rust: dma: implement `dma::Device` trait
  rust: net::phy Change module_phy_driver macro to use module_device_table macro
  rust: net::phy represent DeviceId as transparent wrapper over mdio_device_id
  rust: device_id: split out index support into a separate trait
  device: rust: rename Device::as_ref() to Device::from_raw()
  arm64: cacheinfo: Provide helper to compress MPIDR value into u32
  cacheinfo: Add arch hook to compress CPU h/w id into 32 bits for cache-id
  cacheinfo: Set cache 'id' based on DT data
  container_of: Document container_of() is not to be used in new code
  driver core: auxiliary bus: fix OF node leak
  ...
2025-07-29 12:15:39 -07:00
Yann Dirson 9f1f7cd467 drm/amdgpu: fix module parameter description
Fix dcdebugmask description.

Signed-off-by: Yann Dirson <ydirson@free.fr>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-28 16:40:32 -04:00
Amber Lin 216e4cff54 drm/amdgpu: Add chain runlists support to GC9.4.2
Starting from MEC v97, GC 9.4.2 supports chain runlists of XNACK+/XNACK-
processes.

Signed-off-by: Amber Lin <Amber.Lin@amd.com>
Reviewed-by: Philip Yang<Philip.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-28 16:40:32 -04:00
Lijo Lazar 21c6764ed4 drm/amdgpu: Update external revid for GC v9.5.0
Use different external revid for GC v9.5.0 SOCs.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-28 16:40:07 -04:00
YiPeng Chai 3fc96f60b6 drm/amdgpu: add critical address check for bad page retirement
Add critical address check for bad page retirement.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-28 16:40:07 -04:00
Sathishkumar S faab5ea083 drm/amdgpu: Check vcn sram load return value
Log an error when vcn sram load fails in indirect mode
and return the same error value.

Signed-off-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Reviewed-by: Leo Liu <leo.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-28 16:40:06 -04:00
Lijo Lazar 9d1ac25c7f drm/amdgpu: Update supported modes for GC v9.5.0
For GC v9.5.0 SOCs, both CPX and QPX compute modes are also supported in
NPS2 mode.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Acked-by: Mangesh Gadre <Mangesh.Gadre@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-28 16:40:06 -04:00
YiPeng Chai f348691897 drm/amdgpu: support ras critical address check
Support ras critical address check.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-28 16:40:06 -04:00
Tao Zhou d45c5e6845 drm/amdgpu: adjust the update of RAS bad page number
One eeprom record may not map to unit number of bad pages, the accurate
bad page number is gotten after bad page address check.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-28 16:40:06 -04:00
Tao Zhou 2b17c240e8 drm/amdgpu: add range check for RAS bad page address
Exclude invalid bad pages.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-28 16:40:06 -04:00
YiPeng Chai a813437c33 drm/amdgpu: add command to check address validity
Add command to check address validity and remove
unused command codes.

v2:
 The command interface adds new parameters to support
 multiple check address strategies.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-28 16:40:06 -04:00
YiPeng Chai 020ad3a4ed drm/amdgpu: query the allocated vram address block info
The bad pages that need to be retired are not all allocated
in the same poison consumption process, so an interface is
needed to query the processes that allocate the bad pages.
By killing all the processes that allocate the bad pages,
the bad pages can be reserved immediately.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-28 16:40:06 -04:00
Alex Deucher a0b34e4c86 drm/amdgpu: update mmhub 4.1.0 client id mappings
Update the client id mapping so the correct clients
get printed when there is a mmhub page fault.

Tested-by: David (Ming Qiang) Wu <David.Wu3@amd.com>
Reviewed-by: David (Ming Qiang) Wu <David.Wu3@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
2025-07-28 16:39:49 -04:00
Frank Min 0395cde08e drm/amdgpu: add kicker fws loading for gfx12/smu14/psp14
1. Add kicker firmwares loading for gfx12/smu14/psp14
2. Register additional MODULE_FIRMWARE entries for kicker fws
   - gc_12_0_1_rlc_kicker.bin
   - gc_12_0_1_imu_kicker.bin
   - psp_14_0_3_sos_kicker.bin
   - psp_14_0_3_ta_kicker.bin
   - smu_14_0_3_kicker.bin

Signed-off-by: Frank Min <Frank.Min@amd.com>
Reviewed-by: Gui Chengming <Jack.Gui@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
2025-07-28 16:27:46 -04:00
Yang Wang 8e0d1edb5c drm/amd/amdgpu: fix missing lock for cper.ring->rptr/wptr access
Add lock protection for 'ring->wptr'/'ring->rptr' to ensure the correct execution.

Fixes: 8652920d2c ("drm/amdgpu: add mutex lock for cper ring")
Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
2025-07-28 16:26:52 -04:00
Gang Ba 1f02f2044b drm/amdgpu: Avoid extra evict-restore process.
If vm belongs to another process, this is fclose after fork,
wait may enable signaling KFD eviction fence and cause parent process queue evicted.

[677852.634569]  amdkfd_fence_enable_signaling+0x56/0x70 [amdgpu]
[677852.634814]  __dma_fence_enable_signaling+0x3e/0xe0
[677852.634820]  dma_fence_wait_timeout+0x3a/0x140
[677852.634825]  amddma_resv_wait_timeout+0x7f/0xf0 [amdkcl]
[677852.634831]  amdgpu_vm_wait_idle+0x2d/0x60 [amdgpu]
[677852.635026]  amdgpu_flush+0x34/0x50 [amdgpu]
[677852.635208]  filp_flush+0x38/0x90
[677852.635213]  filp_close+0x14/0x30
[677852.635216]  do_close_on_exec+0xdd/0x130
[677852.635221]  begin_new_exec+0x1da/0x490
[677852.635225]  load_elf_binary+0x307/0xea0
[677852.635231]  ? srso_alias_return_thunk+0x5/0xfbef5
[677852.635235]  ? ima_bprm_check+0xa2/0xd0
[677852.635240]  search_binary_handler+0xda/0x260
[677852.635245]  exec_binprm+0x58/0x1a0
[677852.635249]  bprm_execve.part.0+0x16f/0x210
[677852.635254]  bprm_execve+0x45/0x80
[677852.635257]  do_execveat_common.isra.0+0x190/0x200

Suggested-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Gang Ba <Gang.Ba@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
2025-07-28 16:25:19 -04:00
Alex Deucher 284d4dfe85 drm/amdgpu: track whether a queue is a kernel queue in amdgpu_mqd_prop
Used to to set the MQD appropriately for each queue type.
Kernel queues have additional privileges.

Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org # 6.16.x
2025-07-28 16:25:04 -04:00
Nathan Chancellor c90f2e1172 drm/amdgpu: Initialize data to NULL in imu_v12_0_program_rlc_ram()
After a recent change in clang to expose uninitialized warnings from
const variables and pointers [1], there is a warning in
imu_v12_0_program_rlc_ram() because data is passed uninitialized to
program_imu_rlc_ram():

  drivers/gpu/drm/amd/amdgpu/imu_v12_0.c:374:30: error: variable 'data' is uninitialized when used here [-Werror,-Wuninitialized]
    374 |                         program_imu_rlc_ram(adev, data, (const u32)size);
        |                                                   ^~~~

As this warning happens early in clang's frontend, it does not realize
that due to the assignment of r to -EINVAL, program_imu_rlc_ram() is
never actually called, and even if it were, data would not be
dereferenced because size is 0.

Just initialize data to NULL to silence the warning, as the commit that
added program_imu_rlc_ram() mentioned it would eventually be used over
the old method, at which point data can be properly initialized and
used.

Cc: stable@vger.kernel.org
Closes: https://github.com/ClangBuiltLinux/linux/issues/2107
Fixes: 56159fffaa ("drm/amdgpu: use new method to program rlc ram")
Link: 2464313eef [1]
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-28 16:24:22 -04:00
Dave Airlie 337666c522 Merge tag 'drm-misc-fixes-2025-07-23' of https://gitlab.freedesktop.org/drm/misc/kernel into drm-fixes
drm-misc-fixes for v6.16-rc8/final?:
- Revert all uses of drm_gem_object->dmabuf to
  drm_gem_object->import_attach->dmabuf.
- Fix amdgpu returning BIOS cluttered VRAM after resume.
- Scheduler hang fix.
- Revert nouveau ioctl fix as it caused regressions.
- Fix null pointer deref in nouveau.
- Fix unnecessary semicolon in ti_sn_bridge_probe.

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Link: https://lore.kernel.org/r/72235afd-c849-49fe-9cc1-2b1781abdf08@linux.intel.com
2025-07-24 06:49:38 +10:00
Dave Airlie acab5fbd77 amd-drm-next-6.17-2025-07-17:
amdgpu:
 - Partition fixes
 - Reset fixes
 - RAS fixes
 - i2c fix
 - MPC updates
 - DSC cleanup
 - EDID fixes
 - Display idle D3 update
 - IPS updates
 - DMUB updates
 - Retimer fix
 - Replay fixes
 - Fix DC memory leak
 - Initial support for smartmux
 - DCN 4.0.1 degamma LUT fix
 - Per queue reset cleanups
 - Track ring state associated with a fence
 - SR-IOV fixes
 - SMU fixes
 - Per queue reset improvements for GC 9+ compute
 - Per queue reset improvements for GC 10+ gfx
 - Per queue reset improvements for SDMA 5+
 - Per queue reset improvements for JPEG 2+
 - Per queue reset improvements for VCN 2+
 - GC 8 fix
 - ISP updates
 
 amdkfd:
 - Enable KFD on LoongArch
 
 radeon:
 - Drop console lock during suspend/resume
 
 UAPI:
 - Add userq slot info to INFO IOCTL
   Used for IGT userq validation tests (https://lists.freedesktop.org/archives/igt-dev/2025-July/093228.html)
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQgO5Idg2tXNTSZAr293/aFa7yZ2AUCaHlrvgAKCRC93/aFa7yZ
 2LuDAP9c5phb6TBIFEmLlQEcZ+Ngx8LVkEl1JIfAhwuAwN2V0wD9H1pVVNDXmFlh
 jfB8hLYikqnxYznXuImvutGEBHM8BgQ=
 =W2D8
 -----END PGP SIGNATURE-----

Merge tag 'amd-drm-next-6.17-2025-07-17' of https://gitlab.freedesktop.org/agd5f/linux into drm-next

amd-drm-next-6.17-2025-07-17:

amdgpu:
- Partition fixes
- Reset fixes
- RAS fixes
- i2c fix
- MPC updates
- DSC cleanup
- EDID fixes
- Display idle D3 update
- IPS updates
- DMUB updates
- Retimer fix
- Replay fixes
- Fix DC memory leak
- Initial support for smartmux
- DCN 4.0.1 degamma LUT fix
- Per queue reset cleanups
- Track ring state associated with a fence
- SR-IOV fixes
- SMU fixes
- Per queue reset improvements for GC 9+ compute
- Per queue reset improvements for GC 10+ gfx
- Per queue reset improvements for SDMA 5+
- Per queue reset improvements for JPEG 2+
- Per queue reset improvements for VCN 2+
- GC 8 fix
- ISP updates

amdkfd:
- Enable KFD on LoongArch

radeon:
- Drop console lock during suspend/resume

UAPI:
- Add userq slot info to INFO IOCTL
  Used for IGT userq validation tests (https://lists.freedesktop.org/archives/igt-dev/2025-July/093228.html)

From: Alex Deucher <alexander.deucher@amd.com>
Link: https://lore.kernel.org/r/20250717213827.2061581-1-alexander.deucher@amd.com
Signed-off-by: Dave Airlie <airlied@redhat.com>
2025-07-21 11:57:43 +10:00
Dave Airlie be3cd668ff drm-misc-next for 6.17:
UAPI Changes:
 
 Cross-subsystem Changes:
 
 Core Changes:
 
 - mode_config: Change fb_create prototype to pass the drm_format_info
   and avoid redundant lookups in drivers
 - sched: kunit improvements, memory leak fixes, reset handling
   improvements
 - tests: kunit EDID update
 
 Driver Changes:
 
 - amdgpu: Hibernation fixes, structure lifetime fixes
 - nouveau: sched improvements
 - sitronix: Add Sitronix ST7567 Support
 
 - bridge:
   - Make connector available to bridge detect hook
 
 - panel:
   - More refcounting changes
   - New panels: BOE NE14QDM
 -----BEGIN PGP SIGNATURE-----
 
 iJUEABMJAB0WIQTkHFbLp4ejekA/qfgnX84Zoj2+dgUCaHisvgAKCRAnX84Zoj2+
 dihcAX49H551lQ42amN11jN4pR2tpKLjRMCSsNnXzkokJYx8adEaGTcZq+0Oi9rZ
 muGQKJABf2SSb/bem7qEu0JVL3/Pz8DOLbKIE4ltPMlHfYF5NzJhy3AKIMIjMLH7
 XqSDXOMgjA==
 =CfQG
 -----END PGP SIGNATURE-----

Merge tag 'drm-misc-next-2025-07-17' of https://gitlab.freedesktop.org/drm/misc/kernel into drm-next

drm-misc-next for 6.17:

UAPI Changes:

Cross-subsystem Changes:

Core Changes:

- mode_config: Change fb_create prototype to pass the drm_format_info
  and avoid redundant lookups in drivers
- sched: kunit improvements, memory leak fixes, reset handling
  improvements
- tests: kunit EDID update

Driver Changes:

- amdgpu: Hibernation fixes, structure lifetime fixes
- nouveau: sched improvements
- sitronix: Add Sitronix ST7567 Support

- bridge:
  - Make connector available to bridge detect hook

- panel:
  - More refcounting changes
  - New panels: BOE NE14QDM

Signed-off-by: Dave Airlie <airlied@redhat.com>

From: Maxime Ripard <mripard@redhat.com>
Link: https://lore.kernel.org/r/20250717-efficient-kudu-of-fantasy-ff95e0@houat
2025-07-21 09:16:51 +10:00
Alex Deucher 6ac55eab4f drm/amdgpu: move reset support type checks into the caller
Rather than checking in the callbacks, check if the reset
type is supported in the caller.

Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-17 12:36:56 -04:00
Alex Deucher ea2791d05a drm/amdgpu/sdma7: re-emit unprocessed state on ring reset
Re-emit the unprocessed state after resetting the queue.

Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-17 12:36:54 -04:00
Alex Deucher 9753078f54 drm/amdgpu/sdma6: re-emit unprocessed state on ring reset
Re-emit the unprocessed state after resetting the queue.

Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-17 12:36:48 -04:00
Alex Deucher 1b49bddc58 drm/amdgpu/sdma5.2: re-emit unprocessed state on ring reset
Re-emit the unprocessed state after resetting the queue.

Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-17 12:36:46 -04:00
Alex Deucher 4b1df3bad2 drm/amdgpu/sdma5: re-emit unprocessed state on ring reset
Re-emit the unprocessed state after resetting the queue.

Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-17 12:36:43 -04:00
Alex Deucher 4da11b92d7 drm/amdgpu/gfx12: re-emit unprocessed state on ring reset
Re-emit the unprocessed state after resetting the queue.
Drop the soft_recovery callbacks as the queue reset replaces
it.

Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-17 12:36:39 -04:00
Alex Deucher fa3385ac15 drm/amdgpu/gfx11: re-emit unprocessed state on ring reset
Re-emit the unprocessed state after resetting the queue.
Drop the soft_recovery callbacks as the queue reset replaces
it.

Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-17 12:36:32 -04:00
Alex Deucher f410731d5c drm/amdgpu/gfx10: re-emit unprocessed state on ring reset
Re-emit the unprocessed state after resetting the queue.
Drop the soft_recovery callbacks as the queue reset replaces
it.

Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-17 12:36:28 -04:00
Alex Deucher e22631b53a drm/amdgpu/gfx9.4.3: re-emit unprocessed state on kcq reset
Re-emit the unprocessed state after resetting the queue.

Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-17 12:36:25 -04:00
Alex Deucher ee60209b6f drm/amdgpu/gfx9: re-emit unprocessed state on kcq reset
Re-emit the unprocessed state after resetting the queue.

Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-17 12:36:19 -04:00
Arunpravin Paneer Selvam 81df6bfad6 drm/amdgpu: Add WARN_ON to the resource clear function
Set the dirty bit when the memory resource is not cleared
during BO release.

v2(Christian):
  - Drop the cleared flag set to false.
  - Improve the amdgpu_vram_mgr_set_clear_state() function.

v3:
  - Add back the resource clear flag set function call after
    being cleared during eviction (Christian).
  - Modified the patch subject name.

Signed-off-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Suggested-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-17 12:36:12 -04:00
Eeli Haapalainen 8326193401 drm/amdgpu/gfx8: reset compute ring wptr on the GPU on resume
Commit 42cdf6f687 ("drm/amdgpu/gfx8: always restore kcq MQDs") made the
ring pointer always to be reset on resume from suspend. This caused compute
rings to fail since the reset was done without also resetting it for the
firmware. Reset wptr on the GPU to avoid a disconnect between the driver
and firmware wptr.

Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3911
Fixes: 42cdf6f687 ("drm/amdgpu/gfx8: always restore kcq MQDs")
Signed-off-by: Eeli Haapalainen <eeli.haapalainen@protonmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 2becafc319)
Cc: stable@vger.kernel.org
2025-07-16 16:50:45 -04:00
Lijo Lazar 86790e300d drm/amdgpu: Increase reset counter only on success
Increment the reset counter only if soft recovery succeeded. This is
consistent with a ring hard reset behaviour where counter gets
incremented only if hard reset succeeded.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 25c314aa3e)
Cc: stable@vger.kernel.org
2025-07-16 16:49:04 -04:00
Jesse Zhang 9ffab039bc drm/amdgpu: Replace HQD terminology with slots naming
The term "HQD" is CP-specific and doesn't
accurately describe the queue resources for other IP blocks like SDMA,
VCN, or VPE. This change:

1. Renames `num_hqds` to `num_slots` in amdgpu_kms.c to better reflect
   the generic nature of the resource counting
2. Updates the UAPI struct member from `userq_num_hqds` to `userq_num_slots`
3. Maintains the same functionality while using more appropriate terminology

Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:17:36 -04:00
Jesse Zhang 78d0a27ae0 drm/amdgpu: Add user queue instance count in HW IP info
This change exposes the number of available user queue instances
for each hardware IP type (GFX, COMPUTE, SDMA) through the
drm_amdgpu_info_hw_ip interface.

Key changes:
1. Added userq_num_instance field to drm_amdgpu_info_hw_ip structure
2. Implemented counting of available HQD slots using:
   - mes.gfx_hqd_mask for GFX queues
   - mes.compute_hqd_mask for COMPUTE queues
   - mes.sdma_hqd_mask for SDMA queues
3. Only counts available instances when user queues are enabled
   (!disable_uq)

v2: using the adev->mes.gfx_hqd_mask[]/compute_hqd_mask[]/sdma_hqd_mask[] masks
  to determine the number of queue slots available for each engine type (Alex)
v3: rename userq_num_instance to userq_num_hqds (Alex)

Suggested-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:17:35 -04:00
Pratap Nirujogi 55d42f6169 drm/amd/amdgpu: Add helper functions for isp buffers
Accessing amdgpu internal data structures "struct amdgpu_device"
and "struct amdgpu_bo" in ISP V4L2 driver to alloc/free GART
buffers is not recommended.

Add new amdgpu_isp helper functions that takes opaque params
from ISP V4L2 driver and calls the amdgpu internal functions
amdgpu_bo_create_isp_user() and amdgpu_bo_create_kernel() to
alloc/free GART buffers.

Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Pratap Nirujogi <pratap.nirujogi@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:17:35 -04:00
Pratap Nirujogi e36519f5c8 drm/amd/amdgpu: Initialize swnode for ISP MFD device
Create amd_isp_capture MFD device with swnode initialized to
isp specific software_node part of fwnode graph in amd_isp4
x86/platform driver. The isp driver use this swnode handle
to retrieve the critical properties (data-lanes, mipi phyid,
link-frequencies etc.) required for camera to work on AMD ISP4
based targets.

Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Pratap Nirujogi <pratap.nirujogi@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:17:35 -04:00
Eeli Haapalainen 2becafc319 drm/amdgpu/gfx8: reset compute ring wptr on the GPU on resume
Commit 42cdf6f687 ("drm/amdgpu/gfx8: always restore kcq MQDs") made the
ring pointer always to be reset on resume from suspend. This caused compute
rings to fail since the reset was done without also resetting it for the
firmware. Reset wptr on the GPU to avoid a disconnect between the driver
and firmware wptr.

Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3911
Fixes: 42cdf6f687 ("drm/amdgpu/gfx8: always restore kcq MQDs")
Signed-off-by: Eeli Haapalainen <eeli.haapalainen@protonmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:17:28 -04:00
Alex Deucher 82a7c94fce drm/amdgpu/jpeg: clean up reset type handling
Make the handling consistent with other IPs and across
JPEG versions.

Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:17:16 -04:00
Christian König 084300fef5 drm/amdgpu: rework gmc_v9_0_get_coherence_flags v2
Avoid using the mapping here.

v2: use amdgpu_xgmi_same_hive() as suggested by Felix

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:17:10 -04:00
Alex Deucher d7767a1fd4 drm/amdgpu/vcn3: implement ring reset
Use the new helpers to handle engine resets for VCN.

Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Tested-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:17:07 -04:00
Alex Deucher 63b8c9fdfb drm/amdgpu/vcn2.5: implement ring reset
Use the new helpers to handle engine resets for VCN.

Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Tested-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:17:04 -04:00
Alex Deucher 64ac009747 drm/amdgpu/vcn2: implement ring reset
Use the new helpers to handle engine resets for VCN.

Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Tested-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:17:01 -04:00
Alex Deucher 7b6cde7f4e drm/amdgpu/vcn: add a helper framework for engine resets
With engine resets we reset all queues on the engine rather
than just a single queue.  Add a framework to handle this
similar to SDMA.

Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Tested-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:16:58 -04:00
Alex Deucher 3871149081 drm/amdgpu/vcn5: re-emit unprocessed state on ring reset
Re-emit the unprocessed state after resetting the queue.

Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Tested-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:16:55 -04:00
Alex Deucher 6166e37afd drm/amdgpu/vcn4.0.5: re-emit unprocessed state on ring reset
Re-emit the unprocessed state after resetting the queue.

Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Tested-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:16:47 -04:00
Alex Deucher 64c54f0aa2 drm/amdgpu/vcn4.0.3: re-emit unprocessed state on ring reset
Re-emit the unprocessed state after resetting the queue.

Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:16:44 -04:00
Alex Deucher d156ba3970 drm/amdgpu/vcn4: re-emit unprocessed state on ring reset
Re-emit the unprocessed state after resetting the queue.

Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Tested-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:16:41 -04:00
Alex Deucher 8bea669e67 drm/amdgpu/jpeg5.0.1: re-emit unprocessed state on ring reset
Re-emit the unprocessed state after resetting the queue.

Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:16:36 -04:00
Alex Deucher e708f2cb56 drm/amdgpu/jpeg5: add queue reset
Add queue reset support for jpeg 5.0.0.
Use the new helpers to re-emit the unprocessed state
after resetting the queue.

Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Tested-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:16:33 -04:00
Alex Deucher cf07ece3a8 drm/amdgpu/jpeg4.0.5: add queue reset
Add queue reset support for jpeg 4.0.5.
Use the new helpers to re-emit the unprocessed state
after resetting the queue.

Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Tested-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:16:25 -04:00
Alex Deucher 98f16636a2 drm/amdgpu/jpeg4.0.3: re-emit unprocessed state on ring reset
Re-emit the unprocessed state after resetting the queue.

Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:16:08 -04:00
Alex Deucher 429ccbf6f4 drm/amdgpu/jpeg4: re-emit unprocessed state on ring reset
Re-emit the unprocessed state after resetting the queue.

Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Tested-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:16:03 -04:00
Alex Deucher b81891589b drm/amdgpu/jpeg3: re-emit unprocessed state on ring reset
Re-emit the unprocessed state after resetting the queue.

Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Tested-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:15:22 -04:00
Alex Deucher bb7928f9fc drm/amdgpu/jpeg2.5: re-emit unprocessed state on ring reset
Re-emit the unprocessed state after resetting the queue.

Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Tested-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:15:19 -04:00
Alex Deucher 3c9e205f32 drm/amdgpu/jpeg2: re-emit unprocessed state on ring reset
Re-emit the unprocessed state after resetting the queue.

Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Tested-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:15:12 -04:00
Lijo Lazar 25c314aa3e drm/amdgpu: Increase reset counter only on success
Increment the reset counter only if soft recovery succeeded. This is
consistent with a ring hard reset behaviour where counter gets
incremented only if hard reset succeeded.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:14:44 -04:00
Alex Deucher ec8fbb44b5 drm/amdgpu: make compute timeouts consistent
For kernel compute queues, align the timeout with
other kernel queues (10 sec).  This had previously
been set higher for OpenCL when it used kernel
queues, but now OpenCL uses KFD user queues which
don't have a timeout limitation. This also aligns
with SR-IOV which already used a shorter timeout.

Additionally the longer timeout negatively impacts
the user experience with kernel queues for interactive
applications.

Reviewed-by: Kent Russell <kent.russell@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:14:29 -04:00
Tony Yi 991f2e0c63 drm/amdgpu: Check SQ_CONFIG register support on SRIOV
On SRIOV environments, check if RLCG supports
SQ_CONFIG register programming.

Signed-off-by: Tony Yi <Tony.Yi@amd.com>
Reviewed-by: Zhigang Luo <zhigang.luo@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:14:21 -04:00
Alex Deucher 77cc0da39c drm/amdgpu: track ring state associated with a fence
We need to know the wptr and sequence number associated
with a fence so that we can re-emit the unprocessed state
after a ring reset.  Pre-allocate storage space for
the ring buffer contents and add helpers to save off
and re-emit the unprocessed state so that it can be
re-emitted after the queue is reset.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:14:11 -04:00
Alex Deucher bc29c03b28 drm/amdgpu: clean up GC reset functions
Make them consistent and use the reset flags.

Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:10:10 -04:00
Alex Deucher e3f15cfd8b drm/amdgpu: clean up jpeg reset functions
Make them consistent and use the reset flags.

Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:10:01 -04:00
Alex Deucher 290ccae52d drm/amdgpu/vcn: don't enable per queue resets on SR-IOV
Power control is only available in bare metal.  SR-IOV
will need a different method.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:09:48 -04:00
Alex Deucher 94ee19ea14 drm/amdgpu/jpeg4: add additional ring reset error checking
Start and stop can fail, so add checks.

Fixes: 74894ffc7d ("drm/amdgpu: Add ring reset callback for JPEG4_0_0")
Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: Sathishkumar S <sathishkumar.sundararaju@amd.com>
2025-07-16 16:09:25 -04:00
Alex Deucher 2918487455 drm/amdgpu/jpeg3: add additional ring reset error checking
Start and stop can fail, so add checks.

Fixes: 03399d0bff ("drm/amdgpu: Add ring reset callback for JPEG3_0_0")
Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: Sathishkumar S <sathishkumar.sundararaju@amd.com>
2025-07-16 16:08:59 -04:00
Alex Deucher c9bfafc1a6 drm/amdgpu/jpeg2: add additional ring reset error checking
Start and stop can fail, so add checks.

Fixes: 500c04d2a7 ("drm/amdgpu: Add ring reset callback for JPEG2_0_0")
Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: Sathishkumar S <sathishkumar.sundararaju@amd.com>
2025-07-16 16:08:17 -04:00
Alex Deucher d18e1faef6 drm/amdgpu: clean up sdma reset functions
Make them consistent and drop unneeded extra variables.

Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:08:05 -04:00
Christophe JAILLET 28c5c48638 drm/amdgpu: Fix missing unlocking in an error path in amdgpu_userq_create()
If kasprintf() fails, some mutex still need to be released to avoid locking
issue, as already done in all other error handling path.

Fixes: c03ea34cbf ("drm/amdgpu: add support of debugfs for mqd information")
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Link: https://lore.kernel.org/all/366557fa7ca8173fd78c58336986ca56953369b9.1752087753.git.christophe.jaillet@wanadoo.fr/
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 15:46:04 -04:00
Ville Syrjälä b4d360701b drm/amdgpu: Pass along the format info from .fb_create() to drm_helper_mode_fill_fb_struct()
Plumb the format info from .fb_create() all the way to
drm_helper_mode_fill_fb_struct() to avoid the redundant
lookup.

Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: amd-gfx@lists.freedesktop.org
Reviewed-by: Thomas Zimmermann <tzimmermann@suse.de>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20250701090722.13645-10-ville.syrjala@linux.intel.com
2025-07-16 20:07:03 +03:00
Ville Syrjälä a34cc7bf10 drm: Allow the caller to pass in the format info to drm_helper_mode_fill_fb_struct()
Soon all drivers should have the format info already available in the
places where they call drm_helper_mode_fill_fb_struct(). Allow it to
be passed along into drm_helper_mode_fill_fb_struct() instead of doing
yet another redundant lookup.

Start by always passing in NULL and still doing the extra lookup.
The actual changes to avoid the lookup will follow.

Done with cocci (with some manual fixups):
@@
identifier dev, fb, mode_cmd;
expression get_format_info;
@@
void drm_helper_mode_fill_fb_struct(struct drm_device *dev,
                                    struct drm_framebuffer *fb,
+                                    const struct drm_format_info *info,
                                    const struct drm_mode_fb_cmd2 *mode_cmd)
{
...
- fb->format = get_format_info;
+ fb->format = info ?: get_format_info;
...
}

@@
identifier dev, fb, mode_cmd;
@@
void drm_helper_mode_fill_fb_struct(struct drm_device *dev,
                                    struct drm_framebuffer *fb,
+                                    const struct drm_format_info *info,
                                    const struct drm_mode_fb_cmd2 *mode_cmd);

@@
expression dev, fb, mode_cmd;
@@
drm_helper_mode_fill_fb_struct(dev, fb
+	       ,NULL
	       ,mode_cmd);

Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Liviu Dudau <liviu.dudau@arm.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Inki Dae <inki.dae@samsung.com>
Cc: Seung-Woo Kim <sw0312.kim@samsung.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Patrik Jakobsson <patrik.r.jakobsson@gmail.com>
Cc: Rob Clark <robdclark@gmail.com>
Cc: Abhinav Kumar <quic_abhinavk@quicinc.com>
Cc: Dmitry Baryshkov <lumag@kernel.org>
Cc: Sean Paul <sean@poorly.run>
Cc: Marijn Suijten <marijn.suijten@somainline.org>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: Tomi Valkeinen <tomi.valkeinen@ideasonboard.com>
Cc: Thierry Reding <thierry.reding@gmail.com>
Cc: Mikko Perttunen <mperttunen@nvidia.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Cc: Gurchetan Singh <gurchetansingh@chromium.org>
Cc: Chia-I Wu <olvaffe@gmail.com>
Cc: Zack Rusin <zack.rusin@broadcom.com>
Cc: Broadcom internal kernel review list <bcm-kernel-feedback-list@broadcom.com>
Cc: amd-gfx@lists.freedesktop.org
Cc: linux-arm-msm@vger.kernel.org
Cc: freedreno@lists.freedesktop.org
Cc: nouveau@lists.freedesktop.org
Cc: linux-tegra@vger.kernel.org
Cc: virtualization@lists.linux.dev
Reviewed-by: Thomas Zimmermann <tzimmermann@suse.de>
Reviewed-by: Laurent Pinchart <laurent.pinchart+renesas@ideasonboard.com>
Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@oss.qualcomm.com>
Reviewed-by: Liviu Dudau <liviu.dudau@arm.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Acked-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20250701090722.13645-6-ville.syrjala@linux.intel.com
2025-07-16 20:04:45 +03:00
Ville Syrjälä 81112eaac5 drm: Pass the format info to .fb_create()
Pass along the format information from the top to .fb_create()
so that we can avoid redundant (and somewhat expensive) lookups
in the drivers.

Done with cocci (with some manual fixups):
@@
identifier func =~ ".*create.*";
identifier dev, file, mode_cmd;
@@
struct drm_framebuffer *func(
       struct drm_device *dev,
       struct drm_file *file,
+      const struct drm_format_info *info,
       const struct drm_mode_fb_cmd2 *mode_cmd)
{
...
(
- const struct drm_format_info *info = drm_get_format_info(...);
|
- const struct drm_format_info *info;
...
- info = drm_get_format_info(...);
)
<...
- if (!info)
-    return ...;
...>
}

@@
identifier func =~ ".*create.*";
identifier dev, file, mode_cmd;
@@
struct drm_framebuffer *func(
       struct drm_device *dev,
       struct drm_file *file,
+      const struct drm_format_info *info,
       const struct drm_mode_fb_cmd2 *mode_cmd)
{
...
}

@find@
identifier fb_create_func =~ ".*create.*";
identifier dev, file, mode_cmd;
@@
struct drm_framebuffer *fb_create_func(
       struct drm_device *dev,
       struct drm_file *file,
+      const struct drm_format_info *info,
       const struct drm_mode_fb_cmd2 *mode_cmd);

@@
identifier find.fb_create_func;
expression dev, file, mode_cmd;
@@
fb_create_func(dev, file
+	       ,info
	       ,mode_cmd)

@@
expression dev, file, mode_cmd;
@@
drm_gem_fb_create(dev, file
+	       ,info
	       ,mode_cmd)

@@
expression dev, file, mode_cmd;
@@
drm_gem_fb_create_with_dirty(dev, file
+	       ,info
	       ,mode_cmd)

@@
expression dev, file_priv, mode_cmd;
identifier info, fb;
@@
info = drm_get_format_info(...);
...
fb = dev->mode_config.funcs->fb_create(dev, file_priv
+                                      ,info
                                       ,mode_cmd);

@@
identifier dev, file_priv, mode_cmd;
@@
struct drm_mode_config_funcs {
...
struct drm_framebuffer *(*fb_create)(struct drm_device *dev,
                                     struct drm_file *file_priv,
+                                     const struct drm_format_info *info,
                                     const struct drm_mode_fb_cmd2 *mode_cmd);
...
};

v2: Fix kernel docs (Laurent)
    Fix commit msg (Geert)

Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Liviu Dudau <liviu.dudau@arm.com>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Inki Dae <inki.dae@samsung.com>
Cc: Seung-Woo Kim <sw0312.kim@samsung.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Patrik Jakobsson <patrik.r.jakobsson@gmail.com>
Cc: Chun-Kuang Hu <chunkuang.hu@kernel.org>
Cc: Philipp Zabel <p.zabel@pengutronix.de>
Cc: Rob Clark <robdclark@gmail.com>
Cc: Abhinav Kumar <quic_abhinavk@quicinc.com>
Cc: Dmitry Baryshkov <lumag@kernel.org>
Cc: Sean Paul <sean@poorly.run>
Cc: Marijn Suijten <marijn.suijten@somainline.org>
Cc: Marek Vasut <marex@denx.de>
Cc: Stefan Agner <stefan@agner.ch>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: Tomi Valkeinen <tomi.valkeinen@ideasonboard.com>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Kieran Bingham <kieran.bingham+renesas@ideasonboard.com>
Cc: Biju Das <biju.das.jz@bp.renesas.com>
Cc: Sandy Huang <hjc@rock-chips.com>
Cc: "Heiko Stübner" <heiko@sntech.de>
Cc: Andy Yan <andy.yan@rock-chips.com>
Cc: Thierry Reding <thierry.reding@gmail.com>
Cc: Mikko Perttunen <mperttunen@nvidia.com>
Cc: Dave Stevenson <dave.stevenson@raspberrypi.com>
Cc: "Maíra Canal" <mcanal@igalia.com>
Cc: Raspberry Pi Kernel Maintenance <kernel-list@raspberrypi.com>
Cc: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Cc: Gurchetan Singh <gurchetansingh@chromium.org>
Cc: Chia-I Wu <olvaffe@gmail.com>
Cc: Zack Rusin <zack.rusin@broadcom.com>
Cc: Broadcom internal kernel review list <bcm-kernel-feedback-list@broadcom.com>
Cc: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Cc: amd-gfx@lists.freedesktop.org
Cc: linux-arm-msm@vger.kernel.org
Cc: freedreno@lists.freedesktop.org
Cc: nouveau@lists.freedesktop.org
Cc: virtualization@lists.linux.dev
Cc: spice-devel@lists.freedesktop.org
Cc: linux-renesas-soc@vger.kernel.org
Cc: linux-tegra@vger.kernel.org
Cc: Laurent Pinchart <laurent.pinchart+renesas@ideasonboard.com>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Thomas Zimmermann <tzimmermann@suse.de>
Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@oss.qualcomm.com>
Acked-by: Liviu Dudau <liviu.dudau@arm.com>
Reviewed-by: Laurent Pinchart <laurent.pinchart+renesas@ideasonboard.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Acked-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20250701090722.13645-5-ville.syrjala@linux.intel.com
2025-07-16 20:03:14 +03:00
Arunpravin Paneer Selvam 95a16160ca drm/amdgpu: Reset the clear flag in buddy during resume
- Added a handler in DRM buddy manager to reset the cleared
  flag for the blocks in the freelist.

- This is necessary because, upon resuming, the VRAM becomes
  cluttered with BIOS data, yet the VRAM backend manager
  believes that everything has been cleared.

v2:
  - Add lock before accessing drm_buddy_clear_reset_blocks()(Matthew Auld)
  - Force merge the two dirty blocks.(Matthew Auld)
  - Add a new unit test case for this issue.(Matthew Auld)
  - Having this function being able to flip the state either way would be
    good. (Matthew Brost)

v3(Matthew Auld):
  - Do merge step first to avoid the use of extra reset flag.

Signed-off-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Suggested-by: Christian König <christian.koenig@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Cc: stable@vger.kernel.org
Fixes: a68c7eaa7a ("drm/amdgpu: Enable clear page functionality")
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3812
Signed-off-by: Christian König <christian.koenig@amd.com>
Link: https://lore.kernel.org/r/20250716075125.240637-2-Arunpravin.PaneerSelvam@amd.com
2025-07-16 12:50:32 +02:00
ganglxie 48ee3d8e5e drm/amdgpu: refine bad page loading when in the same nps mode
when loading bad page in the same nps mode, need to set the other fields
fields in eeprom records manually besides retired_page

Signed-off-by: ganglxie <ganglxie@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-15 14:07:53 -04:00
ganglxie 660261df61 drm/amdgpu: refine eeprom data check
add eeprom data checksum check before driver unload. reset eeprom
and save correct data to eeprom when check failed

Signed-off-by: ganglxie <ganglxie@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-15 14:07:53 -04:00
Ce Sun 48cb9c3b21 drm/amdgpu: The interrupt source was not released
When the driver is unloaded, the interrupt source of
the rma device is not released, resulting in the failure
of hw_init when loading again using bad_page_threshold.

Signed-off-by: Ce Sun <cesun102@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-15 14:07:51 -04:00
Alex Deucher 7a5b69d60e drm/amdgpu/vcn5: add additional ring reset error checking
Start and stop can fail, so add checks.

Fixes: b54695dae9 ("drm/amd: Add per-ring reset for vcn v5.0.0 use")
Reviewed-by: Mario Limonciello <mari.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: Mario Limonciello <mario.limonciello@amd.com>
2025-07-15 14:07:43 -04:00
Alex Deucher 1b556bcc38 drm/amdgpu/vcn4.0.5: add additional ring reset error checking
Start and stop can fail, so add checks.

Fixes: d1a46cdd00 ("drm/amd: Add per-ring reset for vcn v4.0.5 use")
Reviewed-by: Mario Limonciello <mari.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: Mario Limonciello <mario.limonciello@amd.com>
2025-07-15 14:07:39 -04:00
Alex Deucher d115a63f81 drm/amdgpu/vcn4: add additional ring reset error checking
Start and stop can fail, so add checks.

Fixes: b8b6e6f165 ("drm/amd: Add per-ring reset for vcn v4.0.0 use")
Reviewed-by: Mario Limonciello <mari.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: Mario Limonciello <mario.limonciello@amd.com>
2025-07-15 14:07:34 -04:00
Alex Deucher a4b2ba8f63 drm/amdgpu/gfx10: fix kiq locking in KCQ reset
The ring test needs to be inside the lock.

Fixes: 097af47d3c ("drm/amdgpu/gfx10: wait for reset done before remap")
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: Jiadong Zhu <Jiadong.Zhu@amd.com>
2025-07-15 14:07:28 -04:00
Alex Deucher 08f116c593 drm/amdgpu/gfx9.4.3: fix kiq locking in KCQ reset
The ring test needs to be inside the lock.

Fixes: 4c953e53cc ("drm/amdgpu/gfx_9.4.3: wait for reset done before remap")
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: Jiadong Zhu <Jiadong.Zhu@amd.com>
2025-07-15 14:07:23 -04:00
Alex Deucher 730ea5074d drm/amdgpu/gfx9: fix kiq locking in KCQ reset
The ring test needs to be inside the lock.

Fixes: fdbd69486b ("drm/amdgpu/gfx9: wait for reset done before remap")
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: Jiadong Zhu <Jiadong.Zhu@amd.com>
2025-07-15 14:07:17 -04:00
Lijo Lazar 8ff4a4b98d drm/amdgpu: Use cached partition mode, if valid
For current partition mode queries, return the mode cached in partition
manager whenever it's valid.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-15 14:07:17 -04:00
Maíra Canal 0a5dc1b67e
drm/sched: Rename DRM_GPU_SCHED_STAT_NOMINAL to DRM_GPU_SCHED_STAT_RESET
Among the scheduler's statuses, the only one that indicates an error is
DRM_GPU_SCHED_STAT_ENODEV. Any status other than DRM_GPU_SCHED_STAT_ENODEV
signifies that the operation succeeded and the GPU is in a nominal state.

However, to provide more information about the GPU's status, it is needed
to convey more information than just "OK".

Therefore, rename DRM_GPU_SCHED_STAT_NOMINAL to
DRM_GPU_SCHED_STAT_RESET, which better communicates the meaning of this
status. The status DRM_GPU_SCHED_STAT_RESET indicates that the GPU has
hung, but it has been successfully reset and is now in a nominal state
again.

Reviewed-by: Philipp Stanner <phasta@kernel.org>
Link: https://lore.kernel.org/r/20250714-sched-skip-reset-v6-1-5c5ba4f55039@igalia.com
Signed-off-by: Maíra Canal <mcanal@igalia.com>
2025-07-15 08:27:00 -03:00
Simona Vetter 7e11e01d1f amd-drm-next-6.17-2025-07-11:
amdgpu:
 - Clean up function signatures
 - GC 10 KGQ reset fix
 - SDMA reset cleanups
 - Misc fixes
 - LVDS fixes
 - UserQ fix
 
 amdkfd:
 - Reset fix
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQgO5Idg2tXNTSZAr293/aFa7yZ2AUCaHF5fgAKCRC93/aFa7yZ
 2FbQAPsH59r6i1K6TfyLYuCIOwFXyqx/AEy2/HmlBx3r2ElC2AEA8R0rrKTVpzlP
 ZN6WaCDLEBOlk21S3U3QCHuSlf/G1Ac=
 =H/st
 -----END PGP SIGNATURE-----

Merge tag 'amd-drm-next-6.17-2025-07-11' of https://gitlab.freedesktop.org/agd5f/linux into drm-next

amd-drm-next-6.17-2025-07-11:

amdgpu:
- Clean up function signatures
- GC 10 KGQ reset fix
- SDMA reset cleanups
- Misc fixes
- LVDS fixes
- UserQ fix

amdkfd:
- Reset fix

Signed-off-by: Simona Vetter <simona.vetter@ffwll.ch>
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20250711205548.21052-1-alexander.deucher@amd.com
2025-07-11 23:55:40 +02:00
André Almeida 667efb3419 drm/amdgpu: Fix lifetime of struct amdgpu_task_info after ring reset
When a ring reset happens, amdgpu calls drm_dev_wedged_event() using
struct amdgpu_task_info *ti as one of the arguments. After using *ti, a
call to amdgpu_vm_put_task_info(ti) is required to correctly track its
lifetime.

However, it's called from a place that the ring reset path never reaches
due to a goto after drm_dev_wedged_event() is called. Move
amdgpu_vm_put_task_info() bellow the exit label to make sure that it's
called regardless of the code path.

amdgpu_vm_put_task_info() can only accept a valid address or NULL as
argument, so initialise *ti to make sure we can call this function if
*ti isn't used.

Fixes: a72002cb18 ("drm/amdgpu: Make use of drm_wedge_task_info")
Reported-by: Dave Airlie <airlied@gmail.com>
Closes: https://lore.kernel.org/dri-devel/CAPM=9tz0rQP8VZWKWyuF8kUMqRScxqoa6aVdwWw9=5yYxyYQ2Q@mail.gmail.com/
Reviewed-by: Christian König <christian.koenig@amd.com>
Link: https://lore.kernel.org/r/20250704030629.1064397-1-andrealmeid@igalia.com
Signed-off-by: André Almeida <andrealmeid@igalia.com>
2025-07-10 18:41:21 -03:00
Samuel Zhang 530694f54d drm/amdgpu: do not resume device in thaw for normal hibernation
For normal hibernation, GPU do not need to be resumed in thaw since it is
not involved in writing the hibernation image. Skip resume in this case
can reduce the hibernation time.

On VM with 8 * 192GB VRAM dGPUs, 98% VRAM usage and 1.7TB system memory,
this can save 50 minutes.

Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
Tested-by: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Link: https://lore.kernel.org/r/20250710062313.3226149-6-guoqing.zhang@amd.com
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
2025-07-10 10:49:45 -05:00
Samuel Zhang 924dda024f drm/amdgpu: move GTT to shmem after eviction for hibernation
When hibernate with data center dGPUs, huge number of VRAM BOs evicted
to GTT and takes too much system memory. This will cause hibernation
fail due to insufficient memory for creating the hibernation image.

Move GTT BOs to shmem in KMD, then shmem to swap disk in kernel
hibernation code to make room for hibernation image.

Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Link: https://lore.kernel.org/r/20250710062313.3226149-3-guoqing.zhang@amd.com
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
2025-07-10 10:49:45 -05:00
Sunil Khatri 03d5236014 drm/amdgpu: fix the logic to validate fpriv and root bo
Fix the smatch warning,
smatch warnings:
drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c:2146 amdgpu_pt_info_read()
error: we previously assumed 'fpriv' could be null (see line 2146)

"if (!fpriv && !fpriv->vm.root.bo)", It has to be an OR condition
rather than an AND which makes an NULL dereference in case fpriv is NULL.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202507090525.9rDWGhz3-lkp@intel.com/
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Link: https://lore.kernel.org/r/20250709071618.591866-1-sunil.khatri@amd.com
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Christian König <christian.koenig@amd.com>
2025-07-09 10:15:30 +02:00
Sunil Khatri 8f9abaff41 drm/amdgpu: fix MQD debugfs undefined symbol when DEBUG_FS=n
Fix undefined reference to amdgpu_mqd_info_fops during
debugfs_create_file if DEBUG_FS=n

Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Link: https://lore.kernel.org/r/20250708101551.68033-1-sunil.khatri@amd.com
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Christian König <christian.koenig@amd.com>
2025-07-09 10:14:19 +02:00
Maarten Lankhorst e21354aea4 Merge remote-tracking branch 'drm/drm-next' into drm-misc-next
Pull in drm-intel-next for the updates to drm panic handling.

Signed-off-by: Maarten Lankhorst <dev@lankhorst.se>
2025-07-08 16:49:07 +02:00
Vitaly Prosyak a886d26f2c drm/amdgpu: fix use-after-free in amdgpu_userq_suspend+0x51a/0x5a0
[  +0.000020] BUG: KASAN: slab-use-after-free in amdgpu_userq_suspend+0x51a/0x5a0 [amdgpu]
[  +0.000817] Read of size 8 at addr ffff88812eec8c58 by task amd_pci_unplug/1733

[  +0.000027] CPU: 10 UID: 0 PID: 1733 Comm: amd_pci_unplug Tainted: G        W          6.14.0+ #2
[  +0.000009] Tainted: [W]=WARN
[  +0.000003] Hardware name: ASUS System Product Name/ROG STRIX B550-F GAMING (WI-FI), BIOS 1401 12/03/2020
[  +0.000004] Call Trace:
[  +0.000004]  <TASK>
[  +0.000003]  dump_stack_lvl+0x76/0xa0
[  +0.000011]  print_report+0xce/0x600
[  +0.000009]  ? srso_return_thunk+0x5/0x5f
[  +0.000006]  ? kasan_complete_mode_report_info+0x76/0x200
[  +0.000007]  ? kasan_addr_to_slab+0xd/0xb0
[  +0.000006]  ? amdgpu_userq_suspend+0x51a/0x5a0 [amdgpu]
[  +0.000707]  kasan_report+0xbe/0x110
[  +0.000006]  ? amdgpu_userq_suspend+0x51a/0x5a0 [amdgpu]
[  +0.000541]  __asan_report_load8_noabort+0x14/0x30
[  +0.000005]  amdgpu_userq_suspend+0x51a/0x5a0 [amdgpu]
[  +0.000535]  ? stop_cpsch+0x396/0x600 [amdgpu]
[  +0.000556]  ? stop_cpsch+0x429/0x600 [amdgpu]
[  +0.000536]  ? __pfx_amdgpu_userq_suspend+0x10/0x10 [amdgpu]
[  +0.000536]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? kgd2kfd_suspend+0x132/0x1d0 [amdgpu]
[  +0.000542]  amdgpu_device_fini_hw+0x581/0xe90 [amdgpu]
[  +0.000485]  ? down_write+0xbb/0x140
[  +0.000007]  ? __mutex_unlock_slowpath.constprop.0+0x317/0x360
[  +0.000005]  ? __pfx_amdgpu_device_fini_hw+0x10/0x10 [amdgpu]
[  +0.000482]  ? __kasan_check_write+0x14/0x30
[  +0.000004]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? up_write+0x55/0xb0
[  +0.000007]  ? srso_return_thunk+0x5/0x5f
[  +0.000005]  ? blocking_notifier_chain_unregister+0x6c/0xc0
[  +0.000008]  amdgpu_driver_unload_kms+0x69/0x90 [amdgpu]
[  +0.000484]  amdgpu_pci_remove+0x93/0x130 [amdgpu]
[  +0.000482]  pci_device_remove+0xae/0x1e0
[  +0.000008]  device_remove+0xc7/0x180
[  +0.000008]  device_release_driver_internal+0x3d4/0x5a0
[  +0.000007]  device_release_driver+0x12/0x20
[  +0.000004]  pci_stop_bus_device+0x104/0x150
[  +0.000006]  pci_stop_and_remove_bus_device_locked+0x1b/0x40
[  +0.000005]  remove_store+0xd7/0xf0
[  +0.000005]  ? __pfx_remove_store+0x10/0x10
[  +0.000006]  ? __pfx__copy_from_iter+0x10/0x10
[  +0.000006]  ? __pfx_dev_attr_store+0x10/0x10
[  +0.000006]  dev_attr_store+0x3f/0x80
[  +0.000006]  sysfs_kf_write+0x125/0x1d0
[  +0.000004]  ? srso_return_thunk+0x5/0x5f
[  +0.000005]  ? __kasan_check_write+0x14/0x30
[  +0.000005]  kernfs_fop_write_iter+0x2ea/0x490
[  +0.000005]  ? rw_verify_area+0x70/0x420
[  +0.000005]  ? __pfx_kernfs_fop_write_iter+0x10/0x10
[  +0.000006]  vfs_write+0x90d/0xe70
[  +0.000005]  ? srso_return_thunk+0x5/0x5f
[  +0.000005]  ? __pfx_vfs_write+0x10/0x10
[  +0.000004]  ? local_clock+0x15/0x30
[  +0.000008]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? __kasan_slab_free+0x5f/0x80
[  +0.000005]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? __kasan_check_read+0x11/0x20
[  +0.000004]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? fdget_pos+0x1d3/0x500
[  +0.000007]  ksys_write+0x119/0x220
[  +0.000005]  ? putname+0x1c/0x30
[  +0.000006]  ? __pfx_ksys_write+0x10/0x10
[  +0.000007]  __x64_sys_write+0x72/0xc0
[  +0.000006]  x64_sys_call+0x18ab/0x26f0
[  +0.000006]  do_syscall_64+0x7c/0x170
[  +0.000004]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? __pfx___x64_sys_openat+0x10/0x10
[  +0.000006]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? __kasan_check_read+0x11/0x20
[  +0.000003]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? fpregs_assert_state_consistent+0x21/0xb0
[  +0.000006]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? syscall_exit_to_user_mode+0x4e/0x240
[  +0.000005]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? do_syscall_64+0x88/0x170
[  +0.000003]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? irqentry_exit+0x43/0x50
[  +0.000004]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? exc_page_fault+0x7c/0x110
[  +0.000006]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  +0.000006] RIP: 0033:0x7480c0b14887
[  +0.000005] Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[  +0.000005] RSP: 002b:00007fff142b0058 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  +0.000006] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007480c0b14887
[  +0.000003] RDX: 0000000000000001 RSI: 00007480c0e7365a RDI: 0000000000000004
[  +0.000003] RBP: 00007fff142b0080 R08: 0000563b2e73c170 R09: 0000000000000000
[  +0.000003] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fff142b02f8
[  +0.000003] R13: 0000563b159a72a9 R14: 0000563b159a9d48 R15: 00007480c0f19040
[  +0.000008]  </TASK>

[  +0.000445] Allocated by task 427 on cpu 5 at 29.342331s:
[  +0.000011]  kasan_save_stack+0x28/0x60
[  +0.000006]  kasan_save_track+0x18/0x70
[  +0.000006]  kasan_save_alloc_info+0x38/0x60
[  +0.000005]  __kasan_kmalloc+0xc1/0xd0
[  +0.000006]  __kmalloc_cache_noprof+0x1bd/0x430
[  +0.000007]  amdgpu_driver_open_kms+0x172/0x760 [amdgpu]
[  +0.000493]  drm_file_alloc+0x569/0x9a0
[  +0.000007]  drm_client_init+0x1b7/0x410
[  +0.000007]  drm_fbdev_client_setup+0x174/0x470
[  +0.000006]  drm_client_setup+0x8a/0xf0
[  +0.000006]  amdgpu_pci_probe+0x510/0x10c0 [amdgpu]
[  +0.000483]  local_pci_probe+0xe7/0x1b0
[  +0.000006]  pci_device_probe+0x5bf/0x890
[  +0.000006]  really_probe+0x1fd/0x950
[  +0.000005]  __driver_probe_device+0x307/0x410
[  +0.000006]  driver_probe_device+0x4e/0x150
[  +0.000005]  __driver_attach+0x223/0x510
[  +0.000006]  bus_for_each_dev+0x102/0x1a0
[  +0.000005]  driver_attach+0x3d/0x60
[  +0.000006]  bus_add_driver+0x309/0x650
[  +0.000005]  driver_register+0x13d/0x490
[  +0.000006]  __pci_register_driver+0x1ee/0x2b0
[  +0.000006]  rfcomm_dlc_clear_state+0x69/0x220 [rfcomm]
[  +0.000011]  do_one_initcall+0x9c/0x3e0
[  +0.000007]  do_init_module+0x29e/0x7f0
[  +0.000006]  load_module+0x5c75/0x7c80
[  +0.000006]  init_module_from_file+0x106/0x180
[  +0.000006]  idempotent_init_module+0x377/0x740
[  +0.000006]  __x64_sys_finit_module+0xd7/0x180
[  +0.000006]  x64_sys_call+0x1f0b/0x26f0
[  +0.000006]  do_syscall_64+0x7c/0x170
[  +0.000005]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

[  +0.000013] Freed by task 1733 on cpu 5 at 59.907086s:
[  +0.000011]  kasan_save_stack+0x28/0x60
[  +0.000006]  kasan_save_track+0x18/0x70
[  +0.000005]  kasan_save_free_info+0x3b/0x60
[  +0.000005]  __kasan_slab_free+0x54/0x80
[  +0.000006]  kfree+0x127/0x470
[  +0.000006]  amdgpu_driver_postclose_kms+0x455/0x760 [amdgpu]
[  +0.000493]  drm_file_free.part.0+0x5b1/0xba0
[  +0.000006]  drm_file_free+0x13/0x30
[  +0.000006]  drm_client_release+0x1c4/0x2b0
[  +0.000006]  drm_fbdev_ttm_fb_destroy+0xd2/0x120 [drm_ttm_helper]
[  +0.000007]  put_fb_info+0x97/0xe0
[  +0.000007]  unregister_framebuffer+0x197/0x380
[  +0.000005]  drm_fb_helper_unregister_info+0x94/0x100
[  +0.000005]  drm_fbdev_client_unregister+0x3c/0x80
[  +0.000007]  drm_client_dev_unregister+0x144/0x330
[  +0.000006]  drm_dev_unregister+0x49/0x1b0
[  +0.000006]  drm_dev_unplug+0x4c/0xd0
[  +0.000006]  amdgpu_pci_remove+0x58/0x130 [amdgpu]
[  +0.000484]  pci_device_remove+0xae/0x1e0
[  +0.000008]  device_remove+0xc7/0x180
[  +0.000007]  device_release_driver_internal+0x3d4/0x5a0
[  +0.000006]  device_release_driver+0x12/0x20
[  +0.000007]  pci_stop_bus_device+0x104/0x150
[  +0.000006]  pci_stop_and_remove_bus_device_locked+0x1b/0x40
[  +0.000006]  remove_store+0xd7/0xf0
[  +0.000006]  dev_attr_store+0x3f/0x80
[  +0.000005]  sysfs_kf_write+0x125/0x1d0
[  +0.000006]  kernfs_fop_write_iter+0x2ea/0x490
[  +0.000006]  vfs_write+0x90d/0xe70
[  +0.000006]  ksys_write+0x119/0x220
[  +0.000006]  __x64_sys_write+0x72/0xc0
[  +0.000006]  x64_sys_call+0x18ab/0x26f0
[  +0.000005]  do_syscall_64+0x7c/0x170
[  +0.000006]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

[  +0.000012] The buggy address belongs to the object at ffff88812eec8000
               which belongs to the cache kmalloc-rnd-07-4k of size 4096
[  +0.000016] The buggy address is located 3160 bytes inside of
               freed 4096-byte region [ffff88812eec8000, ffff88812eec9000)

[  +0.000023] The buggy address belongs to the physical page:
[  +0.000009] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x12eec8
[  +0.000007] head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
[  +0.000005] flags: 0x17ffffc0000040(head|node=0|zone=2|lastcpupid=0x1fffff)
[  +0.000007] page_type: f5(slab)
[  +0.000008] raw: 0017ffffc0000040 ffff888100054500 dead000000000122 0000000000000000
[  +0.000005] raw: 0000000000000000 0000000080040004 00000000f5000000 0000000000000000
[  +0.000006] head: 0017ffffc0000040 ffff888100054500 dead000000000122 0000000000000000
[  +0.000005] head: 0000000000000000 0000000080040004 00000000f5000000 0000000000000000
[  +0.000006] head: 0017ffffc0000003 ffffea0004bbb201 ffffffffffffffff 0000000000000000
[  +0.000005] head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000
[  +0.000005] page dumped because: kasan: bad access detected

[  +0.000010] Memory state around the buggy address:
[  +0.000009]  ffff88812eec8b00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  +0.000012]  ffff88812eec8b80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  +0.000011] >ffff88812eec8c00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  +0.000011]                                                     ^
[  +0.000010]  ffff88812eec8c80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  +0.000011]  ffff88812eec8d00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  +0.000011] ==================================================================

The use-after-free occurs because a delayed work item (`suspend_work`) may still
be pending or running when resources it accesses are freed during device removal
or file close. The previous code used `flush_work(&fpriv->evf_mgr.suspend_work.work)`,
which does not wait for delayed work that has not yet started. As a result, the
delayed work could run after its memory was freed, causing a use-after-free.
By switching to `flush_delayed_work(&fpriv->evf_mgr.suspend_work)`, we ensure that
the kernel waits for both queued and delayed work to finish before
freeing memory, closing this race.

Fixes: adba092973 ("drm/amdgpu: Fix Illegal opcode in command stream Error")
Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-07 13:50:42 -04:00
Vitaly Prosyak a73345b866 Revert "drm/amdgpu: fix slab-use-after-free in amdgpu_userq_mgr_fini"
This reverts commit 5fb90421fa.

The original patch moved `amdgpu_userq_mgr_fini()` to the driver's
`postclose` callback, which is called after `drm_gem_release()` in
the DRM file cleanup sequence.If a user application crashes or aborts
without cleaning up its user queues, 'drm_gem_release()` may free
GEM objects that are still referenced by active user queues, leading
to use-after-free. By reverting, we ensure that user queues are
disabled and cleaned up before any GEM objects are released,
preventing this class of bug. However, this reintroduces a race
during PCI hot-unplug, where device removal can race with per-file
cleanup, leading to use-after-free in suspend/unplug paths.
This will be fixed in the next patch.

Fixes: 5fb90421fa ("drm/amdgpu: fix slab-use-after-free in amdgpu_userq_mgr_fini+0x70c")
Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-07 13:48:50 -04:00
Alex Deucher 0c3c2e334c drm/amdgpu/sdma: allow caller to handle kernel rings in engine reset
Add a parameter to amdgpu_sdma_reset_engine() to let the
caller handle the kernel rings.  This allows the kernel
rings to back up their unprocessed state if the reset comes in
via the drm scheduler rather than KFD.

Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-07 13:48:25 -04:00
Alex Deucher f8410a17d3 drm/amdgpu/sdma: consolidate engine reset handling
Move the force completion handling into the common
engine reset function.  No need to duplicate it for
every IP version.

Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-07 13:48:20 -04:00
Lijo Lazar 9888f73679 drm/amdgpu: Add a noverbose flag to psp_wait_for
For extended wait with retries on a PSP register value, add a noverbose
flag to avoid excessive error messages on each timeout.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-07 13:48:08 -04:00
Alex Deucher 14b2d71a9a drm/amdgpu/gfx10: fix KGQ reset sequence
Need to reinit the ring before remapping it and all of
the KIQ handling needs to be within the kiq lock.

Fixes: 1741281a15 ("drm/amdgpu/gfx10: add ring reset callbacks")
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-07 13:42:07 -04:00
Lijo Lazar 127ed492ad drm/amdgpu: Pass adev pointer to functions
Pass amdgpu device context instead of drm device context to some
amdgpu_device_* functions. DRM device context is not required in those
functions. No functional change.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-07 13:41:37 -04:00
Sunil Khatri c03ea34cbf drm/amdgpu: add support of debugfs for mqd information
Add debugfs support for mqd for each queue of the client.

The address exposed to debugfs could be used to dump
the mqd.

Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Link: https://lore.kernel.org/r/20250704075548.1549849-5-sunil.khatri@amd.com
Signed-off-by: Christian König <christian.koenig@amd.com>
2025-07-04 16:00:48 +02:00
Sunil Khatri 719b378d37 drm/amdgpu: add debugfs support for VM pagetable per client
Add a debugfs file under the client directory which shares
the root page table base address of the VM.

This address could be used to dump the pagetable for debug
memory issues.

Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Link: https://lore.kernel.org/r/20250704075548.1549849-4-sunil.khatri@amd.com
Signed-off-by: Christian König <christian.koenig@amd.com>
2025-07-04 16:00:06 +02:00
Dave Airlie 7e2818386a amd-drm-next-6.17-2025-07-01:
amdgpu:
 - FAMS2 fixes
 - OLED fixes
 - Misc cleanups
 - AUX fixes
 - DMCUB updates
 - SR-IOV hibernation support
 - RAS updates
 - DP tunneling fixes
 - DML2 fixes
 - Backlight improvements
 - Suspend improvements
 - Use scaling for non-native modes on eDP
 - SDMA 4.4.x fixes
 - PCIe DPM fixes
 - SDMA 5.x fixes
 - Cleaner shader updates for GC 9.x
 - Remove fence slab
 - ISP genpd support
 - Parition handling rework
 - SDMA FW checks for userq support
 - Add missing firmware declaration
 - Fix leak in amdgpu_ctx_mgr_entity_fini()
 - Freesync fix
 - Ring reset refactoring
 - Legacy dpm verbosity changes
 
 amdkfd:
 - GWS fix
 - mtype fix for ext coherent system memory
 - MMU notifier fix
 - gfx7/8 fix
 
 radeon:
 - CS validation support for additional GL extensions
 - Bump driver version for new CS validation checks
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQgO5Idg2tXNTSZAr293/aFa7yZ2AUCaGQ6hwAKCRC93/aFa7yZ
 2JVJAQC/IpZoSmW22VrtXjBiQ3yoYt61oOM/WJVXPV11FmisyQEAqQ9nc+/lY9SM
 s5/RskaQGlfCGqafnaSGuoILkD/YDQ4=
 =i8mT
 -----END PGP SIGNATURE-----

Merge tag 'amd-drm-next-6.17-2025-07-01' of https://gitlab.freedesktop.org/agd5f/linux into drm-next

amd-drm-next-6.17-2025-07-01:

amdgpu:
- FAMS2 fixes
- OLED fixes
- Misc cleanups
- AUX fixes
- DMCUB updates
- SR-IOV hibernation support
- RAS updates
- DP tunneling fixes
- DML2 fixes
- Backlight improvements
- Suspend improvements
- Use scaling for non-native modes on eDP
- SDMA 4.4.x fixes
- PCIe DPM fixes
- SDMA 5.x fixes
- Cleaner shader updates for GC 9.x
- Remove fence slab
- ISP genpd support
- Parition handling rework
- SDMA FW checks for userq support
- Add missing firmware declaration
- Fix leak in amdgpu_ctx_mgr_entity_fini()
- Freesync fix
- Ring reset refactoring
- Legacy dpm verbosity changes

amdkfd:
- GWS fix
- mtype fix for ext coherent system memory
- MMU notifier fix
- gfx7/8 fix

radeon:
- CS validation support for additional GL extensions
- Bump driver version for new CS validation checks

From: Alex Deucher <alexander.deucher@amd.com>
Link: https://lore.kernel.org/r/20250701194707.32905-1-alexander.deucher@amd.com
Signed-off-by: Dave Airlie <airlied@redhat.com>
2025-07-04 10:06:29 +10:00
Alex Deucher 34659c1a1f drm/amdkfd: add hqd_sdma_get_doorbell callbacks for gfx7/8
These were missed when support was added for other generations.
The callbacks are called unconditionally so we need to make
sure all generations have them.

Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4304
Link: https://github.com/ROCm/ROCm/issues/4965
Fixes: bac38ca8c4 ("drm/amdkfd: implement per queue sdma reset for gfx 9.4+")
Cc: Jonathan Kim <jonathan.kim@amd.com>
Reported-by: Johl Brown <johlbrown@gmail.com>
Reviewed-by: Jonathan Kim <jonathan.kim@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 1e9d17a5dc)
Cc: stable@vger.kernel.org
2025-06-30 13:57:54 -04:00
Lin.Cao f3e58d8e15 drm/amdgpu: Fix memory leak in amdgpu_ctx_mgr_entity_fini
patch dd64956685 ("drm/amdgpu: Remove duplicated "context still
alive" check") removed ctx put, which will cause amdgpu_ctx_fini()
cannot be called and then cause some finished fence that added by
amdgpu_ctx_add_fence() cannot be released and cause memleak.

Fixes: dd64956685 ("drm/amdgpu: Remove duplicated "context still alive" check")
Signed-off-by: Lin.Cao <lincao12@amd.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 8cf66089e2)
Cc: stable@vger.kernel.org
2025-06-30 13:57:31 -04:00
Kent Russell e54c5de901 drm/amdgpu: Include sdma_4_4_4.bin
This got missed during SDMA 4.4.4 support.

Fixes: 968e3811c3 ("drm/amdgpu: add initial support for sdma444")
Signed-off-by: Kent Russell <kent.russell@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 51526efe02)
Cc: stable@vger.kernel.org
2025-06-30 13:52:01 -04:00
Alex Deucher 905967e359 drm/amdgpu/sdma5.x: suspend KFD queues in ring reset
SDMA 5.x only supports engine soft reset which resets
all queues on the engine.  As such, we need to suspend
KFD queues around resets like we do for SDMA 4.x.

Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 61feed0baa)
2025-06-30 13:51:07 -04:00
Alex Deucher 2ecdb61f76 drm/amdgpu/sdma6: add more ucode version checks for userq support
Fill in the SDMA ucode version checks for more SDMA 6.x parts.

v2: squash in fixes (Alex)

Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-30 12:08:00 -04:00
YiPeng Chai a6d6a86e94 drm/amdgpu: Remove useless timeout error message
The timeout is only used to interrupt polling and
not need to print a error message.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-30 12:08:00 -04:00
Ce Sun 3b3afba42f drm/amdgpu: Fix code style issue
cocci warnings: (new ones prefixed by >>)
>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:6088:8-9:
Unneeded variable: "r". Return "0" on line 6141

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202506281925.HHIpXiO7-lkp@intel.com/
Signed-off-by: Ce Sun <cesun102@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-30 12:08:00 -04:00
ganglxie cfce8f4fa7 drm/amdgpu: refine ras error injection when eeprom initialization failed
when eeprom initialization failed, we still support ras error injection,
and reserve bad pages, but do not save bad pages to eeprom

Signed-off-by: ganglxie <ganglxie@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-30 12:08:00 -04:00
Lijo Lazar 0b7f13551e drm/amdgpu: Fix error with dev_info_once usage
Fixes error with dev_info_once usage in amdgpu_device_asic_has_dc_support.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202506281140.mXfWT3EN-lkp@intel.com/
Fixes: a3e510fd69 ("drm/amdgpu: Convert from DRM_* to dev_*")
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-30 12:08:00 -04:00
Xiang Liu 4a33ca3f6e drm/amdgpu: Use correct severity for BP threshold exceed event
The severity of CPER for BP threshold exceed event should be set as
CPER_SEV_FATAL to match the OOB implementation.

Signed-off-by: Xiang Liu <xiang.liu@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-30 12:08:00 -04:00
Alex Deucher 38b20968f3 drm/amdgpu: move scheduler wqueue handling into callbacks
Move the scheduler wqueue stopping and starting into
the ring reset callbacks.  On some IPs we have to reset
an engine which may have multiple queues.  Move the wqueue
handling into the backend so we can handle them as needed
based on the type of reset available.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-30 11:58:22 -04:00
Alex Deucher 43ca5eb94b drm/amdgpu: move guilty handling into ring resets
Move guilty logic into the ring reset callbacks.  This
allows each ring reset callback to better handle fence
errors and force completions in line with the reset
behavior for each IP.  It also allows us to remove
the ring guilty callback since that logic now lives
in the reset callback.

Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-30 11:57:36 -04:00
Alex Deucher 2dee58ca47 drm/amdgpu: move force completion into ring resets
Move the force completion handling into each ring
reset function so that each engine can determine
whether or not it needs to force completion on the
jobs in the ring.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-30 11:57:29 -04:00
Christian König 821aacb2dc drm/amdgpu: rework queue reset scheduler interaction
Stopping the scheduler for queue reset is generally a good idea because
it prevents any worker from touching the ring buffer.

Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-30 11:57:21 -04:00
Alex Deucher 787e2ce10f drm/amdgpu: update ring reset function signature
Going forward, we'll need more than just the vmid.  Add the
guilty amdgpu_fence.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-30 11:57:16 -04:00
Alex Deucher d0c35c84dc drm/amdgpu: remove job parameter from amdgpu_fence_emit()
What we actually care about is the amdgpu_fence object
so pass that in explicitly to avoid possible mistakes
in the future.

The job_run_counter handling can be safely removed at this
point as we no longer support job resubmission.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-30 11:57:06 -04:00
Alex Deucher 1e9d17a5dc drm/amdkfd: add hqd_sdma_get_doorbell callbacks for gfx7/8
These were missed when support was added for other generations.
The callbacks are called unconditionally so we need to make
sure all generations have them.

Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4304
Link: https://github.com/ROCm/ROCm/issues/4965
Fixes: bac38ca8c4 ("drm/amdkfd: implement per queue sdma reset for gfx 9.4+")
Cc: Jonathan Kim <jonathan.kim@amd.com>
Reported-by: Johl Brown <johlbrown@gmail.com>
Reviewed-by: Jonathan Kim <jonathan.kim@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-30 11:56:26 -04:00
Lin.Cao 8cf66089e2 drm/amdgpu: Fix memory leak in amdgpu_ctx_mgr_entity_fini
patch dd64956685 ("drm/amdgpu: Remove duplicated "context still
alive" check") removed ctx put, which will cause amdgpu_ctx_fini()
cannot be called and then cause some finished fence that added by
amdgpu_ctx_add_fence() cannot be released and cause memleak.

Fixes: dd64956685 ("drm/amdgpu: Remove duplicated "context still alive" check")
Signed-off-by: Lin.Cao <lincao12@amd.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-30 11:55:28 -04:00
Dan Carpenter 26143d2992 drm/amdgpu: indent an if statement
The "return true;" line wasn't indented.  Also checkpatch likes when
we align the && conditions.

Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-30 11:55:21 -04:00
Thomas Zimmermann 515986100d drm/amdgpu: Use dma_buf from GEM object instance
Avoid dereferencing struct drm_gem_object.import_attach for the
imported dma-buf. The dma_buf field in the GEM object instance refers
to the same buffer. Prepares to make import_attach an implementation
detail of PRIME.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-30 11:55:16 -04:00
Thomas Zimmermann 4948e6c7fb drm/amdgpu: Test for imported buffers with drm_gem_is_imported()
Instead of testing import_attach for imported GEM buffers, invoke
drm_gem_is_imported() to do the test.

v2:
- keep amdgpu_bo_print_info() as-is (Christian)

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-30 11:54:59 -04:00
Lijo Lazar a3e510fd69 drm/amdgpu: Convert from DRM_* to dev_*
Convert from generic DRM_* to dev_* calls to have device context info.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-30 11:54:55 -04:00
Kent Russell 51526efe02 drm/amdgpu: Include sdma_4_4_4.bin
This got missed during SDMA 4.4.4 support.

Signed-off-by: Kent Russell <kent.russell@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-30 11:53:55 -04:00
André Almeida 2847237429 drm/amd: Include <linux/export.h> when needed
Fix the following compile time warning when building with W=1:

warning: EXPORT_SYMBOL() is used, but #include <linux/export.h> is missing

Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: André Almeida <andrealmeid@igalia.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-30 11:53:51 -04:00
André Almeida 6531fd55f3 drm/amd: Do not include <linux/export.h> when unused
Fix the following compile time warning when building with W=1:

warning: EXPORT_SYMBOL() is not used, but #include <linux/export.h> is present

Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: André Almeida <andrealmeid@igalia.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-30 11:53:45 -04:00
Alex Deucher 61feed0baa drm/amdgpu/sdma5.x: suspend KFD queues in ring reset
SDMA 5.x only supports engine soft reset which resets
all queues on the engine.  As such, we need to suspend
KFD queues around resets like we do for SDMA 4.x.

Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-30 11:53:34 -04:00
Alex Deucher 31135cc99c drm/amdgpu/sdma7: add ucode version checks for userq support
SDMA 7.0.0/1: 7836028

Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 8c011408ed)
2025-06-24 10:38:05 -04:00
Alex Deucher 899dec4e88 drm/amdgpu/sdma6: add ucode version checks for userq support
SDMA 6.0.0 version 24
SDMA 6.0.2 version 21
SDMA 6.0.3 version 25

Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit e8cca30d8b)
2025-06-24 10:37:58 -04:00
Mario Limonciello 73eab78721 drm/amd: Adjust output for discovery error handling
commit 017fbb6690 ("drm/amdgpu/discovery: check ip_discovery fw file
available") added support for reading an amdgpu IP discovery bin file
for some specific products. If it's not found then it will fallback to
hardcoded values. However if it's not found there is also a lot of noise
about missing files and errors.

Adjust the error handling to decrease most messages to DEBUG and to show
users less about missing files.

Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Reported-by: Marcus Seyfarth <m.seyfarth@gmail.com>
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4312
Tested-by: Marcus Seyfarth <m.seyfarth@gmail.com>
Fixes: 017fbb6690 ("drm/amdgpu/discovery: check ip_discovery fw file available")
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Link: https://lore.kernel.org/r/20250617183052.1692059-1-superm1@kernel.org
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 49f1f9f6c3)
2025-06-24 10:37:32 -04:00
Alex Deucher 99579c55c3 drm/amdgpu/mes: add compatibility checks for set_hw_resource_1
Seems some older MES firmware versions do not properly support
this packet.  Add back some the compatibility checks.

v2: switch to fw version check (Shaoyun)

Fixes: f81cd79311 ("drm/amd/amdgpu: Fix MES init sequence")
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4295
Cc: Shaoyun Liu <shaoyun.liu@amd.com>
Reviewed-by: shaoyun.liu <shaoyun.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 0180e0a5dd)
Cc: stable@vger.kernel.org
2025-06-24 10:34:56 -04:00
Srinivasan Shanmugam 0043ec26d8 drm/amdgpu/gfx9: Add Cleaner Shader Support for GFX9.x GPUs
Enable the cleaner shader for other GFX9.x series of GPUs to provide
data isolation between GPU workloads. The cleaner shader is responsible
for clearing the Local Data Store (LDS), Vector General Purpose
Registers (VGPRs), and Scalar General Purpose Registers (SGPRs), which
helps prevent data leakage and ensures accurate computation results.

This update extends cleaner shader support to GFX9.x GPUs, previously
available for GFX9.4.2. It enhances security by clearing GPU memory
between processes and maintains a consistent GPU state across KGD and
KFD workloads.

Cc: Manu Rastogi <manu.rastogi@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 99808926d0)
2025-06-24 10:34:44 -04:00
Colin Ian King 518f13f8e3 drm/amd: Fix spelling mistake "correctalbe" -> "correctable"
There is a spelling mistake in a pr_info message. Fix it.

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-24 10:04:32 -04:00
Alex Deucher 8c011408ed drm/amdgpu/sdma7: add ucode version checks for userq support
SDMA 7.0.0/1: 7836028

Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-24 10:04:30 -04:00
Alex Deucher e8cca30d8b drm/amdgpu/sdma6: add ucode version checks for userq support
SDMA 6.0.0 version 24
SDMA 6.0.2 version 21
SDMA 6.0.3 version 25

Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-24 10:04:23 -04:00
Lijo Lazar 8345a71fc5 drm/amdgpu: Add more checks to PSP mailbox
Instead of checking the response flag, use status mask also to check
against any unexpected failures like a device drop. Also, log error if
waiting on a psp response fails/times out.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-24 10:04:19 -04:00
Hawking Zhang 5562b66992 drm/amdgpu: Convert init_mem_ranges into common helpers
They can be shared across multiple products

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-24 10:04:16 -04:00
Hawking Zhang b9c58f4e32 drm/amdgpu: Generalize is_multi_chiplet with a common helper v2
It is not necessary to be ip generation specific

v2: rename the helper to is_multi_aid (Lijo)

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-24 10:04:13 -04:00
Hawking Zhang c9df2dcf90 drm/amdgpu: Convert query_memory_partition into common helpers
The query_memory_partition does not need to remain
as soc specific callbacks. They can be shared across
multiple products

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-24 10:04:09 -04:00
Hawking Zhang 97c894758b drm/amdgpu: Move MAX_MEM_RANGES to amdgpu_gmc.h
This relocation allows MAX_MEM_RANGES to be shared
across multiple products

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-24 10:04:06 -04:00
Hawking Zhang f268cef77e drm/amdgpu: Convert pre|post_partition_switch into common helpers
So they can be reused for future products

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Likun Gao <Likun.Gao@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-24 10:04:03 -04:00
Hawking Zhang e0f14a2abf drm/amdgpu: Convert update_supported_modes into a common helper
So it can be used for future products

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Likun Gao <Likun.Gao@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-24 10:03:57 -04:00
Hawking Zhang 4dbc17b455 drm/amdgpu: Convert update_partition_sched_list into a common helper v3
The update_partition_sched_list function does not
need to remain as a soc specific callback. It can
be reused for future products.

v2: bypass the function if xcp_mgr is not available (Likun)

v3: Let caller check the availability of xcp_mgr (Lijo)

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-24 10:03:41 -04:00
Hawking Zhang bf587417ff drm/amdgpu: Convert select_sched into a common helper v3
The xcp select_sched function does not need to
remain as a soc specific callback. It can be reused
for future products

v2: bypass the function if xcp_mgr is not available (Likun)

v3: Let caller check the availability of xcp mgr (Lijo)

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-24 10:03:32 -04:00
Likun Gao 37b791d667 drm/amdgpu: use common function to map ip for aqua_vanjaram
Transfer to use function amdgpu_ip_map_init to map ip
instance for aqua_vanjaram instead of operation on
different ASIC.

Signed-off-by: Likun Gao <Likun.Gao@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-24 10:03:25 -04:00
Likun Gao 20905edb24 drm/amdgpu: make ip map init to common function
IP instance map init function can be an common function
instead of operation on different ASIC.
V2: Create amdgpu_ip.[ch] file for ip related functions.

Signed-off-by: Likun Gao <Likun.Gao@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-24 10:03:16 -04:00
Pratap Nirujogi f0ebe9e578 drm/amd/amdgpu: Refine isp_v4_1_1 logging
Replace DRM_ERROR with drm_err function and update log
messages to drop __func__ and print return value.

Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Pratap Nirujogi <pratap.nirujogi@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-24 10:02:55 -04:00
Pratap Nirujogi fd14786071 drm/amd/amdgpu: Add ISP Generic PM Domain (genpd) support
AMDISP I2C device requires to power on ISP HW to probe the sensor
device. Instead of using the exported symbols from ISP driver to
control the power and clocks remotely,added Generic PM Domain (genpd)
support in amdgpu_isp device for its child devices (amd_isp_capture,
amd_isp_i2c_designware) to set power and clocks using PM methods.

Co-developed-by: Bin Du <bin.du@amd.com>
Signed-off-by: Bin Du <bin.du@amd.com>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Pratap Nirujogi <pratap.nirujogi@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-24 10:02:50 -04:00
Vitaly Prosyak 5fb90421fa drm/amdgpu: fix slab-use-after-free in amdgpu_userq_mgr_fini+0x70c
The issue was reproduced on NV10 using IGT pci_unplug test.
It is expected that `amdgpu_driver_postclose_kms()` is called prior to `amdgpu_drm_release()`.
However, the bug is that `amdgpu_fpriv` was freed in `amdgpu_driver_postclose_kms()`, and then
later accessed in `amdgpu_drm_release()` via a call to `amdgpu_userq_mgr_fini()`.
As a result, KASAN detected a use-after-free condition, as shown in the log below.
The proposed fix is to move the calls to `amdgpu_eviction_fence_destroy()` and
`amdgpu_userq_mgr_fini()` into `amdgpu_driver_postclose_kms()`, so they are invoked before
`amdgpu_fpriv` is freed.

This also ensures symmetry with the initialization path in `amdgpu_driver_open_kms()`,
where the following components are initialized:
- `amdgpu_userq_mgr_init()`
- `amdgpu_eviction_fence_init()`
- `amdgpu_ctx_mgr_init()`

Correspondingly, in `amdgpu_driver_postclose_kms()` we should clean up using:
- `amdgpu_userq_mgr_fini()`
- `amdgpu_eviction_fence_destroy()`
- `amdgpu_ctx_mgr_fini()`

This change eliminates the use-after-free and improves consistency in resource management between open and close paths.

[  +0.094367] ==================================================================
[  +0.000026] BUG: KASAN: slab-use-after-free in amdgpu_userq_mgr_fini+0x70c/0x730 [amdgpu]
[  +0.000866] Write of size 8 at addr ffff88811c068c60 by task amd_pci_unplug/1737
[  +0.000026] CPU: 3 UID: 0 PID: 1737 Comm: amd_pci_unplug Not tainted 6.14.0+ #2
[  +0.000008] Hardware name: ASUS System Product Name/ROG STRIX B550-F GAMING (WI-FI), BIOS 1401 12/03/2020
[  +0.000004] Call Trace:
[  +0.000004]  <TASK>
[  +0.000003]  dump_stack_lvl+0x76/0xa0
[  +0.000010]  print_report+0xce/0x600
[  +0.000009]  ? amdgpu_userq_mgr_fini+0x70c/0x730 [amdgpu]
[  +0.000790]  ? srso_return_thunk+0x5/0x5f
[  +0.000007]  ? kasan_complete_mode_report_info+0x76/0x200
[  +0.000008]  ? amdgpu_userq_mgr_fini+0x70c/0x730 [amdgpu]
[  +0.000684]  kasan_report+0xbe/0x110
[  +0.000007]  ? amdgpu_userq_mgr_fini+0x70c/0x730 [amdgpu]
[  +0.000601]  __asan_report_store8_noabort+0x17/0x30
[  +0.000007]  amdgpu_userq_mgr_fini+0x70c/0x730 [amdgpu]
[  +0.000801]  ? __pfx_amdgpu_userq_mgr_fini+0x10/0x10 [amdgpu]
[  +0.000819]  ? srso_return_thunk+0x5/0x5f
[  +0.000008]  amdgpu_drm_release+0xa3/0xe0 [amdgpu]
[  +0.000604]  __fput+0x354/0xa90
[  +0.000010]  __fput_sync+0x59/0x80
[  +0.000005]  __x64_sys_close+0x7d/0xe0
[  +0.000006]  x64_sys_call+0x2505/0x26f0
[  +0.000006]  do_syscall_64+0x7c/0x170
[  +0.000004]  ? kasan_record_aux_stack+0xae/0xd0
[  +0.000005]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? kmem_cache_free+0x398/0x580
[  +0.000006]  ? __fput+0x543/0xa90
[  +0.000006]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? __fput+0x543/0xa90
[  +0.000004]  ? __kasan_check_read+0x11/0x20
[  +0.000007]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? __kasan_check_read+0x11/0x20
[  +0.000003]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? fpregs_assert_state_consistent+0x21/0xb0
[  +0.000006]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? syscall_exit_to_user_mode+0x4e/0x240
[  +0.000005]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? do_syscall_64+0x88/0x170
[  +0.000003]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? do_syscall_64+0x88/0x170
[  +0.000004]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? irqentry_exit+0x43/0x50
[  +0.000004]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? exc_page_fault+0x7c/0x110
[  +0.000006]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  +0.000005] RIP: 0033:0x7ffff7b14f67
[  +0.000005] Code: ff e8 0d 16 02 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0c e8 73 ba f7 ff
[  +0.000004] RSP: 002b:00007fffffffe358 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[  +0.000006] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ffff7b14f67
[  +0.000003] RDX: 0000000000000000 RSI: 00007ffff7f5755a RDI: 0000000000000003
[  +0.000003] RBP: 00007fffffffe380 R08: 0000555555568170 R09: 0000000000000000
[  +0.000003] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fffffffe5c8
[  +0.000003] R13: 00005555555552a9 R14: 0000555555557d48 R15: 00007ffff7ffd040
[  +0.000007]  </TASK>

[  +0.000286] Allocated by task 425 on cpu 11 at 29.751192s:
[  +0.000013]  kasan_save_stack+0x28/0x60
[  +0.000008]  kasan_save_track+0x18/0x70
[  +0.000006]  kasan_save_alloc_info+0x38/0x60
[  +0.000006]  __kasan_kmalloc+0xc1/0xd0
[  +0.000005]  __kmalloc_cache_noprof+0x1bd/0x430
[  +0.000006]  amdgpu_driver_open_kms+0x172/0x760 [amdgpu]
[  +0.000521]  drm_file_alloc+0x569/0x9a0
[  +0.000008]  drm_client_init+0x1b7/0x410
[  +0.000007]  drm_fbdev_client_setup+0x174/0x470
[  +0.000007]  drm_client_setup+0x8a/0xf0
[  +0.000006]  amdgpu_pci_probe+0x50b/0x10d0 [amdgpu]
[  +0.000482]  local_pci_probe+0xe7/0x1b0
[  +0.000008]  pci_device_probe+0x5bf/0x890
[  +0.000005]  really_probe+0x1fd/0x950
[  +0.000007]  __driver_probe_device+0x307/0x410
[  +0.000005]  driver_probe_device+0x4e/0x150
[  +0.000006]  __driver_attach+0x223/0x510
[  +0.000005]  bus_for_each_dev+0x102/0x1a0
[  +0.000006]  driver_attach+0x3d/0x60
[  +0.000005]  bus_add_driver+0x309/0x650
[  +0.000005]  driver_register+0x13d/0x490
[  +0.000006]  __pci_register_driver+0x1ee/0x2b0
[  +0.000006]  xfrm_ealg_get_byidx+0x43/0x50 [xfrm_algo]
[  +0.000008]  do_one_initcall+0x9c/0x3e0
[  +0.000007]  do_init_module+0x29e/0x7f0
[  +0.000006]  load_module+0x5c75/0x7c80
[  +0.000006]  init_module_from_file+0x106/0x180
[  +0.000007]  idempotent_init_module+0x377/0x740
[  +0.000006]  __x64_sys_finit_module+0xd7/0x180
[  +0.000006]  x64_sys_call+0x1f0b/0x26f0
[  +0.000006]  do_syscall_64+0x7c/0x170
[  +0.000005]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

[  +0.000013] Freed by task 1737 on cpu 9 at 76.455063s:
[  +0.000010]  kasan_save_stack+0x28/0x60
[  +0.000006]  kasan_save_track+0x18/0x70
[  +0.000005]  kasan_save_free_info+0x3b/0x60
[  +0.000006]  __kasan_slab_free+0x54/0x80
[  +0.000005]  kfree+0x127/0x470
[  +0.000006]  amdgpu_driver_postclose_kms+0x455/0x760 [amdgpu]
[  +0.000485]  drm_file_free.part.0+0x5b1/0xba0
[  +0.000007]  drm_file_free+0x13/0x30
[  +0.000006]  drm_client_release+0x1c4/0x2b0
[  +0.000006]  drm_fbdev_ttm_fb_destroy+0xd2/0x120 [drm_ttm_helper]
[  +0.000007]  put_fb_info+0x97/0xe0
[  +0.000006]  unregister_framebuffer+0x197/0x380
[  +0.000005]  drm_fb_helper_unregister_info+0x94/0x100
[  +0.000005]  drm_fbdev_client_unregister+0x3c/0x80
[  +0.000007]  drm_client_dev_unregister+0x144/0x330
[  +0.000006]  drm_dev_unregister+0x49/0x1b0
[  +0.000006]  drm_dev_unplug+0x4c/0xd0
[  +0.000006]  amdgpu_pci_remove+0x58/0x130 [amdgpu]
[  +0.000482]  pci_device_remove+0xae/0x1e0
[  +0.000006]  device_remove+0xc7/0x180
[  +0.000006]  device_release_driver_internal+0x3d4/0x5a0
[  +0.000007]  device_release_driver+0x12/0x20
[  +0.000006]  pci_stop_bus_device+0x104/0x150
[  +0.000006]  pci_stop_and_remove_bus_device_locked+0x1b/0x40
[  +0.000005]  remove_store+0xd7/0xf0
[  +0.000007]  dev_attr_store+0x3f/0x80
[  +0.000006]  sysfs_kf_write+0x125/0x1d0
[  +0.000005]  kernfs_fop_write_iter+0x2ea/0x490
[  +0.000007]  vfs_write+0x90d/0xe70
[  +0.000006]  ksys_write+0x119/0x220
[  +0.000006]  __x64_sys_write+0x72/0xc0
[  +0.000006]  x64_sys_call+0x18ab/0x26f0
[  +0.000005]  do_syscall_64+0x7c/0x170
[  +0.000005]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

[  +0.000013] The buggy address belongs to the object at ffff88811c068000
               which belongs to the cache kmalloc-rnd-01-4k of size 4096
[  +0.000016] The buggy address is located 3168 bytes inside of
               freed 4096-byte region [ffff88811c068000, ffff88811c069000)

[  +0.000022] The buggy address belongs to the physical page:
[  +0.000010] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0xffff88811c06e000 pfn:0x11c068
[  +0.000006] head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
[  +0.000006] flags: 0x17ffffc0000040(head|node=0|zone=2|lastcpupid=0x1fffff)
[  +0.000007] page_type: f5(slab)
[  +0.000007] raw: 0017ffffc0000040 ffff88810004c140 dead000000000122 0000000000000000
[  +0.000005] raw: ffff88811c06e000 0000000080040002 00000000f5000000 0000000000000000
[  +0.000006] head: 0017ffffc0000040 ffff88810004c140 dead000000000122 0000000000000000
[  +0.000005] head: ffff88811c06e000 0000000080040002 00000000f5000000 0000000000000000
[  +0.000006] head: 0017ffffc0000003 ffffea0004701a01 ffffffffffffffff 0000000000000000
[  +0.000005] head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000
[  +0.000004] page dumped because: kasan: bad access detected

[  +0.000011] Memory state around the buggy address:
[  +0.000009]  ffff88811c068b00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  +0.000012]  ffff88811c068b80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  +0.000011] >ffff88811c068c00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  +0.000011]                                                        ^
[  +0.000010]  ffff88811c068c80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  +0.000011]  ffff88811c068d00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  +0.000011] ==================================================================

Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Lijo Lazar <lijo.lazar@amd.com>
Cc: Jesse Zhang <Jesse.Zhang@amd.com>
Cc: Arvind Yadav <arvind.yadav@amd.com>

v2: drop amdgpu_drm_release() and assign drm_release()
    as the callback directly.(Alex)

Fixes: adba092973 ("drm/amdgpu: Fix Illegal opcode in command stream Error")
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-24 10:01:28 -04:00
Alex Deucher 684385273d drm/amdgpu: remove fence slab
Just use kmalloc for the fences in the rare case we need
an independent fence.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-24 10:00:03 -04:00
Mario Limonciello 49f1f9f6c3 drm/amd: Adjust output for discovery error handling
commit 017fbb6690 ("drm/amdgpu/discovery: check ip_discovery fw file
available") added support for reading an amdgpu IP discovery bin file
for some specific products. If it's not found then it will fallback to
hardcoded values. However if it's not found there is also a lot of noise
about missing files and errors.

Adjust the error handling to decrease most messages to DEBUG and to show
users less about missing files.

Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Reported-by: Marcus Seyfarth <m.seyfarth@gmail.com>
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4312
Tested-by: Marcus Seyfarth <m.seyfarth@gmail.com>
Fixes: 017fbb6690 ("drm/amdgpu/discovery: check ip_discovery fw file available")
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Link: https://lore.kernel.org/r/20250617183052.1692059-1-superm1@kernel.org
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-24 09:58:35 -04:00
Alex Deucher 0180e0a5dd drm/amdgpu/mes: add compatibility checks for set_hw_resource_1
Seems some older MES firmware versions do not properly support
this packet.  Add back some the compatibility checks.

v2: switch to fw version check (Shaoyun)

Fixes: f81cd79311 ("drm/amd/amdgpu: Fix MES init sequence")
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4295
Cc: Shaoyun Liu <shaoyun.liu@amd.com>
Reviewed-by: shaoyun.liu <shaoyun.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-24 09:54:33 -04:00
Srinivasan Shanmugam 99808926d0 drm/amdgpu/gfx9: Add Cleaner Shader Support for GFX9.x GPUs
Enable the cleaner shader for other GFX9.x series of GPUs to provide
data isolation between GPU workloads. The cleaner shader is responsible
for clearing the Local Data Store (LDS), Vector General Purpose
Registers (VGPRs), and Scalar General Purpose Registers (SGPRs), which
helps prevent data leakage and ensures accurate computation results.

This update extends cleaner shader support to GFX9.x GPUs, previously
available for GFX9.4.2. It enhances security by clearing GPU memory
between processes and maintains a consistent GPU state across KGD and
KFD workloads.

Cc: Manu Rastogi <manu.rastogi@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-24 09:54:23 -04:00
Greg Kroah-Hartman 63dafeb392 Merge 6.16-rc3 into driver-core-next
We need the driver-core fixes that are in 6.16-rc3 into here as well
to build on top of.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-23 07:53:36 +02:00
Alex Deucher fe79ef3530 drm/amdgpu/sdma5.2: init engine reset mutex
Missing the mutex init.

Fixes: 47454f2dc0 ("drm/amdgpu: Register the new sdma function pointers for sdma_v5_2")
Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit ea685ff30a)
2025-06-18 13:17:49 -04:00
Alex Deucher 49cc5beeab drm/amdgpu/sdma5: init engine reset mutex
Missing the mutex init.

Fixes: e56d4bf57f ("drm/amdgpu/: drm/amdgpu: Register the new sdma function pointers for sdma_v5_0")
Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 3f4caf092f)
2025-06-18 13:17:00 -04:00
Alex Deucher ebe4354270 drm/amdgpu: switch job hw_fence to amdgpu_fence
Use the amdgpu fence container so we can store additional
data in the fence.  This also fixes the start_time handling
for MCBP since we were casting the fence to an amdgpu_fence
and it wasn't.

Fixes: 3f4c175d62 ("drm/amdgpu: MCBP based on DRM scheduler (v9)")
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit bf1cd14f9e)
Cc: stable@vger.kernel.org
2025-06-18 13:15:26 -04:00
Jesse Zhang 7f3b16f3f2 drm/amdgpu: Fix SDMA UTC_L1 handling during start/stop sequences
This commit makes two key fixes to SDMA v4.4.2 handling:

1. disable UTC_L1 in sdma_cntl register when stopping SDMA engines
   by reading the current value before modifying UTC_L1_ENABLE bit.

2. Ensure UTC_L1_ENABLE is consistently managed by:
   - Adding the missing register write when enabling UTC_L1 during start
   - Keeping UTC_L1 enabled by default as per hardware requirements

v2: Correct SDMA_CNTL setting (Philip)

Suggested-by: Jonathan Kim <jonathan.kim@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 375bf56465)
Cc: stable@vger.kernel.org
2025-06-18 13:14:40 -04:00
Lijo Lazar 785c536c31 drm/amdgpu: Release reset locks during failures
Make sure to release reset domain lock in case of failures.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Ce Sun <cesun102@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Fixes: 11bb33766f ("drm/amdgpu: refactor amdgpu_device_gpu_recover")
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 1ab11a8268)
2025-06-18 13:14:10 -04:00
Sonny Jiang 46e15197b5 drm/amdgpu: VCN v5_0_1 to prevent FW checking RB during DPG pause
Add a protection to ensure programming are all complete prior VCPU
starting. This is a WA for an unintended VCPU running.

Signed-off-by: Sonny Jiang <sonny.jiang@amd.com>
Acked-by: Leo Liu <leo.liu@amd.com>
Reviewed-by: Ruijing Dong <ruijing.dong@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit c29521b529)
Cc: stable@vger.kernel.org
2025-06-18 13:13:13 -04:00
Jesse Zhang caade9d69f drm/amdgpu: Use logical instance ID for SDMA v4_4_2 queue operations
Simplify SDMA v4_4_2 queue reset and stop operations by:
1. Removing GET_INST(SDMA0) conversion for ring->me
2. Using the logical instance ID (ring->me) directly
3. Maintaining consistent behavior with other SDMA queue operations

This change aligns with the existing queue handling logic where
ring->me already represents the correct instance identifier.

Signed-off-by:  Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 3bab282dfe)
Cc: stable@vger.kernel.org
2025-06-18 13:11:15 -04:00
Jesse Zhang 09b585592f drm/amdgpu: Fix SDMA engine reset with logical instance ID
This commit makes the following improvements to SDMA engine reset handling:

1. Clarifies in the function documentation that instance_id refers to a logical ID
2. Adds conversion from logical to physical instance ID before performing reset
   using GET_INST(SDMA0, instance_id) macro
3. Improves error messaging to indicate when a logical instance reset fails
4. Adds better code organization with blank lines for readability

The change ensures proper SDMA engine reset by using the correct physical
instance ID while maintaining the logical ID interface for callers.

V2: Remove harvest_config check and convert directly to physical instance (Lijo)

Suggested-by: Jonathan Kim <jonathan.kim@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 5efa6217c2)
Cc: stable@vger.kernel.org
2025-06-18 13:10:44 -04:00
Frank Min 854171405e drm/amdgpu: add kicker fws loading for gfx11/smu13/psp13
1. Add kicker firmwares loading for gfx11/smu13/psp13
2. Register additional MODULE_FIRMWARE entries for kicker fws
   - gc_11_0_0_rlc_kicker.bin
   - gc_11_0_0_imu_kicker.bin
   - psp_13_0_0_sos_kicker.bin
   - psp_13_0_0_ta_kicker.bin
   - smu_13_0_0_kicker.bin

Signed-off-by: Frank Min <Frank.Min@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit fb5ec2174d)
Cc: stable@vger.kernel.org
2025-06-18 13:09:41 -04:00
Frank Min 0bbf5fd86c drm/amdgpu: Add kicker device detection
1. add kicker device list
2. add kicker device checking helper function

Signed-off-by: Frank Min <Frank.Min@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 09aa2b408f)
Cc: stable@vger.kernel.org
2025-06-18 13:08:52 -04:00
Alex Deucher ea685ff30a drm/amdgpu/sdma5.2: init engine reset mutex
Missing the mutex init.

Fixes: 47454f2dc0 ("drm/amdgpu: Register the new sdma function pointers for sdma_v5_2")
Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:22 -04:00
Alex Deucher 3f4caf092f drm/amdgpu/sdma5: init engine reset mutex
Missing the mutex init.

Fixes: e56d4bf57f ("drm/amdgpu/: drm/amdgpu: Register the new sdma function pointers for sdma_v5_0")
Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:21 -04:00
Alex Deucher bf1cd14f9e drm/amdgpu: switch job hw_fence to amdgpu_fence
Use the amdgpu fence container so we can store additional
data in the fence.  This also fixes the start_time handling
for MCBP since we were casting the fence to an amdgpu_fence
and it wasn't.

Fixes: 3f4c175d62 ("drm/amdgpu: MCBP based on DRM scheduler (v9)")
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:21 -04:00
Lijo Lazar 9750ad5aee drm/amdgpu: Add xgmi API to set max speed/width
Add an API to set the max possible xgmi speed/width.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:21 -04:00
Lijo Lazar 8c9eb6ce50 drm/amdgpu: Deprecate xgmi_link_speed enum
xgmi doesn't have discrete max speeds defined. Speed numbers can be
arbitrary based on SOC. Deprecate the enum.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:21 -04:00
Lijo Lazar 04141c05f3 drm/amdgpu: Extend bus status check to more cases
In case of unexpected errors, check if device is alive on the bus.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:21 -04:00
Frank Min a3b7f9c306 drm/amdgpu: reclaim psp fw reservation memory region
PSP v14 fw update introduced changes on memory reservation region, according
to the change driver reclaim some non-reserved region.

1. introduce 2 new psp commands to query fw reservation regions
2. add a new reservation region for psp
3. reclaim psp non-used region

Signed-off-by: Frank Min <Frank.Min@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:21 -04:00
ganglxie e2d1e96c53 drm/amdgpu: refine usage of amdgpu_bad_page_threshold
when amdgpu_bad_page_threshold == -1 or -2, driver will issue a warning
message when threshold is reached and continue runtime services.

Signed-off-by: ganglxie <ganglxie@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:21 -04:00
Jesse Zhang 375bf56465 drm/amdgpu: Fix SDMA UTC_L1 handling during start/stop sequences
This commit makes two key fixes to SDMA v4.4.2 handling:

1. disable UTC_L1 in sdma_cntl register when stopping SDMA engines
   by reading the current value before modifying UTC_L1_ENABLE bit.

2. Ensure UTC_L1_ENABLE is consistently managed by:
   - Adding the missing register write when enabling UTC_L1 during start
   - Keeping UTC_L1 enabled by default as per hardware requirements

v2: Correct SDMA_CNTL setting (Philip)

Suggested-by: Jonathan Kim <jonathan.kim@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:21 -04:00
Lijo Lazar 1ab11a8268 drm/amdgpu: Release reset locks during failures
Make sure to release reset domain lock in case of failures.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Ce Sun <cesun102@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Fixes: 11bb33766f ("drm/amdgpu: refactor amdgpu_device_gpu_recover")
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:21 -04:00
Alex Deucher 9a9e87d152 drm/amdgpu/sdma: handle paging queues in amdgpu_sdma_reset_engine()
Need to properly start and stop paging queues if they are present.

This is not an issue today since we don't support a paging queue
on any chips with queue reset.

Fixes: b22659d5d3 ("drm/amdgpu: switch amdgpu_sdma_reset_engine to use the new sdma function pointers")
Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:20 -04:00
Emily Deng 54f7a24e14 drm/amdkfd: Move the process suspend and resume out of full access
For the suspend and resume process, exclusive access is not required.
Therefore, it can be moved out of the full access section to reduce the
duration of exclusive access.

v3:
Move suspend processes before hardware fini.
Remove twice call for bare metal.

v4:
Refine code

Signed-off-by: Emily Deng <Emily.Deng@amd.com>
Acked-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:19 -04:00
Sonny Jiang c29521b529 drm/amdgpu: VCN v5_0_1 to prevent FW checking RB during DPG pause
Add a protection to ensure programming are all complete prior VCPU
starting. This is a WA for an unintended VCPU running.

Signed-off-by: Sonny Jiang <sonny.jiang@amd.com>
Acked-by: Leo Liu <leo.liu@amd.com>
Reviewed-by: Ruijing Dong <ruijing.dong@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:19 -04:00
Jesse Zhang 0c3f972394 drm/amdgpu: Add soft reset callback to SDMA v4.4.x
Implement soft reset engine callback for SDMA 4.4.x IPs. This avoids IP
version check in generic implementation.

V2: Correct physical instance ID calculation in soft_reset_engine (Jesse)

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:19 -04:00
Jesse Zhang 3bab282dfe drm/amdgpu: Use logical instance ID for SDMA v4_4_2 queue operations
Simplify SDMA v4_4_2 queue reset and stop operations by:
1. Removing GET_INST(SDMA0) conversion for ring->me
2. Using the logical instance ID (ring->me) directly
3. Maintaining consistent behavior with other SDMA queue operations

This change aligns with the existing queue handling logic where
ring->me already represents the correct instance identifier.

Signed-off-by:  Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:19 -04:00
Jesse Zhang 5efa6217c2 drm/amdgpu: Fix SDMA engine reset with logical instance ID
This commit makes the following improvements to SDMA engine reset handling:

1. Clarifies in the function documentation that instance_id refers to a logical ID
2. Adds conversion from logical to physical instance ID before performing reset
   using GET_INST(SDMA0, instance_id) macro
3. Improves error messaging to indicate when a logical instance reset fails
4. Adds better code organization with blank lines for readability

The change ensures proper SDMA engine reset by using the correct physical
instance ID while maintaining the logical ID interface for callers.

V2: Remove harvest_config check and convert directly to physical instance (Lijo)

Suggested-by: Jonathan Kim <jonathan.kim@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:18 -04:00
Lijo Lazar 3f1e81ecb6 drm/amdgpu: Suspend IH during mode-2 reset
On multi-aid SOCs, there could be a continuous stream of interrupts from
GC after poison consumption. Suspend IH to disable them before doing
mode-2 reset. This avoids conflicts in hardware accesses during
interrupt handlers while a reset is ongoing.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:18 -04:00
Mario Limonciello 64c3e4a868 drm/amd: Add support for a complete pmops action
complete() callbacks are supposed to handle reversing anything
that occurred during prepare() callbacks.  They'll be called on every
power state transition, and will also be called if the sequence is
failed (such as an aborted suspend).

Add support for IP blocks to support this action.

Reviewed-by: Alex Hung <alex.hung@amd.com>
Link: https://lore.kernel.org/r/20250602014432.3538345-2-superm1@kernel.org
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:18 -04:00
Xiang Liu f43411978d drm/amdgpu: Add debug mask to disable CE logs
Add debug mask to disable kernel logs of RAS correctable errors,
including both ACA and CE error counter kernel messages.

Signed-off-by: Xiang Liu <xiang.liu@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:18 -04:00
Frank Min fb5ec2174d drm/amdgpu: add kicker fws loading for gfx11/smu13/psp13
1. Add kicker firmwares loading for gfx11/smu13/psp13
2. Register additional MODULE_FIRMWARE entries for kicker fws
   - gc_11_0_0_rlc_kicker.bin
   - gc_11_0_0_imu_kicker.bin
   - psp_13_0_0_sos_kicker.bin
   - psp_13_0_0_ta_kicker.bin
   - smu_13_0_0_kicker.bin

Signed-off-by: Frank Min <Frank.Min@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:18 -04:00
Frank Min 09aa2b408f drm/amdgpu: Add kicker device detection
1. add kicker device list
2. add kicker device checking helper function

Signed-off-by: Frank Min <Frank.Min@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:18 -04:00
Lijo Lazar 3bdf8dd84e drm/amdgpu: Clear reset flags from ras context
Once RAS errors are cleared with appropriate recovery mechanism, clear
reset flags also from RAS context. Otherwise, stale flag values could
affect the subsequent RAS reset handling on the device.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:18 -04:00
Alex Deucher 87fbe3a548 drm/amdgpu/gfx9: drop reset_kgq
It doesn't work reliably and we have soft recover and
full adapter reset so drop this.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:18 -04:00
Alex Deucher fda02c911a drm/amdgpu/gfx8: drop reset_kgq
It doesn't work reliably and we have soft recover and
full adapter reset so drop this.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:18 -04:00
Alex Deucher 18d321c1dc drm/amdgpu/gfx7: drop reset_kgq
It doesn't work reliably and we have soft recover and
full adapter reset so drop this.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:17 -04:00
Jonathan Kim 96f75f9594 drm/amdkfd: allow compute partition mode switch with cgroup exclusions
The KFD currently bars a compute partition mode switch while a KFD
process exists.

Since cgroup excluded devices remain excluded for the lifetime of a KFD
process and user space is able to mode switch single devices, allow
users to mode switch a device with any running process that has been
cgroup excluded from this device.

Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Reviewed-by: Harish Kasiviswanathan <harish.kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:17 -04:00
ganglxie d0cc8d2b7d drm/amdgpu: clear pa and mca record counter when resetting eeprom
clear pa and mca record counter when resetting eeprom, so that
ras_num_bad_pages can be calculated correctly

Signed-off-by: ganglxie <ganglxie@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:15 -04:00
Samuel Zhang 4108c2be12 drm/amdgpu: fix fence fallback timer expired error
IH is not working after switching a new gpu index for the first time.

During VM resume, QEMU programming of VF MSIX table (register GFXMSIX_VECT0_ADDR_LO)
may not work.The access could be blocked by nBIF protection as VF isn't in
exclusive access mode. Exclusive access is enabled now, disable/enable MSIX
so that QEMU reprograms MSIX table.

call amdgpu_restore_msix on resume to restore msix table.

Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
Acked-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:15 -04:00
Samuel Zhang 2f405eb45c drm/amdgpu: enable pdb0 for hibernation on SRIOV
When switching to new GPU index after hibernation and then resume,
VRAM offset of each VRAM BO will be changed, and the cached gpu
addresses needed to updated.

This is to enable pdb0 and switch to use pdb0-based virtual gpu
address by default in amdgpu_bo_create_reserved(). since the virtual
addresses do not change, this can avoid the need to update all
cached gpu addresses all over the codebase.

Signed-off-by: Emily Deng <Emily.Deng@amd.com>
Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:15 -04:00
Samuel Zhang 18b66a6c2a drm/amdgpu: update GPU addresses for SMU and PSP
add amdgpu_bo_fb_aper_addr() and update the cached GPU addresses to use
the FB aperture address for SMU and PSP.

2 reasons for this change:
1. when pdb0 is enabled, gpu addr from amdgpu_bo_create_kernel() is GART
aperture address, it is not compatible with SMU and PSP, it need to be
updated to use FB aperture address.
2. Since FB aperture address will change after switching to new GPU
index after hibernation, it need to be updated on resume.

Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:15 -04:00
Lijo Lazar 0f566f0e9c drm/amdgpu: Remove nbiov7.9 replay count reporting
Direct pcie replay count reporting is not available on nbio v7.9.
Reporting is done through firmware.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Acked-by: Mangesh Gadre <Mangesh.Gadre@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Fixes: 50709d18f4 ("drm/amdgpu: Add pci replay count to nbio v7.9")
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:09 -04:00
Lijo Lazar 196aefea44 drm/amdgpu: Check pcie replays reporting support
Check if pcie replay count reporting is supported before creating sysfs
attribute.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Acked-by: Mangesh Gadre <Mangesh.Gadre@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:02 -04:00
Shiwu Zhang c09910b511 drm/amdgpu: Enable IFWI update support for PSPv14.0.2 and v14.0.3
Make the psp_vbflash and psp_vbflash_status available in sysfs.

v2: make it available for v14.0.2 as well (hawking)

Signed-off-by: Shiwu Zhang <shiwu.zhang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:01 -04:00
Samuel Zhang 855a2a029a drm/amdgpu: update xgmi info and vram_base_offset on resume
For SRIOV VM env with XGMI enabled systems, XGMI physical node id may
change when hibernate and resume with different VF.

Update XGMI info and vram_base_offset on resume for gfx444 SRIOV env.
Add amdgpu_virt_xgmi_migrate_enabled() as the feature flag.

Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18 12:19:01 -04:00
André Almeida a72002cb18 drm/amdgpu: Make use of drm_wedge_task_info
To notify userspace about which task (if any) made the device get in a
wedge state, make use of drm_wedge_task_info parameter, filling it with
the task PID and name.

Reviewed-by: Christian König <christian.koenig@amd.com>
Link: https://lore.kernel.org/r/20250617124949.2151549-7-andrealmeid@igalia.com
Signed-off-by: André Almeida <andrealmeid@igalia.com>
2025-06-17 11:32:48 -03:00
André Almeida 35dc4ce200 drm: amdgpu: Use struct drm_wedge_task_info inside of struct amdgpu_task_info
To avoid a cast when calling drm_dev_wedged_event(), replace pid and
task name inside of struct amdgpu_task_info with struct
drm_wedge_task_info.

Reviewed-by: Christian König <christian.koenig@amd.com>
Link: https://lore.kernel.org/r/20250617124949.2151549-6-andrealmeid@igalia.com
Signed-off-by: André Almeida <andrealmeid@igalia.com>
2025-06-17 11:32:47 -03:00
André Almeida 183bccafa1 drm: Create a task info option for wedge events
When a device get wedged, it might be caused by a guilty application.
For userspace, knowing which task was involved can be useful for some
situations, like for implementing a policy, logs or for giving a chance
for the compositor to let the user know what task was involved in the
problem.  This is an optional argument, when the task info is not
available, the PID and TASK string won't appear in the event string.

Sometimes just the PID isn't enough giving that the task might be already
dead by the time userspace will try to check what was this PID's name,
so to make the life easier also notify what's the task's name in the user
event.

Acked-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Krzysztof Karas <krzysztof.karas@intel.com>
Reviewed-by: Raag Jadav <raag.jadav@intel.com>
Acked-by: Christian König <christian.koenig@amd.com>
Link: https://lore.kernel.org/r/20250617124949.2151549-4-andrealmeid@igalia.com
Signed-off-by: André Almeida <andrealmeid@igalia.com>
2025-06-17 11:32:47 -03:00
André Almeida 3bfd1af74a drm: amdgpu: Create amdgpu_vm_print_task_info()
To avoid repetitive code in amdgpu, create a function that prints the
content of struct amdgpu_task_info.

Reviewed-by: Christian König <christian.koenig@amd.com>
Link: https://lore.kernel.org/r/20250617124949.2151549-3-andrealmeid@igalia.com
Signed-off-by: André Almeida <andrealmeid@igalia.com>
2025-06-17 11:32:47 -03:00
André Almeida 2a4f069d0f drm: amdgpu: Allow NULL pointers at amdgpu_vm_put_task_info()
Allow NULL pointers at amdgpu_vm_put_task_info() as it common practice
for "put" or "free" functions. This avoid an extra check for NULL for
callers.

Reviewed-by: Christian König <christian.koenig@amd.com>
Link: https://lore.kernel.org/r/20250617124949.2151549-2-andrealmeid@igalia.com
Signed-off-by: André Almeida <andrealmeid@igalia.com>
2025-06-17 11:32:47 -03:00
Thomas Weißschuh fb506e31b3 sysfs: treewide: switch back to attribute_group::bin_attrs
The normal bin_attrs field can now handle const pointers.
This makes the _new variant unnecessary.
Switch all users back.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Link: https://lore.kernel.org/r/20250530-sysfs-const-bin_attr-final-v3-4-724bfcf05b99@weissschuh.net
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-17 10:44:15 +02:00
Thomas Weißschuh 2fbe82037a sysfs: treewide: switch back to bin_attribute::read()/write()
The bin_attribute argument of bin_attribute::read() is now const.
This makes the _new() callbacks unnecessary. Switch all users back.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Link: https://lore.kernel.org/r/20250530-sysfs-const-bin_attr-final-v3-3-724bfcf05b99@weissschuh.net
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-17 10:44:13 +02:00
Thomas Zimmermann c598d5eb9f Merge drm/drm-next into drm-misc-next
Backmerging to forward to v6.16-rc1

Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
2025-06-11 09:01:34 +02:00
Ingo Molnar 41cb08555c treewide, timers: Rename from_timer() to timer_container_of()
Move this API to the canonical timer_*() namespace.

[ tglx: Redone against pre rc1 ]

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/aB2X0jCKQO56WdMt@gmail.com
2025-06-08 09:07:37 +02:00
Linus Torvalds e332935a54 drm fixes for 6.16-rc1
(amdkfd on riscv is more a feature).
 
 panel:
 - nt37801: fix IS_ERR
 - nt37801: fix KConfig
 
 connector:
 - Fix null deref in HDMI audio helper.
 
 bridge:
 - analogix_dp: fixup clk-disable removal
 
 msm:
 - mailmap updates
 
 i915:
 - Fix the enabling/disabling of DP audio SDP splitting
 - Fix PSR register definitions for ALPM
 - Fix u32 overflow in SNPS PHY HDMI PLL setup
 - Fix GuC pending message underflow when submit fails
 - Fix GuC wakeref underflow race during reset
 
 xe:
 - Two documentation fixes
 - A couple of vm init fixes
 - Hwmon fixes
 - Drop reduntant conversion to bool
 - Fix CONFIG_INTEL_VSEC dependency
 - Rework eviction rejection of bound external bos
 - Stop re-submitting signalled jobs
 - A couple of pxp fixes
 - Add back a fix that got lost in a merge
 - Create LRC bo without VM
 - Fix for the above fix
 
 amdgpu:
 - UserQ fixes
 - SMU 13.x fixes
 - VCN fixes
 - JPEG fixes
 - Misc cleanups
 - runtime pm fix
 - DCN 4.0.1 fixes
 - Misc display fixes
 - ISP fix
 - VRAM manager fix
 - RAS fixes
 - IP discovery fix
 - Cleaner shader fix for GC 10.1.x
 - OD fix
 - Non-OLED panel fix
 - Misc display fixes
 - Brightness fixes
 
 amdkfd:
 - Enable CONFIG_HSA_AMD on RISCV
 - SVM fix
 - Misc cleanups
 - Ref leak fix
 - WPTR BO fix
 
 radeon:
 - Misc cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEEKbZHaGwW9KfbeusDHTzWXnEhr4FAmhCg/4ACgkQDHTzWXnE
 hr7Q0g//an3wQGf8KgZxCs8DxVVA3zSrUDLiAbs5hsZJDNtd9uGqzy9pZzIV+cVK
 rguAcM/AEVvY/ET1PCVh1FlJ8jMadlGGX6MuegUzdzQ/wB7puwZ+KRZAMmSVEiY6
 7PVKceeJ2bnCK+Vn/SdXpD1s4AXn3hMCyuTfvOC4fJuee/qW62H/wl4ivXzilvvf
 DBSSlpjEcTSKJVRveOw1AL678Z34JhoUB3oek0kpx9TyF4rdKs5qDStEUxMIhpD8
 22vN5oF0UOU93N53udCt4gGQ/Xfqyyl03XP2JYnNmCMJB+BGSR/u/u59cjnvkwDs
 TQBBS8gXfAdRCEPrvtDGNZOLxEhPl+ZaKoTqRp6qi4uL7nUc8NTVBE3UTkt6LVcx
 W1HY5+QzuLPH73QUSSHL609qz1X1aRLWgFh+/Fo82LYh3ORtO6BwbQLP6ZGkbNzm
 GTRqLAmzprL2XisrxP0gsdvgpRplXjwxx7RzCE6evr/u+lMRr4dxoSx1k2C0vVhS
 sFoFjHdrWvHO8KtM14vTt/F7J79suqgQBqF37s8s1e5ptDra4aDQEzCAXxJYx6Pg
 2Q7tamvwaJndQUojd858+OU8lHVWDKm6eYuA4WrbbomT31CVkAWWrmcIiS3CBBX1
 6U0J4h8JcGilbCuPHCP2c9ibakkF/jkO+tZAgW88C/enF9r59r8=
 =jZKo
 -----END PGP SIGNATURE-----

Merge tag 'drm-next-2025-06-06' of https://gitlab.freedesktop.org/drm/kernel

Pull drm fixes from Dave Airlie:
 "This is pretty much two weeks worth of fixes, plus one thing that
  might be considered next: amdkfd is now able to be enabled on risc-v
  platforms.

  Otherwise, amdgpu and xe with the majority of fixes, and then a
  smattering all over.

  panel:
   - nt37801: fix IS_ERR
   - nt37801: fix KConfig

  connector:
   - Fix null deref in HDMI audio helper.

  bridge:
   - analogix_dp: fixup clk-disable removal

  nouveau:
   - minor typo fix (',' vs ';')

  msm:
   - mailmap updates

  i915:
   - Fix the enabling/disabling of DP audio SDP splitting
   - Fix PSR register definitions for ALPM
   - Fix u32 overflow in SNPS PHY HDMI PLL setup
   - Fix GuC pending message underflow when submit fails
   - Fix GuC wakeref underflow race during reset

  xe:
   - Two documentation fixes
   - A couple of vm init fixes
   - Hwmon fixes
   - Drop reduntant conversion to bool
   - Fix CONFIG_INTEL_VSEC dependency
   - Rework eviction rejection of bound external bos
   - Stop re-submitting signalled jobs
   - A couple of pxp fixes
   - Add back a fix that got lost in a merge
   - Create LRC bo without VM
   - Fix for the above fix

  amdgpu:
   - UserQ fixes
   - SMU 13.x fixes
   - VCN fixes
   - JPEG fixes
   - Misc cleanups
   - runtime pm fix
   - DCN 4.0.1 fixes
   - Misc display fixes
   - ISP fix
   - VRAM manager fix
   - RAS fixes
   - IP discovery fix
   - Cleaner shader fix for GC 10.1.x
   - OD fix
   - Non-OLED panel fix
   - Misc display fixes
   - Brightness fixes

  amdkfd:
   - Enable CONFIG_HSA_AMD on RISCV
   - SVM fix
   - Misc cleanups
   - Ref leak fix
   - WPTR BO fix

  radeon:
   - Misc cleanups"

* tag 'drm-next-2025-06-06' of https://gitlab.freedesktop.org/drm/kernel: (105 commits)
  drm/nouveau/vfn/r535: Convert comma to semicolon
  drm/xe: remove unmatched xe_vm_unlock() from __xe_exec_queue_init()
  drm/xe: Create LRC BO without VM
  drm/xe/guc_submit: add back fix
  drm/xe/pxp: Clarify PXP queue creation behavior if PXP is not ready
  drm/xe/pxp: Use the correct define in the set_property_funcs array
  drm/xe/sched: stop re-submitting signalled jobs
  drm/xe: Rework eviction rejection of bound external bos
  drm/xe/vsec: fix CONFIG_INTEL_VSEC dependency
  drm/xe: drop redundant conversion to bool
  drm/xe/hwmon: Move card reactive critical power under channel card
  drm/xe/hwmon: Add support to manage power limits though mailbox
  drm/xe/vm: move xe_svm_init() earlier
  drm/xe/vm: move rebind_work init earlier
  MAINTAINERS: .mailmap: update Rob Clark's email address
  mailmap: Update entry for Akhil P Oommen
  MAINTAINERS: update my email address
  MAINTAINERS: drop myself as maintainer
  drm/i915/display: Fix u32 overflow in SNPS PHY HDMI PLL setup
  drm/amd/display: Fix default DC and AC levels
  ...
2025-06-06 08:09:56 -07:00
Linus Torvalds 3719a04a80 pci-v6.16-changes
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCgAyFiEEgMe7l+5h9hnxdsnuWYigwDrT+vwFAmhAa9EUHGJoZWxnYWFz
 QGdvb2dsZS5jb20ACgkQWYigwDrT+vyA3w//aX8d73z/xVxkYLMN/6XQA5fdmd4d
 Dv4n0Pjf0WCMKbsgRCdXEYLvcHV8VhH5iCR/b2UsFm9LjxSIRuqE5XosY3bNhrHn
 xVKEh2prq2XZOibWrFkJ+RZ0FF7Ogq1Uy5gUBbBHbE1q1byZzrOALaF3FWGaDIZQ
 6QLLAFtd3UtqOOUu8J8P9N15uFR8gunyfuM9U7TLMcy4B8txk6T6m/9xAWtRURuJ
 I6WN8lO+g8Nl2mL9m27+wyWiVT3tKqoMwp8rVtym/L5JQOmHycYhn0WQAr2dPCMs
 Xbgmoeei0je7mZvk5btpt68NAKQ3ZnCVkxbbINBkUxAjI0dbI6h37EhW18ShYVUk
 CCo4fmaFtwP8qNN9tSvDN8vZdGB44fN5tIz4lmGzKk5gt+oV50RC/APrzC+PJBQ0
 +2SdDVKj71Gr2H1VnI6uLB7oQ+tp7TOdhg+DGV4bdc6QFnsM+BpKWRq5f1UQcau/
 XVDmorM/2t6z0DNktAv3NFwSodUjk1loWESr/pRBH1AqAWZTK98PWIg97XYsal59
 zbJ3dLrnCqUNozeVgjtZo1LWD2FZaVTvhq2NY7D+QPpnMGhFUhHxNliZUXiQa1q4
 boI2hEFdu3IQP/OC2a1zGJyMRLU43d5rhZ1U5xQSVtM0c3lgCY7rn/t26LymQVPA
 SYdg2jBcnhe6gXo=
 =eWJw
 -----END PGP SIGNATURE-----

Merge tag 'pci-v6.16-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci

Pull pci updates from Bjorn Helgaas:
 "Enumeration:

   - Print the actual delay time in pci_bridge_wait_for_secondary_bus()
     instead of assuming it was 1000ms (Wilfred Mallawa)

   - Revert 'iommu/amd: Prevent binding other PCI drivers to IOMMU PCI
     devices', which broke resume from system sleep on AMD platforms and
     has been fixed by other commits (Lukas Wunner)

  Resource management:

   - Remove mtip32xx use of pcim_iounmap_regions(), which is deprecated
     and unnecessary (Philipp Stanner)

   - Remove pcim_iounmap_regions() and pcim_request_region_exclusive()
     and related flags since all uses have been removed (Philipp
     Stanner)

   - Rework devres 'request' functions so they are no longer 'hybrid',
     i.e., their behavior no longer depends on whether
     pcim_enable_device or pci_enable_device() was used, and remove
     related code (Philipp Stanner)

   - Warn (not BUG()) about failure to assign optional resources (Ilpo
     Järvinen)

  Error handling:

   - Log the DPC Error Source ID only when it's actually valid (when
     ERR_FATAL or ERR_NONFATAL was received from a downstream device)
     and decode into bus/device/function (Bjorn Helgaas)

   - Determine AER log level once and save it so all related messages
     use the same level (Karolina Stolarek)

   - Use KERN_WARNING, not KERN_ERR, when logging PCIe Correctable
     Errors (Karolina Stolarek)

   - Ratelimit PCIe Correctable and Non-Fatal error logging, with sysfs
     controls on interval and burst count, to avoid flooding logs and
     RCU stall warnings (Jon Pan-Doh)

  Power management:

   - Increment PM usage counter when probing reset methods so we don't
     try to read config space of a powered-off device (Alex Williamson)

   - Set all devices to D0 during enumeration to ensure ACPI opregion is
     connected via _REG (Mario Limonciello)

  Power control:

   - Rename pwrctrl Kconfig symbols from 'PWRCTL' to 'PWRCTRL' to match
     the filename paths. Retain old deprecated symbols for
     compatibility, except for the pwrctrl slot driver
     (PCI_PWRCTRL_SLOT) (Johan Hovold)

   - When unregistering pwrctrl, cancel outstanding rescan work before
     cleaning up data structures to avoid use-after-free issues (Brian
     Norris)

  Bandwidth control:

   - Simplify link bandwidth controller by replacing the count of Link
     Bandwidth Management Status (LBMS) events with a PCI_LINK_LBMS_SEEN
     flag (Ilpo Järvinen)

   - Update the Link Speed after retraining, since the Link Speed may
     have changed (Ilpo Järvinen)

  PCIe native device hotplug:

   - Ignore Presence Detect Changed caused by DPC.

     pciehp already ignores Link Down/Up events caused by DPC, but on
     slots using in-band presence detect, DPC causes a spurious Presence
     Detect Changed event (Lukas Wunner)

   - Ignore Link Down/Up caused by Secondary Bus Reset.

     On hotplug ports using in-band presence detect, the reset causes a
     Presence Detect Changed event, which mistakenly caused teardown and
     re-enumeration of the device. Drivers may need to annotate code
     that resets their device (Lukas Wunner)

  Virtualization:

   - Add an ACS quirk for Loongson Root Ports that don't advertise ACS
     but don't allow peer-to-peer transactions between Root Ports; the
     quirk allows each Root Port to be in a separate IOMMU group (Huacai
     Chen)

  Endpoint framework:

   - For fixed-size BARs, retain both the actual size and the possibly
     larger size allocated to accommodate iATU alignment requirements
     (Jerome Brunet)

   - Simplify ctrl/SPAD space allocation and avoid allocating more space
     than needed (Jerome Brunet)

   - Correct MSI-X PBA offset calculations for DesignWare and Cadence
     endpoint controllers (Niklas Cassel)

   - Align the return value (number of interrupts) encoding for
     pci_epc_get_msi()/pci_epc_ops::get_msi() and
     pci_epc_get_msix()/pci_epc_ops::get_msix() (Niklas Cassel)

   - Align the nr_irqs parameter encoding for
     pci_epc_set_msi()/pci_epc_ops::set_msi() and
     pci_epc_set_msix()/pci_epc_ops::set_msix() (Niklas Cassel)

  Common host controller library:

   - Convert pci-host-common to a library so platforms that don't need
     native host controller drivers don't need to include these helper
     functions (Manivannan Sadhasivam)

  Apple PCIe controller driver:

   - Extract ECAM bridge creation helper from pci_host_common_probe() to
     separate driver-specific things like MSI from PCI things (Marc
     Zyngier)

   - Dynamically allocate RID-to_SID bitmap to prepare for SoCs with
     varying capabilities (Marc Zyngier)

   - Skip ports disabled in DT when setting up ports (Janne Grunau)

   - Add t6020 compatible string (Alyssa Rosenzweig)

   - Add T602x PCIe support (Hector Martin)

   - Directly set/clear INTx mask bits because T602x dropped the
     accessors that could do this without locking (Marc Zyngier)

   - Move port PHY registers to their own reg items to accommodate
     T602x, which moves them around; retain default offsets for existing
     DTs that lack phy%d entries with the reg offsets (Hector Martin)

   - Stop polling for core refclk, which doesn't work on T602x and the
     bootloader has already done anyway (Hector Martin)

   - Use gpiod_set_value_cansleep() when asserting PERST# in probe
     because we're allowed to sleep there (Hector Martin)

  Cadence PCIe controller driver:

   - Drop a runtime PM 'put' to resolve a runtime atomic count underflow
     (Hans Zhang)

   - Make the cadence core buildable as a module (Kishon Vijay Abraham I)

   - Add cdns_pcie_host_disable() and cdns_pcie_ep_disable() for use by
     loadable drivers when they are removed (Siddharth Vadapalli)

  Freescale i.MX6 PCIe controller driver:

   - Apply link training workaround only on IMX6Q, IMX6SX, IMX6SP
     (Richard Zhu)

   - Remove redundant dw_pcie_wait_for_link() from
     imx_pcie_start_link(); since the DWC core does this, imx6 only
     needs it when retraining for a faster link speed (Richard Zhu)

   - Toggle i.MX95 core reset to align with PHY powerup (Richard Zhu)

   - Set SYS_AUX_PWR_DET to work around i.MX95 ERR051624 erratum: in
     some cases, the controller can't exit 'L23 Ready' through Beacon or
     PERST# deassertion (Richard Zhu)

   - Clear GEN3_ZRXDC_NONCOMPL to work around i.MX95 ERR051586 erratum:
     controller can't meet 2.5 GT/s ZRX-DC timing when operating at 8
     GT/s, causing timeouts in L1 (Richard Zhu)

   - Wait for i.MX95 PLL lock before enabling controller (Richard Zhu)

   - Save/restore i.MX95 LUT for suspend/resume (Richard Zhu)

  Mobiveil PCIe controller driver:

   - Return bool (not int) for link-up check in
     mobiveil_pab_ops.link_up() and layerscape-gen4, mobiveil (Hans
     Zhang)

  NVIDIA Tegra194 PCIe controller driver:

   - Create debugfs directory for 'aspm_state_cnt' only when
     CONFIG_PCIEASPM is enabled, since there are no other entries (Hans
     Zhang)

  Qualcomm PCIe controller driver:

   - Add OF support for parsing DT 'eq-presets-<N>gts' property for lane
     equalization presets (Krishna Chaitanya Chundru)

   - Read Maximum Link Width from the Link Capabilities register if DT
     lacks 'num-lanes' property (Krishna Chaitanya Chundru)

   - Add Physical Layer 64 GT/s Capability ID and register offsets for
     8, 32, and 64 GT/s lane equalization registers (Krishna Chaitanya
     Chundru)

   - Add generic dwc support for configuring lane equalization presets
     (Krishna Chaitanya Chundru)

   - Add DT and driver support for PCIe on IPQ5018 SoC (Nitheesh Sekar)

  Renesas R-Car PCIe controller driver:

   - Describe endpoint BAR 4 as being fixed size (Jerome Brunet)

   - Document how to obtain R-Car V4H (r8a779g0) controller firmware
     (Yoshihiro Shimoda)

  Rockchip PCIe controller driver:

   - Reorder rockchip_pci_core_rsts because
     reset_control_bulk_deassert() deasserts in reverse order, to fix a
     link training regression (Jensen Huang)

   - Mark RK3399 as being capable of raising INTx interrupts (Niklas
     Cassel)

  Rockchip DesignWare PCIe controller driver:

   - Check only PCIE_LINKUP, not LTSSM status, to determine whether the
     link is up (Shawn Lin)

   - Increase N_FTS (used in L0s->L0 transitions) and enable ASPM L0s
     for Root Complex and Endpoint modes (Shawn Lin)

   - Hide the broken ATS Capability in rockchip_pcie_ep_init() instead
     of rockchip_pcie_ep_pre_init() so it stays hidden after PERST#
     resets non-sticky registers (Shawn Lin)

   - Call phy_power_off() before phy_exit() in rockchip_pcie_phy_deinit()
     (Diederik de Haas)

  Synopsys DesignWare PCIe controller driver:

   - Set PORT_LOGIC_LINK_WIDTH to one lane to make initial link training
     more robust; this will not affect the intended link width if all
     lanes are functional (Wenbin Yao)

   - Return bool (not int) for link-up check in dw_pcie_ops.link_up()
     and armada8k, dra7xx, dw-rockchip, exynos, histb, keembay,
     keystone, kirin, meson, qcom, qcom-ep, rcar_gen4, spear13xx,
     tegra194, uniphier, visconti (Hans Zhang)

   - Add debugfs support for exposing DWC device-specific PTM context
     (Manivannan Sadhasivam)

  TI J721E PCIe driver:

   - Make j721e buildable as a loadable and removable module (Siddharth
     Vadapalli)

   - Fix j721e host/endpoint dependencies that result in link failures
     in some configs (Arnd Bergmann)

  Device tree bindings:

   - Add qcom DT binding for 'global' interrupt (PCIe controller and
     link-specific events) for ipq8074, ipq8074-gen3, ipq6018, sa8775p,
     sc7280, sc8180x sdm845, sm8150, sm8250, sm8350 (Manivannan
     Sadhasivam)

   - Add qcom DT binding for 8 MSI SPI interrupts for msm8998, ipq8074,
     ipq8074-gen3, ipq6018 (Manivannan Sadhasivam)

   - Add dw rockchip DT binding for rk3576 and rk3562 (Kever Yang)

   - Correct indentation and style of examples in brcm,stb-pcie,
     cdns,cdns-pcie-ep, intel,keembay-pcie-ep, intel,keembay-pcie,
     microchip,pcie-host, rcar-pci-ep, rcar-pci-host, xilinx-versal-cpm
     (Krzysztof Kozlowski)

   - Convert Marvell EBU (dove, kirkwood, armada-370, armada-xp) and
     armada8k from text to schema DT bindings (Rob Herring)

   - Remove obsolete .txt DT bindings for content that has been moved to
     schemas (Rob Herring)

   - Add qcom DT binding for MHI registers in IPQ5332, IPQ6018, IPQ8074
     and IPQ9574 (Varadarajan Narayanan)

   - Convert v3,v360epc-pci from text to DT schema binding (Rob Herring)

   - Change microchip,pcie-host DT binding to be 'dma-noncoherent' since
     PolarFire may be configured that way (Conor Dooley)

  Miscellaneous:

   - Drop 'pci' suffix from intel_mid_pci.c filename to match similar
     files (Andy Shevchenko)

   - All platforms with PCI have an MMU, so add PCI Kconfig dependency
     on MMU to simplify build testing and avoid inadvertent build
     regressions (Arnd Bergmann)

   - Update Krzysztof Wilczyński's email address in MAINTAINERS
     (Krzysztof Wilczyński)

   - Update Manivannan Sadhasivam's email address in MAINTAINERS
     (Manivannan Sadhasivam)"

* tag 'pci-v6.16-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci: (147 commits)
  MAINTAINERS: Update Manivannan Sadhasivam email address
  PCI: j721e: Fix host/endpoint dependencies
  PCI: j721e: Add support to build as a loadable module
  PCI: cadence-ep: Introduce cdns_pcie_ep_disable() helper for cleanup
  PCI: cadence-host: Introduce cdns_pcie_host_disable() helper for cleanup
  PCI: cadence: Add support to build pcie-cadence library as a kernel module
  MAINTAINERS: Update Krzysztof Wilczyński email address
  PCI: Remove unnecessary linesplit in __pci_setup_bridge()
  PCI: WARN (not BUG()) when we fail to assign optional resources
  PCI: Remove unused pci_printk()
  PCI: qcom: Replace PERST# sleep time with proper macro
  PCI: dw-rockchip: Replace PERST# sleep time with proper macro
  PCI: host-common: Convert to library for host controller drivers
  PCI/ERR: Remove misleading TODO regarding kernel panic
  PCI: cadence: Remove duplicate message code definitions
  PCI: endpoint: Align pci_epc_set_msix(), pci_epc_ops::set_msix() nr_irqs encoding
  PCI: endpoint: Align pci_epc_set_msi(), pci_epc_ops::set_msi() nr_irqs encoding
  PCI: endpoint: Align pci_epc_get_msix(), pci_epc_ops::get_msix() return value encoding
  PCI: endpoint: Align pci_epc_get_msi(), pci_epc_ops::get_msi() return value encoding
  PCI: cadence-ep: Correct PBA offset in .set_msix() callback
  ...
2025-06-04 11:26:17 -07:00
Arunpravin Paneer Selvam e34bcf1594 drm/amdgpu: Add userq fence support to SDMAv7.0
- Add userq fence support to SDMAv7.0.
- GFX12's user fence irq src id differs from GFX11's,
  hence we need create a new irq srcid header file for GFX12.

  User fence irq src id information-
  GFX11 and SDMA6.0 - 0x43
  GFX12 and SDMA7.0 - 0x46

Signed-off-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-03 15:32:50 -04:00
Dan Carpenter 335f1e797c drm/amdgpu: Fix integer overflow in amdgpu_gem_add_input_fence()
The "num_syncobj_handles" is a u32 value that comes from the user via the
ioctl.  On 32bit systems the "sizeof(uint32_t) * num_syncobj_handles"
multiplication can have an integer overflow.  Use size_mul() to fix that.

Fixes: 38c67ec9aa ("drm/amdgpu: Add input fence to sync bo map/unmap")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-03 15:31:13 -04:00
Dan Carpenter 98a46a4089 drm/amdgpu: Fix integer overflow issues in amdgpu_userq_fence.c
This patch only affects 32bit systems.  There are several integer
overflows bugs here but only the "sizeof(u32) * num_syncobj"
multiplication is a problem at runtime.  (The last lines of this patch).

These variables are u32 variables that come from the user.  The issue
is the multiplications can overflow leading to us allocating a smaller
buffer than intended.  For the first couple integer overflows, the
syncobj_handles = memdup_user() allocation is immediately followed by
a kmalloc_array():

	syncobj = kmalloc_array(num_syncobj_handles, sizeof(*syncobj), GFP_KERNEL);

In that situation the kmalloc_array() works as a bounds check and we
haven't accessed the syncobj_handlesp[] array yet so the integer overflow
is harmless.

But the "num_syncobj" multiplication doesn't have that and the integer
overflow could lead to an out of bounds access.

Fixes: a292fdecd7 ("drm/amdgpu: Implement userqueue signal/wait IOCTL")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-03 15:06:35 -04:00
Alex Deucher 5cccf10f65 drm/amdgpu: disable workload profile switching when OD is enabled
Users have reported that they have to reduce the level of undervolting
to acheive stability when dynamic workload profiles are enabled on
GC 10.3.x. Disable dynamic workload profiles if the user has enabled
OD.

Fixes: b9467983b7 ("drm/amdgpu: add dynamic workload profile switching for gfx10")
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4262
Reviewed-by: Kenneth Feng <kenneth.feng@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org # 6.15.x
2025-06-03 15:04:24 -04:00
Vitaly Prosyak d26625d034 drm/amdgpu/gfx10: Refine Cleaner Shader for GFX10.1.10
This patch updates the cleaner shader, which is responsible for
initializing GPU resources such as Local Data Share (LDS), Vector
General Purpose Registers (VGPRs), and Scalar General Purpose Registers
(SGPRs). Changes include adjustments to register clearing and shader
configuration.

- Updated GPU resource initialization addresses in the cleaner shader
  from `be803080` to `be803000`.
- Simplified the logic in the SGPR clearing section, ensuring all SGPRs
  are set to zero.

Fixes: 25961bad92 ("drm/amdgpu/gfx10: Add cleaner shader for GFX10.1.10")
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Manu Rastogi <manu.rastogi@amd.com>
Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-03 15:03:09 -04:00
Lijo Lazar 719d84f8a8 drm/amdgpu: Add more checks to discovery fetch
Add more checks for valid vram size and log error, if any.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-03 15:02:58 -04:00
Tvrtko Ursulin bf33a0003d dma-fence: Use a flag for 64-bit seqnos
With the goal of reducing the need for drivers to touch (and dereference)
fence->ops, we move the 64-bit seqnos flag from struct dma_fence_ops to
the fence->flags.

Drivers which were setting this flag are changed to use new
dma_fence_init64() instead of dma_fence_init().

v2:
 * Streamlined init and added kerneldoc.
 * Rebase for amdgpu userq which landed since.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Reviewed-by: Christian König <christian.koenig@amd.com> # v1
Signed-off-by: Tvrtko Ursulin <tursulin@ursulin.net>
Link: https://lore.kernel.org/r/20250515095004.28318-3-tvrtko.ursulin@igalia.com
2025-06-03 17:38:04 +01:00
Maxime Ripard 7b1166dee8 drm for 6.16-rc1
new drivers:
 - bring in the asahi uapi header standalone
 - nova-drm: stub driver
 
 rust dependencies (for nova-core):
 - auxiliary
   - bus abstractions
   - driver registration
   - sample driver
 - devres changes from driver-core
 - revocable changes
 
 core:
 - add Apple fourcc modifiers
 - add virtio capset definitions
 - extend EXPORT_SYNC_FILE for timeline syncobjs
 - convert to devm_platform_ioremap_resource
 - refactor shmem helper page pinning
 - DP powerup/down link helpers
 - remove disgusting turds
 - extended %p4cc in vsprintf.c to support fourcc prints
 - change vsprintf %p4cn to %p4chR, remove %p4cn
 - Add drm_file_err function
 - IN_FORMATS_ASYNC property
 - move sitronix from tiny to their own subdir
 
 rust:
 - add drm core infrastructure rust abstractions
   (device/driver, ioctl, file, gem)
 
 dma-buf:
 - adjust sg handling to not cache map on attach
 - allow setting dma-device for import
 - Add a helper to sort and deduplicate dma_fence arrays
 
 docs:
 - updated drm scheduler docs
 - fbdev todo update
 - fb rendering
 - actual brightness
 
 ttm:
 - fix delayed destroy resv object
 
 bridge:
 - add kunit tests
 - convert tc358775 to atomic
 - convert drivers to devm_drm_bridge_alloc
 - convert rk3066_hdmi to bridge driver
 
 scheduler:
 - add kunit tests
 
 panel:
 - refcount panels to improve lifetime handling
 - Powertip PH128800T004-ZZA01
 - NLT NL13676BC25-03F, Tianma TM070JDHG34-00
 - Himax HX8279/HX8279-D DDIC
 - Visionox G2647FB105
 - Sitronix ST7571
 - ZOTAC rotation quirk
 
 vkms:
 - allow attaching more displays
 
 i915:
 - xe3lpd display updates
 - vrr refactor
 - intel_display struct conversions
 - xe2hpd memory type identification
 - add link rate/count to i915_display_info
 - cleanup VGA plane handling
 - refactor HDCP GSC
 - fix SLPC wait boosting reference counting
 - add 20ms delay to engine reset
 - fix fence release on early probe errors
 
 xe:
 - SRIOV updates
 - BMG PCI ID update
 - support separate firmware for each GT
 - SVM fix, prelim SVM multi-device work
 - export fan speed
 - temp disable d3cold on BMG
 - backup VRAM in PM notifier instead of suspend/freeze
 - update xe_ttm_access_memory to use GPU for non-visible access
 - fix guc_info debugfs for VFs
 - use copy_from_user instead of __copy_from_user
 - append PCIe gen5 limitations to xe_firmware document
 
 amdgpu:
 - DSC cleanup
 - DC Scaling updates
 - Fused I2C-over-AUX updates
 - DMUB updates
 - Use drm_file_err in amdgpu
 - Enforce isolation updates
 - Use new dma_fence helpers
 - USERQ fixes
 - Documentation updates
 - SR-IOV updates
 - RAS updates
 - PSP 12 cleanups
 - GC 9.5 updates
 - SMU 13.x updates
 - VCN / JPEG SR-IOV updates
 
 amdkfd:
 - Update error messages for SDMA
 - Userptr updates
 - XNACK fixes
 
 radeon:
 - CIK doorbell cleanup
 
 nouveau:
 - add support for NVIDIA r570 GSP firmware
 - enable Hopper/Blackwell support
 
 nova-core:
 - fix task list
 - register definition infrastructure
 - move firmware into own rust module
 - register auxiliary device for nova-drm
 
 nova-drm:
 - initial driver skeleton
 
 msm:
 - GPU:
   - ACD (adaptive clock distribution) for X1-85
   - drop fictional address_space_size
   - improve GMU HFI response time out robustness
   - fix crash when throttling during boot
 - DPU:
   - use single CTL path for flushing on DPU 5.x+
   - improve SSPP allocation code for better sharing
   - Enabled SmartDMA on SM8150, SC8180X, SC8280XP, SM8550
   - Added SAR2130P support
   - Disabled DSC support on MSM8937, MSM8917, MSM8953, SDM660
 - DP:
   - switch to new audio helpers
   - better LTTPR handling
 - DSI:
   - Added support for SA8775P
   - Added SAR2130P support
 - HDMI:
   - Switched to use new helpers for ACR data
   - Fixed old standing issue of HPD not working in some cases
 
 amdxdna:
 - add dma-buf support
 - allow empty command submits
 
 renesas:
 - add dma-buf support
 - add zpos, alpha, blend support
 
 panthor:
 - fail properly for NO_MMAP bos
 - add SET_LABEL ioctl
 - debugfs BO dumping support
 
 imagination:
 - update DT bindings
 - support TI AM68 GPU
 
 hibmc:
 - improve interrupt handling and HPD support
 
 virtio:
 - add panic handler support
 
 rockchip:
 - add RK3588 support
 - add DP AUX bus panel support
 
 ivpu:
 - add heartbeat based hangcheck
 
 mediatek:
 - prepares support for MT8195/99 HDMIv2/DDCv2
 
 anx7625:
 - improve HPD
 
 tegra:
 - speed up firmware loading
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEEKbZHaGwW9KfbeusDHTzWXnEhr4FAmg2aVAACgkQDHTzWXnE
 hr6DjhAApr2fZjugU3EmpsARdcIWgEd+X65R97ef7RlUGqBKm2joSwZGOhH0oBsG
 9WyO92Qzu6XMe8OibKqY4D2hir9UPz5v+uEWe3q9CzZGbNyAwyVRjVkaKpnI9upv
 1dmHFI7HgPu6qbz6RfPIfgALBLXvVXMaQ4+ZgN/cLtZFa+OLAV5ByqWsRPPXZFb0
 F/pQGQ4ursglfA+LH3SVPfnTN53lu93IlM5/Os9OQQGj+44w94zQ6DCm7CY1AugH
 n+RM/0Yv7WaoF1ByeOtq4FcrmLRrd+ozsvITbRZqhOx7zS/mhP8LRzAwgKWOYzSh
 puKunyQiSdHR7FSqSi8uyY3YumcLWNa/17LMKoTf+KqweJbKGE7RVBuFBn6WUdPb
 AYHZrSB4USAeyahdrrsU+q7ltu5urs5ckpbXsRurMiaUz/BLim1PIm3N5FDLPY7B
 PD1n1FcMUv3CmJT5Y+aNIQgmf1/dETESRTSAgSoOo3gNp6jdRCYqSuWIBsppibWT
 26+tyz0/FGhE50QviHzg0Sv+jd/g93fN6snNlV8wNFMviq3bC69Toa+y3qJ5e7UC
 /42R7nCWdkCZJfr6E67rOaahe9TDV/LXLqPErwptOkdK8sMchaIgF+deybgTtTi/
 zGRBfjLvb5ocYBmPbeGX4mtXNRpyZ3o9I0QUyGUO4zMwFXmFwn0=
 =jpVr
 -----END PGP SIGNATURE-----
gpgsig -----BEGIN PGP SIGNATURE-----
 
 iJUEABMJAB0WIQTkHFbLp4ejekA/qfgnX84Zoj2+dgUCaD7zuQAKCRAnX84Zoj2+
 dv21AX4qAXMoS1eQQOzx5/MN0LhibwHO8lq0HgyhKKCMZTUvFP91hvuB6qKGzxEU
 +RJmN5cBgPGNuXwr9zLe5A/Lv1LWgfSj1DaAlauYvduFh1xyLOLuo0H3xfTsKrcl
 Onjxi5QVsg==
 =bMa5
 -----END PGP SIGNATURE-----

Merge drm-next-2025-05-28 into drm-misc-next

Christian needs a recent drm-next branch to merge fence patches.

Signed-off-by: Maxime Ripard <mripard@kernel.org>
2025-06-03 15:07:39 +02:00
Lang Yu 30837a49bd drm/amdkfd: Map wptr BO to GART unconditionally
For simulation C models that don't run CP FW where adev->mes.sched_version
is not populated correctly. This causes NULL dereference in
amdgpu_amdkfd_free_gtt_mem(dev->adev, (void **)&pqn->q->wptr_bo_gart)
and warning on unpinned BO in amdgpu_bo_gpu_offset(q->properties.wptr_bo).

Compared with adding version check here and there,
always map wptr BO to GART simplifies things.

v2: Add NULL check in amdgpu_amdkfd_free_gtt_mem.(Philip)

Signed-off-by: Lang Yu <lang.yu@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-29 10:58:44 -04:00
Alex Deucher 684530526f drm/amdgpu/mes: remove some unused functions
Nothing uses them so remove them.  Leftover from
MES bring up.

Reviewed-by: Michael Chen <michael.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-29 10:58:39 -04:00
Alex Deucher 40f970ba7a drm/amdgpu/mes: add missing locking in helper functions
We need to take the MES lock.

Reviewed-by: Michael Chen <michael.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
2025-05-29 10:58:10 -04:00
Mario Limonciello 82a277d529 drm/amd: Export DMCUB version to sysfs
For supported ASICs DMCU version is exported, but ASICs that support
DMCUB there is no information exported to sysfs.

Add an attribute for DMCUB.

Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Link: https://lore.kernel.org/r/20250527155942.476354-1-mario.limonciello@amd.com
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-29 10:57:42 -04:00
ganglxie fce0afca35 drm/amdgpu: Get mca address for old eeprom records
after getting mca address for old eeprom records with 'address==0', it can be
correctly parsed under none-nps1, or it will be dropped.

Signed-off-by: ganglxie <ganglxie@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-29 10:57:09 -04:00
ganglxie 31e837d242 drm/amdgpu: handle old RAS eeprom data in non-nps1 mode
Get MCA address from PA in nps1, then convert MCA address to PA in specific nps
mode.

Signed-off-by: ganglxie <ganglxie@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-29 10:57:05 -04:00
Arunpravin Paneer Selvam 5ae9de5867 drm/amdgpu: Add userq fence support to SDMAv6.0
Add userq fence support to SDMAv6.0

Signed-off-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-29 10:56:58 -04:00
John Olender 4d2f6b4e4c drm/amdgpu: amdgpu_vram_mgr_new(): Clamp lpfn to total vram
The drm_mm allocator tolerated being passed end > mm->size, but the
drm_buddy allocator does not.

Restore the pre-buddy-allocator behavior of allowing such placements.

Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3448
Signed-off-by: John Olender <john.olender@gmail.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
2025-05-29 10:56:23 -04:00
David (Ming Qiang) Wu bf394d2854 drm/amdgpu/vcn5.0.1: read back register after written
The addition of register read-back in VCN v5.0.1 is intended to prevent
potential race conditions.

Reviewed-by: Ruijing Dong <ruijing.dong@amd.com>
Signed-off-by: David (Ming Qiang) Wu <David.Wu3@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-29 10:56:19 -04:00
David (Ming Qiang) Wu a8bce9b7a2 drm/amdgpu/vcn5: read back register after written
The addition of register read-back in VCN v5.0.0 is intended to prevent
potential race conditions.

Reviewed-by: Ruijing Dong <ruijing.dong@amd.com>
Signed-off-by: David (Ming Qiang) Wu <David.Wu3@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-29 10:56:17 -04:00
David (Ming Qiang) Wu 4d4275a038 drm/amdgpu/vcn4.0.5: read back register after written
The addition of register read-back in VCN v4.0.5 is intended to prevent
potential race conditions.

Reviewed-by: Ruijing Dong <ruijing.dong@amd.com>
Signed-off-by: David (Ming Qiang) Wu <David.Wu3@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-29 10:56:14 -04:00
David (Ming Qiang) Wu 5b4c6413c8 drm/amdgpu/vcn4.0.3: read back register after written
The addition of register read-back in VCN v4.0.3 is intended to prevent
potential race conditions.

Reviewed-by: Ruijing Dong <ruijing.dong@amd.com>
Signed-off-by: David (Ming Qiang) Wu <David.Wu3@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-29 10:56:12 -04:00
David (Ming Qiang) Wu a3810a5e37 drm/amdgpu/vcn4: read back register after written
The addition of register read-back in VCN v4.0.0 is intended to prevent
potential race conditions.

Reviewed-by: Ruijing Dong <ruijing.dong@amd.com>
Signed-off-by: David (Ming Qiang) Wu <David.Wu3@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-29 10:56:10 -04:00
David (Ming Qiang) Wu b7a4842a91 drm/amdgpu/vcn3: read back register after written
The addition of register read-back in VCN v3.0 is intended to prevent
potential race conditions.

Reviewed-by: Ruijing Dong <ruijing.dong@amd.com>
Signed-off-by: David (Ming Qiang) Wu <David.Wu3@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-29 10:56:08 -04:00
David (Ming Qiang) Wu d9e688b914 drm/amdgpu/vcn2.5: read back register after written
The addition of register read-back in VCN v2.5 is intended to prevent
potential race conditions.

Reviewed-by: Ruijing Dong <ruijing.dong@amd.com>
Signed-off-by: David (Ming Qiang) Wu <David.Wu3@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-29 10:56:05 -04:00
David (Ming Qiang) Wu 8c5ed7f5ab drm/amdgpu/vcn2: read back register after written
The addition of register read-back in VCN v2.0 is intended to prevent
potential race conditions.

Reviewed-by: Ruijing Dong <ruijing.dong@amd.com>
Signed-off-by: David (Ming Qiang) Wu <David.Wu3@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-29 10:56:03 -04:00
David (Ming Qiang) Wu 0ef2803173 drm/amdgpu/vcn1: read back register after written
V3: drop changes where readbacks have implemented. This patch set
    is to add readbacks only.

V2: use common register UVD_STATUS for readback (standard PCI MMIO
    behavior, i.e. readback post all writes to let the writes hit
    the hardware)
    add readback in ..._stop() for more coverage.

Similar to the changes made for VCN v4.0.5 where readback to post the
writes to avoid race with the doorbell, the addition of register
readback support in other VCN versions is intended to prevent potential
race conditions, even though such issues have not been observed yet.
This change ensures consistency across different VCN variants and helps
avoid similar issues. The overhead introduced is negligible.

Reviewed-by: Ruijing Dong <ruijing.dong@amd.com>
Signed-off-by: David (Ming Qiang) Wu <David.Wu3@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-29 10:55:39 -04:00
Pratap Nirujogi 3e9d9df850 drm/amd/amdgpu: Add GPIO resources required for amdisp
ISP is a child device to GFX, and its device specific information
is not available in ACPI. Adding the 2 GPIO resources required for
ISP_v4_1_1 in amdgpu_isp driver.

- GPIO 0 to allow sensor driver to enable and disable sensor module.
- GPIO 85 to allow ISP driver to enable and disable ISP RGB streaming mode.

Signed-off-by: Pratap Nirujogi <pratap.nirujogi@amd.com>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-28 16:01:39 -04:00
Linus Torvalds b08494a8f7 drm for 6.16-rc1
new drivers:
 - bring in the asahi uapi header standalone
 - nova-drm: stub driver
 
 rust dependencies (for nova-core):
 - auxiliary
   - bus abstractions
   - driver registration
   - sample driver
 - devres changes from driver-core
 - revocable changes
 
 core:
 - add Apple fourcc modifiers
 - add virtio capset definitions
 - extend EXPORT_SYNC_FILE for timeline syncobjs
 - convert to devm_platform_ioremap_resource
 - refactor shmem helper page pinning
 - DP powerup/down link helpers
 - remove disgusting turds
 - extended %p4cc in vsprintf.c to support fourcc prints
 - change vsprintf %p4cn to %p4chR, remove %p4cn
 - Add drm_file_err function
 - IN_FORMATS_ASYNC property
 - move sitronix from tiny to their own subdir
 
 rust:
 - add drm core infrastructure rust abstractions
   (device/driver, ioctl, file, gem)
 
 dma-buf:
 - adjust sg handling to not cache map on attach
 - allow setting dma-device for import
 - Add a helper to sort and deduplicate dma_fence arrays
 
 docs:
 - updated drm scheduler docs
 - fbdev todo update
 - fb rendering
 - actual brightness
 
 ttm:
 - fix delayed destroy resv object
 
 bridge:
 - add kunit tests
 - convert tc358775 to atomic
 - convert drivers to devm_drm_bridge_alloc
 - convert rk3066_hdmi to bridge driver
 
 scheduler:
 - add kunit tests
 
 panel:
 - refcount panels to improve lifetime handling
 - Powertip PH128800T004-ZZA01
 - NLT NL13676BC25-03F, Tianma TM070JDHG34-00
 - Himax HX8279/HX8279-D DDIC
 - Visionox G2647FB105
 - Sitronix ST7571
 - ZOTAC rotation quirk
 
 vkms:
 - allow attaching more displays
 
 i915:
 - xe3lpd display updates
 - vrr refactor
 - intel_display struct conversions
 - xe2hpd memory type identification
 - add link rate/count to i915_display_info
 - cleanup VGA plane handling
 - refactor HDCP GSC
 - fix SLPC wait boosting reference counting
 - add 20ms delay to engine reset
 - fix fence release on early probe errors
 
 xe:
 - SRIOV updates
 - BMG PCI ID update
 - support separate firmware for each GT
 - SVM fix, prelim SVM multi-device work
 - export fan speed
 - temp disable d3cold on BMG
 - backup VRAM in PM notifier instead of suspend/freeze
 - update xe_ttm_access_memory to use GPU for non-visible access
 - fix guc_info debugfs for VFs
 - use copy_from_user instead of __copy_from_user
 - append PCIe gen5 limitations to xe_firmware document
 
 amdgpu:
 - DSC cleanup
 - DC Scaling updates
 - Fused I2C-over-AUX updates
 - DMUB updates
 - Use drm_file_err in amdgpu
 - Enforce isolation updates
 - Use new dma_fence helpers
 - USERQ fixes
 - Documentation updates
 - SR-IOV updates
 - RAS updates
 - PSP 12 cleanups
 - GC 9.5 updates
 - SMU 13.x updates
 - VCN / JPEG SR-IOV updates
 
 amdkfd:
 - Update error messages for SDMA
 - Userptr updates
 - XNACK fixes
 
 radeon:
 - CIK doorbell cleanup
 
 nouveau:
 - add support for NVIDIA r570 GSP firmware
 - enable Hopper/Blackwell support
 
 nova-core:
 - fix task list
 - register definition infrastructure
 - move firmware into own rust module
 - register auxiliary device for nova-drm
 
 nova-drm:
 - initial driver skeleton
 
 msm:
 - GPU:
   - ACD (adaptive clock distribution) for X1-85
   - drop fictional address_space_size
   - improve GMU HFI response time out robustness
   - fix crash when throttling during boot
 - DPU:
   - use single CTL path for flushing on DPU 5.x+
   - improve SSPP allocation code for better sharing
   - Enabled SmartDMA on SM8150, SC8180X, SC8280XP, SM8550
   - Added SAR2130P support
   - Disabled DSC support on MSM8937, MSM8917, MSM8953, SDM660
 - DP:
   - switch to new audio helpers
   - better LTTPR handling
 - DSI:
   - Added support for SA8775P
   - Added SAR2130P support
 - HDMI:
   - Switched to use new helpers for ACR data
   - Fixed old standing issue of HPD not working in some cases
 
 amdxdna:
 - add dma-buf support
 - allow empty command submits
 
 renesas:
 - add dma-buf support
 - add zpos, alpha, blend support
 
 panthor:
 - fail properly for NO_MMAP bos
 - add SET_LABEL ioctl
 - debugfs BO dumping support
 
 imagination:
 - update DT bindings
 - support TI AM68 GPU
 
 hibmc:
 - improve interrupt handling and HPD support
 
 virtio:
 - add panic handler support
 
 rockchip:
 - add RK3588 support
 - add DP AUX bus panel support
 
 ivpu:
 - add heartbeat based hangcheck
 
 mediatek:
 - prepares support for MT8195/99 HDMIv2/DDCv2
 
 anx7625:
 - improve HPD
 
 tegra:
 - speed up firmware loading
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEEKbZHaGwW9KfbeusDHTzWXnEhr4FAmg2aVAACgkQDHTzWXnE
 hr6DjhAApr2fZjugU3EmpsARdcIWgEd+X65R97ef7RlUGqBKm2joSwZGOhH0oBsG
 9WyO92Qzu6XMe8OibKqY4D2hir9UPz5v+uEWe3q9CzZGbNyAwyVRjVkaKpnI9upv
 1dmHFI7HgPu6qbz6RfPIfgALBLXvVXMaQ4+ZgN/cLtZFa+OLAV5ByqWsRPPXZFb0
 F/pQGQ4ursglfA+LH3SVPfnTN53lu93IlM5/Os9OQQGj+44w94zQ6DCm7CY1AugH
 n+RM/0Yv7WaoF1ByeOtq4FcrmLRrd+ozsvITbRZqhOx7zS/mhP8LRzAwgKWOYzSh
 puKunyQiSdHR7FSqSi8uyY3YumcLWNa/17LMKoTf+KqweJbKGE7RVBuFBn6WUdPb
 AYHZrSB4USAeyahdrrsU+q7ltu5urs5ckpbXsRurMiaUz/BLim1PIm3N5FDLPY7B
 PD1n1FcMUv3CmJT5Y+aNIQgmf1/dETESRTSAgSoOo3gNp6jdRCYqSuWIBsppibWT
 26+tyz0/FGhE50QviHzg0Sv+jd/g93fN6snNlV8wNFMviq3bC69Toa+y3qJ5e7UC
 /42R7nCWdkCZJfr6E67rOaahe9TDV/LXLqPErwptOkdK8sMchaIgF+deybgTtTi/
 zGRBfjLvb5ocYBmPbeGX4mtXNRpyZ3o9I0QUyGUO4zMwFXmFwn0=
 =jpVr
 -----END PGP SIGNATURE-----

Merge tag 'drm-next-2025-05-28' of https://gitlab.freedesktop.org/drm/kernel

Pull drm updates from Dave Airlie:
 "As part of building up nova-core/nova-drm pieces we've brought in some
  rust abstractions through this tree, aux bus being the main one, with
  devres changes also in the driver-core tree. Along with the drm core
  abstractions and enough nova-core/nova-drm to use them. This is still
  all stub work under construction, to build the nova driver upstream.

  The other big NVIDIA related one is nouveau adds support for
  Hopper/Blackwell GPUs, this required a new GSP firmware update to
  570.144, and a bunch of rework in order to support multiple fw
  interfaces.

  There is also the introduction of an asahi uapi header file as a
  precursor to getting the real driver in later, but to unblock
  userspace mesa packages while the driver is trapped behind rust
  enablement.

  Otherwise it's the usual mixture of stuff all over, amdgpu, i915/xe,
  and msm being the main ones, and some changes to vsprintf.

  new drivers:
   - bring in the asahi uapi header standalone
   - nova-drm: stub driver

  rust dependencies (for nova-core):
   - auxiliary
       - bus abstractions
       - driver registration
       - sample driver
   - devres changes from driver-core
   - revocable changes

  core:
   - add Apple fourcc modifiers
   - add virtio capset definitions
   - extend EXPORT_SYNC_FILE for timeline syncobjs
   - convert to devm_platform_ioremap_resource
   - refactor shmem helper page pinning
   - DP powerup/down link helpers
   - extended %p4cc in vsprintf.c to support fourcc prints
   - change vsprintf %p4cn to %p4chR, remove %p4cn
   - Add drm_file_err function
   - IN_FORMATS_ASYNC property
   - move sitronix from tiny to their own subdir

  rust:
   - add drm core infrastructure rust abstractions
     (device/driver, ioctl, file, gem)

  dma-buf:
   - adjust sg handling to not cache map on attach
   - allow setting dma-device for import
   - Add a helper to sort and deduplicate dma_fence arrays

  docs:
   - updated drm scheduler docs
   - fbdev todo update
   - fb rendering
   - actual brightness

  ttm:
   - fix delayed destroy resv object

  bridge:
   - add kunit tests
   - convert tc358775 to atomic
   - convert drivers to devm_drm_bridge_alloc
   - convert rk3066_hdmi to bridge driver

  scheduler:
   - add kunit tests

  panel:
   - refcount panels to improve lifetime handling
   - Powertip PH128800T004-ZZA01
   - NLT NL13676BC25-03F, Tianma TM070JDHG34-00
   - Himax HX8279/HX8279-D DDIC
   - Visionox G2647FB105
   - Sitronix ST7571
   - ZOTAC rotation quirk

  vkms:
   - allow attaching more displays

  i915:
   - xe3lpd display updates
   - vrr refactor
   - intel_display struct conversions
   - xe2hpd memory type identification
   - add link rate/count to i915_display_info
   - cleanup VGA plane handling
   - refactor HDCP GSC
   - fix SLPC wait boosting reference counting
   - add 20ms delay to engine reset
   - fix fence release on early probe errors

  xe:
   - SRIOV updates
   - BMG PCI ID update
   - support separate firmware for each GT
   - SVM fix, prelim SVM multi-device work
   - export fan speed
   - temp disable d3cold on BMG
   - backup VRAM in PM notifier instead of suspend/freeze
   - update xe_ttm_access_memory to use GPU for non-visible access
   - fix guc_info debugfs for VFs
   - use copy_from_user instead of __copy_from_user
   - append PCIe gen5 limitations to xe_firmware document

  amdgpu:
   - DSC cleanup
   - DC Scaling updates
   - Fused I2C-over-AUX updates
   - DMUB updates
   - Use drm_file_err in amdgpu
   - Enforce isolation updates
   - Use new dma_fence helpers
   - USERQ fixes
   - Documentation updates
   - SR-IOV updates
   - RAS updates
   - PSP 12 cleanups
   - GC 9.5 updates
   - SMU 13.x updates
   - VCN / JPEG SR-IOV updates

  amdkfd:
   - Update error messages for SDMA
   - Userptr updates
   - XNACK fixes

  radeon:
   - CIK doorbell cleanup

  nouveau:
   - add support for NVIDIA r570 GSP firmware
   - enable Hopper/Blackwell support

  nova-core:
   - fix task list
   - register definition infrastructure
   - move firmware into own rust module
   - register auxiliary device for nova-drm

  nova-drm:
   - initial driver skeleton

  msm:
   - GPU:
       - ACD (adaptive clock distribution) for X1-85
       - drop fictional address_space_size
       - improve GMU HFI response time out robustness
       - fix crash when throttling during boot
   - DPU:
       - use single CTL path for flushing on DPU 5.x+
       - improve SSPP allocation code for better sharing
       - Enabled SmartDMA on SM8150, SC8180X, SC8280XP, SM8550
       - Added SAR2130P support
       - Disabled DSC support on MSM8937, MSM8917, MSM8953, SDM660
   - DP:
       - switch to new audio helpers
       - better LTTPR handling
   - DSI:
       - Added support for SA8775P
       - Added SAR2130P support
   - HDMI:
       - Switched to use new helpers for ACR data
       - Fixed old standing issue of HPD not working in some cases

  amdxdna:
   - add dma-buf support
   - allow empty command submits

  renesas:
   - add dma-buf support
   - add zpos, alpha, blend support

  panthor:
   - fail properly for NO_MMAP bos
   - add SET_LABEL ioctl
   - debugfs BO dumping support

  imagination:
   - update DT bindings
   - support TI AM68 GPU

  hibmc:
   - improve interrupt handling and HPD support

  virtio:
   - add panic handler support

  rockchip:
   - add RK3588 support
   - add DP AUX bus panel support

  ivpu:
   - add heartbeat based hangcheck

  mediatek:
   - prepares support for MT8195/99 HDMIv2/DDCv2

  anx7625:
   - improve HPD

  tegra:
   - speed up firmware loading

* tag 'drm-next-2025-05-28' of https://gitlab.freedesktop.org/drm/kernel: (1627 commits)
  drm/nouveau/tegra: Fix error pointer vs NULL return in nvkm_device_tegra_resource_addr()
  drm/xe: Default auto_link_downgrade status to false
  drm/xe/guc: Make creation of SLPC debugfs files conditional
  drm/i915/display: Add check for alloc_ordered_workqueue() and alloc_workqueue()
  drm/i915/dp_mst: Work around Thunderbolt sink disconnect after SINK_COUNT_ESI read
  drm/i915/ptl: Use everywhere the correct DDI port clock select mask
  drm/nouveau/kms: add support for GB20x
  drm/dp: add option to disable zero sized address only transactions.
  drm/nouveau: add support for GB20x
  drm/nouveau/gsp: add hal for fifo.chan.doorbell_handle
  drm/nouveau: add support for GB10x
  drm/nouveau/gf100-: track chan progress with non-WFI semaphore release
  drm/nouveau/nv50-: separate CHANNEL_GPFIFO handling out from CHANNEL_DMA
  drm/nouveau: add helper functions for allocating pinned/cpu-mapped bos
  drm/nouveau: add support for GH100
  drm/nouveau: improve handling of 64-bit BARs
  drm/nouveau/gv100-: switch to volta semaphore methods
  drm/nouveau/gsp: support deeper page tables in COPY_SERVER_RESERVED_PDES
  drm/nouveau/gsp: init client VMMs with NV0080_CTRL_DMA_SET_PAGE_DIRECTORY
  drm/nouveau/gsp: fetch level shift and PDE from BAR2 VMM
  ...
2025-05-28 09:46:39 -07:00
Pierre-Eric Pelloux-Prayer 6c8e8a1c43 drm/amdgpu: update trace format to match gpu_scheduler_trace
Log fences using the same format for coherency.

Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Arvind Yadav <arvind.yadav@amd.com>
Signed-off-by: Philipp Stanner <phasta@kernel.org>
Link: https://lore.kernel.org/r/20250526125505.2360-11-pierre-eric.pelloux-prayer@amd.com
2025-05-28 16:16:20 +02:00
Pierre-Eric Pelloux-Prayer 4f7fa5fa41 drm: Get rid of drm_sched_job.id
Its only purpose was for trace events, but jobs can already be
uniquely identified using their fence.

The downside of using the fence is that it's only available
after 'drm_sched_job_arm' was called which is true for all trace
events that used job.id so they can safely switch to using it.

Suggested-by: Tvrtko Ursulin <tursulin@igalia.com>
Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Arvind Yadav <arvind.yadav@amd.com>
Signed-off-by: Philipp Stanner <phasta@kernel.org>
Link: https://lore.kernel.org/r/20250526125505.2360-9-pierre-eric.pelloux-prayer@amd.com
2025-05-28 16:16:15 +02:00
Pierre-Eric Pelloux-Prayer 2956554823 drm/sched: Store the drm client_id in drm_sched_fence
This will be used in a later commit to trace the drm client_id in
some of the gpu_scheduler trace events.

This requires changing all the users of drm_sched_job_init to
add an extra parameter.

The newly added drm_client_id field in the drm_sched_fence is a bit
of a duplicate of the owner one. One suggestion I received was to
merge those 2 fields - this can't be done right now as amdgpu uses
some special values (AMDGPU_FENCE_OWNER_*) that can't really be
translated into a client id. Christian is working on getting rid of
those; when it's done we should be able to squash owner/drm_client_id
together.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Signed-off-by: Philipp Stanner <phasta@kernel.org>
Link: https://lore.kernel.org/r/20250526125505.2360-3-pierre-eric.pelloux-prayer@amd.com
2025-05-28 16:15:58 +02:00
Linus Torvalds 2bd1bea5fa A set of cleanups for the generic interrupt subsystem:
- Consolidate on one set of functions for the interrupt domain code to
     get rid of pointlessly duplicated code with only marginal different
     semantics.
 
   - Update the documentation accordingly and consolidate the coding style
     of the irqdomain header.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmgzd+MTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYodTRD/0RmG5tngCbEJmTw6lPDQzRZH4OO3ja
 yRYlyBipemoRmvJRGjV4uHqN2QPrdOuoqMuyBO1aWcMdkpww5bAHcbgSFrlGM1lW
 kqtaxVMbufPiLQSGYe7OQf478CE1ykoBd5Va8whFKrtA73qEUdEMfWT0stspg780
 7BlmQOemL91p7Ytf03FbDdo8tZ5Xu9uXGAulwY9FZsFtsCNyvhl7nOv5Sk8ZQtGO
 xHRCeunjZLWR+IaK59hdakvQybXwSnjT6jODp96nlyKABEKSPShGSPFDWd3g9px7
 4911QwgnvTbcrsk6YmQEmPIOgXZzypjbnjpJr8tFpTbkVIy+6chi5cBJzXoqsUaM
 ylTwFcUQNvcP8yF447qb+nyPFKM5xsC07W0UpZMuJUDmhhPRtDm5pK0jpsif96GP
 l4aMsWe65PUmXHQqLdE89RJXAa8XQ2qspKVtNKq9DmEVgTviQ09Z9SSQIx4U0yIx
 w+YPde8kH2+O+YtMUn/MmfHhUP4MKya7j5zd8Bnv8wLBi7XGPPA5EKKh9I0dz9m+
 X94lweNXyH+Q8U9mt2cQf8VG8Yzgk0eeC0sliJIlybwRgEgRcQbVWw0VvZUA1ySa
 VBlaj3SinO90FEQ0CctT51ss2mUJ/XsGCnxpiGZXfqIZzFbyD1YfZQnXJH0H67DI
 CqdHw22I27Mu/A==
 =9nLp
 -----END PGP SIGNATURE-----

Merge tag 'irq-cleanups-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq cleanups from Thomas Gleixner:
 "A set of cleanups for the generic interrupt subsystem:

   - Consolidate on one set of functions for the interrupt domain code
     to get rid of pointlessly duplicated code with only marginal
     different semantics.

   - Update the documentation accordingly and consolidate the coding
     style of the irqdomain header"

* tag 'irq-cleanups-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits)
  irqdomain: Consolidate coding style
  irqdomain: Fix kernel-doc and add it to Documentation
  Documentation: irqdomain: Update it
  Documentation: irq-domain.rst: Simple improvements
  Documentation: irq/concepts: Minor improvements
  Documentation: irq/concepts: Add commas and reflow
  irqdomain: Improve kernel-docs of functions
  irqdomain: Make struct irq_domain_info variables const
  irqdomain: Use irq_domain_instantiate()'s return value as initializers
  irqdomain: Drop irq_linear_revmap()
  pinctrl: keembay: Switch to irq_find_mapping()
  irqchip/armada-370-xp: Switch to irq_find_mapping()
  gpu: ipu-v3: Switch to irq_find_mapping()
  gpio: idt3243x: Switch to irq_find_mapping()
  sh: Switch to irq_find_mapping()
  powerpc: Switch to irq_find_mapping()
  irqdomain: Drop irq_domain_add_*() functions
  powerpc: Switch irq_domain_add_nomap() to use fwnode
  thermal: Switch to irq_domain_create_linear()
  soc: Switch to irq_domain_create_*()
  ...
2025-05-27 08:07:32 -07:00
Philip Yang a359288ccb drm/amdgpu: seq64 memory unmap uses uninterruptible lock
To unmap and free seq64 memory when drm node close to free vm, if there
is signal accepted, then taking vm lock failed and leaking seq64 va
mapping, and then dmesg has error log "still active bo inside vm".

Change to use uninterruptible lock fix the mapping leaking and no dmesg
error log.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-22 12:03:02 -04:00
Mangesh Gadre b758667f55 drm/amdgpu: update ras support check
update ras support check for vcn 5.0.1

Signed-off-by: Mangesh Gadre <Mangesh.Gadre@amd.com>
Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-22 12:02:55 -04:00
Mangesh Gadre 25e9fb6e3a drm/amdgpu: Enable RAS for jpeg 5.0.1
Enable jpeg ras posion processing and aca error logging

Signed-off-by: Mangesh Gadre <Mangesh.Gadre@amd.com>
Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-22 12:02:52 -04:00
Mangesh Gadre 5035caf18d drm/amdgpu: Enable RAS for vcn 5.0.1
Enable vcn ras posion processing and aca error logging

Signed-off-by: Mangesh Gadre <Mangesh.Gadre@amd.com>
Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-22 12:02:46 -04:00
Tvrtko Ursulin dd64956685 drm/amdgpu: Remove duplicated "context still alive" check
When amdgpu_ctx_mgr_fini() calls amdgpu_ctx_mgr_entity_fini() it contains
the exact same "context still alive" check as it will do next. Remove the
duplicated copy.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-22 12:02:25 -04:00
Tvrtko Ursulin 16f2c942b6 drm/amdgpu: Make amdgpu_ctx_mgr_entity_fini static
Function amdgpu_ctx_mgr_entity_fini() only has a single local caller so
lets make it local.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-22 12:02:21 -04:00
Alex Deucher e90bd6d898 drm/amdgpu: Update runtime pm checks
Don't enable BACO when in passthrough. PCI resets don't work
correctly when in BACO.

Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-22 12:02:14 -04:00
Lijo Lazar 74956242a0 drm/amd/pm: Use external link order for xgmi data
xgmi_port_num interface reports external link number for port number. To
be consistent, use the external link number for reporting other XGMI
link data also.

v2: For invalid link number return -EINVAL (Kevin)

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Acked-by: Yang Wang <kevinyang.wang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-22 12:02:04 -04:00
Lijo Lazar cbbab29246 drm/amdgpu: Add sysfs nodes for partition
Add sysfs nodes to provide compute paritition specific data.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-22 12:01:36 -04:00
Stanley.Yang 1b2231de41 drm/amdgpu: Register aqua vanjaram jpeg poison irq
Register aqua vanjaram jpeg poison irq, add jpeg poison handle.

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-22 12:01:28 -04:00
Stanley.Yang 4c4a891496 drm/amdgpu: Register aqua vanjaram vcn poison irq
Register aqua vanjaram vcn poison irq, add vcn poison handle.

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-22 12:01:24 -04:00
Jesse.Zhang 0132ba7ff0 drm/amdgpu: Fix eviction fence worker race during fd close
The current cleanup order during file descriptor close can lead to
a race condition where the eviction fence worker attempts to access
a destroyed mutex from the user queue manager:

[  517.294055] DEBUG_LOCKS_WARN_ON(lock->magic != lock)
[  517.294060] WARNING: CPU: 8 PID: 2030 at kernel/locking/mutex.c:564
[  517.294094] Workqueue: events amdgpu_eviction_fence_suspend_worker [amdgpu]

The issue occurs because:
1. We destroy the user queue manager (including its mutex) first
2. Then try to destroy eviction fences which may have pending work
3. The eviction fence worker may try to access the already-destroyed mutex

Fix this by reordering the cleanup to:
1. First mark the fd as closing and destroy eviction fences,
   which flushes any pending work
2. Then safely destroy the user queue manager after we're certain
   no more fence work will be executed

The copy in amdgpu_driver_postclose_kms() needs to be removed (Christian)

Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Arvind Yadav <Arvind.Yadav@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-22 12:00:52 -04:00
Prike Liang b2c11e2708 drm/amdgpu: lock the eviction fence for wq signals it
Lock and refer to the eviction fence before the eviction fence
schedules work queue tries to signal it.

Suggested-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Arvind Yadav <Arvind.Yadav@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-22 12:00:44 -04:00
Jiri Slaby (SUSE) 493e109267 gpu: Switch to irq_domain_create_linear()
irq_domain_add_linear() is going away as being obsolete now. Switch to
the preferred irq_domain_create_linear(). That differs in the first
parameter: It takes more generic struct fwnode_handle instead of struct
device_node. Therefore, of_fwnode_handle() is added around the
parameter.

Note some of the users can likely use dev->fwnode directly instead of
indirect of_fwnode_handle(dev->of_node). But dev->fwnode is not
guaranteed to be set for all, so this has to be investigated on case to
case basis (by people who can actually test with the HW).

[ tglx: Fix up subject prefix ]

Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250319092951.37667-19-jirislaby@kernel.org
2025-05-16 21:06:09 +02:00
fanhuang 2f0268ca1c drm/amdgpu/jpeg: sriov support for jpeg_v5_0_1
initialization table handshake with mmsch

Signed-off-by: fanhuang <FangSheng.Huang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-16 13:39:14 -04:00
fanhuang 56fc141a5c drm/amdgpu/vcn: sriov support for vcn_v5_0_1
initialization table handshake with mmsch

Signed-off-by: fanhuang <FangSheng.Huang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-16 13:39:10 -04:00
Arvind Yadav 13d0724f0f drm/amdgpu: fix use-after-unlock in eviction fence destroy
The eviction fence destroy path incorrectly calls dma_fence_put() on
evf_mgr->ev_fence after releasing the ev_fence_lock. This introduces a
potential use-after-unlock or race because another thread concurrently
modifies evf_mgr->ev_fence.

Fix this by grabbing a local reference to evf_mgr->ev_fence under the
lock and using that for dma_fence_put() after waiting.

Cc: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Arvind Yadav <Arvind.Yadav@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-16 13:38:10 -04:00
Lijo Lazar cc473057bb drm/amdgpu: Allow NPS2-CPX combination for VFs
CPX partition mode is compatible with NPS2 on aquavanjaram VFs.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-16 13:38:05 -04:00
fanhuang 67cc7f9096 drm/amdgpu/mmsch: Add MMSCH v5_0 support for sriov
These structures are basically ported from MMSCH v4_0
The structures are the same as v4_0 except for the
init header

Signed-off-by: fanhuang <FangSheng.Huang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-16 13:37:57 -04:00
Lijo Lazar 3aa37922c6 drm/amdgpu: Use compatible NPS mode info
Compatible NPS modes for a partition mode are exposed through xcp_config
interface. To determine if a compute partition mode is valid, check if
the current NPS mode is part of compatible NPS modes.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-16 13:37:50 -04:00
Asad Kamal 58c397890f drm/amdgpu: Add pldm version reporting
Add pldm version reporting through sysfs node

Signed-off-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-16 13:37:38 -04:00
Amber Lin e3d0870a90 drm/amdkfd: Support chain runlists of XNACK+/XNACK-
If the MEC firmware supports chaining runlists of XNACK+/XNACK-
processes, set SQ_CONFIG1 chicken bit and SET_RESOURCES bit 28.

When the MEC/HWS supports it, KFD checks the XNACK+/XNACK- processes mix
happens or not. If it does, enter over-subscription.

Signed-off-by: Amber Lin <Amber.Lin@amd.com>
Reviewed-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-16 13:37:29 -04:00
David (Ming Qiang) Wu ee7360fc27 drm/amdgpu: read back register after written for VCN v4.0.5
On VCN v4.0.5 there is a race condition where the WPTR is not
updated after starting from idle when doorbell is used. Adding
register read-back after written at function end is to ensure
all register writes are done before they can be used.

Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/12528
Signed-off-by: David (Ming Qiang) Wu <David.Wu3@amd.com>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Tested-by: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Ruijing Dong <ruijing.dong@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 07c9db090b)
Cc: stable@vger.kernel.org
2025-05-14 11:51:31 -04:00
Shiwu Zhang 72ea78335e drm/amdgpu: add debugfs for spirom IFWI dump
Expose the debugfs file node for user space to dump the IFWI image
on spirom.

For one transaction between PSP and host, it will read out the
images on both active and inactive partitions so a buffer with two
times the size of maximum IFWI image (currently 16MByte) is needed.

v2: move the vbios gfl macros to the common header and rename the
    bo triplet struct to spirom_bo for this specific usage (Hawking)

v3: return directly the result of last command execution (Lijo)

Signed-off-by: Shiwu Zhang <shiwu.zhang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-14 11:30:15 -04:00
Prike Liang 64db767013 drm/amdgpu: fix userq resource double freed
As the userq resource was already freed at the drm_release
early phase, it should avoid freeing userq resource again
at the later kms postclose callback.

Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-14 11:30:03 -04:00
Jesse.Zhang 96a86dcb5b drm/amdgpu: Fix circular locking in userq creation
A circular locking dependency was detected between the global
`adev->userq_mutex` and per-file `userq_mgr->userq_mutex` when
creating user queues. The issue occurs because:

1. `amdgpu_userq_suspend()` and `amdgpu_userq_resume` take `adev->userq_mutex` first, then
   `userq_mgr->userq_mutex`
2. While `amdgpu_userq_create()` takes them in reverse order

This patch resolves the issue by:
1. Moving the `adev->userq_mutex` lock earlier in `amdgpu_userq_create()`
   to cover the `amdgpu_userq_ensure_ev_fence()` call
2. Releasing it after we're done with both queue creation and the
   scheduling halt check

v2: remove unused adev->userq_mutex lock (Prike)

Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Reviewed-by: Prike Liang <Prike.Liang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-14 11:29:38 -04:00
David (Ming Qiang) Wu 07c9db090b drm/amdgpu: read back register after written for VCN v4.0.5
On VCN v4.0.5 there is a race condition where the WPTR is not
updated after starting from idle when doorbell is used. Adding
register read-back after written at function end is to ensure
all register writes are done before they can be used.

Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/12528
Signed-off-by: David (Ming Qiang) Wu <David.Wu3@amd.com>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Tested-by: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Ruijing Dong <ruijing.dong@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-14 11:28:39 -04:00
Tim Huang 2d73b0845a drm/amdgpu: fix incorrect MALL size for GFX1151
On GFX1151, the reported MALL cache size reflects only
half of its actual size; this adjustment corrects the discrepancy.

Signed-off-by: Tim Huang <tim.huang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 0a5c060b59)
Cc: stable@vger.kernel.org
2025-05-13 14:16:43 -04:00
Philip Yang a0fa7873f2 drm/amdgpu: csa unmap use uninterruptible lock
After process exit to unmap csa and free GPU vm, if signal is accepted
and then waiting to take vm lock is interrupted and return, it causes
memory leaking and below warning backtrace.

Change to use uninterruptible wait lock fix the issue.

WARNING: CPU: 69 PID: 167800 at amd/amdgpu/amdgpu_kms.c:1525
 amdgpu_driver_postclose_kms+0x294/0x2a0 [amdgpu]
 Call Trace:
  <TASK>
  drm_file_free.part.0+0x1da/0x230 [drm]
  drm_close_helper.isra.0+0x65/0x70 [drm]
  drm_release+0x6a/0x120 [drm]
  amdgpu_drm_release+0x51/0x60 [amdgpu]
  __fput+0x9f/0x280
  ____fput+0xe/0x20
  task_work_run+0x67/0xa0
  do_exit+0x217/0x3c0
  do_group_exit+0x3b/0xb0
  get_signal+0x14a/0x8d0
  arch_do_signal_or_restart+0xde/0x100
  exit_to_user_mode_loop+0xc1/0x1a0
  exit_to_user_mode_prepare+0xf4/0x100
  syscall_exit_to_user_mode+0x17/0x40
  do_syscall_64+0x69/0xc0

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 7dbbfb3c17)
Cc: stable@vger.kernel.org
2025-05-13 14:16:30 -04:00
Arunpravin Paneer Selvam 553ad6fc2b drm/amdgpu/userq: Fix DEBUG_LOCKS_WARN_ON(lock->magic != lock)
Fix DEBUG_LOCKS_WARN_ON(lock->magic != lock) warning logs.

Signed-off-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-13 13:39:52 -04:00
Arunpravin Paneer Selvam bc5bab82d3 drm/amdgpu: Fix userq ttm_bo_pin and ttm_bo_unpin lockdep warnings
The ttm_bo_pin and ttm_bo_unpin warnings are resolved by moving the
doorbell bo reserve up before pin/unpin.

WARNING: CPU: 11 PID: 1818 at drivers/gpu/drm/ttm/ttm_bo.c:592 ttm_bo_pin+0x1f6/0x270 [ttm]
[  +0.000277] CPU: 11 UID: 1000 PID: 1818 Comm: Xwayland Tainted: G        W          6.12.0+ #15
[  +0.000006] Tainted: [W]=WARN
[  +0.000004] Hardware name: ASUS System Product Name/TUF GAMING B650-PLUS, BIOS 3072 12/20/2024
[  +0.000004] RIP: 0010:ttm_bo_pin+0x1f6/0x270 [ttm]
[  +0.000005] RSP: 0018:ffff88846ca879d0 EFLAGS: 00010246
[  +0.000007] RAX: 0000000000000000 RBX: ffff88810b7ca848 RCX: 0000000000000000
[  +0.000004] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[  +0.000005] RBP: ffff88846ca879e8 R08: 0000000000000000 R09: 0000000000000000
[  +0.000004] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88810b7ca848
[  +0.000004] R13: ffff88846c666250 R14: 1ffff1108d950f44 R15: ffff88846ca87aa0
[  +0.000005] FS:  00007c45ff436d00(0000) GS:ffff888409580000(0000) knlGS:0000000000000000
[  +0.000004] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000005] CR2: 00005b0c142a60e0 CR3: 000000012ce5a000 CR4: 0000000000f50ef0
[  +0.000004] PKRU: 55555554
[  +0.000004] Call Trace:
[  +0.000004]  <TASK>
[  +0.000005]  ? show_regs+0x6c/0x80
[  +0.000007]  ? __warn+0xd2/0x2d0
[  +0.000007]  ? ttm_bo_pin+0x1f6/0x270 [ttm]
[  +0.000031]  ? report_bug+0x282/0x2f0
[  +0.000012]  ? handle_bug+0x6e/0xc0
[  +0.000007]  ? exc_invalid_op+0x18/0x50
[  +0.000007]  ? asm_exc_invalid_op+0x1b/0x20
[  +0.000017]  ? ttm_bo_pin+0x1f6/0x270 [ttm]
[  +0.000014]  amdgpu_bo_pin+0x365/0x9d0 [amdgpu]
[  +0.000191]  ? __pfx_amdgpu_bo_pin+0x10/0x10 [amdgpu]
[  +0.000185]  ? drm_gem_object_lookup+0x81/0xc0
[  +0.000008]  ? kasan_save_alloc_info+0x37/0x60
[  +0.000007]  ? __kasan_kmalloc+0xc3/0xd0
[  +0.000013]  amdgpu_userqueue_get_doorbell_index+0xee/0x5f0 [amdgpu]
[  +0.000209]  amdgpu_userq_ioctl+0x6b4/0xd40 [amdgpu]
[  +0.000193]  ? __pfx_amdgpu_userq_ioctl+0x10/0x10 [amdgpu]
[  +0.000211]  ? lock_acquire+0x7c/0xc0
[  +0.000006]  ? drm_dev_enter+0x51/0x190
[  +0.000015]  drm_ioctl_kernel+0x18b/0x330
[  +0.000007]  ? __pfx_amdgpu_userq_ioctl+0x10/0x10 [amdgpu]
[  +0.000190]  ? __pfx_drm_ioctl_kernel+0x10/0x10
[  +0.000005]  ? lock_acquire+0x7c/0xc0
[  +0.000009]  ? srso_alias_return_thunk+0x5/0xfbef5
[  +0.000005]  ? __kasan_check_write+0x14/0x30
[  +0.000005]  ? srso_alias_return_thunk+0x5/0xfbef5
[  +0.000011]  drm_ioctl+0x589/0xd00
[  +0.000005]  ? srso_alias_return_thunk+0x5/0xfbef5
[  +0.000006]  ? __pfx_amdgpu_userq_ioctl+0x10/0x10 [amdgpu]
[  +0.000194]  ? __pfx_drm_ioctl+0x10/0x10
[  +0.000006]  ? __pm_runtime_resume+0x80/0x110
[  +0.000021]  ? srso_alias_return_thunk+0x5/0xfbef5
[  +0.000005]  ? trace_hardirqs_on+0x53/0x60
[  +0.000005]  ? srso_alias_return_thunk+0x5/0xfbef5
[  +0.000005]  ? _raw_spin_unlock_irqrestore+0x51/0x80
[  +0.000013]  amdgpu_drm_ioctl+0xd2/0x1c0 [amdgpu]
[  +0.000185]  __x64_sys_ioctl+0x13a/0x1c0
[  +0.000010]  x64_sys_call+0x11ad/0x25f0
[  +0.000007]  do_syscall_64+0x91/0x180
[  +0.000007]  ? srso_alias_return_thunk+0x5/0xfbef5
[  +0.000005]  ? irqentry_exit+0x77/0xb0
[  +0.000005]  ? srso_alias_return_thunk+0x5/0xfbef5
[  +0.000005]  ? exc_page_fault+0x93/0x150
[  +0.000009]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  +0.000005] RIP: 0033:0x7c45ff924ded
[  +0.000005] RSP: 002b:00007ffff7167810 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  +0.000008] RAX: ffffffffffffffda RBX: 00000000c0486456 RCX: 00007c45ff924ded
[  +0.000004] RDX: 00007ffff7167870 RSI: 00000000c0486456 RDI: 000000000000000b
[  +0.000004] RBP: 00007ffff7167860 R08: ffff800100000000 R09: 0000000000010000
[  +0.000005] R10: 00007ffff7167950 R11: 0000000000000246 R12: 00005b0c2a51bc48
[  +0.000004] R13: 000000000000000b R14: 0000000000000000 R15: 00007ffff7167950
[  +0.000022]  </TASK>
[  +0.000004] irq event stamp: 80693
[  +0.000004] hardirqs last  enabled at (80699): [<ffffffff86a693a9>] __up_console_sem+0x79/0xa0
[  +0.000005] hardirqs last disabled at (80704): [<ffffffff86a6938e>] __up_console_sem+0x5e/0xa0
[  +0.000005] softirqs last  enabled at (80390): [<ffffffff8687377e>] __irq_exit_rcu+0x17e/0x1d0
[  +0.000005] softirqs last disabled at (80385): [<ffffffff8687377e>] __irq_exit_rcu+0x17e/0x1d0
[  +0.000006] ---[ end trace 0000000000000000 ]---
------------------------------------------------------------------------------------------------------

[  +0.000006] WARNING: CPU: 10 PID: 1818 at drivers/gpu/drm/ttm/ttm_bo.c:611 ttm_bo_unpin+0x21f/0x2c0 [ttm]
[  +0.000280] CPU: 10 UID: 1000 PID: 1818 Comm: Xwayland Tainted: G        W          6.12.0+ #15
[  +0.000006] Tainted: [W]=WARN
[  +0.000004] Hardware name: ASUS System Product Name/TUF GAMING B650-PLUS, BIOS 3072 12/20/2024
[  +0.000004] RIP: 0010:ttm_bo_unpin+0x21f/0x2c0 [ttm]
[  +0.000005] RSP: 0018:ffff88846ca87888 EFLAGS: 00010246
[  +0.000007] RAX: 0000000000000000 RBX: ffff88810b7ca848 RCX: 0000000000000000
[  +0.000005] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[  +0.000004] RBP: ffff88846ca878a0 R08: 0000000000000000 R09: 0000000000000000
[  +0.000004] R10: 0000000000000000 R11: 0000000000000000 R12: ffff888164e90050
[  +0.000005] R13: ffff88846c666200 R14: 0000000000000001 R15: ffff888168402d28
[  +0.000004] FS:  00007c45ff436d00(0000) GS:ffff888409500000(0000) knlGS:0000000000000000
[  +0.000005] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000004] CR2: 00007c45f7373b20 CR3: 000000012ce5a000 CR4: 0000000000f50ef0
[  +0.000005] PKRU: 55555554
[  +0.000004] Call Trace:
[  +0.000004]  <TASK>
[  +0.000005]  ? show_regs+0x6c/0x80
[  +0.000008]  ? __warn+0xd2/0x2d0
[  +0.000007]  ? ttm_bo_unpin+0x21f/0x2c0 [ttm]
[  +0.000012]  ? report_bug+0x282/0x2f0
[  +0.000013]  ? handle_bug+0x6e/0xc0
[  +0.000006]  ? exc_invalid_op+0x18/0x50
[  +0.000008]  ? asm_exc_invalid_op+0x1b/0x20
[  +0.000017]  ? ttm_bo_unpin+0x21f/0x2c0 [ttm]
[  +0.000011]  ? ttm_bo_unpin+0x217/0x2c0 [ttm]
[  +0.000011]  amdgpu_bo_unpin+0x45/0x250 [amdgpu]
[  +0.000216]  amdgpu_userq_ioctl+0x2c3/0xd40 [amdgpu]
[  +0.000226]  ? drm_dev_exit+0x2d/0x60
[  +0.000010]  ? __pfx_amdgpu_userq_ioctl+0x10/0x10 [amdgpu]
[  +0.000201]  ? srso_alias_return_thunk+0x5/0xfbef5
[  +0.000005]  ? lock_acquire+0x7c/0xc0
[  +0.000006]  ? drm_dev_enter+0x51/0x190
[  +0.000015]  drm_ioctl_kernel+0x18b/0x330
[  +0.000007]  ? __pfx_amdgpu_userq_ioctl+0x10/0x10 [amdgpu]
[  +0.000188]  ? __pfx_drm_ioctl_kernel+0x10/0x10
[  +0.000006]  ? lock_acquire+0x7c/0xc0
[  +0.000008]  ? srso_alias_return_thunk+0x5/0xfbef5
[  +0.000005]  ? __kasan_check_write+0x14/0x30
[  +0.000006]  ? srso_alias_return_thunk+0x5/0xfbef5
[  +0.000010]  drm_ioctl+0x589/0xd00
[  +0.000005]  ? srso_alias_return_thunk+0x5/0xfbef5
[  +0.000006]  ? __pfx_amdgpu_userq_ioctl+0x10/0x10 [amdgpu]
[  +0.000211]  ? __pfx_drm_ioctl+0x10/0x10
[  +0.000006]  ? __pm_runtime_resume+0x80/0x110
[  +0.000020]  ? srso_alias_return_thunk+0x5/0xfbef5
[  +0.000006]  ? trace_hardirqs_on+0x53/0x60
[  +0.000005]  ? srso_alias_return_thunk+0x5/0xfbef5
[  +0.000005]  ? _raw_spin_unlock_irqrestore+0x51/0x80
[  +0.000013]  amdgpu_drm_ioctl+0xd2/0x1c0 [amdgpu]
[  +0.000186]  __x64_sys_ioctl+0x13a/0x1c0
[  +0.000010]  x64_sys_call+0x11ad/0x25f0
[  +0.000007]  do_syscall_64+0x91/0x180
[  +0.000007]  ? srso_alias_return_thunk+0x5/0xfbef5
[  +0.000005]  ? do_syscall_64+0x9d/0x180
[  +0.000007]  ? srso_alias_return_thunk+0x5/0xfbef5
[  +0.000010]  ? __pfx___rseq_handle_notify_resume+0x10/0x10
[  +0.000005]  ? __pfx_blkcg_maybe_throttle_current+0x10/0x10
[  +0.000013]  ? srso_alias_return_thunk+0x5/0xfbef5
[  +0.000009]  ? srso_alias_return_thunk+0x5/0xfbef5
[  +0.000008]  ? srso_alias_return_thunk+0x5/0xfbef5
[  +0.000005]  ? syscall_exit_to_user_mode+0x95/0x260
[  +0.000008]  ? srso_alias_return_thunk+0x5/0xfbef5
[  +0.000005]  ? do_syscall_64+0x9d/0x180
[  +0.000007]  ? srso_alias_return_thunk+0x5/0xfbef5
[  +0.000005]  ? do_syscall_64+0x9d/0x180
[  +0.000011]  ? srso_alias_return_thunk+0x5/0xfbef5
[  +0.000010]  ? srso_alias_return_thunk+0x5/0xfbef5
[  +0.000009]  ? srso_alias_return_thunk+0x5/0xfbef5
[  +0.000008]  ? srso_alias_return_thunk+0x5/0xfbef5
[  +0.000005]  ? irqentry_exit_to_user_mode+0x8b/0x260
[  +0.000007]  ? srso_alias_return_thunk+0x5/0xfbef5
[  +0.000006]  ? irqentry_exit+0x77/0xb0
[  +0.000004]  ? srso_alias_return_thunk+0x5/0xfbef5
[  +0.000005]  ? exc_page_fault+0x93/0x150
[  +0.000010]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  +0.000005] RIP: 0033:0x7c45ff924ded
[  +0.000005] RSP: 002b:00007ffff7168790 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  +0.000008] RAX: ffffffffffffffda RBX: 00000000c0486456 RCX: 00007c45ff924ded
[  +0.000005] RDX: 00007ffff71687f0 RSI: 00000000c0486456 RDI: 000000000000000b
[  +0.000004] RBP: 00007ffff71687e0 R08: 00005b0c2a49b010 R09: 0000000000000007
[  +0.000004] R10: 00005b0c2a4d7140 R11: 0000000000000246 R12: 000000000000000b
[  +0.000004] R13: 00007c45ff19e5cc R14: 00005b0c2a51c538 R15: 00005b0c2a51bbd8
[  +0.000022]  </TASK>
[  +0.000005] irq event stamp: 87419
[  +0.000004] hardirqs last  enabled at (87425): [<ffffffff86a693a9>] __up_console_sem+0x79/0xa0
[  +0.000005] hardirqs last disabled at (87430): [<ffffffff86a6938e>] __up_console_sem+0x5e/0xa0
[  +0.000005] softirqs last  enabled at (87058): [<ffffffff8687377e>] __irq_exit_rcu+0x17e/0x1d0
[  +0.000006] softirqs last disabled at (87053): [<ffffffff8687377e>] __irq_exit_rcu+0x17e/0x1d0
[  +0.000005] ---[ end trace 0000000000000000 ]---

Signed-off-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-13 13:39:46 -04:00
Arunpravin Paneer Selvam 218caca4ba drm/amdgpu/userq: Fix lock contention in userq fence
Fix lockdep warnings.

[  +0.000637] ================================
[  +0.000004] WARNING: inconsistent lock state
[  +0.000004] 6.12.0+ #18 Tainted: G        W  OE
[  +0.000004] --------------------------------
[  +0.000004] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
[  +0.000004] Xwayland/1952 [HC0[0]:SC0[0]:HE1:SE1] takes:
[  +0.000005] ffff8884636f4740 (&fence_drv->fence_list_lock){?...}-{2:2}, at: amdgpu_userq_fence_driver_destroy+0xb8/0x540 [amdgpu]
[  +0.000208] {IN-HARDIRQ-W} state was registered at:
[  +0.000004]   lock_acquire.part.0+0x116/0x360
[  +0.000005]   lock_acquire+0x7c/0xc0
[  +0.000005]   _raw_spin_lock+0x2f/0x60
[  +0.000005]   amdgpu_userq_fence_driver_process+0x75/0x400 [amdgpu]
[  +0.000185]   gfx_v12_0_eop_irq+0x29f/0x420 [amdgpu]
[  +0.000210]   amdgpu_irq_dispatch+0x2a4/0x7b0 [amdgpu]
[  +0.000191]   amdgpu_ih_process+0x1e1/0x3d0 [amdgpu]
[  +0.000185]   amdgpu_irq_handler+0x28/0xc0 [amdgpu]
[  +0.000183]   __handle_irq_event_percpu+0x1bb/0x590
[  +0.000005]   handle_irq_event+0xab/0x1d0
[  +0.000005]   handle_edge_irq+0x1fd/0xc10
[  +0.000005]   __common_interrupt+0x83/0x190
[  +0.000004]   common_interrupt+0xb1/0xe0
[  +0.000005]   asm_common_interrupt+0x27/0x40
[  +0.000004]   cpuidle_enter_state+0x2ba/0x530
[  +0.000005]   cpuidle_enter+0x4f/0xb0
[  +0.000006]   call_cpuidle+0x46/0xd0
[  +0.000005]   do_idle+0x367/0x430
[  +0.000004]   cpu_startup_entry+0x58/0x70
[  +0.000005]   start_secondary+0x224/0x2b0
[  +0.000005]   common_startup_64+0x13e/0x141
[  +0.000005] irq event stamp: 88271
[  +0.000004] hardirqs last  enabled at (88271): [<ffffffffad9ca7a1>] _raw_spin_unlock_irqrestore+0x51/0x80
[  +0.000005] hardirqs last disabled at (88270): [<ffffffffad9ca424>] _raw_spin_lock_irqsave+0x74/0x80
[  +0.000005] softirqs last  enabled at (87858): [<ffffffffaa67377e>] __irq_exit_rcu+0x17e/0x1d0
[  +0.000005] softirqs last disabled at (87849): [<ffffffffaa67377e>] __irq_exit_rcu+0x17e/0x1d0
[  +0.000005]
              other info that might help us debug this:
[  +0.000004]  Possible unsafe locking scenario:

[  +0.000003]        CPU0
[  +0.000004]        ----
[  +0.000003]   lock(&fence_drv->fence_list_lock);

v2:
  Drop fence_list_flags and use xa_lock_irqsave() flags parameter (Christian)
  Fix merge conflicts.

Signed-off-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-13 13:39:20 -04:00
Jesse.Zhang 3b63602614 drm/amdgpu: Add GFX 9.5.0 support for per-queue/pipe reset
This patch enables per-queue and per-pipe reset functionality for
GFX IP v9.5.0 when using MEC firmware version 21 (0x15) or later.

This change:
1. Refactors the pipe reset support check in gfx_v9_4_3_pipe_reset_support()
   to use the compute_supported_reset flags instead of hardcoding
   version checks.
2. Adds support for GFX9.5.0 (IP 9.5.0) with MEC firmware version >= 21
   to enable per-queue and per-pipe reset capabilities.

v2: Replaced mec version check with !!(adev->gfx.compute_supported_reset & AMDGPU_RESET_TYPE_PER_PIPE) (Lijo)

Signed-off-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Jesse Zhang <jesse.zhang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-13 09:37:32 -04:00
Tao Zhou 5d6fddac55 drm/amdgpu: set vram type for GC 9.5.0
Set vram type so we can take different actions according to the type.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-13 09:37:28 -04:00
Tao Zhou dc111f8fb1 drm/amdgpu: set flip bits for RAS bad pages
Make the code more general, user doesn't need to pay attention to the
detail of flip bits setting.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-13 09:37:19 -04:00
Ce Sun 533aa8bdbe drm/amdgpu: Modify the count method of defer error
The number of newly added de counts and the number of
newly added error addresses remain consistent

Signed-off-by: Ce Sun <cesun102@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-13 09:35:12 -04:00
Lijo Lazar 937467b7d5 drm/amdgpu: Log RAS errors during load
During driver load, RAS event manager may not be initialized. This will
cause any ATHUB event during driver load to be skipped in dmesg log. Log
the error in dmesg log for easier diagnosis.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-13 09:34:02 -04:00
Jesse.Zhang 648a0dc0d7 drm/amdgpu: Fix user queue deadlock by reordering mutex locking
This resolves a deadlock between user queue management and GPU reset
paths by enforcing consistent lock ordering.

The deadlock occurred when:

1. Process exit path (amdgpu_userq_mgr_fini) would:
   - Take uqm->userq_mutex
   - Then try to take adev->userq_mutex for list operations

2. GPU reset path (amdgpu_userq_pre_reset) would:
   - Take adev->userq_mutex first (for list traversal)
   - Then take uqm->userq_mutex

The solution establishes a strict top-down locking order:
1. Always take adev->userq_mutex before any uqm->userq_mutex
2. Maintain this order consistently across all code paths

Changes made:
- Reordered locking in amdgpu_userq_mgr_fini() to take device lock first
- Kept existing proper order in amdgpu_userq_pre_reset()
- Simplified the fini flow by removing redundant operations

This prevents circular dependencies while maintaining thread safety
during both normal operation and GPU reset scenarios.

Fixes: 4ce60dbada ("drm/amdgpu: store userq_managers in a list in adev")
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Arvind Yadav <Arvind.Yadav@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-13 09:32:25 -04:00
Ce Sun f71509fdd0 drm/amdgpu: Fix the kernel panic caused by RAS records exceed threshold
kernel panic caused by RAS records exceeding the threshold when load driver
specifying RMA(bad_page_threshold=128)

1.Fix the warnings caused by disabling the interrupt source
before it was enabled
2.Fix kernel panic when xcp sysfs is not initialized,null pointer
appears during fini
3.Fix the memory leak caused by the device's early exit due to rma

The first reason:
[ 2744.246650] ------------[ cut here ]------------
[ 2744.246651] WARNING: CPU: 0 PID: 289 at /tmp/amd.BkfTLqYV/amd/amdgpu/amdgpu_irq.c:635 amdgpu_irq_put.cold+0x42/0x6e [amdgpu]
[ 2744.247108] Modules linked in: amdgpu(OE+) amddrm_ttm_helper(OE) amdttm(OE) amdxcp(OE) amddrm_buddy(OE) amddrm_exec(OE) amd_sched(OE) amdkcl(OE) xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo nft_counter xt_addrtype nft_compat nf_tables nfnetlink br_netfilter bridge stp llc overlay binfmt_misc intel_rapl_msr intel_rapl_common i10nm_edac nfit x86_pkg_temp_thermal intel_powerclamp coretemp ipmi_ssif kvm_intel nls_iso8859_1 kvm rapl isst_if_mbox_pci isst_if_mmio pmt_telemetry pmt_crashlog isst_if_common pmt_class mei_me mei acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad mac_hid sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua msr ramoops reed_solomon pstore_blk pstore_zone efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath
[ 2744.247167]  linear mlx5_ib ib_uverbs ib_core ast i2c_algo_bit drm_vram_helper drm_ttm_helper ttm drm_kms_helper crct10dif_pclmul syscopyarea crc32_pclmul ghash_clmulni_intel mlx5_core sysfillrect sysimgblt aesni_intel mlxfw fb_sys_fops psample cec crypto_simd cryptd rc_core i2c_i801 nvme xhci_pci tls intel_pmt drm pci_hyperv_intf nvme_core i2c_smbus i2c_ismt xhci_pci_renesas wmi pinctrl_emmitsburg
[ 2744.247194] CPU: 0 PID: 289 Comm: kworker/0:1 Tainted: G           OE     5.15.0-70-generic #77-Ubuntu
[ 2744.247197] Hardware name: Microsoft C278A/C278A, BIOS C2789.5.BS.1C23.AG.2 11/21/2024
[ 2744.247198] Workqueue: events work_for_cpu_fn
[ 2744.247206] RIP: 0010:amdgpu_irq_put.cold+0x42/0x6e [amdgpu]
[ 2744.247634] Code: 79 7f ff 44 89 ee 48 c7 c7 4d 5a 42 c2 89 55 d4 e8 90 09 bc bf 8b 55 d4 4c 89 e6 4c 89 ff e8 3c 76 7f ff 8b 55 d4 84 c0 75 07 <0f> 0b e9 95 79 7f ff 49 03 5c 24 08 f0 ff 0b 75 13 4c 89 e6 4c 89
[ 2744.247636] RSP: 0018:ffa0000019e27cb0 EFLAGS: 00010246
[ 2744.247639] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ff11000150fa87c0
[ 2744.247641] RDX: 0000000000000000 RSI: ffffffffc2222430 RDI: ff1100019f200000
[ 2744.247642] RBP: ffa0000019e27ce0 R08: 0000000000000003 R09: ffffffffffe41a08
[ 2744.247643] R10: 0000000000ffff0a R11: 0000000000000001 R12: ff1100019f22ce60
[ 2744.247644] R13: 0000000000000000 R14: 00000000ffffffea R15: ff1100019f200000
[ 2744.247645] FS:  0000000000000000(0000) GS:ff11007e7e400000(0000) knlGS:0000000000000000
[ 2744.247647] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2744.247649] CR2: 00007f3d2002819c CR3: 0000000006810003 CR4: 0000000000771ef0
[ 2744.247650] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2744.247651] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 2744.247652] PKRU: 55555554
[ 2744.247653] Call Trace:
[ 2744.247654]  <TASK>
[ 2744.247656]  sdma_v4_4_2_hw_fini+0x7a/0xc0 [amdgpu]
[ 2744.247997]  ? vcn_v4_0_3_hw_fini+0x5f/0xa0 [amdgpu]
[ 2744.248336]  amdgpu_ip_block_hw_fini+0x31/0x61 [amdgpu]
[ 2744.248776]  amdgpu_device_fini_hw+0x3bb/0x47b [amdgpu]
[ 2744.249197]  ? blocking_notifier_chain_unregister+0x56/0xb0
[ 2744.249202]  amdgpu_driver_unload_kms+0x51/0x60 [amdgpu]
[ 2744.249482]  amdgpu_driver_load_kms.cold+0x18/0x2e [amdgpu]
[ 2744.249913]  amdgpu_pci_probe+0x23e/0x590 [amdgpu]
[ 2744.250187]  local_pci_probe+0x48/0x90
[ 2744.250191]  work_for_cpu_fn+0x17/0x30
[ 2744.250196]  process_one_work+0x228/0x3d0
[ 2744.250198]  worker_thread+0x223/0x420
[ 2744.250200]  ? process_one_work+0x3d0/0x3d0
[ 2744.250201]  kthread+0x127/0x150
[ 2744.250204]  ? set_kthread_struct+0x50/0x50
[ 2744.250207]  ret_from_fork+0x1f/0x30
[ 2744.250212]  </TASK>
[ 2744.250213] ---[ end trace 488c997a88508bc3 ]---

The second reason:
[ 5139.303446] Memory manager not clean during takedown.
[ 5139.303509] WARNING: CPU: 145 PID: 117699 at drivers/gpu/drm/drm_mm.c:998 drm_mm_takedown+0x27/0x30 [drm]
[ 5139.303542] Modules linked in: amdgpu(OE+) amddrm_ttm_helper(OE) amdttm(OE) amdxcp(OE) amddrm_buddy(OE) amddrm_exec(OE) amd_sched(OE) amdkcl(OE) xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo nft_counter xt_addrtype nft_compat nf_tables nfnetlink br_netfilter bridge stp llc overlay intel_rapl_msr intel_rapl_common i10nm_edac nfit x86_pkg_temp_thermal intel_powerclamp coretemp ipmi_ssif kvm_intel binfmt_misc kvm nls_iso8859_1 rapl isst_if_mbox_pci pmt_telemetry pmt_crashlog isst_if_mmio pmt_class isst_if_common mei_me mei acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_pad acpi_power_meter mac_hid sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua msr ramoops reed_solomon pstore_blk pstore_zone efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath
[ 5139.303572]  linear mlx5_ib ib_uverbs ib_core crct10dif_pclmul ast crc32_pclmul i2c_algo_bit ghash_clmulni_intel aesni_intel crypto_simd drm_vram_helper cryptd drm_ttm_helper mlx5_core ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec rc_core mlxfw psample intel_pmt nvme xhci_pci drm tls i2c_i801 pci_hyperv_intf nvme_core i2c_smbus i2c_ismt xhci_pci_renesas wmi pinctrl_emmitsburg [last unloaded: amdkcl]
[ 5139.303588] CPU: 145 PID: 117699 Comm: modprobe Tainted: G     U     OE     5.15.0-70-generic #77-Ubuntu
[ 5139.303590] Hardware name: Microsoft C278A/C278A, BIOS C2789.5.BS.1C23.AG.2 11/21/2024
[ 5139.303591] RIP: 0010:drm_mm_takedown+0x27/0x30 [drm]
[ 5139.303605] Code: cc 66 90 0f 1f 44 00 00 48 8b 47 38 48 83 c7 38 48 39 f8 75 05 c3 cc cc cc cc 55 48 c7 c7 18 d0 10 c0 48 89 e5 e8 5a bc c3 c1 <0f> 0b 5d c3 cc cc cc cc 90 0f 1f 44 00 00 55 b9 15 00 00 00 48 89
[ 5139.303607] RSP: 0018:ffa00000325c3940 EFLAGS: 00010286
[ 5139.303608] RAX: 0000000000000000 RBX: ff1100012f5cecb0 RCX: 0000000000000027
[ 5139.303609] RDX: ff11007e7fa60588 RSI: 0000000000000001 RDI: ff11007e7fa60580
[ 5139.303610] RBP: ffa00000325c3940 R08: 0000000000000003 R09: fffffffff00c2b78
[ 5139.303610] R10: 000000000000002b R11: 0000000000000001 R12: ff1100012f5cec00
[ 5139.303611] R13: ff1100012138f068 R14: 0000000000000000 R15: ff1100012f5cec90
[ 5139.303611] FS:  00007f42ffca0000(0000) GS:ff11007e7fa40000(0000) knlGS:0000000000000000
[ 5139.303612] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5139.303613] CR2: 00007f23d945ab68 CR3: 00000001212ce005 CR4: 0000000000771ee0
[ 5139.303614] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 5139.303615] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 5139.303615] PKRU: 55555554
[ 5139.303616] Call Trace:
[ 5139.303617]  <TASK>
[ 5139.303619]  amdttm_range_man_fini_nocheck+0xfe/0x1c0 [amdttm]
[ 5139.303625]  amdgpu_ttm_fini+0x2ed/0x390 [amdgpu]
[ 5139.303800]  amdgpu_bo_fini+0x27/0xc0 [amdgpu]
[ 5139.303959]  gmc_v9_0_sw_fini+0x63/0x90 [amdgpu]
[ 5139.304144]  amdgpu_device_fini_sw+0x125/0x6a0 [amdgpu]
[ 5139.304302]  amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
[ 5139.304455]  devm_drm_dev_init_release+0x4a/0x80 [drm]
[ 5139.304472]  devm_action_release+0x12/0x20
[ 5139.304476]  release_nodes+0x3d/0xb0
[ 5139.304478]  devres_release_all+0x9b/0xd0
[ 5139.304480]  really_probe+0x11d/0x420
[ 5139.304483]  __driver_probe_device+0x119/0x190
[ 5139.304485]  driver_probe_device+0x23/0xc0
[ 5139.304487]  __driver_attach+0xf7/0x1f0
[ 5139.304489]  ? __device_attach_driver+0x140/0x140
[ 5139.304491]  bus_for_each_dev+0x7c/0xd0
[ 5139.304493]  driver_attach+0x1e/0x30
[ 5139.304494]  bus_add_driver+0x148/0x220
[ 5139.304496]  driver_register+0x95/0x100
[ 5139.304498]  __pci_register_driver+0x68/0x70
[ 5139.304500]  amdgpu_init+0xbc/0x1000 [amdgpu]
[ 5139.304655]  ? 0xffffffffc0b8f000
[ 5139.304657]  do_one_initcall+0x46/0x1e0
[ 5139.304659]  ? kmem_cache_alloc_trace+0x19e/0x2e0
[ 5139.304663]  do_init_module+0x52/0x260
[ 5139.304665]  load_module+0xb2b/0xbc0
[ 5139.304667]  __do_sys_finit_module+0xbf/0x120
[ 5139.304669]  __x64_sys_finit_module+0x18/0x20
[ 5139.304670]  do_syscall_64+0x59/0xc0
[ 5139.304673]  ? exit_to_user_mode_prepare+0x37/0xb0
[ 5139.304676]  ? syscall_exit_to_user_mode+0x27/0x50
[ 5139.304678]  ? __x64_sys_mmap+0x33/0x50
[ 5139.304680]  ? do_syscall_64+0x69/0xc0
[ 5139.304681]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
[ 5139.304684] RIP: 0033:0x7f42ffdbf88d
[ 5139.304686] Code: 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 73 b5 0f 00 f7 d8 64 89 01 48
[ 5139.304687] RSP: 002b:00007ffcb7427158 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[ 5139.304688] RAX: ffffffffffffffda RBX: 000055ce8b8f3150 RCX: 00007f42ffdbf88d
[ 5139.304689] RDX: 0000000000000000 RSI: 000055ce8b8f9a70 RDI: 000000000000000a
[ 5139.304690] RBP: 0000000000040000 R08: 0000000000000000 R09: 0000000000000011
[ 5139.304690] R10: 000000000000000a R11: 0000000000000246 R12: 000055ce8b8f9a70
[ 5139.304691] R13: 000055ce8b8f2ec0 R14: 000055ce8b8f2ab0 R15: 000055ce8b8f9aa0
[ 5139.304692]  </TASK>
[ 5139.304693] ---[ end trace 8536b052f7883003 ]---

Signed-off-by: Ce Sun <cesun102@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-13 09:32:11 -04:00
Tao Zhou b7674ae75b drm/amdgu: get RAS retire flip bits for new type of HBM
Get RAS retire flip bits for HBM with different types in various NPS modes.
Also set flip row bit and MCA R13 bit in PA in different NPS modes.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-13 09:32:08 -04:00
Tao Zhou 9b5b71895b drm/amdgpu: implement get_retire_flip_bits for UMC v12
The RAS bad page retire flip bits can be set per vram type,
vram vendor and NPS mode.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-13 09:32:05 -04:00
Tao Zhou 699bff37a5 drm/amdgpu: add get_retire_flip_bits for UMC
Add the general interface to get flip bits for RAS bad page retirement.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-13 09:32:01 -04:00
Tao Zhou 4ce5b99128 drm/amdgpu: adjust high bits for RAS retired page
Per UMC address conversion algorithm, the high row bits of UMC MCA
address are changed when they're converted into normalized address
on specific ASICs.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-13 09:31:46 -04:00
Tao Zhou 1df57411a6 drm/amd: add definition for new memory type
Support new version of HBM.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-13 09:31:40 -04:00
ganglxie f5db59067c Refine RAS bad page records counting and parsing in eeprom V3
there is only MCA records in V3, no need to care about PA records.
recalculate the value of ras_num_bad_pages when parsing failed and
go on with the left records instead of quit.

Signed-off-by: ganglxie <ganglxie@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-13 09:31:33 -04:00
Tim Huang 0a5c060b59 drm/amdgpu: fix incorrect MALL size for GFX1151
On GFX1151, the reported MALL cache size reflects only
half of its actual size; this adjustment corrects the discrepancy.

Signed-off-by: Tim Huang <tim.huang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-13 09:23:23 -04:00
Arvind Yadav 010503a3cb drm/amdgpu: Fix amdgpu_userq_wait_ioctl() warn missing error code 'r'
To resolve the warning regarding the missing error code 'r' in
amdgpu_userq_wait_ioctl(), assign the value 'r = -EINVAL'.

Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/202505080458.rnV8YfiY-lkp@intel.com/
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Arvind Yadav <Arvind.Yadav@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-13 09:21:56 -04:00
Arvind Yadav f10eb185ad drm/amdgpu: Fix NULL dereference in amdgpu_userq_restore_worker
Switch cancel_delayed_work() to cancel_delayed_work_sync() to ensure
the delayed work has finished executing before proceeding with
resource cleanup. This prevents a potential use-after-free or
NULL dereference if the resume_work is still running during finalization.

BUG: kernel NULL pointer dereference, address: 0000000000000140
[  +0.000050] #PF: supervisor read access in kernel mode
[  +0.000019] #PF: error_code(0x0000) - not-present page
[  +0.000021] PGD 0 P4D 0
[  +0.000015] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
[  +0.000021] CPU: 17 UID: 0 PID: 196299 Comm: kworker/17:0 Tainted: G     U             6.14.0-org-staging #1
[  +0.000032] Tainted: [U]=USER
[  +0.000015] Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS ELITE/X570 AORUS ELITE, BIOS F39 03/22/2024
[  +0.000029] Workqueue: events amdgpu_userq_restore_worker [amdgpu]
[  +0.000426] RIP: 0010:drm_exec_lock_obj+0x32/0x210 [drm_exec]
[  +0.000025] Code: e5 41 57 41 56 41 55 49 89 f5 41 54 49 89 fc 48 83 ec 08 4c 8b 77 30 4d 85 f6 0f 85 c0 00 00 00 4c 8d 7f 08 48 39 77 38 74 54 <49> 8b bd f8 00 00 00 4c 89 fe 41 f6 04 24 01 75 3c e8 08 50 bc e0
[  +0.000046] RSP: 0018:ffffab1b04da3ce8 EFLAGS: 00010297
[  +0.000020] RAX: 0000000000000001 RBX: ffff930cc60e4bc0 RCX: 0000000000000000
[  +0.000025] RDX: 0000000000000004 RSI: 0000000000000048 RDI: ffffab1b04da3d88
[  +0.000028] RBP: ffffab1b04da3d10 R08: ffff930cc60e4000 R09: 0000000000000000
[  +0.000022] R10: ffffab1b04da3d18 R11: 0000000000000001 R12: ffffab1b04da3d88
[  +0.000023] R13: 0000000000000048 R14: 0000000000000000 R15: ffffab1b04da3d90
[  +0.000023] FS:  0000000000000000(0000) GS:ffff9313dea80000(0000) knlGS:0000000000000000
[  +0.000024] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000021] CR2: 0000000000000140 CR3: 000000018351a000 CR4: 0000000000350ef0
[  +0.000025] Call Trace:
[  +0.000018]  <TASK>
[  +0.000015]  ? show_regs+0x69/0x80
[  +0.000022]  ? __die+0x25/0x70
[  +0.000019]  ? page_fault_oops+0x15d/0x510
[  +0.000024]  ? do_user_addr_fault+0x312/0x690
[  +0.000024]  ? sched_clock_cpu+0x10/0x1a0
[  +0.000028]  ? exc_page_fault+0x78/0x1b0
[  +0.000025]  ? asm_exc_page_fault+0x27/0x30
[  +0.000024]  ? drm_exec_lock_obj+0x32/0x210 [drm_exec]
[  +0.000024]  drm_exec_prepare_obj+0x21/0x60 [drm_exec]
[  +0.000021]  amdgpu_vm_lock_pd+0x22/0x30 [amdgpu]
[  +0.000266]  amdgpu_userq_validate_bos+0x6c/0x320 [amdgpu]
[  +0.000333]  amdgpu_userq_restore_worker+0x4a/0x120 [amdgpu]
[  +0.000316]  process_one_work+0x189/0x3c0
[  +0.000021]  worker_thread+0x2a4/0x3b0
[  +0.000022]  kthread+0x109/0x220
[  +0.000018]  ? __pfx_worker_thread+0x10/0x10
[  +0.000779]  ? _raw_spin_unlock_irq+0x1f/0x40
[  +0.000560]  ? __pfx_kthread+0x10/0x10
[  +0.000543]  ret_from_fork+0x3c/0x60
[  +0.000507]  ? __pfx_kthread+0x10/0x10
[  +0.000515]  ret_from_fork_asm+0x1a/0x30
[  +0.000515]  </TASK>

v2: Replace cancel_delayed_work() to cancel_delayed_work_sync()
    in amdgpu_userq_destroy() and amdgpu_userq_evict().

Cc: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Arvind Yadav <arvind.yadav@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-13 09:21:39 -04:00
Philip Yang 7dbbfb3c17 drm/amdgpu: csa unmap use uninterruptible lock
After process exit to unmap csa and free GPU vm, if signal is accepted
and then waiting to take vm lock is interrupted and return, it causes
memory leaking and below warning backtrace.

Change to use uninterruptible wait lock fix the issue.

WARNING: CPU: 69 PID: 167800 at amd/amdgpu/amdgpu_kms.c:1525
 amdgpu_driver_postclose_kms+0x294/0x2a0 [amdgpu]
 Call Trace:
  <TASK>
  drm_file_free.part.0+0x1da/0x230 [drm]
  drm_close_helper.isra.0+0x65/0x70 [drm]
  drm_release+0x6a/0x120 [drm]
  amdgpu_drm_release+0x51/0x60 [amdgpu]
  __fput+0x9f/0x280
  ____fput+0xe/0x20
  task_work_run+0x67/0xa0
  do_exit+0x217/0x3c0
  do_group_exit+0x3b/0xb0
  get_signal+0x14a/0x8d0
  arch_do_signal_or_restart+0xde/0x100
  exit_to_user_mode_loop+0xc1/0x1a0
  exit_to_user_mode_prepare+0xf4/0x100
  syscall_exit_to_user_mode+0x17/0x40
  do_syscall_64+0x69/0xc0

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-13 09:21:31 -04:00
Dave Airlie 1faeeb315f amd-drm-next-6.16-2025-05-09:
amdgpu:
 - IPS fixes
 - DSC cleanup
 - DC Scaling updates
 - DC FP fixes
 - Fused I2C-over-AUX updates
 - SubVP fixes
 - Freesync fix
 - DMUB AUX fixes
 - VCN fix
 - Hibernation fixes
 - HDP fixes
 - DCN 2.1 fixes
 - DPIA fixes
 - DMUB updates
 - Use drm_file_err in amdgpu
 - Enforce isolation updates
 - Use new dma_fence helpers
 - USERQ fixes
 - Documentation updates
 - Misc code cleanups
 - SR-IOV updates
 - RAS updates
 - PSP 12 cleanups
 
 amdkfd:
 - Update error messages for SDMA
 - Userptr updates
 
 drm:
 - Add drm_file_err function
 
 dma-buf:
 - Add a helper to sort and deduplicate dma_fence arrays
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQgO5Idg2tXNTSZAr293/aFa7yZ2AUCaB6KhwAKCRC93/aFa7yZ
 2FYIAQD43sIYoZCo6l7lZt0m54d6Mt1jangVhMK/jH31eOgKYQEAoRJKSMjT6Ktl
 FLBzG8FXekwQELNu6EgyG8Ywwim9AQ0=
 =dfkK
 -----END PGP SIGNATURE-----

Merge tag 'amd-drm-next-6.16-2025-05-09' of https://gitlab.freedesktop.org/agd5f/linux into drm-next

amd-drm-next-6.16-2025-05-09:

amdgpu:
- IPS fixes
- DSC cleanup
- DC Scaling updates
- DC FP fixes
- Fused I2C-over-AUX updates
- SubVP fixes
- Freesync fix
- DMUB AUX fixes
- VCN fix
- Hibernation fixes
- HDP fixes
- DCN 2.1 fixes
- DPIA fixes
- DMUB updates
- Use drm_file_err in amdgpu
- Enforce isolation updates
- Use new dma_fence helpers
- USERQ fixes
- Documentation updates
- Misc code cleanups
- SR-IOV updates
- RAS updates
- PSP 12 cleanups

amdkfd:
- Update error messages for SDMA
- Userptr updates

drm:
- Add drm_file_err function

dma-buf:
- Add a helper to sort and deduplicate dma_fence arrays

From: Alex Deucher <alexander.deucher@amd.com>
Link: https://lore.kernel.org/r/20250509230951.3871914-1-alexander.deucher@amd.com
Signed-off-by: Dave Airlie <airlied@redhat.com>
2025-05-12 07:14:34 +10:00
Alex Deucher 5a11a27677 drm/amdgpu/hdp7: use memcfg register to post the write for HDP flush
Reading back the remapped HDP flush register seems to cause
problems on some platforms. All we need is a read, so read back
the memcfg register.

Fixes: 689275140c ("drm/amdgpu/hdp7.0: do a posting read when flushing HDP")
Reported-by: Alexey Klimov <alexey.klimov@linaro.org>
Link: https://lists.freedesktop.org/archives/amd-gfx/2025-April/123150.html
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4119
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3908
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit dbc064adfc)
Cc: stable@vger.kernel.org
2025-05-08 11:48:12 -04:00
Alex Deucher ca28e80abe drm/amdgpu/hdp6: use memcfg register to post the write for HDP flush
Reading back the remapped HDP flush register seems to cause
problems on some platforms. All we need is a read, so read back
the memcfg register.

Fixes: abe1cbaec6 ("drm/amdgpu/hdp6.0: do a posting read when flushing HDP")
Reported-by: Alexey Klimov <alexey.klimov@linaro.org>
Link: https://lists.freedesktop.org/archives/amd-gfx/2025-April/123150.html
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4119
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3908
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 84141ff615)
Cc: stable@vger.kernel.org
2025-05-08 11:47:54 -04:00
Alex Deucher dbc988c689 drm/amdgpu/hdp5.2: use memcfg register to post the write for HDP flush
Reading back the remapped HDP flush register seems to cause
problems on some platforms. All we need is a read, so read back
the memcfg register.

Fixes: f756dbac1c ("drm/amdgpu/hdp5.2: do a posting read when flushing HDP")
Reported-by: Alexey Klimov <alexey.klimov@linaro.org>
Link: https://lists.freedesktop.org/archives/amd-gfx/2025-April/123150.html
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4119
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3908
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 4a89b7698e)
Cc: stable@vger.kernel.org
2025-05-08 11:47:23 -04:00
Alex Deucher 0e33e0f339 drm/amdgpu/hdp5: use memcfg register to post the write for HDP flush
Reading back the remapped HDP flush register seems to cause
problems on some platforms. All we need is a read, so read back
the memcfg register.

Fixes: cf424020e0 ("drm/amdgpu/hdp5.0: do a posting read when flushing HDP")
Reported-by: Alexey Klimov <alexey.klimov@linaro.org>
Link: https://lists.freedesktop.org/archives/amd-gfx/2025-April/123150.html
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4119
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3908
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit a5cb344033)
Cc: stable@vger.kernel.org
2025-05-08 11:46:57 -04:00
Lijo Lazar afc6053d4c Reapply: drm/amdgpu: Use generic hdp flush function
Except HDP v5.2 all use a common logic for HDP flush. Use a generic
function. HDP v5.2 forces NO_KIQ logic, revisit it later.

Reapply after fixing up an HDP regression.

v2: merge the fix (Alex)

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> (v1)
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-05-08 11:21:37 -04:00