Commit Graph

16421 Commits (09cfd3c52ea76f43b3cb15e570aeddf633d65e80)

Author SHA1 Message Date
Alex Deucher c767d74a9c drm/amdgpu/userq: fix error handling of invalid doorbell
If the doorbell is invalid, be sure to set the r to an error
state so the function returns an error.

Reviewed-by: David (Ming Qiang) Wu <David.Wu3@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 7e2a5b0a9a)
Cc: stable@vger.kernel.org
2025-08-27 14:01:52 -04:00
Jesse.Zhang ee38ea0ae4 drm/amdgpu: update firmware version checks for user queue support
The minimum firmware versions required for user queue functionality
have been increased to address an issue where the queue privilege
state was lost during queue connect operations.

The problem occurred because the privilege state was being restored
to its initial value at the beginning of the function, overwriting
the state that was properly set during the queue connect case.

This commit updates the minimum version requirements:
- ME firmware from 2390 to 2420
- PFP firmware from 2530 to 2580
- MEC firmware from 2600 to 2650
- MES firmware remains at 120

These updated firmware versions contain the necessary fixes to
properly maintain queue privilege state throughout connect operations.

Fixes: 61ca97e959 ("drm/amdgpu: Add fw minimum version check for usermode queue")
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 5f976c9939)
Cc: stable@vger.kernel.org
2025-08-27 14:01:32 -04:00
Alex Deucher ac4ed2da4c Revert "drm/amdgpu: fix incorrect vm flags to map bo"
This reverts commit b08425fa77.

Revert this to align with 6.17 because the fixes tag
was wrong on this commit.

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit be33e8a239)
2025-08-27 14:00:38 -04:00
Alex Deucher 29f155c5e8 drm/amdgpu/gfx12: set MQD as appriopriate for queue types
Set the MQD as appropriate for the kernel vs user queues.

Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 7b9110f289)
Cc: stable@vger.kernel.org
2025-08-27 13:59:53 -04:00
Alex Deucher 27f5e0c132 drm/amdgpu/gfx11: set MQD as appriopriate for queue types
Set the MQD as appropriate for the kernel vs user queues.

Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 063d668320)
Cc: stable@vger.kernel.org
2025-08-27 13:59:25 -04:00
Alex Deucher 7e2a5b0a9a drm/amdgpu/userq: fix error handling of invalid doorbell
If the doorbell is invalid, be sure to set the r to an error
state so the function returns an error.

Reviewed-by: David (Ming Qiang) Wu <David.Wu3@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-27 13:57:51 -04:00
Jesse.Zhang 5f976c9939 drm/amdgpu: update firmware version checks for user queue support
The minimum firmware versions required for user queue functionality
have been increased to address an issue where the queue privilege
state was lost during queue connect operations.

The problem occurred because the privilege state was being restored
to its initial value at the beginning of the function, overwriting
the state that was properly set during the queue connect case.

This commit updates the minimum version requirements:
- ME firmware from 2390 to 2420
- PFP firmware from 2530 to 2580
- MEC firmware from 2600 to 2650
- MES firmware remains at 120

These updated firmware versions contain the necessary fixes to
properly maintain queue privilege state throughout connect operations.

Fixes: 61ca97e959 ("drm/amdgpu: Add fw minimum version check for usermode queue")
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-27 13:57:51 -04:00
Alex Deucher ec813f384b drm/amdgpu/vpe: cancel delayed work in hw_fini
We need to cancel any outstanding work at both suspend
and driver teardown. Move the cancel to hw_fini which
gets called in both cases.

Reviewed-by: David (Ming Qiang) Wu <David.Wu3@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-27 13:57:51 -04:00
David (Ming Qiang) Wu 061a09b4dc drm/amdgpu/vcn: remove unused code in vcn_v1_0.c
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: David (Ming Qiang) Wu <David.Wu3@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-27 13:57:51 -04:00
Shaoyun Liu 87e6505261 drm/amd/amdgpu : Use the MES INV_TLBS API for tlb invalidation on gfx12
From MES version 0x81, it provide the new API INV_TLBS that support
invalidate tlbs with PASID.

Signed-off-by: Shaoyun Liu <shaoyun.liu@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-27 13:57:51 -04:00
Jesse.Zhang a7a411e246 drm/amdgpu: fix shift-out-of-bounds in amdgpu_debugfs_jpeg_sched_mask_set
Fix a UBSAN shift-out-of-bounds warning in amdgpu_debugfs_jpeg_sched_mask_set
when the shift exponent reaches or exceeds 32 bits. The issue occurred because
a 32-bit integer '1' was being shifted by up to 32 bits, which is undefined
behavior.

Replace '1' with '1ULL' to ensure 64-bit arithmetic, matching the u64 type of
'val' and preventing the shift overflow. This is consistent with the existing
mask calculation that already uses 1ULL.

The error manifested as:
UBSAN: shift-out-of-bounds in drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c:373:17
shift exponent 32 is too large for 32-bit type 'int'
v2: remove debug log

Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Suggested-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-27 13:57:51 -04:00
Jack Xiao c350d9e268 Reapply "drm/amdgpu: fix incorrect vm flags to map bo"
It should use vm flags instead of pte flags
to specify bo vm attributes.

This reverts commit 1263ceea2a1327014d9de2858a122f3c27dfa4dd.

Reapply this patch with the proper fixes tag.

Fixes: 6716a823d1 ("drm/amdgpu: rework how PTE flags are generated v3")
Signed-off-by: Jack Xiao <Jack.Xiao@amd.com>
Reviewed-by: Likun Gao <Likun.Gao@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-27 13:57:50 -04:00
Alex Deucher be33e8a239 Revert "drm/amdgpu: fix incorrect vm flags to map bo"
This reverts commit b08425fa77.

Revert this to align with 6.17 because the fixes tag
was wrong on this commit.

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-27 13:57:50 -04:00
Alex Deucher 1c65502f81 drm/amdgpu/vpe: add ring reset support
Implement ring reset for VPE.  Similar to VCN and JPEG,
just powergate the the IP to reset it.

v2: Properly set per queue reset flag

Reviewed-by: Jesse Zhang <jesse.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-27 13:57:49 -04:00
Alex Deucher fa064d50b7 drm/amdgpu/vcn: drop extra cancel_delayed_work_sync()
We already call this in the hw_fini() methods for all
VCN instances, so no need to call it again in
amdgpu_vcn_suspend().

Tested-by: David (Ming Qiang) Wu <David.Wu3@amd.com>
Reviewed-by: Leo Liu <leo.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-27 13:57:49 -04:00
Yifan Zhang e68197aa2b drm/amdgpu: remove redundant AMDGPU_HAS_VRAM
AMDGPU_HAS_VRAM is redundant with is_app_apu, as both refer to
APUs with no carve-out. Since AMDGPU_HAS_VRAM only occurs once,
remove AMDGPU_HAS_VRAM definition. The tmr allocation can be covered
with AMDGPU_GEM_DOMAIN_GTT | AMDGPU_GEM_DOMAIN_VRAM in both vram and
non vram ASICs.

Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-27 13:57:48 -04:00
Ce Sun d8442bcad0 drm/amdgpu: Correct the loss of aca bank reg info
By polling, poll ACA bank count to ensure that valid
ACA bank reg info can be obtained

v2: add corresponding delay before send msg to SMU to query mca bank info
(Stanley)

v3: the loop cannot exit. (Thomas)

v4: remove amdgpu_aca_clear_bank_count. (Kevin)

v5: continuously inject ce. If a creation interruption
occurs at this time, bank reg info will be lost. (Thomas)
v5: each cycle is delayed by 100ms. (Tao)

Signed-off-by: Ce Sun <cesun102@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-27 13:57:47 -04:00
Ce Sun 0989b764f4 drm/amdgpu: Add a mutex lock to protect poison injection
When poison is triggered multiple times, competition will occur.
Add a mutex lock to protect poison injection

Signed-off-by: Ce Sun <cesun102@amd.com>
Reviewed-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-27 13:57:47 -04:00
Ce Sun 907813e5d7 drm/amdgpu: Correct the counts of nr_banks and nr_errors
Correct the counts of nr_banks and nr_errors

Signed-off-by: Ce Sun <cesun102@amd.com>
Reviewed-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-27 13:57:47 -04:00
Liao Yuanhong ee6ba1e69d drm/amdgpu/fence: Remove redundant 0 value initialization
The amdgpu_fence struct is already zeroed by kzalloc(). It's redundant to
initialize am_fence->context to 0.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Liao Yuanhong <liaoyuanhong@vivo.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-27 13:57:47 -04:00
Hawking Zhang 22dcb283d6 drm/amdgpu: Allocate psp fw private buffer in vram
It's not necessarily to allocate psp firmware private
buffer in different memory domain in sriov and bare
metal environment

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Le Ma <le.ma@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-27 13:57:47 -04:00
Alex Deucher 7b9110f289 drm/amdgpu/gfx12: set MQD as appriopriate for queue types
Set the MQD as appropriate for the kernel vs user queues.

Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-27 13:57:42 -04:00
Alex Deucher 063d668320 drm/amdgpu/gfx11: set MQD as appriopriate for queue types
Set the MQD as appropriate for the kernel vs user queues.

Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-27 13:57:37 -04:00
Thomas Zimmermann 08fb45446e drm/amdgpu: Pin buffers while vmap'ing exported dma-buf objects
Current dma-buf vmap semantics require that the mapped buffer remains
in place until the corresponding vunmap has completed.

For GEM-SHMEM, this used to be guaranteed by a pin operation while creating
an S/G table in import. GEM-SHMEN can now import dma-buf objects without
creating the S/G table, so the pin is missing. Leads to page-fault errors,
such as the one shown below.

[  102.101726] BUG: unable to handle page fault for address: ffffc90127000000
[...]
[  102.157102] RIP: 0010:udl_compress_hline16+0x219/0x940 [udl]
[...]
[  102.243250] Call Trace:
[  102.245695]  <TASK>
[  102.2477V95]  ? validate_chain+0x24e/0x5e0
[  102.251805]  ? __lock_acquire+0x568/0xae0
[  102.255807]  udl_render_hline+0x165/0x341 [udl]
[  102.260338]  ? __pfx_udl_render_hline+0x10/0x10 [udl]
[  102.265379]  ? local_clock_noinstr+0xb/0x100
[  102.269642]  ? __lock_release.isra.0+0x16c/0x2e0
[  102.274246]  ? mark_held_locks+0x40/0x70
[  102.278177]  udl_primary_plane_helper_atomic_update+0x43e/0x680 [udl]
[  102.284606]  ? __pfx_udl_primary_plane_helper_atomic_update+0x10/0x10 [udl]
[  102.291551]  ? lockdep_hardirqs_on_prepare.part.0+0x92/0x170
[  102.297208]  ? lockdep_hardirqs_on+0x88/0x130
[  102.301554]  ? _raw_spin_unlock_irq+0x24/0x50
[  102.305901]  ? wait_for_completion_timeout+0x2bb/0x3a0
[  102.311028]  ? drm_atomic_helper_calc_timestamping_constants+0x141/0x200
[  102.317714]  ? drm_atomic_helper_commit_planes+0x3b6/0x1030
[  102.323279]  drm_atomic_helper_commit_planes+0x3b6/0x1030
[  102.328664]  drm_atomic_helper_commit_tail+0x41/0xb0
[  102.333622]  commit_tail+0x204/0x330
[...]
[  102.529946] ---[ end trace 0000000000000000 ]---
[  102.651980] RIP: 0010:udl_compress_hline16+0x219/0x940 [udl]

In this stack strace, udl (based on GEM-SHMEM) imported and vmap'ed a
dma-buf from amdgpu. Amdgpu relocated the buffer, thereby invalidating the
mapping.

Provide a custom dma-buf vmap method in amdgpu that pins the object before
mapping it's buffer's pages into kernel address space. Do the opposite in
vunmap.

Note that dma-buf vmap differs from GEM vmap in how it handles relocation.
While dma-buf vmap keeps the buffer in place, GEM vmap requires the caller
to keep the buffer in place. Hence, this fix is in amdgpu's dma-buf code
instead of its GEM code.

A discussion of various approaches to solving the problem is available
at [1].

v3:
- try (GTT | VRAM); drop CPU domain (Christian)
v2:
- only use mapable domains (Christian)
- try pinning to domains in preferred order

Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
Fixes: 660cd44659 ("drm/shmem-helper: Import dmabuf without mapping its sg_table")
Reported-by: Thomas Zimmermann <tzimmermann@suse.de>
Closes: https://lore.kernel.org/dri-devel/ba1bdfb8-dbf7-4372-bdcb-df7e0511c702@suse.de/
Cc: Shixiong Ou <oushixiong@kylinos.cn>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: dri-devel@lists.freedesktop.org
Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Link: https://lore.kernel.org/dri-devel/9792c6c3-a2b8-4b2b-b5ba-fba19b153e21@suse.de/ # [1]
Reviewed-by: Christian König <christian.koenig@amd.com>
Link: https://lore.kernel.org/r/20250821064031.39090-1-tzimmermann@suse.de
2025-08-22 14:05:31 +02:00
Maxime Ripard 1a2cf179e2
Merge drm/drm-fixes into drm-misc-fixes
Update drm-misc-fixes to -rc2.

Signed-off-by: Maxime Ripard <mripard@kernel.org>
2025-08-20 16:08:49 +02:00
Rodrigo Siqueira 3856a53db6 drm/amdgpu/vcn: Remove unnecessary check
The function amdgpu_vcn_sysfs_reset_mask_init already returns 0, which
makes the check of the result unnecessary in the vcn_v4_0_3_sw_init().
Just return the amdgpu_vcn_sysfs_reset_mask_init directly.

Signed-off-by: Rodrigo Siqueira <siqueira@igalia.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-18 17:06:19 -04:00
Chenglei Xie d2fa0ec6e0 drm/amdgpu: refactor bad_page_work for corner case handling
When a poison is consumed on the guest before the guest receives the host's poison creation msg, a corner case may occur to have poison_handler complete processing earlier than it should to cause the guest to hang waiting for the req_bad_pages reply during a VF FLR, resulting in the VM becoming inaccessible in stress tests.

To fix this issue, this patch refactored the mailbox sequence by seperating the bad_page_work into two parts req_bad_pages_work and handle_bad_pages_work.
Old sequence:
  1.Stop data exchange work
  2.Guest sends MB_REQ_RAS_BAD_PAGES to host and keep polling for IDH_RAS_BAD_PAGES_READY
  3.If the IDH_RAS_BAD_PAGES_READY arrives within timeout limit, re-init the data exchange region for updated bad page info
    else timeout with error message
New sequence:
req_bad_pages_work:
  1.Stop data exhange work
  2.Guest sends MB_REQ_RAS_BAD_PAGES to host
Once Guest receives IDH_RAS_BAD_PAGES_READY event
handle_bad_pages_work:
  3.re-init the data exchange region for updated bad page info

Signed-off-by: Chenglei Xie <Chenglei.Xie@amd.com>
Reviewed-by: Shravan Kumar Gande <Shravankumar.Gande@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-15 13:07:30 -04:00
Qiang Liu fc4e990a32 drm/amdgpu: remove duplicated argument wptr_va
The duplicate judgment of wptr_va could be removed to simplify the logic

Signed-off-by: Qiang Liu <liuqiang@kylinos.cn>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-15 13:07:25 -04:00
Jesse.Zhang 655d6403ad drm/amd/vcn: Add late_init callback for VCN v4.0.3 reset handling
This change reorganizes VCN reset capability detection by:

1. Moving reset mask configuration from sw_init to new late_init phase
2. Adding vcn_v4_0_3_late_init() to properly check for per-queue reset support
3. Only setting soft full reset mask as fallback when per-queue reset isn't supported
4. Removing TODO comment now that queue reset support is implemented

V2: Removed unrelated changes. Keep amdgpu_get_soft_full_reset_mask in place
    and remove TODO comment. (Alex)
v3: set the flags at one place (all in late_init) (Lijo)

Suggested-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Ruili Ji <ruiliji2@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-15 13:05:06 -04:00
Heng Zhou 859958a7fa drm/amdgpu: fix nullptr err of vm_handle_moved
If a amdgpu_bo_va is fpriv->prt_va, the bo of this one is always NULL.
So, such kind of amdgpu_bo_va should be updated separately before
amdgpu_vm_handle_moved.

Signed-off-by: Heng Zhou <Heng.Zhou@amd.com>
Reviewed-by: Kasiviswanathan, Harish <Harish.Kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-15 13:04:07 -04:00
Thomas Zimmermann 3eb61d7cb7 Revert "drm/amdgpu: Use dma_buf from GEM object instance"
This reverts commit 515986100d.

The dma_buf field in struct drm_gem_object is not stable over the
object instance's lifetime. The field becomes NULL when user space
releases the final GEM handle on the buffer object. This resulted
in a NULL-pointer deref.

Workarounds in commit 5307dce878 ("drm/gem: Acquire references on
GEM handles for framebuffers") and commit f6bfc9afc7 ("drm/framebuffer:
Acquire internal references on GEM handles") only solved the problem
partially. They especially don't work for buffer objects without a DRM
framebuffer associated.

Hence, this revert to going back to using .import_attach->dmabuf.

Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
Reviewed-by: Simona Vetter <simona.vetter@ffwll.ch>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Link: https://lore.kernel.org/r/20250715082635.34974-1-tzimmermann@suse.de
2025-08-13 11:14:17 +02:00
Liu01 Tong aa5fc4362f drm/amdgpu: fix task hang from failed job submission during process kill
During process kill, drm_sched_entity_flush() will kill the vm
entities. The following job submissions of this process will fail, and
the resources of these jobs have not been released, nor have the fences
been signalled, causing tasks to hang and timeout.

Fix by check entity status in amdgpu_vm_ready() and avoid submit jobs to
stopped entity.

v2: add amdgpu_vm_ready() check before amdgpu_vm_clear_freed() in
function amdgpu_cs_vm_handling().

Fixes: 1f02f2044b ("drm/amdgpu: Avoid extra evict-restore process.")
Signed-off-by: Liu01 Tong <Tong.Liu01@amd.com>
Signed-off-by: Lin.Cao <lincao12@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit f101c13a87)
2025-08-12 16:16:36 -04:00
Jack Xiao 040bc6d0e0 drm/amdgpu: fix incorrect vm flags to map bo
It should use vm flags instead of pte flags
to specify bo vm attributes.

Fixes: 7946340fa3 ("drm/amdgpu: Move csa related code to separate file")
Signed-off-by: Jack Xiao <Jack.Xiao@amd.com>
Reviewed-by: Likun Gao <Likun.Gao@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit b08425fa77)
2025-08-12 16:08:04 -04:00
YiPeng Chai 10ef476aad drm/amdgpu: fix vram reservation issue
The vram block allocation flag must be cleared
before making vram reservation, otherwise reserving
addresses within the currently freed memory range
will always fail.

Fixes: c9cad937c0 ("drm/amdgpu: add drm buddy support to amdgpu")
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit d38eaf27de)
2025-08-12 16:07:45 -04:00
Frank Min e67b8afcb6 drm/amdgpu: Add PSP fw version check for fw reserve GFX command
The fw reserved GFX command is only supported starting from PSP fw
version 0x3a0e14 and 0x3b0e0d. Older versions do not support this command.

Add a version guard to ensure the command is only used when the running
PSP fw meets the minimum version requirement.

This ensures backward compatibility and safe operation across fw
revisions.

Fixes: a3b7f9c306 ("drm/amdgpu: reclaim psp fw reservation memory region")
Signed-off-by: Frank Min <Frank.Min@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 065e23170a)
2025-08-12 16:07:31 -04:00
Liu01 Tong f101c13a87 drm/amdgpu: fix task hang from failed job submission during process kill
During process kill, drm_sched_entity_flush() will kill the vm
entities. The following job submissions of this process will fail, and
the resources of these jobs have not been released, nor have the fences
been signalled, causing tasks to hang and timeout.

Fix by check entity status in amdgpu_vm_ready() and avoid submit jobs to
stopped entity.

v2: add amdgpu_vm_ready() check before amdgpu_vm_clear_freed() in
function amdgpu_cs_vm_handling().

Signed-off-by: Liu01 Tong <Tong.Liu01@amd.com>
Signed-off-by: Lin.Cao <lincao12@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-12 14:22:54 -04:00
Jack Xiao b08425fa77 drm/amdgpu: fix incorrect vm flags to map bo
It should use vm flags instead of pte flags
to specify bo vm attributes.

Fixes: 7946340fa3 ("drm/amdgpu: Move csa related code to separate file")
Signed-off-by: Jack Xiao <Jack.Xiao@amd.com>
Reviewed-by: Likun Gao <Likun.Gao@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-12 14:19:57 -04:00
YiPeng Chai d38eaf27de drm/amdgpu: fix vram reservation issue
The vram block allocation flag must be cleared
before making vram reservation, otherwise reserving
addresses within the currently freed memory range
will always fail.

Fixes: c9cad937c0 ("drm/amdgpu: add drm buddy support to amdgpu")
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-12 14:15:03 -04:00
Frank Min 065e23170a drm/amdgpu: Add PSP fw version check for fw reserve GFX command
The fw reserved GFX command is only supported starting from PSP fw
version 0x3a0e14 and 0x3b0e0d. Older versions do not support this command.

Add a version guard to ensure the command is only used when the running
PSP fw meets the minimum version requirement.

This ensures backward compatibility and safe operation across fw
revisions.

Fixes: a3b7f9c306 ("drm/amdgpu: reclaim psp fw reservation memory region")
Signed-off-by: Frank Min <Frank.Min@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-12 14:13:03 -04:00
Lijo Lazar d543489aa1 drm/amdgpu: Add description for partition commands
Add string description for partition commands.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-12 14:12:57 -04:00
Cryolitia PukNgae 388b68aef7 drm/amdgpu: fix incorrect comment format
Comments should not have a leading plus sign.

Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Cryolitia PukNgae <cryolitia@uniontech.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-11 11:13:41 -04:00
Vitaly Prosyak c31f486bc8 drm/amdgpu: add to custom amdgpu_drm_release drm_dev_enter/exit
User queues are disabled before GEM objects are released
(protecting against user app crashes).
No races with PCI hot-unplug (because drm_dev_enter prevents cleanup
if iewdevice is being removed).

Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-11 11:13:03 -04:00
Lijo Lazar 1dd2fa0e00 drm/amdgpu: Save and restore switch state
During a DPC error kernel waits for the link to be active before
notifying downstream devices. On certain platforms with Broadcom switch
in synthetiic mode, switch responds with values even though the link is
not fully ready. The config space restoration done by pcie port driver
for SWUS/DS of dGPU is thus not effective as the switch is still doing
internal enumeration.

As a workaround, save state of SWUS/DS device in driver. Add additional
check to see if link is active and restore the values during DPC error
callbacks.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Yang Wang <kevinyang.wang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-11 11:12:53 -04:00
Linus Torvalds ffe8ac927d drm fixes for 6.17-rc1
i915:
 - DP LPFS fixes
 
 xe:
 - SRIOV: PF fixes and removal of need of module param
 - Fix driver unbind around Devcoredump
 - Mark xe driver as BROKEN if kernel page size is not 4kB
 
 amdgpu:
 - GC 9.5.0 fixes
 - SMU fix
 - DCE 6 DC fixes
 - mmhub client ID fixes
 - VRR fix
 - Backlight fix
 - UserQ fix
 - Legacy reset fix
 - Misc fixes
 
 amdkfd:
 - CRIU fix
 - Debugfs fix
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEEKbZHaGwW9KfbeusDHTzWXnEhr4FAmiVQYoACgkQDHTzWXnE
 hr4OVg/+OT59WwFrqodc79YRnGkbMllYfhhsC+hgczRWS1WV6GCp5yYSZVqWjs5I
 H44mM/rUEywXGf5m6lNx9GCCLF+ASgTAosVY/+cSDDpdB+Chl5CSYMbwmmlaS86a
 yzOxbNz7LC+gxrD9ZSmFxaWOop789ISjRhiafAWeKcJeVIo/tuW7rX2wJsN+LdNo
 ioUnNR4UHPKhdk7SQhm6ho3izL02e3H72fK/jcUj2FMehqJRLW6ilNTSntJ+bwp4
 WmtSHSHTfeFximemIiU1l11cilwpAkWqmDjR9ZBW/F6psBU3FX3eOP4MaKL1Io2X
 0gKXcnrz8XSLlZ46rKwHLj3IB3cNuGd2jO/rTcHDTQWGhrjPgn5fZnorOsfMRlTU
 ySaEeVsgCupQyiB1hOYFMECse+tqzV++05Bu/BL3NpYn6GMikghPkianAKlwpwxS
 lVDeOTmC9WX/jD9FeW8a69nsGZJoT3oSp9Ro2xjqlo1bwm1FVmUlxKgXSRIdHofg
 oZ8PGGDQGM5hf69Y/iLNFSPC+Z22n8cDrA0in92Gek965/h2HHkttQVWEMsaVc/x
 gkxb6rWv6ZMeLMrRY4ZQFd8JaOFfEzrc7aWemknEmF0HIdyJJUu2zWiUwm2MgpI/
 tshUMJdgoJaPoTinHFJeID8HrdzgFy9owHnKUne9nNNkPFjskV4=
 =rAxo
 -----END PGP SIGNATURE-----

Merge tag 'drm-next-2025-08-08' of https://gitlab.freedesktop.org/drm/kernel

Pull drm fixes from Dave Airlie:
 "This is the fixes that built up in the merge window, mostly amdgpu and
  xe with one i915 display fix, seems like things are pretty good for
  rc1.

  i915:
   - DP LPFS fixes

  xe:
   - SRIOV: PF fixes and removal of need of module param
   - Fix driver unbind around Devcoredump
   - Mark xe driver as BROKEN if kernel page size is not 4kB

  amdgpu:
   - GC 9.5.0 fixes
   - SMU fix
   - DCE 6 DC fixes
   - mmhub client ID fixes
   - VRR fix
   - Backlight fix
   - UserQ fix
   - Legacy reset fix
   - Misc fixes

  amdkfd:
   - CRIU fix
   - Debugfs fix"

* tag 'drm-next-2025-08-08' of https://gitlab.freedesktop.org/drm/kernel: (28 commits)
  drm/amdgpu: add missing vram lost check for LEGACY RESET
  drm/amdgpu/discovery: fix fw based ip discovery
  drm/amdkfd: Destroy KFD debugfs after destroy KFD wq
  amdgpu/amdgpu_discovery: increase timeout limit for IFWI init
  drm/amdgpu: Update SDMA firmware version check for user queue support
  drm/amdgpu: Add NULL check for asic_funcs
  drm/amd/display: Revert "drm/amd/display: Fix AMDGPU_MAX_BL_LEVEL value"
  drm/amd/display: fix a Null pointer dereference vulnerability
  drm/amd/display: Add primary plane to commits for correct VRR handling
  drm/amdgpu: update mmhub 3.3 client id mappings
  drm/amdgpu: update mmhub 3.0.1 client id mappings
  drm/amdgpu: Retain job->vm in amdgpu_job_prepare_job
  drm/amd/display: Fix DCE 6.0 and 6.4 PLL programming.
  drm/amd/display: Don't overwrite dce60_clk_mgr
  drm/amdkfd: Fix checkpoint-restore on multi-xcc
  drm/amd: Restore cached manual clock settings during resume
  drm/amd: Restore cached power limit during resume
  drm/amdgpu: Update external revid for GC v9.5.0
  drm/amdgpu: Update supported modes for GC v9.5.0
  Mark xe driver as BROKEN if kernel page size is not 4kB
  ...
2025-08-08 06:48:14 +03:00
Alex Deucher 81699fe81b drm/amdgpu: add missing vram lost check for LEGACY RESET
Legacy resets reset the memory controllers so VRAM contents
may be unreliable after reset.

Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit aae94897b6)
Cc: stable@vger.kernel.org
2025-08-06 16:54:25 -04:00
Alex Deucher 514678da56 drm/amdgpu/discovery: fix fw based ip discovery
We only need the fw based discovery table for sysfs.  No
need to parse it.  Additionally parsing some of the board
specific tables may result in incorrect data on some boards.
just load the binary and don't parse it on those boards.

Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4441
Fixes: 80a0e82829 ("drm/amdgpu/discovery: optionally use fw based ip discovery")
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 62eedd150f)
Cc: stable@vger.kernel.org
2025-08-06 16:54:04 -04:00
Xaver Hugl 928587381b amdgpu/amdgpu_discovery: increase timeout limit for IFWI init
With a timeout of only 1 second, my rx 5700XT fails to initialize,
so this increases the timeout to 2s.

Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3697
Signed-off-by: Xaver Hugl <xaver.hugl@kde.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 9ed3d7bdf2)
Cc: stable@vger.kernel.org
2025-08-06 16:51:26 -04:00
Sathishkumar S 111821e4b5 drm/amdgpu/vcn: Hold pg_lock before vcn power off
Acquire vcn_pg_lock before changes to vcn power state
and release it after power off in idle work handler.

Signed-off-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Reviewed-by: Leo Liu <leo.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-06 14:34:54 -04:00
Sathishkumar S 0e7581eda8 drm/amdgpu/jpeg: Hold pg_lock before jpeg poweroff
Acquire jpeg_pg_lock before changes to jpeg power state
and release it after power off from idle work handler.

Signed-off-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Reviewed-by: Leo Liu <leo.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-06 14:34:50 -04:00
Lijo Lazar 0b4d79dafa drm/amdgpu: Assign unique id to compute partition
Assign unique id to compute partition. This is the unique id of the
first XCD instance belonging to the partition.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-06 14:34:47 -04:00
Alex Deucher aae94897b6 drm/amdgpu: add missing vram lost check for LEGACY RESET
Legacy resets reset the memory controllers so VRAM contents
may be unreliable after reset.

Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-06 14:34:14 -04:00
Alex Deucher 62eedd150f drm/amdgpu/discovery: fix fw based ip discovery
We only need the fw based discovery table for sysfs.  No
need to parse it.  Additionally parsing some of the board
specific tables may result in incorrect data on some boards.
just load the binary and don't parse it on those boards.

Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4441
Fixes: 80a0e82829 ("drm/amdgpu/discovery: optionally use fw based ip discovery")
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-06 14:32:38 -04:00
Lijo Lazar 589ea8a1fd drm/amdgpu: Add helpers to set/get unique ids
Add a struct to store unique id information for each type. Add helper
to fetch the unique id.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-06 14:22:08 -04:00
Lijo Lazar 892bac995b drm/amdgpu: Prevent hardware access in dpc state
Don't allow hardware access while in dpc state.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Ce Sun <cesun102@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-06 14:22:03 -04:00
Lijo Lazar 1a0e57eb96 drm/amdgpu/vcn: Fix double-free of vcn dump buffer
The buffer is already freed as part of amdgpu_vcn_reg_dump_fini(). The
issue is introduced by below patch series.

Fixes: de55cbff5c ("drm/amdgpu/vcn: Add regdump helper functions")
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-06 14:21:16 -04:00
Lijo Lazar c5c62160a5 drm/amdgpu: Log reset source during recovery
To get more context, add reset source to identify the source of gpu
recovery - job timeout, RAS, HWS hang etc.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-06 14:21:05 -04:00
Xiang Liu b3505c2c48 drm/amdgpu: Generate BP threshold exceed CPER once threshold exceeded
The bad pages threshold exceed CPER should be generated once threshold
exceeded, no matter the bad_page_threshold setted or not.

Signed-off-by: Xiang Liu <xiang.liu@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-06 14:21:01 -04:00
Lijo Lazar 91c4fd4164 drm/amdgpu: Set dpc status appropriately
Set the dpc status based on hardware state. Also, clear the status before
reinitialization after a successful reset.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Ce Sun <cesun102@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-06 14:18:35 -04:00
Lijo Lazar 32f73741d6 drm/amdgpu: Wait for bootloader after PSPv11 reset
Some PSPv11 SOCs take a longer time for PSP based mode-1 reset. Instead
of checking for C2PMSG_33 status, add the callback wait_for_bootloader.
Wait for bootloader to be back to steady state is already part of the
generic mode-1 reset flow. Increase the retry count for bootloader wait
and also fix the mask to prevent fake pass.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-06 14:17:50 -04:00
Ethan Carter Edwards 66f92d1035 drm/amdgpu/gfx9.4.3: remove redundant repeated nested 0 check
The repeated checks on grbm_soft_reset are unnecessary. Remove them.

Signed-off-by: Ethan Carter Edwards <ethan@ethancedwards.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-06 14:17:47 -04:00
Ethan Carter Edwards 90e1d0324a drm/amdgpu/gfx9: remove redundant repeated nested 0 check
The repeated checks on grbm_soft_reset are unnecessary. Remove them.

Signed-off-by: Ethan Carter Edwards <ethan@ethancedwards.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-06 14:17:45 -04:00
Ethan Carter Edwards 8802ec0eba drm/amdgpu/gfx10: remove redundant repeated nested 0 check
The repeated checks on grbm_soft_reset are unnecessary. Remove them.

Signed-off-by: Ethan Carter Edwards <ethan@ethancedwards.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-06 14:17:40 -04:00
Xaver Hugl 9ed3d7bdf2 amdgpu/amdgpu_discovery: increase timeout limit for IFWI init
With a timeout of only 1 second, my rx 5700XT fails to initialize,
so this increases the timeout to 2s.

Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3697
Signed-off-by: Xaver Hugl <xaver.hugl@kde.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-06 14:17:03 -04:00
Jesse.Zhang 124ffa2970 drm/amdgpu: Update SDMA firmware version check for user queue support
This commit fixes a firmware version check for enabling user queue
support in SDMA v7.0. The previous version check (7836028) was
incorrect and could lead to issues with PROTECTED_FENCE_SIGNAL
commands causing register conflicts between MCU_DBG0 and MCU_DBG1.

Fixes: 8c011408ed ("drm/amdgpu/sdma7: add ucode version checks for userq support")
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 92e2449241)
Cc: stable@vger.kernel.org
2025-08-04 15:48:14 -04:00
Lijo Lazar c2fe914d50 drm/amdgpu: Add NULL check for asic_funcs
If driver load fails too early, asic_funcs pointer remains unassigned.
Add NULL check to sanitize unwind path.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 582bf7c515)
Cc: stable@vger.kernel.org
2025-08-04 15:47:50 -04:00
Alex Deucher 9f9bddfa31 drm/amdgpu: update mmhub 3.3 client id mappings
Update the client id mapping so the correct clients
get printed when there is a mmhub page fault.

v2: fix typos spotted by David Wu.
v3: fix additional typo spotted by David.

Reviewed-by: David (Ming Qiang) Wu <David.Wu3@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit e932f4779a)
Cc: stable@vger.kernel.org
2025-08-04 15:42:15 -04:00
Alex Deucher 0bae62cc98 drm/amdgpu: update mmhub 3.0.1 client id mappings
Update the client id mapping so the correct clients
get printed when there is a mmhub page fault.

Reviewed-by: David (Ming Qiang) Wu <David.Wu3@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 2a2681eda7)
Cc: stable@vger.kernel.org
2025-08-04 15:41:52 -04:00
YuanShang c00d8b79fd drm/amdgpu: Retain job->vm in amdgpu_job_prepare_job
The field job->vm is used in function amdgpu_job_run to get the page
table re-generation counter and decide whether the job should be skipped.

Specifically, function amdgpu_vm_generation checks if the VM is valid for this job to use.
For instance, if a gfx job depends on a cancelled sdma job from entity vm->delayed,
then the gfx job should be skipped.

Fixes: 26c95e838e ("drm/amdgpu: set the VM pointer to NULL in amdgpu_job_prepare")
Signed-off-by: YuanShang <YuanShang.Mao@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit ed76936c6b)
Cc: stable@vger.kernel.org
2025-08-04 15:41:09 -04:00
Lijo Lazar 05c8b69051 drm/amdgpu: Update external revid for GC v9.5.0
Use different external revid for GC v9.5.0 SOCs.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 21c6764ed4)
Cc: stable@vger.kernel.org
2025-08-04 15:32:37 -04:00
Lijo Lazar 389d79a195 drm/amdgpu: Update supported modes for GC v9.5.0
For GC v9.5.0 SOCs, both CPX and QPX compute modes are also supported in
NPS2 mode.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Acked-by: Mangesh Gadre <Mangesh.Gadre@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 9d1ac25c7f)
Cc: stable@vger.kernel.org
2025-08-04 15:32:07 -04:00
Xiang Liu 58364f01db drm/amdgpu: Fix vcn v4.0.3 poison irq call trace on sriov guest
Sriov guest side doesn't init ras feature hence the poison irq shouldn't
be put during hw fini.

[25209.468816] Call Trace:
[25209.468817]  <TASK>
[25209.468818]  ? srso_alias_return_thunk+0x5/0x7f
[25209.468820]  ? show_trace_log_lvl+0x28e/0x2ea
[25209.468822]  ? show_trace_log_lvl+0x28e/0x2ea
[25209.468825]  ? vcn_v4_0_3_hw_fini+0xaf/0xe0 [amdgpu]
[25209.468936]  ? show_regs.part.0+0x23/0x29
[25209.468939]  ? show_regs.cold+0x8/0xd
[25209.468940]  ? amdgpu_irq_put+0x9e/0xc0 [amdgpu]
[25209.469038]  ? __warn+0x8c/0x100
[25209.469040]  ? amdgpu_irq_put+0x9e/0xc0 [amdgpu]
[25209.469135]  ? report_bug+0xa4/0xd0
[25209.469138]  ? handle_bug+0x39/0x90
[25209.469140]  ? exc_invalid_op+0x19/0x70
[25209.469142]  ? asm_exc_invalid_op+0x1b/0x20
[25209.469146]  ? amdgpu_irq_put+0x9e/0xc0 [amdgpu]
[25209.469241]  vcn_v4_0_3_hw_fini+0xaf/0xe0 [amdgpu]
[25209.469343]  amdgpu_ip_block_hw_fini+0x34/0x61 [amdgpu]
[25209.469511]  amdgpu_device_fini_hw+0x3b3/0x467 [amdgpu]

Fixes: 4c4a891496 ("drm/amdgpu: Register aqua vanjaram vcn poison irq")
Signed-off-by: Xiang Liu <xiang.liu@amd.com>
Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-04 14:43:51 -04:00
Xiang Liu d3d73bdb02 drm/amdgpu: Fix jpeg v4.0.3 poison irq call trace on sriov guest
Sriov guest side doesn't init ras feature hence the poison irq shouldn't
be put during hw fini.

[25209.467154] Call Trace:
[25209.467156]  <TASK>
[25209.467158]  ? srso_alias_return_thunk+0x5/0x7f
[25209.467162]  ? show_trace_log_lvl+0x28e/0x2ea
[25209.467166]  ? show_trace_log_lvl+0x28e/0x2ea
[25209.467171]  ? jpeg_v4_0_3_hw_fini+0x6f/0x90 [amdgpu]
[25209.467300]  ? show_regs.part.0+0x23/0x29
[25209.467303]  ? show_regs.cold+0x8/0xd
[25209.467304]  ? amdgpu_irq_put+0x9e/0xc0 [amdgpu]
[25209.467403]  ? __warn+0x8c/0x100
[25209.467407]  ? amdgpu_irq_put+0x9e/0xc0 [amdgpu]
[25209.467503]  ? report_bug+0xa4/0xd0
[25209.467508]  ? handle_bug+0x39/0x90
[25209.467511]  ? exc_invalid_op+0x19/0x70
[25209.467513]  ? asm_exc_invalid_op+0x1b/0x20
[25209.467518]  ? amdgpu_irq_put+0x9e/0xc0 [amdgpu]
[25209.467613]  ? amdgpu_irq_put+0x5f/0xc0 [amdgpu]
[25209.467709]  jpeg_v4_0_3_hw_fini+0x6f/0x90 [amdgpu]
[25209.467805]  amdgpu_ip_block_hw_fini+0x34/0x61 [amdgpu]
[25209.467971]  amdgpu_device_fini_hw+0x3b3/0x467 [amdgpu]

Fixes: 1b2231de41 ("drm/amdgpu: Register aqua vanjaram jpeg poison irq")
Signed-off-by: Xiang Liu <xiang.liu@amd.com>
Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-04 14:42:25 -04:00
Lijo Lazar 5c2b3226d0 drm/amdgpu: Add wrapper function for dpc state
Use wrapper functions to set/indicate dpc status.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Ce Sun <cesun102@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-04 14:42:21 -04:00
Jesse.Zhang 92e2449241 drm/amdgpu: Update SDMA firmware version check for user queue support
This commit fixes a firmware version check for enabling user queue
support in SDMA v7.0. The previous version check (7836028) was
incorrect and could lead to issues with PROTECTED_FENCE_SIGNAL
commands causing register conflicts between MCU_DBG0 and MCU_DBG1.

Fixes: 8c011408ed ("drm/amdgpu/sdma7: add ucode version checks for userq support")
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-04 14:41:06 -04:00
Lijo Lazar 582bf7c515 drm/amdgpu: Add NULL check for asic_funcs
If driver load fails too early, asic_funcs pointer remains unassigned.
Add NULL check to sanitize unwind path.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-04 14:40:39 -04:00
Mangesh Gadre 82594ac858 drm/amdgpu: Initialize vcn v5_0_1 ras function
Initialize vcn v5_0_1 ras function

Signed-off-by: Mangesh Gadre <Mangesh.Gadre@amd.com>
Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-04 14:37:48 -04:00
Yunxiang Li ba5e322b26 drm/amdgpu: skip mgpu fan boost for multi-vf
On multi-vf setup if the VM have two vf assigned, perhaps from two
different gpus, mgpu fan boost will fail.

Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-04 14:37:27 -04:00
Mangesh Gadre 01fa9758c8 drm/amdgpu: Initialize jpeg v5_0_1 ras function
Initialize jpeg v5_0_1 ras function

Signed-off-by: Mangesh Gadre <Mangesh.Gadre@amd.com>
Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-04 14:37:24 -04:00
Xiang Liu 8e8e08c831 drm/amdgpu: Skip poison aca bank from UE channel
Avoid GFX poison consumption errors logged when fatal error occurs.

Signed-off-by: Xiang Liu <xiang.liu@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-04 14:37:20 -04:00
Arnd Bergmann 4d22db6d07 drm/amdgpu: fix link error for !PM_SLEEP
When power management is not enabled in the kernel build, the newly
added hibernation changes cause a link failure:

arm-linux-gnueabi-ld: drivers/gpu/drm/amd/amdgpu/amdgpu_drv.o: in function `amdgpu_pmops_thaw':
amdgpu_drv.c:(.text+0x1514): undefined reference to `pm_hibernate_is_recovering'

Make the power management code in this driver conditional on
CONFIG_PM and CONFIG_PM_SLEEP

Fixes: 530694f54d ("drm/amdgpu: do not resume device in thaw for normal hibernation")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Link: https://lore.kernel.org/r/20250714081635.4071570-1-arnd@kernel.org
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-04 14:36:54 -04:00
Alex Deucher e932f4779a drm/amdgpu: update mmhub 3.3 client id mappings
Update the client id mapping so the correct clients
get printed when there is a mmhub page fault.

v2: fix typos spotted by David Wu.
v3: fix additional typo spotted by David.

Reviewed-by: David (Ming Qiang) Wu <David.Wu3@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-04 14:36:47 -04:00
Alex Deucher 2a2681eda7 drm/amdgpu: update mmhub 3.0.1 client id mappings
Update the client id mapping so the correct clients
get printed when there is a mmhub page fault.

Reviewed-by: David (Ming Qiang) Wu <David.Wu3@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-04 14:36:40 -04:00
Sathishkumar S 26a63590fe drm/amdgpu/vcn: Register dump cleanup in VCN2_5
Use generic vcn devcoredump helper functions for VCN2_5 and VCN2_6

Signed-off-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Acked-by: Leo Liu <leo.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-04 14:36:36 -04:00
Sathishkumar S 53c4be7a59 drm/amdgpu/vcn: Register dump cleanup in VCN2_0_0
Use generic vcn devcoredump helper functions for VCN2_0_0

Signed-off-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Acked-by: Leo Liu <leo.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-04 14:36:31 -04:00
Sathishkumar S b2d532b588 drm/amdgpu/vcn: Register dump cleanup in VCN3_0
Use generic vcn devcoredump helper functions for VCN3_0

Signed-off-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Acked-by: Leo Liu <leo.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-04 14:36:28 -04:00
Sathishkumar S 69cc37647b drm/amdgpu/vcn: Register dump cleanup in VCN4_0_3
Use generic vcn devcoredump helper functions for VCN4_0_3

Signed-off-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Acked-by: Leo Liu <leo.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-04 14:36:23 -04:00
Sathishkumar S 793b97c4ad drm/amdgpu/vcn: Register dump cleanup in VCN4_0_5
Use generic vcn devcoredump helper functions for VCN4_0_5

Signed-off-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Acked-by: Leo Liu <leo.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-04 14:36:20 -04:00
Sathishkumar S 4e011af912 drm/amdgpu/vcn: Register dump cleanup in VCN4_0_0
Use generic vcn devcoredump helper functions for VCN4_0_0

Signed-off-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Acked-by: Leo Liu <leo.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-04 14:36:17 -04:00
Sathishkumar S f4c3be28d5 drm/amdgpu/vcn: Register dump cleanup in VCN5
Use generic vcn devcoredump helper functions for VCN5

Signed-off-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Acked-by: Leo Liu <leo.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-04 14:36:14 -04:00
Stanley.Yang 08e27c9d92 drm/amdgpu: Add new error code for VCN/JPEG new chain
Add VIDS and JPEG8/9 S|D chain error code for VCN/JPEG v5.0.1.

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-04 14:36:06 -04:00
Stanley.Yang b1b29aa88f drm/amdgpu: Fix vcn v5.0.1 poison irq call trace
Why:
    [13014.890792] Call Trace:
    [13014.890793]  <TASK>
    [13014.890795]  ? show_trace_log_lvl+0x1d6/0x2ea
    [13014.890799]  ? show_trace_log_lvl+0x1d6/0x2ea
    [13014.890800]  ? vcn_v5_0_1_hw_fini+0xe9/0x110 [amdgpu]
    [13014.890872]  ? show_regs.part.0+0x23/0x29
    [13014.890873]  ? show_regs.cold+0x8/0xd
    [13014.890874]  ? amdgpu_irq_put+0xc6/0xe0 [amdgpu]
    [13014.890934]  ? __warn+0x8c/0x100
    [13014.890936]  ? amdgpu_irq_put+0xc6/0xe0 [amdgpu]
    [13014.890995]  ? report_bug+0xa4/0xd0
    [13014.890999]  ? handle_bug+0x39/0x90
    [13014.891001]  ? exc_invalid_op+0x19/0x70
    [13014.891003]  ? asm_exc_invalid_op+0x1b/0x20
    [13014.891005]  ? amdgpu_irq_put+0xc6/0xe0 [amdgpu]
    [13014.891065]  ? amdgpu_irq_put+0x63/0xe0 [amdgpu]
    [13014.891124]  vcn_v5_0_1_hw_fini+0xe9/0x110 [amdgpu]
    [13014.891189]  amdgpu_ip_block_hw_fini+0x3b/0x78 [amdgpu]
    [13014.891309]  amdgpu_device_fini_hw+0x3c1/0x479 [amdgpu]
How:
    Add omitted vcn poison irq get call.

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-04 14:36:00 -04:00
Sathishkumar S de55cbff5c drm/amdgpu/vcn: Add regdump helper functions
Add generic helper functions for vcn devcoredump support
which can be re-used for all vcn versions.

Signed-off-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Acked-by: Leo Liu <leo.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-04 14:35:54 -04:00
Meng Li e6c2b0f232 drm/amd/amdgpu: Release xcp drm memory after unplug
Add a new API amdgpu_xcp_drm_dev_free().
After unplug xcp device, need to release xcp drm memory etc.

Co-developed-by: Jiang Liu <gerry@linux.alibaba.com>
Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
Signed-off-by: Meng Li <li.meng@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-04 14:35:48 -04:00
YuanShang ed76936c6b drm/amdgpu: Retain job->vm in amdgpu_job_prepare_job
The field job->vm is used in function amdgpu_job_run to get the page
table re-generation counter and decide whether the job should be skipped.

Specifically, function amdgpu_vm_generation checks if the VM is valid for this job to use.
For instance, if a gfx job depends on a cancelled sdma job from entity vm->delayed,
then the gfx job should be skipped.

Fixes: 26c95e838e ("drm/amdgpu: set the VM pointer to NULL in amdgpu_job_prepare")
Signed-off-by: YuanShang <YuanShang.Mao@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-04 14:31:34 -04:00
Mario Limonciello cc51bbc7d7 drm/amd: Use drm_*() macros instead of DRM_*() for amdgpu_cs
Some of the IOCTL messages can be called for different GPUs and it might
not be obvious which one called them from a problem.  Using the drm_*()
macros the correct device will be shown in the messages.

Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Link: https://lore.kernel.org/r/20250715212420.2254925-1-superm1@kernel.org
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-04 14:31:25 -04:00
Yunshui Jiang 130c7ed88f drm/amdgpu: use kmalloc_array() instead of kmalloc()
Use kmalloc_array() instead of kmalloc() with multiplication.
kmalloc_array() is a safer way because of its multiply overflow check.

Signed-off-by: Yunshui Jiang <jiangyunshui@kylinos.cn>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-04 14:30:31 -04:00
Sathishkumar S 46b0e6b9d7 drm/amdgpu: Fix unintended error log in VCN5_0_0
The error log is supposed to be gaurded under if failure condition.

Fixes: faab5ea083 ("drm/amdgpu: Check vcn sram load return value")
Signed-off-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Reviewed-by: Leo Liu <leo.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-04 14:29:14 -04:00
Ce Sun da46735229 drm/amdgpu: Effective health check before reset
Move amdgpu_device_health_check into amdgpu_device_gpu_recover to
ensure that if the device is present can be checked before reset

The reason is:
1.During the dpc event, the device where the dpc event occurs is not
present on the bus
2.When both dpc event and ATHUB event occur simultaneously,the dpc thread
holds the reset domain lock when detecting error,and the gpu recover thread
acquires the hive lock.The device is simultaneously in the states of
amdgpu_ras_in_recovery and occurs_dpc,so gpu recover thread will not go to
amdgpu_device_health_check.It waits for the reset domain lock held by the
dpc thread, but dpc thread has not released the reset domain lock.In the dpc
callback slot_reset,to obtain the hive lock, the hive lock is held by the
gpu recover thread at this time.So a deadlock occurred

Signed-off-by: Ce Sun <cesun102@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-04 14:27:49 -04:00
Ce Sun 21c0ffa612 drm/amdgpu: Avoid rma causes GPU duplicate reset
Try to ensure poison creation handle is completed in time
to set device rma value.

Signed-off-by: Ce Sun <cesun102@amd.com>
Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-04 14:27:47 -04:00
Xiang Liu 8f0245ee95 drm/amdgpu: Update IPID value for bad page threshold CPER
Update the IPID register value for bad page threshold CPER according to
the latest definition.

Signed-off-by: Xiang Liu <xiang.liu@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-04 14:27:41 -04:00
Srinivasan Shanmugam 70e33073d9 drm/amdgpu: Fix kdoc style in amdgpu_fence.c
The initial comment block before
amdgpu_fence_driver_guilty_force_completion() incorrectly used '/**' but
is not a kernel-doc comment, causing build warnings.

Fixes the below with gcc W=1:
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c:742: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
 * Kernel queue reset handling

Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-04 14:27:28 -04:00
Perry Yuan 8e3967a71e drm/amdgpu: Fix build error when CONFIG_SUSPEND is disabled
The variable `pm_suspend_target_state` is conditionally defined only when
`CONFIG_SUSPEND` is enabled (see `include/linux/suspend.h`). Directly
referencing it without guarding by `#ifdef CONFIG_SUSPEND` causes build
failures when suspend functionality is disabled (e.g., `CONFIG_SUSPEND=n`).

Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Perry Yuan <perry.yuan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-04 14:26:38 -04:00
Christian König 6716a823d1 drm/amdgpu: rework how PTE flags are generated v3
Previously we tried to keep the HW specific PTE flags in each mapping,
but for CRIU that isn't sufficient any more since the original value is
needed for the checkpoint procedure.

So rework the whole handling, nuke the early mapping function, keep the
UAPI flags in each mapping instead of the HW flags and translate them to
the HW flags while filling in the PTEs.

Only tested on Navi 23 for now, so probably needs quite a bit of more
work.

v2: fix KFD and SVN handling
v3: one more SVN fix pointed out by Felix
v4: squash in gfx12 fix from David

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-04 14:26:38 -04:00
Linus Torvalds 89748acdf2 drm fixes for 6.17-rc1
amdgpu:
 - DSC divide by 0 fix
 - clang fix
 - DC debugfs fix
 - Userq fixes
 - Avoid extra evict-restore with KFD
 - Backlight fix
 - Documentation fix
 - RAS fix
 - Add new kicker handling
 - DSC fix for DCN 3.1.4
 - PSR fix
 - Atomic fix
 - DC reset fixes
 - DCN 3.0.1 fix
 - MMHUB client mapping fix
 
 xe:
 - Fix BMG probe on unsupported mailbox command
 - Fix OA static checker warning about null gt
 - Fix a NULL vs IS_ERR() bug in xe_i2c_register_adapter
 - Fix missing unwind goto in GuC/HuC
 - Don't register I2C devices if VF
 - Clear whole GuC g2h_fence during initialization
 - Avoid call kfree for drmm_kzalloc
 - Fix pci_dev reference leak on configfs
 - SRIOV: Disable CSC support on VF
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEEKbZHaGwW9KfbeusDHTzWXnEhr4FAmiL9c8ACgkQDHTzWXnE
 hr5V3RAAji19x0wr2RsNTZRLP5m9+8vAkevIcUv5Y3ir8dUTXONN00ynE7/J/GHt
 X2MI2A+KrS0jvWlNw1vKCRNY/r6kiPd+HpPFwpN7y98Hn9o8H3byKHCvCcV6z469
 XJlXRO7yLe71Bn+iVkgQRSHnsAJVvvtoV0eeJR39wW0pvscWlV/wnTVx9XExNODJ
 lQFehlcbPQB10VRQrRBcWuPtqTX9+sIefEYERKDSXgOKcngPXTwx4QDnbBkuGy5e
 mJ0Xz9SjB554IL2pWo+gOT8FYZLw7lypVAhdCJjR9o4WzqjlxLWK2h3vcLRvo6Pq
 PyzvH0MXIEwfTT6HExXZXYtzt3ovUJ6ZaIJ6Y5Y6z8PcNzeO0+qWyQBwuKSXsN1S
 OOgHYhDnbjz8bS7J9/FPr6yE8RAl391ZWQldG3qXNp8OoXVoQv+TRcX80A9HVwai
 y4f5Ob0bRSKsnJIddsldDZz3usJ7K3xhRjcw2F03mkQKNLcmAQ1YwE0ak573N7gG
 wUj85wa9xREfWiA0ZuhXlV4BVv8a6U8JdolYsJ+HIl0l7xsFKkwLT6adzVSovE2b
 u52fV8MTjJMV17tEF7pzu0Hv3Uar0lecXMKx4G1gYOzBhZCuzeq5+SPepIAc05Ad
 QUR2LwvH0cnd5/GFPkXT8+GsDnjIJKUjEEsJUxoQS966IVTzI4A=
 =AgPB
 -----END PGP SIGNATURE-----

Merge tag 'drm-next-2025-08-01' of https://gitlab.freedesktop.org/drm/kernel

Pull drm fixes from Dave Airlie:
 "Just a bunch of amdgpu and xe fixes.

  amdgpu:
   - DSC divide by 0 fix
   - clang fix
   - DC debugfs fix
   - Userq fixes
   - Avoid extra evict-restore with KFD
   - Backlight fix
   - Documentation fix
   - RAS fix
   - Add new kicker handling
   - DSC fix for DCN 3.1.4
   - PSR fix
   - Atomic fix
   - DC reset fixes
   - DCN 3.0.1 fix
   - MMHUB client mapping fix

  xe:
   - Fix BMG probe on unsupported mailbox command
   - Fix OA static checker warning about null gt
   - Fix a NULL vs IS_ERR() bug in xe_i2c_register_adapter
   - Fix missing unwind goto in GuC/HuC
   - Don't register I2C devices if VF
   - Clear whole GuC g2h_fence during initialization
   - Avoid call kfree for drmm_kzalloc
   - Fix pci_dev reference leak on configfs
   - SRIOV: Disable CSC support on VF

* tag 'drm-next-2025-08-01' of https://gitlab.freedesktop.org/drm/kernel: (24 commits)
  drm/xe/vf: Disable CSC support on VF
  drm/amdgpu: update mmhub 4.1.0 client id mappings
  drm/amd/display: Allow DCN301 to clear update flags
  drm/amd/display: Pass up errors for reset GPU that fails to init HW
  drm/amd/display: Only finalize atomic_obj if it was initialized
  drm/amd/display: Avoid configuring PSR granularity if PSR-SU not supported
  drm/amd/display: Disable dsc_power_gate for dcn314 by default
  drm/amdgpu: add kicker fws loading for gfx12/smu14/psp14
  drm/amd/amdgpu: fix missing lock for cper.ring->rptr/wptr access
  drm/amd/display: Fix misuse of /** to /* in 'dce_i2c_hw.c'
  drm/amd/display: fix initial backlight brightness calculation
  drm/amdgpu: Avoid extra evict-restore process.
  drm/amdgpu: track whether a queue is a kernel queue in amdgpu_mqd_prop
  drm/amdgpu: check if hubbub is NULL in debugfs/amdgpu_dm_capabilities
  drm/amdgpu: Initialize data to NULL in imu_v12_0_program_rlc_ram()
  drm/amd/display: Fix divide by zero when calculating min ODM factor
  drm/xe/configfs: Fix pci_dev reference leak
  drm/xe/hw_engine_group: Avoid call kfree() for drmm_kzalloc()
  drm/xe/guc: Clear whole g2h_fence during initialization
  drm/xe/vf: Don't register I2C devices if VF
  ...
2025-07-31 21:47:36 -07:00
Linus Torvalds 260f6f4fda drm for 6.17-rc1
non-drm:
 rust:
 - make ETIMEDOUT available
 - add size constants up to SZ_2G
 - add DMA coherent allocation bindings
 mtd:
 - driver for Intel GPU non-volatile storage
 i2c
 - designware quirk for Intel xe
 
 core:
 - atomic helpers: tune enable/disable sequences
 - add task info to wedge API
 - refactor EDID quirks
 - connector: move HDR sink to drm_display_info
 - fourcc: half-float and 32-bit float formats
 - mode_config: pass format info to simplify
 
 dma-buf:
 - heaps: Give CMA heap a stable name
 
 ci:
 - add device tree validation and kunit
 
 displayport:
 - change AUX DPCD access probe address
 - add quirk for DPCD probe
 - add panel replay definitions
 - backlight control helpers
 
 fbdev:
 - make CONFIG_FIRMWARE_EDID available on all arches
 
 fence:
 - fix UAF issues
 
 format-helper:
 - improve tests
 
 gpusvm:
 - introduce devmem only flag for allocation
 - add timeslicing support to GPU SVM
 
 ttm:
 - improve eviction
 
 sched:
 - tracing improvements
 - kunit improvements
 - memory leak fixes
 - reset handling improvements
 
 color mgmt:
 - add hardware gamma LUT handling helpers
 
 bridge:
 - add destroy hook
 - switch to reference counted drm_bridge allocations
 - tc358767: convert to devm_drm_bridge_alloc
 - improve CEC handling
 
 panel:
 - switch to reference counter drm_panel allocations
 - fwnode panel lookup
 - Huiling hl055fhv028c support
 - Raspberry Pi 7" 720x1280 support
 - edp: KDC KD116N3730A05, N160JCE-ELL CMN, N116BCJ-EAK
 - simple: AUO P238HAN01
 - st7701: Winstar wf40eswaa6mnn0
 - visionox: rm69299-shift
 - Renesas R61307, Renesas R69328 support
 - DJN HX83112B
 
 hdmi:
 - add CEC handling
 - YUV420 output support
 
 xe:
 - WildCat Lake support
 - Enable PanthorLake by default
 - mark BMG as SRIOV capable
 - update firmware recommendations
 - Expose media OA units
 - aux-bux support for non-volatile memory
 - MTD intel-dg driver for non-volatile memory
 - Expose fan control and voltage regulator in sysfs
 - restructure migration for multi-device
 - Restore GuC submit UAF fix
 - make GEM shrinker drm managed
 - SRIOV VF Post-migration recovery of GGTT nodes
 - W/A additions/reworks
 - Prefetch support for svm ranges
 - Don't allocate managed BO for each policy change
 - HWMON fixes for BMG
 - Create LRC BO without VM
 - PCI ID updates
 - make SLPC debugfs files optional
 - rework eviction rejection of bound external BOs
 - consolidate PAT programming logic for pre/post Xe2
 - init changes for flicker-free boot
 - Enable GuC Dynamic Inhibit Context switch
 
 i915:
 - drm_panic support for i915/xe
 - initial flip queue off by default for LNL/PNL
 - Wildcat Lake Display support
 - Support for DSC fractional link bpp
 - Support for simultaneous Panel Replay and Adaptive sync
 - Support for PTL+ double buffer LUT
 - initial PIPEDMC event handling
 - drm_panel_follower support
 - DPLL interface renames
 - allocate struct intel_display dynamically
 - flip queue preperation
 - abstract DRAM detection better
 - avoid GuC scheduling stalls
 - remove DG1 force probe requirement
 - fix MEI interrupt handler on RT kernels
 - use backlight control helpers for eDP
 - more shared display code refactoring
 
 amdgpu:
 - add userq slot to INFO ioctl
 - SR-IOV hibernation support
 - Suspend improvements
 - Backlight improvements
 - Use scaling for non-native eDP modes
 - cleaner shader updates for GC 9.x
 - Remove fence slab
 - SDMA fw checks for userq support
 - RAS updates
 - DMCUB updates
 - DP tunneling fixes
 - Display idle D3 support
 - Per queue reset improvements
 - initial smartmux support
 
 amdkfd:
 - enable KFD on loongarch
 - mtype fix for ext coherent system memory
 
 radeon:
 - CS validation additional GL extensions
 - drop console lock during suspend/resume
 - bump driver version
 
 msm:
 - VM BIND support
 - CI: infrastructure updates
 - UBWC single source of truth
 - decouple GPU and KMS support
 - DP: rework I/O accessors
 - DPU: SM8750 support
 - DSI: SM8750 support
 - GPU: X1-45 support and speedbin support for X1-85
 - MDSS: SM8750 support
 
 nova:
 - register! macro improvements
 - DMA object abstraction
 - VBIOS parser + fwsec lookup
 - sysmem flush page support
 - falcon: generic falcon boot code and HAL
 - FWSEC-FRTS: fb setup and load/execute
 
 ivpu:
 - Add Wildcat Lake support
 - Add turbo flag
 
 ast:
 - improve hardware generations implementation
 
 imx:
 - IMX8qxq Display Controller support
 
 lima:
 - Rockchip RK3528 GPU support
 
 nouveau:
 - fence handling cleanup
 
 panfrost:
 - MT8370 support
 - bo labeling
 - 64-bit register access
 
 qaic:
 - add RAS support
 
 rockchip:
 - convert inno_hdmi to a bridge
 
 rz-du:
 - add RZ/V2H(P) support
 - MIPI-DSI DCS support
 
 sitronix:
 - ST7567 support
 
 sun4i:
 - add H616 support
 
 tidss:
 - add TI AM62L support
 - AM65x OLDI bridge support
 
 bochs:
 - drm panic support
 
 vkms:
 - YUV and R* format support
 - use faux device
 
 vmwgfx:
 - fence improvements
 
 hyperv:
 - move out of simple
 - add drm_panic support
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEEKbZHaGwW9KfbeusDHTzWXnEhr4FAmiJM/0ACgkQDHTzWXnE
 hr6MpA/+JJKGdSdrE95QkaMcOZh/3e3areGXZ0V/RrrJXdB4/DoAfQSHhF0H7m7y
 MhBGVLGNMXq7KHrz28p1MjLHrE1mwmvJ6hZ4J076ed4u9naoCD0m6k5w5wiue+KL
 HyPR54ADxN0BYmgV0l/B0wj42KsHyTO4x4hdqPJu02V9Dtmx6FCh2ujkOF3p9nbK
 GMwWDttl4KEKljD0IvQ9YIYJ66crYGx/XmZi7JoWRrS104K/h1u8qZuXBp5jVKTy
 OZRAVyLdmJqdTOLH7l599MBBcEd/bNV37/LVwF4T5iFunEKOAiyN0QY0OR+IeRVh
 ZfOv2/gp4UNyIfyahQ7LKLgEilNPGHoPitvDJPvBZxW2UjwXVNvA1QfdK5DAlVRS
 D5NoFRjlFFCz8/c2hQwlKJ9o7eVgH3/pK0mwR7SPGQTuqzLFCrAfCuzUvg/gV++6
 JFqmGKMHeCoxO2o4GMrwjFttStP41usxtV/D+grcbPteNO9UyKJS4C38n4eamJXM
 a9Sy9APuAb6F0w5+yMItEF7TQifgmhIbm5AZHlxE1KoDQV6TdiIf1Gou5LeDGoL6
 OACbXHJPL52tUnfCRpbfI4tE/IVyYsfL01JnvZ5cZZWItXfcIz76ykJri+E0G60g
 yRl/zkimHKO4B0l/HSzal5xROXr+3VzeWehEiz/ot1VriP5OesA=
 =n9MO
 -----END PGP SIGNATURE-----

Merge tag 'drm-next-2025-07-30' of https://gitlab.freedesktop.org/drm/kernel

Pull drm updates from Dave Airlie:
 "Highlights:

   - Intel xe enable Panthor Lake, started adding WildCat Lake

   - amdgpu has a bunch of reset improvments along with the usual IP
     updates

   - msm got VM_BIND support which is important for vulkan sparse memory

   - more drm_panic users

   - gpusvm common code to handle a bunch of core SVM work outside
     drivers.

  Detail summary:

  Changes outside drm subdirectory:
   - 'shrink_shmem_memory()' for better shmem/hibernate interaction
   - Rust support infrastructure:
      - make ETIMEDOUT available
      - add size constants up to SZ_2G
      - add DMA coherent allocation bindings
   - mtd driver for Intel GPU non-volatile storage
   - i2c designware quirk for Intel xe

  core:
   - atomic helpers: tune enable/disable sequences
   - add task info to wedge API
   - refactor EDID quirks
   - connector: move HDR sink to drm_display_info
   - fourcc: half-float and 32-bit float formats
   - mode_config: pass format info to simplify

  dma-buf:
   - heaps: Give CMA heap a stable name

  ci:
   - add device tree validation and kunit

  displayport:
   - change AUX DPCD access probe address
   - add quirk for DPCD probe
   - add panel replay definitions
   - backlight control helpers

  fbdev:
   - make CONFIG_FIRMWARE_EDID available on all arches

  fence:
   - fix UAF issues

  format-helper:
   - improve tests

  gpusvm:
   - introduce devmem only flag for allocation
   - add timeslicing support to GPU SVM

  ttm:
   - improve eviction

  sched:
   - tracing improvements
   - kunit improvements
   - memory leak fixes
   - reset handling improvements

  color mgmt:
   - add hardware gamma LUT handling helpers

  bridge:
   - add destroy hook
   - switch to reference counted drm_bridge allocations
   - tc358767: convert to devm_drm_bridge_alloc
   - improve CEC handling

  panel:
   - switch to reference counter drm_panel allocations
   - fwnode panel lookup
   - Huiling hl055fhv028c support
   - Raspberry Pi 7" 720x1280 support
   - edp: KDC KD116N3730A05, N160JCE-ELL CMN, N116BCJ-EAK
   - simple: AUO P238HAN01
   - st7701: Winstar wf40eswaa6mnn0
   - visionox: rm69299-shift
   - Renesas R61307, Renesas R69328 support
   - DJN HX83112B

  hdmi:
   - add CEC handling
   - YUV420 output support

  xe:
   - WildCat Lake support
   - Enable PanthorLake by default
   - mark BMG as SRIOV capable
   - update firmware recommendations
   - Expose media OA units
   - aux-bux support for non-volatile memory
   - MTD intel-dg driver for non-volatile memory
   - Expose fan control and voltage regulator in sysfs
   - restructure migration for multi-device
   - Restore GuC submit UAF fix
   - make GEM shrinker drm managed
   - SRIOV VF Post-migration recovery of GGTT nodes
   - W/A additions/reworks
   - Prefetch support for svm ranges
   - Don't allocate managed BO for each policy change
   - HWMON fixes for BMG
   - Create LRC BO without VM
   - PCI ID updates
   - make SLPC debugfs files optional
   - rework eviction rejection of bound external BOs
   - consolidate PAT programming logic for pre/post Xe2
   - init changes for flicker-free boot
   - Enable GuC Dynamic Inhibit Context switch

  i915:
   - drm_panic support for i915/xe
   - initial flip queue off by default for LNL/PNL
   - Wildcat Lake Display support
   - Support for DSC fractional link bpp
   - Support for simultaneous Panel Replay and Adaptive sync
   - Support for PTL+ double buffer LUT
   - initial PIPEDMC event handling
   - drm_panel_follower support
   - DPLL interface renames
   - allocate struct intel_display dynamically
   - flip queue preperation
   - abstract DRAM detection better
   - avoid GuC scheduling stalls
   - remove DG1 force probe requirement
   - fix MEI interrupt handler on RT kernels
   - use backlight control helpers for eDP
   - more shared display code refactoring

  amdgpu:
   - add userq slot to INFO ioctl
   - SR-IOV hibernation support
   - Suspend improvements
   - Backlight improvements
   - Use scaling for non-native eDP modes
   - cleaner shader updates for GC 9.x
   - Remove fence slab
   - SDMA fw checks for userq support
   - RAS updates
   - DMCUB updates
   - DP tunneling fixes
   - Display idle D3 support
   - Per queue reset improvements
   - initial smartmux support

  amdkfd:
   - enable KFD on loongarch
   - mtype fix for ext coherent system memory

  radeon:
   - CS validation additional GL extensions
   - drop console lock during suspend/resume
   - bump driver version

  msm:
   - VM BIND support
   - CI: infrastructure updates
   - UBWC single source of truth
   - decouple GPU and KMS support
   - DP: rework I/O accessors
   - DPU: SM8750 support
   - DSI: SM8750 support
   - GPU: X1-45 support and speedbin support for X1-85
   - MDSS: SM8750 support

  nova:
   - register! macro improvements
   - DMA object abstraction
   - VBIOS parser + fwsec lookup
   - sysmem flush page support
   - falcon: generic falcon boot code and HAL
   - FWSEC-FRTS: fb setup and load/execute

  ivpu:
   - Add Wildcat Lake support
   - Add turbo flag

  ast:
   - improve hardware generations implementation

  imx:
   - IMX8qxq Display Controller support

  lima:
   - Rockchip RK3528 GPU support

  nouveau:
   - fence handling cleanup

  panfrost:
   - MT8370 support
   - bo labeling
   - 64-bit register access

  qaic:
   - add RAS support

  rockchip:
   - convert inno_hdmi to a bridge

  rz-du:
   - add RZ/V2H(P) support
   - MIPI-DSI DCS support

  sitronix:
   - ST7567 support

  sun4i:
   - add H616 support

  tidss:
   - add TI AM62L support
   - AM65x OLDI bridge support

  bochs:
   - drm panic support

  vkms:
   - YUV and R* format support
   - use faux device

  vmwgfx:
   - fence improvements

  hyperv:
   - move out of simple
   - add drm_panic support"

* tag 'drm-next-2025-07-30' of https://gitlab.freedesktop.org/drm/kernel: (1479 commits)
  drm/tidss: oldi: convert to devm_drm_bridge_alloc() API
  drm/tidss: encoder: convert to devm_drm_bridge_alloc()
  drm/amdgpu: move reset support type checks into the caller
  drm/amdgpu/sdma7: re-emit unprocessed state on ring reset
  drm/amdgpu/sdma6: re-emit unprocessed state on ring reset
  drm/amdgpu/sdma5.2: re-emit unprocessed state on ring reset
  drm/amdgpu/sdma5: re-emit unprocessed state on ring reset
  drm/amdgpu/gfx12: re-emit unprocessed state on ring reset
  drm/amdgpu/gfx11: re-emit unprocessed state on ring reset
  drm/amdgpu/gfx10: re-emit unprocessed state on ring reset
  drm/amdgpu/gfx9.4.3: re-emit unprocessed state on kcq reset
  drm/amdgpu/gfx9: re-emit unprocessed state on kcq reset
  drm/amdgpu: Add WARN_ON to the resource clear function
  drm/amd/pm: Use cached metrics data on SMUv13.0.6
  drm/amd/pm: Use cached data for min/max clocks
  gpu: nova-core: fix bounds check in PmuLookupTableEntry::new
  drm/amdgpu: Replace HQD terminology with slots naming
  drm/amdgpu: Add user queue instance count in HW IP info
  drm/amd/amdgpu: Add helper functions for isp buffers
  drm/amd/amdgpu: Initialize swnode for ISP MFD device
  ...
2025-07-30 19:26:49 -07:00
Linus Torvalds 22c5696e3f Driver core changes for 6.17-rc1
- DEBUGFS
 
   - Remove unneeded debugfs_file_{get,put}() instances
 
   - Remove last remnants of debugfs_real_fops()
 
   - Allow storing non-const void * in struct debugfs_inode_info::aux
 
 - SYSFS
 
   - Switch back to attribute_group::bin_attrs (treewide)
 
   - Switch back to bin_attribute::read()/write() (treewide)
 
   - Constify internal references to 'struct bin_attribute'
 
 - Support cache-ids for device-tree systems
 
   - Add arch hook arch_compact_of_hwid()
 
   - Use arch_compact_of_hwid() to compact MPIDR values on arm64
 
 - Rust
 
   - Device
 
     - Introduce CoreInternal device context (for bus internal methods)
 
     - Provide generic drvdata accessors for bus devices
 
     - Provide Driver::unbind() callbacks
 
     - Use the infrastructure above for auxiliary, PCI and platform
 
     - Implement Device::as_bound()
 
     - Rename Device::as_ref() to Device::from_raw() (treewide)
 
     - Implement fwnode and device property abstractions
 
       - Implement example usage in the Rust platform sample driver
 
   - Devres
 
     - Remove the inner reference count (Arc) and use pin-init instead
 
     - Replace Devres::new_foreign_owned() with devres::register()
 
     - Require T to be Send in Devres<T>
 
     - Initialize the data kept inside a Devres last
 
     - Provide an accessor for the Devres associated Device
 
   - Device ID
 
     - Add support for ACPI device IDs and driver match tables
 
     - Split up generic device ID infrastructure
 
     - Use generic device ID infrastructure in net::phy
 
   - DMA
 
     - Implement the dma::Device trait
 
     - Add DMA mask accessors to dma::Device
 
     - Implement dma::Device for PCI and platform devices
 
     - Use DMA masks from the DMA sample module
 
   - I/O
 
     - Implement abstraction for resource regions (struct resource)
 
     - Implement resource-based ioremap() abstractions
 
     - Provide platform device accessors for I/O (remap) requests
 
   - Misc
 
     - Support fallible PinInit types in Revocable
 
     - Implement Wrapper<T> for Opaque<T>
 
     - Merge pin-init blanket dependencies (for Devres)
 
 - Misc
 
   - Fix OF node leak in auxiliary_device_create()
 
   - Use util macros in device property iterators
 
   - Improve kobject sample code
 
   - Add device_link_test() for testing device link flags
 
   - Fix typo in Documentation/ABI/testing/sysfs-kernel-address_bits
 
   - Hint to prefer container_of_const() over container_of()
 -----BEGIN PGP SIGNATURE-----
 
 iHQEABYKAB0WIQS2q/xV6QjXAdC7k+1FlHeO1qrKLgUCaIjkhwAKCRBFlHeO1qrK
 LpXuAP9RWwfD9ZGgQZ9OsMk/0pZ2mDclaK97jcmI9TAeSxeZMgD1FHnOMTY7oSIi
 iG7Muq0yLD+A5gk9HUnMUnFNrngWCg==
 =jgRj
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core

Pull driver core updates from Danilo Krummrich:
 "debugfs:
   - Remove unneeded debugfs_file_{get,put}() instances
   - Remove last remnants of debugfs_real_fops()
   - Allow storing non-const void * in struct debugfs_inode_info::aux

  sysfs:
   - Switch back to attribute_group::bin_attrs (treewide)
   - Switch back to bin_attribute::read()/write() (treewide)
   - Constify internal references to 'struct bin_attribute'

  Support cache-ids for device-tree systems:
   - Add arch hook arch_compact_of_hwid()
   - Use arch_compact_of_hwid() to compact MPIDR values on arm64

  Rust:
   - Device:
       - Introduce CoreInternal device context (for bus internal methods)
       - Provide generic drvdata accessors for bus devices
       - Provide Driver::unbind() callbacks
       - Use the infrastructure above for auxiliary, PCI and platform
       - Implement Device::as_bound()
       - Rename Device::as_ref() to Device::from_raw() (treewide)
       - Implement fwnode and device property abstractions
       - Implement example usage in the Rust platform sample driver
   - Devres:
       - Remove the inner reference count (Arc) and use pin-init instead
       - Replace Devres::new_foreign_owned() with devres::register()
       - Require T to be Send in Devres<T>
       - Initialize the data kept inside a Devres last
       - Provide an accessor for the Devres associated Device
   - Device ID:
       - Add support for ACPI device IDs and driver match tables
       - Split up generic device ID infrastructure
       - Use generic device ID infrastructure in net::phy
   - DMA:
       - Implement the dma::Device trait
       - Add DMA mask accessors to dma::Device
       - Implement dma::Device for PCI and platform devices
       - Use DMA masks from the DMA sample module
   - I/O:
       - Implement abstraction for resource regions (struct resource)
       - Implement resource-based ioremap() abstractions
       - Provide platform device accessors for I/O (remap) requests
   - Misc:
       - Support fallible PinInit types in Revocable
       - Implement Wrapper<T> for Opaque<T>
       - Merge pin-init blanket dependencies (for Devres)

  Misc:
   - Fix OF node leak in auxiliary_device_create()
   - Use util macros in device property iterators
   - Improve kobject sample code
   - Add device_link_test() for testing device link flags
   - Fix typo in Documentation/ABI/testing/sysfs-kernel-address_bits
   - Hint to prefer container_of_const() over container_of()"

* tag 'driver-core-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core: (84 commits)
  rust: io: fix broken intra-doc links to `platform::Device`
  rust: io: fix broken intra-doc link to missing `flags` module
  rust: io: mem: enable IoRequest doc-tests
  rust: platform: add resource accessors
  rust: io: mem: add a generic iomem abstraction
  rust: io: add resource abstraction
  rust: samples: dma: set DMA mask
  rust: platform: implement the `dma::Device` trait
  rust: pci: implement the `dma::Device` trait
  rust: dma: add DMA addressing capabilities
  rust: dma: implement `dma::Device` trait
  rust: net::phy Change module_phy_driver macro to use module_device_table macro
  rust: net::phy represent DeviceId as transparent wrapper over mdio_device_id
  rust: device_id: split out index support into a separate trait
  device: rust: rename Device::as_ref() to Device::from_raw()
  arm64: cacheinfo: Provide helper to compress MPIDR value into u32
  cacheinfo: Add arch hook to compress CPU h/w id into 32 bits for cache-id
  cacheinfo: Set cache 'id' based on DT data
  container_of: Document container_of() is not to be used in new code
  driver core: auxiliary bus: fix OF node leak
  ...
2025-07-29 12:15:39 -07:00
Yann Dirson 9f1f7cd467 drm/amdgpu: fix module parameter description
Fix dcdebugmask description.

Signed-off-by: Yann Dirson <ydirson@free.fr>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-28 16:40:32 -04:00
Amber Lin 216e4cff54 drm/amdgpu: Add chain runlists support to GC9.4.2
Starting from MEC v97, GC 9.4.2 supports chain runlists of XNACK+/XNACK-
processes.

Signed-off-by: Amber Lin <Amber.Lin@amd.com>
Reviewed-by: Philip Yang<Philip.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-28 16:40:32 -04:00
Lijo Lazar 21c6764ed4 drm/amdgpu: Update external revid for GC v9.5.0
Use different external revid for GC v9.5.0 SOCs.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-28 16:40:07 -04:00
YiPeng Chai 3fc96f60b6 drm/amdgpu: add critical address check for bad page retirement
Add critical address check for bad page retirement.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-28 16:40:07 -04:00
Sathishkumar S faab5ea083 drm/amdgpu: Check vcn sram load return value
Log an error when vcn sram load fails in indirect mode
and return the same error value.

Signed-off-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Reviewed-by: Leo Liu <leo.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-28 16:40:06 -04:00
Lijo Lazar 9d1ac25c7f drm/amdgpu: Update supported modes for GC v9.5.0
For GC v9.5.0 SOCs, both CPX and QPX compute modes are also supported in
NPS2 mode.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Acked-by: Mangesh Gadre <Mangesh.Gadre@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-28 16:40:06 -04:00
YiPeng Chai f348691897 drm/amdgpu: support ras critical address check
Support ras critical address check.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-28 16:40:06 -04:00
Tao Zhou d45c5e6845 drm/amdgpu: adjust the update of RAS bad page number
One eeprom record may not map to unit number of bad pages, the accurate
bad page number is gotten after bad page address check.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-28 16:40:06 -04:00
Tao Zhou 2b17c240e8 drm/amdgpu: add range check for RAS bad page address
Exclude invalid bad pages.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-28 16:40:06 -04:00
YiPeng Chai a813437c33 drm/amdgpu: add command to check address validity
Add command to check address validity and remove
unused command codes.

v2:
 The command interface adds new parameters to support
 multiple check address strategies.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-28 16:40:06 -04:00
YiPeng Chai 020ad3a4ed drm/amdgpu: query the allocated vram address block info
The bad pages that need to be retired are not all allocated
in the same poison consumption process, so an interface is
needed to query the processes that allocate the bad pages.
By killing all the processes that allocate the bad pages,
the bad pages can be reserved immediately.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-28 16:40:06 -04:00
Alex Deucher a0b34e4c86 drm/amdgpu: update mmhub 4.1.0 client id mappings
Update the client id mapping so the correct clients
get printed when there is a mmhub page fault.

Tested-by: David (Ming Qiang) Wu <David.Wu3@amd.com>
Reviewed-by: David (Ming Qiang) Wu <David.Wu3@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
2025-07-28 16:39:49 -04:00
Frank Min 0395cde08e drm/amdgpu: add kicker fws loading for gfx12/smu14/psp14
1. Add kicker firmwares loading for gfx12/smu14/psp14
2. Register additional MODULE_FIRMWARE entries for kicker fws
   - gc_12_0_1_rlc_kicker.bin
   - gc_12_0_1_imu_kicker.bin
   - psp_14_0_3_sos_kicker.bin
   - psp_14_0_3_ta_kicker.bin
   - smu_14_0_3_kicker.bin

Signed-off-by: Frank Min <Frank.Min@amd.com>
Reviewed-by: Gui Chengming <Jack.Gui@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
2025-07-28 16:27:46 -04:00
Yang Wang 8e0d1edb5c drm/amd/amdgpu: fix missing lock for cper.ring->rptr/wptr access
Add lock protection for 'ring->wptr'/'ring->rptr' to ensure the correct execution.

Fixes: 8652920d2c ("drm/amdgpu: add mutex lock for cper ring")
Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
2025-07-28 16:26:52 -04:00
Gang Ba 1f02f2044b drm/amdgpu: Avoid extra evict-restore process.
If vm belongs to another process, this is fclose after fork,
wait may enable signaling KFD eviction fence and cause parent process queue evicted.

[677852.634569]  amdkfd_fence_enable_signaling+0x56/0x70 [amdgpu]
[677852.634814]  __dma_fence_enable_signaling+0x3e/0xe0
[677852.634820]  dma_fence_wait_timeout+0x3a/0x140
[677852.634825]  amddma_resv_wait_timeout+0x7f/0xf0 [amdkcl]
[677852.634831]  amdgpu_vm_wait_idle+0x2d/0x60 [amdgpu]
[677852.635026]  amdgpu_flush+0x34/0x50 [amdgpu]
[677852.635208]  filp_flush+0x38/0x90
[677852.635213]  filp_close+0x14/0x30
[677852.635216]  do_close_on_exec+0xdd/0x130
[677852.635221]  begin_new_exec+0x1da/0x490
[677852.635225]  load_elf_binary+0x307/0xea0
[677852.635231]  ? srso_alias_return_thunk+0x5/0xfbef5
[677852.635235]  ? ima_bprm_check+0xa2/0xd0
[677852.635240]  search_binary_handler+0xda/0x260
[677852.635245]  exec_binprm+0x58/0x1a0
[677852.635249]  bprm_execve.part.0+0x16f/0x210
[677852.635254]  bprm_execve+0x45/0x80
[677852.635257]  do_execveat_common.isra.0+0x190/0x200

Suggested-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Gang Ba <Gang.Ba@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
2025-07-28 16:25:19 -04:00
Alex Deucher 284d4dfe85 drm/amdgpu: track whether a queue is a kernel queue in amdgpu_mqd_prop
Used to to set the MQD appropriately for each queue type.
Kernel queues have additional privileges.

Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org # 6.16.x
2025-07-28 16:25:04 -04:00
Nathan Chancellor c90f2e1172 drm/amdgpu: Initialize data to NULL in imu_v12_0_program_rlc_ram()
After a recent change in clang to expose uninitialized warnings from
const variables and pointers [1], there is a warning in
imu_v12_0_program_rlc_ram() because data is passed uninitialized to
program_imu_rlc_ram():

  drivers/gpu/drm/amd/amdgpu/imu_v12_0.c:374:30: error: variable 'data' is uninitialized when used here [-Werror,-Wuninitialized]
    374 |                         program_imu_rlc_ram(adev, data, (const u32)size);
        |                                                   ^~~~

As this warning happens early in clang's frontend, it does not realize
that due to the assignment of r to -EINVAL, program_imu_rlc_ram() is
never actually called, and even if it were, data would not be
dereferenced because size is 0.

Just initialize data to NULL to silence the warning, as the commit that
added program_imu_rlc_ram() mentioned it would eventually be used over
the old method, at which point data can be properly initialized and
used.

Cc: stable@vger.kernel.org
Closes: https://github.com/ClangBuiltLinux/linux/issues/2107
Fixes: 56159fffaa ("drm/amdgpu: use new method to program rlc ram")
Link: 2464313eef [1]
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-28 16:24:22 -04:00
Dave Airlie 337666c522 Merge tag 'drm-misc-fixes-2025-07-23' of https://gitlab.freedesktop.org/drm/misc/kernel into drm-fixes
drm-misc-fixes for v6.16-rc8/final?:
- Revert all uses of drm_gem_object->dmabuf to
  drm_gem_object->import_attach->dmabuf.
- Fix amdgpu returning BIOS cluttered VRAM after resume.
- Scheduler hang fix.
- Revert nouveau ioctl fix as it caused regressions.
- Fix null pointer deref in nouveau.
- Fix unnecessary semicolon in ti_sn_bridge_probe.

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Link: https://lore.kernel.org/r/72235afd-c849-49fe-9cc1-2b1781abdf08@linux.intel.com
2025-07-24 06:49:38 +10:00
Dave Airlie acab5fbd77 amd-drm-next-6.17-2025-07-17:
amdgpu:
 - Partition fixes
 - Reset fixes
 - RAS fixes
 - i2c fix
 - MPC updates
 - DSC cleanup
 - EDID fixes
 - Display idle D3 update
 - IPS updates
 - DMUB updates
 - Retimer fix
 - Replay fixes
 - Fix DC memory leak
 - Initial support for smartmux
 - DCN 4.0.1 degamma LUT fix
 - Per queue reset cleanups
 - Track ring state associated with a fence
 - SR-IOV fixes
 - SMU fixes
 - Per queue reset improvements for GC 9+ compute
 - Per queue reset improvements for GC 10+ gfx
 - Per queue reset improvements for SDMA 5+
 - Per queue reset improvements for JPEG 2+
 - Per queue reset improvements for VCN 2+
 - GC 8 fix
 - ISP updates
 
 amdkfd:
 - Enable KFD on LoongArch
 
 radeon:
 - Drop console lock during suspend/resume
 
 UAPI:
 - Add userq slot info to INFO IOCTL
   Used for IGT userq validation tests (https://lists.freedesktop.org/archives/igt-dev/2025-July/093228.html)
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQgO5Idg2tXNTSZAr293/aFa7yZ2AUCaHlrvgAKCRC93/aFa7yZ
 2LuDAP9c5phb6TBIFEmLlQEcZ+Ngx8LVkEl1JIfAhwuAwN2V0wD9H1pVVNDXmFlh
 jfB8hLYikqnxYznXuImvutGEBHM8BgQ=
 =W2D8
 -----END PGP SIGNATURE-----

Merge tag 'amd-drm-next-6.17-2025-07-17' of https://gitlab.freedesktop.org/agd5f/linux into drm-next

amd-drm-next-6.17-2025-07-17:

amdgpu:
- Partition fixes
- Reset fixes
- RAS fixes
- i2c fix
- MPC updates
- DSC cleanup
- EDID fixes
- Display idle D3 update
- IPS updates
- DMUB updates
- Retimer fix
- Replay fixes
- Fix DC memory leak
- Initial support for smartmux
- DCN 4.0.1 degamma LUT fix
- Per queue reset cleanups
- Track ring state associated with a fence
- SR-IOV fixes
- SMU fixes
- Per queue reset improvements for GC 9+ compute
- Per queue reset improvements for GC 10+ gfx
- Per queue reset improvements for SDMA 5+
- Per queue reset improvements for JPEG 2+
- Per queue reset improvements for VCN 2+
- GC 8 fix
- ISP updates

amdkfd:
- Enable KFD on LoongArch

radeon:
- Drop console lock during suspend/resume

UAPI:
- Add userq slot info to INFO IOCTL
  Used for IGT userq validation tests (https://lists.freedesktop.org/archives/igt-dev/2025-July/093228.html)

From: Alex Deucher <alexander.deucher@amd.com>
Link: https://lore.kernel.org/r/20250717213827.2061581-1-alexander.deucher@amd.com
Signed-off-by: Dave Airlie <airlied@redhat.com>
2025-07-21 11:57:43 +10:00
Dave Airlie be3cd668ff drm-misc-next for 6.17:
UAPI Changes:
 
 Cross-subsystem Changes:
 
 Core Changes:
 
 - mode_config: Change fb_create prototype to pass the drm_format_info
   and avoid redundant lookups in drivers
 - sched: kunit improvements, memory leak fixes, reset handling
   improvements
 - tests: kunit EDID update
 
 Driver Changes:
 
 - amdgpu: Hibernation fixes, structure lifetime fixes
 - nouveau: sched improvements
 - sitronix: Add Sitronix ST7567 Support
 
 - bridge:
   - Make connector available to bridge detect hook
 
 - panel:
   - More refcounting changes
   - New panels: BOE NE14QDM
 -----BEGIN PGP SIGNATURE-----
 
 iJUEABMJAB0WIQTkHFbLp4ejekA/qfgnX84Zoj2+dgUCaHisvgAKCRAnX84Zoj2+
 dihcAX49H551lQ42amN11jN4pR2tpKLjRMCSsNnXzkokJYx8adEaGTcZq+0Oi9rZ
 muGQKJABf2SSb/bem7qEu0JVL3/Pz8DOLbKIE4ltPMlHfYF5NzJhy3AKIMIjMLH7
 XqSDXOMgjA==
 =CfQG
 -----END PGP SIGNATURE-----

Merge tag 'drm-misc-next-2025-07-17' of https://gitlab.freedesktop.org/drm/misc/kernel into drm-next

drm-misc-next for 6.17:

UAPI Changes:

Cross-subsystem Changes:

Core Changes:

- mode_config: Change fb_create prototype to pass the drm_format_info
  and avoid redundant lookups in drivers
- sched: kunit improvements, memory leak fixes, reset handling
  improvements
- tests: kunit EDID update

Driver Changes:

- amdgpu: Hibernation fixes, structure lifetime fixes
- nouveau: sched improvements
- sitronix: Add Sitronix ST7567 Support

- bridge:
  - Make connector available to bridge detect hook

- panel:
  - More refcounting changes
  - New panels: BOE NE14QDM

Signed-off-by: Dave Airlie <airlied@redhat.com>

From: Maxime Ripard <mripard@redhat.com>
Link: https://lore.kernel.org/r/20250717-efficient-kudu-of-fantasy-ff95e0@houat
2025-07-21 09:16:51 +10:00
Alex Deucher 6ac55eab4f drm/amdgpu: move reset support type checks into the caller
Rather than checking in the callbacks, check if the reset
type is supported in the caller.

Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-17 12:36:56 -04:00
Alex Deucher ea2791d05a drm/amdgpu/sdma7: re-emit unprocessed state on ring reset
Re-emit the unprocessed state after resetting the queue.

Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-17 12:36:54 -04:00
Alex Deucher 9753078f54 drm/amdgpu/sdma6: re-emit unprocessed state on ring reset
Re-emit the unprocessed state after resetting the queue.

Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-17 12:36:48 -04:00
Alex Deucher 1b49bddc58 drm/amdgpu/sdma5.2: re-emit unprocessed state on ring reset
Re-emit the unprocessed state after resetting the queue.

Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-17 12:36:46 -04:00
Alex Deucher 4b1df3bad2 drm/amdgpu/sdma5: re-emit unprocessed state on ring reset
Re-emit the unprocessed state after resetting the queue.

Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-17 12:36:43 -04:00
Alex Deucher 4da11b92d7 drm/amdgpu/gfx12: re-emit unprocessed state on ring reset
Re-emit the unprocessed state after resetting the queue.
Drop the soft_recovery callbacks as the queue reset replaces
it.

Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-17 12:36:39 -04:00
Alex Deucher fa3385ac15 drm/amdgpu/gfx11: re-emit unprocessed state on ring reset
Re-emit the unprocessed state after resetting the queue.
Drop the soft_recovery callbacks as the queue reset replaces
it.

Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-17 12:36:32 -04:00
Alex Deucher f410731d5c drm/amdgpu/gfx10: re-emit unprocessed state on ring reset
Re-emit the unprocessed state after resetting the queue.
Drop the soft_recovery callbacks as the queue reset replaces
it.

Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-17 12:36:28 -04:00
Alex Deucher e22631b53a drm/amdgpu/gfx9.4.3: re-emit unprocessed state on kcq reset
Re-emit the unprocessed state after resetting the queue.

Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-17 12:36:25 -04:00
Alex Deucher ee60209b6f drm/amdgpu/gfx9: re-emit unprocessed state on kcq reset
Re-emit the unprocessed state after resetting the queue.

Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-17 12:36:19 -04:00
Arunpravin Paneer Selvam 81df6bfad6 drm/amdgpu: Add WARN_ON to the resource clear function
Set the dirty bit when the memory resource is not cleared
during BO release.

v2(Christian):
  - Drop the cleared flag set to false.
  - Improve the amdgpu_vram_mgr_set_clear_state() function.

v3:
  - Add back the resource clear flag set function call after
    being cleared during eviction (Christian).
  - Modified the patch subject name.

Signed-off-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Suggested-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-17 12:36:12 -04:00
Eeli Haapalainen 8326193401 drm/amdgpu/gfx8: reset compute ring wptr on the GPU on resume
Commit 42cdf6f687 ("drm/amdgpu/gfx8: always restore kcq MQDs") made the
ring pointer always to be reset on resume from suspend. This caused compute
rings to fail since the reset was done without also resetting it for the
firmware. Reset wptr on the GPU to avoid a disconnect between the driver
and firmware wptr.

Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3911
Fixes: 42cdf6f687 ("drm/amdgpu/gfx8: always restore kcq MQDs")
Signed-off-by: Eeli Haapalainen <eeli.haapalainen@protonmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 2becafc319)
Cc: stable@vger.kernel.org
2025-07-16 16:50:45 -04:00
Lijo Lazar 86790e300d drm/amdgpu: Increase reset counter only on success
Increment the reset counter only if soft recovery succeeded. This is
consistent with a ring hard reset behaviour where counter gets
incremented only if hard reset succeeded.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 25c314aa3e)
Cc: stable@vger.kernel.org
2025-07-16 16:49:04 -04:00
Jesse Zhang 9ffab039bc drm/amdgpu: Replace HQD terminology with slots naming
The term "HQD" is CP-specific and doesn't
accurately describe the queue resources for other IP blocks like SDMA,
VCN, or VPE. This change:

1. Renames `num_hqds` to `num_slots` in amdgpu_kms.c to better reflect
   the generic nature of the resource counting
2. Updates the UAPI struct member from `userq_num_hqds` to `userq_num_slots`
3. Maintains the same functionality while using more appropriate terminology

Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:17:36 -04:00
Jesse Zhang 78d0a27ae0 drm/amdgpu: Add user queue instance count in HW IP info
This change exposes the number of available user queue instances
for each hardware IP type (GFX, COMPUTE, SDMA) through the
drm_amdgpu_info_hw_ip interface.

Key changes:
1. Added userq_num_instance field to drm_amdgpu_info_hw_ip structure
2. Implemented counting of available HQD slots using:
   - mes.gfx_hqd_mask for GFX queues
   - mes.compute_hqd_mask for COMPUTE queues
   - mes.sdma_hqd_mask for SDMA queues
3. Only counts available instances when user queues are enabled
   (!disable_uq)

v2: using the adev->mes.gfx_hqd_mask[]/compute_hqd_mask[]/sdma_hqd_mask[] masks
  to determine the number of queue slots available for each engine type (Alex)
v3: rename userq_num_instance to userq_num_hqds (Alex)

Suggested-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:17:35 -04:00
Pratap Nirujogi 55d42f6169 drm/amd/amdgpu: Add helper functions for isp buffers
Accessing amdgpu internal data structures "struct amdgpu_device"
and "struct amdgpu_bo" in ISP V4L2 driver to alloc/free GART
buffers is not recommended.

Add new amdgpu_isp helper functions that takes opaque params
from ISP V4L2 driver and calls the amdgpu internal functions
amdgpu_bo_create_isp_user() and amdgpu_bo_create_kernel() to
alloc/free GART buffers.

Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Pratap Nirujogi <pratap.nirujogi@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:17:35 -04:00
Pratap Nirujogi e36519f5c8 drm/amd/amdgpu: Initialize swnode for ISP MFD device
Create amd_isp_capture MFD device with swnode initialized to
isp specific software_node part of fwnode graph in amd_isp4
x86/platform driver. The isp driver use this swnode handle
to retrieve the critical properties (data-lanes, mipi phyid,
link-frequencies etc.) required for camera to work on AMD ISP4
based targets.

Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Pratap Nirujogi <pratap.nirujogi@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:17:35 -04:00
Eeli Haapalainen 2becafc319 drm/amdgpu/gfx8: reset compute ring wptr on the GPU on resume
Commit 42cdf6f687 ("drm/amdgpu/gfx8: always restore kcq MQDs") made the
ring pointer always to be reset on resume from suspend. This caused compute
rings to fail since the reset was done without also resetting it for the
firmware. Reset wptr on the GPU to avoid a disconnect between the driver
and firmware wptr.

Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3911
Fixes: 42cdf6f687 ("drm/amdgpu/gfx8: always restore kcq MQDs")
Signed-off-by: Eeli Haapalainen <eeli.haapalainen@protonmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:17:28 -04:00
Alex Deucher 82a7c94fce drm/amdgpu/jpeg: clean up reset type handling
Make the handling consistent with other IPs and across
JPEG versions.

Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:17:16 -04:00
Christian König 084300fef5 drm/amdgpu: rework gmc_v9_0_get_coherence_flags v2
Avoid using the mapping here.

v2: use amdgpu_xgmi_same_hive() as suggested by Felix

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:17:10 -04:00
Alex Deucher d7767a1fd4 drm/amdgpu/vcn3: implement ring reset
Use the new helpers to handle engine resets for VCN.

Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Tested-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:17:07 -04:00
Alex Deucher 63b8c9fdfb drm/amdgpu/vcn2.5: implement ring reset
Use the new helpers to handle engine resets for VCN.

Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Tested-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:17:04 -04:00
Alex Deucher 64ac009747 drm/amdgpu/vcn2: implement ring reset
Use the new helpers to handle engine resets for VCN.

Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Tested-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:17:01 -04:00
Alex Deucher 7b6cde7f4e drm/amdgpu/vcn: add a helper framework for engine resets
With engine resets we reset all queues on the engine rather
than just a single queue.  Add a framework to handle this
similar to SDMA.

Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Tested-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:16:58 -04:00
Alex Deucher 3871149081 drm/amdgpu/vcn5: re-emit unprocessed state on ring reset
Re-emit the unprocessed state after resetting the queue.

Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Tested-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:16:55 -04:00
Alex Deucher 6166e37afd drm/amdgpu/vcn4.0.5: re-emit unprocessed state on ring reset
Re-emit the unprocessed state after resetting the queue.

Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Tested-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:16:47 -04:00
Alex Deucher 64c54f0aa2 drm/amdgpu/vcn4.0.3: re-emit unprocessed state on ring reset
Re-emit the unprocessed state after resetting the queue.

Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:16:44 -04:00
Alex Deucher d156ba3970 drm/amdgpu/vcn4: re-emit unprocessed state on ring reset
Re-emit the unprocessed state after resetting the queue.

Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Tested-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:16:41 -04:00
Alex Deucher 8bea669e67 drm/amdgpu/jpeg5.0.1: re-emit unprocessed state on ring reset
Re-emit the unprocessed state after resetting the queue.

Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:16:36 -04:00
Alex Deucher e708f2cb56 drm/amdgpu/jpeg5: add queue reset
Add queue reset support for jpeg 5.0.0.
Use the new helpers to re-emit the unprocessed state
after resetting the queue.

Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Tested-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:16:33 -04:00
Alex Deucher cf07ece3a8 drm/amdgpu/jpeg4.0.5: add queue reset
Add queue reset support for jpeg 4.0.5.
Use the new helpers to re-emit the unprocessed state
after resetting the queue.

Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Tested-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:16:25 -04:00
Alex Deucher 98f16636a2 drm/amdgpu/jpeg4.0.3: re-emit unprocessed state on ring reset
Re-emit the unprocessed state after resetting the queue.

Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:16:08 -04:00
Alex Deucher 429ccbf6f4 drm/amdgpu/jpeg4: re-emit unprocessed state on ring reset
Re-emit the unprocessed state after resetting the queue.

Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Tested-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:16:03 -04:00
Alex Deucher b81891589b drm/amdgpu/jpeg3: re-emit unprocessed state on ring reset
Re-emit the unprocessed state after resetting the queue.

Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Tested-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:15:22 -04:00
Alex Deucher bb7928f9fc drm/amdgpu/jpeg2.5: re-emit unprocessed state on ring reset
Re-emit the unprocessed state after resetting the queue.

Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Tested-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:15:19 -04:00
Alex Deucher 3c9e205f32 drm/amdgpu/jpeg2: re-emit unprocessed state on ring reset
Re-emit the unprocessed state after resetting the queue.

Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Tested-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:15:12 -04:00
Lijo Lazar 25c314aa3e drm/amdgpu: Increase reset counter only on success
Increment the reset counter only if soft recovery succeeded. This is
consistent with a ring hard reset behaviour where counter gets
incremented only if hard reset succeeded.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:14:44 -04:00
Alex Deucher ec8fbb44b5 drm/amdgpu: make compute timeouts consistent
For kernel compute queues, align the timeout with
other kernel queues (10 sec).  This had previously
been set higher for OpenCL when it used kernel
queues, but now OpenCL uses KFD user queues which
don't have a timeout limitation. This also aligns
with SR-IOV which already used a shorter timeout.

Additionally the longer timeout negatively impacts
the user experience with kernel queues for interactive
applications.

Reviewed-by: Kent Russell <kent.russell@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:14:29 -04:00
Tony Yi 991f2e0c63 drm/amdgpu: Check SQ_CONFIG register support on SRIOV
On SRIOV environments, check if RLCG supports
SQ_CONFIG register programming.

Signed-off-by: Tony Yi <Tony.Yi@amd.com>
Reviewed-by: Zhigang Luo <zhigang.luo@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:14:21 -04:00
Alex Deucher 77cc0da39c drm/amdgpu: track ring state associated with a fence
We need to know the wptr and sequence number associated
with a fence so that we can re-emit the unprocessed state
after a ring reset.  Pre-allocate storage space for
the ring buffer contents and add helpers to save off
and re-emit the unprocessed state so that it can be
re-emitted after the queue is reset.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:14:11 -04:00
Alex Deucher bc29c03b28 drm/amdgpu: clean up GC reset functions
Make them consistent and use the reset flags.

Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:10:10 -04:00
Alex Deucher e3f15cfd8b drm/amdgpu: clean up jpeg reset functions
Make them consistent and use the reset flags.

Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:10:01 -04:00
Alex Deucher 290ccae52d drm/amdgpu/vcn: don't enable per queue resets on SR-IOV
Power control is only available in bare metal.  SR-IOV
will need a different method.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:09:48 -04:00
Alex Deucher 94ee19ea14 drm/amdgpu/jpeg4: add additional ring reset error checking
Start and stop can fail, so add checks.

Fixes: 74894ffc7d ("drm/amdgpu: Add ring reset callback for JPEG4_0_0")
Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: Sathishkumar S <sathishkumar.sundararaju@amd.com>
2025-07-16 16:09:25 -04:00
Alex Deucher 2918487455 drm/amdgpu/jpeg3: add additional ring reset error checking
Start and stop can fail, so add checks.

Fixes: 03399d0bff ("drm/amdgpu: Add ring reset callback for JPEG3_0_0")
Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: Sathishkumar S <sathishkumar.sundararaju@amd.com>
2025-07-16 16:08:59 -04:00
Alex Deucher c9bfafc1a6 drm/amdgpu/jpeg2: add additional ring reset error checking
Start and stop can fail, so add checks.

Fixes: 500c04d2a7 ("drm/amdgpu: Add ring reset callback for JPEG2_0_0")
Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: Sathishkumar S <sathishkumar.sundararaju@amd.com>
2025-07-16 16:08:17 -04:00
Alex Deucher d18e1faef6 drm/amdgpu: clean up sdma reset functions
Make them consistent and drop unneeded extra variables.

Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 16:08:05 -04:00
Christophe JAILLET 28c5c48638 drm/amdgpu: Fix missing unlocking in an error path in amdgpu_userq_create()
If kasprintf() fails, some mutex still need to be released to avoid locking
issue, as already done in all other error handling path.

Fixes: c03ea34cbf ("drm/amdgpu: add support of debugfs for mqd information")
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Link: https://lore.kernel.org/all/366557fa7ca8173fd78c58336986ca56953369b9.1752087753.git.christophe.jaillet@wanadoo.fr/
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16 15:46:04 -04:00
Ville Syrjälä b4d360701b drm/amdgpu: Pass along the format info from .fb_create() to drm_helper_mode_fill_fb_struct()
Plumb the format info from .fb_create() all the way to
drm_helper_mode_fill_fb_struct() to avoid the redundant
lookup.

Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: amd-gfx@lists.freedesktop.org
Reviewed-by: Thomas Zimmermann <tzimmermann@suse.de>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20250701090722.13645-10-ville.syrjala@linux.intel.com
2025-07-16 20:07:03 +03:00
Ville Syrjälä a34cc7bf10 drm: Allow the caller to pass in the format info to drm_helper_mode_fill_fb_struct()
Soon all drivers should have the format info already available in the
places where they call drm_helper_mode_fill_fb_struct(). Allow it to
be passed along into drm_helper_mode_fill_fb_struct() instead of doing
yet another redundant lookup.

Start by always passing in NULL and still doing the extra lookup.
The actual changes to avoid the lookup will follow.

Done with cocci (with some manual fixups):
@@
identifier dev, fb, mode_cmd;
expression get_format_info;
@@
void drm_helper_mode_fill_fb_struct(struct drm_device *dev,
                                    struct drm_framebuffer *fb,
+                                    const struct drm_format_info *info,
                                    const struct drm_mode_fb_cmd2 *mode_cmd)
{
...
- fb->format = get_format_info;
+ fb->format = info ?: get_format_info;
...
}

@@
identifier dev, fb, mode_cmd;
@@
void drm_helper_mode_fill_fb_struct(struct drm_device *dev,
                                    struct drm_framebuffer *fb,
+                                    const struct drm_format_info *info,
                                    const struct drm_mode_fb_cmd2 *mode_cmd);

@@
expression dev, fb, mode_cmd;
@@
drm_helper_mode_fill_fb_struct(dev, fb
+	       ,NULL
	       ,mode_cmd);

Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Liviu Dudau <liviu.dudau@arm.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Inki Dae <inki.dae@samsung.com>
Cc: Seung-Woo Kim <sw0312.kim@samsung.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Patrik Jakobsson <patrik.r.jakobsson@gmail.com>
Cc: Rob Clark <robdclark@gmail.com>
Cc: Abhinav Kumar <quic_abhinavk@quicinc.com>
Cc: Dmitry Baryshkov <lumag@kernel.org>
Cc: Sean Paul <sean@poorly.run>
Cc: Marijn Suijten <marijn.suijten@somainline.org>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: Tomi Valkeinen <tomi.valkeinen@ideasonboard.com>
Cc: Thierry Reding <thierry.reding@gmail.com>
Cc: Mikko Perttunen <mperttunen@nvidia.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Cc: Gurchetan Singh <gurchetansingh@chromium.org>
Cc: Chia-I Wu <olvaffe@gmail.com>
Cc: Zack Rusin <zack.rusin@broadcom.com>
Cc: Broadcom internal kernel review list <bcm-kernel-feedback-list@broadcom.com>
Cc: amd-gfx@lists.freedesktop.org
Cc: linux-arm-msm@vger.kernel.org
Cc: freedreno@lists.freedesktop.org
Cc: nouveau@lists.freedesktop.org
Cc: linux-tegra@vger.kernel.org
Cc: virtualization@lists.linux.dev
Reviewed-by: Thomas Zimmermann <tzimmermann@suse.de>
Reviewed-by: Laurent Pinchart <laurent.pinchart+renesas@ideasonboard.com>
Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@oss.qualcomm.com>
Reviewed-by: Liviu Dudau <liviu.dudau@arm.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Acked-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20250701090722.13645-6-ville.syrjala@linux.intel.com
2025-07-16 20:04:45 +03:00
Ville Syrjälä 81112eaac5 drm: Pass the format info to .fb_create()
Pass along the format information from the top to .fb_create()
so that we can avoid redundant (and somewhat expensive) lookups
in the drivers.

Done with cocci (with some manual fixups):
@@
identifier func =~ ".*create.*";
identifier dev, file, mode_cmd;
@@
struct drm_framebuffer *func(
       struct drm_device *dev,
       struct drm_file *file,
+      const struct drm_format_info *info,
       const struct drm_mode_fb_cmd2 *mode_cmd)
{
...
(
- const struct drm_format_info *info = drm_get_format_info(...);
|
- const struct drm_format_info *info;
...
- info = drm_get_format_info(...);
)
<...
- if (!info)
-    return ...;
...>
}

@@
identifier func =~ ".*create.*";
identifier dev, file, mode_cmd;
@@
struct drm_framebuffer *func(
       struct drm_device *dev,
       struct drm_file *file,
+      const struct drm_format_info *info,
       const struct drm_mode_fb_cmd2 *mode_cmd)
{
...
}

@find@
identifier fb_create_func =~ ".*create.*";
identifier dev, file, mode_cmd;
@@
struct drm_framebuffer *fb_create_func(
       struct drm_device *dev,
       struct drm_file *file,
+      const struct drm_format_info *info,
       const struct drm_mode_fb_cmd2 *mode_cmd);

@@
identifier find.fb_create_func;
expression dev, file, mode_cmd;
@@
fb_create_func(dev, file
+	       ,info
	       ,mode_cmd)

@@
expression dev, file, mode_cmd;
@@
drm_gem_fb_create(dev, file
+	       ,info
	       ,mode_cmd)

@@
expression dev, file, mode_cmd;
@@
drm_gem_fb_create_with_dirty(dev, file
+	       ,info
	       ,mode_cmd)

@@
expression dev, file_priv, mode_cmd;
identifier info, fb;
@@
info = drm_get_format_info(...);
...
fb = dev->mode_config.funcs->fb_create(dev, file_priv
+                                      ,info
                                       ,mode_cmd);

@@
identifier dev, file_priv, mode_cmd;
@@
struct drm_mode_config_funcs {
...
struct drm_framebuffer *(*fb_create)(struct drm_device *dev,
                                     struct drm_file *file_priv,
+                                     const struct drm_format_info *info,
                                     const struct drm_mode_fb_cmd2 *mode_cmd);
...
};

v2: Fix kernel docs (Laurent)
    Fix commit msg (Geert)

Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Liviu Dudau <liviu.dudau@arm.com>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Inki Dae <inki.dae@samsung.com>
Cc: Seung-Woo Kim <sw0312.kim@samsung.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Patrik Jakobsson <patrik.r.jakobsson@gmail.com>
Cc: Chun-Kuang Hu <chunkuang.hu@kernel.org>
Cc: Philipp Zabel <p.zabel@pengutronix.de>
Cc: Rob Clark <robdclark@gmail.com>
Cc: Abhinav Kumar <quic_abhinavk@quicinc.com>
Cc: Dmitry Baryshkov <lumag@kernel.org>
Cc: Sean Paul <sean@poorly.run>
Cc: Marijn Suijten <marijn.suijten@somainline.org>
Cc: Marek Vasut <marex@denx.de>
Cc: Stefan Agner <stefan@agner.ch>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: Tomi Valkeinen <tomi.valkeinen@ideasonboard.com>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Kieran Bingham <kieran.bingham+renesas@ideasonboard.com>
Cc: Biju Das <biju.das.jz@bp.renesas.com>
Cc: Sandy Huang <hjc@rock-chips.com>
Cc: "Heiko Stübner" <heiko@sntech.de>
Cc: Andy Yan <andy.yan@rock-chips.com>
Cc: Thierry Reding <thierry.reding@gmail.com>
Cc: Mikko Perttunen <mperttunen@nvidia.com>
Cc: Dave Stevenson <dave.stevenson@raspberrypi.com>
Cc: "Maíra Canal" <mcanal@igalia.com>
Cc: Raspberry Pi Kernel Maintenance <kernel-list@raspberrypi.com>
Cc: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Cc: Gurchetan Singh <gurchetansingh@chromium.org>
Cc: Chia-I Wu <olvaffe@gmail.com>
Cc: Zack Rusin <zack.rusin@broadcom.com>
Cc: Broadcom internal kernel review list <bcm-kernel-feedback-list@broadcom.com>
Cc: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Cc: amd-gfx@lists.freedesktop.org
Cc: linux-arm-msm@vger.kernel.org
Cc: freedreno@lists.freedesktop.org
Cc: nouveau@lists.freedesktop.org
Cc: virtualization@lists.linux.dev
Cc: spice-devel@lists.freedesktop.org
Cc: linux-renesas-soc@vger.kernel.org
Cc: linux-tegra@vger.kernel.org
Cc: Laurent Pinchart <laurent.pinchart+renesas@ideasonboard.com>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Thomas Zimmermann <tzimmermann@suse.de>
Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@oss.qualcomm.com>
Acked-by: Liviu Dudau <liviu.dudau@arm.com>
Reviewed-by: Laurent Pinchart <laurent.pinchart+renesas@ideasonboard.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Acked-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20250701090722.13645-5-ville.syrjala@linux.intel.com
2025-07-16 20:03:14 +03:00
Arunpravin Paneer Selvam 95a16160ca drm/amdgpu: Reset the clear flag in buddy during resume
- Added a handler in DRM buddy manager to reset the cleared
  flag for the blocks in the freelist.

- This is necessary because, upon resuming, the VRAM becomes
  cluttered with BIOS data, yet the VRAM backend manager
  believes that everything has been cleared.

v2:
  - Add lock before accessing drm_buddy_clear_reset_blocks()(Matthew Auld)
  - Force merge the two dirty blocks.(Matthew Auld)
  - Add a new unit test case for this issue.(Matthew Auld)
  - Having this function being able to flip the state either way would be
    good. (Matthew Brost)

v3(Matthew Auld):
  - Do merge step first to avoid the use of extra reset flag.

Signed-off-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Suggested-by: Christian König <christian.koenig@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Cc: stable@vger.kernel.org
Fixes: a68c7eaa7a ("drm/amdgpu: Enable clear page functionality")
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3812
Signed-off-by: Christian König <christian.koenig@amd.com>
Link: https://lore.kernel.org/r/20250716075125.240637-2-Arunpravin.PaneerSelvam@amd.com
2025-07-16 12:50:32 +02:00
ganglxie 48ee3d8e5e drm/amdgpu: refine bad page loading when in the same nps mode
when loading bad page in the same nps mode, need to set the other fields
fields in eeprom records manually besides retired_page

Signed-off-by: ganglxie <ganglxie@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-15 14:07:53 -04:00
ganglxie 660261df61 drm/amdgpu: refine eeprom data check
add eeprom data checksum check before driver unload. reset eeprom
and save correct data to eeprom when check failed

Signed-off-by: ganglxie <ganglxie@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-15 14:07:53 -04:00
Ce Sun 48cb9c3b21 drm/amdgpu: The interrupt source was not released
When the driver is unloaded, the interrupt source of
the rma device is not released, resulting in the failure
of hw_init when loading again using bad_page_threshold.

Signed-off-by: Ce Sun <cesun102@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-15 14:07:51 -04:00
Alex Deucher 7a5b69d60e drm/amdgpu/vcn5: add additional ring reset error checking
Start and stop can fail, so add checks.

Fixes: b54695dae9 ("drm/amd: Add per-ring reset for vcn v5.0.0 use")
Reviewed-by: Mario Limonciello <mari.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: Mario Limonciello <mario.limonciello@amd.com>
2025-07-15 14:07:43 -04:00
Alex Deucher 1b556bcc38 drm/amdgpu/vcn4.0.5: add additional ring reset error checking
Start and stop can fail, so add checks.

Fixes: d1a46cdd00 ("drm/amd: Add per-ring reset for vcn v4.0.5 use")
Reviewed-by: Mario Limonciello <mari.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: Mario Limonciello <mario.limonciello@amd.com>
2025-07-15 14:07:39 -04:00
Alex Deucher d115a63f81 drm/amdgpu/vcn4: add additional ring reset error checking
Start and stop can fail, so add checks.

Fixes: b8b6e6f165 ("drm/amd: Add per-ring reset for vcn v4.0.0 use")
Reviewed-by: Mario Limonciello <mari.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: Mario Limonciello <mario.limonciello@amd.com>
2025-07-15 14:07:34 -04:00
Alex Deucher a4b2ba8f63 drm/amdgpu/gfx10: fix kiq locking in KCQ reset
The ring test needs to be inside the lock.

Fixes: 097af47d3c ("drm/amdgpu/gfx10: wait for reset done before remap")
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: Jiadong Zhu <Jiadong.Zhu@amd.com>
2025-07-15 14:07:28 -04:00
Alex Deucher 08f116c593 drm/amdgpu/gfx9.4.3: fix kiq locking in KCQ reset
The ring test needs to be inside the lock.

Fixes: 4c953e53cc ("drm/amdgpu/gfx_9.4.3: wait for reset done before remap")
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: Jiadong Zhu <Jiadong.Zhu@amd.com>
2025-07-15 14:07:23 -04:00
Alex Deucher 730ea5074d drm/amdgpu/gfx9: fix kiq locking in KCQ reset
The ring test needs to be inside the lock.

Fixes: fdbd69486b ("drm/amdgpu/gfx9: wait for reset done before remap")
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: Jiadong Zhu <Jiadong.Zhu@amd.com>
2025-07-15 14:07:17 -04:00
Lijo Lazar 8ff4a4b98d drm/amdgpu: Use cached partition mode, if valid
For current partition mode queries, return the mode cached in partition
manager whenever it's valid.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-15 14:07:17 -04:00
Maíra Canal 0a5dc1b67e
drm/sched: Rename DRM_GPU_SCHED_STAT_NOMINAL to DRM_GPU_SCHED_STAT_RESET
Among the scheduler's statuses, the only one that indicates an error is
DRM_GPU_SCHED_STAT_ENODEV. Any status other than DRM_GPU_SCHED_STAT_ENODEV
signifies that the operation succeeded and the GPU is in a nominal state.

However, to provide more information about the GPU's status, it is needed
to convey more information than just "OK".

Therefore, rename DRM_GPU_SCHED_STAT_NOMINAL to
DRM_GPU_SCHED_STAT_RESET, which better communicates the meaning of this
status. The status DRM_GPU_SCHED_STAT_RESET indicates that the GPU has
hung, but it has been successfully reset and is now in a nominal state
again.

Reviewed-by: Philipp Stanner <phasta@kernel.org>
Link: https://lore.kernel.org/r/20250714-sched-skip-reset-v6-1-5c5ba4f55039@igalia.com
Signed-off-by: Maíra Canal <mcanal@igalia.com>
2025-07-15 08:27:00 -03:00
Simona Vetter 7e11e01d1f amd-drm-next-6.17-2025-07-11:
amdgpu:
 - Clean up function signatures
 - GC 10 KGQ reset fix
 - SDMA reset cleanups
 - Misc fixes
 - LVDS fixes
 - UserQ fix
 
 amdkfd:
 - Reset fix
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQgO5Idg2tXNTSZAr293/aFa7yZ2AUCaHF5fgAKCRC93/aFa7yZ
 2FbQAPsH59r6i1K6TfyLYuCIOwFXyqx/AEy2/HmlBx3r2ElC2AEA8R0rrKTVpzlP
 ZN6WaCDLEBOlk21S3U3QCHuSlf/G1Ac=
 =H/st
 -----END PGP SIGNATURE-----

Merge tag 'amd-drm-next-6.17-2025-07-11' of https://gitlab.freedesktop.org/agd5f/linux into drm-next

amd-drm-next-6.17-2025-07-11:

amdgpu:
- Clean up function signatures
- GC 10 KGQ reset fix
- SDMA reset cleanups
- Misc fixes
- LVDS fixes
- UserQ fix

amdkfd:
- Reset fix

Signed-off-by: Simona Vetter <simona.vetter@ffwll.ch>
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20250711205548.21052-1-alexander.deucher@amd.com
2025-07-11 23:55:40 +02:00
André Almeida 667efb3419 drm/amdgpu: Fix lifetime of struct amdgpu_task_info after ring reset
When a ring reset happens, amdgpu calls drm_dev_wedged_event() using
struct amdgpu_task_info *ti as one of the arguments. After using *ti, a
call to amdgpu_vm_put_task_info(ti) is required to correctly track its
lifetime.

However, it's called from a place that the ring reset path never reaches
due to a goto after drm_dev_wedged_event() is called. Move
amdgpu_vm_put_task_info() bellow the exit label to make sure that it's
called regardless of the code path.

amdgpu_vm_put_task_info() can only accept a valid address or NULL as
argument, so initialise *ti to make sure we can call this function if
*ti isn't used.

Fixes: a72002cb18 ("drm/amdgpu: Make use of drm_wedge_task_info")
Reported-by: Dave Airlie <airlied@gmail.com>
Closes: https://lore.kernel.org/dri-devel/CAPM=9tz0rQP8VZWKWyuF8kUMqRScxqoa6aVdwWw9=5yYxyYQ2Q@mail.gmail.com/
Reviewed-by: Christian König <christian.koenig@amd.com>
Link: https://lore.kernel.org/r/20250704030629.1064397-1-andrealmeid@igalia.com
Signed-off-by: André Almeida <andrealmeid@igalia.com>
2025-07-10 18:41:21 -03:00
Samuel Zhang 530694f54d drm/amdgpu: do not resume device in thaw for normal hibernation
For normal hibernation, GPU do not need to be resumed in thaw since it is
not involved in writing the hibernation image. Skip resume in this case
can reduce the hibernation time.

On VM with 8 * 192GB VRAM dGPUs, 98% VRAM usage and 1.7TB system memory,
this can save 50 minutes.

Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
Tested-by: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Link: https://lore.kernel.org/r/20250710062313.3226149-6-guoqing.zhang@amd.com
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
2025-07-10 10:49:45 -05:00
Samuel Zhang 924dda024f drm/amdgpu: move GTT to shmem after eviction for hibernation
When hibernate with data center dGPUs, huge number of VRAM BOs evicted
to GTT and takes too much system memory. This will cause hibernation
fail due to insufficient memory for creating the hibernation image.

Move GTT BOs to shmem in KMD, then shmem to swap disk in kernel
hibernation code to make room for hibernation image.

Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Link: https://lore.kernel.org/r/20250710062313.3226149-3-guoqing.zhang@amd.com
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
2025-07-10 10:49:45 -05:00
Sunil Khatri 03d5236014 drm/amdgpu: fix the logic to validate fpriv and root bo
Fix the smatch warning,
smatch warnings:
drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c:2146 amdgpu_pt_info_read()
error: we previously assumed 'fpriv' could be null (see line 2146)

"if (!fpriv && !fpriv->vm.root.bo)", It has to be an OR condition
rather than an AND which makes an NULL dereference in case fpriv is NULL.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202507090525.9rDWGhz3-lkp@intel.com/
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Link: https://lore.kernel.org/r/20250709071618.591866-1-sunil.khatri@amd.com
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Christian König <christian.koenig@amd.com>
2025-07-09 10:15:30 +02:00
Sunil Khatri 8f9abaff41 drm/amdgpu: fix MQD debugfs undefined symbol when DEBUG_FS=n
Fix undefined reference to amdgpu_mqd_info_fops during
debugfs_create_file if DEBUG_FS=n

Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Link: https://lore.kernel.org/r/20250708101551.68033-1-sunil.khatri@amd.com
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Christian König <christian.koenig@amd.com>
2025-07-09 10:14:19 +02:00
Maarten Lankhorst e21354aea4 Merge remote-tracking branch 'drm/drm-next' into drm-misc-next
Pull in drm-intel-next for the updates to drm panic handling.

Signed-off-by: Maarten Lankhorst <dev@lankhorst.se>
2025-07-08 16:49:07 +02:00
Vitaly Prosyak a886d26f2c drm/amdgpu: fix use-after-free in amdgpu_userq_suspend+0x51a/0x5a0
[  +0.000020] BUG: KASAN: slab-use-after-free in amdgpu_userq_suspend+0x51a/0x5a0 [amdgpu]
[  +0.000817] Read of size 8 at addr ffff88812eec8c58 by task amd_pci_unplug/1733

[  +0.000027] CPU: 10 UID: 0 PID: 1733 Comm: amd_pci_unplug Tainted: G        W          6.14.0+ #2
[  +0.000009] Tainted: [W]=WARN
[  +0.000003] Hardware name: ASUS System Product Name/ROG STRIX B550-F GAMING (WI-FI), BIOS 1401 12/03/2020
[  +0.000004] Call Trace:
[  +0.000004]  <TASK>
[  +0.000003]  dump_stack_lvl+0x76/0xa0
[  +0.000011]  print_report+0xce/0x600
[  +0.000009]  ? srso_return_thunk+0x5/0x5f
[  +0.000006]  ? kasan_complete_mode_report_info+0x76/0x200
[  +0.000007]  ? kasan_addr_to_slab+0xd/0xb0
[  +0.000006]  ? amdgpu_userq_suspend+0x51a/0x5a0 [amdgpu]
[  +0.000707]  kasan_report+0xbe/0x110
[  +0.000006]  ? amdgpu_userq_suspend+0x51a/0x5a0 [amdgpu]
[  +0.000541]  __asan_report_load8_noabort+0x14/0x30
[  +0.000005]  amdgpu_userq_suspend+0x51a/0x5a0 [amdgpu]
[  +0.000535]  ? stop_cpsch+0x396/0x600 [amdgpu]
[  +0.000556]  ? stop_cpsch+0x429/0x600 [amdgpu]
[  +0.000536]  ? __pfx_amdgpu_userq_suspend+0x10/0x10 [amdgpu]
[  +0.000536]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? kgd2kfd_suspend+0x132/0x1d0 [amdgpu]
[  +0.000542]  amdgpu_device_fini_hw+0x581/0xe90 [amdgpu]
[  +0.000485]  ? down_write+0xbb/0x140
[  +0.000007]  ? __mutex_unlock_slowpath.constprop.0+0x317/0x360
[  +0.000005]  ? __pfx_amdgpu_device_fini_hw+0x10/0x10 [amdgpu]
[  +0.000482]  ? __kasan_check_write+0x14/0x30
[  +0.000004]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? up_write+0x55/0xb0
[  +0.000007]  ? srso_return_thunk+0x5/0x5f
[  +0.000005]  ? blocking_notifier_chain_unregister+0x6c/0xc0
[  +0.000008]  amdgpu_driver_unload_kms+0x69/0x90 [amdgpu]
[  +0.000484]  amdgpu_pci_remove+0x93/0x130 [amdgpu]
[  +0.000482]  pci_device_remove+0xae/0x1e0
[  +0.000008]  device_remove+0xc7/0x180
[  +0.000008]  device_release_driver_internal+0x3d4/0x5a0
[  +0.000007]  device_release_driver+0x12/0x20
[  +0.000004]  pci_stop_bus_device+0x104/0x150
[  +0.000006]  pci_stop_and_remove_bus_device_locked+0x1b/0x40
[  +0.000005]  remove_store+0xd7/0xf0
[  +0.000005]  ? __pfx_remove_store+0x10/0x10
[  +0.000006]  ? __pfx__copy_from_iter+0x10/0x10
[  +0.000006]  ? __pfx_dev_attr_store+0x10/0x10
[  +0.000006]  dev_attr_store+0x3f/0x80
[  +0.000006]  sysfs_kf_write+0x125/0x1d0
[  +0.000004]  ? srso_return_thunk+0x5/0x5f
[  +0.000005]  ? __kasan_check_write+0x14/0x30
[  +0.000005]  kernfs_fop_write_iter+0x2ea/0x490
[  +0.000005]  ? rw_verify_area+0x70/0x420
[  +0.000005]  ? __pfx_kernfs_fop_write_iter+0x10/0x10
[  +0.000006]  vfs_write+0x90d/0xe70
[  +0.000005]  ? srso_return_thunk+0x5/0x5f
[  +0.000005]  ? __pfx_vfs_write+0x10/0x10
[  +0.000004]  ? local_clock+0x15/0x30
[  +0.000008]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? __kasan_slab_free+0x5f/0x80
[  +0.000005]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? __kasan_check_read+0x11/0x20
[  +0.000004]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? fdget_pos+0x1d3/0x500
[  +0.000007]  ksys_write+0x119/0x220
[  +0.000005]  ? putname+0x1c/0x30
[  +0.000006]  ? __pfx_ksys_write+0x10/0x10
[  +0.000007]  __x64_sys_write+0x72/0xc0
[  +0.000006]  x64_sys_call+0x18ab/0x26f0
[  +0.000006]  do_syscall_64+0x7c/0x170
[  +0.000004]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? __pfx___x64_sys_openat+0x10/0x10
[  +0.000006]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? __kasan_check_read+0x11/0x20
[  +0.000003]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? fpregs_assert_state_consistent+0x21/0xb0
[  +0.000006]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? syscall_exit_to_user_mode+0x4e/0x240
[  +0.000005]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? do_syscall_64+0x88/0x170
[  +0.000003]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? irqentry_exit+0x43/0x50
[  +0.000004]  ? srso_return_thunk+0x5/0x5f
[  +0.000004]  ? exc_page_fault+0x7c/0x110
[  +0.000006]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  +0.000006] RIP: 0033:0x7480c0b14887
[  +0.000005] Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[  +0.000005] RSP: 002b:00007fff142b0058 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  +0.000006] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007480c0b14887
[  +0.000003] RDX: 0000000000000001 RSI: 00007480c0e7365a RDI: 0000000000000004
[  +0.000003] RBP: 00007fff142b0080 R08: 0000563b2e73c170 R09: 0000000000000000
[  +0.000003] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fff142b02f8
[  +0.000003] R13: 0000563b159a72a9 R14: 0000563b159a9d48 R15: 00007480c0f19040
[  +0.000008]  </TASK>

[  +0.000445] Allocated by task 427 on cpu 5 at 29.342331s:
[  +0.000011]  kasan_save_stack+0x28/0x60
[  +0.000006]  kasan_save_track+0x18/0x70
[  +0.000006]  kasan_save_alloc_info+0x38/0x60
[  +0.000005]  __kasan_kmalloc+0xc1/0xd0
[  +0.000006]  __kmalloc_cache_noprof+0x1bd/0x430
[  +0.000007]  amdgpu_driver_open_kms+0x172/0x760 [amdgpu]
[  +0.000493]  drm_file_alloc+0x569/0x9a0
[  +0.000007]  drm_client_init+0x1b7/0x410
[  +0.000007]  drm_fbdev_client_setup+0x174/0x470
[  +0.000006]  drm_client_setup+0x8a/0xf0
[  +0.000006]  amdgpu_pci_probe+0x510/0x10c0 [amdgpu]
[  +0.000483]  local_pci_probe+0xe7/0x1b0
[  +0.000006]  pci_device_probe+0x5bf/0x890
[  +0.000006]  really_probe+0x1fd/0x950
[  +0.000005]  __driver_probe_device+0x307/0x410
[  +0.000006]  driver_probe_device+0x4e/0x150
[  +0.000005]  __driver_attach+0x223/0x510
[  +0.000006]  bus_for_each_dev+0x102/0x1a0
[  +0.000005]  driver_attach+0x3d/0x60
[  +0.000006]  bus_add_driver+0x309/0x650
[  +0.000005]  driver_register+0x13d/0x490
[  +0.000006]  __pci_register_driver+0x1ee/0x2b0
[  +0.000006]  rfcomm_dlc_clear_state+0x69/0x220 [rfcomm]
[  +0.000011]  do_one_initcall+0x9c/0x3e0
[  +0.000007]  do_init_module+0x29e/0x7f0
[  +0.000006]  load_module+0x5c75/0x7c80
[  +0.000006]  init_module_from_file+0x106/0x180
[  +0.000006]  idempotent_init_module+0x377/0x740
[  +0.000006]  __x64_sys_finit_module+0xd7/0x180
[  +0.000006]  x64_sys_call+0x1f0b/0x26f0
[  +0.000006]  do_syscall_64+0x7c/0x170
[  +0.000005]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

[  +0.000013] Freed by task 1733 on cpu 5 at 59.907086s:
[  +0.000011]  kasan_save_stack+0x28/0x60
[  +0.000006]  kasan_save_track+0x18/0x70
[  +0.000005]  kasan_save_free_info+0x3b/0x60
[  +0.000005]  __kasan_slab_free+0x54/0x80
[  +0.000006]  kfree+0x127/0x470
[  +0.000006]  amdgpu_driver_postclose_kms+0x455/0x760 [amdgpu]
[  +0.000493]  drm_file_free.part.0+0x5b1/0xba0
[  +0.000006]  drm_file_free+0x13/0x30
[  +0.000006]  drm_client_release+0x1c4/0x2b0
[  +0.000006]  drm_fbdev_ttm_fb_destroy+0xd2/0x120 [drm_ttm_helper]
[  +0.000007]  put_fb_info+0x97/0xe0
[  +0.000007]  unregister_framebuffer+0x197/0x380
[  +0.000005]  drm_fb_helper_unregister_info+0x94/0x100
[  +0.000005]  drm_fbdev_client_unregister+0x3c/0x80
[  +0.000007]  drm_client_dev_unregister+0x144/0x330
[  +0.000006]  drm_dev_unregister+0x49/0x1b0
[  +0.000006]  drm_dev_unplug+0x4c/0xd0
[  +0.000006]  amdgpu_pci_remove+0x58/0x130 [amdgpu]
[  +0.000484]  pci_device_remove+0xae/0x1e0
[  +0.000008]  device_remove+0xc7/0x180
[  +0.000007]  device_release_driver_internal+0x3d4/0x5a0
[  +0.000006]  device_release_driver+0x12/0x20
[  +0.000007]  pci_stop_bus_device+0x104/0x150
[  +0.000006]  pci_stop_and_remove_bus_device_locked+0x1b/0x40
[  +0.000006]  remove_store+0xd7/0xf0
[  +0.000006]  dev_attr_store+0x3f/0x80
[  +0.000005]  sysfs_kf_write+0x125/0x1d0
[  +0.000006]  kernfs_fop_write_iter+0x2ea/0x490
[  +0.000006]  vfs_write+0x90d/0xe70
[  +0.000006]  ksys_write+0x119/0x220
[  +0.000006]  __x64_sys_write+0x72/0xc0
[  +0.000006]  x64_sys_call+0x18ab/0x26f0
[  +0.000005]  do_syscall_64+0x7c/0x170
[  +0.000006]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

[  +0.000012] The buggy address belongs to the object at ffff88812eec8000
               which belongs to the cache kmalloc-rnd-07-4k of size 4096
[  +0.000016] The buggy address is located 3160 bytes inside of
               freed 4096-byte region [ffff88812eec8000, ffff88812eec9000)

[  +0.000023] The buggy address belongs to the physical page:
[  +0.000009] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x12eec8
[  +0.000007] head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
[  +0.000005] flags: 0x17ffffc0000040(head|node=0|zone=2|lastcpupid=0x1fffff)
[  +0.000007] page_type: f5(slab)
[  +0.000008] raw: 0017ffffc0000040 ffff888100054500 dead000000000122 0000000000000000
[  +0.000005] raw: 0000000000000000 0000000080040004 00000000f5000000 0000000000000000
[  +0.000006] head: 0017ffffc0000040 ffff888100054500 dead000000000122 0000000000000000
[  +0.000005] head: 0000000000000000 0000000080040004 00000000f5000000 0000000000000000
[  +0.000006] head: 0017ffffc0000003 ffffea0004bbb201 ffffffffffffffff 0000000000000000
[  +0.000005] head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000
[  +0.000005] page dumped because: kasan: bad access detected

[  +0.000010] Memory state around the buggy address:
[  +0.000009]  ffff88812eec8b00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  +0.000012]  ffff88812eec8b80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  +0.000011] >ffff88812eec8c00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  +0.000011]                                                     ^
[  +0.000010]  ffff88812eec8c80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  +0.000011]  ffff88812eec8d00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  +0.000011] ==================================================================

The use-after-free occurs because a delayed work item (`suspend_work`) may still
be pending or running when resources it accesses are freed during device removal
or file close. The previous code used `flush_work(&fpriv->evf_mgr.suspend_work.work)`,
which does not wait for delayed work that has not yet started. As a result, the
delayed work could run after its memory was freed, causing a use-after-free.
By switching to `flush_delayed_work(&fpriv->evf_mgr.suspend_work)`, we ensure that
the kernel waits for both queued and delayed work to finish before
freeing memory, closing this race.

Fixes: adba092973 ("drm/amdgpu: Fix Illegal opcode in command stream Error")
Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-07 13:50:42 -04:00
Vitaly Prosyak a73345b866 Revert "drm/amdgpu: fix slab-use-after-free in amdgpu_userq_mgr_fini"
This reverts commit 5fb90421fa.

The original patch moved `amdgpu_userq_mgr_fini()` to the driver's
`postclose` callback, which is called after `drm_gem_release()` in
the DRM file cleanup sequence.If a user application crashes or aborts
without cleaning up its user queues, 'drm_gem_release()` may free
GEM objects that are still referenced by active user queues, leading
to use-after-free. By reverting, we ensure that user queues are
disabled and cleaned up before any GEM objects are released,
preventing this class of bug. However, this reintroduces a race
during PCI hot-unplug, where device removal can race with per-file
cleanup, leading to use-after-free in suspend/unplug paths.
This will be fixed in the next patch.

Fixes: 5fb90421fa ("drm/amdgpu: fix slab-use-after-free in amdgpu_userq_mgr_fini+0x70c")
Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-07 13:48:50 -04:00
Alex Deucher 0c3c2e334c drm/amdgpu/sdma: allow caller to handle kernel rings in engine reset
Add a parameter to amdgpu_sdma_reset_engine() to let the
caller handle the kernel rings.  This allows the kernel
rings to back up their unprocessed state if the reset comes in
via the drm scheduler rather than KFD.

Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-07 13:48:25 -04:00
Alex Deucher f8410a17d3 drm/amdgpu/sdma: consolidate engine reset handling
Move the force completion handling into the common
engine reset function.  No need to duplicate it for
every IP version.

Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-07 13:48:20 -04:00