mirror-linux

Commit Graph

Author	SHA1	Message	Date
Alex Deucher	94bd7bf2c9	drm/amdgpu: don't enable SMU on cyan skillfish Cyan skillfish uses different SMU firmware. Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-05 17:38:39 -04:00
Alex Deucher	fa819e3a7c	drm/amdgpu: add support for cyan skillfish gpu_info Some SOCs which are part of the cyan skillfish family rely on an explicit firmware for IP discovery. Add support for the gpu_info firmware. Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-05 17:38:39 -04:00
Alex Deucher	9e6a5cf1a2	drm/amdgpu: add support for cyan skillfish without IP discovery For platforms without an IP discovery table. Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-05 17:38:38 -04:00
Alex Deucher	e8529dbc75	drm/amdgpu: add ip offset support for cyan skillfish For chips that don't have IP discovery tables. Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-05 17:38:38 -04:00
Srinivasan Shanmugam	38ab33dbea	drm/amdgpu: Fix function header names in amdgpu_connectors.c Align the function headers for `amdgpu_max_hdmi_pixel_clock` and `amdgpu_connector_dvi_mode_valid` with the function implementations so they match the expected kdoc style. Fixes the below: drivers/gpu/drm/amd/amdgpu/amdgpu_connectors.c:1199: warning: This comment starts with '/*', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst Returns the maximum supported HDMI (TMDS) pixel clock in KHz. drivers/gpu/drm/amd/amdgpu/amdgpu_connectors.c:1212: warning: This comment starts with '/*', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst Validates the given display mode on DVI and HDMI connectors. Fixes: `585b2f685c` ("drm/amdgpu: Respect max pixel clock for HDMI and DVI-D (v2)") Cc: Christian König <christian.koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-05 17:38:38 -04:00
Yifan Zhang	fa7c99f04f	amd/amdkfd: correct mem limit calculation for small APUs Current mem limit check leaks some GTT memory (reserved_for_pt reserved_for_ras + adev->vram_pin_size) for small APUs. Since carveout VRAM is tunable on APUs, there are three case regarding the carveout VRAM size relative to GTT: 1. 0 < carveout < gtt apu_prefer_gtt = true, is_app_apu = false 2. carveout > gtt / 2 apu_prefer_gtt = false, is_app_apu = false 3. 0 = carveout apu_prefer_gtt = true, is_app_apu = true It doesn't make sense to check below limitation in case 1 (default case, small carveout) because the values in the below expression are mixed with carveout and gtt. adev->kfd.vram_used[xcp_id] + vram_needed > vram_size - reserved_for_pt - reserved_for_ras - atomic64_read(&adev->vram_pin_size) gtt: kfd.vram_used, vram_needed, vram_size carveout: reserved_for_pt, reserved_for_ras, adev->vram_pin_size In case 1, vram allocation will go to gtt domain, skip vram check since ttm_mem_limit check already cover this allocation. Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com> Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-05 17:38:38 -04:00
Alex Deucher	276e8beb2a	drm/amdgpu/userq: add force completion helpers Add support for forcing completion of userq fences. This is needed for userq resets and asic resets so that we can set the error on the fence and force completion. Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-05 17:38:38 -04:00
Alex Deucher	c5da9e9c02	drm/amdgpu: add user queue reset source Track resets from user queues. Reviewed-by: Christian König <christian.koenig@amd.com> Reviewed-by: Sunil Khatri <sunil.khatri@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-05 17:38:38 -04:00
Jesse.Zhang	724471254e	drm/amdgpu/mes12: implement detect and reset callback Implement support for the hung queue detect and reset functionality. v2: Always use AMDGPU_MES_SCHED_PIPE Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-05 17:38:38 -04:00
Jesse.Zhang	b28cfc8305	drm/amdgpu/mes11: implement detect and reset callback Implement support for the hung queue detect and reset functionality. Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-05 17:38:38 -04:00
Jesse.Zhang	78e1222fbf	drm/amdgpu/mes: add front end for detect and reset hung queue Helper function to detect and reset hung queues. MES will return an array of doorbell indices of which queues are hung and were optionally reset. v2: Clear the doorbell array before detection Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-05 17:38:31 -04:00
Jesse.Zhang	6abd725fdf	drm/amd/amdgpu: Implement MES suspend/resume gang functionality for v12 This commit implements the actual MES (Micro Engine Scheduler) suspend and resume gang operations for version 12 hardware. Previously these functions were just stubs returning success. v2: Always use AMDGPU_MES_SCHED_PIPE Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>	2025-09-05 17:38:07 -04:00
Jesse.Zhang	2318336573	drm/amdgpu: Add preempt and restore callbacks to userq funcs Add two new function pointers to struct amdgpu_userq_funcs: - preempt: To handle preemption of user mode queues - restore: To restore preempted user mode queues These callbacks will allow the driver to properly manage queue preemption and restoration when needed, such as during context switching or priority changes. Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-05 17:37:50 -04:00
Sunil Khatri	878f33f390	drm/amdgpu: fix the formating for debugfs print Fix the format of debugfs print in the mqd. Need to add a colon so parser can parse it properly. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-05 16:07:02 -04:00
Alex Deucher	1e18746381	drm/amd: add more cyan skillfish PCI ids Add additional PCI IDs to the cyan skillfish family. Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-05 16:06:58 -04:00
Sunil Khatri	7e0fc7b2b7	drm/amdgpu: add more information in debugfs to pagetable dump Add more information in the debugfs which is needed to dump a pagetable correctly for userqueues where vmid is not known in the kernel. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-05 16:06:52 -04:00
Xiang Liu	f320ed01cf	drm/amdgpu: Correct info field of bad page threshold exceed CPER Correct valid_bits and ms_chk_bits of section info field for bad page threshold exceed CPER to match OOB's behavior. Signed-off-by: Xiang Liu <xiang.liu@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-05 16:06:49 -04:00
Liao Yuanhong	086f66edf9	drm/amdgpu/vcn: Remove redundant ternary operators For ternary operators in the form of "a ? true : false", if 'a' itself returns a boolean result, the ternary operator can be omitted. Remove redundant ternary operators to clean up the code. Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Liao Yuanhong <liaoyuanhong@vivo.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-05 16:06:34 -04:00
Liao Yuanhong	9502b09933	drm/amdgpu/jpeg: Remove redundant ternary operators For ternary operators in the form of "a ? true : false", if 'a' itself returns a boolean result, the ternary operator can be omitted. Remove redundant ternary operators to clean up the code. Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Liao Yuanhong <liaoyuanhong@vivo.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-05 16:06:29 -04:00
Liao Yuanhong	8a4bc4508c	drm/amdgpu/ih: Remove redundant ternary operators For ternary operators in the form of "a ? false : true", if 'a' itself returns a boolean result, the ternary operator can be omitted. Remove redundant ternary operators to clean up the code. Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Liao Yuanhong <liaoyuanhong@vivo.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-05 16:00:15 -04:00
Liao Yuanhong	60df3e81d7	drm/amdgpu/gmc: Remove redundant ternary operators For ternary operators in the form of "a ? false : true", if 'a' itself returns a boolean result, the ternary operator can be omitted. Remove redundant ternary operators to clean up the code. Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Liao Yuanhong <liaoyuanhong@vivo.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-05 16:00:13 -04:00
Liao Yuanhong	d261e744af	drm/amdgpu/gfx: Remove redundant ternary operators For ternary operators in the form of "a ? false : true", if 'a' itself returns a boolean result, the ternary operator can be omitted. Remove redundant ternary operators to clean up the code. Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Liao Yuanhong <liaoyuanhong@vivo.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-05 16:00:09 -04:00
Liao Yuanhong	7670daf65a	drm/amdgpu/amdgpu_cper: Remove redundant ternary operators For ternary operators in the form of "a ? false : true", if 'a' itself returns a boolean result, the ternary operator can be omitted. Remove redundant ternary operators to clean up the code. Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Liao Yuanhong <liaoyuanhong@vivo.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-05 16:00:04 -04:00
Colin Ian King	4320fd9e0d	drm/amd/amdgpu: Fix a less than zero check on a uint32_t struct field Currently the error check from the call to mes_v12_inv_tlb_convert_hub_id is always false because a uint32_t struct field hub_id is being used to to perform the less than zero error check. Fix this by using the int variable ret to perform the check. Fixes: `87e6505261` ("drm/amd/amdgpu : Use the MES INV_TLBS API for tlb invalidation on gfx12") Reviewed-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-05 15:59:11 -04:00
Dave Airlie	6dc1d3c191	drm-misc-next for v6.18: Cross-subsystem Changes: - Update a number of DT bindings for STM32MP25 Arm SoC Core Changes: gem: - Simplify locking for GPUVM panel-backlight-quirks: - Add additional quirks for EDID, DMI, brightness sched: - Fix race condition in trace code - Clean up sysfb: - Clean up Driver Changes: amdgpu: - Give kernel jobs a unique id for better tracing amdxdna: - Improve error reporting bridge: - Improve ref counting on bridge management - adv7511: Provide SPD and HDMI infoframes - it6505: Replace crypto_shash with sha() - synopsys: Add support for DW DPTX Controller plus DT bindings gud: - Replace simple-KMS pipe with regular atomic helpers imagination: - Improve power management - Add support for TH1520 GPU - Support Risc-V architectures ivpu: - Clean up nouveau: - Improve error reporting panthor: - Fail VM bind if BO has offset - Clean up rcar-du: - Make number of lanes configurable rockchip: - Add support for RK3588 DPTX output rocket: - Use kfree() and sizeof() correctly - Test DMA status - Clean up sitronix: - st7571-i2c: Add support for inverted displays and 2-bit grayscale - Clean up stm: - ltdc: Add support support for STM32MP257F-EV1 plus DT bindings tidss: - Convert to kernel's FIELD_ macros v3d: - Improve job management and locking -----BEGIN PGP SIGNATURE----- iQFPBAABCgA5FiEEchf7rIzpz2NEoWjlaA3BHVMLeiMFAmi5Vt0bFIAAAAAABAAO bWFudTIsMi41KzEuMTEsMiwyAAoJEGgNwR1TC3ojdpgIAJE+8M0PiEcOX6b5gRqW kK8ymkVHXXyNbMYst/KsAHLStgyXnI7l2Rhd1ZY3lPO5pm+a7ehwKe+GTpdWoPTF Eq7uMIgdjNX2Zqwr9QfsRAYAwMvcn6tNVvCbAeUzo7PNHaZXj1Q8K1VHpJV3BbWd 1w+OR2b12IpDx7KA1dFYVkL2ZrAE9B7FOLdrHFAOW75N/bcLwrQ4WREfirOsXGN9 TbqCknB3PQKhjlFC86UU7DGImKdPTYvt72K3pPF0Y4eCYmQoKFYqQsFmsdqUEQsa Z2jSbXAFKV3DcJwMzEmNFC//6QeqEJFJMvYYqaqzTW9kMnA4anO/8JML0uTWLxRd 0rw= =NzwW -----END PGP SIGNATURE----- Merge tag 'drm-misc-next-2025-09-04' of https://gitlab.freedesktop.org/drm/misc/kernel into drm-next drm-misc-next for v6.18: Cross-subsystem Changes: - Update a number of DT bindings for STM32MP25 Arm SoC Core Changes: gem: - Simplify locking for GPUVM panel-backlight-quirks: - Add additional quirks for EDID, DMI, brightness sched: - Fix race condition in trace code - Clean up sysfb: - Clean up Driver Changes: amdgpu: - Give kernel jobs a unique id for better tracing amdxdna: - Improve error reporting bridge: - Improve ref counting on bridge management - adv7511: Provide SPD and HDMI infoframes - it6505: Replace crypto_shash with sha() - synopsys: Add support for DW DPTX Controller plus DT bindings gud: - Replace simple-KMS pipe with regular atomic helpers imagination: - Improve power management - Add support for TH1520 GPU - Support Risc-V architectures ivpu: - Clean up nouveau: - Improve error reporting panthor: - Fail VM bind if BO has offset - Clean up rcar-du: - Make number of lanes configurable rockchip: - Add support for RK3588 DPTX output rocket: - Use kfree() and sizeof() correctly - Test DMA status - Clean up sitronix: - st7571-i2c: Add support for inverted displays and 2-bit grayscale - Clean up stm: - ltdc: Add support support for STM32MP257F-EV1 plus DT bindings tidss: - Convert to kernel's FIELD_ macros v3d: - Improve job management and locking Signed-off-by: Dave Airlie <airlied@redhat.com> From: Thomas Zimmermann <tzimmermann@suse.de> Link: https://lore.kernel.org/r/20250904090932.GA193997@linux.fritz.box	2025-09-05 11:49:01 +10:00
Colin Ian King	467e00b30d	drm/amd/amdgpu: Fix missing error return on kzalloc failure Currently the kzalloc failure check just sets reports the failure and sets the variable ret to -ENOMEM, which is not checked later for this specific error. Fix this by just returning -ENOMEM rather than setting ret. Fixes: `4fb9307154` ("drm/amd/amdgpu: remove redundant host to psp cmd buf allocations") Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit `1ee9d1a096`)	2025-09-03 16:27:56 -04:00
Gustavo A. R. Silva	1e6d36e15b	drm/amdgpu/amdkfd: Avoid a couple hundred -Wflex-array-member-not-at-end warnings -Wflex-array-member-not-at-end was introduced in GCC-14, and we are getting ready to enable it, globally. Move the conflicting declarations to the end of the corresponding structures. Notice that `struct dev_pagemap` is a flexible structure, this is a structure that contains a flexible-array member. struct dev_pagemap always has room for at least one range. amdgpu only uses a single range. Therefore no change are needed to the allocation of struct amdgpu_device. Fix 283 of the following type of warnings: 283 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h:111:28: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end] Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Felix Kuehling <felix.kuehling@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-02 17:36:20 -04:00
Colin Ian King	1ee9d1a096	drm/amd/amdgpu: Fix missing error return on kzalloc failure Currently the kzalloc failure check just sets reports the failure and sets the variable ret to -ENOMEM, which is not checked later for this specific error. Fix this by just returning -ENOMEM rather than setting ret. Fixes: `4fb9307154` ("drm/amd/amdgpu: remove redundant host to psp cmd buf allocations") Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-02 15:57:10 -04:00
Timur Kristóf	6df0768c0d	drm/amd/pm: Remove wm_low and wm_high fields from amdgpu_crtc (v2) These fields were only used by si_dpm and are not necessary anymore. They also may have been incorrect because: - wm_high was set to the LOW_WATERMARK field of watermark A. - wm_low was not set on DCE 6 and was always zero. Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Timur Kristóf <timur.kristof@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-02 15:57:01 -04:00
Timur Kristóf	c661219cd7	drm/amdgpu: Power up UVD 3 for FW validation (v2) Unlike later versions, UVD 3 has firmware validation. For this to work, the UVD should be powered up correctly. When DPM is enabled and the display clock is off, the SMU may choose a power state which doesn't power the UVD, which can result in failure to initialize UVD. v2: Add code comments to explain about the UVD power state and how UVD clock is turned on/off. Fixes: `b38f3e80ec` ("drm amdgpu: SI UVD v3_1 (v2)") Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Timur Kristóf <timur.kristof@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-02 15:54:03 -04:00
David Francis	4d82724f7f	drm/amdgpu: Add mapping info option for GEM_OP ioctl Add new GEM_OP_IOCTL option GET_MAPPING_INFO, which returns a list of mappings associated with a given bo, along with their positions and offsets. Userspace for this and the previous change can be found at: https://github.com/checkpoint-restore/criu/pull/2613 Signed-off-by: David Francis <David.Francis@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-02 15:53:33 -04:00
David Francis	f9db1fc52c	drm/amdgpu: Add ioctl to get all gem handles for a process Add new ioctl DRM_IOCTL_AMDGPU_GEM_LIST_HANDLES. This ioctl returns a list of bos with their handles, sizes, and flags and domains. This ioctl is meant to be used during CRIU checkpoint and provide information needed to reconstruct the bos in CRIU restore. Userspace for this and the next change can be found at https://github.com/checkpoint-restore/criu/pull/2613 Signed-off-by: David Francis <David.Francis@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-02 15:34:00 -04:00
David Francis	0317e0e224	drm/amdgpu: Allow more flags to be set on gem create. The GEM create flag AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE specifies that gem memory contains sensitive information and should be cleared to prevent snooping. The COHERENT and UNCACHED gem create flags enable memory features related to sharing memory across devices. For CRIU we need to re-create KFD BOs through the GEM_CREATE IOCTL, so allow those KFD specific flags here as well. This will also aid us in the future and allows to move the KFD components over using the render node for allocations. Signed-off-by: David Francis <David.Francis@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-09-02 15:34:00 -04:00
Dave Airlie	14579a6f18	amd-drm-next-6.18-2025-08-29: amdgpu: - Replay fixes - RAS updates - VCN SRAM load fixes - EDID read fixes - eDP ALPM support - AUX fixes - Documenation updates - Rework how PTE flags are generated - DCE6 fixes - VCN devcoredump cleanup - MMHUB client id fixes - SR-IOV fixes - VRR fixes - VCN 5.0.1 RAS support - Backlight fixes - UserQ fixes - Misc code cleanups - SMU 13.0.12 updates - Expanded PCIe DPC support - Expanded VCN reset support - SMU 13.0.x Updates - VPE per queue reset support - Cusor rotation fix - DSC fixes - GC 12 MES TLB invalidation update - Cursor fixes - Non-DC TMDS clock validation fix amdkfd: - debugfs fixes - Misc code cleanups - Page migration fixes - Partition fixes - SVM fixes radeon: - Misc code cleanups -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQgO5Idg2tXNTSZAr293/aFa7yZ2AUCaLH5kwAKCRC93/aFa7yZ 2KiRAQDGoc3l+19yNjONLmFDlsX5McDw518ivMXMU7JAe25nwgEAk22/agrISdF9 1ADoZJv6s3iakMC+PieRNvqIZS/lGAs= =2Iba -----END PGP SIGNATURE----- Merge tag 'amd-drm-next-6.18-2025-08-29' of https://gitlab.freedesktop.org/agd5f/linux into drm-next amd-drm-next-6.18-2025-08-29: amdgpu: - Replay fixes - RAS updates - VCN SRAM load fixes - EDID read fixes - eDP ALPM support - AUX fixes - Documenation updates - Rework how PTE flags are generated - DCE6 fixes - VCN devcoredump cleanup - MMHUB client id fixes - SR-IOV fixes - VRR fixes - VCN 5.0.1 RAS support - Backlight fixes - UserQ fixes - Misc code cleanups - SMU 13.0.12 updates - Expanded PCIe DPC support - Expanded VCN reset support - SMU 13.0.x Updates - VPE per queue reset support - Cusor rotation fix - DSC fixes - GC 12 MES TLB invalidation update - Cursor fixes - Non-DC TMDS clock validation fix amdkfd: - debugfs fixes - Misc code cleanups - Page migration fixes - Partition fixes - SVM fixes radeon: - Misc code cleanups Signed-off-by: Dave Airlie <airlied@redhat.com> From: Alex Deucher <alexander.deucher@amd.com> Link: https://lore.kernel.org/r/20250829190848.1921648-1-alexander.deucher@amd.com	2025-09-02 09:35:54 +10:00
Pierre-Eric Pelloux-Prayer	256576ed68	drm/amdgpu: give each kernel job a unique id Userspace jobs have drm_file.client_id as a unique identifier as job's owners. For kernel jobs, we can allocate arbitrary values - the risk of overlap with userspace ids is small (given that it's a u64 value). In the unlikely case the overlap happens, it'll only impact trace events. Since this ID is traced in the gpu_scheduler trace events, this allows to determine the source of each job sent to the hardware. To make grepping easier, the IDs are defined as they will appear in the trace output. Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com> Link: https://lore.kernel.org/r/20250604122827.2191-1-pierre-eric.pelloux-prayer@amd.com	2025-09-01 10:49:31 +05:30
Alex Deucher	71403f58b4	drm/amdgpu: drop hw access in non-DC audio fini We already disable the audio pins in hw_fini so there is no need to do it again in sw_fini. Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4481 Cc: oushixiong <oushixiong1025@163.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit `5eeb16ca72`) Cc: stable@vger.kernel.org	2025-08-29 11:13:38 -04:00
Alex Deucher	5171848bdf	drm/amdgpu/mes11: make MES_MISC_OP_CHANGE_CONFIG failure non-fatal If the firmware is too old, just warn and return success. Fixes: `27b7915147` ("drm/amdgpu/mes: keep enforce isolation up to date") Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4414 Cc: shaoyun.Liu@amd.com Reviewed-by: Shaoyun.liu <Shaoyun.liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit `9f28af76fa`) Cc: stable@vger.kernel.org	2025-08-29 11:12:44 -04:00
Jesse.Zhang	2d41a4bfee	drm/amdgpu/sdma: bump firmware version checks for user queue support Using the previous firmware could lead to problems with PROTECTED_FENCE_SIGNAL commands, specifically causing register conflicts between MCU_DBG0 and MCU_DBG1. The updated firmware versions ensure proper alignment and unification of the SDMA_SUBOP_PROTECTED_FENCE_SIGNAL value with SDMA 7.x, resolving these hardware coordination issues Fixes: `e8cca30d8b` ("drm/amdgpu/sdma6: add ucode version checks for userq support") Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit `aab8b689ad`) Cc: stable@vger.kernel.org	2025-08-29 11:11:45 -04:00
Timur Kristóf	585b2f685c	drm/amdgpu: Respect max pixel clock for HDMI and DVI-D (v2) Update the legacy (non-DC) display code to respect the maximum pixel clock for HDMI and DVI-D. Reject modes that would require a higher pixel clock than can be supported. Also update the maximum supported HDMI clock value depending on the ASIC type. For reference, see the DC code: check max_hdmi_pixel_clock in dce*_resource.c v2: Fix maximum clocks for DVI-D and DVI/HDMI adapters. Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Timur Kristóf <timur.kristof@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-29 10:23:09 -04:00
Alex Deucher	5eeb16ca72	drm/amdgpu: drop hw access in non-DC audio fini We already disable the audio pins in hw_fini so there is no need to do it again in sw_fini. Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4481 Cc: oushixiong <oushixiong1025@163.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-29 10:13:42 -04:00
Alex Deucher	9f28af76fa	drm/amdgpu/mes11: make MES_MISC_OP_CHANGE_CONFIG failure non-fatal If the firmware is too old, just warn and return success. Fixes: `27b7915147` ("drm/amdgpu/mes: keep enforce isolation up to date") Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4414 Cc: shaoyun.Liu@amd.com Reviewed-by: Shaoyun.liu <Shaoyun.liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-29 10:12:00 -04:00
Lijo Lazar	9c0442286f	drm/amdgpu: Check vcn state before profile switch The patch uses power state of VCN instances for requesting video profile. In idle worker of a vcn instance, when there is no outstanding submisssion or fence, the instance is put to power gated state. When all instances are powered off that means video profile is no longer required. A request is made to turn off video profile. A job submission starts with begin_use of ring, and at that time vcn instance state is changed to power on. Subsequently a check is made for active video profile, and if not active, a request is made. Fixes: `3b669df92c` ("drm/amdgpu/vcn: adjust workload profile handling") Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-29 10:09:33 -04:00
Mangesh Gadre	37551277df	drm/amdgpu: Avoid vcn v5.0.1 poison irq call trace on sriov guest Sriov guest side doesn't init ras feature hence the poison irq shouldn't be put during hw fini Signed-off-by: Mangesh Gadre <Mangesh.Gadre@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-29 10:09:29 -04:00
Mangesh Gadre	01152c30ee	drm/amdgpu: Avoid jpeg v5.0.1 poison irq call trace on sriov guest Sriov guest side doesn't init ras feature hence the poison irq shouldn't be put during hw fini Signed-off-by: Mangesh Gadre <Mangesh.Gadre@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-29 10:09:23 -04:00
Yang Wang	89c3503bc6	drm/amd/amdgpu: unified amdgpu ip block name v1: 1. Unified amdgpu ip block name print with format "{ip_type}_v{major}_{minor}_{rev}" 2. Avoid IP block name conflicts for SMU/PSP ip block v2: Update IP block print format to keep legacy IP block name (Alex) "{ip_type}_v{major}_{minor}_{rev} ({funcs->name})" Signed-off-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Asad Kamal <asad.kamal@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-29 10:09:16 -04:00
Jesse.Zhang	aab8b689ad	drm/amdgpu/sdma: bump firmware version checks for user queue support Using the previous firmware could lead to problems with PROTECTED_FENCE_SIGNAL commands, specifically causing register conflicts between MCU_DBG0 and MCU_DBG1. The updated firmware versions ensure proper alignment and unification of the SDMA_SUBOP_PROTECTED_FENCE_SIGNAL value with SDMA 7.x, resolving these hardware coordination issues Fixes: `e8cca30d8b` ("drm/amdgpu/sdma6: add ucode version checks for userq support") Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-29 10:07:24 -04:00
Xiang Liu	c8d6e90abe	drm/amdgpu: Notify pmfw bad page threshold exceeded Notify pmfw when bad page threshold is exceeded, no matter the module parameter 'bad_page_threshold' is set or not. Signed-off-by: Xiang Liu <xiang.liu@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-29 10:07:19 -04:00
David (Ming Qiang) Wu	b1d83546cf	drm/amdgpu/vcn: add instance number to VCN version message For multiple VCN instances case we get multiple lines of the same message like below: amdgpu 0000:43:00.0: amdgpu: Found VCN firmware Version ENC: 1.24 DEC: 9 VEP: 0 Revision: 11 amdgpu 0000:43:00.0: amdgpu: Found VCN firmware Version ENC: 1.24 DEC: 9 VEP: 0 Revision: 11 By adding instance number to the log message for multiple VCN instances, each line will clearly indicate which VCN instance it refers to. Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: David (Ming Qiang) Wu <David.Wu3@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-29 10:07:10 -04:00
David (Ming Qiang) Wu	010219ccec	drm/amdgpu/vcn: remove unused code in vcn_v4_0.c Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: David (Ming Qiang) Wu <David.Wu3@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-29 10:07:05 -04:00
Dave Airlie	4b1c24ef50	amd-drm-fixes-6.17-2025-08-28: amdgpu: - UserQ fixes - Revert CSA fix - SR-IOV fix -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQgO5Idg2tXNTSZAr293/aFa7yZ2AUCaLCRbwAKCRC93/aFa7yZ 2F31APwNWU1LwSwWl/bBF2Up8kbmwK3IIDJ/C4mBUJDtlUzsBAEAnb/AB2Aj0ppU Msod+68vx5TUituJJUSv+XjeNa/brQg= =Qqsb -----END PGP SIGNATURE----- Merge tag 'amd-drm-fixes-6.17-2025-08-28' of https://gitlab.freedesktop.org/agd5f/linux into drm-fixes amd-drm-fixes-6.17-2025-08-28: amdgpu: - UserQ fixes - Revert CSA fix - SR-IOV fix Signed-off-by: Dave Airlie <airlied@redhat.com> From: Alex Deucher <alexander.deucher@amd.com> Link: https://lore.kernel.org/r/20250828173904.75850-1-alexander.deucher@amd.com	2025-08-29 08:50:44 +10:00
Alex Deucher	c767d74a9c	drm/amdgpu/userq: fix error handling of invalid doorbell If the doorbell is invalid, be sure to set the r to an error state so the function returns an error. Reviewed-by: David (Ming Qiang) Wu <David.Wu3@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit `7e2a5b0a9a`) Cc: stable@vger.kernel.org	2025-08-27 14:01:52 -04:00
Jesse.Zhang	ee38ea0ae4	drm/amdgpu: update firmware version checks for user queue support The minimum firmware versions required for user queue functionality have been increased to address an issue where the queue privilege state was lost during queue connect operations. The problem occurred because the privilege state was being restored to its initial value at the beginning of the function, overwriting the state that was properly set during the queue connect case. This commit updates the minimum version requirements: - ME firmware from 2390 to 2420 - PFP firmware from 2530 to 2580 - MEC firmware from 2600 to 2650 - MES firmware remains at 120 These updated firmware versions contain the necessary fixes to properly maintain queue privilege state throughout connect operations. Fixes: `61ca97e959` ("drm/amdgpu: Add fw minimum version check for usermode queue") Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit `5f976c9939`) Cc: stable@vger.kernel.org	2025-08-27 14:01:32 -04:00
Alex Deucher	ac4ed2da4c	Revert "drm/amdgpu: fix incorrect vm flags to map bo" This reverts commit `b08425fa77`. Revert this to align with 6.17 because the fixes tag was wrong on this commit. Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit `be33e8a239`)	2025-08-27 14:00:38 -04:00
Alex Deucher	29f155c5e8	drm/amdgpu/gfx12: set MQD as appriopriate for queue types Set the MQD as appropriate for the kernel vs user queues. Acked-by: Christian König <christian.koenig@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit `7b9110f289`) Cc: stable@vger.kernel.org	2025-08-27 13:59:53 -04:00
Alex Deucher	27f5e0c132	drm/amdgpu/gfx11: set MQD as appriopriate for queue types Set the MQD as appropriate for the kernel vs user queues. Acked-by: Christian König <christian.koenig@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit `063d668320`) Cc: stable@vger.kernel.org	2025-08-27 13:59:25 -04:00
Alex Deucher	7e2a5b0a9a	drm/amdgpu/userq: fix error handling of invalid doorbell If the doorbell is invalid, be sure to set the r to an error state so the function returns an error. Reviewed-by: David (Ming Qiang) Wu <David.Wu3@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-27 13:57:51 -04:00
Jesse.Zhang	5f976c9939	drm/amdgpu: update firmware version checks for user queue support The minimum firmware versions required for user queue functionality have been increased to address an issue where the queue privilege state was lost during queue connect operations. The problem occurred because the privilege state was being restored to its initial value at the beginning of the function, overwriting the state that was properly set during the queue connect case. This commit updates the minimum version requirements: - ME firmware from 2390 to 2420 - PFP firmware from 2530 to 2580 - MEC firmware from 2600 to 2650 - MES firmware remains at 120 These updated firmware versions contain the necessary fixes to properly maintain queue privilege state throughout connect operations. Fixes: `61ca97e959` ("drm/amdgpu: Add fw minimum version check for usermode queue") Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-27 13:57:51 -04:00
Alex Deucher	ec813f384b	drm/amdgpu/vpe: cancel delayed work in hw_fini We need to cancel any outstanding work at both suspend and driver teardown. Move the cancel to hw_fini which gets called in both cases. Reviewed-by: David (Ming Qiang) Wu <David.Wu3@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-27 13:57:51 -04:00
David (Ming Qiang) Wu	061a09b4dc	drm/amdgpu/vcn: remove unused code in vcn_v1_0.c Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: David (Ming Qiang) Wu <David.Wu3@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-27 13:57:51 -04:00
Shaoyun Liu	87e6505261	drm/amd/amdgpu : Use the MES INV_TLBS API for tlb invalidation on gfx12 From MES version 0x81, it provide the new API INV_TLBS that support invalidate tlbs with PASID. Signed-off-by: Shaoyun Liu <shaoyun.liu@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-27 13:57:51 -04:00
Jesse.Zhang	a7a411e246	drm/amdgpu: fix shift-out-of-bounds in amdgpu_debugfs_jpeg_sched_mask_set Fix a UBSAN shift-out-of-bounds warning in amdgpu_debugfs_jpeg_sched_mask_set when the shift exponent reaches or exceeds 32 bits. The issue occurred because a 32-bit integer '1' was being shifted by up to 32 bits, which is undefined behavior. Replace '1' with '1ULL' to ensure 64-bit arithmetic, matching the u64 type of 'val' and preventing the shift overflow. This is consistent with the existing mask calculation that already uses 1ULL. The error manifested as: UBSAN: shift-out-of-bounds in drivers/gpu/drm/amd/amdgpu/amdgpu_jpeg.c:373:17 shift exponent 32 is too large for 32-bit type 'int' v2: remove debug log Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Suggested-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-27 13:57:51 -04:00
Jack Xiao	c350d9e268	Reapply "drm/amdgpu: fix incorrect vm flags to map bo" It should use vm flags instead of pte flags to specify bo vm attributes. This reverts commit 1263ceea2a1327014d9de2858a122f3c27dfa4dd. Reapply this patch with the proper fixes tag. Fixes: `6716a823d1` ("drm/amdgpu: rework how PTE flags are generated v3") Signed-off-by: Jack Xiao <Jack.Xiao@amd.com> Reviewed-by: Likun Gao <Likun.Gao@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-27 13:57:50 -04:00
Alex Deucher	be33e8a239	Revert "drm/amdgpu: fix incorrect vm flags to map bo" This reverts commit `b08425fa77`. Revert this to align with 6.17 because the fixes tag was wrong on this commit. Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-27 13:57:50 -04:00
Alex Deucher	1c65502f81	drm/amdgpu/vpe: add ring reset support Implement ring reset for VPE. Similar to VCN and JPEG, just powergate the the IP to reset it. v2: Properly set per queue reset flag Reviewed-by: Jesse Zhang <jesse.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-27 13:57:49 -04:00
Alex Deucher	fa064d50b7	drm/amdgpu/vcn: drop extra cancel_delayed_work_sync() We already call this in the hw_fini() methods for all VCN instances, so no need to call it again in amdgpu_vcn_suspend(). Tested-by: David (Ming Qiang) Wu <David.Wu3@amd.com> Reviewed-by: Leo Liu <leo.liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-27 13:57:49 -04:00
Yifan Zhang	e68197aa2b	drm/amdgpu: remove redundant AMDGPU_HAS_VRAM AMDGPU_HAS_VRAM is redundant with is_app_apu, as both refer to APUs with no carve-out. Since AMDGPU_HAS_VRAM only occurs once, remove AMDGPU_HAS_VRAM definition. The tmr allocation can be covered with AMDGPU_GEM_DOMAIN_GTT \| AMDGPU_GEM_DOMAIN_VRAM in both vram and non vram ASICs. Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-27 13:57:48 -04:00
Ce Sun	d8442bcad0	drm/amdgpu: Correct the loss of aca bank reg info By polling, poll ACA bank count to ensure that valid ACA bank reg info can be obtained v2: add corresponding delay before send msg to SMU to query mca bank info (Stanley) v3: the loop cannot exit. (Thomas) v4: remove amdgpu_aca_clear_bank_count. (Kevin) v5: continuously inject ce. If a creation interruption occurs at this time, bank reg info will be lost. (Thomas) v5: each cycle is delayed by 100ms. (Tao) Signed-off-by: Ce Sun <cesun102@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-27 13:57:47 -04:00
Ce Sun	0989b764f4	drm/amdgpu: Add a mutex lock to protect poison injection When poison is triggered multiple times, competition will occur. Add a mutex lock to protect poison injection Signed-off-by: Ce Sun <cesun102@amd.com> Reviewed-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-27 13:57:47 -04:00
Ce Sun	907813e5d7	drm/amdgpu: Correct the counts of nr_banks and nr_errors Correct the counts of nr_banks and nr_errors Signed-off-by: Ce Sun <cesun102@amd.com> Reviewed-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-27 13:57:47 -04:00
Liao Yuanhong	ee6ba1e69d	drm/amdgpu/fence: Remove redundant 0 value initialization The amdgpu_fence struct is already zeroed by kzalloc(). It's redundant to initialize am_fence->context to 0. Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Liao Yuanhong <liaoyuanhong@vivo.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-27 13:57:47 -04:00
Hawking Zhang	22dcb283d6	drm/amdgpu: Allocate psp fw private buffer in vram It's not necessarily to allocate psp firmware private buffer in different memory domain in sriov and bare metal environment Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Le Ma <le.ma@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-27 13:57:47 -04:00
Alex Deucher	7b9110f289	drm/amdgpu/gfx12: set MQD as appriopriate for queue types Set the MQD as appropriate for the kernel vs user queues. Acked-by: Christian König <christian.koenig@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-27 13:57:42 -04:00
Alex Deucher	063d668320	drm/amdgpu/gfx11: set MQD as appriopriate for queue types Set the MQD as appropriate for the kernel vs user queues. Acked-by: Christian König <christian.koenig@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-27 13:57:37 -04:00
Thomas Zimmermann	08fb45446e	drm/amdgpu: Pin buffers while vmap'ing exported dma-buf objects Current dma-buf vmap semantics require that the mapped buffer remains in place until the corresponding vunmap has completed. For GEM-SHMEM, this used to be guaranteed by a pin operation while creating an S/G table in import. GEM-SHMEN can now import dma-buf objects without creating the S/G table, so the pin is missing. Leads to page-fault errors, such as the one shown below. [ 102.101726] BUG: unable to handle page fault for address: ffffc90127000000 [...] [ 102.157102] RIP: 0010:udl_compress_hline16+0x219/0x940 [udl] [...] [ 102.243250] Call Trace: [ 102.245695] <TASK> [ 102.2477V95] ? validate_chain+0x24e/0x5e0 [ 102.251805] ? __lock_acquire+0x568/0xae0 [ 102.255807] udl_render_hline+0x165/0x341 [udl] [ 102.260338] ? __pfx_udl_render_hline+0x10/0x10 [udl] [ 102.265379] ? local_clock_noinstr+0xb/0x100 [ 102.269642] ? __lock_release.isra.0+0x16c/0x2e0 [ 102.274246] ? mark_held_locks+0x40/0x70 [ 102.278177] udl_primary_plane_helper_atomic_update+0x43e/0x680 [udl] [ 102.284606] ? __pfx_udl_primary_plane_helper_atomic_update+0x10/0x10 [udl] [ 102.291551] ? lockdep_hardirqs_on_prepare.part.0+0x92/0x170 [ 102.297208] ? lockdep_hardirqs_on+0x88/0x130 [ 102.301554] ? _raw_spin_unlock_irq+0x24/0x50 [ 102.305901] ? wait_for_completion_timeout+0x2bb/0x3a0 [ 102.311028] ? drm_atomic_helper_calc_timestamping_constants+0x141/0x200 [ 102.317714] ? drm_atomic_helper_commit_planes+0x3b6/0x1030 [ 102.323279] drm_atomic_helper_commit_planes+0x3b6/0x1030 [ 102.328664] drm_atomic_helper_commit_tail+0x41/0xb0 [ 102.333622] commit_tail+0x204/0x330 [...] [ 102.529946] ---[ end trace 0000000000000000 ]--- [ 102.651980] RIP: 0010:udl_compress_hline16+0x219/0x940 [udl] In this stack strace, udl (based on GEM-SHMEM) imported and vmap'ed a dma-buf from amdgpu. Amdgpu relocated the buffer, thereby invalidating the mapping. Provide a custom dma-buf vmap method in amdgpu that pins the object before mapping it's buffer's pages into kernel address space. Do the opposite in vunmap. Note that dma-buf vmap differs from GEM vmap in how it handles relocation. While dma-buf vmap keeps the buffer in place, GEM vmap requires the caller to keep the buffer in place. Hence, this fix is in amdgpu's dma-buf code instead of its GEM code. A discussion of various approaches to solving the problem is available at [1]. v3: - try (GTT \| VRAM); drop CPU domain (Christian) v2: - only use mapable domains (Christian) - try pinning to domains in preferred order Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de> Fixes: `660cd44659` ("drm/shmem-helper: Import dmabuf without mapping its sg_table") Reported-by: Thomas Zimmermann <tzimmermann@suse.de> Closes: https://lore.kernel.org/dri-devel/ba1bdfb8-dbf7-4372-bdcb-df7e0511c702@suse.de/ Cc: Shixiong Ou <oushixiong@kylinos.cn> Cc: Thomas Zimmermann <tzimmermann@suse.de> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: Maxime Ripard <mripard@kernel.org> Cc: David Airlie <airlied@gmail.com> Cc: Simona Vetter <simona@ffwll.ch> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: "Christian König" <christian.koenig@amd.com> Cc: dri-devel@lists.freedesktop.org Cc: linux-media@vger.kernel.org Cc: linaro-mm-sig@lists.linaro.org Link: https://lore.kernel.org/dri-devel/9792c6c3-a2b8-4b2b-b5ba-fba19b153e21@suse.de/ # [1] Reviewed-by: Christian König <christian.koenig@amd.com> Link: https://lore.kernel.org/r/20250821064031.39090-1-tzimmermann@suse.de	2025-08-22 14:05:31 +02:00
Maxime Ripard	1a2cf179e2	Merge drm/drm-fixes into drm-misc-fixes Update drm-misc-fixes to -rc2. Signed-off-by: Maxime Ripard <mripard@kernel.org>	2025-08-20 16:08:49 +02:00
Rodrigo Siqueira	3856a53db6	drm/amdgpu/vcn: Remove unnecessary check The function amdgpu_vcn_sysfs_reset_mask_init already returns 0, which makes the check of the result unnecessary in the vcn_v4_0_3_sw_init(). Just return the amdgpu_vcn_sysfs_reset_mask_init directly. Signed-off-by: Rodrigo Siqueira <siqueira@igalia.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-18 17:06:19 -04:00
Chenglei Xie	d2fa0ec6e0	drm/amdgpu: refactor bad_page_work for corner case handling When a poison is consumed on the guest before the guest receives the host's poison creation msg, a corner case may occur to have poison_handler complete processing earlier than it should to cause the guest to hang waiting for the req_bad_pages reply during a VF FLR, resulting in the VM becoming inaccessible in stress tests. To fix this issue, this patch refactored the mailbox sequence by seperating the bad_page_work into two parts req_bad_pages_work and handle_bad_pages_work. Old sequence: 1.Stop data exchange work 2.Guest sends MB_REQ_RAS_BAD_PAGES to host and keep polling for IDH_RAS_BAD_PAGES_READY 3.If the IDH_RAS_BAD_PAGES_READY arrives within timeout limit, re-init the data exchange region for updated bad page info else timeout with error message New sequence: req_bad_pages_work: 1.Stop data exhange work 2.Guest sends MB_REQ_RAS_BAD_PAGES to host Once Guest receives IDH_RAS_BAD_PAGES_READY event handle_bad_pages_work: 3.re-init the data exchange region for updated bad page info Signed-off-by: Chenglei Xie <Chenglei.Xie@amd.com> Reviewed-by: Shravan Kumar Gande <Shravankumar.Gande@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-15 13:07:30 -04:00
Qiang Liu	fc4e990a32	drm/amdgpu: remove duplicated argument wptr_va The duplicate judgment of wptr_va could be removed to simplify the logic Signed-off-by: Qiang Liu <liuqiang@kylinos.cn> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-15 13:07:25 -04:00
Jesse.Zhang	655d6403ad	drm/amd/vcn: Add late_init callback for VCN v4.0.3 reset handling This change reorganizes VCN reset capability detection by: 1. Moving reset mask configuration from sw_init to new late_init phase 2. Adding vcn_v4_0_3_late_init() to properly check for per-queue reset support 3. Only setting soft full reset mask as fallback when per-queue reset isn't supported 4. Removing TODO comment now that queue reset support is implemented V2: Removed unrelated changes. Keep amdgpu_get_soft_full_reset_mask in place and remove TODO comment. (Alex) v3: set the flags at one place (all in late_init) (Lijo) Suggested-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Ruili Ji <ruiliji2@amd.com> Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-15 13:05:06 -04:00
Heng Zhou	859958a7fa	drm/amdgpu: fix nullptr err of vm_handle_moved If a amdgpu_bo_va is fpriv->prt_va, the bo of this one is always NULL. So, such kind of amdgpu_bo_va should be updated separately before amdgpu_vm_handle_moved. Signed-off-by: Heng Zhou <Heng.Zhou@amd.com> Reviewed-by: Kasiviswanathan, Harish <Harish.Kasiviswanathan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-15 13:04:07 -04:00
Thomas Zimmermann	3eb61d7cb7	Revert "drm/amdgpu: Use dma_buf from GEM object instance" This reverts commit `515986100d`. The dma_buf field in struct drm_gem_object is not stable over the object instance's lifetime. The field becomes NULL when user space releases the final GEM handle on the buffer object. This resulted in a NULL-pointer deref. Workarounds in commit `5307dce878` ("drm/gem: Acquire references on GEM handles for framebuffers") and commit `f6bfc9afc7` ("drm/framebuffer: Acquire internal references on GEM handles") only solved the problem partially. They especially don't work for buffer objects without a DRM framebuffer associated. Hence, this revert to going back to using .import_attach->dmabuf. Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de> Reviewed-by: Simona Vetter <simona.vetter@ffwll.ch> Acked-by: Alex Deucher <alexander.deucher@amd.com> Link: https://lore.kernel.org/r/20250715082635.34974-1-tzimmermann@suse.de	2025-08-13 11:14:17 +02:00
Liu01 Tong	aa5fc4362f	drm/amdgpu: fix task hang from failed job submission during process kill During process kill, drm_sched_entity_flush() will kill the vm entities. The following job submissions of this process will fail, and the resources of these jobs have not been released, nor have the fences been signalled, causing tasks to hang and timeout. Fix by check entity status in amdgpu_vm_ready() and avoid submit jobs to stopped entity. v2: add amdgpu_vm_ready() check before amdgpu_vm_clear_freed() in function amdgpu_cs_vm_handling(). Fixes: `1f02f2044b` ("drm/amdgpu: Avoid extra evict-restore process.") Signed-off-by: Liu01 Tong <Tong.Liu01@amd.com> Signed-off-by: Lin.Cao <lincao12@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit `f101c13a87`)	2025-08-12 16:16:36 -04:00
Jack Xiao	040bc6d0e0	drm/amdgpu: fix incorrect vm flags to map bo It should use vm flags instead of pte flags to specify bo vm attributes. Fixes: `7946340fa3` ("drm/amdgpu: Move csa related code to separate file") Signed-off-by: Jack Xiao <Jack.Xiao@amd.com> Reviewed-by: Likun Gao <Likun.Gao@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit `b08425fa77`)	2025-08-12 16:08:04 -04:00
YiPeng Chai	10ef476aad	drm/amdgpu: fix vram reservation issue The vram block allocation flag must be cleared before making vram reservation, otherwise reserving addresses within the currently freed memory range will always fail. Fixes: `c9cad937c0` ("drm/amdgpu: add drm buddy support to amdgpu") Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit `d38eaf27de`)	2025-08-12 16:07:45 -04:00
Frank Min	e67b8afcb6	drm/amdgpu: Add PSP fw version check for fw reserve GFX command The fw reserved GFX command is only supported starting from PSP fw version 0x3a0e14 and 0x3b0e0d. Older versions do not support this command. Add a version guard to ensure the command is only used when the running PSP fw meets the minimum version requirement. This ensures backward compatibility and safe operation across fw revisions. Fixes: `a3b7f9c306` ("drm/amdgpu: reclaim psp fw reservation memory region") Signed-off-by: Frank Min <Frank.Min@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit `065e23170a`)	2025-08-12 16:07:31 -04:00
Liu01 Tong	f101c13a87	drm/amdgpu: fix task hang from failed job submission during process kill During process kill, drm_sched_entity_flush() will kill the vm entities. The following job submissions of this process will fail, and the resources of these jobs have not been released, nor have the fences been signalled, causing tasks to hang and timeout. Fix by check entity status in amdgpu_vm_ready() and avoid submit jobs to stopped entity. v2: add amdgpu_vm_ready() check before amdgpu_vm_clear_freed() in function amdgpu_cs_vm_handling(). Signed-off-by: Liu01 Tong <Tong.Liu01@amd.com> Signed-off-by: Lin.Cao <lincao12@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-12 14:22:54 -04:00
Jack Xiao	b08425fa77	drm/amdgpu: fix incorrect vm flags to map bo It should use vm flags instead of pte flags to specify bo vm attributes. Fixes: `7946340fa3` ("drm/amdgpu: Move csa related code to separate file") Signed-off-by: Jack Xiao <Jack.Xiao@amd.com> Reviewed-by: Likun Gao <Likun.Gao@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-12 14:19:57 -04:00
YiPeng Chai	d38eaf27de	drm/amdgpu: fix vram reservation issue The vram block allocation flag must be cleared before making vram reservation, otherwise reserving addresses within the currently freed memory range will always fail. Fixes: `c9cad937c0` ("drm/amdgpu: add drm buddy support to amdgpu") Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-12 14:15:03 -04:00
Frank Min	065e23170a	drm/amdgpu: Add PSP fw version check for fw reserve GFX command The fw reserved GFX command is only supported starting from PSP fw version 0x3a0e14 and 0x3b0e0d. Older versions do not support this command. Add a version guard to ensure the command is only used when the running PSP fw meets the minimum version requirement. This ensures backward compatibility and safe operation across fw revisions. Fixes: `a3b7f9c306` ("drm/amdgpu: reclaim psp fw reservation memory region") Signed-off-by: Frank Min <Frank.Min@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-12 14:13:03 -04:00
Lijo Lazar	d543489aa1	drm/amdgpu: Add description for partition commands Add string description for partition commands. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Asad Kamal <asad.kamal@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-12 14:12:57 -04:00
Cryolitia PukNgae	388b68aef7	drm/amdgpu: fix incorrect comment format Comments should not have a leading plus sign. Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Cryolitia PukNgae <cryolitia@uniontech.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-11 11:13:41 -04:00
Vitaly Prosyak	c31f486bc8	drm/amdgpu: add to custom amdgpu_drm_release drm_dev_enter/exit User queues are disabled before GEM objects are released (protecting against user app crashes). No races with PCI hot-unplug (because drm_dev_enter prevents cleanup if iewdevice is being removed). Cc: Christian König <christian.koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-11 11:13:03 -04:00
Lijo Lazar	1dd2fa0e00	drm/amdgpu: Save and restore switch state During a DPC error kernel waits for the link to be active before notifying downstream devices. On certain platforms with Broadcom switch in synthetiic mode, switch responds with values even though the link is not fully ready. The config space restoration done by pcie port driver for SWUS/DS of dGPU is thus not effective as the switch is still doing internal enumeration. As a workaround, save state of SWUS/DS device in driver. Add additional check to see if link is active and restore the values during DPC error callbacks. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Yang Wang <kevinyang.wang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-11 11:12:53 -04:00
Linus Torvalds	ffe8ac927d	drm fixes for 6.17-rc1 i915: - DP LPFS fixes xe: - SRIOV: PF fixes and removal of need of module param - Fix driver unbind around Devcoredump - Mark xe driver as BROKEN if kernel page size is not 4kB amdgpu: - GC 9.5.0 fixes - SMU fix - DCE 6 DC fixes - mmhub client ID fixes - VRR fix - Backlight fix - UserQ fix - Legacy reset fix - Misc fixes amdkfd: - CRIU fix - Debugfs fix -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEEKbZHaGwW9KfbeusDHTzWXnEhr4FAmiVQYoACgkQDHTzWXnE hr4OVg/+OT59WwFrqodc79YRnGkbMllYfhhsC+hgczRWS1WV6GCp5yYSZVqWjs5I H44mM/rUEywXGf5m6lNx9GCCLF+ASgTAosVY/+cSDDpdB+Chl5CSYMbwmmlaS86a yzOxbNz7LC+gxrD9ZSmFxaWOop789ISjRhiafAWeKcJeVIo/tuW7rX2wJsN+LdNo ioUnNR4UHPKhdk7SQhm6ho3izL02e3H72fK/jcUj2FMehqJRLW6ilNTSntJ+bwp4 WmtSHSHTfeFximemIiU1l11cilwpAkWqmDjR9ZBW/F6psBU3FX3eOP4MaKL1Io2X 0gKXcnrz8XSLlZ46rKwHLj3IB3cNuGd2jO/rTcHDTQWGhrjPgn5fZnorOsfMRlTU ySaEeVsgCupQyiB1hOYFMECse+tqzV++05Bu/BL3NpYn6GMikghPkianAKlwpwxS lVDeOTmC9WX/jD9FeW8a69nsGZJoT3oSp9Ro2xjqlo1bwm1FVmUlxKgXSRIdHofg oZ8PGGDQGM5hf69Y/iLNFSPC+Z22n8cDrA0in92Gek965/h2HHkttQVWEMsaVc/x gkxb6rWv6ZMeLMrRY4ZQFd8JaOFfEzrc7aWemknEmF0HIdyJJUu2zWiUwm2MgpI/ tshUMJdgoJaPoTinHFJeID8HrdzgFy9owHnKUne9nNNkPFjskV4= =rAxo -----END PGP SIGNATURE----- Merge tag 'drm-next-2025-08-08' of https://gitlab.freedesktop.org/drm/kernel Pull drm fixes from Dave Airlie: "This is the fixes that built up in the merge window, mostly amdgpu and xe with one i915 display fix, seems like things are pretty good for rc1. i915: - DP LPFS fixes xe: - SRIOV: PF fixes and removal of need of module param - Fix driver unbind around Devcoredump - Mark xe driver as BROKEN if kernel page size is not 4kB amdgpu: - GC 9.5.0 fixes - SMU fix - DCE 6 DC fixes - mmhub client ID fixes - VRR fix - Backlight fix - UserQ fix - Legacy reset fix - Misc fixes amdkfd: - CRIU fix - Debugfs fix" * tag 'drm-next-2025-08-08' of https://gitlab.freedesktop.org/drm/kernel: (28 commits) drm/amdgpu: add missing vram lost check for LEGACY RESET drm/amdgpu/discovery: fix fw based ip discovery drm/amdkfd: Destroy KFD debugfs after destroy KFD wq amdgpu/amdgpu_discovery: increase timeout limit for IFWI init drm/amdgpu: Update SDMA firmware version check for user queue support drm/amdgpu: Add NULL check for asic_funcs drm/amd/display: Revert "drm/amd/display: Fix AMDGPU_MAX_BL_LEVEL value" drm/amd/display: fix a Null pointer dereference vulnerability drm/amd/display: Add primary plane to commits for correct VRR handling drm/amdgpu: update mmhub 3.3 client id mappings drm/amdgpu: update mmhub 3.0.1 client id mappings drm/amdgpu: Retain job->vm in amdgpu_job_prepare_job drm/amd/display: Fix DCE 6.0 and 6.4 PLL programming. drm/amd/display: Don't overwrite dce60_clk_mgr drm/amdkfd: Fix checkpoint-restore on multi-xcc drm/amd: Restore cached manual clock settings during resume drm/amd: Restore cached power limit during resume drm/amdgpu: Update external revid for GC v9.5.0 drm/amdgpu: Update supported modes for GC v9.5.0 Mark xe driver as BROKEN if kernel page size is not 4kB ...	2025-08-08 06:48:14 +03:00
Alex Deucher	81699fe81b	drm/amdgpu: add missing vram lost check for LEGACY RESET Legacy resets reset the memory controllers so VRAM contents may be unreliable after reset. Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit `aae94897b6`) Cc: stable@vger.kernel.org	2025-08-06 16:54:25 -04:00
Alex Deucher	514678da56	drm/amdgpu/discovery: fix fw based ip discovery We only need the fw based discovery table for sysfs. No need to parse it. Additionally parsing some of the board specific tables may result in incorrect data on some boards. just load the binary and don't parse it on those boards. Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4441 Fixes: `80a0e82829` ("drm/amdgpu/discovery: optionally use fw based ip discovery") Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit `62eedd150f`) Cc: stable@vger.kernel.org	2025-08-06 16:54:04 -04:00
Xaver Hugl	928587381b	amdgpu/amdgpu_discovery: increase timeout limit for IFWI init With a timeout of only 1 second, my rx 5700XT fails to initialize, so this increases the timeout to 2s. Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3697 Signed-off-by: Xaver Hugl <xaver.hugl@kde.org> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit `9ed3d7bdf2`) Cc: stable@vger.kernel.org	2025-08-06 16:51:26 -04:00
Sathishkumar S	111821e4b5	drm/amdgpu/vcn: Hold pg_lock before vcn power off Acquire vcn_pg_lock before changes to vcn power state and release it after power off in idle work handler. Signed-off-by: Sathishkumar S <sathishkumar.sundararaju@amd.com> Reviewed-by: Leo Liu <leo.liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-06 14:34:54 -04:00
Sathishkumar S	0e7581eda8	drm/amdgpu/jpeg: Hold pg_lock before jpeg poweroff Acquire jpeg_pg_lock before changes to jpeg power state and release it after power off from idle work handler. Signed-off-by: Sathishkumar S <sathishkumar.sundararaju@amd.com> Reviewed-by: Leo Liu <leo.liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-06 14:34:50 -04:00
Lijo Lazar	0b4d79dafa	drm/amdgpu: Assign unique id to compute partition Assign unique id to compute partition. This is the unique id of the first XCD instance belonging to the partition. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Asad Kamal <asad.kamal@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-06 14:34:47 -04:00
Alex Deucher	aae94897b6	drm/amdgpu: add missing vram lost check for LEGACY RESET Legacy resets reset the memory controllers so VRAM contents may be unreliable after reset. Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-06 14:34:14 -04:00
Alex Deucher	62eedd150f	drm/amdgpu/discovery: fix fw based ip discovery We only need the fw based discovery table for sysfs. No need to parse it. Additionally parsing some of the board specific tables may result in incorrect data on some boards. just load the binary and don't parse it on those boards. Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4441 Fixes: `80a0e82829` ("drm/amdgpu/discovery: optionally use fw based ip discovery") Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-06 14:32:38 -04:00
Lijo Lazar	589ea8a1fd	drm/amdgpu: Add helpers to set/get unique ids Add a struct to store unique id information for each type. Add helper to fetch the unique id. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Asad Kamal <asad.kamal@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-06 14:22:08 -04:00
Lijo Lazar	892bac995b	drm/amdgpu: Prevent hardware access in dpc state Don't allow hardware access while in dpc state. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Ce Sun <cesun102@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-06 14:22:03 -04:00
Lijo Lazar	1a0e57eb96	drm/amdgpu/vcn: Fix double-free of vcn dump buffer The buffer is already freed as part of amdgpu_vcn_reg_dump_fini(). The issue is introduced by below patch series. Fixes: `de55cbff5c` ("drm/amdgpu/vcn: Add regdump helper functions") Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-06 14:21:16 -04:00
Lijo Lazar	c5c62160a5	drm/amdgpu: Log reset source during recovery To get more context, add reset source to identify the source of gpu recovery - job timeout, RAS, HWS hang etc. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Asad Kamal <asad.kamal@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-06 14:21:05 -04:00
Xiang Liu	b3505c2c48	drm/amdgpu: Generate BP threshold exceed CPER once threshold exceeded The bad pages threshold exceed CPER should be generated once threshold exceeded, no matter the bad_page_threshold setted or not. Signed-off-by: Xiang Liu <xiang.liu@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-06 14:21:01 -04:00
Lijo Lazar	91c4fd4164	drm/amdgpu: Set dpc status appropriately Set the dpc status based on hardware state. Also, clear the status before reinitialization after a successful reset. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Ce Sun <cesun102@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-06 14:18:35 -04:00
Lijo Lazar	32f73741d6	drm/amdgpu: Wait for bootloader after PSPv11 reset Some PSPv11 SOCs take a longer time for PSP based mode-1 reset. Instead of checking for C2PMSG_33 status, add the callback wait_for_bootloader. Wait for bootloader to be back to steady state is already part of the generic mode-1 reset flow. Increase the retry count for bootloader wait and also fix the mask to prevent fake pass. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-06 14:17:50 -04:00
Ethan Carter Edwards	66f92d1035	drm/amdgpu/gfx9.4.3: remove redundant repeated nested 0 check The repeated checks on grbm_soft_reset are unnecessary. Remove them. Signed-off-by: Ethan Carter Edwards <ethan@ethancedwards.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-06 14:17:47 -04:00
Ethan Carter Edwards	90e1d0324a	drm/amdgpu/gfx9: remove redundant repeated nested 0 check The repeated checks on grbm_soft_reset are unnecessary. Remove them. Signed-off-by: Ethan Carter Edwards <ethan@ethancedwards.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-06 14:17:45 -04:00
Ethan Carter Edwards	8802ec0eba	drm/amdgpu/gfx10: remove redundant repeated nested 0 check The repeated checks on grbm_soft_reset are unnecessary. Remove them. Signed-off-by: Ethan Carter Edwards <ethan@ethancedwards.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-06 14:17:40 -04:00
Xaver Hugl	9ed3d7bdf2	amdgpu/amdgpu_discovery: increase timeout limit for IFWI init With a timeout of only 1 second, my rx 5700XT fails to initialize, so this increases the timeout to 2s. Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3697 Signed-off-by: Xaver Hugl <xaver.hugl@kde.org> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-06 14:17:03 -04:00
Jesse.Zhang	124ffa2970	drm/amdgpu: Update SDMA firmware version check for user queue support This commit fixes a firmware version check for enabling user queue support in SDMA v7.0. The previous version check (7836028) was incorrect and could lead to issues with PROTECTED_FENCE_SIGNAL commands causing register conflicts between MCU_DBG0 and MCU_DBG1. Fixes: `8c011408ed` ("drm/amdgpu/sdma7: add ucode version checks for userq support") Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit `92e2449241`) Cc: stable@vger.kernel.org	2025-08-04 15:48:14 -04:00
Lijo Lazar	c2fe914d50	drm/amdgpu: Add NULL check for asic_funcs If driver load fails too early, asic_funcs pointer remains unassigned. Add NULL check to sanitize unwind path. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit `582bf7c515`) Cc: stable@vger.kernel.org	2025-08-04 15:47:50 -04:00
Alex Deucher	9f9bddfa31	drm/amdgpu: update mmhub 3.3 client id mappings Update the client id mapping so the correct clients get printed when there is a mmhub page fault. v2: fix typos spotted by David Wu. v3: fix additional typo spotted by David. Reviewed-by: David (Ming Qiang) Wu <David.Wu3@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit `e932f4779a`) Cc: stable@vger.kernel.org	2025-08-04 15:42:15 -04:00
Alex Deucher	0bae62cc98	drm/amdgpu: update mmhub 3.0.1 client id mappings Update the client id mapping so the correct clients get printed when there is a mmhub page fault. Reviewed-by: David (Ming Qiang) Wu <David.Wu3@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit `2a2681eda7`) Cc: stable@vger.kernel.org	2025-08-04 15:41:52 -04:00
YuanShang	c00d8b79fd	drm/amdgpu: Retain job->vm in amdgpu_job_prepare_job The field job->vm is used in function amdgpu_job_run to get the page table re-generation counter and decide whether the job should be skipped. Specifically, function amdgpu_vm_generation checks if the VM is valid for this job to use. For instance, if a gfx job depends on a cancelled sdma job from entity vm->delayed, then the gfx job should be skipped. Fixes: `26c95e838e` ("drm/amdgpu: set the VM pointer to NULL in amdgpu_job_prepare") Signed-off-by: YuanShang <YuanShang.Mao@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit `ed76936c6b`) Cc: stable@vger.kernel.org	2025-08-04 15:41:09 -04:00
Lijo Lazar	05c8b69051	drm/amdgpu: Update external revid for GC v9.5.0 Use different external revid for GC v9.5.0 SOCs. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Asad Kamal <asad.kamal@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit `21c6764ed4`) Cc: stable@vger.kernel.org	2025-08-04 15:32:37 -04:00
Lijo Lazar	389d79a195	drm/amdgpu: Update supported modes for GC v9.5.0 For GC v9.5.0 SOCs, both CPX and QPX compute modes are also supported in NPS2 mode. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Acked-by: Mangesh Gadre <Mangesh.Gadre@amd.com> Reviewed-by: Asad Kamal <asad.kamal@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit `9d1ac25c7f`) Cc: stable@vger.kernel.org	2025-08-04 15:32:07 -04:00
Xiang Liu	58364f01db	drm/amdgpu: Fix vcn v4.0.3 poison irq call trace on sriov guest Sriov guest side doesn't init ras feature hence the poison irq shouldn't be put during hw fini. [25209.468816] Call Trace: [25209.468817] <TASK> [25209.468818] ? srso_alias_return_thunk+0x5/0x7f [25209.468820] ? show_trace_log_lvl+0x28e/0x2ea [25209.468822] ? show_trace_log_lvl+0x28e/0x2ea [25209.468825] ? vcn_v4_0_3_hw_fini+0xaf/0xe0 [amdgpu] [25209.468936] ? show_regs.part.0+0x23/0x29 [25209.468939] ? show_regs.cold+0x8/0xd [25209.468940] ? amdgpu_irq_put+0x9e/0xc0 [amdgpu] [25209.469038] ? __warn+0x8c/0x100 [25209.469040] ? amdgpu_irq_put+0x9e/0xc0 [amdgpu] [25209.469135] ? report_bug+0xa4/0xd0 [25209.469138] ? handle_bug+0x39/0x90 [25209.469140] ? exc_invalid_op+0x19/0x70 [25209.469142] ? asm_exc_invalid_op+0x1b/0x20 [25209.469146] ? amdgpu_irq_put+0x9e/0xc0 [amdgpu] [25209.469241] vcn_v4_0_3_hw_fini+0xaf/0xe0 [amdgpu] [25209.469343] amdgpu_ip_block_hw_fini+0x34/0x61 [amdgpu] [25209.469511] amdgpu_device_fini_hw+0x3b3/0x467 [amdgpu] Fixes: `4c4a891496` ("drm/amdgpu: Register aqua vanjaram vcn poison irq") Signed-off-by: Xiang Liu <xiang.liu@amd.com> Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-04 14:43:51 -04:00
Xiang Liu	d3d73bdb02	drm/amdgpu: Fix jpeg v4.0.3 poison irq call trace on sriov guest Sriov guest side doesn't init ras feature hence the poison irq shouldn't be put during hw fini. [25209.467154] Call Trace: [25209.467156] <TASK> [25209.467158] ? srso_alias_return_thunk+0x5/0x7f [25209.467162] ? show_trace_log_lvl+0x28e/0x2ea [25209.467166] ? show_trace_log_lvl+0x28e/0x2ea [25209.467171] ? jpeg_v4_0_3_hw_fini+0x6f/0x90 [amdgpu] [25209.467300] ? show_regs.part.0+0x23/0x29 [25209.467303] ? show_regs.cold+0x8/0xd [25209.467304] ? amdgpu_irq_put+0x9e/0xc0 [amdgpu] [25209.467403] ? __warn+0x8c/0x100 [25209.467407] ? amdgpu_irq_put+0x9e/0xc0 [amdgpu] [25209.467503] ? report_bug+0xa4/0xd0 [25209.467508] ? handle_bug+0x39/0x90 [25209.467511] ? exc_invalid_op+0x19/0x70 [25209.467513] ? asm_exc_invalid_op+0x1b/0x20 [25209.467518] ? amdgpu_irq_put+0x9e/0xc0 [amdgpu] [25209.467613] ? amdgpu_irq_put+0x5f/0xc0 [amdgpu] [25209.467709] jpeg_v4_0_3_hw_fini+0x6f/0x90 [amdgpu] [25209.467805] amdgpu_ip_block_hw_fini+0x34/0x61 [amdgpu] [25209.467971] amdgpu_device_fini_hw+0x3b3/0x467 [amdgpu] Fixes: `1b2231de41` ("drm/amdgpu: Register aqua vanjaram jpeg poison irq") Signed-off-by: Xiang Liu <xiang.liu@amd.com> Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-04 14:42:25 -04:00
Lijo Lazar	5c2b3226d0	drm/amdgpu: Add wrapper function for dpc state Use wrapper functions to set/indicate dpc status. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Ce Sun <cesun102@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-04 14:42:21 -04:00
Jesse.Zhang	92e2449241	drm/amdgpu: Update SDMA firmware version check for user queue support This commit fixes a firmware version check for enabling user queue support in SDMA v7.0. The previous version check (7836028) was incorrect and could lead to issues with PROTECTED_FENCE_SIGNAL commands causing register conflicts between MCU_DBG0 and MCU_DBG1. Fixes: `8c011408ed` ("drm/amdgpu/sdma7: add ucode version checks for userq support") Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-04 14:41:06 -04:00
Lijo Lazar	582bf7c515	drm/amdgpu: Add NULL check for asic_funcs If driver load fails too early, asic_funcs pointer remains unassigned. Add NULL check to sanitize unwind path. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-04 14:40:39 -04:00
Mangesh Gadre	82594ac858	drm/amdgpu: Initialize vcn v5_0_1 ras function Initialize vcn v5_0_1 ras function Signed-off-by: Mangesh Gadre <Mangesh.Gadre@amd.com> Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-04 14:37:48 -04:00
Yunxiang Li	ba5e322b26	drm/amdgpu: skip mgpu fan boost for multi-vf On multi-vf setup if the VM have two vf assigned, perhaps from two different gpus, mgpu fan boost will fail. Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-04 14:37:27 -04:00
Mangesh Gadre	01fa9758c8	drm/amdgpu: Initialize jpeg v5_0_1 ras function Initialize jpeg v5_0_1 ras function Signed-off-by: Mangesh Gadre <Mangesh.Gadre@amd.com> Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-04 14:37:24 -04:00
Xiang Liu	8e8e08c831	drm/amdgpu: Skip poison aca bank from UE channel Avoid GFX poison consumption errors logged when fatal error occurs. Signed-off-by: Xiang Liu <xiang.liu@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-04 14:37:20 -04:00
Arnd Bergmann	4d22db6d07	drm/amdgpu: fix link error for !PM_SLEEP When power management is not enabled in the kernel build, the newly added hibernation changes cause a link failure: arm-linux-gnueabi-ld: drivers/gpu/drm/amd/amdgpu/amdgpu_drv.o: in function `amdgpu_pmops_thaw': amdgpu_drv.c:(.text+0x1514): undefined reference to `pm_hibernate_is_recovering' Make the power management code in this driver conditional on CONFIG_PM and CONFIG_PM_SLEEP Fixes: `530694f54d` ("drm/amdgpu: do not resume device in thaw for normal hibernation") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Link: https://lore.kernel.org/r/20250714081635.4071570-1-arnd@kernel.org Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-04 14:36:54 -04:00
Alex Deucher	e932f4779a	drm/amdgpu: update mmhub 3.3 client id mappings Update the client id mapping so the correct clients get printed when there is a mmhub page fault. v2: fix typos spotted by David Wu. v3: fix additional typo spotted by David. Reviewed-by: David (Ming Qiang) Wu <David.Wu3@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-04 14:36:47 -04:00
Alex Deucher	2a2681eda7	drm/amdgpu: update mmhub 3.0.1 client id mappings Update the client id mapping so the correct clients get printed when there is a mmhub page fault. Reviewed-by: David (Ming Qiang) Wu <David.Wu3@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-04 14:36:40 -04:00
Sathishkumar S	26a63590fe	drm/amdgpu/vcn: Register dump cleanup in VCN2_5 Use generic vcn devcoredump helper functions for VCN2_5 and VCN2_6 Signed-off-by: Sathishkumar S <sathishkumar.sundararaju@amd.com> Acked-by: Leo Liu <leo.liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-04 14:36:36 -04:00
Sathishkumar S	53c4be7a59	drm/amdgpu/vcn: Register dump cleanup in VCN2_0_0 Use generic vcn devcoredump helper functions for VCN2_0_0 Signed-off-by: Sathishkumar S <sathishkumar.sundararaju@amd.com> Acked-by: Leo Liu <leo.liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-04 14:36:31 -04:00
Sathishkumar S	b2d532b588	drm/amdgpu/vcn: Register dump cleanup in VCN3_0 Use generic vcn devcoredump helper functions for VCN3_0 Signed-off-by: Sathishkumar S <sathishkumar.sundararaju@amd.com> Acked-by: Leo Liu <leo.liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-04 14:36:28 -04:00
Sathishkumar S	69cc37647b	drm/amdgpu/vcn: Register dump cleanup in VCN4_0_3 Use generic vcn devcoredump helper functions for VCN4_0_3 Signed-off-by: Sathishkumar S <sathishkumar.sundararaju@amd.com> Acked-by: Leo Liu <leo.liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-04 14:36:23 -04:00
Sathishkumar S	793b97c4ad	drm/amdgpu/vcn: Register dump cleanup in VCN4_0_5 Use generic vcn devcoredump helper functions for VCN4_0_5 Signed-off-by: Sathishkumar S <sathishkumar.sundararaju@amd.com> Acked-by: Leo Liu <leo.liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-04 14:36:20 -04:00
Sathishkumar S	4e011af912	drm/amdgpu/vcn: Register dump cleanup in VCN4_0_0 Use generic vcn devcoredump helper functions for VCN4_0_0 Signed-off-by: Sathishkumar S <sathishkumar.sundararaju@amd.com> Acked-by: Leo Liu <leo.liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-04 14:36:17 -04:00
Sathishkumar S	f4c3be28d5	drm/amdgpu/vcn: Register dump cleanup in VCN5 Use generic vcn devcoredump helper functions for VCN5 Signed-off-by: Sathishkumar S <sathishkumar.sundararaju@amd.com> Acked-by: Leo Liu <leo.liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-04 14:36:14 -04:00
Stanley.Yang	08e27c9d92	drm/amdgpu: Add new error code for VCN/JPEG new chain Add VIDS and JPEG8/9 S\|D chain error code for VCN/JPEG v5.0.1. Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-04 14:36:06 -04:00
Stanley.Yang	b1b29aa88f	drm/amdgpu: Fix vcn v5.0.1 poison irq call trace Why: [13014.890792] Call Trace: [13014.890793] <TASK> [13014.890795] ? show_trace_log_lvl+0x1d6/0x2ea [13014.890799] ? show_trace_log_lvl+0x1d6/0x2ea [13014.890800] ? vcn_v5_0_1_hw_fini+0xe9/0x110 [amdgpu] [13014.890872] ? show_regs.part.0+0x23/0x29 [13014.890873] ? show_regs.cold+0x8/0xd [13014.890874] ? amdgpu_irq_put+0xc6/0xe0 [amdgpu] [13014.890934] ? __warn+0x8c/0x100 [13014.890936] ? amdgpu_irq_put+0xc6/0xe0 [amdgpu] [13014.890995] ? report_bug+0xa4/0xd0 [13014.890999] ? handle_bug+0x39/0x90 [13014.891001] ? exc_invalid_op+0x19/0x70 [13014.891003] ? asm_exc_invalid_op+0x1b/0x20 [13014.891005] ? amdgpu_irq_put+0xc6/0xe0 [amdgpu] [13014.891065] ? amdgpu_irq_put+0x63/0xe0 [amdgpu] [13014.891124] vcn_v5_0_1_hw_fini+0xe9/0x110 [amdgpu] [13014.891189] amdgpu_ip_block_hw_fini+0x3b/0x78 [amdgpu] [13014.891309] amdgpu_device_fini_hw+0x3c1/0x479 [amdgpu] How: Add omitted vcn poison irq get call. Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-04 14:36:00 -04:00
Sathishkumar S	de55cbff5c	drm/amdgpu/vcn: Add regdump helper functions Add generic helper functions for vcn devcoredump support which can be re-used for all vcn versions. Signed-off-by: Sathishkumar S <sathishkumar.sundararaju@amd.com> Acked-by: Leo Liu <leo.liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-04 14:35:54 -04:00
Meng Li	e6c2b0f232	drm/amd/amdgpu: Release xcp drm memory after unplug Add a new API amdgpu_xcp_drm_dev_free(). After unplug xcp device, need to release xcp drm memory etc. Co-developed-by: Jiang Liu <gerry@linux.alibaba.com> Signed-off-by: Jiang Liu <gerry@linux.alibaba.com> Signed-off-by: Meng Li <li.meng@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-04 14:35:48 -04:00
YuanShang	ed76936c6b	drm/amdgpu: Retain job->vm in amdgpu_job_prepare_job The field job->vm is used in function amdgpu_job_run to get the page table re-generation counter and decide whether the job should be skipped. Specifically, function amdgpu_vm_generation checks if the VM is valid for this job to use. For instance, if a gfx job depends on a cancelled sdma job from entity vm->delayed, then the gfx job should be skipped. Fixes: `26c95e838e` ("drm/amdgpu: set the VM pointer to NULL in amdgpu_job_prepare") Signed-off-by: YuanShang <YuanShang.Mao@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-04 14:31:34 -04:00
Mario Limonciello	cc51bbc7d7	drm/amd: Use drm_() macros instead of DRM_() for amdgpu_cs Some of the IOCTL messages can be called for different GPUs and it might not be obvious which one called them from a problem. Using the drm_*() macros the correct device will be shown in the messages. Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Link: https://lore.kernel.org/r/20250715212420.2254925-1-superm1@kernel.org Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-04 14:31:25 -04:00
Yunshui Jiang	130c7ed88f	drm/amdgpu: use kmalloc_array() instead of kmalloc() Use kmalloc_array() instead of kmalloc() with multiplication. kmalloc_array() is a safer way because of its multiply overflow check. Signed-off-by: Yunshui Jiang <jiangyunshui@kylinos.cn> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-04 14:30:31 -04:00
Sathishkumar S	46b0e6b9d7	drm/amdgpu: Fix unintended error log in VCN5_0_0 The error log is supposed to be gaurded under if failure condition. Fixes: `faab5ea083` ("drm/amdgpu: Check vcn sram load return value") Signed-off-by: Sathishkumar S <sathishkumar.sundararaju@amd.com> Reviewed-by: Leo Liu <leo.liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-04 14:29:14 -04:00
Ce Sun	da46735229	drm/amdgpu: Effective health check before reset Move amdgpu_device_health_check into amdgpu_device_gpu_recover to ensure that if the device is present can be checked before reset The reason is: 1.During the dpc event, the device where the dpc event occurs is not present on the bus 2.When both dpc event and ATHUB event occur simultaneously,the dpc thread holds the reset domain lock when detecting error,and the gpu recover thread acquires the hive lock.The device is simultaneously in the states of amdgpu_ras_in_recovery and occurs_dpc,so gpu recover thread will not go to amdgpu_device_health_check.It waits for the reset domain lock held by the dpc thread, but dpc thread has not released the reset domain lock.In the dpc callback slot_reset,to obtain the hive lock, the hive lock is held by the gpu recover thread at this time.So a deadlock occurred Signed-off-by: Ce Sun <cesun102@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-04 14:27:49 -04:00
Ce Sun	21c0ffa612	drm/amdgpu: Avoid rma causes GPU duplicate reset Try to ensure poison creation handle is completed in time to set device rma value. Signed-off-by: Ce Sun <cesun102@amd.com> Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-04 14:27:47 -04:00
Xiang Liu	8f0245ee95	drm/amdgpu: Update IPID value for bad page threshold CPER Update the IPID register value for bad page threshold CPER according to the latest definition. Signed-off-by: Xiang Liu <xiang.liu@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2025-08-04 14:27:41 -04:00

1 2 3 4 5 ...

16421 Commits (09cfd3c52ea76f43b3cb15e570aeddf633d65e80)