future incarnations of this memory controllers architecture
- amd64_edac: Remove the legacy csrow sysfs interface which has been
deprecated and unused (we assume) for at least a decade
- Add the capability to fallback to BIOS-provided address translation
functionality (ACPI PRM) which can be used on systems unsupported by
the current AMD address translation library
- The usual fixes, fixlets, cleanups and improvements all over the place
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmktdyMACgkQEsHwGGHe
VUpXTxAAhdQxn1v1tYKya6YHxBS3T3Y3+4fec+LeKgoY1YnoFHMse3TAU+G67opR
1xnEKHKrkX4v1FAwe7eD2G6qyz2ytqcApv4XGxmQ1WgldFWuPl/lI3ngPNMCHMog
dqeQFRQ7MXsk0no0cjMA6NjafFpYOGGGhIzdU3wvgZawH4hG9wHLS6Urvn2SfWj6
Pf/449qS7XoPU5G22qWPqqixRHpc9BPkJfKMIYeaWbxldePlwbh9cOMLqwsZo1QV
v5cv/3CAIVFzRvNVIx05kDhRrwqTjIZL+u9IYHg2g9DA45GQuktYQwd1KksbVpUn
CijhpKMoSnQHN+ZLW84XzvEH2rvroSTZl28d5suY1GHXG3ePc9HpmTVbVElFXWKZ
dq0X2RIbMEbSxneePFHJ4ESUfNN2HbPSfh/sXN4epxcMQI0VWVhXYs5+Ek4UV1+E
hvhCS/kuAypODzEi0cULoMcXdyKr2V1zpaAHNlZshepp/kUzY46b3cBhxKiL3Fsd
x+IhZgow9a+iMJfMpCJhMABKEkoZRgS3gs5nWMJ6t0EvulvknG+aovGB/Q0VaIIa
H69Fn+R2ewnEuZf1JGZDMit1y+wjGgeamk+uWTym+tCyNH1eHaSq48POribajcYF
UtcobK4kG7hPodsbwwD4MhqtSLhuyIcXTHbI3x4+r+LLAgdAPKM=
=NidS
-----END PGP SIGNATURE-----
Merge tag 'edac_updates_for_v6.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras
Pull EDAC updates from Borislav Petkov:
- imh_edac: Add a new EDAC driver for Intel Diamond Rapids and future
incarnations of this memory controllers architecture
- amd64_edac: Remove the legacy csrow sysfs interface which has been
deprecated and unused (we assume) for at least a decade
- Add the capability to fallback to BIOS-provided address translation
functionality (ACPI PRM) which can be used on systems unsupported by
the current AMD address translation library
- The usual fixes, fixlets, cleanups and improvements all over the
place
* tag 'edac_updates_for_v6.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
RAS/AMD/ATL: Replace bitwise_xor_bits() with hweight16()
EDAC/igen6: Fix error handling in igen6_edac driver
EDAC/imh: Setup 'imh_test' debugfs testing node
EDAC/{skx_comm,imh}: Detect 2-level memory configuration
EDAC/skx_common: Extend the maximum number of DRAM chip row bits
EDAC/{skx_common,imh}: Add EDAC driver for Intel Diamond Rapids servers
EDAC/skx_common: Prepare for skx_set_hi_lo()
EDAC/skx_common: Prepare for skx_get_edac_list()
EDAC/{skx_common,skx,i10nm}: Make skx_register_mci() independent of pci_dev
EDAC/ghes: Replace deprecated strcpy() in ghes_edac_report_mem_error()
EDAC/ie31200: Fix error handling in ie31200_register_mci
RAS/CEC: Replace use of system_wq with system_percpu_wq
EDAC: Remove the legacy EDAC sysfs interface
EDAC/amd64: Remove NUM_CONTROLLERS macro
EDAC/amd64: Generate ctl_name string at runtime
RAS/AMD/ATL: Require PRM support for future systems
ACPI: PRM: Add acpi_prm_handler_available()
RAS/AMD/ATL: Return error codes from helper functions
Drop the driver-specific field_get() macro, in favor of the globally
available variant from <linux/bitfield.h>.
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Yury Norov (NVIDIA) <yury.norov@gmail.com>
Prepare for the advent of a globally available common field_get() macro
by undefining the symbol before defining a local variant. This prevents
redefinition warnings from the C preprocessor when introducing the common
macro later.
Suggested-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Yury Norov (NVIDIA) <yury.norov@gmail.com>
The igen6_edac driver calls device_initialize() for all memory
controllers in igen6_register_mci(), but misses corresponding
put_device() calls in error paths and during normal shutdown in
igen6_unregister_mcis().
Adding the missing put_device() calls improves code readability and
ensures proper reference counting for the device structure.
Found by code review.
Signed-off-by: Ma Ke <make24@iscas.ac.cn>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://patch.msgid.link/20251105090244.23327-1-make24@iscas.ac.cn
Setup the following debugfs testing node to enable fake memory error
address decoding tests for the imh_edac driver.
/sys/kernel/debug/edac/imh_test/addr
Tested-by: Yi Lai <yi1.lai@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://patch.msgid.link/20251119134132.2389472-8-qiuxu.zhuo@intel.com
Detect 2-level memory configurations and notify the 'skx_common' library
to enable ADXL 2-level memory error decoding.
Tested-by: Yi Lai <yi1.lai@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://patch.msgid.link/20251119134132.2389472-7-qiuxu.zhuo@intel.com
The allowed maximum number of row bits for DRAM chips in the Diamond
Rapids server processor is up to 19. Extend the current maximum row
bits from 18 to 19.
Tested-by: Yi Lai <yi1.lai@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://patch.msgid.link/20251119134132.2389472-6-qiuxu.zhuo@intel.com
Intel Diamond Rapids CPUs include Integrated Memory and I/O Hubs (IMH).
The memory controllers within the IMHs provide memory stacks to the
processor. Create a new driver for this IMH-based memory controllers
rather than applying additional patches to the existing i10nm_edac.c
for the following reasons:
1) The memory controllers are not presented as PCI devices; instead,
the detection and all their registers have been transitioned to
MMIO-based memory spaces.
2) Validation processes are costly. Modifications to i10nm_edac would
require extensive validation checks against multiple platforms,
including Ice Lake, Sapphire Rapids, Emerald Rapids, Granite Rapids,
Sierra Forest, and Grand Ridge.
3) Future Intel CPUs will likely only need patches on top of this new
EDAC driver. Validation can be limited to Diamond Rapids servers
and future Intel CPU generations.
[Tony: Fix kerneldoc for struct local_reg]
[randconfig: Added dependencies on NFIT and DMI]
Tested-by: Yi Lai <yi1.lai@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://patch.msgid.link/20251119134132.2389472-5-qiuxu.zhuo@intel.com
The upcoming imh_edac driver for Intel Diamond Rapids servers cannot
use skx_get_hi_lo() in skx_common to retrieve the TOHM (Top of High
Memory) and TOLM (Top of Low Memory) parameters. Instead, it obtains
these parameters within its own EDAC driver. To accommodate this,
prepare skx_set_hi_lo() to allow the driver to notify skx_common of
these parameters.
Tested-by: Yi Lai <yi1.lai@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://patch.msgid.link/20251119134132.2389472-4-qiuxu.zhuo@intel.com
The Intel EDAC library 'skx_common' maintains the Intel server EDAC device
list for {skx, i10nm}_edac drivers, which use skx_get_all_bus_mappings()
to build and retrieve the EDAC device list.
However, the upcoming Intel EDAC driver, imh_edac, for Diamond Rapids
servers is designed for memory controllers that are MMIO-based devices
rather than PCI devices. Consequently, it can't use
skx_get_all_bus_mappings() due to the absence of a PCI bus. To accommodate
this, prepare skx_get_edac_list() to enable the upcoming imh_edac driver
to obtain the EDAC device list from the skx_common library and build the
EDAC device list independently.
Tested-by: Yi Lai <yi1.lai@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://patch.msgid.link/20251119134132.2389472-3-qiuxu.zhuo@intel.com
Memory controllers in the new Intel server CPUs, such as Diamond Rapids,
are presented as MMIO-based devices rather than PCI devices.
Modify skx_register_mci() to be independent of 'pci_dev' and use a generic
'dev' of 'struct device' to prepare for support of such MMIO-based memory
controllers.
Tested-by: Yi Lai <yi1.lai@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://patch.msgid.link/20251119134132.2389472-2-qiuxu.zhuo@intel.com
The current single-bit error injection mechanism flips bits directly in ECC RAM
by performing write and read operations. When the ECC RAM is actively used by
the Ethernet or USB controller, this approach sometimes trigger a false
double-bit error.
Switch both Ethernet and USB EDAC devices to use the INTTEST register
(altr_edac_a10_device_inject_fops) for single-bit error injection, similar to
the existing double-bit error injection method.
Fixes: 064acbd4f4 ("EDAC, altera: Add Stratix10 peripheral support")
Signed-off-by: Niravkumar L Rabara <niravkumarlaxmidas.rabara@altera.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Dinh Nguyen <dinguyen@kernel.org>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20251111081333.1279635-1-niravkumarlaxmidas.rabara@altera.com
The OCRAM ECC is always enabled either by the BootROM or by the Secure Device
Manager (SDM) during a power-on reset on SoCFPGA.
However, during a warm reset, the OCRAM content is retained to preserve data,
while the control and status registers are reset to their default values. As
a result, ECC must be explicitly re-enabled after a warm reset.
Fixes: 17e47dc6db ("EDAC/altera: Add Stratix10 OCRAM ECC support")
Signed-off-by: Niravkumar L Rabara <niravkumarlaxmidas.rabara@altera.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Dinh Nguyen <dinguyen@kernel.org>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20251111080801.1279401-1-niravkumarlaxmidas.rabara@altera.com
ie31200_register_mci() calls device_initialize() for priv->dev
unconditionally. However, in the error path, put_device() is not
called, leading to an imbalance. Similarly, in the unload path,
put_device() is missing.
Although edac_mc_free() eventually frees the memory, it does not
release the device initialized by device_initialize(). For code
readability and proper pairing of device_initialize()/put_device(),
add put_device() calls in both error and unload paths.
Found by code review.
Signed-off-by: Ma Ke <make24@iscas.ac.cn>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Link: https://patch.msgid.link/20251106084735.35017-1-make24@iscas.ac.cn
The current code assumes that only DDR errors have split messages. Ensure
proper logging of non-standard event errors that may be split across multiple
messages too.
[ bp: Massage, move comment too, fix it up. ]
Fixes: d5fe2fec6c ("EDAC: Add a driver for the AMD Versal NET DDR controller")
Signed-off-by: Shubhrajyoti Datta <shubhrajyoti.datta@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://patch.msgid.link/20251023113108.3467132-1-shubhrajyoti.datta@amd.com
Commit
1997471069 ("edac: add a new per-dimm API and make the old per-virtual-rank API obsolete")
introduced a new per-DIMM sysfs interface for EDAC making the old
per-virtual-rank sysfs interface obsolete.
Since this new sysfs interface was introduced more than a decade ago, remove
the obsolete legacy interface.
Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20251106015727.1987246-1-avadhut.naik@amd.com
Currently, the NUM_CONTROLLERS macro is used to limit the amount of memory
controllers (UMCs) available per node. The number of UMCs available per node,
however, is already cached by the max_mcs variable of struct amd64_pvt.
Allocate the relevant data structures dynamically using the variable instead
of static allocation through the macro.
The max_mcs variable is used for legacy systems too. These systems have a max
of 2 controllers. Since the default value of max_mcs, set in per_family_init(),
is 2, these legacy systems are also covered.
Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20251106015727.1987246-1-avadhut.naik@amd.com
Currently, the ctl_name string is statically assigned based on the family and
model of the SOC when the amd64_edac module is loaded.
The same, however, is not exactly needed as the string can be generated and
assigned at runtime through scnprintf().
Remove all static assignments and generate the string at runtime. Also,
cleanup the switch cases which became defunct and consolidate identical cases.
Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20251106015727.1987246-1-avadhut.naik@amd.com
The priv->mci[] array has NUM_CONTROLLERS so this > comparison needs to be >=
to prevent an out of bounds access.
Fixes: d5fe2fec6c ("EDAC: Add a driver for the AMD Versal NET DDR controller")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
- Add an EDAC driver for the AMD VersalNET memory controller which
reports hw errors from different IP blocks in the fabric using an
IPC-type transport
- Drop the silly static number of memory controllers in the Intel EDAC
drivers (skx, i10nm) in favor of a flexible array so that former
doesn't need to be increased with every new generation which adds more
memory controllers; along with a proper refactoring
- Add support for two Alder Lake-S SOCs to ie31200_edac
- Add an EDAC driver for ADM Cortex A72 cores, and specifically for
reporting L1 and L2 cache errors
- Last but not least, the usual fixes, cleanups and improvements all
over the subsystem
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmjWYXEACgkQEsHwGGHe
VUqN0g/+KaDOP5caif7/5IJ2fL+9Qv3VvbxucVMS4UgMYBY21V4msfPkuCg8iGes
zEpFUuEFc2NE6XV9i4JNgYNAR+uffOY4rZb67VSr2rQSVeRvBFHb9aMsXBYssV/r
XtCfTdJL/bJ7SLk10aWvBM4quLF9BchdoPctNMt5PuN3dtb1dVFi1TkylXKaocRX
sfu/hOQ0FUbOlYnGTpW+t4TufNcWzC8q9hL4mrbSVHS3XTKk/zQ9PJ8I8f44XqYo
Bn1JXfAErkgo9rqlmjxU90Lg2G+EV+qwDWs61Ox8q3lzbC+9FOd4WIbD3c9TiTT/
Io6tx8PvgFUz43lD+XGoCfd87ZI9CbGoVAEEiFWr+HaqL/XVF5NS5GiBNTyxGGaP
nDzxm1OYQbDEnBfmaWZCMbbd5yCOZ1EZHTgp4VxqJfooU1Ucbct4oPDnERMTNDlv
UUGDh19BAXwcZ9xpy36AIprppZKOBu0WPjXee9sby5cF+KB57Tbrzd+nm/uZRhHj
bTkQTfCcs+EPAksG0snGufy4BlfS6UGqx4HkSZ3ITVJQX4x27razsxTDbKDk50jq
S1cyZlZ5n+mpR0MtC/zNDMB6cxutgAKoqwssVBUiEh9bCaA/tOPqJmoD9Lx2ESDt
0/QcF1ilBRctunguhDbY8EZKye9gM4WWHW5kxE29PtAzepSdTDo=
=4oCR
-----END PGP SIGNATURE-----
Merge tag 'edac_updates_for_v6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras
Pull EDAC updates from Borislav Petkov:
- Add support for new AMD family 0x1a models to amd64_edac
- Add an EDAC driver for the AMD VersalNET memory controller which
reports hw errors from different IP blocks in the fabric using an
IPC-type transport
- Drop the silly static number of memory controllers in the Intel EDAC
drivers (skx, i10nm) in favor of a flexible array so that former
doesn't need to be increased with every new generation which adds
more memory controllers; along with a proper refactoring
- Add support for two Alder Lake-S SOCs to ie31200_edac
- Add an EDAC driver for ADM Cortex A72 cores, and specifically for
reporting L1 and L2 cache errors
- Last but not least, the usual fixes, cleanups and improvements all
over the subsystem
* tag 'edac_updates_for_v6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras: (23 commits)
EDAC/versalnet: Return the correct error in mc_probe()
EDAC/mc_sysfs: Increase legacy channel support to 16
EDAC/amd64: Add support for AMD family 1Ah-based newer models
EDAC: Add a driver for the AMD Versal NET DDR controller
dt-bindings: memory-controllers: Add support for Versal NET EDAC
RAS: Export log_non_standard_event() to drivers
cdx: Export Symbols for MCDI RPC and Initialization
cdx: Split mcdi.h and reorganize headers
EDAC/skx_common: Use topology_physical_package_id() instead of open coding
EDAC: Fix wrong executable file modes for C source files
EDAC/altera: Use dev_fwnode()
EDAC/skx_common: Remove unused *NUM*_IMC macros
EDAC/i10nm: Reallocate skx_dev list if preconfigured cnt != runtime cnt
EDAC/skx_common: Remove redundant upper bound check for res->imc
EDAC/skx_common: Make skx_dev->imc[] a flexible array
EDAC/skx_common: Swap memory controller index mapping
EDAC/skx_common: Move mc_mapping to be a field inside struct skx_imc
EDAC/{skx_common,skx}: Use configuration data, not global macros
EDAC/i10nm: Skip DIMM enumeration on a disabled memory controller
EDAC/ie31200: Add two more Intel Alder Lake-S SoCs for EDAC support
...
* edac-drivers:
EDAC/versalnet: Return the correct error in mc_probe()
EDAC/mc_sysfs: Increase legacy channel support to 16
EDAC/amd64: Add support for AMD family 1Ah-based newer models
EDAC: Add a driver for the AMD Versal NET DDR controller
dt-bindings: memory-controllers: Add support for Versal NET EDAC
RAS: Export log_non_standard_event() to drivers
cdx: Export Symbols for MCDI RPC and Initialization
cdx: Split mcdi.h and reorganize headers
EDAC/skx_common: Use topology_physical_package_id() instead of open coding
EDAC/altera: Use dev_fwnode()
EDAC/skx_common: Remove unused *NUM*_IMC macros
EDAC/i10nm: Reallocate skx_dev list if preconfigured cnt != runtime cnt
EDAC/skx_common: Remove redundant upper bound check for res->imc
EDAC/skx_common: Make skx_dev->imc[] a flexible array
EDAC/skx_common: Swap memory controller index mapping
EDAC/skx_common: Move mc_mapping to be a field inside struct skx_imc
EDAC/{skx_common,skx}: Use configuration data, not global macros
EDAC/i10nm: Skip DIMM enumeration on a disabled memory controller
EDAC/ie31200: Add two more Intel Alder Lake-S SoCs for EDAC support
dt-bindings: arm: cpus: Add edac-enabled property
EDAC: Add EDAC driver for ARM Cortex A72 cores
* edac-misc:
EDAC: Fix wrong executable file modes for C source files
MAINTAINERS: EDAC: Drop inactive reviewers
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Newer AMD systems can support up to 16 channels per EDAC "mc" device.
These are detected by the EDAC module running on the device, and the
current EDAC interface is appropriately enumerated.
The legacy EDAC sysfs interface however, provides device attributes for
channels 0 through 11 only. Consequently, the last four channels, 12
through 15, will not be enumerated and will not be visible through the
legacy sysfs interface.
Add additional device attributes to ensure that all 16 channels, if
present, are enumerated by and visible through the legacy EDAC sysfs
interface.
Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250916203242.1281036-1-avadhut.naik@amd.com
Add support for family 1Ah-based models 50h-57h, 90h-9Fh, A0h-AFh, and
C0h-C7h.
Also, raise the maximum memory controllers number as those machines
support that many.
Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250916203242.1281036-1-avadhut.naik@amd.com
Add a driver for the AMD Versal NET DDR memory controller which supports
single bit error correction, double bit error detection and other system
errors from various IP subsystems (e.g., RPU, NOCs, HNICX, PL).
The driver listens for notifications from the NMC (Network management
controller) using RPMsg (Remote Processor Messaging).
The channel used for communicating to RPMsg is named "error_edac". Upon
receipt of a notification, the driver sends a RAS event trace.
[ bp:
- Fixup title
- Rewrite commit message
- Fixup Kconfig text
- Zap unused defines and align them
- Simplify rpmsg_cb() considerably
- Drop silly double-brackets in conditionals
- Use proper void * type in mcdi_request()
- Do not clear chinfo in rpmsg_probe() unnecessarily
- Fix indentation
- Do a proper err unwind path in init_versalnet()
- Redo the error unwind path in mc_probe() properly
- Fix the ordering in mc_remove()
]
Signed-off-by: Shubhrajyoti Datta <shubhrajyoti.datta@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250908115649.22903-1-shubhrajyoti.datta@amd.com
Link: https://lore.kernel.org/r/20250703173105.GLaGa-WQCESDNsqygm@fat_crate.local
Use topology_physical_package_id() to get the CPU package ID instead of
open coding.
Suggested-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20250903030648.3285935-1-qiuxu.zhuo@intel.com
Three EDAC source files were mistakenly marked as executable when adding the
EDAC scrub controls.
These are plain C source files and should not carry the executable bit.
Correcting their modes follows the principle of least privilege and avoids
unnecessary execute permissions in the repository.
[ bp: Massage commit message. ]
Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250828191954.903125-1-visitorckw@gmail.com
dma_free_coherent() must only be called if the corresponding
dma_alloc_coherent() call has succeeded. Calling it when the allocation fails
leads to undefined behavior.
Delete the wrong call.
[ bp: Massage commit message. ]
Fixes: 71bcada88b ("edac: altera: Add Altera SDRAM EDAC support")
Signed-off-by: Salah Triki <salah.triki@gmail.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Dinh Nguyen <dinguyen@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/aIrfzzqh4IzYtDVC@pc
irq_domain_create_simple() takes fwnode as the first argument. It can be
extracted from the struct device using dev_fwnode() helper instead of using
of_node with of_fwnode_handle().
So use the dev_fwnode() helper.
Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Acked-by: Dinh Nguyen <dinguyen@kernel.org>
Link: https://lore.kernel.org/20250723062631.1830757-1-jirislaby@kernel.org
Ideally, read the present DDR memory controller count first and then
allocate the skx_dev list using this count. However, this approach
requires adding a significant amount of code similar to
skx_get_all_bus_mappings() to obtain the PCI bus mappings for the first
socket and use these mappings along with the related PCI register offset
to read the memory controller count.
Given that the Granite Rapids CPU is the only one that can detect the
count of memory controllers at runtime (other CPUs use the count in the
configuration data), to reduce code complexity, reallocate the skx_dev
list only if the preconfigured count of DDR memory controllers differs
from the count read at runtime for Granite Rapids CPU.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20250731145534.2759334-7-qiuxu.zhuo@intel.com
The following upper bound check for the memory controller physical index
decoded by ADXL is the only place where use the macro 'NUM_IMC' is used:
res->imc > NUM_IMC - 1
Since this check is already covered by skx_get_mc_mapping(), meaning no
memory controller logical index exists for an invalid memory controller
physical index decoded by ADXL, remove the redundant upper bound check
so that the definition for 'NUM_IMC' can be cleaned up (in another patch).
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20250731145534.2759334-6-qiuxu.zhuo@intel.com
The current skx->imc[NUM_IMC] array of memory controller instances is
sized using the macro NUM_IMC. Each time EDAC support is added for a
new CPU, NUM_IMC needs to be updated to ensure it is greater than or
equal to the number of memory controllers for the new CPU. This approach
is inconvenient and results in memory waste for older CPUs with fewer
memory controllers.
To address this, make skx->imc[] a flexible array and determine its size
from configuration data or at runtime.
Suggested-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20250731145534.2759334-5-qiuxu.zhuo@intel.com
The current mapping of memory controller indices is from physical index [1]
to logical index [2], as show below:
skx_dev->imc[pmc].mc_mapping = lmc
Since skx_dev->imc[] is an array of present memory controller instances,
mapping memory controller indices from logical index to physical index,
as show below, is more reasonable. This is also a preparatory step for
making skx_dev->imc[] a flexible array.
skx_dev->imc[lmc].mc_mapping = pmc
Both mappings are equivalent. No functional changes intended.
[1] Indices for memory controllers include both those present to the
OS and those disabled by BIOS.
[2] Indices for memory controllers present to the OS.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20250731145534.2759334-4-qiuxu.zhuo@intel.com
The mc_mapping and imc fields of struct skx_dev have the same size,
NUM_IMC. Move mc_mapping to be a field inside struct skx_imc to prepare
for making the imc array of memory controller instances a flexible array.
No functional changes intended.
Suggested-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20250731145534.2759334-3-qiuxu.zhuo@intel.com
Use model-specific configuration data for the number of memory controllers
per socket, channels per memory controller, and DIMMs per channel as
intended, instead of relying on global macros for maximum values.
No functional changes intended.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20250731145534.2759334-2-qiuxu.zhuo@intel.com
When loading the i10nm_edac driver on some Intel Granite Rapids servers,
a call trace may appear as follows:
UBSAN: shift-out-of-bounds in drivers/edac/skx_common.c:453:16
shift exponent -66 is negative
...
__ubsan_handle_shift_out_of_bounds+0x1e3/0x390
skx_get_dimm_info.cold+0x47/0xd40 [skx_edac_common]
i10nm_get_dimm_config+0x23e/0x390 [i10nm_edac]
skx_register_mci+0x159/0x220 [skx_edac_common]
i10nm_init+0xcb0/0x1ff0 [i10nm_edac]
...
This occurs because some BIOS may disable a memory controller if there
aren't any memory DIMMs populated on this memory controller. The DIMMMTR
register of this disabled memory controller contains the invalid value
~0, resulting in the call trace above.
Fix this call trace by skipping DIMM enumeration on a disabled memory
controller.
Fixes: ba987eaaab ("EDAC/i10nm: Add Intel Granite Rapids server support")
Reported-by: Jose Jesus Ambriz Meza <jose.jesus.ambriz.meza@intel.com>
Reported-by: Chia-Lin Kao (AceLan) <acelan.kao@canonical.com>
Closes: https://lore.kernel.org/all/20250730063155.2612379-1-acelan.kao@canonical.com/
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Chia-Lin Kao (AceLan) <acelan.kao@canonical.com>
Link: https://lore.kernel.org/r/20250806065707.3533345-1-qiuxu.zhuo@intel.com
The driver is designed to support error detection and reporting for
Cortex A72 cores, specifically within their L1 and L2 cache systems.
The errors are detected by reading CPU/L2 memory error syndrome
registers.
Unfortunately there is no robust way to inject errors into the caches,
so this driver doesn't contain any code to actually test it. It has
been tested though with code taken from an older version [1] of this
driver. For reasons stated in thread [1], the error injection code is
not suitable for mainline, so it is removed from the driver.
[1] https://lore.kernel.org/all/1521073067-24348-1-git-send-email-york.sun@nxp.com/#t
[ bp: minor touchups. ]
Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Co-developed-by: Vijay Balakrishna <vijayb@linux.microsoft.com>
Signed-off-by: Vijay Balakrishna <vijayb@linux.microsoft.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/1752714390-27389-2-git-send-email-vijayb@linux.microsoft.com
- switch to use scnprintf()
- Add Granite Rapids-D support
- synopsys: Make sure ECC error and counter registers are cleared during
init/probing to avoid reporting stale errors
- igen6: Add Wildcat Lake SoCs support
- Make sure scrub features sysfs attributes are initialized properly
- Allocate memory repair sysfs attributes statically to reduce stack
usage
- Fix DIMM module size computation for DIMMs with total capacity which
is a non power-of-two number, in amd64_edac
- Do not be too dramatic when reporting disabled memory controllers in
igen6_edac
- Add support to ie31200_edac for the following SoCs:
- Core i5-14[67]00
- Bartless Lake-S SoCs
- Raptor Lake-HX
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmiHeDcACgkQEsHwGGHe
VUrnzhAAryFKu8xWuwOE3eGaMW6oJhjKF8wPxLiCxxi6ZdQ/1uudFVnzwgozmkXo
l10h41A3yc1ZdJqdqn54gF8PxbQ0E1MvbXfmBqZ/U+V+dv6zMwu9TygoPRIJ60ST
aIxTBq2zoSii7ucGCBjbqClMTF3ZcH/Q2FzZoFbZyZd84snWSz0B9+S+937mtMhl
9Y55sAgQuigQDQ71YZymAGyWi9E9J20wFk76vIHEboRIa5sS0iCU88Wb4PT+5iKf
Qc/1gyqnd+6FO9O9ddrYpeDcaIicLShuGVNZNlJalD/JyTIOcP6XdEDa5J7TYp27
7IcmfHSYmZ5eL0vrJfrIwbauEpRL9ZjWXS+uQjj8/K/gkPUsH/Sdldgldkd50GHV
6L79XSzpy4yhlAr3BXU0o917qRVWOpbxr9E7l6VAFGBpLl5ewtZiV3W7/Su4rPd2
zpUGBZvjxO8jmNQn49IPs/XotVQ2L+mT+KSxUMZAO2pV+dztSJELMFQQC0uAXiZc
ApcrSkQxa4fsxU2Ukc1dLOJNkwxEC1ECcPsl2I9EE1cFoix7NP2E+G92D/V52VoZ
QeVkxM7LHZCTH9tH1nrCZ+WJr8S2vZ+uY8jRl42P12xU4kcd3RWEtna18bX5oe++
RlgchnXwutEPSgHYZVPocuaDD7C6eIvYzpaVezVl9dgbRLLx8u4=
=PBTf
-----END PGP SIGNATURE-----
Merge tag 'edac_updates_for_v6.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras
Pull EDAC updates from Borislav Petkov:
- i10nm:
- switch to using scnprintf()
- Add Granite Rapids-D support
- synopsys: Make sure ECC error and counter registers are cleared
during init/probing to avoid reporting stale errors
- igen6: Add Wildcat Lake SoCs support
- Make sure scrub features sysfs attributes are initialized properly
- Allocate memory repair sysfs attributes statically to reduce stack
usage
- Fix DIMM module size computation for DIMMs with total capacity which
is a non power-of-two number, in amd64_edac
- Do not be too dramatic when reporting disabled memory controllers in
igen6_edac
- Add support to ie31200_edac for the following SoCs:
- Core i5-14[67]00
- Bartless Lake-S SoCs
- Raptor Lake-HX
* tag 'edac_updates_for_v6.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
EDAC/{skx_common,i10nm}: Use scnprintf() for safer buffer handling
EDAC/synopsys: Clear the ECC counters on init
EDAC/ie31200: Add Intel Raptor Lake-HX SoCs support
EDAC/igen6: Add Intel Wildcat Lake SoCs support
EDAC/i10nm: Add Intel Granite Rapids-D support
EDAC/mem_repair: Reduce stack usage in edac_mem_repair_get_desc()
EDAC/igen6: Reduce log level to debug for absent memory controllers
EDAC/ie31200: Document which CPUs correspond to each Raptor Lake-S device ID
EDAC/ie31200: Enable support for Core i5-14600 and i7-14700
ie31200/EDAC: Add Intel Bartlett Lake-S SoCs support
snprintf() is fragile when its return value will be used to append
additional data to a buffer. Use scnprintf() instead.
Signed-off-by: Wang Haoran <haoranwangsec@gmail.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Link: https://lore.kernel.org/r/20250715131700.1092720-1-haoranwangsec@gmail.com
Clear the ECC error and counter registers during initialization/probe to avoid
reporting stale errors that may have occurred before EDAC registration.
For that, unify the Zynq and ZynqMP ECC state reading paths and simplify the
code.
[ bp: Massage commit message.
Fix an -Wsometimes-uninitialized warning as reported by
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202507141048.obUv3ZUm-lkp@intel.com ]
Signed-off-by: Shubhrajyoti Datta <shubhrajyoti.datta@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250713050753.7042-1-shubhrajyoti.datta@amd.com
Intel Raptor Lake-HX SoC shares the same memory controller registers
as Raptor Lake-S SoC. Add a compute die ID for Raptor Lake-HX SoCs with
Out-of-Band ECC capability for EDAC support.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Laurens SEGHERS <laurens@rale.com>
Link: https://lore.kernel.org/r/20250704151609.7833-4-qiuxu.zhuo@intel.com
Intel Wildcat Lake is a mobile derivative of Panther Lake with one
memory controller. Wildcat Lake SoCs share the same IBECC registers
with Meteor Lake-P SoCs.
Add a compute die ID and a new configuration structure for Wildcat
Lake SoCs with In-Band ECC capability for EDAC support.
Signed-off-by: Lili Li <lili.li@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20250704151609.7833-3-qiuxu.zhuo@intel.com
The Granite Rapids-D CPU model uses memory controller registers similar
to those of the Granite Rapids server CPU but with a different memory
controller MMIO base.
Add the Granite Rapids-D CPU model ID and use the new memory controller
MMIO base for EDAC support.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: VikasX Chougule <vikasx.chougule@intel.com>
Link: https://lore.kernel.org/r/20250704151609.7833-2-qiuxu.zhuo@intel.com
Constructing an array on the stack adds complexity and can exceed the
warning limit for per-function stack usage:
drivers/edac/mem_repair.c:361:5: error: stack frame size (1296) exceeds
limit (1280) in 'edac_mem_repair_get_desc' [-Werror,-Wframe-larger-than]
Change this to have the actual attribute array allocated statically and then
just add the instance number on the per-instance copy.
Fixes: 699ea5219c ("EDAC: Add a memory repair control feature")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250620114135.4017183-1-arnd@kernel.org
Each Chip-Select (CS) of a Unified Memory Controller (UMC) on AMD Zen-based
SOCs has an Address Mask and a Secondary Address Mask register associated with
it. The amd64_edac module logs DIMM sizes on a per-UMC per-CS granularity
during init using these two registers.
Currently, the module primarily considers only the Address Mask register for
computing DIMM sizes. The Secondary Address Mask register is only considered
for odd CS. Additionally, if it has been considered, the Address Mask register
is ignored altogether for that CS. For power-of-two DIMMs i.e. DIMMs whose
total capacity is a power of two (32GB, 64GB, etc), this is not an issue
since only the Address Mask register is used.
For non-power-of-two DIMMs i.e., DIMMs whose total capacity is not a power of
two (48GB, 96GB, etc), however, the Secondary Address Mask register is used
in conjunction with the Address Mask register. However, since the module only
considers either of the two registers for a CS, the size computed by the
module is incorrect. The Secondary Address Mask register is not considered for
even CS, and the Address Mask register is not considered for odd CS.
Introduce a new helper function so that both Address Mask and Secondary
Address Mask registers are considered, when valid, for computing DIMM sizes.
Furthermore, also rename some variables for greater clarity.
Fixes: 81f5090db8 ("EDAC/amd64: Support asymmetric dual-rank DIMMs")
Closes: https://lore.kernel.org/dbec22b6-00f2-498b-b70d-ab6f8a5ec87e@natrix.lt
Reported-by: Žilvinas Žaltiena <zilvinas@natrix.lt>
Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Tested-by: Žilvinas Žaltiena <zilvinas@natrix.lt>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/20250529205013.403450-1-avadhut.naik@amd.com