61 lines
2.1 KiB
ReStructuredText
61 lines
2.1 KiB
ReStructuredText
.. SPDX-License-Identifier: GPL-2.0
|
|
|
|
=================================================
|
|
Recoverable Hardware Error Tracking in vmcoreinfo
|
|
=================================================
|
|
|
|
Overview
|
|
--------
|
|
|
|
This feature provides a generic infrastructure within the Linux kernel to track
|
|
and log recoverable hardware errors. These are hardware recoverable errors
|
|
visible that might not cause immediate panics but may influence health, mainly
|
|
because new code path will be executed in the kernel.
|
|
|
|
By recording counts and timestamps of recoverable errors into the vmcoreinfo
|
|
crash dump notes, this infrastructure aids post-mortem crash analysis tools in
|
|
correlating hardware events with kernel failures. This enables faster triage
|
|
and better understanding of root causes, especially in large-scale cloud
|
|
environments where hardware issues are common.
|
|
|
|
Benefits
|
|
--------
|
|
|
|
- Facilitates correlation of hardware recoverable errors with kernel panics or
|
|
unusual code paths that lead to system crashes.
|
|
- Provides operators and cloud providers quick insights, improving reliability
|
|
and reducing troubleshooting time.
|
|
- Complements existing full hardware diagnostics without replacing them.
|
|
|
|
Data Exposure and Consumption
|
|
-----------------------------
|
|
|
|
- The tracked error data consists of per-error-type counts and timestamps of
|
|
last occurrence.
|
|
- This data is stored in the `hwerror_data` array, categorized by error source
|
|
types like CPU, memory, PCI, CXL, and others.
|
|
- It is exposed via vmcoreinfo crash dump notes and can be read using tools
|
|
like `crash`, `drgn`, or other kernel crash analysis utilities.
|
|
- There is no other way to read these data other than from crash dumps.
|
|
- These errors are divided by area, which includes CPU, Memory, PCI, CXL and
|
|
others.
|
|
|
|
Typical usage example (in drgn REPL):
|
|
|
|
.. code-block:: python
|
|
|
|
>>> prog['hwerror_data']
|
|
(struct hwerror_info[HWERR_RECOV_MAX]){
|
|
{
|
|
.count = (int)844,
|
|
.timestamp = (time64_t)1752852018,
|
|
},
|
|
...
|
|
}
|
|
|
|
Enabling
|
|
--------
|
|
|
|
- This feature is enabled when CONFIG_VMCORE_INFO is set.
|
|
|