104 lines
3.6 KiB
ReStructuredText
104 lines
3.6 KiB
ReStructuredText
.. SPDX-License-Identifier: GPL-2.0+
|
|
|
|
============================
|
|
DRM RAS over Generic Netlink
|
|
============================
|
|
|
|
The DRM RAS (Reliability, Availability, Serviceability) interface provides a
|
|
standardized way for GPU/accelerator drivers to expose error counters and
|
|
other reliability nodes to user space via Generic Netlink. This allows
|
|
diagnostic tools, monitoring daemons, or test infrastructure to query hardware
|
|
health in a uniform way across different DRM drivers.
|
|
|
|
Key Goals:
|
|
|
|
* Provide a standardized RAS solution for GPU and accelerator drivers, enabling
|
|
data center monitoring and reliability operations.
|
|
* Implement a single drm-ras Generic Netlink family to meet modern Netlink YAML
|
|
specifications and centralize all RAS-related communication in one namespace.
|
|
* Support a basic error counter interface, addressing the immediate, essential
|
|
monitoring needs.
|
|
* Offer a flexible, future-proof interface that can be extended to support
|
|
additional types of RAS data in the future.
|
|
* Allow multiple nodes per driver, enabling drivers to register separate
|
|
nodes for different IP blocks, sub-blocks, or other logical subdivisions
|
|
as applicable.
|
|
|
|
Nodes
|
|
=====
|
|
|
|
Nodes are logical abstractions representing an error type or error source within
|
|
the device. Currently, only error counter nodes is supported.
|
|
|
|
Drivers are responsible for registering and unregistering nodes via the
|
|
`drm_ras_node_register()` and `drm_ras_node_unregister()` APIs.
|
|
|
|
Node Management
|
|
-------------------
|
|
|
|
.. kernel-doc:: drivers/gpu/drm/drm_ras.c
|
|
:doc: DRM RAS Node Management
|
|
.. kernel-doc:: drivers/gpu/drm/drm_ras.c
|
|
:internal:
|
|
|
|
Generic Netlink Usage
|
|
=====================
|
|
|
|
The interface is implemented as a Generic Netlink family named ``drm-ras``.
|
|
User space tools can:
|
|
|
|
* List registered nodes with the ``list-nodes`` command.
|
|
* List all error counters in an node with the ``get-error-counter`` command with ``node-id``
|
|
as a parameter.
|
|
* Query specific error counter values with the ``get-error-counter`` command, using both
|
|
``node-id`` and ``error-id`` as parameters.
|
|
|
|
YAML-based Interface
|
|
--------------------
|
|
|
|
The interface is described in a YAML specification ``Documentation/netlink/specs/drm_ras.yaml``
|
|
|
|
This YAML is used to auto-generate user space bindings via
|
|
``tools/net/ynl/pyynl/ynl_gen_c.py``, and drives the structure of netlink
|
|
attributes and operations.
|
|
|
|
Usage Notes
|
|
-----------
|
|
|
|
* User space must first enumerate nodes to obtain their IDs.
|
|
* Node IDs or Node names can be used for all further queries, such as error counters.
|
|
* Error counters can be queried by either the Error ID or Error name.
|
|
* Query Parameters should be defined as part of the uAPI to ensure user interface stability.
|
|
* The interface supports future extension by adding new node types and
|
|
additional attributes.
|
|
|
|
Example: List nodes using ynl
|
|
|
|
.. code-block:: bash
|
|
|
|
sudo ynl --family drm_ras --dump list-nodes
|
|
[{'device-name': '0000:03:00.0',
|
|
'node-id': 0,
|
|
'node-name': 'correctable-errors',
|
|
'node-type': 'error-counter'},
|
|
{'device-name': '0000:03:00.0',
|
|
'node-id': 1,
|
|
'node-name': 'uncorrectable-errors',
|
|
'node-type': 'error-counter'}]
|
|
|
|
Example: List all error counters using ynl
|
|
|
|
.. code-block:: bash
|
|
|
|
sudo ynl --family drm_ras --dump get-error-counter --json '{"node-id":0}'
|
|
[{'error-id': 1, 'error-name': 'error_name1', 'error-value': 0},
|
|
{'error-id': 2, 'error-name': 'error_name2', 'error-value': 0}]
|
|
|
|
Example: Query an error counter for a given node
|
|
|
|
.. code-block:: bash
|
|
|
|
sudo ynl --family drm_ras --do get-error-counter --json '{"node-id":0, "error-id":1}'
|
|
{'error-id': 1, 'error-name': 'error_name1', 'error-value': 0}
|
|
|