Tariq Toukan says:
====================
mlx5-next updates 2025-09-28
* tag 'mlx5-next-lag' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
net/mlx5: IFC add balance ID and LAG per MP group bits
net/mlx5: Add IFC bit for TIR/SQ order capability
====================
Link: https://patch.msgid.link/1759093989-841873-1-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add interface definitions for load balance ID and LAG per multiplane group
functionality. This patch introduces the hardware capability bits needed
to support balance ID in multiplane LAG configurations.
The new fields include:
- load_balance_id: 4-bit field for balance identifier.
- lag_per_mp_group: capability bit for LAG per multiplane group support.
These interface additions are prerequisites for implementing balance ID
support in the MLX5 driver.
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Shay Drori <shayd@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1758521191-814350-3-git-send-email-tariqt@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Before this cap, firmware requested a certain creation order between TIR
objects and SQs of the same transport domain to properly support the
self loopback prevention feature. If order is not preserved, explicit
modify_tir operations are necessary after the opening of the SQs.
When set, this cap bit indicates that this firmware requirement /
limitation no longer holds.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1758521191-814350-2-git-send-email-tariqt@nvidia.com
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Fix a kernel trace [1] caused by releasing an HWS action of a local flow
counter in mlx5_cmd_hws_delete_fte(), where the HWS action refcount and
mutex were not initialized and the counter struct could already be freed
when deleting the rule.
Fix it by adding the missing initializations and adding refcount for the
local flow counter struct.
[1] Kernel log:
Call Trace:
<TASK>
dump_stack_lvl+0x34/0x48
mlx5_fs_put_hws_action.part.0.cold+0x21/0x94 [mlx5_core]
mlx5_fc_put_hws_action+0x96/0xad [mlx5_core]
mlx5_fs_destroy_fs_actions+0x8b/0x152 [mlx5_core]
mlx5_cmd_hws_delete_fte+0x5a/0xa0 [mlx5_core]
del_hw_fte+0x1ce/0x260 [mlx5_core]
mlx5_del_flow_rules+0x12d/0x240 [mlx5_core]
? ttwu_queue_wakelist+0xf4/0x110
mlx5_ib_destroy_flow+0x103/0x1b0 [mlx5_ib]
uverbs_free_flow+0x20/0x50 [ib_uverbs]
destroy_hw_idr_uobject+0x1b/0x50 [ib_uverbs]
uverbs_destroy_uobject+0x34/0x1a0 [ib_uverbs]
uobj_destroy+0x3c/0x80 [ib_uverbs]
ib_uverbs_run_method+0x23e/0x360 [ib_uverbs]
? uverbs_finalize_object+0x60/0x60 [ib_uverbs]
ib_uverbs_cmd_verbs+0x14f/0x2c0 [ib_uverbs]
? do_tty_write+0x1a9/0x270
? file_tty_write.constprop.0+0x98/0xc0
? new_sync_write+0xfc/0x190
ib_uverbs_ioctl+0xd7/0x160 [ib_uverbs]
__x64_sys_ioctl+0x87/0xc0
do_syscall_64+0x59/0x90
Fixes: b581f42669 ("net/mlx5: fs, manage flow counters HWS action sharing by refcount")
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1758525094-816583-2-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Cross-merge networking fixes after downstream PR (net-6.17-rc7).
No conflicts.
Adjacent changes:
drivers/net/ethernet/mellanox/mlx5/core/en/fs.h
9536fbe10c ("net/mlx5e: Add PSP steering in local NIC RX")
7601a0a462 ("net/mlx5e: Add a miss level for ipsec crypto offload")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
First, allocate more doorbells in mlx5e_create_mdev_resources:
- one doorbell remains 'global' and will be used by all non-channel
associated SQs (e.g. ASO, HWS, PTP, ...).
- allocate additional 'num_doorbells' doorbells. This defaults to
minimum between 8 and max number of channels.
mlx5e_channel_pick_doorbell() now spreads out channel SQs across
available doorbells.
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Completion queues (CQs) in mlx5 use the same global doorbell, which may
become contended when accessed concurrently from many cores.
This patch prepares the CQ management code for supporting different
doorbells per CQ. This will be used in downstream patches to allow
separate doorbells to be used by channels CQs.
The main change is moving the 'uar' pointer from struct mlx5_core_cq to
struct mlx5e_cq, as the uar page to be used is better off stored
directly there. Other users of mlx5_core_cq also store the UAR to be
used separately and therefore the pointer being removed is dead weight
for them. As evidence, in this patch there are two users which set the
mcq.uar pointer but didn't use it, Software Steering and old Innova CQ
creation code. Instead, they rang the doorbell directly from another
pointer.
The 'uar' pointer added to struct mlx5e_cq remains in a hot cacheline
(as before), because it may get accessed for each packet.
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The global doorbell is used for more than just Ethernet resources, so
move it out of mlx5e_hw_objs into a common place (mlx5_priv), to avoid
non-Ethernet modules (e.g. HWS, ASO) depending on Ethernet structs.
Use this opportunity to consolidate it with the 'uar' pointer already
there, which was used as an RX doorbell. Underneath the 'uar' pointer is
identical to 'bfreg->up', so store a single resource and use that
instead.
For CQ doorbells, care is taken to always use bfreg->up->index instead
of bfreg->index, which may refer to a subsequent UAR page from the same
ALLOC_UAR batch on some NICs.
This paves the way for cleanly supporting multiple doorbells in the
Ethernet driver.
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The 'offset' field was introduced in the original commit [1] and never
used until commit [2], which added an unnecessary use.
Remove the field and refactor the write-combining test to use a local
variable instead.
[1] commit a6d51b6861 ("net/mlx5: Introduce blue flame register
allocator")
[2] commit d98995b4bf ("net/mlx5: Reimplement write combining test")
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Update the WQE metadata assignment to avoid overriding existing
metadata when setting the sysport timestamp ID. Since timestamp IDs are
limited to 256 values, they use only the lower 8 bits of the metadata
field.
To avoid conflicts, move IPsec and MACsec metadata ID to bits 8 and 9,
and shift the MACsec fs_id accordingly. This ensures safe coexistence
of timestamping and offload features that use the same metadata field.
Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1757574619-604874-4-git-send-email-tariqt@nvidia.com
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Introduce MLX5_ETH_WQE_FT_META_SHIFT as a shared base offset for
features that use the lower 8 bits of the WQE flow_table_metadata
field, currently used for timestamping, IPsec, and MACsec.
Define MLX5_ETH_WQE_FT_META_MACSEC_FS_ID_MASK so that fs_id occupies
bits 2–5, making it clear that fs_id occupies bits in the metadata.
Set MLX5_ETH_WQE_FT_META_MACSEC_MASK as the OR of the MACsec flag and
MLX5_ETH_WQE_FT_META_MACSEC_FS_ID_MASK, corresponding to the original
0x3E mask.
Update the fs_id macro to right-shift the MACsec flag by
MLX5_ETH_WQE_FT_META_SHIFT and update the RoCE modify-header action to
use it.
Introduce the helper macro MLX5_MACSEC_TX_METADATA(fs_id) to compose
the full shifted MACsec metadata value.
These changes make it explicit exactly which metadata bits carry MACsec
information, simplifying future feature exclusions when multiple
features share the WQE flowtable metadata.
In addition, drop the incorrect “RX flow steering” comment, since this
applies to TX flow steering.
Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1757574619-604874-3-git-send-email-tariqt@nvidia.com
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Now that the driver no longer uses VLAN TX insertion via the WQE
Ethernet segment, the related fields and flags can be removed.
Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1757574619-604874-2-git-send-email-tariqt@nvidia.com
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
The function mlx5_uplink_netdev_get() gets the uplink netdevice
pointer from mdev->mlx5e_res.uplink_netdev. However, the netdevice can
be removed and its pointer cleared when unbound from the mlx5_core.eth
driver. This results in a NULL pointer, causing a kernel panic.
BUG: unable to handle page fault for address: 0000000000001300
at RIP: 0010:mlx5e_vport_rep_load+0x22a/0x270 [mlx5_core]
Call Trace:
<TASK>
mlx5_esw_offloads_rep_load+0x68/0xe0 [mlx5_core]
esw_offloads_enable+0x593/0x910 [mlx5_core]
mlx5_eswitch_enable_locked+0x341/0x420 [mlx5_core]
mlx5_devlink_eswitch_mode_set+0x17e/0x3a0 [mlx5_core]
devlink_nl_eswitch_set_doit+0x60/0xd0
genl_family_rcv_msg_doit+0xe0/0x130
genl_rcv_msg+0x183/0x290
netlink_rcv_skb+0x4b/0xf0
genl_rcv+0x24/0x40
netlink_unicast+0x255/0x380
netlink_sendmsg+0x1f3/0x420
__sock_sendmsg+0x38/0x60
__sys_sendto+0x119/0x180
do_syscall_64+0x53/0x1d0
entry_SYSCALL_64_after_hwframe+0x4b/0x53
Ensure the pointer is valid before use by checking it for NULL. If it
is valid, immediately call netdev_hold() to take a reference, and
preventing the netdevice from being freed while it is in use.
Fixes: 7a9fb35e8c ("net/mlx5e: Do not reload ethernet ports when changing eswitch mode")
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1757939074-617281-2-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Selects which algorithm should be used by the NIC in order to decide rate of
CQE compression dependeng on PCIe bus conditions.
Supported values:
1) balanced, merges fewer CQEs, resulting in a moderate compression ratio
but maintaining a balance between bandwidth savings and performance
2) aggressive, merges more CQEs into a single entry, achieving a higher
compression rate and maximizing performance, particularly under high
traffic loads.
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250907012953.301746-3-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Define the Ports Phy Histogram Configuration Register (PPHCR) to expose
RS-FEC histogram bin ranges, and expose a new counter group in the Ports
Performance Counters Register (PPCNT) to report the corresponding
histogram values.
Co-developed-by: Yael Chemla <ychemla@nvidia.com>
Signed-off-by: Yael Chemla <ychemla@nvidia.com>
Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1756884600-520195-1-git-send-email-tariqt@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
vhca id is already cached in the vport structure no need to query on
every mlx5 layer, use the mlx5_vport_get_vhca_id, where possible.
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Alexei Lazar <alazar@nvidia.com>
Reviewed-by: Feng Liu <feliu@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Next patches will implement the discovery and creation of adjacent
functions vports, this patch introduces the hardware structures
definitions needed for the driver implementation.
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jack Morgenstein <jackm@nvidia.com>
Signed-off-by: Alexei Lazar <alazar@nvidia.com>
From Yishai:
This patch series introduces a new DMA Handle (DMAH) object, along with
corresponding APIs for its allocation and deallocation.
The DMAH object encapsulates attributes relevant for DMA transactions.
While initially intended to support TLP Processing Hints (TPH) [1], the
design is extensible to accommodate future features such as PCI
multipath for DMA, PCI UIO configurations, traffic class selection, and
more.
Additionally, we introduce a new ioctl method on the MR object:
UVERBS_METHOD_REG_MR.
This method consolidates multiple reg_mr variants under a single
user-space ioctl interface, supporting: ibv_reg_mr(), ibv_reg_mr_iova(),
ibv_reg_mr_iova2() and ibv_reg_dmabuf_mr(). It also enables passing a
DMA handle as part of the registration process.
Throughout the patch series, the following DMAH-related stuff can also
be observed in the IB layer:
- Association with a CPU ID and its memory type, for use with Steering
Tags [2].
- Inclusion of Processing Hints (PH) data for TPH functionality [3].
- Enforces security by ensuring that only tasks allowed to run on a
given CPU may request a DMA handle for it.
- Reference counting for DMAH life cycle management and safe usage
across memory regions.
mlx5 driver implementation:
--------------------------
The series includes implementation of the above functionality in the
mlx5 driver.
In mlx5_core:
- Enables TPH over PCIe when both firmware and OS support it.
- Manages Steering Tags and corresponding indices by writing tag values
to the PCI configuration space.
- Exposes APIs to upper layers (e.g., mlx5_ib) to enable the PCIe TPH
functionality.
In mlx5_ib:
- Adds full support for DMAH operations.
- Utilizes mlx5_core's Steering Tag APIs to derive tag indices from
input.
- Stores the resulting index in a mlx5_dmah structure for use during
MKEY creation with a DMA handle.
- Adds support for allowing MKEYs to be created in conjunction with DMA
handles.
Additional details are provided in the commit messages.
[1] Background, from PCIe specification 6.2.
TLP Processing Hints (TPH)
--------------------------
TLP Processing Hints is an optional feature that provides hints in
Request TLP headers to facilitate optimized processing of Requests that
target Memory Space. These Processing Hints enable the system hardware
(e.g., the Root Complex and/ or Endpoints) to optimize platform
resources such as system and memory interconnect on a per TLP basis.
Steering Tags are system-specific values used to identify a processing
resource that a Requester explicitly targets. System software discovers
and identifies TPH capabilities to determine the Steering Tag allocation
for each Function that supports TPH
[2] Steering Tags
Functions that intend to target a TLP towards a specific processing
resource such as a host processor or system cache hierarchy require
topological information of the target cache (e.g., which host cache).
Steering Tags are system-specific values that provide information about
the host or cache structure in the system cache hierarchy. These values
are used to associate processing elements within the platform with the
processing of Requests.
[3] Processing Hints
The Requester provides hints to the Root Complex or other targets about
the intended use of data and data structures by the host and/or device.
The hints are provided by the Requester, which has knowledge of upcoming
Request patterns, and which the Completer would not be able to deduce
autonomously (with good accuracy)
Yishai
Signed-off-by: Leon Romanovsky <leon@kernel.org>
* mlx5-next:
net/mlx5: Add support for device steering tag
net/mlx5: Expose IFC bits for TPH
PCI/TPH: Expose pcie_tph_get_st_table_size()
net/mlx5: Expose cable_length field in PFCC register
net/mlx5: Add IFC bits and enums for buf_ownership
net/mlx5: Add IFC bits to support RSS for IPSec offload
net/mlx5: IFC updates for disabled host PF
net/mlx5: Expose disciplined_fr_counter through HCA capabilities in mlx5_ifc
Background, from PCIe specification 6.2.
TLP Processing Hints (TPH)
--------------------------
TLP Processing Hints is an optional feature that provides hints in
Request TLP headers to facilitate optimized processing of Requests that
target Memory Space. These Processing Hints enable the system hardware
(e.g., the Root Complex and/or Endpoints) to optimize platform
resources such as system and memory interconnect on a per TLP basis.
Steering Tags are system-specific values used to identify a processing
resource that a Requester explicitly targets. System software discovers
and identifies TPH capabilities to determine the Steering Tag allocation
for each Function that supports TPH.
This patch adds steering tag support for mlx5 based NICs by:
- Enabling the TPH functionality over PCI if both FW and OS support it.
- Managing steering tags and their matching steering indexes by
writing a ST to an ST index over the PCI configuration space.
- Exposing APIs to upper layers (e.g.,mlx5_ib) to allow usage of
the PCI TPH infrastructure.
Further details:
- Upon probing of a device, the feature will be enabled based
on both capability detection and OS support.
- It will retrieve the appropriate ST for a given CPU ID and memory
type using the pcie_tph_get_cpu_st() API.
- It will track available ST indices according to the configuration
space table size (expected to be 63 entries), reserving index 0 to
indicate non-TPH use.
- It will assign a free ST index with a ST using the
pcie_tph_set_st_entry() API.
- It will reuse the same index for identical (CPU ID + memory type)
combinations by maintaining a reference count per entry.
- It will expose APIs to upper layers (e.g., mlx5_ib) to allow usage of
the PCI TPH infrastructure.
- SF will use its parent PF stuff.
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://patch.msgid.link/de1ae7398e9e34eacd8c10845683df44fc9e32f8.1752752567.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Introduce new "cable_length" field in PFCC register and related fields
to enhance rx buffer configuration management:
1. cable_length: Shifts cable length handling to fw by storing a
manually entered length from user in PFCC.cable_length
2. lane_rate_oper: In a case where PFCC.cable_length is not supported,
helps compute a default cable length
Signed-off-by: Oren Sidi <osidi@nvidia.com>
Reviewed-by: Alex Lazar <alazar@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1752734895-257735-4-git-send-email-tariqt@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Extend structure layouts and defines buf_ownership.
buf_ownership indicates whether the buffer is managed by SW or FW.
Signed-off-by: Oren Sidi <osidi@nvidia.com>
Reviewed-by: Alex Lazar <alazar@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1752734895-257735-3-git-send-email-tariqt@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
This adds the capabilities, ipsec_next_header and inner/outer
l4_type_ext fields to support RSS for the decrypted packets.
These fields are specifically for firmware steering. HWS validation
logic is updated to correctly handle the changes, ensuring the
unsupported fields are not set.
Besides, reserved_at_c4 is fixed to reserved_at_d4 to reflect the
accurate offset within the structure.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1752734895-257735-2-git-send-email-tariqt@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
The port 2 host PF can be disabled, this bit reflects that setting.
Signed-off-by: Daniel Jurgens <danielj@nvidia.com>
Reviewed-by: William Tu <witu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1752064867-16874-3-git-send-email-tariqt@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Introduce the `disciplined_fr_counter` capability bit to indicate that
the device’s free-running cycle counter is disciplined to real-time.
Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1752064867-16874-2-git-send-email-tariqt@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
From Edward:
This patch series enables the mlx5 driver to dynamically choose the
optimal page size for a DMABUF-based memory key (mkey), rather than
always registering with a fixed page size.
Previously, DMABUF memory registration used a fixed 4K page size for
mkeys which could lead to suboptimal performance when the underlying
memory layout may offer better page sizes.
This approach did not take advantage of larger page size capabilities
advertised by the HCA, and the driver was not setting the proper page
size mask in the mkey mask when performing page size changes,
potentially
leading to invalid registrations when updating to a very large pages.
This series improves DMABUF performance by dynamically selecting the
best page size for a given memory region (MR) both at creation time and
on page fault occurrences, based on the underlying layout and fixing
related gaps and bugs.
By doing so, we reduce the number of page table entries (and thus MTT/
KSM descriptors) that the HCA must traverse, which in turn reduces
cache-line fetches.
Thanks
* mlx5-next:
RDMA/mlx5: Fix UMR modifying of mkey page size
net/mlx5: Expose HCA capability bits for mkey max page size
Signed-off-by: Leon Romanovsky <leon@kernel.org>
When changing the page size on an mkey, the driver needs to set the
appropriate bits in the mkey mask to indicate which fields are being
modified.
The 6th bit of a page size in mlx5 driver is considered an extension,
and this bit has a dedicated capability and mask bits.
Previously, the driver was not setting this mask in the mkey mask when
performing page size changes, regardless of its hardware support,
potentially leading to an incorrect page size updates.
This fixes the issue by setting the relevant bit in the mkey mask when
performing page size changes on an mkey and the 6th bit of this field is
supported by the hardware.
Fixes: cef7dde883 ("net/mlx5: Expand mkey page size to support 6 bits")
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/9f43a9c73bf2db6085a99dc836f7137e76579f09.1751979184.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Expose the HCA capability for maximal page size that can be configured
for an mkey. Used for enforcing capabilities when working with highly
contiguous memory and using large page sizes.
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/3e4d3fda37934430f65f72601519e22bf396fd05.1751979184.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Use the new ib_alloc_device_with_net() API to allocate the IB device
so that it is properly bound to the network namespace obtained via
mlx5_core_net(). This change ensures correct namespace association
(e.g., for containerized setups).
Additionally, expose mlx5_core_net so that RDMA driver can use it.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Make enum for capability bits of general object types depend on
the type definitions themselves.
Make sure that capabilities in the [64,127] bit range are
properly calculated (type id - 64).
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250619113721.60201-2-mbloch@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
RDMA TRANSPORT domains were initially limited to a single priority.
This change allows the domains to have multiple priorities, making
it possible to add several rules and control the order in which
they're evaluated.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/b299cbb4c8678a33da6e6b6988b5bf6145c54b88.1750148083.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
- Usual minor updates and fixes for bnxt_re, hfi1, rxe, mana, iser, mlx5,
vmw_pvrdma, hns
- Make rxe work on tun devices
- mana gains more standard verbs as it moves toward supporting in-kernel
verbs
- DMABUF support for mana
- Fix page size calculations when memory registration exceeds 4G
- On Demand Paging support for rxe
- mlx5 support for RDMA TRANSPORT flow tables and a new ucap mechanism to
access control use of them
- Optional RDMA_TX/RX counters per QP in mlx5
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCZ+ap4gAKCRCFwuHvBreF
YaFHAP9wyeZCZIbnWaGcbNdbsmkEgy7aTVILRHf1NA7VSJ211gD9Ha60E+mkwtvA
i7IJ49R2BdqzKaO9oTutj2Lw+8rABwQ=
=qXhh
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma updates from Jason Gunthorpe:
- Usual minor updates and fixes for bnxt_re, hfi1, rxe, mana, iser,
mlx5, vmw_pvrdma, hns
- Make rxe work on tun devices
- mana gains more standard verbs as it moves toward supporting
in-kernel verbs
- DMABUF support for mana
- Fix page size calculations when memory registration exceeds 4G
- On Demand Paging support for rxe
- mlx5 support for RDMA TRANSPORT flow tables and a new ucap mechanism
to access control use of them
- Optional RDMA_TX/RX counters per QP in mlx5
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (73 commits)
IB/mad: Check available slots before posting receive WRs
RDMA/mana_ib: Fix integer overflow during queue creation
RDMA/mlx5: Fix calculation of total invalidated pages
RDMA/mlx5: Fix mlx5_poll_one() cur_qp update flow
RDMA/mlx5: Fix page_size variable overflow
RDMA/mlx5: Drop access_flags from _mlx5_mr_cache_alloc()
RDMA/mlx5: Fix cache entry update on dereg error
RDMA/mlx5: Fix MR cache initialization error flow
RDMA/mlx5: Support optional-counters binding for QPs
RDMA/mlx5: Compile fs.c regardless of INFINIBAND_USER_ACCESS config
RDMA/core: Pass port to counter bind/unbind operations
RDMA/core: Add support to optional-counters binding configuration
RDMA/core: Create and destroy rdma_counter using rdma_zalloc_drv_obj()
RDMA/mlx5: Add optional counters for RDMA_TX/RX_packets/bytes
RDMA/core: Fix use-after-free when rename device name
RDMA/bnxt_re: Support perf management counters
RDMA/rxe: Fix incorrect return value of rxe_odp_atomic_op()
RDMA/uverbs: Propagate errors from rdma_lookup_get_uobject()
RDMA/mana_ib: Handle net event for pointing to the current netdev
net: mana: Change the function signature of mana_get_primary_netdev_rcu
...
Add support to allow optional-counters binding to a QP, whereas when
a bind operation is requested depending on the counter optional-counter
binding state the driver will determine if to also add optional-counters
to this QP binding.
The optional-counter binding is done by simply adding a steering
rule for the specific optional-counter condition with the additional
match over that QP number.
Note that optional-counters per QP rules are handled on an earlier prio
than per device counters, and per device counter correctness is maintained
by core whereas it is responsible to sum active counters when checking device
counter and to add them to history count when they are deallocated.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/2cad1b891a6641ae61fe8d92f867e1059121813a.1741875070.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Add the following optional counters:
rdma_tx_packets,rdma_rx_bytes,rdma_rx_packets,rdma_tx_bytes.
Which counts all RDMA packets/bytes sent and received per link.
Note that since each direction packet and byte counter are shared,
the counter is only reset when both counters of that direction
are removed. But from user-perspective each can be enabled/disabled separately.
The counters can be enabled using:
sudo rdma stat set link rocep8s0f0/1 optional-counters rdma_tx_packets
And can be seen using:
rdma stat -j show link rocep8s0f0/1
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/9f2753ad636f21704416df64b47395c8991d1123.1741875070.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Add support for HW Steering action of flow meter range. Flow meters
range can use one HWS action for the whole range. Thus, share a cached
HWS action among rules that use same flow meter object range. Hold
refcount for each rule using the cached action.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/1741543663-22123-3-git-send-email-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Tariq Toukan says:
====================
mlx5-next updates 2025-03-10
The following pull-request contains common mlx5 updates for your *net-next* tree.
Please pull and let me know of any problem.
* 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
net/mlx5: Add IFC bits for PPCNT recovery counters group
net/mlx5: fs, add RDMA TRANSPORT steering domain support
net/mlx5: Query ADV_RDMA capabilities
net/mlx5: Limit non-privileged commands
net/mlx5: Allow the throttle mechanism to be more dynamic
net/mlx5: Add RDMA_CTRL HW capabilities
====================
Link: https://patch.msgid.link/1741608293-41436-1-git-send-email-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Add recovery counters group layout of PPCNT (Ports Performance Counters
Register). This group counts recovery events per link. Also add the
corresponding bit in PCAM to indicate this group is supported.
Signed-off-by: Yael Chemla <ychemla@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1741545697-23041-1-git-send-email-tariqt@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Add RX and TX RDMA_TRANSPORT flow table namespace, and the ability
to create flow tables in those namespaces.
The RDMA_TRANSPORT RX and TX are per vport.
Packets will traverse through RDMA_TRANSPORT_RX after RDMA_RX and through
RDMA_TRANSPORT_TX before RDMA_TX, ensuring proper control and management.
RDMA_TRANSPORT domains are managed by the vport group manager.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/a6b550d9859a197eafa804b9a8d76916ca481da9.1740574103.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Limit non-privileged UID commands to half of the available command slots
when privileged UIDs are present.
Privileged throttle commands will not be limited.
Use an xarray to store privileged UIDs. Add insert and remove functions
for privileged UIDs management.
Non-user commands (with uid 0) are not limited.
Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/d2f3dd9a0dbad3c9f2b4bb0723837995e4e06de2.1740574103.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Previously, throttle commands were identified and limited based on
opcode. These commands were limited to half the command slots using a
semaphore, and callback commands checked the opcode to determine
semaphore release.
To allow exceptions, we introduce a variable to indicate when the
throttle lock is held. This allows scenarios where throttle commands
are not limited. Callback functions use this variable to determine
if the throttle semaphore needs to be released.
This patch contains no functional changes. It's a preparation for the
next patch.
Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Link: https://patch.msgid.link/055d975edeb816ac4c0fd1e665c6157d11947d26.1740574103.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>