Start removing stuff from virt/queues.c that will be kept in the
main virtchnl file.
Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Add copy of virtchnl.c under the original name/location.
Now both virt/virtchnl.c and virt/queues.c have the same content,
and only the former of the two is in use.
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Temporary rename of virtchnl.c into queues.c
In order to split virtchnl.c in a way that makes it much easier
to still blame new file, we do it via multiple git steps.
Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Move the default graceful period from a parameter to
devlink_health_reporter_create() to a field in the
devlink_health_reporter_ops structure.
This change improves consistency, as the graceful period is inherently
tied to the reporter's behavior and recovery policy. It simplifies the
signature of devlink_health_reporter_create() and its internal helper
functions. It also centralizes the reporter configuration at the ops
structure, preparing the groundwork for a downstream patch that will
introduce a devlink health reporter burst period attribute whose
default value will similarly be provided by the driver via the ops
structure.
Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250824084354.533182-2-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The current layout of struct ixgbe_orom_civd_info causes incorrect data
storage due to compiler-inserted padding. This results in issues when
writing OROM data into the structure.
Add the __packed attribute to ensure the structure layout matches the
expected binary format without padding.
Fixes: 70db0788a2 ("ixgbe: read the OROM version information")
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Currently, the driver increments `alloc_page_failed` when buffer allocation fails
in `ice_clean_rx_irq()`. However, this counter is intended for page allocation
failures, not buffer allocation issues.
This patch corrects the counter by incrementing `alloc_buf_failed` instead,
ensuring accurate statistics reporting for buffer allocation failures.
Fixes: 2fba7dc515 ("ice: Add support for XDP multi-buffer on Rx side")
Reported-by: Jacob Keller <jacob.e.keller@intel.com>
Suggested-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Michal Kubiak <michal.kubiak@intel.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Tested-by: Priya Singh <priyax.singh@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
The ice_adapter structure is used by the ice driver to connect multiple
physical functions of a device in software. It was introduced by
commit 0e2bddf9e5 ("ice: add ice_adapter for shared data across PFs on
the same NIC") and is primarily used for PTP support, as well as for
handling certain cross-PF synchronization.
The original design of ice_adapter used PCI address information to
determine which devices should be connected. This was extended to support
E825C devices by commit fdb7f54700 ("ice: Initial support for E825C
hardware in ice_adapter"), which used the device ID for E825C devices
instead of the PCI address.
Later, commit 0093cb194a ("ice: use DSN instead of PCI BDF for
ice_adapter index") replaced the use of Bus/Device/Function addressing with
use of the device serial number.
E825C devices may appear in "Dual NAC" configuration which has multiple
physical devices tied to the same clock source and which need to use the
same ice_adapter. Unfortunately, each "NAC" has its own NVM which has its
own unique Device Serial Number. Thus, use of the DSN for connecting
ice_adapter does not work properly. It "worked" in the pre-production
systems because the DSN was not initialized on the test NVMs and all the
NACs had the same zero'd serial number.
Since we cannot rely on the DSN, lets fall back to the logic in the
original E825C support which used the device ID. This is safe for E825C
only because of the embedded nature of the device. It isn't a discreet
adapter that can be plugged into an arbitrary system. All E825C devices on
a given system are connected to the same clock source and need to be
configured through the same PTP clock.
To make this separation clear, reserve bit 63 of the 64-bit index values as
a "fixed index" indicator. Always clear this bit when using the device
serial number as an index.
For E825C, use a fixed value defined as the 0x579C E825C backplane device
ID bitwise ORed with the fixed index indicator. This is slightly different
than the original logic of just using the device ID directly. Doing so
prevents a potential issue with systems where only one of the NACs is
connected with an external PHY over SGMII. In that case, one NAC would
have the E825C_SGMII device ID, but the other would not.
Separate the determination of the full 64-bit index from the 32-bit
reduction logic. Provide both ice_adapter_index() and a wrapping
ice_adapter_xa_index() which handles reducing the index to a long on 32-bit
systems. As before, cache the full index value in the adapter structure to
warn about collisions.
This fixes issues with E825C not initializing PTP on both NACs, due to
failure to connect the appropriate devices to the same ice_adapter.
Fixes: 0093cb194a ("ice: use DSN instead of PCI BDF for ice_adapter index")
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Grzegorz Nitka <grzegorz.nitka@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
The ice_cfg_tx_topo function attempts to apply Tx scheduler topology
configuration based on NVM parameters, selecting either a 5 or 9 layer
topology.
As part of this flow, the driver acquires the "Global Configuration Lock",
which is a hardware resource associated with programming the DDP package
to the device. This "lock" is implemented by firmware as a way to
guarantee that only one PF can program the DDP for a device. Unlike a
traditional lock, once a PF has acquired this lock, no other PF will be
able to acquire it again (including that PF) until a CORER of the device.
Future requests to acquire the lock report that global configuration has
already completed.
The following flow is used to program the Tx topology:
* Read the DDP package for scheduler configuration data
* Acquire the global configuration lock
* Program Tx scheduler topology according to DDP package data
* Trigger a CORER which clears the global configuration lock
This is followed by the flow for programming the DDP package:
* Acquire the global configuration lock (again)
* Download the DDP package to the device
* Release the global configuration lock.
However, if configuration of the Tx topology fails, (i.e.
ice_get_set_tx_topo returns an error code), the driver exits
ice_cfg_tx_topo() immediately, and fails to trigger CORER.
While the global configuration lock is held, the firmware rejects most
AdminQ commands, as it is waiting for the DDP package download (or Tx
scheduler topology programming) to occur.
The current driver flows assume that the global configuration lock has been
reset by CORER after programming the Tx topology. Thus, the same PF
attempts to acquire the global lock again, and fails. This results in the
driver reporting "an unknown error occurred when loading the DDP package".
It then attempts to enter safe mode, but ultimately fails to finish
ice_probe() since nearly all AdminQ command report error codes, and the
driver stops loading the device at some point during its initialization.
The only currently known way that ice_get_set_tx_topo() can fail is with
certain older DDP packages which contain invalid topology configuration, on
firmware versions which strictly validate this data. The most recent
releases of the DDP have resolved the invalid data. However, it is still
poor practice to essentially brick the device, and prevent access to the
device even through safe mode or recovery mode. It is also plausible that
this command could fail for some other reason in the future.
We cannot simply release the global lock after a failed call to
ice_get_set_tx_topo(). Releasing the lock indicates to firmware that global
configuration (downloading of the DDP) has completed. Future attempts by
this or other PFs to load the DDP will fail with a report that the DDP
package has already been downloaded. Then, PFs will enter safe mode as they
realize that the package on the device does not meet the minimum version
requirement to load. The reported error messages are confusing, as they
indicate the version of the default "safe mode" package in the NVM, rather
than the version of the file loaded from /lib/firmware.
Instead, we need to trigger CORER to clear global configuration. This is
the lowest level of hardware reset which clears the global configuration
lock and related state. It also clears any already downloaded DDP.
Crucially, it does *not* clear the Tx scheduler topology configuration.
Refactor ice_cfg_tx_topo() to always trigger a CORER after acquiring the
global lock, regardless of success or failure of the topology
configuration.
We need to re-initialize the HW structure when we trigger the CORER. Thus,
it makes sense for this to be the responsibility of ice_cfg_tx_topo()
rather than its caller, ice_init_tx_topology(). This avoids needless
re-initialization in cases where we don't attempt to update the Tx
scheduler topology, such as if it has already been programmed.
There is one catch: failure to re-initialize the HW struct should stop
ice_probe(). If this function fails, we won't have a valid HW structure and
cannot ensure the device is functioning properly. To handle this, ensure
ice_cfg_tx_topo() returns a limited set of error codes. Set aside one
specifically, -ENODEV, to indicate that the ice_init_tx_topology() should
fail and stop probe.
Other error codes indicate failure to apply the Tx scheduler topology. This
is treated as a non-fatal error, with an informational message informing
the system administrator that the updated Tx topology did not apply. This
allows the device to load and function with the default Tx scheduler
topology, rather than failing to load entirely.
Note that this use of CORER will not result in loops with future PFs
attempting to also load the invalid Tx topology configuration. The first PF
will acquire the global configuration lock as part of programming the DDP.
Each PF after this will attempt to acquire the global lock as part of
programming the Tx topology, and will fail with the indication from
firmware that global configuration is already complete. Tx scheduler
topology configuration is only performed during driver init (probe or
devlink reload) and not during cleanup for a CORER that happens after probe
completes.
Fixes: 91427e6d90 ("ice: Support 5 layer topology")
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Issuing a reset when the driver is loaded without RDMA support, will
results in a crash as it attempts to remove RDMA's non-existent auxbus
device:
echo 1 > /sys/class/net/<if>/device/reset
BUG: kernel NULL pointer dereference, address: 0000000000000008
...
RIP: 0010:ice_unplug_aux_dev+0x29/0x70 [ice]
...
Call Trace:
<TASK>
ice_prepare_for_reset+0x77/0x260 [ice]
pci_dev_save_and_disable+0x2c/0x70
pci_reset_function+0x88/0x130
reset_store+0x5a/0xa0
kernfs_fop_write_iter+0x15e/0x210
vfs_write+0x273/0x520
ksys_write+0x6b/0xe0
do_syscall_64+0x79/0x3b0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
ice_unplug_aux_dev() checks pf->cdev_info->adev for NULL pointer, but
pf->cdev_info will also be NULL, leading to the deref in the trace above.
Introduce a flag to be set when the creation of the auxbus device is
successful, to avoid multiple NULL pointer checks in ice_unplug_aux_dev().
Fixes: c24a65b6a2 ("iidc/ice/irdma: Update IDC to support multiple consumers")
Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Introduce virt/ directory to collect virtchnl files.
We are going to implement a few sizable extensions soon, each of them
increasing virt/ size, so it looks sensible to introduce a new dir.
Suggested-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
idpf has a limit on number of scatter-gather frags
that can be used per segment.
Currently, idpf_tx_start() checks if the limit is hit
and forces a linearization of the whole packet.
This requires high order allocations that can fail
under memory pressure. A full size BIG-TCP packet
would require order-7 alocation on x86_64 :/
We can move the check earlier from idpf_features_check()
for TSO packets, to force GSO in this case, removing the
cost of a big copy.
This means that a linearization will eventually happen
with sizes smaller than one MSS.
__idpf_chk_linearize() is renamed to idpf_chk_tso_segment()
and moved to idpf_lib.c
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Cc: Jacob Keller <jacob.e.keller@intel.com>
Cc: Madhu Chittim <madhu.chittim@intel.com>
Cc: Pavan Kumar Linga <pavan.kumar.linga@intel.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Andrew Lunn <andrew+netdev@lunn.ch>
Reviewed-by: Joshua Hay <joshua.a.hay@intel.com>
Tested-by: Brian Vazquez <brianvv@google.com>
Acked-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250818195934.757936-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
With the new Tx buffer management scheme, there is no need for all of
the stashing mechanisms, the hash table, the reserve buffer stack, etc.
Remove all of that.
Signed-off-by: Joshua Hay <joshua.a.hay@intel.com>
Reviewed-by: Madhu Chittim <madhu.chittim@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Tested-by: Samuel Salin <Samuel.salin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
The Tx refillq logic will cause packets to be silently dropped if there
are not enough buffer resources available to send a packet in flow
scheduling mode. Instead, determine how many buffers are needed along
with number of descriptors. Make sure there are enough of both resources
to send the packet, and stop the queue if not.
Fixes: 7292af042b ("idpf: fix a race in txq wakeup")
Signed-off-by: Joshua Hay <joshua.a.hay@intel.com>
Reviewed-by: Madhu Chittim <madhu.chittim@intel.com>
Tested-by: Samuel Salin <Samuel.salin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Replace the TxQ buffer ring with one large pool/array of buffers (only
for flow scheduling). This eliminates the tag generation and makes it
impossible for a tag to be associated with more than one packet.
The completion tag passed to HW through the descriptor is the index into
the array. That same completion tag is posted back to the driver in the
completion descriptor, and used to index into the array to quickly
retrieve the buffer during cleaning. In this way, the tags are treated
as a fix sized resource. If all tags are in use, no more packets can be
sent on that particular queue (until some are freed up). The tag pool
size is 64K since the completion tag width is 16 bits.
For each packet, the driver pulls a free tag from the refillq to get the
next free buffer index. When cleaning is complete, the tag is posted
back to the refillq. A multi-frag packet spans multiple buffers in the
driver, therefore it uses multiple buffer indexes/tags from the pool.
Each frag pulls from the refillq to get the next free buffer index.
These are tracked in a next_buf field that replaces the completion tag
field in the buffer struct. This chains the buffers together so that the
packet can be cleaned from the starting completion tag taken from the
completion descriptor, then from the next_buf field for each subsequent
buffer.
In case of a dma_mapping_error occurs or the refillq runs out of free
buf_ids, the packet will execute the rollback error path. This unmaps
any buffers previously mapped for the packet. Since several free
buf_ids could have already been pulled from the refillq, we need to
restore its original state as well. Otherwise, the buf_ids/tags
will be leaked and not used again until the queue is reallocated.
Descriptor completions only advance the descriptor ring index to "clean"
the descriptors. The packet completions only clean the buffers
associated with the given packet completion tag and do not update the
descriptor ring index.
When operating in queue based scheduling mode, the array still acts as a
ring and will only have TxQ descriptor count entries. The tx_bufs are
still associated 1:1 with the descriptor ring entries and we can use the
conventional indexing mechanisms.
Fixes: c2d548cad1 ("idpf: add TX splitq napi poll support")
Signed-off-by: Luigi Rizzo <lrizzo@google.com>
Signed-off-by: Brian Vazquez <brianvv@google.com>
Signed-off-by: Joshua Hay <joshua.a.hay@intel.com>
Reviewed-by: Madhu Chittim <madhu.chittim@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Tested-by: Samuel Salin <Samuel.salin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Move (and rename) the existing rollback logic to singleq.c since that
will be the only consumer. Create a simplified splitq specific rollback
function to loop through and unmap tx_bufs based on the completion tag.
This is critical before replacing the Tx buffer ring with the buffer
pool since the previous rollback indexing will not work to unmap the
chained buffers from the pool.
Cache the next_to_use index before any portion of the packet is put on
the descriptor ring. In case of an error, the rollback will bump tail to
the correct next_to_use value. Because the splitq path now supports
different types of context descriptors (and potentially multiple in the
future), this will take care of rolling back any and all context
descriptors encoded on the ring for the erroneous packet. The previous
rollback logic was broken for PTP packets since it would not account for
the PTP context descriptor.
Fixes: 1a49cf814f ("idpf: add Tx timestamp flows")
Signed-off-by: Joshua Hay <joshua.a.hay@intel.com>
Reviewed-by: Madhu Chittim <madhu.chittim@intel.com>
Tested-by: Samuel Salin <Samuel.salin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Track the gap between next_to_use and the last RE index. Set RE again
if the gap is large enough to ensure RE bit is set frequently. This is
critical before removing the stashing mechanisms because the
opportunistic descriptor ring cleaning from the out-of-order completions
will go away. Previously the descriptors would be "cleaned" by both the
descriptor (RE) completion and the out-of-order completions. Without the
latter, we must ensure the RE bit is set more frequently. Otherwise,
it's theoretically possible for the descriptor ring next_to_clean to
never advance. The previous implementation was dependent on the start
of a packet falling on a 64th index in the descriptor ring, which is not
guaranteed with large packets.
Signed-off-by: Luigi Rizzo <lrizzo@google.com>
Signed-off-by: Brian Vazquez <brianvv@google.com>
Signed-off-by: Joshua Hay <joshua.a.hay@intel.com>
Reviewed-by: Madhu Chittim <madhu.chittim@intel.com>
Tested-by: Samuel Salin <Samuel.salin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
In certain production environments, it is possible for completion tags
to collide, meaning N packets with the same completion tag are in flight
at the same time. In this environment, any given Tx queue is effectively
used to send both slower traffic and higher throughput traffic
simultaneously. This is the result of a customer's specific
configuration in the device pipeline, the details of which Intel cannot
provide. This configuration results in a small number of out-of-order
completions, i.e., a small number of packets in flight. The existing
guardrails in the driver only protect against a large number of packets
in flight. The slower flow completions are delayed which causes the
out-of-order completions. The fast flow will continue sending traffic
and generating tags. Because tags are generated on the fly, the fast
flow eventually uses the same tag for a packet that is still in flight
from the slower flow. The driver has no idea which packet it should
clean when it processes the completion with that tag, but it will look
for the packet on the buffer ring before the hash table. If the slower
flow packet completion is processed first, it will end up cleaning the
fast flow packet on the ring prematurely. This leaves the descriptor
ring in a bad state resulting in a crash or Tx timeout.
In summary, generating a tag when a packet is sent can lead to the same
tag being associated with multiple packets. This can lead to resource
leaks, crashes, and/or Tx timeouts.
Before we can replace the tag generation, we need a new mechanism for
the send path to know what tag to use next. The driver will allocate and
initialize a refillq for each TxQ with all of the possible free tag
values. During send, the driver grabs the next free tag from the refillq
from next_to_clean. While cleaning the packet, the clean routine posts
the tag back to the refillq's next_to_use to indicate that it is now
free to use.
This mechanism works exactly the same way as the existing Rx refill
queues, which post the cleaned buffer IDs back to the buffer queue to be
reposted to HW. Since we're using the refillqs for both Rx and Tx now,
genericize some of the existing refillq support.
Note: the refillqs will not be used yet. This is only demonstrating how
they will be used to pass free tags back to the send path.
Signed-off-by: Joshua Hay <joshua.a.hay@intel.com>
Reviewed-by: Madhu Chittim <madhu.chittim@intel.com>
Tested-by: Samuel Salin <Samuel.salin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Device ID comparison in igc_is_device_id_i226 is performed before
the ID is set, resulting in always failing check on init.
Before the patch:
* L1.2 is not disabled on init
* L1.2 is properly disabled after suspend-resume cycle
With the patch:
* L1.2 is properly disabled both on init and after suspend-resume
How to test:
Connect to the 1G link with 300+ mbit/s Internet speed, and run
the download speed test, such as:
curl -o /dev/null http://speedtest.selectel.ru/1GB
Without L1.2 disabled, the speed would be no more than ~200 mbit/s.
With L1.2 disabled, the speed would reach 1 gbit/s.
Note: it's required that the latency between your host and the remote
be around 3-5 ms, the test inside LAN (<1 ms latency) won't trigger the
issue.
Link: https://lore.kernel.org/intel-wired-lan/15248b4f-3271-42dd-8e35-02bfc92b25e1@intel.com
Fixes: 0325143b59 ("igc: disable L1.2 PCI-E link substate to avoid performance issue")
Signed-off-by: ValdikSS <iam@valdikss.org.ru>
Reviewed-by: Vitaly Lifshits <vitaly.lifshits@intel.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250819222000.3504873-6-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently ixgbe driver checks periodically in its watchdog subtask if
there is anything to be transmitted (considering both Tx and XDP rings)
under state of carrier not being 'ok'. Such event is interpreted as Tx
hang and therefore results in interface reset.
This is currently problematic for ndo_xdp_xmit() as it is allowed to
produce descriptors when interface is going through reset or its carrier
is turned off.
Furthermore, XDP rings should not really be objects of Tx hang
detection. This mechanism is rather a matter of ndo_tx_timeout() being
called from dev_watchdog against Tx rings exposed to networking stack.
Taking into account issues described above, let us have a two fold fix -
do not respect XDP rings in local ixgbe watchdog and do not produce Tx
descriptors in ndo_xdp_xmit callback when there is some problem with
carrier currently. For now, keep the Tx hang checks in clean Tx irq
routine, but adjust it to not execute for XDP rings.
Cc: Tobias Böhm <tobias.boehm@hetzner-cloud.de>
Reported-by: Marcus Wichelmann <marcus.wichelmann@hetzner-cloud.de>
Closes: https://lore.kernel.org/netdev/eca1880f-253a-4955-afe6-732d7c6926ee@hetzner-cloud.de/
Fixes: 6453073987 ("ixgbe: add initial support for xdp redirect")
Fixes: 33fdc82f08 ("ixgbe: add support for XDP_TX action")
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Tested-by: Marcus Wichelmann <marcus.wichelmann@hetzner-cloud.de>
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250819222000.3504873-5-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Resolve the budget negative overflow which leads to returning true in
ixgbe_xmit_zc even when the budget of descs are thoroughly consumed.
Before this patch, when the budget is decreased to zero and finishes
sending the last allowed desc in ixgbe_xmit_zc, it will always turn back
and enter into the while() statement to see if it should keep processing
packets, but in the meantime it unexpectedly decreases the value again to
'unsigned int (0--)', namely, UINT_MAX. Finally, the ixgbe_xmit_zc returns
true, showing 'we complete cleaning the budget'. That also means
'clean_complete = true' in ixgbe_poll.
The true theory behind this is if that budget number of descs are consumed,
it implies that we might have more descs to be done. So we should return
false in ixgbe_xmit_zc to tell napi poll to find another chance to start
polling to handle the rest of descs. On the contrary, returning true here
means job done and we know we finish all the possible descs this time and
we don't intend to start a new napi poll.
It is apparently against our expectations. Please also see how
ixgbe_clean_tx_irq() handles the problem: it uses do..while() statement
to make sure the budget can be decreased to zero at most and the negative
overflow never happens.
The patch adds 'likely' because we rarely would not hit the loop condition
since the standard budget is 256.
Fixes: 8221c5eba8 ("ixgbe: add AF_XDP zero-copy Tx support")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Tested-by: Priya Singh <priyax.singh@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250819222000.3504873-4-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Remove array_size() calls and replace vmalloc() with vmalloc_array() to
simplify the code and maintain consistency with existing kmalloc_array()
usage.
vmalloc_array() is also optimized better, resulting in less instructions
being used.
Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com>
Link: https://patch.msgid.link/20250816090659.117699-2-rongqianfeng@vivo.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Tony Nguyen says:
====================
ice: implement SRIOV VF Active-Active LAG
Dave Ertman says:
Implement support for SRIOV VFs over an Active-Active link aggregate.
The same restrictions apply as are in place for the support of
Active-Backup bonds.
- the two interfaces must be on the same NIC
- the FW LLDP engine needs to be disabled
- the DDP package that supports VF LAG must be loaded on device
- the two interfaces must have the same QoS config
- only the first interface added to the bond will have VF support
- the interface with VFs must be in switchdev mode
With the additional requirement of
- the version of the FW on the NIC needs to have VF Active/Active support
The balancing of traffic between the two interfaces is done on a queue
basis. Taking the queues allocated to all of the VFs as a whole, one
half of them will be distributed to each interface. When a link goes
down, then the queues allocated to the down interface will migrate to
the active port. When the down port comes back up, then the same
queues as were originally assigned there will be moved back.
Patch 1 cleans up void pointer casts
Patch 2 utilizes bool over u8 when appropriate
Patch 3 adds a driver prefix to a LAG define
Patch 4 pre-move a function to reduce delta in implementation patch
Patch 5 cleanup variable initialization in declaration block
Patch 6 cleanup capability parsing for LAG feature
Patch 7 is the implementation of the new functionality
* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
ice: Implement support for SRIOV VFs across Active/Active bonds
ice: cleanup capabilities evaluation
ice: Cleanup variable initialization in LAG code
ice: move LAG function in code to prepare for Active-Active
ice: Add driver specific prefix to LAG defines
ice: replace u8 elements with bool where appropriate
ice: Remove casts on void pointers in LAG code
====================
Link: https://patch.msgid.link/20250814230855.128068-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch implements the software flows to handle SRIOV VF
communication across an Active/Active link aggregate. The same
restrictions apply as are in place for the support of Active/Backup
bonds.
- the two interfaces must be on the same NIC
- the FW LLDP engine needs to be disabled
- the DDP package that supports VF LAG must be loaded on device
- the two interfaces must have the same QoS config
- only the first interface added to the bond will have VF support
- the interface with VFs must be in switchdev mode
With the additional requirement of
- the version of the FW on the NIC needs to have VF Active/Active support
This requirement is indicated in the capabilities struct associated
with the NVM loaded on the NIC.
The balancing of traffic between the two interfaces is done on a queue
basis. Taking the queues allocated to all of the VFs as a whole, one
half of them will be distributed to each interface. When a link goes
down, then the queues allocated to the down interface will migrate to
the active port. When the down port comes back up, then the same
queues as were originally assigned there will be moved back.
Co-developed-by: Marcin Szycik <marcin.szycik@linux.intel.com>
Signed-off-by: Marcin Szycik <marcin.szycik@linux.intel.com>
Signed-off-by: Dave Ertman <david.m.ertman@intel.com>
Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Cross-merge networking fixes after downstream PR (net-6.17-rc2).
No conflicts.
Adjacent changes:
drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c
d7a276a576 ("net: stmmac: rk: convert to suspend()/resume() methods")
de1e963ad0 ("net: stmmac: rk: put the PHY clock on remove")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In preparation for implementing SRIOV Active-Active LAG support,
cleanup several unneeded variable initializations in declaration
blocks.
Also move a couple of variable initializations into declaration
block that should be there.
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Dave Ertman <david.m.ertman@intel.com>
Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
In the SRIOV LAG Active-Active, the functions ice_lag_cfg_pf_fltr's
and ice_lag_config_eswitch's content are moved to earlier locations
in the source file. Also, ice_lag_cfg_pf_fltr is renamed, and its
flow is changed.
To reduce the delta in the larger patch, move the original functions
to their new location so that only functional changes are needed in
the larger patch.
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Dave Ertman <david.m.ertman@intel.com>
Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
A define in the LAG code is missing a driver specific prefix.
Add a prefix to the define.
Also shorten a defines name and move to a more logical place.
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Dave Ertman <david.m.ertman@intel.com>
Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
In preparation for the new LAG functionality implementation, there are
a couple of existing LAG elements of the capabilities struct that should
be bool instead of u8. Since we are adding a new element to this struct
that should also be a bool, fix the existing LAG u8 in this patch and
eliminate !! operators where possible.
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Dave Ertman <david.m.ertman@intel.com>
Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
This series will be touching on the LAG code in the ice driver,
to prevent moving or propagating casting on void pointers, clean
them up first.
This also allows for moving the variable initialization into the
variable declaration.
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Dave Ertman <david.m.ertman@intel.com>
Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
In the past %pK was preferable to %p as it would not leak raw pointer
values into the kernel log.
Since commit ad67b74d24 ("printk: hash addresses printed with %p")
the regular %p has been improved to avoid this issue.
Furthermore, restricted pointers ("%pK") were never meant to be used
through printk(). They can still unintentionally leak raw pointers or
acquire sleeping locks in atomic contexts.
Switch to the regular pointer formatting which is safer and
easier to reason about.
There are still a few users of %pK left, but these use it through seq_file,
for which its usage is safe.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Acked-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20250811-restricted-pointers-net-v5-1-2e2fdc7d3f2c@linutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Users of the ixgbe driver report that after adding devlink support by
the commit a0285236ab ("ixgbe: add initial devlink support") their
configs got broken due to unwanted changes of interface names. It's
caused by automatic phys_port_name generation during devlink port
initialization flow.
To prevent from that set no_phys_port_name flag for ixgbe devlink ports.
Reported-by: David Howells <dhowells@redhat.com>
Closes: https://lore.kernel.org/netdev/3452224.1745518016@warthog.procyon.org.uk/
Reported-by: David Kaplan <David.Kaplan@amd.com>
Closes: https://lore.kernel.org/netdev/LV3PR12MB92658474624CCF60220157199470A@LV3PR12MB9265.namprd12.prod.outlook.com/
Fixes: a0285236ab ("ixgbe: add initial devlink support")
Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Core & protocols
----------------
- Wrap datapath globals into net_aligned_data, to avoid false sharing.
- Preserve MSG_ZEROCOPY in forwarding (e.g. out of a container).
- Add SO_INQ and SCM_INQ support to AF_UNIX.
- Add SIOCINQ support to AF_VSOCK.
- Add TCP_MAXSEG sockopt to MPTCP.
- Add IPv6 force_forwarding sysctl to enable forwarding per interface.
- Make TCP validation of whether packet fully fits in the receive
window and the rcv_buf more strict. With increased use of HW
aggregation a single "packet" can be multiple 100s of kB.
- Add MSG_MORE flag to optimize large TCP transmissions via sockmap,
improves latency up to 33% for sockmap users.
- Convert TCP send queue handling from tasklet to BH workque.
- Improve BPF iteration over TCP sockets to see each socket exactly once.
- Remove obsolete and unused TCP RFC3517/RFC6675 loss recovery code.
- Support enabling kernel threads for NAPI processing on per-NAPI
instance basis rather than a whole device. Fully stop the kernel NAPI
thread when threaded NAPI gets disabled. Previously thread would stick
around until ifdown due to tricky synchronization.
- Allow multicast routing to take effect on locally-generated packets.
- Add output interface argument for End.X in segment routing.
- MCTP: add support for gateway routing, improve bind() handling.
- Don't require rtnl_lock when fetching an IPv6 neighbor over Netlink.
- Add a new neighbor flag ("extern_valid"), which cedes refresh
responsibilities to userspace. This is needed for EVPN multi-homing
where a neighbor entry for a multi-homed host needs to be synced
across all the VTEPs among which the host is multi-homed.
- Support NUD_PERMANENT for proxy neighbor entries.
- Add a new queuing discipline for IETF RFC9332 DualQ Coupled AQM.
- Add sequence numbers to netconsole messages. Unregister netconsole's
console when all net targets are removed. Code refactoring.
Add a number of selftests.
- Align IPSec inbound SA lookup to RFC 4301. Only SPI and protocol
should be used for an inbound SA lookup.
- Support inspecting ref_tracker state via DebugFS.
- Don't force bonding advertisement frames tx to ~333 ms boundaries.
Add broadcast_neighbor option to send ARP/ND on all bonded links.
- Allow providing upcall pid for the 'execute' command in openvswitch.
- Remove DCCP support from Netfilter's conntrack.
- Disallow multiple packet duplications in the queuing layer.
- Prevent use of deprecated iptables code on PREEMPT_RT.
Driver API
----------
- Support RSS and hashing configuration over ethtool Netlink.
- Add dedicated ethtool callbacks for getting and setting hashing fields.
- Add support for power budget evaluation strategy in PSE /
Power-over-Ethernet. Generate Netlink events for overcurrent etc.
- Support DPLL phase offset monitoring across all device inputs.
Support providing clock reference and SYNC over separate DPLL
inputs.
- Support traffic classes in devlink rate API for bandwidth management.
- Remove rtnl_lock dependency from UDP tunnel port configuration.
Device drivers
--------------
- Add a new Broadcom driver for 800G Ethernet (bnge).
- Add a standalone driver for Microchip ZL3073x DPLL.
- Remove IBM's NETIUCV device driver.
- Ethernet high-speed NICs:
- Broadcom (bnxt):
- support zero-copy Tx of DMABUF memory
- take page size into account for page pool recycling rings
- Intel (100G, ice, idpf):
- idpf: XDP and AF_XDP support preparations
- idpf: add flow steering
- add link_down_events statistic
- clean up the TSPLL code
- preparations for live VM migration
- nVidia/Mellanox:
- support zero-copy Rx/Tx interfaces (DMABUF and io_uring)
- optimize context memory usage for matchers
- expose serial numbers in devlink info
- support PCIe congestion metrics
- Meta (fbnic):
- add 25G, 50G, and 100G link modes to phylink
- support dumping FW logs
- Marvell/Cavium:
- support for CN20K generation of the Octeon chips
- Amazon:
- add HW clock (without timestamping, just hypervisor time access)
- Ethernet virtual:
- VirtIO net:
- support segmentation of UDP-tunnel-encapsulated packets
- Google (gve):
- support packet timestamping and clock synchronization
- Microsoft vNIC:
- add handler for device-originated servicing events
- allow dynamic MSI-X vector allocation
- support Tx bandwidth clamping
- Ethernet NICs consumer, and embedded:
- AMD:
- amd-xgbe: hardware timestamping and PTP clock support
- Broadcom integrated MACs (bcmgenet, bcmasp):
- use napi_complete_done() return value to support NAPI polling
- add support for re-starting auto-negotiation
- Broadcom switches (b53):
- support BCM5325 switches
- add bcm63xx EPHY power control
- Synopsys (stmmac):
- lots of code refactoring and cleanups
- TI:
- icssg-prueth: read firmware-names from device tree
- icssg: PRP offload support
- Microchip:
- lan78xx: convert to PHYLINK for improved PHY and MAC management
- ksz: add KSZ8463 switch support
- Intel:
- support similar queue priority scheme in multi-queue and
time-sensitive networking (taprio)
- support packet pre-emption in both
- RealTek (r8169):
- enable EEE at 5Gbps on RTL8126
- Airoha:
- add PPPoE offload support
- MDIO bus controller for Airoha AN7583
- Ethernet PHYs:
- support for the IPQ5018 internal GE PHY
- micrel KSZ9477 switch-integrated PHYs:
- add MDI/MDI-X control support
- add RX error counters
- add cable test support
- add Signal Quality Indicator (SQI) reporting
- dp83tg720: improve reset handling and reduce link recovery time
- support bcm54811 (and its MII-Lite interface type)
- air_en8811h: support resume/suspend
- support PHY counters for QCA807x and QCA808x
- support WoL for QCA807x
- CAN drivers:
- rcar_canfd: support for Transceiver Delay Compensation
- kvaser: report FW versions via devlink dev info
- WiFi:
- extended regulatory info support (6 GHz)
- add statistics and beacon monitor for Multi-Link Operation (MLO)
- support S1G aggregation, improve S1G support
- add Radio Measurement action fields
- support per-radio RTS threshold
- some work around how FIPS affects wifi, which was wrong (RC4 is used
by TKIP, not only WEP)
- improvements for unsolicited probe response handling
- WiFi drivers:
- RealTek (rtw88):
- IBSS mode for SDIO devices
- RealTek (rtw89):
- BT coexistence for MLO/WiFi7
- concurrent station + P2P support
- support for USB devices RTL8851BU/RTL8852BU
- Intel (iwlwifi):
- use embedded PNVM in (to be released) FW images to fix
compatibility issues
- many cleanups (unused FW APIs, PCIe code, WoWLAN)
- some FIPS interoperability
- MediaTek (mt76):
- firmware recovery improvements
- more MLO work
- Qualcomm/Atheros (ath12k):
- fix scan on multi-radio devices
- more EHT/Wi-Fi 7 features
- encapsulation/decapsulation offload
- Broadcom (brcm80211):
- support SDIO 43751 device
- Bluetooth:
- hci_event: add support for handling LE BIG Sync Lost event
- ISO: add socket option to report packet seqnum via CMSG
- ISO: support SCM_TIMESTAMPING for ISO TS
- Bluetooth drivers:
- intel_pcie: support Function Level Reset
- nxpuart: add support for 4M baudrate
- nxpuart: implement powerup sequence, reset, FW dump, and FW loading
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmiFgLgACgkQMUZtbf5S
IrvafxAAnQRwYBoIG+piCILx6z5pRvBGHkmEQ4AQgSCFuq2eO3ubwMFIqEybfma1
5+QFjUZAV3OgGgKRBS2KGWxtSzdiF+/JGV1VOIN67sX3Mm0a2QgjA4n5CgKL0FPr
o6BEzjX5XwG1zvGcBNQ5BZ19xUUKjoZQgTtnea8sZ57Fsp5RtRgmYRqoewNvNk/n
uImh0NFsDVb0UeOpSzC34VD9l1dJvLGdui4zJAjno/vpvmT1DkXjoK419J/r52SS
X+5WgsfJ6DkjHqVN1tIhhK34yWqBOcwGFZJgEnWHMkFIl2FqRfFKMHyqtfLlVnLA
mnIpSyz8Sq2AHtx0TlgZ3At/Ri8p5+yYJgHOXcDKyABa8y8Zf4wrycmr6cV9JLuL
z54nLEVnJuvfDVDVJjsLYdJXyhMpZFq6+uAItdxKaw8Ugp/QqG4QtoRj+XIHz4ZW
z6OohkCiCzTwEISFK+pSTxPS30eOxq43kCspcvuLiwCCStJBRkRb5GdZA4dm7LA+
1Od4ADAkHjyrFtBqTyyC2scX8UJ33DlAIpAYyIeS6w9Cj9EXxtp1z33IAAAZ03MW
jJwIaJuc8bK2fWKMmiG7ucIXjPo4t//KiWlpkwwqLhPbjZgfDAcxq1AC2TLoqHBL
y4EOgKpHDCMAghSyiFIAn2JprGcEt8dp+11B0JRXIn4Pm/eYDH8=
=lqbe
-----END PGP SIGNATURE-----
Merge tag 'net-next-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"Core & protocols:
- Wrap datapath globals into net_aligned_data, to avoid false sharing
- Preserve MSG_ZEROCOPY in forwarding (e.g. out of a container)
- Add SO_INQ and SCM_INQ support to AF_UNIX
- Add SIOCINQ support to AF_VSOCK
- Add TCP_MAXSEG sockopt to MPTCP
- Add IPv6 force_forwarding sysctl to enable forwarding per interface
- Make TCP validation of whether packet fully fits in the receive
window and the rcv_buf more strict. With increased use of HW
aggregation a single "packet" can be multiple 100s of kB
- Add MSG_MORE flag to optimize large TCP transmissions via sockmap,
improves latency up to 33% for sockmap users
- Convert TCP send queue handling from tasklet to BH workque
- Improve BPF iteration over TCP sockets to see each socket exactly
once
- Remove obsolete and unused TCP RFC3517/RFC6675 loss recovery code
- Support enabling kernel threads for NAPI processing on per-NAPI
instance basis rather than a whole device. Fully stop the kernel
NAPI thread when threaded NAPI gets disabled. Previously thread
would stick around until ifdown due to tricky synchronization
- Allow multicast routing to take effect on locally-generated packets
- Add output interface argument for End.X in segment routing
- MCTP: add support for gateway routing, improve bind() handling
- Don't require rtnl_lock when fetching an IPv6 neighbor over Netlink
- Add a new neighbor flag ("extern_valid"), which cedes refresh
responsibilities to userspace. This is needed for EVPN multi-homing
where a neighbor entry for a multi-homed host needs to be synced
across all the VTEPs among which the host is multi-homed
- Support NUD_PERMANENT for proxy neighbor entries
- Add a new queuing discipline for IETF RFC9332 DualQ Coupled AQM
- Add sequence numbers to netconsole messages. Unregister
netconsole's console when all net targets are removed. Code
refactoring. Add a number of selftests
- Align IPSec inbound SA lookup to RFC 4301. Only SPI and protocol
should be used for an inbound SA lookup
- Support inspecting ref_tracker state via DebugFS
- Don't force bonding advertisement frames tx to ~333 ms boundaries.
Add broadcast_neighbor option to send ARP/ND on all bonded links
- Allow providing upcall pid for the 'execute' command in openvswitch
- Remove DCCP support from Netfilter's conntrack
- Disallow multiple packet duplications in the queuing layer
- Prevent use of deprecated iptables code on PREEMPT_RT
Driver API:
- Support RSS and hashing configuration over ethtool Netlink
- Add dedicated ethtool callbacks for getting and setting hashing
fields
- Add support for power budget evaluation strategy in PSE /
Power-over-Ethernet. Generate Netlink events for overcurrent etc
- Support DPLL phase offset monitoring across all device inputs.
Support providing clock reference and SYNC over separate DPLL
inputs
- Support traffic classes in devlink rate API for bandwidth
management
- Remove rtnl_lock dependency from UDP tunnel port configuration
Device drivers:
- Add a new Broadcom driver for 800G Ethernet (bnge)
- Add a standalone driver for Microchip ZL3073x DPLL
- Remove IBM's NETIUCV device driver
- Ethernet high-speed NICs:
- Broadcom (bnxt):
- support zero-copy Tx of DMABUF memory
- take page size into account for page pool recycling rings
- Intel (100G, ice, idpf):
- idpf: XDP and AF_XDP support preparations
- idpf: add flow steering
- add link_down_events statistic
- clean up the TSPLL code
- preparations for live VM migration
- nVidia/Mellanox:
- support zero-copy Rx/Tx interfaces (DMABUF and io_uring)
- optimize context memory usage for matchers
- expose serial numbers in devlink info
- support PCIe congestion metrics
- Meta (fbnic):
- add 25G, 50G, and 100G link modes to phylink
- support dumping FW logs
- Marvell/Cavium:
- support for CN20K generation of the Octeon chips
- Amazon:
- add HW clock (without timestamping, just hypervisor time access)
- Ethernet virtual:
- VirtIO net:
- support segmentation of UDP-tunnel-encapsulated packets
- Google (gve):
- support packet timestamping and clock synchronization
- Microsoft vNIC:
- add handler for device-originated servicing events
- allow dynamic MSI-X vector allocation
- support Tx bandwidth clamping
- Ethernet NICs consumer, and embedded:
- AMD:
- amd-xgbe: hardware timestamping and PTP clock support
- Broadcom integrated MACs (bcmgenet, bcmasp):
- use napi_complete_done() return value to support NAPI polling
- add support for re-starting auto-negotiation
- Broadcom switches (b53):
- support BCM5325 switches
- add bcm63xx EPHY power control
- Synopsys (stmmac):
- lots of code refactoring and cleanups
- TI:
- icssg-prueth: read firmware-names from device tree
- icssg: PRP offload support
- Microchip:
- lan78xx: convert to PHYLINK for improved PHY and MAC management
- ksz: add KSZ8463 switch support
- Intel:
- support similar queue priority scheme in multi-queue and
time-sensitive networking (taprio)
- support packet pre-emption in both
- RealTek (r8169):
- enable EEE at 5Gbps on RTL8126
- Airoha:
- add PPPoE offload support
- MDIO bus controller for Airoha AN7583
- Ethernet PHYs:
- support for the IPQ5018 internal GE PHY
- micrel KSZ9477 switch-integrated PHYs:
- add MDI/MDI-X control support
- add RX error counters
- add cable test support
- add Signal Quality Indicator (SQI) reporting
- dp83tg720: improve reset handling and reduce link recovery time
- support bcm54811 (and its MII-Lite interface type)
- air_en8811h: support resume/suspend
- support PHY counters for QCA807x and QCA808x
- support WoL for QCA807x
- CAN drivers:
- rcar_canfd: support for Transceiver Delay Compensation
- kvaser: report FW versions via devlink dev info
- WiFi:
- extended regulatory info support (6 GHz)
- add statistics and beacon monitor for Multi-Link Operation (MLO)
- support S1G aggregation, improve S1G support
- add Radio Measurement action fields
- support per-radio RTS threshold
- some work around how FIPS affects wifi, which was wrong (RC4 is
used by TKIP, not only WEP)
- improvements for unsolicited probe response handling
- WiFi drivers:
- RealTek (rtw88):
- IBSS mode for SDIO devices
- RealTek (rtw89):
- BT coexistence for MLO/WiFi7
- concurrent station + P2P support
- support for USB devices RTL8851BU/RTL8852BU
- Intel (iwlwifi):
- use embedded PNVM in (to be released) FW images to fix
compatibility issues
- many cleanups (unused FW APIs, PCIe code, WoWLAN)
- some FIPS interoperability
- MediaTek (mt76):
- firmware recovery improvements
- more MLO work
- Qualcomm/Atheros (ath12k):
- fix scan on multi-radio devices
- more EHT/Wi-Fi 7 features
- encapsulation/decapsulation offload
- Broadcom (brcm80211):
- support SDIO 43751 device
- Bluetooth:
- hci_event: add support for handling LE BIG Sync Lost event
- ISO: add socket option to report packet seqnum via CMSG
- ISO: support SCM_TIMESTAMPING for ISO TS
- Bluetooth drivers:
- intel_pcie: support Function Level Reset
- nxpuart: add support for 4M baudrate
- nxpuart: implement powerup sequence, reset, FW dump, and FW loading"
* tag 'net-next-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1742 commits)
dpll: zl3073x: Fix build failure
selftests: bpf: fix legacy netfilter options
ipv6: annotate data-races around rt->fib6_nsiblings
ipv6: fix possible infinite loop in fib6_info_uses_dev()
ipv6: prevent infinite loop in rt6_nlmsg_size()
ipv6: add a retry logic in net6_rt_notify()
vrf: Drop existing dst reference in vrf_ip6_input_dst
net/sched: taprio: align entry index attr validation with mqprio
net: fsl_pq_mdio: use dev_err_probe
selftests: rtnetlink.sh: remove esp4_offload after test
vsock: remove unnecessary null check in vsock_getname()
igb: xsk: solve negative overflow of nb_pkts in zerocopy mode
stmmac: xsk: fix negative overflow of budget in zerocopy mode
dt-bindings: ieee802154: Convert at86rf230.txt yaml format
net: dsa: microchip: Disable PTP function of KSZ8463
net: dsa: microchip: Setup fiber ports for KSZ8463
net: dsa: microchip: Write switch MAC address differently for KSZ8463
net: dsa: microchip: Use different registers for KSZ8463
net: dsa: microchip: Add KSZ8463 switch support to KSZ DSA driver
dt-bindings: net: dsa: microchip: Add KSZ8463 switch support
...
The initial idea of making them const was correct as they were seperate
instances. When they got embedded into larger data structures, which are
even modified by the callback this got moot. The only reason why this went
unnoticed is that the required container_of() casts the const attribute
forcefully away.
Stop pretending that it is const.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmiGkt0THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYodb/D/47tAOcL6qyTxh+E5fDj4h5M2aVQYp+
oLiBY7ejroObCAiMo6v/ajaGNEXvLXEtratFBQv1o7n12rxsKnv/XNdIbAzvMwKm
3vNx9anQ0THFnWiUKCAf/mZBc/Z9uXnjBXY0TeaFLuj4DuUMqGGs/ga32UIvEwW+
VXS592NIHYoK87VA/AIOK8AGX9VesYoxUqZmfmYn5NuJwsJDSgGaQpoH2fBVhU6p
B7NRnVLYtFYpeAvL0ihFuOP03HZes5hM2wmqNtoCJJF9e39Eg9DXhsaUw8AIa1fu
UCxm5sfNpMR+QN2AizbI5deHZJbFwYPy/cFrZ0AiwCfZt6NATjO7bQjvu/23yPFn
klby8GizV0Q7qmiLi45EdPdid5oiBm/lNy6ICP9fN/XlHZFxAsqE/OGx626P+LPs
xQK1MpHNjbkCK//X49hxQX6a0BCqSEAG4GOuEEqRxTeD9SZVzIQbcQ6CKOuBqPqc
hc5o5N3suzdmq5sujQD0IiL9R960WfzsSgbmdQm+njhRz+LjAx6T419MlyePiu/5
hf8pZjl/SHn1O6RpPfJ501U+tbo1auEnUjXMGcg12dx7KE4w9s7fboFQLXruuVRr
UNeVXeMRxZqXSAB1HyKQMI8lZ7nHF6FZUA+xLMusCwIYi7YU0h7I3BxDzxpZV62l
zJgslDp7ZqSopw==
=C0ua
-----END PGP SIGNATURE-----
Merge tag 'timers-cleanups-2025-07-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer cleanups from Thomas Gleixner:
"A treewide cleanup of struct cycle_counter const annotations.
The initial idea of making them const was correct as they were
seperate instances. When they got embedded into larger data
structures, which are even modified by the callback this got moot. The
only reason why this went unnoticed is that the required
container_of() casts the const attribute forcefully away.
Stop pretending that it is const"
* tag 'timers-cleanups-2025-07-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
time/timecounter: Fix the lie that struct cyclecounter is const
Merge in late fixes to prepare for the 6.17 net-next PR.
Conflicts:
net/core/neighbour.c
1bbb76a899 ("neighbour: Fix null-ptr-deref in neigh_flush_dev().")
13a936bb99 ("neighbour: Protect tbl->phash_buckets[] with a dedicated mutex.")
03dc03fa04 ("neighbor: Add NTF_EXT_VALIDATED flag for externally validated entries")
Adjacent changes:
drivers/net/usb/usbnet.c
0d9cfc9b8c ("net: usbnet: Avoid potential RCU stall on LINK_CHANGE event")
2c04d279e8 ("net: usb: Convert tasklet API to new bottom half workqueue mechanism")
net/ipv6/route.c
31d7d67ba1 ("ipv6: annotate data-races around rt->fib6_nsiblings")
1caf272972 ("ipv6: adopt dst_dev() helper")
3b3ccf9ed0 ("net: Remove unnecessary NULL check for lwtunnel_fill_encap()")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There is no break time in the while() loop, so every time at the end of
igb_xmit_zc(), negative overflow of nb_pkts will occur, which renders
the return value always false. But theoretically, the result should be
set after calling xsk_tx_peek_release_desc_batch(). We can take
i40e_xmit_zc() as a good example.
Returning false means we're not done with transmission and we need one
more poll, which is exactly what igb_xmit_zc() always did before this
patch. After this patch, the return value depends on the nb_pkts value.
Two cases might happen then:
1. if (nb_pkts < budget), it means we process all the possible data, so
return true and no more necessary poll will be triggered because of
this.
2. if (nb_pkts == budget), it means we might have more data, so return
false to let another poll run again.
Fixes: f8e284a02a ("igb: Add AF_XDP zero-copy Tx support")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Link: https://patch.msgid.link/20250723142327.85187-3-kerneljasonxing@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Tony Nguyen says:
====================
libie: commonize adminq structure
Michal Swiatkowski says:
It is a prework to allow reusing some specific Intel code (eq. fwlog).
Move common *_aq_desc structure to libie header and changing
it in ice, ixgbe, i40e and iavf.
Only generic adminq commands can be easily moved to common header, as
rest is slightly different. Format remains the same. It will be better
to correctly move it when it will be needed to commonize other part of
the code.
Move *_aq_str() to new libie module (libie_adminq) and use it across
drivers. The functions are exactly the same in each driver. Some more
adminq helpers/functions can be moved to libie_adminq when needed.
* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
i40e: use libie_aq_str
iavf: use libie_aq_str
ice: use libie_aq_str
libie: add adminq helper for converting err to str
iavf: use libie adminq descriptors
i40e: use libie adminq descriptors
ixgbe: use libie adminq descriptors
ice, libie: move generic adminq descriptors to lib
====================
Link: https://patch.msgid.link/20250724182826.3758850-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Fix typos in comments and error messages.
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: David Arinzon <darinzon@amazon.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250723201528.2908218-1-helgaas@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There is no need to store the err string in hw->err_str. Simplify it and
use common helper. hw->err_str is still used for other purpouse.
It should be marked that previously for unknown error the numeric value
was passed as a string. Now the "LIBIE_AQ_RC_UNKNOWN" is used for such
cases.
Add libie_aminq module in i40e Kconfig.
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com>
Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
There is no need to store the err string in hw->err_str. Simplify it and
use common helper. hw->err_str is still used for other purpouse.
It should be marked that previously for unknown error the numeric value
was passed as a string. Now the "LIBIE_AQ_RC_UNKNOWN" is used for such
cases.
Add libie_aminq module in iavf Kconfig.
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com>
Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Simple:
s/ice_aq_str/libie_aq_str
Add libie_aminq module in ice Kconfig.
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Add a new module for common handling of Admin Queue related logic.
Start by a helper for error to string conversion. This lives inside
libie/, but is a separate module what follows our logic of splitting
into topical modules, to avoid pulling in not needed stuff, and have
better organization in general.
Olek suggested how to better solve the error to string conversion.
It will be used in follow-up patches in ice, i40e and iavf.
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Suggested-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Use libie_aq_desc instead of iavf_aq_desc. Do needed changes to allow
clean build
Use libie_aq_raw() wherever it can be used.
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Use libie_aq_desc instead of i40e_aq_desc. Do needed changes to allow
clean build.
Get version descriptor is a little less detailed on i40e. To not mess up
with shifting or union inside libie desc use get version descriptor from
i40e.
Move additional caps for i40e to libie.
Fix RCT in declaration that is using libie_aq_desc;
Use libie_aq_raw() wherever it can be used.
The libie aq error is extended, cover it in ice driver just to clean
build. In next patches the libie code for that will be used in each
of intel driver.
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Use libie_aq_desc instead of ixgbe_aci_desc. Do needed changes to allow
clean build.
Move additional caps used in ixgbe to libie.
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
The descriptor structure is the same in ice, ixgbe and i40e. Move it to
common libie header to use it across different driver.
Leave device specific adminq commands in separate folders. This lead to
a change that need to be done in filling/getting descriptor:
- previous: struct specific_desc *cmd;
cmd = &desc.params.specific_desc;
- now: struct specific_desc *cmd;
cmd = libie_aq_raw(&desc);
Do this changes across the driver to allow clean build. The casting only
have to be done in case of specific descriptors, for generic one union
can still be used.
Changes beside code moving:
- change ICE_ prefix to LIBIE_ prefix (ice_ and libie_ too)
- remove shift variables not otherwise needed (in libie_aq_flags)
- fill/get descriptor data based on desc.params.raw whenever the
descriptor isn't defined in libie
- move defines from the libie_aq_sth structure outside
- add libie_aq_raw helper and use it instead of explicit casting
Reviewed by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
To eliminate the use of struct page in page pool, the page pool users
should use netmem descriptor and APIs instead.
Make idpf access ->pp through netmem_desc instead of page.
Signed-off-by: Byungchul Park <byungchul@sk.com>
Link: https://patch.msgid.link/20250721021835.63939-10-byungchul@sk.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
To eliminate the use of struct page in page pool, the page pool users
should use netmem descriptor and APIs instead.
Make iavf access ->pp through netmem_desc instead of page.
Signed-off-by: Byungchul Park <byungchul@sk.com>
Link: https://patch.msgid.link/20250721021835.63939-9-byungchul@sk.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As described by Vitaly Lifshits:
> Starting from Tiger Lake, LAN NVM is locked for writes by SW, so the
> driver cannot perform checksum validation and correction. This means
> that all NVM images must leave the factory with correct checksum and
> checksum valid bit set.
Unfortunately some systems have left the factory with an uninitialized
value of 0xFFFF at register address 0x3F (checksum word location).
So on Tiger Lake platform we ignore the computed checksum when such
condition is encountered.
Signed-off-by: Jacek Kowalski <jacek@jacekk.info>
Tested-by: Vlad URSU <vlad@ursu.me>
Fixes: 4051f68318 ("e1000e: Do not take care about recovery NVM checksum")
Cc: stable@vger.kernel.org
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Vitaly Lifshits <vitaly.lifshits@intel.com>
Tested-by: Mor Bar-Gabay <morx.bar.gabay@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
As described by Vitaly Lifshits:
> Starting from Tiger Lake, LAN NVM is locked for writes by SW, so the
> driver cannot perform checksum validation and correction. This means
> that all NVM images must leave the factory with correct checksum and
> checksum valid bit set. Since Tiger Lake devices were the first to have
> this lock, some systems in the field did not meet this requirement.
> Therefore, for these transitional devices we skip checksum update and
> verification, if the valid bit is not set.
Signed-off-by: Jacek Kowalski <jacek@jacekk.info>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Vitaly Lifshits <vitaly.lifshits@intel.com>
Fixes: 4051f68318 ("e1000e: Do not take care about recovery NVM checksum")
Cc: stable@vger.kernel.org
Tested-by: Mor Bar-Gabay <morx.bar.gabay@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Add check for the return value of devm_kmemdup()
to prevent potential null pointer dereference.
Fixes: c764881096 ("ice: Implement Dynamic Device Personalization (DDP) download")
Cc: stable@vger.kernel.org
Signed-off-by: Haoxiang Li <haoxiang_li2024@163.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
When the PF is processing an Admin Queue message to delete a VF's MACs
from the MAC filter, we currently check if the PF set the MAC and if
the VF is trusted.
This results in undesirable behaviour, where if a trusted VF with a
PF-set MAC sets itself down (which sends an AQ message to delete the
VF's MAC filters) then the VF MAC is erased from the interface.
This results in the VF losing its PF-set MAC which should not happen.
There is no need to check for trust at all, because an untrusted VF
cannot change its own MAC. The only check needed is whether the PF set
the MAC. If the PF set the MAC, then don't erase the MAC on link-down.
Resolve this by changing the deletion check only for PF-set MAC.
(the out-of-tree driver has also intentionally removed the check for VF
trust here with OOT driver version 2.26.8, this changes the Linux kernel
driver behaviour and comment to match the OOT driver behaviour)
Fixes: ea2a1cfc3b ("i40e: Fix VF MAC filter removal")
Signed-off-by: Jamie Bainbridge <jamie.bainbridge@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Currently the tx_dropped field in VF stats is not updated correctly
when reading stats from the PF. This is because it reads from
i40e_eth_stats.tx_discards which seems to be unused for per VSI stats,
as it is not updated by i40e_update_eth_stats() and the corresponding
register, GLV_TDPC, is not implemented[1].
Use i40e_eth_stats.tx_errors instead, which is actually updated by
i40e_update_eth_stats() by reading from GLV_TEPC.
To test, create a VF and try to send bad packets through it:
$ echo 1 > /sys/class/net/enp2s0f0/device/sriov_numvfs
$ cat test.py
from scapy.all import *
vlan_pkt = Ether(dst="ff:ff:ff:ff:ff:ff") / Dot1Q(vlan=999) / IP(dst="192.168.0.1") / ICMP()
ttl_pkt = IP(dst="8.8.8.8", ttl=0) / ICMP()
print("Send packet with bad VLAN tag")
sendp(vlan_pkt, iface="enp2s0f0v0")
print("Send packet with TTL=0")
sendp(ttl_pkt, iface="enp2s0f0v0")
$ ip -s link show dev enp2s0f0
16: enp2s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 3c:ec:ef:b7:e0:ac brd ff:ff:ff:ff:ff:ff
RX: bytes packets errors dropped missed mcast
0 0 0 0 0 0
TX: bytes packets errors dropped carrier collsns
0 0 0 0 0 0
vf 0 link/ether e2:c6:fd:c1:1e:92 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
RX: bytes packets mcast bcast dropped
0 0 0 0 0
TX: bytes packets dropped
0 0 0
$ python test.py
Send packet with bad VLAN tag
.
Sent 1 packets.
Send packet with TTL=0
.
Sent 1 packets.
$ ip -s link show dev enp2s0f0
16: enp2s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 3c:ec:ef:b7:e0:ac brd ff:ff:ff:ff:ff:ff
RX: bytes packets errors dropped missed mcast
0 0 0 0 0 0
TX: bytes packets errors dropped carrier collsns
0 0 0 0 0 0
vf 0 link/ether e2:c6:fd:c1:1e:92 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
RX: bytes packets mcast bcast dropped
0 0 0 0 0
TX: bytes packets dropped
0 0 0
A packet with non-existent VLAN tag and a packet with TTL = 0 are sent,
but tx_dropped is not incremented.
After patch:
$ ip -s link show dev enp2s0f0
19: enp2s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 3c:ec:ef:b7:e0:ac brd ff:ff:ff:ff:ff:ff
RX: bytes packets errors dropped missed mcast
0 0 0 0 0 0
TX: bytes packets errors dropped carrier collsns
0 0 0 0 0 0
vf 0 link/ether 4a:b7:3d:37:f7:56 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
RX: bytes packets mcast bcast dropped
0 0 0 0 0
TX: bytes packets dropped
0 0 2
Fixes: dc645daef9 ("i40e: implement VF stats NDO")
Signed-off-by: Dennis Chen <dechen@redhat.com>
Link: https://www.intel.com/content/www/us/en/content-details/596333/intel-ethernet-controller-x710-tm4-at2-carlsville-datasheet.html
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Remove hw_rx_no_dma_resources and eitr_param fields from struct
ixgbevf_adapter since these fields are never referenced in the driver.
Note that the interrupt throttle rate is controlled by the
rx_itr_setting and tx_itr_setting variables.
This change simplifies the ixgbevf driver by removing unused fields,
which improves maintainability.
Signed-off-by: Yuto Ohnuki <ytohnuki@amazon.com>
Reviewed-by: Dawid Osuchowski <dawid.osuchowski@linux.intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Remove following unused fields from struct igbvf_adapter that are never
referenced in the driver.
- blink_timer
- eeprom_wol
- fc_autoneg
- int_mode
- led_status
- mng_vlan_id
- polling_interval
- rx_dma_failed
- test_icr
- test_rx_ring
- test_tx_ring
- tx_dma_failed
- tx_fifo_head
- tx_fifo_size
- tx_head_addr
Also removed the following fields from struct igbvf_adapter since they
are never read or used after initialization by igbvf_probe() and igbvf_sw_init().
- bd_number
- rx_abs_int_delay
- tx_abs_int_delay
- rx_int_delay
- tx_int_delay
This changes simplify the igbvf driver by removing unused fields, which
improves maintenability.
Tested-by: Kohei Enju <enjuk@amazon.com>
Signed-off-by: Yuto Ohnuki <ytohnuki@amazon.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Introduce support for a lowest priority wildcard (catch-all) rule in
ethtool's Network Flow Classification (NFC) for the igc driver. The
wildcard rule directs all unmatched network traffic, including traffic not
captured by Receive Side Scaling (RSS), to a specified queue. This
functionality utilizes the Default Queue feature available in I225/I226
hardware.
The implementation has been validated on Intel ADL-S systems with two
back-to-back connected I226 network interfaces.
Testing Procedure:
1. On the Device Under Test (DUT), verify the initial statistic:
$ ethtool -S enp1s0 | grep rx_q.*packets
rx_queue_0_packets: 0
rx_queue_1_packets: 0
rx_queue_2_packets: 0
rx_queue_3_packets: 0
2. From the Link Partner, send 10 ARP packets:
$ arping -c 10 -I enp170s0 169.254.1.2
3. On the DUT, verify the packet reception on Queue 0:
$ ethtool -S enp1s0 | grep rx_q.*packets
rx_queue_0_packets: 10
rx_queue_1_packets: 0
rx_queue_2_packets: 0
rx_queue_3_packets: 0
4. On the DUT, add a wildcard rule to route all packets to Queue 3:
$ sudo ethtool -N enp1s0 flow-type ether queue 3
5. From the Link Partner, send another 10 ARP packets:
$ arping -c 10 -I enp170s0 169.254.1.2
6. Now, packets are routed to Queue 3 by the wildcard (Default Queue) rule:
$ ethtool -S enp1s0 | grep rx_q.*packets
rx_queue_0_packets: 10
rx_queue_1_packets: 0
rx_queue_2_packets: 0
rx_queue_3_packets: 10
7. On the DUT, add a EtherType rule to route ARP packet to Queue 1:
$ sudo ethtool -N enp1s0 flow-type ether proto 0x0806 queue 1
8. From the Link Partner, send another 10 ARP packets:
$ arping -c 10 -I enp170s0 169.254.1.2
9. Now, packets are routed to Queue 1 by the EtherType rule because it is
higher priority than the wildcard (Default Queue) rule:
$ ethtool -S enp1s0 | grep rx_q.*packets
rx_queue_0_packets: 10
rx_queue_1_packets: 10
rx_queue_2_packets: 0
rx_queue_3_packets: 10
10. On the DUT, delete all the NFC rules:
$ sudo ethtool -N enp1s0 delete 63
$ sudo ethtool -N enp1s0 delete 64
11. From the Link Partner, send another 10 ARP packets:
$ arping -c 10 -I enp170s0 169.254.1.2
12. Now, packets are routed to Queue 0 because the value of Default Queue
is reset back to 0:
$ ethtool -S enp1s0 | grep rx_q.*packets
rx_queue_0_packets: 20
rx_queue_1_packets: 10
rx_queue_2_packets: 0
rx_queue_3_packets: 10
Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>
Co-developed-by: Blanco Alcaine Hector <hector.blanco.alcaine@intel.com>
Signed-off-by: Blanco Alcaine Hector <hector.blanco.alcaine@intel.com>
Signed-off-by: Song Yoong Siang <yoong.siang.song@intel.com>
Tested-by: Mor Bar-Gabay <morx.bar.gabay@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Move the RSS field definitions related to IPv4 and IPv6 UDP from igc.h to
igc_defines.h to consolidate the RSS field definitions in a single header
file, improving code organization and maintainability.
This refactoring does not alter the functionality of the driver but
enhances the logical grouping of related constants
Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>
Signed-off-by: Song Yoong Siang <yoong.siang.song@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Brett Creeley <brett.creeley@amd.com>
Tested-by: Mor Bar-Gabay <morx.bar.gabay@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
In the VF handling code, parts of the code for lag can be broken out into
helper functions to reduce code duplication. Break this code out into
helper functions
Reviewed-by: Marcin Szycik <marcin.szycik@linux.intel.com>
Signed-off-by: Dave Ertman <david.m.ertman@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Previously the ice_add_prof() took an array of u8 and looped over it with
for_each_set_bit(), examining each 8 bit value as a bitmap.
This was just hard to understand and unnecessary, and was triggering
undefined behavior sanitizers with unaligned accesses within bitmap
fields (on our internal tools/builds). Since the @ptype being passed in
was already declared as a bitmap, refactor this to use native types with
the advantage of simplifying the code to use a single loop.
Co-developed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
CC: Jesse Brandeburg <jbrandeburg@cloudflare.com>
Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
E835 is an enhanced version of the E830.
It continues to use the same set of commands, registers and interfaces
as other devices in the 800 Series.
Following device IDs are added:
- 0x1248: Intel(R) Ethernet Controller E835-CC for backplane
- 0x1249: Intel(R) Ethernet Controller E835-CC for QSFP
- 0x124A: Intel(R) Ethernet Controller E835-CC for SFP
- 0x1261: Intel(R) Ethernet Controller E835-C for backplane
- 0x1262: Intel(R) Ethernet Controller E835-C for QSFP
- 0x1263: Intel(R) Ethernet Controller E835-C for SFP
- 0x1265: Intel(R) Ethernet Controller E835-L for backplane
- 0x1266: Intel(R) Ethernet Controller E835-L for QSFP
- 0x1267: Intel(R) Ethernet Controller E835-L for SFP
Reviewed-by: Konrad Knitter <konrad.knitter@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Dawid Osuchowski <dawid.osuchowski@linux.intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Introduce the ICE_AQC_PORT_OPT_MAX_LANE_40G constant and update the code
to process this new option in both the devlink and the Admin Queue Command
GET PORT OPTION (opcode 0x06EA) message, similar to existing constants like
ICE_AQC_PORT_OPT_MAX_LANE_50G, ICE_AQC_PORT_OPT_MAX_LANE_100G, and so on.
This feature allows the driver to correctly report configuration options
for 2x40G on E823 and other cards in the future via devlink.
Example command:
devlink port split pci/0000:01:00.0/0 count 2
Example dmesg:
ice 0000:01:00.0: Available port split options and max port speeds (Gbps):
ice 0000:01:00.0: Status Split Quad 0 Quad 1
ice 0000:01:00.0: count L0 L1 L2 L3 L4 L5 L6 L7
ice 0000:01:00.0: 2 40 - - - 40 - - -
ice 0000:01:00.0: 2 50 - 50 - - - - -
ice 0000:01:00.0: 4 25 25 25 25 - - - -
ice 0000:01:00.0: 4 25 25 - - 25 25 - -
ice 0000:01:00.0: Active 8 10 10 10 10 10 10 10 10
ice 0000:01:00.0: 1 100 - - - - - - -
Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
The IRQ coalescing config currently reside only inside struct
idpf_q_vector. However, all idpf_q_vector structs are de-allocated and
re-allocated during resets. This leads to user-set coalesce configuration
to be lost.
Add new fields to struct idpf_vport_user_config_data to save the user
settings and re-apply them after reset.
Reviewed-by: Madhu Chittim <madhu.chittim@intel.com>
Signed-off-by: Ahmed Zaki <ahmed.zaki@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Tested-by: Samuel Salin <Samuel.salin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Add cross timestamp support through virtchnl mailbox messages and directly,
through PCIe BAR registers. Cross timestamping assumes that both system
time and device clock time values are cached simultaneously, what is
triggered by HW. Feature is enabled for both ARM and x86 archs.
Signed-off-by: Milena Olech <milena.olech@intel.com>
Reviewed-by: Karol Kolacinski <karol.kolacinski@intel.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Tested-by: Samuel Salin <Samuel.salin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Use the new virtchnl2 OP codes to communicate with the Control Plane to
add flow steering filters. We add the basic functionality for add/delete
with TCP/UDP IPv4 only. Support for other OP codes and protocols will be
added later.
Standard 'ethtool -N|--config-ntuple' should be used, for example:
# ethtool -N ens801f0d1 flow-type tcp4 src-ip 10.0.0.1 action 6
to route all IPv4/TCP traffic from IP 10.0.0.1 to queue 6.
Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: Ahmed Zaki <ahmed.zaki@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Add opcodes and corresponding message structure to add and delete
flow steering rules. Flow steering enables configuration
of rules to take an action or subset of actions based on a match
criteria. Actions could be redirect to queue, redirect to queue
group, drop packet or mark.
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Co-developed-by: Dinesh Kumar <dinesh.kumar@intel.com>
Signed-off-by: Dinesh Kumar <dinesh.kumar@intel.com>
Signed-off-by: Sudheer Mogilappagari <sudheer.mogilappagari@intel.com>
Signed-off-by: Ahmed Zaki <ahmed.zaki@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
The "enum virtchnl2_cap_rss" will be used for negotiating flow
steering capabilities. Instead of adding a new enum, rename
virtchnl2_cap_rss to virtchnl2_flow_types. Also rename the enum's
constants.
Flow steering will use this enum in the next patches.
Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: Ahmed Zaki <ahmed.zaki@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Tony Nguyen says:
====================
Add RDMA support for Intel IPU E2000 in idpf
Tatyana Nikolova says:
This idpf patch series is the second part of the staged submission for
introducing RDMA RoCEv2 support for the IPU E2000 line of products,
referred to as GEN3.
To support RDMA GEN3 devices, the idpf driver uses common definitions
of the IIDC interface and implements specific device functionality in
iidc_rdma_idpf.h.
The IPU model can host one or more logical network endpoints called
vPorts per PCI function that are flexibly associated with a physical
port or an internal communication port.
Other features as it pertains to GEN3 devices include:
* MMIO learning
* RDMA capability negotiation
* RDMA vectors discovery between idpf and control plane
These patches are split from the submission "Add RDMA support for Intel
IPU E2000 (GEN3)" [1]. The patches have been tested on a range of hosts
and platforms with a variety of general RDMA applications which include
standalone verbs (rping, perftest, etc.), storage and HPC applications.
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
[1] https://lore.kernel.org/all/20240724233917.704-1-tatyana.e.nikolova@intel.com/
This idpf patch series is the second part of the staged submission for
introducing RDMA RoCEv2 support for the IPU E2000 line of products,
referred to as GEN3.
To support RDMA GEN3 devices, the idpf driver uses common definitions
of the IIDC interface and implements specific device functionality in
iidc_rdma_idpf.h.
The IPU model can host one or more logical network endpoints called
vPorts per PCI function that are flexibly associated with a physical
port or an internal communication port.
Other features as it pertains to GEN3 devices include:
* MMIO learning
* RDMA capability negotiation
* RDMA vectors discovery between idpf and control plane
These patches are split from the submission "Add RDMA support for Intel
IPU E2000 (GEN3)" [1]. The patches have been tested on a range of hosts
and platforms with a variety of general RDMA applications which include
standalone verbs (rping, perftest, etc.), storage and HPC applications.
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
[1] https://lore.kernel.org/all/20240724233917.704-1-tatyana.e.nikolova@intel.com/
IWL reviews:
v3: https://lore.kernel.org/all/20250708210554.1662-1-tatyana.e.nikolova@intel.com/
v2: https://lore.kernel.org/all/20250612220002.1120-1-tatyana.e.nikolova@intel.com/
v1 (split from previous series):
https://lore.kernel.org/all/20250523170435.668-1-tatyana.e.nikolova@intel.com/
v3: https://lore.kernel.org/all/20250207194931.1569-1-tatyana.e.nikolova@intel.com/
RFC v2: https://lore.kernel.org/all/20240824031924.421-1-tatyana.e.nikolova@intel.com/
RFC: https://lore.kernel.org/all/20240724233917.704-1-tatyana.e.nikolova@intel.com/
* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/linux:
idpf: implement get LAN MMIO memory regions
idpf: implement IDC vport aux driver MTU change handler
idpf: implement remaining IDC RDMA core callbacks and handlers
idpf: implement RDMA vport auxiliary dev create, init, and destroy
idpf: implement core RDMA auxiliary dev create, init, and destroy
idpf: use reserved RDMA vectors from control plane
====================
Link: https://patch.msgid.link/20250714181002.2865694-1-anthony.l.nguyen@intel.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
pf->ice_debugfs_pf_fwlog should be checked for an error here.
Fixes: 96a9a9341c ("ice: configure FW logging")
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
The function ice_lag_is_switchdev_running() is being called from outside of
the LAG event handler code. This results in the lag->upper_netdev being
NULL sometimes. To avoid a NULL-pointer dereference, there needs to be a
check before it is dereferenced.
Fixes: 776fe19953 ("ice: block default rule setting on LAG interface")
Signed-off-by: Dave Ertman <david.m.ertman@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
With large values of CONFIG_NR_CPUS, three Intel ethernet drivers fail to
compile like:
In function ‘i40e_free_q_vector’,
inlined from ‘i40e_vsi_alloc_q_vectors’ at drivers/net/ethernet/intel/i40e/i40e_main.c:12112:3:
571 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
include/linux/rcupdate.h:1084:17: note: in expansion of macro ‘BUILD_BUG_ON’
1084 | BUILD_BUG_ON(offsetof(typeof(*(ptr)), rhf) >= 4096); \
drivers/net/ethernet/intel/i40e/i40e_main.c:5113:9: note: in expansion of macro ‘kfree_rcu’
5113 | kfree_rcu(q_vector, rcu);
| ^~~~~~~~~
The problem is that the 'rcu' member in 'q_vector' is too far from the start
of the structure. Move this member before the CPU mask instead, in all three
drivers.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Tested-by: Sunitha Mekala <sunithax.d.mekala@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
The RDMA driver needs to map its own MMIO regions for the sake of
performance, meaning the IDPF needs to avoid mapping portions of the BAR
space. However, to be HW agnostic, the IDPF cannot assume where
these are and must avoid mapping hard coded regions as much as possible.
The IDPF maps the bare minimum to load and communicate with the
control plane, i.e., the mailbox registers and the reset state
registers. Because of how and when mailbox register offsets are
initialized, it is easier to adjust the existing defines to be relative
to the mailbox region starting address. Use a specific mailbox register
write function that uses these relative offsets. The reset state
register addresses are calculated the same way as for other registers,
described below.
The IDPF then calls a new virtchnl op to fetch a list of MMIO regions
that it should map. The addresses for the registers in these regions are
calculated by determining what region the register resides in, adjusting
the offset to be relative to that region, and then adding the
register's offset to that region's mapped address.
If the new virtchnl op is not supported, the IDPF will fallback to
mapping the whole bar. However, it will still map them as separate
regions outside the mailbox and reset state registers. This way we can
use the same logic in both cases to access the MMIO space.
Reviewed-by: Madhu Chittim <madhu.chittim@intel.com>
Signed-off-by: Joshua Hay <joshua.a.hay@intel.com>
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
The only event an RDMA vport aux driver cares about right now is an MTU
change on its underlying vport. Implement and plumb the handler to
signal the pre MTU change event and post MTU change events to the RDMA
vport aux driver.
Reviewed-by: Madhu Chittim <madhu.chittim@intel.com>
Signed-off-by: Joshua Hay <joshua.a.hay@intel.com>
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Implement the idpf_idc_request_reset and idpf_idc_rdma_vc_send_sync
callbacks for the rdma core auxiliary driver to issue reset events to
the idpf and send (synchronous) virtchnl messages to the control plane
respectively.
Implement and plumb the reset handler for the opposite flow as well,
i.e. when the idpf is resetiing and needs to notify the rdma core
auxiliary driver.
Reviewed-by: Madhu Chittim <madhu.chittim@intel.com>
Signed-off-by: Joshua Hay <joshua.a.hay@intel.com>
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Implement the functions to create, initialize, and destroy an RDMA vport
auxiliary device. The vport aux dev creation is dependent on the
core aux device to call idpf_idc_vport_dev_ctrl to signal that it is
ready for vport aux devices. Implement that core callback to either
create and initialize the vport aux dev or deinitialize.
RDMA vport aux dev creation is also dependent on the control plane to
tell us the vport is RDMA enabled. Add a flag in the create vport
message to signal individual vport RDMA capabilities.
Reviewed-by: Madhu Chittim <madhu.chittim@intel.com>
Signed-off-by: Joshua Hay <joshua.a.hay@intel.com>
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Add the initial idpf_idc.c file with the functions to kick off the IDC
initialization, create and initialize a core RDMA auxiliary device, and
destroy said device.
The RDMA core has a dependency on the vports being created by the
control plane before it can be initialized. Therefore, once all the
vports are up after a hard reset (either during driver load a function
level reset), the core RDMA device info will be created. It is populated
with the function type (as distinguished by the IDC initialization
function pointer), the core idc_ops function points (just stubs for
now), the reserved RDMA MSIX table, and various other info the core RDMA
auxiliary driver will need. It is then plugged on to the bus.
During a function level reset or driver unload, the device will be
unplugged from the bus and destroyed.
Reviewed-by: Madhu Chittim <madhu.chittim@intel.com>
Signed-off-by: Joshua Hay <joshua.a.hay@intel.com>
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Fetch the number of reserved RDMA vectors from the control plane.
Adjust the number of reserved LAN vectors if necessary. Adjust the
minimum number of vectors the OS should reserve to include RDMA; and
fail if the OS cannot reserve enough vectors for the minimum number of
LAN and RDMA vectors required. Create a separate msix table for the
reserved RDMA vectors, which will just get handed off to the RDMA core
device to do with what it will.
Reviewed-by: Madhu Chittim <madhu.chittim@intel.com>
Signed-off-by: Joshua Hay <joshua.a.hay@intel.com>
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
The ice_get_vf_by_id() function is used to obtain a reference to a VF
structure based on its ID. The ice_sriov_set_msix_vec_count() function
needs to get a VF reference starting from the VF PCI device, and uses
pci_iov_vf_id() to get the VF ID. This pattern is currently uncommon in the
ice driver. However, the live migration module will introduce many more
such locations.
Add a helper wrapper ice_get_vf_by_dev() which takes the VF PCI device and
calls ice_get_vf_by_id() using pci_iov_vf_id() to get the VF ID.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Commit 05c16687e0 ("ice: set MSI-X vector count on VF") added support to
change the vector count for VFs as part of ice_sriov_set_msix_vec_count().
This function modifies and rebuilds the target VF with the requested number
of MSI-X vectors.
Future support for live migration will add a call to
ice_sriov_set_msix_vec_count() to ensure that a migrated VF has the proper
MSI-X vector count. In most cases, this request will be to set the MSI-X
vector count to its current value. In that case, no work is necessary.
Rather than requiring the caller to check this, update the function to
check and exit early if the vector count is already at the requested value.
This avoids an unnecessary VF rebuild.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
The ice_sriov_set_msix_vec_count() obtains the VF device ID in a strange
way by iterating over the possible VF IDs and calling
pci_iov_virtfn_devfn to calculate the device and function combos and
compare them to the pdev->devfn.
This is unnecessary. The pci_iov_vf_id() helper already exists which does
the reverse calculation of pci_iov_virtfn_devfn(), which is much simpler
and avoids the loop construction. Use this instead.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
The live migration process will require configuring the target VF with the
data provided from the source host. A few helper functions in ice_sriov.c
and ice_virtchnl.c will be needed for this process, but are currently
static.
Expose these functions in their respective headers so that the live
migration module can use them during the migration process.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
A future change is going to need to call ice_vsi_update_l2tsel from a new
context outside of ice_virtchnl.c
Since this function deals with a generic VSI, move it into ice_lib.c to
enable calling it from other places in the ice driver.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Madhu Chittim <madhu.chittim@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
The VF can program the RSS hash configuration over virtchnl. It does this
by sending a u64 bitmask which represents the current hash configuration.
It is not trivial to reverse the hardware configuration back to this hash
set for migration. Instead, save the value to the ice_vf structure when its
modified by the VF.
The rss_hashcfg value is an 8-byte field. Make room for it in ice_vf by
re-arranging some of the existing fields. There is a 4-byte gap after the
first_vector_idx, and a 4-byte gap between max_tx_rate and vf_states. Move
first_vector_idx into the later 4-byte gap, creating an 8 byte area where
rss_hashcfg can be placed. Also move the num_msix field near min_tx_rate,
filling 2 bytes of a 3 byte hole.
The end result of these changes enables placing the rss_hashcfg field into
the structure while also saving 8 bytes in size. It looks like there are a
handful of more possible cleanups to reduce the size even further, but
those have been left as a future cleanup.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Madhu Chittim <madhu.chittim@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
The live migration driver will need to save and restore the Tx queue
context state from the hardware registers. This state contains both static
fields which do not change during Tx traffic as well as dynamic fields
which may change during Tx traffic.
Unlike the Rx context, the Tx queue context is accessed indirectly from
GLCOMM_QTX_CNTX_CTL and GLCOMM_QTX_CNTX_DATA registers. These registers are
shared by multiple PFs on the same PCIe card. Multiple PFs cannot safely
access the registers simultaneously, and there is no hardware semaphore or
logic to control access. To handle this, introduce the txq_ctx_lock to the
ice_adapter structure. This is similar to the ptp_gltsyn_time_lock. All PFs
on the same adapter share this structure, and use it to serialize access to
the registers to prevent error.
Add a new functions to get and set the Tx queue context through the
GLCOMM_QTX_CNTX_CTL interface. The hardware context values are stored in
the registers using the same packed format as the Admin Queue buffer.
The hardware buffer is 40 bytes wide, as it contains an additional 18 bytes
of internal state not sent with the Admin Queue buffer. For this reason, a
separate typedef and packing function must be used. We can share the same
packed fields definitions because we never need to unpack the internal
state. This is preferred, as it ensures the internal state is zero'd when
writing into HW, and avoids issues with reading by u32 registers into a
buffer of 22 bytes in length. Thanks to the typedefs, misuse of the API
with the wrong size buffer can easily be caught at compile time.
Note reading this data from hardware is essential because the current Tx
queue context may be different from the context as initially programmed by
the driver during VF initialization. When migrating a VF we must ensure the
target VF has identical context as the source VF did.
Co-developed-by: Yahui Cao <yahui.cao@intel.com>
Signed-off-by: Yahui Cao <yahui.cao@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Madhu Chittim <madhu.chittim@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
In order to support live migration, the ice driver will need to read
certain data from the Rx queue context. This is stored in the hardware in a
packed format.
Since we use <linux/packing.h> for the mapping between the packed hardware
format and the unpacked structure, it is trivial to enable unpacking
support via the unpack_fields() function.
Add the ice_unpack_rxq_ctx() function based on the unpack_fields() API.
Re-use the same field definitions from the packing implementation.
Add ice_copy_rxq_ctx_from_hw() to copy the Rx queue context data from the
hardware registers.
Use these to implement ice_read_rxq_ctx() which will return the Rx queue
context to the caller in its unpacked ice_rlan_ctx struct.
This will enable the migration logic access to the relevant data about the
Rx device queues. It can easily be copied to the target system as part of
the migration payload, where it will be used to configure the Rx queues.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Madhu Chittim <madhu.chittim@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
ICE appears to have some odd form of rss_context use plumbed
in for .get_rxfh. The .set_rxfh side does not support creating
contexts, however, so this must be dead code. For at least a year
now (since commit 7964e78846 ("net: ethtool: use the tracking
array for get_rxfh on custom RSS contexts")) we have not been
calling .get_rxfh with a non-zero rss_context. We just get
the info from the RSS XArray under dev->ethtool.
Remove what must be dead code in the driver, clear the support flags.
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250707184115.2285277-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Tony Nguyen says:
====================
Intel Wired LAN Driver Updates 2025-07-03
Vladimir Oltean converts Intel drivers (ice, igc, igb, ixgbe, i40e) to
utilize new timestamping API (ndo_hwtstamp_get() and ndo_hwtstamp_set()).
For ixgbe:
Paul, Don, Slawomir, and Radoslaw add Malicious Driver Detection (MDD)
support for X550 and E610 devices to detect, report, and handle
potentially malicious VFs.
Simon Horman corrects spelling mistakes.
For igbvf:
Kohei Enju removes a couple of unreported counters and adds reporting
of Tx timeouts.
* '10GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
igbvf: add tx_timeout_count to ethtool statistics
igbvf: remove unused interrupt counter fields from struct igbvf_adapter
ixgbe: spelling corrections
ixgbe: turn off MDD while modifying SRRCTL
ixgbe: add Tx hang detection unhandled MDD
ixgbe: check for MDD events
ixgbe: add MDD support
i40e: convert to ndo_hwtstamp_get() and ndo_hwtstamp_set()
ixgbe: convert to ndo_hwtstamp_get() and ndo_hwtstamp_set()
igb: convert to ndo_hwtstamp_get() and ndo_hwtstamp_set()
igc: convert to ndo_hwtstamp_get() and ndo_hwtstamp_set()
ice: convert to ndo_hwtstamp_get() and ndo_hwtstamp_set()
====================
Link: https://patch.msgid.link/20250703174242.3829277-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add `tx_timeout_count` to ethtool statistics to provide visibility into
transmit timeout events, bringing igbvf in line with other Intel
ethernet drivers.
Currently `tx_timeout_count` is incremented in igbvf_watchdog_task() and
igbvf_tx_timeout() but is not exposed to userspace nor used elsewhere in
the driver.
Before:
# ethtool -S ens5 | grep tx
tx_packets: 43
tx_bytes: 4408
tx_restart_queue: 0
After:
# ethtool -S ens5 | grep tx
tx_packets: 41
tx_bytes: 4241
tx_restart_queue: 0
tx_timeout_count: 0
Tested-by: Kohei Enju <enjuk@amazon.com>
Signed-off-by: Kohei Enju <enjuk@amazon.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Remove `int_counter0` and `int_counter1` from struct igbvf_adapter since
they are only incremented in interrupt handlers igbvf_intr_msix_rx() and
igbvf_msix_other(), but never read or used anywhere in the driver.
Note that igbvf_intr_msix_tx() does not have similar counter increments,
suggesting that these were likely overlooked during development.
Eliminate the fields and their unnecessary accesses in interrupt
handlers.
Tested-by: Kohei Enju <enjuk@amazon.com>
Signed-off-by: Kohei Enju <enjuk@amazon.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Add Tx Hang detection due to an unhandled MDD Event.
Previously, a malicious VF could disable the entire port causing
TX to hang on the E610 card.
Those events that caused PF to freeze were not detected
as an MDD event and usually required a Tx Hang watchdog timer
to catch the suspension, and perform a physical function reset.
Implement flows in the affected PF driver in such a way to check
the cause of the hang, detect it as an MDD event and log an
entry of the malicious VF that caused the Hang.
The PF blocks the malicious VF, if it continues to be the source
of several MDD events.
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Marcin Szycik <marcin.szycik@linux.intel.com>
Signed-off-by: Slawomir Mrozowicz <slawomirx.mrozowicz@intel.com>
Co-developed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
When an event is detected it is logged and, for the time being, the
queue is immediately re-enabled. This is due to the lack of an API
to the hypervisor so it could deal with it as it chooses.
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Reviewed-by: Marcin Szycik <marcin.szycik@linux.intel.com>
Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com>
Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Add malicious driver detection to ixgbe driver. The supported devices
are E610 and X550.
Handling MDD events is enabled while VFs are created and turned off
when they are disabled. There is no runtime command to enable or
disable MDD independently.
MDD event is logged when malicious VF driver is detected. For example VF
can try to send incorrect Tx descriptor (TSO on, but length field not
correct). It can be reproduced by manipulating the driver, or using
driver with incorrect descriptor values.
Example log:
"Malicious event on VF 0 tx:128 rx:128"
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Reviewed-by: Marcin Szycik <marcin.szycik@linux.intel.com>
Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com>
Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
New timestamping API was introduced in commit 66f7223039 ("net: add
NDOs for configuring hardware timestamping") from kernel v6.6.
It is time to convert the Intel i40e driver to the new API, so that
timestamping configuration can be removed from the ndo_eth_ioctl() path
completely.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Reviewed-by: Milena Olech <milena.olech@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
New timestamping API was introduced in commit 66f7223039 ("net: add
NDOs for configuring hardware timestamping") from kernel v6.6.
It is time to convert the Intel ixgbe driver to the new API, so that
timestamping configuration can be removed from the ndo_eth_ioctl() path
completely.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Reviewed-by: Milena Olech <milena.olech@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
New timestamping API was introduced in commit 66f7223039 ("net: add
NDOs for configuring hardware timestamping") from kernel v6.6.
It is time to convert the Intel igb driver to the new API, so that
timestamping configuration can be removed from the ndo_eth_ioctl() path
completely.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Reviewed-by: Milena Olech <milena.olech@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
New timestamping API was introduced in commit 66f7223039 ("net: add
NDOs for configuring hardware timestamping") from kernel v6.6.
It is time to convert the Intel igc driver to the new API, so that
timestamping configuration can be removed from the ndo_eth_ioctl() path
completely.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Vitaly Lifshits <vitaly.lifshits@intel.com>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Reviewed-by: Milena Olech <milena.olech@intel.com>
Tested-by: Mor Bar-Gabay <morx.bar.gabay@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
New timestamping API was introduced in commit 66f7223039 ("net: add
NDOs for configuring hardware timestamping") from kernel v6.6.
It is time to convert the Intel ice driver to the new API, so that
timestamping configuration can be removed from the ndo_eth_ioctl() path
completely.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Reviewed-by: Milena Olech <milena.olech@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
I226 devices advertise support for the PCI-E link L1.2 substate. However,
due to a hardware limitation, the exit latency from this low-power state
is longer than the packet buffer can tolerate under high traffic
conditions. This can lead to packet loss and degraded performance.
To mitigate this, disable the L1.2 substate. The increased power draw
between L1.1 and L1.2 is insignificant.
Fixes: 4354621173 ("igc: Add new device ID's")
Link: https://lore.kernel.org/intel-wired-lan/15248b4f-3271-42dd-8e35-02bfc92b25e1@intel.com
Signed-off-by: Vitaly Lifshits <vitaly.lifshits@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Tested-by: Mor Bar-Gabay <morx.bar.gabay@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
In both the read callback for struct cyclecounter, and in struct
timecounter, struct cyclecounter is declared as a const pointer.
Unfortunatly, a number of users of this pointer treat it as a non-const
pointer as it is burried in a larger structure that is heavily modified by
the callback function when accessed. This lie had been hidden by the fact
that container_of() "casts away" a const attribute of a pointer without any
compiler warning happening at all.
Fix this all up by removing the const attribute in the needed places so
that everyone can see that the structure really isn't const, but can,
and is, modified by the users of it.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/2025070124-backyard-hurt-783a@gregkh
The driver currently defaults to the internal oscillator as the clock
source for E825-C hardware. While this clock source is labeled TCXO,
indicating a temperature compensated oscillator, this is only true for some
board designs. Many board designs have a less capable oscillator. The
E825-C hardware may also have its clock source set to the TIME_REF pin.
This pin is connected to the DPLL and is often a more stable clock source.
The choice of the internal oscillator is not suitable for all systems,
especially those which want to enable SyncE support.
There is currently no interface available for users to configure the clock
source. Other variants of the E82x board have the clock source configured
in the NVM, but E825-C lacks this capability, so different board designs
cannot select a different default clock via firmware.
In most setups, the TIME_REF is a suitable default clock source.
Additionally, we now fall back to the internal oscillator automatically if
the TIME_REF clock source cannot be locked.
Change the default clock source for E825-C to TIME_REF. Note that the
driver logs a dev_dbg message upon configuring the TSPLL which includes the
clock source and frequency. This can be enabled to confirm which clock
source is in use.
Longterm a proper interface to dynamically introspect and change the clock
source will be designed (perhaps some extension of the DPLL subsystem?)
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Initialize TSPLL after initializing PHC in ice_ptp.c instead of calling
for each product in PHC init in ice_ptp_hw.c.
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Reviewed-by: Milena Olech <milena.olech@intel.com>
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
TSPLL can fail when trying to lock to TIME_REF as a clock source, e.g.
when the external clock source is not stable or connected to the board.
To continue operation after failure, try to lock again to internal TCXO
and inform user about this.
Reviewed-by: Milena Olech <milena.olech@intel.com>
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
To ensure proper operation, wait for 10 to 20 microseconds before
enabling TSPLL.
Adjust wait time after enabling TSPLL from 1-5 ms to 1-2 ms.
Those values are empirical and tested on multiple HW configurations.
Reviewed-by: Milena Olech <milena.olech@intel.com>
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Add helpers for checking TSPLL params, disabling sticky bits,
configuring TSPLL and getting default clock frequency to simplify
the code flows.
Reviewed-by: Milena Olech <milena.olech@intel.com>
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Switch from unions with bitfield structs to definitions with bitfield
masks. This is necessary, because some registers have different
field definitions or even use a different register for the same fields
based on HW type.
Remove unused register fields.
Reviewed-by: Milena Olech <milena.olech@intel.com>
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
After programming the TSPLL, re-read the registers before reporting status.
This ensures the debug log message will show what was actually programmed,
rather than relying on a cached value.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
When programming the Clock Generation Unit for E285-C hardware, we need
to clear the time_sync_en bit of the DWORD 9 before we set the
frequency.
Co-developed-by: Karol Kolacinski <karol.kolacinski@intel.com>
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Tony Nguyen says:
====================
ice: Separate TSPLL from PTP and clean up [part]
Jake Keller says:
Separate TSPLL related functions and definitions from all PTP-related
files and clean up the code by implementing multiple helpers.
* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
ice: add TSPLL log config helper
ice: use designated initializers for TSPLL consts
ice: remove ice_tspll_params_e825 definitions
ice: fix E825-C TSPLL register definitions
ice: rename TSPLL and CGU functions and definitions
ice: move TSPLL functions to a separate file
====================
Link: https://patch.msgid.link/20250618174231.3100231-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Don't populate the const read-only array supported_sizes on the
stack at run time, instead make it static.
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Vitaly Lifshits <vitaly.lifshits@intel.com>>
Link: https://patch.msgid.link/20250618135408.1784120-1-colin.i.king@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Drivers that are using ops lock and don't depend on RTNL lock
still need to manage it because udp_tunnel's RTNL dependency.
Introduce new udp_tunnel_nic_lock and use it instead of
rtnl_lock. Drop non-UDP_TUNNEL_NIC_INFO_MAY_SLEEP mode from
udp_tunnel infra (udp_tunnel_nic_device_sync_work needs to
grab udp_tunnel_nic_lock mutex and might sleep).
Cover more places in v4:
- netlink
- udp_tunnel_notify_add_rx_port (ndo_open)
- triggers udp_tunnel_nic_device_sync_work
- udp_tunnel_notify_del_rx_port (ndo_stop)
- triggers udp_tunnel_nic_device_sync_work
- udp_tunnel_get_rx_info (__netdev_update_features)
- triggers NETDEV_UDP_TUNNEL_PUSH_INFO
- udp_tunnel_drop_rx_info (__netdev_update_features)
- triggers NETDEV_UDP_TUNNEL_DROP_INFO
- udp_tunnel_nic_reset_ntf (ndo_open)
- notifiers
- udp_tunnel_nic_netdevice_event, depending on the event:
- triggers NETDEV_UDP_TUNNEL_PUSH_INFO
- triggers NETDEV_UDP_TUNNEL_DROP_INFO
- ethnl_tunnel_info_reply_size
- udp_tunnel_nic_set_port_priv (two intel drivers)
Cc: Michael Chan <michael.chan@broadcom.com>
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Stanislav Fomichev <stfomichev@gmail.com>
Link: https://patch.msgid.link/20250616162117.287806-4-stfomichev@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a helper function to print new/current TSPLL config. This helps
avoid unnecessary casts from u8 to enums.
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Reviewed-by: Milena Olech <milena.olech@intel.com>
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Instead of multiple comments, use designated initializers for TSPLL
consts.
Adjust ice_tspll_params_e82x fields sizes.
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Reviewed-by: Milena Olech <milena.olech@intel.com>
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Remove ice_tspll_params_e825 definitions as according to EDS (Electrical
Design Specification) doc, E825 devices support only 156.25 MHz TSPLL
frequency for both TCXO and TIME_REF clock source.
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
The E825-C hardware has a slightly different register layout for register
19 of the Clock Generation Unit and TSPLL. The fbdiv_intgr value can be 10
bits wide.
Additionally, most of the fields that were in register 24 are made
available in register 23 instead. The programming logic already has a
corrected definition for register 23, but it incorrectly still used the
8-bit definition of fbdiv_intgr. This results in truncating some of the
values of fbdiv_intgr, including the value used for the 156.25MHz signal.
The driver only used register 24 to obtain the enable status, which we
should read from register 23. This results in an incorrect output for the
log messages, but does not change any functionality besides
disabled-by-default dynamic debug messages.
Fix the register definitions, and adjust the code to properly reflect the
enable/disable status in the log messages.
Co-developed-by: Karol Kolacinski <karol.kolacinski@intel.com>
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Rename TSPLL and CGU functions, definitions etc. to match the file name
and have consistent naming scheme.
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Reviewed-by: Milena Olech <milena.olech@intel.com>
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Collect TSPLL related functions and definitions and move them to
a separate file to have all TSPLL functionality in one place.
Move CGU related functions and definitions to ice_common.*
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Reviewed-by: Milena Olech <milena.olech@intel.com>
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Tony Nguyen says:
====================
libeth: add libeth_xdp helper lib
Alexander Lobakin says:
Time to add XDP helpers infra to libeth to greatly simplify adding
XDP to idpf and iavf, as well as improve and extend XDP in ice and
i40e. Any vendor is free to reuse helpers. If this happens, I'm fine
with moving the folder of out intel/.
The helpers greatly simplify building xdp_buff, running a prog,
handling the verdict, implement XDP_TX, .ndo_xdp_xmit, XDP buffer
completion. Same applies to XSk (with XSk xmit instead of
.ndo_xdp_xmit, plus stuff like XSk wakeup).
They are entirely generic with no HW definitions or assumptions.
HW-specific stuff like parsing Rx desc / filling Tx desc is passed
from the driver as inline callbacks.
For now, key assumptions that optimize performance / avoid code
bloat, but might not fit every driver in driver/net/:
* netmem holding the buffers are always order-0;
* driver has separate XDP Tx queues, doesn't use stack queues for
that. For best efficiency, you may want to have nr_cpu_ids XDP
queues, but less (queue sharing) is also supported;
* XDP Tx queues are interrupt-less and use "lazy" cleaning only
when there are less than 1/4 free Tx descriptors of the queue
size;
* main target platforms are 64-bit, although 32-bit is also fully
supported, but the code might be not as optimized for them.
Library code already supports multi-buffer for all kinds of Tx and
both header split and no split for Rx and Tx. Frags can come from
devmem/io_uring etc., direct `struct page *` is used only for header
buffers for which it's always true.
Drivers are free to pass their own Rx hints and XSK xmit hints ops.
XDP_TX and ndo_xdp_xmit use onstack bulk for the frames to be sent
and send them by batches of 16 buffers. This eats ~280 bytes on the
stack, but gives good boosts and allow to greatly optimize the main
sending function leaving it without any error/exception paths.
XSk xmit fills Tx descriptors in the loop unrolled by 8. This was
proven to improve perf on ice and i40e. XDP_TX and ndo_xdp_xmit
doesn't use unrolling as I wasn't able to get any improvements in
those scenenarios from this, while +1 Kb for their sending functions
for nothing doesn't sound reasonable.
XSk wakeup, instead of traditionally used "SW interrupts" provided
by NICs, uses IPI to schedule NAPI on the CPU corresponding to the
given queue pair. It gives better control over CPU distribution and
in general performs way better than "SW interrupts", plus allows us
to not pass any HW-specific callbacks there.
The code is built the way that all callbacks passed from drivers
get inlined; in general, most of hotpath gets inlined. Everything
slow/exception lands to .c files in the libeth folder, doesn't
create copies in the drivers themselves and doesn't overloat
hotpath.
Sure, inlining means that hotpath will be compiled into every driver
that uses the lib, but the core code is written in one place, so no
copying of bugs happens. Fixed once -- works everywhere.
The last commit might look like sorta hack, but it gives really good
boosts and decreases object code size, plus there are checks that
all those wider accesses are fully safe, so I don't feel anything
bad about it.
An example of using libeth_xdp can be found either on my GitHub or
on the mailing lists here ("XDP for idpf"). Macros for building
driver XDP functions lead to that some implementations (XDP_TX,
ndo_xdp_xmit etc.) consist of really only a few lines.
* '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
libeth: xdp, xsk: access adjacent u32s as u64 where applicable
libeth: xsk: add XSkFQ refill and XSk wakeup helpers
libeth: xsk: add XSk Rx processing support
libeth: xsk: add XSk xmit functions
libeth: xsk: add XSk XDP_TX sending helpers
libeth: xdp: add RSS hash hint and XDP features setup helpers
libeth: xdp: add templates for building driver-side callbacks
libeth: xdp: add XDP prog run and verdict result handling
libeth: xdp: add helpers for preparing/processing &libeth_xdp_buff
libeth: xdp: add XDPSQ cleanup timers
libeth: xdp: add XDPSQ locking helpers
libeth: xdp: add XDPSQE completion helpers
libeth: xdp: add .ndo_xdp_xmit() helpers
libeth: xdp: add XDP_TX buffers sending
libeth: support native XDP and register memory model
libeth: convert to netmem
libeth, libie: clean symbol exports up a little
====================
Link: https://patch.msgid.link/20250616201639.710420-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
On some systems with Nahum 11 and Nahum 13 the value of the XTAL clock in
the software STRAP is incorrect. This causes the PTP timer to run at the
wrong rate and can lead to synchronization issues.
The STRAP value is configured by the system firmware, and a firmware
update is not always possible. Since the XTAL clock on these systems
always runs at 38.4MHz, the driver may ignore the STRAP and just set
the correct value.
Fixes: cc23f4f0b6 ("e1000e: Add support for Meteor Lake")
Signed-off-by: Vitaly Lifshits <vitaly.lifshits@intel.com>
Tested-by: Mor Bar-Gabay <morx.bar.gabay@intel.com>
Reviewed-by: Gil Fine <gil.fine@linux.intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
This patch fixes an issue seen in a large-scale deployment under heavy
incoming pkts where the aRFS flow wrongly matches a flow and reprograms the
NIC with wrong settings. That mis-steering causes RX-path latency spikes
and noisy neighbor effects when many connections collide on the same
hash (some of our production servers have 20-30K connections).
set_rps_cpu() calls ndo_rx_flow_steer() with flow_id that is calculated by
hashing the skb sized by the per rx-queue table size. This results in
multiple connections (even across different rx-queues) getting the same
hash value. The driver steer function modifies the wrong flow to use this
rx-queue, e.g.: Flow#1 is first added:
Flow#1: <ip1, port1, ip2, port2>, Hash 'h', q#10
Later when a new flow needs to be added:
Flow#2: <ip3, port3, ip4, port4>, Hash 'h', q#20
The driver finds the hash 'h' from Flow#1 and updates it to use q#20. This
results in both flows getting un-optimized - packets for Flow#1 goes to
q#20, and then reprogrammed back to q#10 later and so on; and Flow #2
programming is never done as Flow#1 is matched first for all misses. Many
flows may wrongly share the same hash and reprogram rules of the original
flow each with their own q#.
Tested on two 144-core servers with 16K netperf sessions for 180s. Netperf
clients are pinned to cores 0-71 sequentially (so that wrong packets on q#s
72-143 can be measured). IRQs are set 1:1 for queues -> CPUs, enable XPS,
enable aRFS (global value is 144 * rps_flow_cnt).
Test notes about results from ice_rx_flow_steer():
---------------------------------------------------
1. "Skip:" counter increments here:
if (fltr_info->q_index == rxq_idx ||
arfs_entry->fltr_state != ICE_ARFS_ACTIVE)
goto out;
2. "Add:" counter increments here:
ret = arfs_entry->fltr_info.fltr_id;
INIT_HLIST_NODE(&arfs_entry->list_entry);
3. "Update:" counter increments here:
/* update the queue to forward to on an already existing flow */
Runtime comparison: original code vs with the patch for different
rps_flow_cnt values.
+-------------------------------+--------------+--------------+
| rps_flow_cnt | 512 | 2048 |
+-------------------------------+--------------+--------------+
| Ratio of Pkts on Good:Bad q's | 214 vs 822K | 1.1M vs 980K |
| Avoid wrong aRFS programming | 0 vs 310K | 0 vs 30K |
| CPU User | 216 vs 183 | 216 vs 206 |
| CPU System | 1441 vs 1171 | 1447 vs 1320 |
| CPU Softirq | 1245 vs 920 | 1238 vs 961 |
| CPU Total | 29 vs 22.7 | 29 vs 24.9 |
| aRFS Update | 533K vs 59 | 521K vs 32 |
| aRFS Skip | 82M vs 77M | 7.2M vs 4.5M |
+-------------------------------+--------------+--------------+
A separate TCP_STREAM and TCP_RR with 1,4,8,16,64,128,256,512 connections
showed no performance degradation.
Some points on the patch/aRFS behavior:
1. Enabling full tuple matching ensures flows are always correctly matched,
even with smaller hash sizes.
2. 5-6% drop in CPU utilization as the packets arrive at the correct CPUs
and fewer calls to driver for programming on misses.
3. Larger hash tables reduces mis-steering due to more unique flow hashes,
but still has clashes. However, with larger per-device rps_flow_cnt, old
flows take more time to expire and new aRFS flows cannot be added if h/w
limits are reached (rps_may_expire_flow() succeeds when 10*rps_flow_cnt
pkts have been processed by this cpu that are not part of the flow).
Fixes: 28bf26724f ("ice: Implement aRFS")
Signed-off-by: Krishna Kumar <krikku@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Migrate to new callbacks added by commit 9bb00786fc ("net: ethtool:
add dedicated callbacks for getting and setting rxfh fields").
I'm deleting all the boilerplate kdoc from the affected functions.
It is somewhere between pointless and incorrect, just a burden for
people refactoring the code.
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Joe Damato <joe@dama.to>
Reviewed-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20250614180907.4167714-8-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Migrate to new callbacks added by commit 9bb00786fc ("net: ethtool:
add dedicated callbacks for getting and setting rxfh fields").
I'm deleting all the boilerplate kdoc from the affected functions.
It is somewhere between pointless and incorrect, just a burden for
people refactoring the code.
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Joe Damato <joe@dama.to>
Reviewed-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20250614180907.4167714-7-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Migrate to new callbacks added by commit 9bb00786fc ("net: ethtool:
add dedicated callbacks for getting and setting rxfh fields").
I'm deleting all the boilerplate kdoc from the affected functions.
It is somewhere between pointless and incorrect, just a burden for
people refactoring the code.
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Joe Damato <joe@dama.to>
Reviewed-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20250614180907.4167714-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Migrate to new callbacks added by commit 9bb00786fc ("net: ethtool:
add dedicated callbacks for getting and setting rxfh fields").
.get callback moves out of the switch and set_rxnfc disappears
as ETHTOOL_SRXFH as the only functionality.
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Joe Damato <joe@dama.to>
Reviewed-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20250614180907.4167714-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Migrate to new callbacks added by commit 9bb00786fc ("net: ethtool:
add dedicated callbacks for getting and setting rxfh fields").
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Joe Damato <joe@dama.to>
Reviewed-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20250614180907.4167714-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Migrate to new callbacks added by commit 9bb00786fc ("net: ethtool:
add dedicated callbacks for getting and setting rxfh fields").
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Joe Damato <joe@dama.to>
Reviewed-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20250614180907.4167714-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Migrate to new callbacks added by commit 9bb00786fc ("net: ethtool:
add dedicated callbacks for getting and setting rxfh fields").
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Joe Damato <joe@dama.to>
Reviewed-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20250614180907.4167714-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Migrate to new callbacks added by commit 9bb00786fc ("net: ethtool:
add dedicated callbacks for getting and setting rxfh fields").
This driver's RXFH config is read only / fixed and it's the only
get_rxnfc sub-command the driver supports. So convert the get_rxnfc
handler into a get_rxfh_fields handler.
Reviewed-by: Joe Damato <joe@dama.to>
Reviewed-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250614180638.4166766-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
XSkFQ refill is pretty generic across the drivers minus FQ descriptor
filling and can easily be unified with one inline callback.
XSk wakeup is usually not, but here, instead of commonly used
"SW interrupts", I picked firing an IPI. In most tests, it showed better
performance; it also provides better control for userspace on which CPU
will handle the xmit, as SW interrupts honor IRQ affinity no matter
which core produces XSk xmit descs (while XDPSQs are associated 1:1
with cores having the same ID).
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Add XSk counterparts for preparing XSk &libeth_xdp_buff (adding head and
frags), running the program, and handling the verdict, inc. XDP_PASS.
Shortcuts in comparison with regular Rx: frags and all verdicts except
XDP_REDIRECT are under unlikely() and out of line; no checks for XDP
program presence as it's always true for XSk.
Suggested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> # optimizations
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Reuse core sending functions to send XSk xmit frames.
Both metadata and no metadata pools/driver are supported. libeth_xdp
also provides generic XSk metadata ops, currently with the checksum
offload only and for cases when HW doesn't require supplying L3/L4
checksum offsets. Drivers are free to pass their own ops.
&libeth_xdp_tx_bulk is not used here as it would be redundant;
pool->tx_descs are accessed directly.
Fake "libeth_xsktmo" is needed to hide implementation details from the
drivers when they want to use the generic ops: the original struct is
defined in the same file where dev->xsk_tx_metadata_ops gets set to
avoid duplication of slowpath; at the same time; XSk xmit functions
use local "fast" copy to inline XMO callbacks.
Tx descriptor filling loop is unrolled by 8.
Suggested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> # optimizations
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Add Xsk counterparts for XDP_TX buffer sending and completion.
The same base structures and functions used from the libeth_xdp core,
with adjustments to that XSk Rx always operates on &xdp_buff_xsk for
both head and frags. And unlike regular Rx, here unlikely() are used
for frags, as the header split gives no benefits for XSk Rx, at
least for now.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
End the XDP section by adding helpers to setup XDP features, flipping
.ndo_xdp_xmit() support at runtime (in case when it's not always on),
and calculating the queue clean/refill threshold.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Running a prog and handling the verdicts, up to napi_gro_receive()
is also pretty generic code not really differing between vendors
(except for Tx descriptor filling and Rx descriptor parsing).
Define a couple inlines to do that. The inline callbacks a driver
needs to pass is mentioned above: Tx descriptor filling for XDP_TX,
populating skb with the descriptor data for XDP_PASS, finalizing
XDPSQs after the polling loop for XDP_TX (kicking the HW to start
sending).
The populate callback passes only &libeth_xdp_buff assuming buff::desc
pointer is enough, plus you can always get the corresponding Rx queue
structure via container_of(buff::rxq). If not, a driver can extend
the buff with more fields directly on the stack without touching
libeth_xdp definitions.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Add convenience helpers to build an &xdp_buff. This means: general
initialization before the NAPI loop, adding head, adding frags etc.
libeth_xdp_process_buff() is the same what everybody have in their
drivers:
dma_sync_for_cpu();
if (!frag) {
add_head();
prefetch();
} else {
add_frag();
}
Note that I don't use net_prefetch(), sticking to the original
prefetch(). In none of my tests prefetching 128 bytes yielded better
perf than 64 bytes. That might differ if the headers are huge enough,
but then additional tunneling etc. overhead takes place, you either
way won't win a lot.
&libeth_xdp_stash is for cases when you exit the polling loop without
finishing building the buff. If that happens, you need to store the
buffer in the queue structure until the next loop and then restore it.
It makes no sense to place a whole full &xdp_buff there. Define a
minimal structure, which would store only the fields essential to
restore it.
I was able to pack it into 16 bytes, which is only 8 bytes bigger
than `struct sk_buff *skb` on x64.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
When XDP Tx queues are not interrupt-driven but use lazy cleaning,
i.e. only when there are less than `threshold` free descriptors left,
we also need cleanup timers to avoid &xdp_buff and &xdp_frame stall
for too long, especially with Page Pool (it warns every about inflight
pages every 60 second).
Let's say we sent 256 frames and don't need to send more, but we clean
only when the number of pending items >= 384. In that case, those 256
will stall until 128 more are sent. For this, add simple helpers to
run a timer which will clean the queue regardless, after 1 second of
the last send.
The timer is triggered when finalizing the queue. As long as there is
regular active traffic, the timer doesn't fire.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Unfortunately, it's not always possible to allocate
max(num_rxqs, nr_cpu_ids) even on hi-end NICs.
To mitigate this, add simple locking helpers to libeth_xdp.
As long as XDPSQs are not shared, the whole functionality is gated
behind a static lock. Otherwise, each bulk flush locks the queue for
the time of cleaning and filling the descriptors.
As long as this particular queue is not used by more than 1 CPU,
the impact is minimal (runtime check for boolean twice per 16+
descriptors).
Suggested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> # static key
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Similarly to libeth_tx_complete(), add libeth_xdp_complete_tx() to
handle XDP_TX and xmit buffers. Both use bulk return under the hood.
Also add out of line libeth_tx_complete_any() which handles both
regular and XDP frames (if libeth_xdp is loaded), for example,
to call on queue destroy, where we don't need inlining but
convenience.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Add helpers for implementing .ndo_xdp_xmit().
Same as for XDP_TX, accumulate up to 16 DMA-mapped frames on the stack,
then flush. If DMA mapping is failed for some reason, don't try mapping
further frames, but still flush what was already prepared.
DMA address of a head frame is stored in its headroom, assuming it
has enough of it for an 8 (or 4) byte value.
In addition to @prep and @xmit driver callbacks in XDP_TX, xmit also
needs @finalize to kick the XDPSQ after filling.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Start adding XDP-specific code to libeth, namely handling XDP_TX buffers
(only sending).
The idea is that we accumulate up to 16 buffers on the stack, then,
if either the limit is reached or the polling is finished, flush them
at once with only one XDPSQ cleaning (if needed). The main sending
function will be aware of the sending budget and already have all the
info to send the buffers, so it can't fail.
Drivers need to provide 2 inline callbacks to the main sending function:
for cleaning an XDPSQ and for filling descriptors; the library code
takes care of the rest.
Note that unlike the generic code, multi-buffer support is not wrapped
here with unlikely() to not hurt header split setups.
&libeth_xdp_buff is a simple extension over &xdp_buff which has a direct
pointer to the corresponding Rx descriptor (and, luckily, precisely 1 CL
size and 16-byte alignment on x86_64).
Suggested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> # xmit logic
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Expand libeth's Page Pool functionality by adding native XDP support.
This means picking the appropriate headroom and DMA direction.
Also, register all the created &page_pools as XDP memory models.
A driver then can call xdp_rxq_info_attach_page_pool() when registering
its RxQ info.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>