The Linux Kernel Logo
  • Development process
  • Submitting patches
  • Code of conduct
  • Maintainer handbook
  • All development-process docs
  • Core API
  • Driver APIs
  • Subsystems
    • Core subsystems
    • Human interfaces
    • Networking interfaces
      • Networking
        • AF_XDP
        • Bare UDP Tunnelling Module Documentation
        • batman-adv
        • SocketCAN - Controller Area Network
        • The UCAN Protocol
        • Hardware Device Drivers
        • Networking Diagnostics
        • Distributed Switch Architecture
        • Linux Devlink Documentation
          • Locking
          • Nested instances
          • Interface documentation
          • Driver-specific documentation
            • bnxt devlink support
            • etas_es58x devlink support
            • hns3 devlink support
            • i40e devlink support
            • ionic devlink support
            • ice devlink support
              • Parameters
              • Info versions
              • Flash Update
              • Reload
              • Port split
              • Regions
              • Devlink Rate
            • mlx4 devlink support
            • mlx5 devlink support
            • mlxsw devlink support
            • mv88e6xxx devlink support
            • netdevsim devlink support
            • nfp devlink support
            • qed devlink support
            • ti-cpsw-switch devlink support
            • am65-cpsw-nuss devlink support
            • prestera devlink support
            • iosm devlink support
            • octeontx2 devlink support
            • sfc devlink support
        • CAIF
        • Netlink interface for ethtool
        • IEEE 802.15.4 Developer’s Guide
        • ISO 15765-2 (ISO-TP)
        • J1939 Documentation
        • Linux Networking and Network Devices APIs
        • MSG_ZEROCOPY
        • FAILOVER
        • Net DIM - Generic Network Dynamic Interrupt Moderation
        • NET_FAILOVER
        • Page Pool API
        • PHY Abstraction Layer
        • phylink
        • IP-Aliasing
        • Ethernet Bridging
        • SNMP counter
        • Checksum Offloads
        • Segmentation Offloads
        • Scaling in the Linux Networking Stack
        • Kernel TLS
        • Kernel TLS offload
        • In-Kernel TLS Handshake
        • Linux NFC subsystem
        • Netdev private dataroom for 6lowpan interfaces
        • 6pack Protocol
        • ARCnet Hardware
        • ARCnet
        • ATM
        • AX.25
        • Linux Ethernet Bonding Driver HOWTO
        • cdc_mbim - Driver for CDC MBIM Mobile Broadband modems
        • DCCP protocol
        • DCTCP (DataCenter TCP)
        • Device Memory TCP
        • DNS Resolver Module
        • Softnet Driver Issues
        • EQL Driver: Serial IP Load Balancing HOWTO
        • LC-trie implementation notes
        • Linux Socket Filtering aka Berkeley Packet Filter (BPF)
        • Generic HDLC layer
        • Generic Netlink
        • Netlink Family Specifications
        • Generic networking statistics for netlink users
        • The Linux kernel GTP tunneling module
        • Identifier Locator Addressing (ILA)
        • IOAM6 Sysfs variables
        • io_uring zero copy Rx
        • IP dynamic address hack-port v0.03
        • IPsec
        • IP Sysctl
        • IPv6
        • IPVLAN Driver HOWTO
        • IPvs-sysctl
        • Kernel Connection Multiplexor
        • L2TP
        • The Linux LAPB Module Interface
        • How to use packet injection with mac80211
        • Management Component Transport Protocol (MCTP)
        • MPLS Sysfs variables
        • Multipath TCP (MPTCP)
        • MPTCP Sysfs variables
        • HOWTO for multiqueue network device support
        • Multi-PF Netdev
        • NAPI
        • Common Networking Struct Cachelines
        • Netconsole
        • Netdev features mess and how to get out from it alive
        • Network Devices, the Kernel, and You!
        • Netfilter Sysfs variables
        • NETIF Msg Level
        • Netmem Support for Network Drivers
        • Resilient Next-hop Groups
        • Netfilter Conntrack Sysfs variables
        • Netfilter’s flowtable infrastructure
        • OPEN Alliance 10BASE-T1x MAC-PHY Serial Interface (TC6) Framework Support
        • Open vSwitch datapath developer documentation
        • Operational States
        • Packet MMAP
        • Linux Phonet protocol family
        • PHY link topology
        • HOWTO for the linux packet generator
        • PLIP: The Parallel Line Internet Protocol Device
        • PPP Generic Driver and Channel Interface
        • The proc/net/tcp and proc/net/tcp6 variables
        • Power Sourcing Equipment (PSE) Documentation
        • How to use radiotap headers
        • RDS
        • Linux wireless regulatory documentation
        • Network Function Representors
        • RxRPC Network Protocol
        • SOCKET OPTIONS
        • SECURITY
        • EXAMPLE CLIENT USAGE
        • Linux Kernel SCTP
        • LSM/SeLinux secid
        • Seg6 Sysfs variables
        • struct sk_buff
        • SMC Sysctl
        • NIC SR-IOV APIs
        • Interface statistics
        • Stream Parser (strparser)
        • Ethernet switch device driver model (switchdev)
        • Sysfs tagging
        • TC Actions - Environmental Rules
        • TC queue based filtering
        • TCP Authentication Option Linux implementation (RFC5925)
        • Thin-streams and TCP
        • Team
        • Timestamping
        • Linux Kernel TIPC
        • Transparent proxy support
        • Universal TUN/TAP device driver
        • The UDP-Lite protocol (RFC 3828)
        • Virtual Routing and Forwarding (VRF)
        • Virtual eXtensible Local Area Networking documentation
        • Linux X.25 Project
        • X.25 Device Driver Interface
        • XFRM device - offloading the IPsec computations
        • XFRM proc - /proc/net/xfrm_* files
        • XFRM
        • XFRM Syscall
        • XDP RX Metadata
        • AF_XDP TX Metadata
      • NetLabel
      • InfiniBand
      • ISDN
      • MHI
    • Storage interfaces
    • Other subsystems
  • Locking
  • Licensing rules
  • Writing documentation
  • Development tools
  • Testing guide
  • Hacking guide
  • Tracing
  • Fault injection
  • Livepatching
  • Rust
  • Administration
  • Build system
  • Reporting issues
  • Userspace tools
  • Userspace API
  • Firmware
  • Firmware and Devicetree
  • CPU architectures
  • Unsorted documentation
  • Translations
The Linux Kernel
  • Kernel subsystem documentation
  • Networking
  • Linux Devlink Documentation
  • ice devlink support
  • View page source

ice devlink support¶

This document describes the devlink features implemented by the ice device driver.

Parameters¶

Generic parameters implemented¶

Name

Mode

Notes

enable_roce

runtime

mutually exclusive with enable_iwarp

enable_iwarp

runtime

mutually exclusive with enable_roce

tx_scheduling_layers

permanent

The ice hardware uses hierarchical scheduling for Tx with a fixed number of layers in the scheduling tree. Each of them are decision points. Root node represents a port, while all the leaves represent the queues. This way of configuring the Tx scheduler allows features like DCB or devlink-rate (documented below) to configure how much bandwidth is given to any given queue or group of queues, enabling fine-grained control because scheduling parameters can be configured at any given layer of the tree.

The default 9-layer tree topology was deemed best for most workloads, as it gives an optimal ratio of performance to configurability. However, for some specific cases, this 9-layer topology might not be desired. One example would be sending traffic to queues that are not a multiple of 8. Because the maximum radix is limited to 8 in 9-layer topology, the 9th queue has a different parent than the rest, and it’s given more bandwidth credits. This causes a problem when the system is sending traffic to 9 queues:

tx_queue_0_packets: 24163396
tx_queue_1_packets: 24164623
tx_queue_2_packets: 24163188
tx_queue_3_packets: 24163701
tx_queue_4_packets: 24163683
tx_queue_5_packets: 24164668
tx_queue_6_packets: 23327200
tx_queue_7_packets: 24163853
tx_queue_8_packets: 91101417 < Too much traffic is sent from 9th

To address this need, you can switch to a 5-layer topology, which changes the maximum topology radix to 512. With this enhancement, the performance characteristic is equal as all queues can be assigned to the same parent in the tree. The obvious drawback of this solution is a lower configuration depth of the tree.

Use the tx_scheduling_layer parameter with the devlink command to change the transmit scheduler topology. To use 5-layer topology, use a value of 5. For example: $ devlink dev param set pci/0000:16:00.0 name tx_scheduling_layers value 5 cmode permanent Use a value of 9 to set it back to the default value.

You must do PCI slot powercycle for the selected topology to take effect.

To verify that value has been set: $ devlink dev param show pci/0000:16:00.0 name tx_scheduling_layers

msix_vec_per_pf_max

driverinit

Set the max MSI-X that can be used by the PF, rest can be utilized for SRIOV. The range is from min value set in msix_vec_per_pf_min to 2k/number of ports.

msix_vec_per_pf_min

driverinit

Set the min MSI-X that will be used by the PF. This value inform how many MSI-X will be allocated statically. The range is from 2 to value set in msix_vec_per_pf_max.

Driver specific parameters implemented¶

Name

Mode

Description

local_forwarding

runtime

Controls loopback behavior by tuning scheduler bandwidth. It impacts all kinds of functions: physical, virtual and subfunctions. Supported values are:

enabled - loopback traffic is allowed on port

disabled - loopback traffic is not allowed on this port

prioritized - loopback traffic is prioritized on this port

Default value of local_forwarding parameter is enabled. prioritized provides ability to adjust loopback traffic rate to increase one port capacity at cost of the another. User needs to disable local forwarding on one of the ports in order have increased capacity on the prioritized port.

Info versions¶

The ice driver reports the following versions

devlink info versions implemented¶

Name

Type

Example

Description

board.id

fixed

K65390-000

The Product Board Assembly (PBA) identifier of the board.

cgu.id

fixed

36

The Clock Generation Unit (CGU) hardware revision identifier.

fw.mgmt

running

2.1.7

3-digit version number of the management firmware running on the Embedded Management Processor of the device. It controls the PHY, link, access to device resources, etc. Intel documentation refers to this as the EMP firmware.

fw.mgmt.api

running

1.5.1

3-digit version number (major.minor.patch) of the API exported over the AdminQ by the management firmware. Used by the driver to identify what commands are supported. Historical versions of the kernel only displayed a 2-digit version number (major.minor).

fw.mgmt.build

running

0x305d955f

Unique identifier of the source for the management firmware.

fw.undi

running

1.2581.0

Version of the Option ROM containing the UEFI driver. The version is reported in major.minor.patch format. The major version is incremented whenever a major breaking change occurs, or when the minor version would overflow. The minor version is incremented for non-breaking changes and reset to 1 when the major version is incremented. The patch version is normally 0 but is incremented when a fix is delivered as a patch against an older base Option ROM.

fw.psid.api

running

0.80

Version defining the format of the flash contents.

fw.bundle_id

running

0x80002ec0

Unique identifier of the firmware image file that was loaded onto the device. Also referred to as the EETRACK identifier of the NVM.

fw.app.name

running

ICE OS Default Package

The name of the DDP package that is active in the device. The DDP package is loaded by the driver during initialization. Each variation of the DDP package has a unique name.

fw.app

running

1.3.1.0

The version of the DDP package that is active in the device. Note that both the name (as reported by fw.app.name) and version are required to uniquely identify the package.

fw.app.bundle_id

running

0xc0000001

Unique identifier for the DDP package loaded in the device. Also referred to as the DDP Track ID. Can be used to uniquely identify the specific DDP package.

fw.netlist

running

1.1.2000-6.7.0

The version of the netlist module. This module defines the device’s Ethernet capabilities and default settings, and is used by the management firmware as part of managing link and device connectivity.

fw.netlist.build

running

0xee16ced7

The first 4 bytes of the hash of the netlist module contents.

fw.cgu

running

8032.16973825.6021

The version of Clock Generation Unit (CGU). Format: <CGU type>.<configuration version>.<firmware version>.

Flash Update¶

The ice driver implements support for flash update using the devlink-flash interface. It supports updating the device flash using a combined flash image that contains the fw.mgmt, fw.undi, and fw.netlist components.

List of supported overwrite modes¶

Bits

Behavior

DEVLINK_FLASH_OVERWRITE_SETTINGS

Do not preserve settings stored in the flash components being updated. This includes overwriting the port configuration that determines the number of physical functions the device will initialize with.

DEVLINK_FLASH_OVERWRITE_SETTINGS and DEVLINK_FLASH_OVERWRITE_IDENTIFIERS

Do not preserve either settings or identifiers. Overwrite everything in the flash with the contents from the provided image, without performing any preservation. This includes overwriting device identifying fields such as the MAC address, VPD area, and device serial number. It is expected that this combination be used with an image customized for the specific device.

The ice hardware does not support overwriting only identifiers while preserving settings, and thus DEVLINK_FLASH_OVERWRITE_IDENTIFIERS on its own will be rejected. If no overwrite mask is provided, the firmware will be instructed to preserve all settings and identifying fields when updating.

Reload¶

The ice driver supports activating new firmware after a flash update using DEVLINK_CMD_RELOAD with the DEVLINK_RELOAD_ACTION_FW_ACTIVATE action.

$ devlink dev reload pci/0000:01:00.0 reload action fw_activate

The new firmware is activated by issuing a device specific Embedded Management Processor reset which requests the device to reset and reload the EMP firmware image.

The driver does not currently support reloading the driver via DEVLINK_RELOAD_ACTION_DRIVER_REINIT.

Port split¶

The ice driver supports port splitting only for port 0, as the FW has a predefined set of available port split options for the whole device.

A system reboot is required for port split to be applied.

The following command will select the port split option with 4 ports:

$ devlink port split pci/0000:16:00.0/0 count 4

The list of all available port options will be printed to dynamic debug after each split and unsplit command. The first option is the default.

ice 0000:16:00.0: Available port split options and max port speeds (Gbps):
ice 0000:16:00.0: Status  Split      Quad 0          Quad 1
ice 0000:16:00.0:         count  L0  L1  L2  L3  L4  L5  L6  L7
ice 0000:16:00.0: Active  2     100   -   -   - 100   -   -   -
ice 0000:16:00.0:         2      50   -  50   -   -   -   -   -
ice 0000:16:00.0: Pending 4      25  25  25  25   -   -   -   -
ice 0000:16:00.0:         4      25  25   -   -  25  25   -   -
ice 0000:16:00.0:         8      10  10  10  10  10  10  10  10
ice 0000:16:00.0:         1     100   -   -   -   -   -   -   -

There could be multiple FW port options with the same port split count. When the same port split count request is issued again, the next FW port option with the same port split count will be selected.

devlink port unsplit will select the option with a split count of 1. If there is no FW option available with split count 1, you will receive an error.

Regions¶

The ice driver implements the following regions for accessing internal device data.

regions implemented¶

Name

Description

nvm-flash

The contents of the entire flash chip, sometimes referred to as the device’s Non Volatile Memory.

shadow-ram

The contents of the Shadow RAM, which is loaded from the beginning of the flash. Although the contents are primarily from the flash, this area also contains data generated during device boot which is not stored in flash.

device-caps

The contents of the device firmware’s capabilities buffer. Useful to determine the current state and configuration of the device.

Both the nvm-flash and shadow-ram regions can be accessed without a snapshot. The device-caps region requires a snapshot as the contents are sent by firmware and can’t be split into separate reads.

Users can request an immediate capture of a snapshot for all three regions via the DEVLINK_CMD_REGION_NEW command.

$ devlink region show
pci/0000:01:00.0/nvm-flash: size 10485760 snapshot [] max 1
pci/0000:01:00.0/device-caps: size 4096 snapshot [] max 10

$ devlink region new pci/0000:01:00.0/nvm-flash snapshot 1
$ devlink region dump pci/0000:01:00.0/nvm-flash snapshot 1

$ devlink region dump pci/0000:01:00.0/nvm-flash snapshot 1
0000000000000000 0014 95dc 0014 9514 0035 1670 0034 db30
0000000000000010 0000 0000 ffff ff04 0029 8c00 0028 8cc8
0000000000000020 0016 0bb8 0016 1720 0000 0000 c00f 3ffc
0000000000000030 bada cce5 bada cce5 bada cce5 bada cce5

$ devlink region read pci/0000:01:00.0/nvm-flash snapshot 1 address 0 length 16
0000000000000000 0014 95dc 0014 9514 0035 1670 0034 db30

$ devlink region delete pci/0000:01:00.0/nvm-flash snapshot 1

$ devlink region new pci/0000:01:00.0/device-caps snapshot 1
$ devlink region dump pci/0000:01:00.0/device-caps snapshot 1
0000000000000000 01 00 01 00 00 00 00 00 01 00 00 00 00 00 00 00
0000000000000010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0000000000000020 02 00 02 01 32 03 00 00 0a 00 00 00 25 00 00 00
0000000000000030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0000000000000040 04 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
0000000000000050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0000000000000060 05 00 01 00 03 00 00 00 00 00 00 00 00 00 00 00
0000000000000070 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0000000000000080 06 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
0000000000000090 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00000000000000a0 08 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00
00000000000000b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00000000000000c0 12 00 01 00 01 00 00 00 01 00 01 00 00 00 00 00
00000000000000d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00000000000000e0 13 00 01 00 00 01 00 00 00 00 00 00 00 00 00 00
00000000000000f0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0000000000000100 14 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
0000000000000110 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0000000000000120 15 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
0000000000000130 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0000000000000140 16 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
0000000000000150 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0000000000000160 17 00 01 00 06 00 00 00 00 00 00 00 00 00 00 00
0000000000000170 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0000000000000180 18 00 01 00 01 00 00 00 01 00 00 00 08 00 00 00
0000000000000190 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00000000000001a0 22 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
00000000000001b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00000000000001c0 40 00 01 00 00 08 00 00 08 00 00 00 00 00 00 00
00000000000001d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00000000000001e0 41 00 01 00 00 08 00 00 00 00 00 00 00 00 00 00
00000000000001f0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0000000000000200 42 00 01 00 00 08 00 00 00 00 00 00 00 00 00 00
0000000000000210 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

$ devlink region delete pci/0000:01:00.0/device-caps snapshot 1

Devlink Rate¶

The ice driver implements devlink-rate API. It allows for offload of the Hierarchical QoS to the hardware. It enables user to group Virtual Functions in a tree structure and assign supported parameters: tx_share, tx_max, tx_priority and tx_weight to each node in a tree. So effectively user gains an ability to control how much bandwidth is allocated for each VF group. This is later enforced by the HW.

It is assumed that this feature is mutually exclusive with DCB performed in FW and ADQ, or any driver feature that would trigger changes in QoS, for example creation of the new traffic class. The driver will prevent DCB or ADQ configuration if user started making any changes to the nodes using devlink-rate API. To configure those features a driver reload is necessary. Correspondingly if ADQ or DCB will get configured the driver won’t export hierarchy at all, or will remove the untouched hierarchy if those features are enabled after the hierarchy is exported, but before any changes are made.

This feature is also dependent on switchdev being enabled in the system. It’s required because devlink-rate requires devlink-port objects to be present, and those objects are only created in switchdev mode.

If the driver is set to the switchdev mode, it will export internal hierarchy the moment VF’s are created. Root of the tree is always represented by the node_0. This node can’t be deleted by the user. Leaf nodes and nodes with children also can’t be deleted.

Attributes supported¶

Name

Description

tx_max

maximum bandwidth to be consumed by the tree Node. Rate Limit is an absolute number specifying a maximum amount of bytes a Node may consume during the course of one second. Rate limit guarantees that a link will not oversaturate the receiver on the remote end and also enforces an SLA between the subscriber and network provider.

tx_share

minimum bandwidth allocated to a tree node when it is not blocked. It specifies an absolute BW. While tx_max defines the maximum bandwidth the node may consume, the tx_share marks committed BW for the Node.

tx_priority

allows for usage of strict priority arbiter among siblings. This arbitration scheme attempts to schedule nodes based on their priority as long as the nodes remain within their bandwidth limit. Range 0-7. Nodes with priority 7 have the highest priority and are selected first, while nodes with priority 0 have the lowest priority. Nodes that have the same priority are treated equally.

tx_weight

allows for usage of Weighted Fair Queuing arbitration scheme among siblings. This arbitration scheme can be used simultaneously with the strict priority. Range 1-200. Only relative values matter for arbitration.

tx_priority and tx_weight can be used simultaneously. In that case nodes with the same priority form a WFQ subgroup in the sibling group and arbitration among them is based on assigned weights.

# enable switchdev
$ devlink dev eswitch set pci/0000:4b:00.0 mode switchdev

# at this point driver should export internal hierarchy
$ echo 2 > /sys/class/net/ens785np0/device/sriov_numvfs

$ devlink port function rate show
pci/0000:4b:00.0/node_25: type node parent node_24
pci/0000:4b:00.0/node_24: type node parent node_0
pci/0000:4b:00.0/node_32: type node parent node_31
pci/0000:4b:00.0/node_31: type node parent node_30
pci/0000:4b:00.0/node_30: type node parent node_16
pci/0000:4b:00.0/node_19: type node parent node_18
pci/0000:4b:00.0/node_18: type node parent node_17
pci/0000:4b:00.0/node_17: type node parent node_16
pci/0000:4b:00.0/node_14: type node parent node_5
pci/0000:4b:00.0/node_5: type node parent node_3
pci/0000:4b:00.0/node_13: type node parent node_4
pci/0000:4b:00.0/node_12: type node parent node_4
pci/0000:4b:00.0/node_11: type node parent node_4
pci/0000:4b:00.0/node_10: type node parent node_4
pci/0000:4b:00.0/node_9: type node parent node_4
pci/0000:4b:00.0/node_8: type node parent node_4
pci/0000:4b:00.0/node_7: type node parent node_4
pci/0000:4b:00.0/node_6: type node parent node_4
pci/0000:4b:00.0/node_4: type node parent node_3
pci/0000:4b:00.0/node_3: type node parent node_16
pci/0000:4b:00.0/node_16: type node parent node_15
pci/0000:4b:00.0/node_15: type node parent node_0
pci/0000:4b:00.0/node_2: type node parent node_1
pci/0000:4b:00.0/node_1: type node parent node_0
pci/0000:4b:00.0/node_0: type node
pci/0000:4b:00.0/1: type leaf parent node_25
pci/0000:4b:00.0/2: type leaf parent node_25

# let's create some custom node
$ devlink port function rate add pci/0000:4b:00.0/node_custom parent node_0

# second custom node
$ devlink port function rate add pci/0000:4b:00.0/node_custom_1 parent node_custom

# reassign second VF to newly created branch
$ devlink port function rate set pci/0000:4b:00.0/2 parent node_custom_1

# assign tx_weight to the VF
$ devlink port function rate set pci/0000:4b:00.0/2 tx_weight 5

# assign tx_share to the VF
$ devlink port function rate set pci/0000:4b:00.0/2 tx_share 500Mbps
Previous Next

© Copyright The kernel development community.

Built with Sphinx using a theme provided by Read the Docs.