Fortinet black logo

VMware ESXi Administration Guide

SR-IOV, LAGs, and affinity

Copy Link
Copy Doc ID 1ee71d39-a936-11ec-9fd1-fa163e15d75b:186095
Download PDF

SR-IOV, LAGs, and affinity

For use cases that do not currently benefit from vSPU, the best that you can do is load balancing across all CPUs. You can best achieve this using SR-IOV, LAGs, and CPU affinity settings.

You likely need link aggregation, if not for throughput, for resiliency. Considerations for LAG differ when considering VFs, but the main concepts are the same. The following diagram represents an example LAG-based topology based on having two NIC cards, each with two ports in a single NUMA node.

This scenario tolerates the following:

  • NIC port/link failure
  • NIC card failure
  • Switch failure

The design also stresses the need for the trust on setting discussed earlier, as the VF must react upon the status of the PF, as LACP is not going to provide the functionality it would do in an

appliance-based deployment.

You must configure LACP mode as static in this deployment scenario.

In this diagram, the PF is using an external VLAN tag to separate traffic to the respective VFs and the VM is unaware of this external VLAN.

Without vSPU, there is no PMD, and the NIC uses the interrupts are used to signal that there is network traffic that the CPU must process. To get a performant system without using vSPU, you must take care to balance the amount of interrupts that each CPU receives.

Using the same layout as the diagram displays, find the relevant system interrupts/queues:

diagnose hardware sysinfo interrupts  grep "CPUport"
           CPU0       CPU1   <...>   CPU15
 47:     119912          0   <...>       0   PCI-MSI-edge      iavf-port2-TxRx-0
 48:          0     200309   <...>       0   PCI-MSI-edge      iavf-port2-TxRx-1
 49:          0          0   <...>       0   PCI-MSI-edge      iavf-port2-TxRx-2
 50:          0          0   <...>       0   PCI-MSI-edge      iavf-port2-TxRx-3
 <...>
 67:     254849          0   <...>       0   PCI-MSI-edge      iavf-port6-TxRx-0
 68:          0     443186   <...>       0   PCI-MSI-edge      iavf-port6-TxRx-1
 69:          0          0   <...>       0   PCI-MSI-edge      iavf-port6-TxRx-2
 70:          0          0   <...>       0   PCI-MSI-edge      iavf-port6-TxRx-3
 <...>
 87:      72971          0   <...>       0   PCI-MSI-edge      iavf-port10-TxRx-0
 88:          0     376044   <...>       0   PCI-MSI-edge      iavf-port10-TxRx-1
 89:          0          0   <...>       0   PCI-MSI-edge      iavf-port10-TxRx-2
 90:          0          0   <...>       0   PCI-MSI-edge      iavf-port10-TxRx-3
 <...>
 107:    197132          0   <...>       0   PCI-MSI-edge      iavf-port14-TxRx-0
 108:         0     421851   <...>       0   PCI-MSI-edge      iavf-port14-TxRx-1
 109:         0          0   <...>       0   PCI-MSI-edge      iavf-port14-TxRx-2
 110:         0          0   <...>       0   PCI-MSI-edge      iavf-port14-TxRx-3
 <...>
 122:         0          0   <...>       0   PCI-MSI-edge      iavf-port17-TxRx-0
 123:         0          0   <...>       0   PCI-MSI-edge      iavf-port17-TxRx-1
 124:         0          0   <...>       0   PCI-MSI-edge      iavf-port17-TxRx-2
 125:         0          0   <...>  345768   PCI-MSI-edge      iavf-port17-TxRx-3
 <...>
Note

The interrupt names can differ. For example, the Mellanox ConnectX-5 NIC card has ten interrupts/queues per port named port2-0 through to port2-9.

The idea is to spread the interrupts across CPUs to balance the load across all system resources. Using the interrupt names as per the print, you can pin them to particular CPUs:

config system affinity-interrupt
    edit 20
        set interrupt "iavf-port2-TxRx-0"
        set affinity-cpumask "0x0000000000000001"
    next
    edit 21
        set interrupt "iavf-port2-TxRx-1"
        set affinity-cpumask "0x0000000000000002"
    next
    edit 22
        set interrupt "iavf-port2-TxRx-2"
        set affinity-cpumask "0x0000000000000004"
    next
    edit 23
        set interrupt "iavf-port2-TxRx-3"
        set affinity-cpumask "0x0000000000000008"
    next
<...>
    edit 60
        set interrupt "iavf-port6-TxRx-0"
        set affinity-cpumask "0x0000000000000001"
    next
    edit 61
        set interrupt "iavf-port6-TxRx-1"
        set affinity-cpumask "0x0000000000000002"
    next
    edit 62
        set interrupt "iavf-port6-TxRx-2"
        set affinity-cpumask "0x0000000000000004"
    next
    edit 63
        set interrupt "iavf-port6-TxRx-3"
        set affinity-cpumask "0x0000000000000008"
    next
<...>
    edit 170
        set interrupt "iavf-port17-TxRx-0"
        set affinity-cpumask "0x0000000000001000"
    next
    edit 171
        set interrupt "iavf-port17-TxRx-1"
        set affinity-cpumask "0x0000000000002000"
    next
    edit 172
        set interrupt "iavf-port17-TxRx-2"
        set affinity-cpumask "0x0000000000004000"
    next
    edit 173
        set interrupt "iavf-port17-TxRx-3"
        set affinity-cpumask "0x0000000000008000"
    next
end

This is a mapping of the four queues on an interface to one of four CPUs in a group, but also reusing the group of four CPUs across four interfaces as the following diagrams. This interleaving of the functions gets an even interrupt distribution, which gives the most performant deployment scenario.

In case of a failure, for example of the NIC card, this interleaving model ensures that the traffic interfaces where most traffic is expected are processed by different CPUs as the diagram shows, keeping the performance to a maximum.

Working out how best to balance the interrupts is the main thing to address in these circumstances. In the example case, each port has four queues/interrupts that you can map, making a VM16 effective with four PFs. The SR-IOV VLAN filtering and resultant LAG configuration provides interleaving, which helps balance the load across all CPUs.

Simularly, it may be that a VM32 is best serviced with eight PFs. It may be that the NIC card allows configuration of how many PFs are presented. For example, you may use an NIC presenting 4 x 10G more effectively across the CPUs than 1 x 40G.

Without much flexibility in using transparent VLANs or number of PFs, affining some services such as IPS, logging, or Web Filter to CPUs unused for traffic and providing effective CPU use may be the best option.

Effectively, there is significant flexibility, which should allow you to find a sweet spot of performance in most scenarios.

SR-IOV, LAGs, and affinity

For use cases that do not currently benefit from vSPU, the best that you can do is load balancing across all CPUs. You can best achieve this using SR-IOV, LAGs, and CPU affinity settings.

You likely need link aggregation, if not for throughput, for resiliency. Considerations for LAG differ when considering VFs, but the main concepts are the same. The following diagram represents an example LAG-based topology based on having two NIC cards, each with two ports in a single NUMA node.

This scenario tolerates the following:

  • NIC port/link failure
  • NIC card failure
  • Switch failure

The design also stresses the need for the trust on setting discussed earlier, as the VF must react upon the status of the PF, as LACP is not going to provide the functionality it would do in an

appliance-based deployment.

You must configure LACP mode as static in this deployment scenario.

In this diagram, the PF is using an external VLAN tag to separate traffic to the respective VFs and the VM is unaware of this external VLAN.

Without vSPU, there is no PMD, and the NIC uses the interrupts are used to signal that there is network traffic that the CPU must process. To get a performant system without using vSPU, you must take care to balance the amount of interrupts that each CPU receives.

Using the same layout as the diagram displays, find the relevant system interrupts/queues:

diagnose hardware sysinfo interrupts  grep "CPUport"
           CPU0       CPU1   <...>   CPU15
 47:     119912          0   <...>       0   PCI-MSI-edge      iavf-port2-TxRx-0
 48:          0     200309   <...>       0   PCI-MSI-edge      iavf-port2-TxRx-1
 49:          0          0   <...>       0   PCI-MSI-edge      iavf-port2-TxRx-2
 50:          0          0   <...>       0   PCI-MSI-edge      iavf-port2-TxRx-3
 <...>
 67:     254849          0   <...>       0   PCI-MSI-edge      iavf-port6-TxRx-0
 68:          0     443186   <...>       0   PCI-MSI-edge      iavf-port6-TxRx-1
 69:          0          0   <...>       0   PCI-MSI-edge      iavf-port6-TxRx-2
 70:          0          0   <...>       0   PCI-MSI-edge      iavf-port6-TxRx-3
 <...>
 87:      72971          0   <...>       0   PCI-MSI-edge      iavf-port10-TxRx-0
 88:          0     376044   <...>       0   PCI-MSI-edge      iavf-port10-TxRx-1
 89:          0          0   <...>       0   PCI-MSI-edge      iavf-port10-TxRx-2
 90:          0          0   <...>       0   PCI-MSI-edge      iavf-port10-TxRx-3
 <...>
 107:    197132          0   <...>       0   PCI-MSI-edge      iavf-port14-TxRx-0
 108:         0     421851   <...>       0   PCI-MSI-edge      iavf-port14-TxRx-1
 109:         0          0   <...>       0   PCI-MSI-edge      iavf-port14-TxRx-2
 110:         0          0   <...>       0   PCI-MSI-edge      iavf-port14-TxRx-3
 <...>
 122:         0          0   <...>       0   PCI-MSI-edge      iavf-port17-TxRx-0
 123:         0          0   <...>       0   PCI-MSI-edge      iavf-port17-TxRx-1
 124:         0          0   <...>       0   PCI-MSI-edge      iavf-port17-TxRx-2
 125:         0          0   <...>  345768   PCI-MSI-edge      iavf-port17-TxRx-3
 <...>
Note

The interrupt names can differ. For example, the Mellanox ConnectX-5 NIC card has ten interrupts/queues per port named port2-0 through to port2-9.

The idea is to spread the interrupts across CPUs to balance the load across all system resources. Using the interrupt names as per the print, you can pin them to particular CPUs:

config system affinity-interrupt
    edit 20
        set interrupt "iavf-port2-TxRx-0"
        set affinity-cpumask "0x0000000000000001"
    next
    edit 21
        set interrupt "iavf-port2-TxRx-1"
        set affinity-cpumask "0x0000000000000002"
    next
    edit 22
        set interrupt "iavf-port2-TxRx-2"
        set affinity-cpumask "0x0000000000000004"
    next
    edit 23
        set interrupt "iavf-port2-TxRx-3"
        set affinity-cpumask "0x0000000000000008"
    next
<...>
    edit 60
        set interrupt "iavf-port6-TxRx-0"
        set affinity-cpumask "0x0000000000000001"
    next
    edit 61
        set interrupt "iavf-port6-TxRx-1"
        set affinity-cpumask "0x0000000000000002"
    next
    edit 62
        set interrupt "iavf-port6-TxRx-2"
        set affinity-cpumask "0x0000000000000004"
    next
    edit 63
        set interrupt "iavf-port6-TxRx-3"
        set affinity-cpumask "0x0000000000000008"
    next
<...>
    edit 170
        set interrupt "iavf-port17-TxRx-0"
        set affinity-cpumask "0x0000000000001000"
    next
    edit 171
        set interrupt "iavf-port17-TxRx-1"
        set affinity-cpumask "0x0000000000002000"
    next
    edit 172
        set interrupt "iavf-port17-TxRx-2"
        set affinity-cpumask "0x0000000000004000"
    next
    edit 173
        set interrupt "iavf-port17-TxRx-3"
        set affinity-cpumask "0x0000000000008000"
    next
end

This is a mapping of the four queues on an interface to one of four CPUs in a group, but also reusing the group of four CPUs across four interfaces as the following diagrams. This interleaving of the functions gets an even interrupt distribution, which gives the most performant deployment scenario.

In case of a failure, for example of the NIC card, this interleaving model ensures that the traffic interfaces where most traffic is expected are processed by different CPUs as the diagram shows, keeping the performance to a maximum.

Working out how best to balance the interrupts is the main thing to address in these circumstances. In the example case, each port has four queues/interrupts that you can map, making a VM16 effective with four PFs. The SR-IOV VLAN filtering and resultant LAG configuration provides interleaving, which helps balance the load across all CPUs.

Simularly, it may be that a VM32 is best serviced with eight PFs. It may be that the NIC card allows configuration of how many PFs are presented. For example, you may use an NIC presenting 4 x 10G more effectively across the CPUs than 1 x 40G.

Without much flexibility in using transparent VLANs or number of PFs, affining some services such as IPS, logging, or Web Filter to CPUs unused for traffic and providing effective CPU use may be the best option.

Effectively, there is significant flexibility, which should allow you to find a sweet spot of performance in most scenarios.