QPI Bypass and CPU Socket Load Balancing

Feature Set N-ANL10

Platform

Napatech SmartNIC

Content Type

Feature Description

Capture Software Version

N-ANL10

In this chapter

On NUMA hosts, multi-CPU distribution of traffic can incur a performance penalty because only the NUMA node local to an accelerator benefits from direct data I/O to the L3 cache of the NUMA node. An expansion bus enables bonding two accelerators, ensuring local cache access.

Note: This chapter only applies to the following :

The NT200C01 solution consisting of two NT100E3-1-PTP accelerators.
A pair of NT40A01-4×10/1-SLB accelerators.

QPI bypass concept

On modern Intel architecture NUMA hosts, two technologies have improved I/O performance significantly:

QuickPath Interconnect (QPI) provides fast connections between multiple processors and I/O hubs.
Direct Data I/O (DDIO) is an enhancement of DMA that allows data to be transferred directly between a device in a PCIe slot and the L3 cache of the processors local to the PCIe slot.

But writing from a device in a PCIe slot to memory on a remote NUMA node through QPI incurs latency in several ways:

The local L3 cache is polluted by data destined for the remote NUMA node.
The QPI itself causes latency.
The QPI memory write causes a flush of remote L3 cache lines (enforced by the cache coherency protocol).

To bypass the QPI and ensure local cache access, data destined for a remote NUMA node must be transferred to the accelerator local to that NUMA node before the PCIe bus. Some Napatech accelerators enables this through an expansion bus that allows two accelerators to be interconnected.

In the NT200C01 solution, two NT100E3-1-PTPs can be bonded as peers so that streams can be redirected to the peer accelerator.
A pair of NT40A01-4×10/1-SLB accelerators can be bonded in a master-slave configuration so that all frames received on the master accelerator is copied to the slave accelerator.