QPI Bypass

In this chapter

On NUMA hosts, multi-CPU distribution of traffic can incur a performance penalty because only the NUMA node local to an accelerator benefits from direct data I/O to the L3 cache of the NUMA node. An expansion bus enables bonding two NT100E3-1-PTPs, ensuring local cache access.

Note: This chapter only applies to the NT200C01 solution consisting of two NT100E3-1-PTP accelerators.

QPI bypass concept

On modern Intel architecture NUMA hosts, two technologies have improved I/O performance significantly:

QuickPath Interconnect (QPI) provides fast connections between multiple processors and I/O hubs.
Direct Data I/O (DDIO) is an enhancement of DMA that allows data to be transferred directly between a device in a PCIe slot and the L3 cache of the processors local to the PCIe slot.

Writing from a device in a PCIe slot to memory on a remote NUMA node through QPI incurs latency in several ways:

The local L3 cache is polluted by data destined for the remote NUMA node.
The QPI itself causes latency.
The QPI memory write causes a flush of remote L3 cache lines (enforced by the cache coherency protocol).

To bypass the QPI, data destined for a remote NUMA node must be transferred to the accelerator local to that NUMA node before the PCIe bus. Two NT100E3-1-PTPs can be bonded through an expansion bus that allow streams to be redirected to the peer accelerator, ensuring local cache access.

QPI bypass example

In this example, two bonded NT100E3-1-PTPs are used to merge and distribute up- and downstream traffic in a 100G network.

The process is described in this table:

Stage	Description
1	The bonding is a static configuration set up in ntservice.ini (see below).
2	Using NTPL commands (see below), the application sets up affinity between streams and NUMA nodes, sets up distribution and creates the streams.
3	The application creates a thread for each stream on the NUMA nodes and sets the appropriate affinity.
4	Each thread opens its stream using the packet interface.
5	When a packet is received, all time-stamping, packet processing and statistics is performed by the receiving accelerator.
6	If the packet belongs to a stream that should be handled by the remote NUMA node, the packet is transferred over the expansion bus to the other accelerator.
7	The packet is DMAed over the PCIe interface to the local cache of the NUMA node, just as if it had been received on this accelerator.
8	The packet is received by the thread.

ntservice.ini

[Adapter0]
AdapterType = NT100E3_1_PTP
BondingType = Peer
RemoteAdapter = 1                     # The bonded peer
NumaNode = 0                          # Local NUMA node for this PCIe slot
BusId = 0000:06:00.0
TimeSyncReferencePriority = OSTime    # Or another reference clock source
TimeSyncConnectorInt1 = NttsOut 

[Adapter1]
AdapterType = NT100E3_1_PTP
BondingType = Peer
RemoteAdapter = 0                     # The bonded peer
NumaNode = 1                          # Local NUMA node for this PCIe slot
BusId = 0000:84:00.0
TimeSyncReferencePriority = Int1      # Synced to Adapter0
TimeSyncConnectorInt1 = NttsIn

[NT100E3_1_PTP]
HostBufferSegmentSizeRx = dynamic     # Peer bonding requires dynamic segment size
HostBuffersRx = [32,16,$Local$]       # Hostbuffers allocated on the local NUMA node

NumaNode = ...

Specifies the NUMA node local for each PCIe slot. It can be identified from hardware documentation or by using a utility such as lstopo from the portable hardware locality (hwloc) package.

RemoteAdapter = ...

Specifies the bonded peer accelerator.

TimeSyncReferencePriority = ...

TimeSyncConnectorInt1 = ...

Sets up time synchronization of the accelerators to ensure proper merging of traffic.

HostBuffersRx = [32,16,$Local$]

Allocates 32 host buffers on the NUMA node local to each accelerator.

A small host buffer size helps avoid L3 cache eviction.

NTPL

Setup[NUMANode=0] = StreamId==(0..15)
Setup[NUMANode=1] = StreamId==(16..31)
HashMode = Hash5TupleSorted
Assign[StreamId=(0..31)] = All

Setup[NUMANode=...] = ...: Sets up the affinity between streams and NUMA nodes.
HashMode = Hash5TupleSorted: The hash mode Hash5TupleSorted assures that up- and downstream traffic for each service is merged to the same stream.
Assign[StreamId=(0..31)] = All: Creates 32 streams.