CPU Socket Load Balancing

Feature Set N-ANL10

Platform
Napatech SmartNIC
Content Type
Feature Description
Capture Software Version
N-ANL10

A pair of NT40A01-4×10/1-SLB accelerators can be bonded in a master-slave configuration so that all traffic received on the master accelerator is replicated to the slave accelerator, ensuring local cache access to two NUMA nodes in a dual CPU socket server.

Purpose

The CPU Socket Load Balancing feature can be used to optimize performance on a dual CPU socket server for demanding data processing at 4×10 Gbit/s.

Features

To avoid the large performance penalty of moving data between applications running on different NUMA nodes via QPI, a pair of NT40A01-4×10/1-SLB accelerators allow traffic received on one accelerator (master) to be transferred at full line rate to another accelerator (slave) through an expansion bus.
Note: This section applies only to NT40A01-4×10/1-SLB accelerators.
Note: The NT40A01-4×10/1-SLB accelerator is supported on Linux only.
socket_load_balancing Sheet.53 Sheet.54 Sheet.55 Sheet.56 Sheet.57 Sheet.58 Sheet.59 CPU Socket 0 CPU Socket 0 Sheet.60 Sheet.61 Sheet.62 Sheet.63 Quick Path Quick Path Sheet.64 Interconnect Interconnect Sheet.65 Sheet.66 Sheet.67 Load Load Sheet.68 Distribution Distribution Sheet.69 Sheet.70 Sheet.71 CPU Socket 1 CPU Socket 1 Sheet.72 4x10G 4x10G Sheet.73 Sheet.74 Sheet.75 Sheet.76 Up to 128 Up to 128 Sheet.77 streams streams Sheet.78 Sheet.79 Sheet.80 Up to 128 Up to 128 Sheet.81 streams streams Sheet.83 Sheet.84 Sheet.85 Direct cache access Direct cache access Sheet.86 Sheet.87 Sheet.88 No QPI No QPI Sheet.89 transfer transfer Sheet.90 Sheet.91 Sheet.92 Sheet.93 Load Load Sheet.94 Distribution Distribution Sheet.95
  • 4×10 Gbit/s traffic received on a master card is replicated to a slave card. This allows load distribution of traffic in up to 128 streams per CPU socket (NUMA node), for a total of 256 streams in a dual CPU socket server.

  • Time-stamping and port statistics are done by the master accelerator, while frame processing, filtering and stream distribution is done by each accelerator. In effect, this works as if an optical splitter had been inserted before each pair of ports, with the advantage that time-stamping of the received frames is guaranteed to be identical.

The ports on the slave accelerator are disabled and cannot be used.

Configuration

The static configuration of CPU Socket Load Balancing is set up in ntservice.ini for each NT40A01-4×10/1-SLB accelerator in a pair:
  • Master or Slave.
  • NUMA node locality.
  • Accelerators in a pair must be consecutively numbered, with the slave number = master number + 1.

Filtering and distribution of traffic to streams and setting affinity of streams to NUMA nodes is done as usual using NTPL.

The traffic received on a port 0 on the master accelerator is replicated and available as if received on port 0 on the slave accelerator. From the host's point of view, since ports are enumerated consecutively in accelerator order, port n and port n+4 receive the same traffic.

Example

[Adapter0]
BondingType = Master
NumaNode  = 0 # Local NUMA node for this PCIe slot

[Adapter1]
BondingType = Slave
NumaNode  = 1 # Local NUMA node for this PCIe slot