Tips/Tweaking

ntservice.ini

The 3GD system has a default configuration that allows simple usage, which in many cases might be sufficient. It is advisable to edit the ntservice.ini to match the configuration in which the sytem is to run.

Performance
- Amount of host buffers
  Performance is very dependant on the number of host buffers defined. If the configured number of host buffers is higher than the actual number required, performance will suffer because even though there is no traffic assigned to the unused host buffers they still need to be serviced which will consume CPU and PCIe bandwidth.
  In order to get an idea on the amount of required host buffers, one can use the Profiling tool located in the tools package. The tool is able to show how many host buffers are used and by which applications/streams.
- NUMA node
  PCIe performance can be limited depending on the server layout. If the server has a configuration with multiple CPUs and with multiple IOH (I/O hub) one needs to pay attention to the NUMA nodes and assign host buffers to an adapter to the NUMA node with the shorted QPI (QuickPath Interconnect) distance.
  
  One or multiple IOH
  The system PCIe read/write to each NUMA node is measured when NtService is stating. The measurement is made by letting the adapter perform reads/writes to host memory and measure how fast it can do it.
  
  Below is an extract from ntlog on a system with 2 NUMA nodes but only one IOH.
```
        # /opt/napatech3/bin/ntlog | cut -d- -f2
        ....
        PCIe settings:
        Supported max payload of 256 bytes. Negotiated 256 bytes
           PCIgen in use 2. Adapter supports PCIgen 2.
           Adapter support 8 lanes @ 5.0 GT/s. Link negotiated to 8 lanes @ 5.0 GT/s.
         PCIe measurements on NUMA node 0
           RX only max expected throughput is: 29041 Mbps
           TX only max expected throughput is: 28259 Mbps
           RX combined max expected throughput is: 27406 Mbps
           TX combined max expected throughput is: 27821 Mbps
         PCIe measurements on NUMA node 1
           RX only max expected throughput is: 29057 Mbps
           TX only max expected throughput is: 28259 Mbps
           RX combined max expected throughput is: 27399 Mbps
           TX combined max expected throughput is: 27804 Mbps
        ....
```
  The extract below is the same dump but on a system with 2 IOH and 2 two NUMA nodes.
```
        # /opt/napatech3/bin/ntlog | cut -d- -f2
        ....
        PCIe settings:
        Supported max payload of 256 bytes. Negotiated 256 bytes
           PCIgen in use 2. Adapter supports PCIgen 2.
           Adapter support 8 lanes @ 5.0 GT/s. Link negotiated to 8 lanes @ 5.0 GT/s.
         PCIe measurements on NUMA node 0
           RX only max expected throughput is: 29041 Mbps
           TX only max expected throughput is: 28259 Mbps
           RX combined max expected throughput is: 27406 Mbps
           TX combined max expected throughput is: 27821 Mbps
         PCIe measurements on NUMA node 1
           RX only max expected throughput is: 19057 Mbps
           TX only max expected throughput is: 18259 Mbps
           RX combined max expected throughput is: 17399 Mbps
           TX combined max expected throughput is: 17804 Mbps
        ....
```
- Intel Turbo Boost
  Availability and frequency upside of Turbo Boost state depends upon several factors including, but not limited to, the following:
  - Type of workload
  - Number of active cores
  - Estimated current consumption
  - Estimated power consumption
  - Processor temperature
    The operation is dependent on headroom available in one or more cores. The amount of time the system spends in Turbo Boost mode depends on workload, environment, platform design, and overall system configuration.
    Maximum frequency for processor function can't be specified. Maximum frequency is automatic and dependent on working conditions.
    Due to varying power characteristics, some parts with Turbo Boost may not achieve maximum turbo frequencies when running heavy workloads and using multiple cores concurrently.
    The less total power the CPU is using, the higher it can set the clocks in turbo. The more cores that are idle the better performance with Turbo Boost on one single core.
- Network macros
  The Network macros are by default able to run on all supported packet descriptors. Running this may might not be the most optimal way because each of the macros needs to determine the descriptor format and issue an if-then-else until the correct version is located. Below is an example of a network macro implementation to illustrate the if-then-else case:
  #define NT_NET_GET_PKT_HASH(_n_) _NT_NET_GET_PKT_HASH(_h_)
  ...
  #if defined(_NTAPI_EXTDESCR_7_)
  #define NT_NET_GET_PKT_HASH(_h_) _NT_NET_GET_PKT_HASH_EXT7(_h_)
  #elseif defined(_NTAPI_EXTDESCR_8_)
  #define NT_NET_GET_PKT_HASH(_h_) _NT_NET_GET_PKT_HASH_EXT8(_h_)
  #elseif defined(_NTAPI_EXTDESCR_9_)
  #define NT_NET_GET_PKT_HASH(_h_) _NT_NET_GET_PKT_HASH_EXT9(_h_)
  #else
  #define NT_NET_GET_PKT_HASH(_h_) ((NT_NET_GET_PKT_DESCRIPTOR_FORMAT(_h_)==7)?
  _NT_NET_GET_PKT_HASH_EXT7(_h_):NT_NET_GET_PKT_DESCRIPTOR_FORMAT(_h_)==8?
  _NT_NET_GET_PKT_HASH_EXT8(_h_):NT_NET_GET_PKT_DESCRIPTOR_FORMAT(_h_)==9?
  _NT_NET_GET_PKT_HASH_EXT9(_h_):assert(0))
  #endif
  Two options exist to prevent the macros performing if-then-else:
  1. Define NTAPI_EXTDESCR? to match the extended descriptor in ntservice.ini. Macros not available for the choosen descriptor type will return -1.
  2. Use NT_NET_GET_PKT_DESCRIPTOR_FORMAT() to detect the format and use the format descriptors directly
    if(NT_NET_GET_PKT_DESCRIPTOR_FORMAT(hNetBuf)==7) {
    hash=_NT_NET_GET_PKT_HASH_EXT7(hNetBuf);
    hashType=_NT_NET_GET_PKT_HASH_TYPE_EXT7(hNetBuf);
    }
Memory limit on 32-bit systems
Systems running on 32-bits have a limited host buffer memory pool because of the 1G/3G kernel user space memory split. If the system contain multiple adapters, the ntservice.ini must be tweaked to limit the amount of host buffers required. One obvious thing to disable is the TX host buffers if TX is not utilized.
```
      [Adapter0]
      AdapterType=NT4E
      BusId=00:0a:00:00
      HostBuffersTx=[0,16,0] # 0 TX host buffers, disable the TX host buffers for adapter 0
```
or disable RX host buffers if only TX is used
```
      [Adapter0]
      AdapterType=NT4E
      BusId=00:0a:00:00
      HostBuffersRx=[0,16,0] # 0 RX host buffers, disable the RX host buffers for adapter 0
```

Host Buffer Memory

Getting host buffer memory might not always be possible because memory tends to be fragmented over time. To avoid this, the Napatech Software Suite low level driver is in charge of allocating memory and has been designed in such way that it always tries to allocate more memory if needed but it never releases memory until unloaded. Hence it is a good idea to load the low level driver only once and never reload it to avoid memory fragmentation.

Large Host Buffer Memory Configurations

A total of 1024GB (1TB) host buffers across all configured adapters is supported by the Napatech driver.

Linux considerations:

Linux kernel sysctl "vm.max_map_count":
Especially for large host buffer configurations it is necessary to adjust the kernel sysctl "vm.max_map_count" (/proc/sys/vm/max_map_count)
The kernel sysctl "vm.max_map_count" (/proc/sys/vm/max_map_count) should be adjusted to (at least) the total configured host buffer memory in MB multiplied by four.
Example for total host buffer size 128GB (131072MB): 131072*4 = 524288. Hence the minimum value for "vm.max_map_count" is 524288.
If your kernel is already using an even higher value - then just leave it like that.
If your application has special memory mapping needs - it should be added to this setting.
Linux kernel sysctl "kernel.numa_balancing":
Especially for large host buffer configurations it may be necessary to disable the kernel sysctl "kernel.numa_balancing" (/proc/sys/kernel/numa_balancing)
In some large host buffer scenarios the automatic numa page balancing and page migration will not work properly and it will impose a big system-load on the system.
Symptoms seen: idle user-space threads consumes close to 100% system-time. Calls to sleep(), usleep() and nanosleep() becomes unaccurate - yet their return value is zero (uninterrupted).
Disabling the numa balancing will fix this (kernel sysctl: kernel.numa_balancing=0)

Data Sharing and Host Buffer Allowance

If multiple streams/applications share host buffers, they will influence each other and one application can cause the other to drop packets. To avoid one application influencing another application, the Host Buffer Allowance should be used. If sharing is done across adapters, some sort of time synchronization is needed or adapters might drift away from each other causing buffering and dropped packets. Time synchronization is configured in ntservice.ini.
A more thorough description of Host Buffer Allowance can be found at youtube: Host Buffer Allowance explained

NTPL

On in-line adapters, be aware that as soon as an Assign command has been executed successfully, the specified data starts flowing to the destination port.

Debugging

Debugging is done via ntlog, monitoring and profiling. Together these tools provide information about the warnings/errors that have occurred and the current adapter status.

Application Crash

If the user application crashes, NtService automatically catches the streams in use and release them. What NtService does not do is destroy StreamIds. Applications must destroy stream IDs themselves if needed. Applications can be written such that StreamIds are created external to the application and the application only opens StreamIds.

Start up Synchronization of Host Buffers

One stream ID can cover multiple host buffers and these need to be mapped into the application opening the stream ID. The mapping is done within NT_NetRxGet() but is not atomic, so host buffers will be mapped one after another. This results in NT_NetRxGet() and NT_NetRxGetNextPacket() returning data from the host buffers as they are mapped, hence merging happens as host buffers are mapped. Therefore, it seems that NT_NetRxGet() and NT_NetRxGetNextPacket only receive traffic from one host buffer, but over time traffic from all host buffers will be returned. The same behavior occurs when host buffers are assigned via NTPL after NT_NetRxOpen() has been called.

Transmit to one port using multiple apps

When transmitting to one port from multiple apps TXNOW must always be set by both apps. See Transmitting to one port from multiple applications using absolute TX timing for further information.

Latency stability

To achieve latency stability the use of affinity is needed. On linux a combination of kernel boot parameter ISOLCPUS, sched_setaffinity() and sched_setscheduler() is needed to achieve optimal latency stability. As example the following setup could be used: Set kernel boot parameter: isolcpus=2,3,6,7 Start 3GD limiting it to use core 2 and 3: taskset -c 2,3 /opt/napatech3/bin/ntstart.sh Run application on core 6,7 with SCHED_RR

Packet interleaving from a 4GA multi-threaded Tx application

How are packets interleaved from a card if you are releasing packets from multiple threads? In 3GA the packets would be interleaved a segment at a time. How is this done in 4GA? Are the packets still interleaved in batches? Or could it be possible (with careful coordination between threads) to interleave packets from the threads one at a time?

The way a packet gets sent is a multi-stage process. When using the “packet interface”, the stages are (roughly like this):

The application calculates the packet size and then requests the driver for a netbuf of the right size.
The application creates the packet in the netbuf.
The application “releases” the netbuf (which signals the packet can be transmitted).
As part of the “release”, the NTAPI library code advances the so-called write-pointer (aka head-pointer). That pointer is stored at a specific address in memory, and that destination is also known to the FPGA.
The FPGA periodically (and asynchronously) reads the head/write-pointer (the pointer gets DMAed from host memory to the FPGA) to see whether there are new packets to send.
The FPGA sends pending data/packets and advances/updates (asynchronously) the so-called tail/read-pointer (again via DMA from FPGA to host memory) to signal the data has been transmitted.

So packets can be sent as single packets from each TX host buffer, but the asynchronous part of Stage 5. means it is not possible for software to know whether the adapter is transmitting the packet(s) or not, and on top of that when multiple TX host buffers are in play, then an application, be it a single threaded or a multi threaded one, cannot know in what sequence the adapter transmits the packets from the individual host buffers. That is the case due to the unpredictable behavior of the OS scheduler (so stage 4. above may occur in a random sequence when multiple threads are in use, unless the threads are synchronized, which may hurt the performance) and/but also because a Napatech adapter does not deal with multiple TX host buffers in a deterministic way. Also of concern is the asynchronous step from Stage 4. to 5, which makes it impossible to ensure the FPGA sees/reads the updated head/write-pointer in the order that the application assumes.

So if an application uses two TX host buffers and wants to send (say UDP) packets in a specific sequence, then it should send those ordered packets via the same TX host buffers; if the application uses both TX host buffers, the adapter may send (some of the) packets in the incorrect order.

Please also observe that the stages 1. through 6. are the same for the “segment” interface, except that a netbuf represents a one-megabyte segment.

Another 4GA property is that it is inherently a circular buffer, which simply means that a packet’s position in the buffer determines the sequence in which it gets transmitted (for that TX host buffer). If an application uses the segment interface and allocates multiple segments and releases the segment “out-of-order”, then the NTAPI driver library holds back the segments until the library code can release a contiguous range of segments. For instance, if an application allocates segments s1 and s2 but releases them as s2 followed by s1, then the library code holds back s2 until s1 is also ready for transmission.

To use timestamp based transmission one must use a single TX host buffer and a dispatcher thread. This is more difficult but allows full control. Note that segments must be filled up entirely before they are released; this sometimes means padding with “dummy” packets for which the packet descriptor’s TxIgnore bit is set; dummy packets must be less than 10,000 bytes.

Tips - Optimization - Debugging

Reference Documentation