Tips/Tweaking
ntservice.ini
- Performance
- Amount of host buffers
Performance is very dependant on the number of host buffers defined. If the configured number of host buffers is higher than the actual number required, performance will suffer because even though there is no traffic assigned to the unused host buffers they still need to be serviced which will consume CPU and PCIe bandwidth.
In order to get an idea on the amount of required host buffers, one can use the Profiling tool located in the tools package. The tool is able to show how many host buffers are used and by which applications/streams. - NUMA node
PCIe performance can be limited depending on the server layout. If the server has a configuration with multiple CPUs and with multiple IOH (I/O hub) one needs to pay attention to the NUMA nodes and assign host buffers to an adapter to the NUMA node with the shorted QPI (QuickPath Interconnect) distance.One or multiple IOH
Below is an extract from ntlog on a system with 2 NUMA nodes but only one IOH.# /opt/napatech3/bin/ntlog | cut -d- -f2 .... PCIe settings: Supported max payload of 256 bytes. Negotiated 256 bytes PCIgen in use 2. Adapter supports PCIgen 2. Adapter support 8 lanes @ 5.0 GT/s. Link negotiated to 8 lanes @ 5.0 GT/s. PCIe measurements on NUMA node 0 RX only max expected throughput is: 29041 Mbps TX only max expected throughput is: 28259 Mbps RX combined max expected throughput is: 27406 Mbps TX combined max expected throughput is: 27821 Mbps PCIe measurements on NUMA node 1 RX only max expected throughput is: 29057 Mbps TX only max expected throughput is: 28259 Mbps RX combined max expected throughput is: 27399 Mbps TX combined max expected throughput is: 27804 Mbps ....
The extract below is the same dump but on a system with 2 IOH and 2 two NUMA nodes.# /opt/napatech3/bin/ntlog | cut -d- -f2 .... PCIe settings: Supported max payload of 256 bytes. Negotiated 256 bytes PCIgen in use 2. Adapter supports PCIgen 2. Adapter support 8 lanes @ 5.0 GT/s. Link negotiated to 8 lanes @ 5.0 GT/s. PCIe measurements on NUMA node 0 RX only max expected throughput is: 29041 Mbps TX only max expected throughput is: 28259 Mbps RX combined max expected throughput is: 27406 Mbps TX combined max expected throughput is: 27821 Mbps PCIe measurements on NUMA node 1 RX only max expected throughput is: 19057 Mbps TX only max expected throughput is: 18259 Mbps RX combined max expected throughput is: 17399 Mbps TX combined max expected throughput is: 17804 Mbps ....
- Intel Turbo Boost
Availability and frequency upside of Turbo Boost state depends upon several factors including, but not limited to, the following:
- Type of workload
- Number of active cores
- Estimated current consumption
- Estimated power consumption
- Processor temperature
The operation is dependent on headroom available in one or more cores. The amount of time the system spends in Turbo Boost mode depends on workload, environment, platform design, and overall system configuration.
Maximum frequency for processor function can't be specified. Maximum frequency is automatic and dependent on working conditions.
Due to varying power characteristics, some parts with Turbo Boost may not achieve maximum turbo frequencies when running heavy workloads and using multiple cores concurrently.
The less total power the CPU is using, the higher it can set the clocks in turbo. The more cores that are idle the better performance with Turbo Boost on one single core.
- Type of workload
- Network macros
The Network macros are by default able to run on all supported packet descriptors. Running this may might not be the most optimal way because each of the macros needs to determine the descriptor format and issue an if-then-else until the correct version is located. Below is an example of a network macro implementation to illustrate the if-then-else case:Two options exist to prevent the macros performing if-then-else:#define NT_NET_GET_PKT_HASH(_n_) _NT_NET_GET_PKT_HASH(_h_)...#if defined(_NTAPI_EXTDESCR_7_)#define NT_NET_GET_PKT_HASH(_h_) _NT_NET_GET_PKT_HASH_EXT7(_h_)#elseif defined(_NTAPI_EXTDESCR_8_)#define NT_NET_GET_PKT_HASH(_h_) _NT_NET_GET_PKT_HASH_EXT8(_h_)#elseif defined(_NTAPI_EXTDESCR_9_)#define NT_NET_GET_PKT_HASH(_h_) _NT_NET_GET_PKT_HASH_EXT9(_h_)#else#define NT_NET_GET_PKT_HASH(_h_) ((NT_NET_GET_PKT_DESCRIPTOR_FORMAT(_h_)==7)?_NT_NET_GET_PKT_HASH_EXT7(_h_):NT_NET_GET_PKT_DESCRIPTOR_FORMAT(_h_)==8?_NT_NET_GET_PKT_HASH_EXT8(_h_):NT_NET_GET_PKT_DESCRIPTOR_FORMAT(_h_)==9?_NT_NET_GET_PKT_HASH_EXT9(_h_):assert(0))#endif- Define NTAPI_EXTDESCR? to match the extended descriptor in ntservice.ini. Macros not available for the choosen descriptor type will return -1.
- Use NT_NET_GET_PKT_DESCRIPTOR_FORMAT() to detect the format and use the format descriptors directly
- Amount of host buffers
- Memory limit on 32-bit systems
Systems running on 32-bits have a limited host buffer memory pool because of the 1G/3G kernel user space memory split. If the system contain multiple adapters, the ntservice.ini must be tweaked to limit the amount of host buffers required. One obvious thing to disable is the TX host buffers if TX is not utilized.[Adapter0] AdapterType=NT4E BusId=00:0a:00:00 HostBuffersTx=[0,16,0] # 0 TX host buffers, disable the TX host buffers for adapter 0
or disable RX host buffers if only TX is used[Adapter0] AdapterType=NT4E BusId=00:0a:00:00 HostBuffersRx=[0,16,0] # 0 RX host buffers, disable the RX host buffers for adapter 0
Host Buffer Memory
Large Host Buffer Memory Configurations
Linux considerations:
- Linux kernel sysctl "vm.max_map_count":
Especially for large host buffer configurations it is necessary to adjust the kernel sysctl "vm.max_map_count" (/proc/sys/vm/max_map_count)
The kernel sysctl "vm.max_map_count" (/proc/sys/vm/max_map_count) should be adjusted to (at least) the total configured host buffer memory in MB multiplied by four.
Example for total host buffer size 128GB (131072MB): 131072*4 = 524288. Hence the minimum value for "vm.max_map_count" is 524288.
If your kernel is already using an even higher value - then just leave it like that.
If your application has special memory mapping needs - it should be added to this setting.
- Linux kernel sysctl "kernel.numa_balancing":
Especially for large host buffer configurations it may be necessary to disable the kernel sysctl "kernel.numa_balancing" (/proc/sys/kernel/numa_balancing)
In some large host buffer scenarios the automatic numa page balancing and page migration will not work properly and it will impose a big system-load on the system.
Symptoms seen: idle user-space threads consumes close to 100% system-time. Calls to sleep(), usleep() and nanosleep() becomes unaccurate - yet their return value is zero (uninterrupted).
Disabling the numa balancing will fix this (kernel sysctl: kernel.numa_balancing=0)
Data Sharing and Host Buffer Allowance
A more thorough description of Host Buffer Allowance can be found at youtube: Host Buffer Allowance explained
NTPL
Debugging
Application Crash
Start up Synchronization of Host Buffers
Transmit to one port using multiple apps
Latency stability
Packet interleaving from a 4GA multi-threaded Tx application
The way a packet gets sent is a multi-stage process. When using the “packet interface”, the stages are (roughly like this):
- The application calculates the packet size and then requests the driver for a netbuf of the right size.
- The application creates the packet in the netbuf.
- The application “releases” the netbuf (which signals the packet can be transmitted).
- As part of the “release”, the NTAPI library code advances the so-called write-pointer (aka head-pointer). That pointer is stored at a specific address in memory, and that destination is also known to the FPGA.
- The FPGA periodically (and asynchronously) reads the head/write-pointer (the pointer gets DMAed from host memory to the FPGA) to see whether there are new packets to send.
- The FPGA sends pending data/packets and advances/updates (asynchronously) the so-called tail/read-pointer (again via DMA from FPGA to host memory) to signal the data has been transmitted.
So packets can be sent as single packets from each TX host buffer, but the asynchronous part of Stage 5. means it is not possible for software to know whether the adapter is transmitting the packet(s) or not, and on top of that when multiple TX host buffers are in play, then an application, be it a single threaded or a multi threaded one, cannot know in what sequence the adapter transmits the packets from the individual host buffers. That is the case due to the unpredictable behavior of the OS scheduler (so stage 4. above may occur in a random sequence when multiple threads are in use, unless the threads are synchronized, which may hurt the performance) and/but also because a Napatech adapter does not deal with multiple TX host buffers in a deterministic way. Also of concern is the asynchronous step from Stage 4. to 5, which makes it impossible to ensure the FPGA sees/reads the updated head/write-pointer in the order that the application assumes.
So if an application uses two TX host buffers and wants to send (say UDP) packets in a specific sequence, then it should send those ordered packets via the same TX host buffers; if the application uses both TX host buffers, the adapter may send (some of the) packets in the incorrect order.
Please also observe that the stages 1. through 6. are the same for the “segment” interface, except that a netbuf represents a one-megabyte segment.
Another 4GA property is that it is inherently a circular buffer, which simply means that a packet’s position in the buffer determines the sequence in which it gets transmitted (for that TX host buffer). If an application uses the segment interface and allocates multiple segments and releases the segment “out-of-order”, then the NTAPI driver library holds back the segments until the library code can release a contiguous range of segments. For instance, if an application allocates segments s1 and s2 but releases them as s2 followed by s1, then the library code holds back s2 until s1 is also ready for transmission.
To use timestamp based transmission one must use a single TX host buffer and a dispatcher thread. This is more difficult but allows full control. Note that segments must be filled up entirely before they are released; this sometimes means padding with “dummy” packets for which the packet descriptor’s TxIgnore bit is set; dummy packets must be less than 10,000 bytes.